Show Lecture.Hashing as a slide show.
CS253 Hashing
Leonardo da Vinci’s Mona Lisa and John the Baptist
Inclusion
To use unordered_set or unordered_multiset, you need to:
#include <unordered_set>
To use unordered_map or unordered_multimap, you need to:
#include <unordered_map>
To use the class hash
:
#include <functional>
Hashing in General
To store an object in a hash table:
- Combine the bits of the object into a single number,
the hash value.
- The bits that make up the real value, e.g., in a string,
the chars, not the pointer.
- Use that number (mod N ) as an index into an array
of N buckets.
- Each bucket is a collection of data with the same hash value.
- If N is large enough, each bucket only contains a few values.
- A good hash container adjusts the number of buckets dynamically.
- It can take a lot of space, but it’s fast : O(1)
index to the bucket, then an O(n ) seach in the bucket.
A good hash will keep the bucket size small.
Typical Hash Table
A hash table starts like this, an array of seven (for instance)
pointers, all initially null (indicated by ●):
- Why seven? A prime number gives us the best chance of
spreading out input data with a pattern.
- If our array size were even, and data were all multiples of 10,
then half of our buckets would be unused.
- Seven is ludicrously small for actual code, but fits on a screen.
A real hash table might have thousands of buckets.
Typical Hash Table
After adding "animal"
and "vegetable"
:
0 | 1 | 2 | 3 | 4 | 5 | 6 |
● | | ● | | ● | ● | ● |
| ⇓ | | ⇓ |
| animal | | vegetable |
"animal"
hashed to 22 (somehow), which is 1 (mod 7),
so it’s in bucket 1.
"vegetable"
hashed to 9823439, which is 3 (mod 7),
so it’s in bucket 3.
- We’re not delving into the details of the string ⇒ unsigned hash
function used.
Typical Hash Table
After adding "mineral"
:
0 | 1 | 2 | 3 | 4 | 5 | 6 |
● | | ● | | ● | ● | ● |
| ⇓ | | ⇓ |
| animal | | mineral vegetable |
"mineral"
hashed to 3671, which is 3 (mod 7),
so it also goes into bucket 3.
- Since that bucket was non-empty, it was added to the list
for that bucket.
- It doesn’t matter where in the linked list we add the new item,
so we added it at the start, which is easy.
Typical Hash Table
0 | 1 | 2 | 3 | 4 | 5 | 6 |
● | | ● | | ● | ● | ● |
| ⇓ | | ⇓ |
| animal | | mineral vegetable |
- To traverse the table:
for each pointer in the array:
for each node in that linked list:
process that item
- The input order (animal/vegetable/mineral) may not resemble the
output order (animal/mineral/vegetable).
Expanding the Table
- Of course, if our seven-pointer table gets too many items, then the
linked lists will get too long for an efficient linear search.
- When that happens, rehash : expand the table to seventeen
(another prime) pointers and rearrange everything.
- Prime numbers are useful! Who’d’ve thought?
- Increasing the table by big jumps makes rehashing occur less often,
but wastes more space. It’s a trade-off!
- The scheme of roughly doubling the container size should remind
you of vector’s memory allocation technique.
So What?
- This is all very nice, and good for several assignments and a quiz
in Data Structures.
- It’s tricky, and easy to get wrong:
- What’s a good hash function for strings?
- What if a list is empty?
- How do we rehash without completely duplicating all the data?
- How do we traverse the container?
- How do we compute the next largest prime number?
- Fortunately, the C++ hash containers, unordered_set,
unordered_multiset, unordered_map, and unordered_multimap,
do all of the heavy lifting for you.
- Their semantics are similar to set, multiset,
map, and multimap, except for ordering.
Hashing in C++
unordered_set<int> p = {2, 3, 5, 7, 11, 13, 17, 19};
for (auto n : p)
cout << n << ' ';
19 17 13 11 7 5 3 2
- How many buckets were used? Who cares?
- What was the hash function used? Who cares!
- When does it rehash? Who cares?
- These all have default implementation-dependent answers,
which can queried & changed:
- Might set a large initial number of buckets, if you know that
lots of data is coming.
- Your data might not hash well with the default hash function,
so you write your own.
I Care
OK, let’s say that we care. We can find out:
unordered_set<int> p = {2, 3, 5, 7, 11, 13, 17, 19};
cout << "Buckets: " << p.bucket_count() << '\n'
<< "Size: " << p.size() << '\n'
<< "Load: " << p.load_factor() << " of "
<< p.max_load_factor() << '\n';
for (size_t b = 0; b<p.bucket_count(); b++)
if (p.bucket_size(b))
cout << "Bucket " << b << ": "
<< p.bucket_size(b) << " items\n";
for (auto n : p)
cout << n << ' ';
Buckets: 13
Size: 8
Load: 0.615385 of 1
Bucket 0: 1 items
Bucket 2: 1 items
Bucket 3: 1 items
Bucket 4: 1 items
Bucket 5: 1 items
Bucket 6: 1 items
Bucket 7: 1 items
Bucket 11: 1 items
19 17 13 11 7 5 3 2
Variable Number of Buckets
The number of buckets (usually prime) increases,
based on how much data the hash contains:
unordered_set<int> us;
for (int r = 1; r <= 1e6; r*=10) {
us.reserve(r);
cout << setw(8) << r << ' '
<< setw(8) << us.bucket_count() << '\n';
}
1 2
10 11
100 103
1000 1031
10000 10273
100000 107897
1000000 1056323
The unordered_set::reserve() method asks for at least that many
buckets, but the implementation is free to allocate more.
Load Factor
- A hash table has a load factor ,
defined as average number of items per bucket.
- If this gets too large, the hash table rehashes
(allocates more buckets, puts everything in the new proper buckets).
- Any bucket may contain many items, due to a poor hash function,
or unlucky data.
- unordered_set::load_factor()
-
Returns the current load factor for this hash table, defined as
unordered_set::size()/unordered_set::bucket_count()
.
- unordered_set::max_load_factor()
-
Returns/sets maximum load factor tolerated before rehashing.
Load Factor Demo
unordered_multiset<double> us;
for (int i=0; i<1e6; i++)
us.insert(drand48());
cout << us.size() << '\n'
<< us.bucket_count() << '\n'
<< us.load_factor() << '/' << us.max_load_factor() << '\n';
1000000
1447153
0.691012/1
Real time: 373 ms
unordered_multiset<double> us;
us.max_load_factor(10);
for (int i=0; i<1e6; i++)
us.insert(drand48());
cout << us.size() << '\n'
<< us.bucket_count() << '\n'
<< us.load_factor() << '/' << us.max_load_factor() << '\n';
1000000
126271
7.91947/10
Real time: 719 ms
Note the time/space tradeoff. drand48() not the best technique.
What are the Hash Values?
The process of hashing is converting any value
(integer, floating-point, vector, set, struct MyData
, etc.)
to an unsigned number, as uniquely as we can.
We can find out the hash values for this implementation :
cout << hex << setfill('0')
<< setw(16) << hash<int>()(253) << '\n'
<< setw(16) << hash<int>()(-253) << '\n'
<< setw(16) << hash<double>()(253.0) << '\n'
<< setw(16) << hash<float>()(253.0F) << '\n'
<< setw(16) << hash<double>()(-0.0) << '\n'
<< setw(16) << hash<double>()(0.0) << '\n'
<< setw(16) << hash<long>()(253L) << '\n'
<< setw(16) << hash<unsigned>()(253U) << '\n'
<< setw(16) << hash<char>()('a') << '\n'
<< setw(16) << hash<bool>()(true) << '\n'
<< setw(16) << hash<string>()("253") << '\n'
<< setw(16) << hash<string>()("") << '\n'
<< setw(16) << hash<int *>()(new int) << '\n';
00000000000000fd
ffffffffffffff03
a6e6c311a0093ae9
3363ec8d00f382ce
0000000000000000
0000000000000000
00000000000000fd
00000000000000fd
0000000000000061
0000000000000001
1a5e026e774daa8e
553e93901e462a6e
00000000012122c0
Not everything
Not all standard types are hashable:
cout << hash<ostream>()(cout); // 🦡
c.cc:1: error: use of deleted function ‘std::hash<std::basic_ostream<char>
>::hash()’
int a[] = {11,22};
cout << hash<int[]>()(a); // 🦡
c.cc:2: error: use of deleted function ‘std::hash<int []>::hash()’
User-defined Types
It certainly doesn’t know how to hash your types.
Why not? Just crunch all the bits in the user-defined type into a hash value!
- Well, that would work for a struct or class that contained only scalars.
- What about a vector?
- All that a vector contains in the object itself is a pointer
and a couple of lengths. The real data is off in the heap.
- How is the compiler supposed to know how much data is in the heap?
Sure, we know that it corresponds to the one of the lengths, but it
would be unreasonable to expect the compiler to know that.
- A list would be even worse—the data is all over the place!
User-defined Types
It certainly doesn’t know how to hash your types:
struct Point {
float x, y;
};
int main() {
Point p = {1.2, 3.4};
cout << hash<Point>()(p); // 🦡
}
c.cc:7: error: use of deleted function ‘std::hash<Point>::hash()’
However, it can be taught.
User-defined Types
- Well, fine.
- What does unordered_set need to work with a type?
- a hash functor (to tell which bucket to go into)
- an equality comparison functor (to see if two values are the same)
User-defined Types
We can create a template specialization for std::hash<Point>
:
struct Point { float x, y; } p = {1.2, 3.4};
template <>
struct std::hash<Point> {
size_t operator()(const Point &p) const {
return hash<float>()(p.x) ^ hash<float>()(p.y);
}
};
int main() {
cout << hash<Point>()(p);
}
11708950365973905104
User-defined Types
Still fails; needs ==
:
struct Point { float x, y; } p = {1.2, 3.4};
template <>
struct std::hash<Point> {
size_t operator()(const Point &p) const {
return hash<float>()(p.x) ^ hash<float>()(p.y);
}
};
int main() {
unordered_set<Point> us;
us.insert(p); // 🦡
}
In file included from /usr/local/gcc/11.2.0/include/c++/11.2.0/string:48,
from /usr/local/gcc/11.2.0/include/c++/11.2.0/bits/locale_classes.h:40,
from /usr/local/gcc/11.2.0/include/c++/11.2.0/bits/ios_base.h:41,
from /usr/local/gcc/11.2.0/include/c++/11.2.0/ios:42,
from /s/bach/a/class/cs000/public_html/pmwiki/cookbook/c++-includes.h:5,
from <command-line>:
/usr/local/gcc/11.2.0/include/c++/11.2.0/bits/stl_function.h: In instantiation of ‘constexpr bool std::equal_to<_Tp>::operator()(const _Tp&, const _Tp&) const [with _Tp = Point]’:
/usr/local/gcc/11.2.0/include/c++/11.2.0/bits/hashtable_policy.h:1614: required from ‘bool std::__detail::_Hashtable_base<_Key, _Value, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _Traits>::_M_equals(const _Key&, std::__detail::_Hashtable_base<_Key, _Value, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _Traits>::__hash_code, const std::__detail::_Hash_node_value<_Value, typename _Traits::__hash_cached::value>&) const [with _Key = Point; _Value = Point; _ExtractKey = std::__detail::_Identity; _Equal = std::equal_to<Point>; _Hash = std::hash<Point>; _RangeHash = std::__detail::_Mod_range_hashing; _Unused = std::__detail::_Default_ranged_hash; _Traits = std::__detail::_Hashtable_traits<true, true, true>; std::__detail::_Hashtable_base<_Key, _Value, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _Traits>::__hash_code = long unsigned int; typename _Traits::__hash_cached = std::__detail::_Hashtable_traits<true, true, true>::__hash_cached]’
/usr/local/gcc/11.2.0/include/c++/11.2.0/bits/hashtable.h:1819: required from ‘std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::__node_base_ptr std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::_M_find_before_node(std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::size_type, const key_type&, std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::__hash_code) const [with _Key = Point; _Value = Point; _Alloc = std::allocator<Point>; _ExtractKey = std::__detail::_Identity; _Equal = std::equal_to<Point>; _Hash = std::hash<Point>; _RangeHash = std::__detail::_Mod_range_hashing; _Unused = std::__detail::_Default_ranged_hash; _RehashPolicy = std::__detail::_Prime_rehash_policy; _Traits = std::__detail::_Hashtable_traits<true, true, true>; std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::__node_base_ptr = std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<Point, true> > >::__node_base*; std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::size_type = long unsigned int; std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::key_type = Point; std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::__hash_code = long unsigned int]’
/usr/local/gcc/11.2.0/include/c++/11.2.0/bits/hashtable.h:793: required from ‘std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::__node_ptr std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::_M_find_node(std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::size_type, const key_type&, std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::__hash_code) const [with _Key = Point; _Value = Point; _Alloc = std::allocator<Point>; _ExtractKey = std::__detail::_Identity; _Equal = std::equal_to<Point>; _Hash = std::hash<Point>; _RangeHash = std::__detail::_Mod_range_hashing; _Unused = std::__detail::_Default_ranged_hash; _RehashPolicy = std::__detail::_Prime_rehash_policy; _Traits = std::__detail::_Hashtable_traits<true, true, true>; std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::__node_ptr = std::allocator<std::__detail::_Hash_node<Point, true> >::value_type*; std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::size_type = long unsigned int; std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::key_type = Point; std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::__hash_code = long unsigned int]’
/usr/local/gcc/11.2.0/include/c++/11.2.0/bits/hashtable.h:2084: required from ‘std::pair<typename std::__detail::_Insert<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::iterator, bool> std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::_M_insert(_Arg&&, const _NodeGenerator&, std::true_type) [with _Arg = const Point&; _NodeGenerator = std::__detail::_AllocNode<std::allocator<std::__detail::_Hash_node<Point, true> > >; _Key = Point; _Value = Point; _Alloc = std::allocator<Point>; _ExtractKey = std::__detail::_Identity; _Equal = std::equal_to<Point>; _Hash = std::hash<Point>; _RangeHash = std::__detail::_Mod_range_hashing; _Unused = std::__detail::_Default_ranged_hash; _RehashPolicy = std::__detail::_Prime_rehash_policy; _Traits = std::__detail::_Hashtable_traits<true, true, true>; typename std::__detail::_Insert<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::iterator = std::__detail::_Insert_base<Point, Point, std::allocator<Point>, std::__detail::_Identity, std::equal_to<Point>, std::hash<Point>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, true, true> >::iterator; typename _Traits::__constant_iterators = std::__detail::_Hashtable_traits<true, true, true>::__constant_iterators; std::true_type = std::integral_constant<bool, true>]’
/usr/local/gcc/11.2.0/include/c++/11.2.0/bits/hashtable_policy.h:843: required from ‘std::__detail::_Insert_base<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::__ireturn_type std::__detail::_Insert_base<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::insert(const value_type&) [with _Key = Point; _Value = Point; _Alloc = std::allocator<Point>; _ExtractKey = std::__detail::_Identity; _Equal = std::equal_to<Point>; _Hash = std::hash<Point>; _RangeHash = std::__detail::_Mod_range_hashing; _Unused = std::__detail::_Default_ranged_hash; _RehashPolicy = std::__detail::_Prime_rehash_policy; _Traits = std::__detail::_Hashtable_traits<true, true, true>; std::__detail::_Insert_base<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::__ireturn_type = std::pair<std::__detail::_Node_iterator<Point, true, true>, bool>; std::__detail::_Insert_base<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::value_type = Point]’
/usr/local/gcc/11.2.0/include/c++/11.2.0/bits/unordered_set.h:422: required from ‘std::pair<typename std::_Hashtable<_Value, _Value, _Alloc, std::__detail::_Identity, _Pred, _Hash, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<std::__not_<std::__and_<std::__is_fast_hash<_Hash>, std::__is_nothrow_invocable<const _Hash&, const _Tp&> > >::value, true, true> >::iterator, bool> std::unordered_set<_Value, _Hash, _Pred, _Alloc>::insert(const value_type&) [with _Value = Point; _Hash = std::hash<Point>; _Pred = std::equal_to<Point>; _Alloc = std::allocator<Point>; typename std::_Hashtable<_Value, _Value, _Alloc, std::__detail::_Identity, _Pred, _Hash, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<std::__not_<std::__and_<std::__is_fast_hash<_Hash>, std::__is_nothrow_invocable<const _Hash&, const _Tp&> > >::value, true, true> >::iterator = std::__detail::_Insert_base<Point, Point, std::allocator<Point>, std::__detail::_Identity, std::equal_to<Point>, std::hash<Point>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, true, true> >::iterator; std::unordered_set<_Value, _Hash, _Pred, _Alloc>::value_type = Point]’
c.cc:12: required from here
/usr/local/gcc/11.2.0/include/c++/11.2.0/bits/stl_function.h:356: error: no
match for ‘operator==’ in ‘__x == __y’ (operand types are ‘const
Point’ and ‘const Point’)
User-defined Types
Now, unordered_set works with a Point
:
struct Point { float x, y; } p = {1.2, 3.4};
template <>
struct std::hash<Point> {
size_t operator()(const Point &p) const {
return hash<float>()(p.x) ^ hash<float>()(p.y);
}
};
bool operator==(const Point &a, const Point &b) {
return a.x==b.x && a.y==b.y;
}
// or could’ve specialized std::equal_to<Point>
int main() {
unordered_set<Point> us;
us.insert(p);
}
The Rules
- Usually, messing around in the
std::
namespace is forbidden.
- However, you may specialize templates in the
std::
namespace
for your own types.