CS253: Software Development with C++

Fall 2022

Hashing

Show Lecture.Hashing as a slide show.

CS253 Hashing

Leonardo da Vinci’s Mona Lisa and John the Baptist

Inclusion

To use unordered_set or unordered_multiset, you need to:

    
#include <unordered_set>

To use unordered_map or unordered_multimap, you need to:

    
#include <unordered_map>

To use the class hash:

    
#include <functional>

Hashing in General

To store an object in a hash table:

Typical Hash Table

A hash table starts like this, an array of seven (for instance) pointers, all initially null (indicated by ●):

0123456

Typical Hash Table

After adding "animal" and "vegetable":

0123456
  
  
 animal vegetable

Typical Hash Table

After adding "mineral":

0123456
  
  
 animal mineral
vegetable

Typical Hash Table

0123456
  
  
 animal mineral
vegetable

Expanding the Table

So What?

Hashing in C++

unordered_set<int> p = {2, 3, 5, 7, 11, 13, 17, 19};
for (auto n : p)
    cout << n << ' ';
19 17 13 11 7 5 3 2 

I Care

OK, let’s say that we care. We can find out:

unordered_set<int> p = {2, 3, 5, 7, 11, 13, 17, 19};
cout << "Buckets: " << p.bucket_count() << '\n'
     << "Size: " << p.size() << '\n'
     << "Load: " << p.load_factor() << " of "
     << p.max_load_factor() << '\n';
for (size_t b = 0; b<p.bucket_count(); b++)
    if (p.bucket_size(b))
        cout << "Bucket " << b << ": "
             << p.bucket_size(b) << " items\n";
for (auto n : p)
    cout << n << ' ';
Buckets: 13
Size: 8
Load: 0.615385 of 1
Bucket 0: 1 items
Bucket 2: 1 items
Bucket 3: 1 items
Bucket 4: 1 items
Bucket 5: 1 items
Bucket 6: 1 items
Bucket 7: 1 items
Bucket 11: 1 items
19 17 13 11 7 5 3 2 

Variable Number of Buckets

The number of buckets (usually prime) increases, based on how much data the hash contains:

unordered_set<int> us;
for (int r = 1; r <= 1e6; r*=10) {
    us.reserve(r);
    cout << setw(8) << r << ' '
         << setw(8) << us.bucket_count() << '\n';
}
       1        2
      10       11
     100      103
    1000     1031
   10000    10273
  100000   107897
 1000000  1056323

The unordered_set::reserve() method asks for at least that many buckets, but the implementation is free to allocate more.

Load Factor

unordered_set::load_factor()
Returns the current load factor for this hash table, defined as unordered_set::size()/unordered_set::bucket_count().
unordered_set::max_load_factor()
Returns/sets maximum load factor tolerated before rehashing.

Load Factor Demo

unordered_multiset<double> us;
for (int i=0; i<1e6; i++)
    us.insert(drand48());
cout << us.size()         << '\n'
     << us.bucket_count() << '\n'
     << us.load_factor()  << '/' << us.max_load_factor() << '\n';
1000000
1447153
0.691012/1

Real time: 373 ms

unordered_multiset<double> us;
us.max_load_factor(10);
for (int i=0; i<1e6; i++)
    us.insert(drand48());
cout << us.size()         << '\n'
     << us.bucket_count() << '\n'
     << us.load_factor()  << '/' << us.max_load_factor() << '\n';
1000000
126271
7.91947/10

Real time: 719 ms

Note the time/space tradeoff. drand48() not the best technique.

What are the Hash Values?

The process of hashing is converting any value (integer, floating-point, vector, set, struct MyData, etc.) to an unsigned number, as uniquely as we can.

We can find out the hash values for this implementation :

cout << hex << setfill('0')
     << setw(16) << hash<int>()(253)       << '\n'
     << setw(16) << hash<int>()(-253)      << '\n'
     << setw(16) << hash<double>()(253.0)  << '\n'
     << setw(16) << hash<float>()(253.0F)  << '\n'
     << setw(16) << hash<double>()(-0.0)   << '\n'
     << setw(16) << hash<double>()(0.0)    << '\n'
     << setw(16) << hash<long>()(253L)     << '\n'
     << setw(16) << hash<unsigned>()(253U) << '\n'
     << setw(16) << hash<char>()('a')      << '\n'
     << setw(16) << hash<bool>()(true)     << '\n'
     << setw(16) << hash<string>()("253")  << '\n'
     << setw(16) << hash<string>()("")     << '\n'
     << setw(16) << hash<int *>()(new int) << '\n';
00000000000000fd
ffffffffffffff03
a6e6c311a0093ae9
3363ec8d00f382ce
0000000000000000
0000000000000000
00000000000000fd
00000000000000fd
0000000000000061
0000000000000001
1a5e026e774daa8e
553e93901e462a6e
00000000012122c0

Not everything

Not all standard types are hashable:

cout << hash<ostream>()(cout);  // 🦡
c.cc:1: error: use of deleted function ‘std::hash<std::basic_ostream<char> 
   >::hash()’
int a[] = {11,22};
cout << hash<int[]>()(a);  // 🦡
c.cc:2: error: use of deleted function ‘std::hash<int []>::hash()’

User-defined Types

It certainly doesn’t know how to hash your types.

Why not? Just crunch all the bits in the user-defined type into a hash value!
  • Well, that would work for a struct or class that contained only scalars.
  • What about a vector?
  • All that a vector contains in the object itself is a pointer and a couple of lengths. The real data is off in the heap.
  • How is the compiler supposed to know how much data is in the heap? Sure, we know that it corresponds to the one of the lengths, but it would be unreasonable to expect the compiler to know that.
  • A list would be even worse—the data is all over the place!

User-defined Types

It certainly doesn’t know how to hash your types:

struct Point {
    float x, y;
};

int main() {
    Point p = {1.2, 3.4};
    cout << hash<Point>()(p);  // 🦡
}
c.cc:7: error: use of deleted function ‘std::hash<Point>::hash()’

However, it can be taught.

User-defined Types

User-defined Types

We can create a template specialization for std::hash<Point>:

struct Point { float x, y; } p = {1.2, 3.4};

template <>
struct std::hash<Point> {
    size_t operator()(const Point &p) const {
       return hash<float>()(p.x) ^ hash<float>()(p.y);
    }
};

int main() {
    cout << hash<Point>()(p);
}
11708950365973905104

User-defined Types

Still fails; needs ==:

struct Point { float x, y; } p = {1.2, 3.4};

template <>
struct std::hash<Point> {
    size_t operator()(const Point &p) const {
       return hash<float>()(p.x) ^ hash<float>()(p.y);
    }
};

int main() {
    unordered_set<Point> us;
    us.insert(p);  // 🦡
}
In file included from /usr/local/gcc/11.2.0/include/c++/11.2.0/string:48,
                 from /usr/local/gcc/11.2.0/include/c++/11.2.0/bits/locale_classes.h:40,
                 from /usr/local/gcc/11.2.0/include/c++/11.2.0/bits/ios_base.h:41,
                 from /usr/local/gcc/11.2.0/include/c++/11.2.0/ios:42,
                 from /s/bach/a/class/cs000/public_html/pmwiki/cookbook/c++-includes.h:5,
                 from <command-line>:
/usr/local/gcc/11.2.0/include/c++/11.2.0/bits/stl_function.h: In instantiation of ‘constexpr bool std::equal_to<_Tp>::operator()(const _Tp&, const _Tp&) const [with _Tp = Point]’:
/usr/local/gcc/11.2.0/include/c++/11.2.0/bits/hashtable_policy.h:1614:   required from ‘bool std::__detail::_Hashtable_base<_Key, _Value, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _Traits>::_M_equals(const _Key&, std::__detail::_Hashtable_base<_Key, _Value, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _Traits>::__hash_code, const std::__detail::_Hash_node_value<_Value, typename _Traits::__hash_cached::value>&) const [with _Key = Point; _Value = Point; _ExtractKey = std::__detail::_Identity; _Equal = std::equal_to<Point>; _Hash = std::hash<Point>; _RangeHash = std::__detail::_Mod_range_hashing; _Unused = std::__detail::_Default_ranged_hash; _Traits = std::__detail::_Hashtable_traits<true, true, true>; std::__detail::_Hashtable_base<_Key, _Value, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _Traits>::__hash_code = long unsigned int; typename _Traits::__hash_cached = std::__detail::_Hashtable_traits<true, true, true>::__hash_cached]’
/usr/local/gcc/11.2.0/include/c++/11.2.0/bits/hashtable.h:1819:   required from ‘std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::__node_base_ptr std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::_M_find_before_node(std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::size_type, const key_type&, std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::__hash_code) const [with _Key = Point; _Value = Point; _Alloc = std::allocator<Point>; _ExtractKey = std::__detail::_Identity; _Equal = std::equal_to<Point>; _Hash = std::hash<Point>; _RangeHash = std::__detail::_Mod_range_hashing; _Unused = std::__detail::_Default_ranged_hash; _RehashPolicy = std::__detail::_Prime_rehash_policy; _Traits = std::__detail::_Hashtable_traits<true, true, true>; std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::__node_base_ptr = std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<Point, true> > >::__node_base*; std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::size_type = long unsigned int; std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::key_type = Point; std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::__hash_code = long unsigned int]’
/usr/local/gcc/11.2.0/include/c++/11.2.0/bits/hashtable.h:793:   required from ‘std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::__node_ptr std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::_M_find_node(std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::size_type, const key_type&, std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::__hash_code) const [with _Key = Point; _Value = Point; _Alloc = std::allocator<Point>; _ExtractKey = std::__detail::_Identity; _Equal = std::equal_to<Point>; _Hash = std::hash<Point>; _RangeHash = std::__detail::_Mod_range_hashing; _Unused = std::__detail::_Default_ranged_hash; _RehashPolicy = std::__detail::_Prime_rehash_policy; _Traits = std::__detail::_Hashtable_traits<true, true, true>; std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::__node_ptr = std::allocator<std::__detail::_Hash_node<Point, true> >::value_type*; std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::size_type = long unsigned int; std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::key_type = Point; std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::__hash_code = long unsigned int]’
/usr/local/gcc/11.2.0/include/c++/11.2.0/bits/hashtable.h:2084:   required from ‘std::pair<typename std::__detail::_Insert<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::iterator, bool> std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::_M_insert(_Arg&&, const _NodeGenerator&, std::true_type) [with _Arg = const Point&; _NodeGenerator = std::__detail::_AllocNode<std::allocator<std::__detail::_Hash_node<Point, true> > >; _Key = Point; _Value = Point; _Alloc = std::allocator<Point>; _ExtractKey = std::__detail::_Identity; _Equal = std::equal_to<Point>; _Hash = std::hash<Point>; _RangeHash = std::__detail::_Mod_range_hashing; _Unused = std::__detail::_Default_ranged_hash; _RehashPolicy = std::__detail::_Prime_rehash_policy; _Traits = std::__detail::_Hashtable_traits<true, true, true>; typename std::__detail::_Insert<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::iterator = std::__detail::_Insert_base<Point, Point, std::allocator<Point>, std::__detail::_Identity, std::equal_to<Point>, std::hash<Point>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, true, true> >::iterator; typename _Traits::__constant_iterators = std::__detail::_Hashtable_traits<true, true, true>::__constant_iterators; std::true_type = std::integral_constant<bool, true>]’
/usr/local/gcc/11.2.0/include/c++/11.2.0/bits/hashtable_policy.h:843:   required from ‘std::__detail::_Insert_base<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::__ireturn_type std::__detail::_Insert_base<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::insert(const value_type&) [with _Key = Point; _Value = Point; _Alloc = std::allocator<Point>; _ExtractKey = std::__detail::_Identity; _Equal = std::equal_to<Point>; _Hash = std::hash<Point>; _RangeHash = std::__detail::_Mod_range_hashing; _Unused = std::__detail::_Default_ranged_hash; _RehashPolicy = std::__detail::_Prime_rehash_policy; _Traits = std::__detail::_Hashtable_traits<true, true, true>; std::__detail::_Insert_base<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::__ireturn_type = std::pair<std::__detail::_Node_iterator<Point, true, true>, bool>; std::__detail::_Insert_base<_Key, _Value, _Alloc, _ExtractKey, _Equal, _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::value_type = Point]’
/usr/local/gcc/11.2.0/include/c++/11.2.0/bits/unordered_set.h:422:   required from ‘std::pair<typename std::_Hashtable<_Value, _Value, _Alloc, std::__detail::_Identity, _Pred, _Hash, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<std::__not_<std::__and_<std::__is_fast_hash<_Hash>, std::__is_nothrow_invocable<const _Hash&, const _Tp&> > >::value, true, true> >::iterator, bool> std::unordered_set<_Value, _Hash, _Pred, _Alloc>::insert(const value_type&) [with _Value = Point; _Hash = std::hash<Point>; _Pred = std::equal_to<Point>; _Alloc = std::allocator<Point>; typename std::_Hashtable<_Value, _Value, _Alloc, std::__detail::_Identity, _Pred, _Hash, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<std::__not_<std::__and_<std::__is_fast_hash<_Hash>, std::__is_nothrow_invocable<const _Hash&, const _Tp&> > >::value, true, true> >::iterator = std::__detail::_Insert_base<Point, Point, std::allocator<Point>, std::__detail::_Identity, std::equal_to<Point>, std::hash<Point>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, true, true> >::iterator; std::unordered_set<_Value, _Hash, _Pred, _Alloc>::value_type = Point]’
c.cc:12:   required from here
/usr/local/gcc/11.2.0/include/c++/11.2.0/bits/stl_function.h:356: error: no 
   match for ‘operator==’ in ‘__x == __y’ (operand types are ‘const 
   Point’ and ‘const Point’)

User-defined Types

Now, unordered_set works with a Point:

struct Point { float x, y; } p = {1.2, 3.4};

template <>
struct std::hash<Point> {
    size_t operator()(const Point &p) const {
       return hash<float>()(p.x) ^ hash<float>()(p.y);
    }
};

bool operator==(const Point &a, const Point &b) {
    return a.x==b.x && a.y==b.y;
}
// or could’ve specialized std::equal_to<Point>

int main() {
    unordered_set<Point> us;
    us.insert(p);
}

The Rules