Data Structures with Generic Types in Python/

...

/

ChainedHashTable: Hashing with Chaining

Hash tables overview

Hash tables are an efficient method of storing a small number, n, of integers from a large range $U = \{0,...,2^{w}− 1\}$ . The term hash table includes a broad range of data structures. There are two of the most common implementations of hash tables:

hashing with chaining
linear probing

Very often hash tables store types of data that are not integers. In this case, an integer hash code is associated with each data item and is used in the hash table. There are various ways of how such hash codes are generated.

Some of the methods used for hashing require random choices of integers in some specific range. In the code samples, some of these “random” integers are hard-coded constants. These constants were obtained using random bits generated from atmospheric noise.

The `ChainedHashTable` structure

A ChainedHashTable data structure uses hashing with chaining to store data as an array, t, of lists. An integer, n, keeps track of the total number of items in all lists. See the below figure:

The hash value of a data item x, denoted hash(x) is a value in the range $\{0,..., \text{t.length}-1\}$ . All items with hash value i are stored in the list at t[i]. To ensure that lists don’t get too long, we maintain the invariant.

\textcolor{Red}{n \leq \text{t.length}}

so that the average number of elements stored in one of these lists is $n /$ t.length $\le 1$ .

To add an element, x, to the hash table, we first check if the length of t needs to be increased and, if so, we grow t. With this out of the way we hash x to get an integer, i, in the range $\{0,\ldots,$ t.length $-1\}$ , and we append x to the list t[i]:

Growing the table, if necessary, involves doubling the length of t and reinserting all elements into the new table. This strategy is exactly the same as the one used in the implementation of ArrayStack and the same result applies: The cost of growing is only constant when amortized over a sequence of insertions.

Besides growing, the only other work done when adding a new value x to a ChainedHashTable involves appending x to the list t[hash(x)]. For any of the list implementations, this takes only constant time.

To remove an element, x, from the hash table, we iterate over the list t[hash(x)] until we find x so that we can remove it:

Again, this takes time proportional to the length of the list t[hash(x)].

The performance of a hash table depends critically on the choice of the hash function. A good hash function will spread the elements evenly among the t.length lists, so that the expected size of the list t[hash(x)] is $O(n/ \text{t.length}) = O(1)$ . On the other hand, a bad hash function will hash all values (including x) to the same table location, in which case the size of the list t[hash(x)] will be n. In the next section, we describe a good hash function.

Multiplicative hashing

Multiplicative hashing is an efficient method of generating hash values based on modular arithmetic and integer division. It uses the div operator, which calculates the integral part of a quotient, while discarding the remainder. Formally, for any integers $a \ge 0$ and $b \ge 1$ , $a \text{ div }b = \lfloor a/b \rfloor$ .

In multiplicative hashing, we use a hash table of size $2^{d}$ for some integer d (called the dimension). The formula for hashing an integer $x \in \{0,...,2^{w} − 1\}$ is

\text{hash(x)} = ((z · x)\ \text{ mod}\ 2^{w}) \text{div}\ 2^{w-d}

Here, z is a randomly chosen odd integer in $\{ 1,...,2^{w} −1\}$ . This hash function can be realized very efficiently by observing that, by default, operations on integers are already done modulo $2^{w}$ where w is the number of bits in an integer. See the below figure:

Note: This is true for most programming languages including C, C#, C++, and Java. Notable exceptions are Python and Ruby, in which the result of a fixed-length $w-bit$ integer operation that overflows is upgraded to a variable-length representation.

The operation of the multiplicative hash function with w = 32 and d = 8 is shown below:

The following lemma, whose proof is deferred until later in this section, shows that multiplicative hashing does a good job of avoiding collisions:

Lemma 1: Let $x$ and $y$ be any two values in $\{0,...,2^{w} − 1\}$ with $x \neq y$ . Then $\Pr\{\text{hash}(x) = \text{hash}(y)\} \le 2/2^{d}.$

With Lemma 1, the performance of remove(x), and find(x) are easy to analyze:

Lemma 2: For any data value x, the expected length of the list t[hash(x)] is at most $n_{x} + 2$ , where $n_{x}$ is the number of occurrences of x in the hash table.

Proof: Let $S$ be the (multi-)set of elements stored in the hash table that are not equal to $x$ . For an element $y \in S$ , define the indicator variable

I_y = \begin{cases} 1 \text{\ \ \ \ \ if hash$(x) = $ hash$(y)$}\\ 0 \text{\ \ \ \ \ otherwise} \end{cases}

and notice that, by Lemma 1, $E[I_y] \le 2 / 2^{d} = 2 / \text{t.length}$ ...

2^w(4294967296)	100000000000000000000000000000000
z (4102541685)	11110100100001111101000101110101
x (42)	00000000000000000000000000101010
z . x	10100000011110010010000101110100110010
(z . x) mod 2^w	00011110010010000101110100110010
((z . x) mod 2^w) div 2^w-d	00011110

Overview

Array-Based Lists

Linked Lists

Skiplists

Hash Tables

Binary Trees

Random Binary Search Trees

Scapegoat Trees

Red-Black Trees

Heaps

Sorting Algorithms

Graphs

Data Structures for Integers

External Memory Searching

Wrap Up

ChainedHashTable: Hashing with Chaining

Hash tables overview

The `ChainedHashTable` structure

Multiplicative hashing

The Multiplicative Hash Function

ChainedHashTable: Hashing with Chaining

Hash tables overview

The ChainedHashTable structure

Multiplicative hashing

The Multiplicative Hash Function

The `ChainedHashTable` structure