...

/

ChainedHashTable: Hashing with Chaining

ChainedHashTable: Hashing with Chaining

Check your understanding of hash tables.

Hash tables overview

Hash tables are an efficient method of storing a small number, n, of integers from a large range U={0,...,2w1}U = \{0,...,2^{w}− 1\}. The term hash table includes a broad range of data structures. This lesson focuses on two of the most common implementations of hash tables:

  • Hashing with chaining
  • Linear probing

Very often, hash tables store types of data that are not integers. In this case, an integer hash code is associated with each data item and is used in the hash table. There are various ways of how such hash codes are generated.

Some of the methods used for hashing require random choices of integers in some specific range. In the code samples, some of these random integers are hard-coded constants. These constants were obtained using random bits generated from atmospheric noise.

The ChainedHashTable structure

A ChainedHashTable data structure uses hashing with chaining to store data as an array, t, of lists. An integer, n, keeps track of the total number of items in all lists. See the below figure:

Press + to interact
An example of a ChainedHashTable with n = 14 and t.length = 16; in this example hash(x) = 6
An example of a ChainedHashTable with n = 14 and t.length = 16; in this example hash(x) = 6

To initialize a ChainedHashTable we use the following variables:

Press + to interact
List<T>[] t;
int n;

The hash value of a data item x, denoted hash(x) is a value in the range {0,...,t.length1}\{0,..., \text{t.length}-1\}. All items with hash value i are stored in the list at t[i]. To ensure that lists don’t get too long, we maintain the invariant.

nt.length\textcolor{Red}{n \leq \text{t.length}}

so that the average number of elements stored in one of these lists is n/t.length1n/t.length\le 1.

To add an element, x, to the hash table, we first check if the length of t needs to be increased and, if so, we grow t. With this out of the way we hash x to get an integer, i, in the range {0,...,t.length1}\{0,..., \text{t.length}-1\}, and we append x to the list t[i]:

Press + to interact
boolean add(T x) {
if (find(x) != null) return false;
if (n + 1 > t.length) resize();
t[hash(x)].add(x);
n++;
return true;
}

Growing the table, if necessary, involves doubling the length of t and reinserting all elements into the new table. This strategy is exactly the same as the one used in the implementation of ArrayStack and the same result applies: The cost of growing is only constant when amortized over a sequence of insertions.

Besides growing, the only other work done when adding a new value x to a ChainedHashTable involves appending x to the list t[hash(x)]. For any of the list implementations, this takes only constant time.

To remove an element, x, from the hash table, we iterate over the list t[hash(x)] until we find x so that we can remove it:

Press + to interact
T remove(T x) {
Iterator < T > it = t[hash(x)].iterator();
while (it.hasNext()) {
T y = it.next();
if (y.equals(x)) {
it.remove();
n--;
return y;
}
}
return null;
}

This takes O(nhash(x)O(n_ \text{hash(x)} ) time, where nin_i denotes the length of the list stored at t[i].

Searching for the element x in a hash table is similar. We perform a linear search on the list t[hash(x)]:

Press + to interact
T find(Object x) {
for (T y: t[hash(x)])
if (y.equals(x))
return y;
return null;
}

Again, this takes time proportional to the length of the list t[hash(x)].

The performance of a hash table depends critically on the choice of the hash function. A good hash function will spread the elements evenly among the t.length lists, so that the expected size of the list t[hash(x)] is O(n/t.length)=O(1)O(n/ \text{t.length}) = O(1). On the other hand, a bad hash function will hash all values (including x) to the same table location, in which case the size of the list t[hash(x)] will be n. In the next section, we describe a good hash function.

Multiplicative hashing

Multiplicative hashing is an efficient method of generating hash values based on modular arithmetic and integer division. It uses the divdiv operator, which calculates the integral part of a quotient while discarding the remainder. Formally, for any integers a0a \ge 0 and b1b \ge 1, a div b=a/ba \text{ div }b = \lfloor a/b \rfloor.

In multiplicative hashing, we use a hash table of size 2d2^{d} for some integer dd (called the dimension). The formula for hashing an integer x{0,...,2w1}x \in \{0,...,2^{w} − 1\} is

hash(x)=((zx)mod2w)div 2wd\text{hash(x)} = ((z · x)\bmod{ 2^{w}}) \text{div}\ 2^{w-d}

Here, zz is a randomly chosen odd integer in {1,...,2w1}\{ 1,...,2^{w} −1\}. This hash function can be realized very efficiently by observing that, by default, operations on integers are already done modulo 2w\text{modulo} {\ 2^{w}} where ww is the number of bits in an integer. See the below figure:

Note: This is true for most programming languages including C, C#, C++, and Java. Notable exceptions are Python and Ruby, in which the result of a fixed-length wbitw-bit integer operation that overflows is upgraded to a variable-length representation.

The operation of the multiplicative hash function with w=32w = 32 and d=8d = 8 is shown below:

The Multiplicative Hash Function

2w(4294967296)

100000000000000000000000000000000

z (4102541685)

11110100100001111101000101110101

x (42)

00000000000000000000000000101010

z . x

10100000011110010010000101110100110010

(z . x) mod 2w

00011110010010000101110100110010

((z . x) mod 2w) div 2w-d

00011110

Furthermore, integer division by 2wd2^{w-d} is equivalent to dropping the rightmost w − d bits in a binary representation (which is implemented by shifting the bits right by w − d using the >>> operator). In this way, the code that implements the above formula is simpler than the formula itself:

Press + to interact
int hash(Object x) {
return (z * x.hashCode()) >>> (w - d);
}

The following lemma, whose proof is deferred until later in this section, shows that multiplicative hashing does a good job of avoiding collisions:

Lemma 1: Let xx and yy be any two values in {0,...,2w1}\{0,...,2^{w} − 1\} with xyx \neq y. Then Pr{hash(x)=hash(y)}2/2d.\Pr\{\text{hash}(x) = \text{hash}(y)\} \le 2/2^{d}.

With Lemma 1, the performance of remove(x), and find(x) are easy to analyze:

Lemma 2: For any data value x, the expected length of the list t[hash(x)] is at most nx+2n_{x} + 2, where nxn_{x} is the number of occurrences of xx in the hash table.

Proof: Let SS be the (multi-)set of elements stored in the hash table that are not equal to xx. For an element ySy \in S, define the indicator variable

Iy={1     if hash(x)= hash(y)0     otherwiseI_y = \begin{cases} 1 \text{\ \ \ \ \ if hash$(x) = $ hash$(y)$}\\ 0 \text{\ \ \ \ \ otherwise} \end{cases}

and notice that, by Lemma 1, E[Iy]2/2d=2/t.lengthE[I_y] \le 2 / 2^{d} = 2 / \text{t.length} ...