Michael Data

Hash Tables are a good solution to the dictionary problem.
If we have n key-value pairs, maintain a table H of size N > n. The ration $\frac{n}{N}$ is called the load factor alpha. Design a hash function $h$ from keys to indices into H.

As set/delete functions are performed, n will vary. This requires an arraylist resizing technique to keep H approximately the correct size. Rebuilding this data structure when H changes may also require choosing a new hashing function.
The primary problem of hash tables is the collision. A collision is where two keys are mapped to the same index.

For hash chaining and linear probing, average time/key is O(1). But there is usually at least onekey that is much slower. In hash chaining, the worst key typically takes time log(n)/loglog(n). In linear probing, the worst key typically takes time log(n). Cuckoo hashing is an attempt to guarantee constant time for search operations.

Operations
  • def search(k):
  • i = h(k)
  • if H[i] contains a pair for key k
  • return the value
  • else
  • exception
  • def set(k,v):
  • store k,v in H[h(k)]
  • def delete(k):
  • i = h(k)
  • if H[i] contains pair for key k
  • clear it
  • else
  • exception
Dictionary Problem
  • * data = collection of key-value pairs ; at most one pair for each key
  • * operations
  • search(key): return the associated value for the key, or exception if no pair present
  • set(key,value):
  • add a new key-value pair if the key is not present
  • overwrite the old value if the key is present
  • delete(key): remove the key and its value
Hash Functions

From k possible keys to N indices. To make the mathematics clean, would like all hashing functions to be equally likely. That is, the output for each key should be uniformly distributed over indices of the hash table.

This is unrealistic and not possible to implement efficiently:

  • Make a table T of K random numbers
  • return T[k] for h(k)

Cryptographic hash functions

Can rely on to create functions for which current practical computing resources cannot find collisions. Then can map that hash function down to a table of size N.

  • * benefit: will work with high quality
  • * problem: too slow
  • * solution: use them as pseudorandom number generators for faster hashing schemes

K-independent hash functions

For every k-tuple of keys, there are $N^k$ different tuples of indices they might get mapped to. Each should be equally likely.
K is a small integer depending on which hash algorithm is used.

  • * Hash chaining requires only 2-independence
  • * Linear probing is guaranteed for 5-independence
  • 4-independent functions exist which make it behave badly
  • * Cuckoo hashing requires $\theta(log(n))$-independence

Generating a k-independent hash function

  • Choose (non-randomly) a prime number “p” » N
  • Choose randomly k numbers a_0…a_k in range 0:(p-1)
  • def h(x):
  • return (a_0 + a_1*x + a_2*x^2 + …) modulo p
  • Construction takes constant time
  • Requirement on slow multiplication operations

Tabulation Hashing

Need to know how large keys will be e.g. 32-bit numbers:

  • Build four tables $T_i$ of 256 random numbers each
  • def h(k):
  • partition k into 4 bytes k_0,k_1,k_2,k_3
  • return T_0[k_0] ^ T_1[k_1] ^ T_2[k_2] ^ T_3[k_3]
  • where ^ is bit-wise exclusive OR
  • normal addition should work, but is unproven
  • It is 3-independent but not 4-independent.
  • Works well (guarantees constant expected time) for chaining, linear probing, and cuckoo hashing.

For 32-bit keys, 4 bytes per key:

  • * number of tables = t = 4
  • * each table covers $2^{8 \text{bits}}$ numbers
  • * total of $4 * 2^8 = 1024$ random numbers needed
Implementations

Hash Chaining

Remember mutliple values within each cell of H. Each cell stores a collection of key-value pairs.
Search, set, and delete requires iteration through the collection to find the matching key.

Collisions

  • E[time/op] = average over choices of hash function
  • = O(1+E[length of H[h(k)]])
  • = O(1+E[# collisions with k ])
  • = O(1+ $\displaystyle \sum_{other keys}$P[q collides wih k]) called “linearity of expectation”
  • = O(1 + $(n-1)\frac{1}{N}$)
  • = O(1 + alpha)

Linear Probing

Each cell of H stores a single pair. Try to store (k,v) in H[h(k)]. If it is full, move on to H[h(k)+1], H(h(k)+2], etc. wrapping around the array.

  • * Deletion is required to maintain this invariant property: from h(k) to the actual position of k, all cells of H must be nonempty.
  • * Easy to implement
  • * single level of random-access structure
  • * Doesn't work well when alpha is close to 1
  • def set(k,v):
  • i = h(k)
  • while H[i] is nonempty and contains the wrong key:
  • i = (i+1) mod N
  • store k,v in H[i]
  • def search(k):
  • i = h(k)
  • while H[i] is nonempty and contains the wrong key:
  • i = (i+1) mod N
  • if H[i] is nonempty:
  • return the value H[i]
  • else:
  • rase and exception
  • def delete(k):
  • i = h(k)
  • while H[i] is nonempty and contains the wrong key:
  • i = (i+1) mod N
  • clear H[i]
  • j = i
  • while H[(j+1) mod N] is nonempty and h(H[j+1].key) ⇐ i:
  • j = (j+1) mod N
  • H[i] = H[j]

Collisions

  • time/op ~ length of continuous block of nonempty cells
  • E[time/op] = $\sum_{possible blocks}$P[ that block is maximal contiguous nonempty block] * (length of block)
  • for each length L
  • there are L different blocks of length L containing h(k)
  • each block has probability $\leq C^{-L}$ of being full for some constant c>1 which depends on alpha
  • The total time = O($\sum_{l = 1}^n \frac{L^2}{C^L}$) which is O(1)

Cuckoo Hashing

In [http://en.wikipedia.org/wiki/Cuckoo_hashing Cuckoo Hashing], make two tables H0 and H1, two hash functions h0 and h1. For search, look in Hi[hi(k)] for i = {0,1}. For delete, look and clear in both tables. The complicated operation is set().

  • * guaranteed constant-time search
  • * slower than linear probing in practice
  • def set(k,v):
  • i = 0
  • while (k,v) is a nonempty pair:
  • swap (k,v) with Hi[hi(k)]
  • i = 1-i
  • If the previous loop is run too many times (c*log(n)),
  • then need to rebuild the data structure with a new hash function.
  • Don't need to resize the hash table necessarily, but need to choose a new hash function.

Do need to resize the hash table in order to maintain $n < N$.
Intuitive explanation of graph: draw hash table cells as verticies, and keys as the edge between two cells.
Each connected component can support up to a single key per vertex. If the edges are given direction, each connected component can support only one cycle. Don't actually need two separate hash tables, as long as these requirements are met.

Why not just compare the number of edges to the number of vertices in each component?
… average degree of component vertices?

Erdos-Rengi model:

  • Fixed set of vertices
  • Edges are added 1-by-1 in independent uniform random order
  • When the number of edges is less than epsilon*n for small epsilon,
  • the typical number of components of size s ~ n/(c^s) for some c which is a function of epsilon
  • So with high probability, no component has more than k*log(n) edges
  • for some k which is a function of epsilon
  • So if the set operation loops for more than log(n) times,
  • You have a very large component (happens a very small fraction of the time)
  • You are in an infinite loop (happens depending on hash functions)
  • Pr[forming a cycle with a new edge] = theta(1/n) n = number of edges
  • Pr[component exists with a cycle] = 1/(c^s) s = component size
  • Pr[forming a 2-cycle component] = theta(1/(n^2))
  • Conclusions:
  • Over a sequence of n operations,
  • Probability 1-theta(1/n) - It works fine
  • Probability theta(1/n) - Have to rebuild
  • Expected time:
  • O(n) if it works
  • +O(1/n * n) for first rebuild
  • +O(1/n^2 * n) for second rebuild
  • O(n) total