Michael Data

Set Data Type

[http://en.wikipedia.org/wiki/Set_(abstract_data_type) Set Data Types] store objects without any particular ordering.

Operations

  • membership query: return binary answer
  • add(obj): add the object to the set if not present
  • remove(obj): remove the object from the set if present
  • intersect(set1,set2): return objects present in both sets
  • union(set1,set2): return objects present in either set, inclusive
  • difference(set1,set2): return objects in set1 which are not present in set2
  • subset(set1,set2): return binary answer, positive if all objects in set1 are in set2

Implementations

Bit Map

Hash Map

Bloom Filters

[http://en.wikipedia.org/wiki/Bloom_filter Bloom Filters] are very space-efficient set structures, but allow for false-positive membership tests. They use $O\left(\frac{n}{word}\right)$ space. They also cannot handle intersection or element removal.

They are useful as quick-but-inaccurate whitelists before more rigorous filters. They are also useful to filter logging a large number of events containing some kind of identification. In this situation, the bloom filter keeps the overall size of the log files down.

Structure:

  • * a set S of values to represent
  • * a parameter k which typically equals 3 or 4.
  • * Table of $N > nk$ bits ($\frac{N}{word}$ words)
  • * hash function mapping each element to a k-tuple of locations
  • stores a 1-bit in ALL locations, not just one
  • add(x):
  • change each value of its k bits to one
  • union(s,t):
  • bitwise Boolean or. same as on bitmap
  • member(x,s):
  • loop through all k positions of h(x) and check that they are all one

False-Positive Rate

Increasing N reduces the false-positive rate, but also slows down data operations.

  • Given k, n, and N.
  • p = probability of a bit equalling 1
  • If the table has pN 1-bits and (1-p)N 0-bits,
  • prob. of false positive is p^k
  • p ⇐ (nk)/N converges to equality as N/n grows larger
  • easier to calculate probability of a cell being zero
  • (1-p) = (1-1/N)^(nk) ~ (1-1/e)^(nk/N) probability of each added object missing this cell
  • Using (1-1/N)^N ~ (1-1/e) for very large N

Increasing k requires more hash collisions for a false-positive query, but also fills up more of the total array, increasing the probability of all collisions.

  • For a single insertion, the probability of a particular bit not being set is
  • 1 - 1/N
  • For k functions, this changes to
  • (1 - 1/N)^k
  • After inserting n elements, it becomes
  • (1 - 1/N)^(nk)

Since this probability of (a particular cell is set to 1 after storing n elements) is known, the number of cells filled in the final array is a [http://en.wikipedia.org/wiki/Multinomial_distribution Multinomial distribution]. So the expected number of cells filled in the final array is $n\left( 1 - \left( 1 - \frac{1}{N}\right)^{nk}\right)$.

Counting Bloom Filter

Allows removals. Each cell stores a count of the number of set elements mapped to it instead of just a 1 or a 0. Membership tests are the same, just check that all counts are greater than zero. False-positives are still possible. Adding an element increments all k counts. Removal decrements all k counts.

With high probability, all counts are $O\left(\frac{\log(nk)}{\log\log(nk)}\right)$. So need $O(\log\log n)$ bits per cell in order to not overflow the counts.

Invertible Bloom Filter

NOT space efficient relative to a standard Bloom Filter. Based on the counting idea, but is able to “read back” members of small sets with high probability. Has useful applications in streaming data structures e.g. the Straggler Detection Problem.

As in the counting Bloom Filter, maps keys to multiple cells in a table.
Each cell counts the number of keys that hit it, the sum of the keys that hit it, and the sum of the hashes for each key (using a hash function independent of the one used to map keys to cells).

  • remove(x):
  • same as add, but with minus operations
  • list():
  • while there is a pure cell 'c'
  • x = c→sum(keys) / c→(# keys)
  • report c→(# keys) copies of x
  • remove(x) one used to address the cells).

A cell is 'pure' if: $\frac{hash (\sum keys)}{# keys} = \frac{\sum hash(keys)}{# keys}$. A pure cell has high probability of only being hit by a single key. With high probability, if some number of distinct keys stored in the IBF is less than ( a constant * table size ), it can be correctly decoded.

  • add(x):
  • for each cell x maps to:
  • # keys += 1
  • sum keys += x
  • sum hashes += hash(x)
  • remove(x):
  • same as add, but with minus operations
  • list():
  • while there is a pure cell 'c'
  • x = c→sum(keys) / c→(# keys)
  • report c→(# keys) copies of x
  • remove(x)

Sketches

Bloom Filters

Similar to the sketch used in a streaming data structure, can use IBFs as a concise representation of certain large objects.

Represent large sets by their Invertible Bloom Filters.
If you have sets A and B, decode(IBF(A) - IBF(B)) where the minus operation is a cell-wise subtraction. If the difference between the sets is small, then it represents an IBF which can be decoded. This has applications in version control systems for identifying changes between locations using an amount of data proportional to the change.

Min Hash

Sketch for sets that allows estimation of set similarity, regardless of the size of the difference. Originally developed as a way for AltaVista to identify duplicate web pages.

Uses the [http://en.wikipedia.org/wiki/Jaccard_index Jaccard Index]: $J(A,B) = \frac{|A \cap B|}{|A \cup B|}$. J = 0 if A and B are disjoint, and equals 1 if they are the same.

single-element sketch:

  • sketch(s) = element x of S for which hash(x) is minimum
  • easy to compute and easy to update with additions to S
  • can't handle removal operations
  • Pr[sketch(A) = sketch(B)] = Pr[smallest hash in A union B is also in A intersect B]
  • = $\frac{|A \cap B|}{|A \cup B|}$ , assuming no ties.
  • The expected value of this probability is 1 if the two sketches are equal, and zero otherwise.

[http://en.wikipedia.org/wiki/MinHash Min Hash] does this, but with a slightly larger sketch for finer-grained Jaccard estimation.

  • choose parameter k
  • sketch(A) = { k keys inA with smallest hashes }
  • useful property: sketch(A union B) = sketch(sketch(A) union sketch(B))

dist(A,B) = $\frac{|sketch(A \cup B) \cap sketch(A) \cap sketch(B)|}{|sketch(A \cup B)|}$