Michael Data

Streaming is a set of conditions:

  • * Input data is so large that it cannot fit into memory
  • * Have to solve a problem on it using one (or a few) passes
  • * Maintain a “sketch”, a data structure much smaller than the input which is easily updated but retains important information
Majority Voting Problem

Election with large number n of voters.
Each can write in one arbitrary name as their vote
The election is won by a candidate with greater than n/2 votes (strict majority)
How to find the winner using very little internal storage?

Solutions

In a pass where a non-winner is the current candidate, votes for the winner will not be counted for the winner later. However, throwing away that vote along with the candidate preserves a total $> \frac{n^'}{2}$ the true winner has.

  • count = 0
  • candidate = None
  • for vote in stream:
  • if count == 0:
  • candidate = vote
  • if candidate == vote:
  • count = count + 1
  • else:
  • count = count - 1
  • second pass through stream to count total votes and verify a winner exists
Straggler Detection Problem

An office building has one entrance with a guard.
Most workers arrive in the morning. They may go out for lunch or breaks. They leave in the evening.
The guard wants to know who is still in the building. (not how many, but the identities)

This is a dynamic set problem with add and remove operations. When the set is small, need to report all of its elements.

Easier version

How to identify the missing number from 1-100 out of the remaining 99?

  • * missing one = $\left[ \sum_{i = 1}^{100} i \right] - \left[ \sum x \right]$ for inputs x

Invertible Bloom Filter

Use an IBF to keep track during the day. Then at the end of the day use the IBF's decode() function.