Michael Data

[http://en.wikipedia.org/wiki/Binary_search_tree Binary Search Trees] are trees with two or less children per node and maintain the following properties for each node:

  • * All children to the left have a smaller value
  • * All children to the right have a greater value
  • * All subtrees are also Binary Search Trees

Left/right are arbitrary labels to distinguish between the two sub-trees.

A balanced tree with $n$ nodes will use a number of comparisons $\log_2 n - O(1)$ and is optimal for comparison-based algorithms.
Binary search can handle more general queries than hash tables. (e.g. Can quickly find the next value greater than a query value)

The longest path from the root to a leaf is the height of the tree. So a balanced tree will have height $O(\log_2 (n))$. A balanced tree of height $h$ has at most $2^h - 1$ nodes. Updating strategies should maintain this balance.

Construction

Balanced:

  1. For values $X$, sort $X$ into an array $A$
  2. Create a root from the median of $X$, $A[m = floor(|X|/2)]$
  3. Define two sub-trees $X_1$ and $X_2$

#* $X_1 = A[0:m-1]$ and $X_2 = A[m+1:|X|]$

  1. Create binary search trees for $X_1$ and $X_2$, and make them sub-trees of this root

Search for value x begins at the root.

  • * compare x with value of this node

*# If x is less, recurse at the left child of the current node
*# if x is greater, recurse at the right child of the current node
*# if they are equal, return the current node

  • * If a null child is opened, then the query x is not present in the tree

If all keys are equally likely to be searched, then $O(log(n))$ is the best search time we can do. Otherwise if the input is non-random, then other tree strategies are better: [http://en.wikipedia.org/wiki/Weight-balanced_tree weight-balanced trees] or rebalance gradually after each update via rotation (e.g. AVL Trees, Red Black Trees, Rank Balanced Trees)

Update Strategies

Naive Insertion

Given a new value to insert, perform a binary search. Insert the new value in a new node wherever the search ends with a null child reference.

Constructing a tree with this insertion method is very sensitive to insertion order. It is terrible if done in sorted order because it will produce a deviant tree. It can work pretty well if the insertion order is a random permutation of the variables. This is because when all permutations are equally likely, more balanced trees are more likely to occur. It is a result of the fact that multiple permutations can produce balancing effects, while deviant shapes occur with fewer permutations.

Rotation

Strategy for rebalancing the tree gradually after each insertion/deletion operation.

Maintains the binary search quality while reducing the over-all height of the tree.

Representations

Implicit Tree

Alternate representation is an implicit complete binary tree. An array A of n items which are interpreted as a tree. This is an implicit data structure in that the logical structure is just an array of items, but the key information that makes it a tree is in the ordering within the array.

For each non-root node x, priority(parent(x)) \leq priority(x).

  • * children(array[i]) = array[2i+1] , array[2i+2]
  • * parent(array[i>0]) = array[floor( (i-1)/2)]

Explicit Tree

Create a node object for each tree node, with the following fields:

  • * data value
  • * left child pointer to another node
  • * right child pointer to another node
  • * additional information used to balance the tree
Applications

Text completion: once a prefix has been typed, can search through a tree of known previous searches to provide suggestions.

Machine learning nearest neighbor classification. Can use tree search to improve search for the nearest neighboring value.

Splay Trees

Balanced binary search trees aren't ideal for an adversarial case model: The “adversary” doesn't make random requests, but always requests the key stored at the bottom of the tree.

Should use strategies that take advantage of nonuniform input:

  • * Access pattern varies over time
  • Called locality of access because choices are dependent on recent requests

The motivation for a splay tree is to move accessed keys closer to the root on access.
Function is identical to a regular binary search tree, with the addition of a splay(node) routine which is called at the end of every search. It is also called after a failed search.

  • search(x): usual binary search, then splay(n) on the last node n from the search
  • insert(x): usual binary search, then insert x at the final node, then splay(x)
  • split(x): perform splay(x), but then remove x and form two separate trees from its children
  • concat(t1,t2): search for last element of first tree n1, then splay(n1) do the same for t2, then combine
  • splay(x): perform rotations ending with x at the root
  • while distance(root to x) >= 2:
  • double-rotate(x,x.parent,x.parent.parent)
  • if x is not root:
  • rotate(x,parent)

Amortized analysis of splay(x):

  • * for simplicity, assume only deletion operations
  • This makes it so that time(search) ~ O(time(splay()))
  • * Assume that each key has a positive weight associated with it (e.g. access probability)
  • Let $W = \sum_i^n w_i$ be the sum of these weights
  • * Define rank(x) = $\lceil \log_2 \sum_{\text{descendent of x} w_i} w_i \rceil$
  • * potential function $\phi = \sum_i^n rank(i) $

The amortized time to splay key i will be $O\left(\log \frac{W}{w_i}\right)$

  • * before splay, rank($k_i$) = $\log w_i$
  • * after splay, rank($k_i$) = $\log W$
  • * Plugging in $w = 1$ for all nodes:
  • W = n
  • $\log(\frac{W}{w_i}) = \log n$
  • Splay trees have amortized time $O(\log n)$
  • * Plugging in $w_i = p_i$ for all nodes, where $p_i$ is probability of key i being searched
  • W = 1
  • $ E(search) = \sum_i \log(\frac{W}{w}) p_i = - p_i \log p_i $ = entropy of the distribution over requests

Splay trees are at least as good (asymptotically) as an unchanging binary search tree which is optimized for a particular access pattern.

In special cases, a double-rotation can make no change to the rank of the target node. This requires the lowest node in the original tree to have at least half of the total weight. It is fairly contrived, but due to the special potential function the potential will still increase after a double-rotation occurs.

also see [http://en.wikipedia.org/wiki/Finger_trees Finger Trees]

Competitive Analysis

Applications in cache management: want to kick out an item in the cache in order to make room for a new one. If you knew the future, the correct answer would be the one that isn't accessed for the longest amount of time. An approximate heuristic for this is to kick out the least-recently used one instead. This heuristic is the worst decision process when elements are accessed in a round-robin order. See competetive ratio