Class website For each week, write summaries of the listed reading material. 1-page summary for each article, including an objective summary of the article content followed by my commentary.

Information Retrieval: Mainly developed through the world-wide web. Classic assumptions: Static corpus. Retrieve information relevant to a query.

- * Use some relevance score function $R(Q,D)$ between each query and document.

Web assumptions: dynamic content, adversarial entities, advertising, non-text documents

- * Results must be ranked
- Today's standard is to present only the top 10
- * Results must be found quickly

Real-time: parse potential strings for a match. On the web: pre-search potential strings and prepare an index

- * There are millions of words online
- * There are on the order of 100 million active websites

Search engines need to

- * Figure out what the user wants, not rely on the user to learn a system
- * Balance paid entries with relevant entries

Search Engine Optimization Practice of making content index higher in search engines by tailoring the interface to their crawlers. This can be done legitimately or illegitimately. Search engines provide guidelines. *Legitimate

- Make relevant content
- Update relevant keywords and tags
- * Inethical
- spam fake pages and links in the local site
- spam fake comments on other sites pointing to the local site
- spam irrelevant keywords or tags

Old generation search engines relied heavily on text term-frequency/inverse-document-frequency ratios. This was exploited by “keyword stuffing” by website SEO rankings. Scripts are used to exploit systems that count the number of clicks on a search result or on an advertisement. To exploit link-function rankings, unrelated sites may do “link exchange” where they link to each other.

[http://www.fleiner.com/bots/ How to create crawler traps]

Web data collection: Data dumps, url downloads, web APIs, crawling.

Crawler traps: web server responds with ever changing URLs and content. Maybe be intentional or unintentional.

Duplicate detection: Studies: 30% of web content is a duplicate of some of the other 70%. Exact duplicates online are very rare. Near-duplicate detection is an open problem, presently solved through `fingerprinting`

techniques.

Crawlers don't go to the **Deep Web**: content behind login forms, JavaScript, not linked from elsewhere. The Deep Web is estimated to be larger than the shallow web.

Web is almost 20 years old, so laws that apply to it tend to be outdated or untested. There are more useful conventions than actual laws.

- * Privacy policies help communicate tracking and data habits of a site
- * Sites frequently define their own
`Terms of Use`

or`Terms of Service`

- *
**Fair Use**

Guest lecture by Chen Li. Where traditional databases have an index architecture for queries, web search relies on dynamic content that is not as easily indexed. Databases are too slow for web search, and unoptimized for ranking. Vertical search: search engine focused on specific content Horizontal search: general search engine that tries to cover everything Requirements provided by web searches: instant results, fuzzy search, personalization.

See http://labs.google.com/people/jeff Optimization opportunities: don't store documents in separate files to lessen open/close times. The total time to read and then decompress data is usually less than reading uncompressed data.

Sparse matrices represent bag-of-words models for each known document.
The **index** is another vector-space representation of term-document pairs.
A (term, [list]) pair is called a **posting**, where [list] is the list of all documents in which the term occurs.

The index must then be sorted in order to optimize future searches. Block merge sort strategies handle this task when system memory is limited.

**BSBI** (Block Sort-Based Indexing) uses this sorted list of postings to retrieve queries [http://swtch.com/~rsc/regexp/regexp4.html RE2]. The document-term matrix is $O(T \log T)$ where $T$ is the number of term-document pairs.

Quiz #2

Distributed indexing: “Map-reduce”. Phase one (map) splits the input workload among multiple parsers, which create independent indexes. Phase two (reduce) uses “inverters” over all the resulting indexes to combine them. It is a parallel implementation of merge sort.

Within context of the query, want to rank ~all~ documents in the corpus. (In practice, use heuristics to narrow down the set that are actually processed.)

Index structure: Greater length n-grams create larger lexicons. 3-grams tend to work for smaller data sets (Google Code Search). Support proximity matching by ordering words in the index by their document order.

Compression: When doing mostly reads, decompressing and reading from disk tends to be faster than reading uncompressed from disk.

Lisp style: Partition the file into groups, using one inverter.

Hadoop style: Two mappings. The first one that splits the sorting workload, Then the second mapping is to a set of inverters which create inverted indices. In this case, each inverter is only responsible for a sub-set of the “types”.

Hadoop is an open-source architecture for large-scale distributed and data-intensive processing.

File system uses a “namenode” in additiona to distributed “data nodes”. The namenode acts as a server of data locations, returning the data node and block location of the target file.

Boolean retrieval: retrieve a bit-vector for each query word over documents (1 if document contains word; 0 otherwise). Perform boolean AND/OR over resulting bit-vectors to produce list of primary documents to return. Performing a Boolean operation over two bit-vectors can be a linear operation of the number of postings if they are in sorted order. This is in contrast to being O(number of documents total). Starting with most restrictive expressions can help to eliminate documents early on in the process.

Frequency ranking: Now instead of a bit vector over documents, each document returns an integer count. These values must be mapped to some ranking score that will allow them to be compared. log(frequency) is usually used. Inverse-document frequency is used to counter-balance words that occur frequently throughout the corpus. IDF is only useful for ranking terms relative to each other in multi-word queries.

[http://en.wikipedia.org/wiki/Jaccard_coefficient Jaccard Coefficient] is an alternative ranking method that doesn't account for frequency.

[http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html#fig:ROC-curve Normalized discounted cumulative gain]: $NDCG(Q,k) = \frac{1}{Q} \sum_{j=1}^{|Q|} Z_{kj} \sum_{m=1}^k \frac{2^{R(j,m)} - 1}{log_2(1+m)}$

Relevance of retrieved documents

Amount of relevant retrieved documents

Plot documents in a vector space defined over term frequencies. In this space, a “similarity” measure can be performed easily using the cosine-distance between vectors. Cosine is ideal because it decreases monotonically from +1 to -1 between 0 and 180. $cos(\theta) = \frac{V(d_1) \cdot V(d_2)}{|V(d_1)| |V(d_2)|}$. In this case, vector length should be normalized.

[http://en.wikipedia.org/wiki/Cosine_similarity Cosine Similarity]: $a \cdot b = ||a|| ||b|| cos\theta$ $a \cdot b = \frac{sum_{i=1}^n A_i B_i}{\sqrt{\sum_{i=1}^n (A_i)^2} \sqrt{\sum_{i=1}^n (B_i)^2}}$

To perform queries, treat the query as a short document. Return documents ranked by distance to the query in the vector space. This is a very fast operation relative to other ranking methods.

`Gold standard`

for evaluation: does relevance have to be determined through human judgement?
Results depend on the task evaluated, are usually restricted to binary responses, and there may not be good agreement on the “answer”.

Popular trend is to use the [http://en.wikipedia.org/wiki/Harmonic_Mean Harmonic Mean] of precision and recall. $F = \frac{2RP}{R+P}$ A more general form allows weighting between the two: $F_\beta = \frac{(\beta^2 + 1)RP}{R+\beta^2P}$

Another method is to use `precision-at-N`

or `recall-at-N`

as a method of focusing on the primary results. A simplified and weaker method is to compute the average precision over some top-N range. When this average is averaged over multiple queries, the result is called **Mean average precision** or **map**.

Discounted gain: $DG_i = R_i / \log_2(i)$

Discounted cumulative gain: $DCG_p = \sum_{i=1}^p \frac{2^{R_i} - 1}{\log(1+i)}$

Normalized discounted cumulative gain:
$NDCG_p = \sum_{i=1}^p \frac{2^{R_i} - 1}{\log(1+i)} / \sum_{i=1}^p \frac{2^R_i - 1}{\log(1+i)}$
Where $R_i$ is the relevance score for document `i`

in an ideal ranking.

Assign relevance score inversely to the Google result. So the top 5 Google results get relevance scores of 5,4,3,2,1 in that order.

Cosine similarity scores are usually computed over dimensions defined by the TF-IDF score for each term. Heap objects help for retrieving top K results.

Heuristics for improving performance: Want to avoid computing the cosine-distance.

- Find a set $A$ of contenders such that $K < |A| « N$.
- Doesn't necessarily contain the top K, but has good chance to

- * Precompute a
**champions list**for each term of docs with high frequency - * Do a stopword-filter on input terms before using any. Filter can be based on IDF threshold.
- * Filter documents that contain fewer number of different query terms
- * Query-independent scores related to
`authority`

of the source. Should up-weight authoritative sources such as wikipedia

A recent trend is to use machine learning over these features to better tune document rankings.

Graph heuristics:

- * PageRank: Take advantage of human-generated links between websites to infer “goodness” of the site: For a known good site, if another site links to it, then the linker is probably also good.
- * Can estimate link density between sites by performing a random walk and counting visits to each node.
- * Can unify different words used as anchor text based on common hyperlink references.
- * Can try to identify a “hub” which is good at connecting to experts.
- $h(x) \from \sum_{x \to y} a(y)$
- Hub score for page x is the sum score for all pages which x points to
- $a(x) \from \sum_{x \from y} h(y)$
- Authority score for page x is the sum of the hub scores of the pages that point to it.
- In practice, about 5 iterations through the graph tends to stabilize, but the scores are guaranteed to converge if well-scaled.

Latent semantic analysis and latent semantic indexing.

`LSI`

developed around 1989, is the use of `LSA`

for information retrieval indexing.
LSA always reduces the dimensions of the data. The goal is to do so in a way that preserves useful trends in the data. Singular value decomposition is a useful principle component analysis tool which extracts the eigenvectors and eigenvalues from the data matrix.

[http://en.wikipedia.org/wiki/Singular_value_decomposition SVD]: $D \to U \Sigma V$

Industry stuff. Web IR is big $$$.