Search Engines and Information Retrieval

Properties of Text: Zipf's law, Benford's law, Heaps law, Polya urn, capture-recapture
Vector Space Model: tf-idf weighting, cosine similarity, beyond bag-of-words
Vocabulary mismatch: tokenization, stemming, synonyms, GVSM, pseudo-feedback
Indexing: inverted index, proximity, v-byte and delta-encoding, doc-at-a-time vs. term-at-a-time
Building a web crawler, index freshness | Extracting web content, Finn's algorithm
Locality Sensitive Hashing (LSH), near-duplicate detection, Adler32 checksum
Evaluation: Cranfield paradigm, Recall/Precision, F-measure, NDCG, query logs
PageRank algorithm, Hubs and Authorities, link spam, anchor-text
Probabilistic Model: assumptions, derivation, estimation, 2-Poisson, BM25
Relevance Model: derivation, estimation, cross-language retrieval
Language Models: Good-Turing estimation, Jelinek-Mercer and Dirichlet smoothing
Classification: PA algorithm, SVM, SMO algorithm, Learning-to-Rank, click logs