Search Engines and Information Retrieval
Properties of Text: Zipf's law, Benford's law, Heaps law, Polya urn, capturerecapture
Vector Space Model: tfidf weighting, cosine similarity, beyond bagofwords
Vocabulary mismatch: tokenization, stemming, synonyms, GVSM, pseudofeedback
Indexing: inverted index, proximity, vbyte and deltaencoding, docatatime vs. termatatime
Building a web crawler, index freshness

Extracting web content, Finn's algorithm
Locality Sensitive Hashing (LSH), nearduplicate detection, Adler32 checksum
Evaluation: Cranfield paradigm, Recall/Precision, Fmeasure, NDCG, query logs
PageRank algorithm, Hubs and Authorities, link spam, anchortext
Probabilistic Model: assumptions, derivation, estimation, 2Poisson, BM25
Relevance Model: derivation, estimation, crosslanguage retrieval
Language Models: GoodTuring estimation, JelinekMercer and Dirichlet smoothing
Classification: PA algorithm, SVM, SMO algorithm, LearningtoRank, click logs