Building a Web Crawler | Extracting Web Content

Building a web crawler:

  1. Sources of data
  2. Blogs, tweets, news feeds
  3. The algorithm
  4. Inside an HTTP request
  5. Robots.txt
  6. Keeping index fresh

Extracting web content:

  1. Overview
  2. Extracting content from XML
  3. Extracting content from HTML
  4. Content and the DOM tree
  5. Tag plateau algorithm (Finn's method)
  6. Tag plateau: maximum subsequence sum
  7. Clarification