Document Selection

From the one-hundred billion documents on the internet, let's retrieve the top one-hundred thousand that are relevant to the searcher's query.

Previously you saw the layered model approach. We will be adopting this approach to perform search ranking. Let’s zoom in on the first step, i.e., document selection, as shown below:

Press + to interact
The layered model approach
The layered model approach

From the one-hundred billion documents on the internet, we want to retrieve the top one-hundred thousand that are relevant to the searcher’s query by using information retrieval techniques.

Let’s get some terminologies out of the way before we start.

📝 Information retrieval is the science of searching for information in a document. It focuses on comparing the query text with the document text and determining what is a good match.

Documents

Document types are as follows:

  • Web-pages
  • Emails
  • Books
  • News stories
  • Scholarly papers
  • Text messages
  • Word™ documents
  • Powerpoint™ presentations
...