How to implement full text search

, , , , , , , , , , , , ,

Basic approach

  1. tokenize into words
  2. case folding
  3. stemming (convert each word into its root form; porter stemmer very popular stemmer for English
  4. remove stop words
  5. for each word left create posting list per word

Queries

  • OR queries do an intersection
  • AND queries do a union

Phrase Queries

  • Bad way: Do an AND query, then examine every document in the result set in a second pass and eliminate those that lack the phrase. Very bad performance is possible.
  • Good way: Update posting list to store document:position (multiple occurances included multiple times)

Basic Operations

  1. Insert a new document: Break the document into words; Append the document ID to the posting list for each word
  2. Perform a query: Lookup the posting list for each word in the query; Combine posting lists
  3. Keeping the Working Set Small
  4. Limit size of the lexicon (stemming, stop words)
  5. Compress the posting list aggressively
  6. Spill stationary parts of posting list into a separate table that is not cached

 

This website uses IntenseDebate comments, but they are not currently loaded because either your browser doesn't support JavaScript, or they didn't load fast enough.

Comments are closed.