How to implement full text search


, , , , , , , , , , , , ,

Basic approach

  1. tokenize into words
  2. case folding
  3. stemming (convert each word into its root form; porter stemmer very popular stemmer for English
  4. remove stop words
  5. for each word left create posting list per word

Queries

  • OR queries do an intersection
  • AND queries do a union

Phrase Queries

  • Bad way: Do an AND query, then examine every document in the result set in a second pass and eliminate those that lack the phrase. Very bad performance is possible.
  • Good way: Update posting list to store document:position (multiple occurances included multiple times)

Basic Operations

  1. Insert a new document: Break the document into words; Append the document ID to the posting list for each word
  2. Perform a query: Lookup the posting list for each word in the query; Combine posting lists
  3. Keeping the Working Set Small
  4. Limit size of the lexicon (stemming, stop words)
  5. Compress the posting list aggressively
  6. Spill stationary parts of posting list into a separate table that is not cached

 

How to uninstall Adobe Creative Suite 2.0 applications manually