How to implement full text search
Basic approach
- tokenize into words
- case folding
- stemming (convert each word into its root form; porter stemmer very popular stemmer for English
- remove stop words
- for each word left create posting list per word
Queries
- OR queries do an intersection
- AND queries do a union
Phrase Queries
- Bad way: Do an AND query, then examine every document in the result set in a second pass and eliminate those that lack the phrase. Very bad performance is possible.
- Good way: Update posting list to store document:position (multiple occurances included multiple times)
Basic Operations
- Insert a new document: Break the document into words; Append the document ID to the posting list for each word
- Perform a query: Lookup the posting list for each word in the query; Combine posting lists
- Keeping the Working Set Small
- Limit size of the lexicon (stemming, stop words)
- Compress the posting list aggressively
- Spill stationary parts of posting list into a separate table that is not cached