stemming (convert each word into its root form; porter stemmer very popular stemmer for English
remove stop words
for each word left create posting list per word
Queries
OR queries do an intersection
AND queries do a union
Phrase Queries
Bad way: Do an AND query, then examine every document in the result set in a second pass and eliminate those that lack the phrase. Very bad performance is possible.
Good way: Update posting list to store document:position (multiple occurances included multiple times)
Basic Operations
Insert a new document: Break the document into words; Append the document ID to the posting list for each word
Perform a query: Lookup the posting list for each word in the query; Combine posting lists
Keeping the Working Set Small
Limit size of the lexicon (stemming, stop words)
Compress the posting list aggressively
Spill stationary parts of posting list into a separate table that is not cached