Building A Python-Based Search EngineΒΆ
Presenter: Daniel Lindsley
Track: IV
Description:
Search is an increasingly common request in all types of applications as the amount of data all of us deal with continues to grow. The technology/architecture behind search engines is wildly different from what many developers expect. This talk will give a solid grounding in the fundamentals of providing search using Python to flesh out these concepts in a simple library.
Primary author of Haystack
The Goal
- teach you how search works
- increase your comfort with other engines
Why should you care about search?
- Standard crawlers have to scrape HTML
- getting good content out of that can be a nightmare
You know the data model better than they do
Maybe it’s not a web app at all
Core Concepts
- Document-based
- never grep through a string
Inverted Index
Terms
Engine - the black box you hand a query to get results back
Documents - a text blob with optional metadata
Corpus - the collection of all the documents
Stopword - words that are filler (and, the, of)
Steming - finding the root of the word
Segments - the data that make up the overall index
Relevance - the algorithm used to rank the results
Faceting - providing results matching a certain criteria
Boost - a way to push up certain type of results to the top
Indexing
- Documents
- The are not a row in the db
- think blob of text + metadata
- text quality is the most important thing
- flat, not relational
- denormalize, denormalize, denormalize!!!!
- Tokenization
- the process of taking the text blob and making it smaller word sized chunks
- split on whitespace, lowercase, strip stopwords
- the point is to normalize the tokens to work with when querying
- Stemming
- the process of finding the root word
- find root works because of spelling mistakes or pluralization
- these become the terms in the inverted index
- Cons: really only works on the language it was built for
- very hard to make work cross language
- how do we solve this? - n-grams
- n-grams
passes a “window” over the tokenized data
these become terms in the index
- example gram size of 3 on hello
- [‘hel’, ‘ell’, ‘llo’, ‘wor’, ‘ld’]
- edge n-grams - typically uses multiple gram sizes
- this works great for autocomplete
- can also work across other languages
- cons: generate a lot more terms and storage can be a problem
- initially quality of search can suffer a little
- inverted index
- the heart of the engine - it’s how things work
- like a dictionary - keys matter (terms from all docs)
- stores position & document IDs
- Segments
- with a huge data structure it really can’t be in one file
- lots of ways to do this
- many follow Lucene
- flat files and then hash the keys
- terms will always be sorted
- Searching
Three main components
- Query Parser
- take a users hand written query and parse it out to something the engine can tackle
- do this the same way as you do with the index
- Index reading
- per-term, hash the term to get the right file
- rip through & collect position and document collection
- Scoring
we have terms now, but the are out of order
- lots of choices
- bm25, phased, google’s pagerank
- Faceting
- for a given field, collect all terms
- count the length of the unique document ids for each
- order by descending count
- Boost
- during the scoring process
- If a condition is met, alter the score according
- More like this
- collect all the terms for a given document
- can find other like terms in other documents
- sotr based on how many times a document is seen in the set
- more complete solutions use NLP to increase quality
https://github.com/toastdriven/microsearch http://speakerdeck.com/daniellindsley