can analyze, index and store documents and has querying support for term, fuzzy and prefix. Here are the six different files, with their layouts explained, needed to support its indexing, querying and document storage capabilities. Resin Document addresses [index_version].da Documents are stored in a doc file. To find one in a file one must first look up its starting location in the file and find out its length. The da file contains the addresses of documents and their sizes, each record ordered by document ID, a integer that is incremented from zero and assigned to each document at every upsert session. One file per upsert session. da Size of a document address: sizeof(long) + sizeof(int) Documents [index_version].doc Documents of variable length are stored along side and without delimiter. QuickLZ compression in fast mode is used by default, gzip by supplying the compression flag. One doc file per upsert session. Size of a document: sizeof(int) + variable sized string * numOfFields Index info [index_version].ix Contains the document count and a compression flag. One per upsert session. Size: sizeof(int) + sizeof(byte) Primary keys [index_name].pk Contains the hashed value of the primary key of each document, together with a flag stating if the document is deleted or live. When writing an index supplying a primary key is required only if you need to delete or update by that key. Documents without primary key can only be deleted by term and can never be updated. One file per upsert session. Size of a document record: sizeof(UInt64) + sizeof(byte) Postings [index_version].pos A posting is a term count and a reference to a document. Postings are grouped by term. One postings file per upsert session. Size of a posting: 2 * sizeof(int) Search trees [index_version]-[field_hash].tri At the heart (brain?) of each search engine, the search tree, here in the form of a doubly-chained prefix tree, the perfect merger between a trie and a binary search tree. Enough about trees. Each trie node, its char value and its meta-data is stored in the file. Meta-data bool EndOfWordbool HasChildbool HasSiblingshort Depth *int Weight (numOfChildren + numOfSiblings + 1)PostingsAddress Address (only if EndOfWord, same size as a document address) It’s a lot of information but needed to be able to scan the file from disk as if it was a left-child right-sibling search tree that exists in-memory. The weight of a node is needed in order to determine how many nodes to skip once a scan has reached a dead end but need to move on to the next leg. The posting address found at each EndOfWord has a size and leads to a location in the postings file that marks the beginning of the postings list associated with that particular term. One trie file per document field per upsert session. Size of a node: Size of a end of word node: sizeof(char) + 3 * sizeof(byte) + sizeof(int) + sizeof(short) + SizeOfPostingsAddress() SizeOfNode() + SizeOfPostingsAddress() File updates No file updates are necessary because of a versioning scheme that at querying time detects which files are live and which are dead. Conclusion If these are the files Resin, Lucene and, in one form or another, Google needs, no more and no less, then it is far from rocket science, what they do there, over at Google’s. Cheers!

File format of a search engine

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Building a cupboard

0–100 in Django: Starting an app the right way

The Noonification: How to Deal With Flapping or Broken Tests (11/29/2023)

The Noonification: OpenAI is Sam Altman; Sam Altman is OpenAI (11/30/2023)

The Noonification: Panda Power (11/28/2023)

The Noonification: Feature Optimization for Price Prediction (11/26/2023)

Building a cupboard

0–100 in Django: Starting an app the right way

The Noonification: How to Deal With Flapping or Broken Tests (11/29/2023)

The Noonification: OpenAI is Sam Altman; Sam Altman is OpenAI (11/30/2023)

The Noonification: Panda Power (11/28/2023)

The Noonification: Feature Optimization for Price Prediction (11/26/2023)

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps