IIR Chapter 4. Index Construction.
1. Hardware basics.
- the technique of keeping frequently used disk data in main memory caching
- seek time
- buffer
2. BSBI - Blocked sort-based indexing.
- seek time
- buffer
2. BSBI - Blocked sort-based indexing.
- Collection segmented into parts of equal size to reduce disk seeks
- TermID-docID pairs for each part in memory
- Store intermediate results on disk
- merge all intermediate result into final index
3. SPIMI - Single-pass in-memory indexing.
A more scalable alternative than BSBI. Uses terms instead of termIDs, writes each block’s dictionary to disk, and then starts a new dictionary for the next block. SPIMI can index collections of any size as long as there is enough disk space available.
4. Distributed indexing.
Good for web search engines. Uses MapReduce technique. A master node directs the process of assigning and reassigning tasks to individual worker nodes (multiple machines). The map phase of MapReduce consists of mapping splits of the input data to key-value pairs. The reduce phase makes sure that all values for a given key will be stored close together, so that they can be read and processed quickly.
5. Dynamic indexing.
For collections that are modified frequently with documents being added, deleted, and updated. One way to achieve this is to periodically reconstruct the index from scratch. Another way is to maintain two indexes: a large main index and a small auxiliary index that stores new documents. The auxiliary index is kept in memory. Searches are run across both indexes and results merged.
Having multiple indexes complicates the maintenance of collection-wide statistics, so some large search engines use a reconstruction-from-scratch strategy.
- TermID-docID pairs for each part in memory
- Store intermediate results on disk
- merge all intermediate result into final index
3. SPIMI - Single-pass in-memory indexing.
A more scalable alternative than BSBI. Uses terms instead of termIDs, writes each block’s dictionary to disk, and then starts a new dictionary for the next block. SPIMI can index collections of any size as long as there is enough disk space available.
4. Distributed indexing.
Good for web search engines. Uses MapReduce technique. A master node directs the process of assigning and reassigning tasks to individual worker nodes (multiple machines). The map phase of MapReduce consists of mapping splits of the input data to key-value pairs. The reduce phase makes sure that all values for a given key will be stored close together, so that they can be read and processed quickly.
5. Dynamic indexing.
For collections that are modified frequently with documents being added, deleted, and updated. One way to achieve this is to periodically reconstruct the index from scratch. Another way is to maintain two indexes: a large main index and a small auxiliary index that stores new documents. The auxiliary index is kept in memory. Searches are run across both indexes and results merged.
Having multiple indexes complicates the maintenance of collection-wide statistics, so some large search engines use a reconstruction-from-scratch strategy.
IIR Chapter 5. Index Compression.
Compression of dictionary and inverted index is essential for efficient IR systems. Main reason is potential cutting of the cost of storing the index by 75%. Also increased use of caching and faster transfer of data from disk to memory.
1. Dictionary Compression.
- Dictionary as a string
- Blocked Storage
- Front Coding
2. Postings File Compression.
- Variable byte (VB) encoding
- γ codes
No comments:
Post a Comment