IS2140 Reading and Muddiest Points by ARB148: Week 2 Reading

1.2

Steps for building inverted index:

Collect the documents to be indexed;

Tokenize the text, turning each document into a list of tokens;

Do linguistic preprocessing;

Index the documents;

2.1 Convert the byte sequence into a linear sequence of characters and determine document units for indexing.

2.2 determining the vocabulary of terms:

2.2.1 Tokenization has to take care of:

language identification;

hyphens;

compounds;

Greatly depends on language of the document.

2.2.2 Dropping common terms: stop words

- create stop list.

2.2.3 Normalization

Canonicalize tokens so that matches occur despite superficial differences in the character sequences of the tokens.

2.2.4 Stemming and lemmatization

Main goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.

2.3 Using of a Skip list should allow us to avoid processing parts of the postings list.

3.1 Search structures for dictionaries:

Hashing and Search Trees.

3.2 Wildcard queries.

Used in case when user:

uncertain of spelling a query term;

is aware of multiple variants of spelling a term;

not sure whether the search engine performs stemming;

in uncertain of the correct rendition of a foreign word or phrase.

3.3 Spelling correction.

Two steps to solving this problem:

edit distance;

k-gram overlap.

3.4 Phonetic correction.

Use soundex algorithms.

IS2140 Reading and Muddiest Points by ARB148

Friday, January 10, 2014

Week 2 Reading

No comments:

Post a Comment