IR 8. Evaluation in information retrieval
8.1 Information retrieval system evaluation
Test collection consisting of three things:
- A document collection
- A test suite of information needs, expressible as queries
- A set of relevance judgments, standardly a binary assessment of either relevant or nonrelevant for each query-document pair.
8.2 Standard test collections
- The Cranfield collection.
- Text Retrieval Conference (TREC)
- GOV2
- NII Test Collections for IR Systems (NTCIR)
- Cross Language Evaluation Forum (CLEF)
- Reuters-21578 and Reuters-RCV1
- 20 Newsgroups.
8.3 Evaluation of unranked retrieval sets
- Precision (P) is the fraction of retrieved documents that are relevant
- Recall (R) is the fraction of relevant documents that are retrieved
- Accuracy, the fraction of classifications that are correct.
- F measure, weighted harmonic mean of precision and recall
8.4 Evaluation of ranked retrieval results
- Precision-recall curve
- Interpolated precision
- 11-point interpolated average precision. For each information need, the interpolated precision is measured at the 11 recall levels of 0.0, 0.1, 0.2, . . . , 1.0.
- Mean Average Precision (MAP), provides a single-figure measure of quality across recall levels.
- Precision at k - measuring precision at fixed low levels of retrieved results, such as 10 or 30 documents.
8.5 Assessing relevance
To properly evaluate a system, your test information needs must be germane to the documents in the test document collection, and appropriate for pre- dicted usage of the system.
8.6 A broader perspective: System quality and user utility
How satisfied is each user with the results the system gives for each information need that they pose?
System issues:
- How fast does it index
- How fast does it search
- How expressive is its query language? How fast is it on complex queries?
- How large is its document collection, in terms of the number of documents or the collection having information distributed across a broad range of topics?
User utility:
Measuring user happiness, based on the relevance, speed, and user interface of a system.
Refining a deployed system:
If an IR system has been built and is being used by a large number of users, the system’s builders can evaluate possible changes by deploying variant versions of the system and recording measures that are indicative of user satisfaction with one variant vs. others as they are being used.
8.7 Results snippets
Make the results list informative enough that the user can do a final ranking of the documents for themselves based on relevance to their information need.
No comments:
Post a Comment