Skip to content

3 Inverted File Index

Word stemming

Word stemming is to eliminate the commonly used words from the original documents.

答案

词干提取(Word Stemming)的核心是将单词还原为其词干(词根形式),目的是合并语义相同但形态不同的词汇,减少索引规模,提升检索相关性,答案为 \(\textnormal{F}\)

Stop Words

Stop words should be ignored when creating inverted file indices, since they appear rarely in articles, and are not useful for indexing.

答案

停用词指的是过于常见、几乎出现于每一个文本的词汇,通常不具备特殊的含义,可以预先从原始文本中去掉,再进行倒序索引的匹配,但不能直接将停用词从倒排索引中踢掉或简单地就将出现频率很高的词看作停用词,故答案为 \(\textnormal{F}\)

Term Access

While accessing a term by hashing in an inverted file index, range searches are expensive.

答案

散列查找单个词汇的速度非常快(\(O(1)\)),但查找多个词汇的速度相对较慢,故答案为 \(\textnormal{T}\)

While accessing a term stored in a B+ tree in an inverted file index, range searches are expensive.

答案

\(\textnormal{F}\)

Distributed Indexing

Document-partitioned V.S. Term-partitioned
Document-partitioned
Term-partitioned

In distributed indexing, document-partitioned strategy is to store on each node all the documents that contain the terms in a certain range.

答案

\(\textnormal{F}\)

For the document-partitioned strategy in distributed indexing, each node contains a subset of all documents that have a specific range of index.

答案

\(\textnormal{T}\)

Dynamic Indexing

In the dynamic indexing situation, the main index is usually updated when a new document comes to the document collection.

答案

动态索引中,新文档先写入增量索引,待达到阈值后再与主索引合并,不会立即更新主索引,答案为 \(\textnormal{F}\)

Threshold

document V.S. query

  • 文档:只检索根据权重排名下来的前 x 个文档,对于布尔查询可能会错过一些有意义的文档
  • 查询:将查询中的词汇按它们出现的频率升序排序,搜索时只会根据序列前面的几个词汇搜索,根据实际情况确定阈值的大小

In a search engine, thresholding for query retrieves the top k documents according to their weights.

答案

\(\textnormal{F}\)

Measurement

检索性能评估

  • 数据检索性能评估 (data retrieval performance evaluation):主要考虑响应时间、索引占用空间等指标
  • 信息检索性能评估 (information retrieval performance evaluation):主要考虑回答的相关程度

When evaluating the performance of data retrieval, it is important to measure the relevancy of the answer set.

答案

\(\textnormal{F}\)

Precision measures the quality of all the retrieved documents.

答案

\(\textnormal{T}\)

Recall is more important than precision when evaluating the explosive detection in airport security.

答案

机场安检的爆炸物检测场景中,漏检爆炸物会直接威胁公共安全,后果极其严重,故答案为 \(\textnormal{T}\)

Two spam mail(垃圾邮件) detection systems are tested on a dataset with 10000 ordinary mails and 2000 spam mails. System A detects 300 ordinary mails and 1600 spam mails, and system B detects 315 ordinary mails and 1800 spam mails. If our primary concern is to keep the important mails safe, which of the following is correct?

\(\textnormal{A}\). Precision is our primary concern and system A is better.

\(\textnormal{B}\). Recall is our primary concern and system B is better.

\(\textnormal{C}\). Precision is our primary concern and system B is better.

\(\textnormal{D}\). Recall is our primary concern and system A is better.

答案

精确率越高,说明被误判为垃圾的普通邮件越少,普通邮件越安全,故答案为 \(\textnormal{C}\)