[contentanalysis] 关于Lucene的讨论

  • From: "Wight911" <wight911@xxxxxxxxx>
  • To: <contentanalysis@xxxxxxxxxxxxx>
  • Date: Sun, 13 Jul 2008 18:53:49 +0800

下面片段是我在一篇SIGMOD文章中看到的、关于倒排表索引(inverted list)的描
述。

请对照评价一下Lucene目前的版本。

我提问的原因是,根据我们的讨论,Lucene当前版本似乎就不是Matrix存储,是什么存
储?

 

Conceptually, an inverted list is a two-dimensional table,

where the i-th row represents indexed keyword Ki and the

j-th column represents instance Ij . The cell at the i-th row

and j-th column, denoted (Ki, Ij), records the number of

occurrences, called occurrence count, of keyword Ki in the

attributes of instance Ij . If the cell (Ki,Ij) is not zero, we

say instance Ij is indexed on Ki. The keywords are ordered

in alphabetic order, and the instances are ordered by their

identifiers. Table 1 shows the inverted list for our example

triple base.

…(这里省略一小段无关文字)

In practice, an inverted list is seldom stored as a matrix.

There are multiple ways to store an inverted list, such as a

sorted array, a prefix B-tree or a Patricia trie. In addition,

[42] describes techniques for compression of inverted lists.

The extensions we describe are orthogonal to these physical

implementations.

 

Other related posts:

  • » [contentanalysis] 关于Lucene的讨论