Lucene学习总结之四:Lucene索引过程分析(4)

Stella981
• 阅读 569
【转载 http://blog.csdn.net/forfuture1978/article/details/5279200
6、关闭IndexWriter对象

代码:

writer.close();

--> IndexWriter.closeInternal(boolean)

      --> (1) 将索引信息由内存写入磁盘: flush(waitForMerges, true, true);
      --> (2) 进行段合并: mergeScheduler.merge(this);

对段的合并将在后面的章节进行讨论,此处仅仅讨论将索引信息由写入磁盘的过程。

代码:

IndexWriter.flush(boolean triggerMerge, boolean flushDocStores, boolean flushDeletes)

--> IndexWriter.doFlush(boolean flushDocStores, boolean flushDeletes)

      --> IndexWriter.doFlushInternal(boolean flushDocStores, boolean flushDeletes)

将索引写入磁盘包括以下几个过程:

  • 得到要写入的段名:String segment = docWriter.getSegment();
  • DocumentsWriter将缓存的信息写入段:docWriter.flush(flushDocStores);
  • 生成新的段信息对象:newSegment = new SegmentInfo(segment, flushedDocCount, directory, false, true, docStoreOffset, docStoreSegment, docStoreIsCompoundFile, docWriter.hasProx());
  • 准备删除文档:docWriter.pushDeletes();
  • 生成cfs段:docWriter.createCompoundFile(segment);
  • 删除文档:applyDeletes();
6.1、得到要写入的段名

代码:

SegmentInfo newSegment = null;

final int numDocs = docWriter.getNumDocsInRAM();//文档总数

String docStoreSegment = docWriter.getDocStoreSegment();//存储域和词向量所要要写入的段名,"_0"   

int docStoreOffset = docWriter.getDocStoreOffset();//存储域和词向量要写入的段中的偏移量

String segment = docWriter.getSegment();//段名,"_0"

在Lucene的索引文件结构一章做过详细介绍,存储域和词向量可以和索引域存储在不同的段中。

6.2、将缓存的内容写入段

代码:

flushedDocCount = docWriter.flush(flushDocStores);

此过程又包含以下两个阶段;

  • 按照基本索引链关闭存储域和词向量信息
  • 按照基本索引链的结构将索引结果写入段

6.2.1、按照基本索引链关闭存储域和词向量信息

代码为:

closeDocStore();

flushState.numDocsInStore = 0;

其主要是根据基本索引链结构,关闭存储域和词向量信息:

  • consumer(DocFieldProcessor).closeDocStore(flushState);
    • consumer(DocInverter).closeDocStore(state);
      • consumer(TermsHash).closeDocStore(state);
        • consumer(FreqProxTermsWriter).closeDocStore(state);
        • if (nextTermsHash != null) nextTermsHash.closeDocStore(state);
          • consumer(TermVectorsTermsWriter).closeDocStore(state);
      • endConsumer(NormsWriter).closeDocStore(state);
    • fieldsWriter(StoredFieldsWriter).closeDocStore(state);

其中有实质意义的是以下两个closeDocStore:

  • 词向量的关闭:TermVectorsTermsWriter.closeDocStore(SegmentWriteState)

void closeDocStore(final SegmentWriteState state) throws IOException {

                     if (tvx != null) {
            //为不保存词向量的文档在tvd文件中写入零。即便不保存词向量,在tvx, tvd中也保留一个位置
            fill(state.numDocsInStore - docWriter.getDocStoreOffset());
            //关闭tvx, tvf, tvd文件的写入流
            tvx.close();
            tvf.close();
            tvd.close();
            tvx = null;
            //记录写入的文件名,为以后生成cfs文件的时候,将这些写入的文件生成一个统一的cfs文件。
            state.flushedFiles.add(state.docStoreSegmentName + "." + IndexFileNames.VECTORS_INDEX_EXTENSION);
            state.flushedFiles.add(state.docStoreSegmentName + "." + IndexFileNames.VECTORS_FIELDS_EXTENSION);
            state.flushedFiles.add(state.docStoreSegmentName + "." + IndexFileNames.VECTORS_DOCUMENTS_EXTENSION);
            //从DocumentsWriter的成员变量openFiles中删除,未来可能被IndexFileDeleter删除
            docWriter.removeOpenFile(state.docStoreSegmentName + "." + IndexFileNames.VECTORS_INDEX_EXTENSION);
            docWriter.removeOpenFile(state.docStoreSegmentName + "." + IndexFileNames.VECTORS_FIELDS_EXTENSION);
            docWriter.removeOpenFile(state.docStoreSegmentName + "." + IndexFileNames.VECTORS_DOCUMENTS_EXTENSION);
            lastDocID = 0;
        }    
}

  • 存储域的关闭:StoredFieldsWriter.closeDocStore(SegmentWriteState)

public void closeDocStore(SegmentWriteState state) throws IOException {

    //关闭fdx, fdt写入流

    fieldsWriter.close();
    --> fieldsStream.close();
    --> indexStream.close();
    fieldsWriter = null;
    lastDocID = 0;

    //记录写入的文件名
    state.flushedFiles.add(state.docStoreSegmentName + "." + IndexFileNames.FIELDS_EXTENSION);
    state.flushedFiles.add(state.docStoreSegmentName + "." + IndexFileNames.FIELDS_INDEX_EXTENSION);
    state.docWriter.removeOpenFile(state.docStoreSegmentName + "." + IndexFileNames.FIELDS_EXTENSION);
    state.docWriter.removeOpenFile(state.docStoreSegmentName + "." + IndexFileNames.FIELDS_INDEX_EXTENSION);
}

6.2.2、按照基本索引链的结构将索引结果写入段

代码为:

consumer(DocFieldProcessor).flush(threads, flushState);

    //回收fieldHash,以便用于下一轮的索引,为提高效率,索引链中的对象是被复用的。

    Map> childThreadsAndFields = new HashMap>();
    for ( DocConsumerPerThread thread : threads) {
        DocFieldProcessorPerThread perThread = (DocFieldProcessorPerThread) thread;
        childThreadsAndFields.put(perThread.consumer, perThread.fields());
perThread.trimFields(state);
    }

    //写入存储域

    --> fieldsWriter(StoredFieldsWriter).flush(state);

    //写入索引域

    --> consumer(DocInverter).flush(childThreadsAndFields, state);

    //写入域元数据信息,并记录写入的文件名,以便以后生成cfs文件

    --> final String fileName = state.segmentFileName(IndexFileNames.FIELD_INFOS_EXTENSION);

    --> fieldInfos.write(state.directory, fileName);

    --> state.flushedFiles.add(fileName);

此过程也是按照基本索引链来的:

  • consumer(DocFieldProcessor).flush(…);
    • consumer(DocInverter).flush(…);
      • consumer(TermsHash).flush(…);
        • consumer(FreqProxTermsWriter).flush(…);
        • if (nextTermsHash != null) nextTermsHash.flush(…);
          • consumer(TermVectorsTermsWriter).flush(…);
      • endConsumer(NormsWriter).flush(…);
    • fieldsWriter(StoredFieldsWriter).flush(…);

6.2.2.1、写入存储域

代码为:

StoredFieldsWriter.flush(SegmentWriteState state) {
    if (state.numDocsInStore > 0) {
      initFieldsWriter();
      fill(state.numDocsInStore - docWriter.getDocStoreOffset());
    }
    if (fieldsWriter != null)
      fieldsWriter.flush();
  }

从代码中可以看出,是写入fdx, fdt两个文件,但是在上述的closeDocStore已经写入了,并且把state.numDocsInStore置零,fieldsWriter设为null,在这里其实什么也不做。

6.2.2.2、写入索引域

代码为:

DocInverter.flush(Map>, SegmentWriteState)

    //写入倒排表及词向量信息

    --> consumer(TermsHash).flush(childThreadsAndFields, state);

    //写入标准化因子

    --> endConsumer(NormsWriter).flush(endChildThreadsAndFields, state);

6.2.2.2.1、写入倒排表及词向量信息

代码为:

TermsHash.flush(Map>, SegmentWriteState)

    //写入倒排表信息

    --> consumer(FreqProxTermsWriter).flush(childThreadsAndFields, state);

   //回收RawPostingList

    --> shrinkFreePostings(threadsAndFields, state);

    //写入词向量信息

    --> if (nextTermsHash != null) nextTermsHash.flush(nextThreadsAndFields, state);

          --> consumer(TermVectorsTermsWriter).flush(childThreadsAndFields, state);

6.2.2.2.1.1、写入倒排表信息

代码为:

FreqProxTermsWriter.flush(Map                                       Collection>, SegmentWriteState)

(a) 所有域按名称排序,使得同名域能够一起处理

    Collections.sort(allFields);

    final int numAllFields = allFields.size();

(b) 生成倒排表的写对象

    final FormatPostingsFieldsConsumer consumer = new FormatPostingsFieldsWriter(state, fieldInfos);

    int start = 0;

(c) 对于每一个域

    while(start < numAllFields) {

(c-1) 找出所有的同名域

        final FieldInfo fieldInfo = allFields.get(start).fieldInfo;

        final String fieldName = fieldInfo.name;

        int end = start+1;

        while(end < numAllFields && allFields.get(end).fieldInfo.name.equals(fieldName))

            end++;

        FreqProxTermsWriterPerField[] fields = new FreqProxTermsWriterPerField[end-start];

        for(int i=start;i

            fields[i-start] = allFields.get(i);

            fieldInfo.storePayloads |= fields[i-start].hasPayloads;

        }

(c-2) 将同名域的倒排表添加到文件

        appendPostings(fields, consumer);

(c-3) 释放空间

        for(int i=0;i

            TermsHashPerField perField = fields[i].termsHashPerField;

            int numPostings = perField.numPostings;

            perField.reset();

            perField.shrinkHash(numPostings);

            fields[i].reset();

        }

        start = end;

    }

(d) 关闭倒排表的写对象

    consumer.finish();

(b) 生成倒排表的写对象

代码为:

public FormatPostingsFieldsWriter(SegmentWriteState state, FieldInfos fieldInfos) throws IOException {
    dir = state.directory;
    segment = state.segmentName;
    totalNumDocs = state.numDocs;
    this.fieldInfos = fieldInfos;
    //用于写tii,tis
    termsOut = new TermInfosWriter(dir, segment, fieldInfos, state.termIndexInterval);
    //用于写freq, prox的跳表 
    skipListWriter = new DefaultSkipListWriter(termsOut.skipInterval, termsOut.maxSkipLevels, totalNumDocs, null, null);
    //记录写入的文件名,
    state.flushedFiles.add(state.segmentFileName(IndexFileNames.TERMS_EXTENSION));
    state.flushedFiles.add(state.segmentFileName(IndexFileNames.TERMS_INDEX_EXTENSION)); 
    //用以上两个写对象,按照一定的格式写入段
    termsWriter = new FormatPostingsTermsWriter(state, this);
}

对象结构如下:

consumer    FormatPostingsFieldsWriter  (id=119)  //用于处理一个域
    dir    SimpleFSDirectory  (id=126)   //目标索引文件夹
    totalNumDocs    8   //文档总数
    fieldInfos    FieldInfos  (id=70)  //域元数据信息  
    segment    "_0"   //段名
    skipListWriter    DefaultSkipListWriter  (id=133)  //freq, prox中跳表的写对象  
    termsOut    TermInfosWriter  (id=125)  //tii, tis文件的写对象
    termsWriter    FormatPostingsTermsWriter  (id=135)  //用于添加词(Term)
        currentTerm    null   
        currentTermStart    0   
        fieldInfo    null   
        freqStart    0   
        proxStart    0   
        termBuffer    null   
        termsOut    TermInfosWriter  (id=125)   
        docsWriter    FormatPostingsDocsWriter  (id=139)  //用于写入此词的docid, freq信息
            df    0   
            fieldInfo    null   
            freqStart    0   
            lastDocID    0   
            omitTermFreqAndPositions    false   
            out    SimpleFSDirectory$SimpleFSIndexOutput  (id=144)   
            skipInterval    16   
            skipListWriter    DefaultSkipListWriter  (id=133)   
            storePayloads    false   
            termInfo    TermInfo  (id=151)   
            totalNumDocs    8    
            posWriter    FormatPostingsPositionsWriter  (id=146)  //用于写入此词在此文档中的位置信息  
                lastPayloadLength    -1   
                lastPosition    0   
                omitTermFreqAndPositions    false   
                out    SimpleFSDirectory$SimpleFSIndexOutput  (id=157)   
                parent    FormatPostingsDocsWriter  (id=139)   
                storePayloads    false   

  • FormatPostingsFieldsWriter.addField(FieldInfo field)用于添加索引域信息,其返回FormatPostingsTermsConsumer用于添加词信息
  • FormatPostingsTermsConsumer.addTerm(char[] text, int start)用于添加词信息,其返回FormatPostingsDocsConsumer用于添加freq信息
  • FormatPostingsDocsConsumer.addDoc(int docID, int termDocFreq)用于添加freq信息,其返回FormatPostingsPositionsConsumer用于添加prox信息
  • FormatPostingsPositionsConsumer.addPosition(int position, byte[] payload, int payloadOffset, int payloadLength)用于添加prox信息

(c-2) 将同名域的倒排表添加到文件

代码为:

FreqProxTermsWriter.appendPostings(FreqProxTermsWriterPerField[], FormatPostingsFieldsConsumer) {

    int numFields = fields.length;

    final FreqProxFieldMergeState[] mergeStates = new FreqProxFieldMergeState[numFields];

    for(int i=0;i

      FreqProxFieldMergeState fms = mergeStates[i] = new FreqProxFieldMergeState(fields[i]);

      boolean result = fms.nextTerm(); //对所有的域,取第一个词(Term)

    }

(1) 添加此域,虽然有多个域,但是由于是同名域,只取第一个域的信息即可。返回的是用于添加此域中的词的对象。

    final FormatPostingsTermsConsumer termsConsumer = consumer.addField(fields[0].fieldInfo);

    FreqProxFieldMergeState[] termStates = new FreqProxFieldMergeState[numFields];

    final boolean currentFieldOmitTermFreqAndPositions = fields[0].fieldInfo.omitTermFreqAndPositions;

(2) 此while循环是遍历每一个尚有未处理的词的域,依次按照词典顺序处理这些域所包含的词。当一个域中的所有的词都被处理过后,则numFields减一,并从mergeStates数组中移除此域。直到所有的域的所有的词都处理完毕,方才退出此循环。

    while(numFields > 0) {

(2-1) 找出所有域中按字典顺序的下一个词。可能多个同名域中,都包含同一个term,因而要遍历所有的numFields,得到所有的域里的下一个词,numToMerge即为有多少个域包含此词。

      termStates[0] = mergeStates[0];

      int numToMerge = 1;

      for(int i=1;i

        final char[] text = mergeStates[i].text;

        final int textOffset = mergeStates[i].textOffset;

        final int cmp = compareText(text, textOffset, termStates[0].text, termStates[0].textOffset);

        if (cmp < 0) {

          termStates[0] = mergeStates[i];

          numToMerge = 1;

        } else if (cmp == 0)

          termStates[numToMerge++] = mergeStates[i];

      }

(2-2) 添加此词,返回FormatPostingsDocsConsumer用于添加文档号(doc ID)及词频信息(freq)

      final FormatPostingsDocsConsumer docConsumer = termsConsumer.addTerm(termStates[0].text, termStates[0].textOffset);

(2-3) 由于共numToMerge个域都包含此词,每个词都有一个链表的文档号表示包含这些词的文档。此循环遍历所有的包含此词的域,依次按照从小到大的循序添加包含此词的文档号及词频信息。当一个域中对此词的所有文档号都处理过了,则numToMerge减一,并从termStates数组中移除此域。当所有包含此词的域的所有文档号都处理过了,则结束此循环。

      while(numToMerge > 0) {

(2-3-1) 找出最小的文档号

        FreqProxFieldMergeState minState = termStates[0];

        for(int i=1;i

          if (termStates[i].docID < minState.docID)

            minState = termStates[i];

        final int termDocFreq = minState.termFreq;

(2-3-2) 添加文档号及词频信息,并形成跳表,返回FormatPostingsPositionsConsumer用于添加位置(prox)信息

        final FormatPostingsPositionsConsumer posConsumer = docConsumer.addDoc(minState.docID, termDocFreq);

//ByteSliceReader是用于读取bytepool中的prox信息的。

        final ByteSliceReader prox = minState.prox;

        if (!currentFieldOmitTermFreqAndPositions) {

          int position = 0;

(2-3-3) 此循环对包含此词的文档,添加位置信息

          for(int j=0;j

            final int code = prox.readVInt();

            position += code >> 1;

            final int payloadLength;

// 如果此位置有payload信息,则从bytepool中读出,否则设为零。

            if ((code & 1) != 0) {

              payloadLength = prox.readVInt();

              if (payloadBuffer == null || payloadBuffer.length < payloadLength)

                payloadBuffer = new byte[payloadLength];

              prox.readBytes(payloadBuffer, 0, payloadLength);

            } else

              payloadLength = 0;

//添加位置(prox)信息

              posConsumer.addPosition(position, payloadBuffer, 0, payloadLength);

          }

          posConsumer.finish();

        }

(2-3-4) 判断退出条件,上次选中的域取得下一个文档号,如果没有,则说明此域包含此词的文档已经处理完毕,则从termStates中删除此域,并将numToMerge减一。然后此域取得下一个词,当循环到(2)的时候,表明此域已经开始处理下一个词。如果没有下一个词,说明此域中的所有的词都处理完毕,则从mergeStates中删除此域,并将numFields减一,当numFields为0的时候,循环(2)也就结束了。

        if (!minState.nextDoc()) {//获得下一个docid

          //如果此域包含此词的文档已经没有下一篇docid,则从数组termStates中移除,numToMerge减一。

          int upto = 0;

          for(int i=0;i

            if (termStates[i] != minState)

              termStates[upto++] = termStates[i];

          numToMerge--;

          //此域则取下一个词(term),在循环(2)处来参与下一个词的合并

          if (!minState.nextTerm()) {

            //如果此域没有下一个词了,则此域从数组mergeStates中移除,numFields减一。

            upto = 0;

            for(int i=0;i

              if (mergeStates[i] != minState)

                mergeStates[upto++] = mergeStates[i];

            numFields--;

          }

        }

      }

(2-4) 经过上面的过程,docid和freq信息虽已经写入段文件,而跳表信息并没有写到文件中,而是写入skip buffer里面了,此处真正写入文件。并且词典(tii, tis)也应该写入文件。

      docConsumer(FormatPostingsDocsWriter).finish();

    }

    termsConsumer.finish();

  }

(2-3-4) 获得下一篇文档号代码如下:

public boolean nextDoc() {//如何获取下一个docid

  if (freq.eof()) {//如果bytepool中的freq信息已经读完

    if (p.lastDocCode != -1) {//由上述缓存管理,PostingList里面还存着最后一篇文档的文档号及词频信息,则将最后一篇文档返回

      docID = p.lastDocID;

      if (!field.omitTermFreqAndPositions)

        termFreq = p.docFreq;

      p.lastDocCode = -1;

      return true;

    } else

      return false;//没有下一篇文档

  }

  final int code = freq.readVInt();//如果bytepool中的freq信息尚未读完

  if (field.omitTermFreqAndPositions)

    docID += code;

  else {

    //读出文档号及词频信息。

    docID += code >>> 1;

    if ((code & 1) != 0)

      termFreq = 1;

    else

      termFreq = freq.readVInt();

  }

  return true;

}

(2-3-2) 添加文档号及词频信息代码如下:

FormatPostingsPositionsConsumer FormatPostingsDocsWriter.addDoc(int docID, int termDocFreq) {

    final int delta = docID - lastDocID;

    //当文档数量达到skipInterval倍数的时候,添加跳表项。

    if ((++df % skipInterval) == 0) {

      skipListWriter.setSkipData(lastDocID, storePayloads, posWriter.lastPayloadLength);

      skipListWriter.bufferSkip(df);

   }

   lastDocID = docID;

   if (omitTermFreqAndPositions)

     out.writeVInt(delta);

   else if (1 == termDocFreq)

     out.writeVInt((delta<<1) | 1);

   else {

     //写入文档号及词频信息。

     out.writeVInt(delta<<1);

     out.writeVInt(termDocFreq);

   }

   return posWriter;

}

(2-3-3) 添加位置信息:

FormatPostingsPositionsWriter.addPosition(int position, byte[] payload, int payloadOffset, int payloadLength) {

    final int delta = position - lastPosition;

    lastPosition = position;

    if (storePayloads) {

        //保存位置及payload信息

        if (payloadLength != lastPayloadLength) {

            lastPayloadLength = payloadLength;

            out.writeVInt((delta<<1)|1);

            out.writeVInt(payloadLength);

        } else

            out.writeVInt(delta << 1);

            if (payloadLength > 0)

                out.writeBytes(payload, payloadLength);

    } else

        out.writeVInt(delta);

}

(2-4) 将跳表和词典(tii, tis)写入文件

FormatPostingsDocsWriter.finish() {

//将跳表缓存写入文件

    long skipPointer = skipListWriter.writeSkip(out);

    if (df > 0) {

//将词典(terminfo)写入tii,tis文件

      parent.termsOut(TermInfosWriter).add(fieldInfo.number, utf8.result, utf8.length, termInfo);

    }

  }

将跳表缓存写入文件:

DefaultSkipListWriter(MultiLevelSkipListWriter).writeSkip(IndexOutput)  {

    long skipPointer = output.getFilePointer();

    if (skipBuffer == null || skipBuffer.length == 0) return skipPointer;

    //正如我们在索引文件格式中分析的那样, 高层在前,低层在后,除最低层外,其他的层都有长度保存。

    for (int level = numberOfSkipLevels - 1; level > 0; level--) {

      long length = skipBuffer[level].getFilePointer();

      if (length > 0) {

        output.writeVLong(length);

        skipBuffer[level].writeTo(output);

      }

    }

    //写入最低层

    skipBuffer[0].writeTo(output);

    return skipPointer;

  }

将词典(terminfo)写入tii,tis文件:

  • tii文件是tis文件的类似跳表的东西,是在tis文件中每隔indexInterval个词提取出一个词放在tii文件中,以便很快的查找到词。
  • 因而TermInfosWriter类型中有一个成员变量other也是TermInfosWriter类型的,还有一个成员变量isIndex来表示此对象是用来写tii文件的还是用来写tis文件的。
  • 如果一个TermInfosWriter对象的isIndex=false则,它是用来写tis文件的,它的other指向的是用来写tii文件的TermInfosWriter对象
  • 如果一个TermInfosWriter对象的isIndex=true则,它是用来写tii文件的,它的other指向的是用来写tis文件的TermInfosWriter对象

TermInfosWriter.add (int fieldNumber, byte[] termBytes, int termBytesLength, TermInfo ti) {

    //如果词的总数是indexInterval的倍数,则应该写入tii文件

    if (!isIndex && size % indexInterval == 0)

      other.add(lastFieldNumber, lastTermBytes, lastTermBytesLength, lastTi);

    //将词写入tis文件

    writeTerm(fieldNumber, termBytes, termBytesLength);

    output.writeVInt(ti.docFreq);                       // write doc freq

    output.writeVLong(ti.freqPointer - lastTi.freqPointer); // write pointers

    output.writeVLong(ti.proxPointer - lastTi.proxPointer);

    if (ti.docFreq >= skipInterval) {

      output.writeVInt(ti.skipOffset);

    }

    if (isIndex) {

      output.writeVLong(other.output.getFilePointer() - lastIndexPointer);

      lastIndexPointer = other.output.getFilePointer(); // write pointer

    }

    lastFieldNumber = fieldNumber;

    lastTi.set(ti);

    size++;

  }

6.2.2.2.1.2、写入词向量信息

代码为:

TermVectorsTermsWriter.flush (Map>
                                              threadsAndFields, final SegmentWriteState state) {

    if (tvx != null) {

      if (state.numDocsInStore > 0)

        fill(state.numDocsInStore - docWriter.getDocStoreOffset());

      tvx.flush();

      tvd.flush();

      tvf.flush();

    }

    for (Map.Entry> entry :
                                                                                                                                      threadsAndFields.entrySet()) {

      for (final TermsHashConsumerPerField field : entry.getValue() ) {

        TermVectorsTermsWriterPerField perField = (TermVectorsTermsWriterPerField) field;

        perField.termsHashPerField.reset();

        perField.shrinkHash();

      }

      TermVectorsTermsWriterPerThread perThread = (TermVectorsTermsWriterPerThread) entry.getKey();

      perThread.termsHashPerThread.reset(true);

    }

  }

从代码中可以看出,是写入tvx, tvd, tvf三个文件,但是在上述的closeDocStore已经写入了,并且把tvx设为null,在这里其实什么也不做,仅仅是清空postingsHash,以便进行下一轮索引时重用此对象。

6.2.2.2.2、写入标准化因子

代码为:

NormsWriter.flush(Map> threadsAndFields,

                           SegmentWriteState state) {

    final Map> byField = new HashMap>();

    for (final Map.Entry> entry :
                                                                                                                           threadsAndFields.entrySet()) {

     //遍历所有的域,将同名域对应的NormsWriterPerField放到同一个链表中。

      final Collection fields = entry.getValue();

      final Iterator fieldsIt = fields.iterator();

      while (fieldsIt.hasNext()) {

        final NormsWriterPerField perField = (NormsWriterPerField) fieldsIt.next();

        List l = byField.get(perField.fieldInfo);

        if (l == null) {

            l = new ArrayList();

            byField.put(perField.fieldInfo, l);

        }

        l.add(perField);

    }

    //记录写入的文件名,方便以后生成cfs文件。

    final String normsFileName = state.segmentName + "." + IndexFileNames.NORMS_EXTENSION;

    state.flushedFiles.add(normsFileName);

    IndexOutput normsOut = state.directory.createOutput(normsFileName);

    try {

      //写入nrm文件头

      normsOut.writeBytes(SegmentMerger.NORMS_HEADER, 0, SegmentMerger.NORMS_HEADER.length);

      final int numField = fieldInfos.size();

      int normCount = 0;

      //对每一个域进行处理

      for(int fieldNumber=0;fieldNumber

        final FieldInfo fieldInfo = fieldInfos.fieldInfo(fieldNumber);

        //得到同名域的链表

        List toMerge = byField.get(fieldInfo);

        int upto = 0;

        if (toMerge != null) {

          final int numFields = toMerge.size();

          normCount++;

          final NormsWriterPerField[] fields = new NormsWriterPerField[numFields];

          int[] uptos = new int[numFields];

          for(int j=0;j

            fields[j] = toMerge.get(j);

          int numLeft = numFields;

          //处理同名的多个域

          while(numLeft > 0) {

            //得到所有的同名域中最小的文档号

            int minLoc = 0;

            int minDocID = fields[0].docIDs[uptos[0]];

            for(int j=1;j

              final int docID = fields[j].docIDs[uptos[j]];

              if (docID < minDocID) {

                minDocID = docID;

                minLoc = j;

              }

            }

            // 在nrm文件中,每一个文件都有一个位置,没有设定的,放入默认值

            for (;upto<minDocID;upto++)

              normsOut.writeByte(defaultNorm);

            //写入当前的nrm值

            normsOut.writeByte(fields[minLoc].norms[uptos[minLoc]]);

            (uptos[minLoc])++;

            upto++;

            //如果当前域的文档已经处理完毕,则numLeft减一,归零时推出循环。

            if (uptos[minLoc] == fields[minLoc].upto) {

              fields[minLoc].reset();

              if (minLoc != numLeft-1) {

                fields[minLoc] = fields[numLeft-1];

                uptos[minLoc] = uptos[numLeft-1];

              }

              numLeft--;

            }

          }

          // 对所有的未设定nrm值的文档写入默认值。

          for(;upto

            normsOut.writeByte(defaultNorm);

        } else if (fieldInfo.isIndexed && !fieldInfo.omitNorms) {

          normCount++;

          // Fill entire field with default norm:

          for(;upto

            normsOut.writeByte(defaultNorm);

        }

      }

    } finally {

      normsOut.close();

    }

  }

6.2.2.3、写入域元数据

代码为:

FieldInfos.write(IndexOutput) {

    output.writeVInt(CURRENT_FORMAT);

    output.writeVInt(size());

    for (int i = 0; i < size(); i++) {

      FieldInfo fi = fieldInfo(i);

      byte bits = 0x0;

      if (fi.isIndexed) bits |= IS_INDEXED;

      if (fi.storeTermVector) bits |= STORE_TERMVECTOR;

      if (fi.storePositionWithTermVector) bits |= STORE_POSITIONS_WITH_TERMVECTOR;

      if (fi.storeOffsetWithTermVector) bits |= STORE_OFFSET_WITH_TERMVECTOR;

      if (fi.omitNorms) bits |= OMIT_NORMS;

      if (fi.storePayloads) bits |= STORE_PAYLOADS;

      if (fi.omitTermFreqAndPositions) bits |= OMIT_TERM_FREQ_AND_POSITIONS;

      output.writeString(fi.name);

      output.writeByte(bits);

    }

}

此处基本就是按照fnm文件的格式写入的。

6.3、生成新的段信息对象

代码:

newSegment = new SegmentInfo(segment, flushedDocCount, directory, false, true, docStoreOffset, docStoreSegment, docStoreIsCompoundFile, docWriter.hasProx());

segmentInfos.add(newSegment);

6.4、准备删除文档

代码:

docWriter.pushDeletes();

    --> deletesFlushed.update(deletesInRAM);

此处将deletesInRAM全部加到deletesFlushed中,并把deletesInRAM清空。原因上面已经阐明。

6.5、生成cfs段

代码:

docWriter.createCompoundFile(segment);

newSegment.setUseCompoundFile(true);

代码为:

DocumentsWriter.createCompoundFile(String segment) {

    CompoundFileWriter cfsWriter = new CompoundFileWriter(directory, segment + "." + IndexFileNames.COMPOUND_FILE_EXTENSION);

    //将上述中记录的文档名全部加入cfs段的写对象。

    for (final String flushedFile : flushState.flushedFiles)

      cfsWriter.addFile(flushedFile);

    cfsWriter.close();

  }

6.6、删除文档

代码:

applyDeletes();

代码为:

boolean applyDeletes(SegmentInfos infos) {

  if (!hasDeletes())

    return false;

  final int infosEnd = infos.size();

  int docStart = 0;

  boolean any = false;

  for (int i = 0; i < infosEnd; i++) {

    assert infos.info(i).dir == directory;

    SegmentReader reader = writer.readerPool.get(infos.info(i), false);

    try {

      any |= applyDeletes(reader, docStart);

      docStart += reader.maxDoc();

    } finally {

      writer.readerPool.release(reader);

    }

  }

  deletesFlushed.clear();

  return any;

}

  • Lucene删除文档可以用reader,也可以用writer,但是归根结底还是用reader来删除的。
  • reader的删除有以下三种方式:
    • 按照词删除,删除所有包含此词的文档。
    • 按照文档号删除。
    • 按照查询对象删除,删除所有满足此查询的文档。
  • 但是这三种方式归根结底还是按照文档号删除,也就是写.del文件的过程。

private final synchronized boolean applyDeletes(IndexReader reader, int docIDStart)

  throws CorruptIndexException, IOException {

  final int docEnd = docIDStart + reader.maxDoc();

  boolean any = false;

  //按照词删除,删除所有包含此词的文档。

  TermDocs docs = reader.termDocs();

  try {

    for (Entry entry: deletesFlushed.terms.entrySet()) {

      Term term = entry.getKey();

      docs.seek(term);

      int limit = entry.getValue().getNum();

      while (docs.next()) {

        int docID = docs.doc();

        if (docIDStart+docID >= limit)

          break;

        reader.deleteDocument(docID);

        any = true;

      }

    }

  } finally {

    docs.close();

  }

  //按照文档号删除。

  for (Integer docIdInt : deletesFlushed.docIDs) {

    int docID = docIdInt.intValue();

    if (docID >= docIDStart && docID < docEnd) {

      reader.deleteDocument(docID-docIDStart);

      any = true;

    }

  }

  //按照查询对象删除,删除所有满足此查询的文档。

  IndexSearcher searcher = new IndexSearcher(reader);

  for (Entry entry : deletesFlushed.queries.entrySet()) {

    Query query = entry.getKey();

    int limit = entry.getValue().intValue();

    Weight weight = query.weight(searcher);

    Scorer scorer = weight.scorer(reader, true, false);

    if (scorer != null) {

      while(true)  {

        int doc = scorer.nextDoc();

        if (((long) docIDStart) + doc >= limit)

          break;

        reader.deleteDocument(doc);

        any = true;

      }

    }

  }

  searcher.close();

  return any;

}

点赞
收藏
评论区
推荐文章
blmius blmius
3年前
MySQL:[Err] 1292 - Incorrect datetime value: ‘0000-00-00 00:00:00‘ for column ‘CREATE_TIME‘ at row 1
文章目录问题用navicat导入数据时,报错:原因这是因为当前的MySQL不支持datetime为0的情况。解决修改sql\mode:sql\mode:SQLMode定义了MySQL应支持的SQL语法、数据校验等,这样可以更容易地在不同的环境中使用MySQL。全局s
皕杰报表之UUID
​在我们用皕杰报表工具设计填报报表时,如何在新增行里自动增加id呢?能新增整数排序id吗?目前可以在新增行里自动增加id,但只能用uuid函数增加UUID编码,不能新增整数排序id。uuid函数说明:获取一个UUID,可以在填报表中用来创建数据ID语法:uuid()或uuid(sep)参数说明:sep布尔值,生成的uuid中是否包含分隔符'',缺省为
待兔 待兔
4个月前
手写Java HashMap源码
HashMap的使用教程HashMap的使用教程HashMap的使用教程HashMap的使用教程HashMap的使用教程22
Jacquelyn38 Jacquelyn38
3年前
2020年前端实用代码段,为你的工作保驾护航
有空的时候,自己总结了几个代码段,在开发中也经常使用,谢谢。1、使用解构获取json数据let jsonData  id: 1,status: "OK",data: 'a', 'b';let  id, status, data: number   jsonData;console.log(id, status, number )
Wesley13 Wesley13
3年前
Java获得今日零时零分零秒的时间(Date型)
publicDatezeroTime()throwsParseException{    DatetimenewDate();    SimpleDateFormatsimpnewSimpleDateFormat("yyyyMMdd00:00:00");    SimpleDateFormatsimp2newS
Stella981 Stella981
3年前
KVM调整cpu和内存
一.修改kvm虚拟机的配置1、virsheditcentos7找到“memory”和“vcpu”标签,将<namecentos7</name<uuid2220a6d1a36a4fbb8523e078b3dfe795</uuid
Wesley13 Wesley13
3年前
mysql设置时区
mysql设置时区mysql\_query("SETtime\_zone'8:00'")ordie('时区设置失败,请联系管理员!');中国在东8区所以加8方法二:selectcount(user\_id)asdevice,CONVERT\_TZ(FROM\_UNIXTIME(reg\_time),'08:00','0
Wesley13 Wesley13
3年前
00:Java简单了解
浅谈Java之概述Java是SUN(StanfordUniversityNetwork),斯坦福大学网络公司)1995年推出的一门高级编程语言。Java是一种面向Internet的编程语言。随着Java技术在web方面的不断成熟,已经成为Web应用程序的首选开发语言。Java是简单易学,完全面向对象,安全可靠,与平台无关的编程语言。
Stella981 Stella981
3年前
Django中Admin中的一些参数配置
设置在列表中显示的字段,id为django模型默认的主键list_display('id','name','sex','profession','email','qq','phone','status','create_time')设置在列表可编辑字段list_editable
Wesley13 Wesley13
3年前
MySQL部分从库上面因为大量的临时表tmp_table造成慢查询
背景描述Time:20190124T00:08:14.70572408:00User@Host:@Id:Schema:sentrymetaLast_errno:0Killed:0Query_time:0.315758Lock_
Python进阶者 Python进阶者
10个月前
Excel中这日期老是出来00:00:00,怎么用Pandas把这个去除
大家好,我是皮皮。一、前言前几天在Python白银交流群【上海新年人】问了一个Pandas数据筛选的问题。问题如下:这日期老是出来00:00:00,怎么把这个去除。二、实现过程后来【论草莓如何成为冻干莓】给了一个思路和代码如下:pd.toexcel之前把这