当前位置：首页 > article >正文

探究：Elasticsearch 文档的 _id 是 Lucene 的 docid 吗？

article 2024/11/15 8:29:58

1、前言

之前在与研发进行 ES 使用优化的过程中，研发的同事饶有兴致的在会议后问了我这么一个问题：我们写入 ES 的 _id 字段和 lucene 中使用的 docid 是一个内容么？

两者有什么关联么？

当时对 Lucene 没有太多了解的我只能实话实说：两者应该不是一个概念，但是具体是否有关联我这边也没有梳理清楚，后面有结论了可以再进行沟通。

现在，我们针对这个问题梳理一下吧。

2、Lucene 的 docid

首先来看看 Lucene 官方对 docid 的定义。

Lucene's index is composed of segments, each of which contains a subset of all the documents in the index, and is a complete searchable index in itself, over that subset.  As documents are written to the index, new segments are created and flushed to directory storage.  Segments are immutable;  updates and deletions may only create new segments and do not modify existing ones.  Over time, the writer merges groups of smaller segments into single larger ones in order to maintain an index that is efficient to search, and to reclaim dead space left behind by deleted (and updated) documents.

Each document is identified by a 32-bit number, its "docid," and is composed of a collection of Field values of diverse types (postings, stored fields, doc values, and points).  Docids come in two flavors: global and per-segment.  A document's global docid is just the sum of its per-segment docid and that segment's base docid offset.  External, high-level APIs only handle global docids, but internal APIs that reference a LeafReader, which is a reader for a single segment, deal in per-segment docids.

Docids are assigned sequentially within each segment (starting at 0). Thus the number of documents in a segment is the same as its maximum docid;  some may be deleted, but their docids are retained until the segment is merged.  When segments merge, their documents are assigned new sequential docids.  Accordingly, docid values must always be treated as internal implementation, not exposed as part of an application, nor stored or referenced outside of Lucene's internal APIs.

最直接的定义：

docid 是 32 位的数字，只是用于标记文档；

docid 由不同类型的Field值(发布、存储字段、文档值和点)的集合组成；

docid 有两种类型:全局的和每段的。文档的全局文档值只是其每段文档值和该段的 docid 偏移量的总和。

docid 在每个段内按顺序分配(从 0 开始)。

docid 只在 segment merge 的时候回收，然后被重新分配。

最后，官方强调 docid 只能作为 Lucene 的内部实现，并不适合外部应用调用，也不能在 Lucene 的内部 api 之外存储或引用。

虽然官方对 docid 的生成和使用都介绍的很详细，但是最后一句概括了所有，docid 只是 Lucene 内部对文档进行标记的实现方式，与外部无关。

3、ES 的 _id

下面我们看看 ES 的 _id 字段。

Each document has an _id that uniquely identifies it, which is indexed so that documents can be looked up either with the GET API or the ids query. The _id can either be assigned at indexing time, or a unique _id can be generated by Elasticsearch. This field is not configurable in the mappings.

这里 ES 对 _id 字段的定义就很清晰，它是文档唯一性标记，它可以在写入的时候可以被指定值，如果不指定 ES 会自己生成一个值。即 _id 是 ES 文档数据的唯一性主键，即 uuid。

ES 这么做的原因是 Lucene 不强要求写入的数据需要带 uuid，这在 Lucene 的主要开发者 Michael McCandless 博客里得到了印证，Lucene 并不要求写入的数据具有唯一性主键。原文如下

Most search applications using Apache Lucene assign a unique id, or primary key, to each indexed document. While Lucene itself does not require this (it could care less!), the application usually needs it to later replace, delete or retrieve that one document by its external id.

而从 Lucene 的角度看，像 _id 这样的元信息字段，只是 ES 对 Lucene 定制的一个字段，和其他字段没有本质的区别。

public AbstractIdFieldType() {super(NAME, isIndexed:true, isStored:true, hasDocValue:false, TextSearchInfo.SIMPLE_MATCH_ONLY, Collections.emptyMap());}

从 ES 源码可以看到，_id 字段是一个被倒排索引，被 store 了，但是没有生成列存 docvalue 的特殊字段。

而我们日常用 _id 在 ES 进行 GET 操作获取文档数据，与使用 ES search API 进行 _id 查询获取数据，对于 Lucene 来说是一样的，都是对 _id 字段进行检索。

# get 操作
GET my-index-000001/_doc/2

# search 操作
GET my-index-000001/_search
{
  "query": {
    "terms": {
      "_id": [ "2" ] 
    }
  }
}

4、小结

综上，我们不难看出，虽然 _id 字段和 docid 都是用与标记文档的，但两者并无耦合性关系。

_id 字段是 ES 用来做数据唯一性主键的特殊字段。docid 则是 Lucene 的内部实现，服务于 Lucene 字段查找的过程。

当然，ES 实现数据多版本的唯一性，不仅仅依靠 _id 字段，更多的信息在

《Elasticsearch内核解析 - 数据模型篇》：

https://zhuanlan.zhihu.com/p/34680841，

而对 _id 字段优化使用有兴趣的同学可以参考

《Choosing a fast unique identifier (UUID) for Lucene》：

https://blog.mikemccandless.com/2014/05/choosing-fast-unique-identifier-uuid.html

作者介绍

金多安，Elastic 认证专家，Elastic 资深运维工程师，死磕 Elasticsearch 知识星球嘉宾，星球Top活跃技术专家，搜索客社区日报责任编辑

铭毅天下审稿并做了部分微调。

更多推荐

Elasticsearch 使用误区之一——将 Elasticsearch 视为关系数据库！
Elasticsearch 使用误区之二——频繁更新文档
Elasticsearch 使用误区之三——分片设置不合理
Elasticsearch 使用误区之四——不合理的使用 track_total_hits
Elasticsearch Filter 缓存加速检索的细节，你知道吗？
深入解析 Elasticsearch IK 分词器：ik_smart 和 ik_max_word 的区别与应用场景
图解 Elasticsearch 的 Fielddata Cache 使用与优化
《一本书讲透 Elasticsearch》读者群的创新之路

更短时间更快习得更多干货！

和全球超2000+ Elastic 爱好者一起精进！

elastic6.cn——ElasticStack进阶助手

抢先一步学习进阶干货！

查看全文

http://www.kler.cn/a/283760.html

解锁微前端的优秀库

OceanStor Pacific系列 8.1.0 功能架构

redis7.x源码分析:(1) sds动态字符串

ubuntu20.04 解决Pytorch默认安装CPU版本的问题

flutter 发版的时候设置版本号

从社交媒体到元宇宙：Facebook未来发展新方向

DNN学习平台（GoogleNet、SSD、FastRCNN、Yolov3）

C# 自动化抢购脚本：基于商品链接的实现方案

【杂谈】新能源和智能车

在docker中安装skywalking + es

一起搭WPF架构之浅写View界面按钮的进阶设计

人工智能领域面试基础问题整理（二）：什么是人工智能？

OpenCV小练习：人脸检测

LVS之net模式实验

MySQL空间管理：查询、优化与碎片清理

C#基础（1）复杂数据类型概述

19050 牛牛打气球

Training language models to follow instructionswith human feedback

【iOS】iOS中简单的网络请求

Openai api via azure error: NotFoundError: 404 Resource not found

优化系统性能：深入探讨Web层缓存与Redis应用的挑战与对策

虹科技术|全新Linux环境PCAN驱动程序发布！CAN/CAN FD通信体验全面升级！

C# 什么是属性

Linux操作系统在虚拟机VM上的安装【CentOS版本】

深入解析 Maven 子父模块的依赖管理

Java 面试题：HTTP版本演变--xunznux

1、前言

2、Lucene 的 docid

3、ES 的 _id

4、小结

作者介绍

相关文章：