当前位置：首页 > article >正文

Lucene中DocValues 和 Stored Fields 的用法

article 2025/1/30 16:33:14

1. 设计目的

Stored Fields：
- 用于存储原始文档的字段值，目的是在搜索后能够完整地还原文档。
- 适合存储需要直接展示给用户的内容，如标题、正文等。
DocValues：
- 用于存储面向列（column-oriented）的数据，目的是支持高效的排序、聚合和分组操作。
- 适合存储需要频繁用于计算的字段，如数值、日期等。

2. 存储结构

Stored Fields：
- 按文档（行）存储，类似于行式存储（row-oriented storage）。
- 每个文档的字段值存储在一起，适合一次性读取整个文档。
DocValues：
- 按字段（列）存储，类似于列式存储（column-oriented storage）。
- 每个字段的所有文档值存储在一起，适合快速访问某个字段的所有值。

3. 数据访问方式

Stored Fields：
- 通过文档 ID（docID）访问，适合检索单个文档的完整内容。
- 例如：searcher.doc(docID) 可以获取文档的所有存储字段。
DocValues：
- 通过字段名访问，适合对某个字段的所有文档值进行操作。
- 例如：DocValues.getNumeric(field) 可以获取某个字段的所有数值。

4. 使用场景

Stored Fields：
- 需要展示文档的原始内容时使用。
- 例如：搜索引擎中点击搜索结果后展示的标题、摘要、正文等。
DocValues：
- 需要对字段进行排序、聚合或分组时使用。
- 例如：按价格排序、按日期范围过滤、统计某个字段的最大值等。

5. 性能特点

Stored Fields：
- 读取整个文档时效率较高，但无法高效支持排序和聚合操作。
- 存储空间较大，因为存储的是原始数据。
DocValues：
- 排序、聚合和分组操作效率高，因为数据按列存储，适合批量处理。
- 存储空间较小，因为数据经过优化（如压缩、编码）。

6. 示例对比

假设有一个文档集合，包含以下字段：

title（标题）
content（正文）
price（价格）
date（日期）

Stored Fields

存储方式：

复制

doc1: {title: "手机", content: "这是一款智能手机", price: 2999, date: "2023-10-01"}
doc2: {title: "笔记本", content: "高性能笔记本电脑", price: 5999, date: "2023-10-02"}

使用场景：
- 用户搜索后，展示文档的标题和内容。

DocValues

存储方式：

复制

title: [null, null]  # 通常不支持文本字段
price: [2999, 5999]
date: ["2023-10-01", "2023-10-02"]

使用场景：
- 按价格排序：price: [2999, 5999]。
- 按日期过滤：date: ["2023-10-01", "2023-10-02"]。

7. 总结对比

特性	Stored Fields	DocValues
存储方式	行式存储（按文档存储）	列式存储（按字段存储）
数据访问	通过文档 ID 访问	通过字段名访问
适用场景	展示文档的原始内容	排序、聚合、分组操作
性能特点	读取整个文档效率高	排序和聚合效率高
存储空间	较大	较小（经过优化）

选择建议

如果需要展示文档的原始内容，使用 Stored Fields。
如果需要对字段进行排序、聚合或分组，使用 DocValues。
在某些场景下，可以同时使用两者，例如：存储字段用于展示，DocValues 用于排序。

示例场景

假设我们有一个文档集合，包含以下字段：

id（文档 ID）
price（价格，数值类型）
date（日期，字符串类型）

我们需要对这些字段进行排序和聚合操作。

Java 示例代码


import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.*;
import org.apache.lucene.index.*;
import org.apache.lucene.search.*;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.BytesRef;

import java.io.IOException;

public class DocValuesExample {

    public static void main(String[] args) throws IOException {
        // 1. 创建内存目录（实际应用中可以使用 FSDirectory）
        Directory directory = new RAMDirectory();

        // 2. 创建分析器和索引写入器
        StandardAnalyzer analyzer = new StandardAnalyzer();
        IndexWriterConfig config = new IndexWriterConfig(analyzer);
        IndexWriter writer = new IndexWriter(directory, config);

        // 3. 添加文档
        addDocument(writer, 1, 2999, "2023-10-01");
        addDocument(writer, 2, 5999, "2023-10-02");
        addDocument(writer, 3, 1999, "2023-10-03");
        writer.commit();

        // 4. 创建索引读取器
        IndexReader reader = DirectoryReader.open(directory);
        IndexSearcher searcher = new IndexSearcher(reader);

        // 5. 使用 DocValues 进行排序
        Sort sort = new Sort(new SortField("price", SortField.Type.INT));
        TopDocs topDocs = searcher.search(new MatchAllDocsQuery(), 10, sort);

        System.out.println("按价格排序结果：");
        for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
            int docId = scoreDoc.doc;
            Document doc = searcher.doc(docId);
            System.out.println("文档 ID: " + doc.get("id") + ", 价格: " + doc.get("price"));
        }

        // 6. 使用 DocValues 进行聚合（计算最高价格）
        int maxPrice = Integer.MIN_VALUE;
        BinaryDocValues priceValues = MultiDocValues.getBinaryValues(reader, "price");
        for (int i = 0; i < reader.maxDoc(); i++) {
            BytesRef bytesRef = priceValues.nextDoc() == DocIdSetIterator.NO_MORE_DOCS ? null : priceValues.binaryValue();
            if (bytesRef != null) {
                int price = Integer.parseInt(bytesRef.utf8ToString());
                if (price > maxPrice) {
                    maxPrice = price;
                }
            }
        }
        System.out.println("最高价格: " + maxPrice);

        // 7. 关闭资源
        reader.close();
        writer.close();
        directory.close();
    }

    // 添加文档的方法
    private static void addDocument(IndexWriter writer, int id, int price, String date) throws IOException {
        Document doc = new Document();
        doc.add(new StringField("id", String.valueOf(id), Field.Store.YES)); // 存储字段
        doc.add(new NumericDocValuesField("price", price)); // DocValues 字段（用于排序）
        doc.add(new SortedDocValuesField("date", new BytesRef(date))); // DocValues 字段（用于排序）
        doc.add(new StoredField("price", price)); // 存储字段（用于展示）
        doc.add(new StoredField("date", date)); // 存储字段（用于展示）
        writer.addDocument(doc);
    }
}

代码说明

创建索引：
- 使用 NumericDocValuesField 存储数值字段（如 price）。
- 使用 SortedDocValuesField 存储字符串字段（如 date）。
- 使用 StoredField 存储需要展示的字段值。
排序：
- 使用 Sort 和 SortField 对 price 字段进行排序。
聚合：
- 使用 BinaryDocValues 读取 price 字段的值，并计算最高价格。
展示：
- 使用 StoredField 存储的字段值展示文档内容。

输出结果

复制

按价格排序结果：
文档 ID: 3, 价格: 1999
文档 ID: 1, 价格: 2999
文档 ID: 2, 价格: 5999
最高价格: 5999

查看全文

http://www.kler.cn/a/522271.html

性能优化2-删除无效引用

Hive安装教程

SpringBoot统一数据返回格式统一异常处理

数据分析系列--③RapidMiner算子说明及数据预处理

YOLOv10 介绍

【PyTorch】5.张量索引操作

【Unity3D】Tilemap俯视角像素游戏案例

字节启动AGI长期研究计划，代号Seed Edge

Nacos未授权新建用户漏洞（/nacos/v1/auth/users）

深入理解JWT及其应用

简易CPU设计入门：控制总线的剩余信号（一）

粒子群算法笔记数学建模

真正理解std::move

＜ OS 有关＞阿里云几个小时前使用密钥替换 SSH 密码认证后，发现主机正在被“攻击” 分析与应对

关于Dubbo的面试题概念原理配置及代码

【信息系统项目管理师-选择真题】2012上半年综合知识答案和详解

十三先天记

【面试】【前端】前端浏览器考题总结

深度学习｜表示学习｜卷积神经网络｜输出维度公式｜15

HTML 符号详解

安全策略初始实验

30289_SC65XX功能机MMI开发笔记（ums9117）

深入理解动态规划(dp)--(提前要对dfs有了解)

【OMCI实践】ONT上线过程的omci消息（二）

leetcode 209. 长度最小的子数组

AI 模型评估与质量控制：生成内容的评估与问题防护