当前位置：首页 > article >正文

ES的预置分词器

article 2025/3/12 21:20:26

Elasticsearch（简称 ES）提供了多种预置的分词器（Analyzer），用于对文本进行分词处理。分词器通常由字符过滤器（Character Filters）、分词器（Tokenizer）和词元过滤器（Token Filters）组成。以下是一些常用的预置分词器及其示例：

1. Standard Analyzer（标准分词器）

默认分词器，适用于大多数语言。
处理步骤：
1. 使用标准分词器（Standard Tokenizer）按空格和标点符号分词。
2. 应用小写过滤器（Lowercase Token Filter）将词元转换为小写。

示例：

POST _analyze
{
  "analyzer": "standard",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

输出：

["the", "2", "quick", "brown", "foxes", "jumped", "over", "the", "lazy", "dog's", "bone"]

2. Simple Analyzer（简单分词器）

按非字母字符（如数字、标点符号）分词，并将词元转换为小写。

示例：

POST _analyze
{
  "analyzer": "simple",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

输出：

["the", "quick", "brown", "foxes", "jumped", "over", "the", "lazy", "dog", "s", "bone"]

3. Whitespace Analyzer（空格分词器）

仅按空格分词，不转换大小写，不处理标点符号。

示例：

POST _analyze
{
  "analyzer": "whitespace",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

输出：

["The", "2", "QUICK", "Brown-Foxes", "jumped", "over", "the", "lazy", "dog's", "bone."]

4. Keyword Analyzer（关键词分词器）

将整个文本作为一个单独的词元，不做任何分词处理。

示例：

POST _analyze
{
  "analyzer": "keyword",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

输出：

["The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."]

5. Stop Analyzer（停用词分词器）

类似于简单分词器，但会过滤掉常见的停用词（如 “the”, “and”, “a” 等）。

示例：

POST _analyze
{
  "analyzer": "stop",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

输出：

["quick", "brown", "foxes", "jumped", "over", "lazy", "dog", "s", "bone"]

6. Pattern Analyzer（正则分词器）

使用正则表达式定义分词规则。

示例：

POST _analyze
{
  "analyzer": "pattern",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

默认按非字母字符分词，并转换为小写：

["the", "2", "quick", "brown", "foxes", "jumped", "over", "the", "lazy", "dog", "s", "bone"]

7. Language Analyzer（语言分词器）

针对特定语言优化，支持多种语言（如英语、中文、法语等）。

示例（英语）：

POST _analyze
{
  "analyzer": "english",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

输出：

["2", "quick", "brown", "fox", "jump", "over", "lazi", "dog", "bone"]

8. ICU Analyzer（国际化分词器）

基于 ICU（International Components for Unicode）库，支持多语言分词。

示例：

POST _analyze
{
  "analyzer": "icu_analyzer",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

输出：

["the", "2", "quick", "brown", "foxes", "jumped", "over", "the", "lazy", "dog's", "bone"]

9. Fingerprint Analyzer（指纹分词器）

对文本进行分词、去重、排序，并生成唯一的“指纹”。

示例：

POST _analyze
{
  "analyzer": "fingerprint",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

输出：

["2", "bone", "brown", "dog", "foxes", "jumped", "lazy", "over", "quick", "the"]

总结

Elasticsearch 的预置分词器适用于不同的场景，开发者可以根据需求选择合适的分析器，或者自定义分词器以满足特定需求。

查看全文

http://www.kler.cn/a/581856.html

Linux IPC：System V共享内存汇总整理

理解 XSS 和 CSP：保护你的 Web 应用免受恶意脚本攻击

多光谱相机数据采集过程中常见仪器

＜rust＞＜tauri＞＜GUI＞基于tauri，打开任意windows电脑应用程序

如何手动下载spring jar包

Vue.js 全面解析：构建现代前端应用的渐进式框架

SPA应用优化首屏加载速度

C++20 新特性总结

AWS原生架构下的服务器性能与成本平衡之道——海外业务云端实践

用Python实现链上数据爬取与分析

RISC-V特权模式与寄存器

MATLAB 控制系统设计与仿真 - 22

从零开始用AI开发游戏（一）

关于在electron(Nodejs)中使用 Napi 的简单记录

【GPT入门】第6课 openai接口介绍与参数说明

远程手机遥控开关原理及应用

git commit messege 模板设置 (规范化管理git)

机器学习基础（4）

清华同方国产电脑能改windows吗_清华同方国产系统改win7教程

《PaddleOCR》—— OCR