当前位置: 首页 > article >正文

ES的预置分词器

Elasticsearch(简称 ES)提供了多种预置的分词器(Analyzer),用于对文本进行分词处理。分词器通常由字符过滤器(Character Filters)、分词器(Tokenizer)和词元过滤器(Token Filters)组成。以下是一些常用的预置分词器及其示例:


1. Standard Analyzer(标准分词器)

  • 默认分词器,适用于大多数语言。
  • 处理步骤:
    1. 使用标准分词器(Standard Tokenizer)按空格和标点符号分词。
    2. 应用小写过滤器(Lowercase Token Filter)将词元转换为小写。
  • 示例
    POST _analyze
    {
      "analyzer": "standard",
      "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
    }
    
    输出
    ["the", "2", "quick", "brown", "foxes", "jumped", "over", "the", "lazy", "dog's", "bone"]
    

2. Simple Analyzer(简单分词器)

  • 按非字母字符(如数字、标点符号)分词,并将词元转换为小写。
  • 示例
    POST _analyze
    {
      "analyzer": "simple",
      "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
    }
    
    输出
    ["the", "quick", "brown", "foxes", "jumped", "over", "the", "lazy", "dog", "s", "bone"]
    

3. Whitespace Analyzer(空格分词器)

  • 仅按空格分词,不转换大小写,不处理标点符号。
  • 示例
    POST _analyze
    {
      "analyzer": "whitespace",
      "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
    }
    
    输出
    ["The", "2", "QUICK", "Brown-Foxes", "jumped", "over", "the", "lazy", "dog's", "bone."]
    

4. Keyword Analyzer(关键词分词器)

  • 将整个文本作为一个单独的词元,不做任何分词处理。
  • 示例
    POST _analyze
    {
      "analyzer": "keyword",
      "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
    }
    
    输出
    ["The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."]
    

5. Stop Analyzer(停用词分词器)

  • 类似于简单分词器,但会过滤掉常见的停用词(如 “the”, “and”, “a” 等)。
  • 示例
    POST _analyze
    {
      "analyzer": "stop",
      "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
    }
    
    输出
    ["quick", "brown", "foxes", "jumped", "over", "lazy", "dog", "s", "bone"]
    

6. Pattern Analyzer(正则分词器)

  • 使用正则表达式定义分词规则。
  • 示例
    POST _analyze
    {
      "analyzer": "pattern",
      "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
    }
    
    默认按非字母字符分词,并转换为小写:
    ["the", "2", "quick", "brown", "foxes", "jumped", "over", "the", "lazy", "dog", "s", "bone"]
    

7. Language Analyzer(语言分词器)

  • 针对特定语言优化,支持多种语言(如英语、中文、法语等)。
  • 示例(英语)
    POST _analyze
    {
      "analyzer": "english",
      "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
    }
    
    输出
    ["2", "quick", "brown", "fox", "jump", "over", "lazi", "dog", "bone"]
    

8. ICU Analyzer(国际化分词器)

  • 基于 ICU(International Components for Unicode)库,支持多语言分词。
  • 示例
    POST _analyze
    {
      "analyzer": "icu_analyzer",
      "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
    }
    
    输出
    ["the", "2", "quick", "brown", "foxes", "jumped", "over", "the", "lazy", "dog's", "bone"]
    

9. Fingerprint Analyzer(指纹分词器)

  • 对文本进行分词、去重、排序,并生成唯一的“指纹”。
  • 示例
    POST _analyze
    {
      "analyzer": "fingerprint",
      "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
    }
    
    输出
    ["2", "bone", "brown", "dog", "foxes", "jumped", "lazy", "over", "quick", "the"]
    

总结

Elasticsearch 的预置分词器适用于不同的场景,开发者可以根据需求选择合适的分析器,或者自定义分词器以满足特定需求。


http://www.kler.cn/a/581856.html

相关文章:

  • Linux IPC:System V共享内存汇总整理
  • 理解 XSS 和 CSP:保护你的 Web 应用免受恶意脚本攻击
  • 多光谱相机数据采集过程中常见仪器
  • <rust><tauri><GUI>基于tauri,打开任意windows电脑应用程序
  • 如何手动下载spring jar包
  • Vue.js 全面解析:构建现代前端应用的渐进式框架
  • SPA应用优化首屏加载速度
  • C++20 新特性总结
  • AWS原生架构下的服务器性能与成本平衡之道——海外业务云端实践
  • 用Python实现链上数据爬取与分析
  • RISC-V特权模式与寄存器
  • MATLAB 控制系统设计与仿真 - 22
  • 从零开始用AI开发游戏(一)
  • 关于在electron(Nodejs)中使用 Napi 的简单记录
  • 【GPT入门】第6课 openai接口介绍与参数说明
  • 远程手机遥控开关原理及应用
  • git commit messege 模板设置 (规范化管理git)
  • 机器学习基础(4)
  • 清华同方国产电脑能改windows吗_清华同方国产系统改win7教程
  • 《PaddleOCR》—— OCR