当前位置: 首页 > article >正文

Elasticsearch-分词器详解

什么是分词器

1、分词器介绍 对文本进行分析处理的一种手段,基本处理逻辑为按照预先制定的分词规则,把原始文档分割成若干更小粒度的词项,粒度大小取决于分词器规则。
常用的中文分词器有ik按照切词的粒度粗细又分为:ik_max_word和ik_smart;英文分词器standard
ik_max_word会将文本做最细粒度的拆分,会穷尽各种可能的组合,适合 Term Query;
ik_smart:会做最粗粒度的拆分,适合 Phrase 查询
下面是对分词器使用的语句:

GET _analyze
{
  "text": ["布布努力学习编程"]
  ,"analyzer": "ik_max_word"
}
{
  "tokens" : [
    {
      "token" : "布",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "布",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "努力学习",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "努力",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "力学",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "学习",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "编程",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 6
    }
  ]
}

GET _analyze
{
  "text": ["布布努力学习编程"]
  ,"analyzer": "ik_smart"
}
{
  "tokens" : [
    {
      "token" : "布",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "布",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "努力学习",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "编程",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 3
    }
  ]
}

GET _analyze
{
  "text": ["布布努力学习编程"]
  ,"analyzer": "standard"
}
{
  "tokens" : [
    {
      "token" : "布",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "布",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "努",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "力",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    },
    {
      "token" : "学",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<IDEOGRAPHIC>",
      "position" : 4
    },
    {
      "token" : "习",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "<IDEOGRAPHIC>",
      "position" : 5
    },
    {
      "token" : "编",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "<IDEOGRAPHIC>",
      "position" : 6
    },
    {
      "token" : "程",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "<IDEOGRAPHIC>",
      "position" : 7
    }
  ]
}

分词器的生效时间

  1. 在创建索引的时候会把索引中text类型的字段按照mapping中配置的分词器进行分词存储倒排索引;
  2. 在查询的时候全文检索,会对搜索条件进行分词做为查询条件去和创建索引时的分词匹配;

分词器的组成

切词器(Tokenizer):用于定义切词(分词)逻辑
词项过滤器(Token Filter):用于对分词之后的单个词项的处理逻辑
字符过滤器(Character Filter):用于处理单个字符
注意:分词器不会对源数据产生影响,分词只是对倒排索引以及搜索词的行为

字符过滤器(Character Filter)

定义:分词之前的预处理,过滤无用字符
字符过滤器分为三种:

字符过滤器-HTML标签过滤器:HTML Strip Character Filter

过滤html标签

html_strip
参数:escaped_tags 需要保留的html标签
“type”: “html_strip”

DELETE test_html_strip_filter
#字符过滤器
PUT test_html_strip_filter
{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_char_filter": {
          "type": "html_strip",
          "escaped_tags": [
              "a"
            ]
        }
      }
    }
  }
}
 
GET test_html_strip_filter/_analyze
{
  "tokenizer": "standard",
  "char_filter": ["my_char_filter"],
  "text": ["<p>I&apos;m so <a>happy</a>!</p>"]
}

结果:
{
  "tokens" : [
    {
      "token" : "I'm",
      "start_offset" : 3,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "so",
      "start_offset" : 12,
      "end_offset" : 14,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "a",
      "start_offset" : 16,
      "end_offset" : 17,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "happy",
      "start_offset" : 18,
      "end_offset" : 23,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "a",
      "start_offset" : 25,
      "end_offset" : 26,
      "type" : "<ALPHANUM>",
      "position" : 4
    }
  ]
}

上面语句的作用是:对text文本,只保留标签,其余标签不展示

字符过滤器-字符映射过滤器:Mapping Character Filter

通过在索引的mapping映射中指定对某些字符的替换从而完成特定字符的过滤
“type”:“mapping”

##Mapping Character Filter 
DELETE my_index
PUT my_index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_char_filter":{
          "type":"mapping",
          "mappings":[
            "臭 => *",
            "傻=> *",
            "逼=> *"
            ]
        }
      },
      "analyzer": {
        "my_analyzer":{
          "tokenizer":"keyword",
          "char_filter":["my_char_filter"]
        }
      }
    }
  }
}
GET my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "你就是个臭傻逼"
}

{
  "tokens" : [
    {
      "token" : "你就是个***",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "word",
      "position" : 0
    }
  ]
}

字符过滤器-正则替换过滤器:Pattern Replace Character Filter

“type”:"pattern_replace"表示正则替换

##Pattern Replace Character Filter 
#17611001200
DELETE my_index
PUT my_index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_char_filter":{
          "type":"pattern_replace",
          "pattern":"(\\d{3})\\d{4}(\\d{4})",
          "replacement":"$1****$2"
        }
      },
      "analyzer": {
        "my_analyzer":{
          "tokenizer":"keyword",
          "char_filter":["my_char_filter"]
        }
      }
    }
  }
}
GET my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "您的手机号是17611001200"
}

{
  "tokens" : [
    {
      "token" : "您的手机号是184****6831",
      "start_offset" : 0,
      "end_offset" : 17,
      "type" : "word",
      "position" : 0
    }
  ]
}

词项过滤器(Token Filter)

词项过滤器用来处理切词完成之后的词项,例如把大小写转换,删除停用词或同义词处理等。官方同样预置了很多词项过滤器,基本可以满足日常开发的需要。当然也是支持第三方也自行开发的。

standard转大小写、停用词

#转为大写
GET _analyze
{
  "tokenizer": "standard", 
  "filter": ["uppercase"],
  "text": ["www elastic co guide"]
}
 
#转为小写
GET _analyze
{
  "tokenizer": "standard",
  "filter": ["lowercase"],
  "text": ["WWW ELASTIC CO GUIDE"]
}

停用词
在切词完成之后,会被干掉词项,即停用词。停用词可以自定义
在分词器插件的配置文件中可以看到停用词的定义


GET _analyze
{
  "tokenizer": "standard",
  "filter": ["stop"], 
  "text": ["what are you doing"]
}

这是IK分词器的停用词
在这里插入图片描述
自定义停用词

### 自定义 filter
PUT test_token_filter_stop
{
  "settings": {
    "analysis": {
      "filter": {
        "my_filter": {
          "type": "stop",
          "stopwords": [
            "www"
          ],
          "ignore_case": true
        }
      }
    }
  }
}
GET test_token_filter_stop/_analyze
{
  "tokenizer": "standard", 
  "filter": ["my_filter"], 
  "text": ["What www WWW are you doing"]
}

同义词

#同义词
PUT test_token_filter_synonym
{
  "settings": {
    "analysis": {
      "filter": {
        "my_synonym": {
          "type": "synonym",
          "synonyms": ["good, nice => excellent"]
        }
      }
    }
  }
}
 
GET test_token_filter_synonym/_analyze
{
  "tokenizer": "standard", 
  "filter": ["my_synonym"], 
  "text": ["good"]
}

切词器:Tokenizer

tokenizer 是分词器的核心组成部分之一,其主要作用是分词,或称之为切词。主要用来对原始文本进行细粒度拆分。拆分之后的每一个部分称之为一个 Term,或称之为一个词项。可以把切词器理解为预定义的切词规则。官方内置了很多种切词器,默认的切词器为 standard。


http://www.kler.cn/a/449705.html

相关文章:

  • 面向未来的教育技术:智能成绩管理系统的开发
  • 初学stm32 --- 定时器中断
  • Hadoop集群(HDFS集群、YARN集群、MapReduce​计算框架)
  • 解决 Docker 中 DataLoader 多进程错误:共享内存不足
  • Opencv之对图片的处理和运算
  • 算法,递归和迭代
  • Java爬虫获取1688关键字接口详细解析
  • 前端模拟接口工具-json-server
  • Oracle:数据库的顶尖认证
  • redis常用数据类型介绍
  • MacroSan 2500_24A配置
  • 旅游推荐系统设计与实现 计算机毕业设计 有源码 P10090
  • Vue3自定义hook函数
  • Calcite Web 项目常见问题解决方案
  • 逻辑回归之KS曲线
  • 基于Matlab实现无刷直流电机仿真
  • springBoot Maven 剔除无用的jar引用
  • 坑人 C# MySql.Data SDK
  • 蓝牙的世界:HarmonyOS Next中的蓝牙接入和连接
  • 【py脚本+logstash+es实现自动化检测工具】
  • 多模态去噪信息收集
  • 本机如何连接虚拟机MYSQL
  • 深入了解 Kubernetes Pod 的状态
  • StarRocks 生产部署一套集群,存储空间如何规划?
  • 【MySQL初阶】--- 库和表的操作
  • (2024.12)Ubuntu20.04安装openMVS<成功>.colmap<成功>和openMVG<失败>记录