当前位置：首页 > article >正文

Elasticsearch-分词器详解

article 2024/12/24 19:24:27

什么是分词器

1、分词器介绍 对文本进行分析处理的一种手段，基本处理逻辑为按照预先制定的分词规则，把原始文档分割成若干更小粒度的词项，粒度大小取决于分词器规则。
常用的中文分词器有ik按照切词的粒度粗细又分为:ik_max_word和ik_smart；英文分词器standard
ik_max_word会将文本做最细粒度的拆分，会穷尽各种可能的组合，适合 Term Query；
ik_smart:会做最粗粒度的拆分，适合 Phrase 查询
下面是对分词器使用的语句:

GET _analyze
{
  "text": ["布布努力学习编程"]
  ,"analyzer": "ik_max_word"
}
{
  "tokens" : [
    {
      "token" : "布",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "布",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "努力学习",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "努力",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "力学",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "学习",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "编程",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 6
    }
  ]
}

GET _analyze
{
  "text": ["布布努力学习编程"]
  ,"analyzer": "ik_smart"
}
{
  "tokens" : [
    {
      "token" : "布",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "布",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "努力学习",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "编程",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 3
    }
  ]
}

GET _analyze
{
  "text": ["布布努力学习编程"]
  ,"analyzer": "standard"
}
{
  "tokens" : [
    {
      "token" : "布",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "布",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "努",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "力",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    },
    {
      "token" : "学",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<IDEOGRAPHIC>",
      "position" : 4
    },
    {
      "token" : "习",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "<IDEOGRAPHIC>",
      "position" : 5
    },
    {
      "token" : "编",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "<IDEOGRAPHIC>",
      "position" : 6
    },
    {
      "token" : "程",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "<IDEOGRAPHIC>",
      "position" : 7
    }
  ]
}

分词器的生效时间

在创建索引的时候会把索引中text类型的字段按照mapping中配置的分词器进行分词存储倒排索引；
在查询的时候全文检索，会对搜索条件进行分词做为查询条件去和创建索引时的分词匹配；

分词器的组成

切词器（Tokenizer）：用于定义切词（分词）逻辑
词项过滤器（Token Filter）：用于对分词之后的单个词项的处理逻辑
字符过滤器（Character Filter）：用于处理单个字符
注意：分词器不会对源数据产生影响，分词只是对倒排索引以及搜索词的行为

字符过滤器（Character Filter）

定义:分词之前的预处理，过滤无用字符
字符过滤器分为三种:

字符过滤器-HTML标签过滤器：HTML Strip Character Filter

过滤html标签

html_strip
参数：escaped_tags 需要保留的html标签
“type”: “html_strip”

DELETE test_html_strip_filter
#字符过滤器
PUT test_html_strip_filter
{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_char_filter": {
          "type": "html_strip",
          "escaped_tags": [
              "a"
            ]
        }
      }
    }
  }
}
 
GET test_html_strip_filter/_analyze
{
  "tokenizer": "standard",
  "char_filter": ["my_char_filter"],
  "text": ["<p>I&apos;m so <a>happy</a>!</p>"]
}

结果:
{
  "tokens" : [
    {
      "token" : "I'm",
      "start_offset" : 3,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "so",
      "start_offset" : 12,
      "end_offset" : 14,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "a",
      "start_offset" : 16,
      "end_offset" : 17,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "happy",
      "start_offset" : 18,
      "end_offset" : 23,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "a",
      "start_offset" : 25,
      "end_offset" : 26,
      "type" : "<ALPHANUM>",
      "position" : 4
    }
  ]
}

上面语句的作用是:对text文本，只保留标签，其余标签不展示

字符过滤器-字符映射过滤器：Mapping Character Filter

通过在索引的mapping映射中指定对某些字符的替换从而完成特定字符的过滤
“type”:“mapping”

##Mapping Character Filter 
DELETE my_index
PUT my_index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_char_filter":{
          "type":"mapping",
          "mappings":[
            "臭 => *",
            "傻=> *",
            "逼=> *"
            ]
        }
      },
      "analyzer": {
        "my_analyzer":{
          "tokenizer":"keyword",
          "char_filter":["my_char_filter"]
        }
      }
    }
  }
}
GET my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "你就是个臭傻逼"
}

{
  "tokens" : [
    {
      "token" : "你就是个***",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "word",
      "position" : 0
    }
  ]
}

字符过滤器-正则替换过滤器：Pattern Replace Character Filter

“type”:"pattern_replace"表示正则替换

##Pattern Replace Character Filter 
#17611001200
DELETE my_index
PUT my_index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_char_filter":{
          "type":"pattern_replace",
          "pattern":"(\\d{3})\\d{4}(\\d{4})",
          "replacement":"$1****$2"
        }
      },
      "analyzer": {
        "my_analyzer":{
          "tokenizer":"keyword",
          "char_filter":["my_char_filter"]
        }
      }
    }
  }
}
GET my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "您的手机号是17611001200"
}

{
  "tokens" : [
    {
      "token" : "您的手机号是184****6831",
      "start_offset" : 0,
      "end_offset" : 17,
      "type" : "word",
      "position" : 0
    }
  ]
}

词项过滤器（Token Filter）

词项过滤器用来处理切词完成之后的词项，例如把大小写转换，删除停用词或同义词处理等。官方同样预置了很多词项过滤器，基本可以满足日常开发的需要。当然也是支持第三方也自行开发的。

standard转大小写、停用词

#转为大写
GET _analyze
{
  "tokenizer": "standard", 
  "filter": ["uppercase"],
  "text": ["www elastic co guide"]
}
 
#转为小写
GET _analyze
{
  "tokenizer": "standard",
  "filter": ["lowercase"],
  "text": ["WWW ELASTIC CO GUIDE"]
}

停用词
在切词完成之后，会被干掉词项，即停用词。停用词可以自定义
在分词器插件的配置文件中可以看到停用词的定义


GET _analyze
{
  "tokenizer": "standard",
  "filter": ["stop"], 
  "text": ["what are you doing"]
}

这是IK分词器的停用词
在这里插入图片描述
自定义停用词

### 自定义 filter
PUT test_token_filter_stop
{
  "settings": {
    "analysis": {
      "filter": {
        "my_filter": {
          "type": "stop",
          "stopwords": [
            "www"
          ],
          "ignore_case": true
        }
      }
    }
  }
}
GET test_token_filter_stop/_analyze
{
  "tokenizer": "standard", 
  "filter": ["my_filter"], 
  "text": ["What www WWW are you doing"]
}

同义词

#同义词
PUT test_token_filter_synonym
{
  "settings": {
    "analysis": {
      "filter": {
        "my_synonym": {
          "type": "synonym",
          "synonyms": ["good, nice => excellent"]
        }
      }
    }
  }
}
 
GET test_token_filter_synonym/_analyze
{
  "tokenizer": "standard", 
  "filter": ["my_synonym"], 
  "text": ["good"]
}