ElasticSearch映射分词
目录
弃用Type
why
映射
查询 mapping of index
创建 index with mapping
添加 field with mapping
数据迁移
1.新建 一个 index with correct mapping
2.数据迁移 reindex data into that index
分词
POST _analyze
自定义词库
ik分词器
circuit_breaking_exception
弃用Type
ES 6.x 之前,Type 开始弃用
ES 7.x ,被弱化,仍支持
ES 8.x ,完全移除
弃用后,每个索引只包含一种文档类型
如果需要区分不同类型的文档,俩种方式:
- 创建不同的索引
- 在文档中添加自定义字段来实现。
why
Elasticsearch 的底层存储(Lucene)是基于索引的,而不是基于 Type 的。
在同一个索引中,不同 Type 的文档可能具有相同名称但不同类型的字段,这种字段类型冲突会导致数据不一致和查询错误。
GET /bank/_search
{
"query": {
"match": {
"address": "mill lane"
}
},
"_source": ["account_number","address"]
}
从查询语句可以看出,查询是基于index的,不会去指定type。如果有不同type的address,就会引起查询冲突。
映射
Mapping 定义 doc和field 如何被存储和被检索
Mapping(映射) 是 Elasticsearch 中用于定义文档结构和字段类型的机制。它类似于关系型数据库中的表结构(Schema),用于描述文档中包含哪些字段、字段的数据类型(如文本、数值、日期等),以及字段的其他属性(如是否分词、是否索引等)。
Mapping 是 Elasticsearch 的核心概念之一,它决定了数据如何被存储、索引和查询。
查询 mapping of index
_mapping
GET /bank/_mapping
{
"bank" : {
"mappings" : {
"properties" : {
"account_number" : {
"type" : "long"
},
"address" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"age" : {
"type" : "long"
},
"balance" : {
"type" : "long"
},
"city" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"email" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"employer" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"firstname" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"gender" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"lastname" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"state" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
- text 可以添加子field ---keyword,类型是 keyword。keyword存储精确值
创建 index with mapping
Put /{indexName}
Put /my_index
{
"mappings": {
"properties": {
"account_number": {
"type": "long"
},
"address": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"city": {
"type": "keyword"
}
}
}
}
添加 field with mapping
- PUT /{indexName}/_mapping + mapping.properties请求体
PUT /my_index/_mapping
{
"properties": {
"state": {
"type": "keyword",
"index": false
}
}
}
- "index": false 该字段无法被索引,不会参与检索 默认true
数据迁移
ES不支持修改已存在的mapping。若想更新已存在的mapping,就要进行数据迁移。
1.新建 一个 index with correct mapping
PUT /my_bank
{
"mappings": {
"properties": {
"account_number": {
"type": "long"
},
"address": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"age": {
"type": "integer"
},
"balance": {
"type": "long"
},
"city": {
"type": "keyword"
},
"email": {
"type": "keyword"
},
"employer": {
"type": "keyword"
},
"firstname": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"gender": {
"type": "keyword"
},
"lastname": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"state": {
"type": "keyword"
}
}
}
}
2.数据迁移 reindex data into that index
POST _reindex
{
"source": {
"index": "bank",
"type": "account"
},
"dest": {
"index": "my_bank"
}
}
- ES 8.0 弃用type参数
分词
将文本拆分为单个词项(tokens)
POST _analyze
标准分词器
POST _analyze
{
"analyzer": "standard",
"text": ["it's test data","hello world"]
}
Response
{
"tokens" : [
{
"token" : "it's",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "test",
"start_offset" : 5,
"end_offset" : 9,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "data",
"start_offset" : 10,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "hello",
"start_offset" : 15,
"end_offset" : 20,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "world",
"start_offset" : 21,
"end_offset" : 26,
"type" : "<ALPHANUM>",
"position" : 4
}
]
}
自定义词库
nginx/html目录下 创建es/term.text,添加词条
配置ik远程词库,/elasticsearch/config/analysis-ik/IKAnalyzer.cfg.xml
测试
POST _analyze
{
"analyzer": "ik_smart",
"text": "尚硅谷项目谷粒商城"
}
[尚硅谷,谷粒商城]为term.text词库中的词条
Response
{
"tokens" : [
{
"token" : "尚硅谷",
"start_offset" : 0,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "项目",
"start_offset" : 3,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "谷粒商城",
"start_offset" : 5,
"end_offset" : 9,
"type" : "CN_WORD",
"position" : 2
}
]
}
ik分词器
中文分词
github地址
https://github.com/infinilabs/analysis-ik
下载地址
bin/elasticsearch-plugin install https://get.infini.cloud/elasticsearch/analysis-ik/7.4.2
进入docker容器ES 下载 ik 插件
卸载插件
elasticsearch-plugin remove analysis-ik
测试
POST _analyze
{
"analyzer": "ik_smart",
"text": "我要成为java高手"
}
Response
{
"tokens" : [
{
"token" : "我",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_CHAR",
"position" : 0
},
{
"token" : "要",
"start_offset" : 1,
"end_offset" : 2,
"type" : "CN_CHAR",
"position" : 1
},
{
"token" : "成为",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "java",
"start_offset" : 4,
"end_offset" : 8,
"type" : "ENGLISH",
"position" : 3
},
{
"token" : "高手",
"start_offset" : 8,
"end_offset" : 10,
"type" : "CN_WORD",
"position" : 4
}
]
}
circuit_breaking_exception
熔断器机制被触发
{
"error": {
"root_cause": [
{
"type": "circuit_breaking_exception",
"reason": "[parent] Data too large, data for [<http_request>] would be [124604192/118.8mb], which is larger than the limit of [123273216/117.5mb], real usage: [124604192/118.8mb], new bytes reserved: [0/0b], usages [request=0/0b, fielddata=1788/1.7kb, in_flight_requests=0/0b, accounting=225547/220.2kb]",
"bytes_wanted": 124604192,
"bytes_limit": 123273216,
"durability": "PERMANENT"
}
],
"type": "circuit_breaking_exception",
"reason": "[parent] Data too large, data for [<http_request>] would be [124604192/118.8mb], which is larger than the limit of [123273216/117.5mb], real usage: [124604192/118.8mb], new bytes reserved: [0/0b], usages [request=0/0b, fielddata=1788/1.7kb, in_flight_requests=0/0b, accounting=225547/220.2kb]",
"bytes_wanted": 124604192,
"bytes_limit": 123273216,
"durability": "PERMANENT"
},
"status": 429
}
查看ES日志
docker logs elasticsearch
检查 Elasticsearch 的内存使用情况
GET /_cat/nodes?v&h=name,heap.percent,ram.percent
-
如果
heap.percent
或ram.percent
接近 100%,说明内存不足。
增加 Elasticsearch 堆内存
删除并重新创建容器 调整 -Xms
和 -Xmx
参数 256m
docker run --name elasticsearch -p 9200:9200 -p 9300:9300 \
> -e "discovery.type=single-node" \
> -e ES_JAVA_OPTS="-Xms64m -Xmx256m" \
> -v /mydata/elasticsearch/config/elasticsearch.yml:/usr/share/elasticsearch/config/elasticsearch.yml \
> -v /mydata/elasticsearch/data:/usr/share/elasticsearch/data \
> -v /mydata/elasticsearch/plugins:/usr/share/elasticsearch/plugins \
> -d elasticsearch:7.4.2