当前位置：首页 > article >正文

ElasticSearch映射分词

article 2025/2/23 6:21:07

弃用Type

why

映射

查询 mapping of index

创建 index with mapping

添加 field with mapping

数据迁移

1.新建一个 index with correct mapping

2.数据迁移 reindex data into that index

分词

POST _analyze

自定义词库

ik分词器

circuit_breaking_exception

弃用Type

ES 6.x 之前，Type 开始弃用

ES 7.x ，被弱化，仍支持

ES 8.x ，完全移除

弃用后，每个索引只包含一种文档类型

如果需要区分不同类型的文档，俩种方式：

创建不同的索引
在文档中添加自定义字段来实现。

why

Elasticsearch 的底层存储（Lucene）是基于索引的，而不是基于 Type 的。

在同一个索引中，不同 Type 的文档可能具有相同名称但不同类型的字段，这种字段类型冲突会导致数据不一致和查询错误。

GET /bank/_search
{
  "query": {
    "match": {
      "address": "mill lane"
    }
  },
  "_source": ["account_number","address"]
}

从查询语句可以看出，查询是基于index的，不会去指定type。如果有不同type的address，就会引起查询冲突。

映射

Mapping 定义 doc和field 如何被存储和被检索

Mapping（映射） 是 Elasticsearch 中用于定义文档结构和字段类型的机制。它类似于关系型数据库中的表结构（Schema），用于描述文档中包含哪些字段、字段的数据类型（如文本、数值、日期等），以及字段的其他属性（如是否分词、是否索引等）。

Mapping 是 Elasticsearch 的核心概念之一，它决定了数据如何被存储、索引和查询。

查询 mapping of index

_mapping

GET /bank/_mapping

{
  "bank" : {
    "mappings" : {
      "properties" : {
        "account_number" : {
          "type" : "long"
        },
        "address" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "age" : {
          "type" : "long"
        },
        "balance" : {
          "type" : "long"
        },
        "city" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "email" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "employer" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "firstname" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "gender" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "lastname" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "state" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    }
  }
}

text 可以添加子field ---keyword，类型是 keyword。keyword存储精确值

创建 index with mapping

Put /{indexName}

Put /my_index
{
  "mappings": {
    "properties": {
      "account_number": {
        "type": "long"
      },
      "address": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "city": {
        "type": "keyword"
      }
    }
  }
}

添加 field with mapping

PUT /{indexName}/_mapping + mapping.properties请求体

PUT /my_index/_mapping
{
  "properties": {
    "state": {
      "type": "keyword",
      "index": false
    }
  }
}

"index": false 该字段无法被索引，不会参与检索默认true

数据迁移

ES不支持修改已存在的mapping。若想更新已存在的mapping，就要进行数据迁移。

1.新建一个 index with correct mapping

PUT /my_bank
{
  "mappings": {
    "properties": {
      "account_number": {
        "type": "long"
      },
      "address": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "age": {
        "type": "integer"
      },
      "balance": {
        "type": "long"
      },
      "city": {
        "type": "keyword"
      },
      "email": {
        "type": "keyword"
      },
      "employer": {
        "type": "keyword"
      },
      "firstname": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "gender": {
        "type": "keyword"
      },
      "lastname": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "state": {
        "type": "keyword"
      }
    }
  }
}

2.数据迁移 reindex data into that index

POST _reindex
{
  "source": {
    "index": "bank",
    "type": "account"
  },
  "dest": {
    "index": "my_bank"
  }
}

ES 8.0 弃用type参数

分词

将文本拆分为单个词项（tokens）

POST _analyze

标准分词器

POST _analyze
{
  "analyzer": "standard",
  "text": ["it's test data","hello world"]
}

Response

{
  "tokens" : [
    {
      "token" : "it's",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "test",
      "start_offset" : 5,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "data",
      "start_offset" : 10,
      "end_offset" : 14,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "hello",
      "start_offset" : 15,
      "end_offset" : 20,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "world",
      "start_offset" : 21,
      "end_offset" : 26,
      "type" : "<ALPHANUM>",
      "position" : 4
    }
  ]
}

自定义词库

nginx/html目录下创建es/term.text，添加词条

配置ik远程词库，/elasticsearch/config/analysis-ik/IKAnalyzer.cfg.xml

测试

POST _analyze
{
  "analyzer": "ik_smart",
  "text": "尚硅谷项目谷粒商城"
}

[尚硅谷，谷粒商城]为term.text词库中的词条

Response

{
  "tokens" : [
    {
      "token" : "尚硅谷",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "项目",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "谷粒商城",
      "start_offset" : 5,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 2
    }
  ]
}

ik分词器

中文分词

github地址

https://github.com/infinilabs/analysis-ik

下载地址

bin/elasticsearch-plugin install https://get.infini.cloud/elasticsearch/analysis-ik/7.4.2

进入docker容器ES 下载 ik 插件

卸载插件

elasticsearch-plugin remove analysis-ik

测试

POST _analyze
{
  "analyzer": "ik_smart",
  "text": "我要成为java高手"
}

Response

{
  "tokens" : [
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "要",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "成为",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "java",
      "start_offset" : 4,
      "end_offset" : 8,
      "type" : "ENGLISH",
      "position" : 3
    },
    {
      "token" : "高手",
      "start_offset" : 8,
      "end_offset" : 10,
      "type" : "CN_WORD",
      "position" : 4
    }
  ]
}

circuit_breaking_exception

熔断器机制被触发

{
    "error": {
        "root_cause": [
            {
                "type": "circuit_breaking_exception",
                "reason": "[parent] Data too large, data for [<http_request>] would be [124604192/118.8mb], which is larger than the limit of [123273216/117.5mb], real usage: [124604192/118.8mb], new bytes reserved: [0/0b], usages [request=0/0b, fielddata=1788/1.7kb, in_flight_requests=0/0b, accounting=225547/220.2kb]",
                "bytes_wanted": 124604192,
                "bytes_limit": 123273216,
                "durability": "PERMANENT"
            }
        ],
        "type": "circuit_breaking_exception",
        "reason": "[parent] Data too large, data for [<http_request>] would be [124604192/118.8mb], which is larger than the limit of [123273216/117.5mb], real usage: [124604192/118.8mb], new bytes reserved: [0/0b], usages [request=0/0b, fielddata=1788/1.7kb, in_flight_requests=0/0b, accounting=225547/220.2kb]",
        "bytes_wanted": 124604192,
        "bytes_limit": 123273216,
        "durability": "PERMANENT"
    },
    "status": 429
}

查看ES日志

docker logs elasticsearch

检查 Elasticsearch 的内存使用情况

GET /_cat/nodes?v&h=name,heap.percent,ram.percent

如果 heap.percent 或 ram.percent 接近 100%，说明内存不足。

增加 Elasticsearch 堆内存

删除并重新创建容器调整 -Xms 和 -Xmx 参数 256m

docker run --name elasticsearch -p 9200:9200 -p 9300:9300 \
> -e "discovery.type=single-node" \
> -e ES_JAVA_OPTS="-Xms64m -Xmx256m" \
> -v /mydata/elasticsearch/config/elasticsearch.yml:/usr/share/elasticsearch/config/elasticsearch.yml \
> -v  /mydata/elasticsearch/data:/usr/share/elasticsearch/data \
> -v /mydata/elasticsearch/plugins:/usr/share/elasticsearch/plugins \
> -d elasticsearch:7.4.2

查看全文

http://www.kler.cn/a/549192.html