当前位置: 首页 > article >正文

ElasticSearch映射分词

目录

弃用Type

why

映射

查询 mapping of index

创建 index with mapping

添加 field with mapping

数据迁移

1.新建 一个 index with correct mapping 

2.数据迁移 reindex data into that index

分词

POST _analyze

自定义词库

ik分词器

circuit_breaking_exception


弃用Type

ES 6.x 之前,Type 开始弃用

ES 7.x ,被弱化,仍支持

ES 8.x ,完全移除

弃用后,每个索引只包含一种文档类型

如果需要区分不同类型的文档,俩种方式:

  • 创建不同的索引
  • 在文档中添加自定义字段来实现。

why

Elasticsearch 的底层存储(Lucene)是基于索引的,而不是基于 Type 的。

在同一个索引中,不同 Type 的文档可能具有相同名称但不同类型的字段,这种字段类型冲突会导致数据不一致和查询错误。

GET /bank/_search
{
  "query": {
    "match": {
      "address": "mill lane"
    }
  },
  "_source": ["account_number","address"]
}

从查询语句可以看出,查询是基于index的,不会去指定type。如果有不同type的address,就会引起查询冲突。


映射

Mapping 定义 doc和field 如何被存储和被检索

Mapping(映射) 是 Elasticsearch 中用于定义文档结构和字段类型的机制。它类似于关系型数据库中的表结构(Schema),用于描述文档中包含哪些字段、字段的数据类型(如文本、数值、日期等),以及字段的其他属性(如是否分词、是否索引等)。

Mapping 是 Elasticsearch 的核心概念之一,它决定了数据如何被存储、索引和查询。

查询 mapping of index

 _mapping

GET /bank/_mapping
{
  "bank" : {
    "mappings" : {
      "properties" : {
        "account_number" : {
          "type" : "long"
        },
        "address" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "age" : {
          "type" : "long"
        },
        "balance" : {
          "type" : "long"
        },
        "city" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "email" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "employer" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "firstname" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "gender" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "lastname" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "state" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    }
  }
}
  • text 可以添加子field ---keyword,类型是 keyword。keyword存储精确值

创建 index with mapping

Put /{indexName}

Put /my_index
{
  "mappings": {
    "properties": {
      "account_number": {
        "type": "long"
      },
      "address": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "city": {
        "type": "keyword"
      }
    }
  }
}

添加 field with mapping

  •  PUT /{indexName}/_mapping + mapping.properties请求体
PUT /my_index/_mapping
{
  "properties": {
    "state": {
      "type": "keyword",
      "index": false
    }
  }
}
  •  "index": false  该字段无法被索引,不会参与检索   默认true

数据迁移

 ES不支持修改已存在的mapping。若想更新已存在的mapping,就要进行数据迁移。

1.新建 一个 index with correct mapping 

PUT /my_bank
{
  "mappings": {
    "properties": {
      "account_number": {
        "type": "long"
      },
      "address": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "age": {
        "type": "integer"
      },
      "balance": {
        "type": "long"
      },
      "city": {
        "type": "keyword"
      },
      "email": {
        "type": "keyword"
      },
      "employer": {
        "type": "keyword"
      },
      "firstname": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "gender": {
        "type": "keyword"
      },
      "lastname": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "state": {
        "type": "keyword"
      }
    }
  }
}

2.数据迁移 reindex data into that index

POST _reindex
{
  "source": {
    "index": "bank",
    "type": "account"
  },
  "dest": {
    "index": "my_bank"
  }
}
  • ES 8.0  弃用type参数 


分词

        将文本拆分为单个词项(tokens)

POST _analyze

标准分词器

POST _analyze
{
  "analyzer": "standard",
  "text": ["it's test data","hello world"]
}

 Response

{
  "tokens" : [
    {
      "token" : "it's",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "test",
      "start_offset" : 5,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "data",
      "start_offset" : 10,
      "end_offset" : 14,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "hello",
      "start_offset" : 15,
      "end_offset" : 20,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "world",
      "start_offset" : 21,
      "end_offset" : 26,
      "type" : "<ALPHANUM>",
      "position" : 4
    }
  ]
}

自定义词库

nginx/html目录下 创建es/term.text,添加词条

配置ik远程词库,/elasticsearch/config/analysis-ik/IKAnalyzer.cfg.xml

 测试

POST _analyze
{
  "analyzer": "ik_smart",
  "text": "尚硅谷项目谷粒商城"
}

 [尚硅谷,谷粒商城]为term.text词库中的词条

 Response

{
  "tokens" : [
    {
      "token" : "尚硅谷",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "项目",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "谷粒商城",
      "start_offset" : 5,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 2
    }
  ]
}

ik分词器

        中文分词

github地址

https://github.com/infinilabs/analysis-ik

    下载地址

    bin/elasticsearch-plugin install https://get.infini.cloud/elasticsearch/analysis-ik/7.4.2

    进入docker容器ES 下载 ik 插件

    卸载插件

    elasticsearch-plugin remove analysis-ik

    测试

    POST _analyze
    {
      "analyzer": "ik_smart",
      "text": "我要成为java高手"
    }

    Response 

    {
      "tokens" : [
        {
          "token" : "我",
          "start_offset" : 0,
          "end_offset" : 1,
          "type" : "CN_CHAR",
          "position" : 0
        },
        {
          "token" : "要",
          "start_offset" : 1,
          "end_offset" : 2,
          "type" : "CN_CHAR",
          "position" : 1
        },
        {
          "token" : "成为",
          "start_offset" : 2,
          "end_offset" : 4,
          "type" : "CN_WORD",
          "position" : 2
        },
        {
          "token" : "java",
          "start_offset" : 4,
          "end_offset" : 8,
          "type" : "ENGLISH",
          "position" : 3
        },
        {
          "token" : "高手",
          "start_offset" : 8,
          "end_offset" : 10,
          "type" : "CN_WORD",
          "position" : 4
        }
      ]
    }

    circuit_breaking_exception

    熔断器机制被触发

    {
        "error": {
            "root_cause": [
                {
                    "type": "circuit_breaking_exception",
                    "reason": "[parent] Data too large, data for [<http_request>] would be [124604192/118.8mb], which is larger than the limit of [123273216/117.5mb], real usage: [124604192/118.8mb], new bytes reserved: [0/0b], usages [request=0/0b, fielddata=1788/1.7kb, in_flight_requests=0/0b, accounting=225547/220.2kb]",
                    "bytes_wanted": 124604192,
                    "bytes_limit": 123273216,
                    "durability": "PERMANENT"
                }
            ],
            "type": "circuit_breaking_exception",
            "reason": "[parent] Data too large, data for [<http_request>] would be [124604192/118.8mb], which is larger than the limit of [123273216/117.5mb], real usage: [124604192/118.8mb], new bytes reserved: [0/0b], usages [request=0/0b, fielddata=1788/1.7kb, in_flight_requests=0/0b, accounting=225547/220.2kb]",
            "bytes_wanted": 124604192,
            "bytes_limit": 123273216,
            "durability": "PERMANENT"
        },
        "status": 429
    }

    查看ES日志

    docker logs elasticsearch

    检查 Elasticsearch 的内存使用情况

    GET /_cat/nodes?v&h=name,heap.percent,ram.percent
    • 如果 heap.percent 或 ram.percent 接近 100%,说明内存不足。

     增加 Elasticsearch 堆内存

    删除并重新创建容器 调整 -Xms 和 -Xmx 参数 256m

    docker run --name elasticsearch -p 9200:9200 -p 9300:9300 \
    > -e "discovery.type=single-node" \
    > -e ES_JAVA_OPTS="-Xms64m -Xmx256m" \
    > -v /mydata/elasticsearch/config/elasticsearch.yml:/usr/share/elasticsearch/config/elasticsearch.yml \
    > -v  /mydata/elasticsearch/data:/usr/share/elasticsearch/data \
    > -v /mydata/elasticsearch/plugins:/usr/share/elasticsearch/plugins \
    > -d elasticsearch:7.4.2


    http://www.kler.cn/a/549192.html

    相关文章:

  • Linux自学day18-二叉树、哈希表、常见的排序与查找算法
  • 【信息学奥赛一本通 C++题解】1288:三角形最佳路径问题
  • PAT乙级真题 — 1084 外观数列(java)
  • python 视频处理库moviepy 设置字幕
  • 微信小程序markdown转换为wxml(uniapp开发)
  • 使用 MySQL 从 JSON 字符串提取数据
  • Blazor-设置组件焦点
  • 记忆力训练day19
  • 【Python】错误异常
  • PHP基础部分
  • HTTP、HTTPS区别可靠性及POST为什么比GET安全的探讨
  • 光化学腐蚀法制作DIY钢网的制作流程
  • qt:对象树,窗口坐标,信号与槽
  • 【网络】协议与网络版计算器
  • BMS项目-面试及答疑整理
  • linux--关于linux文件IO(2) open、read、lseek、stat
  • C#中的动态类型用法总结带演示代码
  • 【函数题】6-12 二叉搜索树的操作集
  • AI程序员(aider)+ollama+DeepSeek-R1安装配置和使用
  • 「vue3-element-admin」Vue3 + TypeScript 项目整合 Animate.css 动画效果实战指南