记一次ElasticSearch冷热分离索引无法正确分配的问题

April 10, 2020 本文总阅读量次 7511

前记

一开始ElasticSearch是同事自己搭的一个单节点, 我在接手后开始使用mapping,集群等功能, 但很多旧index在单节点时被应用了很多奇怪的配置, 导致添加新的功能非常麻烦, 比如在启用冷热分离功能的时候, 旧的index无法正确分配.

1.问题

当给ElasticSearch配置了ILM后,ElasticSearch会根据配置自动执行,按照配置把索引进行在集群之中迁移,但是运行一段时间后发现,之前用logstash传过来的且没有使用mapping的旧索引都无法正常执行ILM,而其他数据却能正常处理.
由于发现这个问题时,对于Es不是十分了解,只能copy错误去搜索,搜索后发现几乎没有一个答案是满意的,最多说的是升级ElasticSearch版本解决,所以就又水了一文.

2.查看索引ilm运行情况

由于是由ilm自己自动运行的, 我们可以通过查看ilm的运行情况查看问题发生在哪

1	`GET so1n_index/_ilm/explain`

得出响应如下, 这里使用so1n_index代替索引名(下同).

{
  "indices" : {
    "so1n_index" : {
      "index" : "so1n_index",
      "managed" : true,
      "policy" : "so1n_index_policy",
      "lifecycle_date_millis" : 1585008002962,
      "age" : "35.41d",
      "phase" : "cold",
      "phase_time_millis" : 1587686411329,
      "action" : "allocate",
      "action_time_millis" : 1587686419823,
      "step" : "check-allocation",
      "step_time_millis" : 1587686428607,
      "step_info" : {
        "message" : "Waiting for [2] shards to be allocated to nodes matching the given filters",
        "shards_left_to_allocate" : 2,
        "all_shards_active" : true,
        "actual_replicas" : 1
      },
      "phase_execution" : {
        "policy" : "so1n_index_policy",
        "phase_definition" : {
          "min_age" : "31d",
          "actions" : {
            "freeze" : { },
            "allocate" : {
              "include" : { },
              "exclude" : { },
              "require" : {
                "box_type" : "cold"
              }
            },
            "set_priority" : {
              "priority" : 10
            }
          }
        },
        "version" : 3,
        "modified_date_in_millis" : 1587663954354
      }
    }
  }
}

看输出可以发现,我在ilm配置索引min_age超过31天就执行so1n_index_policy中的一个cold phase,而在执行其中allocate的check-allocation步骤时报错Waiting for [2] shards to be allocated to nodes matching the given filters,也就是ilm执行到so1n_index节点分配时就已经出问题了,而且还是在check-allocation阶段.所以我们只要查明为什么check-allocation不通过就好了.

3.查看索引allocation情况

直接使用explain查看allocation的错误原因

GET /_cluster/allocation/explain
{
  "index": "so1n_index",
  "shard": 0,
  "primary": true
}

返回响应如下:

{
  "index" : "so1n_index",
  "shard" : 0,
  "primary" : true,
  "current_state" : "started",
  "current_node" : {
    "id" : "so1n_id_1",
    "name" : "so1n-elastic-node-1",
    "transport_address" : "10.142.0.1:9300",
    "attributes" : {
      "ml.machine_memory" : "7839637504",
      "xpack.installed" : "true",
      "box_type" : "hot",
      "ml.max_open_jobs" : "20"
    }
  },
  "can_remain_on_current_node" : "no",
  "can_remain_decisions" : [
    {
      "decider" : "filter",
      "decision" : "NO",
      "explanation" : """node does not match index setting [index.routing.allocation.require] filters [box_type:"cold",_id:"so1n_id_3"]"""
    }
  ],
  "can_move_to_other_node" : "no",
  "move_explanation" : "cannot move shard to another node, even though it is not allowed to remain on its current node",
  "node_allocation_decisions" : [
    {
      "node_id" : "so1n_id_2",
      "node_name" : "so1n-elastic-node-2",
      "transport_address" : "10.142.0.2:9300",
      "node_attributes" : {
        "ml.machine_memory" : "7839653888",
        "ml.max_open_jobs" : "20",
        "box_type" : "cold",
        "xpack.installed" : "true"
      },
      "node_decision" : "no",
      "weight_ranking" : 1,
      "deciders" : [
        {
          "decider" : "filter",
          "decision" : "NO",
          "explanation" : """node does not match index setting [index.routing.allocation.require] filters [box_type:"cold",_id:"so1n_id_3"]"""
        }
      ]
    }
  ]
}

这里可以发现,原来是索引多了一段配置,导致在分配索引时,必须要确保节点的id是:so1n_id_3且box_type为cold,可是被分配的节点这一台的box_type是虽然是cold, 但是节点id是:so1n_id_2.
接下来就可以具体查看索引的设置(kibana的设置里面可以查看):

{
  "index.blocks.read_only_allow_delete": "false",
  "index.query.default_field": [
    "*"
  ],
  "index.refresh_interval": "1s",
  "index.write.wait_for_active_shards": "1",
  "index.lifecycle.name": "so1n_index_policy",
  "index.routing.allocation.require._id": "so1n_id_3",
  "index.routing.allocation.require.box_type": "cold",
  "index.blocks.write": "true",
  "index.priority": "10",
  "index.number_of_replicas": "1"
}

可以发现,这里多了一段

1	`"index.routing.allocation.require._id": "so1n_id_3"`

这一段从来没配置过,估计是没使用mapping时,当数据发送到so1n_id_3节点时,Es会自动给索引加上index.routing.allocation.require._id.

4.解决

从上面可以发现, 由于旧索引带有:

1	`"index.routing.allocation.require._id": "so1n_id_3"`

导致了旧索引无法分配, 目前业务上不需要该限制条件, 可以确定Es上面的所有索引都可以不需要index.routing.allocation.require._id,那可以通过以下进行修改:(如果有部分索引需要index.routing.allocation.require._id,则需要通配符修改)

PUT */_settings
{
   "index.routing.allocation.require._id": null
}

修改完成后再调用explain查看:

GET /_cluster/allocation/explain
{
  "index": "so1n_index",
  "shard": 0,
  "primary": true
}

通过响应可以发现Es已经在后台执行索引迁移了:

{
  "index" : "so1n_index",
  "shard" : 0,
  "primary" : true,
  "current_state" : "relocating",
  "current_node" : {
    "id" : "so1n_id_1",
    "name" : "so1n-elastic-node-1",
    "transport_address" : "10.142.0.1:9300",
    "attributes" : {
      "ml.machine_memory" : "7839637504",
      "xpack.installed" : "true",
      "box_type" : "hot",
      "ml.max_open_jobs" : "20"
    }
  },
  "explanation" : "the shard is in the process of relocating from node [so1n-elastic-node-1] to node [so1n-elastic-node-4], wait until relocation has completed"
}

迁移完成后再去查看可以发现字段rebalance_explanation会显示:cannot rebalance as no target node exists that can both allocate this shard and improve the cluster balance
则此,索引冷热分离规则全部适配运行完成.

本文作者：So1n
本文链接：http://so1n.me/2020/04/10/%E8%AE%B0%E4%B8%80%E6%AC%A1ElasticSearch%E5%86%B7%E7%83%AD%E5%88%86%E7%A6%BB%E7%B4%A2%E5%BC%95%E6%97%A0%E6%B3%95%E6%AD%A3%E7%A1%AE%E5%88%86%E9%85%8D%E7%9A%84%E9%97%AE%E9%A2%98/index.html
版权声明：本博客所有文章均采用 BY-NC-SA 许可协议，转载请注明出处！

查看评论