记一次ElasticSearch冷热分离索引无法正确分配的问题

本文总阅读量

前记

一开始ElasticSearch是同事自己搭的一个单节点, 我在接手后开始使用mapping,集群等功能, 但很多旧index在单节点时被应用了很多奇怪的配置, 导致添加新的功能非常麻烦, 比如在启用冷热分离功能的时候, 旧的index无法正确分配.

1.问题

当给ElasticSearch配置了ILM后,ElasticSearch会根据配置自动执行,按照配置把索引进行在集群之中迁移,但是运行一段时间后发现,之前用logstash传过来的且没有使用mapping的旧索引都无法正常执行ILM,而其他数据却能正常处理.
由于发现这个问题时,对于Es不是十分了解,只能copy错误去搜索,搜索后发现几乎没有一个答案是满意的,最多说的是升级ElasticSearch版本解决,所以就又水了一文.

2.查看索引ilm运行情况

由于是由ilm自己自动运行的, 我们可以通过查看ilm的运行情况查看问题发生在哪

1
GET so1n_index/_ilm/explain

得出响应如下, 这里使用so1n_index代替索引名(下同).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
{
"indices" : {
"so1n_index" : {
"index" : "so1n_index",
"managed" : true,
"policy" : "so1n_index_policy",
"lifecycle_date_millis" : 1585008002962,
"age" : "35.41d",
"phase" : "cold",
"phase_time_millis" : 1587686411329,
"action" : "allocate",
"action_time_millis" : 1587686419823,
"step" : "check-allocation",
"step_time_millis" : 1587686428607,
"step_info" : {
"message" : "Waiting for [2] shards to be allocated to nodes matching the given filters",
"shards_left_to_allocate" : 2,
"all_shards_active" : true,
"actual_replicas" : 1
},
"phase_execution" : {
"policy" : "so1n_index_policy",
"phase_definition" : {
"min_age" : "31d",
"actions" : {
"freeze" : { },
"allocate" : {
"include" : { },
"exclude" : { },
"require" : {
"box_type" : "cold"
}
},
"set_priority" : {
"priority" : 10
}
}
},
"version" : 3,
"modified_date_in_millis" : 1587663954354
}
}
}
}

看输出可以发现,我在ilm配置索引min_age超过31天就执行so1n_index_policy中的一个cold phase,而在执行其中allocate的check-allocation步骤时报错Waiting for [2] shards to be allocated to nodes matching the given filters,也就是ilm执行到so1n_index节点分配时就已经出问题了,而且还是在check-allocation阶段.所以我们只要查明为什么check-allocation不通过就好了.

3.查看索引allocation情况

直接使用explain查看allocation的错误原因

1
2
3
4
5
6
GET /_cluster/allocation/explain
{
"index": "so1n_index",
"shard": 0,
"primary": true
}

返回响应如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
{
"index" : "so1n_index",
"shard" : 0,
"primary" : true,
"current_state" : "started",
"current_node" : {
"id" : "so1n_id_1",
"name" : "so1n-elastic-node-1",
"transport_address" : "10.142.0.1:9300",
"attributes" : {
"ml.machine_memory" : "7839637504",
"xpack.installed" : "true",
"box_type" : "hot",
"ml.max_open_jobs" : "20"
}
},
"can_remain_on_current_node" : "no",
"can_remain_decisions" : [
{
"decider" : "filter",
"decision" : "NO",
"explanation" : """node does not match index setting [index.routing.allocation.require] filters [box_type:"cold",_id:"so1n_id_3"]"""
}
],
"can_move_to_other_node" : "no",
"move_explanation" : "cannot move shard to another node, even though it is not allowed to remain on its current node",
"node_allocation_decisions" : [
{
"node_id" : "so1n_id_2",
"node_name" : "so1n-elastic-node-2",
"transport_address" : "10.142.0.2:9300",
"node_attributes" : {
"ml.machine_memory" : "7839653888",
"ml.max_open_jobs" : "20",
"box_type" : "cold",
"xpack.installed" : "true"
},
"node_decision" : "no",
"weight_ranking" : 1,
"deciders" : [
{
"decider" : "filter",
"decision" : "NO",
"explanation" : """node does not match index setting [index.routing.allocation.require] filters [box_type:"cold",_id:"so1n_id_3"]"""
}
]
}
]
}

这里可以发现,原来是索引多了一段配置,导致在分配索引时,必须要确保节点的id是:so1n_id_3且box_type为cold,可是被分配的节点这一台的box_type是虽然是cold, 但是节点id是:so1n_id_2.
接下来就可以具体查看索引的设置(kibana的设置里面可以查看):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
{
"index.blocks.read_only_allow_delete": "false",
"index.query.default_field": [
"*"
],
"index.refresh_interval": "1s",
"index.write.wait_for_active_shards": "1",
"index.lifecycle.name": "so1n_index_policy",
"index.routing.allocation.require._id": "so1n_id_3",
"index.routing.allocation.require.box_type": "cold",
"index.blocks.write": "true",
"index.priority": "10",
"index.number_of_replicas": "1"
}

可以发现,这里多了一段

1
"index.routing.allocation.require._id": "so1n_id_3"

这一段从来没配置过,估计是没使用mapping时,当数据发送到so1n_id_3节点时,Es会自动给索引加上index.routing.allocation.require._id.

4.解决

从上面可以发现, 由于旧索引带有:

1
"index.routing.allocation.require._id": "so1n_id_3"

导致了旧索引无法分配, 目前业务上不需要该限制条件, 可以确定Es上面的所有索引都可以不需要index.routing.allocation.require._id,那可以通过以下进行修改:(如果有部分索引需要index.routing.allocation.require._id,则需要通配符修改)

1
2
3
4
PUT */_settings
{
"index.routing.allocation.require._id": null
}

修改完成后再调用explain查看:

1
2
3
4
5
6
GET /_cluster/allocation/explain
{
"index": "so1n_index",
"shard": 0,
"primary": true
}

通过响应可以发现Es已经在后台执行索引迁移了:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
{
"index" : "so1n_index",
"shard" : 0,
"primary" : true,
"current_state" : "relocating",
"current_node" : {
"id" : "so1n_id_1",
"name" : "so1n-elastic-node-1",
"transport_address" : "10.142.0.1:9300",
"attributes" : {
"ml.machine_memory" : "7839637504",
"xpack.installed" : "true",
"box_type" : "hot",
"ml.max_open_jobs" : "20"
}
},
"explanation" : "the shard is in the process of relocating from node [so1n-elastic-node-1] to node [so1n-elastic-node-4], wait until relocation has completed"
}

迁移完成后再去查看可以发现字段rebalance_explanation会显示:cannot rebalance as no target node exists that can both allocate this shard and improve the cluster balance
则此,索引冷热分离规则全部适配运行完成.

查看评论