MongoDB Chunk Migration Failed Solution: Unable to acquire X lock
After adding new shards to our production MongoDB cluster (v4.4.6-ent
with 5 shards, 3 replicas for each shard), we found that the balancer is
not working. sh.status()
displays many chunk migration
errors:
...
balancer:
Currently enabled: yes
Currently running: no
Failed balancer rounds in last 5 attempts: 0
Migration Results for the last 24 hours:
7 : Failed with error 'aborted', from mongo-1 to mongo-3
7208 : Failed with error 'aborted', from mongo-1 to mongo-4
databases:
{ "_id" : "X", "primary" : "mongo-1", "partitioned" : true, "version" : { "uuid" : UUID("xxx"), "lastMod" : 1 } }
X.A
shard key: { "Uuid" : 1 }
unique: false
balancing: true
chunks:
mongo-0 231
mongo-1 327
mongo-2 230
mongo-3 208
...
Obviously, the chunks is unbalanced accross shards (327 vs 208).
Since the balancer is enabled, we try to debug the issue through
mongodb.log
on the config server. There are many migration
failed logs (sensitive infomation masked):
{
"t": {
"$date": "2021-09-27T05:05:46.090+00:00"
},
"s": "I",
"c": "SHARDING",
"id": 21872,
"ctx": "Balancer",
"msg": "Migration failed",
"attr": {
"migrateInfo": "X.A: [{ Uuid: MinKey }, { Uuid: \"xxx\" }), from mongo-a-1, to mongo-a-4",
"error": "LockTimeout: Unable to acquire X lock on '{13328793763114131834: Collection, 1799578717045662184, X.A}' within 500ms. opId: 669456657, op: MoveChunk, connId: 0."
}
}
Because the above error happens for the first chunk to be migrated,
the whole balancing process is blocked by this error which makes shard
mongo-1
holds much more data than other shards. It seems
that the error is caused by
Unable to acquire X lock on collection
.
Try to manually movechunk according to this, but got the same error:
MongoDB Enterprise mongos> sh.moveChunk("X.A", {"Uuid": "XX"}, "mongo-3" )
{
"ok" : 0,
"errmsg" : "Unable to acquire X lock on '{13328793763114131834: Collection, 1799578717045662184, X.A}' within 500ms. opId: 719103624, op: MoveChunk, connId: 0.",
"code" : 24,
"codeName" : "LockTimeout",
"operationTime" : Timestamp(1632890816, 39),
"$clusterTime" : {
"clusterTime" : Timestamp(1632890816, 39),
"signature" : {
"hash" : BinData(0,"/="),
"keyId" : NumberLong("6952576454697680898")
}
}
}
Increase
maxTransactionLockRequestTimeoutMillis
?
Google a similar discussion here,
which suggests to set
maxTransactionLockRequestTimeoutMillis
longer. However, I
think 500ms is long enough to acquire the lock, change the parameter may
have side effect on the production environment, so we don't test whether
it works.
Jumbo Chunks?
According to this
post, jumbo chunks might lead to migration failure. So we checked
the collection chunks by sh.status(true)
, NO jumbo chunks
found.
Restart the Source Mongod
We also try to check the current operations and locks via
db.currentOp()
, but not have a clue.
Finally, we restart the source mongod process through MongoDB Ops Manager to change the primary in the shard. Surprisingly, it works! A few miniutes later, chunks are evenly distributed accross shards:
...
balancer:
Currently enabled: yes
Currently running: no
Balancer active window is set between 17:45 and 11:59 server local time
Failed balancer rounds in last 5 attempts: 0
Migration Results for the last 24 hours:
236 : Success
2 : Failed with error 'aborted', from mongo-4 to mongo-0
databases:
{ "_id" : "X", "primary" : "mongo-1", "partitioned" : true, "version" : { "uuid" : UUID("xxx"), "lastMod" : 1 } }
X.A
shard key: { "Uuid" : 1 }
unique: false
balancing: true
chunks:
mongo-0 207
mongo-1 207
mongo-2 207
mongo-3 207
mongo-4 208
...
We guess that the root cause is that the original mongod process has inconsistent states. Restart is just a simple way to force another member to become primary, which bypass the inconsistent states issue of the original mongod.
{
"t": {
"$date": "2021-09-29T02:43:35.602+00:00"
},
"s": "I",
"c": "SHARDING",
"id": 21872,
"ctx": "Balancer",
"msg": "Migration failed",
"attr": {
"migrateInfo": "X.A: [{ Uuid: \"xxx\" }, { Uuid: \"xxx\" }), from mongo-a-1, to mongo-a-4",
"error": "Overflow: Invalid advance (1) past end of buffer[0] at offset: 0"
}
}
The error
Overflow: Invalid advance past end of buffer at offset
can
be resolved by manually restart the source monogd as well.