MongoDB Chunk Migration Failed Solution: Unable to acquire X lock

After adding new shards to our production MongoDB cluster (v4.4.6-ent with 5 shards, 3 replicas for each shard), we found that the balancer is not working. sh.status() displays many chunk migration errors:

        Currently enabled:  yes
        Currently running:  no
        Failed balancer rounds in last 5 attempts:  0
        Migration Results for the last 24 hours:
                7 : Failed with error 'aborted', from mongo-1 to mongo-3
                7208 : Failed with error 'aborted', from mongo-1 to mongo-4
        {  "_id" : "X",  "primary" : "mongo-1",  "partitioned" : true,  "version" : {  "uuid" : UUID("xxx"),  "lastMod" : 1 } }
                        shard key: { "Uuid" : 1 }
                        unique: false
                        balancing: true
                                mongo-0       231
                                mongo-1       327
                                mongo-2       230
                                mongo-3       208

Obviously, the chunks is unbalanced accross shards (327 vs 208). Since the balancer is enabled, we try to debug the issue through mongodb.log on the config server. There are many migration failed logs (sensitive infomation masked):

    "t": {
        "$date": "2021-09-27T05:05:46.090+00:00"
    "s": "I",
    "c": "SHARDING",
    "id": 21872,
    "ctx": "Balancer",
    "msg": "Migration failed",
    "attr": {
        "migrateInfo": "X.A: [{ Uuid: MinKey }, { Uuid: \"xxx\" }), from mongo-a-1, to mongo-a-4",
        "error": "LockTimeout: Unable to acquire X lock on '{13328793763114131834: Collection, 1799578717045662184, X.A}' within 500ms. opId: 669456657, op: MoveChunk, connId: 0."

Because the above error happens for the first chunk to be migrated, the whole balancing process is blocked by this error which makes shard mongo-1 holds much more data than other shards. It seems that the error is caused by Unable to acquire X lock on collection.

Try to manually movechunk according to this, but got the same error:

MongoDB Enterprise mongos> sh.moveChunk("X.A", {"Uuid": "XX"}, "mongo-3" )
        "ok" : 0,
        "errmsg" : "Unable to acquire X lock on '{13328793763114131834: Collection, 1799578717045662184, X.A}' within 500ms. opId: 719103624, op: MoveChunk, connId: 0.",
        "code" : 24,
        "codeName" : "LockTimeout",
        "operationTime" : Timestamp(1632890816, 39),
        "$clusterTime" : {
                "clusterTime" : Timestamp(1632890816, 39),
                "signature" : {
                        "hash" : BinData(0,"/="),
                        "keyId" : NumberLong("6952576454697680898")

Increase maxTransactionLockRequestTimeoutMillis?

Google a similar discussion here, which suggests to set maxTransactionLockRequestTimeoutMillis longer. However, I think 500ms is long enough to acquire the lock, change the parameter may have side effect on the production environment, so we don't test whether it works.

Jumbo Chunks?

According to this post, jumbo chunks might lead to migration failure. So we checked the collection chunks by sh.status(true), NO jumbo chunks found.

Restart the Source Mongod

We also try to check the current operations and locks via db.currentOp() , but not have a clue.

Finally, we restart the source mongod process through MongoDB Ops Manager to change the primary in the shard. Surprisingly, it works! A few miniutes later, chunks are evenly distributed accross shards:

        Currently enabled:  yes
        Currently running:  no
                Balancer active window is set between 17:45 and 11:59 server local time
        Failed balancer rounds in last 5 attempts:  0
        Migration Results for the last 24 hours:
                236 : Success
                2 : Failed with error 'aborted', from mongo-4 to mongo-0
        {  "_id" : "X",  "primary" : "mongo-1",  "partitioned" : true,  "version" : {  "uuid" : UUID("xxx"),  "lastMod" : 1 } }
                        shard key: { "Uuid" : 1 }
                        unique: false
                        balancing: true
                                mongo-0       207
                                mongo-1       207
                                mongo-2       207
                                mongo-3       207
                                mongo-4       208

We guess that the root cause is that the original mongod process has inconsistent states. Restart is just a simple way to force another member to become primary, which bypass the inconsistent states issue of the original mongod.

    "t": {
        "$date": "2021-09-29T02:43:35.602+00:00"
    "s": "I",
    "c": "SHARDING",
    "id": 21872,
    "ctx": "Balancer",
    "msg": "Migration failed",
    "attr": {
        "migrateInfo": "X.A: [{ Uuid: \"xxx\" }, { Uuid: \"xxx\" }), from mongo-a-1, to mongo-a-4",
        "error": "Overflow: Invalid advance (1) past end of buffer[0] at offset: 0"

The error Overflow: Invalid advance past end of buffer at offset can be resolved by manually restart the source monogd as well.