MongoDB Change Stream Makes the Cluster Down

Posted on 2021-07-22 Edited on 2023-01-24 In MongoDB 评论: Views:

MongoDB change stream is a nice feature. It allows applications to access real-time data changes without the complexity and risk of tailing the oplog.

Recently, when we use change stream to replicate data from one sharded cluster to another, it immediately made the cluster unstable (broke down several nodes and triggered the primary change). Then the read/write operations latency significantly increased.

Observations

Observations on our production envrionment:

The cluster immediately down after starting change stream application
Mongod restarted
Mongos connections significantly increased (about 3x)
Service latency notably increased (from milliseconds to seconds)

However, such issue is not always happen. The change stream application works fine for another two sharded clusters (test environment, small traffic).

We were struggling to debug the issue. We attempted to upgrade MongoDB client drivers but not work. Since the issue is only happened in the production, it's really hard to reproduce. :-(

Finally, we found this link: https://jira.mongodb.org/browse/SERVER-50769 . Perfect match:

Sharded cluster, MongoDB v4.4.3-ent
Heavily used transactions
Server restart after starting change stream

After requesting the mongod logs, we found logs like this:

{"t":{"$date":"2021-07-20T06:36:56.044+00:00"},"s":"F", "c":"-", "id":23079, "ctx":"conn198","msg":"Invariant failure","attr":{"expr":"_currentApplyOps.getArrayLength() > 0","file":"src/mongo/db/pipeline/document_source_change_stream_transform.cpp","line":535}}

Mixed feeling...

Solution

Upgrade the MongoDB version to 4.4.4+.

See the release notes:

SERVER-50769: server restarted after expr{"expr":"_currentApplyOps.getArrayLength() > 0","file":"src/mongo/db/pipeline/document_source_change_stream_transform.cpp"