MongoDB Change Stream Makes the Cluster Down

MongoDB change stream is a nice feature. It allows applications to access real-time data changes without the complexity and risk of tailing the oplog.

Recently, when we use change stream to replicate data from one sharded cluster to another, it immediately made the cluster unstable (broke down several nodes and triggered the primary change). Then the read/write operations latency significantly increased.

Observations

Observations on our production envrionment:

  • The cluster immediately down after starting change stream application
  • Mongod restarted
  • Mongos connections significantly increased (about 3x)
  • Service latency notably increased (from milliseconds to seconds)

However, such issue is not always happen. The change stream application works fine for another two sharded clusters (test environment, small traffic).

We were struggling to debug the issue. We attempted to upgrade MongoDB client drivers but not work. Since the issue is only happened in the production, it's really hard to reproduce. :-(

Finally, we found this link: https://jira.mongodb.org/browse/SERVER-50769 . Perfect match:

  • Sharded cluster, MongoDB v4.4.3-ent
  • Heavily used transactions
  • Server restart after starting change stream

After requesting the mongod logs, we found logs like this:

{"t":{"$date":"2021-07-20T06:36:56.044+00:00"},"s":"F", "c":"-", "id":23079, "ctx":"conn198","msg":"Invariant failure","attr":{"expr":"_currentApplyOps.getArrayLength() > 0","file":"src/mongo/db/pipeline/document_source_change_stream_transform.cpp","line":535}}

Mixed feeling...

Solution

Upgrade the MongoDB version to 4.4.4+.

See the release notes:

SERVER-50769: server restarted after expr{"expr":"_currentApplyOps.getArrayLength() > 0","file":"src/mongo/db/pipeline/document_source_change_stream_transform.cpp"