MongoDB Agent Host Discovery Issue
I have a sharded cluster (2 shards, each 3 mongods; 3 config server, 2 mongoses) which is deployed by MongoDB Ops Manager.
Last week, one of the shard host status was shown as a grey diamond
(Hover: "Last Ping: Never"). Besides, in the Ops Manager's server page,
a server had two processes (e.g. sharddb-0
and
sharddb-config
). However, the cluster still works well and
we can list the host sharddb-0-0
(shard 0, replica 0) in the
mongo shell by sh.status()
and rs.status()
.
What's wrong with the cluster?
Problem Analysis
It looks like that one of the server is out of Ops Manager's
monitoring (but the agent log was still collected in time): In the first glance, I
thought the issue was caused by the host. Therefore, I deleted the
corresponding pod as well as its pvc in kubernetes and waiting for the
host to be recovered. No luck.
Monitoring Logs
I checked the automation agent log:
[2021-01-15T10:13:03.299+0000] [discovery.monitor.info] [monitoring/discovery.go:discover:793] Performing discovery with 10 hosts [2021-01-15T10:13:06.661+0000] [discovery.monitor.info] [monitoring/discovery.go:discover:850] Received discovery responses from 20/20 requests after 3s [2021-01-15T10:13:58.298+0000] [agent.info] [monitoring/agent.go:Run:266] Done. Sleeping for 55s... [2021-01-15T10:13:58.298+0000] [discovery.monitor.info] [monitoring/discovery.go:discover:793] Performing discovery with 10 hosts
Looks like the issue was caused by
mongodb-mms-automation-agent
host discovery.
Since there are 11 hosts in total (2 * 3 + 3 + 2 = 11), the
mongodb-mms-automation-agent
needs to send 22 requests
instead of 20. That's why it doesn't discover host
sharddb-0-0
.
System Warnings
After that, I found some system warnings in the admin page:
Automation metrics found conflicting canonical hosts=[sharddb-0-0.sharddb-sh.mongodb.svc.cluster.local:27017] within all ping's hosts=[sharddb-0-0.sharddb-sh.mongodb.svc.cluster.local:27017]. Skipping this sample.
Backup agent warnings:
com.xgen.svc.mms.svc.alert.AlertProcessingSvc [runAlertCheck:187] Alert check failed: groupId=5ff879a85d86101b89072b19, event=BACKUP_AGENT_DOWN, target=unassigned java.lang.IllegalStateException: Duplicate key sharddb-0-0.sharddb-sh.mongodb.svc.cluster.local (attempted merging values BackupAgentAudit{Id: 5ff8802c39e579142fd18bcd, GroupId: 5ff879a85d86101b89072b19, Hostname: sharddb-0-0.sharddb-sh.mongodb.svc.cluster.local, Version: 10.14.17.6445} and BackupAgentAudit{Id: 5ff8802d39e579142fd18d2f, GroupId: 5ff879a85d86101b89072b19, Hostname: sharddb-0-1.sharddb-sh.mongodb.svc.cluster.local, Version: 10.14.17.6445})
The above logs indicate that the host sharddb-0-0
is
conflicted with some other hosts. But how to make the automation agent
discover the missing hosts?
Solution
After one week grinding, I accidently found the culprit: conflict host mappings. What's host mappings? Refer to: Host Mappings
Go to Deployment -> More -> Host Mappings
to check
the existing host mappings. If you find something like this:
Host | Alias |
---|---|
sharddb-0-0.sharddb-sh.mongodb.svc.cluster.local:27017 | sharddb-0-1:27017 |
That means sharddb-0-0
is mapped to
sharddb-0-1
, then conflicting hosts issue occurs.
Solution: delete the conflict host mappings, or simply delete all host mappings. (Don't worry, the mapping will be reconstructed several minutes later.)
I guess the root cause is that the Ops Manager doesn't update the host mappings timely. If host A obtain a new IP which is previously used by host B, then the issue happens.