MongoDB Agent Host Discovery Issue
I have a sharded cluster (2 shards, each 3 mongods; 3 config server, 2 mongoses) which is deployed by MongoDB Ops Manager.
Last week, one of the shard host status was shown as a grey diamond
(Hover: "Last Ping: Never"). Besides, in the Ops Manager's server page,
a server had two processes (e.g. sharddb-0 and
sharddb-config). However, the cluster still works well and
we can list the host sharddb-0-0(shard 0, replica 0) in the
mongo shell by sh.status() and rs.status().
What's wrong with the cluster?
Problem Analysis
It looks like that one of the server is out of Ops Manager's
monitoring (but the agent log was still collected in time):
In the first glance, I
thought the issue was caused by the host. Therefore, I deleted the
corresponding pod as well as its pvc in kubernetes and waiting for the
host to be recovered. No luck.
Monitoring Logs
I checked the automation agent log:
[2021-01-15T10:13:03.299+0000] [discovery.monitor.info] [monitoring/discovery.go:discover:793] Performing discovery with 10 hosts [2021-01-15T10:13:06.661+0000] [discovery.monitor.info] [monitoring/discovery.go:discover:850] Received discovery responses from 20/20 requests after 3s [2021-01-15T10:13:58.298+0000] [agent.info] [monitoring/agent.go:Run:266] Done. Sleeping for 55s... [2021-01-15T10:13:58.298+0000] [discovery.monitor.info] [monitoring/discovery.go:discover:793] Performing discovery with 10 hosts
Looks like the issue was caused by
mongodb-mms-automation-agent host discovery.
Since there are 11 hosts in total (2 * 3 + 3 + 2 = 11), the
mongodb-mms-automation-agent needs to send 22 requests
instead of 20. That's why it doesn't discover host
sharddb-0-0.
System Warnings
After that, I found some system warnings in the admin page:
Automation metrics found conflicting canonical hosts=[sharddb-0-0.sharddb-sh.mongodb.svc.cluster.local:27017] within all ping's hosts=[sharddb-0-0.sharddb-sh.mongodb.svc.cluster.local:27017]. Skipping this sample.
Backup agent warnings:
com.xgen.svc.mms.svc.alert.AlertProcessingSvc [runAlertCheck:187] Alert check failed: groupId=5ff879a85d86101b89072b19, event=BACKUP_AGENT_DOWN, target=unassigned java.lang.IllegalStateException: Duplicate key sharddb-0-0.sharddb-sh.mongodb.svc.cluster.local (attempted merging values BackupAgentAudit{Id: 5ff8802c39e579142fd18bcd, GroupId: 5ff879a85d86101b89072b19, Hostname: sharddb-0-0.sharddb-sh.mongodb.svc.cluster.local, Version: 10.14.17.6445} and BackupAgentAudit{Id: 5ff8802d39e579142fd18d2f, GroupId: 5ff879a85d86101b89072b19, Hostname: sharddb-0-1.sharddb-sh.mongodb.svc.cluster.local, Version: 10.14.17.6445})
The above logs indicate that the host sharddb-0-0 is
conflicted with some other hosts. But how to make the automation agent
discover the missing hosts?
Solution
After one week grinding, I accidently found the culprit: conflict host mappings. What's host mappings? Refer to: Host Mappings

Go to Deployment -> More -> Host Mappings to check
the existing host mappings. If you find something like this:
| Host | Alias |
|---|---|
| sharddb-0-0.sharddb-sh.mongodb.svc.cluster.local:27017 | sharddb-0-1:27017 |
That means sharddb-0-0 is mapped to
sharddb-0-1, then conflicting hosts issue occurs.
Solution: delete the conflict host mappings, or simply delete all host mappings. (Don't worry, the mapping will be reconstructed several minutes later.)

I guess the root cause is that the Ops Manager doesn't update the host mappings timely. If host A obtain a new IP which is previously used by host B, then the issue happens.