I have a sharded cluster (2 shards, each 3 mongods; 3 config server, 2 mongoses) which is deployed by MongoDB Ops Manager.

Last week, one of the shard host status was shown as a grey diamond (Hover: "Last Ping: Never"). Besides, in the Ops Manager's server page, a server had two processes (e.g. sharddb-0 and sharddb-config). However, the cluster still works well and we can list the host sharddb-0-0(shard 0, replica 0) in the mongo shell by sh.status() and rs.status(). What's wrong with the cluster?

# Problem Analysis

It looks like that one of the server is out of Ops Manager's monitoring (but the agent log was still collected in time): In the first glance, I thought the issue was caused by the host. Therefore, I deleted the corresponding pod as well as its pvc in kubernetes and waiting for the host to be recovered. No luck.

## Monitoring Logs

I checked the automation agent log:

[2021-01-15T10:13:03.299+0000] [discovery.monitor.info] [monitoring/discovery.go:discover:793] Performing discovery with 10 hosts [2021-01-15T10:13:06.661+0000] [discovery.monitor.info] [monitoring/discovery.go:discover:850] Received discovery responses from 20/20 requests after 3s [2021-01-15T10:13:58.298+0000] [agent.info] [monitoring/agent.go:Run:266] Done. Sleeping for 55s... [2021-01-15T10:13:58.298+0000] [discovery.monitor.info] [monitoring/discovery.go:discover:793] Performing discovery with 10 hosts

Looks like the issue was caused by mongodb-mms-automation-agent host discovery.

Since there are 11 hosts in total (2 * 3 + 3 + 2 = 11), the mongodb-mms-automation-agent needs to send 22 requests instead of 20. That's why it doesn't discover host sharddb-0-0.

## System Warnings

After that, I found some system warnings in the admin page:

Automation metrics found conflicting canonical hosts=[sharddb-0-0.sharddb-sh.mongodb.svc.cluster.local:27017] within all ping's hosts=[sharddb-0-0.sharddb-sh.mongodb.svc.cluster.local:27017]. Skipping this sample.

Backup agent warnings:

com.xgen.svc.mms.svc.alert.AlertProcessingSvc [runAlertCheck:187] Alert check failed: groupId=5ff879a85d86101b89072b19, event=BACKUP_AGENT_DOWN, target=unassigned java.lang.IllegalStateException: Duplicate key sharddb-0-0.sharddb-sh.mongodb.svc.cluster.local (attempted merging values BackupAgentAudit{Id: 5ff8802c39e579142fd18bcd, GroupId: 5ff879a85d86101b89072b19, Hostname: sharddb-0-0.sharddb-sh.mongodb.svc.cluster.local, Version: 10.14.17.6445} and BackupAgentAudit{Id: 5ff8802d39e579142fd18d2f, GroupId: 5ff879a85d86101b89072b19, Hostname: sharddb-0-1.sharddb-sh.mongodb.svc.cluster.local, Version: 10.14.17.6445})

The above logs indicate that the host sharddb-0-0 is conflicted with some other hosts. But how to make the automation agent discover the missing hosts?

# Solution

After one week grinding, I accidently found the culprit: conflict host mappings. What's host mappings? Refer to: Host Mappings

Go to Deployment -> More -> Host Mappings to check the existing host mappings. If you find something like this:

Host Alias
sharddb-0-0.sharddb-sh.mongodb.svc.cluster.local:27017 sharddb-0-1:27017

That means sharddb-0-0 is mapped to sharddb-0-1, then conflicting hosts issue occurs.

Solution: delete the conflict host mappings, or simply delete all host mappings. (Don't worry, the mapping will be reconstructed several minutes later.)

I guess the root cause is that the Ops Manager doesn't update the host mappings timely. If host A obtain a new IP which is previously used by host B, then the issue happens.