MongoDB频繁切换初选

我们正在运行一个Mongo 2.6副本集,其中有3个成员:小学,中学,仲裁者。 几乎每天我们的MongoDB都在切换哪个服务器是主要的,这导致到该DB的所有连接都被中断。 如果这样做是完全没有问题的,因为其中一台服务器确实停机了,但是在每种情况下,“下”服务器似乎并没有真正停机。 这是一直以来。

以下是我们所知道的:

  1. 所有3台服务器上的mongod进程都没有重启或closures。
  2. 服务器一直在向New Relic报告。
  3. 从mongo日志中我们看到频繁的心跳失败。
  4. 在任何时候,服务器都不是非常高的负载。 我看到每小时10分钟左右CPU每小时都会爆发一次,但这并不能排除故障。

以下是show log rs的结果,同时shell已插入到当前的主要show log rs中。

 2015-05-17T15:05:49.339+0000 [rsBackgroundSync] replSet sync source problem: 10278 dbclient error communicating with server: server1:27017 2015-05-17T15:05:49.358+0000 [rsBackgroundSync] replSet syncing to: server1:27017 2015-05-17T15:05:56.444+0000 [rsBackgroundSync] replset setting syncSourceFeedback to server1:27017 2015-05-17T22:11:36.638+0000 [rsHealthPoll] replSet info server1:27017 is down (or slow to respond): 2015-05-17T22:11:36.644+0000 [rsHealthPoll] replSet member server1:27017 is now in state DOWN 2015-05-17T22:11:37.495+0000 [rsMgr] not electing self, we are not freshest 2015-05-17T22:11:38.656+0000 [rsHealthPoll] replSet member server1:27017 is up 2015-05-17T22:11:38.656+0000 [rsHealthPoll] replSet member server1:27017 is now in state PRIMARY 2015-05-17T22:11:39.140+0000 [rsBackgroundSync] replSet syncing to: server1:27017 2015-05-17T22:11:39.147+0000 [rsBackgroundSync] replset setting syncSourceFeedback to server1:27017 2015-05-17T23:05:47.431+0000 [rsBackgroundSync] replSet sync source problem: 10278 dbclient error communicating with server: server1:27017 2015-05-17T23:05:47.431+0000 [rsBackgroundSync] replSet syncing to: server1:27017 2015-05-17T23:05:47.876+0000 [rsBackgroundSync] replset setting syncSourceFeedback to server1:27017 2015-05-18T10:05:46.821+0000 [rsBackgroundSync] replSet sync source problem: 10278 dbclient error communicating with server: server1:27017 2015-05-18T10:05:46.822+0000 [rsBackgroundSync] replSet syncing to: server1:27017 2015-05-18T10:05:51.014+0000 [rsBackgroundSync] replset setting syncSourceFeedback to server1:27017 2015-05-18T22:12:11.433+0000 [rsHealthPoll] replSet info server1:27017 is down (or slow to respond): 2015-05-18T22:12:11.434+0000 [rsHealthPoll] replSet member server1:27017 is now in state DOWN 2015-05-18T22:12:11.507+0000 [rsMgr] replSet info electSelf 3 2015-05-18T22:12:14.708+0000 [rsMgr] replSet PRIMARY 2015-05-18T22:12:14.709+0000 [rsHealthPoll] replSet member server1:27017 is up 2015-05-18T22:12:14.709+0000 [rsHealthPoll] replSet member server1:27017 is now in state PRIMARY 2015-05-18T22:12:21.610+0000 [rsHealthPoll] replSet member server1:27017 is now in state ROLLBACK 2015-05-18T22:12:23.612+0000 [rsHealthPoll] replSet member server1:27017 is now in state SECONDARY 2015-05-19T22:13:13.004+0000 [rsHealthPoll] couldn't connect to server1:27017: couldn't connect to server server1:27017 (xxxx), connection attempt failed 2015-05-19T22:13:24.127+0000 [rsHealthPoll] couldn't connect to server1:27017: couldn't connect to server server1:27017 (xxxx) failed, connection attempt failed 2015-05-19T22:13:29.267+0000 [rsHealthPoll] replset info server1:27017 just heartbeated us, but our heartbeat failed: , not changing state 2015-05-20T22:14:35.832+0000 [rsHealthPoll] replset info server1:27017 just heartbeated us, but our heartbeat failed: , not changing state 

您可以看到我们正在频繁发生心跳失败和closures通知,但是在每种情况下,服务器每次都会从几秒钟到几秒钟之内备份。 我不确定哪里可以开始寻找旁边的尝试,找出可能导致问题的原因。

我经常看到这一点,它总是在mongod进程之外。 DNSparsing器问题,TCP / IP堆栈问题,networking链接,物理硬件等。从mongod进程中走出来。 检查您的主机操作系统上的networking错误,检查物理链接(如果物理在等式中),检查两个服务器之间的云提供商是否跨越区域。 这很可能是主机操作系统上的东西,而与MongoDB本身无关。

这已经解决了。 核心问题是我们的托pipe服务提供商运行VMWare快照作为备份机制。 这些快照导致虚拟机暂时进入停滞状态,我相信技术术语是虚拟机停机。

一旦这些快照被禁用,我们不再有任何问题。