drbd corosync集群第二个节点试图成为主要的所有时间

我们正在面对与drbd corosync群集的问题。

在主要所有资源（mysql服务，drbd）的一个节点上工作正常。但是第二个节点总是试图成为主要的。

第二个节点的错误日志如下所示：

lrmd: [25272]: info: RA output: (mysql-drbd:0:promote:stderr) 0: State change failed: (-1) Multiple primaries not allowed by config Oct 1 16:39:39 node2 lrmd: [25272]: info: RA output: (mysql-drbd:0:promote:stderr) 0: State change failed: (-1) Multiple primaries not allowed by config Oct 1 16:39:39 node2 lrmd: [25272]: info: RA output: (mysql-drbd:0:promote:stderr) Command 'drbdsetup 0 primary' terminated with exit code 11 Oct 1 16:39:39 node2 drbd[25416]: ERROR: mysql-disk: Called drbdadm -c /etc/drbd.conf primary mysql-disk Oct 1 16:39:39 node2 drbd[25416]: ERROR: mysql-disk: Called drbdadm -c /etc/drbd.conf primary mysql-disk Oct 1 16:39:39 node2 drbd[25416]: ERROR: mysql-disk: Exit code 11 Oct 1 16:39:39 node2 drbd[25416]: ERROR: mysql-disk: Exit code 11 Oct 1 16:39:39 node2 drbd[25416]: ERROR: mysql-disk: Command output: Oct 1 16:39:39 node2 drbd[25416]: ERROR: mysql-disk: Command output:

主/从状态的corosync并不完美。请参阅下面的corosync状态。

 Node1 [root@node1 ~]# crm status ============ Last updated: Thu Oct 2 09:01:30 2014 Stack: openais Current DC: node1 - partition WITHOUT quorum Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3 2 Nodes configured, 2 expected votes 4 Resources configured. ============ Online: [ node1 ] OFFLINE: [ node2 ] mysql-vip (ocf::heartbeat:IPaddr2): Started node1 Master/Slave Set: mysql-drbd-ms Masters: [ node1 ] Stopped: [ mysql-drbd:1 ] mysql-fs (ocf::heartbeat:Filesystem): Started node1 mysql-server (ocf::heartbeat:mysql): Started node1 You have new mail in /var/spool/mail/root

节点2

 [root@node2 ~]# crm status ============ Last updated: Thu Oct 2 09:03:04 2014 Stack: openais Current DC: node2 - partition WITHOUT quorum Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3 2 Nodes configured, 2 expected votes 4 Resources configured. ============ Online: [ node2 ] OFFLINE: [ node1 ] Master/Slave Set: mysql-drbd-ms mysql-drbd:0 (ocf::linbit:drbd): Slave node2 (unmanaged) FAILED Stopped: [ mysql-drbd:1 ] Failed actions: mysql-drbd:0_promote_0 (node=node2, call=7, rc=-2, status=Timed Out): unknown exec error mysql-drbd:0_stop_0 (node=node2, call=13, rc=6, status=complete): not configured

DRBD状态在两个节点上都显示正常

节点1（主）：

 [root@node1 ~]# service drbd status drbd driver loaded OK; device status: version: 8.3.8 (api:88/proto:86-94) GIT-hash: d78846e52224fd00562f7c225bcc25b2d422321d build by [email protected], 2010-06-04 08:04:09 m:res cs ro ds p mounted fstype 0:mysql-disk Connected Primary/Secondary UpToDate/UpToDate C

Node2（Secondary）：

 [root@node2 ~]# service drbd status drbd driver loaded OK; device status: version: 8.3.8 (api:88/proto:86-94) GIT-hash: d78846e52224fd00562f7c225bcc25b2d422321d build by [email protected], 2010-06-04 08:04:09 m:res cs ro ds p mounted fstype 0:mysql-disk Connected Secondary/Primary UpToDate/UpToDate C

发生这种情况的原因是，您没有configuration集群防护（stonith），现在您的集群处于裂脑状态

  Now you have a cluster with two DC and every node are trying to start the resource

看来你的corosync在每个节点上都不能相互通信。这就是为什么每个节点都将其节点标记为Online。

我会build议尝试使用单播而不是多播选项。

在两个节点上停止corosync。
更新支持单播1.4.1的corosync版本
更改您的corosyncconfiguration，并在下面添加：
启动corosync

成员{

  memberaddr: <node1 IP> } member { memberaddr: <node2 IP> } ringnumber: 0 bindnetaddr: <Network address of your nodes> mcastport: 5405 }

运输：udpu

请评论说的行

使用mcastaddr

通过Iptable防火墙允许端口5404和5405，并在两个节点上启动corosync。

谢谢。