使用起搏器和corosync监测系统资源时,克隆时会返回“未运行”

设置:操作系统:CentOS 7,最新版本Corosync,Pacemaker&PCS – 两个节点主动/主动群集,虚拟IP – 两个节点上Exim运行在远程邮件(SMTP),没有什么特别的configuration – 当Exim在其中一个节点上失败,节点不应该参与回复虚拟IP,直到Exim恢复运行

我试图得到这个工作: – 克隆ocf:心跳:虚拟IP的IPaddr2资源 – 克隆systemd:Exim资源观看Exim与on-fail =“待机”选项

问题:最初,一切工作都应该如此。 当其中一个节点无法运行Exim时,它会正确停止,并且该节点不再参与虚拟IP。 问题是,在停止和启动其中一个节点之后,Exim重新启动(因为它应该),但监视器返回“不运行”。 当Exim-resource没有configurationon-fail =“standby”时,一切都按照devise工作,我可以按照自己的想法启动/停止Exim和其中一个节点。

消息在日志中:

Jan 28 16:17:30 testvm101 crmd[14183]: notice: process_lrm_event: LRM operation exim:0_monitor_30000 (call=141, rc=7, cib-update=211, confirmed=false) not running Jan 28 16:17:30 testvm101 crmd[14183]: warning: status_from_rc: Action 20 (exim:0_monitor_30000) on testvm101 failed (target: 0 vs. rc: 7): Error Jan 28 16:17:30 testvm101 crmd[14183]: warning: update_failcount: Updating failcount for exim:0 on testvm101 after failed monitor: rc=7 (update=value++, time=1422458250) 

pcs状态输出:

 [root@testvm101 ~]# pcs status Cluster name: smtp_cluster Last updated: Wed Jan 28 16:31:44 2015 Last change: Wed Jan 28 16:17:13 2015 via cibadmin on testvm101 Stack: corosync Current DC: testvm101 (1) - partition with quorum Version: 1.1.10-32.el7_0.1-368c726 2 Nodes configured 4 Resources configured Node testvm101 (1): standby (on-fail) Online: [ testvm102 ] Full list of resources: Clone Set: virtual_ip-clone [virtual_ip] (unique) virtual_ip:0 (ocf::heartbeat:IPaddr2): Started testvm102 virtual_ip:1 (ocf::heartbeat:IPaddr2): Started testvm102 Clone Set: exim-clone [exim] (unique) exim:0 (systemd:exim): Started testvm102 exim:1 (systemd:exim): Started testvm102 Failed actions: exim:0_monitor_30000 on testvm101 'not running' (7): call=141, status=complete, last-rc-change='Wed Jan 28 16:17:30 2015', queued=6ms, exec=15002ms 

据我所知,在这个消息的时候,Exim正在运行并为systemd工作。 我已经试图指定启动延迟选项,希望这会有所作为(但事实并非如此)。

运行时: pcs resource cleanup exim-clone清除失败计数,一切正常,直到monitor-action第一次出现,然后标记为待机的节点被另一个交换…

示例:节点testvm102上的Exim监视器失败后的状态:

 [root@testvm101 ~]# pcs status ... Node testvm102 (2): standby (on-fail) Online: [ testvm101 ] Full list of resources: Clone Set: virtual_ip-clone [virtual_ip] (unique) virtual_ip:0 (ocf::heartbeat:IPaddr2): Started testvm101 virtual_ip:1 (ocf::heartbeat:IPaddr2): Started testvm101 Clone Set: exim-clone [exim] (unique) exim:0 (systemd:exim): Started testvm101 exim:1 (systemd:exim): Started testvm101 Failed actions: exim:0_monitor_30000 on testvm102 'not running' (7): call=150, status=complete, last-rc-change='Wed Jan 28 16:33:59 2015', queued=5ms, exec=15004ms 

我正在为exim-resource运行资源清理来重置失败计数:

 [root@testvm101 ~]# pcs resource cleanup exim-clone Resource: exim-clone successfully cleaned up 

经过一段时间后,状态看起来很好(实际上也很好):

 [root@testvm101 ~]# pcs status ... Online: [ testvm101 testvm102 ] Full list of resources: Clone Set: virtual_ip-clone [virtual_ip] (unique) virtual_ip:0 (ocf::heartbeat:IPaddr2): Started testvm101 virtual_ip:1 (ocf::heartbeat:IPaddr2): Started testvm102 Clone Set: exim-clone [exim] (unique) exim:0 (systemd:exim): Started testvm101 exim:1 (systemd:exim): Started testvm102 

下次执行监视操作时,检查在另一个节点上失败:

 [root@testvm101 ~]# pcs status ... Node testvm101 (1): standby (on-fail) Online: [ testvm102 ] Full list of resources: Clone Set: virtual_ip-clone [virtual_ip] (unique) virtual_ip:0 (ocf::heartbeat:IPaddr2): Started testvm102 virtual_ip:1 (ocf::heartbeat:IPaddr2): Started testvm102 Clone Set: exim-clone [exim] (unique) exim:0 (systemd:exim): Started testvm102 exim:1 (systemd:exim): Started testvm102 Failed actions: exim:0_monitor_30000 on testvm101 'not running' (7): call=176, status=complete, last-rc-change='Wed Jan 28 16:37:10 2015', queued=0ms, exec=0ms 

也许是我忘记的东西?

感谢帮助