Linux-HA + dm-multipath:删除path导致段错误,内核空指针取消引用和STONITH

那么我正在设置一个正在运行的Linux-HA群集

*起搏器1.1.5

* OpenAIS的-1.1.4

*多path工具-0.4.9

* OpenSuSE 11.4,内核2.6.37

群集configuration通过LinBit的健康检查,所以我对此非常有信心。

多path正在使用,因为我们有一个LSI SASarrays通过2个HBA连接到每个主机(每个主机总共4条path)。 我现在想要做的是通过从多path设置中删除path来testing故障转移function。

多径path如下:

pgsql-data (360080e50001b658a000006874e398abe) dm-0 LSI,INF-01-00 size=6.0T features='0' hwhandler='1 rdac' wp=rw |-+- policy='round-robin 0' prio=0 status=active | |- 4:0:0:1 sda 8:0 active undef running | `- 5:0:0:1 sde 8:64 active undef running `-+- policy='round-robin 0' prio=0 status=enabled |- 4:0:1:1 sdc 8:32 active undef running `- 5:0:1:1 sdg 8:96 active undef running 

为了模拟丢失path,我回显1到/ sys / block / {path} / device / state这会导致path出现失败/多path错误,如下所示:

 pgsql-data (360080e50001b658a000006874e398abe) dm-0 LSI,INF-01-00 size=6.0T features='0' hwhandler='1 rdac' wp=rw |-+- policy='round-robin 0' prio=0 status=active | |- 4:0:1:1 sdc 8:32 failed faulty offline | `- 5:0:1:1 sdg 8:96 active undef running `-+- policy='round-robin 0' prio=0 status=enabled |- 4:0:0:1 sda 8:0 active undef running `- 5:0:0:1 sde 8:64 active undef running 

不过,我注意到通过观看/ var / log / messages rdac检查器说path仍然是:

 multipathd: pgsql-data: sdc - rdac checker reports path is up 

另外,让我们回到multipath -l输出 – 注意失败path仍然在活动组中吗? 它应该已经被移到启用的组中,并且从启用的主动/运行path应该取代它(处于活动状态)。

现在,如果我们将另一个活动启用pathsdgclosures,rdac不仅报告path处于启动状态,而且多path资源在集群中进入失败状态,两个活动/启用path都不取代它,结果是segfault,一个关于不能解引用NULL点的内核错误,以及集群STONITHs节点。

 db01-primary:/home/kendall/scripts # crm resource show db01-secondary-stonith (stonith:external/ipmi) Started db01-primary-stonith (stonith:external/ipmi) Started Master/Slave Set: master_drbd [drbd_pg_xlog] Masters: [ db01-primary ] Slaves: [ db01-secondary ] Resource Group: ha-pgsql multipathd (lsb:/etc/init.d/multipathd) Started FAILED pgsql_mp_fs (ocf::heartbeat:Filesystem) Started pg_xlog_fs (ocf::heartbeat:Filesystem) Started ha-DBIP-mgmt (ocf::heartbeat:IPaddr2) Started ha-DBIP (ocf::heartbeat:IPaddr2) Started postgresql (ocf::heartbeat:pgsql) Started incron (lsb:/etc/init.d/incron) Started pgbouncer (lsb:/etc/init.d/pgbouncer) Stopped pager-email (ocf::heartbeat:MailTo) Stopped db01-primary:/home/kendall/scripts # multipath -l pgsql-data (360080e50001b658a000006874e398abe) dm-0 LSI,INF-01-00 size=6.0T features='0' hwhandler='1 rdac' wp=rw |-+- policy='round-robin 0' prio=0 status=enabled | |- 4:0:1:1 sdc 8:32 failed faulty offline | `- 5:0:1:1 sdg 8:96 failed faulty offline `-+- policy='round-robin 0' prio=0 status=active |- 4:0:0:1 sda 8:0 active undef running `- 5:0:0:1 sde 8:64 active undef running 

这是从/ var / log / messages显示内核错误的摘录

 Aug 17 15:30:40 db01-primary multipathd: 8:96: mark as failed Aug 17 15:30:40 db01-primary multipathd: pgsql-data: remaining active paths: 2 Aug 17 15:30:40 db01-primary kernel: [ 1833.424180] sd 5:0:1:1: rejecting I/O to offline device Aug 17 15:30:40 db01-primary kernel: [ 1833.424281] device-mapper: multipath: Failing path 8:96. Aug 17 15:30:40 db01-primary kernel: [ 1833.428389] sd 4:0:0:1: rdac: array , ctlr 1, queueing MODE_SELECT command Aug 17 15:30:40 db01-primary multipathd: dm-0: add map (uevent) Aug 17 15:30:41 db01-primary kernel: [ 1833.804418] sd 4:0:0:1: rdac: array , ctlr 1, MODE_SELECT completed Aug 17 15:30:41 db01-primary kernel: [ 1833.804437] sd 5:0:0:1: rdac: array , ctlr 1, queueing MODE_SELECT command Aug 17 15:30:41 db01-primary kernel: [ 1833.808127] sd 5:0:0:1: rdac: array , ctlr 1, MODE_SELECT completed Aug 17 15:30:42 db01-primary multipathd: pgsql-data: sda - rdac checker reports path is up Aug 17 15:30:42 db01-primary multipathd: 8:0: reinstated Aug 17 15:30:42 db01-primary kernel: [ 1835.639635] device-mapper: multipath: adding disabled device 8:32 Aug 17 15:30:42 db01-primary kernel: [ 1835.639652] device-mapper: multipath: adding disabled device 8:96 Aug 17 15:30:42 db01-primary kernel: [ 1835.640666] BUG: unable to handle kernel NULL pointer dereference at (null) Aug 17 15:30:42 db01-primary kernel: [ 1835.640688] IP: [<ffffffffa01408a3>] dm_set_device_limits+0x23/0x140 [dm_mod] 

还有一个堆栈跟踪,可以从http://pastebin.com/gifMj7gu获得

multipath.conf可从http://pastebin.com/dw9pqF3Z获取

任何人都有这方面的见解,和/或如何进行?

我可以每次重新创build一次。

好吧,事实certificate,在/ sys / block / {dev} / device / state中设置“离线”并不足以使rdac报告path处于closures状态。 昨天晚上,我花了一些时间与单位,拉动SAS电缆,并观察系统的行为。 这工作正常。 不太“按预期”,因为当一个活动的path没有,它不会被取代启用组,但这是一个不同的问题。 故障转移也按预期工作; 一旦最后一条path丢失,集群closures数据库和相关资源,并将它们传送到辅助节点。

如果发现自己处于类似的情况,则可以尝试在multipath.conf中将multipath hwhandler设置为“0” 您必须在设备{}部分中进行设置。 这基本上禁用path检查,所以一旦设备脱机,它是真的离线。