无法将drbd切换到辅助

我在centos 5运行带有ocfs2 ,并计划使用packemaker 。 过了一段时间,我正面临着drbd裂脑的问题。

 version: 8.3.13 (api:88/proto:86-96) GIT-hash: 83ca112086600faacab2f157bc5a9324f7bd7f77 build by [email protected], 2012-05-07 11:56:36 1: cs:StandAlone ro:Primary/Unknown ds:UpToDate/DUnknown r----- ns:0 nr:0 dw:112281991 dr:797551 al:99 bm:6401 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:60 

我无法将我的drbd切换到辅助。

 drbdadm secondary r0 1: State change failed: (-12) Device is held open by someone Command 'drbdsetup 1 secondary' terminated with exit code 11 

我的drbd资源configuration:

 resource r0 { syncer { rate 1000M; verify-alg sha1; } disk { on-io-error detach; } handlers { pri-lost-after-sb "/usr/lib/drbd/notify-split-brain.sh root"; } net { allow-two-primaries; after-sb-0pri discard-younger-primary; after-sb-1pri call-pri-lost-after-sb; after-sb-2pri call-pri-lost-after-sb; } startup { become-primary-on both; } on serving_4130{ device /dev/drbd1; disk /dev/sdb1; address 192.168.4.130:7789; meta-disk internal; } on MT305-3182 { device /dev/drbd1; disk /dev/xvdb1; address 192.168.3.182:7789; meta-disk internal; } } 

ocfs2状态的状态:

 service ocfs2 status Configured OCFS2 mountpoints: /data 

lsof表明,有一个与drbd相关的进程。

 lsof | grep drbd COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME drbd1_wor 7782 root cwd DIR 253,0 4096 2 / drbd1_wor 7782 root rtd DIR 253,0 4096 2 / drbd1_wor 7782 root txt unknown /proc/7782/exe 

这是一个死的符号链接:

 # ls -l /proc/7782/exe ls: cannot read symbolic link /proc/7782/exe: No such file or directory lrwxrwxrwx 1 root root 0 May 4 09:56 /proc/7782/exe # ps -ef | awk '$2 == "7782" { print $0 }' root 7782 1 0 Apr22 ? 00:00:20 [drbd1_worker] 

请注意,这个过程被包裹在方括号中:

man ps

 args COMMAND command with all its arguments as a string. Modifications to the arguments may be shown. The output in this column may contain spaces. A process marked <defunct> is partly dead, waiting to be fully destroyed by its parent. Sometimes the process args will be unavailable; when this happens, ps will instead print the executable name in brackets. 

所以,最后一个问题是:在这种情况下,我们如何手动恢复DRBD 而无需重新启动


回复@andreask:

我的分区表:

 # df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/VolGroup00-LogVol00 35G 6.9G 27G 21% / /dev/xvda1 99M 20M 74M 22% /boot tmpfs 1.0G 0 1.0G 0% /dev/shm /dev/drbd1 100G 902M 100G 1% /data 

设备名称:

 # dmsetup ls --tree -o inverted (202:2) ├─VolGroup00-LogVol01 (253:1) └─VolGroup00-LogVol00 (253:0) 

注意块设备( 253:0 ),与lsof的输出相同:

 # lvdisplay --- Logical volume --- LV Name /dev/VolGroup00/LogVol00 VG Name VolGroup00 LV UUID vCd152-amVZ-GaPo-H9Zs-TIS0-KI6j-ej8kYi LV Write Access read/write LV Status available # open 1 LV Size 35.97 GB Current LE 1151 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:0 

回复@Doug:

 # vgdisplay --- Volume group --- VG Name VolGroup00 System ID Format lvm2 Metadata Areas 1 Metadata Sequence No 3 VG Access read/write VG Status resizable MAX LV 0 Cur LV 2 Open LV 2 Max PV 0 Cur PV 1 Act PV 1 VG Size 39.88 GB PE Size 32.00 MB Total PE 1276 Alloc PE / Size 1276 / 39.88 GB Free PE / Size 0 / 0 VG UUID OTwzII-AP5H-nIbH-k2UA-H9nw-juBv-wcvmBq 

更新5月17日星期五16:08:16 ICT 2013

这里是来自Lars Ellenberg的 一些 想法 :

如果文件系统仍然安装…哦,好吧。 卸载它。 不懒,但真的。

我确定,OCFS2已经被卸载了。

如果涉及nfs,请尝试

 killall -9 nfsd killall -9 lockd echo 0 > /proc/fs/nfsd/threads 

不,NFS不参与。

如果涉及lvm / dmsetup / kpartx / multipath / udev,请尝试

 dmsetup ls --tree -o inverted 

并检查是否有来自drbd的依赖关系。

从上面的输出中可以看出,LVM与DRBD无关:

pvdisplay -m

  --- Physical volume --- PV Name /dev/xvda2 VG Name VolGroup00 PV Size 39.90 GB / not usable 20.79 MB Allocatable yes (but full) PE Size (KByte) 32768 Total PE 1276 Free PE 0 Allocated PE 1276 PV UUID 1t4hkB-p43c-ABex-stfQ-XaRt-9H4i-51gSTD --- Physical Segments --- Physical extent 0 to 1148: Logical volume /dev/VolGroup00/LogVol00 Logical extents 0 to 1148 Physical extent 1149 to 1275: Logical volume /dev/VolGroup00/LogVol01 Logical extents 0 to 126 

fdisk -l

 Disk /dev/xvda: 42.9 GB, 42949672960 bytes 255 heads, 63 sectors/track, 5221 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/xvda1 * 1 13 104391 83 Linux /dev/xvda2 14 5221 41833260 8e Linux LVM Disk /dev/xvdb: 107.3 GB, 107374182400 bytes 255 heads, 63 sectors/track, 13054 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/xvdb1 1 13054 104856223+ 83 Linux 

如果涉及loop / cryptoloop / etc,请检查其中一个是否仍在访问它们。

如果正在使用某个虚拟化技术,则closures/销毁可能在其生命周期中访问该drbd的所有容器/虚拟机。

不,不。

有时候这只是udev或者相当于一场比赛。

我已经禁用multipath规则,甚至停止udevd ,没有任何改变。

有时它是一个unix域套接字或类似的仍然保持打开状态(不需要在lsof / fuser中显示)。

如果是这样,我们怎么才能找出这个unix socket?


更新周五5月22日22:10:41 ICT 2013

这是通过魔法SysRq密钥转储时DRBD工作进程的堆栈跟踪 :

 kernel: drbd1_worker S ffff81007ae21820 0 7782 1 7795 7038 (L-TLB) kernel: ffff810055d89e00 0000000000000046 000573a8befba2d6 ffffffff8008e82f kernel: 00078d18577c6114 0000000000000009 ffff81007ae21820 ffff81007fcae040 kernel: 00078d18577ca893 00000000000002b1 ffff81007ae21a08 000000017a590180 kernel: Call Trace: kernel: [<ffffffff8008e82f>] enqueue_task+0x41/0x56 kernel: [<ffffffff80063002>] thread_return+0x62/0xfe kernel: [<ffffffff80064905>] __down_interruptible+0xbf/0x112 kernel: [<ffffffff8008ee84>] default_wake_function+0x0/0xe kernel: [<ffffffff80064713>] __down_failed_interruptible+0x35/0x3a kernel: [<ffffffff885d461a>] :drbd:.text.lock.drbd_worker+0x2d/0x43 kernel: [<ffffffff885eca37>] :drbd:drbd_thread_setup+0x127/0x1e1 kernel: [<ffffffff800bab82>] audit_syscall_exit+0x329/0x344 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 kernel: [<ffffffff885ec910>] :drbd:drbd_thread_setup+0x0/0x1e1 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 

我不确定这个OCFS2心跳区域是否阻止DRBD切换到次级:

 kernel: o2hb-C3E41CA2 S ffff810002536420 0 9251 31 3690 (L-TLB) kernel: ffff810004af7d20 0000000000000046 ffff810004af7d30 ffffffff80063002 kernel: 1400000004000000 000000000000000a ffff81007ec307a0 ffffffff80319b60 kernel: 000935c260ad6764 0000000000000fcd ffff81007ec30988 0000000000027e86 kernel: Call Trace: kernel: [<ffffffff80063002>] thread_return+0x62/0xfe kernel: [<ffffffff8006389f>] schedule_timeout+0x8a/0xad kernel: [<ffffffff8009a41d>] process_timeout+0x0/0x5 kernel: [<ffffffff8009a97c>] msleep_interruptible+0x21/0x42 kernel: [<ffffffff884b3b0b>] :ocfs2_nodemanager:o2hb_thread+0xd2c/0x10d6 kernel: [<ffffffff80063002>] thread_return+0x62/0xfe kernel: [<ffffffff800a329f>] keventd_create_kthread+0x0/0xc4 kernel: [<ffffffff884b2ddf>] :ocfs2_nodemanager:o2hb_thread+0x0/0x10d6 kernel: [<ffffffff800a329f>] keventd_create_kthread+0x0/0xc4 kernel: [<ffffffff80032632>] kthread+0xfe/0x132 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 kernel: [<ffffffff800a329f>] keventd_create_kthread+0x0/0xc4 kernel: [<ffffffff80032534>] kthread+0x0/0x132 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 

我不确定这个OCFS2心跳区域是否阻止DRBD切换到次级:

也许。 你有没有试图杀死该地区遵循本指南?

 # /etc/init.d/o2cb offline serving Stopping O2CB cluster serving: Failed Unable to stop cluster as heartbeat region still active 

好的,首先你应该列出OCFS2卷以及它们的标签和uuids:

 # mounted.ocfs2 -d Device FS Stack UUID Label /dev/sdb1 ocfs2 o2cb C3E41CA2BDE8477CA7FF2C796098633C data_ocfs2 /dev/drbd1 ocfs2 o2cb C3E41CA2BDE8477CA7FF2C796098633C data_ocfs2 

其次,检查你是否有任何参考这个设备:

 # ocfs2_hb_ctl -I -d /dev/sdb1 C3E41CA2BDE8477CA7FF2C796098633C: 1 refs 

试图杀死它:

 # ocfs2_hb_ctl -K -d /dev/sdb1 ocfs2 

然后停止群集堆栈:

 # /etc/init.d/o2cb stop Stopping O2CB cluster serving: OK Unmounting ocfs2_dlmfs filesystem: OK Unloading module "ocfs2_dlmfs": OK Unmounting configfs filesystem: OK Unloading module "configfs": OK 

并将设备带回次要angular色:

 # drbdadm secondary r0 # drbd-overview 1:r0 StandAlone Secondary/Unknown UpToDate/DUnknown r----- 

现在,你可以像往常一样恢复分裂大脑:

 # drbdadm -- --discard-my-data connect r0 # drbd-overview 1:r0 WFConnection Secondary/Unknown UpToDate/DUnknown C r----- 

在另一个节点(裂脑幸存者):

 # drbdadm connect r0 # drbd-overview 1:r0 SyncSource Primary/Secondary UpToDate/Inconsistent C r---- /data ocfs2 100G 1.9G 99G 2% [>....................] sync'ed: 3.2% (753892/775004)K delay_probe: 28 

在分裂的受害者:

 # /etc/init.d/o2cb start Loading filesystem "configfs": OK Mounting configfs filesystem at /sys/kernel/config: OK Loading filesystem "ocfs2_dlmfs": OK Mounting ocfs2_dlmfs filesystem at /dlm: OK Starting O2CB cluster serving: OK # /etc/init.d/ocfs2 start Starting Oracle Cluster File System (OCFS2) [ OK ] 

确认此安装点已启动并正在运行:

 # df -h /data/ Filesystem Size Used Avail Use% Mounted on /dev/drbd1 100G 1.9G 99G 2% /data 

DRBD无法降级资源的一个常见原因是活动的设备映射设备…就像一个卷组。 你可以检查它,例如:

 dmsetup ls --tree -o inverted