mount.ocfs2:传输端点在安装时未连接…?

我用OCFS2replace了一个以双主模式运行的死节点。 所有的步骤工作:

/proc/drbd

 version: 8.3.13 (api:88/proto:86-96) GIT-hash: 83ca112086600faacab2f157bc5a9324f7bd7f77 build by [email protected], 2012-05-07 11:56:36 1: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r----- ns:81 nr:407832 dw:106657970 dr:266340 al:179 bm:6551 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0 

直到我尝试安装该卷:

 mount -t ocfs2 /dev/drbd1 /data/webroot/ mount.ocfs2: Transport endpoint is not connected while mounting /dev/drbd1 on /data/webroot/. Check 'dmesg' for more information on this error. 

/var/log/kern.log

 kernel: (o2net,11427,1):o2net_connect_expired:1664 ERROR: no connection established with node 0 after 30.0 seconds, giving up and returning errors. kernel: (mount.ocfs2,12037,1):dlm_request_join:1036 ERROR: status = -107 kernel: (mount.ocfs2,12037,1):dlm_try_to_join_domain:1210 ERROR: status = -107 kernel: (mount.ocfs2,12037,1):dlm_join_domain:1488 ERROR: status = -107 kernel: (mount.ocfs2,12037,1):dlm_register_domain:1754 ERROR: status = -107 kernel: (mount.ocfs2,12037,1):ocfs2_dlm_init:2808 ERROR: status = -107 kernel: (mount.ocfs2,12037,1):ocfs2_mount_volume:1447 ERROR: status = -107 kernel: ocfs2: Unmounting device (147,1) on (node 1) 

以下是节点0(192.168.3.145)上的内核日志:

 kernel: : (swapper,0,7):o2net_listen_data_ready:1894 bytes: 0 kernel: : (o2net,4024,3):o2net_accept_one:1800 attempt to connect from unknown node at 192.168.2.93 :43868 kernel: : (o2net,4024,3):o2net_connect_expired:1664 ERROR: no connection established with node 1 after 30.0 seconds, giving up and returning errors. kernel: : (o2net,4024,3):o2net_set_nn_state:478 node 1 sc: 0000000000000000 -> 0000000000000000, valid 0 -> 0, err 0 -> -107 

我确信两个节点上的/etc/ocfs2/cluster.conf是相同的:

/etc/ocfs2/cluster.conf

 node: ip_port = 7777 ip_address = 192.168.3.145 number = 0 name = SVR233NTC-3145.localdomain cluster = cpc node: ip_port = 7777 ip_address = 192.168.2.93 number = 1 name = SVR022-293.localdomain cluster = cpc cluster: node_count = 2 name = cpc 

他们连接好:

 # nc -z 192.168.3.145 7777 Connection to 192.168.3.145 7777 port [tcp/cbt] succeeded! 

但O2CB心跳在新节点(192.168.2.93)上不活动:

/etc/init.d/o2cb status

 Driver for "configfs": Loaded Filesystem "configfs": Mounted Driver for "ocfs2_dlmfs": Loaded Filesystem "ocfs2_dlmfs": Mounted Checking O2CB cluster cpc: Online Heartbeat dead threshold = 31 Network idle timeout: 30000 Network keepalive delay: 2000 Network reconnect delay: 2000 Checking O2CB heartbeat: Not active 

以下是在节点1上运行tcpdump的同时启动节点1上的ocfs2的结果:

  1 0.000000 192.168.2.93 -> 192.168.3.145 TCP 70 55274 > cbt [SYN] Seq=0 Win=5840 Len=0 MSS=1460 TSval=690432180 TSecr=0 2 0.000008 192.168.3.145 -> 192.168.2.93 TCP 70 cbt > 55274 [SYN, ACK] Seq=0 Ack=1 Win=5792 Len=0 MSS=1460 TSval=707657223 TSecr=690432180 3 0.000223 192.168.2.93 -> 192.168.3.145 TCP 66 55274 > cbt [ACK] Seq=1 Ack=1 Win=5840 Len=0 TSval=690432181 TSecr=707657223 4 0.000286 192.168.2.93 -> 192.168.3.145 TCP 98 55274 > cbt [PSH, ACK] Seq=1 Ack=1 Win=5840 Len=32 TSval=690432181 TSecr=707657223 5 0.000292 192.168.3.145 -> 192.168.2.93 TCP 66 cbt > 55274 [ACK] Seq=1 Ack=33 Win=5792 Len=0 TSval=707657223 TSecr=690432181 6 0.000324 192.168.3.145 -> 192.168.2.93 TCP 66 cbt > 55274 [RST, ACK] Seq=1 Ack=33 Win=5792 Len=0 TSval=707657223 TSecr=690432181 

RST标志每6个数据包发送一次。

我还能做些什么来debugging这种情况?

PS:

节点上的OCFS2版本0:

  • OCFS2工具,1.4.4-1.el5
  • ocfs2-2.6.18-274.12.1.el5-1.4.7-1.el5

节点1上的OCFS2版本:

  • OCFS2工具,1.4.4-1.el5
  • ocfs2-2.6.18-308.el5-1.4.7-1.el5

更新1 – 12月23日星期日18:15:07 ICT 2012

两个节点都在同一个局域网段上吗? 没有路由器等?

不,他们是在不同子网上的两台VMWare服务器。

哦,虽然我记得 – 主机名/ DNS的所有设置和正常工作?

当然,我将每个节点的主机名和IP地址添加到/etc/hosts

 192.168.2.93 SVR022-293.localdomain 192.168.3.145 SVR233NTC-3145.localdomain 

他们可以通过主机名相互连接:

 # nc -z SVR022-293.localdomain 7777 Connection to SVR022-293.localdomain 7777 port [tcp/cbt] succeeded! # nc -z SVR233NTC-3145.localdomain 7777 Connection to SVR233NTC-3145.localdomain 7777 port [tcp/cbt] succeeded! 

更新2 – 12月24日星期一18:32:15 ICT 2012

find线索:我的同事在群集运行时手动编辑/etc/ocfs2/cluster.conf文件。 所以,它仍然将死亡节点信息保存在/sys/kernel/config/cluster/

 # ls -l /sys/kernel/config/cluster/cpc/node/ total 0 drwxr-xr-x 2 root root 0 Dec 24 18:21 SVR150-4107.localdomain drwxr-xr-x 2 root root 0 Dec 24 18:21 SVR233NTC-3145.localdomain 

(在这种情况下SVR150-4107.localdomain

我要停止群集以删除死的节点,但得到以下错误:

 # /etc/init.d/o2cb stop Stopping O2CB cluster cpc: Failed Unable to stop cluster as heartbeat region still active 

我确定ocfs2服务已经停止了:

 # mounted.ocfs2 -f Device FS Nodes /dev/sdb ocfs2 Not mounted /dev/drbd1 ocfs2 Not mounted 

没有参考资料了:

 # ocfs2_hb_ctl -I -u 12963EAF4E16484DB81ECB0251177C26 12963EAF4E16484DB81ECB0251177C26: 0 refs 

我也卸载了ocfs2内核模块,以确保:

 # ps -ef | grep [o]cfs2 root 12513 43 0 18:25 ? 00:00:00 [ocfs2_wq] # modprobe -r ocfs2 # ps -ef | grep [o]cfs2 # lsof | grep ocfs2 

但没有什么变化:

 # /etc/init.d/o2cb offline Stopping O2CB cluster cpc: Failed Unable to stop cluster as heartbeat region still active 

所以最后的问题是:如何删除无效的节点信息, 而无需重新启动


更新3 – 12月24日星期一22:41:51 ICT 2012

这里是所有正在运行的心跳线程:

 # ls -l /sys/kernel/config/cluster/cpc/heartbeat/ | grep '^d' drwxr-xr-x 2 root root 0 Dec 24 22:18 72EF09EA3D0D4F51BDC00B47432B1EB2 

这个心跳区域的引用计数:

 # ocfs2_hb_ctl -I -u 72EF09EA3D0D4F51BDC00B47432B1EB2 72EF09EA3D0D4F51BDC00B47432B1EB2: 7 refs 

试图杀死:

 # ocfs2_hb_ctl -K -u 72EF09EA3D0D4F51BDC00B47432B1EB2 ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping heartbeat 

有任何想法吗?

哦耶! 问题解决了。

注意UUID:

 # mounted.ocfs2 -d Device FS Stack UUID Label /dev/sdb ocfs2 o2cb 12963EAF4E16484DB81ECB0251177C26 ocfs2_drbd1 /dev/drbd1 ocfs2 o2cb 12963EAF4E16484DB81ECB0251177C26 ocfs2_drbd1 

但:

 # ls -l /sys/kernel/config/cluster/cpc/heartbeat/ drwxr-xr-x 2 root root 0 Dec 24 22:53 72EF09EA3D0D4F51BDC00B47432B1EB2 

这可能是因为我“意外”强制重新编写OCFS2卷。 我面临的问题与Ocfs2用户邮件列表中的类似。

这也是以下错误的原因:

ocfs2_hb_ctl:在停止心跳期间,ocfs2_lookup未find文件

因为ocfs2_hb_ctl无法在/proc/partitionsfind具有UUID 72EF09EA3D0D4F51BDC00B47432B1EB2的设备。

我想到了一个想法: 我可以更改OCFS2卷的UUID吗?

通过tunefs.ocfs2手册页查看:

 Usage: tunefs.ocfs2 [options] <device> [new-size] tunefs.ocfs2 -h|--help tunefs.ocfs2 -V|--version [options] can be any mix of: -U|--uuid-reset[=new-uuid] 

所以我做了以下命令:

 # tunefs.ocfs2 --uuid-reset=72EF09EA3D0D4F51BDC00B47432B1EB2 /dev/drbd1 WARNING!!! OCFS2 uses the UUID to uniquely identify a file system. Having two OCFS2 file systems with the same UUID could, in the least, cause erratic behavior, and if unlucky, cause file system damage. Please choose the UUID with care. Update the UUID ?yes 

校验:

 # tunefs.ocfs2 -Q "%U\n" /dev/drbd1 72EF09EA3D0D4F51BDC00B47432B1EB2 

试图再次杀死心跳区域,看看会发生什么:

 # ocfs2_hb_ctl -K -u 72EF09EA3D0D4F51BDC00B47432B1EB2 # ocfs2_hb_ctl -I -u 72EF09EA3D0D4F51BDC00B47432B1EB2 72EF09EA3D0D4F51BDC00B47432B1EB2: 6 refs 

继续杀人,直到我看到0 refs然后closures群集:

 # /etc/init.d/o2cb offline cpc Stopping O2CB cluster cpc: OK 

并阻止它:

 # /etc/init.d/o2cb stop Stopping O2CB cluster cpc: OK Unloading module "ocfs2": OK Unmounting ocfs2_dlmfs filesystem: OK Unloading module "ocfs2_dlmfs": OK Unmounting configfs filesystem: OK Unloading module "configfs": OK 

重新开始查看新节点是否已更新:

 # /etc/init.d/o2cb start Loading filesystem "configfs": OK Mounting configfs filesystem at /sys/kernel/config: OK Loading filesystem "ocfs2_dlmfs": OK Mounting ocfs2_dlmfs filesystem at /dlm: OK Starting O2CB cluster cpc: OK # ls -l /sys/kernel/config/cluster/cpc/node/ total 0 drwxr-xr-x 2 root root 0 Dec 26 19:02 SVR022-293.localdomain drwxr-xr-x 2 root root 0 Dec 26 19:02 SVR233NTC-3145.localdomain 

OK,在对等节点(192.168.2.93)上,尝试启动OCFS2:

 # /etc/init.d/ocfs2 start Starting Oracle Cluster File System (OCFS2) [ OK ] 

感谢Sunil Mushran,因为这个线程帮我解决了这个问题。

教训是:

  1. IP地址,端口…只能在群集脱机时更改。 查看常见问题 。
  2. 切勿强制重新格式化OCFS2卷。