债券在主要的失败之后无法启动备份网卡

我们的生产服务器有4个网卡,两个链接2个链接。

外部networking:bond0:eth0启动运行,eth1启动备份内部networking:bond1:启动并运行eth2,eth3主动备份

我们同时遇到了eth0和eth2的失败:

Mar 3 10:38:16 localhost kernel: [93739227.917537] tg3 0000:02:00.0 eth0: 0x000068b0: 0xe0011514, 0x00000000, 0x00000000, 0x00000000 Mar 3 10:38:16 localhost kernel: [93739227.930035] tg3 0000:02:00.0 eth0: 0x000068e0: 0x00000000, 0x00000000, 0x00000000, 0x0001c2cc Mar 3 10:38:16 localhost kernel: [93739227.942529] tg3 0000:02:00.0 eth0: 0x000068f0: 0x00ff000e, 0x00ff0000, 0x00000000, 0x04444444 ... Mar 3 10:38:17 localhost kernel: [93739228.141585] tg3 0000:02:00.0 eth0: 4: NAPI info [0000000a:0000000a:(0000:0000:01ff):04dc:(04dc:04dc:0000:0000)] Mar 3 10:38:17 localhost kernel: [93739228.201559] bonding: bond0: link status definitely down for interface eth0, disabling it Mar 3 10:38:17 localhost kernel: [93739228.216343] tg3 0000:02:00.0 eth0: Link is down Mar 3 10:38:18 localhost kernel: [93739229.253266] bonding: bond0: now running without any active interface ! Mar 3 10:38:18 localhost kernel: [93739229.253331] tg3 0000:08:00.0 eth2: transmit timed out, resetting Mar 3 10:38:19 localhost kernel: [93739230.509553] tg3 0000:08:00.0 eth2: 0x00000000: 0x165f14e4, 0x00100406, 0x02000000, 0x00800010 Mar 3 10:38:19 localhost kernel: [93739230.521603] tg3 0000:08:00.0 eth2: 0x00000010: 0xd90a000c, 0x00000000, 0xd90b000c, 0x00000000 Mar 3 10:38:19 localhost kernel: [93739230.533658] tg3 0000:08:00.0 eth2: 0x00000020: 0xd90c000c, 0x00000000, 0x00000000, 0x200314e4 Mar 3 10:38:19 localhost kernel: [93739230.545704] tg3 0000:08:00.0 eth2: 0x00000030: 0xdd000000, 0x00000048, 0x00000000, 0x0000010f Mar 3 10:38:19 localhost kernel: [93739230.557755] tg3 0000:08:00.0 eth2: 0x00000040: 0x00000000, 0xa5000000, 0xc8035001, 0x64002008 Mar 3 10:38:19 localhost kernel: [93739230.569808] tg3 0000:08:00.0 eth2: 0x00000050: 0x818c5803, 0x78000000, 0x0086a005, 0x00000000 ... Mar 3 10:38:23 localhost kernel: [93739234.611688] tg3 0000:08:00.0 eth2: 4: Host status block [00000001:000000df:(0000:0000:0a0f):(0000:0000)] Mar 3 10:38:23 localhost kernel: [93739234.624030] tg3 0000:08:00.0 eth2: 4: NAPI info [000000c4:000000c4:(0000:0000:01ff):09d4:(01d4:01d4:0000:0000)] Mar 3 10:38:23 localhost kernel: [93739234.699205] bonding: bond1: link status definitely down for interface eth2, disabling it Mar 3 10:38:23 localhost kernel: [93739234.738410] tg3 0000:08:00.0: tg3_stop_block timed out, ofs=1400 enable_bit=2 Mar 3 10:38:23 localhost kernel: [93739234.850735] tg3 0000:08:00.0: tg3_stop_block timed out, ofs=c00 enable_bit=2 Mar 3 10:38:23 localhost kernel: [93739234.977285] tg3 0000:08:00.0 eth2: Link is down Mar 3 10:38:25 localhost kernel: [93739236.081087] bonding: bond1: now running without any active interface ! 

1)同时发生2个不同的networking,我们怀疑硬件问题(电源中的主板或微切口,即电源单元故障)请随意告诉我,如果您同意或不接受我的诊断;)

2)configuration为主备备份的绑定是在发生故障的情况下保留热备份网卡。 正如你在这里看到的,它看起来像没有运行备份,甚至没有考虑任何事情。 事件发生时我检查了ifconfig ,并将eth1和eth3(备份)正确地附加到各自的债券上。

债券未能切换到热备份卡的问题可能是什么?

编辑:完整的networkingconfiguration:

 bond0 Link encap:Ethernet HWaddr 90:b1:1c:xxxxx inet addr:195.178.186.222 Bcast:195.178.xxxxxxx Mask:255.255.255.224 inet6 addr: fe80::92xxxxa:4b1e/64 Scope:Link UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1 RX packets:11806289 errors:0 dropped:563346 overruns:0 frame:0 TX packets:15209428 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:2314496738 (2.3 GB) TX bytes:17247449206 (17.2 GB) bond1 Link encap:Ethernet HWaddr 00:10:1xxxx:ce inet addr:192.168.0.1 Bcast:192.168.0.255 Mask:255.255.255.0 inet6 addr: fe80::210:18ff:fed3:b1ce/64 Scope:Link UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1 RX packets:161091053340 errors:0 dropped:1071 overruns:0 frame:13821 TX packets:112926434041 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:99357307904176 (99.3 TB) TX bytes:45744253012472 (45.7 TB) eth0 Link encap:Ethernet HWaddr 90:b1:xxxxxx4b:1e UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 RX packets:11806289 errors:0 dropped:563346 overruns:0 frame:0 TX packets:15209428 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:2314496738 (2.3 GB) TX bytes:17247449206 (17.2 GB) Interrupt:16 eth1 Link encap:Ethernet HWaddr 90:b1:1xxxxxx:1e UP BROADCAST SLAVE MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) Interrupt:17 eth2 Link encap:Ethernet HWaddr 00:10:xxxxx1:ce UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 RX packets:161091053340 errors:0 dropped:1070 overruns:0 frame:13821 TX packets:112926434041 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:99357307904176 (99.3 TB) TX bytes:45744253012472 (45.7 TB) Interrupt:48 eth3 Link encap:Ethernet HWaddr 00:10xxxb1:ce UP BROADCAST SLAVE MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) Interrupt:52 lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:65536 Metric:1 RX packets:6935638599 errors:0 dropped:0 overruns:0 frame:0 TX packets:6935638599 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:18028725295176 (18.0 TB) TX bytes:18028725295176 (18.0 TB) 

这里是/ proc / net / bonding / bond0(bond1是类似的)

 Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011) Bonding Mode: fault-tolerance (active-backup) Primary Slave: eth0 (primary_reselect always) Currently Active Slave: eth0 MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 Slave Interface: eth0 MII Status: up Speed: 1000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 90:b1:1c:4a:4b:1e Slave queue ID: 0 Slave Interface: eth1 MII Status: down Speed: Unknown Duplex: Unknown Link Failure Count: 0 Permanent HW addr: 90:b1:1c:4a:4b:1f Slave queue ID: 0 

最后弄清楚什么时候出现另一个问题。 发生这种情况时,我们呼叫数据中心派遣技术人员validation端口和电缆。 我们得到的答案是,一切都很好,端口闪烁。

当我自己去数据中心的另一个问题,我看了看机器背后的布线… eth0和eth2被正确的电缆,但eth1和eth3甚至没有堵塞! 他们怎么会错过!

这个故事的士气是,如果一个热备份已经启动,但却无法处理故障转移,并且日志中没有任何内容,那么这是一个布线或端口问题。 也总是自己检查一下,不要相信别人为你做,他们大多不会在意