绑定接口的计算机在所有从设备接口上都不会收到组播报文

在将我们的机器从RHEL 6.6升级到RHEL 6.7之后,我们观察到一个问题,我们的30台机器中有4台只能在两个从机接口之一上接收组播stream量。 目前还不清楚升级是否相关,或者是否包含重新启动引发的行为 – 重新启动是罕见的。

我们希望在4个不同的端口上接收大量的239.0.10.200组播组。 如果我们在有问题的机器上检查ethtool统计数据,我们会看到以下输出:

健康的界面:

  # ethtool -S eth0 |grep mcast [0]: rx_mcast_packets: 294 [0]: tx_mcast_packets: 0 [1]: rx_mcast_packets: 68 [1]: tx_mcast_packets: 0 [2]: rx_mcast_packets: 2612869 [2]: tx_mcast_packets: 305 [3]: rx_mcast_packets: 0 [3]: tx_mcast_packets: 0 [4]: rx_mcast_packets: 2585571 [4]: tx_mcast_packets: 0 [5]: rx_mcast_packets: 2571341 [5]: tx_mcast_packets: 0 [6]: rx_mcast_packets: 0 [6]: tx_mcast_packets: 8 [7]: rx_mcast_packets: 9 [7]: tx_mcast_packets: 0 rx_mcast_packets: 7770152 tx_mcast_packets: 313 

界面破损:

  # ethtool -S eth1 |grep mcast [0]: rx_mcast_packets: 451 [0]: tx_mcast_packets: 0 [1]: rx_mcast_packets: 0 [1]: tx_mcast_packets: 0 [2]: rx_mcast_packets: 5 [2]: tx_mcast_packets: 304 [3]: rx_mcast_packets: 0 [3]: tx_mcast_packets: 0 [4]: rx_mcast_packets: 5 [4]: tx_mcast_packets: 145 [5]: rx_mcast_packets: 0 [5]: tx_mcast_packets: 0 [6]: rx_mcast_packets: 5 [6]: tx_mcast_packets: 10 [7]: rx_mcast_packets: 0 [7]: tx_mcast_packets: 0 rx_mcast_packets: 466 tx_mcast_packets: 459 

多播是从其他10台机器中获取的。 如果我们检查哪个主机有一台坏的机器接收到来自(使用tcpdump)的多播,它只接收来自预期主机的一个子集(3-6)。

组态

Linux版本:

 # uname -a Linux ab31 2.6.32-573.3.1.el6.x86_64 #1 SMP Mon Aug 10 09:44:54 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux 

IFCONFIG:

 # ifconfig -a bond0 Link encap:Ethernet HWaddr 4C:76:25:97:B1:75 inet addr:10.91.20.231 Bcast:10.91.255.255 Mask:255.255.0.0 inet6 addr: fe80::4e76:25ff:fe97:b175/64 Scope:Link UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1 RX packets:18005156 errors:0 dropped:0 overruns:0 frame:0 TX packets:11407592 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:10221086569 (9.5 GiB) TX bytes:2574472468 (2.3 GiB) eth0 Link encap:Ethernet HWaddr 4C:76:25:97:B1:75 inet6 addr: fe80::4e76:25ff:fe97:b175/64 Scope:Link UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 RX packets:13200915 errors:0 dropped:0 overruns:0 frame:0 TX packets:3514446 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:9386669124 (8.7 GiB) TX bytes:339950822 (324.2 MiB) Interrupt:34 Memory:d9000000-d97fffff eth1 Link encap:Ethernet HWaddr 4C:76:25:97:B1:75 inet6 addr: fe80::4e76:25ff:fe97:b175/64 Scope:Link UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 RX packets:4804241 errors:0 dropped:0 overruns:0 frame:0 TX packets:7893146 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:834417445 (795.7 MiB) TX bytes:2234521646 (2.0 GiB) Interrupt:36 Memory:da000000-da7fffff lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:65536 Metric:1 RX packets:139908 errors:0 dropped:0 overruns:0 frame:0 TX packets:139908 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:210503939 (200.7 MiB) TX bytes:210503939 (200.7 MiB) 

networkingconfiguration:

 # cat /etc/sysconfig/network-scripts/ifcfg-bond0 DEVICE=bond0 IPADDR=10.91.20.231 NETMASK=255.255.0.0 GATEWAY=10.91.1.25 ONBOOT=yes BOOTPROTO=none USERCTL=no BONDING_OPTS="miimon=100 mode=802.3ad" # cat /etc/sysconfig/network-scripts/ifcfg-eth0 DEVICE="eth0" HWADDR="4C:76:25:97:B1:75" BOOTPROTO=none ONBOOT="yes" USERCTL=no MASTER=bond0 SLAVE=yes # cat /etc/sysconfig/network-scripts/ifcfg-eth1 DEVICE="eth1" HWADDR="4C:76:25:97:B1:78" BOOTPROTO=none ONBOOT="yes" USERCTL=no MASTER=bond0 SLAVE=yes 

驱动程序信息(与eth1相同):

 # ethtool -i eth0 driver: bnx2x version: 1.710.51-0 firmware-version: FFV7.10.17 bc 7.10.11 bus-info: 0000:01:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: yes 

适配器:

 # lspci|grep Ether 01:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM57810 10 Gigabit Ethernet (rev 10) 01:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM57810 10 Gigabit Ethernet (rev 10) 

的/ proc /净/结合/ bond0:

 $ cat /proc/net/bonding/bond0 Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011) Bonding Mode: IEEE 802.3ad Dynamic link aggregation Transmit Hash Policy: layer2 (0) MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 802.3ad info LACP rate: slow Min links: 0 Aggregator selection policy (ad_select): stable Active Aggregator Info: Aggregator ID: 1 Number of ports: 2 Actor Key: 33 Partner Key: 5 Partner Mac Address: 00:01:09:06:09:07 Slave Interface: eth0 MII Status: up Speed: 10000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 4c:76:25:97:b1:75 Aggregator ID: 1 Slave queue ID: 0 Slave Interface: eth1 MII Status: up Speed: 10000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 4c:76:25:97:b1:78 Aggregator ID: 1 Slave queue ID: 0 

其他信息

  • 重新启动( ifconfig downifconfig up )破坏的界面修复了这个问题

  • 偶尔在启动过程中,我们会在我们的系统日志(我们不使用IPv6)中看到以下消息,但是,即使未logging此消息,也会出现问题

     Oct 2 11:27:51 ab30 kernel: bond0: IPv6 duplicate address fe80::4e76:25ff:fe87:9d75 detected! 
  • syslog在configuration期间输出:

     Oct 5 07:44:31 ab31 kernel: bonding: bond0 is being created... Oct 5 07:44:31 ab31 kernel: bonding: bond0 already exists Oct 5 07:44:31 ab31 kernel: bond0: Setting MII monitoring interval to 100 Oct 5 07:44:31 ab31 kernel: bond0: Setting MII monitoring interval to 100 Oct 5 07:44:31 ab31 kernel: ADDRCONF(NETDEV_UP): bond0: link is not ready Oct 5 07:44:31 ab31 kernel: bond0: Setting MII monitoring interval to 100 Oct 5 07:44:31 ab31 kernel: bond0: Adding slave eth0 Oct 5 07:44:31 ab31 kernel: bnx2x 0000:01:00.0: firmware: requesting bnx2x/bnx2x-e2-7.10.51.0.fw Oct 5 07:44:31 ab31 kernel: bnx2x 0000:01:00.0: eth0: using MSI-X IRQs: sp 120 fp[0] 122 ... fp[7] 129 Oct 5 07:44:31 ab31 kernel: bnx2x 0000:01:00.0: eth0: NIC Link is Up, 10000 Mbps full duplex, Flow control: none Oct 5 07:44:31 ab31 kernel: bond0: Enslaving eth0 as a backup interface with an up link Oct 5 07:44:31 ab31 kernel: bond0: Adding slave eth1 Oct 5 07:44:31 ab31 kernel: bnx2x 0000:01:00.1: firmware: requesting bnx2x/bnx2x-e2-7.10.51.0.fw Oct 5 07:44:31 ab31 kernel: bnx2x 0000:01:00.1: eth1: using MSI-X IRQs: sp 130 fp[0] 132 ... fp[7] 139 Oct 5 07:44:31 ab31 kernel: bnx2x 0000:01:00.1: eth1: NIC Link is Up, 10000 Mbps full duplex, Flow control: none Oct 5 07:44:31 ab31 kernel: bond0: Enslaving eth1 as a backup interface with an up link Oct 5 07:44:31 ab31 kernel: ADDRCONF(NETDEV_UP): bond0: link is not ready Oct 5 07:44:31 ab31 kernel: ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready 
  • bond0接口join到组播组,如ip maddr

     ... 4: bond0 inet 239.0.10.200 users 16 ... 
  • 一切工作在同一networking上的其他机器上。 但是,似乎(不是100%确认)工作机器有另一个networking适配器:

     01:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20) 
  • 在检查我们的交换机统计信息时,我们可以看到数据被发送到两个接口

我们到目前为止所尝试过的

  • 正如在Linux Kernel中build议的, 不通过多播UDP数据包,我们调查了是否有rp_filter问题。 然而,改变这些标志并没有改变我们的任何东西。

  • 将内核降级到RedHat升级之前使用的内核 – 不变。

任何提示如何进一步排除故障的赞赏。 如果需要更多信息,请告诉我。

我们正在使用戴尔刀片服务器出现此问题。 在使用戴尔支持之后,似乎我们在join多播组时使用了IGMPv3 EXCLUDE过滤。 刀片服务器中的交换机显然不支持排除模式。 build议切换到IGMPv3 INCLUDE过滤模式。

但是,我们现在已经停止在我们的平台上使用多播,为什么我们可能不会尝试这些更改。 因此,我不能肯定地说这是根本原因。