我有一台运行在全新戴尔PowerEdge R320上的CentOS 6.6服务器(Linux 2.6.32-504)。 就在那一天,我开始注意到与我的一个网卡相关的/ var / log / messages中的问题。
注意:这是一个服务器集群,要求在交叉networking中使用eth0到冗余备份服务器,所以我们使用eth1来与客户的networking和世界进行交stream。 因此,为了排除故障而切换网卡将是一件非常困难的事情。 但是我也会注意到我在eth0上看不到任何问题。
我想尽可能地解决这个问题〜是硬件故障,Linux内核不喜欢戴尔硬件等等? 我不是职业Linuxpipe理员,所以我在这方面是绿色的。
谷歌的search似乎给我提出了更多的问题,而不是答案(似乎只适用于其他发行版,但指出了内核/固件问题的可能性,我虚心求求这个社区的专业知识和想法!
我注意到的一件事是,接口以100Mb /半的速度恢复(当它实际上是1Gb /满的时候)。
我会尽我所能地提供,这是相当有希望的。
这里是/var/log/messages的片段,我注意到这个问题…这对任何人都意味着什么?
Mar 22 16:10:07 ind1un043 kernel: ------------[ cut here ]------------ Mar 22 16:10:07 ind1un043 kernel: WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26b/0x280() (Not tainted) Mar 22 16:10:07 ind1un043 kernel: Hardware name: PowerEdge R320 Mar 22 16:10:07 ind1un043 kernel: NETDEV WATCHDOG: eth1 (tg3): transmit queue 0 timed out Mar 22 16:10:07 ind1un043 kernel: Modules linked in: nls_utf8 hfsplus(U) mpt3sas mpt2sas scsi_transport_sas raid_class mptctl mptbase dell_rbu coretemp 8021q garp stp llc wctc4xxp(U) dahdi_transcode(U) wcb4xxp(U) wctdm(U) wcfxo(U) wctdm24xxp(U) wcte11xp(U) wct1xxp(U) wcte12xp(U) dahdi_voicebus(U) wct4xxp(U) oct612x(U) dahdi(U) crc_ccitt ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 uinput iTCO_wdt iTCO_vendor_support microcode dcdbas sb_edac edac_core ipmi_devintf power_meter acpi_ipmi ipmi_si ipmi_msghandler sg shpchp tg3 ptp pps_core lpc_ich mfd_core ext4 jbd2 mbcache sr_mod cdrom sd_mod crc_t10dif ahci wmi megaraid_sas dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib] Mar 22 16:10:07 ind1un043 kernel: Pid: 9, comm: ksoftirqd/1 Not tainted 2.6.32-504.el6.x86_64 #1 Mar 22 16:10:07 ind1un043 kernel: Call Trace: Mar 22 16:10:07 ind1un043 kernel: <IRQ> [<ffffffff81074df7>] ? warn_slowpath_common+0x87/0xc0 Mar 22 16:10:07 ind1un043 kernel: [<ffffffff81074ee6>] ? warn_slowpath_fmt+0x46/0x50 Mar 22 16:10:07 ind1un043 kernel: [<ffffffff8147df7b>] ? dev_watchdog+0x26b/0x280 Mar 22 16:10:07 ind1un043 kernel: [<ffffffff81060d7c>] ? scheduler_tick+0xcc/0x260 Mar 22 16:10:07 ind1un043 kernel: [<ffffffff8147dd10>] ? dev_watchdog+0x0/0x280 Mar 22 16:10:07 ind1un043 kernel: [<ffffffff81087db7>] ? run_timer_softirq+0x197/0x340 Mar 22 16:10:07 ind1un043 kernel: [<ffffffff810b0275>] ? tick_dev_program_event+0x65/0xc0 Mar 22 16:10:07 ind1un043 kernel: [<ffffffff8107d8b1>] ? __do_softirq+0xc1/0x1e0 Mar 22 16:10:07 ind1un043 kernel: [<ffffffff810b034a>] ? tick_program_event+0x2a/0x30 Mar 22 16:10:07 ind1un043 kernel: [<ffffffff8100c30c>] ? call_softirq+0x1c/0x30 Mar 22 16:10:07 ind1un043 kernel: [<ffffffff8100fc15>] ? do_softirq+0x65/0xa0 Mar 22 16:10:07 ind1un043 kernel: [<ffffffff8107d765>] ? irq_exit+0x85/0x90 Mar 22 16:10:07 ind1un043 kernel: [<ffffffff81533c0a>] ? smp_apic_timer_interrupt+0x4a/0x60 Mar 22 16:10:07 ind1un043 kernel: [<ffffffff8100bb93>] ? apic_timer_interrupt+0x13/0x20 Mar 22 16:10:07 ind1un043 kernel: <EOI> [<ffffffff8107d4c5>] ? ksoftirqd+0xd5/0x110 Mar 22 16:10:07 ind1un043 kernel: [<ffffffff8107d3f0>] ? ksoftirqd+0x0/0x110 Mar 22 16:10:07 ind1un043 kernel: [<ffffffff8109e66e>] ? kthread+0x9e/0xc0 Mar 22 16:10:07 ind1un043 kernel: [<ffffffff8100c20a>] ? child_rip+0xa/0x20 Mar 22 16:10:07 ind1un043 kernel: [<ffffffff8109e5d0>] ? kthread+0x0/0xc0 Mar 22 16:10:07 ind1un043 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20 Mar 22 16:10:07 ind1un043 kernel: ---[ end trace c154004f7af06fb3 ]--- Mar 22 16:10:07 ind1un043 kernel: tg3 0000:02:00.1: eth1: transmit timed out, resetting Mar 22 16:10:08 ind1un043 kernel: tg3 0000:02:00.1: eth1: 0x00000000: 0x165f14e4, 0x00100406, 0x02000000, 0x00800010 Mar 22 16:10:08 ind1un043 kernel: tg3 0000:02:00.1: eth1: 0x00000010: 0xd90d000c, 0x00000000, 0xd90e000c, 0x00000000 Mar 22 16:10:08 ind1un043 kernel: tg3 0000:02:00.1: eth1: 0x00000020: 0xd90f000c, 0x00000000, 0x00000000, 0x04f71028 ... (TOO MANY CHARACTERS TO POST ON SERVER FAULT, SO STRIPPING) ... Mar 22 16:10:09 ind1un043 kernel: tg3 0000:02:00.1: eth1: 0x00007020: 0x00000000, 0x00000000, 0x00000406, 0x10004000 Mar 22 16:10:09 ind1un043 kernel: tg3 0000:02:00.1: eth1: 0x00007030: 0x000e0000, 0x00004af8, 0x00170030, 0x00000000 Mar 22 16:10:09 ind1un043 kernel: tg3 0000:02:00.1: eth1: 0: Host status block [00000001:00000063:(0000:0695:0000):(0000:01b8)] Mar 22 16:10:09 ind1un043 kernel: tg3 0000:02:00.1: eth1: 0: NAPI info [00000063:00000063:(01a5:01b8:01ff):0000:(0095:0000:0000:0000)] Mar 22 16:10:09 ind1un043 kernel: tg3 0000:02:00.1: eth1: 1: Host status block [00000001:000000ac:(0000:0000:0000):(083a:0000)] Mar 22 16:10:09 ind1un043 kernel: tg3 0000:02:00.1: eth1: 1: NAPI info [000000ac:000000ac:(0000:0000:01ff):083a:(003a:003a:0000:0000)] Mar 22 16:10:09 ind1un043 kernel: tg3 0000:02:00.1: eth1: 2: Host status block [00000001:0000004f:(089e:0000:0000):(0000:0000)] Mar 22 16:10:09 ind1un043 kernel: tg3 0000:02:00.1: eth1: 2: NAPI info [00000046:00000046:(0000:0000:01ff):0895:(0095:0095:0000:0000)] Mar 22 16:10:09 ind1un043 kernel: tg3 0000:02:00.1: eth1: 3: Host status block [00000001:00000012:(0000:0000:0000):(0000:0000)] Mar 22 16:10:09 ind1un043 kernel: tg3 0000:02:00.1: eth1: 3: NAPI info [00000012:00000012:(0000:0000:01ff):0fbc:(07bc:07bc:0000:0000)] Mar 22 16:10:09 ind1un043 kernel: tg3 0000:02:00.1: eth1: 4: Host status block [00000001:000000d4:(0000:0000:0742):(0000:0000)] Mar 22 16:10:09 ind1un043 kernel: tg3 0000:02:00.1: eth1: 4: NAPI info [000000d4:000000d4:(0000:0000:01ff):0742:(0742:0742:0000:0000)] Mar 22 16:10:09 ind1un043 kernel: tg3 0000:02:00.1: tg3_stop_block timed out, ofs=1400 enable_bit=2 Mar 22 16:10:09 ind1un043 kernel: tg3 0000:02:00.1: tg3_stop_block timed out, ofs=c00 enable_bit=2 Mar 22 16:10:09 ind1un043 kernel: tg3 0000:02:00.1: eth1: Link is down Mar 22 16:10:11 ind1un043 kernel: tg3 0000:02:00.1: eth1: Link is up at 100 Mbps, half duplex Mar 22 16:10:11 ind1un043 kernel: tg3 0000:02:00.1: eth1: Flow control is off for TX and off for RX Mar 22 16:10:11 ind1un043 kernel: tg3 0000:02:00.1: eth1: EEE is disabled
但在此之前,这里是/var/log/messages的片段,其中NIC在启动过程中正常初始化
Mar 21 16:01:13 ind1un043 kernel: tg3 0000:02:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16 Mar 21 16:01:13 ind1un043 kernel: tg3 0000:02:00.0: eth0: Tigon3 [partno(BCM95720) rev 5720000] (PCI Express) MAC address 14:18:77:32:77:3f Mar 21 16:01:13 ind1un043 kernel: tg3 0000:02:00.0: eth0: attached PHY is 5720C (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[1]) Mar 21 16:01:13 ind1un043 kernel: tg3 0000:02:00.0: eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1] Mar 21 16:01:13 ind1un043 kernel: tg3 0000:02:00.0: eth0: dma_rwctrl[00000001] dma_mask[64-bit] Mar 21 16:01:13 ind1un043 kernel: tg3 0000:02:00.1: PCI INT B -> GSI 17 (level, low) -> IRQ 17 Mar 21 16:01:13 ind1un043 kernel: tg3 0000:02:00.1: eth1: Tigon3 [partno(BCM95720) rev 5720000] (PCI Express) MAC address 14:18:77:32:77:40 Mar 21 16:01:13 ind1un043 kernel: tg3 0000:02:00.1: eth1: attached PHY is 5720C (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[1]) Mar 21 16:01:13 ind1un043 kernel: tg3 0000:02:00.1: eth1: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1] Mar 21 16:01:13 ind1un043 kernel: tg3 0000:02:00.1: eth1: dma_rwctrl[00000001] dma_mask[64-bit]
这里是/var/log/messages的片段,其中NIC正常显示,但不知道在什么情况下(它刚刚在此代码片段之上的init之后)…
Mar 21 16:01:13 ind1un043 kernel: ADDRCONF(NETDEV_UP): eth0: link is not ready Mar 21 16:01:13 ind1un043 kernel: ADDRCONF(NETDEV_UP): eth1: link is not ready Mar 21 16:01:13 ind1un043 kernel: tg3 0000:02:00.0: eth0: Link is up at 1000 Mbps, full duplex Mar 21 16:01:13 ind1un043 kernel: tg3 0000:02:00.0: eth0: Flow control is on for TX and on for RX Mar 21 16:01:13 ind1un043 kernel: tg3 0000:02:00.0: eth0: EEE is disabled Mar 21 16:01:13 ind1un043 kernel: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready Mar 21 16:01:13 ind1un043 kernel: tg3 0000:02:00.1: eth1: Link is up at 100 Mbps, half duplex Mar 21 16:01:13 ind1un043 kernel: tg3 0000:02:00.1: eth1: Flow control is off for TX and off for RX Mar 21 16:01:13 ind1un043 kernel: tg3 0000:02:00.1: eth1: EEE is disabled Mar 21 16:01:13 ind1un043 kernel: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
而且,这里是/var/log/messages的片段,NIC好像很快就会出现其他问题,重新协商或者其他问题。 这在随后几天随机发生几次
Mar 21 16:01:42 ind1un043 kernel: ADDRCONF(NETDEV_UP): eth1: link is not ready Mar 21 16:01:44 ind1un043 kernel: tg3 0000:02:00.1: eth1: Link is up at 100 Mbps, half duplex Mar 21 16:01:44 ind1un043 kernel: tg3 0000:02:00.1: eth1: Flow control is off for TX and off for RX Mar 21 16:01:44 ind1un043 kernel: tg3 0000:02:00.1: eth1: EEE is disabled Mar 21 16:01:44 ind1un043 kernel: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
这是我的lspci | grep -i net lspci | grep -i net …
02:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet PCIe 02:00.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet PCIe
不知道这是否完全是一个“答案”,但它确实解决了我的问题,就这个特定的情况而言。 这似乎与ACPI有关。
我编辑了我的/etc/grub.conf文件,将以下内容添加到kernel行的末尾:
noapic acpi=off pci=noacpi
我现在不再在日志中看到任何错误。
难道节能function是将接口closures并将其恢复为较慢的模式? 或者可能是硬件/固件和这个内核的一些错误? 不知道〜但这至less可以解决我的问题(这是一个生产服务器,我们从来不希望其networking)。