当事情崩溃时,这是/ var / messages的内容:
Dec 21 19:47:45 localhost kernel: ------------[ cut here ]------------ Dec 21 19:47:45 localhost kernel: WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted) Dec 21 19:47:45 localhost kernel: Hardware name: KGP(M)E-D16 Dec 21 19:47:45 localhost kernel: NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out Dec 21 19:47:45 localhost kernel: Modules linked in: ipt_REDIRECT iptable_nat nf_nat xt_multiport xt_owner ext3 jbd nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables autofs4 sunrpc cpufreq_ondemand powernow_k8 freq_table mperf ipv6 e1000e microcode serio_raw k10temp edac_core edac_mce_amd i2c_piix4 i2c_core sg shpchp ext4 mbcache jbd2 sd_mod crc_t10dif ata_generic pata_acpi pata_atiixp ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: nf_conntrack] Dec 21 19:47:45 localhost kernel: Pid: 0, comm: swapper Not tainted 2.6.32-220.el6.x86_64 #1 Dec 21 19:47:45 localhost kernel: Call Trace: Dec 21 19:47:45 localhost kernel: <IRQ> [<ffffffff81069b77>] ? warn_slowpath_common+0x87/0xc0 Dec 21 19:47:45 localhost kernel: [<ffffffff81069c66>] ? warn_slowpath_fmt+0x46/0x50 Dec 21 19:47:45 localhost kernel: [<ffffffff8144a54d>] ? dev_watchdog+0x26d/0x280 Dec 21 19:47:45 localhost kernel: [<ffffffff8144a2e0>] ? dev_watchdog+0x0/0x280 Dec 21 19:47:45 localhost kernel: [<ffffffff8107c957>] ? run_timer_softirq+0x197/0x340 Dec 21 19:47:45 localhost kernel: [<ffffffff810a0b70>] ? tick_sched_timer+0x0/0xc0 Dec 21 19:47:45 localhost kernel: [<ffffffff8102ad2d>] ? lapic_next_event+0x1d/0x30 Dec 21 19:47:45 localhost kernel: [<ffffffff81072161>] ? __do_softirq+0xc1/0x1d0 Dec 21 19:47:45 localhost kernel: [<ffffffff81095770>] ? hrtimer_interrupt+0x140/0x250 Dec 21 19:47:45 localhost kernel: [<ffffffff8100c24c>] ? call_softirq+0x1c/0x30 Dec 21 19:47:45 localhost kernel: [<ffffffff8100de85>] ? do_softirq+0x65/0xa0 Dec 21 19:47:45 localhost kernel: [<ffffffff81071f45>] ? irq_exit+0x85/0x90 Dec 21 19:47:45 localhost kernel: [<ffffffff814f4de0>] ? smp_apic_timer_interrupt+0x70/0x9b Dec 21 19:47:45 localhost kernel: [<ffffffff8100bc13>] ? apic_timer_interrupt+0x13/0x20 Dec 21 19:47:45 localhost kernel: <EOI> [<ffffffff810375ab>] ? native_safe_halt+0xb/0x10 Dec 21 19:47:45 localhost kernel: [<ffffffff810145dd>] ? default_idle+0x4d/0xb0 Dec 21 19:47:45 localhost kernel: [<ffffffff81009e06>] ? cpu_idle+0xb6/0x110 Dec 21 19:47:45 localhost kernel: [<ffffffff814d411a>] ? rest_init+0x7a/0x80 Dec 21 19:47:45 localhost kernel: [<ffffffff81c1ff76>] ? start_kernel+0x424/0x430 Dec 21 19:47:45 localhost kernel: [<ffffffff81c1f33a>] ? x86_64_start_reservations+0x125/0x129 Dec 21 19:47:45 localhost kernel: [<ffffffff81c1f438>] ? x86_64_start_kernel+0xfa/0x109 Dec 21 19:47:45 localhost kernel: ---[ end trace 1c035fe603219926 ]--- Dec 21 19:47:45 localhost kernel: e1000e 0000:03:00.0: eth0: Reset adapter Dec 21 19:47:46 localhost abrt-dump-oops: Reported 1 kernel oopses to Abrt Dec 21 19:47:46 localhost abrtd: Directory 'oops-2012-12-21-19:47:46-12170-0' creation detected Dec 21 19:47:47 localhost abrtd: Can't open file '/var/spool/abrt/oops-2012-12-21-19:47:46-12170-0/uid': No such file or directory Dec 21 19:47:54 localhost kernel: Bridge firewalling registered Dec 21 19:49:05 localhost abrtd: Sending an email... Dec 21 19:49:05 localhost abrtd: Email was sent to: root@localhost Dec 21 19:49:05 localhost abrtd: New problem directory /var/spool/abrt/oops-2012-12-21-19:47:46-12170-0, processing Dec 21 19:49:05 localhost abrtd: Can't open file '/var/spool/abrt/oops-2012-12-21-19:47:46-12170-0/uid': No such file or directory
看起来像一个名为KGP(M)E-D16的硬件停止或东西。 看看在谷歌显示,这是主板。
我还应该检查什么? 我已经把这个报告给fdcservers.net了。
他们声称这是内核错误。 而不是硬件问题。 什么内核错误? 为什么会导致服务器崩溃? 我该怎么办?
检查网卡驱动程序我得到了这个
root@host [/var/log]# ethtool -i eth0 driver: e1000e version: 1.9.5-k firmware-version: 1.8-0 bus-info: 0000:03:00.0 root@host [/var/log]# ethtool -i eth1 driver: e1000e version: 1.9.5-k firmware-version: 1.8-0 bus-info: 0000:02:00.0 root@host [/var/log]# ethtool -i eth2 Cannot get driver information: No such device
话虽如此,
硬件名称:KGP(M)E-D16是华硕主板。 另外,如果您search硬件名称:KGP(M)E-D16此页面排名前3。
问题在于它自己。 net/sched/sch_generic.c是通用包调度程序的第261行。
恐慌本身就在这里
Dec 21 19:47:45 localhost kernel: [<ffffffff8144a54d>] ? dev_watchdog+0x26d/0x280
所以,networking设备超时。 正如源代码所说,某些队列被阻塞,定时器过期。 它应该在某个特定的时间持有该设备,但柜台结束了。 这是代码的相关部分。
if (!mod_timer(&dev->watchdog_timer, 258 round_jiffies(jiffies + 259 dev->watchdog_timeo))) 260 dev_hold(dev);
你看到有一个看门狗定时器,计数器是以jiffies来衡量的。 当这个计时器结束时,它会抛出警告。
这与您的网卡或驱动程序有关。 除非能够certificate它,否则我会立即拒绝内核bug的理论。 没有办法告诉它,除非有人报告了确切的呼叫跟踪或英特尔知道这个跟踪,它发生在相同的硬件,相同的驱动程序,相同的固件。 简而言之,没有检查内核转储或vmcore,没有经验的人会告诉这是内核错误。 处理定时器的内核部分是精心devise的,e1000不是一个难以解决的驱动程序。
我不想解雇你的服务员,但这是我的要求。 检查你的ethtool -S ethX输出来查看是否有丢包,超限,超时等是值得的。