Linux内核糟糕 – 来源未知

希望有人能帮助解释这里发生的事情:

[ 2081.280253] BUG: unable to handle kernel paging request at ffff8801ad287000 [ 2081.280262] IP: [<ffffffff8000f549>] __sanitize_i387_state+0x29/0x120 [ 2081.280272] PGD 1e30067 PUD 39ab067 PMD 3b15067 PTE 0 [ 2081.280277] Oops: 0000 [#4] SMP [ 2081.280281] last sysfs file: /sys/devices/xen-backend/vbd-5-51715/uevent [ 2081.280285] CPU 1 [ 2081.280286] Modules linked in: tun md5 ip6table_filter ip6_tables iptable_filter ip_tables x_tables usbbk gntdev netbk blkbk blkback_pagemap blktap xenbus_be evtchn nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs bridge stp llc edd sbs sbshc max6650 lm75 coretemp domctl snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device adm1021 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx dm_mod snd_hda_codec_hdmi 8250_pci snd_hda_codec_realtek snd_hda_intel snd_hda_codec ir_lirc_codec lirc_dev ir_sony_decoder ir_jvc_decoder snd_hwdep ir_rc6_decoder ir_rc5_decoder rc_rc6_mce sg ir_nec_decoder nouveau ttm tpm_tis tpm mceusb ir_core i2c_i801 e1000e snd_pcm pcspkr tpm_bios iTCO_wdt iTCO_vendor_support snd_timer 8250 serial_core snd soundcore snd_page_alloc ext4 jbd2 crc16 drm_kms_helper drm i2c_algo_bit i2c_core video output ehci_hcd usbcore button xenblk cdrom xennet fan processor thermal thermal_sys hwmon ata_generic [ 2081.280350] [ 2081.280354] Pid: 6623, comm: block Tainted: GD 2.6.37.6-0.5-xen #1 /DQ67OW [ 2081.280359] RIP: e030:[<ffffffff8000f549>] [<ffffffff8000f549>] __sanitize_i387_state+0x29/0x120 [ 2081.280365] RSP: e02b:ffff88006bb0dd98 EFLAGS: 00010246 [ 2081.280368] RAX: 0000000000000000 RBX: ffff8801ad286e00 RCX: ffff88006bb0dfd8 [ 2081.280371] RDX: ffff88006bae4440 RSI: 0000000000000200 RDI: ffff88006bae4440 [ 2081.280375] RBP: ffff88006bae4440 R08: ffff88006bb0df58 R09: 0000000000000000 [ 2081.280378] R10: 0000000000000000 R11: 00000000ffffffff R12: 0000000000000011 [ 2081.280381] R13: ffff88006bb0df58 R14: 00007fffc379b800 R15: 00007fffc379b638 [ 2081.280388] FS: 00007f89c8b00700(0000) GS:ffff8801e651d000(0000) knlGS:0000000000000000 [ 2081.280391] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b [ 2081.280394] CR2: ffff8801ad287000 CR3: 000000006bb10000 CR4: 0000000000002660 [ 2081.280398] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 2081.280408] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 2081.280412] Process block (pid: 6623, threadinfo ffff88006bb0c000, task ffff88006bae4440) [ 2081.280415] Stack: [ 2081.280417] 00007fffc379b800 ffff88006bae4440 0000000000000011 ffffffff8000f90a [ 2081.280422] ffff88006bb0dee8 ffff88006bae4998 0000000000000011 ffffffff80006a22 [ 2081.280426] ffff8801d88d65c0 ffff88006bae4440 ffff88006bb0de68 000000116bae4440 [ 2081.280431] Call Trace: [ 2081.280438] [<ffffffff8000f90a>] save_i387_xstate+0x1aa/0x210 [ 2081.280444] [<ffffffff80006a22>] __setup_rt_frame+0x2f2/0x370 [ 2081.280449] [<ffffffff80006dd1>] handle_signal+0x201/0x2b0 [ 2081.280454] [<ffffffff80006f09>] do_signal+0x89/0x1b0 [ 2081.280459] [<ffffffff800070b5>] do_notify_resume+0x65/0x90 [ 2081.280464] [<ffffffff8000770e>] int_signal+0x12/0x17 [ 2081.280471] [<00007f89c7fb1090>] 0x7f89c7fb1090 [ 2081.280474] Code: 00 00 41 54 55 53 48 8b 9f 10 05 00 00 48 85 db 0f 84 9c 00 00 00 48 8b 47 08 f6 40 14 01 0f 85 ef 00 00 00 48 8b 05 37 55 89 00 <48> 8b ab 00 02 00 00 48 89 c2 48 21 ea 48 39 d0 74 75 48 89 e8 [ 2081.280499] RIP [<ffffffff8000f549>] __sanitize_i387_state+0x29/0x120 [ 2081.280504] RSP <ffff88006bb0dd98> [ 2081.280506] CR2: ffff8801ad287000 [ 2081.284005] ---[ end trace 56e37f97ef72fda4 ]--- 

这是一个新的服务器版本运行opensuse 11.4,内核2.6.37.6-0.5-xen的i2500与8GB内存。

我已经尝试了几个不同的内核(通过发生有更新通过zypper),我已经尝试了两个RAM(4GB)个别,并交换了他们的位置。 主板DQ67OW具有集成的graphicsfunction,如果集成内存消耗过大,内核不知道,我已尝试使用分立function。 任何CPU内核都可能发生此错误。

它似乎并没有被任何具体的活动触发 – 我正在运行mdadm raid5,通常'块'进程是触发oops的,但是bash和udevd也触发了它。

看起来,如果oops发生足够严重的过程,整个服务器会挂起闪烁的大写locking和滚动locking指示灯。

处理器,主板和RAM都是新的。 我期待这是由硬件故障,或者驱动程序错误引发的。 也许这个networking驱动程序…?

任何关于如何缩小罪魁祸首的build议都是很好的。

干杯,

保罗

追踪追踪:

 [17836.273843] BUG: unable to handle kernel paging request at ffff8801ad287000 [17836.273853] IP: [<ffffffff8000f549>] __sanitize_i387_state+0x29/0x120 [17836.273863] PGD 1e30067 PUD 39ab067 PMD 3b15067 PTE 0 [17836.273868] Oops: 0000 [#6] SMP [17836.273871] last sysfs file: /sys/devices/xen-backend/vbd-6-51715/statistics/wr_sect [17836.273875] CPU 1 [17836.273876] Modules linked in: usb_storage uas tun md5 ip6table_filter ip6_tables iptable_filter ip_tables x_tables usbbk gntdev netbk blkbk blkback_pagemap blktap xenbus_be evtchn nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs bridge stp llc edd sbs sbshc max6650 lm75 coretemp domctl snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device adm1021 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx dm_mod snd_hda_codec_hdmi 8250_pci snd_hda_codec_realtek snd_hda_intel snd_hda_codec ir_lirc_codec lirc_dev ir_sony_decoder ir_jvc_decoder snd_hwdep ir_rc6_decoder ir_rc5_decoder rc_rc6_mce sg ir_nec_decoder nouveau ttm tpm_tis tpm mceusb ir_core i2c_i801 e1000e snd_pcm pcspkr tpm_bios iTCO_wdt iTCO_vendor_support snd_timer 8250 serial_core snd soundcore snd_page_alloc ext4 jbd2 crc16 drm_kms_helper drm i2c_algo_bit i2c_core video output ehci_hcd usbcore button xenblk cdrom xennet fan processor thermal thermal_sys hwmon ata_generic [17836.273940] [17836.273943] Pid: 9479, comm: bash Tainted: GD 2.6.37.6-0.5-xen #1 /DQ67OW [17836.273949] RIP: e030:[<ffffffff8000f549>] [<ffffffff8000f549>] __sanitize_i387_state+0x29/0x120 [17836.273954] RSP: e02b:ffff88002afebd98 EFLAGS: 00010246 [17836.273957] RAX: 0000000000000000 RBX: ffff8801ad286e00 RCX: ffff88002afebfd8 [17836.273960] RDX: ffff88002ad62800 RSI: 0000000000000200 RDI: ffff88002ad62800 [17836.273964] RBP: ffff88002ad62800 R08: ffff88002afebf58 R09: 0000000000000000 [17836.273967] R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000011 [17836.273970] R13: ffff88002afebf58 R14: 00007fff522ce400 R15: 00007fff522ce238 [17836.273976] FS: 00007f5908ab2700(0000) GS:ffff8801e651d000(0000) knlGS:0000000000000000 [17836.273979] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b [17836.273982] CR2: ffff8801ad287000 CR3: 00000000fa6a2000 CR4: 0000000000002660 [17836.273986] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [17836.273989] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [17836.273993] Process bash (pid: 9479, threadinfo ffff88002afea000, task ffff88002ad62800) [17836.273996] Stack: [17836.273998] 00007fff522ce400 ffff88002ad62800 0000000000000011 ffffffff8000f90a [17836.274003] ffff88002afebee8 ffff88002ad62d58 0000000000000011 ffffffff80006a22 [17836.274007] ffff8801d91c4e80 ffff88002ad62800 ffff88002afebe68 000000112ad62800 [17836.274011] Call Trace: [17836.274019] [<ffffffff8000f90a>] save_i387_xstate+0x1aa/0x210 [17836.274025] [<ffffffff80006a22>] __setup_rt_frame+0x2f2/0x370 [17836.274030] [<ffffffff80006dd1>] handle_signal+0x201/0x2b0 [17836.274035] [<ffffffff80006f09>] do_signal+0x89/0x1b0 [17836.274040] [<ffffffff800070b5>] do_notify_resume+0x65/0x90 [17836.274046] [<ffffffff8000770e>] int_signal+0x12/0x17 [17836.274052] [<00007f5907ecfd80>] 0x7f5907ecfd80 [17836.274055] Code: 00 00 41 54 55 53 48 8b 9f 10 05 00 00 48 85 db 0f 84 9c 00 00 00 48 8b 47 08 f6 40 14 01 0f 85 ef 00 00 00 48 8b 05 37 55 89 00 <48> 8b ab 00 02 00 00 48 89 c2 48 21 ea 48 39 d0 74 75 48 89 e8 [17836.274081] RIP [<ffffffff8000f549>] __sanitize_i387_state+0x29/0x120 [17836.274085] RSP <ffff88002afebd98> [17836.274088] CR2: ffff8801ad287000 [17836.274091] ---[ end trace 56e37f97ef72fda6 ]--- 

这通常是由于内存不足,但正如你所说,也可能是由于软件错误。 (这与内核空间中的segfault相当。)

一夜之间运行memtest。 安装软件包后,它应该显示为引导选项。

如果没有透露,那可能是软件。 比较不同的崩溃日志,看看在第一行报告的地址是否有任何共同点,或中途给出的呼叫跟踪。 如果他们都非常相似,这可能是一个软件错误。 把这个报告作为发行版的内核错误,看看有什么帮助。

虽然我没有运行memtest很长时间,但我对opensuse安装感到怀疑。 这是一个干净的安装,但我的预感是一个内核问题或沿着这些线。

所以我把Debian安装到了不同的分区,然后把我的虚拟机和其他所有东西都拆了,从此以后就没出现过故障了。

我认为最可能的贡献者是Debian Xen内核是2.6.32,Opensuse是2.6.37。 这可能是内核中的一个bug,或者只是configuration中的不兼容。

当我得到时间时,我将比较.configs。 它已经跑了几天了,平均每个小时我都会有一个呃,现在我不…