在我们的一个OpenSUSE(11.3)虚拟化服务器上,我们最近遇到了一些稳定性问题。 服务器负载非常重,最多有100个虚拟域,全部都在LVM快照上运行。 我们有服务器突然挂起(仍在ping响应,无法login)。
服务器将大量来自LVM的消息写入日志中,但是我不确定这些消息是否正常。 这是服务器挂起之前的日志示例:
Oct 28 13:26:35 xen lvm[3384]: Another thread is handling an event. Waiting... Oct 28 13:26:35 xen lvm[3384]: Another thread is handling an event. Waiting... Oct 28 13:27:15 xen lvm[3384]: Another thread is handling an event. Waiting... Oct 28 13:27:15 xen lvm[3384]: dm_task_run failed, errno = 22, Invalid argument Oct 28 13:28:25 xen lvm[3384]: Another thread is handling an event. Waiting... Oct 28 13:30:45 xen lvm[3384]: Another thread is handling an event. Waiting... Oct 28 13:30:55 xen lvm[3384]: Another thread is handling an event. Waiting... Oct 28 13:30:55 xen lvm[3384]: Another thread is handling an event. Waiting... Oct 28 13:31:05 xen lvm[3384]: Another thread is handling an event. Waiting... Oct 28 13:31:05 xen lvm[3384]: Another thread is handling an event. Waiting... Oct 28 13:31:05 xen lvm[3384]: Another thread is handling an event. Waiting... Oct 28 13:31:15 xen lvm[3384]: Another thread is handling an event. Waiting... Oct 28 13:31:21 xen kernel: [76866.784142] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Oct 28 13:31:21 xen kernel: [76866.784158] sr 2:0:0:0: [sr0] CDB: Get event status notification: 4a 01 00 00 10 00 00 00 08 00 Oct 28 13:31:21 xen kernel: [76866.784183] ata3.00: cmd a0/00:00:00:08:00/00:00:00:00:00/a0 tag 0 pio 16392 in Oct 28 13:31:21 xen kernel: [76866.784186] res 40/00:02:00:08:00/00:00:00:00:00/a0 Emask 0x4 (timeout) Oct 28 13:31:21 xen kernel: [76866.784192] ata3.00: status: { DRDY } Oct 28 13:31:25 xen lvm[3384]: Another thread is handling an event. Waiting... Oct 28 13:31:25 xen lvm[3384]: Another thread is handling an event. Waiting... Oct 28 13:31:26 xen kernel: [76871.824061] ata3: link is slow to respond, please be patient (ready=0) Oct 28 13:31:31 xen kernel: [76876.808057] ata3: device not ready (errno=-16), forcing hardreset Oct 28 13:31:31 xen kernel: [76876.808080] ata3: soft resetting link Oct 28 13:31:31 xen kernel: [76876.972455] ata3: nv_mode_filter: 0x739f&0x739f->0x739f, BIOS=0x7000 (0xc0000000) ACPI=0x701f (60:900:0x11) Oct 28 13:31:31 xen kernel: [76876.988343] ata3.00: configured for UDMA/33 Oct 28 13:31:31 xen kernel: [76876.988971] ata3: EH complete Oct 28 13:31:35 xen lvm[3384]: Another thread is handling an event. Waiting... Oct 28 13:31:45 xen lvm[3384]: Another thread is handling an event. Waiting... Oct 28 13:32:05 xen lvm[3384]: Another thread is handling an event. Waiting... Oct 28 13:32:35 xen lvm[3384]: Another thread is handling an event. Waiting... Oct 28 13:35:55 xen lvm[3384]: Another thread is handling an event. Waiting... Oct 28 13:36:05 xen lvm[3384]: Another thread is handling an event. Waiting... Oct 28 13:36:25 xen lvm[3384]: Another thread is handling an event. Waiting...
在这一点上,我有四个问题:
编辑:在下一个悬挂我有更多的信息:
Oct 30 01:59:25 xen lvm[3384]: Another thread is handling an event. Waiting... Oct 30 02:01:15 xen kernel: [208260.160037] BUG: soft lockup - CPU#3 stuck for 67s! [qemu-dm:12987] Oct 30 02:01:15 xen kernel: [208260.160053] BUG: soft lockup - CPU#4 stuck for 67s! [qemu-dm:26429] Oct 30 02:01:15 xen kernel: [208260.160070] Modules linked in: Oct 30 02:01:15 xen kernel: [208260.160074] Modules linked in: bridge stp llc bridge usbbk stp llc gntdev usbbk gntdev netbk netbk blkbk blkbk blkback hcd usbcore sd_mod dm_snapshot xenblk ehci_hcd cdrom xennet usbcore edd dm_mod dm_snapshot fan pata_amd xenblk sata_nv libata cdrom scsi_mod thermal xennet processor thermal_sys edd hwmon Oct 30 02:01:15 xen kernel: [208260.160276] dm_modCPU 4 Oct 30 02:01:15 xen kernel: [208260.160283] Modules linked in: fan bridge stp pata_amd llc usbbk gntdev netbk sata_nv blkbk blkback_pagemap libata blktap domctl xenbus_be scsi_mod evtchn nfs lockd thermal fscache nfs_acl auth_rpcgss processor sunrpc fuse loop thermal_sys tg3 pcspkr i2c_nforce2 hwmon i2c_core ghes 8250_pnp Oct 30 02:01:15 xen kernel: [208260.160342] CPU 3 8250 Oct 30 02:01:15 xen kernel: [208260.160346] Modules linked in: shpchp bridge stp llc k10temp usbbk gntdev netbk hed blkbk blkback_pagemap blktap domctl sr_mod xenbus_be evtchn nfs forcedeth pci_hotplug lockd serial_core serio_raw sg fscache nfs_acl button auth_rpcgss sunrpc fuse usbhid hid linear ohci_hcd loop sd_mod ehci_hcd tg3 pcspkr i2c_nforce2 usbcore i2c_core ghes 8250_pnp dm_snapshot 8250 shpchp k10temp xenblk hed sr_mod cdrom forcedeth xennet edd dm_mod pci_hotplug serial_core fan serio_raw sg pata_amd button usbhid hid sata_nv libata linear scsi_mod ohci_hcd sd_mod thermal ehci_hcd processor thermal_sys hwmon usbcore dm_snapshot Oct 30 02:01:15 xen kernel: [208260.160471] xenblk Oct 30 02:01:15 xen kernel: [208260.160477] cdrom
唯一的问题是我的答案是第一个:这个消息对我的 LVM来说并不正常。
谷歌search消息并没有带来很多比这个线程 ,一个人有LVM挂在驱动器故障,显然lvm无法检测到驱动器出于某种原因失败(进一步的post进入原因,但我不熟悉多path或SAN来了解正在发生的事情)。 如果你不使用多path,它可能不会为你做很多事情。
根据我的经验,驱动器重置是驱动器出现故障的标志,但是通常失败的驱动器会在内核决定尝试重置之前logging一些超时错误。 尝试阅读所涉及的驱动器的SMART信息,并查看驱动器是否认为有问题? 这也可能是一个驱动器或控制器固件的问题,因为第一个日志似乎意味着该驱动器突然停止响应,但也许超时错误丢失在lvm消息stream之前的某处。 如果ata3不是一个新的驱动器,那么固件问题的可能性较小。 有趣的是,驱动器在重置后以UDMA / 33模式(旧的PATA速度)模式注册,这意味着驱动器没有正确复位,这可以解释为什么LVM保持挂起状态。