我在日志中有一个奇怪的错误信息,它是这样开始的:
:39:35 host1 kernel: [54674279.243416] mpt2sas0: fault_state(0x2651)! :39:35 host1 kernel: [54674279.243543] mpt2sas0: sending diag reset !! :39:36 host1 kernel: [54674280.481215] mpt2sas0: diag reset: SUCCESS :39:36 host1 kernel: [54674280.713443] mpt2sas0: LSISAS2008: FWVersion(07.15.08.00), ChipRevision(0x03), BiosVersion(07.02.03.00) :39:36 host1 kernel: [54674280.713451] mpt2sas0: Dell 6Gbps SAS HBA: Vendor(0x1000), Device(0x0072), SSVID(0x1028), SSDID(0x1F1C) :39:36 host1 kernel: [54674280.713455] mpt2sas0: Protocol=(Initiator,Target), Capabilities=(Raid,TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set Full,NCQ) :39:36 host1 kernel: [54674280.713518] mpt2sas0: sending port enable !! :39:43 host1 kernel: [54674287.616666] mpt2sas0: port enable: SUCCESS :39:43 host1 kernel: [54674287.616814] mpt2sas0: search for end-devices: start :39:43 host1 kernel: [54674287.617657] scsi target7:0:3: handle(0x0009), sas_addr(0x590b11c410294314), enclosure logical id(0x590b11c007729400), slot(7) :39:43 host1 kernel: [54674287.617735] scsi target7:0:2: handle(0x000a), sas_addr(0x590b11c41025f914), enclosure logical id(0x590b11c007729400), slot(3) :39:43 host1 kernel: [54674287.617807] mpt2sas0: search for end-devices: complete :39:43 host1 kernel: [54674287.617810] mpt2sas0: search for raid volumes: start :39:43 host1 kernel: [54674287.617813] mpt2sas0: search for responding raid volumes: complete :39:43 host1 kernel: [54674287.617816] mpt2sas0: search for expanders: start :39:43 host1 kernel: [54674287.617818] mpt2sas0: search for expanders: complete :39:43 host1 kernel: [54674287.617833] mpt2sas0: search for end-devices: start :39:43 host1 kernel: [54674287.618468] scsi target7:0:3: handle(0x0009), sas_addr(0x590b11c410294314), enclosure logical id(0x590b11c007729400), slot(7) :39:43 host1 kernel: [54674287.618543] scsi target7:0:2: handle(0x000a), sas_addr(0x590b11c41025f914), enclosure logical id(0x590b11c007729400), slot(3) :39:43 host1 kernel: [54674287.618614] mpt2sas0: search for end-devices: complete :39:43 host1 kernel: [54674287.618617] mpt2sas0: search for raid volumes: start :39:43 host1 kernel: [54674287.618619] mpt2sas0: search for responding raid volumes: complete :39:43 host1 kernel: [54674287.618622] mpt2sas0: search for expanders: start :39:43 host1 kernel: [54674287.618624] mpt2sas0: search for expanders: complete :39:43 host1 kernel: [54674287.618632] mpt2sas0: _base_fault_reset_work: hard reset: success :39:43 host1 kernel: [54674287.618639] mpt2sas0: removing unresponding devices: start :39:43 host1 kernel: [54674287.618642] mpt2sas0: removing unresponding devices: complete :39:43 host1 kernel: [54674287.618654] mpt2sas0: scan devices: start :39:43 host1 kernel: [54674287.619530] mpt2sas0: failure at /build/buildd/linux-3.2.0/drivers/scsi/mpt2sas/mpt2sas_scsih.c:5157/_scsih_add_device()! :39:43 host1 kernel: [54674287.619866] mpt2sas0: failure at /build/buildd/linux-3.2.0/drivers/scsi/mpt2sas/mpt2sas_scsih.c:5157/_scsih_add_device()!
最后一条消息每秒重复多次。 其他信息认为相关:
这是一台戴尔Linux内核连接到戴尔磁盘arrays的戴尔机器。
# uname -a Linux host1 3.2.0-34-generic #53-Ubuntu SMP Thu Nov 15 10:48:16 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux # modinfo -F version mpt2sas 10.100.00.00 lspci | grep LSI 01:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2008 [Falcon] (rev 03) 08:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 03)
当更多的debugging添加到mpt2sas,这是结果:
mpt2sas0: failure at /build/buildd/linux-3.2.0/drivers/scsi/mpt2sas/mpt2sas_scsih.c:5157/_scsih_add_device()! phy-7:4: refresh: parent sas_addr(0x590b11c007729400), link_rate(0x08), phy(4) attached_handle(0x0000), sas_addr(0x0000000000000000)
连接到磁盘arrays的不同卷的其他机器正常工作。 磁盘arrays和iDrac在日志中没有提供线索,似乎任何东西都是正常的。 谷歌search提供了一些恐怖的故事,RAID可以最终放弃所有的磁盘。 这个问题没有与exception高的负载相关联。
行为持续数小时。
红帽似乎有非常类似的问题,但没有解决scheme(?)呢:
https://access.redhat.com/solutions/1990653
不幸的是,我不能重新启动机器来执行实验。