我在这里有一些问题。 我有一个Ubuntu Linux服务器,在软件RAID 1(使用mdadm创build)中configuration了2个SAS驱动器。 RAID将运行良好的一天,我可以做猫/ proc / mdstat,它表明,这两个磁盘是积极的,一切都是健康的。 然后意外的第二个磁盘将会失败,并且会进入降级模式。
然后,我将从RAID组中删除磁盘,重新启动服务器,然后重新添加磁盘到设置。 RAID会自我重build,而且我会用相同的磁盘再次运行健康的RAID 1。 再次,在12-24小时左右,第二个驱动器将失败。
硬盘是全新的,所以我想认为硬件是好的。 这是我在磁盘失败时从kern.log和syslog中捕获的输出。
任何人都可以翻译这个或有什么可能发生的想法?
谢谢!
Kern.log
Feb 28 20:34:55 CSTEP-APPS20 kernel: [ 9.180815] sd 2:0:0:0: Attached scsi generic sg1 type 0 Feb 28 20:34:55 CSTEP-APPS20 kernel: [ 9.181086] sd 2:0:1:0: Attached scsi generic sg2 type 0 Feb 28 20:34:55 CSTEP-APPS20 kernel: [ 9.181376] sd 2:0:1:0: [sdb] 71096640 512-byte logical blocks: (36.4 GB/33.9 GiB) Feb 28 20:34:55 CSTEP-APPS20 kernel: [ 9.182584] sd 2:0:1:0: [sdb] Write Protect is off Feb 28 20:34:55 CSTEP-APPS20 kernel: [ 9.182591] sd 2:0:1:0: [sdb] Mode Sense: cb 00 10 08 Feb 28 20:34:55 CSTEP-APPS20 kernel: [ 9.182835] sd 2:0:0:0: [sda] 71096640 512-byte logical blocks: (36.4 GB/33.9 GiB) Feb 28 20:34:55 CSTEP-APPS20 kernel: [ 9.183802] sd 2:0:1:0: [sdb] Write cache: disabled, read cache: enabled, supports DPO and FUA Feb 28 20:34:55 CSTEP-APPS20 kernel: [ 9.185146] sd 2:0:0:0: [sda] Write Protect is off Feb 28 20:34:55 CSTEP-APPS20 kernel: [ 9.185151] sd 2:0:0:0: [sda] Mode Sense: cb 00 10 08 Feb 28 20:34:55 CSTEP-APPS20 kernel: [ 9.188191] sd 2:0:0:0: [sda] Write cache: disabled, read cache: enabled, supports DPO and FUA Feb 28 20:34:55 CSTEP-APPS20 kernel: [ 9.191403] sd 2:0:1:0: [sdb] Attached SCSI disk Feb 28 20:34:55 CSTEP-APPS20 kernel: [ 9.299351] sd 2:0:0:0: [sda] Attached SCSI disk Mar 1 09:01:22 CSTEP-APPS20 kernel: [44807.010040] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 88 00 00 10 00 Mar 1 09:01:32 CSTEP-APPS20 kernel: [44817.560056] sd 2:0:1:0: [sdb] CDB: Test Unit Ready: 00 00 00 00 00 00 Mar 1 09:02:03 CSTEP-APPS20 kernel: [44848.470035] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 c0 00 00 80 00 Mar 1 09:02:03 CSTEP-APPS20 kernel: [44848.720124] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 88 00 00 10 00 Mar 1 09:02:04 CSTEP-APPS20 kernel: [44849.512078] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 88 00 00 10 00 Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380147] sd 2:0:1:0: Device offlined - not ready after error recovery Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380153] sd 2:0:1:0: Device offlined - not ready after error recovery Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380167] sd 2:0:1:0: rejecting I/O to offline device Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380285] sd 2:0:1:0: rejecting I/O to offline device Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380403] sd 2:0:1:0: [sdb] Unhandled error code Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380407] sd 2:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380416] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 88 00 00 10 00 Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380677] sd 2:0:1:0: [sdb] Unhandled error code Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380680] sd 2:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380684] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 c0 00 00 80 00 Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380915] sd 2:0:1:0: rejecting I/O to offline device
和系统日志
Mar 1 09:01:43 CSTEP-APPS20 kernel: [44827.860060] mptscsih: ioc0: WARNING - Issuing Reset from mptscsih_IssueTaskMgmt!! Mar 1 09:01:43 CSTEP-APPS20 kernel: [44827.860070] mptbase: ioc0: Initiating recovery Mar 1 09:02:03 CSTEP-APPS20 kernel: [44848.470023] mptscsih: ioc0: task abort: SUCCESS (sc=ffff88016197b400) Mar 1 09:02:03 CSTEP-APPS20 kernel: [44848.470030] mptscsih: ioc0: attempting task abort! (sc=ffff880156fa4c00) Mar 1 09:02:03 CSTEP-APPS20 kernel: [44848.470035] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 c0 00 00 80 00 Mar 1 09:02:03 CSTEP-APPS20 kernel: [44848.470050] mptscsih: ioc0: task abort: SUCCESS (sc=ffff880156fa4c00) Mar 1 09:02:03 CSTEP-APPS20 kernel: [44848.470073] scsi target2:0:0: Beginning Domain Validation Mar 1 09:02:03 CSTEP-APPS20 kernel: [44848.720120] mptscsih: ioc0: attempting target reset! (sc=ffff88016197b400) Mar 1 09:02:03 CSTEP-APPS20 kernel: [44848.720124] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 88 00 00 10 00 Mar 1 09:02:04 CSTEP-APPS20 kernel: [44849.262008] mptscsih: ioc0: target reset: SUCCESS (sc=ffff88016197b400) Mar 1 09:02:04 CSTEP-APPS20 kernel: [44849.512073] mptscsih: ioc0: attempting bus reset! (sc=ffff88016197b400) Mar 1 09:02:04 CSTEP-APPS20 kernel: [44849.512078] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 88 00 00 10 00 Mar 1 09:02:05 CSTEP-APPS20 kernel: [44850.046491] mptscsih: ioc0: bus reset: SUCCESS (sc=ffff88016197b400) Mar 1 09:02:15 CSTEP-APPS20 kernel: [44860.553909] mptscsih: ioc0: attempting host reset! (sc=ffff88016197b400) Mar 1 09:02:15 CSTEP-APPS20 kernel: [44860.553915] mptbase: ioc0: Initiating recovery Mar 1 09:02:35 CSTEP-APPS20 kernel: [44879.870026] mptscsih: ioc0: host reset: SUCCESS (sc=ffff88016197b400) Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380147] sd 2:0:1:0: Device offlined - not ready after error recovery Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380153] sd 2:0:1:0: Device offlined - not ready after error recovery Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380167] sd 2:0:1:0: rejecting I/O to offline device Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380285] sd 2:0:1:0: rejecting I/O to offline device Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380403] sd 2:0:1:0: [sdb] Unhandled error code Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380407] sd 2:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380416] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 88 00 00 10 00 Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380429] end_request: I/O error, dev sdb, sector 55297928 Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380562] __ratelimit: 24 callbacks suppressed Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380566] raid1: sdb1: rescheduling sector 55295880 Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380677] sd 2:0:1:0: [sdb] Unhandled error code Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380680] sd 2:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380684] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 c0 00 00 80 00 Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380695] end_request: I/O error, dev sdb, sector 55297984 Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380817] raid1: sdb1: rescheduling sector 55295936 Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.380915] sd 2:0:1:0: rejecting I/O to offline device Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.381019] end_request: I/O error, dev sdb, sector 63983488 Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.381142] md: super_written gets error=-5, uptodate=0 Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.381146] raid1: Disk failure on sdb1, disabling device. Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.381148] raid1: Operation continuing on 1 devices. Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.398144] scsi target2:0:0: Ending Domain Validation Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.398226] scsi target2:0:0: FAST-160 WIDE SCSI 320.0 MB/s DT IU RTI WRFLOW PCOMP (6.25 ns, offset 127) Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.398295] scsi target2:0:1: Beginning Domain Validation Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.648493] scsi target2:0:1: Domain Validation Initial Inquiry Failed Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.648623] scsi target2:0:1: Ending Domain Validation Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.648691] scsi target2:0:1: asynchronous Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.648760] scsi target2:0:8: Beginning Domain Validation Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.649386] scsi target2:0:8: Ending Domain Validation Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.649458] scsi target2:0:8: asynchronous Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.653384] RAID1 conf printout: Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.653390] --- wd:1 rd:2 Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.653395] disk 0, wo:0, o:1, dev:sda1 Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.653399] disk 1, wo:1, o:0, dev:sdb1 Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.693763] RAID1 conf printout: Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.693767] --- wd:1 rd:2 Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.693771] disk 0, wo:0, o:1, dev:sda1 Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.714266] raid1: sda1: redirecting sector 55295880 to another mirror Mar 1 09:02:45 CSTEP-APPS20 kernel: [44890.719943] raid1: sda1: redirecting sector 55295936 to another mirror
它看起来像设备/ dev / sdb将脱机。 你可能有布线问题,但它可能是磁盘。 与磁盘固件和控制器的冲突也是可能的。
我会立即在磁盘上运行制造商的诊断程序。 只是因为他们是全新的,我不会把他们置于有缺陷的怀疑。 (事实上,换上全新的磁盘,我会怀疑它们比运行了几个月的磁盘还要多一点。)
我不明白你为什么假设驱动器是好的。 即使新的驱动器失败。 根据我的专业经验,婴儿死亡率和硬盘老年人死亡率一样普遍。 这就是为什么许多商店为他们的设备运行烧机。
用已知的驱动器更换驱动器,看看会发生什么情况,或者至less通过SMART或诊断工具查看坏块的数量。