CentOS Raid 1驱动器在RAID检查时失败

我有一对HP DL320e服务器,在软件RAID 1arrays中configuration了两个WD Red 6TB驱动器。 DL320e有一个板载RAID控制器,因Linux软件RAID而被禁用。

这两台机器似乎工作正常,突袭arrays看起来很正常,除了每次raid-check运行(星期天凌晨1点从默认的每周crontab,而且如果我手动运行raid-check)一个驱动器脱机。 在此之后,“失败的”驱动器的设备文件已被删除(例如/ dev / sda2),但冷启动后,它们重新出现,“失败”的驱动器可以添加回arrays,似乎正常工作。

自从(全新的)机器和磁盘在几个月前安装以来,这一直在进行。 根据smartctl没有任何驱动器有任何坏扇区交换出来,所以在其他地方的几个职位的基础上,我已经尝试使用hdparm写在/ var / log / messages中标识的扇区,以强制驱动器检测和换掉坏道,没有任何效果。

我也尝试使用dd在整个/ dev / sdb2和/ dev / sdb3上写零。 这完成了没有造成任何错误,但没有导致任何坏扇区被换出,但似乎表明整个驱动器表面可以成功写入。

我已经使用smartctl运行所有智能诊断,并且全部完成。

由于这些机器都是从新安装的,而且两个系统都发生故障,并且4个驱动器中至less有3个“出现故障”(一台机器上的两个驱动器在不同的时间出现故障),所以我不相信这是由坏的硬件。 事实上,对于整个失败的驱动器完成/ dev / zero的ddcertificate驱动器可以在整个表面上写入。

驱动器configuration了3个分区,biosboot,/ boot和root + / home。

来自两个服务器的日志虽然报告了不同的扇区号,但它们或多或less是相同的,同一台服务器上每周报告的扇区号也是不同的。

/ proc / mdstat报告

sh-4.2# cat /proc/mdstat Personalities : [raid1] md126 : active raid1 sda3[0] sdb3[1] 5859876672 blocks super 1.2 [2/2] [UU] bitmap: 2/44 pages [8KB], 65536KB chunk md127 : active raid1 sda2[2] sdb2[3] 511936 blocks super 1.0 [2/2] [UU] unused devices: <none> sh-4.2# 

时间过去了,直到凌晨1点,然后:

 WARNING: Your hard drive is failing Device: /dev/sda [SAT], unable to open device sh-4.2# cat /proc/mdstat Personalities : [raid1] md126 : active raid1 sda3[0](F) sdb3[1] 5859876672 blocks super 1.2 [2/1] [_U] bitmap: 5/44 pages [20KB], 65536KB chunk md127 : active raid1 sda2[2](F) sdb2[3] 511936 blocks super 1.0 [2/1] [_U] unused devices: <none> sh-4.2# 

/ var / log / messages报告

 Jun 7 01:00:01 1000 kernel: md: data-check of RAID array md126 Jun 7 01:00:01 1000 kernel: md: minimum _guaranteed_ speed: 1000 KB/sec/disk. Jun 7 01:00:01 1000 kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check. Jun 7 01:00:01 1000 kernel: md: using 128k window, over a total of 5859876672k. Jun 7 01:00:07 1000 kernel: md: delaying data-check of md127 until md126 has finished (they share one or more physical units) Jun 7 01:01:01 1000 systemd: Starting Session 1544 of user root. Jun 7 01:01:01 1000 systemd: Started Session 1544 of user root. Jun 7 01:03:43 1000 kernel: ata1.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x40000 action 0x6 frozen Jun 7 01:03:43 1000 kernel: ata1: SError: { CommWake } Jun 7 01:03:43 1000 kernel: ata1.00: failed command: READ FPDMA QUEUED Jun 7 01:03:43 1000 kernel: ata1.00: cmd 60/80:00:80:1b:70/00:00:03:00:00/40 tag 0 ncq 65536 in res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) Jun 7 01:03:43 1000 kernel: ata1.00: status: { DRDY } Jun 7 01:03:43 1000 kernel: ata1.00: failed command: READ FPDMA QUEUED Jun 7 01:03:43 1000 kernel: ata1.00: cmd 60/80:08:00:1c:70/00:00:03:00:00/40 tag 1 ncq 65536 in res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Jun 7 01:03:43 1000 kernel: ata1.00: status: { DRDY } Jun 7 01:03:43 1000 kernel: ata1.00: failed command: READ FPDMA QUEUED Jun 7 01:03:43 1000 kernel: ata1.00: cmd 60/80:10:00:0d:70/00:00:03:00:00/40 tag 2 ncq 65536 in res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) 

随着标签的值增加到30,随后重复

 Jun 7 01:07:10 1000 kernel: ata1.00: cmd 60/80:f0:00:cd:7f/00:00:06:00:00/40 tag 30 ncq 65536 in res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Jun 7 01:07:10 1000 kernel: ata1.00: status: { DRDY } Jun 7 01:07:10 1000 kernel: ata1: hard resetting link Jun 7 01:07:11 1000 kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Jun 7 01:07:11 1000 kernel: ata1.00: configured for UDMA/133 Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0 Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0 Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0 Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0 Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0 Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0 Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0 Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0 Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0 Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0 Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0 Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0 Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0 Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0 Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0 Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0 Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0 Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0 Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0 Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0 Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0 Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0 Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0 Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0 Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0 Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0 Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0 Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0 Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0 Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0 Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0 Jun 7 01:07:11 1000 kernel: ata1: EH complete Jun 7 01:09:53 1000 kernel: ata1.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x40000 action 0x6 frozen Jun 7 01:09:53 1000 kernel: ata1: SError: { CommWake } Jun 7 01:09:53 1000 kernel: ata1.00: failed command: READ FPDMA QUEUED Jun 7 01:09:53 1000 kernel: ata1.00: cmd 60/80:00:80:f6:dd/00:00:08:00:00/40 tag 0 ncq 65536 in res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) Jun 7 01:09:53 1000 kernel: ata1.00: status: { DRDY } Jun 7 01:09:53 1000 kernel: ata1.00: failed command: READ FPDMA QUEUED Jun 7 01:09:53 1000 kernel: ata1.00: cmd 60/80:08:00:f7:dd/00:00:08:00:00/40 tag 1 ncq 65536 in res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Jun 7 01:09:53 1000 kernel: ata1.00: status: { DRDY } 

然后再重复标签30

 Jun 7 01:09:53 1000 kernel: ata1.00: failed command: READ FPDMA QUEUED Jun 7 01:09:53 1000 kernel: ata1.00: cmd 60/80:f0:00:f6:dd/00:00:08:00:00/40 tag 30 ncq 65536 in res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Jun 7 01:09:53 1000 kernel: ata1.00: status: { DRDY } Jun 7 01:09:53 1000 kernel: ata1: hard resetting link Jun 7 01:09:59 1000 kernel: ata1: link is slow to respond, please be patient (ready=0) Jun 7 01:10:01 1000 systemd: Starting Session 1545 of user root. Jun 7 01:10:01 1000 systemd: Started Session 1545 of user root. Jun 7 01:10:03 1000 kernel: ata1: COMRESET failed (errno=-16) Jun 7 01:10:03 1000 kernel: ata1: hard resetting link Jun 7 01:10:04 1000 kernel: ata1: SATA link down (SStatus 0 SControl 300) Jun 7 01:10:09 1000 kernel: ata1: hard resetting link Jun 7 01:10:09 1000 kernel: ata1: SATA link down (SStatus 0 SControl 300) Jun 7 01:10:09 1000 kernel: ata1: limiting SATA link speed to 1.5 Gbps Jun 7 01:10:14 1000 kernel: ata1: hard resetting link Jun 7 01:10:14 1000 kernel: ata1: SATA link down (SStatus 0 SControl 310) Jun 7 01:10:14 1000 kernel: ata1.00: disabled Jun 7 01:10:14 1000 kernel: ata1.00: device reported invalid CHS sector 0 Jun 7 01:10:14 1000 kernel: ata1.00: device reported invalid CHS sector 0 

多一块

 Jun 7 01:10:14 1000 kernel: ata1.00: device reported invalid CHS sector 0 Jun 7 01:10:14 1000 kernel: sd 0:0:0:0: [sda] Jun 7 01:10:14 1000 kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE Jun 7 01:10:14 1000 kernel: sd 0:0:0:0: [sda] Jun 7 01:10:14 1000 kernel: Sense Key : Aborted Command [current] [descriptor] Jun 7 01:10:14 1000 kernel: Descriptor sense data with sense descriptors (in hex): Jun 7 01:10:14 1000 kernel: 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00 Jun 7 01:10:14 1000 kernel: 00 00 00 00 Jun 7 01:10:14 1000 kernel: sd 0:0:0:0: [sda] Jun 7 01:10:14 1000 kernel: Add. Sense: No additional sense information Jun 7 01:10:14 1000 kernel: sd 0:0:0:0: [sda] CDB: Jun 7 01:10:14 1000 kernel: Read(16): 88 00 00 00 00 00 08 dd f6 80 00 00 00 80 00 00 Jun 7 01:10:14 1000 kernel: end_request: I/O error, dev sda, sector 148764288 Jun 7 01:10:14 1000 kernel: sd 0:0:0:0: rejecting I/O to offline device Jun 7 01:10:14 1000 kernel: sd 0:0:0:0: [sda] killing request Jun 7 01:10:14 1000 kernel: sd 0:0:0:0: [sda] 

最后

 Jun 7 01:10:14 1000 kernel: Read(16): 88 00 00 00 00 00 08 dd fb 00 00 00 00 80 00 00 Jun 7 01:10:14 1000 kernel: end_request: I/O error, dev sda, sector 148765440 Jun 7 01:10:14 1000 kernel: ata1: EH complete Jun 7 01:10:14 1000 kernel: md: super_written gets error=-5, uptodate=0 Jun 7 01:10:14 1000 kernel: md/raid1:md126: Disk failure on sda3, disabling device. md/raid1:md126: Operation continuing on 1 devices. Jun 7 01:10:14 1000 kernel: ata1.00: detaching (SCSI 0:0:0:0) Jun 7 01:10:14 1000 kernel: sd 0:0:0:0: [sda] Stopping disk Jun 7 01:10:14 1000 kernel: sd 0:0:0:0: [sda] START_STOP FAILED Jun 7 01:10:14 1000 kernel: sd 0:0:0:0: [sda] Jun 7 01:10:14 1000 kernel: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK Jun 7 01:10:14 1000 udisksd[3364]: Unable to resolve /sys/devices/virtual/block/md126/md/dev-sda3/block symlink Jun 7 01:10:14 1000 kernel: md: md126: data-check interrupted. Jun 7 01:10:14 1000 kernel: md: super_written gets error=-19, uptodate=0 Jun 7 01:10:14 1000 kernel: md/raid1:md127: Disk failure on sda2, disabling device. md/raid1:md127: Operation continuing on 1 devices. Jun 7 01:10:15 1000 kernel: md: md127 still in use. Jun 7 01:10:15 1000 kernel: md: md126 still in use. Jun 7 01:10:15 1000 udisksd[3364]: Unable to resolve /sys/devices/virtual/block/md127/md/dev-sda2/block symlink Jun 7 01:10:15 1000 udisksd[3364]: Unable to resolve /sys/devices/virtual/block/md126/md/dev-sda3/block symlink Jun 7 01:10:15 1000 udisksd[3364]: Unable to resolve /sys/devices/virtual/block/md127/md/dev-sda2/block symlink Jun 7 01:10:15 1000 udisksd[3364]: Unable to resolve /sys/devices/virtual/block/md127/md/dev-sda2/block symlink Jun 7 01:10:15 1000 udisksd[3364]: Unable to resolve /sys/devices/virtual/block/md126/md/dev-sda3/block symlink Jun 7 01:20:01 1000 systemd: Created slice user-0.slice. Jun 7 01:20:01 1000 systemd: Starting Session 1546 of user root. Jun 7 01:20:01 1000 systemd: Started Session 1546 of user root. Jun 7 01:30:01 1000 systemd: Created slice user-0.slice. Jun 7 01:30:01 1000 systemd: Starting Session 1547 of user root. Jun 7 01:30:01 1000 systemd: Started Session 1547 of user root. Jun 7 01:36:58 1000 smartd[977]: Device: /dev/sda [SAT], open() failed: No such device Jun 7 01:36:58 1000 smartd[977]: Sending warning via /usr/libexec/smartmontools/smartdnotify to root ... Jun 7 01:36:58 1000 smartd[977]: Warning via /usr/libexec/smartmontools/smartdnotify to root produced unexpected output (80 bytes) to STDOUT/STDERR: Jun 7 01:36:58 1000 smartd[977]: /usr/libexec/smartmontools/smartdnotify: line 13: /dev/pts/0: Permission denied Jun 7 01:36:58 1000 smartd[977]: Warning via /usr/libexec/smartmontools/smartdnotify to root: successful 

如果有人能提出我在这里可能会出现什么问题,我将非常感激。