mdadm未检测到磁盘故障

我在我的服务器上使用软RAID raid1。 上个星期六,在磁盘上有一个失败,因为我可以看到我的日志下面的错误

Mar 16 08:38:40 storage-1 kernel: [694968.826388] ata2.01: status: { DRDY ERR } Mar 16 08:38:40 storage-1 kernel: [694968.826412] ata2.01: error: { UNC } Mar 16 08:38:40 storage-1 kernel: [694968.848390] ata2.00: configured for UDMA/133 Mar 16 08:38:40 storage-1 kernel: [694968.864359] ata2.01: configured for UDMA/133 Mar 16 08:38:40 storage-1 kernel: [694968.864366] sd 1:0:1:0: [sdc] Unhandled sense code Mar 16 08:38:40 storage-1 kernel: [694968.864368] sd 1:0:1:0: [sdc] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE Mar 16 08:38:40 storage-1 kernel: [694968.864371] sd 1:0:1:0: [sdc] Sense Key : Medium Error [current] [descriptor] Mar 16 08:38:40 storage-1 kernel: [694968.864374] Descriptor sense data with sense descriptors (in hex): Mar 16 08:38:40 storage-1 kernel: [694968.864376] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 Mar 16 08:38:40 storage-1 kernel: [694968.864382] 05 10 b7 3f Mar 16 08:38:40 storage-1 kernel: [694968.864384] sd 1:0:1:0: [sdc] Add. Sense: Unrecovered read error - auto reallocate failed Mar 16 08:38:40 storage-1 kernel: [694968.864388] sd 1:0:1:0: [sdc] CDB: Read(10): 28 00 05 10 b7 3f 00 00 90 00 Mar 16 08:38:40 storage-1 kernel: [694968.864393] end_request: I/O error, dev sdc, sector 84981567 Mar 16 08:38:40 storage-1 kernel: [694968.864421] raid1: sdc1: rescheduling sector 84981504 Mar 16 08:38:40 storage-1 kernel: [694968.864451] ata2: EH complete Mar 16 08:38:40 storage-1 kernel: [694973.825824] ata2.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Mar 16 08:38:40 storage-1 kernel: [694973.825854] ata2.01: failed command: READ DMA Mar 16 08:38:40 storage-1 kernel: [694973.825880] ata2.01: cmd c8/00:20:3f:ba:10/00:00:00:00:00/f5 tag 0 dma 16384 in Mar 16 08:38:40 storage-1 kernel: [694973.825882] res 51/40:20:3f:ba:10/00:00:00:00:00/f5 Emask 0x9 (media error) 

但是,当我检查与cat /proc/mdstat ,mdadm没有检测到这个磁盘故障,它仍然保持磁盘安装在分区md3像这样

 rivo@storage-1:~$ cat /proc/mdstat Personalities : [raid1] md3 : active raid1 sdc1[0] sdd1[1] 976759936 blocks [2/2] [UU] 

这在I / O上造成了一个缓慢访问服务器的问题。

有谁知道为什么mdadm没有检测到这个磁盘故障,所以它会从RAID中自动删除失败的磁盘?

有什么方法可以更好地configurationmdadm,以便将来可以检测到这种中断?

mdadm不会监视驱动器上的问题 – 它只是知道磁盘是否运行并可以同步。 这不是确切的解释,也许别人知道,并会写更多关于它 。 为了更好地监控驱动器,使用smartmontools及其守护进程smartd 。 如果你想在检测到错误的时候收到邮件,在configuration文件( /etc/smartd.conf )中应该是这样的:

 /dev/sda -d ata -H -m [email protected] /dev/sdb -d ata -H -m [email protected] 

要检查驱动器信息,请使用smartctl

 smartctl -a /dev/sda