Debian,mdadm,Degraded Array,磁盘在重新添加后变得闲置

今晚我收到了我的服务器上由mdadm生成的消息:

This is an automatically generated mail message from mdadm A DegradedArray event had been detected on md device /dev/md3. Faithfully yours, etc. PS The /proc/mdstat file currently contains the following: Personalities : [raid1] md4 : active raid1 sdb4[0] sda4[1] 474335104 blocks [2/2] [UU] md3 : active raid1 sdb3[2](F) sda3[1] 10000384 blocks [2/1] [_U] md2 : active (auto-read-only) raid1 sdb2[0] sda2[1] 4000064 blocks [2/2] [UU] md1 : active raid1 sdb1[0] sda1[1] 48064 blocks [2/2] [UU] 

我从/ dev / md3中删除了/ dev / sdb3并重新添加它,它正在重build一段时间,并成为一个备用设备,所以现在我有这样的统计信息:

 cat /proc/mdstat Personalities : [raid1] md4 : active raid1 sdb4[0] sda4[1] 474335104 blocks [2/2] [UU] md3 : active raid1 sdb3[2](S) sda3[1] 10000384 blocks [2/1] [_U] md2 : active (auto-read-only) raid1 sdb2[0] sda2[1] 4000064 blocks [2/2] [UU] md1 : active raid1 sdb1[0] sda1[1] 48064 blocks [2/2] [UU] 

[码]

 mdadm -D /dev/md3 /dev/md3: Version : 0.90 Creation Time : Sat Jun 28 14:47:58 2008 Raid Level : raid1 Array Size : 10000384 (9.54 GiB 10.24 GB) Used Dev Size : 10000384 (9.54 GiB 10.24 GB) Raid Devices : 2 Total Devices : 2 Preferred Minor : 3 Persistence : Superblock is persistent Update Time : Sun Sep 4 16:30:46 2011 State : clean, degraded Active Devices : 1 Working Devices : 2 Failed Devices : 0 Spare Devices : 1 UUID : 1c32c34a:52d09232:fc218793:7801d094 Events : 0.7172118 Number Major Minor RaidDevice State 0 0 0 0 removed 1 8 3 1 active sync /dev/sda3 2 8 19 - spare /dev/sdb3 

这是/ var / log / messages中的最后一个日志

 Sep 4 16:15:45 ogw2 kernel: [1314646.950806] md: unbind<sdb3> Sep 4 16:15:45 ogw2 kernel: [1314646.950820] md: export_rdev(sdb3) Sep 4 16:17:00 ogw2 kernel: [1314721.977950] md: bind<sdb3> Sep 4 16:17:00 ogw2 kernel: [1314722.011058] RAID1 conf printout: Sep 4 16:17:00 ogw2 kernel: [1314722.011064] --- wd:1 rd:2 Sep 4 16:17:00 ogw2 kernel: [1314722.011070] disk 0, wo:1, o:1, dev:sdb3 Sep 4 16:17:00 ogw2 kernel: [1314722.011073] disk 1, wo:0, o:1, dev:sda3 Sep 4 16:17:00 ogw2 kernel: [1314722.012667] md: recovery of RAID array md3 Sep 4 16:17:00 ogw2 kernel: [1314722.012673] md: minimum _guaranteed_ speed: 1000 KB/sec/disk. Sep 4 16:17:00 ogw2 kernel: [1314722.012677] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery. Sep 4 16:17:00 ogw2 kernel: [1314722.012684] md: using 128k window, over a total of 10000384 blocks. Sep 4 16:20:25 ogw2 kernel: [1314927.480582] md: md3: recovery done. Sep 4 16:20:27 ogw2 kernel: [1314929.252395] ata2.00: configured for UDMA/133 Sep 4 16:20:27 ogw2 kernel: [1314929.260419] ata2.01: configured for UDMA/133 Sep 4 16:20:27 ogw2 kernel: [1314929.260437] ata2: EH complete Sep 4 16:20:29 ogw2 kernel: [1314931.068402] ata2.00: configured for UDMA/133 Sep 4 16:20:29 ogw2 kernel: [1314931.076418] ata2.01: configured for UDMA/133 Sep 4 16:20:29 ogw2 kernel: [1314931.076436] ata2: EH complete Sep 4 16:20:30 ogw2 kernel: [1314932.884390] ata2.00: configured for UDMA/133 Sep 4 16:20:30 ogw2 kernel: [1314932.892419] ata2.01: configured for UDMA/133 Sep 4 16:20:30 ogw2 kernel: [1314932.892436] ata2: EH complete Sep 4 16:20:32 ogw2 kernel: [1314934.828390] ata2.00: configured for UDMA/133 Sep 4 16:20:32 ogw2 kernel: [1314934.836397] ata2.01: configured for UDMA/133 Sep 4 16:20:32 ogw2 kernel: [1314934.836413] ata2: EH complete Sep 4 16:20:34 ogw2 kernel: [1314936.776392] ata2.00: configured for UDMA/133 Sep 4 16:20:34 ogw2 kernel: [1314936.784403] ata2.01: configured for UDMA/133 Sep 4 16:20:34 ogw2 kernel: [1314936.784419] ata2: EH complete Sep 4 16:20:36 ogw2 kernel: [1314938.760392] ata2.00: configured for UDMA/133 Sep 4 16:20:36 ogw2 kernel: [1314938.768395] ata2.01: configured for UDMA/133 Sep 4 16:20:36 ogw2 kernel: [1314938.768422] sd 1:0:0:0: [sda] Unhandled sense code Sep 4 16:20:36 ogw2 kernel: [1314938.768426] sd 1:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE Sep 4 16:20:36 ogw2 kernel: [1314938.768431] sd 1:0:0:0: [sda] Sense Key : Medium Error [current] [descriptor] Sep 4 16:20:36 ogw2 kernel: [1314938.768438] Descriptor sense data with sense descriptors (in hex): Sep 4 16:20:36 ogw2 kernel: [1314938.768441] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 Sep 4 16:20:36 ogw2 kernel: [1314938.768454] 01 ac b6 4a Sep 4 16:20:36 ogw2 kernel: [1314938.768459] sd 1:0:0:0: [sda] Add. Sense: Unrecovered read error - auto reallocate failed Sep 4 16:20:36 ogw2 kernel: [1314938.768468] sd 1:0:0:0: [sda] CDB: Read(10): 28 00 01 ac b5 f8 00 03 80 00 Sep 4 16:20:36 ogw2 kernel: [1314938.768527] ata2: EH complete Sep 4 16:20:38 ogw2 kernel: [1314940.788406] ata2.00: configured for UDMA/133 Sep 4 16:20:38 ogw2 kernel: [1314940.796394] ata2.01: configured for UDMA/133 Sep 4 16:20:38 ogw2 kernel: [1314940.796415] ata2: EH complete Sep 4 16:20:40 ogw2 kernel: [1314942.728391] ata2.00: configured for UDMA/133 Sep 4 16:20:40 ogw2 kernel: [1314942.736395] ata2.01: configured for UDMA/133 Sep 4 16:20:40 ogw2 kernel: [1314942.736413] ata2: EH complete Sep 4 16:20:42 ogw2 kernel: [1314944.548391] ata2.00: configured for UDMA/133 Sep 4 16:20:42 ogw2 kernel: [1314944.556393] ata2.01: configured for UDMA/133 Sep 4 16:20:42 ogw2 kernel: [1314944.556414] ata2: EH complete Sep 4 16:20:44 ogw2 kernel: [1314946.372392] ata2.00: configured for UDMA/133 Sep 4 16:20:44 ogw2 kernel: [1314946.380392] ata2.01: configured for UDMA/133 Sep 4 16:20:44 ogw2 kernel: [1314946.380411] ata2: EH complete Sep 4 16:20:46 ogw2 kernel: [1314948.196391] ata2.00: configured for UDMA/133 Sep 4 16:20:46 ogw2 kernel: [1314948.204391] ata2.01: configured for UDMA/133 Sep 4 16:20:46 ogw2 kernel: [1314948.204411] ata2: EH complete Sep 4 16:20:48 ogw2 kernel: [1314950.144390] ata2.00: configured for UDMA/133 Sep 4 16:20:48 ogw2 kernel: [1314950.152392] ata2.01: configured for UDMA/133 Sep 4 16:20:48 ogw2 kernel: [1314950.152416] sd 1:0:0:0: [sda] Unhandled sense code Sep 4 16:20:48 ogw2 kernel: [1314950.152419] sd 1:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE Sep 4 16:20:48 ogw2 kernel: [1314950.152424] sd 1:0:0:0: [sda] Sense Key : Medium Error [current] [descriptor] Sep 4 16:20:48 ogw2 kernel: [1314950.152431] Descriptor sense data with sense descriptors (in hex): Sep 4 16:20:48 ogw2 kernel: [1314950.152434] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 Sep 4 16:20:48 ogw2 kernel: [1314950.152447] 01 ac b6 4a Sep 4 16:20:48 ogw2 kernel: [1314950.152452] sd 1:0:0:0: [sda] Add. Sense: Unrecovered read error - auto reallocate failed Sep 4 16:20:48 ogw2 kernel: [1314950.152461] sd 1:0:0:0: [sda] CDB: Read(10): 28 00 01 ac b6 48 00 00 08 00 Sep 4 16:20:48 ogw2 kernel: [1314950.152523] ata2: EH complete Sep 4 16:20:48 ogw2 kernel: [1314950.575325] RAID1 conf printout: Sep 4 16:20:48 ogw2 kernel: [1314950.575332] --- wd:1 rd:2 Sep 4 16:20:48 ogw2 kernel: [1314950.575337] disk 0, wo:1, o:1, dev:sdb3 Sep 4 16:20:48 ogw2 kernel: [1314950.575341] disk 1, wo:0, o:1, dev:sda3 Sep 4 16:20:48 ogw2 kernel: [1314950.575344] RAID1 conf printout: Sep 4 16:20:48 ogw2 kernel: [1314950.575347] --- wd:1 rd:2 Sep 4 16:20:48 ogw2 kernel: [1314950.575350] disk 1, wo:0, o:1, dev:sda3 

所以我不明白为什么这个设备(SDB3)成为SPARE和RAID没有同步…

任何人都可以指出我该怎么做?

更新:忘记说/ dev / md3挂载为/(root)分区,并包含除/ boot之外的所有系统目录。

看起来像MD保持错误的设备。 sda坏了,在读取块重新同步sdb时,抛出一个不可恢复的读取错误。

在sdb被删除之后,sda上的数据是否会发生变化? 如果没有,那么你可能会很幸运 – 即使在失败的重新同步之后,sdb上的文件系统仍可能处于一致的状态; 得到MD与sdb组装数组。

尽pipe如此,这还是有点远的。 更有可能的是,您将有机会看到您的备份策略运作得如何。

请注意,所有MDarrays都处于危险之中 – 不仅仅是“正式”降级的那个 – 因为它们全部基于两个物理设备: sdasdb 。 我当然希望你有适当的备份和/或系统恢复程序,以防万一事情变成梨形。 正如沙恩·马登(Shane Madden)指出的那样,重新同步的日志显示了一个令人担忧的错误,可能表明sda本身并不健康。

最好的办法是拉sdb并立即更换。 如果你没有更换的方便,然后订购一个(也许使用间隔的时间来取得所有arrays的最后一个完整备份,而他们仍然是好的!)。 您的replace驱动器将需要适当的分区,然后分区相应地添加到您的四个arrays中的每一个。 希望一切顺利,所有的arrays都会重新同步。

但是,如果Shane是正确的,并且来自失败的sda进一步的错误妨碍了正确的重新组装/重新同步,那么接下来要做的就是拉取sda ,用旧的sdbreplace它(可能还是不错的),然后看看旧的sdb和新的replace驱动器的组合重新同步成功。

最后,如果以上都不行,最后一个尝试(完成系统重build和恢复之前)是更换驱动器控制器。 我已经看到驱动器控制器剥落,并导致健康arrays的问题。 testing一个控制器是否是MD错误的原因之一,就是将其中一个“失败”的驱动器放入另一台装有已知好控制器和安装好的mdadm工具的Linux机器上。 由于所有arrays都是RAID1,因此任何单个驱动器上的arrays都应该能够组装到可用状态(尽pipe降级),然后您可以在其中检查文件系统,进行备份等等。