mdadm停止重build一个RAID5arrays99.9%

我最近在QNAP TS-412 NAS上安装了三个新磁盘。

这三个新磁盘应该与已经存在的磁盘组合成一个4磁盘RAID5arrays,所以我开始了迁移过程。

经过多次尝试(每次约24小时),迁移似乎工作,但导致了一个没有响应的NAS。

此时我重置NAS。 一切都从那里走下坡路:

  • NAS引导,但标记第一个磁盘失败,并从所有arrays中删除它,让他们跛行。
  • 我在磁盘上运行检查,无法find任何问题(这将是奇怪的,因为它几乎是新的)。
  • pipe理界面没有提供任何恢复选项,所以我想我只是手动做。

我已经使用mdadm (正在/dev/md4/dev/md13/dev/md9 )成功重build所有QNAP内部RAID1arrays,只留下RAID5arrays; /dev/md0

我已经使用这些命令多次尝试了这些命令:

 mdadm -w /dev/md0 

(必须在从/dev/sda3删除/dev/sda3之后,由NAS以只读方式挂载arrays,不能在RO模式下修改arrays)。

 mdadm /dev/md0 --re-add /dev/sda3 

之后arrays开始重build。 尽pipe这个系统的速度非常缓慢和/或没有响应,但它的速度却达到了99.9%。 (大多数情况下,使用SSHlogin失败)。

事物的当前状态:

 [admin@nas01 ~]# cat /proc/mdstat Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] md4 : active raid1 sdd2[2](S) sdc2[1] sdb2[0] 530048 blocks [2/2] [UU] md0 : active raid5 sda3[4] sdd3[3] sdc3[2] sdb3[1] 8786092608 blocks super 1.0 level 5, 64k chunk, algorithm 2 [4/3] [_UUU] [===================>.] recovery = 99.9% (2928697160/2928697536) finish=0.0min speed=110K/sec md13 : active raid1 sda4[0] sdb4[1] sdd4[3] sdc4[2] 458880 blocks [4/4] [UUUU] bitmap: 0/57 pages [0KB], 4KB chunk md9 : active raid1 sda1[0] sdd1[3] sdc1[2] sdb1[1] 530048 blocks [4/4] [UUUU] bitmap: 2/65 pages [8KB], 4KB chunk unused devices: <none> 

(现在停滞在2928697160/2928697536几小时)

 [admin@nas01 ~]# mdadm -D /dev/md0 /dev/md0: Version : 01.00.03 Creation Time : Thu Jan 10 23:35:00 2013 Raid Level : raid5 Array Size : 8786092608 (8379.07 GiB 8996.96 GB) Used Dev Size : 2928697536 (2793.02 GiB 2998.99 GB) Raid Devices : 4 Total Devices : 4 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Mon Jan 14 09:54:51 2013 State : clean, degraded, recovering Active Devices : 3 Working Devices : 4 Failed Devices : 0 Spare Devices : 1 Layout : left-symmetric Chunk Size : 64K Rebuild Status : 99% complete Name : 3 UUID : 0c43bf7b:282339e8:6c730d6b:98bc3b95 Events : 34111 Number Major Minor RaidDevice State 4 8 3 0 spare rebuilding /dev/sda3 1 8 19 1 active sync /dev/sdb3 2 8 35 2 active sync /dev/sdc3 3 8 51 3 active sync /dev/sdd3 

在检查/mnt/HDA_ROOT/.logs/kmsg ,事实certificate实际问题似乎是用/dev/sdb3代替的:

 <6>[71052.730000] sd 3:0:0:0: [sdb] Unhandled sense code <6>[71052.730000] sd 3:0:0:0: [sdb] Result: hostbyte=0x00 driverbyte=0x08 <6>[71052.730000] sd 3:0:0:0: [sdb] Sense Key : 0x3 [current] [descriptor] <4>[71052.730000] Descriptor sense data with sense descriptors (in hex): <6>[71052.730000] 72 03 00 00 00 00 00 0c 00 0a 80 00 00 00 00 01 <6>[71052.730000] 5d 3e d9 c8 <6>[71052.730000] sd 3:0:0:0: [sdb] ASC=0x0 ASCQ=0x0 <6>[71052.730000] sd 3:0:0:0: [sdb] CDB: cdb[0]=0x88: 88 00 00 00 00 01 5d 3e d9 c8 00 00 00 c0 00 00 <3>[71052.730000] end_request: I/O error, dev sdb, sector 5859367368 <4>[71052.730000] raid5_end_read_request: 27 callbacks suppressed <4>[71052.730000] raid5:md0: read error not correctable (sector 5857246784 on sdb3). <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5:md0: read error not correctable (sector 5857246792 on sdb3). <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5:md0: read error not correctable (sector 5857246800 on sdb3). <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5:md0: read error not correctable (sector 5857246808 on sdb3). <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5:md0: read error not correctable (sector 5857246816 on sdb3). <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5:md0: read error not correctable (sector 5857246824 on sdb3). <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5:md0: read error not correctable (sector 5857246832 on sdb3). <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5:md0: read error not correctable (sector 5857246840 on sdb3). <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5:md0: read error not correctable (sector 5857246848 on sdb3). <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5:md0: read error not correctable (sector 5857246856 on sdb3). <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. <4>[71052.730000] raid5: some error occurred in a active device:1 of md0. 

对于585724XXXX范围内的各个(随机)扇区,上述顺序以稳定的速率重复。

我的问题是:

  • 为什么它会如此接近尾声,而仍然使用如此多的资源以致系统停滞( md0_raid5md0_resync进程仍在运行)。
  • 有什么方法可以查看是什么导致它失败/失速? < – 可能是由于sdb3错误。
  • 我怎样才能完成操作而不会丢失我的3TB数据? (像跳过sdb3上的麻烦部门,但保持完整的数据?)

在完成之前,它可能会拖延,因为它要求有故障的磁盘返回某种状态,但是没有得到它。

无论如何,所有的数据都是(或应该是)完整的,只有3个4磁盘。

你说它从arrayspopup错误的磁盘 – 所以它应该仍然在运行,尽pipe在降级模式。

你可以挂载吗?

您可以通过执行以下操作强制数组运行:

  • 打印出数组的详细信息: mdadm -D /dev/md0
  • 停止arrays: mdadm --stop /dev/md0
  • 重新创build数组并强制md接受:“mdadm -C -n md0 –assume-clean / dev / sd [abcd] 3`

只要以下步骤完全安全:

  • 你不写入数组,和
  • 您使用了与以前完全相同的创build参数。

最后一个标记将防止重build并跳过任何完整性testing。
然后,您应该能够挂载并恢复您的数据。

显而易见的方法是更换出现故障的磁盘,重新创buildarrays并重播在arrays扩展操作之前执行的备份。

但是既然你看起来没有这个select,那么这将是下一个最好的select:

  • 得到一个足够空间的Linux系统来容纳所有的磁盘的原始空间(如果我的数字是正确的,那么就是12TB)
  • 将数据从您的磁盘复制到此系统,目标可能是文件或块设备,这对mdraid来说并不重要。 如果您的sdb3设备出现故障,您可能需要使用ddrescue而不是简单的dd来复制数据。
  • 尝试从那里重新组装和重build数组

此外,请查看此博客页面,了解有关如何评估RAID 5arrays多设备故障情况的一些提示。