从服务器删除磁盘后,Linux软件RAID变为无响应

我正在运行一个CentOS 7机器(标准内核: 3.10.0-327.36.3.el7.x86_64 ),并在16x 1TB固态硬盘上安装软件RAID-10(更确切地说,磁盘上有两个RAIDarrays;一个的arrays提供主机的交换分区)。 上周,SSD失败了:

 13:18:07 kvm7 kernel: sd 1:0:2:0: attempting task abort! scmd(ffff887e57b916c0) 13:18:07 kvm7 kernel: sd 1:0:2:0: [sdk] CDB: Write(10) 2a 08 02 55 20 08 00 00 01 00 13:18:07 kvm7 kernel: scsi target1:0:2: handle(0x000b), sas_address(0x4433221102000000), phy(2) 13:18:07 kvm7 kernel: scsi target1:0:2: enclosure_logical_id(0x500304801c14a001), slot(2) 13:18:10 kvm7 kernel: sd 1:0:2:0: task abort: SUCCESS scmd(ffff887e57b916c0) 13:18:11 kvm7 kernel: sd 1:0:2:0: [sdk] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE 13:18:11 kvm7 kernel: sd 1:0:2:0: [sdk] Sense Key : Not Ready [current] 13:18:11 kvm7 kernel: sd 1:0:2:0: [sdk] Add. Sense: Logical unit not ready, cause not reportable 13:18:11 kvm7 kernel: sd 1:0:2:0: [sdk] CDB: Write(10) 2a 08 02 55 20 08 00 00 01 00 13:18:11 kvm7 kernel: blk_update_request: I/O error, dev sdk, sector 39133192 13:18:11 kvm7 kernel: blk_update_request: I/O error, dev sdk, sector 39133192 13:18:11 kvm7 kernel: md: super_written gets error=-5, uptodate=0 13:18:11 kvm7 kernel: md/raid10:md3: Disk failure on sdk3, disabling device.#012md/raid10:md3: Operation continuing on 15 devices. 13:19:27 kvm7 kernel: sd 1:0:2:0: device_blocked, handle(0x000b) 13:19:29 kvm7 kernel: sd 1:0:2:0: [sdk] Synchronizing SCSI cache 13:19:29 kvm7 kernel: md: md3 still in use. 13:19:29 kvm7 kernel: sd 1:0:2:0: [sdk] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK 13:19:29 kvm7 kernel: mpt3sas1: removing handle(0x000b), sas_addr(0x4433221102000000) 13:19:29 kvm7 kernel: md: md2 still in use. 13:19:29 kvm7 kernel: md/raid10:md2: Disk failure on sdk2, disabling device.#012md/raid10:md2: Operation continuing on 15 devices. 13:19:29 kvm7 kernel: md: unbind<sdk3> 13:19:29 kvm7 kernel: md: export_rdev(sdk3) 13:19:29 kvm7 kernel: md: unbind<sdk2> 13:19:29 kvm7 kernel: md: export_rdev(sdk2) 

/proc/mdstat看起来和预期一样(1个故障成员),并且虚拟机保持正常运行。

 md3 : active raid10 sdp3[15] sdb3[2] sdg3[12] sde3[8] sdn3[11] sdl3[7] sdm3[9] sdf3[10] sdi3[1] sdk3[5](F) sdc3[4] sdd3[6] sdh3[14] sdo3[13] sda3[0] sdj3[3] 7844052992 blocks super 1.2 128K chunks 2 near-copies [16/15] [UUUUU_UUUUUUUUUU] 

由于没有1 TB固态硬盘,SSD必须暂时replace为更大的SSD。 所以我们做了,开始重build,一切都很好。 今天,“正确的”固态硬盘到达了,所以数据中心技术人员只是拉动包含所提到的固态硬盘的托盘,系统在几秒钟内变得没有反应。 虽然主机在分离的RAIDarrays上运行良好,但虚拟机无法执行I / O。 负载增加到> 800.我能够执行mdadm --detail /dev/md3 ,显示一个降级(但主动/干净)的数组,所以从这个angular度来看系统是绝对好的。 我试图从数组中删除有缺陷的/缺失的驱动器,这当然是失败的(“没有这样的设备”),突然间,即使mdadm --detail /dev/md3也没有生成任何输出,只是卡住了,杀掉SSH会话摆脱这一点。 在此之后,我决定强制重新启动,因为我甚至不知道如何从arrays中删除这个有问题的驱动器,并且一切正常。 当然,RAID仍然是退化的,需要重新同步,但除此之外: 没有问题。

我非常确定,在将托盘拉出机架之前,我应该通过mdadm删除驱动器, 尽pipe我无法解释mdraid的这种行为。 在我看来,我们“模拟”了一个普通的磁盘中断,所以有谁知道是什么原因导致了这个问题,以及我如何确定下一次正常的磁盘中断不会导致同样的问题?

内核logging了一些消息,而我觉得有趣的是,新设备以sdqforms出现,而被拉出的设备被称为sdk 。 所以我认为sdk没有从数组中正确踢。 上周发生初始SSD故障时,我没有看到这种行为。 所以更换驱动器也成为sdk

日志还显示了旧的故障和插入新的SSD之间的7分钟,所以我不认为像这样的问题已经在https://superuser.com/questions/942886/fail-device- in-md-raid-when-at-stops-respond发生了。 虚拟机也立即下降,而不是7分钟后。 那么 – 对此有什么想法? 将不胜感激:)

 11:45:36 kvm7 kernel: sd 1:0:8:0: device_blocked, handle(0x000b) 11:45:37 kvm7 kernel: blk_update_request: I/O error, dev sdk, sector 0 11:45:37 kvm7 kernel: md/raid10:md3: sdk3: rescheduling sector 4072069640 11:45:37 kvm7 kernel: md/raid10:md3: sdk3: rescheduling sector 4072069648 11:45:37 kvm7 kernel: md/raid10:md3: sdk3: rescheduling sector 4072069656 11:45:37 kvm7 kernel: md/raid10:md3: sdk3: rescheduling sector 4072069664 11:45:37 kvm7 kernel: md/raid10:md3: sdk3: rescheduling sector 4072069672 11:45:37 kvm7 kernel: md/raid10:md3: sdk3: rescheduling sector 4072069680 11:45:37 kvm7 kernel: md/raid10:md3: sdk3: rescheduling sector 4072069688 11:45:37 kvm7 kernel: md/raid10:md3: sdk3: rescheduling sector 4072069696 11:45:37 kvm7 kernel: md/raid10:md3: sdk3: rescheduling sector 4072069704 11:45:37 kvm7 kernel: md/raid10:md3: sdk3: rescheduling sector 4072069712 11:45:37 kvm7 kernel: sd 1:0:8:0: [sdk] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK 11:45:37 kvm7 kernel: sd 1:0:8:0: [sdk] CDB: Read(10) 28 00 20 af f7 08 00 00 08 00 11:45:37 kvm7 kernel: blk_update_request: I/O error, dev sdk, sector 548402952 11:45:37 kvm7 kernel: blk_update_request: I/O error, dev sdk, sector 0 11:45:37 kvm7 kernel: blk_update_request: I/O error, dev sdk, sector 39133192 11:45:37 kvm7 kernel: md: super_written gets error=-5, uptodate=0 11:45:37 kvm7 kernel: md/raid10:md3: Disk failure on sdk3, disabling device.#012md/raid10:md3: Operation continuing on 15 devices. 11:45:37 kvm7 kernel: md: md2 still in use. 11:45:37 kvm7 kernel: md/raid10:md2: Disk failure on sdk2, disabling device.#012md/raid10:md2: Operation continuing on 15 devices. 11:45:37 kvm7 kernel: blk_update_request: I/O error, dev sdk, sector 39133264 11:45:37 kvm7 kernel: md: super_written gets error=-5, uptodate=0 11:45:37 kvm7 kernel: sd 1:0:8:0: [sdk] Synchronizing SCSI cache 11:45:37 kvm7 kernel: sd 1:0:8:0: [sdk] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK 11:45:37 kvm7 kernel: mpt3sas1: removing handle(0x000b), sas_addr(0x4433221102000000) 11:45:37 kvm7 kernel: md: unbind<sdk2> 11:45:37 kvm7 kernel: md: export_rdev(sdk2) 11:48:00 kvm7 kernel: INFO: task md3_raid10:1293 blocked for more than 120 seconds. 11:48:00 kvm7 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 11:48:00 kvm7 kernel: md3_raid10 D ffff883f26e55c00 0 1293 2 0x00000000 11:48:00 kvm7 kernel: ffff887f24bd7c58 0000000000000046 ffff887f212eb980 ffff887f24bd7fd8 11:48:00 kvm7 kernel: ffff887f24bd7fd8 ffff887f24bd7fd8 ffff887f212eb980 ffff887f23514400 11:48:00 kvm7 kernel: ffff887f235144dc 0000000000000001 ffff887f23514500 ffff8807fa4c4300 11:48:00 kvm7 kernel: Call Trace: 11:48:00 kvm7 kernel: [<ffffffff8163bb39>] schedule+0x29/0x70 11:48:00 kvm7 kernel: [<ffffffffa0104ef7>] freeze_array+0xb7/0x180 [raid10] 11:48:00 kvm7 kernel: [<ffffffff810a6b80>] ? wake_up_atomic_t+0x30/0x30 11:48:00 kvm7 kernel: [<ffffffffa010880d>] handle_read_error+0x2bd/0x360 [raid10] 11:48:00 kvm7 kernel: [<ffffffff812c7412>] ? generic_make_request+0xe2/0x130 11:48:00 kvm7 kernel: [<ffffffffa0108a1d>] raid10d+0x16d/0x1440 [raid10] 11:48:00 kvm7 kernel: [<ffffffff814bb785>] md_thread+0x155/0x1a0 11:48:00 kvm7 kernel: [<ffffffff810a6b80>] ? wake_up_atomic_t+0x30/0x30 11:48:00 kvm7 kernel: [<ffffffff814bb630>] ? md_safemode_timeout+0x50/0x50 11:48:00 kvm7 kernel: [<ffffffff810a5b8f>] kthread+0xcf/0xe0 11:48:00 kvm7 kernel: [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 11:48:00 kvm7 kernel: [<ffffffff81646a98>] ret_from_fork+0x58/0x90 11:48:00 kvm7 kernel: [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 11:48:00 kvm7 kernel: INFO: task qemu-kvm:26929 blocked for more than 120 seconds. [serveral messages for stuck qemu-kvm processes] 11:52:42 kvm7 kernel: scsi 1:0:9:0: Direct-Access ATA KINGSTON SKC400S 001A PQ: 0 ANSI: 6 11:52:42 kvm7 kernel: scsi 1:0:9:0: SATA: handle(0x000b), sas_addr(0x4433221102000000), phy(2), device_name(0x4d6b497569a68ba2) 11:52:42 kvm7 kernel: scsi 1:0:9:0: SATA: enclosure_logical_id(0x500304801c14a001), slot(2) 11:52:42 kvm7 kernel: scsi 1:0:9:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y) 11:52:42 kvm7 kernel: scsi 1:0:9:0: qdepth(32), tagged(1), simple(0), ordered(0), scsi_level(7), cmd_que(1) 11:52:42 kvm7 kernel: sd 1:0:9:0: Attached scsi generic sg10 type 0 11:52:42 kvm7 kernel: sd 1:0:9:0: [sdq] 2000409264 512-byte logical blocks: (1.02 TB/953 GiB) 11:52:42 kvm7 kernel: sd 1:0:9:0: [sdq] Write Protect is off 11:52:42 kvm7 kernel: sd 1:0:9:0: [sdq] Write cache: enabled, read cache: enabled, supports DPO and FUA 11:52:42 kvm7 kernel: sdq: unknown partition table 11:52:42 kvm7 kernel: sd 1:0:9:0: [sdq] Attached SCSI disk 

从内核堆栈跟踪看来, md驱动程序试图冻结数组( freeze_array+0xb7/0x180 [raid10] )以彻底删除失败的成员,但是这个操作从未完成。 这由缺less的md: unbind<sdk3>行确认。

对我来说,这似乎是一个死锁/活锁问题,所以根本原因可能是一个软件错误。 你真的应该在Linux RAID邮件列表上提交报告