如何解决我的RAIDarrays？

我刚刚检查了我的RAIDarrays今天早上，我得到的是：

$ cat /proc/mdstat Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] md1 : active raid1 sdc7[0] 238340224 blocks [2/1] [U_] md0 : active raid1 sdc6[0] 244139648 blocks [2/1] [U_] md127 : active raid1 sdc3[0] 390628416 blocks [2/1] [U_] unused devices: <none> $

我相信这意味着我的一个磁盘arrays已经死了，这是真的吗？

我如何进行正确的故障排除？我的/etc/mdadm/mdadm.conf如下所示：

 $ cat /etc/mdadm/mdadm.conf # mdadm.conf # # Please refer to mdadm.conf(5) for information about this file. # # by default (built-in), scan all partitions (/proc/partitions) and all # containers for MD superblocks. alternatively, specify devices to scan, using # wildcards if desired. #DEVICE partitions containers # auto-create devices with Debian standard permissions CREATE owner=root group=disk mode=0660 auto=yes # automatically tag new arrays as belonging to the local system HOMEHOST <system> # instruct the monitoring daemon where to send mail alerts MAILADDR root # definitions of existing MD arrays ARRAY /dev/md127 UUID=124cd4a5:2965955f:cd707cc0:bc3f8165 ARRAY /dev/md0 UUID=91e560f1:4e51d8eb:cd707cc0:bc3f8165 ARRAY /dev/md1 UUID=0abe503f:401d8d09:cd707cc0:bc3f8165

如何找出哪个物理驱动器坏了，需要更换？

谢谢

EDIT1

 # mdadm --detail /dev/md0 /dev/md0: Version : 0.90 Creation Time : Tue Sep 1 19:15:33 2009 Raid Level : raid1 Array Size : 244139648 (232.83 GiB 250.00 GB) Used Dev Size : 244139648 (232.83 GiB 250.00 GB) Raid Devices : 2 Total Devices : 1 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Mon Sep 21 07:11:24 2015 State : clean, degraded Active Devices : 1 Working Devices : 1 Failed Devices : 0 Spare Devices : 0 UUID : 91e560f1:4e51d8eb:cd707cc0:bc3f8165 Events : 0.76017 Number Major Minor RaidDevice State 0 8 38 0 active sync /dev/sdc6 1 0 0 1 removed root@regDesktopHome:~#

为什么会说Failed Devices : 0 ？

EDIT2
打开Gparted，我可以看到/dev/sdb和/dev/sdc这两个RAID驱动器。但是，mdadm认为/ /dev.sdb由于某种原因已被删除…这很奇怪。我试图在/ dev / sdb上挂载一个分区，并得到以下结果

 $sudo mount /dev/sdb7 test [sudo] password for ron: mount: unknown filesystem type 'linux_raid_member'

这看起来都很合适。如何恢复我的RAIDarrays？

编辑3

我运行了smartctl -a /dev/sdc和smartctl -a /dev/sdb ，我也做了坏块badblocks /dev/sdc和badblocks /dev/sdb ，而sdc似乎是100％干净的， sdb返回了一些坏块：

 # badblocks /dev/sdb 16130668 16130669 16130670 16130671

这可能是我看到的错误的原因吗？任何方式来修复/忽略这些坏块，或者我应该更换驱动器？

编辑4

 # smartctl --all /dev/sdb smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.13.0-62-generic] (local build) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Seagate Barracuda 7200.12 Device Model: ST31000528AS Serial Number: 6VP0308B LU WWN Device Id: 5 000c50 013d3ae45 Firmware Version: CC34 User Capacity: 1,000,204,886,016 bytes [1.00 TB] Sector Size: 512 bytes logical/physical Rotation Rate: 7200 rpm Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS T13/1699-D revision 4 SATA Version is: SATA 2.6, 3.0 Gb/s Local Time is: Sat Sep 26 11:35:02 2015 PDT ==> WARNING: A firmware update for this drive may be available, see the following Seagate web pages: http://knowledge.seagate.com/articles/en_US/FAQ/207931en http://knowledge.seagate.com/articles/en_US/FAQ/213891en SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 600) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 195) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x103f) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 114 099 006 Pre-fail Always - 78420742 3 Spin_Up_Time 0x0003 095 095 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 099 099 020 Old_age Always - 1240 5 Reallocated_Sector_Ct 0x0033 099 099 036 Pre-fail Always - 60 7 Seek_Error_Rate 0x000f 082 060 030 Pre-fail Always - 199357441 9 Power_On_Hours 0x0032 052 052 000 Old_age Always - 42401 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 099 037 020 Old_age Always - 1240 183 Runtime_Bad_Block 0x0000 098 098 000 Old_age Offline - 2 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 094 094 000 Old_age Always - 6 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 189 High_Fly_Writes 0x003a 050 050 000 Old_age Always - 50 190 Airflow_Temperature_Cel 0x0022 062 046 045 Old_age Always - 38 (Min/Max 30/38) 194 Temperature_Celsius 0x0022 038 054 000 Old_age Always - 38 (0 17 0 0 0) 195 Hardware_ECC_Recovered 0x001a 030 012 000 Old_age Always - 78420742 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 1 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 1 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 73332271657814 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 2822963046 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 2361465529 SMART Error Log Version: 1 ATA Error Count: 6 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 6 occurred at disk power-on lifetime: 42372 hours (1765 days + 12 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 d9 44 ec 01 Error: UNC at LBA = 0x01ec44d9 = 32261337 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 00 08 d8 44 ec 41 00 09:26:28.967 READ FPDMA QUEUED 27 00 00 00 00 00 e0 00 09:26:28.941 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3] ec 00 00 00 00 00 a0 00 09:26:28.940 IDENTIFY DEVICE ef 03 46 00 00 00 a0 00 09:26:28.928 SET FEATURES [Set transfer mode] 27 00 00 00 00 00 e0 00 09:26:28.901 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3] Error 5 occurred at disk power-on lifetime: 42372 hours (1765 days + 12 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 d9 44 ec 01 Error: UNC at LBA = 0x01ec44d9 = 32261337 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 00 08 d8 44 ec 41 00 09:26:26.095 READ FPDMA QUEUED 27 00 00 00 00 00 e0 00 09:26:26.069 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3] ec 00 00 00 00 00 a0 00 09:26:26.068 IDENTIFY DEVICE ef 03 46 00 00 00 a0 00 09:26:26.055 SET FEATURES [Set transfer mode] 27 00 00 00 00 00 e0 00 09:26:26.029 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3] Error 4 occurred at disk power-on lifetime: 42372 hours (1765 days + 12 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 d9 44 ec 01 Error: UNC at LBA = 0x01ec44d9 = 32261337 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 00 08 d8 44 ec 41 00 09:26:23.222 READ FPDMA QUEUED 27 00 00 00 00 00 e0 00 09:26:23.195 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3] ec 00 00 00 00 00 a0 00 09:26:23.194 IDENTIFY DEVICE ef 03 46 00 00 00 a0 00 09:26:23.182 SET FEATURES [Set transfer mode] 27 00 00 00 00 00 e0 00 09:26:23.137 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3] Error 3 occurred at disk power-on lifetime: 42372 hours (1765 days + 12 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 d9 44 ec 01 Error: UNC at LBA = 0x01ec44d9 = 32261337 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 00 08 d8 44 ec 41 00 09:26:20.351 READ FPDMA QUEUED 60 00 80 e8 44 ec 41 00 09:26:20.350 READ FPDMA QUEUED 27 00 00 00 00 00 e0 00 09:26:20.324 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3] ec 00 00 00 00 00 a0 00 09:26:20.323 IDENTIFY DEVICE ef 03 46 00 00 00 a0 00 09:26:20.311 SET FEATURES [Set transfer mode] Error 2 occurred at disk power-on lifetime: 42372 hours (1765 days + 12 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 d9 44 ec 01 Error: UNC at LBA = 0x01ec44d9 = 32261337 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 00 80 e8 44 ec 41 00 09:26:17.478 READ FPDMA QUEUED 60 00 40 a8 44 ec 41 00 09:26:17.478 READ FPDMA QUEUED 60 00 20 88 44 ec 41 00 09:26:17.476 READ FPDMA QUEUED 60 00 08 80 44 ec 41 00 09:26:17.453 READ FPDMA QUEUED 27 00 00 00 00 00 e0 00 09:26:17.427 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3] SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. #

编辑5

我意识到在拔掉/dev/sdb ，以前的/dev/sdc现在是/dev/sdb 。我用smartctl -a /dev/sdb确认，在拔掉坏盘后引导序列号已经改变了。我不走运，驱动器没有保修 ，所以我会给自己一个新的更换驱动器。

在cat /proc/mdstat的输出中，看到如何看不到损坏的驱动器（用F标记），因为arrays已降级，所以已经引导了服务器。

您可以使用mdadm --detail /dev/md0获取信息。这可能会告诉你哪个驱动器应该在其中。

要回复您的编辑：

我会先分析/dev/sdb 。使用smartctl -a检查（特别是）重新分配的扇区数和错误日志。使用smartctl -t long /dev/sdb进行自检。使用badblocks等

然后：

如果replace/dev/sdb ，请从/dev/sdc复制分区表。如果他们不是GPT，则可以使用sfdisk -d /dev/sdc | sfdisk /dev/sdb sfdisk -d /dev/sdc | sfdisk /dev/sdb 。或者，如果它们是GPT，则可以使用gdisk将分区表保存到文件，然后加载它。它隐藏在高级function之下。
一般情况下要考虑：如果您的（新）驱动器有4k扇区，请确保分区alignment4k 。
如果要重新添加现有的/dev/sdb ，则可能需要在所有现有分区上运行mdadm --zero-superblock 。
然后你可以mdadm --manage /dev/md0 --add /dev/sdb6和md1和sdb7相同

不用说，如果混淆驱动器，某些命令会擦除数据。所以，请确定sdc和sdb是什么…

编辑：关于坏块：如果任何软件级别的工具看到坏块，驱动器被中断。通常情况下，磁盘在写入时通过重新分配来隐藏磁盘。谷歌为“硬盘部门重新分配”。您的smartctl -a输出应显示为sdb重新分配的扇区。所以是的，你的sdb已经被踢出了arrays，你需要更换它。

编辑：关于smartctl -a输出。有两件事情是最重要的：

它显示了60个重新分配的部门。即使规范化的值仍然是99，如果达到36（正确的），只有官方的“坏”，你不应该相信启动reallocting行业的磁盘。所以特别是如果这个价值开始变化，原始价值，这是重要的。你甚至可以configurationsmartd来监视它。
错误日志显示42372小时的条目。你可以知道是最近的，因为参数9（在你的情况下）， Power on hours 。有一些无害的东西可能会导致SMART错误日志条目，比如给出错误的ATA命令，但在这种情况下，因为您的数组已经降级，所以它们可能是相关的。

至于确定哪个磁盘在你的系统中; 例如，做dmesg |grep -i sdb会有所帮助。您的系统中可能有三个磁盘， sdb是第二个SATA控制器上的磁盘，可以命名为1或2，具体取决于是基于零还是基于一个。

因为你可能从sda启动，你可以直接replacesdb并执行上面列出的操作。如果你的启动驱动器坏了，你希望你有：

在其他磁盘上也安装了grub。
有一个服务器，可以从另一个磁盘实际启动。

有一天戴尔服务器，它不想从sdb开始时，有一个空白sda 。这花了一些令人信服和即兴。

有时你需要将像ata1.01这样的名字翻译成真实的设备名称。例如，发生故障的磁盘会给内核错误指出'ata1.01上的ATAexception'或者出现这种情况的文字。阅读这个答案。（我configuration了我们的中央日志系统来警告我这些内核错误，因为它们是未决磁盘故障的可靠指示）。