Linux软件RAID：磁盘有问题吗？

我有一个设置与三个相同的硬盘，被识别为sdb，sdd和sde。我在这三个磁盘上有一个RAID0分区（md0）和两个RAID5分区（md1和md2）。我所有的RAID分区似乎都正常工作，并且自创build它们以来就已经这样做了。我在控制台上看到关于md [12]的消息“正在使用3个设备中的2个”，这对我来说听起来像是一个问题。

$ cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] [raid0] [linear] [multipath] [raid1] [raid10] md2 : active raid0 sdb3[0] sdd3[1] sde3[2] 24574464 blocks super 1.2 512k chunks md1 : active raid5 sdd2[1] sde2[3] 5823403008 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [_UU] md0 : active raid5 sdd1[1] sde1[3] 20462592 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [_UU] unused devices: <none>

我没有mdadm的经验，但这对我来说似乎像数组md [12]缺lesssdb磁盘。但是，MD2似乎不会丢失任何东西。那么，有sdb磁盘失败或者这只是一些configuration问题？我需要做更多的诊断来解决这个问题？

编辑：

 # mdadm --examine /dev/sdb2 /dev/sdb2: Magic : a92b4efc Version : 1.2 Feature Map : 0x0 Array UUID : 94d56562:90a999e8:601741c0:55d8c83f Name : jostein1:1 (local to host jostein1) Creation Time : Sat Aug 18 13:00:00 2012 Raid Level : raid5 Raid Devices : 3 Avail Dev Size : 5823404032 (2776.82 GiB 2981.58 GB) Array Size : 5823403008 (5553.63 GiB 5963.16 GB) Used Dev Size : 5823403008 (2776.82 GiB 2981.58 GB) Data Offset : 262144 sectors Super Offset : 8 sectors Unused Space : before=262064 sectors, after=1024 sectors State : active Device UUID : cee60351:c3a525ce:a449b326:6cb5970d Update Time : Tue May 24 21:43:20 2016 Checksum : 4afdc54a - correct Events : 7400 Layout : left-symmetric Chunk Size : 512K Device Role : Active device 0 Array State : AAA ('A' == active, '.' == missing, 'R' == replacing) # mdadm --examine /dev/sde2 /dev/sde2: Magic : a92b4efc Version : 1.2 Feature Map : 0x0 Array UUID : 94d56562:90a999e8:601741c0:55d8c83f Name : jostein1:1 (local to host jostein1) Creation Time : Sat Aug 18 13:00:00 2012 Raid Level : raid5 Raid Devices : 3 Avail Dev Size : 5823404032 (2776.82 GiB 2981.58 GB) Array Size : 5823403008 (5553.63 GiB 5963.16 GB) Used Dev Size : 5823403008 (2776.82 GiB 2981.58 GB) Data Offset : 262144 sectors Super Offset : 8 sectors Unused Space : before=262064 sectors, after=1024 sectors State : clean Device UUID : 9c5abb6d:8f1eecbd:4b0f5459:c0424d26 Update Time : Tue Oct 11 21:17:10 2016 Checksum : a3992056 - correct Events : 896128 Layout : left-symmetric Chunk Size : 512K Device Role : Active device 2 Array State : .AA ('A' == active, '.' == missing, 'R' == replacing)

所以 – 在sdb上检查显示它是活动的，而sdd和sde上的同一个命令则显示它是缺less的。

 # mdadm --detail --verbose /dev/md1 /dev/md1: Version : 1.2 Creation Time : Sat Aug 18 13:00:00 2012 Raid Level : raid5 Array Size : 5823403008 (5553.63 GiB 5963.16 GB) Used Dev Size : 2911701504 (2776.82 GiB 2981.58 GB) Raid Devices : 3 Total Devices : 2 Persistence : Superblock is persistent Update Time : Tue Oct 11 22:03:50 2016 State : clean, degraded Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 512K Name : jostein1:1 (local to host jostein1) UUID : 94d56562:90a999e8:601741c0:55d8c83f Events : 897492 Number Major Minor RaidDevice State 0 0 0 0 removed 1 8 50 1 active sync /dev/sdd2 3 8 66 2 active sync /dev/sde2

EDIT2：

设备的事件计数不再是arrays的一部分，与其他设备非常不同：

 # mdadm --examine /dev/sd[bde]1 | egrep 'Event|/dev/sd' /dev/sdb1: Events : 603 /dev/sdd1: Events : 374272 /dev/sde1: Events : 374272

用于不属于arrays的磁盘的Smartmontools：

 # smartctl -d ata -a /dev/sdb smartctl 6.4 2014-10-07 r4002 [x86_64-linux-4.2.0-36-generic] (local build) Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Western Digital Caviar Green (AF, SATA 6Gb/s) Device Model: WDC WD30EZRX-00MMMB0 Serial Number: WD-WCAWZ2185619 LU WWN Device Id: 5 0014ee 25c58f89e Firmware Version: 80.00A80 User Capacity: 3,000,592,982,016 bytes [3.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS (minor revision not indicated) SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s) Local Time is: Wed Oct 12 18:54:30 2016 CEST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (51480) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 494) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x3035) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 147 144 021 Pre-fail Always - 9641 4 Start_Stop_Count 0x0032 099 099 000 Old_age Always - 1398 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 090 090 000 Old_age Always - 7788 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 1145 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 45 193 Load_Cycle_Count 0x0032 097 097 000 Old_age Always - 309782 194 Temperature_Celsius 0x0022 124 103 000 Old_age Always - 28 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.

你的mdstat文件说明了一切。

[3/2] [_UU]表示虽然有3个定义的物理卷，但目前只有2个正在使用。同样的_UU也是_UU说的。

对于raid设备上的更详细的信息（在进入物理之前）你会运行（以root身份）

 mdadm --detail --verbose /dev/md0 mdadm --detail --verbose /dev/md1 mdadm --detail --verbose /dev/md2

在我的系统上（使用raid6），我模拟了一个失败，这是一个输出示例：

 dev/md0: Version : 1.2 Creation Time : Thu Sep 29 09:51:41 2016 Raid Level : raid6 Array Size : 16764928 (15.99 GiB 17.17 GB) Used Dev Size : 8382464 (7.99 GiB 8.58 GB) Raid Devices : 4 Total Devices : 5 Persistence : Superblock is persistent Update Time : Thu Oct 11 13:06:50 2016 State : clean <<== CLEAN! Active Devices : 4 Working Devices : 4 Failed Devices : 1 Spare Devices : 0 Layout : left-symmetric Chunk Size : 512K Name : ubuntu:0 (local to host ubuntu) UUID : 3837ba75:eaecb6be:8ceb4539:e5d69538 Events : 43 Number Major Minor RaidDevice State 4 8 65 0 active sync /dev/sde1 <<== NEW ENTRY 1 8 17 1 active sync /dev/sdb1 2 8 33 2 active sync /dev/sdc1 3 8 49 3 active sync /dev/sdd1 0 8 1 - faulty /dev/sda1 <<== SW-REPLACED

md1和md2是raid5arrays，因为它们在/ dev / sdb上的分区失败或被标记为falty而降级。运行mdadm – 查看数组本身的更多细节（女士–examine / dev / md1）。

如果在/ dev / sdb中一切正常，则重新将分区添加到数组中。从/etc/mdadm.conf中获取正确的分区号或者在数组中输出–examine。

mdadm –re-add / dev / sdb [？] / dev / md1