智能短线离线testing永远不会结束RAID1的所有驱动器

我在这里有一个奇怪的情况。 在一台低stream量的服务器上使用了三年之后,一台RAID1中的两个三星硬盘之一昨天失败了:

Personalities : [raid1] md0 : active raid1 sdb1[2](F) sda1[0] 732572608 blocks [2/1] [U_] 

由于smartd没有任何报告,我检查了聪明的属性和唯一可疑的读数是不同的sdb(失败)与sda相比是:

 sda: 191 G-Sense_Error_Rate 0x0022 252 252 000 Old_age Always - 0 sdb: 191 G-Sense_Error_Rate 0x0022 098 098 000 Old_age Always - 27650 

服务器机架中的G-sense错误? 也许传感器失败了?

但是在这两个驱动器上还有另一个时髦的阅读:最近的短线离线testing是“中断(主机复位)”,如果我用smartctl --test=short /dev/sda启动一个新的testing,select性自检日志显示:

 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Self_test_in_progress [90% left] (0-65535) 

然而,这个短暂的testing永远不会结束,甚至几个小时后,情况依然如此 – 在两个驱动器上。 这可能是驱动器上的固件错误吗? 还是控制器失败?

下面是两个驱动器的完整转储,每个驱动器都有一个短暂的自我testing:

/ dev / sda上:

 smartctl 6.1 2013-03-16 r3800 [x86_64-linux-3.12.13-gentoo] (local build) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: SAMSUNG SpinPoint F3 Device Model: SAMSUNG HD754JJ Serial Number: S281J9CZ500175 LU WWN Device Id: 5 0024e9 2026e8417 Firmware Version: 1AJ10001 User Capacity: 750,156,374,016 bytes [750 GB] Sector Size: 512 bytes logical/physical Rotation Rate: 7200 rpm Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS T13/1699-D revision 6 SATA Version is: SATA 2.6, 3.0 Gb/s Local Time is: Fri Mar 21 09:04:35 2014 CET SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 249) Self-test routine in progress... 90% of test remaining. Total time to complete Offline data collection: ( 6540) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 109) minutes. SCT capabilities: (0x003f) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail Always - 59 2 Throughput_Performance 0x0026 055 052 000 Old_age Always - 6038 3 Spin_Up_Time 0x0023 072 071 025 Pre-fail Always - 8729 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 10 5 Reallocated_Sector_Ct 0x0033 252 252 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 252 252 051 Old_age Always - 0 8 Seek_Time_Performance 0x0024 252 252 015 Old_age Offline - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 23571 10 Spin_Retry_Count 0x0032 252 252 051 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 252 252 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 21 191 G-Sense_Error_Rate 0x0022 252 252 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0022 252 252 000 Old_age Always - 0 194 Temperature_Celsius 0x0002 060 053 000 Old_age Always - 40 (Min/Max 20/48) 195 Hardware_ECC_Recovered 0x003a 100 100 000 Old_age Always - 0 196 Reallocated_Event_Count 0x0032 252 252 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 252 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 252 252 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0036 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x002a 001 001 000 Old_age Always - 102119 223 Load_Retry_Count 0x0032 252 252 000 Old_age Always - 0 225 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 21 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Aborted by host 90% 23571 - # 2 Short offline Aborted by host 90% 23571 - # 3 Short offline Aborted by host 90% 23571 - # 4 Short offline Aborted by host 90% 23571 - # 5 Short offline Interrupted (host reset) 90% 23571 - # 6 Short offline Interrupted (host reset) 90% 23571 - # 7 Short offline Interrupted (host reset) 90% 23571 - # 8 Short offline Interrupted (host reset) 90% 23571 - # 9 Extended offline Interrupted (host reset) 90% 23571 - #10 Short offline Interrupted (host reset) 90% 23571 - #11 Short offline Interrupted (host reset) 90% 23571 - #12 Short offline Completed without error 00% 23559 - #13 Short offline Completed without error 00% 23535 - #14 Short offline Completed without error 00% 23511 - #15 Extended offline Completed without error 00% 23495 - #16 Short offline Completed without error 00% 23487 - #17 Short offline Completed without error 00% 23463 - #18 Short offline Completed without error 00% 23439 - #19 Short offline Completed without error 00% 23415 - #20 Short offline Completed without error 00% 23391 - #21 Short offline Completed without error 00% 23367 - SMART Selective self-test log data structure revision number 0 Note: revision number not 1 implies that no selective self-test has ever been run SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Self_test_in_progress [90% left] (0-65535) 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. 

的/ dev / sdb的:

 smartctl 6.1 2013-03-16 r3800 [x86_64-linux-3.12.13-gentoo] (local build) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: SAMSUNG SpinPoint F3 Device Model: SAMSUNG HD754JJ Serial Number: S281J9CZ500174 LU WWN Device Id: 5 0024e9 2026e840e Firmware Version: 1AJ10001 User Capacity: 750,156,374,016 bytes [750 GB] Sector Size: 512 bytes logical/physical Rotation Rate: 7200 rpm Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS T13/1699-D revision 6 SATA Version is: SATA 2.6, 3.0 Gb/s Local Time is: Fri Mar 21 09:05:10 2014 CET SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 249) Self-test routine in progress... 90% of test remaining. Total time to complete Offline data collection: ( 6960) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 116) minutes. SCT capabilities: (0x003f) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail Always - 69 2 Throughput_Performance 0x0026 055 054 000 Old_age Always - 6442 3 Spin_Up_Time 0x0023 071 071 025 Pre-fail Always - 8885 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 10 5 Reallocated_Sector_Ct 0x0033 252 252 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 252 252 051 Old_age Always - 0 8 Seek_Time_Performance 0x0024 252 252 015 Old_age Offline - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 23571 10 Spin_Retry_Count 0x0032 252 252 051 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 252 252 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 21 191 G-Sense_Error_Rate 0x0022 098 098 000 Old_age Always - 27650 192 Power-Off_Retract_Count 0x0022 252 252 000 Old_age Always - 0 194 Temperature_Celsius 0x0002 064 054 000 Old_age Always - 35 (Min/Max 20/46) 195 Hardware_ECC_Recovered 0x003a 100 100 000 Old_age Always - 0 196 Reallocated_Event_Count 0x0032 252 252 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 252 252 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 252 252 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0036 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x002a 001 001 000 Old_age Always - 71575 223 Load_Retry_Count 0x0032 252 252 000 Old_age Always - 0 225 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 21 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Aborted by host 90% 23571 - # 2 Short offline Aborted by host 90% 23571 - # 3 Short offline Aborted by host 90% 23571 - # 4 Short offline Aborted by host 90% 23571 - # 5 Extended offline Interrupted (host reset) 90% 23571 - # 6 Short offline Interrupted (host reset) 90% 23571 - # 7 Short offline Interrupted (host reset) 90% 23571 - # 8 Short offline Interrupted (host reset) 90% 23571 - # 9 Short offline Interrupted (host reset) 90% 23571 - #10 Short offline Interrupted (host reset) 90% 23571 - #11 Short offline Interrupted (host reset) 90% 23571 - #12 Short offline Interrupted (host reset) 90% 23571 - #13 Short offline Completed without error 00% 23558 - #14 Short offline Completed without error 00% 23534 - #15 Short offline Completed without error 00% 23510 - #16 Short offline Completed without error 00% 23486 - #17 Extended offline Completed without error 00% 23471 - #18 Short offline Completed without error 00% 23462 - #19 Short offline Completed without error 00% 23438 - #20 Short offline Completed without error 00% 23414 - #21 Short offline Completed without error 00% 23390 - SMART Selective self-test log data structure revision number 0 Note: revision number not 1 implies that no selective self-test has ever been run SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Self_test_in_progress [90% left] (0-65535) 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. 

感谢您的提示!

控制器错误在这里似乎是可能 在你花费太多时间之前,我build议重新安装所有的电缆。 随着时间的推移,可能会出现一些问题,并造成问题。

另外,你提到smartd …你尝试运行自检时禁用了这个吗? 这可能会干扰手动testing。

dmesg有没有关于sdb被认为失败的原因? 这两个驱动器似乎都报告他们的健康状况为PASSED,我不相信mdadm实际上使用任何SMART数据来确定驱动器的健康状况。