PERC MegaRAID SMART Status与smartctl不匹配 – 寻找线索HDD有什么问题

戴尔R720xd上的MegaCli和PERC H710P上的5个4Tb SATA驱动器在RAID5中出现奇怪的SMART错误

/opt/MegaRAID/MegaCli/MegaCli64 -PDList -aALL

给我一些“失败次序事件编号”

 Slot Number: 4 ... Last Predictive Failure Event Seq Number: 7309 ... Inquiry Data: PK2361PAGAZU8WHitachi HUS724040ALE640 MJAOA3B0 ... Drive has flagged a SMART alert : Yes

但smartctl根本没有提供线索有什么问题：

 # smartctl -a -d sat+megaraid,4 /dev/sda smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.32-279.19.1.el6.x86_64] (local build) ... Serial Number: PK2361PAGAZU8W # Note same serial, no mistake ... SMART support is: Available - device has SMART capability. SMART support is: Enabled ... === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED Warning: This result is based on an Attribute check. ... Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0 2 Throughput_Performance 0x0005 137 137 054 Pre-fail Offline - 79 3 Spin_Up_Time 0x0007 100 100 024 Pre-fail Always - 426 4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 7 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0 7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0 8 Seek_Time_Performance 0x0005 114 114 020 Pre-fail Offline - 37 9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 4912 10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 7 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 182 193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 182 194 Temperature_Celsius 0x0002 176 176 000 Old_age Always - 34 (Min/Max 19/40) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged

没有理由在上面皱眉

做了短暂的自测并没有透露任何东西，现在开始长时间的testing：

 Serial Number: PK2331PAG7EENT SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 4911 -

同时，同一arrays中有39个磁盘重新分配扇区，PERC不会将其标记为即将失败。 smartctl输出如下：

 Serial Number: PK2331PAG7EENT ... ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 39

和MegaCli64输出为39个重新分配的部门相同的磁盘：

 Slot Number: 0 Last Predictive Failure Event Seq Number: 0 Inquiry Data: PK2331PAG7EENTHitachi HUS724040ALE640 MJAOA3B0 ... Drive has flagged a SMART alert : No

MegaRAID存储pipe理器的报告也不是很有启发性：

 ID = 113 SEQUENCE NUMBER = 7310 TIME = 11-07-2013 20:58:01 LOCALIZED MESSAGE = Controller ID: 0 Unexpected sense: PD = -:-:4Hardware impending failure general hard drive failure, CDB = 0x03 0x00 0x00 0x00 0x40 0x00 , Sense = 0xf0 0x00 0x00 0x00 0x00 0x00 0x00 0x0a 0x00 0x00 0x00 0x00 0x5d 0x10 0x00 0x00 0x00 0x00 ID = 96 SEQUENCE NUMBER = 7309 TIME = 11-07-2013 20:58:01 LOCALIZED MESSAGE = Controller ID: 0 PD Predictive failure: -:-:4

所以磁盘似乎是健康的，任何想法如何重置SMART警报？我不认为聪明的统计数据足以要求保修。

PS：我们已经取消了＃4，插上了＃5，显示健康，显示为预期的“外国”，现在被指定为全球热点。放置新驱动器＃4和RAID重build音量。戴尔支持build议使用omconfig来获取更详细的控制器日志。