戴尔R720xd上的MegaCli和PERC H710P上的5个4Tb SATA驱动器在RAID5中出现奇怪的SMART错误
/opt/MegaRAID/MegaCli/MegaCli64 -PDList -aALL
给我一些“失败次序事件编号”
Slot Number: 4 ... Last Predictive Failure Event Seq Number: 7309 ... Inquiry Data: PK2361PAGAZU8WHitachi HUS724040ALE640 MJAOA3B0 ... Drive has flagged a SMART alert : Yes
但smartctl根本没有提供线索有什么问题:
# smartctl -a -d sat+megaraid,4 /dev/sda smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.32-279.19.1.el6.x86_64] (local build) ... Serial Number: PK2361PAGAZU8W # Note same serial, no mistake ... SMART support is: Available - device has SMART capability. SMART support is: Enabled ... === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED Warning: This result is based on an Attribute check. ... Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0 2 Throughput_Performance 0x0005 137 137 054 Pre-fail Offline - 79 3 Spin_Up_Time 0x0007 100 100 024 Pre-fail Always - 426 4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 7 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0 7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0 8 Seek_Time_Performance 0x0005 114 114 020 Pre-fail Offline - 37 9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 4912 10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 7 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 182 193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 182 194 Temperature_Celsius 0x0002 176 176 000 Old_age Always - 34 (Min/Max 19/40) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged
没有理由在上面皱眉
做了短暂的自测并没有透露任何东西,现在开始长时间的testing:
Serial Number: PK2331PAG7EENT SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 4911 -
同时,同一arrays中有39个磁盘重新分配扇区,PERC不会将其标记为即将失败。 smartctl输出如下:
Serial Number: PK2331PAG7EENT ... ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 39
和MegaCli64输出为39个重新分配的部门相同的磁盘:
Slot Number: 0 Last Predictive Failure Event Seq Number: 0 Inquiry Data: PK2331PAG7EENTHitachi HUS724040ALE640 MJAOA3B0 ... Drive has flagged a SMART alert : No
MegaRAID存储pipe理器的报告也不是很有启发性:
ID = 113 SEQUENCE NUMBER = 7310 TIME = 11-07-2013 20:58:01 LOCALIZED MESSAGE = Controller ID: 0 Unexpected sense: PD = -:-:4Hardware impending failure general hard drive failure, CDB = 0x03 0x00 0x00 0x00 0x40 0x00 , Sense = 0xf0 0x00 0x00 0x00 0x00 0x00 0x00 0x0a 0x00 0x00 0x00 0x00 0x5d 0x10 0x00 0x00 0x00 0x00 ID = 96 SEQUENCE NUMBER = 7309 TIME = 11-07-2013 20:58:01 LOCALIZED MESSAGE = Controller ID: 0 PD Predictive failure: -:-:4
所以磁盘似乎是健康的,任何想法如何重置SMART警报? 我不认为聪明的统计数据足以要求保修。
PS:我们已经取消了#4,插上了#5,显示健康,显示为预期的“外国”,现在被指定为全球热点。 放置新驱动器#4和RAID重build音量。 戴尔支持build议使用omconfig来获取更详细的控制器日志。