我很难为我的一个朋友找出问题。 他使用Debian发行版在Linux上运行ZFS。 我们正在将这些条目添加到dmesg中。
[273044.834151] mpt2sas0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00) [273044.834157] mpt2sas0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00) [273044.834161] mpt2sas0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00) [273044.834164] mpt2sas0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00) [273044.834168] mpt2sas0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00) [273044.834171] mpt2sas0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00) [273044.834175] mpt2sas0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00) [273044.834178] mpt2sas0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00) [273044.834182] mpt2sas0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00) [273044.834185] mpt2sas0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00) [273044.841140] sd 0:0:1:0: [sdb] Device not ready [273044.841146] sd 0:0:1:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [273044.841151] sd 0:0:1:0: [sdb] Sense Key : Not Ready [current] [273044.841155] sd 0:0:1:0: [sdb] Add. Sense: Logical unit not ready, cause not reportable [273044.841162] sd 0:0:1:0: [sdb] CDB: Write(10): 2a 00 b4 0c c3 e0 00 01 00 00 [273044.841171] end_request: I/O error, dev sdb, sector 3020735456 [273044.841530] sd 0:0:1:0: [sdb] Device not ready [273044.841532] sd 0:0:1:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [273044.841535] sd 0:0:1:0: [sdb] Sense Key : Not Ready [current] [273044.841538] sd 0:0:1:0: [sdb] Add. Sense: Logical unit not ready, cause not reportable [273044.841543] sd 0:0:1:0: [sdb] CDB: Write(10): 2a 00 b4 0c c1 e0 00 01 00 00 [273044.841550] end_request: I/O error, dev sdb, sector 3020734944 --- snip ---
我们已经做了一个完整的磨砂,并没有产生额外的错误。 我们也运行了一个聪明的长期testing,也通过了testing。 目前没有待决部门,也没有重新分配部门。 我们还有什么可以尝试debugging这个问题?
SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 197 174 021 Pre-fail Always - 5150 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 30 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 094 094 000 Old_age Always - 5065 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 30 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 24 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 47 194 Temperature_Celsius 0x0022 121 102 000 Old_age Always - 29 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 197 174 021 Pre-fail Always - 5150 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 30 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 094 094 000 Old_age Always - 5065 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 30 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 24 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 47 194 Temperature_Celsius 0x0022 121 102 000 Old_age Always - 29 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0再次运行另一个scrub。 我使用ibm serveraid m1015闪存到运行在带有wd20earx绿色磁盘的Supermicro X9SCM-F主板上的IT上。 pool: hulk state: ONLINE scan: scrub in progress since Sun May 4 14:26:11 2014 33.2G scanned out of 10.2T at 254M/s, 11h38m to go 0 repaired, 0.32% done config:
NAME STATE READ WRITE CKSUM hulk ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 hulk1 ONLINE 0 0 0 hulk2 ONLINE 0 0 0 hulk3 ONLINE 0 0 0 hulk4 ONLINE 0 0 0 hulk5 ONLINE 0 0 0 hulk6 ONLINE 0 0 0
错误:没有已知的数据错误
log_info 0x31110d00解码为:
Value 0x31110D00 Type 0x30000000 SAS Origin 0x01000000 PL Code 0x00110000 PL_LOGINFO_CODE_RESET See Sub-Codes below (PL_LOGINFO_SUB_CODE) Sub Code 0x00000D00 PL_LOGINFO_SUB_CODE_SATA_LINK_DOWN
这归结于SATA设备被SAS HBA或OS自行重置的事实。
您可以使用mptevents获取有关SAS HBA事件的完整信息(它似乎是一个LSI SAS卡,如果它是MegaRaid,它将不起作用)。
您可以使用echo 0x010401cd > /proc/sys/dev/scsi/logging_level启用SCSI日志logging
如果没有一个显示错误比内部驱动器断言,但这些是非常罕见的。
此外,我会build议看看SAS phys,看他们是否在invalid_dword文件中指出任何错误。 你可以在/sys/class/sas_phy目录中find它们。
哦,超微… 🙂
但是,真的,等一下。 获取备用磁盘,甚至可以将其configuration为热备份。 这是RAID保护的用途。 您的错误似乎是本地的一个驱动器/端口/ SATA连接。 它排除了背板( 可能没有一个 )和控制器作为原因。 如果驱动器坏了,让它。 根据需要更换。