SATA硬盘错误

我有一个WDC WD3202ABYS服务器…有100个虚拟主机。服务器正在工作大约5年，在这段时间我有4个磁盘更换。所有相同的原因：sata错误。最后一个：

ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 ata2.00: BMDMA stat 0x5 ata2.00: cmd 35/00:60:57:7b:b6/00:01:06:00:00/e0 tag 0 dma 180224 out res 51/10:60:57:7b:b6/10:01:06:00:00/e0 Emask 0x81 (invalid argument) ata2.00: status: { DRDY ERR } ata2.00: error: { IDNF } ata2.00: configured for UDMA/133 sd 1:0:0:0: SCSI error: return code = 0x08000002 sdb: Current [descriptor]: sense key: Aborted Command Add. Sense: Recorded entity not found Descriptor sense data with sense descriptors (in hex): 72 0b 14 00 00 00 00 0c 00 0a 80 00 00 00 00 00 06 b6 7b 57 end_request: I/O error, dev sdb, sector 112622423 Buffer I/O error on device dm-8, logical block 14077747 lost page write due to I/O error on dm-8 Buffer I/O error on device dm-8, logical block 14077748 lost page write due to I/O error on dm-8 Buffer I/O error on device dm-8, logical block 14077749 lost page write due to I/O error on dm-8 Buffer I/O error on device dm-8, logical block 14077750 lost page write due to I/O error on dm-8 Buffer I/O error on device dm-8, logical block 14077751 lost page write due to I/O error on dm-8 Buffer I/O error on device dm-8, logical block 14077756 lost page write due to I/O error on dm-8 ata2: EH complete SCSI device sdb: 625142448 512-byte hdwr sectors (320073 MB) sdb: Write Protect is off sdb: Mode Sense: 00 3a 00 00 SCSI device sdb: drive cache: write back ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 ata2.00: BMDMA stat 0x5 ata2.00: cmd 35/00:90:17:30:b7/00:02:08:00:00/e0 tag 0 dma 335872 out res 51/10:90:17:30:b7/10:02:08:00:00/e0 Emask 0x81 (invalid argument) ata2.00: status: { DRDY ERR } ata2.00: error: { IDNF } ata2.00: configured for UDMA/133 sd 1:0:0:0: SCSI error: return code = 0x08000002 sdb: Current [descriptor]: sense key: Aborted Command Add. Sense: Recorded entity not found Descriptor sense data with sense descriptors (in hex): 72 0b 14 00 00 00 00 0c 00 0a 80 00 00 00 00 00 08 b7 30 17 end_request: I/O error, dev sdb, sector 146223127 printk: 34 messages suppressed. Buffer I/O error on device dm-8, logical block 18277835

看起来像一些软件错误…

但在此之后的短时间（也许当我开始fsck）以下错误：

 EXT3-fs error (device dm-8): ext3_put_super: Couldn't clean up the journal ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 ata2.00: BMDMA stat 0x4 ata2.00: cmd c8/00:00:8f:0d:84/00:00:00:00:00/e1 tag 0 dma 131072 in res 51/01:00:a8:0d:84/10:02:08:00:00/e1 Emask 0x1 (device error) ata2.00: status: { DRDY ERR } ata2.00: configured for UDMA/133 ata2: EH complete ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 ata2.00: BMDMA stat 0x4 ata2.00: cmd c8/00:00:8f:0d:84/00:00:00:00:00/e1 tag 0 dma 131072 in res 51/40:00:a8:0d:84/10:02:08:00:00/e1 Emask 0x9 (media error) ata2.00: status: { DRDY ERR } ata2.00: error: { UNC } ata2.00: configured for UDMA/133 ata2: EH complete ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 ata2.00: BMDMA stat 0x4 ata2.00: cmd c8/00:00:8f:0d:84/00:00:00:00:00/e1 tag 0 dma 131072 in res 51/40:00:a8:0d:84/10:02:08:00:00/e1 Emask 0x9 (media error) ata2.00: status: { DRDY ERR } ata2.00: error: { UNC } ata2.00: configured for UDMA/133 ata2: EH complete ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 ata2.00: BMDMA stat 0x4 ata2.00: cmd c8/00:00:8f:0d:84/00:00:00:00:00/e1 tag 0 dma 131072 in res 51/40:00:a8:0d:84/10:02:08:00:00/e1 Emask 0x9 (media error)

是否有可能，这个错误也是“软件”…我的意思是这个硬盘只有9000小时的老…没有额外的负载硬盘…温度是29摄氏度…我需要更换硬盘吗？或者检查磁盘就够了？

 EXT3-fs error (device dm-8): ext3_put_super: Couldn't clean up the journal ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 ata2.00: BMDMA stat 0x4 ata2.00: cmd c8/00:00:8f:0d:84/00:00:00:00:00/e1 tag 0 dma 131072 in res 51/01:00:a8:0d:84/10:02:08:00:00/e1 Emask 0x1 (device error) ata2.00: status: { DRDY ERR } ata2.00: configured for UDMA/133 ata2: EH complete ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 ata2.00: BMDMA stat 0x4 ata2.00: cmd c8/00:00:8f:0d:84/00:00:00:00:00/e1 tag 0 dma 131072 in res 51/40:00:a8:0d:84/10:02:08:00:00/e1 Emask 0x9 (media error) ata2.00: status: { DRDY ERR } ata2.00: error: { UNC } ata2.00: configured for UDMA/133 ata2: EH complete ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 ata2.00: BMDMA stat 0x4 ata2.00: cmd c8/00:00:8f:0d:84/00:00:00:00:00/e1 tag 0 dma 131072 in res 51/40:00:a8:0d:84/10:02:08:00:00/e1 Emask 0x9 (media error) ata2.00: status: { DRDY ERR } ata2.00: error: { UNC } ata2.00: configured for UDMA/133 ata2: EH complete ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 ata2.00: BMDMA stat 0x4 ata2.00: cmd c8/00:00:8f:0d:84/00:00:00:00:00/e1 tag 0 dma 131072 in res 51/40:00:a8:0d:84/10:02:08:00:00/e1 Emask 0x9 (media error)

如何找出原因？

这是来自智能的错误：

 Error 36 occurred at disk power-on lifetime: 9160 hours (381 days + 16 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 22 09 80 e3 Error: UNC at LBA = 0x03800922 = 58722594 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 08 1f 09 80 03 0a 47d+13:38:13.534 READ DMA ec 00 00 00 00 00 00 0a 47d+13:38:13.530 IDENTIFY DEVICE ef 03 46 00 00 00 00 0a 47d+13:38:13.528 SET FEATURES [Set transfer mode]

好。是否有可能出现以下情况：1.磁盘在没有fsck的情况下是9000。 2.有一些错误3.在dmesg中出现如下错误：

 ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 ata2.00: BMDMA stat 0x5 ata2.00: cmd 35/00:60:57:7b:b6/00:01:06:00:00/e0 tag 0 dma 180224 out res 51/10:60:57:7b:b6/10:01:06:00:00/e0 Emask 0x81 (invalid argument) ata2.00: status: { DRDY ERR } ata2.00: error: { IDNF } ata2.00: configured for UDMA/133 sd 1:0:0:0: SCSI error: return code = 0x08000002 sdb: Current [descriptor]: sense key: Aborted Command Add. Sense: Recorded entity not found

像inode错误等错误…
我试图卸下这个parition，错误来自硬盘，就像找不到这样的inode等等…？

如果是的话，我不明白。我需要每年更换磁盘吗？只是为了防止这个错误？有人有同样的问题吗？不只有一个磁盘…

根据我的经验，你所看到的错误实际上是软件反映的硬件错误。 “由于I / O错误而丢失的页面写入”消息是我见过的坏的硬盘驱动器，它的行为类似于您尝试fsck时描述的方式。这几乎是一个真正的硬件故障。

你应该检查smartctl的输出，看看它说什么可能是问题。

 smartctl --attributes /dev/sdb

它会给你类似的输出：

  ===开始读取智能数据部分===
 SMART属性数据结构版本号：16
具有阈值的供应商特定SMART属性：
 ID＃ATTRIBUTE_NAME标记值最差值types已更新WHEN_FAILED RAW_VALUE
   1 Raw_Read_Error_Rate 0x000f 200 200 051预失败始终 -  0
   3 Spin_Up_Time 0x0003 212 186 021预失败始终 -  4358
   4 Start_Stop_Count 0x0032 100 100 000 Old_age始终 -  97
   5 Reallocated_Sector_Ct 0x0033 200 200 140预失败总是 -  0
   7 Seek_Error_Rate 0x000f 200 200 051预失败始终为0
   9 Power_On_Hours 0x0032 066 066 000 Old_age Always  -  25420
  10 Spin_Retry_Count 0x0013 100 253 051预故障始终 -  0
  11 Calibration_Retry_Count 0x0013 100 253 051预失败始终 -  0
  12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always  -  86
 194 Temperature_Celsius 0x0022 104 001 000 Old_age Always  -  46
 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age始终为0
 197 Current_Pending_Sector 0x0012 200 200 000 Old_age始终为0
 198 Offline_Uncorrectable 0x0010 200 200 000 Old_age Offline  -  0
 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age始终为0
 200 Multi_Zone_Error_Rate 0x0009 200 200 051脱机失败前 -  0

输出可能是神秘的，但我会密切关注的将是Reallocated_Sector_Ct，因为这告诉你什么是HD已知的坏扇区。命令“smartctl -a”将提供更多的数据。在一段时间后，我看到输出结果的底部是“SMART Error Log”，它有几个条目。

你有一个无法纠正的阅读错误。

 Error: UNC at LBA = 0x03800922 = 58722594

该块上的数据现在丢失了。

你应该：

首先使用镜子。企业磁盘实际上是要在镜子后面，他们宁愿返回一个读取错误，也不愿意尝试获取数据。
从备份中恢复丢失的数据

你没有EXCUSE不使用RAID（特别是如果你为客户端托pipe网站！） – 操作系统不是那么大，你不需要在2磁盘系统上的专用磁盘。

你使用RAID控制器吗？你使用什么样的控制器？

其中一件事（这既令人沮丧又有启发性）是硬盘制造商正在向SATA市场引入越来越多的细分。现在有用于“小型企业/ RAID使用”和“单一/桌面使用”的驱动器。 SAS似乎被推向“高端企业”市场。

您的Model＃是WD的RE3系列驱动器，专为RAID设置而devise。我被告知这意味着除此之外）驱动器会在尝试修复错误时更早（即在3-4秒内）“放弃”，而不是一遍又一遍地尝试更长的时间。尽早报告给RAID控制器的错误，所以它可以使用另一个驱动器来恢复。相反，如果驱动器等待时间更长，则RAID控制器会将驱动器踢出arrays以便不响应。

失败应该仍然是罕见的，而不是每年一次。也许这是你的设置的另一个方面？（我曾经用一根SATA电缆打了一个令人沮丧的战斗，现在它被安装在我的门上，作为对其他电缆的警告…）

西数的驱动器我有一个非常糟糕的经验。由于完全失效或坏扇区，我的驱动器中有一半以上必须在保修期内更换。

在购买了8年左右的WD硬盘之后，我不再想和他们一起花钱。我不知道我可以信任哪个WD驱动器; 我迄今为止的经验说“没有一个”。

您已经更换了原来的驱动器4次; 你同时购买了所有五个硬盘吗？买新的更换每一个失败时？将保修期内的驱动器退回换货？你是如何获得五个驱动器的，以及他们是什么型号的？根据我的经验，WD驱动器的批量往往不好，同时出现故障。