通过所有诊断时确认磁盘已损坏

我有一个可能损坏的磁盘系统，但磁盘通过所有方式的诊断。我一直无法确认磁盘是否损坏。我有什么select？

我只能更换磁盘，但由于这种情况与另一个更严重的情况（长篇故事）非常相似，所以我想实际做出正确的诊断，而不是随机分组硬件。

问题和历史是这样的：

我有一台Debian Linux PC（500 MHz P3）作为路由器，nagios和munin。
它每两个星期坠毁。没有日志或dmesg可以获得（因为它是一个旧的康柏，只有当您将其configuration为无键盘时启动，以后连接键盘，一旦启动，不可能）。
当时，我只是用另一台康柏（P4 2.4 GHz）取代了电脑，因为我认为硬件有问题。但是，它每隔几个星期仍然坠毁。
不同的是，在这台电脑上，我仍然可以通过SSH进入。它给在hda上的各种错误。

我想确认磁盘已损坏，但是我确实没有证实这一点：

SMART错误日志显示没有错误。通常当磁盘开始动作时，SMART我的通行证，但它仍然在错误日志中logging一个读取错误。
SMART自检（ smartctl -t long /dev/sda ）完成而没有错误。
重新分配扇区数（一个告诉参数）一直是31年，即使在我的台式机多年前还在使用这个磁盘的时候，它仍然是这样。这个数字从未改变。
dd if=/dev/sda of=/dev/null bs=4096通过飞行的颜色。

我还能做什么来评估驱动器的健康状况？

再一次，这不是关于再次使这个路由器function完全，这是一个磁盘取证问题，因为它恰好是我有另一台服务器，可能有同样的问题，知道这个答案可能会帮助我很大。

为了logging，下面是日志等。

这是smartctl -a输出：

 smartctl 5.40 2010-07-12 r3124 [i686-pc-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Seagate Barracuda 7200.7 and 7200.7 Plus family Device Model: ST3120026A Serial Number: 5JT1CLQM Firmware Version: 3.06 User Capacity: 120,034,123,776 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 6 ATA Standard is: ATA/ATAPI-6 T13 1410D revision 2 Local Time is: Mon Jul 1 21:18:33 2013 CEST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 24) The self-test routine was aborted by the host. Total time to complete Offline data collection: ( 430) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. No General Purpose Logging support. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 85) minutes. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 050 046 006 Pre-fail Always - 47766662 3 Spin_Up_Time 0x0003 097 096 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 10 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 31 7 Seek_Error_Rate 0x000f 084 060 030 Pre-fail Always - 820305 9 Power_On_Hours 0x0032 048 048 000 Old_age Always - 46373 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 605 194 Temperature_Celsius 0x0022 036 065 000 Old_age Always - 36 195 Hardware_ECC_Recovered 0x001a 050 046 000 Old_age Always - 47766662 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 196 000 Old_age Always - 6 200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline - 0 202 Data_Address_Mark_Errs 0x0032 100 253 000 Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Aborted by host 80% 46361 - # 2 Extended offline Completed without error 00% 46358 - # 3 Short offline Completed without error 00% 12046 - # 4 Extended offline Completed without error 00% 10472 - # 5 Short offline Completed without error 00% 10471 - # 6 Short offline Completed without error 00% 10471 - # 7 Short offline Completed without error 00% 6770 - # 8 Extended offline Aborted by host 90% 5958 - # 9 Extended offline Aborted by host 90% 5951 - #10 Short offline Completed without error 00% 5024 - #11 Extended offline Aborted by host 80% 5024 - #12 Short offline Completed without error 00% 3697 - #13 Short offline Completed without error 00% 237 - #14 Short offline Completed without error 00% 145 - #15 Short offline Completed without error 00% 69 - #16 Extended offline Completed without error 00% 68 - #17 Short offline Completed without error 00% 66 - #18 Short offline Completed without error 00% 49 - #19 Short offline Completed without error 00% 29 - #20 Short offline Completed without error 00% 29 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.

这是dmesg错误，当它崩溃（重复一堆不同的部门）：

 [1755091.211136] sd 0:0:0:0: [sda] Unhandled error code [1755091.211144] sd 0:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [1755091.211151] sd 0:0:0:0: [sda] CDB: Read(10): 28 00 08 fe ad 38 00 00 08 00 [1755091.211166] end_request: I/O error, dev sda, sector 150908216

你不可靠。

或者说，你已经完成了可供select的选项。

正如谷歌的研究发现，失败的磁盘不一定会显示exception的SMART值（然而，其他方法更可靠：当他们这样做，他们将失败）。

不要忘记，即使很多是计算标准化，实际上有硬件和软件的错误，可以积累的错误利润等，现实世界是不完美的，它不是看不到特定控制器的硬盘 – 反之亦然。有时候，这是一个有问题的固件问题，有时候一些完全不同的系统组件不起作用，例如一个低于标准的PSU，在特定的负载情况下会有尖峰。甚至温度的变化，年龄…这个名单几乎可以随意扩大。

因此，这里的标准程序是将磁盘放入一个完全不同的系统configuration并重新运行testing – 但是由于您已经完成了系统的完整更改，因此您已经正确地断定磁盘必须出错。（除非你已经告诉我们一切都没有改变，Cable / HBA就会浮现出来，在这种情况下，这种假设不会成立）。

编辑：我只是意识到，还有一个select，您可以search此磁盘驱动器是否有比您当前驱动器上的更新版本更新的固件版本。如果是这样，你可以看看更改日志，指出你的情况可能出现的问题。

总之，为了build立完全的信心（在这种特殊情况下）驱动器行为exception，您需要将其发送回制造商。

我认为这是一个糟糕的控制器。你可以做更多的事情来检查磁盘以及控制器…

在驱动器上运行“badblocks”。这与您运行的“dd”类似。将另一个具有良好SMART状态的驱动器放入计算机中。如果这个磁盘给你类似的行为，那么你知道它是硬件，而不是磁盘给你的问题。在这种情况下，我会认为这是控制器。你提到你改变了系统，而且它仍然给你提供了一些问题，所以说起来，我仍然认为必须有一个共同的组件导致系统不稳定。你也可以看看：

坏的电缆（是电缆交换到驱动器的第二台机器？）
在系统上configuration不当（你是不是用不同的硬件设置系统？）