Intereting Posts

从（第三个）本地机器在两个ftp服务器之间进行文件复制 nginx负载均衡 Sharepoint事件错误2436 – searchfunction失败为什么我的IMAP端口不显示为打开（但它工作？） PAT用户不能访问网站，但NAT用户可以？ Office 365 – Exchange Online – 以任何方式阻止错误的URL垃圾邮件？为什么要升级到Ubuntu 9.10时切换到ext4？禁用具有多个NIC的服务器的自动DNSlogging创build 通过更改文件权限locking 如何打开我的电脑在ubntu浏览器上的phpmyadmin工具使用rgex重写nginx url proxypass 在生产环境中安装SQL Server 2008 Express的问题在不同于domain.com/blog的服务器中托pipedomain.com 虚拟机无法导入; 该文件的结尾是提前到达的 DHCP中继代理不工作

干净的EXT3分区input/输出错误 – 如何检查数据块有什么问题

我在使用HP Raid控制器的CentOS 5服务器（内核版本2.6.18-164.15.1.el5）上的ext3分区上遇到问题：

hpacucli ctrl all show detail Smart Array P410 in Slot 1 Bus Interface: PCI ...

惠普工具不报告任何问题。

这是正常的分区ext3的块大小设置为2K，这是很好的 – fsck输出：

 fsck 1.39 (29-May-2006) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information

文件inode也可以：

 File: `name.xxx' Size: 3126962 Blocks: 6124 IO Block: 4096 regular file Device: 6851h/26705d Inode: 64579729 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2014-07-28 09:02:59.000000000 -0400 Modify: 2014-07-28 09:02:59.000000000 -0400 Change: 2014-07-28 09:02:59.000000000 -0400

我无法执行的操作之一是文件复制：

 > cp /long_path/name.xxx . cp: reading `/long_path.name.xxx': Input/output error

要确定问题出在哪里，我运行dd复制文件：

 > dd if=/long_path/name.xxx bs=2048 of=test dd: reading `/long_path/name.xxx': Input/output error 222+0 records in 222+0 records out 454656 bytes (455 kB) copied, 0.042867 seconds, 10.6 MB/s

所以我想这个问题是在223文件块。

Debugfs应该有助于在磁盘上定位该块

 debugfs -R "stat name.xxx" /dev/sdf debugfs 1.39 (29-May-2006) Inode: 64579729 Type: regular Mode: 0644 Flags: 0x0 Generation: 2900468317 User: 0 Group: 0 Size: 3126962 File ACL: 0 Directory ACL: 0 Links: 1 Blockcount: 6124 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x53d64a03 -- Mon Jul 28 09:02:59 2014 atime: 0x53d64a03 -- Mon Jul 28 09:02:59 2014 mtime: 0x53d64a03 -- Mon Jul 28 09:02:59 2014 BLOCKS: (0):130402311, (1-4):130402844-130402847, (5-6):130484033-130484034, (7):130484036, (8-10):130484049-130484051, (11):130484055, (IND):130761221, (12-13):130761222-130761223, (14):130763791, (15):130763942, (16):130765268, (17-23):130838937-130838943, (24-46):130853946-130853968, (47-48):130855126-130855127, (49):130855215, (50-53):130856428-130856431, (54-104):130856533-130856583, (105-341):130856748-130856984, ... [MORE BLOCKS] .... TOTAL: 1531

所以我猜想有问题的数据在块130856866。

我怎样才能获得更多有关该块的信息？我运行了坏块，并有一个坏块列表。我的猜测是，我必须将块数乘以2（文件系统块大小为2K，而badblocks默认使用1K）。另外badblocks检查一个磁盘，而不是一个分区，所以也许我应该添加一些偏移量（这个磁盘上有一个分区，所以可能没有）。

 > fdisk -l /dev/sdf Disk /dev/sdf: 2000.3 GB, 2000365379584 bytes 255 heads, 63 sectors/track, 243197 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/cciss/c0d5p1 * 1 243197 1953479871 83 Linux

我也想过使用smartd。我该找什么？

 Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 0 1457 0 2887405961 0 65948.712 18 write: 0 0 0 0 0 15056.493 0 verify: 0 1 0 361901613 0 3591.720 0 Non-medium error count: 226 SMART Self-test log Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] Description number (hours) # 1 Background long Failed in segment --> - 34479 16845361 [0x3 0x11 0x0] # 2 Background short Completed - 44 - [- - -] # 3 Background short Completed - 39 - [- - -] # 4 Background long Completed - 6 - [- - -] Long (extended) Self Test duration: 18500 seconds [308.3 minutes] Background scan results log Status: scan is active Accumulated power on time, hours:minutes 34541:56 [2072516 minutes] Number of background scans performed: 1139, scan progress: 38.18% Number of background medium scans performed: 1139 # when lba(hex) [sk,asc,ascq] reassign_status 1 19215:06 0000000000014c61 [3,11,0] Recovered via rewrite in-place 2 19215:07 0000000000014c66 [3,11,0] Recovered via rewrite in-place 3 19413:28 0000000001010a31 [3,11,0] Require Write or Reassign Blocks command 4 19943:24 000000000001ea99 [3,11,0] Recovered via rewrite in-place 5 20152:23 00000000000232b8 [3,11,0] Recovered via rewrite in-place 6 31229:34 810000004087f984 [3,11,0] Require Write or Reassign Blocks command 7 33021:51 810000004087ba85 [3,11,0] Require Write or Reassign Blocks command 8 33021:51 000000004087ba9f [3,11,0] Require Write or Reassign Blocks command 9 33021:52 000000004087bad6 [3,11,0] Require Write or Reassign Blocks command 10 33029:43 000000004087baa5 [3,11,0] Require Write or Reassign Blocks command 11 33055:27 000000004087bac3 [3,11,0] Require Write or Reassign Blocks command 12 33244:40 810000004087f9d6 [3,11,0] Require Write or Reassign Blocks command 13 33431:58 990000004087f105 [0,0,0] Reassignment by disk failed 14 33480:17 00000000463d7713 [3,11,0] Require Write or Reassign Blocks command 15 33480:19 00000000463d7723 [3,11,0] Require Write or Reassign Blocks command 16 33480:20 00000000463d7725 [3,11,0] Require Write or Reassign Blocks command 17 33480:28 81000000463d774e [3,11,0] Require Write or Reassign Blocks command 18 33686:17 8100000044e50edc [3,11,0] Require Write or Reassign Blocks command 19 34154:17 81000000432bef27 [3,11,0] Require Write or Reassign Blocks command 20 34463:43 810000001f32decd [3,11,0] Require Write or Reassign Blocks command 21 34463:43 0000000030080000 [3,11,0] Require Write or Reassign Blocks command

我应该如何结婚以上的smartctl输出（或从smartd运行的任何其他输出）与我最初的问题。

也不应该由HDD软件来解决这个问题吗？

BTW。我发现下面的链接有助于理解“debugging-R”输出。也许这个链接对其他人有用。

UPDATE

做进一步的研究，我发现与有问题的inode（如上面的cp命令）相关的操作触发内核日志中的以下行：

 kernel: cciss: cmd ffff810037e00000 has CHECK CONDITION sense key = 0x3

“感应钥匙”是一个“地位”和SCSI标准的一部分（这里列出和更多的描述在这里）。

所以，要弄清楚这一点，我做了以下。

把你的块号，乘以四，然后加一个

 (130856866 * 4) + 1 = 523427465

这代表了产生I / O错误的部门。块大小是2k，扇区是512字节。额外的一个额外帐户为分区的起始扇区偏移量。

要与SMART相关联，我们需要将我们现在的值转换为hex。

 $ printf "0x%x\n" 523427465 0x1f32de89

现在，当您将其与SMART显示的内容关联起来时，会出现一条可疑的closures线。

 20 34463:43 810000001f32decd [3,11,0] Require Write or Reassign Blocks command

多远？

 $ bc -l bc 1.06.95 Copyright 1991-1994, 1997, 1998, 2000, 2004, 2006 Free Software Foundation, Inc. This is free software with ABSOLUTELY NO WARRANTY. For details type `warranty'. obase=16 ibase=16 1F32DECD-1F32DE89 44

这只是在34816和32768字节之间，但我们不能说构成该块的四个中的哪一个被损坏。

如果我不得不冒险猜测，那么我会说，在同一个地址周围可能会有大量的块会报告I / O错误（假设raid striping的大小是32k或者其他）。

此外，如果RAID正从另一个磁盘中获取块区块，则读取操作可能不会引起问题。写操作必须以任何方式传播到RAID1设置中的所有磁盘，这样可能会导致写入失败，但读取成功。另外，如果我们假设RAID卡的块大小是32k，我们也可以假设损坏的块加上SMART报告的块都被该盘上发生的任何事情损坏。它只是SMARTtesting从良好的磁盘读取第一个32K和坏的磁盘为接下来的32K。

现代硬盘保留“预留扇区”，用这种新的扇区位置replace这样的受损扇区。看到你现在正在得到这个，而Reassign by disk failed消息从智能我会说一个磁盘已经用完了。

在做一些事情方面; 这有点棘手。 LBA寻址是对下面真实磁盘的抽象。您需要确定导致此问题的是哪个磁盘，在RAIDarrays中将其置换失败并将其replace。

无论如何，你有一个坏的磁盘，你应该期待尽快取代它。

这是一个很大的过程…但有一些东西跳出来对我。

您的内核版本是：2.6.18-164.15.1.el5 – 将您的内核版本设置在EL5.4级别，即2010年3月左右。

我在EL5中持续存在ext3文件系统的稳定性和腐败问题。直到2012年中，事情还没有完全解决。在我最糟糕的情况下，我正在和一家云基础架构公司合作，从未从基础版本更新内核。所以我开始在数以千计的EL5服务器上看到这些问题。

有没有可能更新你的OS / kernel / e2fsprogs，fsck并重试？

此外，如果内核大约在2010年，您的系统的BIOS和Smart Array P410固件可能已经过时了。这是什么型号的服务器？

编辑：

你的cciss CHECK_CONDITION错误是免费的。在这一点上，甚至不需要处理SMART。运行HParrays诊断实用程序 ，它会将错误信息提取到报告中。无论哪种方式，我真的希望这不是一个RAID5arrays。

你可以发布hpacucli ctrl all show config detail的输出hpacucli ctrl all show config detail ？

内核日志可以读取实际发生故障的块，您可以从/var/log （可能是/var/log/kernel.log ）或dmesg命令的输出中读取该/var/log 。

注意：你需要的不是磁盘扇区号，而是分区和文件系统特定的块号。内核自2.4.x左右都在说他们在dmesg。

将一个-L标志赋给e2fsck可以将这个列表赋给文件系统的坏块列表。因此正确的步骤如下：

首先，从dmesg中检查坏块的列表。

其次，把它们放到一个简单的文本文件中，如此

 cat >badblockfile.txt 34252345 3452345 23452345

（CTRL / d）

e2fsck -f -y -C0 /dev/diskname -L badblockfile.txt

如果您不能解释dmesg，请将相关部分作为注释或作为问题的扩展。

延期

你的文件系统有2k块，并从硬盘的第一个扇区开始（有512byte扇区）。因此，文件系统块（可以赋给e2fsck）和磁盘块（在dmesg输出中）之间的公式非常简单：

 filesystem_block=(serctor_no-1)/4

如果在消息中没有文件系统级别的块，那么也可以使用这个公式。

备用提示

还有一个额外的提示：e2fsck有一个标志-c 。这会在检查之前调用工具badblocks ，并将新发现的坏块标记为坏。正如我所经历的，它并不是真的好，在大多数情况下，它找不到所有的坏块。在你的地方，我在无限循环的周末（或者至less在夜晚）

 while true; do e2fsck -f -y -C0 -c /dev/sdf;done