奇怪的经常性的过度I / O等待

我很清楚I / O等待已经在这个网站上被多次讨论过了，但是其他所有的话题似乎都包含了不断的 I / O延迟，而我们需要在服务器上解决的I / O问题是不规则的）的间隔，但是一直存在高达20kms的等待和2秒的服务时间的大量尖峰。受影响的磁盘是/ dev / sdb（Seagate Barracuda，详情见下文）。

一个典型的iostat -x输出有时看起来像这样，这是一个极端的例子，但并不罕见：

iostat (Oct 6, 2013) tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 16.00 0.00 156.00 9.75 21.89 288.12 36.00 57.60 5.50 0.00 44.00 8.00 48.79 2194.18 181.82 100.00 2.00 0.00 16.00 8.00 46.49 3397.00 500.00 100.00 4.50 0.00 40.00 8.89 43.73 5581.78 222.22 100.00 14.50 0.00 148.00 10.21 13.76 5909.24 68.97 100.00 1.50 0.00 12.00 8.00 8.57 7150.67 666.67 100.00 0.50 0.00 4.00 8.00 6.31 10168.00 2000.00 100.00 2.00 0.00 16.00 8.00 5.27 11001.00 500.00 100.00 0.50 0.00 4.00 8.00 2.96 17080.00 2000.00 100.00 34.00 0.00 1324.00 9.88 1.32 137.84 4.45 59.60 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 22.00 44.00 204.00 11.27 0.01 0.27 0.27 0.60

让我为您提供一些关于硬件的更多信息。这是一个戴尔1950年的III盒子与Debian作为操作系统，其中uname -a报告如下：

 Linux xx 2.6.32-5-amd64 #1 SMP Fri Feb 15 15:39:52 UTC 2013 x86_64 GNU/Linux

该机器是一个专门的服务器，主机没有任何数据库或I / O大型应用程序运行在线游戏。核心应用程序消耗8 GB内存中的0.8个左右，平均CPU负载相对较低。然而，游戏本身对I / O延迟反应比较敏感，因此我们的玩家会遇到大量的游戏内滞后，我们希望尽快解决这个问题。

 iostat: avg-cpu: %user %nice %system %iowait %steal %idle 1.77 0.01 1.05 1.59 0.00 95.58 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sdb 13.16 25.42 135.12 504701011 2682640656 sda 1.52 0.74 20.63 14644533 409684488

正常运行时间是：

 19:26:26 up 229 days, 17:26, 4 users, load average: 0.36, 0.37, 0.32

硬盘控制器：

 01:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 1078 (rev 04)

硬碟：

 Array 1, RAID-1, 2x Seagate Cheetah 15K.5 73 GB SAS Array 2, RAID-1, 2x Seagate ST3500620SS Barracuda ES.2 500GB 16MB 7200RPM SAS

来自df的分区信息：

 Filesystem 1K-blocks Used Available Use% Mounted on /dev/sdb1 480191156 30715200 425083668 7% /home /dev/sda2 7692908 437436 6864692 6% / /dev/sda5 15377820 1398916 13197748 10% /usr /dev/sda6 39159724 19158340 18012140 52% /var

使用iostat -dx sdb 1生成的更多数据示例（2013年10月11日）

 Device: rrqm/s wrqm/sr/sw/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdb 0.00 15.00 0.00 70.00 0.00 656.00 9.37 4.50 1.83 4.80 33.60 sdb 0.00 0.00 0.00 2.00 0.00 16.00 8.00 12.00 836.00 500.00 100.00 sdb 0.00 0.00 0.00 3.00 0.00 32.00 10.67 9.96 1990.67 333.33 100.00 sdb 0.00 0.00 0.00 4.00 0.00 40.00 10.00 6.96 3075.00 250.00 100.00 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.00 0.00 0.00 100.00 sdb 0.00 0.00 0.00 2.00 0.00 16.00 8.00 2.62 4648.00 500.00 100.00 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.00 0.00 0.00 100.00 sdb 0.00 0.00 0.00 1.00 0.00 16.00 16.00 1.69 7024.00 1000.00 100.00 sdb 0.00 74.00 0.00 124.00 0.00 1584.00 12.77 1.09 67.94 6.94 86.00

使用rrdtool生成的特性图可以在这里find：

iostat阴谋1，24分钟间隔： http ： //imageshack.us/photo/my-images/600/yqm3.png/

iostat情节2，120分钟的时间间隔： http : //imageshack.us/photo/my-images/407/griw.png/

由于我们有一个相当大的5.5 GB的caching，我们认为testingI / O等待尖峰可能是由caching缺失事件引起的可能是个好主意。因此，我们做了同步，然后刷新caching和缓冲区：

 echo 3 > /proc/sys/vm/drop_caches

而之后的I / O等待和服务时间实际上是经过了屋顶，而机器上的所有东西都感觉像慢动作。在接下来的几个小时内，等待时间恢复了，一切都像以前一样 – 在短暂的，不可预知的时间间隔内，中小时滞。

现在我的问题是：有人有什么想法可能会导致这个恼人的行为？这是磁盘arrays或RAID控制器死亡的第一个迹象，或者可以通过重新启动轻松修复的东西？（现在我们很不情愿这样做，因为我们担心磁盘可能不会再回来了。）

任何帮助是极大的赞赏。

在此先感谢，克里斯。

编辑补充说：我们确实看到一个或两个进程处于顶部的“D”状态，其中一个似乎被频繁地logging下来。但是，如果我没有弄错，这并不表示引起延迟的过程，而是那些受影响的过程 – 如果我错了，就纠正我。关于不间断睡眠过程的信息是否能帮助我们以任何方式解决问题？

@Andy Shinn要求smartctl数据，这里是：

smartctl -a -d megaraid,2 /dev/sdb产生：

 smartctl 5.40 2010-07-12 r3124 [x86_64-unknown-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net Device: SEAGATE ST3500620SS Version: MS05 Serial number: Device type: disk Transport protocol: SAS Local Time is: Mon Oct 14 20:37:13 2013 CEST Device supports SMART and is Enabled Temperature Warning Disabled or Not Supported SMART Health Status: OK Current Drive Temperature: 20 C Drive Trip Temperature: 68 C Elements in grown defect list: 0 Vendor (Seagate) cache information Blocks sent to initiator = 1236631092 Blocks received from initiator = 1097862364 Blocks read from cache and sent to initiator = 1383620256 Number of read and write commands whose size <= segment size = 531295338 Number of read and write commands whose size > segment size = 51986460 Vendor (Seagate/Hitachi) factory information number of hours powered up = 36556.93 number of minutes until next internal SMART test = 32 Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 509271032 47 0 509271079 509271079 20981.423 0 write: 0 0 0 0 0 5022.039 0 verify: 1870931090 196 0 1870931286 1870931286 100558.708 0 Non-medium error count: 0 SMART Self-test log Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] Description number (hours) # 1 Background short Completed 16 36538 - [- - -] # 2 Background short Completed 16 36514 - [- - -] # 3 Background short Completed 16 36490 - [- - -] # 4 Background short Completed 16 36466 - [- - -] # 5 Background short Completed 16 36442 - [- - -] # 6 Background long Completed 16 36420 - [- - -] # 7 Background short Completed 16 36394 - [- - -] # 8 Background short Completed 16 36370 - [- - -] # 9 Background long Completed 16 36364 - [- - -] #10 Background short Completed 16 36361 - [- - -] #11 Background long Completed 16 2 - [- - -] #12 Background short Completed 16 0 - [- - -] Long (extended) Self Test duration: 6798 seconds [113.3 minutes]

smartctl -a -d megaraid,3 /dev/sdb产生：

 smartctl 5.40 2010-07-12 r3124 [x86_64-unknown-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net Device: SEAGATE ST3500620SS Version: MS05 Serial number: Device type: disk Transport protocol: SAS Local Time is: Mon Oct 14 20:37:26 2013 CEST Device supports SMART and is Enabled Temperature Warning Disabled or Not Supported SMART Health Status: OK Current Drive Temperature: 19 C Drive Trip Temperature: 68 C Elements in grown defect list: 0 Vendor (Seagate) cache information Blocks sent to initiator = 288745640 Blocks received from initiator = 1097848399 Blocks read from cache and sent to initiator = 1304149705 Number of read and write commands whose size <= segment size = 527414694 Number of read and write commands whose size > segment size = 51986460 Vendor (Seagate/Hitachi) factory information number of hours powered up = 36596.83 number of minutes until next internal SMART test = 28 Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 610862490 44 0 610862534 610862534 20470.133 0 write: 0 0 0 0 0 5022.480 0 verify: 2861227413 203 0 2861227616 2861227616 100872.443 0 Non-medium error count: 1 SMART Self-test log Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] Description number (hours) # 1 Background short Completed 16 36580 - [- - -] # 2 Background short Completed 16 36556 - [- - -] # 3 Background short Completed 16 36532 - [- - -] # 4 Background short Completed 16 36508 - [- - -] # 5 Background short Completed 16 36484 - [- - -] # 6 Background long Completed 16 36462 - [- - -] # 7 Background short Completed 16 36436 - [- - -] # 8 Background short Completed 16 36412 - [- - -] # 9 Background long Completed 16 36404 - [- - -] #10 Background short Completed 16 36401 - [- - -] #11 Background long Completed 16 2 - [- - -] #12 Background short Completed 16 0 - [- - -] Long (extended) Self Test duration: 6798 seconds [113.3 minutes]

（我假设你的磁盘直接连接到服务器，而不是通过NFS，例如。）

重要的是你的svctm （在iostat输出中）是非常高的，这意味着RAID或磁盘的硬件问题。正常磁盘的Svctm应该在4（ms）左右。可能会less些，但不能太高。

不幸的是， smartctl输出不是你的情况。它有错误更正，但这可能是正常的。 长时间的testing似乎完成了，但是这还不确定。 ST3500620SS似乎是一个很好的旧的服务器/ RAIDtypes的磁盘，它应该对读取错误作出快速响应（不像桌面/非RAID磁盘），所以这可能是比坏扇区更复杂的硬件问题。尝试在RAID统计中find一些不寻常的（如高错误率）： http : //hwraid.le-vert.net/wiki/LSIMegaRAIDSAS

我的build议是更换磁盘应该是下一步。

更新：

Svctm是更重要的因素，因为高利用率只是svctmexception高的结果。

桌面磁盘安装到Promise RAID时，我看到了类似的问题。桌面磁盘被devise成试图通过许多长时间的重试来修复读取错误，这导致了延迟（这些读取错误可能是因为其他因素，例如振动，这在服务器房间比在桌面上强得多）。与此不同的是，devise用于RAID的磁盘只是快速地向RAID控制器报告任何错误，RAID控制器可以通过RAID重新校正来纠正这些错误。另外，服务器磁盘可以被devise成对机械强度更强的振动。有一个普遍的误解，就是服务器磁盘和桌面一样是比较昂贵的，这是错误的，它们实际上是不同的。

问：嗯，我想问：如果是硬件问题，你不觉得这个问题应该是连续可见的，并且不会消失一段时间吗？ 你碰巧有这个效果的解释吗？

A：

问题可能永远在那里，但只有在高负荷时才会引起注意 。
不同时间的振动级别可能不同（例如，取决于附近的服务器）。如果你的问题是磁盘受到震动的影响，它肯定会消失并重新出现。当我遇到“桌面磁盘”问题时，我看到了类似的行为。（当然，你的磁盘是服务器的，推荐用于RAID，所以这不是完全一样的问题，但它可能是相似的。）

我有一个非常类似的问题。 IBM ServeRaid M5110（更名LSI 9265-8i）和CentOS 6.x

第一个VD是4个IBM品牌的日立硬盘的RAID0。

然后，我们买了三星PM853T固态硬盘，并安装在另外4个驱动器，并创build了另一个RAID0。当我们把我们的工作量从盘片转换到固态硬盘时，每1小时IO就会飙升，所有的操作都会停下来。负载会从正常的〜2上升到80以上。几十秒后，一切都会平静下来，应用程序将继续工作。

这种情况在盘片上从未发生过。

所以，我的第一印象是LSI和三星之间的某种不兼容。几天后，又经过了很多头脑的搔痒和debugging，我发现MegaCli64是罪魁祸首。我们通过Zabbix运行它来监视驱动器的健康状况，但是当扫描控制器时，MegaCli会停止在SSD上等待，每个SSD的时间加上十二秒，几乎两分钟。那将会把所有的I / O都降到零，并且使得Iowait和负载猛增。

解决scheme是find没有引起问题的MegaCli版本。我们从IBM站点下载了版本。