高负载平均，高等待，dmesg raid错误信息（debian nfs服务器）

Debian 6在HP raid（2 CPU）上运行raid（2 * 1.5T RAID1 + 2 * 2T RAID1join了RAID0以生成3.5T），主要运行nfs＆imapd（加上samba for windows share＆local www预览网页）。与本地ubuntu桌面客户端挂载$ HOME，通过nfs / smb访问imap＆odd文件（例如video）的笔记本电脑; 通过家庭路由器/交换机连接100baseT或wifi

uname -a

Linux prole 2.6.32-5-686 #1 SMP Wed Jan 11 12:29:30 UTC 2012 i686 GNU/Linux

安装程序已经运行了好几个月，但是间歇性地变得很慢（桌面从服务器挂载$ HOME，或者在笔记本电脑上播放video的用户体验），现在一直如此糟糕，以至于我不得不深入研究以找出问题所在（！）

服务器在低负载下似乎可以，例如（笔记本电脑）客户端（在本地磁盘上有$ HOME），连接到服务器的imapd和nfs，安装RAID以访问1个文件：顶部显示负载约0.1或更less，0等待

但是当（桌面）客户端安装$ HOME并启动用户KDE会话（所有访问服务器）时，顶部显示例如

 top - 13:41:17 up 3:43, 3 users, load average: 9.29, 9.55, 8.27 Tasks: 158 total, 1 running, 157 sleeping, 0 stopped, 0 zombie Cpu(s): 0.4%us, 0.4%sy, 0.0%ni, 49.0%id, 49.7%wa, 0.0%hi, 0.5%si, 0.0%st Mem: 903856k total, 851784k used, 52072k free, 171152k buffers Swap: 0k total, 0k used, 0k free, 476896k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3935 root 20 0 2456 1088 784 R 2 0.1 0:00.02 top 1 root 20 0 2028 680 584 S 0 0.1 0:01.14 init 2 root 20 0 0 0 0 S 0 0.0 0:00.00 kthreadd 3 root RT 0 0 0 0 S 0 0.0 0:00.00 migration/0 4 root 20 0 0 0 0 S 0 0.0 0:00.12 ksoftirqd/0 5 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/0 6 root RT 0 0 0 0 S 0 0.0 0:00.00 migration/1 7 root 20 0 0 0 0 S 0 0.0 0:00.16 ksoftirqd/1 8 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/1 9 root 20 0 0 0 0 S 0 0.0 0:00.42 events/0 10 root 20 0 0 0 0 S 0 0.0 0:02.26 events/1 11 root 20 0 0 0 0 S 0 0.0 0:00.00 cpuset 12 root 20 0 0 0 0 S 0 0.0 0:00.00 khelper 13 root 20 0 0 0 0 S 0 0.0 0:00.00 netns 14 root 20 0 0 0 0 S 0 0.0 0:00.00 async/mgr 15 root 20 0 0 0 0 S 0 0.0 0:00.00 pm 16 root 20 0 0 0 0 S 0 0.0 0:00.02 sync_supers 17 root 20 0 0 0 0 S 0 0.0 0:00.02 bdi-default 18 root 20 0 0 0 0 S 0 0.0 0:00.00 kintegrityd/0 19 root 20 0 0 0 0 S 0 0.0 0:00.00 kintegrityd/1 20 root 20 0 0 0 0 S 0 0.0 0:00.02 kblockd/0 21 root 20 0 0 0 0 S 0 0.0 0:00.08 kblockd/1 22 root 20 0 0 0 0 S 0 0.0 0:00.00 kacpid 23 root 20 0 0 0 0 S 0 0.0 0:00.00 kacpi_notify 24 root 20 0 0 0 0 S 0 0.0 0:00.00 kacpi_hotplug 25 root 20 0 0 0 0 S 0 0.0 0:00.00 kseriod 28 root 20 0 0 0 0 S 0 0.0 0:04.19 kondemand/0 29 root 20 0 0 0 0 S 0 0.0 0:02.93 kondemand/1 30 root 20 0 0 0 0 S 0 0.0 0:00.00 khungtaskd 31 root 20 0 0 0 0 S 0 0.0 0:00.18 kswapd0 32 root 25 5 0 0 0 S 0 0.0 0:00.00 ksmd 33 root 20 0 0 0 0 S 0 0.0 0:00.00 aio/0 34 root 20 0 0 0 0 S 0 0.0 0:00.00 aio/1 35 root 20 0 0 0 0 S 0 0.0 0:00.00 crypto/0 36 root 20 0 0 0 0 S 0 0.0 0:00.00 crypto/1 203 root 20 0 0 0 0 S 0 0.0 0:00.00 ksuspend_usbd 204 root 20 0 0 0 0 S 0 0.0 0:00.00 khubd 205 root 20 0 0 0 0 S 0 0.0 0:00.00 ata/0 206 root 20 0 0 0 0 S 0 0.0 0:00.00 ata/1 207 root 20 0 0 0 0 S 0 0.0 0:00.14 ata_aux 208 root 20 0 0 0 0 S 0 0.0 0:00.01 scsi_eh_0

dmesg表示存在磁盘问题：

 .............. (previous episode) [13276.966004] raid1:md0: read error corrected (8 sectors at 489900360 on sdc7) [13276.966043] raid1: sdb7: redirecting sector 489898312 to another mirror [13279.569186] ata4.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0 [13279.569211] ata4.00: irq_stat 0x40000008 [13279.569230] ata4.00: failed command: READ FPDMA QUEUED [13279.569257] ata4.00: cmd 60/08:00:00:6a:05/00:00:23:00:00/40 tag 0 ncq 4096 in [13279.569262] res 41/40:00:05:6a:05/00:00:23:00:00/40 Emask 0x409 (media error) <F> [13279.569306] ata4.00: status: { DRDY ERR } [13279.569321] ata4.00: error: { UNC } [13279.575362] ata4.00: configured for UDMA/133 [13279.575388] ata4: EH complete [13283.169224] ata4.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0 [13283.169246] ata4.00: irq_stat 0x40000008 [13283.169263] ata4.00: failed command: READ FPDMA QUEUED [13283.169289] ata4.00: cmd 60/08:00:00:6a:05/00:00:23:00:00/40 tag 0 ncq 4096 in [13283.169294] res 41/40:00:07:6a:05/00:00:23:00:00/40 Emask 0x409 (media error) <F> [13283.169331] ata4.00: status: { DRDY ERR } [13283.169345] ata4.00: error: { UNC } [13283.176071] ata4.00: configured for UDMA/133 [13283.176104] ata4: EH complete [13286.224814] ata4.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0 [13286.224837] ata4.00: irq_stat 0x40000008 [13286.224853] ata4.00: failed command: READ FPDMA QUEUED [13286.224879] ata4.00: cmd 60/08:00:00:6a:05/00:00:23:00:00/40 tag 0 ncq 4096 in [13286.224884] res 41/40:00:06:6a:05/00:00:23:00:00/40 Emask 0x409 (media error) <F> [13286.224922] ata4.00: status: { DRDY ERR } [13286.224935] ata4.00: error: { UNC } [13286.231277] ata4.00: configured for UDMA/133 [13286.231303] ata4: EH complete [13288.802623] ata4.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0 [13288.802646] ata4.00: irq_stat 0x40000008 [13288.802662] ata4.00: failed command: READ FPDMA QUEUED [13288.802688] ata4.00: cmd 60/08:00:00:6a:05/00:00:23:00:00/40 tag 0 ncq 4096 in [13288.802693] res 41/40:00:05:6a:05/00:00:23:00:00/40 Emask 0x409 (media error) <F> [13288.802731] ata4.00: status: { DRDY ERR } [13288.802745] ata4.00: error: { UNC } [13288.808901] ata4.00: configured for UDMA/133 [13288.808927] ata4: EH complete [13291.380430] ata4.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0 [13291.380453] ata4.00: irq_stat 0x40000008 [13291.380470] ata4.00: failed command: READ FPDMA QUEUED [13291.380496] ata4.00: cmd 60/08:00:00:6a:05/00:00:23:00:00/40 tag 0 ncq 4096 in [13291.380501] res 41/40:00:05:6a:05/00:00:23:00:00/40 Emask 0x409 (media error) <F> [13291.380577] ata4.00: status: { DRDY ERR } [13291.380594] ata4.00: error: { UNC } [13291.386517] ata4.00: configured for UDMA/133 [13291.386543] ata4: EH complete [13294.347147] ata4.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0 [13294.347169] ata4.00: irq_stat 0x40000008 [13294.347186] ata4.00: failed command: READ FPDMA QUEUED [13294.347211] ata4.00: cmd 60/08:00:00:6a:05/00:00:23:00:00/40 tag 0 ncq 4096 in [13294.347217] res 41/40:00:06:6a:05/00:00:23:00:00/40 Emask 0x409 (media error) <F> [13294.347254] ata4.00: status: { DRDY ERR } [13294.347268] ata4.00: error: { UNC } [13294.353556] ata4.00: configured for UDMA/133 [13294.353583] sd 3:0:0:0: [sdc] Unhandled sense code [13294.353590] sd 3:0:0:0: [sdc] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [13294.353599] sd 3:0:0:0: [sdc] Sense Key : Medium Error [current] [descriptor] [13294.353610] Descriptor sense data with sense descriptors (in hex): [13294.353616] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 [13294.353635] 23 05 6a 06 [13294.353644] sd 3:0:0:0: [sdc] Add. Sense: Unrecovered read error - auto reallocate failed [13294.353657] sd 3:0:0:0: [sdc] CDB: Read(10): 28 00 23 05 6a 00 00 00 08 00 [13294.353675] end_request: I/O error, dev sdc, sector 587557382 [13294.353726] ata4: EH complete [13294.366953] raid1:md0: read error corrected (8 sectors at 489900544 on sdc7) [13294.366992] raid1: sdc7: redirecting sector 489898496 to another mirror

而且他们发生的频率很高，我猜这可能是导致性能问题的原因之一（？）

＃dmesg | grep镜像

 [12433.561822] raid1: sdc7: redirecting sector 489900464 to another mirror [12449.428933] raid1: sdb7: redirecting sector 489900504 to another mirror [12464.807016] raid1: sdb7: redirecting sector 489900512 to another mirror [12480.196222] raid1: sdb7: redirecting sector 489900520 to another mirror [12495.585413] raid1: sdb7: redirecting sector 489900528 to another mirror [12510.974424] raid1: sdb7: redirecting sector 489900536 to another mirror [12526.374933] raid1: sdb7: redirecting sector 489900544 to another mirror [12542.619938] raid1: sdc7: redirecting sector 489900608 to another mirror [12559.431328] raid1: sdc7: redirecting sector 489900616 to another mirror [12576.553866] raid1: sdc7: redirecting sector 489900624 to another mirror [12592.065265] raid1: sdc7: redirecting sector 489900632 to another mirror [12607.621121] raid1: sdc7: redirecting sector 489900640 to another mirror [12623.165856] raid1: sdc7: redirecting sector 489900648 to another mirror [12638.699474] raid1: sdc7: redirecting sector 489900656 to another mirror [12655.610881] raid1: sdc7: redirecting sector 489900664 to another mirror [12672.255617] raid1: sdc7: redirecting sector 489900672 to another mirror [12672.288746] raid1: sdc7: redirecting sector 489900680 to another mirror [12672.332376] raid1: sdc7: redirecting sector 489900688 to another mirror [12672.362935] raid1: sdc7: redirecting sector 489900696 to another mirror [12674.201177] raid1: sdc7: redirecting sector 489900704 to another mirror [12698.045050] raid1: sdc7: redirecting sector 489900712 to another mirror [12698.089309] raid1: sdc7: redirecting sector 489900720 to another mirror [12698.111999] raid1: sdc7: redirecting sector 489900728 to another mirror [12698.134006] raid1: sdc7: redirecting sector 489900736 to another mirror [12719.034376] raid1: sdc7: redirecting sector 489900744 to another mirror [12734.545775] raid1: sdc7: redirecting sector 489900752 to another mirror [12734.590014] raid1: sdc7: redirecting sector 489900760 to another mirror [12734.624050] raid1: sdc7: redirecting sector 489900768 to another mirror [12734.647308] raid1: sdc7: redirecting sector 489900776 to another mirror [12734.664657] raid1: sdc7: redirecting sector 489900784 to another mirror [12734.710642] raid1: sdc7: redirecting sector 489900792 to another mirror [12734.721919] raid1: sdc7: redirecting sector 489900800 to another mirror [12734.744732] raid1: sdc7: redirecting sector 489900808 to another mirror [12734.779330] raid1: sdc7: redirecting sector 489900816 to another mirror [12782.604564] raid1: sdb7: redirecting sector 1242934216 to another mirror [12798.264153] raid1: sdc7: redirecting sector 1242935080 to another mirror [13245.832193] raid1: sdb7: redirecting sector 489898296 to another mirror [13261.376929] raid1: sdb7: redirecting sector 489898304 to another mirror [13276.966043] raid1: sdb7: redirecting sector 489898312 to another mirror [13294.366992] raid1: sdc7: redirecting sector 489898496 to another mirror

尽pipearrays仍在所有磁盘上运行，但它们还没有放弃：

＃cat / proc / mdstat

 Personalities : [raid1] [raid0] md10 : active raid0 md0[0] md1[1] 3368770048 blocks super 1.2 512k chunks md1 : active raid1 sde2[2] sdd2[1] 1464087824 blocks super 1.2 [2/2] [UU] md0 : active raid1 sdb7[0] sdc7[2] 1904684920 blocks super 1.2 [2/2] [UU] unused devices: <none>

所以我想我有一些想法是什么问题，但我不是一个Linux系统pipe理员专家的想象力的最遥远的延伸，真的很感谢一些线索检查在这里与我的诊断，我需要做什么：

显然我需要为sdc找另一个驱动器。（我猜如果价格合适，我可以买一个更大的驱动器：我想有一天我需要增加arrays的大小，这将是一个较小的驱动器来取代较大的一个）
然后使用mdadm使现有的sdc失效，将其删除并装入新的驱动器
使用与旧的驱动器相同的磁盘arrays大小的磁盘驱动器
使用mdadm将新驱动器添加到arrays中

那声音OK？

通常当你看到磁盘出现错误时，磁盘往往暂停一会儿，尝试纠正错误本身，而Linux RAID将容许一些磁盘等待，直到它标记为坏。这个磁盘暂停可能是导致缓慢下降（特别是在你看到的错误率）。

您的计划更换驱动器是正确的。不过，我不build议用replaceRAID分区的概念来获得更大的驱动程序，然后再使用其他的东西。更接近原始磁盘（大小和速度）以保持arrays一致性更为明智。也就是说，在理论上，你可能会变大，并为数组replace分割确切的replace大小，然后再分割另一个分割（甚至是数组的另一个成员）。

一个可能有助于debugging的侧面说明：我喜欢使用一个工具作为顶层的替代品，它被称为atop（ http://www.atoptool.nl/ ），它可以让你更多地看到每个磁盘和使用磁盘I的进程/ O，瓶颈在哪里（您可能会注意到，等待I / O是针对特定磁盘的问题）。

如何减lessTIME_WAIT中的套接字数量？