Ubuntu服务器性能下降

我有一个定制的Ubuntu 11.04服务器与6磁盘软件RAID 10主驱动器。在这上面，我主要运行一个PostgreSQL和其他一些从networking上传输数据的工具。经常在几小时的正常运行时间之后，服务器开始落后于各种进程。例如，login后可能需要10-15秒才能获得shell提示符。 top出现可能需要5-10秒。 ls可能需要一两秒钟。

当我看到顶部时，几乎没有CPU使用率。 PostgreSQL服务器使用相当多的内存，但不足以stream入交换。

我不知道该从哪里走，除了怀疑RAID10（我以前只有过软件RAID 1的）。

编辑：从顶部输出：

 top - 11:56:03 up 1:46, 3 users, load average: 0.89, 0.73, 0.72 Tasks: 119 total, 1 running, 118 sleeping, 0 stopped, 0 zombie Cpu(s): 0.2%us, 0.0%sy, 0.0%ni, 93.5%id, 6.2%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 16325596k total, 3478248k used, 12847348k free, 20880k buffers Swap: 19534176k total, 0k used, 19534176k free, 3041992k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1747 woodsp 20 0 109m 10m 4888 S 1 0.1 0:42.70 python 357 root 20 0 0 0 0 S 0 0.0 0:00.40 jbd2/sda3-8 1 root 20 0 24324 2284 1344 S 0 0.0 0:00.84 init 2 root 20 0 0 0 0 S 0 0.0 0:00.00 kthreadd 3 root 20 0 0 0 0 S 0 0.0 0:00.24 ksoftirqd/0 6 root RT 0 0 0 0 S 0 0.0 0:00.00 migration/0 7 root RT 0 0 0 0 S 0 0.0 0:00.01 watchdog/0 8 root RT 0 0 0 0 S 0 0.0 0:00.00 migration/1 10 root 20 0 0 0 0 S 0 0.0 0:00.02 ksoftirqd/1 12 root RT 0 0 0 0 S 0 0.0 0:00.01 watchdog/1 13 root RT 0 0 0 0 S 0 0.0 0:00.00 migration/2 14 root 20 0 0 0 0 S 0 0.0 0:00.00 kworker/2:0 15 root 20 0 0 0 0 S 0 0.0 0:00.00 ksoftirqd/2 16 root RT 0 0 0 0 S 0 0.0 0:00.01 watchdog/2 17 root RT 0 0 0 0 S 0 0.0 0:00.00 migration/3 18 root 20 0 0 0 0 S 0 0.0 0:00.00 kworker/3:0 19 root 20 0 0 0 0 S 0 0.0 0:00.02 ksoftirqd/3 20 root RT 0 0 0 0 S 0 0.0 0:00.01 watchdog/3 21 root 0 -20 0 0 0 S 0 0.0 0:00.00 cpuset 22 root 0 -20 0 0 0 S 0 0.0 0:00.00 khelper 23 root 20 0 0 0 0 S 0 0.0 0:00.00 kdevtmpfs 24 root 0 -20 0 0 0 S 0 0.0 0:00.00 netns 26 root 20 0 0 0 0 S 0 0.0 0:00.00 sync_supers

df -h

 rpsharp@ncp-skookum:~$ df -h Filesystem Size Used Avail Use% Mounted on /dev/sda3 1.8T 549G 1.2T 32% / udev 7.8G 4.0K 7.8G 1% /dev tmpfs 3.2G 492K 3.2G 1% /run none 5.0M 0 5.0M 0% /run/lock none 7.8G 0 7.8G 0% /run/shm /dev/sda2 952M 128K 952M 1% /boot/efi /dev/md0 5.5T 562G 4.7T 11% /usr/local

免费-m

 psharp@ncp-skookum:~$ free -m total used free shared buffers cached Mem: 15942 3409 12533 0 20 2983 -/+ buffers/cache: 405 15537 Swap: 19076 0 19076

tail -50 / var / log / syslog

 Jul 3 06:31:32 ncp-skookum rsyslogd: [origin software="rsyslogd" swVersion="5.8.6" x-pid="1070" x-info="http://www.rsyslog.com"] rsyslogd was HUPed Jul 3 06:39:01 ncp-skookum CRON[14211]: (root) CMD ( [ -x /usr/lib/php5/maxlifetime ] && [ -d /var/lib/php5 ] && find /var/lib/php5/ -depth -mindepth 1 -maxdepth 1 -type f -cmin +$(/usr/lib/php5/maxlifetime) ! -execdir fuser -s {} 2>/dev/null \; -delete) Jul 3 06:40:01 ncp-skookum CRON[14223]: (smmsp) CMD (test -x /etc/init.d/sendmail && /usr/share/sendmail/sendmail cron-msp) Jul 3 07:00:01 ncp-skookum CRON[14328]: (woodsp) CMD (/home/woodsp/bin/mail_tweetupdate # email an update) Jul 3 07:00:01 ncp-skookum CRON[14327]: (smmsp) CMD (test -x /etc/init.d/sendmail && /usr/share/sendmail/sendmail cron-msp) Jul 3 07:00:28 ncp-skookum sendmail[14356]: q63E0SoZ014356: from=woodsp, size=2328, class=0, nrcpts=2, msgid=<[email protected]>, relay=woodsp@localhost Jul 3 07:00:29 ncp-skookum sm-mta[14357]: q63E0Si6014357: from=<[email protected]>, size=2569, class=0, nrcpts=2, msgid=<[email protected]>, proto=ESMTP, daemon=MTA-v4, relay=localhost [127.0.0.1] Jul 3 07:00:29 ncp-skookum sendmail[14356]: q63E0SoZ014356: to=Spencer Wood <[email protected]>,Martin Lacayo <[email protected]>, ctladdr=woodsp (1004/1005), delay=00:00:01, xdelay=00:00:01, mailer=relay, pri=62328, relay=[127.0.0.1] [127.0.0.1], dsn=2.0.0, stat=Sent (q63E0Si6014357 Message accepted for delivery) Jul 3 07:00:29 ncp-skookum sm-mta[14359]: STARTTLS=client, relay=mx3.stanford.edu., version=TLSv1/SSLv3, verify=FAIL, cipher=DHE-RSA-AES256-SHA, bits=256/256 Jul 3 07:00:29 ncp-skookum sm-mta[14359]: q63E0Si6014357: to=<[email protected]>,<[email protected]>, ctladdr=<[email protected]> (1004/1005), delay=00:00:01, xdelay=00:00:00, mailer=esmtp, pri=152569, relay=mx3.stanford.edu. [171.67.219.73], dsn=2.0.0, stat=Sent (Ok: queued as 8F3505802AC) Jul 3 07:09:08 ncp-skookum CRON[14396]: (root) CMD ( [ -x /usr/lib/php5/maxlifetime ] && [ -d /var/lib/php5 ] && find /var/lib/php5/ -depth -mindepth 1 -maxdepth 1 -type f -cmin +$(/usr/lib/php5/maxlifetime) ! -execdir fuser -s {} 2>/dev/null \; -delete) Jul 3 07:17:01 ncp-skookum CRON[14438]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly) Jul 3 07:20:01 ncp-skookum CRON[14453]: (smmsp) CMD (test -x /etc/init.d/sendmail && /usr/share/sendmail/sendmail cron-msp) Jul 3 07:39:01 ncp-skookum CRON[14551]: (root) CMD ( [ -x /usr/lib/php5/maxlifetime ] && [ -d /var/lib/php5 ] && find /var/lib/php5/ -depth -mindepth 1 -maxdepth 1 -type f -cmin +$(/usr/lib/php5/maxlifetime) ! -execdir fuser -s {} 2>/dev/null \; -delete) Jul 3 07:40:01 ncp-skookum CRON[14562]: (smmsp) CMD (test -x /etc/init.d/sendmail && /usr/share/sendmail/sendmail cron-msp) Jul 3 08:00:01 ncp-skookum CRON[14668]: (smmsp) CMD (test -x /etc/init.d/sendmail && /usr/share/sendmail/sendmail cron-msp) Jul 3 08:09:01 ncp-skookum CRON[14724]: (root) CMD ( [ -x /usr/lib/php5/maxlifetime ] && [ -d /var/lib/php5 ] && find /var/lib/php5/ -depth -mindepth 1 -maxdepth 1 -type f -cmin +$(/usr/lib/php5/maxlifetime) ! -execdir fuser -s {} 2>/dev/null \; -delete) Jul 3 08:17:01 ncp-skookum CRON[14766]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly) Jul 3 08:20:01 ncp-skookum CRON[14781]: (smmsp) CMD (test -x /etc/init.d/sendmail && /usr/share/sendmail/sendmail cron-msp) Jul 3 08:39:01 ncp-skookum CRON[14881]: (root) CMD ( [ -x /usr/lib/php5/maxlifetime ] && [ -d /var/lib/php5 ] && find /var/lib/php5/ -depth -mindepth 1 -maxdepth 1 -type f -cmin +$(/usr/lib/php5/maxlifetime) ! -execdir fuser -s {} 2>/dev/null \; -delete) Jul 3 08:40:01 ncp-skookum CRON[14892]: (smmsp) CMD (test -x /etc/init.d/sendmail && /usr/share/sendmail/sendmail cron-msp)

输出hdparm -t / dev / sd {a，b，c，d，e，f}这看起来很可疑？

 /dev/sda: Timing buffered disk reads: 2 MB in 4.84 seconds = 423.39 kB/sec /dev/sdb: Timing buffered disk reads: 420 MB in 3.01 seconds = 139.74 MB/sec /dev/sdc: Timing buffered disk reads: 390 MB in 3.00 seconds = 129.87 MB/sec /dev/sdd: Timing buffered disk reads: 416 MB in 3.00 seconds = 138.51 MB/sec /dev/sde: Timing buffered disk reads: 422 MB in 3.00 seconds = 140.50 MB/sec /dev/sdf: Timing buffered disk reads: 416 MB in 3.01 seconds = 138.26 MB/sec

实际上，您使用的是将磁盘暴露给Linux内核的物理存储控制器（无论您是否使用内置的RAIDfunction）以及软件RAID。您不能排除您的存储控制器得不到支持的可能性。使用hdparm -t / dev / sd {a，b，c，d，e，f}的输出来诊断问题（该命令需要一段时间）。

由于你在/ dev / sda上看到一些过度的缓慢，我怀疑磁盘故障或控制器故障。仔细检查您的存储控制器是否得到了很好的支持，并尝试尽快更换/ dev / sda。

我有一个主意。在发布hdparm的输出时，它表示SDA驱动器非常慢。这可能是因为：

a）您的/和（部分）您的RAID 10在同一个磁盘上，或者…

b）一些司机有问题。

如果您升级了内核，请尝试使用Ubuntu提供的默认值。

正如@Oneiroi所指出的，你应该尝试iotop，并在后台运行程序。你可以单独运行安装了RAID的ls; 然后在RAID上运行两个ls。如果它减慢，那么这可能是一个理由。

尝试使用grep在/ var / log / dmesg，syslog和hdd，kernel，raid或postgresql等单词中search。

另外，我会尝试使SDA失败，并从RAID卸载。然后再次尝试hdparm。如果有效，那么问题是sda磁盘。

另一个可能的情况是，问题是PostgreSQL。如果可能的话，可以在没有PostgreSQL的情况下启动服务器，看看问题是否仍然存在。如果仍有问题，请closures您可能拥有的任何其他服务。你也可以尝试closures一切，但PostgreSQL。如果你能做到这一点，你可能知道这个问题是由什么产生的

a）PostgreSQL

b）其他服务

c）操纵大量的数据

d）系统本身。

根据你以前的尝试，你可以指定你有什么问题（a，b，c或d），并得到更好的帮助。

另外，如果@SilverbackNet有机会，他可以告诉我们他的服务器; 所以我们现在在两台服务器之间是相似的，并且有一个解决scheme。

PD：对不起，英文不好。编辑并更正错误; 必须有很多。

PD2：我希望这是有帮助的，但这只是一堆理论，我认为可以帮助:)

另外两种可能性：

太多的日志logging正在进行，或日志文件没有得到正确清理。如果它们变得非常大，则在正常的操作过程中需要时间来加载/保存它们。
networking连接或SSH问题。我有类似的症状与Ubuntu 11，当我SSH进入机器的SSH连接似乎挂起或响应非常缓慢，即使很短的时间。然而，直接挂显示器似乎是一样快。随着Ubuntu 12服务器，问题消失。

当我input这个内容时，有一个可能的事情完全被遗漏了。

也许是什么东西触发了大量的上下文切换和/或中断？这可能会显示在top很多system% ，但无论如何，看看vmstat 1并观看in和cs列。并将结果粘贴到您的问题中。

看起来你已经检查了明显的东西 – 这是一个困惑。

你不在任何地方使用LVM？（我在这里没有看到任何看起来像LVM设备的东西）快照会在LVM上杀死性能。

检查你的irqbalance是否正在运行，并在物理内核（不是超线程的）上合理地分配中断。

你正在使用哪个io调度程序？假设你没有一个有大量内存的硬件RAID控制器，那么截止date可能是最明智的select，但是如果你目前已经configuration了最后期限，那么试试CFQ。

什么是文件系统和挂载选项？你如何configuration磁盘（对于IDE /数据hdparm告诉你什么 – 检查声学设置，DMA，预读和caching）？