无法重新启动同步文件系统

我有一个Linux服务器的问题。每周运行一次的mysql实例挂起，没有办法完全停止它。如果我杀了它，它仍然处于僵尸状态，并且init不会收获它的pid。

该服务器用于暂存部署和一些内部工具，所以不会承受沉重的负担。唯一的过程中不断使用id的MySQL，为此我认为这是唯一的过程中遭受这个问题。

我search了系统日志中的错误，唯一发现的是这个错误（重复了几次）在dmesg输出：

[706560.640085] INFO: task mysqld:31965 blocked for more than 120 seconds. [706560.640198] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [706560.640312] mysqld D ffff88032fd93f40 0 31965 1 0x00000000 [706560.640317] ffff880242a27d18 0000000000000086 ffff88031a50dd00 ffff880242a27fd8 [706560.640321] ffff880242a27fd8 ffff880242a27fd8 ffff88031e549740 ffff88031a50dd00 [706560.640325] ffff88031a50dd00 ffff88032fd947f8 0000000000000002 ffffffff8112f250 [706560.640328] Call Trace: [706560.640338] [<ffffffff8112f250>] ? __lock_page+0x70/0x70 [706560.640344] [<ffffffff816cb1b9>] schedule+0x29/0x70 [706560.640347] [<ffffffff816cb28f>] io_schedule+0x8f/0xd0 [706560.640350] [<ffffffff8112f25e>] sleep_on_page+0xe/0x20 [706560.640353] [<ffffffff816c9900>] __wait_on_bit+0x60/0x90 [706560.640356] [<ffffffff8112f390>] wait_on_page_bit+0x80/0x90 [706560.640360] [<ffffffff8107dce0>] ? autoremove_wake_function+0x40/0x40 [706560.640363] [<ffffffff8112f891>] filemap_fdatawait_range+0x101/0x190 [706560.640366] [<ffffffff81130975>] filemap_write_and_wait_range+0x65/0x70 [706560.640371] [<ffffffff8122e441>] ext4_sync_file+0x71/0x320 [706560.640376] [<ffffffff811c3e6d>] do_fsync+0x5d/0x90 [706560.640379] [<ffffffff811c40d0>] sys_fsync+0x10/0x20 [706560.640383] [<ffffffff816d495d>] system_call_fastpath+0x1a/0x1f

发生这种情况时，重新启动所有程序的唯一方法是完全重新启动，但是为了做到这一点，在手动停止所有正在运行的进程之后，我不得不使用此命令

 echo b > /proc/sysrq-trigger

否则正常的重启进程将永远挂起。我已经跟踪重新启动脚本，我发现重启过程挂在同步呼叫，这在/etc/init.d/sendsigs （我在Ubuntu上）

 # Flush the kernel I/O buffer before we start to kill # processes, to make sure the IO of already stopped services to # not slow down the remaining processes to a point where they # are accidentily killed with SIGKILL because they did not # manage to shut down in time. sync

我几乎可以肯定，这是一个硬件问题（RAID控制器???）的原因也是因为我有其他两台机器具有相同的硬件和软件configuration，他们并没有受到这个，但我可以'在syslog或dmesg中find任何提示。我还安装了smartmontools和mcelog软件包，但是他们都没有报告任何问题。

我能做些什么来追踪这个问题的原因？

今天再次发生，这是触发重启后的系统状态

 init─┬─console-kit-dae───64*[{console-kit-dae}] ├─dbus-daemon ├─mcelog ├─mysqld───{mysqld} ├─newrelic-daemon───newrelic-daemon───11*[{newrelic-daemon}] ├─ntpd ├─polkitd───{polkitd} ├─python3 ├─rpc.idmapd ├─rpc.statd ├─rpcbind ├─sh───rc───S20sendsigs───sync ├─smartd ├─snmpd ├─sshd───sshd───zsh───sudo───zsh───pstree └─sshd───sshd───zsh───sudo───zsh

这里是同步过程的状态

 # ps aux | grep sync root 3637 0.1 0.0 4352 372 ? D 05:53 0:00 sync

即不间断睡眠…

硬件规格由`lshw`报告

我认为RAID控制器是一个假的RAID。我通常不处理硬件（为了logging，我没有物理访问）

 description: Computer product: X7DBP () vendor: Supermicro version: 0123456789 serial: 0123456789 width: 64 bits capabilities: smbios-2.4 dmi-2.4 vsyscall32 configuration: administrator_password=disabled boot=normal frontpanel_password=unknown keyboard_password=unknown power-on_password=disabled uuid=53D19F64-D663-A017-8922-0030487C1FEE *-core description: Motherboard product: X7DBP vendor: Supermicro physical id: 0 version: PCB Version serial: 0123456789 *-firmware description: BIOS vendor: Phoenix Technologies LTD physical id: 0 version: 6.00 date: 05/29/2007 size: 106KiB capacity: 960KiB capabilities: pci pnp upgrade shadowing escd cdboot bootselect edd int13floppy2880 acpi usb ls120boot zipboot biosbootspecification *-storage description: RAID bus controller product: 631xESB/632xESB SATA RAID Controller vendor: Intel Corporation physical id: 1f.2 bus info: pci@0000:00:1f.2 version: 09 width: 32 bits clock: 66MHz capabilities: storage pm bus_master cap_list configuration: driver=ahci latency=0 resources: irq:19 ioport:18a0(size=8) ioport:1874(size=4) ioport:1878(size=8) ioport:1870(size=4) ioport:1880(size=32) memory:d8500400-d85007ff

你的进程状态是D ，这在技术上意味着不间断的睡眠。但是，正如我一直说的， D意味着磁盘。处于这种状态的进程正在等待磁盘I / O操作完成。

我们可以从你的调用跟踪中看到， mysqld本身正在尝试sync并被卡住超过120秒，等待同步完成。

这表明您的存储子系统有问题。您应该查看您的硬盘和磁盘控制器（如果是本地磁盘）或networking和SAN（如果是远程存储）。

无法重新启动同步文件系统

硬件规格由lshw报告

硬件规格由`lshw`报告