我怎样才能诊断一个“冻结”的Linux软件突袭设备?

我有一台服务器运行Linux 3.2.12 32位i686 13驱动器:1启动驱动器,3 raid5设备的4驱动器每个。

/ proc / mdstat显示

Personalities : [raid1] [raid10] [raid6] [raid5] [raid4] md2 : active raid5 sdd1[3] sdc1[2] sdb1[1] sda1[0] 5860535808 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU] md1 : active raid5 sdk1[3] sdj1[2] sdi1[1] sdh1[0] 4395407808 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU] md3 : active raid5 sdl1[0] sdm1[1] sdf1[3] sde1[2] 5860535808 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU] unused devices: <none> 

我的问题是,在三天内第二次,其中一个RAID驱动器正在导致任何尝试读取它的进程locking。 没有信号能够终止这些进程,我必须重新启动才能重新运行。 但是,重启后驱动器看起来很好,并且raid状态看起来不错,内核日志没有任何有用的错误信息,除了进程被挂起。

我在所有的驱动器上运行smartctl,他们似乎很好。

我还可以检查什么来尝试诊断?

内核日志除了看起来有趣之外, 但请注意,“不能发送ioctl分区”已经一直存在,search结果显示这是一个无害的警告。

每900秒:

 ... Aug 20 18:34:01 [kernel] [ 931.249505] mdadm: sending ioctl 1261 to a partition! Aug 20 18:49:01 [kernel] [ 1831.302297] scsi_verify_blk_ioctl: 2 callbacks suppressed Aug 20 18:49:01 [kernel] [ 1831.302300] mdadm: sending ioctl 1261 to a partition! Aug 20 18:49:01 [kernel] [ 1831.302302] mdadm: sending ioctl 1261 to a partition! Aug 20 18:49:01 [kernel] [ 1831.302774] mdadm: sending ioctl 1261 to a partition! Aug 20 18:49:01 [kernel] [ 1831.302776] mdadm: sending ioctl 1261 to a partition! Aug 20 18:49:02 [kernel] [ 1831.333538] mdadm: sending ioctl 1261 to a partition! Aug 20 18:49:02 [kernel] [ 1831.333540] mdadm: sending ioctl 1261 to a partition! Aug 20 18:49:02 [kernel] [ 1831.358068] mdadm: sending ioctl 1261 to a partition! Aug 20 18:49:02 [kernel] [ 1831.358071] mdadm: sending ioctl 1261 to a partition! Aug 20 18:49:02 [kernel] [ 1831.414331] mdadm: sending ioctl 1261 to a partition! Aug 20 18:49:02 [kernel] [ 1831.414334] mdadm: sending ioctl 1261 to a partition! Aug 20 19:04:01 [kernel] [ 2731.070794] scsi_verify_blk_ioctl: 2 callbacks suppressed ... 

关于问题出现的时间:

 Aug 21 13:38:32 [kernel] [69601.312055] INFO: task kjournald:26008 blocked for more than 600 seconds. Aug 21 13:38:32 [kernel] [69601.312057] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 21 13:38:32 [kernel] [69601.312059] kjournald D 00000000 0 26008 2 0x00000000 Aug 21 13:38:32 [kernel] [69601.312063] eb5ccc80 00000046 00000000 00000000 00000000 e81e0070 e81e020c f6205900 Aug 21 13:38:32 [kernel] [69601.312068] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 Aug 21 13:38:32 [kernel] [69601.312072] 00000000 00000000 00000000 00000000 00000000 00000001 c0b66230 e81e0280 Aug 21 13:38:32 [kernel] [69601.312077] Call Trace: Aug 21 13:38:32 [kernel] [69601.312083] [<c013cbe5>] ? prepare_to_wait+0x15/0x55 Aug 21 13:38:32 [kernel] [69601.312088] [<c0217df5>] ? journal_commit_transaction+0xdb/0xca6 Aug 21 13:38:32 [kernel] [69601.312090] [<c013ca68>] ? wake_up_bit+0x16/0x16 Aug 21 13:38:32 [kernel] [69601.312093] [<c0132c3d>] ? lock_timer_base+0x19/0x35 Aug 21 13:38:32 [kernel] [69601.312095] [<c0132cb8>] ? try_to_del_timer_sync+0x5f/0x65 Aug 21 13:38:32 [kernel] [69601.312098] [<c021ade6>] ? kjournald+0xa6/0x1a2 Aug 21 13:38:32 [kernel] [69601.312101] [<c013ca68>] ? wake_up_bit+0x16/0x16 Aug 21 13:38:32 [kernel] [69601.312103] [<c021ad40>] ? journal_grab_journal_head+0x31/0x31 Aug 21 13:38:32 [kernel] [69601.312106] [<c013c778>] ? kthread+0x65/0x6a Aug 21 13:38:32 [kernel] [69601.312108] [<c013c713>] ? kthread_stop+0x47/0x47 Aug 21 13:38:32 [kernel] [69601.312111] [<c0830b36>] ? kernel_thread_helper+0x6/0xd 

首先升级你的内核。 该特定的内核包含一个错误 ,导致各种ioctls在某些mdraid和LVMconfiguration中打印这些警告(也可能失败)。

如果固定内核无法解决问题,请在所有驱动器上运行扩展自检。 请注意,每个驱动器的自检可能需要几个小时,运行时会稍微降低性能,因此应该在系统活动较less的时间运行。 例如,要安排自我testing从晚上11点开始:

 at 11 pm <<JOB for drive in /dev/sd? do smartctl -t long $drive || : done JOB 

第二天晚些时候,检查testing结果:

 for drive in /dev/sd? do echo Test results for drive $drive smartctl -l selftest $drive || : done 

如果内核更新没有解决问题,那么您可能会发现一个自检失败的驱动器。

如果找不到自检失败的驱动器,请检查驱动器属性。

 for drive in /dev/sd? do echo Attributes for drive $drive smartctl -A $drive || : done 

请注意,即使这些属性未被标记为失败,其中一些属性也可能表示问题; 所以找一个专家来检查一下,或者附上你的问题。