XServe与Ubuntu 10.04问题突袭

我有一个通过光纤通道连接到Dell Poweredge R610的Apple XServe RAID。 该服务器主要用于托pipeSubversion版本库,并存储磁盘映像。 在过去6个月左右的时间里,我们遇到了一些与这个设置有关的问题。 看起来很好,当负载是最小的,但前几天复制一些大的磁盘映像时,它有一堆的错误,并重新安装只读。

实际的错误消息以一堆任务中止开始

May 17 15:20:09 sub0 kernel: [4661904.506886] mptscsih: ioc1: attempting task abort! (sc=ffff88011d2aea00) May 17 15:20:09 sub0 kernel: [4661904.506890] sd 2:0:0:0: [sdb] CDB: Write(10): 2a 00 a8 17 2c ea 00 04 00 00 May 17 15:20:09 sub0 kernel: [4661904.507219] mptscsih: ioc1: task abort: SUCCESS (sc=ffff88011d2aea00) ... May 17 15:21:42 sub0 kernel: [4661997.476282] mptscsih: ioc1: attempting target reset! (sc=ffff88011e632c00) May 17 15:21:42 sub0 kernel: [4661997.476284] sd 2:0:0:0: [sdb] CDB: Write(10): 2a 00 a8 18 14 52 00 04 00 00 May 17 15:21:42 sub0 kernel: [4661997.494532] mptscsih: ioc1: target reset: SUCCESS (sc=ffff88011e632c00) May 17 15:21:42 sub0 kernel: [4661997.494589] mptscsih: ioc1: attempting bus reset! (sc=ffff88011e632c00) May 17 15:21:42 sub0 kernel: [4661997.494592] sd 2:0:0:0: [sdb] CDB: Write(10): 2a 00 a8 18 14 52 00 04 00 00 May 17 15:21:42 sub0 kernel: [4661997.495403] mptscsih: ioc1: bus reset: SUCCESS (sc=ffff88011e632c00) May 17 15:21:52 sub0 kernel: [4662007.498403] mptscsih: ioc1: attempting host reset! (sc=ffff88011e632c00) May 17 15:21:52 sub0 kernel: [4662007.498411] mptbase: ioc1: Initiating recovery May 17 15:22:02 sub0 kernel: [4662016.680666] mptscsih: ioc1: host reset: SUCCESS (sc=ffff88011e632c00) May 17 15:22:12 sub0 kernel: [4662026.686900] sd 2:0:0:0: Device offlined - not ready after error recovery ... May 17 15:22:12 sub0 kernel: [4662026.687032] sd 2:0:0:0: [sdb] Unhandled error code May 17 15:22:12 sub0 kernel: [4662026.687034] sd 2:0:0:0: [sdb] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK May 17 15:22:12 sub0 kernel: [4662026.687037] sd 2:0:0:0: [sdb] CDB: Write(10): 2a 00 a8 18 14 52 00 04 00 00 May 17 15:22:12 sub0 kernel: [4662026.720494] lost page write due to I/O error on sdb1 ... May 17 15:22:12 sub0 kernel: [4662027.117326] sd 2:0:0:0: [sdb] Unhandled error code May 17 15:22:12 sub0 kernel: [4662027.117328] sd 2:0:0:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK May 17 15:22:12 sub0 kernel: [4662027.117331] sd 2:0:0:0: [sdb] CDB: Write(10): 2a 00 a8 17 2c ea May 17 15:22:12 sub0 kernel: [4662027.117339] 00 04 00 00 May 17 15:22:12 sub0 kernel: [4662027.122264] sd 2:0:0:0: [sdb] Unhandled error code May 17 15:22:12 sub0 kernel: [4662027.122266] sd 2:0:0:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK May 17 15:22:12 sub0 kernel: [4662027.122268] sd 2:0:0:0: [sdb] CDB: Write(10): 2a 00 a8 17 30 ea 00 04 00 00 May 17 15:22:12 sub0 kernel: [4662027.125053] sd 2:0:0:0: [sdb] Unhandled error code May 17 15:22:12 sub0 kernel: [4662027.125055] sd 2:0:0:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK May 17 15:22:12 sub0 kernel: [4662027.125058] sd 2:0:0:0: [sdb] CDB: Write(10): 2a 00 a8 18 18 52 00 04 00 00 May 17 15:22:12 sub0 kernel: [4662027.127869] sd 2:0:0:0: [sdb] Unhandled error code May 17 15:22:12 sub0 kernel: [4662027.127871] sd 2:0:0:0: [sdb] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK May 17 15:22:12 sub0 kernel: [4662027.127874] sd 2:0:0:0: [sdb] CDB: Write(10): 2a 00 a8 18 10 62 00 03 e8 00 ... May 17 15:22:12 sub0 kernel: [4662027.130737] sd 2:0:0:0: [sdb] Unhandled error code May 17 15:22:12 sub0 kernel: [4662027.405150] sd 2:0:0:0: [sdb] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK May 17 15:22:12 sub0 kernel: [4662027.405152] sd 2:0:0:0: [sdb] CDB: Write(10): 2a 00 a8 17 34 ea 00 04 00 00 May 17 15:22:12 sub0 kernel: [4662027.410575] JBD: Detected IO errors while flushing file data on sdb1 May 17 15:22:13 sub0 kernel: [4662028.182860] JBD: Detected IO errors while flushing file data on sdb1 

在这一点上,arrays只能重新读取。 我对这个问题可能会感到不知所措(我在处理这种types的光纤通道/ RAIDarrays方面相对较新)

系统信息(让我知道如果我可以提供任何可能有用的东西)

 sysadmin@sub0:~$ lspci(snipped to the relevant stuff I presume) 03:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS (rev 08) 05:00.0 Fibre Channel: LSI Logic / Symbios Logic FC949ES Fibre Channel Adapter (rev 02) 05:00.1 Fibre Channel: LSI Logic / Symbios Logic FC949ES Fibre Channel Adapter (rev 02) sysadmin@sub0:~$ cat /proc/mpt/summary ioc0: LSIFC949E, FwRev=01031700h, Ports=1, MaxQ=1023, LanAddr=00:06:2B:1B:89:14, IRQ=40 ioc1: LSISAS1068E B3, FwRev=00192f00h, Ports=1, MaxQ=266, IRQ=16 ioc2: LSIFC949E, FwRev=01031700h, Ports=1, MaxQ=1023, LanAddr=00:06:2B:1B:89:15, IRQ=50 sysadmin@sub0:~$ cat /proc/mpt/version mptlinux-3.04.12 Fusion MPT base driver Fusion MPT FC host driver Fusion MPT SAS host driver sysadmin@sub0:~$ cat /etc/issue Ubuntu 10.04.2 LTS \n \l 

完整/ var / log / messages: https : //gist.github.com/96df4b5b9ac7ec46f74c#file_messages

完整/var/log/kern.log:https://gist.github.com/96df4b5b9ac7ec46f74c#file_kern.log

感谢您花时间阅读并提供任何帮助。

我会有助于更多地了解如何实际configurationRAID,如卷,大小,RAID级别,条纹和块大小等,以及是否使用多path。

你得到一个error handling升级,因为中止的命令没有被处理到低级驱动程序和SCSI中间层的满意,这就是为什么恢复严重性不断攀升。 它如何到达那里开始将需要大量的分析,如录制blktrace。 我只能用这个非常有限的信息来推荐使用一个LTS backport内核(例如Oneiric)来尝试升级驱动程序,并尝试重新创build问题; 你使用的mptsas驱动程序是非常古老的。 如果你看起来够辛苦,你可以使用DKMS包来更新驱动程序。

如果你仍然有这个问题,那么你将不得不考虑自己的能力来挖掘和解决这个问题,同时寻求OS供应商的额外支持。 这些是支持合同要解决的问题。 无论你走哪条路,都要准备好要花上数周,而不是数天才能确定根本原因。 祝你好运。