我们最近升级了一个NFS服务器(称为“ian”),在升级之后,NFS被破坏,而之前没有。 该服务器现在运行Debian“lenny”,内核版本为2.6.26-2; 这是一个真正古老的戴尔PowerEdge 2650.以下是我们在票据系统中logging的问题的详细信息:
以下是一些与此相关的日志消息:
kern.log:
Oct 7 10:37:50 ian kernel: [60831.093676] statd: server localhost not responding, timed out Oct 7 10:37:50 ian kernel: [60831.093739] lockd: cannot monitor jade-30.example.com Oct 7 10:38:20 ian kernel: [60871.021326] statd: server localhost not responding, timed out Oct 7 10:38:20 ian kernel: [60871.021403] lockd: cannot monitor jade-30.example.com Oct 7 10:38:50 ian kernel: [60913.065830] statd: server localhost not responding, timed out Oct 7 10:38:50 ian kernel: [60913.065892] lockd: cannot monitor jade-30.example.com Oct 7 10:39:20 ian kernel: [60956.671592] statd: server localhost not responding, timed out Oct 7 10:39:20 ian kernel: [60956.671592] lockd: cannot monitor jade-30.example.com Oct 7 10:39:50 ian kernel: [61000.167660] statd: server localhost not responding, timed out Oct 7 10:39:50 ian kernel: [61000.167738] lockd: cannot unmonitor jade-43.example.com Oct 7 10:40:20 ian kernel: [61046.297937] statd: server localhost not responding, timed out Oct 7 10:40:20 ian kernel: [61046.298000] lockd: cannot unmonitor jade-48.example.com Oct 7 10:40:50 ian kernel: [61097.378121] statd: server localhost not responding, timed out Oct 7 10:40:50 ian kernel: [61097.378190] lockd: cannot unmonitor jade-56.example.com Oct 7 10:41:20 ian kernel: [61140.563304] rpcbind: server jade-42.example.com not responding, timed out Oct 7 10:41:20 ian kernel: [61140.566912] statd: server localhost not responding, timed out Oct 7 10:41:20 ian kernel: [61140.566912] lockd: cannot monitor jade-30.example.com Oct 7 10:41:20 ian kernel: [61140.563304] rpcbind: server jade-27.example.com not responding, timed out Oct 7 10:41:50 ian kernel: [61185.324223] rpcbind: server jade-30.example.com not responding, timed out Oct 7 10:41:50 ian kernel: [61185.324223] statd: server localhost not responding, timed out Oct 7 10:41:50 ian kernel: [61185.324223] lockd: cannot monitor jade-30.example.com [etc.]
daemon.log中:
Oct 7 10:59:33 ian rpc.statd[3991]: Can't callback ian (100021,3), giving up. [and many more]
(nb 100021是locking的RPC服务。)
不久之后,这是在kern.log:
Oct 7 11:02:20 ian kernel: [62963.093217] ------------[ cut here ]------------ Oct 7 11:02:20 ian kernel: [62963.093217] kernel BUG at kernel/workqueue.c:273! Oct 7 11:02:20 ian kernel: [62963.093217] invalid opcode: 0000 [#1] <5>rpcbind: server jade-52.example.com not responding, timed out Oct 7 11:02:20 ian kernel: [62963.093217] SMP Oct 7 11:02:20 ian kernel: [62963.093217] Modules linked in: ipmi_devintf ipmi_si ipmi_msghandler ppdev parport_pc lp parport ipv6 nfsd auth_rpcgss exportfs nfs lockd nfs_acl sunrpc dm_snapshot dm_mirror dm_log dm_mod snd_pcm snd_timer snd soundcore snd_page_alloc pcspkr shpchp pci_hotplug i2c_piix4 i2c_core button sworks_agp agpgart evdev dcdbas ext3 jbd mbcache sg sd_mod ide_cd_mod cdrom ata_generic libata dock ohci_hcd floppy aacraid usbcore tg3 scsi_mod e1000 serverworks ide_pci_generic ide_core thermal processor fan thermal_sys [last unloaded: scsi_wait_scan] Oct 7 11:02:20 ian kernel: [62963.093217] Oct 7 11:02:20 ian kernel: [62963.093217] Pid: 3999, comm: rpciod/0 Not tainted (2.6.26-2-686 #1) Oct 7 11:02:20 ian kernel: [62963.093217] EIP: 0060:[<c012f0a0>] EFLAGS: 00010203 CPU: 0 Oct 7 11:02:20 ian kernel: [62963.093217] EIP is at run_workqueue+0x66/0xf2 Oct 7 11:02:20 ian kernel: [62963.093217] EAX: 00000000 EBX: f7de6f80 ECX: dd54f2dc EDX: dd54f2dc Oct 7 11:02:20 ian kernel: [62963.093217] ESI: dd54f2d8 EDI: f8b72afd EBP: 00000000 ESP: dec0bfac Oct 7 11:02:20 ian kernel: [62963.093217] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 Oct 7 11:02:20 ian kernel: [62963.093217] Process rpciod/0 (pid: 3999, ti=dec0a000 task=f75595e0 task.ti=dec0a000) Oct 7 11:02:20 ian kernel: [62963.093217] Stack: f7de6f80 c012f789 f7de6f8c c012f83c 00000000 f75595e0 c0131a44 dec0bfc8 Oct 7 11:02:20 ian kernel: [62963.093217] dec0bfc8 f7de6f80 00000000 c0131983 c013194b 00000000 c01044f7 dd5bde04 Oct 7 11:02:20 ian kernel: [62963.093217] 00000000 00000000 00000000 00000000 00000000 Oct 7 11:02:20 ian kernel: [62963.093217] Call Trace: Oct 7 11:02:20 ian kernel: [62963.093217] [<c012f789>] worker_thread+0x0/0xbd Oct 7 11:02:20 ian kernel: [62963.093217] [<c012f83c>] worker_thread+0xb3/0xbd Oct 7 11:02:20 ian kernel: [62963.093217] [<c0131a44>] autoremove_wake_function+0x0/0x2d Oct 7 11:02:20 ian kernel: [62963.093217] [<c0131983>] kthread+0x38/0x5d Oct 7 11:02:20 ian kernel: [62963.093217] [<c013194b>] kthread+0x0/0x5d Oct 7 11:02:20 ian kernel: [62963.093217] [<c01044f7>] kernel_thread_helper+0x7/0x10 Oct 7 11:02:20 ian kernel: [62963.093217] ======================= Oct 7 11:02:20 ian kernel: [62963.093217] Code: 7e 0c 89 73 18 8b 11 8b 41 04 89 42 04 89 10 89 49 04 89 09 f0 fe 03 fb 90 8d b4 26 00 00 00 00 90 8b 41 fc 83 e0 fc 39 d8 74 04 <0f> 0b eb fe f0 0f ba 71 fc 00 89 f0 ff d7 89 e0 25 00 e0 ff ff Oct 7 11:02:20 ian kernel: [62963.093217] EIP: [<c012f0a0>] run_workqueue+0x66/0xf2 SS:ESP 0068:dec0bfac Oct 7 11:02:20 ian kernel: [62963.097562] ---[ end trace 661db88ca1ec8ada ]---
考虑到这个问题似乎在这个内核错误被logging之前就已经开始了,它可能不是问题的原因,或者实际上是相关的。 这个内核错误只出现了一次,即使多次重启之后问题再次出现。
虽然这个问题昨天才开始(可能是前一天晚上伊恩的操作系统升级造成的),但直到今天才引起我们的注意。
其他一些debuggingtesting:
$ rpcinfo -p localhost program vers proto port 100000 2 tcp 111 portmapper 100000 2 udp 111 portmapper 100021 1 udp 58030 nlockmgr 100021 3 udp 58030 nlockmgr 100021 4 udp 58030 nlockmgr 100021 1 tcp 52602 nlockmgr 100021 3 tcp 52602 nlockmgr 100021 4 tcp 52602 nlockmgr 100024 1 udp 55130 status 100024 1 tcp 36082 status 100003 2 udp 2049 nfs 100003 3 udp 2049 nfs 100003 4 udp 2049 nfs 100003 2 tcp 2049 nfs 100003 3 tcp 2049 nfs 100003 4 tcp 2049 nfs 100005 1 udp 60186 mountd 100005 1 tcp 59282 mountd 100005 2 udp 60186 mountd 100005 2 tcp 59282 mountd 100005 3 udp 60186 mountd 100005 3 tcp 59282 mountd $ rpcinfo -u localhost nlockmgr rpcinfo: RPC: Timed out program 100021 version 0 is not available
我们发现在重启ian之后(我们做了几次)NFS确实有一段时间了。 当它工作时,上面的命令给出了这个输出:
$ rpcinfo -u localhost nlockmgr program 100021 version 1 ready and waiting rpcinfo: RPC: Program/version mismatch; low version = 1, high version = 4 program 100021 version 2 is not available program 100021 version 3 ready and waiting program 100021 version 4 ready and waiting
我们重新启动到以前的内核(从2007-04-07 vmlinuz-2.6.20.6),我们仍然有问题。
我们启用了lockd / nlockmgr / nlm的debugging消息,如下所示:#echo 2047> / proc / sys / sunrpc / nlm_debug
这些消息并不明显有帮助。 这是从什么时候NFS工作好吗:
Oct 8 15:45:12 ian kernel: [ 222.037132] lockd: request from 192.168.246.17, port=896 Oct 8 15:45:12 ian kernel: [ 222.037151] lockd: LOCK_MSG called Oct 8 15:45:12 ian kernel: [ 222.037171] lockd: nlm_lookup_host(192.168.240.55->192.168.246.17, p=17, v=4, my role=server, name=jade-45.example.com) Oct 8 15:45:12 ian kernel: [ 222.037171] lockd: get host jade-45.example.com Oct 8 15:45:12 ian kernel: [ 222.037171] lockd: LOCK called Oct 8 15:45:12 ian kernel: [ 222.037171] lockd: nlm_lookup_host(192.168.240.55->192.168.246.17, p=17, v=4, my role=server, name=jade-45.example.com) Oct 8 15:45:12 ian kernel: [ 222.037171] lockd: get host jade-45.example.com Oct 8 15:45:12 ian kernel: [ 222.037171] lockd: nsm_monitor(jade-45.example.com) Oct 8 15:45:12 ian kernel: [ 222.037171] lockd: nlm_lookup_file (01060001 fc22606e bd478947 777f518d 58ff17ac 028b086c 999bb476 00000000) Oct 8 15:45:12 ian kernel: [ 222.037171] lockd: creating file for (01060001 fc22606e bd478947 777f518d 58ff17ac 028b086c 999bb476 00000000) Oct 8 15:45:12 ian kernel: [ 222.037171] lockd: found file f5c21080 (count 0) Oct 8 15:45:12 ian kernel: [ 222.037171] lockd: nlmsvc_lock(sda8/42666092, ty=1, pi=1886, 0-9223372036854775807, bl=1) Oct 8 15:45:12 ian kernel: [ 222.037171] lockd: nlm_lookup_host(192.168.240.55->192.168.246.17, p=17, v=4, my role=server, name=jade-45.example.com) Oct 8 15:45:12 ian kernel: [ 222.037171] lockd: get host jade-45.example.com Oct 8 15:45:12 ian kernel: [ 222.037171] lockd: nlmsvc_lookup_block f=f5c21080 pd=1886 0-9223372036854775807 ty=1 Oct 8 15:45:12 ian kernel: [ 222.037171] lockd: get host jade-45.example.com Oct 8 15:45:12 ian kernel: [ 222.037171] lockd: created block f6b882c0... Oct 8 15:45:12 ian kernel: [ 222.037171] lockd: vfs_lock_file returned 0 Oct 8 15:45:12 ian kernel: [ 222.037171] lockd: freeing block f6b882c0... Oct 8 15:45:12 ian kernel: [ 222.037171] lockd: release host jade-45.example.com Oct 8 15:45:12 ian kernel: [ 222.037171] lockd: nlm_release_file(f5c21080, ct = 2) Oct 8 15:45:12 ian kernel: [ 222.037171] lockd: release host jade-45.example.com Oct 8 15:45:12 ian kernel: [ 222.037171] lockd: nlmsvc_lock returned 0 Oct 8 15:45:12 ian kernel: [ 222.037171] lockd: LOCK status 0 Oct 8 15:45:12 ian kernel: [ 222.037171] lockd: release host jade-45.example.com Oct 8 15:45:12 ian kernel: [ 222.037171] lockd: nlm_release_file(f5c21080, ct = 1) Oct 8 15:45:12 ian kernel: [ 222.037171] lockd: call procedure 12 on jade-45.example.com (async) Oct 8 15:45:12 ian kernel: [ 222.037171] lockd: nlm_bind_host(192.168.240.55->192.168.246.17) Oct 8 15:45:12 ian kernel: [ 222.037171] lockd: 20145 callback returned 0 Oct 8 15:45:12 ian kernel: [ 222.037171] lockd: release host jade-45.example.com
locking后的同样的事情没有回应:
Oct 8 15:48:09 ian kernel: [ 461.693824] lockd: request from 192.168.246.96, port=940 Oct 8 15:48:09 ian kernel: [ 461.693901] lockd: LOCK_MSG called Oct 8 15:48:09 ian kernel: [ 461.693974] lockd: nlm_lookup_host(192.168.240.55->192.168.246.96, p=17, v=4, my role=server, name=librarybrowser-06.local) Oct 8 15:48:09 ian kernel: [ 461.694072] lockd: get host librarybrowser-06.local Oct 8 15:48:09 ian kernel: [ 461.694142] lockd: LOCK called Oct 8 15:48:09 ian kernel: [ 461.694212] lockd: nlm_lookup_host(192.168.240.55->192.168.246.96, p=17, v=4, my role=server, name=librarybrowser-06.local) Oct 8 15:48:09 ian kernel: [ 461.694308] lockd: get host librarybrowser-06.local Oct 8 15:48:09 ian kernel: [ 461.694377] lockd: nsm_monitor(librarybrowser-06.local) Oct 8 15:48:09 ian kernel: [ 461.694448] lockd: nlm_lookup_file (01060001 fc22606e bd478947 777f518d 58ff17ac 027c0411 999bb54b 00000000) Oct 8 15:48:09 ian kernel: [ 461.694543] lockd: creating file for (01060001 fc22606e bd478947 777f518d 58ff17ac 027c0411 999bb54b 00000000) Oct 8 15:48:09 ian kernel: [ 461.694644] lockd: open failed (error 117440512) Oct 8 15:48:09 ian kernel: [ 461.694713] lockd: release host librarybrowser-06.local Oct 8 15:48:09 ian kernel: [ 461.694783] lockd: call procedure 12 on librarybrowser-06.local (async) Oct 8 15:48:09 ian kernel: [ 461.694857] lockd: nlm_bind_host(192.168.240.55->192.168.246.96)
请注意“打开失败(错误117440512)”行; 我们也得到了“打开失败(错误16777216)”
看起来这种故障的触发可能与负载有关。 在重新启动NFS服务之后,NFS在再次失败之前继续工作了很长一段时间(但是只有很less的客户端连接,据我们所知)。
此服务器上没有运行防火墙; /etc/hosts.allow和/etc/hosts.deny都是空的(注释除外)。
我们目前处于这个问题的死胡同。 我们没有简单的方法回到升级前的状态。 但是,我们正在加快将这项服务迁移到新服务器的计划,但是由于这个服务器的运行时间和内核与现在有问题的服务器运行的操作系统和内核差不多,所以最好能够更好地了解一下是什么导致了这个问题,所以我们可以有一些信心,我们不会在新的服务器上遇到他们。
任何想法,或其他事情调查或尝试的提示?
谢谢,
蒂姆。
我在这里find了答案: http : //sophiedogg.com/?p=141
/var/lib/nfs/statd/sm – 包含statd监视列表的目录 /var/lib/nfs/statd/sm.bak – 包含statd通知列表的目录 在删除这些文件之前,应该停止rpcbind,statd和lockd服务。 以下是在基于RPM的发行版上运行以解决此问题的命令列表。
service rpcbind stop service nfslock stop rm -rf /var/lib/nfs/statd/sm/* rm -rf /var/lib/nfs/statd/sm.bak/* service rpcbind start service nfslock start
运行这些命令后,最好重新启动NFS服务器。
在Debian和衍生产品中,命令是:
service nfs-kernel-server stop service rpcbind stop service nfs-common stop service rpcbind start service nfs-common start service nfs-kernel-server start
我也在service nfs-common restart上为service nfs-common restart 。
我唯一能告诉你的是,我们有一些运行lenny的NFS服务器,没有任何问题。 硬件是旧的(如果我记得正确,我现在不工作),但硬盘速度很快,15千转/分钟,所以这基本上是重要的。
debuggingnfs问题可能会令人沮丧。 我发现这可以帮助您从日志中获取更多信息:
* RPC debugging: echo 2048 > /proc/sys/sunrpc/rpc_debug grep . /proc/net/rpc/*/content ls -l /proc/fs/nfsd cat /proc/fs/nfs/exports * NFS debugging: # turn on linux nfs debug echo 1 > /proc/sys/sunrpc/nfs_debug # turn off linux nfs debug echo 0 > /proc/sys/sunrpc/nfs_debug