我见过的最奇怪的docker错误

我在我的一台服务器上使用docker-mailserver的Docker。 将一些服务从传统的Debian Jessie服务器迁移到Ubuntu 16.04 LTS服务器后出现了一个非常奇怪的麻烦。 服务器参数:

遗产:

someuser@legacyserver:~$ uname -r 3.16.0-4-amd64 someuser@legacyserver:~$ dpkg -l | grep systemd ...215-17+deb8u7... someuser@legacyserver:~$ cat /proc/cmdline root=ZFS=rpool/ROOT/debian-1 ro boot=zfs quiet 

新服务器:

 someuser@newserver:~$ uname -r 4.4.0-21-generic someuser@newserver:~$ dpkg -l | grep systemd ...229-4ubuntu4... someuser@newserver:~$ cat /proc/cmdline root=ZFS=rpool/ROOT/debian-1 apparmor=0 ro 

我在systemd-nspawn Debian Jessie容器中的docker上运行docker-mailserver。 我鼓励的第一个问题是在新systemd上的只读cgroup,这解决了这个问题:

 mount | grep cgroup | tail -n +2 | while read line do umount -l $(echo $line | cut -f3 -d" ") mount -t $(echo $line | cut -f5 -d" ") -o $(echo $line | cut -f6 - d" " | rev | cut -c2- | rev | cut -c2- | sed -e 's/ro,/rw,/g') $(echo $line | cut -f1 -d" ") $(echo $line | cut -f3 -d" ") done 

它只是重新读取所有cgroups(不能使用-o remount)。

但是,首先,我要查看systemd-nspawn容器,然后从它移动到docker容器。 当我例如重新加载Postfix(或者做其他事情)…两个容器(嵌套的docker和systemd-nspawn)像鼠标一样安静地退出…像这样:

 someuser@newserver:~# rsh somesystemdcontainer Last login: Sun Jun 25 15:27:24 CEST 2017 from host0 on pts/0 Linux somesystemdcontainer 4.4.0-21-generic #37-Ubuntu SMP Mon Apr 18 18:33:37 UTC 2016 x86_64 The programs included with the Debian GNU/Linux system are free software; the exact distribution terms for each program are described in the individual files in /usr/share/doc/*/copyright. Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent permitted by applicable law. root@somesystemdcontainer:~# rsh mail #this is the docker container Last login: Sun Jun 25 13:28:18 UTC 2017 from 172.18.0.1 on pts/0 Welcome to Ubuntu 14.04.5 LTS (GNU/Linux 4.4.0-21-generic x86_64) * Documentation: https://help.ubuntu.com/ root@mail:~# service postfix reload * Reloading Postfix configuration... ...done. root@mail:~# rlogin: connection closed. root@newserver:~# 

没有在DMESG,没有在内核日志,任何地方。 正如您在cmdline中看到的一样,在内核和用户空间上禁用apparmor并不会帮助…在停止systemd-nspawn容器之后:

 jun 25 15:32:26 newserver kernel: INFO: task sh:10962 blocked for more than 120 seconds. jun 25 15:32:26 newserver kernel: Tainted: PO 4.4.0-21-generic #37-Ubuntu jun 25 15:32:26 newserver kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. jun 25 15:32:26 newserver kernel: sh D ffff88009ebb3c88 0 10962 9487 0x00000102 jun 25 15:32:26 newserver kernel: ffff88009ebb3c88 0000000000000000 ffff88040dab3700 ffff8800c9450dc0 jun 25 15:32:26 newserver kernel: ffff88009ebb4000 ffff8800c08008b0 0000000000000001 ffff8800c9450dc0 jun 25 15:32:26 newserver kernel: ffff8800c2fe87e8 ffff88009ebb3ca0 ffffffff818203f5 ffff8800c9450dc0 jun 25 15:32:26 newserver kernel: Call Trace: jun 25 15:32:26 newserver kernel: [<ffffffff818203f5>] schedule+0x35/0x80 jun 25 15:32:26 newserver kernel: [<ffffffff8111fd4f>] zap_pid_ns_processes+0x13f/0x1a0 jun 25 15:32:26 newserver kernel: [<ffffffff8108432b>] do_exit+0xa6b/0xae0 jun 25 15:32:26 newserver kernel: [<ffffffff8122383f>] ? dput+0x2f/0x220 jun 25 15:32:26 newserver kernel: [<ffffffff81084423>] do_group_exit+0x43/0xb0 jun 25 15:32:26 newserver kernel: [<ffffffff810904d2>] get_signal+0x292/0x600 jun 25 15:32:26 newserver kernel: [<ffffffff8102e517>] do_signal+0x37/0x6f0 jun 25 15:32:26 newserver kernel: [<ffffffff8181fd36>] ? __schedule+0x386/0xa10 jun 25 15:32:26 newserver kernel: [<ffffffff81083526>] ? do_wait+0x116/0x240 jun 25 15:32:26 newserver kernel: [<ffffffff8100320c>] exit_to_usermode_loop+0x8c/0xd0 jun 25 15:32:26 newserver kernel: [<ffffffff81003c5e>] syscall_return_slowpath+0x4e/0x60 jun 25 15:32:26 newserver kernel: [<ffffffff81824650>] int_ret_from_sys_call+0x25/0x8f jun 25 15:32:53 newserver systemd[1]: [email protected]: State 'stop-sigterm' timed out. Killing. jun 25 15:32:53 newserver systemd-nspawn[9483]: somesystemdcontainer login: jun 25 15:32:53 newserver systemd[1]: [email protected]: Main process exited, code=killed, status=9/KILL jun 25 15:32:53 newserver systemd[1]: Stopped Container somesystemdcontainer. jun 25 15:32:53 newserver systemd[1]: [email protected]: Unit entered failed state. jun 25 15:32:53 newserver systemd[1]: [email protected]: Failed with result 'signal'. jun 25 15:32:53 newserver systemd[1]: Stopped Container somesystemdcontainer. jun 25 15:32:53 newserver systemd-machined[2890]: Machine somesystemdcontainer terminated. 

10962是… bash在DOCKER容器中,它在pstree上“跳出命名空间”。

我现在应该怎么做?