我们正在运行一个服务,提供URL的截图,并将其提交给我们的S3存储桶。 与manet类似,但我们的自定义编码nodejs应用程序。 我们不在本地硬盘上存储截图。 我们暂时将它们存储为resize,然后删除。 临时图像文件夹始终为空。
问题是:在服务器重新启动之前,磁盘空间运行得越来越低。 例如,现在df -h显示:
ubuntu@ip-10-0-1-94:~$ df -h Filesystem Size Used Avail Use% Mounted on /dev/xvda1 118G 74G 40G 65% / none 4.0K 0 4.0K 0% /sys/fs/cgroup udev 7.4G 8.0K 7.4G 1% /dev tmpfs 1.5G 360K 1.5G 1% /run none 5.0M 0 5.0M 0% /run/lock none 7.4G 0 7.4G 0% /run/shm none 100M 0 100M 0% /run/user
但是, du -sh / shows:
root@ip-10-0-1-94:~# du -sh / du: cannot access '/proc/14440': No such file or directory du: cannot access '/proc/14520/task/14520/fd/4': No such file or directory du: cannot access '/proc/14520/task/14520/fdinfo/4': No such file or directory du: cannot access '/proc/14520/fd/4': No such file or directory du: cannot access '/proc/14520/fdinfo/4': No such file or directory du: cannot access '/proc/14521': No such file or directory 7.0G /
如果我为根文件系统中的所有文件夹设置了du ,那么它将总计为7 Gb,而不是74.如果我重新启动服务器,一旦它重新恢复,将会有7 Gb,但应该在10-12小时70+再次计数。
我们使用mongodb作为存储,所以我假设它可以,但是,我删除了之前放置的smallfilesconfiguration选项。 还是一样的事情。
在这里连接lsof输出和ps aux 在这里
这是mount输出:
ubuntu@ip-10-0-1-94:~$ mount /dev/xvda1 on / type ext4 (rw,discard) proc on /proc type proc (rw,noexec,nosuid,nodev) sysfs on /sys type sysfs (rw,noexec,nosuid,nodev) none on /sys/fs/cgroup type tmpfs (rw) none on /sys/fs/fuse/connections type fusectl (rw) none on /sys/kernel/debug type debugfs (rw) none on /sys/kernel/security type securityfs (rw) udev on /dev type devtmpfs (rw,mode=0755) devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=0620) tmpfs on /run type tmpfs (rw,noexec,nosuid,size=10%,mode=0755) none on /run/lock type tmpfs (rw,noexec,nosuid,nodev,size=5242880) none on /run/shm type tmpfs (rw,nosuid,nodev) none on /run/user type tmpfs (rw,noexec,nosuid,nodev,size=104857600,mode=0755) none on /sys/fs/pstore type pstore (rw) systemd on /sys/fs/cgroup/systemd type cgroup (rw,noexec,nosuid,nodev,none,name=systemd)
重新启动任何正在运行的服务,如mongodb或supervisor不会更改任何内容。 这里是一个例子:
root@ip-10-0-1-94:~# df -h Filesystem Size Used Avail Use% Mounted on /dev/xvda1 118G 74G 40G 65% / none 4.0K 0 4.0K 0% /sys/fs/cgroup udev 7.4G 8.0K 7.4G 1% /dev tmpfs 1.5G 360K 1.5G 1% /run none 5.0M 0 5.0M 0% /run/lock none 7.4G 0 7.4G 0% /run/shm none 100M 0 100M 0% /run/user root@ip-10-0-1-94:~# service mongod restart mongod stop/waiting mongod start/running, process 31590 root@ip-10-0-1-94:~# df -h Filesystem Size Used Avail Use% Mounted on /dev/xvda1 118G 74G 40G 65% / none 4.0K 0 4.0K 0% /sys/fs/cgroup udev 7.4G 8.0K 7.4G 1% /dev tmpfs 1.5G 360K 1.5G 1% /run none 5.0M 0 5.0M 0% /run/lock none 7.4G 0 7.4G 0% /run/shm none 100M 0 100M 0% /run/user
或supervisor控制node进程(工作人员和应用程序):
root@ip-10-0-1-94:~# df -h Filesystem Size Used Avail Use% Mounted on /dev/xvda1 118G 74G 40G 65% / none 4.0K 0 4.0K 0% /sys/fs/cgroup udev 7.4G 8.0K 7.4G 1% /dev tmpfs 1.5G 360K 1.5G 1% /run none 5.0M 0 5.0M 0% /run/lock none 7.4G 0 7.4G 0% /run/shm none 100M 0 100M 0% /run/user root@ip-10-0-1-94:~# service supervisor restart Restarting supervisor: supervisord. root@ip-10-0-1-94:~# df -h Filesystem Size Used Avail Use% Mounted on /dev/xvda1 118G 74G 40G 65% / none 4.0K 0 4.0K 0% /sys/fs/cgroup udev 7.4G 8.0K 7.4G 1% /dev tmpfs 1.5G 360K 1.5G 1% /run none 5.0M 0 5.0M 0% /run/lock none 7.4G 0 7.4G 0% /run/shm none 100M 0 100M 0% /run/user
更新 :事实certificate,这是因为gearmanlogin吨吨
accept(Too many open files) -> libgearman-server/gearmand.cc:851
消息。 即使该文件被删除,它仍然由齿轮人员进程打开,因此空间不被释放。 这是certificate:
root@ip-10-0-1-94:~# sudo lsof -s | awk '$5 == "REG"' | sort -n -r -k 7,7 | head -n 1 gearmand 4221 gearman 3w REG 202,1 31748949650 143608 /var/log/gearman-job-server/gearman.log.1 (deleted)
(感谢@Andrew Henle)
现在接下来的问题是:为什么gearman把这个写到日志里。 如上所述,这是因为在TIME_WAIT状态中连接到gearman人数太多,但是他们不在TIME_WAIT ,他们处于ESTABLISHED 。 他们在这里 。
如果我做strace -p 4221 ,我只能看到这个
write(22, "\3", 1) = 1 epoll_wait(6, {{EPOLLIN, {u32=9, u64=9}}}, 32, -1) = 1 clock_gettime(CLOCK_MONOTONIC, {169649, 568914324}) = 0 gettimeofday({1446109467, 793708}, NULL) = 0 accept4(9, {sa_family=AF_INET, sin_port=htons(33010), sin_addr=inet_addr("127.0.0.1")}, [16], SOCK_NONBLOCK) = 874 write(17, "\3", 1) = 1 epoll_wait(6, {{EPOLLIN, {u32=9, u64=9}}}, 32, -1) = 1 clock_gettime(CLOCK_MONOTONIC, {169659, 749954206}) = 0 gettimeofday({1446109477, 974726}, NULL) = 0 accept4(9, {sa_family=AF_INET, sin_port=htons(33060), sin_addr=inet_addr("127.0.0.1")}, [16], SOCK_NONBLOCK) = 875 write(32, "\3", 1) = 1 epoll_wait(6, {{EPOLLIN, {u32=9, u64=9}}}, 32, -1) = 1 clock_gettime(CLOCK_MONOTONIC, {169659, 754505349}) = 0 gettimeofday({1446109477, 979307}, NULL) = 0 accept4(9, {sa_family=AF_INET, sin_port=htons(33062), sin_addr=inet_addr("127.0.0.1")}, [16], SOCK_NONBLOCK) = 876 write(27, "\3", 1) = 1 epoll_wait(6, {{EPOLLIN, {u32=9, u64=9}}}, 32, -1) = 1 clock_gettime(CLOCK_MONOTONIC, {169664, 300399805}) = 0 gettimeofday({1446109482, 525209}, NULL) = 0 accept4(9, {sa_family=AF_INET, sin_port=htons(33134), sin_addr=inet_addr("127.0.0.1")}, [16], SOCK_NONBLOCK) = 877 write(22, "\3", 1) = 1 epoll_wait(6, {{EPOLLIN, {u32=9, u64=9}}}, 32, -1) = 1 clock_gettime(CLOCK_MONOTONIC, {169666, 161035104}) = 0 gettimeofday({1446109484, 385826}, NULL) = 0 accept4(9, {sa_family=AF_INET, sin_port=htons(33165), sin_addr=inet_addr("127.0.0.1")}, [16], SOCK_NONBLOCK) = 878 write(17, "\3", 1) = 1 epoll_wait(6, {{EPOLLIN, {u32=9, u64=9}}}, 32, -1) = 1 clock_gettime(CLOCK_MONOTONIC, {169668, 308112847}) = 0 gettimeofday({1446109486, 532900}, NULL) = 0 accept4(9, {sa_family=AF_INET, sin_port=htons(33186), sin_addr=inet_addr("127.0.0.1")}, [16], SOCK_NONBLOCK) = 879 write(32, "\3", 1) = 1 epoll_wait(6, {{EPOLLIN, {u32=9, u64=9}}}, 32, -1) = 1 clock_gettime(CLOCK_MONOTONIC, {169671, 251265264}) = 0 gettimeofday({1446109489, 476077}, NULL) = 0 accept4(9, {sa_family=AF_INET, sin_port=htons(33218), sin_addr=inet_addr("127.0.0.1")}, [16], SOCK_NONBLOCK) = 880 write(27, "\3", 1) = 1 epoll_wait(6, {{EPOLLIN, {u32=9, u64=9}}}, 32, -1) = 1 clock_gettime(CLOCK_MONOTONIC, {169672, 320483648}) = 0 gettimeofday({1446109490, 545274}, NULL) = 0 accept4(9, {sa_family=AF_INET, sin_port=htons(33232), sin_addr=inet_addr("127.0.0.1")}, [16], SOCK_NONBLOCK) = 881 write(22, "\3", 1) = 1 epoll_wait(6, {{EPOLLIN, {u32=9, u64=9}}}, 32, -1) = 1 clock_gettime(CLOCK_MONOTONIC, {169676, 186686282}) = 0 gettimeofday({1446109494, 411486}, NULL) = 0 accept4(9, {sa_family=AF_INET, sin_port=htons(33303), sin_addr=inet_addr("127.0.0.1")}, [16], SOCK_NONBLOCK) = 882 write(17, "\3", 1) = 1 epoll_wait(6, {{EPOLLIN, {u32=9, u64=9}}}, 32, -1) = 1 clock_gettime(CLOCK_MONOTONIC, {169684, 699748557}) = 0 gettimeofday({1446109502, 924549}, NULL) = 0 accept4(9, {sa_family=AF_INET, sin_port=htons(33320), sin_addr=inet_addr("127.0.0.1")}, [16], SOCK_NONBLOCK) = 883 write(32, "\3", 1) = 1 epoll_wait(6, {{EPOLLIN, {u32=9, u64=9}}}, 32, -1) = 1 clock_gettime(CLOCK_MONOTONIC, {169687, 906830251}) = 0 gettimeofday({1446109506, 131601}, NULL) = 0 accept4(9, {sa_family=AF_INET, sin_port=htons(33348), sin_addr=inet_addr("127.0.0.1")}, [16], SOCK_NONBLOCK) = 884 write(27, "\3", 1) = 1 epoll_wait(6, {{EPOLLIN, {u32=9, u64=9}}}, 32, -1) = 1 clock_gettime(CLOCK_MONOTONIC, {169701, 112588731}) = 0 gettimeofday({1446109519, 337387}, NULL) = 0 accept4(9, {sa_family=AF_INET, sin_port=htons(33386), sin_addr=inet_addr("127.0.0.1")}, [16], SOCK_NONBLOCK) = 885 write(22, "\3", 1) = 1 epoll_wait(6, {{EPOLLIN, {u32=9, u64=9}}}, 32, -1) = 1 clock_gettime(CLOCK_MONOTONIC, {169707, 686312787}) = 0 gettimeofday({1446109525, 911113}, NULL) = 0 accept4(9, {sa_family=AF_INET, sin_port=htons(33420), sin_addr=inet_addr("127.0.0.1")}, [16], SOCK_NONBLOCK) = 886 write(17, "\3", 1) = 1
每个部分
epoll_wait(6, {{EPOLLIN, {u32=9, u64=9}}}, 32, -1) = 1 clock_gettime(CLOCK_MONOTONIC, {169707, 686312787}) = 0 gettimeofday({1446109525, 911113}, NULL) = 0 accept4(9, {sa_family=AF_INET, sin_port=htons(33420), sin_addr=inet_addr("127.0.0.1")}, [16], SOCK_NONBLOCK) = 886 write(17, "\3", 1)
每3-5秒添加一次。 没有别的几分钟。
无论过程创build这个文件是你的罪魁祸首:
gearmand 811 gearman 3w REG 202,1 71016771760 143618 /var/log/gearman-job-server/gearman.log.1 (deleted)
鉴于它被称为gearman.log.1 ,我怀疑,无论是做日志滚动不正确做。
当你看到df和du之间严重的不匹配时,通常是一个被删除的文件,一个进程仍然是打开的。 lsof | grep deleted lsof | grep deleted在Linux上很好地find它们。
简单地searchdeleted发布的lsof输出中的deleted ,会显示其他几个*.1日志文件,这些日志文件似乎具有相同的不正确的滚动问题。
只是关于CentOS的另一个信息。 在这种情况下,使用“systemctl”启动进程。 您必须修改系统文件==> /usr/lib/systemd/system/processName.service。在文件中添加以下行:
LimitNOFILE=50000
只需重新加载你的系统conf:
systemctl daemon-reload