systemd正常运行18天后使用4GB RAM

我有一台运行CentOS 7的networking服务器,系统进程在几个星期的正常运行时间之后使用了近4 GB的内存。 内存使用量正在稳步增加,每天大约200MB。 像systemd-logind和dbus-daemon这样的相关进程也会在很大程度上使用大量的CPU。 我的另一个使用“init”而不是systemd的CentOS 6服务器没有这样的资源使用情况。

在下面的例子中,systemd,systemd-logind,systemd-journal和dbus-daemon在正常的web服务中没有运行其他进程时,使用四核CPU的总计10.7%,而systemd占用了19%的系统的16GB的RAM。 这是不正常的行为,经过四处搜寻,我还没有发现任何其他人与这个问题。 什么可能导致这个资源被盗用? 任何build议,将不胜感激。

空闲期间从顶部输出(Web服务除外):

top - 08:51:31 up 16 days, 13:43, 2 users, load average: 1.84, 1.39, 1.07 Tasks: 297 total, 2 running, 295 sleeping, 0 stopped, 0 zombie %Cpu(s): 5.6 us, 3.6 sy, 0.0 ni, 90.6 id, 0.1 wa, 0.0 hi, 0.1 si, 0.0 st KiB Mem : 16212992 total, 2466564 free, 4275764 used, 9470664 buff/cache KiB Swap: 4194300 total, 4070740 free, 123560 used. 10707392 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 743 dbus 20 0 27104 1856 1152 S 3.3 0.0 304:27.19 dbus-daemon 1 root 20 0 3247784 2.920g 1800 S 3.0 18.9 287:41.35 systemd 737 root 20 0 27416 2524 1304 S 2.7 0.0 225:32.66 systemd-logind 736 root 20 0 434760 3756 3076 S 2.0 0.0 172:26.53 NetworkManager 548 root 20 0 82276 34652 34516 S 1.7 0.2 160:20.16 systemd-journal 770 polkitd 20 0 522920 2956 2248 S 1.7 0.0 120:06.11 polkitd 716 root 16 -4 116744 1368 1312 S 1.3 0.0 93:26.54 auditd 3778 nginx 20 0 446488 14688 6564 S 1.3 0.1 2:18.80 php-fpm 3847 nginx 20 0 446316 14588 6548 S 1.3 0.1 2:19.29 php-fpm 7000 nginx 20 0 446132 14400 6544 S 1.3 0.1 1:22.77 php-fpm 14862 nginx 20 0 446304 14600 6580 S 1.3 0.1 1:32.25 php-fpm 30333 nginx 20 0 446292 14468 6528 S 1.3 0.1 1:40.78 php-fpm 740 root 20 0 784980 20112 19696 S 1.0 0.1 76:12.69 rsyslogd 3521 nginx 20 0 446188 14848 6748 S 1.0 0.1 2:20.00 php-fpm 3687 nginx 20 0 446036 14688 6764 S 1.0 0.1 2:20.45 php-fpm 3689 nginx 20 0 446408 14604 6552 S 1.0 0.1 2:19.75 php-fpm 3774 nginx 20 0 446288 14568 6552 S 1.0 0.1 2:19.68 php-fpm 3836 nginx 20 0 447416 15572 6564 S 1.0 0.1 2:21.06 php-fpm 4861 nginx 20 0 446260 14576 6540 S 1.0 0.1 2:18.94 php-fpm 4862 nginx 20 0 446508 15084 6764 S 1.0 0.1 2:20.71 php-fpm 13538 nginx 20 0 447204 15452 6572 S 1.0 0.1 1:32.33 php-fpm 15530 nginx 20 0 446292 14520 6528 S 1.0 0.1 1:32.55 php-fpm 28468 nginx 20 0 446356 14672 6568 S 1.0 0.1 1:42.21 php-fpm 29564 nginx 20 0 446292 14536 6548 S 1.0 0.1 1:41.11 php-fpm 30851 nginx 20 0 445956 14568 6748 S 1.0 0.1 1:49.66 php-fpm 

编辑2-14-16

我可能在“sudo journalctl”的输出中find了相关的东西(见下文)。 有一次,我的其他生产服务器上的SSH连接每秒钟都会出现很多行。 这些是rsync进程将文件从远程服务器传输到有问题的服务器。 这就解释了systemd,systemd-logind,NetworkManager和systemd-journal的CPU使用情况。

但是,这无法解释内存泄漏,这是最大的问题。 自从前几天这篇文章的写作以来,systemd已经从系统内存的18.9%增加到了21.4%。

下面的日志已被修改,以取代服务器的真实域名和IP地址。

 Feb 14 10:02:13 hostname.domain.com systemd-logind[737]: New session 6467482 of user tropicg9. Feb 14 10:02:13 hostname.domain.com systemd[1]: Started Session 6467482 of user tropicg9. Feb 14 10:02:13 hostname.domain.com systemd[1]: Starting Session 6467482 of user tropicg9. Feb 14 10:02:13 hostname.domain.com sshd[9665]: pam_unix(sshd:session): session opened for user tropicg9 by (uid=0) Feb 14 10:02:13 hostname.domain.com sshd[9667]: Received disconnect from 1.2.3.4: 11: disconnected by user Feb 14 10:02:13 hostname.domain.com sshd[9665]: pam_unix(sshd:session): session closed for user tropicg9 Feb 14 10:02:13 hostname.domain.com systemd-logind[737]: Removed session 6467482. Feb 14 10:02:14 hostname.domain.com sshd[9728]: Accepted publickey for tropicg9 from 1.2.3.4 port 45289 ssh2: RSA 0b: Feb 14 10:02:14 hostname.domain.com systemd-logind[737]: New session 6467483 of user tropicg9. Feb 14 10:02:14 hostname.domain.com systemd[1]: Started Session 6467483 of user tropicg9. Feb 14 10:02:14 hostname.domain.com systemd[1]: Starting Session 6467483 of user tropicg9. Feb 14 10:02:14 hostname.domain.com sshd[9728]: pam_unix(sshd:session): session opened for user tropicg9 by (uid=0) Feb 14 10:02:14 hostname.domain.com sshd[9735]: Received disconnect from 1.2.3.4: 11: disconnected by user Feb 14 10:02:14 hostname.domain.com sshd[9728]: pam_unix(sshd:session): session closed for user tropicg9 Feb 14 10:02:14 hostname.domain.com systemd-logind[737]: Removed session 6467483. Feb 14 10:02:15 hostname.domain.com sshd[9876]: Accepted publickey for tropicg9 from 1.2.3.4 port 45290 ssh2: RSA 0b: Feb 14 10:02:15 hostname.domain.com systemd-logind[737]: New session 6467484 of user tropicg9. Feb 14 10:02:15 hostname.domain.com systemd[1]: Started Session 6467484 of user tropicg9. Feb 14 10:02:15 hostname.domain.com systemd[1]: Starting Session 6467484 of user tropicg9. Feb 14 10:02:15 hostname.domain.com sshd[9876]: pam_unix(sshd:session): session opened for user tropicg9 by (uid=0) Feb 14 10:02:15 hostname.domain.com sshd[9883]: Received disconnect from 1.2.3.4: 11: disconnected by user Feb 14 10:02:15 hostname.domain.com sshd[9876]: pam_unix(sshd:session): session closed for user tropicg9 Feb 14 10:02:15 hostname.domain.com systemd-logind[737]: Removed session 6467484. Feb 14 10:02:20 hostname.domain.com sshd[10333]: Accepted publickey for tropicg9 from 1.2.3.4 port 45291 ssh2: RSA 0b Feb 14 10:02:20 hostname.domain.com systemd-logind[737]: New session 6467485 of user tropicg9. Feb 14 10:02:20 hostname.domain.com systemd[1]: Started Session 6467485 of user tropicg9. Feb 14 10:02:20 hostname.domain.com systemd[1]: Starting Session 6467485 of user tropicg9. Feb 14 10:02:20 hostname.domain.com sshd[10333]: pam_unix(sshd:session): session opened for user tropicg9 by (uid=0) Feb 14 10:02:20 hostname.domain.com sshd[10342]: Received disconnect from 1.2.3.4: 11: disconnected by user Feb 14 10:02:20 hostname.domain.com sshd[10333]: pam_unix(sshd:session): session closed for user tropicg9 Feb 14 10:02:20 hostname.domain.com systemd-logind[737]: Removed session 6467485. Feb 14 10:02:21 hostname.domain.com sshd[10450]: Accepted publickey for tropicg9 from 1.2.3.4 port 45292 ssh2: RSA 0b Feb 14 10:02:21 hostname.domain.com systemd-logind[737]: New session 6467486 of user tropicg9. Feb 14 10:02:21 hostname.domain.com systemd[1]: Started Session 6467486 of user tropicg9. Feb 14 10:02:21 hostname.domain.com systemd[1]: Starting Session 6467486 of user tropicg9. Feb 14 10:02:21 hostname.domain.com sshd[10450]: pam_unix(sshd:session): session opened for user tropicg9 by (uid=0) Feb 14 10:02:21 hostname.domain.com sshd[10457]: Received disconnect from 1.2.3.4: 11: disconnected by user Feb 14 10:02:21 hostname.domain.com sshd[10450]: pam_unix(sshd:session): session closed for user tropicg9 Feb 14 10:02:21 hostname.domain.com systemd-logind[737]: Removed session 6467486. Feb 14 10:02:22 hostname.domain.com sshd[10473]: Accepted publickey for tropicg9 from 1.2.3.4 port 45293 ssh2: RSA 0b Feb 14 10:02:22 hostname.domain.com systemd-logind[737]: New session 6467487 of user tropicg9. Feb 14 10:02:22 hostname.domain.com systemd[1]: Started Session 6467487 of user tropicg9. Feb 14 10:02:22 hostname.domain.com systemd[1]: Starting Session 6467487 of user tropicg9. Feb 14 10:02:22 hostname.domain.com sshd[10473]: pam_unix(sshd:session): session opened for user tropicg9 by (uid=0) Feb 14 10:02:22 hostname.domain.com sshd[10475]: Received disconnect from 1.2.3.4: 11: disconnected by user Feb 14 10:02:22 hostname.domain.com sshd[10473]: pam_unix(sshd:session): session closed for user tropicg9 Feb 14 10:02:22 hostname.domain.com systemd-logind[737]: Removed session 6467487. Feb 14 10:02:23 hostname.domain.com sshd[10484]: Accepted publickey for tropicg9 from 1.2.3.4 port 45294 ssh2: RSA 0b Feb 14 10:02:23 hostname.domain.com systemd-logind[737]: New session 6467488 of user tropicg9. Feb 14 10:02:23 hostname.domain.com systemd[1]: Started Session 6467488 of user tropicg9. Feb 14 10:02:23 hostname.domain.com systemd[1]: Starting Session 6467488 of user tropicg9. Feb 14 10:02:23 hostname.domain.com sshd[10484]: pam_unix(sshd:session): session opened for user tropicg9 by (uid=0) Feb 14 10:02:23 hostname.domain.com sshd[10486]: Received disconnect from 1.2.3.4: 11: disconnected by user Feb 14 10:02:23 hostname.domain.com sshd[10484]: pam_unix(sshd:session): session closed for user tropicg9 Feb 14 10:02:23 hostname.domain.com systemd-logind[737]: Removed session 6467488. Feb 14 10:02:39 hostname.domain.com sshd[10654]: Accepted publickey for tropicg9 from 1.2.3.4 port 45295 ssh2: RSA 0b Feb 14 10:02:39 hostname.domain.com systemd[1]: Started Session 6467489 of user tropicg9. Feb 14 10:02:39 hostname.domain.com systemd-logind[737]: New session 6467489 of user tropicg9. Feb 14 10:02:39 hostname.domain.com systemd[1]: Starting Session 6467489 of user tropicg9. Feb 14 10:02:39 hostname.domain.com sshd[10654]: pam_unix(sshd:session): session opened for user tropicg9 by (uid=0) Feb 14 10:02:39 hostname.domain.com sshd[10656]: Received disconnect from 1.2.3.4: 11: disconnected by user Feb 14 10:02:39 hostname.domain.com sshd[10654]: pam_unix(sshd:session): session closed for user tropicg9 Feb 14 10:02:39 hostname.domain.com systemd-logind[737]: Removed session 6467489.session 6467489. 

更新2-16-16

这是systemd-cgtop的输出,显示活动控制组的资源使用情况(向右滚动)。 这显示了在“根”path下的所有重要资源使用情况。 这似乎没有缩小,但也许这个信息可能是有帮助的。

在/ run / systemd / system /目录下只有86个范围文件和相关目录,最多6天。 在SSH连接期间,这些文件被孤立导致成千上万的条目和高CPU负载,但在这里没有发生。

 Path Tasks %CPU Memory Input/s Output/s / 296 30.5 11.3G 657.8K 893.0K /system.slice/NetworkManager.service 1 - - - - /system.slice/auditd.service 1 - - - - /system.slice/crond.service 1 - - - - /system.slice/dbus.service 1 - - - - /system.slice/irqbalance.service 1 - - - - /system.slice/lvm2-lvmetad.service 1 - - - - /system.slice/mariadb.service 2 - - - - /system.slice/nginx.service 10 - - - - /system.slice/php-fpm.service 101 - - - - /system.slice/polkit.service 1 - - - - /system.slice/postfix.service 3 - - - - /system.slice/rsyslog.service 1 - - - - /system.slice/smartd.service 1 - - - - /system.slice/sshd.service 2 - - - - /system.slice/system-getty.slice/[email protected] 1 - - - - /system.slice/systemd-journald.service 1 - - - - /system.slice/systemd-logind.service 1 - - - - /system.slice/systemd-udevd.service 1 - - - - /system.slice/tuned.service 1 - - - - /system.slice/wpa_supplicant.service 1 - - - - /user.slice/user-1000.slice/session-7170741.scope 4 - - - - 

系统内存的临时清理

看来运行systemctl daemon-reexec会释放所有分配给PID 1进程的内存。 然而,泄漏仍在继续。 解决这个问题的一个解决scheme是设置一个每日cron来清除内存,但是它不能解决泄漏问题。 我已经向Redhat提交了一个bug ,因为这是CentOS 7.x的systemd的稳定版本。 希望泄漏可以被发现和堵塞。

检查mmap / mmunmap调用的systemd进程的踪迹。 它应该揭示这个问题:

  yum安装strace
 strace -ff -p 1

诊断内存泄漏是一种快速和肮脏的方法。 systemd进程的strace应该看起来相似:

 recvmsg(23,{msg_name(0)= NULL,msg_iov(1)= [{“WATCHDOG = 1”,4096}],msg_controllen = 32,{cmsg_len = 28,cmsg_level = SOL_SOCKET,cmsg_type = SCM_CREDENTIALS {pid = uid = 0,gid = 0}},msg_flags = MSG_CMSG_CLOEXEC},MSG_DONTWAIT | MSG_CMSG_CLOEXEC)= 10
打开(“/ proc / 620 / cgroup”,O_RDONLY | O_CLOEXEC)= 20
 fstat(20,{st_mode = S_IFREG | 0444,st_size = 0,...})= 0
 mmap(NULL,4096,PROT_READ | PROT_WRITE,MAP_PRIVATE | MAP_ANONYMOUS,-1,0)= 0x7fcfd734e000
阅读(20,“10:cpuset:/ \ n9:perf_event:/ \ n8:hug”...,1024)= 164
closures(20)= 0
 munmap(0x7fcfd734e000,4096)= 0

它分配内存,做一些事情,比释放内存。
检查系统调用systemd的跟踪,你应该发现它不能完成调用,释放分配的内存。
我猜想安装不正确的伪文件系统或selinux是有问题的,所以systemd不能完成它的调用。