服务器上升时间不到10分钟,但顶端显示所有进程的CPU使用时间非常高[1](使用超过百万小时),这是一个24核心的机器。 系统最终在10-15分钟内坠毁。 电力回收后恢复正常。
我倾向于一个错误的硬件,以某种方式通过电力回收正确初始化。
任何想法可能出了什么问题?
[1]
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 13 root 20 0 0 0 0 S 100.0 0.0 30019,26 ksoftirqd/2 33 root 20 0 0 0 0 S 100.0 0.0 40025,54 ksoftirqd/7 53 root 20 0 0 0 0 S 100.0 0.0 65042,06 ksoftirqd/12 2842 root 20 0 14.0g 362m 11m S 5500.0 0.3 8206270h java 12830 root 20 0 104m 2400 1532 S 100.0 0.0 5139288h bash 2541 root 39 19 0 0 0 S 1.0 0.0 300194:24 kipmi0 14937 root 20 0 13516 1640 956 R 0.7 0.0 0:00.12 top 160 root 20 0 0 0 0 S 0.3 0.0 20012,57 kblockd/6 1 root 20 0 21444 1548 1240 S 0.0 0.0 4270563h init 2 root 20 0 0 0 0 S 0.0 0.0 785508,31 kthreadd 3 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/0 4 root 20 0 0 0 0 S 0.0 0.0 10237405h ksoftirqd/0 5 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/0 6 root RT 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/0 7 root RT 0 0 0 0 R 0.0 0.0 300194:20 migration/1 8 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/1 9 root 20 0 0 0 0 S 0.0 0.0 30019,26 ksoftirqd/1 10 root RT 0 0 0 0 R 0.0 0.0 300194:20 watchdog/1 11 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/2 12 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/2 14 root RT 0 0 0 0 S 0.0 0.0 300194:20 watchdog/2 15 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/3 16 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/3 17 root 20 0 0 0 0 S 0.0 0.0 900583:01 ksoftirqd/3 18 root RT 0 0 0 0 S 0.0 0.0 300194:20 watchdog/