服务器:Poweredge r620
操作系统:RHEL 6.4
内核:2.6.32-358.18.1.el6.x86_64
我在生产环境中遇到应用程序警报。 关键的CPU饥饿进程正在资源匮乏,导致处理积压。 在最近部署的群集中,所有第12代戴尔服务器(r620)都出现这个问题。 据我所知,发生这种情况的实例与最高CPU利用率相匹配,同时在dmesg伴随着大量的“功率限制通知”垃圾邮件。 其中一个事件的摘录:
Nov 7 10:15:15 someserver [.crit] CPU12: Core power limit notification (total events = 14) Nov 7 10:15:15 someserver [.crit] CPU0: Core power limit notification (total events = 14) Nov 7 10:15:15 someserver [.crit] CPU6: Core power limit notification (total events = 14) Nov 7 10:15:15 someserver [.crit] CPU14: Core power limit notification (total events = 14) Nov 7 10:15:15 someserver [.crit] CPU18: Core power limit notification (total events = 14) Nov 7 10:15:15 someserver [.crit] CPU2: Core power limit notification (total events = 14) Nov 7 10:15:15 someserver [.crit] CPU4: Core power limit notification (total events = 14) Nov 7 10:15:15 someserver [.crit] CPU16: Core power limit notification (total events = 14) Nov 7 10:15:15 someserver [.crit] CPU0: Package power limit notification (total events = 11) Nov 7 10:15:15 someserver [.crit] CPU6: Package power limit notification (total events = 13) Nov 7 10:15:15 someserver [.crit] CPU14: Package power limit notification (total events = 14) Nov 7 10:15:15 someserver [.crit] CPU18: Package power limit notification (total events = 14) Nov 7 10:15:15 someserver [.crit] CPU20: Core power limit notification (total events = 14) Nov 7 10:15:15 someserver [.crit] CPU8: Core power limit notification (total events = 14) Nov 7 10:15:15 someserver [.crit] CPU2: Package power limit notification (total events = 12) Nov 7 10:15:15 someserver [.crit] CPU10: Core power limit notification (total events = 14) Nov 7 10:15:15 someserver [.crit] CPU22: Core power limit notification (total events = 14) Nov 7 10:15:15 someserver [.crit] CPU4: Package power limit notification (total events = 14) Nov 7 10:15:15 someserver [.crit] CPU16: Package power limit notification (total events = 13) Nov 7 10:15:15 someserver [.crit] CPU20: Package power limit notification (total events = 14) Nov 7 10:15:15 someserver [.crit] CPU8: Package power limit notification (total events = 14) Nov 7 10:15:15 someserver [.crit] CPU10: Package power limit notification (total events = 14) Nov 7 10:15:15 someserver [.crit] CPU22: Package power limit notification (total events = 14) Nov 7 10:15:15 someserver [.crit] CPU15: Core power limit notification (total events = 369) Nov 7 10:15:15 someserver [.crit] CPU3: Core power limit notification (total events = 369) Nov 7 10:15:15 someserver [.crit] CPU1: Core power limit notification (total events = 369) Nov 7 10:15:15 someserver [.crit] CPU5: Core power limit notification (total events = 369) Nov 7 10:15:15 someserver [.crit] CPU17: Core power limit notification (total events = 369) Nov 7 10:15:15 someserver [.crit] CPU13: Core power limit notification (total events = 369) Nov 7 10:15:15 someserver [.crit] CPU15: Package power limit notification (total events = 375) Nov 7 10:15:15 someserver [.crit] CPU3: Package power limit notification (total events = 374) Nov 7 10:15:15 someserver [.crit] CPU1: Package power limit notification (total events = 376) Nov 7 10:15:15 someserver [.crit] CPU5: Package power limit notification (total events = 376) Nov 7 10:15:15 someserver [.crit] CPU7: Core power limit notification (total events = 369) Nov 7 10:15:15 someserver [.crit] CPU19: Core power limit notification (total events = 369) Nov 7 10:15:15 someserver [.crit] CPU17: Package power limit notification (total events = 377) Nov 7 10:15:15 someserver [.crit] CPU9: Core power limit notification (total events = 369) Nov 7 10:15:15 someserver [.crit] CPU21: Core power limit notification (total events = 369) Nov 7 10:15:15 someserver [.crit] CPU23: Core power limit notification (total events = 369) Nov 7 10:15:15 someserver [.crit] CPU11: Core power limit notification (total events = 369) Nov 7 10:15:15 someserver [.crit] CPU13: Package power limit notification (total events = 376) Nov 7 10:15:15 someserver [.crit] CPU7: Package power limit notification (total events = 375) Nov 7 10:15:15 someserver [.crit] CPU19: Package power limit notification (total events = 375) Nov 7 10:15:15 someserver [.crit] CPU9: Package power limit notification (total events = 374) Nov 7 10:15:15 someserver [.crit] CPU21: Package power limit notification (total events = 375) Nov 7 10:15:15 someserver [.crit] CPU23: Package power limit notification (total events = 374)
有一点Google Fu显示,这通常与CPU运行热或电压调节相关联。我不认为这是发生的事情。 群集中所有服务器的温度传感器运行正常,在iDRAC中禁用“功率限额策略”,并在所有这些服务器上将“系统configuration文件”设置为“性能”:
# omreport chassis biossetup | grep -A10 'System Profile' System Profile Settings ------------------------------------------ System Profile : Performance CPU Power Management : Maximum Performance Memory Frequency : Maximum Performance Turbo Boost : Enabled C1E : Disabled C States : Disabled Monitor/Mwait : Enabled Memory Patrol Scrub : Standard Memory Refresh Rate : 1x Memory Operating Voltage : Auto Collaborative CPU Performance Control : Disabled
我可以在网上find的所有东西都在这里圈出来。 到底是怎么回事?
不是电压调节导致性能问题,而是debugging内核中断正在触发它。
尽pipe在Redhat方面有一些错误信息,但所有链接的页面都指向相同的现象。 无论是否使用性能configuration文件,都会发生电压调节,这可能是由于启用了Turbo Boostfunction。 不pipe原因如何,这些电压波动与内核2.6.32-358.18.1.el6.x86_64中默认启用的功率限制内核中断交互较差。
确认的解决方法:
grub.conf将禁用PLN: clearcpuid=229 片状解决方法:
错误的解决方法: