在12G戴尔RHEL6戴尔服务器上发生“电源限制通知”

服务器:Poweredge r620
操作系统:RHEL 6.4
内核:2.6.32-358.18.1.el6.x86_64

我在生产环境中遇到应用程序警报。 关键的CPU饥饿进程正在资源匮乏,导致处理积压。 在最近部署的群集中,所有第12代戴尔服务器(r620)都出现这个问题。 据我所知,发生这种情况的实例与最高CPU利用率相匹配,同时在dmesg伴随着大量的“功率限制通知”垃圾邮件。 其中一个事件的摘录:

 Nov 7 10:15:15 someserver [.crit] CPU12: Core power limit notification (total events = 14) Nov 7 10:15:15 someserver [.crit] CPU0: Core power limit notification (total events = 14) Nov 7 10:15:15 someserver [.crit] CPU6: Core power limit notification (total events = 14) Nov 7 10:15:15 someserver [.crit] CPU14: Core power limit notification (total events = 14) Nov 7 10:15:15 someserver [.crit] CPU18: Core power limit notification (total events = 14) Nov 7 10:15:15 someserver [.crit] CPU2: Core power limit notification (total events = 14) Nov 7 10:15:15 someserver [.crit] CPU4: Core power limit notification (total events = 14) Nov 7 10:15:15 someserver [.crit] CPU16: Core power limit notification (total events = 14) Nov 7 10:15:15 someserver [.crit] CPU0: Package power limit notification (total events = 11) Nov 7 10:15:15 someserver [.crit] CPU6: Package power limit notification (total events = 13) Nov 7 10:15:15 someserver [.crit] CPU14: Package power limit notification (total events = 14) Nov 7 10:15:15 someserver [.crit] CPU18: Package power limit notification (total events = 14) Nov 7 10:15:15 someserver [.crit] CPU20: Core power limit notification (total events = 14) Nov 7 10:15:15 someserver [.crit] CPU8: Core power limit notification (total events = 14) Nov 7 10:15:15 someserver [.crit] CPU2: Package power limit notification (total events = 12) Nov 7 10:15:15 someserver [.crit] CPU10: Core power limit notification (total events = 14) Nov 7 10:15:15 someserver [.crit] CPU22: Core power limit notification (total events = 14) Nov 7 10:15:15 someserver [.crit] CPU4: Package power limit notification (total events = 14) Nov 7 10:15:15 someserver [.crit] CPU16: Package power limit notification (total events = 13) Nov 7 10:15:15 someserver [.crit] CPU20: Package power limit notification (total events = 14) Nov 7 10:15:15 someserver [.crit] CPU8: Package power limit notification (total events = 14) Nov 7 10:15:15 someserver [.crit] CPU10: Package power limit notification (total events = 14) Nov 7 10:15:15 someserver [.crit] CPU22: Package power limit notification (total events = 14) Nov 7 10:15:15 someserver [.crit] CPU15: Core power limit notification (total events = 369) Nov 7 10:15:15 someserver [.crit] CPU3: Core power limit notification (total events = 369) Nov 7 10:15:15 someserver [.crit] CPU1: Core power limit notification (total events = 369) Nov 7 10:15:15 someserver [.crit] CPU5: Core power limit notification (total events = 369) Nov 7 10:15:15 someserver [.crit] CPU17: Core power limit notification (total events = 369) Nov 7 10:15:15 someserver [.crit] CPU13: Core power limit notification (total events = 369) Nov 7 10:15:15 someserver [.crit] CPU15: Package power limit notification (total events = 375) Nov 7 10:15:15 someserver [.crit] CPU3: Package power limit notification (total events = 374) Nov 7 10:15:15 someserver [.crit] CPU1: Package power limit notification (total events = 376) Nov 7 10:15:15 someserver [.crit] CPU5: Package power limit notification (total events = 376) Nov 7 10:15:15 someserver [.crit] CPU7: Core power limit notification (total events = 369) Nov 7 10:15:15 someserver [.crit] CPU19: Core power limit notification (total events = 369) Nov 7 10:15:15 someserver [.crit] CPU17: Package power limit notification (total events = 377) Nov 7 10:15:15 someserver [.crit] CPU9: Core power limit notification (total events = 369) Nov 7 10:15:15 someserver [.crit] CPU21: Core power limit notification (total events = 369) Nov 7 10:15:15 someserver [.crit] CPU23: Core power limit notification (total events = 369) Nov 7 10:15:15 someserver [.crit] CPU11: Core power limit notification (total events = 369) Nov 7 10:15:15 someserver [.crit] CPU13: Package power limit notification (total events = 376) Nov 7 10:15:15 someserver [.crit] CPU7: Package power limit notification (total events = 375) Nov 7 10:15:15 someserver [.crit] CPU19: Package power limit notification (total events = 375) Nov 7 10:15:15 someserver [.crit] CPU9: Package power limit notification (total events = 374) Nov 7 10:15:15 someserver [.crit] CPU21: Package power limit notification (total events = 375) Nov 7 10:15:15 someserver [.crit] CPU23: Package power limit notification (total events = 374) 

有一点Google Fu显示,这通常与CPU运行热或电压调节相关联。我不认为这是发生的事情。 群集中所有服务器的温度传感器运行正常,在iDRAC中禁用“功率限额策略”,并在所有这些服务器上将“系统configuration文件”设置为“性能”:

 # omreport chassis biossetup | grep -A10 'System Profile' System Profile Settings ------------------------------------------ System Profile : Performance CPU Power Management : Maximum Performance Memory Frequency : Maximum Performance Turbo Boost : Enabled C1E : Disabled C States : Disabled Monitor/Mwait : Enabled Memory Patrol Scrub : Standard Memory Refresh Rate : 1x Memory Operating Voltage : Auto Collaborative CPU Performance Control : Disabled 
  • 戴尔邮件列表文章几乎完美地描述了这些症状。 戴尔build议作者尝试使用性能configuration文件,但这并没有帮助。 他最终在戴尔的指南中应用了一些设置, 为低延迟环境configuration服务器 ,其中一个设置(或其组合)似乎已经解决了这个问题。
  • Kernel.org错误#36182注意到,功率限制中断debugging是默认启用的,这会导致CPU电压调节开始的情况下性能下降。
  • RHN知识库文章 (需要RHNlogin)提到一个问题,影响不运行性能configuration文件的PE r620和r720服务器,并build议两周前发布的内核更新。 除了我们正在运行性能configuration文件…

我可以在网上find的所有东西都在这里圈出来。 到底是怎么回事?

    不是电压调节导致性能问题,而是debugging内核中断正在触发它。

    尽pipe在Redhat方面有一些错误信息,但所有链接的页面都指向相同的现象。 无论是否使用性能configuration文件,都会发生电压调节,这可能是由于启用了Turbo Boostfunction。 不pipe原因如何,这些电压波动与内核2.6.32-358.18.1.el6.x86_64中默认启用的功率限制内核中断交互较差。

    确认的解决方法:

    • 升级到最新发布的Redhat内核(2.6.32-358.23.2.el6)会禁用此debugging并消除性能问题。
    • 将以下内核参数添加到grub.conf将禁用PLN: clearcpuid=229

    片状解决方法:

    • 设置“性能”的系统configuration文件。 这本身并不足以在我们的服务器上禁用PLN。 你的旅费可能会改变。

    错误的解决方法:

    • 列入ACPI相关模块。 我已经在几个论坛主题中看到了这个。 不好意思,所以不要 。