我们有一台HP ML115 G5基于AMD服务器的HP ML115 G5服务器,在按下电源button后,在BIOS POST单声音出现之前,它会在10-15秒后自动closures(在风扇testing期间,我想是)。
我们需要一些远程(200公里)硬件故障诊断的帮助。 我们的硬件规格如下:
root@linux:~/# dmidecode -t1 # dmidecode 2.12 SMBIOS 2.5 present. System Information Manufacturer: HP Product Name: ProLiant ML115 G5 Serial Number: CZC94743QJ SKU Number: 470064-894` root@linux:~/# head -n 30 dmidecode.txt # dmidecode 2.12 Handle 0x0000, DMI type 0, 24 bytes BIOS Information Vendor: HP Version: O18 Release Date: 07/06/2009
此时它工作稳定。 我设法打开它:
如果我们把它置于标准的位置上,就像我在开始时所写的那样,它不会开启。 完全可重复的。
电压/温度/球迷的统计看起来对我来说没什么:
root@linux:~/# ipmitool sdr POST Error | Not Readable | ns Memory ECC | Not Readable | ns ACPI State | 0x01 | ok PCI Reset | 0x00 | ok CPU Fan | 1048.88 RPM | ok Rear Fan | 2107.04 RPM | ok CPU Diode | 26.50 degrees C | ok Front Ambient | 19 degrees C | ok System 12V | 11.93 Volts | ok System 5V | 5.12 Volts | ok System AUX 5V | 4.98 Volts | ok System 3.3V | 3.39 Volts | ok System AUX 3.3V | 3.33 Volts | ok CPU Vcore | 1.07 Volts | ok CPU 12V | 11.82 Volts | ok HT 1.2V | 1.20 Volts | ok Mem Vcore | 1.81 Volts | ok MEM VTT | 0.90 Volts | ok MCP55 1.5V | 1.50 Volts | ok MCP55 1.4V | 1.40 Volts | ok Therm-Trip | 0x00 | ok CPU Prochot | 0x00 | ok System Reset | 0x00 | ok NMI | 0x00 | ok PCI Error | Not Readable | ns CPU Socket | 0x01 | ok LO100 Present | 0x00 | ok Watchdog | Not Readable | ns
IPMI事件:
18 | 03/18/2015 | 09:29:46 | Temperature #0x20 | Upper Non-critical going high | Asserted 30 | 03/18/2015 | 09:30:08 | Temperature #0x20 | Upper Critical going high | Asserted 48 | 03/18/2015 | 10:38:59 | Temperature #0x20 | Upper Non-critical going high | Asserted 60 | 03/18/2015 | 10:39:20 | Temperature #0x20 | Upper Critical going high | Asserted 78 | 03/18/2015 | 10:45:26 | Temperature #0x20 | Upper Non-critical going high | Asserted 90 | 03/18/2015 | 10:45:30 | Temperature #0x20 | Upper Non-critical going high | Deasserted a8 | 03/18/2015 | 10:45:56 | Temperature #0x20 | Upper Non-critical going high | Asserted c0 | 03/18/2015 | 10:46:12 | Temperature #0x20 | Upper Critical going high | Asserted d8 | 03/18/2015 | 10:48:42 | Temperature #0x20 | Upper Non-critical going high | Asserted f0 | 03/18/2015 | 10:48:46 | Temperature #0x20 | Upper Non-critical going high | Deasserted 108 | 03/18/2015 | 10:49:04 | Temperature #0x20 | Upper Non-critical going high | Asserted 120 | 03/18/2015 | 10:49:18 | Temperature #0x20 | Upper Critical going high | Asserted 138 | 03/18/2015 | 10:50:24 | Temperature #0x20 | Upper Non-critical going high | Asserted 150 | 03/18/2015 | 10:50:25 | Temperature #0x20 | Upper Critical going high | Asserted 168 | 03/18/2015 | 10:57:53 | Temperature #0x20 | Upper Non-critical going high | Asserted 180 | 03/18/2015 | 10:57:57 | Temperature #0x20 | Upper Non-critical going high | Deasserted 198 | 03/18/2015 | 10:58:24 | Temperature #0x20 | Upper Non-critical going high | Asserted 1b0 | 03/18/2015 | 10:58:41 | Temperature #0x20 | Upper Critical going high | Asserted 1c8 | 03/18/2015 | 11:14:23 | Temperature #0x20 | Upper Non-critical going high | Asserted 1e0 | 03/18/2015 | 11:15:06 | Temperature #0x20 | Upper Non-critical going high | Deasserted 1f8 | 03/18/2015 | 11:16:33 | Temperature #0x20 | Upper Non-critical going high | Asserted 210 | 03/18/2015 | 11:16:33 | Temperature #0x20 | Upper Critical going high | Asserted 228 | 03/18/2015 | 11:49:12 | Temperature #0x20 | Upper Non-critical going high | Asserted 240 | 03/18/2015 | 11:49:18 | Temperature #0x20 | Upper Non-critical going high | Deasserted 258 | 03/18/2015 | 11:55:45 | Temperature #0x20 | Upper Non-critical going high | Asserted 270 | 03/18/2015 | 11:55:46 | Temperature #0x20 | Upper Non-critical going high | Deasserted 288 | 03/18/2015 | 11:56:32 | Temperature #0x20 | Upper Non-critical going high | Asserted 2a0 | 03/18/2015 | 11:57:06 | Temperature #0x20 | Upper Critical going high | Asserted 2b8 | 03/18/2015 | 12:00:11 | Temperature #0x20 | Upper Non-critical going high | Asserted 2d0 | 03/18/2015 | 12:00:14 | Temperature #0x20 | Upper Non-critical going high | Deasserted 2e8 | 03/18/2015 | 12:00:59 | Temperature #0x20 | Upper Non-critical going high | Asserted 300 | 03/18/2015 | 12:01:34 | Temperature #0x20 | Upper Critical going high | Asserted 318 | 07/06/2009 | 00:00:22 | Fan #0x42 | Upper Critical going high | Asserted 330 | 11/13/2016 | 13:25:47 | Fan #0x41 | Upper Critical going high | Asserted 348 | 11/13/2016 | 13:33:00 | Fan #0x41 | Upper Critical going high | Asserted 360 | 11/13/2016 | 13:33:47 | Fan #0x41 | Upper Critical going high | Asserted 378 | 11/13/2016 | 13:44:58 | Fan #0x41 | Upper Critical going high | Asserted 390 | 11/13/2016 | 13:45:48 | Fan #0x41 | Upper Critical going high | Asserted 3a8 | 11/13/2016 | 13:47:45 | Fan #0x41 | Upper Critical going high | Asserted 3c0 | 12/01/2016 | 17:00:29 | Fan #0x41 | Upper Critical going high | Asserted 3d8 | 12/01/2016 | 17:01:53 | Fan #0x41 | Upper Critical going high | Asserted 3f0 | 12/01/2016 | 17:04:02 | Fan #0x41 | Upper Critical going high | Asserted 408 | 12/01/2016 | 17:31:34 | Fan #0x41 | Upper Critical going high | Asserted 420 | 12/01/2016 | 17:43:42 | Fan #0x41 | Upper Critical going high | Asserted
11/13/2016它发生在我第一次,我认为它可能是硬件看门狗,所以我们禁用它在BIOS中。
服务器有2x1TB磁盘,2x3TB没有光驱。 365瓦非热插拔,非冗余电源。
现在,我们build议replace盒子,但就我而言,我无法解释为什么会发生这种情况(我认为这是某种机械主板故障)。 我想知道你有没有其他想法。
**更新,Chopper3先生问我的意思是什么, but CPU one is not standard 。 所以,原来的hatsink就这样被损坏了:
时间和不好的材料select,塑料并不意味着在持续的压力下持久。 我从来没有见过塑料挂载,因为在任何其他箱子里的设置…
服务器一直保持在合理的条件下,从不过热,不受太阳的直接影响,在工作中没有人触及它。
这是大约一年半以前。 我们无法在市场上find原装的惠普部件。 我们用3倍大的replace它,因为在给定的时间,AM2sockets并不是那么受欢迎。 现在我不记得它是否有两根信号线加上VCC和GND(4),就像上面的股票一样。 它可能只有三个。 VCC + GND和旋转信号(3)。 从那个时候起,我们有多次停电,这种情况从来没有发生过。
我投票在主板上的错误。 就像失败的焊点或边缘元件一样。 我遇到类似的故障,推动主板这样会允许服务器启动,但一旦我释放压力,服务器closures风扇故障或挂起ECC错误。
您可能有一个风扇故障,并且服务器被configuration为暂停关键的风扇故障。