如何确定Fedora Linux服务器上随机重启的原因

我有一个f23 Linux机器作为开发服务器运行,在过去的几个星期里,我已经login了几次,发现它已经被重置。 有一次它重新启动在我面前,似乎重置到BIOS,然后再次通电。

这似乎每2或3天发生一次。 服务器日志只显示正常的操作,cron等,直到重置和重启;

https://paste.fedoraproject.org/518600/33737531/

Jan 01 20:01:02 pc03.config run-parts[19540]: (/etc/cron.hourly) starting mcelog.cron Jan 01 20:01:02 pc03.config run-parts[19544]: (/etc/cron.hourly) finished mcelog.cron Jan 01 20:09:10 pc03.config puppet-agent[19565]: Applied catalog in 0.03 seconds -- Reboot -- Jan 01 20:17:57 pc03.config systemd-journal[372]: Runtime journal is using 8.0M (max allowed 1.5G, trying to leave 2.3G free of 15.6G available → current limit 1.5G). Jan 01 20:17:57 pc03.config systemd-journal[372]: Runtime journal is using 8.0M (max allowed 1.5G, trying to leave 2.3G free of 15.6G available → current limit 1.5G). Jan 01 20:17:57 pc03.config kernel: Linux version 4.8.13-100.fc23.x86_64 ([email protected]) (gcc version 5.3.1 20160406 (Red Hat 5.3.1-6) (GCC) ) #1 SMP Fri Dec 9 14:51:40 UTC 2016 Jan 01 20:17:57 pc03.config kernel: Command line: BOOT_IMAGE=/vmlinuz-4.8.13-100.fc23.x86_64 root=/dev/mapper/fedora_pc03-root ro rd.lvm.lv=fedora_pc03/root rd.lvm.lv=fedora_pc03/swap rhgb quiet nouveau.modeset=0 rd.driver.blacklist=nouveau video=vesa:off LANG=en_GB.UTF-8 Jan 01 20:17:57 pc03.config kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers' Jan 01 20:17:57 pc03.config kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers' 

不过,在这个杂志上似乎有很多这样的信息。

 Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1 Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]: It has been corrected by h/w and requires no further action Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]: event severity: corrected Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]: Error 0, type: corrected Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]: fru_text: CorrectedErr Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]: section_type: PCIe error Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]: port_type: 0, PCIe end point Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]: version: 0.0 Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]: command: 0xffff, status: 0xffff Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]: device_id: 0000:80:02.3 Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]: slot: 0 Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]: secondary_bus: 0x00 Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]: vendor_id: 0xffff, device_id: 0xffff Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]: class_code: ffffff 

我检查了BIOS smbios事件日志,它只有重新启动代码0x17显示机器复位后,它没有注册任何内存重置像我所料。

不幸的是,该机器不支持IPMI,因为该板是超微型X9DAi

我不知道如何解释该硬件错误消息中的错误代码,但似乎0000:80:02对应于;

 [root@pc03 ~]# lspci -s 0000:80:02 80:02.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express Root Port 2a (rev 07) 

我目前正在监视temps / cpu的服务器,所以当下一次崩溃时,我会对传感器的状态有一个很好的了解。 有没有其他的步骤可以确定这个崩溃的根本原因?