使用CentOS 7升级服务器SuperMicro后,EDAC内存错误。这些主板,操作系统或内存模块损坏的具体错误?

SuperMicro MBD-X9DRD-EF主板上有服务器。 在一年的时间内,它在CentOS7上运行良好,一个CPU(Intel Xeon X6 E5-2620v2)和128 Gb(8×16 Gb)LVDDR(1600MHz Crucial ECC Reg RTL(PC3-12800))内存。 上个月,我们通过添加第二个CPU和额外的128 Gb内存来升级这个服务器,与现有的内存完全相同。 但是,在密集使用服务器(3-4天)之后,我们开始接收(经常)这样的错误:

[root@GBserver log]# dmesg [614781.869098] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR [614781.869104] EDAC sbridge MC1: CPU 6: Machine Check Event: 0 Bank 7: 8c00004000010090 [614781.869106] EDAC sbridge MC1: TSC 0 [614781.869108] EDAC sbridge MC1: ADDR 38126a6c40 [614781.869110] EDAC sbridge MC1: MISC 14066ca86 [614781.869112] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1473082855 SOCKET 1 APIC 20 [614782.595676] EDAC MC1: 1 CE memory read error on CPU_SrcID#1_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x38126a6 offset:0xc40 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0090 socket:1 ha:0 channel_mask:1 rank:1) 

和edac-util的输出:

 [root@GBserver log]# edac-util -v mc0: 0 Uncorrected Errors with no DIMM info mc0: 0 Corrected Errors with no DIMM info mc0: csrow0: 0 Uncorrected Errors mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#0_DIMM#0: 0 Corrected Errors mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#2_DIMM#0: 0 Corrected Errors mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#3_DIMM#0: 0 Corrected Errors mc0: csrow1: 0 Uncorrected Errors mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#0_DIMM#1: 0 Corrected Errors mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#1_DIMM#1: 0 Corrected Errors mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#2_DIMM#1: 0 Corrected Errors mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#3_DIMM#1: 0 Corrected Errors mc1: 0 Uncorrected Errors with no DIMM info mc1: 0 Corrected Errors with no DIMM info mc1: csrow0: 0 Uncorrected Errors mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#0_DIMM#0: 296182 Corrected Errors mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#2_DIMM#0: 0 Corrected Errors mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#3_DIMM#0: 0 Corrected Errors mc1: csrow1: 0 Uncorrected Errors mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#0_DIMM#1: 0 Corrected Errors mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#1_DIMM#1: 0 Corrected Errors mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#2_DIMM#1: 0 Corrected Errors mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#3_DIMM#1: 0 Corrected Errors 

mc1:csrow0:CPU_SrcID#1_Ha#0_Chan#0_DIMM#0:296182更正的错误

这些错误是由主板,CPU还是操作系统故障引起的,还是我们打破了内存芯片? 我们该怎么做? 如何find损坏的内存模块?

3周后,logging了大约11M的纠正错误。 看到BIOS日志后,我发现内存模块坏了。 在这里输入图像说明 这是我的问题的答案。
接下来,我将删除已损坏的模块,并将其replace为另一个模块。