我最近购买了一块内置BMC(Aspeed AST2400芯片)的SuperMicro X10SLL-F主板。 我想在服务器上运行linux时使用内置的看门狗控制器(gentoo硬化)。
我启用了BIOS中的看门狗function,然后将硬件复位的主板跳线切换到NMI(看门狗超时操作,用于testing目的以避免重新启动)。 关于软 – 我安装并添加到默认运行级别看门狗程序(sys-apps / watchdog),该程序configuration为每隔10秒钟对看门狗设备(存在的/ dev / watchdog)进行ping操作。 看门狗超时设置为250秒。
程序显然看到看门狗硬件(ipmitool与openipmi启用):
# ipmitool mc watchdog get Watchdog Timer Use: SMS/OS (0x44) Watchdog Timer Is: Started/Running Watchdog Timer Actions: Hard Reset (0x01) Pre-timeout interval: 0 seconds Timer Expiration Flags: 0x10 Initial Countdown: 254 sec Present Countdown: 253 sec
Freeipmi:
# bmc-watchdog --get Timer Use: SMS/OS Timer: Running Logging: Enabled Timeout Action: Hard Reset Pre-Timeout Interrupt: None Pre-Timeout Interval: 0 seconds Timer Use BIOS FRB2 Flag: Clear Timer Use BIOS POST Flag: Clear Timer Use BIOS OS Load Flag: Clear Timer Use BIOS SMS/OS Flag: Set Timer Use BIOS OEM Flag: Clear Initial Countdown: 254 seconds Current Countdown: 253 seconds
但是,经过一定的时间后(上面的程序报告了“当前倒数”值):
[ 294.107534] Uhhuh. NMI received for unknown reason 21 on CPU 0. [ 294.107998] Do you have a strange power saving mode enabled? [ 294.108437] Dazed and confused, but trying to continue
NMI显然是看门狗超时造成的。 机器硬重置发生后不到一分钟。
问题在哪里,我应该挖掘哪个方向?
编辑:与ipmi相关的内核消息:
[ 0.353090] ipmi message handler version 39.2 [ 0.353353] ipmi device interface [ 0.353623] IPMI System Interface driver. [ 0.353898] ipmi_si: probing via ACPI [ 0.354172] ipmi_si 00:08: [io 0x0ca2] regsize 1 spacing 1 irq 0 [ 0.354444] ipmi_si: Adding ACPI-specified kcs state machine [ 0.354790] ipmi_si: probing via SMBIOS [ 0.355051] ipmi_si: SMBIOS: io 0xca2 regsize 1 spacing 1 irq 0 [ 0.355317] ipmi_si: Adding SMBIOS-specified kcs state machine duplicate interface [ 0.355836] ipmi_si: probing via SPMI [ 0.356095] ipmi_si: SPMI: io 0xca2 regsize 1 spacing 1 irq 0 [ 0.356362] ipmi_si: Adding SPMI-specified kcs state machine duplicate interface [ 0.356906] ipmi_si: Trying ACPI-specified kcs state machine at i/o address 0xca2, slave address 0x0, irq 0 [ 0.390536] ipmi_si: The BMC does not support clearing the recv irq bit, compensating, but the BMC needs to be fixed. [ 0.418476] ipmi_si 00:08: Found new BMC (man_id: 0x002a7c, prod_id: 0x0801, dev_id: 0x20) [ 0.419004] ipmi_si 00:08: IPMI kcs interface initialized [ 0.419272] IPMI SSIF Interface driver [ 0.420350] IPMI Watchdog: driver initialized [ 0.420635] Copyright (C) 2004 MontaVista Software - IPMI Powerdown via sys_reboot. [ 0.421444] IPMI poweroff: ATCA Detect mfg 0x2A7C prod 0x801 [ 0.421710] IPMI poweroff: Found a chassis style poweroff function
编辑:我试图使用configuration“-u 4 -p 2 -a 0 -F -P -L -O -i 300 -e 10”的bmc-watchdog。 所以只有短消息/操作系统时间正在使用,预超时中断设置为NMI,超时操作设置为NONE:
# bmc-watchdog --get Timer Use: SMS/OS Timer: Running Logging: Enabled Timeout Action: None Pre-Timeout Interrupt: NMI / Diagnostic Interrupt Pre-Timeout Interval: 0 seconds Timer Use BIOS FRB2 Flag: Clear Timer Use BIOS POST Flag: Clear Timer Use BIOS OS Load Flag: Clear Timer Use BIOS SMS/OS Flag: Set Timer Use BIOS OEM Flag: Clear Initial Countdown: 300 seconds Current Countdown: 290 seconds
但是,这完全没有改变。
编辑。 另外,当我触发看门狗定时器与\ 0x00到/ dev /看门狗回声,然后保持不动 – 系统默认10秒超时后正确重新启动。 所以看门狗工作良好,但从启动系统重新启动刚刚350秒。
编辑。 我检查了BMC系统事件日志(SEL),并在重新启动后发现:
Sensor #202 | Watchdog 2 | Assertion Event | Timer interrupt ; Timer use at expiration = SMS/OS ; Interrupt type = none Sensor #202 | Watchdog 2 | Assertion Event | Timer expired, status only ; Timer use at expiration = SMS/OS ; Interrupt type = none
这里有趣的是,这个事件被标记为“仅限状态”。 即便如此,系统重新启动。 当我故意触发看门狗超时时,日志不一样:
Sensor #202 | Watchdog 2 | Assertion Event | Timer interrupt ; Timer use at expiration = SMS/OS ; Interrupt type = none Sensor #202 | Watchdog 2 | Assertion Event | Hard Reset ; Timer use at expiration = SMS/OS ; Interrupt type = none
最终,我发现了一些奇怪的解决scheme:只要将看门狗跳线(JWD1)打开(既不selectNMI,也不select硬重置)。 看门狗在BIOS设置中启用。
在这种情况下,看门狗按预期工作 – 在bmc-watchdog运行的情况下,系统稳定了25分钟,并在看门狗程序终止后重新启动。