Nagios:检查服务是孤儿吗?

最近我注意到在nagios.log有一些警告:

[1366060611] Warning: The check of service 'pt-deadlock-logger' on host 'xx' looks like it was orphaned (results never came back). I'm scheduling an immediate check of the service...

关键问题是:之后,Nagios不再运行任何检查。 作为一个解决方法,我必须设置一个事件处理程序来重新启动Nagios,只要看到这个警告:

localhost.cfg

 define service{ use logfile-service host_name localhost service_description nagios_orphaned check_command check_nagios_orphaned event_handler restart_nagios contact_groups admin } 

commands.cfg

 define command { command_name check_nagios_orphaned command_line sudo $USER2$/check_logfiles --tag=orphaned --logfile=/usr/local/nagios/var/nagios.log --warningpattern="looks like it was orphaned" } define command { command_name restart_nagios command_line $USER1$/eventhandlers/restart_nagios.sh $SERVICESTATE$ } 

restart_nagios.sh

 #!/bin/bash case "$1" in OK) ;; WARNING) /usr/bin/screen -S nagios -d -m sudo /etc/init.d/nagios restart ;; UNKNOWN) ;; CRITICAL) ;; esac exit 0 

我一直在尝试将Nagios更新到最新版本:

 # nagios -V Nagios Core 3.5.0 Copyright (c) 2009-2011 Nagios Core Development Team and Community Contributors Copyright (c) 1999-2009 Ethan Galstad Last Modified: 03-15-2013 License: GPL 

但仍然得到这个警告。

谷歌search的第一个结果是: http : //support.nagios.com/wiki/index.php/Nagios_XI :FAQs#Check_Services_Being_Orphaned

但我确定只有一个(父)进程正在运行:

 # ps -ef | grep '/usr/local/nagios/bin/nagio[s]' nagios 8956 15155 0 18:08 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg nagios 8957 15155 0 18:08 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg nagios 15155 1 5 14:09 ? 00:13:47 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg 

而且,在日志文件中我看不到Resource temporarily unavailable错误,所以可以排除限制的可能性。

embedded式Perl解释器已被禁用:

 enable_embedded_perl=0 use_embedded_perl_implicitly=0 

还有其他原因吗?

PS:我在Xen HVM上运行Nagios:

 # virt-what xen xen-hvm 

更新周二4月16日22:07:09 ICT 2013

在源代码目录中search这个警告,我发现:

 # grep -lr 'looks like it was orphaned' nagios-3.5.0 /nagios-3.5.0/base/checks.o /nagios-3.5.0/base/nagios /nagios-3.5.0/base/checks.c 

这是check_for_orphaned_services函数:

 /* check for services that never returned from a check... */ void check_for_orphaned_services(void) { service *temp_service = NULL; time_t current_time = 0L; time_t expected_time = 0L; log_debug_info(DEBUGL_FUNCTIONS, 0, "check_for_orphaned_services()\n"); /* get the current time */ time(&current_time); /* check all services... */ for(temp_service = service_list; temp_service != NULL; temp_service = temp_service->next) { /* skip services that are not currently executing */ if(temp_service->is_executing == FALSE) continue; /* determine the time at which the check results should have come in (allow 10 minutes slack time) */ expected_time = (time_t)(temp_service->next_check + temp_service->latency + service_check_timeout + check_reaper_interval + 600); /* this service was supposed to have executed a while ago, but for some reason the results haven't come back in... */ if(expected_time < current_time) { /* log a warning */ logit(NSLOG_RUNTIME_WARNING, TRUE, "Warning: The check of service '%s' on host '%s' looks like it was orphaned (results never came back). I'm scheduling an immediate check of the service...\n", temp_service->description, temp_service->host_name); log_debug_info(DEBUGL_CHECKS, 1, "Service '%s' on host '%s' was orphaned, so we're scheduling an immediate check...\n", temp_service->description, temp_service->host_name); /* decrement the number of running service checks */ if(currently_running_service_checks > 0) currently_running_service_checks--; /* disable the executing flag */ temp_service->is_executing = FALSE; /* schedule an immediate check of the service */ schedule_service_check(temp_service, current_time, CHECK_OPTION_ORPHAN_CHECK); } } return; } 

更新date:4月18日星期四22:32:19 ICT 2013

为了确认,我编辑了源代码,将expected_timecurrent_time的值添加到日志文件中。 我得到的是:

 [1366294608] expected_time: 'Thu Apr 18 21:16:36 2013 ', current_time: 'Thu Apr 18 21:16:48 2013 ' - Warning: The check of service 'Check_MK' on host 'xx' looks like it was orphaned (results never came back). I'm scheduling an immediate check of the service... 

重新读取日志文件,我看到一个重要的信息:

[1366218303] Warning: A system time change of 0d 0h 0m 1s (backwards in time) has been detected. Compensating...

看起来Xen是罪魁祸首。