在Nagios / OMD中一个小时后联系小组进行警报

我正在尝试为下面的场景find一个解决scheme。

我有一个Nagios几百个服务（OMD安装check_mk和其他美味的东西），它们被定义为不同的服务types，所以对于不同的types，我有不同的联系人组，谁会在发生问题时得到警报。

它运行良好，但是如果服务在1小时后处于关键状态，并且已经被确认/评论等，我想要调用一个脚本。

我在参考文档中没有find任何内容。

提前谢谢你的帮助

典型的服务types：

define contact{ contact_name level1 ; Short name of user use generic-contact ; Inherit default values from alias Gravity Level1 ; Full name of user email [email protected] ; email for alerting } define contactgroup{ contactgroup_name defcon3 members level1, level2 } define service{ name defcon3-service ; The 'name' of this service template active_checks_enabled 0 ; Active service checks are enabled passive_checks_enabled 1 ; Passive service checks are enabled/accepted obsess_over_service 1 ; We should obsess over this service (if necessary) check_freshness 0 ; Default is to NOT check service 'freshness' notifications_enabled 1 ; Service notifications are enabled event_handler_enabled 1 ; Service event handler is enabled flap_detection_enabled 1 ; Flap detection is enabled failure_prediction_enabled 1 ; Failure prediction is enabled process_perf_data 1 ; Process performance data retain_status_information 1 ; Retain status information across program restarts retain_nonstatus_information 1 ; Retain non-status information across is_volatile 0 ; The service is not volatile check_period 24x7 ; The service can be checked at any time of the day max_check_attempts 3 ; Re-check the service up to 3 times in order to normal_check_interval 2 ; Check the service every 10 minutes under normal retry_check_interval 1 ; Re-check the service every two minutes until a notification_options w,u,c,r ; Send notifications about warning, unknown, notification_interval 60 ; Re-notify about service problems every hour notification_period 24x7 ; Notifications can be sent out at any time contact_groups defcon3 ; default mail to monitoring -v- register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT AR } define service { use check_mk_passive_perf use defcon3-service host_name gravity-mon service_description CPU load contact_groups +defcon3 service_groups +defcon3 check_command check_mk-cpu.loads }

我讨厌直接与另一张海报相矛盾，但是NAGIOS可以做到这一点：您要查找的内容在文档中被称为通知升级。

多克说，

当且仅当一个或多个升级定义匹配正在发送的当前通知时，通知才会升级。如果主机或服务通知没有适用的有效升级定义，则在主机组或服务定义中指定的联系人组将被用于通知。

因此，如果在主机webserver上有一项名为HTTP的服务，其失败通常每隔30分钟（例如）通知群组sysadmins ，而且如果警报未被确认和不确定通过第三次提醒，你可以尝试：

 define serviceescalation{ host_name webserver service_description HTTP first_notification 3 last_notification 5 contact_groups nt-admins,managers }

在你的情况下，你不想通知人，但调用脚本。为此，您需要定义一个新的联系人组，其中包含一个成员，该成员具有（例如） /usr/local/bin/my-webserver-handling-script的service_notification_commmand 。

如果您不希望重复调用脚本，则需要调整上面的first_notification和last_notification ，以便该特定的升级只应用一次。

我也提醒你这样做。我个人不喜欢通知系统也成为事件处理系统; 我认为他们应该让一个人知道一些事情是不正确的，让人类处理它，这就是为什么：根据定义，NAGIOS只在事情不正常的时候提醒人们。如果您要自动处理这个问题，您需要非常确定他们的方法是否正确。例如，如果你打算让这个脚本重新启动web服务器，那么你最好确定你的主机依赖关系已经正确设置，这样中间路由器的故障也不会导致你的networking服务器开始被野蛮地重新启动，从而导致文件系统损坏，你必须在修复路由器后处理。