守护进程，但不是严重错误到syslog？

下面的问题可以被看作是一个CIFS / AD相关的问题（特定的视图），或作为一个关于服务重启，error handling和日志parsing（一般视图）的问题。我将在这里介绍这两个区域，但很乐意接受任何一个答案（只是跳过你不感兴趣的部分）。

具体情况：idmap不会定期重新扫描域控制器

在兼容Windows Server 2008的Active Directory中，通常有多个域控制器以实现高可用性。如果所有这些服务器同时不可用，并且具有活动内核SMB / CIFS服务器（已成功join域并按预期工作）的OmniOS（r151018）文件服务器启动，则会发生以下情况：

idmap服务试图达到一个DC 60秒，然后放弃…

 root@omnios:/root# tail -n 20 /var/svc/log/system-idmap:default.log @ Tue Sep 6 10:19:42 2016 Global Catalog servers not configured/discoverable Domain controller servers not configured/discoverable created thread ID 3 - 1 threads currently active [ Sep 6 10:19:42 Method "start" exited with status 0. ] @ Tue Sep 6 10:19:53 2016 created thread ID 4 - 2 threads currently active getdcname wait begin @ Tue Sep 6 10:19:57 2016 DNS: _ldap._tcp.dc._msdcs.testdomain.intranet: Host name lookup failure @ Tue Sep 6 10:20:08 2016 getdcname timeout @ Tue Sep 6 10:20:12 2016 DNS: _ldap._tcp.dc._msdcs.testdomain.intranet: Host name lookup failure @ Tue Sep 6 10:20:27 2016 DNS: _ldap._tcp.dc._msdcs.testdomain.intranet: Host name lookup failure @ Tue Sep 6 10:20:42 2016 DNS: _ldap._tcp.dc._msdcs.testdomain.intranet: Host name lookup failure Domain discovery took 60 sec. Check the DNS configuration.

…但不至于失败：

 root@omnios:/root# svcs -xv idmap svc:/system/idmap:default (Native Identity Mapping Service) State: online since Tue Sep 6 10:19:42 2016 See: man -M /usr/share/man -s 1M idmapd See: man -M /usr/share/man -s 1M idmap See: /var/svc/log/system-idmap:default.log Impact: None.

在那之后， smbd每分钟都在系统日志中（正确地）抱怨它找不到DC：

 smbd[525]: [ID 510351 daemon.notice] smb_locate_dc status 0xc0000233 smbd[525]: [ID 199031 daemon.notice] smbd_dc_update: testdomain.intranet: locate failed

即使在DC恢复在线并且可达之后，这仍然存在。它是立即通过使用svcadm restart idmap 。当然，由于这些中断可能没有计划而发生，所以不应该手工完成。

如何重新启动idmap脚本来自动发生这些事件？ 我曾尝试使用SMF，但似乎这只适用于崩溃的服务，而idmap报告没有问题（ smbd只报告通知）。另一种可能性是不断的监视日志文件，并为这些事件处理好，但这对我来说似乎是无效的。我也尝试将config/rediscovery_interval值减less到60秒，但似乎忽略（或不适用于此）。
另外，还有什么解决scheme可以解决问题呢？ 不幸的是，我没有发现任何可用的东西，除了发布确认完全重启解决问题的方式（因为idmap也在那里重新启动）。

编辑： svccfg -s idmap listprop输出 – 我唯一改变的是config/rediscovery_interval （默认3600），之后手动删除ID。

 config application config/id_cache_timeout count 86400 config/list_size_limit count 0 config/name_cache_timeout count 604800 config/preferred_dc astring config/stability astring Unstable config/use_ads boolean true config/use_lsa boolean true config/value_authorization astring solaris.smf.value.idmap config/machine_uuid astring [...] config/machine_sid astring [...] config/rediscovery_interval count 60 config/domain_name astring testdomain.intranet debug application debug/all integer 0 debug/config integer 0 debug/discovery integer 0 debug/dns integer 0 debug/ldap integer 0 debug/mapping integer 0 debug/stability astring Unstable debug/value_authorization astring solaris.smf.value.idmap rpcbind dependency rpcbind/entities fmri svc:/network/rpc/bind rpcbind/grouping astring require_all rpcbind/restart_on astring restart rpcbind/type astring service filesystem-minimal dependency filesystem-minimal/entities fmri svc:/system/filesystem/minimal filesystem-minimal/grouping astring require_all filesystem-minimal/restart_on astring error filesystem-minimal/type astring service manifestfiles framework manifestfiles/lib_svc_manifest_system_idmap_xml astring /lib/svc/manifest/system/idmap.xml general framework general/action_authorization astring solaris.smf.manage.idmap general/entity_stability astring Unstable general/single_instance boolean true general/value_authorization astring solaris.smf.manage.idmap start method start/exec astring /usr/lib/idmapd start/timeout_seconds count 60 start/type astring method stop method stop/exec astring :kill stop/timeout_seconds count 60 stop/type astring method refresh method refresh/exec astring ":kill -HUP" refresh/timeout_seconds count 60 refresh/type astring method tm_common_name template tm_common_name/C ustring "Native Identity Mapping Service" tm_man_idmapd1M template tm_man_idmapd1M/manpath astring /usr/share/man tm_man_idmapd1M/section astring 1M tm_man_idmapd1M/title astring idmapd tm_man_idmap1M template tm_man_idmap1M/manpath astring /usr/share/man tm_man_idmap1M/section astring 1M tm_man_idmap1M/title astring idmap

一般问题：如果进程似乎正常运行，如何高效地对系统日志消息作出反应？

这个问题可以概括为如何在Solaris上以最有效的方式监视日志文件的问题。我search了几个工具，例如swatch ， logsurfer ， logsurfer ，或者每分钟执行的普通旧cron作业，并连接到一个读取dmesg输出的简单脚本。

这是唯一可行的方法吗？还是有更好的解决scheme？
- 有没有办法告诉SMF，即使没有发生危急情况，某些过程的某些通知也应该采取行动？
- 我偶然发现了故障pipe理器FMA，但它似乎只能在危急情况下工作，而不仅仅是通知（或任何用户可指定的string）。它是否正确？
如果这是唯一的方法，你会build议使用什么？为什么？