目前的状态监测和故障预测系统的缺点

我有一个问题，哪个好的解决scheme（软件/硬件）已经在企业中开发并应用于在线故障预测？ Zabbix，Openstb，仙人掌和类似的替代品？你能列出更多吗？你能描述他们有什么优点和缺点，特别是在故障预测方面？

我想知道它们的缺点，并通过模型\algorithm进行一些改进。如果您对在线故障预测的概念不太了解，请参考以下说明。如果你已经知道了，就跳过它。

Online failure prediction -- It is an approach to evaluate whether an incoming failure will occur in the near future, and when the failure will occur, and in which component (maybe software or hardware) the failure will occur. It's a short-term prediction by tracking failure, detected error reporting, undetected errors' symptoms, faults's auditing (actively searching the faults, for example, search inodes' inconsistency in Linux filesystems).

文章中介绍了更详细的介绍和相关的方法， https://s3-us-west-2.amazonaws.com/mlsurveys/88.pdf

非常感谢你！

监测系统比较： https ： //en.wikipedia.org/wiki/Comparison_of_network_monitoring_systems

我不认为，一些监控系统开箱即用的失败预测 。你提供的论文太学术了。您仍然可以将其构build在某个监视系统的顶部，这将为您的故障algorithm预测提供数据/事件/失败。

一些监控系统有：

度量预测 （趋势预测）。这不是一个失败的预测。好的半学术论文有关于它的Zabbix – Zabbix预测。
exception检测 – 这不是一个预测，它是检测。最着名的用于exception检测的OSS是Skyline 。基于RRD的系统（Cacti）使用RRD Holt Winteralgorithm 。 Graphite还有一些math函数，可以用于exception检测。

如果你想实现/改进故障检测，然后使其通用：

input层 – 一些插件的概念，所以用户应该能够使用/写自己的插件，这将从特定的插件监控系统
故障检测层 – 有很多algorithm，所以每个algorithm都应该是可configuration的
输出层 – 类似于input层，所以关于预测失败的事件可以返回到监视系统或另一个警报系统

请使用户（不学术）友好，并使用Github。当我需要testing时，请给我平息。 🙂