ESXi v5.5发生随机崩溃

硬件:types:HP Proliant ML350 G5内存22GB CPU 1个英特尔氙E5405 2.00GHz

OP:ESXi 5.5刚刚从5.1更新,尝试修复ESXi 5.1在相同硬件上发生的崩溃。

我试图find为什么我们的服务器崩溃的错误,现在已经有两个locking在24小时。 前面的内部错误指示灯呈红色闪烁,只有“#5和#6页面76”手册中的“处理器2”指示灯“琥珀色”和“电源”指示灯“绿色”闪烁。

在日志中,我可以在相关的时间框架中看到的唯一错误是在日志之下。 这是原因吗? 或者还有什么我可以做的尝试和logging/find错误。

来自zcat syslog.6.gz | 减

2014-05-26T11:55:47Z sfcbd[35064]: Error opening socket pair for getProviderContext: Too many open files 2014-05-26T11:55:47Z sfcbd[35064]: Failed to set recv timeout (30) for socket -1. Errno = 9 2014-05-26T11:55:47Z sfcbd[35064]: Failed to set timeout for local socket (eg provider) 2014-05-26T11:55:47Z sfcbd[35064]: spGetMsg receiving from -1 35064-9 Bad file descriptor 2014-05-26T11:55:47Z sfcbd[35064]: rcvMsg receiving from -1 35064-9 Bad file descriptor 2014-05-26T11:55:47Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor 2014-05-26T11:55:47Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor 2014-05-26T11:55:47Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor 2014-05-26T11:55:47Z sfcbd[35064]: Error opening socket pair for getProviderContext: Too many open files 2014-05-26T11:55:47Z sfcbd[35064]: Failed to set recv timeout (30) for socket -1. Errno = 9 2014-05-26T11:55:47Z sfcbd[35064]: Failed to set timeout for local socket (eg provider) 2014-05-26T11:55:47Z sfcbd[35064]: spGetMsg receiving from -1 35064-9 Bad file descriptor 2014-05-26T11:55:47Z sfcbd[35064]: rcvMsg receiving from -1 35064-9 Bad file descriptor 2014-05-26T11:55:47Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor 2014-05-26T11:55:47Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor 2014-05-26T11:55:47Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor 2014-05-26T11:55:53Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor 2014-05-26T11:55:57Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor 2014-05-26T11:56:01Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor 2014-05-26T11:56:04Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor 2014-05-26T11:56:15Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor 2014-05-26T11:56:17Z sfcbd[35064]: Error opening socket pair for getProviderContext: Too many open files 2014-05-26T11:56:17Z sfcbd[35064]: Failed to set recv timeout (30) for socket -1. Errno = 9 2014-05-26T11:56:17Z sfcbd[35064]: Failed to set timeout for local socket (eg provider) 2014-05-26T11:56:17Z sfcbd[35064]: spGetMsg receiving from -1 35064-9 Bad file descriptor 2014-05-26T11:56:17Z sfcbd[35064]: rcvMsg receiving from -1 35064-9 Bad file descriptor 2014-05-26T11:56:17Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor 2014-05-26T11:56:17Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor 2014-05-26T11:56:17Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor 2014-05-26T11:56:17Z sfcbd[35064]: Error opening socket pair for getProviderContext: Too many open files 2014-05-26T11:56:17Z sfcbd[35064]: Failed to set recv timeout (30) for socket -1. Errno = 9 2014-05-26T11:56:17Z sfcbd[35064]: Failed to set timeout for local socket (eg provider) 2014-05-26T11:56:17Z sfcbd[35064]: spGetMsg receiving from -1 35064-9 Bad file descriptor 2014-05-26T11:56:17Z sfcbd[35064]: rcvMsg receiving from -1 35064-9 Bad file descriptor 2014-05-26T11:56:17Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor 2014-05-26T11:56:17Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor 2014-05-26T11:56:17Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor 2014-05-26T11:56:23Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor 2014-05-26T11:56:27Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor 2014-05-26T11:56:31Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor 2014-05-26T11:56:34Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor 2014-05-26T11:56:34Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor 2014-05-26T11:56:34Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor 2014-05-26T11:56:34Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor 2014-05-26T11:56:34Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor 2014-05-26T11:56:44Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor 2014-05-26T11:56:44Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor 2014-05-26T11:56:44Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor 2014-05-26T11:56:44Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor 2014-05-26T11:56:46Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor 2014-05-26T11:56:48Z sfcbd[35064]: Error opening socket pair for getProviderContext: Too many open files 

更新

设置iLO 2并对日志进行访问确实显示了som的进展,我得到了很多Power删除的消息。 所以我开始怀疑电源,并删除UPS后,服务器已经稳定了5天。

 Informational iLO 2 05/29/2014 20:31 05/29/2014 20:31 1 Server power restored. Informational iLO 2 05/29/2014 20:31 05/29/2014 20:31 1 Server power removed. Informational iLO 2 05/29/2014 16:57 05/29/2014 16:57 1 Server power restored. Informational iLO 2 05/29/2014 16:57 05/29/2014 16:57 1 Server power removed. Informational iLO 2 05/29/2014 15:39 05/29/2014 15:39 1 Server power restored. Informational iLO 2 05/29/2014 15:39 05/29/2014 15:39 1 Server power removed. 

更新2

现在仍然不稳定在24个房间里再次坠毁了2次

相同的日志

 Informational iLO 2 06/13/2014 05:21 06/13/2014 05:21 2 Server power removed. Informational iLO 2 06/13/2014 05:21 06/13/2014 05:21 3 Server power restored. 

在发生这种情况之后,iLO接口将保持不变。 空的IML日志不显示任何内容

在这里输入图像说明


更新3

 Status Summary Server Name: esx01.xx.xx; ProLiant ML350 G5 UUID: 32393534-3937-5A43-4A38-353130393248 Server Serial Number / Product ID: CZJ851092H / 459279-425 System ROM: D21 11/02/2008; backup system ROM: 11/02/2008 System Health: Ok Internal Health LED: Ok Server Power: ON UID Light: OFF Last Used Remote Console: Remote Console Latest IML Entry: IML Cleared (iLO 2 user:xxx) iLO 2 Name: ILOCZJ851092H License Type: iLO 2 Standard iLO 2 Firmware Version: 1.61 08/31/2008 IP address: 192.168.2.2 Active Sessions: iLO 2 user:xxx Latest iLO 2 Event Log Entry: Browser login: xxx - 172.20.1.105(DNS name not found). iLO 2 Date/Time: 06/13/2014 23:22:52 

您可能有硬件问题。 这不是 VMware ESXi的问题。

  • 你在build立多less个ESXi?
  • 服务器硬件/ BIOS上固件版本是什么?
  • 您提到的另一个ESXi主机是否包含相同的硬件?

您最好的select是检查服务器的HP集成pipe理日志 (IML)。 您可以通过ILO 2界面来完成此任务。

  • login到ILO,检查硬件系统状态选项卡。 这个主要的总结屏幕可能会告诉你什么是错的。
  • 此外,请查看“系统状态”选项卡下的IML选项。 这会告诉你为什么服务器崩溃。

就这样。 这里可能有RAM,CPU或系统板问题。

在这里输入图像说明


编辑:请更新您的主机的固件, 请! – 不要成为一个统计

系统的当前可引导固件DVD的下载位于此处。 请启动您的系统,让它更新所有的组件。 该服务器上的所有内容看起来都可以追溯到2008年。与HP服务器硬件配合使用时,这是一个很大的问题。