在GKE日志中检测Kubernetes OOMKilled事件

我想设置OOMKilled事件的检测工具,在检查一个POD时看起来像这样:

Name: pnovotnak-manhole-123456789-82l2h Namespace: test Node: test-cluster-cja8smaK-oQSR/10.xxx Start Time: Fri, 03 Feb 2017 14:34:57 -0800 Labels: pod-template-hash=123456789 run=pnovotnak-manhole Status: Running IP: 10.xxx Controllers: ReplicaSet/pnovotnak-manhole-123456789 Containers: pnovotnak-manhole: Container ID: docker://... Image: pnovotnak/it Image ID: docker://sha256:... Port: Limits: cpu: 2 memory: 3Gi Requests: cpu: 200m memory: 256Mi State: Running Started: Fri, 03 Feb 2017 14:41:12 -0800 Last State: Terminated Reason: OOMKilled Exit Code: 137 Started: Fri, 03 Feb 2017 14:35:08 -0800 Finished: Fri, 03 Feb 2017 14:41:11 -0800 Ready: True Restart Count: 1 Volume Mounts: /var/run/secrets/kubernetes.io/serviceaccount from default-token-tder (ro) Environment Variables: <none> Conditions: Type Status Initialized True Ready True PodScheduled True Volumes: default-token-46euo: Type: Secret (a volume populated by a Secret) SecretName: default-token-tder QoS Class: Burstable Tolerations: <none> Events: FirstSeen LastSeen Count From SubObjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 11m 11m 1 {default-scheduler } Normal Scheduled Successfully assigned pnovotnak-manhole-123456789-82l2h to test-cluster-cja8smaK-oQSR 10m 10m 1 {kubelet test-cluster-cja8smaK-oQSR} spec.containers{pnovotnak-manhole} Normal Created Created container with docker id xxxxxxxxxxxx; Security:[seccomp=unconfined] 10m 10m 1 {kubelet test-cluster-cja8smaK-oQSR} spec.containers{pnovotnak-manhole} Normal Started Started container with docker id xxxxxxxxxxxx 11m 4m 2 {kubelet test-cluster-cja8smaK-oQSR} spec.containers{pnovotnak-manhole} Normal Pulling pulling image "pnovotnak/it" 10m 4m 2 {kubelet test-cluster-cja8smaK-oQSR} spec.containers{pnovotnak-manhole} Normal Pulled Successfully pulled image "pnovotnak/it" 4m 4m 1 {kubelet test-cluster-cja8smaK-oQSR} spec.containers{pnovotnak-manhole} Normal Created Created container with docker id yyyyyyyyyyyy; Security:[seccomp=unconfined] 4m 4m 1 {kubelet test-cluster-cja8smaK-oQSR} spec.containers{pnovotnak-manhole} Normal Started Started container with docker id yyyyyyyyyyyy 

所有我从pod日志中得到的是;

 { textPayload: "shutting down, got signal: Terminated " insertId: "aaaaaaaaaaaaaaaa" resource: { type: "container" labels: { pod_id: "pnovotnak-manhole-123456789-82l2h" ... } } timestamp: "2017-02-03T22:34:48Z" severity: "ERROR" labels: { container.googleapis.com/container_name: "POD" ... } logName: "projects/cyrusmolcloud/logs/POD" } 

和kublet日志;

 { insertId: "bbbbbbbbbbbbbb" jsonPayload: { _BOOT_ID: "ffffffffffffffffffffffffffffffff" MESSAGE: "I0203 22:41:11.925928 1843 kubelet.go:1816] SyncLoop (PLEG): "pnovotnak-manhole-123456789-82l2h_test(a-uuid)", event: &pleg.PodLifecycleEvent{ID:"another-uuid", Type:"ContainerDied", Data:"..."}" ... 

这似乎不足以将这个唯一标识为OOM事件。 任何其他的想法?

尽pipeOOMKilled事件在日志中不存在,但是如果可以检测到一个荚被杀死,则可以使用kubectl get pod -o go-template=... <pod-id>来确定原因。 作为直接来自文档的示例:

 [13:59:01] $ ./cluster/kubectl.sh get pod -o go-template='{{range.status.containerStatuses}}{{"Container Name: "}}{{.name}}{{"\r\nLastState: "}}{{.lastState}}{{end}}' simmemleak-60xbc Container Name: simmemleak LastState: map[terminated:map[exitCode:137 reason:OOM Killed startedAt:2015-07-07T20:58:43Z finishedAt:2015-07-07T20:58:43Z containerID:docker://0e4095bba1feccdfe7ef9fb6ebffe972b4b14285d5acdec6f0d3ae8a22fad8b2]] 

如果以编程的方式进行,依赖kubectl输出的更好的select是使用Kubernetes REST API GET /api/v1/pods方法。 文档中还提供了访问API的方法。