Auto-Healing Containers in Kubernetes
One of the most in-demand container orchestration features is monitoring of container ’ health and ability of the orchestrator to deal with unhealthy containers according to the specified configuration. For example, the orchestrator can launch a new container as a replacement for a container that is not healthy. In Kubernetes, a popular open-source, container management solution, the same technique applies to a pod, the smallest deployable unit, consisting of one or more containers that share the same IP address and port space.
What is Container Health?
So how can we determine which container is healthy and which is not? As we know, there is a single main process that is running in a container. Such a process can start other child processes within a container, if necessary. Every such process, including the main process, can have its own lifecycle – but if the main process stops, the container stops as well.
A container is healthy, by the most general definition, if its main process is running. If the container’s main process is terminated unexpectedly, then the container is considered unhealthy.
Note: You should monitor health only for containers that are running permanently, such as web servers or databases. If you are running a container that is expected to stop sometime , there is no reason to monitor its health. Instead, you should analyze its result, such as its exit code. You can use a Kubernetes job for pods that are expected to terminate on their own.
In addition, you should take into consideration that the container keeps running and is considered healthy even if one or more child processes are terminated unexpectedly. Furthermore, t the main process might be running but not working as expected, because, for example, it was not configured properly. For such cases, we need an application-specific way to determine the container’s health.
Auto-Healing in Kubernetes
In Kubernetes, there are several key concepts that are related to containers’ health and auto healing: pod’s phase, probes, and restart policy.
Each Kubernetes pod has a field phase, which provides a simple, high-level summary of where the pod is in its lifecycle. The pod can be on one of the following phases:
- Pending – If pod is created, but one or more of containers are not running yet. For example, it can take some time for the container image to be downloaded over the network.
- Running –– All of the containers in the pod are running.
- Succeeded – All of the containers in the pod are terminated with zero exit code.
- Failed – All of the containers in the pod are terminated and at least one container failed (either exited with non-zero exit code or was terminated by the system).
- Unknown – The state of the pod could not be obtained for some reason.
For each pod Kubernetes can periodically execute liveness and readiness probes, if they are defined for the pod. A probe is an executable action, that can check if the specified condition is met. There are three types of actions:
- ExecAction – Executes a specified command in the container. The condition is considered successful if the command exits with zero as its exit code.
- TCPSocketAction – Performs a TCP check against the container’s IP address on a specified port. The condition is considered successful if the port is open.
- HTTPGetAction – Performs an HTTP GET request against the container’s IP address on a specified port and path. The condition is considered successful if the response has a HTTP status code greater than or equal to 200 and less than 400.
A liveness probe determines whether the container is running or not. If the liveness probe fails, then Kubernetes kills the container. A new container can be started instead, if a restart policy says so. Although there is no default liveness probe for a container, you do not necessarily need onee, because Kubernetes will automatically perform the correct action in accordance with the pod’s restart policy.
A readiness probe determines whether the container is ready to service requests. If the readiness probe fails, the endpoints controller removes the pod’s IP address from the endpoints of all services for that pod. There is no default readiness probe for a container.
A pod has a restartPolicy field with possible values Always, OnFailure, and Never. The default value is Always. The restart policy applies to all of the containers in the pod. Failed containers are restarted on the same node with a delay that grows exponentially up to 5 minutes.
Example 1: Define a liveness probe for a pod
Step 1. Define a new pod in the file echoserver-pod.yaml. Here we use an existing image echoserver, which is a simple HTTP server that responds with the HTTP headers it receives:
apiVersion: v1 kind: Pod metadata: name: echoserver spec: containers: - image: gcr.io/google_containers/echoserver:1.4 name: echoserver ports: - containerPort: 8080 livenessProbe: httpGet: path: / port: 8080 initialDelaySeconds: 15 timeoutSeconds: 1
In this example, we defined a liveness probe for the port 8080 and root path. The liveness probe also has the following fields:
- initialDelaySeconds – The number of seconds after the container has started before liveness probes are initiated
- timeoutSeconds – The number of seconds after which the probe times out. The default value is 1 second; the minimum value is 1.
- periodSeconds – How often (in seconds) to perform the probe. The efault value is 10 seconds; the minimum value is 1.
Step 2. Create the echoserver pod:
$ kubectl create -f echoserver-pod.yaml pod "echoserver" created
Step 3. Check that the pod is running and there is no restarts:
$ kubectl get pod echoserver NAME READY STATUS RESTARTS AGE echoserver 1/1 Running 0 15s
As you can see, the pod is running and there are no restarts (0).
Example 2: Define a failing liveness probe for a pod
Step 1. Edit the file echoserver-pod.yaml. Change the pod’s name to echoserver2,and change the port number (8080) in the liveness probe to another value, for example, 8081 (changes are highlighted):
apiVersion: v1 kind: Pod metadata: name: echoserver2 spec: containers: - image: gcr.io/google_containers/echoserver:1.4 name: echoserver ports: - containerPort: 8080 livenessProbe: httpGet: path: / port: 8081 initialDelaySeconds: 15 timeoutSeconds: 1
Our echoserver pod will not respond on port 8081, so we defined a liveness probe that will fail.
Step 2. Create a new echoserver2 pod:
$ kubectl create -f echoserver-pod.yaml pod "echoserver2" created
Step 3. Wait 1 minute and check the pod’s status:
$ kubectl get pod echoserver2 NAME READY STATUS RESTARTS AGE echoserver2 1/1 Running 2 1m
As you can see, Kubernetes has restarted our pod several times. In the output of kubectl describe in Events section, you can see the following information:
Successfully assigned echoserver2 to …
Created container with docker id …
Started container with docker id …
Container “echoserver” is unhealthy, it will be killed and re-created.
Kubernetes provides key mechanisms that allow users to monitor containers’ health and restart them in case of failures: probes and the restart policy. Probes are executable actions, which check if the specified conditions are met. The pod’s restart policy specifies actions for failed containers. It is worth noting that files in a container are ephemeral, so when a container gets restarted, the changes to the files will be lost. If a container is not stateless, it is necessary to use volumes for persistent storage.