Kubernetes tolerations working together with Docker UCP scheduler restrictions

In this blog post we´ll take a look at how the scheduler controls in Docker UCP interact with Kubernetes taints and tolerations. Both are used to control what workloads are allowed to run on manager and DTR (Docker Trusted Registry) nodes. Docker EE UCP mangers nodes are also Kubernetes master nodes, and in production systems it is important to restrict what runs on the manager (master) and DTR nodes. We’ll walk through deploying a Kubernetes workload on every node in a Docker EE cluster.

The task at hand

You have a Docker EE 2.1 installation, and you’ve been successfully deploying basic Swarm and/or Kubernetes workloads. Now you need to deploy a workload that runs exactly one instance on every node in the cluster.

This is fairly common for performance monitoring agents or data collections agents, for instance Dynatrace agents or Elastic Filebeat and Metricbeat “data shippers”. A global Swarm service or a Kubernetes DaemonSet is a likely choice, freeing you from the drudgery of manually adding an instance to every new cluster node and potentially cleaning up after removing a node.

This should be pretty simple

For one reason or another, you need to use a Kubernetes DaemonSet instead of a global Swarm service. Maybe that is what the vendor supports, or maybe you need container settings that aren’t supported by a Swarm service (e.g. privileged mode, joining the host network, etc). This should not be a big deal; you create a manifest file for the DaemonSet and create the DaemonSet using kubectl.

For the purposes of this blog post, let’s assume that you have your UCP scheduler set to not allow admin users or normal users to schedule work on manager or DTR nodes. This would be a pretty standard configuration in a production Docker EE cluster, except perhaps during certain activities like DTR upgrades or maintenance.

For our example, we’ll use a manifest file named nginx-ds.yaml, and we’ll create the DaemonSet in the namespace my-namespace:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    app: nginx-ds
  name: nginx-ds
  namespace: my-namespace
spec:
  selector:
    matchLabels:
      app: nginx-ds
  template:
    metadata:
      labels:
        app: nginx-ds
    spec:
      containers:
      - image: nginx:1.14
        name: nginx

We’ll deploy the DaemonSet using the command kubectl apply -f nginx-ds.yaml. There are no errors when creating the DaemonSet, so we check to see if all the pods are running with kubectl -n my-namespace get po -l app=nginx-ds -o wide. Hmmm, something is not right. There are no pods running or scheduled on the manager and DTR nodes. Oh right, we need to add tolerations so that workloads can run on the manager nodes , and in the case of Docker EE, the DTR nodes as well.

We check a manager node to find the taint you need to add a toleration for: kubectl -n my-namespace describe node my-manager-node. Ignoring taints that  are dynamically added due to loading, network issues, etc, the only taint you will find in a default configuration is :  Taints: com.docker.ucp.manager:NoSchedule

Add the toleration to the pod spec in the manifest file:

tolerations:
- key: "com.docker.ucp.manager"
  operator: "Exists"

Then apply the manifest file again. Hmmm, there are still no pods running on or scheduled for the manager and DTR nodes. We describe the DaemonSet using kubectl -n my-namespace get ds nginx-ds -o yaml, and the toleration that we just added to the manifest file is not included in the output. We double-check our toleration syntax, double-check the taint key, and apply the manifest file again. Still no luck! What’s going on here?

Docker EE UCP scheduler settings

What may not be immediately obvious is that with Kubernetes installed under UCP in Docker EE, an Admission Controller is (apparently) used to modify the Kubernetes objects that are created via the Kubernetes API so that they are compliant with UCP scheduler settings. Let’s look at the Scheduling panel in the UCP GUI, under -> Admin Settings -> Scheduler:

We can see that UCP controls the ability to schedule Kubernetes workloads on UCP manager and DTR nodes. There is some course granularity allowed:

  • Allow admins to schedule work on these nodes: A toleration will be applied to workloads created by an admin user.
  • Allow users and service accounts to schedule work on these nodes: A toleration will be applied to workloads created by normal users and service accounts.

While not called out in the UCP scheduler GUI, it is also good to know that regardless of what is configured in the scheduler GUI, if an admin user creates a workload in the kube-system namespace, a toleration will automatically be applied. You can try this out:

  • Delete the existing DaemonSet:
    • kubectl delete -f nginx-ds.yaml
  • Edit the manifest file to change the namespace value in nginx-ds.yaml to kube-system
  • Create the DaemonSet again:
    • kubectl apply -f nginx-ds.yaml
  • Check that a pod from the DaemonSet is running (or at least scheduled) on every node:
    • kubectl -n kube-system get po -l app=nginx-ds -o wide

See Restrict services to worker nodes for an explanation of this behavior from Docker’s docs.

One other thing to be aware of if you are using UCP v3.1.4 or later:
If you are going to deploy pods where the containers use restricted parameters, you must be a UCP admin. If you are using a service account to deploy the pods, the service account must have the cluster-admin ClusterRole as a ClusterRoleBinding. Restricted parameters include:

  • Host Bind Mounts
  • Privileged Mode
  • Extra Capabilities
  • Host Networking
  • Host IPC
  • Host PID

See the UCP release notes for more details.

Digging into the details

Docker EE UCP uses a different taint than you may be used to seeing if you have been installing Kubernetes using kubeadm or a similar method. UCP uses the taint com.docker.ucp.manager:NoSchedule to keep normal workloads from running on manager nodes. You are likely used to seeing node-role.kubernetes.io/master:NoSchedule in other Kubernetes installations, but that taint is not applied to any nodes with Kubernetes in Docker EE. UCP also groups DTR nodes together with manager nodes for the purpose of blocking or allowing workloads.

In other Kubernetes installations, you may have used the somewhat brute force approach of simply removing a taint from master nodes to allow user workloads to run there. This is not a good idea with Kubernetes under UCP, since it will break the scheduling controls built into UCP and perhaps cause other unexpected behavior. The same goes for adding other NoSchedule taints to the manager and DTR nodes in an attempt to externally control scheduling workloads on them.

When you create an object, UCP adds (or removes) the com.docker.ucp.manager:NoSchedule toleration to (or from) that object based on the UCP scheduler settings rather than removing taints from nodes. This has the effect of overriding any attempts to manually add that particular toleration. The same goes for “live” editing of objects such as Deployments, ReplicaSets, etc; that particular toleration will be overridden when you save your changes. However, you can successfully add tolerations for other taints.

How to use this behavior effectively

Solution #1: Schedule work in the kube-system namespace as an admin user. This may not always be a good idea from a security perspective, and you may need to edit some vendor-supplied manifest files to make this work.

Solution #2: Permit admin users to schedule workloads on manager and DTR nodes just long enough to create the object (the DaemonSet in this example), then disable that setting again. UCP will add the toleration to the pod spec in the DaemonSet, so that even if a pod dies or is killed, the toleration will still be in the pod spec and a new pod can be scheduled on a manager or DTR node. However, if you need to update the DaemonSet (for instance, change the image from nginx:1.14 to nginx:1.15 in our example), you will need to allow admin users to schedule workloads on manager and DTR nodes again during the upgrade.

Earlier I mentioned that manager and DTR nodes are grouped together in terms of allowing normal workload scheduling. While at first this might seem to be an all or nothing approach, you can still fine tune this behavior by adding labels to nodes and then adding Affinities or NodeSelectors to the objects you create. For instance, if you want a particular app to run on worker nodes and manager nodes but not on DTR nodes:

  • Add a label such as my-app=allowed to all nodes except the DTR nodes.
  • Then add an Affinity to the pod spec in the manifest file for your object:
affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
      - key: my-app
      operator: In
      values:
      - allowed
  • Change the UCP scheduler settings to allow admin users to schedule work on manager and DTR nodes, create your object, then revert the UCP scheduler settings.

Other considerations

UCP handling of scheduling on manager and DTR nodes may not be what DevOps teams expect when designing or implementing a pipeline. On the plus side, there will most likely not be a lot of workloads that can be justified to run on manager and DTR nodes. Workloads that do run on those nodes are likely to be special apps that are arguably not the best use cases for pipeline deployment. But if you are using your pipeline to deploy workloads that must run on manager and / or DTR nodes, a couple considerations are:

  • If you want to use the Docker API to toggle the UCP scheduling constraints on and off, you will need to use an admin account for that particular task.
  • If you are not using an admin account for the pipeline interaction with Kubernetes, you will need to temporarily allow users and service accounts to schedule work on manager and DTR nodes. This allows a brief window where other user activity could result in scheduling workloads on manager and DTR nodes.

Need help?

Do you like the convenience of letting Docker EE do all the heavy lifting for Kubernetes installation, configuration and user management, but you are a little overwhelmed by some of the details? Capstone IT is here to help! Don’t hesitate to contact us with any of your questions regarding Docker, Kubernetes and running containers in the Cloud.

Leave a Reply