The Container Engine is a Commodity

I started my career 20 years ago when Java was the new kid on the block. I worked at IBM for many years on their JVM and I was very proud of the innovation and work that we did there.

During that time, a curious thing happened. The JVM became a commodity. People didn’t really care too much about who had the better garbage collector or the faster JIT, rather they cared about stability and function. As new versions of Java came out, a new byte code would sometimes sneak in or profiling might be improved a little, but the bulk of the innovation and function was in the class libraries – the rich APIs sitting on top of this commoditized JVM and the portable byte code spec. It’s easy to forget today just how much third party innovation grew up around it. IBM bet their entire enterprise software stack on this humble little interpreter.

I believe that the exact same thing is happening in the container space and the parallels with Java are striking.

What does a container engine actually have to do, fundamentally? It has to provide some networking and storage virtualization, it has to provide a sandpit for running a process, various means of data ingress and egress into that process, a format for representing binary dependencies and a control plane for lifecycle management. Beyond that, yes you can have logging plugins and the ability to execute other processes and snapshotting, but none of that is essential.

Nothing captured this requirement more clearly than the CRI-O announcement earlier this year. It was a conscious effort to put a stake in the ground and say, “here is a straw man for what we need a basic container runtime to do”. Docker themselves came out with containerd earlier this year as a way of attempting to map the multi-process daemon model onto the synchronous single-process runC as part of their OCI integration. The net effect is a secondary API boundary sitting below the Docker API that represents basic container compute primitives.

The VIC team did the exact same thing back in January when we were re-architecting the Bonneville codebase. We intentionally designed a container primitives layer consisting of 5 distinct services – persistence, network, compute, events and interaction. The network and storage services would mutate the underlying infrastructure, the compute service would create containers and the event and interaction services would provide optional  value-add around the running containers. We did this with the intent of being able to layer multiple personalities on top – everything from a Trivial Container Engine that only uses the compute service to a Docker personality. Beyond that, the goal was to end up with low-level interfaces that could also abstract away the underlying implementation – it should be possible to deploy a container host inside of a container host and manage them using the exact same client. Cool, huh?!

Almost a year on from that original white-boarding session, it’s worth reflecting on where we find ourselves. The Docker personality in VICe will never be fully API complete*. There are many reasons for this, but primarily it’s because that API is a level above what VICe is really aiming at. The Docker API now covers development, clustering, image resolution, HA, deployment and build. VICe is focused on the runtime – provisioning and deployment with clustering, scheduling and HA transparently integrated at a lower layer.

I’m often asked, “how is VICe ever going to keep up with the pace of innovation?”. Well, if the industry can standardize around a container primitives API with an agreement to use explicit rather than implicit interfaces, then I think it’s very possible. If the Container Engine is a commodity and we get that bit right, most of the innovation will happen in the layers above it and should be compatible with it. It may take a few goes around to find the right abstractions, but in my option, it’s a goal well worth pursuing.

[* This is a prediction, not a commitment to not doing it 😉 ]

Video: Downloading and Installing VICe 0.7.0

The first of many videos is now up. I’m going to try to keep them relatively short and informative. How to download, install and run VIC felt like a good place to start.

Please comment if there’s any particular focus you’d like to request from another video and I’ll add it to the list.

 

How is VICe Differerent?

Continuing from my earlier post, attempting to define what VICe is, I want to continue by asking, “how is it different?”. It’s a question I get asked a lot.

I’d like to start with a very abstract observation: The Onion Has Many Layers. I found myself using this in response to a heated discussion in our office recently about VICe and Photon Platform. It’s an attempt to draw attention away from what something is towards more what it does. As an example, the notion of a Pod could theoretically be implemented in many ways. It is a well-defined unit of service that has shared storage, multiple isolated processes, a specific identity on a network etc. A Pod could be represented as a VM running Docker; a synthetic abstraction around a group of containers in a VM, a single container with multiple exec’d processes (if you can live without isolation) or even a group of VMs in a resource pool (if you could find a way to abstract the inter-process communications).

The point is, it’s better to focus on the characteristics and requirements of a thing before you decide what the most appropriate runtime representation of it should be, if at all. AWS Lambda is a great example of this kind of thinking in that it entirely abstracts you from the details. What you care about is being able to define policies that express intent about the quality of service you want and then trust that those policies are translated into appropriate implementation detail.

And this brings us to containers and VMs. When we first pitched Bonneville, the question we were asked most often is “why just one container per VM, isn’t more than one better?”. I was unable to articulate a coherent answer the first time I responded to this, but it gets right to the crux of what VICe is (of course, the correct answer is “it depends” and then just walk away). It all boils down to infrastructure. If I deploy my Docker image as a VM, it relies entirely on vSphere for scheduling, networks, storage and control plane. Deploy it in a VM and it’s now dependent on Linux for many of those same dependencies. In simple terms, are you going to deploy it to a nested hypervisor or to your actual hypervisor? Do you want your container to connect directly to a virtual network via a vNIC or have to go through various layers of a guest OS first?

So who cares? What difference does it make?

Isolation

While acknowledging that no code is perfect and hypervisor vulnerabilities do exist, something like Dirty COW is as good an example of any of why folks deploying containers are forced to consider the meaning of an isolation domain. Running containers in VMs and treating the VM as an isolation domain seems to be an obvious answer for many.

It’s not just the potential for breaking out of a container though. There are perfectly legitimate ways for back doors to be opened up between a container and its host – privileged mode and host-mounted volumes are obvious examples. VICe offers you neither of these things. Every VICe container is “privileged” in that it has full access to its own OS, but containers are never given access to the control plane or a datastore.

The challenge that arises with containers in VMs is that if a VM is both the isolation domain and a unit of tenancy, then capacity planning and packing without dynamically configurable resource limits is tricky.

Clustering and Scheduling

If IaaS in the form of Linux VMs is your starting point, then your clustering and scheduling has to be managed at a level above. You’re forced to deal with a node abstraction where an increase of capacity means adding nodes. Of course this is exactly what your vSphere admin has to deal with, except of course they’re experts at it and their cluster runs a far wider variety of workloads, all of which can be live migrated seamlessly.

But even when you have these Linux nodes, you’re forced to consider how to pack them. As mentioned above, re-sizing a node is disruptive enough to existing workloads that tightly coupling nodes to applications or tenants is either wasteful or limiting. Yet if you don’t do this, you’re faced with a host of isolation concerns. Do I really want these two apps to potentially share a network / kernel? Sure I can express that through labels, but that adds its own complexity.

VICe doesn’t have this problem because of the strongly enforced isolation, the dynamically configurable resource limits and vSphere DRS.

The one notable advantage of a clustering and scheduling abstraction above the IaaS layer is that of portability. This is one reason why companies behind frameworks such as Swarm, Kubernetes, Mesos and others are innovating furiously in this space.

Patching and Downtime

The focus on rolling updates in cloud native frameworks stems from the problem that reconfiguring or patching nodes is disruptive. The 12-factor approach to application design clearly helps with this. Hot adding of memory or disk is possible, but vertical scaling is no longer sexy and it’s only configurable in one direction. The impact of OS patching was lessened to some extent by innovations at CoreOS, but patching the control plane in a node is particularly disruptive because it covers just about everything.

Rolling updates do have the advantage of being a portable abstraction and it’s fine if you’ve architected with that in mind, but it’s no substitute for live migration if you’re in the middle of a debugging session.

VICe is designed such that common maintenance tasks do not disrupt container uptime or accessibility. Changing the resource limits of a Virtual Container Host (VCH) is simply an mouse click. Patching an ESXi host causes a live migration of the container to another host. Upgrading of a VCH means momentary downtime for the control plane, but no disruption to TTY sessions or stderr/stdout streaming from the container.

Does this mean that these containers are no longer cattle? Not at all. They’re just as easy to shoot in the head as any other container.

Multi-tenancy

If I gave you 3 physical computers and asked you to create container hosts for 10 tenants with the following constraints:

  • 8 of the tenants resource limits should be dynamically configurable up to the entire capacity of the 3 computers
  • 2 of the tenants must only be scheduled to 2 out of 3 computers because of GPU or SSD or some policy.
  • Powering down one of the computers should have no impact on the running workloads, provided that there’s enough compute capacity on the other two.

These requirements don’t seem all that unreasonable, yet this would be difficult with bare metal Linux and impossible with a tenant-per-VM model. However, these requirements are simple for VICe to satisfy because these are the kind of requirements vSphere admins have had for a long time now.

Everything Else

So we’ve not touched on auditing or backup or shared storage or monitoring or a host of other areas that IT people care about. The point is, whether you deploy a container in a VM or as a VM makes a world of difference to the way in which you have to manage it in production.

It’s largely a question of who manages that complexity. Is it you and your fiefdom of Linux VMs or is it your IT admin who already has this responsibility for every other kind of workload your company runs?

Additionally these big questions of clustering, scheduling, patching, packing, isolation and tenancy – should all of those things exist at a layer above your workloads or below? That question largely boils down to a balance of how much you care about portability, whether your apps have been designed for it and again, who’s responsibility it should be.

 

VIC in a Nutshell

Having set up the blog and created a ticking timebomb of IaaS, it’s time to crack on with a great first question.

What is vSphere Integrated Containers Engine (VICe) and for that matter, what is VIC, the product?

There are a few decent write-ups out there, but mostly those can only analyze what immediately presents itself and less about the intent or direction. That’s what I am going to attempt to succinctly cover here. The scope of this post is really just expanding on the summary in the About section. There will be plenty of scope to deep-dive in other posts about specifics.

In a Nutshell

The VIC product is a container engine (VICe), management portal (Admiral) and container registry (Harbor) that all currently support the Docker image format. VIC is entirely OSS, free to use and support is included in the vSphere Enterprise Plus license.

I don’t plan to cover anything more about Admiral or Harbor in this post. Admiral is a neat way to tie together image registries, container deployment endpoints and application templates into a UI. Harbor is a Docker image registry with additional capabilities such as RBAC, replication etc.

VIC engine (VICe) is a ground-up open-source rewrite of research Project Bonneville, the high-level premise of which is “native containers on vSphere”. So what does that actually mean in practice?

  • vSphere is the container host, not Linux
    • Containers are spun up as VMs not in VMs
    • Every container is fully isolated from the host and from each other
    • Provides per-tenant dynamic resource limits within an ESXi cluster
  • vSphere is the infrastructure, not Linux
    • vSphere networks appear in the Docker client as container networks
    • Images, volumes and container state are provisioned directly to VMFS
    • Shifts complexity away from the user to the admin
  • vSphere is the control plane
    • Use the Docker client to directly control vSphere infrastructure
    • A container endpoint presents as a service abstraction, not as IaaS

VICe will be the fastest and easiest way to provision any Linux-based workload to vSphere, provided that workload can be serialized as a Docker image.

VICe tackles head-on the question of how to authenticate tenants to self-provision workloads directly into a vSphere environment without having to raise tickets for VMs. It provides a vSphere admin the capability to pre-authenticate access to a certain amount of compute, network and storage without having to create resource reservations.

It’s all centered around the notion of a Virtual Container Host (VCH). A VCH is a vApp with a small appliance running in it. It is a per-tenant container namespace with dynamic resource limits. The appliance is the control plane endpoint that the client connects to and is effectively a proxy to the vSphere control plane.

When a tenant wants to run a container from a Docker client, vSphere creates a VM within the resource pool, attaches the image and volumes as disks and boots the VM from an ISO containing a Linux kernel. The image cache and volumes are persisted on a vSphere datastore. The networks are vSphere distributed port groups. The only thing running in the containerVM is the container process and a PID 1 agent that extends the control plane into the containerVM.

It’s good for the admin because it allows them complete visibility, auditing and control over which workloads have access to what resources. It also allows them to extend their existing infrastructure to provide a flexible CaaS offering to their clients.

It’s good for the clients because they no longer have to deal with any of the complexities of running a container workload in production. There is no infrastructure to manage, it is simply a service.

In Summary

VICe is a VM provisioning mechanism that applies container characteristics to a VM in a way that is completely transparent to a client. How many of those characteristics you can take advantage of will depend on functional completeness of the version you’re using and a few fundamental limitations that will be a good topic for another blog.

What VICe is not is a container development or build environment. It is designed specifically to sit at the end of a CI pipeline and pull down images from a trusted registry and run them. This is an important distinction. Unfortunately this a distinction that currently can’t be separated out in the Docker client or API definition, so at least for now, VICe only supports a subset of the Docker API.