The Container Engine is a Commodity

I started my career 20 years ago when Java was the new kid on the block. I worked at IBM for many years on their JVM and I was very proud of the innovation and work that we did there.

During that time, a curious thing happened. The JVM became a commodity. People didn’t really care too much about who had the better garbage collector or the faster JIT, rather they cared about stability and function. As new versions of Java came out, a new byte code would sometimes sneak in or profiling might be improved a little, but the bulk of the innovation and function was in the class libraries – the rich APIs sitting on top of this commoditized JVM and the portable byte code spec. It’s easy to forget today just how much third party innovation grew up around it. IBM bet their entire enterprise software stack on this humble little interpreter.

I believe that the exact same thing is happening in the container space and the parallels with Java are striking.

What does a container engine actually have to do, fundamentally? It has to provide some networking and storage virtualization, it has to provide a sandpit for running a process, various means of data ingress and egress into that process, a format for representing binary dependencies and a control plane for lifecycle management. Beyond that, yes you can have logging plugins and the ability to execute other processes and snapshotting, but none of that is essential.

Nothing captured this requirement more clearly than the CRI-O announcement earlier this year. It was a conscious effort to put a stake in the ground and say, “here is a straw man for what we need a basic container runtime to do”. Docker themselves came out with containerd earlier this year as a way of attempting to map the multi-process daemon model onto the synchronous single-process runC as part of their OCI integration. The net effect is a secondary API boundary sitting below the Docker API that represents basic container compute primitives.

The VIC team did the exact same thing back in January when we were re-architecting the Bonneville codebase. We intentionally designed a container primitives layer consisting of 5 distinct services – persistence, network, compute, events and interaction. The network and storage services would mutate the underlying infrastructure, the compute service would create containers and the event and interaction services would provide optional  value-add around the running containers. We did this with the intent of being able to layer multiple personalities on top – everything from a Trivial Container Engine that only uses the compute service to a Docker personality. Beyond that, the goal was to end up with low-level interfaces that could also abstract away the underlying implementation – it should be possible to deploy a container host inside of a container host and manage them using the exact same client. Cool, huh?!

Almost a year on from that original white-boarding session, it’s worth reflecting on where we find ourselves. The Docker personality in VICe will never be fully API complete*. There are many reasons for this, but primarily it’s because that API is a level above what VICe is really aiming at. The Docker API now covers development, clustering, image resolution, HA, deployment and build. VICe is focused on the runtime – provisioning and deployment with clustering, scheduling and HA transparently integrated at a lower layer.

I’m often asked, “how is VICe ever going to keep up with the pace of innovation?”. Well, if the industry can standardize around a container primitives API with an agreement to use explicit rather than implicit interfaces, then I think it’s very possible. If the Container Engine is a commodity and we get that bit right, most of the innovation will happen in the layers above it and should be compatible with it. It may take a few goes around to find the right abstractions, but in my option, it’s a goal well worth pursuing.

[* This is a prediction, not a commitment to not doing it 😉 ]

Video: Downloading and Installing VICe 0.7.0

The first of many videos is now up. I’m going to try to keep them relatively short and informative. How to download, install and run VIC felt like a good place to start.

Please comment if there’s any particular focus you’d like to request from another video and I’ll add it to the list.

 

How is VICe Differerent?

Continuing from my earlier post, attempting to define what VICe is, I want to continue by asking, “how is it different?”. It’s a question I get asked a lot.

I’d like to start with a very abstract observation: The Onion Has Many Layers. I found myself using this in response to a heated discussion in our office recently about VICe and Photon Platform. It’s an attempt to draw attention away from what something is towards more what it does. As an example, the notion of a Pod could theoretically be implemented in many ways. It is a well-defined unit of service that has shared storage, multiple isolated processes, a specific identity on a network etc. A Pod could be represented as a VM running Docker; a synthetic abstraction around a group of containers in a VM, a single container with multiple exec’d processes (if you can live without isolation) or even a group of VMs in a resource pool (if you could find a way to abstract the inter-process communications).

The point is, it’s better to focus on the characteristics and requirements of a thing before you decide what the most appropriate runtime representation of it should be, if at all. AWS Lambda is a great example of this kind of thinking in that it entirely abstracts you from the details. What you care about is being able to define policies that express intent about the quality of service you want and then trust that those policies are translated into appropriate implementation detail.

And this brings us to containers and VMs. When we first pitched Bonneville, the question we were asked most often is “why just one container per VM, isn’t more than one better?”. I was unable to articulate a coherent answer the first time I responded to this, but it gets right to the crux of what VICe is (of course, the correct answer is “it depends” and then just walk away). It all boils down to infrastructure. If I deploy my Docker image as a VM, it relies entirely on vSphere for scheduling, networks, storage and control plane. Deploy it in a VM and it’s now dependent on Linux for many of those same dependencies. In simple terms, are you going to deploy it to a nested hypervisor or to your actual hypervisor? Do you want your container to connect directly to a virtual network via a vNIC or have to go through various layers of a guest OS first?

So who cares? What difference does it make?

Isolation

While acknowledging that no code is perfect and hypervisor vulnerabilities do exist, something like Dirty COW is as good an example of any of why folks deploying containers are forced to consider the meaning of an isolation domain. Running containers in VMs and treating the VM as an isolation domain seems to be an obvious answer for many.

It’s not just the potential for breaking out of a container though. There are perfectly legitimate ways for back doors to be opened up between a container and its host – privileged mode and host-mounted volumes are obvious examples. VICe offers you neither of these things. Every VICe container is “privileged” in that it has full access to its own OS, but containers are never given access to the control plane or a datastore.

The challenge that arises with containers in VMs is that if a VM is both the isolation domain and a unit of tenancy, then capacity planning and packing without dynamically configurable resource limits is tricky.

Clustering and Scheduling

If IaaS in the form of Linux VMs is your starting point, then your clustering and scheduling has to be managed at a level above. You’re forced to deal with a node abstraction where an increase of capacity means adding nodes. Of course this is exactly what your vSphere admin has to deal with, except of course they’re experts at it and their cluster runs a far wider variety of workloads, all of which can be live migrated seamlessly.

But even when you have these Linux nodes, you’re forced to consider how to pack them. As mentioned above, re-sizing a node is disruptive enough to existing workloads that tightly coupling nodes to applications or tenants is either wasteful or limiting. Yet if you don’t do this, you’re faced with a host of isolation concerns. Do I really want these two apps to potentially share a network / kernel? Sure I can express that through labels, but that adds its own complexity.

VICe doesn’t have this problem because of the strongly enforced isolation, the dynamically configurable resource limits and vSphere DRS.

The one notable advantage of a clustering and scheduling abstraction above the IaaS layer is that of portability. This is one reason why companies behind frameworks such as Swarm, Kubernetes, Mesos and others are innovating furiously in this space.

Patching and Downtime

The focus on rolling updates in cloud native frameworks stems from the problem that reconfiguring or patching nodes is disruptive. The 12-factor approach to application design clearly helps with this. Hot adding of memory or disk is possible, but vertical scaling is no longer sexy and it’s only configurable in one direction. The impact of OS patching was lessened to some extent by innovations at CoreOS, but patching the control plane in a node is particularly disruptive because it covers just about everything.

Rolling updates do have the advantage of being a portable abstraction and it’s fine if you’ve architected with that in mind, but it’s no substitute for live migration if you’re in the middle of a debugging session.

VICe is designed such that common maintenance tasks do not disrupt container uptime or accessibility. Changing the resource limits of a Virtual Container Host (VCH) is simply an mouse click. Patching an ESXi host causes a live migration of the container to another host. Upgrading of a VCH means momentary downtime for the control plane, but no disruption to TTY sessions or stderr/stdout streaming from the container.

Does this mean that these containers are no longer cattle? Not at all. They’re just as easy to shoot in the head as any other container.

Multi-tenancy

If I gave you 3 physical computers and asked you to create container hosts for 10 tenants with the following constraints:

  • 8 of the tenants resource limits should be dynamically configurable up to the entire capacity of the 3 computers
  • 2 of the tenants must only be scheduled to 2 out of 3 computers because of GPU or SSD or some policy.
  • Powering down one of the computers should have no impact on the running workloads, provided that there’s enough compute capacity on the other two.

These requirements don’t seem all that unreasonable, yet this would be difficult with bare metal Linux and impossible with a tenant-per-VM model. However, these requirements are simple for VICe to satisfy because these are the kind of requirements vSphere admins have had for a long time now.

Everything Else

So we’ve not touched on auditing or backup or shared storage or monitoring or a host of other areas that IT people care about. The point is, whether you deploy a container in a VM or as a VM makes a world of difference to the way in which you have to manage it in production.

It’s largely a question of who manages that complexity. Is it you and your fiefdom of Linux VMs or is it your IT admin who already has this responsibility for every other kind of workload your company runs?

Additionally these big questions of clustering, scheduling, patching, packing, isolation and tenancy – should all of those things exist at a layer above your workloads or below? That question largely boils down to a balance of how much you care about portability, whether your apps have been designed for it and again, who’s responsibility it should be.

 

VIC in a Nutshell

Having set up the blog and created a ticking timebomb of IaaS, it’s time to crack on with a great first question.

What is vSphere Integrated Containers Engine (VICe) and for that matter, what is VIC, the product?

There are a few decent write-ups out there, but mostly those can only analyze what immediately presents itself and less about the intent or direction. That’s what I am going to attempt to succinctly cover here. The scope of this post is really just expanding on the summary in the About section. There will be plenty of scope to deep-dive in other posts about specifics.

In a Nutshell

The VIC product is a container engine (VICe), management portal (Admiral) and container registry (Harbor) that all currently support the Docker image format. VIC is entirely OSS, free to use and support is included in the vSphere Enterprise Plus license.

I don’t plan to cover anything more about Admiral or Harbor in this post. Admiral is a neat way to tie together image registries, container deployment endpoints and application templates into a UI. Harbor is a Docker image registry with additional capabilities such as RBAC, replication etc.

VIC engine (VICe) is a ground-up open-source rewrite of research Project Bonneville, the high-level premise of which is “native containers on vSphere”. So what does that actually mean in practice?

  • vSphere is the container host, not Linux
    • Containers are spun up as VMs not in VMs
    • Every container is fully isolated from the host and from each other
    • Provides per-tenant dynamic resource limits within an ESXi cluster
  • vSphere is the infrastructure, not Linux
    • vSphere networks appear in the Docker client as container networks
    • Images, volumes and container state are provisioned directly to VMFS
    • Shifts complexity away from the user to the admin
  • vSphere is the control plane
    • Use the Docker client to directly control vSphere infrastructure
    • A container endpoint presents as a service abstraction, not as IaaS

VICe will be the fastest and easiest way to provision any Linux-based workload to vSphere, provided that workload can be serialized as a Docker image.

VICe tackles head-on the question of how to authenticate tenants to self-provision workloads directly into a vSphere environment without having to raise tickets for VMs. It provides a vSphere admin the capability to pre-authenticate access to a certain amount of compute, network and storage without having to create resource reservations.

It’s all centered around the notion of a Virtual Container Host (VCH). A VCH is a vApp with a small appliance running in it. It is a per-tenant container namespace with dynamic resource limits. The appliance is the control plane endpoint that the client connects to and is effectively a proxy to the vSphere control plane.

When a tenant wants to run a container from a Docker client, vSphere creates a VM within the resource pool, attaches the image and volumes as disks and boots the VM from an ISO containing a Linux kernel. The image cache and volumes are persisted on a vSphere datastore. The networks are vSphere distributed port groups. The only thing running in the containerVM is the container process and a PID 1 agent that extends the control plane into the containerVM.

It’s good for the admin because it allows them complete visibility, auditing and control over which workloads have access to what resources. It also allows them to extend their existing infrastructure to provide a flexible CaaS offering to their clients.

It’s good for the clients because they no longer have to deal with any of the complexities of running a container workload in production. There is no infrastructure to manage, it is simply a service.

In Summary

VICe is a VM provisioning mechanism that applies container characteristics to a VM in a way that is completely transparent to a client. How many of those characteristics you can take advantage of will depend on functional completeness of the version you’re using and a few fundamental limitations that will be a good topic for another blog.

What VICe is not is a container development or build environment. It is designed specifically to sit at the end of a CI pipeline and pull down images from a trusted registry and run them. This is an important distinction. Unfortunately this a distinction that currently can’t be separated out in the Docker client or API definition, so at least for now, VICe only supports a subset of the Docker API.

Guesswork and Gambling

My first blog post is about running Docker in production. This blog in fact. Why pay for a WordPress hosting solution when I could make life so much harder for myself for your amusement?

I started with a GCE 2GB 1 vCPU instance. 2GB should be enough, right? Off the bat, I’m introduced to one of the most curious hang-overs from our transition from Operating System to Cloud. The static reservation. Guesswork and gambling: guess how much you need and then pray you don’t need more (before I’ve even written my first post, I’m already fretting about the complexities of horizontal scaling)

The bottom line though is that my blog can be afford to be down for short periods. The fact that many financial institutions still consider 24/7 online access a luxury… I mean really, how important is it for my vapid spewings to be highly available? No, what’s most important to me is the integrity of my data and being able to scale my compute, networking and storage bandwidth with minimal disruption in the unlikely event that this site becomes popular.

So I start by installing Docker into an Ubuntu 16.04 instance. I note that the install procedure has changed again. It’s improved – nice that it now uses apt. Once it’s installed and I’ve run hello-world, do I want to configure anything special? I decide not. This is not going to be the end of a CI pipeline, it’s going to run one thing for a long time. In fact, in this particular case, the utility of Docker is really the dependency management and provisioning aspect. The fact that I can grab a compose file and just run something.

Having found this article https://docs.docker.com/compose/wordpress/ on Docker Compose and WordPress, I decide to just go for it and install docker-compose. Naturally I used apt. Ha! Fool! The version apt gave me is incompatible with the yml file in the demo, so I screw around with curl to pull down a more recent version from GitHub. Second problem was that the GCE VM instance doesn’t include python-ipaddress, which is an essential dependency of docker-compose. Of course, this is why we go straight to apt for stuff.

So having run the compose file, docker ps shows me that I have a WordPress and a MqSQL running. OK. Some more faffing and I realise the reason my browser can’t see it is because the GCE firewall rules only open port 80 for HTTP traffic. Simple to modify.

OK, so now I can create some test content and presumably it’s been eaten by MySQL. Next step is to think about how to separate data and compute. Why would I want to do this? Because I want to be able to manage the lifecycle of one independently of the other – particularly if I want to scale one horizontally or vertically. Docker volumes alone are not going to help me here, because the only disk I can map in is the boot disk of the VM. I either need a separate disk for the MySQL data or a data container on a separate VM. Nice to see that GCE offers SQL instances with multi-region high availability and dynamically expandable disk, but this is my money and I don’t need it. Of course the beauty of Cloud is that I should be able to test and verify all of these things.

I create a 20GB SSD disk and add it to my running instance where it appears as /dev/sdb. Partition and format the disk using fdisk and mkfs, update /etc/fstab and mount to a local directory. Reconfigure the compose yml file to point to my new mount and restart. All beautifully simple. I also choose to move the compose configuration to the new disk as I also want it to be preserved. What did strike me as curious is that disk throughput is proportional to size in that if I need more bandwidth, I have to make it bigger.

So before I go all in on this, what do I want to test for?

  1. Expand compute capacity with a small outage
  2. Expand disk space and/or disk throughput with a small outage
  3. Snapshot the data disk and bring it back

Vertical scaling is going to be a lot easier than horizontal at this stage, which should be a simple power down, reconfigure, power up and compose up, right? Let’s double the capacity – 2vCPU and 4GB. Pleasantly surprised that compose automatically brought my app back on reboot. The one thing I find curious though is that GCE gives me CPU, disk and network monitoring out of the box, but not memory. Seems like if I want host memory monitoring, I need to sign up for StackDriver. Hmm. Docker stats is actually quite neat in giving me a high-level summary of my consumption from the guest perspective, but without graphing over time, it’s of limited use. Piping vmstat to a file just seems very Not Cloud.

Expanding disk space isn’t as easy as creating a snapshot and creating a new disk from it, because partition size. However, manually copying the data over from one disk to another works just fine and can happen out of band. Since throughput is tied to disk size, I’d have to do this same operation for either. It doesn’t seem like I can easily automate snapshots, so a cron job in the VM that drives the GCE APIs could be a fun little project.

So, here it is. Overall a confidence-inspiring and relatively simple exercise. You are my customer.