Continuing from my earlier post, attempting to define what VICe is, I want to continue by asking, “how is it different?”. It’s a question I get asked a lot.
I’d like to start with a very abstract observation: The Onion Has Many Layers. I found myself using this in response to a heated discussion in our office recently about VICe and Photon Platform. It’s an attempt to draw attention away from what something is towards more what it does. As an example, the notion of a Pod could theoretically be implemented in many ways. It is a well-defined unit of service that has shared storage, multiple isolated processes, a specific identity on a network etc. A Pod could be represented as a VM running Docker; a synthetic abstraction around a group of containers in a VM, a single container with multiple exec’d processes (if you can live without isolation) or even a group of VMs in a resource pool (if you could find a way to abstract the inter-process communications).
The point is, it’s better to focus on the characteristics and requirements of a thing before you decide what the most appropriate runtime representation of it should be, if at all. AWS Lambda is a great example of this kind of thinking in that it entirely abstracts you from the details. What you care about is being able to define policies that express intent about the quality of service you want and then trust that those policies are translated into appropriate implementation detail.
And this brings us to containers and VMs. When we first pitched Bonneville, the question we were asked most often is “why just one container per VM, isn’t more than one better?”. I was unable to articulate a coherent answer the first time I responded to this, but it gets right to the crux of what VICe is (of course, the correct answer is “it depends” and then just walk away). It all boils down to infrastructure. If I deploy my Docker image as a VM, it relies entirely on vSphere for scheduling, networks, storage and control plane. Deploy it in a VM and it’s now dependent on Linux for many of those same dependencies. In simple terms, are you going to deploy it to a nested hypervisor or to your actual hypervisor? Do you want your container to connect directly to a virtual network via a vNIC or have to go through various layers of a guest OS first?
So who cares? What difference does it make?
While acknowledging that no code is perfect and hypervisor vulnerabilities do exist, something like Dirty COW is as good an example of any of why folks deploying containers are forced to consider the meaning of an isolation domain. Running containers in VMs and treating the VM as an isolation domain seems to be an obvious answer for many.
It’s not just the potential for breaking out of a container though. There are perfectly legitimate ways for back doors to be opened up between a container and its host – privileged mode and host-mounted volumes are obvious examples. VICe offers you neither of these things. Every VICe container is “privileged” in that it has full access to its own OS, but containers are never given access to the control plane or a datastore.
The challenge that arises with containers in VMs is that if a VM is both the isolation domain and a unit of tenancy, then capacity planning and packing without dynamically configurable resource limits is tricky.
Clustering and Scheduling
If IaaS in the form of Linux VMs is your starting point, then your clustering and scheduling has to be managed at a level above. You’re forced to deal with a node abstraction where an increase of capacity means adding nodes. Of course this is exactly what your vSphere admin has to deal with, except of course they’re experts at it and their cluster runs a far wider variety of workloads, all of which can be live migrated seamlessly.
But even when you have these Linux nodes, you’re forced to consider how to pack them. As mentioned above, re-sizing a node is disruptive enough to existing workloads that tightly coupling nodes to applications or tenants is either wasteful or limiting. Yet if you don’t do this, you’re faced with a host of isolation concerns. Do I really want these two apps to potentially share a network / kernel? Sure I can express that through labels, but that adds its own complexity.
VICe doesn’t have this problem because of the strongly enforced isolation, the dynamically configurable resource limits and vSphere DRS.
The one notable advantage of a clustering and scheduling abstraction above the IaaS layer is that of portability. This is one reason why companies behind frameworks such as Swarm, Kubernetes, Mesos and others are innovating furiously in this space.
Patching and Downtime
The focus on rolling updates in cloud native frameworks stems from the problem that reconfiguring or patching nodes is disruptive. The 12-factor approach to application design clearly helps with this. Hot adding of memory or disk is possible, but vertical scaling is no longer sexy and it’s only configurable in one direction. The impact of OS patching was lessened to some extent by innovations at CoreOS, but patching the control plane in a node is particularly disruptive because it covers just about everything.
Rolling updates do have the advantage of being a portable abstraction and it’s fine if you’ve architected with that in mind, but it’s no substitute for live migration if you’re in the middle of a debugging session.
VICe is designed such that common maintenance tasks do not disrupt container uptime or accessibility. Changing the resource limits of a Virtual Container Host (VCH) is simply an mouse click. Patching an ESXi host causes a live migration of the container to another host. Upgrading of a VCH means momentary downtime for the control plane, but no disruption to TTY sessions or stderr/stdout streaming from the container.
Does this mean that these containers are no longer cattle? Not at all. They’re just as easy to shoot in the head as any other container.
If I gave you 3 physical computers and asked you to create container hosts for 10 tenants with the following constraints:
- 8 of the tenants resource limits should be dynamically configurable up to the entire capacity of the 3 computers
- 2 of the tenants must only be scheduled to 2 out of 3 computers because of GPU or SSD or some policy.
- Powering down one of the computers should have no impact on the running workloads, provided that there’s enough compute capacity on the other two.
These requirements don’t seem all that unreasonable, yet this would be difficult with bare metal Linux and impossible with a tenant-per-VM model. However, these requirements are simple for VICe to satisfy because these are the kind of requirements vSphere admins have had for a long time now.
So we’ve not touched on auditing or backup or shared storage or monitoring or a host of other areas that IT people care about. The point is, whether you deploy a container in a VM or as a VM makes a world of difference to the way in which you have to manage it in production.
It’s largely a question of who manages that complexity. Is it you and your fiefdom of Linux VMs or is it your IT admin who already has this responsibility for every other kind of workload your company runs?
Additionally these big questions of clustering, scheduling, patching, packing, isolation and tenancy – should all of those things exist at a layer above your workloads or below? That question largely boils down to a balance of how much you care about portability, whether your apps have been designed for it and again, who’s responsibility it should be.