* [RFC DOCUMENT 00/12] kubevirt-and-kvm: Add documents
@ 2020-09-16 16:44 Andrea Bolognani
2020-09-16 16:45 ` [RFC DOCUMENT 01/12] kubevirt-and-kvm: Add Index page Andrea Bolognani
` (12 more replies)
0 siblings, 13 replies; 15+ messages in thread
From: Andrea Bolognani @ 2020-09-16 16:44 UTC (permalink / raw)
To: libvir-list, qemu-devel; +Cc: vromanso, rmohr, abologna, crobinso
Hello there!
Several weeks ago, a group of Red Hatters working on the
virtualization stack (primarily QEMU and libvirt) started a
conversation with developers from the KubeVirt project with the goal
of better understanding and documenting the interactions between the
two.
Specifically, we were interested in integration pain points, with the
underlying ideas being that only once those issues are understood it
becomes possible to look for solutions, and that better communication
would naturally lead to improvements on both sides.
This series of documents was born out of that conversation. We're
sharing them with the QEMU and libvirt communities in the hope that
they can be a valuable resource for understanding how the projects
they're working on are consumed by higher-level tools, and what
challenges are encountered in the process.
Note that, while the documents describe a number of potential
directions for things like development of new components, that's all
just brainstorming that naturally occurred as we were learning new
things: the actual design process should, and will, happen on the
upstream lists.
Right now the documents live in their own little git repository[1],
but the expectation is that eventually they will find a suitable
long-term home. The most likely candidate right now is the main
KubeVirt repository, but if you have other locations in mind please
do speak up!
I'm also aware of the fact that this delivery mechanism is fairly
unconventional, but I thought it would be the best way to spark a
discussion around these topics with the QEMU and libvirt developers.
Last but not least, please keep in mind that the documents are a work
in progress, and polish has been applied to them unevenly: while the
information presented is, to the best of our knowledge, all accurate,
some parts are in a rougher state than others. Improvements will
hopefully come over time - and if you feel like helping out in making
that happen, it would certainly be appreciated!
Looking forward to your feedback :)
[1] https://gitlab.com/abologna/kubevirt-and-kvm
--
Andrea Bolognani / Red Hat / Virtualization
^ permalink raw reply [flat|nested] 15+ messages in thread
* [RFC DOCUMENT 01/12] kubevirt-and-kvm: Add Index page
2020-09-16 16:44 [RFC DOCUMENT 00/12] kubevirt-and-kvm: Add documents Andrea Bolognani
@ 2020-09-16 16:45 ` Andrea Bolognani
2020-09-16 16:46 ` [RFC DOCUMENT 02/12] kubevirt-and-kvm: Add Components page Andrea Bolognani
` (11 subsequent siblings)
12 siblings, 0 replies; 15+ messages in thread
From: Andrea Bolognani @ 2020-09-16 16:45 UTC (permalink / raw)
To: libvir-list, qemu-devel; +Cc: vromanso, rmohr, crobinso
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Index.md
# KubeVirt and the KVM user space
This is the entry point to a series of documents which, together,
detail the current status of KubeVirt and how it interacts with the
KVM user space.
The intended audience is people who are familiar with the traditional
virtualization stack (QEMU plus libvirt), and in order to make it
more approachable to them comparisons, are included and little to no
knowledge of KubeVirt or Kubernetes is assumed.
Each section contains a short summary as well as a link to a separate
document discussing the topic in more detail, with the intention that
readers will be able to quickly get a high-level understading of the
various topics by reading this overview document and then dig further
into the specific areas they're interested in.
## Architecture
### Goals
* KubeVirt aims to feel completely native to Kubernetes users
* VMs should behave like containers whenever possible
* There should be no features that are limited to VMs when it would
make sense to implement them for containers too
* KubeVirt also aims to support all the workloads that traditional
virtualization can handle
* Windows support, device assignment etc. are all fair game
* When these two goals clash, integration with Kubernetes usually
wins
### Components
* KubeVirt is made up of various discrete components that interact
with Kubernetes and the KVM user space
* The overall design is somewhat similar to that of libvirt, except
with a much higher granularity and many of the tasks offloaded to
Kubernetes
* Some of the components run at the cluster level or host level
with very high privileges, others run at the pod level with
significantly reduced privileges
Additional information: [Components][]
### Runtime environment
* QEMU expects its environment to be set up in advance, something
that is typically taken care of by libvirt
* libvirtd, when not running in session mode, assumes that it has
root-level access to the system and can perform pretty much any
privileged operation
* In Kubernetes, the runtime environment is usually heavily locked
down and many privileged operations are not permitted
* Requiring additional permissions for VMs goes against the goal,
mentioned earlier, to have VMs behave the same as containers
whenever possible
## Specific areas
### Hotplug
* QEMU supports hotplug (and hot-unplug) of most devices, and its use
is extremely common
* Conversely, resources associated with containers such as storage
volumes, network interfaces and CPU shares are allocated upfront
and do not change throughout the life of the workload
* If the container needs more (or less) resources, the Kubernetes
approach is to destroy the existing one and schedule a new one to
take over its role
Additional information: [Hotplug][]
### Storage
* Handled through the same Kubernetes APIs used for containers
* QEMU / libvirt only see an image file and don't have direct
access to the underlying storage implementation
* This makes certain scenarios that are common in the
virtualization world very challenging: examples include hotplug
and full VM snapshots (storage plus memory)
* It might be possible to remove some of these limitations by
changing the way storage is exposed to QEMU, or even take advantage
of the storage technologies that QEMU already implements and make
them available to containers in addition to VMs.
Additional information: [Storage][]
### Networking
* Application processes running in VMs are hidden behind a network
interface as opposed to local sockets and processes running in
a separated user namespace
* Service meshes proxy and monitor applications by means of
socket redirection and classification on local ports and
process identifiers. We need to aim for generic compatibility
* Existing solutions replicate a full TCP/IP stack to pretend
applications running in a QEMU instance are local. No chances
for zero-copy and context switching avoidance
* Network connectivity is shared between control plane and workload
itself. Addressing and port mapping need particular attention
* Linux capabilities granted to the pod might be minimal, or none
at all. Live migration presents further challenges in terms of
network addressing and port mapping
Additional information: [Networking][]
### Live migration
* QEMU supports live migration between hosts, usually coordinated by
libvirt
* Kubernetes expects containers to be disposable, so the equivalent
of live migration would be to simply destroy the ones running on
the source node and schedule replacements on the destination node
* For KubeVirt, a hybrid approach is used: a new container is created
on the target node, then the VM is migrated from the old container,
running on the source node, to the newly-created one
Additional information: [Live migration][]
### CPU pinning
* CPU pinning is not handled by QEMU directly, but is instead
delegated to libvirt
* KubeVirt figures out which CPUs are assigned to the container after
it has been started by Kubernetes, and passes that information to
libvirt so that it can perform CPU pinning
Additional information: [CPU pinning][]
### NUMA pinning
* NUMA pinning is not handled by QEMU directly, but is instead
delegated to libvirt
* KubeVirt doesn't implement NUMA pinning at the moment
Additional information: [NUMA pinning][]
### Isolation
* For security reasons, it's a good idea to run each QEMU process in
an environment that is isolated from the host as well as other VMs
* This includes using a separate unprivileged user account, setting
up namespaces and cgroups, using SELinux...
* QEMU doesn't take care of this itself and delegates it to libvirt
* Most of these techniques serve as the base for containers, so
KubeVirt can rely on Kubernetes providing a similar level of
isolation without further intervention
Additional information: [Isolation][]
## Other tidbits
### Upgrades
* When libvirt is upgraded, running VMs keep using the old QEMU
binary: the new QEMU binary is used for newly-started VMs as well
as after VMs have been power cycled or migrated
* KubeVirt behaves similarly, with the old version of libvirt and
QEMU remaining in use for running VMs
Additional information [Upgrades][]
### Backpropagation
* Applications using libvirt usually don't provide all information,
eg. a full PCI topology, and let libvirt fill in the blanks
* This might require a second step where the additional information
is collected and stored along with the original one
* Backpropagation doesn't fit well in Kubernetes' declarative model,
so KubeVirt doesn't currently perform it
Additional information: [Backpropagation][]
## Contacts and credits
This information was collected and organized by many people at Red
Hat, some of which have agreed to serve as point of contacts for
follow-up discussion.
Additional information: [Contacts][]
[Backpropagation]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Backpropagation.md
[CPU pinning]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/CPU-Pinning.md
[Components]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Components.md
[Contacts]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Contacts.md
[Hotplug]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Hotplug.md
[Isolation]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Isolation.md
[Live migration]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Live-Migration.md
[NUMA pinning]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/NUMA-Pinning.md
[Networking]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Networking.md
[Storage]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Storage.md
[Upgrades]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Upgrades.md
^ permalink raw reply [flat|nested] 15+ messages in thread
* [RFC DOCUMENT 02/12] kubevirt-and-kvm: Add Components page
2020-09-16 16:44 [RFC DOCUMENT 00/12] kubevirt-and-kvm: Add documents Andrea Bolognani
2020-09-16 16:45 ` [RFC DOCUMENT 01/12] kubevirt-and-kvm: Add Index page Andrea Bolognani
@ 2020-09-16 16:46 ` Andrea Bolognani
2020-09-16 16:47 ` [RFC DOCUMENT 03/12] kubevirt-and-kvm: Add Hotplug page Andrea Bolognani
` (10 subsequent siblings)
12 siblings, 0 replies; 15+ messages in thread
From: Andrea Bolognani @ 2020-09-16 16:46 UTC (permalink / raw)
To: libvir-list, qemu-devel; +Cc: vromanso, rmohr, crobinso
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Components.md
# Components
This document describes the various components of the KubeVirt
architecture, how they fit together, and how they compare to the
traditional virtualization architecture (QEMU + libvirt).
## Traditional architecture
For the comparison to make sense, let's start by reviewing the
architecture used for traditional virtualization.
![libvirt architecture][Components-Libvirt]
(Image taken from the "[Look into libvirt][]" presentation by Osier
Yang, which is a bit old but still mostly accurate from a high-level
perspective.)
In particular, the `libvirtd` process runs with high privileges on
the host and is responsible for managing all VMs.
When asked to start a VM, the management process will
* Prepare the environment by performing a number of privileged
operations upfront
* Set up CGroups
* Set up kernel namespaces
* Apply SELinux labels
* Configure network devices
* Open host files
* ...
* Start a non-privileged QEMU process in that environment
## Kubernetes
To understand how KubeVirt works, it's first necessary to have some
knowledge of Kubernetes.
In Kubernetes, every user workload runs inside [Pods][]. The pod is
the smallest unit of work that Kubernetes will schedule.
Some facts about pods:
* They consist of multiple containers
* The containers share a network namespace
* The containers have their own PID and mount namespace
* The containers have their own CGroups for CPU, memory, devices and
so forth. They are controlled by k8s and can’t be modified from
outside.
* Pods can be started with extended privileges (`CAP_NICE`,
`CAP_NET_RAW`, root user, ...)
* The app in the pods can drop the privileges, but the pod can not
drop them (`kubectl exec` gives you a shell with the full
privileges).
Creating pods with elevated privileges is generally frowned upon, and
depending on the policy decided by the cluster administrator it might
be outright impossible.
## KubeVirt architecture
Let's now discuss how KubeVirt is structured.
![KubeVirt architecture][Components-Kubevirt]
The main components are:
* `virt-launcher`, a copy of which runs inside each pod besides QEMU
and libvirt, is the unprivileged component responsible for
receiving commands from other KubeVirt components and reporting
back events such as VM crashes;
* `virt-handler` runs at the node level via a DaemonSet, and is the
privileged component which takes care of the VM setup by reaching
into the corresponding pod and modifying its namespaces;
* `virt-controller` runs at the cluster level and monitors the API
server so that it can react to user requests and VM events;
* `virt-api`, also running at the cluster level, exposes a few
additional APIs that only apply to VMs, such as the "console" and
"vnc" actions.
When a KubeVirt VM is started:
* We request a Pod with certain privileges and resources from
Kubernetes.
* The kubelet (the node daemon of kubernetes) prepares the
environment with the help of a container runtime.
* A shim process (virt-launcher) is our main entrypoint in the pod,
which starts libvirt
* Once our node-daemon (virt-handler) can reach our shim process, it
does privileged setup from outside. It reaches into the namespaces
and modifies their content as needed. We mostly have to modify the
mount and network namespaces.
* Once the environment is prepared, virt-handler asks virt-launcher
to start a VM via its libvirt component.
More information can be found in the [KubeVirt architecture][] page.
## Comparison
The two architectures are quite similar from the high-level point of
view: in both cases there are a number of privileged components which
take care of preparing an environment suitable for running an
unprivileged QEMU process in.
The difference, however, is that while libvirtd takes care of all
this setup itself, in the case of KubeVirt several smaller components
are involved: some of these components are privileged just as libvirtd
is, but others are not, and some of the tasks are not even performed
by KubeVirt itself but rather delegated to the existing Kubernetes
infrastructure.
## Use of libvirtd in KubeVirt
In the traditional virtualization scenario, `libvirtd` provides a
number of useful features on top of those available with plain QEMU,
including
* support for multiple clients connecting at the same time
* management of multiple VMs through a single entry point
* remote API access
KubeVirt interacts with libvirt under certain conditions that make
the features described above irrelevant:
* there's only one client talking to libvirt: `virt-handler`
* libvirt is only asked to manage a single VM
* client and libvirt are running in the same pod, no remote libvirt
access
[Components-Kubevirt]:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Components-Kubevirt.png
[Components-Libvirt]:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Components-Libvirt.png
[KubeVirt architecture]: https://github.com/kubevirt/kubevirt/blob/master/docs/architecture.md
[Look into libvirt]: https://www.slideshare.net/ben_duyujie/look-into-libvirt-osier-yang
[Pods]: https://kubernetes.io/docs/concepts/workloads/pods/
^ permalink raw reply [flat|nested] 15+ messages in thread
* [RFC DOCUMENT 03/12] kubevirt-and-kvm: Add Hotplug page
2020-09-16 16:44 [RFC DOCUMENT 00/12] kubevirt-and-kvm: Add documents Andrea Bolognani
2020-09-16 16:45 ` [RFC DOCUMENT 01/12] kubevirt-and-kvm: Add Index page Andrea Bolognani
2020-09-16 16:46 ` [RFC DOCUMENT 02/12] kubevirt-and-kvm: Add Components page Andrea Bolognani
@ 2020-09-16 16:47 ` Andrea Bolognani
2020-09-16 16:48 ` [RFC DOCUMENT 04/12] kubevirt-and-kvm: Add Storage page Andrea Bolognani
` (9 subsequent siblings)
12 siblings, 0 replies; 15+ messages in thread
From: Andrea Bolognani @ 2020-09-16 16:47 UTC (permalink / raw)
To: libvir-list, qemu-devel; +Cc: vromanso, rmohr, crobinso
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Hotplug.md
# Hotplug
In Kubernetes, pods are defined to be immutable, so it's not possible
to perform hotplug of devices in the same way as with the traditional
virtualization stack.
This limitation is a result of KubeVirt's guiding principle of
integrating with Kubernetes as much as possible and making VMs appear
the same as containers from the user's point of view.
There have been several attempts at lifting this restriction in
Kubernetes over the years, but they have all been unsuccessful so
far.
## Existing hotplug support
When working with containers, changing the amount of resources
associated with a pod will result in it being destroyed and a new
pod with the updated resource allocation being created in its place.
This works fine for containers, which are designed to be clonable and
disposable, but when it comes to VMs they usually can't be destroyed
on a whim and running multiple instances in parallel is generally not
wise even when possible.
## Possible workarounds
Until a proper hotplug API makes its way into Kubernetes, one
possible way to implement hotplug could be to perform migration to a
container compliant with the new allocation request, and only then
perform the QEMU-level hotplug operation.
^ permalink raw reply [flat|nested] 15+ messages in thread
* [RFC DOCUMENT 04/12] kubevirt-and-kvm: Add Storage page
2020-09-16 16:44 [RFC DOCUMENT 00/12] kubevirt-and-kvm: Add documents Andrea Bolognani
` (2 preceding siblings ...)
2020-09-16 16:47 ` [RFC DOCUMENT 03/12] kubevirt-and-kvm: Add Hotplug page Andrea Bolognani
@ 2020-09-16 16:48 ` Andrea Bolognani
2020-09-16 16:50 ` [RFC DOCUMENT 05/12] kubevirt-and-kvm: Add Networking page Andrea Bolognani
` (8 subsequent siblings)
12 siblings, 0 replies; 15+ messages in thread
From: Andrea Bolognani @ 2020-09-16 16:48 UTC (permalink / raw)
To: libvir-list, qemu-devel; +Cc: vromanso, alitke, rmohr, stefanha, crobinso
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Storage.md
# Storage
This document describes the known use-cases and architecture options
we have for Linux Virtualization storage in [KubeVirt][].
## Problem description
The main goal of Kubevirt is to leverage the storage subsystem of
Kubernetes (built around [CSI][] and [Persistent Volumes][] aka PVs),
in order to let both workloads (VMs and containers) leverage the same
storage. As a consequence Kubevirt is limited in its use of QEMU
storage subsystem and features. That means:
* Storage solutions should be implemented in k8s in a way that can be
consumed by both containers and VMs.
* VMs can only consume (and provide) storage features which are
available in the pod, through k8s APIs. For example, a VM will not
support disk snapshots if it’s attached to a storage provider that
doesn’t support it. Ditto for incremental backup, block jobs,
encryption, etc.
## Current situation
### Storage handled outside of QEMU
In this scenario, the VM pod uses a [Persistent Volume Claim
(PVC)][Persistent Volumes] to give QEMU access to a raw storage
device or fs mount, which is provided by a [CSI][] driver. QEMU
**doesn’t** handle any of the storage use-cases such as thin
provisioning, snapshots, change block tracking, block jobs, etc.
This is how things work today in KubeVirt.
![Storage handled outside of QEMU][Storage-Current]
Devices and interfaces:
* PVC: block or fs
* QEMU backend: raw device or raw image
* QEMU frontend: virtio-blk
* alternative: emulated device for wider compatibility and Windows
installations
* CDROM (sata)
* disk (sata)
Pros:
* Simplicity
* Sharing the same storage model with other pods/containers
Cons:
* Limited feature-set (fully off-loaded to the storage provider from
CSI).
* No VM snapshots (disk + memory)
* Limited opportunities for fine-tuning and optimizations for
high-performance.
* Hotplug is challenging, because the set of PVCs in a pod is
immutable.
Questions and comments
* How to optimize this in QEMU?
* Can we bypass the block layer for this use-case? Like having SPDK
inside the VM pod?
* Rust-based storage daemon (e.g. [vhost_user_block][]) running
inside the VM pod alongside QEMU (bypassing the block layer)
* We should be able to achieve high-performance with local NVME
storage here, with multiple polling IOThreads and multi queue.
* See [this blog post][PVC resize blog] for information about the PVC
resize feature. To implement this for VMs we could have kubevirt
watch PVCs and respond to capacity changes with a corresponding
call to resize the image file (if applicable) and to notify qemu of
the enlarged device.
* Features such as incremental backup (CBT) and snapshots could be
implemented through a generic CSI backend... Device mapper?
Stratis? (See [Other Topics](#other-topics))
## Possible alternatives
### Storage device passthrough (highest performance)
Device passthrough via PCI VFIO, SCSI, or vDPA. No storage use-cases
and no CSI, as the device is passed directly to the guest.
![Storage device passthrough][Storage-Passthrough]
Devices and interfaces:
* N/A (hardware passthrough)
Pros:
* Highest possible performance (same as host)
Cons:
* No storage features anywhere outside of the guest.
* No live-migration for most cases.
### File-system passthrough (virtio-fs)
File mount volumes (directories, actually) can be exposed to QEMU via
[virtio-fs][] so that VMs have access to files and directories.
![File-system passthrough (virtio-fs)][Storage-Virtiofs]
Devices and interfaces:
* PVC: file-system
Pros:
* Simplicity from the user-perspective
* Flexibility
* Great for heterogeneous workloads that share data between
containers and VMs (ie. OpenShift pipelines)
Cons:
* Performance when compared to block device passthrough
Questions and comments:
* Feature is still quite new (The Windows driver is fresh out of the
oven)
### QEMU storage daemon in CSI for local storage
The qemu-storage-daemon is a user-space daemon that exposes QEMU’s
block layer to external users. It’s similar to [SPDK][], but includes
the implementation of QEMU block layer features such as snapshots and
bitmap tracking for incremental backup (CBT). It also allows the
splitting of one single NVMe device, allowing multiple QEMU VMs to
share one NVMe disk.
In this architecture, the storage daemon runs as part of CSI (control
plane), with the data-plane being either a vhost-user-blk interface
for QEMU or a fs-mount export for containers.
![QEMU storage daemon in CSI for local storage][Storage-QSD]
Devices and interfaces:
* CSI:
* fs mount with a vhost-user-blk socket for QEMU to open
* (OR) fs mount via NBD or FUSE with the actual file-system
contents
* qemu-storage-daemon backend: NVMe local device w/ raw or qcow2
* alternative: any driver supported by QEMU, such as file-posix.
* QEMU frontend: virtio-blk
* alternative: any emulated device (CDROM, virtio-scsi, etc)
* In this case QEMU itself would be consuming vhost-user-blk and
emulating the device for the guest
Pros:
* The NVMe driver from the storage daemon can support partitioning
one NVMe device into multiple blk devices, each shared via a
vhost-user-blk connection.
* Rich feature set, exposing features already implemented in the QEMU
block layer to regular pods/containers:
* Snapshots and thin-provisioning (qcow2)
* Incremental Backup (CBT)
* Compatibility with use-cases from other projects (oVirt, OpenStack,
etc)
* Snapshots, thin-provisioning, CBT and block jobs via QEMU
Cons:
* Complexity due to cascading and splitting of components.
* Depends on the evolution of CSI APIs to provide the right
use-cases.
Questions and comments:
* Local restrictions: QEMU and qemu-storage-daemon should be running
on the same host (for vhost-user-blk shared memory to work).
* Need to cascade CSI providers for volume management (resize,
creation, etc)
* How to share a partitioned NVMe device (from one storage daemon)
with multiple pods?
* See also: [kubevirt/kubevirt#3208][] (similar idea for
vhost-user-net).
* We could do hotplugging under the hood with the storage daemon.
* To expose a new PV, a new qemu-storage-daemon pod can be created
with a corresponding PVC. Conversely, on unplug, the pod can be
deleted. Ideally, we might have a 1:1 relationship between PVs
and storage daemon pods (though 1:n for attaching multiple guests
to a single daemon).
* This requires that we can create a new unix socket connection
from new storage daemon pods to the VMs. The exact way to achieve
this is still to be figured out. According to Adam Litke, the
naive way would require elevated privileges for both pods.
* After having the socket (either the file or a file descriptor)
available in the VM pod, QEMU can connect to it.
* In order to avoid a mix of block devices having a PVC in the VM pod
and others where we just passed the unix socket, we can completely
avoid the PVC case for the VM pod:
* For exposing a PV to QEMU, we would always go through the storage
daemon (i.e. the PVC moves from the VM pod to the storage daemon
pod), so the VM pod always only gets a unix socket connection,
unifying the two cases.
* Using vhost-user-blk from the storage daemon pod performs the
same (or potentially better if this allows for polling that we
wouldn’t have done otherwise) as having a PVC directly in the VM
pod, so while it looks like an indirection, the actual I/O path
would be comparable.
* This architecture would also allow using the native
Gluster/Ceph/NBD/… block drivers in the QEMU process without
making them special (because they wouldn’t use a PVC either),
unifying even more cases.
* Kubernetes has fairly low per-node Pod limits by default so we
may need to be careful about 1:1 Pod/PVC mapping. We may want to
support aggregation of multiple storage connections into a single
q-s-d Pod.
## Other topics
### Device Mapper
Another possibility is to leverage the device-mapper from Linux to
provide features such as snapshots and even like Incremental Backup.
For example, [dm-era][] seems to provide the basic primitives for
bitmap tracking.
This could be part of scenario number 1, or cascaded with other PVs
somewhere else.
Is this already being used? For example, [cybozu-go/topolvm][] is a
CSI LVM Plugin for k8s.
### Stratis
[Stratis][] seems to be an interesting project to be leveraged in the
world of Kubernetes.
### vhost-user-blk in other CSI backends
Would it make sense for other CSI backends to implement support for
vhost-user-blk?
[CSI]: https://kubernetes.io/blog/2019/01/15/container-storage-interface-ga/
[KubeVirt]: https://kubevirt.io/
[PVC resize blog]:
https://kubernetes.io/blog/2018/07/12/resizing-persistent-volumes-using-kubernetes/
[Persistent Volumes]: https://kubernetes.io/docs/concepts/storage/persistent-volumes/
[SPDK]: https://spdk.io/
[Storage-Current]:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Storage-Current.png
[Storage-Passthrough]:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Storage-Passthrough.png
[Storage-QSD]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Storage-QSD.png
[Storage-Virtiofs]:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Storage-Virtiofs.png
[Stratis]: https://stratis-storage.github.io/
[cybozu-go/topolvm]: https://github.com/cybozu-go/topolvm
[dm-era]: https://www.kernel.org/doc/html/latest/admin-guide/device-mapper/era.html
[kubevirt/kubevirt#3208]: https://github.com/kubevirt/kubevirt/pull/3208
[vhost_user_block]:
https://github.com/cloud-hypervisor/cloud-hypervisor/tree/master/vhost_user_block
[virtio-fs]: https://virtio-fs.gitlab.io/
^ permalink raw reply [flat|nested] 15+ messages in thread
* [RFC DOCUMENT 05/12] kubevirt-and-kvm: Add Networking page
2020-09-16 16:44 [RFC DOCUMENT 00/12] kubevirt-and-kvm: Add documents Andrea Bolognani
` (3 preceding siblings ...)
2020-09-16 16:48 ` [RFC DOCUMENT 04/12] kubevirt-and-kvm: Add Storage page Andrea Bolognani
@ 2020-09-16 16:50 ` Andrea Bolognani
2020-09-16 16:51 ` [RFC DOCUMENT 06/12] kubevirt-and-kvm: Add Live Migration page Andrea Bolognani
` (7 subsequent siblings)
12 siblings, 0 replies; 15+ messages in thread
From: Andrea Bolognani @ 2020-09-16 16:50 UTC (permalink / raw)
To: libvir-list, qemu-devel; +Cc: sbrivio, vromanso, alkaplan, rmohr, crobinso
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Networking.md
# Networking
## Problem description
Service meshes (such as [Istio][], [Linkerd][]) typically expect
application processes to run on the same physical host, usually in a
separate user namespace. Network namespaces might be used too, for
additional isolation. Network traffic to and from local processes is
monitored and proxied by redirection and observation of local
sockets. `iptables` and `nftables` (collectively referred to as the
`netfilter` framework) are the typical Linux facilities providing
classification and redirection of packets.
![containers][Networking-Containers]
*Service meshes with containers. Typical ingress path:
**1.** NIC driver queues buffers for IP processing
**2.** `netfilter` rules installed by *service mesh* redirect packets
to proxy
**3.** IP receive path completes, L4 protocol handler invoked
**4.** TCP socket of proxy receives packets
**5.** proxy opens TCP socket towards application service
**6.** packets get TCP header, ready for classification
**7.** `netfilter` rules installed by service mesh forward request to
service
**8.** local IP routing queues packets for TCP protocol handler
**9.** application process receives packets and handles request.
Egress path is conceptually symmetrical.*
If we move application processes to VMs, sockets and processes are
not visible anymore. All the traffic is typically forwarded via
interfaces operating at data link level. Socket redirection and port
mapping to local processes don't work.
![and now?][Networking-Challenge]
*Application process moved to VM:
**8.** IP layer enqueues packets to L2 interface towards application
**9.** `tap` driver forwards L2 packets to guest
**10.** packets are received on `virtio-net` ring buffer
**11.** guest driver queues buffers for IP processing
**12.** IP receive path completes, L4 protocol handler invoked
**13.** TCP socket of application receives packets and handles request.
**:warning: Proxy challenge**: the service mesh can't forward packets
to local sockets via `netfilter` rules. *Add-on* NAT rules might
conflict, as service meshes expect full control of the ruleset.
Socket monitoring and PID/UID classification isn't possible.*
## Existing solutions
Existing solutions typically implement a full TCP/IP stack, replaying
traffic on sockets local to the Pod of the service mesh. This creates
the illusion of application processes running on the same host,
eventually separated by user namespaces.
![slirp][Networking-Slirp]
*Existing solutions introduce a third TCP/IP stack:
**8.** local IP routing queues packets for TCP protocol handler
**9.** userspace implementation of TCP/IP stack receives packets on
local socket, and
**10.** forwards L2 encapsulation to `tap` *QEMU* interface (socket
back-end).*
While being almost transparent to the service mesh infrastructure,
this kind of solution comes with a number of downsides:
* three different TCP/IP stacks (guest, adaptation and host) need to
be traversed for every service request. There are no chances to
implement zero-copy mechanisms, and the amount of context switches
increases dramatically
* addressing needs to be coordinated to create the pretense of
consistent addresses and routes between guest and host
environments. This typically needs a NAT with masquerading, or some
form of packet bridging
* the traffic seen by the service mesh and observable externally is a
distant replica of the packets forwarded to and from the guest
environment:
* TCP congestion windows and network buffering mechanisms in
general operate differently from what would be naturally expected
by the application
* protocols carrying addressing information might pose additional
challenges, as the applications don't see the same set of
addresses and routes as they would if deployed with regular
containers
## Experiments
![experiments: thin layer][Networking-Experiments-Thin-Layer]
*How can we improve on the existing solutions while maintaining
drop-in compatibility? A thin layer implements a TCP adaptation
and IP services.*
These are some directions we have been exploring so far:
* a thinner layer between guest and host, that only implements what's
strictly needed to pretend processes are running locally. A further
TCP/IP stack is not necessarily needed. Some sort of TCP adaptation
is needed, however, if this layer (currently implemented as
userspace process) runs without the `CAP_NET_RAW` capability: we
can't create raw IP sockets on the Pod, and therefore need to map
packets at layer 2 to layer 4 sockets offered by the host kernel
* to avoid implementing an actual TCP/IP stack like the one
offered by *libslirp*, we can align TCP parameters advertised
towards the guest (MSS, congestion window) to the socket
parameters provided by the host kernel, probing them via the
`TCP_INFO` socket option (introduced in Linux 2.4).
Segmentation and reassembly is therefore not needed, providing
solid chances to avoid dynamic memory allocation altogether, and
congestion control becomes implicitly equivalent as parameters
are mirrored between the two sides
* to reflect the actual receiving dynamic of the guest and support
retransmissions without a permanent userspace buffer, segments
are not dequeued (`MSG_PEEK`) until acknowledged by the receiver
(application)
* similarly, the implementation of the host-side sender adjusts MSS
(`TCP_MAXSEG` socket option, since Linux 2.6.28) and advertised
window (`TCP_WINDOW_CLAMP`, since Linux 2.4) to the parameters
observed from incoming packets
* this adaptation layer needs to maintain some of the TCP states,
but we can rely on the host kernel TCP implementation for the
different states of connections being closed
* no particular requirements are placed on the MTU of guest
interfaces: if fragments are received, payload from the single
fragmented packets can be reassembled by the host kernel as
needed, and out-of-order fragments can be safely discarded, as
there's no intermediate hop justifying the condition
* this layer would connect to `qemu` over a *UNIX domain socket*,
instead of a `tap` interface, so that the `CAP_NET_ADMIN`
capability doesn't need to be granted to any process on the Pod:
no further network interfaces are created on the host
* transparent, adaptive mapping of ports to the guest, to avoid the
need for explicit port forwarding
* security and maintainability goals: no dynamic memory allocation,
~2 000 *LoC* target, no external dependencies
![experiments: ebpf][Networking-Experiments-eBPF]
*Additionally, an `eBPF` fast path could be implemented
**6.** hooking at socket level, and
**7.** mapping IP and Ethernet addresses,
with the existing layer implementing connection tracking and slow
path*
If additional capabilities are granted, the data path can be
optimised in several ways:
* with `CAP_NET_RAW`:
* the adaptation layer can use raw IP sockets instead of L4 sockets,
implementing a pure connection tracking, without the need for any
TCP logic: the guest operating system implements the single TCP
stack needed with this variation
* zero-copy mechanisms could be implemented using `vhost-user` and
QEMU socket back-ends, instead of relying on a full-fledged layer 2
(Ethernet) interface
* with `CAP_BPF` and `CAP_NET_ADMIN`:
* context switching in packet forwarding could be avoided by the
`sockmap` extension provided by `eBPF`, and programming the `XDP`
data hooks for in-kernel data transfers
* using eBPF programs, we might want to switch (dynamically?) to
the `vhost-net` facility
* the userspace process would still need to take care of
establishing in-kernel flows, and providing IP and IPv6
services (ARP, DHCP, NDP) for addressing transparency and to
avoid the need for further capabilities (e.g.
`CAP_NET_BIND_SERVICE`), but the main, fast datapath would
reside entirely in the kernel
[Istio]: https://istio.io/
[Linkerd]: https://linkerd.io/
[Networking-Challenge]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Networking-Challenge.png
[Networking-Containers]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Networking-Containers.png
[Networking-Experiments-Thin-Layer]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Networking-Experiments-Thin-Layer.png
[Networking-Experiments-eBPF]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Networking-Experiments-eBPF.png
[Networking-Slirp]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Networking-Slirp.png
^ permalink raw reply [flat|nested] 15+ messages in thread
* [RFC DOCUMENT 06/12] kubevirt-and-kvm: Add Live Migration page
2020-09-16 16:44 [RFC DOCUMENT 00/12] kubevirt-and-kvm: Add documents Andrea Bolognani
` (4 preceding siblings ...)
2020-09-16 16:50 ` [RFC DOCUMENT 05/12] kubevirt-and-kvm: Add Networking page Andrea Bolognani
@ 2020-09-16 16:51 ` Andrea Bolognani
2020-09-16 16:52 ` [RFC DOCUMENT 07/12] kubevirt-and-kvm: Add CPU Pinning page Andrea Bolognani
` (6 subsequent siblings)
12 siblings, 0 replies; 15+ messages in thread
From: Andrea Bolognani @ 2020-09-16 16:51 UTC (permalink / raw)
To: libvir-list, qemu-devel; +Cc: vromanso, rmohr, crobinso
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Live-Migration.md
# Live Migration
There are two scenarios where live migration is triggered in KubeVirt
* As per user request, by posting a `VirtualMachineInstanceMigration`
to the cluster
* As per cluster request, for instance on a Node eviction (due lack
of resources or maintenance of given Node)
In both situations, KubeVirt will use libvirt to handle logic and
coordination with QEMU while KubeVirt's components manage the
Kubernetes control plane and the cluster's limitations.
In short, KubeVirt:
* Checks if the target host is capable of migration of the given VM;
* Handles single network namespace per Pod by proxying migration data
(more at [Networking][])
* Handles cluster resources usage (e.g: bandwidth usage);
* Handles cross-version migration;
![Live migration between two nodes][Live-Migration-Flow]
## Limitations
Live migration is not possible if:
* The VM is configured with cpu-passthrough;
* The VM has local or non-shared volume
* The Pod is using bridge binding for network access (right side of
image below)
![Kubevirt's Pod][Live-Migration-Network]
## More on KubeVirt's Live migration
This blog [post on live migration][] explains how to have live
migration enabled in KubeVirt's VMs and describes some of its
caveats.
[Live-Migration-Flow]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Live-Migration-Flow.png
[Live-Migration-Network]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Live-Migration-Network.png
[Networking]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Networking.md
[post on live migration]: https://kubevirt.io/2020/Live-migration.html
^ permalink raw reply [flat|nested] 15+ messages in thread
* [RFC DOCUMENT 07/12] kubevirt-and-kvm: Add CPU Pinning page
2020-09-16 16:44 [RFC DOCUMENT 00/12] kubevirt-and-kvm: Add documents Andrea Bolognani
` (5 preceding siblings ...)
2020-09-16 16:51 ` [RFC DOCUMENT 06/12] kubevirt-and-kvm: Add Live Migration page Andrea Bolognani
@ 2020-09-16 16:52 ` Andrea Bolognani
2020-09-16 16:53 ` [RFC DOCUMENT 08/12] kubevirt-and-kvm: Add NUMA " Andrea Bolognani
` (5 subsequent siblings)
12 siblings, 0 replies; 15+ messages in thread
From: Andrea Bolognani @ 2020-09-16 16:52 UTC (permalink / raw)
To: libvir-list, qemu-devel; +Cc: vromanso, rmohr, crobinso
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/CPU-Pinning.md
# CPU pinning
As is the case for many of KubeVirt's features, CPU pinning is
partially achieved using standard Kubernetes components: this both
reduces the amount of new code that has to be written and guarantees
better integration with containers running side by side with the VMs.
## Kubernetes CPU Manager
The Static policy allocates exclusive CPUs to pod containers in the
Guaranteed QoS class which request integer CPUs. On a best-effort
basis, the Static policy tries to allocate CPUs topologically in the
following order:
* Allocate all the CPUs in the same processor socket if available and
the container requests at least an entire socket worth of CPUs.
* Allocate all the logical CPUs (hyperthreads) from the same physical
CPU core if available and the container requests an entire core
worth of CPUs.
* Allocate any available logical CPU, preferring to acquire CPUs from
the same socket.
## KubeVirt dedicated CPU placement
KubeVirt relies on the Kubernetes CPU Manager to allocate dedicated
CPUs to the `virt-launcher` container.
When `virt-launcher` starts, it reads
`/sys/fs/cgroup/cpuset/cpuset.cpus` and generates `<vcpupin>`
configuration for libvirt based on the information found within.
However, affinity changes require `CAP_SYS_NICE` so this additional
capability has to be granted to the VM pod.
Going forward, we would like to perform the affinity change in
`virt-handler` (the privileged component running at the node level),
which would allow the VM pod to work without additional capabilities.
^ permalink raw reply [flat|nested] 15+ messages in thread
* [RFC DOCUMENT 08/12] kubevirt-and-kvm: Add NUMA Pinning page
2020-09-16 16:44 [RFC DOCUMENT 00/12] kubevirt-and-kvm: Add documents Andrea Bolognani
` (6 preceding siblings ...)
2020-09-16 16:52 ` [RFC DOCUMENT 07/12] kubevirt-and-kvm: Add CPU Pinning page Andrea Bolognani
@ 2020-09-16 16:53 ` Andrea Bolognani
2020-09-16 16:54 ` [RFC DOCUMENT 09/12] kubevirt-and-kvm: Add Isolation page Andrea Bolognani
` (4 subsequent siblings)
12 siblings, 0 replies; 15+ messages in thread
From: Andrea Bolognani @ 2020-09-16 16:53 UTC (permalink / raw)
To: libvir-list, qemu-devel; +Cc: vromanso, rmohr, crobinso
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/NUMA-Pinning.md
# NUMA pinning
KubeVirt doesn't currently implement NUMA pinning due to Kubernetes
limitation.
## Kubernetes Topology Manager
Allows aligning CPU and peripheral device allocations by NUMA node.
Many limitations:
* Not scheduler aware.
* Doesn’t allow memory alignment.
* etc...
^ permalink raw reply [flat|nested] 15+ messages in thread
* [RFC DOCUMENT 09/12] kubevirt-and-kvm: Add Isolation page
2020-09-16 16:44 [RFC DOCUMENT 00/12] kubevirt-and-kvm: Add documents Andrea Bolognani
` (7 preceding siblings ...)
2020-09-16 16:53 ` [RFC DOCUMENT 08/12] kubevirt-and-kvm: Add NUMA " Andrea Bolognani
@ 2020-09-16 16:54 ` Andrea Bolognani
2020-09-16 16:55 ` [RFC DOCUMENT 10/12] kubevirt-and-kvm: Add Upgrades page Andrea Bolognani
` (3 subsequent siblings)
12 siblings, 0 replies; 15+ messages in thread
From: Andrea Bolognani @ 2020-09-16 16:54 UTC (permalink / raw)
To: libvir-list, qemu-devel; +Cc: vromanso, rmohr, crobinso
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Isolation.md
# Isolation
How is the QEMU process isolated from the host and from other VMs?
## Traditional virtualization
cgroups
* managed by libvirt
SELinux
* libvirt is privileged and QEMU is protected by SELinux policies set
by libvirt (SVirt)
* QEMU runs with SELinux type `svirt_t`
## KubeVirt
cgroups
* Managed by kubelet
* No involvement from libvirt
* Memory limits
* When using hard limits, the entire VM can be killed by Kubernetes
* Memory consumption estimates are based on heuristics
SELinux
* KubeVirt is not using SVirt and there are no plans to do so
* At the moment, the custom [KubeVirt SELinux policy][] is used to
ensure libvirt has sufficient privilege to perform its own setup
steps
* The standard SELinux type used by containers is `container_t`
* KubeVirt would like to eventually use the same for VMs as well
Capabilities
* The default set of capabilities is fairly conservative
* Privileged operation should happen outside of the pod: in
KubeVirt's case, a good candidate is `virt-handler`, the
privileged components that runs at the node level
* Additional capabilities can be requested for a pod
* However, this is frowned upon and considered a liability from the
security point of view
* The cluster admin may even set a security policy that prevent
pods from using certain capabilities
* In such a scenario, KubeVirt workloads may be entirely unable
to run
## Specific examples
The following is a list of examples, either historical or current, of
scenarios where libvirt's approach to isolation clashed with
Kubernetes' and changes on either component were necessary.
SELinux
* libvirt use of hugetlbfs for hugepages config is disallowed by
`container_t`
* Possibly fixable by using memfd
* [libvirt memoryBacking docs][]
* [KubeVirt memfd issue][]
* Use of libvirt+QEMU multiqueue tap support is disallowed by
`container_t`
* And there’s no way to pass in this setup from outside the
existing stack
* [KubeVirt multiqueue workaround][] extending their SELinux policy to allow
`attach_queue`
* Passing precreated tap devices to libvirt triggers
relabelfrom+relabelto `tun_socket` SELinux access
* This may not be virt stacks fault, seems to happen automatically
when permissions aren’t correct
Capabilities
* libvirt performs memory locking for VFIO devices unconditionally
* Previously KubeVirt had to grant `CAP_SYS_RESOURCE` to pods.
KubeVirt worked around it by duplicating libvirt’s memory pinning
calculations so the libvirt action would be a no-op, but that is
fragile and may cause the issue to resurface if libvirt
calculation logic changes.
* References: [libvir-list memlock thread][], [KubeVirt memlock
PR][], [libvirt qemuDomainGetMemLockLimitBytes][], [KubeVirt
VMI.getMemlockSize][]
* virtiofsd requires `CAP_SYS_ADMIN` capability to perform
`unshare(CLONE_NEWPID|CLONE_NEWNS)`
* This is required for certain use cases like running overlayfs in
the VM on top of virtiofs, but is not a requirement for all
usecases.
* References: [KubeVirt virtiofs PR][], [RHEL virtiofs bug][]
* KubeVirt uses libvirt for CPU pinning, which requires the pod to
have `CAP_SYS_NICE`.
* Long term, KubeVirt would like to handle that pinning in their
privileged component virt-handler, so `CAP_SYS_NICE` can be
dropped.
* Sidenote: libvirt unconditionally requires `CAP_SYS_NICE` when
any other running VM is using CPU pinning, however this sounds
like a plain old bug.
* References: [KubeVirt CPU pinning PR][], [KubeVirt CPU pinning
workaround PR][], [RHEL CPU pinning bug][]
* libvirt bridge usage used to require `CAP_NET_ADMIN`
* This is a historical example for reference. libvirt usage of a
bridge device always implied tap device creation, which required
`CAP_NET_ADMIN` privileges for the pod
* The fix was to teach libvirt to accept a precreated tap device
and skip some setup operations on it
* Example XML: `<interface type='ethernet'><target dev='mytap0'
managed='no'/></interface>`
* Kubevirt still hasn’t fully managed to drop `CAP_NET_ADMIN`
though
* References: [RHEL precreated TAP bug][], [libvirt precreated TAP
patches][], [KubeVirt precreated TAP PR][], [KubeVirt NET_ADMIN
issue][], [KubeVirt NET_ADMIN issue][]
[KubeVirt CPU pinning PR]: https://github.com/kubevirt/kubevirt/pull/1381
[KubeVirt CPU pinning workaround PR]: https://github.com/kubevirt/kubevirt/pull/1648
[KubeVirt NET_ADMIN PR]: https://github.com/kubevirt/kubevirt/pull/3290
[KubeVirt NET_ADMIN issue]: https://github.com/kubevirt/kubevirt/issues/3085
[KubeVirt SELinux policy]: https://github.com/kubevirt/kubevirt/blob/master/cmd/virt-handler/virt_launcher.cil
[KubeVirt VMI.getMemlockSize]: https://github.com/kubevirt/kubevirt/blob/f5ffba5f84365155c81d0e2cda4aa709da062230/pkg/virt-handler/isolation/isolation.go#L206
[KubeVirt memfd issue]: https://github.com/kubevirt/kubevirt/issues/3781
[KubeVirt memlock PR]: https://github.com/kubevirt/kubevirt/pull/2584
[KubeVirt multiqueue workaround]: https://github.com/kubevirt/kubevirt/pull/2941/commits/bc55cb916003c54f6cbf329112a4e36d0d874836
[KubeVirt precreated TAP PR]: https://github.com/kubevirt/kubevirt/pull/2837
[KubeVirt virtiofs PR]: https://github.com/kubevirt/kubevirt/pull/3493
[RHEL CPU pinning bug]: https://bugzilla.redhat.com/show_bug.cgi?id=1819801
[RHEL precreated TAP bug]: https://bugzilla.redhat.com/show_bug.cgi?id=1723367
[RHEL virtiofs bug]: https://bugzilla.redhat.com/show_bug.cgi?id=1854595
[libvir-list memlock thread]: https://www.redhat.com/archives/libvirt-users/2019-August/msg00046.html
[libvirt memoryBacking docs]: https://libvirt.org/formatdomain.html#elementsMemoryBacking
[libvirt precreated TAP patches]: https://www.redhat.com/archives/libvir-list/2019-August/msg01256.html
[libvirt qemuDomainGetMemLockLimitBytes]: https://gitlab.com/libvirt/libvirt/-/blob/84bb5fd1ab2bce88e508d416f4bcea520c803ea8/src/qemu/qemu_domain.c#L8712
^ permalink raw reply [flat|nested] 15+ messages in thread
* [RFC DOCUMENT 10/12] kubevirt-and-kvm: Add Upgrades page
2020-09-16 16:44 [RFC DOCUMENT 00/12] kubevirt-and-kvm: Add documents Andrea Bolognani
` (8 preceding siblings ...)
2020-09-16 16:54 ` [RFC DOCUMENT 09/12] kubevirt-and-kvm: Add Isolation page Andrea Bolognani
@ 2020-09-16 16:55 ` Andrea Bolognani
2020-09-16 16:56 ` [RFC DOCUMENT 11/12] kubevirt-and-kvm: Add Backpropagation page Andrea Bolognani
` (2 subsequent siblings)
12 siblings, 0 replies; 15+ messages in thread
From: Andrea Bolognani @ 2020-09-16 16:55 UTC (permalink / raw)
To: libvir-list, qemu-devel; +Cc: vromanso, rmohr, crobinso
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Upgrades.md
# Upgrades
The KubeVirt installation and upgrade process are entirely controlled
by an [operator][], which is a common pattern in the Kubernetes
world. The operator is a piece of software running in the cluster and
managing the lifecycle of other components, in this case KubeVirt.
## The operator
What it does:
* Manages the whole KubeVirt installation
* Keeps the cluster actively in sync with the desired state
* Upgrades KubeVirt with zero downtime
## Installation
Install the operator:
```bash
$ LATEST=$(curl -L https://storage.googleapis.com/kubevirt-prow/devel/nightly/release/kubevirt/kubevirt/latest)
$ kubectl apply -f https://storage.googleapis.com/kubevirt-prow/devel/nightly/release/kubevirt/kubevirt/$(LATEST)/kubevirt-operator.yaml
$ kubectl get pods -n kubevirt
NAME READY STATUS RESTARTS AGE
virt-operator-58cf9d6648-c7qph 1/1 Running 0 69s
virt-operator-58cf9d6648-pvzw2 1/1 Running 0 69s
```
Trigger the installation of KubeVirt:
```bash
$ LATEST=$(curl -L https://storage.googleapis.com/kubevirt-prow/devel/nightly/release/kubevirt/kubevirt/latest)
$ kubectl apply -f https://storage.googleapis.com/kubevirt-prow/devel/nightly/release/kubevirt/kubevirt/${LATEST}/kubevirt-cr.yaml
$ kubectl get pods -n kubevirt
NAME READY STATUS RESTARTS AGE
virt-api-8bdd88557-fllhr 1/1 Running 0 59s
virt-controller-55ccb8cdcb-5rtp6 1/1 Running 0 43s
virt-controller-55ccb8cdcb-v8kr9 1/1 Running 0 43s
virt-handler-67pns 1/1 Running 0 43s
```
The process happens in two steps because the operator relies on the
KubeVirt [custom resource][] for information on the desired
installation, and will not do anything until that resource exists in
the cluster.
## Upgrade
The upgrading process is similar:
* Install the latest operator
* Reference the new version in the KubeVirt CustomResource to trigger
the actual upgrade
```bash
$ kubectl.sh get kubevirt -n kubevirt kubevirt -o yaml
apiVersion: kubevirt.io/v1alpha3
kind: KubeVirt
metadata:
name: kubevirt
spec:
imageTag: v0.30
certificateRotateStrategy: {}
configuration: {}
imagePullPolicy: IfNotPresent
```
Note the `imageTag` attribute: when present, the KubeVirt operator
will take steps to ensure that the version of KubeVirt that's
deployed on the cluster matches its value, which in our case will
trigger an upgrade.
The following chart explain the upgrade flow in more detail and shows
how the various components are affected:
![KubeVirt upgrade flow][Upgrades-Kubevirt]
KubeVirt is released as a complete suite: no individual
`virt-launcher` releases are planned. Everything is tested together,
everything is released together.
## QEMU and libvirt
The versions of QEMU and libvirt used for VMs are also tied to the
version of KubeVirt and are upgraded along with everything else.
* Migrations from old libvirt/QEMU to new libvirt/QEMU pairs are
possible
* As soon as the new `virt-handler` and the new controller are rolled
out, the cluster will only start VMIs with the new QEMU/libvirt
versions
## Version compatibility
The virt stack is updated along with KubeVirt, which mitigates
compatibility concerns. As a rule of thumb, versions of QEMU and
libvirt older than a year or so are not taken into consideration.
Currently, the ability to perform backwards migation (eg. from a
newer version of QEMU to an older one) is not necessary, but that
could very well change as KubeVirt becomes more widely used.
[Upgrades-Kubevirt]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Upgrades-Kubevirt.png
[custom resource]: https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/
[operator]: https://kubernetes.io/docs/concepts/extend-kubernetes/operator/
^ permalink raw reply [flat|nested] 15+ messages in thread
* [RFC DOCUMENT 11/12] kubevirt-and-kvm: Add Backpropagation page
2020-09-16 16:44 [RFC DOCUMENT 00/12] kubevirt-and-kvm: Add documents Andrea Bolognani
` (9 preceding siblings ...)
2020-09-16 16:55 ` [RFC DOCUMENT 10/12] kubevirt-and-kvm: Add Upgrades page Andrea Bolognani
@ 2020-09-16 16:56 ` Andrea Bolognani
2020-09-16 16:57 ` [RFC DOCUMENT 12/12] kubevirt-and-kvm: Add Contacts page Andrea Bolognani
2020-09-22 9:29 ` [RFC DOCUMENT 00/12] kubevirt-and-kvm: Add documents Philippe Mathieu-Daudé
12 siblings, 0 replies; 15+ messages in thread
From: Andrea Bolognani @ 2020-09-16 16:56 UTC (permalink / raw)
To: libvir-list, qemu-devel; +Cc: vromanso, rmohr, crobinso
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Backpropagation.md
# Backpropagation
Whenever a partial VM configuration is submitted to libvirt, any missing
information is automatically filled in to obtain a configuration that's
complete enough to guarantee long-term guest ABI stability.
PCI addresses are perhaps the most prominent example of this: most management
applications don't include this information at all in the XML they submit to
libvirt, and rely on libvirt building a reasonable PCI topology to support the
requested devices.
For example, using a made-up YAML syntax for brevity, the input could look like
```yaml
devices:
disks:
- image: /path/to/image.qcow2
```
and the output could be augmented by libvirt to look like
```yaml
devices:
controllers:
- model: pcie-root-port
address:
type: pci
domain: 0x0000
bus: 0x00
slot: 0x01
function: 0x0
disks:
- image: /path/to/image.qcow2
model: virtio-blk
address:
type: pci
domain: 0x0000
bus: 0x01
slot: 0x00
function: 0x0
```
This is where backpropagation comes in: the only version of the VM
configuration that is complete enough to guarantee a stable guest ABI is the
one that includes all information added by libvirt, so if the management
application wants to be able to make further changes to the VM it needs to
reflect the additional information back into its understanding of the VM
configuration somehow.
For applications like virsh and virt-manager, this is easy: they don't have
their own configuration format or even store the VM configuration, and
simply fetch it from libvirt and operate on it directly every single time.
oVirt, to the best of my knowledge, generates an initial VM configuration based
on the settings provided by the user, submits it to libvirt and then parses
back the augmented version, figuring out what information was added and
updating its database to match: if the VM configuration needs to be generated
again later, it will include all information present in the database, including
those that originated from libvirt rather than the user.
KubeVirt does not currently perform any backpropagation. There are two ways a
user can influence PCI address allocation:
* explicitly add a `pciAddress` attribute for the device, which will cause
KubeVirt to pass the corresponding address to libvirt, which in turn will
attempt to comply with the user's request;
* add the `kubevirt.io/placePCIDevicesOnRootComplex` annotation to the VM
configuration, which will cause KubeVirt to provide libvirt with a
fully-specified PCI topology where all devices live on the PCIe Root Bus.
In all cases but the one where KubeVirt defines the full PCI topology itself,
it's implicitly relying on libvirt always building the PCI topology in the
exact same way every single time in order to have a stable guest ABI. While
this works in practice, it's not something that libvirt actually guarantees:
once a VM has been defined, libvirt will never change its PCI topology, but
submitting the same partial VM configuration to different libvirt versions can
result in different PCI topologies.
^ permalink raw reply [flat|nested] 15+ messages in thread
* [RFC DOCUMENT 12/12] kubevirt-and-kvm: Add Contacts page
2020-09-16 16:44 [RFC DOCUMENT 00/12] kubevirt-and-kvm: Add documents Andrea Bolognani
` (10 preceding siblings ...)
2020-09-16 16:56 ` [RFC DOCUMENT 11/12] kubevirt-and-kvm: Add Backpropagation page Andrea Bolognani
@ 2020-09-16 16:57 ` Andrea Bolognani
2020-09-22 9:29 ` [RFC DOCUMENT 00/12] kubevirt-and-kvm: Add documents Philippe Mathieu-Daudé
12 siblings, 0 replies; 15+ messages in thread
From: Andrea Bolognani @ 2020-09-16 16:57 UTC (permalink / raw)
To: libvir-list, qemu-devel; +Cc: vromanso, rmohr, crobinso
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Contacts.md
# Contacts and credits
# Contacts
The following people have agreed to serve as points of contact for
follow-up discussion around the topics included in these documents.
## Overall
* Andrea Bolognani <<abologna@redhat.com>> (KVM user space)
* Cole Robinson <<crobinso@redhat.com>> (KVM user space)
* Roman Mohr <<rmohr@redhat.com>> (KubeVirt)
* Vladik Romanovsky <<vromanso@redhat.com>> (KubeVirt)
## Networking
* Alona Paz <<alkaplan@redhat.com>> (KubeVirt)
* Stefano Brivio <<sbrivio@redhat.com>> (KVM user space)
## Storage
* Adam Litke <<alitke@redhat.com>> (KubeVirt)
* Stefan Hajnoczi <<stefanha@redhat.com>> (KVM user space)
# Credits
In addition to those listed above, the following people have also
contributed to the documents or the discussion around them.
Ademar Reis, Adrian Moreno Zapata, Alice Frosi, Amnon Ilan, Ariel
Adam, Christophe de Dinechin, Dan Kenisberg, David Gilbert, Eduardo
Habkost, Fabian Deutsch, Gerd Hoffmann, Jason Wang, John Snow, Kevin
Wolf, Marc-André Lureau, Michael Henriksen, Michael Tsirkin, Paolo
Bonzini, Peter Krempa, Petr Horacek, Richard Jones, Sergio Lopez,
Steve Gordon, Victor Toso, Viviek Goyal.
If your name should be in the list above but is not, please know that
was an honest mistake and not a way to downplay your contribution!
Get in touch and we'll get it sorted out :)
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC DOCUMENT 00/12] kubevirt-and-kvm: Add documents
2020-09-16 16:44 [RFC DOCUMENT 00/12] kubevirt-and-kvm: Add documents Andrea Bolognani
` (11 preceding siblings ...)
2020-09-16 16:57 ` [RFC DOCUMENT 12/12] kubevirt-and-kvm: Add Contacts page Andrea Bolognani
@ 2020-09-22 9:29 ` Philippe Mathieu-Daudé
2020-09-24 18:31 ` Andrea Bolognani
12 siblings, 1 reply; 15+ messages in thread
From: Philippe Mathieu-Daudé @ 2020-09-22 9:29 UTC (permalink / raw)
To: Andrea Bolognani, libvir-list, qemu-devel; +Cc: crobinso, rmohr, vromanso
Hi Andrea,
On 9/16/20 6:44 PM, Andrea Bolognani wrote:
> Hello there!
>
> Several weeks ago, a group of Red Hatters working on the
> virtualization stack (primarily QEMU and libvirt) started a
> conversation with developers from the KubeVirt project with the goal
> of better understanding and documenting the interactions between the
> two.
>
> Specifically, we were interested in integration pain points, with the
> underlying ideas being that only once those issues are understood it
> becomes possible to look for solutions, and that better communication
> would naturally lead to improvements on both sides.
>
> This series of documents was born out of that conversation. We're
> sharing them with the QEMU and libvirt communities in the hope that
> they can be a valuable resource for understanding how the projects
> they're working on are consumed by higher-level tools, and what
> challenges are encountered in the process.
>
> Note that, while the documents describe a number of potential
> directions for things like development of new components, that's all
> just brainstorming that naturally occurred as we were learning new
> things: the actual design process should, and will, happen on the
> upstream lists.
>
> Right now the documents live in their own little git repository[1],
> but the expectation is that eventually they will find a suitable
> long-term home. The most likely candidate right now is the main
> KubeVirt repository, but if you have other locations in mind please
> do speak up!
>
> I'm also aware of the fact that this delivery mechanism is fairly
> unconventional, but I thought it would be the best way to spark a
> discussion around these topics with the QEMU and libvirt developers.
>
> Last but not least, please keep in mind that the documents are a work
> in progress, and polish has been applied to them unevenly: while the
> information presented is, to the best of our knowledge, all accurate,
> some parts are in a rougher state than others. Improvements will
> hopefully come over time - and if you feel like helping out in making
> that happen, it would certainly be appreciated!
>
> Looking forward to your feedback :)
>
>
> [1] https://gitlab.com/abologna/kubevirt-and-kvm
Thanks a lot for this documentation, I could learn new things,
use cases out of my interest area. Useful as a developer to
better understand how are used the areas I'm coding. This
shorten a bit that gap between developers and users.
What would be more valuable than a developer review/feedback is
having feedback from users and technical writers.
Suggestion: also share it on qemu-discuss@nongnu.org which is
less technical (maybe simply repost the cover and link to the
Wiki).
--
What is not obvious in this cover (and the documents pasted on
the list) is there are schema pictures on the Wiki pages which
are not viewable and appreciable via an email post.
--
I had zero knowledge on Kubernetes. I have been confused by their
use in the introduction...
From Index:
"The intended audience is people who are familiar with the traditional
virtualization stack (QEMU plus libvirt), and in order to make it
more approachable to them comparisons, are included and little to no
knowledge of KubeVirt or Kubernetes is assumed."
Then in Architecture's {Goals and Components} there is an assumption
Kubernetes is known. Entering in Components, Kubernetes is briefly
but enough explained.
Then KubeVirt is very well explained.
--
Sometimes the "Other topics" category is confusing, it seems out
of the scope of the "better understanding and documenting the
interactions between KubeVirt and KVM" and looks like left over
notes. I.e.:
"Another possibility is to leverage the device-mapper from Linux to
provide features such as snapshots and even like Incremental Backup.
For example, dm-era seems to provide the basic primitives for
bitmap tracking.
This could be part of scenario number 1, or cascaded with other PVs
somewhere else.
Is this already being used? For example, cybozu-go/topolvm is a
CSI LVM Plugin for k8s."
"vhost-user-blk in other CSI backends
Would it make sense for other CSI backends to implement support for
vhost-user-blk?"
"The audience is people who are familiar with the traditional
virtualization stack (QEMU plus libvirt)". Feeling part of the
audience, I have no clue how to answer these questions...
I'd prefer you tell me :)
Maybe renaming the "Other topics" section would help.
"Unanswered questions", "Other possibilities to investigate",...
--
Very good contribution in documentation,
Thanks!
Phil.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC DOCUMENT 00/12] kubevirt-and-kvm: Add documents
2020-09-22 9:29 ` [RFC DOCUMENT 00/12] kubevirt-and-kvm: Add documents Philippe Mathieu-Daudé
@ 2020-09-24 18:31 ` Andrea Bolognani
0 siblings, 0 replies; 15+ messages in thread
From: Andrea Bolognani @ 2020-09-24 18:31 UTC (permalink / raw)
To: Philippe Mathieu-Daudé, libvir-list, qemu-devel
Cc: crobinso, rmohr, vromanso
On Tue, 2020-09-22 at 11:29 +0200, Philippe Mathieu-Daudé wrote:
> Hi Andrea,
Hi Philippe, and sorry for the delay in answering!
First of all, thanks for taking the time to go through the documents
and posting your thoughts. More comments below.
> Thanks a lot for this documentation, I could learn new things,
> use cases out of my interest area. Useful as a developer to
> better understand how are used the areas I'm coding. This
> shorten a bit that gap between developers and users.
>
> What would be more valuable than a developer review/feedback is
> having feedback from users and technical writers.
> Suggestion: also share it on qemu-discuss@nongnu.org which is
> less technical (maybe simply repost the cover and link to the
> Wiki).
More eyes would obviously be good, but note that these are really
intended to improve the interactions between QEMU/libvirt and
KubeVirt, so the audience is ultimately developers. Of course you
could say that KubeVirt developers *are* users when it comes to
QEMU/libvirt, and you wouldn't be wrong ;) Still, qemu-devel seems
like the proper venue.
> What is not obvious in this cover (and the documents pasted on
> the list) is there are schema pictures on the Wiki pages which
> are not viewable and appreciable via an email post.
You're right! I was pretty sure I had a line about that somewhere in
there but I guess it got lost during editing. Hopefully the URL at
the very beginning of each document caused people to browse the HTML
version.
> I had zero knowledge on Kubernetes. I have been confused by their
> use in the introduction...
>
> From Index:
>
> "The intended audience is people who are familiar with the traditional
> virtualization stack (QEMU plus libvirt), and in order to make it
> more approachable to them comparisons, are included and little to no
> knowledge of KubeVirt or Kubernetes is assumed."
>
> Then in Architecture's {Goals and Components} there is an assumption
> Kubernetes is known. Entering in Components, Kubernetes is briefly
> but enough explained.
>
> Then KubeVirt is very well explained.
I guess the sections in the Index you're referring to assume that you
know that Kubernetes is somehow connected to containers, and that
it's a clustered environment. Anything else I missed?
Perhaps we could move the contents of
https://gitlab.cee.redhat.com/abologna/kubevirt-and-kvm/-/blob/master/Components.md#kubernetes
to a small document that's linked to near the very top. Would that
improve things, in your opinion?
> Sometimes the "Other topics" category is confusing, it seems out
> of the scope of the "better understanding and documenting the
> interactions between KubeVirt and KVM" and looks like left over
> notes.
That's probably because they absolutely are O:-)
> Maybe renaming the "Other topics" section would help.
> "Unanswered questions", "Other possibilities to investigate",...
This sounds sensible :)
Thanks again for your feedback!
--
Andrea Bolognani / Red Hat / Virtualization
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2020-09-24 18:33 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-16 16:44 [RFC DOCUMENT 00/12] kubevirt-and-kvm: Add documents Andrea Bolognani
2020-09-16 16:45 ` [RFC DOCUMENT 01/12] kubevirt-and-kvm: Add Index page Andrea Bolognani
2020-09-16 16:46 ` [RFC DOCUMENT 02/12] kubevirt-and-kvm: Add Components page Andrea Bolognani
2020-09-16 16:47 ` [RFC DOCUMENT 03/12] kubevirt-and-kvm: Add Hotplug page Andrea Bolognani
2020-09-16 16:48 ` [RFC DOCUMENT 04/12] kubevirt-and-kvm: Add Storage page Andrea Bolognani
2020-09-16 16:50 ` [RFC DOCUMENT 05/12] kubevirt-and-kvm: Add Networking page Andrea Bolognani
2020-09-16 16:51 ` [RFC DOCUMENT 06/12] kubevirt-and-kvm: Add Live Migration page Andrea Bolognani
2020-09-16 16:52 ` [RFC DOCUMENT 07/12] kubevirt-and-kvm: Add CPU Pinning page Andrea Bolognani
2020-09-16 16:53 ` [RFC DOCUMENT 08/12] kubevirt-and-kvm: Add NUMA " Andrea Bolognani
2020-09-16 16:54 ` [RFC DOCUMENT 09/12] kubevirt-and-kvm: Add Isolation page Andrea Bolognani
2020-09-16 16:55 ` [RFC DOCUMENT 10/12] kubevirt-and-kvm: Add Upgrades page Andrea Bolognani
2020-09-16 16:56 ` [RFC DOCUMENT 11/12] kubevirt-and-kvm: Add Backpropagation page Andrea Bolognani
2020-09-16 16:57 ` [RFC DOCUMENT 12/12] kubevirt-and-kvm: Add Contacts page Andrea Bolognani
2020-09-22 9:29 ` [RFC DOCUMENT 00/12] kubevirt-and-kvm: Add documents Philippe Mathieu-Daudé
2020-09-24 18:31 ` Andrea Bolognani
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.