Re: [Qemu-devel] [multiprocess RFC PATCH 36/37] multi-process: add the concept description to docs/devel/qemu-multiprocess

From: John G Johnson <john.g.johnson@oracle.com>
To: Stefan Hajnoczi <stefanha@redhat.com>
Cc: "\"Daniel P. Berrangé\"" <berrange@redhat.com>,
	"Elena Ufimtseva" <elena.ufimtseva@oracle.com>,
	sstabellini@kernel.org, "Jag Raman" <jag.raman@oracle.com>,
	konrad.wilk@oracle.com, "Stefan Hajnoczi" <stefanha@gmail.com>,
	qemu-devel@nongnu.org, ross.lagerwall@citrix.com,
	liran.alon@oracle.com, kanth.ghatraju@oracle.com
Subject: Re: [Qemu-devel] [multiprocess RFC PATCH 36/37] multi-process: add the concept description to docs/devel/qemu-multiprocess
Date: Thu, 7 Mar 2019 15:29:41 -0800	[thread overview]
Message-ID: <BDEBF2EE-DE0F-46CF-B60E-536B3DA9BF77@oracle.com> (raw)
In-Reply-To: <20190307192727.GG2915@stefanha-x1.localdomain>


> On Mar 7, 2019, at 11:27 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Thu, Mar 07, 2019 at 02:51:20PM +0000, Daniel P. Berrangé wrote:
>> On Thu, Mar 07, 2019 at 02:26:09PM +0000, Stefan Hajnoczi wrote:
>>> On Wed, Mar 06, 2019 at 11:22:53PM -0800, elena.ufimtseva@oracle.com wrote:
>>>> diff --git a/docs/devel/qemu-multiprocess.txt b/docs/devel/qemu-multiprocess.txt
>>>> new file mode 100644
>>>> index 0000000..e29c6c8
>>>> --- /dev/null
>>>> +++ b/docs/devel/qemu-multiprocess.txt
>>> 
>>> Thanks for this document and the interesting work that you are doing.
>>> I'd like to discuss the security advantages gained by disaggregating
>>> QEMU in more detail.
>>> 
>>> The security model for VMs managed by libvirt (most production x86, ppc,
>>> s390 guests) is that the QEMU process is untrusted and only has access
>>> to resources belonging to the guest.  SELinux is used to restrict the
>>> process from accessing other files, processes, etc on the host.
>> 
>> NB it doesn't have to be SELinux. Libvirt also supports AppArmor and
>> can even do isolation with traditional DAC by putting each QEMU under
>> a distinct UID/GID and having libvirtd set ownership on resources each
>> VM is permitted to use.
>> 
>>> QEMU does not hold privileged resources that must be kept away from the
>>> guest.  An escaped guest can access its image file, tap file descriptor,
>>> etc but they are the same resources it could already access via device
>>> emulation.
>>> 
>>> Can you give specific examples of how disaggregation improves security?
> 
> Elena & collaborators: Dan has posted some ideas but please share yours
> so the security benefits of this patch series can be better understood.
> 

	Dan covered the main point.  The security regime we use (selinux)
constrains the actions of processes on objects, so having multiple processes
allows us to apply more fine-grained policies.


>> I guess one obvious answer is that the existing security mechanisms like
>> SELinux/ApArmor/DAC can be made to work in a more fine grained manner if
>> there are distinct processes. This would allow for a more useful seccomp
>> filter to better protect against secondary kernel exploits should QEMU
>> itself be exploited, if we can protect individual components.
> 
> Fine-grained sandboxing is possible in theory but tedious in practice.
> From what I can tell this patch series doesn't implement any sandboxing
> for child processes.
> 

	The policies aren’t in QEMU, but in the selinux config files.
They would say, for example, that when the QEMU process exec()s the
disk emulation process, the process security context type transitions
to a new type.  This type would have permission to access the VM image
objects, whereas the QEMU process type (and any other device emulation
process types) cannot access them.

	If you wanted to use DAC, you could do the something similar by
making the disk emulation executable setuid to a UID than can access
VM image files.

	In either case, the policies and permissions are set up before
libvirt even runs, so it doesn’t need to be aware of them.


> There must be a convenient way to get fine-grained sandboxing for
> disaggregated devices.  In other words, it shouldn't be left as an
> exercise to device process authors.
> 

	We can add some MAC or DAC suggestions in the documentation.


> How to do this in practice must be clear from the beginning if
> fine-grained sandboxing is the main selling point.
> 
> Some details to start the discussion:
> 
> * How will fine-grained SELinux/AppArmor/DAC policies be configured for
>   each process?  I guess this requires root, so does libvirt need to
>   know about each process?
> 

	The polices would apply to process security context types (or
UIDs in a DAC regime), so I would not expect libvirt to be aware of them.


> * We need to make sure that processes cannot send signals to each
>   other, ptrace, interfere in /proc/$PID, etc.  How will this be done?
> 

	Any process type restrictions would be enforced by selinux.


> * Were you planning to use any other sandboxing mechanisms
>   (namespaces?)?  How will they be set up if the device processed is
>   forked/executed by an unprivileged QEMU?
> 

	All of the QEMU-related process related to a single VM will run
in the same container, but the container is created, along with it selinux
policies, before libvirt is run.


>> Not everything is protected by MAC/DAC. For example network based disks
>> typically have a username + password for accessing the remote storage
>> server. Best practice would be a distinct username for every QEMU process
>> such that each can only access its own storage, but I don't know of any
>> app which does that. So ability to split off backends into separate
>> processes could limit exposure of information that is not otherwise
>> protected by current protection models.
> 
> If the disaggregated disk process with a global username + password is
> compromised then all your disk images are compromised.  So you still
> need to follow the best practice of per-VM credentials even with
> disaggregation, and if you do then disaggregation doesn't add anything!
> 

	You could put disk secrets in files that can only be read by the
disk emulation process type.  If you wanted even finer granularity, you
could use MCS to run each disk controller instance in a different security
context category, and make the secret files only readable by the corresponding
category.

	Another layer of security would be to have network security policies
that only allow the disk emulation processes to connect to the storage servers.

								JJ