[PATCH] docs/qemu-deprivilege: Revise and update with status and future plans

* [PATCH] docs/qemu-deprivilege: Revise and update with status and future plans
@ 2018-03-22 18:24 George Dunlap
  2018-03-23  9:41 ` Ross Lagerwall
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: George Dunlap @ 2018-03-22 18:24 UTC (permalink / raw)
  To: xen-devel
  Cc: Stefano Stabellini, Wei Liu, Andrew Cooper, Tim Deegan,
	George Dunlap, Ross Lagerwall, Julien Grall, Jan Beulich,
	Anthony Perard, Ian Jackson

docs/qemu-deprivilege.txt had some basic instructions for using
dm_restrict, but it was incomplete, misleading, and stale.

Update the docs in a number of ways.

Introduce a section mentioning minimim versions of Linux, Xen, and
qemu required (TBD)

Fix the discussion of qemu userid.  Mention xen-qemuuser-range-base,
and provide example shell code that actually has some hope of working
(instead of failing out after creating 900 userids.

Describe how to enable restrictions, as well as features which
probably don't or definitely don't work.

Introduce a "Technical Details" section which describes specifically
what restrictions are currently done, and also what restrictions we
are looking at doing in the future.

The idea here is that as we implement the various items for the
future, we move them from "Restrictions still to do" to "Restrictions
done".  This can also act as a design document -- a place for public
discussion of what can or should be done and how.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
---
Thank you to Ross Lagerwall, whose description of what XenServer is
doing formed much of the basis for the text here.

CC: Ian Jackson <ian.jackson@citrix.com>
CC: Wei Liu <wei.liu2@citrix.com>
CC: Andrew Cooper <andrew.cooper3@citrix.com>
CC: Jan Beulich <jbeulich@suse.com>
CC: Tim Deegan <tim@xen.org>
CC: Konrad Wilk <konrad.wilk@oracle.com>
CC: Stefano Stabellini <sstabellini@kernel.org>
CC: Julien Grall <julien.grall@arm.com>
CC: Anthony Perard <anthony.perard@citrix.com>
CC: Ross Lagerwall <ross.lagerwall@citrix.com>
---
 docs/misc/qemu-deprivilege.txt | 259 ++++++++++++++++++++++++++++++++++++-----
 1 file changed, 233 insertions(+), 26 deletions(-)

diff --git a/docs/misc/qemu-deprivilege.txt b/docs/misc/qemu-deprivilege.txt
index 58b86a3908..9a5627350a 100644
--- a/docs/misc/qemu-deprivilege.txt
+++ b/docs/misc/qemu-deprivilege.txt
@@ -1,36 +1,243 @@
-For security reasons, libxl tries to pass a non-root username to QEMU as
-argument. During initialization QEMU calls setuid and setgid with the
-user ID and the group ID of the user passed as argument.
-Libxl looks for the following users in this order:
-
-1) a user named "xen-qemuuser-domid$domid",
-Where $domid is the domid of the domain being created.
-This requires the reservation of 65535 uids from xen-qemuuser-domid1
-to xen-qemuuser-domid65535. To use this mechanism, you might want to
-create a large number of users at installation time. For example:
-
-for ((i=1; i<65536; i++))
+# Introduction
+
+# Setup
+
+## Getting the right versions of software
+
+Linux 4.XX
+
+Xen 4.XX
+
+Qemu: Requires patches not yet in any release
+
+## Setting up a userid range
+
+For maximum security, libxl needs to run the devicemodel for each
+domain under a user id (UID) corresponding to its domain id.  There
+are 32752 possible domain IDs, and so libxl needs 32752 user ids set
+aside for it.
+
+The simplest and most effective way to do this is to allocate a
+contiguous block of UIDs, and create a single user named
+`xen-qemuuser-range-base` with the first UID.  For example, under Debian:
+
+    adduser --no-create-home --uid 65536 --system xen-qemuuser-range-base
+
+An alternate way is to create 32752 distinct users with the name
+`xen-qemuuser-domid$domid`, doing something like the following:
+
+for ((i=1; i<=32751; i++))
 do
-    adduser --no-create-home --system xen-qemuuser-domid$i
+    adduser --no-create-home --system --uid $(($i-1+65536)) xen-qemuuser-domid$i
 done
 
-You might want to consider passing --group to adduser to create a new
-group for each new user.
+FIXME: Test the above script to see if it works
+
+NOTE: Most modern systems have 32-bit UIDs, and so can in theory go up
+to 2^31 (or 2^32 if uids are unsigned).  POSIX only guarantees 16-bit
+UIDs however.  UID 65535 is reserved for an invalid value, and 65534
+is normally allocated to "nobody".
+
+Another, less-secure way is to run all QEMUs as the same UID.  To do
+this, create a user named `xen-qemuuser-shared`; for example:
+
+    adduser --no-create-home --system xen-qemuuser-shared
+
+## Domain config changes
+
+The core domain config change is to add the following line to the
+domain configuration:
+
+    dm_restrict=1
+
+This will perform a number of restrictions, outlined below in the
+'Technical details' section.
+
+Remove non-functioning default features:
+
+    vga="none"
+
+Other features expected not to work include:
+* Inserting a new cdrom while the guest is running (xl cdrom-insert)
+* migration / save / restore
+* PCI passthrough
+
+# Technical details
+
+## Restrictions done
+
+### Having qemu switch user
+
+'''Description''': As mentioned above, having qemu switch to a non-root user, one per
+domain id.
+
+'''Implementation''': The toolstack adds the following to the qemu command-line:
+
+    -runas <uid>:<gid>
+
+'''Testing Status''': Not tested
+
+### Xen restrictions
+
+'''Description''': Close and restrict Xen-related file descriptors.
+Specifically, make sure that only one `privcmd` instance is open, and
+that the IOCTL_EVTCHN_RESTRICT_DOMID ioctl has been called.
+
+XXX Also, make sure that only one `xenstore` fd remains open, and that
+it's restricted.
+
+'''Implementation''': Toolstack adds the following to the qemu command-line:
+
+-xen-domid-restrict
+
+'''Testing status''': Not tested XXX
+
+## Restrictions still to do
+
+### Chroot
+
+'''Description''': Qemu runs in its own chroot, such that even if it
+could call an 'open' command of some sort, there would be nothing for
+it to see.
+
+'''Implementation''': The toolstack creates a directory such as:
+`/var/run/qemu/root-<domid>`
+
+Then add the following to the qemu command-line:
+
+    -chroot /var/run/qemu/root-<domid>
+
+### Namespaces for unused functionality
+
+'''Descripiton''': Enter QEMU into its own mount & IPC namespaces.
+This means that even if other restrictions fail, the process won't be
+able to even name system mount points or exsting non-file-based IPC
+descriptors to attempt to attack them.
+
+'''Implementation''':
+
+In theory this could be done in QEMU (similar to -sandbox, -runas,
+-chroot, and so on), but a patch doing this in QEMU was NAKed
+upstream. They preferred that this was done as a setup step by
+whatever executes QEMU; i.e., have the process which exec's QEMU first
+call:
+
+    unshare(CLONE_NEWNS | CLONE_NEWIPC)
+
+### seccomp filtering
+
+'''Description''': Turn on seccomp filtering to disable syscalls which
+QEMU doesn't need:
+
+'''Implementation''': Enable from the command-line:
+
+    -sandbox on,obsolete=deny,elevateprivileges=allow,spawn=deny,resourcecontrol=deny
+
+`elevateprivileges` is currently required to allow `-runas` to work.
+Removing this requirement would mean making sure that the uid change
+happened before the seccomp2 call, perhaps by changing the uid before
+executing QEMU.  (But this would then require other changes to create
+the QMP socket, VNC socket, and so on).
+
+### Basic RLIMITs
+
+'''Description''': A number of limits on the resources that a given
+process / userid is allowed to consume.  These can limit the ability
+of a compromised QEMU process to DoS domain 0 by exhausting various
+resources available to it.
+
+'''Implementaiton'''
+
+Limits that can be implemented immediately without much effort:
+ - RLIMIT_FSIZE (file size): 256KiB
+
+Probably not necessary but why not:
+ - RLIMIT_CORE: 0
+ - RLIMIT_MSGQUEUE: 0
+ - RLIMIT_LOCKS: 0 XXX Check
+ - RLIMIT_MEMLOCK: 0
+   mlock() is Used only when both "realtime" and "mlock" are specified.
+
+### Further RLIMITs
+
+RLIMIT_AS limits the total amount of memory; but this includes the
+virtual memory which QEMU uses as a mapcache.  xen-mapcache.c already
+fiddles with this; it would be straightforward to make it *set* the
+rlimit to what it thinks a sensible limit is.
+
+Other things that would take some cleverness / changes to QEMU to
+utilize due to ordering constrants:
+ - RLIMIT_NPROC (after uid changes to a unique uid)
+ - RLIMIT_NOFILES (after all necessary files are opened)
+
+### libxl UID cleanup
+
+'''Description''': Domain IDs are reused, and thus restricted UIDs are
+reused.  If a compromised QEMU can fork (due to seccomp or
+RLIMIT_NPROC limits being ineffective for some reason), it may avoid
+being killed when its domain dies, then wait until the domain ID is
+reused again, at which point it will have control over the domain in
+question (which probably belongs to someone else).
+
+libxl should kill all UIDs associated with a domain both when the VM
+is destroyed, and before starting a VM with the same UID.
+
+'''Implementation''': Needs to be researched; it's difficult to do in
+a way that's not racy (e.g., we can't simply look at all processes,
+find the pids corresponding to uids, and then kill those, as a
+continually forking process could (potentially) elude this process.
+Rumor has it there's a "kill all processes with my UID" system call,
+or something of that nature.
+
+kill(-1,sig) sends a signal to "every process to which the calling
+process has permission to send a signal".  So in theory:
+  setuid(X)
+  kill(-1,KILL)
+should do the trick.
+
+### Disks
+
+The chroot (and seccomp?) happens late enough such that QEMU can
+initialize itself and open its disks. If you want to add a disk at run
+time via or insert a CD, you can't pass a path because QEMU is
+chrooted. Instead use the add-fd QMP command and use
+/dev/fdset/<fdset-id> as the path.
+
+A further layer of restriction could be to set RLIMIT_NOFILES to '0',
+and hand all disks over QMP.
+
+## Migration
+
+When calling xen-save-devices-state, since QEMU is running in a chroot
+it is not useful to pass a filename (it doesn't even have write access
+inside the chroot). Instead, give it an open fd using the add-fd
+mechanism.
+
+### Network namespacing
+
+Enter QEMU into its own network namespace (in addition to mount & IPC
+namespaces).  Basically change the 'unshare' call to be as follows:
+
+    unshare(CLONE_NEWNET | CLONE_NEWNS | CLONE_NEWIPC)
+
+### Network
 
+If QEMU runs in its own network namespace, it can't open the tap
+device itself because the interface won't be visible outside of its
+own namespace. So instead, have the toolstack open the device and pass
+it as an fd on the command-line:
 
-2) a user named "xen-qemuuser-shared"
-As a fall back if both 1) fails, libxl will use a single user for
-all QEMU instances. The user is named xen-qemuuser-shared. This is
-less secure but still better than running QEMU as root. Using this is as
-simple as creating just one more user on your host:
+    -device rtl8139,netdev=tapnet0,mac=... -netdev tap,id=tapnet0,fd=<tapfd>
 
-adduser --no-create-home --system xen-qemuuser-shared
+### VNC
 
+If QEMU runs in its own network namespace, it is not straightforward
+to listen on a TCP socket outside of its own network namespace. One
+option would be to use VNC over a UNIX socket:
 
-3) root
-As a last resort, libxl will start QEMU as root.
+    -vnc unix:/var/run/xen/vnc-<domid>
 
+However, this would break functionality in the general case; I think
+we need to have the toolstack open a socket and pass the fd to QEMU
+(which requires changes to QEMU).
 
-Please note that running QEMU as non-root causes several features like
-migration and PCI passthrough to not work properly and may prevent the guest
-from booting.
-- 
2.16.2


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 21+ messages in thread