All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] docs/qemu-deprivilege: Revise and update with status and future plans
@ 2018-03-22 18:24 George Dunlap
  2018-03-23  9:41 ` Ross Lagerwall
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: George Dunlap @ 2018-03-22 18:24 UTC (permalink / raw)
  To: xen-devel
  Cc: Stefano Stabellini, Wei Liu, Andrew Cooper, Tim Deegan,
	George Dunlap, Ross Lagerwall, Julien Grall, Jan Beulich,
	Anthony Perard, Ian Jackson

docs/qemu-deprivilege.txt had some basic instructions for using
dm_restrict, but it was incomplete, misleading, and stale.

Update the docs in a number of ways.

Introduce a section mentioning minimim versions of Linux, Xen, and
qemu required (TBD)

Fix the discussion of qemu userid.  Mention xen-qemuuser-range-base,
and provide example shell code that actually has some hope of working
(instead of failing out after creating 900 userids.

Describe how to enable restrictions, as well as features which
probably don't or definitely don't work.

Introduce a "Technical Details" section which describes specifically
what restrictions are currently done, and also what restrictions we
are looking at doing in the future.

The idea here is that as we implement the various items for the
future, we move them from "Restrictions still to do" to "Restrictions
done".  This can also act as a design document -- a place for public
discussion of what can or should be done and how.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
---
Thank you to Ross Lagerwall, whose description of what XenServer is
doing formed much of the basis for the text here.

CC: Ian Jackson <ian.jackson@citrix.com>
CC: Wei Liu <wei.liu2@citrix.com>
CC: Andrew Cooper <andrew.cooper3@citrix.com>
CC: Jan Beulich <jbeulich@suse.com>
CC: Tim Deegan <tim@xen.org>
CC: Konrad Wilk <konrad.wilk@oracle.com>
CC: Stefano Stabellini <sstabellini@kernel.org>
CC: Julien Grall <julien.grall@arm.com>
CC: Anthony Perard <anthony.perard@citrix.com>
CC: Ross Lagerwall <ross.lagerwall@citrix.com>
---
 docs/misc/qemu-deprivilege.txt | 259 ++++++++++++++++++++++++++++++++++++-----
 1 file changed, 233 insertions(+), 26 deletions(-)

diff --git a/docs/misc/qemu-deprivilege.txt b/docs/misc/qemu-deprivilege.txt
index 58b86a3908..9a5627350a 100644
--- a/docs/misc/qemu-deprivilege.txt
+++ b/docs/misc/qemu-deprivilege.txt
@@ -1,36 +1,243 @@
-For security reasons, libxl tries to pass a non-root username to QEMU as
-argument. During initialization QEMU calls setuid and setgid with the
-user ID and the group ID of the user passed as argument.
-Libxl looks for the following users in this order:
-
-1) a user named "xen-qemuuser-domid$domid",
-Where $domid is the domid of the domain being created.
-This requires the reservation of 65535 uids from xen-qemuuser-domid1
-to xen-qemuuser-domid65535. To use this mechanism, you might want to
-create a large number of users at installation time. For example:
-
-for ((i=1; i<65536; i++))
+# Introduction
+
+# Setup
+
+## Getting the right versions of software
+
+Linux 4.XX
+
+Xen 4.XX
+
+Qemu: Requires patches not yet in any release
+
+## Setting up a userid range
+
+For maximum security, libxl needs to run the devicemodel for each
+domain under a user id (UID) corresponding to its domain id.  There
+are 32752 possible domain IDs, and so libxl needs 32752 user ids set
+aside for it.
+
+The simplest and most effective way to do this is to allocate a
+contiguous block of UIDs, and create a single user named
+`xen-qemuuser-range-base` with the first UID.  For example, under Debian:
+
+    adduser --no-create-home --uid 65536 --system xen-qemuuser-range-base
+
+An alternate way is to create 32752 distinct users with the name
+`xen-qemuuser-domid$domid`, doing something like the following:
+
+for ((i=1; i<=32751; i++))
 do
-    adduser --no-create-home --system xen-qemuuser-domid$i
+    adduser --no-create-home --system --uid $(($i-1+65536)) xen-qemuuser-domid$i
 done
 
-You might want to consider passing --group to adduser to create a new
-group for each new user.
+FIXME: Test the above script to see if it works
+
+NOTE: Most modern systems have 32-bit UIDs, and so can in theory go up
+to 2^31 (or 2^32 if uids are unsigned).  POSIX only guarantees 16-bit
+UIDs however.  UID 65535 is reserved for an invalid value, and 65534
+is normally allocated to "nobody".
+
+Another, less-secure way is to run all QEMUs as the same UID.  To do
+this, create a user named `xen-qemuuser-shared`; for example:
+
+    adduser --no-create-home --system xen-qemuuser-shared
+
+## Domain config changes
+
+The core domain config change is to add the following line to the
+domain configuration:
+
+    dm_restrict=1
+
+This will perform a number of restrictions, outlined below in the
+'Technical details' section.
+
+Remove non-functioning default features:
+
+    vga="none"
+
+Other features expected not to work include:
+* Inserting a new cdrom while the guest is running (xl cdrom-insert)
+* migration / save / restore
+* PCI passthrough
+
+# Technical details
+
+## Restrictions done
+
+### Having qemu switch user
+
+'''Description''': As mentioned above, having qemu switch to a non-root user, one per
+domain id.
+
+'''Implementation''': The toolstack adds the following to the qemu command-line:
+
+    -runas <uid>:<gid>
+
+'''Testing Status''': Not tested
+
+### Xen restrictions
+
+'''Description''': Close and restrict Xen-related file descriptors.
+Specifically, make sure that only one `privcmd` instance is open, and
+that the IOCTL_EVTCHN_RESTRICT_DOMID ioctl has been called.
+
+XXX Also, make sure that only one `xenstore` fd remains open, and that
+it's restricted.
+
+'''Implementation''': Toolstack adds the following to the qemu command-line:
+
+-xen-domid-restrict
+
+'''Testing status''': Not tested XXX
+
+## Restrictions still to do
+
+### Chroot
+
+'''Description''': Qemu runs in its own chroot, such that even if it
+could call an 'open' command of some sort, there would be nothing for
+it to see.
+
+'''Implementation''': The toolstack creates a directory such as:
+`/var/run/qemu/root-<domid>`
+
+Then add the following to the qemu command-line:
+
+    -chroot /var/run/qemu/root-<domid>
+
+### Namespaces for unused functionality
+
+'''Descripiton''': Enter QEMU into its own mount & IPC namespaces.
+This means that even if other restrictions fail, the process won't be
+able to even name system mount points or exsting non-file-based IPC
+descriptors to attempt to attack them.
+
+'''Implementation''':
+
+In theory this could be done in QEMU (similar to -sandbox, -runas,
+-chroot, and so on), but a patch doing this in QEMU was NAKed
+upstream. They preferred that this was done as a setup step by
+whatever executes QEMU; i.e., have the process which exec's QEMU first
+call:
+
+    unshare(CLONE_NEWNS | CLONE_NEWIPC)
+
+### seccomp filtering
+
+'''Description''': Turn on seccomp filtering to disable syscalls which
+QEMU doesn't need:
+
+'''Implementation''': Enable from the command-line:
+
+    -sandbox on,obsolete=deny,elevateprivileges=allow,spawn=deny,resourcecontrol=deny
+
+`elevateprivileges` is currently required to allow `-runas` to work.
+Removing this requirement would mean making sure that the uid change
+happened before the seccomp2 call, perhaps by changing the uid before
+executing QEMU.  (But this would then require other changes to create
+the QMP socket, VNC socket, and so on).
+
+### Basic RLIMITs
+
+'''Description''': A number of limits on the resources that a given
+process / userid is allowed to consume.  These can limit the ability
+of a compromised QEMU process to DoS domain 0 by exhausting various
+resources available to it.
+
+'''Implementaiton'''
+
+Limits that can be implemented immediately without much effort:
+ - RLIMIT_FSIZE (file size): 256KiB
+
+Probably not necessary but why not:
+ - RLIMIT_CORE: 0
+ - RLIMIT_MSGQUEUE: 0
+ - RLIMIT_LOCKS: 0 XXX Check
+ - RLIMIT_MEMLOCK: 0
+   mlock() is Used only when both "realtime" and "mlock" are specified.
+
+### Further RLIMITs
+
+RLIMIT_AS limits the total amount of memory; but this includes the
+virtual memory which QEMU uses as a mapcache.  xen-mapcache.c already
+fiddles with this; it would be straightforward to make it *set* the
+rlimit to what it thinks a sensible limit is.
+
+Other things that would take some cleverness / changes to QEMU to
+utilize due to ordering constrants:
+ - RLIMIT_NPROC (after uid changes to a unique uid)
+ - RLIMIT_NOFILES (after all necessary files are opened)
+
+### libxl UID cleanup
+
+'''Description''': Domain IDs are reused, and thus restricted UIDs are
+reused.  If a compromised QEMU can fork (due to seccomp or
+RLIMIT_NPROC limits being ineffective for some reason), it may avoid
+being killed when its domain dies, then wait until the domain ID is
+reused again, at which point it will have control over the domain in
+question (which probably belongs to someone else).
+
+libxl should kill all UIDs associated with a domain both when the VM
+is destroyed, and before starting a VM with the same UID.
+
+'''Implementation''': Needs to be researched; it's difficult to do in
+a way that's not racy (e.g., we can't simply look at all processes,
+find the pids corresponding to uids, and then kill those, as a
+continually forking process could (potentially) elude this process.
+Rumor has it there's a "kill all processes with my UID" system call,
+or something of that nature.
+
+kill(-1,sig) sends a signal to "every process to which the calling
+process has permission to send a signal".  So in theory:
+  setuid(X)
+  kill(-1,KILL)
+should do the trick.
+
+### Disks
+
+The chroot (and seccomp?) happens late enough such that QEMU can
+initialize itself and open its disks. If you want to add a disk at run
+time via or insert a CD, you can't pass a path because QEMU is
+chrooted. Instead use the add-fd QMP command and use
+/dev/fdset/<fdset-id> as the path.
+
+A further layer of restriction could be to set RLIMIT_NOFILES to '0',
+and hand all disks over QMP.
+
+## Migration
+
+When calling xen-save-devices-state, since QEMU is running in a chroot
+it is not useful to pass a filename (it doesn't even have write access
+inside the chroot). Instead, give it an open fd using the add-fd
+mechanism.
+
+### Network namespacing
+
+Enter QEMU into its own network namespace (in addition to mount & IPC
+namespaces).  Basically change the 'unshare' call to be as follows:
+
+    unshare(CLONE_NEWNET | CLONE_NEWNS | CLONE_NEWIPC)
+
+### Network
 
+If QEMU runs in its own network namespace, it can't open the tap
+device itself because the interface won't be visible outside of its
+own namespace. So instead, have the toolstack open the device and pass
+it as an fd on the command-line:
 
-2) a user named "xen-qemuuser-shared"
-As a fall back if both 1) fails, libxl will use a single user for
-all QEMU instances. The user is named xen-qemuuser-shared. This is
-less secure but still better than running QEMU as root. Using this is as
-simple as creating just one more user on your host:
+    -device rtl8139,netdev=tapnet0,mac=... -netdev tap,id=tapnet0,fd=<tapfd>
 
-adduser --no-create-home --system xen-qemuuser-shared
+### VNC
 
+If QEMU runs in its own network namespace, it is not straightforward
+to listen on a TCP socket outside of its own network namespace. One
+option would be to use VNC over a UNIX socket:
 
-3) root
-As a last resort, libxl will start QEMU as root.
+    -vnc unix:/var/run/xen/vnc-<domid>
 
+However, this would break functionality in the general case; I think
+we need to have the toolstack open a socket and pass the fd to QEMU
+(which requires changes to QEMU).
 
-Please note that running QEMU as non-root causes several features like
-migration and PCI passthrough to not work properly and may prevent the guest
-from booting.
-- 
2.16.2


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] docs/qemu-deprivilege: Revise and update with status and future plans
  2018-03-22 18:24 [PATCH] docs/qemu-deprivilege: Revise and update with status and future plans George Dunlap
@ 2018-03-23  9:41 ` Ross Lagerwall
  2018-03-23 10:01   ` Roger Pau Monné
  2018-03-23 10:53   ` George Dunlap
  2018-03-23 12:13 ` Anthony PERARD
  2018-03-26 16:43 ` Ian Jackson
  2 siblings, 2 replies; 21+ messages in thread
From: Ross Lagerwall @ 2018-03-23  9:41 UTC (permalink / raw)
  To: George Dunlap, xen-devel
  Cc: Stefano Stabellini, Wei Liu, Andrew Cooper, Tim Deegan,
	Julien Grall, Jan Beulich, Anthony Perard, Ian Jackson

On 03/22/2018 06:24 PM, George Dunlap wrote:
snip
> -for ((i=1; i<65536; i++))
> +# Introduction
> +
> +# Setup
> +
> +## Getting the right versions of software
> +
> +Linux 4.XX

(For dom0 kernel...)

Requires 4.11 for the ability to restrict dmop calls.

> +
> +Xen 4.XX

Requires 4.11 to get required dmop calls to make VGA work.

> +
> +Qemu: Requires patches not yet in any release
> +
> +## Setting up a userid range
> +
> +For maximum security, libxl needs to run the devicemodel for each
> +domain under a user id (UID) corresponding to its domain id.  There
> +are 32752 possible domain IDs, and so libxl needs 32752 user ids set
> +aside for it.
> +
> +The simplest and most effective way to do this is to allocate a
> +contiguous block of UIDs, and create a single user named
> +`xen-qemuuser-range-base` with the first UID.  For example, under Debian:
> +
> +    adduser --no-create-home --uid 65536 --system xen-qemuuser-range-base
> +
> +An alternate way is to create 32752 distinct users with the name
> +`xen-qemuuser-domid$domid`, doing something like the following:
> +
> +for ((i=1; i<=32751; i++))
>   do
> -    adduser --no-create-home --system xen-qemuuser-domid$i
> +    adduser --no-create-home --system --uid $(($i-1+65536)) xen-qemuuser-domid$i
>   done
>   
> -You might want to consider passing --group to adduser to create a new
> -group for each new user.
> +FIXME: Test the above script to see if it works
> +
> +NOTE: Most modern systems have 32-bit UIDs, and so can in theory go up
> +to 2^31 (or 2^32 if uids are unsigned).  POSIX only guarantees 16-bit
> +UIDs however.  UID 65535 is reserved for an invalid value, and 65534
> +is normally allocated to "nobody".
> +
> +Another, less-secure way is to run all QEMUs as the same UID.  To do
> +this, create a user named `xen-qemuuser-shared`; for example:
> +
> +    adduser --no-create-home --system xen-qemuuser-shared
> +
> +## Domain config changes
> +
> +The core domain config change is to add the following line to the
> +domain configuration:
> +
> +    dm_restrict=1
> +
> +This will perform a number of restrictions, outlined below in the
> +'Technical details' section.
> +
> +Remove non-functioning default features:
> +
> +    vga="none"

I'm not sure what this means?

> +
> +Other features expected not to work include:
> +* Inserting a new cdrom while the guest is running (xl cdrom-insert)
> +* migration / save / restore

The above two features could be made to work if the toolstack drives 
QEMU correctly.

> +* PCI passthrough

This one requires a fair amount of Xen & QEMU changes to have a chance 
of working.

> +
> +# Technical details
> +
> +## Restrictions done
> +
> +### Having qemu switch user
> +
> +'''Description''': As mentioned above, having qemu switch to a non-root user, one per
> +domain id.
> +
> +'''Implementation''': The toolstack adds the following to the qemu command-line:
> +
> +    -runas <uid>:<gid>
> +
> +'''Testing Status''': Not tested
> +
> +### Xen restrictions
> +
> +'''Description''': Close and restrict Xen-related file descriptors.
> +Specifically, make sure that only one `privcmd` instance is open, and
> +that the IOCTL_EVTCHN_RESTRICT_DOMID ioctl has been called.

Just to clarify, we call IOCTL_PRIVCMD_RESTRICT on the `privcmd` fds and 
IOCTL_EVTCHN_RESTRICT_DOMID on the evtchn fds which remain open. There 
is no requirement to have only one instance of each.

> +
> +XXX Also, make sure that only one `xenstore` fd remains open, and that
> +it's restricted.

The current implementation closes _all_ xenstore fds and doesn't need to 
make use of xenstore after going into restricted mode.

> +
> +'''Implementation''': Toolstack adds the following to the qemu command-line:
> +
> +-xen-domid-restrict
> +
> +'''Testing status''': Not tested XXX
> +
> +## Restrictions still to do
> +
> +### Chroot
> +
> +'''Description''': Qemu runs in its own chroot, such that even if it
> +could call an 'open' command of some sort, there would be nothing for
> +it to see.
> +
> +'''Implementation''': The toolstack creates a directory such as:
> +`/var/run/qemu/root-<domid>`
> +
> +Then add the following to the qemu command-line:
> +
> +    -chroot /var/run/qemu/root-<domid>
> +
> +### Namespaces for unused functionality
> +
> +'''Descripiton''': Enter QEMU into its own mount & IPC namespaces.

Spelling: Descripiton

> +This means that even if other restrictions fail, the process won't be
> +able to even name system mount points or exsting non-file-based IPC
> +descriptors to attempt to attack them.
> +
> +'''Implementation''':
> +
> +In theory this could be done in QEMU (similar to -sandbox, -runas,
> +-chroot, and so on), but a patch doing this in QEMU was NAKed
> +upstream. They preferred that this was done as a setup step by
> +whatever executes QEMU; i.e., have the process which exec's QEMU first
> +call:
> +
> +    unshare(CLONE_NEWNS | CLONE_NEWIPC)
> +
> +### seccomp filtering
> +
> +'''Description''': Turn on seccomp filtering to disable syscalls which
> +QEMU doesn't need:
> +
> +'''Implementation''': Enable from the command-line:
> +
> +    -sandbox on,obsolete=deny,elevateprivileges=allow,spawn=deny,resourcecontrol=deny
> +
> +`elevateprivileges` is currently required to allow `-runas` to work.
> +Removing this requirement would mean making sure that the uid change
> +happened before the seccomp2 call, perhaps by changing the uid before
> +executing QEMU.  (But this would then require other changes to create
> +the QMP socket, VNC socket, and so on).
> +
> +### Basic RLIMITs
> +
> +'''Description''': A number of limits on the resources that a given
> +process / userid is allowed to consume.  These can limit the ability
> +of a compromised QEMU process to DoS domain 0 by exhausting various
> +resources available to it.
> +
> +'''Implementaiton'''

Spelling: Implementaiton

> +
> +Limits that can be implemented immediately without much effort:
> + - RLIMIT_FSIZE (file size): 256KiB
> +
> +Probably not necessary but why not:
> + - RLIMIT_CORE: 0
> + - RLIMIT_MSGQUEUE: 0
> + - RLIMIT_LOCKS: 0 XXX Check
> + - RLIMIT_MEMLOCK: 0
> +   mlock() is Used only when both "realtime" and "mlock" are specified.
> +
> +### Further RLIMITs
> +
> +RLIMIT_AS limits the total amount of memory; but this includes the
> +virtual memory which QEMU uses as a mapcache.  xen-mapcache.c already
> +fiddles with this; it would be straightforward to make it *set* the
> +rlimit to what it thinks a sensible limit is.
> +
> +Other things that would take some cleverness / changes to QEMU to
> +utilize due to ordering constrants:
> + - RLIMIT_NPROC (after uid changes to a unique uid)
> + - RLIMIT_NOFILES (after all necessary files are opened)
> +
> +### libxl UID cleanup
> +
> +'''Description''': Domain IDs are reused, and thus restricted UIDs are
> +reused.  If a compromised QEMU can fork (due to seccomp or
> +RLIMIT_NPROC limits being ineffective for some reason), it may avoid
> +being killed when its domain dies, then wait until the domain ID is
> +reused again, at which point it will have control over the domain in
> +question (which probably belongs to someone else).
> +
> +libxl should kill all UIDs associated with a domain both when the VM
> +is destroyed, and before starting a VM with the same UID.
> +
> +'''Implementation''': Needs to be researched; it's difficult to do in
> +a way that's not racy (e.g., we can't simply look at all processes,
> +find the pids corresponding to uids, and then kill those, as a
> +continually forking process could (potentially) elude this process.
> +Rumor has it there's a "kill all processes with my UID" system call,
> +or something of that nature.
> +
> +kill(-1,sig) sends a signal to "every process to which the calling
> +process has permission to send a signal".  So in theory:
> +  setuid(X)
> +  kill(-1,KILL)
> +should do the trick.
> +
> +### Disks
> +
> +The chroot (and seccomp?) happens late enough such that QEMU can
> +initialize itself and open its disks. If you want to add a disk at run
> +time via or insert a CD, you can't pass a path because QEMU is
> +chrooted. Instead use the add-fd QMP command and use
> +/dev/fdset/<fdset-id> as the path.
> +
> +A further layer of restriction could be to set RLIMIT_NOFILES to '0',
> +and hand all disks over QMP.
> +
> +## Migration
> +
> +When calling xen-save-devices-state, since QEMU is running in a chroot
> +it is not useful to pass a filename (it doesn't even have write access
> +inside the chroot). Instead, give it an open fd using the add-fd
> +mechanism.
> +
> +### Network namespacing
> +
> +Enter QEMU into its own network namespace (in addition to mount & IPC
> +namespaces).  Basically change the 'unshare' call to be as follows:
> +
> +    unshare(CLONE_NEWNET | CLONE_NEWNS | CLONE_NEWIPC)

It might be clearer if this was merged with the other Namespacing 
section or at least put immediately afterwards.

> +
> +### Network
>   
> +If QEMU runs in its own network namespace, it can't open the tap
> +device itself because the interface won't be visible outside of its
> +own namespace. So instead, have the toolstack open the device and pass
> +it as an fd on the command-line:
>   
> -2) a user named "xen-qemuuser-shared"
> -As a fall back if both 1) fails, libxl will use a single user for
> -all QEMU instances. The user is named xen-qemuuser-shared. This is
> -less secure but still better than running QEMU as root. Using this is as
> -simple as creating just one more user on your host:
> +    -device rtl8139,netdev=tapnet0,mac=... -netdev tap,id=tapnet0,fd=<tapfd>
>   
> -adduser --no-create-home --system xen-qemuuser-shared
> +### VNC
>   
> +If QEMU runs in its own network namespace, it is not straightforward
> +to listen on a TCP socket outside of its own network namespace. One
> +option would be to use VNC over a UNIX socket:
>   
> -3) root
> -As a last resort, libxl will start QEMU as root.
> +    -vnc unix:/var/run/xen/vnc-<domid>
>   
> +However, this would break functionality in the general case; I think
> +we need to have the toolstack open a socket and pass the fd to QEMU
> +(which requires changes to QEMU).
>   
> -Please note that running QEMU as non-root causes several features like
> -migration and PCI passthrough to not work properly and may prevent the guest
> -from booting.
> 

Although there are still a lot of todos, this looks generally good and 
is a big improvement on the previous document.

Cheers,
-- 
Ross Lagerwall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] docs/qemu-deprivilege: Revise and update with status and future plans
  2018-03-23  9:41 ` Ross Lagerwall
@ 2018-03-23 10:01   ` Roger Pau Monné
  2018-03-23 10:53   ` George Dunlap
  1 sibling, 0 replies; 21+ messages in thread
From: Roger Pau Monné @ 2018-03-23 10:01 UTC (permalink / raw)
  To: Ross Lagerwall
  Cc: Stefano Stabellini, Wei Liu, Andrew Cooper, Tim Deegan,
	George Dunlap, Julien Grall, Jan Beulich, Ian Jackson,
	Anthony Perard, xen-devel

On Fri, Mar 23, 2018 at 09:41:47AM +0000, Ross Lagerwall wrote:
> On 03/22/2018 06:24 PM, George Dunlap wrote:
> > +* PCI passthrough
> 
> This one requires a fair amount of Xen & QEMU changes to have a chance of
> working.

I'm not sure this will ever be feasible with the current approach
where QEMU is the one doing the passthrough mediation. You need QEMU
to be able to write to the PCI config space, at which point the depriv
thing becomes moot.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] docs/qemu-deprivilege: Revise and update with status and future plans
  2018-03-23  9:41 ` Ross Lagerwall
  2018-03-23 10:01   ` Roger Pau Monné
@ 2018-03-23 10:53   ` George Dunlap
  2018-03-23 11:33     ` Ross Lagerwall
  1 sibling, 1 reply; 21+ messages in thread
From: George Dunlap @ 2018-03-23 10:53 UTC (permalink / raw)
  To: Ross Lagerwall, xen-devel
  Cc: Stefano Stabellini, Wei Liu, Andrew Cooper, Tim Deegan,
	Julien Grall, Jan Beulich, Anthony Perard, Ian Jackson

On 03/23/2018 09:41 AM, Ross Lagerwall wrote:
> On 03/22/2018 06:24 PM, George Dunlap wrote:
> snip
>> -for ((i=1; i<65536; i++))
>> +# Introduction
>> +
>> +# Setup
>> +
>> +## Getting the right versions of software
>> +
>> +Linux 4.XX
> 
> (For dom0 kernel...)
> 
> Requires 4.11 for the ability to restrict dmop calls.

Thanks, I'll update this section.

>> +
>> +Xen 4.XX
> 
> Requires 4.11 to get required dmop calls to make VGA work.

On reflection, there's probably not much point in including this: The
document will contain the state of functionality of the version of Xen
that contains it.

>> +## Domain config changes
>> +
>> +The core domain config change is to add the following line to the
>> +domain configuration:
>> +
>> +    dm_restrict=1
>> +
>> +This will perform a number of restrictions, outlined below in the
>> +'Technical details' section.
>> +
>> +Remove non-functioning default features:
>> +
>> +    vga="none"
> 
> I'm not sure what this means?

Well it's under "domain config changes"; if you add this to your xl
domain config, then QEMU will not provide any emulated VGA devices.

But it sounds like this issue has been fixed in 4.11 anyway, so perhaps
we can remove this section?

>> +Other features expected not to work include:
>> +* Inserting a new cdrom while the guest is running (xl cdrom-insert)
>> +* migration / save / restore
> 
> The above two features could be made to work if the toolstack drives
> QEMU correctly.

Yes; I'm trying to use this document for several purposes:

* "HOWTO" -- useful for people who want to experiment with the feature.
For this I want to include what works and what doesn't work

* Design doc -- a place to discuss / record what to do and how to do it
(a lot of these are independent, so the work could be shared or at least
passed around between people).

* Todo list -- a place to identify work that still needs to be done.

But I could try to make it clearer that these are "todo" items, and that
"PCI" is further down the importance list than the others.


>> +### Xen restrictions
>> +
>> +'''Description''': Close and restrict Xen-related file descriptors.
>> +Specifically, make sure that only one `privcmd` instance is open, and
>> +that the IOCTL_EVTCHN_RESTRICT_DOMID ioctl has been called.
> 
> Just to clarify, we call IOCTL_PRIVCMD_RESTRICT on the `privcmd` fds and
> IOCTL_EVTCHN_RESTRICT_DOMID on the evtchn fds which remain open. There
> is no requirement to have only one instance of each.
> 
>> +
>> +XXX Also, make sure that only one `xenstore` fd remains open, and that
>> +it's restricted.
> 
> The current implementation closes _all_ xenstore fds and doesn't need to
> make use of xenstore after going into restricted mode.

Ack (and with spelling mistakes)


>> +### Network namespacing
>> +
>> +Enter QEMU into its own network namespace (in addition to mount & IPC
>> +namespaces).  Basically change the 'unshare' call to be as follows:
>> +
>> +    unshare(CLONE_NEWNET | CLONE_NEWNS | CLONE_NEWIPC)
> 
> It might be clearer if this was merged with the other Namespacing
> section or at least put immediately afterwards.

Part of my goal here was to list things in a "low-hanging-fruit" order.
Since QEMU doesn't use mount or IPC namespaces, that can be done
immediately.  Using a new network namespace requires changing how
network devices are set up, and possibly making code changes to QEMU so
that it can still listen on network sockets; hence putting this lower
down the list (followed by the two things that would need to be fixed if
it were implemented).

>> +### Network
>>   +If QEMU runs in its own network namespace, it can't open the tap
>> +device itself because the interface won't be visible outside of its
>> +own namespace. So instead, have the toolstack open the device and pass
>> +it as an fd on the command-line:
>>   -2) a user named "xen-qemuuser-shared"
>> -As a fall back if both 1) fails, libxl will use a single user for
>> -all QEMU instances. The user is named xen-qemuuser-shared. This is
>> -less secure but still better than running QEMU as root. Using this is as
>> -simple as creating just one more user on your host:
>> +    -device rtl8139,netdev=tapnet0,mac=... -netdev
>> tap,id=tapnet0,fd=<tapfd>
>>   -adduser --no-create-home --system xen-qemuuser-shared
>> +### VNC
>>   +If QEMU runs in its own network namespace, it is not straightforward
>> +to listen on a TCP socket outside of its own network namespace. One
>> +option would be to use VNC over a UNIX socket:
>>   -3) root
>> -As a last resort, libxl will start QEMU as root.
>> +    -vnc unix:/var/run/xen/vnc-<domid>
>>   +However, this would break functionality in the general case; I think
>> +we need to have the toolstack open a socket and pass the fd to QEMU
>> +(which requires changes to QEMU).
>>   -Please note that running QEMU as non-root causes several features like
>> -migration and PCI passthrough to not work properly and may prevent
>> the guest
>> -from booting.
>>
> 
> Although there are still a lot of todos, this looks generally good and
> is a big improvement on the previous document.

Thanks.

I think v2 I'll also remove the HOWTO-ish stuff from xl.cfg, which is
inappropriate for a man page, and refer to this document instead.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] docs/qemu-deprivilege: Revise and update with status and future plans
  2018-03-23 10:53   ` George Dunlap
@ 2018-03-23 11:33     ` Ross Lagerwall
  0 siblings, 0 replies; 21+ messages in thread
From: Ross Lagerwall @ 2018-03-23 11:33 UTC (permalink / raw)
  To: George Dunlap, xen-devel
  Cc: Stefano Stabellini, Wei Liu, Andrew Cooper, Tim Deegan,
	Julien Grall, Jan Beulich, Anthony Perard, Ian Jackson

On 03/23/2018 10:53 AM, George Dunlap wrote:
> On 03/23/2018 09:41 AM, Ross Lagerwall wrote:
>> On 03/22/2018 06:24 PM, George Dunlap wrote:
>> snip
>>> -for ((i=1; i<65536; i++))
>>> +# Introduction
>>> +
>>> +# Setup
>>> +
>>> +## Getting the right versions of software
>>> +
>>> +Linux 4.XX
>>
>> (For dom0 kernel...)
>>
>> Requires 4.11 for the ability to restrict dmop calls.
> 
> Thanks, I'll update this section.
> 
>>> +
>>> +Xen 4.XX
>>
>> Requires 4.11 to get required dmop calls to make VGA work.
> 
> On reflection, there's probably not much point in including this: The
> document will contain the state of functionality of the version of Xen
> that contains it.

Agreed.

> 
>>> +## Domain config changes
>>> +
>>> +The core domain config change is to add the following line to the
>>> +domain configuration:
>>> +
>>> +    dm_restrict=1
>>> +
>>> +This will perform a number of restrictions, outlined below in the
>>> +'Technical details' section.
>>> +
>>> +Remove non-functioning default features:
>>> +
>>> +    vga="none"
>>
>> I'm not sure what this means?
> 
> Well it's under "domain config changes"; if you add this to your xl
> domain config, then QEMU will not provide any emulated VGA devices.
> 
> But it sounds like this issue has been fixed in 4.11 anyway, so perhaps
> we can remove this section?

Yes, it can be removed.

> 
>>> +Other features expected not to work include:
>>> +* Inserting a new cdrom while the guest is running (xl cdrom-insert)
>>> +* migration / save / restore
>>
>> The above two features could be made to work if the toolstack drives
>> QEMU correctly.
> 
> Yes; I'm trying to use this document for several purposes:
> 
> * "HOWTO" -- useful for people who want to experiment with the feature.
> For this I want to include what works and what doesn't work
> 
> * Design doc -- a place to discuss / record what to do and how to do it
> (a lot of these are independent, so the work could be shared or at least
> passed around between people).
> 
> * Todo list -- a place to identify work that still needs to be done.
> 
> But I could try to make it clearer that these are "todo" items, and that
> "PCI" is further down the importance list than the others.
> 
> 
>>> +### Xen restrictions
>>> +
>>> +'''Description''': Close and restrict Xen-related file descriptors.
>>> +Specifically, make sure that only one `privcmd` instance is open, and
>>> +that the IOCTL_EVTCHN_RESTRICT_DOMID ioctl has been called.
>>
>> Just to clarify, we call IOCTL_PRIVCMD_RESTRICT on the `privcmd` fds and
>> IOCTL_EVTCHN_RESTRICT_DOMID on the evtchn fds which remain open. There
>> is no requirement to have only one instance of each.
>>
>>> +
>>> +XXX Also, make sure that only one `xenstore` fd remains open, and that
>>> +it's restricted.
>>
>> The current implementation closes _all_ xenstore fds and doesn't need to
>> make use of xenstore after going into restricted mode.
> 
> Ack (and with spelling mistakes)
> 
> 
>>> +### Network namespacing
>>> +
>>> +Enter QEMU into its own network namespace (in addition to mount & IPC
>>> +namespaces).  Basically change the 'unshare' call to be as follows:
>>> +
>>> +    unshare(CLONE_NEWNET | CLONE_NEWNS | CLONE_NEWIPC)
>>
>> It might be clearer if this was merged with the other Namespacing
>> section or at least put immediately afterwards.
> 
> Part of my goal here was to list things in a "low-hanging-fruit" order.
> Since QEMU doesn't use mount or IPC namespaces, that can be done
> immediately.  Using a new network namespace requires changing how
> network devices are set up, and possibly making code changes to QEMU so
> that it can still listen on network sockets; hence putting this lower
> down the list (followed by the two things that would need to be fixed if
> it were implemented).

OK, fair enough.

-- 
Ross Lagerwall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] docs/qemu-deprivilege: Revise and update with status and future plans
  2018-03-22 18:24 [PATCH] docs/qemu-deprivilege: Revise and update with status and future plans George Dunlap
  2018-03-23  9:41 ` Ross Lagerwall
@ 2018-03-23 12:13 ` Anthony PERARD
  2018-03-26 16:43 ` Ian Jackson
  2 siblings, 0 replies; 21+ messages in thread
From: Anthony PERARD @ 2018-03-23 12:13 UTC (permalink / raw)
  To: George Dunlap
  Cc: Stefano Stabellini, Wei Liu, Andrew Cooper, Tim Deegan,
	Ross Lagerwall, Julien Grall, Jan Beulich, Ian Jackson,
	xen-devel

On Thu, Mar 22, 2018 at 06:24:37PM +0000, George Dunlap wrote:
> +### Disks
> +
> +The chroot (and seccomp?) happens late enough such that QEMU can
> +initialize itself and open its disks. If you want to add a disk at run
> +time via or insert a CD, you can't pass a path because QEMU is
> +chrooted. Instead use the add-fd QMP command and use
> +/dev/fdset/<fdset-id> as the path.
> +
> +A further layer of restriction could be to set RLIMIT_NOFILES to '0',
> +and hand all disks over QMP.

The "add-fd" can work also on the command line. But I guess using only
QMP will be better from libxl point of view, only one code path to add
disks.

Also, with dm_restrict=1, another todo: qdisk backend doesn't work. We
probably needs to start a second QEMU process for pv backends.

-- 
Anthony PERARD

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] docs/qemu-deprivilege: Revise and update with status and future plans
  2018-03-22 18:24 [PATCH] docs/qemu-deprivilege: Revise and update with status and future plans George Dunlap
  2018-03-23  9:41 ` Ross Lagerwall
  2018-03-23 12:13 ` Anthony PERARD
@ 2018-03-26 16:43 ` Ian Jackson
  2018-03-27 10:20   ` George Dunlap
  2018-03-27 10:21   ` George Dunlap
  2 siblings, 2 replies; 21+ messages in thread
From: Ian Jackson @ 2018-03-26 16:43 UTC (permalink / raw)
  To: George Dunlap
  Cc: Stefano Stabellini, Wei Liu, Andrew Cooper, Tim Deegan,
	Ross Lagerwall, Julien Grall, Jan Beulich, Anthony Perard,
	xen-devel

Thanks for this update!

George Dunlap writes ("[PATCH] docs/qemu-deprivilege: Revise and update with status and future plans"):
...
> +# Technical details
> +
> +## Restrictions done

This makes this doc into a mixture of a design doc and a user doc, I
think.

It might be worth stating the design intent, which I think is this:

 * Even if there is a bug (for example in qemu) which permits a domain
   to compromise the device model, the compromised device model
   process is prevented from violating the system's overall security
   properties.  Ie, a guest cannot "escape" from the virtualisation by
   using a qemu bug.

This design intent is not yet achieved.  Right now an attacker is
impeded and their attack is complicated; in some circumstances the
will be limited to denial of service.

I'm not sure the individual restrictions need to be in a user-facing
doc.

Maybe the user-facing wording from your patch should be moved to
xl.cfg.doc.5 ?

> +'''Description''': Close and restrict Xen-related file descriptors.
> +Specifically, make sure that only one `privcmd` instance is open, and
> +that the IOCTL_EVTCHN_RESTRICT_DOMID ioctl has been called.
> +
> +XXX Also, make sure that only one `xenstore` fd remains open, and that
> +it's restricted.

No.  Firstly, in each case, all relevant descriptors are restricted.
This is the purpose of the xentoolcore__restrict_* stuff.  Secondly,
xenstore *is* covered - but the xs fd is squashed so as to be totally
unuseable: xs.c uses xentoolcore__restrict_by_dup2_null.

> +### Namespaces for unused functionality
> +
> +'''Descripiton''': Enter QEMU into its own mount & IPC namespaces.
> +This means that even if other restrictions fail, the process won't be
> +able to even name system mount points or exsting non-file-based IPC
> +descriptors to attempt to attack them.
> +
> +'''Implementation''':
> +
> +In theory this could be done in QEMU (similar to -sandbox, -runas,
> +-chroot, and so on), but a patch doing this in QEMU was NAKed
> +upstream. They preferred that this was done as a setup step by
> +whatever executes QEMU; i.e., have the process which exec's QEMU first
> +call:
> +
> +    unshare(CLONE_NEWNS | CLONE_NEWIPC)

This would mean we would have to pass qemu fds for both the network
tap devices and any vnc consoles.  That makes life considerably more
complicated.  I think we should perhaps revisit this upstream.

> +'''Implementation''': Enable from the command-line:
> +
> +    -sandbox on,obsolete=deny,elevateprivileges=allow,spawn=deny,resourcecontrol=deny
> +
> +`elevateprivileges` is currently required to allow `-runas` to work.
> +Removing this requirement would mean making sure that the uid change
> +happened before the seccomp2 call, perhaps by changing the uid before
> +executing QEMU.  (But this would then require other changes to create
> +the QMP socket, VNC socket, and so on).

See what I say above.

> +### Further RLIMITs
> +
> +RLIMIT_AS limits the total amount of memory; but this includes the
> +virtual memory which QEMU uses as a mapcache.  xen-mapcache.c already
> +fiddles with this; it would be straightforward to make it *set* the
> +rlimit to what it thinks a sensible limit is.
> +
> +Other things that would take some cleverness / changes to QEMU to
> +utilize due to ordering constrants:
> + - RLIMIT_NPROC (after uid changes to a unique uid)
> + - RLIMIT_NOFILES (after all necessary files are opened)

I think there is little difficulty with RLIMIT_NPROC since our qemu
does not fork.  I think we can set it to a value which is currently
violated for the current uid ?

> +### libxl UID cleanup
...
> +kill(-1,sig) sends a signal to "every process to which the calling
> +process has permission to send a signal".  So in theory:
> +  setuid(X)
> +  kill(-1,KILL)
> +should do the trick.

We need to check whether a malicious qemu process could kill this
one.

> +### Disks
> +
> +The chroot (and seccomp?) happens late enough such that QEMU can
> +initialize itself and open its disks. If you want to add a disk at run
> +time via or insert a CD, you can't pass a path because QEMU is
> +chrooted. Instead use the add-fd QMP command and use
> +/dev/fdset/<fdset-id> as the path.

I don't think we (Xen) really support hotplug of emulated disks right
now.  So it's just cd insert that's a problem.

> +### Network
>  
> +If QEMU runs in its own network namespace, it can't open the tap
> +device itself because the interface won't be visible outside of its
> +own namespace. So instead, have the toolstack open the device and pass
> +it as an fd on the command-line:

I think this could be solved by doing these things in a different
order.

Thanks,
Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] docs/qemu-deprivilege: Revise and update with status and future plans
  2018-03-26 16:43 ` Ian Jackson
@ 2018-03-27 10:20   ` George Dunlap
  2018-03-27 11:24     ` George Dunlap
  2018-03-27 13:33     ` Ian Jackson
  2018-03-27 10:21   ` George Dunlap
  1 sibling, 2 replies; 21+ messages in thread
From: George Dunlap @ 2018-03-27 10:20 UTC (permalink / raw)
  To: Ian Jackson
  Cc: Stefano Stabellini, Wei Liu, Andrew Cooper, Tim Deegan,
	Ross Lagerwall, Julien Grall, Jan Beulich, Anthony Perard,
	xen-devel

On 03/26/2018 05:43 PM, Ian Jackson wrote:
> Thanks for this update!
> 
> George Dunlap writes ("[PATCH] docs/qemu-deprivilege: Revise and update with status and future plans"):
> ...
>> +# Technical details
>> +
>> +## Restrictions done
> 
> This makes this doc into a mixture of a design doc and a user doc, I
> think.
> 
> It might be worth stating the design intent, which I think is this:
> 
>  * Even if there is a bug (for example in qemu) which permits a domain
>    to compromise the device model, the compromised device model
>    process is prevented from violating the system's overall security
>    properties.  Ie, a guest cannot "escape" from the virtualisation by
>    using a qemu bug.
> 
> This design intent is not yet achieved.  Right now an attacker is
> impeded and their attack is complicated; in some circumstances the
> will be limited to denial of service.
> 
> I'm not sure the individual restrictions need to be in a user-facing
> doc.
> 
> Maybe the user-facing wording from your patch should be moved to
> xl.cfg.doc.5 ?

Actually I think most of the user-facing stuff already in xl.cfg is
inappropriate for that man page.  It might make sense to have a separate
man page for it.

>> +'''Description''': Close and restrict Xen-related file descriptors.
>> +Specifically, make sure that only one `privcmd` instance is open, and
>> +that the IOCTL_EVTCHN_RESTRICT_DOMID ioctl has been called.
>> +
>> +XXX Also, make sure that only one `xenstore` fd remains open, and that
>> +it's restricted.
> 
> No.  Firstly, in each case, all relevant descriptors are restricted.
> This is the purpose of the xentoolcore__restrict_* stuff.  Secondly,
> xenstore *is* covered - but the xs fd is squashed so as to be totally
> unuseable: xs.c uses xentoolcore__restrict_by_dup2_null.

Ross already gave me some corrections on this; here is what I have:

8<---
'''Description''': Close and restrict Xen-related file descriptors.
Specifically:
 * Close all xenstore-related file descriptors
 * Make sure that extraneous `privcmd` and `evtchn` instances are
closed
 * Make sure that all open instances of `privcmd` and `evtchn` file
descriptors have had IOCTL_PRIVCMD_RESTRICT and
IOCTL_EVTCHN_RESTRICT_DOMID ioctls called on them, respectively.
--->8

It sounds like the last may be inaccurate for libxl?

>> +### Namespaces for unused functionality
>> +
>> +'''Descripiton''': Enter QEMU into its own mount & IPC namespaces.
>> +This means that even if other restrictions fail, the process won't be
>> +able to even name system mount points or exsting non-file-based IPC
>> +descriptors to attempt to attack them.
>> +
>> +'''Implementation''':
>> +
>> +In theory this could be done in QEMU (similar to -sandbox, -runas,
>> +-chroot, and so on), but a patch doing this in QEMU was NAKed
>> +upstream. They preferred that this was done as a setup step by
>> +whatever executes QEMU; i.e., have the process which exec's QEMU first
>> +call:
>> +
>> +    unshare(CLONE_NEWNS | CLONE_NEWIPC)
> 
> This would mean we would have to pass qemu fds for both the network
> tap devices and any vnc consoles.  That makes life considerably more
> complicated.  I think we should perhaps revisit this upstream.

You haven't read this very carefully (nor do you seem to have read the
discussion with Ross before responding).  I split the "Namespaces"
restriction Ross suggested internally to us into two parts:
 - Namespace restrictions for unused functionality
 - Network namespacing

This section covers the first one, which will have no impact because the
features restricted are not used.  It can be implemented immediately
with no architectural change to give additional security.

The second one does mean passing fds for network items.  Doing so in
general seems cleaner architecturally.

In fact, I was wondering if it might make sense to do *all* the
deprivileging in a separate process before starting qemu.  That would
make it simple, for example, to write a test program which would try to
break out of the "jail" we'd put it in.

>> +### Further RLIMITs
>> +
>> +RLIMIT_AS limits the total amount of memory; but this includes the
>> +virtual memory which QEMU uses as a mapcache.  xen-mapcache.c already
>> +fiddles with this; it would be straightforward to make it *set* the
>> +rlimit to what it thinks a sensible limit is.
>> +
>> +Other things that would take some cleverness / changes to QEMU to
>> +utilize due to ordering constrants:
>> + - RLIMIT_NPROC (after uid changes to a unique uid)
>> + - RLIMIT_NOFILES (after all necessary files are opened)
> 
> I think there is little difficulty with RLIMIT_NPROC since our qemu
> does not fork.  I think we can set it to a value which is currently
> violated for the current uid ?

Well AFAICT classic POSIX allows you to set rlimits on yourself, but not
on another process.  Since this is on the *user id* rather than the
*process*, I didn't think "setrlimit [as root] -> exec -> setuid" would
work correctly; I assumed you'd have to have "exec -> setuid ->
setrlimit", which would require further changes to QEMU.

I now realize that it might be that the limit will follow the current
uid of the process, in which case "setrlimit -> setuid" might have the
expected behavior.  But a quick Google search shows that the interaction
of RLIMIT_NPROC and setuid() is tricky[1][2], and may vary from
implementation to implementation; relying on the interaction to be
correct (and stay correct) seems somewhat risky (unless POSIX has
explicitly documented what should happen in that case, which again a
quick Google search hasn't turned up).

Linux does seem to have a "set rlimit on another process" system call
(prlimit).  But that would still require at least a little bit of care,
as then we'd need to set the limit after the setuid but before the guest
started running.  And in any case I couldn't (again in a quick search)
discover that FreeBSD has such a system call (and working correctly on
FreeBSD seems to be a design goal).

[1] https://lkml.org/lkml/2003/7/13/226
[2] https://lwn.net/Articles/451985/

>> +### libxl UID cleanup
> ...
>> +kill(-1,sig) sends a signal to "every process to which the calling
>> +process has permission to send a signal".  So in theory:
>> +  setuid(X)
>> +  kill(-1,KILL)
>> +should do the trick.
> 
> We need to check whether a malicious qemu process could kill this
> one.

Hmm, in theory it probably could.  If we do it twice (once at domain
destruction and again at domain creation) it would need to win at least
two races.  I didn't find a more secure way of doing this: Linux at
least doesn't seem to have a "kill all processes with userid X" system
call; the only way to target all processes with a userid was kill(-1),
which targets your own userid (and of course then leaves you open to
being killed).  There may be a way to prevent your own process from
being killed by the other process by playing with EUID / RUID and so on.

The only way to change your pid is to fork() or clone(), right?  If we
get RLIMIT_NPROC=1 working correctly, then this should be a "just in
case" backdrop, as QEMU shouldn't (in theory) be able to change its pid.

>> +### Disks
>> +
>> +The chroot (and seccomp?) happens late enough such that QEMU can
>> +initialize itself and open its disks. If you want to add a disk at run
>> +time via or insert a CD, you can't pass a path because QEMU is
>> +chrooted. Instead use the add-fd QMP command and use
>> +/dev/fdset/<fdset-id> as the path.
> 
> I don't think we (Xen) really support hotplug of emulated disks right
> now.  So it's just cd insert that's a problem.

I might edit it, but as written it's strictly correct: As it happens, we
don't hot-plug emulated devices; but if you wanted to, you'd need to
pass a file descriptor.

But there is a dangling phrase here ("...add a disk at run time via [?]
or insert a CD...").

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] docs/qemu-deprivilege: Revise and update with status and future plans
  2018-03-26 16:43 ` Ian Jackson
  2018-03-27 10:20   ` George Dunlap
@ 2018-03-27 10:21   ` George Dunlap
  2018-03-28 12:28     ` Ross Lagerwall
  1 sibling, 1 reply; 21+ messages in thread
From: George Dunlap @ 2018-03-27 10:21 UTC (permalink / raw)
  To: Ian Jackson
  Cc: Stefano Stabellini, Wei Liu, Andrew Cooper, Tim Deegan,
	Ross Lagerwall, Julien Grall, Jan Beulich, Anthony Perard,
	xen-devel

On 03/26/2018 05:43 PM, Ian Jackson wrote:
>> +### Network
>>  
>> +If QEMU runs in its own network namespace, it can't open the tap
>> +device itself because the interface won't be visible outside of its
>> +own namespace. So instead, have the toolstack open the device and pass
>> +it as an fd on the command-line:
> 
> I think this could be solved by doing these things in a different
> order.

Ross, do you have a reference for the qemu-devel discussion where they
rejected the patches to have QEMU restrict the namespaces?

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] docs/qemu-deprivilege: Revise and update with status and future plans
  2018-03-27 10:20   ` George Dunlap
@ 2018-03-27 11:24     ` George Dunlap
  2018-03-27 13:33     ` Ian Jackson
  1 sibling, 0 replies; 21+ messages in thread
From: George Dunlap @ 2018-03-27 11:24 UTC (permalink / raw)
  To: Ian Jackson
  Cc: Stefano Stabellini, Wei Liu, Andrew Cooper, Tim Deegan,
	Ross Lagerwall, Julien Grall, Jan Beulich, Anthony Perard,
	xen-devel

On 03/27/2018 11:20 AM, George Dunlap wrote:
>>> +### Further RLIMITs
>>> +
>>> +RLIMIT_AS limits the total amount of memory; but this includes the
>>> +virtual memory which QEMU uses as a mapcache.  xen-mapcache.c already
>>> +fiddles with this; it would be straightforward to make it *set* the
>>> +rlimit to what it thinks a sensible limit is.
>>> +
>>> +Other things that would take some cleverness / changes to QEMU to
>>> +utilize due to ordering constrants:
>>> + - RLIMIT_NPROC (after uid changes to a unique uid)
>>> + - RLIMIT_NOFILES (after all necessary files are opened)
>>
>> I think there is little difficulty with RLIMIT_NPROC since our qemu
>> does not fork.  I think we can set it to a value which is currently
>> violated for the current uid ?
> 
> Well AFAICT classic POSIX allows you to set rlimits on yourself, but not
> on another process.  Since this is on the *user id* rather than the
> *process*, I didn't think "setrlimit [as root] -> exec -> setuid" would
> work correctly; I assumed you'd have to have "exec -> setuid ->
> setrlimit", which would require further changes to QEMU.
> 
> I now realize that it might be that the limit will follow the current
> uid of the process, in which case "setrlimit -> setuid" might have the
> expected behavior.  But a quick Google search shows that the interaction
> of RLIMIT_NPROC and setuid() is tricky[1][2], and may vary from
> implementation to implementation; relying on the interaction to be
> correct (and stay correct) seems somewhat risky (unless POSIX has
> explicitly documented what should happen in that case, which again a
> quick Google search hasn't turned up).
> 
> Linux does seem to have a "set rlimit on another process" system call
> (prlimit).  But that would still require at least a little bit of care,
> as then we'd need to set the limit after the setuid but before the guest
> started running.  And in any case I couldn't (again in a quick search)
> discover that FreeBSD has such a system call (and working correctly on
> FreeBSD seems to be a design goal).

Actually, it looks like RLIMIT_NPROC isn't POSIX (at least it's not
listed in [1]), but it is supported by FreeBSD [2], it would seem.  The
lack of an explicit specification for NPROC / setuid interaction makes
it doubly risky to rely on.

[1] http://pubs.opengroup.org/onlinepubs/9699919799/
[2]
https://www.freebsd.org/cgi/man.cgi?query=setrlimit&apropos=0&sektion=2&manpath=FreeBSD+12-current&arch=default&format=html

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] docs/qemu-deprivilege: Revise and update with status and future plans
  2018-03-27 10:20   ` George Dunlap
  2018-03-27 11:24     ` George Dunlap
@ 2018-03-27 13:33     ` Ian Jackson
  2018-03-27 14:15       ` George Dunlap
  2018-03-28 12:47       ` Ross Lagerwall
  1 sibling, 2 replies; 21+ messages in thread
From: Ian Jackson @ 2018-03-27 13:33 UTC (permalink / raw)
  To: George Dunlap
  Cc: Stefano Stabellini, Wei Liu, Andrew Cooper, Tim Deegan,
	Ross Lagerwall, Julien Grall, Jan Beulich, Anthony Perard,
	xen-devel

George Dunlap writes ("Re: [PATCH] docs/qemu-deprivilege: Revise and update with status and future plans"):
> Actually I think most of the user-facing stuff already in xl.cfg is
> inappropriate for that man page.  It might make sense to have a separate
> man page for it.

I wouldn't object to that.

> On 03/26/2018 05:43 PM, Ian Jackson wrote:
> > No.  Firstly, in each case, all relevant descriptors are restricted.
> > This is the purpose of the xentoolcore__restrict_* stuff.  Secondly,
> > xenstore *is* covered - but the xs fd is squashed so as to be totally
> > unuseable: xs.c uses xentoolcore__restrict_by_dup2_null.
> 
> Ross already gave me some corrections on this; here is what I have:
> 
> 8<---
> '''Description''': Close and restrict Xen-related file descriptors.
> Specifically:
>  * Close all xenstore-related file descriptors
>  * Make sure that extraneous `privcmd` and `evtchn` instances are
> closed
>  * Make sure that all open instances of `privcmd` and `evtchn` file
> descriptors have had IOCTL_PRIVCMD_RESTRICT and
> IOCTL_EVTCHN_RESTRICT_DOMID ioctls called on them, respectively.
> --->8
> 
> It sounds like the last may be inaccurate for libxl?

I don't think anything closes any extraneous fds.  My approach in
the libxc layer was to register all fds and have the restrict call
iterate over all of them.  So, I guess, drop your 2nd bullet.

All of this is done by qemu calling xentoolcore_restrict_all, not by
libxl.

Maybe the "extraneous privcmd and evtchn fds" Ross means are ones
inherited by qemu from the toolstack.  Hrm.  Now I want to try to make
an argument that no such fds are leaked into qemu subprocesses by
libxl but I don't think it's trivial to do so.  It may not be true
even in the usual case but I also worry about races in higher layers
that call libxl in multiple threads with multiple ctx's concurrently
and which might therefore end up forking for a qemu while another
thread is opening privcmd.

Sorting this out needs to go on your todo list :-/.

> >> +### Namespaces for unused functionality
...
> >> +    unshare(CLONE_NEWNS | CLONE_NEWIPC)
> > 
> > This would mean we would have to pass qemu fds for both the network
> > tap devices and any vnc consoles.  That makes life considerably more
> > complicated.  I think we should perhaps revisit this upstream.
> 
> You haven't read this very carefully (nor do you seem to have read the
> discussion with Ross before responding).  I split the "Namespaces"
> restriction Ross suggested internally to us into two parts:
>  - Namespace restrictions for unused functionality
>  - Network namespacing

Oh, sorry.  Yes.

> In fact, I was wondering if it might make sense to do *all* the
> deprivileging in a separate process before starting qemu.  That would
> make it simple, for example, to write a test program which would try to
> break out of the "jail" we'd put it in.

I think this would be a lot of trouble because we'd have to pass
pre-restricted privcmd fds into qemu and stuff them into libxc et al.
It would mean fixing a lot of redundant handle opening which is
currently harmless.

> >> +Other things that would take some cleverness / changes to QEMU to
> >> +utilize due to ordering constrants:
> >> + - RLIMIT_NPROC (after uid changes to a unique uid)
> >> + - RLIMIT_NOFILES (after all necessary files are opened)
> > 
> > I think there is little difficulty with RLIMIT_NPROC since our qemu
> > does not fork.  I think we can set it to a value which is currently
> > violated for the current uid ?
> 
> Well AFAICT classic POSIX allows you to set rlimits on yourself, but not
> on another process.  Since this is on the *user id* rather than the
> *process*, I didn't think "setrlimit [as root] -> exec -> setuid" would
> work correctly; I assumed you'd have to have "exec -> setuid ->
> setrlimit", which would require further changes to QEMU.

rlimits are a process property.

I'm quite alarmed by the patch by Neil Brown in your first link that
setuid should fail if it would cause RLIMIT_NPROC to be violated for
this process in the context of the new uid !  The difficulties
discussed on your 2nd link were immediately obvious to me on reading
the first one.

AFAICT from the article Linux now does not check RLIMIT_NPROC during
setuid, nor during execve.  I think that is correct behaviour.

> I now realize that it might be that the limit will follow the current
> uid of the process, in which case "setrlimit -> setuid" might have the
> expected behavior.  But a quick Google search shows that the interaction
> of RLIMIT_NPROC and setuid() is tricky[1][2], and may vary from
> implementation to implementation; relying on the interaction to be
> correct (and stay correct) seems somewhat risky (unless POSIX has
> explicitly documented what should happen in that case, which again a
> quick Google search hasn't turned up).

RLIMIT_NPROC is not in SuS:
 http://pubs.opengroup.org/onlinepubs/9699919799/functions/setrlimit.html

FreeBSD does not have any of this madness:
 https://www.unix.com/man-page/freebsd/2/setuid/

Anyway IMO we should set RLIMIT_NPROC after fork and before execve.
If this causes setuid to fail in qemu, qemu will crash.  But this
could only be the case if the new uid already has other processes,
which it shouldn't do.  (If it does, the new uid is probably
contaminated by a previous domain with the same domid.)

If the kernel is a version of Linux with the execve RLIMIT_NPROC
check, then execve will fail because there will be many more processes
than root's RLIMIT_NPROC.  I propose to treat that as a bug in Linux.

> Linux does seem to have a "set rlimit on another process" system call
> (prlimit).  But that would still require at least a little bit of care,
> as then we'd need to set the limit after the setuid but before the guest
> started running.  And in any case I couldn't (again in a quick search)
> discover that FreeBSD has such a system call (and working correctly on
> FreeBSD seems to be a design goal).

Yes.

> >> +### libxl UID cleanup
> > ...
> >> +kill(-1,sig) sends a signal to "every process to which the calling
> >> +process has permission to send a signal".  So in theory:
> >> +  setuid(X)
> >> +  kill(-1,KILL)
> >> +should do the trick.
> > 
> > We need to check whether a malicious qemu process could kill this
> > one.
> 
> Hmm, in theory it probably could.

I could find nothing in SuS explaining when process A may send signals
to process B.  So I resorted to the BSD manpages:

  https://www.unix.com/man-page/freebsd/2/kill/

  For a process to have permission to send a signal to a process
  designated by pid, the user must be the super-user, or the real or
  saved user ID of the receiving process must match the real or
  effective user ID of the sending process.

Also kill(-1,) does not signal the sender.

I did some web searches hoping to find someone had already solved
this.  The best discussion is here
  https://www.unix.com/unix-for-advanced-and-expert-users/168364-kill-all-process-uid.html
but the proposed answer doesn't work because the kill would kill lots
of root processes.

My analysis:

                                          euid   ruid   suid

  compared for receiving process                 ====   ====
  compared for sending process            ====   ====

  possible rogue processes                qid    qid    qid

  unrelated rootish processes             0      any    any
     or                                   any    0      any
     or                                   any    any    0

  killer (our pet issuing kill)
     so rogues can't kill killer                 !=qid  !=qid
     so it doesn't kill randomly          !=0    !=0
     therefore:                           qid    pet    0

We will have to reserve one uid `pet' specially for this nonsense.  We
call setresuid(2) to switch uids, and then kill(-1,9) and _exit(0).
The kill() call will kill other processes with uid `pet'.  So we take
a global lock while we do this.

`pet' could be a uid associated with a reserved domid, eg dom0.

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] docs/qemu-deprivilege: Revise and update with status and future plans
  2018-03-27 13:33     ` Ian Jackson
@ 2018-03-27 14:15       ` George Dunlap
  2018-03-27 14:24         ` George Dunlap
  2018-03-27 14:36         ` Ian Jackson
  2018-03-28 12:47       ` Ross Lagerwall
  1 sibling, 2 replies; 21+ messages in thread
From: George Dunlap @ 2018-03-27 14:15 UTC (permalink / raw)
  To: Ian Jackson
  Cc: Stefano Stabellini, Wei Liu, Andrew Cooper, Tim Deegan,
	Ross Lagerwall, Julien Grall, Jan Beulich, Anthony Perard,
	xen-devel

On 03/27/2018 02:33 PM, Ian Jackson wrote:
>>>> +### Namespaces for unused functionality
> ...
>>>> +    unshare(CLONE_NEWNS | CLONE_NEWIPC)
>>>
>>> This would mean we would have to pass qemu fds for both the network
>>> tap devices and any vnc consoles.  That makes life considerably more
>>> complicated.  I think we should perhaps revisit this upstream.
>>
>> You haven't read this very carefully (nor do you seem to have read the
>> discussion with Ross before responding).  I split the "Namespaces"
>> restriction Ross suggested internally to us into two parts:
>>  - Namespace restrictions for unused functionality
>>  - Network namespacing
> 
> Oh, sorry.  Yes.

Actually, sorry for being a bit 'pointy' here; upon reflection it
occured to me that you probably actually did read the whole thread, but
when you went back to respond you meant to respond to the other section,
but stopped scanning when you got a match with "namespace".

>> I now realize that it might be that the limit will follow the current
>> uid of the process, in which case "setrlimit -> setuid" might have the
>> expected behavior.  But a quick Google search shows that the interaction
>> of RLIMIT_NPROC and setuid() is tricky[1][2], and may vary from
>> implementation to implementation; relying on the interaction to be
>> correct (and stay correct) seems somewhat risky (unless POSIX has
>> explicitly documented what should happen in that case, which again a
>> quick Google search hasn't turned up).
> 
> RLIMIT_NPROC is not in SuS:
>  http://pubs.opengroup.org/onlinepubs/9699919799/functions/setrlimit.html
> 
> FreeBSD does not have any of this madness:
>  https://www.unix.com/man-page/freebsd/2/setuid/
> 
> Anyway IMO we should set RLIMIT_NPROC after fork and before execve.
> If this causes setuid to fail in qemu, qemu will crash.  But this
> could only be the case if the new uid already has other processes,
> which it shouldn't do.  (If it does, the new uid is probably
> contaminated by a previous domain with the same domid.)

I was more worried about the limit not having the expected effect after
the setuid().

>>>> +### libxl UID cleanup
>>> ...
>>>> +kill(-1,sig) sends a signal to "every process to which the calling
>>>> +process has permission to send a signal".  So in theory:
>>>> +  setuid(X)
>>>> +  kill(-1,KILL)
>>>> +should do the trick.
>>>
>>> We need to check whether a malicious qemu process could kill this
>>> one.
>>
>> Hmm, in theory it probably could.
> 
> I could find nothing in SuS explaining when process A may send signals
> to process B.  So I resorted to the BSD manpages:
> 
>   https://www.unix.com/man-page/freebsd/2/kill/
> 
>   For a process to have permission to send a signal to a process
>   designated by pid, the user must be the super-user, or the real or
>   saved user ID of the receiving process must match the real or
>   effective user ID of the sending process.

The text of both the FreeBSD and Linux man pages looks to be copied
verbatim from [1].

[1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/kill.html

> Also kill(-1,) does not signal the sender.

Linux is the same.

> My analysis:
> 
>                                           euid   ruid   suid
> 
>   compared for receiving process                 ====   ====
>   compared for sending process            ====   ====
> 
>   possible rogue processes                qid    qid    qid
> 
>   unrelated rootish processes             0      any    any
>      or                                   any    0      any
>      or                                   any    any    0
> 
>   killer (our pet issuing kill)
>      so rogues can't kill killer                 !=qid  !=qid
>      so it doesn't kill randomly          !=0    !=0
>      therefore:                           qid    pet    0
> 
> We will have to reserve one uid `pet' specially for this nonsense.  We
> call setresuid(2) to switch uids, and then kill(-1,9) and _exit(0).
> The kill() call will kill other processes with uid `pet'.  So we take
> a global lock while we do this.
> 
> `pet' could be a uid associated with a reserved domid, eg dom0.

Right -- it looks like that could work.  I hadn't initially noticed the
{RUID, SUID} => {RUID, EUID} distinction.

It's kind of hard to believe this is so difficult to pull off.

Should we post this to that forum, for the benefit of other people who
end up finding the same discussion? :-)

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] docs/qemu-deprivilege: Revise and update with status and future plans
  2018-03-27 14:15       ` George Dunlap
@ 2018-03-27 14:24         ` George Dunlap
  2018-03-27 14:37           ` Ian Jackson
  2018-03-27 14:36         ` Ian Jackson
  1 sibling, 1 reply; 21+ messages in thread
From: George Dunlap @ 2018-03-27 14:24 UTC (permalink / raw)
  To: Ian Jackson
  Cc: Stefano Stabellini, Wei Liu, Andrew Cooper, Tim Deegan,
	Ross Lagerwall, Julien Grall, Jan Beulich, Anthony Perard,
	xen-devel

On 03/27/2018 03:15 PM, George Dunlap wrote:
> On 03/27/2018 02:33 PM, Ian Jackson wrote:
>>>>> +### Namespaces for unused functionality
>> ...
>>>>> +    unshare(CLONE_NEWNS | CLONE_NEWIPC)
>>>>
>>>> This would mean we would have to pass qemu fds for both the network
>>>> tap devices and any vnc consoles.  That makes life considerably more
>>>> complicated.  I think we should perhaps revisit this upstream.
>>>
>>> You haven't read this very carefully (nor do you seem to have read the
>>> discussion with Ross before responding).  I split the "Namespaces"
>>> restriction Ross suggested internally to us into two parts:
>>>  - Namespace restrictions for unused functionality
>>>  - Network namespacing
>>
>> Oh, sorry.  Yes.
> 
> Actually, sorry for being a bit 'pointy' here; upon reflection it
> occured to me that you probably actually did read the whole thread, but
> when you went back to respond you meant to respond to the other section,
> but stopped scanning when you got a match with "namespace".
> 
>>> I now realize that it might be that the limit will follow the current
>>> uid of the process, in which case "setrlimit -> setuid" might have the
>>> expected behavior.  But a quick Google search shows that the interaction
>>> of RLIMIT_NPROC and setuid() is tricky[1][2], and may vary from
>>> implementation to implementation; relying on the interaction to be
>>> correct (and stay correct) seems somewhat risky (unless POSIX has
>>> explicitly documented what should happen in that case, which again a
>>> quick Google search hasn't turned up).
>>
>> RLIMIT_NPROC is not in SuS:
>>  http://pubs.opengroup.org/onlinepubs/9699919799/functions/setrlimit.html
>>
>> FreeBSD does not have any of this madness:
>>  https://www.unix.com/man-page/freebsd/2/setuid/
>>
>> Anyway IMO we should set RLIMIT_NPROC after fork and before execve.
>> If this causes setuid to fail in qemu, qemu will crash.  But this
>> could only be the case if the new uid already has other processes,
>> which it shouldn't do.  (If it does, the new uid is probably
>> contaminated by a previous domain with the same domid.)
> 
> I was more worried about the limit not having the expected effect after
> the setuid().
> 
>>>>> +### libxl UID cleanup
>>>> ...
>>>>> +kill(-1,sig) sends a signal to "every process to which the calling
>>>>> +process has permission to send a signal".  So in theory:
>>>>> +  setuid(X)
>>>>> +  kill(-1,KILL)
>>>>> +should do the trick.
>>>>
>>>> We need to check whether a malicious qemu process could kill this
>>>> one.
>>>
>>> Hmm, in theory it probably could.
>>
>> I could find nothing in SuS explaining when process A may send signals
>> to process B.  So I resorted to the BSD manpages:
>>
>>   https://www.unix.com/man-page/freebsd/2/kill/
>>
>>   For a process to have permission to send a signal to a process
>>   designated by pid, the user must be the super-user, or the real or
>>   saved user ID of the receiving process must match the real or
>>   effective user ID of the sending process.
> 
> The text of both the FreeBSD and Linux man pages looks to be copied
> verbatim from [1].
> 
> [1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/kill.html
> 
>> Also kill(-1,) does not signal the sender.
> 
> Linux is the same.
> 
>> My analysis:
>>
>>                                           euid   ruid   suid
>>
>>   compared for receiving process                 ====   ====
>>   compared for sending process            ====   ====
>>
>>   possible rogue processes                qid    qid    qid
>>
>>   unrelated rootish processes             0      any    any
>>      or                                   any    0      any
>>      or                                   any    any    0
>>
>>   killer (our pet issuing kill)
>>      so rogues can't kill killer                 !=qid  !=qid
>>      so it doesn't kill randomly          !=0    !=0
>>      therefore:                           qid    pet    0
>>
>> We will have to reserve one uid `pet' specially for this nonsense.  We
>> call setresuid(2) to switch uids, and then kill(-1,9) and _exit(0).
>> The kill() call will kill other processes with uid `pet'.  So we take
>> a global lock while we do this.
>>
>> `pet' could be a uid associated with a reserved domid, eg dom0.

The alternate would be to have yet another UID range, to that we could
have a "target ID" (i.e., QEMU) and a "reaper ID" for each domain.  I
think that should mean any races should be benign.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] docs/qemu-deprivilege: Revise and update with status and future plans
  2018-03-27 14:15       ` George Dunlap
  2018-03-27 14:24         ` George Dunlap
@ 2018-03-27 14:36         ` Ian Jackson
  2018-03-27 15:52           ` George Dunlap
  1 sibling, 1 reply; 21+ messages in thread
From: Ian Jackson @ 2018-03-27 14:36 UTC (permalink / raw)
  To: George Dunlap
  Cc: Stefano Stabellini, Wei Liu, Andrew Cooper, Tim Deegan,
	Ross Lagerwall, Julien Grall, Jan Beulich, Anthony Perard,
	xen-devel

George Dunlap writes ("Re: [PATCH] docs/qemu-deprivilege: Revise and update with status and future plans"):
> On 03/27/2018 02:33 PM, Ian Jackson wrote:
> > Anyway IMO we should set RLIMIT_NPROC after fork and before execve.
> > If this causes setuid to fail in qemu, qemu will crash.  But this
> > could only be the case if the new uid already has other processes,
> > which it shouldn't do.  (If it does, the new uid is probably
> > contaminated by a previous domain with the same domid.)
> 
> I was more worried about the limit not having the expected effect after
> the setuid().

I think we can safely rule that out.

> > I could find nothing in SuS explaining when process A may send signals
> > to process B.  So I resorted to the BSD manpages:
> > 
> >   https://www.unix.com/man-page/freebsd/2/kill/
> > 
> >   For a process to have permission to send a signal to a process
> >   designated by pid, the user must be the super-user, or the real or
> >   saved user ID of the receiving process must match the real or
> >   effective user ID of the sending process.
> 
> The text of both the FreeBSD and Linux man pages looks to be copied
> verbatim from [1].
> 
> [1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/kill.html

I don't know how I missed that.  It's the 2nd paragraph!

> > `pet' could be a uid associated with a reserved domid, eg dom0.
> 
> Right -- it looks like that could work.  I hadn't initially noticed the
> {RUID, SUID} => {RUID, EUID} distinction.
> 
> It's kind of hard to believe this is so difficult to pull off.

Yes!

> Should we post this to that forum, for the benefit of other people who
> end up finding the same discussion? :-)

Tempting.  Let's wait until we see if it works, first ...

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] docs/qemu-deprivilege: Revise and update with status and future plans
  2018-03-27 14:24         ` George Dunlap
@ 2018-03-27 14:37           ` Ian Jackson
  2018-03-27 14:45             ` George Dunlap
  0 siblings, 1 reply; 21+ messages in thread
From: Ian Jackson @ 2018-03-27 14:37 UTC (permalink / raw)
  To: George Dunlap
  Cc: Stefano Stabellini, Wei Liu, Andrew Cooper, Tim Deegan,
	Ross Lagerwall, Julien Grall, Jan Beulich, Anthony Perard,
	xen-devel

George Dunlap writes ("Re: [PATCH] docs/qemu-deprivilege: Revise and update with status and future plans"):
> The alternate would be to have yet another UID range, to that we could
> have a "target ID" (i.e., QEMU) and a "reaper ID" for each domain.  I
> think that should mean any races should be benign.

That would mean gobbling 2^17 uids rather than 2^16.  Doesn't seem
desirable to me, to add a miniscule amount of concurrency to a pretty
heavyweight operation.

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] docs/qemu-deprivilege: Revise and update with status and future plans
  2018-03-27 14:37           ` Ian Jackson
@ 2018-03-27 14:45             ` George Dunlap
  0 siblings, 0 replies; 21+ messages in thread
From: George Dunlap @ 2018-03-27 14:45 UTC (permalink / raw)
  To: Ian Jackson
  Cc: Stefano Stabellini, Wei Liu, Andrew Cooper, Tim Deegan,
	Ross Lagerwall, Julien Grall, Jan Beulich, Anthony Perard,
	xen-devel

On 03/27/2018 03:37 PM, Ian Jackson wrote:
> George Dunlap writes ("Re: [PATCH] docs/qemu-deprivilege: Revise and update with status and future plans"):
>> The alternate would be to have yet another UID range, to that we could
>> have a "target ID" (i.e., QEMU) and a "reaper ID" for each domain.  I
>> think that should mean any races should be benign.
> 
> That would mean gobbling 2^17 uids rather than 2^16.  Doesn't seem
> desirable to me, to add a miniscule amount of concurrency to a pretty
> heavyweight operation.

I think all the OSes that can run as a dom0 have 32-bit UIDs anyway
(which is why after my patch the document suggests starting with UID
65536).  I was more worried about the complexity of implementation.

But I don't have terribly strong feelings about it.

 -George


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] docs/qemu-deprivilege: Revise and update with status and future plans
  2018-03-27 14:36         ` Ian Jackson
@ 2018-03-27 15:52           ` George Dunlap
  0 siblings, 0 replies; 21+ messages in thread
From: George Dunlap @ 2018-03-27 15:52 UTC (permalink / raw)
  To: Ian Jackson
  Cc: Stefano Stabellini, Wei Liu, Andrew Cooper, Tim Deegan,
	Ross Lagerwall, Julien Grall, Jan Beulich, Anthony Perard,
	xen-devel

[-- Attachment #1: Type: text/plain, Size: 2067 bytes --]

On 03/27/2018 03:36 PM, Ian Jackson wrote:
> George Dunlap writes ("Re: [PATCH] docs/qemu-deprivilege: Revise and update with status and future plans"):
>> On 03/27/2018 02:33 PM, Ian Jackson wrote:
>>> Anyway IMO we should set RLIMIT_NPROC after fork and before execve.
>>> If this causes setuid to fail in qemu, qemu will crash.  But this
>>> could only be the case if the new uid already has other processes,
>>> which it shouldn't do.  (If it does, the new uid is probably
>>> contaminated by a previous domain with the same domid.)
>>
>> I was more worried about the limit not having the expected effect after
>> the setuid().
> 
> I think we can safely rule that out.
> 
>>> I could find nothing in SuS explaining when process A may send signals
>>> to process B.  So I resorted to the BSD manpages:
>>>
>>>   https://www.unix.com/man-page/freebsd/2/kill/
>>>
>>>   For a process to have permission to send a signal to a process
>>>   designated by pid, the user must be the super-user, or the real or
>>>   saved user ID of the receiving process must match the real or
>>>   effective user ID of the sending process.
>>
>> The text of both the FreeBSD and Linux man pages looks to be copied
>> verbatim from [1].
>>
>> [1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/kill.html
> 
> I don't know how I missed that.  It's the 2nd paragraph!
> 
>>> `pet' could be a uid associated with a reserved domid, eg dom0.
>>
>> Right -- it looks like that could work.  I hadn't initially noticed the
>> {RUID, SUID} => {RUID, EUID} distinction.
>>
>> It's kind of hard to believe this is so difficult to pull off.
> 
> Yes!
> 
>> Should we post this to that forum, for the benefit of other people who
>> end up finding the same discussion? :-)
> 
> Tempting.  Let's wait until we see if it works, first ...

Seems to work for me.  See the attached proof-of-concept programs.

If in the "reaper" program you change 'xuid' to DEFAULT_TARGET_UID
instead, then it does indeed die in usleep; with a separate 'real'
userid it survives and "runner" dies.

 -George

[-- Attachment #2: common.h --]
[-- Type: text/x-chdr, Size: 107 bytes --]

#ifndef COMMON_H

#define DEFAULT_TARGET_UID 32768
#define DEFAULT_REAPER_UID 32769

#endif /* COMMON_H */

[-- Attachment #3: reaper.c --]
[-- Type: text/x-csrc, Size: 952 bytes --]

#define _GNU_SOURCE
#include <sys/types.h>
#include <unistd.h>
#include <signal.h>
#include <stdio.h>

#include "common.h"

int main(int argc, char * argv) {
    uid_t euid, tuid = DEFAULT_TARGET_UID, xuid = DEFAULT_REAPER_UID;
    int rc;

    /* Check to make sure we have enough permissions */
    euid = geteuid();

    if ( euid != 0 ) {
        fprintf(stderr, "Must run as euid 0 to set uid\n");
        return -1;
    }

    /* 
     * Set euid to TARGET_UID so that we can kill the 'runner'; but
     * set ruid to REAPER_UID so that the runner can't kill us.
     */
    if ( setresuid(xuid, tuid, 0) ) {
        perror("Setting uid to target");
        return -1;
    }

    printf("Sleeping for 1 second to get the runner a chance to kill me...\n");
    usleep(1000000);
    
    rc = kill(-1, 9);
    if ( rc )
        perror("No processes killed");
    else
        printf("At least one process killed successfully!\n");

    return 0;
}

[-- Attachment #4: runner.c --]
[-- Type: text/x-csrc, Size: 563 bytes --]

#include <sys/types.h>
#include <unistd.h>
#include <signal.h>
#include <stdio.h>

#include "common.h"

int main(int argc, char * argv) {
    uid_t euid, tuid = DEFAULT_TARGET_UID;

    /* Check to make sure we have enough permissions */
    euid = geteuid();

    if ( euid != 0 ) {
        fprintf(stderr, "Must run as euid 0 to set uid\n");
        return -1;
    }

    if(setuid(tuid)) {
        perror("Setting uid to target");
        return -1;
    }

    while(1) {
        if(!fork())
            kill(-1, 9);
        else
            _exit(0);
    }
}

[-- Attachment #5: Type: text/plain, Size: 157 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] docs/qemu-deprivilege: Revise and update with status and future plans
  2018-03-27 10:21   ` George Dunlap
@ 2018-03-28 12:28     ` Ross Lagerwall
  2018-03-28 13:26       ` George Dunlap
  0 siblings, 1 reply; 21+ messages in thread
From: Ross Lagerwall @ 2018-03-28 12:28 UTC (permalink / raw)
  To: George Dunlap, Ian Jackson
  Cc: Stefano Stabellini, Wei Liu, Andrew Cooper, Tim Deegan,
	Julien Grall, Jan Beulich, Anthony Perard, xen-devel

On 03/27/2018 11:21 AM, George Dunlap wrote:
> On 03/26/2018 05:43 PM, Ian Jackson wrote:
>>> +### Network
>>>   
>>> +If QEMU runs in its own network namespace, it can't open the tap
>>> +device itself because the interface won't be visible outside of its
>>> +own namespace. So instead, have the toolstack open the device and pass
>>> +it as an fd on the command-line:
>>
>> I think this could be solved by doing these things in a different
>> order.
> 
> Ross, do you have a reference for the qemu-devel discussion where they
> rejected the patches to have QEMU restrict the namespaces?
> 

The thread starts here:
https://lists.gnu.org/archive/html/qemu-devel/2017-10/msg04674.html

Regards,
-- 
Ross Lagerwall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] docs/qemu-deprivilege: Revise and update with status and future plans
  2018-03-27 13:33     ` Ian Jackson
  2018-03-27 14:15       ` George Dunlap
@ 2018-03-28 12:47       ` Ross Lagerwall
  2018-03-28 13:44         ` George Dunlap
  1 sibling, 1 reply; 21+ messages in thread
From: Ross Lagerwall @ 2018-03-28 12:47 UTC (permalink / raw)
  To: Ian Jackson, George Dunlap
  Cc: Stefano Stabellini, Wei Liu, Andrew Cooper, Tim Deegan,
	Julien Grall, Jan Beulich, Anthony Perard, xen-devel

On 03/27/2018 02:33 PM, Ian Jackson wrote:
> George Dunlap writes ("Re: [PATCH] docs/qemu-deprivilege: Revise and update with status and future plans"):
>> Actually I think most of the user-facing stuff already in xl.cfg is
>> inappropriate for that man page.  It might make sense to have a separate
>> man page for it.
> 
> I wouldn't object to that.
> 
>> On 03/26/2018 05:43 PM, Ian Jackson wrote:
>>> No.  Firstly, in each case, all relevant descriptors are restricted.
>>> This is the purpose of the xentoolcore__restrict_* stuff.  Secondly,
>>> xenstore *is* covered - but the xs fd is squashed so as to be totally
>>> unuseable: xs.c uses xentoolcore__restrict_by_dup2_null.
>>
>> Ross already gave me some corrections on this; here is what I have:
>>
>> 8<---
>> '''Description''': Close and restrict Xen-related file descriptors.
>> Specifically:
>>   * Close all xenstore-related file descriptors
>>   * Make sure that extraneous `privcmd` and `evtchn` instances are
>> closed
>>   * Make sure that all open instances of `privcmd` and `evtchn` file
>> descriptors have had IOCTL_PRIVCMD_RESTRICT and
>> IOCTL_EVTCHN_RESTRICT_DOMID ioctls called on them, respectively.
>> --->8
>>
>> It sounds like the last may be inaccurate for libxl?
> 
> I don't think anything closes any extraneous fds.  My approach in
> the libxc layer was to register all fds and have the restrict call
> iterate over all of them.  So, I guess, drop your 2nd bullet.
> 
> All of this is done by qemu calling xentoolcore_restrict_all, not by
> libxl.
> 
> Maybe the "extraneous privcmd and evtchn fds" Ross means are ones
> inherited by qemu from the toolstack.

IIRC I didn't mention anything about extraneous fds. I'm not sure where 
that part came from

Maybe this part shouldn't contain so much detail. Perhaps it could just 
refer to the documentation for xentoolcore_restrict_all?

Note that file descriptors are not just closed, they are replaced with 
/dev/null. Also note that any gnttab and gntalloc file descriptors (used 
by libxengnttab) are also replaced with /dev/null.

Regards,
-- 
Ross Lagerwall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] docs/qemu-deprivilege: Revise and update with status and future plans
  2018-03-28 12:28     ` Ross Lagerwall
@ 2018-03-28 13:26       ` George Dunlap
  0 siblings, 0 replies; 21+ messages in thread
From: George Dunlap @ 2018-03-28 13:26 UTC (permalink / raw)
  To: Ross Lagerwall, Ian Jackson
  Cc: Stefano Stabellini, Wei Liu, Andrew Cooper, Tim Deegan,
	Julien Grall, Jan Beulich, Anthony Perard, xen-devel

On 03/28/2018 01:28 PM, Ross Lagerwall wrote:
> On 03/27/2018 11:21 AM, George Dunlap wrote:
>> On 03/26/2018 05:43 PM, Ian Jackson wrote:
>>>> +### Network
>>>>   +If QEMU runs in its own network namespace, it can't open the tap
>>>> +device itself because the interface won't be visible outside of its
>>>> +own namespace. So instead, have the toolstack open the device and pass
>>>> +it as an fd on the command-line:
>>>
>>> I think this could be solved by doing these things in a different
>>> order.
>>
>> Ross, do you have a reference for the qemu-devel discussion where they
>> rejected the patches to have QEMU restrict the namespaces?
>>
> 
> The thread starts here:
> https://lists.gnu.org/archive/html/qemu-devel/2017-10/msg04674.html

Thanks.  The key objection seems to be that things on the command-line
will behave differently than subsequent commands sent over QMP.  But of
course, that's exactly the point. :-)  And as you pointed out in that
thread, it's no different than chroot.

It is suboptimal to have QMP commands return 'success' when they don't
actually have any effect; so a better interface would be to arrange that
all QMP commands use an unshared namespace fail.

That said, architecturally, doing as much of the restriction before
exec'ing QEMU seems a lot cleaner to me.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] docs/qemu-deprivilege: Revise and update with status and future plans
  2018-03-28 12:47       ` Ross Lagerwall
@ 2018-03-28 13:44         ` George Dunlap
  0 siblings, 0 replies; 21+ messages in thread
From: George Dunlap @ 2018-03-28 13:44 UTC (permalink / raw)
  To: Ross Lagerwall, Ian Jackson
  Cc: Stefano Stabellini, Wei Liu, Andrew Cooper, Tim Deegan,
	Julien Grall, Jan Beulich, Anthony Perard, xen-devel

On 03/28/2018 01:47 PM, Ross Lagerwall wrote:
> On 03/27/2018 02:33 PM, Ian Jackson wrote:
>> George Dunlap writes ("Re: [PATCH] docs/qemu-deprivilege: Revise and
>> update with status and future plans"):
>>> Actually I think most of the user-facing stuff already in xl.cfg is
>>> inappropriate for that man page.  It might make sense to have a separate
>>> man page for it.
>>
>> I wouldn't object to that.
>>
>>> On 03/26/2018 05:43 PM, Ian Jackson wrote:
>>>> No.  Firstly, in each case, all relevant descriptors are restricted.
>>>> This is the purpose of the xentoolcore__restrict_* stuff.  Secondly,
>>>> xenstore *is* covered - but the xs fd is squashed so as to be totally
>>>> unuseable: xs.c uses xentoolcore__restrict_by_dup2_null.
>>>
>>> Ross already gave me some corrections on this; here is what I have:
>>>
>>> 8<---
>>> '''Description''': Close and restrict Xen-related file descriptors.
>>> Specifically:
>>>   * Close all xenstore-related file descriptors
>>>   * Make sure that extraneous `privcmd` and `evtchn` instances are
>>> closed
>>>   * Make sure that all open instances of `privcmd` and `evtchn` file
>>> descriptors have had IOCTL_PRIVCMD_RESTRICT and
>>> IOCTL_EVTCHN_RESTRICT_DOMID ioctls called on them, respectively.
>>> --->8
>>>
>>> It sounds like the last may be inaccurate for libxl?
>>
>> I don't think anything closes any extraneous fds.  My approach in
>> the libxc layer was to register all fds and have the restrict call
>> iterate over all of them.  So, I guess, drop your 2nd bullet.
>>
>> All of this is done by qemu calling xentoolcore_restrict_all, not by
>> libxl.
>>
>> Maybe the "extraneous privcmd and evtchn fds" Ross means are ones
>> inherited by qemu from the toolstack.
> 
> IIRC I didn't mention anything about extraneous fds. I'm not sure where
> that part came from

That was me regurgitating from memory something IanJ said, about QEMU
having loads of dangling fds lying around that he needed to figure out
how to check to make sure they were all closed.  Probably I misunderstood.

> Maybe this part shouldn't contain so much detail. Perhaps it could just
> refer to the documentation for xentoolcore_restrict_all?

Which documentation is this?  The description in
tools/libs/toolcore/include/xentoolcore.h describes the *intended
effect*, not the actual mechanism.  (It doesn't mention dup()'ing the
files to /dev/null, for instance.)

I do think it's important to have one place where someone can find all
the technical details of the restrictions made, so that they can verify
that restriction is sufficient for their needs (or determine that we've
missed something and let us know).

I'm happy to have a summary of what xentoolcore_restrict_all() does with
a reference to a more complete description elsewhere, if (or when) such
a description exists.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2018-03-28 13:44 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-22 18:24 [PATCH] docs/qemu-deprivilege: Revise and update with status and future plans George Dunlap
2018-03-23  9:41 ` Ross Lagerwall
2018-03-23 10:01   ` Roger Pau Monné
2018-03-23 10:53   ` George Dunlap
2018-03-23 11:33     ` Ross Lagerwall
2018-03-23 12:13 ` Anthony PERARD
2018-03-26 16:43 ` Ian Jackson
2018-03-27 10:20   ` George Dunlap
2018-03-27 11:24     ` George Dunlap
2018-03-27 13:33     ` Ian Jackson
2018-03-27 14:15       ` George Dunlap
2018-03-27 14:24         ` George Dunlap
2018-03-27 14:37           ` Ian Jackson
2018-03-27 14:45             ` George Dunlap
2018-03-27 14:36         ` Ian Jackson
2018-03-27 15:52           ` George Dunlap
2018-03-28 12:47       ` Ross Lagerwall
2018-03-28 13:44         ` George Dunlap
2018-03-27 10:21   ` George Dunlap
2018-03-28 12:28     ` Ross Lagerwall
2018-03-28 13:26       ` George Dunlap

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.