All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 1/6] docs/qemu-deprivilege: Revise and update with status and future plans
@ 2018-09-21 17:04 George Dunlap
  2018-09-21 17:04 ` [PATCH v2 2/6] test/depriv: Add a tool to check process-level depriv George Dunlap
                   ` (7 more replies)
  0 siblings, 8 replies; 39+ messages in thread
From: George Dunlap @ 2018-09-21 17:04 UTC (permalink / raw)
  To: xen-devel
  Cc: Stefano Stabellini, Wei Liu, Konrad Wilk, Andrew Cooper,
	Tim Deegan, George Dunlap, Ross Lagerwall, Julien Grall,
	Jan Beulich, Anthony Perard, Ian Jackson

docs/qemu-deprivilege.txt had some basic instructions for using
dm_restrict, but it was incomplete, misleading, and stale.

Update the docs in a number of ways.

First, separate user-facing documentation and technical description
into docs/features and docs/design, respectively.

In the feature doc:

* Introduce a section mentioning minimim versions of Linux, Xen, and
qemu required (TBD)

* Fix the discussion of qemu userid.  Mention xen-qemuuser-range-base,
and provide example shell code that actually has some hope of working
(instead of failing out after creating 900 userids).

* Describe how to enable restrictions, as well as features which
probably don't or definitely don't work.

In the design doc, introduce a "Technical Details" section which
describes specifically what restrictions are currently done, and also
what restrictions we are looking at doing in the future.

The idea here is that as we implement the various items for the
future, we move them from "Restrictions still to do" to "Restrictions
done".  This can also act as a design document -- a place for public
discussion of what can or should be done and how.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
---
Changes since v1:
- Break into two, and move into appropriate directories (rather than 'misc')
- Updated version requirements
- Distinguish between features which "don't yet work" and features which we never expect to work
- Update description of xen-restrict functionality
- Reorder and expand further restrictions
- Make it more clear which restrictions are available on Linux only
- Include detailed description of how to kill a process
- Add RLIMIT_NPROC as something we can do without further changes to qemu
- Document the need to check for the sandbox feature before using it

Thank you to Ross Lagerwall, whose description of what XenServer is
doing formed much of the basis for the text here.

CC: Ian Jackson <ian.jackson@citrix.com>
CC: Wei Liu <wei.liu2@citrix.com>
CC: Andrew Cooper <andrew.cooper3@citrix.com>
CC: Jan Beulich <jbeulich@suse.com>
CC: Tim Deegan <tim@xen.org>
CC: Konrad Wilk <konrad.wilk@oracle.com>
CC: Stefano Stabellini <sstabellini@kernel.org>
CC: Julien Grall <julien.grall@arm.com>
CC: Anthony Perard <anthony.perard@citrix.com>
CC: Ross Lagerwall <ross.lagerwall@citrix.com>
---
 docs/designs/qemu-deprivilege.md      | 295 ++++++++++++++++++++++++++
 docs/features/qemu-deprivilege.pandoc |  91 ++++++++
 docs/misc/qemu-deprivilege.txt        |  36 ----
 3 files changed, 386 insertions(+), 36 deletions(-)
 create mode 100644 docs/designs/qemu-deprivilege.md
 create mode 100644 docs/features/qemu-deprivilege.pandoc
 delete mode 100644 docs/misc/qemu-deprivilege.txt

diff --git a/docs/designs/qemu-deprivilege.md b/docs/designs/qemu-deprivilege.md
new file mode 100644
index 0000000000..1e731c16aa
--- /dev/null
+++ b/docs/designs/qemu-deprivilege.md
@@ -0,0 +1,295 @@
+# Introduction
+
+The goal of deprilvileging qemu is this: Even if there is a bug (for
+example in qemu) which permits a domain to gain control of the device
+model, the compromised device model process is prevented from
+violating the system's overall security properties.  Ie, a guest
+cannot "escape" from the virtualisation by using a qemu bug.
+
+This document lists the various technical measures which we either
+have taken, or plan to take to effect this goal.  Some of them are
+required to be considered secure (that is, there are known attack
+vectors which they close); others are "just in case" (that is, there
+are no known attack vectors, but we perform the restrictions to reduce
+the possibility of unknown attack vectors).
+
+# Restrictions done
+
+The following restrictions are currently implemented.
+
+## Having qemu switch user
+
+'''Description''': As mentioned above, having QEMU switch to a
+non-root user, one per domain id.  Not being the root user limits what
+a compromised QEMU process can do to the system, and having one user
+per domain id limits what a comprimised QEMU process can do to the
+QEMU processes of other VMs.
+
+'''Implementation''': The toolstack adds the following to the qemu command-line:
+
+    -runas <uid>:<gid>
+
+'''How to test''':
+
+    grep /proc/<qpid>/status [UG]id
+
+'''Testing Status''': Not tested
+
+## Xen library / file-descriptor restrictions
+
+'''Description''': Close and restrict Xen-related file descriptors.
+Specifically:
+ * Close all xenstore-related file descriptors
+ * Make sure that extraneous `privcmd` and `evtchn` instances are
+closed
+ * Make sure that all open instances of `privcmd` and `evtchn` file
+descriptors have had `IOCTL_PRIVCMD_RESTRICT` and
+`IOCTL_EVTCHN_RESTRICT_DOMID` ioctls called on them, respectively.
+
+FIXME: Double-check the correctness of the above
+
+'''Implementation''': Toolstack adds the following to the qemu command-line:
+
+    -xen-domid-restrict
+
+'''How to test''':
+
+    tools/test/depriv/depriv-fd-checker.c
+
+'''Testing status''': Tested
+
+# Restrictions / improvements still to do
+
+This lists potential restrictions still to do.  It is meant to be
+listed in order of ease of implementation, with low-hanging fruit
+first.
+
+## Chroot
+
+'''Description''': Qemu runs in its own chroot, such that even if it
+could call an 'open' command of some sort, there would be nothing for
+it to see.
+
+'''Implementation''': The toolstack creates a directory in the libxl "run-dir"; e.g.
+`/var/run/xen/qemu-root-<domid>`
+
+Then adds the following to the qemu command-line:
+
+    -chroot /var/run/xen/qemu-root-<domid>
+	
+'''How to test''':  Check `/proc/<qpid>/root`
+	
+'''Tested''': Not tested
+
+## Namespaces for unused functionality (Linux only)
+
+'''Description''': Enter QEMU into its own mount & IPC namespaces.
+This means that even if other restrictions fail, the process won't be
+able to even name system mount points or exsting non-file-based IPC
+descriptors to attempt to attack them.
+
+'''Implementation''':
+
+In theory this could be done in QEMU (similar to -sandbox, -runas,
+-chroot, and so on), but a patch doing this in QEMU was NAKed
+upstream. They preferred that this was done as a setup step by
+whatever executes QEMU; i.e., have the process which exec's QEMU first
+call:
+
+    unshare(CLONE_NEWNS | CLONE_NEWIPC)
+	
+'''How to test''':  Check `/proc/<qpid>/ns/[ipc,mnt]`
+
+'''Tested''': Not tested
+
+### Basic RLIMITs
+
+'''Description''': A number of limits on the resources that a given
+process / userid is allowed to consume.  These can limit the ability
+of a compromised QEMU process to DoS domain 0 by exhausting various
+resources available to it.
+
+'''Implementation'''
+
+Limits that can be implemented immediately without much effort:
+ - RLIMIT_FSIZE` (file size) to 256KiB.
+ - RLIMIT_NPROC (after uid changes to a unique uid)
+
+Probably not necessary but why not:
+ - RLIMIT_CORE: 0
+ - RLIMIT_MSGQUEUE: 0
+ - RLIMIT_LOCKS: 0 XXX Check
+ - RLIMIT_MEMLOCK: 0
+   mlock() is Used only when both "realtime" and "mlock" are specified.
+   
+'''How to test''': Check `/proc/<qpid>/limits`
+
+'''Tested''': Not tested
+
+### Further RLIMITs
+
+RLIMIT_AS limits the total amount of memory; but this includes the
+virtual memory which QEMU uses as a mapcache.  xen-mapcache.c already
+fiddles with this; it would be straightforward to make it *set* the
+rlimit to what it thinks a sensible limit is.
+
+Other things that would take some cleverness / changes to QEMU to
+utilize due to ordering constrants:
+ - RLIMIT_NOFILES (after all necessary files are opened)
+
+### libxl UID cleanup
+
+'''Description''': Domain IDs are reused, and thus restricted UIDs are
+reused.  If a compromised QEMU can fork (due to seccomp or
+RLIMIT_NPROC limits being ineffective for some reason), it may avoid
+being killed when its domain dies, then wait until the domain ID is
+reused again, at which point it will have control over the domain in
+question (which probably belongs to someone else).
+
+libxl should kill all UIDs associated with a domain both when the VM
+is destroyed, and before starting a VM with the same UID.
+
+'''Implementation''': This is unnecessarily tricky.
+
+The kill() system call can have three kinds of targets:
+ - A single pid
+ - A process group
+ - "Every process except me to which I am allowed to send a signal" (-1)
+
+Targeting a single pid is racy and likely to be beaten by the
+following loop:
+
+    while(1) {
+        if(fork())
+	    _exit(0);
+    }	  
+
+That is, by the time you've read the process list and found the
+process id you want to kill, that process has exited and there is a
+new process whose pid you don't know about.
+
+Targeting a process group will be ineffective, as unprivileged
+processes are allowed to make their own process groups.
+
+kill(-1) can be used but must be done with care.  Consider the
+following code, for example:
+
+    setuid(target_uid);
+    kill(-1, 9);
+
+This looks like it will do the trick; but by setting all of the user
+ids (effective, real, and saved), it opens the 'killing' process up to
+being killed by the target process:
+
+    while(1) {
+        if(fork())
+            _exit(0);
+        else
+            kill(-1, 9);
+    }
+
+Fortunately there is an assymetry we can take advantage of.  From the
+POSIX spec:
+
+> For a process to have permission to send a signal to a process
+> designated by pid, unless the sending process has appropriate
+> privileges, the real or effective user ID of the sending process shall
+> match the real or saved set-user-ID of the receiving process.
+
+The solution is to allocate a second "reaper" uid that is only used to kill
+target processes.  We set the euid of the killing process to the `target_uid`,
+but the ruid of the killing process to `reaper_uid`, leaving the suid of the
+killing process as 0:
+
+    setresuid(reaper_uid, target_uid, 0);
+    kill(-1, 9);
+
+NOTE: We cannot use `setreuid(reaper_uid, target_uid)` here, as that
+will set *both* euid *and* suid to `target_uid`, making the killing
+process vulnerable to the target process again.
+
+Since this will kill all other `reaper_uid` processes as well, we must
+either allocate a separate `reaper_uid` per domain, or use locking to
+ensure that only one killing process is active at a time.
+
+## libxl: Treat QMP connection as untrusted
+
+'''Description''': Currently libxl talks with QEMU via QMP; but its
+interactions have not historically considered from a security point of
+view.  For example, qmp_synchronous_send() waits for a response from
+QEMU, which a compromised QEMU could simply not send (thus preventing
+the toolstack from making forward progress).
+
+'''Implementation''': Audit toolstack interactions with QEMU which
+happen after the guest has started running, and assume QEMU has been
+compromised.
+
+### seccomp filtering (Linux only)
+
+'''Description''': Turn on seccomp filtering to disable syscalls which
+QEMU doesn't need. 
+
+'''Implementation''': Enable from the command-line:
+
+    -sandbox on,obsolete=deny,elevateprivileges=allow,spawn=deny,resourcecontrol=deny
+
+`elevateprivileges` is currently required to allow `-runas` to work.
+Removing this requirement would mean making sure that the uid change
+happened before the seccomp2 call, perhaps by changing the uid before
+executing QEMU.  (But this would then require other changes to create
+the QMP socket, VNC socket, and so on).
+
+This feature is not on by default and may not be available in all
+environments.  We therefore need to either:
+ 1. Require that this feature be enabled to build qemu
+ 2. Check for `-sandbox` support at runtime before 
+
+### Disks
+
+The chroot (and seccomp?) happens late enough such that QEMU can
+initialize itself and open its disks. If you want to add a disk at run
+time via or insert a CD, you can't pass a path because QEMU is
+chrooted. Instead use the add-fd QMP command and use
+/dev/fdset/<fdset-id> as the path.
+
+A further layer of restriction could be to set RLIMIT_NOFILES to '0',
+and hand all disks over QMP.
+
+## Migration
+
+When calling xen-save-devices-state, since QEMU is running in a chroot
+it is not useful to pass a filename (it doesn't even have write access
+inside the chroot). Instead, give it an open fd using the add-fd
+mechanism.
+
+### Network namespacing (Linux only)
+
+Enter QEMU into its own network namespace (in addition to mount & IPC
+namespaces).  Basically change the 'unshare' call to be as follows:
+
+    unshare(CLONE_NEWNET | CLONE_NEWNS | CLONE_NEWIPC)
+
+QEMU does actually use the network namespace by default, so adding
+this restriction requires additional changes, listed below.
+
+### Network
+
+If QEMU runs in its own network namespace, it can't open the tap
+device itself because the interface won't be visible outside of its
+own namespace. So instead, have the toolstack open the device and pass
+it as an fd on the command-line:
+
+    -device rtl8139,netdev=tapnet0,mac=... -netdev tap,id=tapnet0,fd=<tapfd>
+
+### VNC
+
+If QEMU runs in its own network namespace, it is not straightforward
+to listen on a TCP socket outside of its own network namespace. One
+option would be to use VNC over a UNIX socket:
+
+    -vnc unix:/var/run/xen/vnc-<domid>
+
+However, this would break functionality in the general case; I think
+we need to have the toolstack open a socket and pass the fd to QEMU
+(which requires changes to QEMU).
+
diff --git a/docs/features/qemu-deprivilege.pandoc b/docs/features/qemu-deprivilege.pandoc
new file mode 100644
index 0000000000..b377076606
--- /dev/null
+++ b/docs/features/qemu-deprivilege.pandoc
@@ -0,0 +1,91 @@
+% QEMU Deprivileging / dm_restrict
+% Revision 1
+
+\clearpage
+
+# Basics
+
+---------------- ----------------------------------------------------
+         Status: **Tech Preview**
+
+Architecture(s): x86
+
+   Component(s): toolstack
+
+---------------- ----------------------------------------------------
+
+# Overview
+
+By default, the QEMU device model is run in domain 0.  If an attacker
+can gain control of a QEMU process, it could easily take control of a
+system.
+
+dm_restrict is a set of operations to restrict QEMU running in domain
+0.  It consists of two halves:
+
+ 1. Mechanisms to restrict QEMU to only being able to affect its own
+domain
+ 2. Mechanisms to restruct QEMU's ability to interact with domain 0.
+
+# User details
+
+## Getting the right versions of software
+
+Linux: 4.11+
+
+Qemu: 3.0+ (Or the version that comes with Xen 4.12+)
+
+## Setting up a userid range
+
+For maximum security, libxl needs to run the devicemodel for each
+domain under a user id (UID) corresponding to its domain id.  There
+are 32752 possible domain IDs, and so libxl needs 32752 user ids set
+aside for it.
+
+The simplest and most effective way to do this is to allocate a
+contiguous block of UIDs, and create a single user named
+`xen-qemuuser-range-base` with the first UID.  For example, under Debian:
+
+    adduser --no-create-home --uid 65536 --system xen-qemuuser-range-base
+
+NOTE: Most modern systems have 32-bit UIDs, and so can in theory go up
+to 2^31 (or 2^32 if uids are unsigned).  POSIX only guarantees 16-bit
+UIDs however.  UID 65535 is reserved for an invalid value, and 65534
+is normally allocated to "nobody".
+
+Another, less-secure way is to run all QEMUs as the same UID.  To do
+this, create a user named `xen-qemuuser-shared`; for example:
+
+    adduser --no-create-home --system xen-qemuuser-shared
+
+## Domain config changes
+
+The core domain config change is to add the following line to the
+domain configuration:
+
+    dm_restrict=1
+
+This will perform a number of restrictions, outlined below in the
+'Technical details' section.
+
+# Technical details
+
+See docs/design/qemu-deprivilege.txt for technical details.
+
+# Limitations
+
+The following features still need to be implemented:
+ * Inserting a new cdrom while the guest is running (xl cdrom-insert)
+ * Migration / save / restore
+
+Additionally, getting PCI passthrough to work securely would require a
+significant rework of how passthrough works at the moment.  It may be
+implemented at some point but is not a near-term priority.
+
+# History
+
+------------------------------------------------------------------------
+Date       Revision Version  Notes
+---------- -------- -------- -------------------------------------------
+2018-09-14 1        Xen 4.12 Imported from docs/misc
+---------- -------- -------- -------------------------------------------
diff --git a/docs/misc/qemu-deprivilege.txt b/docs/misc/qemu-deprivilege.txt
deleted file mode 100644
index 58b86a3908..0000000000
--- a/docs/misc/qemu-deprivilege.txt
+++ /dev/null
@@ -1,36 +0,0 @@
-For security reasons, libxl tries to pass a non-root username to QEMU as
-argument. During initialization QEMU calls setuid and setgid with the
-user ID and the group ID of the user passed as argument.
-Libxl looks for the following users in this order:
-
-1) a user named "xen-qemuuser-domid$domid",
-Where $domid is the domid of the domain being created.
-This requires the reservation of 65535 uids from xen-qemuuser-domid1
-to xen-qemuuser-domid65535. To use this mechanism, you might want to
-create a large number of users at installation time. For example:
-
-for ((i=1; i<65536; i++))
-do
-    adduser --no-create-home --system xen-qemuuser-domid$i
-done
-
-You might want to consider passing --group to adduser to create a new
-group for each new user.
-
-
-2) a user named "xen-qemuuser-shared"
-As a fall back if both 1) fails, libxl will use a single user for
-all QEMU instances. The user is named xen-qemuuser-shared. This is
-less secure but still better than running QEMU as root. Using this is as
-simple as creating just one more user on your host:
-
-adduser --no-create-home --system xen-qemuuser-shared
-
-
-3) root
-As a last resort, libxl will start QEMU as root.
-
-
-Please note that running QEMU as non-root causes several features like
-migration and PCI passthrough to not work properly and may prevent the guest
-from booting.
-- 
2.18.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v2 2/6] test/depriv: Add a tool to check process-level depriv
  2018-09-21 17:04 [PATCH v2 1/6] docs/qemu-deprivilege: Revise and update with status and future plans George Dunlap
@ 2018-09-21 17:04 ` George Dunlap
  2018-09-24 10:57   ` Ian Jackson
  2018-09-24 11:14   ` Anthony PERARD
  2018-09-21 17:04 ` [PATCH v2 3/6] tools/dm_restrict: Ask QEMU to chroot George Dunlap
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 39+ messages in thread
From: George Dunlap @ 2018-09-21 17:04 UTC (permalink / raw)
  To: xen-devel
  Cc: Stefano Stabellini, Wei Liu, George Dunlap, Ross Lagerwall,
	Anthony Perard, Ian Jackson

Add a tool to check whether the various process-level deprivileging
operations have actually taken place on the process.

The tool takes a domname or domid, and returns success or failure.

To begin with, only test the process/group it setting, since this is
the only restriction currently implemented.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
---
CC: Ian Jackson <ian.jackson@citrix.com>
CC: Wei Liu <wei.liu2@citrix.com>
CC: Stefano Stabellini <sstabellini@kernel.org>
CC: Anthony Perard <anthony.perard@citrix.com>
CC: Ross Lagerwall <ross.lagerwall@citrix.com>
---
 tools/tests/depriv/depriv-process-checker.sh | 71 ++++++++++++++++++++
 1 file changed, 71 insertions(+)
 create mode 100755 tools/tests/depriv/depriv-process-checker.sh

diff --git a/tools/tests/depriv/depriv-process-checker.sh b/tools/tests/depriv/depriv-process-checker.sh
new file mode 100755
index 0000000000..4aa58e7760
--- /dev/null
+++ b/tools/tests/depriv/depriv-process-checker.sh
@@ -0,0 +1,71 @@
+#!/bin/bash
+
+domain="$1"
+
+if [[ "$domain" =~ ^[0-9]+$ ]] ; then
+    domid="$domain"
+else
+    domid=$(xl domid "$domain")
+fi
+
+dmpid=$(xenstore-read /local/domain/$domid/image/device-model-pid 2>/dev/null)
+if [[ -z "$dmpid" ]] ; then
+    echo "xenstore-read failed"
+    exit 1
+fi
+
+failed="false"
+
+# TEST: Process / group id
+#
+# Read /proc/<qpid>/status, checking Uid and Gid lines
+#
+# Uid should be xen-qemuuser-range-base+$domid
+# Gid should be 65534 ("nobody")
+# FIXME: deal with other UID configurations?
+echo -n "Process UID: "
+tgt_uid=$(id -u xen-qemuuser-range-base)
+tgt_uid=$(( $tgt_uid + $domid ))
+
+# Example input:
+# Uid:	1193	1193	1193	1193
+input=$(grep Uid /proc/$dmpid/status)
+if [[ "$input" =~ ^Uid:[[:space:]]*([0-9]+)[[:space:]]*([0-9]+)[[:space:]]*([0-9]+)[[:space:]]*([0-9]+)$ ]] ; then
+    result="PASSED"
+    for i in {1..4}; do
+	if [[ "${BASH_REMATCH[$i]}" != "$tgt_uid" ]] ; then
+	    result="FAILED"
+	    failed="true"
+	    break
+	fi
+    done
+else
+    result="FAILED"
+    failed="true"
+fi
+echo $result
+
+# Example input:
+# Gid:	10020	10020	10020	10020
+echo -n "Process GID: "
+input=$(grep Uid /proc/$dmpid/status)
+if [[ "$input" =~ ^Gid:[[:space:]]*([0-9]+)[[:space:]]*([0-9]+)[[:space:]]*([0-9]+)[[:space:]]*([0-9]+)$ ]] ; then
+    result="PASSED"
+    for i in {1..4}; do
+	if [[ "${BASH_REMATCH[$i]}" != "$65534" ]] ; then
+	    result="FAILED"
+	    failed="true"
+	    break
+	fi
+    done
+else
+    result="FAILED"
+    failed="true"
+fi
+echo $result
+
+if $failed ; then
+    exit 1
+else
+    exit 0
+fi
-- 
2.18.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v2 3/6] tools/dm_restrict: Ask QEMU to chroot
  2018-09-21 17:04 [PATCH v2 1/6] docs/qemu-deprivilege: Revise and update with status and future plans George Dunlap
  2018-09-21 17:04 ` [PATCH v2 2/6] test/depriv: Add a tool to check process-level depriv George Dunlap
@ 2018-09-21 17:04 ` George Dunlap
  2018-09-24  8:20   ` Paul Durrant
  2018-09-24 10:31   ` Ian Jackson
  2018-09-21 17:04 ` [PATCH v2 4/6] tools/dm_restrict: Unshare mount and IPC namespaces on Linux George Dunlap
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 39+ messages in thread
From: George Dunlap @ 2018-09-21 17:04 UTC (permalink / raw)
  To: xen-devel; +Cc: Anthony Perard, Ian Jackson, Wei Liu, George Dunlap

When dm_restrict is enabled, ask QEMU to chroot into an empty directory.

* Create /var/run/qemu/root-domid (deleting the old one if it's there)
* Pass the -chroot option to QEMU

Rather than running `rm -rf` on the directory before creating it
(since there is no library function to do this), simply rmdir the
directory, relying on the fact that the previous QEMU instance, if
properly restcirted, shouldn't have been able to write anything
anyway.

Also test that this worked in depriv-process-checker.sh.  This
requires XEN_RUN_DIR to be set in the environment so we know where to
look.

Suggested-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
---
Not sure the best way to get XEN_RUN_DIR; having configure process this
file seems like a bit overkill.

CC: Ian Jackson <ian.jackson@citrix.com>
CC: Wei Liu <wei.liu2@citrix.com>
CC: Anthony Perard <anthony.perard@citrix.com>
---
 docs/designs/qemu-deprivilege.md             | 12 +++---
 tools/libxl/libxl_dm.c                       | 39 +++++++++++++++++++-
 tools/tests/depriv/depriv-process-checker.sh | 17 +++++++++
 3 files changed, 61 insertions(+), 7 deletions(-)

diff --git a/docs/designs/qemu-deprivilege.md b/docs/designs/qemu-deprivilege.md
index 1e731c16aa..df5bb07d7c 100644
--- a/docs/designs/qemu-deprivilege.md
+++ b/docs/designs/qemu-deprivilege.md
@@ -58,12 +58,6 @@ FIXME: Double-check the correctness of the above
 
 '''Testing status''': Tested
 
-# Restrictions / improvements still to do
-
-This lists potential restrictions still to do.  It is meant to be
-listed in order of ease of implementation, with low-hanging fruit
-first.
-
 ## Chroot
 
 '''Description''': Qemu runs in its own chroot, such that even if it
@@ -81,6 +75,12 @@ Then adds the following to the qemu command-line:
 	
 '''Tested''': Not tested
 
+## Restrictions / improvements still to do
+
+This lists potential restrictions still to do.  It is meant to be
+listed in order of ease of implementation, with low-hanging fruit
+first.
+
 ## Namespaces for unused functionality (Linux only)
 
 '''Description''': Enter QEMU into its own mount & IPC namespaces.
diff --git a/tools/libxl/libxl_dm.c b/tools/libxl/libxl_dm.c
index abd31ee6f2..df6fb64bee 100644
--- a/tools/libxl/libxl_dm.c
+++ b/tools/libxl/libxl_dm.c
@@ -1410,9 +1410,46 @@ static int libxl__build_device_model_args_new(libxl__gc *gc,
         }
     }
 
-    if (libxl_defbool_val(b_info->dm_restrict))
+    if (libxl_defbool_val(b_info->dm_restrict)) {
+        char * chroot_dir = GCSPRINTF("%s/qemu-root-%d",
+                                      libxl__run_dir_path(), guest_domid);
+        
         flexarray_append(dm_args, "-xen-domid-restrict");
 
+        /* 
+         * Run QEMU in a chroot at /var/run/xen/root-%d
+         *
+         * There is no library function to do the equivalent of `rm
+         * -rf`.  However deprivileged QEMU in theory shouldn't be
+         * able to write any files anyway, as the chroot would be
+         * owned by root, but it would be running as an unprivileged
+         * process.  So in theory, old chroots should always be empty.
+         * 
+         * rmdir the directory before attempting to create
+         * it; if it returns anything other than ENOENT, fail domain
+         * creation.
+         */
+        if (rmdir(chroot_dir) != 0
+            && errno != ENOENT) {
+            LOGED(ERROR, guest_domid,
+                  "failed to remove existing chroot dir %s", chroot_dir);
+            return ERROR_FAIL;
+        }
+        
+        for (;;) {
+            if (!mkdir(chroot_dir, 0000))
+                break;
+            if (errno == EINTR) continue;
+            LOGED(ERROR, guest_domid,
+                  "failed to create chroot dir %s", chroot_dir);
+            return ERROR_FAIL;
+        }
+
+        // Add "-chroot [dir]" to command-line
+        flexarray_append(dm_args, "-chroot");
+        flexarray_append(dm_args, chroot_dir);
+    }
+
     if (state->saved_state) {
         /* This file descriptor is meant to be used by QEMU */
         *dm_state_fd = open(state->saved_state, O_RDONLY);
diff --git a/tools/tests/depriv/depriv-process-checker.sh b/tools/tests/depriv/depriv-process-checker.sh
index 4aa58e7760..cb240fd0ed 100755
--- a/tools/tests/depriv/depriv-process-checker.sh
+++ b/tools/tests/depriv/depriv-process-checker.sh
@@ -64,6 +64,23 @@ else
 fi
 echo $result
 
+# TEST: chroot
+#
+# Read /proc/<dmpid>/root to see if it's correct.
+echo -n "Chroot: "
+if [[ -n "$XEN_RUN_DIR" ]] ; then
+    tgt_chroot=$XEN_RUN_DIR/qemu-root-$domid
+    root=$(readlink /proc/$dmpid/root)
+    if [[ "$root" != "$tgt_chroot" ]] ; then
+	echo "FAILED"
+	failed="true"
+    else
+	echo "PASSED"
+    fi
+else
+    echo "SKIPPED (XEN_RUN_DIR undefined)"
+fi
+
 if $failed ; then
     exit 1
 else
-- 
2.18.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v2 4/6] tools/dm_restrict: Unshare mount and IPC namespaces on Linux
  2018-09-21 17:04 [PATCH v2 1/6] docs/qemu-deprivilege: Revise and update with status and future plans George Dunlap
  2018-09-21 17:04 ` [PATCH v2 2/6] test/depriv: Add a tool to check process-level depriv George Dunlap
  2018-09-21 17:04 ` [PATCH v2 3/6] tools/dm_restrict: Ask QEMU to chroot George Dunlap
@ 2018-09-21 17:04 ` George Dunlap
  2018-09-24  8:24   ` Paul Durrant
  2018-09-24 10:40   ` Ian Jackson
  2018-09-21 17:04 ` [PATCH v2 5/6] tools/dm_depriv: Add first cut RLIMITs George Dunlap
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 39+ messages in thread
From: George Dunlap @ 2018-09-21 17:04 UTC (permalink / raw)
  To: xen-devel; +Cc: Anthony Perard, Ian Jackson, Wei Liu, George Dunlap

QEMU running under Xen doesn't need mount or IPC functionality.
Create and enter separate namespaces for each of these before
executing QEMU, so that in the event that other restrictions fail, the
process won't be able to even name system mount points or exsting
non-file-based IPC descriptors to attempt to attack them.

Unsharing is something a process can only do to itself (it would
seem); so add an os-specific "dm_preexec_restrict()" hook just before
we exec() the device model.

Also add checks to depriv-process-checker.sh to verify that dm is
running in a new namespace (or at least, a different one than the
caller).

Suggested-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
---

CC: Ian Jackson <ian.jackson@citrix.com>
CC: Wei Liu <wei.liu2@citrix.com>
CC: Anthony Perard <anthony.perard@citrix.com>
---
 docs/designs/qemu-deprivilege.md             | 12 ++++++------
 tools/libxl/libxl_dm.c                       |  2 ++
 tools/libxl/libxl_freebsd.c                  |  5 +++++
 tools/libxl/libxl_internal.h                 |  7 +++++++
 tools/libxl/libxl_linux.c                    | 14 ++++++++++++++
 tools/libxl/libxl_netbsd.c                   |  5 +++++
 tools/tests/depriv/depriv-process-checker.sh | 18 ++++++++++++++++++
 7 files changed, 57 insertions(+), 6 deletions(-)

diff --git a/docs/designs/qemu-deprivilege.md b/docs/designs/qemu-deprivilege.md
index df5bb07d7c..58d5df6072 100644
--- a/docs/designs/qemu-deprivilege.md
+++ b/docs/designs/qemu-deprivilege.md
@@ -75,12 +75,6 @@ Then adds the following to the qemu command-line:
 	
 '''Tested''': Not tested
 
-## Restrictions / improvements still to do
-
-This lists potential restrictions still to do.  It is meant to be
-listed in order of ease of implementation, with low-hanging fruit
-first.
-
 ## Namespaces for unused functionality (Linux only)
 
 '''Description''': Enter QEMU into its own mount & IPC namespaces.
@@ -102,6 +96,12 @@ call:
 
 '''Tested''': Not tested
 
+# Restrictions / improvements still to do
+
+This lists potential restrictions still to do.  It is meant to be
+listed in order of ease of implementation, with low-hanging fruit
+first.
+
 ### Basic RLIMITs
 
 '''Description''': A number of limits on the resources that a given
diff --git a/tools/libxl/libxl_dm.c b/tools/libxl/libxl_dm.c
index df6fb64bee..6733514370 100644
--- a/tools/libxl/libxl_dm.c
+++ b/tools/libxl/libxl_dm.c
@@ -2391,6 +2391,8 @@ retry_transaction:
         goto out_close;
     if (!rc) { /* inner child */
         setsid();
+        if (libxl_defbool_val(b_info->dm_restrict))
+            libxl__local_dm_preexec_restrict(gc, logfile_w);
         libxl__exec(gc, null, logfile_w, logfile_w, dm, args, envs);
     }
 
diff --git a/tools/libxl/libxl_freebsd.c b/tools/libxl/libxl_freebsd.c
index 6442ccec72..aed2a2b115 100644
--- a/tools/libxl/libxl_freebsd.c
+++ b/tools/libxl/libxl_freebsd.c
@@ -245,3 +245,8 @@ int libxl__pci_topology_init(libxl__gc *gc,
 {
     return ERROR_NI;
 }
+
+void libxl__local_dm_preexec_restrict(libxl__gc *gc, int stderrfd)
+{
+    return;
+}
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index 802382c704..6e5ff977a6 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -3782,6 +3782,13 @@ struct libxl__dm_spawn_state {
 
 _hidden void libxl__spawn_local_dm(libxl__egc *egc, libxl__dm_spawn_state*);
 
+/* 
+ * Called after forking but before executing the local devicemodel.
+ * Must call _exit(-1) on failure, printing an error message to
+ * stderrfd if available.
+ */
+_hidden void libxl__local_dm_preexec_restrict(libxl__gc *gc, int stderrfd);
+
 /* Stubdom device models. */
 
 typedef struct {
diff --git a/tools/libxl/libxl_linux.c b/tools/libxl/libxl_linux.c
index 6ef0abc693..2eeac8df9a 100644
--- a/tools/libxl/libxl_linux.c
+++ b/tools/libxl/libxl_linux.c
@@ -307,6 +307,20 @@ int libxl__pci_topology_init(libxl__gc *gc,
     return err;
 }
 
+void libxl__local_dm_preexec_restrict(libxl__gc *gc, int stderrfd)
+{
+    int rc;
+
+    /* Unshare mount and IPC namespaces.  These are unused by QEMU. */
+    rc = unshare(CLONE_NEWNS | CLONE_NEWIPC);
+    if (rc < 0) {
+        char *msg = GCSPRINTF("libxl: Mount and IPC namespace unfailed with error %d\n",
+                              errno);
+        write(stderrfd, msg, strlen(msg));
+        _exit(-1);
+    }
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/tools/libxl/libxl_netbsd.c b/tools/libxl/libxl_netbsd.c
index 2edfb00641..dce3f1fdce 100644
--- a/tools/libxl/libxl_netbsd.c
+++ b/tools/libxl/libxl_netbsd.c
@@ -124,3 +124,8 @@ int libxl__pci_topology_init(libxl__gc *gc,
 {
     return ERROR_NI;
 }
+
+void libxl__local_dm_preexec_restrict(libxl__gc *gc, int stderrfd)
+{
+    return;
+}
diff --git a/tools/tests/depriv/depriv-process-checker.sh b/tools/tests/depriv/depriv-process-checker.sh
index cb240fd0ed..7dc2573799 100755
--- a/tools/tests/depriv/depriv-process-checker.sh
+++ b/tools/tests/depriv/depriv-process-checker.sh
@@ -81,6 +81,24 @@ else
     echo "SKIPPED (XEN_RUN_DIR undefined)"
 fi
 
+# TEST: Namespace unsharing
+#
+# Read /proc/<dmpid>/ns/<namespace> and make sure it's not equal to
+# the current processes' value
+for nsname in ipc mnt; do
+    echo -n "Unshare namespace $nsname: "
+    dmns=$(readlink /proc/$dmpid/ns/$nsname)
+    myns=$(readlink /proc/self/ns/$nsname)
+
+    if [[ "$dmns" == "$myns" ]] ; then
+	echo "FAILED"
+	failed="true"
+    else
+	echo "PASSED"
+    fi
+done
+
+
 if $failed ; then
     exit 1
 else
-- 
2.18.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v2 5/6] tools/dm_depriv: Add first cut RLIMITs
  2018-09-21 17:04 [PATCH v2 1/6] docs/qemu-deprivilege: Revise and update with status and future plans George Dunlap
                   ` (2 preceding siblings ...)
  2018-09-21 17:04 ` [PATCH v2 4/6] tools/dm_restrict: Unshare mount and IPC namespaces on Linux George Dunlap
@ 2018-09-21 17:04 ` George Dunlap
  2018-09-24  8:34   ` Paul Durrant
  2018-09-24 10:48   ` Ian Jackson
  2018-09-21 17:04 ` [PATCH v2 6/6] RFC: tools/dm_restrict: Enable QEMU sandboxing George Dunlap
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 39+ messages in thread
From: George Dunlap @ 2018-09-21 17:04 UTC (permalink / raw)
  To: xen-devel; +Cc: Anthony Perard, Ian Jackson, Wei Liu, George Dunlap

Limit the ability of a potentially compromised QEMU to consume system
resources.  Key limits:
 - RLIMIT_FSIZE (file size): 256KiB
 - RLIMIT_NPROC (after uid changes to a unique uid)

Probably unnecessary limits but why not:
 - RLIMIT_CORE: 0
 - RLIMIT_MSGQUEUE: 0
 - RLIMIT_LOCKS: 0
 - RLIMIT_MEMLOCK: 0

NB that we do not yet set RLIMIT_AS (total virtual memory) or
RLIMIT_NOFILES (number of open files), since these require more care
and/or more coordination with QEMU to implement.

Suggested-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
---
CC: Ian Jackson <ian.jackson@citrix.com>
CC: Wei Liu <wei.liu2@citrix.com>
CC: Anthony Perard <anthony.perard@citrix.com>
---
 docs/designs/qemu-deprivilege.md             | 12 ++---
 tools/libxl/libxl_linux.c                    | 53 ++++++++++++++++++++
 tools/tests/depriv/depriv-process-checker.sh | 38 ++++++++++++++
 3 files changed, 97 insertions(+), 6 deletions(-)

diff --git a/docs/designs/qemu-deprivilege.md b/docs/designs/qemu-deprivilege.md
index 58d5df6072..ca556005b5 100644
--- a/docs/designs/qemu-deprivilege.md
+++ b/docs/designs/qemu-deprivilege.md
@@ -96,12 +96,6 @@ call:
 
 '''Tested''': Not tested
 
-# Restrictions / improvements still to do
-
-This lists potential restrictions still to do.  It is meant to be
-listed in order of ease of implementation, with low-hanging fruit
-first.
-
 ### Basic RLIMITs
 
 '''Description''': A number of limits on the resources that a given
@@ -126,6 +120,12 @@ Probably not necessary but why not:
 
 '''Tested''': Not tested
 
+# Restrictions / improvements still to do
+
+This lists potential restrictions still to do.  It is meant to be
+listed in order of ease of implementation, with low-hanging fruit
+first.
+
 ### Further RLIMITs
 
 RLIMIT_AS limits the total amount of memory; but this includes the
diff --git a/tools/libxl/libxl_linux.c b/tools/libxl/libxl_linux.c
index 2eeac8df9a..a7ae5af30e 100644
--- a/tools/libxl/libxl_linux.c
+++ b/tools/libxl/libxl_linux.c
@@ -16,6 +16,7 @@
 #include "libxl_osdeps.h" /* must come before any other headers */
 
 #include "libxl_internal.h"
+#include <sys/resource.h>
  
 int libxl__try_phy_backend(mode_t st_mode)
 {
@@ -307,9 +308,44 @@ int libxl__pci_topology_init(libxl__gc *gc,
     return err;
 }
 
+static struct {
+    int resource;
+    rlim_t limit;
+} rlimits[] = {
+    {
+        .resource = RLIMIT_FSIZE,
+        /* Big enough for log files, not big enough for a DoS */
+        .limit = 256*1024,
+    },
+    {
+        .resource = RLIMIT_NPROC,
+        .limit = 0
+    },
+    {
+        .resource = RLIMIT_CORE,
+        .limit = 0
+    },
+    {
+        .resource = RLIMIT_MSGQUEUE,
+        .limit = 0
+    },
+    {
+        .resource = RLIMIT_LOCKS,
+        .limit = 0
+    },
+    {
+        .resource = RLIMIT_MEMLOCK,
+        .limit = 0
+    },
+    {
+        .resource = -1
+    }
+};
+
 void libxl__local_dm_preexec_restrict(libxl__gc *gc, int stderrfd)
 {
     int rc;
+    unsigned i;
 
     /* Unshare mount and IPC namespaces.  These are unused by QEMU. */
     rc = unshare(CLONE_NEWNS | CLONE_NEWIPC);
@@ -319,6 +355,23 @@ void libxl__local_dm_preexec_restrict(libxl__gc *gc, int stderrfd)
         write(stderrfd, msg, strlen(msg));
         _exit(-1);
     }
+
+    /* Set various "easy" rlimits */
+    for (i=0; rlimits[i].resource != -1; i++) {
+        struct rlimit rlim;
+
+        rlim.rlim_cur = rlim.rlim_max = rlimits[i].limit;
+        
+        rc = setrlimit(rlimits[i].resource, &rlim);
+        if (rc < 0) {
+            char *msg = GCSPRINTF("libxl: Setting rlimit %d to %lld failed with error %d\n",
+                                  rlimits[i].resource,
+                                  (unsigned long long)rlimits[i].limit, errno);
+            write(stderrfd, msg, strlen(msg));
+            _exit(-1);
+        }
+
+    }
 }
 
 /*
diff --git a/tools/tests/depriv/depriv-process-checker.sh b/tools/tests/depriv/depriv-process-checker.sh
index 7dc2573799..6a861bafa5 100755
--- a/tools/tests/depriv/depriv-process-checker.sh
+++ b/tools/tests/depriv/depriv-process-checker.sh
@@ -98,6 +98,44 @@ for nsname in ipc mnt; do
     fi
 done
 
+# TEST: RLIMITs
+#
+# Read /proc/<dmpid>/limits
+function check_rlimit() {
+    limit_name=$1
+    limit_string=$2
+    tgt=$3
+
+    echo -n "rlimit $limit_name: "
+    input=$(grep "^$limit_string" /proc/$dmpid/limits)
+    
+    if [[ -z "$input" ]] ; then
+	echo "Couldn't find limit $limit"
+	echo FAILED
+	failed="true"
+	return
+    fi
+    
+    if [[ "$input" =~ ^$limit_string[[:space:]]*([^[:space:]]+)[[:space:]]*([^[:space:]]+)[[:space:]]*[^[:space:]]+ ]] ; then
+	if [[ "${BASH_REMATCH[1]}" != $tgt ||
+		  "${BASH_REMATCH[2]}" != $tgt ]] ; then
+	    echo "FAILED"
+	    failed="true"
+	else
+	    echo "PASSED"
+	fi
+    else
+	echo "Couldn't parse /proc/<dmpid>/limits"
+	echo "FAILED"
+	failed="true"
+    fi
+}
+check_rlimit FSIZE "Max file size" "262144"
+check_rlimit NPROC "Max processes" 0
+check_rlimit CORE "Max core file size" "0"
+check_rlimit MSGQUEUE "Max msgqueue size" 0
+check_rlimit LOCKS "Max file locks" 0
+check_rlimit MEMLOCK "Max locked memory" 0
 
 if $failed ; then
     exit 1
-- 
2.18.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v2 6/6] RFC: tools/dm_restrict: Enable QEMU sandboxing
  2018-09-21 17:04 [PATCH v2 1/6] docs/qemu-deprivilege: Revise and update with status and future plans George Dunlap
                   ` (3 preceding siblings ...)
  2018-09-21 17:04 ` [PATCH v2 5/6] tools/dm_depriv: Add first cut RLIMITs George Dunlap
@ 2018-09-21 17:04 ` George Dunlap
  2018-09-24 10:49   ` Ian Jackson
  2018-09-24 11:21   ` Ian Jackson
  2018-09-24  8:14 ` [PATCH v2 1/6] docs/qemu-deprivilege: Revise and update with status and future plans Paul Durrant
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 39+ messages in thread
From: George Dunlap @ 2018-09-21 17:04 UTC (permalink / raw)
  To: xen-devel
  Cc: Anthony Perard, Ian Jackson, Stefano Stabellini, Wei Liu, George Dunlap

QEMU has a `sandbox` feature, wherein it will use seccomp2 to restrict
what system calls it is able to make.

Suggested-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
---
This can't be checked in as-is, because `-sandbox` support may not have
been compiled in.  We therefore need to either:
 1. Require that this feature be enabled to build qemu
 2. Check for `-sandbox` support at runtime before

CC: Ian Jackson <ian.jackson@citrix.com>
CC: Wei Liu <wei.liu2@citrix.com>
CC: Anthony Perard <anthony.perard@citrix.com>
CC: Stefano Stabellini <sstabellini@kernel.org>
---
 tools/libxl/libxl_dm.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/tools/libxl/libxl_dm.c b/tools/libxl/libxl_dm.c
index 6733514370..b541c1a55e 100644
--- a/tools/libxl/libxl_dm.c
+++ b/tools/libxl/libxl_dm.c
@@ -1448,6 +1448,10 @@ static int libxl__build_device_model_args_new(libxl__gc *gc,
         // Add "-chroot [dir]" to command-line
         flexarray_append(dm_args, "-chroot");
         flexarray_append(dm_args, chroot_dir);
+
+        // Add sandboxing
+        flexarray_append(dm_args, "-sandbox");
+        flexarray_append(dm_args, "on,obsolete=deny,elevateprivileges=allow,spawn=deny,resourcecontrol=deny");
     }
 
     if (state->saved_state) {
-- 
2.18.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 1/6] docs/qemu-deprivilege: Revise and update with status and future plans
  2018-09-21 17:04 [PATCH v2 1/6] docs/qemu-deprivilege: Revise and update with status and future plans George Dunlap
                   ` (4 preceding siblings ...)
  2018-09-21 17:04 ` [PATCH v2 6/6] RFC: tools/dm_restrict: Enable QEMU sandboxing George Dunlap
@ 2018-09-24  8:14 ` Paul Durrant
  2018-09-24 10:23 ` Ian Jackson
  2018-09-25 11:20 ` Anthony PERARD
  7 siblings, 0 replies; 39+ messages in thread
From: Paul Durrant @ 2018-09-24  8:14 UTC (permalink / raw)
  To: xen-devel
  Cc: Stefano Stabellini, Wei Liu, Konrad Wilk, Andrew Cooper,
	Tim (Xen.org),
	George Dunlap, Ross Lagerwall, Julien Grall, Jan Beulich,
	Anthony Perard, Ian Jackson

> -----Original Message-----
[snip]
> diff --git a/docs/features/qemu-deprivilege.pandoc b/docs/features/qemu-
> deprivilege.pandoc
> new file mode 100644
> index 0000000000..b377076606
> --- /dev/null
> +++ b/docs/features/qemu-deprivilege.pandoc
> @@ -0,0 +1,91 @@
> +% QEMU Deprivileging / dm_restrict
> +% Revision 1
> +
> +\clearpage
> +
> +# Basics
> +
> +---------------- ----------------------------------------------------
> +         Status: **Tech Preview**
> +
> +Architecture(s): x86
> +
> +   Component(s): toolstack
> +
> +---------------- ----------------------------------------------------
> +
> +# Overview
> +
> +By default, the QEMU device model is run in domain 0.  If an attacker
> +can gain control of a QEMU process, it could easily take control of a
> +system.
> +
> +dm_restrict is a set of operations to restrict QEMU running in domain
> +0.  It consists of two halves:
> +
> + 1. Mechanisms to restrict QEMU to only being able to affect its own
> +domain
> + 2. Mechanisms to restruct QEMU's ability to interact with domain 0.
> +
> +# User details
> +
> +## Getting the right versions of software
> +
> +Linux: 4.11+
> +
> +Qemu: 3.0+ (Or the version that comes with Xen 4.12+)
> +
> +## Setting up a userid range
> +
> +For maximum security, libxl needs to run the devicemodel for each
> +domain under a user id (UID) corresponding to its domain id.  There
> +are 32752 possible domain IDs, and so libxl needs 32752 user ids set
> +aside for it.
> +
> +The simplest and most effective way to do this is to allocate a
> +contiguous block of UIDs, and create a single user named
> +`xen-qemuuser-range-base` with the first UID.  For example, under Debian:
> +
> +    adduser --no-create-home --uid 65536 --system xen-qemuuser-range-base
> +
> +NOTE: Most modern systems have 32-bit UIDs, and so can in theory go up
> +to 2^31 (or 2^32 if uids are unsigned).  POSIX only guarantees 16-bit
> +UIDs however.  UID 65535 is reserved for an invalid value, and 65534
> +is normally allocated to "nobody".
> +
> +Another, less-secure way is to run all QEMUs as the same UID.  To do
> +this, create a user named `xen-qemuuser-shared`; for example:
> +
> +    adduser --no-create-home --system xen-qemuuser-shared
> +
> +## Domain config changes
> +
> +The core domain config change is to add the following line to the
> +domain configuration:
> +
> +    dm_restrict=1
> +
> +This will perform a number of restrictions, outlined below in the
> +'Technical details' section.
> +
> +# Technical details
> +
> +See docs/design/qemu-deprivilege.txt for technical details.

Nit... I guess you mean docs/design/qemu-deprivilege.md?

  Paul

> +
> +# Limitations
> +
> +The following features still need to be implemented:
> + * Inserting a new cdrom while the guest is running (xl cdrom-insert)
> + * Migration / save / restore
> +
> +Additionally, getting PCI passthrough to work securely would require a
> +significant rework of how passthrough works at the moment.  It may be
> +implemented at some point but is not a near-term priority.
> +
> +# History
> +
> +------------------------------------------------------------------------
> +Date       Revision Version  Notes
> +---------- -------- -------- -------------------------------------------
> +2018-09-14 1        Xen 4.12 Imported from docs/misc
> +---------- -------- -------- -------------------------------------------
> diff --git a/docs/misc/qemu-deprivilege.txt b/docs/misc/qemu-
> deprivilege.txt
> deleted file mode 100644
> index 58b86a3908..0000000000
> --- a/docs/misc/qemu-deprivilege.txt
> +++ /dev/null
> @@ -1,36 +0,0 @@
> -For security reasons, libxl tries to pass a non-root username to QEMU as
> -argument. During initialization QEMU calls setuid and setgid with the
> -user ID and the group ID of the user passed as argument.
> -Libxl looks for the following users in this order:
> -
> -1) a user named "xen-qemuuser-domid$domid",
> -Where $domid is the domid of the domain being created.
> -This requires the reservation of 65535 uids from xen-qemuuser-domid1
> -to xen-qemuuser-domid65535. To use this mechanism, you might want to
> -create a large number of users at installation time. For example:
> -
> -for ((i=1; i<65536; i++))
> -do
> -    adduser --no-create-home --system xen-qemuuser-domid$i
> -done
> -
> -You might want to consider passing --group to adduser to create a new
> -group for each new user.
> -
> -
> -2) a user named "xen-qemuuser-shared"
> -As a fall back if both 1) fails, libxl will use a single user for
> -all QEMU instances. The user is named xen-qemuuser-shared. This is
> -less secure but still better than running QEMU as root. Using this is as
> -simple as creating just one more user on your host:
> -
> -adduser --no-create-home --system xen-qemuuser-shared
> -
> -
> -3) root
> -As a last resort, libxl will start QEMU as root.
> -
> -
> -Please note that running QEMU as non-root causes several features like
> -migration and PCI passthrough to not work properly and may prevent the
> guest
> -from booting.
> --
> 2.18.0
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xenproject.org
> https://lists.xenproject.org/mailman/listinfo/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 3/6] tools/dm_restrict: Ask QEMU to chroot
  2018-09-21 17:04 ` [PATCH v2 3/6] tools/dm_restrict: Ask QEMU to chroot George Dunlap
@ 2018-09-24  8:20   ` Paul Durrant
  2018-09-24  9:50     ` George Dunlap
  2018-09-24 10:31   ` Ian Jackson
  1 sibling, 1 reply; 39+ messages in thread
From: Paul Durrant @ 2018-09-24  8:20 UTC (permalink / raw)
  To: xen-devel; +Cc: Anthony Perard, Ian Jackson, Wei Liu, George Dunlap

> -----Original Message-----
> From: Xen-devel [mailto:xen-devel-bounces@lists.xenproject.org] On Behalf
> Of George Dunlap
> Sent: 21 September 2018 18:04
> To: xen-devel@lists.xenproject.org
> Cc: Anthony Perard <anthony.perard@citrix.com>; Ian Jackson
> <Ian.Jackson@citrix.com>; Wei Liu <wei.liu2@citrix.com>; George Dunlap
> <George.Dunlap@citrix.com>
> Subject: [Xen-devel] [PATCH v2 3/6] tools/dm_restrict: Ask QEMU to chroot
> 
> When dm_restrict is enabled, ask QEMU to chroot into an empty directory.
> 
> * Create /var/run/qemu/root-domid (deleting the old one if it's there)
> * Pass the -chroot option to QEMU
> 
> Rather than running `rm -rf` on the directory before creating it
> (since there is no library function to do this), simply rmdir the
> directory, relying on the fact that the previous QEMU instance, if
> properly restcirted, shouldn't have been able to write anything

^ typo... 'restricted'

> anyway.
> 
> Also test that this worked in depriv-process-checker.sh.  This
> requires XEN_RUN_DIR to be set in the environment so we know where to
> look.
> 
> Suggested-by: Ross Lagerwall <ross.lagerwall@citrix.com>
> Signed-off-by: George Dunlap <george.dunlap@citrix.com>
> ---
> Not sure the best way to get XEN_RUN_DIR; having configure process this
> file seems like a bit overkill.
> 
> CC: Ian Jackson <ian.jackson@citrix.com>
> CC: Wei Liu <wei.liu2@citrix.com>
> CC: Anthony Perard <anthony.perard@citrix.com>
> ---
>  docs/designs/qemu-deprivilege.md             | 12 +++---
>  tools/libxl/libxl_dm.c                       | 39 +++++++++++++++++++-
>  tools/tests/depriv/depriv-process-checker.sh | 17 +++++++++
>  3 files changed, 61 insertions(+), 7 deletions(-)
> 
> diff --git a/docs/designs/qemu-deprivilege.md b/docs/designs/qemu-
> deprivilege.md
> index 1e731c16aa..df5bb07d7c 100644
> --- a/docs/designs/qemu-deprivilege.md
> +++ b/docs/designs/qemu-deprivilege.md
> @@ -58,12 +58,6 @@ FIXME: Double-check the correctness of the above
> 
>  '''Testing status''': Tested
> 
> -# Restrictions / improvements still to do
> -
> -This lists potential restrictions still to do.  It is meant to be
> -listed in order of ease of implementation, with low-hanging fruit
> -first.
> -
>  ## Chroot
> 
>  '''Description''': Qemu runs in its own chroot, such that even if it
> @@ -81,6 +75,12 @@ Then adds the following to the qemu command-line:
> 
>  '''Tested''': Not tested

^ should this change to 'tested' now?

  Paul

> 
> +## Restrictions / improvements still to do
> +
> +This lists potential restrictions still to do.  It is meant to be
> +listed in order of ease of implementation, with low-hanging fruit
> +first.
> +
>  ## Namespaces for unused functionality (Linux only)
> 
>  '''Description''': Enter QEMU into its own mount & IPC namespaces.
> diff --git a/tools/libxl/libxl_dm.c b/tools/libxl/libxl_dm.c
> index abd31ee6f2..df6fb64bee 100644
> --- a/tools/libxl/libxl_dm.c
> +++ b/tools/libxl/libxl_dm.c
> @@ -1410,9 +1410,46 @@ static int
> libxl__build_device_model_args_new(libxl__gc *gc,
>          }
>      }
> 
> -    if (libxl_defbool_val(b_info->dm_restrict))
> +    if (libxl_defbool_val(b_info->dm_restrict)) {
> +        char * chroot_dir = GCSPRINTF("%s/qemu-root-%d",
> +                                      libxl__run_dir_path(),
> guest_domid);
> +
>          flexarray_append(dm_args, "-xen-domid-restrict");
> 
> +        /*
> +         * Run QEMU in a chroot at /var/run/xen/root-%d
> +         *
> +         * There is no library function to do the equivalent of `rm
> +         * -rf`.  However deprivileged QEMU in theory shouldn't be
> +         * able to write any files anyway, as the chroot would be
> +         * owned by root, but it would be running as an unprivileged
> +         * process.  So in theory, old chroots should always be empty.
> +         *
> +         * rmdir the directory before attempting to create
> +         * it; if it returns anything other than ENOENT, fail domain
> +         * creation.
> +         */
> +        if (rmdir(chroot_dir) != 0
> +            && errno != ENOENT) {
> +            LOGED(ERROR, guest_domid,
> +                  "failed to remove existing chroot dir %s", chroot_dir);
> +            return ERROR_FAIL;
> +        }
> +
> +        for (;;) {
> +            if (!mkdir(chroot_dir, 0000))
> +                break;
> +            if (errno == EINTR) continue;
> +            LOGED(ERROR, guest_domid,
> +                  "failed to create chroot dir %s", chroot_dir);
> +            return ERROR_FAIL;
> +        }
> +
> +        // Add "-chroot [dir]" to command-line
> +        flexarray_append(dm_args, "-chroot");
> +        flexarray_append(dm_args, chroot_dir);
> +    }
> +
>      if (state->saved_state) {
>          /* This file descriptor is meant to be used by QEMU */
>          *dm_state_fd = open(state->saved_state, O_RDONLY);
> diff --git a/tools/tests/depriv/depriv-process-checker.sh
> b/tools/tests/depriv/depriv-process-checker.sh
> index 4aa58e7760..cb240fd0ed 100755
> --- a/tools/tests/depriv/depriv-process-checker.sh
> +++ b/tools/tests/depriv/depriv-process-checker.sh
> @@ -64,6 +64,23 @@ else
>  fi
>  echo $result
> 
> +# TEST: chroot
> +#
> +# Read /proc/<dmpid>/root to see if it's correct.
> +echo -n "Chroot: "
> +if [[ -n "$XEN_RUN_DIR" ]] ; then
> +    tgt_chroot=$XEN_RUN_DIR/qemu-root-$domid
> +    root=$(readlink /proc/$dmpid/root)
> +    if [[ "$root" != "$tgt_chroot" ]] ; then
> +	echo "FAILED"
> +	failed="true"
> +    else
> +	echo "PASSED"
> +    fi
> +else
> +    echo "SKIPPED (XEN_RUN_DIR undefined)"
> +fi
> +
>  if $failed ; then
>      exit 1
>  else
> --
> 2.18.0
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xenproject.org
> https://lists.xenproject.org/mailman/listinfo/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 4/6] tools/dm_restrict: Unshare mount and IPC namespaces on Linux
  2018-09-21 17:04 ` [PATCH v2 4/6] tools/dm_restrict: Unshare mount and IPC namespaces on Linux George Dunlap
@ 2018-09-24  8:24   ` Paul Durrant
  2018-09-24 10:40   ` Ian Jackson
  1 sibling, 0 replies; 39+ messages in thread
From: Paul Durrant @ 2018-09-24  8:24 UTC (permalink / raw)
  To: xen-devel; +Cc: Anthony Perard, Ian Jackson, Wei Liu, George Dunlap

> -----Original Message-----
> From: Xen-devel [mailto:xen-devel-bounces@lists.xenproject.org] On Behalf
> Of George Dunlap
> Sent: 21 September 2018 18:04
> To: xen-devel@lists.xenproject.org
> Cc: Anthony Perard <anthony.perard@citrix.com>; Ian Jackson
> <Ian.Jackson@citrix.com>; Wei Liu <wei.liu2@citrix.com>; George Dunlap
> <George.Dunlap@citrix.com>
> Subject: [Xen-devel] [PATCH v2 4/6] tools/dm_restrict: Unshare mount and
> IPC namespaces on Linux
> 
> QEMU running under Xen doesn't need mount or IPC functionality.
> Create and enter separate namespaces for each of these before
> executing QEMU, so that in the event that other restrictions fail, the
> process won't be able to even name system mount points or exsting
> non-file-based IPC descriptors to attempt to attack them.
> 
> Unsharing is something a process can only do to itself (it would
> seem); so add an os-specific "dm_preexec_restrict()" hook just before
> we exec() the device model.
> 
> Also add checks to depriv-process-checker.sh to verify that dm is
> running in a new namespace (or at least, a different one than the
> caller).
> 
> Suggested-by: Ross Lagerwall <ross.lagerwall@citrix.com>
> Signed-off-by: George Dunlap <george.dunlap@citrix.com>
> ---
> 
> CC: Ian Jackson <ian.jackson@citrix.com>
> CC: Wei Liu <wei.liu2@citrix.com>
> CC: Anthony Perard <anthony.perard@citrix.com>
> ---
>  docs/designs/qemu-deprivilege.md             | 12 ++++++------
>  tools/libxl/libxl_dm.c                       |  2 ++
>  tools/libxl/libxl_freebsd.c                  |  5 +++++
>  tools/libxl/libxl_internal.h                 |  7 +++++++
>  tools/libxl/libxl_linux.c                    | 14 ++++++++++++++
>  tools/libxl/libxl_netbsd.c                   |  5 +++++
>  tools/tests/depriv/depriv-process-checker.sh | 18 ++++++++++++++++++
>  7 files changed, 57 insertions(+), 6 deletions(-)
> 
> diff --git a/docs/designs/qemu-deprivilege.md b/docs/designs/qemu-
> deprivilege.md
> index df5bb07d7c..58d5df6072 100644
> --- a/docs/designs/qemu-deprivilege.md
> +++ b/docs/designs/qemu-deprivilege.md
> @@ -75,12 +75,6 @@ Then adds the following to the qemu command-line:
> 
>  '''Tested''': Not tested
> 
> -## Restrictions / improvements still to do
> -
> -This lists potential restrictions still to do.  It is meant to be
> -listed in order of ease of implementation, with low-hanging fruit
> -first.
> -
>  ## Namespaces for unused functionality (Linux only)
> 
>  '''Description''': Enter QEMU into its own mount & IPC namespaces.
> @@ -102,6 +96,12 @@ call:
> 
>  '''Tested''': Not tested

Again.. should this not become 'tested'?

> 
> +# Restrictions / improvements still to do
> +
> +This lists potential restrictions still to do.  It is meant to be
> +listed in order of ease of implementation, with low-hanging fruit
> +first.
> +
>  ### Basic RLIMITs
> 
>  '''Description''': A number of limits on the resources that a given
> diff --git a/tools/libxl/libxl_dm.c b/tools/libxl/libxl_dm.c
> index df6fb64bee..6733514370 100644
> --- a/tools/libxl/libxl_dm.c
> +++ b/tools/libxl/libxl_dm.c
> @@ -2391,6 +2391,8 @@ retry_transaction:
>          goto out_close;
>      if (!rc) { /* inner child */
>          setsid();
> +        if (libxl_defbool_val(b_info->dm_restrict))
> +            libxl__local_dm_preexec_restrict(gc, logfile_w);
>          libxl__exec(gc, null, logfile_w, logfile_w, dm, args, envs);
>      }
> 
> diff --git a/tools/libxl/libxl_freebsd.c b/tools/libxl/libxl_freebsd.c
> index 6442ccec72..aed2a2b115 100644
> --- a/tools/libxl/libxl_freebsd.c
> +++ b/tools/libxl/libxl_freebsd.c
> @@ -245,3 +245,8 @@ int libxl__pci_topology_init(libxl__gc *gc,
>  {
>      return ERROR_NI;
>  }
> +
> +void libxl__local_dm_preexec_restrict(libxl__gc *gc, int stderrfd)
> +{
> +    return;
> +}
> diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
> index 802382c704..6e5ff977a6 100644
> --- a/tools/libxl/libxl_internal.h
> +++ b/tools/libxl/libxl_internal.h
> @@ -3782,6 +3782,13 @@ struct libxl__dm_spawn_state {
> 
>  _hidden void libxl__spawn_local_dm(libxl__egc *egc,
> libxl__dm_spawn_state*);
> 
> +/*
> + * Called after forking but before executing the local devicemodel.
> + * Must call _exit(-1) on failure, printing an error message to
> + * stderrfd if available.
> + */
> +_hidden void libxl__local_dm_preexec_restrict(libxl__gc *gc, int
> stderrfd);
> +
>  /* Stubdom device models. */
> 
>  typedef struct {
> diff --git a/tools/libxl/libxl_linux.c b/tools/libxl/libxl_linux.c
> index 6ef0abc693..2eeac8df9a 100644
> --- a/tools/libxl/libxl_linux.c
> +++ b/tools/libxl/libxl_linux.c
> @@ -307,6 +307,20 @@ int libxl__pci_topology_init(libxl__gc *gc,
>      return err;
>  }
> 
> +void libxl__local_dm_preexec_restrict(libxl__gc *gc, int stderrfd)
> +{
> +    int rc;
> +
> +    /* Unshare mount and IPC namespaces.  These are unused by QEMU. */
> +    rc = unshare(CLONE_NEWNS | CLONE_NEWIPC);
> +    if (rc < 0) {
> +        char *msg = GCSPRINTF("libxl: Mount and IPC namespace unfailed
> with error %d\n",
> +                              errno);
> +        write(stderrfd, msg, strlen(msg));
> +        _exit(-1);
> +    }
> +}
> +
>  /*
>   * Local variables:
>   * mode: C
> diff --git a/tools/libxl/libxl_netbsd.c b/tools/libxl/libxl_netbsd.c
> index 2edfb00641..dce3f1fdce 100644
> --- a/tools/libxl/libxl_netbsd.c
> +++ b/tools/libxl/libxl_netbsd.c
> @@ -124,3 +124,8 @@ int libxl__pci_topology_init(libxl__gc *gc,
>  {
>      return ERROR_NI;
>  }
> +
> +void libxl__local_dm_preexec_restrict(libxl__gc *gc, int stderrfd)
> +{
> +    return;
> +}
> diff --git a/tools/tests/depriv/depriv-process-checker.sh
> b/tools/tests/depriv/depriv-process-checker.sh
> index cb240fd0ed..7dc2573799 100755
> --- a/tools/tests/depriv/depriv-process-checker.sh
> +++ b/tools/tests/depriv/depriv-process-checker.sh
> @@ -81,6 +81,24 @@ else
>      echo "SKIPPED (XEN_RUN_DIR undefined)"
>  fi
> 
> +# TEST: Namespace unsharing
> +#
> +# Read /proc/<dmpid>/ns/<namespace> and make sure it's not equal to
> +# the current processes' value
> +for nsname in ipc mnt; do
> +    echo -n "Unshare namespace $nsname: "
> +    dmns=$(readlink /proc/$dmpid/ns/$nsname)
> +    myns=$(readlink /proc/self/ns/$nsname)
> +
> +    if [[ "$dmns" == "$myns" ]] ; then
> +	echo "FAILED"
> +	failed="true"
> +    else
> +	echo "PASSED"
> +    fi
> +done
> +
> +
>  if $failed ; then
>      exit 1
>  else
> --
> 2.18.0
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xenproject.org
> https://lists.xenproject.org/mailman/listinfo/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 5/6] tools/dm_depriv: Add first cut RLIMITs
  2018-09-21 17:04 ` [PATCH v2 5/6] tools/dm_depriv: Add first cut RLIMITs George Dunlap
@ 2018-09-24  8:34   ` Paul Durrant
       [not found]     ` <CAFLBxZanne777R0WfJLuGqTv3dx=A+W0Z5UsW9B0i4MFMWWKXw@mail.gmail.com>
  2018-09-24 10:48   ` Ian Jackson
  1 sibling, 1 reply; 39+ messages in thread
From: Paul Durrant @ 2018-09-24  8:34 UTC (permalink / raw)
  To: xen-devel; +Cc: Anthony Perard, Ian Jackson, Wei Liu, George Dunlap

> -----Original Message-----
> From: Xen-devel [mailto:xen-devel-bounces@lists.xenproject.org] On Behalf
> Of George Dunlap
> Sent: 21 September 2018 18:04
> To: xen-devel@lists.xenproject.org
> Cc: Anthony Perard <anthony.perard@citrix.com>; Ian Jackson
> <Ian.Jackson@citrix.com>; Wei Liu <wei.liu2@citrix.com>; George Dunlap
> <George.Dunlap@citrix.com>
> Subject: [Xen-devel] [PATCH v2 5/6] tools/dm_depriv: Add first cut RLIMITs
> 
> Limit the ability of a potentially compromised QEMU to consume system
> resources.  Key limits:
>  - RLIMIT_FSIZE (file size): 256KiB
>  - RLIMIT_NPROC (after uid changes to a unique uid)
> 
> Probably unnecessary limits but why not:
>  - RLIMIT_CORE: 0
>  - RLIMIT_MSGQUEUE: 0
>  - RLIMIT_LOCKS: 0
>  - RLIMIT_MEMLOCK: 0
> 
> NB that we do not yet set RLIMIT_AS (total virtual memory) or
> RLIMIT_NOFILES (number of open files), since these require more care
> and/or more coordination with QEMU to implement.
> 
> Suggested-by: Ross Lagerwall <ross.lagerwall@citrix.com>
> Signed-off-by: George Dunlap <george.dunlap@citrix.com>
> ---
> CC: Ian Jackson <ian.jackson@citrix.com>
> CC: Wei Liu <wei.liu2@citrix.com>
> CC: Anthony Perard <anthony.perard@citrix.com>
> ---
>  docs/designs/qemu-deprivilege.md             | 12 ++---
>  tools/libxl/libxl_linux.c                    | 53 ++++++++++++++++++++
>  tools/tests/depriv/depriv-process-checker.sh | 38 ++++++++++++++
>  3 files changed, 97 insertions(+), 6 deletions(-)
> 
> diff --git a/docs/designs/qemu-deprivilege.md b/docs/designs/qemu-
> deprivilege.md
> index 58d5df6072..ca556005b5 100644
> --- a/docs/designs/qemu-deprivilege.md
> +++ b/docs/designs/qemu-deprivilege.md
> @@ -96,12 +96,6 @@ call:
> 
>  '''Tested''': Not tested
> 
> -# Restrictions / improvements still to do
> -
> -This lists potential restrictions still to do.  It is meant to be
> -listed in order of ease of implementation, with low-hanging fruit
> -first.
> -
>  ### Basic RLIMITs
> 
>  '''Description''': A number of limits on the resources that a given
> @@ -126,6 +120,12 @@ Probably not necessary but why not:
> 
>  '''Tested''': Not tested
> 
> +# Restrictions / improvements still to do
> +
> +This lists potential restrictions still to do.  It is meant to be
> +listed in order of ease of implementation, with low-hanging fruit
> +first.
> +
>  ### Further RLIMITs
> 
>  RLIMIT_AS limits the total amount of memory; but this includes the
> diff --git a/tools/libxl/libxl_linux.c b/tools/libxl/libxl_linux.c
> index 2eeac8df9a..a7ae5af30e 100644
> --- a/tools/libxl/libxl_linux.c
> +++ b/tools/libxl/libxl_linux.c
> @@ -16,6 +16,7 @@
>  #include "libxl_osdeps.h" /* must come before any other headers */
> 
>  #include "libxl_internal.h"
> +#include <sys/resource.h>
> 
>  int libxl__try_phy_backend(mode_t st_mode)
>  {
> @@ -307,9 +308,44 @@ int libxl__pci_topology_init(libxl__gc *gc,
>      return err;
>  }
> 
> +static struct {
> +    int resource;
> +    rlim_t limit;
> +} rlimits[] = {
> +    {
> +        .resource = RLIMIT_FSIZE,
> +        /* Big enough for log files, not big enough for a DoS */
> +        .limit = 256*1024,
> +    },
> +    {
> +        .resource = RLIMIT_NPROC,
> +        .limit = 0
> +    },
> +    {
> +        .resource = RLIMIT_CORE,
> +        .limit = 0
> +    },
> +    {
> +        .resource = RLIMIT_MSGQUEUE,
> +        .limit = 0
> +    },
> +    {
> +        .resource = RLIMIT_LOCKS,
> +        .limit = 0
> +    },
> +    {
> +        .resource = RLIMIT_MEMLOCK,
> +        .limit = 0
> +    },
> +    {
> +        .resource = -1

Is -1 guaranteed not to clash with any defined resource type?

> +    }
> +};
> +
>  void libxl__local_dm_preexec_restrict(libxl__gc *gc, int stderrfd)
>  {
>      int rc;
> +    unsigned i;
> 
>      /* Unshare mount and IPC namespaces.  These are unused by QEMU. */
>      rc = unshare(CLONE_NEWNS | CLONE_NEWIPC);
> @@ -319,6 +355,23 @@ void libxl__local_dm_preexec_restrict(libxl__gc *gc,
> int stderrfd)
>          write(stderrfd, msg, strlen(msg));
>          _exit(-1);
>      }
> +
> +    /* Set various "easy" rlimits */
> +    for (i=0; rlimits[i].resource != -1; i++) {

Couldn't figure out whether this is a coding style violation or not but I think 'i=0' should be 'i = 0'

  Paul

> +        struct rlimit rlim;
> +
> +        rlim.rlim_cur = rlim.rlim_max = rlimits[i].limit;
> +
> +        rc = setrlimit(rlimits[i].resource, &rlim);
> +        if (rc < 0) {
> +            char *msg = GCSPRINTF("libxl: Setting rlimit %d to %lld
> failed with error %d\n",
> +                                  rlimits[i].resource,
> +                                  (unsigned long long)rlimits[i].limit,
> errno);
> +            write(stderrfd, msg, strlen(msg));
> +            _exit(-1);
> +        }
> +
> +    }
>  }
> 
>  /*
> diff --git a/tools/tests/depriv/depriv-process-checker.sh
> b/tools/tests/depriv/depriv-process-checker.sh
> index 7dc2573799..6a861bafa5 100755
> --- a/tools/tests/depriv/depriv-process-checker.sh
> +++ b/tools/tests/depriv/depriv-process-checker.sh
> @@ -98,6 +98,44 @@ for nsname in ipc mnt; do
>      fi
>  done
> 
> +# TEST: RLIMITs
> +#
> +# Read /proc/<dmpid>/limits
> +function check_rlimit() {
> +    limit_name=$1
> +    limit_string=$2
> +    tgt=$3
> +
> +    echo -n "rlimit $limit_name: "
> +    input=$(grep "^$limit_string" /proc/$dmpid/limits)
> +
> +    if [[ -z "$input" ]] ; then
> +	echo "Couldn't find limit $limit"
> +	echo FAILED
> +	failed="true"
> +	return
> +    fi
> +
> +    if [[ "$input" =~
> ^$limit_string[[:space:]]*([^[:space:]]+)[[:space:]]*([^[:space:]]+)[[:spa
> ce:]]*[^[:space:]]+ ]] ; then
> +	if [[ "${BASH_REMATCH[1]}" != $tgt ||
> +		  "${BASH_REMATCH[2]}" != $tgt ]] ; then
> +	    echo "FAILED"
> +	    failed="true"
> +	else
> +	    echo "PASSED"
> +	fi
> +    else
> +	echo "Couldn't parse /proc/<dmpid>/limits"
> +	echo "FAILED"
> +	failed="true"
> +    fi
> +}
> +check_rlimit FSIZE "Max file size" "262144"
> +check_rlimit NPROC "Max processes" 0
> +check_rlimit CORE "Max core file size" "0"
> +check_rlimit MSGQUEUE "Max msgqueue size" 0
> +check_rlimit LOCKS "Max file locks" 0
> +check_rlimit MEMLOCK "Max locked memory" 0
> 
>  if $failed ; then
>      exit 1
> --
> 2.18.0
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xenproject.org
> https://lists.xenproject.org/mailman/listinfo/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 3/6] tools/dm_restrict: Ask QEMU to chroot
  2018-09-24  8:20   ` Paul Durrant
@ 2018-09-24  9:50     ` George Dunlap
  2018-09-24  9:52       ` Paul Durrant
  0 siblings, 1 reply; 39+ messages in thread
From: George Dunlap @ 2018-09-24  9:50 UTC (permalink / raw)
  To: Paul Durrant, xen-devel; +Cc: Anthony Perard, Ian Jackson, Wei Liu

On 09/24/2018 09:20 AM, Paul Durrant wrote:
>> -----Original Message-----
>> From: Xen-devel [mailto:xen-devel-bounces@lists.xenproject.org] On Behalf
>> Of George Dunlap
>> Sent: 21 September 2018 18:04
>> To: xen-devel@lists.xenproject.org
>> Cc: Anthony Perard <anthony.perard@citrix.com>; Ian Jackson
>> <Ian.Jackson@citrix.com>; Wei Liu <wei.liu2@citrix.com>; George Dunlap
>> <George.Dunlap@citrix.com>
>> Subject: [Xen-devel] [PATCH v2 3/6] tools/dm_restrict: Ask QEMU to chroot
>>
>> When dm_restrict is enabled, ask QEMU to chroot into an empty directory.
>>
>> * Create /var/run/qemu/root-domid (deleting the old one if it's there)
>> * Pass the -chroot option to QEMU
>>
>> Rather than running `rm -rf` on the directory before creating it
>> (since there is no library function to do this), simply rmdir the
>> directory, relying on the fact that the previous QEMU instance, if
>> properly restcirted, shouldn't have been able to write anything
> 
> ^ typo... 'restricted'

Oops -- fixed, thanks.

>> diff --git a/docs/designs/qemu-deprivilege.md b/docs/designs/qemu-
>> deprivilege.md
>> index 1e731c16aa..df5bb07d7c 100644
>> --- a/docs/designs/qemu-deprivilege.md
>> +++ b/docs/designs/qemu-deprivilege.md
>> @@ -58,12 +58,6 @@ FIXME: Double-check the correctness of the above
>>
>>  '''Testing status''': Tested
>>
>> -# Restrictions / improvements still to do
>> -
>> -This lists potential restrictions still to do.  It is meant to be
>> -listed in order of ease of implementation, with low-hanging fruit
>> -first.
>> -
>>  ## Chroot
>>
>>  '''Description''': Qemu runs in its own chroot, such that even if it
>> @@ -81,6 +75,12 @@ Then adds the following to the qemu command-line:
>>
>>  '''Tested''': Not tested
> 
> ^ should this change to 'tested' now?

I sort of went back and forth here between whether this should mean 'a
test it available' (i.e., depriv-process-checker.sh checks it) and 'this
is actively being tested' (i.e., by osstest).  Here I ended up going
with the second option, but that makes a weird dependency between
xen.git and osstest.

One option, I suppose, would be to change this to "Test implemented" or
something.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 3/6] tools/dm_restrict: Ask QEMU to chroot
  2018-09-24  9:50     ` George Dunlap
@ 2018-09-24  9:52       ` Paul Durrant
  0 siblings, 0 replies; 39+ messages in thread
From: Paul Durrant @ 2018-09-24  9:52 UTC (permalink / raw)
  To: George Dunlap, xen-devel; +Cc: Anthony Perard, Ian Jackson, Wei Liu

> -----Original Message-----
> From: George Dunlap [mailto:george.dunlap@citrix.com]
> Sent: 24 September 2018 10:50
> To: Paul Durrant <Paul.Durrant@citrix.com>; xen-devel@lists.xenproject.org
> Cc: Anthony Perard <anthony.perard@citrix.com>; Ian Jackson
> <Ian.Jackson@citrix.com>; Wei Liu <wei.liu2@citrix.com>
> Subject: Re: [Xen-devel] [PATCH v2 3/6] tools/dm_restrict: Ask QEMU to
> chroot
> 
> On 09/24/2018 09:20 AM, Paul Durrant wrote:
> >> -----Original Message-----
> >> From: Xen-devel [mailto:xen-devel-bounces@lists.xenproject.org] On
> Behalf
> >> Of George Dunlap
> >> Sent: 21 September 2018 18:04
> >> To: xen-devel@lists.xenproject.org
> >> Cc: Anthony Perard <anthony.perard@citrix.com>; Ian Jackson
> >> <Ian.Jackson@citrix.com>; Wei Liu <wei.liu2@citrix.com>; George Dunlap
> >> <George.Dunlap@citrix.com>
> >> Subject: [Xen-devel] [PATCH v2 3/6] tools/dm_restrict: Ask QEMU to
> chroot
> >>
> >> When dm_restrict is enabled, ask QEMU to chroot into an empty
> directory.
> >>
> >> * Create /var/run/qemu/root-domid (deleting the old one if it's there)
> >> * Pass the -chroot option to QEMU
> >>
> >> Rather than running `rm -rf` on the directory before creating it
> >> (since there is no library function to do this), simply rmdir the
> >> directory, relying on the fact that the previous QEMU instance, if
> >> properly restcirted, shouldn't have been able to write anything
> >
> > ^ typo... 'restricted'
> 
> Oops -- fixed, thanks.
> 
> >> diff --git a/docs/designs/qemu-deprivilege.md b/docs/designs/qemu-
> >> deprivilege.md
> >> index 1e731c16aa..df5bb07d7c 100644
> >> --- a/docs/designs/qemu-deprivilege.md
> >> +++ b/docs/designs/qemu-deprivilege.md
> >> @@ -58,12 +58,6 @@ FIXME: Double-check the correctness of the above
> >>
> >>  '''Testing status''': Tested
> >>
> >> -# Restrictions / improvements still to do
> >> -
> >> -This lists potential restrictions still to do.  It is meant to be
> >> -listed in order of ease of implementation, with low-hanging fruit
> >> -first.
> >> -
> >>  ## Chroot
> >>
> >>  '''Description''': Qemu runs in its own chroot, such that even if it
> >> @@ -81,6 +75,12 @@ Then adds the following to the qemu command-line:
> >>
> >>  '''Tested''': Not tested
> >
> > ^ should this change to 'tested' now?
> 
> I sort of went back and forth here between whether this should mean 'a
> test it available' (i.e., depriv-process-checker.sh checks it) and 'this
> is actively being tested' (i.e., by osstest).  Here I ended up going
> with the second option, but that makes a weird dependency between
> xen.git and osstest.
> 
> One option, I suppose, would be to change this to "Test implemented" or
> something.
> 

Ok, 'Test implemented' sounds good.

  Paul

>  -George
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 1/6] docs/qemu-deprivilege: Revise and update with status and future plans
  2018-09-21 17:04 [PATCH v2 1/6] docs/qemu-deprivilege: Revise and update with status and future plans George Dunlap
                   ` (5 preceding siblings ...)
  2018-09-24  8:14 ` [PATCH v2 1/6] docs/qemu-deprivilege: Revise and update with status and future plans Paul Durrant
@ 2018-09-24 10:23 ` Ian Jackson
  2018-09-24 13:08   ` George Dunlap
  2018-09-25 11:20 ` Anthony PERARD
  7 siblings, 1 reply; 39+ messages in thread
From: Ian Jackson @ 2018-09-24 10:23 UTC (permalink / raw)
  To: George Dunlap
  Cc: Stefano Stabellini, Wei Liu, Konrad Wilk, Andrew Cooper,
	Tim Deegan, Ross Lagerwall, Julien Grall, Jan Beulich,
	Anthony Perard, xen-devel

George Dunlap writes ("[PATCH v2 1/6] docs/qemu-deprivilege: Revise and update with status and future plans"):
> +## Xen library / file-descriptor restrictions
> +
> +'''Description''': Close and restrict Xen-related file descriptors.
> +Specifically:
> + * Close all xenstore-related file descriptors

This is correct.

> + * Make sure that extraneous `privcmd` and `evtchn` instances are
> +closed

No, *all* privcmd and evtchn instances are restricted, even
`extraneous' ones which have been leaked by qemu.  None are closed.

> +'''How to test''':
> +
> +    tools/test/depriv/depriv-fd-checker.c

You also need the tool `fishdescriptor' from src:chiark-utils to get
the descriptors out of qemu.  It is in chiark-utils-bin in Debian
buster and Debian stretch-backports.

> +## Namespaces for unused functionality (Linux only)
> +
> +'''Description''': Enter QEMU into its own mount & IPC namespaces.
> +This means that even if other restrictions fail, the process won't be
> +able to even name system mount points or exsting non-file-based IPC
> +descriptors to attempt to attack them.
> +
> +'''Implementation''':
> +
> +In theory this could be done in QEMU (similar to -sandbox, -runas,
> +-chroot, and so on), but a patch doing this in QEMU was NAKed
> +upstream. They preferred that this was done as a setup step by
> +whatever executes QEMU; i.e., have the process which exec's QEMU first
> +call:
> +
> +    unshare(CLONE_NEWNS | CLONE_NEWIPC)

If you are recording this kind of information here: this will of
course not work, because qemu binds and opens things at startup that
would be broken by this.  Maybe you want to give a url to a mailing
list posting instead of this un-referenced hearsay.

> +### Network namespacing (Linux only)
> +
> +Enter QEMU into its own network namespace (in addition to mount & IPC
> +namespaces).  Basically change the 'unshare' call to be as follows:
> +
> +    unshare(CLONE_NEWNET | CLONE_NEWNS | CLONE_NEWIPC)
> +
> +QEMU does actually use the network namespace by default, so adding
> +this restriction requires additional changes, listed below.

The CLONE_NEWIPC overlaps with the IPC unshare discussed above.

> +## Setting up a userid range

There was some discussion on a Debian list recently about some
container systems that encode a 16-bit within-container uid and a
16-bit container number into the 32-bit uid.  I guess we don't need to
explicitly worry about clashes between our usage and those ?

> +# Limitations
> +
> +The following features still need to be implemented:
> + * Inserting a new cdrom while the guest is running (xl cdrom-insert)
> + * Migration / save / restore
> +
> +Additionally, getting PCI passthrough to work securely would require a
> +significant rework of how passthrough works at the moment.  It may be
> +implemented at some point but is not a near-term priority.

The limitations section should also say something like this:

 The currently implemented restrictions are thought to be a useful
 security improvement.  However, the design and implementation is
 preliminary and there is work left to do.  Accordingly we do not
 promise that they are sufficient to stop a rogue domain which takes
 control of its qemu from escaping into the host, let alone stop it
 from denying service to the host.

 Therefore, bugs which affect the effectiveness of the qemu depriv
 mechanisms will be treated as plain bugs, not security bugs; they
 would not result in a Xen Project Security Advisory.  However, bugs
 where the security of a system with dm_restrict=1 is worse than
 before, will be treated as security bugs.

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 3/6] tools/dm_restrict: Ask QEMU to chroot
  2018-09-21 17:04 ` [PATCH v2 3/6] tools/dm_restrict: Ask QEMU to chroot George Dunlap
  2018-09-24  8:20   ` Paul Durrant
@ 2018-09-24 10:31   ` Ian Jackson
  1 sibling, 0 replies; 39+ messages in thread
From: Ian Jackson @ 2018-09-24 10:31 UTC (permalink / raw)
  To: George Dunlap; +Cc: Anthony Perard, xen-devel, Wei Liu

George Dunlap writes ("[PATCH v2 3/6] tools/dm_restrict: Ask QEMU to chroot"):
> When dm_restrict is enabled, ask QEMU to chroot into an empty directory.

This is great.  However,

1. Can you please arrange to install depriv-process-checker.sh ?
   That way osstest can run it.

2. depriv-process-checker.sh should be called depriv-process-checker
   so that we do not need to rename it if we change the implementation
   language.  (In general `foo.sh' `foo.pl' are a bad idea.)

3. If XEN_RUN_DIR is not set, that should be a test failure, not a
   silent "pass" although the test was not run, so I think it should
   exit nonzero.

4. Style:

> diff --git a/tools/libxl/libxl_dm.c b/tools/libxl/libxl_dm.c
> index abd31ee6f2..df6fb64bee 100644
> -    if (libxl_defbool_val(b_info->dm_restrict))
> +    if (libxl_defbool_val(b_info->dm_restrict)) {
> +        char * chroot_dir = GCSPRINTF("%s/qemu-root-%d",
                 ^
Spurious space.

> +        if (rmdir(chroot_dir) != 0
> +            && errno != ENOENT) {

This error handling style is contrary to usual libxl practice.  More
idiomatic, and IMO more readable, would be something like this:

           r = rmdir(chroot_dir);
           if (r && errno != ENOENT) {

> +        for (;;) {
> +            if (!mkdir(chroot_dir, 0000))
> +                break;

Same here.

Thanks,
Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 4/6] tools/dm_restrict: Unshare mount and IPC namespaces on Linux
  2018-09-21 17:04 ` [PATCH v2 4/6] tools/dm_restrict: Unshare mount and IPC namespaces on Linux George Dunlap
  2018-09-24  8:24   ` Paul Durrant
@ 2018-09-24 10:40   ` Ian Jackson
  2018-09-24 14:22     ` George Dunlap
  1 sibling, 1 reply; 39+ messages in thread
From: Ian Jackson @ 2018-09-24 10:40 UTC (permalink / raw)
  To: George Dunlap; +Cc: Anthony Perard, xen-devel, Wei Liu

George Dunlap writes ("[PATCH v2 4/6] tools/dm_restrict: Unshare mount and IPC namespaces on Linux"):
> QEMU running under Xen doesn't need mount or IPC functionality.
> Create and enter separate namespaces for each of these before
> executing QEMU, so that in the event that other restrictions fail, the
> process won't be able to even name system mount points or exsting
> non-file-based IPC descriptors to attempt to attack them.

> Unsharing is something a process can only do to itself (it would
> seem); so add an os-specific "dm_preexec_restrict()" hook just before
> we exec() the device model.
...
> +
> +void libxl__local_dm_preexec_restrict(libxl__gc *gc, int stderrfd)
> +{
> +    return;
> +}

This patch is the point of divergence between Linux dom0's and FreeBSD
dom0's.

IMO it's not good enough to simply talk about some restrictions as
being Linux-only.  Rather, the restrictions are supposed to work
together as a whole, and the whole syscall API available to qemu needs
to be considered.  That would need to be done separately for FreeBSD.

I think that this means we should explicitly write down that the qemu
depriv implementation is incomplete on FreeBSD.

> +/* 
> + * Called after forking but before executing the local devicemodel.
> + * Must call _exit(-1) on failure, printing an error message to
> + * stderrfd if available.

IMO it would be better if the single call site handled the errors.  I
think that can be done in a standard way.  So this function should
probably be expected to call a LOG macro and then return ERROR_FAIL,
on failure.

> +void libxl__local_dm_preexec_restrict(libxl__gc *gc, int stderrfd)
> +{
> +    int rc;
> +
> +    /* Unshare mount and IPC namespaces.  These are unused by QEMU. */
> +    rc = unshare(CLONE_NEWNS | CLONE_NEWIPC);

Variable name should be r.  rc is used for libxl error codes.
See tools/libxl/CODING_STYLE.

> +    if (rc < 0) {
> +        char *msg = GCSPRINTF("libxl: Mount and IPC namespace unfailed with error %d\n",
> +                              errno);

Please do not print numeric errno values.  I think you can use LOG:

 |       .... The child may not
 |  * make any further calls to libxl infrastructure, except for memory
 |  * allocation and logging.    ...

(from libxl_internal.h's discussion of libxl__ev_child_fork)

> +# TEST: Namespace unsharing
> +#
> +# Read /proc/<dmpid>/ns/<namespace> and make sure it's not equal to
> +# the current processes' value
> +for nsname in ipc mnt; do
> +    echo -n "Unshare namespace $nsname: "
> +    dmns=$(readlink /proc/$dmpid/ns/$nsname)
> +    myns=$(readlink /proc/self/ns/$nsname)
> +
> +    if [[ "$dmns" == "$myns" ]] ; then
> +	echo "FAILED"
> +	failed="true"
> +    else
> +	echo "PASSED"
> +    fi
> +done

I wonder if maybe this output should be in subunit v1 format.

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 5/6] tools/dm_depriv: Add first cut RLIMITs
  2018-09-21 17:04 ` [PATCH v2 5/6] tools/dm_depriv: Add first cut RLIMITs George Dunlap
  2018-09-24  8:34   ` Paul Durrant
@ 2018-09-24 10:48   ` Ian Jackson
  2018-09-24 14:41     ` George Dunlap
  1 sibling, 1 reply; 39+ messages in thread
From: Ian Jackson @ 2018-09-24 10:48 UTC (permalink / raw)
  To: George Dunlap; +Cc: Anthony Perard, xen-devel, Wei Liu

George Dunlap writes ("[PATCH v2 5/6] tools/dm_depriv: Add first cut RLIMITs"):
> Limit the ability of a potentially compromised QEMU to consume system
> resources.  Key limits:
...
> +static struct {
> +    int resource;
> +    rlim_t limit;
> +} rlimits[] = {
> +    {
> +        .resource = RLIMIT_FSIZE,
> +        /* Big enough for log files, not big enough for a DoS */
> +        .limit = 256*1024,
> +    },
> +    {
> +        .resource = RLIMIT_NPROC,
> +        .limit = 0
> +    },
> +    {
> +        .resource = RLIMIT_CORE,
> +        .limit = 0
> +    },
> +    {
> +        .resource = RLIMIT_MSGQUEUE,
> +        .limit = 0
> +    },
> +    {
> +        .resource = RLIMIT_LOCKS,
> +        .limit = 0
> +    },
> +    {
> +        .resource = RLIMIT_MEMLOCK,
> +        .limit = 0

This table would be much nicer in table format, with each entry as a
row.  If the entries are too long to wrap then options include:
 * Use an intermediate struct so as to avoid having to use
    named initialisers
 * Introduce a macro to define each table row (I think
    this is better).

> +    /* Set various "easy" rlimits */
> +    for (i=0; rlimits[i].resource != -1; i++) {

It would be more idiomatic to walk a pointer along the array, rather
than writing rlimits[i] all the time.

> +        if (rc < 0) {
> +            char *msg = GCSPRINTF("libxl: Setting rlimit %d to %lld failed with error %d\n",
> +                                  rlimits[i].resource,

If you cared very much about the error handling, you could produce
names rather than numbers for the rlimit values by wrapping the struct
rlimit up in a struct of our own with a `const char *name' field, and
having a macro to generate the table entries.

But it's probably not worth it.

Again, use `r' for syscall returns.  And, again, don't print numerical
errno values.  use LOGE.

...
> +    if [[ "$input" =~ ^$limit_string[[:space:]]*([^[:space:]]+)[[:space:]]*([^[:space:]]+)[[:space:]]*[^[:space:]]+ ]] ; then
> +	if [[ "${BASH_REMATCH[1]}" != $tgt ||

Cor.  I think I would have reached for Perl by now.  I guess tastes
differ.

I guess you have tested that this does all pass and fail
appropriately ?


About the output format from this checker script.  As it is, osstest
will have to grow an additional parser to parse this.  I don't mind
that if it's subunit v1 (or some other plsusible standard).

But if it's going to be bespoke it would be much more convenient if
the output was parseable by the same parser as the oiutput from
depriv-fd-checker.

Thanks,
Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 6/6] RFC: tools/dm_restrict: Enable QEMU sandboxing
  2018-09-21 17:04 ` [PATCH v2 6/6] RFC: tools/dm_restrict: Enable QEMU sandboxing George Dunlap
@ 2018-09-24 10:49   ` Ian Jackson
  2018-09-24 11:16     ` George Dunlap
  2018-09-24 11:21   ` Ian Jackson
  1 sibling, 1 reply; 39+ messages in thread
From: Ian Jackson @ 2018-09-24 10:49 UTC (permalink / raw)
  To: George Dunlap; +Cc: Anthony Perard, xen-devel, Stefano Stabellini, Wei Liu

George Dunlap writes ("[PATCH v2 6/6] RFC: tools/dm_restrict: Enable QEMU sandboxing"):
> QEMU has a `sandbox` feature, wherein it will use seccomp2 to restrict
> what system calls it is able to make.
...
> +        flexarray_append(dm_args, "on,obsolete=deny,elevateprivileges=allow,spawn=deny,resourcecontrol=deny");

Why `elevateprivileges=allow' ?

In this syntax, what happens with unmentioned abilities ?

Thanks,
Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/6] test/depriv: Add a tool to check process-level depriv
  2018-09-21 17:04 ` [PATCH v2 2/6] test/depriv: Add a tool to check process-level depriv George Dunlap
@ 2018-09-24 10:57   ` Ian Jackson
  2018-09-24 14:45     ` George Dunlap
  2018-09-24 11:14   ` Anthony PERARD
  1 sibling, 1 reply; 39+ messages in thread
From: Ian Jackson @ 2018-09-24 10:57 UTC (permalink / raw)
  To: George Dunlap
  Cc: Anthony Perard, xen-devel, Stefano Stabellini, Wei Liu, Ross Lagerwall

George Dunlap writes ("[PATCH v2 2/6] test/depriv: Add a tool to check process-level depriv"):
> Add a tool to check whether the various process-level deprivileging
> operations have actually taken place on the process.
...
> +# Example input:
> +# Uid:	1193	1193	1193	1193
> +input=$(grep Uid /proc/$dmpid/status)

Are you sure this grep does not need to be more specific ?  What if a
new thing gets added, I don't know,
  Sponglefleep-Uid-Blarking: yes 42
?

> +if [[ "$input" =~ ^Uid:[[:space:]]*([0-9]+)[[:space:]]*([0-9]+)[[:space:]]*([0-9]+)[[:space:]]*([0-9]+)$ ]] ; then

I think I made most of my comments about this script in my other
review comments.

But, specifically, here: if you are confident about the format of the
line in /proc/*/status, you could do
   fields=($input)
   for uid in ${fields[*]:1}; do
      compare uid with expected

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/6] test/depriv: Add a tool to check process-level depriv
  2018-09-21 17:04 ` [PATCH v2 2/6] test/depriv: Add a tool to check process-level depriv George Dunlap
  2018-09-24 10:57   ` Ian Jackson
@ 2018-09-24 11:14   ` Anthony PERARD
  1 sibling, 0 replies; 39+ messages in thread
From: Anthony PERARD @ 2018-09-24 11:14 UTC (permalink / raw)
  To: George Dunlap
  Cc: Ross Lagerwall, xen-devel, Stefano Stabellini, Wei Liu, Ian Jackson

On Fri, Sep 21, 2018 at 06:04:24PM +0100, George Dunlap wrote:
> diff --git a/tools/tests/depriv/depriv-process-checker.sh b/tools/tests/depriv/depriv-process-checker.sh
> --- /dev/null
> +++ b/tools/tests/depriv/depriv-process-checker.sh
[...]
> +# Example input:
> +# Uid:	1193	1193	1193	1193
> +input=$(grep Uid /proc/$dmpid/status)
> +if [[ "$input" =~ ^Uid:[[:space:]]*([0-9]+)[[:space:]]*([0-9]+)[[:space:]]*([0-9]+)[[:space:]]*([0-9]+)$ ]] ; then

You should replace all the * by +. Surely you want at least one space
between two numbers.

> +    result="PASSED"
> +    for i in {1..4}; do
> +	if [[ "${BASH_REMATCH[$i]}" != "$tgt_uid" ]] ; then
> +	    result="FAILED"
> +	    failed="true"
> +	    break
> +	fi
> +    done
> +else
> +    result="FAILED"
> +    failed="true"
> +fi
> +echo $result
> +
> +# Example input:
> +# Gid:	10020	10020	10020	10020
> +echo -n "Process GID: "
> +input=$(grep Uid /proc/$dmpid/status)

Here grep for "Uid" but ...

> +if [[ "$input" =~ ^Gid:[[:space:]]*([0-9]+)[[:space:]]*([0-9]+)[[:space:]]*([0-9]+)[[:space:]]*([0-9]+)$ ]] ; then

here is "Gid".

> +    result="PASSED"
> +    for i in {1..4}; do
> +	if [[ "${BASH_REMATCH[$i]}" != "$65534" ]] ; then

Is "$65534" supposed to be a variable? Or is "$" a extra character?


-- 
Anthony PERARD

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 6/6] RFC: tools/dm_restrict: Enable QEMU sandboxing
  2018-09-24 10:49   ` Ian Jackson
@ 2018-09-24 11:16     ` George Dunlap
  2018-09-24 13:04       ` Ian Jackson
  0 siblings, 1 reply; 39+ messages in thread
From: George Dunlap @ 2018-09-24 11:16 UTC (permalink / raw)
  To: Ian Jackson; +Cc: Anthony Perard, xen-devel, Stefano Stabellini, Wei Liu

On 09/24/2018 11:49 AM, Ian Jackson wrote:
> George Dunlap writes ("[PATCH v2 6/6] RFC: tools/dm_restrict: Enable QEMU sandboxing"):
>> QEMU has a `sandbox` feature, wherein it will use seccomp2 to restrict
>> what system calls it is able to make.
> ...
>> +        flexarray_append(dm_args, "on,obsolete=deny,elevateprivileges=allow,spawn=deny,resourcecontrol=deny");
> 
> Why `elevateprivileges=allow' ?

From qemu-depriv.md:

`elevateprivileges` is currently required to allow `-runas` to work.
Removing this requirement would mean making sure that the uid change
happened before the seccomp2 call, perhaps by changing the uid before
executing QEMU.  (But this would then require other changes to create
the QMP socket, VNC socket, and so on).

Should I C&P this into a comment here?

> In this syntax, what happens with unmentioned abilities ?

Good question -- the -help doesn't seem to say.  Looking at the code
(qemu-seccomp.c:parse_sandbox()) for those who want to follow along at
home), it seems different options have different default values (which
are not mentioned) -- obsolete is default deny, but spawn,
elevateprivileges, and resourcsecontrol are default allow.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 6/6] RFC: tools/dm_restrict: Enable QEMU sandboxing
  2018-09-21 17:04 ` [PATCH v2 6/6] RFC: tools/dm_restrict: Enable QEMU sandboxing George Dunlap
  2018-09-24 10:49   ` Ian Jackson
@ 2018-09-24 11:21   ` Ian Jackson
  2018-09-24 14:28     ` George Dunlap
  1 sibling, 1 reply; 39+ messages in thread
From: Ian Jackson @ 2018-09-24 11:21 UTC (permalink / raw)
  To: George Dunlap; +Cc: Anthony Perard, xen-devel, Stefano Stabellini, Wei Liu

Apropos of our conversation on IRC, I looked at the checker script in
detail.

> #!/bin/bash
> 
> domain="$1"

Just noticed this, but: OMG no `set -e'.
You probably want `set -o pipefail' too.

> dmpid=$(xenstore-read /local/domain/$domid/image/device-model-pid 2>/dev/null)
> if [[ -z "$dmpid" ]] ; then
>     echo "xenstore-read failed"
>     exit 1
> fi

Why do you throw away the stderr from xenstore-read ?

> failed="false"

Quotes are not needed here and seem un-idiomatic to me when the RHS is
a simple atom.

> # TEST: Process / group id
...
> # FIXME: deal with other UID configurations?

Since the test will fail if you don't do this, I think this is very
sub-critical and you could drop the fixme.

> # Example input:
> # Uid:	1193	1193	1193	1193
> input=$(grep Uid /proc/$dmpid/status)

I have commented on this grep, and the subsequent regexpery, already.

But also: you check the uid and the gid, but by duplicating the code.
Surely this could be a shell function.

> echo -n "Process UID: "

If this had been me, I would have written
   begin_test "Process UID"
and then

>     result="PASSED"

      test_passed

>     for i in {1..4}; do
> 	if [[ "${BASH_REMATCH[$i]}" != "$tgt_uid" ]] ; then
> 	    result="FAILED"
> 	    failed="true"

In particular, you open-code setting `failed' everywhere.  If you miss
one then that could hide a test failure.  So
            test_failed
But you want to print a reason so
            test_failed "got $uidorgid $actual_id wanted $tgt_id"

As a bonus, doing this now means you can fix the test output format to
be more parseable by changing the code in one place.

>     if [[ "$root" != "$tgt_chroot" ]] ; then
> 	echo "FAILED"

You could introduce
            test_condition 'root directory' "$root" != "$tgt_chroot"
which calls test_passed or test_failed as appropriate.

If you have it return the same exit status as the test, you can use it
for the uids where you would say
        test_condition "one $uidorgid" $actual_id = $tgt_id || break
and the rest of the time you would have to write
            test_condition 'root directory' "$root" != "$tgt_chroot" ||:

> function check_rlimit() {
>     limit_name=$1
>     limit_string=$2
>     tgt=$3
> 
>     echo -n "rlimit $limit_name: "
>     input=$(grep "^$limit_string" /proc/$dmpid/limits)
...
>     if [[ "$input" =~ ^$limit_string[[:space:]]*([^[:space:]]+)[[:space:]]*([^[:space:]]+)[[:space:]]*[^[:space:]]+ ]] ; then

Because of the unfortunate format of /proc/PID/limits, you do can't
just do the
    fields=($input)
trick but
    fields=(${input#*  })
DTRT.

Thanks,
Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 6/6] RFC: tools/dm_restrict: Enable QEMU sandboxing
  2018-09-24 11:16     ` George Dunlap
@ 2018-09-24 13:04       ` Ian Jackson
  2018-09-24 13:43         ` George Dunlap
  0 siblings, 1 reply; 39+ messages in thread
From: Ian Jackson @ 2018-09-24 13:04 UTC (permalink / raw)
  To: George Dunlap; +Cc: Anthony Perard, xen-devel, Stefano Stabellini, Wei Liu

George Dunlap writes ("Re: [PATCH v2 6/6] RFC: tools/dm_restrict: Enable QEMU sandboxing"):
> From qemu-depriv.md:
> 
> `elevateprivileges` is currently required to allow `-runas` to work.
> Removing this requirement would mean making sure that the uid change
> happened before the seccomp2 call, perhaps by changing the uid before
> executing QEMU.  (But this would then require other changes to create
> the QMP socket, VNC socket, and so on).
> 
> Should I C&P this into a comment here?

Yes.

I think the conclusion I would draw from that comment is not that the
uid change should happen before exec'ing qemu, but that the seccomp
call in qemu is made too early.  But fine.

> > In this syntax, what happens with unmentioned abilities ?
> 
> Good question -- the -help doesn't seem to say.  Looking at the code
> (qemu-seccomp.c:parse_sandbox()) for those who want to follow along at
> home), it seems different options have different default values (which
> are not mentioned) -- obsolete is default deny, but spawn,
> elevateprivileges, and resourcsecontrol are default allow.

Erk.  I guess we could parse -help output :-/.

What about capabilities not known to the qemu source code ?

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 1/6] docs/qemu-deprivilege: Revise and update with status and future plans
  2018-09-24 10:23 ` Ian Jackson
@ 2018-09-24 13:08   ` George Dunlap
  2018-09-25 10:44     ` Ian Jackson
  0 siblings, 1 reply; 39+ messages in thread
From: George Dunlap @ 2018-09-24 13:08 UTC (permalink / raw)
  To: Ian Jackson
  Cc: Stefano Stabellini, Wei Liu, Konrad Wilk, Andrew Cooper,
	Tim Deegan, Ross Lagerwall, Julien Grall, Jan Beulich,
	Anthony Perard, xen-devel

On 09/24/2018 11:23 AM, Ian Jackson wrote:
> George Dunlap writes ("[PATCH v2 1/6] docs/qemu-deprivilege: Revise and update with status and future plans"):
>> +## Xen library / file-descriptor restrictions
>> +
>> +'''Description''': Close and restrict Xen-related file descriptors.
>> +Specifically:
>> + * Close all xenstore-related file descriptors
> 
> This is correct.
> 
>> + * Make sure that extraneous `privcmd` and `evtchn` instances are
>> +closed
> 
> No, *all* privcmd and evtchn instances are restricted, even
> `extraneous' ones which have been leaked by qemu.  None are closed.

Ack.

>> +'''How to test''':
>> +
>> +    tools/test/depriv/depriv-fd-checker.c
> 
> You also need the tool `fishdescriptor' from src:chiark-utils to get
> the descriptors out of qemu.  It is in chiark-utils-bin in Debian
> buster and Debian stretch-backports.

This was meant to be a somewhat technical description of the mechanism
of doing the testing (to be implemented by someone implementing the
feature), rather than a how-to for keen users / testers to actually run
the test.  What about:

"Use `fishdescriptor` from [reference], to pull a file descriptor from a
running QEMU, then check that it has the desired properties, and that
hypercalls which are meant to fail do fail.  (The latter is implemented
in `tools/test/depriv/depriv-fd-checker.c`)."

On a related note: Is there any reason not to  move
osstest-depriv-fd-collector into the tree, perhaps even merging it with
the functionality in depriv-process-checker?

>> +## Namespaces for unused functionality (Linux only)
>> +
>> +'''Description''': Enter QEMU into its own mount & IPC namespaces.
>> +This means that even if other restrictions fail, the process won't be
>> +able to even name system mount points or exsting non-file-based IPC
>> +descriptors to attempt to attack them.
>> +
>> +'''Implementation''':
>> +
>> +In theory this could be done in QEMU (similar to -sandbox, -runas,
>> +-chroot, and so on), but a patch doing this in QEMU was NAKed
>> +upstream. They preferred that this was done as a setup step by
>> +whatever executes QEMU; i.e., have the process which exec's QEMU first
>> +call:
>> +
>> +    unshare(CLONE_NEWNS | CLONE_NEWIPC)
> 
> If you are recording this kind of information here: this will of
> course not work, because qemu binds and opens things at startup that
> would be broken by this.  Maybe you want to give a url to a mailing
> list posting instead of this un-referenced hearsay.
> 
>> +### Network namespacing (Linux only)
>> +
>> +Enter QEMU into its own network namespace (in addition to mount & IPC
>> +namespaces).  Basically change the 'unshare' call to be as follows:
>> +
>> +    unshare(CLONE_NEWNET | CLONE_NEWNS | CLONE_NEWIPC)
>> +
>> +QEMU does actually use the network namespace by default, so adding
>> +this restriction requires additional changes, listed below.
> 
> The CLONE_NEWIPC overlaps with the IPC unshare discussed above.

This is the second time I've had to try to explain the difference
between the above two items; I'm not sure what's not clear about what
was written.

The title of the first says: "...for unused functionaltiy".  IPC
namespaces are for non-file-based IPCs (i.e., things which are not unix
sockets).  QEMU does not use this functionality, nor does it use mount
functionality.  The first restriction is in fact implemented in patch 4,
and I haven't had any issues with it.

I suppose I need more text like the above, explicitly defining what
those namespaces do, and that QEMU doesn't need them?

The second section explicitly mentions the fact that this will not work,
and references further changes which would be required for such a change
to be implemented (in the 'Network' and 'VNC' sections).

I can certainly provide a reference if you think that's important.

Any other suggestions for clarification, so that we don't have to repeat
this discussion again, would be welcome.

>> +## Setting up a userid range
> 
> There was some discussion on a Debian list recently about some
> container systems that encode a 16-bit within-container uid and a
> 16-bit container number into the 32-bit uid.  I guess we don't need to
> explicitly worry about clashes between our usage and those ?

Hmm, someone may run containers that use such things in dom0, at which
point we may have a namespace collision.

But really I think this is a distro problem to solve -- we don't specify
a >16-bit UID, we just give it as an example.  Debian could, for instance:
 - Not use the system which uses the 16/16 split
 - Enforce that Xen and the 16/16 split system are not installed at the
same time
 - Reserve 32k of UIDs in the 16-bit space somehow
 - Reserve one of the "container ID" entries for Xen, so that there's
never a collision

>> +# Limitations
>> +
>> +The following features still need to be implemented:
>> + * Inserting a new cdrom while the guest is running (xl cdrom-insert)
>> + * Migration / save / restore
>> +
>> +Additionally, getting PCI passthrough to work securely would require a
>> +significant rework of how passthrough works at the moment.  It may be
>> +implemented at some point but is not a near-term priority.
> 
> The limitations section should also say something like this:
> 
>  The currently implemented restrictions are thought to be a useful
>  security improvement.  However, the design and implementation is
>  preliminary and there is work left to do.  Accordingly we do not
>  promise that they are sufficient to stop a rogue domain which takes
>  control of its qemu from escaping into the host, let alone stop it
>  from denying service to the host.

Isn't this what "Tech preview" means?  Or do you mean we'll keep this
kind of warning in after we take it out of 'tech preview'?

>  Therefore, bugs which affect the effectiveness of the qemu depriv
>  mechanisms will be treated as plain bugs, not security bugs; they
>  would not result in a Xen Project Security Advisory.  However, bugs
>  where the security of a system with dm_restrict=1 is worse than
>  before, will be treated as security bugs.

This would be slightly different than 'tech preview'.

Once this goes to "supported", I agree that we shouldn't issue an XSA
for, say, a bug in Linux's implementation of RLIMIT_NPROC, or a bug in
Linux that allows QEMU, while running as an unprivileged process, to do
something it's not supposed to do (say, fill up our chroot, which is
owned by root).

But I do think we should issue an XSA if there is code in libxl which
claims to do something but failed.  For instance, if a change
accidentally disables the `-runas` option to QEMU when dm_restrict=1,
then that would merit an XSA in my opinion.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 6/6] RFC: tools/dm_restrict: Enable QEMU sandboxing
  2018-09-24 13:04       ` Ian Jackson
@ 2018-09-24 13:43         ` George Dunlap
  2018-09-25 10:46           ` Ian Jackson
  0 siblings, 1 reply; 39+ messages in thread
From: George Dunlap @ 2018-09-24 13:43 UTC (permalink / raw)
  To: Ian Jackson; +Cc: Anthony Perard, xen-devel, Stefano Stabellini, Wei Liu

On 09/24/2018 02:04 PM, Ian Jackson wrote:
> George Dunlap writes ("Re: [PATCH v2 6/6] RFC: tools/dm_restrict: Enable QEMU sandboxing"):
>> From qemu-depriv.md:
>>
>> `elevateprivileges` is currently required to allow `-runas` to work.
>> Removing this requirement would mean making sure that the uid change
>> happened before the seccomp2 call, perhaps by changing the uid before
>> executing QEMU.  (But this would then require other changes to create
>> the QMP socket, VNC socket, and so on).
>>
>> Should I C&P this into a comment here?
> 
> Yes.
> 
> I think the conclusion I would draw from that comment is not that the
> uid change should happen before exec'ing qemu, but that the seccomp
> call in qemu is made too early.  But fine.

Yeah, I was thinking that just after I sent this mail too; it would be
good to see if there was a reasoning behind that.

>>> In this syntax, what happens with unmentioned abilities ?
>>
>> Good question -- the -help doesn't seem to say.  Looking at the code
>> (qemu-seccomp.c:parse_sandbox()) for those who want to follow along at
>> home), it seems different options have different default values (which
>> are not mentioned) -- obsolete is default deny, but spawn,
>> elevateprivileges, and resourcsecontrol are default allow.
> 
> Erk.  I guess we could parse -help output :-/.
> 
> What about capabilities not known to the qemu source code ?

Hrm -- it looks like the sandboxing stuff is based on a blacklist,
rather than a whitelist.  Which may be inevitable, given that seccomp2
operates on system calls but qemu makes library calls (and thus doesn't
actually know which system calls are need and which are not -- see [1]).
 But it does rather undermine the usefulness of this as a security
feature -- there are literally hundreds of system calls available on
Linux, of which only 50 or so are listed here.

Luckily `-sandbox` was just one of the "sure why not" layers of extra
security, not something we rely on.

We could add a test to our testing script to parse `-help` output for
unknown-to-libxl options and throw an error, so that they get added in,
if we want.

 -George

[1] https://lwn.net/Articles/738694/

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 4/6] tools/dm_restrict: Unshare mount and IPC namespaces on Linux
  2018-09-24 10:40   ` Ian Jackson
@ 2018-09-24 14:22     ` George Dunlap
  2018-09-25 10:55       ` Ian Jackson
  0 siblings, 1 reply; 39+ messages in thread
From: George Dunlap @ 2018-09-24 14:22 UTC (permalink / raw)
  To: Ian Jackson; +Cc: Anthony Perard, xen-devel, Wei Liu

On 09/24/2018 11:40 AM, Ian Jackson wrote:
> George Dunlap writes ("[PATCH v2 4/6] tools/dm_restrict: Unshare mount and IPC namespaces on Linux"):
>> QEMU running under Xen doesn't need mount or IPC functionality.
>> Create and enter separate namespaces for each of these before
>> executing QEMU, so that in the event that other restrictions fail, the
>> process won't be able to even name system mount points or exsting
>> non-file-based IPC descriptors to attempt to attack them.
> 
>> Unsharing is something a process can only do to itself (it would
>> seem); so add an os-specific "dm_preexec_restrict()" hook just before
>> we exec() the device model.
> ...
>> +
>> +void libxl__local_dm_preexec_restrict(libxl__gc *gc, int stderrfd)
>> +{
>> +    return;
>> +}
> 
> This patch is the point of divergence between Linux dom0's and FreeBSD
> dom0's.
> 
> IMO it's not good enough to simply talk about some restrictions as
> being Linux-only.  Rather, the restrictions are supposed to work
> together as a whole, and the whole syscall API available to qemu needs
> to be considered.  That would need to be done separately for FreeBSD.
> 
> I think that this means we should explicitly write down that the qemu
> depriv implementation is incomplete on FreeBSD.

I think theoretically, an unprivileged user locked in a chroot owned by
root, and with the xenstore fds properly restricted, should be enough to
prevent a guest from gaining control or reading data that it shouldn't
be able to read.  All of the other restrictions are to add extra layers,
in case that first layer should have a bug.

That is, our unprivileged process shouldn't be *able* to call mount;
but, *just in case*, we'll put it in a separate mount space, so that it
can't mount proc or sysfs or whatever.  Our unprivileged process
shouldn't be able to use non-file-based IPC to break into other
processes on the system; but just in case, we'll put it in its own IPC
namespace.  Our unpriveleged process shouldn't be *able* to cause any
mischief with the reboot or readdir system calls, but let's restrict
those with `-sandbox` anyway, just in case.

`-runas` and `-chroot` should work just fine on FreeBSD; and if the
restrict IOCTL is implemented, then that should be enough to say, "At a
basic level this works".  We probably do want someone familiar with
FreeBSD to go through and implement whatever can be implemented there as
far as additional restrictions.

> 
>> +/* 
>> + * Called after forking but before executing the local devicemodel.
>> + * Must call _exit(-1) on failure, printing an error message to
>> + * stderrfd if available.
> 
> IMO it would be better if the single call site handled the errors.  I
> think that can be done in a standard way.  So this function should
> probably be expected to call a LOG macro and then return ERROR_FAIL,
> on failure.

Ack

>> +void libxl__local_dm_preexec_restrict(libxl__gc *gc, int stderrfd)
>> +{
>> +    int rc;
>> +
>> +    /* Unshare mount and IPC namespaces.  These are unused by QEMU. */
>> +    rc = unshare(CLONE_NEWNS | CLONE_NEWIPC);
> 
> Variable name should be r.  rc is used for libxl error codes.
> See tools/libxl/CODING_STYLE.

Yes, sorry I forgot about this. Ack (to this and all similar reminders).

>> +# TEST: Namespace unsharing
>> +#
>> +# Read /proc/<dmpid>/ns/<namespace> and make sure it's not equal to
>> +# the current processes' value
>> +for nsname in ipc mnt; do
>> +    echo -n "Unshare namespace $nsname: "
>> +    dmns=$(readlink /proc/$dmpid/ns/$nsname)
>> +    myns=$(readlink /proc/self/ns/$nsname)
>> +
>> +    if [[ "$dmns" == "$myns" ]] ; then
>> +	echo "FAILED"
>> +	failed="true"
>> +    else
>> +	echo "PASSED"
>> +    fi
>> +done
> 
> I wonder if maybe this output should be in subunit v1 format.

We could do that.

It doesn't look like the "official" subunit code [1] accumulates the
failure and returns it as the status code of the script as a whole.  Is
that something that's valuable, do you think?  Or should we have `status
= OK` mean, "The script itself ran without errors, check the output for
any test failures"?

 -George

[1]
https://github.com/testing-cabal/subunit/blob/master/shell/share/subunit.sh

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 6/6] RFC: tools/dm_restrict: Enable QEMU sandboxing
  2018-09-24 11:21   ` Ian Jackson
@ 2018-09-24 14:28     ` George Dunlap
  2018-09-25 11:02       ` Ian Jackson
  0 siblings, 1 reply; 39+ messages in thread
From: George Dunlap @ 2018-09-24 14:28 UTC (permalink / raw)
  To: Ian Jackson; +Cc: Anthony Perard, xen-devel, Stefano Stabellini, Wei Liu

On 09/24/2018 12:21 PM, Ian Jackson wrote:
> Apropos of our conversation on IRC, I looked at the checker script in
> detail.
> 
>> #!/bin/bash
>>
>> domain="$1"
> 
> Just noticed this, but: OMG no `set -e'.
> You probably want `set -o pipefail' too.

`set -e` never made any sense to me -- that's not the way I code in any
other language; why would scripting be any different?  What's the
advantage of doing that in the current script?

>> dmpid=$(xenstore-read /local/domain/$domid/image/device-model-pid 2>/dev/null)
>> if [[ -z "$dmpid" ]] ; then
>>     echo "xenstore-read failed"
>>     exit 1
>> fi
> 
> Why do you throw away the stderr from xenstore-read ?

That's left over from a previous version of the script, where I didn't
check to see whether $1 was numeric, but rather tried to interpret it as
numeric and if it failed, then ran `xl domid` on it.  I can take that out.

> 
>> failed="false"
> 
> Quotes are not needed here and seem un-idiomatic to me when the RHS is
> a simple atom.

Sure.

>> # TEST: Process / group id
> ...
>> # FIXME: deal with other UID configurations?
> 
> Since the test will fail if you don't do this, I think this is very
> sub-critical and you could drop the fixme.

This was really a question for the reviewer; probably it should have
been RFC rather than FIXME.

The question is: should the script handle the case where
`xen-qemuuser-shared` is defined instead of `xen-qemuuser-range-base`?

>> # Example input:
>> # Uid:	1193	1193	1193	1193
>> input=$(grep Uid /proc/$dmpid/status)
> 
> I have commented on this grep, and the subsequent regexpery, already.
> 
> But also: you check the uid and the gid, but by duplicating the code.
> Surely this could be a shell function.
> 
>> echo -n "Process UID: "
> 
> If this had been me, I would have written
>    begin_test "Process UID"
> and then
> 
>>     result="PASSED"
> 
>       test_passed
> 
>>     for i in {1..4}; do
>> 	if [[ "${BASH_REMATCH[$i]}" != "$tgt_uid" ]] ; then
>> 	    result="FAILED"
>> 	    failed="true"
> 
> In particular, you open-code setting `failed' everywhere.  If you miss
> one then that could hide a test failure.  So
>             test_failed
> But you want to print a reason so
>             test_failed "got $uidorgid $actual_id wanted $tgt_id"
> 
> As a bonus, doing this now means you can fix the test output format to
> be more parseable by changing the code in one place.
> 
>>     if [[ "$root" != "$tgt_chroot" ]] ; then
>> 	echo "FAILED"
> 
> You could introduce
>             test_condition 'root directory' "$root" != "$tgt_chroot"
> which calls test_passed or test_failed as appropriate.
> 
> If you have it return the same exit status as the test, you can use it
> for the uids where you would say
>         test_condition "one $uidorgid" $actual_id = $tgt_id || break
> and the rest of the time you would have to write
>             test_condition 'root directory' "$root" != "$tgt_chroot" ||:

I'll take a look at doing this if we decide to stick with bash.

>> function check_rlimit() {
>>     limit_name=$1
>>     limit_string=$2
>>     tgt=$3
>>
>>     echo -n "rlimit $limit_name: "
>>     input=$(grep "^$limit_string" /proc/$dmpid/limits)
> ...
>>     if [[ "$input" =~ ^$limit_string[[:space:]]*([^[:space:]]+)[[:space:]]*([^[:space:]]+)[[:space:]]*[^[:space:]]+ ]] ; then
> 
> Because of the unfortunate format of /proc/PID/limits, you do can't
> just do the
>     fields=($input)
> trick but
>     fields=(${input#*  })

What will this do?

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 5/6] tools/dm_depriv: Add first cut RLIMITs
  2018-09-24 10:48   ` Ian Jackson
@ 2018-09-24 14:41     ` George Dunlap
  2018-09-25 11:03       ` Ian Jackson
  0 siblings, 1 reply; 39+ messages in thread
From: George Dunlap @ 2018-09-24 14:41 UTC (permalink / raw)
  To: Ian Jackson; +Cc: Anthony Perard, xen-devel, Wei Liu

On 09/24/2018 11:48 AM, Ian Jackson wrote:
>> +        if (rc < 0) {
>> +            char *msg = GCSPRINTF("libxl: Setting rlimit %d to %lld failed with error %d\n",
>> +                                  rlimits[i].resource,
> 
> If you cared very much about the error handling, you could produce
> names rather than numbers for the rlimit values by wrapping the struct
> rlimit up in a struct of our own with a `const char *name' field, and
> having a macro to generate the table entries.
> 
> But it's probably not worth it.

I think I considered this for a second or two before concluding the same
thing. :-)

> Again, use `r' for syscall returns.  And, again, don't print numerical
> errno values.  use LOGE.
> 
> ...
>> +    if [[ "$input" =~ ^$limit_string[[:space:]]*([^[:space:]]+)[[:space:]]*([^[:space:]]+)[[:space:]]*[^[:space:]]+ ]] ; then
>> +	if [[ "${BASH_REMATCH[1]}" != $tgt ||
> 
> Cor.  I think I would have reached for Perl by now.  I guess tastes
> differ.

I didn't realize I was going to need regexp until most of the regexp
until the rest of the script was written; at that point,  generating
this rune was certainly a lot less work than re-writing the whole thing
from scratch.  :-)

> I guess you have tested that this does all pass and fail
> appropriately ?

I have run the script on a VM without dm_restrict, and all the tests
come  back FAILED; and run it on a vm with dm_restrict, and all the
tests come back PASSED.

On the other hand, Anthony seems to have found a bug in the part of the
script that checks gid; so it seems there's something fishy.  But it's
my practice to test both, yes.

When we add it to osstest, we probably want to do the same -- make sure
that all the tests fail when dm_restrict=0, as well as pass when
dm_restrict=1.

> About the output format from this checker script.  As it is, osstest
> will have to grow an additional parser to parse this.  I don't mind
> that if it's subunit v1 (or some other plsusible standard).
> 
> But if it's going to be bespoke it would be much more convenient if
> the output was parseable by the same parser as the oiutput from
> depriv-fd-checker.

Does osstest need to actually parse the output at all?  Couldn't it just
capture the output verbatim in a log file, and use the exit code as a
pass/fail (or grep for FAILED)?

I'm happy to use a more standard formatting if you think that would be
useful though.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/6] test/depriv: Add a tool to check process-level depriv
  2018-09-24 10:57   ` Ian Jackson
@ 2018-09-24 14:45     ` George Dunlap
  0 siblings, 0 replies; 39+ messages in thread
From: George Dunlap @ 2018-09-24 14:45 UTC (permalink / raw)
  To: Ian Jackson
  Cc: Anthony Perard, xen-devel, Stefano Stabellini, Wei Liu, Ross Lagerwall

On 09/24/2018 11:57 AM, Ian Jackson wrote:
> George Dunlap writes ("[PATCH v2 2/6] test/depriv: Add a tool to check process-level depriv"):
>> Add a tool to check whether the various process-level deprivileging
>> operations have actually taken place on the process.
> ...
>> +# Example input:
>> +# Uid:	1193	1193	1193	1193
>> +input=$(grep Uid /proc/$dmpid/status)
> 
> Are you sure this grep does not need to be more specific ?  What if a
> new thing gets added, I don't know,
>   Sponglefleep-Uid-Blarking: yes 42

Yes, it probably should be more specific.

>> +if [[ "$input" =~ ^Uid:[[:space:]]*([0-9]+)[[:space:]]*([0-9]+)[[:space:]]*([0-9]+)[[:space:]]*([0-9]+)$ ]] ; then
> 
> I think I made most of my comments about this script in my other
> review comments.
> 
> But, specifically, here: if you are confident about the format of the
> line in /proc/*/status, you could do
>    fields=($input)
>    for uid in ${fields[*]:1}; do
>       compare uid with expected

I had something like that originally, but wasn't confident about it;
when I ended up having to write the regexp for /proc/*/limits, I wrote
one for here too to be a bit more safe.

Linus considers almost any user-space reliance on kernel behavior as
binding, so I suspect ($input) is probably safe enough if we want to
remove at least one nasty Bash regexp.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 1/6] docs/qemu-deprivilege: Revise and update with status and future plans
  2018-09-24 13:08   ` George Dunlap
@ 2018-09-25 10:44     ` Ian Jackson
  0 siblings, 0 replies; 39+ messages in thread
From: Ian Jackson @ 2018-09-25 10:44 UTC (permalink / raw)
  To: George Dunlap
  Cc: Stefano Stabellini, Wei Liu, Konrad Wilk, Andrew Cooper,
	Tim Deegan, Ross Lagerwall, Julien Grall, Jan Beulich,
	Anthony Perard, xen-devel

George Dunlap writes ("Re: [PATCH v2 1/6] docs/qemu-deprivilege: Revise and update with status and future plans"):
> On 09/24/2018 11:23 AM, Ian Jackson wrote:
> > You also need the tool `fishdescriptor' from src:chiark-utils to get
> > the descriptors out of qemu.  It is in chiark-utils-bin in Debian
> > buster and Debian stretch-backports.
> 
> This was meant to be a somewhat technical description of the mechanism
> of doing the testing (to be implemented by someone implementing the
> feature), rather than a how-to for keen users / testers to actually run
> the test.  What about:

Right.

> "Use `fishdescriptor` from [reference], to pull a file descriptor from a
> running QEMU, then check that it has the desired properties, and that
> hypercalls which are meant to fail do fail.  (The latter is implemented
> in `tools/test/depriv/depriv-fd-checker.c`)."

SGTM.  `[reference]' is `(in Debian this can be found in the binary
package chiark-scripts)'.

> On a related note: Is there any reason not to  move
> osstest-depriv-fd-collector into the tree, perhaps even merging it with
> the functionality in depriv-process-checker?

I would be entirely fine with that.  Its osstest-specific knowledge is
fairly limited.

> >> +## Namespaces for unused functionality (Linux only)
...
> >> +    unshare(CLONE_NEWNS | CLONE_NEWIPC)
...
> >> +    unshare(CLONE_NEWNET | CLONE_NEWNS | CLONE_NEWIPC)
...
> > The CLONE_NEWIPC overlaps with the IPC unshare discussed above.
> 
> This is the second time I've had to try to explain the difference
> between the above two items; I'm not sure what's not clear about what
> was written.

Your two unshare calls have overlapping flags.  That means they have
overlapping effects.  I think I saw that and then conflated the two
completely.  Sorry about that - but really, it doesn't make sense to
list two unshare calls with overlapping flags as if they were
separate.

> > If you are recording this kind of information here: this will of
> > course not work, because qemu binds and opens things at startup that
> > would be broken by this.  Maybe you want to give a url to a mailing
> > list posting instead of this un-referenced hearsay.
>
> The title of the first says: "...for unused functionaltiy".  IPC
> namespaces are for non-file-based IPCs (i.e., things which are not unix
> sockets).  QEMU does not use this functionality, nor does it use mount
> functionality.  The first restriction is in fact implemented in patch 4,
> and I haven't had any issues with it.

I think I was thinkng of CLONE_NEWNET when I wrote the words above.
Sorry for the confusion.

> >> +## Setting up a userid range
> > 
> > There was some discussion on a Debian list recently about some
> > container systems that encode a 16-bit within-container uid and a
> > 16-bit container number into the 32-bit uid.  I guess we don't need to
> > explicitly worry about clashes between our usage and those ?
> 
> Hmm, someone may run containers that use such things in dom0, at which
> point we may have a namespace collision.
> 
> But really I think this is a distro problem to solve -- we don't specify
> a >16-bit UID, we just give it as an example.  Debian could, for instance:
>  - Not use the system which uses the 16/16 split
>  - Enforce that Xen and the 16/16 split system are not installed at the
> same time
>  - Reserve 32k of UIDs in the 16-bit space somehow
>  - Reserve one of the "container ID" entries for Xen, so that there's
> never a collision

In practice even in Debian things are not this organised.  Currently
Debian's policy docs simply say that uids >2^16 are "reserved".  Work
will need to be done there to formalise this.

> > The limitations section should also say something like this:
> > 
> >  The currently implemented restrictions are thought to be a useful
> >  security improvement.  However, the design and implementation is
> >  preliminary and there is work left to do.  Accordingly we do not
> >  promise that they are sufficient to stop a rogue domain which takes
> >  control of its qemu from escaping into the host, let alone stop it
> >  from denying service to the host.
> 
> Isn't this what "Tech preview" means?  Or do you mean we'll keep this
> kind of warning in after we take it out of 'tech preview'?
> 
> >  Therefore, bugs which affect the effectiveness of the qemu depriv
> >  mechanisms will be treated as plain bugs, not security bugs; they
> >  would not result in a Xen Project Security Advisory.  However, bugs
> >  where the security of a system with dm_restrict=1 is worse than
> >  before, will be treated as security bugs.
> 
> This would be slightly different than 'tech preview'.

Both of these paragraphs are to be taken together.  And yes, it is
somewhat different than pure "tech preview", which is "if you use this
feature, all bets are off".  I am trying to write a sensible promise
which means that someone who turns on dm_restrict is not *weakening*
their security posture.

> Once this goes to "supported", I agree that we shouldn't issue an XSA
> for, say, a bug in Linux's implementation of RLIMIT_NPROC, or a bug in
> Linux that allows QEMU, while running as an unprivileged process, to do
> something it's not supposed to do (say, fill up our chroot, which is
> owned by root).
> 
> But I do think we should issue an XSA if there is code in libxl which
> claims to do something but failed.  For instance, if a change
> accidentally disables the `-runas` option to QEMU when dm_restrict=1,
> then that would merit an XSA in my opinion.

Yes.  But right now, I just want to say that we will issue an advisory
of dm_restrict is _worse_.  For example if using dm_restrict somehow
allowed a guest to write to read-only disks, or something.

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 6/6] RFC: tools/dm_restrict: Enable QEMU sandboxing
  2018-09-24 13:43         ` George Dunlap
@ 2018-09-25 10:46           ` Ian Jackson
  0 siblings, 0 replies; 39+ messages in thread
From: Ian Jackson @ 2018-09-25 10:46 UTC (permalink / raw)
  To: George Dunlap; +Cc: Anthony Perard, xen-devel, Stefano Stabellini, Wei Liu

George Dunlap writes ("Re: [PATCH v2 6/6] RFC: tools/dm_restrict: Enable QEMU sandboxing"):
> On 09/24/2018 02:04 PM, Ian Jackson wrote:
> > What about capabilities not known to the qemu source code ?
> 
> Hrm -- it looks like the sandboxing stuff is based on a blacklist,
> rather than a whitelist.  Which may be inevitable, given that seccomp2
> operates on system calls but qemu makes library calls (and thus doesn't
> actually know which system calls are need and which are not -- see [1]).
>  But it does rather undermine the usefulness of this as a security
> feature -- there are literally hundreds of system calls available on
> Linux, of which only 50 or so are listed here.

How annoying.

> Luckily `-sandbox` was just one of the "sure why not" layers of extra
> security, not something we rely on.

Right.

> We could add a test to our testing script to parse `-help` output for
> unknown-to-libxl options and throw an error, so that they get added in,
> if we want.

That sounds like a good idea.

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 4/6] tools/dm_restrict: Unshare mount and IPC namespaces on Linux
  2018-09-24 14:22     ` George Dunlap
@ 2018-09-25 10:55       ` Ian Jackson
  0 siblings, 0 replies; 39+ messages in thread
From: Ian Jackson @ 2018-09-25 10:55 UTC (permalink / raw)
  To: George Dunlap; +Cc: Anthony Perard, xen-devel, Wei Liu

George Dunlap writes ("Re: [PATCH v2 4/6] tools/dm_restrict: Unshare mount and IPC namespaces on Linux"):
> On 09/24/2018 11:40 AM, Ian Jackson wrote:
> > I think that this means we should explicitly write down that the qemu
> > depriv implementation is incomplete on FreeBSD.
> 
> I think theoretically, an unprivileged user locked in a chroot owned by
> root, and with the xenstore fds properly restricted, should be enough to
> prevent a guest from gaining control or reading data that it shouldn't
> be able to read.  All of the other restrictions are to add extra layers,
> in case that first layer should have a bug.

I think this is not true.  The first thing I can think of is that such
a process can probably make network connections in an inappropriate
context.  The second thing is that it can leave a copy of itself
behind to interfere with the next domain (eg with ptrace) (which is a
thing we already know about).

Stepping back, my point is that the review that our set of depriv
calls etc. is sufficient, before we declare this feature supported,
ought to be kernel-specific.  So depending on which things are
implemented at which time, the feature might be (security-) supported
on FreeBSD but not Linux, or vice versa.

And this should be reflected in the paperwork somehow so that we don't
inadvertently conduct the review for one kernel and then just declare
the feature `supported'.  (This is less likely now that I have drawn
our attention to the issue by writing about it so much...)

> > I wonder if maybe this output should be in subunit v1 format.
> 
> We could do that.
> 
> It doesn't look like the "official" subunit code [1] accumulates the
> failure and returns it as the status code of the script as a whole.  Is
> that something that's valuable, do you think?  Or should we have `status
> = OK` mean, "The script itself ran without errors, check the output for
> any test failures"?

I think the way you did it is better at least in the default case,
although an actual subunit v1 consumer will want the behaviour of
subunit.sh because then they know that a zero exit status means that
all intended tests have been reported.

I guess this is one reason why you might write your own little
wrappers.

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 6/6] RFC: tools/dm_restrict: Enable QEMU sandboxing
  2018-09-24 14:28     ` George Dunlap
@ 2018-09-25 11:02       ` Ian Jackson
  2018-09-25 14:24         ` George Dunlap
  0 siblings, 1 reply; 39+ messages in thread
From: Ian Jackson @ 2018-09-25 11:02 UTC (permalink / raw)
  To: George Dunlap; +Cc: Anthony Perard, xen-devel, Stefano Stabellini, Wei Liu

George Dunlap writes ("Re: [PATCH v2 6/6] RFC: tools/dm_restrict: Enable QEMU sandboxing"):
> On 09/24/2018 12:21 PM, Ian Jackson wrote:
> > Just noticed this, but: OMG no `set -e'.
> > You probably want `set -o pipefail' too.
> 
> `set -e` never made any sense to me -- that's not the way I code in any
> other language; why would scripting be any different?  What's the
> advantage of doing that in the current script?

?

Many modern languages throw exceptions.  set -e is a bit like that.
If you don't say set -e then you need to wrap every everything in your
entire script with an error check.

For example, you write

> >> dmpid=$(xenstore-read /local/domain/$domid/image/device-model-pid 2>/dev/null)
> >> if [[ -z "$dmpid" ]] ; then
> >>     echo "xenstore-read failed"
> >>     exit 1
> >> fi
> > 
> > Why do you throw away the stderr from xenstore-read ?
> 
> That's left over from a previous version of the script, where I didn't
> check to see whether $1 was numeric, but rather tried to interpret it as
> numeric and if it failed, then ran `xl domid` on it.  I can take that out.

but with set -e you can write only

    dmpid=$(xenstore-read /local/domain/$domid/image/device-model-pid)

and the subsequent if is not needed.


> The question is: should the script handle the case where
> `xen-qemuuser-shared` is defined instead of `xen-qemuuser-range-base`?

I think it is fine if it doesn't.  If someone wants that feature it is
easy to add it.

> >> function check_rlimit() {
> >>     limit_name=$1
> >>     limit_string=$2
> >>     tgt=$3
> >>
> >>     echo -n "rlimit $limit_name: "
> >>     input=$(grep "^$limit_string" /proc/$dmpid/limits)
> > ...
> >>     if [[ "$input" =~ ^$limit_string[[:space:]]*([^[:space:]]+)[[:space:]]*([^[:space:]]+)[[:space:]]*[^[:space:]]+ ]] ; then
> > 
> > Because of the unfortunate format of /proc/PID/limits, you do can't
> > just do the
> >     fields=($input)
> > trick but
> >     fields=(${input#*  })
> 
> What will this do?

The expression
  ${input#*  }
is the contents of input with the shortest prefix matching `*  '
stripped off the front.  That will eat all the words at the start,
which are separated by one space each, and find the first pair of
adjacent spaces.  So if input is
  'Max processes             63603                63603                processes 
then "${input#*  }" is
  '           63603                63603                processes 

Expanding it without the surrounding " " causes it to be word-split in
the usual way, producing
  63603 63603 processes
and then this is assigned to a new bash array variable `fields'
which can then be indexed to find the values.

Ian.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 5/6] tools/dm_depriv: Add first cut RLIMITs
  2018-09-24 14:41     ` George Dunlap
@ 2018-09-25 11:03       ` Ian Jackson
  0 siblings, 0 replies; 39+ messages in thread
From: Ian Jackson @ 2018-09-25 11:03 UTC (permalink / raw)
  To: George Dunlap; +Cc: Anthony Perard, xen-devel, Wei Liu

George Dunlap writes ("Re: [PATCH v2 5/6] tools/dm_depriv: Add first cut RLIMITs"):
> Does osstest need to actually parse the output at all?  Couldn't it just
> capture the output verbatim in a log file, and use the exit code as a
> pass/fail (or grep for FAILED)?

Yes.  Each of these tests should be its own substep.  That way we can
unconditionally run the test and later add individual restriction
features.

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 1/6] docs/qemu-deprivilege: Revise and update with status and future plans
  2018-09-21 17:04 [PATCH v2 1/6] docs/qemu-deprivilege: Revise and update with status and future plans George Dunlap
                   ` (6 preceding siblings ...)
  2018-09-24 10:23 ` Ian Jackson
@ 2018-09-25 11:20 ` Anthony PERARD
  2018-09-28 11:37   ` George Dunlap
  7 siblings, 1 reply; 39+ messages in thread
From: Anthony PERARD @ 2018-09-25 11:20 UTC (permalink / raw)
  To: George Dunlap
  Cc: Stefano Stabellini, Wei Liu, Konrad Wilk, Andrew Cooper,
	Tim Deegan, Ross Lagerwall, Julien Grall, Jan Beulich,
	Ian Jackson, xen-devel

On Fri, Sep 21, 2018 at 06:04:23PM +0100, George Dunlap wrote:
> +## Migration
> +
> +When calling xen-save-devices-state, since QEMU is running in a chroot
> +it is not useful to pass a filename (it doesn't even have write access
> +inside the chroot). Instead, give it an open fd using the add-fd
> +mechanism.

That describe an issue only on save. The restore part of a migration
also have an issue:

On restore, QEMU signal to libxl that it is ready only after priviledge
restriction have been put in place (and after or when it receive the
migration data). But xenstore isn't available anymore, so restore fails
from libxl point-of-view.


Or maybe you describe here an issue that would arise when an hypothetic
chroot is apply. But the migration issue describe here still apply
without chroot and with only -runas, the path libxl give to QEMU is
accessible by root only.


> diff --git a/docs/features/qemu-deprivilege.pandoc b/docs/features/qemu-deprivilege.pandoc
> new file mode 100644
> index 0000000000..b377076606
> --- /dev/null
> +++ b/docs/features/qemu-deprivilege.pandoc
[...]
> +# Technical details
> +
> +See docs/design/qemu-deprivilege.txt for technical details.

Right now, this doc have the ".md" suffix, not ".txt", I don't know if
that matter.


-- 
Anthony PERARD

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 6/6] RFC: tools/dm_restrict: Enable QEMU sandboxing
  2018-09-25 11:02       ` Ian Jackson
@ 2018-09-25 14:24         ` George Dunlap
  2018-09-26 10:57           ` Ian Jackson
  0 siblings, 1 reply; 39+ messages in thread
From: George Dunlap @ 2018-09-25 14:24 UTC (permalink / raw)
  To: Ian Jackson; +Cc: Anthony Perard, xen-devel, Stefano Stabellini, Wei Liu

On 09/25/2018 12:02 PM, Ian Jackson wrote:
> George Dunlap writes ("Re: [PATCH v2 6/6] RFC: tools/dm_restrict: Enable QEMU sandboxing"):
>> On 09/24/2018 12:21 PM, Ian Jackson wrote:
>>> Just noticed this, but: OMG no `set -e'.
>>> You probably want `set -o pipefail' too.
>>
>> `set -e` never made any sense to me -- that's not the way I code in any
>> other language; why would scripting be any different?  What's the
>> advantage of doing that in the current script?
> 
> ?
> 
> Many modern languages throw exceptions.  set -e is a bit like that.
> If you don't say set -e then you need to wrap every everything in your
> entire script with an error check.

I'm not sure you can call such languages "modern" anymore. :-)  Neither
golang nor rust have exceptions, for example, by deliberate choice. I've
read golang's rationale for why not, and I tend to agree with them.
People complain about the fact that you always have to check errors in
golang, but it generally prods you to think about what would actually
happen if something went wrong at each point and do something sensible.
Rust seems to be following a similar example.

I suppose the difference between bash and golang / rust is that the
compiler prods you to actually do the checking, whereas in bash nothing
prods you to do the check; so `set -e` is actually somewhat safer: if
you forget to check a critical success / failure, exiting is probably
going to cause less havoc than blindly continuing.

> For example, you write
> 
>>>> dmpid=$(xenstore-read /local/domain/$domid/image/device-model-pid 2>/dev/null)
>>>> if [[ -z "$dmpid" ]] ; then
>>>>     echo "xenstore-read failed"
>>>>     exit 1
>>>> fi
>>>
>>> Why do you throw away the stderr from xenstore-read ?
>>
>> That's left over from a previous version of the script, where I didn't
>> check to see whether $1 was numeric, but rather tried to interpret it as
>> numeric and if it failed, then ran `xl domid` on it.  I can take that out.
> 
> but with set -e you can write only
> 
>     dmpid=$(xenstore-read /local/domain/$domid/image/device-model-pid)
> 
> and the subsequent if is not needed.

I think you misunderstood what I meant.  The original code looked
something like this:

# See if $domain is a domid
dmpid=$(xenstore-read /local/domain/$domain/... >2 /dev/null )
if [[ -z "$dmpid" ]] ; then
  # That didn't work; maybe it was a domain name instead
  domid=$(xl domid $domain)
  dmpid=$(xenstore-read /local/domain/$domid/...)

  # ... more error handling
fi

Without the pipe to /dev/null, if a user uses a domain name (i.e.,
'c6-01'), then the script will succeed, but there will be a spurious
error message when the first xenstore-read fails.

With `set -e`, I'd not only have to pipe to null, I'd also have to add
in "|| true" or something to keep the shell from exiting on an error.

In the current code, printing the error message is obviously better than
throwing it away.  But I still feel better checking the result and
giving a sensible follow-up error message than having a failure silently
exit, because I prefer to explicitly think about what happens when
things fail (and remind the people reading the code to do so as well).

>> The question is: should the script handle the case where
>> `xen-qemuuser-shared` is defined instead of `xen-qemuuser-range-base`?
> 
> I think it is fine if it doesn't.  If someone wants that feature it is
> easy to add it.

Ack

>>>> function check_rlimit() {
>>>>     limit_name=$1
>>>>     limit_string=$2
>>>>     tgt=$3
>>>>
>>>>     echo -n "rlimit $limit_name: "
>>>>     input=$(grep "^$limit_string" /proc/$dmpid/limits)
>>> ...
>>>>     if [[ "$input" =~ ^$limit_string[[:space:]]*([^[:space:]]+)[[:space:]]*([^[:space:]]+)[[:space:]]*[^[:space:]]+ ]] ; then
>>>
>>> Because of the unfortunate format of /proc/PID/limits, you do can't
>>> just do the
>>>     fields=($input)
>>> trick but
>>>     fields=(${input#*  })
>>
>> What will this do?
> 
> The expression
>   ${input#*  }
> is the contents of input with the shortest prefix matching `*  '
> stripped off the front.  That will eat all the words at the start,
> which are separated by one space each, and find the first pair of
> adjacent spaces.  So if input is
>   'Max processes             63603                63603                processes 
> then "${input#*  }" is
>   '           63603                63603                processes 
> 
> Expanding it without the surrounding " " causes it to be word-split in
> the usual way, producing
>   63603 63603 processes
> and then this is assigned to a new bash array variable `fields'
> which can then be indexed to find the values.

That's an idea.  I'm not sure I like it much better than using a regexp
(now that the runes have been written), but I'll think about it.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 6/6] RFC: tools/dm_restrict: Enable QEMU sandboxing
  2018-09-25 14:24         ` George Dunlap
@ 2018-09-26 10:57           ` Ian Jackson
  0 siblings, 0 replies; 39+ messages in thread
From: Ian Jackson @ 2018-09-26 10:57 UTC (permalink / raw)
  To: George Dunlap; +Cc: Anthony Perard, xen-devel, Stefano Stabellini, Wei Liu

George Dunlap writes ("Re: [PATCH v2 6/6] RFC: tools/dm_restrict: Enable QEMU sandboxing"):
> On 09/25/2018 12:02 PM, Ian Jackson wrote:
> > If you don't say set -e then you need to wrap every everything in your
> > entire script with an error check.

(I have deleted the whole comparative programming languages
discussion.)

> > For example, you write
> > 
> >>>> dmpid=$(xenstore-read /local/domain/$domid/image/device-model-pid 2>/dev/null)
> >>>> if [[ -z "$dmpid" ]] ; then
> >>>>     echo "xenstore-read failed"
> >>>>     exit 1
> >>>> fi
...
> > but with set -e you can write only
> > 
> >     dmpid=$(xenstore-read /local/domain/$domid/image/device-model-pid)
> > 
> > and the subsequent if is not needed.
> 
> I think you misunderstood what I meant.  The original code looked
> something like this:

Yes.  You agreed to the removal of the 2>/dev/null.  My point here is
that you if you use set -e you can get rid of the entire error
clause.

> In the current code, printing the error message is obviously better than
> throwing it away.  But I still feel better checking the result and
> giving a sensible follow-up error message than having a failure silently
> exit, because I prefer to explicitly think about what happens when
> things fail (and remind the people reading the code to do so as well).

This approach is not very usual in shell scripting (at least, in
circles whose behaviour should be emulated)[1], and it leads to bugs.
Your script as originally presented is riddled with unchecked calls to
subprograms.

If you absolutely want to handle a particular error case specifically
then IMO it is much better to make a specific exception to that
particular case, by using either a set +e / set -e pair, or a
construct like if or ||.

> > The expression
> >   ${input#*  }
...
> That's an idea.  I'm not sure I like it much better than using a regexp
> (now that the runes have been written), but I'll think about it.

Fair enough.

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 1/6] docs/qemu-deprivilege: Revise and update with status and future plans
  2018-09-25 11:20 ` Anthony PERARD
@ 2018-09-28 11:37   ` George Dunlap
  2018-09-28 12:59     ` Anthony PERARD
  0 siblings, 1 reply; 39+ messages in thread
From: George Dunlap @ 2018-09-28 11:37 UTC (permalink / raw)
  To: Anthony Perard
  Cc: Stefano Stabellini, Wei Liu, Konrad Rzeszutek Wilk,
	Andrew Cooper, Tim Deegan, Ross Lagerwall, Julien Grall,
	Jan Beulich, xen-devel, Ian Jackson

On Tue, Sep 25, 2018 at 12:20 PM Anthony PERARD
<anthony.perard@citrix.com> wrote:
>
> On Fri, Sep 21, 2018 at 06:04:23PM +0100, George Dunlap wrote:
> > +## Migration
> > +
> > +When calling xen-save-devices-state, since QEMU is running in a chroot
> > +it is not useful to pass a filename (it doesn't even have write access
> > +inside the chroot). Instead, give it an open fd using the add-fd
> > +mechanism.
>
> That describe an issue only on save. The restore part of a migration
> also have an issue:
>
> On restore, QEMU signal to libxl that it is ready only after priviledge
> restriction have been put in place (and after or when it receive the
> migration data). But xenstore isn't available anymore, so restore fails
> from libxl point-of-view.
>
>
> Or maybe you describe here an issue that would arise when an hypothetic
> chroot is apply. But the migration issue describe here still apply
> without chroot and with only -runas, the path libxl give to QEMU is
> accessible by root only.

Do our other restrictions -- chroot, deprivilege, &c -- all work on
restore at the moment?

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 1/6] docs/qemu-deprivilege: Revise and update with status and future plans
  2018-09-28 11:37   ` George Dunlap
@ 2018-09-28 12:59     ` Anthony PERARD
  0 siblings, 0 replies; 39+ messages in thread
From: Anthony PERARD @ 2018-09-28 12:59 UTC (permalink / raw)
  To: George Dunlap
  Cc: Stefano Stabellini, Wei Liu, Konrad Rzeszutek Wilk,
	Andrew Cooper, Tim Deegan, Ross Lagerwall, Julien Grall,
	Jan Beulich, xen-devel, Ian Jackson

On Fri, Sep 28, 2018 at 12:37:22PM +0100, George Dunlap wrote:
> On Tue, Sep 25, 2018 at 12:20 PM Anthony PERARD
> <anthony.perard@citrix.com> wrote:
> >
> > On Fri, Sep 21, 2018 at 06:04:23PM +0100, George Dunlap wrote:
> > > +## Migration
> > > +
> > > +When calling xen-save-devices-state, since QEMU is running in a chroot
> > > +it is not useful to pass a filename (it doesn't even have write access
> > > +inside the chroot). Instead, give it an open fd using the add-fd
> > > +mechanism.
> >
> > That describe an issue only on save. The restore part of a migration
> > also have an issue:
> >
> > On restore, QEMU signal to libxl that it is ready only after priviledge
> > restriction have been put in place (and after or when it receive the
> > migration data). But xenstore isn't available anymore, so restore fails
> > from libxl point-of-view.
> >
> >
> > Or maybe you describe here an issue that would arise when an hypothetic
> > chroot is apply. But the migration issue describe here still apply
> > without chroot and with only -runas, the path libxl give to QEMU is
> > accessible by root only.
> 
> Do our other restrictions -- chroot, deprivilege, &c -- all work on
> restore at the moment?

chroot isn't an issue for restore. But running QEMU as non-root user is
currently an issue for restore.

I think the only issue with restore that doesn't also exist when
creating a guest is restricted access to xenstore.

-- 
Anthony PERARD

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 5/6] tools/dm_depriv: Add first cut RLIMITs
       [not found]     ` <CAFLBxZanne777R0WfJLuGqTv3dx=A+W0Z5UsW9B0i4MFMWWKXw@mail.gmail.com>
@ 2018-10-05 15:18       ` George Dunlap
  0 siblings, 0 replies; 39+ messages in thread
From: George Dunlap @ 2018-10-05 15:18 UTC (permalink / raw)
  To: Paul Durrant; +Cc: Anthony Perard, xen-devel, Wei Liu, Ian Jackson

[resending]
On Fri, Oct 5, 2018 at 4:17 PM George Dunlap <george.dunlap@citrix.com> wrote:
>
> On Mon, Sep 24, 2018 at 9:35 AM Paul Durrant <Paul.Durrant@citrix.com> wrote:
> > > +    {
> > > +        .resource = -1
> >
> > Is -1 guaranteed not to clash with any defined resource type?
>
> Hmm... well at the  moment /usr/include/bits/resource.h seems to have
> them defined as non-negative integers (i.e., starting at 0).  But
> there's nothing I can see which prevents a new one being defined as -1
> if someone in the Linux community decides to.  The POSIX merely
> defines this as a int, and I couldn't find anything which says it had
> to be non-negative in the spec.
>
> Linux seems to define RLIMIT_NLIMITS as the total number of limits so
> far defined; I'll use that instead.  (That also sort of implies they
> expect you to use that as an array sizer, and the various limits as
> indexes into the array, and thus be non-negative; but whatever.)
>
> > > +
> > > +    /* Set various "easy" rlimits */
> > > +    for (i=0; rlimits[i].resource != -1; i++) {
> >
> > Couldn't figure out whether this is a coding style violation or not but I think 'i=0' should be 'i = 0'
>
> Ack.
>
>  -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2018-10-05 15:18 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-21 17:04 [PATCH v2 1/6] docs/qemu-deprivilege: Revise and update with status and future plans George Dunlap
2018-09-21 17:04 ` [PATCH v2 2/6] test/depriv: Add a tool to check process-level depriv George Dunlap
2018-09-24 10:57   ` Ian Jackson
2018-09-24 14:45     ` George Dunlap
2018-09-24 11:14   ` Anthony PERARD
2018-09-21 17:04 ` [PATCH v2 3/6] tools/dm_restrict: Ask QEMU to chroot George Dunlap
2018-09-24  8:20   ` Paul Durrant
2018-09-24  9:50     ` George Dunlap
2018-09-24  9:52       ` Paul Durrant
2018-09-24 10:31   ` Ian Jackson
2018-09-21 17:04 ` [PATCH v2 4/6] tools/dm_restrict: Unshare mount and IPC namespaces on Linux George Dunlap
2018-09-24  8:24   ` Paul Durrant
2018-09-24 10:40   ` Ian Jackson
2018-09-24 14:22     ` George Dunlap
2018-09-25 10:55       ` Ian Jackson
2018-09-21 17:04 ` [PATCH v2 5/6] tools/dm_depriv: Add first cut RLIMITs George Dunlap
2018-09-24  8:34   ` Paul Durrant
     [not found]     ` <CAFLBxZanne777R0WfJLuGqTv3dx=A+W0Z5UsW9B0i4MFMWWKXw@mail.gmail.com>
2018-10-05 15:18       ` George Dunlap
2018-09-24 10:48   ` Ian Jackson
2018-09-24 14:41     ` George Dunlap
2018-09-25 11:03       ` Ian Jackson
2018-09-21 17:04 ` [PATCH v2 6/6] RFC: tools/dm_restrict: Enable QEMU sandboxing George Dunlap
2018-09-24 10:49   ` Ian Jackson
2018-09-24 11:16     ` George Dunlap
2018-09-24 13:04       ` Ian Jackson
2018-09-24 13:43         ` George Dunlap
2018-09-25 10:46           ` Ian Jackson
2018-09-24 11:21   ` Ian Jackson
2018-09-24 14:28     ` George Dunlap
2018-09-25 11:02       ` Ian Jackson
2018-09-25 14:24         ` George Dunlap
2018-09-26 10:57           ` Ian Jackson
2018-09-24  8:14 ` [PATCH v2 1/6] docs/qemu-deprivilege: Revise and update with status and future plans Paul Durrant
2018-09-24 10:23 ` Ian Jackson
2018-09-24 13:08   ` George Dunlap
2018-09-25 10:44     ` Ian Jackson
2018-09-25 11:20 ` Anthony PERARD
2018-09-28 11:37   ` George Dunlap
2018-09-28 12:59     ` Anthony PERARD

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.