Re: [PATCH] docs/qemu-deprivilege: Revise and update with status and future plans

From: Ian Jackson <ian.jackson@citrix.com>
To: George Dunlap <george.dunlap@citrix.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>,
	Wei Liu <wei.liu2@citrix.com>,
	Andrew Cooper <andrew.cooper3@citrix.com>,
	Tim Deegan <tim@xen.org>,
	Ross Lagerwall <ross.lagerwall@citrix.com>,
	Julien Grall <julien.grall@arm.com>,
	Jan Beulich <jbeulich@suse.com>,
	Anthony Perard <anthony.perard@citrix.com>,
	xen-devel@lists.xenproject.org
Subject: Re: [PATCH] docs/qemu-deprivilege: Revise and update with status and future plans
Date: Tue, 27 Mar 2018 14:33:11 +0100	[thread overview]
Message-ID: <23226.18455.602635.161530@mariner.uk.xensource.com> (raw)
In-Reply-To: <87f60317-7913-a3ad-58eb-59928b8d7d61@citrix.com>

George Dunlap writes ("Re: [PATCH] docs/qemu-deprivilege: Revise and update with status and future plans"):
> Actually I think most of the user-facing stuff already in xl.cfg is
> inappropriate for that man page.  It might make sense to have a separate
> man page for it.

I wouldn't object to that.

> On 03/26/2018 05:43 PM, Ian Jackson wrote:
> > No.  Firstly, in each case, all relevant descriptors are restricted.
> > This is the purpose of the xentoolcore__restrict_* stuff.  Secondly,
> > xenstore *is* covered - but the xs fd is squashed so as to be totally
> > unuseable: xs.c uses xentoolcore__restrict_by_dup2_null.
> 
> Ross already gave me some corrections on this; here is what I have:
> 
> 8<---
> '''Description''': Close and restrict Xen-related file descriptors.
> Specifically:
>  * Close all xenstore-related file descriptors
>  * Make sure that extraneous `privcmd` and `evtchn` instances are
> closed
>  * Make sure that all open instances of `privcmd` and `evtchn` file
> descriptors have had IOCTL_PRIVCMD_RESTRICT and
> IOCTL_EVTCHN_RESTRICT_DOMID ioctls called on them, respectively.
> --->8
> 
> It sounds like the last may be inaccurate for libxl?

I don't think anything closes any extraneous fds.  My approach in
the libxc layer was to register all fds and have the restrict call
iterate over all of them.  So, I guess, drop your 2nd bullet.

All of this is done by qemu calling xentoolcore_restrict_all, not by
libxl.

Maybe the "extraneous privcmd and evtchn fds" Ross means are ones
inherited by qemu from the toolstack.  Hrm.  Now I want to try to make
an argument that no such fds are leaked into qemu subprocesses by
libxl but I don't think it's trivial to do so.  It may not be true
even in the usual case but I also worry about races in higher layers
that call libxl in multiple threads with multiple ctx's concurrently
and which might therefore end up forking for a qemu while another
thread is opening privcmd.

Sorting this out needs to go on your todo list :-/.

> >> +### Namespaces for unused functionality
...
> >> +    unshare(CLONE_NEWNS | CLONE_NEWIPC)
> > 
> > This would mean we would have to pass qemu fds for both the network
> > tap devices and any vnc consoles.  That makes life considerably more
> > complicated.  I think we should perhaps revisit this upstream.
> 
> You haven't read this very carefully (nor do you seem to have read the
> discussion with Ross before responding).  I split the "Namespaces"
> restriction Ross suggested internally to us into two parts:
>  - Namespace restrictions for unused functionality
>  - Network namespacing

Oh, sorry.  Yes.

> In fact, I was wondering if it might make sense to do *all* the
> deprivileging in a separate process before starting qemu.  That would
> make it simple, for example, to write a test program which would try to
> break out of the "jail" we'd put it in.

I think this would be a lot of trouble because we'd have to pass
pre-restricted privcmd fds into qemu and stuff them into libxc et al.
It would mean fixing a lot of redundant handle opening which is
currently harmless.

> >> +Other things that would take some cleverness / changes to QEMU to
> >> +utilize due to ordering constrants:
> >> + - RLIMIT_NPROC (after uid changes to a unique uid)
> >> + - RLIMIT_NOFILES (after all necessary files are opened)
> > 
> > I think there is little difficulty with RLIMIT_NPROC since our qemu
> > does not fork.  I think we can set it to a value which is currently
> > violated for the current uid ?
> 
> Well AFAICT classic POSIX allows you to set rlimits on yourself, but not
> on another process.  Since this is on the *user id* rather than the
> *process*, I didn't think "setrlimit [as root] -> exec -> setuid" would
> work correctly; I assumed you'd have to have "exec -> setuid ->
> setrlimit", which would require further changes to QEMU.

rlimits are a process property.

I'm quite alarmed by the patch by Neil Brown in your first link that
setuid should fail if it would cause RLIMIT_NPROC to be violated for
this process in the context of the new uid !  The difficulties
discussed on your 2nd link were immediately obvious to me on reading
the first one.

AFAICT from the article Linux now does not check RLIMIT_NPROC during
setuid, nor during execve.  I think that is correct behaviour.

> I now realize that it might be that the limit will follow the current
> uid of the process, in which case "setrlimit -> setuid" might have the
> expected behavior.  But a quick Google search shows that the interaction
> of RLIMIT_NPROC and setuid() is tricky[1][2], and may vary from
> implementation to implementation; relying on the interaction to be
> correct (and stay correct) seems somewhat risky (unless POSIX has
> explicitly documented what should happen in that case, which again a
> quick Google search hasn't turned up).

RLIMIT_NPROC is not in SuS:
 http://pubs.opengroup.org/onlinepubs/9699919799/functions/setrlimit.html

FreeBSD does not have any of this madness:
 https://www.unix.com/man-page/freebsd/2/setuid/

Anyway IMO we should set RLIMIT_NPROC after fork and before execve.
If this causes setuid to fail in qemu, qemu will crash.  But this
could only be the case if the new uid already has other processes,
which it shouldn't do.  (If it does, the new uid is probably
contaminated by a previous domain with the same domid.)

If the kernel is a version of Linux with the execve RLIMIT_NPROC
check, then execve will fail because there will be many more processes
than root's RLIMIT_NPROC.  I propose to treat that as a bug in Linux.

> Linux does seem to have a "set rlimit on another process" system call
> (prlimit).  But that would still require at least a little bit of care,
> as then we'd need to set the limit after the setuid but before the guest
> started running.  And in any case I couldn't (again in a quick search)
> discover that FreeBSD has such a system call (and working correctly on
> FreeBSD seems to be a design goal).

Yes.

> >> +### libxl UID cleanup
> > ...
> >> +kill(-1,sig) sends a signal to "every process to which the calling
> >> +process has permission to send a signal".  So in theory:
> >> +  setuid(X)
> >> +  kill(-1,KILL)
> >> +should do the trick.
> > 
> > We need to check whether a malicious qemu process could kill this
> > one.
> 
> Hmm, in theory it probably could.

I could find nothing in SuS explaining when process A may send signals
to process B.  So I resorted to the BSD manpages:

  https://www.unix.com/man-page/freebsd/2/kill/

  For a process to have permission to send a signal to a process
  designated by pid, the user must be the super-user, or the real or
  saved user ID of the receiving process must match the real or
  effective user ID of the sending process.

Also kill(-1,) does not signal the sender.

I did some web searches hoping to find someone had already solved
this.  The best discussion is here
  https://www.unix.com/unix-for-advanced-and-expert-users/168364-kill-all-process-uid.html
but the proposed answer doesn't work because the kill would kill lots
of root processes.

My analysis:

                                          euid   ruid   suid

  compared for receiving process                 ====   ====
  compared for sending process            ====   ====

  possible rogue processes                qid    qid    qid

  unrelated rootish processes             0      any    any
     or                                   any    0      any
     or                                   any    any    0

  killer (our pet issuing kill)
     so rogues can't kill killer                 !=qid  !=qid
     so it doesn't kill randomly          !=0    !=0
     therefore:                           qid    pet    0

We will have to reserve one uid `pet' specially for this nonsense.  We
call setresuid(2) to switch uids, and then kill(-1,9) and _exit(0).
The kill() call will kill other processes with uid `pet'.  So we take
a global lock while we do this.

`pet' could be a uid associated with a reserved domid, eg dom0.

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel