QEMU-Devel Archive on lore.kernel.org
 help / color / Atom feed
From: Andy Lutomirski <luto@kernel.org>
To: "Michael S. Tsirkin" <mst@redhat.com>
Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>,
	KVM list <kvm@vger.kernel.org>,
	"open list:DOCUMENTATION" <linux-doc@vger.kernel.org>,
	ghammer@redhat.com, "Weiss, Radu" <raduweis@amazon.com>,
	Qemu Developers <qemu-devel@nongnu.org>,
	"open list:VIRTIO GPU DRIVER"
	<virtualization@lists.linux-foundation.org>,
	Pavel Machek <pavel@ucw.cz>, Jonathan Corbet <corbet@lwn.net>,
	"Rafael J. Wysocki" <rafael@kernel.org>,
	Eric Biggers <ebiggers@kernel.org>,
	"Singh, Balbir" <sblbir@amazon.com>,
	bonzini@gnu.org, "Graf \(AWS\), Alexander" <graf@amazon.de>,
	Michal Hocko <mhocko@kernel.org>, Jann Horn <jannh@google.com>,
	oridgar@gmail.com, "Catangiu, Adrian Costin" <acatan@amazon.com>,
	Andy Lutomirski <luto@kernel.org>,
	Colm MacCarthaigh <colmmacc@amazon.com>,
	"Theodore Y. Ts'o" <tytso@mit.edu>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	kernel list <linux-kernel@vger.kernel.org>,
	Linux API <linux-api@vger.kernel.org>, Willy Tarreau <w@1wt.eu>,
	"Woodhouse, David" <dwmw@amazon.co.uk>
Subject: Re: [PATCH] drivers/virt: vmgenid: add vm generation id driver
Date: Sun, 18 Oct 2020 09:14:00 -0700
Message-ID: <CALCETrUeRAhmEFR6EFXz8HzDYd2doZ2TMyZmu1pU_-yAPA6KDw@mail.gmail.com> (raw)
In-Reply-To: <20201018115524-mutt-send-email-mst@kernel.org>

On Sun, Oct 18, 2020 at 8:59 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Sun, Oct 18, 2020 at 08:54:36AM -0700, Andy Lutomirski wrote:
> > On Sun, Oct 18, 2020 at 8:52 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> > >
> > > On Sat, Oct 17, 2020 at 03:24:08PM +0200, Jason A. Donenfeld wrote:
> > > > 4c. The guest kernel maintains an array of physical addresses that are
> > > > MADV_WIPEONFORK. The hypervisor knows about this array and its
> > > > location through whatever protocol, and before resuming a
> > > > moved/snapshotted/duplicated VM, it takes the responsibility for
> > > > memzeroing this memory. The huge pro here would be that this
> > > > eliminates all races, and reduces complexity quite a bit, because the
> > > > hypervisor can perfectly synchronize its bringup (and SMP bringup)
> > > > with this, and it can even optimize things like on-disk memory
> > > > snapshots to simply not write out those pages to disk.
> > > >
> > > > A 4c-like approach seems like it'd be a lot of bang for the buck -- we
> > > > reuse the existing mechanism (MADV_WIPEONFORK), so there's no new
> > > > userspace API to deal with, and it'd be race free, and eliminate a lot
> > > > of kernel complexity.
> > >
> > > Clearly this has a chance to break applications, right?
> > > If there's an app that uses this as a non-system-calls way
> > > to find out whether there was a fork, it will break
> > > when wipe triggers without a fork ...
> > > For example, imagine:
> > >
> > > MADV_WIPEONFORK
> > > copy secret data to MADV_DONTFORK
> > > fork
> > >
> > >
> > > used to work, with this change it gets 0s instead of the secret data.
> > >
> > >
> > > I am also not sure it's wise to expose each guest process
> > > to the hypervisor like this. E.g. each process needs a
> > > guest physical address of its own then. This is a finite resource.
> > >
> > >
> > > The mmap interface proposed here is somewhat baroque, but it is
> > > certainly simple to implement ...
> >
> > Wipe of fork/vmgenid/whatever could end up being much more problematic
> > than it naively appears -- it could be wiped in the middle of a read.
> > Either the API needs to handle this cleanly, or we need something more
> > aggressive like signal-on-fork.
> >
> > --Andy
>
>
> Right, it's not on fork, it's actually when process is snapshotted.
>
> If we assume it's CRIU we care about, then I
> wonder what's wrong with something like
> MADV_CHANGEONPTRACE_SEIZE
> and basically say it's X bytes which change the value...

I feel like we may be approaching this from the wrong end.  Rather
than saying "what data structure can the kernel expose that might
plausibly be useful", how about we try identifying some specific
userspace needs and see what a good solution could look like.  I can
identify two major cryptographic use cases:

1. A userspace RNG.  The API exposed by the userspace end is a
function that generates random numbers.  The userspace code in turn
wants to know some things from the kernel: it wants some
best-quality-available random seed data from the kernel (and possibly
an indication of how good it is) as well as an indication of whether
the userspace memory may have been cloned or rolled back, or, failing
that, an indication of whether a reseed is needed.  Userspace could
implement a wide variety of algorithms on top depending on its goals
and compliance requirements, but the end goal is for the userspace
part to be very, very fast.

2. A userspace crypto stack that wants to avoid shooting itself in the
foot due to inadvertently doing the same thing twice.  For example, an
AES-GCM stack does not want to reuse an IV, *expecially* if there is
even the slightest chance that it might reuse the IV for different
data.  This use case doesn't necessarily involve random numbers, but,
if anything, it needs to be even faster than #1.

The threats here are not really the same.  For #1, a userspace RNG
should be able to recover from a scenario in which an adversary clones
the entire process *and gets to own the clone*.  For example, in
Android, an adversary can often gain complete control of a fork of the
zygote -- this shouldn't adversely affect the security properties of
other forks.  Similarly, a server farm could operate by having one
booted server that is cloned to create more workers.  Those clones
could be provisioned with secrets and permissions post-clone, and at
attacker gaining control of a fresh clone could be considered
acceptable.  For #2, in contrast, if an adversary gains control of a
clone of an AES-GCM session, they learn the key outright -- the
relevant attack scenario is that the adversary gets to interact with
two clones without compromising either clone per se.

It's worth noting that, in both cases, there could possibly be more
than one instance of an RNG or an AES-GCM session in the same process.
This means that using signals is awkward but not necessarily
impossibly.  (This is an area in which Linux, and POSIX in general, is
much weaker than Windows.)


  reply index

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <AQHWo8lIfZnFKGe8nkGmhTCXwq5R3w==>
2020-10-16 14:33 ` Catangiu, Adrian Costin
2020-10-16 15:00   ` Catangiu, Adrian Costin
2020-10-16 15:14   ` gregkh
2020-10-17  1:40   ` Jann Horn
2020-10-17  3:36     ` Willy Tarreau
2020-10-17  4:02       ` Jann Horn
2020-10-17  4:34         ` Colm MacCarthaigh
2020-10-17  5:01           ` Jann Horn
2020-10-17  5:29             ` Colm MacCarthaigh
2020-10-17  5:37             ` Willy Tarreau
2020-10-17  5:52               ` Jann Horn
2020-10-17  6:44                 ` Willy Tarreau
2020-10-17  6:55                   ` Jann Horn
2020-10-17  7:17                     ` Willy Tarreau
2020-10-17 13:24                     ` Jason A. Donenfeld
2020-10-17 18:06                       ` Catangiu, Adrian Costin
2020-10-17 18:09                       ` Alexander Graf
2020-10-18  2:08                         ` Jann Horn
2020-10-20  9:35                         ` Christian Borntraeger
2020-10-20  9:54                           ` Alexander Graf
2020-10-20 16:54                         ` Catangiu, Adrian Costin
2020-10-18  3:14                       ` Colm MacCarthaigh
2020-10-18 15:52                       ` Michael S. Tsirkin
2020-10-18 15:54                         ` Andy Lutomirski
2020-10-18 15:59                           ` Michael S. Tsirkin
2020-10-18 16:14                             ` Andy Lutomirski [this message]
2020-10-19 15:00                               ` Michael S. Tsirkin
2020-10-17 18:10     ` Andy Lutomirski
2020-10-19 17:15       ` Mathieu Desnoyers
2020-10-20 10:00         ` Alexander Graf

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CALCETrUeRAhmEFR6EFXz8HzDYd2doZ2TMyZmu1pU_-yAPA6KDw@mail.gmail.com \
    --to=luto@kernel.org \
    --cc=Jason@zx2c4.com \
    --cc=acatan@amazon.com \
    --cc=bonzini@gnu.org \
    --cc=colmmacc@amazon.com \
    --cc=corbet@lwn.net \
    --cc=dwmw@amazon.co.uk \
    --cc=ebiggers@kernel.org \
    --cc=ghammer@redhat.com \
    --cc=graf@amazon.de \
    --cc=gregkh@linuxfoundation.org \
    --cc=jannh@google.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mhocko@kernel.org \
    --cc=mst@redhat.com \
    --cc=oridgar@gmail.com \
    --cc=pavel@ucw.cz \
    --cc=qemu-devel@nongnu.org \
    --cc=raduweis@amazon.com \
    --cc=rafael@kernel.org \
    --cc=sblbir@amazon.com \
    --cc=tytso@mit.edu \
    --cc=virtualization@lists.linux-foundation.org \
    --cc=w@1wt.eu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

QEMU-Devel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/qemu-devel/0 qemu-devel/git/0.git
	git clone --mirror https://lore.kernel.org/qemu-devel/1 qemu-devel/git/1.git
	git clone --mirror https://lore.kernel.org/qemu-devel/2 qemu-devel/git/2.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 qemu-devel qemu-devel/ https://lore.kernel.org/qemu-devel \
		qemu-devel@nongnu.org
	public-inbox-index qemu-devel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.nongnu.qemu-devel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git