All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Michael S. Tsirkin" <mst@redhat.com>
To: "Jason A. Donenfeld" <Jason@zx2c4.com>
Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	qemu-devel@nongnu.org, linux-hyperv@vger.kernel.org,
	linux-crypto@vger.kernel.org, graf@amazon.com,
	mikelley@microsoft.com, gregkh@linuxfoundation.org,
	adrian@parity.io, lersek@redhat.com, berrange@redhat.com,
	linux@dominikbrodowski.net, jannh@google.com, rafael@kernel.org,
	len.brown@intel.com, pavel@ucw.cz, linux-pm@vger.kernel.org,
	colmmacc@amazon.com, tytso@mit.edu, arnd@arndb.de
Subject: Re: propagating vmgenid outward and upward
Date: Tue, 1 Mar 2022 11:21:38 -0500	[thread overview]
Message-ID: <20220301111459-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <Yh4+9+UpanJWAIyZ@zx2c4.com>

On Tue, Mar 01, 2022 at 04:42:47PM +0100, Jason A. Donenfeld wrote:
> Hey folks,
> 
> Having finally wrapped up development of the initial vmgenid driver, I
> thought I'd pull together some thoughts on vmgenid, notification, and
> propagating, from disjointed conversations I've had with a few of you
> over the last several weeks.
> 
> The basic problem is: VMs can be cloned, forked, rewound, or
> snapshotted, and when this happens, a) the RNG needs to reseed itself,
> and b) cryptographic algorithms that are not reuse resistant need to
> reinitialize in one way or another. For 5.18, we're handling (a) via the
> new vmgenid driver, which implements a spec from Microsoft, whereby the
> driver receives ACPI notifications when a 16 byte unique value changes.
> 
> The vmgenid driver basically works, though it is racy, because that ACPI
> notification can arrive after the system is already running again. This
> race is even worse on Windows, where they kick the notification into a
> worker thread, which then publishes it upward elsewhere to another async
> mechanism, and eventually it hits the RNG and various userspace apps.
> On Linux it's not that bad -- we reseed immediately upon receiving the
> notification -- but it still inherits this same "push"-model deficiency,
> which a "pull"-model would not have.
> 
> If we had a "pull" model, rather than just expose a 16-byte unique
> identifier, the vmgenid virtual hardware would _also_ expose a
> word-sized generation counter, which would be incremented every time the
> unique ID changed. Then, every time we would touch the RNG, we'd simply
> do an inexpensive check of this memremap()'d integer, and reinitialize
> with the unique ID if the integer changed. In this way, the race would
> be entirely eliminated. We would then be able to propagate this outwards
> to other drivers, by just exporting an extern symbol, in the manner of
> `jiffies`, and propagate it upwards to userspace, by putting it in the
> vDSO, in the manner of gettimeofday. And like that, there'd be no
> terrible async thing and things would work pretty easily.

I am not sure what the difference is though. So we have a 16 byte unique
value and you would prefer a dword counter. How is the former not a
superset of the later?  I'm not sure how safe it is to expose it to
userspace specifically, but rest of text talks about exposing it to a
kernel driver so maybe not an issue? So what makes interrupt driven
required, and why not just remap and read existing vmgenid in the pull
manner?  What did I miss?


> But that's not what we have, because Microsoft didn't collaborate with
> anybody on this, and now it's implemented in several hypervisors. Given
> that I'm already spending considerable time working on the RNG, entirely
> without funding, somehow I'm not super motivated to lead a
> cross-industry political effort to change Microsoft's vmgenid spec.
> Maybe somebody else has an appetite for this, but either way, those
> changes would be several years off at best.
> 
> So given we have a "push"-model mechanism, there are two problems to
> tackle, perhaps in the same way, perhaps in a different way:
> 
> A) Outwards propagation toward other kernel drivers: in this case, I
>    have in mind WireGuard, naturally, which very much needs to clear its
>    existing sessions when VMs are forked.
> 
> B) Upwards propagation to userspace: in this case, we handle the
>    concerns of the Amazon engineers on this thread who broached this
>    topic a few years ago, in which s2n, their TLS library, wants to
>    reinitialize its userspace RNG (a silly thing, but I digress) and
>    probably clear session keys too, for the same good reason as
>    WireGuard.
> 
> For (A), at least wearing my WireGuard-maintainer hat, there is an easy
> way and there is a "race-free" way. I use scare quotes there because
> we're still in a "push"-model, which means it's still racy no matter
> what.
> 
> The faux "race-free" way involves having `extern u32 rng_vm_generation;`
> or similar in random.h, and then everything that generates a session key
> would snapshot this value, and every time a session key is used, a
> comparison would be made. This works, but given that we're going to be
> racy no matter what, I think I'd prefer avoiding the extra code in the
> hot path and extra per-session storage. It seems like that'd involve a
> lot of fiddly engineering for no real world benefit.
> 
> The easy way, and the way that I think I prefer, would be to just have a
> sync notifier_block for this, just like we have with
> register_pm_notifier(). From my perspective, it'd be simplest to just
> piggy back on the already existing PM notifier with an extra event,
> PM_POST_VMFORK, which would join the existing set of 7, following
> PM_POST_RESTORE. I think that'd be coherent. However, if the PM people
> don't want to play ball, we could always come up with our own
> notifier_block. But I don't see the need. Plus, WireGuard *already*
> uses the PM notifier for clearing keys, so code-wise for my use case,
> that'd amount adding another case for PM_POST_VMFORK, in addition to the
> currently existing PM_HIBERNATION_PREPARE and PM_SUSPEND_PREPARE cases,
> which all would be treated the same way. Ezpz. So if that sounds like an
> interesting thing to the PM people, I think I'd like to propose a patch
> for that, possibly even for 5.18, given that it'd be very straight-
> forward.
> 
> For (B), it's a little bit trickier. But I think our options follow the
> same rubric. We can expose a generation counter in the vDSO, with
> semantics akin to the extern integer I described above. Or we could
> expose that counter in a file that userspace could poll() on and receive
> notifications that way. Or perhaps a third way. I'm all ears here.
> Alex's team from Amazon last year proposed something similar to the vDSO
> idea, except using mmap on a sysfs file, though from what I can tell,
> that wound up being kind of complicated. Due to the fact that we're
> _already_ racy, I think I'm most inclined at this point toward the
> poll() approach for the same reasons as I prefer a notifier_block. But
> on userspace I could be convinced otherwise, and I'd be interested in
> totally different ideas here too.
> 
> Another thing I should note is that, while I'm not currently leaning
> toward it, the vDSO approach also ties into interesting discussions
> about userspace RNGs (generally a silly idea), and their need for things
> like fork detection and also learning when the kernel RNG was last
> reseeded. So cracking open the vDSO book might invite all sorts of other
> interesting questions and discussions, which may be productive or may be
> a humongous distraction. (Also, again, I'm not super enthusiastic about
> userspace RNGs.)
> 
> Also, there is an interesting question to decide with regards to
> userspace, which is whether the vmgenid driver should expose its unique
> ID to userspace, as Alex requested on an earlier thread. I am actually
> sort of opposed to this. That unique ID may or may not be secret and
> entropic; if it isn't, the crypto is designed to not be impacted
> negatively, but if it is, we should keep it secret. So, rather, I think
> the correct flow is that userspace simply calls getrandom() upon
> learning that the VM forked, which is guaranteed to have been
> reinitialized already by add_vmfork_randomness(), and that will
> guarantee a value that is unique to the VM, without having to actually
> expose that value.
> 
> So, anyway, this is more or less where my thinking on this matter is.
> Would be happy to hear some fresh ideas here too.
> 
> Regards,
> Jason


WARNING: multiple messages have this Message-ID (diff)
From: "Michael S. Tsirkin" <mst@redhat.com>
To: "Jason A. Donenfeld" <Jason@zx2c4.com>
Cc: len.brown@intel.com, linux-hyperv@vger.kernel.org,
	colmmacc@amazon.com, berrange@redhat.com, adrian@parity.io,
	kvm@vger.kernel.org, jannh@google.com,
	gregkh@linuxfoundation.org, linux-pm@vger.kernel.org,
	rafael@kernel.org, linux-kernel@vger.kernel.org,
	linux@dominikbrodowski.net, qemu-devel@nongnu.org,
	graf@amazon.com, linux-crypto@vger.kernel.org, pavel@ucw.cz,
	tytso@mit.edu, mikelley@microsoft.com, lersek@redhat.com,
	arnd@arndb.de
Subject: Re: propagating vmgenid outward and upward
Date: Tue, 1 Mar 2022 11:21:38 -0500	[thread overview]
Message-ID: <20220301111459-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <Yh4+9+UpanJWAIyZ@zx2c4.com>

On Tue, Mar 01, 2022 at 04:42:47PM +0100, Jason A. Donenfeld wrote:
> Hey folks,
> 
> Having finally wrapped up development of the initial vmgenid driver, I
> thought I'd pull together some thoughts on vmgenid, notification, and
> propagating, from disjointed conversations I've had with a few of you
> over the last several weeks.
> 
> The basic problem is: VMs can be cloned, forked, rewound, or
> snapshotted, and when this happens, a) the RNG needs to reseed itself,
> and b) cryptographic algorithms that are not reuse resistant need to
> reinitialize in one way or another. For 5.18, we're handling (a) via the
> new vmgenid driver, which implements a spec from Microsoft, whereby the
> driver receives ACPI notifications when a 16 byte unique value changes.
> 
> The vmgenid driver basically works, though it is racy, because that ACPI
> notification can arrive after the system is already running again. This
> race is even worse on Windows, where they kick the notification into a
> worker thread, which then publishes it upward elsewhere to another async
> mechanism, and eventually it hits the RNG and various userspace apps.
> On Linux it's not that bad -- we reseed immediately upon receiving the
> notification -- but it still inherits this same "push"-model deficiency,
> which a "pull"-model would not have.
> 
> If we had a "pull" model, rather than just expose a 16-byte unique
> identifier, the vmgenid virtual hardware would _also_ expose a
> word-sized generation counter, which would be incremented every time the
> unique ID changed. Then, every time we would touch the RNG, we'd simply
> do an inexpensive check of this memremap()'d integer, and reinitialize
> with the unique ID if the integer changed. In this way, the race would
> be entirely eliminated. We would then be able to propagate this outwards
> to other drivers, by just exporting an extern symbol, in the manner of
> `jiffies`, and propagate it upwards to userspace, by putting it in the
> vDSO, in the manner of gettimeofday. And like that, there'd be no
> terrible async thing and things would work pretty easily.

I am not sure what the difference is though. So we have a 16 byte unique
value and you would prefer a dword counter. How is the former not a
superset of the later?  I'm not sure how safe it is to expose it to
userspace specifically, but rest of text talks about exposing it to a
kernel driver so maybe not an issue? So what makes interrupt driven
required, and why not just remap and read existing vmgenid in the pull
manner?  What did I miss?


> But that's not what we have, because Microsoft didn't collaborate with
> anybody on this, and now it's implemented in several hypervisors. Given
> that I'm already spending considerable time working on the RNG, entirely
> without funding, somehow I'm not super motivated to lead a
> cross-industry political effort to change Microsoft's vmgenid spec.
> Maybe somebody else has an appetite for this, but either way, those
> changes would be several years off at best.
> 
> So given we have a "push"-model mechanism, there are two problems to
> tackle, perhaps in the same way, perhaps in a different way:
> 
> A) Outwards propagation toward other kernel drivers: in this case, I
>    have in mind WireGuard, naturally, which very much needs to clear its
>    existing sessions when VMs are forked.
> 
> B) Upwards propagation to userspace: in this case, we handle the
>    concerns of the Amazon engineers on this thread who broached this
>    topic a few years ago, in which s2n, their TLS library, wants to
>    reinitialize its userspace RNG (a silly thing, but I digress) and
>    probably clear session keys too, for the same good reason as
>    WireGuard.
> 
> For (A), at least wearing my WireGuard-maintainer hat, there is an easy
> way and there is a "race-free" way. I use scare quotes there because
> we're still in a "push"-model, which means it's still racy no matter
> what.
> 
> The faux "race-free" way involves having `extern u32 rng_vm_generation;`
> or similar in random.h, and then everything that generates a session key
> would snapshot this value, and every time a session key is used, a
> comparison would be made. This works, but given that we're going to be
> racy no matter what, I think I'd prefer avoiding the extra code in the
> hot path and extra per-session storage. It seems like that'd involve a
> lot of fiddly engineering for no real world benefit.
> 
> The easy way, and the way that I think I prefer, would be to just have a
> sync notifier_block for this, just like we have with
> register_pm_notifier(). From my perspective, it'd be simplest to just
> piggy back on the already existing PM notifier with an extra event,
> PM_POST_VMFORK, which would join the existing set of 7, following
> PM_POST_RESTORE. I think that'd be coherent. However, if the PM people
> don't want to play ball, we could always come up with our own
> notifier_block. But I don't see the need. Plus, WireGuard *already*
> uses the PM notifier for clearing keys, so code-wise for my use case,
> that'd amount adding another case for PM_POST_VMFORK, in addition to the
> currently existing PM_HIBERNATION_PREPARE and PM_SUSPEND_PREPARE cases,
> which all would be treated the same way. Ezpz. So if that sounds like an
> interesting thing to the PM people, I think I'd like to propose a patch
> for that, possibly even for 5.18, given that it'd be very straight-
> forward.
> 
> For (B), it's a little bit trickier. But I think our options follow the
> same rubric. We can expose a generation counter in the vDSO, with
> semantics akin to the extern integer I described above. Or we could
> expose that counter in a file that userspace could poll() on and receive
> notifications that way. Or perhaps a third way. I'm all ears here.
> Alex's team from Amazon last year proposed something similar to the vDSO
> idea, except using mmap on a sysfs file, though from what I can tell,
> that wound up being kind of complicated. Due to the fact that we're
> _already_ racy, I think I'm most inclined at this point toward the
> poll() approach for the same reasons as I prefer a notifier_block. But
> on userspace I could be convinced otherwise, and I'd be interested in
> totally different ideas here too.
> 
> Another thing I should note is that, while I'm not currently leaning
> toward it, the vDSO approach also ties into interesting discussions
> about userspace RNGs (generally a silly idea), and their need for things
> like fork detection and also learning when the kernel RNG was last
> reseeded. So cracking open the vDSO book might invite all sorts of other
> interesting questions and discussions, which may be productive or may be
> a humongous distraction. (Also, again, I'm not super enthusiastic about
> userspace RNGs.)
> 
> Also, there is an interesting question to decide with regards to
> userspace, which is whether the vmgenid driver should expose its unique
> ID to userspace, as Alex requested on an earlier thread. I am actually
> sort of opposed to this. That unique ID may or may not be secret and
> entropic; if it isn't, the crypto is designed to not be impacted
> negatively, but if it is, we should keep it secret. So, rather, I think
> the correct flow is that userspace simply calls getrandom() upon
> learning that the VM forked, which is guaranteed to have been
> reinitialized already by add_vmfork_randomness(), and that will
> guarantee a value that is unique to the VM, without having to actually
> expose that value.
> 
> So, anyway, this is more or less where my thinking on this matter is.
> Would be happy to hear some fresh ideas here too.
> 
> Regards,
> Jason



  parent reply	other threads:[~2022-03-01 16:21 UTC|newest]

Thread overview: 61+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-01 15:42 propagating vmgenid outward and upward Jason A. Donenfeld
2022-03-01 16:15 ` Laszlo Ersek
2022-03-01 16:28   ` Jason A. Donenfeld
2022-03-01 16:28     ` Jason A. Donenfeld
2022-03-01 17:17     ` Michael S. Tsirkin
2022-03-01 17:17       ` Michael S. Tsirkin
2022-03-01 18:37       ` Jason A. Donenfeld
2022-03-01 18:37         ` Jason A. Donenfeld
2022-03-02  7:42         ` Michael S. Tsirkin
2022-03-02  7:42           ` Michael S. Tsirkin
2022-03-02  7:48           ` Michael S. Tsirkin
2022-03-02  7:48             ` Michael S. Tsirkin
2022-03-02  8:30         ` Michael S. Tsirkin
2022-03-02  8:30           ` Michael S. Tsirkin
2022-03-02 11:26           ` Jason A. Donenfeld
2022-03-02 11:26             ` Jason A. Donenfeld
2022-03-02 12:58             ` Michael S. Tsirkin
2022-03-02 12:58               ` Michael S. Tsirkin
2022-03-02 13:55               ` Jason A. Donenfeld
2022-03-02 13:55                 ` Jason A. Donenfeld
2022-03-02 14:46                 ` Michael S. Tsirkin
2022-03-02 14:46                   ` Michael S. Tsirkin
2022-03-02 15:14                   ` Jason A. Donenfeld
2022-03-02 15:14                     ` Jason A. Donenfeld
2022-03-02 15:20                     ` Michael S. Tsirkin
2022-03-02 15:20                       ` Michael S. Tsirkin
2022-03-02 15:36                       ` Jason A. Donenfeld
2022-03-02 15:36                         ` Jason A. Donenfeld
2022-03-02 16:22                         ` Michael S. Tsirkin
2022-03-02 16:22                           ` Michael S. Tsirkin
2022-03-02 16:32                           ` Jason A. Donenfeld
2022-03-02 16:32                             ` Jason A. Donenfeld
2022-03-02 17:27                             ` Michael S. Tsirkin
2022-03-02 17:27                               ` Michael S. Tsirkin
2022-03-03 13:07                             ` Michael S. Tsirkin
2022-03-03 13:07                               ` Michael S. Tsirkin
2022-03-02 16:29                         ` Michael S. Tsirkin
2022-03-02 16:29                           ` Michael S. Tsirkin
2022-03-01 16:21 ` Michael S. Tsirkin [this message]
2022-03-01 16:21   ` Michael S. Tsirkin
2022-03-01 16:35   ` Jason A. Donenfeld
2022-03-01 16:35     ` Jason A. Donenfeld
2022-03-01 18:01 ` Greg KH
2022-03-01 18:01   ` Greg KH
2022-03-01 18:24   ` Jason A. Donenfeld
2022-03-01 18:24     ` Jason A. Donenfeld
2022-03-01 19:41     ` Greg KH
2022-03-01 19:41       ` Greg KH
2022-03-01 23:12       ` Jason A. Donenfeld
2022-03-01 23:12         ` Jason A. Donenfeld
2022-03-02 14:35 ` Jason A. Donenfeld
2022-03-09 10:10 ` Alexander Graf
2022-03-09 22:02   ` Jason A. Donenfeld
2022-03-09 22:02     ` Jason A. Donenfeld
2022-03-10 11:18     ` Alexander Graf
2022-03-20 22:53       ` Michael S. Tsirkin
2022-03-20 22:53         ` Michael S. Tsirkin
2022-04-19 15:12       ` Jason A. Donenfeld
2022-04-19 15:12         ` Jason A. Donenfeld
2022-04-19 16:43         ` Michael S. Tsirkin
2022-04-19 16:43           ` Michael S. Tsirkin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220301111459-mutt-send-email-mst@kernel.org \
    --to=mst@redhat.com \
    --cc=Jason@zx2c4.com \
    --cc=adrian@parity.io \
    --cc=arnd@arndb.de \
    --cc=berrange@redhat.com \
    --cc=colmmacc@amazon.com \
    --cc=graf@amazon.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=jannh@google.com \
    --cc=kvm@vger.kernel.org \
    --cc=len.brown@intel.com \
    --cc=lersek@redhat.com \
    --cc=linux-crypto@vger.kernel.org \
    --cc=linux-hyperv@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=linux@dominikbrodowski.net \
    --cc=mikelley@microsoft.com \
    --cc=pavel@ucw.cz \
    --cc=qemu-devel@nongnu.org \
    --cc=rafael@kernel.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.