All of lore.kernel.org
 help / color / mirror / Atom feed
* propagating vmgenid outward and upward
@ 2022-03-01 15:42 Jason A. Donenfeld
  2022-03-01 16:15 ` Laszlo Ersek
                   ` (4 more replies)
  0 siblings, 5 replies; 61+ messages in thread
From: Jason A. Donenfeld @ 2022-03-01 15:42 UTC (permalink / raw)
  To: linux-kernel, kvm, qemu-devel, linux-hyperv, linux-crypto, graf,
	mikelley, gregkh, adrian, lersek, berrange, linux, jannh, mst,
	rafael, len.brown, pavel, linux-pm, colmmacc, tytso, arnd

Hey folks,

Having finally wrapped up development of the initial vmgenid driver, I
thought I'd pull together some thoughts on vmgenid, notification, and
propagating, from disjointed conversations I've had with a few of you
over the last several weeks.

The basic problem is: VMs can be cloned, forked, rewound, or
snapshotted, and when this happens, a) the RNG needs to reseed itself,
and b) cryptographic algorithms that are not reuse resistant need to
reinitialize in one way or another. For 5.18, we're handling (a) via the
new vmgenid driver, which implements a spec from Microsoft, whereby the
driver receives ACPI notifications when a 16 byte unique value changes.

The vmgenid driver basically works, though it is racy, because that ACPI
notification can arrive after the system is already running again. This
race is even worse on Windows, where they kick the notification into a
worker thread, which then publishes it upward elsewhere to another async
mechanism, and eventually it hits the RNG and various userspace apps.
On Linux it's not that bad -- we reseed immediately upon receiving the
notification -- but it still inherits this same "push"-model deficiency,
which a "pull"-model would not have.

If we had a "pull" model, rather than just expose a 16-byte unique
identifier, the vmgenid virtual hardware would _also_ expose a
word-sized generation counter, which would be incremented every time the
unique ID changed. Then, every time we would touch the RNG, we'd simply
do an inexpensive check of this memremap()'d integer, and reinitialize
with the unique ID if the integer changed. In this way, the race would
be entirely eliminated. We would then be able to propagate this outwards
to other drivers, by just exporting an extern symbol, in the manner of
`jiffies`, and propagate it upwards to userspace, by putting it in the
vDSO, in the manner of gettimeofday. And like that, there'd be no
terrible async thing and things would work pretty easily.

But that's not what we have, because Microsoft didn't collaborate with
anybody on this, and now it's implemented in several hypervisors. Given
that I'm already spending considerable time working on the RNG, entirely
without funding, somehow I'm not super motivated to lead a
cross-industry political effort to change Microsoft's vmgenid spec.
Maybe somebody else has an appetite for this, but either way, those
changes would be several years off at best.

So given we have a "push"-model mechanism, there are two problems to
tackle, perhaps in the same way, perhaps in a different way:

A) Outwards propagation toward other kernel drivers: in this case, I
   have in mind WireGuard, naturally, which very much needs to clear its
   existing sessions when VMs are forked.

B) Upwards propagation to userspace: in this case, we handle the
   concerns of the Amazon engineers on this thread who broached this
   topic a few years ago, in which s2n, their TLS library, wants to
   reinitialize its userspace RNG (a silly thing, but I digress) and
   probably clear session keys too, for the same good reason as
   WireGuard.

For (A), at least wearing my WireGuard-maintainer hat, there is an easy
way and there is a "race-free" way. I use scare quotes there because
we're still in a "push"-model, which means it's still racy no matter
what.

The faux "race-free" way involves having `extern u32 rng_vm_generation;`
or similar in random.h, and then everything that generates a session key
would snapshot this value, and every time a session key is used, a
comparison would be made. This works, but given that we're going to be
racy no matter what, I think I'd prefer avoiding the extra code in the
hot path and extra per-session storage. It seems like that'd involve a
lot of fiddly engineering for no real world benefit.

The easy way, and the way that I think I prefer, would be to just have a
sync notifier_block for this, just like we have with
register_pm_notifier(). From my perspective, it'd be simplest to just
piggy back on the already existing PM notifier with an extra event,
PM_POST_VMFORK, which would join the existing set of 7, following
PM_POST_RESTORE. I think that'd be coherent. However, if the PM people
don't want to play ball, we could always come up with our own
notifier_block. But I don't see the need. Plus, WireGuard *already*
uses the PM notifier for clearing keys, so code-wise for my use case,
that'd amount adding another case for PM_POST_VMFORK, in addition to the
currently existing PM_HIBERNATION_PREPARE and PM_SUSPEND_PREPARE cases,
which all would be treated the same way. Ezpz. So if that sounds like an
interesting thing to the PM people, I think I'd like to propose a patch
for that, possibly even for 5.18, given that it'd be very straight-
forward.

For (B), it's a little bit trickier. But I think our options follow the
same rubric. We can expose a generation counter in the vDSO, with
semantics akin to the extern integer I described above. Or we could
expose that counter in a file that userspace could poll() on and receive
notifications that way. Or perhaps a third way. I'm all ears here.
Alex's team from Amazon last year proposed something similar to the vDSO
idea, except using mmap on a sysfs file, though from what I can tell,
that wound up being kind of complicated. Due to the fact that we're
_already_ racy, I think I'm most inclined at this point toward the
poll() approach for the same reasons as I prefer a notifier_block. But
on userspace I could be convinced otherwise, and I'd be interested in
totally different ideas here too.

Another thing I should note is that, while I'm not currently leaning
toward it, the vDSO approach also ties into interesting discussions
about userspace RNGs (generally a silly idea), and their need for things
like fork detection and also learning when the kernel RNG was last
reseeded. So cracking open the vDSO book might invite all sorts of other
interesting questions and discussions, which may be productive or may be
a humongous distraction. (Also, again, I'm not super enthusiastic about
userspace RNGs.)

Also, there is an interesting question to decide with regards to
userspace, which is whether the vmgenid driver should expose its unique
ID to userspace, as Alex requested on an earlier thread. I am actually
sort of opposed to this. That unique ID may or may not be secret and
entropic; if it isn't, the crypto is designed to not be impacted
negatively, but if it is, we should keep it secret. So, rather, I think
the correct flow is that userspace simply calls getrandom() upon
learning that the VM forked, which is guaranteed to have been
reinitialized already by add_vmfork_randomness(), and that will
guarantee a value that is unique to the VM, without having to actually
expose that value.

So, anyway, this is more or less where my thinking on this matter is.
Would be happy to hear some fresh ideas here too.

Regards,
Jason

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
  2022-03-01 15:42 propagating vmgenid outward and upward Jason A. Donenfeld
@ 2022-03-01 16:15 ` Laszlo Ersek
  2022-03-01 16:28     ` Jason A. Donenfeld
  2022-03-01 16:21   ` Michael S. Tsirkin
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 61+ messages in thread
From: Laszlo Ersek @ 2022-03-01 16:15 UTC (permalink / raw)
  To: Jason A. Donenfeld, linux-kernel, kvm, qemu-devel, linux-hyperv,
	linux-crypto, graf, mikelley, gregkh, adrian, berrange, linux,
	jannh, mst, rafael, len.brown, pavel, linux-pm, colmmacc, tytso,
	arnd

On 03/01/22 16:42, Jason A. Donenfeld wrote:
> Hey folks,
> 
> Having finally wrapped up development of the initial vmgenid driver, I
> thought I'd pull together some thoughts on vmgenid, notification, and
> propagating, from disjointed conversations I've had with a few of you
> over the last several weeks.
> 
> The basic problem is: VMs can be cloned, forked, rewound, or
> snapshotted, and when this happens, a) the RNG needs to reseed itself,
> and b) cryptographic algorithms that are not reuse resistant need to
> reinitialize in one way or another. For 5.18, we're handling (a) via the
> new vmgenid driver, which implements a spec from Microsoft, whereby the
> driver receives ACPI notifications when a 16 byte unique value changes.
> 
> The vmgenid driver basically works, though it is racy, because that ACPI
> notification can arrive after the system is already running again. This
> race is even worse on Windows, where they kick the notification into a
> worker thread, which then publishes it upward elsewhere to another async
> mechanism, and eventually it hits the RNG and various userspace apps.
> On Linux it's not that bad -- we reseed immediately upon receiving the
> notification -- but it still inherits this same "push"-model deficiency,
> which a "pull"-model would not have.
> 
> If we had a "pull" model, rather than just expose a 16-byte unique
> identifier, the vmgenid virtual hardware would _also_ expose a
> word-sized generation counter, which would be incremented every time the
> unique ID changed. Then, every time we would touch the RNG, we'd simply
> do an inexpensive check of this memremap()'d integer, and reinitialize
> with the unique ID if the integer changed.

Does the vmgenid spec (as-is) preclude the use of the 16-byte identifier
like this?

After all, once you locate the identifier via the ADDR object, you could
perhaps consult it every time you were about to touch the RNG. Perhaps
with the help of 16-byte atomic (?) operations, you could maintain a
copy of the identifier elsewhere, and detect (atomically, upon each RNG
access) whether the identifier has changed, without waiting for the Notify.

We'd require (or assume) the hypervisors to modify the identifier in
guest RAM before resuming execution of the guest. And we'd have to delay
RNG accesses in the guest kernel until after the vmgenid device -- or
its absence -- were discovered.

Just a random guesss.

Thanks
Laszlo

> In this way, the race would
> be entirely eliminated. We would then be able to propagate this outwards
> to other drivers, by just exporting an extern symbol, in the manner of
> `jiffies`, and propagate it upwards to userspace, by putting it in the
> vDSO, in the manner of gettimeofday. And like that, there'd be no
> terrible async thing and things would work pretty easily.
> 
> But that's not what we have, because Microsoft didn't collaborate with
> anybody on this, and now it's implemented in several hypervisors. Given
> that I'm already spending considerable time working on the RNG, entirely
> without funding, somehow I'm not super motivated to lead a
> cross-industry political effort to change Microsoft's vmgenid spec.
> Maybe somebody else has an appetite for this, but either way, those
> changes would be several years off at best.
> 
> So given we have a "push"-model mechanism, there are two problems to
> tackle, perhaps in the same way, perhaps in a different way:
> 
> A) Outwards propagation toward other kernel drivers: in this case, I
>    have in mind WireGuard, naturally, which very much needs to clear its
>    existing sessions when VMs are forked.
> 
> B) Upwards propagation to userspace: in this case, we handle the
>    concerns of the Amazon engineers on this thread who broached this
>    topic a few years ago, in which s2n, their TLS library, wants to
>    reinitialize its userspace RNG (a silly thing, but I digress) and
>    probably clear session keys too, for the same good reason as
>    WireGuard.
> 
> For (A), at least wearing my WireGuard-maintainer hat, there is an easy
> way and there is a "race-free" way. I use scare quotes there because
> we're still in a "push"-model, which means it's still racy no matter
> what.
> 
> The faux "race-free" way involves having `extern u32 rng_vm_generation;`
> or similar in random.h, and then everything that generates a session key
> would snapshot this value, and every time a session key is used, a
> comparison would be made. This works, but given that we're going to be
> racy no matter what, I think I'd prefer avoiding the extra code in the
> hot path and extra per-session storage. It seems like that'd involve a
> lot of fiddly engineering for no real world benefit.
> 
> The easy way, and the way that I think I prefer, would be to just have a
> sync notifier_block for this, just like we have with
> register_pm_notifier(). From my perspective, it'd be simplest to just
> piggy back on the already existing PM notifier with an extra event,
> PM_POST_VMFORK, which would join the existing set of 7, following
> PM_POST_RESTORE. I think that'd be coherent. However, if the PM people
> don't want to play ball, we could always come up with our own
> notifier_block. But I don't see the need. Plus, WireGuard *already*
> uses the PM notifier for clearing keys, so code-wise for my use case,
> that'd amount adding another case for PM_POST_VMFORK, in addition to the
> currently existing PM_HIBERNATION_PREPARE and PM_SUSPEND_PREPARE cases,
> which all would be treated the same way. Ezpz. So if that sounds like an
> interesting thing to the PM people, I think I'd like to propose a patch
> for that, possibly even for 5.18, given that it'd be very straight-
> forward.
> 
> For (B), it's a little bit trickier. But I think our options follow the
> same rubric. We can expose a generation counter in the vDSO, with
> semantics akin to the extern integer I described above. Or we could
> expose that counter in a file that userspace could poll() on and receive
> notifications that way. Or perhaps a third way. I'm all ears here.
> Alex's team from Amazon last year proposed something similar to the vDSO
> idea, except using mmap on a sysfs file, though from what I can tell,
> that wound up being kind of complicated. Due to the fact that we're
> _already_ racy, I think I'm most inclined at this point toward the
> poll() approach for the same reasons as I prefer a notifier_block. But
> on userspace I could be convinced otherwise, and I'd be interested in
> totally different ideas here too.
> 
> Another thing I should note is that, while I'm not currently leaning
> toward it, the vDSO approach also ties into interesting discussions
> about userspace RNGs (generally a silly idea), and their need for things
> like fork detection and also learning when the kernel RNG was last
> reseeded. So cracking open the vDSO book might invite all sorts of other
> interesting questions and discussions, which may be productive or may be
> a humongous distraction. (Also, again, I'm not super enthusiastic about
> userspace RNGs.)
> 
> Also, there is an interesting question to decide with regards to
> userspace, which is whether the vmgenid driver should expose its unique
> ID to userspace, as Alex requested on an earlier thread. I am actually
> sort of opposed to this. That unique ID may or may not be secret and
> entropic; if it isn't, the crypto is designed to not be impacted
> negatively, but if it is, we should keep it secret. So, rather, I think
> the correct flow is that userspace simply calls getrandom() upon
> learning that the VM forked, which is guaranteed to have been
> reinitialized already by add_vmfork_randomness(), and that will
> guarantee a value that is unique to the VM, without having to actually
> expose that value.
> 
> So, anyway, this is more or less where my thinking on this matter is.
> Would be happy to hear some fresh ideas here too.
> 
> Regards,
> Jason
> 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
  2022-03-01 15:42 propagating vmgenid outward and upward Jason A. Donenfeld
@ 2022-03-01 16:21   ` Michael S. Tsirkin
  2022-03-01 16:21   ` Michael S. Tsirkin
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 61+ messages in thread
From: Michael S. Tsirkin @ 2022-03-01 16:21 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: linux-kernel, kvm, qemu-devel, linux-hyperv, linux-crypto, graf,
	mikelley, gregkh, adrian, lersek, berrange, linux, jannh, rafael,
	len.brown, pavel, linux-pm, colmmacc, tytso, arnd

On Tue, Mar 01, 2022 at 04:42:47PM +0100, Jason A. Donenfeld wrote:
> Hey folks,
> 
> Having finally wrapped up development of the initial vmgenid driver, I
> thought I'd pull together some thoughts on vmgenid, notification, and
> propagating, from disjointed conversations I've had with a few of you
> over the last several weeks.
> 
> The basic problem is: VMs can be cloned, forked, rewound, or
> snapshotted, and when this happens, a) the RNG needs to reseed itself,
> and b) cryptographic algorithms that are not reuse resistant need to
> reinitialize in one way or another. For 5.18, we're handling (a) via the
> new vmgenid driver, which implements a spec from Microsoft, whereby the
> driver receives ACPI notifications when a 16 byte unique value changes.
> 
> The vmgenid driver basically works, though it is racy, because that ACPI
> notification can arrive after the system is already running again. This
> race is even worse on Windows, where they kick the notification into a
> worker thread, which then publishes it upward elsewhere to another async
> mechanism, and eventually it hits the RNG and various userspace apps.
> On Linux it's not that bad -- we reseed immediately upon receiving the
> notification -- but it still inherits this same "push"-model deficiency,
> which a "pull"-model would not have.
> 
> If we had a "pull" model, rather than just expose a 16-byte unique
> identifier, the vmgenid virtual hardware would _also_ expose a
> word-sized generation counter, which would be incremented every time the
> unique ID changed. Then, every time we would touch the RNG, we'd simply
> do an inexpensive check of this memremap()'d integer, and reinitialize
> with the unique ID if the integer changed. In this way, the race would
> be entirely eliminated. We would then be able to propagate this outwards
> to other drivers, by just exporting an extern symbol, in the manner of
> `jiffies`, and propagate it upwards to userspace, by putting it in the
> vDSO, in the manner of gettimeofday. And like that, there'd be no
> terrible async thing and things would work pretty easily.

I am not sure what the difference is though. So we have a 16 byte unique
value and you would prefer a dword counter. How is the former not a
superset of the later?  I'm not sure how safe it is to expose it to
userspace specifically, but rest of text talks about exposing it to a
kernel driver so maybe not an issue? So what makes interrupt driven
required, and why not just remap and read existing vmgenid in the pull
manner?  What did I miss?


> But that's not what we have, because Microsoft didn't collaborate with
> anybody on this, and now it's implemented in several hypervisors. Given
> that I'm already spending considerable time working on the RNG, entirely
> without funding, somehow I'm not super motivated to lead a
> cross-industry political effort to change Microsoft's vmgenid spec.
> Maybe somebody else has an appetite for this, but either way, those
> changes would be several years off at best.
> 
> So given we have a "push"-model mechanism, there are two problems to
> tackle, perhaps in the same way, perhaps in a different way:
> 
> A) Outwards propagation toward other kernel drivers: in this case, I
>    have in mind WireGuard, naturally, which very much needs to clear its
>    existing sessions when VMs are forked.
> 
> B) Upwards propagation to userspace: in this case, we handle the
>    concerns of the Amazon engineers on this thread who broached this
>    topic a few years ago, in which s2n, their TLS library, wants to
>    reinitialize its userspace RNG (a silly thing, but I digress) and
>    probably clear session keys too, for the same good reason as
>    WireGuard.
> 
> For (A), at least wearing my WireGuard-maintainer hat, there is an easy
> way and there is a "race-free" way. I use scare quotes there because
> we're still in a "push"-model, which means it's still racy no matter
> what.
> 
> The faux "race-free" way involves having `extern u32 rng_vm_generation;`
> or similar in random.h, and then everything that generates a session key
> would snapshot this value, and every time a session key is used, a
> comparison would be made. This works, but given that we're going to be
> racy no matter what, I think I'd prefer avoiding the extra code in the
> hot path and extra per-session storage. It seems like that'd involve a
> lot of fiddly engineering for no real world benefit.
> 
> The easy way, and the way that I think I prefer, would be to just have a
> sync notifier_block for this, just like we have with
> register_pm_notifier(). From my perspective, it'd be simplest to just
> piggy back on the already existing PM notifier with an extra event,
> PM_POST_VMFORK, which would join the existing set of 7, following
> PM_POST_RESTORE. I think that'd be coherent. However, if the PM people
> don't want to play ball, we could always come up with our own
> notifier_block. But I don't see the need. Plus, WireGuard *already*
> uses the PM notifier for clearing keys, so code-wise for my use case,
> that'd amount adding another case for PM_POST_VMFORK, in addition to the
> currently existing PM_HIBERNATION_PREPARE and PM_SUSPEND_PREPARE cases,
> which all would be treated the same way. Ezpz. So if that sounds like an
> interesting thing to the PM people, I think I'd like to propose a patch
> for that, possibly even for 5.18, given that it'd be very straight-
> forward.
> 
> For (B), it's a little bit trickier. But I think our options follow the
> same rubric. We can expose a generation counter in the vDSO, with
> semantics akin to the extern integer I described above. Or we could
> expose that counter in a file that userspace could poll() on and receive
> notifications that way. Or perhaps a third way. I'm all ears here.
> Alex's team from Amazon last year proposed something similar to the vDSO
> idea, except using mmap on a sysfs file, though from what I can tell,
> that wound up being kind of complicated. Due to the fact that we're
> _already_ racy, I think I'm most inclined at this point toward the
> poll() approach for the same reasons as I prefer a notifier_block. But
> on userspace I could be convinced otherwise, and I'd be interested in
> totally different ideas here too.
> 
> Another thing I should note is that, while I'm not currently leaning
> toward it, the vDSO approach also ties into interesting discussions
> about userspace RNGs (generally a silly idea), and their need for things
> like fork detection and also learning when the kernel RNG was last
> reseeded. So cracking open the vDSO book might invite all sorts of other
> interesting questions and discussions, which may be productive or may be
> a humongous distraction. (Also, again, I'm not super enthusiastic about
> userspace RNGs.)
> 
> Also, there is an interesting question to decide with regards to
> userspace, which is whether the vmgenid driver should expose its unique
> ID to userspace, as Alex requested on an earlier thread. I am actually
> sort of opposed to this. That unique ID may or may not be secret and
> entropic; if it isn't, the crypto is designed to not be impacted
> negatively, but if it is, we should keep it secret. So, rather, I think
> the correct flow is that userspace simply calls getrandom() upon
> learning that the VM forked, which is guaranteed to have been
> reinitialized already by add_vmfork_randomness(), and that will
> guarantee a value that is unique to the VM, without having to actually
> expose that value.
> 
> So, anyway, this is more or less where my thinking on this matter is.
> Would be happy to hear some fresh ideas here too.
> 
> Regards,
> Jason


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
@ 2022-03-01 16:21   ` Michael S. Tsirkin
  0 siblings, 0 replies; 61+ messages in thread
From: Michael S. Tsirkin @ 2022-03-01 16:21 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: len.brown, linux-hyperv, colmmacc, berrange, adrian, kvm, jannh,
	gregkh, linux-pm, rafael, linux-kernel, linux, qemu-devel, graf,
	linux-crypto, pavel, tytso, mikelley, lersek, arnd

On Tue, Mar 01, 2022 at 04:42:47PM +0100, Jason A. Donenfeld wrote:
> Hey folks,
> 
> Having finally wrapped up development of the initial vmgenid driver, I
> thought I'd pull together some thoughts on vmgenid, notification, and
> propagating, from disjointed conversations I've had with a few of you
> over the last several weeks.
> 
> The basic problem is: VMs can be cloned, forked, rewound, or
> snapshotted, and when this happens, a) the RNG needs to reseed itself,
> and b) cryptographic algorithms that are not reuse resistant need to
> reinitialize in one way or another. For 5.18, we're handling (a) via the
> new vmgenid driver, which implements a spec from Microsoft, whereby the
> driver receives ACPI notifications when a 16 byte unique value changes.
> 
> The vmgenid driver basically works, though it is racy, because that ACPI
> notification can arrive after the system is already running again. This
> race is even worse on Windows, where they kick the notification into a
> worker thread, which then publishes it upward elsewhere to another async
> mechanism, and eventually it hits the RNG and various userspace apps.
> On Linux it's not that bad -- we reseed immediately upon receiving the
> notification -- but it still inherits this same "push"-model deficiency,
> which a "pull"-model would not have.
> 
> If we had a "pull" model, rather than just expose a 16-byte unique
> identifier, the vmgenid virtual hardware would _also_ expose a
> word-sized generation counter, which would be incremented every time the
> unique ID changed. Then, every time we would touch the RNG, we'd simply
> do an inexpensive check of this memremap()'d integer, and reinitialize
> with the unique ID if the integer changed. In this way, the race would
> be entirely eliminated. We would then be able to propagate this outwards
> to other drivers, by just exporting an extern symbol, in the manner of
> `jiffies`, and propagate it upwards to userspace, by putting it in the
> vDSO, in the manner of gettimeofday. And like that, there'd be no
> terrible async thing and things would work pretty easily.

I am not sure what the difference is though. So we have a 16 byte unique
value and you would prefer a dword counter. How is the former not a
superset of the later?  I'm not sure how safe it is to expose it to
userspace specifically, but rest of text talks about exposing it to a
kernel driver so maybe not an issue? So what makes interrupt driven
required, and why not just remap and read existing vmgenid in the pull
manner?  What did I miss?


> But that's not what we have, because Microsoft didn't collaborate with
> anybody on this, and now it's implemented in several hypervisors. Given
> that I'm already spending considerable time working on the RNG, entirely
> without funding, somehow I'm not super motivated to lead a
> cross-industry political effort to change Microsoft's vmgenid spec.
> Maybe somebody else has an appetite for this, but either way, those
> changes would be several years off at best.
> 
> So given we have a "push"-model mechanism, there are two problems to
> tackle, perhaps in the same way, perhaps in a different way:
> 
> A) Outwards propagation toward other kernel drivers: in this case, I
>    have in mind WireGuard, naturally, which very much needs to clear its
>    existing sessions when VMs are forked.
> 
> B) Upwards propagation to userspace: in this case, we handle the
>    concerns of the Amazon engineers on this thread who broached this
>    topic a few years ago, in which s2n, their TLS library, wants to
>    reinitialize its userspace RNG (a silly thing, but I digress) and
>    probably clear session keys too, for the same good reason as
>    WireGuard.
> 
> For (A), at least wearing my WireGuard-maintainer hat, there is an easy
> way and there is a "race-free" way. I use scare quotes there because
> we're still in a "push"-model, which means it's still racy no matter
> what.
> 
> The faux "race-free" way involves having `extern u32 rng_vm_generation;`
> or similar in random.h, and then everything that generates a session key
> would snapshot this value, and every time a session key is used, a
> comparison would be made. This works, but given that we're going to be
> racy no matter what, I think I'd prefer avoiding the extra code in the
> hot path and extra per-session storage. It seems like that'd involve a
> lot of fiddly engineering for no real world benefit.
> 
> The easy way, and the way that I think I prefer, would be to just have a
> sync notifier_block for this, just like we have with
> register_pm_notifier(). From my perspective, it'd be simplest to just
> piggy back on the already existing PM notifier with an extra event,
> PM_POST_VMFORK, which would join the existing set of 7, following
> PM_POST_RESTORE. I think that'd be coherent. However, if the PM people
> don't want to play ball, we could always come up with our own
> notifier_block. But I don't see the need. Plus, WireGuard *already*
> uses the PM notifier for clearing keys, so code-wise for my use case,
> that'd amount adding another case for PM_POST_VMFORK, in addition to the
> currently existing PM_HIBERNATION_PREPARE and PM_SUSPEND_PREPARE cases,
> which all would be treated the same way. Ezpz. So if that sounds like an
> interesting thing to the PM people, I think I'd like to propose a patch
> for that, possibly even for 5.18, given that it'd be very straight-
> forward.
> 
> For (B), it's a little bit trickier. But I think our options follow the
> same rubric. We can expose a generation counter in the vDSO, with
> semantics akin to the extern integer I described above. Or we could
> expose that counter in a file that userspace could poll() on and receive
> notifications that way. Or perhaps a third way. I'm all ears here.
> Alex's team from Amazon last year proposed something similar to the vDSO
> idea, except using mmap on a sysfs file, though from what I can tell,
> that wound up being kind of complicated. Due to the fact that we're
> _already_ racy, I think I'm most inclined at this point toward the
> poll() approach for the same reasons as I prefer a notifier_block. But
> on userspace I could be convinced otherwise, and I'd be interested in
> totally different ideas here too.
> 
> Another thing I should note is that, while I'm not currently leaning
> toward it, the vDSO approach also ties into interesting discussions
> about userspace RNGs (generally a silly idea), and their need for things
> like fork detection and also learning when the kernel RNG was last
> reseeded. So cracking open the vDSO book might invite all sorts of other
> interesting questions and discussions, which may be productive or may be
> a humongous distraction. (Also, again, I'm not super enthusiastic about
> userspace RNGs.)
> 
> Also, there is an interesting question to decide with regards to
> userspace, which is whether the vmgenid driver should expose its unique
> ID to userspace, as Alex requested on an earlier thread. I am actually
> sort of opposed to this. That unique ID may or may not be secret and
> entropic; if it isn't, the crypto is designed to not be impacted
> negatively, but if it is, we should keep it secret. So, rather, I think
> the correct flow is that userspace simply calls getrandom() upon
> learning that the VM forked, which is guaranteed to have been
> reinitialized already by add_vmfork_randomness(), and that will
> guarantee a value that is unique to the VM, without having to actually
> expose that value.
> 
> So, anyway, this is more or less where my thinking on this matter is.
> Would be happy to hear some fresh ideas here too.
> 
> Regards,
> Jason



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
  2022-03-01 16:15 ` Laszlo Ersek
@ 2022-03-01 16:28     ` Jason A. Donenfeld
  0 siblings, 0 replies; 61+ messages in thread
From: Jason A. Donenfeld @ 2022-03-01 16:28 UTC (permalink / raw)
  To: Laszlo Ersek
  Cc: linux-kernel, kvm, qemu-devel, linux-hyperv, linux-crypto, graf,
	mikelley, gregkh, adrian, berrange, linux, jannh, mst, rafael,
	len.brown, pavel, linux-pm, colmmacc, tytso, arnd

Hi Laszlo,

On Tue, Mar 01, 2022 at 05:15:21PM +0100, Laszlo Ersek wrote:
> > If we had a "pull" model, rather than just expose a 16-byte unique
> > identifier, the vmgenid virtual hardware would _also_ expose a
> > word-sized generation counter, which would be incremented every time the
> > unique ID changed. Then, every time we would touch the RNG, we'd simply
> > do an inexpensive check of this memremap()'d integer, and reinitialize
> > with the unique ID if the integer changed.
> 
> Does the vmgenid spec (as-is) preclude the use of the 16-byte identifier
> like this?
> 
> After all, once you locate the identifier via the ADDR object, you could
> perhaps consult it every time you were about to touch the RNG.

No, you could in fact do this, and there'd be nothing wrong with that
from a spec perspective. You could even vDSO it all the way through
onward to userspace. However, doing a 16-byte atomic memcmp on
each-and-every packet is really a non-starter. For that kind of "check
it in the hot path" thing to be viable, you really want it to be a
counter that is word-sized. The "pull"-model involves pulling on every
single packet in order to be better than the "push"-model. Anyway, even
with a word-sized counter, it's unclear whether the costs of checking on
every packet would be worth it to everyone, but at least it's more
tenable than a 16-byte whammy.

Jason

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
@ 2022-03-01 16:28     ` Jason A. Donenfeld
  0 siblings, 0 replies; 61+ messages in thread
From: Jason A. Donenfeld @ 2022-03-01 16:28 UTC (permalink / raw)
  To: Laszlo Ersek
  Cc: len.brown, linux-hyperv, colmmacc, berrange, adrian, kvm, jannh,
	gregkh, linux-pm, mst, linux-kernel, linux, qemu-devel, graf,
	linux-crypto, pavel, rafael, tytso, mikelley, arnd

Hi Laszlo,

On Tue, Mar 01, 2022 at 05:15:21PM +0100, Laszlo Ersek wrote:
> > If we had a "pull" model, rather than just expose a 16-byte unique
> > identifier, the vmgenid virtual hardware would _also_ expose a
> > word-sized generation counter, which would be incremented every time the
> > unique ID changed. Then, every time we would touch the RNG, we'd simply
> > do an inexpensive check of this memremap()'d integer, and reinitialize
> > with the unique ID if the integer changed.
> 
> Does the vmgenid spec (as-is) preclude the use of the 16-byte identifier
> like this?
> 
> After all, once you locate the identifier via the ADDR object, you could
> perhaps consult it every time you were about to touch the RNG.

No, you could in fact do this, and there'd be nothing wrong with that
from a spec perspective. You could even vDSO it all the way through
onward to userspace. However, doing a 16-byte atomic memcmp on
each-and-every packet is really a non-starter. For that kind of "check
it in the hot path" thing to be viable, you really want it to be a
counter that is word-sized. The "pull"-model involves pulling on every
single packet in order to be better than the "push"-model. Anyway, even
with a word-sized counter, it's unclear whether the costs of checking on
every packet would be worth it to everyone, but at least it's more
tenable than a 16-byte whammy.

Jason


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
  2022-03-01 16:21   ` Michael S. Tsirkin
@ 2022-03-01 16:35     ` Jason A. Donenfeld
  -1 siblings, 0 replies; 61+ messages in thread
From: Jason A. Donenfeld @ 2022-03-01 16:35 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, kvm, qemu-devel, linux-hyperv, linux-crypto, graf,
	mikelley, gregkh, adrian, lersek, berrange, linux, jannh, rafael,
	len.brown, pavel, linux-pm, colmmacc, tytso, arnd

Hi Michael,

On Tue, Mar 01, 2022 at 11:21:38AM -0500, Michael S. Tsirkin wrote:
> > If we had a "pull" model, rather than just expose a 16-byte unique
> > identifier, the vmgenid virtual hardware would _also_ expose a
> > word-sized generation counter, which would be incremented every time the
> > unique ID changed. Then, every time we would touch the RNG, we'd simply
> > do an inexpensive check of this memremap()'d integer, and reinitialize
> > with the unique ID if the integer changed. In this way, the race would
> > be entirely eliminated. We would then be able to propagate this outwards
> > to other drivers, by just exporting an extern symbol, in the manner of
> > `jiffies`, and propagate it upwards to userspace, by putting it in the
> > vDSO, in the manner of gettimeofday. And like that, there'd be no
> > terrible async thing and things would work pretty easily.
> 
> I am not sure what the difference is though. So we have a 16 byte unique
> value and you would prefer a dword counter. How is the former not a
> superset of the later?  

Laszlo just asked the same question, which I answered here:
<https://lore.kernel.org/lkml/Yh5JwK6toc%2FzBNL7@zx2c4.com/>. You have
to read the full 16 bytes. You can't safely just read the first 4 or 8
or something, because it's a "unique ID" rather than a counter. That
seems like a needlessly expensive thing to do on each-and-every packet.

> I'm not sure how safe it is to expose it to
> userspace specifically, but rest of text talks about exposing it to a
> kernel driver so maybe not an issue? So what makes interrupt driven
> required, and why not just remap and read existing vmgenid in the pull
> manner?  What did I miss?

I don't really understand your question, but guessing your meaning: I'm
not talking about exposing the actual 16-byte value to any other
drivers, but just notifying them that their sessions should be dropped.
If it's easier to think about this in code, grep for wg_pm_notification(),
and consider that it'd be changing this code:

        if (action != PM_HIBERNATION_PREPARE && action != PM_SUSPEND_PREPARE)
                return 0;

into:

        if (action != PM_HIBERNATION_PREPARE && action != PM_SUSPEND_PREPARE &&
	    action != PM_VMFORK_POST)
                return 0;

But perhaps I misunderstood this part of your question?

Jason

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
@ 2022-03-01 16:35     ` Jason A. Donenfeld
  0 siblings, 0 replies; 61+ messages in thread
From: Jason A. Donenfeld @ 2022-03-01 16:35 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: len.brown, linux-hyperv, colmmacc, berrange, adrian, kvm, jannh,
	gregkh, linux-pm, rafael, linux-kernel, linux, qemu-devel, graf,
	linux-crypto, pavel, tytso, mikelley, lersek, arnd

Hi Michael,

On Tue, Mar 01, 2022 at 11:21:38AM -0500, Michael S. Tsirkin wrote:
> > If we had a "pull" model, rather than just expose a 16-byte unique
> > identifier, the vmgenid virtual hardware would _also_ expose a
> > word-sized generation counter, which would be incremented every time the
> > unique ID changed. Then, every time we would touch the RNG, we'd simply
> > do an inexpensive check of this memremap()'d integer, and reinitialize
> > with the unique ID if the integer changed. In this way, the race would
> > be entirely eliminated. We would then be able to propagate this outwards
> > to other drivers, by just exporting an extern symbol, in the manner of
> > `jiffies`, and propagate it upwards to userspace, by putting it in the
> > vDSO, in the manner of gettimeofday. And like that, there'd be no
> > terrible async thing and things would work pretty easily.
> 
> I am not sure what the difference is though. So we have a 16 byte unique
> value and you would prefer a dword counter. How is the former not a
> superset of the later?  

Laszlo just asked the same question, which I answered here:
<https://lore.kernel.org/lkml/Yh5JwK6toc%2FzBNL7@zx2c4.com/>. You have
to read the full 16 bytes. You can't safely just read the first 4 or 8
or something, because it's a "unique ID" rather than a counter. That
seems like a needlessly expensive thing to do on each-and-every packet.

> I'm not sure how safe it is to expose it to
> userspace specifically, but rest of text talks about exposing it to a
> kernel driver so maybe not an issue? So what makes interrupt driven
> required, and why not just remap and read existing vmgenid in the pull
> manner?  What did I miss?

I don't really understand your question, but guessing your meaning: I'm
not talking about exposing the actual 16-byte value to any other
drivers, but just notifying them that their sessions should be dropped.
If it's easier to think about this in code, grep for wg_pm_notification(),
and consider that it'd be changing this code:

        if (action != PM_HIBERNATION_PREPARE && action != PM_SUSPEND_PREPARE)
                return 0;

into:

        if (action != PM_HIBERNATION_PREPARE && action != PM_SUSPEND_PREPARE &&
	    action != PM_VMFORK_POST)
                return 0;

But perhaps I misunderstood this part of your question?

Jason


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
  2022-03-01 16:28     ` Jason A. Donenfeld
@ 2022-03-01 17:17       ` Michael S. Tsirkin
  -1 siblings, 0 replies; 61+ messages in thread
From: Michael S. Tsirkin @ 2022-03-01 17:17 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Laszlo Ersek, linux-kernel, kvm, qemu-devel, linux-hyperv,
	linux-crypto, graf, mikelley, gregkh, adrian, berrange, linux,
	jannh, rafael, len.brown, pavel, linux-pm, colmmacc, tytso, arnd

On Tue, Mar 01, 2022 at 05:28:48PM +0100, Jason A. Donenfeld wrote:
> Hi Laszlo,
> 
> On Tue, Mar 01, 2022 at 05:15:21PM +0100, Laszlo Ersek wrote:
> > > If we had a "pull" model, rather than just expose a 16-byte unique
> > > identifier, the vmgenid virtual hardware would _also_ expose a
> > > word-sized generation counter, which would be incremented every time the
> > > unique ID changed. Then, every time we would touch the RNG, we'd simply
> > > do an inexpensive check of this memremap()'d integer, and reinitialize
> > > with the unique ID if the integer changed.
> > 
> > Does the vmgenid spec (as-is) preclude the use of the 16-byte identifier
> > like this?
> > 
> > After all, once you locate the identifier via the ADDR object, you could
> > perhaps consult it every time you were about to touch the RNG.
> 
> No, you could in fact do this, and there'd be nothing wrong with that
> from a spec perspective. You could even vDSO it all the way through
> onward to userspace. However, doing a 16-byte atomic memcmp on
> each-and-every packet is really a non-starter. For that kind of "check
> it in the hot path" thing to be viable, you really want it to be a
> counter that is word-sized. The "pull"-model involves pulling on every
> single packet in order to be better than the "push"-model. Anyway, even
> with a word-sized counter, it's unclear whether the costs of checking on
> every packet would be worth it to everyone, but at least it's more
> tenable than a 16-byte whammy.
> 
> Jason

Hmm okay, so it's a performance optimization... some batching then? Do
you really need to worry about every packet? Every 64 packets not
enough?  Packets are after all queued at NICs etc, and VM fork can
happen after they leave wireguard ...

-- 
MST


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
@ 2022-03-01 17:17       ` Michael S. Tsirkin
  0 siblings, 0 replies; 61+ messages in thread
From: Michael S. Tsirkin @ 2022-03-01 17:17 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: len.brown, linux-hyperv, colmmacc, berrange, adrian, kvm, jannh,
	gregkh, linux-pm, rafael, linux-kernel, linux, qemu-devel, graf,
	linux-crypto, pavel, tytso, mikelley, Laszlo Ersek, arnd

On Tue, Mar 01, 2022 at 05:28:48PM +0100, Jason A. Donenfeld wrote:
> Hi Laszlo,
> 
> On Tue, Mar 01, 2022 at 05:15:21PM +0100, Laszlo Ersek wrote:
> > > If we had a "pull" model, rather than just expose a 16-byte unique
> > > identifier, the vmgenid virtual hardware would _also_ expose a
> > > word-sized generation counter, which would be incremented every time the
> > > unique ID changed. Then, every time we would touch the RNG, we'd simply
> > > do an inexpensive check of this memremap()'d integer, and reinitialize
> > > with the unique ID if the integer changed.
> > 
> > Does the vmgenid spec (as-is) preclude the use of the 16-byte identifier
> > like this?
> > 
> > After all, once you locate the identifier via the ADDR object, you could
> > perhaps consult it every time you were about to touch the RNG.
> 
> No, you could in fact do this, and there'd be nothing wrong with that
> from a spec perspective. You could even vDSO it all the way through
> onward to userspace. However, doing a 16-byte atomic memcmp on
> each-and-every packet is really a non-starter. For that kind of "check
> it in the hot path" thing to be viable, you really want it to be a
> counter that is word-sized. The "pull"-model involves pulling on every
> single packet in order to be better than the "push"-model. Anyway, even
> with a word-sized counter, it's unclear whether the costs of checking on
> every packet would be worth it to everyone, but at least it's more
> tenable than a 16-byte whammy.
> 
> Jason

Hmm okay, so it's a performance optimization... some batching then? Do
you really need to worry about every packet? Every 64 packets not
enough?  Packets are after all queued at NICs etc, and VM fork can
happen after they leave wireguard ...

-- 
MST



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
  2022-03-01 15:42 propagating vmgenid outward and upward Jason A. Donenfeld
@ 2022-03-01 18:01   ` Greg KH
  2022-03-01 16:21   ` Michael S. Tsirkin
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 61+ messages in thread
From: Greg KH @ 2022-03-01 18:01 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: linux-kernel, kvm, qemu-devel, linux-hyperv, linux-crypto, graf,
	mikelley, adrian, lersek, berrange, linux, jannh, mst, rafael,
	len.brown, pavel, linux-pm, colmmacc, tytso, arnd

On Tue, Mar 01, 2022 at 04:42:47PM +0100, Jason A. Donenfeld wrote:
> The easy way, and the way that I think I prefer, would be to just have a
> sync notifier_block for this, just like we have with
> register_pm_notifier(). From my perspective, it'd be simplest to just
> piggy back on the already existing PM notifier with an extra event,
> PM_POST_VMFORK, which would join the existing set of 7, following
> PM_POST_RESTORE. I think that'd be coherent. However, if the PM people
> don't want to play ball, we could always come up with our own
> notifier_block. But I don't see the need. Plus, WireGuard *already*
> uses the PM notifier for clearing keys, so code-wise for my use case,
> that'd amount adding another case for PM_POST_VMFORK, in addition to the
> currently existing PM_HIBERNATION_PREPARE and PM_SUSPEND_PREPARE cases,
> which all would be treated the same way. Ezpz. So if that sounds like an
> interesting thing to the PM people, I think I'd like to propose a patch
> for that, possibly even for 5.18, given that it'd be very straight-
> forward.

A notifier block like this makes sense, but why tie onto the PM_ stuff?
This isn't power management issues, it's a system-wide change that I am
sure others will want to know about that doesn't reflect any power
changes.

As much as I hate adding new notifiers in the kernel, that might be all
you need here.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
@ 2022-03-01 18:01   ` Greg KH
  0 siblings, 0 replies; 61+ messages in thread
From: Greg KH @ 2022-03-01 18:01 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: len.brown, linux-hyperv, colmmacc, berrange, adrian, kvm, jannh,
	linux-pm, mst, linux-kernel, linux, qemu-devel, graf,
	linux-crypto, pavel, rafael, tytso, mikelley, lersek, arnd

On Tue, Mar 01, 2022 at 04:42:47PM +0100, Jason A. Donenfeld wrote:
> The easy way, and the way that I think I prefer, would be to just have a
> sync notifier_block for this, just like we have with
> register_pm_notifier(). From my perspective, it'd be simplest to just
> piggy back on the already existing PM notifier with an extra event,
> PM_POST_VMFORK, which would join the existing set of 7, following
> PM_POST_RESTORE. I think that'd be coherent. However, if the PM people
> don't want to play ball, we could always come up with our own
> notifier_block. But I don't see the need. Plus, WireGuard *already*
> uses the PM notifier for clearing keys, so code-wise for my use case,
> that'd amount adding another case for PM_POST_VMFORK, in addition to the
> currently existing PM_HIBERNATION_PREPARE and PM_SUSPEND_PREPARE cases,
> which all would be treated the same way. Ezpz. So if that sounds like an
> interesting thing to the PM people, I think I'd like to propose a patch
> for that, possibly even for 5.18, given that it'd be very straight-
> forward.

A notifier block like this makes sense, but why tie onto the PM_ stuff?
This isn't power management issues, it's a system-wide change that I am
sure others will want to know about that doesn't reflect any power
changes.

As much as I hate adding new notifiers in the kernel, that might be all
you need here.

thanks,

greg k-h


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
  2022-03-01 18:01   ` Greg KH
@ 2022-03-01 18:24     ` Jason A. Donenfeld
  -1 siblings, 0 replies; 61+ messages in thread
From: Jason A. Donenfeld @ 2022-03-01 18:24 UTC (permalink / raw)
  To: Greg KH
  Cc: LKML, KVM list, QEMU Developers, linux-hyperv,
	Linux Crypto Mailing List, Alexander Graf, Michael Kelley (LINUX),
	adrian, Laszlo Ersek, Daniel P. Berrangé,
	Dominik Brodowski, Jann Horn, Michael S. Tsirkin,
	Rafael J. Wysocki, Brown, Len, Pavel Machek, Linux PM,
	Colm MacCarthaigh, Theodore Ts'o, Arnd Bergmann

Hi Greg,

On Tue, Mar 1, 2022 at 7:01 PM Greg KH <gregkh@linuxfoundation.org> wrote:
> A notifier block like this makes sense, but why tie onto the PM_ stuff?
> This isn't power management issues, it's a system-wide change that I am
> sure others will want to know about that doesn't reflect any power
> changes.
>
> As much as I hate adding new notifiers in the kernel, that might be all
> you need here.

You might indeed be right. I guess I was thinking that "resuming from
suspend" and "resuming from a VM fork" are kind of the same thing.
There _is_ a certain kind of similarity between the two. I was hoping
if the similarity was a strong enough one, maybe it'd make sense to do
them together rather than adding another notifier. But I suppose you
disagree, and it sounds like Rafael might too --
<https://lore.kernel.org/lkml/CAJZ5v0g+GihH_b9YvwuHzdrUVNGXOeabOznDC1vK6qLi8gtSTQ@mail.gmail.com/>.
Code-wise for me with WireGuard it's of course appealing to treat them
the same, since it's like a one line change, but if I need to add a
new notifier call there, it's not the end of the world.

Jason

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
@ 2022-03-01 18:24     ` Jason A. Donenfeld
  0 siblings, 0 replies; 61+ messages in thread
From: Jason A. Donenfeld @ 2022-03-01 18:24 UTC (permalink / raw)
  To: Greg KH
  Cc: Brown, Len, linux-hyperv, Colm MacCarthaigh,
	Daniel P. Berrangé,
	adrian, KVM list, Jann Horn, Linux PM, Michael S. Tsirkin, LKML,
	Dominik Brodowski, QEMU Developers, Alexander Graf,
	Linux Crypto Mailing List, Pavel Machek, Rafael J. Wysocki,
	Theodore Ts'o, Michael Kelley (LINUX),
	Laszlo Ersek, Arnd Bergmann

Hi Greg,

On Tue, Mar 1, 2022 at 7:01 PM Greg KH <gregkh@linuxfoundation.org> wrote:
> A notifier block like this makes sense, but why tie onto the PM_ stuff?
> This isn't power management issues, it's a system-wide change that I am
> sure others will want to know about that doesn't reflect any power
> changes.
>
> As much as I hate adding new notifiers in the kernel, that might be all
> you need here.

You might indeed be right. I guess I was thinking that "resuming from
suspend" and "resuming from a VM fork" are kind of the same thing.
There _is_ a certain kind of similarity between the two. I was hoping
if the similarity was a strong enough one, maybe it'd make sense to do
them together rather than adding another notifier. But I suppose you
disagree, and it sounds like Rafael might too --
<https://lore.kernel.org/lkml/CAJZ5v0g+GihH_b9YvwuHzdrUVNGXOeabOznDC1vK6qLi8gtSTQ@mail.gmail.com/>.
Code-wise for me with WireGuard it's of course appealing to treat them
the same, since it's like a one line change, but if I need to add a
new notifier call there, it's not the end of the world.

Jason


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
  2022-03-01 17:17       ` Michael S. Tsirkin
@ 2022-03-01 18:37         ` Jason A. Donenfeld
  -1 siblings, 0 replies; 61+ messages in thread
From: Jason A. Donenfeld @ 2022-03-01 18:37 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Laszlo Ersek, LKML, KVM list, QEMU Developers, linux-hyperv,
	Linux Crypto Mailing List, Alexander Graf, Michael Kelley (LINUX),
	Greg Kroah-Hartman, adrian, Daniel P. Berrangé,
	Dominik Brodowski, Jann Horn, Rafael J. Wysocki, Brown, Len,
	Pavel Machek, Linux PM, Colm MacCarthaigh, Theodore Ts'o,
	Arnd Bergmann

Hi Michael,

On Tue, Mar 1, 2022 at 6:17 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> Hmm okay, so it's a performance optimization... some batching then? Do
> you really need to worry about every packet? Every 64 packets not
> enough?  Packets are after all queued at NICs etc, and VM fork can
> happen after they leave wireguard ...

Unfortunately, yes, this is an "every packet" sort of thing -- if the
race is to be avoided in a meaningful way. It's really extra bad:
ChaCha20 and AES-CTR work by xoring a secret stream of bytes with
plaintext to produce a ciphertext. If you use that same secret stream
and xor it with a second plaintext and transmit that too, an attacker
can combine the two different ciphertexts to learn things about the
original plaintext.

But, anyway, it seems like the race is here to stay given what we have
_currently_ available with the virtual hardware. That's why I'm
focused on trying to get something going that's the least bad with
what we've currently got, which is racy by design. How vitally
important is it to have something that doesn't race in the far future?
I don't know, really. It seems plausible that that ACPI notifier
triggers so early that nothing else really even has a chance, so the
race concern is purely theoretical. But I haven't tried to measure
that so I'm not sure.

Jason

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
@ 2022-03-01 18:37         ` Jason A. Donenfeld
  0 siblings, 0 replies; 61+ messages in thread
From: Jason A. Donenfeld @ 2022-03-01 18:37 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Brown, Len, linux-hyperv, Colm MacCarthaigh,
	Daniel P. Berrangé,
	adrian, KVM list, Jann Horn, Greg Kroah-Hartman, Linux PM,
	Rafael J. Wysocki, LKML, Dominik Brodowski, QEMU Developers,
	Alexander Graf, Linux Crypto Mailing List, Pavel Machek,
	Theodore Ts'o, Michael Kelley (LINUX),
	Laszlo Ersek, Arnd Bergmann

Hi Michael,

On Tue, Mar 1, 2022 at 6:17 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> Hmm okay, so it's a performance optimization... some batching then? Do
> you really need to worry about every packet? Every 64 packets not
> enough?  Packets are after all queued at NICs etc, and VM fork can
> happen after they leave wireguard ...

Unfortunately, yes, this is an "every packet" sort of thing -- if the
race is to be avoided in a meaningful way. It's really extra bad:
ChaCha20 and AES-CTR work by xoring a secret stream of bytes with
plaintext to produce a ciphertext. If you use that same secret stream
and xor it with a second plaintext and transmit that too, an attacker
can combine the two different ciphertexts to learn things about the
original plaintext.

But, anyway, it seems like the race is here to stay given what we have
_currently_ available with the virtual hardware. That's why I'm
focused on trying to get something going that's the least bad with
what we've currently got, which is racy by design. How vitally
important is it to have something that doesn't race in the far future?
I don't know, really. It seems plausible that that ACPI notifier
triggers so early that nothing else really even has a chance, so the
race concern is purely theoretical. But I haven't tried to measure
that so I'm not sure.

Jason


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
  2022-03-01 18:24     ` Jason A. Donenfeld
@ 2022-03-01 19:41       ` Greg KH
  -1 siblings, 0 replies; 61+ messages in thread
From: Greg KH @ 2022-03-01 19:41 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: LKML, KVM list, QEMU Developers, linux-hyperv,
	Linux Crypto Mailing List, Alexander Graf, Michael Kelley (LINUX),
	adrian, Laszlo Ersek, Daniel P. Berrangé,
	Dominik Brodowski, Jann Horn, Michael S. Tsirkin,
	Rafael J. Wysocki, Brown, Len, Pavel Machek, Linux PM,
	Colm MacCarthaigh, Theodore Ts'o, Arnd Bergmann

On Tue, Mar 01, 2022 at 07:24:11PM +0100, Jason A. Donenfeld wrote:
> Hi Greg,
> 
> On Tue, Mar 1, 2022 at 7:01 PM Greg KH <gregkh@linuxfoundation.org> wrote:
> > A notifier block like this makes sense, but why tie onto the PM_ stuff?
> > This isn't power management issues, it's a system-wide change that I am
> > sure others will want to know about that doesn't reflect any power
> > changes.
> >
> > As much as I hate adding new notifiers in the kernel, that might be all
> > you need here.
> 
> You might indeed be right. I guess I was thinking that "resuming from
> suspend" and "resuming from a VM fork" are kind of the same thing.
> There _is_ a certain kind of similarity between the two. I was hoping
> if the similarity was a strong enough one, maybe it'd make sense to do
> them together rather than adding another notifier. But I suppose you
> disagree, and it sounds like Rafael might too --
> <https://lore.kernel.org/lkml/CAJZ5v0g+GihH_b9YvwuHzdrUVNGXOeabOznDC1vK6qLi8gtSTQ@mail.gmail.com/>.

Hey, nice, we agree!  :)

> Code-wise for me with WireGuard it's of course appealing to treat them
> the same, since it's like a one line change, but if I need to add a
> new notifier call there, it's not the end of the world.

I know there are other places in the kernel that would like to be
notified when they have been moved to another machine so that they can
do things like determine if the CPU functionality has changed (or not),
and perhaps do other types of device reconfiguration.  Right now the
kernel does not have any way of knowing this, so it makes sense that if
the platform (i.e. ACPI here) has a way of creating such a event, it
should and then we can start tieing in other subsystems to use it
as-needed.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
@ 2022-03-01 19:41       ` Greg KH
  0 siblings, 0 replies; 61+ messages in thread
From: Greg KH @ 2022-03-01 19:41 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Brown, Len, linux-hyperv, Colm MacCarthaigh,
	Daniel P. Berrangé,
	adrian, KVM list, Jann Horn, Linux PM, Michael S. Tsirkin, LKML,
	Dominik Brodowski, QEMU Developers, Alexander Graf,
	Linux Crypto Mailing List, Pavel Machek, Rafael J. Wysocki,
	Theodore Ts'o, Michael Kelley (LINUX),
	Laszlo Ersek, Arnd Bergmann

On Tue, Mar 01, 2022 at 07:24:11PM +0100, Jason A. Donenfeld wrote:
> Hi Greg,
> 
> On Tue, Mar 1, 2022 at 7:01 PM Greg KH <gregkh@linuxfoundation.org> wrote:
> > A notifier block like this makes sense, but why tie onto the PM_ stuff?
> > This isn't power management issues, it's a system-wide change that I am
> > sure others will want to know about that doesn't reflect any power
> > changes.
> >
> > As much as I hate adding new notifiers in the kernel, that might be all
> > you need here.
> 
> You might indeed be right. I guess I was thinking that "resuming from
> suspend" and "resuming from a VM fork" are kind of the same thing.
> There _is_ a certain kind of similarity between the two. I was hoping
> if the similarity was a strong enough one, maybe it'd make sense to do
> them together rather than adding another notifier. But I suppose you
> disagree, and it sounds like Rafael might too --
> <https://lore.kernel.org/lkml/CAJZ5v0g+GihH_b9YvwuHzdrUVNGXOeabOznDC1vK6qLi8gtSTQ@mail.gmail.com/>.

Hey, nice, we agree!  :)

> Code-wise for me with WireGuard it's of course appealing to treat them
> the same, since it's like a one line change, but if I need to add a
> new notifier call there, it's not the end of the world.

I know there are other places in the kernel that would like to be
notified when they have been moved to another machine so that they can
do things like determine if the CPU functionality has changed (or not),
and perhaps do other types of device reconfiguration.  Right now the
kernel does not have any way of knowing this, so it makes sense that if
the platform (i.e. ACPI here) has a way of creating such a event, it
should and then we can start tieing in other subsystems to use it
as-needed.

thanks,

greg k-h


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
  2022-03-01 19:41       ` Greg KH
@ 2022-03-01 23:12         ` Jason A. Donenfeld
  -1 siblings, 0 replies; 61+ messages in thread
From: Jason A. Donenfeld @ 2022-03-01 23:12 UTC (permalink / raw)
  To: Greg KH
  Cc: LKML, KVM list, QEMU Developers, linux-hyperv,
	Linux Crypto Mailing List, Alexander Graf, Michael Kelley (LINUX),
	adrian, Laszlo Ersek, Daniel P. Berrangé,
	Dominik Brodowski, Jann Horn, Michael S. Tsirkin,
	Rafael J. Wysocki, Brown, Len, Pavel Machek, Linux PM,
	Colm MacCarthaigh, Theodore Ts'o, Arnd Bergmann

On Tue, Mar 1, 2022 at 8:41 PM Greg KH <gregkh@linuxfoundation.org> wrote:
>
> On Tue, Mar 01, 2022 at 07:24:11PM +0100, Jason A. Donenfeld wrote:
> > Hi Greg,
> >
> > On Tue, Mar 1, 2022 at 7:01 PM Greg KH <gregkh@linuxfoundation.org> wrote:
> > > A notifier block like this makes sense, but why tie onto the PM_ stuff?
> > > This isn't power management issues, it's a system-wide change that I am
> > > sure others will want to know about that doesn't reflect any power
> > > changes.
> > >
> > > As much as I hate adding new notifiers in the kernel, that might be all
> > > you need here.
> >
> > You might indeed be right. I guess I was thinking that "resuming from
> > suspend" and "resuming from a VM fork" are kind of the same thing.
> > There _is_ a certain kind of similarity between the two. I was hoping
> > if the similarity was a strong enough one, maybe it'd make sense to do
> > them together rather than adding another notifier. But I suppose you
> > disagree, and it sounds like Rafael might too --
> > <https://lore.kernel.org/lkml/CAJZ5v0g+GihH_b9YvwuHzdrUVNGXOeabOznDC1vK6qLi8gtSTQ@mail.gmail.com/>.
>
> Hey, nice, we agree!  :)

It is now done and posted here:
https://lore.kernel.org/lkml/20220301231038.530897-1-Jason@zx2c4.com/

Jason

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
@ 2022-03-01 23:12         ` Jason A. Donenfeld
  0 siblings, 0 replies; 61+ messages in thread
From: Jason A. Donenfeld @ 2022-03-01 23:12 UTC (permalink / raw)
  To: Greg KH
  Cc: Brown, Len, linux-hyperv, Colm MacCarthaigh,
	Daniel P. Berrangé,
	adrian, KVM list, Jann Horn, Linux PM, Michael S. Tsirkin, LKML,
	Dominik Brodowski, QEMU Developers, Alexander Graf,
	Linux Crypto Mailing List, Pavel Machek, Rafael J. Wysocki,
	Theodore Ts'o, Michael Kelley (LINUX),
	Laszlo Ersek, Arnd Bergmann

On Tue, Mar 1, 2022 at 8:41 PM Greg KH <gregkh@linuxfoundation.org> wrote:
>
> On Tue, Mar 01, 2022 at 07:24:11PM +0100, Jason A. Donenfeld wrote:
> > Hi Greg,
> >
> > On Tue, Mar 1, 2022 at 7:01 PM Greg KH <gregkh@linuxfoundation.org> wrote:
> > > A notifier block like this makes sense, but why tie onto the PM_ stuff?
> > > This isn't power management issues, it's a system-wide change that I am
> > > sure others will want to know about that doesn't reflect any power
> > > changes.
> > >
> > > As much as I hate adding new notifiers in the kernel, that might be all
> > > you need here.
> >
> > You might indeed be right. I guess I was thinking that "resuming from
> > suspend" and "resuming from a VM fork" are kind of the same thing.
> > There _is_ a certain kind of similarity between the two. I was hoping
> > if the similarity was a strong enough one, maybe it'd make sense to do
> > them together rather than adding another notifier. But I suppose you
> > disagree, and it sounds like Rafael might too --
> > <https://lore.kernel.org/lkml/CAJZ5v0g+GihH_b9YvwuHzdrUVNGXOeabOznDC1vK6qLi8gtSTQ@mail.gmail.com/>.
>
> Hey, nice, we agree!  :)

It is now done and posted here:
https://lore.kernel.org/lkml/20220301231038.530897-1-Jason@zx2c4.com/

Jason


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
  2022-03-01 18:37         ` Jason A. Donenfeld
@ 2022-03-02  7:42           ` Michael S. Tsirkin
  -1 siblings, 0 replies; 61+ messages in thread
From: Michael S. Tsirkin @ 2022-03-02  7:42 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Laszlo Ersek, LKML, KVM list, QEMU Developers, linux-hyperv,
	Linux Crypto Mailing List, Alexander Graf, Michael Kelley (LINUX),
	Greg Kroah-Hartman, adrian, Daniel P. Berrangé,
	Dominik Brodowski, Jann Horn, Rafael J. Wysocki, Brown, Len,
	Pavel Machek, Linux PM, Colm MacCarthaigh, Theodore Ts'o,
	Arnd Bergmann

On Tue, Mar 01, 2022 at 07:37:06PM +0100, Jason A. Donenfeld wrote:
> Hi Michael,
> 
> On Tue, Mar 1, 2022 at 6:17 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > Hmm okay, so it's a performance optimization... some batching then? Do
> > you really need to worry about every packet? Every 64 packets not
> > enough?  Packets are after all queued at NICs etc, and VM fork can
> > happen after they leave wireguard ...
> 
> Unfortunately, yes, this is an "every packet" sort of thing -- if the
> race is to be avoided in a meaningful way. It's really extra bad:
> ChaCha20 and AES-CTR work by xoring a secret stream of bytes with
> plaintext to produce a ciphertext. If you use that same secret stream
> and xor it with a second plaintext and transmit that too, an attacker
> can combine the two different ciphertexts to learn things about the
> original plaintext.

So what about the point about packets queued then? You don't fish
packets out of qdisc queues, do you?

> But, anyway, it seems like the race is here to stay given what we have
> _currently_ available with the virtual hardware. That's why I'm
> focused on trying to get something going that's the least bad with
> what we've currently got, which is racy by design. How vitally
> important is it to have something that doesn't race in the far future?
> I don't know, really. It seems plausible that that ACPI notifier
> triggers so early that nothing else really even has a chance, so the
> race concern is purely theoretical. But I haven't tried to measure
> that so I'm not sure.
> 
> Jason


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
@ 2022-03-02  7:42           ` Michael S. Tsirkin
  0 siblings, 0 replies; 61+ messages in thread
From: Michael S. Tsirkin @ 2022-03-02  7:42 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Brown, Len, linux-hyperv, Colm MacCarthaigh,
	Daniel P. Berrangé,
	adrian, KVM list, Jann Horn, Greg Kroah-Hartman, Linux PM,
	Rafael J. Wysocki, LKML, Dominik Brodowski, QEMU Developers,
	Alexander Graf, Linux Crypto Mailing List, Pavel Machek,
	Theodore Ts'o, Michael Kelley (LINUX),
	Laszlo Ersek, Arnd Bergmann

On Tue, Mar 01, 2022 at 07:37:06PM +0100, Jason A. Donenfeld wrote:
> Hi Michael,
> 
> On Tue, Mar 1, 2022 at 6:17 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > Hmm okay, so it's a performance optimization... some batching then? Do
> > you really need to worry about every packet? Every 64 packets not
> > enough?  Packets are after all queued at NICs etc, and VM fork can
> > happen after they leave wireguard ...
> 
> Unfortunately, yes, this is an "every packet" sort of thing -- if the
> race is to be avoided in a meaningful way. It's really extra bad:
> ChaCha20 and AES-CTR work by xoring a secret stream of bytes with
> plaintext to produce a ciphertext. If you use that same secret stream
> and xor it with a second plaintext and transmit that too, an attacker
> can combine the two different ciphertexts to learn things about the
> original plaintext.

So what about the point about packets queued then? You don't fish
packets out of qdisc queues, do you?

> But, anyway, it seems like the race is here to stay given what we have
> _currently_ available with the virtual hardware. That's why I'm
> focused on trying to get something going that's the least bad with
> what we've currently got, which is racy by design. How vitally
> important is it to have something that doesn't race in the far future?
> I don't know, really. It seems plausible that that ACPI notifier
> triggers so early that nothing else really even has a chance, so the
> race concern is purely theoretical. But I haven't tried to measure
> that so I'm not sure.
> 
> Jason



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
  2022-03-02  7:42           ` Michael S. Tsirkin
@ 2022-03-02  7:48             ` Michael S. Tsirkin
  -1 siblings, 0 replies; 61+ messages in thread
From: Michael S. Tsirkin @ 2022-03-02  7:48 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Laszlo Ersek, LKML, KVM list, QEMU Developers, linux-hyperv,
	Linux Crypto Mailing List, Alexander Graf, Michael Kelley (LINUX),
	Greg Kroah-Hartman, adrian, Daniel P. Berrangé,
	Dominik Brodowski, Jann Horn, Rafael J. Wysocki, Brown, Len,
	Pavel Machek, Linux PM, Colm MacCarthaigh, Theodore Ts'o,
	Arnd Bergmann

On Wed, Mar 02, 2022 at 02:42:37AM -0500, Michael S. Tsirkin wrote:
> On Tue, Mar 01, 2022 at 07:37:06PM +0100, Jason A. Donenfeld wrote:
> > Hi Michael,
> > 
> > On Tue, Mar 1, 2022 at 6:17 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > > Hmm okay, so it's a performance optimization... some batching then? Do
> > > you really need to worry about every packet? Every 64 packets not
> > > enough?  Packets are after all queued at NICs etc, and VM fork can
> > > happen after they leave wireguard ...
> > 
> > Unfortunately, yes, this is an "every packet" sort of thing -- if the
> > race is to be avoided in a meaningful way. It's really extra bad:
> > ChaCha20 and AES-CTR work by xoring a secret stream of bytes with
> > plaintext to produce a ciphertext. If you use that same secret stream
> > and xor it with a second plaintext and transmit that too, an attacker
> > can combine the two different ciphertexts to learn things about the
> > original plaintext.
> 
> So what about the point about packets queued then? You don't fish
> packets out of qdisc queues, do you?

Oh pls ignore it, I think I got it. Resending same packet is not
a problem, producing a new one is.

> > But, anyway, it seems like the race is here to stay given what we have
> > _currently_ available with the virtual hardware. That's why I'm
> > focused on trying to get something going that's the least bad with
> > what we've currently got, which is racy by design. How vitally
> > important is it to have something that doesn't race in the far future?
> > I don't know, really. It seems plausible that that ACPI notifier
> > triggers so early that nothing else really even has a chance, so the
> > race concern is purely theoretical. But I haven't tried to measure
> > that so I'm not sure.
> > 
> > Jason


So how about measuring the performance impact of reading the 16 byte
vmgenid then? This could be a performance option, too - some people
might want extra security, some might not care.  And I feel if linux
DTRT and reads the 16 bytes then hypervisor vendors will be motivated to
improve and add a 4 byte unique one. As long as linux is interrupt
driven there's no motivation for change.

-- 
MST


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
@ 2022-03-02  7:48             ` Michael S. Tsirkin
  0 siblings, 0 replies; 61+ messages in thread
From: Michael S. Tsirkin @ 2022-03-02  7:48 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Brown, Len, linux-hyperv, Colm MacCarthaigh,
	Daniel P. Berrangé,
	adrian, KVM list, Jann Horn, Greg Kroah-Hartman, Linux PM,
	Rafael J. Wysocki, LKML, Dominik Brodowski, QEMU Developers,
	Alexander Graf, Linux Crypto Mailing List, Pavel Machek,
	Theodore Ts'o, Michael Kelley (LINUX),
	Laszlo Ersek, Arnd Bergmann

On Wed, Mar 02, 2022 at 02:42:37AM -0500, Michael S. Tsirkin wrote:
> On Tue, Mar 01, 2022 at 07:37:06PM +0100, Jason A. Donenfeld wrote:
> > Hi Michael,
> > 
> > On Tue, Mar 1, 2022 at 6:17 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > > Hmm okay, so it's a performance optimization... some batching then? Do
> > > you really need to worry about every packet? Every 64 packets not
> > > enough?  Packets are after all queued at NICs etc, and VM fork can
> > > happen after they leave wireguard ...
> > 
> > Unfortunately, yes, this is an "every packet" sort of thing -- if the
> > race is to be avoided in a meaningful way. It's really extra bad:
> > ChaCha20 and AES-CTR work by xoring a secret stream of bytes with
> > plaintext to produce a ciphertext. If you use that same secret stream
> > and xor it with a second plaintext and transmit that too, an attacker
> > can combine the two different ciphertexts to learn things about the
> > original plaintext.
> 
> So what about the point about packets queued then? You don't fish
> packets out of qdisc queues, do you?

Oh pls ignore it, I think I got it. Resending same packet is not
a problem, producing a new one is.

> > But, anyway, it seems like the race is here to stay given what we have
> > _currently_ available with the virtual hardware. That's why I'm
> > focused on trying to get something going that's the least bad with
> > what we've currently got, which is racy by design. How vitally
> > important is it to have something that doesn't race in the far future?
> > I don't know, really. It seems plausible that that ACPI notifier
> > triggers so early that nothing else really even has a chance, so the
> > race concern is purely theoretical. But I haven't tried to measure
> > that so I'm not sure.
> > 
> > Jason


So how about measuring the performance impact of reading the 16 byte
vmgenid then? This could be a performance option, too - some people
might want extra security, some might not care.  And I feel if linux
DTRT and reads the 16 bytes then hypervisor vendors will be motivated to
improve and add a 4 byte unique one. As long as linux is interrupt
driven there's no motivation for change.

-- 
MST



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
  2022-03-01 18:37         ` Jason A. Donenfeld
@ 2022-03-02  8:30           ` Michael S. Tsirkin
  -1 siblings, 0 replies; 61+ messages in thread
From: Michael S. Tsirkin @ 2022-03-02  8:30 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Laszlo Ersek, LKML, KVM list, QEMU Developers, linux-hyperv,
	Linux Crypto Mailing List, Alexander Graf, Michael Kelley (LINUX),
	Greg Kroah-Hartman, adrian, Daniel P. Berrangé,
	Dominik Brodowski, Jann Horn, Rafael J. Wysocki, Brown, Len,
	Pavel Machek, Linux PM, Colm MacCarthaigh, Theodore Ts'o,
	Arnd Bergmann

On Tue, Mar 01, 2022 at 07:37:06PM +0100, Jason A. Donenfeld wrote:
> Hi Michael,
> 
> On Tue, Mar 1, 2022 at 6:17 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > Hmm okay, so it's a performance optimization... some batching then? Do
> > you really need to worry about every packet? Every 64 packets not
> > enough?  Packets are after all queued at NICs etc, and VM fork can
> > happen after they leave wireguard ...
> 
> Unfortunately, yes, this is an "every packet" sort of thing -- if the
> race is to be avoided in a meaningful way. It's really extra bad:
> ChaCha20 and AES-CTR work by xoring a secret stream of bytes with
> plaintext to produce a ciphertext. If you use that same secret stream
> and xor it with a second plaintext and transmit that too, an attacker
> can combine the two different ciphertexts to learn things about the
> original plaintext.
> 
> But, anyway, it seems like the race is here to stay given what we have
> _currently_ available with the virtual hardware. That's why I'm
> focused on trying to get something going that's the least bad with
> what we've currently got, which is racy by design. How vitally
> important is it to have something that doesn't race in the far future?
> I don't know, really. It seems plausible that that ACPI notifier
> triggers so early that nothing else really even has a chance, so the
> race concern is purely theoretical. But I haven't tried to measure
> that so I'm not sure.
> 
> Jason


I got curious, and wrote a dumb benchmark:


#include <stdio.h>
#include <assert.h>
#include <limits.h>
#include <string.h>

struct lng {
	unsigned long long l1;
	unsigned long long l2;
};

struct shrt {
	unsigned long s;
};


struct lng l = { 1, 2 };
struct shrt s = { 3 };

static void test1(volatile struct shrt *sp)
{
	if (sp->s != s.s) {
		printf("short mismatch!\n");
		s.s = sp->s;
	}
}
static void test2(volatile struct lng *lp)
{
	if (lp->l1 != l.l1 || lp->l2 != l.l2) {
		printf("long mismatch!\n");
		l.l1 = lp->l1;
		l.l2 = lp->l2;
	}
}

int main(int argc, char **argv)
{
	volatile struct shrt sv = { 4 };
	volatile struct lng lv = { 5, 6 };

	if (argc > 1) {
		printf("test 1\n");
		for (int i = 0; i < 10000000; ++i) 
			test1(&sv);
	} else {
		printf("test 2\n");
		for (int i = 0; i < 10000000; ++i)
			test2(&lv);
	}
	return 0;
}


Results (built with -O2, nothing fancy):

[mst@tuck ~]$ perf stat -r 1000 ./a.out 1 > /dev/null

 Performance counter stats for './a.out 1' (1000 runs):

              5.12 msec task-clock:u              #    0.945 CPUs utilized            ( +-  0.07% )
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
                52      page-faults:u             #   10.016 K/sec                    ( +-  0.07% )
        20,190,800      cycles:u                  #    3.889 GHz                      ( +-  0.01% )
        50,147,371      instructions:u            #    2.48  insn per cycle           ( +-  0.00% )
        20,032,224      branches:u                #    3.858 G/sec                    ( +-  0.00% )
             1,604      branch-misses:u           #    0.01% of all branches          ( +-  0.26% )

        0.00541882 +- 0.00000847 seconds time elapsed  ( +-  0.16% )

[mst@tuck ~]$ perf stat -r 1000 ./a.out > /dev/null

 Performance counter stats for './a.out' (1000 runs):

              7.75 msec task-clock:u              #    0.947 CPUs utilized            ( +-  0.12% )
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
                52      page-faults:u             #    6.539 K/sec                    ( +-  0.07% )
        30,205,916      cycles:u                  #    3.798 GHz                      ( +-  0.01% )
        80,147,373      instructions:u            #    2.65  insn per cycle           ( +-  0.00% )
        30,032,227      branches:u                #    3.776 G/sec                    ( +-  0.00% )
             1,621      branch-misses:u           #    0.01% of all branches          ( +-  0.23% )

        0.00817982 +- 0.00000965 seconds time elapsed  ( +-  0.12% )


So yes, the overhead is higher by 50% which seems a lot but it's from a
very small number, so I don't see why it's a show stopper, it's not by a
factor of 10 such that we should sacrifice safety by default. Maybe a
kernel flag that removes the read replacing it with an interrupt will
do.

In other words, premature optimization is the root of all evil.

-- 
MST


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
@ 2022-03-02  8:30           ` Michael S. Tsirkin
  0 siblings, 0 replies; 61+ messages in thread
From: Michael S. Tsirkin @ 2022-03-02  8:30 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Brown, Len, linux-hyperv, Colm MacCarthaigh,
	Daniel P. Berrangé,
	adrian, KVM list, Jann Horn, Greg Kroah-Hartman, Linux PM,
	Rafael J. Wysocki, LKML, Dominik Brodowski, QEMU Developers,
	Alexander Graf, Linux Crypto Mailing List, Pavel Machek,
	Theodore Ts'o, Michael Kelley (LINUX),
	Laszlo Ersek, Arnd Bergmann

On Tue, Mar 01, 2022 at 07:37:06PM +0100, Jason A. Donenfeld wrote:
> Hi Michael,
> 
> On Tue, Mar 1, 2022 at 6:17 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > Hmm okay, so it's a performance optimization... some batching then? Do
> > you really need to worry about every packet? Every 64 packets not
> > enough?  Packets are after all queued at NICs etc, and VM fork can
> > happen after they leave wireguard ...
> 
> Unfortunately, yes, this is an "every packet" sort of thing -- if the
> race is to be avoided in a meaningful way. It's really extra bad:
> ChaCha20 and AES-CTR work by xoring a secret stream of bytes with
> plaintext to produce a ciphertext. If you use that same secret stream
> and xor it with a second plaintext and transmit that too, an attacker
> can combine the two different ciphertexts to learn things about the
> original plaintext.
> 
> But, anyway, it seems like the race is here to stay given what we have
> _currently_ available with the virtual hardware. That's why I'm
> focused on trying to get something going that's the least bad with
> what we've currently got, which is racy by design. How vitally
> important is it to have something that doesn't race in the far future?
> I don't know, really. It seems plausible that that ACPI notifier
> triggers so early that nothing else really even has a chance, so the
> race concern is purely theoretical. But I haven't tried to measure
> that so I'm not sure.
> 
> Jason


I got curious, and wrote a dumb benchmark:


#include <stdio.h>
#include <assert.h>
#include <limits.h>
#include <string.h>

struct lng {
	unsigned long long l1;
	unsigned long long l2;
};

struct shrt {
	unsigned long s;
};


struct lng l = { 1, 2 };
struct shrt s = { 3 };

static void test1(volatile struct shrt *sp)
{
	if (sp->s != s.s) {
		printf("short mismatch!\n");
		s.s = sp->s;
	}
}
static void test2(volatile struct lng *lp)
{
	if (lp->l1 != l.l1 || lp->l2 != l.l2) {
		printf("long mismatch!\n");
		l.l1 = lp->l1;
		l.l2 = lp->l2;
	}
}

int main(int argc, char **argv)
{
	volatile struct shrt sv = { 4 };
	volatile struct lng lv = { 5, 6 };

	if (argc > 1) {
		printf("test 1\n");
		for (int i = 0; i < 10000000; ++i) 
			test1(&sv);
	} else {
		printf("test 2\n");
		for (int i = 0; i < 10000000; ++i)
			test2(&lv);
	}
	return 0;
}


Results (built with -O2, nothing fancy):

[mst@tuck ~]$ perf stat -r 1000 ./a.out 1 > /dev/null

 Performance counter stats for './a.out 1' (1000 runs):

              5.12 msec task-clock:u              #    0.945 CPUs utilized            ( +-  0.07% )
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
                52      page-faults:u             #   10.016 K/sec                    ( +-  0.07% )
        20,190,800      cycles:u                  #    3.889 GHz                      ( +-  0.01% )
        50,147,371      instructions:u            #    2.48  insn per cycle           ( +-  0.00% )
        20,032,224      branches:u                #    3.858 G/sec                    ( +-  0.00% )
             1,604      branch-misses:u           #    0.01% of all branches          ( +-  0.26% )

        0.00541882 +- 0.00000847 seconds time elapsed  ( +-  0.16% )

[mst@tuck ~]$ perf stat -r 1000 ./a.out > /dev/null

 Performance counter stats for './a.out' (1000 runs):

              7.75 msec task-clock:u              #    0.947 CPUs utilized            ( +-  0.12% )
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
                52      page-faults:u             #    6.539 K/sec                    ( +-  0.07% )
        30,205,916      cycles:u                  #    3.798 GHz                      ( +-  0.01% )
        80,147,373      instructions:u            #    2.65  insn per cycle           ( +-  0.00% )
        30,032,227      branches:u                #    3.776 G/sec                    ( +-  0.00% )
             1,621      branch-misses:u           #    0.01% of all branches          ( +-  0.23% )

        0.00817982 +- 0.00000965 seconds time elapsed  ( +-  0.12% )


So yes, the overhead is higher by 50% which seems a lot but it's from a
very small number, so I don't see why it's a show stopper, it's not by a
factor of 10 such that we should sacrifice safety by default. Maybe a
kernel flag that removes the read replacing it with an interrupt will
do.

In other words, premature optimization is the root of all evil.

-- 
MST



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
  2022-03-02  8:30           ` Michael S. Tsirkin
@ 2022-03-02 11:26             ` Jason A. Donenfeld
  -1 siblings, 0 replies; 61+ messages in thread
From: Jason A. Donenfeld @ 2022-03-02 11:26 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Laszlo Ersek, LKML, KVM list, QEMU Developers, linux-hyperv,
	Linux Crypto Mailing List, Alexander Graf, Michael Kelley (LINUX),
	Greg Kroah-Hartman, adrian, Daniel P. Berrangé,
	Dominik Brodowski, Jann Horn, Rafael J. Wysocki, Brown, Len,
	Pavel Machek, Linux PM, Colm MacCarthaigh, Theodore Ts'o,
	Arnd Bergmann

Hey Michael,

Thanks for the benchmark.

On Wed, Mar 2, 2022 at 9:30 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> So yes, the overhead is higher by 50% which seems a lot but it's from a
> very small number, so I don't see why it's a show stopper, it's not by a
> factor of 10 such that we should sacrifice safety by default. Maybe a
> kernel flag that removes the read replacing it with an interrupt will
> do.
>
> In other words, premature optimization is the root of all evil.

Unfortunately I don't think it's as simple as that for several reasons.

First, I'm pretty confident a beefy Intel machine can mostly hide
non-dependent comparisons in the memory access and have the problem
mostly go away. But this is much less the case on, say, an in-order
MIPS32r2, which isn't just "some crappy ISA I'm using for the sake of
argument," but actually the platform on which a lot of networking and
WireGuard stuff runs, so I do care about it. There, we have 4
reads/comparisons which can't pipeline nearly as well.

There's also the atomicity aspect, which I think makes your benchmark
not quite accurate. Those 16 bytes could change between the first and
second word (or between the Nth and N+1th word for N<=3 on 32-bit).
What if in that case the word you read second doesn't change, but the
word you read first did? So then you find yourself having to do a
hi-lo-hi dance. And then consider the 32-bit case, where that's even
more annoying. This is just one of those things that comes up when you
compare the semantics of a "large unique ID" and "word-sized counter",
as general topics. (My suggestion is that vmgenid provide both.)

Finally, there's a slightly storage aspect, where adding 16 bytes to a
per-key struct is a little bit heavier than adding 4 bytes and might
bust a cache line without sufficient care, care which always has some
cost in one way or another.

So I just don't know if it's realistic to impose a 16-byte per-packet
comparison all the time like that. I'm familiar with WireGuard
obviously, but there's also cifs and maybe even wifi and bluetooth,
and who knows what else, to care about too. Then there's the userspace
discussion. I can't imagine a 16-byte hotpath comparison being
accepted as implementable.

> And I feel if linux
> DTRT and reads the 16 bytes then hypervisor vendors will be motivated to
> improve and add a 4 byte unique one. As long as linux is interrupt
> driven there's no motivation for change.

I reeeeeally don't want to get pulled into the politics of this on the
hypervisor side. I assume an improved thing would begin with QEMU and
Firecracker or something collaborating because they're both open
source and Amazon people seem interested. And then pressure builds for
Microsoft and VMware to do it on their side. And then we get this all
nicely implemented in the kernel. In the meantime, though, I'm not
going to refuse to address the problem entirely just because the
virtual hardware is less than perfect; I'd rather make the most with
what we've got while still being somewhat reasonable from an
implementation perspective.

Jason

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
@ 2022-03-02 11:26             ` Jason A. Donenfeld
  0 siblings, 0 replies; 61+ messages in thread
From: Jason A. Donenfeld @ 2022-03-02 11:26 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Brown, Len, linux-hyperv, Colm MacCarthaigh,
	Daniel P. Berrangé,
	adrian, KVM list, Jann Horn, Greg Kroah-Hartman, Linux PM,
	Rafael J. Wysocki, LKML, Dominik Brodowski, QEMU Developers,
	Alexander Graf, Linux Crypto Mailing List, Pavel Machek,
	Theodore Ts'o, Michael Kelley (LINUX),
	Laszlo Ersek, Arnd Bergmann

Hey Michael,

Thanks for the benchmark.

On Wed, Mar 2, 2022 at 9:30 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> So yes, the overhead is higher by 50% which seems a lot but it's from a
> very small number, so I don't see why it's a show stopper, it's not by a
> factor of 10 such that we should sacrifice safety by default. Maybe a
> kernel flag that removes the read replacing it with an interrupt will
> do.
>
> In other words, premature optimization is the root of all evil.

Unfortunately I don't think it's as simple as that for several reasons.

First, I'm pretty confident a beefy Intel machine can mostly hide
non-dependent comparisons in the memory access and have the problem
mostly go away. But this is much less the case on, say, an in-order
MIPS32r2, which isn't just "some crappy ISA I'm using for the sake of
argument," but actually the platform on which a lot of networking and
WireGuard stuff runs, so I do care about it. There, we have 4
reads/comparisons which can't pipeline nearly as well.

There's also the atomicity aspect, which I think makes your benchmark
not quite accurate. Those 16 bytes could change between the first and
second word (or between the Nth and N+1th word for N<=3 on 32-bit).
What if in that case the word you read second doesn't change, but the
word you read first did? So then you find yourself having to do a
hi-lo-hi dance. And then consider the 32-bit case, where that's even
more annoying. This is just one of those things that comes up when you
compare the semantics of a "large unique ID" and "word-sized counter",
as general topics. (My suggestion is that vmgenid provide both.)

Finally, there's a slightly storage aspect, where adding 16 bytes to a
per-key struct is a little bit heavier than adding 4 bytes and might
bust a cache line without sufficient care, care which always has some
cost in one way or another.

So I just don't know if it's realistic to impose a 16-byte per-packet
comparison all the time like that. I'm familiar with WireGuard
obviously, but there's also cifs and maybe even wifi and bluetooth,
and who knows what else, to care about too. Then there's the userspace
discussion. I can't imagine a 16-byte hotpath comparison being
accepted as implementable.

> And I feel if linux
> DTRT and reads the 16 bytes then hypervisor vendors will be motivated to
> improve and add a 4 byte unique one. As long as linux is interrupt
> driven there's no motivation for change.

I reeeeeally don't want to get pulled into the politics of this on the
hypervisor side. I assume an improved thing would begin with QEMU and
Firecracker or something collaborating because they're both open
source and Amazon people seem interested. And then pressure builds for
Microsoft and VMware to do it on their side. And then we get this all
nicely implemented in the kernel. In the meantime, though, I'm not
going to refuse to address the problem entirely just because the
virtual hardware is less than perfect; I'd rather make the most with
what we've got while still being somewhat reasonable from an
implementation perspective.

Jason


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
  2022-03-02 11:26             ` Jason A. Donenfeld
@ 2022-03-02 12:58               ` Michael S. Tsirkin
  -1 siblings, 0 replies; 61+ messages in thread
From: Michael S. Tsirkin @ 2022-03-02 12:58 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Laszlo Ersek, LKML, KVM list, QEMU Developers, linux-hyperv,
	Linux Crypto Mailing List, Alexander Graf, Michael Kelley (LINUX),
	Greg Kroah-Hartman, adrian, Daniel P. Berrangé,
	Dominik Brodowski, Jann Horn, Rafael J. Wysocki, Brown, Len,
	Pavel Machek, Linux PM, Colm MacCarthaigh, Theodore Ts'o,
	Arnd Bergmann

On Wed, Mar 02, 2022 at 12:26:27PM +0100, Jason A. Donenfeld wrote:
> Hey Michael,
> 
> Thanks for the benchmark.
> 
> On Wed, Mar 2, 2022 at 9:30 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> > So yes, the overhead is higher by 50% which seems a lot but it's from a
> > very small number, so I don't see why it's a show stopper, it's not by a
> > factor of 10 such that we should sacrifice safety by default. Maybe a
> > kernel flag that removes the read replacing it with an interrupt will
> > do.
> >
> > In other words, premature optimization is the root of all evil.
> 
> Unfortunately I don't think it's as simple as that for several reasons.
> 
> First, I'm pretty confident a beefy Intel machine can mostly hide
> non-dependent comparisons in the memory access and have the problem
> mostly go away. But this is much less the case on, say, an in-order
> MIPS32r2, which isn't just "some crappy ISA I'm using for the sake of
> argument," but actually the platform on which a lot of networking and
> WireGuard stuff runs, so I do care about it. There, we have 4
> reads/comparisons which can't pipeline nearly as well.

Sure. Want to try running some benchmarks on that platform?
Presumably you have access to such a box, right?


> There's also the atomicity aspect, which I think makes your benchmark
> not quite accurate. Those 16 bytes could change between the first and
> second word (or between the Nth and N+1th word for N<=3 on 32-bit).
> What if in that case the word you read second doesn't change, but the
> word you read first did? So then you find yourself having to do a
> hi-lo-hi dance.
> And then consider the 32-bit case, where that's even
> more annoying. This is just one of those things that comes up when you
> compare the semantics of a "large unique ID" and "word-sized counter",
> as general topics. (My suggestion is that vmgenid provide both.)

I don't see how this matters for any applications at all. Feel free to
present a case that would be race free with a word but not a 16
byte value, I could not imagine one. It's human to err of course.

>
> Finally, there's a slightly storage aspect, where adding 16 bytes to a
> per-key struct is a little bit heavier than adding 4 bytes and might
> bust a cache line without sufficient care, care which always has some
> cost in one way or another.
> 
> So I just don't know if it's realistic to impose a 16-byte per-packet
> comparison all the time like that. I'm familiar with WireGuard
> obviously, but there's also cifs and maybe even wifi and bluetooth,
> and who knows what else, to care about too. Then there's the userspace
> discussion. I can't imagine a 16-byte hotpath comparison being
> accepted as implementable.

I think this hinges on benchmarking results. Want to start with
my silly benchmark at least? If you can't measure an order of
magnitude gain then I think any effect on wireguard will be in the
noise.


> > And I feel if linux
> > DTRT and reads the 16 bytes then hypervisor vendors will be motivated to
> > improve and add a 4 byte unique one. As long as linux is interrupt
> > driven there's no motivation for change.
> 
> I reeeeeally don't want to get pulled into the politics of this on the
> hypervisor side. I assume an improved thing would begin with QEMU and
> Firecracker or something collaborating because they're both open
> source and Amazon people seem interested.

I think it would begin with a benchmark showing there's even any
measureable performance to be gained by switching the semantics.

> And then pressure builds for
> Microsoft and VMware to do it on their side. And then we get this all
> nicely implemented in the kernel. In the meantime, though, I'm not
> going to refuse to address the problem entirely just because the
> virtual hardware is less than perfect; I'd rather make the most with
> what we've got while still being somewhat reasonable from an
> implementation perspective.
> 
> Jason

Right but given you are trading security off for performance, it matters
a lot what the performance gain is.

-- 
MST


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
@ 2022-03-02 12:58               ` Michael S. Tsirkin
  0 siblings, 0 replies; 61+ messages in thread
From: Michael S. Tsirkin @ 2022-03-02 12:58 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Brown, Len, linux-hyperv, Colm MacCarthaigh,
	Daniel P. Berrangé,
	adrian, KVM list, Jann Horn, Greg Kroah-Hartman, Linux PM,
	Rafael J. Wysocki, LKML, Dominik Brodowski, QEMU Developers,
	Alexander Graf, Linux Crypto Mailing List, Pavel Machek,
	Theodore Ts'o, Michael Kelley (LINUX),
	Laszlo Ersek, Arnd Bergmann

On Wed, Mar 02, 2022 at 12:26:27PM +0100, Jason A. Donenfeld wrote:
> Hey Michael,
> 
> Thanks for the benchmark.
> 
> On Wed, Mar 2, 2022 at 9:30 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> > So yes, the overhead is higher by 50% which seems a lot but it's from a
> > very small number, so I don't see why it's a show stopper, it's not by a
> > factor of 10 such that we should sacrifice safety by default. Maybe a
> > kernel flag that removes the read replacing it with an interrupt will
> > do.
> >
> > In other words, premature optimization is the root of all evil.
> 
> Unfortunately I don't think it's as simple as that for several reasons.
> 
> First, I'm pretty confident a beefy Intel machine can mostly hide
> non-dependent comparisons in the memory access and have the problem
> mostly go away. But this is much less the case on, say, an in-order
> MIPS32r2, which isn't just "some crappy ISA I'm using for the sake of
> argument," but actually the platform on which a lot of networking and
> WireGuard stuff runs, so I do care about it. There, we have 4
> reads/comparisons which can't pipeline nearly as well.

Sure. Want to try running some benchmarks on that platform?
Presumably you have access to such a box, right?


> There's also the atomicity aspect, which I think makes your benchmark
> not quite accurate. Those 16 bytes could change between the first and
> second word (or between the Nth and N+1th word for N<=3 on 32-bit).
> What if in that case the word you read second doesn't change, but the
> word you read first did? So then you find yourself having to do a
> hi-lo-hi dance.
> And then consider the 32-bit case, where that's even
> more annoying. This is just one of those things that comes up when you
> compare the semantics of a "large unique ID" and "word-sized counter",
> as general topics. (My suggestion is that vmgenid provide both.)

I don't see how this matters for any applications at all. Feel free to
present a case that would be race free with a word but not a 16
byte value, I could not imagine one. It's human to err of course.

>
> Finally, there's a slightly storage aspect, where adding 16 bytes to a
> per-key struct is a little bit heavier than adding 4 bytes and might
> bust a cache line without sufficient care, care which always has some
> cost in one way or another.
> 
> So I just don't know if it's realistic to impose a 16-byte per-packet
> comparison all the time like that. I'm familiar with WireGuard
> obviously, but there's also cifs and maybe even wifi and bluetooth,
> and who knows what else, to care about too. Then there's the userspace
> discussion. I can't imagine a 16-byte hotpath comparison being
> accepted as implementable.

I think this hinges on benchmarking results. Want to start with
my silly benchmark at least? If you can't measure an order of
magnitude gain then I think any effect on wireguard will be in the
noise.


> > And I feel if linux
> > DTRT and reads the 16 bytes then hypervisor vendors will be motivated to
> > improve and add a 4 byte unique one. As long as linux is interrupt
> > driven there's no motivation for change.
> 
> I reeeeeally don't want to get pulled into the politics of this on the
> hypervisor side. I assume an improved thing would begin with QEMU and
> Firecracker or something collaborating because they're both open
> source and Amazon people seem interested.

I think it would begin with a benchmark showing there's even any
measureable performance to be gained by switching the semantics.

> And then pressure builds for
> Microsoft and VMware to do it on their side. And then we get this all
> nicely implemented in the kernel. In the meantime, though, I'm not
> going to refuse to address the problem entirely just because the
> virtual hardware is less than perfect; I'd rather make the most with
> what we've got while still being somewhat reasonable from an
> implementation perspective.
> 
> Jason

Right but given you are trading security off for performance, it matters
a lot what the performance gain is.

-- 
MST



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
  2022-03-02 12:58               ` Michael S. Tsirkin
@ 2022-03-02 13:55                 ` Jason A. Donenfeld
  -1 siblings, 0 replies; 61+ messages in thread
From: Jason A. Donenfeld @ 2022-03-02 13:55 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Laszlo Ersek, LKML, KVM list, QEMU Developers, linux-hyperv,
	Linux Crypto Mailing List, Alexander Graf, Michael Kelley (LINUX),
	Greg Kroah-Hartman, adrian, Daniel P. Berrangé,
	Dominik Brodowski, Jann Horn, Rafael J. Wysocki, Brown, Len,
	Pavel Machek, Linux PM, Colm MacCarthaigh, Theodore Ts'o,
	Arnd Bergmann

Hi Michael,

On Wed, Mar 02, 2022 at 07:58:33AM -0500, Michael S. Tsirkin wrote:
> > There's also the atomicity aspect, which I think makes your benchmark
> > not quite accurate. Those 16 bytes could change between the first and
> > second word (or between the Nth and N+1th word for N<=3 on 32-bit).
> > What if in that case the word you read second doesn't change, but the
> > word you read first did? So then you find yourself having to do a
> > hi-lo-hi dance.
> > And then consider the 32-bit case, where that's even
> > more annoying. This is just one of those things that comes up when you
> > compare the semantics of a "large unique ID" and "word-sized counter",
> > as general topics. (My suggestion is that vmgenid provide both.)
> 
> I don't see how this matters for any applications at all. Feel free to
> present a case that would be race free with a word but not a 16
> byte value, I could not imagine one. It's human to err of course.

Word-size reads happen all at once on systems that Linux supports,
whereas this is not the case for 16 bytes (with a few niche exceptions
like cmpxchg16b and such). If you read the counter atomically, you can
check to see whether it's changed just after encrypting but before
transmitting and not transmit if it has changed, and voila, no race.
With 16 bytes, synchronization of that read is pretty tricky (though
maybe not all together impossible), because, as I mentioned, the first
word might have changed by the time you read a matching second word. I'm
sure you're familiar with the use of seqlocks in the kernel for solving
a somewhat related problem.

Jason

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
@ 2022-03-02 13:55                 ` Jason A. Donenfeld
  0 siblings, 0 replies; 61+ messages in thread
From: Jason A. Donenfeld @ 2022-03-02 13:55 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Brown, Len, linux-hyperv, Colm MacCarthaigh,
	Daniel P. Berrangé,
	adrian, KVM list, Jann Horn, Greg Kroah-Hartman, Linux PM,
	Rafael J. Wysocki, LKML, Dominik Brodowski, QEMU Developers,
	Alexander Graf, Linux Crypto Mailing List, Pavel Machek,
	Theodore Ts'o, Michael Kelley (LINUX),
	Laszlo Ersek, Arnd Bergmann

Hi Michael,

On Wed, Mar 02, 2022 at 07:58:33AM -0500, Michael S. Tsirkin wrote:
> > There's also the atomicity aspect, which I think makes your benchmark
> > not quite accurate. Those 16 bytes could change between the first and
> > second word (or between the Nth and N+1th word for N<=3 on 32-bit).
> > What if in that case the word you read second doesn't change, but the
> > word you read first did? So then you find yourself having to do a
> > hi-lo-hi dance.
> > And then consider the 32-bit case, where that's even
> > more annoying. This is just one of those things that comes up when you
> > compare the semantics of a "large unique ID" and "word-sized counter",
> > as general topics. (My suggestion is that vmgenid provide both.)
> 
> I don't see how this matters for any applications at all. Feel free to
> present a case that would be race free with a word but not a 16
> byte value, I could not imagine one. It's human to err of course.

Word-size reads happen all at once on systems that Linux supports,
whereas this is not the case for 16 bytes (with a few niche exceptions
like cmpxchg16b and such). If you read the counter atomically, you can
check to see whether it's changed just after encrypting but before
transmitting and not transmit if it has changed, and voila, no race.
With 16 bytes, synchronization of that read is pretty tricky (though
maybe not all together impossible), because, as I mentioned, the first
word might have changed by the time you read a matching second word. I'm
sure you're familiar with the use of seqlocks in the kernel for solving
a somewhat related problem.

Jason


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
  2022-03-01 15:42 propagating vmgenid outward and upward Jason A. Donenfeld
                   ` (2 preceding siblings ...)
  2022-03-01 18:01   ` Greg KH
@ 2022-03-02 14:35 ` Jason A. Donenfeld
  2022-03-09 10:10 ` Alexander Graf
  4 siblings, 0 replies; 61+ messages in thread
From: Jason A. Donenfeld @ 2022-03-02 14:35 UTC (permalink / raw)
  To: linux-kernel, kvm, qemu-devel, linux-hyperv, linux-crypto, graf,
	mikelley, gregkh, adrian, lersek, berrange, linux, jannh, mst,
	rafael, len.brown, pavel, linux-pm, colmmacc, tytso, arnd

Hey again,

On Tue, Mar 01, 2022 at 04:42:47PM +0100, Jason A. Donenfeld wrote:
> For (B), it's a little bit trickier. But I think our options follow the
> same rubric. We can expose a generation counter in the vDSO, with
> semantics akin to the extern integer I described above. Or we could
> expose that counter in a file that userspace could poll() on and receive
> notifications that way. Or perhaps a third way. I'm all ears here.
> Alex's team from Amazon last year proposed something similar to the vDSO
> idea, except using mmap on a sysfs file, though from what I can tell,
> that wound up being kind of complicated. Due to the fact that we're
> _already_ racy, I think I'm most inclined at this point toward the
> poll() approach for the same reasons as I prefer a notifier_block. But
> on userspace I could be convinced otherwise, and I'd be interested in
> totally different ideas here too.

I implemented the poll() case here in 15 lines of code and found it
remarkably simple to do:

https://lore.kernel.org/lkml/20220302143331.654426-1-Jason@zx2c4.com/

This is just a PoC/RFC for the sake of having something tangible to look
at for this thread. It is notable to me, though, that implementing this
was so minimal.

Regards,
Jason

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
  2022-03-02 13:55                 ` Jason A. Donenfeld
@ 2022-03-02 14:46                   ` Michael S. Tsirkin
  -1 siblings, 0 replies; 61+ messages in thread
From: Michael S. Tsirkin @ 2022-03-02 14:46 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Laszlo Ersek, LKML, KVM list, QEMU Developers, linux-hyperv,
	Linux Crypto Mailing List, Alexander Graf, Michael Kelley (LINUX),
	Greg Kroah-Hartman, adrian, Daniel P. Berrangé,
	Dominik Brodowski, Jann Horn, Rafael J. Wysocki, Brown, Len,
	Pavel Machek, Linux PM, Colm MacCarthaigh, Theodore Ts'o,
	Arnd Bergmann

On Wed, Mar 02, 2022 at 02:55:29PM +0100, Jason A. Donenfeld wrote:
> Hi Michael,
> 
> On Wed, Mar 02, 2022 at 07:58:33AM -0500, Michael S. Tsirkin wrote:
> > > There's also the atomicity aspect, which I think makes your benchmark
> > > not quite accurate. Those 16 bytes could change between the first and
> > > second word (or between the Nth and N+1th word for N<=3 on 32-bit).
> > > What if in that case the word you read second doesn't change, but the
> > > word you read first did? So then you find yourself having to do a
> > > hi-lo-hi dance.
> > > And then consider the 32-bit case, where that's even
> > > more annoying. This is just one of those things that comes up when you
> > > compare the semantics of a "large unique ID" and "word-sized counter",
> > > as general topics. (My suggestion is that vmgenid provide both.)
> > 
> > I don't see how this matters for any applications at all. Feel free to
> > present a case that would be race free with a word but not a 16
> > byte value, I could not imagine one. It's human to err of course.
> 
> Word-size reads happen all at once on systems that Linux supports,
> whereas this is not the case for 16 bytes (with a few niche exceptions
> like cmpxchg16b and such). If you read the counter atomically, you can
> check to see whether it's changed just after encrypting but before
> transmitting and not transmit if it has changed, and voila, no race.
> With 16 bytes, synchronization of that read is pretty tricky (though
> maybe not all together impossible), because, as I mentioned, the first
> word might have changed by the time you read a matching second word. I'm
> sure you're familiar with the use of seqlocks in the kernel for solving
> a somewhat related problem.
> 
> Jason

I just don't see how "value changed while it was read" is so different
from "value changed one clock after it was read".  Since we don't detect
the latter I don't see why we should worry about the former.  What I
don't have here is how would a code reading the value look.  It might
help to write some pseudo code to show that, but I'd say it makes more
sense to just code the read up even just so the overhead of the current
interface can be roughtly measured.

-- 
MST


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
@ 2022-03-02 14:46                   ` Michael S. Tsirkin
  0 siblings, 0 replies; 61+ messages in thread
From: Michael S. Tsirkin @ 2022-03-02 14:46 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Brown, Len, linux-hyperv, Colm MacCarthaigh,
	Daniel P. Berrangé,
	adrian, KVM list, Jann Horn, Greg Kroah-Hartman, Linux PM,
	Rafael J. Wysocki, LKML, Dominik Brodowski, QEMU Developers,
	Alexander Graf, Linux Crypto Mailing List, Pavel Machek,
	Theodore Ts'o, Michael Kelley (LINUX),
	Laszlo Ersek, Arnd Bergmann

On Wed, Mar 02, 2022 at 02:55:29PM +0100, Jason A. Donenfeld wrote:
> Hi Michael,
> 
> On Wed, Mar 02, 2022 at 07:58:33AM -0500, Michael S. Tsirkin wrote:
> > > There's also the atomicity aspect, which I think makes your benchmark
> > > not quite accurate. Those 16 bytes could change between the first and
> > > second word (or between the Nth and N+1th word for N<=3 on 32-bit).
> > > What if in that case the word you read second doesn't change, but the
> > > word you read first did? So then you find yourself having to do a
> > > hi-lo-hi dance.
> > > And then consider the 32-bit case, where that's even
> > > more annoying. This is just one of those things that comes up when you
> > > compare the semantics of a "large unique ID" and "word-sized counter",
> > > as general topics. (My suggestion is that vmgenid provide both.)
> > 
> > I don't see how this matters for any applications at all. Feel free to
> > present a case that would be race free with a word but not a 16
> > byte value, I could not imagine one. It's human to err of course.
> 
> Word-size reads happen all at once on systems that Linux supports,
> whereas this is not the case for 16 bytes (with a few niche exceptions
> like cmpxchg16b and such). If you read the counter atomically, you can
> check to see whether it's changed just after encrypting but before
> transmitting and not transmit if it has changed, and voila, no race.
> With 16 bytes, synchronization of that read is pretty tricky (though
> maybe not all together impossible), because, as I mentioned, the first
> word might have changed by the time you read a matching second word. I'm
> sure you're familiar with the use of seqlocks in the kernel for solving
> a somewhat related problem.
> 
> Jason

I just don't see how "value changed while it was read" is so different
from "value changed one clock after it was read".  Since we don't detect
the latter I don't see why we should worry about the former.  What I
don't have here is how would a code reading the value look.  It might
help to write some pseudo code to show that, but I'd say it makes more
sense to just code the read up even just so the overhead of the current
interface can be roughtly measured.

-- 
MST



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
  2022-03-02 14:46                   ` Michael S. Tsirkin
@ 2022-03-02 15:14                     ` Jason A. Donenfeld
  -1 siblings, 0 replies; 61+ messages in thread
From: Jason A. Donenfeld @ 2022-03-02 15:14 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Laszlo Ersek, LKML, KVM list, QEMU Developers, linux-hyperv,
	Linux Crypto Mailing List, Alexander Graf, Michael Kelley (LINUX),
	Greg Kroah-Hartman, adrian, Daniel P. Berrangé,
	Dominik Brodowski, Jann Horn, Rafael J. Wysocki, Brown, Len,
	Pavel Machek, Linux PM, Colm MacCarthaigh, Theodore Ts'o,
	Arnd Bergmann

Hi Michael,

On Wed, Mar 2, 2022 at 3:46 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> I just don't see how "value changed while it was read" is so different
> from "value changed one clock after it was read".  Since we don't detect
> the latter I don't see why we should worry about the former.

The "barrier" is at the point where the plaintext has been chosen AND
the nonce for a given keypair has been selected. So, if you have
plaintext in a buffer, and a key in a buffer, and the nonce for that
encryption in a buffer, and then after those are all selected, you
check to see if the vmgenid has changed since the birth of that key,
then you're all set. If it changes _after_ that point of check (your
"one clock after"), it doesn't matter: you'll just be
double-transmitting the same ciphertext, which is something that flaky
wifi sometimes does _anyway_ (and attackers can do intentionally), so
network protocols already are resilient to replay. This is the same
case you asked about earlier, and then answered yourself, when you
were wondering about reaching down into qdiscs.

Jason

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
@ 2022-03-02 15:14                     ` Jason A. Donenfeld
  0 siblings, 0 replies; 61+ messages in thread
From: Jason A. Donenfeld @ 2022-03-02 15:14 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Brown, Len, linux-hyperv, Colm MacCarthaigh,
	Daniel P. Berrangé,
	adrian, KVM list, Jann Horn, Greg Kroah-Hartman, Linux PM,
	Rafael J. Wysocki, LKML, Dominik Brodowski, QEMU Developers,
	Alexander Graf, Linux Crypto Mailing List, Pavel Machek,
	Theodore Ts'o, Michael Kelley (LINUX),
	Laszlo Ersek, Arnd Bergmann

Hi Michael,

On Wed, Mar 2, 2022 at 3:46 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> I just don't see how "value changed while it was read" is so different
> from "value changed one clock after it was read".  Since we don't detect
> the latter I don't see why we should worry about the former.

The "barrier" is at the point where the plaintext has been chosen AND
the nonce for a given keypair has been selected. So, if you have
plaintext in a buffer, and a key in a buffer, and the nonce for that
encryption in a buffer, and then after those are all selected, you
check to see if the vmgenid has changed since the birth of that key,
then you're all set. If it changes _after_ that point of check (your
"one clock after"), it doesn't matter: you'll just be
double-transmitting the same ciphertext, which is something that flaky
wifi sometimes does _anyway_ (and attackers can do intentionally), so
network protocols already are resilient to replay. This is the same
case you asked about earlier, and then answered yourself, when you
were wondering about reaching down into qdiscs.

Jason


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
  2022-03-02 15:14                     ` Jason A. Donenfeld
@ 2022-03-02 15:20                       ` Michael S. Tsirkin
  -1 siblings, 0 replies; 61+ messages in thread
From: Michael S. Tsirkin @ 2022-03-02 15:20 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Laszlo Ersek, LKML, KVM list, QEMU Developers, linux-hyperv,
	Linux Crypto Mailing List, Alexander Graf, Michael Kelley (LINUX),
	Greg Kroah-Hartman, adrian, Daniel P. Berrangé,
	Dominik Brodowski, Jann Horn, Rafael J. Wysocki, Brown, Len,
	Pavel Machek, Linux PM, Colm MacCarthaigh, Theodore Ts'o,
	Arnd Bergmann

On Wed, Mar 02, 2022 at 04:14:56PM +0100, Jason A. Donenfeld wrote:
> Hi Michael,
> 
> On Wed, Mar 2, 2022 at 3:46 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > I just don't see how "value changed while it was read" is so different
> > from "value changed one clock after it was read".  Since we don't detect
> > the latter I don't see why we should worry about the former.
> 
> The "barrier" is at the point where the plaintext has been chosen AND
> the nonce for a given keypair has been selected. So, if you have
> plaintext in a buffer, and a key in a buffer, and the nonce for that
> encryption in a buffer, and then after those are all selected, you
> check to see if the vmgenid has changed since the birth of that key,
> then you're all set. If it changes _after_ that point of check (your
> "one clock after"), it doesn't matter: you'll just be
> double-transmitting the same ciphertext, which is something that flaky
> wifi sometimes does _anyway_ (and attackers can do intentionally), so
> network protocols already are resilient to replay. This is the same
> case you asked about earlier, and then answered yourself, when you
> were wondering about reaching down into qdiscs.
> 
> Jason

So writing some code:

1:
	put plaintext in a buffer
	put a key in a buffer
	put the nonce for that encryption in a buffer

	if vm gen id != stored vm gen id
		stored vm gen id = vm gen id
		goto 1

I think this is race free, but I don't see why does it matter whether we
read gen id atomically or not.

-- 
MST


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
@ 2022-03-02 15:20                       ` Michael S. Tsirkin
  0 siblings, 0 replies; 61+ messages in thread
From: Michael S. Tsirkin @ 2022-03-02 15:20 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Brown, Len, linux-hyperv, Colm MacCarthaigh,
	Daniel P. Berrangé,
	adrian, KVM list, Jann Horn, Greg Kroah-Hartman, Linux PM,
	Rafael J. Wysocki, LKML, Dominik Brodowski, QEMU Developers,
	Alexander Graf, Linux Crypto Mailing List, Pavel Machek,
	Theodore Ts'o, Michael Kelley (LINUX),
	Laszlo Ersek, Arnd Bergmann

On Wed, Mar 02, 2022 at 04:14:56PM +0100, Jason A. Donenfeld wrote:
> Hi Michael,
> 
> On Wed, Mar 2, 2022 at 3:46 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > I just don't see how "value changed while it was read" is so different
> > from "value changed one clock after it was read".  Since we don't detect
> > the latter I don't see why we should worry about the former.
> 
> The "barrier" is at the point where the plaintext has been chosen AND
> the nonce for a given keypair has been selected. So, if you have
> plaintext in a buffer, and a key in a buffer, and the nonce for that
> encryption in a buffer, and then after those are all selected, you
> check to see if the vmgenid has changed since the birth of that key,
> then you're all set. If it changes _after_ that point of check (your
> "one clock after"), it doesn't matter: you'll just be
> double-transmitting the same ciphertext, which is something that flaky
> wifi sometimes does _anyway_ (and attackers can do intentionally), so
> network protocols already are resilient to replay. This is the same
> case you asked about earlier, and then answered yourself, when you
> were wondering about reaching down into qdiscs.
> 
> Jason

So writing some code:

1:
	put plaintext in a buffer
	put a key in a buffer
	put the nonce for that encryption in a buffer

	if vm gen id != stored vm gen id
		stored vm gen id = vm gen id
		goto 1

I think this is race free, but I don't see why does it matter whether we
read gen id atomically or not.

-- 
MST



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
  2022-03-02 15:20                       ` Michael S. Tsirkin
@ 2022-03-02 15:36                         ` Jason A. Donenfeld
  -1 siblings, 0 replies; 61+ messages in thread
From: Jason A. Donenfeld @ 2022-03-02 15:36 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Laszlo Ersek, LKML, KVM list, QEMU Developers, linux-hyperv,
	Linux Crypto Mailing List, Alexander Graf, Michael Kelley (LINUX),
	Greg Kroah-Hartman, adrian, Daniel P. Berrangé,
	Dominik Brodowski, Jann Horn, Rafael J. Wysocki, Brown, Len,
	Pavel Machek, Linux PM, Colm MacCarthaigh, Theodore Ts'o,
	Arnd Bergmann

Hi Michael,

On Wed, Mar 02, 2022 at 10:20:25AM -0500, Michael S. Tsirkin wrote:
> So writing some code:
> 
> 1:
> 	put plaintext in a buffer
> 	put a key in a buffer
> 	put the nonce for that encryption in a buffer
> 
> 	if vm gen id != stored vm gen id
> 		stored vm gen id = vm gen id
> 		goto 1
> 
> I think this is race free, but I don't see why does it matter whether we
> read gen id atomically or not.

Because that 16 byte read of vmgenid is not atomic. Let's say you read
the first 8 bytes, and then the VM is forked. In the forked VM, the next
8 bytes are the same as last time, but the first 8 bytes, which you
already read, have changed. In that case, your != becomes a ==, and the
test fails.

This is one of those fundamental things of "unique ID" vs "generation
counter word".

Anyway, per your request in your last email, I wrote some code for this,
which may or may not be totally broken, and only works on 64-bit x86,
which is really the best possible case in terms of performance. And even
so, it's not great.

Jason

--------8<------------------------

diff --git a/drivers/net/wireguard/noise.c b/drivers/net/wireguard/noise.c
index 720952b92e78..250b8973007d 100644
--- a/drivers/net/wireguard/noise.c
+++ b/drivers/net/wireguard/noise.c
@@ -106,6 +106,7 @@ static struct noise_keypair *keypair_create(struct wg_peer *peer)
 	keypair->entry.type = INDEX_HASHTABLE_KEYPAIR;
 	keypair->entry.peer = peer;
 	kref_init(&keypair->refcount);
+	keypair->vmgenid = vmgenid_read_atomic();
 	return keypair;
 }

diff --git a/drivers/net/wireguard/noise.h b/drivers/net/wireguard/noise.h
index c527253dba80..0add240a14a0 100644
--- a/drivers/net/wireguard/noise.h
+++ b/drivers/net/wireguard/noise.h
@@ -27,10 +27,13 @@ struct noise_symmetric_key {
 	bool is_valid;
 };

+extern __uint128_t vmgenid_read_atomic(void);
+
 struct noise_keypair {
 	struct index_hashtable_entry entry;
 	struct noise_symmetric_key sending;
 	atomic64_t sending_counter;
+	__uint128_t vmgenid;
 	struct noise_symmetric_key receiving;
 	struct noise_replay_counter receiving_counter;
 	__le32 remote_index;
diff --git a/drivers/net/wireguard/send.c b/drivers/net/wireguard/send.c
index 5368f7c35b4b..40d016be59e3 100644
--- a/drivers/net/wireguard/send.c
+++ b/drivers/net/wireguard/send.c
@@ -381,6 +381,9 @@ void wg_packet_send_staged_packets(struct wg_peer *peer)
 			goto out_invalid;
 	}

+	if (keypair->vmgenid != vmgenid_read_atomic())
+		goto out_invalid;
+
 	packets.prev->next = NULL;
 	wg_peer_get(keypair->entry.peer);
 	PACKET_CB(packets.next)->keypair = keypair;
diff --git a/drivers/virt/vmgenid.c b/drivers/virt/vmgenid.c
index 0ae1a39f2e28..c122fae1d494 100644
--- a/drivers/virt/vmgenid.c
+++ b/drivers/virt/vmgenid.c
@@ -21,6 +21,21 @@ struct vmgenid_state {
 	u8 this_id[VMGENID_SIZE];
 };

+static __uint128_t *val;
+
+__uint128_t vmgenid_read_atomic(void)
+{
+	__uint128_t ret = 0;
+	if (!val)
+		return 0;
+	asm volatile("lock cmpxchg16b %1"
+		     : "+A"(ret)
+		     : "m"(*val), "b"(0), "c"(0)
+		     : "cc");
+	return ret;
+}
+EXPORT_SYMBOL(vmgenid_read_atomic);
+
 static int vmgenid_add(struct acpi_device *device)
 {
 	struct acpi_buffer parsed = { ACPI_ALLOCATE_BUFFER };
@@ -50,6 +65,7 @@ static int vmgenid_add(struct acpi_device *device)
 	phys_addr = (obj->package.elements[0].integer.value << 0) |
 		    (obj->package.elements[1].integer.value << 32);
 	state->next_id = devm_memremap(&device->dev, phys_addr, VMGENID_SIZE, MEMREMAP_WB);
+	val = (__uint128_t *)state->next_id;
 	if (IS_ERR(state->next_id)) {
 		ret = PTR_ERR(state->next_id);
 		goto out;


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
@ 2022-03-02 15:36                         ` Jason A. Donenfeld
  0 siblings, 0 replies; 61+ messages in thread
From: Jason A. Donenfeld @ 2022-03-02 15:36 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Brown, Len, linux-hyperv, Colm MacCarthaigh,
	Daniel P. Berrangé,
	adrian, KVM list, Jann Horn, Greg Kroah-Hartman, Linux PM,
	Rafael J. Wysocki, LKML, Dominik Brodowski, QEMU Developers,
	Alexander Graf, Linux Crypto Mailing List, Pavel Machek,
	Theodore Ts'o, Michael Kelley (LINUX),
	Laszlo Ersek, Arnd Bergmann

Hi Michael,

On Wed, Mar 02, 2022 at 10:20:25AM -0500, Michael S. Tsirkin wrote:
> So writing some code:
> 
> 1:
> 	put plaintext in a buffer
> 	put a key in a buffer
> 	put the nonce for that encryption in a buffer
> 
> 	if vm gen id != stored vm gen id
> 		stored vm gen id = vm gen id
> 		goto 1
> 
> I think this is race free, but I don't see why does it matter whether we
> read gen id atomically or not.

Because that 16 byte read of vmgenid is not atomic. Let's say you read
the first 8 bytes, and then the VM is forked. In the forked VM, the next
8 bytes are the same as last time, but the first 8 bytes, which you
already read, have changed. In that case, your != becomes a ==, and the
test fails.

This is one of those fundamental things of "unique ID" vs "generation
counter word".

Anyway, per your request in your last email, I wrote some code for this,
which may or may not be totally broken, and only works on 64-bit x86,
which is really the best possible case in terms of performance. And even
so, it's not great.

Jason

--------8<------------------------

diff --git a/drivers/net/wireguard/noise.c b/drivers/net/wireguard/noise.c
index 720952b92e78..250b8973007d 100644
--- a/drivers/net/wireguard/noise.c
+++ b/drivers/net/wireguard/noise.c
@@ -106,6 +106,7 @@ static struct noise_keypair *keypair_create(struct wg_peer *peer)
 	keypair->entry.type = INDEX_HASHTABLE_KEYPAIR;
 	keypair->entry.peer = peer;
 	kref_init(&keypair->refcount);
+	keypair->vmgenid = vmgenid_read_atomic();
 	return keypair;
 }

diff --git a/drivers/net/wireguard/noise.h b/drivers/net/wireguard/noise.h
index c527253dba80..0add240a14a0 100644
--- a/drivers/net/wireguard/noise.h
+++ b/drivers/net/wireguard/noise.h
@@ -27,10 +27,13 @@ struct noise_symmetric_key {
 	bool is_valid;
 };

+extern __uint128_t vmgenid_read_atomic(void);
+
 struct noise_keypair {
 	struct index_hashtable_entry entry;
 	struct noise_symmetric_key sending;
 	atomic64_t sending_counter;
+	__uint128_t vmgenid;
 	struct noise_symmetric_key receiving;
 	struct noise_replay_counter receiving_counter;
 	__le32 remote_index;
diff --git a/drivers/net/wireguard/send.c b/drivers/net/wireguard/send.c
index 5368f7c35b4b..40d016be59e3 100644
--- a/drivers/net/wireguard/send.c
+++ b/drivers/net/wireguard/send.c
@@ -381,6 +381,9 @@ void wg_packet_send_staged_packets(struct wg_peer *peer)
 			goto out_invalid;
 	}

+	if (keypair->vmgenid != vmgenid_read_atomic())
+		goto out_invalid;
+
 	packets.prev->next = NULL;
 	wg_peer_get(keypair->entry.peer);
 	PACKET_CB(packets.next)->keypair = keypair;
diff --git a/drivers/virt/vmgenid.c b/drivers/virt/vmgenid.c
index 0ae1a39f2e28..c122fae1d494 100644
--- a/drivers/virt/vmgenid.c
+++ b/drivers/virt/vmgenid.c
@@ -21,6 +21,21 @@ struct vmgenid_state {
 	u8 this_id[VMGENID_SIZE];
 };

+static __uint128_t *val;
+
+__uint128_t vmgenid_read_atomic(void)
+{
+	__uint128_t ret = 0;
+	if (!val)
+		return 0;
+	asm volatile("lock cmpxchg16b %1"
+		     : "+A"(ret)
+		     : "m"(*val), "b"(0), "c"(0)
+		     : "cc");
+	return ret;
+}
+EXPORT_SYMBOL(vmgenid_read_atomic);
+
 static int vmgenid_add(struct acpi_device *device)
 {
 	struct acpi_buffer parsed = { ACPI_ALLOCATE_BUFFER };
@@ -50,6 +65,7 @@ static int vmgenid_add(struct acpi_device *device)
 	phys_addr = (obj->package.elements[0].integer.value << 0) |
 		    (obj->package.elements[1].integer.value << 32);
 	state->next_id = devm_memremap(&device->dev, phys_addr, VMGENID_SIZE, MEMREMAP_WB);
+	val = (__uint128_t *)state->next_id;
 	if (IS_ERR(state->next_id)) {
 		ret = PTR_ERR(state->next_id);
 		goto out;



^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
  2022-03-02 15:36                         ` Jason A. Donenfeld
@ 2022-03-02 16:22                           ` Michael S. Tsirkin
  -1 siblings, 0 replies; 61+ messages in thread
From: Michael S. Tsirkin @ 2022-03-02 16:22 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Laszlo Ersek, LKML, KVM list, QEMU Developers, linux-hyperv,
	Linux Crypto Mailing List, Alexander Graf, Michael Kelley (LINUX),
	Greg Kroah-Hartman, adrian, Daniel P. Berrangé,
	Dominik Brodowski, Jann Horn, Rafael J. Wysocki, Brown, Len,
	Pavel Machek, Linux PM, Colm MacCarthaigh, Theodore Ts'o,
	Arnd Bergmann

On Wed, Mar 02, 2022 at 04:36:49PM +0100, Jason A. Donenfeld wrote:
> Hi Michael,
> 
> On Wed, Mar 02, 2022 at 10:20:25AM -0500, Michael S. Tsirkin wrote:
> > So writing some code:
> > 
> > 1:
> > 	put plaintext in a buffer
> > 	put a key in a buffer
> > 	put the nonce for that encryption in a buffer
> > 
> > 	if vm gen id != stored vm gen id
> > 		stored vm gen id = vm gen id
> > 		goto 1
> > 
> > I think this is race free, but I don't see why does it matter whether we
> > read gen id atomically or not.
> 
> Because that 16 byte read of vmgenid is not atomic. Let's say you read
> the first 8 bytes, and then the VM is forked.

But at this point when VM was forked plaintext key and nonce are all in
buffer, and you previously indicated a fork at this point is harmless.
You wrote "If it changes _after_ that point of check ... it doesn't
matter:"

> In the forked VM, the next
> 8 bytes are the same as last time, but the first 8 bytes, which you
> already read, have changed. In that case, your != becomes a ==, and the
> test fails.

Yes I'm aware what an atomic read is. If the read is not atomic
a part of value can change ;)

-- 
MST


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
@ 2022-03-02 16:22                           ` Michael S. Tsirkin
  0 siblings, 0 replies; 61+ messages in thread
From: Michael S. Tsirkin @ 2022-03-02 16:22 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Brown, Len, linux-hyperv, Colm MacCarthaigh,
	Daniel P. Berrangé,
	adrian, KVM list, Jann Horn, Greg Kroah-Hartman, Linux PM,
	Rafael J. Wysocki, LKML, Dominik Brodowski, QEMU Developers,
	Alexander Graf, Linux Crypto Mailing List, Pavel Machek,
	Theodore Ts'o, Michael Kelley (LINUX),
	Laszlo Ersek, Arnd Bergmann

On Wed, Mar 02, 2022 at 04:36:49PM +0100, Jason A. Donenfeld wrote:
> Hi Michael,
> 
> On Wed, Mar 02, 2022 at 10:20:25AM -0500, Michael S. Tsirkin wrote:
> > So writing some code:
> > 
> > 1:
> > 	put plaintext in a buffer
> > 	put a key in a buffer
> > 	put the nonce for that encryption in a buffer
> > 
> > 	if vm gen id != stored vm gen id
> > 		stored vm gen id = vm gen id
> > 		goto 1
> > 
> > I think this is race free, but I don't see why does it matter whether we
> > read gen id atomically or not.
> 
> Because that 16 byte read of vmgenid is not atomic. Let's say you read
> the first 8 bytes, and then the VM is forked.

But at this point when VM was forked plaintext key and nonce are all in
buffer, and you previously indicated a fork at this point is harmless.
You wrote "If it changes _after_ that point of check ... it doesn't
matter:"

> In the forked VM, the next
> 8 bytes are the same as last time, but the first 8 bytes, which you
> already read, have changed. In that case, your != becomes a ==, and the
> test fails.

Yes I'm aware what an atomic read is. If the read is not atomic
a part of value can change ;)

-- 
MST



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
  2022-03-02 15:36                         ` Jason A. Donenfeld
@ 2022-03-02 16:29                           ` Michael S. Tsirkin
  -1 siblings, 0 replies; 61+ messages in thread
From: Michael S. Tsirkin @ 2022-03-02 16:29 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Laszlo Ersek, LKML, KVM list, QEMU Developers, linux-hyperv,
	Linux Crypto Mailing List, Alexander Graf, Michael Kelley (LINUX),
	Greg Kroah-Hartman, adrian, Daniel P. Berrangé,
	Dominik Brodowski, Jann Horn, Rafael J. Wysocki, Brown, Len,
	Pavel Machek, Linux PM, Colm MacCarthaigh, Theodore Ts'o,
	Arnd Bergmann

On Wed, Mar 02, 2022 at 04:36:49PM +0100, Jason A. Donenfeld wrote:
> Hi Michael,
> 
> On Wed, Mar 02, 2022 at 10:20:25AM -0500, Michael S. Tsirkin wrote:
> > So writing some code:
> > 
> > 1:
> > 	put plaintext in a buffer
> > 	put a key in a buffer
> > 	put the nonce for that encryption in a buffer
> > 
> > 	if vm gen id != stored vm gen id
> > 		stored vm gen id = vm gen id
> > 		goto 1
> > 
> > I think this is race free, but I don't see why does it matter whether we
> > read gen id atomically or not.
> 
> Because that 16 byte read of vmgenid is not atomic. Let's say you read
> the first 8 bytes, and then the VM is forked. In the forked VM, the next
> 8 bytes are the same as last time, but the first 8 bytes, which you
> already read, have changed. In that case, your != becomes a ==, and the
> test fails.
> 
> This is one of those fundamental things of "unique ID" vs "generation
> counter word".
> 
> Anyway, per your request in your last email, I wrote some code for this,
> which may or may not be totally broken, and only works on 64-bit x86,
> which is really the best possible case in terms of performance. And even
> so, it's not great.
> 
> Jason
> 
> --------8<------------------------
> 
> diff --git a/drivers/net/wireguard/noise.c b/drivers/net/wireguard/noise.c
> index 720952b92e78..250b8973007d 100644
> --- a/drivers/net/wireguard/noise.c
> +++ b/drivers/net/wireguard/noise.c
> @@ -106,6 +106,7 @@ static struct noise_keypair *keypair_create(struct wg_peer *peer)
>  	keypair->entry.type = INDEX_HASHTABLE_KEYPAIR;
>  	keypair->entry.peer = peer;
>  	kref_init(&keypair->refcount);
> +	keypair->vmgenid = vmgenid_read_atomic();
>  	return keypair;
>  }
> 
> diff --git a/drivers/net/wireguard/noise.h b/drivers/net/wireguard/noise.h
> index c527253dba80..0add240a14a0 100644
> --- a/drivers/net/wireguard/noise.h
> +++ b/drivers/net/wireguard/noise.h
> @@ -27,10 +27,13 @@ struct noise_symmetric_key {
>  	bool is_valid;
>  };
> 
> +extern __uint128_t vmgenid_read_atomic(void);
> +
>  struct noise_keypair {
>  	struct index_hashtable_entry entry;
>  	struct noise_symmetric_key sending;
>  	atomic64_t sending_counter;
> +	__uint128_t vmgenid;
>  	struct noise_symmetric_key receiving;
>  	struct noise_replay_counter receiving_counter;
>  	__le32 remote_index;
> diff --git a/drivers/net/wireguard/send.c b/drivers/net/wireguard/send.c
> index 5368f7c35b4b..40d016be59e3 100644
> --- a/drivers/net/wireguard/send.c
> +++ b/drivers/net/wireguard/send.c
> @@ -381,6 +381,9 @@ void wg_packet_send_staged_packets(struct wg_peer *peer)
>  			goto out_invalid;
>  	}
> 
> +	if (keypair->vmgenid != vmgenid_read_atomic())
> +		goto out_invalid;
> +
>  	packets.prev->next = NULL;
>  	wg_peer_get(keypair->entry.peer);
>  	PACKET_CB(packets.next)->keypair = keypair;

I don't think we care about an atomic read here.  All data is in buffer
by this point, if it did not fork before that then we are ok, even
if it forks during the read.

We probably do need a memory barrier to make sure all writes complete
before the read of vmgenid, I'm not sure which kind - I think hypervisor
can be trusted to do a full CPU barrier on fork so probably just a
compiler barrier.

> diff --git a/drivers/virt/vmgenid.c b/drivers/virt/vmgenid.c
> index 0ae1a39f2e28..c122fae1d494 100644
> --- a/drivers/virt/vmgenid.c
> +++ b/drivers/virt/vmgenid.c
> @@ -21,6 +21,21 @@ struct vmgenid_state {
>  	u8 this_id[VMGENID_SIZE];
>  };
> 
> +static __uint128_t *val;
> +
> +__uint128_t vmgenid_read_atomic(void)
> +{
> +	__uint128_t ret = 0;
> +	if (!val)
> +		return 0;
> +	asm volatile("lock cmpxchg16b %1"
> +		     : "+A"(ret)
> +		     : "m"(*val), "b"(0), "c"(0)
> +		     : "cc");
> +	return ret;
> +}
> +EXPORT_SYMBOL(vmgenid_read_atomic);
> +
>  static int vmgenid_add(struct acpi_device *device)
>  {
>  	struct acpi_buffer parsed = { ACPI_ALLOCATE_BUFFER };
> @@ -50,6 +65,7 @@ static int vmgenid_add(struct acpi_device *device)
>  	phys_addr = (obj->package.elements[0].integer.value << 0) |
>  		    (obj->package.elements[1].integer.value << 32);
>  	state->next_id = devm_memremap(&device->dev, phys_addr, VMGENID_SIZE, MEMREMAP_WB);
> +	val = (__uint128_t *)state->next_id;
>  	if (IS_ERR(state->next_id)) {
>  		ret = PTR_ERR(state->next_id);
>  		goto out;


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
@ 2022-03-02 16:29                           ` Michael S. Tsirkin
  0 siblings, 0 replies; 61+ messages in thread
From: Michael S. Tsirkin @ 2022-03-02 16:29 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Brown, Len, linux-hyperv, Colm MacCarthaigh,
	Daniel P. Berrangé,
	adrian, KVM list, Jann Horn, Greg Kroah-Hartman, Linux PM,
	Rafael J. Wysocki, LKML, Dominik Brodowski, QEMU Developers,
	Alexander Graf, Linux Crypto Mailing List, Pavel Machek,
	Theodore Ts'o, Michael Kelley (LINUX),
	Laszlo Ersek, Arnd Bergmann

On Wed, Mar 02, 2022 at 04:36:49PM +0100, Jason A. Donenfeld wrote:
> Hi Michael,
> 
> On Wed, Mar 02, 2022 at 10:20:25AM -0500, Michael S. Tsirkin wrote:
> > So writing some code:
> > 
> > 1:
> > 	put plaintext in a buffer
> > 	put a key in a buffer
> > 	put the nonce for that encryption in a buffer
> > 
> > 	if vm gen id != stored vm gen id
> > 		stored vm gen id = vm gen id
> > 		goto 1
> > 
> > I think this is race free, but I don't see why does it matter whether we
> > read gen id atomically or not.
> 
> Because that 16 byte read of vmgenid is not atomic. Let's say you read
> the first 8 bytes, and then the VM is forked. In the forked VM, the next
> 8 bytes are the same as last time, but the first 8 bytes, which you
> already read, have changed. In that case, your != becomes a ==, and the
> test fails.
> 
> This is one of those fundamental things of "unique ID" vs "generation
> counter word".
> 
> Anyway, per your request in your last email, I wrote some code for this,
> which may or may not be totally broken, and only works on 64-bit x86,
> which is really the best possible case in terms of performance. And even
> so, it's not great.
> 
> Jason
> 
> --------8<------------------------
> 
> diff --git a/drivers/net/wireguard/noise.c b/drivers/net/wireguard/noise.c
> index 720952b92e78..250b8973007d 100644
> --- a/drivers/net/wireguard/noise.c
> +++ b/drivers/net/wireguard/noise.c
> @@ -106,6 +106,7 @@ static struct noise_keypair *keypair_create(struct wg_peer *peer)
>  	keypair->entry.type = INDEX_HASHTABLE_KEYPAIR;
>  	keypair->entry.peer = peer;
>  	kref_init(&keypair->refcount);
> +	keypair->vmgenid = vmgenid_read_atomic();
>  	return keypair;
>  }
> 
> diff --git a/drivers/net/wireguard/noise.h b/drivers/net/wireguard/noise.h
> index c527253dba80..0add240a14a0 100644
> --- a/drivers/net/wireguard/noise.h
> +++ b/drivers/net/wireguard/noise.h
> @@ -27,10 +27,13 @@ struct noise_symmetric_key {
>  	bool is_valid;
>  };
> 
> +extern __uint128_t vmgenid_read_atomic(void);
> +
>  struct noise_keypair {
>  	struct index_hashtable_entry entry;
>  	struct noise_symmetric_key sending;
>  	atomic64_t sending_counter;
> +	__uint128_t vmgenid;
>  	struct noise_symmetric_key receiving;
>  	struct noise_replay_counter receiving_counter;
>  	__le32 remote_index;
> diff --git a/drivers/net/wireguard/send.c b/drivers/net/wireguard/send.c
> index 5368f7c35b4b..40d016be59e3 100644
> --- a/drivers/net/wireguard/send.c
> +++ b/drivers/net/wireguard/send.c
> @@ -381,6 +381,9 @@ void wg_packet_send_staged_packets(struct wg_peer *peer)
>  			goto out_invalid;
>  	}
> 
> +	if (keypair->vmgenid != vmgenid_read_atomic())
> +		goto out_invalid;
> +
>  	packets.prev->next = NULL;
>  	wg_peer_get(keypair->entry.peer);
>  	PACKET_CB(packets.next)->keypair = keypair;

I don't think we care about an atomic read here.  All data is in buffer
by this point, if it did not fork before that then we are ok, even
if it forks during the read.

We probably do need a memory barrier to make sure all writes complete
before the read of vmgenid, I'm not sure which kind - I think hypervisor
can be trusted to do a full CPU barrier on fork so probably just a
compiler barrier.

> diff --git a/drivers/virt/vmgenid.c b/drivers/virt/vmgenid.c
> index 0ae1a39f2e28..c122fae1d494 100644
> --- a/drivers/virt/vmgenid.c
> +++ b/drivers/virt/vmgenid.c
> @@ -21,6 +21,21 @@ struct vmgenid_state {
>  	u8 this_id[VMGENID_SIZE];
>  };
> 
> +static __uint128_t *val;
> +
> +__uint128_t vmgenid_read_atomic(void)
> +{
> +	__uint128_t ret = 0;
> +	if (!val)
> +		return 0;
> +	asm volatile("lock cmpxchg16b %1"
> +		     : "+A"(ret)
> +		     : "m"(*val), "b"(0), "c"(0)
> +		     : "cc");
> +	return ret;
> +}
> +EXPORT_SYMBOL(vmgenid_read_atomic);
> +
>  static int vmgenid_add(struct acpi_device *device)
>  {
>  	struct acpi_buffer parsed = { ACPI_ALLOCATE_BUFFER };
> @@ -50,6 +65,7 @@ static int vmgenid_add(struct acpi_device *device)
>  	phys_addr = (obj->package.elements[0].integer.value << 0) |
>  		    (obj->package.elements[1].integer.value << 32);
>  	state->next_id = devm_memremap(&device->dev, phys_addr, VMGENID_SIZE, MEMREMAP_WB);
> +	val = (__uint128_t *)state->next_id;
>  	if (IS_ERR(state->next_id)) {
>  		ret = PTR_ERR(state->next_id);
>  		goto out;



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
  2022-03-02 16:22                           ` Michael S. Tsirkin
@ 2022-03-02 16:32                             ` Jason A. Donenfeld
  -1 siblings, 0 replies; 61+ messages in thread
From: Jason A. Donenfeld @ 2022-03-02 16:32 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Laszlo Ersek, LKML, KVM list, QEMU Developers, linux-hyperv,
	Linux Crypto Mailing List, Alexander Graf, Michael Kelley (LINUX),
	Greg Kroah-Hartman, adrian, Daniel P. Berrangé,
	Dominik Brodowski, Jann Horn, Rafael J. Wysocki, Brown, Len,
	Pavel Machek, Linux PM, Colm MacCarthaigh, Theodore Ts'o,
	Arnd Bergmann

Hi Michael,

On Wed, Mar 02, 2022 at 11:22:46AM -0500, Michael S. Tsirkin wrote:
> > Because that 16 byte read of vmgenid is not atomic. Let's say you read
> > the first 8 bytes, and then the VM is forked.
> 
> But at this point when VM was forked plaintext key and nonce are all in
> buffer, and you previously indicated a fork at this point is harmless.
> You wrote "If it changes _after_ that point of check ... it doesn't
> matter:"

Ahhh, fair point. I think you're right.

Alright, so all we're talking about here is an ordinary 16-byte read,
and 16 bytes of storage per keypair, and a 16-byte comparison.

Still seems much worse than just having a single word...

Jason

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
@ 2022-03-02 16:32                             ` Jason A. Donenfeld
  0 siblings, 0 replies; 61+ messages in thread
From: Jason A. Donenfeld @ 2022-03-02 16:32 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Brown, Len, linux-hyperv, Colm MacCarthaigh,
	Daniel P. Berrangé,
	adrian, KVM list, Jann Horn, Greg Kroah-Hartman, Linux PM,
	Rafael J. Wysocki, LKML, Dominik Brodowski, QEMU Developers,
	Alexander Graf, Linux Crypto Mailing List, Pavel Machek,
	Theodore Ts'o, Michael Kelley (LINUX),
	Laszlo Ersek, Arnd Bergmann

Hi Michael,

On Wed, Mar 02, 2022 at 11:22:46AM -0500, Michael S. Tsirkin wrote:
> > Because that 16 byte read of vmgenid is not atomic. Let's say you read
> > the first 8 bytes, and then the VM is forked.
> 
> But at this point when VM was forked plaintext key and nonce are all in
> buffer, and you previously indicated a fork at this point is harmless.
> You wrote "If it changes _after_ that point of check ... it doesn't
> matter:"

Ahhh, fair point. I think you're right.

Alright, so all we're talking about here is an ordinary 16-byte read,
and 16 bytes of storage per keypair, and a 16-byte comparison.

Still seems much worse than just having a single word...

Jason


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
  2022-03-02 16:32                             ` Jason A. Donenfeld
@ 2022-03-02 17:27                               ` Michael S. Tsirkin
  -1 siblings, 0 replies; 61+ messages in thread
From: Michael S. Tsirkin @ 2022-03-02 17:27 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Laszlo Ersek, LKML, KVM list, QEMU Developers, linux-hyperv,
	Linux Crypto Mailing List, Alexander Graf, Michael Kelley (LINUX),
	Greg Kroah-Hartman, adrian, Daniel P. Berrangé,
	Dominik Brodowski, Jann Horn, Rafael J. Wysocki, Brown, Len,
	Pavel Machek, Linux PM, Colm MacCarthaigh, Theodore Ts'o,
	Arnd Bergmann

On Wed, Mar 02, 2022 at 05:32:07PM +0100, Jason A. Donenfeld wrote:
> Hi Michael,
> 
> On Wed, Mar 02, 2022 at 11:22:46AM -0500, Michael S. Tsirkin wrote:
> > > Because that 16 byte read of vmgenid is not atomic. Let's say you read
> > > the first 8 bytes, and then the VM is forked.
> > 
> > But at this point when VM was forked plaintext key and nonce are all in
> > buffer, and you previously indicated a fork at this point is harmless.
> > You wrote "If it changes _after_ that point of check ... it doesn't
> > matter:"
> 
> Ahhh, fair point. I think you're right.
> 
> Alright, so all we're talking about here is an ordinary 16-byte read,
> and 16 bytes of storage per keypair, and a 16-byte comparison.
> 
> Still seems much worse than just having a single word...
> 
> Jason

And it is, I saw a 30% higher overhead, it is however 30% of a very
low number ;)

-- 
MST


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
@ 2022-03-02 17:27                               ` Michael S. Tsirkin
  0 siblings, 0 replies; 61+ messages in thread
From: Michael S. Tsirkin @ 2022-03-02 17:27 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Brown, Len, linux-hyperv, Colm MacCarthaigh,
	Daniel P. Berrangé,
	adrian, KVM list, Jann Horn, Greg Kroah-Hartman, Linux PM,
	Rafael J. Wysocki, LKML, Dominik Brodowski, QEMU Developers,
	Alexander Graf, Linux Crypto Mailing List, Pavel Machek,
	Theodore Ts'o, Michael Kelley (LINUX),
	Laszlo Ersek, Arnd Bergmann

On Wed, Mar 02, 2022 at 05:32:07PM +0100, Jason A. Donenfeld wrote:
> Hi Michael,
> 
> On Wed, Mar 02, 2022 at 11:22:46AM -0500, Michael S. Tsirkin wrote:
> > > Because that 16 byte read of vmgenid is not atomic. Let's say you read
> > > the first 8 bytes, and then the VM is forked.
> > 
> > But at this point when VM was forked plaintext key and nonce are all in
> > buffer, and you previously indicated a fork at this point is harmless.
> > You wrote "If it changes _after_ that point of check ... it doesn't
> > matter:"
> 
> Ahhh, fair point. I think you're right.
> 
> Alright, so all we're talking about here is an ordinary 16-byte read,
> and 16 bytes of storage per keypair, and a 16-byte comparison.
> 
> Still seems much worse than just having a single word...
> 
> Jason

And it is, I saw a 30% higher overhead, it is however 30% of a very
low number ;)

-- 
MST



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
  2022-03-02 16:32                             ` Jason A. Donenfeld
@ 2022-03-03 13:07                               ` Michael S. Tsirkin
  -1 siblings, 0 replies; 61+ messages in thread
From: Michael S. Tsirkin @ 2022-03-03 13:07 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Laszlo Ersek, LKML, KVM list, QEMU Developers, linux-hyperv,
	Linux Crypto Mailing List, Alexander Graf, Michael Kelley (LINUX),
	Greg Kroah-Hartman, adrian, Daniel P. Berrangé,
	Dominik Brodowski, Jann Horn, Rafael J. Wysocki, Brown, Len,
	Pavel Machek, Linux PM, Colm MacCarthaigh, Theodore Ts'o,
	Arnd Bergmann

On Wed, Mar 02, 2022 at 05:32:07PM +0100, Jason A. Donenfeld wrote:
> Hi Michael,
> 
> On Wed, Mar 02, 2022 at 11:22:46AM -0500, Michael S. Tsirkin wrote:
> > > Because that 16 byte read of vmgenid is not atomic. Let's say you read
> > > the first 8 bytes, and then the VM is forked.
> > 
> > But at this point when VM was forked plaintext key and nonce are all in
> > buffer, and you previously indicated a fork at this point is harmless.
> > You wrote "If it changes _after_ that point of check ... it doesn't
> > matter:"
> 
> Ahhh, fair point. I think you're right.
> 
> Alright, so all we're talking about here is an ordinary 16-byte read,
> and 16 bytes of storage per keypair, and a 16-byte comparison.
> 
> Still seems much worse than just having a single word...
> 
> Jason

Oh I forgot about __int128.



#include <stdio.h>
#include <assert.h>
#include <limits.h>
#include <string.h>

struct lng {
	__int128 l;
};

struct shrt {
	unsigned long s;
};


struct lng l = { 1 };
struct shrt s = { 3 };

static void test1(volatile struct shrt *sp)
{
	if (sp->s != s.s) {
		printf("short mismatch!\n");
		s.s = sp->s;
	}
}
static void test2(volatile struct lng *lp)
{
	if (lp->l != l.l) {
		printf("long mismatch!\n");
		l.l = lp->l;
	}
}

int main(int argc, char **argv)
{
	volatile struct shrt sv = { 4 };
	volatile struct lng lv = { 5 };

	if (argc > 1) {
		printf("test 1\n");
		for (int i = 0; i < 100000000; ++i) 
			test1(&sv);
	} else {
		printf("test 2\n");
		for (int i = 0; i < 100000000; ++i)
			test2(&lv);
	}
	return 0;
}


with that the compiler has an easier time to produce optimal
code, so the difference is smaller.
Note: compiled with
gcc -O2 -mno-sse -mno-sse2 -ggdb bench3.c 

since with sse there's no difference at all.


[mst@tuck ~]$ perf stat -r 100 ./a.out 1 > /dev/null 


 Performance counter stats for './a.out 1' (100 runs):

             94.55 msec task-clock:u              #    0.996 CPUs utilized            ( +-  0.09% )
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
                52      page-faults:u             #  548.914 /sec                     ( +-  0.21% )
       400,459,851      cycles:u                  #    4.227 GHz                      ( +-  0.03% )
       500,147,935      instructions:u            #    1.25  insn per cycle           ( +-  0.00% )
       200,032,462      branches:u                #    2.112 G/sec                    ( +-  0.00% )
             1,810      branch-misses:u           #    0.00% of all branches          ( +-  0.73% )

         0.0949732 +- 0.0000875 seconds time elapsed  ( +-  0.09% )

[mst@tuck ~]$ 
[mst@tuck ~]$ perf stat -r 100 ./a.out > /dev/null 

 Performance counter stats for './a.out' (100 runs):

            110.19 msec task-clock:u              #    1.136 CPUs utilized            ( +-  0.18% )
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
                52      page-faults:u             #  537.743 /sec                     ( +-  0.22% )
       428,518,442      cycles:u                  #    4.431 GHz                      ( +-  0.07% )
       900,147,986      instructions:u            #    2.24  insn per cycle           ( +-  0.00% )
       200,032,505      branches:u                #    2.069 G/sec                    ( +-  0.00% )
             2,139      branch-misses:u           #    0.00% of all branches          ( +-  0.77% )

          0.096956 +- 0.000203 seconds time elapsed  ( +-  0.21% )

-- 
MST


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
@ 2022-03-03 13:07                               ` Michael S. Tsirkin
  0 siblings, 0 replies; 61+ messages in thread
From: Michael S. Tsirkin @ 2022-03-03 13:07 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Brown, Len, linux-hyperv, Colm MacCarthaigh,
	Daniel P. Berrangé,
	adrian, KVM list, Jann Horn, Greg Kroah-Hartman, Linux PM,
	Rafael J. Wysocki, LKML, Dominik Brodowski, QEMU Developers,
	Alexander Graf, Linux Crypto Mailing List, Pavel Machek,
	Theodore Ts'o, Michael Kelley (LINUX),
	Laszlo Ersek, Arnd Bergmann

On Wed, Mar 02, 2022 at 05:32:07PM +0100, Jason A. Donenfeld wrote:
> Hi Michael,
> 
> On Wed, Mar 02, 2022 at 11:22:46AM -0500, Michael S. Tsirkin wrote:
> > > Because that 16 byte read of vmgenid is not atomic. Let's say you read
> > > the first 8 bytes, and then the VM is forked.
> > 
> > But at this point when VM was forked plaintext key and nonce are all in
> > buffer, and you previously indicated a fork at this point is harmless.
> > You wrote "If it changes _after_ that point of check ... it doesn't
> > matter:"
> 
> Ahhh, fair point. I think you're right.
> 
> Alright, so all we're talking about here is an ordinary 16-byte read,
> and 16 bytes of storage per keypair, and a 16-byte comparison.
> 
> Still seems much worse than just having a single word...
> 
> Jason

Oh I forgot about __int128.



#include <stdio.h>
#include <assert.h>
#include <limits.h>
#include <string.h>

struct lng {
	__int128 l;
};

struct shrt {
	unsigned long s;
};


struct lng l = { 1 };
struct shrt s = { 3 };

static void test1(volatile struct shrt *sp)
{
	if (sp->s != s.s) {
		printf("short mismatch!\n");
		s.s = sp->s;
	}
}
static void test2(volatile struct lng *lp)
{
	if (lp->l != l.l) {
		printf("long mismatch!\n");
		l.l = lp->l;
	}
}

int main(int argc, char **argv)
{
	volatile struct shrt sv = { 4 };
	volatile struct lng lv = { 5 };

	if (argc > 1) {
		printf("test 1\n");
		for (int i = 0; i < 100000000; ++i) 
			test1(&sv);
	} else {
		printf("test 2\n");
		for (int i = 0; i < 100000000; ++i)
			test2(&lv);
	}
	return 0;
}


with that the compiler has an easier time to produce optimal
code, so the difference is smaller.
Note: compiled with
gcc -O2 -mno-sse -mno-sse2 -ggdb bench3.c 

since with sse there's no difference at all.


[mst@tuck ~]$ perf stat -r 100 ./a.out 1 > /dev/null 


 Performance counter stats for './a.out 1' (100 runs):

             94.55 msec task-clock:u              #    0.996 CPUs utilized            ( +-  0.09% )
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
                52      page-faults:u             #  548.914 /sec                     ( +-  0.21% )
       400,459,851      cycles:u                  #    4.227 GHz                      ( +-  0.03% )
       500,147,935      instructions:u            #    1.25  insn per cycle           ( +-  0.00% )
       200,032,462      branches:u                #    2.112 G/sec                    ( +-  0.00% )
             1,810      branch-misses:u           #    0.00% of all branches          ( +-  0.73% )

         0.0949732 +- 0.0000875 seconds time elapsed  ( +-  0.09% )

[mst@tuck ~]$ 
[mst@tuck ~]$ perf stat -r 100 ./a.out > /dev/null 

 Performance counter stats for './a.out' (100 runs):

            110.19 msec task-clock:u              #    1.136 CPUs utilized            ( +-  0.18% )
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
                52      page-faults:u             #  537.743 /sec                     ( +-  0.22% )
       428,518,442      cycles:u                  #    4.431 GHz                      ( +-  0.07% )
       900,147,986      instructions:u            #    2.24  insn per cycle           ( +-  0.00% )
       200,032,505      branches:u                #    2.069 G/sec                    ( +-  0.00% )
             2,139      branch-misses:u           #    0.00% of all branches          ( +-  0.77% )

          0.096956 +- 0.000203 seconds time elapsed  ( +-  0.21% )

-- 
MST



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
  2022-03-01 15:42 propagating vmgenid outward and upward Jason A. Donenfeld
                   ` (3 preceding siblings ...)
  2022-03-02 14:35 ` Jason A. Donenfeld
@ 2022-03-09 10:10 ` Alexander Graf
  2022-03-09 22:02     ` Jason A. Donenfeld
  4 siblings, 1 reply; 61+ messages in thread
From: Alexander Graf @ 2022-03-09 10:10 UTC (permalink / raw)
  To: Jason A. Donenfeld, linux-kernel, kvm, qemu-devel, linux-hyperv,
	linux-crypto, mikelley, gregkh, adrian, lersek, berrange, linux,
	jannh, mst, rafael, len.brown, pavel, linux-pm, colmmacc, tytso,
	arnd


On 01.03.22 16:42, Jason A. Donenfeld wrote:
>
> Hey folks,
>
> Having finally wrapped up development of the initial vmgenid driver, I
> thought I'd pull together some thoughts on vmgenid, notification, and
> propagating, from disjointed conversations I've had with a few of you
> over the last several weeks.
>
> The basic problem is: VMs can be cloned, forked, rewound, or
> snapshotted, and when this happens, a) the RNG needs to reseed itself,
> and b) cryptographic algorithms that are not reuse resistant need to
> reinitialize in one way or another. For 5.18, we're handling (a) via the
> new vmgenid driver, which implements a spec from Microsoft, whereby the
> driver receives ACPI notifications when a 16 byte unique value changes.
>
> The vmgenid driver basically works, though it is racy, because that ACPI
> notification can arrive after the system is already running again. This


I believe enough people already pointed out that this assumption is 
incorrect. The thing that is racy about VMGenID is the interrupt based 
notification. The actual identifier is updated before the VM resumes 
from its clone operation, so if you match on that you will know whether 
you are in a new or old world. And that is enough to create 
transactions: Save the identifier before a "crypto transaction", 
validate before you finish, if they don't match, abort, reseed and replay.


> race is even worse on Windows, where they kick the notification into a
> worker thread, which then publishes it upward elsewhere to another async
> mechanism, and eventually it hits the RNG and various userspace apps.
> On Linux it's not that bad -- we reseed immediately upon receiving the
> notification -- but it still inherits this same "push"-model deficiency,
> which a "pull"-model would not have.
>
> If we had a "pull" model, rather than just expose a 16-byte unique
> identifier, the vmgenid virtual hardware would _also_ expose a
> word-sized generation counter, which would be incremented every time the
> unique ID changed. Then, every time we would touch the RNG, we'd simply
> do an inexpensive check of this memremap()'d integer, and reinitialize
> with the unique ID if the integer changed. In this way, the race would
> be entirely eliminated. We would then be able to propagate this outwards
> to other drivers, by just exporting an extern symbol, in the manner of
> `jiffies`, and propagate it upwards to userspace, by putting it in the
> vDSO, in the manner of gettimeofday. And like that, there'd be no
> terrible async thing and things would work pretty easily.
>
> But that's not what we have, because Microsoft didn't collaborate with
> anybody on this, and now it's implemented in several hypervisors. Given
> that I'm already spending considerable time working on the RNG, entirely
> without funding, somehow I'm not super motivated to lead a
> cross-industry political effort to change Microsoft's vmgenid spec.
> Maybe somebody else has an appetite for this, but either way, those
> changes would be several years off at best.
>
> So given we have a "push"-model mechanism, there are two problems to
> tackle, perhaps in the same way, perhaps in a different way:
>
> A) Outwards propagation toward other kernel drivers: in this case, I
>     have in mind WireGuard, naturally, which very much needs to clear its
>     existing sessions when VMs are forked.
>
> B) Upwards propagation to userspace: in this case, we handle the
>     concerns of the Amazon engineers on this thread who broached this
>     topic a few years ago, in which s2n, their TLS library, wants to
>     reinitialize its userspace RNG (a silly thing, but I digress) and
>     probably clear session keys too, for the same good reason as
>     WireGuard.
>
> For (A), at least wearing my WireGuard-maintainer hat, there is an easy
> way and there is a "race-free" way. I use scare quotes there because
> we're still in a "push"-model, which means it's still racy no matter
> what.
>
> The faux "race-free" way involves having `extern u32 rng_vm_generation;`
> or similar in random.h, and then everything that generates a session key
> would snapshot this value, and every time a session key is used, a
> comparison would be made. This works, but given that we're going to be
> racy no matter what, I think I'd prefer avoiding the extra code in the
> hot path and extra per-session storage. It seems like that'd involve a
> lot of fiddly engineering for no real world benefit.
>
> The easy way, and the way that I think I prefer, would be to just have a
> sync notifier_block for this, just like we have with
> register_pm_notifier(). From my perspective, it'd be simplest to just
> piggy back on the already existing PM notifier with an extra event,
> PM_POST_VMFORK, which would join the existing set of 7, following
> PM_POST_RESTORE. I think that'd be coherent. However, if the PM people
> don't want to play ball, we could always come up with our own
> notifier_block. But I don't see the need. Plus, WireGuard *already*
> uses the PM notifier for clearing keys, so code-wise for my use case,
> that'd amount adding another case for PM_POST_VMFORK, in addition to the
> currently existing PM_HIBERNATION_PREPARE and PM_SUSPEND_PREPARE cases,
> which all would be treated the same way. Ezpz. So if that sounds like an
> interesting thing to the PM people, I think I'd like to propose a patch
> for that, possibly even for 5.18, given that it'd be very straight-
> forward.
>
> For (B), it's a little bit trickier. But I think our options follow the
> same rubric. We can expose a generation counter in the vDSO, with
> semantics akin to the extern integer I described above. Or we could
> expose that counter in a file that userspace could poll() on and receive
> notifications that way. Or perhaps a third way. I'm all ears here.
> Alex's team from Amazon last year proposed something similar to the vDSO
> idea, except using mmap on a sysfs file, though from what I can tell,
> that wound up being kind of complicated. Due to the fact that we're
> _already_ racy, I think I'm most inclined at this point toward the
> poll() approach for the same reasons as I prefer a notifier_block. But
> on userspace I could be convinced otherwise, and I'd be interested in
> totally different ideas here too.
>
> Another thing I should note is that, while I'm not currently leaning
> toward it, the vDSO approach also ties into interesting discussions
> about userspace RNGs (generally a silly idea), and their need for things
> like fork detection and also learning when the kernel RNG was last
> reseeded. So cracking open the vDSO book might invite all sorts of other
> interesting questions and discussions, which may be productive or may be
> a humongous distraction. (Also, again, I'm not super enthusiastic about
> userspace RNGs.)
>
> Also, there is an interesting question to decide with regards to
> userspace, which is whether the vmgenid driver should expose its unique
> ID to userspace, as Alex requested on an earlier thread. I am actually
> sort of opposed to this. That unique ID may or may not be secret and
> entropic; if it isn't, the crypto is designed to not be impacted
> negatively, but if it is, we should keep it secret. So, rather, I think
> the correct flow is that userspace simply calls getrandom() upon
> learning that the VM forked, which is guaranteed to have been
> reinitialized already by add_vmfork_randomness(), and that will
> guarantee a value that is unique to the VM, without having to actually
> expose that value.


If you follow the logic at the beginning of the mail, you can create 
something race free if you consume the hardware VMGenID counter. You can 
not make it race free if you rely on the interrupt mechanism.

So following that train of thought, if you expose the hardware VMGenID 
to user space, you could allow user space to act race free based on 
VMGenID. That means consumers of user space RNGs could validate whether 
the ID is identical between the beginning of the crypto operation and 
the end.

That said, there are 2 pieces to the puzzle of user space notification: 
Polling and event based. The part above solves the polling use cases - 
user space libraries that just want to know whether they are now in a 
new world.

However, there are more complicated cases as well. What do you do with 
Samba for example? It needs to generate a new SID after the clone. 
That's a super heavy operation. Do you want to have smbd constantly poll 
on the VMGenID just to see whether it needs to kick off some 
administrative actions?

For the event based approach, we're in the same boat as "S3 resume" - we 
need a global notification mechanism that the state of the system 
changed and act accordingly. That's where the systemd proposal[1] comes 
in: Create inhibitors and scriptlets that get spawned when we want to 
suspend and then resume-cloned later. I'm personally even ok if we just 
limit that whole use case to cloning while you're in S3 only.

In that case, all we would need from the kernel is an easily readable 
GenID that changes before systemd runs again after suspend: Systemd 
wakes up after resume, checks if the GenID changed and if so, invokes 
the unquiescing target in after the resume one.

For this particular use case we're not in the fast path, so we could 
make GenID reading a syscall which checks against VMGenID. But that 
won't cut it for the polling use case.

I'm also not a super big fan of putting all that logic into systemd. It 
means applications need to create their own notification mechanisms to 
pass that cloning notification into actual processes. Don't we have any 
mechanism that applications and libraries could use to natively get an 
event when the GenID changes?


Alex


[1] https://github.com/systemd/systemd/issues/20222


>
> So, anyway, this is more or less where my thinking on this matter is.
> Would be happy to hear some fresh ideas here too.
>
> Regards,
> Jason



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
  2022-03-09 10:10 ` Alexander Graf
@ 2022-03-09 22:02     ` Jason A. Donenfeld
  0 siblings, 0 replies; 61+ messages in thread
From: Jason A. Donenfeld @ 2022-03-09 22:02 UTC (permalink / raw)
  To: Alexander Graf
  Cc: LKML, KVM list, QEMU Developers, linux-hyperv,
	Linux Crypto Mailing List, Michael Kelley (LINUX),
	Greg Kroah-Hartman, adrian, Laszlo Ersek, Daniel P. Berrangé,
	Dominik Brodowski, Jann Horn, Michael S. Tsirkin,
	Rafael J. Wysocki, Brown, Len, Pavel Machek, Linux PM,
	Colm MacCarthaigh, Theodore Ts'o, Arnd Bergmann

Hi Alex,

On Wed, Mar 9, 2022 at 3:10 AM Alexander Graf <graf@amazon.com> wrote:
> > The vmgenid driver basically works, though it is racy, because that ACPI
> > notification can arrive after the system is already running again. This
>
>
> I believe enough people already pointed out that this assumption is
> incorrect. The thing that is racy about VMGenID is the interrupt based
> notification.

I'm having a hard time figuring out what's different between your
statement and mine. I said that the race is due to the notification.
You said that the race is due to the notification. What subtle thing
am I missing here that would lead you to say that my assumption is
incorrect? Or did you just misread?

> The actual identifier is updated before the VM resumes
> from its clone operation, so if you match on that you will know whether
> you are in a new or old world. And that is enough to create
> transactions: Save the identifier before a "crypto transaction",
> validate before you finish, if they don't match, abort, reseed and replay.

Right. But more than just transactions, it's useful to preventing key
reuse vulnerabilities, in which case, you store the current identifier
just before an ephemeral key is generated, and then subsequently check
to see that the identifier hasn't changed before transmitting anything
related to that key.

> If you follow the logic at the beginning of the mail, you can create
> something race free if you consume the hardware VMGenID counter. You can
> not make it race free if you rely on the interrupt mechanism.

Yes, as mentioned and discussed in depth before. However, your use of
the word "counter" is problematic. Vmgenid is not a counter. It's a
unique identifier. That means you can't compare it with a single word
comparison but have to compare all of the 16 bytes. That seems
potentially expensive. It's for that reason that I suggested
augmenting the vmgenid spec with an additional word-sized _counter_
that could be mapped into the kernels and into userspaces.

> So following that train of thought, if you expose the hardware VMGenID
> to user space, you could allow user space to act race free based on
> VMGenID. That means consumers of user space RNGs could validate whether
> the ID is identical between the beginning of the crypto operation and
> the end.

Right.

> However, there are more complicated cases as well. What do you do with
> Samba for example? It needs to generate a new SID after the clone.
> That's a super heavy operation. Do you want to have smbd constantly poll
> on the VMGenID just to see whether it needs to kick off some
> administrative actions?

Were it a single word-sized integer, mapped into memory, that wouldn't
be much of a problem at all. It could constantly read this before and
after every operation. The problem is that it's 16 bytes and
understandably applications don't want to deal with that clunkiness.

> In that case, all we would need from the kernel is an easily readable
> GenID that changes

Actually, no, you need even less than that. All that's required is a
sysfs/procfs file that can be poll()'d on. It doesn't need to have any
content. When poll() returns readable, the VM has been forked. Then
userspace rngs and other things like that can call getrandom() to
receive a fresh value to mix into whatever their operation is. Since
all we're talking about here is _event notification_, all we need is
that event, which is what poll() provides.

> I'm also not a super big fan of putting all that logic into systemd. It
> means applications need to create their own notification mechanisms to
> pass that cloning notification into actual processes. Don't we have any
> mechanism that applications and libraries could use to natively get an
> event when the GenID changes?

Yes. poll() can do this. For the purposes of discussion, I've posted
an implementation of this idea here:
https://lore.kernel.org/lkml/20220309215907.77526-1-Jason@zx2c4.com/

What I'm sort of leaning toward is doing something like that patch,
and then later if vmgenid ever grows an additional word-sized counter,
moving to explore the race-free approach. Given the amount of
programming required to actually implement the race-free approach
(transactions and careful study of each case), the poll() file
approach might be a medium-grade compromise for the time being.
Evidently that's what Microsoft decided too.

Jason

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
@ 2022-03-09 22:02     ` Jason A. Donenfeld
  0 siblings, 0 replies; 61+ messages in thread
From: Jason A. Donenfeld @ 2022-03-09 22:02 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Brown, Len, linux-hyperv, Daniel P. Berrangé,
	adrian, KVM list, Jann Horn, QEMU Developers, Greg Kroah-Hartman,
	Linux PM, Michael S. Tsirkin, LKML, Dominik Brodowski,
	Michael Kelley (LINUX),
	Arnd Bergmann, Linux Crypto Mailing List, Pavel Machek,
	Rafael J. Wysocki, Theodore Ts'o, Colm MacCarthaigh,
	Laszlo Ersek

Hi Alex,

On Wed, Mar 9, 2022 at 3:10 AM Alexander Graf <graf@amazon.com> wrote:
> > The vmgenid driver basically works, though it is racy, because that ACPI
> > notification can arrive after the system is already running again. This
>
>
> I believe enough people already pointed out that this assumption is
> incorrect. The thing that is racy about VMGenID is the interrupt based
> notification.

I'm having a hard time figuring out what's different between your
statement and mine. I said that the race is due to the notification.
You said that the race is due to the notification. What subtle thing
am I missing here that would lead you to say that my assumption is
incorrect? Or did you just misread?

> The actual identifier is updated before the VM resumes
> from its clone operation, so if you match on that you will know whether
> you are in a new or old world. And that is enough to create
> transactions: Save the identifier before a "crypto transaction",
> validate before you finish, if they don't match, abort, reseed and replay.

Right. But more than just transactions, it's useful to preventing key
reuse vulnerabilities, in which case, you store the current identifier
just before an ephemeral key is generated, and then subsequently check
to see that the identifier hasn't changed before transmitting anything
related to that key.

> If you follow the logic at the beginning of the mail, you can create
> something race free if you consume the hardware VMGenID counter. You can
> not make it race free if you rely on the interrupt mechanism.

Yes, as mentioned and discussed in depth before. However, your use of
the word "counter" is problematic. Vmgenid is not a counter. It's a
unique identifier. That means you can't compare it with a single word
comparison but have to compare all of the 16 bytes. That seems
potentially expensive. It's for that reason that I suggested
augmenting the vmgenid spec with an additional word-sized _counter_
that could be mapped into the kernels and into userspaces.

> So following that train of thought, if you expose the hardware VMGenID
> to user space, you could allow user space to act race free based on
> VMGenID. That means consumers of user space RNGs could validate whether
> the ID is identical between the beginning of the crypto operation and
> the end.

Right.

> However, there are more complicated cases as well. What do you do with
> Samba for example? It needs to generate a new SID after the clone.
> That's a super heavy operation. Do you want to have smbd constantly poll
> on the VMGenID just to see whether it needs to kick off some
> administrative actions?

Were it a single word-sized integer, mapped into memory, that wouldn't
be much of a problem at all. It could constantly read this before and
after every operation. The problem is that it's 16 bytes and
understandably applications don't want to deal with that clunkiness.

> In that case, all we would need from the kernel is an easily readable
> GenID that changes

Actually, no, you need even less than that. All that's required is a
sysfs/procfs file that can be poll()'d on. It doesn't need to have any
content. When poll() returns readable, the VM has been forked. Then
userspace rngs and other things like that can call getrandom() to
receive a fresh value to mix into whatever their operation is. Since
all we're talking about here is _event notification_, all we need is
that event, which is what poll() provides.

> I'm also not a super big fan of putting all that logic into systemd. It
> means applications need to create their own notification mechanisms to
> pass that cloning notification into actual processes. Don't we have any
> mechanism that applications and libraries could use to natively get an
> event when the GenID changes?

Yes. poll() can do this. For the purposes of discussion, I've posted
an implementation of this idea here:
https://lore.kernel.org/lkml/20220309215907.77526-1-Jason@zx2c4.com/

What I'm sort of leaning toward is doing something like that patch,
and then later if vmgenid ever grows an additional word-sized counter,
moving to explore the race-free approach. Given the amount of
programming required to actually implement the race-free approach
(transactions and careful study of each case), the poll() file
approach might be a medium-grade compromise for the time being.
Evidently that's what Microsoft decided too.

Jason


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
  2022-03-09 22:02     ` Jason A. Donenfeld
  (?)
@ 2022-03-10 11:18     ` Alexander Graf
  2022-03-20 22:53         ` Michael S. Tsirkin
  2022-04-19 15:12         ` Jason A. Donenfeld
  -1 siblings, 2 replies; 61+ messages in thread
From: Alexander Graf @ 2022-03-10 11:18 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: LKML, KVM list, QEMU Developers, linux-hyperv,
	Linux Crypto Mailing List, Michael Kelley (LINUX),
	Greg Kroah-Hartman, adrian, Laszlo Ersek, Daniel P. Berrangé,
	Dominik Brodowski, Jann Horn, Michael S. Tsirkin,
	Rafael J. Wysocki, Brown, Len, Pavel Machek, Linux PM,
	Colm MacCarthaigh, Theodore Ts'o, Arnd Bergmann


On 09.03.22 23:02, Jason A. Donenfeld wrote:
> Hi Alex,
>
> On Wed, Mar 9, 2022 at 3:10 AM Alexander Graf <graf@amazon.com> wrote:
>>> The vmgenid driver basically works, though it is racy, because that ACPI
>>> notification can arrive after the system is already running again. This
>>
>> I believe enough people already pointed out that this assumption is
>> incorrect. The thing that is racy about VMGenID is the interrupt based
>> notification.
> I'm having a hard time figuring out what's different between your
> statement and mine. I said that the race is due to the notification.
> You said that the race is due to the notification. What subtle thing
> am I missing here that would lead you to say that my assumption is
> incorrect? Or did you just misread?


The subtle difference is that you don't need to rely on the notification 
to learn about the world switch. If you instead read VMGenID explicitly 
without waiting for the notification, you're guaranteed to always know 
whether you were cloned. That means the actual VMGenID interface is not 
always racy. Just the notification part is.

So you *can* build a race-free VMGenID mechanism. You just can't build a 
race-free *event based* VMGenID mechanism if you want to allow cloning 
at arbitrary points in time.


>
>> The actual identifier is updated before the VM resumes
>> from its clone operation, so if you match on that you will know whether
>> you are in a new or old world. And that is enough to create
>> transactions: Save the identifier before a "crypto transaction",
>> validate before you finish, if they don't match, abort, reseed and replay.
> Right. But more than just transactions, it's useful to preventing key
> reuse vulnerabilities, in which case, you store the current identifier
> just before an ephemeral key is generated, and then subsequently check
> to see that the identifier hasn't changed before transmitting anything
> related to that key.
>
>> If you follow the logic at the beginning of the mail, you can create
>> something race free if you consume the hardware VMGenID counter. You can
>> not make it race free if you rely on the interrupt mechanism.
> Yes, as mentioned and discussed in depth before. However, your use of
> the word "counter" is problematic. Vmgenid is not a counter. It's a
> unique identifier. That means you can't compare it with a single word
> comparison but have to compare all of the 16 bytes. That seems
> potentially expensive. It's for that reason that I suggested
> augmenting the vmgenid spec with an additional word-sized _counter_
> that could be mapped into the kernels and into userspaces.


I think Michael's benchmarks show quite well that it's not all that 
expensive. We're talking 2x 64bit compares within the same cache line. 
You already realized that we don't need them to be atomic - just 
properly barriered.

So what if we created a vsyscall that takes a buffer of "up to 16 
bytes". If we later realize that an additional page with a 4 byte 
counter is a viable performance optimization, we can work with MS to add 
that to the spec. But the user space interface would stay identical.


>
>> So following that train of thought, if you expose the hardware VMGenID
>> to user space, you could allow user space to act race free based on
>> VMGenID. That means consumers of user space RNGs could validate whether
>> the ID is identical between the beginning of the crypto operation and
>> the end.
> Right.
>
>> However, there are more complicated cases as well. What do you do with
>> Samba for example? It needs to generate a new SID after the clone.
>> That's a super heavy operation. Do you want to have smbd constantly poll
>> on the VMGenID just to see whether it needs to kick off some
>> administrative actions?
> Were it a single word-sized integer, mapped into memory, that wouldn't
> be much of a problem at all. It could constantly read this before and
> after every operation. The problem is that it's 16 bytes and
> understandably applications don't want to deal with that clunkiness.


I don't think applications should be in the business of mapping 
arbitrary locations of the vdso space to match on them. We need to build 
an interface that is fast and flexible enough so they can just say "here 
is a buffer, tell me if the ID changed". That's practically all you need 
- on init you run that once. Later on, you invoke it every time between 
finishing a crypto operation and putting it on the wire.


>
>> In that case, all we would need from the kernel is an easily readable
>> GenID that changes
> Actually, no, you need even less than that. All that's required is a
> sysfs/procfs file that can be poll()'d on. It doesn't need to have any
> content. When poll() returns readable, the VM has been forked. Then
> userspace rngs and other things like that can call getrandom() to
> receive a fresh value to mix into whatever their operation is. Since
> all we're talking about here is _event notification_, all we need is
> that event, which is what poll() provides.
>
>> I'm also not a super big fan of putting all that logic into systemd. It
>> means applications need to create their own notification mechanisms to
>> pass that cloning notification into actual processes. Don't we have any
>> mechanism that applications and libraries could use to natively get an
>> event when the GenID changes?
> Yes. poll() can do this. For the purposes of discussion, I've posted
> an implementation of this idea here:
> https://lore.kernel.org/lkml/20220309215907.77526-1-Jason@zx2c4.com/
>
> What I'm sort of leaning toward is doing something like that patch,
> and then later if vmgenid ever grows an additional word-sized counter,
> moving to explore the race-free approach. Given the amount of
> programming required to actually implement the race-free approach
> (transactions and careful study of each case), the poll() file
> approach might be a medium-grade compromise for the time being.
> Evidently that's what Microsoft decided too.


I agree on the slightly racy compromise and that it's a step into the 
right direction. Doing this is a no brainer IMHO and I like the proc 
based poll approach.

I have an additional problem you might have an idea for with the poll 
based path. In addition to the clone notification, I'd need to know at 
which point everyone who was listening to a clone notification is 
finished acting up it. If I spawn a tiny VM to do "work", I want to know 
when it's safe to hand requests into it. How do I find out when that 
point in time is?

As far as the race-free approach goes, I wouldn't get hung up on 4 byte 
vs 16 byte UUID to match against. Outside of FUD that this might 
potentially have performance impact (4 byte reads will have impact 
too!), there's nothing that would keep us from implementing that 
interface in addition to the poll today.

I'm happy to see all of these things evolve incrementally though. We can 
start with the poll interface and then later implement a vsyscall that 
allows transactions in hot paths.


Alex




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
  2022-03-10 11:18     ` Alexander Graf
@ 2022-03-20 22:53         ` Michael S. Tsirkin
  2022-04-19 15:12         ` Jason A. Donenfeld
  1 sibling, 0 replies; 61+ messages in thread
From: Michael S. Tsirkin @ 2022-03-20 22:53 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Jason A. Donenfeld, LKML, KVM list, QEMU Developers,
	linux-hyperv, Linux Crypto Mailing List, Michael Kelley (LINUX),
	Greg Kroah-Hartman, adrian, Laszlo Ersek, Daniel P. Berrangé,
	Dominik Brodowski, Jann Horn, Rafael J. Wysocki, Brown, Len,
	Pavel Machek, Linux PM, Colm MacCarthaigh, Theodore Ts'o,
	Arnd Bergmann

On Thu, Mar 10, 2022 at 12:18:04PM +0100, Alexander Graf wrote:
> I agree on the slightly racy compromise

Thought hard about this, I think I agree, and I guess as a minimum we
can start with at least the ACPI+RNG patch, right? That will already
address wireguard ...

-- 
MST


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
@ 2022-03-20 22:53         ` Michael S. Tsirkin
  0 siblings, 0 replies; 61+ messages in thread
From: Michael S. Tsirkin @ 2022-03-20 22:53 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Brown, Len, Jason A. Donenfeld, Colm MacCarthaigh,
	Daniel P. Berrangé,
	adrian, KVM list, Jann Horn, QEMU Developers, Greg Kroah-Hartman,
	Linux PM, Rafael J. Wysocki, linux-hyperv, Dominik Brodowski,
	LKML, Arnd Bergmann, Linux Crypto Mailing List, Pavel Machek,
	Theodore Ts'o, Michael Kelley (LINUX),
	Laszlo Ersek

On Thu, Mar 10, 2022 at 12:18:04PM +0100, Alexander Graf wrote:
> I agree on the slightly racy compromise

Thought hard about this, I think I agree, and I guess as a minimum we
can start with at least the ACPI+RNG patch, right? That will already
address wireguard ...

-- 
MST



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
  2022-03-10 11:18     ` Alexander Graf
@ 2022-04-19 15:12         ` Jason A. Donenfeld
  2022-04-19 15:12         ` Jason A. Donenfeld
  1 sibling, 0 replies; 61+ messages in thread
From: Jason A. Donenfeld @ 2022-04-19 15:12 UTC (permalink / raw)
  To: Alexander Graf
  Cc: LKML, KVM list, QEMU Developers, linux-hyperv,
	Linux Crypto Mailing List, Michael Kelley (LINUX),
	Greg Kroah-Hartman, adrian, Laszlo Ersek, Daniel P. Berrangé,
	Dominik Brodowski, Jann Horn, Michael S. Tsirkin,
	Rafael J. Wysocki, Brown, Len, Pavel Machek, Linux PM,
	Colm MacCarthaigh, Theodore Ts'o, Arnd Bergmann

Hey Alex,

On Thu, Mar 10, 2022 at 12:18 PM Alexander Graf <graf@amazon.com> wrote:
> I agree on the slightly racy compromise and that it's a step into the
> right direction. Doing this is a no brainer IMHO and I like the proc
> based poll approach.

Alright. I'm going to email a more serious patch for that in the next
few hours and you can have a look. Let's do that for 5.19.

> I have an additional problem you might have an idea for with the poll
> based path. In addition to the clone notification, I'd need to know at
> which point everyone who was listening to a clone notification is
> finished acting up it. If I spawn a tiny VM to do "work", I want to know
> when it's safe to hand requests into it. How do I find out when that
> point in time is?

Seems tricky to solve. Even a count of current waiters and a
generation number won't be sufficient, since it wouldn't take into
account users who haven't _yet_ gotten to waiting. But maybe it's not
the right problem to solve? Or somehow not necessary? For example, if
the problem is a bit more constrained a solution becomes easier: you
have a fixed/known set of readers that you know about, and you
guarantee that they're all waiting before the fork. Then after the
fork, they all do something to alert you in their poll()er, and you
count up how many alerts you get until it matches the number of
expected waiters. Would that work? It seems like anything more general
than that is just butting heads with the racy compromise we're already
making.

Jason

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
@ 2022-04-19 15:12         ` Jason A. Donenfeld
  0 siblings, 0 replies; 61+ messages in thread
From: Jason A. Donenfeld @ 2022-04-19 15:12 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Brown, Len, linux-hyperv, Daniel P. Berrangé,
	adrian, KVM list, Jann Horn, QEMU Developers, Greg Kroah-Hartman,
	Linux PM, Michael S. Tsirkin, LKML, Dominik Brodowski,
	Michael Kelley (LINUX),
	Arnd Bergmann, Linux Crypto Mailing List, Pavel Machek,
	Rafael J. Wysocki, Theodore Ts'o, Colm MacCarthaigh,
	Laszlo Ersek

Hey Alex,

On Thu, Mar 10, 2022 at 12:18 PM Alexander Graf <graf@amazon.com> wrote:
> I agree on the slightly racy compromise and that it's a step into the
> right direction. Doing this is a no brainer IMHO and I like the proc
> based poll approach.

Alright. I'm going to email a more serious patch for that in the next
few hours and you can have a look. Let's do that for 5.19.

> I have an additional problem you might have an idea for with the poll
> based path. In addition to the clone notification, I'd need to know at
> which point everyone who was listening to a clone notification is
> finished acting up it. If I spawn a tiny VM to do "work", I want to know
> when it's safe to hand requests into it. How do I find out when that
> point in time is?

Seems tricky to solve. Even a count of current waiters and a
generation number won't be sufficient, since it wouldn't take into
account users who haven't _yet_ gotten to waiting. But maybe it's not
the right problem to solve? Or somehow not necessary? For example, if
the problem is a bit more constrained a solution becomes easier: you
have a fixed/known set of readers that you know about, and you
guarantee that they're all waiting before the fork. Then after the
fork, they all do something to alert you in their poll()er, and you
count up how many alerts you get until it matches the number of
expected waiters. Would that work? It seems like anything more general
than that is just butting heads with the racy compromise we're already
making.

Jason


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
  2022-04-19 15:12         ` Jason A. Donenfeld
@ 2022-04-19 16:43           ` Michael S. Tsirkin
  -1 siblings, 0 replies; 61+ messages in thread
From: Michael S. Tsirkin @ 2022-04-19 16:43 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Alexander Graf, LKML, KVM list, QEMU Developers, linux-hyperv,
	Linux Crypto Mailing List, Michael Kelley (LINUX),
	Greg Kroah-Hartman, adrian, Laszlo Ersek, Daniel P. Berrangé,
	Dominik Brodowski, Jann Horn, Rafael J. Wysocki, Brown, Len,
	Pavel Machek, Linux PM, Colm MacCarthaigh, Theodore Ts'o,
	Arnd Bergmann

On Tue, Apr 19, 2022 at 05:12:36PM +0200, Jason A. Donenfeld wrote:
> Hey Alex,
> 
> On Thu, Mar 10, 2022 at 12:18 PM Alexander Graf <graf@amazon.com> wrote:
> > I agree on the slightly racy compromise and that it's a step into the
> > right direction. Doing this is a no brainer IMHO and I like the proc
> > based poll approach.
> 
> Alright. I'm going to email a more serious patch for that in the next
> few hours and you can have a look. Let's do that for 5.19.
> 
> > I have an additional problem you might have an idea for with the poll
> > based path. In addition to the clone notification, I'd need to know at
> > which point everyone who was listening to a clone notification is
> > finished acting up it. If I spawn a tiny VM to do "work", I want to know
> > when it's safe to hand requests into it. How do I find out when that
> > point in time is?
> 
> Seems tricky to solve. Even a count of current waiters and a
> generation number won't be sufficient, since it wouldn't take into
> account users who haven't _yet_ gotten to waiting. But maybe it's not
> the right problem to solve? Or somehow not necessary? For example, if
> the problem is a bit more constrained a solution becomes easier: you
> have a fixed/known set of readers that you know about, and you
> guarantee that they're all waiting before the fork. Then after the
> fork, they all do something to alert you in their poll()er, and you
> count up how many alerts you get until it matches the number of
> expected waiters. Would that work? It seems like anything more general
> than that is just butting heads with the racy compromise we're already
> making.
> 
> Jason

I have some ideas here ... but can you explain the use-case a bit more?

-- 
MST


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: propagating vmgenid outward and upward
@ 2022-04-19 16:43           ` Michael S. Tsirkin
  0 siblings, 0 replies; 61+ messages in thread
From: Michael S. Tsirkin @ 2022-04-19 16:43 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Brown, Len, linux-hyperv, Colm MacCarthaigh,
	Daniel P. Berrangé,
	adrian, KVM list, Jann Horn, Greg Kroah-Hartman, Linux PM,
	Rafael J. Wysocki, LKML, Dominik Brodowski, QEMU Developers,
	Alexander Graf, Linux Crypto Mailing List, Pavel Machek,
	Theodore Ts'o, Michael Kelley (LINUX),
	Laszlo Ersek, Arnd Bergmann

On Tue, Apr 19, 2022 at 05:12:36PM +0200, Jason A. Donenfeld wrote:
> Hey Alex,
> 
> On Thu, Mar 10, 2022 at 12:18 PM Alexander Graf <graf@amazon.com> wrote:
> > I agree on the slightly racy compromise and that it's a step into the
> > right direction. Doing this is a no brainer IMHO and I like the proc
> > based poll approach.
> 
> Alright. I'm going to email a more serious patch for that in the next
> few hours and you can have a look. Let's do that for 5.19.
> 
> > I have an additional problem you might have an idea for with the poll
> > based path. In addition to the clone notification, I'd need to know at
> > which point everyone who was listening to a clone notification is
> > finished acting up it. If I spawn a tiny VM to do "work", I want to know
> > when it's safe to hand requests into it. How do I find out when that
> > point in time is?
> 
> Seems tricky to solve. Even a count of current waiters and a
> generation number won't be sufficient, since it wouldn't take into
> account users who haven't _yet_ gotten to waiting. But maybe it's not
> the right problem to solve? Or somehow not necessary? For example, if
> the problem is a bit more constrained a solution becomes easier: you
> have a fixed/known set of readers that you know about, and you
> guarantee that they're all waiting before the fork. Then after the
> fork, they all do something to alert you in their poll()er, and you
> count up how many alerts you get until it matches the number of
> expected waiters. Would that work? It seems like anything more general
> than that is just butting heads with the racy compromise we're already
> making.
> 
> Jason

I have some ideas here ... but can you explain the use-case a bit more?

-- 
MST



^ permalink raw reply	[flat|nested] 61+ messages in thread

end of thread, other threads:[~2022-04-19 16:47 UTC | newest]

Thread overview: 61+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-01 15:42 propagating vmgenid outward and upward Jason A. Donenfeld
2022-03-01 16:15 ` Laszlo Ersek
2022-03-01 16:28   ` Jason A. Donenfeld
2022-03-01 16:28     ` Jason A. Donenfeld
2022-03-01 17:17     ` Michael S. Tsirkin
2022-03-01 17:17       ` Michael S. Tsirkin
2022-03-01 18:37       ` Jason A. Donenfeld
2022-03-01 18:37         ` Jason A. Donenfeld
2022-03-02  7:42         ` Michael S. Tsirkin
2022-03-02  7:42           ` Michael S. Tsirkin
2022-03-02  7:48           ` Michael S. Tsirkin
2022-03-02  7:48             ` Michael S. Tsirkin
2022-03-02  8:30         ` Michael S. Tsirkin
2022-03-02  8:30           ` Michael S. Tsirkin
2022-03-02 11:26           ` Jason A. Donenfeld
2022-03-02 11:26             ` Jason A. Donenfeld
2022-03-02 12:58             ` Michael S. Tsirkin
2022-03-02 12:58               ` Michael S. Tsirkin
2022-03-02 13:55               ` Jason A. Donenfeld
2022-03-02 13:55                 ` Jason A. Donenfeld
2022-03-02 14:46                 ` Michael S. Tsirkin
2022-03-02 14:46                   ` Michael S. Tsirkin
2022-03-02 15:14                   ` Jason A. Donenfeld
2022-03-02 15:14                     ` Jason A. Donenfeld
2022-03-02 15:20                     ` Michael S. Tsirkin
2022-03-02 15:20                       ` Michael S. Tsirkin
2022-03-02 15:36                       ` Jason A. Donenfeld
2022-03-02 15:36                         ` Jason A. Donenfeld
2022-03-02 16:22                         ` Michael S. Tsirkin
2022-03-02 16:22                           ` Michael S. Tsirkin
2022-03-02 16:32                           ` Jason A. Donenfeld
2022-03-02 16:32                             ` Jason A. Donenfeld
2022-03-02 17:27                             ` Michael S. Tsirkin
2022-03-02 17:27                               ` Michael S. Tsirkin
2022-03-03 13:07                             ` Michael S. Tsirkin
2022-03-03 13:07                               ` Michael S. Tsirkin
2022-03-02 16:29                         ` Michael S. Tsirkin
2022-03-02 16:29                           ` Michael S. Tsirkin
2022-03-01 16:21 ` Michael S. Tsirkin
2022-03-01 16:21   ` Michael S. Tsirkin
2022-03-01 16:35   ` Jason A. Donenfeld
2022-03-01 16:35     ` Jason A. Donenfeld
2022-03-01 18:01 ` Greg KH
2022-03-01 18:01   ` Greg KH
2022-03-01 18:24   ` Jason A. Donenfeld
2022-03-01 18:24     ` Jason A. Donenfeld
2022-03-01 19:41     ` Greg KH
2022-03-01 19:41       ` Greg KH
2022-03-01 23:12       ` Jason A. Donenfeld
2022-03-01 23:12         ` Jason A. Donenfeld
2022-03-02 14:35 ` Jason A. Donenfeld
2022-03-09 10:10 ` Alexander Graf
2022-03-09 22:02   ` Jason A. Donenfeld
2022-03-09 22:02     ` Jason A. Donenfeld
2022-03-10 11:18     ` Alexander Graf
2022-03-20 22:53       ` Michael S. Tsirkin
2022-03-20 22:53         ` Michael S. Tsirkin
2022-04-19 15:12       ` Jason A. Donenfeld
2022-04-19 15:12         ` Jason A. Donenfeld
2022-04-19 16:43         ` Michael S. Tsirkin
2022-04-19 16:43           ` Michael S. Tsirkin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.