All of lore.kernel.org
 help / color / mirror / Atom feed
* Timer Signals vs KVM
@ 2024-03-27 13:46 Julian Stecklina
  2024-04-01 22:22 ` Sean Christopherson
  0 siblings, 1 reply; 4+ messages in thread
From: Julian Stecklina @ 2024-03-27 13:46 UTC (permalink / raw)
  To: kvm; +Cc: Thomas Prescher

Hey everyone,

we are developing the KVM backend for VirtualBox [0] and wanted to reach out
regarding some weird behavior.

We are using `timer_create` to deliver timer events to vCPU threads as signals.
We mask the signal using pthread_sigmask in the host vCPU thread and unmask them
for guest execution using KVM_SET_SIGNAL_MASK.

This method of handling timers works well and gives us very low latency as
opposed to using a separate thread that handles timers. As far as we can tell,
neither Qemu nor other VMMs use such a setup. We see two issues:

When we enable nested virtualization, we see what looks like corruption in the
nested guest. The guest trips over exceptions that shouldn't be there. We are
currently debugging this to find out details, but the setup is pretty painful
and it will take a bit. If we disable the timer signals, this issue goes away
(at the cost of broken VBox timers obviously...).  This is weird and has left us
wondering, whether there might be something broken with signals in this
scenario, especially since none of the other VMMs uses this method.

The other issue is that we have a somewhat sad interaction with split-lock
detection, which I've blogged about some time ago [1]. Long story short: When
you program timers <10ms into the future, you run the risk of making no progress
anymore when the guest triggers the split-lock punishment [2]. See the blog post
for details. I was wondering whether there is a better solution here than
disabling the split-lock detection or whether our approach here is fundamentally
broken.

Looking forward to your thoughts. :)

Thanks!
Julian

[0] https://github.com/cyberus-technology/virtualbox-kvm
[1] https://x86.lol/generic/2023/11/07/split-lock.html
[2]
https://elixir.bootlin.com/linux/v6.9-rc1/source/arch/x86/kernel/cpu/intel.c#L1137

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Timer Signals vs KVM
  2024-03-27 13:46 Timer Signals vs KVM Julian Stecklina
@ 2024-04-01 22:22 ` Sean Christopherson
  2024-04-16 12:44   ` Julian Stecklina
  0 siblings, 1 reply; 4+ messages in thread
From: Sean Christopherson @ 2024-04-01 22:22 UTC (permalink / raw)
  To: Julian Stecklina; +Cc: kvm, Thomas Prescher

On Wed, Mar 27, 2024, Julian Stecklina wrote:
> Hey everyone,
> 
> we are developing the KVM backend for VirtualBox [0] and wanted to reach out
> regarding some weird behavior.
> 
> We are using `timer_create` to deliver timer events to vCPU threads as signals.
> We mask the signal using pthread_sigmask in the host vCPU thread and unmask them
> for guest execution using KVM_SET_SIGNAL_MASK.

What exactly do you mean by "timer events"?  From the split-lock blog post, it
does NOT seem like you're emulating guest timer events.  Specifically, this

  Consider that we want to run a KVM vCPU on Linux, but we want it to
  unconditionally exit after 1ms regardless of what the guest does.

sounds like you're doing vCPU scheduling in userspace.  But the above

  as opposed to using a separate thread that handles timers

doesn't really mesh with that.

> This method of handling timers works well and gives us very low latency as
> opposed to using a separate thread that handles timers. As far as we can tell,
> neither Qemu nor other VMMs use such a setup. We see two issues:
> 
> When we enable nested virtualization, we see what looks like corruption in the
> nested guest. The guest trips over exceptions that shouldn't be there. We are
> currently debugging this to find out details, but the setup is pretty painful
> and it will take a bit. If we disable the timer signals, this issue goes away
> (at the cost of broken VBox timers obviously...).  This is weird and has left us
> wondering, whether there might be something broken with signals in this
> scenario, especially since none of the other VMMs uses this method.

It's certainly possible there's a kernel bug, but it's probably more likely a
problem in your userspace.  QEMU (and others VMMs) do use signals to interrupt
vCPUs, e.g. to take control for live migration.  That's obviously different than
what you're doing, and will have orders of magnitude lower volume of signals in
nested guests, but the effective coverage isn't "zero".

> The other issue is that we have a somewhat sad interaction with split-lock

LOL, I think the "sad" part is redundant.  I've yet to have any iteraction with
split-lock detection that wasn't sad. :-)

> detection, which I've blogged about some time ago [1]. Long story short: When
> you program timers <10ms into the future, you run the risk of making no progress
> anymore when the guest triggers the split-lock punishment [2]. See the blog post
> for details. I was wondering whether there is a better solution here than
> disabling the split-lock detection or whether our approach here is fundamentally
> broken.

I'm pretty sure disabling split-lock is just whacking one mole, there will be many
more lurking.  AIUI, timer_create() provides a per process timer, i.e. a timer
which counts even if a task (i.e. a vCPU) is scheduled out.  The split-lock issue
is the most blatant problem because it's (a) 100% deterministic and (b) tied to
guest code.  But any other paths that might_sleep() are going to be problematic,
albeit far less likely to completely block forward progress.

I don't really see a sane way around that, short of actually having a userspace
component that knows how long a task/vCPU has actually run.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Timer Signals vs KVM
  2024-04-01 22:22 ` Sean Christopherson
@ 2024-04-16 12:44   ` Julian Stecklina
  2024-04-16 12:53     ` Julian Stecklina
  0 siblings, 1 reply; 4+ messages in thread
From: Julian Stecklina @ 2024-04-16 12:44 UTC (permalink / raw)
  To: seanjc; +Cc: kvm, Thomas Prescher

On Mon, 2024-04-01 at 15:22 -0700, Sean Christopherson wrote:
> On Wed, Mar 27, 2024, Julian Stecklina wrote:
> 
> > 
> > When we enable nested virtualization, we see what looks like corruption in
> > the
> > nested guest. The guest trips over exceptions that shouldn't be there. We
> > are
> > currently debugging this to find out details, but the setup is pretty
> > painful
> > and it will take a bit. If we disable the timer signals, this issue goes
> > away
> > (at the cost of broken VBox timers obviously...).  This is weird and has
> > left us
> > wondering, whether there might be something broken with signals in this
> > scenario, especially since none of the other VMMs uses this method.
> 
> It's certainly possible there's a kernel bug, but it's probably more likely a
> problem in your userspace.  QEMU (and others VMMs) do use signals to interrupt
> vCPUs, e.g. to take control for live migration.  That's obviously different
> than
> what you're doing, and will have orders of magnitude lower volume of signals
> in
> nested guests, but the effective coverage isn't "zero".

After some weeks of bug hunting, my colleague Thomas has found the issue and we
posted a patch:

https://lore.kernel.org/kvm/20240416123558.212040-1-julian.stecklina@cyberus-technology.de/T/#t

Given the complexity of the nesting code, we're not entirely sure whether this
is the best way of fixing this, though.

But with this patch we can run uXen (as used by HP Sure Click aka Bromium)
inside of VirtualBox. It also fixes the other nesting problems we saw with
VBox/KVM!

The reason why this triggers in VirtualBox and not in Qemu is that there are
cases where VirtualBox marks CR4 dirty even though it hasn't changed.

Thanks,

Julian

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Timer Signals vs KVM
  2024-04-16 12:44   ` Julian Stecklina
@ 2024-04-16 12:53     ` Julian Stecklina
  0 siblings, 0 replies; 4+ messages in thread
From: Julian Stecklina @ 2024-04-16 12:53 UTC (permalink / raw)
  To: seanjc; +Cc: kvm, Thomas Prescher

On Tue, 2024-04-16 at 14:44 +0200, Julian Stecklina wrote:
> On Mon, 2024-04-01 at 15:22 -0700, Sean Christopherson wrote:
> > On Wed, Mar 27, 2024, Julian Stecklina wrote:
> > 
> > > 
> > > When we enable nested virtualization, we see what looks like corruption in
> > > the
> > > nested guest. The guest trips over exceptions that shouldn't be there. We
> > > are
> > > currently debugging this to find out details, but the setup is pretty
> > > painful
> > > and it will take a bit. If we disable the timer signals, this issue goes
> > > away
> > > (at the cost of broken VBox timers obviously...).  This is weird and has
> > > left us
> > > wondering, whether there might be something broken with signals in this
> > > scenario, especially since none of the other VMMs uses this method.
> > 
> > It's certainly possible there's a kernel bug, but it's probably more likely
> > a
> > problem in your userspace.  QEMU (and others VMMs) do use signals to
> > interrupt
> > vCPUs, e.g. to take control for live migration.  That's obviously different
> > than
> > what you're doing, and will have orders of magnitude lower volume of signals
> > in
> > nested guests, but the effective coverage isn't "zero".
> 
> After some weeks of bug hunting, my colleague Thomas has found the issue and
> we
> posted a patch:
> 
> https://lore.kernel.org/kvm/20240416123558.212040-1-julian.stecklina@cyberus-technology.de/T/#t

It's this patch specifically:
https://lore.kernel.org/kvm/20240416123558.212040-1-julian.stecklina@cyberus-technology.de/T/#m2eebd2ab30a86622aea3732112150851ac0768fe

Thanks,
Julian

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-04-16 12:53 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-27 13:46 Timer Signals vs KVM Julian Stecklina
2024-04-01 22:22 ` Sean Christopherson
2024-04-16 12:44   ` Julian Stecklina
2024-04-16 12:53     ` Julian Stecklina

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.