All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mark Rutland <mark.rutland@arm.com>
To: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Cc: maz <maz@kernel.org>, Will Deacon <will@kernel.org>,
	paulmck <paulmck@kernel.org>,
	linux-arm-kernel <linux-arm-kernel@lists.infradead.org>,
	rcu <rcu@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>,
	frederic <frederic@kernel.org>,
	kvmarm@lists.cs.columbia.edu,
	linux-kernel <linux-kernel@vger.kernel.org>
Subject: Re: Possible nohz-full/RCU issue in arm64 KVM
Date: Fri, 17 Dec 2021 14:38:14 +0000	[thread overview]
Message-ID: <Ybyg1r/Q6EfeuXGV@FVFF77S0Q05N> (raw)
In-Reply-To: <70f112072d9496d21901946ea82832d3ed3a8cb2.camel@redhat.com>

On Fri, Dec 17, 2021 at 03:15:29PM +0100, Nicolas Saenz Julienne wrote:
> On Fri, 2021-12-17 at 13:21 +0000, Mark Rutland wrote:
> > On Fri, Dec 17, 2021 at 12:51:57PM +0100, Nicolas Saenz Julienne wrote:
> > > Hi All,
> > 
> > Hi,
> > 
> > > arm64's guest entry code does the following:
> > > 
> > > int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
> > > {
> > > 	[...]
> > > 
> > > 	guest_enter_irqoff();
> > > 
> > > 	ret = kvm_call_hyp_ret(__kvm_vcpu_run, vcpu);
> > > 
> > > 	[...]
> > > 
> > > 	local_irq_enable();
> > > 
> > > 	/*
> > > 	 * We do local_irq_enable() before calling guest_exit() so
> > > 	 * that if a timer interrupt hits while running the guest we
> > > 	 * account that tick as being spent in the guest.  We enable
> > > 	 * preemption after calling guest_exit() so that if we get
> > > 	 * preempted we make sure ticks after that is not counted as
> > > 	 * guest time.
> > > 	 */
> > > 	guest_exit();
> > > 	[...]
> > > }
> > > 
> > > 
> > > On a nohz-full CPU, guest_{enter,exit}() delimit an RCU extended quiescent
> > > state (EQS). Any interrupt happening between local_irq_enable() and
> > > guest_exit() should disable that EQS. Now, AFAICT all el0 interrupt handlers
> > > do the right thing if trggered in this context, but el1's won't. Is it
> > > possible to hit an el1 handler (for example __el1_irq()) there?
> > 
> > I think you're right that the EL1 handlers can trigger here and won't exit the
> > EQS.
> > 
> > I'm not immediately sure what we *should* do here. What does x86 do for an IRQ
> > taken from a guest mode? I couldn't spot any handling of that case, but I'm not
> > familiar enough with the x86 exception model to know if I'm looking in the
> > right place.
> 
> Well x86 has its own private KVM guest context exit function
> 'kvm_guest_exit_irqoff()', which allows it to do the right thing (simplifying
> things):
> 
> 	local_irq_disable();
> 	kvm_guest_enter_irqoff() // Inform CT, enter EQS
> 	__vmx_kvm_run()
> 	kvm_guest_exit_irqoff() // Inform CT, exit EQS, task still marked with PF_VCPU
> 
> 	/*
> 	 * Consume any pending interrupts, including the possible source of
> 	 * VM-Exit on SVM and any ticks that occur between VM-Exit and now.
> 	 * An instruction is required after local_irq_enable() to fully unblock
> 	 * interrupts on processors that implement an interrupt shadow, the
> 	 * stat.exits increment will do nicely.
> 	 */
> 	local_irq_enable();
> 	++vcpu->stat.exits;
> 	local_irq_disable();
> 
> 	/*
> 	 * Wait until after servicing IRQs to account guest time so that any
> 	 * ticks that occurred while running the guest are properly accounted
> 	 * to the guest.  Waiting until IRQs are enabled degrades the accuracy
> 	 * of accounting via context tracking, but the loss of accuracy is
> 	 * acceptable for all known use cases.
> 	 */
> 	vtime_account_guest_exit(); // current->flags &= ~PF_VCPU

I see.

The abstraction's really messy here on x86, and the enter/exit sides aren't
clearly balanced.

For example kvm_guest_enter_irqoff() calls guest_enter_irq_off() which calls
vtime_account_guest_enter(), but kvm_guest_exit_irqoff() doesn't call
guest_exit_irq_off() and the call to vtime_account_guest_exit() is open-coded
elsewhere. Also, guest_enter_irq_off() conditionally calls
rcu_virt_note_context_switch(), but I can't immediately spot anything on the
exit side that corresponded with that, which looks suspicious.

> So I guess we should convert to x86's scheme, and maybe create another generic
> guest_{enter,exit}() flavor for virtualization schemes that run with interrupts
> disabled.

I think we might need to do some preparatory refactoring here so that this is
all clearly balanced even on x86, e.g. splitting the enter/exit steps into
multiple phases.

> > Note that the EL0 handlers *cannot* trigger for an exception taken from a
> > guest. We use separate vectors while running a guest (for both VHE and nVHE
> > modes), and from the main kernel's PoV we return from kvm_call_hyp_ret(). We
> > can ony take IRQ from EL1 *after* that returns.
> > 
> > We *might* need to audit the KVM vector handlers to make sure they're not
> > dependent on RCU protection (I assume they're not, but it's possible something
> > has leaked into the VHE code).
> 
> IIUC in the window between local_irq_enable() and guest_exit() any driver
> interrupt might trigger, isn't it?

Yes, via the EL1 interrupt vectors, which I assume we'll fix in one go above.

Here I was trying to point out that there's another potential issue here if we
do anything in the context of the KVM exception vectors, as those can run C
code in a shallow exeption context, and can either return back into the guest
OR return to the caller of kvm_call_hyp_ret(__kvm_vcpu_run, vcpu).

So even if we fix kvm_arch_vcpu_ioctl_run() we might need to also rework
handlers that run in that shallow exception context, if they rely on RCU for
something.

Thanks,
Mark.

WARNING: multiple messages have this Message-ID (diff)
From: Mark Rutland <mark.rutland@arm.com>
To: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Cc: paulmck <paulmck@kernel.org>, maz <maz@kernel.org>,
	frederic <frederic@kernel.org>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	rcu <rcu@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>,
	Will Deacon <will@kernel.org>,
	kvmarm@lists.cs.columbia.edu,
	linux-arm-kernel <linux-arm-kernel@lists.infradead.org>
Subject: Re: Possible nohz-full/RCU issue in arm64 KVM
Date: Fri, 17 Dec 2021 14:38:14 +0000	[thread overview]
Message-ID: <Ybyg1r/Q6EfeuXGV@FVFF77S0Q05N> (raw)
In-Reply-To: <70f112072d9496d21901946ea82832d3ed3a8cb2.camel@redhat.com>

On Fri, Dec 17, 2021 at 03:15:29PM +0100, Nicolas Saenz Julienne wrote:
> On Fri, 2021-12-17 at 13:21 +0000, Mark Rutland wrote:
> > On Fri, Dec 17, 2021 at 12:51:57PM +0100, Nicolas Saenz Julienne wrote:
> > > Hi All,
> > 
> > Hi,
> > 
> > > arm64's guest entry code does the following:
> > > 
> > > int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
> > > {
> > > 	[...]
> > > 
> > > 	guest_enter_irqoff();
> > > 
> > > 	ret = kvm_call_hyp_ret(__kvm_vcpu_run, vcpu);
> > > 
> > > 	[...]
> > > 
> > > 	local_irq_enable();
> > > 
> > > 	/*
> > > 	 * We do local_irq_enable() before calling guest_exit() so
> > > 	 * that if a timer interrupt hits while running the guest we
> > > 	 * account that tick as being spent in the guest.  We enable
> > > 	 * preemption after calling guest_exit() so that if we get
> > > 	 * preempted we make sure ticks after that is not counted as
> > > 	 * guest time.
> > > 	 */
> > > 	guest_exit();
> > > 	[...]
> > > }
> > > 
> > > 
> > > On a nohz-full CPU, guest_{enter,exit}() delimit an RCU extended quiescent
> > > state (EQS). Any interrupt happening between local_irq_enable() and
> > > guest_exit() should disable that EQS. Now, AFAICT all el0 interrupt handlers
> > > do the right thing if trggered in this context, but el1's won't. Is it
> > > possible to hit an el1 handler (for example __el1_irq()) there?
> > 
> > I think you're right that the EL1 handlers can trigger here and won't exit the
> > EQS.
> > 
> > I'm not immediately sure what we *should* do here. What does x86 do for an IRQ
> > taken from a guest mode? I couldn't spot any handling of that case, but I'm not
> > familiar enough with the x86 exception model to know if I'm looking in the
> > right place.
> 
> Well x86 has its own private KVM guest context exit function
> 'kvm_guest_exit_irqoff()', which allows it to do the right thing (simplifying
> things):
> 
> 	local_irq_disable();
> 	kvm_guest_enter_irqoff() // Inform CT, enter EQS
> 	__vmx_kvm_run()
> 	kvm_guest_exit_irqoff() // Inform CT, exit EQS, task still marked with PF_VCPU
> 
> 	/*
> 	 * Consume any pending interrupts, including the possible source of
> 	 * VM-Exit on SVM and any ticks that occur between VM-Exit and now.
> 	 * An instruction is required after local_irq_enable() to fully unblock
> 	 * interrupts on processors that implement an interrupt shadow, the
> 	 * stat.exits increment will do nicely.
> 	 */
> 	local_irq_enable();
> 	++vcpu->stat.exits;
> 	local_irq_disable();
> 
> 	/*
> 	 * Wait until after servicing IRQs to account guest time so that any
> 	 * ticks that occurred while running the guest are properly accounted
> 	 * to the guest.  Waiting until IRQs are enabled degrades the accuracy
> 	 * of accounting via context tracking, but the loss of accuracy is
> 	 * acceptable for all known use cases.
> 	 */
> 	vtime_account_guest_exit(); // current->flags &= ~PF_VCPU

I see.

The abstraction's really messy here on x86, and the enter/exit sides aren't
clearly balanced.

For example kvm_guest_enter_irqoff() calls guest_enter_irq_off() which calls
vtime_account_guest_enter(), but kvm_guest_exit_irqoff() doesn't call
guest_exit_irq_off() and the call to vtime_account_guest_exit() is open-coded
elsewhere. Also, guest_enter_irq_off() conditionally calls
rcu_virt_note_context_switch(), but I can't immediately spot anything on the
exit side that corresponded with that, which looks suspicious.

> So I guess we should convert to x86's scheme, and maybe create another generic
> guest_{enter,exit}() flavor for virtualization schemes that run with interrupts
> disabled.

I think we might need to do some preparatory refactoring here so that this is
all clearly balanced even on x86, e.g. splitting the enter/exit steps into
multiple phases.

> > Note that the EL0 handlers *cannot* trigger for an exception taken from a
> > guest. We use separate vectors while running a guest (for both VHE and nVHE
> > modes), and from the main kernel's PoV we return from kvm_call_hyp_ret(). We
> > can ony take IRQ from EL1 *after* that returns.
> > 
> > We *might* need to audit the KVM vector handlers to make sure they're not
> > dependent on RCU protection (I assume they're not, but it's possible something
> > has leaked into the VHE code).
> 
> IIUC in the window between local_irq_enable() and guest_exit() any driver
> interrupt might trigger, isn't it?

Yes, via the EL1 interrupt vectors, which I assume we'll fix in one go above.

Here I was trying to point out that there's another potential issue here if we
do anything in the context of the KVM exception vectors, as those can run C
code in a shallow exeption context, and can either return back into the guest
OR return to the caller of kvm_call_hyp_ret(__kvm_vcpu_run, vcpu).

So even if we fix kvm_arch_vcpu_ioctl_run() we might need to also rework
handlers that run in that shallow exception context, if they rely on RCU for
something.

Thanks,
Mark.
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

WARNING: multiple messages have this Message-ID (diff)
From: Mark Rutland <mark.rutland@arm.com>
To: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Cc: maz <maz@kernel.org>, Will Deacon <will@kernel.org>,
	paulmck <paulmck@kernel.org>,
	linux-arm-kernel <linux-arm-kernel@lists.infradead.org>,
	rcu <rcu@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>,
	frederic <frederic@kernel.org>,
	kvmarm@lists.cs.columbia.edu,
	linux-kernel <linux-kernel@vger.kernel.org>
Subject: Re: Possible nohz-full/RCU issue in arm64 KVM
Date: Fri, 17 Dec 2021 14:38:14 +0000	[thread overview]
Message-ID: <Ybyg1r/Q6EfeuXGV@FVFF77S0Q05N> (raw)
In-Reply-To: <70f112072d9496d21901946ea82832d3ed3a8cb2.camel@redhat.com>

On Fri, Dec 17, 2021 at 03:15:29PM +0100, Nicolas Saenz Julienne wrote:
> On Fri, 2021-12-17 at 13:21 +0000, Mark Rutland wrote:
> > On Fri, Dec 17, 2021 at 12:51:57PM +0100, Nicolas Saenz Julienne wrote:
> > > Hi All,
> > 
> > Hi,
> > 
> > > arm64's guest entry code does the following:
> > > 
> > > int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
> > > {
> > > 	[...]
> > > 
> > > 	guest_enter_irqoff();
> > > 
> > > 	ret = kvm_call_hyp_ret(__kvm_vcpu_run, vcpu);
> > > 
> > > 	[...]
> > > 
> > > 	local_irq_enable();
> > > 
> > > 	/*
> > > 	 * We do local_irq_enable() before calling guest_exit() so
> > > 	 * that if a timer interrupt hits while running the guest we
> > > 	 * account that tick as being spent in the guest.  We enable
> > > 	 * preemption after calling guest_exit() so that if we get
> > > 	 * preempted we make sure ticks after that is not counted as
> > > 	 * guest time.
> > > 	 */
> > > 	guest_exit();
> > > 	[...]
> > > }
> > > 
> > > 
> > > On a nohz-full CPU, guest_{enter,exit}() delimit an RCU extended quiescent
> > > state (EQS). Any interrupt happening between local_irq_enable() and
> > > guest_exit() should disable that EQS. Now, AFAICT all el0 interrupt handlers
> > > do the right thing if trggered in this context, but el1's won't. Is it
> > > possible to hit an el1 handler (for example __el1_irq()) there?
> > 
> > I think you're right that the EL1 handlers can trigger here and won't exit the
> > EQS.
> > 
> > I'm not immediately sure what we *should* do here. What does x86 do for an IRQ
> > taken from a guest mode? I couldn't spot any handling of that case, but I'm not
> > familiar enough with the x86 exception model to know if I'm looking in the
> > right place.
> 
> Well x86 has its own private KVM guest context exit function
> 'kvm_guest_exit_irqoff()', which allows it to do the right thing (simplifying
> things):
> 
> 	local_irq_disable();
> 	kvm_guest_enter_irqoff() // Inform CT, enter EQS
> 	__vmx_kvm_run()
> 	kvm_guest_exit_irqoff() // Inform CT, exit EQS, task still marked with PF_VCPU
> 
> 	/*
> 	 * Consume any pending interrupts, including the possible source of
> 	 * VM-Exit on SVM and any ticks that occur between VM-Exit and now.
> 	 * An instruction is required after local_irq_enable() to fully unblock
> 	 * interrupts on processors that implement an interrupt shadow, the
> 	 * stat.exits increment will do nicely.
> 	 */
> 	local_irq_enable();
> 	++vcpu->stat.exits;
> 	local_irq_disable();
> 
> 	/*
> 	 * Wait until after servicing IRQs to account guest time so that any
> 	 * ticks that occurred while running the guest are properly accounted
> 	 * to the guest.  Waiting until IRQs are enabled degrades the accuracy
> 	 * of accounting via context tracking, but the loss of accuracy is
> 	 * acceptable for all known use cases.
> 	 */
> 	vtime_account_guest_exit(); // current->flags &= ~PF_VCPU

I see.

The abstraction's really messy here on x86, and the enter/exit sides aren't
clearly balanced.

For example kvm_guest_enter_irqoff() calls guest_enter_irq_off() which calls
vtime_account_guest_enter(), but kvm_guest_exit_irqoff() doesn't call
guest_exit_irq_off() and the call to vtime_account_guest_exit() is open-coded
elsewhere. Also, guest_enter_irq_off() conditionally calls
rcu_virt_note_context_switch(), but I can't immediately spot anything on the
exit side that corresponded with that, which looks suspicious.

> So I guess we should convert to x86's scheme, and maybe create another generic
> guest_{enter,exit}() flavor for virtualization schemes that run with interrupts
> disabled.

I think we might need to do some preparatory refactoring here so that this is
all clearly balanced even on x86, e.g. splitting the enter/exit steps into
multiple phases.

> > Note that the EL0 handlers *cannot* trigger for an exception taken from a
> > guest. We use separate vectors while running a guest (for both VHE and nVHE
> > modes), and from the main kernel's PoV we return from kvm_call_hyp_ret(). We
> > can ony take IRQ from EL1 *after* that returns.
> > 
> > We *might* need to audit the KVM vector handlers to make sure they're not
> > dependent on RCU protection (I assume they're not, but it's possible something
> > has leaked into the VHE code).
> 
> IIUC in the window between local_irq_enable() and guest_exit() any driver
> interrupt might trigger, isn't it?

Yes, via the EL1 interrupt vectors, which I assume we'll fix in one go above.

Here I was trying to point out that there's another potential issue here if we
do anything in the context of the KVM exception vectors, as those can run C
code in a shallow exeption context, and can either return back into the guest
OR return to the caller of kvm_call_hyp_ret(__kvm_vcpu_run, vcpu).

So even if we fix kvm_arch_vcpu_ioctl_run() we might need to also rework
handlers that run in that shallow exception context, if they rely on RCU for
something.

Thanks,
Mark.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

  reply	other threads:[~2021-12-17 14:38 UTC|newest]

Thread overview: 66+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-12-17 11:51 Possible nohz-full/RCU issue in arm64 KVM Nicolas Saenz Julienne
2021-12-17 11:51 ` Nicolas Saenz Julienne
2021-12-17 11:51 ` Nicolas Saenz Julienne
2021-12-17 13:21 ` Mark Rutland
2021-12-17 13:21   ` Mark Rutland
2021-12-17 13:21   ` Mark Rutland
2021-12-17 14:15   ` Nicolas Saenz Julienne
2021-12-17 14:15     ` Nicolas Saenz Julienne
2021-12-17 14:15     ` Nicolas Saenz Julienne
2021-12-17 14:38     ` Mark Rutland [this message]
2021-12-17 14:38       ` Mark Rutland
2021-12-17 14:38       ` Mark Rutland
2021-12-17 15:54       ` Paolo Bonzini
2021-12-17 15:54         ` Paolo Bonzini
2021-12-17 15:54         ` Paolo Bonzini
2021-12-17 16:07         ` Paul E. McKenney
2021-12-17 16:07           ` Paul E. McKenney
2021-12-17 16:07           ` Paul E. McKenney
2021-12-17 16:20           ` Nicolas Saenz Julienne
2021-12-17 16:20             ` Nicolas Saenz Julienne
2021-12-17 16:20             ` Nicolas Saenz Julienne
2021-12-17 16:43             ` Paul E. McKenney
2021-12-17 16:43               ` Paul E. McKenney
2021-12-17 16:43               ` Paul E. McKenney
2021-12-17 16:34           ` Paolo Bonzini
2021-12-17 16:34             ` Paolo Bonzini
2021-12-17 16:34             ` Paolo Bonzini
2021-12-17 16:45             ` Paul E. McKenney
2021-12-17 16:45               ` Paul E. McKenney
2021-12-17 16:45               ` Paul E. McKenney
2021-12-17 17:02               ` Paolo Bonzini
2021-12-17 17:02                 ` Paolo Bonzini
2021-12-17 17:02                 ` Paolo Bonzini
2021-12-17 17:12                 ` Paul E. McKenney
2021-12-17 17:12                   ` Paul E. McKenney
2021-12-17 17:12                   ` Paul E. McKenney
2021-12-17 17:23                   ` Paolo Bonzini
2021-12-17 17:23                     ` Paolo Bonzini
2021-12-17 17:23                     ` Paolo Bonzini
2021-12-17 17:47                     ` Paul E. McKenney
2021-12-17 17:47                       ` Paul E. McKenney
2021-12-17 17:47                       ` Paul E. McKenney
2022-01-04 16:39         ` Mark Rutland
2022-01-04 16:39           ` Mark Rutland
2022-01-04 16:39           ` Mark Rutland
2022-01-04 17:07           ` Paolo Bonzini
2022-01-04 17:07             ` Paolo Bonzini
2022-01-04 17:07             ` Paolo Bonzini
2022-01-11 11:32           ` Nicolas Saenz Julienne
2022-01-11 11:32             ` Nicolas Saenz Julienne
2022-01-11 11:32             ` Nicolas Saenz Julienne
2022-01-11 12:23             ` Mark Rutland
2022-01-11 12:23               ` Mark Rutland
2022-01-11 12:23               ` Mark Rutland
2021-12-17 14:51   ` Paolo Bonzini
2021-12-17 14:51     ` Paolo Bonzini
2021-12-17 14:51     ` Paolo Bonzini
2021-12-20 14:28   ` Marc Zyngier
2021-12-20 14:28     ` Marc Zyngier
2021-12-20 14:28     ` Marc Zyngier
2021-12-20 16:10   ` Frederic Weisbecker
2021-12-20 16:10     ` Frederic Weisbecker
2021-12-20 16:10     ` Frederic Weisbecker
2022-01-04 13:24     ` Mark Rutland
2022-01-04 13:24       ` Mark Rutland
2022-01-04 13:24       ` Mark Rutland

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Ybyg1r/Q6EfeuXGV@FVFF77S0Q05N \
    --to=mark.rutland@arm.com \
    --cc=frederic@kernel.org \
    --cc=kvmarm@lists.cs.columbia.edu \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=maz@kernel.org \
    --cc=nsaenzju@redhat.com \
    --cc=paulmck@kernel.org \
    --cc=rcu@vger.kernel.org \
    --cc=tglx@linutronix.de \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.