All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] kvm/x86: Handle async PF in RCU read-side critical sections
@ 2017-09-29 11:01 Boqun Feng
  2017-09-29 14:53 ` Paolo Bonzini
  2017-10-01  1:31 ` [PATCH v2] " Boqun Feng
  0 siblings, 2 replies; 14+ messages in thread
From: Boqun Feng @ 2017-09-29 11:01 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Boqun Feng, Paul E. McKenney, Peter Zijlstra, Paolo Bonzini,
	Radim Krčmář,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86

Sasha Levin reported a WARNING:

| WARNING: CPU: 0 PID: 6974 at kernel/rcu/tree_plugin.h:329
| rcu_preempt_note_context_switch kernel/rcu/tree_plugin.h:329 [inline]
| WARNING: CPU: 0 PID: 6974 at kernel/rcu/tree_plugin.h:329
| rcu_note_context_switch+0x16c/0x2210 kernel/rcu/tree.c:458
...
| CPU: 0 PID: 6974 Comm: syz-fuzzer Not tainted 4.13.0-next-20170908+ #246
| Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
| 1.10.1-1ubuntu1 04/01/2014
| Call Trace:
...
| RIP: 0010:rcu_preempt_note_context_switch kernel/rcu/tree_plugin.h:329 [inline]
| RIP: 0010:rcu_note_context_switch+0x16c/0x2210 kernel/rcu/tree.c:458
| RSP: 0018:ffff88003b2debc8 EFLAGS: 00010002
| RAX: 0000000000000001 RBX: 1ffff1000765bd85 RCX: 0000000000000000
| RDX: 1ffff100075d7882 RSI: ffffffffb5c7da20 RDI: ffff88003aebc410
| RBP: ffff88003b2def30 R08: dffffc0000000000 R09: 0000000000000001
| R10: 0000000000000000 R11: 0000000000000000 R12: ffff88003b2def08
| R13: 0000000000000000 R14: ffff88003aebc040 R15: ffff88003aebc040
| __schedule+0x201/0x2240 kernel/sched/core.c:3292
| schedule+0x113/0x460 kernel/sched/core.c:3421
| kvm_async_pf_task_wait+0x43f/0x940 arch/x86/kernel/kvm.c:158
| do_async_page_fault+0x72/0x90 arch/x86/kernel/kvm.c:271
| async_page_fault+0x22/0x30 arch/x86/entry/entry_64.S:1069
| RIP: 0010:format_decode+0x240/0x830 lib/vsprintf.c:1996
| RSP: 0018:ffff88003b2df520 EFLAGS: 00010283
| RAX: 000000000000003f RBX: ffffffffb5d1e141 RCX: ffff88003b2df670
| RDX: 0000000000000001 RSI: dffffc0000000000 RDI: ffffffffb5d1e140
| RBP: ffff88003b2df560 R08: dffffc0000000000 R09: 0000000000000000
| R10: ffff88003b2df718 R11: 0000000000000000 R12: ffff88003b2df5d8
| R13: 0000000000000064 R14: ffffffffb5d1e140 R15: 0000000000000000
| vsnprintf+0x173/0x1700 lib/vsprintf.c:2136
| sprintf+0xbe/0xf0 lib/vsprintf.c:2386
| proc_self_get_link+0xfb/0x1c0 fs/proc/self.c:23
| get_link fs/namei.c:1047 [inline]
| link_path_walk+0x1041/0x1490 fs/namei.c:2127
...

And this happened when we hit a page fault in an RCU read-side critical
section and then we tried to reschedule in kvm_async_pf_task_wait(),
this reschedule would hit the WARN in rcu_preempt_note_context_switch(),
and be treated as a sleep in RCU read-side critical section, which is
not allowed(even in preemptible RCU).

To cure this, make kvm_async_pf_task_wait() go to the halt path if the
PF happens in a RCU read-side critical section.

Reported-by: Sasha Levin <levinsasha928@gmail.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
---
 arch/x86/kernel/kvm.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index aa60a08b65b1..e675704fa6f7 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -140,7 +140,8 @@ void kvm_async_pf_task_wait(u32 token)
 
 	n.token = token;
 	n.cpu = smp_processor_id();
-	n.halted = is_idle_task(current) || preempt_count() > 1;
+	n.halted = is_idle_task(current) || preempt_count() > 1 ||
+		   rcu_preempt_depth();
 	init_swait_queue_head(&n.wq);
 	hlist_add_head(&n.link, &b->list);
 	raw_spin_unlock(&b->lock);
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] kvm/x86: Handle async PF in RCU read-side critical sections
  2017-09-29 11:01 [PATCH] kvm/x86: Handle async PF in RCU read-side critical sections Boqun Feng
@ 2017-09-29 14:53 ` Paolo Bonzini
  2017-09-29 16:43   ` Paul E. McKenney
  2017-10-01  1:31 ` [PATCH v2] " Boqun Feng
  1 sibling, 1 reply; 14+ messages in thread
From: Paolo Bonzini @ 2017-09-29 14:53 UTC (permalink / raw)
  To: Boqun Feng, linux-kernel, kvm
  Cc: Paul E. McKenney, Peter Zijlstra, Radim Krčmář,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86

On 29/09/2017 13:01, Boqun Feng wrote:
> Sasha Levin reported a WARNING:
> 
> | WARNING: CPU: 0 PID: 6974 at kernel/rcu/tree_plugin.h:329
> | rcu_preempt_note_context_switch kernel/rcu/tree_plugin.h:329 [inline]
> | WARNING: CPU: 0 PID: 6974 at kernel/rcu/tree_plugin.h:329
> | rcu_note_context_switch+0x16c/0x2210 kernel/rcu/tree.c:458
> ...
> | CPU: 0 PID: 6974 Comm: syz-fuzzer Not tainted 4.13.0-next-20170908+ #246
> | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> | 1.10.1-1ubuntu1 04/01/2014
> | Call Trace:
> ...
> | RIP: 0010:rcu_preempt_note_context_switch kernel/rcu/tree_plugin.h:329 [inline]
> | RIP: 0010:rcu_note_context_switch+0x16c/0x2210 kernel/rcu/tree.c:458
> | RSP: 0018:ffff88003b2debc8 EFLAGS: 00010002
> | RAX: 0000000000000001 RBX: 1ffff1000765bd85 RCX: 0000000000000000
> | RDX: 1ffff100075d7882 RSI: ffffffffb5c7da20 RDI: ffff88003aebc410
> | RBP: ffff88003b2def30 R08: dffffc0000000000 R09: 0000000000000001
> | R10: 0000000000000000 R11: 0000000000000000 R12: ffff88003b2def08
> | R13: 0000000000000000 R14: ffff88003aebc040 R15: ffff88003aebc040
> | __schedule+0x201/0x2240 kernel/sched/core.c:3292
> | schedule+0x113/0x460 kernel/sched/core.c:3421
> | kvm_async_pf_task_wait+0x43f/0x940 arch/x86/kernel/kvm.c:158
> | do_async_page_fault+0x72/0x90 arch/x86/kernel/kvm.c:271
> | async_page_fault+0x22/0x30 arch/x86/entry/entry_64.S:1069
> | RIP: 0010:format_decode+0x240/0x830 lib/vsprintf.c:1996
> | RSP: 0018:ffff88003b2df520 EFLAGS: 00010283
> | RAX: 000000000000003f RBX: ffffffffb5d1e141 RCX: ffff88003b2df670
> | RDX: 0000000000000001 RSI: dffffc0000000000 RDI: ffffffffb5d1e140
> | RBP: ffff88003b2df560 R08: dffffc0000000000 R09: 0000000000000000
> | R10: ffff88003b2df718 R11: 0000000000000000 R12: ffff88003b2df5d8
> | R13: 0000000000000064 R14: ffffffffb5d1e140 R15: 0000000000000000
> | vsnprintf+0x173/0x1700 lib/vsprintf.c:2136
> | sprintf+0xbe/0xf0 lib/vsprintf.c:2386
> | proc_self_get_link+0xfb/0x1c0 fs/proc/self.c:23
> | get_link fs/namei.c:1047 [inline]
> | link_path_walk+0x1041/0x1490 fs/namei.c:2127
> ...
> 
> And this happened when we hit a page fault in an RCU read-side critical
> section and then we tried to reschedule in kvm_async_pf_task_wait(),
> this reschedule would hit the WARN in rcu_preempt_note_context_switch(),
> and be treated as a sleep in RCU read-side critical section, which is
> not allowed(even in preemptible RCU).

Just a small fix to the commit message:

This happened when the host hit a page fault, and delivered it as in an
async page fault, while the guest was in an RCU read-side critical
section.  The guest then tries to reschedule in kvm_async_pf_task_wait(),
but rcu_preempt_note_context_switch() would treat the reschedule as a
sleep in RCU read-side critical section, which is not allowed (even in
preemptible RCU).  Thus the WARN.

Queued with that change, thanks.

Paolo

> To cure this, make kvm_async_pf_task_wait() go to the halt path if the
> PF happens in a RCU read-side critical section.
> 
> Reported-by: Sasha Levin <levinsasha928@gmail.com>
> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
> ---
>  arch/x86/kernel/kvm.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index aa60a08b65b1..e675704fa6f7 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -140,7 +140,8 @@ void kvm_async_pf_task_wait(u32 token)
>  
>  	n.token = token;
>  	n.cpu = smp_processor_id();
> -	n.halted = is_idle_task(current) || preempt_count() > 1;
> +	n.halted = is_idle_task(current) || preempt_count() > 1 ||
> +		   rcu_preempt_depth();
>  	init_swait_queue_head(&n.wq);
>  	hlist_add_head(&n.link, &b->list);
>  	raw_spin_unlock(&b->lock);
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] kvm/x86: Handle async PF in RCU read-side critical sections
  2017-09-29 14:53 ` Paolo Bonzini
@ 2017-09-29 16:43   ` Paul E. McKenney
  2017-09-29 23:41     ` Boqun Feng
  0 siblings, 1 reply; 14+ messages in thread
From: Paul E. McKenney @ 2017-09-29 16:43 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Boqun Feng, linux-kernel, kvm, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86

On Fri, Sep 29, 2017 at 04:53:57PM +0200, Paolo Bonzini wrote:
> On 29/09/2017 13:01, Boqun Feng wrote:
> > Sasha Levin reported a WARNING:
> > 
> > | WARNING: CPU: 0 PID: 6974 at kernel/rcu/tree_plugin.h:329
> > | rcu_preempt_note_context_switch kernel/rcu/tree_plugin.h:329 [inline]
> > | WARNING: CPU: 0 PID: 6974 at kernel/rcu/tree_plugin.h:329
> > | rcu_note_context_switch+0x16c/0x2210 kernel/rcu/tree.c:458
> > ...
> > | CPU: 0 PID: 6974 Comm: syz-fuzzer Not tainted 4.13.0-next-20170908+ #246
> > | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> > | 1.10.1-1ubuntu1 04/01/2014
> > | Call Trace:
> > ...
> > | RIP: 0010:rcu_preempt_note_context_switch kernel/rcu/tree_plugin.h:329 [inline]
> > | RIP: 0010:rcu_note_context_switch+0x16c/0x2210 kernel/rcu/tree.c:458
> > | RSP: 0018:ffff88003b2debc8 EFLAGS: 00010002
> > | RAX: 0000000000000001 RBX: 1ffff1000765bd85 RCX: 0000000000000000
> > | RDX: 1ffff100075d7882 RSI: ffffffffb5c7da20 RDI: ffff88003aebc410
> > | RBP: ffff88003b2def30 R08: dffffc0000000000 R09: 0000000000000001
> > | R10: 0000000000000000 R11: 0000000000000000 R12: ffff88003b2def08
> > | R13: 0000000000000000 R14: ffff88003aebc040 R15: ffff88003aebc040
> > | __schedule+0x201/0x2240 kernel/sched/core.c:3292
> > | schedule+0x113/0x460 kernel/sched/core.c:3421
> > | kvm_async_pf_task_wait+0x43f/0x940 arch/x86/kernel/kvm.c:158
> > | do_async_page_fault+0x72/0x90 arch/x86/kernel/kvm.c:271
> > | async_page_fault+0x22/0x30 arch/x86/entry/entry_64.S:1069
> > | RIP: 0010:format_decode+0x240/0x830 lib/vsprintf.c:1996
> > | RSP: 0018:ffff88003b2df520 EFLAGS: 00010283
> > | RAX: 000000000000003f RBX: ffffffffb5d1e141 RCX: ffff88003b2df670
> > | RDX: 0000000000000001 RSI: dffffc0000000000 RDI: ffffffffb5d1e140
> > | RBP: ffff88003b2df560 R08: dffffc0000000000 R09: 0000000000000000
> > | R10: ffff88003b2df718 R11: 0000000000000000 R12: ffff88003b2df5d8
> > | R13: 0000000000000064 R14: ffffffffb5d1e140 R15: 0000000000000000
> > | vsnprintf+0x173/0x1700 lib/vsprintf.c:2136
> > | sprintf+0xbe/0xf0 lib/vsprintf.c:2386
> > | proc_self_get_link+0xfb/0x1c0 fs/proc/self.c:23
> > | get_link fs/namei.c:1047 [inline]
> > | link_path_walk+0x1041/0x1490 fs/namei.c:2127
> > ...
> > 
> > And this happened when we hit a page fault in an RCU read-side critical
> > section and then we tried to reschedule in kvm_async_pf_task_wait(),
> > this reschedule would hit the WARN in rcu_preempt_note_context_switch(),
> > and be treated as a sleep in RCU read-side critical section, which is
> > not allowed(even in preemptible RCU).
> 
> Just a small fix to the commit message:
> 
> This happened when the host hit a page fault, and delivered it as in an
> async page fault, while the guest was in an RCU read-side critical
> section.  The guest then tries to reschedule in kvm_async_pf_task_wait(),
> but rcu_preempt_note_context_switch() would treat the reschedule as a
> sleep in RCU read-side critical section, which is not allowed (even in
> preemptible RCU).  Thus the WARN.
> 
> Queued with that change, thanks.

Not to be repetitive, but if the schedule() is on the guest, this change
really does silently break up an RCU read-side critical section on
guests built with PREEMPT=n.  (Yes, they were already being broken,
but it would be good to avoid this breakage in PREEMPT=n as well as
in PREEMPT=y.)

							Thanx, Paul

> Paolo
> 
> > To cure this, make kvm_async_pf_task_wait() go to the halt path if the
> > PF happens in a RCU read-side critical section.
> > 
> > Reported-by: Sasha Levin <levinsasha928@gmail.com>
> > Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
> > ---
> >  arch/x86/kernel/kvm.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> > index aa60a08b65b1..e675704fa6f7 100644
> > --- a/arch/x86/kernel/kvm.c
> > +++ b/arch/x86/kernel/kvm.c
> > @@ -140,7 +140,8 @@ void kvm_async_pf_task_wait(u32 token)
> >  
> >  	n.token = token;
> >  	n.cpu = smp_processor_id();
> > -	n.halted = is_idle_task(current) || preempt_count() > 1;
> > +	n.halted = is_idle_task(current) || preempt_count() > 1 ||
> > +		   rcu_preempt_depth();
> >  	init_swait_queue_head(&n.wq);
> >  	hlist_add_head(&n.link, &b->list);
> >  	raw_spin_unlock(&b->lock);
> > 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] kvm/x86: Handle async PF in RCU read-side critical sections
  2017-09-29 16:43   ` Paul E. McKenney
@ 2017-09-29 23:41     ` Boqun Feng
  2017-09-30 17:15       ` Paul E. McKenney
  0 siblings, 1 reply; 14+ messages in thread
From: Boqun Feng @ 2017-09-29 23:41 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Paolo Bonzini, linux-kernel, kvm, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86

[-- Attachment #1: Type: text/plain, Size: 5724 bytes --]

On Fri, Sep 29, 2017 at 04:43:39PM +0000, Paul E. McKenney wrote:
> On Fri, Sep 29, 2017 at 04:53:57PM +0200, Paolo Bonzini wrote:
> > On 29/09/2017 13:01, Boqun Feng wrote:
> > > Sasha Levin reported a WARNING:
> > > 
> > > | WARNING: CPU: 0 PID: 6974 at kernel/rcu/tree_plugin.h:329
> > > | rcu_preempt_note_context_switch kernel/rcu/tree_plugin.h:329 [inline]
> > > | WARNING: CPU: 0 PID: 6974 at kernel/rcu/tree_plugin.h:329
> > > | rcu_note_context_switch+0x16c/0x2210 kernel/rcu/tree.c:458
> > > ...
> > > | CPU: 0 PID: 6974 Comm: syz-fuzzer Not tainted 4.13.0-next-20170908+ #246
> > > | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> > > | 1.10.1-1ubuntu1 04/01/2014
> > > | Call Trace:
> > > ...
> > > | RIP: 0010:rcu_preempt_note_context_switch kernel/rcu/tree_plugin.h:329 [inline]
> > > | RIP: 0010:rcu_note_context_switch+0x16c/0x2210 kernel/rcu/tree.c:458
> > > | RSP: 0018:ffff88003b2debc8 EFLAGS: 00010002
> > > | RAX: 0000000000000001 RBX: 1ffff1000765bd85 RCX: 0000000000000000
> > > | RDX: 1ffff100075d7882 RSI: ffffffffb5c7da20 RDI: ffff88003aebc410
> > > | RBP: ffff88003b2def30 R08: dffffc0000000000 R09: 0000000000000001
> > > | R10: 0000000000000000 R11: 0000000000000000 R12: ffff88003b2def08
> > > | R13: 0000000000000000 R14: ffff88003aebc040 R15: ffff88003aebc040
> > > | __schedule+0x201/0x2240 kernel/sched/core.c:3292
> > > | schedule+0x113/0x460 kernel/sched/core.c:3421
> > > | kvm_async_pf_task_wait+0x43f/0x940 arch/x86/kernel/kvm.c:158
> > > | do_async_page_fault+0x72/0x90 arch/x86/kernel/kvm.c:271
> > > | async_page_fault+0x22/0x30 arch/x86/entry/entry_64.S:1069
> > > | RIP: 0010:format_decode+0x240/0x830 lib/vsprintf.c:1996
> > > | RSP: 0018:ffff88003b2df520 EFLAGS: 00010283
> > > | RAX: 000000000000003f RBX: ffffffffb5d1e141 RCX: ffff88003b2df670
> > > | RDX: 0000000000000001 RSI: dffffc0000000000 RDI: ffffffffb5d1e140
> > > | RBP: ffff88003b2df560 R08: dffffc0000000000 R09: 0000000000000000
> > > | R10: ffff88003b2df718 R11: 0000000000000000 R12: ffff88003b2df5d8
> > > | R13: 0000000000000064 R14: ffffffffb5d1e140 R15: 0000000000000000
> > > | vsnprintf+0x173/0x1700 lib/vsprintf.c:2136
> > > | sprintf+0xbe/0xf0 lib/vsprintf.c:2386
> > > | proc_self_get_link+0xfb/0x1c0 fs/proc/self.c:23
> > > | get_link fs/namei.c:1047 [inline]
> > > | link_path_walk+0x1041/0x1490 fs/namei.c:2127
> > > ...
> > > 
> > > And this happened when we hit a page fault in an RCU read-side critical
> > > section and then we tried to reschedule in kvm_async_pf_task_wait(),
> > > this reschedule would hit the WARN in rcu_preempt_note_context_switch(),
> > > and be treated as a sleep in RCU read-side critical section, which is
> > > not allowed(even in preemptible RCU).
> > 
> > Just a small fix to the commit message:
> > 
> > This happened when the host hit a page fault, and delivered it as in an
> > async page fault, while the guest was in an RCU read-side critical
> > section.  The guest then tries to reschedule in kvm_async_pf_task_wait(),
> > but rcu_preempt_note_context_switch() would treat the reschedule as a
> > sleep in RCU read-side critical section, which is not allowed (even in
> > preemptible RCU).  Thus the WARN.
> > 
> > Queued with that change, thanks.
> 
> Not to be repetitive, but if the schedule() is on the guest, this change
> really does silently break up an RCU read-side critical section on
> guests built with PREEMPT=n.  (Yes, they were already being broken,
> but it would be good to avoid this breakage in PREEMPT=n as well as
> in PREEMPT=y.)
> 

Then probably adding !IS_ENABLED(CONFIG_PREEMPT) as one of the reason we
choose the halt path? Like:

	n.halted = is_idle_task(current) || preempt_count() > 1 ||
		   !IS_ENABLED(CONFIG_PREEMPT) || rcu_preempt_depth();


But I think async PF could also happen while a user program is running?
Then maybe add a second parameter @user for kvm_async_pf_task_wait(),
like:

	kvm_async_pf_task_wait((u32)read_cr2(), user_mode(regs));

and the halt condition becomes:

	n.halted = is_idle_task(current) || preempt_count() > 1 ||
		   (!IS_ENABLED(CONFIG_PREEMPT) && !user) || rcu_preempt_depth();

Thoughts?


A side thing is being broken already for PREEMPT=n means we maybe fail
to detect this in rcutorture? Then should we add a config with
KVM_GUEST=y and try to run some memory consuming things(e.g. stress
--vm) in the rcutorture kvm script simultaneously? Paolo, do you have
any test workload that could trigger async PF quickly?

Regards,
Boqun

> 							Thanx, Paul
> 
> > Paolo
> > 
> > > To cure this, make kvm_async_pf_task_wait() go to the halt path if the
> > > PF happens in a RCU read-side critical section.
> > > 
> > > Reported-by: Sasha Levin <levinsasha928@gmail.com>
> > > Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
> > > ---
> > >  arch/x86/kernel/kvm.c | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> > > index aa60a08b65b1..e675704fa6f7 100644
> > > --- a/arch/x86/kernel/kvm.c
> > > +++ b/arch/x86/kernel/kvm.c
> > > @@ -140,7 +140,8 @@ void kvm_async_pf_task_wait(u32 token)
> > >  
> > >  	n.token = token;
> > >  	n.cpu = smp_processor_id();
> > > -	n.halted = is_idle_task(current) || preempt_count() > 1;
> > > +	n.halted = is_idle_task(current) || preempt_count() > 1 ||
> > > +		   rcu_preempt_depth();
> > >  	init_swait_queue_head(&n.wq);
> > >  	hlist_add_head(&n.link, &b->list);
> > >  	raw_spin_unlock(&b->lock);
> > > 
> > 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] kvm/x86: Handle async PF in RCU read-side critical sections
  2017-09-29 23:41     ` Boqun Feng
@ 2017-09-30 17:15       ` Paul E. McKenney
  2017-10-01  1:20         ` Boqun Feng
  2017-10-02 12:45         ` Paolo Bonzini
  0 siblings, 2 replies; 14+ messages in thread
From: Paul E. McKenney @ 2017-09-30 17:15 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Paolo Bonzini, linux-kernel, kvm, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86

On Sat, Sep 30, 2017 at 07:41:56AM +0800, Boqun Feng wrote:
> On Fri, Sep 29, 2017 at 04:43:39PM +0000, Paul E. McKenney wrote:
> > On Fri, Sep 29, 2017 at 04:53:57PM +0200, Paolo Bonzini wrote:
> > > On 29/09/2017 13:01, Boqun Feng wrote:
> > > > Sasha Levin reported a WARNING:
> > > > 
> > > > | WARNING: CPU: 0 PID: 6974 at kernel/rcu/tree_plugin.h:329
> > > > | rcu_preempt_note_context_switch kernel/rcu/tree_plugin.h:329 [inline]
> > > > | WARNING: CPU: 0 PID: 6974 at kernel/rcu/tree_plugin.h:329
> > > > | rcu_note_context_switch+0x16c/0x2210 kernel/rcu/tree.c:458
> > > > ...
> > > > | CPU: 0 PID: 6974 Comm: syz-fuzzer Not tainted 4.13.0-next-20170908+ #246
> > > > | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> > > > | 1.10.1-1ubuntu1 04/01/2014
> > > > | Call Trace:
> > > > ...
> > > > | RIP: 0010:rcu_preempt_note_context_switch kernel/rcu/tree_plugin.h:329 [inline]
> > > > | RIP: 0010:rcu_note_context_switch+0x16c/0x2210 kernel/rcu/tree.c:458
> > > > | RSP: 0018:ffff88003b2debc8 EFLAGS: 00010002
> > > > | RAX: 0000000000000001 RBX: 1ffff1000765bd85 RCX: 0000000000000000
> > > > | RDX: 1ffff100075d7882 RSI: ffffffffb5c7da20 RDI: ffff88003aebc410
> > > > | RBP: ffff88003b2def30 R08: dffffc0000000000 R09: 0000000000000001
> > > > | R10: 0000000000000000 R11: 0000000000000000 R12: ffff88003b2def08
> > > > | R13: 0000000000000000 R14: ffff88003aebc040 R15: ffff88003aebc040
> > > > | __schedule+0x201/0x2240 kernel/sched/core.c:3292
> > > > | schedule+0x113/0x460 kernel/sched/core.c:3421
> > > > | kvm_async_pf_task_wait+0x43f/0x940 arch/x86/kernel/kvm.c:158
> > > > | do_async_page_fault+0x72/0x90 arch/x86/kernel/kvm.c:271
> > > > | async_page_fault+0x22/0x30 arch/x86/entry/entry_64.S:1069
> > > > | RIP: 0010:format_decode+0x240/0x830 lib/vsprintf.c:1996
> > > > | RSP: 0018:ffff88003b2df520 EFLAGS: 00010283
> > > > | RAX: 000000000000003f RBX: ffffffffb5d1e141 RCX: ffff88003b2df670
> > > > | RDX: 0000000000000001 RSI: dffffc0000000000 RDI: ffffffffb5d1e140
> > > > | RBP: ffff88003b2df560 R08: dffffc0000000000 R09: 0000000000000000
> > > > | R10: ffff88003b2df718 R11: 0000000000000000 R12: ffff88003b2df5d8
> > > > | R13: 0000000000000064 R14: ffffffffb5d1e140 R15: 0000000000000000
> > > > | vsnprintf+0x173/0x1700 lib/vsprintf.c:2136
> > > > | sprintf+0xbe/0xf0 lib/vsprintf.c:2386
> > > > | proc_self_get_link+0xfb/0x1c0 fs/proc/self.c:23
> > > > | get_link fs/namei.c:1047 [inline]
> > > > | link_path_walk+0x1041/0x1490 fs/namei.c:2127
> > > > ...
> > > > 
> > > > And this happened when we hit a page fault in an RCU read-side critical
> > > > section and then we tried to reschedule in kvm_async_pf_task_wait(),
> > > > this reschedule would hit the WARN in rcu_preempt_note_context_switch(),
> > > > and be treated as a sleep in RCU read-side critical section, which is
> > > > not allowed(even in preemptible RCU).
> > > 
> > > Just a small fix to the commit message:
> > > 
> > > This happened when the host hit a page fault, and delivered it as in an
> > > async page fault, while the guest was in an RCU read-side critical
> > > section.  The guest then tries to reschedule in kvm_async_pf_task_wait(),
> > > but rcu_preempt_note_context_switch() would treat the reschedule as a
> > > sleep in RCU read-side critical section, which is not allowed (even in
> > > preemptible RCU).  Thus the WARN.
> > > 
> > > Queued with that change, thanks.
> > 
> > Not to be repetitive, but if the schedule() is on the guest, this change
> > really does silently break up an RCU read-side critical section on
> > guests built with PREEMPT=n.  (Yes, they were already being broken,
> > but it would be good to avoid this breakage in PREEMPT=n as well as
> > in PREEMPT=y.)
> > 
> 
> Then probably adding !IS_ENABLED(CONFIG_PREEMPT) as one of the reason we
> choose the halt path? Like:
> 
> 	n.halted = is_idle_task(current) || preempt_count() > 1 ||
> 		   !IS_ENABLED(CONFIG_PREEMPT) || rcu_preempt_depth();
> 
> 
> But I think async PF could also happen while a user program is running?
> Then maybe add a second parameter @user for kvm_async_pf_task_wait(),
> like:
> 
> 	kvm_async_pf_task_wait((u32)read_cr2(), user_mode(regs));
> 
> and the halt condition becomes:
> 
> 	n.halted = is_idle_task(current) || preempt_count() > 1 ||
> 		   (!IS_ENABLED(CONFIG_PREEMPT) && !user) || rcu_preempt_depth();
> 
> Thoughts?

This looks to me like it would cover it.  If !PREEMPT interrupt from
kernel, we halt, which would prevent the sleep.

I take it that we get unhalted when the host gets things patched up?

> A side thing is being broken already for PREEMPT=n means we maybe fail
> to detect this in rcutorture? Then should we add a config with
> KVM_GUEST=y and try to run some memory consuming things(e.g. stress
> --vm) in the rcutorture kvm script simultaneously? Paolo, do you have
> any test workload that could trigger async PF quickly?

I do not believe that have seen this in rcutorture, but I always run in
a guest OS on a large-memory system (well, by my old-fashioned standards,
anyway) that would be quite unlikely to evict a guest OS's pages.  Plus
I tend to run on shared systems, and deliberately running them out of
memory would not be particularly friendly to others using those systems.

I -do- run background scripts that are intended to force the host OS to
preempt the guest OSes frequently, but I don't believe that this would
cause that bug.

But it seems like it would make more sense to add this sort of thing to
whatever KVM tests there are for host-side eviction of guest pages.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] kvm/x86: Handle async PF in RCU read-side critical sections
  2017-09-30 17:15       ` Paul E. McKenney
@ 2017-10-01  1:20         ` Boqun Feng
  2017-10-02 12:45         ` Paolo Bonzini
  1 sibling, 0 replies; 14+ messages in thread
From: Boqun Feng @ 2017-10-01  1:20 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Paolo Bonzini, linux-kernel, kvm, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86

[-- Attachment #1: Type: text/plain, Size: 6227 bytes --]

On Sat, Sep 30, 2017 at 05:15:15PM +0000, Paul E. McKenney wrote:
> On Sat, Sep 30, 2017 at 07:41:56AM +0800, Boqun Feng wrote:
> > On Fri, Sep 29, 2017 at 04:43:39PM +0000, Paul E. McKenney wrote:
> > > On Fri, Sep 29, 2017 at 04:53:57PM +0200, Paolo Bonzini wrote:
> > > > On 29/09/2017 13:01, Boqun Feng wrote:
> > > > > Sasha Levin reported a WARNING:
> > > > > 
> > > > > | WARNING: CPU: 0 PID: 6974 at kernel/rcu/tree_plugin.h:329
> > > > > | rcu_preempt_note_context_switch kernel/rcu/tree_plugin.h:329 [inline]
> > > > > | WARNING: CPU: 0 PID: 6974 at kernel/rcu/tree_plugin.h:329
> > > > > | rcu_note_context_switch+0x16c/0x2210 kernel/rcu/tree.c:458
> > > > > ...
> > > > > | CPU: 0 PID: 6974 Comm: syz-fuzzer Not tainted 4.13.0-next-20170908+ #246
> > > > > | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> > > > > | 1.10.1-1ubuntu1 04/01/2014
> > > > > | Call Trace:
> > > > > ...
> > > > > | RIP: 0010:rcu_preempt_note_context_switch kernel/rcu/tree_plugin.h:329 [inline]
> > > > > | RIP: 0010:rcu_note_context_switch+0x16c/0x2210 kernel/rcu/tree.c:458
> > > > > | RSP: 0018:ffff88003b2debc8 EFLAGS: 00010002
> > > > > | RAX: 0000000000000001 RBX: 1ffff1000765bd85 RCX: 0000000000000000
> > > > > | RDX: 1ffff100075d7882 RSI: ffffffffb5c7da20 RDI: ffff88003aebc410
> > > > > | RBP: ffff88003b2def30 R08: dffffc0000000000 R09: 0000000000000001
> > > > > | R10: 0000000000000000 R11: 0000000000000000 R12: ffff88003b2def08
> > > > > | R13: 0000000000000000 R14: ffff88003aebc040 R15: ffff88003aebc040
> > > > > | __schedule+0x201/0x2240 kernel/sched/core.c:3292
> > > > > | schedule+0x113/0x460 kernel/sched/core.c:3421
> > > > > | kvm_async_pf_task_wait+0x43f/0x940 arch/x86/kernel/kvm.c:158
> > > > > | do_async_page_fault+0x72/0x90 arch/x86/kernel/kvm.c:271
> > > > > | async_page_fault+0x22/0x30 arch/x86/entry/entry_64.S:1069
> > > > > | RIP: 0010:format_decode+0x240/0x830 lib/vsprintf.c:1996
> > > > > | RSP: 0018:ffff88003b2df520 EFLAGS: 00010283
> > > > > | RAX: 000000000000003f RBX: ffffffffb5d1e141 RCX: ffff88003b2df670
> > > > > | RDX: 0000000000000001 RSI: dffffc0000000000 RDI: ffffffffb5d1e140
> > > > > | RBP: ffff88003b2df560 R08: dffffc0000000000 R09: 0000000000000000
> > > > > | R10: ffff88003b2df718 R11: 0000000000000000 R12: ffff88003b2df5d8
> > > > > | R13: 0000000000000064 R14: ffffffffb5d1e140 R15: 0000000000000000
> > > > > | vsnprintf+0x173/0x1700 lib/vsprintf.c:2136
> > > > > | sprintf+0xbe/0xf0 lib/vsprintf.c:2386
> > > > > | proc_self_get_link+0xfb/0x1c0 fs/proc/self.c:23
> > > > > | get_link fs/namei.c:1047 [inline]
> > > > > | link_path_walk+0x1041/0x1490 fs/namei.c:2127
> > > > > ...
> > > > > 
> > > > > And this happened when we hit a page fault in an RCU read-side critical
> > > > > section and then we tried to reschedule in kvm_async_pf_task_wait(),
> > > > > this reschedule would hit the WARN in rcu_preempt_note_context_switch(),
> > > > > and be treated as a sleep in RCU read-side critical section, which is
> > > > > not allowed(even in preemptible RCU).
> > > > 
> > > > Just a small fix to the commit message:
> > > > 
> > > > This happened when the host hit a page fault, and delivered it as in an
> > > > async page fault, while the guest was in an RCU read-side critical
> > > > section.  The guest then tries to reschedule in kvm_async_pf_task_wait(),
> > > > but rcu_preempt_note_context_switch() would treat the reschedule as a
> > > > sleep in RCU read-side critical section, which is not allowed (even in
> > > > preemptible RCU).  Thus the WARN.
> > > > 
> > > > Queued with that change, thanks.
> > > 
> > > Not to be repetitive, but if the schedule() is on the guest, this change
> > > really does silently break up an RCU read-side critical section on
> > > guests built with PREEMPT=n.  (Yes, they were already being broken,
> > > but it would be good to avoid this breakage in PREEMPT=n as well as
> > > in PREEMPT=y.)
> > > 
> > 
> > Then probably adding !IS_ENABLED(CONFIG_PREEMPT) as one of the reason we
> > choose the halt path? Like:
> > 
> > 	n.halted = is_idle_task(current) || preempt_count() > 1 ||
> > 		   !IS_ENABLED(CONFIG_PREEMPT) || rcu_preempt_depth();
> > 
> > 
> > But I think async PF could also happen while a user program is running?
> > Then maybe add a second parameter @user for kvm_async_pf_task_wait(),
> > like:
> > 
> > 	kvm_async_pf_task_wait((u32)read_cr2(), user_mode(regs));
> > 
> > and the halt condition becomes:
> > 
> > 	n.halted = is_idle_task(current) || preempt_count() > 1 ||
> > 		   (!IS_ENABLED(CONFIG_PREEMPT) && !user) || rcu_preempt_depth();
> > 
> > Thoughts?
> 
> This looks to me like it would cover it.  If !PREEMPT interrupt from
> kernel, we halt, which would prevent the sleep.
> 
> I take it that we get unhalted when the host gets things patched up?
> 

That's my understanding. Going to send out a v2 and see if I miss
something.

> > A side thing is being broken already for PREEMPT=n means we maybe fail
> > to detect this in rcutorture? Then should we add a config with
> > KVM_GUEST=y and try to run some memory consuming things(e.g. stress
> > --vm) in the rcutorture kvm script simultaneously? Paolo, do you have
> > any test workload that could trigger async PF quickly?
> 
> I do not believe that have seen this in rcutorture, but I always run in
> a guest OS on a large-memory system (well, by my old-fashioned standards,

;-)

> anyway) that would be quite unlikely to evict a guest OS's pages.  Plus
> I tend to run on shared systems, and deliberately running them out of
> memory would not be particularly friendly to others using those systems.
> 
> I -do- run background scripts that are intended to force the host OS to
> preempt the guest OSes frequently, but I don't believe that this would
> cause that bug.
> 
> But it seems like it would make more sense to add this sort of thing to
> whatever KVM tests there are for host-side eviction of guest pages.
> 

Great! I will see what I can do, but it's better to get some hints from
KVM folks on how to trigger async PF more frequently.

Regards,
Boqun

> 							Thanx, Paul
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2] kvm/x86: Handle async PF in RCU read-side critical sections
  2017-09-29 11:01 [PATCH] kvm/x86: Handle async PF in RCU read-side critical sections Boqun Feng
  2017-09-29 14:53 ` Paolo Bonzini
@ 2017-10-01  1:31 ` Boqun Feng
  2017-10-02 13:41   ` Paolo Bonzini
  1 sibling, 1 reply; 14+ messages in thread
From: Boqun Feng @ 2017-10-01  1:31 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Boqun Feng, Paul E. McKenney, Peter Zijlstra, Wanpeng Li,
	Paolo Bonzini, Radim Krčmář,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86

Sasha Levin reported a WARNING:

| WARNING: CPU: 0 PID: 6974 at kernel/rcu/tree_plugin.h:329
| rcu_preempt_note_context_switch kernel/rcu/tree_plugin.h:329 [inline]
| WARNING: CPU: 0 PID: 6974 at kernel/rcu/tree_plugin.h:329
| rcu_note_context_switch+0x16c/0x2210 kernel/rcu/tree.c:458
...
| CPU: 0 PID: 6974 Comm: syz-fuzzer Not tainted 4.13.0-next-20170908+ #246
| Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
| 1.10.1-1ubuntu1 04/01/2014
| Call Trace:
...
| RIP: 0010:rcu_preempt_note_context_switch kernel/rcu/tree_plugin.h:329 [inline]
| RIP: 0010:rcu_note_context_switch+0x16c/0x2210 kernel/rcu/tree.c:458
| RSP: 0018:ffff88003b2debc8 EFLAGS: 00010002
| RAX: 0000000000000001 RBX: 1ffff1000765bd85 RCX: 0000000000000000
| RDX: 1ffff100075d7882 RSI: ffffffffb5c7da20 RDI: ffff88003aebc410
| RBP: ffff88003b2def30 R08: dffffc0000000000 R09: 0000000000000001
| R10: 0000000000000000 R11: 0000000000000000 R12: ffff88003b2def08
| R13: 0000000000000000 R14: ffff88003aebc040 R15: ffff88003aebc040
| __schedule+0x201/0x2240 kernel/sched/core.c:3292
| schedule+0x113/0x460 kernel/sched/core.c:3421
| kvm_async_pf_task_wait+0x43f/0x940 arch/x86/kernel/kvm.c:158
| do_async_page_fault+0x72/0x90 arch/x86/kernel/kvm.c:271
| async_page_fault+0x22/0x30 arch/x86/entry/entry_64.S:1069
| RIP: 0010:format_decode+0x240/0x830 lib/vsprintf.c:1996
| RSP: 0018:ffff88003b2df520 EFLAGS: 00010283
| RAX: 000000000000003f RBX: ffffffffb5d1e141 RCX: ffff88003b2df670
| RDX: 0000000000000001 RSI: dffffc0000000000 RDI: ffffffffb5d1e140
| RBP: ffff88003b2df560 R08: dffffc0000000000 R09: 0000000000000000
| R10: ffff88003b2df718 R11: 0000000000000000 R12: ffff88003b2df5d8
| R13: 0000000000000064 R14: ffffffffb5d1e140 R15: 0000000000000000
| vsnprintf+0x173/0x1700 lib/vsprintf.c:2136
| sprintf+0xbe/0xf0 lib/vsprintf.c:2386
| proc_self_get_link+0xfb/0x1c0 fs/proc/self.c:23
| get_link fs/namei.c:1047 [inline]
| link_path_walk+0x1041/0x1490 fs/namei.c:2127
...

This happened when the host hit a page fault, and delivered it as in an
async page fault, while the guest was in an RCU read-side critical
section.  The guest then tries to reschedule in kvm_async_pf_task_wait(),
but rcu_preempt_note_context_switch() would treat the reschedule as a
sleep in RCU read-side critical section, which is not allowed (even in
preemptible RCU).  Thus the WARN.

To cure this, we need to make kvm_async_pf_task_wait() go to the halt
path(instead of the schedule path) if the PF happens in a RCU read-side
critical section.

In PREEMPT=y kernel, this is simple, as we record current RCU read-side
critical section nested level in rcu_preempt_depth(). But for PREEMPT=n
kernel rcu_read_lock/unlock() may be no-ops, so we don't whether we are
in a RCU read-side critical section or not. We resolve this by always
choosing the halt path in PREEMPT=n kernel unless the guest gets the
async PF while running in user mode.

Reported-by: Sasha Levin <levinsasha928@gmail.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
[The explanation for async PF is contributed by Paolo Bonzini]
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
---
v1 --> v2:

*	Add more accurate explanation of async PF from Paolo in the
	commit message.

*	Extend the kvm_async_pf_task_wait() to have a second parameter
	@user to indicate whether the fault happens while a user program
	running the guest.

Wanpeng, the callsite of kvm_async_pf_task_wait() in
kvm_handle_page_fault() is for nested scenario, right? I take it we
should handle it as if the fault happens when l1 guest is running in
kernel mode, so @user should be 0, right?

 arch/x86/include/asm/kvm_para.h | 4 ++--
 arch/x86/kernel/kvm.c           | 9 ++++++---
 arch/x86/kvm/mmu.c              | 2 +-
 3 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index bc62e7cbf1b1..0a5ae6bb128b 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -88,7 +88,7 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
 bool kvm_para_available(void);
 unsigned int kvm_arch_para_features(void);
 void __init kvm_guest_init(void);
-void kvm_async_pf_task_wait(u32 token);
+void kvm_async_pf_task_wait(u32 token, int user);
 void kvm_async_pf_task_wake(u32 token);
 u32 kvm_read_and_reset_pf_reason(void);
 extern void kvm_disable_steal_time(void);
@@ -103,7 +103,7 @@ static inline void kvm_spinlock_init(void)
 
 #else /* CONFIG_KVM_GUEST */
 #define kvm_guest_init() do {} while (0)
-#define kvm_async_pf_task_wait(T) do {} while(0)
+#define kvm_async_pf_task_wait(T, U) do {} while(0)
 #define kvm_async_pf_task_wake(T) do {} while(0)
 
 static inline bool kvm_para_available(void)
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index aa60a08b65b1..916f519e54c9 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -117,7 +117,7 @@ static struct kvm_task_sleep_node *_find_apf_task(struct kvm_task_sleep_head *b,
 	return NULL;
 }
 
-void kvm_async_pf_task_wait(u32 token)
+void kvm_async_pf_task_wait(u32 token, int user)
 {
 	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
 	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
@@ -140,7 +140,10 @@ void kvm_async_pf_task_wait(u32 token)
 
 	n.token = token;
 	n.cpu = smp_processor_id();
-	n.halted = is_idle_task(current) || preempt_count() > 1;
+	n.halted = is_idle_task(current) ||
+		   preempt_count() > 1 ||
+		   (!IS_ENABLED(CONFIG_PREEMPT) && !user) ||
+		   rcu_preempt_depth();
 	init_swait_queue_head(&n.wq);
 	hlist_add_head(&n.link, &b->list);
 	raw_spin_unlock(&b->lock);
@@ -268,7 +271,7 @@ do_async_page_fault(struct pt_regs *regs, unsigned long error_code)
 	case KVM_PV_REASON_PAGE_NOT_PRESENT:
 		/* page is swapped out by the host. */
 		prev_state = exception_enter();
-		kvm_async_pf_task_wait((u32)read_cr2());
+		kvm_async_pf_task_wait((u32)read_cr2(), user_mode(regs));
 		exception_exit(prev_state);
 		break;
 	case KVM_PV_REASON_PAGE_READY:
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index eca30c1eb1d9..106d4a029a8a 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3837,7 +3837,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
 	case KVM_PV_REASON_PAGE_NOT_PRESENT:
 		vcpu->arch.apf.host_apf_reason = 0;
 		local_irq_disable();
-		kvm_async_pf_task_wait(fault_address);
+		kvm_async_pf_task_wait(fault_address, 0);
 		local_irq_enable();
 		break;
 	case KVM_PV_REASON_PAGE_READY:
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] kvm/x86: Handle async PF in RCU read-side critical sections
  2017-09-30 17:15       ` Paul E. McKenney
  2017-10-01  1:20         ` Boqun Feng
@ 2017-10-02 12:45         ` Paolo Bonzini
  2017-10-02 13:53           ` Paul E. McKenney
  1 sibling, 1 reply; 14+ messages in thread
From: Paolo Bonzini @ 2017-10-02 12:45 UTC (permalink / raw)
  To: paulmck, Boqun Feng
  Cc: linux-kernel, kvm, Peter Zijlstra, Radim Krčmář,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86

On 30/09/2017 19:15, Paul E. McKenney wrote:
> On Sat, Sep 30, 2017 at 07:41:56AM +0800, Boqun Feng wrote:
>> On Fri, Sep 29, 2017 at 04:43:39PM +0000, Paul E. McKenney wrote:
>>> Not to be repetitive, but if the schedule() is on the guest, this change
>>> really does silently break up an RCU read-side critical section on
>>> guests built with PREEMPT=n.  (Yes, they were already being broken,
>>> but it would be good to avoid this breakage in PREEMPT=n as well as
>>> in PREEMPT=y.)

Yes, you're right.  It's pretty surprising that it's never been reported.

>> Then probably adding !IS_ENABLED(CONFIG_PREEMPT) as one of the reason we
>> choose the halt path? Like:
>>
>> 	n.halted = is_idle_task(current) || preempt_count() > 1 ||
>> 		   !IS_ENABLED(CONFIG_PREEMPT) || rcu_preempt_depth();
>>
>>
>> But I think async PF could also happen while a user program is running?
>> Then maybe add a second parameter @user for kvm_async_pf_task_wait(),
>> like:
>>
>> 	kvm_async_pf_task_wait((u32)read_cr2(), user_mode(regs));
>>
>> and the halt condition becomes:
>>
>> 	n.halted = is_idle_task(current) || preempt_count() > 1 ||
>> 		   (!IS_ENABLED(CONFIG_PREEMPT) && !user) || rcu_preempt_depth();
>>
>> Thoughts?
> 
> This looks to me like it would cover it.  If !PREEMPT interrupt from
> kernel, we halt, which would prevent the sleep.
> 
> I take it that we get unhalted when the host gets things patched up?

Yes.  You get another page fault (this time it's a "page ready" page
fault rather than a "page not present" one), which has the side
effecting of ending the halt.

Paolo

>> A side thing is being broken already for PREEMPT=n means we maybe fail
>> to detect this in rcutorture? Then should we add a config with
>> KVM_GUEST=y and try to run some memory consuming things(e.g. stress
>> --vm) in the rcutorture kvm script simultaneously? Paolo, do you have
>> any test workload that could trigger async PF quickly?
> 
> I do not believe that have seen this in rcutorture, but I always run in
> a guest OS on a large-memory system (well, by my old-fashioned standards,
> anyway) that would be quite unlikely to evict a guest OS's pages.  Plus
> I tend to run on shared systems, and deliberately running them out of
> memory would not be particularly friendly to others using those systems.
> 
> I -do- run background scripts that are intended to force the host OS to
> preempt the guest OSes frequently, but I don't believe that this would
> cause that bug.
> 
> But it seems like it would make more sense to add this sort of thing to
> whatever KVM tests there are for host-side eviction of guest pages.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2] kvm/x86: Handle async PF in RCU read-side critical sections
  2017-10-01  1:31 ` [PATCH v2] " Boqun Feng
@ 2017-10-02 13:41   ` Paolo Bonzini
  2017-10-02 14:43     ` Boqun Feng
  0 siblings, 1 reply; 14+ messages in thread
From: Paolo Bonzini @ 2017-10-02 13:41 UTC (permalink / raw)
  To: Boqun Feng, linux-kernel, kvm
  Cc: Paul E. McKenney, Peter Zijlstra, Wanpeng Li,
	Radim Krčmář,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86

On 01/10/2017 03:31, Boqun Feng wrote:
> Sasha Levin reported a WARNING:
> 
> | WARNING: CPU: 0 PID: 6974 at kernel/rcu/tree_plugin.h:329
> | rcu_preempt_note_context_switch kernel/rcu/tree_plugin.h:329 [inline]
> | WARNING: CPU: 0 PID: 6974 at kernel/rcu/tree_plugin.h:329
> | rcu_note_context_switch+0x16c/0x2210 kernel/rcu/tree.c:458
> ...
> | CPU: 0 PID: 6974 Comm: syz-fuzzer Not tainted 4.13.0-next-20170908+ #246
> | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> | 1.10.1-1ubuntu1 04/01/2014
> | Call Trace:
> ...
> | RIP: 0010:rcu_preempt_note_context_switch kernel/rcu/tree_plugin.h:329 [inline]
> | RIP: 0010:rcu_note_context_switch+0x16c/0x2210 kernel/rcu/tree.c:458
> | RSP: 0018:ffff88003b2debc8 EFLAGS: 00010002
> | RAX: 0000000000000001 RBX: 1ffff1000765bd85 RCX: 0000000000000000
> | RDX: 1ffff100075d7882 RSI: ffffffffb5c7da20 RDI: ffff88003aebc410
> | RBP: ffff88003b2def30 R08: dffffc0000000000 R09: 0000000000000001
> | R10: 0000000000000000 R11: 0000000000000000 R12: ffff88003b2def08
> | R13: 0000000000000000 R14: ffff88003aebc040 R15: ffff88003aebc040
> | __schedule+0x201/0x2240 kernel/sched/core.c:3292
> | schedule+0x113/0x460 kernel/sched/core.c:3421
> | kvm_async_pf_task_wait+0x43f/0x940 arch/x86/kernel/kvm.c:158
> | do_async_page_fault+0x72/0x90 arch/x86/kernel/kvm.c:271
> | async_page_fault+0x22/0x30 arch/x86/entry/entry_64.S:1069
> | RIP: 0010:format_decode+0x240/0x830 lib/vsprintf.c:1996
> | RSP: 0018:ffff88003b2df520 EFLAGS: 00010283
> | RAX: 000000000000003f RBX: ffffffffb5d1e141 RCX: ffff88003b2df670
> | RDX: 0000000000000001 RSI: dffffc0000000000 RDI: ffffffffb5d1e140
> | RBP: ffff88003b2df560 R08: dffffc0000000000 R09: 0000000000000000
> | R10: ffff88003b2df718 R11: 0000000000000000 R12: ffff88003b2df5d8
> | R13: 0000000000000064 R14: ffffffffb5d1e140 R15: 0000000000000000
> | vsnprintf+0x173/0x1700 lib/vsprintf.c:2136
> | sprintf+0xbe/0xf0 lib/vsprintf.c:2386
> | proc_self_get_link+0xfb/0x1c0 fs/proc/self.c:23
> | get_link fs/namei.c:1047 [inline]
> | link_path_walk+0x1041/0x1490 fs/namei.c:2127
> ...
> 
> This happened when the host hit a page fault, and delivered it as in an
> async page fault, while the guest was in an RCU read-side critical
> section.  The guest then tries to reschedule in kvm_async_pf_task_wait(),
> but rcu_preempt_note_context_switch() would treat the reschedule as a
> sleep in RCU read-side critical section, which is not allowed (even in
> preemptible RCU).  Thus the WARN.
> 
> To cure this, we need to make kvm_async_pf_task_wait() go to the halt
> path(instead of the schedule path) if the PF happens in a RCU read-side
> critical section.
> 
> In PREEMPT=y kernel, this is simple, as we record current RCU read-side
> critical section nested level in rcu_preempt_depth(). But for PREEMPT=n
> kernel rcu_read_lock/unlock() may be no-ops, so we don't whether we are
> in a RCU read-side critical section or not. We resolve this by always
> choosing the halt path in PREEMPT=n kernel unless the guest gets the
> async PF while running in user mode.
> 
> Reported-by: Sasha Levin <levinsasha928@gmail.com>
> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Wanpeng Li <wanpeng.li@hotmail.com>
> [The explanation for async PF is contributed by Paolo Bonzini]
> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
> ---
> v1 --> v2:
> 
> *	Add more accurate explanation of async PF from Paolo in the
> 	commit message.
> 
> *	Extend the kvm_async_pf_task_wait() to have a second parameter
> 	@user to indicate whether the fault happens while a user program
> 	running the guest.
> 
> Wanpeng, the callsite of kvm_async_pf_task_wait() in
> kvm_handle_page_fault() is for nested scenario, right? I take it we
> should handle it as if the fault happens when l1 guest is running in
> kernel mode, so @user should be 0, right?

In that case we can schedule, actually.  The guest will let another
process run.

In fact we could schedule safely most of the time in the
!user_mode(regs) case, it's just that with PREEMPT=n there's no
knowledge of whether we can do so.  This explains why we have never seen
the bug before.

I had already applied v1, can you rebase and resend please?  Thanks,

Paolo

>  arch/x86/include/asm/kvm_para.h | 4 ++--
>  arch/x86/kernel/kvm.c           | 9 ++++++---
>  arch/x86/kvm/mmu.c              | 2 +-
>  3 files changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> index bc62e7cbf1b1..0a5ae6bb128b 100644
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -88,7 +88,7 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
>  bool kvm_para_available(void);
>  unsigned int kvm_arch_para_features(void);
>  void __init kvm_guest_init(void);
> -void kvm_async_pf_task_wait(u32 token);
> +void kvm_async_pf_task_wait(u32 token, int user);
>  void kvm_async_pf_task_wake(u32 token);
>  u32 kvm_read_and_reset_pf_reason(void);
>  extern void kvm_disable_steal_time(void);
> @@ -103,7 +103,7 @@ static inline void kvm_spinlock_init(void)
>  
>  #else /* CONFIG_KVM_GUEST */
>  #define kvm_guest_init() do {} while (0)
> -#define kvm_async_pf_task_wait(T) do {} while(0)
> +#define kvm_async_pf_task_wait(T, U) do {} while(0)
>  #define kvm_async_pf_task_wake(T) do {} while(0)
>  
>  static inline bool kvm_para_available(void)
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index aa60a08b65b1..916f519e54c9 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -117,7 +117,7 @@ static struct kvm_task_sleep_node *_find_apf_task(struct kvm_task_sleep_head *b,
>  	return NULL;
>  }
>  
> -void kvm_async_pf_task_wait(u32 token)
> +void kvm_async_pf_task_wait(u32 token, int user)
>  {
>  	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
>  	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
> @@ -140,7 +140,10 @@ void kvm_async_pf_task_wait(u32 token)
>  
>  	n.token = token;
>  	n.cpu = smp_processor_id();
> -	n.halted = is_idle_task(current) || preempt_count() > 1;
> +	n.halted = is_idle_task(current) ||
> +		   preempt_count() > 1 ||
> +		   (!IS_ENABLED(CONFIG_PREEMPT) && !user) ||
> +		   rcu_preempt_depth();
>  	init_swait_queue_head(&n.wq);
>  	hlist_add_head(&n.link, &b->list);
>  	raw_spin_unlock(&b->lock);
> @@ -268,7 +271,7 @@ do_async_page_fault(struct pt_regs *regs, unsigned long error_code)
>  	case KVM_PV_REASON_PAGE_NOT_PRESENT:
>  		/* page is swapped out by the host. */
>  		prev_state = exception_enter();
> -		kvm_async_pf_task_wait((u32)read_cr2());
> +		kvm_async_pf_task_wait((u32)read_cr2(), user_mode(regs));
>  		exception_exit(prev_state);
>  		break;
>  	case KVM_PV_REASON_PAGE_READY:
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index eca30c1eb1d9..106d4a029a8a 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -3837,7 +3837,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
>  	case KVM_PV_REASON_PAGE_NOT_PRESENT:
>  		vcpu->arch.apf.host_apf_reason = 0;
>  		local_irq_disable();
> -		kvm_async_pf_task_wait(fault_address);
> +		kvm_async_pf_task_wait(fault_address, 0);
>  		local_irq_enable();
>  		break;
>  	case KVM_PV_REASON_PAGE_READY:
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] kvm/x86: Handle async PF in RCU read-side critical sections
  2017-10-02 12:45         ` Paolo Bonzini
@ 2017-10-02 13:53           ` Paul E. McKenney
  0 siblings, 0 replies; 14+ messages in thread
From: Paul E. McKenney @ 2017-10-02 13:53 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Boqun Feng, linux-kernel, kvm, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86

On Mon, Oct 02, 2017 at 02:45:34PM +0200, Paolo Bonzini wrote:
> On 30/09/2017 19:15, Paul E. McKenney wrote:
> > On Sat, Sep 30, 2017 at 07:41:56AM +0800, Boqun Feng wrote:
> >> On Fri, Sep 29, 2017 at 04:43:39PM +0000, Paul E. McKenney wrote:
> >>> Not to be repetitive, but if the schedule() is on the guest, this change
> >>> really does silently break up an RCU read-side critical section on
> >>> guests built with PREEMPT=n.  (Yes, they were already being broken,
> >>> but it would be good to avoid this breakage in PREEMPT=n as well as
> >>> in PREEMPT=y.)
> 
> Yes, you're right.  It's pretty surprising that it's never been reported.

It would look like random memory corruption in the guest, so it might
well have been encountered.  Though you have to get a page fault in
just the wrong place and an update has to happen just at that time, so
perhaps low probability.

Still, good to fix.

> >> Then probably adding !IS_ENABLED(CONFIG_PREEMPT) as one of the reason we
> >> choose the halt path? Like:
> >>
> >> 	n.halted = is_idle_task(current) || preempt_count() > 1 ||
> >> 		   !IS_ENABLED(CONFIG_PREEMPT) || rcu_preempt_depth();
> >>
> >>
> >> But I think async PF could also happen while a user program is running?
> >> Then maybe add a second parameter @user for kvm_async_pf_task_wait(),
> >> like:
> >>
> >> 	kvm_async_pf_task_wait((u32)read_cr2(), user_mode(regs));
> >>
> >> and the halt condition becomes:
> >>
> >> 	n.halted = is_idle_task(current) || preempt_count() > 1 ||
> >> 		   (!IS_ENABLED(CONFIG_PREEMPT) && !user) || rcu_preempt_depth();
> >>
> >> Thoughts?
> > 
> > This looks to me like it would cover it.  If !PREEMPT interrupt from
> > kernel, we halt, which would prevent the sleep.
> > 
> > I take it that we get unhalted when the host gets things patched up?
> 
> Yes.  You get another page fault (this time it's a "page ready" page
> fault rather than a "page not present" one), which has the side
> effecting of ending the halt.

Got it, thank you!

							Thanx, Paul

> Paolo
> 
> >> A side thing is being broken already for PREEMPT=n means we maybe fail
> >> to detect this in rcutorture? Then should we add a config with
> >> KVM_GUEST=y and try to run some memory consuming things(e.g. stress
> >> --vm) in the rcutorture kvm script simultaneously? Paolo, do you have
> >> any test workload that could trigger async PF quickly?
> > 
> > I do not believe that have seen this in rcutorture, but I always run in
> > a guest OS on a large-memory system (well, by my old-fashioned standards,
> > anyway) that would be quite unlikely to evict a guest OS's pages.  Plus
> > I tend to run on shared systems, and deliberately running them out of
> > memory would not be particularly friendly to others using those systems.
> > 
> > I -do- run background scripts that are intended to force the host OS to
> > preempt the guest OSes frequently, but I don't believe that this would
> > cause that bug.
> > 
> > But it seems like it would make more sense to add this sort of thing to
> > whatever KVM tests there are for host-side eviction of guest pages.
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2] kvm/x86: Handle async PF in RCU read-side critical sections
  2017-10-02 13:41   ` Paolo Bonzini
@ 2017-10-02 14:43     ` Boqun Feng
  2017-10-02 15:34       ` Paul E. McKenney
  0 siblings, 1 reply; 14+ messages in thread
From: Boqun Feng @ 2017-10-02 14:43 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, kvm, Paul E. McKenney, Peter Zijlstra, Wanpeng Li,
	Radim Krčmář,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86

[-- Attachment #1: Type: text/plain, Size: 4500 bytes --]

On Mon, Oct 02, 2017 at 01:41:03PM +0000, Paolo Bonzini wrote:
[...]
> > 
> > Wanpeng, the callsite of kvm_async_pf_task_wait() in
> > kvm_handle_page_fault() is for nested scenario, right? I take it we
> > should handle it as if the fault happens when l1 guest is running in
> > kernel mode, so @user should be 0, right?
> 
> In that case we can schedule, actually.  The guest will let another
> process run.
> 
> In fact we could schedule safely most of the time in the
> !user_mode(regs) case, it's just that with PREEMPT=n there's no
> knowledge of whether we can do so.  This explains why we have never seen
> the bug before.
> 

Thanks, looks like I confused myself a little bit here. You are right.
So in PREEMPT=n kernel, we only couldn't schedule when the async PF
*interrupts* the *kernel*, while in the kvm_handle_page_fault(), we
actually didn't interrupt the kernel, so it's fine.

> I had already applied v1, can you rebase and resend please?  Thanks,
> 

Sure, I'm going to rename that parameter to "interrupt_kernel"(probably
a bad name"), indicating whether the async PF interrupts the kernel.

But it's a little bit late today, will do that tomorrow.

Regards,
Boqun


> Paolo
> 
> >  arch/x86/include/asm/kvm_para.h | 4 ++--
> >  arch/x86/kernel/kvm.c           | 9 ++++++---
> >  arch/x86/kvm/mmu.c              | 2 +-
> >  3 files changed, 9 insertions(+), 6 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> > index bc62e7cbf1b1..0a5ae6bb128b 100644
> > --- a/arch/x86/include/asm/kvm_para.h
> > +++ b/arch/x86/include/asm/kvm_para.h
> > @@ -88,7 +88,7 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
> >  bool kvm_para_available(void);
> >  unsigned int kvm_arch_para_features(void);
> >  void __init kvm_guest_init(void);
> > -void kvm_async_pf_task_wait(u32 token);
> > +void kvm_async_pf_task_wait(u32 token, int user);
> >  void kvm_async_pf_task_wake(u32 token);
> >  u32 kvm_read_and_reset_pf_reason(void);
> >  extern void kvm_disable_steal_time(void);
> > @@ -103,7 +103,7 @@ static inline void kvm_spinlock_init(void)
> >  
> >  #else /* CONFIG_KVM_GUEST */
> >  #define kvm_guest_init() do {} while (0)
> > -#define kvm_async_pf_task_wait(T) do {} while(0)
> > +#define kvm_async_pf_task_wait(T, U) do {} while(0)
> >  #define kvm_async_pf_task_wake(T) do {} while(0)
> >  
> >  static inline bool kvm_para_available(void)
> > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> > index aa60a08b65b1..916f519e54c9 100644
> > --- a/arch/x86/kernel/kvm.c
> > +++ b/arch/x86/kernel/kvm.c
> > @@ -117,7 +117,7 @@ static struct kvm_task_sleep_node *_find_apf_task(struct kvm_task_sleep_head *b,
> >  	return NULL;
> >  }
> >  
> > -void kvm_async_pf_task_wait(u32 token)
> > +void kvm_async_pf_task_wait(u32 token, int user)
> >  {
> >  	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
> >  	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
> > @@ -140,7 +140,10 @@ void kvm_async_pf_task_wait(u32 token)
> >  
> >  	n.token = token;
> >  	n.cpu = smp_processor_id();
> > -	n.halted = is_idle_task(current) || preempt_count() > 1;
> > +	n.halted = is_idle_task(current) ||
> > +		   preempt_count() > 1 ||
> > +		   (!IS_ENABLED(CONFIG_PREEMPT) && !user) ||
> > +		   rcu_preempt_depth();
> >  	init_swait_queue_head(&n.wq);
> >  	hlist_add_head(&n.link, &b->list);
> >  	raw_spin_unlock(&b->lock);
> > @@ -268,7 +271,7 @@ do_async_page_fault(struct pt_regs *regs, unsigned long error_code)
> >  	case KVM_PV_REASON_PAGE_NOT_PRESENT:
> >  		/* page is swapped out by the host. */
> >  		prev_state = exception_enter();
> > -		kvm_async_pf_task_wait((u32)read_cr2());
> > +		kvm_async_pf_task_wait((u32)read_cr2(), user_mode(regs));
> >  		exception_exit(prev_state);
> >  		break;
> >  	case KVM_PV_REASON_PAGE_READY:
> > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > index eca30c1eb1d9..106d4a029a8a 100644
> > --- a/arch/x86/kvm/mmu.c
> > +++ b/arch/x86/kvm/mmu.c
> > @@ -3837,7 +3837,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
> >  	case KVM_PV_REASON_PAGE_NOT_PRESENT:
> >  		vcpu->arch.apf.host_apf_reason = 0;
> >  		local_irq_disable();
> > -		kvm_async_pf_task_wait(fault_address);
> > +		kvm_async_pf_task_wait(fault_address, 0);
> >  		local_irq_enable();
> >  		break;
> >  	case KVM_PV_REASON_PAGE_READY:
> > 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2] kvm/x86: Handle async PF in RCU read-side critical sections
  2017-10-02 14:43     ` Boqun Feng
@ 2017-10-02 15:34       ` Paul E. McKenney
  2017-10-02 21:11         ` Paolo Bonzini
  0 siblings, 1 reply; 14+ messages in thread
From: Paul E. McKenney @ 2017-10-02 15:34 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Paolo Bonzini, linux-kernel, kvm, Peter Zijlstra, Wanpeng Li,
	Radim Krčmář,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86

On Mon, Oct 02, 2017 at 10:43:00PM +0800, Boqun Feng wrote:
> On Mon, Oct 02, 2017 at 01:41:03PM +0000, Paolo Bonzini wrote:
> [...]
> > > 
> > > Wanpeng, the callsite of kvm_async_pf_task_wait() in
> > > kvm_handle_page_fault() is for nested scenario, right? I take it we
> > > should handle it as if the fault happens when l1 guest is running in
> > > kernel mode, so @user should be 0, right?
> > 
> > In that case we can schedule, actually.  The guest will let another
> > process run.
> > 
> > In fact we could schedule safely most of the time in the
> > !user_mode(regs) case, it's just that with PREEMPT=n there's no
> > knowledge of whether we can do so.  This explains why we have never seen
> > the bug before.
> 
> Thanks, looks like I confused myself a little bit here. You are right.
> So in PREEMPT=n kernel, we only couldn't schedule when the async PF
> *interrupts* the *kernel*, while in the kvm_handle_page_fault(), we
> actually didn't interrupt the kernel, so it's fine.

Actually, we should be able to do a little bit better than that.
If PREEMPT=n but PREEMPT_COUNT=y, then preempt_count() will know
about RCU read-side critical sections via preempt_disable().

So maybe something like this?

	n.halted = is_idle_task(current) ||
		   preempt_count() > 1 ||
		   (!IS_ENABLED(CONFIG_PREEMPT) && !IS_ENABLED(CONFIG_PREEMPT_COUNT) && !user) ||
		   rcu_preempt_depth();

							Thanx, Paul

> > I had already applied v1, can you rebase and resend please?  Thanks,
> > 
> 
> Sure, I'm going to rename that parameter to "interrupt_kernel"(probably
> a bad name"), indicating whether the async PF interrupts the kernel.
> 
> But it's a little bit late today, will do that tomorrow.
> 
> Regards,
> Boqun
> 
> 
> > Paolo
> > 
> > >  arch/x86/include/asm/kvm_para.h | 4 ++--
> > >  arch/x86/kernel/kvm.c           | 9 ++++++---
> > >  arch/x86/kvm/mmu.c              | 2 +-
> > >  3 files changed, 9 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> > > index bc62e7cbf1b1..0a5ae6bb128b 100644
> > > --- a/arch/x86/include/asm/kvm_para.h
> > > +++ b/arch/x86/include/asm/kvm_para.h
> > > @@ -88,7 +88,7 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
> > >  bool kvm_para_available(void);
> > >  unsigned int kvm_arch_para_features(void);
> > >  void __init kvm_guest_init(void);
> > > -void kvm_async_pf_task_wait(u32 token);
> > > +void kvm_async_pf_task_wait(u32 token, int user);
> > >  void kvm_async_pf_task_wake(u32 token);
> > >  u32 kvm_read_and_reset_pf_reason(void);
> > >  extern void kvm_disable_steal_time(void);
> > > @@ -103,7 +103,7 @@ static inline void kvm_spinlock_init(void)
> > >  
> > >  #else /* CONFIG_KVM_GUEST */
> > >  #define kvm_guest_init() do {} while (0)
> > > -#define kvm_async_pf_task_wait(T) do {} while(0)
> > > +#define kvm_async_pf_task_wait(T, U) do {} while(0)
> > >  #define kvm_async_pf_task_wake(T) do {} while(0)
> > >  
> > >  static inline bool kvm_para_available(void)
> > > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> > > index aa60a08b65b1..916f519e54c9 100644
> > > --- a/arch/x86/kernel/kvm.c
> > > +++ b/arch/x86/kernel/kvm.c
> > > @@ -117,7 +117,7 @@ static struct kvm_task_sleep_node *_find_apf_task(struct kvm_task_sleep_head *b,
> > >  	return NULL;
> > >  }
> > >  
> > > -void kvm_async_pf_task_wait(u32 token)
> > > +void kvm_async_pf_task_wait(u32 token, int user)
> > >  {
> > >  	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
> > >  	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
> > > @@ -140,7 +140,10 @@ void kvm_async_pf_task_wait(u32 token)
> > >  
> > >  	n.token = token;
> > >  	n.cpu = smp_processor_id();
> > > -	n.halted = is_idle_task(current) || preempt_count() > 1;
> > > +	n.halted = is_idle_task(current) ||
> > > +		   preempt_count() > 1 ||
> > > +		   (!IS_ENABLED(CONFIG_PREEMPT) && !user) ||
> > > +		   rcu_preempt_depth();
> > >  	init_swait_queue_head(&n.wq);
> > >  	hlist_add_head(&n.link, &b->list);
> > >  	raw_spin_unlock(&b->lock);
> > > @@ -268,7 +271,7 @@ do_async_page_fault(struct pt_regs *regs, unsigned long error_code)
> > >  	case KVM_PV_REASON_PAGE_NOT_PRESENT:
> > >  		/* page is swapped out by the host. */
> > >  		prev_state = exception_enter();
> > > -		kvm_async_pf_task_wait((u32)read_cr2());
> > > +		kvm_async_pf_task_wait((u32)read_cr2(), user_mode(regs));
> > >  		exception_exit(prev_state);
> > >  		break;
> > >  	case KVM_PV_REASON_PAGE_READY:
> > > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > > index eca30c1eb1d9..106d4a029a8a 100644
> > > --- a/arch/x86/kvm/mmu.c
> > > +++ b/arch/x86/kvm/mmu.c
> > > @@ -3837,7 +3837,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
> > >  	case KVM_PV_REASON_PAGE_NOT_PRESENT:
> > >  		vcpu->arch.apf.host_apf_reason = 0;
> > >  		local_irq_disable();
> > > -		kvm_async_pf_task_wait(fault_address);
> > > +		kvm_async_pf_task_wait(fault_address, 0);
> > >  		local_irq_enable();
> > >  		break;
> > >  	case KVM_PV_REASON_PAGE_READY:
> > > 
> > 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2] kvm/x86: Handle async PF in RCU read-side critical sections
  2017-10-02 15:34       ` Paul E. McKenney
@ 2017-10-02 21:11         ` Paolo Bonzini
  2017-10-02 21:41           ` Paul E. McKenney
  0 siblings, 1 reply; 14+ messages in thread
From: Paolo Bonzini @ 2017-10-02 21:11 UTC (permalink / raw)
  To: paulmck
  Cc: Boqun Feng, linux-kernel, kvm, Peter Zijlstra, Wanpeng Li,
	Radim Krčmář,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86


----- Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> Actually, we should be able to do a little bit better than that.
> If PREEMPT=n but PREEMPT_COUNT=y, then preempt_count() will know
> about RCU read-side critical sections via preempt_disable().
> 
> So maybe something like this?
> 
> 	n.halted = is_idle_task(current) ||
> 		   preempt_count() > 1 ||
> 		   (!IS_ENABLED(CONFIG_PREEMPT) && !IS_ENABLED(CONFIG_PREEMPT_COUNT) && !user) ||
> 		   rcu_preempt_depth();

Or even

 is_idle_task(current) ||
 (IS_ENABLED(CONFIG_PREEMPT_COUNT)
  ? preempt_count() > 1 || rcu_preempt_depth()
  : !user)

since PREEMPT selects PREEMPT_COUNT.

Paolo

> 							Thanx, Paul
> 
> > > I had already applied v1, can you rebase and resend please?  Thanks,
> > > 
> > 
> > Sure, I'm going to rename that parameter to "interrupt_kernel"(probably
> > a bad name"), indicating whether the async PF interrupts the kernel.
> > 
> > But it's a little bit late today, will do that tomorrow.
> > 
> > Regards,
> > Boqun
> > 
> > 
> > > Paolo
> > > 
> > > >  arch/x86/include/asm/kvm_para.h | 4 ++--
> > > >  arch/x86/kernel/kvm.c           | 9 ++++++---
> > > >  arch/x86/kvm/mmu.c              | 2 +-
> > > >  3 files changed, 9 insertions(+), 6 deletions(-)
> > > > 
> > > > diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> > > > index bc62e7cbf1b1..0a5ae6bb128b 100644
> > > > --- a/arch/x86/include/asm/kvm_para.h
> > > > +++ b/arch/x86/include/asm/kvm_para.h
> > > > @@ -88,7 +88,7 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
> > > >  bool kvm_para_available(void);
> > > >  unsigned int kvm_arch_para_features(void);
> > > >  void __init kvm_guest_init(void);
> > > > -void kvm_async_pf_task_wait(u32 token);
> > > > +void kvm_async_pf_task_wait(u32 token, int user);
> > > >  void kvm_async_pf_task_wake(u32 token);
> > > >  u32 kvm_read_and_reset_pf_reason(void);
> > > >  extern void kvm_disable_steal_time(void);
> > > > @@ -103,7 +103,7 @@ static inline void kvm_spinlock_init(void)
> > > >  
> > > >  #else /* CONFIG_KVM_GUEST */
> > > >  #define kvm_guest_init() do {} while (0)
> > > > -#define kvm_async_pf_task_wait(T) do {} while(0)
> > > > +#define kvm_async_pf_task_wait(T, U) do {} while(0)
> > > >  #define kvm_async_pf_task_wake(T) do {} while(0)
> > > >  
> > > >  static inline bool kvm_para_available(void)
> > > > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> > > > index aa60a08b65b1..916f519e54c9 100644
> > > > --- a/arch/x86/kernel/kvm.c
> > > > +++ b/arch/x86/kernel/kvm.c
> > > > @@ -117,7 +117,7 @@ static struct kvm_task_sleep_node *_find_apf_task(struct kvm_task_sleep_head *b,
> > > >  	return NULL;
> > > >  }
> > > >  
> > > > -void kvm_async_pf_task_wait(u32 token)
> > > > +void kvm_async_pf_task_wait(u32 token, int user)
> > > >  {
> > > >  	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
> > > >  	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
> > > > @@ -140,7 +140,10 @@ void kvm_async_pf_task_wait(u32 token)
> > > >  
> > > >  	n.token = token;
> > > >  	n.cpu = smp_processor_id();
> > > > -	n.halted = is_idle_task(current) || preempt_count() > 1;
> > > > +	n.halted = is_idle_task(current) ||
> > > > +		   preempt_count() > 1 ||
> > > > +		   (!IS_ENABLED(CONFIG_PREEMPT) && !user) ||
> > > > +		   rcu_preempt_depth();
> > > >  	init_swait_queue_head(&n.wq);
> > > >  	hlist_add_head(&n.link, &b->list);
> > > >  	raw_spin_unlock(&b->lock);
> > > > @@ -268,7 +271,7 @@ do_async_page_fault(struct pt_regs *regs, unsigned long error_code)
> > > >  	case KVM_PV_REASON_PAGE_NOT_PRESENT:
> > > >  		/* page is swapped out by the host. */
> > > >  		prev_state = exception_enter();
> > > > -		kvm_async_pf_task_wait((u32)read_cr2());
> > > > +		kvm_async_pf_task_wait((u32)read_cr2(), user_mode(regs));
> > > >  		exception_exit(prev_state);
> > > >  		break;
> > > >  	case KVM_PV_REASON_PAGE_READY:
> > > > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > > > index eca30c1eb1d9..106d4a029a8a 100644
> > > > --- a/arch/x86/kvm/mmu.c
> > > > +++ b/arch/x86/kvm/mmu.c
> > > > @@ -3837,7 +3837,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
> > > >  	case KVM_PV_REASON_PAGE_NOT_PRESENT:
> > > >  		vcpu->arch.apf.host_apf_reason = 0;
> > > >  		local_irq_disable();
> > > > -		kvm_async_pf_task_wait(fault_address);
> > > > +		kvm_async_pf_task_wait(fault_address, 0);
> > > >  		local_irq_enable();
> > > >  		break;
> > > >  	case KVM_PV_REASON_PAGE_READY:
> > > > 
> > > 
> 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2] kvm/x86: Handle async PF in RCU read-side critical sections
  2017-10-02 21:11         ` Paolo Bonzini
@ 2017-10-02 21:41           ` Paul E. McKenney
  0 siblings, 0 replies; 14+ messages in thread
From: Paul E. McKenney @ 2017-10-02 21:41 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Boqun Feng, linux-kernel, kvm, Peter Zijlstra, Wanpeng Li,
	Radim Krčmář,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86

On Mon, Oct 02, 2017 at 05:11:07PM -0400, Paolo Bonzini wrote:
> 
> ----- Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> > Actually, we should be able to do a little bit better than that.
> > If PREEMPT=n but PREEMPT_COUNT=y, then preempt_count() will know
> > about RCU read-side critical sections via preempt_disable().
> > 
> > So maybe something like this?
> > 
> > 	n.halted = is_idle_task(current) ||
> > 		   preempt_count() > 1 ||
> > 		   (!IS_ENABLED(CONFIG_PREEMPT) && !IS_ENABLED(CONFIG_PREEMPT_COUNT) && !user) ||
> > 		   rcu_preempt_depth();
> 
> Or even
> 
>  is_idle_task(current) ||
>  (IS_ENABLED(CONFIG_PREEMPT_COUNT)
>   ? preempt_count() > 1 || rcu_preempt_depth()
>   : !user)
> 
> since PREEMPT selects PREEMPT_COUNT.

Makes sense to me!

							Thanx, Paul

> Paolo
> 
> > 							Thanx, Paul
> > 
> > > > I had already applied v1, can you rebase and resend please?  Thanks,
> > > > 
> > > 
> > > Sure, I'm going to rename that parameter to "interrupt_kernel"(probably
> > > a bad name"), indicating whether the async PF interrupts the kernel.
> > > 
> > > But it's a little bit late today, will do that tomorrow.
> > > 
> > > Regards,
> > > Boqun
> > > 
> > > 
> > > > Paolo
> > > > 
> > > > >  arch/x86/include/asm/kvm_para.h | 4 ++--
> > > > >  arch/x86/kernel/kvm.c           | 9 ++++++---
> > > > >  arch/x86/kvm/mmu.c              | 2 +-
> > > > >  3 files changed, 9 insertions(+), 6 deletions(-)
> > > > > 
> > > > > diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> > > > > index bc62e7cbf1b1..0a5ae6bb128b 100644
> > > > > --- a/arch/x86/include/asm/kvm_para.h
> > > > > +++ b/arch/x86/include/asm/kvm_para.h
> > > > > @@ -88,7 +88,7 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
> > > > >  bool kvm_para_available(void);
> > > > >  unsigned int kvm_arch_para_features(void);
> > > > >  void __init kvm_guest_init(void);
> > > > > -void kvm_async_pf_task_wait(u32 token);
> > > > > +void kvm_async_pf_task_wait(u32 token, int user);
> > > > >  void kvm_async_pf_task_wake(u32 token);
> > > > >  u32 kvm_read_and_reset_pf_reason(void);
> > > > >  extern void kvm_disable_steal_time(void);
> > > > > @@ -103,7 +103,7 @@ static inline void kvm_spinlock_init(void)
> > > > >  
> > > > >  #else /* CONFIG_KVM_GUEST */
> > > > >  #define kvm_guest_init() do {} while (0)
> > > > > -#define kvm_async_pf_task_wait(T) do {} while(0)
> > > > > +#define kvm_async_pf_task_wait(T, U) do {} while(0)
> > > > >  #define kvm_async_pf_task_wake(T) do {} while(0)
> > > > >  
> > > > >  static inline bool kvm_para_available(void)
> > > > > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> > > > > index aa60a08b65b1..916f519e54c9 100644
> > > > > --- a/arch/x86/kernel/kvm.c
> > > > > +++ b/arch/x86/kernel/kvm.c
> > > > > @@ -117,7 +117,7 @@ static struct kvm_task_sleep_node *_find_apf_task(struct kvm_task_sleep_head *b,
> > > > >  	return NULL;
> > > > >  }
> > > > >  
> > > > > -void kvm_async_pf_task_wait(u32 token)
> > > > > +void kvm_async_pf_task_wait(u32 token, int user)
> > > > >  {
> > > > >  	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
> > > > >  	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
> > > > > @@ -140,7 +140,10 @@ void kvm_async_pf_task_wait(u32 token)
> > > > >  
> > > > >  	n.token = token;
> > > > >  	n.cpu = smp_processor_id();
> > > > > -	n.halted = is_idle_task(current) || preempt_count() > 1;
> > > > > +	n.halted = is_idle_task(current) ||
> > > > > +		   preempt_count() > 1 ||
> > > > > +		   (!IS_ENABLED(CONFIG_PREEMPT) && !user) ||
> > > > > +		   rcu_preempt_depth();
> > > > >  	init_swait_queue_head(&n.wq);
> > > > >  	hlist_add_head(&n.link, &b->list);
> > > > >  	raw_spin_unlock(&b->lock);
> > > > > @@ -268,7 +271,7 @@ do_async_page_fault(struct pt_regs *regs, unsigned long error_code)
> > > > >  	case KVM_PV_REASON_PAGE_NOT_PRESENT:
> > > > >  		/* page is swapped out by the host. */
> > > > >  		prev_state = exception_enter();
> > > > > -		kvm_async_pf_task_wait((u32)read_cr2());
> > > > > +		kvm_async_pf_task_wait((u32)read_cr2(), user_mode(regs));
> > > > >  		exception_exit(prev_state);
> > > > >  		break;
> > > > >  	case KVM_PV_REASON_PAGE_READY:
> > > > > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > > > > index eca30c1eb1d9..106d4a029a8a 100644
> > > > > --- a/arch/x86/kvm/mmu.c
> > > > > +++ b/arch/x86/kvm/mmu.c
> > > > > @@ -3837,7 +3837,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
> > > > >  	case KVM_PV_REASON_PAGE_NOT_PRESENT:
> > > > >  		vcpu->arch.apf.host_apf_reason = 0;
> > > > >  		local_irq_disable();
> > > > > -		kvm_async_pf_task_wait(fault_address);
> > > > > +		kvm_async_pf_task_wait(fault_address, 0);
> > > > >  		local_irq_enable();
> > > > >  		break;
> > > > >  	case KVM_PV_REASON_PAGE_READY:
> > > > > 
> > > > 
> > 
> > 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2017-10-02 21:41 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-29 11:01 [PATCH] kvm/x86: Handle async PF in RCU read-side critical sections Boqun Feng
2017-09-29 14:53 ` Paolo Bonzini
2017-09-29 16:43   ` Paul E. McKenney
2017-09-29 23:41     ` Boqun Feng
2017-09-30 17:15       ` Paul E. McKenney
2017-10-01  1:20         ` Boqun Feng
2017-10-02 12:45         ` Paolo Bonzini
2017-10-02 13:53           ` Paul E. McKenney
2017-10-01  1:31 ` [PATCH v2] " Boqun Feng
2017-10-02 13:41   ` Paolo Bonzini
2017-10-02 14:43     ` Boqun Feng
2017-10-02 15:34       ` Paul E. McKenney
2017-10-02 21:11         ` Paolo Bonzini
2017-10-02 21:41           ` Paul E. McKenney

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.