Re: [PATCH v2] kvm/x86: Handle async PF in RCU read-side critical sections

From: Paolo Bonzini <pbonzini@redhat.com>
To: Boqun Feng <boqun.feng@gmail.com>,
	linux-kernel@vger.kernel.org, kvm@vger.kernel.org
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	"Peter Zijlstra" <peterz@infradead.org>,
	"Wanpeng Li" <wanpeng.li@hotmail.com>,
	"Radim Krčmář" <rkrcmar@redhat.com>,
	"Thomas Gleixner" <tglx@linutronix.de>,
	"Ingo Molnar" <mingo@redhat.com>,
	"H. Peter Anvin" <hpa@zytor.com>,
	x86@kernel.org
Subject: Re: [PATCH v2] kvm/x86: Handle async PF in RCU read-side critical sections
Date: Mon, 2 Oct 2017 15:41:03 +0200	[thread overview]
Message-ID: <6c0ee091-4dec-745c-7ffa-add189f249fb@redhat.com> (raw)
In-Reply-To: <20171001013140.21325-1-boqun.feng@gmail.com>

On 01/10/2017 03:31, Boqun Feng wrote:
> Sasha Levin reported a WARNING:
> 
> | WARNING: CPU: 0 PID: 6974 at kernel/rcu/tree_plugin.h:329
> | rcu_preempt_note_context_switch kernel/rcu/tree_plugin.h:329 [inline]
> | WARNING: CPU: 0 PID: 6974 at kernel/rcu/tree_plugin.h:329
> | rcu_note_context_switch+0x16c/0x2210 kernel/rcu/tree.c:458
> ...
> | CPU: 0 PID: 6974 Comm: syz-fuzzer Not tainted 4.13.0-next-20170908+ #246
> | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> | 1.10.1-1ubuntu1 04/01/2014
> | Call Trace:
> ...
> | RIP: 0010:rcu_preempt_note_context_switch kernel/rcu/tree_plugin.h:329 [inline]
> | RIP: 0010:rcu_note_context_switch+0x16c/0x2210 kernel/rcu/tree.c:458
> | RSP: 0018:ffff88003b2debc8 EFLAGS: 00010002
> | RAX: 0000000000000001 RBX: 1ffff1000765bd85 RCX: 0000000000000000
> | RDX: 1ffff100075d7882 RSI: ffffffffb5c7da20 RDI: ffff88003aebc410
> | RBP: ffff88003b2def30 R08: dffffc0000000000 R09: 0000000000000001
> | R10: 0000000000000000 R11: 0000000000000000 R12: ffff88003b2def08
> | R13: 0000000000000000 R14: ffff88003aebc040 R15: ffff88003aebc040
> | __schedule+0x201/0x2240 kernel/sched/core.c:3292
> | schedule+0x113/0x460 kernel/sched/core.c:3421
> | kvm_async_pf_task_wait+0x43f/0x940 arch/x86/kernel/kvm.c:158
> | do_async_page_fault+0x72/0x90 arch/x86/kernel/kvm.c:271
> | async_page_fault+0x22/0x30 arch/x86/entry/entry_64.S:1069
> | RIP: 0010:format_decode+0x240/0x830 lib/vsprintf.c:1996
> | RSP: 0018:ffff88003b2df520 EFLAGS: 00010283
> | RAX: 000000000000003f RBX: ffffffffb5d1e141 RCX: ffff88003b2df670
> | RDX: 0000000000000001 RSI: dffffc0000000000 RDI: ffffffffb5d1e140
> | RBP: ffff88003b2df560 R08: dffffc0000000000 R09: 0000000000000000
> | R10: ffff88003b2df718 R11: 0000000000000000 R12: ffff88003b2df5d8
> | R13: 0000000000000064 R14: ffffffffb5d1e140 R15: 0000000000000000
> | vsnprintf+0x173/0x1700 lib/vsprintf.c:2136
> | sprintf+0xbe/0xf0 lib/vsprintf.c:2386
> | proc_self_get_link+0xfb/0x1c0 fs/proc/self.c:23
> | get_link fs/namei.c:1047 [inline]
> | link_path_walk+0x1041/0x1490 fs/namei.c:2127
> ...
> 
> This happened when the host hit a page fault, and delivered it as in an
> async page fault, while the guest was in an RCU read-side critical
> section.  The guest then tries to reschedule in kvm_async_pf_task_wait(),
> but rcu_preempt_note_context_switch() would treat the reschedule as a
> sleep in RCU read-side critical section, which is not allowed (even in
> preemptible RCU).  Thus the WARN.
> 
> To cure this, we need to make kvm_async_pf_task_wait() go to the halt
> path(instead of the schedule path) if the PF happens in a RCU read-side
> critical section.
> 
> In PREEMPT=y kernel, this is simple, as we record current RCU read-side
> critical section nested level in rcu_preempt_depth(). But for PREEMPT=n
> kernel rcu_read_lock/unlock() may be no-ops, so we don't whether we are
> in a RCU read-side critical section or not. We resolve this by always
> choosing the halt path in PREEMPT=n kernel unless the guest gets the
> async PF while running in user mode.
> 
> Reported-by: Sasha Levin <levinsasha928@gmail.com>
> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Wanpeng Li <wanpeng.li@hotmail.com>
> [The explanation for async PF is contributed by Paolo Bonzini]
> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
> ---
> v1 --> v2:
> 
> *	Add more accurate explanation of async PF from Paolo in the
> 	commit message.
> 
> *	Extend the kvm_async_pf_task_wait() to have a second parameter
> 	@user to indicate whether the fault happens while a user program
> 	running the guest.
> 
> Wanpeng, the callsite of kvm_async_pf_task_wait() in
> kvm_handle_page_fault() is for nested scenario, right? I take it we
> should handle it as if the fault happens when l1 guest is running in
> kernel mode, so @user should be 0, right?

In that case we can schedule, actually.  The guest will let another
process run.

In fact we could schedule safely most of the time in the
!user_mode(regs) case, it's just that with PREEMPT=n there's no
knowledge of whether we can do so.  This explains why we have never seen
the bug before.

I had already applied v1, can you rebase and resend please?  Thanks,

Paolo

>  arch/x86/include/asm/kvm_para.h | 4 ++--
>  arch/x86/kernel/kvm.c           | 9 ++++++---
>  arch/x86/kvm/mmu.c              | 2 +-
>  3 files changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> index bc62e7cbf1b1..0a5ae6bb128b 100644
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -88,7 +88,7 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
>  bool kvm_para_available(void);
>  unsigned int kvm_arch_para_features(void);
>  void __init kvm_guest_init(void);
> -void kvm_async_pf_task_wait(u32 token);
> +void kvm_async_pf_task_wait(u32 token, int user);
>  void kvm_async_pf_task_wake(u32 token);
>  u32 kvm_read_and_reset_pf_reason(void);
>  extern void kvm_disable_steal_time(void);
> @@ -103,7 +103,7 @@ static inline void kvm_spinlock_init(void)
>  
>  #else /* CONFIG_KVM_GUEST */
>  #define kvm_guest_init() do {} while (0)
> -#define kvm_async_pf_task_wait(T) do {} while(0)
> +#define kvm_async_pf_task_wait(T, U) do {} while(0)
>  #define kvm_async_pf_task_wake(T) do {} while(0)
>  
>  static inline bool kvm_para_available(void)
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index aa60a08b65b1..916f519e54c9 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -117,7 +117,7 @@ static struct kvm_task_sleep_node *_find_apf_task(struct kvm_task_sleep_head *b,
>  	return NULL;
>  }
>  
> -void kvm_async_pf_task_wait(u32 token)
> +void kvm_async_pf_task_wait(u32 token, int user)
>  {
>  	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
>  	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
> @@ -140,7 +140,10 @@ void kvm_async_pf_task_wait(u32 token)
>  
>  	n.token = token;
>  	n.cpu = smp_processor_id();
> -	n.halted = is_idle_task(current) || preempt_count() > 1;
> +	n.halted = is_idle_task(current) ||
> +		   preempt_count() > 1 ||
> +		   (!IS_ENABLED(CONFIG_PREEMPT) && !user) ||
> +		   rcu_preempt_depth();
>  	init_swait_queue_head(&n.wq);
>  	hlist_add_head(&n.link, &b->list);
>  	raw_spin_unlock(&b->lock);
> @@ -268,7 +271,7 @@ do_async_page_fault(struct pt_regs *regs, unsigned long error_code)
>  	case KVM_PV_REASON_PAGE_NOT_PRESENT:
>  		/* page is swapped out by the host. */
>  		prev_state = exception_enter();
> -		kvm_async_pf_task_wait((u32)read_cr2());
> +		kvm_async_pf_task_wait((u32)read_cr2(), user_mode(regs));
>  		exception_exit(prev_state);
>  		break;
>  	case KVM_PV_REASON_PAGE_READY:
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index eca30c1eb1d9..106d4a029a8a 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -3837,7 +3837,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
>  	case KVM_PV_REASON_PAGE_NOT_PRESENT:
>  		vcpu->arch.apf.host_apf_reason = 0;
>  		local_irq_disable();
> -		kvm_async_pf_task_wait(fault_address);
> +		kvm_async_pf_task_wait(fault_address, 0);
>  		local_irq_enable();
>  		break;
>  	case KVM_PV_REASON_PAGE_READY:
>