kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 0/2] x86/kvm: Sanitize async page fault
@ 2020-03-06 23:42 Thomas Gleixner
  2020-03-06 23:42 ` [patch 1/2] x86/kvm: Handle async page faults directly through do_page_fault() Thomas Gleixner
  2020-03-06 23:42 ` [patch 2/2] x86/kvm: Sanitize kvm_async_pf_task_wait() Thomas Gleixner
  0 siblings, 2 replies; 15+ messages in thread
From: Thomas Gleixner @ 2020-03-06 23:42 UTC (permalink / raw)
  To: LKML; +Cc: x86, Paolo Bonzini, KVM, Paul E. McKenney

Hi!

This emerged from the ongoing efforts to consolidate the entry code maze on
x86 along with the tracing and instrumentation induced violations of RCU
constraints.

While working on this I stumbled over the KVM async page fault handler and
kvm_async_pf_task_wait() in particular. It took me a while to realize that
the randomly sprinkled around rcu_irq_enter()/exit() invocations are just
cargo cult programming. Several patches "fixed" RCU splats by curing the
symptoms without realizing that the code is fundametally flawed from a
design perspective.

Aside of that Andy noticed that the way the async page fault handler is
implemented can be improved which in turn allows further simplification
vs. CR2 consistency and the general exception entry handling on x86.

I'm sending this out as a stand alone series against mainline, but I'd
prefer to take this along with the rest if the entry code rework.

If you look at the changelog then don't be afraid, most of the added lines
are comments which were painfully absent in the original code.

Thanks,

	tglx
---
 entry/entry_32.S       |    8 --
 entry/entry_64.S       |    4 -
 include/asm/kvm_para.h |   21 ++++-
 include/asm/x86_init.h |    2 
 kernel/kvm.c           |  193 +++++++++++++++++++++++++++++++++----------------
 kernel/traps.c         |    2 
 kernel/x86_init.c      |    1 
 kvm/mmu/mmu.c          |    2 
 mm/fault.c             |   19 ++++
 9 files changed, 171 insertions(+), 81 deletions(-)


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [patch 1/2] x86/kvm: Handle async page faults directly through do_page_fault()
  2020-03-06 23:42 [patch 0/2] x86/kvm: Sanitize async page fault Thomas Gleixner
@ 2020-03-06 23:42 ` Thomas Gleixner
  2020-03-06 23:42 ` [patch 2/2] x86/kvm: Sanitize kvm_async_pf_task_wait() Thomas Gleixner
  1 sibling, 0 replies; 15+ messages in thread
From: Thomas Gleixner @ 2020-03-06 23:42 UTC (permalink / raw)
  To: LKML; +Cc: x86, Paolo Bonzini, KVM, Paul E. McKenney, Andy Lutomirski

From: Andy Lutomirski <luto@kernel.org>

KVM overloads #PF to indicate two types of not-actually-page-fault
events.  Right now, the KVM guest code intercepts them by modifying
the IDT and hooking the #PF vector.  This makes the already fragile
fault code even harder to understand, and it also pollutes call
traces with async_page_fault and do_async_page_fault for normal page
faults.

Clean it up by moving the logic into do_page_fault() using a static
branch.  This gets rid of the platform trap_init override mechanism
completely.

[ tglx: Fixed up 32bit, removed error code from the async functions and
  	massaged coding style ]

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/entry/entry_32.S       |    8 --------
 arch/x86/entry/entry_64.S       |    4 ----
 arch/x86/include/asm/kvm_para.h |   19 +++++++++++++++++--
 arch/x86/include/asm/x86_init.h |    2 --
 arch/x86/kernel/kvm.c           |   39 +++++++++++++++++++++------------------
 arch/x86/kernel/traps.c         |    2 --
 arch/x86/kernel/x86_init.c      |    1 -
 arch/x86/mm/fault.c             |   19 +++++++++++++++++++
 8 files changed, 57 insertions(+), 37 deletions(-)

--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -1698,14 +1698,6 @@ SYM_CODE_START(general_protection)
 	jmp	common_exception
 SYM_CODE_END(general_protection)
 
-#ifdef CONFIG_KVM_GUEST
-SYM_CODE_START(async_page_fault)
-	ASM_CLAC
-	pushl	$do_async_page_fault
-	jmp	common_exception_read_cr2
-SYM_CODE_END(async_page_fault)
-#endif
-
 SYM_CODE_START(rewind_stack_do_exit)
 	/* Prevent any naive code from trying to unwind to our caller. */
 	xorl	%ebp, %ebp
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1202,10 +1202,6 @@ idtentry xendebug		do_debug		has_error_c
 idtentry general_protection	do_general_protection	has_error_code=1
 idtentry page_fault		do_page_fault		has_error_code=1	read_cr2=1
 
-#ifdef CONFIG_KVM_GUEST
-idtentry async_page_fault	do_async_page_fault	has_error_code=1	read_cr2=1
-#endif
-
 #ifdef CONFIG_X86_MCE
 idtentry machine_check		do_mce			has_error_code=0	paranoid=1
 #endif
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -91,8 +91,18 @@ unsigned int kvm_arch_para_hints(void);
 void kvm_async_pf_task_wait(u32 token, int interrupt_kernel);
 void kvm_async_pf_task_wake(u32 token);
 u32 kvm_read_and_reset_pf_reason(void);
-extern void kvm_disable_steal_time(void);
-void do_async_page_fault(struct pt_regs *regs, unsigned long error_code, unsigned long address);
+void kvm_disable_steal_time(void);
+bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token);
+
+DECLARE_STATIC_KEY_FALSE(kvm_async_pf_enabled);
+
+static __always_inline bool kvm_handle_async_pf(struct pt_regs *regs, u32 token)
+{
+	if (static_branch_unlikely(&kvm_async_pf_enabled))
+		return __kvm_handle_async_pf(regs, token);
+	else
+		return false;
+}
 
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 void __init kvm_spinlock_init(void);
@@ -130,6 +140,11 @@ static inline void kvm_disable_steal_tim
 {
 	return;
 }
+
+static inline bool kvm_handle_async_pf(struct pt_regs *regs, u32 token)
+{
+	return false;
+}
 #endif
 
 #endif /* _ASM_X86_KVM_PARA_H */
--- a/arch/x86/include/asm/x86_init.h
+++ b/arch/x86/include/asm/x86_init.h
@@ -50,14 +50,12 @@ struct x86_init_resources {
  * @pre_vector_init:		init code to run before interrupt vectors
  *				are set up.
  * @intr_init:			interrupt init code
- * @trap_init:			platform specific trap setup
  * @intr_mode_select:		interrupt delivery mode selection
  * @intr_mode_init:		interrupt delivery mode setup
  */
 struct x86_init_irqs {
 	void (*pre_vector_init)(void);
 	void (*intr_init)(void);
-	void (*trap_init)(void);
 	void (*intr_mode_select)(void);
 	void (*intr_mode_init)(void);
 };
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -35,6 +35,8 @@
 #include <asm/tlb.h>
 #include <asm/cpuidle_haltpoll.h>
 
+DEFINE_STATIC_KEY_FALSE(kvm_async_pf_enabled);
+
 static int kvmapf = 1;
 
 static int __init parse_no_kvmapf(char *arg)
@@ -242,25 +244,27 @@ u32 kvm_read_and_reset_pf_reason(void)
 EXPORT_SYMBOL_GPL(kvm_read_and_reset_pf_reason);
 NOKPROBE_SYMBOL(kvm_read_and_reset_pf_reason);
 
-dotraplinkage void
-do_async_page_fault(struct pt_regs *regs, unsigned long error_code, unsigned long address)
+bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token)
 {
+	/*
+	 * If we get a page fault right here, the pf_reason seems likely
+	 * to be clobbered.  Bummer.
+	 */
 	switch (kvm_read_and_reset_pf_reason()) {
 	default:
-		do_page_fault(regs, error_code, address);
-		break;
+		return false;
 	case KVM_PV_REASON_PAGE_NOT_PRESENT:
 		/* page is swapped out by the host. */
-		kvm_async_pf_task_wait((u32)address, !user_mode(regs));
-		break;
+		kvm_async_pf_task_wait(token, !user_mode(regs));
+		return true;
 	case KVM_PV_REASON_PAGE_READY:
 		rcu_irq_enter();
-		kvm_async_pf_task_wake((u32)address);
+		kvm_async_pf_task_wake(token);
 		rcu_irq_exit();
-		break;
+		return true;
 	}
 }
-NOKPROBE_SYMBOL(do_async_page_fault);
+NOKPROBE_SYMBOL(__kvm_handle_async_pf);
 
 static void __init paravirt_ops_setup(void)
 {
@@ -306,7 +310,11 @@ static notrace void kvm_guest_apic_eoi_w
 static void kvm_guest_cpu_init(void)
 {
 	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF) && kvmapf) {
-		u64 pa = slow_virt_to_phys(this_cpu_ptr(&apf_reason));
+		u64 pa;
+
+		WARN_ON_ONCE(!static_branch_likely(&kvm_async_pf_enabled));
+
+		pa = slow_virt_to_phys(this_cpu_ptr(&apf_reason));
 
 #ifdef CONFIG_PREEMPTION
 		pa |= KVM_ASYNC_PF_SEND_ALWAYS;
@@ -592,12 +600,6 @@ static int kvm_cpu_down_prepare(unsigned
 }
 #endif
 
-static void __init kvm_apf_trap_init(void)
-{
-	update_intr_gate(X86_TRAP_PF, async_page_fault);
-}
-
-
 static void kvm_flush_tlb_others(const struct cpumask *cpumask,
 			const struct flush_tlb_info *info)
 {
@@ -632,8 +634,6 @@ static void __init kvm_guest_init(void)
 	register_reboot_notifier(&kvm_pv_reboot_nb);
 	for (i = 0; i < KVM_TASK_SLEEP_HASHSIZE; i++)
 		raw_spin_lock_init(&async_pf_sleepers[i].lock);
-	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF))
-		x86_init.irqs.trap_init = kvm_apf_trap_init;
 
 	if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
 		has_steal_clock = 1;
@@ -670,6 +670,9 @@ static void __init kvm_guest_init(void)
 	 * overcommitted.
 	 */
 	hardlockup_detector_disable();
+
+	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF) && kvmapf)
+		static_branch_enable(&kvm_async_pf_enabled);
 }
 
 static noinline uint32_t __kvm_cpuid_base(void)
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -936,7 +936,5 @@ void __init trap_init(void)
 
 	idt_setup_ist_traps();
 
-	x86_init.irqs.trap_init();
-
 	idt_setup_debugidt_traps();
 }
--- a/arch/x86/kernel/x86_init.c
+++ b/arch/x86/kernel/x86_init.c
@@ -79,7 +79,6 @@ struct x86_init_ops x86_init __initdata
 	.irqs = {
 		.pre_vector_init	= init_ISA_irqs,
 		.intr_init		= native_init_IRQ,
-		.trap_init		= x86_init_noop,
 		.intr_mode_select	= apic_intr_mode_select,
 		.intr_mode_init		= apic_intr_mode_init
 	},
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -30,6 +30,7 @@
 #include <asm/desc.h>			/* store_idt(), ...		*/
 #include <asm/cpu_entry_area.h>		/* exception stack		*/
 #include <asm/pgtable_areas.h>		/* VMALLOC_START, ...		*/
+#include <asm/kvm_para.h>		/* kvm_handle_async_pf		*/
 
 #define CREATE_TRACE_POINTS
 #include <asm/trace/exceptions.h>
@@ -1505,6 +1506,24 @@ do_page_fault(struct pt_regs *regs, unsi
 		unsigned long address)
 {
 	prefetchw(&current->mm->mmap_sem);
+	/*
+	 * KVM has two types of events that are, logically, interrupts, but
+	 * are unfortunately delivered using the #PF vector.  These events are
+	 * "you just accessed valid memory, but the host doesn't have it right
+	 * now, so I'll put you to sleep if you continue" and "that memory
+	 * you tried to access earlier is available now."
+	 *
+	 * We are relying on the interrupted context being sane (valid RSP,
+	 * relevant locks not held, etc.), which is fine as long as the
+	 * interrupted context had IF=1.  We are also relying on the KVM
+	 * async pf type field and CR2 being read consistently instead of
+	 * getting values from real and async page faults mixed up.
+	 *
+	 * Fingers crossed.
+	 */
+	if (kvm_handle_async_pf(regs, (u32)address))
+		return;
+
 	trace_page_fault_entries(regs, hw_error_code, address);
 
 	if (unlikely(kmmio_fault(regs, address)))


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [patch 2/2] x86/kvm: Sanitize kvm_async_pf_task_wait()
  2020-03-06 23:42 [patch 0/2] x86/kvm: Sanitize async page fault Thomas Gleixner
  2020-03-06 23:42 ` [patch 1/2] x86/kvm: Handle async page faults directly through do_page_fault() Thomas Gleixner
@ 2020-03-06 23:42 ` Thomas Gleixner
  2020-03-07  0:22   ` Paul E. McKenney
  2020-03-07  2:18   ` Andy Lutomirski
  1 sibling, 2 replies; 15+ messages in thread
From: Thomas Gleixner @ 2020-03-06 23:42 UTC (permalink / raw)
  To: LKML; +Cc: x86, Paolo Bonzini, KVM, Paul E. McKenney

While working on the entry consolidation I stumbled over the KVM async page
fault handler and kvm_async_pf_task_wait() in particular. It took me a
while to realize that the randomly sprinkled around rcu_irq_enter()/exit()
invocations are just cargo cult programming. Several patches "fixed" RCU
splats by curing the symptoms without noticing that the code is flawed 
from a design perspective.

The main problem is that this async injection is not based on a proper
handshake mechanism and only respects the minimal requirement, i.e. the
guest is not in a state where it has interrupts disabled.

Aside of that the actual code is a convoluted one fits it all swiss army
knife. It is invoked from different places with different RCU constraints:

  1) Host side:

     vcpu_enter_guest()
       kvm_x86_ops->handle_exit()
         kvm_handle_page_fault()
           kvm_async_pf_task_wait()

     The invocation happens from fully preemptible context.

  2) Guest side:

     The async page fault interrupted:

         a) user space

	 b) preemptible kernel code which is not in a RCU read side
	    critical section

     	 c) non-preemtible kernel code or a RCU read side critical section
	    or kernel code with CONFIG_PREEMPTION=n which allows not to
	    differentiate between #2b and #2c.

RCU is watching for:

  #1  The vCPU exited and current is definitely not the idle task

  #2a The #PF entry code on the guest went through enter_from_user_mode()
      which reactivates RCU

  #2b There is no preemptible, interrupts enabled code in the kernel
      which can run with RCU looking away. (The idle task is always
      non preemptible).

I.e. all schedulable states (#1, #2a, #2b) do not need any of this RCU
voodoo at all.

In #2c RCU is eventually not watching, but as that state cannot schedule
anyway there is no point to worry about it so it has to invoke
rcu_irq_enter() before running that code. This can be optimized, but this
will be done as an extra step in course of the entry code consolidation
work.

So the proper solution for this is to:

  - Split kvm_async_pf_task_wait() into schedule and halt based waiting
    interfaces which share the enqueueing code.

  - Add comments (condensed form of this changelog) to spare others the
    time waste and pain of reverse engineering all of this with the help of
    uncomprehensible changelogs and code history.

  - Invoke kvm_async_pf_task_wait_schedule() from kvm_handle_page_fault(),
    user mode and schedulable kernel side async page faults (#1, #2a, #2b)

  - Invoke kvm_async_pf_task_wait_halt() for the non schedulable kernel
    case (#2c).

    For this case also remove the rcu_irq_exit()/enter() pair around the
    halt as it is just a pointless exercise:

       - vCPUs can VMEXIT at any any random point and can be scheduled out
         for an arbitrary amount of time by the host and this is not any
         different except that it voluntary triggers the exit via halt.

       - The interrupted context could have RCU watching already. So the
	 rcu_irq_exit() before the halt is not gaining anything aside of
	 confusing the reader. Claiming that this might prevent RCU stalls
	 is just an illusion.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/kvm_para.h |    2 
 arch/x86/kernel/kvm.c           |  156 ++++++++++++++++++++++++++++------------
 arch/x86/kvm/mmu/mmu.c          |    2 
 3 files changed, 115 insertions(+), 45 deletions(-)

--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -88,7 +88,7 @@ static inline long kvm_hypercall4(unsign
 bool kvm_para_available(void);
 unsigned int kvm_arch_para_features(void);
 unsigned int kvm_arch_para_hints(void);
-void kvm_async_pf_task_wait(u32 token, int interrupt_kernel);
+void kvm_async_pf_task_wait_schedule(u32 token);
 void kvm_async_pf_task_wake(u32 token);
 u32 kvm_read_and_reset_pf_reason(void);
 void kvm_disable_steal_time(void);
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -75,7 +75,7 @@ struct kvm_task_sleep_node {
 	struct swait_queue_head wq;
 	u32 token;
 	int cpu;
-	bool halted;
+	bool use_halt;
 };
 
 static struct kvm_task_sleep_head {
@@ -98,75 +98,145 @@ static struct kvm_task_sleep_node *_find
 	return NULL;
 }
 
-/*
- * @interrupt_kernel: Is this called from a routine which interrupts the kernel
- * 		      (other than user space)?
- */
-void kvm_async_pf_task_wait(u32 token, int interrupt_kernel)
+static bool kvm_async_pf_queue_task(u32 token, bool use_halt,
+				    struct kvm_task_sleep_node *n)
 {
 	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
 	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
-	struct kvm_task_sleep_node n, *e;
-	DECLARE_SWAITQUEUE(wait);
-
-	rcu_irq_enter();
+	struct kvm_task_sleep_node *e;
 
 	raw_spin_lock(&b->lock);
 	e = _find_apf_task(b, token);
 	if (e) {
 		/* dummy entry exist -> wake up was delivered ahead of PF */
 		hlist_del(&e->link);
-		kfree(e);
 		raw_spin_unlock(&b->lock);
+		kfree(e);
+		return false;
+	}
 
-		rcu_irq_exit();
+	n->token = token;
+	n->cpu = smp_processor_id();
+	n->use_halt = use_halt;
+	init_swait_queue_head(&n->wq);
+	hlist_add_head(&n->link, &b->list);
+	raw_spin_unlock(&b->lock);
+	return true;
+}
+
+/*
+ * kvm_async_pf_task_wait_schedule - Wait for pagefault to be handled
+ * @token:	Token to identify the sleep node entry
+ *
+ * Invoked from the async pagefault handling code or from the VM exit page
+ * fault handler. In both cases RCU is watching.
+ */
+void kvm_async_pf_task_wait_schedule(u32 token)
+{
+	struct kvm_task_sleep_node n;
+	DECLARE_SWAITQUEUE(wait);
+
+	lockdep_assert_irqs_disabled();
+
+	if (!kvm_async_pf_queue_task(token, false, &n))
 		return;
+
+	for (;;) {
+		prepare_to_swait_exclusive(&n.wq, &wait, TASK_UNINTERRUPTIBLE);
+		if (hlist_unhashed(&n.link))
+			break;
+
+		local_irq_enable();
+		schedule();
+		local_irq_disable();
 	}
+	finish_swait(&n.wq, &wait);
+}
+EXPORT_SYMBOL_GPL(kvm_async_pf_task_wait_schedule);
 
-	n.token = token;
-	n.cpu = smp_processor_id();
-	n.halted = is_idle_task(current) ||
-		   (IS_ENABLED(CONFIG_PREEMPT_COUNT)
-		    ? preempt_count() > 1 || rcu_preempt_depth()
-		    : interrupt_kernel);
-	init_swait_queue_head(&n.wq);
-	hlist_add_head(&n.link, &b->list);
-	raw_spin_unlock(&b->lock);
+/*
+ * Invoked from the async page fault handler.
+ */
+static void kvm_async_pf_task_wait_halt(u32 token)
+{
+	struct kvm_task_sleep_node n;
+
+	if (!kvm_async_pf_queue_task(token, true, &n))
+		return;
 
 	for (;;) {
-		if (!n.halted)
-			prepare_to_swait_exclusive(&n.wq, &wait, TASK_UNINTERRUPTIBLE);
 		if (hlist_unhashed(&n.link))
 			break;
+		/*
+		 * No point in doing anything about RCU here. Any RCU read
+		 * side critical section or RCU watching section can be
+		 * interrupted by VMEXITs and the host is free to keep the
+		 * vCPU scheduled out as long as it sees fit. This is not
+		 * any different just because of the halt induced voluntary
+		 * VMEXIT.
+		 *
+		 * Also the async page fault could have interrupted any RCU
+		 * watching context, so invoking rcu_irq_exit()/enter()
+		 * around this is not gaining anything.
+		 */
+		native_safe_halt();
+		local_irq_disable();
+	}
+}
 
-		rcu_irq_exit();
+/* Invoked from the async page fault handler */
+static void kvm_async_pf_task_wait(u32 token, bool usermode)
+{
+	bool can_schedule;
 
-		if (!n.halted) {
-			local_irq_enable();
-			schedule();
-			local_irq_disable();
-		} else {
-			/*
-			 * We cannot reschedule. So halt.
-			 */
-			native_safe_halt();
-			local_irq_disable();
-		}
+	/*
+	 * No need to check whether interrupts were disabled because the
+	 * host will (hopefully) only inject an async page fault into
+	 * interrupt enabled regions.
+	 *
+	 * If CONFIG_PREEMPTION is enabled then check whether the code
+	 * which triggered the page fault is preemptible. This covers user
+	 * mode as well because preempt_count() is obviously 0 there.
+	 *
+	 * The check for rcu_preempt_depth() is also required because
+	 * voluntary scheduling inside a rcu read locked section is not
+	 * allowed.
+	 *
+	 * The idle task is already covered by this because idle always
+	 * has a preempt count > 0.
+	 *
+	 * If CONFIG_PREEMPTION is disabled only allow scheduling when
+	 * coming from user mode as there is no indication whether the
+	 * context which triggered the page fault could schedule or not.
+	 */
+	if (IS_ENABLED(CONFIG_PREEMPTION))
+		can_schedule = preempt_count() + rcu_preempt_depth() == 0;
+	else
+		can_schedule = usermode;
 
+	/*
+	 * If the kernel context is allowed to schedule then RCU is
+	 * watching because no preemptible code in the kernel is inside RCU
+	 * idle state. So it can be treated like user mode. User mode is
+	 * safe because the #PF entry invoked enter_from_user_mode().
+	 *
+	 * For the non schedulable case invoke rcu_irq_enter() for
+	 * now. This will be moved out to the pagefault entry code later
+	 * and only invoked when really needed.
+	 */
+	if (can_schedule) {
+		kvm_async_pf_task_wait_schedule(token);
+	} else {
 		rcu_irq_enter();
+		kvm_async_pf_task_wait_halt(token);
+		rcu_irq_exit();
 	}
-	if (!n.halted)
-		finish_swait(&n.wq, &wait);
-
-	rcu_irq_exit();
-	return;
 }
-EXPORT_SYMBOL_GPL(kvm_async_pf_task_wait);
 
 static void apf_task_wake_one(struct kvm_task_sleep_node *n)
 {
 	hlist_del_init(&n->link);
-	if (n->halted)
+	if (n->use_halt)
 		smp_send_reschedule(n->cpu);
 	else if (swq_has_sleeper(&n->wq))
 		swake_up_one(&n->wq);
@@ -255,7 +325,7 @@ bool __kvm_handle_async_pf(struct pt_reg
 		return false;
 	case KVM_PV_REASON_PAGE_NOT_PRESENT:
 		/* page is swapped out by the host. */
-		kvm_async_pf_task_wait(token, !user_mode(regs));
+		kvm_async_pf_task_wait(token, user_mode(regs));
 		return true;
 	case KVM_PV_REASON_PAGE_READY:
 		rcu_irq_enter();
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4205,7 +4205,7 @@ int kvm_handle_page_fault(struct kvm_vcp
 	case KVM_PV_REASON_PAGE_NOT_PRESENT:
 		vcpu->arch.apf.host_apf_reason = 0;
 		local_irq_disable();
-		kvm_async_pf_task_wait(fault_address, 0);
+		kvm_async_pf_task_wait_schedule(fault_address);
 		local_irq_enable();
 		break;
 	case KVM_PV_REASON_PAGE_READY:


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch 2/2] x86/kvm: Sanitize kvm_async_pf_task_wait()
  2020-03-06 23:42 ` [patch 2/2] x86/kvm: Sanitize kvm_async_pf_task_wait() Thomas Gleixner
@ 2020-03-07  0:22   ` Paul E. McKenney
  2020-03-07  1:02     ` Thomas Gleixner
  2020-03-07  2:18   ` Andy Lutomirski
  1 sibling, 1 reply; 15+ messages in thread
From: Paul E. McKenney @ 2020-03-07  0:22 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: LKML, x86, Paolo Bonzini, KVM

On Sat, Mar 07, 2020 at 12:42:06AM +0100, Thomas Gleixner wrote:
> While working on the entry consolidation I stumbled over the KVM async page
> fault handler and kvm_async_pf_task_wait() in particular. It took me a
> while to realize that the randomly sprinkled around rcu_irq_enter()/exit()
> invocations are just cargo cult programming. Several patches "fixed" RCU
> splats by curing the symptoms without noticing that the code is flawed 
> from a design perspective.
> 
> The main problem is that this async injection is not based on a proper
> handshake mechanism and only respects the minimal requirement, i.e. the
> guest is not in a state where it has interrupts disabled.
> 
> Aside of that the actual code is a convoluted one fits it all swiss army
> knife. It is invoked from different places with different RCU constraints:
> 
>   1) Host side:
> 
>      vcpu_enter_guest()
>        kvm_x86_ops->handle_exit()
>          kvm_handle_page_fault()
>            kvm_async_pf_task_wait()
> 
>      The invocation happens from fully preemptible context.
> 
>   2) Guest side:
> 
>      The async page fault interrupted:
> 
>          a) user space
> 
> 	 b) preemptible kernel code which is not in a RCU read side
> 	    critical section
> 
>      	 c) non-preemtible kernel code or a RCU read side critical section
> 	    or kernel code with CONFIG_PREEMPTION=n which allows not to
> 	    differentiate between #2b and #2c.
> 
> RCU is watching for:
> 
>   #1  The vCPU exited and current is definitely not the idle task
> 
>   #2a The #PF entry code on the guest went through enter_from_user_mode()
>       which reactivates RCU
> 
>   #2b There is no preemptible, interrupts enabled code in the kernel
>       which can run with RCU looking away. (The idle task is always
>       non preemptible).
> 
> I.e. all schedulable states (#1, #2a, #2b) do not need any of this RCU
> voodoo at all.
> 
> In #2c RCU is eventually not watching, but as that state cannot schedule
> anyway there is no point to worry about it so it has to invoke
> rcu_irq_enter() before running that code. This can be optimized, but this
> will be done as an extra step in course of the entry code consolidation
> work.

In other words, any needed rcu_irq_enter() and rcu_irq_exit() are added
in one of the entry-code consolidation patches, and patch below depends
on that patch, correct?

							Thanx, Paul

> So the proper solution for this is to:
> 
>   - Split kvm_async_pf_task_wait() into schedule and halt based waiting
>     interfaces which share the enqueueing code.
> 
>   - Add comments (condensed form of this changelog) to spare others the
>     time waste and pain of reverse engineering all of this with the help of
>     uncomprehensible changelogs and code history.
> 
>   - Invoke kvm_async_pf_task_wait_schedule() from kvm_handle_page_fault(),
>     user mode and schedulable kernel side async page faults (#1, #2a, #2b)
> 
>   - Invoke kvm_async_pf_task_wait_halt() for the non schedulable kernel
>     case (#2c).
> 
>     For this case also remove the rcu_irq_exit()/enter() pair around the
>     halt as it is just a pointless exercise:
> 
>        - vCPUs can VMEXIT at any any random point and can be scheduled out
>          for an arbitrary amount of time by the host and this is not any
>          different except that it voluntary triggers the exit via halt.
> 
>        - The interrupted context could have RCU watching already. So the
> 	 rcu_irq_exit() before the halt is not gaining anything aside of
> 	 confusing the reader. Claiming that this might prevent RCU stalls
> 	 is just an illusion.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  arch/x86/include/asm/kvm_para.h |    2 
>  arch/x86/kernel/kvm.c           |  156 ++++++++++++++++++++++++++++------------
>  arch/x86/kvm/mmu/mmu.c          |    2 
>  3 files changed, 115 insertions(+), 45 deletions(-)
> 
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -88,7 +88,7 @@ static inline long kvm_hypercall4(unsign
>  bool kvm_para_available(void);
>  unsigned int kvm_arch_para_features(void);
>  unsigned int kvm_arch_para_hints(void);
> -void kvm_async_pf_task_wait(u32 token, int interrupt_kernel);
> +void kvm_async_pf_task_wait_schedule(u32 token);
>  void kvm_async_pf_task_wake(u32 token);
>  u32 kvm_read_and_reset_pf_reason(void);
>  void kvm_disable_steal_time(void);
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -75,7 +75,7 @@ struct kvm_task_sleep_node {
>  	struct swait_queue_head wq;
>  	u32 token;
>  	int cpu;
> -	bool halted;
> +	bool use_halt;
>  };
>  
>  static struct kvm_task_sleep_head {
> @@ -98,75 +98,145 @@ static struct kvm_task_sleep_node *_find
>  	return NULL;
>  }
>  
> -/*
> - * @interrupt_kernel: Is this called from a routine which interrupts the kernel
> - * 		      (other than user space)?
> - */
> -void kvm_async_pf_task_wait(u32 token, int interrupt_kernel)
> +static bool kvm_async_pf_queue_task(u32 token, bool use_halt,
> +				    struct kvm_task_sleep_node *n)
>  {
>  	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
>  	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
> -	struct kvm_task_sleep_node n, *e;
> -	DECLARE_SWAITQUEUE(wait);
> -
> -	rcu_irq_enter();
> +	struct kvm_task_sleep_node *e;
>  
>  	raw_spin_lock(&b->lock);
>  	e = _find_apf_task(b, token);
>  	if (e) {
>  		/* dummy entry exist -> wake up was delivered ahead of PF */
>  		hlist_del(&e->link);
> -		kfree(e);
>  		raw_spin_unlock(&b->lock);
> +		kfree(e);
> +		return false;
> +	}
>  
> -		rcu_irq_exit();
> +	n->token = token;
> +	n->cpu = smp_processor_id();
> +	n->use_halt = use_halt;
> +	init_swait_queue_head(&n->wq);
> +	hlist_add_head(&n->link, &b->list);
> +	raw_spin_unlock(&b->lock);
> +	return true;
> +}
> +
> +/*
> + * kvm_async_pf_task_wait_schedule - Wait for pagefault to be handled
> + * @token:	Token to identify the sleep node entry
> + *
> + * Invoked from the async pagefault handling code or from the VM exit page
> + * fault handler. In both cases RCU is watching.
> + */
> +void kvm_async_pf_task_wait_schedule(u32 token)
> +{
> +	struct kvm_task_sleep_node n;
> +	DECLARE_SWAITQUEUE(wait);
> +
> +	lockdep_assert_irqs_disabled();
> +
> +	if (!kvm_async_pf_queue_task(token, false, &n))
>  		return;
> +
> +	for (;;) {
> +		prepare_to_swait_exclusive(&n.wq, &wait, TASK_UNINTERRUPTIBLE);
> +		if (hlist_unhashed(&n.link))
> +			break;
> +
> +		local_irq_enable();
> +		schedule();
> +		local_irq_disable();
>  	}
> +	finish_swait(&n.wq, &wait);
> +}
> +EXPORT_SYMBOL_GPL(kvm_async_pf_task_wait_schedule);
>  
> -	n.token = token;
> -	n.cpu = smp_processor_id();
> -	n.halted = is_idle_task(current) ||
> -		   (IS_ENABLED(CONFIG_PREEMPT_COUNT)
> -		    ? preempt_count() > 1 || rcu_preempt_depth()
> -		    : interrupt_kernel);
> -	init_swait_queue_head(&n.wq);
> -	hlist_add_head(&n.link, &b->list);
> -	raw_spin_unlock(&b->lock);
> +/*
> + * Invoked from the async page fault handler.
> + */
> +static void kvm_async_pf_task_wait_halt(u32 token)
> +{
> +	struct kvm_task_sleep_node n;
> +
> +	if (!kvm_async_pf_queue_task(token, true, &n))
> +		return;
>  
>  	for (;;) {
> -		if (!n.halted)
> -			prepare_to_swait_exclusive(&n.wq, &wait, TASK_UNINTERRUPTIBLE);
>  		if (hlist_unhashed(&n.link))
>  			break;
> +		/*
> +		 * No point in doing anything about RCU here. Any RCU read
> +		 * side critical section or RCU watching section can be
> +		 * interrupted by VMEXITs and the host is free to keep the
> +		 * vCPU scheduled out as long as it sees fit. This is not
> +		 * any different just because of the halt induced voluntary
> +		 * VMEXIT.
> +		 *
> +		 * Also the async page fault could have interrupted any RCU
> +		 * watching context, so invoking rcu_irq_exit()/enter()
> +		 * around this is not gaining anything.
> +		 */
> +		native_safe_halt();
> +		local_irq_disable();
> +	}
> +}
>  
> -		rcu_irq_exit();
> +/* Invoked from the async page fault handler */
> +static void kvm_async_pf_task_wait(u32 token, bool usermode)
> +{
> +	bool can_schedule;
>  
> -		if (!n.halted) {
> -			local_irq_enable();
> -			schedule();
> -			local_irq_disable();
> -		} else {
> -			/*
> -			 * We cannot reschedule. So halt.
> -			 */
> -			native_safe_halt();
> -			local_irq_disable();
> -		}
> +	/*
> +	 * No need to check whether interrupts were disabled because the
> +	 * host will (hopefully) only inject an async page fault into
> +	 * interrupt enabled regions.
> +	 *
> +	 * If CONFIG_PREEMPTION is enabled then check whether the code
> +	 * which triggered the page fault is preemptible. This covers user
> +	 * mode as well because preempt_count() is obviously 0 there.
> +	 *
> +	 * The check for rcu_preempt_depth() is also required because
> +	 * voluntary scheduling inside a rcu read locked section is not
> +	 * allowed.
> +	 *
> +	 * The idle task is already covered by this because idle always
> +	 * has a preempt count > 0.
> +	 *
> +	 * If CONFIG_PREEMPTION is disabled only allow scheduling when
> +	 * coming from user mode as there is no indication whether the
> +	 * context which triggered the page fault could schedule or not.
> +	 */
> +	if (IS_ENABLED(CONFIG_PREEMPTION))
> +		can_schedule = preempt_count() + rcu_preempt_depth() == 0;
> +	else
> +		can_schedule = usermode;
>  
> +	/*
> +	 * If the kernel context is allowed to schedule then RCU is
> +	 * watching because no preemptible code in the kernel is inside RCU
> +	 * idle state. So it can be treated like user mode. User mode is
> +	 * safe because the #PF entry invoked enter_from_user_mode().
> +	 *
> +	 * For the non schedulable case invoke rcu_irq_enter() for
> +	 * now. This will be moved out to the pagefault entry code later
> +	 * and only invoked when really needed.
> +	 */
> +	if (can_schedule) {
> +		kvm_async_pf_task_wait_schedule(token);
> +	} else {
>  		rcu_irq_enter();
> +		kvm_async_pf_task_wait_halt(token);
> +		rcu_irq_exit();
>  	}
> -	if (!n.halted)
> -		finish_swait(&n.wq, &wait);
> -
> -	rcu_irq_exit();
> -	return;
>  }
> -EXPORT_SYMBOL_GPL(kvm_async_pf_task_wait);
>  
>  static void apf_task_wake_one(struct kvm_task_sleep_node *n)
>  {
>  	hlist_del_init(&n->link);
> -	if (n->halted)
> +	if (n->use_halt)
>  		smp_send_reschedule(n->cpu);
>  	else if (swq_has_sleeper(&n->wq))
>  		swake_up_one(&n->wq);
> @@ -255,7 +325,7 @@ bool __kvm_handle_async_pf(struct pt_reg
>  		return false;
>  	case KVM_PV_REASON_PAGE_NOT_PRESENT:
>  		/* page is swapped out by the host. */
> -		kvm_async_pf_task_wait(token, !user_mode(regs));
> +		kvm_async_pf_task_wait(token, user_mode(regs));
>  		return true;
>  	case KVM_PV_REASON_PAGE_READY:
>  		rcu_irq_enter();
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4205,7 +4205,7 @@ int kvm_handle_page_fault(struct kvm_vcp
>  	case KVM_PV_REASON_PAGE_NOT_PRESENT:
>  		vcpu->arch.apf.host_apf_reason = 0;
>  		local_irq_disable();
> -		kvm_async_pf_task_wait(fault_address, 0);
> +		kvm_async_pf_task_wait_schedule(fault_address);
>  		local_irq_enable();
>  		break;
>  	case KVM_PV_REASON_PAGE_READY:
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch 2/2] x86/kvm: Sanitize kvm_async_pf_task_wait()
  2020-03-07  0:22   ` Paul E. McKenney
@ 2020-03-07  1:02     ` Thomas Gleixner
  2020-03-07  3:48       ` Paul E. McKenney
  0 siblings, 1 reply; 15+ messages in thread
From: Thomas Gleixner @ 2020-03-07  1:02 UTC (permalink / raw)
  To: paulmck; +Cc: LKML, x86, Paolo Bonzini, KVM

"Paul E. McKenney" <paulmck@kernel.org> writes:
>> In #2c RCU is eventually not watching, but as that state cannot schedule
>> anyway there is no point to worry about it so it has to invoke
>> rcu_irq_enter() before running that code. This can be optimized, but this
>> will be done as an extra step in course of the entry code consolidation
>> work.
>
> In other words, any needed rcu_irq_enter() and rcu_irq_exit() are added
> in one of the entry-code consolidation patches, and patch below depends
> on that patch, correct?

No. The patch itself is already correct when applied to mainline. It has
no dependencies.

It invokes rcu_irq_enter()/exit() for the case (#2c) where it is
relevant. All other case are already RCU safe today.

The fact that the invocation is misplaced is a different story and yes,
that is part of the entry code cleanup along with some optimization
which are possible once the entry voodoo is out of ASM and adjustable
for a particular entry point in C.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch 2/2] x86/kvm: Sanitize kvm_async_pf_task_wait()
  2020-03-06 23:42 ` [patch 2/2] x86/kvm: Sanitize kvm_async_pf_task_wait() Thomas Gleixner
  2020-03-07  0:22   ` Paul E. McKenney
@ 2020-03-07  2:18   ` Andy Lutomirski
  2020-03-07 10:01     ` Thomas Gleixner
  1 sibling, 1 reply; 15+ messages in thread
From: Andy Lutomirski @ 2020-03-07  2:18 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: LKML, X86 ML, Paolo Bonzini, KVM, Paul E. McKenney

> On Mar 6, 2020, at 4:12 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> While working on the entry consolidation I stumbled over the KVM async page
> fault handler and kvm_async_pf_task_wait() in particular. It took me a
> while to realize that the randomly sprinkled around rcu_irq_enter()/exit()
> invocations are just cargo cult programming. Several patches "fixed" RCU
> splats by curing the symptoms without noticing that the code is flawed
> from a design perspective.
>
> The main problem is that this async injection is not based on a proper
> handshake mechanism and only respects the minimal requirement, i.e. the
> guest is not in a state where it has interrupts disabled.
>
> Aside of that the actual code is a convoluted one fits it all swiss army
> knife. It is invoked from different places with different RCU constraints:
>
> 1) Host side:
>
>   vcpu_enter_guest()
>     kvm_x86_ops->handle_exit()
>       kvm_handle_page_fault()
>         kvm_async_pf_task_wait()
>
>   The invocation happens from fully preemptible context.

I’m a bit baffled as to why the host uses this code at all instead of
just sleeping the old fashioned way when the guest takes a (host) page
fault.  Oh well.

> +
> +/*
> + * kvm_async_pf_task_wait_schedule - Wait for pagefault to be handled
> + * @token:    Token to identify the sleep node entry
> + *
> + * Invoked from the async pagefault handling code or from the VM exit page
> + * fault handler. In both cases RCU is watching.
> + */
> +void kvm_async_pf_task_wait_schedule(u32 token)
> +{
> +    struct kvm_task_sleep_node n;
> +    DECLARE_SWAITQUEUE(wait);
> +
> +    lockdep_assert_irqs_disabled();
> +
> +    if (!kvm_async_pf_queue_task(token, false, &n))
>      return;
> +
> +    for (;;) {
> +        prepare_to_swait_exclusive(&n.wq, &wait, TASK_UNINTERRUPTIBLE);
> +        if (hlist_unhashed(&n.link))
> +            break;
> +
> +        local_irq_enable();
> +        schedule();
> +        local_irq_disable();
>  }
> +    finish_swait(&n.wq, &wait);
> +}
> +EXPORT_SYMBOL_GPL(kvm_async_pf_task_wait_schedule);
>
> -    n.token = token;
> -    n.cpu = smp_processor_id();
> -    n.halted = is_idle_task(current) ||
> -           (IS_ENABLED(CONFIG_PREEMPT_COUNT)
> -            ? preempt_count() > 1 || rcu_preempt_depth()
> -            : interrupt_kernel);
> -    init_swait_queue_head(&n.wq);
> -    hlist_add_head(&n.link, &b->list);
> -    raw_spin_unlock(&b->lock);
> +/*
> + * Invoked from the async page fault handler.
> + */
> +static void kvm_async_pf_task_wait_halt(u32 token)
> +{
> +    struct kvm_task_sleep_node n;
> +
> +    if (!kvm_async_pf_queue_task(token, true, &n))
> +        return;
>
>  for (;;) {
> -        if (!n.halted)
> -            prepare_to_swait_exclusive(&n.wq, &wait, TASK_UNINTERRUPTIBLE);
>      if (hlist_unhashed(&n.link))
>          break;
> +        /*
> +         * No point in doing anything about RCU here. Any RCU read
> +         * side critical section or RCU watching section can be
> +         * interrupted by VMEXITs and the host is free to keep the
> +         * vCPU scheduled out as long as it sees fit. This is not
> +         * any different just because of the halt induced voluntary
> +         * VMEXIT.
> +         *
> +         * Also the async page fault could have interrupted any RCU
> +         * watching context, so invoking rcu_irq_exit()/enter()
> +         * around this is not gaining anything.
> +         */
> +        native_safe_halt();
> +        local_irq_disable();

What’s the local_irq_disable() here for? I would believe a
lockdep_assert_irqs_disabled() somewhere in here would make sense.
(Yes, I see you copied this from the old code. It’s still nonsense.)

I also find it truly bizarre that hlt actually works in this context.
Does KVM in fact wake a HLTed guest that HLTed with IRQs off when a
pending async pf is satisfied?  This would make sense if the wake
event were an interrupt, but it’s not according to Paolo.

All this being said, the only remotely sane case is when regs->flags
has IF==1. Perhaps this code should actually do:

WARN_ON(!(regs->flags & X86_EFLAGS_IF));

while (the page isn’t ready) {
 local_irq_enable();
 native_safe_halt();
 local_irq_disable();
}

with some provision to survive the case where the warning fires so we
at least get logs.

In any event, I just sent a patch to disable async page faults that
happen in kernel mode.  Perhaps this patch should actually just have
some warnings to sanity check the async page faults and not even try
to handle all these nasty sub-cases.

Paolo, is there any way to test this async page faults?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch 2/2] x86/kvm: Sanitize kvm_async_pf_task_wait()
  2020-03-07  1:02     ` Thomas Gleixner
@ 2020-03-07  3:48       ` Paul E. McKenney
  0 siblings, 0 replies; 15+ messages in thread
From: Paul E. McKenney @ 2020-03-07  3:48 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: LKML, x86, Paolo Bonzini, KVM

On Sat, Mar 07, 2020 at 02:02:25AM +0100, Thomas Gleixner wrote:
> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >> In #2c RCU is eventually not watching, but as that state cannot schedule
> >> anyway there is no point to worry about it so it has to invoke
> >> rcu_irq_enter() before running that code. This can be optimized, but this
> >> will be done as an extra step in course of the entry code consolidation
> >> work.
> >
> > In other words, any needed rcu_irq_enter() and rcu_irq_exit() are added
> > in one of the entry-code consolidation patches, and patch below depends
> > on that patch, correct?
> 
> No. The patch itself is already correct when applied to mainline. It has
> no dependencies.
> 
> It invokes rcu_irq_enter()/exit() for the case (#2c) where it is
> relevant. All other case are already RCU safe today.
> 
> The fact that the invocation is misplaced is a different story and yes,
> that is part of the entry code cleanup along with some optimization
> which are possible once the entry voodoo is out of ASM and adjustable
> for a particular entry point in C.

The weekend clearly did not come a moment too soon for me, did it?  :-/

							Thanx, Paul

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch 2/2] x86/kvm: Sanitize kvm_async_pf_task_wait()
  2020-03-07  2:18   ` Andy Lutomirski
@ 2020-03-07 10:01     ` Thomas Gleixner
  2020-03-07 15:10       ` Andy Lutomirski
  0 siblings, 1 reply; 15+ messages in thread
From: Thomas Gleixner @ 2020-03-07 10:01 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: LKML, X86 ML, Paolo Bonzini, KVM, Paul E. McKenney

Andy Lutomirski <luto@kernel.org> writes:
>> On Mar 6, 2020, at 4:12 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> Aside of that the actual code is a convoluted one fits it all swiss army
>> knife. It is invoked from different places with different RCU constraints:
>>
>> 1) Host side:
>>
>>   vcpu_enter_guest()
>>     kvm_x86_ops->handle_exit()
>>       kvm_handle_page_fault()
>>         kvm_async_pf_task_wait()
>>
>>   The invocation happens from fully preemptible context.
>
> I’m a bit baffled as to why the host uses this code at all instead of
> just sleeping the old fashioned way when the guest takes a (host) page
> fault.  Oh well.

If I can trust the crystal ball which I used to decode this maze then it
actually makes sense.

Aysnc faults are faults which cannot be handled by the guest, i.e. the
host either pulled a page away under the guest or did not populate it in
the first place.

So the reasoning is that if this happens the guest might be in a
situation where it can schedule other tasks instead of being stopped
completely by the host until the page arrives.

Now you could argue that this mostly makes sense for CPL 0 faults, but
there is definitely code in the kernel where it makes sense as well,
e.g. exec. But of course as this is designed without a proper handshake
there is no way for the hypervisor to figure out whether it makes sense
or not.

If the async fault cannot be delivered to the guest (async PF disabled,
async PF only enabled for CPL 0, IF == 0) then the host utilizes the
same data structure and wait mechanism. That really makes sense.

The part which does not make sense in the current implementation is the
kvm_async_pf_task_wait() trainwreck. A clear upfront separation of
schedulable and non schedulable wait mechanisms would have avoided all
the RCU duct tape nonsense and also spared me the brain damage caused by
reverse engineering this completely undocumented mess.

>> +static void kvm_async_pf_task_wait_halt(u32 token)
>> +{
>> +    struct kvm_task_sleep_node n;
>> +
>> +    if (!kvm_async_pf_queue_task(token, true, &n))
>> +        return;
>>
>>  for (;;) {
>> -        if (!n.halted)
>> -            prepare_to_swait_exclusive(&n.wq, &wait, TASK_UNINTERRUPTIBLE);
>>      if (hlist_unhashed(&n.link))
>>          break;
>> +        /*
>> +         * No point in doing anything about RCU here. Any RCU read
>> +         * side critical section or RCU watching section can be
>> +         * interrupted by VMEXITs and the host is free to keep the
>> +         * vCPU scheduled out as long as it sees fit. This is not
>> +         * any different just because of the halt induced voluntary
>> +         * VMEXIT.
>> +         *
>> +         * Also the async page fault could have interrupted any RCU
>> +         * watching context, so invoking rcu_irq_exit()/enter()
>> +         * around this is not gaining anything.
>> +         */
>> +        native_safe_halt();
>> +        local_irq_disable();
>
> What’s the local_irq_disable() here for? I would believe a
> lockdep_assert_irqs_disabled() somewhere in here would make sense.
> (Yes, I see you copied this from the old code. It’s still nonsense.)

native_safe_halt() does:

         STI
         HLT

So the irq disable is required as the loop should exit with interrupts
disabled.

> I also find it truly bizarre that hlt actually works in this context.
> Does KVM in fact wake a HLTed guest that HLTed with IRQs off when a
> pending async pf is satisfied?  This would make sense if the wake
> event were an interrupt, but it’s not according to Paolo.

See above. safe halt enables interrupts, which means IF == 1 and the
host sanity check for IF == 1 is satisfied.

In fact, if e.g. some regular interrupt arrives before the page becomes
available and the guest entered the halt loop because the fault happened
inside a RCU read side critical section with preemption enabled, then
the interrupt might wake up another task, set need resched and this
other task can run. At some point the halt waiting task gets back into
the loop and either finds the page ready or goes back into halt.

If the fault hit a preempt disabled region then still the interrupt and
eventual resulting soft interrupts can be handled.

Both scenarios are correct when you actually manage to mentaly
disconnect regular #PF and async #PF completely.

Of course the overloading of regular #PF, the utter lack of
documentation and the crappy and duct taped implementation makes this
really a mind boggling exercise.

> All this being said, the only remotely sane case is when regs->flags
> has IF==1. Perhaps this code should actually do:
>
> WARN_ON(!(regs->flags & X86_EFLAGS_IF));

Yes, that want's to be somewhere early and also cover the async wake
case. Neither wake nor wait can be injected when IF == 0.

> while (the page isn’t ready) {
>  local_irq_enable();
>  native_safe_halt();
>  local_irq_disable();
> }
>
> with some provision to survive the case where the warning fires so we
> at least get logs.

I don't think that any attempt to survive a async #PF injection into a
interrupt disabled region makes sense aside of looking smart and being
uncomprehensible voodoo.

If this ever happens then the host side is completely buggered and all
we can do is warn and pray or warn and die hard.

My personal preference is to warn and die hard.

> In any event, I just sent a patch to disable async page faults that
> happen in kernel mode.

I don't think it's justified. The host side really makes sure that the
guest does have IF == 1 before injecting anything which is not a NMI. If
the guest enables interrupts at the wrong place then this is really not
the hosts problem.

Having a warning in that async pf entry for the IF == 0 injection case
is good enough IMO.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch 2/2] x86/kvm: Sanitize kvm_async_pf_task_wait()
  2020-03-07 10:01     ` Thomas Gleixner
@ 2020-03-07 15:10       ` Andy Lutomirski
  2020-03-07 15:51         ` Andy Lutomirski
  2020-03-07 15:52         ` Thomas Gleixner
  0 siblings, 2 replies; 15+ messages in thread
From: Andy Lutomirski @ 2020-03-07 15:10 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, LKML, X86 ML, Paolo Bonzini, KVM, Paul E. McKenney

On Sat, Mar 7, 2020 at 2:01 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Andy Lutomirski <luto@kernel.org> writes:
> >> On Mar 6, 2020, at 4:12 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> >> Aside of that the actual code is a convoluted one fits it all swiss army
> >> knife. It is invoked from different places with different RCU constraints:
> >>
> >> 1) Host side:
> >>
> >>   vcpu_enter_guest()
> >>     kvm_x86_ops->handle_exit()
> >>       kvm_handle_page_fault()
> >>         kvm_async_pf_task_wait()
> >>
> >>   The invocation happens from fully preemptible context.
> >
> > I’m a bit baffled as to why the host uses this code at all instead of
> > just sleeping the old fashioned way when the guest takes a (host) page
> > fault.  Oh well.
>
> If I can trust the crystal ball which I used to decode this maze then it
> actually makes sense.
>
> Aysnc faults are faults which cannot be handled by the guest, i.e. the
> host either pulled a page away under the guest or did not populate it in
> the first place.
>
> So the reasoning is that if this happens the guest might be in a
> situation where it can schedule other tasks instead of being stopped
> completely by the host until the page arrives.
>
> Now you could argue that this mostly makes sense for CPL 0 faults, but
> there is definitely code in the kernel where it makes sense as well,
> e.g. exec. But of course as this is designed without a proper handshake
> there is no way for the hypervisor to figure out whether it makes sense
> or not.
>
> If the async fault cannot be delivered to the guest (async PF disabled,
> async PF only enabled for CPL 0, IF == 0) then the host utilizes the
> same data structure and wait mechanism. That really makes sense.
>
> The part which does not make sense in the current implementation is the
> kvm_async_pf_task_wait() trainwreck. A clear upfront separation of
> schedulable and non schedulable wait mechanisms would have avoided all
> the RCU duct tape nonsense and also spared me the brain damage caused by
> reverse engineering this completely undocumented mess.
>
> >> +static void kvm_async_pf_task_wait_halt(u32 token)
> >> +{
> >> +    struct kvm_task_sleep_node n;
> >> +
> >> +    if (!kvm_async_pf_queue_task(token, true, &n))
> >> +        return;
> >>
> >>  for (;;) {
> >> -        if (!n.halted)
> >> -            prepare_to_swait_exclusive(&n.wq, &wait, TASK_UNINTERRUPTIBLE);
> >>      if (hlist_unhashed(&n.link))
> >>          break;
> >> +        /*
> >> +         * No point in doing anything about RCU here. Any RCU read
> >> +         * side critical section or RCU watching section can be
> >> +         * interrupted by VMEXITs and the host is free to keep the
> >> +         * vCPU scheduled out as long as it sees fit. This is not
> >> +         * any different just because of the halt induced voluntary
> >> +         * VMEXIT.
> >> +         *
> >> +         * Also the async page fault could have interrupted any RCU
> >> +         * watching context, so invoking rcu_irq_exit()/enter()
> >> +         * around this is not gaining anything.
> >> +         */
> >> +        native_safe_halt();
> >> +        local_irq_disable();
> >
> > What’s the local_irq_disable() here for? I would believe a
> > lockdep_assert_irqs_disabled() somewhere in here would make sense.
> > (Yes, I see you copied this from the old code. It’s still nonsense.)
>
> native_safe_halt() does:
>
>          STI
>          HLT
>
> So the irq disable is required as the loop should exit with interrupts
> disabled.

Oops, should have looked at what native_safe_halt() does.

>
> > I also find it truly bizarre that hlt actually works in this context.
> > Does KVM in fact wake a HLTed guest that HLTed with IRQs off when a
> > pending async pf is satisfied?  This would make sense if the wake
> > event were an interrupt, but it’s not according to Paolo.
>
> See above. safe halt enables interrupts, which means IF == 1 and the
> host sanity check for IF == 1 is satisfied.
>
> In fact, if e.g. some regular interrupt arrives before the page becomes
> available and the guest entered the halt loop because the fault happened
> inside a RCU read side critical section with preemption enabled, then
> the interrupt might wake up another task, set need resched and this
> other task can run.

Now I'm confused again.  Your patch is very careful not to schedule if
we're in an RCU read-side critical section, but the regular preemption
code (preempt_schedule_irq, etc) seems to be willing to schedule
inside an RCU read-side critical section.  Why is the latter okay but
not the async pf case?

Ignoring that, this still seems racy:

STI
nested #PF telling us to wake up
#PF returns
HLT

doesn't this result in putting the CPU asleep for no good reason until
the next interrupt hits?


> > All this being said, the only remotely sane case is when regs->flags
> > has IF==1. Perhaps this code should actually do:
> >
> > WARN_ON(!(regs->flags & X86_EFLAGS_IF));
>
> Yes, that want's to be somewhere early and also cover the async wake
> case. Neither wake nor wait can be injected when IF == 0.

Sadly, wrmsr to turn off async pf will inject wakeups even if IF == 0.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch 2/2] x86/kvm: Sanitize kvm_async_pf_task_wait()
  2020-03-07 15:10       ` Andy Lutomirski
@ 2020-03-07 15:51         ` Andy Lutomirski
  2020-03-07 19:18           ` Thomas Gleixner
  2020-03-07 15:52         ` Thomas Gleixner
  1 sibling, 1 reply; 15+ messages in thread
From: Andy Lutomirski @ 2020-03-07 15:51 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, LKML, X86 ML, Paolo Bonzini, KVM, Paul E. McKenney

On Sat, Mar 7, 2020 at 7:10 AM Andy Lutomirski <luto@kernel.org> wrote:
>
> On Sat, Mar 7, 2020 at 2:01 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > Andy Lutomirski <luto@kernel.org> writes:

> Now I'm confused again.  Your patch is very careful not to schedule if
> we're in an RCU read-side critical section, but the regular preemption
> code (preempt_schedule_irq, etc) seems to be willing to schedule
> inside an RCU read-side critical section.  Why is the latter okay but
> not the async pf case?

I read more docs.  I guess the relevant situation is
CONFIG_PREEMPT_CPU, in which case it is legal to preempt an RCU
read-side critical section and obviously legal to put the whole CPU to
sleep, but it's illegal to explicitly block in an RCU read-side
critical section.  So I have a question for Paul: is it, in fact,
entirely illegal to block or merely illegal to block for an
excessively long time, e.g. waiting for user space or network traffic?
 In this situation, we cannot make progress until the host says we
can, so we are, in effect, blocking until the host tells us to stop
blocking.  Regardless, I agree that turning IRQs on is reasonable, and
allowing those IRQs to preempt us is reasonable.

As it stands in your patch, the situation is rather odd: we'll run
another task if that task *preempts* us (e.g. we block long enough to
run out of our time slice), but we won't run another task if we aren't
preempted.  This seems bizarre.

>
> Ignoring that, this still seems racy:
>
> STI
> nested #PF telling us to wake up
> #PF returns
> HLT
>
> doesn't this result in putting the CPU asleep for no good reason until
> the next interrupt hits?

I think this issue still stands and is actually a fairly easy race to hit.

STI
IRQ happens and we get preempted
another task runs and gets the #PF "async pf wakeup" event
reschedule, back to original task
HLT

The only particularly unusual thing here is that an IRQ (timer or
otherwise) needs to be queued up between when the #PF "async pf sleep"
event happens and STI so that it gets delivered before HLT.

ISTM the way to fully address this is to make the logic something like:

if (preemptible) {
  actually go to sleep.  do not HLT.  Do this even in an RCU read-side
critical section.
} else {
  /* ok, we have to wait, but it's still legal to handle IRQs. */
  if (choice A) {
    keep IRQs off.  Spin until we wake up.
  } else {
    while (still need to sleep) {
      HLT (with IRQs off!)
      local_irq_enable();
      /* if an interrupt was queued, handle it. */
      local_irq_disable();
    }
}

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch 2/2] x86/kvm: Sanitize kvm_async_pf_task_wait()
  2020-03-07 15:10       ` Andy Lutomirski
  2020-03-07 15:51         ` Andy Lutomirski
@ 2020-03-07 15:52         ` Thomas Gleixner
  2020-03-07 16:06           ` Andy Lutomirski
  1 sibling, 1 reply; 15+ messages in thread
From: Thomas Gleixner @ 2020-03-07 15:52 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, LKML, X86 ML, Paolo Bonzini, KVM, Paul E. McKenney

Andy Lutomirski <luto@kernel.org> writes:
> On Sat, Mar 7, 2020 at 2:01 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>> > What’s the local_irq_disable() here for? I would believe a
>> > lockdep_assert_irqs_disabled() somewhere in here would make sense.
>> > (Yes, I see you copied this from the old code. It’s still nonsense.)
>>
>> native_safe_halt() does:
>>
>>          STI
>>          HLT
>>
>> So the irq disable is required as the loop should exit with interrupts
>> disabled.
>
> Oops, should have looked at what native_safe_halt() does.
>
>>
>> > I also find it truly bizarre that hlt actually works in this context.
>> > Does KVM in fact wake a HLTed guest that HLTed with IRQs off when a
>> > pending async pf is satisfied?  This would make sense if the wake
>> > event were an interrupt, but it’s not according to Paolo.
>>
>> See above. safe halt enables interrupts, which means IF == 1 and the
>> host sanity check for IF == 1 is satisfied.
>>
>> In fact, if e.g. some regular interrupt arrives before the page becomes
>> available and the guest entered the halt loop because the fault happened
>> inside a RCU read side critical section with preemption enabled, then
>> the interrupt might wake up another task, set need resched and this
>> other task can run.
>
> Now I'm confused again.  Your patch is very careful not to schedule if
> we're in an RCU read-side critical section, but the regular preemption
> code (preempt_schedule_irq, etc) seems to be willing to schedule
> inside an RCU read-side critical section.  Why is the latter okay but
> not the async pf case?

Preemption is fine, but voluntary schedule not. voluntary schedule might
end up in idle if this is the last runnable task.

> Ignoring that, this still seems racy:
>
> STI
> nested #PF telling us to wake up
> #PF returns
> HLT

You will say Ooops, should have looked .... when I tell you that the
above cannot happen. From the SDM:

  If IF = 0, maskable hardware interrupts remain inhibited on the
  instruction boundary following an execution of STI.

Otherwise safe_halt would not work at all :)

> doesn't this result in putting the CPU asleep for no good reason until
> the next interrupt hits?

No :)

>
>> > All this being said, the only remotely sane case is when regs->flags
>> > has IF==1. Perhaps this code should actually do:
>> >
>> > WARN_ON(!(regs->flags & X86_EFLAGS_IF));
>>
>> Yes, that want's to be somewhere early and also cover the async wake
>> case. Neither wake nor wait can be injected when IF == 0.
>
> Sadly, wrmsr to turn off async pf will inject wakeups even if IF == 0.

WHAT? That's fundamentally broken. Can you point me to the code in
question?

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch 2/2] x86/kvm: Sanitize kvm_async_pf_task_wait()
  2020-03-07 15:52         ` Thomas Gleixner
@ 2020-03-07 16:06           ` Andy Lutomirski
  2020-03-07 20:08             ` Thomas Gleixner
  0 siblings, 1 reply; 15+ messages in thread
From: Andy Lutomirski @ 2020-03-07 16:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, LKML, X86 ML, Paolo Bonzini, KVM, Paul E. McKenney

On Sat, Mar 7, 2020 at 7:52 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Andy Lutomirski <luto@kernel.org> writes:
> > On Sat, Mar 7, 2020 at 2:01 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> >> > What’s the local_irq_disable() here for? I would believe a
> >> > lockdep_assert_irqs_disabled() somewhere in here would make sense.
> >> > (Yes, I see you copied this from the old code. It’s still nonsense.)
> >>
> >> native_safe_halt() does:
> >>
> >>          STI
> >>          HLT
> >>
> >> So the irq disable is required as the loop should exit with interrupts
> >> disabled.
> >
> > Oops, should have looked at what native_safe_halt() does.
> >
> >>
> >> > I also find it truly bizarre that hlt actually works in this context.
> >> > Does KVM in fact wake a HLTed guest that HLTed with IRQs off when a
> >> > pending async pf is satisfied?  This would make sense if the wake
> >> > event were an interrupt, but it’s not according to Paolo.
> >>
> >> See above. safe halt enables interrupts, which means IF == 1 and the
> >> host sanity check for IF == 1 is satisfied.
> >>
> >> In fact, if e.g. some regular interrupt arrives before the page becomes
> >> available and the guest entered the halt loop because the fault happened
> >> inside a RCU read side critical section with preemption enabled, then
> >> the interrupt might wake up another task, set need resched and this
> >> other task can run.
> >
> > Now I'm confused again.  Your patch is very careful not to schedule if
> > we're in an RCU read-side critical section, but the regular preemption
> > code (preempt_schedule_irq, etc) seems to be willing to schedule
> > inside an RCU read-side critical section.  Why is the latter okay but
> > not the async pf case?
>
> Preemption is fine, but voluntary schedule not. voluntary schedule might
> end up in idle if this is the last runnable task.

I guess something horrible happens if we try to go idle while in an
RCU read-side critical section.  Like perhaps it's considered a grace
period.  Hmm.

>
> > Ignoring that, this still seems racy:
> >
> > STI
> > nested #PF telling us to wake up
> > #PF returns
> > HLT
>
> You will say Ooops, should have looked .... when I tell you that the
> above cannot happen. From the SDM:
>
>   If IF = 0, maskable hardware interrupts remain inhibited on the
>   instruction boundary following an execution of STI.
>
> Otherwise safe_halt would not work at all :)

Ooops, should have looked. :)

> > Sadly, wrmsr to turn off async pf will inject wakeups even if IF == 0.
>
> WHAT? That's fundamentally broken. Can you point me to the code in
> question?

I think Paolo said so in a different thread, but I can't Let me see if
I can find it:

kvm_pv_enable_async_pf()
  kvm_clear_async_pf_completion_queue()

but that doesn't actually seem to send #PF.  So maybe I'm wrong.

I will admit that, even after reading the host code a few times, I'm
also not convinced that wakeups don't get swallowed on occasion if
they would have been delivered at times when it's illegal.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch 2/2] x86/kvm: Sanitize kvm_async_pf_task_wait()
  2020-03-07 15:51         ` Andy Lutomirski
@ 2020-03-07 19:18           ` Thomas Gleixner
  2020-03-07 19:30             ` Andy Lutomirski
  0 siblings, 1 reply; 15+ messages in thread
From: Thomas Gleixner @ 2020-03-07 19:18 UTC (permalink / raw)
  To: Andy Lutomirski, Andy Lutomirski
  Cc: LKML, X86 ML, Paolo Bonzini, KVM, Paul E. McKenney

Andy Lutomirski <luto@kernel.org> writes:
> On Sat, Mar 7, 2020 at 7:10 AM Andy Lutomirski <luto@kernel.org> wrote:
>> On Sat, Mar 7, 2020 at 2:01 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>> >
>> > Andy Lutomirski <luto@kernel.org> writes:
>
>> Now I'm confused again.  Your patch is very careful not to schedule if
>> we're in an RCU read-side critical section, but the regular preemption
>> code (preempt_schedule_irq, etc) seems to be willing to schedule
>> inside an RCU read-side critical section.  Why is the latter okay but
>> not the async pf case?
>
> I read more docs.  I guess the relevant situation is
> CONFIG_PREEMPT_CPU, in which case it is legal to preempt an RCU
> read-side critical section and obviously legal to put the whole CPU to
> sleep, but it's illegal to explicitly block in an RCU read-side
> critical section.  So I have a question for Paul: is it, in fact,
> entirely illegal to block or merely illegal to block for an
> excessively long time, e.g. waiting for user space or network traffic?

Two issues here:

    - excessive blocking time

    - entering idle with an RCU read side critical section blocking

>  In this situation, we cannot make progress until the host says we
> can, so we are, in effect, blocking until the host tells us to stop
> blocking.  Regardless, I agree that turning IRQs on is reasonable, and
> allowing those IRQs to preempt us is reasonable.
>
> As it stands in your patch, the situation is rather odd: we'll run
> another task if that task *preempts* us (e.g. we block long enough to
> run out of our time slice), but we won't run another task if we aren't
> preempted.  This seems bizarre.

Yes, it looks odd. We could do:

	preempt_disable();
	while (!page_arrived()) {
		if (preempt_count() == 1 && this_cpu_runnable_tasks() > 1) {
        		set_need_resched();
                	schedule_preempt_disabled();
		} else {
                	native_safe_halt();
                        local_irq_disable();
		}
	}
        preempt_enable();

Don't know if it's worth the trouble. But that's not the problem :)

> I think this issue still stands and is actually a fairly easy race to hit.
>
> STI
> IRQ happens and we get preempted
> another task runs and gets the #PF "async pf wakeup" event
> reschedule, back to original task
> HLT

See the other mail about STI :)

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch 2/2] x86/kvm: Sanitize kvm_async_pf_task_wait()
  2020-03-07 19:18           ` Thomas Gleixner
@ 2020-03-07 19:30             ` Andy Lutomirski
  0 siblings, 0 replies; 15+ messages in thread
From: Andy Lutomirski @ 2020-03-07 19:30 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, LKML, X86 ML, Paolo Bonzini, KVM, Paul E. McKenney

On Sat, Mar 7, 2020 at 11:18 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Andy Lutomirski <luto@kernel.org> writes:
> > On Sat, Mar 7, 2020 at 7:10 AM Andy Lutomirski <luto@kernel.org> wrote:
> >> On Sat, Mar 7, 2020 at 2:01 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> >> >
> >> > Andy Lutomirski <luto@kernel.org> writes:
> >
> >> Now I'm confused again.  Your patch is very careful not to schedule if
> >> we're in an RCU read-side critical section, but the regular preemption
> >> code (preempt_schedule_irq, etc) seems to be willing to schedule
> >> inside an RCU read-side critical section.  Why is the latter okay but
> >> not the async pf case?
> >
> > I read more docs.  I guess the relevant situation is
> > CONFIG_PREEMPT_CPU, in which case it is legal to preempt an RCU
> > read-side critical section and obviously legal to put the whole CPU to
> > sleep, but it's illegal to explicitly block in an RCU read-side
> > critical section.  So I have a question for Paul: is it, in fact,
> > entirely illegal to block or merely illegal to block for an
> > excessively long time, e.g. waiting for user space or network traffic?
>
> Two issues here:
>
>     - excessive blocking time

We can't do anything about this.  We are blocked until the host says
otherwise, and the critical section cannot end until the host lets it
end.

>
>     - entering idle with an RCU read side critical section blocking

We could surely make this work.  I'm not at all convinced it's
worthwhile, though.

>
> >  In this situation, we cannot make progress until the host says we
> > can, so we are, in effect, blocking until the host tells us to stop
> > blocking.  Regardless, I agree that turning IRQs on is reasonable, and
> > allowing those IRQs to preempt us is reasonable.
> >
> > As it stands in your patch, the situation is rather odd: we'll run
> > another task if that task *preempts* us (e.g. we block long enough to
> > run out of our time slice), but we won't run another task if we aren't
> > preempted.  This seems bizarre.
>
> Yes, it looks odd. We could do:
>
>         preempt_disable();
>         while (!page_arrived()) {
>                 if (preempt_count() == 1 && this_cpu_runnable_tasks() > 1) {
>                         set_need_resched();
>                         schedule_preempt_disabled();

The downside here is that the scheduler may immediately reschedule us,
thus accomplishing nothing whatsoever.

>                 } else {
>                         native_safe_halt();
>                         local_irq_disable();
>                 }
>         }
>         preempt_enable();
>
> Don't know if it's worth the trouble. But that's not the problem :)

I suspect that we should either declare it entirely not worth the
trouble and do it like in your patch or we should teach preempt-rcu to
handle the special case of going idle while in a read-side critical
section.  For all I know, the latter is trivial, but it could easily
be a total disaster.  Paul?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch 2/2] x86/kvm: Sanitize kvm_async_pf_task_wait()
  2020-03-07 16:06           ` Andy Lutomirski
@ 2020-03-07 20:08             ` Thomas Gleixner
  0 siblings, 0 replies; 15+ messages in thread
From: Thomas Gleixner @ 2020-03-07 20:08 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, LKML, X86 ML, Paolo Bonzini, KVM, Paul E. McKenney

Andy Lutomirski <luto@kernel.org> writes:
> On Sat, Mar 7, 2020 at 7:52 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>> WHAT? That's fundamentally broken. Can you point me to the code in
>> question?
>
> I think Paolo said so in a different thread, but I can't Let me see if
> I can find it:
>
> kvm_pv_enable_async_pf()
>   kvm_clear_async_pf_completion_queue()
>
> but that doesn't actually seem to send #PF.  So maybe I'm wrong.
>
> I will admit that, even after reading the host code a few times, I'm
> also not convinced that wakeups don't get swallowed on occasion if
> they would have been delivered at times when it's illegal.

I gave up trying to decode it. I wait for the kvm wizards to answer all
our nasty questions :)

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2020-03-07 20:09 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-06 23:42 [patch 0/2] x86/kvm: Sanitize async page fault Thomas Gleixner
2020-03-06 23:42 ` [patch 1/2] x86/kvm: Handle async page faults directly through do_page_fault() Thomas Gleixner
2020-03-06 23:42 ` [patch 2/2] x86/kvm: Sanitize kvm_async_pf_task_wait() Thomas Gleixner
2020-03-07  0:22   ` Paul E. McKenney
2020-03-07  1:02     ` Thomas Gleixner
2020-03-07  3:48       ` Paul E. McKenney
2020-03-07  2:18   ` Andy Lutomirski
2020-03-07 10:01     ` Thomas Gleixner
2020-03-07 15:10       ` Andy Lutomirski
2020-03-07 15:51         ` Andy Lutomirski
2020-03-07 19:18           ` Thomas Gleixner
2020-03-07 19:30             ` Andy Lutomirski
2020-03-07 15:52         ` Thomas Gleixner
2020-03-07 16:06           ` Andy Lutomirski
2020-03-07 20:08             ` Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).