linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/3] powerpc/xive: Fix race condition leading to host crashes and hangs
@ 2019-08-13  9:58 Paul Mackerras
  2019-08-13 10:01 ` [PATCH v2 2/3] KVM: PPC: Book3S HV: Don't push XIVE context when not using XIVE device Paul Mackerras
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Paul Mackerras @ 2019-08-13  9:58 UTC (permalink / raw)
  To: linuxppc-dev, kvm; +Cc: kvm-ppc, David Gibson

This series fixes a race condition that has been observed in testing
on POWER9 machines running KVM guests.  An interrupt being freed by
free_irq() can have an instance present in a XIVE interrupt queue,
which can then be presented to the generic interrupt code after the
data structures for it have been freed, leading to a variety of
crashes and hangs.

This series is based on current upstream kernel source plus Cédric Le
Goater's patch "KVM: PPC: Book3S HV: XIVE: Free escalation interrupts
before disabling the VP", which is a pre-requisite for this series.
As it touches both KVM and generic PPC code, this series will probably
go in via Michael Ellerman's powerpc tree.

V2 of this patch series adds a patch fixing a bug noticed by Cédric,
and also fixes a bug in patch 1/2 of the v1 series.

Paul.

 arch/powerpc/include/asm/xive.h         |  8 +++
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 38 +++++++++-----
 arch/powerpc/kvm/book3s_xive.c          | 42 +++++++++++++++-
 arch/powerpc/kvm/book3s_xive.h          |  2 +
 arch/powerpc/kvm/book3s_xive_native.c   |  6 +++
 arch/powerpc/sysdev/xive/common.c       | 87 ++++++++++++++++++++++++---------
 6 files changed, 146 insertions(+), 37 deletions(-)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v2 2/3] KVM: PPC: Book3S HV: Don't push XIVE context when not using XIVE device
  2019-08-13  9:58 [PATCH v2 0/3] powerpc/xive: Fix race condition leading to host crashes and hangs Paul Mackerras
@ 2019-08-13 10:01 ` Paul Mackerras
  2019-08-13 12:18   ` Cédric Le Goater
  2019-08-22 10:46   ` Michael Ellerman
  2019-08-13 10:03 ` [PATCH v2 1/3] KVM: PPC: Book3S HV: Fix race in re-enabling XIVE escalation interrupts Paul Mackerras
  2019-08-13 10:06 ` [PATCH v2 3/3] powerpc/xive: Implement get_irqchip_state method for XIVE to fix shutdown race Paul Mackerras
  2 siblings, 2 replies; 10+ messages in thread
From: Paul Mackerras @ 2019-08-13 10:01 UTC (permalink / raw)
  To: linuxppc-dev, kvm; +Cc: kvm-ppc, David Gibson

At present, when running a guest on POWER9 using HV KVM but not using
an in-kernel interrupt controller (XICS or XIVE), for example if QEMU
is run with the kernel_irqchip=off option, the guest entry code goes
ahead and tries to load the guest context into the XIVE hardware, even
though no context has been set up.

To fix this, we check that the "CAM word" is non-zero before pushing
it to the hardware.  The CAM word is initialized to a non-zero value
in kvmppc_xive_connect_vcpu() and kvmppc_xive_native_connect_vcpu(),
and is now cleared in kvmppc_xive_{,native_}cleanup_vcpu.

Cc: stable@vger.kernel.org # v4.11+
Reported-by: Cédric Le Goater <clg@kaod.org>
Fixes: 5af50993850a ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller")
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
---
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |  2 ++
 arch/powerpc/kvm/book3s_xive.c          | 11 ++++++++++-
 arch/powerpc/kvm/book3s_xive_native.c   |  3 +++
 3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 2e7e788..07181d0 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -942,6 +942,8 @@ ALT_FTR_SECTION_END_IFCLR(CPU_FTR_ARCH_300)
 	ld	r11, VCPU_XIVE_SAVED_STATE(r4)
 	li	r9, TM_QW1_OS
 	lwz	r8, VCPU_XIVE_CAM_WORD(r4)
+	cmpwi	r8, 0
+	beq	no_xive
 	li	r7, TM_QW1_OS + TM_WORD2
 	mfmsr	r0
 	andi.	r0, r0, MSR_DR		/* in real mode? */
diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index 09f838a..586867e 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -67,8 +67,14 @@ void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu)
 	void __iomem *tima = local_paca->kvm_hstate.xive_tima_virt;
 	u64 pq;
 
-	if (!tima)
+	/*
+	 * Nothing to do if the platform doesn't have a XIVE
+	 * or this vCPU doesn't have its own XIVE context
+	 * (e.g. because it's not using an in-kernel interrupt controller).
+	 */
+	if (!tima || !vcpu->arch.xive_cam_word)
 		return;
+
 	eieio();
 	__raw_writeq(vcpu->arch.xive_saved_state.w01, tima + TM_QW1_OS);
 	__raw_writel(vcpu->arch.xive_cam_word, tima + TM_QW1_OS + TM_WORD2);
@@ -1146,6 +1152,9 @@ void kvmppc_xive_cleanup_vcpu(struct kvm_vcpu *vcpu)
 	/* Disable the VP */
 	xive_native_disable_vp(xc->vp_id);
 
+	/* Clear the cam word so guest entry won't try to push context */
+	vcpu->arch.xive_cam_word = 0;
+
 	/* Free the queues */
 	for (i = 0; i < KVMPPC_XIVE_Q_COUNT; i++) {
 		struct xive_q *q = &xc->queues[i];
diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
index 368427f..11b91b4 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -81,6 +81,9 @@ void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu)
 	/* Disable the VP */
 	xive_native_disable_vp(xc->vp_id);
 
+	/* Clear the cam word so guest entry won't try to push context */
+	vcpu->arch.xive_cam_word = 0;
+
 	/* Free the queues */
 	for (i = 0; i < KVMPPC_XIVE_Q_COUNT; i++) {
 		kvmppc_xive_native_cleanup_queue(vcpu, i);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 1/3] KVM: PPC: Book3S HV: Fix race in re-enabling XIVE escalation interrupts
  2019-08-13  9:58 [PATCH v2 0/3] powerpc/xive: Fix race condition leading to host crashes and hangs Paul Mackerras
  2019-08-13 10:01 ` [PATCH v2 2/3] KVM: PPC: Book3S HV: Don't push XIVE context when not using XIVE device Paul Mackerras
@ 2019-08-13 10:03 ` Paul Mackerras
  2019-08-14  4:46   ` Jordan Niethe
  2019-08-22 10:46   ` Michael Ellerman
  2019-08-13 10:06 ` [PATCH v2 3/3] powerpc/xive: Implement get_irqchip_state method for XIVE to fix shutdown race Paul Mackerras
  2 siblings, 2 replies; 10+ messages in thread
From: Paul Mackerras @ 2019-08-13 10:03 UTC (permalink / raw)
  To: linuxppc-dev, kvm; +Cc: kvm-ppc, David Gibson

Escalation interrupts are interrupts sent to the host by the XIVE
hardware when it has an interrupt to deliver to a guest VCPU but that
VCPU is not running anywhere in the system.  Hence we disable the
escalation interrupt for the VCPU being run when we enter the guest
and re-enable it when the guest does an H_CEDE hypercall indicating
it is idle.

It is possible that an escalation interrupt gets generated just as we
are entering the guest.  In that case the escalation interrupt may be
using a queue entry in one of the interrupt queues, and that queue
entry may not have been processed when the guest exits with an H_CEDE.
The existing entry code detects this situation and does not clear the
vcpu->arch.xive_esc_on flag as an indication that there is a pending
queue entry (if the queue entry gets processed, xive_esc_irq() will
clear the flag).  There is a comment in the code saying that if the
flag is still set on H_CEDE, we have to abort the cede rather than
re-enabling the escalation interrupt, lest we end up with two
occurrences of the escalation interrupt in the interrupt queue.

However, the exit code doesn't do that; it aborts the cede in the sense
that vcpu->arch.ceded gets cleared, but it still enables the escalation
interrupt by setting the source's PQ bits to 00.  Instead we need to
set the PQ bits to 10, indicating that an interrupt has been triggered.
We also need to avoid setting vcpu->arch.xive_esc_on in this case
(i.e. vcpu->arch.xive_esc_on seen to be set on H_CEDE) because
xive_esc_irq() will run at some point and clear it, and if we race with
that we may end up with an incorrect result (i.e. xive_esc_on set when
the escalation interrupt has just been handled).

It is extremely unlikely that having two queue entries would cause
observable problems; theoretically it could cause queue overflow, but
the CPU would have to have thousands of interrupts targetted to it for
that to be possible.  However, this fix will also make it possible to
determine accurately whether there is an unhandled escalation
interrupt in the queue, which will be needed by the following patch.

Cc: stable@vger.kernel.org # v4.16+
Fixes: 9b9b13a6d153 ("KVM: PPC: Book3S HV: Keep XIVE escalation interrupt masked unless ceded")
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
---
v2: don't set xive_esc_on if we're not using a XIVE escalation
interrupt.

 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 36 +++++++++++++++++++++------------
 1 file changed, 23 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 337e644..2e7e788 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -2831,29 +2831,39 @@ kvm_cede_prodded:
 kvm_cede_exit:
 	ld	r9, HSTATE_KVM_VCPU(r13)
 #ifdef CONFIG_KVM_XICS
-	/* Abort if we still have a pending escalation */
+	/* are we using XIVE with single escalation? */
+	ld	r10, VCPU_XIVE_ESC_VADDR(r9)
+	cmpdi	r10, 0
+	beq	3f
+	li	r6, XIVE_ESB_SET_PQ_00
+	/*
+	 * If we still have a pending escalation, abort the cede,
+	 * and we must set PQ to 10 rather than 00 so that we don't
+	 * potentially end up with two entries for the escalation
+	 * interrupt in the XIVE interrupt queue.  In that case
+	 * we also don't want to set xive_esc_on to 1 here in
+	 * case we race with xive_esc_irq().
+	 */
 	lbz	r5, VCPU_XIVE_ESC_ON(r9)
 	cmpwi	r5, 0
-	beq	1f
+	beq	4f
 	li	r0, 0
 	stb	r0, VCPU_CEDED(r9)
-1:	/* Enable XIVE escalation */
-	li	r5, XIVE_ESB_SET_PQ_00
+	li	r6, XIVE_ESB_SET_PQ_10
+	b	5f
+4:	li	r0, 1
+	stb	r0, VCPU_XIVE_ESC_ON(r9)
+	/* make sure store to xive_esc_on is seen before xive_esc_irq runs */
+	sync
+5:	/* Enable XIVE escalation */
 	mfmsr	r0
 	andi.	r0, r0, MSR_DR		/* in real mode? */
 	beq	1f
-	ld	r10, VCPU_XIVE_ESC_VADDR(r9)
-	cmpdi	r10, 0
-	beq	3f
-	ldx	r0, r10, r5
+	ldx	r0, r10, r6
 	b	2f
 1:	ld	r10, VCPU_XIVE_ESC_RADDR(r9)
-	cmpdi	r10, 0
-	beq	3f
-	ldcix	r0, r10, r5
+	ldcix	r0, r10, r6
 2:	sync
-	li	r0, 1
-	stb	r0, VCPU_XIVE_ESC_ON(r9)
 #endif /* CONFIG_KVM_XICS */
 3:	b	guest_exit_cont
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 3/3] powerpc/xive: Implement get_irqchip_state method for XIVE to fix shutdown race
  2019-08-13  9:58 [PATCH v2 0/3] powerpc/xive: Fix race condition leading to host crashes and hangs Paul Mackerras
  2019-08-13 10:01 ` [PATCH v2 2/3] KVM: PPC: Book3S HV: Don't push XIVE context when not using XIVE device Paul Mackerras
  2019-08-13 10:03 ` [PATCH v2 1/3] KVM: PPC: Book3S HV: Fix race in re-enabling XIVE escalation interrupts Paul Mackerras
@ 2019-08-13 10:06 ` Paul Mackerras
  2019-08-22 10:46   ` Michael Ellerman
  2 siblings, 1 reply; 10+ messages in thread
From: Paul Mackerras @ 2019-08-13 10:06 UTC (permalink / raw)
  To: linuxppc-dev, kvm; +Cc: kvm-ppc, David Gibson

Testing has revealed the existence of a race condition where a XIVE
interrupt being shut down can be in one of the XIVE interrupt queues
(of which there are up to 8 per CPU, one for each priority) at the
point where free_irq() is called.  If this happens, can return an
interrupt number which has been shut down.  This can lead to various
symptoms:

- irq_to_desc(irq) can be NULL.  In this case, no end-of-interrupt
  function gets called, resulting in the CPU's elevated interrupt
  priority (numerically lowered CPPR) never gets reset.  That then
  means that the CPU stops processing interrupts, causing device
  timeouts and other errors in various device drivers.

- The irq descriptor or related data structures can be in the process
  of being freed as the interrupt code is using them.  This typically
  leads to crashes due to bad pointer dereferences.

This race is basically what commit 62e0468650c3 ("genirq: Add optional
hardware synchronization for shutdown", 2019-06-28) is intended to
fix, given a get_irqchip_state() method for the interrupt controller
being used.  It works by polling the interrupt controller when an
interrupt is being freed until the controller says it is not pending.

With XIVE, the PQ bits of the interrupt source indicate the state of
the interrupt source, and in particular the P bit goes from 0 to 1 at
the point where the hardware writes an entry into the interrupt queue
that this interrupt is directed towards.  Normally, the code will then
process the interrupt and do an end-of-interrupt (EOI) operation which
will reset PQ to 00 (assuming another interrupt hasn't been generated
in the meantime).  However, there are situations where the code resets
P even though a queue entry exists (for example, by setting PQ to 01,
which disables the interrupt source), and also situations where the
code leaves P at 1 after removing the queue entry (for example, this
is done for escalation interrupts so they cannot fire again until
they are explicitly re-enabled).

The code already has a 'saved_p' flag for the interrupt source which
indicates that a queue entry exists, although it isn't maintained
consistently.  This patch adds a 'stale_p' flag to indicate that
P has been left at 1 after processing a queue entry, and adds code
to set and clear saved_p and stale_p as necessary to maintain a
consistent indication of whether a queue entry may or may not exist.

With this, we can implement xive_get_irqchip_state() by looking at
stale_p, saved_p and the ESB PQ bits for the interrupt.

There is some additional code to handle escalation interrupts
properly; because they are enabled and disabled in KVM assembly code,
which does not have access to the xive_irq_data struct for the
escalation interrupt.  Hence, stale_p may be incorrect when the
escalation interrupt is freed in kvmppc_xive_{,native_}cleanup_vcpu().
Fortunately, we can fix it up by looking at vcpu->arch.xive_esc_on,
with some careful attention to barriers in order to ensure the correct
result if xive_esc_irq() races with kvmppc_xive_cleanup_vcpu().

Finally, this adds code to make noise on the console (pr_crit and
WARN_ON(1)) if we find an interrupt queue entry for an interrupt
which does not have a descriptor.  While this won't catch the race
reliably, if it does get triggered it will be an indication that
the race is occurring and needs to be debugged.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
---
v2: call xive_cleanup_single_escalation
from kvmppc_xive_native_cleanup_vcpu() too.

 arch/powerpc/include/asm/xive.h       |  8 ++++
 arch/powerpc/kvm/book3s_xive.c        | 31 +++++++++++++
 arch/powerpc/kvm/book3s_xive.h        |  2 +
 arch/powerpc/kvm/book3s_xive_native.c |  3 ++
 arch/powerpc/sysdev/xive/common.c     | 87 ++++++++++++++++++++++++++---------
 5 files changed, 108 insertions(+), 23 deletions(-)

diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
index e401698..efb0e59 100644
--- a/arch/powerpc/include/asm/xive.h
+++ b/arch/powerpc/include/asm/xive.h
@@ -46,7 +46,15 @@ struct xive_irq_data {
 
 	/* Setup/used by frontend */
 	int target;
+	/*
+	 * saved_p means that there is a queue entry for this interrupt
+	 * in some CPU's queue (not including guest vcpu queues), even
+	 * if P is not set in the source ESB.
+	 * stale_p means that there is no queue entry for this interrupt
+	 * in some CPU's queue, even if P is set in the source ESB.
+	 */
 	bool saved_p;
+	bool stale_p;
 };
 #define XIVE_IRQ_FLAG_STORE_EOI	0x01
 #define XIVE_IRQ_FLAG_LSI	0x02
diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index 586867e..591bfb4 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -166,6 +166,9 @@ static irqreturn_t xive_esc_irq(int irq, void *data)
 	 */
 	vcpu->arch.xive_esc_on = false;
 
+	/* This orders xive_esc_on = false vs. subsequent stale_p = true */
+	smp_wmb();	/* goes with smp_mb() in cleanup_single_escalation */
+
 	return IRQ_HANDLED;
 }
 
@@ -1119,6 +1122,31 @@ void kvmppc_xive_disable_vcpu_interrupts(struct kvm_vcpu *vcpu)
 	vcpu->arch.xive_esc_raddr = 0;
 }
 
+/*
+ * In single escalation mode, the escalation interrupt is marked so
+ * that EOI doesn't re-enable it, but just sets the stale_p flag to
+ * indicate that the P bit has already been dealt with.  However, the
+ * assembly code that enters the guest sets PQ to 00 without clearing
+ * stale_p (because it has no easy way to address it).  Hence we have
+ * to adjust stale_p before shutting down the interrupt.
+ */
+void xive_cleanup_single_escalation(struct kvm_vcpu *vcpu,
+				    struct kvmppc_xive_vcpu *xc, int irq)
+{
+	struct irq_data *d = irq_get_irq_data(irq);
+	struct xive_irq_data *xd = irq_data_get_irq_handler_data(d);
+
+	/*
+	 * This slightly odd sequence gives the right result
+	 * (i.e. stale_p set if xive_esc_on is false) even if
+	 * we race with xive_esc_irq() and xive_irq_eoi().
+	 */
+	xd->stale_p = false;
+	smp_mb();		/* paired with smb_wmb in xive_esc_irq */
+	if (!vcpu->arch.xive_esc_on)
+		xd->stale_p = true;
+}
+
 void kvmppc_xive_cleanup_vcpu(struct kvm_vcpu *vcpu)
 {
 	struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
@@ -1143,6 +1171,9 @@ void kvmppc_xive_cleanup_vcpu(struct kvm_vcpu *vcpu)
 	/* Free escalations */
 	for (i = 0; i < KVMPPC_XIVE_Q_COUNT; i++) {
 		if (xc->esc_virq[i]) {
+			if (xc->xive->single_escalation)
+				xive_cleanup_single_escalation(vcpu, xc,
+							xc->esc_virq[i]);
 			free_irq(xc->esc_virq[i], vcpu);
 			irq_dispose_mapping(xc->esc_virq[i]);
 			kfree(xc->esc_virq_names[i]);
diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h
index 50494d0..955b820 100644
--- a/arch/powerpc/kvm/book3s_xive.h
+++ b/arch/powerpc/kvm/book3s_xive.h
@@ -282,6 +282,8 @@ int kvmppc_xive_select_target(struct kvm *kvm, u32 *server, u8 prio);
 int kvmppc_xive_attach_escalation(struct kvm_vcpu *vcpu, u8 prio,
 				  bool single_escalation);
 struct kvmppc_xive *kvmppc_xive_get_device(struct kvm *kvm, u32 type);
+void xive_cleanup_single_escalation(struct kvm_vcpu *vcpu,
+				    struct kvmppc_xive_vcpu *xc, int irq);
 
 #endif /* CONFIG_KVM_XICS */
 #endif /* _KVM_PPC_BOOK3S_XICS_H */
diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
index 11b91b4..f0cab43 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -71,6 +71,9 @@ void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu)
 	for (i = 0; i < KVMPPC_XIVE_Q_COUNT; i++) {
 		/* Free the escalation irq */
 		if (xc->esc_virq[i]) {
+			if (xc->xive->single_escalation)
+				xive_cleanup_single_escalation(vcpu, xc,
+							xc->esc_virq[i]);
 			free_irq(xc->esc_virq[i], vcpu);
 			irq_dispose_mapping(xc->esc_virq[i]);
 			kfree(xc->esc_virq_names[i]);
diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
index 1cdb395..be86fce 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -135,7 +135,7 @@ static u32 xive_read_eq(struct xive_q *q, bool just_peek)
 static u32 xive_scan_interrupts(struct xive_cpu *xc, bool just_peek)
 {
 	u32 irq = 0;
-	u8 prio;
+	u8 prio = 0;
 
 	/* Find highest pending priority */
 	while (xc->pending_prio != 0) {
@@ -148,8 +148,19 @@ static u32 xive_scan_interrupts(struct xive_cpu *xc, bool just_peek)
 		irq = xive_read_eq(&xc->queue[prio], just_peek);
 
 		/* Found something ? That's it */
-		if (irq)
-			break;
+		if (irq) {
+			if (just_peek || irq_to_desc(irq))
+				break;
+			/*
+			 * We should never get here; if we do then we must
+			 * have failed to synchronize the interrupt properly
+			 * when shutting it down.
+			 */
+			pr_crit("xive: got interrupt %d without descriptor, dropping\n",
+				irq);
+			WARN_ON(1);
+			continue;
+		}
 
 		/* Clear pending bits */
 		xc->pending_prio &= ~(1 << prio);
@@ -307,6 +318,7 @@ static void xive_do_queue_eoi(struct xive_cpu *xc)
  */
 static void xive_do_source_eoi(u32 hw_irq, struct xive_irq_data *xd)
 {
+	xd->stale_p = false;
 	/* If the XIVE supports the new "store EOI facility, use it */
 	if (xd->flags & XIVE_IRQ_FLAG_STORE_EOI)
 		xive_esb_write(xd, XIVE_ESB_STORE_EOI, 0);
@@ -350,7 +362,7 @@ static void xive_do_source_eoi(u32 hw_irq, struct xive_irq_data *xd)
 	}
 }
 
-/* irq_chip eoi callback */
+/* irq_chip eoi callback, called with irq descriptor lock held */
 static void xive_irq_eoi(struct irq_data *d)
 {
 	struct xive_irq_data *xd = irq_data_get_irq_handler_data(d);
@@ -366,6 +378,8 @@ static void xive_irq_eoi(struct irq_data *d)
 	if (!irqd_irq_disabled(d) && !irqd_is_forwarded_to_vcpu(d) &&
 	    !(xd->flags & XIVE_IRQ_NO_EOI))
 		xive_do_source_eoi(irqd_to_hwirq(d), xd);
+	else
+		xd->stale_p = true;
 
 	/*
 	 * Clear saved_p to indicate that it's no longer occupying
@@ -397,11 +411,16 @@ static void xive_do_source_set_mask(struct xive_irq_data *xd,
 	 */
 	if (mask) {
 		val = xive_esb_read(xd, XIVE_ESB_SET_PQ_01);
-		xd->saved_p = !!(val & XIVE_ESB_VAL_P);
-	} else if (xd->saved_p)
+		if (!xd->stale_p && !!(val & XIVE_ESB_VAL_P))
+			xd->saved_p = true;
+		xd->stale_p = false;
+	} else if (xd->saved_p) {
 		xive_esb_read(xd, XIVE_ESB_SET_PQ_10);
-	else
+		xd->saved_p = false;
+	} else {
 		xive_esb_read(xd, XIVE_ESB_SET_PQ_00);
+		xd->stale_p = false;
+	}
 }
 
 /*
@@ -541,6 +560,8 @@ static unsigned int xive_irq_startup(struct irq_data *d)
 	unsigned int hw_irq = (unsigned int)irqd_to_hwirq(d);
 	int target, rc;
 
+	xd->saved_p = false;
+	xd->stale_p = false;
 	pr_devel("xive_irq_startup: irq %d [0x%x] data @%p\n",
 		 d->irq, hw_irq, d);
 
@@ -587,6 +608,7 @@ static unsigned int xive_irq_startup(struct irq_data *d)
 	return 0;
 }
 
+/* called with irq descriptor lock held */
 static void xive_irq_shutdown(struct irq_data *d)
 {
 	struct xive_irq_data *xd = irq_data_get_irq_handler_data(d);
@@ -602,16 +624,6 @@ static void xive_irq_shutdown(struct irq_data *d)
 	xive_do_source_set_mask(xd, true);
 
 	/*
-	 * The above may have set saved_p. We clear it otherwise it
-	 * will prevent re-enabling later on. It is ok to forget the
-	 * fact that the interrupt might be in a queue because we are
-	 * accounting that already in xive_dec_target_count() and will
-	 * be re-routing it to a new queue with proper accounting when
-	 * it's started up again
-	 */
-	xd->saved_p = false;
-
-	/*
 	 * Mask the interrupt in HW in the IVT/EAS and set the number
 	 * to be the "bad" IRQ number
 	 */
@@ -797,6 +809,10 @@ static int xive_irq_retrigger(struct irq_data *d)
 	return 1;
 }
 
+/*
+ * Caller holds the irq descriptor lock, so this won't be called
+ * concurrently with xive_get_irqchip_state on the same interrupt.
+ */
 static int xive_irq_set_vcpu_affinity(struct irq_data *d, void *state)
 {
 	struct xive_irq_data *xd = irq_data_get_irq_handler_data(d);
@@ -820,6 +836,10 @@ static int xive_irq_set_vcpu_affinity(struct irq_data *d, void *state)
 
 		/* Set it to PQ=10 state to prevent further sends */
 		pq = xive_esb_read(xd, XIVE_ESB_SET_PQ_10);
+		if (!xd->stale_p) {
+			xd->saved_p = !!(pq & XIVE_ESB_VAL_P);
+			xd->stale_p = !xd->saved_p;
+		}
 
 		/* No target ? nothing to do */
 		if (xd->target == XIVE_INVALID_TARGET) {
@@ -827,7 +847,7 @@ static int xive_irq_set_vcpu_affinity(struct irq_data *d, void *state)
 			 * An untargetted interrupt should have been
 			 * also masked at the source
 			 */
-			WARN_ON(pq & 2);
+			WARN_ON(xd->saved_p);
 
 			return 0;
 		}
@@ -847,9 +867,8 @@ static int xive_irq_set_vcpu_affinity(struct irq_data *d, void *state)
 		 * This saved_p is cleared by the host EOI, when we know
 		 * for sure the queue slot is no longer in use.
 		 */
-		if (pq & 2) {
-			pq = xive_esb_read(xd, XIVE_ESB_SET_PQ_11);
-			xd->saved_p = true;
+		if (xd->saved_p) {
+			xive_esb_read(xd, XIVE_ESB_SET_PQ_11);
 
 			/*
 			 * Sync the XIVE source HW to ensure the interrupt
@@ -862,8 +881,7 @@ static int xive_irq_set_vcpu_affinity(struct irq_data *d, void *state)
 			 */
 			if (xive_ops->sync_source)
 				xive_ops->sync_source(hw_irq);
-		} else
-			xd->saved_p = false;
+		}
 	} else {
 		irqd_clr_forwarded_to_vcpu(d);
 
@@ -914,6 +932,23 @@ static int xive_irq_set_vcpu_affinity(struct irq_data *d, void *state)
 	return 0;
 }
 
+/* Called with irq descriptor lock held. */
+static int xive_get_irqchip_state(struct irq_data *data,
+				  enum irqchip_irq_state which, bool *state)
+{
+	struct xive_irq_data *xd = irq_data_get_irq_handler_data(data);
+
+	switch (which) {
+	case IRQCHIP_STATE_ACTIVE:
+		*state = !xd->stale_p &&
+			 (xd->saved_p ||
+			  !!(xive_esb_read(xd, XIVE_ESB_GET) & XIVE_ESB_VAL_P));
+		return 0;
+	default:
+		return -EINVAL;
+	}
+}
+
 static struct irq_chip xive_irq_chip = {
 	.name = "XIVE-IRQ",
 	.irq_startup = xive_irq_startup,
@@ -925,6 +960,7 @@ static struct irq_chip xive_irq_chip = {
 	.irq_set_type = xive_irq_set_type,
 	.irq_retrigger = xive_irq_retrigger,
 	.irq_set_vcpu_affinity = xive_irq_set_vcpu_affinity,
+	.irq_get_irqchip_state = xive_get_irqchip_state,
 };
 
 bool is_xive_irq(struct irq_chip *chip)
@@ -1338,6 +1374,11 @@ static void xive_flush_cpu_queue(unsigned int cpu, struct xive_cpu *xc)
 		xd = irq_desc_get_handler_data(desc);
 
 		/*
+		 * Clear saved_p to indicate that it's no longer pending
+		 */
+		xd->saved_p = false;
+
+		/*
 		 * For LSIs, we EOI, this will cause a resend if it's
 		 * still asserted. Otherwise do an MSI retrigger.
 		 */
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 2/3] KVM: PPC: Book3S HV: Don't push XIVE context when not using XIVE device
  2019-08-13 10:01 ` [PATCH v2 2/3] KVM: PPC: Book3S HV: Don't push XIVE context when not using XIVE device Paul Mackerras
@ 2019-08-13 12:18   ` Cédric Le Goater
  2019-08-22 10:46   ` Michael Ellerman
  1 sibling, 0 replies; 10+ messages in thread
From: Cédric Le Goater @ 2019-08-13 12:18 UTC (permalink / raw)
  To: Paul Mackerras, linuxppc-dev, kvm; +Cc: kvm-ppc, David Gibson

On 13/08/2019 12:01, Paul Mackerras wrote:
> At present, when running a guest on POWER9 using HV KVM but not using
> an in-kernel interrupt controller (XICS or XIVE), for example if QEMU
> is run with the kernel_irqchip=off option, the guest entry code goes
> ahead and tries to load the guest context into the XIVE hardware, even
> though no context has been set up.
> 
> To fix this, we check that the "CAM word" is non-zero before pushing
> it to the hardware.  The CAM word is initialized to a non-zero value
> in kvmppc_xive_connect_vcpu() and kvmppc_xive_native_connect_vcpu(),
> and is now cleared in kvmppc_xive_{,native_}cleanup_vcpu.

If a "CAM word" is defined, it means the vCPU (VP) was enabled at the
XIVE HW level. So this is the criteria to consider that a vCPU needs
to update (push) its XIVE thread interrupt context when scheduled
to run.


Reviewed-by: Cédric Le Goater <clg@kaod.org>

Thanks,

C.

> 
> Cc: stable@vger.kernel.org # v4.11+
> Reported-by: Cédric Le Goater <clg@kaod.org>
> Fixes: 5af50993850a ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller")
> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
> ---
>  arch/powerpc/kvm/book3s_hv_rmhandlers.S |  2 ++
>  arch/powerpc/kvm/book3s_xive.c          | 11 ++++++++++-
>  arch/powerpc/kvm/book3s_xive_native.c   |  3 +++
>  3 files changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> index 2e7e788..07181d0 100644
> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -942,6 +942,8 @@ ALT_FTR_SECTION_END_IFCLR(CPU_FTR_ARCH_300)
>  	ld	r11, VCPU_XIVE_SAVED_STATE(r4)
>  	li	r9, TM_QW1_OS
>  	lwz	r8, VCPU_XIVE_CAM_WORD(r4)
> +	cmpwi	r8, 0
> +	beq	no_xive
>  	li	r7, TM_QW1_OS + TM_WORD2
>  	mfmsr	r0
>  	andi.	r0, r0, MSR_DR		/* in real mode? */
> diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
> index 09f838a..586867e 100644
> --- a/arch/powerpc/kvm/book3s_xive.c
> +++ b/arch/powerpc/kvm/book3s_xive.c
> @@ -67,8 +67,14 @@ void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu)
>  	void __iomem *tima = local_paca->kvm_hstate.xive_tima_virt;
>  	u64 pq;
>  
> -	if (!tima)
> +	/*
> +	 * Nothing to do if the platform doesn't have a XIVE
> +	 * or this vCPU doesn't have its own XIVE context
> +	 * (e.g. because it's not using an in-kernel interrupt controller).
> +	 */
> +	if (!tima || !vcpu->arch.xive_cam_word)
>  		return;
> +
>  	eieio();
>  	__raw_writeq(vcpu->arch.xive_saved_state.w01, tima + TM_QW1_OS);
>  	__raw_writel(vcpu->arch.xive_cam_word, tima + TM_QW1_OS + TM_WORD2);
> @@ -1146,6 +1152,9 @@ void kvmppc_xive_cleanup_vcpu(struct kvm_vcpu *vcpu)
>  	/* Disable the VP */
>  	xive_native_disable_vp(xc->vp_id);
>  
> +	/* Clear the cam word so guest entry won't try to push context */
> +	vcpu->arch.xive_cam_word = 0;
> +
>  	/* Free the queues */
>  	for (i = 0; i < KVMPPC_XIVE_Q_COUNT; i++) {
>  		struct xive_q *q = &xc->queues[i];
> diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
> index 368427f..11b91b4 100644
> --- a/arch/powerpc/kvm/book3s_xive_native.c
> +++ b/arch/powerpc/kvm/book3s_xive_native.c
> @@ -81,6 +81,9 @@ void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu)
>  	/* Disable the VP */
>  	xive_native_disable_vp(xc->vp_id);
>  
> +	/* Clear the cam word so guest entry won't try to push context */
> +	vcpu->arch.xive_cam_word = 0;
> +
>  	/* Free the queues */
>  	for (i = 0; i < KVMPPC_XIVE_Q_COUNT; i++) {
>  		kvmppc_xive_native_cleanup_queue(vcpu, i);
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/3] KVM: PPC: Book3S HV: Fix race in re-enabling XIVE escalation interrupts
  2019-08-13 10:03 ` [PATCH v2 1/3] KVM: PPC: Book3S HV: Fix race in re-enabling XIVE escalation interrupts Paul Mackerras
@ 2019-08-14  4:46   ` Jordan Niethe
  2019-08-14  6:05     ` Paul Mackerras
  2019-08-22 10:46   ` Michael Ellerman
  1 sibling, 1 reply; 10+ messages in thread
From: Jordan Niethe @ 2019-08-14  4:46 UTC (permalink / raw)
  To: Paul Mackerras, linuxppc-dev, kvm; +Cc: kvm-ppc, David Gibson

On Tue, 2019-08-13 at 20:03 +1000, Paul Mackerras wrote:
> Escalation interrupts are interrupts sent to the host by the XIVE
> hardware when it has an interrupt to deliver to a guest VCPU but that
> VCPU is not running anywhere in the system.  Hence we disable the
> escalation interrupt for the VCPU being run when we enter the guest
> and re-enable it when the guest does an H_CEDE hypercall indicating
> it is idle.
> 
> It is possible that an escalation interrupt gets generated just as we
> are entering the guest.  In that case the escalation interrupt may be
> using a queue entry in one of the interrupt queues, and that queue
> entry may not have been processed when the guest exits with an
> H_CEDE.
> The existing entry code detects this situation and does not clear the
> vcpu->arch.xive_esc_on flag as an indication that there is a pending
> queue entry (if the queue entry gets processed, xive_esc_irq() will
> clear the flag).  There is a comment in the code saying that if the
> flag is still set on H_CEDE, we have to abort the cede rather than
> re-enabling the escalation interrupt, lest we end up with two
> occurrences of the escalation interrupt in the interrupt queue.
> 
> However, the exit code doesn't do that; it aborts the cede in the
> sense
> that vcpu->arch.ceded gets cleared, but it still enables the
> escalation
> interrupt by setting the source's PQ bits to 00.  Instead we need to
> set the PQ bits to 10, indicating that an interrupt has been
> triggered.
> We also need to avoid setting vcpu->arch.xive_esc_on in this case
> (i.e. vcpu->arch.xive_esc_on seen to be set on H_CEDE) because
> xive_esc_irq() will run at some point and clear it, and if we race
> with
> that we may end up with an incorrect result (i.e. xive_esc_on set
> when
> the escalation interrupt has just been handled).
> 
> It is extremely unlikely that having two queue entries would cause
> observable problems; theoretically it could cause queue overflow, but
> the CPU would have to have thousands of interrupts targetted to it
> for
> that to be possible.  However, this fix will also make it possible to
> determine accurately whether there is an unhandled escalation
> interrupt in the queue, which will be needed by the following patch.
> 
> Cc: stable@vger.kernel.org # v4.16+
> Fixes: 9b9b13a6d153 ("KVM: PPC: Book3S HV: Keep XIVE escalation
> interrupt masked unless ceded")
> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
> ---
> v2: don't set xive_esc_on if we're not using a XIVE escalation
> interrupt.
> 
>  arch/powerpc/kvm/book3s_hv_rmhandlers.S | 36 +++++++++++++++++++++
> ------------
>  1 file changed, 23 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> index 337e644..2e7e788 100644
> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -2831,29 +2831,39 @@ kvm_cede_prodded:
>  kvm_cede_exit:
>  	ld	r9, HSTATE_KVM_VCPU(r13)
>  #ifdef CONFIG_KVM_XICS
> -	/* Abort if we still have a pending escalation */
> +	/* are we using XIVE with single escalation? */
> +	ld	r10, VCPU_XIVE_ESC_VADDR(r9)
> +	cmpdi	r10, 0
> +	beq	3f
> +	li	r6, XIVE_ESB_SET_PQ_00
Would it make sense to put the above instruction down into the 4: label
instead? If we do not branch to 4, r6 is overwriten anyway. 
I think that would save a load when we do not branch to 4. Also it
would mean that you could use r5 everywhere instead of changing it to
r6? 
> +	/*
> +	 * If we still have a pending escalation, abort the cede,
> +	 * and we must set PQ to 10 rather than 00 so that we don't
> +	 * potentially end up with two entries for the escalation
> +	 * interrupt in the XIVE interrupt queue.  In that case
> +	 * we also don't want to set xive_esc_on to 1 here in
> +	 * case we race with xive_esc_irq().
> +	 */
>  	lbz	r5, VCPU_XIVE_ESC_ON(r9)
>  	cmpwi	r5, 0
> -	beq	1f
> +	beq	4f
>  	li	r0, 0
>  	stb	r0, VCPU_CEDED(r9)
> -1:	/* Enable XIVE escalation */
> -	li	r5, XIVE_ESB_SET_PQ_00
> +	li	r6, XIVE_ESB_SET_PQ_10
> +	b	5f
> +4:	li	r0, 1
> +	stb	r0, VCPU_XIVE_ESC_ON(r9)
> +	/* make sure store to xive_esc_on is seen before xive_esc_irq
> runs */
> +	sync
> +5:	/* Enable XIVE escalation */
>  	mfmsr	r0
>  	andi.	r0, r0, MSR_DR		/* in real mode? */
>  	beq	1f
> -	ld	r10, VCPU_XIVE_ESC_VADDR(r9)
> -	cmpdi	r10, 0
> -	beq	3f
> -	ldx	r0, r10, r5
> +	ldx	r0, r10, r6
>  	b	2f
>  1:	ld	r10, VCPU_XIVE_ESC_RADDR(r9)
> -	cmpdi	r10, 0
> -	beq	3f
> -	ldcix	r0, r10, r5
> +	ldcix	r0, r10, r6
>  2:	sync
> -	li	r0, 1
> -	stb	r0, VCPU_XIVE_ESC_ON(r9)
>  #endif /* CONFIG_KVM_XICS */
>  3:	b	guest_exit_cont
>  


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/3] KVM: PPC: Book3S HV: Fix race in re-enabling XIVE escalation interrupts
  2019-08-14  4:46   ` Jordan Niethe
@ 2019-08-14  6:05     ` Paul Mackerras
  0 siblings, 0 replies; 10+ messages in thread
From: Paul Mackerras @ 2019-08-14  6:05 UTC (permalink / raw)
  To: Jordan Niethe; +Cc: linuxppc-dev, kvm-ppc, kvm, David Gibson

On Wed, Aug 14, 2019 at 02:46:38PM +1000, Jordan Niethe wrote:
> On Tue, 2019-08-13 at 20:03 +1000, Paul Mackerras wrote:

[snip]
> > diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> > b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> > index 337e644..2e7e788 100644
> > --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> > +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> > @@ -2831,29 +2831,39 @@ kvm_cede_prodded:
> >  kvm_cede_exit:
> >  	ld	r9, HSTATE_KVM_VCPU(r13)
> >  #ifdef CONFIG_KVM_XICS
> > -	/* Abort if we still have a pending escalation */
> > +	/* are we using XIVE with single escalation? */
> > +	ld	r10, VCPU_XIVE_ESC_VADDR(r9)
> > +	cmpdi	r10, 0
> > +	beq	3f
> > +	li	r6, XIVE_ESB_SET_PQ_00
> Would it make sense to put the above instruction down into the 4: label
> instead? If we do not branch to 4, r6 is overwriten anyway. 

Right.

> I think that would save a load when we do not branch to 4. Also it

Well, li is a load immediate rather than a load ("load" would normally
imply a load from memory).  Load-immediate instructions are
essentially free since they can easily be executed in parallel with
other instructions and execute in a single cycle.

> would mean that you could use r5 everywhere instead of changing it to
> r6? 

Yes.  If I have to respin the patch for other reasons then I will
rearrange things as you suggest.  I don't think it's worth respinning
just for this change -- it won't reduce the total number of
instructions, and I strongly doubt there would be any measurable
performance difference.

> > +	/*
> > +	 * If we still have a pending escalation, abort the cede,
> > +	 * and we must set PQ to 10 rather than 00 so that we don't
> > +	 * potentially end up with two entries for the escalation
> > +	 * interrupt in the XIVE interrupt queue.  In that case
> > +	 * we also don't want to set xive_esc_on to 1 here in
> > +	 * case we race with xive_esc_irq().
> > +	 */
> >  	lbz	r5, VCPU_XIVE_ESC_ON(r9)
> >  	cmpwi	r5, 0
> > -	beq	1f
> > +	beq	4f
> >  	li	r0, 0
> >  	stb	r0, VCPU_CEDED(r9)
> > -1:	/* Enable XIVE escalation */
> > -	li	r5, XIVE_ESB_SET_PQ_00
> > +	li	r6, XIVE_ESB_SET_PQ_10
> > +	b	5f
> > +4:	li	r0, 1
> > +	stb	r0, VCPU_XIVE_ESC_ON(r9)
> > +	/* make sure store to xive_esc_on is seen before xive_esc_irq
> > runs */
> > +	sync
> > +5:	/* Enable XIVE escalation */
> >  	mfmsr	r0
> >  	andi.	r0, r0, MSR_DR		/* in real mode? */
> >  	beq	1f
> > -	ld	r10, VCPU_XIVE_ESC_VADDR(r9)
> > -	cmpdi	r10, 0
> > -	beq	3f
> > -	ldx	r0, r10, r5
> > +	ldx	r0, r10, r6
> >  	b	2f
> >  1:	ld	r10, VCPU_XIVE_ESC_RADDR(r9)
> > -	cmpdi	r10, 0
> > -	beq	3f
> > -	ldcix	r0, r10, r5
> > +	ldcix	r0, r10, r6
> >  2:	sync
> > -	li	r0, 1
> > -	stb	r0, VCPU_XIVE_ESC_ON(r9)
> >  #endif /* CONFIG_KVM_XICS */
> >  3:	b	guest_exit_cont
> >  

Paul.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/3] KVM: PPC: Book3S HV: Fix race in re-enabling XIVE escalation interrupts
  2019-08-13 10:03 ` [PATCH v2 1/3] KVM: PPC: Book3S HV: Fix race in re-enabling XIVE escalation interrupts Paul Mackerras
  2019-08-14  4:46   ` Jordan Niethe
@ 2019-08-22 10:46   ` Michael Ellerman
  1 sibling, 0 replies; 10+ messages in thread
From: Michael Ellerman @ 2019-08-22 10:46 UTC (permalink / raw)
  To: Paul Mackerras, linuxppc-dev, kvm; +Cc: kvm-ppc, David Gibson

On Tue, 2019-08-13 at 10:03:49 UTC, Paul Mackerras wrote:
> Escalation interrupts are interrupts sent to the host by the XIVE
> hardware when it has an interrupt to deliver to a guest VCPU but that
> VCPU is not running anywhere in the system.  Hence we disable the
> escalation interrupt for the VCPU being run when we enter the guest
> and re-enable it when the guest does an H_CEDE hypercall indicating
> it is idle.
> 
> It is possible that an escalation interrupt gets generated just as we
> are entering the guest.  In that case the escalation interrupt may be
> using a queue entry in one of the interrupt queues, and that queue
> entry may not have been processed when the guest exits with an H_CEDE.
> The existing entry code detects this situation and does not clear the
> vcpu->arch.xive_esc_on flag as an indication that there is a pending
> queue entry (if the queue entry gets processed, xive_esc_irq() will
> clear the flag).  There is a comment in the code saying that if the
> flag is still set on H_CEDE, we have to abort the cede rather than
> re-enabling the escalation interrupt, lest we end up with two
> occurrences of the escalation interrupt in the interrupt queue.
> 
> However, the exit code doesn't do that; it aborts the cede in the sense
> that vcpu->arch.ceded gets cleared, but it still enables the escalation
> interrupt by setting the source's PQ bits to 00.  Instead we need to
> set the PQ bits to 10, indicating that an interrupt has been triggered.
> We also need to avoid setting vcpu->arch.xive_esc_on in this case
> (i.e. vcpu->arch.xive_esc_on seen to be set on H_CEDE) because
> xive_esc_irq() will run at some point and clear it, and if we race with
> that we may end up with an incorrect result (i.e. xive_esc_on set when
> the escalation interrupt has just been handled).
> 
> It is extremely unlikely that having two queue entries would cause
> observable problems; theoretically it could cause queue overflow, but
> the CPU would have to have thousands of interrupts targetted to it for
> that to be possible.  However, this fix will also make it possible to
> determine accurately whether there is an unhandled escalation
> interrupt in the queue, which will be needed by the following patch.
> 
> Cc: stable@vger.kernel.org # v4.16+
> Fixes: 9b9b13a6d153 ("KVM: PPC: Book3S HV: Keep XIVE escalation interrupt masked unless ceded")
> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>

Applied to powerpc topic/ppc-kvm, thanks.

https://git.kernel.org/powerpc/c/959c5d5134786b4988b6fdd08e444aa67d1667ed

cheers

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 2/3] KVM: PPC: Book3S HV: Don't push XIVE context when not using XIVE device
  2019-08-13 10:01 ` [PATCH v2 2/3] KVM: PPC: Book3S HV: Don't push XIVE context when not using XIVE device Paul Mackerras
  2019-08-13 12:18   ` Cédric Le Goater
@ 2019-08-22 10:46   ` Michael Ellerman
  1 sibling, 0 replies; 10+ messages in thread
From: Michael Ellerman @ 2019-08-22 10:46 UTC (permalink / raw)
  To: Paul Mackerras, linuxppc-dev, kvm; +Cc: kvm-ppc, David Gibson

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 1067 bytes --]

On Tue, 2019-08-13 at 10:01:00 UTC, Paul Mackerras wrote:
> At present, when running a guest on POWER9 using HV KVM but not using
> an in-kernel interrupt controller (XICS or XIVE), for example if QEMU
> is run with the kernel_irqchip=off option, the guest entry code goes
> ahead and tries to load the guest context into the XIVE hardware, even
> though no context has been set up.
> 
> To fix this, we check that the "CAM word" is non-zero before pushing
> it to the hardware.  The CAM word is initialized to a non-zero value
> in kvmppc_xive_connect_vcpu() and kvmppc_xive_native_connect_vcpu(),
> and is now cleared in kvmppc_xive_{,native_}cleanup_vcpu.
> 
> Cc: stable@vger.kernel.org # v4.11+
> Reported-by: Cédric Le Goater <clg@kaod.org>
> Fixes: 5af50993850a ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller")
> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
> Reviewed-by: Cédric Le Goater <clg@kaod.org>

Applied to powerpc topic/ppc-kvm, thanks.

https://git.kernel.org/powerpc/c/8d4ba9c931bc384bcc6889a43915aaaf19d3e499

cheers

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 3/3] powerpc/xive: Implement get_irqchip_state method for XIVE to fix shutdown race
  2019-08-13 10:06 ` [PATCH v2 3/3] powerpc/xive: Implement get_irqchip_state method for XIVE to fix shutdown race Paul Mackerras
@ 2019-08-22 10:46   ` Michael Ellerman
  0 siblings, 0 replies; 10+ messages in thread
From: Michael Ellerman @ 2019-08-22 10:46 UTC (permalink / raw)
  To: Paul Mackerras, linuxppc-dev, kvm; +Cc: kvm-ppc, David Gibson

On Tue, 2019-08-13 at 10:06:48 UTC, Paul Mackerras wrote:
> Testing has revealed the existence of a race condition where a XIVE
> interrupt being shut down can be in one of the XIVE interrupt queues
> (of which there are up to 8 per CPU, one for each priority) at the
> point where free_irq() is called.  If this happens, can return an
> interrupt number which has been shut down.  This can lead to various
> symptoms:
> 
> - irq_to_desc(irq) can be NULL.  In this case, no end-of-interrupt
>   function gets called, resulting in the CPU's elevated interrupt
>   priority (numerically lowered CPPR) never gets reset.  That then
>   means that the CPU stops processing interrupts, causing device
>   timeouts and other errors in various device drivers.
> 
> - The irq descriptor or related data structures can be in the process
>   of being freed as the interrupt code is using them.  This typically
>   leads to crashes due to bad pointer dereferences.
> 
> This race is basically what commit 62e0468650c3 ("genirq: Add optional
> hardware synchronization for shutdown", 2019-06-28) is intended to
> fix, given a get_irqchip_state() method for the interrupt controller
> being used.  It works by polling the interrupt controller when an
> interrupt is being freed until the controller says it is not pending.
> 
> With XIVE, the PQ bits of the interrupt source indicate the state of
> the interrupt source, and in particular the P bit goes from 0 to 1 at
> the point where the hardware writes an entry into the interrupt queue
> that this interrupt is directed towards.  Normally, the code will then
> process the interrupt and do an end-of-interrupt (EOI) operation which
> will reset PQ to 00 (assuming another interrupt hasn't been generated
> in the meantime).  However, there are situations where the code resets
> P even though a queue entry exists (for example, by setting PQ to 01,
> which disables the interrupt source), and also situations where the
> code leaves P at 1 after removing the queue entry (for example, this
> is done for escalation interrupts so they cannot fire again until
> they are explicitly re-enabled).
> 
> The code already has a 'saved_p' flag for the interrupt source which
> indicates that a queue entry exists, although it isn't maintained
> consistently.  This patch adds a 'stale_p' flag to indicate that
> P has been left at 1 after processing a queue entry, and adds code
> to set and clear saved_p and stale_p as necessary to maintain a
> consistent indication of whether a queue entry may or may not exist.
> 
> With this, we can implement xive_get_irqchip_state() by looking at
> stale_p, saved_p and the ESB PQ bits for the interrupt.
> 
> There is some additional code to handle escalation interrupts
> properly; because they are enabled and disabled in KVM assembly code,
> which does not have access to the xive_irq_data struct for the
> escalation interrupt.  Hence, stale_p may be incorrect when the
> escalation interrupt is freed in kvmppc_xive_{,native_}cleanup_vcpu().
> Fortunately, we can fix it up by looking at vcpu->arch.xive_esc_on,
> with some careful attention to barriers in order to ensure the correct
> result if xive_esc_irq() races with kvmppc_xive_cleanup_vcpu().
> 
> Finally, this adds code to make noise on the console (pr_crit and
> WARN_ON(1)) if we find an interrupt queue entry for an interrupt
> which does not have a descriptor.  While this won't catch the race
> reliably, if it does get triggered it will be an indication that
> the race is occurring and needs to be debugged.
> 
> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>

Applied to powerpc topic/ppc-kvm, thanks.

https://git.kernel.org/powerpc/c/da15c03b047dca891d37b9f4ef9ca14d84a6484f

cheers

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-08-22 10:54 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-13  9:58 [PATCH v2 0/3] powerpc/xive: Fix race condition leading to host crashes and hangs Paul Mackerras
2019-08-13 10:01 ` [PATCH v2 2/3] KVM: PPC: Book3S HV: Don't push XIVE context when not using XIVE device Paul Mackerras
2019-08-13 12:18   ` Cédric Le Goater
2019-08-22 10:46   ` Michael Ellerman
2019-08-13 10:03 ` [PATCH v2 1/3] KVM: PPC: Book3S HV: Fix race in re-enabling XIVE escalation interrupts Paul Mackerras
2019-08-14  4:46   ` Jordan Niethe
2019-08-14  6:05     ` Paul Mackerras
2019-08-22 10:46   ` Michael Ellerman
2019-08-13 10:06 ` [PATCH v2 3/3] powerpc/xive: Implement get_irqchip_state method for XIVE to fix shutdown race Paul Mackerras
2019-08-22 10:46   ` Michael Ellerman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).