linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Thomas Gleixner <tglx@linutronix.de>
To: LKML <linux-kernel@vger.kernel.org>
Cc: x86@kernel.org, Nadav Amit <namit@vmware.com>,
	Ricardo Neri <ricardo.neri-calderon@linux.intel.com>,
	Stephane Eranian <eranian@google.com>,
	Feng Tang <feng.tang@intel.com>
Subject: [patch V2 04/25] x86/apic: Make apic_pending_intr_clear() more robust
Date: Thu, 04 Jul 2019 17:51:49 +0200	[thread overview]
Message-ID: <20190704155608.636478018@linutronix.de> (raw)
In-Reply-To: 20190704155145.617706117@linutronix.de

In course of developing shorthand based IPI support issues with the
function which tries to clear eventually pending ISR bits in the local APIC
were observed.

  1) O-day testing triggered the WARN_ON() in apic_pending_intr_clear().

     This warning is emitted when the function fails to clear pending ISR
     bits or observes pending IRR bits which are not delivered to the CPU
     after the stale ISR bit(s) are ACK'ed.

     Unfortunately the function only emits a WARN_ON() and fails to dump
     the IRR/ISR content. That's useless for debugging.

     Feng added spot on debug printk's which revealed that the stale IRR
     bit belonged to the APIC timer interrupt vector, but adding ad hoc
     debug code does not help with sporadic failures in the field.

     Rework the loop so the full IRR/ISR contents are saved and on failure
     dumped.

  2) The loop termination logic is interesting at best.

     If the machine has no TSC or cpu_khz is not known yet it tries 1
     million times to ack stale IRR/ISR bits. What?

     With TSC it uses the TSC to calculate the loop termination. It takes a
     timestamp at entry and terminates the loop when:

     	  (rdtsc() - start_timestamp) >= (cpu_hkz << 10)

     That's roughly one second.

     Both methods are problematic. The APIC has 256 vectors, which means
     that in theory max. 256 IRR/ISR bits can be set. In practice this is
     impossible as the first 32 vectors are reserved and not affected and
     the chance that more than a few bits are set is close to zero.

     With the pure loop based approach the 1 million retries are complete
     overkill.

     With TSC this can terminate too early in a guest which is running on a
     heavily loaded host even with only a couple of IRR/ISR bits set. The
     reason is that after acknowledging the highest priority ISR bit,
     pending IRRs must get serviced first before the next round of
     acknowledge can take place as the APIC (real and virtualized) does not
     honour EOI without a preceeding interrupt on the CPU. And every APIC
     read/write takes a VMEXIT if the APIC is virtualized. While trying to
     reproduce the issue 0-day reported it was observed that the guest was
     scheduled out long enough under heavy load that it terminated after 8
     iterations.

     Make the loop terminate after 512 iterations. That's plenty enough
     in any case and does not take endless time to complete.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/apic/apic.c |  111 +++++++++++++++++++++++++-------------------
 1 file changed, 65 insertions(+), 46 deletions(-)

--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -1424,54 +1424,72 @@ static void lapic_setup_esr(void)
 			oldvalue, value);
 }
 
+#define APIC_IR_REGS		APIC_ISR_NR
+#define APIC_IR_BITS		(APIC_IR_REGS * 32)
+#define APIC_IR_MAPSIZE		(APIC_IR_BITS / BITS_PER_LONG)
+
+union apic_ir {
+	unsigned long	map[APIC_IR_MAPSIZE];
+	u32		regs[APIC_IR_REGS];
+};
+
+static bool apic_check_and_ack(union apic_ir *irr, union apic_ir *isr)
+{
+	int i, bit;
+
+	/* Read the IRRs */
+	for (i = 0; i < APIC_IR_REGS; i++)
+		irr->regs[i] = apic_read(APIC_IRR + i * 0x10);
+
+	/* Read the ISRs */
+	for (i = 0; i < APIC_IR_REGS; i++)
+		isr->regs[i] = apic_read(APIC_ISR + i * 0x10);
+
+	/*
+	 * If the ISR map is not empty. ACK the APIC and run another round
+	 * to verify whether a pending IRR has been unblocked and turned
+	 * into a ISR.
+	 */
+	if (!bitmap_empty(isr->map, APIC_IR_BITS)) {
+		/*
+		 * There can be multiple ISR bits set when a high priority
+		 * interrupt preempted a lower priority one. Issue an ACK
+		 * per set bit.
+		 */
+		for_each_set_bit(bit, isr->map, APIC_IR_BITS)
+			ack_APIC_irq();
+		return true;
+	}
+
+	return !bitmap_empty(irr->map, APIC_IR_BITS);
+}
+
+/*
+ * After a crash, we no longer service the interrupts and a pending
+ * interrupt from previous kernel might still have ISR bit set.
+ *
+ * Most probably by now the CPU has serviced that pending interrupt and it
+ * might not have done the ack_APIC_irq() because it thought, interrupt
+ * came from i8259 as ExtInt. LAPIC did not get EOI so it does not clear
+ * the ISR bit and cpu thinks it has already serivced the interrupt. Hence
+ * a vector might get locked. It was noticed for timer irq (vector
+ * 0x31). Issue an extra EOI to clear ISR.
+ *
+ * If there are pending IRR bits they turn into ISR bits after a higher
+ * priority ISR bit has been acked.
+ */
 static void apic_pending_intr_clear(void)
 {
-	long long max_loops = cpu_khz ? cpu_khz : 1000000;
-	unsigned long long tsc = 0, ntsc;
-	unsigned int queued;
-	unsigned long value;
-	int i, j, acked = 0;
-
-	if (boot_cpu_has(X86_FEATURE_TSC))
-		tsc = rdtsc();
-	/*
-	 * After a crash, we no longer service the interrupts and a pending
-	 * interrupt from previous kernel might still have ISR bit set.
-	 *
-	 * Most probably by now CPU has serviced that pending interrupt and
-	 * it might not have done the ack_APIC_irq() because it thought,
-	 * interrupt came from i8259 as ExtInt. LAPIC did not get EOI so it
-	 * does not clear the ISR bit and cpu thinks it has already serivced
-	 * the interrupt. Hence a vector might get locked. It was noticed
-	 * for timer irq (vector 0x31). Issue an extra EOI to clear ISR.
-	 */
-	do {
-		queued = 0;
-		for (i = APIC_ISR_NR - 1; i >= 0; i--)
-			queued |= apic_read(APIC_IRR + i*0x10);
-
-		for (i = APIC_ISR_NR - 1; i >= 0; i--) {
-			value = apic_read(APIC_ISR + i*0x10);
-			for_each_set_bit(j, &value, 32) {
-				ack_APIC_irq();
-				acked++;
-			}
-		}
-		if (acked > 256) {
-			pr_err("LAPIC pending interrupts after %d EOI\n", acked);
-			break;
-		}
-		if (queued) {
-			if (boot_cpu_has(X86_FEATURE_TSC) && cpu_khz) {
-				ntsc = rdtsc();
-				max_loops = (long long)cpu_khz << 10;
-				max_loops -= ntsc - tsc;
-			} else {
-				max_loops--;
-			}
-		}
-	} while (queued && max_loops > 0);
-	WARN_ON(max_loops <= 0);
+	union apic_ir irr, isr;
+	unsigned int i;
+
+	/* 512 loops are way oversized and give the APIC a chance to obey. */
+	for (i = 0; i < 512; i++) {
+		if (!apic_check_and_ack(&irr, &isr))
+			return;
+	}
+	/* Dump the IRR/ISR content if that failed */
+	pr_warn("APIC: Stale IRR: %256pb ISR: %256pb\n", irr.map, isr.map);
 }
 
 /**
@@ -1544,6 +1562,7 @@ static void setup_local_APIC(void)
 	value &= ~APIC_TPRI_MASK;
 	apic_write(APIC_TASKPRI, value);
 
+	/* Clear eventually stale ISR/IRR bits */
 	apic_pending_intr_clear();
 
 	/*



  parent reply	other threads:[~2019-07-04 16:34 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-07-04 15:51 [patch V2 00/25] x86/apic: Support for IPI shorthands Thomas Gleixner
2019-07-04 15:51 ` [patch V2 01/25] x86/kgbd: Use NMI_VECTOR not APIC_DM_NMI Thomas Gleixner
2019-07-05 21:43   ` Thomas Gleixner
2019-07-04 15:51 ` [patch V2 02/25] x86/apic: Invoke perf_events_lapic_init() after enabling APIC Thomas Gleixner
2019-07-04 15:51 ` [patch V2 03/25] x86/apic: Soft disable APIC before initializing it Thomas Gleixner
2019-07-04 15:51 ` Thomas Gleixner [this message]
2019-07-05 15:47   ` [patch V2 04/25] x86/apic: Make apic_pending_intr_clear() more robust Andrew Cooper
2019-07-05 19:06     ` Andy Lutomirski
2019-07-05 20:17       ` Andrew Cooper
2019-07-05 20:36       ` Thomas Gleixner
2019-07-05 20:39         ` Andy Lutomirski
2019-07-07  8:27           ` Thomas Gleixner
2019-07-05 19:19     ` Nadav Amit
2019-07-05 20:47       ` Andrew Cooper
2019-07-05 20:25     ` Thomas Gleixner
2019-07-05 20:37       ` Andy Lutomirski
2019-07-05 20:49       ` Paolo Bonzini
2019-07-05 21:16         ` Andrew Cooper
2019-07-07  8:37         ` Thomas Gleixner
2019-07-09 14:43           ` Thomas Gleixner
2019-07-04 15:51 ` [patch V2 05/25] x86/apic: Move IPI inlines into ipi.c Thomas Gleixner
2019-07-04 15:51 ` [patch V2 06/25] x86/apic: Cleanup the include maze Thomas Gleixner
2019-07-04 15:51 ` [patch V2 07/25] x86/apic: Move ipi header into apic directory Thomas Gleixner
2019-07-04 15:51 ` [patch V2 08/25] x86/apic: Move apic_flat_64 " Thomas Gleixner
2019-07-04 15:51 ` [patch V2 09/25] x86/apic: Consolidate the apic local headers Thomas Gleixner
2019-07-04 15:51 ` [patch V2 10/25] x86/apic/uv: Make x2apic_extra_bits static Thomas Gleixner
2019-07-04 15:51 ` [patch V2 11/25] smp/hotplug: Track booted once CPUs in a cpumask Thomas Gleixner
2019-07-04 15:51 ` [patch V2 12/25] x86/cpu: Move arch_smt_update() to a neutral place Thomas Gleixner
2019-07-04 15:51 ` [patch V2 13/25] x86/hotplug: Silence APIC and NMI when CPU is dead Thomas Gleixner
2019-07-04 15:51 ` [patch V2 14/25] x86/apic: Remove dest argument from __default_send_IPI_shortcut() Thomas Gleixner
2019-07-04 15:52 ` [patch V2 15/25] x86/apic: Add NMI_VECTOR wait to IPI shorthand Thomas Gleixner
2019-07-04 15:52 ` [patch V2 16/25] x86/apic: Move no_ipi_broadcast() out of 32bit Thomas Gleixner
2019-07-04 15:52 ` [patch V2 17/25] x86/apic: Add static key to Control IPI shorthands Thomas Gleixner
2019-07-04 15:52 ` [patch V2 18/25] x86/apic: Provide and use helper for send_IPI_allbutself() Thomas Gleixner
2019-07-04 15:52 ` [patch V2 19/25] cpumask: Implement cpumask_or_equal() Thomas Gleixner
2019-07-04 15:52 ` [patch V2 20/25] x86/smp: Move smp_function_call implementations into IPI code Thomas Gleixner
2019-07-04 15:52 ` [patch V2 21/25] x86/smp: Enhance native_send_call_func_ipi() Thomas Gleixner
2019-07-05  1:26   ` Nadav Amit
2019-07-04 15:52 ` [patch V2 22/25] x86/apic: Remove the shorthand decision logic Thomas Gleixner
2019-07-04 15:52 ` [patch V2 23/25] x86/apic: Share common IPI helpers Thomas Gleixner
2019-07-04 15:52 ` [patch V2 24/25] x86/apic/flat64: Remove the IPI shorthand decision logic Thomas Gleixner
2019-07-04 15:52 ` [patch V2 25/25] x86/apic/x2apic: Implement IPI shorthands support Thomas Gleixner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190704155608.636478018@linutronix.de \
    --to=tglx@linutronix.de \
    --cc=eranian@google.com \
    --cc=feng.tang@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=namit@vmware.com \
    --cc=ricardo.neri-calderon@linux.intel.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).