linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [debug PATCHes] Re: smp_call_function_single lockups
@ 2015-04-01  2:00 Daniel J Blueman
  0 siblings, 0 replies; 6+ messages in thread
From: Daniel J Blueman @ 2015-04-01  2:00 UTC (permalink / raw)
  To: Chris J Arges
  Cc: Linux Kernel, x86, Linus Torvalds, Rafael David Tinoco,
	Peter Anvin, Jiang Liu, Peter Zijlstra, Jens Axboe,
	Frederic Weisbecker, Gema Gomez

On Wednesday, April 1, 2015 at 6:40:06 AM UTC+8, Chris J Arges wrote:
> On Tue, Mar 31, 2015 at 12:56:56PM +0200, Ingo Molnar wrote:
> >
> > * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> >
> > > Ok, interesting. So the whole "we try to do an APIC ACK with the ISR
> > > bit clear" seems to be a real issue.
> >
> > It's interesting in particular when it happens with an edge-triggered
> > interrupt source: it's much harder to miss level triggered IRQs, which
> > stay around until actively handled. Edge triggered irqs are more
> > fragile to loss of event processing.
> >
> > > > Anyway, maybe this sheds some more light on this issue. I can
> > > > reproduce this at will, so let me know of other experiments to do.
> >
> > Btw., could you please describe (again) what your current best method
> > for reproduction is? It's been a long discussion ...
> >
>
> Ingo,
>
> To set this up, I've done the following on a Xeon E5620 / Xeon E312xx machine
> ( Although I've heard of others that have reproduced on other machines. )
>
> 1) Create an L1 KVM VM with 2 vCPUs (single vCPU case doesn't reproduce)
> 2) Create an L2 KVM VM inside the L1 VM with 1 vCPU
> 3) Add the following to the L1 cmdline:
> nmi_watchdog=panic hung_task_panic=1 softlockup_panic=1 unknown_nmi_panic
> 3) Run something like 'stress -c 1 -m 1 -d 1 -t 1200' inside the L2 VM
>
> Sometimes this is sufficient to reproduce the issue, I've observed that running
> KSM in the L1 VM can agitate this issue (it calls native_flush_tlb_others).
> If this doesn't reproduce then you can do the following:
> 4) Migrate the L2 vCPU randomly (via virsh vcpupin --live  OR tasksel) between
> L1 vCPUs until the hang occurs.
>
> I attempted to write a module that used smp_call_function_single calls to
> trigger IPIs but have been unable to create a more simple reproducer.

A non-intrusive way of generating a lot of IPIs, is calling
stop_machine() via something like:

while :; do
    echo "base=0x20000000000 size=0x8000000 type=write-back" >/proc/mtrr
    echo "disable=4" >| /proc/mtrr
done

Of course, ensure base is above DRAM and any 64-bit MMIO for no
side-effects and ensure it'll be entry 4. Onlining and offlining cores
in parallel will generate IPIs also.

Dan
-- 
Daniel J Blueman

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [debug PATCHes] Re: smp_call_function_single lockups
  2015-04-01 14:10                         ` Chris J Arges
@ 2015-04-01 14:55                           ` Ingo Molnar
  0 siblings, 0 replies; 6+ messages in thread
From: Ingo Molnar @ 2015-04-01 14:55 UTC (permalink / raw)
  To: Chris J Arges
  Cc: Linus Torvalds, Rafael David Tinoco, Peter Anvin, Jiang Liu,
	Peter Zijlstra, LKML, Jens Axboe, Frederic Weisbecker,
	Gema Gomez, the arch/x86 maintainers


* Chris J Arges <chris.j.arges@canonical.com> wrote:

> > diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> > index 6cedd7914581..79d6de6fdf0a 100644
> > --- a/arch/x86/kernel/apic/vector.c
> > +++ b/arch/x86/kernel/apic/vector.c
> > @@ -144,6 +144,8 @@ __assign_irq_vector(int irq, struct irq_cfg *cfg, const struct cpumask *mask)
> >  			cfg->move_in_progress =
> >  			   cpumask_intersects(cfg->old_domain, cpu_online_mask);
> >  			cpumask_and(cfg->domain, cfg->domain, tmp_mask);
> > +			if (cfg->move_in_progress)
> > +				dump_stack();
> >  			break;
> >  		}
> >  
> > 
> 
> Should I add the dump_stack() to the 'new-domain move' area line 
> ~184 since that's the path I seem to be triggering?

Yeah, please do, that's what I wanted to mark.

My guess is that it will show a stack dump to within set_affinity().

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [debug PATCHes] Re: smp_call_function_single lockups
  2015-04-01 12:39                       ` Ingo Molnar
@ 2015-04-01 14:10                         ` Chris J Arges
  2015-04-01 14:55                           ` Ingo Molnar
  0 siblings, 1 reply; 6+ messages in thread
From: Chris J Arges @ 2015-04-01 14:10 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Rafael David Tinoco, Peter Anvin, Jiang Liu,
	Peter Zijlstra, LKML, Jens Axboe, Frederic Weisbecker,
	Gema Gomez, the arch/x86 maintainers

On Wed, Apr 01, 2015 at 02:39:13PM +0200, Ingo Molnar wrote:
> 
> * Chris J Arges <chris.j.arges@canonical.com> wrote:
> 
> > This was only tested only on the L1, so I can put this on the L0 host and run
> > this as well. The results:
> > 
> > [  124.897002] apic: vector c1, new-domain move in progress                    
> > [  124.954827] apic: vector d1, sent cleanup vector, move completed            
> > [  163.477270] apic: vector d1, new-domain move in progress                    
> > [  164.041938] apic: vector e1, sent cleanup vector, move completed            
> > [  213.466971] apic: vector e1, new-domain move in progress                    
> > [  213.775639] apic: vector 22, sent cleanup vector, move completed            
> > [  365.996747] apic: vector 22, new-domain move in progress                    
> > [  366.011136] apic: vector 42, sent cleanup vector, move completed            
> > [  393.836032] apic: vector 42, new-domain move in progress                    
> > [  393.837727] apic: vector 52, sent cleanup vector, move completed            
> > [  454.977514] apic: vector 52, new-domain move in progress                    
> > [  454.978880] apic: vector 62, sent cleanup vector, move completed            
> > [  467.055730] apic: vector 62, new-domain move in progress                    
> > [  467.058129] apic: vector 72, sent cleanup vector, move completed            
> > [  545.280125] apic: vector 72, new-domain move in progress                    
> > [  545.282801] apic: vector 82, sent cleanup vector, move completed            
> > [  567.631652] apic: vector 82, new-domain move in progress                    
> > [  567.632207] apic: vector 92, sent cleanup vector, move completed            
> > [  628.940638] apic: vector 92, new-domain move in progress                    
> > [  628.965274] apic: vector a2, sent cleanup vector, move completed            
> > [  635.187433] apic: vector a2, new-domain move in progress                    
> > [  635.191643] apic: vector b2, sent cleanup vector, move completed            
> > [  673.548020] apic: vector b2, new-domain move in progress                    
> > [  673.553843] apic: vector c2, sent cleanup vector, move completed            
> > [  688.221906] apic: vector c2, new-domain move in progress                    
> > [  688.229487] apic: vector d2, sent cleanup vector, move completed            
> > [  723.818916] apic: vector d2, new-domain move in progress                    
> > [  723.828970] apic: vector e2, sent cleanup vector, move completed            
> > [  733.485435] apic: vector e2, new-domain move in progress                    
> > [  733.615007] apic: vector 23, sent cleanup vector, move completed            
> > [  824.092036] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [ksmd:26] 
> 
> Are these all the messages? Looks like Linus's warnings went away, or 
> did you filter them out?
>
Ingo,

The patch I ran with included both patches, but I had to modify your patch
to accomodate Linus' vector tracking, I've inlined the complete patch below.

The above kernel log was all kernel messages (no grepping) leading up to the
soft lockup on L1.
 
> But ... the affinity setting message does not appear to trigger, and 
> that's the only real race I can see in the code. Also, the frequency 
> of these messages appears to be low, while the race window is narrow. 
> So I'm not sure the problem is related to the irq-move mechanism.
> 
> One thing that appears to be weird: why is there irq-movement activity 
> to begin with? Is something changing irq-affinities?
>

irqbalance is running on all VMs/Systems (L0,L1,L2) by default. I'll disable
this and re-test.
 
> Could you put a dump_stack() into the call? Something like the patch 
> below, in addition to all patches so far. (if it conflicts with the 
> previous debugging patches then just add the code manually to after 
> the debug printout.)
> 
> Thanks,
> 
> 	Ingo
> 
> diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> index 6cedd7914581..79d6de6fdf0a 100644
> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -144,6 +144,8 @@ __assign_irq_vector(int irq, struct irq_cfg *cfg, const struct cpumask *mask)
>  			cfg->move_in_progress =
>  			   cpumask_intersects(cfg->old_domain, cpu_online_mask);
>  			cpumask_and(cfg->domain, cfg->domain, tmp_mask);
> +			if (cfg->move_in_progress)
> +				dump_stack();
>  			break;
>  		}
>  
> 

Should I add the dump_stack() to the 'new-domain move' area line ~184 since
that's the path I seem to be triggering?

Thanks,
--chris

--

For completeness, here is the patch that I ran with previously:

diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index efc3b22..88b3933 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -395,7 +395,7 @@ static inline void apic_write(u32 reg, u32 val)
	apic->write(reg, val);
 }
 
-static inline void apic_eoi(void)
+static inline void apic_eoi(int vector)
 {
	apic->eoi_write(APIC_EOI, APIC_EOI_ACK);
 }
@@ -426,7 +426,7 @@ extern void __init apic_set_eoi_write(void (*eoi_write)(u32 reg, u32 v));
 
 static inline u32 apic_read(u32 reg) { return 0; }
 static inline void apic_write(u32 reg, u32 val) { }
-static inline void apic_eoi(void) { }
+static inline void apic_eoi(int vector) { }
 static inline u64 apic_icr_read(void) { return 0; }
 static inline void apic_icr_write(u32 low, u32 high) { }
 static inline void apic_wait_icr_idle(void) { }
@@ -435,13 +435,26 @@ static inline void apic_set_eoi_write(void (*eoi_write)(u32 reg, u32 v)) {}
 
 #endif /* CONFIG_X86_LOCAL_APIC */
 
-static inline void ack_APIC_irq(void)
+#include <asm/irq_vectors.h>
+typedef int vector_irq_t[NR_VECTORS];
+DECLARE_PER_CPU(vector_irq_t, vector_irq);
+
+static inline void ack_APIC_irq(int vector)
 {
+	/* Is the ISR bit actually set for this vector? */
+	if (vector >= 0) {
+		unsigned v = apic_read(APIC_ISR + ((vector & ~0x1f) >> 1));
+		int irq;
+		v >>= vector & 0x1f;
+		irq = __this_cpu_read(vector_irq[vector]);
+
+		WARN(!(v & 1), "ack_APIC_irq: vector = %0x, irq = %0x\n", vector, irq);
+	}
	/*
	 * ack_APIC_irq() actually gets compiled as a single instruction
	 * ... yummie.
	 */
-	apic_eoi();
+	apic_eoi(vector);
 }
 
 static inline unsigned default_get_apic_id(unsigned long x)
@@ -639,9 +652,9 @@ static inline void entering_irq(void)
	exit_idle();
 }
 
-static inline void entering_ack_irq(void)
+static inline void entering_ack_irq(int vector)
 {
-	ack_APIC_irq();
+	ack_APIC_irq(vector);
	entering_irq();
 }
 
@@ -650,11 +663,11 @@ static inline void exiting_irq(void)
	irq_exit();
 }
 
-static inline void exiting_ack_irq(void)
+static inline void exiting_ack_irq(int vector)
 {
	irq_exit();
	/* Ack only at the end to avoid potential reentry */
-	ack_APIC_irq();
+	ack_APIC_irq(vector);
 }
 
 extern void ioapic_zap_locks(void);
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index ad3639a..20ef123 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -910,7 +910,7 @@ __visible void __irq_entry smp_apic_timer_interrupt(struct pt_regs *regs)
	 * Besides, if we don't timer interrupts ignore the global
	 * interrupt lock, which is the WrongThing (tm) to do.
	 */
-	entering_ack_irq();
+	entering_ack_irq(LOCAL_TIMER_VECTOR);
	local_apic_timer_interrupt();
	exiting_irq();
 
@@ -929,7 +929,7 @@ __visible void __irq_entry smp_trace_apic_timer_interrupt(struct pt_regs *regs)
	 * Besides, if we don't timer interrupts ignore the global
	 * interrupt lock, which is the WrongThing (tm) to do.
	 */
-	entering_ack_irq();
+	entering_ack_irq(LOCAL_TIMER_VECTOR);
	trace_local_timer_entry(LOCAL_TIMER_VECTOR);
	local_apic_timer_interrupt();
	trace_local_timer_exit(LOCAL_TIMER_VECTOR);
@@ -1342,7 +1342,7 @@ void setup_local_APIC(void)
			value = apic_read(APIC_ISR + i*0x10);
			for (j = 31; j >= 0; j--) {
				if (value & (1<<j)) {
-					ack_APIC_irq();
+					ack_APIC_irq(-1);
					acked++;
				}
			}
@@ -1868,7 +1868,7 @@ static inline void __smp_spurious_interrupt(u8 vector)
	 */
	v = apic_read(APIC_ISR + ((vector & ~0x1f) >> 1));
	if (v & (1 << (vector & 0x1f)))
-		ack_APIC_irq();
+		ack_APIC_irq(vector);
 
	inc_irq_stat(irq_spurious_count);
 
@@ -1917,7 +1917,7 @@ static inline void __smp_error_interrupt(struct pt_regs *regs)
	if (lapic_get_maxlvt() > 3)	/* Due to the Pentium erratum 3AP. */
		apic_write(APIC_ESR, 0);
	v = apic_read(APIC_ESR);
-	ack_APIC_irq();
+	ack_APIC_irq(ERROR_APIC_VECTOR);
	atomic_inc(&irq_err_count);
 
	apic_printk(APIC_DEBUG, KERN_DEBUG "APIC error on CPU%d: %02x",
diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index f4dc246..b74db5a 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -1993,7 +1993,7 @@ static void ack_ioapic_level(struct irq_data *data)
	 * We must acknowledge the irq before we move it or the acknowledge will
	 * not propagate properly.
	 */
-	ack_APIC_irq();
+	ack_APIC_irq(i);
 
	/*
	 * Tail end of clearing remote IRR bit (either by delivering the EOI
@@ -2067,7 +2067,8 @@ static void unmask_lapic_irq(struct irq_data *data)
 
 static void ack_lapic_irq(struct irq_data *data)
 {
-	ack_APIC_irq();
+	struct irq_cfg *cfg = irqd_cfg(data);
+	ack_APIC_irq(cfg->vector);
 }
 
 static struct irq_chip lapic_chip __read_mostly = {
diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 6cedd79..a8efbcb 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -143,6 +143,8 @@ __assign_irq_vector(int irq, struct irq_cfg *cfg, const struct cpumask *mask)
			cpumask_andnot(cfg->old_domain, cfg->domain, tmp_mask);
			cfg->move_in_progress =
			   cpumask_intersects(cfg->old_domain, cpu_online_mask);
+			if (cfg->move_in_progress)
+				pr_info("apic: vector %02x, same-domain move in progress\n", cfg->vector);
			cpumask_and(cfg->domain, cfg->domain, tmp_mask);
			break;
		}
@@ -178,6 +180,8 @@ next:
			cpumask_copy(cfg->old_domain, cfg->domain);
			cfg->move_in_progress =
			   cpumask_intersects(cfg->old_domain, cpu_online_mask);
+			if (cfg->move_in_progress)
+				pr_info("apic: vector %02x, new-domain move in progress\n", cfg->vector);
		}
		for_each_cpu_and(new_cpu, tmp_mask, cpu_online_mask)
			per_cpu(vector_irq, new_cpu)[vector] = irq;
@@ -232,6 +236,7 @@ void clear_irq_vector(int irq, struct irq_cfg *cfg)
		}
	}
	cfg->move_in_progress = 0;
+	pr_info("apic: vector %02x, vector cleared, move completed\n", cfg->vector);
	raw_spin_unlock_irqrestore(&vector_lock, flags);
 }
 
@@ -335,9 +340,11 @@ int apic_retrigger_irq(struct irq_data *data)
 
 void apic_ack_edge(struct irq_data *data)
 {
-	irq_complete_move(irqd_cfg(data));
+	struct irq_cfg *cfg = irqd_cfg(data);
+	ack_APIC_irq(cfg->vector);
+
+	irq_complete_move(cfg);
	irq_move_irq(data);
-	ack_APIC_irq();
 }
 
 /*
@@ -391,13 +398,14 @@ void send_cleanup_vector(struct irq_cfg *cfg)
		free_cpumask_var(cleanup_mask);
	}
	cfg->move_in_progress = 0;
+	pr_info("apic: vector %02x, sent cleanup vector, move completed\n", cfg->vector);
 }
 
 asmlinkage __visible void smp_irq_move_cleanup_interrupt(void)
 {
	unsigned vector, me;
 
-	ack_APIC_irq();
+	ack_APIC_irq(IRQ_MOVE_CLEANUP_VECTOR);
	irq_enter();
	exit_idle();
 
diff --git a/arch/x86/kernel/cpu/mcheck/therm_throt.c b/arch/x86/kernel/cpu/mcheck/therm_throt.c
index 1af51b1..89bac1f 100644
--- a/arch/x86/kernel/cpu/mcheck/therm_throt.c
+++ b/arch/x86/kernel/cpu/mcheck/therm_throt.c
@@ -433,7 +433,7 @@ asmlinkage __visible void smp_thermal_interrupt(struct pt_regs *regs)
 {
	entering_irq();
	__smp_thermal_interrupt();
-	exiting_ack_irq();
+	exiting_ack_irq(THERMAL_APIC_VECTOR);
 }
 
 asmlinkage __visible void smp_trace_thermal_interrupt(struct pt_regs *regs)
@@ -442,7 +442,7 @@ asmlinkage __visible void smp_trace_thermal_interrupt(struct pt_regs *regs)
	trace_thermal_apic_entry(THERMAL_APIC_VECTOR);
	__smp_thermal_interrupt();
	trace_thermal_apic_exit(THERMAL_APIC_VECTOR);
-	exiting_ack_irq();
+	exiting_ack_irq(THERMAL_APIC_VECTOR);
 }
 
 /* Thermal monitoring depends on APIC, ACPI and clock modulation */
diff --git a/arch/x86/kernel/cpu/mcheck/threshold.c b/arch/x86/kernel/cpu/mcheck/threshold.c
index 7245980..03068d1 100644
--- a/arch/x86/kernel/cpu/mcheck/threshold.c
+++ b/arch/x86/kernel/cpu/mcheck/threshold.c
@@ -28,7 +28,7 @@ asmlinkage __visible void smp_threshold_interrupt(void)
 {
	entering_irq();
	__smp_threshold_interrupt();
-	exiting_ack_irq();
+	exiting_ack_irq(THRESHOLD_APIC_VECTOR);
 }
 
 asmlinkage __visible void smp_trace_threshold_interrupt(void)
@@ -37,5 +37,5 @@ asmlinkage __visible void smp_trace_threshold_interrupt(void)
	trace_threshold_apic_entry(THRESHOLD_APIC_VECTOR);
	__smp_threshold_interrupt();
	trace_threshold_apic_exit(THRESHOLD_APIC_VECTOR);
-	exiting_ack_irq();
+	exiting_ack_irq(THRESHOLD_APIC_VECTOR);
 }
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 67b1cbe..fc2b9a0 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -45,7 +45,8 @@ void ack_bad_irq(unsigned int irq)
	 * completely.
	 * But only ack when the APIC is enabled -AK
	 */
-	ack_APIC_irq();
+	/* This doesn't know the vector. It needs the irq descriptor */
+	ack_APIC_irq(-1);
 }
 
 #define irq_stats(x)		(&per_cpu(irq_stat, x))
@@ -198,7 +199,7 @@ __visible unsigned int __irq_entry do_IRQ(struct pt_regs *regs)
	irq = __this_cpu_read(vector_irq[vector]);
 
	if (!handle_irq(irq, regs)) {
-		ack_APIC_irq();
+		ack_APIC_irq(vector);
 
		if (irq != VECTOR_RETRIGGERED) {
			pr_emerg_ratelimited("%s: %d.%d No irq handler for vector (irq %d)\n",
@@ -230,7 +231,7 @@ __visible void smp_x86_platform_ipi(struct pt_regs *regs)
 {
	struct pt_regs *old_regs = set_irq_regs(regs);
 
-	entering_ack_irq();
+	entering_ack_irq(X86_PLATFORM_IPI_VECTOR);
	__smp_x86_platform_ipi();
	exiting_irq();
	set_irq_regs(old_regs);
@@ -244,7 +245,7 @@ __visible void smp_kvm_posted_intr_ipi(struct pt_regs *regs)
 {
	struct pt_regs *old_regs = set_irq_regs(regs);
 
-	ack_APIC_irq();
+	ack_APIC_irq(POSTED_INTR_VECTOR);
 
	irq_enter();
 
@@ -262,7 +263,7 @@ __visible void smp_trace_x86_platform_ipi(struct pt_regs *regs)
 {
	struct pt_regs *old_regs = set_irq_regs(regs);
 
-	entering_ack_irq();
+	entering_ack_irq(X86_PLATFORM_IPI_VECTOR);
	trace_x86_platform_ipi_entry(X86_PLATFORM_IPI_VECTOR);
	__smp_x86_platform_ipi();
	trace_x86_platform_ipi_exit(X86_PLATFORM_IPI_VECTOR);
diff --git a/arch/x86/kernel/irq_work.c b/arch/x86/kernel/irq_work.c
index 15d741d..dfe9263 100644
--- a/arch/x86/kernel/irq_work.c
+++ b/arch/x86/kernel/irq_work.c
@@ -10,10 +10,10 @@
 #include <asm/apic.h>
 #include <asm/trace/irq_vectors.h>
 
-static inline void irq_work_entering_irq(void)
+static inline void irq_work_entering_irq(int vector)
 {
	irq_enter();
-	ack_APIC_irq();
+	ack_APIC_irq(vector);
 }
 
 static inline void __smp_irq_work_interrupt(void)
@@ -24,14 +24,14 @@ static inline void __smp_irq_work_interrupt(void)
 
 __visible void smp_irq_work_interrupt(struct pt_regs *regs)
 {
-	irq_work_entering_irq();
+	irq_work_entering_irq(IRQ_WORK_VECTOR);
	__smp_irq_work_interrupt();
	exiting_irq();
 }
 
 __visible void smp_trace_irq_work_interrupt(struct pt_regs *regs)
 {
-	irq_work_entering_irq();
+	irq_work_entering_irq(IRQ_WORK_VECTOR);
	trace_irq_work_entry(IRQ_WORK_VECTOR);
	__smp_irq_work_interrupt();
	trace_irq_work_exit(IRQ_WORK_VECTOR);
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index be8e1bd..44b74b2 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -170,7 +170,7 @@ static int smp_stop_nmi_callback(unsigned int val, struct pt_regs *regs)
 
 asmlinkage __visible void smp_reboot_interrupt(void)
 {
-	ack_APIC_irq();
+	ack_APIC_irq(REBOOT_VECTOR);
	irq_enter();
	stop_this_cpu(NULL);
	irq_exit();
@@ -258,16 +258,16 @@ static inline void __smp_reschedule_interrupt(void)
 
 __visible void smp_reschedule_interrupt(struct pt_regs *regs)
 {
-	ack_APIC_irq();
+	ack_APIC_irq(RESCHEDULE_VECTOR);
	__smp_reschedule_interrupt();
	/*
	 * KVM uses this interrupt to force a cpu out of guest mode
	 */
 }
 
-static inline void smp_entering_irq(void)
+static inline void smp_entering_irq(int vector)
 {
-	ack_APIC_irq();
+	ack_APIC_irq(vector);
	irq_enter();
 }
 
@@ -279,7 +279,7 @@ __visible void smp_trace_reschedule_interrupt(struct pt_regs *regs)
	 * scheduler_ipi(). This is OK, since those functions are allowed
	 * to nest.
	 */
-	smp_entering_irq();
+	smp_entering_irq(RESCHEDULE_VECTOR);
	trace_reschedule_entry(RESCHEDULE_VECTOR);
	__smp_reschedule_interrupt();
	trace_reschedule_exit(RESCHEDULE_VECTOR);
@@ -297,14 +297,14 @@ static inline void __smp_call_function_interrupt(void)
 
 __visible void smp_call_function_interrupt(struct pt_regs *regs)
 {
-	smp_entering_irq();
+	smp_entering_irq(CALL_FUNCTION_VECTOR);
	__smp_call_function_interrupt();
	exiting_irq();
 }
 
 __visible void smp_trace_call_function_interrupt(struct pt_regs *regs)
 {
-	smp_entering_irq();
+	smp_entering_irq(CALL_FUNCTION_VECTOR);
	trace_call_function_entry(CALL_FUNCTION_VECTOR);
	__smp_call_function_interrupt();
	trace_call_function_exit(CALL_FUNCTION_VECTOR);
@@ -319,14 +319,14 @@ static inline void __smp_call_function_single_interrupt(void)
 
 __visible void smp_call_function_single_interrupt(struct pt_regs *regs)
 {
-	smp_entering_irq();
+	smp_entering_irq(CALL_FUNCTION_SINGLE_VECTOR);
	__smp_call_function_single_interrupt();
	exiting_irq();
 }
 
 __visible void smp_trace_call_function_single_interrupt(struct pt_regs *regs)
 {
-	smp_entering_irq();
+	smp_entering_irq(CALL_FUNCTION_SINGLE_VECTOR);
	trace_call_function_single_entry(CALL_FUNCTION_SINGLE_VECTOR);
	__smp_call_function_single_interrupt();
	trace_call_function_single_exit(CALL_FUNCTION_SINGLE_VECTOR);
diff --git a/drivers/iommu/irq_remapping.c b/drivers/iommu/irq_remapping.c
index 390079e..896c4b3 100644
--- a/drivers/iommu/irq_remapping.c
+++ b/drivers/iommu/irq_remapping.c
@@ -334,12 +334,14 @@ void panic_if_irq_remap(const char *msg)
 
 static void ir_ack_apic_edge(struct irq_data *data)
 {
-	ack_APIC_irq();
+	struct irq_cfg *cfg = irqd_cfg(data);
+	ack_APIC_irq(cfg->vector);
 }
 
 static void ir_ack_apic_level(struct irq_data *data)
 {
-	ack_APIC_irq();
+	struct irq_cfg *cfg = irqd_cfg(data);
+	ack_APIC_irq(cfg->vector);
	eoi_ioapic_irq(data->irq, irqd_cfg(data));
 }
 
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 886d09e..fbdb5bd 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -244,8 +244,10 @@ int irq_set_affinity_hint(unsigned int irq, const struct cpumask *m)
	desc->affinity_hint = m;
	irq_put_desc_unlock(desc, flags);
	/* set the initial affinity to prevent every interrupt being on CPU0 */
-	if (m)
+	if (m) {
+		pr_info("irq/core: Called irq_set_affinity_hint(%d)\n", irq);
		__irq_set_affinity(irq, m, false);
+	}
	return 0;
 }
 EXPORT_SYMBOL_GPL(irq_set_affinity_hint);



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [debug PATCHes] Re: smp_call_function_single lockups
  2015-03-31 22:38                     ` Chris J Arges
@ 2015-04-01 12:39                       ` Ingo Molnar
  2015-04-01 14:10                         ` Chris J Arges
  0 siblings, 1 reply; 6+ messages in thread
From: Ingo Molnar @ 2015-04-01 12:39 UTC (permalink / raw)
  To: Chris J Arges
  Cc: Linus Torvalds, Rafael David Tinoco, Peter Anvin, Jiang Liu,
	Peter Zijlstra, LKML, Jens Axboe, Frederic Weisbecker,
	Gema Gomez, the arch/x86 maintainers


* Chris J Arges <chris.j.arges@canonical.com> wrote:

> This was only tested only on the L1, so I can put this on the L0 host and run
> this as well. The results:
> 
> [  124.897002] apic: vector c1, new-domain move in progress                    
> [  124.954827] apic: vector d1, sent cleanup vector, move completed            
> [  163.477270] apic: vector d1, new-domain move in progress                    
> [  164.041938] apic: vector e1, sent cleanup vector, move completed            
> [  213.466971] apic: vector e1, new-domain move in progress                    
> [  213.775639] apic: vector 22, sent cleanup vector, move completed            
> [  365.996747] apic: vector 22, new-domain move in progress                    
> [  366.011136] apic: vector 42, sent cleanup vector, move completed            
> [  393.836032] apic: vector 42, new-domain move in progress                    
> [  393.837727] apic: vector 52, sent cleanup vector, move completed            
> [  454.977514] apic: vector 52, new-domain move in progress                    
> [  454.978880] apic: vector 62, sent cleanup vector, move completed            
> [  467.055730] apic: vector 62, new-domain move in progress                    
> [  467.058129] apic: vector 72, sent cleanup vector, move completed            
> [  545.280125] apic: vector 72, new-domain move in progress                    
> [  545.282801] apic: vector 82, sent cleanup vector, move completed            
> [  567.631652] apic: vector 82, new-domain move in progress                    
> [  567.632207] apic: vector 92, sent cleanup vector, move completed            
> [  628.940638] apic: vector 92, new-domain move in progress                    
> [  628.965274] apic: vector a2, sent cleanup vector, move completed            
> [  635.187433] apic: vector a2, new-domain move in progress                    
> [  635.191643] apic: vector b2, sent cleanup vector, move completed            
> [  673.548020] apic: vector b2, new-domain move in progress                    
> [  673.553843] apic: vector c2, sent cleanup vector, move completed            
> [  688.221906] apic: vector c2, new-domain move in progress                    
> [  688.229487] apic: vector d2, sent cleanup vector, move completed            
> [  723.818916] apic: vector d2, new-domain move in progress                    
> [  723.828970] apic: vector e2, sent cleanup vector, move completed            
> [  733.485435] apic: vector e2, new-domain move in progress                    
> [  733.615007] apic: vector 23, sent cleanup vector, move completed            
> [  824.092036] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [ksmd:26] 

Are these all the messages? Looks like Linus's warnings went away, or 
did you filter them out?

But ... the affinity setting message does not appear to trigger, and 
that's the only real race I can see in the code. Also, the frequency 
of these messages appears to be low, while the race window is narrow. 
So I'm not sure the problem is related to the irq-move mechanism.

One thing that appears to be weird: why is there irq-movement activity 
to begin with? Is something changing irq-affinities?

Could you put a dump_stack() into the call? Something like the patch 
below, in addition to all patches so far. (if it conflicts with the 
previous debugging patches then just add the code manually to after 
the debug printout.)

Thanks,

	Ingo

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 6cedd7914581..79d6de6fdf0a 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -144,6 +144,8 @@ __assign_irq_vector(int irq, struct irq_cfg *cfg, const struct cpumask *mask)
 			cfg->move_in_progress =
 			   cpumask_intersects(cfg->old_domain, cpu_online_mask);
 			cpumask_and(cfg->domain, cfg->domain, tmp_mask);
+			if (cfg->move_in_progress)
+				dump_stack();
 			break;
 		}
 

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [debug PATCHes] Re: smp_call_function_single lockups
  2015-03-31 10:56                   ` [debug PATCHes] " Ingo Molnar
@ 2015-03-31 22:38                     ` Chris J Arges
  2015-04-01 12:39                       ` Ingo Molnar
  0 siblings, 1 reply; 6+ messages in thread
From: Chris J Arges @ 2015-03-31 22:38 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Rafael David Tinoco, Peter Anvin, Jiang Liu,
	Peter Zijlstra, LKML, Jens Axboe, Frederic Weisbecker,
	Gema Gomez, the arch/x86 maintainers

On Tue, Mar 31, 2015 at 12:56:56PM +0200, Ingo Molnar wrote:
> 
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > Ok, interesting. So the whole "we try to do an APIC ACK with the ISR 
> > bit clear" seems to be a real issue.
> 
> It's interesting in particular when it happens with an edge-triggered 
> interrupt source: it's much harder to miss level triggered IRQs, which 
> stay around until actively handled. Edge triggered irqs are more 
> fragile to loss of event processing.
> 
> > > Anyway, maybe this sheds some more light on this issue. I can 
> > > reproduce this at will, so let me know of other experiments to do.
> 
> Btw., could you please describe (again) what your current best method 
> for reproduction is? It's been a long discussion ...
> 

Ingo,

To set this up, I've done the following on a Xeon E5620 / Xeon E312xx machine
( Although I've heard of others that have reproduced on other machines. )

1) Create an L1 KVM VM with 2 vCPUs (single vCPU case doesn't reproduce)
2) Create an L2 KVM VM inside the L1 VM with 1 vCPU
3) Add the following to the L1 cmdline:
nmi_watchdog=panic hung_task_panic=1 softlockup_panic=1 unknown_nmi_panic
3) Run something like 'stress -c 1 -m 1 -d 1 -t 1200' inside the L2 VM

Sometimes this is sufficient to reproduce the issue, I've observed that running
KSM in the L1 VM can agitate this issue (it calls native_flush_tlb_others).
If this doesn't reproduce then you can do the following:
4) Migrate the L2 vCPU randomly (via virsh vcpupin --live  OR tasksel) between
L1 vCPUs until the hang occurs.

I attempted to write a module that used smp_call_function_single calls to 
trigger IPIs but have been unable to create a more simple reproducer.

> > Somebody else who knows the apic needs to also take a look, but I'd 
> > love to hear what the actual hardware interrupt is (from that 
> > "vector_irq[vector]" thing above.
> > 
> > I'm not recognizing 0xe1 as any of the hardcoded SMP vectors (they 
> > are 0xf0-0xff), so it sounds like an external one. But that then 
> > requires the whole mapping table thing.
> > 
> > Ingo/Peter/Jiang - is there anything else useful we could print out? 
> > I worry about the irq movement code. Can we add printk's to when an 
> > irq is chasing from one CPU to another and doing that 
> > "move_in_progress" thing? I've always been scared of that code.
> 
> 1)
> 
> So the irq_cfg::move_in_progress field originates from:
> 
>   610142927b5b ("[PATCH] x86_64 irq: Safely cleanup an irq after moving it.")
> 
> and I agree that it looks fragile, so it would be nice to see how much 
> (if at all?) it gets used, by sticking a few pr_info()s in it.
> 
> Debug patch for the vector movement state machine attached below. 
> Untested.
> 
> 2)
> 
> But ... having taken a quick look at the vector movement handling, I 
> am scared more than ever,and I cannot convince myself that it's race 
> free. It's possibly harmless, but still, famous last words ...
> 
> For example, most ->move_in_progress transitions happen with the 
> vector_lock held - with the exception of send_cleanup_vector(): there 
> we are only holding the desc->lock, but that's not enough! For 
> example, this codepath:
> 
> static void __irq_complete_move(struct irq_cfg *cfg, unsigned vector)
> {
>         unsigned me;
> 
>         if (likely(!cfg->move_in_progress))
>                 return;
> 
>         me = smp_processor_id();
> 
>         if (vector == cfg->vector && cpumask_test_cpu(me, cfg->domain))
>                 send_cleanup_vector(cfg);
> 
> ...
> 
> void send_cleanup_vector(struct irq_cfg *cfg)
> {
> ...
> 
>         cfg->move_in_progress = 0;
> }
> 
> is blatantly racy, unless I missed something obvious?
> 
> 2b)
> 
> Initially I thought this only affects slowpaths like driver unregister 
> or CPU hotplug, but I think this would be relevant for the fastpath of 
> edge triggered irqs as well:
> 
> void apic_ack_edge(struct irq_data *data)
> {
>         irq_complete_move(irqd_cfg(data));
>         irq_move_irq(data);
>         ack_APIC_irq();
> }
> 
> and irq_complete_move() checks and clears cfg->move_in_progress as 
> listed above.
> 
> So in most scenarios this is probably harmless, because it can in the 
> worst case result in the missing of a ->move_in_progress flag setting.
> 
> But it does not look harmless in the apic_set_affinity() code path: 
> there we call into assign_irq_vector() without the desc->lock held, 
> due to this gem in the IRQ core:
> 
> int irq_set_affinity_hint(unsigned int irq, const struct cpumask *m)
> {
>         unsigned long flags;
>         struct irq_desc *desc = irq_get_desc_lock(irq, &flags, IRQ_GET_DESC_CHECK_GLOBAL);
> 
>         if (!desc)
>                 return -EINVAL;
>         desc->affinity_hint = m;
>         irq_put_desc_unlock(desc, flags);
>         /* set the initial affinity to prevent every interrupt being on CPU0 */
>         if (m)
>                 __irq_set_affinity(irq, m, false);
>         return 0;
> }
> 
> argh!
> 
> Now if this ever crosses path with an assign_irq_vector() call on 
> another CPU, then I really cannot see what would save us from deep 
> trouble: we use cfg->vector while that is possibly being modified, 
> right?
> 
> So I'd suggest to put a printk into irq_set_affinity_hint() as well, 
> to make sure it's not used during this test. In particular:
> 
> drivers/virtio/virtio_pci_common.c:                     irq_set_affinity_hint(irq, NULL);
> drivers/virtio/virtio_pci_common.c:                     irq_set_affinity_hint(irq, mask);
> 
> might be triggering it in the virtualization code...
> 
> The second patch below adds the relevant pr_info(). (untested)
> 
> Now this too might be unrelated, because the affinity hint was added 
> ages ago, in:
> 
> e7a297b0d7d6 ("genirq: Add CPU mask affinity hint")
> 
> and the potentially racy nature of calling into set_affinity without 
> the desc lock held was probably not realized back then.
> 
> VirtIO's use of the affinity-hint was added a while ago as well, four 
> years ago, in:
> 
>   75a0a52be3c2 ("virtio: introduce an API to set affinity for a virtqueue")
> 
> 2c)
> 
> The other thing that worries me here is that we apparently send an 
> IRQ-move IPI while having an un-acked local APIC (!). I have vague 
> memories that this was problematic and fragile before.
> 
> To ease my worries in this area I've attached a third patch below that 
> changes the order around.
> 
> This too is ancient behavior - but might be more prominent on certain 
> hardware (and virtualization) variants, so it has a chance to be 
> relevant.
> 
> 3)
> 
> Furthermore, I also noticed that the whole vector_lock()/unlock() 
> business is actively misleading IMHO, the vector_lock looks local to 
> vector.c, while we have these two calls that essentially export it ...
> 
> Some of that damage was inflicted in this ancient commit:
> 
>   d388e5fdc461 ("x86: Restore proper vector locking during cpu hotplug")
> 
> the proper approach would be to protect against hotplug events in 
> assign_irq_vector(), not the other way around!
> 
> Probably not a functional problem, yet somewhat disgusting.
> 
> -------------------------------
> 
> So ... all in one, these various problems might be related to the 
> regression or not, but they sure need addressing.
> 
> Thanks,
> 
> 	Ingo
> 

I've merged the below patch with Linus' patch here:
https://lkml.org/lkml/2015/3/31/955

This was only tested only on the L1, so I can put this on the L0 host and run
this as well. The results:

[  124.897002] apic: vector c1, new-domain move in progress                    
[  124.954827] apic: vector d1, sent cleanup vector, move completed            
[  163.477270] apic: vector d1, new-domain move in progress                    
[  164.041938] apic: vector e1, sent cleanup vector, move completed            
[  213.466971] apic: vector e1, new-domain move in progress                    
[  213.775639] apic: vector 22, sent cleanup vector, move completed            
[  365.996747] apic: vector 22, new-domain move in progress                    
[  366.011136] apic: vector 42, sent cleanup vector, move completed            
[  393.836032] apic: vector 42, new-domain move in progress                    
[  393.837727] apic: vector 52, sent cleanup vector, move completed            
[  454.977514] apic: vector 52, new-domain move in progress                    
[  454.978880] apic: vector 62, sent cleanup vector, move completed            
[  467.055730] apic: vector 62, new-domain move in progress                    
[  467.058129] apic: vector 72, sent cleanup vector, move completed            
[  545.280125] apic: vector 72, new-domain move in progress                    
[  545.282801] apic: vector 82, sent cleanup vector, move completed            
[  567.631652] apic: vector 82, new-domain move in progress                    
[  567.632207] apic: vector 92, sent cleanup vector, move completed            
[  628.940638] apic: vector 92, new-domain move in progress                    
[  628.965274] apic: vector a2, sent cleanup vector, move completed            
[  635.187433] apic: vector a2, new-domain move in progress                    
[  635.191643] apic: vector b2, sent cleanup vector, move completed            
[  673.548020] apic: vector b2, new-domain move in progress                    
[  673.553843] apic: vector c2, sent cleanup vector, move completed            
[  688.221906] apic: vector c2, new-domain move in progress                    
[  688.229487] apic: vector d2, sent cleanup vector, move completed            
[  723.818916] apic: vector d2, new-domain move in progress                    
[  723.828970] apic: vector e2, sent cleanup vector, move completed            
[  733.485435] apic: vector e2, new-domain move in progress                    
[  733.615007] apic: vector 23, sent cleanup vector, move completed            
[  824.092036] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [ksmd:26] 

Thanks,
--chris j arges

> diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> index 6cedd7914581..1dd1de9dd690 100644
> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -143,6 +143,8 @@ __assign_irq_vector(int irq, struct irq_cfg *cfg, const struct cpumask *mask)
>  			cpumask_andnot(cfg->old_domain, cfg->domain, tmp_mask);
>  			cfg->move_in_progress =
>  			   cpumask_intersects(cfg->old_domain, cpu_online_mask);
> +			if (cfg->move_in_progress)
> +				pr_info("apic: vector %02x, same-domain move in progress\n", cfg->vector);
>  			cpumask_and(cfg->domain, cfg->domain, tmp_mask);
>  			break;
>  		}
> @@ -178,6 +180,8 @@ __assign_irq_vector(int irq, struct irq_cfg *cfg, const struct cpumask *mask)
>  			cpumask_copy(cfg->old_domain, cfg->domain);
>  			cfg->move_in_progress =
>  			   cpumask_intersects(cfg->old_domain, cpu_online_mask);
> +			if (cfg->move_in_progress)
> +				pr_info("apic: vector %02x, new-domain move in progress\n", cfg->vector);
>  		}
>  		for_each_cpu_and(new_cpu, tmp_mask, cpu_online_mask)
>  			per_cpu(vector_irq, new_cpu)[vector] = irq;
> @@ -232,6 +236,7 @@ void clear_irq_vector(int irq, struct irq_cfg *cfg)
>  		}
>  	}
>  	cfg->move_in_progress = 0;
> +	pr_info("apic: vector %02x, vector cleared, move completed\n", cfg->vector);
>  	raw_spin_unlock_irqrestore(&vector_lock, flags);
>  }
>  
> @@ -391,6 +396,7 @@ void send_cleanup_vector(struct irq_cfg *cfg)
>  		free_cpumask_var(cleanup_mask);
>  	}
>  	cfg->move_in_progress = 0;
> +	pr_info("apic: vector %02x, sent cleanup vector, move completed\n", cfg->vector);
>  }
>  
>  asmlinkage __visible void smp_irq_move_cleanup_interrupt(void)
> 
> diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
> index c0a1100d911f..18300a7b8ed2 100644
> --- a/kernel/irq/manage.c
> +++ b/kernel/irq/manage.c
> @@ -250,8 +250,10 @@ int irq_set_affinity_hint(unsigned int irq, const struct cpumask *m)
>  	desc->affinity_hint = m;
>  	irq_put_desc_unlock(desc, flags);
>  	/* set the initial affinity to prevent every interrupt being on CPU0 */
> -	if (m)
> +	if (m) {
> +		pr_info("irq/core: Called irq_set_affinity_hint(%d)\n", irq);
>  		__irq_set_affinity(irq, m, false);
> +	}
>  	return 0;
>  }
>  EXPORT_SYMBOL_GPL(irq_set_affinity_hint);
> 
> diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> index 6cedd7914581..833a981c5420 100644
> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -335,9 +340,11 @@ int apic_retrigger_irq(struct irq_data *data)
>  
>  void apic_ack_edge(struct irq_data *data)
>  {
> +	ack_APIC_irq();
> +
> +	/* Might generate IPIs, so do this after having ACKed the APIC: */
>  	irq_complete_move(irqd_cfg(data));
>  	irq_move_irq(data);
> -	ack_APIC_irq();
>  }
>  
>  /*
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [debug PATCHes] Re: smp_call_function_single lockups
  2015-03-31  4:28                 ` Linus Torvalds
@ 2015-03-31 10:56                   ` Ingo Molnar
  2015-03-31 22:38                     ` Chris J Arges
  0 siblings, 1 reply; 6+ messages in thread
From: Ingo Molnar @ 2015-03-31 10:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Chris J Arges, Rafael David Tinoco, Peter Anvin, Jiang Liu,
	Peter Zijlstra, LKML, Jens Axboe, Frederic Weisbecker,
	Gema Gomez, the arch/x86 maintainers


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Ok, interesting. So the whole "we try to do an APIC ACK with the ISR 
> bit clear" seems to be a real issue.

It's interesting in particular when it happens with an edge-triggered 
interrupt source: it's much harder to miss level triggered IRQs, which 
stay around until actively handled. Edge triggered irqs are more 
fragile to loss of event processing.

> > Anyway, maybe this sheds some more light on this issue. I can 
> > reproduce this at will, so let me know of other experiments to do.

Btw., could you please describe (again) what your current best method 
for reproduction is? It's been a long discussion ...

> Somebody else who knows the apic needs to also take a look, but I'd 
> love to hear what the actual hardware interrupt is (from that 
> "vector_irq[vector]" thing above.
> 
> I'm not recognizing 0xe1 as any of the hardcoded SMP vectors (they 
> are 0xf0-0xff), so it sounds like an external one. But that then 
> requires the whole mapping table thing.
> 
> Ingo/Peter/Jiang - is there anything else useful we could print out? 
> I worry about the irq movement code. Can we add printk's to when an 
> irq is chasing from one CPU to another and doing that 
> "move_in_progress" thing? I've always been scared of that code.

1)

So the irq_cfg::move_in_progress field originates from:

  610142927b5b ("[PATCH] x86_64 irq: Safely cleanup an irq after moving it.")

and I agree that it looks fragile, so it would be nice to see how much 
(if at all?) it gets used, by sticking a few pr_info()s in it.

Debug patch for the vector movement state machine attached below. 
Untested.

2)

But ... having taken a quick look at the vector movement handling, I 
am scared more than ever,and I cannot convince myself that it's race 
free. It's possibly harmless, but still, famous last words ...

For example, most ->move_in_progress transitions happen with the 
vector_lock held - with the exception of send_cleanup_vector(): there 
we are only holding the desc->lock, but that's not enough! For 
example, this codepath:

static void __irq_complete_move(struct irq_cfg *cfg, unsigned vector)
{
        unsigned me;

        if (likely(!cfg->move_in_progress))
                return;

        me = smp_processor_id();

        if (vector == cfg->vector && cpumask_test_cpu(me, cfg->domain))
                send_cleanup_vector(cfg);

...

void send_cleanup_vector(struct irq_cfg *cfg)
{
...

        cfg->move_in_progress = 0;
}

is blatantly racy, unless I missed something obvious?

2b)

Initially I thought this only affects slowpaths like driver unregister 
or CPU hotplug, but I think this would be relevant for the fastpath of 
edge triggered irqs as well:

void apic_ack_edge(struct irq_data *data)
{
        irq_complete_move(irqd_cfg(data));
        irq_move_irq(data);
        ack_APIC_irq();
}

and irq_complete_move() checks and clears cfg->move_in_progress as 
listed above.

So in most scenarios this is probably harmless, because it can in the 
worst case result in the missing of a ->move_in_progress flag setting.

But it does not look harmless in the apic_set_affinity() code path: 
there we call into assign_irq_vector() without the desc->lock held, 
due to this gem in the IRQ core:

int irq_set_affinity_hint(unsigned int irq, const struct cpumask *m)
{
        unsigned long flags;
        struct irq_desc *desc = irq_get_desc_lock(irq, &flags, IRQ_GET_DESC_CHECK_GLOBAL);

        if (!desc)
                return -EINVAL;
        desc->affinity_hint = m;
        irq_put_desc_unlock(desc, flags);
        /* set the initial affinity to prevent every interrupt being on CPU0 */
        if (m)
                __irq_set_affinity(irq, m, false);
        return 0;
}

argh!

Now if this ever crosses path with an assign_irq_vector() call on 
another CPU, then I really cannot see what would save us from deep 
trouble: we use cfg->vector while that is possibly being modified, 
right?

So I'd suggest to put a printk into irq_set_affinity_hint() as well, 
to make sure it's not used during this test. In particular:

drivers/virtio/virtio_pci_common.c:                     irq_set_affinity_hint(irq, NULL);
drivers/virtio/virtio_pci_common.c:                     irq_set_affinity_hint(irq, mask);

might be triggering it in the virtualization code...

The second patch below adds the relevant pr_info(). (untested)

Now this too might be unrelated, because the affinity hint was added 
ages ago, in:

e7a297b0d7d6 ("genirq: Add CPU mask affinity hint")

and the potentially racy nature of calling into set_affinity without 
the desc lock held was probably not realized back then.

VirtIO's use of the affinity-hint was added a while ago as well, four 
years ago, in:

  75a0a52be3c2 ("virtio: introduce an API to set affinity for a virtqueue")

2c)

The other thing that worries me here is that we apparently send an 
IRQ-move IPI while having an un-acked local APIC (!). I have vague 
memories that this was problematic and fragile before.

To ease my worries in this area I've attached a third patch below that 
changes the order around.

This too is ancient behavior - but might be more prominent on certain 
hardware (and virtualization) variants, so it has a chance to be 
relevant.

3)

Furthermore, I also noticed that the whole vector_lock()/unlock() 
business is actively misleading IMHO, the vector_lock looks local to 
vector.c, while we have these two calls that essentially export it ...

Some of that damage was inflicted in this ancient commit:

  d388e5fdc461 ("x86: Restore proper vector locking during cpu hotplug")

the proper approach would be to protect against hotplug events in 
assign_irq_vector(), not the other way around!

Probably not a functional problem, yet somewhat disgusting.

-------------------------------

So ... all in one, these various problems might be related to the 
regression or not, but they sure need addressing.

Thanks,

	Ingo

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 6cedd7914581..1dd1de9dd690 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -143,6 +143,8 @@ __assign_irq_vector(int irq, struct irq_cfg *cfg, const struct cpumask *mask)
 			cpumask_andnot(cfg->old_domain, cfg->domain, tmp_mask);
 			cfg->move_in_progress =
 			   cpumask_intersects(cfg->old_domain, cpu_online_mask);
+			if (cfg->move_in_progress)
+				pr_info("apic: vector %02x, same-domain move in progress\n", cfg->vector);
 			cpumask_and(cfg->domain, cfg->domain, tmp_mask);
 			break;
 		}
@@ -178,6 +180,8 @@ __assign_irq_vector(int irq, struct irq_cfg *cfg, const struct cpumask *mask)
 			cpumask_copy(cfg->old_domain, cfg->domain);
 			cfg->move_in_progress =
 			   cpumask_intersects(cfg->old_domain, cpu_online_mask);
+			if (cfg->move_in_progress)
+				pr_info("apic: vector %02x, new-domain move in progress\n", cfg->vector);
 		}
 		for_each_cpu_and(new_cpu, tmp_mask, cpu_online_mask)
 			per_cpu(vector_irq, new_cpu)[vector] = irq;
@@ -232,6 +236,7 @@ void clear_irq_vector(int irq, struct irq_cfg *cfg)
 		}
 	}
 	cfg->move_in_progress = 0;
+	pr_info("apic: vector %02x, vector cleared, move completed\n", cfg->vector);
 	raw_spin_unlock_irqrestore(&vector_lock, flags);
 }
 
@@ -391,6 +396,7 @@ void send_cleanup_vector(struct irq_cfg *cfg)
 		free_cpumask_var(cleanup_mask);
 	}
 	cfg->move_in_progress = 0;
+	pr_info("apic: vector %02x, sent cleanup vector, move completed\n", cfg->vector);
 }
 
 asmlinkage __visible void smp_irq_move_cleanup_interrupt(void)

diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index c0a1100d911f..18300a7b8ed2 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -250,8 +250,10 @@ int irq_set_affinity_hint(unsigned int irq, const struct cpumask *m)
 	desc->affinity_hint = m;
 	irq_put_desc_unlock(desc, flags);
 	/* set the initial affinity to prevent every interrupt being on CPU0 */
-	if (m)
+	if (m) {
+		pr_info("irq/core: Called irq_set_affinity_hint(%d)\n", irq);
 		__irq_set_affinity(irq, m, false);
+	}
 	return 0;
 }
 EXPORT_SYMBOL_GPL(irq_set_affinity_hint);

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 6cedd7914581..833a981c5420 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -335,9 +340,11 @@ int apic_retrigger_irq(struct irq_data *data)
 
 void apic_ack_edge(struct irq_data *data)
 {
+	ack_APIC_irq();
+
+	/* Might generate IPIs, so do this after having ACKed the APIC: */
 	irq_complete_move(irqd_cfg(data));
 	irq_move_irq(data);
-	ack_APIC_irq();
 }
 
 /*

^ permalink raw reply related	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2015-04-01 14:55 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-01  2:00 [debug PATCHes] Re: smp_call_function_single lockups Daniel J Blueman
  -- strict thread matches above, loose matches on Subject: below --
2015-02-19 15:42 Rafael David Tinoco
2015-02-19 16:26 ` Linus Torvalds
2015-02-19 16:32   ` Rafael David Tinoco
2015-02-19 16:59     ` Linus Torvalds
2015-02-19 17:39       ` Linus Torvalds
2015-02-19 20:29         ` Linus Torvalds
2015-02-19 21:59           ` Linus Torvalds
2015-02-19 22:45             ` Linus Torvalds
2015-03-31  3:15               ` Chris J Arges
2015-03-31  4:28                 ` Linus Torvalds
2015-03-31 10:56                   ` [debug PATCHes] " Ingo Molnar
2015-03-31 22:38                     ` Chris J Arges
2015-04-01 12:39                       ` Ingo Molnar
2015-04-01 14:10                         ` Chris J Arges
2015-04-01 14:55                           ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).