All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch 00/14] x86/irq: Plug various vector cleanup races
@ 2015-12-31 16:30 Thomas Gleixner
  2015-12-31 16:30 ` [patch 01/14] x86/irq: Fix a race in x86_vector_free_irqs() Thomas Gleixner
                   ` (14 more replies)
  0 siblings, 15 replies; 38+ messages in thread
From: Thomas Gleixner @ 2015-12-31 16:30 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Peter Anvin, Jiang Liu, Joe Lawrence,
	Jeremiah Mahler, Borislav Petkov, andy.shevchenko, Guenter Roeck

Joe reported a nasty race in the vector cleanup code which results in stale
irq descriptors in the vector arrays and a potential use after free.

This series addresses this issue, another race which was found and fixed by
Jiang related to the same area, plus a cpu hotplug issue which is not related
to this.

This lot is intended for stable, but not yet marked so. The diffstat below is
rather large, but this is mostly due to extensive commentry which I added in
the process.

This is also available via git from:

  git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git WIP.x86/apic

Thanks,

	tglx
---
 include/asm/irq.h    |    5 -
 kernel/apic/vector.c |  221 +++++++++++++++++++++++++++++++++------------------
 kernel/irq.c         |   11 ++
 3 files changed, 158 insertions(+), 79 deletions(-)


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [patch 01/14] x86/irq: Fix a race in x86_vector_free_irqs()
  2015-12-31 16:30 [patch 00/14] x86/irq: Plug various vector cleanup races Thomas Gleixner
@ 2015-12-31 16:30 ` Thomas Gleixner
  2015-12-31 16:30 ` [patch 02/14] x86/irq: Validate that irq descriptor is still active Thomas Gleixner
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 38+ messages in thread
From: Thomas Gleixner @ 2015-12-31 16:30 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Peter Anvin, Jiang Liu, Joe Lawrence,
	Jeremiah Mahler, Borislav Petkov, andy.shevchenko, Guenter Roeck

[-- Attachment #1: x86-irq-fix-a-race-window-in-x86_vector_free_irqs.patch --]
[-- Type: text/plain, Size: 2980 bytes --]

There's a race condition between

x86_vector_free_irqs()
{
	free_apic_chip_data(irq_data->chip_data);
	xxxxx	//irq_data->chip_data has been freed, but the pointer
		//hasn't been reset yet
	irq_domain_reset_irq_data(irq_data);
}

and 

smp_irq_move_cleanup_interrupt()
{
	raw_spin_lock(&vector_lock);
	data = apic_chip_data(irq_desc_get_irq_data(desc));
	access data->xxxx	// may access freed memory
	raw_spin_unlock(&desc->lock);
}

which may cause smp_irq_move_cleanup_interrupt() to access freed memory.

Call irq_domain_reset_irq_data(), which clears the pointer with vector lock
held.

[ tglx: Free memory outside of lock held region. ]

Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Jeremiah Mahler <jmmahler@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: andy.shevchenko@gmail.com
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Joe Lawrence <joe.lawrence@stratus.com>
Link: http://lkml.kernel.org/r/1450880014-11741-3-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/apic/vector.c |   16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -224,10 +224,8 @@ static int assign_irq_vector_policy(int
 static void clear_irq_vector(int irq, struct apic_chip_data *data)
 {
 	struct irq_desc *desc;
-	unsigned long flags;
 	int cpu, vector;
 
-	raw_spin_lock_irqsave(&vector_lock, flags);
 	BUG_ON(!data->cfg.vector);
 
 	vector = data->cfg.vector;
@@ -237,10 +235,8 @@ static void clear_irq_vector(int irq, st
 	data->cfg.vector = 0;
 	cpumask_clear(data->domain);
 
-	if (likely(!data->move_in_progress)) {
-		raw_spin_unlock_irqrestore(&vector_lock, flags);
+	if (likely(!data->move_in_progress))
 		return;
-	}
 
 	desc = irq_to_desc(irq);
 	for_each_cpu_and(cpu, data->old_domain, cpu_online_mask) {
@@ -253,7 +249,6 @@ static void clear_irq_vector(int irq, st
 		}
 	}
 	data->move_in_progress = 0;
-	raw_spin_unlock_irqrestore(&vector_lock, flags);
 }
 
 void init_irq_alloc_info(struct irq_alloc_info *info,
@@ -274,19 +269,24 @@ void copy_irq_alloc_info(struct irq_allo
 static void x86_vector_free_irqs(struct irq_domain *domain,
 				 unsigned int virq, unsigned int nr_irqs)
 {
+	struct apic_chip_data *apic_data;
 	struct irq_data *irq_data;
+	unsigned long flags;
 	int i;
 
 	for (i = 0; i < nr_irqs; i++) {
 		irq_data = irq_domain_get_irq_data(x86_vector_domain, virq + i);
 		if (irq_data && irq_data->chip_data) {
+			raw_spin_lock_irqsave(&vector_lock, flags);
 			clear_irq_vector(virq + i, irq_data->chip_data);
-			free_apic_chip_data(irq_data->chip_data);
+			apic_data = irq_data->chip_data;
+			irq_domain_reset_irq_data(irq_data);
+			raw_spin_unlock_irqrestore(&vector_lock, flags);
+			free_apic_chip_data(apic_data);
 #ifdef	CONFIG_X86_IO_APIC
 			if (virq + i < nr_legacy_irqs())
 				legacy_irq_data[virq + i] = NULL;
 #endif
-			irq_domain_reset_irq_data(irq_data);
 		}
 	}
 }



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [patch 02/14] x86/irq: Validate that irq descriptor is still active
  2015-12-31 16:30 [patch 00/14] x86/irq: Plug various vector cleanup races Thomas Gleixner
  2015-12-31 16:30 ` [patch 01/14] x86/irq: Fix a race in x86_vector_free_irqs() Thomas Gleixner
@ 2015-12-31 16:30 ` Thomas Gleixner
  2016-01-16 21:16   ` [tip:x86/urgent] " tip-bot for Thomas Gleixner
  2015-12-31 16:30 ` [patch 04/14] x86/irq: Reorganize the return path in assign_irq_vector Thomas Gleixner
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 38+ messages in thread
From: Thomas Gleixner @ 2015-12-31 16:30 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Peter Anvin, Jiang Liu, Joe Lawrence,
	Jeremiah Mahler, Borislav Petkov, andy.shevchenko, Guenter Roeck

[-- Attachment #1: x86-irq--Validate-that-irq-descriptor-is-still-active.patch --]
[-- Type: text/plain, Size: 773 bytes --]

In fixup_irqs() we unconditionally dereference the irq chip of an irq
descriptor. The descriptor might still be valid, but already cleaned up,
i.e. the chip removed. Add a check for this condition.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/irq.c |    9 +++++++++
 1 file changed, 9 insertions(+)

--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -470,6 +470,15 @@ void fixup_irqs(void)
 		}
 
 		chip = irq_data_get_irq_chip(data);
+		/*
+		 * The interrupt descriptor might have been cleaned up
+		 * already, but it is not yet removed from the radix tree
+		 */
+		if (!chip) {
+			raw_spin_unlock(&desc->lock);
+			continue;
+		}
+
 		if (!irqd_can_move_in_process_context(data) && chip->irq_mask)
 			chip->irq_mask(data);
 



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [patch 04/14] x86/irq: Reorganize the return path in assign_irq_vector
  2015-12-31 16:30 [patch 00/14] x86/irq: Plug various vector cleanup races Thomas Gleixner
  2015-12-31 16:30 ` [patch 01/14] x86/irq: Fix a race in x86_vector_free_irqs() Thomas Gleixner
  2015-12-31 16:30 ` [patch 02/14] x86/irq: Validate that irq descriptor is still active Thomas Gleixner
@ 2015-12-31 16:30 ` Thomas Gleixner
  2016-01-16 21:17   ` [tip:x86/urgent] " tip-bot for Thomas Gleixner
  2015-12-31 16:30 ` [patch 03/14] x86/irq: Do not use apic_chip_data.old_domain as temporary buffer Thomas Gleixner
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 38+ messages in thread
From: Thomas Gleixner @ 2015-12-31 16:30 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Peter Anvin, Jiang Liu, Joe Lawrence,
	Jeremiah Mahler, Borislav Petkov, andy.shevchenko, Guenter Roeck

[-- Attachment #1: x86-irq--Reorganize-return-path-in-assign_irq_vector.patch --]
[-- Type: text/plain, Size: 2208 bytes --]

Use an explicit goto for the cases where we have success in the search/update
and return -ENOSPC if the search loop ends due to no space.

Preparatory patch for fixes. No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/apic/vector.c |   22 ++++++++--------------
 1 file changed, 8 insertions(+), 14 deletions(-)

--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -116,13 +116,12 @@ static int __assign_irq_vector(int irq,
 	 */
 	static int current_vector = FIRST_EXTERNAL_VECTOR + VECTOR_OFFSET_START;
 	static int current_offset = VECTOR_OFFSET_START % 16;
-	int cpu, err;
+	int cpu;
 
 	if (d->move_in_progress)
 		return -EBUSY;
 
 	/* Only try and allocate irqs on cpus that are present */
-	err = -ENOSPC;
 	cpumask_clear(d->old_domain);
 	cpumask_clear(searched_cpumask);
 	cpu = cpumask_first_and(mask, cpu_online_mask);
@@ -132,9 +131,8 @@ static int __assign_irq_vector(int irq,
 		apic->vector_allocation_domain(cpu, vector_cpumask, mask);
 
 		if (cpumask_subset(vector_cpumask, d->domain)) {
-			err = 0;
 			if (cpumask_equal(vector_cpumask, d->domain))
-				break;
+				goto success;
 			/*
 			 * New cpumask using the vector is a proper subset of
 			 * the current in use mask. So cleanup the vector
@@ -145,7 +143,7 @@ static int __assign_irq_vector(int irq,
 			d->move_in_progress =
 			   cpumask_intersects(d->old_domain, cpu_online_mask);
 			cpumask_and(d->domain, d->domain, vector_cpumask);
-			break;
+			goto success;
 		}
 
 		vector = current_vector;
@@ -185,17 +183,13 @@ static int __assign_irq_vector(int irq,
 			per_cpu(vector_irq, new_cpu)[vector] = irq_to_desc(irq);
 		d->cfg.vector = vector;
 		cpumask_copy(d->domain, vector_cpumask);
-		err = 0;
-		break;
+		goto success;
 	}
+	return -ENOSPC;
 
-	if (!err) {
-		/* cache destination APIC IDs into cfg->dest_apicid */
-		err = apic->cpu_mask_to_apicid_and(mask, d->domain,
-						   &d->cfg.dest_apicid);
-	}
-
-	return err;
+success:
+	/* cache destination APIC IDs into cfg->dest_apicid */
+	return apic->cpu_mask_to_apicid_and(mask, d->domain, &d->cfg.dest_apicid);
 }
 
 static int assign_irq_vector(int irq, struct apic_chip_data *data,



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [patch 03/14] x86/irq: Do not use apic_chip_data.old_domain as temporary buffer
  2015-12-31 16:30 [patch 00/14] x86/irq: Plug various vector cleanup races Thomas Gleixner
                   ` (2 preceding siblings ...)
  2015-12-31 16:30 ` [patch 04/14] x86/irq: Reorganize the return path in assign_irq_vector Thomas Gleixner
@ 2015-12-31 16:30 ` Thomas Gleixner
  2015-12-31 16:30 ` [patch 05/14] x86/irq: Reorganize the search in assign_irq_vector Thomas Gleixner
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 38+ messages in thread
From: Thomas Gleixner @ 2015-12-31 16:30 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Peter Anvin, Jiang Liu, Joe Lawrence,
	Jeremiah Mahler, Borislav Petkov, andy.shevchenko, Guenter Roeck

[-- Attachment #1: x86-irq-do-not-reuse-struct-apic_chip_data-old_domain-as-temporary-buffer.patch --]
[-- Type: text/plain, Size: 2246 bytes --]

Function __assign_irq_vector() makes use of apic_chip_data.old_domain as a
temporary buffer, which is in the way of using apic_chip_data.old_domain for
synchronizing the vector cleanup with the vector assignement code.

Use a proper temporary cpumask for this.

[ tglx: Renamed the mask to searched_cpumask for clarity ]

Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Jeremiah Mahler <jmmahler@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: andy.shevchenko@gmail.com
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Joe Lawrence <joe.lawrence@stratus.com>
Link: http://lkml.kernel.org/r/1450880014-11741-1-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/apic/vector.c |    8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -30,7 +30,7 @@ struct apic_chip_data {
 
 struct irq_domain *x86_vector_domain;
 static DEFINE_RAW_SPINLOCK(vector_lock);
-static cpumask_var_t vector_cpumask;
+static cpumask_var_t vector_cpumask, searched_cpumask;
 static struct irq_chip lapic_controller;
 #ifdef	CONFIG_X86_IO_APIC
 static struct apic_chip_data *legacy_irq_data[NR_IRQS_LEGACY];
@@ -124,6 +124,7 @@ static int __assign_irq_vector(int irq,
 	/* Only try and allocate irqs on cpus that are present */
 	err = -ENOSPC;
 	cpumask_clear(d->old_domain);
+	cpumask_clear(searched_cpumask);
 	cpu = cpumask_first_and(mask, cpu_online_mask);
 	while (cpu < nr_cpu_ids) {
 		int new_cpu, vector, offset;
@@ -157,9 +158,9 @@ static int __assign_irq_vector(int irq,
 		}
 
 		if (unlikely(current_vector == vector)) {
-			cpumask_or(d->old_domain, d->old_domain,
+			cpumask_or(searched_cpumask, searched_cpumask,
 				   vector_cpumask);
-			cpumask_andnot(vector_cpumask, mask, d->old_domain);
+			cpumask_andnot(vector_cpumask, mask, searched_cpumask);
 			cpu = cpumask_first_and(vector_cpumask,
 						cpu_online_mask);
 			continue;
@@ -404,6 +405,7 @@ int __init arch_early_irq_init(void)
 	arch_init_htirq_domain(x86_vector_domain);
 
 	BUG_ON(!alloc_cpumask_var(&vector_cpumask, GFP_KERNEL));
+	BUG_ON(!alloc_cpumask_var(&searched_cpumask, GFP_KERNEL));
 
 	return arch_early_ioapic_init();
 }



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [patch 05/14] x86/irq: Reorganize the search in assign_irq_vector
  2015-12-31 16:30 [patch 00/14] x86/irq: Plug various vector cleanup races Thomas Gleixner
                   ` (3 preceding siblings ...)
  2015-12-31 16:30 ` [patch 03/14] x86/irq: Do not use apic_chip_data.old_domain as temporary buffer Thomas Gleixner
@ 2015-12-31 16:30 ` Thomas Gleixner
  2016-01-16 21:17   ` [tip:x86/urgent] " tip-bot for Thomas Gleixner
  2015-12-31 16:30 ` [patch 06/14] x86/irq: Check vector allocation early Thomas Gleixner
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 38+ messages in thread
From: Thomas Gleixner @ 2015-12-31 16:30 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Peter Anvin, Jiang Liu, Joe Lawrence,
	Jeremiah Mahler, Borislav Petkov, andy.shevchenko, Guenter Roeck

[-- Attachment #1: x86-irq--Reorganize-the-search-in-assign_irq_vector.patch --]
[-- Type: text/plain, Size: 1716 bytes --]

Split out the code which advances the target cpu for the search so we can
reuse it for the next patch which adds an early validation check for the
vectormask which we get from the apic.

Add comments while at it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/apic/vector.c |   24 ++++++++++++++++--------
 1 file changed, 16 insertions(+), 8 deletions(-)

--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -155,14 +155,9 @@ static int __assign_irq_vector(int irq,
 			vector = FIRST_EXTERNAL_VECTOR + offset;
 		}
 
-		if (unlikely(current_vector == vector)) {
-			cpumask_or(searched_cpumask, searched_cpumask,
-				   vector_cpumask);
-			cpumask_andnot(vector_cpumask, mask, searched_cpumask);
-			cpu = cpumask_first_and(vector_cpumask,
-						cpu_online_mask);
-			continue;
-		}
+		/* If the search wrapped around, try the next cpu */
+		if (unlikely(current_vector == vector))
+			goto next_cpu;
 
 		if (test_bit(vector, used_vectors))
 			goto next;
@@ -184,6 +179,19 @@ static int __assign_irq_vector(int irq,
 		d->cfg.vector = vector;
 		cpumask_copy(d->domain, vector_cpumask);
 		goto success;
+
+next_cpu:
+		/*
+		 * We exclude the current @vector_cpumask from the requested
+		 * @mask and try again with the next online cpu in the
+		 * result. We cannot modify @mask, so we use @vector_cpumask
+		 * as a temporary buffer here as it will be reassigned when
+		 * calling apic->vector_allocation_domain() above.
+		 */
+		cpumask_or(searched_cpumask, searched_cpumask, vector_cpumask);
+		cpumask_andnot(vector_cpumask, mask, searched_cpumask);
+		cpu = cpumask_first_and(vector_cpumask, cpu_online_mask);
+		continue;
 	}
 	return -ENOSPC;
 



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [patch 06/14] x86/irq: Check vector allocation early
  2015-12-31 16:30 [patch 00/14] x86/irq: Plug various vector cleanup races Thomas Gleixner
                   ` (4 preceding siblings ...)
  2015-12-31 16:30 ` [patch 05/14] x86/irq: Reorganize the search in assign_irq_vector Thomas Gleixner
@ 2015-12-31 16:30 ` Thomas Gleixner
  2016-01-16 21:17   ` [tip:x86/urgent] " tip-bot for Thomas Gleixner
  2015-12-31 16:30 ` [patch 08/14] x86/irq: Get rid of code duplication Thomas Gleixner
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 38+ messages in thread
From: Thomas Gleixner @ 2015-12-31 16:30 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Peter Anvin, Jiang Liu, Joe Lawrence,
	Jeremiah Mahler, Borislav Petkov, andy.shevchenko, Guenter Roeck

[-- Attachment #1: x86-irq--Check-vector-allocation-early.patch --]
[-- Type: text/plain, Size: 4099 bytes --]

__assign_irq_vector() uses the vector_cpumask which is assigned by
apic->vector_allocation_domain() without doing basic sanity checks. That can
result in a situation where the final assignement of a newly found vector
fails in apic->cpu_mask_to_apicid_and(). So we have to do rollbacks for no
reason.

apic->cpu_mask_to_apicid_and() only fails if 

  vector_cpumask & requested_cpumask & cpu_online_mask 

is empty.

Check for this condition right away and if the result is empty try immediately
the next possible cpu in the requested mask. So in case of a failure the old
setting is unchanged and we can remove the rollback code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/apic/vector.c |   38 +++++++++++++++++++++++++-------------
 1 file changed, 25 insertions(+), 13 deletions(-)

--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -30,7 +30,7 @@ struct apic_chip_data {
 
 struct irq_domain *x86_vector_domain;
 static DEFINE_RAW_SPINLOCK(vector_lock);
-static cpumask_var_t vector_cpumask, searched_cpumask;
+static cpumask_var_t vector_cpumask, vector_searchmask, searched_cpumask;
 static struct irq_chip lapic_controller;
 #ifdef	CONFIG_X86_IO_APIC
 static struct apic_chip_data *legacy_irq_data[NR_IRQS_LEGACY];
@@ -128,8 +128,20 @@ static int __assign_irq_vector(int irq,
 	while (cpu < nr_cpu_ids) {
 		int new_cpu, vector, offset;
 
+		/* Get the possible target cpus for @mask/@cpu from the apic */
 		apic->vector_allocation_domain(cpu, vector_cpumask, mask);
 
+		/*
+		 * Clear the offline cpus from @vector_cpumask for searching
+		 * and verify whether the result overlaps with @mask. If true,
+		 * then the call to apic->cpu_mask_to_apicid_and() will
+		 * succeed as well. If not, no point in trying to find a
+		 * vector in this mask.
+		 */
+		cpumask_and(vector_searchmask, vector_cpumask, cpu_online_mask);
+		if (!cpumask_intersects(vector_searchmask, mask))
+			goto next_cpu;
+
 		if (cpumask_subset(vector_cpumask, d->domain)) {
 			if (cpumask_equal(vector_cpumask, d->domain))
 				goto success;
@@ -162,7 +174,7 @@ static int __assign_irq_vector(int irq,
 		if (test_bit(vector, used_vectors))
 			goto next;
 
-		for_each_cpu_and(new_cpu, vector_cpumask, cpu_online_mask) {
+		for_each_cpu(new_cpu, vector_searchmask) {
 			if (!IS_ERR_OR_NULL(per_cpu(vector_irq, new_cpu)[vector]))
 				goto next;
 		}
@@ -174,7 +186,7 @@ static int __assign_irq_vector(int irq,
 			d->move_in_progress =
 			   cpumask_intersects(d->old_domain, cpu_online_mask);
 		}
-		for_each_cpu_and(new_cpu, vector_cpumask, cpu_online_mask)
+		for_each_cpu(new_cpu, vector_searchmask)
 			per_cpu(vector_irq, new_cpu)[vector] = irq_to_desc(irq);
 		d->cfg.vector = vector;
 		cpumask_copy(d->domain, vector_cpumask);
@@ -196,8 +208,14 @@ static int __assign_irq_vector(int irq,
 	return -ENOSPC;
 
 success:
-	/* cache destination APIC IDs into cfg->dest_apicid */
-	return apic->cpu_mask_to_apicid_and(mask, d->domain, &d->cfg.dest_apicid);
+	/*
+	 * Cache destination APIC IDs into cfg->dest_apicid. This cannot fail
+	 * as we already established, that mask & d->domain & cpu_online_mask
+	 * is not empty.
+	 */
+	BUG_ON(apic->cpu_mask_to_apicid_and(mask, d->domain,
+					    &d->cfg.dest_apicid));
+	return 0;
 }
 
 static int assign_irq_vector(int irq, struct apic_chip_data *data,
@@ -407,6 +425,7 @@ int __init arch_early_irq_init(void)
 	arch_init_htirq_domain(x86_vector_domain);
 
 	BUG_ON(!alloc_cpumask_var(&vector_cpumask, GFP_KERNEL));
+	BUG_ON(!alloc_cpumask_var(&vector_searchmask, GFP_KERNEL));
 	BUG_ON(!alloc_cpumask_var(&searched_cpumask, GFP_KERNEL));
 
 	return arch_early_ioapic_init();
@@ -496,14 +515,7 @@ static int apic_set_affinity(struct irq_
 		return -EINVAL;
 
 	err = assign_irq_vector(irq, data, dest);
-	if (err) {
-		if (assign_irq_vector(irq, data,
-				      irq_data_get_affinity_mask(irq_data)))
-			pr_err("Failed to recover vector for irq %d\n", irq);
-		return err;
-	}
-
-	return IRQ_SET_MASK_OK;
+	return err ? err : IRQ_SET_MASK_OK;
 }
 
 static struct irq_chip lapic_controller = {



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [patch 07/14] x86/irq: Copy vectormask instead of an AND operation
  2015-12-31 16:30 [patch 00/14] x86/irq: Plug various vector cleanup races Thomas Gleixner
                   ` (6 preceding siblings ...)
  2015-12-31 16:30 ` [patch 08/14] x86/irq: Get rid of code duplication Thomas Gleixner
@ 2015-12-31 16:30 ` Thomas Gleixner
  2016-01-16 21:18   ` [tip:x86/urgent] " tip-bot for Thomas Gleixner
  2015-12-31 16:30 ` [patch 09/14] x86/irq: Remove offline cpus from vector cleanup Thomas Gleixner
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 38+ messages in thread
From: Thomas Gleixner @ 2015-12-31 16:30 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Peter Anvin, Jiang Liu, Joe Lawrence,
	Jeremiah Mahler, Borislav Petkov, andy.shevchenko, Guenter Roeck

[-- Attachment #1: x86-irq--Copy-vectormask-instead-of-an-AND-operation.patch --]
[-- Type: text/plain, Size: 799 bytes --]

In the case that the new vector mask is a subset of the existing mask there is
no point to do a AND operation of currentmask & newmask. The result is
newmask. So we can simply copy the new mask to the current mask and be done
with it. Preparatory patch for further consolidation.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/apic/vector.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -154,7 +154,7 @@ static int __assign_irq_vector(int irq,
 				       vector_cpumask);
 			d->move_in_progress =
 			   cpumask_intersects(d->old_domain, cpu_online_mask);
-			cpumask_and(d->domain, d->domain, vector_cpumask);
+			cpumask_copy(d->domain, vector_cpumask);
 			goto success;
 		}
 



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [patch 08/14] x86/irq: Get rid of code duplication
  2015-12-31 16:30 [patch 00/14] x86/irq: Plug various vector cleanup races Thomas Gleixner
                   ` (5 preceding siblings ...)
  2015-12-31 16:30 ` [patch 06/14] x86/irq: Check vector allocation early Thomas Gleixner
@ 2015-12-31 16:30 ` Thomas Gleixner
  2016-01-16 21:18   ` [tip:x86/urgent] " tip-bot for Thomas Gleixner
  2015-12-31 16:30 ` [patch 07/14] x86/irq: Copy vectormask instead of an AND operation Thomas Gleixner
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 38+ messages in thread
From: Thomas Gleixner @ 2015-12-31 16:30 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Peter Anvin, Jiang Liu, Joe Lawrence,
	Jeremiah Mahler, Borislav Petkov, andy.shevchenko, Guenter Roeck

[-- Attachment #1: x86-irq--Get-rid-of-code-duplication.patch --]
[-- Type: text/plain, Size: 2767 bytes --]

Reusing an existing vector and assigning a new vector has duplicated
code. Consolidate it.

This is also a preparatory patch for finally plugging the cleanup race.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/apic/vector.c |   33 +++++++++++++++------------------
 1 file changed, 15 insertions(+), 18 deletions(-)

--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -116,7 +116,7 @@ static int __assign_irq_vector(int irq,
 	 */
 	static int current_vector = FIRST_EXTERNAL_VECTOR + VECTOR_OFFSET_START;
 	static int current_offset = VECTOR_OFFSET_START % 16;
-	int cpu;
+	int cpu, vector;
 
 	if (d->move_in_progress)
 		return -EBUSY;
@@ -126,7 +126,7 @@ static int __assign_irq_vector(int irq,
 	cpumask_clear(searched_cpumask);
 	cpu = cpumask_first_and(mask, cpu_online_mask);
 	while (cpu < nr_cpu_ids) {
-		int new_cpu, vector, offset;
+		int new_cpu, offset;
 
 		/* Get the possible target cpus for @mask/@cpu from the apic */
 		apic->vector_allocation_domain(cpu, vector_cpumask, mask);
@@ -146,16 +146,12 @@ static int __assign_irq_vector(int irq,
 			if (cpumask_equal(vector_cpumask, d->domain))
 				goto success;
 			/*
-			 * New cpumask using the vector is a proper subset of
-			 * the current in use mask. So cleanup the vector
-			 * allocation for the members that are not used anymore.
+			 * Mark the cpus which are not longer in the mask for
+			 * cleanup.
 			 */
-			cpumask_andnot(d->old_domain, d->domain,
-				       vector_cpumask);
-			d->move_in_progress =
-			   cpumask_intersects(d->old_domain, cpu_online_mask);
-			cpumask_copy(d->domain, vector_cpumask);
-			goto success;
+			cpumask_andnot(d->old_domain, d->domain, vector_cpumask);
+			vector = d->cfg.vector;
+			goto update;
 		}
 
 		vector = current_vector;
@@ -181,16 +177,12 @@ static int __assign_irq_vector(int irq,
 		/* Found one! */
 		current_vector = vector;
 		current_offset = offset;
-		if (d->cfg.vector) {
+		/* Schedule the old vector for cleanup on all cpus */
+		if (d->cfg.vector)
 			cpumask_copy(d->old_domain, d->domain);
-			d->move_in_progress =
-			   cpumask_intersects(d->old_domain, cpu_online_mask);
-		}
 		for_each_cpu(new_cpu, vector_searchmask)
 			per_cpu(vector_irq, new_cpu)[vector] = irq_to_desc(irq);
-		d->cfg.vector = vector;
-		cpumask_copy(d->domain, vector_cpumask);
-		goto success;
+		goto update;
 
 next_cpu:
 		/*
@@ -207,6 +199,11 @@ static int __assign_irq_vector(int irq,
 	}
 	return -ENOSPC;
 
+update:
+	/* Cleanup required ? */
+	d->move_in_progress = cpumask_intersects(d->old_domain, cpu_online_mask);
+	d->cfg.vector = vector;
+	cpumask_copy(d->domain, vector_cpumask);
 success:
 	/*
 	 * Cache destination APIC IDs into cfg->dest_apicid. This cannot fail



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [patch 09/14] x86/irq: Remove offline cpus from vector cleanup
  2015-12-31 16:30 [patch 00/14] x86/irq: Plug various vector cleanup races Thomas Gleixner
                   ` (7 preceding siblings ...)
  2015-12-31 16:30 ` [patch 07/14] x86/irq: Copy vectormask instead of an AND operation Thomas Gleixner
@ 2015-12-31 16:30 ` Thomas Gleixner
  2016-01-16 21:18   ` [tip:x86/urgent] " tip-bot for Thomas Gleixner
  2015-12-31 16:30 ` [patch 10/14] x86/irq: Clear move_in_progress before sending cleanup IPI Thomas Gleixner
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 38+ messages in thread
From: Thomas Gleixner @ 2015-12-31 16:30 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Peter Anvin, Jiang Liu, Joe Lawrence,
	Jeremiah Mahler, Borislav Petkov, andy.shevchenko, Guenter Roeck

[-- Attachment #1: x86-irq--Remove-offline-cpus-from-vector-cleanup.patch --]
[-- Type: text/plain, Size: 799 bytes --]

No point of keeping offline cpus in the cleanup mask.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/apic/vector.c |    8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -200,8 +200,12 @@ static int __assign_irq_vector(int irq,
 	return -ENOSPC;
 
 update:
-	/* Cleanup required ? */
-	d->move_in_progress = cpumask_intersects(d->old_domain, cpu_online_mask);
+	/*
+	 * Exclude offline cpus from the cleanup mask and set the
+	 * move_in_progress flag when the result is not empty.
+	 */
+	cpumask_and(d->old_domain, d->old_domain, cpu_online_mask);
+	d->move_in_progress = !cpumask_empty(d->old_domain);
 	d->cfg.vector = vector;
 	cpumask_copy(d->domain, vector_cpumask);
 success:



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [patch 10/14] x86/irq: Clear move_in_progress before sending cleanup IPI
  2015-12-31 16:30 [patch 00/14] x86/irq: Plug various vector cleanup races Thomas Gleixner
                   ` (8 preceding siblings ...)
  2015-12-31 16:30 ` [patch 09/14] x86/irq: Remove offline cpus from vector cleanup Thomas Gleixner
@ 2015-12-31 16:30 ` Thomas Gleixner
  2016-01-16 21:19   ` [tip:x86/urgent] " tip-bot for Thomas Gleixner
  2015-12-31 16:30 ` [patch 12/14] x86/irq: Remove outgoing CPU from vector cleanup mask Thomas Gleixner
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 38+ messages in thread
From: Thomas Gleixner @ 2015-12-31 16:30 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Peter Anvin, Jiang Liu, Joe Lawrence,
	Jeremiah Mahler, Borislav Petkov, andy.shevchenko, Guenter Roeck

[-- Attachment #1: x86-irq--Protect-send_cleanup_vector-with-vector_lock.patch --]
[-- Type: text/plain, Size: 1989 bytes --]

send_cleanup_vector() fiddles with the old_domain mask unprotected because it
relies on the protection by the move_in_progress flag. But this is fatal, as
the flag is reset after the IPI has been sent. So a cpu which receives the IPI
can still see the flag set and therefor ignores the cleanup request. If no
other cleanup request happens then the vector stays stale on that cpu and in
case of an irq removal the vector still persists. That can lead to use after
free when the next cleanup IPI happens.

Protect the code with vector_lock and clear move_in_progress before sending
the IPI.

This does not plug the race which Joe reported because:

CPU0                          CPU1                      CPU2
lock_vector()
data->move_in_progress=0
sendIPI()                       
unlock_vector()
                              set_affinity()
                              assign_irq_vector()
                              lock_vector()             handle_IPI
                              move_in_progress = 1      lock_vector()
                              unlock_vector()
                                                        move_in_progress == 1

The full fix comes with a later patch.

Reported-by: Joe Lawrence <joe.lawrence@stratus.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/apic/vector.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -530,6 +530,8 @@ static void __send_cleanup_vector(struct
 {
 	cpumask_var_t cleanup_mask;
 
+	raw_spin_lock(&vector_lock);
+	data->move_in_progress = 0;
 	if (unlikely(!alloc_cpumask_var(&cleanup_mask, GFP_ATOMIC))) {
 		unsigned int i;
 
@@ -541,7 +543,7 @@ static void __send_cleanup_vector(struct
 		apic->send_IPI_mask(cleanup_mask, IRQ_MOVE_CLEANUP_VECTOR);
 		free_cpumask_var(cleanup_mask);
 	}
-	data->move_in_progress = 0;
+	raw_spin_unlock(&vector_lock);
 }
 
 void send_cleanup_vector(struct irq_cfg *cfg)



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [patch 11/14] x86/irq: Remove the cpumask allocation from send_cleanup_vector()
  2015-12-31 16:30 [patch 00/14] x86/irq: Plug various vector cleanup races Thomas Gleixner
                   ` (10 preceding siblings ...)
  2015-12-31 16:30 ` [patch 12/14] x86/irq: Remove outgoing CPU from vector cleanup mask Thomas Gleixner
@ 2015-12-31 16:30 ` Thomas Gleixner
  2016-01-16 21:19   ` [tip:x86/urgent] " tip-bot for Thomas Gleixner
  2015-12-31 16:30 ` [patch 13/14] x86/irq: Call irq_force_move_complete with irq descriptor Thomas Gleixner
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 38+ messages in thread
From: Thomas Gleixner @ 2015-12-31 16:30 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Peter Anvin, Jiang Liu, Joe Lawrence,
	Jeremiah Mahler, Borislav Petkov, andy.shevchenko, Guenter Roeck

[-- Attachment #1: x86-irq--Remove-the-cpumask-allocation-from-send_cleanup_vector.patch --]
[-- Type: text/plain, Size: 1311 bytes --]

There is no need to allocate a new cpumask for sending the cleanup vector. The
old_domain mask is now protected by the vector_lock, so we can safely remove
the offline cpus from it and send the IPI with the resulting mask.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/apic/vector.c |   16 +++-------------
 1 file changed, 3 insertions(+), 13 deletions(-)

--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -528,21 +528,11 @@ static struct irq_chip lapic_controller
 #ifdef CONFIG_SMP
 static void __send_cleanup_vector(struct apic_chip_data *data)
 {
-	cpumask_var_t cleanup_mask;
-
 	raw_spin_lock(&vector_lock);
+	cpumask_and(data->old_domain, data->old_domain, cpu_online_mask);
 	data->move_in_progress = 0;
-	if (unlikely(!alloc_cpumask_var(&cleanup_mask, GFP_ATOMIC))) {
-		unsigned int i;
-
-		for_each_cpu_and(i, data->old_domain, cpu_online_mask)
-			apic->send_IPI_mask(cpumask_of(i),
-					    IRQ_MOVE_CLEANUP_VECTOR);
-	} else {
-		cpumask_and(cleanup_mask, data->old_domain, cpu_online_mask);
-		apic->send_IPI_mask(cleanup_mask, IRQ_MOVE_CLEANUP_VECTOR);
-		free_cpumask_var(cleanup_mask);
-	}
+	if (!cpumask_empty(data->old_domain))
+		apic->send_IPI_mask(data->old_domain, IRQ_MOVE_CLEANUP_VECTOR);
 	raw_spin_unlock(&vector_lock);
 }
 



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [patch 12/14] x86/irq: Remove outgoing CPU from vector cleanup mask
  2015-12-31 16:30 [patch 00/14] x86/irq: Plug various vector cleanup races Thomas Gleixner
                   ` (9 preceding siblings ...)
  2015-12-31 16:30 ` [patch 10/14] x86/irq: Clear move_in_progress before sending cleanup IPI Thomas Gleixner
@ 2015-12-31 16:30 ` Thomas Gleixner
  2016-01-16 21:19   ` [tip:x86/urgent] " tip-bot for Thomas Gleixner
  2015-12-31 16:30 ` [patch 11/14] x86/irq: Remove the cpumask allocation from send_cleanup_vector() Thomas Gleixner
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 38+ messages in thread
From: Thomas Gleixner @ 2015-12-31 16:30 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Peter Anvin, Jiang Liu, Joe Lawrence,
	Jeremiah Mahler, Borislav Petkov, andy.shevchenko, Guenter Roeck

[-- Attachment #1: x86-irq--Remove-outgoing-CPU-from-vector-cleanup-mask.patch --]
[-- Type: text/plain, Size: 1089 bytes --]

We want to synchronize new vector assignments with a pending cleanup. Remove a
dying cpu from a pending cleanup mask.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/apic/vector.c |   18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -631,9 +631,23 @@ void irq_complete_move(struct irq_cfg *c
 void irq_force_complete_move(int irq)
 {
 	struct irq_cfg *cfg = irq_cfg(irq);
+	struct apic_chip_data *data;
 
-	if (cfg)
-		__irq_complete_move(cfg, cfg->vector);
+	if (!cfg)
+		return;
+
+	__irq_complete_move(cfg, cfg->vector);
+
+	/*
+	 * Remove this cpu from the cleanup mask. The IPI might have been sent
+	 * just before the cpu was removed from the offline mask, but has not
+	 * been processed because the CPU has interrupts disabled and is on
+	 * the way out.
+	 */
+	raw_spin_lock(&vector_lock);
+	data = container_of(cfg, struct apic_chip_data, cfg);
+	cpumask_clear_cpu(smp_processor_id(), data->old_domain);
+	raw_spin_unlock(&vector_lock);
 }
 #endif
 



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [patch 13/14] x86/irq: Call irq_force_move_complete with irq descriptor
  2015-12-31 16:30 [patch 00/14] x86/irq: Plug various vector cleanup races Thomas Gleixner
                   ` (11 preceding siblings ...)
  2015-12-31 16:30 ` [patch 11/14] x86/irq: Remove the cpumask allocation from send_cleanup_vector() Thomas Gleixner
@ 2015-12-31 16:30 ` Thomas Gleixner
  2016-01-16 21:20   ` [tip:x86/urgent] " tip-bot for Thomas Gleixner
  2015-12-31 16:30 ` [patch 14/14] x86/irq: Plug vector cleanup race Thomas Gleixner
  2016-01-04 15:35 ` [patch 00/14] x86/irq: Plug various vector cleanup races Joe Lawrence
  14 siblings, 1 reply; 38+ messages in thread
From: Thomas Gleixner @ 2015-12-31 16:30 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Peter Anvin, Jiang Liu, Joe Lawrence,
	Jeremiah Mahler, Borislav Petkov, andy.shevchenko, Guenter Roeck

[-- Attachment #1: x86-irq--Call-irq_force_move_complete-with-irq-descriptor.patch --]
[-- Type: text/plain, Size: 2469 bytes --]

First of all there is no point in looking up the irq descriptor again, but we
also need the descriptor for the final cleanup race fix in the next
patch. Make that change seperate. No functional difference.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/irq.h    |    5 +++--
 arch/x86/kernel/apic/vector.c |   11 +++++++----
 arch/x86/kernel/irq.c         |    2 +-
 3 files changed, 11 insertions(+), 7 deletions(-)

--- a/arch/x86/include/asm/irq.h
+++ b/arch/x86/include/asm/irq.h
@@ -23,11 +23,13 @@ extern void irq_ctx_init(int cpu);
 
 #define __ARCH_HAS_DO_SOFTIRQ
 
+struct irq_desc;
+
 #ifdef CONFIG_HOTPLUG_CPU
 #include <linux/cpumask.h>
 extern int check_irq_vectors_for_cpu_disable(void);
 extern void fixup_irqs(void);
-extern void irq_force_complete_move(int);
+extern void irq_force_complete_move(struct irq_desc *desc);
 #endif
 
 #ifdef CONFIG_HAVE_KVM
@@ -37,7 +39,6 @@ extern void kvm_set_posted_intr_wakeup_h
 extern void (*x86_platform_ipi_callback)(void);
 extern void native_init_IRQ(void);
 
-struct irq_desc;
 extern bool handle_irq(struct irq_desc *desc, struct pt_regs *regs);
 
 extern __visible unsigned int do_IRQ(struct pt_regs *regs);
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -628,10 +628,14 @@ void irq_complete_move(struct irq_cfg *c
 	__irq_complete_move(cfg, ~get_irq_regs()->orig_ax);
 }
 
-void irq_force_complete_move(int irq)
+/*
+ * Called with @desc->lock held and interrupts disabled.
+ */
+void irq_force_complete_move(struct irq_desc *desc)
 {
-	struct irq_cfg *cfg = irq_cfg(irq);
-	struct apic_chip_data *data;
+	struct irq_data *irqdata = irq_desc_get_irq_data(desc);
+	struct apic_chip_data *data = apic_chip_data(irqdata);
+	struct irq_cfg *cfg = data ? &data->cfg : NULL;
 
 	if (!cfg)
 		return;
@@ -645,7 +649,6 @@ void irq_force_complete_move(int irq)
 	 * the way out.
 	 */
 	raw_spin_lock(&vector_lock);
-	data = container_of(cfg, struct apic_chip_data, cfg);
 	cpumask_clear_cpu(smp_processor_id(), data->old_domain);
 	raw_spin_unlock(&vector_lock);
 }
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -462,7 +462,7 @@ void fixup_irqs(void)
 		 * non intr-remapping case, we can't wait till this interrupt
 		 * arrives at this cpu before completing the irq move.
 		 */
-		irq_force_complete_move(irq);
+		irq_force_complete_move(desc);
 
 		if (cpumask_any_and(affinity, cpu_online_mask) >= nr_cpu_ids) {
 			break_affinity = 1;



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [patch 14/14] x86/irq: Plug vector cleanup race
  2015-12-31 16:30 [patch 00/14] x86/irq: Plug various vector cleanup races Thomas Gleixner
                   ` (12 preceding siblings ...)
  2015-12-31 16:30 ` [patch 13/14] x86/irq: Call irq_force_move_complete with irq descriptor Thomas Gleixner
@ 2015-12-31 16:30 ` Thomas Gleixner
  2016-01-16 21:20   ` [tip:x86/urgent] " tip-bot for Thomas Gleixner
  2016-01-04 15:35 ` [patch 00/14] x86/irq: Plug various vector cleanup races Joe Lawrence
  14 siblings, 1 reply; 38+ messages in thread
From: Thomas Gleixner @ 2015-12-31 16:30 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Peter Anvin, Jiang Liu, Joe Lawrence,
	Jeremiah Mahler, Borislav Petkov, andy.shevchenko, Guenter Roeck

[-- Attachment #1: x86-irq--Plug-vector-race.patch --]
[-- Type: text/plain, Size: 5140 bytes --]

We still can end up with a stale vector due to the following:

CPU0                          CPU1                      CPU2
lock_vector()
data->move_in_progress=0
sendIPI()                       
unlock_vector()
                              set_affinity()
                              assign_irq_vector()
                              lock_vector()             handle_IPI
                              move_in_progress = 1      lock_vector()
                              unlock_vector()
                                                        move_in_progress == 1

So we need to serialize the vector assignment against a pending cleanup. The
solution is rather simple now. We not only check for the move_in_progress flag
in assign_irq_vector(), we also check whether there is still a cleanup pending
in the old_domain cpumask. If so, we return -EBUSY to the caller and let him
deal with it. Though we have to be careful in the cpu unplug case. If the
cleanout has not yet completed then the following setaffinity() call would
return -EBUSY. Add code which prevents this.

Full context is here: http://lkml.kernel.org/r/5653B688.4050809@stratus.com

Reported-by: Joe Lawrence <joe.lawrence@stratus.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/apic/vector.c |   63 +++++++++++++++++++++++++++++++++++-------
 1 file changed, 53 insertions(+), 10 deletions(-)

--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -118,7 +118,12 @@ static int __assign_irq_vector(int irq,
 	static int current_offset = VECTOR_OFFSET_START % 16;
 	int cpu, vector;
 
-	if (d->move_in_progress)
+	/*
+	 * If there is still a move in progress or the previous move has not
+	 * been cleaned up completely, tell the caller to come back later.
+	 */
+	if (d->move_in_progress ||
+	    cpumask_intersects(d->old_domain, cpu_online_mask))
 		return -EBUSY;
 
 	/* Only try and allocate irqs on cpus that are present */
@@ -257,7 +262,12 @@ static void clear_irq_vector(int irq, st
 	data->cfg.vector = 0;
 	cpumask_clear(data->domain);
 
-	if (likely(!data->move_in_progress))
+	/*
+	 * If move is in progress or the old_domain mask is not empty,
+	 * i.e. the cleanup IPI has not been processed yet, we need to remove
+	 * the old references to desc from all cpus vector tables.
+	 */
+	if (!data->move_in_progress && cpumask_empty(data->old_domain))
 		return;
 
 	desc = irq_to_desc(irq);
@@ -577,12 +587,25 @@ asmlinkage __visible void smp_irq_move_c
 			goto unlock;
 
 		/*
-		 * Check if the irq migration is in progress. If so, we
-		 * haven't received the cleanup request yet for this irq.
+		 * Nothing to cleanup if irq migration is in progress
+		 * or this cpu is not set in the cleanup mask.
 		 */
-		if (data->move_in_progress)
+		if (data->move_in_progress ||
+		    !cpumask_test_cpu(me, data->old_domain))
 			goto unlock;
 
+		/*
+		 * We have two cases to handle here:
+		 * 1) vector is unchanged but the target mask got reduced
+		 * 2) vector and the target mask has changed
+		 *
+		 * #1 is obvious, but in #2 we have two vectors with the same
+		 * irq descriptor: the old and the new vector. So we need to
+		 * make sure that we only cleanup the old vector. The new
+		 * vector has the current @vector number in the config and
+		 * this cpu is part of the target mask. We better leave that
+		 * one alone.
+		 */
 		if (vector == data->cfg.vector &&
 		    cpumask_test_cpu(me, data->domain))
 			goto unlock;
@@ -600,6 +623,7 @@ asmlinkage __visible void smp_irq_move_c
 			goto unlock;
 		}
 		__this_cpu_write(vector_irq[vector], VECTOR_UNUSED);
+		cpumask_clear_cpu(me, data->old_domain);
 unlock:
 		raw_spin_unlock(&desc->lock);
 	}
@@ -643,13 +667,32 @@ void irq_force_complete_move(struct irq_
 	__irq_complete_move(cfg, cfg->vector);
 
 	/*
-	 * Remove this cpu from the cleanup mask. The IPI might have been sent
-	 * just before the cpu was removed from the offline mask, but has not
-	 * been processed because the CPU has interrupts disabled and is on
-	 * the way out.
+	 * This is tricky. If the cleanup of @data->old_domain has not been
+	 * done yet, then the following setaffinity call will fail with
+	 * -EBUSY. This can leave the interrupt in a stale state.
+	 *
+	 * The cleanup cannot make progress because we hold @desc->lock. So in
+	 * case @data->old_domain is not yet cleaned up, we need to drop the
+	 * lock and acquire it again. @desc cannot go away, because the
+	 * hotplug code holds the sparse irq lock.
 	 */
 	raw_spin_lock(&vector_lock);
-	cpumask_clear_cpu(smp_processor_id(), data->old_domain);
+	/* Clean out all offline cpus (including ourself) first. */
+	cpumask_and(data->old_domain, data->old_domain, cpu_online_mask);
+	while (!cpumask_empty(data->old_domain)) {
+		raw_spin_unlock(&vector_lock);
+		raw_spin_unlock(&desc->lock);
+		cpu_relax();
+		raw_spin_lock(&desc->lock);
+		/*
+		 * Reevaluate apic_chip_data. It might have been cleared after
+		 * we dropped @desc->lock.
+		 */
+		data = apic_chip_data(irqdata);
+		if (!data)
+			return;
+		raw_spin_lock(&vector_lock);
+	}
 	raw_spin_unlock(&vector_lock);
 }
 #endif



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 00/14] x86/irq: Plug various vector cleanup races
  2015-12-31 16:30 [patch 00/14] x86/irq: Plug various vector cleanup races Thomas Gleixner
                   ` (13 preceding siblings ...)
  2015-12-31 16:30 ` [patch 14/14] x86/irq: Plug vector cleanup race Thomas Gleixner
@ 2016-01-04 15:35 ` Joe Lawrence
  2016-01-14  8:24   ` Thomas Gleixner
  14 siblings, 1 reply; 38+ messages in thread
From: Joe Lawrence @ 2016-01-04 15:35 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Ingo Molnar, Peter Anvin, Jiang Liu, Jeremiah Mahler,
	Borislav Petkov, andy.shevchenko, Guenter Roeck

On 12/31/2015 11:30 AM, Thomas Gleixner wrote:
> Joe reported a nasty race in the vector cleanup code which results in stale
> irq descriptors in the vector arrays and a potential use after free.
> 
> This series addresses this issue, another race which was found and fixed by
> Jiang related to the same area, plus a cpu hotplug issue which is not related
> to this.
> 
> This lot is intended for stable, but not yet marked so. The diffstat below is
> rather large, but this is mostly due to extensive commentry which I added in
> the process.

Hi Thomas,

No issues running the same PCI device removal and stress tests against
the patchset.

Thanks,

Tested-by: Joe Lawrence <joe.lawrence@stratus.com>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 00/14] x86/irq: Plug various vector cleanup races
  2016-01-04 15:35 ` [patch 00/14] x86/irq: Plug various vector cleanup races Joe Lawrence
@ 2016-01-14  8:24   ` Thomas Gleixner
  2016-01-14 10:33     ` Borislav Petkov
  2016-01-16 21:15     ` [tip:x86/urgent] x86/irq: Call chip-> irq_set_affinity in proper context tip-bot for Thomas Gleixner
  0 siblings, 2 replies; 38+ messages in thread
From: Thomas Gleixner @ 2016-01-14  8:24 UTC (permalink / raw)
  To: Joe Lawrence
  Cc: LKML, Ingo Molnar, Peter Anvin, Jiang Liu, Jeremiah Mahler,
	Borislav Petkov, andy.shevchenko, Guenter Roeck

On Mon, 4 Jan 2016, Joe Lawrence wrote:
> No issues running the same PCI device removal and stress tests against
> the patchset.

Thanks for testing!

Though there is yet another long standing bug in that area. Fix below.

Thanks,

	tglx

8<--------------------

Subject: x86/irq: Call chip->irq_set_affinity in proper context
From: Thomas Gleixner <tglx@linutronix.de>
Date: Thu, 14 Jan 2016 08:43:38 +0100

setup_ioapic_dest() calls irqchip->irq_set_affinity() completely
unprotected. That's wrong in several aspects:

 - it triggers a lockdep splat because vector lock is taken with interrupts
   enabled.

 - it opens a race window where irq_set_affinity() can be interrupted and the
   irq chip left in unconsistent state.

The proper calling convention is irq descriptor lock held and interrupts
disabled.

Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
---
 arch/x86/kernel/apic/io_apic.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -2521,6 +2521,7 @@ void __init setup_ioapic_dest(void)
 {
 	int pin, ioapic, irq, irq_entry;
 	const struct cpumask *mask;
+	struct irq_desc *desc;
 	struct irq_data *idata;
 	struct irq_chip *chip;
 
@@ -2536,7 +2537,9 @@ void __init setup_ioapic_dest(void)
 		if (irq < 0 || !mp_init_irq_at_boot(ioapic, irq))
 			continue;
 
-		idata = irq_get_irq_data(irq);
+		desc = irq_to_desc(irq);
+		raw_spin_lock_irq(d&desc->lock);
+		idata = irq_desc_get_irq_data(desc);
 
 		/*
 		 * Honour affinities which have been set in early boot
@@ -2550,6 +2553,7 @@ void __init setup_ioapic_dest(void)
 		/* Might be lapic_chip for irq 0 */
 		if (chip->irq_set_affinity)
 			chip->irq_set_affinity(idata, mask, false);
+		raw_spin_unlock_irq(d&desc->lock);
 	}
 }
 #endif

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 00/14] x86/irq: Plug various vector cleanup races
  2016-01-14  8:24   ` Thomas Gleixner
@ 2016-01-14 10:33     ` Borislav Petkov
  2016-01-16 21:37       ` Joe Lawrence
  2016-01-16 21:15     ` [tip:x86/urgent] x86/irq: Call chip-> irq_set_affinity in proper context tip-bot for Thomas Gleixner
  1 sibling, 1 reply; 38+ messages in thread
From: Borislav Petkov @ 2016-01-14 10:33 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Joe Lawrence, LKML, Ingo Molnar, Peter Anvin, Jiang Liu,
	Jeremiah Mahler, andy.shevchenko, Guenter Roeck

On Thu, Jan 14, 2016 at 09:24:35AM +0100, Thomas Gleixner wrote:
> On Mon, 4 Jan 2016, Joe Lawrence wrote:
> > No issues running the same PCI device removal and stress tests against
> > the patchset.
> 
> Thanks for testing!
> 
> Though there is yet another long standing bug in that area. Fix below.
> 
> Thanks,
> 
> 	tglx
> 
> 8<--------------------
> 
> Subject: x86/irq: Call chip->irq_set_affinity in proper context
> From: Thomas Gleixner <tglx@linutronix.de>
> Date: Thu, 14 Jan 2016 08:43:38 +0100
> 
> setup_ioapic_dest() calls irqchip->irq_set_affinity() completely
> unprotected. That's wrong in several aspects:
> 
>  - it triggers a lockdep splat because vector lock is taken with interrupts
>    enabled.
> 
>  - it opens a race window where irq_set_affinity() can be interrupted and the
>    irq chip left in unconsistent state.
> 
> The proper calling convention is irq descriptor lock held and interrupts
> disabled.
> 
> Reported-by: Borislav Petkov <bp@alien8.de>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: stable@vger.kernel.org
> ---
>  arch/x86/kernel/apic/io_apic.c |    6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> --- a/arch/x86/kernel/apic/io_apic.c
> +++ b/arch/x86/kernel/apic/io_apic.c
> @@ -2521,6 +2521,7 @@ void __init setup_ioapic_dest(void)
>  {
>  	int pin, ioapic, irq, irq_entry;
>  	const struct cpumask *mask;
> +	struct irq_desc *desc;
>  	struct irq_data *idata;
>  	struct irq_chip *chip;
>  
> @@ -2536,7 +2537,9 @@ void __init setup_ioapic_dest(void)
>  		if (irq < 0 || !mp_init_irq_at_boot(ioapic, irq))
>  			continue;
>  
> -		idata = irq_get_irq_data(irq);
> +		desc = irq_to_desc(irq);
> +		raw_spin_lock_irq(d&desc->lock);

s/d//

> +		idata = irq_desc_get_irq_data(desc);
>  
>  		/*
>  		 * Honour affinities which have been set in early boot
> @@ -2550,6 +2553,7 @@ void __init setup_ioapic_dest(void)
>  		/* Might be lapic_chip for irq 0 */
>  		if (chip->irq_set_affinity)
>  			chip->irq_set_affinity(idata, mask, false);
> +		raw_spin_unlock_irq(d&desc->lock);

s/d//

With those micro-changes:

Tested-by: Borislav Petkov <bp@suse.de>

:-)

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [tip:x86/urgent] x86/irq: Call chip-> irq_set_affinity in proper context
  2016-01-14  8:24   ` Thomas Gleixner
  2016-01-14 10:33     ` Borislav Petkov
@ 2016-01-16 21:15     ` tip-bot for Thomas Gleixner
  1 sibling, 0 replies; 38+ messages in thread
From: tip-bot for Thomas Gleixner @ 2016-01-16 21:15 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: hpa, joe.lawrence, mingo, jmmahler, tglx, linux, bp,
	linux-kernel, jiang.liu

Commit-ID:  e23b257c293ce4bcc8cabb2aa3097b6ed8a8261a
Gitweb:     http://git.kernel.org/tip/e23b257c293ce4bcc8cabb2aa3097b6ed8a8261a
Author:     Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Thu, 14 Jan 2016 08:43:38 +0100
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 15 Jan 2016 13:43:58 +0100

x86/irq: Call chip->irq_set_affinity in proper context

setup_ioapic_dest() calls irqchip->irq_set_affinity() completely
unprotected. That's wrong in several aspects:

 - it opens a race window where irq_set_affinity() can be interrupted and the
   irq chip left in unconsistent state.

 - it triggers a lockdep splat when we fix the vector race for 4.3+ because
   vector lock is taken with interrupts enabled.

The proper calling convention is irq descriptor lock held and interrupts
disabled.

Reported-and-tested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Jeremiah Mahler <jmmahler@gmail.com>
Cc: andy.shevchenko@gmail.com
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Joe Lawrence <joe.lawrence@stratus.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1601140919420.3575@nanos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/apic/io_apic.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index f253218..fdb0fbf 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -2521,6 +2521,7 @@ void __init setup_ioapic_dest(void)
 {
 	int pin, ioapic, irq, irq_entry;
 	const struct cpumask *mask;
+	struct irq_desc *desc;
 	struct irq_data *idata;
 	struct irq_chip *chip;
 
@@ -2536,7 +2537,9 @@ void __init setup_ioapic_dest(void)
 		if (irq < 0 || !mp_init_irq_at_boot(ioapic, irq))
 			continue;
 
-		idata = irq_get_irq_data(irq);
+		desc = irq_to_desc(irq);
+		raw_spin_lock_irq(&desc->lock);
+		idata = irq_desc_get_irq_data(desc);
 
 		/*
 		 * Honour affinities which have been set in early boot
@@ -2550,6 +2553,7 @@ void __init setup_ioapic_dest(void)
 		/* Might be lapic_chip for irq 0 */
 		if (chip->irq_set_affinity)
 			chip->irq_set_affinity(idata, mask, false);
+		raw_spin_unlock_irq(&desc->lock);
 	}
 }
 #endif

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [tip:x86/urgent] x86/irq: Validate that irq descriptor is still active
  2015-12-31 16:30 ` [patch 02/14] x86/irq: Validate that irq descriptor is still active Thomas Gleixner
@ 2016-01-16 21:16   ` tip-bot for Thomas Gleixner
  0 siblings, 0 replies; 38+ messages in thread
From: tip-bot for Thomas Gleixner @ 2016-01-16 21:16 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: jmmahler, joe.lawrence, tglx, jiang.liu, linux-kernel, linux, bp,
	mingo, hpa

Commit-ID:  36f34c8c63da3e272fd66f91089228c22d2b6e8b
Gitweb:     http://git.kernel.org/tip/36f34c8c63da3e272fd66f91089228c22d2b6e8b
Author:     Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Thu, 31 Dec 2015 16:30:45 +0000
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 15 Jan 2016 13:43:59 +0100

x86/irq: Validate that irq descriptor is still active

In fixup_irqs() we unconditionally dereference the irq chip of an irq
descriptor. The descriptor might still be valid, but already cleaned up,
i.e. the chip removed. Add a check for this condition.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Joe Lawrence <joe.lawrence@stratus.com>
Cc: Jeremiah Mahler <jmmahler@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: andy.shevchenko@gmail.com
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: stable@vger.kernel.org #4.3+
Link: http://lkml.kernel.org/r/20151231160106.236423282@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/irq.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index f8062aa..c0b58dd 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -470,6 +470,15 @@ void fixup_irqs(void)
 		}
 
 		chip = irq_data_get_irq_chip(data);
+		/*
+		 * The interrupt descriptor might have been cleaned up
+		 * already, but it is not yet removed from the radix tree
+		 */
+		if (!chip) {
+			raw_spin_unlock(&desc->lock);
+			continue;
+		}
+
 		if (!irqd_can_move_in_process_context(data) && chip->irq_mask)
 			chip->irq_mask(data);
 

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [tip:x86/urgent] x86/irq: Reorganize the return path in assign_irq_vector
  2015-12-31 16:30 ` [patch 04/14] x86/irq: Reorganize the return path in assign_irq_vector Thomas Gleixner
@ 2016-01-16 21:17   ` tip-bot for Thomas Gleixner
  0 siblings, 0 replies; 38+ messages in thread
From: tip-bot for Thomas Gleixner @ 2016-01-16 21:17 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, linux-kernel, joe.lawrence, tglx, jmmahler, bp, jiang.liu,
	hpa, linux

Commit-ID:  433cbd57d190a1cdd02f243df41c3d7f55ec4b94
Gitweb:     http://git.kernel.org/tip/433cbd57d190a1cdd02f243df41c3d7f55ec4b94
Author:     Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Thu, 31 Dec 2015 16:30:46 +0000
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 15 Jan 2016 13:43:59 +0100

x86/irq: Reorganize the return path in assign_irq_vector

Use an explicit goto for the cases where we have success in the search/update
and return -ENOSPC if the search loop ends due to no space.

Preparatory patch for fixes. No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Borislav Petkov <bp@alien8.de>
Tested-by: Joe Lawrence <joe.lawrence@stratus.com>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Jeremiah Mahler <jmmahler@gmail.com>
Cc: andy.shevchenko@gmail.com
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: stable@vger.kernel.org #4.3+
Link: http://lkml.kernel.org/r/20151231160106.403491024@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/apic/vector.c | 22 ++++++++--------------
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 19082cf..613b1cd 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -118,13 +118,12 @@ static int __assign_irq_vector(int irq, struct apic_chip_data *d,
 	 */
 	static int current_vector = FIRST_EXTERNAL_VECTOR + VECTOR_OFFSET_START;
 	static int current_offset = VECTOR_OFFSET_START % 16;
-	int cpu, err;
+	int cpu;
 
 	if (d->move_in_progress)
 		return -EBUSY;
 
 	/* Only try and allocate irqs on cpus that are present */
-	err = -ENOSPC;
 	cpumask_clear(d->old_domain);
 	cpumask_clear(searched_cpumask);
 	cpu = cpumask_first_and(mask, cpu_online_mask);
@@ -134,9 +133,8 @@ static int __assign_irq_vector(int irq, struct apic_chip_data *d,
 		apic->vector_allocation_domain(cpu, vector_cpumask, mask);
 
 		if (cpumask_subset(vector_cpumask, d->domain)) {
-			err = 0;
 			if (cpumask_equal(vector_cpumask, d->domain))
-				break;
+				goto success;
 			/*
 			 * New cpumask using the vector is a proper subset of
 			 * the current in use mask. So cleanup the vector
@@ -147,7 +145,7 @@ static int __assign_irq_vector(int irq, struct apic_chip_data *d,
 			d->move_in_progress =
 			   cpumask_intersects(d->old_domain, cpu_online_mask);
 			cpumask_and(d->domain, d->domain, vector_cpumask);
-			break;
+			goto success;
 		}
 
 		vector = current_vector;
@@ -187,17 +185,13 @@ next:
 			per_cpu(vector_irq, new_cpu)[vector] = irq_to_desc(irq);
 		d->cfg.vector = vector;
 		cpumask_copy(d->domain, vector_cpumask);
-		err = 0;
-		break;
+		goto success;
 	}
+	return -ENOSPC;
 
-	if (!err) {
-		/* cache destination APIC IDs into cfg->dest_apicid */
-		err = apic->cpu_mask_to_apicid_and(mask, d->domain,
-						   &d->cfg.dest_apicid);
-	}
-
-	return err;
+success:
+	/* cache destination APIC IDs into cfg->dest_apicid */
+	return apic->cpu_mask_to_apicid_and(mask, d->domain, &d->cfg.dest_apicid);
 }
 
 static int assign_irq_vector(int irq, struct apic_chip_data *data,

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [tip:x86/urgent] x86/irq: Reorganize the search in assign_irq_vector
  2015-12-31 16:30 ` [patch 05/14] x86/irq: Reorganize the search in assign_irq_vector Thomas Gleixner
@ 2016-01-16 21:17   ` tip-bot for Thomas Gleixner
  0 siblings, 0 replies; 38+ messages in thread
From: tip-bot for Thomas Gleixner @ 2016-01-16 21:17 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: bp, jmmahler, linux, jiang.liu, linux-kernel, tglx, hpa,
	joe.lawrence, mingo

Commit-ID:  95ffeb4b5baca266e1d0d2bc90f1513e6f419cdd
Gitweb:     http://git.kernel.org/tip/95ffeb4b5baca266e1d0d2bc90f1513e6f419cdd
Author:     Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Thu, 31 Dec 2015 16:30:47 +0000
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 15 Jan 2016 13:44:00 +0100

x86/irq: Reorganize the search in assign_irq_vector

Split out the code which advances the target cpu for the search so we can
reuse it for the next patch which adds an early validation check for the
vectormask which we get from the apic.

Add comments while at it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Borislav Petkov <bp@alien8.de>
Tested-by: Joe Lawrence <joe.lawrence@stratus.com>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Jeremiah Mahler <jmmahler@gmail.com>
Cc: andy.shevchenko@gmail.com
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: stable@vger.kernel.org #4.3+
Link: http://lkml.kernel.org/r/20151231160106.484562040@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/apic/vector.c | 24 ++++++++++++++++--------
 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 613b1cd..cef3195 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -157,14 +157,9 @@ next:
 			vector = FIRST_EXTERNAL_VECTOR + offset;
 		}
 
-		if (unlikely(current_vector == vector)) {
-			cpumask_or(searched_cpumask, searched_cpumask,
-				   vector_cpumask);
-			cpumask_andnot(vector_cpumask, mask, searched_cpumask);
-			cpu = cpumask_first_and(vector_cpumask,
-						cpu_online_mask);
-			continue;
-		}
+		/* If the search wrapped around, try the next cpu */
+		if (unlikely(current_vector == vector))
+			goto next_cpu;
 
 		if (test_bit(vector, used_vectors))
 			goto next;
@@ -186,6 +181,19 @@ next:
 		d->cfg.vector = vector;
 		cpumask_copy(d->domain, vector_cpumask);
 		goto success;
+
+next_cpu:
+		/*
+		 * We exclude the current @vector_cpumask from the requested
+		 * @mask and try again with the next online cpu in the
+		 * result. We cannot modify @mask, so we use @vector_cpumask
+		 * as a temporary buffer here as it will be reassigned when
+		 * calling apic->vector_allocation_domain() above.
+		 */
+		cpumask_or(searched_cpumask, searched_cpumask, vector_cpumask);
+		cpumask_andnot(vector_cpumask, mask, searched_cpumask);
+		cpu = cpumask_first_and(vector_cpumask, cpu_online_mask);
+		continue;
 	}
 	return -ENOSPC;
 

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [tip:x86/urgent] x86/irq: Check vector allocation early
  2015-12-31 16:30 ` [patch 06/14] x86/irq: Check vector allocation early Thomas Gleixner
@ 2016-01-16 21:17   ` tip-bot for Thomas Gleixner
  0 siblings, 0 replies; 38+ messages in thread
From: tip-bot for Thomas Gleixner @ 2016-01-16 21:17 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: jiang.liu, tglx, bp, mingo, hpa, linux-kernel, joe.lawrence,
	linux, jmmahler

Commit-ID:  3716fd27a604d61a91cda47083504971486b80f1
Gitweb:     http://git.kernel.org/tip/3716fd27a604d61a91cda47083504971486b80f1
Author:     Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Thu, 31 Dec 2015 16:30:48 +0000
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 15 Jan 2016 13:44:00 +0100

x86/irq: Check vector allocation early

__assign_irq_vector() uses the vector_cpumask which is assigned by
apic->vector_allocation_domain() without doing basic sanity checks. That can
result in a situation where the final assignement of a newly found vector
fails in apic->cpu_mask_to_apicid_and(). So we have to do rollbacks for no
reason.

apic->cpu_mask_to_apicid_and() only fails if 

  vector_cpumask & requested_cpumask & cpu_online_mask 

is empty.

Check for this condition right away and if the result is empty try immediately
the next possible cpu in the requested mask. So in case of a failure the old
setting is unchanged and we can remove the rollback code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Borislav Petkov <bp@alien8.de>
Tested-by: Joe Lawrence <joe.lawrence@stratus.com>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Jeremiah Mahler <jmmahler@gmail.com>
Cc: andy.shevchenko@gmail.com
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: stable@vger.kernel.org #4.3+
Link: http://lkml.kernel.org/r/20151231160106.561877324@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/apic/vector.c | 38 +++++++++++++++++++++++++-------------
 1 file changed, 25 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index cef3195..940e18d 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -31,7 +31,7 @@ struct apic_chip_data {
 struct irq_domain *x86_vector_domain;
 EXPORT_SYMBOL_GPL(x86_vector_domain);
 static DEFINE_RAW_SPINLOCK(vector_lock);
-static cpumask_var_t vector_cpumask, searched_cpumask;
+static cpumask_var_t vector_cpumask, vector_searchmask, searched_cpumask;
 static struct irq_chip lapic_controller;
 #ifdef	CONFIG_X86_IO_APIC
 static struct apic_chip_data *legacy_irq_data[NR_IRQS_LEGACY];
@@ -130,8 +130,20 @@ static int __assign_irq_vector(int irq, struct apic_chip_data *d,
 	while (cpu < nr_cpu_ids) {
 		int new_cpu, vector, offset;
 
+		/* Get the possible target cpus for @mask/@cpu from the apic */
 		apic->vector_allocation_domain(cpu, vector_cpumask, mask);
 
+		/*
+		 * Clear the offline cpus from @vector_cpumask for searching
+		 * and verify whether the result overlaps with @mask. If true,
+		 * then the call to apic->cpu_mask_to_apicid_and() will
+		 * succeed as well. If not, no point in trying to find a
+		 * vector in this mask.
+		 */
+		cpumask_and(vector_searchmask, vector_cpumask, cpu_online_mask);
+		if (!cpumask_intersects(vector_searchmask, mask))
+			goto next_cpu;
+
 		if (cpumask_subset(vector_cpumask, d->domain)) {
 			if (cpumask_equal(vector_cpumask, d->domain))
 				goto success;
@@ -164,7 +176,7 @@ next:
 		if (test_bit(vector, used_vectors))
 			goto next;
 
-		for_each_cpu_and(new_cpu, vector_cpumask, cpu_online_mask) {
+		for_each_cpu(new_cpu, vector_searchmask) {
 			if (!IS_ERR_OR_NULL(per_cpu(vector_irq, new_cpu)[vector]))
 				goto next;
 		}
@@ -176,7 +188,7 @@ next:
 			d->move_in_progress =
 			   cpumask_intersects(d->old_domain, cpu_online_mask);
 		}
-		for_each_cpu_and(new_cpu, vector_cpumask, cpu_online_mask)
+		for_each_cpu(new_cpu, vector_searchmask)
 			per_cpu(vector_irq, new_cpu)[vector] = irq_to_desc(irq);
 		d->cfg.vector = vector;
 		cpumask_copy(d->domain, vector_cpumask);
@@ -198,8 +210,14 @@ next_cpu:
 	return -ENOSPC;
 
 success:
-	/* cache destination APIC IDs into cfg->dest_apicid */
-	return apic->cpu_mask_to_apicid_and(mask, d->domain, &d->cfg.dest_apicid);
+	/*
+	 * Cache destination APIC IDs into cfg->dest_apicid. This cannot fail
+	 * as we already established, that mask & d->domain & cpu_online_mask
+	 * is not empty.
+	 */
+	BUG_ON(apic->cpu_mask_to_apicid_and(mask, d->domain,
+					    &d->cfg.dest_apicid));
+	return 0;
 }
 
 static int assign_irq_vector(int irq, struct apic_chip_data *data,
@@ -409,6 +427,7 @@ int __init arch_early_irq_init(void)
 	arch_init_htirq_domain(x86_vector_domain);
 
 	BUG_ON(!alloc_cpumask_var(&vector_cpumask, GFP_KERNEL));
+	BUG_ON(!alloc_cpumask_var(&vector_searchmask, GFP_KERNEL));
 	BUG_ON(!alloc_cpumask_var(&searched_cpumask, GFP_KERNEL));
 
 	return arch_early_ioapic_init();
@@ -498,14 +517,7 @@ static int apic_set_affinity(struct irq_data *irq_data,
 		return -EINVAL;
 
 	err = assign_irq_vector(irq, data, dest);
-	if (err) {
-		if (assign_irq_vector(irq, data,
-				      irq_data_get_affinity_mask(irq_data)))
-			pr_err("Failed to recover vector for irq %d\n", irq);
-		return err;
-	}
-
-	return IRQ_SET_MASK_OK;
+	return err ? err : IRQ_SET_MASK_OK;
 }
 
 static struct irq_chip lapic_controller = {

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [tip:x86/urgent] x86/irq: Copy vectormask instead of an AND operation
  2015-12-31 16:30 ` [patch 07/14] x86/irq: Copy vectormask instead of an AND operation Thomas Gleixner
@ 2016-01-16 21:18   ` tip-bot for Thomas Gleixner
  0 siblings, 0 replies; 38+ messages in thread
From: tip-bot for Thomas Gleixner @ 2016-01-16 21:18 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux, tglx, jiang.liu, linux-kernel, joe.lawrence, bp, mingo,
	jmmahler, hpa

Commit-ID:  9ac15b7a8af4cf3337a101498c0ed690d23ade75
Gitweb:     http://git.kernel.org/tip/9ac15b7a8af4cf3337a101498c0ed690d23ade75
Author:     Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Thu, 31 Dec 2015 16:30:49 +0000
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 15 Jan 2016 13:44:00 +0100

x86/irq: Copy vectormask instead of an AND operation

In the case that the new vector mask is a subset of the existing mask there is
no point to do a AND operation of currentmask & newmask. The result is
newmask. So we can simply copy the new mask to the current mask and be done
with it. Preparatory patch for further consolidation.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Borislav Petkov <bp@alien8.de>
Tested-by: Joe Lawrence <joe.lawrence@stratus.com>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Jeremiah Mahler <jmmahler@gmail.com>
Cc: andy.shevchenko@gmail.com
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: stable@vger.kernel.org #4.3+
Link: http://lkml.kernel.org/r/20151231160106.640253454@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/apic/vector.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 940e18d..1bd29c6 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -156,7 +156,7 @@ static int __assign_irq_vector(int irq, struct apic_chip_data *d,
 				       vector_cpumask);
 			d->move_in_progress =
 			   cpumask_intersects(d->old_domain, cpu_online_mask);
-			cpumask_and(d->domain, d->domain, vector_cpumask);
+			cpumask_copy(d->domain, vector_cpumask);
 			goto success;
 		}
 

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [tip:x86/urgent] x86/irq: Get rid of code duplication
  2015-12-31 16:30 ` [patch 08/14] x86/irq: Get rid of code duplication Thomas Gleixner
@ 2016-01-16 21:18   ` tip-bot for Thomas Gleixner
  0 siblings, 0 replies; 38+ messages in thread
From: tip-bot for Thomas Gleixner @ 2016-01-16 21:18 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, joe.lawrence, bp, hpa, linux, mingo, jiang.liu,
	tglx, jmmahler

Commit-ID:  ab25ac02148b600e645f77cfb8b8ea415ed75bb4
Gitweb:     http://git.kernel.org/tip/ab25ac02148b600e645f77cfb8b8ea415ed75bb4
Author:     Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Thu, 31 Dec 2015 16:30:49 +0000
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 15 Jan 2016 13:44:00 +0100

x86/irq: Get rid of code duplication

Reusing an existing vector and assigning a new vector has duplicated
code. Consolidate it.

This is also a preparatory patch for finally plugging the cleanup race.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Borislav Petkov <bp@alien8.de>
Tested-by: Joe Lawrence <joe.lawrence@stratus.com>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Jeremiah Mahler <jmmahler@gmail.com>
Cc: andy.shevchenko@gmail.com
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: stable@vger.kernel.org #4.3+
Link: http://lkml.kernel.org/r/20151231160106.721599216@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/apic/vector.c | 33 +++++++++++++++------------------
 1 file changed, 15 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 1bd29c6..fccfa3f 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -118,7 +118,7 @@ static int __assign_irq_vector(int irq, struct apic_chip_data *d,
 	 */
 	static int current_vector = FIRST_EXTERNAL_VECTOR + VECTOR_OFFSET_START;
 	static int current_offset = VECTOR_OFFSET_START % 16;
-	int cpu;
+	int cpu, vector;
 
 	if (d->move_in_progress)
 		return -EBUSY;
@@ -128,7 +128,7 @@ static int __assign_irq_vector(int irq, struct apic_chip_data *d,
 	cpumask_clear(searched_cpumask);
 	cpu = cpumask_first_and(mask, cpu_online_mask);
 	while (cpu < nr_cpu_ids) {
-		int new_cpu, vector, offset;
+		int new_cpu, offset;
 
 		/* Get the possible target cpus for @mask/@cpu from the apic */
 		apic->vector_allocation_domain(cpu, vector_cpumask, mask);
@@ -148,16 +148,12 @@ static int __assign_irq_vector(int irq, struct apic_chip_data *d,
 			if (cpumask_equal(vector_cpumask, d->domain))
 				goto success;
 			/*
-			 * New cpumask using the vector is a proper subset of
-			 * the current in use mask. So cleanup the vector
-			 * allocation for the members that are not used anymore.
+			 * Mark the cpus which are not longer in the mask for
+			 * cleanup.
 			 */
-			cpumask_andnot(d->old_domain, d->domain,
-				       vector_cpumask);
-			d->move_in_progress =
-			   cpumask_intersects(d->old_domain, cpu_online_mask);
-			cpumask_copy(d->domain, vector_cpumask);
-			goto success;
+			cpumask_andnot(d->old_domain, d->domain, vector_cpumask);
+			vector = d->cfg.vector;
+			goto update;
 		}
 
 		vector = current_vector;
@@ -183,16 +179,12 @@ next:
 		/* Found one! */
 		current_vector = vector;
 		current_offset = offset;
-		if (d->cfg.vector) {
+		/* Schedule the old vector for cleanup on all cpus */
+		if (d->cfg.vector)
 			cpumask_copy(d->old_domain, d->domain);
-			d->move_in_progress =
-			   cpumask_intersects(d->old_domain, cpu_online_mask);
-		}
 		for_each_cpu(new_cpu, vector_searchmask)
 			per_cpu(vector_irq, new_cpu)[vector] = irq_to_desc(irq);
-		d->cfg.vector = vector;
-		cpumask_copy(d->domain, vector_cpumask);
-		goto success;
+		goto update;
 
 next_cpu:
 		/*
@@ -209,6 +201,11 @@ next_cpu:
 	}
 	return -ENOSPC;
 
+update:
+	/* Cleanup required ? */
+	d->move_in_progress = cpumask_intersects(d->old_domain, cpu_online_mask);
+	d->cfg.vector = vector;
+	cpumask_copy(d->domain, vector_cpumask);
 success:
 	/*
 	 * Cache destination APIC IDs into cfg->dest_apicid. This cannot fail

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [tip:x86/urgent] x86/irq: Remove offline cpus from vector cleanup
  2015-12-31 16:30 ` [patch 09/14] x86/irq: Remove offline cpus from vector cleanup Thomas Gleixner
@ 2016-01-16 21:18   ` tip-bot for Thomas Gleixner
  0 siblings, 0 replies; 38+ messages in thread
From: tip-bot for Thomas Gleixner @ 2016-01-16 21:18 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, linux, bp, jiang.liu, mingo, joe.lawrence, hpa,
	tglx, jmmahler

Commit-ID:  847667ef10356b824a11c853fc8a8b1b437b6a8d
Gitweb:     http://git.kernel.org/tip/847667ef10356b824a11c853fc8a8b1b437b6a8d
Author:     Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Thu, 31 Dec 2015 16:30:50 +0000
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 15 Jan 2016 13:44:01 +0100

x86/irq: Remove offline cpus from vector cleanup

No point of keeping offline cpus in the cleanup mask.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Borislav Petkov <bp@alien8.de>
Tested-by: Joe Lawrence <joe.lawrence@stratus.com>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Jeremiah Mahler <jmmahler@gmail.com>
Cc: andy.shevchenko@gmail.com
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: stable@vger.kernel.org #4.3+
Link: http://lkml.kernel.org/r/20151231160106.808642683@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/apic/vector.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index fccfa3f..68d18b3 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -202,8 +202,12 @@ next_cpu:
 	return -ENOSPC;
 
 update:
-	/* Cleanup required ? */
-	d->move_in_progress = cpumask_intersects(d->old_domain, cpu_online_mask);
+	/*
+	 * Exclude offline cpus from the cleanup mask and set the
+	 * move_in_progress flag when the result is not empty.
+	 */
+	cpumask_and(d->old_domain, d->old_domain, cpu_online_mask);
+	d->move_in_progress = !cpumask_empty(d->old_domain);
 	d->cfg.vector = vector;
 	cpumask_copy(d->domain, vector_cpumask);
 success:

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [tip:x86/urgent] x86/irq: Clear move_in_progress before sending cleanup IPI
  2015-12-31 16:30 ` [patch 10/14] x86/irq: Clear move_in_progress before sending cleanup IPI Thomas Gleixner
@ 2016-01-16 21:19   ` tip-bot for Thomas Gleixner
  0 siblings, 0 replies; 38+ messages in thread
From: tip-bot for Thomas Gleixner @ 2016-01-16 21:19 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: bp, joe.lawrence, mingo, hpa, jmmahler, jiang.liu, tglx, linux,
	linux-kernel

Commit-ID:  c1684f5035b60e9f98566493e869496fb5de1d89
Gitweb:     http://git.kernel.org/tip/c1684f5035b60e9f98566493e869496fb5de1d89
Author:     Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Thu, 31 Dec 2015 16:30:51 +0000
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 15 Jan 2016 13:44:01 +0100

x86/irq: Clear move_in_progress before sending cleanup IPI

send_cleanup_vector() fiddles with the old_domain mask unprotected because it
relies on the protection by the move_in_progress flag. But this is fatal, as
the flag is reset after the IPI has been sent. So a cpu which receives the IPI
can still see the flag set and therefor ignores the cleanup request. If no
other cleanup request happens then the vector stays stale on that cpu and in
case of an irq removal the vector still persists. That can lead to use after
free when the next cleanup IPI happens.

Protect the code with vector_lock and clear move_in_progress before sending
the IPI.

This does not plug the race which Joe reported because:

CPU0                          CPU1                      CPU2
lock_vector()
data->move_in_progress=0
sendIPI()                       
unlock_vector()
                              set_affinity()
                              assign_irq_vector()
                              lock_vector()             handle_IPI
                              move_in_progress = 1      lock_vector()
                              unlock_vector()
                                                        move_in_progress == 1

The full fix comes with a later patch.

Reported-and-tested-by: Joe Lawrence <joe.lawrence@stratus.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Borislav Petkov <bp@alien8.de>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Jeremiah Mahler <jmmahler@gmail.com>
Cc: andy.shevchenko@gmail.com
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: stable@vger.kernel.org #4.3+
Link: http://lkml.kernel.org/r/20151231160106.892412198@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/apic/vector.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 68d18b3..ed62f9c 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -532,6 +532,8 @@ static void __send_cleanup_vector(struct apic_chip_data *data)
 {
 	cpumask_var_t cleanup_mask;
 
+	raw_spin_lock(&vector_lock);
+	data->move_in_progress = 0;
 	if (unlikely(!alloc_cpumask_var(&cleanup_mask, GFP_ATOMIC))) {
 		unsigned int i;
 
@@ -543,7 +545,7 @@ static void __send_cleanup_vector(struct apic_chip_data *data)
 		apic->send_IPI_mask(cleanup_mask, IRQ_MOVE_CLEANUP_VECTOR);
 		free_cpumask_var(cleanup_mask);
 	}
-	data->move_in_progress = 0;
+	raw_spin_unlock(&vector_lock);
 }
 
 void send_cleanup_vector(struct irq_cfg *cfg)

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [tip:x86/urgent] x86/irq: Remove the cpumask allocation from send_cleanup_vector()
  2015-12-31 16:30 ` [patch 11/14] x86/irq: Remove the cpumask allocation from send_cleanup_vector() Thomas Gleixner
@ 2016-01-16 21:19   ` tip-bot for Thomas Gleixner
  0 siblings, 0 replies; 38+ messages in thread
From: tip-bot for Thomas Gleixner @ 2016-01-16 21:19 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: jiang.liu, jmmahler, mingo, tglx, linux, hpa, linux-kernel,
	joe.lawrence, bp

Commit-ID:  5da0c1217f05d2ccc9a8ed6e6e5c23a8a1d24dd6
Gitweb:     http://git.kernel.org/tip/5da0c1217f05d2ccc9a8ed6e6e5c23a8a1d24dd6
Author:     Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Thu, 31 Dec 2015 16:30:52 +0000
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 15 Jan 2016 13:44:01 +0100

x86/irq: Remove the cpumask allocation from send_cleanup_vector()

There is no need to allocate a new cpumask for sending the cleanup vector. The
old_domain mask is now protected by the vector_lock, so we can safely remove
the offline cpus from it and send the IPI with the resulting mask.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Borislav Petkov <bp@alien8.de>
Tested-by: Joe Lawrence <joe.lawrence@stratus.com>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Jeremiah Mahler <jmmahler@gmail.com>
Cc: andy.shevchenko@gmail.com
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: stable@vger.kernel.org #4.3+
Link: http://lkml.kernel.org/r/20151231160106.967993932@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/apic/vector.c | 16 +++-------------
 1 file changed, 3 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index ed62f9c..91dc274 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -530,21 +530,11 @@ static struct irq_chip lapic_controller = {
 #ifdef CONFIG_SMP
 static void __send_cleanup_vector(struct apic_chip_data *data)
 {
-	cpumask_var_t cleanup_mask;
-
 	raw_spin_lock(&vector_lock);
+	cpumask_and(data->old_domain, data->old_domain, cpu_online_mask);
 	data->move_in_progress = 0;
-	if (unlikely(!alloc_cpumask_var(&cleanup_mask, GFP_ATOMIC))) {
-		unsigned int i;
-
-		for_each_cpu_and(i, data->old_domain, cpu_online_mask)
-			apic->send_IPI_mask(cpumask_of(i),
-					    IRQ_MOVE_CLEANUP_VECTOR);
-	} else {
-		cpumask_and(cleanup_mask, data->old_domain, cpu_online_mask);
-		apic->send_IPI_mask(cleanup_mask, IRQ_MOVE_CLEANUP_VECTOR);
-		free_cpumask_var(cleanup_mask);
-	}
+	if (!cpumask_empty(data->old_domain))
+		apic->send_IPI_mask(data->old_domain, IRQ_MOVE_CLEANUP_VECTOR);
 	raw_spin_unlock(&vector_lock);
 }
 

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [tip:x86/urgent] x86/irq: Remove outgoing CPU from vector cleanup mask
  2015-12-31 16:30 ` [patch 12/14] x86/irq: Remove outgoing CPU from vector cleanup mask Thomas Gleixner
@ 2016-01-16 21:19   ` tip-bot for Thomas Gleixner
  0 siblings, 0 replies; 38+ messages in thread
From: tip-bot for Thomas Gleixner @ 2016-01-16 21:19 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: jiang.liu, linux-kernel, hpa, tglx, linux, bp, jmmahler,
	joe.lawrence, mingo

Commit-ID:  56d7d2f4bbd00fb198b7907cb3ab657d06115a42
Gitweb:     http://git.kernel.org/tip/56d7d2f4bbd00fb198b7907cb3ab657d06115a42
Author:     Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Thu, 31 Dec 2015 16:30:52 +0000
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 15 Jan 2016 13:44:01 +0100

x86/irq: Remove outgoing CPU from vector cleanup mask

We want to synchronize new vector assignments with a pending cleanup. Remove a
dying cpu from a pending cleanup mask.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Borislav Petkov <bp@alien8.de>
Tested-by: Joe Lawrence <joe.lawrence@stratus.com>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Jeremiah Mahler <jmmahler@gmail.com>
Cc: andy.shevchenko@gmail.com
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: stable@vger.kernel.org #4.3+
Link: http://lkml.kernel.org/r/20151231160107.045961667@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/apic/vector.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 91dc274..a7fa11e 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -633,9 +633,23 @@ void irq_complete_move(struct irq_cfg *cfg)
 void irq_force_complete_move(int irq)
 {
 	struct irq_cfg *cfg = irq_cfg(irq);
+	struct apic_chip_data *data;
+
+	if (!cfg)
+		return;
 
-	if (cfg)
-		__irq_complete_move(cfg, cfg->vector);
+	__irq_complete_move(cfg, cfg->vector);
+
+	/*
+	 * Remove this cpu from the cleanup mask. The IPI might have been sent
+	 * just before the cpu was removed from the offline mask, but has not
+	 * been processed because the CPU has interrupts disabled and is on
+	 * the way out.
+	 */
+	raw_spin_lock(&vector_lock);
+	data = container_of(cfg, struct apic_chip_data, cfg);
+	cpumask_clear_cpu(smp_processor_id(), data->old_domain);
+	raw_spin_unlock(&vector_lock);
 }
 #endif
 

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [tip:x86/urgent] x86/irq: Call irq_force_move_complete with irq descriptor
  2015-12-31 16:30 ` [patch 13/14] x86/irq: Call irq_force_move_complete with irq descriptor Thomas Gleixner
@ 2016-01-16 21:20   ` tip-bot for Thomas Gleixner
  0 siblings, 0 replies; 38+ messages in thread
From: tip-bot for Thomas Gleixner @ 2016-01-16 21:20 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, mingo, jiang.liu, joe.lawrence, bp, linux-kernel, linux,
	jmmahler, hpa

Commit-ID:  90a2282e23f0522e4b3f797ad447c5e91bf7fe32
Gitweb:     http://git.kernel.org/tip/90a2282e23f0522e4b3f797ad447c5e91bf7fe32
Author:     Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Thu, 31 Dec 2015 16:30:53 +0000
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 15 Jan 2016 13:44:01 +0100

x86/irq: Call irq_force_move_complete with irq descriptor

First of all there is no point in looking up the irq descriptor again, but we
also need the descriptor for the final cleanup race fix in the next
patch. Make that change seperate. No functional difference.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Borislav Petkov <bp@alien8.de>
Tested-by: Joe Lawrence <joe.lawrence@stratus.com>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Jeremiah Mahler <jmmahler@gmail.com>
Cc: andy.shevchenko@gmail.com
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: stable@vger.kernel.org #4.3+
Link: http://lkml.kernel.org/r/20151231160107.125211743@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/irq.h    |  5 +++--
 arch/x86/kernel/apic/vector.c | 11 +++++++----
 arch/x86/kernel/irq.c         |  2 +-
 3 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/irq.h b/arch/x86/include/asm/irq.h
index 881b476..e7de5c9 100644
--- a/arch/x86/include/asm/irq.h
+++ b/arch/x86/include/asm/irq.h
@@ -23,11 +23,13 @@ extern void irq_ctx_init(int cpu);
 
 #define __ARCH_HAS_DO_SOFTIRQ
 
+struct irq_desc;
+
 #ifdef CONFIG_HOTPLUG_CPU
 #include <linux/cpumask.h>
 extern int check_irq_vectors_for_cpu_disable(void);
 extern void fixup_irqs(void);
-extern void irq_force_complete_move(int);
+extern void irq_force_complete_move(struct irq_desc *desc);
 #endif
 
 #ifdef CONFIG_HAVE_KVM
@@ -37,7 +39,6 @@ extern void kvm_set_posted_intr_wakeup_handler(void (*handler)(void));
 extern void (*x86_platform_ipi_callback)(void);
 extern void native_init_IRQ(void);
 
-struct irq_desc;
 extern bool handle_irq(struct irq_desc *desc, struct pt_regs *regs);
 
 extern __visible unsigned int do_IRQ(struct pt_regs *regs);
diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index a7fa11e..5f78835 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -630,10 +630,14 @@ void irq_complete_move(struct irq_cfg *cfg)
 	__irq_complete_move(cfg, ~get_irq_regs()->orig_ax);
 }
 
-void irq_force_complete_move(int irq)
+/*
+ * Called with @desc->lock held and interrupts disabled.
+ */
+void irq_force_complete_move(struct irq_desc *desc)
 {
-	struct irq_cfg *cfg = irq_cfg(irq);
-	struct apic_chip_data *data;
+	struct irq_data *irqdata = irq_desc_get_irq_data(desc);
+	struct apic_chip_data *data = apic_chip_data(irqdata);
+	struct irq_cfg *cfg = data ? &data->cfg : NULL;
 
 	if (!cfg)
 		return;
@@ -647,7 +651,6 @@ void irq_force_complete_move(int irq)
 	 * the way out.
 	 */
 	raw_spin_lock(&vector_lock);
-	data = container_of(cfg, struct apic_chip_data, cfg);
 	cpumask_clear_cpu(smp_processor_id(), data->old_domain);
 	raw_spin_unlock(&vector_lock);
 }
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index c0b58dd..61521dc 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -462,7 +462,7 @@ void fixup_irqs(void)
 		 * non intr-remapping case, we can't wait till this interrupt
 		 * arrives at this cpu before completing the irq move.
 		 */
-		irq_force_complete_move(irq);
+		irq_force_complete_move(desc);
 
 		if (cpumask_any_and(affinity, cpu_online_mask) >= nr_cpu_ids) {
 			break_affinity = 1;

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [tip:x86/urgent] x86/irq: Plug vector cleanup race
  2015-12-31 16:30 ` [patch 14/14] x86/irq: Plug vector cleanup race Thomas Gleixner
@ 2016-01-16 21:20   ` tip-bot for Thomas Gleixner
  0 siblings, 0 replies; 38+ messages in thread
From: tip-bot for Thomas Gleixner @ 2016-01-16 21:20 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux, mingo, jmmahler, joe.lawrence, bp, linux-kernel, hpa,
	jiang.liu, tglx

Commit-ID:  98229aa36caa9c769b13565523de9b813013c703
Gitweb:     http://git.kernel.org/tip/98229aa36caa9c769b13565523de9b813013c703
Author:     Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Thu, 31 Dec 2015 16:30:54 +0000
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 15 Jan 2016 13:44:02 +0100

x86/irq: Plug vector cleanup race

We still can end up with a stale vector due to the following:

CPU0                          CPU1                      CPU2
lock_vector()
data->move_in_progress=0
sendIPI()                       
unlock_vector()
                              set_affinity()
                              assign_irq_vector()
                              lock_vector()             handle_IPI
                              move_in_progress = 1      lock_vector()
                              unlock_vector()
                                                        move_in_progress == 1

So we need to serialize the vector assignment against a pending cleanup. The
solution is rather simple now. We not only check for the move_in_progress flag
in assign_irq_vector(), we also check whether there is still a cleanup pending
in the old_domain cpumask. If so, we return -EBUSY to the caller and let him
deal with it. Though we have to be careful in the cpu unplug case. If the
cleanout has not yet completed then the following setaffinity() call would
return -EBUSY. Add code which prevents this.

Full context is here: http://lkml.kernel.org/r/5653B688.4050809@stratus.com

Reported-and-tested-by: Joe Lawrence <joe.lawrence@stratus.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Borislav Petkov <bp@alien8.de>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Jeremiah Mahler <jmmahler@gmail.com>
Cc: andy.shevchenko@gmail.com
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: stable@vger.kernel.org #4.3+
Link: http://lkml.kernel.org/r/20151231160107.207265407@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/apic/vector.c | 63 ++++++++++++++++++++++++++++++++++++-------
 1 file changed, 53 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 5f78835..3b670df 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -120,7 +120,12 @@ static int __assign_irq_vector(int irq, struct apic_chip_data *d,
 	static int current_offset = VECTOR_OFFSET_START % 16;
 	int cpu, vector;
 
-	if (d->move_in_progress)
+	/*
+	 * If there is still a move in progress or the previous move has not
+	 * been cleaned up completely, tell the caller to come back later.
+	 */
+	if (d->move_in_progress ||
+	    cpumask_intersects(d->old_domain, cpu_online_mask))
 		return -EBUSY;
 
 	/* Only try and allocate irqs on cpus that are present */
@@ -259,7 +264,12 @@ static void clear_irq_vector(int irq, struct apic_chip_data *data)
 	data->cfg.vector = 0;
 	cpumask_clear(data->domain);
 
-	if (likely(!data->move_in_progress))
+	/*
+	 * If move is in progress or the old_domain mask is not empty,
+	 * i.e. the cleanup IPI has not been processed yet, we need to remove
+	 * the old references to desc from all cpus vector tables.
+	 */
+	if (!data->move_in_progress && cpumask_empty(data->old_domain))
 		return;
 
 	desc = irq_to_desc(irq);
@@ -579,12 +589,25 @@ asmlinkage __visible void smp_irq_move_cleanup_interrupt(void)
 			goto unlock;
 
 		/*
-		 * Check if the irq migration is in progress. If so, we
-		 * haven't received the cleanup request yet for this irq.
+		 * Nothing to cleanup if irq migration is in progress
+		 * or this cpu is not set in the cleanup mask.
 		 */
-		if (data->move_in_progress)
+		if (data->move_in_progress ||
+		    !cpumask_test_cpu(me, data->old_domain))
 			goto unlock;
 
+		/*
+		 * We have two cases to handle here:
+		 * 1) vector is unchanged but the target mask got reduced
+		 * 2) vector and the target mask has changed
+		 *
+		 * #1 is obvious, but in #2 we have two vectors with the same
+		 * irq descriptor: the old and the new vector. So we need to
+		 * make sure that we only cleanup the old vector. The new
+		 * vector has the current @vector number in the config and
+		 * this cpu is part of the target mask. We better leave that
+		 * one alone.
+		 */
 		if (vector == data->cfg.vector &&
 		    cpumask_test_cpu(me, data->domain))
 			goto unlock;
@@ -602,6 +625,7 @@ asmlinkage __visible void smp_irq_move_cleanup_interrupt(void)
 			goto unlock;
 		}
 		__this_cpu_write(vector_irq[vector], VECTOR_UNUSED);
+		cpumask_clear_cpu(me, data->old_domain);
 unlock:
 		raw_spin_unlock(&desc->lock);
 	}
@@ -645,13 +669,32 @@ void irq_force_complete_move(struct irq_desc *desc)
 	__irq_complete_move(cfg, cfg->vector);
 
 	/*
-	 * Remove this cpu from the cleanup mask. The IPI might have been sent
-	 * just before the cpu was removed from the offline mask, but has not
-	 * been processed because the CPU has interrupts disabled and is on
-	 * the way out.
+	 * This is tricky. If the cleanup of @data->old_domain has not been
+	 * done yet, then the following setaffinity call will fail with
+	 * -EBUSY. This can leave the interrupt in a stale state.
+	 *
+	 * The cleanup cannot make progress because we hold @desc->lock. So in
+	 * case @data->old_domain is not yet cleaned up, we need to drop the
+	 * lock and acquire it again. @desc cannot go away, because the
+	 * hotplug code holds the sparse irq lock.
 	 */
 	raw_spin_lock(&vector_lock);
-	cpumask_clear_cpu(smp_processor_id(), data->old_domain);
+	/* Clean out all offline cpus (including ourself) first. */
+	cpumask_and(data->old_domain, data->old_domain, cpu_online_mask);
+	while (!cpumask_empty(data->old_domain)) {
+		raw_spin_unlock(&vector_lock);
+		raw_spin_unlock(&desc->lock);
+		cpu_relax();
+		raw_spin_lock(&desc->lock);
+		/*
+		 * Reevaluate apic_chip_data. It might have been cleared after
+		 * we dropped @desc->lock.
+		 */
+		data = apic_chip_data(irqdata);
+		if (!data)
+			return;
+		raw_spin_lock(&vector_lock);
+	}
 	raw_spin_unlock(&vector_lock);
 }
 #endif

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [patch 00/14] x86/irq: Plug various vector cleanup races
  2016-01-14 10:33     ` Borislav Petkov
@ 2016-01-16 21:37       ` Joe Lawrence
  2016-01-18 15:00         ` Joe Lawrence
  0 siblings, 1 reply; 38+ messages in thread
From: Joe Lawrence @ 2016-01-16 21:37 UTC (permalink / raw)
  To: Borislav Petkov, Thomas Gleixner
  Cc: LKML, Ingo Molnar, Peter Anvin, Jiang Liu, Jeremiah Mahler,
	andy.shevchenko, Guenter Roeck

On 01/14/2016 05:33 AM, Borislav Petkov wrote:
> On Thu, Jan 14, 2016 at 09:24:35AM +0100, Thomas Gleixner wrote:
>> On Mon, 4 Jan 2016, Joe Lawrence wrote:
>>> No issues running the same PCI device removal and stress tests against
>>> the patchset.
>>
>> Thanks for testing!
>>
>> Though there is yet another long standing bug in that area. Fix below.
>>
>> Thanks,
>>
>> 	tglx
>>
>> 8<--------------------
>>
[ ... snip ... ]
>
> s/d//
>
> With those micro-changes:
>
> Tested-by: Borislav Petkov <bp@suse.de>
>
> :-)

Tests still running ok here (with same micro-change as Borislav).

-- Joe

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 00/14] x86/irq: Plug various vector cleanup races
  2016-01-16 21:37       ` Joe Lawrence
@ 2016-01-18 15:00         ` Joe Lawrence
  2016-01-18 15:43           ` Borislav Petkov
  2016-01-20  3:57           ` Joe Lawrence
  0 siblings, 2 replies; 38+ messages in thread
From: Joe Lawrence @ 2016-01-18 15:00 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Borislav Petkov, LKML, Ingo Molnar, Peter Anvin, Jiang Liu,
	Jeremiah Mahler, andy.shevchenko, Guenter Roeck

On 01/16/2016 04:37 PM, Joe Lawrence wrote:
> On 01/14/2016 05:33 AM, Borislav Petkov wrote:
>> On Thu, Jan 14, 2016 at 09:24:35AM +0100, Thomas Gleixner wrote:
>>> On Mon, 4 Jan 2016, Joe Lawrence wrote:
>>>> No issues running the same PCI device removal and stress tests against
>>>> the patchset.
>>>
>>> Thanks for testing!
>>>
>>> Though there is yet another long standing bug in that area. Fix below.
>>>
>>> Thanks,
>>>
>>>     tglx
>>>
>>> 8<--------------------
>>>
> [ ... snip ... ]
>>
>> s/d//
>>
>> With those micro-changes:
>>
>> Tested-by: Borislav Petkov <bp@suse.de>
>>
>> :-)
> 
> Tests still running ok here (with same micro-change as Borislav).

Hi Thomas,

When logging in this morning and looking at the box running the 14
patches + additional patch, I see it hit a hung task timeout in xhci USB
code about 39 hours in.  Stack trace below (looks to be waiting on a
completion that never comes).

I didn't see this when running only the *initial* 14 patches.  Of
course, before these irq cleanup fixes my tests never ran this long :)
So it may or may not be related to the patchset, I'm still poking around
the generated vmcore.  Let me know if there is anything you might be
interested in looking at from the wreckage.

-- Joe



INFO: task kworker/0:1:1506 blocked for more than 120 seconds.
      Tainted: P           OE   4.3.0sra12+ #50
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/0:1     D 0000000000000000     0  1506      2 0x00000080
Workqueue: usb_hub_wq hub_event
 ffff8801e46dba58 0000000000000046 ffff8810375dac00 ffff881038430000
 ffff8801e46dc000 ffff88025ac20440 ffff88025ac20438 ffff881038430000
 0000000000000000 ffff8801e46dba70 ffffffff81659893 7fffffffffffffff
Call Trace:
 [<ffffffff81659893>] schedule+0x33/0x80
 [<ffffffff8165c530>] schedule_timeout+0x200/0x2a0
 [<ffffffff810e2761>] ? internal_add_timer+0x71/0xb0
 [<ffffffff810e4994>] ? mod_timer+0x114/0x210
 [<ffffffff8165a371>] wait_for_completion+0xf1/0x130
 [<ffffffff810a70d0>] ? wake_up_q+0x70/0x70
 [<ffffffff814b14a1>] xhci_discover_or_reset_device+0x1e1/0x540
 [<ffffffff814723b8>] hub_port_reset+0x3c8/0x590
 [<ffffffff81472aa5>] hub_port_init+0x525/0xb00
 [<ffffffff81476068>] hub_port_connect+0x328/0x940
 [<ffffffff81476cbc>] hub_event+0x63c/0xb00
 [<ffffffff810947dc>] process_one_work+0x14c/0x3c0
 [<ffffffff81095044>] worker_thread+0x114/0x470
 [<ffffffff8165925f>] ? __schedule+0x2af/0x8b0
 [<ffffffff81094f30>] ? rescuer_thread+0x310/0x310
 [<ffffffff8109ab88>] kthread+0xd8/0xf0
 [<ffffffff8109aab0>] ? kthread_park+0x60/0x60
 [<ffffffff8165d75f>] ret_from_fork+0x3f/0x70
 [<ffffffff8109aab0>] ? kthread_park+0x60/0x60

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 00/14] x86/irq: Plug various vector cleanup races
  2016-01-18 15:00         ` Joe Lawrence
@ 2016-01-18 15:43           ` Borislav Petkov
  2016-01-18 16:38             ` Joe Lawrence
  2016-01-20  3:57           ` Joe Lawrence
  1 sibling, 1 reply; 38+ messages in thread
From: Borislav Petkov @ 2016-01-18 15:43 UTC (permalink / raw)
  To: Joe Lawrence
  Cc: Thomas Gleixner, LKML, Ingo Molnar, Peter Anvin, Jiang Liu,
	Jeremiah Mahler, andy.shevchenko, Guenter Roeck

On Mon, Jan 18, 2016 at 10:00:49AM -0500, Joe Lawrence wrote:
> INFO: task kworker/0:1:1506 blocked for more than 120 seconds.
>       Tainted: P           OE   4.3.0sra12+ #50

P as in Proprietary?

Can you trigger without that module?

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 00/14] x86/irq: Plug various vector cleanup races
  2016-01-18 15:43           ` Borislav Petkov
@ 2016-01-18 16:38             ` Joe Lawrence
  0 siblings, 0 replies; 38+ messages in thread
From: Joe Lawrence @ 2016-01-18 16:38 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Thomas Gleixner, LKML, Ingo Molnar, Peter Anvin, Jiang Liu,
	Jeremiah Mahler, andy.shevchenko, Guenter Roeck

On 01/18/2016 10:43 AM, Borislav Petkov wrote:
> On Mon, Jan 18, 2016 at 10:00:49AM -0500, Joe Lawrence wrote:
>> INFO: task kworker/0:1:1506 blocked for more than 120 seconds.
>>       Tainted: P           OE   4.3.0sra12+ #50
> 
> P as in Proprietary?
> 
> Can you trigger without that module?
> 

Hi Boris,

Unfortunately, yes to proprietary.

I had a Stratus proprietary driver in place that enables a test to
hotplug add/remove entire PCI subtrees at random points of
initialization.  That driver allows us to insert various hardware faults
to facilitate testing the Stratus ftServer platform and OS device
drivers against PCI master abort detection and cleanup.

I can try w/o the driver, though the test will be drastically different
-- polite sysfs PCI removal of a single device (in this case, the xHCI
host controller).  An interesting data point to gather nonetheless.

-- Joe

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 00/14] x86/irq: Plug various vector cleanup races
  2016-01-18 15:00         ` Joe Lawrence
  2016-01-18 15:43           ` Borislav Petkov
@ 2016-01-20  3:57           ` Joe Lawrence
  2016-01-20  8:26             ` Borislav Petkov
  1 sibling, 1 reply; 38+ messages in thread
From: Joe Lawrence @ 2016-01-20  3:57 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Borislav Petkov, LKML, Ingo Molnar, Peter Anvin, Jiang Liu,
	Jeremiah Mahler, andy.shevchenko, Guenter Roeck

On 01/18/2016 10:00 AM, Joe Lawrence wrote:
[... snip ... ] 
> Hi Thomas,
> 
> When logging in this morning and looking at the box running the 14
> patches + additional patch, I see it hit a hung task timeout in xhci USB
> code about 39 hours in.  Stack trace below (looks to be waiting on a
> completion that never comes).
> 
> I didn't see this when running only the *initial* 14 patches.  Of
> course, before these irq cleanup fixes my tests never ran this long :)
> So it may or may not be related to the patchset, I'm still poking around
> the generated vmcore.  Let me know if there is anything you might be
> interested in looking at from the wreckage.
> 
> -- Joe
> 
> 
> 
> INFO: task kworker/0:1:1506 blocked for more than 120 seconds.
>        Tainted: P           OE   4.3.0sra12+ #50
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> kworker/0:1     D 0000000000000000     0  1506      2 0x00000080
> Workqueue: usb_hub_wq hub_event
>   ffff8801e46dba58 0000000000000046 ffff8810375dac00 ffff881038430000
>   ffff8801e46dc000 ffff88025ac20440 ffff88025ac20438 ffff881038430000
>   0000000000000000 ffff8801e46dba70 ffffffff81659893 7fffffffffffffff
> Call Trace:
>   [<ffffffff81659893>] schedule+0x33/0x80
>   [<ffffffff8165c530>] schedule_timeout+0x200/0x2a0
>   [<ffffffff810e2761>] ? internal_add_timer+0x71/0xb0
>   [<ffffffff810e4994>] ? mod_timer+0x114/0x210
>   [<ffffffff8165a371>] wait_for_completion+0xf1/0x130
>   [<ffffffff810a70d0>] ? wake_up_q+0x70/0x70
>   [<ffffffff814b14a1>] xhci_discover_or_reset_device+0x1e1/0x540
>   [<ffffffff814723b8>] hub_port_reset+0x3c8/0x590
>   [<ffffffff81472aa5>] hub_port_init+0x525/0xb00
>   [<ffffffff81476068>] hub_port_connect+0x328/0x940
>   [<ffffffff81476cbc>] hub_event+0x63c/0xb00
>   [<ffffffff810947dc>] process_one_work+0x14c/0x3c0
>   [<ffffffff81095044>] worker_thread+0x114/0x470
>   [<ffffffff8165925f>] ? __schedule+0x2af/0x8b0
>   [<ffffffff81094f30>] ? rescuer_thread+0x310/0x310
>   [<ffffffff8109ab88>] kthread+0xd8/0xf0
>   [<ffffffff8109aab0>] ? kthread_park+0x60/0x60
>   [<ffffffff8165d75f>] ret_from_fork+0x3f/0x70
>   [<ffffffff8109aab0>] ? kthread_park+0x60/0x60

Hi Thomas / Boris,

In an effort to exonerate the patchset, I instrumented xHCI to monitor
complementary wait_for_completion / complete calls in that driver,
hoping that an early exit in its probe might be simply skipping the
complete call ... but I ended up collecting two new crashes in
get_next_timer_interrupt:

(Again with proprietary and out-of-tree drivers loaded.)

general protection fault: 0000 [#1] SMP
Modules linked in: xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun
matroxfb(OE) ccmod(POE) ftmod(OE) videosw(OE) [ ... snip ... ]
CPU: 10 PID: 0 Comm: swapper/10 Tainted: P           OE   4.3.0sra13+ #53
Hardware name: Stratus ftServer 6800/G7LYY, BIOS BIOS Version 8.1:61 09/10/2015
task: ffff881038e35800 ti: ffff881038e3c000 task.ti: ffff881038e3c000
RIP: 0010:[<ffffffff810e4c55>]  [<ffffffff810e4c55>] get_next_timer_interrupt+0x1a5/0x240
RSP: 0018:ffff881038e3fde0  EFLAGS: 00010002
RAX: ffff88103fa8e8b8 RBX: 000013629b0c5740 RCX: 000000014140a6d6
RDX: 6b6b6b6b6b6b6b6b RSI: 0000000000000001 RDI: 00000000010140a7
RBP: ffff881038e3fe30 R08: 6b6b6b6b6b6b6b6b R09: 0000000000000027
R10: 0000000000000027 R11: ffff881038e3fde8 R12: 000000010140a6d6
R13: ffff88103fa8e080 R14: ffff881038e3fe00 R15: 0000000000000040
FS:  0000000000000000(0000) GS:ffff88103fa80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fcbf695a000 CR3: 0000002030582000 CR4: 00000000001406e0
Stack:
 ffff88103fa8e8b8 ffff88103fa8eab8 ffff88103fa8ecb8 ffff88103fa8eeb8
 c5067d2c236293b0 ffff88103fa8f500 ffff88103fa8d600 000013629b1632f3
 000000010140a6d6 000013629b0c5740 ffff881038e3fe88 ffffffff810f421f
Call Trace:
 [<ffffffff810f421f>] tick_nohz_stop_sched_tick+0x1bf/0x2b0
 [<ffffffff810f43af>] __tick_nohz_idle_enter+0x9f/0x150
 [<ffffffff810f47dc>] tick_nohz_idle_enter+0x3c/0x70
 [<ffffffff810bf00c>] cpu_startup_entry+0x9c/0x330
 [<ffffffff8104e700>] start_secondary+0x160/0x1a0
Code: 95 38 0e 00 00 48 89 55 c8 48 8d 55 b0 4c 8d 5a 08 4c 8d 72 20 41 89 fa 41 83 e2 3f 45 89 d1 49 63 d1 48 8b 14 d0 48 85 d2 74 1e <f6> 42 2a 10 75 10 4c 8b 42 10 be 01 00 00 00 49 39 c8 49 0f 48
RIP  [<ffffffff810e4c55>] get_next_timer_interrupt+0x1a5/0x240
 RSP <ffff881038e3fde0>

crash> dis -l get_next_timer_interrupt+0x1a5
/root/linus/kernel/time/timer.c: 1289
  <get_next_timer_interrupt+0x1a5>:    testb  $0x10,0x2a(%rdx)

RDX: 6b6b6b6b6b6b6b6b

1246 static unsigned long __next_timer_interrupt(struct tvec_base *base)
...
1251         struct timer_list *nte;
...
1277         /* Check tv2-tv5. */
1278         varray[0] = &base->tv2;
1279         varray[1] = &base->tv3;
1280         varray[2] = &base->tv4;
1281         varray[3] = &base->tv5;
...
1283         for (array = 0; array < 4; array++) {
1284                 struct tvec *varp = varray[array];
1285
1286                 index = slot = timer_jiffies & TVN_MASK;
1287                 do {
1288                         hlist_for_each_entry(nte, varp->vec + slot, entry) {
1289                                 if (nte->flags & TIMER_DEFERRABLE)

So the nte pointer contains the slub_debug poison pattern.

>From the disassembly of __next_timer_interrupt, it looks like r13 is
used to store "base".

R13: ffff88103fa8e080

crash> struct tvec_base ffff88103fa8e080
struct tvec_base {
  lock = {
    {
      rlock = {
        raw_lock = {
          val = {
            counter = 0x1
          }
        }
      }
    }
  },
  running_timer = 0x0,
  timer_jiffies = 0x10140a6d7,
  next_timer = 0x10140a6d6,
  active_timers = 0x7,
  all_timers = 0x8,
  cpu = 0xa,
  migration_enabled = 0x1,
  nohz_active = 0x1,
  ...
  tv1 = {
    vec = {{
      ...
        first = 0xffff88203800dff8
      ...
  tv2 = {
    vec = {{
      ...
        first = 0xffff88100aa9a550     << index 39 points to slub poison
      ...
        first = 0xffff88103fa90ea0
      ...
  tv3 = {
    vec = {{
      ...
        first = 0xffff8807452b3928
      ...
        first = 0xffff88103fa8d380
      ...
  tv4 = {
    vec = {{
      ...
  tv5 = {
    vec = {{
      ...
        first = 0xffff88100c8955e8

crash> struct hlist_node 0xffff88203800dff8
struct hlist_node {
  next = 0x0,
  pprev = 0xffff88103fa8e790
}
crash> struct hlist_node 0xffff88100aa9a550
struct hlist_node {
  next = 0x6b6b6b6b6b6b6b6b,           << uhoh!
  pprev = 0x6b6b6b6b6b6b6b6b           <<
}
crash> struct hlist_node 0xffff88103fa90ea0
struct hlist_node {
  next = 0x0,
  pprev = 0xffff88103fa8e9f8
}
crash> struct hlist_node 0xffff8807452b3928
struct hlist_node {
  next = 0xffff88100aa9da68,
  pprev = 0xffff88103fa8eae0
}
crash> struct hlist_node 0xffff88103fa8d380
struct hlist_node {
  next = 0x0,
  pprev = 0xffff88103fa8eb58
}
crash> struct hlist_node 0xffff88100c8955e8
struct hlist_node {
  next = 0xffff881021e47598,
  pprev = 0xffff88103fa8eec0
}

crash utility confirms it in its "timer" display:

crash> timer
TVEC_BASES[9]: ffff88103fa4e080
  JIFFIES
4315982678
  EXPIRES      TIMER_LIST         FUNCTION
4315982681  ffff88203800ddb0  ffffffff8150e7a0 <intel_pstate_timer_func>
4315982973  ffff88103fa50ea0  ffffffff81092d90 <delayed_work_timer_fn>
4316267973  ffff88103fa4d380  ffffffff81041930 <mce_timer_fn>

timer: invalid list entry: 6b6b6b6b6b6b6b6b
timer: ignoring faulty timer list at index 39 of timer array

timer: invalid list entry: 6b6b6b6b6b6b6b6b
timer: ignoring faulty timer list at index 39 of timer array
TVEC_BASES[10]: ffff88103fa8e080
      JIFFIES
         4315982678
      EXPIRES           TIMER_LIST         FUNCTION
         4315981531  ffff88203800dff8  ffffffff8150e7a0 <intel_pstate_timer_func>
         4316034039  ffff88100aa9da68  ffffffff81092d90 <delayed_work_timer_fn>
         4316034111  ffff8807452b3928  ffffffff81092d90 <delayed_work_timer_fn>
         4316267970  ffff88103fa8d380  ffffffff81041930 <mce_timer_fn>
         4401397760  ffff88100c8955e8  ffffffff8160bbe0 <ipv6_regen_rndid>
         4401397760  ffff881021e47598  ffffffff8160bbe0 <ipv6_regen_rndid>
7740398493674204011  ffff88100aa9a550  6b6b6b6b6b6b6b6b

crash> struct timer_list ffff88100aa9a550
struct timer_list {
  entry = {
    next = 0x6b6b6b6b6b6b6b6b,
    pprev = 0x6b6b6b6b6b6b6b6b
  },
  expires = 0x6b6b6b6b6b6b6b6b,
  function = 0x6b6b6b6b6b6b6b6b,
  data = 0x6b6b6b6b6b6b6b6b,
  flags = 0x6b6b6b6b,
  slack = 0x6b6b6b6b,
  start_pid = 0x6b6b6b6b,
  start_site = 0x6b6b6b6b6b6b6b6b,
  start_comm = "kkkkkkkkkkkkkkkk"
}

A second crash of the same signature occurred a few hours later.


Unfortunately I only have a single box to run these tests in what
amounts to an after-hours effort.  I started testing back in 4.3 but
avoided moving forward to avoid the 4.4 development cycle (and
incidental issues that it might have muddled the waters).  That said,
what would be the best way to proceed?

  Change device removal tests to avoid proprietary drivers.

  What about the other out-of-tree device drivers (mpt3sas, ixgbe,
  etc.)?  These are open source, but contain much Stratus
  device removal paranoia that upstream hasn't adopted.

  Rebase evil(TM) proprietary/out-of-tree drivers against 4.4 or
  4.5rcX, apply this patchset and any other required device removal
  fixups.

If proprietary/out-of-tree drivers are a debugging deal breaker, I
understand.  The platform offers a unique hotplug testbed, so I try to
contribute testing and bug reports where I feel they apply equally
to untainted upstream.

Regards,

-- Joe

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 00/14] x86/irq: Plug various vector cleanup races
  2016-01-20  3:57           ` Joe Lawrence
@ 2016-01-20  8:26             ` Borislav Petkov
  2016-01-22 15:28               ` Joe Lawrence
  0 siblings, 1 reply; 38+ messages in thread
From: Borislav Petkov @ 2016-01-20  8:26 UTC (permalink / raw)
  To: Joe Lawrence
  Cc: Thomas Gleixner, LKML, Ingo Molnar, Peter Anvin, Jiang Liu,
	Jeremiah Mahler, andy.shevchenko, Guenter Roeck

On Tue, Jan 19, 2016 at 10:57:35PM -0500, Joe Lawrence wrote:
> Unfortunately I only have a single box to run these tests in what
> amounts to an after-hours effort.  I started testing back in 4.3 but
> avoided moving forward to avoid the 4.4 development cycle (and
> incidental issues that it might have muddled the waters).  That said,
> what would be the best way to proceed?
> 
>   Change device removal tests to avoid proprietary drivers.

Right, I'd tend to suggest that. tglx says you could also try enabling
CONFIG_DEBUG_OBJECTS and CONFIG_DEBUG_OBJECTS_TIMERS and collect full
dmesg from when the corruption happens. That might give us some more
insights as to what happens.

Thanks.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 00/14] x86/irq: Plug various vector cleanup races
  2016-01-20  8:26             ` Borislav Petkov
@ 2016-01-22 15:28               ` Joe Lawrence
  0 siblings, 0 replies; 38+ messages in thread
From: Joe Lawrence @ 2016-01-22 15:28 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Thomas Gleixner, LKML, Ingo Molnar, Peter Anvin, Jiang Liu,
	Jeremiah Mahler, andy.shevchenko, Guenter Roeck

On 01/20/2016 03:26 AM, Borislav Petkov wrote:
> ... tglx says you could also try enabling
> CONFIG_DEBUG_OBJECTS and CONFIG_DEBUG_OBJECTS_TIMERS and collect full
> dmesg from when the corruption happens. That might give us some more
> insights as to what happens.

Hi Boris,

Excellent idea.  With CONFIG_DEBUG_OBJECTS_FREE=y  I've hit this twice:

ODEBUG: free active (active state 0) object type: timer_list hint:
xhci_stop_endpoint_command_watchdog+0x0/0x2b0

So I'll try to hunt that down on the xHCI side and report over to
linux-usb.  Sorry for the unrelated noise on this patch -- my guess is
that timing changed just enough for the xHCI driver to hit this. :)

-- Joe

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2016-01-22 15:28 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-31 16:30 [patch 00/14] x86/irq: Plug various vector cleanup races Thomas Gleixner
2015-12-31 16:30 ` [patch 01/14] x86/irq: Fix a race in x86_vector_free_irqs() Thomas Gleixner
2015-12-31 16:30 ` [patch 02/14] x86/irq: Validate that irq descriptor is still active Thomas Gleixner
2016-01-16 21:16   ` [tip:x86/urgent] " tip-bot for Thomas Gleixner
2015-12-31 16:30 ` [patch 04/14] x86/irq: Reorganize the return path in assign_irq_vector Thomas Gleixner
2016-01-16 21:17   ` [tip:x86/urgent] " tip-bot for Thomas Gleixner
2015-12-31 16:30 ` [patch 03/14] x86/irq: Do not use apic_chip_data.old_domain as temporary buffer Thomas Gleixner
2015-12-31 16:30 ` [patch 05/14] x86/irq: Reorganize the search in assign_irq_vector Thomas Gleixner
2016-01-16 21:17   ` [tip:x86/urgent] " tip-bot for Thomas Gleixner
2015-12-31 16:30 ` [patch 06/14] x86/irq: Check vector allocation early Thomas Gleixner
2016-01-16 21:17   ` [tip:x86/urgent] " tip-bot for Thomas Gleixner
2015-12-31 16:30 ` [patch 08/14] x86/irq: Get rid of code duplication Thomas Gleixner
2016-01-16 21:18   ` [tip:x86/urgent] " tip-bot for Thomas Gleixner
2015-12-31 16:30 ` [patch 07/14] x86/irq: Copy vectormask instead of an AND operation Thomas Gleixner
2016-01-16 21:18   ` [tip:x86/urgent] " tip-bot for Thomas Gleixner
2015-12-31 16:30 ` [patch 09/14] x86/irq: Remove offline cpus from vector cleanup Thomas Gleixner
2016-01-16 21:18   ` [tip:x86/urgent] " tip-bot for Thomas Gleixner
2015-12-31 16:30 ` [patch 10/14] x86/irq: Clear move_in_progress before sending cleanup IPI Thomas Gleixner
2016-01-16 21:19   ` [tip:x86/urgent] " tip-bot for Thomas Gleixner
2015-12-31 16:30 ` [patch 12/14] x86/irq: Remove outgoing CPU from vector cleanup mask Thomas Gleixner
2016-01-16 21:19   ` [tip:x86/urgent] " tip-bot for Thomas Gleixner
2015-12-31 16:30 ` [patch 11/14] x86/irq: Remove the cpumask allocation from send_cleanup_vector() Thomas Gleixner
2016-01-16 21:19   ` [tip:x86/urgent] " tip-bot for Thomas Gleixner
2015-12-31 16:30 ` [patch 13/14] x86/irq: Call irq_force_move_complete with irq descriptor Thomas Gleixner
2016-01-16 21:20   ` [tip:x86/urgent] " tip-bot for Thomas Gleixner
2015-12-31 16:30 ` [patch 14/14] x86/irq: Plug vector cleanup race Thomas Gleixner
2016-01-16 21:20   ` [tip:x86/urgent] " tip-bot for Thomas Gleixner
2016-01-04 15:35 ` [patch 00/14] x86/irq: Plug various vector cleanup races Joe Lawrence
2016-01-14  8:24   ` Thomas Gleixner
2016-01-14 10:33     ` Borislav Petkov
2016-01-16 21:37       ` Joe Lawrence
2016-01-18 15:00         ` Joe Lawrence
2016-01-18 15:43           ` Borislav Petkov
2016-01-18 16:38             ` Joe Lawrence
2016-01-20  3:57           ` Joe Lawrence
2016-01-20  8:26             ` Borislav Petkov
2016-01-22 15:28               ` Joe Lawrence
2016-01-16 21:15     ` [tip:x86/urgent] x86/irq: Call chip-> irq_set_affinity in proper context tip-bot for Thomas Gleixner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.