All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] smp_call_function_many SMP race
@ 2010-03-23 11:15 Anton Blanchard
  2010-03-23 12:26 ` Peter Zijlstra
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Anton Blanchard @ 2010-03-23 11:15 UTC (permalink / raw)
  To: Xiao Guangrong, Ingo Molnar, Jens Axboe, Nick Piggin,
	Peter Zijlstra, Rusty Russell, Andrew Morton, Linus Torvalds,
	paulmck, Milton Miller, Nick Piggin
  Cc: linux-kernel


I noticed a failure where we hit the following WARN_ON in
generic_smp_call_function_interrupt:

                if (!cpumask_test_and_clear_cpu(cpu, data->cpumask))
                        continue;

                data->csd.func(data->csd.info);

                refs = atomic_dec_return(&data->refs);
                WARN_ON(refs < 0);      <-------------------------

We atomically tested and cleared our bit in the cpumask, and yet the number
of cpus left (ie refs) was 0. How can this be?

It turns out commit c0f68c2fab4898bcc4671a8fb941f428856b4ad5 (generic-ipi:
cleanup for generic_smp_call_function_interrupt()) is at fault. It removes
locking from smp_call_function_many and in doing so creates a rather
complicated race.

The problem comes about because:

- The smp_call_function_many interrupt handler walks call_function.queue
  without any locking.
- We reuse a percpu data structure in smp_call_function_many.
- We do not wait for any RCU grace period before starting the next
  smp_call_function_many.

Imagine a scenario where CPU A does two smp_call_functions back to back, and
CPU B does an smp_call_function in between. We concentrate on how CPU C handles
the calls:


CPU A                  CPU B                  CPU C

smp_call_function
                                              smp_call_function_interrupt
                                                walks call_function.queue
                                                sees CPU A on list

                         smp_call_function

                                              smp_call_function_interrupt
                                                walks call_function.queue
                                                sees (stale) CPU A on list
smp_call_function
  reuses percpu *data
  set data->cpumask
                                                sees and clears bit in cpumask!
                                                sees data->refs is 0!

  set data->refs (too late!)


The important thing to note is since the interrupt handler walks a potentially
stale call_function.queue without any locking, then another cpu can view the
percpu *data structure at any time, even when the owner is in the process
of initialising it.

The following test case hits the WARN_ON 100% of the time on my PowerPC box
(having 128 threads does help :)


#include <linux/module.h>
#include <linux/init.h>

#define ITERATIONS 100

static void do_nothing_ipi(void *dummy)
{
}

static void do_ipis(struct work_struct *dummy)
{
	int i;

	for (i = 0; i < ITERATIONS; i++)
		smp_call_function(do_nothing_ipi, NULL, 1);

	printk(KERN_DEBUG "cpu %d finished\n", smp_processor_id());
}

static struct work_struct work[NR_CPUS];

static int __init testcase_init(void)
{
	int cpu;

	for_each_online_cpu(cpu) {
		INIT_WORK(&work[cpu], do_ipis);
		schedule_work_on(cpu, &work[cpu]);
	}

	return 0;
}

static void __exit testcase_exit(void)
{
}

module_init(testcase_init)
module_exit(testcase_exit)
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Anton Blanchard");


I tried to fix it by ordering the read and the write of ->cpumask and ->refs.
In doing so I missed a critical case but Paul McKenney was able to spot
my bug thankfully :) To ensure we arent viewing previous iterations the
interrupt handler needs to read ->refs then ->cpumask then ->refs _again_.

Thanks to Milton Miller and Paul McKenney for helping to debug this issue.

---

My head hurts. This needs some serious analysis before we can be sure it
fixes all the races. With all these memory barriers, maybe the previous
spinlocks weren't so bad after all :)


Index: linux-2.6/kernel/smp.c
===================================================================
--- linux-2.6.orig/kernel/smp.c	2010-03-23 05:09:08.000000000 -0500
+++ linux-2.6/kernel/smp.c	2010-03-23 06:12:40.000000000 -0500
@@ -193,6 +193,31 @@ void generic_smp_call_function_interrupt
 	list_for_each_entry_rcu(data, &call_function.queue, csd.list) {
 		int refs;
 
+		/*
+		 * Since we walk the list without any locks, we might
+		 * see an entry that was completed, removed from the
+		 * list and is in the process of being reused.
+		 *
+		 * Just checking data->refs then data->cpumask is not good
+		 * enough because we could see a non zero data->refs from a
+		 * previous iteration. We need to check data->refs, then
+		 * data->cpumask then data->refs again. Talk about
+		 * complicated!
+		 */
+
+		if (atomic_read(&data->refs) == 0)
+			continue;
+
+		smp_rmb();
+
+		if (!cpumask_test_cpu(cpu, data->cpumask))
+			continue;
+
+		smp_rmb();
+
+		if (atomic_read(&data->refs) == 0)
+			continue;
+
 		if (!cpumask_test_and_clear_cpu(cpu, data->cpumask))
 			continue;
 
@@ -446,6 +471,14 @@ void smp_call_function_many(const struct
 	data->csd.info = info;
 	cpumask_and(data->cpumask, mask, cpu_online_mask);
 	cpumask_clear_cpu(this_cpu, data->cpumask);
+
+	/*
+	 * To ensure the interrupt handler gets an up to date view
+	 * we order the cpumask and refs writes and order the
+	 * read of them in the interrupt handler.
+	 */
+	smp_wmb();
+
 	atomic_set(&data->refs, cpumask_weight(data->cpumask));
 
 	raw_spin_lock_irqsave(&call_function.lock, flags);

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] smp_call_function_many SMP race
  2010-03-23 11:15 [PATCH] smp_call_function_many SMP race Anton Blanchard
@ 2010-03-23 12:26 ` Peter Zijlstra
  2010-03-23 15:33   ` Paul E. McKenney
  2010-03-23 21:31   ` Anton Blanchard
  2010-03-23 16:41 ` Paul E. McKenney
  2010-05-03 14:24 ` Peter Zijlstra
  2 siblings, 2 replies; 10+ messages in thread
From: Peter Zijlstra @ 2010-03-23 12:26 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Xiao Guangrong, Ingo Molnar, Jens Axboe, Nick Piggin,
	Rusty Russell, Andrew Morton, Linus Torvalds, paulmck,
	Milton Miller, Nick Piggin, linux-kernel

On Tue, 2010-03-23 at 22:15 +1100, Anton Blanchard wrote:
> 
> It turns out commit c0f68c2fab4898bcc4671a8fb941f428856b4ad5 (generic-ipi:
> cleanup for generic_smp_call_function_interrupt()) is at fault. It removes
> locking from smp_call_function_many and in doing so creates a rather
> complicated race. 

A rather simple question since my brain isn't quite ready processing the
content here..

Isn't reverting that one patch a simpler solution than adding all that
extra logic? If not, then the above statement seems false and we had a
bug even with that preempt_enable/disable() pair.

Just wondering.. :-)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] smp_call_function_many SMP race
  2010-03-23 12:26 ` Peter Zijlstra
@ 2010-03-23 15:33   ` Paul E. McKenney
  2010-03-23 15:49     ` Peter Zijlstra
  2010-03-23 21:31   ` Anton Blanchard
  1 sibling, 1 reply; 10+ messages in thread
From: Paul E. McKenney @ 2010-03-23 15:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Anton Blanchard, Xiao Guangrong, Ingo Molnar, Jens Axboe,
	Nick Piggin, Rusty Russell, Andrew Morton, Linus Torvalds,
	Milton Miller, Nick Piggin, linux-kernel

On Tue, Mar 23, 2010 at 01:26:43PM +0100, Peter Zijlstra wrote:
> On Tue, 2010-03-23 at 22:15 +1100, Anton Blanchard wrote:
> > 
> > It turns out commit c0f68c2fab4898bcc4671a8fb941f428856b4ad5 (generic-ipi:
> > cleanup for generic_smp_call_function_interrupt()) is at fault. It removes
> > locking from smp_call_function_many and in doing so creates a rather
> > complicated race. 
> 
> A rather simple question since my brain isn't quite ready processing the
> content here..
> 
> Isn't reverting that one patch a simpler solution than adding all that
> extra logic? If not, then the above statement seems false and we had a
> bug even with that preempt_enable/disable() pair.
> 
> Just wondering.. :-)

If I understand correctly, if you want to fix it by reverting patches,
you have to revert back to simple locking (up to and including
54fdade1c3332391948ec43530c02c4794a38172).  And I believe that the poor
performance of simple locking was whole reason for the series of patches.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] smp_call_function_many SMP race
  2010-03-23 15:33   ` Paul E. McKenney
@ 2010-03-23 15:49     ` Peter Zijlstra
  0 siblings, 0 replies; 10+ messages in thread
From: Peter Zijlstra @ 2010-03-23 15:49 UTC (permalink / raw)
  To: paulmck
  Cc: Anton Blanchard, Xiao Guangrong, Ingo Molnar, Jens Axboe,
	Nick Piggin, Rusty Russell, Andrew Morton, Linus Torvalds,
	Milton Miller, Nick Piggin, linux-kernel

On Tue, 2010-03-23 at 08:33 -0700, Paul E. McKenney wrote:
> On Tue, Mar 23, 2010 at 01:26:43PM +0100, Peter Zijlstra wrote:
> > On Tue, 2010-03-23 at 22:15 +1100, Anton Blanchard wrote:
> > > 
> > > It turns out commit c0f68c2fab4898bcc4671a8fb941f428856b4ad5 (generic-ipi:
> > > cleanup for generic_smp_call_function_interrupt()) is at fault. It removes
> > > locking from smp_call_function_many and in doing so creates a rather
> > > complicated race. 
> > 
> > A rather simple question since my brain isn't quite ready processing the
> > content here..
> > 
> > Isn't reverting that one patch a simpler solution than adding all that
> > extra logic? If not, then the above statement seems false and we had a
> > bug even with that preempt_enable/disable() pair.
> > 
> > Just wondering.. :-)
> 
> If I understand correctly, if you want to fix it by reverting patches,
> you have to revert back to simple locking (up to and including
> 54fdade1c3332391948ec43530c02c4794a38172).  And I believe that the poor
> performance of simple locking was whole reason for the series of patches.

Right, then c0f68c2 did not in fact cause this bug..

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] smp_call_function_many SMP race
  2010-03-23 11:15 [PATCH] smp_call_function_many SMP race Anton Blanchard
  2010-03-23 12:26 ` Peter Zijlstra
@ 2010-03-23 16:41 ` Paul E. McKenney
  2010-05-03 14:24 ` Peter Zijlstra
  2 siblings, 0 replies; 10+ messages in thread
From: Paul E. McKenney @ 2010-03-23 16:41 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Xiao Guangrong, Ingo Molnar, Jens Axboe, Nick Piggin,
	Peter Zijlstra, Rusty Russell, Andrew Morton, Linus Torvalds,
	Milton Miller, Nick Piggin, linux-kernel

On Tue, Mar 23, 2010 at 10:15:56PM +1100, Anton Blanchard wrote:
> 
> I noticed a failure where we hit the following WARN_ON in
> generic_smp_call_function_interrupt:
> 
>                 if (!cpumask_test_and_clear_cpu(cpu, data->cpumask))
>                         continue;
> 
>                 data->csd.func(data->csd.info);
> 
>                 refs = atomic_dec_return(&data->refs);
>                 WARN_ON(refs < 0);      <-------------------------
> 
> We atomically tested and cleared our bit in the cpumask, and yet the number
> of cpus left (ie refs) was 0. How can this be?
> 
> It turns out commit c0f68c2fab4898bcc4671a8fb941f428856b4ad5 (generic-ipi:
> cleanup for generic_smp_call_function_interrupt()) is at fault. It removes
> locking from smp_call_function_many and in doing so creates a rather
> complicated race.
> 
> The problem comes about because:
> 
> - The smp_call_function_many interrupt handler walks call_function.queue
>   without any locking.
> - We reuse a percpu data structure in smp_call_function_many.
> - We do not wait for any RCU grace period before starting the next
>   smp_call_function_many.
> 
> Imagine a scenario where CPU A does two smp_call_functions back to back, and
> CPU B does an smp_call_function in between. We concentrate on how CPU C handles
> the calls:
> 
> 
> CPU A                  CPU B                  CPU C
> 
> smp_call_function
>                                               smp_call_function_interrupt
>                                                 walks call_function.queue
>                                                 sees CPU A on list
> 
>                          smp_call_function
> 
>                                               smp_call_function_interrupt
>                                                 walks call_function.queue
>                                                 sees (stale) CPU A on list
> smp_call_function
>   reuses percpu *data
>   set data->cpumask
>                                                 sees and clears bit in cpumask!
>                                                 sees data->refs is 0!
> 
>   set data->refs (too late!)
> 
> 
> The important thing to note is since the interrupt handler walks a potentially
> stale call_function.queue without any locking, then another cpu can view the
> percpu *data structure at any time, even when the owner is in the process
> of initialising it.
> 
> The following test case hits the WARN_ON 100% of the time on my PowerPC box
> (having 128 threads does help :)
> 
> 
> #include <linux/module.h>
> #include <linux/init.h>
> 
> #define ITERATIONS 100
> 
> static void do_nothing_ipi(void *dummy)
> {
> }
> 
> static void do_ipis(struct work_struct *dummy)
> {
> 	int i;
> 
> 	for (i = 0; i < ITERATIONS; i++)
> 		smp_call_function(do_nothing_ipi, NULL, 1);
> 
> 	printk(KERN_DEBUG "cpu %d finished\n", smp_processor_id());
> }
> 
> static struct work_struct work[NR_CPUS];
> 
> static int __init testcase_init(void)
> {
> 	int cpu;
> 
> 	for_each_online_cpu(cpu) {
> 		INIT_WORK(&work[cpu], do_ipis);
> 		schedule_work_on(cpu, &work[cpu]);
> 	}
> 
> 	return 0;
> }
> 
> static void __exit testcase_exit(void)
> {
> }
> 
> module_init(testcase_init)
> module_exit(testcase_exit)
> MODULE_LICENSE("GPL");
> MODULE_AUTHOR("Anton Blanchard");
> 
> 
> I tried to fix it by ordering the read and the write of ->cpumask and ->refs.
> In doing so I missed a critical case but Paul McKenney was able to spot
> my bug thankfully :) To ensure we arent viewing previous iterations the
> interrupt handler needs to read ->refs then ->cpumask then ->refs _again_.
> 
> Thanks to Milton Miller and Paul McKenney for helping to debug this issue.
> 
> ---
> 
> My head hurts. This needs some serious analysis before we can be sure it
> fixes all the races. With all these memory barriers, maybe the previous
> spinlocks weren't so bad after all :)

;-)

Does this patch appear to have fixed things, or do you still have a
failure rate?  In other words, should I be working on a proof of
(in)correctness, or should I be looking for further bugs?

							Thanx, Paul

> Index: linux-2.6/kernel/smp.c
> ===================================================================
> --- linux-2.6.orig/kernel/smp.c	2010-03-23 05:09:08.000000000 -0500
> +++ linux-2.6/kernel/smp.c	2010-03-23 06:12:40.000000000 -0500
> @@ -193,6 +193,31 @@ void generic_smp_call_function_interrupt
>  	list_for_each_entry_rcu(data, &call_function.queue, csd.list) {
>  		int refs;
> 
> +		/*
> +		 * Since we walk the list without any locks, we might
> +		 * see an entry that was completed, removed from the
> +		 * list and is in the process of being reused.
> +		 *
> +		 * Just checking data->refs then data->cpumask is not good
> +		 * enough because we could see a non zero data->refs from a
> +		 * previous iteration. We need to check data->refs, then
> +		 * data->cpumask then data->refs again. Talk about
> +		 * complicated!
> +		 */
> +
> +		if (atomic_read(&data->refs) == 0)
> +			continue;
> +
> +		smp_rmb();
> +
> +		if (!cpumask_test_cpu(cpu, data->cpumask))
> +			continue;
> +
> +		smp_rmb();
> +
> +		if (atomic_read(&data->refs) == 0)
> +			continue;
> +
>  		if (!cpumask_test_and_clear_cpu(cpu, data->cpumask))
>  			continue;
> 
> @@ -446,6 +471,14 @@ void smp_call_function_many(const struct
>  	data->csd.info = info;
>  	cpumask_and(data->cpumask, mask, cpu_online_mask);
>  	cpumask_clear_cpu(this_cpu, data->cpumask);
> +
> +	/*
> +	 * To ensure the interrupt handler gets an up to date view
> +	 * we order the cpumask and refs writes and order the
> +	 * read of them in the interrupt handler.
> +	 */
> +	smp_wmb();
> +
>  	atomic_set(&data->refs, cpumask_weight(data->cpumask));
> 
>  	raw_spin_lock_irqsave(&call_function.lock, flags);

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] smp_call_function_many SMP race
  2010-03-23 12:26 ` Peter Zijlstra
  2010-03-23 15:33   ` Paul E. McKenney
@ 2010-03-23 21:31   ` Anton Blanchard
  1 sibling, 0 replies; 10+ messages in thread
From: Anton Blanchard @ 2010-03-23 21:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Xiao Guangrong, Ingo Molnar, Jens Axboe, Nick Piggin,
	Rusty Russell, Andrew Morton, Linus Torvalds, paulmck,
	Milton Miller, Nick Piggin, linux-kernel

 
Hi Peter,

> A rather simple question since my brain isn't quite ready processing the
> content here..

After working my way through that bug my brain wasn't in good shape
either because:

> Isn't reverting that one patch a simpler solution than adding all that
> extra logic? If not, then the above statement seems false and we had a
> bug even with that preempt_enable/disable() pair.

I screwed up the explanation :( The commit message that causes it is
54fdade1c3332391948ec43530c02c4794a38172 (generic-ipi: make struct
call_function_data lockless), and backing it out fixes the issue.

Anton

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] smp_call_function_many SMP race
  2010-03-23 11:15 [PATCH] smp_call_function_many SMP race Anton Blanchard
  2010-03-23 12:26 ` Peter Zijlstra
  2010-03-23 16:41 ` Paul E. McKenney
@ 2010-05-03 14:24 ` Peter Zijlstra
  2 siblings, 0 replies; 10+ messages in thread
From: Peter Zijlstra @ 2010-05-03 14:24 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Xiao Guangrong, Ingo Molnar, Jens Axboe, Nick Piggin,
	Rusty Russell, Andrew Morton, Linus Torvalds, paulmck,
	Milton Miller, Nick Piggin, linux-kernel

On Tue, 2010-03-23 at 22:15 +1100, Anton Blanchard wrote:
> 
> My head hurts. This needs some serious analysis before we can be sure it
> fixes all the races. With all these memory barriers, maybe the previous
> spinlocks weren't so bad after all :)
> 
> 
> Index: linux-2.6/kernel/smp.c
> ===================================================================
> --- linux-2.6.orig/kernel/smp.c 2010-03-23 05:09:08.000000000 -0500
> +++ linux-2.6/kernel/smp.c      2010-03-23 06:12:40.000000000 -0500
> @@ -193,6 +193,31 @@ void generic_smp_call_function_interrupt
>         list_for_each_entry_rcu(data, &call_function.queue, csd.list) {
>                 int refs;
>  
> +               /*
> +                * Since we walk the list without any locks, we might
> +                * see an entry that was completed, removed from the
> +                * list and is in the process of being reused.
> +                *
> +                * Just checking data->refs then data->cpumask is not good
> +                * enough because we could see a non zero data->refs from a
> +                * previous iteration. We need to check data->refs, then
> +                * data->cpumask then data->refs again. Talk about
> +                * complicated!
> +                */

But the atomic_dec_return() implies a mb, which is before
list_del_rcu(), also, the next enqueue will have a wmb in
list_rcu_add(), so it seems to me that if we issue an rmb it would be
impossible to see a !zero ref of the previous enlisting.

> +               if (atomic_read(&data->refs) == 0)
> +                       continue;
> +
> +               smp_rmb();
> +
> +               if (!cpumask_test_cpu(cpu, data->cpumask))
> +                       continue;
> +
> +               smp_rmb();
> +
> +               if (atomic_read(&data->refs) == 0)
> +                       continue;
> +
>                 if (!cpumask_test_and_clear_cpu(cpu, data->cpumask))
>                         continue;
>  
> @@ -446,6 +471,14 @@ void smp_call_function_many(const struct
>         data->csd.info = info;
>         cpumask_and(data->cpumask, mask, cpu_online_mask);
>         cpumask_clear_cpu(this_cpu, data->cpumask);
> +
> +       /*
> +        * To ensure the interrupt handler gets an up to date view
> +        * we order the cpumask and refs writes and order the
> +        * read of them in the interrupt handler.
> +        */
> +       smp_wmb();
> +
>         atomic_set(&data->refs, cpumask_weight(data->cpumask));

We could make this an actual atomic instruction of course..

>         raw_spin_lock_irqsave(&call_function.lock, flags);
> 
> 



^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [PATCH] smp_call_function_many SMP race
  2011-01-17 18:17 ` Peter Zijlstra
@ 2011-01-18 21:05   ` Milton Miller
  0 siblings, 0 replies; 10+ messages in thread
From: Milton Miller @ 2011-01-18 21:05 UTC (permalink / raw)
  To: Peter Zijlstra, Anton Blanchard, Paul E. McKenney
  Cc: xiaoguangrong, mingo, jaxboe, npiggin, rusty, akpm, torvalds,
	miltonm, benh, linux-kernel

On Mon, 17 Jan 2011 around 19:17:33 +0100, Peter Zijlstra wrote:
> On Wed, 2011-01-12 at 15:07 +1100, Anton Blanchard wrote:
> 
> > I managed to forget all about this bug, probably because of how much it
> > makes my brain hurt.
> 
> Agreed.
> 
> 
> > I tried to fix it by ordering the read and the write of ->cpumask and
> > ->refs. In doing so I missed a critical case but Paul McKenney was able
> > to spot my bug thankfully :) To ensure we arent viewing previous
> > iterations the interrupt handler needs to read ->refs then ->cpumask
> > then ->refs _again_.
> 
> > ---
> > 
> > Index: linux-2.6/kernel/smp.c
> > =====================================================================
> > --- linux-2.6.orig/kernel/smp.c 2010-12-22 17:19:11.262835785 +1100
> > +++ linux-2.6/kernel/smp.c      2011-01-12 15:03:08.793324402 +1100
> > @@ -194,6 +194,31 @@ void generic_smp_call_function_interrupt
> >         list_for_each_entry_rcu(data, &call_function.queue, csd.list) {
> >                 int refs;
> >  
> > +               /*
> > +                * Since we walk the list without any locks, we might
> > +                * see an entry that was completed, removed from the
> > +                * list and is in the process of being reused.
> > +                *
> > +                * Just checking data->refs then data->cpumask is not good
> > +                * enough because we could see a non zero data->refs from a
> > +                * previous iteration. We need to check data->refs, then
> > +                * data->cpumask then data->refs again. Talk about
> > +                * complicated!
> > +                */
> > +
> > +               if (atomic_read(&data->refs) == 0)
> > +                       continue;
> > +
> 
> So here we might see the old ref
> 
> > +               smp_rmb();
> > +
> > +               if (!cpumask_test_cpu(cpu, data->cpumask))
> > +                       continue;
> 
> Here we might see the new cpumask
> 
> > +               smp_rmb();
> > +
> > +               if (atomic_read(&data->refs) == 0)
> > +                       continue;
> > +
> 
> But then still see a 0 ref, at which point we skip this entry and rely
> on the fact that arch_send_call_function_ipi_mask() will simply latch
> our IPI line and cause a back-to-back IPI such that we can process the
> data on the second go-round?
> 
> >                 if (!cpumask_test_and_clear_cpu(cpu, data->cpumask))
> >                         continue;
> 
> And finally, once we observe a valid ->refs, do we test the ->cpumask
> again so we cross with the store order (->cpumask first, then ->refs).
> 
> > @@ -458,6 +483,14 @@ void smp_call_function_many(const struct
> >         data->csd.info = info;
> >         cpumask_and(data->cpumask, mask, cpu_online_mask);
> >         cpumask_clear_cpu(this_cpu, data->cpumask);
> > +
> > +       /*
> > +        * To ensure the interrupt handler gets an up to date view
> > +        * we order the cpumask and refs writes and order the
> > +        * read of them in the interrupt handler.
> > +        */
> > +       smp_wmb();
> > +
> >         atomic_set(&data->refs, cpumask_weight(data->cpumask));
> >  
> >         raw_spin_lock_irqsave(&call_function.lock, flags); 
> 
> Read side:			Write side:
> 
> list_for_each_rcu()
>   !->refs, continue		  ->cpumask = 
> rmb				wmb
>   !->cpumask, continue		  ->refs = 
> rmb				wmb
>   !->refs, continue		  list_add_rcu()
> mb
>   !->cpumask, continue
> 
> 
> 
> Wouldn't something like:
> 
> list_for_each_rcu()
>   !->cpumask, continue		  ->refs =
> rmb				wmb
>   !->refs, continue		  ->cpumask =
> mb				wmb
>   !->cpumask, continue		  list_add_rcu()
> 

Yes, I believe it does.   Paul found the race case after I went home,
and I found the resulting patch with the extra calls posted the next
morning.   When I tried to rase the issue, Paul said he wanted me to
try it on hardware before he did the analysis again.   I finally got a
machine to do that yesterday afternoon.


> 
> Suffice? There we can observe the old ->cpumask, new ->refs and new
> ->cpumask in crossed order, so we filter out the old, and cross the new,
> and have one rmb and conditional less.
> 
> Or am I totally missing something here,.. like said, this stuff hurts
> brains.
> 
> 

In fact, if we assert that the called function is not allowed to enable
interrupts, then we can consolidate both writes to be after the function
is executed instead of doing one atomic read/write before (for the
cpumask bit) and a second one after (for the ref count).   I ran the 
timings on Anton's test case on a 4-node 256 thread power7 box.

What follows is the a two patch set.   I am keeping the second
seperate in case some wiere funtion is enabling interrupts in the
middle of this.

milton

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [PATCH] smp_call_function_many SMP race
  2011-01-12  4:07 Anton Blanchard
@ 2011-01-17 18:17 ` Peter Zijlstra
  2011-01-18 21:05   ` Milton Miller
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Zijlstra @ 2011-01-17 18:17 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: xiaoguangrong, mingo, jaxboe, npiggin, rusty, akpm, torvalds,
	paulmck, miltonm, benh, linux-kernel

On Wed, 2011-01-12 at 15:07 +1100, Anton Blanchard wrote:

> I managed to forget all about this bug, probably because of how much it
> makes my brain hurt.

Agreed.


> I tried to fix it by ordering the read and the write of ->cpumask and
> ->refs. In doing so I missed a critical case but Paul McKenney was able
> to spot my bug thankfully :) To ensure we arent viewing previous
> iterations the interrupt handler needs to read ->refs then ->cpumask
> then ->refs _again_.

> ---
> 
> Index: linux-2.6/kernel/smp.c
> ===================================================================
> --- linux-2.6.orig/kernel/smp.c 2010-12-22 17:19:11.262835785 +1100
> +++ linux-2.6/kernel/smp.c      2011-01-12 15:03:08.793324402 +1100
> @@ -194,6 +194,31 @@ void generic_smp_call_function_interrupt
>         list_for_each_entry_rcu(data, &call_function.queue, csd.list) {
>                 int refs;
>  
> +               /*
> +                * Since we walk the list without any locks, we might
> +                * see an entry that was completed, removed from the
> +                * list and is in the process of being reused.
> +                *
> +                * Just checking data->refs then data->cpumask is not good
> +                * enough because we could see a non zero data->refs from a
> +                * previous iteration. We need to check data->refs, then
> +                * data->cpumask then data->refs again. Talk about
> +                * complicated!
> +                */
> +
> +               if (atomic_read(&data->refs) == 0)
> +                       continue;
> +

So here we might see the old ref

> +               smp_rmb();
> +
> +               if (!cpumask_test_cpu(cpu, data->cpumask))
> +                       continue;

Here we might see the new cpumask

> +               smp_rmb();
> +
> +               if (atomic_read(&data->refs) == 0)
> +                       continue;
> +

But then still see a 0 ref, at which point we skip this entry and rely
on the fact that arch_send_call_function_ipi_mask() will simply latch
our IPI line and cause a back-to-back IPI such that we can process the
data on the second go-round?

>                 if (!cpumask_test_and_clear_cpu(cpu, data->cpumask))
>                         continue;

And finally, once we observe a valid ->refs, do we test the ->cpumask
again so we cross with the store order (->cpumask first, then ->refs).

> @@ -458,6 +483,14 @@ void smp_call_function_many(const struct
>         data->csd.info = info;
>         cpumask_and(data->cpumask, mask, cpu_online_mask);
>         cpumask_clear_cpu(this_cpu, data->cpumask);
> +
> +       /*
> +        * To ensure the interrupt handler gets an up to date view
> +        * we order the cpumask and refs writes and order the
> +        * read of them in the interrupt handler.
> +        */
> +       smp_wmb();
> +
>         atomic_set(&data->refs, cpumask_weight(data->cpumask));
>  
>         raw_spin_lock_irqsave(&call_function.lock, flags); 

Read side:			Write side:

list_for_each_rcu()
  !->refs, continue		  ->cpumask = 
rmb				wmb
  !->cpumask, continue		  ->refs = 
rmb				wmb
  !->refs, continue		  list_add_rcu()
mb
  !->cpumask, continue



Wouldn't something like:

list_for_each_rcu()
  !->cpumask, continue		  ->refs =
rmb				wmb
  !->refs, continue		  ->cpumask =
mb				wmb
  !->cpumask, continue		  list_add_rcu()


Suffice? There we can observe the old ->cpumask, new ->refs and new
->cpumask in crossed order, so we filter out the old, and cross the new,
and have one rmb and conditional less.

Or am I totally missing something here,.. like said, this stuff hurts
brains.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [PATCH] smp_call_function_many SMP race
@ 2011-01-12  4:07 Anton Blanchard
  2011-01-17 18:17 ` Peter Zijlstra
  0 siblings, 1 reply; 10+ messages in thread
From: Anton Blanchard @ 2011-01-12  4:07 UTC (permalink / raw)
  To: xiaoguangrong, mingo, jaxboe, npiggin, peterz, rusty, akpm,
	torvalds, paulmck, miltonm, benh
  Cc: linux-kernel


Hi,

I managed to forget all about this bug, probably because of how much it
makes my brain hurt.

The issue is not that we use RCU, but that we use RCU on a static data
structure that gets reused without waiting for an RCU grace period.
Another way to solve this bug would be to dynamically allocate the
structure, assuming we are OK with the overhead.

Anton


From: Anton Blanchard <anton@samba.org>

I noticed a failure where we hit the following WARN_ON in
generic_smp_call_function_interrupt:

                if (!cpumask_test_and_clear_cpu(cpu, data->cpumask))
                        continue;

                data->csd.func(data->csd.info);

                refs = atomic_dec_return(&data->refs);
                WARN_ON(refs < 0);      <-------------------------

We atomically tested and cleared our bit in the cpumask, and yet the
number of cpus left (ie refs) was 0. How can this be?

It turns out commit 54fdade1c3332391948ec43530c02c4794a38172
(generic-ipi: make struct call_function_data lockless)
is at fault. It removes locking from smp_call_function_many and in
doing so creates a rather complicated race.

The problem comes about because:

- The smp_call_function_many interrupt handler walks call_function.queue
  without any locking.
- We reuse a percpu data structure in smp_call_function_many.
- We do not wait for any RCU grace period before starting the next
  smp_call_function_many.

Imagine a scenario where CPU A does two smp_call_functions back to
back, and CPU B does an smp_call_function in between. We concentrate on
how CPU C handles the calls:


CPU A                  CPU B                  CPU C

smp_call_function
                                              smp_call_function_interrupt
                                                walks
call_function.queue sees CPU A on list

                         smp_call_function

                                              smp_call_function_interrupt
                                                walks
                                              call_function.queue sees
                                              (stale) CPU A on list
                                              smp_call_function reuses
                                              percpu *data set
                                              data->cpumask sees and
                                              clears bit in cpumask!
                                              sees data->refs is 0!

  set data->refs (too late!)


The important thing to note is since the interrupt handler walks a
potentially stale call_function.queue without any locking, then another
cpu can view the percpu *data structure at any time, even when the
owner is in the process of initialising it.

The following test case hits the WARN_ON 100% of the time on my PowerPC
box (having 128 threads does help :)


#include <linux/module.h>
#include <linux/init.h>

#define ITERATIONS 100

static void do_nothing_ipi(void *dummy)
{
}

static void do_ipis(struct work_struct *dummy)
{
	int i;

	for (i = 0; i < ITERATIONS; i++)
		smp_call_function(do_nothing_ipi, NULL, 1);

	printk(KERN_DEBUG "cpu %d finished\n", smp_processor_id());
}

static struct work_struct work[NR_CPUS];

static int __init testcase_init(void)
{
	int cpu;

	for_each_online_cpu(cpu) {
		INIT_WORK(&work[cpu], do_ipis);
		schedule_work_on(cpu, &work[cpu]);
	}

	return 0;
}

static void __exit testcase_exit(void)
{
}

module_init(testcase_init)
module_exit(testcase_exit)
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Anton Blanchard");


I tried to fix it by ordering the read and the write of ->cpumask and
->refs. In doing so I missed a critical case but Paul McKenney was able
to spot my bug thankfully :) To ensure we arent viewing previous
iterations the interrupt handler needs to read ->refs then ->cpumask
then ->refs _again_.

Thanks to Milton Miller and Paul McKenney for helping to debug this
issue.

---

Index: linux-2.6/kernel/smp.c
===================================================================
--- linux-2.6.orig/kernel/smp.c	2010-12-22 17:19:11.262835785 +1100
+++ linux-2.6/kernel/smp.c	2011-01-12 15:03:08.793324402 +1100
@@ -194,6 +194,31 @@ void generic_smp_call_function_interrupt
 	list_for_each_entry_rcu(data, &call_function.queue, csd.list) {
 		int refs;
 
+		/*
+		 * Since we walk the list without any locks, we might
+		 * see an entry that was completed, removed from the
+		 * list and is in the process of being reused.
+		 *
+		 * Just checking data->refs then data->cpumask is not good
+		 * enough because we could see a non zero data->refs from a
+		 * previous iteration. We need to check data->refs, then
+		 * data->cpumask then data->refs again. Talk about
+		 * complicated!
+		 */
+
+		if (atomic_read(&data->refs) == 0)
+			continue;
+
+		smp_rmb();
+
+		if (!cpumask_test_cpu(cpu, data->cpumask))
+			continue;
+
+		smp_rmb();
+
+		if (atomic_read(&data->refs) == 0)
+			continue;
+
 		if (!cpumask_test_and_clear_cpu(cpu, data->cpumask))
 			continue;
 
@@ -458,6 +483,14 @@ void smp_call_function_many(const struct
 	data->csd.info = info;
 	cpumask_and(data->cpumask, mask, cpu_online_mask);
 	cpumask_clear_cpu(this_cpu, data->cpumask);
+
+	/*
+	 * To ensure the interrupt handler gets an up to date view
+	 * we order the cpumask and refs writes and order the
+	 * read of them in the interrupt handler.
+	 */
+	smp_wmb();
+
 	atomic_set(&data->refs, cpumask_weight(data->cpumask));
 
 	raw_spin_lock_irqsave(&call_function.lock, flags);

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2011-01-18 21:05 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-03-23 11:15 [PATCH] smp_call_function_many SMP race Anton Blanchard
2010-03-23 12:26 ` Peter Zijlstra
2010-03-23 15:33   ` Paul E. McKenney
2010-03-23 15:49     ` Peter Zijlstra
2010-03-23 21:31   ` Anton Blanchard
2010-03-23 16:41 ` Paul E. McKenney
2010-05-03 14:24 ` Peter Zijlstra
2011-01-12  4:07 Anton Blanchard
2011-01-17 18:17 ` Peter Zijlstra
2011-01-18 21:05   ` Milton Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.