All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] percpu-rwsem: use barrier in unlock path
       [not found] <Pine.LNX.4.64.1210151716310.10685@file.rdu.redhat.com>
@ 2012-10-16 23:30 ` Mikulas Patocka
  2012-10-17  2:23   ` Linus Torvalds
  0 siblings, 1 reply; 14+ messages in thread
From: Mikulas Patocka @ 2012-10-16 23:30 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jens Axboe, linux-kernel

Hi

When I read my patch 62ac665ff9fc07497ca524bd20d6a96893d11071 again, it 
turns out that it is missing a barrier in percpu_up_write. Please apply 
this.

Mikulas

---

percpu-rwsem: use barrier in unlock path

Fix missing barrier in patch 62ac665ff9fc07497ca524bd20d6a96893d11071.

The lock is considered unlocked when p->locked is set to false.
Use barrier prevent reordering of operations around p->locked.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 include/linux/percpu-rwsem.h |    9 +++++++++
 1 file changed, 9 insertions(+)

Index: linux-3.6.2-fast/include/linux/percpu-rwsem.h
===================================================================
--- linux-3.6.2-fast.orig/include/linux/percpu-rwsem.h	2012-10-17 01:17:55.000000000 +0200
+++ linux-3.6.2-fast/include/linux/percpu-rwsem.h	2012-10-17 01:21:46.000000000 +0200
@@ -66,6 +66,15 @@ static inline void percpu_down_write(str
 
 static inline void percpu_up_write(struct percpu_rw_semaphore *p)
 {
+	/*
+	 * The lock is considered unlocked when p->locked is set to false.
+	 * Use barrier prevent reordering of operations around p->locked.
+	 */
+#if defined(CONFIG_X86) && (!defined(CONFIG_X86_PPRO_FENCE) && !defined(CONFIG_X86_OOSTORE))
+	barrier();
+#else
+	smp_mb();
+#endif
 	p->locked = false;
 	mutex_unlock(&p->mtx);
 }


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] percpu-rwsem: use barrier in unlock path
  2012-10-16 23:30 ` [PATCH] percpu-rwsem: use barrier in unlock path Mikulas Patocka
@ 2012-10-17  2:23   ` Linus Torvalds
  2012-10-17  5:58     ` Lai Jiangshan
                       ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Linus Torvalds @ 2012-10-17  2:23 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Jens Axboe, linux-kernel, linux-arch

[ Architecture people, note the potential new SMP barrier! ]

On Tue, Oct 16, 2012 at 4:30 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
> +       /*
> +        * The lock is considered unlocked when p->locked is set to false.
> +        * Use barrier prevent reordering of operations around p->locked.
> +        */
> +#if defined(CONFIG_X86) && (!defined(CONFIG_X86_PPRO_FENCE) && !defined(CONFIG_X86_OOSTORE))
> +       barrier();
> +#else
> +       smp_mb();
> +#endif
>         p->locked = false;

Ugh. The #if is too ugly to live.

This is a classic case of "people who write their own serialization
primitives invariably get them wrong". And this fix is just horrible,
and code like this should not be allowed.

I suspect we could make the above x86-specific optimization be valid
by introducing a new barrier, called "smp_release_before_store()" or
something like that, which on x86 just happens to be a no-op (but
still a compiler barrier). That's because earlier reads will not pass
a later stores, and stores are viewed in program order, so all x86
stores have "release consistency" modulo the theoretical PPro bugs
(that I don't think we actually ever saw in practice).

But it is possible that that is true on other architectures too, so
your hand-crafted x86-specific #if is not just ugly, it's liable to
get worse.

The alternative is to just say that we should use "smp_mb()"
unconditionally, depending on how critical this code actually ends up
being.

Added linux-arch in case architecture-maintainers have comments on
"smp_release_before_store()" thing. It would be kind of similar to the
odd "smp_read_barrier_depends()", except it would normally be a full
memory barrier, except on architectures where a weaker barrier might
be sufficient.

I suspect there may be many architectures where a "smp_wmb()" is
sufficient for this case, for the simple reason that no sane
microarchitecture would *ever* move earlier reads down past a later
write, so release consistency really only needs the local writes to be
ordered, not the full memory ordering.

Arch people?

The more optimal solution may be to mark the store *itself* to be
"store with release consistency", which on x86 would be a regular
store (with the compiler barrier), but on other architectures may be a
special memory operation. On architectures with
release/aqcuire-consistency, there's not a separate barrier before the
store, the store instruction itself is done with special semantics. So
maybe the right thing to do is

   #define smp_release_consistency_store(val, ptr) ...

where on x86, the implementation would be a simple

   do { barrier(); *(ptr)=(val); } while (0)

but on other architectures it might be a inline asm with the required
magic store-with-release instruction.

How important is this code sequence? Is the "percpu_up_write()"
function really so critical that we can't have an extra memory
barrier? Or do people perhaps see *other* places where
release-consistency-stores might be worth doing?

But in no event do I want to see that butt-ugly #if statement.

                Linus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] percpu-rwsem: use barrier in unlock path
  2012-10-17  2:23   ` Linus Torvalds
@ 2012-10-17  5:58     ` Lai Jiangshan
  2012-10-17 15:07       ` Mikulas Patocka
  2012-10-17  9:56     ` Alan Cox
  2012-10-18 16:00     ` Mikulas Patocka
  2 siblings, 1 reply; 14+ messages in thread
From: Lai Jiangshan @ 2012-10-17  5:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mikulas Patocka, Jens Axboe, linux-kernel, linux-arch,
	Paul E. McKenney, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner, Eric Dumazet

On 10/17/2012 10:23 AM, Linus Torvalds wrote:
> [ Architecture people, note the potential new SMP barrier! ]
> 
> On Tue, Oct 16, 2012 at 4:30 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>> +       /*
>> +        * The lock is considered unlocked when p->locked is set to false.
>> +        * Use barrier prevent reordering of operations around p->locked.
>> +        */
>> +#if defined(CONFIG_X86) && (!defined(CONFIG_X86_PPRO_FENCE) && !defined(CONFIG_X86_OOSTORE))
>> +       barrier();
>> +#else
>> +       smp_mb();
>> +#endif
>>         p->locked = false;
> 
> Ugh. The #if is too ugly to live.

Even the previous patch is applied, percpu_down_read() still
needs mb() to pair with it.

> 
> This is a classic case of "people who write their own serialization
> primitives invariably get them wrong". And this fix is just horrible,
> and code like this should not be allowed.

One of the most major problems of 62ac665ff9fc07497ca524bd20d6a96893d11071 is that
it is merged without Ackeds or Revieweds from Paul or Peter or someone else
who are expert at synchronization/arch memory models.

I suggest any new synchronization should stay in -tip for 2 or more cycles
before merged to mainline.

Thanks,
Lai

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] percpu-rwsem: use barrier in unlock path
  2012-10-17  2:23   ` Linus Torvalds
  2012-10-17  5:58     ` Lai Jiangshan
@ 2012-10-17  9:56     ` Alan Cox
  2012-10-18 16:00     ` Mikulas Patocka
  2 siblings, 0 replies; 14+ messages in thread
From: Alan Cox @ 2012-10-17  9:56 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Mikulas Patocka, Jens Axboe, linux-kernel, linux-arch

> a later stores, and stores are viewed in program order, so all x86
> stores have "release consistency" modulo the theoretical PPro bugs
> (that I don't think we actually ever saw in practice).

We did seem the for real. The PPro is on its way out however so I hardly
thing we care about optimising for that case.

Alan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] percpu-rwsem: use barrier in unlock path
  2012-10-17  5:58     ` Lai Jiangshan
@ 2012-10-17 15:07       ` Mikulas Patocka
  2012-10-17 20:28         ` Steven Rostedt
  0 siblings, 1 reply; 14+ messages in thread
From: Mikulas Patocka @ 2012-10-17 15:07 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Linus Torvalds, Jens Axboe, linux-kernel, linux-arch,
	Paul E. McKenney, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner, Eric Dumazet

Hi


On Wed, 17 Oct 2012, Lai Jiangshan wrote:

> On 10/17/2012 10:23 AM, Linus Torvalds wrote:
> > [ Architecture people, note the potential new SMP barrier! ]
> > 
> > On Tue, Oct 16, 2012 at 4:30 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
> >> +       /*
> >> +        * The lock is considered unlocked when p->locked is set to false.
> >> +        * Use barrier prevent reordering of operations around p->locked.
> >> +        */
> >> +#if defined(CONFIG_X86) && (!defined(CONFIG_X86_PPRO_FENCE) && !defined(CONFIG_X86_OOSTORE))
> >> +       barrier();
> >> +#else
> >> +       smp_mb();
> >> +#endif
> >>         p->locked = false;
> > 
> > Ugh. The #if is too ugly to live.
> 
> Even the previous patch is applied, percpu_down_read() still
> needs mb() to pair with it.

percpu_down_read uses rcu_read_lock which should guarantee that memory 
accesses don't escape in front of a rcu-protected section.

If rcu_read_unlock has only an unlock barrier and not a full barrier, 
memory accesses could be moved in front of rcu_read_unlock and reordered 
with this_cpu_inc(*p->counters), but it doesn't matter because 
percpu_down_write does synchronize_rcu(), so it never sees these accesses 
halfway through.

> > This is a classic case of "people who write their own serialization
> > primitives invariably get them wrong". And this fix is just horrible,
> > and code like this should not be allowed.
> 
> One of the most major problems of 62ac665ff9fc07497ca524bd20d6a96893d11071 is that
> it is merged without Ackeds or Revieweds from Paul or Peter or someone else
> who are expert at synchronization/arch memory models.
> 
> I suggest any new synchronization should stay in -tip for 2 or more cycles
> before merged to mainline.

But the bug that this synchronization is fixing is quite serious (it 
causes random crashes when block size is being changed, the crash happens 
regularly at multiple important business sites) so it must be fixed soon 
and not wait half a year.

> Thanks,
> Lai

Mikulas

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] percpu-rwsem: use barrier in unlock path
  2012-10-17 15:07       ` Mikulas Patocka
@ 2012-10-17 20:28         ` Steven Rostedt
  2012-10-18  2:18           ` Lai Jiangshan
  2012-10-18 16:05           ` Mikulas Patocka
  0 siblings, 2 replies; 14+ messages in thread
From: Steven Rostedt @ 2012-10-17 20:28 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Lai Jiangshan, Linus Torvalds, Jens Axboe, linux-kernel,
	linux-arch, Paul E. McKenney, Peter Zijlstra, Thomas Gleixner,
	Eric Dumazet

On Wed, Oct 17, 2012 at 11:07:21AM -0400, Mikulas Patocka wrote:
> > 
> > Even the previous patch is applied, percpu_down_read() still
> > needs mb() to pair with it.
> 
> percpu_down_read uses rcu_read_lock which should guarantee that memory 
> accesses don't escape in front of a rcu-protected section.

You do realize that rcu_read_lock() does nothing more that a barrier(),
right?

Paul worked really hard to get rcu_read_locks() to not call HW barriers.

> 
> If rcu_read_unlock has only an unlock barrier and not a full barrier, 
> memory accesses could be moved in front of rcu_read_unlock and reordered 
> with this_cpu_inc(*p->counters), but it doesn't matter because 
> percpu_down_write does synchronize_rcu(), so it never sees these accesses 
> halfway through.

Looking at the patch, you are correct. The read side doesn't need the
memory barrier as the worse thing that will happen is that it sees the
locked = false, and will just grab the mutex unnecessarily.

> > 
> > I suggest any new synchronization should stay in -tip for 2 or more cycles
> > before merged to mainline.
> 
> But the bug that this synchronization is fixing is quite serious (it 
> causes random crashes when block size is being changed, the crash happens 
> regularly at multiple important business sites) so it must be fixed soon 
> and not wait half a year.

I don't think Lai was suggesting to wait on this fix, but instead to
totally rip out the percpu_rwsems and work on them some more, and then
re-introduce them in a half a year.

-- Steve


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] percpu-rwsem: use barrier in unlock path
  2012-10-17 20:28         ` Steven Rostedt
@ 2012-10-18  2:18           ` Lai Jiangshan
  2012-10-18  4:13             ` Steven Rostedt
                               ` (2 more replies)
  2012-10-18 16:05           ` Mikulas Patocka
  1 sibling, 3 replies; 14+ messages in thread
From: Lai Jiangshan @ 2012-10-18  2:18 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Steven Rostedt, Linus Torvalds, Jens Axboe, linux-kernel,
	linux-arch, Paul E. McKenney, Peter Zijlstra, Thomas Gleixner,
	Eric Dumazet

On 10/18/2012 04:28 AM, Steven Rostedt wrote:
> On Wed, Oct 17, 2012 at 11:07:21AM -0400, Mikulas Patocka wrote:
>>>
>>> Even the previous patch is applied, percpu_down_read() still
>>> needs mb() to pair with it.
>>
>> percpu_down_read uses rcu_read_lock which should guarantee that memory 
>> accesses don't escape in front of a rcu-protected section.
> 
> You do realize that rcu_read_lock() does nothing more that a barrier(),
> right?
> 
> Paul worked really hard to get rcu_read_locks() to not call HW barriers.
> 
>>
>> If rcu_read_unlock has only an unlock barrier and not a full barrier, 
>> memory accesses could be moved in front of rcu_read_unlock and reordered 
>> with this_cpu_inc(*p->counters), but it doesn't matter because 
>> percpu_down_write does synchronize_rcu(), so it never sees these accesses 
>> halfway through.
> 
> Looking at the patch, you are correct. The read side doesn't need the
> memory barrier as the worse thing that will happen is that it sees the
> locked = false, and will just grab the mutex unnecessarily.

---------------------
A memory barrier can be added iff these two things are known:
	1) it disables the disordering between what and what.
	2) what is the corresponding mb() that it pairs with.

You tried to add a mb() in percpu_up_write(), OK, I know it disables the disordering
between the writes to the protected data and the statement "p->locked = false",
But I can't find out the corresponding mb() that it pairs with.

percpu_down_read()					writes to the data
	The cpu cache/prefetch the data			writes to the data
	which is chaos					writes to the data
							percpu_up_write()
								mb()
								p->locked = false;
	unlikely(p->locked)
		the cpu see p->lock = false,
		don't discard the cached/prefetch data
	this_cpu_inc(*p->counters);
	the code of read-access to the data
	****and we use the chaos data*****

So you need to add a mb() after "unlikely(p->locked)".

-------------------------

The RCU you use don't protect any data. It protects codes of the fast path:
	unlikely(p->locked);
	this_cpu_inc(*p->counters);

and synchronize_rcu() ensures all previous fast path had fully finished
"this_cpu_inc(*p->counters);".

It don't protect other code/data, if you want to protect other code or other
data, please add more synchronizations or mb()s.

---------------

I extremely hate a synchronization protects code instead of data.
but sometimes I also have to do it.

---------------

a very draft example of paired-mb()s is here:


diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
index cf80f7e..84a93c0 100644
--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -12,6 +12,14 @@ struct percpu_rw_semaphore {
 	struct mutex mtx;
 };
 
+#if 1
+#define light_mb() barrier()
+#define heavy_mb() synchronize_sched()
+#else
+#define light_mb() smp_mb()
+#define heavy_mb() smp_mb();
+#endif
+
 static inline void percpu_down_read(struct percpu_rw_semaphore *p)
 {
 	rcu_read_lock();
@@ -24,22 +32,12 @@ static inline void percpu_down_read(struct percpu_rw_semaphore *p)
 	}
 	this_cpu_inc(*p->counters);
 	rcu_read_unlock();
+	light_mb(); /* A, between read of p->locked and read of data, paired with D */
 }
 
 static inline void percpu_up_read(struct percpu_rw_semaphore *p)
 {
-	/*
-	 * On X86, write operation in this_cpu_dec serves as a memory unlock
-	 * barrier (i.e. memory accesses may be moved before the write, but
-	 * no memory accesses are moved past the write).
-	 * On other architectures this may not be the case, so we need smp_mb()
-	 * there.
-	 */
-#if defined(CONFIG_X86) && (!defined(CONFIG_X86_PPRO_FENCE) && !defined(CONFIG_X86_OOSTORE))
-	barrier();
-#else
-	smp_mb();
-#endif
+	light_mb(); /* B, between read of the data and write to p->counter, paired with C */
 	this_cpu_dec(*p->counters);
 }
 
@@ -61,11 +59,12 @@ static inline void percpu_down_write(struct percpu_rw_semaphore *p)
 	synchronize_rcu();
 	while (__percpu_count(p->counters))
 		msleep(1);
-	smp_rmb(); /* paired with smp_mb() in percpu_sem_up_read() */
+	heavy_mb(); /* C, between read of p->counter and write to data, paired with B */
 }
 
 static inline void percpu_up_write(struct percpu_rw_semaphore *p)
 {
+	heavy_mb(); /* D, between write to data and write to p->locked, paired with A */
 	p->locked = false;
 	mutex_unlock(&p->mtx);
 }

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] percpu-rwsem: use barrier in unlock path
  2012-10-18  2:18           ` Lai Jiangshan
@ 2012-10-18  4:13             ` Steven Rostedt
  2012-10-18 16:17               ` Mikulas Patocka
  2012-10-18 15:32             ` Mikulas Patocka
  2012-10-18 19:56             ` Mikulas Patocka
  2 siblings, 1 reply; 14+ messages in thread
From: Steven Rostedt @ 2012-10-18  4:13 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Mikulas Patocka, Linus Torvalds, Jens Axboe, linux-kernel,
	linux-arch, Paul E. McKenney, Peter Zijlstra, Thomas Gleixner,
	Eric Dumazet

On Thu, 2012-10-18 at 10:18 +0800, Lai Jiangshan wrote:
> > 
> > Looking at the patch, you are correct. The read side doesn't need the
> > memory barrier as the worse thing that will happen is that it sees the
> > locked = false, and will just grab the mutex unnecessarily.
> 
> ---------------------
> A memory barrier can be added iff these two things are known:
> 	1) it disables the disordering between what and what.
> 	2) what is the corresponding mb() that it pairs with.
> 

OK, I was just looking at the protection and actions of the locked flag,
but I see what you are saying with the data itself.

> You tried to add a mb() in percpu_up_write(), OK, I know it disables the disordering
> between the writes to the protected data and the statement "p->locked = false",
> But I can't find out the corresponding mb() that it pairs with.
> 
> percpu_down_read()					writes to the data
> 	The cpu cache/prefetch the data			writes to the data
> 	which is chaos					writes to the data
> 							percpu_up_write()
> 								mb()
> 								p->locked = false;
> 	unlikely(p->locked)
> 		the cpu see p->lock = false,
> 		don't discard the cached/prefetch data
> 	this_cpu_inc(*p->counters);
> 	the code of read-access to the data
> 	****and we use the chaos data*****
> 
> So you need to add a mb() after "unlikely(p->locked)".

Does it need a full mb() or could it be just a rmb()? The down_read I
wouldn't think would need to protect against stores, would it? The
memory barrier should probably go in front of the unlikely() too. The
write to p->counters is handled by the synchronized sched, and adding a
rmb() in front of the unlikely check would keep prefetched data from
passing this barrier.

This is a perfect example why this primitive should be vetted outside of
mainline before it gets merged.

-- Steve



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] percpu-rwsem: use barrier in unlock path
  2012-10-18  2:18           ` Lai Jiangshan
  2012-10-18  4:13             ` Steven Rostedt
@ 2012-10-18 15:32             ` Mikulas Patocka
  2012-10-18 19:56             ` Mikulas Patocka
  2 siblings, 0 replies; 14+ messages in thread
From: Mikulas Patocka @ 2012-10-18 15:32 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Steven Rostedt, Linus Torvalds, Jens Axboe, linux-kernel,
	linux-arch, Paul E. McKenney, Peter Zijlstra, Thomas Gleixner,
	Eric Dumazet



On Thu, 18 Oct 2012, Lai Jiangshan wrote:

> On 10/18/2012 04:28 AM, Steven Rostedt wrote:
> > On Wed, Oct 17, 2012 at 11:07:21AM -0400, Mikulas Patocka wrote:
> >>>
> >>> Even the previous patch is applied, percpu_down_read() still
> >>> needs mb() to pair with it.
> >>
> >> percpu_down_read uses rcu_read_lock which should guarantee that memory 
> >> accesses don't escape in front of a rcu-protected section.
> > 
> > You do realize that rcu_read_lock() does nothing more that a barrier(),
> > right?
> > 
> > Paul worked really hard to get rcu_read_locks() to not call HW barriers.
> > 
> >>
> >> If rcu_read_unlock has only an unlock barrier and not a full barrier, 
> >> memory accesses could be moved in front of rcu_read_unlock and reordered 
> >> with this_cpu_inc(*p->counters), but it doesn't matter because 
> >> percpu_down_write does synchronize_rcu(), so it never sees these accesses 
> >> halfway through.
> > 
> > Looking at the patch, you are correct. The read side doesn't need the
> > memory barrier as the worse thing that will happen is that it sees the
> > locked = false, and will just grab the mutex unnecessarily.
> 
> ---------------------
> A memory barrier can be added iff these two things are known:
> 	1) it disables the disordering between what and what.
> 	2) what is the corresponding mb() that it pairs with.
> 
> You tried to add a mb() in percpu_up_write(), OK, I know it disables the disordering
> between the writes to the protected data and the statement "p->locked = false",
> But I can't find out the corresponding mb() that it pairs with.

Or alternativelly, instead of barrier, we can do this.

Mikulas

---

percpu-rwsem: use barrier in unlock path

The lock is considered unlocked when p->locked is set to false.
Use barrier prevent reordering of operations around p->locked.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 include/linux/percpu-rwsem.h |    6 ++++++
 1 file changed, 6 insertions(+)

Index: linux-3.6.2-fast/include/linux/percpu-rwsem.h
===================================================================
--- linux-3.6.2-fast.orig/include/linux/percpu-rwsem.h	2012-10-17 20:48:40.000000000 +0200
+++ linux-3.6.2-fast/include/linux/percpu-rwsem.h	2012-10-18 17:19:24.000000000 +0200
@@ -66,6 +66,12 @@ static inline void percpu_down_write(str
 
 static inline void percpu_up_write(struct percpu_rw_semaphore *p)
 {
+	/*
+	 * Make sure that other processes that are in rcu section and that
+	 * may have seen partially modified state exit the rcu section and
+	 * try to grab the mutex.
+	 */
+	synchronize_rcu();
 	p->locked = false;
 	mutex_unlock(&p->mtx);
 }

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] percpu-rwsem: use barrier in unlock path
  2012-10-17  2:23   ` Linus Torvalds
  2012-10-17  5:58     ` Lai Jiangshan
  2012-10-17  9:56     ` Alan Cox
@ 2012-10-18 16:00     ` Mikulas Patocka
  2012-10-19 18:48       ` Linus Torvalds
  2 siblings, 1 reply; 14+ messages in thread
From: Mikulas Patocka @ 2012-10-18 16:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, linux-kernel, linux-arch, Lai Jiangshan,
	Steven Rostedt, Paul E. McKenney, Peter Zijlstra,
	Thomas Gleixner, Eric Dumazet



On Tue, 16 Oct 2012, Linus Torvalds wrote:

> [ Architecture people, note the potential new SMP barrier! ]
> 
> On Tue, Oct 16, 2012 at 4:30 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
> > +       /*
> > +        * The lock is considered unlocked when p->locked is set to false.
> > +        * Use barrier prevent reordering of operations around p->locked.
> > +        */
> > +#if defined(CONFIG_X86) && (!defined(CONFIG_X86_PPRO_FENCE) && !defined(CONFIG_X86_OOSTORE))
> > +       barrier();
> > +#else
> > +       smp_mb();
> > +#endif
> >         p->locked = false;
> 
> Ugh. The #if is too ugly to live.

One instance of this is already present in the code.

I suggest that you drop this patch and use synchronize_rcu() that I 
have just posted.

But another instance of this "#if defined(CONFIG_X86) && ..." is already 
there in percpu_up_read.


What is the procedure for making changes that require support of 
architectures? It is trivial to make a patch that moves this into 
arch-specific includes, the problem is that the patch break all the 
architectures - I wrote support for x86, sparc, parisc, alpha (I can test 
those) but not the others.

Are you going to apply this patch, break majority of architectures and 
wait until architecture maintainers fix it? Or is there some arch-specific 
git tree, where the patch should be applied and where the maintainers fix 
things?

> This is a classic case of "people who write their own serialization
> primitives invariably get them wrong". And this fix is just horrible,
> and code like this should not be allowed.
> 
> I suspect we could make the above x86-specific optimization be valid
> by introducing a new barrier, called "smp_release_before_store()" or
> something like that, which on x86 just happens to be a no-op (but
> still a compiler barrier). That's because earlier reads will not pass
> a later stores, and stores are viewed in program order, so all x86
> stores have "release consistency" modulo the theoretical PPro bugs
> (that I don't think we actually ever saw in practice).
> 
> But it is possible that that is true on other architectures too, so
> your hand-crafted x86-specific #if is not just ugly, it's liable to
> get worse.
> 
> The alternative is to just say that we should use "smp_mb()"
> unconditionally, depending on how critical this code actually ends up
> being.
> 
> Added linux-arch in case architecture-maintainers have comments on
> "smp_release_before_store()" thing. It would be kind of similar to the
> odd "smp_read_barrier_depends()", except it would normally be a full
> memory barrier, except on architectures where a weaker barrier might
> be sufficient.
> 
> I suspect there may be many architectures where a "smp_wmb()" is
> sufficient for this case, for the simple reason that no sane
> microarchitecture would *ever* move earlier reads down past a later
> write,

Alpha can move reads down past writes (if the read is not in cache and the 
write hits the cache, it may make sense to do this reordering).

> so release consistency really only needs the local writes to be
> ordered, not the full memory ordering.
> 
> Arch people?
> 
> The more optimal solution may be to mark the store *itself* to be
> "store with release consistency", which on x86 would be a regular
> store (with the compiler barrier), but on other architectures may be a
> special memory operation. On architectures with
> release/aqcuire-consistency, there's not a separate barrier before the
> store, the store instruction itself is done with special semantics. So
> maybe the right thing to do is
> 
>    #define smp_release_consistency_store(val, ptr) ...
> 
> where on x86, the implementation would be a simple
> 
>    do { barrier(); *(ptr)=(val); } while (0)

smp_release_consistency_store doesn't work. In 
include/linux/percpu-rwsem.h it is required that 
"this_cpu_dec(*p->counters);" works as an unlock barrier. So we could do 
either.

"smp_mb(); this_cpu_dec(*p->counters);" - generates pointless barrier on x86

"#if defined(CONFIG_X86) && (!defined(CONFIG_X86_PPRO_FENCE) && 
!defined(CONFIG_X86_OOSTORE))
        barrier();
#else
        smp_mb();
#endif
        this_cpu_dec(*p->counters);
" - removes the barrier on the most common architecture (X86), but the 
code looks dirty

"smp_release_before_store();this_cpu_dec(*p->counters);" - the code is 
clean, but the downside is that it breaks architectures that don't have 
smp_release_before_store().

> but on other architectures it might be a inline asm with the required
> magic store-with-release instruction.
> 
> How important is this code sequence? Is the "percpu_up_write()"
> function really so critical that we can't have an extra memory
> barrier?

percpu_up_write() is not critical. But percpu_up_read() is critical and it 
should be as fast as possible.

Mikulas

> Or do people perhaps see *other* places where
> release-consistency-stores might be worth doing?
> 
> But in no event do I want to see that butt-ugly #if statement.
> 
>                 Linus
> 

---
Introduce smp_release_before_store()

smp_release_before_store() with the following store operation works as
an unlock barrier - that is, memory accesses may be moved before the
unlock barrier, but no memory accesses are moved past the memory
barrier.

On x86 writes are strongly ordered (unless CONFIG_X86_PPRO_FENCE or
CONFIG_X86_OOSTORE is selected), so smp_release_before_store() defaults
to a compiler barrier (and no barrier instruction). On other
architectures it may need more heavyweight barriers.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 Documentation/memory-barriers.txt   |    5 +++++
 arch/alpha/include/asm/barrier.h    |    2 ++
 arch/parisc/include/asm/barrier.h   |    1 +
 arch/sparc/include/asm/barrier_32.h |    1 +
 arch/sparc/include/asm/barrier_64.h |    2 ++
 arch/x86/include/asm/barrier.h      |    6 ++++++
 include/linux/percpu-rwsem.h        |   13 +------------
 7 files changed, 18 insertions(+), 12 deletions(-)

Index: linux-3.6.2-fast/arch/alpha/include/asm/barrier.h
===================================================================
--- linux-3.6.2-fast.orig/arch/alpha/include/asm/barrier.h	2012-10-18 17:45:12.000000000 +0200
+++ linux-3.6.2-fast/arch/alpha/include/asm/barrier.h	2012-10-18 17:46:24.000000000 +0200
@@ -29,6 +29,8 @@ __asm__ __volatile__("mb": : :"memory")
 #define smp_read_barrier_depends()	do { } while (0)
 #endif
 
+#define smp_release_before_store()	smp_mb()
+
 #define set_mb(var, value) \
 do { var = value; mb(); } while (0)
 
Index: linux-3.6.2-fast/arch/parisc/include/asm/barrier.h
===================================================================
--- linux-3.6.2-fast.orig/arch/parisc/include/asm/barrier.h	2012-10-18 17:45:13.000000000 +0200
+++ linux-3.6.2-fast/arch/parisc/include/asm/barrier.h	2012-10-18 17:46:24.000000000 +0200
@@ -29,6 +29,7 @@
 #define smp_wmb()	mb()
 #define smp_read_barrier_depends()	do { } while(0)
 #define read_barrier_depends()		do { } while(0)
+#define smp_release_before_store()	mb ()
 
 #define set_mb(var, value)		do { var = value; mb(); } while (0)
 
Index: linux-3.6.2-fast/arch/sparc/include/asm/barrier_32.h
===================================================================
--- linux-3.6.2-fast.orig/arch/sparc/include/asm/barrier_32.h	2012-10-18 17:45:13.000000000 +0200
+++ linux-3.6.2-fast/arch/sparc/include/asm/barrier_32.h	2012-10-18 17:56:48.000000000 +0200
@@ -11,5 +11,6 @@
 #define smp_rmb()	__asm__ __volatile__("":::"memory")
 #define smp_wmb()	__asm__ __volatile__("":::"memory")
 #define smp_read_barrier_depends()	do { } while(0)
+#define smp_release_before_store()	smp_mb()
 
 #endif /* !(__SPARC_BARRIER_H) */
Index: linux-3.6.2-fast/arch/sparc/include/asm/barrier_64.h
===================================================================
--- linux-3.6.2-fast.orig/arch/sparc/include/asm/barrier_64.h	2012-10-18 17:45:13.000000000 +0200
+++ linux-3.6.2-fast/arch/sparc/include/asm/barrier_64.h	2012-10-18 17:46:24.000000000 +0200
@@ -53,4 +53,6 @@ do {	__asm__ __volatile__("ba,pt	%%xcc, 
 
 #define smp_read_barrier_depends()	do { } while(0)
 
+#define smp_release_before_store()	__asm__ __volatile__("":::"memory")
+
 #endif /* !(__SPARC64_BARRIER_H) */
Index: linux-3.6.2-fast/arch/x86/include/asm/barrier.h
===================================================================
--- linux-3.6.2-fast.orig/arch/x86/include/asm/barrier.h	2012-10-18 17:45:13.000000000 +0200
+++ linux-3.6.2-fast/arch/x86/include/asm/barrier.h	2012-10-18 17:46:24.000000000 +0200
@@ -100,6 +100,12 @@
 #define set_mb(var, value) do { var = value; barrier(); } while (0)
 #endif
 
+#ifdef CONFIG_X86_PPRO_FENCE
+#define smp_release_before_store()	smp_mb()
+#else
+#define smp_release_before_store()	smp_wmb()
+#endif
+
 /*
  * Stop RDTSC speculation. This is needed when you need to use RDTSC
  * (or get_cycles or vread that possibly accesses the TSC) in a defined
Index: linux-3.6.2-fast/include/linux/percpu-rwsem.h
===================================================================
--- linux-3.6.2-fast.orig/include/linux/percpu-rwsem.h	2012-10-18 17:45:13.000000000 +0200
+++ linux-3.6.2-fast/include/linux/percpu-rwsem.h	2012-10-18 17:46:24.000000000 +0200
@@ -28,18 +28,7 @@ static inline void percpu_down_read(stru
 
 static inline void percpu_up_read(struct percpu_rw_semaphore *p)
 {
-	/*
-	 * On X86, write operation in this_cpu_dec serves as a memory unlock
-	 * barrier (i.e. memory accesses may be moved before the write, but
-	 * no memory accesses are moved past the write).
-	 * On other architectures this may not be the case, so we need smp_mb()
-	 * there.
-	 */
-#if defined(CONFIG_X86) && (!defined(CONFIG_X86_PPRO_FENCE) && !defined(CONFIG_X86_OOSTORE))
-	barrier();
-#else
-	smp_mb();
-#endif
+	smp_release_before_store();
 	this_cpu_dec(*p->counters);
 }
 
Index: linux-3.6.2-fast/Documentation/memory-barriers.txt
===================================================================
--- linux-3.6.2-fast.orig/Documentation/memory-barriers.txt	2012-10-18 17:45:13.000000000 +0200
+++ linux-3.6.2-fast/Documentation/memory-barriers.txt	2012-10-18 17:46:24.000000000 +0200
@@ -1056,11 +1056,16 @@ The Linux kernel has eight basic CPU mem
 	WRITE		wmb()			smp_wmb()
 	READ		rmb()			smp_rmb()
 	DATA DEPENDENCY	read_barrier_depends()	smp_read_barrier_depends()
+	UNLOCK BARRIER	-			smp_release_before_store()
 
 
 All memory barriers except the data dependency barriers imply a compiler
 barrier. Data dependencies do not impose any additional compiler ordering.
 
+smp_release_before_store() together with the following store operation implies
+an unlock barrier. That is - memory accesses may be moved before the unlock
+barrier, but no memory accesses are moved past the unlock barrier.
+
 Aside: In the case of data dependencies, the compiler would be expected to
 issue the loads in the correct order (eg. `a[b]` would have to load the value
 of b before loading a[b]), however there is no guarantee in the C specification

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] percpu-rwsem: use barrier in unlock path
  2012-10-17 20:28         ` Steven Rostedt
  2012-10-18  2:18           ` Lai Jiangshan
@ 2012-10-18 16:05           ` Mikulas Patocka
  1 sibling, 0 replies; 14+ messages in thread
From: Mikulas Patocka @ 2012-10-18 16:05 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Lai Jiangshan, Linus Torvalds, Jens Axboe, linux-kernel,
	linux-arch, Paul E. McKenney, Peter Zijlstra, Thomas Gleixner,
	Eric Dumazet



On Wed, 17 Oct 2012, Steven Rostedt wrote:

> On Wed, Oct 17, 2012 at 11:07:21AM -0400, Mikulas Patocka wrote:
> > > 
> > > Even the previous patch is applied, percpu_down_read() still
> > > needs mb() to pair with it.
> > 
> > percpu_down_read uses rcu_read_lock which should guarantee that memory 
> > accesses don't escape in front of a rcu-protected section.
> 
> You do realize that rcu_read_lock() does nothing more that a barrier(),
> right?
> 
> Paul worked really hard to get rcu_read_locks() to not call HW barriers.
> 
> > 
> > If rcu_read_unlock has only an unlock barrier and not a full barrier, 
> > memory accesses could be moved in front of rcu_read_unlock and reordered 
> > with this_cpu_inc(*p->counters), but it doesn't matter because 
> > percpu_down_write does synchronize_rcu(), so it never sees these accesses 
> > halfway through.
> 
> Looking at the patch, you are correct. The read side doesn't need the
> memory barrier as the worse thing that will happen is that it sees the
> locked = false, and will just grab the mutex unnecessarily.

It wasn't correct.

CPU 1 is holding the write lock.

CPU 2 could get to percpu_down_read past rcu_read_lock and prefetch some 
data that are accessed after percpu_down_read.

CPU 1 goes into percpu_up_write(), calls a barrier, p->locked = false; and 
mutex_unlock(&p->mtx);

CPU 2 now sees p->locked == false, so it goes through percpu_down_read. It 
exists percpu_down_read and uses the invalid prefetched data.

It could be fixed by using synchronize_rcu(); in percpu_up_write, I sent a 
patch for that.

Mikulas

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] percpu-rwsem: use barrier in unlock path
  2012-10-18  4:13             ` Steven Rostedt
@ 2012-10-18 16:17               ` Mikulas Patocka
  0 siblings, 0 replies; 14+ messages in thread
From: Mikulas Patocka @ 2012-10-18 16:17 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Lai Jiangshan, Linus Torvalds, Jens Axboe, linux-kernel,
	linux-arch, Paul E. McKenney, Peter Zijlstra, Thomas Gleixner,
	Eric Dumazet



On Thu, 18 Oct 2012, Steven Rostedt wrote:

> On Thu, 2012-10-18 at 10:18 +0800, Lai Jiangshan wrote:
> > > 
> > > Looking at the patch, you are correct. The read side doesn't need the
> > > memory barrier as the worse thing that will happen is that it sees the
> > > locked = false, and will just grab the mutex unnecessarily.
> > 
> > ---------------------
> > A memory barrier can be added iff these two things are known:
> > 	1) it disables the disordering between what and what.
> > 	2) what is the corresponding mb() that it pairs with.
> > 
> 
> OK, I was just looking at the protection and actions of the locked flag,
> but I see what you are saying with the data itself.
> 
> > You tried to add a mb() in percpu_up_write(), OK, I know it disables the disordering
> > between the writes to the protected data and the statement "p->locked = false",
> > But I can't find out the corresponding mb() that it pairs with.
> > 
> > percpu_down_read()					writes to the data
> > 	The cpu cache/prefetch the data			writes to the data
> > 	which is chaos					writes to the data
> > 							percpu_up_write()
> > 								mb()
> > 								p->locked = false;
> > 	unlikely(p->locked)
> > 		the cpu see p->lock = false,
> > 		don't discard the cached/prefetch data
> > 	this_cpu_inc(*p->counters);
> > 	the code of read-access to the data
> > 	****and we use the chaos data*****
> > 
> > So you need to add a mb() after "unlikely(p->locked)".
> 
> Does it need a full mb() or could it be just a rmb()? The down_read I
> wouldn't think would need to protect against stores, would it? The
> memory barrier should probably go in front of the unlikely() too. The
> write to p->counters is handled by the synchronized sched, and adding a
> rmb() in front of the unlikely check would keep prefetched data from
> passing this barrier.
> 
> This is a perfect example why this primitive should be vetted outside of
> mainline before it gets merged.
> 
> -- Steve

If we do synchronize_rcu() in percpu_up_write, we don't need a barrier in 
percpu_down_read(). So I would do that.

Mikulas

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] percpu-rwsem: use barrier in unlock path
  2012-10-18  2:18           ` Lai Jiangshan
  2012-10-18  4:13             ` Steven Rostedt
  2012-10-18 15:32             ` Mikulas Patocka
@ 2012-10-18 19:56             ` Mikulas Patocka
  2 siblings, 0 replies; 14+ messages in thread
From: Mikulas Patocka @ 2012-10-18 19:56 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Steven Rostedt, Linus Torvalds, Jens Axboe, linux-kernel,
	linux-arch, Paul E. McKenney, Peter Zijlstra, Thomas Gleixner,
	Eric Dumazet

This patch looks sensible.

I'd apply either this or my previous patch that adds synchronize_rcu() to 
percpu_up_write.

This patch avoids the memory barrier on non-x86 cpus in percpu_up_read, so 
it is faster than the previous approach.

Mikulas


On Thu, 18 Oct 2012, Lai Jiangshan wrote:

> ---------------
> 
> a very draft example of paired-mb()s is here:
> 
> 
> diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
> index cf80f7e..84a93c0 100644
> --- a/include/linux/percpu-rwsem.h
> +++ b/include/linux/percpu-rwsem.h
> @@ -12,6 +12,14 @@ struct percpu_rw_semaphore {
>  	struct mutex mtx;
>  };
>  
> +#if 1
> +#define light_mb() barrier()
> +#define heavy_mb() synchronize_sched()
> +#else
> +#define light_mb() smp_mb()
> +#define heavy_mb() smp_mb();
> +#endif
> +
>  static inline void percpu_down_read(struct percpu_rw_semaphore *p)
>  {
>  	rcu_read_lock();
> @@ -24,22 +32,12 @@ static inline void percpu_down_read(struct percpu_rw_semaphore *p)
>  	}
>  	this_cpu_inc(*p->counters);
>  	rcu_read_unlock();
> +	light_mb(); /* A, between read of p->locked and read of data, paired with D */
>  }
>  
>  static inline void percpu_up_read(struct percpu_rw_semaphore *p)
>  {
> -	/*
> -	 * On X86, write operation in this_cpu_dec serves as a memory unlock
> -	 * barrier (i.e. memory accesses may be moved before the write, but
> -	 * no memory accesses are moved past the write).
> -	 * On other architectures this may not be the case, so we need smp_mb()
> -	 * there.
> -	 */
> -#if defined(CONFIG_X86) && (!defined(CONFIG_X86_PPRO_FENCE) && !defined(CONFIG_X86_OOSTORE))
> -	barrier();
> -#else
> -	smp_mb();
> -#endif
> +	light_mb(); /* B, between read of the data and write to p->counter, paired with C */
>  	this_cpu_dec(*p->counters);
>  }
>  
> @@ -61,11 +59,12 @@ static inline void percpu_down_write(struct percpu_rw_semaphore *p)
>  	synchronize_rcu();
>  	while (__percpu_count(p->counters))
>  		msleep(1);
> -	smp_rmb(); /* paired with smp_mb() in percpu_sem_up_read() */
> +	heavy_mb(); /* C, between read of p->counter and write to data, paired with B */
>  }
>  
>  static inline void percpu_up_write(struct percpu_rw_semaphore *p)
>  {
> +	heavy_mb(); /* D, between write to data and write to p->locked, paired with A */
>  	p->locked = false;
>  	mutex_unlock(&p->mtx);
>  }
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] percpu-rwsem: use barrier in unlock path
  2012-10-18 16:00     ` Mikulas Patocka
@ 2012-10-19 18:48       ` Linus Torvalds
  0 siblings, 0 replies; 14+ messages in thread
From: Linus Torvalds @ 2012-10-19 18:48 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Jens Axboe, linux-kernel, linux-arch, Lai Jiangshan,
	Steven Rostedt, Paul E. McKenney, Peter Zijlstra,
	Thomas Gleixner, Eric Dumazet

On Thu, Oct 18, 2012 at 9:00 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>
> What is the procedure for making changes that require support of
> architectures? It is trivial to make a patch that moves this into
> arch-specific includes, the problem is that the patch break all the
> architectures - I wrote support for x86, sparc, parisc, alpha (I can test
> those) but not the others.

We'd need to add it to everybody.

It shouldn't need per-architecture testing - since "smp_mb()" is
always safe. So we could just make all architectures default to that,
and then for x86 (and potentially others that have cheaper models for
release-consistency) just do the optimized one.

We *could* also simply do something like

   #ifndef smp_release_before_store
     #define smp_release_before_store() smp_mb()
   #endif

and basically make the rule be that only architectures that have a
cheaper one need to define it at all. That may be the correct
short-term fix, since there initially would be only a single user.

                  Linus

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2012-10-19 18:48 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <Pine.LNX.4.64.1210151716310.10685@file.rdu.redhat.com>
2012-10-16 23:30 ` [PATCH] percpu-rwsem: use barrier in unlock path Mikulas Patocka
2012-10-17  2:23   ` Linus Torvalds
2012-10-17  5:58     ` Lai Jiangshan
2012-10-17 15:07       ` Mikulas Patocka
2012-10-17 20:28         ` Steven Rostedt
2012-10-18  2:18           ` Lai Jiangshan
2012-10-18  4:13             ` Steven Rostedt
2012-10-18 16:17               ` Mikulas Patocka
2012-10-18 15:32             ` Mikulas Patocka
2012-10-18 19:56             ` Mikulas Patocka
2012-10-18 16:05           ` Mikulas Patocka
2012-10-17  9:56     ` Alan Cox
2012-10-18 16:00     ` Mikulas Patocka
2012-10-19 18:48       ` Linus Torvalds

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.