All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH for-3.3] mempool: clean up and document synchronization and memory barrier usage
@ 2011-12-20 22:18 Tejun Heo
  2011-12-21 14:55 ` Oleg Nesterov
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Tejun Heo @ 2011-12-20 22:18 UTC (permalink / raw)
  To: Oleg Nesterov, Andrew Morton, Linus Torvalds; +Cc: linux-kernel

mempool_alloc/free() use undocumented smp_mb()'s.  While the code is
not necessarily incorrect, the usage is not minimal and misleading.

The lockless part is in mempool_free().  It wants to determine whether
the item being freed needs to be returned to the pool or backing
allocator without grabbing pool->lock.  Two things need to be
guaranteed for correct operation.

1. pool->curr_nr + #allocated should never dip below pool->min_nr.
2. Waiters shouldn't be left dangling.

For #1, The only necessary condition is that curr_nr visible at free
is from after the allocation of the element being freed (details in
the comment).  For most cases, this is true without any barrier but
there can be fringe cases where the allocated pointer is passed to the
freeing task without going through memory barriers.  To cover this
case, wmb is necessary before returning from allocation and rmb is
necessary before reading curr_nr.  IOW,

	ALLOCATING TASK			FREEING TASK

	update pool state after alloc;
	wmb();
	pass pointer to freeing task;
					read pointer;
					rmb();
					read pool state to free;

The wmb can be the implied one from unlocking pool->lock and smp_mb()
in mempool_free() can be replaced with smp_rmb().

For #2, the waiter needs to add itself to waitqueue and then check the
wait condition and the waker needs to update the wait condition and
then wake up.  Because waitqueue operations always go through full
spinlock synchronization, there is no need for extra memory barriers.

Furthermore, mempool_alloc() is already holding pool->lock when it
decides that it needs to wait.  There is no reason to do unlock - add
waitqueue - test condition again.  It can simply add itself to
waitqueue while holding pool->lock and then unlock and sleep.

This patch converts smp_mb() in mempool_free() with smp_rmb(), drops
superflous smp_mb() from mempool_alloc() and extend pool->lock over
waitqueue addition.  More importantly, it explains what memory
barriers do and how the lockless testing is correct.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
---
I was trying to make mempool useable for percpu and the undocumented
smp_mb()'s confused me.  I tried to chop it down to the minimum and
explain what's going on.  The previous code wasn't incorrect so this
isn't a fix per-se but it's always good to be explicit about barriers
and lockless tricks.

There probably is better way to word why the lockless testing is
correct.  It's a lot clearer ot me when drawn as stepping graph on a
piece of paper.  If somebody can think of better wording, please don't
be shy.

Oleg, I *think* this is correct but it would be great if you can go
over it.

Thank you.

 mm/mempool.c |   60 ++++++++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 47 insertions(+), 13 deletions(-)

Index: work/mm/mempool.c
===================================================================
--- work.orig/mm/mempool.c
+++ work/mm/mempool.c
@@ -223,29 +223,31 @@ repeat_alloc:
 	spin_lock_irqsave(&pool->lock, flags);
 	if (likely(pool->curr_nr)) {
 		element = remove_element(pool);
+		/* implied wmb in unlock is faired with rmb in mempool_free() */
 		spin_unlock_irqrestore(&pool->lock, flags);
 		return element;
 	}
-	spin_unlock_irqrestore(&pool->lock, flags);
 
 	/* We must not sleep in the GFP_ATOMIC case */
-	if (!(gfp_mask & __GFP_WAIT))
+	if (!(gfp_mask & __GFP_WAIT)) {
+		spin_unlock_irqrestore(&pool->lock, flags);
 		return NULL;
+	}
 
-	/* Now start performing page reclaim */
+	/* Let's wait for someone else to return an element to @pool */
 	gfp_temp = gfp_mask;
 	init_wait(&wait);
 	prepare_to_wait(&pool->wait, &wait, TASK_UNINTERRUPTIBLE);
-	smp_mb();
-	if (!pool->curr_nr) {
-		/*
-		 * FIXME: this should be io_schedule().  The timeout is there
-		 * as a workaround for some DM problems in 2.6.18.
-		 */
-		io_schedule_timeout(5*HZ);
-	}
-	finish_wait(&pool->wait, &wait);
 
+	spin_unlock_irqrestore(&pool->lock, flags);
+
+	/*
+	 * FIXME: this should be io_schedule().  The timeout is there as a
+	 * workaround for some DM problems in 2.6.18.
+	 */
+	io_schedule_timeout(5*HZ);
+
+	finish_wait(&pool->wait, &wait);
 	goto repeat_alloc;
 }
 EXPORT_SYMBOL(mempool_alloc);
@@ -265,7 +267,39 @@ void mempool_free(void *element, mempool
 	if (unlikely(element == NULL))
 		return;
 
-	smp_mb();
+	/*
+	 * Paired with implied wmb of @pool->lock in mempool_alloc().  The
+	 * preceding read is for @element and the following @pool->curr_nr.
+	 * This ensures that the visible value of @pool->curr_nr is from
+	 * after the allocation of @element.  This is necessary for fringe
+	 * cases where @element was passed to this task without going
+	 * through barriers.
+	 *
+	 * For example, assume @p is %NULL at the beginning and one task
+	 * performs "p = mempool_alloc(...);" while another task is doing
+	 * "while (!p) cpu_relax(); mempool_free(p, ...);".  This function
+	 * may end up using curr_nr value which is from before allocation
+	 * of @p without the following rmb.
+	 */
+	smp_rmb();
+
+	/*
+	 * For correctness, we need a test which is guaranteed to trigger
+	 * if curr_nr + #allocated == min_nr.  Testing curr_nr < min_nr
+	 * without locking achieves that and refilling as soon as possible
+	 * is desirable.
+	 *
+	 * Because curr_nr visible here is always a value after the
+	 * allocation of @element, any task which decremented curr_nr below
+	 * min_nr is guaranteed to see curr_nr < min_nr unless curr_nr gets
+	 * incremented to min_nr afterwards.  If curr_nr gets incremented
+	 * to min_nr after the allocation of @element, the elements
+	 * allocated after that are subject to the same guarantee.
+	 *
+	 * Waiters happen iff curr_nr is 0 and the above guarantee also
+	 * ensures that there will be frees which return elements to the
+	 * pool waking up the waiters.
+	 */
 	if (pool->curr_nr < pool->min_nr) {
 		spin_lock_irqsave(&pool->lock, flags);
 		if (pool->curr_nr < pool->min_nr) {

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH for-3.3] mempool: clean up and document synchronization and memory barrier usage
  2011-12-20 22:18 [PATCH for-3.3] mempool: clean up and document synchronization and memory barrier usage Tejun Heo
@ 2011-12-21 14:55 ` Oleg Nesterov
  2011-12-21 14:57   ` Oleg Nesterov
  2011-12-21 16:37   ` Tejun Heo
  2011-12-21 15:12 ` mempool && io_schedule_timeout() Oleg Nesterov
  2011-12-21 19:08 ` [PATCH for-3.3 UPDATED] mempool: fix and document synchronization and memory barrier usage Tejun Heo
  2 siblings, 2 replies; 10+ messages in thread
From: Oleg Nesterov @ 2011-12-21 14:55 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Andrew Morton, Linus Torvalds, linux-kernel

On 12/20, Tejun Heo wrote:
>
> For #1, The only necessary condition is that curr_nr visible at free
> is from after the allocation of the element being freed (details in
> the comment).  For most cases, this is true without any barrier but
> there can be fringe cases where the allocated pointer is passed to the
> freeing task without going through memory barriers.  To cover this
> case, wmb is necessary before returning from allocation and rmb is
> necessary before reading curr_nr.  IOW,
>
> 	ALLOCATING TASK			FREEING TASK
>
> 	update pool state after alloc;
> 	wmb();
> 	pass pointer to freeing task;
> 					read pointer;
> 					rmb();
> 					read pool state to free;
>
> The wmb can be the implied one from unlocking pool->lock and smp_mb()
> in mempool_free() can be replaced with smp_rmb().

OK... I do not know if this is the bug or not, but I agree this race
is possible.

> Furthermore, mempool_alloc() is already holding pool->lock when it
> decides that it needs to wait.  There is no reason to do unlock - add
> waitqueue - test condition again.  It can simply add itself to
> waitqueue while holding pool->lock and then unlock and sleep.

Confused. I agree, we can hold pool->lock until schedule(). But, at
the same time, why should we hold it?

Or I missed the reason why we must not unlock before prepare_to_wait?

> --- work.orig/mm/mempool.c
> +++ work/mm/mempool.c
> @@ -223,29 +223,31 @@ repeat_alloc:
>  	spin_lock_irqsave(&pool->lock, flags);
>  	if (likely(pool->curr_nr)) {
>  		element = remove_element(pool);
> +		/* implied wmb in unlock is faired with rmb in mempool_free() */
>  		spin_unlock_irqrestore(&pool->lock, flags);

Not really afaics. unlock() doesn't imply wmb().

So, if some thread does PTR = mempool_alloc(), the "final" store to PTR
can leak into the critical section, and it can be reordered inside this
section with --curr_nr.

See the fat comment about set_current_state() in prepare_to_wait().

> @@ -265,7 +267,39 @@ void mempool_free(void *element, mempool
>  	if (unlikely(element == NULL))
>  		return;
>  
> -	smp_mb();
> +	/*
> +	 * Paired with implied wmb of @pool->lock in mempool_alloc().  The
> +	 * preceding read is for @element and the following @pool->curr_nr.
> +	 * This ensures that the visible value of @pool->curr_nr is from
> +	 * after the allocation of @element.  This is necessary for fringe
> +	 * cases where @element was passed to this task without going
> +	 * through barriers.
> +	 *
> +	 * For example, assume @p is %NULL at the beginning and one task
> +	 * performs "p = mempool_alloc(...);" while another task is doing
> +	 * "while (!p) cpu_relax(); mempool_free(p, ...);".  This function
> +	 * may end up using curr_nr value which is from before allocation
> +	 * of @p without the following rmb.
> +	 */
> +	smp_rmb();
> +
> +	/*
> +	 * For correctness, we need a test which is guaranteed to trigger
> +	 * if curr_nr + #allocated == min_nr.  Testing curr_nr < min_nr
> +	 * without locking achieves that and refilling as soon as possible
> +	 * is desirable.
> +	 *
> +	 * Because curr_nr visible here is always a value after the
> +	 * allocation of @element, any task which decremented curr_nr below
> +	 * min_nr is guaranteed to see curr_nr < min_nr unless curr_nr gets
> +	 * incremented to min_nr afterwards.  If curr_nr gets incremented
> +	 * to min_nr after the allocation of @element, the elements
> +	 * allocated after that are subject to the same guarantee.
> +	 *
> +	 * Waiters happen iff curr_nr is 0 and the above guarantee also
> +	 * ensures that there will be frees which return elements to the
> +	 * pool waking up the waiters.
> +	 */
>  	if (pool->curr_nr < pool->min_nr) {
>  		spin_lock_irqsave(&pool->lock, flags);
>  		if (pool->curr_nr < pool->min_nr) {

Just to check my understanding...

IOW. Whatever we do, we can race with other threads and miss "curr_nr < min_nr"
condition. But this is fine, unless this element (which we are going to free)
was the reason for this condition. Otherwise we can rely on another mempool_free(),
the waiters in mempool_alloc() can never hang forever.

Yes, I think this is right.

Oleg.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH for-3.3] mempool: clean up and document synchronization and memory barrier usage
  2011-12-21 14:55 ` Oleg Nesterov
@ 2011-12-21 14:57   ` Oleg Nesterov
  2011-12-21 16:37   ` Tejun Heo
  1 sibling, 0 replies; 10+ messages in thread
From: Oleg Nesterov @ 2011-12-21 14:57 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Andrew Morton, Linus Torvalds, linux-kernel

On 12/21, Oleg Nesterov wrote:
>
> On 12/20, Tejun Heo wrote:
> >
> > Furthermore, mempool_alloc() is already holding pool->lock when it
> > decides that it needs to wait.  There is no reason to do unlock - add
> > waitqueue - test condition again.  It can simply add itself to
> > waitqueue while holding pool->lock and then unlock and sleep.
>
> Confused. I agree, we can hold pool->lock until schedule(). But, at
> the same time, why should we hold it?

Ah, I see.

> Or I missed the reason why we must not unlock before prepare_to_wait?

I didn't notice that this removes another "if (!pool->curr_nr)" check.

Oleg.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* mempool && io_schedule_timeout()
  2011-12-20 22:18 [PATCH for-3.3] mempool: clean up and document synchronization and memory barrier usage Tejun Heo
  2011-12-21 14:55 ` Oleg Nesterov
@ 2011-12-21 15:12 ` Oleg Nesterov
  2011-12-21 15:34   ` Tejun Heo
  2011-12-21 19:08 ` [PATCH for-3.3 UPDATED] mempool: fix and document synchronization and memory barrier usage Tejun Heo
  2 siblings, 1 reply; 10+ messages in thread
From: Oleg Nesterov @ 2011-12-21 15:12 UTC (permalink / raw)
  To: Tejun Heo, Pavel Mironchik, Alasdair G Kergon
  Cc: Andrew Morton, Linus Torvalds, linux-kernel

On 12/20, Tejun Heo wrote:
>
> -	if (!pool->curr_nr) {
> -		/*
> -		 * FIXME: this should be io_schedule().  The timeout is there
> -		 * as a workaround for some DM problems in 2.6.18.
> -		 */
> -		io_schedule_timeout(5*HZ);
> -	}
> -	finish_wait(&pool->wait, &wait);
>
> +	spin_unlock_irqrestore(&pool->lock, flags);
> +
> +	/*
> +	 * FIXME: this should be io_schedule().  The timeout is there as a
> +	 * workaround for some DM problems in 2.6.18.
> +	 */
> +	io_schedule_timeout(5*HZ);

Just curious... This was added by 0b1d647a in 2006.

Perhaps the problem was already fixed?

Oleg.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: mempool && io_schedule_timeout()
  2011-12-21 15:12 ` mempool && io_schedule_timeout() Oleg Nesterov
@ 2011-12-21 15:34   ` Tejun Heo
  0 siblings, 0 replies; 10+ messages in thread
From: Tejun Heo @ 2011-12-21 15:34 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Pavel Mironchik, Alasdair G Kergon, Andrew Morton,
	Linus Torvalds, linux-kernel, dm-devel

Hello,

On Wed, Dec 21, 2011 at 7:12 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>> +     /*
>> +      * FIXME: this should be io_schedule().  The timeout is there as a
>> +      * workaround for some DM problems in 2.6.18.
>> +      */
>> +     io_schedule_timeout(5*HZ);
>
> Just curious... This was added by 0b1d647a in 2006.
>
> Perhaps the problem was already fixed?

Yeah, I was curious about that too.  It seemed the problem was dm
sharing the same pool from multiple devices (which defeats the purpose
of using mempool BTW).  cc'ing dm devel.  Alasdair, is the problem
fixed now?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH for-3.3] mempool: clean up and document synchronization and memory barrier usage
  2011-12-21 14:55 ` Oleg Nesterov
  2011-12-21 14:57   ` Oleg Nesterov
@ 2011-12-21 16:37   ` Tejun Heo
  2011-12-21 16:44     ` Tejun Heo
  2011-12-21 17:40     ` Oleg Nesterov
  1 sibling, 2 replies; 10+ messages in thread
From: Tejun Heo @ 2011-12-21 16:37 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Linus Torvalds, linux-kernel, Paul E. McKenney,
	David Howells

Hello, Oleg.

On Wed, Dec 21, 2011 at 03:55:56PM +0100, Oleg Nesterov wrote:
> > The wmb can be the implied one from unlocking pool->lock and smp_mb()
> > in mempool_free() can be replaced with smp_rmb().
> 
> OK... I do not know if this is the bug or not, but I agree this race
> is possible.

I'm not entirely sure whether this is a condition which we need to
guarantee but, if the scenario like the above ever happens, it's gonna
be a hellish one to track down.  It is *highly* unlikely to ever cause
any real problem.  It requires combination of unlikely situations to
lead to an actual problem.  But, if that ever happens, it's gonna a
hellish one to track down, to the level that we'll have to put it
aside as one freak occurrence and blame solar activity or alignment of
stars.  As nothing really prevents the described pattern, I think it's
better to be safe.

IMHO fringe corner cases like this are the biggest reason we need to
be extra reserved when playing with lockless tricks and fully document
when we end up using one.

> > Furthermore, mempool_alloc() is already holding pool->lock when it
> > decides that it needs to wait.  There is no reason to do unlock - add
> > waitqueue - test condition again.  It can simply add itself to
> > waitqueue while holding pool->lock and then unlock and sleep.
> 
> Confused. I agree, we can hold pool->lock until schedule(). But, at
> the same time, why should we hold it?
> 
> Or I missed the reason why we must not unlock before prepare_to_wait?

Yeah, as you replied in another message, this is to avoid a ordering
dependent curr_nr test.  Extending lock coverage here doesn't cost
anything and it's always better to be straightforward.

> > --- work.orig/mm/mempool.c
> > +++ work/mm/mempool.c
> > @@ -223,29 +223,31 @@ repeat_alloc:
> >  	spin_lock_irqsave(&pool->lock, flags);
> >  	if (likely(pool->curr_nr)) {
> >  		element = remove_element(pool);
> > +		/* implied wmb in unlock is faired with rmb in mempool_free() */
> >  		spin_unlock_irqrestore(&pool->lock, flags);
> 
> Not really afaics. unlock() doesn't imply wmb().
> 
> So, if some thread does PTR = mempool_alloc(), the "final" store to PTR
> can leak into the critical section, and it can be reordered inside this
> section with --curr_nr.
> 
> See the fat comment about set_current_state() in prepare_to_wait().

Hmmm.... I'm quoting it part here.  (cc'ing David and Paul, hi guys!)

/*
 * Note: we use "set_current_state()" _after_ the wait-queue add,
 * because we need a memory barrier there on SMP, so that any
 * wake-function that tests for the wait-queue being active
 * will be guaranteed to see waitqueue addition _or_ subsequent
 * tests in this thread will see the wakeup having taken place.
 *
 * The spin_unlock() itself is semi-permeable and only protects
 * one way (it only protects stuff inside the critical region and
 * stops them from bleeding out - it would still allow subsequent
 * loads to move into the critical region).
 */
	   
The first paragraph is saying that at that point full barrier (for
both stores and loads) is necessary at that point and the second
paragraph is a bit confusing but the last sentence seems to say that
only loads after the unlock can creep above unlock, which essentially
means write barrier.

I think clearer explanation is in memory-barrier.txt.

 (6) UNLOCK operations.

     This also acts as a one-way permeable barrier.  It guarantees that all
     memory operations before the UNLOCK operation will appear to happen before
     the UNLOCK operation with respect to the other components of the system.

     Memory operations that occur after an UNLOCK operation may appear to
     happen before it completes.

     LOCK and UNLOCK operations are guaranteed to appear with respect to each
     other strictly in the order specified.

     The use of LOCK and UNLOCK operations generally precludes the need for
     other sorts of memory barrier (but note the exceptions mentioned in the
     subsection "MMIO write barrier").
						   
So, in a nutshell, I think it's saying w.r.t. stores is that stores
before unlock may not be deferred across the unlock but stores after
unlock may creep upwards past unlock, so it acts as one-way write
barrier.  And, yeah, that would be the minimal requirement for proper
lock operation.  Curious whether any processor implements such fine
grained barrier tho, maybe alpha does that too?

Paul, David, am I understanding it correctly?

Anyways, yeah, you're right.  We need a smp_wmb() before returning but
I think the comment on top of prepare_to_wait() is misleading.  Care
to update it?

> IOW. Whatever we do, we can race with other threads and miss "curr_nr < min_nr"
> condition. But this is fine, unless this element (which we are going to free)
> was the reason for this condition. Otherwise we can rely on another mempool_free(),
> the waiters in mempool_alloc() can never hang forever.

Yes, that's another way to put it.  It's much easier to understand if
you draw a graph with time on x axis and curr_nr on y and mark which
curr_nr values are guaranteed to be visible to whom.  If at any point
of time, a freeing task sees curr_nr == min_nr, it's guaranteed that
either it's staying that way or, if not, someone else will see the
newly decremented value.

> Yes, I think this is right.

Great, thanks.  I'll wait a bit for futher comments and repost w/
smp_wmb() added.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH for-3.3] mempool: clean up and document synchronization and memory barrier usage
  2011-12-21 16:37   ` Tejun Heo
@ 2011-12-21 16:44     ` Tejun Heo
  2011-12-21 17:40     ` Oleg Nesterov
  1 sibling, 0 replies; 10+ messages in thread
From: Tejun Heo @ 2011-12-21 16:44 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Linus Torvalds, linux-kernel, Paul E. McKenney,
	David Howells

Ooh, missed something.

On Wed, Dec 21, 2011 at 08:37:15AM -0800, Tejun Heo wrote:
> Yes, that's another way to put it.  It's much easier to understand if
> you draw a graph with time on x axis and curr_nr on y and mark which
> curr_nr values are guaranteed to be visible to whom.  If at any point
> of time, a freeing task sees curr_nr == min_nr, it's guaranteed that
> either it's staying that way or, if not, someone else will see the
> newly decremented value.

And there will be enough "someone elses" to restore curr_nr == min_nr.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH for-3.3] mempool: clean up and document synchronization and memory barrier usage
  2011-12-21 16:37   ` Tejun Heo
  2011-12-21 16:44     ` Tejun Heo
@ 2011-12-21 17:40     ` Oleg Nesterov
  2011-12-21 18:52       ` Tejun Heo
  1 sibling, 1 reply; 10+ messages in thread
From: Oleg Nesterov @ 2011-12-21 17:40 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, Linus Torvalds, linux-kernel, Paul E. McKenney,
	David Howells

Hi,

On 12/21, Tejun Heo wrote:
>
> Hello, Oleg.
>
> On Wed, Dec 21, 2011 at 03:55:56PM +0100, Oleg Nesterov wrote:
> > > The wmb can be the implied one from unlocking pool->lock and smp_mb()
> > > in mempool_free() can be replaced with smp_rmb().
> >
> > OK... I do not know if this is the bug or not, but I agree this race
> > is possible.
>
> I'm not entirely sure whether this is a condition which we need to
> guarantee but, if the scenario like the above ever happens, it's gonna
> be a hellish one to track down.

Yes, yes, I agree. In fact I was going to remove this sentence after
I actually read the patch, but forgot.

And in any case, it is very nice to document the barriers.

> > Not really afaics. unlock() doesn't imply wmb().
> >
> > So, if some thread does PTR = mempool_alloc(), the "final" store to PTR
> > can leak into the critical section, and it can be reordered inside this
> > section with --curr_nr.
> >
> > See the fat comment about set_current_state() in prepare_to_wait().
>
> Hmmm.... I'm quoting it part here.  (cc'ing David and Paul, hi guys!)
>
> /*
>  * Note: we use "set_current_state()" _after_ the wait-queue add,
>  * because we need a memory barrier there on SMP, so that any
>  * wake-function that tests for the wait-queue being active
>  * will be guaranteed to see waitqueue addition _or_ subsequent
>  * tests in this thread will see the wakeup having taken place.
>  *
>  * The spin_unlock() itself is semi-permeable and only protects
>  * one way (it only protects stuff inside the critical region and
>  * stops them from bleeding out - it would still allow subsequent
>  * loads to move into the critical region).
>  */
>
> The first paragraph is saying that at that point full barrier (for
> both stores and loads) is necessary at that point and the second
> paragraph is a bit confusing but the last sentence seems to say that
> only loads after the unlock can creep above unlock,

Probably, this is because the comment tries to explain the possible
reordering with the subsequent "if (condition)" check, so it only
mentions loads.

> I think clearer explanation is in memory-barrier.txt.
>
>  (6) UNLOCK operations.
>
>      This also acts as a one-way permeable barrier.  It guarantees that all
>      memory operations before the UNLOCK operation will appear to happen before
>      the UNLOCK operation with respect to the other components of the system.
>
>      Memory operations that occur after an UNLOCK operation may appear to
>      happen before it completes.
>
>      LOCK and UNLOCK operations are guaranteed to appear with respect to each
>      other strictly in the order specified.
>
>      The use of LOCK and UNLOCK operations generally precludes the need for
>      other sorts of memory barrier (but note the exceptions mentioned in the
>      subsection "MMIO write barrier").
>
> So, in a nutshell, I think it's saying w.r.t. stores is that stores
> before unlock may not be deferred across the unlock but stores after
> unlock may creep upwards past unlock, so it acts as one-way write
> barrier.  And, yeah, that would be the minimal requirement for proper
> lock operation.  Curious whether any processor implements such fine
> grained barrier tho, maybe alpha does that too?
>
> Paul, David, am I understanding it correctly?
>
> Anyways, yeah, you're right.  We need a smp_wmb() before returning but
> I think the comment on top of prepare_to_wait() is misleading.

Hmm. I am not sure I understand... Although almost everything written
in English looks misleading to me ;)

I think that "Memory operations that occur after" above means both
loads and stores, so the comment is probably fine. set_current_state()
inserts the barrier between store and load:


	__add_wait_queue();

	mb();			// set_current_state()

	unlock();		// can't serialize with LOAD(condition) below

	if (!condition)
		schedule();

(let's forget about task->state to simplify)

> Great, thanks.  I'll wait a bit for futher comments and repost w/
> smp_wmb() added.

Well. This is almost off-topic, but perhaps we can add
smp_mb__after_unlock() ? We already have smp_mb__after_lock.
Afaics prepare_to_wait() could use it.

I am not talking about perfomance issues, just I think the code
will be more understandable.

Oleg.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH for-3.3] mempool: clean up and document synchronization and memory barrier usage
  2011-12-21 17:40     ` Oleg Nesterov
@ 2011-12-21 18:52       ` Tejun Heo
  0 siblings, 0 replies; 10+ messages in thread
From: Tejun Heo @ 2011-12-21 18:52 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Linus Torvalds, linux-kernel, Paul E. McKenney,
	David Howells

Hello,

On Wed, Dec 21, 2011 at 06:40:58PM +0100, Oleg Nesterov wrote:
> > The first paragraph is saying that at that point full barrier (for
> > both stores and loads) is necessary at that point and the second
> > paragraph is a bit confusing but the last sentence seems to say that
> > only loads after the unlock can creep above unlock,
> 
> Probably, this is because the comment tries to explain the possible
> reordering with the subsequent "if (condition)" check, so it only
> mentions loads.

Ah, I see.

> > Anyways, yeah, you're right.  We need a smp_wmb() before returning but
> > I think the comment on top of prepare_to_wait() is misleading.
> 
> Hmm. I am not sure I understand... Although almost everything written
> in English looks misleading to me ;)

Amen. :) I missed the context there, so please forget about it.

> > Great, thanks.  I'll wait a bit for futher comments and repost w/
> > smp_wmb() added.
> 
> Well. This is almost off-topic, but perhaps we can add
> smp_mb__after_unlock() ? We already have smp_mb__after_lock.
> Afaics prepare_to_wait() could use it.
> 
> I am not talking about perfomance issues, just I think the code
> will be more understandable.

Hmmm... maybe.  I really don't know.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH for-3.3 UPDATED] mempool: fix and document synchronization and memory barrier usage
  2011-12-20 22:18 [PATCH for-3.3] mempool: clean up and document synchronization and memory barrier usage Tejun Heo
  2011-12-21 14:55 ` Oleg Nesterov
  2011-12-21 15:12 ` mempool && io_schedule_timeout() Oleg Nesterov
@ 2011-12-21 19:08 ` Tejun Heo
  2 siblings, 0 replies; 10+ messages in thread
From: Tejun Heo @ 2011-12-21 19:08 UTC (permalink / raw)
  To: Oleg Nesterov, Andrew Morton, Linus Torvalds; +Cc: linux-kernel

mempool_alloc/free() use undocumented smp_mb()'s.  The code is
slightly broken and misleading.

The lockless part is in mempool_free().  It wants to determine whether
the item being freed needs to be returned to the pool or backing
allocator without grabbing pool->lock.  Two things need to be
guaranteed for correct operation.

1. pool->curr_nr + #allocated should never dip below pool->min_nr.
2. Waiters shouldn't be left dangling.

For #1, The only necessary condition is that curr_nr visible at free
is from after the allocation of the element being freed (details in
the comment).  For most cases, this is true without any barrier but
there can be fringe cases where the allocated pointer is passed to the
freeing task without going through memory barriers.  To cover this
case, wmb is necessary before returning from allocation and rmb is
necessary before reading curr_nr.  IOW,

	ALLOCATING TASK			FREEING TASK

	update pool state after alloc;
	wmb();
	pass pointer to freeing task;
					read pointer;
					rmb();
					read pool state to free;

The current code doesn't have wmb after pool update during allocation
and may theoretically, on machines where unlock doesn't behave as full
wmb, lead to pool depletion and deadlock.  smp_wmb() needs to be added
after successful allocation from reserved elements and smp_mb() in
mempool_free() can be replaced with smp_rmb().

For #2, the waiter needs to add itself to waitqueue and then check the
wait condition and the waker needs to update the wait condition and
then wake up.  Because waitqueue operations always go through full
spinlock synchronization, there is no need for extra memory barriers.

Furthermore, mempool_alloc() is already holding pool->lock when it
decides that it needs to wait.  There is no reason to do unlock - add
waitqueue - test condition again.  It can simply add itself to
waitqueue while holding pool->lock and then unlock and sleep.

This patch adds smp_wmb() after successful allocation from reserved
pool, replaces smp_mb() in mempool_free() with smp_rmb() and extend
pool->lock over waitqueue addition.  More importantly, it explains
what memory barriers do and how the lockless testing is correct.

-v2: Oleg pointed out that unlock doesn't imply wmb.  Added explicit
     smp_wmb() after successful allocation from reserved pool and
     updated comments accordingly.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: David Howells <dhowells@redhat.com>
---
Oleg, can you please ack?  With the addition of explicit smp_wmb(),
this patch becomes a fix, but I'm not sure about sending this to
-stable.  The chance of actual breakage is between nill and extremely
unlikely while this change introducing a new bug is much higher.
Maybe after a release cycle or two.

Andrew, if people agree with the patch, it would be great if you can
route this through -mm.

Thanks a lot.

 mm/mempool.c |   61 ++++++++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 48 insertions(+), 13 deletions(-)

Index: work/mm/mempool.c
===================================================================
--- work.orig/mm/mempool.c
+++ work/mm/mempool.c
@@ -224,28 +224,31 @@ repeat_alloc:
 	if (likely(pool->curr_nr)) {
 		element = remove_element(pool);
 		spin_unlock_irqrestore(&pool->lock, flags);
+		/* paired with rmb in mempool_free(), read comment there */
+		smp_wmb();
 		return element;
 	}
-	spin_unlock_irqrestore(&pool->lock, flags);
 
 	/* We must not sleep in the GFP_ATOMIC case */
-	if (!(gfp_mask & __GFP_WAIT))
+	if (!(gfp_mask & __GFP_WAIT)) {
+		spin_unlock_irqrestore(&pool->lock, flags);
 		return NULL;
+	}
 
-	/* Now start performing page reclaim */
+	/* Let's wait for someone else to return an element to @pool */
 	gfp_temp = gfp_mask;
 	init_wait(&wait);
 	prepare_to_wait(&pool->wait, &wait, TASK_UNINTERRUPTIBLE);
-	smp_mb();
-	if (!pool->curr_nr) {
-		/*
-		 * FIXME: this should be io_schedule().  The timeout is there
-		 * as a workaround for some DM problems in 2.6.18.
-		 */
-		io_schedule_timeout(5*HZ);
-	}
-	finish_wait(&pool->wait, &wait);
 
+	spin_unlock_irqrestore(&pool->lock, flags);
+
+	/*
+	 * FIXME: this should be io_schedule().  The timeout is there as a
+	 * workaround for some DM problems in 2.6.18.
+	 */
+	io_schedule_timeout(5*HZ);
+
+	finish_wait(&pool->wait, &wait);
 	goto repeat_alloc;
 }
 EXPORT_SYMBOL(mempool_alloc);
@@ -265,7 +268,39 @@ void mempool_free(void *element, mempool
 	if (unlikely(element == NULL))
 		return;
 
-	smp_mb();
+	/*
+	 * Paired with the wmb in mempool_alloc().  The preceding read is
+	 * for @element and the following @pool->curr_nr.  This ensures
+	 * that the visible value of @pool->curr_nr is from after the
+	 * allocation of @element.  This is necessary for fringe cases
+	 * where @element was passed to this task without going through
+	 * barriers.
+	 *
+	 * For example, assume @p is %NULL at the beginning and one task
+	 * performs "p = mempool_alloc(...);" while another task is doing
+	 * "while (!p) cpu_relax(); mempool_free(p, ...);".  This function
+	 * may end up using curr_nr value which is from before allocation
+	 * of @p without the following rmb.
+	 */
+	smp_rmb();
+
+	/*
+	 * For correctness, we need a test which is guaranteed to trigger
+	 * if curr_nr + #allocated == min_nr.  Testing curr_nr < min_nr
+	 * without locking achieves that and refilling as soon as possible
+	 * is desirable.
+	 *
+	 * Because curr_nr visible here is always a value after the
+	 * allocation of @element, any task which decremented curr_nr below
+	 * min_nr is guaranteed to see curr_nr < min_nr unless curr_nr gets
+	 * incremented to min_nr afterwards.  If curr_nr gets incremented
+	 * to min_nr after the allocation of @element, the elements
+	 * allocated after that are subject to the same guarantee.
+	 *
+	 * Waiters happen iff curr_nr is 0 and the above guarantee also
+	 * ensures that there will be frees which return elements to the
+	 * pool waking up the waiters.
+	 */
 	if (pool->curr_nr < pool->min_nr) {
 		spin_lock_irqsave(&pool->lock, flags);
 		if (pool->curr_nr < pool->min_nr) {

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2011-12-21 19:08 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-12-20 22:18 [PATCH for-3.3] mempool: clean up and document synchronization and memory barrier usage Tejun Heo
2011-12-21 14:55 ` Oleg Nesterov
2011-12-21 14:57   ` Oleg Nesterov
2011-12-21 16:37   ` Tejun Heo
2011-12-21 16:44     ` Tejun Heo
2011-12-21 17:40     ` Oleg Nesterov
2011-12-21 18:52       ` Tejun Heo
2011-12-21 15:12 ` mempool && io_schedule_timeout() Oleg Nesterov
2011-12-21 15:34   ` Tejun Heo
2011-12-21 19:08 ` [PATCH for-3.3 UPDATED] mempool: fix and document synchronization and memory barrier usage Tejun Heo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.