All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] Fixed a mismatch between the users of radix_tree and the implementation.
@ 2010-08-13  7:15 Salman Qazi
  2010-08-13  7:20 ` Salman Qazi
  0 siblings, 1 reply; 17+ messages in thread
From: Salman Qazi @ 2010-08-13  7:15 UTC (permalink / raw)
  To: npiggin, paulmck, akpm, linux-kernel, a.p.zijlstra; +Cc: adurbin

This matters for the lockless page cache, in particular find_get_pages implementation.

In the following case, we get can get a deadlock:

    0.  The radix tree contains two items, one has the index 0.
    1.  The reader (in this case find_get_pages) takes the rcu_read_lock.
    2.  The reader acquires slot(s) for item(s) including the index 0 item.
    3.  The non-zero index item is deleted, and as a consequence the other item
        is moved to the root of the tree.  The place where it used to be is
        queued for deletion after the readers finish.
    4.  The reader looks at the index 0 slot, and finds that the page has 0 ref count
    5.  The reader looks at it again, hoping that the item will either be freed
        or the ref count will increase.  This never happens, as the slot it
        is looking at will never be updated.  Also, this slot can never be reclaimed
        because the reader is holding rcu_read_lock and is in an infinite loop.

This can be reproduced with reliably by running dbench followed by compilebench under
autotest.  I have not been able to construct a small targeted repro case.

There is also a similar potential issue with insertion.  Storing the first
element in the root and then moving it to a new slot on insertion of a
second element would potentially lead to a similar problem.

Both of these issues have been fixed in this change.

Signed-off-by: Salman Qazi <sqazi@google.com>
---
 lib/radix-tree.c |    9 ++-------
 1 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index e907858..035d5aa 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -252,11 +252,6 @@ static int radix_tree_extend(struct radix_tree_root *root, unsigned long index)
 	while (index > radix_tree_maxindex(height))
 		height++;
 
-	if (root->rnode == NULL) {
-		root->height = height;
-		goto out;
-	}
-
 	do {
 		unsigned int newheight;
 		if (!(node = radix_tree_node_alloc(root)))
@@ -278,7 +273,7 @@ static int radix_tree_extend(struct radix_tree_root *root, unsigned long index)
 		rcu_assign_pointer(root->rnode, node);
 		root->height = newheight;
 	} while (height > root->height);
-out:
+
 	return 0;
 }
 
@@ -1154,7 +1149,7 @@ EXPORT_SYMBOL(radix_tree_gang_lookup_tag_slot);
 static inline void radix_tree_shrink(struct radix_tree_root *root)
 {
 	/* try to shrink tree height */
-	while (root->height > 0) {
+	while (root->height > 1) {
 		struct radix_tree_node *to_free = root->rnode;
 		void *newptr;
 


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH] Fixed a mismatch between the users of radix_tree and the implementation.
  2010-08-13  7:15 [PATCH] Fixed a mismatch between the users of radix_tree and the implementation Salman Qazi
@ 2010-08-13  7:20 ` Salman Qazi
  2010-08-16 11:25   ` Peter Zijlstra
  0 siblings, 1 reply; 17+ messages in thread
From: Salman Qazi @ 2010-08-13  7:20 UTC (permalink / raw)
  To: paulmck, nickpiggin, akpm, linux-kernel, a.p.zijlstra; +Cc: adurbin

This matters for the lockless page cache, in particular find_get_pages implementation.

In the following case, we get can get a deadlock:

    0.  The radix tree contains two items, one has the index 0.
    1.  The reader (in this case find_get_pages) takes the rcu_read_lock.
    2.  The reader acquires slot(s) for item(s) including the index 0 item.
    3.  The non-zero index item is deleted, and as a consequence the other item
        is moved to the root of the tree.  The place where it used to be is
        queued for deletion after the readers finish.
    4.  The reader looks at the index 0 slot, and finds that the page has 0 ref count
    5.  The reader looks at it again, hoping that the item will either be freed
        or the ref count will increase.  This never happens, as the slot it
        is looking at will never be updated.  Also, this slot can never be reclaimed
        because the reader is holding rcu_read_lock and is in an infinite loop.

This can be reproduced with reliably by running dbench followed by compilebench under
autotest.  I have not been able to construct a small targeted repro case.

There is also a similar potential issue with insertion.  Storing the first
element in the root and then moving it to a new slot on insertion of a
second element would potentially lead to a similar problem.

Both of these issues have been fixed in this change.

Signed-off-by: Salman Qazi <sqazi@google.com>
---
 lib/radix-tree.c |    9 ++-------
 1 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index e907858..035d5aa 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -252,11 +252,6 @@ static int radix_tree_extend(struct radix_tree_root *root, unsigned long index)
 	while (index > radix_tree_maxindex(height))
 		height++;
 
-	if (root->rnode == NULL) {
-		root->height = height;
-		goto out;
-	}
-
 	do {
 		unsigned int newheight;
 		if (!(node = radix_tree_node_alloc(root)))
@@ -278,7 +273,7 @@ static int radix_tree_extend(struct radix_tree_root *root, unsigned long index)
 		rcu_assign_pointer(root->rnode, node);
 		root->height = newheight;
 	} while (height > root->height);
-out:
+
 	return 0;
 }
 
@@ -1154,7 +1149,7 @@ EXPORT_SYMBOL(radix_tree_gang_lookup_tag_slot);
 static inline void radix_tree_shrink(struct radix_tree_root *root)
 {
 	/* try to shrink tree height */
-	while (root->height > 0) {
+	while (root->height > 1) {
 		struct radix_tree_node *to_free = root->rnode;
 		void *newptr;
 


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] Fixed a mismatch between the users of radix_tree and the implementation.
  2010-08-13  7:20 ` Salman Qazi
@ 2010-08-16 11:25   ` Peter Zijlstra
  2010-08-16 18:30     ` Salman Qazi
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2010-08-16 11:25 UTC (permalink / raw)
  To: Salman Qazi; +Cc: paulmck, nickpiggin, akpm, linux-kernel, adurbin

On Fri, 2010-08-13 at 00:20 -0700, Salman Qazi wrote:
> This matters for the lockless page cache, in particular find_get_pages implementation.
> 
> In the following case, we get can get a deadlock:
> 
>     0.  The radix tree contains two items, one has the index 0.
>     1.  The reader (in this case find_get_pages) takes the rcu_read_lock.
>     2.  The reader acquires slot(s) for item(s) including the index 0 item.
>     3.  The non-zero index item is deleted, and as a consequence the other item
>         is moved to the root of the tree.  The place where it used to be is
>         queued for deletion after the readers finish.
>     4.  The reader looks at the index 0 slot, and finds that the page has 0 ref count
>     5.  The reader looks at it again, hoping that the item will either be freed
>         or the ref count will increase.  This never happens, as the slot it
>         is looking at will never be updated.  Also, this slot can never be reclaimed
>         because the reader is holding rcu_read_lock and is in an infinite loop.
> 
> This can be reproduced with reliably by running dbench followed by compilebench under
> autotest.  I have not been able to construct a small targeted repro case.
> 
> There is also a similar potential issue with insertion.  Storing the first
> element in the root and then moving it to a new slot on insertion of a
> second element would potentially lead to a similar problem.
> 
> Both of these issues have been fixed in this change.

Your changelog fails to mention how you fixed it.. 

It also reminds me how much I hate that height hack, I really should get
path-compression working some day...

http://programming.kicks-ass.net/kernel-patches/concurrent-pagecache/23-rc1-rt/radix-tree-path-compression.patch



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] Fixed a mismatch between the users of radix_tree and the implementation.
  2010-08-16 11:25   ` Peter Zijlstra
@ 2010-08-16 18:30     ` Salman Qazi
  2010-08-16 19:33       ` Peter Zijlstra
  2010-08-17 15:23       ` Nick Piggin
  0 siblings, 2 replies; 17+ messages in thread
From: Salman Qazi @ 2010-08-16 18:30 UTC (permalink / raw)
  To: paulmck, akpm, linux-kernel, a.p.zijlstra; +Cc: adurbin

Done.

---

This matters for the lockless page cache, in particular find_get_pages implementation.

In the following case, we get can get a deadlock:

    0.  The radix tree contains two items, one has the index 0.
    1.  The reader (in this case find_get_pages) takes the rcu_read_lock.
    2.  The reader acquires slot(s) for item(s) including the index 0 item.
    3.  The non-zero index item is deleted, and as a consequence the other item
        is moved to the root of the tree.  The place where it used to be is
        queued for deletion after the readers finish.
    4.  The reader looks at the index 0 slot, and finds that the page has 0 ref count
    5.  The reader looks at it again, hoping that the item will either be freed
        or the ref count will increase.  This never happens, as the slot it
        is looking at will never be updated.  Also, this slot can never be reclaimed
        because the reader is holding rcu_read_lock and is in an infinite loop.

This can be reproduced with reliably by running dbench followed by compilebench under
autotest.  I have not been able to construct a small targeted repro case.

There is also a similar potential issue with insertion.  Storing the first
element in the root and then moving it to a new slot on insertion of a
second element would potentially lead to a similar problem.

Both of these issues have been fixed in this change.  For the delete case,
we no longer shrink the tree back to being just the root containing the
only remaining object.  For the insert case, we no longer store the
first object in the root, rather allocating a node structure for it.  The
reason that this works is that deleting (or inserting) intermediate nodes
does not make a difference to a reader holding a slot.  The problem only
occurs when a leaf node is inserted or deleted.  By making these changes
we avoid insertion and deletion of new leaf nodes for values already
in the tree.

Signed-off-by: Salman Qazi <sqazi@google.com>
---
 lib/radix-tree.c |    9 ++-------
 1 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index e907858..035d5aa 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -252,11 +252,6 @@ static int radix_tree_extend(struct radix_tree_root *root, unsigned long index)
 	while (index > radix_tree_maxindex(height))
 		height++;
 
-	if (root->rnode == NULL) {
-		root->height = height;
-		goto out;
-	}
-
 	do {
 		unsigned int newheight;
 		if (!(node = radix_tree_node_alloc(root)))
@@ -278,7 +273,7 @@ static int radix_tree_extend(struct radix_tree_root *root, unsigned long index)
 		rcu_assign_pointer(root->rnode, node);
 		root->height = newheight;
 	} while (height > root->height);
-out:
+
 	return 0;
 }
 
@@ -1154,7 +1149,7 @@ EXPORT_SYMBOL(radix_tree_gang_lookup_tag_slot);
 static inline void radix_tree_shrink(struct radix_tree_root *root)
 {
 	/* try to shrink tree height */
-	while (root->height > 0) {
+	while (root->height > 1) {
 		struct radix_tree_node *to_free = root->rnode;
 		void *newptr;
 


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] Fixed a mismatch between the users of radix_tree and the implementation.
  2010-08-16 18:30     ` Salman Qazi
@ 2010-08-16 19:33       ` Peter Zijlstra
       [not found]         ` <AANLkTindpUo5tPiXVp6wXjOAqczrJvYegrOMBqSjr4_t@mail.gmail.com>
  2010-08-17 15:23       ` Nick Piggin
  1 sibling, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2010-08-16 19:33 UTC (permalink / raw)
  To: Salman Qazi; +Cc: paulmck, akpm, linux-kernel, adurbin

On Mon, 2010-08-16 at 11:30 -0700, Salman Qazi wrote:
> For the delete case,
> we no longer shrink the tree back to being just the root containing the
> only remaining object.  For the insert case, we no longer store the
> first object in the root, rather allocating a node structure for it.  The
> reason that this works is that deleting (or inserting) intermediate nodes
> does not make a difference to a reader holding a slot.  

Ah, I through that was what it did. So you basically increase the memory
footprint for tiny files.. have you done any measurements on that?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Fixed a mismatch between the users of radix_tree and the implementation.
       [not found]         ` <AANLkTindpUo5tPiXVp6wXjOAqczrJvYegrOMBqSjr4_t@mail.gmail.com>
@ 2010-08-16 21:05           ` Salman Qazi
  2010-08-16 21:06           ` Peter Zijlstra
  1 sibling, 0 replies; 17+ messages in thread
From: Salman Qazi @ 2010-08-16 21:05 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: paulmck, akpm, linux-kernel, adurbin

On Mon, Aug 16, 2010 at 1:59 PM, Salman Qazi <sqazi@google.com> wrote:
>
>
> On Mon, Aug 16, 2010 at 12:33 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>>
>> On Mon, 2010-08-16 at 11:30 -0700, Salman Qazi wrote:
>> > For the delete case,
>> > we no longer shrink the tree back to being just the root containing the
>> > only remaining object.  For the insert case, we no longer store the
>> > first object in the root, rather allocating a node structure for it.  The
>> > reason that this works is that deleting (or inserting) intermediate nodes
>> > does not make a difference to a reader holding a slot.
>>
>> Ah, I through that was what it did. So you basically increase the memory
>> footprint for tiny files.. have you done any measurements on that?
>
>


You raise a valid concern.  I haven't.  What would you recommend as a
benchmark/metric to measure this?
(had to resend this. mail client had decided to switch to HTML mode,
which LKML didn't like)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Fixed a mismatch between the users of radix_tree and the implementation.
       [not found]         ` <AANLkTindpUo5tPiXVp6wXjOAqczrJvYegrOMBqSjr4_t@mail.gmail.com>
  2010-08-16 21:05           ` Salman Qazi
@ 2010-08-16 21:06           ` Peter Zijlstra
  2010-08-17  4:35             ` Salman Qazi
  1 sibling, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2010-08-16 21:06 UTC (permalink / raw)
  To: Salman Qazi; +Cc: paulmck, akpm, linux-kernel, adurbin

(html damaged email alert)

On Mon, 2010-08-16 at 13:59 -0700, Salman Qazi wrote:
> On Mon, Aug 16, 2010 at 12:33 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>         On Mon, 2010-08-16 at 11:30 -0700, Salman Qazi wrote:
>         > For the delete case,
>         > we no longer shrink the tree back to being just the root containing the
>         > only remaining object.  For the insert case, we no longer store the
>         > first object in the root, rather allocating a node structure for it.  The
>         > reason that this works is that deleting (or inserting) intermediate nodes
>         > does not make a difference to a reader holding a slot.
>         
>         
>         Ah, I through that was what it did. So you basically increase the memory
>         footprint for tiny files.. have you done any measurements on that?
> 

> You raise a valid concern.  I haven't.  What would you recommend as a
> benchmark/metric to measure this?

One thing you could try is something like the below on a freshly booted
machine, once without and once with the patch:

 cd /usr/src/linux-2.6
 echo 1 > /proc/sys/vm/drop_caches
 grep radix /proc/slabinfo
 make bzImage
 echo 1 > /proc/sys/vm/drop_caches
 grep radix /proc/slabinfo




^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Fixed a mismatch between the users of radix_tree and the implementation.
  2010-08-16 21:06           ` Peter Zijlstra
@ 2010-08-17  4:35             ` Salman Qazi
  2010-08-17  4:45               ` Salman Qazi
  0 siblings, 1 reply; 17+ messages in thread
From: Salman Qazi @ 2010-08-17  4:35 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: paulmck, akpm, linux-kernel, adurbin

On Mon, Aug 16, 2010 at 2:06 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> (html damaged email alert)
>
> On Mon, 2010-08-16 at 13:59 -0700, Salman Qazi wrote:
>> On Mon, Aug 16, 2010 at 12:33 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>>         On Mon, 2010-08-16 at 11:30 -0700, Salman Qazi wrote:
>>         > For the delete case,
>>         > we no longer shrink the tree back to being just the root containing the
>>         > only remaining object.  For the insert case, we no longer store the
>>         > first object in the root, rather allocating a node structure for it.  The
>>         > reason that this works is that deleting (or inserting) intermediate nodes
>>         > does not make a difference to a reader holding a slot.
>>
>>
>>         Ah, I through that was what it did. So you basically increase the memory
>>         footprint for tiny files.. have you done any measurements on that?
>>
>
>> You raise a valid concern.  I haven't.  What would you recommend as a
>> benchmark/metric to measure this?
>
> One thing you could try is something like the below on a freshly booted
> machine, once without and once with the patch:
>
>  cd /usr/src/linux-2.6
>  echo 1 > /proc/sys/vm/drop_caches
>  grep radix /proc/slabinfo
>  make bzImage
>  echo 1 > /proc/sys/vm/drop_caches
>  grep radix /proc/slabinfo
>
>
>
>

Here's what I see:

Without the patch:

Before:
radix_tree_node      468   1400    568   28    4 : tunables    0    0
  0 : slabdata     50     50      0

After:
radix_tree_node     1886   3192    568   28    4 : tunables    0    0
  0 : slabdata    114    114      0

With the patch:

Before:

radix_tree_node      495   1176    568   28    4 : tunables    0    0
  0 : slabdata     42     42      0

After:

radix_tree_node     3173   7336    568   28    4 : tunables    0    0
  0 : slabdata    262    262      0


So, not particularly good news :(.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Fixed a mismatch between the users of radix_tree and the implementation.
  2010-08-17  4:35             ` Salman Qazi
@ 2010-08-17  4:45               ` Salman Qazi
  2010-08-17  8:42                 ` Peter Zijlstra
  0 siblings, 1 reply; 17+ messages in thread
From: Salman Qazi @ 2010-08-17  4:45 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: paulmck, akpm, linux-kernel, adurbin

On Mon, Aug 16, 2010 at 9:35 PM, Salman Qazi <sqazi@google.com> wrote:
> On Mon, Aug 16, 2010 at 2:06 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> (html damaged email alert)
>>
>> On Mon, 2010-08-16 at 13:59 -0700, Salman Qazi wrote:
>>> On Mon, Aug 16, 2010 at 12:33 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>>>         On Mon, 2010-08-16 at 11:30 -0700, Salman Qazi wrote:
>>>         > For the delete case,
>>>         > we no longer shrink the tree back to being just the root containing the
>>>         > only remaining object.  For the insert case, we no longer store the
>>>         > first object in the root, rather allocating a node structure for it.  The
>>>         > reason that this works is that deleting (or inserting) intermediate nodes
>>>         > does not make a difference to a reader holding a slot.
>>>
>>>
>>>         Ah, I through that was what it did. So you basically increase the memory
>>>         footprint for tiny files.. have you done any measurements on that?
>>>
>>
>>> You raise a valid concern.  I haven't.  What would you recommend as a
>>> benchmark/metric to measure this?
>>
>> One thing you could try is something like the below on a freshly booted
>> machine, once without and once with the patch:
>>
>>  cd /usr/src/linux-2.6
>>  echo 1 > /proc/sys/vm/drop_caches
>>  grep radix /proc/slabinfo
>>  make bzImage
>>  echo 1 > /proc/sys/vm/drop_caches
>>  grep radix /proc/slabinfo
>>
>>
>>
>>
>
> Here's what I see:
>
> Without the patch:
>
> Before:
> radix_tree_node      468   1400    568   28    4 : tunables    0    0
>  0 : slabdata     50     50      0
>
> After:
> radix_tree_node     1886   3192    568   28    4 : tunables    0    0
>  0 : slabdata    114    114      0
>
> With the patch:
>
> Before:
>
> radix_tree_node      495   1176    568   28    4 : tunables    0    0
>  0 : slabdata     42     42      0
>
> After:
>
> radix_tree_node     3173   7336    568   28    4 : tunables    0    0
>  0 : slabdata    262    262      0
>
>
> So, not particularly good news :(.
>

But considering that the kernel locks up, and we are still talking
about < 5MB after a kernel compile, should we really be all that
concerned?  If so, what are the alternatives that should be considered
for fixing this lock up?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Fixed a mismatch between the users of radix_tree and the implementation.
  2010-08-17  4:45               ` Salman Qazi
@ 2010-08-17  8:42                 ` Peter Zijlstra
  0 siblings, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2010-08-17  8:42 UTC (permalink / raw)
  To: Salman Qazi; +Cc: paulmck, akpm, linux-kernel, adurbin, Nick Piggin

On Mon, 2010-08-16 at 21:45 -0700, Salman Qazi wrote:
> But considering that the kernel locks up, and we are still talking
> about < 5MB after a kernel compile, should we really be all that
> concerned?  If so, what are the alternatives that should be considered
> for fixing this lock up? 

No bright ideas atm, but yeah, that indirect pointer vs rcu slot lookup
is funky. Lets go with your patch for now.

Maybe Nick has a bright idea?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Fixed a mismatch between the users of radix_tree and the implementation.
  2010-08-16 18:30     ` Salman Qazi
  2010-08-16 19:33       ` Peter Zijlstra
@ 2010-08-17 15:23       ` Nick Piggin
  2010-08-17 15:48         ` Peter Zijlstra
  2010-08-17 21:28         ` Salman Qazi
  1 sibling, 2 replies; 17+ messages in thread
From: Nick Piggin @ 2010-08-17 15:23 UTC (permalink / raw)
  To: Salman Qazi; +Cc: paulmck, akpm, linux-kernel, a.p.zijlstra, adurbin

On Mon, Aug 16, 2010 at 11:30:07AM -0700, Salman Qazi wrote:
> Done.
> 
> ---
> 
> This matters for the lockless page cache, in particular find_get_pages implementation.
> 
> In the following case, we get can get a deadlock:
> 
>     0.  The radix tree contains two items, one has the index 0.
>     1.  The reader (in this case find_get_pages) takes the rcu_read_lock.
>     2.  The reader acquires slot(s) for item(s) including the index 0 item.
>     3.  The non-zero index item is deleted, and as a consequence the other item
>         is moved to the root of the tree.  The place where it used to be is
>         queued for deletion after the readers finish.
      3b. The zero item is deleted, removing it from the direct slot,
          it remains in the rcu-delayed indirect node.
>     4.  The reader looks at the index 0 slot, and finds that the page has 0 ref count
>     5.  The reader looks at it again, hoping that the item will either be freed
>         or the ref count will increase.  This never happens, as the slot it
>         is looking at will never be updated.  Also, this slot can never be reclaimed
>         because the reader is holding rcu_read_lock and is in an infinite loop.

Good catch. Increasing memory footprint really sucks actually. 5MB is a
lot of memory, and it also means another pointer dereference on small
files.

I actually do go to a lot of trouble already to have direct pointers.
Unfortunately this is another little complexity, but I really think it's
worthwhile.

How about something like this?
--
Index: linux-2.6/include/linux/radix-tree.h
===================================================================
--- linux-2.6.orig/include/linux/radix-tree.h	2010-08-18 00:01:19.000000000 +1000
+++ linux-2.6/include/linux/radix-tree.h	2010-08-18 00:56:32.000000000 +1000
@@ -34,19 +34,12 @@
  * needed for RCU lookups (because root->height is unreliable). The only
  * time callers need worry about this is when doing a lookup_slot under
  * RCU.
+ *
+ * Indirect pointer in fact is also used to tag the last pointer of a node
+ * when it is shrunk, before we rcu free the node. See shrink code for
+ * details.
  */
 #define RADIX_TREE_INDIRECT_PTR	1
-#define RADIX_TREE_RETRY ((void *)-1UL)
-
-static inline void *radix_tree_ptr_to_indirect(void *ptr)
-{
-	return (void *)((unsigned long)ptr | RADIX_TREE_INDIRECT_PTR);
-}
-
-static inline void *radix_tree_indirect_to_ptr(void *ptr)
-{
-	return (void *)((unsigned long)ptr & ~RADIX_TREE_INDIRECT_PTR);
-}
 
 static inline int radix_tree_is_indirect_ptr(void *ptr)
 {
@@ -138,16 +131,29 @@ do {									\
  *		removed.
  *
  * For use with radix_tree_lookup_slot().  Caller must hold tree at least read
- * locked across slot lookup and dereference.  More likely, will be used with
- * radix_tree_replace_slot(), as well, so caller will hold tree write locked.
+ * locked across slot lookup and dereference. Not required if write lock is
+ * held (ie. items cannot be concurrently inserted).
+ *
+ * radix_tree_deref_retry must be used to confirm validity of the pointer if
+ * only the read lock is held.
  */
 static inline void *radix_tree_deref_slot(void **pslot)
 {
-	void *ret = rcu_dereference(*pslot);
-	if (unlikely(radix_tree_is_indirect_ptr(ret)))
-		ret = RADIX_TREE_RETRY;
-	return ret;
+	return rcu_dereference(*pslot);
 }
+
+/**
+ * radix_tree_deref_retry	- check radix_tree_deref_slot
+ * @arg:	pointer returned by radix_tree_deref_slot
+ * Returns:	0 if retry is not required, otherwise retry is required
+ *
+ * radix_tree_deref_retry must be used with radix_tree_deref_slot.
+ */
+static inline int radix_tree_deref_retry(void *arg)
+{
+	return unlikely((unsigned long)arg & RADIX_TREE_INDIRECT_PTR);
+}
+
 /**
  * radix_tree_replace_slot	- replace item in a slot
  * @pslot:	pointer to slot, returned by radix_tree_lookup_slot
Index: linux-2.6/lib/radix-tree.c
===================================================================
--- linux-2.6.orig/lib/radix-tree.c	2010-08-18 00:01:19.000000000 +1000
+++ linux-2.6/lib/radix-tree.c	2010-08-18 00:47:55.000000000 +1000
@@ -82,6 +82,16 @@ struct radix_tree_preload {
 };
 static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, };
 
+static inline void *ptr_to_indirect(void *ptr)
+{
+	return (void *)((unsigned long)ptr | RADIX_TREE_INDIRECT_PTR);
+}
+
+static inline void *indirect_to_ptr(void *ptr)
+{
+	return (void *)((unsigned long)ptr & ~RADIX_TREE_INDIRECT_PTR);
+}
+
 static inline gfp_t root_gfp_mask(struct radix_tree_root *root)
 {
 	return root->gfp_mask & __GFP_BITS_MASK;
@@ -263,7 +273,7 @@ static int radix_tree_extend(struct radi
 			return -ENOMEM;
 
 		/* Increase the height.  */
-		node->slots[0] = radix_tree_indirect_to_ptr(root->rnode);
+		node->slots[0] = indirect_to_ptr(root->rnode);
 
 		/* Propagate the aggregated tag info into the new root */
 		for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
@@ -274,7 +284,7 @@ static int radix_tree_extend(struct radi
 		newheight = root->height+1;
 		node->height = newheight;
 		node->count = 1;
-		node = radix_tree_ptr_to_indirect(node);
+		node = ptr_to_indirect(node);
 		rcu_assign_pointer(root->rnode, node);
 		root->height = newheight;
 	} while (height > root->height);
@@ -307,7 +317,7 @@ int radix_tree_insert(struct radix_tree_
 			return error;
 	}
 
-	slot = radix_tree_indirect_to_ptr(root->rnode);
+	slot = indirect_to_ptr(root->rnode);
 
 	height = root->height;
 	shift = (height-1) * RADIX_TREE_MAP_SHIFT;
@@ -323,8 +333,7 @@ int radix_tree_insert(struct radix_tree_
 				rcu_assign_pointer(node->slots[offset], slot);
 				node->count++;
 			} else
-				rcu_assign_pointer(root->rnode,
-					radix_tree_ptr_to_indirect(slot));
+				rcu_assign_pointer(root->rnode, ptr_to_indirect(slot));
 		}
 
 		/* Go a level down */
@@ -372,7 +381,7 @@ static void *radix_tree_lookup_element(s
 			return NULL;
 		return is_slot ? (void *)&root->rnode : node;
 	}
-	node = radix_tree_indirect_to_ptr(node);
+	node = indirect_to_ptr(node);
 
 	height = node->height;
 	if (index > radix_tree_maxindex(height))
@@ -391,7 +400,7 @@ static void *radix_tree_lookup_element(s
 		height--;
 	} while (height > 0);
 
-	return is_slot ? (void *)slot:node;
+	return is_slot ? (void *)slot : indirect_to_ptr(node);
 }
 
 /**
@@ -453,7 +462,7 @@ void *radix_tree_tag_set(struct radix_tr
 	height = root->height;
 	BUG_ON(index > radix_tree_maxindex(height));
 
-	slot = radix_tree_indirect_to_ptr(root->rnode);
+	slot = indirect_to_ptr(root->rnode);
 	shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
 
 	while (height > 0) {
@@ -507,7 +516,7 @@ void *radix_tree_tag_clear(struct radix_
 
 	shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
 	pathp->node = NULL;
-	slot = radix_tree_indirect_to_ptr(root->rnode);
+	slot = indirect_to_ptr(root->rnode);
 
 	while (height > 0) {
 		int offset;
@@ -577,7 +586,7 @@ int radix_tree_tag_get(struct radix_tree
 
 	if (!radix_tree_is_indirect_ptr(node))
 		return (index == 0);
-	node = radix_tree_indirect_to_ptr(node);
+	node = indirect_to_ptr(node);
 
 	height = node->height;
 	if (index > radix_tree_maxindex(height))
@@ -651,7 +660,7 @@ unsigned long radix_tree_range_tag_if_ta
 	}
 
 	shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
-	slot = radix_tree_indirect_to_ptr(root->rnode);
+	slot = indirect_to_ptr(root->rnode);
 
 	for (;;) {
 		int offset;
@@ -861,7 +870,7 @@ radix_tree_gang_lookup(struct radix_tree
 		results[0] = node;
 		return 1;
 	}
-	node = radix_tree_indirect_to_ptr(node);
+	node = indirect_to_ptr(node);
 
 	max_index = radix_tree_maxindex(node->height);
 
@@ -880,7 +889,8 @@ radix_tree_gang_lookup(struct radix_tree
 			slot = *(((void ***)results)[ret + i]);
 			if (!slot)
 				continue;
-			results[ret + nr_found] = rcu_dereference_raw(slot);
+			results[ret + nr_found] =
+				indirect_to_ptr(rcu_dereference_raw(slot));
 			nr_found++;
 		}
 		ret += nr_found;
@@ -929,7 +939,7 @@ radix_tree_gang_lookup_slot(struct radix
 		results[0] = (void **)&root->rnode;
 		return 1;
 	}
-	node = radix_tree_indirect_to_ptr(node);
+	node = indirect_to_ptr(node);
 
 	max_index = radix_tree_maxindex(node->height);
 
@@ -1054,7 +1064,7 @@ radix_tree_gang_lookup_tag(struct radix_
 		results[0] = node;
 		return 1;
 	}
-	node = radix_tree_indirect_to_ptr(node);
+	node = indirect_to_ptr(node);
 
 	max_index = radix_tree_maxindex(node->height);
 
@@ -1073,7 +1083,8 @@ radix_tree_gang_lookup_tag(struct radix_
 			slot = *(((void ***)results)[ret + i]);
 			if (!slot)
 				continue;
-			results[ret + nr_found] = rcu_dereference_raw(slot);
+			results[ret + nr_found] =
+				indirect_to_ptr(rcu_dereference_raw(slot));
 			nr_found++;
 		}
 		ret += nr_found;
@@ -1123,7 +1134,7 @@ radix_tree_gang_lookup_tag_slot(struct r
 		results[0] = (void **)&root->rnode;
 		return 1;
 	}
-	node = radix_tree_indirect_to_ptr(node);
+	node = indirect_to_ptr(node);
 
 	max_index = radix_tree_maxindex(node->height);
 
@@ -1159,7 +1170,7 @@ static inline void radix_tree_shrink(str
 		void *newptr;
 
 		BUG_ON(!radix_tree_is_indirect_ptr(to_free));
-		to_free = radix_tree_indirect_to_ptr(to_free);
+		to_free = indirect_to_ptr(to_free);
 
 		/*
 		 * The candidate node has more than one child, or its child
@@ -1172,16 +1183,39 @@ static inline void radix_tree_shrink(str
 
 		/*
 		 * We don't need rcu_assign_pointer(), since we are simply
-		 * moving the node from one part of the tree to another. If
-		 * it was safe to dereference the old pointer to it
+		 * moving the node from one part of the tree to another: if it
+		 * was safe to dereference the old pointer to it
 		 * (to_free->slots[0]), it will be safe to dereference the new
-		 * one (root->rnode).
+		 * one (root->rnode) as far as dependent read barriers go.
 		 */
 		newptr = to_free->slots[0];
 		if (root->height > 1)
-			newptr = radix_tree_ptr_to_indirect(newptr);
+			newptr = ptr_to_indirect(newptr);
 		root->rnode = newptr;
 		root->height--;
+
+		/*
+		 * We have a dilemma here. The node's slot[0] must not be
+		 * NULLed in case there are concurrent lookups expecting to
+		 * find the item. However if this was a bottom-level node,
+		 * then it may be subject to the slot pointer being visible
+		 * to callers dereferencing it. If item corresponding to
+		 * slot[0] is subsequently deleted, these callers would expect
+		 * their slot to become empty sooner or later.
+		 *
+		 * For example, lockless pagecache will look up a slot, deref
+		 * the page pointer, and if the page is 0 refcount it means it
+		 * was concurrently deleted from pagecache so try the deref
+		 * again. Fortunately there is already a requirement for logic
+		 * to retry the entire slot lookup -- the indirect pointer
+		 * problem (replacing direct root node with an indirect pointer
+		 * also results in a stale slot). So tag the slot as indirect
+		 * to force callers to retry.
+		 */
+		if (root->height == 0)
+			*((unsigned long *)&to_free->slots[0]) |=
+						RADIX_TREE_INDIRECT_PTR;
+
 		radix_tree_node_free(to_free);
 	}
 }
@@ -1218,7 +1252,7 @@ void *radix_tree_delete(struct radix_tre
 		root->rnode = NULL;
 		goto out;
 	}
-	slot = radix_tree_indirect_to_ptr(slot);
+	slot = indirect_to_ptr(slot);
 
 	shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
 	pathp->node = NULL;
@@ -1260,8 +1294,7 @@ void *radix_tree_delete(struct radix_tre
 			radix_tree_node_free(to_free);
 
 		if (pathp->node->count) {
-			if (pathp->node ==
-					radix_tree_indirect_to_ptr(root->rnode))
+			if (pathp->node == indirect_to_ptr(root->rnode))
 				radix_tree_shrink(root);
 			goto out;
 		}
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c	2010-08-18 00:14:14.000000000 +1000
+++ linux-2.6/mm/filemap.c	2010-08-18 01:07:20.000000000 +1000
@@ -631,7 +631,9 @@ repeat:
 	pagep = radix_tree_lookup_slot(&mapping->page_tree, offset);
 	if (pagep) {
 		page = radix_tree_deref_slot(pagep);
-		if (unlikely(!page || page == RADIX_TREE_RETRY))
+		if (unlikely(!page))
+			goto out;
+		if (radix_tree_deref_retry(page))
 			goto repeat;
 
 		if (!page_cache_get_speculative(page))
@@ -647,6 +649,7 @@ repeat:
 			goto repeat;
 		}
 	}
+out:
 	rcu_read_unlock();
 
 	return page;
@@ -764,12 +767,11 @@ repeat:
 		page = radix_tree_deref_slot((void **)pages[i]);
 		if (unlikely(!page))
 			continue;
-		/*
-		 * this can only trigger if nr_found == 1, making livelock
-		 * a non issue.
-		 */
-		if (unlikely(page == RADIX_TREE_RETRY))
+		if (radix_tree_deref_retry(page)) {
+			if (ret)
+				start = pages[ret-1]->index;
 			goto restart;
+		}
 
 		if (!page_cache_get_speculative(page))
 			goto repeat;
@@ -817,11 +819,7 @@ repeat:
 		page = radix_tree_deref_slot((void **)pages[i]);
 		if (unlikely(!page))
 			continue;
-		/*
-		 * this can only trigger if nr_found == 1, making livelock
-		 * a non issue.
-		 */
-		if (unlikely(page == RADIX_TREE_RETRY))
+		if (radix_tree_deref_retry(page))
 			goto restart;
 
 		if (page->mapping == NULL || page->index != index)
@@ -874,11 +872,7 @@ repeat:
 		page = radix_tree_deref_slot((void **)pages[i]);
 		if (unlikely(!page))
 			continue;
-		/*
-		 * this can only trigger if nr_found == 1, making livelock
-		 * a non issue.
-		 */
-		if (unlikely(page == RADIX_TREE_RETRY))
+		if (radix_tree_deref_retry(page))
 			goto restart;
 
 		if (!page_cache_get_speculative(page))

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Fixed a mismatch between the users of radix_tree and the implementation.
  2010-08-17 15:23       ` Nick Piggin
@ 2010-08-17 15:48         ` Peter Zijlstra
  2010-08-17 19:49           ` Salman Qazi
  2010-08-17 21:28         ` Salman Qazi
  1 sibling, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2010-08-17 15:48 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Salman Qazi, paulmck, akpm, linux-kernel, adurbin

On Wed, 2010-08-18 at 01:23 +1000, Nick Piggin wrote:
> How about something like this?

Ah, very nice indeed. Looks like it will indeed close the hole.

Thanks Nick!

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Fixed a mismatch between the users of radix_tree and the implementation.
  2010-08-17 15:48         ` Peter Zijlstra
@ 2010-08-17 19:49           ` Salman Qazi
  0 siblings, 0 replies; 17+ messages in thread
From: Salman Qazi @ 2010-08-17 19:49 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Nick Piggin, paulmck, akpm, linux-kernel, adurbin

On Tue, Aug 17, 2010 at 8:48 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, 2010-08-18 at 01:23 +1000, Nick Piggin wrote:
>> How about something like this?
>
> Ah, very nice indeed. Looks like it will indeed close the hole.
>
> Thanks Nick!
>

I like Nick's patch in principle.  I will try it out to verify that it
fixes the problem (although, reading it, I am fairly certain that it
does).

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Fixed a mismatch between the users of radix_tree and the implementation.
  2010-08-17 15:23       ` Nick Piggin
  2010-08-17 15:48         ` Peter Zijlstra
@ 2010-08-17 21:28         ` Salman Qazi
  2010-08-30 23:27           ` Salman Qazi
  1 sibling, 1 reply; 17+ messages in thread
From: Salman Qazi @ 2010-08-17 21:28 UTC (permalink / raw)
  To: Nick Piggin; +Cc: paulmck, akpm, linux-kernel, a.p.zijlstra, adurbin

On Tue, Aug 17, 2010 at 8:23 AM, Nick Piggin <npiggin@kernel.dk> wrote:
> On Mon, Aug 16, 2010 at 11:30:07AM -0700, Salman Qazi wrote:
>> Done.
>>
>> ---
>>
>> This matters for the lockless page cache, in particular find_get_pages implementation.
>>
>> In the following case, we get can get a deadlock:
>>
>>     0.  The radix tree contains two items, one has the index 0.
>>     1.  The reader (in this case find_get_pages) takes the rcu_read_lock.
>>     2.  The reader acquires slot(s) for item(s) including the index 0 item.
>>     3.  The non-zero index item is deleted, and as a consequence the other item
>>         is moved to the root of the tree.  The place where it used to be is
>>         queued for deletion after the readers finish.
>      3b. The zero item is deleted, removing it from the direct slot,
>          it remains in the rcu-delayed indirect node.
>>     4.  The reader looks at the index 0 slot, and finds that the page has 0 ref count
>>     5.  The reader looks at it again, hoping that the item will either be freed
>>         or the ref count will increase.  This never happens, as the slot it
>>         is looking at will never be updated.  Also, this slot can never be reclaimed
>>         because the reader is holding rcu_read_lock and is in an infinite loop.
>
> Good catch. Increasing memory footprint really sucks actually. 5MB is a
> lot of memory, and it also means another pointer dereference on small
> files.
>
> I actually do go to a lot of trouble already to have direct pointers.
> Unfortunately this is another little complexity, but I really think it's
> worthwhile.
>
> How about something like this?
> --
> Index: linux-2.6/include/linux/radix-tree.h
> ===================================================================
> --- linux-2.6.orig/include/linux/radix-tree.h   2010-08-18 00:01:19.000000000 +1000
> +++ linux-2.6/include/linux/radix-tree.h        2010-08-18 00:56:32.000000000 +1000
> @@ -34,19 +34,12 @@
>  * needed for RCU lookups (because root->height is unreliable). The only
>  * time callers need worry about this is when doing a lookup_slot under
>  * RCU.
> + *
> + * Indirect pointer in fact is also used to tag the last pointer of a node
> + * when it is shrunk, before we rcu free the node. See shrink code for
> + * details.
>  */
>  #define RADIX_TREE_INDIRECT_PTR        1
> -#define RADIX_TREE_RETRY ((void *)-1UL)
> -
> -static inline void *radix_tree_ptr_to_indirect(void *ptr)
> -{
> -       return (void *)((unsigned long)ptr | RADIX_TREE_INDIRECT_PTR);
> -}
> -
> -static inline void *radix_tree_indirect_to_ptr(void *ptr)
> -{
> -       return (void *)((unsigned long)ptr & ~RADIX_TREE_INDIRECT_PTR);
> -}
>
>  static inline int radix_tree_is_indirect_ptr(void *ptr)
>  {
> @@ -138,16 +131,29 @@ do {                                                                      \
>  *             removed.
>  *
>  * For use with radix_tree_lookup_slot().  Caller must hold tree at least read
> - * locked across slot lookup and dereference.  More likely, will be used with
> - * radix_tree_replace_slot(), as well, so caller will hold tree write locked.
> + * locked across slot lookup and dereference. Not required if write lock is
> + * held (ie. items cannot be concurrently inserted).
> + *
> + * radix_tree_deref_retry must be used to confirm validity of the pointer if
> + * only the read lock is held.
>  */
>  static inline void *radix_tree_deref_slot(void **pslot)
>  {
> -       void *ret = rcu_dereference(*pslot);
> -       if (unlikely(radix_tree_is_indirect_ptr(ret)))
> -               ret = RADIX_TREE_RETRY;
> -       return ret;
> +       return rcu_dereference(*pslot);
>  }
> +
> +/**
> + * radix_tree_deref_retry      - check radix_tree_deref_slot
> + * @arg:       pointer returned by radix_tree_deref_slot
> + * Returns:    0 if retry is not required, otherwise retry is required
> + *
> + * radix_tree_deref_retry must be used with radix_tree_deref_slot.
> + */
> +static inline int radix_tree_deref_retry(void *arg)
> +{
> +       return unlikely((unsigned long)arg & RADIX_TREE_INDIRECT_PTR);
> +}
> +
>  /**
>  * radix_tree_replace_slot     - replace item in a slot
>  * @pslot:     pointer to slot, returned by radix_tree_lookup_slot
> Index: linux-2.6/lib/radix-tree.c
> ===================================================================
> --- linux-2.6.orig/lib/radix-tree.c     2010-08-18 00:01:19.000000000 +1000
> +++ linux-2.6/lib/radix-tree.c  2010-08-18 00:47:55.000000000 +1000
> @@ -82,6 +82,16 @@ struct radix_tree_preload {
>  };
>  static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, };
>
> +static inline void *ptr_to_indirect(void *ptr)
> +{
> +       return (void *)((unsigned long)ptr | RADIX_TREE_INDIRECT_PTR);
> +}
> +
> +static inline void *indirect_to_ptr(void *ptr)
> +{
> +       return (void *)((unsigned long)ptr & ~RADIX_TREE_INDIRECT_PTR);
> +}
> +
>  static inline gfp_t root_gfp_mask(struct radix_tree_root *root)
>  {
>        return root->gfp_mask & __GFP_BITS_MASK;
> @@ -263,7 +273,7 @@ static int radix_tree_extend(struct radi
>                        return -ENOMEM;
>
>                /* Increase the height.  */
> -               node->slots[0] = radix_tree_indirect_to_ptr(root->rnode);
> +               node->slots[0] = indirect_to_ptr(root->rnode);
>
>                /* Propagate the aggregated tag info into the new root */
>                for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
> @@ -274,7 +284,7 @@ static int radix_tree_extend(struct radi
>                newheight = root->height+1;
>                node->height = newheight;
>                node->count = 1;
> -               node = radix_tree_ptr_to_indirect(node);
> +               node = ptr_to_indirect(node);
>                rcu_assign_pointer(root->rnode, node);
>                root->height = newheight;
>        } while (height > root->height);
> @@ -307,7 +317,7 @@ int radix_tree_insert(struct radix_tree_
>                        return error;
>        }
>
> -       slot = radix_tree_indirect_to_ptr(root->rnode);
> +       slot = indirect_to_ptr(root->rnode);
>
>        height = root->height;
>        shift = (height-1) * RADIX_TREE_MAP_SHIFT;
> @@ -323,8 +333,7 @@ int radix_tree_insert(struct radix_tree_
>                                rcu_assign_pointer(node->slots[offset], slot);
>                                node->count++;
>                        } else
> -                               rcu_assign_pointer(root->rnode,
> -                                       radix_tree_ptr_to_indirect(slot));
> +                               rcu_assign_pointer(root->rnode, ptr_to_indirect(slot));
>                }
>
>                /* Go a level down */
> @@ -372,7 +381,7 @@ static void *radix_tree_lookup_element(s
>                        return NULL;
>                return is_slot ? (void *)&root->rnode : node;
>        }
> -       node = radix_tree_indirect_to_ptr(node);
> +       node = indirect_to_ptr(node);
>
>        height = node->height;
>        if (index > radix_tree_maxindex(height))
> @@ -391,7 +400,7 @@ static void *radix_tree_lookup_element(s
>                height--;
>        } while (height > 0);
>
> -       return is_slot ? (void *)slot:node;
> +       return is_slot ? (void *)slot : indirect_to_ptr(node);
>  }
>
>  /**
> @@ -453,7 +462,7 @@ void *radix_tree_tag_set(struct radix_tr
>        height = root->height;
>        BUG_ON(index > radix_tree_maxindex(height));
>
> -       slot = radix_tree_indirect_to_ptr(root->rnode);
> +       slot = indirect_to_ptr(root->rnode);
>        shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
>
>        while (height > 0) {
> @@ -507,7 +516,7 @@ void *radix_tree_tag_clear(struct radix_
>
>        shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
>        pathp->node = NULL;
> -       slot = radix_tree_indirect_to_ptr(root->rnode);
> +       slot = indirect_to_ptr(root->rnode);
>
>        while (height > 0) {
>                int offset;
> @@ -577,7 +586,7 @@ int radix_tree_tag_get(struct radix_tree
>
>        if (!radix_tree_is_indirect_ptr(node))
>                return (index == 0);
> -       node = radix_tree_indirect_to_ptr(node);
> +       node = indirect_to_ptr(node);
>
>        height = node->height;
>        if (index > radix_tree_maxindex(height))
> @@ -651,7 +660,7 @@ unsigned long radix_tree_range_tag_if_ta
>        }
>
>        shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
> -       slot = radix_tree_indirect_to_ptr(root->rnode);
> +       slot = indirect_to_ptr(root->rnode);
>
>        for (;;) {
>                int offset;
> @@ -861,7 +870,7 @@ radix_tree_gang_lookup(struct radix_tree
>                results[0] = node;
>                return 1;
>        }
> -       node = radix_tree_indirect_to_ptr(node);
> +       node = indirect_to_ptr(node);
>
>        max_index = radix_tree_maxindex(node->height);
>
> @@ -880,7 +889,8 @@ radix_tree_gang_lookup(struct radix_tree
>                        slot = *(((void ***)results)[ret + i]);
>                        if (!slot)
>                                continue;
> -                       results[ret + nr_found] = rcu_dereference_raw(slot);
> +                       results[ret + nr_found] =
> +                               indirect_to_ptr(rcu_dereference_raw(slot));
>                        nr_found++;
>                }
>                ret += nr_found;
> @@ -929,7 +939,7 @@ radix_tree_gang_lookup_slot(struct radix
>                results[0] = (void **)&root->rnode;
>                return 1;
>        }
> -       node = radix_tree_indirect_to_ptr(node);
> +       node = indirect_to_ptr(node);
>
>        max_index = radix_tree_maxindex(node->height);
>
> @@ -1054,7 +1064,7 @@ radix_tree_gang_lookup_tag(struct radix_
>                results[0] = node;
>                return 1;
>        }
> -       node = radix_tree_indirect_to_ptr(node);
> +       node = indirect_to_ptr(node);
>
>        max_index = radix_tree_maxindex(node->height);
>
> @@ -1073,7 +1083,8 @@ radix_tree_gang_lookup_tag(struct radix_
>                        slot = *(((void ***)results)[ret + i]);
>                        if (!slot)
>                                continue;
> -                       results[ret + nr_found] = rcu_dereference_raw(slot);
> +                       results[ret + nr_found] =
> +                               indirect_to_ptr(rcu_dereference_raw(slot));
>                        nr_found++;
>                }
>                ret += nr_found;
> @@ -1123,7 +1134,7 @@ radix_tree_gang_lookup_tag_slot(struct r
>                results[0] = (void **)&root->rnode;
>                return 1;
>        }
> -       node = radix_tree_indirect_to_ptr(node);
> +       node = indirect_to_ptr(node);
>
>        max_index = radix_tree_maxindex(node->height);
>
> @@ -1159,7 +1170,7 @@ static inline void radix_tree_shrink(str
>                void *newptr;
>
>                BUG_ON(!radix_tree_is_indirect_ptr(to_free));
> -               to_free = radix_tree_indirect_to_ptr(to_free);
> +               to_free = indirect_to_ptr(to_free);
>
>                /*
>                 * The candidate node has more than one child, or its child
> @@ -1172,16 +1183,39 @@ static inline void radix_tree_shrink(str
>
>                /*
>                 * We don't need rcu_assign_pointer(), since we are simply
> -                * moving the node from one part of the tree to another. If
> -                * it was safe to dereference the old pointer to it
> +                * moving the node from one part of the tree to another: if it
> +                * was safe to dereference the old pointer to it
>                 * (to_free->slots[0]), it will be safe to dereference the new
> -                * one (root->rnode).
> +                * one (root->rnode) as far as dependent read barriers go.
>                 */
>                newptr = to_free->slots[0];
>                if (root->height > 1)
> -                       newptr = radix_tree_ptr_to_indirect(newptr);
> +                       newptr = ptr_to_indirect(newptr);
>                root->rnode = newptr;
>                root->height--;
> +
> +               /*
> +                * We have a dilemma here. The node's slot[0] must not be
> +                * NULLed in case there are concurrent lookups expecting to
> +                * find the item. However if this was a bottom-level node,
> +                * then it may be subject to the slot pointer being visible
> +                * to callers dereferencing it. If item corresponding to
> +                * slot[0] is subsequently deleted, these callers would expect
> +                * their slot to become empty sooner or later.
> +                *
> +                * For example, lockless pagecache will look up a slot, deref
> +                * the page pointer, and if the page is 0 refcount it means it
> +                * was concurrently deleted from pagecache so try the deref
> +                * again. Fortunately there is already a requirement for logic
> +                * to retry the entire slot lookup -- the indirect pointer
> +                * problem (replacing direct root node with an indirect pointer
> +                * also results in a stale slot). So tag the slot as indirect
> +                * to force callers to retry.
> +                */
> +               if (root->height == 0)
> +                       *((unsigned long *)&to_free->slots[0]) |=
> +                                               RADIX_TREE_INDIRECT_PTR;
> +
>                radix_tree_node_free(to_free);
>        }
>  }
> @@ -1218,7 +1252,7 @@ void *radix_tree_delete(struct radix_tre
>                root->rnode = NULL;
>                goto out;
>        }
> -       slot = radix_tree_indirect_to_ptr(slot);
> +       slot = indirect_to_ptr(slot);
>
>        shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
>        pathp->node = NULL;
> @@ -1260,8 +1294,7 @@ void *radix_tree_delete(struct radix_tre
>                        radix_tree_node_free(to_free);
>
>                if (pathp->node->count) {
> -                       if (pathp->node ==
> -                                       radix_tree_indirect_to_ptr(root->rnode))
> +                       if (pathp->node == indirect_to_ptr(root->rnode))
>                                radix_tree_shrink(root);
>                        goto out;
>                }
> Index: linux-2.6/mm/filemap.c
> ===================================================================
> --- linux-2.6.orig/mm/filemap.c 2010-08-18 00:14:14.000000000 +1000
> +++ linux-2.6/mm/filemap.c      2010-08-18 01:07:20.000000000 +1000
> @@ -631,7 +631,9 @@ repeat:
>        pagep = radix_tree_lookup_slot(&mapping->page_tree, offset);
>        if (pagep) {
>                page = radix_tree_deref_slot(pagep);
> -               if (unlikely(!page || page == RADIX_TREE_RETRY))
> +               if (unlikely(!page))
> +                       goto out;
> +               if (radix_tree_deref_retry(page))
>                        goto repeat;
>
>                if (!page_cache_get_speculative(page))
> @@ -647,6 +649,7 @@ repeat:
>                        goto repeat;
>                }
>        }
> +out:
>        rcu_read_unlock();
>
>        return page;
> @@ -764,12 +767,11 @@ repeat:
>                page = radix_tree_deref_slot((void **)pages[i]);
>                if (unlikely(!page))
>                        continue;
> -               /*
> -                * this can only trigger if nr_found == 1, making livelock
> -                * a non issue.
> -                */
> -               if (unlikely(page == RADIX_TREE_RETRY))
> +               if (radix_tree_deref_retry(page)) {
> +                       if (ret)
> +                               start = pages[ret-1]->index;

What does the line above do?


>                        goto restart;
> +               }
>
>                if (!page_cache_get_speculative(page))
>                        goto repeat;
> @@ -817,11 +819,7 @@ repeat:
>                page = radix_tree_deref_slot((void **)pages[i]);
>                if (unlikely(!page))
>                        continue;
> -               /*
> -                * this can only trigger if nr_found == 1, making livelock
> -                * a non issue.
> -                */
> -               if (unlikely(page == RADIX_TREE_RETRY))
> +               if (radix_tree_deref_retry(page))
>                        goto restart;
>
>                if (page->mapping == NULL || page->index != index)
> @@ -874,11 +872,7 @@ repeat:
>                page = radix_tree_deref_slot((void **)pages[i]);
>                if (unlikely(!page))
>                        continue;
> -               /*
> -                * this can only trigger if nr_found == 1, making livelock
> -                * a non issue.
> -                */
> -               if (unlikely(page == RADIX_TREE_RETRY))
> +               if (radix_tree_deref_retry(page))
>                        goto restart;
>
>                if (!page_cache_get_speculative(page))
>

I just verified that your patch fixes the issue.  Thanks for doing this.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Fixed a mismatch between the users of radix_tree and the implementation.
  2010-08-17 21:28         ` Salman Qazi
@ 2010-08-30 23:27           ` Salman Qazi
  2010-11-10 19:16             ` Salman Qazi
  0 siblings, 1 reply; 17+ messages in thread
From: Salman Qazi @ 2010-08-30 23:27 UTC (permalink / raw)
  To: Nick Piggin; +Cc: paulmck, akpm, linux-kernel, a.p.zijlstra, adurbin

On Tue, Aug 17, 2010 at 2:28 PM, Salman Qazi <sqazi@google.com> wrote:
>
> On Tue, Aug 17, 2010 at 8:23 AM, Nick Piggin <npiggin@kernel.dk> wrote:
> > On Mon, Aug 16, 2010 at 11:30:07AM -0700, Salman Qazi wrote:
> >> Done.
> >>
> >> ---
> >>
> >> This matters for the lockless page cache, in particular find_get_pages implementation.
> >>
> >> In the following case, we get can get a deadlock:
> >>
> >>     0.  The radix tree contains two items, one has the index 0.
> >>     1.  The reader (in this case find_get_pages) takes the rcu_read_lock.
> >>     2.  The reader acquires slot(s) for item(s) including the index 0 item.
> >>     3.  The non-zero index item is deleted, and as a consequence the other item
> >>         is moved to the root of the tree.  The place where it used to be is
> >>         queued for deletion after the readers finish.
> >      3b. The zero item is deleted, removing it from the direct slot,
> >          it remains in the rcu-delayed indirect node.
> >>     4.  The reader looks at the index 0 slot, and finds that the page has 0 ref count
> >>     5.  The reader looks at it again, hoping that the item will either be freed
> >>         or the ref count will increase.  This never happens, as the slot it
> >>         is looking at will never be updated.  Also, this slot can never be reclaimed
> >>         because the reader is holding rcu_read_lock and is in an infinite loop.
> >
> > Good catch. Increasing memory footprint really sucks actually. 5MB is a
> > lot of memory, and it also means another pointer dereference on small
> > files.
> >
> > I actually do go to a lot of trouble already to have direct pointers.
> > Unfortunately this is another little complexity, but I really think it's
> > worthwhile.
> >
> > How about something like this?
> > --
> > Index: linux-2.6/include/linux/radix-tree.h
> > ===================================================================
> > --- linux-2.6.orig/include/linux/radix-tree.h   2010-08-18 00:01:19.000000000 +1000
> > +++ linux-2.6/include/linux/radix-tree.h        2010-08-18 00:56:32.000000000 +1000
> > @@ -34,19 +34,12 @@
> >  * needed for RCU lookups (because root->height is unreliable). The only
> >  * time callers need worry about this is when doing a lookup_slot under
> >  * RCU.
> > + *
> > + * Indirect pointer in fact is also used to tag the last pointer of a node
> > + * when it is shrunk, before we rcu free the node. See shrink code for
> > + * details.
> >  */
> >  #define RADIX_TREE_INDIRECT_PTR        1
> > -#define RADIX_TREE_RETRY ((void *)-1UL)
> > -
> > -static inline void *radix_tree_ptr_to_indirect(void *ptr)
> > -{
> > -       return (void *)((unsigned long)ptr | RADIX_TREE_INDIRECT_PTR);
> > -}
> > -
> > -static inline void *radix_tree_indirect_to_ptr(void *ptr)
> > -{
> > -       return (void *)((unsigned long)ptr & ~RADIX_TREE_INDIRECT_PTR);
> > -}
> >
> >  static inline int radix_tree_is_indirect_ptr(void *ptr)
> >  {
> > @@ -138,16 +131,29 @@ do {                                                                      \
> >  *             removed.
> >  *
> >  * For use with radix_tree_lookup_slot().  Caller must hold tree at least read
> > - * locked across slot lookup and dereference.  More likely, will be used with
> > - * radix_tree_replace_slot(), as well, so caller will hold tree write locked.
> > + * locked across slot lookup and dereference. Not required if write lock is
> > + * held (ie. items cannot be concurrently inserted).
> > + *
> > + * radix_tree_deref_retry must be used to confirm validity of the pointer if
> > + * only the read lock is held.
> >  */
> >  static inline void *radix_tree_deref_slot(void **pslot)
> >  {
> > -       void *ret = rcu_dereference(*pslot);
> > -       if (unlikely(radix_tree_is_indirect_ptr(ret)))
> > -               ret = RADIX_TREE_RETRY;
> > -       return ret;
> > +       return rcu_dereference(*pslot);
> >  }
> > +
> > +/**
> > + * radix_tree_deref_retry      - check radix_tree_deref_slot
> > + * @arg:       pointer returned by radix_tree_deref_slot
> > + * Returns:    0 if retry is not required, otherwise retry is required
> > + *
> > + * radix_tree_deref_retry must be used with radix_tree_deref_slot.
> > + */
> > +static inline int radix_tree_deref_retry(void *arg)
> > +{
> > +       return unlikely((unsigned long)arg & RADIX_TREE_INDIRECT_PTR);
> > +}
> > +
> >  /**
> >  * radix_tree_replace_slot     - replace item in a slot
> >  * @pslot:     pointer to slot, returned by radix_tree_lookup_slot
> > Index: linux-2.6/lib/radix-tree.c
> > ===================================================================
> > --- linux-2.6.orig/lib/radix-tree.c     2010-08-18 00:01:19.000000000 +1000
> > +++ linux-2.6/lib/radix-tree.c  2010-08-18 00:47:55.000000000 +1000
> > @@ -82,6 +82,16 @@ struct radix_tree_preload {
> >  };
> >  static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, };
> >
> > +static inline void *ptr_to_indirect(void *ptr)
> > +{
> > +       return (void *)((unsigned long)ptr | RADIX_TREE_INDIRECT_PTR);
> > +}
> > +
> > +static inline void *indirect_to_ptr(void *ptr)
> > +{
> > +       return (void *)((unsigned long)ptr & ~RADIX_TREE_INDIRECT_PTR);
> > +}
> > +
> >  static inline gfp_t root_gfp_mask(struct radix_tree_root *root)
> >  {
> >        return root->gfp_mask & __GFP_BITS_MASK;
> > @@ -263,7 +273,7 @@ static int radix_tree_extend(struct radi
> >                        return -ENOMEM;
> >
> >                /* Increase the height.  */
> > -               node->slots[0] = radix_tree_indirect_to_ptr(root->rnode);
> > +               node->slots[0] = indirect_to_ptr(root->rnode);
> >
> >                /* Propagate the aggregated tag info into the new root */
> >                for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
> > @@ -274,7 +284,7 @@ static int radix_tree_extend(struct radi
> >                newheight = root->height+1;
> >                node->height = newheight;
> >                node->count = 1;
> > -               node = radix_tree_ptr_to_indirect(node);
> > +               node = ptr_to_indirect(node);
> >                rcu_assign_pointer(root->rnode, node);
> >                root->height = newheight;
> >        } while (height > root->height);
> > @@ -307,7 +317,7 @@ int radix_tree_insert(struct radix_tree_
> >                        return error;
> >        }
> >
> > -       slot = radix_tree_indirect_to_ptr(root->rnode);
> > +       slot = indirect_to_ptr(root->rnode);
> >
> >        height = root->height;
> >        shift = (height-1) * RADIX_TREE_MAP_SHIFT;
> > @@ -323,8 +333,7 @@ int radix_tree_insert(struct radix_tree_
> >                                rcu_assign_pointer(node->slots[offset], slot);
> >                                node->count++;
> >                        } else
> > -                               rcu_assign_pointer(root->rnode,
> > -                                       radix_tree_ptr_to_indirect(slot));
> > +                               rcu_assign_pointer(root->rnode, ptr_to_indirect(slot));
> >                }
> >
> >                /* Go a level down */
> > @@ -372,7 +381,7 @@ static void *radix_tree_lookup_element(s
> >                        return NULL;
> >                return is_slot ? (void *)&root->rnode : node;
> >        }
> > -       node = radix_tree_indirect_to_ptr(node);
> > +       node = indirect_to_ptr(node);
> >
> >        height = node->height;
> >        if (index > radix_tree_maxindex(height))
> > @@ -391,7 +400,7 @@ static void *radix_tree_lookup_element(s
> >                height--;
> >        } while (height > 0);
> >
> > -       return is_slot ? (void *)slot:node;
> > +       return is_slot ? (void *)slot : indirect_to_ptr(node);
> >  }
> >
> >  /**
> > @@ -453,7 +462,7 @@ void *radix_tree_tag_set(struct radix_tr
> >        height = root->height;
> >        BUG_ON(index > radix_tree_maxindex(height));
> >
> > -       slot = radix_tree_indirect_to_ptr(root->rnode);
> > +       slot = indirect_to_ptr(root->rnode);
> >        shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
> >
> >        while (height > 0) {
> > @@ -507,7 +516,7 @@ void *radix_tree_tag_clear(struct radix_
> >
> >        shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
> >        pathp->node = NULL;
> > -       slot = radix_tree_indirect_to_ptr(root->rnode);
> > +       slot = indirect_to_ptr(root->rnode);
> >
> >        while (height > 0) {
> >                int offset;
> > @@ -577,7 +586,7 @@ int radix_tree_tag_get(struct radix_tree
> >
> >        if (!radix_tree_is_indirect_ptr(node))
> >                return (index == 0);
> > -       node = radix_tree_indirect_to_ptr(node);
> > +       node = indirect_to_ptr(node);
> >
> >        height = node->height;
> >        if (index > radix_tree_maxindex(height))
> > @@ -651,7 +660,7 @@ unsigned long radix_tree_range_tag_if_ta
> >        }
> >
> >        shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
> > -       slot = radix_tree_indirect_to_ptr(root->rnode);
> > +       slot = indirect_to_ptr(root->rnode);
> >
> >        for (;;) {
> >                int offset;
> > @@ -861,7 +870,7 @@ radix_tree_gang_lookup(struct radix_tree
> >                results[0] = node;
> >                return 1;
> >        }
> > -       node = radix_tree_indirect_to_ptr(node);
> > +       node = indirect_to_ptr(node);
> >
> >        max_index = radix_tree_maxindex(node->height);
> >
> > @@ -880,7 +889,8 @@ radix_tree_gang_lookup(struct radix_tree
> >                        slot = *(((void ***)results)[ret + i]);
> >                        if (!slot)
> >                                continue;
> > -                       results[ret + nr_found] = rcu_dereference_raw(slot);
> > +                       results[ret + nr_found] =
> > +                               indirect_to_ptr(rcu_dereference_raw(slot));
> >                        nr_found++;
> >                }
> >                ret += nr_found;
> > @@ -929,7 +939,7 @@ radix_tree_gang_lookup_slot(struct radix
> >                results[0] = (void **)&root->rnode;
> >                return 1;
> >        }
> > -       node = radix_tree_indirect_to_ptr(node);
> > +       node = indirect_to_ptr(node);
> >
> >        max_index = radix_tree_maxindex(node->height);
> >
> > @@ -1054,7 +1064,7 @@ radix_tree_gang_lookup_tag(struct radix_
> >                results[0] = node;
> >                return 1;
> >        }
> > -       node = radix_tree_indirect_to_ptr(node);
> > +       node = indirect_to_ptr(node);
> >
> >        max_index = radix_tree_maxindex(node->height);
> >
> > @@ -1073,7 +1083,8 @@ radix_tree_gang_lookup_tag(struct radix_
> >                        slot = *(((void ***)results)[ret + i]);
> >                        if (!slot)
> >                                continue;
> > -                       results[ret + nr_found] = rcu_dereference_raw(slot);
> > +                       results[ret + nr_found] =
> > +                               indirect_to_ptr(rcu_dereference_raw(slot));
> >                        nr_found++;
> >                }
> >                ret += nr_found;
> > @@ -1123,7 +1134,7 @@ radix_tree_gang_lookup_tag_slot(struct r
> >                results[0] = (void **)&root->rnode;
> >                return 1;
> >        }
> > -       node = radix_tree_indirect_to_ptr(node);
> > +       node = indirect_to_ptr(node);
> >
> >        max_index = radix_tree_maxindex(node->height);
> >
> > @@ -1159,7 +1170,7 @@ static inline void radix_tree_shrink(str
> >                void *newptr;
> >
> >                BUG_ON(!radix_tree_is_indirect_ptr(to_free));
> > -               to_free = radix_tree_indirect_to_ptr(to_free);
> > +               to_free = indirect_to_ptr(to_free);
> >
> >                /*
> >                 * The candidate node has more than one child, or its child
> > @@ -1172,16 +1183,39 @@ static inline void radix_tree_shrink(str
> >
> >                /*
> >                 * We don't need rcu_assign_pointer(), since we are simply
> > -                * moving the node from one part of the tree to another. If
> > -                * it was safe to dereference the old pointer to it
> > +                * moving the node from one part of the tree to another: if it
> > +                * was safe to dereference the old pointer to it
> >                 * (to_free->slots[0]), it will be safe to dereference the new
> > -                * one (root->rnode).
> > +                * one (root->rnode) as far as dependent read barriers go.
> >                 */
> >                newptr = to_free->slots[0];
> >                if (root->height > 1)
> > -                       newptr = radix_tree_ptr_to_indirect(newptr);
> > +                       newptr = ptr_to_indirect(newptr);
> >                root->rnode = newptr;
> >                root->height--;
> > +
> > +               /*
> > +                * We have a dilemma here. The node's slot[0] must not be
> > +                * NULLed in case there are concurrent lookups expecting to
> > +                * find the item. However if this was a bottom-level node,
> > +                * then it may be subject to the slot pointer being visible
> > +                * to callers dereferencing it. If item corresponding to
> > +                * slot[0] is subsequently deleted, these callers would expect
> > +                * their slot to become empty sooner or later.
> > +                *
> > +                * For example, lockless pagecache will look up a slot, deref
> > +                * the page pointer, and if the page is 0 refcount it means it
> > +                * was concurrently deleted from pagecache so try the deref
> > +                * again. Fortunately there is already a requirement for logic
> > +                * to retry the entire slot lookup -- the indirect pointer
> > +                * problem (replacing direct root node with an indirect pointer
> > +                * also results in a stale slot). So tag the slot as indirect
> > +                * to force callers to retry.
> > +                */
> > +               if (root->height == 0)
> > +                       *((unsigned long *)&to_free->slots[0]) |=
> > +                                               RADIX_TREE_INDIRECT_PTR;
> > +
> >                radix_tree_node_free(to_free);
> >        }
> >  }
> > @@ -1218,7 +1252,7 @@ void *radix_tree_delete(struct radix_tre
> >                root->rnode = NULL;
> >                goto out;
> >        }
> > -       slot = radix_tree_indirect_to_ptr(slot);
> > +       slot = indirect_to_ptr(slot);
> >
> >        shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
> >        pathp->node = NULL;
> > @@ -1260,8 +1294,7 @@ void *radix_tree_delete(struct radix_tre
> >                        radix_tree_node_free(to_free);
> >
> >                if (pathp->node->count) {
> > -                       if (pathp->node ==
> > -                                       radix_tree_indirect_to_ptr(root->rnode))
> > +                       if (pathp->node == indirect_to_ptr(root->rnode))
> >                                radix_tree_shrink(root);
> >                        goto out;
> >                }
> > Index: linux-2.6/mm/filemap.c
> > ===================================================================
> > --- linux-2.6.orig/mm/filemap.c 2010-08-18 00:14:14.000000000 +1000
> > +++ linux-2.6/mm/filemap.c      2010-08-18 01:07:20.000000000 +1000
> > @@ -631,7 +631,9 @@ repeat:
> >        pagep = radix_tree_lookup_slot(&mapping->page_tree, offset);
> >        if (pagep) {
> >                page = radix_tree_deref_slot(pagep);
> > -               if (unlikely(!page || page == RADIX_TREE_RETRY))
> > +               if (unlikely(!page))
> > +                       goto out;
> > +               if (radix_tree_deref_retry(page))
> >                        goto repeat;
> >
> >                if (!page_cache_get_speculative(page))
> > @@ -647,6 +649,7 @@ repeat:
> >                        goto repeat;
> >                }
> >        }
> > +out:
> >        rcu_read_unlock();
> >
> >        return page;
> > @@ -764,12 +767,11 @@ repeat:
> >                page = radix_tree_deref_slot((void **)pages[i]);
> >                if (unlikely(!page))
> >                        continue;
> > -               /*
> > -                * this can only trigger if nr_found == 1, making livelock
> > -                * a non issue.
> > -                */
> > -               if (unlikely(page == RADIX_TREE_RETRY))
> > +               if (radix_tree_deref_retry(page)) {
> > +                       if (ret)
> > +                               start = pages[ret-1]->index;
>
> What does the line above do?
>
>
> >                        goto restart;
> > +               }
> >
> >                if (!page_cache_get_speculative(page))
> >                        goto repeat;
> > @@ -817,11 +819,7 @@ repeat:
> >                page = radix_tree_deref_slot((void **)pages[i]);
> >                if (unlikely(!page))
> >                        continue;
> > -               /*
> > -                * this can only trigger if nr_found == 1, making livelock
> > -                * a non issue.
> > -                */
> > -               if (unlikely(page == RADIX_TREE_RETRY))
> > +               if (radix_tree_deref_retry(page))
> >                        goto restart;
> >
> >                if (page->mapping == NULL || page->index != index)
> > @@ -874,11 +872,7 @@ repeat:
> >                page = radix_tree_deref_slot((void **)pages[i]);
> >                if (unlikely(!page))
> >                        continue;
> > -               /*
> > -                * this can only trigger if nr_found == 1, making livelock
> > -                * a non issue.
> > -                */
> > -               if (unlikely(page == RADIX_TREE_RETRY))
> > +               if (radix_tree_deref_retry(page))
> >                        goto restart;
> >
> >                if (!page_cache_get_speculative(page))
> >
>
> I just verified that your patch fixes the issue.  Thanks for doing this.

Are we going to proceed with this?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Fixed a mismatch between the users of radix_tree and the implementation.
  2010-08-30 23:27           ` Salman Qazi
@ 2010-11-10 19:16             ` Salman Qazi
  2010-11-11  0:29               ` Nick Piggin
  0 siblings, 1 reply; 17+ messages in thread
From: Salman Qazi @ 2010-11-10 19:16 UTC (permalink / raw)
  To: Nick Piggin; +Cc: paulmck, akpm, linux-kernel, a.p.zijlstra, adurbin

On Mon, Aug 30, 2010 at 4:27 PM, Salman Qazi <sqazi@google.com> wrote:
> On Tue, Aug 17, 2010 at 2:28 PM, Salman Qazi <sqazi@google.com> wrote:
>>
>> On Tue, Aug 17, 2010 at 8:23 AM, Nick Piggin <npiggin@kernel.dk> wrote:
>> > On Mon, Aug 16, 2010 at 11:30:07AM -0700, Salman Qazi wrote:
>> >> Done.
>> >>
>> >> ---
>> >>
>> >> This matters for the lockless page cache, in particular find_get_pages implementation.
>> >>
>> >> In the following case, we get can get a deadlock:
>> >>
>> >>     0.  The radix tree contains two items, one has the index 0.
>> >>     1.  The reader (in this case find_get_pages) takes the rcu_read_lock.
>> >>     2.  The reader acquires slot(s) for item(s) including the index 0 item.
>> >>     3.  The non-zero index item is deleted, and as a consequence the other item
>> >>         is moved to the root of the tree.  The place where it used to be is
>> >>         queued for deletion after the readers finish.
>> >      3b. The zero item is deleted, removing it from the direct slot,
>> >          it remains in the rcu-delayed indirect node.
>> >>     4.  The reader looks at the index 0 slot, and finds that the page has 0 ref count
>> >>     5.  The reader looks at it again, hoping that the item will either be freed
>> >>         or the ref count will increase.  This never happens, as the slot it
>> >>         is looking at will never be updated.  Also, this slot can never be reclaimed
>> >>         because the reader is holding rcu_read_lock and is in an infinite loop.
>> >
>> > Good catch. Increasing memory footprint really sucks actually. 5MB is a
>> > lot of memory, and it also means another pointer dereference on small
>> > files.
>> >
>> > I actually do go to a lot of trouble already to have direct pointers.
>> > Unfortunately this is another little complexity, but I really think it's
>> > worthwhile.
>> >
>> > How about something like this?
>> > --
>> > Index: linux-2.6/include/linux/radix-tree.h
>> > ===================================================================
>> > --- linux-2.6.orig/include/linux/radix-tree.h   2010-08-18 00:01:19.000000000 +1000
>> > +++ linux-2.6/include/linux/radix-tree.h        2010-08-18 00:56:32.000000000 +1000
>> > @@ -34,19 +34,12 @@
>> >  * needed for RCU lookups (because root->height is unreliable). The only
>> >  * time callers need worry about this is when doing a lookup_slot under
>> >  * RCU.
>> > + *
>> > + * Indirect pointer in fact is also used to tag the last pointer of a node
>> > + * when it is shrunk, before we rcu free the node. See shrink code for
>> > + * details.
>> >  */
>> >  #define RADIX_TREE_INDIRECT_PTR        1
>> > -#define RADIX_TREE_RETRY ((void *)-1UL)
>> > -
>> > -static inline void *radix_tree_ptr_to_indirect(void *ptr)
>> > -{
>> > -       return (void *)((unsigned long)ptr | RADIX_TREE_INDIRECT_PTR);
>> > -}
>> > -
>> > -static inline void *radix_tree_indirect_to_ptr(void *ptr)
>> > -{
>> > -       return (void *)((unsigned long)ptr & ~RADIX_TREE_INDIRECT_PTR);
>> > -}
>> >
>> >  static inline int radix_tree_is_indirect_ptr(void *ptr)
>> >  {
>> > @@ -138,16 +131,29 @@ do {                                                                      \
>> >  *             removed.
>> >  *
>> >  * For use with radix_tree_lookup_slot().  Caller must hold tree at least read
>> > - * locked across slot lookup and dereference.  More likely, will be used with
>> > - * radix_tree_replace_slot(), as well, so caller will hold tree write locked.
>> > + * locked across slot lookup and dereference. Not required if write lock is
>> > + * held (ie. items cannot be concurrently inserted).
>> > + *
>> > + * radix_tree_deref_retry must be used to confirm validity of the pointer if
>> > + * only the read lock is held.
>> >  */
>> >  static inline void *radix_tree_deref_slot(void **pslot)
>> >  {
>> > -       void *ret = rcu_dereference(*pslot);
>> > -       if (unlikely(radix_tree_is_indirect_ptr(ret)))
>> > -               ret = RADIX_TREE_RETRY;
>> > -       return ret;
>> > +       return rcu_dereference(*pslot);
>> >  }
>> > +
>> > +/**
>> > + * radix_tree_deref_retry      - check radix_tree_deref_slot
>> > + * @arg:       pointer returned by radix_tree_deref_slot
>> > + * Returns:    0 if retry is not required, otherwise retry is required
>> > + *
>> > + * radix_tree_deref_retry must be used with radix_tree_deref_slot.
>> > + */
>> > +static inline int radix_tree_deref_retry(void *arg)
>> > +{
>> > +       return unlikely((unsigned long)arg & RADIX_TREE_INDIRECT_PTR);
>> > +}
>> > +
>> >  /**
>> >  * radix_tree_replace_slot     - replace item in a slot
>> >  * @pslot:     pointer to slot, returned by radix_tree_lookup_slot
>> > Index: linux-2.6/lib/radix-tree.c
>> > ===================================================================
>> > --- linux-2.6.orig/lib/radix-tree.c     2010-08-18 00:01:19.000000000 +1000
>> > +++ linux-2.6/lib/radix-tree.c  2010-08-18 00:47:55.000000000 +1000
>> > @@ -82,6 +82,16 @@ struct radix_tree_preload {
>> >  };
>> >  static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, };
>> >
>> > +static inline void *ptr_to_indirect(void *ptr)
>> > +{
>> > +       return (void *)((unsigned long)ptr | RADIX_TREE_INDIRECT_PTR);
>> > +}
>> > +
>> > +static inline void *indirect_to_ptr(void *ptr)
>> > +{
>> > +       return (void *)((unsigned long)ptr & ~RADIX_TREE_INDIRECT_PTR);
>> > +}
>> > +
>> >  static inline gfp_t root_gfp_mask(struct radix_tree_root *root)
>> >  {
>> >        return root->gfp_mask & __GFP_BITS_MASK;
>> > @@ -263,7 +273,7 @@ static int radix_tree_extend(struct radi
>> >                        return -ENOMEM;
>> >
>> >                /* Increase the height.  */
>> > -               node->slots[0] = radix_tree_indirect_to_ptr(root->rnode);
>> > +               node->slots[0] = indirect_to_ptr(root->rnode);
>> >
>> >                /* Propagate the aggregated tag info into the new root */
>> >                for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
>> > @@ -274,7 +284,7 @@ static int radix_tree_extend(struct radi
>> >                newheight = root->height+1;
>> >                node->height = newheight;
>> >                node->count = 1;
>> > -               node = radix_tree_ptr_to_indirect(node);
>> > +               node = ptr_to_indirect(node);
>> >                rcu_assign_pointer(root->rnode, node);
>> >                root->height = newheight;
>> >        } while (height > root->height);
>> > @@ -307,7 +317,7 @@ int radix_tree_insert(struct radix_tree_
>> >                        return error;
>> >        }
>> >
>> > -       slot = radix_tree_indirect_to_ptr(root->rnode);
>> > +       slot = indirect_to_ptr(root->rnode);
>> >
>> >        height = root->height;
>> >        shift = (height-1) * RADIX_TREE_MAP_SHIFT;
>> > @@ -323,8 +333,7 @@ int radix_tree_insert(struct radix_tree_
>> >                                rcu_assign_pointer(node->slots[offset], slot);
>> >                                node->count++;
>> >                        } else
>> > -                               rcu_assign_pointer(root->rnode,
>> > -                                       radix_tree_ptr_to_indirect(slot));
>> > +                               rcu_assign_pointer(root->rnode, ptr_to_indirect(slot));
>> >                }
>> >
>> >                /* Go a level down */
>> > @@ -372,7 +381,7 @@ static void *radix_tree_lookup_element(s
>> >                        return NULL;
>> >                return is_slot ? (void *)&root->rnode : node;
>> >        }
>> > -       node = radix_tree_indirect_to_ptr(node);
>> > +       node = indirect_to_ptr(node);
>> >
>> >        height = node->height;
>> >        if (index > radix_tree_maxindex(height))
>> > @@ -391,7 +400,7 @@ static void *radix_tree_lookup_element(s
>> >                height--;
>> >        } while (height > 0);
>> >
>> > -       return is_slot ? (void *)slot:node;
>> > +       return is_slot ? (void *)slot : indirect_to_ptr(node);
>> >  }
>> >
>> >  /**
>> > @@ -453,7 +462,7 @@ void *radix_tree_tag_set(struct radix_tr
>> >        height = root->height;
>> >        BUG_ON(index > radix_tree_maxindex(height));
>> >
>> > -       slot = radix_tree_indirect_to_ptr(root->rnode);
>> > +       slot = indirect_to_ptr(root->rnode);
>> >        shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
>> >
>> >        while (height > 0) {
>> > @@ -507,7 +516,7 @@ void *radix_tree_tag_clear(struct radix_
>> >
>> >        shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
>> >        pathp->node = NULL;
>> > -       slot = radix_tree_indirect_to_ptr(root->rnode);
>> > +       slot = indirect_to_ptr(root->rnode);
>> >
>> >        while (height > 0) {
>> >                int offset;
>> > @@ -577,7 +586,7 @@ int radix_tree_tag_get(struct radix_tree
>> >
>> >        if (!radix_tree_is_indirect_ptr(node))
>> >                return (index == 0);
>> > -       node = radix_tree_indirect_to_ptr(node);
>> > +       node = indirect_to_ptr(node);
>> >
>> >        height = node->height;
>> >        if (index > radix_tree_maxindex(height))
>> > @@ -651,7 +660,7 @@ unsigned long radix_tree_range_tag_if_ta
>> >        }
>> >
>> >        shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
>> > -       slot = radix_tree_indirect_to_ptr(root->rnode);
>> > +       slot = indirect_to_ptr(root->rnode);
>> >
>> >        for (;;) {
>> >                int offset;
>> > @@ -861,7 +870,7 @@ radix_tree_gang_lookup(struct radix_tree
>> >                results[0] = node;
>> >                return 1;
>> >        }
>> > -       node = radix_tree_indirect_to_ptr(node);
>> > +       node = indirect_to_ptr(node);
>> >
>> >        max_index = radix_tree_maxindex(node->height);
>> >
>> > @@ -880,7 +889,8 @@ radix_tree_gang_lookup(struct radix_tree
>> >                        slot = *(((void ***)results)[ret + i]);
>> >                        if (!slot)
>> >                                continue;
>> > -                       results[ret + nr_found] = rcu_dereference_raw(slot);
>> > +                       results[ret + nr_found] =
>> > +                               indirect_to_ptr(rcu_dereference_raw(slot));
>> >                        nr_found++;
>> >                }
>> >                ret += nr_found;
>> > @@ -929,7 +939,7 @@ radix_tree_gang_lookup_slot(struct radix
>> >                results[0] = (void **)&root->rnode;
>> >                return 1;
>> >        }
>> > -       node = radix_tree_indirect_to_ptr(node);
>> > +       node = indirect_to_ptr(node);
>> >
>> >        max_index = radix_tree_maxindex(node->height);
>> >
>> > @@ -1054,7 +1064,7 @@ radix_tree_gang_lookup_tag(struct radix_
>> >                results[0] = node;
>> >                return 1;
>> >        }
>> > -       node = radix_tree_indirect_to_ptr(node);
>> > +       node = indirect_to_ptr(node);
>> >
>> >        max_index = radix_tree_maxindex(node->height);
>> >
>> > @@ -1073,7 +1083,8 @@ radix_tree_gang_lookup_tag(struct radix_
>> >                        slot = *(((void ***)results)[ret + i]);
>> >                        if (!slot)
>> >                                continue;
>> > -                       results[ret + nr_found] = rcu_dereference_raw(slot);
>> > +                       results[ret + nr_found] =
>> > +                               indirect_to_ptr(rcu_dereference_raw(slot));
>> >                        nr_found++;
>> >                }
>> >                ret += nr_found;
>> > @@ -1123,7 +1134,7 @@ radix_tree_gang_lookup_tag_slot(struct r
>> >                results[0] = (void **)&root->rnode;
>> >                return 1;
>> >        }
>> > -       node = radix_tree_indirect_to_ptr(node);
>> > +       node = indirect_to_ptr(node);
>> >
>> >        max_index = radix_tree_maxindex(node->height);
>> >
>> > @@ -1159,7 +1170,7 @@ static inline void radix_tree_shrink(str
>> >                void *newptr;
>> >
>> >                BUG_ON(!radix_tree_is_indirect_ptr(to_free));
>> > -               to_free = radix_tree_indirect_to_ptr(to_free);
>> > +               to_free = indirect_to_ptr(to_free);
>> >
>> >                /*
>> >                 * The candidate node has more than one child, or its child
>> > @@ -1172,16 +1183,39 @@ static inline void radix_tree_shrink(str
>> >
>> >                /*
>> >                 * We don't need rcu_assign_pointer(), since we are simply
>> > -                * moving the node from one part of the tree to another. If
>> > -                * it was safe to dereference the old pointer to it
>> > +                * moving the node from one part of the tree to another: if it
>> > +                * was safe to dereference the old pointer to it
>> >                 * (to_free->slots[0]), it will be safe to dereference the new
>> > -                * one (root->rnode).
>> > +                * one (root->rnode) as far as dependent read barriers go.
>> >                 */
>> >                newptr = to_free->slots[0];
>> >                if (root->height > 1)
>> > -                       newptr = radix_tree_ptr_to_indirect(newptr);
>> > +                       newptr = ptr_to_indirect(newptr);
>> >                root->rnode = newptr;
>> >                root->height--;
>> > +
>> > +               /*
>> > +                * We have a dilemma here. The node's slot[0] must not be
>> > +                * NULLed in case there are concurrent lookups expecting to
>> > +                * find the item. However if this was a bottom-level node,
>> > +                * then it may be subject to the slot pointer being visible
>> > +                * to callers dereferencing it. If item corresponding to
>> > +                * slot[0] is subsequently deleted, these callers would expect
>> > +                * their slot to become empty sooner or later.
>> > +                *
>> > +                * For example, lockless pagecache will look up a slot, deref
>> > +                * the page pointer, and if the page is 0 refcount it means it
>> > +                * was concurrently deleted from pagecache so try the deref
>> > +                * again. Fortunately there is already a requirement for logic
>> > +                * to retry the entire slot lookup -- the indirect pointer
>> > +                * problem (replacing direct root node with an indirect pointer
>> > +                * also results in a stale slot). So tag the slot as indirect
>> > +                * to force callers to retry.
>> > +                */
>> > +               if (root->height == 0)
>> > +                       *((unsigned long *)&to_free->slots[0]) |=
>> > +                                               RADIX_TREE_INDIRECT_PTR;
>> > +
>> >                radix_tree_node_free(to_free);
>> >        }
>> >  }
>> > @@ -1218,7 +1252,7 @@ void *radix_tree_delete(struct radix_tre
>> >                root->rnode = NULL;
>> >                goto out;
>> >        }
>> > -       slot = radix_tree_indirect_to_ptr(slot);
>> > +       slot = indirect_to_ptr(slot);
>> >
>> >        shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
>> >        pathp->node = NULL;
>> > @@ -1260,8 +1294,7 @@ void *radix_tree_delete(struct radix_tre
>> >                        radix_tree_node_free(to_free);
>> >
>> >                if (pathp->node->count) {
>> > -                       if (pathp->node ==
>> > -                                       radix_tree_indirect_to_ptr(root->rnode))
>> > +                       if (pathp->node == indirect_to_ptr(root->rnode))
>> >                                radix_tree_shrink(root);
>> >                        goto out;
>> >                }
>> > Index: linux-2.6/mm/filemap.c
>> > ===================================================================
>> > --- linux-2.6.orig/mm/filemap.c 2010-08-18 00:14:14.000000000 +1000
>> > +++ linux-2.6/mm/filemap.c      2010-08-18 01:07:20.000000000 +1000
>> > @@ -631,7 +631,9 @@ repeat:
>> >        pagep = radix_tree_lookup_slot(&mapping->page_tree, offset);
>> >        if (pagep) {
>> >                page = radix_tree_deref_slot(pagep);
>> > -               if (unlikely(!page || page == RADIX_TREE_RETRY))
>> > +               if (unlikely(!page))
>> > +                       goto out;
>> > +               if (radix_tree_deref_retry(page))
>> >                        goto repeat;
>> >
>> >                if (!page_cache_get_speculative(page))
>> > @@ -647,6 +649,7 @@ repeat:
>> >                        goto repeat;
>> >                }
>> >        }
>> > +out:
>> >        rcu_read_unlock();
>> >
>> >        return page;
>> > @@ -764,12 +767,11 @@ repeat:
>> >                page = radix_tree_deref_slot((void **)pages[i]);
>> >                if (unlikely(!page))
>> >                        continue;
>> > -               /*
>> > -                * this can only trigger if nr_found == 1, making livelock
>> > -                * a non issue.
>> > -                */
>> > -               if (unlikely(page == RADIX_TREE_RETRY))
>> > +               if (radix_tree_deref_retry(page)) {
>> > +                       if (ret)
>> > +                               start = pages[ret-1]->index;
>>
>> What does the line above do?
>>
>>
>> >                        goto restart;
>> > +               }
>> >
>> >                if (!page_cache_get_speculative(page))
>> >                        goto repeat;
>> > @@ -817,11 +819,7 @@ repeat:
>> >                page = radix_tree_deref_slot((void **)pages[i]);
>> >                if (unlikely(!page))
>> >                        continue;
>> > -               /*
>> > -                * this can only trigger if nr_found == 1, making livelock
>> > -                * a non issue.
>> > -                */
>> > -               if (unlikely(page == RADIX_TREE_RETRY))
>> > +               if (radix_tree_deref_retry(page))
>> >                        goto restart;
>> >
>> >                if (page->mapping == NULL || page->index != index)
>> > @@ -874,11 +872,7 @@ repeat:
>> >                page = radix_tree_deref_slot((void **)pages[i]);
>> >                if (unlikely(!page))
>> >                        continue;
>> > -               /*
>> > -                * this can only trigger if nr_found == 1, making livelock
>> > -                * a non issue.
>> > -                */
>> > -               if (unlikely(page == RADIX_TREE_RETRY))
>> > +               if (radix_tree_deref_retry(page))
>> >                        goto restart;
>> >
>> >                if (!page_cache_get_speculative(page))
>> >
>>
>> I just verified that your patch fixes the issue.  Thanks for doing this.
>
> Are we going to proceed with this?
>


This bug is still unresolved.  Can we please follow this patch through
to conclusion?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Fixed a mismatch between the users of radix_tree and the implementation.
  2010-11-10 19:16             ` Salman Qazi
@ 2010-11-11  0:29               ` Nick Piggin
  0 siblings, 0 replies; 17+ messages in thread
From: Nick Piggin @ 2010-11-11  0:29 UTC (permalink / raw)
  To: Salman Qazi
  Cc: Nick Piggin, paulmck, akpm, linux-kernel, a.p.zijlstra, adurbin

On Wed, Nov 10, 2010 at 11:16:21AM -0800, Salman Qazi wrote:
> On Mon, Aug 30, 2010 at 4:27 PM, Salman Qazi <sqazi@google.com> wrote:
> > On Tue, Aug 17, 2010 at 2:28 PM, Salman Qazi <sqazi@google.com> wrote:
> >> > -               if (unlikely(page == RADIX_TREE_RETRY))
> >> > +               if (radix_tree_deref_retry(page))
> >> >                        goto restart;
> >> >
> >> >                if (!page_cache_get_speculative(page))
> >> >
> >>
> >> I just verified that your patch fixes the issue.  Thanks for doing this.
> >
> > Are we going to proceed with this?
> >
> 
> 
> This bug is still unresolved.  Can we please follow this patch through
> to conclusion?

Yep, I'm about 30 minutes ahead of you (if you ignore the couple of
months I am behind).

I forgot to add you to cc, sorry -- patch just went to lkml and Linus.

Thanks for following up.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2010-11-11  0:30 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-08-13  7:15 [PATCH] Fixed a mismatch between the users of radix_tree and the implementation Salman Qazi
2010-08-13  7:20 ` Salman Qazi
2010-08-16 11:25   ` Peter Zijlstra
2010-08-16 18:30     ` Salman Qazi
2010-08-16 19:33       ` Peter Zijlstra
     [not found]         ` <AANLkTindpUo5tPiXVp6wXjOAqczrJvYegrOMBqSjr4_t@mail.gmail.com>
2010-08-16 21:05           ` Salman Qazi
2010-08-16 21:06           ` Peter Zijlstra
2010-08-17  4:35             ` Salman Qazi
2010-08-17  4:45               ` Salman Qazi
2010-08-17  8:42                 ` Peter Zijlstra
2010-08-17 15:23       ` Nick Piggin
2010-08-17 15:48         ` Peter Zijlstra
2010-08-17 19:49           ` Salman Qazi
2010-08-17 21:28         ` Salman Qazi
2010-08-30 23:27           ` Salman Qazi
2010-11-10 19:16             ` Salman Qazi
2010-11-11  0:29               ` Nick Piggin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.