All of lore.kernel.org
 help / color / mirror / Atom feed
* [bug] radix_tree_gang_lookup_tag_slot() looping endlessly
@ 2010-08-18 13:56 Dave Chinner
  2010-08-18 17:37 ` Jan Kara
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2010-08-18 13:56 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, npiggin, a.p.zijlstra, jack

Folks,

I'm seeing a livelock with the new writeback sync livelock avoidance
code. The problem is that the radix tree lookup via
pagevec_lookup_tag()->find_get_pages_tag() is getting stuck in
radix_tree_gang_lookup_tag_slot() and never exitting.

The reproducer I'm running is xfstests 013 on 2.6.35-rc1 with some
pending XFS changes available here:

git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfsdev.git for-oss

It's 100% reproducable, and a regression against 2.6.35 patched wth exactly
the same extra XFS commits as the above branch.

I tried applying Nick's recent indirect pointer fixup patch for the
radix tree, but that didn't fix the problem. I applied the patch
below on top of that to detect when __lookup_tag is not making
progress and the livelock has gone away. Someone who knows the how
the radix tree code is supposed to work might be able to pinpoint
the problem exactly from this.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

---
 lib/radix-tree.c |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 9eeb9f3..5d2872c 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -1077,6 +1077,11 @@ radix_tree_gang_lookup_tag(struct radix_tree_root *root, void **results,
 			break;
 		slots_found = __lookup_tag(node, (void ***)results + ret,
 				cur_index, max_items - ret, &next_index, tag);
+
+		/* livelock avoidance */
+		if (slots_found == 0 && cur_index == next_index)
+			break;
+
 		nr_found = 0;
 		for (i = 0; i < slots_found; i++) {
 			struct radix_tree_node *slot;
@@ -1147,6 +1152,9 @@ radix_tree_gang_lookup_tag_slot(struct radix_tree_root *root, void ***results,
 			break;
 		slots_found = __lookup_tag(node, results + ret,
 				cur_index, max_items - ret, &next_index, tag);
+		/* livelock avoidance */
+		if (slots_found == 0 && cur_index == next_index)
+			break;
 		ret += slots_found;
 		if (next_index == 0)
 			break;

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [bug] radix_tree_gang_lookup_tag_slot() looping endlessly
  2010-08-18 13:56 [bug] radix_tree_gang_lookup_tag_slot() looping endlessly Dave Chinner
@ 2010-08-18 17:37 ` Jan Kara
  2010-08-18 23:29   ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Jan Kara @ 2010-08-18 17:37 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-fsdevel, npiggin, a.p.zijlstra, jack

  Hi,

On Wed 18-08-10 23:56:51, Dave Chinner wrote:
> I'm seeing a livelock with the new writeback sync livelock avoidance
> code. The problem is that the radix tree lookup via
> pagevec_lookup_tag()->find_get_pages_tag() is getting stuck in
> radix_tree_gang_lookup_tag_slot() and never exitting.
  Is this pagevec_lookup_tag() from write_cache_pages() which was called
for fsync() or so? 

> The reproducer I'm running is xfstests 013 on 2.6.35-rc1 with some
> pending XFS changes available here:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfsdev.git for-oss
> 
> It's 100% reproducable, and a regression against 2.6.35 patched wth exactly
> the same extra XFS commits as the above branch.
  Hmm, what HW config do you have? I didn't hit the livelock and I've been
running xfstests several times with the livelock avoidance patch. Hmm,
looking at the code maybe what you describe could happen if we remove the
page from page cache but leave a dangling tag in the radix tree... But
remove_from_page_cache() is called with tree_lock held and it removes all
tags from the index we just remove so it shouldn't really happen. Could
you dump more info about the inode this happens on? Like the i_size, the
index we stall at... Thanks.

> I tried applying Nick's recent indirect pointer fixup patch for the
> radix tree, but that didn't fix the problem. I applied the patch
> below on top of that to detect when __lookup_tag is not making
> progress and the livelock has gone away. Someone who knows the how
> the radix tree code is supposed to work might be able to pinpoint
> the problem exactly from this.

								Honza
> ---
>  lib/radix-tree.c |    8 ++++++++
>  1 files changed, 8 insertions(+), 0 deletions(-)
> 
> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> index 9eeb9f3..5d2872c 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -1077,6 +1077,11 @@ radix_tree_gang_lookup_tag(struct radix_tree_root *root, void **results,
>  			break;
>  		slots_found = __lookup_tag(node, (void ***)results + ret,
>  				cur_index, max_items - ret, &next_index, tag);
> +
> +		/* livelock avoidance */
> +		if (slots_found == 0 && cur_index == next_index)
> +			break;
> +
>  		nr_found = 0;
>  		for (i = 0; i < slots_found; i++) {
>  			struct radix_tree_node *slot;
> @@ -1147,6 +1152,9 @@ radix_tree_gang_lookup_tag_slot(struct radix_tree_root *root, void ***results,
>  			break;
>  		slots_found = __lookup_tag(node, results + ret,
>  				cur_index, max_items - ret, &next_index, tag);
> +		/* livelock avoidance */
> +		if (slots_found == 0 && cur_index == next_index)
> +			break;
>  		ret += slots_found;
>  		if (next_index == 0)
>  			break;
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [bug] radix_tree_gang_lookup_tag_slot() looping endlessly
  2010-08-18 17:37 ` Jan Kara
@ 2010-08-18 23:29   ` Dave Chinner
  2010-08-19  7:25     ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2010-08-18 23:29 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-kernel, linux-fsdevel, npiggin, a.p.zijlstra

On Wed, Aug 18, 2010 at 07:37:09PM +0200, Jan Kara wrote:
>   Hi,
> 
> On Wed 18-08-10 23:56:51, Dave Chinner wrote:
> > I'm seeing a livelock with the new writeback sync livelock avoidance
> > code. The problem is that the radix tree lookup via
> > pagevec_lookup_tag()->find_get_pages_tag() is getting stuck in
> > radix_tree_gang_lookup_tag_slot() and never exitting.
>   Is this pagevec_lookup_tag() from write_cache_pages() which was called
> for fsync() or so? 

Called from a direct IO doing a cache flush-invalidate call
across the range the direct IO spans.

fsstress      R  running task        0  2514   2513 0x00000008
 ffff88007da5fa98 ffffffff8110c0d5 ffff88007da5fc28 ffff880078f0c418
 ffff88007da5fbc8 ffffffff8110ae7b ffff88007da5fb08 0000000000000297
 ffffffffffffffff 0000000100000000 ffff88007da5fb20 00000002810d79ae
Call Trace:
 [<ffffffff8110c0d5>] ? pagevec_lookup_tag+0x25/0x40
 [<ffffffff8110ae7b>] write_cache_pages+0x10b/0x490
 [<ffffffff81109d30>] ? __writepage+0x0/0x50
 [<ffffffff813fc1fe>] ? do_raw_spin_unlock+0x5e/0xb0
 [<ffffffff8110c7dc>] ? release_pages+0x20c/0x270
 [<ffffffff813fc2a4>] ? do_raw_spin_lock+0x54/0x160
 [<ffffffff813f0ca2>] ? radix_tree_gang_lookup_slot+0x72/0xb0
 [<ffffffff8110b227>] generic_writepages+0x27/0x30
 [<ffffffff8130fc5d>] xfs_vm_writepages+0x5d/0x80
 [<ffffffff8110b254>] do_writepages+0x24/0x40
 [<ffffffff8110237b>] __filemap_fdatawrite_range+0x5b/0x60
 [<ffffffff811023da>] filemap_write_and_wait_range+0x5a/0x80
 [<ffffffff81103117>] generic_file_aio_read+0x417/0x6d0
 [<ffffffff81315f7c>] xfs_file_aio_read+0x15c/0x310
 [<ffffffff811456da>] do_sync_read+0xda/0x120
 [<ffffffff813c36ff>] ? security_file_permission+0x6f/0x80
 [<ffffffff81145a25>] vfs_read+0xc5/0x180
 [<ffffffff81146151>] sys_read+0x51/0x80
 [<ffffffff81036032>] system_call_fastpath+0x16/0x1b

>From the writeback tracing, it shows it stuck like with his writeback control:

fsstress-2514  [001] 950360.214327: wbc_writepage: bdi 253:0: towrt=9223372036854775807 skip=0 mode=1 kupd=0 bgrd=0 reclm=0 cyclic=0 more=0 older=0x0 start=0x79000 end=0x7fffffffffffffff
fsstress-2514  [001] 950360.214348: wbc_writepage: bdi 253:0: towrt=9223372036854775806 skip=0 mode=1 kupd=0 bgrd=0 reclm=0 cyclic=0 more=0 older=0x0 start=0x79000 end=0x7fffffffffffffff


> > The reproducer I'm running is xfstests 013 on 2.6.35-rc1 with some
> > pending XFS changes available here:
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfsdev.git for-oss
> > 
> > It's 100% reproducable, and a regression against 2.6.35 patched wth exactly
> > the same extra XFS commits as the above branch.
>   Hmm, what HW config do you have?

It's a VM started with:

$ cat /vm-images/vm-2/run-vm-2.sh 
#!/bin/sh
sudo /usr/bin/kvm \
        -kvm-shadow-memory 16 \
        -no-fd-bootchk \
        -localtime \
        -boot c \
        -serial pty \
        -nographic \
        -alt-grab \
        -smp 2 -m 2048 \
        -hda /vm-images/vm-2/root.img \
        -drive file=/vm-images/vm-2/vm-2-test.img,if=virtio,cache=none \
        -drive file=/vm-images/vm-2/vm-2-scratch.img,if=virtio,cache=none \
        -net nic,vlan=0,macaddr=00:e4:b6:63:63:6e,model=virtio \
        -net tap,vlan=0,script=/vm-images/qemu-ifup,downscript=no \
        -kernel /vm-images/vm-2/vmlinuz \
        -append "console=ttyS0,115200 root=/dev/sda1"


> I didn't hit the livelock and I've been
> running xfstests several times with the livelock avoidance patch.

Christoph hasn't seen it either.

> Hmm,
> looking at the code maybe what you describe could happen if we remove the
> page from page cache but leave a dangling tag in the radix tree... But
> remove_from_page_cache() is called with tree_lock held and it removes all
> tags from the index we just remove so it shouldn't really happen.

This might be a stupid question, but here goes anyway. I know the
slot contents are protected on lookup by rcu_read_lock() and
rcu_dereference_raw(), but what protects the tags on read? AFAICT,
they are being looked up without any locking, memory barriers, etc
w.r.t. deletion. i.e. I cannot see how a tag lookup is prevented
from racing with the propagation of a tag removal back up the tree
(which is done under the tree lock). What am I missing?

> Could
> you dump more info about the inode this happens on? Like the i_size, the
> index we stall at... Thanks.

>From the writeback tracing I know that the index is different for
every stall, and given that it is fsstress producing the hang I'd
guess the inode is different every time, too. I'll try to get more
data on this later today.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [bug] radix_tree_gang_lookup_tag_slot() looping endlessly
  2010-08-18 23:29   ` Dave Chinner
@ 2010-08-19  7:25     ` Dave Chinner
  2010-08-19 13:25       ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2010-08-19  7:25 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-kernel, linux-fsdevel, npiggin, a.p.zijlstra

On Thu, Aug 19, 2010 at 09:29:17AM +1000, Dave Chinner wrote:
> On Wed, Aug 18, 2010 at 07:37:09PM +0200, Jan Kara wrote:
> >   Hi,
> > 
> > On Wed 18-08-10 23:56:51, Dave Chinner wrote:
> > > I'm seeing a livelock with the new writeback sync livelock avoidance
> > > code. The problem is that the radix tree lookup via
> > > pagevec_lookup_tag()->find_get_pages_tag() is getting stuck in
> > > radix_tree_gang_lookup_tag_slot() and never exitting.

[snip]

> 
> > Hmm,
> > looking at the code maybe what you describe could happen if we remove the
> > page from page cache but leave a dangling tag in the radix tree... But
> > remove_from_page_cache() is called with tree_lock held and it removes all
> > tags from the index we just remove so it shouldn't really happen.
> 
> This might be a stupid question, but here goes anyway. I know the
> slot contents are protected on lookup by rcu_read_lock() and
> rcu_dereference_raw(), but what protects the tags on read? AFAICT,
> they are being looked up without any locking, memory barriers, etc
> w.r.t. deletion. i.e. I cannot see how a tag lookup is prevented
> from racing with the propagation of a tag removal back up the tree
> (which is done under the tree lock). What am I missing?

Definitely looks like corrupted tags:

[   97.301618] lookup ino 9283137, size 2106992, mapping pages 146, root 0xffff880073d83e20, index 497, nr_pages 14, tag 1
[   97.301711] lookup ino 9283137, size 2106992, mapping pages 9, root 0xffff880073d83e20, index 75, nr_pages 14, tag 2
[   97.301713] livelock @ root 0xffff880073d83e20, index 256, first 75
[   97.301715] height 2
[   97.301716] shift 6
[   97.301717] tag_get 0xffff8800769f5b40, 4
[   97.301718] height 1
[   97.301719] shift 0
[   97.301720] no more slots 4
[   97.301721] livelock @ root 0xffff880073d83e20, index 256, first 75

The slot (#4) has the tag set, but the actual slot is empty and so
the lookup aborts without changing the index, and as such we have an
endless loop. In this case, it apears to have occurred directly
after the mapping was almost entirely invalidated....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [bug] radix_tree_gang_lookup_tag_slot() looping endlessly
  2010-08-19  7:25     ` Dave Chinner
@ 2010-08-19 13:25       ` Dave Chinner
  2010-08-19 15:58         ` Jan Kara
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2010-08-19 13:25 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-kernel, linux-fsdevel, npiggin, a.p.zijlstra

On Thu, Aug 19, 2010 at 05:25:20PM +1000, Dave Chinner wrote:
> On Thu, Aug 19, 2010 at 09:29:17AM +1000, Dave Chinner wrote:
> > On Wed, Aug 18, 2010 at 07:37:09PM +0200, Jan Kara wrote:
> > >   Hi,
> > > 
> > > On Wed 18-08-10 23:56:51, Dave Chinner wrote:
> > > > I'm seeing a livelock with the new writeback sync livelock avoidance
> > > > code. The problem is that the radix tree lookup via
> > > > pagevec_lookup_tag()->find_get_pages_tag() is getting stuck in
> > > > radix_tree_gang_lookup_tag_slot() and never exitting.
> 
> [snip]
> 
> > 
> > > Hmm,
> > > looking at the code maybe what you describe could happen if we remove the
> > > page from page cache but leave a dangling tag in the radix tree... But
> > > remove_from_page_cache() is called with tree_lock held and it removes all
> > > tags from the index we just remove so it shouldn't really happen.
> > 
> > This might be a stupid question, but here goes anyway. I know the
> > slot contents are protected on lookup by rcu_read_lock() and
> > rcu_dereference_raw(), but what protects the tags on read? AFAICT,
> > they are being looked up without any locking, memory barriers, etc
> > w.r.t. deletion. i.e. I cannot see how a tag lookup is prevented
> > from racing with the propagation of a tag removal back up the tree
> > (which is done under the tree lock). What am I missing?
> 
> Definitely looks like corrupted tags:
> 
> [   97.301618] lookup ino 9283137, size 2106992, mapping pages 146, root 0xffff880073d83e20, index 497, nr_pages 14, tag 1
> [   97.301711] lookup ino 9283137, size 2106992, mapping pages 9, root 0xffff880073d83e20, index 75, nr_pages 14, tag 2
> [   97.301713] livelock @ root 0xffff880073d83e20, index 256, first 75
> [   97.301715] height 2
> [   97.301716] shift 6
> [   97.301717] tag_get 0xffff8800769f5b40, 4
> [   97.301718] height 1
> [   97.301719] shift 0
> [   97.301720] no more slots 4
> [   97.301721] livelock @ root 0xffff880073d83e20, index 256, first 75
> 
> The slot (#4) has the tag set, but the actual slot is empty and so
> the lookup aborts without changing the index, and as such we have an
> endless loop. In this case, it apears to have occurred directly
> after the mapping was almost entirely invalidated....

And it look slike the corrupted tags are coming through
radix_tree_set_tag_if_tagged:

[   29.533595] tag @ root 0xffff880078088d60, pages 261 (466 -> 472), nr 0
[   29.534410] settag root @ 0xffff880078088d60, index 466, offset 7, height 2, shift 6
[   29.535331] slot[settag] 0x80 iftag 0x88
[   29.535805] leveldown root @ 0xffff880078088d60, index 466, offset 7, height 1, shift 0
[   29.536842] tag @ root 0xffff880078088d60, pages 261 (473 -> 472), nr 0
                                                         ^^^^^^^^^^   ^^^^

Here we've tried to set the tags on the index 462 -> 472, but we have scanned
to index 472 and not set any tags on pages. *However*, because
radix_tree_set_tag_if_tagged() does a top-down traversal it has set the
tag on the parent node before checking if any of the child nodes can
have the tag set.

hence when radix_tree_gang_lookup_tag_slot() comes along:

[   29.543718] lookup ino 4452983, size 1453202, mapping pages 256, root 0xffff880078088d60, index 502, nr_pages 14, tag 2
[   29.545015] livelock @ root 0xffff880078088d60, index 502, first 502
[   29.545785] height 2
[   29.546117] shift 6
[   29.546381] tag_get 0xffff880078040d70, 7
                                           ^
The parent node has the tag set for slot 7

[   29.546862] slot[tag] 0x80
[   29.547192] height 1
[   29.547461] shift 0
[   29.547721] no more slots 7

but slot 7 has no children. Because the children didn't have the
TO_WRITE tag set, the tags ont eh parent node never got removed.
Hence we get a livelock because whenever this bad tag is encountered
we abort without increasing the start index, so we re-enter and
traverse exactly the sae path again....

[   29.548090] livelock @ root 0xffff880078088d60, index 502, first 502

It looks to me like radix_tree_set_tag_if_tagged() is fundamentally
broken.  All the tag set/clear code stores the tree path in a cursor
and uses that to propagate the tags if and only if the full path
from root to leaf is resolved. radix_tree_set_tag_if_tagged() sets
tags on intermediate nodes before it has resolved the full path and
hence can set tags when it should not. The "should not" cases occur
when we have to tag sub-ranges or the scan aborts because it's
reached the number ot tag in a batch.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [bug] radix_tree_gang_lookup_tag_slot() looping endlessly
  2010-08-19 13:25       ` Dave Chinner
@ 2010-08-19 15:58         ` Jan Kara
  2010-08-19 22:25           ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Jan Kara @ 2010-08-19 15:58 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jan Kara, linux-kernel, linux-fsdevel, npiggin, a.p.zijlstra

  Hi Dave,

On Thu 19-08-10 23:25:52, Dave Chinner wrote:
> It looks to me like radix_tree_set_tag_if_tagged() is fundamentally
> broken.  All the tag set/clear code stores the tree path in a cursor
> and uses that to propagate the tags if and only if the full path
> from root to leaf is resolved. radix_tree_set_tag_if_tagged() sets
> tags on intermediate nodes before it has resolved the full path and
> hence can set tags when it should not. The "should not" cases occur
> when we have to tag sub-ranges or the scan aborts because it's
> reached the number ot tag in a batch.
  Thanks for debugging this! You are right that the code can leave dangling
tag when we end the scan at the end of given range but the first tagged
leaf is after the end of the given range (there shouldn't be a problem with
the batches because there we can exit only just after we tag a leaf so that
should be OK).
  There are two possibilities how to fix the bug:
a) Always tag bottom up - i.e., when we see leaf that should be tagged, go
up and tag the parent as well if it is not already tagged.
b) When we exit the search and we didn't not set any leaf tag since last
time we went down, we walk up the tree and do an equivalent of
radix_tree_clear_tag().
  I'll probably go for a) since it looks more robust but b) would be
probably faster.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [bug] radix_tree_gang_lookup_tag_slot() looping endlessly
  2010-08-19 15:58         ` Jan Kara
@ 2010-08-19 22:25           ` Dave Chinner
  2010-08-20  2:04             ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2010-08-19 22:25 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-kernel, linux-fsdevel, npiggin, a.p.zijlstra

On Thu, Aug 19, 2010 at 05:58:39PM +0200, Jan Kara wrote:
>   Hi Dave,
> 
> On Thu 19-08-10 23:25:52, Dave Chinner wrote:
> > It looks to me like radix_tree_set_tag_if_tagged() is fundamentally
> > broken.  All the tag set/clear code stores the tree path in a cursor
> > and uses that to propagate the tags if and only if the full path
> > from root to leaf is resolved. radix_tree_set_tag_if_tagged() sets
> > tags on intermediate nodes before it has resolved the full path and
> > hence can set tags when it should not. The "should not" cases occur
> > when we have to tag sub-ranges or the scan aborts because it's
> > reached the number ot tag in a batch.
>   Thanks for debugging this! You are right that the code can leave dangling
> tag when we end the scan at the end of given range but the first tagged
> leaf is after the end of the given range (there shouldn't be a problem with
> the batches because there we can exit only just after we tag a leaf so that
> should be OK).
>   There are two possibilities how to fix the bug:
> a) Always tag bottom up - i.e., when we see leaf that should be tagged, go
> up and tag the parent as well if it is not already tagged.
> b) When we exit the search and we didn't not set any leaf tag since last
> time we went down, we walk up the tree and do an equivalent of
> radix_tree_clear_tag().
>   I'll probably go for a) since it looks more robust but b) would be
> probably faster.

I think that when it comes to data integrity, more robust should
win over speed every time. I think it can be done quite easily,
though, having slept on it - we have the current path in the
open_slots[] array, so we could just walk that when we set a leaf
tag. That should be easy to optimise as well - just keep track of
how high up the path we have set the tag and only walk that far
when setting the tags. That way we don't continually set the tag on
the root higher level slots. That shouldn't be any slower than the
current code...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [bug] radix_tree_gang_lookup_tag_slot() looping endlessly
  2010-08-19 22:25           ` Dave Chinner
@ 2010-08-20  2:04             ` Dave Chinner
  0 siblings, 0 replies; 8+ messages in thread
From: Dave Chinner @ 2010-08-20  2:04 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-kernel, linux-fsdevel, npiggin, a.p.zijlstra

On Fri, Aug 20, 2010 at 08:25:59AM +1000, Dave Chinner wrote:
> On Thu, Aug 19, 2010 at 05:58:39PM +0200, Jan Kara wrote:
> >   Hi Dave,
> > 
> > On Thu 19-08-10 23:25:52, Dave Chinner wrote:
> > > It looks to me like radix_tree_set_tag_if_tagged() is fundamentally
> > > broken.  All the tag set/clear code stores the tree path in a cursor
> > > and uses that to propagate the tags if and only if the full path
> > > from root to leaf is resolved. radix_tree_set_tag_if_tagged() sets
> > > tags on intermediate nodes before it has resolved the full path and
> > > hence can set tags when it should not. The "should not" cases occur
> > > when we have to tag sub-ranges or the scan aborts because it's
> > > reached the number ot tag in a batch.
> >   Thanks for debugging this! You are right that the code can leave dangling
> > tag when we end the scan at the end of given range but the first tagged
> > leaf is after the end of the given range (there shouldn't be a problem with
> > the batches because there we can exit only just after we tag a leaf so that
> > should be OK).
> >   There are two possibilities how to fix the bug:
> > a) Always tag bottom up - i.e., when we see leaf that should be tagged, go
> > up and tag the parent as well if it is not already tagged.
> > b) When we exit the search and we didn't not set any leaf tag since last
> > time we went down, we walk up the tree and do an equivalent of
> > radix_tree_clear_tag().
> >   I'll probably go for a) since it looks more robust but b) would be
> > probably faster.
> 
> I think that when it comes to data integrity, more robust should
> win over speed every time. I think it can be done quite easily,
> though, having slept on it - we have the current path in the
> open_slots[] array, so we could just walk that when we set a leaf
> tag. That should be easy to optimise as well - just keep track of
> how high up the path we have set the tag and only walk that far
> when setting the tags. That way we don't continually set the tag on
> the root higher level slots. That shouldn't be any slower than the
> current code...

Fixing this indicates that there is a second bug also corrupting the
PAGECACHE_TAG_TOWRITE tags - it takes quite a bit longer to hit, but
when it fails it is generally because the bit at slot offset zero in
a high-up intermediate node is incorrectly set. It appears that none
of the code is actually setting it, so it's been quite difficult to
track down.

Eventually I noticed through code inspection that
radix_tree_node_rcu_free() clears the tag at offset zero for the
because of the radix_tree_shrink implementation potentially leaving
the first slot non-null. The addition of the third tag did not add
this clearing of the tag in the zero slot.  Adding this:

 	 */
 	tag_clear(node, 0, 0);
 	tag_clear(node, 1, 0);
+	tag_clear(node, 2, 0);
 	node->slots[0] = NULL;
 	node->count = 0;
 
To radix_tree_node_rcu_free() appears to fix the problem. Whoever
failed to coment the definition of the number of tags the radix tree
supports left a really nasty landmine that Jan stepped on. Cleaning
up the mess hasn't been pretty, either.

So, after a couple of days of debugging I finally have test
013 passing without failing. Now to clean up the mess I have and
test some proper patches....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2010-08-20  2:04 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-08-18 13:56 [bug] radix_tree_gang_lookup_tag_slot() looping endlessly Dave Chinner
2010-08-18 17:37 ` Jan Kara
2010-08-18 23:29   ` Dave Chinner
2010-08-19  7:25     ` Dave Chinner
2010-08-19 13:25       ` Dave Chinner
2010-08-19 15:58         ` Jan Kara
2010-08-19 22:25           ` Dave Chinner
2010-08-20  2:04             ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.