All of lore.kernel.org
 help / color / mirror / Atom feed
* BUG: soft lockup - is this XFS problem?
@ 2008-12-19  6:59 Roman Kononov
  2008-12-23 17:12   ` Christoph Hellwig
  0 siblings, 1 reply; 74+ messages in thread
From: Roman Kononov @ 2008-12-19  6:59 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1 bytes --]



[-- Attachment #2: log.txt --]
[-- Type: text/plain, Size: 4162 bytes --]

Dec 19 00:34:18 10.10.10.100 kernel: BUG: soft lockup - CPU#0 stuck for 61s! [postmaster:23237]
Dec 19 00:34:18 10.10.10.100 kernel: Modules linked in: xd1000
Dec 19 00:34:18 10.10.10.100 kernel: CPU 0:
Dec 19 00:34:18 10.10.10.100 kernel: Modules linked in: xd1000
Dec 19 00:34:18 10.10.10.100 kernel: Pid: 23237, comm: postmaster Not tainted 2.6.27.9 #1
Dec 19 00:34:18 10.10.10.100 kernel: RIP: 0010:[<ffffffff8026c872>]  [<ffffffff8026c872>] find_get_pages+0x72/0x120
Dec 19 00:34:18 10.10.10.100 kernel: RSP: 0018:ffff88012e9f3498  EFLAGS: 00000297
Dec 19 00:34:18 10.10.10.100 kernel: RAX: ffff8800a4d752a0 RBX: 000000000000000c RCX: 0000000000000003
Dec 19 00:34:18 10.10.10.100 kernel: RDX: 0000000000000004 RSI: 0000000000000000 RDI: ffffe200004ab780
Dec 19 00:34:18 10.10.10.100 kernel: RBP: ffff88023f6b5028 R08: ffffe200004ab280 R09: 000000000000000d
Dec 19 00:34:18 10.10.10.100 kernel: R10: 0000000000000021 R11: 00000000000aef22 R12: ffffffff80273e3c
Dec 19 00:34:18 10.10.10.100 kernel: R13: ffffe20001208608 R14: 0100000000000286 R15: ffff88023f6b5028
Dec 19 00:34:18 10.10.10.100 kernel: FS:  00007fd397fb5700(0000) GS:ffffffff806d7540(0000) knlGS:0000000000000000
Dec 19 00:34:18 10.10.10.100 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Dec 19 00:34:18 10.10.10.100 kernel: CR2: 00002aaaaba00000 CR3: 000000017911c000 CR4: 00000000000006e0
Dec 19 00:34:18 10.10.10.100 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Dec 19 00:34:18 10.10.10.100 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Dec 19 00:34:18 10.10.10.100 kernel:
Dec 19 00:34:18 10.10.10.100 kernel: Call Trace:
Dec 19 00:34:18 10.10.10.100 kernel:  [<ffffffff8026c842>] ? find_get_pages+0x42/0x120
Dec 19 00:34:18 10.10.10.100 kernel:  [<ffffffff80276107>] ? pagevec_lookup+0x17/0x20
Dec 19 00:34:18 10.10.10.100 kernel:  [<ffffffff803a6701>] ? xfs_cluster_write+0x91/0x160
Dec 19 00:34:18 10.10.10.100 kernel:  [<ffffffff803a6e73>] ? xfs_page_state_convert+0x523/0x6c0
Dec 19 00:34:18 10.10.10.100 kernel:  [<ffffffff803a7301>] ? xfs_vm_writepage+0x71/0x120
Dec 19 00:34:18 10.10.10.100 kernel:  [<ffffffff80278092>] ? shrink_page_list+0x592/0x700
Dec 19 00:34:18 10.10.10.100 kernel:  [<ffffffff802784b7>] ? shrink_zone+0x2b7/0xc70
Dec 19 00:34:18 10.10.10.100 kernel:  [<ffffffff802798c4>] ? try_to_free_pages+0x244/0x3b0
Dec 19 00:34:18 10.10.10.100 kernel:  [<ffffffff80277920>] ? isolate_pages_global+0x0/0x40
Dec 19 00:34:18 10.10.10.100 kernel:  [<ffffffff8027b2d3>] ? congestion_wait+0x83/0xa0
Dec 19 00:34:18 10.10.10.100 kernel:  [<ffffffff8024f5f0>] ? autoremove_wake_function+0x0/0x30
Dec 19 00:34:18 10.10.10.100 kernel:  [<ffffffff80273668>] ? __alloc_pages_internal+0x218/0x4e0
Dec 19 00:34:18 10.10.10.100 kernel:  [<ffffffff8026d08f>] ? __grab_cache_page+0x6f/0xc0
Dec 19 00:34:18 10.10.10.100 kernel:  [<ffffffff802c69ad>] ? block_write_begin+0x7d/0xe0
Dec 19 00:34:18 10.10.10.100 kernel:  [<ffffffff803a71e2>] ? xfs_vm_write_begin+0x22/0x30
Dec 19 00:34:18 10.10.10.100 kernel:  [<ffffffff803a5e10>] ? xfs_get_blocks+0x0/0x10
Dec 19 00:34:18 10.10.10.100 kernel:  [<ffffffff8026df5b>] ? generic_file_buffered_write+0x1cb/0x790
Dec 19 00:34:18 10.10.10.100 kernel:  [<ffffffff8059145f>] ? _spin_lock_irqsave+0x1f/0x50
Dec 19 00:34:18 10.10.10.100 kernel:  [<ffffffff803ae63c>] ? xfs_write+0x65c/0x950
Dec 19 00:34:18 10.10.10.100 kernel:  [<ffffffff80591681>] ? _spin_unlock_irq+0x11/0x40
Dec 19 00:34:18 10.10.10.100 kernel:  [<ffffffff8029c0cb>] ? do_sync_write+0xdb/0x120
Dec 19 00:34:18 10.10.10.100 kernel:  [<ffffffff8025c169>] ? do_futex+0x109/0x9f0
Dec 19 00:34:18 10.10.10.100 kernel:  [<ffffffff8024f5f0>] ? autoremove_wake_function+0x0/0x30
Dec 19 00:34:18 10.10.10.100 kernel:  [<ffffffff802315f0>] ? wake_up_new_task+0xc0/0x100
Dec 19 00:34:18 10.10.10.100 kernel:  [<ffffffff8029caeb>] ? vfs_write+0xcb/0x170
Dec 19 00:34:18 10.10.10.100 kernel:  [<ffffffff8029cc93>] ? sys_write+0x53/0xa0
Dec 19 00:34:18 10.10.10.100 kernel:  [<ffffffff8020c44b>] ? system_call_fastpath+0x16/0x1b
Dec 19 00:34:18 10.10.10.100 kernel:  [<ffffffff8020c44b>] ? system_call_fastpath+0x16/0x1b

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: BUG: soft lockup - is this XFS problem?
  2008-12-19  6:59 BUG: soft lockup - is this XFS problem? Roman Kononov
@ 2008-12-23 17:12   ` Christoph Hellwig
  0 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2008-12-23 17:12 UTC (permalink / raw)
  To: Roman Kononov; +Cc: npiggin, linux-kernel, xfs


Nick, I've seen various reports like this by Roman.  It seems to be
caused by an interaction of the lockless pagecache with the xfs
I/O code.  Any idea what might be wrong here:

BUG: soft lockup - CPU#0 stuck for 61s! [postmaster:23237]
Modules linked in: xd1000
CPU 0:
Modules linked in: xd1000
Pid: 23237, comm: postmaster Not tainted 2.6.27.9 #1
RIP: 0010:[<ffffffff8026c872>]  [<ffffffff8026c872>] find_get_pages+0x72/0x120
RSP: 0018:ffff88012e9f3498  EFLAGS: 00000297
RAX: ffff8800a4d752a0 RBX: 000000000000000c RCX: 0000000000000003
RDX: 0000000000000004 RSI: 0000000000000000 RDI: ffffe200004ab780
RBP: ffff88023f6b5028 R08: ffffe200004ab280 R09: 000000000000000d
R10: 0000000000000021 R11: 00000000000aef22 R12: ffffffff80273e3c
R13: ffffe20001208608 R14: 0100000000000286 R15: ffff88023f6b5028
FS:  00007fd397fb5700(0000) GS:ffffffff806d7540(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002aaaaba00000 CR3: 000000017911c000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Call Trace:
 kernel:  [<ffffffff8026c842>] ? find_get_pages+0x42/0x120
 [<ffffffff80276107>] ? pagevec_lookup+0x17/0x20
 [<ffffffff803a6701>] ? xfs_cluster_write+0x91/0x160
 [<ffffffff803a6e73>] ? xfs_page_state_convert+0x523/0x6c0
 [<ffffffff803a7301>] ? xfs_vm_writepage+0x71/0x120
 [<ffffffff80278092>] ? shrink_page_list+0x592/0x700
 [<ffffffff802784b7>] ? shrink_zone+0x2b7/0xc70
 [<ffffffff802798c4>] ? try_to_free_pages+0x244/0x3b0
 [<ffffffff80277920>] ? isolate_pages_global+0x0/0x40
 [<ffffffff8027b2d3>] ? congestion_wait+0x83/0xa0
 [<ffffffff8024f5f0>] ? autoremove_wake_function+0x0/0x30
 [<ffffffff80273668>] ? __alloc_pages_internal+0x218/0x4e0
 [<ffffffff8026d08f>] ? __grab_cache_page+0x6f/0xc0
 [<ffffffff802c69ad>] ? block_write_begin+0x7d/0xe0
 [<ffffffff803a71e2>] ? xfs_vm_write_begin+0x22/0x30
 [<ffffffff803a5e10>] ? xfs_get_blocks+0x0/0x10
 [<ffffffff8026df5b>] ? generic_file_buffered_write+0x1cb/0x790
 [<ffffffff8059145f>] ? _spin_lock_irqsave+0x1f/0x50
 [<ffffffff803ae63c>] ? xfs_write+0x65c/0x950
 [<ffffffff80591681>] ? _spin_unlock_irq+0x11/0x40
 [<ffffffff8029c0cb>] ? do_sync_write+0xdb/0x120
 [<ffffffff8025c169>] ? do_futex+0x109/0x9f0
 [<ffffffff8024f5f0>] ? autoremove_wake_function+0x0/0x30
 [<ffffffff802315f0>] ? wake_up_new_task+0xc0/0x100
 [<ffffffff8029caeb>] ? vfs_write+0xcb/0x170
 [<ffffffff8029cc93>] ? sys_write+0x53/0xa0
 [<ffffffff8020c44b>] ? system_call_fastpath+0x16/0x1b
 [<ffffffff8020c44b>] ? system_call_fastpath+0x16/0x1b


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: BUG: soft lockup - is this XFS problem?
@ 2008-12-23 17:12   ` Christoph Hellwig
  0 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2008-12-23 17:12 UTC (permalink / raw)
  To: Roman Kononov; +Cc: npiggin, linux-kernel, xfs


Nick, I've seen various reports like this by Roman.  It seems to be
caused by an interaction of the lockless pagecache with the xfs
I/O code.  Any idea what might be wrong here:

BUG: soft lockup - CPU#0 stuck for 61s! [postmaster:23237]
Modules linked in: xd1000
CPU 0:
Modules linked in: xd1000
Pid: 23237, comm: postmaster Not tainted 2.6.27.9 #1
RIP: 0010:[<ffffffff8026c872>]  [<ffffffff8026c872>] find_get_pages+0x72/0x120
RSP: 0018:ffff88012e9f3498  EFLAGS: 00000297
RAX: ffff8800a4d752a0 RBX: 000000000000000c RCX: 0000000000000003
RDX: 0000000000000004 RSI: 0000000000000000 RDI: ffffe200004ab780
RBP: ffff88023f6b5028 R08: ffffe200004ab280 R09: 000000000000000d
R10: 0000000000000021 R11: 00000000000aef22 R12: ffffffff80273e3c
R13: ffffe20001208608 R14: 0100000000000286 R15: ffff88023f6b5028
FS:  00007fd397fb5700(0000) GS:ffffffff806d7540(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002aaaaba00000 CR3: 000000017911c000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Call Trace:
 kernel:  [<ffffffff8026c842>] ? find_get_pages+0x42/0x120
 [<ffffffff80276107>] ? pagevec_lookup+0x17/0x20
 [<ffffffff803a6701>] ? xfs_cluster_write+0x91/0x160
 [<ffffffff803a6e73>] ? xfs_page_state_convert+0x523/0x6c0
 [<ffffffff803a7301>] ? xfs_vm_writepage+0x71/0x120
 [<ffffffff80278092>] ? shrink_page_list+0x592/0x700
 [<ffffffff802784b7>] ? shrink_zone+0x2b7/0xc70
 [<ffffffff802798c4>] ? try_to_free_pages+0x244/0x3b0
 [<ffffffff80277920>] ? isolate_pages_global+0x0/0x40
 [<ffffffff8027b2d3>] ? congestion_wait+0x83/0xa0
 [<ffffffff8024f5f0>] ? autoremove_wake_function+0x0/0x30
 [<ffffffff80273668>] ? __alloc_pages_internal+0x218/0x4e0
 [<ffffffff8026d08f>] ? __grab_cache_page+0x6f/0xc0
 [<ffffffff802c69ad>] ? block_write_begin+0x7d/0xe0
 [<ffffffff803a71e2>] ? xfs_vm_write_begin+0x22/0x30
 [<ffffffff803a5e10>] ? xfs_get_blocks+0x0/0x10
 [<ffffffff8026df5b>] ? generic_file_buffered_write+0x1cb/0x790
 [<ffffffff8059145f>] ? _spin_lock_irqsave+0x1f/0x50
 [<ffffffff803ae63c>] ? xfs_write+0x65c/0x950
 [<ffffffff80591681>] ? _spin_unlock_irq+0x11/0x40
 [<ffffffff8029c0cb>] ? do_sync_write+0xdb/0x120
 [<ffffffff8025c169>] ? do_futex+0x109/0x9f0
 [<ffffffff8024f5f0>] ? autoremove_wake_function+0x0/0x30
 [<ffffffff802315f0>] ? wake_up_new_task+0xc0/0x100
 [<ffffffff8029caeb>] ? vfs_write+0xcb/0x170
 [<ffffffff8029cc93>] ? sys_write+0x53/0xa0
 [<ffffffff8020c44b>] ? system_call_fastpath+0x16/0x1b
 [<ffffffff8020c44b>] ? system_call_fastpath+0x16/0x1b

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: BUG: soft lockup - is this XFS problem?
  2008-12-23 17:12   ` Christoph Hellwig
@ 2008-12-30  4:23     ` Nick Piggin
  -1 siblings, 0 replies; 74+ messages in thread
From: Nick Piggin @ 2008-12-30  4:23 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Roman Kononov, linux-kernel, xfs

On Tue, Dec 23, 2008 at 12:12:59PM -0500, Christoph Hellwig wrote:
> 
> Nick, I've seen various reports like this by Roman.  It seems to be
> caused by an interaction of the lockless pagecache with the xfs
> I/O code.  Any idea what might be wrong here:

Hmm, it could get into a loop here if there is a page in the pagecache
with a zero refcount, which might be a problem with XFS... other looping
conditions might indicate a problem iwth lockless pagecache or radix
tree. It would be very helpful to know what condition it is looping on...


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: BUG: soft lockup - is this XFS problem?
@ 2008-12-30  4:23     ` Nick Piggin
  0 siblings, 0 replies; 74+ messages in thread
From: Nick Piggin @ 2008-12-30  4:23 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-kernel, Roman Kononov, xfs

On Tue, Dec 23, 2008 at 12:12:59PM -0500, Christoph Hellwig wrote:
> 
> Nick, I've seen various reports like this by Roman.  It seems to be
> caused by an interaction of the lockless pagecache with the xfs
> I/O code.  Any idea what might be wrong here:

Hmm, it could get into a loop here if there is a page in the pagecache
with a zero refcount, which might be a problem with XFS... other looping
conditions might indicate a problem iwth lockless pagecache or radix
tree. It would be very helpful to know what condition it is looping on...

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: BUG: soft lockup - is this XFS problem?
  2008-12-30  4:23     ` Nick Piggin
@ 2009-01-03 21:44       ` Christoph Hellwig
  -1 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2009-01-03 21:44 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Peter Klotz, Roman Kononov, linux-kernel, xfs

On Tue, Dec 30, 2008 at 05:23:33AM +0100, Nick Piggin wrote:
> On Tue, Dec 23, 2008 at 12:12:59PM -0500, Christoph Hellwig wrote:
> > 
> > Nick, I've seen various reports like this by Roman.  It seems to be
> > caused by an interaction of the lockless pagecache with the xfs
> > I/O code.  Any idea what might be wrong here:
> 
> Hmm, it could get into a loop here if there is a page in the pagecache
> with a zero refcount, which might be a problem with XFS... other looping
> conditions might indicate a problem iwth lockless pagecache or radix
> tree. It would be very helpful to know what condition it is looping on...

See http://oss.sgi.com/bugzilla/show_bug.cgi?id=805

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: BUG: soft lockup - is this XFS problem?
@ 2009-01-03 21:44       ` Christoph Hellwig
  0 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2009-01-03 21:44 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Peter Klotz, linux-kernel, Roman Kononov, xfs

On Tue, Dec 30, 2008 at 05:23:33AM +0100, Nick Piggin wrote:
> On Tue, Dec 23, 2008 at 12:12:59PM -0500, Christoph Hellwig wrote:
> > 
> > Nick, I've seen various reports like this by Roman.  It seems to be
> > caused by an interaction of the lockless pagecache with the xfs
> > I/O code.  Any idea what might be wrong here:
> 
> Hmm, it could get into a loop here if there is a page in the pagecache
> with a zero refcount, which might be a problem with XFS... other looping
> conditions might indicate a problem iwth lockless pagecache or radix
> tree. It would be very helpful to know what condition it is looping on...

See http://oss.sgi.com/bugzilla/show_bug.cgi?id=805

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: BUG: soft lockup - is this XFS problem?
  2009-01-03 21:44       ` Christoph Hellwig
@ 2009-01-05  1:48         ` Nick Piggin
  -1 siblings, 0 replies; 74+ messages in thread
From: Nick Piggin @ 2009-01-05  1:48 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Peter Klotz, Roman Kononov, linux-kernel, xfs

On Sat, Jan 03, 2009 at 04:44:43PM -0500, Christoph Hellwig wrote:
> On Tue, Dec 30, 2008 at 05:23:33AM +0100, Nick Piggin wrote:
> > On Tue, Dec 23, 2008 at 12:12:59PM -0500, Christoph Hellwig wrote:
> > > 
> > > Nick, I've seen various reports like this by Roman.  It seems to be
> > > caused by an interaction of the lockless pagecache with the xfs
> > > I/O code.  Any idea what might be wrong here:
> > 
> > Hmm, it could get into a loop here if there is a page in the pagecache
> > with a zero refcount, which might be a problem with XFS... other looping
> > conditions might indicate a problem iwth lockless pagecache or radix
> > tree. It would be very helpful to know what condition it is looping on...
> 
> See http://oss.sgi.com/bugzilla/show_bug.cgi?id=805

OK.. Hmm, well here is a modification to your patch which might help further.
I'll see if I can reproduce it here meanwhile.

---
 mm/filemap.c |   29 +++++++++++++++++++++++++----
 1 file changed, 25 insertions(+), 4 deletions(-)

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -770,11 +770,13 @@ EXPORT_SYMBOL(find_or_create_page);
  * find_get_pages() returns the number of pages which were found.
  */
 unsigned find_get_pages(struct address_space *mapping, pgoff_t start,
-			    unsigned int nr_pages, struct page **pages)
+			    unsigned int nr_pages,
+			    struct page **pages)
 {
 	unsigned int i;
 	unsigned int ret;
 	unsigned int nr_found;
+	int locked = 0;
 
 	rcu_read_lock();
 restart:
@@ -785,27 +787,46 @@ restart:
 		struct page *page;
 repeat:
 		page = radix_tree_deref_slot((void **)pages[i]);
-		if (unlikely(!page))
+		if (unlikely(!page)) {
+			if (printk_ratelimit())
+				printk(KERN_INFO "unable to deref page\n");
 			continue;
+		}
+
 		/*
 		 * this can only trigger if nr_found == 1, making livelock
 		 * a non issue.
 		 */
-		if (unlikely(page == RADIX_TREE_RETRY))
+		if (unlikely(page == RADIX_TREE_RETRY)) {
+			printk(KERN_INFO "got RADIX_TREE_RETRY\n");
 			goto restart;
+		}
 
-		if (!page_cache_get_speculative(page))
+		if (!page_cache_get_speculative(page)) {
+			/* If the page is in the radix-tree, and the radix-tree
+			 * is locked, the page must have a non-zero refcount */
+			BUG_ON(locked);
+			printk(KERN_INFO "page_cache_get failed\n");
+			spin_lock_irq(&mapping->tree_lock);
+			locked = 1;
 			goto repeat;
+		}
 
 		/* Has the page moved? */
 		if (unlikely(page != *((void **)pages[i]))) {
+			BUG_ON(locked);
+			printk(KERN_INFO "page moved\n");
 			page_cache_release(page);
+			spin_lock_irq(&mapping->tree_lock);
+			locked = 1;
 			goto repeat;
 		}
 
 		pages[ret] = page;
 		ret++;
 	}
+	if (locked)
+		spin_unlock_irq(&mapping->tree_lock);
 	rcu_read_unlock();
 	return ret;
 }

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: BUG: soft lockup - is this XFS problem?
@ 2009-01-05  1:48         ` Nick Piggin
  0 siblings, 0 replies; 74+ messages in thread
From: Nick Piggin @ 2009-01-05  1:48 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Peter Klotz, linux-kernel, Roman Kononov, xfs

On Sat, Jan 03, 2009 at 04:44:43PM -0500, Christoph Hellwig wrote:
> On Tue, Dec 30, 2008 at 05:23:33AM +0100, Nick Piggin wrote:
> > On Tue, Dec 23, 2008 at 12:12:59PM -0500, Christoph Hellwig wrote:
> > > 
> > > Nick, I've seen various reports like this by Roman.  It seems to be
> > > caused by an interaction of the lockless pagecache with the xfs
> > > I/O code.  Any idea what might be wrong here:
> > 
> > Hmm, it could get into a loop here if there is a page in the pagecache
> > with a zero refcount, which might be a problem with XFS... other looping
> > conditions might indicate a problem iwth lockless pagecache or radix
> > tree. It would be very helpful to know what condition it is looping on...
> 
> See http://oss.sgi.com/bugzilla/show_bug.cgi?id=805

OK.. Hmm, well here is a modification to your patch which might help further.
I'll see if I can reproduce it here meanwhile.

---
 mm/filemap.c |   29 +++++++++++++++++++++++++----
 1 file changed, 25 insertions(+), 4 deletions(-)

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -770,11 +770,13 @@ EXPORT_SYMBOL(find_or_create_page);
  * find_get_pages() returns the number of pages which were found.
  */
 unsigned find_get_pages(struct address_space *mapping, pgoff_t start,
-			    unsigned int nr_pages, struct page **pages)
+			    unsigned int nr_pages,
+			    struct page **pages)
 {
 	unsigned int i;
 	unsigned int ret;
 	unsigned int nr_found;
+	int locked = 0;
 
 	rcu_read_lock();
 restart:
@@ -785,27 +787,46 @@ restart:
 		struct page *page;
 repeat:
 		page = radix_tree_deref_slot((void **)pages[i]);
-		if (unlikely(!page))
+		if (unlikely(!page)) {
+			if (printk_ratelimit())
+				printk(KERN_INFO "unable to deref page\n");
 			continue;
+		}
+
 		/*
 		 * this can only trigger if nr_found == 1, making livelock
 		 * a non issue.
 		 */
-		if (unlikely(page == RADIX_TREE_RETRY))
+		if (unlikely(page == RADIX_TREE_RETRY)) {
+			printk(KERN_INFO "got RADIX_TREE_RETRY\n");
 			goto restart;
+		}
 
-		if (!page_cache_get_speculative(page))
+		if (!page_cache_get_speculative(page)) {
+			/* If the page is in the radix-tree, and the radix-tree
+			 * is locked, the page must have a non-zero refcount */
+			BUG_ON(locked);
+			printk(KERN_INFO "page_cache_get failed\n");
+			spin_lock_irq(&mapping->tree_lock);
+			locked = 1;
 			goto repeat;
+		}
 
 		/* Has the page moved? */
 		if (unlikely(page != *((void **)pages[i]))) {
+			BUG_ON(locked);
+			printk(KERN_INFO "page moved\n");
 			page_cache_release(page);
+			spin_lock_irq(&mapping->tree_lock);
+			locked = 1;
 			goto repeat;
 		}
 
 		pages[ret] = page;
 		ret++;
 	}
+	if (locked)
+		spin_unlock_irq(&mapping->tree_lock);
 	rcu_read_unlock();
 	return ret;
 }

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: BUG: soft lockup - is this XFS problem?
  2009-01-05  1:48         ` Nick Piggin
@ 2009-01-05  4:19           ` Nick Piggin
  -1 siblings, 0 replies; 74+ messages in thread
From: Nick Piggin @ 2009-01-05  4:19 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Peter Klotz, Roman Kononov, linux-kernel, xfs

On Mon, Jan 05, 2009 at 02:48:21AM +0100, Nick Piggin wrote:
> On Sat, Jan 03, 2009 at 04:44:43PM -0500, Christoph Hellwig wrote:
> > On Tue, Dec 30, 2008 at 05:23:33AM +0100, Nick Piggin wrote:
> > > On Tue, Dec 23, 2008 at 12:12:59PM -0500, Christoph Hellwig wrote:
> > > > 
> > > > Nick, I've seen various reports like this by Roman.  It seems to be
> > > > caused by an interaction of the lockless pagecache with the xfs
> > > > I/O code.  Any idea what might be wrong here:
> > > 
> > > Hmm, it could get into a loop here if there is a page in the pagecache
> > > with a zero refcount, which might be a problem with XFS... other looping
> > > conditions might indicate a problem iwth lockless pagecache or radix
> > > tree. It would be very helpful to know what condition it is looping on...
> > 
> > See http://oss.sgi.com/bugzilla/show_bug.cgi?id=805
> 
> OK.. Hmm, well here is a modification to your patch which might help further.
> I'll see if I can reproduce it here meanwhile.

I have reproduced it. It seems like it might be a livelock condition
because the system ended up recovering after I terminated the dd (and
did so before I collected any real info, oops, hopefully I can
reproduce it again).

This would fit with the problem going away when the debugging patch
was applied. Timing changes...

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: BUG: soft lockup - is this XFS problem?
@ 2009-01-05  4:19           ` Nick Piggin
  0 siblings, 0 replies; 74+ messages in thread
From: Nick Piggin @ 2009-01-05  4:19 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Peter Klotz, linux-kernel, Roman Kononov, xfs

On Mon, Jan 05, 2009 at 02:48:21AM +0100, Nick Piggin wrote:
> On Sat, Jan 03, 2009 at 04:44:43PM -0500, Christoph Hellwig wrote:
> > On Tue, Dec 30, 2008 at 05:23:33AM +0100, Nick Piggin wrote:
> > > On Tue, Dec 23, 2008 at 12:12:59PM -0500, Christoph Hellwig wrote:
> > > > 
> > > > Nick, I've seen various reports like this by Roman.  It seems to be
> > > > caused by an interaction of the lockless pagecache with the xfs
> > > > I/O code.  Any idea what might be wrong here:
> > > 
> > > Hmm, it could get into a loop here if there is a page in the pagecache
> > > with a zero refcount, which might be a problem with XFS... other looping
> > > conditions might indicate a problem iwth lockless pagecache or radix
> > > tree. It would be very helpful to know what condition it is looping on...
> > 
> > See http://oss.sgi.com/bugzilla/show_bug.cgi?id=805
> 
> OK.. Hmm, well here is a modification to your patch which might help further.
> I'll see if I can reproduce it here meanwhile.

I have reproduced it. It seems like it might be a livelock condition
because the system ended up recovering after I terminated the dd (and
did so before I collected any real info, oops, hopefully I can
reproduce it again).

This would fit with the problem going away when the debugging patch
was applied. Timing changes...

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: BUG: soft lockup - is this XFS problem?
  2009-01-05  4:19           ` Nick Piggin
@ 2009-01-05  6:48             ` Nick Piggin
  -1 siblings, 0 replies; 74+ messages in thread
From: Nick Piggin @ 2009-01-05  6:48 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Peter Klotz, Roman Kononov, linux-kernel, xfs

On Mon, Jan 05, 2009 at 05:19:59AM +0100, Nick Piggin wrote:
> On Mon, Jan 05, 2009 at 02:48:21AM +0100, Nick Piggin wrote:
> > 
> > OK.. Hmm, well here is a modification to your patch which might help further.
> > I'll see if I can reproduce it here meanwhile.
> 
> I have reproduced it. It seems like it might be a livelock condition
> because the system ended up recovering after I terminated the dd (and
> did so before I collected any real info, oops, hopefully I can
> reproduce it again).
> 
> This would fit with the problem going away when the debugging patch
> was applied. Timing changes...

No, I was wrong. The problem goes away with the patch applied because
a function call == a compiler barrier. The problem randomly recovered
for me because of something more subtle.

I believe this patch should solve it. Please test and confirm before
I send it upstream.

---

An XFS workload showed up a bug in the lockless pagecache patch. Basically it
would go into an "infinite" loop, although it would sometimes be able to break
out of the loop! The reason is a missing compiler barrier in the "increment
reference count unless it was zero" case of the lockless pagecache protocol in
the gang lookup functions.

This would cause the compiler to use a cached value of struct page pointer to
retry the operation with, rather than reload it. So the page might have been
removed from pagecache and freed (refcount==0) but the lookup would not correctly
notice the page is no longer in pagecache, and keep attempting to increment the
refcount and failing, until the page gets reallocated for something else. This
isn't a data corruption because the condition will be detected if the page has
been reallocated. However it can result in a lockup. 

Add a the required compiler barrier and comment to fix this.

Assembly snippet from find_get_pages, before:
.L220:
        movq    (%rbx), %rax    #* ivtmp.1162, tmp82
        movq    (%rax), %rdi    #, prephitmp.1149
.L218:
        testb   $1, %dil        #, prephitmp.1149
        jne     .L217   #,
        testq   %rdi, %rdi      # prephitmp.1149
        je      .L203   #,
        cmpq    $-1, %rdi       #, prephitmp.1149
        je      .L217   #,
        movl    8(%rdi), %esi   # <variable>._count.counter, c
        testl   %esi, %esi      # c
        je      .L218   #,

after:
.L212:
        movq    (%rbx), %rax    #* ivtmp.1109, tmp81
        movq    (%rax), %rdi    #, ret
        testb   $1, %dil        #, ret
        jne     .L211   #,
        testq   %rdi, %rdi      # ret
        je      .L197   #,
        cmpq    $-1, %rdi       #, ret
        je      .L211   #,
        movl    8(%rdi), %esi   # <variable>._count.counter, c
        testl   %esi, %esi      # c
        je      .L212   #,

(notice the obvious infinite loop in the first example, if page->count remains 0)


---
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c	2009-01-05 17:22:57.000000000 +1100
+++ linux-2.6/mm/filemap.c	2009-01-05 17:28:40.000000000 +1100
@@ -794,8 +794,19 @@ repeat:
 		if (unlikely(page == RADIX_TREE_RETRY))
 			goto restart;
 
-		if (!page_cache_get_speculative(page))
+		if (!page_cache_get_speculative(page)) {
+			/*
+			 * A failed page_cache_get_speculative operation does
+			 * not imply any barriers (Documentation/atomic_ops.txt),
+			 * and as such, we must force the compiler to deref the
+			 * radix-tree slot again rather than using the cached
+			 * value (because we need to give up if the page has been
+			 * removed from the radix-tree, rather than looping until
+			 * it gets reused for something else).
+			 */
+			barrier();
 			goto repeat;
+		}
 
 		/* Has the page moved? */
 		if (unlikely(page != *((void **)pages[i]))) {
@@ -850,8 +861,11 @@ repeat:
 		if (page->mapping == NULL || page->index != index)
 			break;
 
-		if (!page_cache_get_speculative(page))
+		if (!page_cache_get_speculative(page)) {
+			/* barrier: see find_get_pages() */
+			barrier();
 			goto repeat;
+		}
 
 		/* Has the page moved? */
 		if (unlikely(page != *((void **)pages[i]))) {
@@ -904,8 +918,11 @@ repeat:
 		if (unlikely(page == RADIX_TREE_RETRY))
 			goto restart;
 
-		if (!page_cache_get_speculative(page))
+		if (!page_cache_get_speculative(page)) {
+			/* barrier: see find_get_pages() */
+			barrier();
 			goto repeat;
+		}
 
 		/* Has the page moved? */
 		if (unlikely(page != *((void **)pages[i]))) {

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: BUG: soft lockup - is this XFS problem?
@ 2009-01-05  6:48             ` Nick Piggin
  0 siblings, 0 replies; 74+ messages in thread
From: Nick Piggin @ 2009-01-05  6:48 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Peter Klotz, linux-kernel, Roman Kononov, xfs

On Mon, Jan 05, 2009 at 05:19:59AM +0100, Nick Piggin wrote:
> On Mon, Jan 05, 2009 at 02:48:21AM +0100, Nick Piggin wrote:
> > 
> > OK.. Hmm, well here is a modification to your patch which might help further.
> > I'll see if I can reproduce it here meanwhile.
> 
> I have reproduced it. It seems like it might be a livelock condition
> because the system ended up recovering after I terminated the dd (and
> did so before I collected any real info, oops, hopefully I can
> reproduce it again).
> 
> This would fit with the problem going away when the debugging patch
> was applied. Timing changes...

No, I was wrong. The problem goes away with the patch applied because
a function call == a compiler barrier. The problem randomly recovered
for me because of something more subtle.

I believe this patch should solve it. Please test and confirm before
I send it upstream.

---

An XFS workload showed up a bug in the lockless pagecache patch. Basically it
would go into an "infinite" loop, although it would sometimes be able to break
out of the loop! The reason is a missing compiler barrier in the "increment
reference count unless it was zero" case of the lockless pagecache protocol in
the gang lookup functions.

This would cause the compiler to use a cached value of struct page pointer to
retry the operation with, rather than reload it. So the page might have been
removed from pagecache and freed (refcount==0) but the lookup would not correctly
notice the page is no longer in pagecache, and keep attempting to increment the
refcount and failing, until the page gets reallocated for something else. This
isn't a data corruption because the condition will be detected if the page has
been reallocated. However it can result in a lockup. 

Add a the required compiler barrier and comment to fix this.

Assembly snippet from find_get_pages, before:
.L220:
        movq    (%rbx), %rax    #* ivtmp.1162, tmp82
        movq    (%rax), %rdi    #, prephitmp.1149
.L218:
        testb   $1, %dil        #, prephitmp.1149
        jne     .L217   #,
        testq   %rdi, %rdi      # prephitmp.1149
        je      .L203   #,
        cmpq    $-1, %rdi       #, prephitmp.1149
        je      .L217   #,
        movl    8(%rdi), %esi   # <variable>._count.counter, c
        testl   %esi, %esi      # c
        je      .L218   #,

after:
.L212:
        movq    (%rbx), %rax    #* ivtmp.1109, tmp81
        movq    (%rax), %rdi    #, ret
        testb   $1, %dil        #, ret
        jne     .L211   #,
        testq   %rdi, %rdi      # ret
        je      .L197   #,
        cmpq    $-1, %rdi       #, ret
        je      .L211   #,
        movl    8(%rdi), %esi   # <variable>._count.counter, c
        testl   %esi, %esi      # c
        je      .L212   #,

(notice the obvious infinite loop in the first example, if page->count remains 0)


---
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c	2009-01-05 17:22:57.000000000 +1100
+++ linux-2.6/mm/filemap.c	2009-01-05 17:28:40.000000000 +1100
@@ -794,8 +794,19 @@ repeat:
 		if (unlikely(page == RADIX_TREE_RETRY))
 			goto restart;
 
-		if (!page_cache_get_speculative(page))
+		if (!page_cache_get_speculative(page)) {
+			/*
+			 * A failed page_cache_get_speculative operation does
+			 * not imply any barriers (Documentation/atomic_ops.txt),
+			 * and as such, we must force the compiler to deref the
+			 * radix-tree slot again rather than using the cached
+			 * value (because we need to give up if the page has been
+			 * removed from the radix-tree, rather than looping until
+			 * it gets reused for something else).
+			 */
+			barrier();
 			goto repeat;
+		}
 
 		/* Has the page moved? */
 		if (unlikely(page != *((void **)pages[i]))) {
@@ -850,8 +861,11 @@ repeat:
 		if (page->mapping == NULL || page->index != index)
 			break;
 
-		if (!page_cache_get_speculative(page))
+		if (!page_cache_get_speculative(page)) {
+			/* barrier: see find_get_pages() */
+			barrier();
 			goto repeat;
+		}
 
 		/* Has the page moved? */
 		if (unlikely(page != *((void **)pages[i]))) {
@@ -904,8 +918,11 @@ repeat:
 		if (unlikely(page == RADIX_TREE_RETRY))
 			goto restart;
 
-		if (!page_cache_get_speculative(page))
+		if (!page_cache_get_speculative(page)) {
+			/* barrier: see find_get_pages() */
+			barrier();
 			goto repeat;
+		}
 
 		/* Has the page moved? */
 		if (unlikely(page != *((void **)pages[i]))) {

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: BUG: soft lockup - is this XFS problem?
  2009-01-05  6:48             ` Nick Piggin
@ 2009-01-05 14:25               ` Roman Kononov
  -1 siblings, 0 replies; 74+ messages in thread
From: Roman Kononov @ 2009-01-05 14:25 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Hellwig, Peter Klotz, Roman Kononov, linux-kernel, xfs

On 2009-01-05 00:48 Nick Piggin said the following:
> I believe this patch should solve it. Please test and confirm before
> I send it upstream.

3 systems with 2.6.27.10 have worked overnight with dd running 
continuously. They all failed within 20 minutes without the patch.

Thank you,

Roman

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: BUG: soft lockup - is this XFS problem?
@ 2009-01-05 14:25               ` Roman Kononov
  0 siblings, 0 replies; 74+ messages in thread
From: Roman Kononov @ 2009-01-05 14:25 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Hellwig, Peter Klotz, linux-kernel, Roman Kononov, xfs

On 2009-01-05 00:48 Nick Piggin said the following:
> I believe this patch should solve it. Please test and confirm before
> I send it upstream.

3 systems with 2.6.27.10 have worked overnight with dd running 
continuously. They all failed within 20 minutes without the patch.

Thank you,

Roman

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: BUG: soft lockup - is this XFS problem?
  2009-01-05  6:48             ` Nick Piggin
@ 2009-01-05 16:21               ` Peter Klotz
  -1 siblings, 0 replies; 74+ messages in thread
From: Peter Klotz @ 2009-01-05 16:21 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christoph Hellwig, Roman Kononov, linux-kernel, xfs

Nick Piggin wrote:
> I believe this patch should solve it. Please test and confirm before
> I send it upstream.


My test successfully completed two times, writing a 900GB file in each run.
I used a patched 2.6.27.10 x86_64 kernel.

On an unpatched system this test usually fails before reaching 100GB.

Thank you for fixing this issue that quick.

Regards, Peter.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: BUG: soft lockup - is this XFS problem?
@ 2009-01-05 16:21               ` Peter Klotz
  0 siblings, 0 replies; 74+ messages in thread
From: Peter Klotz @ 2009-01-05 16:21 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christoph Hellwig, linux-kernel, Roman Kononov, xfs

Nick Piggin wrote:
> I believe this patch should solve it. Please test and confirm before
> I send it upstream.


My test successfully completed two times, writing a 900GB file in each run.
I used a patched 2.6.27.10 x86_64 kernel.

On an unpatched system this test usually fails before reaching 100GB.

Thank you for fixing this issue that quick.

Regards, Peter.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [patch] mm: fix lockless pagecache reordering bug (was Re: BUG: soft lockup - is this XFS problem?)
  2009-01-05 16:21               ` Peter Klotz
  (?)
@ 2009-01-05 16:41                 ` Nick Piggin
  -1 siblings, 0 replies; 74+ messages in thread
From: Nick Piggin @ 2009-01-05 16:41 UTC (permalink / raw)
  To: Peter Klotz, Linus Torvalds, stable, Linux Memory Management List
  Cc: Christoph Hellwig, Roman Kononov, linux-kernel, xfs, Andrew Morton

Hi,

This patch should be applied to 2.6.29 and 27/28 stable kernels, please.
--

Peter Klotz and Roman Kononov both reported a bug where in XFS workloads where
they were seeing softlockups in find_get_pages
(http://oss.sgi.com/bugzilla/show_bug.cgi?id=805).

Basically it would go into an "infinite" loop, although it would sometimes be
able to break out of the loop depending on the phase of the moon.

This turns out to be a bug in the lockless pagecache patch. There is a missing
compiler barrier in the "increment reference count unless it was zero" failure
case of the lockless pagecache protocol in the gang lookup functions.

This would cause the compiler to use a cached value of struct page pointer to
retry the operation with, rather than reload it. So the page might have been
removed from pagecache and freed (refcount==0) but the lookup would not correctly
notice the page is no longer in pagecache, and keep attempting to increment the
refcount and failing, until the page gets reallocated for something else. This
isn't a data corruption because the condition will be properly handled if the
page does get reallocated. However it can result in a lockup. 

Add a the required compiler barrier and comment to fix this.

Assembly snippet from find_get_pages, before:
.L220:
        movq    (%rbx), %rax    #* ivtmp.1162, tmp82
        movq    (%rax), %rdi    #, prephitmp.1149
.L218:
        testb   $1, %dil        #, prephitmp.1149
        jne     .L217   #,
        testq   %rdi, %rdi      # prephitmp.1149
        je      .L203   #,
        cmpq    $-1, %rdi       #, prephitmp.1149
        je      .L217   #,
        movl    8(%rdi), %esi   # <variable>._count.counter, c
        testl   %esi, %esi      # c
        je      .L218   #,

after:
.L212:
        movq    (%rbx), %rax    #* ivtmp.1109, tmp81
        movq    (%rax), %rdi    #, ret
        testb   $1, %dil        #, ret
        jne     .L211   #,
        testq   %rdi, %rdi      # ret
        je      .L197   #,
        cmpq    $-1, %rdi       #, ret
        je      .L211   #,
        movl    8(%rdi), %esi   # <variable>._count.counter, c
        testl   %esi, %esi      # c
        je      .L212   #,

(notice the obvious infinite loop in the first example, if page->count remains 0)

The problem was noticed and resolved on 2.6.27 stable kernels, and also applies
upstream (where I was able to reproduce it and verify the fix).

Reported-by: Peter Klotz <peter.klotz@aon.at>
Reported-by: Roman Kononov <kononov@ftml.net>
Tested-by: Peter Klotz <peter.klotz@aon.at>
Tested-by: Roman Kononov <kononov@ftml.net>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c	2009-01-05 17:22:57.000000000 +1100
+++ linux-2.6/mm/filemap.c	2009-01-05 17:28:40.000000000 +1100
@@ -794,8 +794,19 @@ repeat:
 		if (unlikely(page == RADIX_TREE_RETRY))
 			goto restart;
 
-		if (!page_cache_get_speculative(page))
+		if (!page_cache_get_speculative(page)) {
+			/*
+			 * A failed page_cache_get_speculative operation does
+			 * not imply any barriers (Documentation/atomic_ops.txt),
+			 * and as such, we must force the compiler to deref the
+			 * radix-tree slot again rather than using the cached
+			 * value (because we need to give up if the page has been
+			 * removed from the radix-tree, rather than looping until
+			 * it gets reused for something else).
+			 */
+			barrier();
 			goto repeat;
+		}
 
 		/* Has the page moved? */
 		if (unlikely(page != *((void **)pages[i]))) {
@@ -850,8 +861,11 @@ repeat:
 		if (page->mapping == NULL || page->index != index)
 			break;
 
-		if (!page_cache_get_speculative(page))
+		if (!page_cache_get_speculative(page)) {
+			/* barrier: see find_get_pages() */
+			barrier();
 			goto repeat;
+		}
 
 		/* Has the page moved? */
 		if (unlikely(page != *((void **)pages[i]))) {
@@ -904,8 +918,11 @@ repeat:
 		if (unlikely(page == RADIX_TREE_RETRY))
 			goto restart;
 
-		if (!page_cache_get_speculative(page))
+		if (!page_cache_get_speculative(page)) {
+			/* barrier: see find_get_pages() */
+			barrier();
 			goto repeat;
+		}
 
 		/* Has the page moved? */
 		if (unlikely(page != *((void **)pages[i]))) {

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [patch] mm: fix lockless pagecache reordering bug (was Re: BUG: soft lockup - is this XFS problem?)
@ 2009-01-05 16:41                 ` Nick Piggin
  0 siblings, 0 replies; 74+ messages in thread
From: Nick Piggin @ 2009-01-05 16:41 UTC (permalink / raw)
  To: Peter Klotz, Linus Torvalds, stable, Linux Memory Management List
  Cc: Christoph Hellwig, Andrew Morton, linux-kernel, Roman Kononov, xfs

Hi,

This patch should be applied to 2.6.29 and 27/28 stable kernels, please.
--

Peter Klotz and Roman Kononov both reported a bug where in XFS workloads where
they were seeing softlockups in find_get_pages
(http://oss.sgi.com/bugzilla/show_bug.cgi?id=805).

Basically it would go into an "infinite" loop, although it would sometimes be
able to break out of the loop depending on the phase of the moon.

This turns out to be a bug in the lockless pagecache patch. There is a missing
compiler barrier in the "increment reference count unless it was zero" failure
case of the lockless pagecache protocol in the gang lookup functions.

This would cause the compiler to use a cached value of struct page pointer to
retry the operation with, rather than reload it. So the page might have been
removed from pagecache and freed (refcount==0) but the lookup would not correctly
notice the page is no longer in pagecache, and keep attempting to increment the
refcount and failing, until the page gets reallocated for something else. This
isn't a data corruption because the condition will be properly handled if the
page does get reallocated. However it can result in a lockup. 

Add a the required compiler barrier and comment to fix this.

Assembly snippet from find_get_pages, before:
.L220:
        movq    (%rbx), %rax    #* ivtmp.1162, tmp82
        movq    (%rax), %rdi    #, prephitmp.1149
.L218:
        testb   $1, %dil        #, prephitmp.1149
        jne     .L217   #,
        testq   %rdi, %rdi      # prephitmp.1149
        je      .L203   #,
        cmpq    $-1, %rdi       #, prephitmp.1149
        je      .L217   #,
        movl    8(%rdi), %esi   # <variable>._count.counter, c
        testl   %esi, %esi      # c
        je      .L218   #,

after:
.L212:
        movq    (%rbx), %rax    #* ivtmp.1109, tmp81
        movq    (%rax), %rdi    #, ret
        testb   $1, %dil        #, ret
        jne     .L211   #,
        testq   %rdi, %rdi      # ret
        je      .L197   #,
        cmpq    $-1, %rdi       #, ret
        je      .L211   #,
        movl    8(%rdi), %esi   # <variable>._count.counter, c
        testl   %esi, %esi      # c
        je      .L212   #,

(notice the obvious infinite loop in the first example, if page->count remains 0)

The problem was noticed and resolved on 2.6.27 stable kernels, and also applies
upstream (where I was able to reproduce it and verify the fix).

Reported-by: Peter Klotz <peter.klotz@aon.at>
Reported-by: Roman Kononov <kononov@ftml.net>
Tested-by: Peter Klotz <peter.klotz@aon.at>
Tested-by: Roman Kononov <kononov@ftml.net>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c	2009-01-05 17:22:57.000000000 +1100
+++ linux-2.6/mm/filemap.c	2009-01-05 17:28:40.000000000 +1100
@@ -794,8 +794,19 @@ repeat:
 		if (unlikely(page == RADIX_TREE_RETRY))
 			goto restart;
 
-		if (!page_cache_get_speculative(page))
+		if (!page_cache_get_speculative(page)) {
+			/*
+			 * A failed page_cache_get_speculative operation does
+			 * not imply any barriers (Documentation/atomic_ops.txt),
+			 * and as such, we must force the compiler to deref the
+			 * radix-tree slot again rather than using the cached
+			 * value (because we need to give up if the page has been
+			 * removed from the radix-tree, rather than looping until
+			 * it gets reused for something else).
+			 */
+			barrier();
 			goto repeat;
+		}
 
 		/* Has the page moved? */
 		if (unlikely(page != *((void **)pages[i]))) {
@@ -850,8 +861,11 @@ repeat:
 		if (page->mapping == NULL || page->index != index)
 			break;
 
-		if (!page_cache_get_speculative(page))
+		if (!page_cache_get_speculative(page)) {
+			/* barrier: see find_get_pages() */
+			barrier();
 			goto repeat;
+		}
 
 		/* Has the page moved? */
 		if (unlikely(page != *((void **)pages[i]))) {
@@ -904,8 +918,11 @@ repeat:
 		if (unlikely(page == RADIX_TREE_RETRY))
 			goto restart;
 
-		if (!page_cache_get_speculative(page))
+		if (!page_cache_get_speculative(page)) {
+			/* barrier: see find_get_pages() */
+			barrier();
 			goto repeat;
+		}
 
 		/* Has the page moved? */
 		if (unlikely(page != *((void **)pages[i]))) {

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [patch] mm: fix lockless pagecache reordering bug (was Re: BUG: soft lockup - is this XFS problem?)
@ 2009-01-05 16:41                 ` Nick Piggin
  0 siblings, 0 replies; 74+ messages in thread
From: Nick Piggin @ 2009-01-05 16:41 UTC (permalink / raw)
  To: Peter Klotz, Linus Torvalds, stable, Linux Memory Management List
  Cc: Christoph Hellwig, Roman Kononov, linux-kernel, xfs, Andrew Morton

Hi,

This patch should be applied to 2.6.29 and 27/28 stable kernels, please.
--

Peter Klotz and Roman Kononov both reported a bug where in XFS workloads where
they were seeing softlockups in find_get_pages
(http://oss.sgi.com/bugzilla/show_bug.cgi?id=805).

Basically it would go into an "infinite" loop, although it would sometimes be
able to break out of the loop depending on the phase of the moon.

This turns out to be a bug in the lockless pagecache patch. There is a missing
compiler barrier in the "increment reference count unless it was zero" failure
case of the lockless pagecache protocol in the gang lookup functions.

This would cause the compiler to use a cached value of struct page pointer to
retry the operation with, rather than reload it. So the page might have been
removed from pagecache and freed (refcount==0) but the lookup would not correctly
notice the page is no longer in pagecache, and keep attempting to increment the
refcount and failing, until the page gets reallocated for something else. This
isn't a data corruption because the condition will be properly handled if the
page does get reallocated. However it can result in a lockup. 

Add a the required compiler barrier and comment to fix this.

Assembly snippet from find_get_pages, before:
.L220:
        movq    (%rbx), %rax    #* ivtmp.1162, tmp82
        movq    (%rax), %rdi    #, prephitmp.1149
.L218:
        testb   $1, %dil        #, prephitmp.1149
        jne     .L217   #,
        testq   %rdi, %rdi      # prephitmp.1149
        je      .L203   #,
        cmpq    $-1, %rdi       #, prephitmp.1149
        je      .L217   #,
        movl    8(%rdi), %esi   # <variable>._count.counter, c
        testl   %esi, %esi      # c
        je      .L218   #,

after:
.L212:
        movq    (%rbx), %rax    #* ivtmp.1109, tmp81
        movq    (%rax), %rdi    #, ret
        testb   $1, %dil        #, ret
        jne     .L211   #,
        testq   %rdi, %rdi      # ret
        je      .L197   #,
        cmpq    $-1, %rdi       #, ret
        je      .L211   #,
        movl    8(%rdi), %esi   # <variable>._count.counter, c
        testl   %esi, %esi      # c
        je      .L212   #,

(notice the obvious infinite loop in the first example, if page->count remains 0)

The problem was noticed and resolved on 2.6.27 stable kernels, and also applies
upstream (where I was able to reproduce it and verify the fix).

Reported-by: Peter Klotz <peter.klotz@aon.at>
Reported-by: Roman Kononov <kononov@ftml.net>
Tested-by: Peter Klotz <peter.klotz@aon.at>
Tested-by: Roman Kononov <kononov@ftml.net>
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c	2009-01-05 17:22:57.000000000 +1100
+++ linux-2.6/mm/filemap.c	2009-01-05 17:28:40.000000000 +1100
@@ -794,8 +794,19 @@ repeat:
 		if (unlikely(page == RADIX_TREE_RETRY))
 			goto restart;
 
-		if (!page_cache_get_speculative(page))
+		if (!page_cache_get_speculative(page)) {
+			/*
+			 * A failed page_cache_get_speculative operation does
+			 * not imply any barriers (Documentation/atomic_ops.txt),
+			 * and as such, we must force the compiler to deref the
+			 * radix-tree slot again rather than using the cached
+			 * value (because we need to give up if the page has been
+			 * removed from the radix-tree, rather than looping until
+			 * it gets reused for something else).
+			 */
+			barrier();
 			goto repeat;
+		}
 
 		/* Has the page moved? */
 		if (unlikely(page != *((void **)pages[i]))) {
@@ -850,8 +861,11 @@ repeat:
 		if (page->mapping == NULL || page->index != index)
 			break;
 
-		if (!page_cache_get_speculative(page))
+		if (!page_cache_get_speculative(page)) {
+			/* barrier: see find_get_pages() */
+			barrier();
 			goto repeat;
+		}
 
 		/* Has the page moved? */
 		if (unlikely(page != *((void **)pages[i]))) {
@@ -904,8 +918,11 @@ repeat:
 		if (unlikely(page == RADIX_TREE_RETRY))
 			goto restart;
 
-		if (!page_cache_get_speculative(page))
+		if (!page_cache_get_speculative(page)) {
+			/* barrier: see find_get_pages() */
+			barrier();
 			goto repeat;
+		}
 
 		/* Has the page moved? */
 		if (unlikely(page != *((void **)pages[i]))) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch] mm: fix lockless pagecache reordering bug (was Re: BUG: soft lockup - is this XFS problem?)
  2009-01-05 16:41                 ` Nick Piggin
  (?)
@ 2009-01-05 17:30                   ` Linus Torvalds
  -1 siblings, 0 replies; 74+ messages in thread
From: Linus Torvalds @ 2009-01-05 17:30 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Klotz, stable, Linux Memory Management List,
	Christoph Hellwig, Roman Kononov, linux-kernel, xfs,
	Andrew Morton



On Mon, 5 Jan 2009, Nick Piggin wrote:
> 
> This patch should be applied to 2.6.29 and 27/28 stable kernels, please.

No. I think this patch is utter crap. But please feel free to educate me 
on why that is not the case.

Here's my explanation:

Not only is it ugly (which is already sufficient ground to suspect it is 
wrong or could at least be done better), but reading the comment, it makes 
no sense at all. You only put the barrier in the "goto repeat" case, but 
the thing is, if you worry about radix tree slot not being reloaded in the 
repeat case, then you damn well should worry about it not being reloaded 
in the non-repeat case too!

The code is immediately followed by a test to see that the page is still 
the same in the slot, ie this:

                /*
                 * Has the page moved?
                 * This is part of the lockless pagecache protocol. See
                 * include/linux/pagemap.h for details.
                 */
                if (unlikely(page != *pagep)) {

and if you need a barrier for the repeat case, you need one for this case 
too.

In other words, it looks like you fixed the symptom, but not the real 
cause! That's now how we work in the kernel.

The real cause, btw, appears to be that radix_tree_deref_slot() is a piece 
of slimy sh*t, and has not been correctly updated to RCU. The proper fix 
doesn't require any barriers that I can see - I think the proper fix is 
this simple one-liner.

If you use RCU to protect a data structure, then any data loaded from that 
data structure that can change due to RCU should be loaded with 
"rcu_dereference()". 

Now, I can't test this, because it makes absolutely no difference for me 
(the diff isn't empty, but the asm changes seem to be all due to just gcc 
variable numbering changing). I can't seem to see the buggy code. Maybe it 
needs a specific compiler version, or some specific config option to 
trigger?

So because I can't see the issue, I also obviously can't verify that it's 
the only possible case. Maybe there is some other memory access that 
should also be done with the proper rcu accessors?

Of course, it's also possible that we should just put a barrier in 
page_cache_get_speculative(). That doesn't seem to make a whole lot of 
conceptual sense, though (the same way that your barrier() didn't make any 
sense - I don't see that the barrier has absolutely _anything_ to do with 
whether the speculative getting of the page fails or not!)

In general, I'd like fewer "band-aid" patches, and more "deep thinking" 
patches. I'm not saying mine is very deep either, but I think it's at 
least scrathing the surface of the real problem rather than just trying to 
cover it up.

		Linus

---
 include/linux/radix-tree.h |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index a916c66..355f6e8 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -136,7 +136,7 @@ do {									\
  */
 static inline void *radix_tree_deref_slot(void **pslot)
 {
-	void *ret = *pslot;
+	void *ret = rcu_dereference(*pslot);
 	if (unlikely(radix_tree_is_indirect_ptr(ret)))
 		ret = RADIX_TREE_RETRY;
 	return ret;

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [patch] mm: fix lockless pagecache reordering bug (was Re: BUG: soft lockup - is this XFS problem?)
@ 2009-01-05 17:30                   ` Linus Torvalds
  0 siblings, 0 replies; 74+ messages in thread
From: Linus Torvalds @ 2009-01-05 17:30 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-kernel, Roman Kononov, xfs, Christoph Hellwig,
	Linux Memory Management List, Andrew Morton, Peter Klotz, stable



On Mon, 5 Jan 2009, Nick Piggin wrote:
> 
> This patch should be applied to 2.6.29 and 27/28 stable kernels, please.

No. I think this patch is utter crap. But please feel free to educate me 
on why that is not the case.

Here's my explanation:

Not only is it ugly (which is already sufficient ground to suspect it is 
wrong or could at least be done better), but reading the comment, it makes 
no sense at all. You only put the barrier in the "goto repeat" case, but 
the thing is, if you worry about radix tree slot not being reloaded in the 
repeat case, then you damn well should worry about it not being reloaded 
in the non-repeat case too!

The code is immediately followed by a test to see that the page is still 
the same in the slot, ie this:

                /*
                 * Has the page moved?
                 * This is part of the lockless pagecache protocol. See
                 * include/linux/pagemap.h for details.
                 */
                if (unlikely(page != *pagep)) {

and if you need a barrier for the repeat case, you need one for this case 
too.

In other words, it looks like you fixed the symptom, but not the real 
cause! That's now how we work in the kernel.

The real cause, btw, appears to be that radix_tree_deref_slot() is a piece 
of slimy sh*t, and has not been correctly updated to RCU. The proper fix 
doesn't require any barriers that I can see - I think the proper fix is 
this simple one-liner.

If you use RCU to protect a data structure, then any data loaded from that 
data structure that can change due to RCU should be loaded with 
"rcu_dereference()". 

Now, I can't test this, because it makes absolutely no difference for me 
(the diff isn't empty, but the asm changes seem to be all due to just gcc 
variable numbering changing). I can't seem to see the buggy code. Maybe it 
needs a specific compiler version, or some specific config option to 
trigger?

So because I can't see the issue, I also obviously can't verify that it's 
the only possible case. Maybe there is some other memory access that 
should also be done with the proper rcu accessors?

Of course, it's also possible that we should just put a barrier in 
page_cache_get_speculative(). That doesn't seem to make a whole lot of 
conceptual sense, though (the same way that your barrier() didn't make any 
sense - I don't see that the barrier has absolutely _anything_ to do with 
whether the speculative getting of the page fails or not!)

In general, I'd like fewer "band-aid" patches, and more "deep thinking" 
patches. I'm not saying mine is very deep either, but I think it's at 
least scrathing the surface of the real problem rather than just trying to 
cover it up.

		Linus

---
 include/linux/radix-tree.h |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index a916c66..355f6e8 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -136,7 +136,7 @@ do {									\
  */
 static inline void *radix_tree_deref_slot(void **pslot)
 {
-	void *ret = *pslot;
+	void *ret = rcu_dereference(*pslot);
 	if (unlikely(radix_tree_is_indirect_ptr(ret)))
 		ret = RADIX_TREE_RETRY;
 	return ret;

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [patch] mm: fix lockless pagecache reordering bug (was Re: BUG: soft lockup - is this XFS problem?)
@ 2009-01-05 17:30                   ` Linus Torvalds
  0 siblings, 0 replies; 74+ messages in thread
From: Linus Torvalds @ 2009-01-05 17:30 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Klotz, stable, Linux Memory Management List,
	Christoph Hellwig, Roman Kononov, linux-kernel, xfs,
	Andrew Morton



On Mon, 5 Jan 2009, Nick Piggin wrote:
> 
> This patch should be applied to 2.6.29 and 27/28 stable kernels, please.

No. I think this patch is utter crap. But please feel free to educate me 
on why that is not the case.

Here's my explanation:

Not only is it ugly (which is already sufficient ground to suspect it is 
wrong or could at least be done better), but reading the comment, it makes 
no sense at all. You only put the barrier in the "goto repeat" case, but 
the thing is, if you worry about radix tree slot not being reloaded in the 
repeat case, then you damn well should worry about it not being reloaded 
in the non-repeat case too!

The code is immediately followed by a test to see that the page is still 
the same in the slot, ie this:

                /*
                 * Has the page moved?
                 * This is part of the lockless pagecache protocol. See
                 * include/linux/pagemap.h for details.
                 */
                if (unlikely(page != *pagep)) {

and if you need a barrier for the repeat case, you need one for this case 
too.

In other words, it looks like you fixed the symptom, but not the real 
cause! That's now how we work in the kernel.

The real cause, btw, appears to be that radix_tree_deref_slot() is a piece 
of slimy sh*t, and has not been correctly updated to RCU. The proper fix 
doesn't require any barriers that I can see - I think the proper fix is 
this simple one-liner.

If you use RCU to protect a data structure, then any data loaded from that 
data structure that can change due to RCU should be loaded with 
"rcu_dereference()". 

Now, I can't test this, because it makes absolutely no difference for me 
(the diff isn't empty, but the asm changes seem to be all due to just gcc 
variable numbering changing). I can't seem to see the buggy code. Maybe it 
needs a specific compiler version, or some specific config option to 
trigger?

So because I can't see the issue, I also obviously can't verify that it's 
the only possible case. Maybe there is some other memory access that 
should also be done with the proper rcu accessors?

Of course, it's also possible that we should just put a barrier in 
page_cache_get_speculative(). That doesn't seem to make a whole lot of 
conceptual sense, though (the same way that your barrier() didn't make any 
sense - I don't see that the barrier has absolutely _anything_ to do with 
whether the speculative getting of the page fails or not!)

In general, I'd like fewer "band-aid" patches, and more "deep thinking" 
patches. I'm not saying mine is very deep either, but I think it's at 
least scrathing the surface of the real problem rather than just trying to 
cover it up.

		Linus

---
 include/linux/radix-tree.h |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index a916c66..355f6e8 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -136,7 +136,7 @@ do {									\
  */
 static inline void *radix_tree_deref_slot(void **pslot)
 {
-	void *ret = *pslot;
+	void *ret = rcu_dereference(*pslot);
 	if (unlikely(radix_tree_is_indirect_ptr(ret)))
 		ret = RADIX_TREE_RETRY;
 	return ret;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [patch] mm: fix lockless pagecache reordering bug (was Re: BUG: soft lockup - is this XFS problem?)
  2009-01-05 17:30                   ` Linus Torvalds
  (?)
@ 2009-01-05 18:00                     ` Nick Piggin
  -1 siblings, 0 replies; 74+ messages in thread
From: Nick Piggin @ 2009-01-05 18:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Klotz, stable, Linux Memory Management List,
	Christoph Hellwig, Roman Kononov, linux-kernel, xfs,
	Andrew Morton

On Mon, Jan 05, 2009 at 09:30:55AM -0800, Linus Torvalds wrote:
> 
> 
> On Mon, 5 Jan 2009, Nick Piggin wrote:
> > 
> > This patch should be applied to 2.6.29 and 27/28 stable kernels, please.
> 
> No. I think this patch is utter crap. But please feel free to educate me 
> on why that is not the case.
> 
> Here's my explanation:
> 
> Not only is it ugly (which is already sufficient ground to suspect it is 
> wrong or could at least be done better), but reading the comment, it makes 
> no sense at all. You only put the barrier in the "goto repeat" case, but 
> the thing is, if you worry about radix tree slot not being reloaded in the 
> repeat case, then you damn well should worry about it not being reloaded 
> in the non-repeat case too!

In which case atomic_inc_unless is defined to provide a barrier.

 
> The code is immediately followed by a test to see that the page is still 
> the same in the slot, ie this:
> 
>                 /*
>                  * Has the page moved?
>                  * This is part of the lockless pagecache protocol. See
>                  * include/linux/pagemap.h for details.
>                  */
>                 if (unlikely(page != *pagep)) {
> 
> and if you need a barrier for the repeat case, you need one for this case 
> too.
> 
> In other words, it looks like you fixed the symptom, but not the real 
> cause! That's now how we work in the kernel.
> 
> The real cause, btw, appears to be that radix_tree_deref_slot() is a piece 
> of slimy sh*t, and has not been correctly updated to RCU. The proper fix 
> doesn't require any barriers that I can see - I think the proper fix is 
> this simple one-liner.
> 
> If you use RCU to protect a data structure, then any data loaded from that 
> data structure that can change due to RCU should be loaded with 
> "rcu_dereference()". 

It doesn't need that because the last level pointers in the radix
tree are not necessarily under RCU, but whatever synchronisation
the caller uses (in this case, speculative page references, which
should not require smp_read_barrier_depends, AFAIKS). Putting an
rcu_dereference there might work, but I think it misses a subtlety
of this code.


> Now, I can't test this, because it makes absolutely no difference for me 
> (the diff isn't empty, but the asm changes seem to be all due to just gcc 
> variable numbering changing). I can't seem to see the buggy code. Maybe it 
> needs a specific compiler version, or some specific config option to 
> trigger?
> 
> So because I can't see the issue, I also obviously can't verify that it's 
> the only possible case. Maybe there is some other memory access that 
> should also be done with the proper rcu accessors?
> 
> Of course, it's also possible that we should just put a barrier in 
> page_cache_get_speculative(). That doesn't seem to make a whole lot of 
> conceptual sense, though (the same way that your barrier() didn't make any 
> sense - I don't see that the barrier has absolutely _anything_ to do with 
> whether the speculative getting of the page fails or not!)

When that fails, the caller can (almost) assume the pointer has changed.
So it has to load the new pointer to continue. The object pointed to is
not protected with RCU, nor is there a requirement to see a specific
load execution ordering. 

>
> In general, I'd like fewer "band-aid" patches, and more "deep thinking" 
> patches. I'm not saying mine is very deep either, but I think it's at 
> least scrathing the surface of the real problem rather than just trying to 
> cover it up.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch] mm: fix lockless pagecache reordering bug (was Re: BUG: soft lockup - is this XFS problem?)
@ 2009-01-05 18:00                     ` Nick Piggin
  0 siblings, 0 replies; 74+ messages in thread
From: Nick Piggin @ 2009-01-05 18:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, Roman Kononov, xfs, Christoph Hellwig,
	Linux Memory Management List, Andrew Morton, Peter Klotz, stable

On Mon, Jan 05, 2009 at 09:30:55AM -0800, Linus Torvalds wrote:
> 
> 
> On Mon, 5 Jan 2009, Nick Piggin wrote:
> > 
> > This patch should be applied to 2.6.29 and 27/28 stable kernels, please.
> 
> No. I think this patch is utter crap. But please feel free to educate me 
> on why that is not the case.
> 
> Here's my explanation:
> 
> Not only is it ugly (which is already sufficient ground to suspect it is 
> wrong or could at least be done better), but reading the comment, it makes 
> no sense at all. You only put the barrier in the "goto repeat" case, but 
> the thing is, if you worry about radix tree slot not being reloaded in the 
> repeat case, then you damn well should worry about it not being reloaded 
> in the non-repeat case too!

In which case atomic_inc_unless is defined to provide a barrier.

 
> The code is immediately followed by a test to see that the page is still 
> the same in the slot, ie this:
> 
>                 /*
>                  * Has the page moved?
>                  * This is part of the lockless pagecache protocol. See
>                  * include/linux/pagemap.h for details.
>                  */
>                 if (unlikely(page != *pagep)) {
> 
> and if you need a barrier for the repeat case, you need one for this case 
> too.
> 
> In other words, it looks like you fixed the symptom, but not the real 
> cause! That's now how we work in the kernel.
> 
> The real cause, btw, appears to be that radix_tree_deref_slot() is a piece 
> of slimy sh*t, and has not been correctly updated to RCU. The proper fix 
> doesn't require any barriers that I can see - I think the proper fix is 
> this simple one-liner.
> 
> If you use RCU to protect a data structure, then any data loaded from that 
> data structure that can change due to RCU should be loaded with 
> "rcu_dereference()". 

It doesn't need that because the last level pointers in the radix
tree are not necessarily under RCU, but whatever synchronisation
the caller uses (in this case, speculative page references, which
should not require smp_read_barrier_depends, AFAIKS). Putting an
rcu_dereference there might work, but I think it misses a subtlety
of this code.


> Now, I can't test this, because it makes absolutely no difference for me 
> (the diff isn't empty, but the asm changes seem to be all due to just gcc 
> variable numbering changing). I can't seem to see the buggy code. Maybe it 
> needs a specific compiler version, or some specific config option to 
> trigger?
> 
> So because I can't see the issue, I also obviously can't verify that it's 
> the only possible case. Maybe there is some other memory access that 
> should also be done with the proper rcu accessors?
> 
> Of course, it's also possible that we should just put a barrier in 
> page_cache_get_speculative(). That doesn't seem to make a whole lot of 
> conceptual sense, though (the same way that your barrier() didn't make any 
> sense - I don't see that the barrier has absolutely _anything_ to do with 
> whether the speculative getting of the page fails or not!)

When that fails, the caller can (almost) assume the pointer has changed.
So it has to load the new pointer to continue. The object pointed to is
not protected with RCU, nor is there a requirement to see a specific
load execution ordering. 

>
> In general, I'd like fewer "band-aid" patches, and more "deep thinking" 
> patches. I'm not saying mine is very deep either, but I think it's at 
> least scrathing the surface of the real problem rather than just trying to 
> cover it up.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch] mm: fix lockless pagecache reordering bug (was Re: BUG: soft lockup - is this XFS problem?)
@ 2009-01-05 18:00                     ` Nick Piggin
  0 siblings, 0 replies; 74+ messages in thread
From: Nick Piggin @ 2009-01-05 18:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Klotz, stable, Linux Memory Management List,
	Christoph Hellwig, Roman Kononov, linux-kernel, xfs,
	Andrew Morton

On Mon, Jan 05, 2009 at 09:30:55AM -0800, Linus Torvalds wrote:
> 
> 
> On Mon, 5 Jan 2009, Nick Piggin wrote:
> > 
> > This patch should be applied to 2.6.29 and 27/28 stable kernels, please.
> 
> No. I think this patch is utter crap. But please feel free to educate me 
> on why that is not the case.
> 
> Here's my explanation:
> 
> Not only is it ugly (which is already sufficient ground to suspect it is 
> wrong or could at least be done better), but reading the comment, it makes 
> no sense at all. You only put the barrier in the "goto repeat" case, but 
> the thing is, if you worry about radix tree slot not being reloaded in the 
> repeat case, then you damn well should worry about it not being reloaded 
> in the non-repeat case too!

In which case atomic_inc_unless is defined to provide a barrier.

 
> The code is immediately followed by a test to see that the page is still 
> the same in the slot, ie this:
> 
>                 /*
>                  * Has the page moved?
>                  * This is part of the lockless pagecache protocol. See
>                  * include/linux/pagemap.h for details.
>                  */
>                 if (unlikely(page != *pagep)) {
> 
> and if you need a barrier for the repeat case, you need one for this case 
> too.
> 
> In other words, it looks like you fixed the symptom, but not the real 
> cause! That's now how we work in the kernel.
> 
> The real cause, btw, appears to be that radix_tree_deref_slot() is a piece 
> of slimy sh*t, and has not been correctly updated to RCU. The proper fix 
> doesn't require any barriers that I can see - I think the proper fix is 
> this simple one-liner.
> 
> If you use RCU to protect a data structure, then any data loaded from that 
> data structure that can change due to RCU should be loaded with 
> "rcu_dereference()". 

It doesn't need that because the last level pointers in the radix
tree are not necessarily under RCU, but whatever synchronisation
the caller uses (in this case, speculative page references, which
should not require smp_read_barrier_depends, AFAIKS). Putting an
rcu_dereference there might work, but I think it misses a subtlety
of this code.


> Now, I can't test this, because it makes absolutely no difference for me 
> (the diff isn't empty, but the asm changes seem to be all due to just gcc 
> variable numbering changing). I can't seem to see the buggy code. Maybe it 
> needs a specific compiler version, or some specific config option to 
> trigger?
> 
> So because I can't see the issue, I also obviously can't verify that it's 
> the only possible case. Maybe there is some other memory access that 
> should also be done with the proper rcu accessors?
> 
> Of course, it's also possible that we should just put a barrier in 
> page_cache_get_speculative(). That doesn't seem to make a whole lot of 
> conceptual sense, though (the same way that your barrier() didn't make any 
> sense - I don't see that the barrier has absolutely _anything_ to do with 
> whether the speculative getting of the page fails or not!)

When that fails, the caller can (almost) assume the pointer has changed.
So it has to load the new pointer to continue. The object pointed to is
not protected with RCU, nor is there a requirement to see a specific
load execution ordering. 

>
> In general, I'd like fewer "band-aid" patches, and more "deep thinking" 
> patches. I'm not saying mine is very deep either, but I think it's at 
> least scrathing the surface of the real problem rather than just trying to 
> cover it up.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch] mm: fix lockless pagecache reordering bug (was Re: BUG: soft lockup - is this XFS problem?)
  2009-01-05 18:00                     ` Nick Piggin
  (?)
@ 2009-01-05 18:44                       ` Linus Torvalds
  -1 siblings, 0 replies; 74+ messages in thread
From: Linus Torvalds @ 2009-01-05 18:44 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Klotz, stable, Linux Memory Management List,
	Christoph Hellwig, Roman Kononov, linux-kernel, xfs,
	Andrew Morton



On Mon, 5 Jan 2009, Nick Piggin wrote:

> On Mon, Jan 05, 2009 at 09:30:55AM -0800, Linus Torvalds wrote:
> > 
> > Not only is it ugly (which is already sufficient ground to suspect it is 
> > wrong or could at least be done better), but reading the comment, it makes 
> > no sense at all. You only put the barrier in the "goto repeat" case, but 
> > the thing is, if you worry about radix tree slot not being reloaded in the 
> > repeat case, then you damn well should worry about it not being reloaded 
> > in the non-repeat case too!
> 
> In which case atomic_inc_unless is defined to provide a barrier.

Hmm. Ok, granted.

> > If you use RCU to protect a data structure, then any data loaded from that 
> > data structure that can change due to RCU should be loaded with 
> > "rcu_dereference()". 
> 
> It doesn't need that because the last level pointers in the radix
> tree are not necessarily under RCU, but whatever synchronisation
> the caller uses (in this case, speculative page references, which
> should not require smp_read_barrier_depends, AFAIKS).

rcu_dereference() does more than that smp_read_barrier_depends() (which is 
a no-op on all sane architectures). 

The important part of rcu_dereference() is the ACCESS_ONCE() part. That's 
the one that guarantees the access to happen - exactly once.

> Putting an rcu_dereference there might work, but I think it misses a 
> subtlety of this code.

No, _you_ miss the subtlety of something that can change under you.

Look at radix_tree_deref_slot(), and realize that without the 
rcu_dereference(), the compiler would actually be allowed to think that it 
can re-load anything from *pslot several times. So without my one-liner 
patch, the compiler can actually do this:

	register = load_from_memory(pslot)
	if (radix_tree_is_indirect_ptr(register))
		goto fail:
	return load_from_memory(pslot);

   fail:
	return RADIX_TREE_RETRY;

see? Imagine if you are low on registers (x86, anyone?) and look at that 
radix_tree_is_indirect_ptr() test: it does a logical "and" which can be 
done with a memory instruction on x86. So the compiler could _literally_ 
compile this as

	testb $1,(%eax)		; %eax is "pslot"
	jne indirect_pointer
	movl (%eax),%eax	; now we load it for real

rather than

	movl (%eax),%eax
	testl $1,%eax
	jne indirect_pointer

because the first version actually keeps more registers live for the 
indirect case. In fact, the compiler might be delaying that "movl" until 
much later (depending on barriers and needs). And notice how that "now we 
load it for real" may be getting a new value - including a possible 
indirect pointer value, even though we tested that it wasn't an indirect 
pointer!

And THIS is why code that depends on RCU needs to use "rcu_dereference()". 
Because otherwise you may be testing one thing, and then later using some 
_other_ value than the one you tested. You must guarantee that you really 
just load it once, and that the compiler doesn't decide that it can load 
it multiple times, and test the multiple (possibly different) values using 
different logic.

> > Of course, it's also possible that we should just put a barrier in 
> > page_cache_get_speculative(). That doesn't seem to make a whole lot of 
> > conceptual sense, though (the same way that your barrier() didn't make any 
> > sense - I don't see that the barrier has absolutely _anything_ to do with 
> > whether the speculative getting of the page fails or not!)
> 
> When that fails, the caller can (almost) assume the pointer has changed.

Not relevant.

Yes, when it fails, the caller can obviously assume that the pointer has 
almost certainly changed, but that's neither here nor there - if the 
page_cache_get_speculative() fails, you mustn't use that pointer *whether* 
it has changed or not. So there's no point in even testing, and the code 
obviously doesn't.

> So it has to load the new pointer to continue. The object pointed to is
> not protected with RCU, nor is there a requirement to see a specific
> load execution ordering. 

Either the value can change, or it can not. It's that simple.

If it cannot change, then we can load it just once, or we can load it 
multiple times, and it won't matter. Barriers won't do anything but screw 
up the code.

If it can change from under us, you need to use rcu_dereference(), or 
open-code it with an ACCESS_ONCE() or put in barriers. But your placement 
of a barrier was NONSENSICAL. Your barrier didn't protect anything else - 
like the test for the RADIX_TREE_INDIRECT_PTR bit.

And that was the fundamental problem.

And once you fix that fundamental problem, your barrier no longer makes 
any sense, because the barrier HAS NOTHING TO DO WITH WHETHER 
page_cache_get_speculative() fails or not!

			Linus

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch] mm: fix lockless pagecache reordering bug (was Re: BUG: soft lockup - is this XFS problem?)
@ 2009-01-05 18:44                       ` Linus Torvalds
  0 siblings, 0 replies; 74+ messages in thread
From: Linus Torvalds @ 2009-01-05 18:44 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-kernel, Roman Kononov, xfs, Christoph Hellwig,
	Linux Memory Management List, Andrew Morton, Peter Klotz, stable



On Mon, 5 Jan 2009, Nick Piggin wrote:

> On Mon, Jan 05, 2009 at 09:30:55AM -0800, Linus Torvalds wrote:
> > 
> > Not only is it ugly (which is already sufficient ground to suspect it is 
> > wrong or could at least be done better), but reading the comment, it makes 
> > no sense at all. You only put the barrier in the "goto repeat" case, but 
> > the thing is, if you worry about radix tree slot not being reloaded in the 
> > repeat case, then you damn well should worry about it not being reloaded 
> > in the non-repeat case too!
> 
> In which case atomic_inc_unless is defined to provide a barrier.

Hmm. Ok, granted.

> > If you use RCU to protect a data structure, then any data loaded from that 
> > data structure that can change due to RCU should be loaded with 
> > "rcu_dereference()". 
> 
> It doesn't need that because the last level pointers in the radix
> tree are not necessarily under RCU, but whatever synchronisation
> the caller uses (in this case, speculative page references, which
> should not require smp_read_barrier_depends, AFAIKS).

rcu_dereference() does more than that smp_read_barrier_depends() (which is 
a no-op on all sane architectures). 

The important part of rcu_dereference() is the ACCESS_ONCE() part. That's 
the one that guarantees the access to happen - exactly once.

> Putting an rcu_dereference there might work, but I think it misses a 
> subtlety of this code.

No, _you_ miss the subtlety of something that can change under you.

Look at radix_tree_deref_slot(), and realize that without the 
rcu_dereference(), the compiler would actually be allowed to think that it 
can re-load anything from *pslot several times. So without my one-liner 
patch, the compiler can actually do this:

	register = load_from_memory(pslot)
	if (radix_tree_is_indirect_ptr(register))
		goto fail:
	return load_from_memory(pslot);

   fail:
	return RADIX_TREE_RETRY;

see? Imagine if you are low on registers (x86, anyone?) and look at that 
radix_tree_is_indirect_ptr() test: it does a logical "and" which can be 
done with a memory instruction on x86. So the compiler could _literally_ 
compile this as

	testb $1,(%eax)		; %eax is "pslot"
	jne indirect_pointer
	movl (%eax),%eax	; now we load it for real

rather than

	movl (%eax),%eax
	testl $1,%eax
	jne indirect_pointer

because the first version actually keeps more registers live for the 
indirect case. In fact, the compiler might be delaying that "movl" until 
much later (depending on barriers and needs). And notice how that "now we 
load it for real" may be getting a new value - including a possible 
indirect pointer value, even though we tested that it wasn't an indirect 
pointer!

And THIS is why code that depends on RCU needs to use "rcu_dereference()". 
Because otherwise you may be testing one thing, and then later using some 
_other_ value than the one you tested. You must guarantee that you really 
just load it once, and that the compiler doesn't decide that it can load 
it multiple times, and test the multiple (possibly different) values using 
different logic.

> > Of course, it's also possible that we should just put a barrier in 
> > page_cache_get_speculative(). That doesn't seem to make a whole lot of 
> > conceptual sense, though (the same way that your barrier() didn't make any 
> > sense - I don't see that the barrier has absolutely _anything_ to do with 
> > whether the speculative getting of the page fails or not!)
> 
> When that fails, the caller can (almost) assume the pointer has changed.

Not relevant.

Yes, when it fails, the caller can obviously assume that the pointer has 
almost certainly changed, but that's neither here nor there - if the 
page_cache_get_speculative() fails, you mustn't use that pointer *whether* 
it has changed or not. So there's no point in even testing, and the code 
obviously doesn't.

> So it has to load the new pointer to continue. The object pointed to is
> not protected with RCU, nor is there a requirement to see a specific
> load execution ordering. 

Either the value can change, or it can not. It's that simple.

If it cannot change, then we can load it just once, or we can load it 
multiple times, and it won't matter. Barriers won't do anything but screw 
up the code.

If it can change from under us, you need to use rcu_dereference(), or 
open-code it with an ACCESS_ONCE() or put in barriers. But your placement 
of a barrier was NONSENSICAL. Your barrier didn't protect anything else - 
like the test for the RADIX_TREE_INDIRECT_PTR bit.

And that was the fundamental problem.

And once you fix that fundamental problem, your barrier no longer makes 
any sense, because the barrier HAS NOTHING TO DO WITH WHETHER 
page_cache_get_speculative() fails or not!

			Linus

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch] mm: fix lockless pagecache reordering bug (was Re: BUG: soft lockup - is this XFS problem?)
@ 2009-01-05 18:44                       ` Linus Torvalds
  0 siblings, 0 replies; 74+ messages in thread
From: Linus Torvalds @ 2009-01-05 18:44 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Klotz, stable, Linux Memory Management List,
	Christoph Hellwig, Roman Kononov, linux-kernel, xfs,
	Andrew Morton



On Mon, 5 Jan 2009, Nick Piggin wrote:

> On Mon, Jan 05, 2009 at 09:30:55AM -0800, Linus Torvalds wrote:
> > 
> > Not only is it ugly (which is already sufficient ground to suspect it is 
> > wrong or could at least be done better), but reading the comment, it makes 
> > no sense at all. You only put the barrier in the "goto repeat" case, but 
> > the thing is, if you worry about radix tree slot not being reloaded in the 
> > repeat case, then you damn well should worry about it not being reloaded 
> > in the non-repeat case too!
> 
> In which case atomic_inc_unless is defined to provide a barrier.

Hmm. Ok, granted.

> > If you use RCU to protect a data structure, then any data loaded from that 
> > data structure that can change due to RCU should be loaded with 
> > "rcu_dereference()". 
> 
> It doesn't need that because the last level pointers in the radix
> tree are not necessarily under RCU, but whatever synchronisation
> the caller uses (in this case, speculative page references, which
> should not require smp_read_barrier_depends, AFAIKS).

rcu_dereference() does more than that smp_read_barrier_depends() (which is 
a no-op on all sane architectures). 

The important part of rcu_dereference() is the ACCESS_ONCE() part. That's 
the one that guarantees the access to happen - exactly once.

> Putting an rcu_dereference there might work, but I think it misses a 
> subtlety of this code.

No, _you_ miss the subtlety of something that can change under you.

Look at radix_tree_deref_slot(), and realize that without the 
rcu_dereference(), the compiler would actually be allowed to think that it 
can re-load anything from *pslot several times. So without my one-liner 
patch, the compiler can actually do this:

	register = load_from_memory(pslot)
	if (radix_tree_is_indirect_ptr(register))
		goto fail:
	return load_from_memory(pslot);

   fail:
	return RADIX_TREE_RETRY;

see? Imagine if you are low on registers (x86, anyone?) and look at that 
radix_tree_is_indirect_ptr() test: it does a logical "and" which can be 
done with a memory instruction on x86. So the compiler could _literally_ 
compile this as

	testb $1,(%eax)		; %eax is "pslot"
	jne indirect_pointer
	movl (%eax),%eax	; now we load it for real

rather than

	movl (%eax),%eax
	testl $1,%eax
	jne indirect_pointer

because the first version actually keeps more registers live for the 
indirect case. In fact, the compiler might be delaying that "movl" until 
much later (depending on barriers and needs). And notice how that "now we 
load it for real" may be getting a new value - including a possible 
indirect pointer value, even though we tested that it wasn't an indirect 
pointer!

And THIS is why code that depends on RCU needs to use "rcu_dereference()". 
Because otherwise you may be testing one thing, and then later using some 
_other_ value than the one you tested. You must guarantee that you really 
just load it once, and that the compiler doesn't decide that it can load 
it multiple times, and test the multiple (possibly different) values using 
different logic.

> > Of course, it's also possible that we should just put a barrier in 
> > page_cache_get_speculative(). That doesn't seem to make a whole lot of 
> > conceptual sense, though (the same way that your barrier() didn't make any 
> > sense - I don't see that the barrier has absolutely _anything_ to do with 
> > whether the speculative getting of the page fails or not!)
> 
> When that fails, the caller can (almost) assume the pointer has changed.

Not relevant.

Yes, when it fails, the caller can obviously assume that the pointer has 
almost certainly changed, but that's neither here nor there - if the 
page_cache_get_speculative() fails, you mustn't use that pointer *whether* 
it has changed or not. So there's no point in even testing, and the code 
obviously doesn't.

> So it has to load the new pointer to continue. The object pointed to is
> not protected with RCU, nor is there a requirement to see a specific
> load execution ordering. 

Either the value can change, or it can not. It's that simple.

If it cannot change, then we can load it just once, or we can load it 
multiple times, and it won't matter. Barriers won't do anything but screw 
up the code.

If it can change from under us, you need to use rcu_dereference(), or 
open-code it with an ACCESS_ONCE() or put in barriers. But your placement 
of a barrier was NONSENSICAL. Your barrier didn't protect anything else - 
like the test for the RADIX_TREE_INDIRECT_PTR bit.

And that was the fundamental problem.

And once you fix that fundamental problem, your barrier no longer makes 
any sense, because the barrier HAS NOTHING TO DO WITH WHETHER 
page_cache_get_speculative() fails or not!

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch] mm: fix lockless pagecache reordering bug (was Re: BUG: soft lockup - is this XFS problem?)
  2009-01-05 18:44                       ` Linus Torvalds
  (?)
@ 2009-01-05 19:39                         ` Linus Torvalds
  -1 siblings, 0 replies; 74+ messages in thread
From: Linus Torvalds @ 2009-01-05 19:39 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Klotz, stable, Linux Memory Management List,
	Christoph Hellwig, Roman Kononov, linux-kernel, xfs,
	Andrew Morton



On Mon, 5 Jan 2009, Linus Torvalds wrote:
> 
> Either the value can change, or it can not. It's that simple.
> 
> If it cannot change, then we can load it just once, or we can load it 
> multiple times, and it won't matter. Barriers won't do anything but screw 
> up the code.
> 
> If it can change from under us, you need to use rcu_dereference(), or 
> open-code it with an ACCESS_ONCE() or put in barriers. But your placement 
> of a barrier was NONSENSICAL. Your barrier didn't protect anything else - 
> like the test for the RADIX_TREE_INDIRECT_PTR bit.
> 
> And that was the fundamental problem.

Btw, this is the real issue with anything that does "locking vs 
optimistic" accesses.

If you use locking, then by definition (if you did things right), the 
values you are working with do not change. As a result, it doesn't matter 
if the compiler re-orders accesses, splits them up, or coalesces them. 
It's why normal code should never need barriers, because it doesn't matter 
whether some access gets optimized away or gets done multiple times.

But whenever you use an optimistic algorithm, and the data may change 
under you, you need to use barriers or other things to limit the things 
the CPU and/or compiler does.

And yes, "rcu_dereference()" is one such thing - it's not a barrier in the 
sense that it doesn't necessarily affect ordering of accesses to other 
variables around it (although the read_barrier_depends() obviously _is_ a 
very special kind of ordering wrt the pointer itself on alpha). But it 
does make sure that the compiler at least does not coalesce - or split - 
that _one_ particular access.

It's true that it has "rcu" in its name, and it's also true that that may 
be a bit misleading in that it's very much useful not just for rcu, but 
for _any_ algorithm that depends on rcu-like behavior - ie optimistic 
accesses to data that may change underneath it. RCU is just the most 
commonly used (and perhaps best codified) variant of that kind of code.

			Linus

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch] mm: fix lockless pagecache reordering bug (was Re: BUG: soft lockup - is this XFS problem?)
@ 2009-01-05 19:39                         ` Linus Torvalds
  0 siblings, 0 replies; 74+ messages in thread
From: Linus Torvalds @ 2009-01-05 19:39 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-kernel, Roman Kononov, xfs, Christoph Hellwig,
	Linux Memory Management List, Andrew Morton, Peter Klotz, stable



On Mon, 5 Jan 2009, Linus Torvalds wrote:
> 
> Either the value can change, or it can not. It's that simple.
> 
> If it cannot change, then we can load it just once, or we can load it 
> multiple times, and it won't matter. Barriers won't do anything but screw 
> up the code.
> 
> If it can change from under us, you need to use rcu_dereference(), or 
> open-code it with an ACCESS_ONCE() or put in barriers. But your placement 
> of a barrier was NONSENSICAL. Your barrier didn't protect anything else - 
> like the test for the RADIX_TREE_INDIRECT_PTR bit.
> 
> And that was the fundamental problem.

Btw, this is the real issue with anything that does "locking vs 
optimistic" accesses.

If you use locking, then by definition (if you did things right), the 
values you are working with do not change. As a result, it doesn't matter 
if the compiler re-orders accesses, splits them up, or coalesces them. 
It's why normal code should never need barriers, because it doesn't matter 
whether some access gets optimized away or gets done multiple times.

But whenever you use an optimistic algorithm, and the data may change 
under you, you need to use barriers or other things to limit the things 
the CPU and/or compiler does.

And yes, "rcu_dereference()" is one such thing - it's not a barrier in the 
sense that it doesn't necessarily affect ordering of accesses to other 
variables around it (although the read_barrier_depends() obviously _is_ a 
very special kind of ordering wrt the pointer itself on alpha). But it 
does make sure that the compiler at least does not coalesce - or split - 
that _one_ particular access.

It's true that it has "rcu" in its name, and it's also true that that may 
be a bit misleading in that it's very much useful not just for rcu, but 
for _any_ algorithm that depends on rcu-like behavior - ie optimistic 
accesses to data that may change underneath it. RCU is just the most 
commonly used (and perhaps best codified) variant of that kind of code.

			Linus

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch] mm: fix lockless pagecache reordering bug (was Re: BUG: soft lockup - is this XFS problem?)
@ 2009-01-05 19:39                         ` Linus Torvalds
  0 siblings, 0 replies; 74+ messages in thread
From: Linus Torvalds @ 2009-01-05 19:39 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Klotz, stable, Linux Memory Management List,
	Christoph Hellwig, Roman Kononov, linux-kernel, xfs,
	Andrew Morton



On Mon, 5 Jan 2009, Linus Torvalds wrote:
> 
> Either the value can change, or it can not. It's that simple.
> 
> If it cannot change, then we can load it just once, or we can load it 
> multiple times, and it won't matter. Barriers won't do anything but screw 
> up the code.
> 
> If it can change from under us, you need to use rcu_dereference(), or 
> open-code it with an ACCESS_ONCE() or put in barriers. But your placement 
> of a barrier was NONSENSICAL. Your barrier didn't protect anything else - 
> like the test for the RADIX_TREE_INDIRECT_PTR bit.
> 
> And that was the fundamental problem.

Btw, this is the real issue with anything that does "locking vs 
optimistic" accesses.

If you use locking, then by definition (if you did things right), the 
values you are working with do not change. As a result, it doesn't matter 
if the compiler re-orders accesses, splits them up, or coalesces them. 
It's why normal code should never need barriers, because it doesn't matter 
whether some access gets optimized away or gets done multiple times.

But whenever you use an optimistic algorithm, and the data may change 
under you, you need to use barriers or other things to limit the things 
the CPU and/or compiler does.

And yes, "rcu_dereference()" is one such thing - it's not a barrier in the 
sense that it doesn't necessarily affect ordering of accesses to other 
variables around it (although the read_barrier_depends() obviously _is_ a 
very special kind of ordering wrt the pointer itself on alpha). But it 
does make sure that the compiler at least does not coalesce - or split - 
that _one_ particular access.

It's true that it has "rcu" in its name, and it's also true that that may 
be a bit misleading in that it's very much useful not just for rcu, but 
for _any_ algorithm that depends on rcu-like behavior - ie optimistic 
accesses to data that may change underneath it. RCU is just the most 
commonly used (and perhaps best codified) variant of that kind of code.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch] mm: fix lockless pagecache reordering bug (was Re: BUG: soft lockup - is this XFS problem?)
  2009-01-05 18:44                       ` Linus Torvalds
  (?)
@ 2009-01-05 20:12                         ` Paul E. McKenney
  -1 siblings, 0 replies; 74+ messages in thread
From: Paul E. McKenney @ 2009-01-05 20:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Peter Klotz, stable, Linux Memory Management List,
	Christoph Hellwig, Roman Kononov, linux-kernel, xfs,
	Andrew Morton

On Mon, Jan 05, 2009 at 10:44:27AM -0800, Linus Torvalds wrote:
> On Mon, 5 Jan 2009, Nick Piggin wrote:
> > On Mon, Jan 05, 2009 at 09:30:55AM -0800, Linus Torvalds wrote:
> > Putting an rcu_dereference there might work, but I think it misses a 
> > subtlety of this code.
> 
> No, _you_ miss the subtlety of something that can change under you.
> 
> Look at radix_tree_deref_slot(), and realize that without the 
> rcu_dereference(), the compiler would actually be allowed to think that it 
> can re-load anything from *pslot several times. So without my one-liner 
> patch, the compiler can actually do this:
> 
> 	register = load_from_memory(pslot)
> 	if (radix_tree_is_indirect_ptr(register))
> 		goto fail:
> 	return load_from_memory(pslot);
> 
>    fail:
> 	return RADIX_TREE_RETRY;

My guess is that Nick believes that the value in *pslot cannot change
in such as way as to cause radix_tree_is_indirect_ptr()'s return value
to change within a given RCU grace period, and that Linus disagrees.

Whatever the answer, I would argue for -at- -least- a comment explaining
why it is safe.  I am not seeing the objection to rcu_dereference(), but
I must confess that it has been awhile since I have looked closely at
the radix_tree code.  :-/

							Thanx, Paul

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch] mm: fix lockless pagecache reordering bug (was Re: BUG: soft lockup - is this XFS problem?)
@ 2009-01-05 20:12                         ` Paul E. McKenney
  0 siblings, 0 replies; 74+ messages in thread
From: Paul E. McKenney @ 2009-01-05 20:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, linux-kernel, Roman Kononov, xfs, Christoph Hellwig,
	Linux Memory Management List, Andrew Morton, Peter Klotz, stable

On Mon, Jan 05, 2009 at 10:44:27AM -0800, Linus Torvalds wrote:
> On Mon, 5 Jan 2009, Nick Piggin wrote:
> > On Mon, Jan 05, 2009 at 09:30:55AM -0800, Linus Torvalds wrote:
> > Putting an rcu_dereference there might work, but I think it misses a 
> > subtlety of this code.
> 
> No, _you_ miss the subtlety of something that can change under you.
> 
> Look at radix_tree_deref_slot(), and realize that without the 
> rcu_dereference(), the compiler would actually be allowed to think that it 
> can re-load anything from *pslot several times. So without my one-liner 
> patch, the compiler can actually do this:
> 
> 	register = load_from_memory(pslot)
> 	if (radix_tree_is_indirect_ptr(register))
> 		goto fail:
> 	return load_from_memory(pslot);
> 
>    fail:
> 	return RADIX_TREE_RETRY;

My guess is that Nick believes that the value in *pslot cannot change
in such as way as to cause radix_tree_is_indirect_ptr()'s return value
to change within a given RCU grace period, and that Linus disagrees.

Whatever the answer, I would argue for -at- -least- a comment explaining
why it is safe.  I am not seeing the objection to rcu_dereference(), but
I must confess that it has been awhile since I have looked closely at
the radix_tree code.  :-/

							Thanx, Paul

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch] mm: fix lockless pagecache reordering bug (was Re: BUG: soft lockup - is this XFS problem?)
@ 2009-01-05 20:12                         ` Paul E. McKenney
  0 siblings, 0 replies; 74+ messages in thread
From: Paul E. McKenney @ 2009-01-05 20:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Peter Klotz, stable, Linux Memory Management List,
	Christoph Hellwig, Roman Kononov, linux-kernel, xfs,
	Andrew Morton

On Mon, Jan 05, 2009 at 10:44:27AM -0800, Linus Torvalds wrote:
> On Mon, 5 Jan 2009, Nick Piggin wrote:
> > On Mon, Jan 05, 2009 at 09:30:55AM -0800, Linus Torvalds wrote:
> > Putting an rcu_dereference there might work, but I think it misses a 
> > subtlety of this code.
> 
> No, _you_ miss the subtlety of something that can change under you.
> 
> Look at radix_tree_deref_slot(), and realize that without the 
> rcu_dereference(), the compiler would actually be allowed to think that it 
> can re-load anything from *pslot several times. So without my one-liner 
> patch, the compiler can actually do this:
> 
> 	register = load_from_memory(pslot)
> 	if (radix_tree_is_indirect_ptr(register))
> 		goto fail:
> 	return load_from_memory(pslot);
> 
>    fail:
> 	return RADIX_TREE_RETRY;

My guess is that Nick believes that the value in *pslot cannot change
in such as way as to cause radix_tree_is_indirect_ptr()'s return value
to change within a given RCU grace period, and that Linus disagrees.

Whatever the answer, I would argue for -at- -least- a comment explaining
why it is safe.  I am not seeing the objection to rcu_dereference(), but
I must confess that it has been awhile since I have looked closely at
the radix_tree code.  :-/

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch] mm: fix lockless pagecache reordering bug (was Re: BUG: soft lockup - is this XFS problem?)
  2009-01-05 20:12                         ` Paul E. McKenney
  (?)
@ 2009-01-05 20:39                           ` Linus Torvalds
  -1 siblings, 0 replies; 74+ messages in thread
From: Linus Torvalds @ 2009-01-05 20:39 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Nick Piggin, Peter Klotz, stable, Linux Memory Management List,
	Christoph Hellwig, Roman Kononov, linux-kernel, xfs,
	Andrew Morton



On Mon, 5 Jan 2009, Paul E. McKenney wrote:
> 
> My guess is that Nick believes that the value in *pslot cannot change
> in such as way as to cause radix_tree_is_indirect_ptr()'s return value
> to change within a given RCU grace period, and that Linus disagrees.

Oh, it's entirely possible that there are some lifetime rules or others 
that make it impossible for things to go from "not indirect" -> 
"indirect". So if that was Nick's point, then I'm not "disagreeing" per 
se.

What I'm disagreeing about is that Nick apparently thinks that this is all 
subtle code, and as a result we should add barriers in some very 
non-obvious places.

While _I_ think that the problem isn't properly solved by barriers, but by 
just making the code less subtle. If the barrier only exists because of 
the reload issue, then the obvious solution - to me - is to just use what 
is already the proper accessor function that forces a nice reload. That 
way the compiler is forced to create code that does what the source 
clearly means it to do, regardless of any barriers at all.

Barriers in general should be the _last_ thing added. And if they are 
added, they should be added as deeply in the call-chain as possible, so 
that we don't need to add them in multiple call-sites. Again, using the 
rcu_dereference() approach seems to solve that issue too - rather than add 
three barriers in three different places, we just add the proper 
dereference in _one_ place.

> Whatever the answer, I would argue for -at- -least- a comment explaining
> why it is safe.  I am not seeing the objection to rcu_dereference(), but
> I must confess that it has been awhile since I have looked closely at
> the radix_tree code.  :-/

And I'm actually suprised that gcc can generate the problematic code in 
the first place. I'd expect that a "atomic_add_unless()" would always be 
at LEAST a compiler barrier, even if it isn't necessarily a CPU memory 
barrier.

But because we inline it, and because we allow gcc to see that it doesn't 
do anything if it gets just the right value from memory, I guess gcc ends 
up able to change the "for()" loop so that the first iteration can exit 
specially, and then for that case (and no other case) it can cache 
variables over the whole atomic_add_unless().

Again, that's very fragile. The fact that Documentation/atomic_ops.txt 
says that the failure case doesn't contain any barriers is really _meant_ 
to be about the architecture-specific CPU barriers, not so much about 
something as simple as a compiler re-ordering. 

So while I think that we should use rcu_dereference() (regardless of any 
other issues), I _also_ think that part of the problem really is the 
excessive subtlety in the whole code, and the (obviously very surprising) 
fact that gcc could end up caching an unrelated memory load across that 
whole atomic op.

Maybe we should make atomics always imply a compiler barrier, even when 
they do not imply a memory barrier. The one exception would be the 
(special) case of "atomic_read()/atomic_set()", which don't really do any 
kind of complex operation at all, and where we really do want the compiler 
to be able to coalesce multiple atomic_reads() to a single one.

In contrast, there's no sense in allowing the compiler to coalesce a 
"atomic_add_unless()" with anything else. Making it a compiler barrier 
(possibly by uninlining it, or just adding a barrier to it) would also 
have avoided the whole subtle case - which is always a good thing.

				Linus

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch] mm: fix lockless pagecache reordering bug (was Re: BUG: soft lockup - is this XFS problem?)
@ 2009-01-05 20:39                           ` Linus Torvalds
  0 siblings, 0 replies; 74+ messages in thread
From: Linus Torvalds @ 2009-01-05 20:39 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Nick Piggin, linux-kernel, Roman Kononov, xfs, Christoph Hellwig,
	Linux Memory Management List, Andrew Morton, Peter Klotz, stable



On Mon, 5 Jan 2009, Paul E. McKenney wrote:
> 
> My guess is that Nick believes that the value in *pslot cannot change
> in such as way as to cause radix_tree_is_indirect_ptr()'s return value
> to change within a given RCU grace period, and that Linus disagrees.

Oh, it's entirely possible that there are some lifetime rules or others 
that make it impossible for things to go from "not indirect" -> 
"indirect". So if that was Nick's point, then I'm not "disagreeing" per 
se.

What I'm disagreeing about is that Nick apparently thinks that this is all 
subtle code, and as a result we should add barriers in some very 
non-obvious places.

While _I_ think that the problem isn't properly solved by barriers, but by 
just making the code less subtle. If the barrier only exists because of 
the reload issue, then the obvious solution - to me - is to just use what 
is already the proper accessor function that forces a nice reload. That 
way the compiler is forced to create code that does what the source 
clearly means it to do, regardless of any barriers at all.

Barriers in general should be the _last_ thing added. And if they are 
added, they should be added as deeply in the call-chain as possible, so 
that we don't need to add them in multiple call-sites. Again, using the 
rcu_dereference() approach seems to solve that issue too - rather than add 
three barriers in three different places, we just add the proper 
dereference in _one_ place.

> Whatever the answer, I would argue for -at- -least- a comment explaining
> why it is safe.  I am not seeing the objection to rcu_dereference(), but
> I must confess that it has been awhile since I have looked closely at
> the radix_tree code.  :-/

And I'm actually suprised that gcc can generate the problematic code in 
the first place. I'd expect that a "atomic_add_unless()" would always be 
at LEAST a compiler barrier, even if it isn't necessarily a CPU memory 
barrier.

But because we inline it, and because we allow gcc to see that it doesn't 
do anything if it gets just the right value from memory, I guess gcc ends 
up able to change the "for()" loop so that the first iteration can exit 
specially, and then for that case (and no other case) it can cache 
variables over the whole atomic_add_unless().

Again, that's very fragile. The fact that Documentation/atomic_ops.txt 
says that the failure case doesn't contain any barriers is really _meant_ 
to be about the architecture-specific CPU barriers, not so much about 
something as simple as a compiler re-ordering. 

So while I think that we should use rcu_dereference() (regardless of any 
other issues), I _also_ think that part of the problem really is the 
excessive subtlety in the whole code, and the (obviously very surprising) 
fact that gcc could end up caching an unrelated memory load across that 
whole atomic op.

Maybe we should make atomics always imply a compiler barrier, even when 
they do not imply a memory barrier. The one exception would be the 
(special) case of "atomic_read()/atomic_set()", which don't really do any 
kind of complex operation at all, and where we really do want the compiler 
to be able to coalesce multiple atomic_reads() to a single one.

In contrast, there's no sense in allowing the compiler to coalesce a 
"atomic_add_unless()" with anything else. Making it a compiler barrier 
(possibly by uninlining it, or just adding a barrier to it) would also 
have avoided the whole subtle case - which is always a good thing.

				Linus

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch] mm: fix lockless pagecache reordering bug (was Re: BUG: soft lockup - is this XFS problem?)
@ 2009-01-05 20:39                           ` Linus Torvalds
  0 siblings, 0 replies; 74+ messages in thread
From: Linus Torvalds @ 2009-01-05 20:39 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Nick Piggin, Peter Klotz, stable, Linux Memory Management List,
	Christoph Hellwig, Roman Kononov, linux-kernel, xfs,
	Andrew Morton



On Mon, 5 Jan 2009, Paul E. McKenney wrote:
> 
> My guess is that Nick believes that the value in *pslot cannot change
> in such as way as to cause radix_tree_is_indirect_ptr()'s return value
> to change within a given RCU grace period, and that Linus disagrees.

Oh, it's entirely possible that there are some lifetime rules or others 
that make it impossible for things to go from "not indirect" -> 
"indirect". So if that was Nick's point, then I'm not "disagreeing" per 
se.

What I'm disagreeing about is that Nick apparently thinks that this is all 
subtle code, and as a result we should add barriers in some very 
non-obvious places.

While _I_ think that the problem isn't properly solved by barriers, but by 
just making the code less subtle. If the barrier only exists because of 
the reload issue, then the obvious solution - to me - is to just use what 
is already the proper accessor function that forces a nice reload. That 
way the compiler is forced to create code that does what the source 
clearly means it to do, regardless of any barriers at all.

Barriers in general should be the _last_ thing added. And if they are 
added, they should be added as deeply in the call-chain as possible, so 
that we don't need to add them in multiple call-sites. Again, using the 
rcu_dereference() approach seems to solve that issue too - rather than add 
three barriers in three different places, we just add the proper 
dereference in _one_ place.

> Whatever the answer, I would argue for -at- -least- a comment explaining
> why it is safe.  I am not seeing the objection to rcu_dereference(), but
> I must confess that it has been awhile since I have looked closely at
> the radix_tree code.  :-/

And I'm actually suprised that gcc can generate the problematic code in 
the first place. I'd expect that a "atomic_add_unless()" would always be 
at LEAST a compiler barrier, even if it isn't necessarily a CPU memory 
barrier.

But because we inline it, and because we allow gcc to see that it doesn't 
do anything if it gets just the right value from memory, I guess gcc ends 
up able to change the "for()" loop so that the first iteration can exit 
specially, and then for that case (and no other case) it can cache 
variables over the whole atomic_add_unless().

Again, that's very fragile. The fact that Documentation/atomic_ops.txt 
says that the failure case doesn't contain any barriers is really _meant_ 
to be about the architecture-specific CPU barriers, not so much about 
something as simple as a compiler re-ordering. 

So while I think that we should use rcu_dereference() (regardless of any 
other issues), I _also_ think that part of the problem really is the 
excessive subtlety in the whole code, and the (obviously very surprising) 
fact that gcc could end up caching an unrelated memory load across that 
whole atomic op.

Maybe we should make atomics always imply a compiler barrier, even when 
they do not imply a memory barrier. The one exception would be the 
(special) case of "atomic_read()/atomic_set()", which don't really do any 
kind of complex operation at all, and where we really do want the compiler 
to be able to coalesce multiple atomic_reads() to a single one.

In contrast, there's no sense in allowing the compiler to coalesce a 
"atomic_add_unless()" with anything else. Making it a compiler barrier 
(possibly by uninlining it, or just adding a barrier to it) would also 
have avoided the whole subtle case - which is always a good thing.

				Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch] mm: fix lockless pagecache reordering bug (was Re:
  2009-01-05 20:12                         ` Paul E. McKenney
  (?)
@ 2009-01-05 21:04                           ` Peter Zijlstra
  -1 siblings, 0 replies; 74+ messages in thread
From: Peter Zijlstra @ 2009-01-05 21:04 UTC (permalink / raw)
  To: paulmck
  Cc: Linus Torvalds, Nick Piggin, Peter Klotz, stable,
	Linux Memory Management List, Christoph Hellwig, Roman Kononov,
	linux-kernel, xfs, Andrew Morton

On Mon, 2009-01-05 at 12:12 -0800, Paul E. McKenney wrote:
> On Mon, Jan 05, 2009 at 10:44:27AM -0800, Linus Torvalds wrote:
> > On Mon, 5 Jan 2009, Nick Piggin wrote:
> > > On Mon, Jan 05, 2009 at 09:30:55AM -0800, Linus Torvalds wrote:
> > > Putting an rcu_dereference there might work, but I think it misses a 
> > > subtlety of this code.
> > 
> > No, _you_ miss the subtlety of something that can change under you.
> > 
> > Look at radix_tree_deref_slot(), and realize that without the 
> > rcu_dereference(), the compiler would actually be allowed to think that it 
> > can re-load anything from *pslot several times. So without my one-liner 
> > patch, the compiler can actually do this:
> > 
> > 	register = load_from_memory(pslot)
> > 	if (radix_tree_is_indirect_ptr(register))
> > 		goto fail:
> > 	return load_from_memory(pslot);
> > 
> >    fail:
> > 	return RADIX_TREE_RETRY;
> 
> My guess is that Nick believes that the value in *pslot cannot change
> in such as way as to cause radix_tree_is_indirect_ptr()'s return value
> to change within a given RCU grace period, and that Linus disagrees.

Nick's belief would indeed be true IFF all modifying ops including all
uses of radix_tree_replace_slot() are serialized wrt. each other.

However, since radix_tree_deref_slot() is the counterpart of
radix_tree_replace_slot(), one would indeed expect rcu_dereference()
therein, much like Linus suggests.

While what Nick says is true, the lifetime management of the data
objects is arranged externally from the radix tree -- I still think we
need the rcu_dereference() even for that argument, as we want to support
RCU lifetime management as well.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch] mm: fix lockless pagecache reordering bug (was Re:
@ 2009-01-05 21:04                           ` Peter Zijlstra
  0 siblings, 0 replies; 74+ messages in thread
From: Peter Zijlstra @ 2009-01-05 21:04 UTC (permalink / raw)
  To: paulmck
  Cc: Nick Piggin, linux-kernel, Roman Kononov, xfs, Christoph Hellwig,
	Linux Memory Management List, Andrew Morton, Peter Klotz,
	Linus Torvalds, stable

On Mon, 2009-01-05 at 12:12 -0800, Paul E. McKenney wrote:
> On Mon, Jan 05, 2009 at 10:44:27AM -0800, Linus Torvalds wrote:
> > On Mon, 5 Jan 2009, Nick Piggin wrote:
> > > On Mon, Jan 05, 2009 at 09:30:55AM -0800, Linus Torvalds wrote:
> > > Putting an rcu_dereference there might work, but I think it misses a 
> > > subtlety of this code.
> > 
> > No, _you_ miss the subtlety of something that can change under you.
> > 
> > Look at radix_tree_deref_slot(), and realize that without the 
> > rcu_dereference(), the compiler would actually be allowed to think that it 
> > can re-load anything from *pslot several times. So without my one-liner 
> > patch, the compiler can actually do this:
> > 
> > 	register = load_from_memory(pslot)
> > 	if (radix_tree_is_indirect_ptr(register))
> > 		goto fail:
> > 	return load_from_memory(pslot);
> > 
> >    fail:
> > 	return RADIX_TREE_RETRY;
> 
> My guess is that Nick believes that the value in *pslot cannot change
> in such as way as to cause radix_tree_is_indirect_ptr()'s return value
> to change within a given RCU grace period, and that Linus disagrees.

Nick's belief would indeed be true IFF all modifying ops including all
uses of radix_tree_replace_slot() are serialized wrt. each other.

However, since radix_tree_deref_slot() is the counterpart of
radix_tree_replace_slot(), one would indeed expect rcu_dereference()
therein, much like Linus suggests.

While what Nick says is true, the lifetime management of the data
objects is arranged externally from the radix tree -- I still think we
need the rcu_dereference() even for that argument, as we want to support
RCU lifetime management as well.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch] mm: fix lockless pagecache reordering bug (was Re:
@ 2009-01-05 21:04                           ` Peter Zijlstra
  0 siblings, 0 replies; 74+ messages in thread
From: Peter Zijlstra @ 2009-01-05 21:04 UTC (permalink / raw)
  To: paulmck
  Cc: Linus Torvalds, Nick Piggin, Peter Klotz, stable,
	Linux Memory Management List, Christoph Hellwig, Roman Kononov,
	linux-kernel, xfs, Andrew Morton

On Mon, 2009-01-05 at 12:12 -0800, Paul E. McKenney wrote:
> On Mon, Jan 05, 2009 at 10:44:27AM -0800, Linus Torvalds wrote:
> > On Mon, 5 Jan 2009, Nick Piggin wrote:
> > > On Mon, Jan 05, 2009 at 09:30:55AM -0800, Linus Torvalds wrote:
> > > Putting an rcu_dereference there might work, but I think it misses a 
> > > subtlety of this code.
> > 
> > No, _you_ miss the subtlety of something that can change under you.
> > 
> > Look at radix_tree_deref_slot(), and realize that without the 
> > rcu_dereference(), the compiler would actually be allowed to think that it 
> > can re-load anything from *pslot several times. So without my one-liner 
> > patch, the compiler can actually do this:
> > 
> > 	register = load_from_memory(pslot)
> > 	if (radix_tree_is_indirect_ptr(register))
> > 		goto fail:
> > 	return load_from_memory(pslot);
> > 
> >    fail:
> > 	return RADIX_TREE_RETRY;
> 
> My guess is that Nick believes that the value in *pslot cannot change
> in such as way as to cause radix_tree_is_indirect_ptr()'s return value
> to change within a given RCU grace period, and that Linus disagrees.

Nick's belief would indeed be true IFF all modifying ops including all
uses of radix_tree_replace_slot() are serialized wrt. each other.

However, since radix_tree_deref_slot() is the counterpart of
radix_tree_replace_slot(), one would indeed expect rcu_dereference()
therein, much like Linus suggests.

While what Nick says is true, the lifetime management of the data
objects is arranged externally from the radix tree -- I still think we
need the rcu_dereference() even for that argument, as we want to support
RCU lifetime management as well.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch] mm: fix lockless pagecache reordering bug (was Re: BUG: soft lockup - is this XFS problem?)
  2009-01-05 20:39                           ` Linus Torvalds
  (?)
@ 2009-01-05 21:57                             ` Paul E. McKenney
  -1 siblings, 0 replies; 74+ messages in thread
From: Paul E. McKenney @ 2009-01-05 21:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Peter Klotz, stable, Linux Memory Management List,
	Christoph Hellwig, Roman Kononov, linux-kernel, xfs,
	Andrew Morton

On Mon, Jan 05, 2009 at 12:39:14PM -0800, Linus Torvalds wrote:
> 
> 
> On Mon, 5 Jan 2009, Paul E. McKenney wrote:
> > 
> > My guess is that Nick believes that the value in *pslot cannot change
> > in such as way as to cause radix_tree_is_indirect_ptr()'s return value
> > to change within a given RCU grace period, and that Linus disagrees.
> 
> Oh, it's entirely possible that there are some lifetime rules or others 
> that make it impossible for things to go from "not indirect" -> 
> "indirect". So if that was Nick's point, then I'm not "disagreeing" per 
> se.
> 
> What I'm disagreeing about is that Nick apparently thinks that this is all 
> subtle code, and as a result we should add barriers in some very 
> non-obvious places.
> 
> While _I_ think that the problem isn't properly solved by barriers, but by 
> just making the code less subtle. If the barrier only exists because of 
> the reload issue, then the obvious solution - to me - is to just use what 
> is already the proper accessor function that forces a nice reload. That 
> way the compiler is forced to create code that does what the source 
> clearly means it to do, regardless of any barriers at all.
> 
> Barriers in general should be the _last_ thing added. And if they are 
> added, they should be added as deeply in the call-chain as possible, so 
> that we don't need to add them in multiple call-sites. Again, using the 
> rcu_dereference() approach seems to solve that issue too - rather than add 
> three barriers in three different places, we just add the proper 
> dereference in _one_ place.

I don't have any argument with this line of reasoning, and am myself a bit
puzzled as to why rcu_dereference() isn't the right tool for Nick's job.
Then again, I don't claim to fully understand what he is trying to do.

> > Whatever the answer, I would argue for -at- -least- a comment explaining
> > why it is safe.  I am not seeing the objection to rcu_dereference(), but
> > I must confess that it has been awhile since I have looked closely at
> > the radix_tree code.  :-/
> 
> And I'm actually suprised that gcc can generate the problematic code in 
> the first place. I'd expect that a "atomic_add_unless()" would always be 
> at LEAST a compiler barrier, even if it isn't necessarily a CPU memory 
> barrier.
> 
> But because we inline it, and because we allow gcc to see that it doesn't 
> do anything if it gets just the right value from memory, I guess gcc ends 
> up able to change the "for()" loop so that the first iteration can exit 
> specially, and then for that case (and no other case) it can cache 
> variables over the whole atomic_add_unless().
> 
> Again, that's very fragile. The fact that Documentation/atomic_ops.txt 
> says that the failure case doesn't contain any barriers is really _meant_ 
> to be about the architecture-specific CPU barriers, not so much about 
> something as simple as a compiler re-ordering. 
> 
> So while I think that we should use rcu_dereference() (regardless of any 
> other issues), I _also_ think that part of the problem really is the 
> excessive subtlety in the whole code, and the (obviously very surprising) 
> fact that gcc could end up caching an unrelated memory load across that 
> whole atomic op.
> 
> Maybe we should make atomics always imply a compiler barrier, even when 
> they do not imply a memory barrier. The one exception would be the 
> (special) case of "atomic_read()/atomic_set()", which don't really do any 
> kind of complex operation at all, and where we really do want the compiler 
> to be able to coalesce multiple atomic_reads() to a single one.
> 
> In contrast, there's no sense in allowing the compiler to coalesce a 
> "atomic_add_unless()" with anything else. Making it a compiler barrier 
> (possibly by uninlining it, or just adding a barrier to it) would also 
> have avoided the whole subtle case - which is always a good thing.

That makes a lot of sense to me!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch] mm: fix lockless pagecache reordering bug (was Re: BUG: soft lockup - is this XFS problem?)
@ 2009-01-05 21:57                             ` Paul E. McKenney
  0 siblings, 0 replies; 74+ messages in thread
From: Paul E. McKenney @ 2009-01-05 21:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, linux-kernel, Roman Kononov, xfs, Christoph Hellwig,
	Linux Memory Management List, Andrew Morton, Peter Klotz, stable

On Mon, Jan 05, 2009 at 12:39:14PM -0800, Linus Torvalds wrote:
> 
> 
> On Mon, 5 Jan 2009, Paul E. McKenney wrote:
> > 
> > My guess is that Nick believes that the value in *pslot cannot change
> > in such as way as to cause radix_tree_is_indirect_ptr()'s return value
> > to change within a given RCU grace period, and that Linus disagrees.
> 
> Oh, it's entirely possible that there are some lifetime rules or others 
> that make it impossible for things to go from "not indirect" -> 
> "indirect". So if that was Nick's point, then I'm not "disagreeing" per 
> se.
> 
> What I'm disagreeing about is that Nick apparently thinks that this is all 
> subtle code, and as a result we should add barriers in some very 
> non-obvious places.
> 
> While _I_ think that the problem isn't properly solved by barriers, but by 
> just making the code less subtle. If the barrier only exists because of 
> the reload issue, then the obvious solution - to me - is to just use what 
> is already the proper accessor function that forces a nice reload. That 
> way the compiler is forced to create code that does what the source 
> clearly means it to do, regardless of any barriers at all.
> 
> Barriers in general should be the _last_ thing added. And if they are 
> added, they should be added as deeply in the call-chain as possible, so 
> that we don't need to add them in multiple call-sites. Again, using the 
> rcu_dereference() approach seems to solve that issue too - rather than add 
> three barriers in three different places, we just add the proper 
> dereference in _one_ place.

I don't have any argument with this line of reasoning, and am myself a bit
puzzled as to why rcu_dereference() isn't the right tool for Nick's job.
Then again, I don't claim to fully understand what he is trying to do.

> > Whatever the answer, I would argue for -at- -least- a comment explaining
> > why it is safe.  I am not seeing the objection to rcu_dereference(), but
> > I must confess that it has been awhile since I have looked closely at
> > the radix_tree code.  :-/
> 
> And I'm actually suprised that gcc can generate the problematic code in 
> the first place. I'd expect that a "atomic_add_unless()" would always be 
> at LEAST a compiler barrier, even if it isn't necessarily a CPU memory 
> barrier.
> 
> But because we inline it, and because we allow gcc to see that it doesn't 
> do anything if it gets just the right value from memory, I guess gcc ends 
> up able to change the "for()" loop so that the first iteration can exit 
> specially, and then for that case (and no other case) it can cache 
> variables over the whole atomic_add_unless().
> 
> Again, that's very fragile. The fact that Documentation/atomic_ops.txt 
> says that the failure case doesn't contain any barriers is really _meant_ 
> to be about the architecture-specific CPU barriers, not so much about 
> something as simple as a compiler re-ordering. 
> 
> So while I think that we should use rcu_dereference() (regardless of any 
> other issues), I _also_ think that part of the problem really is the 
> excessive subtlety in the whole code, and the (obviously very surprising) 
> fact that gcc could end up caching an unrelated memory load across that 
> whole atomic op.
> 
> Maybe we should make atomics always imply a compiler barrier, even when 
> they do not imply a memory barrier. The one exception would be the 
> (special) case of "atomic_read()/atomic_set()", which don't really do any 
> kind of complex operation at all, and where we really do want the compiler 
> to be able to coalesce multiple atomic_reads() to a single one.
> 
> In contrast, there's no sense in allowing the compiler to coalesce a 
> "atomic_add_unless()" with anything else. Making it a compiler barrier 
> (possibly by uninlining it, or just adding a barrier to it) would also 
> have avoided the whole subtle case - which is always a good thing.

That makes a lot of sense to me!

							Thanx, Paul

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch] mm: fix lockless pagecache reordering bug (was Re: BUG: soft lockup - is this XFS problem?)
@ 2009-01-05 21:57                             ` Paul E. McKenney
  0 siblings, 0 replies; 74+ messages in thread
From: Paul E. McKenney @ 2009-01-05 21:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Peter Klotz, stable, Linux Memory Management List,
	Christoph Hellwig, Roman Kononov, linux-kernel, xfs,
	Andrew Morton

On Mon, Jan 05, 2009 at 12:39:14PM -0800, Linus Torvalds wrote:
> 
> 
> On Mon, 5 Jan 2009, Paul E. McKenney wrote:
> > 
> > My guess is that Nick believes that the value in *pslot cannot change
> > in such as way as to cause radix_tree_is_indirect_ptr()'s return value
> > to change within a given RCU grace period, and that Linus disagrees.
> 
> Oh, it's entirely possible that there are some lifetime rules or others 
> that make it impossible for things to go from "not indirect" -> 
> "indirect". So if that was Nick's point, then I'm not "disagreeing" per 
> se.
> 
> What I'm disagreeing about is that Nick apparently thinks that this is all 
> subtle code, and as a result we should add barriers in some very 
> non-obvious places.
> 
> While _I_ think that the problem isn't properly solved by barriers, but by 
> just making the code less subtle. If the barrier only exists because of 
> the reload issue, then the obvious solution - to me - is to just use what 
> is already the proper accessor function that forces a nice reload. That 
> way the compiler is forced to create code that does what the source 
> clearly means it to do, regardless of any barriers at all.
> 
> Barriers in general should be the _last_ thing added. And if they are 
> added, they should be added as deeply in the call-chain as possible, so 
> that we don't need to add them in multiple call-sites. Again, using the 
> rcu_dereference() approach seems to solve that issue too - rather than add 
> three barriers in three different places, we just add the proper 
> dereference in _one_ place.

I don't have any argument with this line of reasoning, and am myself a bit
puzzled as to why rcu_dereference() isn't the right tool for Nick's job.
Then again, I don't claim to fully understand what he is trying to do.

> > Whatever the answer, I would argue for -at- -least- a comment explaining
> > why it is safe.  I am not seeing the objection to rcu_dereference(), but
> > I must confess that it has been awhile since I have looked closely at
> > the radix_tree code.  :-/
> 
> And I'm actually suprised that gcc can generate the problematic code in 
> the first place. I'd expect that a "atomic_add_unless()" would always be 
> at LEAST a compiler barrier, even if it isn't necessarily a CPU memory 
> barrier.
> 
> But because we inline it, and because we allow gcc to see that it doesn't 
> do anything if it gets just the right value from memory, I guess gcc ends 
> up able to change the "for()" loop so that the first iteration can exit 
> specially, and then for that case (and no other case) it can cache 
> variables over the whole atomic_add_unless().
> 
> Again, that's very fragile. The fact that Documentation/atomic_ops.txt 
> says that the failure case doesn't contain any barriers is really _meant_ 
> to be about the architecture-specific CPU barriers, not so much about 
> something as simple as a compiler re-ordering. 
> 
> So while I think that we should use rcu_dereference() (regardless of any 
> other issues), I _also_ think that part of the problem really is the 
> excessive subtlety in the whole code, and the (obviously very surprising) 
> fact that gcc could end up caching an unrelated memory load across that 
> whole atomic op.
> 
> Maybe we should make atomics always imply a compiler barrier, even when 
> they do not imply a memory barrier. The one exception would be the 
> (special) case of "atomic_read()/atomic_set()", which don't really do any 
> kind of complex operation at all, and where we really do want the compiler 
> to be able to coalesce multiple atomic_reads() to a single one.
> 
> In contrast, there's no sense in allowing the compiler to coalesce a 
> "atomic_add_unless()" with anything else. Making it a compiler barrier 
> (possibly by uninlining it, or just adding a barrier to it) would also 
> have avoided the whole subtle case - which is always a good thing.

That makes a lot of sense to me!

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch] mm: fix lockless pagecache reordering bug (was Re:
  2009-01-05 21:04                           ` Peter Zijlstra
  (?)
@ 2009-01-05 21:58                             ` Paul E. McKenney
  -1 siblings, 0 replies; 74+ messages in thread
From: Paul E. McKenney @ 2009-01-05 21:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Nick Piggin, Peter Klotz, stable,
	Linux Memory Management List, Christoph Hellwig, Roman Kononov,
	linux-kernel, xfs, Andrew Morton

On Mon, Jan 05, 2009 at 10:04:51PM +0100, Peter Zijlstra wrote:
> On Mon, 2009-01-05 at 12:12 -0800, Paul E. McKenney wrote:
> > On Mon, Jan 05, 2009 at 10:44:27AM -0800, Linus Torvalds wrote:
> > > On Mon, 5 Jan 2009, Nick Piggin wrote:
> > > > On Mon, Jan 05, 2009 at 09:30:55AM -0800, Linus Torvalds wrote:
> > > > Putting an rcu_dereference there might work, but I think it misses a 
> > > > subtlety of this code.
> > > 
> > > No, _you_ miss the subtlety of something that can change under you.
> > > 
> > > Look at radix_tree_deref_slot(), and realize that without the 
> > > rcu_dereference(), the compiler would actually be allowed to think that it 
> > > can re-load anything from *pslot several times. So without my one-liner 
> > > patch, the compiler can actually do this:
> > > 
> > > 	register = load_from_memory(pslot)
> > > 	if (radix_tree_is_indirect_ptr(register))
> > > 		goto fail:
> > > 	return load_from_memory(pslot);
> > > 
> > >    fail:
> > > 	return RADIX_TREE_RETRY;
> > 
> > My guess is that Nick believes that the value in *pslot cannot change
> > in such as way as to cause radix_tree_is_indirect_ptr()'s return value
> > to change within a given RCU grace period, and that Linus disagrees.
> 
> Nick's belief would indeed be true IFF all modifying ops including all
> uses of radix_tree_replace_slot() are serialized wrt. each other.
> 
> However, since radix_tree_deref_slot() is the counterpart of
> radix_tree_replace_slot(), one would indeed expect rcu_dereference()
> therein, much like Linus suggests.
> 
> While what Nick says is true, the lifetime management of the data
> objects is arranged externally from the radix tree -- I still think we
> need the rcu_dereference() even for that argument, as we want to support
> RCU lifetime management as well.

Makes sense to me!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch] mm: fix lockless pagecache reordering bug (was Re:
@ 2009-01-05 21:58                             ` Paul E. McKenney
  0 siblings, 0 replies; 74+ messages in thread
From: Paul E. McKenney @ 2009-01-05 21:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nick Piggin, linux-kernel, Roman Kononov, xfs, Christoph Hellwig,
	Linux Memory Management List, Andrew Morton, Peter Klotz,
	Linus Torvalds, stable

On Mon, Jan 05, 2009 at 10:04:51PM +0100, Peter Zijlstra wrote:
> On Mon, 2009-01-05 at 12:12 -0800, Paul E. McKenney wrote:
> > On Mon, Jan 05, 2009 at 10:44:27AM -0800, Linus Torvalds wrote:
> > > On Mon, 5 Jan 2009, Nick Piggin wrote:
> > > > On Mon, Jan 05, 2009 at 09:30:55AM -0800, Linus Torvalds wrote:
> > > > Putting an rcu_dereference there might work, but I think it misses a 
> > > > subtlety of this code.
> > > 
> > > No, _you_ miss the subtlety of something that can change under you.
> > > 
> > > Look at radix_tree_deref_slot(), and realize that without the 
> > > rcu_dereference(), the compiler would actually be allowed to think that it 
> > > can re-load anything from *pslot several times. So without my one-liner 
> > > patch, the compiler can actually do this:
> > > 
> > > 	register = load_from_memory(pslot)
> > > 	if (radix_tree_is_indirect_ptr(register))
> > > 		goto fail:
> > > 	return load_from_memory(pslot);
> > > 
> > >    fail:
> > > 	return RADIX_TREE_RETRY;
> > 
> > My guess is that Nick believes that the value in *pslot cannot change
> > in such as way as to cause radix_tree_is_indirect_ptr()'s return value
> > to change within a given RCU grace period, and that Linus disagrees.
> 
> Nick's belief would indeed be true IFF all modifying ops including all
> uses of radix_tree_replace_slot() are serialized wrt. each other.
> 
> However, since radix_tree_deref_slot() is the counterpart of
> radix_tree_replace_slot(), one would indeed expect rcu_dereference()
> therein, much like Linus suggests.
> 
> While what Nick says is true, the lifetime management of the data
> objects is arranged externally from the radix tree -- I still think we
> need the rcu_dereference() even for that argument, as we want to support
> RCU lifetime management as well.

Makes sense to me!

							Thanx, Paul

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch] mm: fix lockless pagecache reordering bug (was Re:
@ 2009-01-05 21:58                             ` Paul E. McKenney
  0 siblings, 0 replies; 74+ messages in thread
From: Paul E. McKenney @ 2009-01-05 21:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Nick Piggin, Peter Klotz, stable,
	Linux Memory Management List, Christoph Hellwig, Roman Kononov,
	linux-kernel, xfs, Andrew Morton

On Mon, Jan 05, 2009 at 10:04:51PM +0100, Peter Zijlstra wrote:
> On Mon, 2009-01-05 at 12:12 -0800, Paul E. McKenney wrote:
> > On Mon, Jan 05, 2009 at 10:44:27AM -0800, Linus Torvalds wrote:
> > > On Mon, 5 Jan 2009, Nick Piggin wrote:
> > > > On Mon, Jan 05, 2009 at 09:30:55AM -0800, Linus Torvalds wrote:
> > > > Putting an rcu_dereference there might work, but I think it misses a 
> > > > subtlety of this code.
> > > 
> > > No, _you_ miss the subtlety of something that can change under you.
> > > 
> > > Look at radix_tree_deref_slot(), and realize that without the 
> > > rcu_dereference(), the compiler would actually be allowed to think that it 
> > > can re-load anything from *pslot several times. So without my one-liner 
> > > patch, the compiler can actually do this:
> > > 
> > > 	register = load_from_memory(pslot)
> > > 	if (radix_tree_is_indirect_ptr(register))
> > > 		goto fail:
> > > 	return load_from_memory(pslot);
> > > 
> > >    fail:
> > > 	return RADIX_TREE_RETRY;
> > 
> > My guess is that Nick believes that the value in *pslot cannot change
> > in such as way as to cause radix_tree_is_indirect_ptr()'s return value
> > to change within a given RCU grace period, and that Linus disagrees.
> 
> Nick's belief would indeed be true IFF all modifying ops including all
> uses of radix_tree_replace_slot() are serialized wrt. each other.
> 
> However, since radix_tree_deref_slot() is the counterpart of
> radix_tree_replace_slot(), one would indeed expect rcu_dereference()
> therein, much like Linus suggests.
> 
> While what Nick says is true, the lifetime management of the data
> objects is arranged externally from the radix tree -- I still think we
> need the rcu_dereference() even for that argument, as we want to support
> RCU lifetime management as well.

Makes sense to me!

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch] mm: fix lockless pagecache reordering bug (was Re: BUG: soft lockup - is this XFS problem?)
  2009-01-05 21:57                             ` Paul E. McKenney
  (?)
@ 2009-01-06  2:05                               ` Nick Piggin
  -1 siblings, 0 replies; 74+ messages in thread
From: Nick Piggin @ 2009-01-06  2:05 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Peter Klotz, stable,
	Linux Memory Management List, Christoph Hellwig, Roman Kononov,
	linux-kernel, xfs, Andrew Morton

On Mon, Jan 05, 2009 at 01:57:27PM -0800, Paul E. McKenney wrote:
> On Mon, Jan 05, 2009 at 12:39:14PM -0800, Linus Torvalds wrote:
> > 
> > 
> > On Mon, 5 Jan 2009, Paul E. McKenney wrote:
> > > 
> > > My guess is that Nick believes that the value in *pslot cannot change
> > > in such as way as to cause radix_tree_is_indirect_ptr()'s return value
> > > to change within a given RCU grace period, and that Linus disagrees.
> > 
> > Oh, it's entirely possible that there are some lifetime rules or others 
> > that make it impossible for things to go from "not indirect" -> 
> > "indirect". So if that was Nick's point, then I'm not "disagreeing" per 
> > se.
> > 
> > What I'm disagreeing about is that Nick apparently thinks that this is all 
> > subtle code, and as a result we should add barriers in some very 
> > non-obvious places.
> > 
> > While _I_ think that the problem isn't properly solved by barriers, but by 
> > just making the code less subtle. If the barrier only exists because of 
> > the reload issue, then the obvious solution - to me - is to just use what 
> > is already the proper accessor function that forces a nice reload. That 
> > way the compiler is forced to create code that does what the source 
> > clearly means it to do, regardless of any barriers at all.
> > 
> > Barriers in general should be the _last_ thing added. And if they are 
> > added, they should be added as deeply in the call-chain as possible, so 
> > that we don't need to add them in multiple call-sites. Again, using the 
> > rcu_dereference() approach seems to solve that issue too - rather than add 
> > three barriers in three different places, we just add the proper 
> > dereference in _one_ place.
> 
> I don't have any argument with this line of reasoning, and am myself a bit
> puzzled as to why rcu_dereference() isn't the right tool for Nick's job.
> Then again, I don't claim to fully understand what he is trying to do.

OK, granted I do need the ACCESS_ONCE. It is loading a pointer who's target
can be changed concurrently with the rcu algorithm. The rcu_derefernce
thing kind of set me thinking down the wrong track, because the object of the
pointer it loads is not RCU protected and doesn't need the memory barrier
(on alpha).

But... RCU radix tree is not only used for the pagecache, so it's probably not
worth complicating things to seperate out those two cases. rcu_dereference
might be the best fit.


> > > Whatever the answer, I would argue for -at- -least- a comment explaining
> > > why it is safe.  I am not seeing the objection to rcu_dereference(), but
> > > I must confess that it has been awhile since I have looked closely at
> > > the radix_tree code.  :-/
> > 
> > And I'm actually suprised that gcc can generate the problematic code in 
> > the first place. I'd expect that a "atomic_add_unless()" would always be 
> > at LEAST a compiler barrier, even if it isn't necessarily a CPU memory 
> > barrier.
> > 
> > But because we inline it, and because we allow gcc to see that it doesn't 
> > do anything if it gets just the right value from memory, I guess gcc ends 
> > up able to change the "for()" loop so that the first iteration can exit 
> > specially, and then for that case (and no other case) it can cache 
> > variables over the whole atomic_add_unless().
> > 
> > Again, that's very fragile. The fact that Documentation/atomic_ops.txt 
> > says that the failure case doesn't contain any barriers is really _meant_ 
> > to be about the architecture-specific CPU barriers, not so much about 
> > something as simple as a compiler re-ordering. 
> > 
> > So while I think that we should use rcu_dereference() (regardless of any 
> > other issues), I _also_ think that part of the problem really is the 
> > excessive subtlety in the whole code, and the (obviously very surprising) 
> > fact that gcc could end up caching an unrelated memory load across that 
> > whole atomic op.
> > 
> > Maybe we should make atomics always imply a compiler barrier, even when 
> > they do not imply a memory barrier. The one exception would be the 
> > (special) case of "atomic_read()/atomic_set()", which don't really do any 
> > kind of complex operation at all, and where we really do want the compiler 
> > to be able to coalesce multiple atomic_reads() to a single one.
> > 
> > In contrast, there's no sense in allowing the compiler to coalesce a 
> > "atomic_add_unless()" with anything else. Making it a compiler barrier 
> > (possibly by uninlining it, or just adding a barrier to it) would also 
> > have avoided the whole subtle case - which is always a good thing.
> 
> That makes a lot of sense to me!

It would have avoided one problem (the same one my patch did). But it
doesn't solve the problem of the missing ACCESS_ONCE allowing the
pointer to be reloaded from the slot pointer.

Sticking an rcu_dereference in radix_tree_deref_slot seems to fix the
assembly for me too, I grafted the changelog onto that. Linus probably
you are using -Os?

--
Subject: mm lockless pagecache barrier fix

An XFS workload showed up a bug in the lockless pagecache patch. Basically it
would go into an "infinite" loop, although it would sometimes be able to break
out of the loop! The reason is a missing compiler barrier in the "increment
reference count unless it was zero" case of the lockless pagecache protocol in
the gang lookup functions.

This would cause the compiler to use a cached value of struct page pointer to
retry the operation with, rather than reload it. So the page might have been
removed from pagecache and freed (refcount==0) but the lookup would not correctly
notice the page is no longer in pagecache, and keep attempting to increment the
refcount and failing, until the page gets reallocated for something else. This
isn't a data corruption because the condition will be detected if the page has
been reallocated. However it can result in a lockup. 

Linus points out that ACCESS_ONCE is also required in that pointer load, even
if it's absence is not causing a bug on our particular build. The most general
way to solve this is just to put an rcu_dereference in radix_tree_deref_slot.

Assembly of find_get_pages,
before:
.L220:
        movq    (%rbx), %rax    #* ivtmp.1162, tmp82
        movq    (%rax), %rdi    #, prephitmp.1149
.L218:
        testb   $1, %dil        #, prephitmp.1149
        jne     .L217   #,
        testq   %rdi, %rdi      # prephitmp.1149
        je      .L203   #,
        cmpq    $-1, %rdi       #, prephitmp.1149
        je      .L217   #,
        movl    8(%rdi), %esi   # <variable>._count.counter, c
        testl   %esi, %esi      # c
        je      .L218   #,

after:
.L212:
        movq    (%rbx), %rax    #* ivtmp.1109, tmp81
        movq    (%rax), %rdi    #, ret
        testb   $1, %dil        #, ret
        jne     .L211   #,
        testq   %rdi, %rdi      # ret
        je      .L197   #,
        cmpq    $-1, %rdi       #, ret
        je      .L211   #,
        movl    8(%rdi), %esi   # <variable>._count.counter, c
        testl   %esi, %esi      # c
        je      .L212   #,

(notice the obvious infinite loop in the first example, if page->count remains 0)

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 include/linux/radix-tree.h |    2 +-
 mm/filemap.c               |   23 ++++++++++++++++++++---
 2 files changed, 21 insertions(+), 4 deletions(-)

Index: linux-2.6/include/linux/radix-tree.h
===================================================================
--- linux-2.6.orig/include/linux/radix-tree.h
+++ linux-2.6/include/linux/radix-tree.h
@@ -136,7 +136,7 @@ do {									\
  */
 static inline void *radix_tree_deref_slot(void **pslot)
 {
-	void *ret = *pslot;
+	void *ret = rcu_dereference(*pslot);
 	if (unlikely(radix_tree_is_indirect_ptr(ret)))
 		ret = RADIX_TREE_RETRY;
 	return ret;


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch] mm: fix lockless pagecache reordering bug (was Re: BUG: soft lockup - is this XFS problem?)
@ 2009-01-06  2:05                               ` Nick Piggin
  0 siblings, 0 replies; 74+ messages in thread
From: Nick Piggin @ 2009-01-06  2:05 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, Roman Kononov, xfs, Christoph Hellwig,
	Linux Memory Management List, Andrew Morton, Peter Klotz,
	Linus Torvalds, stable

On Mon, Jan 05, 2009 at 01:57:27PM -0800, Paul E. McKenney wrote:
> On Mon, Jan 05, 2009 at 12:39:14PM -0800, Linus Torvalds wrote:
> > 
> > 
> > On Mon, 5 Jan 2009, Paul E. McKenney wrote:
> > > 
> > > My guess is that Nick believes that the value in *pslot cannot change
> > > in such as way as to cause radix_tree_is_indirect_ptr()'s return value
> > > to change within a given RCU grace period, and that Linus disagrees.
> > 
> > Oh, it's entirely possible that there are some lifetime rules or others 
> > that make it impossible for things to go from "not indirect" -> 
> > "indirect". So if that was Nick's point, then I'm not "disagreeing" per 
> > se.
> > 
> > What I'm disagreeing about is that Nick apparently thinks that this is all 
> > subtle code, and as a result we should add barriers in some very 
> > non-obvious places.
> > 
> > While _I_ think that the problem isn't properly solved by barriers, but by 
> > just making the code less subtle. If the barrier only exists because of 
> > the reload issue, then the obvious solution - to me - is to just use what 
> > is already the proper accessor function that forces a nice reload. That 
> > way the compiler is forced to create code that does what the source 
> > clearly means it to do, regardless of any barriers at all.
> > 
> > Barriers in general should be the _last_ thing added. And if they are 
> > added, they should be added as deeply in the call-chain as possible, so 
> > that we don't need to add them in multiple call-sites. Again, using the 
> > rcu_dereference() approach seems to solve that issue too - rather than add 
> > three barriers in three different places, we just add the proper 
> > dereference in _one_ place.
> 
> I don't have any argument with this line of reasoning, and am myself a bit
> puzzled as to why rcu_dereference() isn't the right tool for Nick's job.
> Then again, I don't claim to fully understand what he is trying to do.

OK, granted I do need the ACCESS_ONCE. It is loading a pointer who's target
can be changed concurrently with the rcu algorithm. The rcu_derefernce
thing kind of set me thinking down the wrong track, because the object of the
pointer it loads is not RCU protected and doesn't need the memory barrier
(on alpha).

But... RCU radix tree is not only used for the pagecache, so it's probably not
worth complicating things to seperate out those two cases. rcu_dereference
might be the best fit.


> > > Whatever the answer, I would argue for -at- -least- a comment explaining
> > > why it is safe.  I am not seeing the objection to rcu_dereference(), but
> > > I must confess that it has been awhile since I have looked closely at
> > > the radix_tree code.  :-/
> > 
> > And I'm actually suprised that gcc can generate the problematic code in 
> > the first place. I'd expect that a "atomic_add_unless()" would always be 
> > at LEAST a compiler barrier, even if it isn't necessarily a CPU memory 
> > barrier.
> > 
> > But because we inline it, and because we allow gcc to see that it doesn't 
> > do anything if it gets just the right value from memory, I guess gcc ends 
> > up able to change the "for()" loop so that the first iteration can exit 
> > specially, and then for that case (and no other case) it can cache 
> > variables over the whole atomic_add_unless().
> > 
> > Again, that's very fragile. The fact that Documentation/atomic_ops.txt 
> > says that the failure case doesn't contain any barriers is really _meant_ 
> > to be about the architecture-specific CPU barriers, not so much about 
> > something as simple as a compiler re-ordering. 
> > 
> > So while I think that we should use rcu_dereference() (regardless of any 
> > other issues), I _also_ think that part of the problem really is the 
> > excessive subtlety in the whole code, and the (obviously very surprising) 
> > fact that gcc could end up caching an unrelated memory load across that 
> > whole atomic op.
> > 
> > Maybe we should make atomics always imply a compiler barrier, even when 
> > they do not imply a memory barrier. The one exception would be the 
> > (special) case of "atomic_read()/atomic_set()", which don't really do any 
> > kind of complex operation at all, and where we really do want the compiler 
> > to be able to coalesce multiple atomic_reads() to a single one.
> > 
> > In contrast, there's no sense in allowing the compiler to coalesce a 
> > "atomic_add_unless()" with anything else. Making it a compiler barrier 
> > (possibly by uninlining it, or just adding a barrier to it) would also 
> > have avoided the whole subtle case - which is always a good thing.
> 
> That makes a lot of sense to me!

It would have avoided one problem (the same one my patch did). But it
doesn't solve the problem of the missing ACCESS_ONCE allowing the
pointer to be reloaded from the slot pointer.

Sticking an rcu_dereference in radix_tree_deref_slot seems to fix the
assembly for me too, I grafted the changelog onto that. Linus probably
you are using -Os?

--
Subject: mm lockless pagecache barrier fix

An XFS workload showed up a bug in the lockless pagecache patch. Basically it
would go into an "infinite" loop, although it would sometimes be able to break
out of the loop! The reason is a missing compiler barrier in the "increment
reference count unless it was zero" case of the lockless pagecache protocol in
the gang lookup functions.

This would cause the compiler to use a cached value of struct page pointer to
retry the operation with, rather than reload it. So the page might have been
removed from pagecache and freed (refcount==0) but the lookup would not correctly
notice the page is no longer in pagecache, and keep attempting to increment the
refcount and failing, until the page gets reallocated for something else. This
isn't a data corruption because the condition will be detected if the page has
been reallocated. However it can result in a lockup. 

Linus points out that ACCESS_ONCE is also required in that pointer load, even
if it's absence is not causing a bug on our particular build. The most general
way to solve this is just to put an rcu_dereference in radix_tree_deref_slot.

Assembly of find_get_pages,
before:
.L220:
        movq    (%rbx), %rax    #* ivtmp.1162, tmp82
        movq    (%rax), %rdi    #, prephitmp.1149
.L218:
        testb   $1, %dil        #, prephitmp.1149
        jne     .L217   #,
        testq   %rdi, %rdi      # prephitmp.1149
        je      .L203   #,
        cmpq    $-1, %rdi       #, prephitmp.1149
        je      .L217   #,
        movl    8(%rdi), %esi   # <variable>._count.counter, c
        testl   %esi, %esi      # c
        je      .L218   #,

after:
.L212:
        movq    (%rbx), %rax    #* ivtmp.1109, tmp81
        movq    (%rax), %rdi    #, ret
        testb   $1, %dil        #, ret
        jne     .L211   #,
        testq   %rdi, %rdi      # ret
        je      .L197   #,
        cmpq    $-1, %rdi       #, ret
        je      .L211   #,
        movl    8(%rdi), %esi   # <variable>._count.counter, c
        testl   %esi, %esi      # c
        je      .L212   #,

(notice the obvious infinite loop in the first example, if page->count remains 0)

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 include/linux/radix-tree.h |    2 +-
 mm/filemap.c               |   23 ++++++++++++++++++++---
 2 files changed, 21 insertions(+), 4 deletions(-)

Index: linux-2.6/include/linux/radix-tree.h
===================================================================
--- linux-2.6.orig/include/linux/radix-tree.h
+++ linux-2.6/include/linux/radix-tree.h
@@ -136,7 +136,7 @@ do {									\
  */
 static inline void *radix_tree_deref_slot(void **pslot)
 {
-	void *ret = *pslot;
+	void *ret = rcu_dereference(*pslot);
 	if (unlikely(radix_tree_is_indirect_ptr(ret)))
 		ret = RADIX_TREE_RETRY;
 	return ret;

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch] mm: fix lockless pagecache reordering bug (was Re: BUG: soft lockup - is this XFS problem?)
@ 2009-01-06  2:05                               ` Nick Piggin
  0 siblings, 0 replies; 74+ messages in thread
From: Nick Piggin @ 2009-01-06  2:05 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Peter Klotz, stable,
	Linux Memory Management List, Christoph Hellwig, Roman Kononov,
	linux-kernel, xfs, Andrew Morton

On Mon, Jan 05, 2009 at 01:57:27PM -0800, Paul E. McKenney wrote:
> On Mon, Jan 05, 2009 at 12:39:14PM -0800, Linus Torvalds wrote:
> > 
> > 
> > On Mon, 5 Jan 2009, Paul E. McKenney wrote:
> > > 
> > > My guess is that Nick believes that the value in *pslot cannot change
> > > in such as way as to cause radix_tree_is_indirect_ptr()'s return value
> > > to change within a given RCU grace period, and that Linus disagrees.
> > 
> > Oh, it's entirely possible that there are some lifetime rules or others 
> > that make it impossible for things to go from "not indirect" -> 
> > "indirect". So if that was Nick's point, then I'm not "disagreeing" per 
> > se.
> > 
> > What I'm disagreeing about is that Nick apparently thinks that this is all 
> > subtle code, and as a result we should add barriers in some very 
> > non-obvious places.
> > 
> > While _I_ think that the problem isn't properly solved by barriers, but by 
> > just making the code less subtle. If the barrier only exists because of 
> > the reload issue, then the obvious solution - to me - is to just use what 
> > is already the proper accessor function that forces a nice reload. That 
> > way the compiler is forced to create code that does what the source 
> > clearly means it to do, regardless of any barriers at all.
> > 
> > Barriers in general should be the _last_ thing added. And if they are 
> > added, they should be added as deeply in the call-chain as possible, so 
> > that we don't need to add them in multiple call-sites. Again, using the 
> > rcu_dereference() approach seems to solve that issue too - rather than add 
> > three barriers in three different places, we just add the proper 
> > dereference in _one_ place.
> 
> I don't have any argument with this line of reasoning, and am myself a bit
> puzzled as to why rcu_dereference() isn't the right tool for Nick's job.
> Then again, I don't claim to fully understand what he is trying to do.

OK, granted I do need the ACCESS_ONCE. It is loading a pointer who's target
can be changed concurrently with the rcu algorithm. The rcu_derefernce
thing kind of set me thinking down the wrong track, because the object of the
pointer it loads is not RCU protected and doesn't need the memory barrier
(on alpha).

But... RCU radix tree is not only used for the pagecache, so it's probably not
worth complicating things to seperate out those two cases. rcu_dereference
might be the best fit.


> > > Whatever the answer, I would argue for -at- -least- a comment explaining
> > > why it is safe.  I am not seeing the objection to rcu_dereference(), but
> > > I must confess that it has been awhile since I have looked closely at
> > > the radix_tree code.  :-/
> > 
> > And I'm actually suprised that gcc can generate the problematic code in 
> > the first place. I'd expect that a "atomic_add_unless()" would always be 
> > at LEAST a compiler barrier, even if it isn't necessarily a CPU memory 
> > barrier.
> > 
> > But because we inline it, and because we allow gcc to see that it doesn't 
> > do anything if it gets just the right value from memory, I guess gcc ends 
> > up able to change the "for()" loop so that the first iteration can exit 
> > specially, and then for that case (and no other case) it can cache 
> > variables over the whole atomic_add_unless().
> > 
> > Again, that's very fragile. The fact that Documentation/atomic_ops.txt 
> > says that the failure case doesn't contain any barriers is really _meant_ 
> > to be about the architecture-specific CPU barriers, not so much about 
> > something as simple as a compiler re-ordering. 
> > 
> > So while I think that we should use rcu_dereference() (regardless of any 
> > other issues), I _also_ think that part of the problem really is the 
> > excessive subtlety in the whole code, and the (obviously very surprising) 
> > fact that gcc could end up caching an unrelated memory load across that 
> > whole atomic op.
> > 
> > Maybe we should make atomics always imply a compiler barrier, even when 
> > they do not imply a memory barrier. The one exception would be the 
> > (special) case of "atomic_read()/atomic_set()", which don't really do any 
> > kind of complex operation at all, and where we really do want the compiler 
> > to be able to coalesce multiple atomic_reads() to a single one.
> > 
> > In contrast, there's no sense in allowing the compiler to coalesce a 
> > "atomic_add_unless()" with anything else. Making it a compiler barrier 
> > (possibly by uninlining it, or just adding a barrier to it) would also 
> > have avoided the whole subtle case - which is always a good thing.
> 
> That makes a lot of sense to me!

It would have avoided one problem (the same one my patch did). But it
doesn't solve the problem of the missing ACCESS_ONCE allowing the
pointer to be reloaded from the slot pointer.

Sticking an rcu_dereference in radix_tree_deref_slot seems to fix the
assembly for me too, I grafted the changelog onto that. Linus probably
you are using -Os?

--
Subject: mm lockless pagecache barrier fix

An XFS workload showed up a bug in the lockless pagecache patch. Basically it
would go into an "infinite" loop, although it would sometimes be able to break
out of the loop! The reason is a missing compiler barrier in the "increment
reference count unless it was zero" case of the lockless pagecache protocol in
the gang lookup functions.

This would cause the compiler to use a cached value of struct page pointer to
retry the operation with, rather than reload it. So the page might have been
removed from pagecache and freed (refcount==0) but the lookup would not correctly
notice the page is no longer in pagecache, and keep attempting to increment the
refcount and failing, until the page gets reallocated for something else. This
isn't a data corruption because the condition will be detected if the page has
been reallocated. However it can result in a lockup. 

Linus points out that ACCESS_ONCE is also required in that pointer load, even
if it's absence is not causing a bug on our particular build. The most general
way to solve this is just to put an rcu_dereference in radix_tree_deref_slot.

Assembly of find_get_pages,
before:
.L220:
        movq    (%rbx), %rax    #* ivtmp.1162, tmp82
        movq    (%rax), %rdi    #, prephitmp.1149
.L218:
        testb   $1, %dil        #, prephitmp.1149
        jne     .L217   #,
        testq   %rdi, %rdi      # prephitmp.1149
        je      .L203   #,
        cmpq    $-1, %rdi       #, prephitmp.1149
        je      .L217   #,
        movl    8(%rdi), %esi   # <variable>._count.counter, c
        testl   %esi, %esi      # c
        je      .L218   #,

after:
.L212:
        movq    (%rbx), %rax    #* ivtmp.1109, tmp81
        movq    (%rax), %rdi    #, ret
        testb   $1, %dil        #, ret
        jne     .L211   #,
        testq   %rdi, %rdi      # ret
        je      .L197   #,
        cmpq    $-1, %rdi       #, ret
        je      .L211   #,
        movl    8(%rdi), %esi   # <variable>._count.counter, c
        testl   %esi, %esi      # c
        je      .L212   #,

(notice the obvious infinite loop in the first example, if page->count remains 0)

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 include/linux/radix-tree.h |    2 +-
 mm/filemap.c               |   23 ++++++++++++++++++++---
 2 files changed, 21 insertions(+), 4 deletions(-)

Index: linux-2.6/include/linux/radix-tree.h
===================================================================
--- linux-2.6.orig/include/linux/radix-tree.h
+++ linux-2.6/include/linux/radix-tree.h
@@ -136,7 +136,7 @@ do {									\
  */
 static inline void *radix_tree_deref_slot(void **pslot)
 {
-	void *ret = *pslot;
+	void *ret = rcu_dereference(*pslot);
 	if (unlikely(radix_tree_is_indirect_ptr(ret)))
 		ret = RADIX_TREE_RETRY;
 	return ret;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch] mm: fix lockless pagecache reordering bug (was Re: BUG: soft lockup - is this XFS problem?)
  2009-01-06  2:05                               ` Nick Piggin
  (?)
@ 2009-01-06  2:23                                 ` Paul E. McKenney
  -1 siblings, 0 replies; 74+ messages in thread
From: Paul E. McKenney @ 2009-01-06  2:23 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Peter Klotz, stable,
	Linux Memory Management List, Christoph Hellwig, Roman Kononov,
	linux-kernel, xfs, Andrew Morton

On Tue, Jan 06, 2009 at 03:05:50AM +0100, Nick Piggin wrote:
> On Mon, Jan 05, 2009 at 01:57:27PM -0800, Paul E. McKenney wrote:
> > On Mon, Jan 05, 2009 at 12:39:14PM -0800, Linus Torvalds wrote:
> > > 
> > > 
> > > On Mon, 5 Jan 2009, Paul E. McKenney wrote:
> > > > 
> > > > My guess is that Nick believes that the value in *pslot cannot change
> > > > in such as way as to cause radix_tree_is_indirect_ptr()'s return value
> > > > to change within a given RCU grace period, and that Linus disagrees.
> > > 
> > > Oh, it's entirely possible that there are some lifetime rules or others 
> > > that make it impossible for things to go from "not indirect" -> 
> > > "indirect". So if that was Nick's point, then I'm not "disagreeing" per 
> > > se.
> > > 
> > > What I'm disagreeing about is that Nick apparently thinks that this is all 
> > > subtle code, and as a result we should add barriers in some very 
> > > non-obvious places.
> > > 
> > > While _I_ think that the problem isn't properly solved by barriers, but by 
> > > just making the code less subtle. If the barrier only exists because of 
> > > the reload issue, then the obvious solution - to me - is to just use what 
> > > is already the proper accessor function that forces a nice reload. That 
> > > way the compiler is forced to create code that does what the source 
> > > clearly means it to do, regardless of any barriers at all.
> > > 
> > > Barriers in general should be the _last_ thing added. And if they are 
> > > added, they should be added as deeply in the call-chain as possible, so 
> > > that we don't need to add them in multiple call-sites. Again, using the 
> > > rcu_dereference() approach seems to solve that issue too - rather than add 
> > > three barriers in three different places, we just add the proper 
> > > dereference in _one_ place.
> > 
> > I don't have any argument with this line of reasoning, and am myself a bit
> > puzzled as to why rcu_dereference() isn't the right tool for Nick's job.
> > Then again, I don't claim to fully understand what he is trying to do.
> 
> OK, granted I do need the ACCESS_ONCE. It is loading a pointer who's target
> can be changed concurrently with the rcu algorithm. The rcu_derefernce
> thing kind of set me thinking down the wrong track, because the object of the
> pointer it loads is not RCU protected and doesn't need the memory barrier
> (on alpha).
> 
> But... RCU radix tree is not only used for the pagecache, so it's probably not
> worth complicating things to seperate out those two cases. rcu_dereference
> might be the best fit.

Works for me!

> > > > Whatever the answer, I would argue for -at- -least- a comment explaining
> > > > why it is safe.  I am not seeing the objection to rcu_dereference(), but
> > > > I must confess that it has been awhile since I have looked closely at
> > > > the radix_tree code.  :-/
> > > 
> > > And I'm actually suprised that gcc can generate the problematic code in 
> > > the first place. I'd expect that a "atomic_add_unless()" would always be 
> > > at LEAST a compiler barrier, even if it isn't necessarily a CPU memory 
> > > barrier.
> > > 
> > > But because we inline it, and because we allow gcc to see that it doesn't 
> > > do anything if it gets just the right value from memory, I guess gcc ends 
> > > up able to change the "for()" loop so that the first iteration can exit 
> > > specially, and then for that case (and no other case) it can cache 
> > > variables over the whole atomic_add_unless().
> > > 
> > > Again, that's very fragile. The fact that Documentation/atomic_ops.txt 
> > > says that the failure case doesn't contain any barriers is really _meant_ 
> > > to be about the architecture-specific CPU barriers, not so much about 
> > > something as simple as a compiler re-ordering. 
> > > 
> > > So while I think that we should use rcu_dereference() (regardless of any 
> > > other issues), I _also_ think that part of the problem really is the 
> > > excessive subtlety in the whole code, and the (obviously very surprising) 
> > > fact that gcc could end up caching an unrelated memory load across that 
> > > whole atomic op.
> > > 
> > > Maybe we should make atomics always imply a compiler barrier, even when 
> > > they do not imply a memory barrier. The one exception would be the 
> > > (special) case of "atomic_read()/atomic_set()", which don't really do any 
> > > kind of complex operation at all, and where we really do want the compiler 
> > > to be able to coalesce multiple atomic_reads() to a single one.
> > > 
> > > In contrast, there's no sense in allowing the compiler to coalesce a 
> > > "atomic_add_unless()" with anything else. Making it a compiler barrier 
> > > (possibly by uninlining it, or just adding a barrier to it) would also 
> > > have avoided the whole subtle case - which is always a good thing.
> > 
> > That makes a lot of sense to me!
> 
> It would have avoided one problem (the same one my patch did). But it
> doesn't solve the problem of the missing ACCESS_ONCE allowing the
> pointer to be reloaded from the slot pointer.

Agreed.

> Sticking an rcu_dereference in radix_tree_deref_slot seems to fix the
> assembly for me too, I grafted the changelog onto that. Linus probably
> you are using -Os?
> 
> --
> Subject: mm lockless pagecache barrier fix
> 
> An XFS workload showed up a bug in the lockless pagecache patch. Basically it
> would go into an "infinite" loop, although it would sometimes be able to break
> out of the loop! The reason is a missing compiler barrier in the "increment
> reference count unless it was zero" case of the lockless pagecache protocol in
> the gang lookup functions.
> 
> This would cause the compiler to use a cached value of struct page pointer to
> retry the operation with, rather than reload it. So the page might have been
> removed from pagecache and freed (refcount==0) but the lookup would not correctly
> notice the page is no longer in pagecache, and keep attempting to increment the
> refcount and failing, until the page gets reallocated for something else. This
> isn't a data corruption because the condition will be detected if the page has
> been reallocated. However it can result in a lockup. 
> 
> Linus points out that ACCESS_ONCE is also required in that pointer load, even
> if it's absence is not causing a bug on our particular build. The most general
> way to solve this is just to put an rcu_dereference in radix_tree_deref_slot.
> 
> Assembly of find_get_pages,
> before:
> .L220:
>         movq    (%rbx), %rax    #* ivtmp.1162, tmp82
>         movq    (%rax), %rdi    #, prephitmp.1149
> .L218:
>         testb   $1, %dil        #, prephitmp.1149
>         jne     .L217   #,
>         testq   %rdi, %rdi      # prephitmp.1149
>         je      .L203   #,
>         cmpq    $-1, %rdi       #, prephitmp.1149
>         je      .L217   #,
>         movl    8(%rdi), %esi   # <variable>._count.counter, c
>         testl   %esi, %esi      # c
>         je      .L218   #,
> 
> after:
> .L212:
>         movq    (%rbx), %rax    #* ivtmp.1109, tmp81
>         movq    (%rax), %rdi    #, ret
>         testb   $1, %dil        #, ret
>         jne     .L211   #,
>         testq   %rdi, %rdi      # ret
>         je      .L197   #,
>         cmpq    $-1, %rdi       #, ret
>         je      .L211   #,
>         movl    8(%rdi), %esi   # <variable>._count.counter, c
>         testl   %esi, %esi      # c
>         je      .L212   #,
> 
> (notice the obvious infinite loop in the first example, if page->count remains 0)

Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

> Signed-off-by: Nick Piggin <npiggin@suse.de>
> ---
>  include/linux/radix-tree.h |    2 +-
>  mm/filemap.c               |   23 ++++++++++++++++++++---
>  2 files changed, 21 insertions(+), 4 deletions(-)
> 
> Index: linux-2.6/include/linux/radix-tree.h
> ===================================================================
> --- linux-2.6.orig/include/linux/radix-tree.h
> +++ linux-2.6/include/linux/radix-tree.h
> @@ -136,7 +136,7 @@ do {									\
>   */
>  static inline void *radix_tree_deref_slot(void **pslot)
>  {
> -	void *ret = *pslot;
> +	void *ret = rcu_dereference(*pslot);
>  	if (unlikely(radix_tree_is_indirect_ptr(ret)))
>  		ret = RADIX_TREE_RETRY;
>  	return ret;
> 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch] mm: fix lockless pagecache reordering bug (was Re: BUG: soft lockup - is this XFS problem?)
@ 2009-01-06  2:23                                 ` Paul E. McKenney
  0 siblings, 0 replies; 74+ messages in thread
From: Paul E. McKenney @ 2009-01-06  2:23 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-kernel, Roman Kononov, xfs, Christoph Hellwig,
	Linux Memory Management List, Andrew Morton, Peter Klotz,
	Linus Torvalds, stable

On Tue, Jan 06, 2009 at 03:05:50AM +0100, Nick Piggin wrote:
> On Mon, Jan 05, 2009 at 01:57:27PM -0800, Paul E. McKenney wrote:
> > On Mon, Jan 05, 2009 at 12:39:14PM -0800, Linus Torvalds wrote:
> > > 
> > > 
> > > On Mon, 5 Jan 2009, Paul E. McKenney wrote:
> > > > 
> > > > My guess is that Nick believes that the value in *pslot cannot change
> > > > in such as way as to cause radix_tree_is_indirect_ptr()'s return value
> > > > to change within a given RCU grace period, and that Linus disagrees.
> > > 
> > > Oh, it's entirely possible that there are some lifetime rules or others 
> > > that make it impossible for things to go from "not indirect" -> 
> > > "indirect". So if that was Nick's point, then I'm not "disagreeing" per 
> > > se.
> > > 
> > > What I'm disagreeing about is that Nick apparently thinks that this is all 
> > > subtle code, and as a result we should add barriers in some very 
> > > non-obvious places.
> > > 
> > > While _I_ think that the problem isn't properly solved by barriers, but by 
> > > just making the code less subtle. If the barrier only exists because of 
> > > the reload issue, then the obvious solution - to me - is to just use what 
> > > is already the proper accessor function that forces a nice reload. That 
> > > way the compiler is forced to create code that does what the source 
> > > clearly means it to do, regardless of any barriers at all.
> > > 
> > > Barriers in general should be the _last_ thing added. And if they are 
> > > added, they should be added as deeply in the call-chain as possible, so 
> > > that we don't need to add them in multiple call-sites. Again, using the 
> > > rcu_dereference() approach seems to solve that issue too - rather than add 
> > > three barriers in three different places, we just add the proper 
> > > dereference in _one_ place.
> > 
> > I don't have any argument with this line of reasoning, and am myself a bit
> > puzzled as to why rcu_dereference() isn't the right tool for Nick's job.
> > Then again, I don't claim to fully understand what he is trying to do.
> 
> OK, granted I do need the ACCESS_ONCE. It is loading a pointer who's target
> can be changed concurrently with the rcu algorithm. The rcu_derefernce
> thing kind of set me thinking down the wrong track, because the object of the
> pointer it loads is not RCU protected and doesn't need the memory barrier
> (on alpha).
> 
> But... RCU radix tree is not only used for the pagecache, so it's probably not
> worth complicating things to seperate out those two cases. rcu_dereference
> might be the best fit.

Works for me!

> > > > Whatever the answer, I would argue for -at- -least- a comment explaining
> > > > why it is safe.  I am not seeing the objection to rcu_dereference(), but
> > > > I must confess that it has been awhile since I have looked closely at
> > > > the radix_tree code.  :-/
> > > 
> > > And I'm actually suprised that gcc can generate the problematic code in 
> > > the first place. I'd expect that a "atomic_add_unless()" would always be 
> > > at LEAST a compiler barrier, even if it isn't necessarily a CPU memory 
> > > barrier.
> > > 
> > > But because we inline it, and because we allow gcc to see that it doesn't 
> > > do anything if it gets just the right value from memory, I guess gcc ends 
> > > up able to change the "for()" loop so that the first iteration can exit 
> > > specially, and then for that case (and no other case) it can cache 
> > > variables over the whole atomic_add_unless().
> > > 
> > > Again, that's very fragile. The fact that Documentation/atomic_ops.txt 
> > > says that the failure case doesn't contain any barriers is really _meant_ 
> > > to be about the architecture-specific CPU barriers, not so much about 
> > > something as simple as a compiler re-ordering. 
> > > 
> > > So while I think that we should use rcu_dereference() (regardless of any 
> > > other issues), I _also_ think that part of the problem really is the 
> > > excessive subtlety in the whole code, and the (obviously very surprising) 
> > > fact that gcc could end up caching an unrelated memory load across that 
> > > whole atomic op.
> > > 
> > > Maybe we should make atomics always imply a compiler barrier, even when 
> > > they do not imply a memory barrier. The one exception would be the 
> > > (special) case of "atomic_read()/atomic_set()", which don't really do any 
> > > kind of complex operation at all, and where we really do want the compiler 
> > > to be able to coalesce multiple atomic_reads() to a single one.
> > > 
> > > In contrast, there's no sense in allowing the compiler to coalesce a 
> > > "atomic_add_unless()" with anything else. Making it a compiler barrier 
> > > (possibly by uninlining it, or just adding a barrier to it) would also 
> > > have avoided the whole subtle case - which is always a good thing.
> > 
> > That makes a lot of sense to me!
> 
> It would have avoided one problem (the same one my patch did). But it
> doesn't solve the problem of the missing ACCESS_ONCE allowing the
> pointer to be reloaded from the slot pointer.

Agreed.

> Sticking an rcu_dereference in radix_tree_deref_slot seems to fix the
> assembly for me too, I grafted the changelog onto that. Linus probably
> you are using -Os?
> 
> --
> Subject: mm lockless pagecache barrier fix
> 
> An XFS workload showed up a bug in the lockless pagecache patch. Basically it
> would go into an "infinite" loop, although it would sometimes be able to break
> out of the loop! The reason is a missing compiler barrier in the "increment
> reference count unless it was zero" case of the lockless pagecache protocol in
> the gang lookup functions.
> 
> This would cause the compiler to use a cached value of struct page pointer to
> retry the operation with, rather than reload it. So the page might have been
> removed from pagecache and freed (refcount==0) but the lookup would not correctly
> notice the page is no longer in pagecache, and keep attempting to increment the
> refcount and failing, until the page gets reallocated for something else. This
> isn't a data corruption because the condition will be detected if the page has
> been reallocated. However it can result in a lockup. 
> 
> Linus points out that ACCESS_ONCE is also required in that pointer load, even
> if it's absence is not causing a bug on our particular build. The most general
> way to solve this is just to put an rcu_dereference in radix_tree_deref_slot.
> 
> Assembly of find_get_pages,
> before:
> .L220:
>         movq    (%rbx), %rax    #* ivtmp.1162, tmp82
>         movq    (%rax), %rdi    #, prephitmp.1149
> .L218:
>         testb   $1, %dil        #, prephitmp.1149
>         jne     .L217   #,
>         testq   %rdi, %rdi      # prephitmp.1149
>         je      .L203   #,
>         cmpq    $-1, %rdi       #, prephitmp.1149
>         je      .L217   #,
>         movl    8(%rdi), %esi   # <variable>._count.counter, c
>         testl   %esi, %esi      # c
>         je      .L218   #,
> 
> after:
> .L212:
>         movq    (%rbx), %rax    #* ivtmp.1109, tmp81
>         movq    (%rax), %rdi    #, ret
>         testb   $1, %dil        #, ret
>         jne     .L211   #,
>         testq   %rdi, %rdi      # ret
>         je      .L197   #,
>         cmpq    $-1, %rdi       #, ret
>         je      .L211   #,
>         movl    8(%rdi), %esi   # <variable>._count.counter, c
>         testl   %esi, %esi      # c
>         je      .L212   #,
> 
> (notice the obvious infinite loop in the first example, if page->count remains 0)

Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

> Signed-off-by: Nick Piggin <npiggin@suse.de>
> ---
>  include/linux/radix-tree.h |    2 +-
>  mm/filemap.c               |   23 ++++++++++++++++++++---
>  2 files changed, 21 insertions(+), 4 deletions(-)
> 
> Index: linux-2.6/include/linux/radix-tree.h
> ===================================================================
> --- linux-2.6.orig/include/linux/radix-tree.h
> +++ linux-2.6/include/linux/radix-tree.h
> @@ -136,7 +136,7 @@ do {									\
>   */
>  static inline void *radix_tree_deref_slot(void **pslot)
>  {
> -	void *ret = *pslot;
> +	void *ret = rcu_dereference(*pslot);
>  	if (unlikely(radix_tree_is_indirect_ptr(ret)))
>  		ret = RADIX_TREE_RETRY;
>  	return ret;
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch] mm: fix lockless pagecache reordering bug (was Re: BUG: soft lockup - is this XFS problem?)
@ 2009-01-06  2:23                                 ` Paul E. McKenney
  0 siblings, 0 replies; 74+ messages in thread
From: Paul E. McKenney @ 2009-01-06  2:23 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Peter Klotz, stable,
	Linux Memory Management List, Christoph Hellwig, Roman Kononov,
	linux-kernel, xfs, Andrew Morton

On Tue, Jan 06, 2009 at 03:05:50AM +0100, Nick Piggin wrote:
> On Mon, Jan 05, 2009 at 01:57:27PM -0800, Paul E. McKenney wrote:
> > On Mon, Jan 05, 2009 at 12:39:14PM -0800, Linus Torvalds wrote:
> > > 
> > > 
> > > On Mon, 5 Jan 2009, Paul E. McKenney wrote:
> > > > 
> > > > My guess is that Nick believes that the value in *pslot cannot change
> > > > in such as way as to cause radix_tree_is_indirect_ptr()'s return value
> > > > to change within a given RCU grace period, and that Linus disagrees.
> > > 
> > > Oh, it's entirely possible that there are some lifetime rules or others 
> > > that make it impossible for things to go from "not indirect" -> 
> > > "indirect". So if that was Nick's point, then I'm not "disagreeing" per 
> > > se.
> > > 
> > > What I'm disagreeing about is that Nick apparently thinks that this is all 
> > > subtle code, and as a result we should add barriers in some very 
> > > non-obvious places.
> > > 
> > > While _I_ think that the problem isn't properly solved by barriers, but by 
> > > just making the code less subtle. If the barrier only exists because of 
> > > the reload issue, then the obvious solution - to me - is to just use what 
> > > is already the proper accessor function that forces a nice reload. That 
> > > way the compiler is forced to create code that does what the source 
> > > clearly means it to do, regardless of any barriers at all.
> > > 
> > > Barriers in general should be the _last_ thing added. And if they are 
> > > added, they should be added as deeply in the call-chain as possible, so 
> > > that we don't need to add them in multiple call-sites. Again, using the 
> > > rcu_dereference() approach seems to solve that issue too - rather than add 
> > > three barriers in three different places, we just add the proper 
> > > dereference in _one_ place.
> > 
> > I don't have any argument with this line of reasoning, and am myself a bit
> > puzzled as to why rcu_dereference() isn't the right tool for Nick's job.
> > Then again, I don't claim to fully understand what he is trying to do.
> 
> OK, granted I do need the ACCESS_ONCE. It is loading a pointer who's target
> can be changed concurrently with the rcu algorithm. The rcu_derefernce
> thing kind of set me thinking down the wrong track, because the object of the
> pointer it loads is not RCU protected and doesn't need the memory barrier
> (on alpha).
> 
> But... RCU radix tree is not only used for the pagecache, so it's probably not
> worth complicating things to seperate out those two cases. rcu_dereference
> might be the best fit.

Works for me!

> > > > Whatever the answer, I would argue for -at- -least- a comment explaining
> > > > why it is safe.  I am not seeing the objection to rcu_dereference(), but
> > > > I must confess that it has been awhile since I have looked closely at
> > > > the radix_tree code.  :-/
> > > 
> > > And I'm actually suprised that gcc can generate the problematic code in 
> > > the first place. I'd expect that a "atomic_add_unless()" would always be 
> > > at LEAST a compiler barrier, even if it isn't necessarily a CPU memory 
> > > barrier.
> > > 
> > > But because we inline it, and because we allow gcc to see that it doesn't 
> > > do anything if it gets just the right value from memory, I guess gcc ends 
> > > up able to change the "for()" loop so that the first iteration can exit 
> > > specially, and then for that case (and no other case) it can cache 
> > > variables over the whole atomic_add_unless().
> > > 
> > > Again, that's very fragile. The fact that Documentation/atomic_ops.txt 
> > > says that the failure case doesn't contain any barriers is really _meant_ 
> > > to be about the architecture-specific CPU barriers, not so much about 
> > > something as simple as a compiler re-ordering. 
> > > 
> > > So while I think that we should use rcu_dereference() (regardless of any 
> > > other issues), I _also_ think that part of the problem really is the 
> > > excessive subtlety in the whole code, and the (obviously very surprising) 
> > > fact that gcc could end up caching an unrelated memory load across that 
> > > whole atomic op.
> > > 
> > > Maybe we should make atomics always imply a compiler barrier, even when 
> > > they do not imply a memory barrier. The one exception would be the 
> > > (special) case of "atomic_read()/atomic_set()", which don't really do any 
> > > kind of complex operation at all, and where we really do want the compiler 
> > > to be able to coalesce multiple atomic_reads() to a single one.
> > > 
> > > In contrast, there's no sense in allowing the compiler to coalesce a 
> > > "atomic_add_unless()" with anything else. Making it a compiler barrier 
> > > (possibly by uninlining it, or just adding a barrier to it) would also 
> > > have avoided the whole subtle case - which is always a good thing.
> > 
> > That makes a lot of sense to me!
> 
> It would have avoided one problem (the same one my patch did). But it
> doesn't solve the problem of the missing ACCESS_ONCE allowing the
> pointer to be reloaded from the slot pointer.

Agreed.

> Sticking an rcu_dereference in radix_tree_deref_slot seems to fix the
> assembly for me too, I grafted the changelog onto that. Linus probably
> you are using -Os?
> 
> --
> Subject: mm lockless pagecache barrier fix
> 
> An XFS workload showed up a bug in the lockless pagecache patch. Basically it
> would go into an "infinite" loop, although it would sometimes be able to break
> out of the loop! The reason is a missing compiler barrier in the "increment
> reference count unless it was zero" case of the lockless pagecache protocol in
> the gang lookup functions.
> 
> This would cause the compiler to use a cached value of struct page pointer to
> retry the operation with, rather than reload it. So the page might have been
> removed from pagecache and freed (refcount==0) but the lookup would not correctly
> notice the page is no longer in pagecache, and keep attempting to increment the
> refcount and failing, until the page gets reallocated for something else. This
> isn't a data corruption because the condition will be detected if the page has
> been reallocated. However it can result in a lockup. 
> 
> Linus points out that ACCESS_ONCE is also required in that pointer load, even
> if it's absence is not causing a bug on our particular build. The most general
> way to solve this is just to put an rcu_dereference in radix_tree_deref_slot.
> 
> Assembly of find_get_pages,
> before:
> .L220:
>         movq    (%rbx), %rax    #* ivtmp.1162, tmp82
>         movq    (%rax), %rdi    #, prephitmp.1149
> .L218:
>         testb   $1, %dil        #, prephitmp.1149
>         jne     .L217   #,
>         testq   %rdi, %rdi      # prephitmp.1149
>         je      .L203   #,
>         cmpq    $-1, %rdi       #, prephitmp.1149
>         je      .L217   #,
>         movl    8(%rdi), %esi   # <variable>._count.counter, c
>         testl   %esi, %esi      # c
>         je      .L218   #,
> 
> after:
> .L212:
>         movq    (%rbx), %rax    #* ivtmp.1109, tmp81
>         movq    (%rax), %rdi    #, ret
>         testb   $1, %dil        #, ret
>         jne     .L211   #,
>         testq   %rdi, %rdi      # ret
>         je      .L197   #,
>         cmpq    $-1, %rdi       #, ret
>         je      .L211   #,
>         movl    8(%rdi), %esi   # <variable>._count.counter, c
>         testl   %esi, %esi      # c
>         je      .L212   #,
> 
> (notice the obvious infinite loop in the first example, if page->count remains 0)

Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

> Signed-off-by: Nick Piggin <npiggin@suse.de>
> ---
>  include/linux/radix-tree.h |    2 +-
>  mm/filemap.c               |   23 ++++++++++++++++++++---
>  2 files changed, 21 insertions(+), 4 deletions(-)
> 
> Index: linux-2.6/include/linux/radix-tree.h
> ===================================================================
> --- linux-2.6.orig/include/linux/radix-tree.h
> +++ linux-2.6/include/linux/radix-tree.h
> @@ -136,7 +136,7 @@ do {									\
>   */
>  static inline void *radix_tree_deref_slot(void **pslot)
>  {
> -	void *ret = *pslot;
> +	void *ret = rcu_dereference(*pslot);
>  	if (unlikely(radix_tree_is_indirect_ptr(ret)))
>  		ret = RADIX_TREE_RETRY;
>  	return ret;
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch] mm: fix lockless pagecache reordering bug (was Re: BUG: soft lockup - is this XFS problem?)
  2009-01-06  2:05                               ` Nick Piggin
  (?)
@ 2009-01-06  2:29                                 ` Linus Torvalds
  -1 siblings, 0 replies; 74+ messages in thread
From: Linus Torvalds @ 2009-01-06  2:29 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Paul E. McKenney, Peter Klotz, stable,
	Linux Memory Management List, Christoph Hellwig, Roman Kononov,
	linux-kernel, xfs, Andrew Morton



On Tue, 6 Jan 2009, Nick Piggin wrote:
> 
> Sticking an rcu_dereference in radix_tree_deref_slot seems to fix the
> assembly for me too, I grafted the changelog onto that. Linus probably
> you are using -Os?

Ahh, yes. I am. That explains why I can't see any difference.

		Linus

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch] mm: fix lockless pagecache reordering bug (was Re: BUG: soft lockup - is this XFS problem?)
@ 2009-01-06  2:29                                 ` Linus Torvalds
  0 siblings, 0 replies; 74+ messages in thread
From: Linus Torvalds @ 2009-01-06  2:29 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Paul E. McKenney, linux-kernel, Roman Kononov, xfs,
	Christoph Hellwig, Linux Memory Management List, Andrew Morton,
	Peter Klotz, stable



On Tue, 6 Jan 2009, Nick Piggin wrote:
> 
> Sticking an rcu_dereference in radix_tree_deref_slot seems to fix the
> assembly for me too, I grafted the changelog onto that. Linus probably
> you are using -Os?

Ahh, yes. I am. That explains why I can't see any difference.

		Linus

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch] mm: fix lockless pagecache reordering bug (was Re: BUG: soft lockup - is this XFS problem?)
@ 2009-01-06  2:29                                 ` Linus Torvalds
  0 siblings, 0 replies; 74+ messages in thread
From: Linus Torvalds @ 2009-01-06  2:29 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Paul E. McKenney, Peter Klotz, stable,
	Linux Memory Management List, Christoph Hellwig, Roman Kononov,
	linux-kernel, xfs, Andrew Morton



On Tue, 6 Jan 2009, Nick Piggin wrote:
> 
> Sticking an rcu_dereference in radix_tree_deref_slot seems to fix the
> assembly for me too, I grafted the changelog onto that. Linus probably
> you are using -Os?

Ahh, yes. I am. That explains why I can't see any difference.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch] mm: fix lockless pagecache reordering bug (was Re: BUG: soft lockup - is this XFS problem?)
  2009-01-06  2:05                               ` Nick Piggin
  (?)
@ 2009-01-06  8:38                                 ` Peter Klotz
  -1 siblings, 0 replies; 74+ messages in thread
From: Peter Klotz @ 2009-01-06  8:38 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Paul E. McKenney, Linus Torvalds, stable,
	Linux Memory Management List, Christoph Hellwig, Roman Kononov,
	linux-kernel, xfs, Andrew Morton

Nick Piggin wrote:
> --
> Subject: mm lockless pagecache barrier fix
> 
> An XFS workload showed up a bug in the lockless pagecache patch. Basically it
> would go into an "infinite" loop, although it would sometimes be able to break
> out of the loop! The reason is a missing compiler barrier in the "increment
> reference count unless it was zero" case of the lockless pagecache protocol in
> the gang lookup functions.
> 
> This would cause the compiler to use a cached value of struct page pointer to
> retry the operation with, rather than reload it. So the page might have been
> removed from pagecache and freed (refcount==0) but the lookup would not correctly
> notice the page is no longer in pagecache, and keep attempting to increment the
> refcount and failing, until the page gets reallocated for something else. This
> isn't a data corruption because the condition will be detected if the page has
> been reallocated. However it can result in a lockup. 
> 
> Linus points out that ACCESS_ONCE is also required in that pointer load, even
> if it's absence is not causing a bug on our particular build. The most general
> way to solve this is just to put an rcu_dereference in radix_tree_deref_slot.
> 
> Assembly of find_get_pages,
> before:
> .L220:
>         movq    (%rbx), %rax    #* ivtmp.1162, tmp82
>         movq    (%rax), %rdi    #, prephitmp.1149
> .L218:
>         testb   $1, %dil        #, prephitmp.1149
>         jne     .L217   #,
>         testq   %rdi, %rdi      # prephitmp.1149
>         je      .L203   #,
>         cmpq    $-1, %rdi       #, prephitmp.1149
>         je      .L217   #,
>         movl    8(%rdi), %esi   # <variable>._count.counter, c
>         testl   %esi, %esi      # c
>         je      .L218   #,
> 
> after:
> .L212:
>         movq    (%rbx), %rax    #* ivtmp.1109, tmp81
>         movq    (%rax), %rdi    #, ret
>         testb   $1, %dil        #, ret
>         jne     .L211   #,
>         testq   %rdi, %rdi      # ret
>         je      .L197   #,
>         cmpq    $-1, %rdi       #, ret
>         je      .L211   #,
>         movl    8(%rdi), %esi   # <variable>._count.counter, c
>         testl   %esi, %esi      # c
>         je      .L212   #,
> 
> (notice the obvious infinite loop in the first example, if page->count remains 0)
> 
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> ---
>  include/linux/radix-tree.h |    2 +-
>  mm/filemap.c               |   23 ++++++++++++++++++++---
>  2 files changed, 21 insertions(+), 4 deletions(-)
> 
> Index: linux-2.6/include/linux/radix-tree.h
> ===================================================================
> --- linux-2.6.orig/include/linux/radix-tree.h
> +++ linux-2.6/include/linux/radix-tree.h
> @@ -136,7 +136,7 @@ do {									\
>   */
>  static inline void *radix_tree_deref_slot(void **pslot)
>  {
> -	void *ret = *pslot;
> +	void *ret = rcu_dereference(*pslot);
>  	if (unlikely(radix_tree_is_indirect_ptr(ret)))
>  		ret = RADIX_TREE_RETRY;
>  	return ret;
> 
> 

The patch above fixes my problem. I did two complete test runs that 
normally fail rather quickly.

Regards, Peter.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch] mm: fix lockless pagecache reordering bug (was Re: BUG: soft lockup - is this XFS problem?)
@ 2009-01-06  8:38                                 ` Peter Klotz
  0 siblings, 0 replies; 74+ messages in thread
From: Peter Klotz @ 2009-01-06  8:38 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-kernel, Roman Kononov, xfs, Christoph Hellwig,
	Linux Memory Management List, Andrew Morton, Paul E. McKenney,
	Linus Torvalds, stable

Nick Piggin wrote:
> --
> Subject: mm lockless pagecache barrier fix
> 
> An XFS workload showed up a bug in the lockless pagecache patch. Basically it
> would go into an "infinite" loop, although it would sometimes be able to break
> out of the loop! The reason is a missing compiler barrier in the "increment
> reference count unless it was zero" case of the lockless pagecache protocol in
> the gang lookup functions.
> 
> This would cause the compiler to use a cached value of struct page pointer to
> retry the operation with, rather than reload it. So the page might have been
> removed from pagecache and freed (refcount==0) but the lookup would not correctly
> notice the page is no longer in pagecache, and keep attempting to increment the
> refcount and failing, until the page gets reallocated for something else. This
> isn't a data corruption because the condition will be detected if the page has
> been reallocated. However it can result in a lockup. 
> 
> Linus points out that ACCESS_ONCE is also required in that pointer load, even
> if it's absence is not causing a bug on our particular build. The most general
> way to solve this is just to put an rcu_dereference in radix_tree_deref_slot.
> 
> Assembly of find_get_pages,
> before:
> .L220:
>         movq    (%rbx), %rax    #* ivtmp.1162, tmp82
>         movq    (%rax), %rdi    #, prephitmp.1149
> .L218:
>         testb   $1, %dil        #, prephitmp.1149
>         jne     .L217   #,
>         testq   %rdi, %rdi      # prephitmp.1149
>         je      .L203   #,
>         cmpq    $-1, %rdi       #, prephitmp.1149
>         je      .L217   #,
>         movl    8(%rdi), %esi   # <variable>._count.counter, c
>         testl   %esi, %esi      # c
>         je      .L218   #,
> 
> after:
> .L212:
>         movq    (%rbx), %rax    #* ivtmp.1109, tmp81
>         movq    (%rax), %rdi    #, ret
>         testb   $1, %dil        #, ret
>         jne     .L211   #,
>         testq   %rdi, %rdi      # ret
>         je      .L197   #,
>         cmpq    $-1, %rdi       #, ret
>         je      .L211   #,
>         movl    8(%rdi), %esi   # <variable>._count.counter, c
>         testl   %esi, %esi      # c
>         je      .L212   #,
> 
> (notice the obvious infinite loop in the first example, if page->count remains 0)
> 
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> ---
>  include/linux/radix-tree.h |    2 +-
>  mm/filemap.c               |   23 ++++++++++++++++++++---
>  2 files changed, 21 insertions(+), 4 deletions(-)
> 
> Index: linux-2.6/include/linux/radix-tree.h
> ===================================================================
> --- linux-2.6.orig/include/linux/radix-tree.h
> +++ linux-2.6/include/linux/radix-tree.h
> @@ -136,7 +136,7 @@ do {									\
>   */
>  static inline void *radix_tree_deref_slot(void **pslot)
>  {
> -	void *ret = *pslot;
> +	void *ret = rcu_dereference(*pslot);
>  	if (unlikely(radix_tree_is_indirect_ptr(ret)))
>  		ret = RADIX_TREE_RETRY;
>  	return ret;
> 
> 

The patch above fixes my problem. I did two complete test runs that 
normally fail rather quickly.

Regards, Peter.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch] mm: fix lockless pagecache reordering bug (was Re: BUG: soft lockup - is this XFS problem?)
@ 2009-01-06  8:38                                 ` Peter Klotz
  0 siblings, 0 replies; 74+ messages in thread
From: Peter Klotz @ 2009-01-06  8:38 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Paul E. McKenney, Linus Torvalds, stable,
	Linux Memory Management List, Christoph Hellwig, Roman Kononov,
	linux-kernel, xfs, Andrew Morton

Nick Piggin wrote:
> --
> Subject: mm lockless pagecache barrier fix
> 
> An XFS workload showed up a bug in the lockless pagecache patch. Basically it
> would go into an "infinite" loop, although it would sometimes be able to break
> out of the loop! The reason is a missing compiler barrier in the "increment
> reference count unless it was zero" case of the lockless pagecache protocol in
> the gang lookup functions.
> 
> This would cause the compiler to use a cached value of struct page pointer to
> retry the operation with, rather than reload it. So the page might have been
> removed from pagecache and freed (refcount==0) but the lookup would not correctly
> notice the page is no longer in pagecache, and keep attempting to increment the
> refcount and failing, until the page gets reallocated for something else. This
> isn't a data corruption because the condition will be detected if the page has
> been reallocated. However it can result in a lockup. 
> 
> Linus points out that ACCESS_ONCE is also required in that pointer load, even
> if it's absence is not causing a bug on our particular build. The most general
> way to solve this is just to put an rcu_dereference in radix_tree_deref_slot.
> 
> Assembly of find_get_pages,
> before:
> .L220:
>         movq    (%rbx), %rax    #* ivtmp.1162, tmp82
>         movq    (%rax), %rdi    #, prephitmp.1149
> .L218:
>         testb   $1, %dil        #, prephitmp.1149
>         jne     .L217   #,
>         testq   %rdi, %rdi      # prephitmp.1149
>         je      .L203   #,
>         cmpq    $-1, %rdi       #, prephitmp.1149
>         je      .L217   #,
>         movl    8(%rdi), %esi   # <variable>._count.counter, c
>         testl   %esi, %esi      # c
>         je      .L218   #,
> 
> after:
> .L212:
>         movq    (%rbx), %rax    #* ivtmp.1109, tmp81
>         movq    (%rax), %rdi    #, ret
>         testb   $1, %dil        #, ret
>         jne     .L211   #,
>         testq   %rdi, %rdi      # ret
>         je      .L197   #,
>         cmpq    $-1, %rdi       #, ret
>         je      .L211   #,
>         movl    8(%rdi), %esi   # <variable>._count.counter, c
>         testl   %esi, %esi      # c
>         je      .L212   #,
> 
> (notice the obvious infinite loop in the first example, if page->count remains 0)
> 
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> ---
>  include/linux/radix-tree.h |    2 +-
>  mm/filemap.c               |   23 ++++++++++++++++++++---
>  2 files changed, 21 insertions(+), 4 deletions(-)
> 
> Index: linux-2.6/include/linux/radix-tree.h
> ===================================================================
> --- linux-2.6.orig/include/linux/radix-tree.h
> +++ linux-2.6/include/linux/radix-tree.h
> @@ -136,7 +136,7 @@ do {									\
>   */
>  static inline void *radix_tree_deref_slot(void **pslot)
>  {
> -	void *ret = *pslot;
> +	void *ret = rcu_dereference(*pslot);
>  	if (unlikely(radix_tree_is_indirect_ptr(ret)))
>  		ret = RADIX_TREE_RETRY;
>  	return ret;
> 
> 

The patch above fixes my problem. I did two complete test runs that 
normally fail rather quickly.

Regards, Peter.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch] mm: fix lockless pagecache reordering bug (was Re: BUG: soft lockup - is this XFS problem?)
  2009-01-06  8:38                                 ` Peter Klotz
  (?)
@ 2009-01-06  8:43                                   ` Nick Piggin
  -1 siblings, 0 replies; 74+ messages in thread
From: Nick Piggin @ 2009-01-06  8:43 UTC (permalink / raw)
  To: Peter Klotz
  Cc: Paul E. McKenney, Linus Torvalds, stable,
	Linux Memory Management List, Christoph Hellwig, Roman Kononov,
	linux-kernel, xfs, Andrew Morton

On Tue, Jan 06, 2009 at 09:38:15AM +0100, Peter Klotz wrote:
> >Index: linux-2.6/include/linux/radix-tree.h
> >===================================================================
> >--- linux-2.6.orig/include/linux/radix-tree.h
> >+++ linux-2.6/include/linux/radix-tree.h
> >@@ -136,7 +136,7 @@ do {						 \
> >  */
> > static inline void *radix_tree_deref_slot(void **pslot)
> > {
> >-	void *ret = *pslot;
> >+	void *ret = rcu_dereference(*pslot);
> > 	if (unlikely(radix_tree_is_indirect_ptr(ret)))
> > 		ret = RADIX_TREE_RETRY;
> > 	return ret;
> >
> >
> 
> The patch above fixes my problem. I did two complete test runs that 
> normally fail rather quickly.

OK, thanks for reporting and testing. 

I think this patch is a candidate for -stable too.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch] mm: fix lockless pagecache reordering bug (was Re: BUG: soft lockup - is this XFS problem?)
@ 2009-01-06  8:43                                   ` Nick Piggin
  0 siblings, 0 replies; 74+ messages in thread
From: Nick Piggin @ 2009-01-06  8:43 UTC (permalink / raw)
  To: Peter Klotz
  Cc: linux-kernel, Roman Kononov, xfs, Christoph Hellwig,
	Linux Memory Management List, Andrew Morton, Paul E. McKenney,
	Linus Torvalds, stable

On Tue, Jan 06, 2009 at 09:38:15AM +0100, Peter Klotz wrote:
> >Index: linux-2.6/include/linux/radix-tree.h
> >===================================================================
> >--- linux-2.6.orig/include/linux/radix-tree.h
> >+++ linux-2.6/include/linux/radix-tree.h
> >@@ -136,7 +136,7 @@ do {						 \
> >  */
> > static inline void *radix_tree_deref_slot(void **pslot)
> > {
> >-	void *ret = *pslot;
> >+	void *ret = rcu_dereference(*pslot);
> > 	if (unlikely(radix_tree_is_indirect_ptr(ret)))
> > 		ret = RADIX_TREE_RETRY;
> > 	return ret;
> >
> >
> 
> The patch above fixes my problem. I did two complete test runs that 
> normally fail rather quickly.

OK, thanks for reporting and testing. 

I think this patch is a candidate for -stable too.

Thanks,
Nick

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch] mm: fix lockless pagecache reordering bug (was Re: BUG: soft lockup - is this XFS problem?)
@ 2009-01-06  8:43                                   ` Nick Piggin
  0 siblings, 0 replies; 74+ messages in thread
From: Nick Piggin @ 2009-01-06  8:43 UTC (permalink / raw)
  To: Peter Klotz
  Cc: Paul E. McKenney, Linus Torvalds, stable,
	Linux Memory Management List, Christoph Hellwig, Roman Kononov,
	linux-kernel, xfs, Andrew Morton

On Tue, Jan 06, 2009 at 09:38:15AM +0100, Peter Klotz wrote:
> >Index: linux-2.6/include/linux/radix-tree.h
> >===================================================================
> >--- linux-2.6.orig/include/linux/radix-tree.h
> >+++ linux-2.6/include/linux/radix-tree.h
> >@@ -136,7 +136,7 @@ do {						 \
> >  */
> > static inline void *radix_tree_deref_slot(void **pslot)
> > {
> >-	void *ret = *pslot;
> >+	void *ret = rcu_dereference(*pslot);
> > 	if (unlikely(radix_tree_is_indirect_ptr(ret)))
> > 		ret = RADIX_TREE_RETRY;
> > 	return ret;
> >
> >
> 
> The patch above fixes my problem. I did two complete test runs that 
> normally fail rather quickly.

OK, thanks for reporting and testing. 

I think this patch is a candidate for -stable too.

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch] mm: fix lockless pagecache reordering bug (was Re: BUG: soft lockup - is this XFS problem?)
  2009-01-06  2:05                               ` Nick Piggin
  (?)
@ 2009-01-06 16:16                                 ` Roman Kononov
  -1 siblings, 0 replies; 74+ messages in thread
From: Roman Kononov @ 2009-01-06 16:16 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Paul E. McKenney, Linus Torvalds, Peter Klotz, stable,
	Linux Memory Management List, Christoph Hellwig, Roman Kononov,
	linux-kernel, xfs, Andrew Morton

On 2009-01-05 20:05 Nick Piggin said the following:
> Subject: mm lockless pagecache barrier fix
>  static inline void *radix_tree_deref_slot(void **pslot)
>  {
> -	void *ret = *pslot;
> +	void *ret = rcu_dereference(*pslot);
>  	if (unlikely(radix_tree_is_indirect_ptr(ret)))
>  		ret = RADIX_TREE_RETRY;
>  	return ret;

3 systems are working fine for a few hours with the patch. They would 
fail within 20 minutes without it.

Thanks.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch] mm: fix lockless pagecache reordering bug (was Re: BUG: soft lockup - is this XFS problem?)
@ 2009-01-06 16:16                                 ` Roman Kononov
  0 siblings, 0 replies; 74+ messages in thread
From: Roman Kononov @ 2009-01-06 16:16 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Paul E. McKenney, linux-kernel, Roman Kononov, xfs,
	Christoph Hellwig, Linux Memory Management List, Andrew Morton,
	Peter Klotz, Linus Torvalds, stable

On 2009-01-05 20:05 Nick Piggin said the following:
> Subject: mm lockless pagecache barrier fix
>  static inline void *radix_tree_deref_slot(void **pslot)
>  {
> -	void *ret = *pslot;
> +	void *ret = rcu_dereference(*pslot);
>  	if (unlikely(radix_tree_is_indirect_ptr(ret)))
>  		ret = RADIX_TREE_RETRY;
>  	return ret;

3 systems are working fine for a few hours with the patch. They would 
fail within 20 minutes without it.

Thanks.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch] mm: fix lockless pagecache reordering bug (was Re: BUG: soft lockup - is this XFS problem?)
@ 2009-01-06 16:16                                 ` Roman Kononov
  0 siblings, 0 replies; 74+ messages in thread
From: Roman Kononov @ 2009-01-06 16:16 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Paul E. McKenney, Linus Torvalds, Peter Klotz, stable,
	Linux Memory Management List, Christoph Hellwig, Roman Kononov,
	linux-kernel, xfs, Andrew Morton

On 2009-01-05 20:05 Nick Piggin said the following:
> Subject: mm lockless pagecache barrier fix
>  static inline void *radix_tree_deref_slot(void **pslot)
>  {
> -	void *ret = *pslot;
> +	void *ret = rcu_dereference(*pslot);
>  	if (unlikely(radix_tree_is_indirect_ptr(ret)))
>  		ret = RADIX_TREE_RETRY;
>  	return ret;

3 systems are working fine for a few hours with the patch. They would 
fail within 20 minutes without it.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch] mm: fix lockless pagecache reordering bug (was Re: BUG: soft lockup - is this XFS problem?)
  2009-01-05 19:39                         ` Linus Torvalds
  (?)
@ 2009-01-06 17:17                           ` Paul E. McKenney
  -1 siblings, 0 replies; 74+ messages in thread
From: Paul E. McKenney @ 2009-01-06 17:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Peter Klotz, stable, Linux Memory Management List,
	Christoph Hellwig, Roman Kononov, linux-kernel, xfs,
	Andrew Morton

On Mon, Jan 05, 2009 at 11:39:29AM -0800, Linus Torvalds wrote:
> On Mon, 5 Jan 2009, Linus Torvalds wrote:
> > Either the value can change, or it can not. It's that simple.
> > 
> > If it cannot change, then we can load it just once, or we can load it 
> > multiple times, and it won't matter. Barriers won't do anything but screw 
> > up the code.
> > 
> > If it can change from under us, you need to use rcu_dereference(), or 
> > open-code it with an ACCESS_ONCE() or put in barriers. But your placement 
> > of a barrier was NONSENSICAL. Your barrier didn't protect anything else - 
> > like the test for the RADIX_TREE_INDIRECT_PTR bit.
> > 
> > And that was the fundamental problem.
> 
> Btw, this is the real issue with anything that does "locking vs 
> optimistic" accesses.
> 
> If you use locking, then by definition (if you did things right), the 
> values you are working with do not change. As a result, it doesn't matter 
> if the compiler re-orders accesses, splits them up, or coalesces them. 
> It's why normal code should never need barriers, because it doesn't matter 
> whether some access gets optimized away or gets done multiple times.
> 
> But whenever you use an optimistic algorithm, and the data may change 
> under you, you need to use barriers or other things to limit the things 
> the CPU and/or compiler does.
> 
> And yes, "rcu_dereference()" is one such thing - it's not a barrier in the 
> sense that it doesn't necessarily affect ordering of accesses to other 
> variables around it (although the read_barrier_depends() obviously _is_ a 
> very special kind of ordering wrt the pointer itself on alpha). But it 
> does make sure that the compiler at least does not coalesce - or split - 
> that _one_ particular access.
> 
> It's true that it has "rcu" in its name, and it's also true that that may 
> be a bit misleading in that it's very much useful not just for rcu, but 
> for _any_ algorithm that depends on rcu-like behavior - ie optimistic 
> accesses to data that may change underneath it. RCU is just the most 
> commonly used (and perhaps best codified) variant of that kind of code.

The codification is quite important -- otherwise RCU would be a knife
without a handle.  And some would no doubt argue that RCU is -still-
a knife without a handle, but so it goes.  It does still need more work.
And I hope that additional codification of other optimistic concurrency
algorithms will make them more usable as well.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch] mm: fix lockless pagecache reordering bug (was Re: BUG: soft lockup - is this XFS problem?)
@ 2009-01-06 17:17                           ` Paul E. McKenney
  0 siblings, 0 replies; 74+ messages in thread
From: Paul E. McKenney @ 2009-01-06 17:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, linux-kernel, Roman Kononov, xfs, Christoph Hellwig,
	Linux Memory Management List, Andrew Morton, Peter Klotz, stable

On Mon, Jan 05, 2009 at 11:39:29AM -0800, Linus Torvalds wrote:
> On Mon, 5 Jan 2009, Linus Torvalds wrote:
> > Either the value can change, or it can not. It's that simple.
> > 
> > If it cannot change, then we can load it just once, or we can load it 
> > multiple times, and it won't matter. Barriers won't do anything but screw 
> > up the code.
> > 
> > If it can change from under us, you need to use rcu_dereference(), or 
> > open-code it with an ACCESS_ONCE() or put in barriers. But your placement 
> > of a barrier was NONSENSICAL. Your barrier didn't protect anything else - 
> > like the test for the RADIX_TREE_INDIRECT_PTR bit.
> > 
> > And that was the fundamental problem.
> 
> Btw, this is the real issue with anything that does "locking vs 
> optimistic" accesses.
> 
> If you use locking, then by definition (if you did things right), the 
> values you are working with do not change. As a result, it doesn't matter 
> if the compiler re-orders accesses, splits them up, or coalesces them. 
> It's why normal code should never need barriers, because it doesn't matter 
> whether some access gets optimized away or gets done multiple times.
> 
> But whenever you use an optimistic algorithm, and the data may change 
> under you, you need to use barriers or other things to limit the things 
> the CPU and/or compiler does.
> 
> And yes, "rcu_dereference()" is one such thing - it's not a barrier in the 
> sense that it doesn't necessarily affect ordering of accesses to other 
> variables around it (although the read_barrier_depends() obviously _is_ a 
> very special kind of ordering wrt the pointer itself on alpha). But it 
> does make sure that the compiler at least does not coalesce - or split - 
> that _one_ particular access.
> 
> It's true that it has "rcu" in its name, and it's also true that that may 
> be a bit misleading in that it's very much useful not just for rcu, but 
> for _any_ algorithm that depends on rcu-like behavior - ie optimistic 
> accesses to data that may change underneath it. RCU is just the most 
> commonly used (and perhaps best codified) variant of that kind of code.

The codification is quite important -- otherwise RCU would be a knife
without a handle.  And some would no doubt argue that RCU is -still-
a knife without a handle, but so it goes.  It does still need more work.
And I hope that additional codification of other optimistic concurrency
algorithms will make them more usable as well.

							Thanx, Paul

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch] mm: fix lockless pagecache reordering bug (was Re: BUG: soft lockup - is this XFS problem?)
@ 2009-01-06 17:17                           ` Paul E. McKenney
  0 siblings, 0 replies; 74+ messages in thread
From: Paul E. McKenney @ 2009-01-06 17:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Peter Klotz, stable, Linux Memory Management List,
	Christoph Hellwig, Roman Kononov, linux-kernel, xfs,
	Andrew Morton

On Mon, Jan 05, 2009 at 11:39:29AM -0800, Linus Torvalds wrote:
> On Mon, 5 Jan 2009, Linus Torvalds wrote:
> > Either the value can change, or it can not. It's that simple.
> > 
> > If it cannot change, then we can load it just once, or we can load it 
> > multiple times, and it won't matter. Barriers won't do anything but screw 
> > up the code.
> > 
> > If it can change from under us, you need to use rcu_dereference(), or 
> > open-code it with an ACCESS_ONCE() or put in barriers. But your placement 
> > of a barrier was NONSENSICAL. Your barrier didn't protect anything else - 
> > like the test for the RADIX_TREE_INDIRECT_PTR bit.
> > 
> > And that was the fundamental problem.
> 
> Btw, this is the real issue with anything that does "locking vs 
> optimistic" accesses.
> 
> If you use locking, then by definition (if you did things right), the 
> values you are working with do not change. As a result, it doesn't matter 
> if the compiler re-orders accesses, splits them up, or coalesces them. 
> It's why normal code should never need barriers, because it doesn't matter 
> whether some access gets optimized away or gets done multiple times.
> 
> But whenever you use an optimistic algorithm, and the data may change 
> under you, you need to use barriers or other things to limit the things 
> the CPU and/or compiler does.
> 
> And yes, "rcu_dereference()" is one such thing - it's not a barrier in the 
> sense that it doesn't necessarily affect ordering of accesses to other 
> variables around it (although the read_barrier_depends() obviously _is_ a 
> very special kind of ordering wrt the pointer itself on alpha). But it 
> does make sure that the compiler at least does not coalesce - or split - 
> that _one_ particular access.
> 
> It's true that it has "rcu" in its name, and it's also true that that may 
> be a bit misleading in that it's very much useful not just for rcu, but 
> for _any_ algorithm that depends on rcu-like behavior - ie optimistic 
> accesses to data that may change underneath it. RCU is just the most 
> commonly used (and perhaps best codified) variant of that kind of code.

The codification is quite important -- otherwise RCU would be a knife
without a handle.  And some would no doubt argue that RCU is -still-
a knife without a handle, but so it goes.  It does still need more work.
And I hope that additional codification of other optimistic concurrency
algorithms will make them more usable as well.

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: BUG: soft lockup - is this XFS problem?
  2009-01-05  6:48             ` Nick Piggin
@ 2011-07-14 11:23               ` Guus Sliepen
  -1 siblings, 0 replies; 74+ messages in thread
From: Guus Sliepen @ 2011-07-14 11:23 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Hellwig, Peter Klotz, Roman Kononov, linux-kernel, xfs

[-- Attachment #1: Type: text/plain, Size: 4406 bytes --]

Hello,

I'm having a problem with a system having an XFS filesystem on RAID locking up
fairly consistently when writing large amounts of data to it, with several
kernels, including 2.6.38.2 and 2.6.39.3, on both AMD and Intel multi-core
processors. The kernel always logs this several times:

BUG: soft lockup - CPU#2 stuck for 67s! [kswapd0:33]

With different CPU# numbers, but always in kswapd0. Eventually the system will
really lock up, requiring a reset. During soft lockups (when file transfer
apparently stalled), merely typing "ps aux" would often cause the lockup to end
immediately. After googling I found this page:

https://patchwork.kernel.org/patch/789/

An unpatched vanilla 2.6.39.3 consistently locked up, however after patching it
(adding a barrier() after all 4 instances of if
(!page_cache_get_speculative(page))) the lockups never happened anymore, and
file transfer has been steady.

I also tested it with ext4, which doesn't give lockups on unpatched kernels,
but unfortunately mkfs.ext4 cannot create filesystems larger than 16TB yet, so
I have to use XFS instead.

On Mon, Jan 05, 2009 at 06:48:38AM -0000, Nick Piggin wrote:

> I believe this patch should solve it. Please test and confirm before
> I send it upstream.

Further comments on that thread in 2009 indicated the patch was very useful,
but it doesn't seem to have been applied upstream. Is there any reason this
patch should not be applied?

If necessary I can submit a reworked patch for 2.6.39.3 or 3.0 when that comes
out.

> ---
> An XFS workload showed up a bug in the lockless pagecache patch. Basically it
> would go into an "infinite" loop, although it would sometimes be able to break
> out of the loop! The reason is a missing compiler barrier in the "increment
> reference count unless it was zero" case of the lockless pagecache protocol in
> the gang lookup functions.
> 
> This would cause the compiler to use a cached value of struct page pointer to
> retry the operation with, rather than reload it. So the page might have been
> removed from pagecache and freed (refcount==0) but the lookup would not correctly
> notice the page is no longer in pagecache, and keep attempting to increment the
> refcount and failing, until the page gets reallocated for something else. This
> isn't a data corruption because the condition will be detected if the page has
> been reallocated. However it can result in a lockup. 
> 
> Add a the required compiler barrier and comment to fix this.
[...]
> Index: linux-2.6/mm/filemap.c
> ===================================================================
> --- linux-2.6.orig/mm/filemap.c	2009-01-05 17:22:57.000000000 +1100
> +++ linux-2.6/mm/filemap.c	2009-01-05 17:28:40.000000000 +1100
> @@ -794,8 +794,19 @@ repeat:
>  		if (unlikely(page == RADIX_TREE_RETRY))
>  			goto restart;
>  
> -		if (!page_cache_get_speculative(page))
> +		if (!page_cache_get_speculative(page)) {
> +			/*
> +			 * A failed page_cache_get_speculative operation does
> +			 * not imply any barriers (Documentation/atomic_ops.txt),
> +			 * and as such, we must force the compiler to deref the
> +			 * radix-tree slot again rather than using the cached
> +			 * value (because we need to give up if the page has been
> +			 * removed from the radix-tree, rather than looping until
> +			 * it gets reused for something else).
> +			 */
> +			barrier();
>  			goto repeat;
> +		}
>  
>  		/* Has the page moved? */
>  		if (unlikely(page != *((void **)pages[i]))) {
> @@ -850,8 +861,11 @@ repeat:
>  		if (page->mapping == NULL || page->index != index)
>  			break;
>  
> -		if (!page_cache_get_speculative(page))
> +		if (!page_cache_get_speculative(page)) {
> +			/* barrier: see find_get_pages() */
> +			barrier();
>  			goto repeat;
> +		}
>  
>  		/* Has the page moved? */
>  		if (unlikely(page != *((void **)pages[i]))) {
> @@ -904,8 +918,11 @@ repeat:
>  		if (unlikely(page == RADIX_TREE_RETRY))
>  			goto restart;
>  
> -		if (!page_cache_get_speculative(page))
> +		if (!page_cache_get_speculative(page)) {
> +			/* barrier: see find_get_pages() */
> +			barrier();
>  			goto repeat;
> +		}
>  
>  		/* Has the page moved? */
>  		if (unlikely(page != *((void **)pages[i]))) {

-- 
Met vriendelijke groet / with kind regards,
Guus Sliepen <Guus.Sliepen@astro.su.se>

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: BUG: soft lockup - is this XFS problem?
@ 2011-07-14 11:23               ` Guus Sliepen
  0 siblings, 0 replies; 74+ messages in thread
From: Guus Sliepen @ 2011-07-14 11:23 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Hellwig, Peter Klotz, linux-kernel, Roman Kononov, xfs


[-- Attachment #1.1: Type: text/plain, Size: 4406 bytes --]

Hello,

I'm having a problem with a system having an XFS filesystem on RAID locking up
fairly consistently when writing large amounts of data to it, with several
kernels, including 2.6.38.2 and 2.6.39.3, on both AMD and Intel multi-core
processors. The kernel always logs this several times:

BUG: soft lockup - CPU#2 stuck for 67s! [kswapd0:33]

With different CPU# numbers, but always in kswapd0. Eventually the system will
really lock up, requiring a reset. During soft lockups (when file transfer
apparently stalled), merely typing "ps aux" would often cause the lockup to end
immediately. After googling I found this page:

https://patchwork.kernel.org/patch/789/

An unpatched vanilla 2.6.39.3 consistently locked up, however after patching it
(adding a barrier() after all 4 instances of if
(!page_cache_get_speculative(page))) the lockups never happened anymore, and
file transfer has been steady.

I also tested it with ext4, which doesn't give lockups on unpatched kernels,
but unfortunately mkfs.ext4 cannot create filesystems larger than 16TB yet, so
I have to use XFS instead.

On Mon, Jan 05, 2009 at 06:48:38AM -0000, Nick Piggin wrote:

> I believe this patch should solve it. Please test and confirm before
> I send it upstream.

Further comments on that thread in 2009 indicated the patch was very useful,
but it doesn't seem to have been applied upstream. Is there any reason this
patch should not be applied?

If necessary I can submit a reworked patch for 2.6.39.3 or 3.0 when that comes
out.

> ---
> An XFS workload showed up a bug in the lockless pagecache patch. Basically it
> would go into an "infinite" loop, although it would sometimes be able to break
> out of the loop! The reason is a missing compiler barrier in the "increment
> reference count unless it was zero" case of the lockless pagecache protocol in
> the gang lookup functions.
> 
> This would cause the compiler to use a cached value of struct page pointer to
> retry the operation with, rather than reload it. So the page might have been
> removed from pagecache and freed (refcount==0) but the lookup would not correctly
> notice the page is no longer in pagecache, and keep attempting to increment the
> refcount and failing, until the page gets reallocated for something else. This
> isn't a data corruption because the condition will be detected if the page has
> been reallocated. However it can result in a lockup. 
> 
> Add a the required compiler barrier and comment to fix this.
[...]
> Index: linux-2.6/mm/filemap.c
> ===================================================================
> --- linux-2.6.orig/mm/filemap.c	2009-01-05 17:22:57.000000000 +1100
> +++ linux-2.6/mm/filemap.c	2009-01-05 17:28:40.000000000 +1100
> @@ -794,8 +794,19 @@ repeat:
>  		if (unlikely(page == RADIX_TREE_RETRY))
>  			goto restart;
>  
> -		if (!page_cache_get_speculative(page))
> +		if (!page_cache_get_speculative(page)) {
> +			/*
> +			 * A failed page_cache_get_speculative operation does
> +			 * not imply any barriers (Documentation/atomic_ops.txt),
> +			 * and as such, we must force the compiler to deref the
> +			 * radix-tree slot again rather than using the cached
> +			 * value (because we need to give up if the page has been
> +			 * removed from the radix-tree, rather than looping until
> +			 * it gets reused for something else).
> +			 */
> +			barrier();
>  			goto repeat;
> +		}
>  
>  		/* Has the page moved? */
>  		if (unlikely(page != *((void **)pages[i]))) {
> @@ -850,8 +861,11 @@ repeat:
>  		if (page->mapping == NULL || page->index != index)
>  			break;
>  
> -		if (!page_cache_get_speculative(page))
> +		if (!page_cache_get_speculative(page)) {
> +			/* barrier: see find_get_pages() */
> +			barrier();
>  			goto repeat;
> +		}
>  
>  		/* Has the page moved? */
>  		if (unlikely(page != *((void **)pages[i]))) {
> @@ -904,8 +918,11 @@ repeat:
>  		if (unlikely(page == RADIX_TREE_RETRY))
>  			goto restart;
>  
> -		if (!page_cache_get_speculative(page))
> +		if (!page_cache_get_speculative(page)) {
> +			/* barrier: see find_get_pages() */
> +			barrier();
>  			goto repeat;
> +		}
>  
>  		/* Has the page moved? */
>  		if (unlikely(page != *((void **)pages[i]))) {

-- 
Met vriendelijke groet / with kind regards,
Guus Sliepen <Guus.Sliepen@astro.su.se>

[-- Attachment #1.2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: BUG: soft lockup - is this XFS problem?
  2011-07-14 11:23               ` Guus Sliepen
@ 2011-07-14 18:03                 ` Peter Klotz
  -1 siblings, 0 replies; 74+ messages in thread
From: Peter Klotz @ 2011-07-14 18:03 UTC (permalink / raw)
  To: Guus Sliepen, Nick Piggin, Christoph Hellwig, Roman Kononov,
	linux-kernel, xfs

On 07/14/2011 01:23 PM, Guus Sliepen wrote:

> I'm having a problem with a system having an XFS filesystem on RAID locking up
> fairly consistently when writing large amounts of data to it, with several
> kernels, including 2.6.38.2 and 2.6.39.3, on both AMD and Intel multi-core
> processors. The kernel always logs this several times:
>
> BUG: soft lockup - CPU#2 stuck for 67s! [kswapd0:33]
...
>> I believe this patch should solve it. Please test and confirm before
>> I send it upstream.
>
> Further comments on that thread in 2009 indicated the patch was very useful,
> but it doesn't seem to have been applied upstream. Is there any reason this
> patch should not be applied?

Hello Guus

This Bugzilla entry documents the XFS bug from 2009 in detail including 
links:

http://oss.sgi.com/bugzilla/show_bug.cgi?id=805

The problem was finally solved by a patch proposed by Linus. This is the 
reason the original patch developed by Nick never made it into the kernel.

My tests back then showed that both patches fixed the problem.

It seems you have found a test case where just Nick's patch helps.

Regards, Peter.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: BUG: soft lockup - is this XFS problem?
@ 2011-07-14 18:03                 ` Peter Klotz
  0 siblings, 0 replies; 74+ messages in thread
From: Peter Klotz @ 2011-07-14 18:03 UTC (permalink / raw)
  To: Guus Sliepen, Nick Piggin, Christoph Hellwig, Roman Kononov,
	linux-kernel, xfs

On 07/14/2011 01:23 PM, Guus Sliepen wrote:

> I'm having a problem with a system having an XFS filesystem on RAID locking up
> fairly consistently when writing large amounts of data to it, with several
> kernels, including 2.6.38.2 and 2.6.39.3, on both AMD and Intel multi-core
> processors. The kernel always logs this several times:
>
> BUG: soft lockup - CPU#2 stuck for 67s! [kswapd0:33]
...
>> I believe this patch should solve it. Please test and confirm before
>> I send it upstream.
>
> Further comments on that thread in 2009 indicated the patch was very useful,
> but it doesn't seem to have been applied upstream. Is there any reason this
> patch should not be applied?

Hello Guus

This Bugzilla entry documents the XFS bug from 2009 in detail including 
links:

http://oss.sgi.com/bugzilla/show_bug.cgi?id=805

The problem was finally solved by a patch proposed by Linus. This is the 
reason the original patch developed by Nick never made it into the kernel.

My tests back then showed that both patches fixed the problem.

It seems you have found a test case where just Nick's patch helps.

Regards, Peter.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: BUG: soft lockup - is this XFS problem?
  2011-07-14 18:03                 ` Peter Klotz
@ 2011-07-14 19:29                   ` Guus Sliepen
  -1 siblings, 0 replies; 74+ messages in thread
From: Guus Sliepen @ 2011-07-14 19:29 UTC (permalink / raw)
  To: Peter Klotz
  Cc: Nick Piggin, Christoph Hellwig, Roman Kononov, linux-kernel, xfs

[-- Attachment #1: Type: text/plain, Size: 1444 bytes --]

On Thu, Jul 14, 2011 at 08:03:09PM +0200, Peter Klotz wrote:

> On 07/14/2011 01:23 PM, Guus Sliepen wrote:
> 
> >I'm having a problem with a system having an XFS filesystem on RAID locking up
> >fairly consistently when writing large amounts of data to it, with several
> >kernels, including 2.6.38.2 and 2.6.39.3, on both AMD and Intel multi-core
> >processors. The kernel always logs this several times:
> >
> >BUG: soft lockup - CPU#2 stuck for 67s! [kswapd0:33]
[...]
> This Bugzilla entry documents the XFS bug from 2009 in detail
> including links:
> 
> http://oss.sgi.com/bugzilla/show_bug.cgi?id=805

Aha, I did not look at that before.

> The problem was finally solved by a patch proposed by Linus. This is
> the reason the original patch developed by Nick never made it into
> the kernel.
> 
> My tests back then showed that both patches fixed the problem.
> 
> It seems you have found a test case where just Nick's patch helps.

Yes. I agree with Linus that the root cause should be fixed, not the symptoms.
I don't have time to dive in the kernel code myself, but I do have several
nearly identical machines where I can test things on. I will be happy to test
out patches and/or different kernel versions or kernel configurations, and I
can provide dmesg output and perhaps other information if necessary.

-- 
Met vriendelijke groet / with kind regards,
Guus Sliepen <Guus.Sliepen@astro.su.se>

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: BUG: soft lockup - is this XFS problem?
@ 2011-07-14 19:29                   ` Guus Sliepen
  0 siblings, 0 replies; 74+ messages in thread
From: Guus Sliepen @ 2011-07-14 19:29 UTC (permalink / raw)
  To: Peter Klotz
  Cc: Christoph Hellwig, xfs, Nick Piggin, Roman Kononov, linux-kernel


[-- Attachment #1.1: Type: text/plain, Size: 1444 bytes --]

On Thu, Jul 14, 2011 at 08:03:09PM +0200, Peter Klotz wrote:

> On 07/14/2011 01:23 PM, Guus Sliepen wrote:
> 
> >I'm having a problem with a system having an XFS filesystem on RAID locking up
> >fairly consistently when writing large amounts of data to it, with several
> >kernels, including 2.6.38.2 and 2.6.39.3, on both AMD and Intel multi-core
> >processors. The kernel always logs this several times:
> >
> >BUG: soft lockup - CPU#2 stuck for 67s! [kswapd0:33]
[...]
> This Bugzilla entry documents the XFS bug from 2009 in detail
> including links:
> 
> http://oss.sgi.com/bugzilla/show_bug.cgi?id=805

Aha, I did not look at that before.

> The problem was finally solved by a patch proposed by Linus. This is
> the reason the original patch developed by Nick never made it into
> the kernel.
> 
> My tests back then showed that both patches fixed the problem.
> 
> It seems you have found a test case where just Nick's patch helps.

Yes. I agree with Linus that the root cause should be fixed, not the symptoms.
I don't have time to dive in the kernel code myself, but I do have several
nearly identical machines where I can test things on. I will be happy to test
out patches and/or different kernel versions or kernel configurations, and I
can provide dmesg output and perhaps other information if necessary.

-- 
Met vriendelijke groet / with kind regards,
Guus Sliepen <Guus.Sliepen@astro.su.se>

[-- Attachment #1.2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 74+ messages in thread

end of thread, other threads:[~2011-07-14 19:29 UTC | newest]

Thread overview: 74+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-12-19  6:59 BUG: soft lockup - is this XFS problem? Roman Kononov
2008-12-23 17:12 ` Christoph Hellwig
2008-12-23 17:12   ` Christoph Hellwig
2008-12-30  4:23   ` Nick Piggin
2008-12-30  4:23     ` Nick Piggin
2009-01-03 21:44     ` Christoph Hellwig
2009-01-03 21:44       ` Christoph Hellwig
2009-01-05  1:48       ` Nick Piggin
2009-01-05  1:48         ` Nick Piggin
2009-01-05  4:19         ` Nick Piggin
2009-01-05  4:19           ` Nick Piggin
2009-01-05  6:48           ` Nick Piggin
2009-01-05  6:48             ` Nick Piggin
2009-01-05 14:25             ` Roman Kononov
2009-01-05 14:25               ` Roman Kononov
2009-01-05 16:21             ` Peter Klotz
2009-01-05 16:21               ` Peter Klotz
2009-01-05 16:41               ` [patch] mm: fix lockless pagecache reordering bug (was Re: BUG: soft lockup - is this XFS problem?) Nick Piggin
2009-01-05 16:41                 ` Nick Piggin
2009-01-05 16:41                 ` Nick Piggin
2009-01-05 17:30                 ` Linus Torvalds
2009-01-05 17:30                   ` Linus Torvalds
2009-01-05 17:30                   ` Linus Torvalds
2009-01-05 18:00                   ` Nick Piggin
2009-01-05 18:00                     ` Nick Piggin
2009-01-05 18:00                     ` Nick Piggin
2009-01-05 18:44                     ` Linus Torvalds
2009-01-05 18:44                       ` Linus Torvalds
2009-01-05 18:44                       ` Linus Torvalds
2009-01-05 19:39                       ` Linus Torvalds
2009-01-05 19:39                         ` Linus Torvalds
2009-01-05 19:39                         ` Linus Torvalds
2009-01-06 17:17                         ` Paul E. McKenney
2009-01-06 17:17                           ` Paul E. McKenney
2009-01-06 17:17                           ` Paul E. McKenney
2009-01-05 20:12                       ` Paul E. McKenney
2009-01-05 20:12                         ` Paul E. McKenney
2009-01-05 20:12                         ` Paul E. McKenney
2009-01-05 20:39                         ` Linus Torvalds
2009-01-05 20:39                           ` Linus Torvalds
2009-01-05 20:39                           ` Linus Torvalds
2009-01-05 21:57                           ` Paul E. McKenney
2009-01-05 21:57                             ` Paul E. McKenney
2009-01-05 21:57                             ` Paul E. McKenney
2009-01-06  2:05                             ` Nick Piggin
2009-01-06  2:05                               ` Nick Piggin
2009-01-06  2:05                               ` Nick Piggin
2009-01-06  2:23                               ` Paul E. McKenney
2009-01-06  2:23                                 ` Paul E. McKenney
2009-01-06  2:23                                 ` Paul E. McKenney
2009-01-06  2:29                               ` Linus Torvalds
2009-01-06  2:29                                 ` Linus Torvalds
2009-01-06  2:29                                 ` Linus Torvalds
2009-01-06  8:38                               ` Peter Klotz
2009-01-06  8:38                                 ` Peter Klotz
2009-01-06  8:38                                 ` Peter Klotz
2009-01-06  8:43                                 ` Nick Piggin
2009-01-06  8:43                                   ` Nick Piggin
2009-01-06  8:43                                   ` Nick Piggin
2009-01-06 16:16                               ` Roman Kononov
2009-01-06 16:16                                 ` Roman Kononov
2009-01-06 16:16                                 ` Roman Kononov
2009-01-05 21:04                         ` [patch] mm: fix lockless pagecache reordering bug (was Peter Zijlstra
2009-01-05 21:04                           ` Peter Zijlstra
2009-01-05 21:04                           ` Peter Zijlstra
2009-01-05 21:58                           ` Paul E. McKenney
2009-01-05 21:58                             ` Paul E. McKenney
2009-01-05 21:58                             ` Paul E. McKenney
2011-07-14 11:23             ` BUG: soft lockup - is this XFS problem? Guus Sliepen
2011-07-14 11:23               ` Guus Sliepen
2011-07-14 18:03               ` Peter Klotz
2011-07-14 18:03                 ` Peter Klotz
2011-07-14 19:29                 ` Guus Sliepen
2011-07-14 19:29                   ` Guus Sliepen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.