All of lore.kernel.org
 help / color / mirror / Atom feed
From: Nick Piggin <npiggin@suse.de>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Klotz <peter.klotz@aon.at>,
	stable@kernel.org,
	Linux Memory Management List <linux-mm@kvack.org>,
	Christoph Hellwig <hch@infradead.org>,
	Roman Kononov <kernel@kononov.ftml.net>,
	linux-kernel@vger.kernel.org, xfs@oss.sgi.com,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [patch] mm: fix lockless pagecache reordering bug (was Re: BUG: soft lockup - is this XFS problem?)
Date: Mon, 5 Jan 2009 19:00:08 +0100	[thread overview]
Message-ID: <20090105180008.GE32675@wotan.suse.de> (raw)
In-Reply-To: <alpine.LFD.2.00.0901050859430.3057@localhost.localdomain>

On Mon, Jan 05, 2009 at 09:30:55AM -0800, Linus Torvalds wrote:
> 
> 
> On Mon, 5 Jan 2009, Nick Piggin wrote:
> > 
> > This patch should be applied to 2.6.29 and 27/28 stable kernels, please.
> 
> No. I think this patch is utter crap. But please feel free to educate me 
> on why that is not the case.
> 
> Here's my explanation:
> 
> Not only is it ugly (which is already sufficient ground to suspect it is 
> wrong or could at least be done better), but reading the comment, it makes 
> no sense at all. You only put the barrier in the "goto repeat" case, but 
> the thing is, if you worry about radix tree slot not being reloaded in the 
> repeat case, then you damn well should worry about it not being reloaded 
> in the non-repeat case too!

In which case atomic_inc_unless is defined to provide a barrier.

 
> The code is immediately followed by a test to see that the page is still 
> the same in the slot, ie this:
> 
>                 /*
>                  * Has the page moved?
>                  * This is part of the lockless pagecache protocol. See
>                  * include/linux/pagemap.h for details.
>                  */
>                 if (unlikely(page != *pagep)) {
> 
> and if you need a barrier for the repeat case, you need one for this case 
> too.
> 
> In other words, it looks like you fixed the symptom, but not the real 
> cause! That's now how we work in the kernel.
> 
> The real cause, btw, appears to be that radix_tree_deref_slot() is a piece 
> of slimy sh*t, and has not been correctly updated to RCU. The proper fix 
> doesn't require any barriers that I can see - I think the proper fix is 
> this simple one-liner.
> 
> If you use RCU to protect a data structure, then any data loaded from that 
> data structure that can change due to RCU should be loaded with 
> "rcu_dereference()". 

It doesn't need that because the last level pointers in the radix
tree are not necessarily under RCU, but whatever synchronisation
the caller uses (in this case, speculative page references, which
should not require smp_read_barrier_depends, AFAIKS). Putting an
rcu_dereference there might work, but I think it misses a subtlety
of this code.


> Now, I can't test this, because it makes absolutely no difference for me 
> (the diff isn't empty, but the asm changes seem to be all due to just gcc 
> variable numbering changing). I can't seem to see the buggy code. Maybe it 
> needs a specific compiler version, or some specific config option to 
> trigger?
> 
> So because I can't see the issue, I also obviously can't verify that it's 
> the only possible case. Maybe there is some other memory access that 
> should also be done with the proper rcu accessors?
> 
> Of course, it's also possible that we should just put a barrier in 
> page_cache_get_speculative(). That doesn't seem to make a whole lot of 
> conceptual sense, though (the same way that your barrier() didn't make any 
> sense - I don't see that the barrier has absolutely _anything_ to do with 
> whether the speculative getting of the page fails or not!)

When that fails, the caller can (almost) assume the pointer has changed.
So it has to load the new pointer to continue. The object pointed to is
not protected with RCU, nor is there a requirement to see a specific
load execution ordering. 

>
> In general, I'd like fewer "band-aid" patches, and more "deep thinking" 
> patches. I'm not saying mine is very deep either, but I think it's at 
> least scrathing the surface of the real problem rather than just trying to 
> cover it up.

WARNING: multiple messages have this Message-ID (diff)
From: Nick Piggin <npiggin@suse.de>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org,
	Roman Kononov <kernel@kononov.ftml.net>,
	xfs@oss.sgi.com, Christoph Hellwig <hch@infradead.org>,
	Linux Memory Management List <linux-mm@kvack.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Peter Klotz <peter.klotz@aon.at>,
	stable@kernel.org
Subject: Re: [patch] mm: fix lockless pagecache reordering bug (was Re: BUG: soft lockup - is this XFS problem?)
Date: Mon, 5 Jan 2009 19:00:08 +0100	[thread overview]
Message-ID: <20090105180008.GE32675@wotan.suse.de> (raw)
In-Reply-To: <alpine.LFD.2.00.0901050859430.3057@localhost.localdomain>

On Mon, Jan 05, 2009 at 09:30:55AM -0800, Linus Torvalds wrote:
> 
> 
> On Mon, 5 Jan 2009, Nick Piggin wrote:
> > 
> > This patch should be applied to 2.6.29 and 27/28 stable kernels, please.
> 
> No. I think this patch is utter crap. But please feel free to educate me 
> on why that is not the case.
> 
> Here's my explanation:
> 
> Not only is it ugly (which is already sufficient ground to suspect it is 
> wrong or could at least be done better), but reading the comment, it makes 
> no sense at all. You only put the barrier in the "goto repeat" case, but 
> the thing is, if you worry about radix tree slot not being reloaded in the 
> repeat case, then you damn well should worry about it not being reloaded 
> in the non-repeat case too!

In which case atomic_inc_unless is defined to provide a barrier.

 
> The code is immediately followed by a test to see that the page is still 
> the same in the slot, ie this:
> 
>                 /*
>                  * Has the page moved?
>                  * This is part of the lockless pagecache protocol. See
>                  * include/linux/pagemap.h for details.
>                  */
>                 if (unlikely(page != *pagep)) {
> 
> and if you need a barrier for the repeat case, you need one for this case 
> too.
> 
> In other words, it looks like you fixed the symptom, but not the real 
> cause! That's now how we work in the kernel.
> 
> The real cause, btw, appears to be that radix_tree_deref_slot() is a piece 
> of slimy sh*t, and has not been correctly updated to RCU. The proper fix 
> doesn't require any barriers that I can see - I think the proper fix is 
> this simple one-liner.
> 
> If you use RCU to protect a data structure, then any data loaded from that 
> data structure that can change due to RCU should be loaded with 
> "rcu_dereference()". 

It doesn't need that because the last level pointers in the radix
tree are not necessarily under RCU, but whatever synchronisation
the caller uses (in this case, speculative page references, which
should not require smp_read_barrier_depends, AFAIKS). Putting an
rcu_dereference there might work, but I think it misses a subtlety
of this code.


> Now, I can't test this, because it makes absolutely no difference for me 
> (the diff isn't empty, but the asm changes seem to be all due to just gcc 
> variable numbering changing). I can't seem to see the buggy code. Maybe it 
> needs a specific compiler version, or some specific config option to 
> trigger?
> 
> So because I can't see the issue, I also obviously can't verify that it's 
> the only possible case. Maybe there is some other memory access that 
> should also be done with the proper rcu accessors?
> 
> Of course, it's also possible that we should just put a barrier in 
> page_cache_get_speculative(). That doesn't seem to make a whole lot of 
> conceptual sense, though (the same way that your barrier() didn't make any 
> sense - I don't see that the barrier has absolutely _anything_ to do with 
> whether the speculative getting of the page fails or not!)

When that fails, the caller can (almost) assume the pointer has changed.
So it has to load the new pointer to continue. The object pointed to is
not protected with RCU, nor is there a requirement to see a specific
load execution ordering. 

>
> In general, I'd like fewer "band-aid" patches, and more "deep thinking" 
> patches. I'm not saying mine is very deep either, but I think it's at 
> least scrathing the surface of the real problem rather than just trying to 
> cover it up.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

WARNING: multiple messages have this Message-ID (diff)
From: Nick Piggin <npiggin@suse.de>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Klotz <peter.klotz@aon.at>,
	stable@kernel.org,
	Linux Memory Management List <linux-mm@kvack.org>,
	Christoph Hellwig <hch@infradead.org>,
	Roman Kononov <kernel@kononov.ftml.net>,
	linux-kernel@vger.kernel.org, xfs@oss.sgi.com,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [patch] mm: fix lockless pagecache reordering bug (was Re: BUG: soft lockup - is this XFS problem?)
Date: Mon, 5 Jan 2009 19:00:08 +0100	[thread overview]
Message-ID: <20090105180008.GE32675@wotan.suse.de> (raw)
In-Reply-To: <alpine.LFD.2.00.0901050859430.3057@localhost.localdomain>

On Mon, Jan 05, 2009 at 09:30:55AM -0800, Linus Torvalds wrote:
> 
> 
> On Mon, 5 Jan 2009, Nick Piggin wrote:
> > 
> > This patch should be applied to 2.6.29 and 27/28 stable kernels, please.
> 
> No. I think this patch is utter crap. But please feel free to educate me 
> on why that is not the case.
> 
> Here's my explanation:
> 
> Not only is it ugly (which is already sufficient ground to suspect it is 
> wrong or could at least be done better), but reading the comment, it makes 
> no sense at all. You only put the barrier in the "goto repeat" case, but 
> the thing is, if you worry about radix tree slot not being reloaded in the 
> repeat case, then you damn well should worry about it not being reloaded 
> in the non-repeat case too!

In which case atomic_inc_unless is defined to provide a barrier.

 
> The code is immediately followed by a test to see that the page is still 
> the same in the slot, ie this:
> 
>                 /*
>                  * Has the page moved?
>                  * This is part of the lockless pagecache protocol. See
>                  * include/linux/pagemap.h for details.
>                  */
>                 if (unlikely(page != *pagep)) {
> 
> and if you need a barrier for the repeat case, you need one for this case 
> too.
> 
> In other words, it looks like you fixed the symptom, but not the real 
> cause! That's now how we work in the kernel.
> 
> The real cause, btw, appears to be that radix_tree_deref_slot() is a piece 
> of slimy sh*t, and has not been correctly updated to RCU. The proper fix 
> doesn't require any barriers that I can see - I think the proper fix is 
> this simple one-liner.
> 
> If you use RCU to protect a data structure, then any data loaded from that 
> data structure that can change due to RCU should be loaded with 
> "rcu_dereference()". 

It doesn't need that because the last level pointers in the radix
tree are not necessarily under RCU, but whatever synchronisation
the caller uses (in this case, speculative page references, which
should not require smp_read_barrier_depends, AFAIKS). Putting an
rcu_dereference there might work, but I think it misses a subtlety
of this code.


> Now, I can't test this, because it makes absolutely no difference for me 
> (the diff isn't empty, but the asm changes seem to be all due to just gcc 
> variable numbering changing). I can't seem to see the buggy code. Maybe it 
> needs a specific compiler version, or some specific config option to 
> trigger?
> 
> So because I can't see the issue, I also obviously can't verify that it's 
> the only possible case. Maybe there is some other memory access that 
> should also be done with the proper rcu accessors?
> 
> Of course, it's also possible that we should just put a barrier in 
> page_cache_get_speculative(). That doesn't seem to make a whole lot of 
> conceptual sense, though (the same way that your barrier() didn't make any 
> sense - I don't see that the barrier has absolutely _anything_ to do with 
> whether the speculative getting of the page fails or not!)

When that fails, the caller can (almost) assume the pointer has changed.
So it has to load the new pointer to continue. The object pointed to is
not protected with RCU, nor is there a requirement to see a specific
load execution ordering. 

>
> In general, I'd like fewer "band-aid" patches, and more "deep thinking" 
> patches. I'm not saying mine is very deep either, but I think it's at 
> least scrathing the surface of the real problem rather than just trying to 
> cover it up.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2009-01-05 18:00 UTC|newest]

Thread overview: 74+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-12-19  6:59 BUG: soft lockup - is this XFS problem? Roman Kononov
2008-12-23 17:12 ` Christoph Hellwig
2008-12-23 17:12   ` Christoph Hellwig
2008-12-30  4:23   ` Nick Piggin
2008-12-30  4:23     ` Nick Piggin
2009-01-03 21:44     ` Christoph Hellwig
2009-01-03 21:44       ` Christoph Hellwig
2009-01-05  1:48       ` Nick Piggin
2009-01-05  1:48         ` Nick Piggin
2009-01-05  4:19         ` Nick Piggin
2009-01-05  4:19           ` Nick Piggin
2009-01-05  6:48           ` Nick Piggin
2009-01-05  6:48             ` Nick Piggin
2009-01-05 14:25             ` Roman Kononov
2009-01-05 14:25               ` Roman Kononov
2009-01-05 16:21             ` Peter Klotz
2009-01-05 16:21               ` Peter Klotz
2009-01-05 16:41               ` [patch] mm: fix lockless pagecache reordering bug (was Re: BUG: soft lockup - is this XFS problem?) Nick Piggin
2009-01-05 16:41                 ` Nick Piggin
2009-01-05 16:41                 ` Nick Piggin
2009-01-05 17:30                 ` Linus Torvalds
2009-01-05 17:30                   ` Linus Torvalds
2009-01-05 17:30                   ` Linus Torvalds
2009-01-05 18:00                   ` Nick Piggin [this message]
2009-01-05 18:00                     ` Nick Piggin
2009-01-05 18:00                     ` Nick Piggin
2009-01-05 18:44                     ` Linus Torvalds
2009-01-05 18:44                       ` Linus Torvalds
2009-01-05 18:44                       ` Linus Torvalds
2009-01-05 19:39                       ` Linus Torvalds
2009-01-05 19:39                         ` Linus Torvalds
2009-01-05 19:39                         ` Linus Torvalds
2009-01-06 17:17                         ` Paul E. McKenney
2009-01-06 17:17                           ` Paul E. McKenney
2009-01-06 17:17                           ` Paul E. McKenney
2009-01-05 20:12                       ` Paul E. McKenney
2009-01-05 20:12                         ` Paul E. McKenney
2009-01-05 20:12                         ` Paul E. McKenney
2009-01-05 20:39                         ` Linus Torvalds
2009-01-05 20:39                           ` Linus Torvalds
2009-01-05 20:39                           ` Linus Torvalds
2009-01-05 21:57                           ` Paul E. McKenney
2009-01-05 21:57                             ` Paul E. McKenney
2009-01-05 21:57                             ` Paul E. McKenney
2009-01-06  2:05                             ` Nick Piggin
2009-01-06  2:05                               ` Nick Piggin
2009-01-06  2:05                               ` Nick Piggin
2009-01-06  2:23                               ` Paul E. McKenney
2009-01-06  2:23                                 ` Paul E. McKenney
2009-01-06  2:23                                 ` Paul E. McKenney
2009-01-06  2:29                               ` Linus Torvalds
2009-01-06  2:29                                 ` Linus Torvalds
2009-01-06  2:29                                 ` Linus Torvalds
2009-01-06  8:38                               ` Peter Klotz
2009-01-06  8:38                                 ` Peter Klotz
2009-01-06  8:38                                 ` Peter Klotz
2009-01-06  8:43                                 ` Nick Piggin
2009-01-06  8:43                                   ` Nick Piggin
2009-01-06  8:43                                   ` Nick Piggin
2009-01-06 16:16                               ` Roman Kononov
2009-01-06 16:16                                 ` Roman Kononov
2009-01-06 16:16                                 ` Roman Kononov
2009-01-05 21:04                         ` [patch] mm: fix lockless pagecache reordering bug (was Peter Zijlstra
2009-01-05 21:04                           ` Peter Zijlstra
2009-01-05 21:04                           ` Peter Zijlstra
2009-01-05 21:58                           ` Paul E. McKenney
2009-01-05 21:58                             ` Paul E. McKenney
2009-01-05 21:58                             ` Paul E. McKenney
2011-07-14 11:23             ` BUG: soft lockup - is this XFS problem? Guus Sliepen
2011-07-14 11:23               ` Guus Sliepen
2011-07-14 18:03               ` Peter Klotz
2011-07-14 18:03                 ` Peter Klotz
2011-07-14 19:29                 ` Guus Sliepen
2011-07-14 19:29                   ` Guus Sliepen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090105180008.GE32675@wotan.suse.de \
    --to=npiggin@suse.de \
    --cc=akpm@linux-foundation.org \
    --cc=hch@infradead.org \
    --cc=kernel@kononov.ftml.net \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=peter.klotz@aon.at \
    --cc=stable@kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.