: Fix SMP-reordering race in mark_buffer_dirty
diff mbox series

Message ID Pine.LNX.4.64.0804022058530.13737@artax.karlin.mff.cuni.cz
State New, archived
Headers show
Series
  • : Fix SMP-reordering race in mark_buffer_dirty
Related show

Commit Message

Mikulas Patocka April 2, 2008, 7:20 p.m. UTC
Hi

It looks like someone overoptimized mark_buffer_dirty(). 

mark_buffer_dirty() is
void mark_buffer_dirty(struct buffer_head *bh)
{
        WARN_ON_ONCE(!buffer_uptodate(bh));
        if (!buffer_dirty(bh) && !test_set_buffer_dirty(bh))
                __set_page_dirty(bh->b_page, page_mapping(bh->b_page), 0);
}

That buffer_dirty() test is not atomic, it may be reordered with whatever 
else.

So suppose this race

CPU1:

write to buffer data
call mark_buffer_dirty()
test for !buffer_dirty(bh)

--- there is no synchronizing operation, so inside CPU it may get 
reordered to:

test for !buffer_dirty(bh)
write to buffer data

CPU2:
clear_buffer_dirty(bh);
submit_bh(WRITE, bh);

The resulting operations may end up in this order:
CPU1: test for !buffer_dirty(bh) --- sees that the bit is set
CPU2: clear_buffer_dirty(bh);
CPU2: submit_bh(WRITE, bh);
CPU1: write to buffer data

So we have a clean buffer with modified data and this modification is 
going to be lost.

Mikulas


Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Comments

Linus Torvalds April 2, 2008, 7:44 p.m. UTC | #1
On Wed, 2 Apr 2008, Mikulas Patocka wrote:
> +	/*
> +	 * Make sure that the test for buffer_dirty(bh) is not reordered with
> +	 * previous modifications to the buffer data.
> +	 * -- mikulas
> +	 */
> +	smp_mb();
>  	WARN_ON_ONCE(!buffer_uptodate(bh));
>  	if (!buffer_dirty(bh) && !test_set_buffer_dirty(bh))

At that point, the better patch is to just *remove* the buffer_dirty() 
test, and rely on the stronger ordering requirements of 
test_set_buffer_dirty().

The whole - and only - point of the buffer_dirty() check was to avoid the 
more expensive test_set_buffer_dirty() call, but it's only more expensive 
because of the barrier semantics. So if you add a barrier, the point goes 
away and you should instead remove the optimization.

(I also seriously doubt you can actually trigger this in real life, but 
simplifying the code is probably fine regardless).

		Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Mikulas Patocka April 2, 2008, 9:03 p.m. UTC | #2
On Wed, 2 Apr 2008, Linus Torvalds wrote:

> On Wed, 2 Apr 2008, Mikulas Patocka wrote:
> > +	/*
> > +	 * Make sure that the test for buffer_dirty(bh) is not reordered with
> > +	 * previous modifications to the buffer data.
> > +	 * -- mikulas
> > +	 */
> > +	smp_mb();
> >  	WARN_ON_ONCE(!buffer_uptodate(bh));
> >  	if (!buffer_dirty(bh) && !test_set_buffer_dirty(bh))
> 
> At that point, the better patch is to just *remove* the buffer_dirty() 
> test, and rely on the stronger ordering requirements of 
> test_set_buffer_dirty().
> 
> The whole - and only - point of the buffer_dirty() check was to avoid the 
> more expensive test_set_buffer_dirty() call, but it's only more expensive 
> because of the barrier semantics. So if you add a barrier, the point goes 
> away and you should instead remove the optimization.
> 
> (I also seriously doubt you can actually trigger this in real life, but 
> simplifying the code is probably fine regardless).
> 
> 		Linus

I measured it:
On Core2, mfence is faster (8 ticks) than lock btsl (27 ticks)
On Pentium-4-prescott, mfence is 124 ticks and lock btsl is 86 ticks.
On Pentium-4-pre-prescott, mfence wins again (100) and lock btsl is 120.
On Athlon, mfence is 16 ticks and lock btsl is 19 ticks.

So you're right, the gain of mfence is so little that you can remove it 
and use only test_set_buffer_dirty.

I don't know if there are other architectures where smb_mb() would be 
significantly faster than test_and_set_bit.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Linus Torvalds April 2, 2008, 9:31 p.m. UTC | #3
On Wed, 2 Apr 2008, Mikulas Patocka wrote:
> 
> So you're right, the gain of mfence is so little that you can remove it 
> and use only test_set_buffer_dirty.

Well, I suspect that part of the issue is that quite often you end up 
with *both* because the buffer wasn't already dirty from before.

Re-dirtying a dirty buffer is pretty common for things like bitmap blocks 
etc, so it's probably a worthy optimization if it has no cost, and on 
Core2 I suspect your version is worth it, but it's not like it's going to 
be necessarily a 99% kind of case. I suspect quite a lot of the 
mark_buffer_dirty() calls are actually on clean buffers.

(Of course, a valid argument is that if it was already dirty, we'll skip 
the other expensive parts, so only the "already dirty" case is worth 
optimizing for. Maybe true. There might also be cases where it means one 
less dirty cacheline in memory.)

> I don't know if there are other architectures where smb_mb() would be 
> significantly faster than test_and_set_bit.

Probably none, since it test_and_set_bit() implies a smp_mb(), and 
generally the bigger cost is in the barrier than in the bit setting 
itself.

Core 2 is the outlier in having a noticeably faster "mfence" than atomic 
instructions (and judging by noises Intel makes, Nehalem will undo that 
outlier).

				Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Linus Torvalds April 2, 2008, 9:35 p.m. UTC | #4
On Wed, 2 Apr 2008, Linus Torvalds wrote:
> 
> Core 2 is the outlier in having a noticeably faster "mfence" than atomic 
> instructions

Side note: mfence is probably faster than 6 cycles on Core 2. I've seen it 
be basically zero-cost. Try adding a few memory ops around it - the timing 
will probably change. Core 2 has this odd behaviour that it sometimes does 
worse on the *really* trivial things that don't happen in real life.

			Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Andrew Morton April 2, 2008, 10:01 p.m. UTC | #5
On Wed, 2 Apr 2008 12:44:05 -0700 (PDT)
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> 
> 
> On Wed, 2 Apr 2008, Mikulas Patocka wrote:
> > +	/*
> > +	 * Make sure that the test for buffer_dirty(bh) is not reordered with
> > +	 * previous modifications to the buffer data.
> > +	 * -- mikulas
> > +	 */
> > +	smp_mb();
> >  	WARN_ON_ONCE(!buffer_uptodate(bh));
> >  	if (!buffer_dirty(bh) && !test_set_buffer_dirty(bh))
> 
> At that point, the better patch is to just *remove* the buffer_dirty() 
> test, and rely on the stronger ordering requirements of 
> test_set_buffer_dirty().
> 
> The whole - and only - point of the buffer_dirty() check was to avoid the 
> more expensive test_set_buffer_dirty() call, but it's only more expensive 
> because of the barrier semantics. So if you add a barrier, the point goes 
> away and you should instead remove the optimization.

But then the test-and-set of an already-set flag would newly cause the
cacheline to be dirtied, requiring additional bus usage to write it back?

The CPU's test-and-set-bit operation could of course optimise that away in
this case.  But does it?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Linus Torvalds April 2, 2008, 10:07 p.m. UTC | #6
On Wed, 2 Apr 2008, Andrew Morton wrote:
> 
> But then the test-and-set of an already-set flag would newly cause the
> cacheline to be dirtied, requiring additional bus usage to write it back?
> 
> The CPU's test-and-set-bit operation could of course optimise that away in
> this case.  But does it?

No, afaik no current x86 uarch will optimize away the write on a locked 
instuction if it turns out to be unnecessary. 

Can somebody find a timing reason to have the ugly code?

		Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Mikulas Patocka April 2, 2008, 10:35 p.m. UTC | #7
On Wed, 2 Apr 2008, Linus Torvalds wrote:

> On Wed, 2 Apr 2008, Mikulas Patocka wrote:
> > 
> > So you're right, the gain of mfence is so little that you can remove it 
> > and use only test_set_buffer_dirty.
> 
> Well, I suspect that part of the issue is that quite often you end up 
> with *both* because the buffer wasn't already dirty from before.
> 
> Re-dirtying a dirty buffer is pretty common for things like bitmap blocks 
> etc, so it's probably a worthy optimization if it has no cost, and on 
> Core2 I suspect your version is worth it, but it's not like it's going to 
> be necessarily a 99% kind of case. I suspect quite a lot of the 
> mark_buffer_dirty() calls are actually on clean buffers.

The cost of doing the buffer lookup and update is much more than the 20 
clocks, so the 20 clocks mfence/lock difference can be forgotten.

> (Of course, a valid argument is that if it was already dirty, we'll skip 
> the other expensive parts, so only the "already dirty" case is worth 
> optimizing for. Maybe true. There might also be cases where it means one 
> less dirty cacheline in memory.)

That is another good example: when two CPUs are simultaneously updating 
the same buffer (i.e. two inodes in one block or simultaneous writes to a 
bitmap), then test_set_buffer_dirty will do cache ping-pong (that is much 
more expensive than that 20-100 cycles for an interlocked instruction), 
and smp_mb()+test_buffer_dirty will keep cache intact.

So maybe there's still a reason for smb_mb().

Mikulas

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Mikulas Patocka April 2, 2008, 10:39 p.m. UTC | #8
On Wed, 2 Apr 2008, Linus Torvalds wrote:

> On Wed, 2 Apr 2008, Linus Torvalds wrote:
> > 
> > Core 2 is the outlier in having a noticeably faster "mfence" than atomic 
> > instructions
> 
> Side note: mfence is probably faster than 6 cycles on Core 2. I've seen it 
> be basically zero-cost. Try adding a few memory ops around it - the timing 
> will probably change. Core 2 has this odd behaviour that it sometimes does 
> worse on the *really* trivial things that don't happen in real life.
> 
> 			Linus

When I added memory reads around mfence, the combined result was worse 
than the sum of reads and mfence --- mfence alone 8 ticks, mfence+4reads 
18 ticks.

mfence+writes scales linearly, i.e. mfence + 4 writes does 12 ticks.

mfence can fully overlap with register operations.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Linus Torvalds April 2, 2008, 10:51 p.m. UTC | #9
On Thu, 3 Apr 2008, Mikulas Patocka wrote:
> 
> When I added memory reads around mfence, the combined result was worse 
> than the sum of reads and mfence --- mfence alone 8 ticks, mfence+4reads 
> 18 ticks.

Sorry, my test-program was buggy.

> mfence+writes scales linearly, i.e. mfence + 4 writes does 12 ticks.

I get different results now that I fixed my memop test. I get a loop with 
reads and writes taking 3 cycles without mfence, and 21 cycles with (read 
after, write before). 

			Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Mikulas Patocka April 2, 2008, 10:53 p.m. UTC | #10
On Wed, 2 Apr 2008, Linus Torvalds wrote:

> 
> 
> On Wed, 2 Apr 2008, Andrew Morton wrote:
> > 
> > But then the test-and-set of an already-set flag would newly cause the
> > cacheline to be dirtied, requiring additional bus usage to write it back?
> > 
> > The CPU's test-and-set-bit operation could of course optimise that away in
> > this case.  But does it?
> 
> No, afaik no current x86 uarch will optimize away the write on a locked 
> instuction if it turns out to be unnecessary. 

No, it doesn't. Try this:

#include <string.h>
#include <pthread.h>
void *pth(void *p)
{
        int i;
        for (i = 0; i < 100000000; i++)
                __asm__ volatile ("lock;btsl $0, %0"::"m"(*(int 
*)p):"cc");
        return NULL;
}
int args[2000];
int main(void)
{
        pthread_t t1, t2, t3, t4;
        memset(args, -1, sizeof args);
        pthread_create(&t1, NULL, pth, &args[0]);
        pthread_create(&t2, NULL, pth, &args[16]);
        pthread_create(&t3, NULL, pth, &args[32]);
        pthread_create(&t4, NULL, pth, &args[48]);
        pthread_join(t1, NULL);
        pthread_join(t2, NULL);
        pthread_join(t3, NULL);
        pthread_join(t4, NULL);
        return 0;
}

--- when the &args[] indices are in a conflicting cacheline, I get 9 times 
slower execution. I tried it on 2 double-core Core 2 Xeons.

Mikulas

> Can somebody find a timing reason to have the ugly code?
> 
> 		Linus
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Linus Torvalds April 2, 2008, 11:52 p.m. UTC | #11
On Wed, 2 Apr 2008, Andrew Morton wrote:
> 
> But then the test-and-set of an already-set flag would newly cause the
> cacheline to be dirtied, requiring additional bus usage to write it back?

Looking around a bit, I don't see any realistic case where this could 
possibly be the case and that is performance-sensitive.

The VFS-level uses of mark_buffer_dirty() seem to all be coupled with 
other uses that clear or set other bits in the buffer status word, so the 
cacheline will always apparently be dirty. 

The low-level filesystems sometimes do it for things like block bitmap 
changes, and I could imagine that there it actually (a) is no longer in 
the cache and (b) the buffer really was dirty to start with, but in ext3 
for example, you'd end up in journal_dirty_metadata which spinlocks on the 
BH_State bit first etc etc.

So the cacheline *will* be dirty, and this function doesn't seem like it 
could possibly ever show up in a real profile for any real load anyway, so 
it seems odd to try to optimize it this way.

			Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Andrew Morton April 3, 2008, 2:12 a.m. UTC | #12
On Wed, 2 Apr 2008 16:52:14 -0700 (PDT) Linus Torvalds <torvalds@linux-foundation.org> wrote:

> 
> 
> On Wed, 2 Apr 2008, Andrew Morton wrote:
> > 
> > But then the test-and-set of an already-set flag would newly cause the
> > cacheline to be dirtied, requiring additional bus usage to write it back?
> 
> Looking around a bit, I don't see any realistic case where this could 
> possibly be the case and that is performance-sensitive.

You sure?  A pretty common case would be overwrite of an already-dirty page
and from a quick read the only place where we modify bh.b_state is the
set_buffer_uptodate() and clear_buffer_new() in __block_commit_write(),
both of which could/should be converted to use the same trick.  Like
__block_prepare_write(), which already does

	if (!buffer_uptodate(bh))
		set_buffer_uptodate(bh);


What happened here was back in about, umm, 2001 we discovered one or two
code paths which when optimised in this way led to overall-measurably (not
just oprofile-measurably) improvements.  I don't recall which ones they
were.

So we then said oh-goody and sprinkled the same pattern all over the place
on the off-chance.  But I'm sure that over the ages we've let that
optimisation rot (witness __block_commit_write() above).

These were simpler times, and we didn't worry our little heads overly much
about this ordering stuff.

> The VFS-level uses of mark_buffer_dirty() seem to all be coupled with 
> other uses that clear or set other bits in the buffer status word, so the 
> cacheline will always apparently be dirty. 

As I say, I expect we could fix this if we want to.  The key point here is
that a page overwrite does not do lock_buffer(), so it should be possible
to do the whole operation without modifying bh.b_state.  If we wish to do
that.

> The low-level filesystems sometimes do it for things like block bitmap 
> changes, and I could imagine that there it actually (a) is no longer in 
> the cache and (b) the buffer really was dirty to start with, but in ext3 
> for example, you'd end up in journal_dirty_metadata which spinlocks on the 
> BH_State bit first etc etc.

Yeah, ext3 is probably a lost cause.  Anyone who cares about performance is
using ext2 and a UPS ;)

> So the cacheline *will* be dirty, and this function doesn't seem like it 
> could possibly ever show up in a real profile for any real load anyway, so 
> it seems odd to try to optimize it this way.
> 

I don't think the world would end if we took it out.  Particularly as not
many people use ext2 and we already broke it.

The trick will be to hunt down all the other places where we did a similar
thing and check them.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Linus Torvalds April 3, 2008, 2:34 p.m. UTC | #13
On Wed, 2 Apr 2008, Andrew Morton wrote:
> 
> You sure?  A pretty common case would be overwrite of an already-dirty page
> and from a quick read the only place where we modify bh.b_state is the
> set_buffer_uptodate() and clear_buffer_new() in __block_commit_write(),
> both of which could/should be converted to use the same trick.  Like
> __block_prepare_write(), which already does
> 
> 	if (!buffer_uptodate(bh))
> 		set_buffer_uptodate(bh);

Well, that particular optimization is safe, but it's safe because 
"uptodate" is sticky. Once it gets set, it is never reset.

But in general, it's simply a bad idea to do

	if (read)
		atomic-read-modify-write;

because it so often has races. This is pretty much exactly the same bug as 
we had not long ago with

	if (!waitqueue_empty(..))
		wake_up(..);

and for very similar reasons - the "read" part is very fast, yes, but it's 
also by definition not actually doing all the careful things that the 
atomic operation (whether a CPU-atomic one, or a software-written atomic 
with a spinlock one) does.

> What happened here was back in about, umm, 2001 we discovered one or two
> code paths which when optimised in this way led to overall-measurably (not
> just oprofile-measurably) improvements.  I don't recall which ones they
> were.
> 
> So we then said oh-goody and sprinkled the same pattern all over the place
> on the off-chance.  But I'm sure that over the ages we've let that
> optimisation rot (witness __block_commit_write() above).

And the problem is that 

	if (!buffer_uptodate(bh))
		set_buffer_uptodate(bh);

really isn't "the same" optimization at all as

	if (!buffer_dirty(bh) && test_and_set_buffer_dirty(bh)) {
			..

and the latter is simply fundamentally different.

> As I say, I expect we could fix this if we want to.  The key point here is
> that a page overwrite does not do lock_buffer(), so it should be possible
> to do the whole operation without modifying bh.b_state.  If we wish to do
> that.

Well, if we really want to do this op, then I'd rather make the code be 
really obvious what the smp_mb is about, but also make sure that we don't 
unnecessarily do *both* the smp_mb and the actual already-serialized bit 
operation.

But I'd be even happier if we only did these kinds of things when we have 
real performance-data that they help.

		Linus
---
 fs/buffer.c |   15 ++++++++++++++-
 1 files changed, 14 insertions(+), 1 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 9819632..39ff144 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1181,7 +1181,20 @@ __getblk_slow(struct block_device *bdev, sector_t block, int size)
 void mark_buffer_dirty(struct buffer_head *bh)
 {
 	WARN_ON_ONCE(!buffer_uptodate(bh));
-	if (!buffer_dirty(bh) && !test_set_buffer_dirty(bh))
+
+	/*
+	 * Very *carefully* optimize the it-is-already-dirty case.
+	 *
+	 * Don't let the final "is it dirty" escape to before we
+	 * perhaps modified the buffer.
+	 */
+	if (buffer_dirty(bh)) {
+		smp_mb();
+		if (buffer_dirty(bh))
+			return;
+	}
+
+	if (!test_set_buffer_dirty(bh))
 		__set_page_dirty(bh->b_page, page_mapping(bh->b_page), 0);
 }
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Patch
diff mbox series

--- linux-2.6.25-rc8/fs/buffer.c_	2008-04-02 21:08:36.000000000 +0200
+++ linux-2.6.25-rc8/fs/buffer.c	2008-04-02 21:10:25.000000000 +0200
@@ -1180,6 +1180,12 @@ 
  */
 void mark_buffer_dirty(struct buffer_head *bh)
 {
+	/*
+	 * Make sure that the test for buffer_dirty(bh) is not reordered with
+	 * previous modifications to the buffer data.
+	 * -- mikulas
+	 */
+	smp_mb();
 	WARN_ON_ONCE(!buffer_uptodate(bh));
 	if (!buffer_dirty(bh) && !test_set_buffer_dirty(bh))
 		__set_page_dirty(bh->b_page, page_mapping(bh->b_page), 0);