linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RESEND] Minimal fix for private_list handling races
@ 2008-01-22 17:10 Jan Kara
  2008-01-23  1:00 ` Nick Piggin
  0 siblings, 1 reply; 8+ messages in thread
From: Jan Kara @ 2008-01-22 17:10 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton

  Hi,

  as I got no answer for a week, I'm resending this fix for races in
private_list handling. Andrew, do you like them more than the previous
version?

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
---

There are two possible races in handling of private_list in buffer cache.
1) When fsync_buffers_list() processes a private_list, it clears
b_assoc_mapping and moves buffer to its private list. Now drop_buffers() comes,
sees a buffer is on list so it calls __remove_assoc_queue() which complains
about b_assoc_mapping being cleared (as it cannot propagate possible IO error).
This race has been actually observed in the wild.
2) When fsync_buffers_list() processes a private_list,
mark_buffer_dirty_inode() can be called on bh which is already on the private
list of fsync_buffers_list(). As buffer is on some list (note that the check is
performed without private_lock), it is not readded to the mapping's
private_list and after fsync_buffers_list() finishes, we have a dirty buffer
which should be on private_list but it isn't. This race has not been reported,
probably because most (but not all) callers of mark_buffer_dirty_inode() hold
i_mutex and thus are serialized with fsync().

Fix these issues by not clearing b_assoc_map when fsync_buffers_list() moves
buffer to a dedicated list and by reinserting buffer in private_list when
it is found dirty after the IO has completed.

Signed-off-by: Jan Kara <jack@suse.cz>

diff --git a/fs/buffer.c b/fs/buffer.c
index 7249e01..3ffb2b6 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -794,6 +794,7 @@ static int fsync_buffers_list(spinlock_t *lock, struct list_head *list)
 {
 	struct buffer_head *bh;
 	struct list_head tmp;
+	struct address_space *mapping;
 	int err = 0, err2;
 
 	INIT_LIST_HEAD(&tmp);
@@ -801,9 +802,11 @@ static int fsync_buffers_list(spinlock_t *lock, struct list_head *list)
 	spin_lock(lock);
 	while (!list_empty(list)) {
 		bh = BH_ENTRY(list->next);
+		mapping = bh->b_assoc_map;
 		__remove_assoc_queue(bh);
 		if (buffer_dirty(bh) || buffer_locked(bh)) {
 			list_add(&bh->b_assoc_buffers, &tmp);
+			bh->b_assoc_map = mapping;
 			if (buffer_dirty(bh)) {
 				get_bh(bh);
 				spin_unlock(lock);
@@ -828,8 +831,13 @@ static int fsync_buffers_list(spinlock_t *lock, struct list_head *list)
 		wait_on_buffer(bh);
 		if (!buffer_uptodate(bh))
 			err = -EIO;
-		brelse(bh);
 		spin_lock(lock);
+		if (buffer_dirty(bh) && list_empty(&bh->b_assoc_buffers)) {
+			BUG_ON(!bh->b_assoc_map);
+			list_add(&bh->b_assoc_buffers,
+				 &bh->b_assoc_map->private_list);
+		}
+		brelse(bh);
 	}
 	
 	spin_unlock(lock);

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH RESEND] Minimal fix for private_list handling races
  2008-01-22 17:10 [PATCH RESEND] Minimal fix for private_list handling races Jan Kara
@ 2008-01-23  1:00 ` Nick Piggin
  2008-01-23 13:30   ` Jan Kara
  0 siblings, 1 reply; 8+ messages in thread
From: Nick Piggin @ 2008-01-23  1:00 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-kernel, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 1305 bytes --]

On Wednesday 23 January 2008 04:10, Jan Kara wrote:
>   Hi,
>
>   as I got no answer for a week, I'm resending this fix for races in
> private_list handling. Andrew, do you like them more than the previous
> version?

FWIW, I reviewed this, and it looks OK although I think some comments
would be in order.

What would be really nice is to avoid the use of b_assoc_buffers
completely in this function like I've attempted (untested). I don't
know if you'd actually call that an improvement...?

Couple of things I noticed while looking at this code.

- What is osync_buffers_list supposed to do? I couldn't actually
  work it out. Why do we care about waiting for these buffers on
  here that were added while waiting for writeout of other buffers
  to finish? Can we just remove it completely? I must be missing
  something.

- What are the get_bh(bh) things supposed to do? Protect the lifetime
  of a given bh while "lock" is dropped? That's nice, ignoring the
  fact that we brelse(bh) *before* taking the lock again... but isn't
  every single other buffer that we _have't_ elevated its reference
  exposed to exactly the same lifetime problem? IOW, either it is not
  required at all, or it is required for _all_ buffers? (my patch
  should fix this).

Hmm, now I remember why I rewrote this file :P

[-- Attachment #2: fs-fsync_buffers_list-fix.patch --]
[-- Type: text/x-diff, Size: 1910 bytes --]

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -792,47 +792,53 @@ EXPORT_SYMBOL(__set_page_dirty_buffers);
  */
 static int fsync_buffers_list(spinlock_t *lock, struct list_head *list)
 {
+	struct buffer_head *batch[16];
+	int i, idx, done;
 	struct buffer_head *bh;
-	struct list_head tmp;
 	int err = 0, err2;
 
-	INIT_LIST_HEAD(&tmp);
-
+again:
 	spin_lock(lock);
+	idx = 0;
 	while (!list_empty(list)) {
 		bh = BH_ENTRY(list->next);
 		__remove_assoc_queue(bh);
 		if (buffer_dirty(bh) || buffer_locked(bh)) {
-			list_add(&bh->b_assoc_buffers, &tmp);
-			if (buffer_dirty(bh)) {
-				get_bh(bh);
-				spin_unlock(lock);
-				/*
-				 * Ensure any pending I/O completes so that
-				 * ll_rw_block() actually writes the current
-				 * contents - it is a noop if I/O is still in
-				 * flight on potentially older contents.
-				 */
-				ll_rw_block(SWRITE, 1, &bh);
-				brelse(bh);
-				spin_lock(lock);
-			}
+			batch[idx++] = bh;
+			get_bh(bh);
 		}
+
+		if (idx == 16)
+			break;
 	}
+	done = list_empty(list);
+	spin_unlock(lock);
 
-	while (!list_empty(&tmp)) {
-		bh = BH_ENTRY(tmp.prev);
-		list_del_init(&bh->b_assoc_buffers);
-		get_bh(bh);
-		spin_unlock(lock);
+	for (i = 0; i < idx; i++) {
+		bh = batch[i];
+		if (buffer_dirty(bh)) {
+			/*
+			 * Ensure any pending I/O completes so
+			 * that ll_rw_block() actually writes
+			 * the current contents - it is a noop
+			 * if I/O is still in flight on
+			 * potentially older contents.
+			 */
+			ll_rw_block(SWRITE, 1, &bh);
+		}
+	}
+	for (i = 0; i < idx; i++) {
+		bh = batch[i];
 		wait_on_buffer(bh);
 		if (!buffer_uptodate(bh))
 			err = -EIO;
 		brelse(bh);
-		spin_lock(lock);
 	}
+
+	idx = 0;
+	if (!done)
+		goto again;
 	
-	spin_unlock(lock);
 	err2 = osync_buffers_list(lock, list);
 	if (err)
 		return err;

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH RESEND] Minimal fix for private_list handling races
  2008-01-23  1:00 ` Nick Piggin
@ 2008-01-23 13:30   ` Jan Kara
  2008-01-23 15:05     ` Nick Piggin
  0 siblings, 1 reply; 8+ messages in thread
From: Jan Kara @ 2008-01-23 13:30 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-kernel, Andrew Morton

On Wed 23-01-08 12:00:02, Nick Piggin wrote:
> On Wednesday 23 January 2008 04:10, Jan Kara wrote:
> >   Hi,
> >
> >   as I got no answer for a week, I'm resending this fix for races in
> > private_list handling. Andrew, do you like them more than the previous
> > version?
> 
> FWIW, I reviewed this, and it looks OK although I think some comments
> would be in order.
  Thanks.

> What would be really nice is to avoid the use of b_assoc_buffers
> completely in this function like I've attempted (untested). I don't
> know if you'd actually call that an improvement...?
  I thought about this solution as well. But main issue I had with this
solution is that currently, you nicely submit all the metadata buffers at
once, so that block layer can sort them and write them in nice order. With
the array you submit buffers by 16 (or any other fixed amount) and in
mostly random order... So IMHO fsync would become measurably slower.

> Couple of things I noticed while looking at this code.
> 
> - What is osync_buffers_list supposed to do? I couldn't actually
>   work it out. Why do we care about waiting for these buffers on
>   here that were added while waiting for writeout of other buffers
>   to finish? Can we just remove it completely? I must be missing
>   something.
  The problem here is that mark_buffer_dirty_inode() can move the buffer
from 'tmp' list back to private_list while we are waiting for another
buffer...

> - What are the get_bh(bh) things supposed to do? Protect the lifetime
>   of a given bh while "lock" is dropped? That's nice, ignoring the
>   fact that we brelse(bh) *before* taking the lock again... but isn't
>   every single other buffer that we _have't_ elevated its reference
>   exposed to exactly the same lifetime problem? IOW, either it is not
>   required at all, or it is required for _all_ buffers? (my patch
>   should fix this).
  I think this get_bh() should stop try_to_free_buffers() from removing the
buffer. brelse() before taking the private_lock is fine, because the loop
actually checks for while (!list_empty(tmp)) so we really don't care what
happens with the buffer after we are done with it. So I think that logic is
actually fine.

> Hmm, now I remember why I rewrote this file :P
  Well, I wrote a patch which cleaned up this logic and got rid of that tmp
list because the handling is really messy but Andrew didn't like it and
wanted just small changes if possible...

> Index: linux-2.6/fs/buffer.c
> ===================================================================
> --- linux-2.6.orig/fs/buffer.c
> +++ linux-2.6/fs/buffer.c
> @@ -792,47 +792,53 @@ EXPORT_SYMBOL(__set_page_dirty_buffers);
>   */
>  static int fsync_buffers_list(spinlock_t *lock, struct list_head *list)
>  {
> +	struct buffer_head *batch[16];
> +	int i, idx, done;
>  	struct buffer_head *bh;
> -	struct list_head tmp;
>  	int err = 0, err2;
>  
> -	INIT_LIST_HEAD(&tmp);
> -
> +again:
>  	spin_lock(lock);
> +	idx = 0;
>  	while (!list_empty(list)) {
>  		bh = BH_ENTRY(list->next);
>  		__remove_assoc_queue(bh);
>  		if (buffer_dirty(bh) || buffer_locked(bh)) {
> -			list_add(&bh->b_assoc_buffers, &tmp);
> -			if (buffer_dirty(bh)) {
> -				get_bh(bh);
> -				spin_unlock(lock);
> -				/*
> -				 * Ensure any pending I/O completes so that
> -				 * ll_rw_block() actually writes the current
> -				 * contents - it is a noop if I/O is still in
> -				 * flight on potentially older contents.
> -				 */
> -				ll_rw_block(SWRITE, 1, &bh);
> -				brelse(bh);
> -				spin_lock(lock);
> -			}
> +			batch[idx++] = bh;
> +			get_bh(bh);
>  		}
> +
> +		if (idx == 16)
> +			break;
>  	}
> +	done = list_empty(list);
> +	spin_unlock(lock);
>  
> -	while (!list_empty(&tmp)) {
> -		bh = BH_ENTRY(tmp.prev);
> -		list_del_init(&bh->b_assoc_buffers);
> -		get_bh(bh);
> -		spin_unlock(lock);
> +	for (i = 0; i < idx; i++) {
> +		bh = batch[i];
> +		if (buffer_dirty(bh)) {
> +			/*
> +			 * Ensure any pending I/O completes so
> +			 * that ll_rw_block() actually writes
> +			 * the current contents - it is a noop
> +			 * if I/O is still in flight on
> +			 * potentially older contents.
> +			 */
> +			ll_rw_block(SWRITE, 1, &bh);
> +		}
> +	}
> +	for (i = 0; i < idx; i++) {
> +		bh = batch[i];
>  		wait_on_buffer(bh);
>  		if (!buffer_uptodate(bh))
>  			err = -EIO;
>  		brelse(bh);
> -		spin_lock(lock);
>  	}
> +
> +	idx = 0;
> +	if (!done)
> +		goto again;
>  	
> -	spin_unlock(lock);
>  	err2 = osync_buffers_list(lock, list);
>  	if (err)
>  		return err;

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH RESEND] Minimal fix for private_list handling races
  2008-01-23 13:30   ` Jan Kara
@ 2008-01-23 15:05     ` Nick Piggin
  2008-01-23 15:48       ` Jan Kara
  0 siblings, 1 reply; 8+ messages in thread
From: Nick Piggin @ 2008-01-23 15:05 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-kernel, Andrew Morton

On Thursday 24 January 2008 00:30, Jan Kara wrote:
> On Wed 23-01-08 12:00:02, Nick Piggin wrote:
> > On Wednesday 23 January 2008 04:10, Jan Kara wrote:
> > >   Hi,
> > >
> > >   as I got no answer for a week, I'm resending this fix for races in
> > > private_list handling. Andrew, do you like them more than the previous
> > > version?
> >
> > FWIW, I reviewed this, and it looks OK although I think some comments
> > would be in order.
>
>   Thanks.
>
> > What would be really nice is to avoid the use of b_assoc_buffers
> > completely in this function like I've attempted (untested). I don't
> > know if you'd actually call that an improvement...?
>
>   I thought about this solution as well. But main issue I had with this
> solution is that currently, you nicely submit all the metadata buffers at
> once, so that block layer can sort them and write them in nice order. With
> the array you submit buffers by 16 (or any other fixed amount) and in
> mostly random order... So IMHO fsync would become measurably slower.

Oh, I don't know the filesystems very well... which ones would
attach a large number of metadata buffers to the inode?


> > Couple of things I noticed while looking at this code.
> >
> > - What is osync_buffers_list supposed to do? I couldn't actually
> >   work it out. Why do we care about waiting for these buffers on
> >   here that were added while waiting for writeout of other buffers
> >   to finish? Can we just remove it completely? I must be missing
> >   something.
>
>   The problem here is that mark_buffer_dirty_inode() can move the buffer
> from 'tmp' list back to private_list while we are waiting for another
> buffer...

Hmm, no not while we're waiting for another buffer because b_assoc_buffers
will not be empty. However it is possible between removing from the inode
list and insertion onto the temp list I think, because

  if (list_empty(&bh->b_assoc_buffers)) {

check in mark_buffer_dirty_inode is done without private_lock held. Nice.

With that in mind, doesn't your first patch suffer from a race due to
exactly this unlocked list_empty check when you are removing clean buffers
from the queue?

                             if (!buffer_dirty(bh) && !buffer_locked(bh))
mark_buffer_dirty()
if (list_empty(&bh->b_assoc_buffers))
     /* add */
                                 __remove_assoc_queue(bh);

Which results in the buffer being dirty but not on the ->private_list,
doesn't it?

But let's see... there must be a memory ordering problem here in existing
code anyway, because I don't see any barriers. Between b_assoc_buffers and
b_state (via buffer_dirty); fsync_buffers_list vs mark_buffer_dirty_inode,
right?


> > - What are the get_bh(bh) things supposed to do? Protect the lifetime
> >   of a given bh while "lock" is dropped? That's nice, ignoring the
> >   fact that we brelse(bh) *before* taking the lock again... but isn't
> >   every single other buffer that we _have't_ elevated its reference
> >   exposed to exactly the same lifetime problem? IOW, either it is not
> >   required at all, or it is required for _all_ buffers? (my patch
> >   should fix this).
>
>   I think this get_bh() should stop try_to_free_buffers() from removing the
> buffer. brelse() before taking the private_lock is fine, because the loop
> actually checks for while (!list_empty(tmp)) so we really don't care what
> happens with the buffer after we are done with it. So I think that logic is
> actually fine.

Oh, of course. I overlooked the important fact that the tmp list is
also actually subject to modification via other threads exactly the
same as private_list...

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH RESEND] Minimal fix for private_list handling races
  2008-01-23 15:05     ` Nick Piggin
@ 2008-01-23 15:48       ` Jan Kara
  2008-01-25  8:34         ` Nick Piggin
  0 siblings, 1 reply; 8+ messages in thread
From: Jan Kara @ 2008-01-23 15:48 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-kernel, Andrew Morton

On Thu 24-01-08 02:05:16, Nick Piggin wrote:
> On Thursday 24 January 2008 00:30, Jan Kara wrote:
> > On Wed 23-01-08 12:00:02, Nick Piggin wrote:
> > > On Wednesday 23 January 2008 04:10, Jan Kara wrote:
> > > >   Hi,
> > > >
> > > >   as I got no answer for a week, I'm resending this fix for races in
> > > > private_list handling. Andrew, do you like them more than the previous
> > > > version?
> > >
> > > FWIW, I reviewed this, and it looks OK although I think some comments
> > > would be in order.
> >
> >   Thanks.
> >
> > > What would be really nice is to avoid the use of b_assoc_buffers
> > > completely in this function like I've attempted (untested). I don't
> > > know if you'd actually call that an improvement...?
> >
> >   I thought about this solution as well. But main issue I had with this
> > solution is that currently, you nicely submit all the metadata buffers at
> > once, so that block layer can sort them and write them in nice order. With
> > the array you submit buffers by 16 (or any other fixed amount) and in
> > mostly random order... So IMHO fsync would become measurably slower.
> 
> Oh, I don't know the filesystems very well... which ones would
> attach a large number of metadata buffers to the inode?
  This logic is actually used only by a few filesystems - ext2 and UDF are
probably the most common ones. For example for ext2, the indirect blocks
are on the list if the file is freshly written, so that is roughly around
1MB of metadata per 1GB of data (for 4KB blocks, with 1KB blocks it is 4MB
per 1GB). Because seeks are expensive, you could really end up with the
write being 16 times slower when you do it in 16 passes instead of one...

> > > Couple of things I noticed while looking at this code.
> > >
> > > - What is osync_buffers_list supposed to do? I couldn't actually
> > >   work it out. Why do we care about waiting for these buffers on
> > >   here that were added while waiting for writeout of other buffers
> > >   to finish? Can we just remove it completely? I must be missing
> > >   something.
> >
> >   The problem here is that mark_buffer_dirty_inode() can move the buffer
> > from 'tmp' list back to private_list while we are waiting for another
> > buffer...
> 
> Hmm, no not while we're waiting for another buffer because b_assoc_buffers
> will not be empty. However it is possible between removing from the inode
> list and insertion onto the temp list I think, because
> 
>   if (list_empty(&bh->b_assoc_buffers)) {
> 
> check in mark_buffer_dirty_inode is done without private_lock held. Nice.
  Oh yes, that was it.

> With that in mind, doesn't your first patch suffer from a race due to
> exactly this unlocked list_empty check when you are removing clean buffers
> from the queue?
> 
>                              if (!buffer_dirty(bh) && !buffer_locked(bh))
> mark_buffer_dirty()
> if (list_empty(&bh->b_assoc_buffers))
>      /* add */
>                                  __remove_assoc_queue(bh);
> 
> Which results in the buffer being dirty but not on the ->private_list,
> doesn't it?
  Hmm, I'm not sure about which patch you speak. Logic with removing clean
buffers has been in the first version (but there mark_buffer_dirty_inode()
was written differently). In the current version, we readd buffer to
private_list if it is found dirty in the second while loop of
fsync_buffers() and that should be enough.

> But let's see... there must be a memory ordering problem here in existing
> code anyway, because I don't see any barriers. Between b_assoc_buffers and
> b_state (via buffer_dirty); fsync_buffers_list vs mark_buffer_dirty_inode,
> right?
  I'm not sure. What exactly to you mean? BTW: spin_lock is a memory barrier,
isn't it?
 
								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH RESEND] Minimal fix for private_list handling races
  2008-01-23 15:48       ` Jan Kara
@ 2008-01-25  8:34         ` Nick Piggin
  2008-01-25 18:16           ` Jan Kara
  2008-01-28 22:16           ` Jan Kara
  0 siblings, 2 replies; 8+ messages in thread
From: Nick Piggin @ 2008-01-25  8:34 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-kernel, Andrew Morton

On Thursday 24 January 2008 02:48, Jan Kara wrote:
> On Thu 24-01-08 02:05:16, Nick Piggin wrote:
> > On Thursday 24 January 2008 00:30, Jan Kara wrote:
> > > On Wed 23-01-08 12:00:02, Nick Piggin wrote:
> > > > On Wednesday 23 January 2008 04:10, Jan Kara wrote:
> > > > >   Hi,
> > > > >
> > > > >   as I got no answer for a week, I'm resending this fix for races
> > > > > in private_list handling. Andrew, do you like them more than the
> > > > > previous version?
> > > >
> > > > FWIW, I reviewed this, and it looks OK although I think some comments
> > > > would be in order.
> > >
> > >   Thanks.
> > >
> > > > What would be really nice is to avoid the use of b_assoc_buffers
> > > > completely in this function like I've attempted (untested). I don't
> > > > know if you'd actually call that an improvement...?
> > >
> > >   I thought about this solution as well. But main issue I had with this
> > > solution is that currently, you nicely submit all the metadata buffers
> > > at once, so that block layer can sort them and write them in nice
> > > order. With the array you submit buffers by 16 (or any other fixed
> > > amount) and in mostly random order... So IMHO fsync would become
> > > measurably slower.
> >
> > Oh, I don't know the filesystems very well... which ones would
> > attach a large number of metadata buffers to the inode?
>
>   This logic is actually used only by a few filesystems - ext2 and UDF are
> probably the most common ones. For example for ext2, the indirect blocks
> are on the list if the file is freshly written, so that is roughly around
> 1MB of metadata per 1GB of data (for 4KB blocks, with 1KB blocks it is 4MB
> per 1GB). Because seeks are expensive, you could really end up with the
> write being 16 times slower when you do it in 16 passes instead of one...

Yeah, that's fair enough I suppose. I wasn't thinking you'd have a
huge newly dirtied file, but it could happen. I don't want to cause
regressions.


> > With that in mind, doesn't your first patch suffer from a race due to
> > exactly this unlocked list_empty check when you are removing clean
> > buffers from the queue?
> >
> >                              if (!buffer_dirty(bh) && !buffer_locked(bh))
> > mark_buffer_dirty()
> > if (list_empty(&bh->b_assoc_buffers))
> >      /* add */
> >                                  __remove_assoc_queue(bh);
> >
> > Which results in the buffer being dirty but not on the ->private_list,
> > doesn't it?
>
>   Hmm, I'm not sure about which patch you speak. Logic with removing clean
> buffers has been in the first version (but there mark_buffer_dirty_inode()
> was written differently).

Ah, yes I see I missed that. I like that a lot better.


> In the current version, we readd buffer to 
> private_list if it is found dirty in the second while loop of
> fsync_buffers() and that should be enough.

Sure, I think there is still a data race though, but if there is one
it's already been there for a long time and nobody cares too much about
those anyway.


> > But let's see... there must be a memory ordering problem here in existing
> > code anyway, because I don't see any barriers. Between b_assoc_buffers
> > and b_state (via buffer_dirty); fsync_buffers_list vs
> > mark_buffer_dirty_inode, right?
>
>   I'm not sure. What exactly to you mean? BTW: spin_lock is a memory
> barrier, isn't it?

In existing code:

mark_buffer_dirty_inode():              fsync_buffers_list():
 test_set_buffer_dirty(bh);              list_del_init(&bh->b_assoc_buffers);
 if (list_empty(&bh->b_assoc_buffers))   if (buffer_dirty(bh)) {
     ...                                   list_add(&bh->b_assoc_buffers, );

These two code sequences can run concurrently because only fsync_buffers_list
takes the lock.

So fsync_buffers_list can speculatively load bh->b_state before
its stores to clear b_assoc_buffers propagate to the CPU running
mark_buffer_dirty_inode.

So if there is a !dirty buffer on the list, then fsync_buffers_list will
remove it from the list, but mark_buffer_dirty_inode won't see it has been
removed from the list and won't re-add it. I think.

This is actually even possible to hit on x86 because they reorder loads
past stores. It needs a smp_mb() before if (buffer_dirty(bh) {}.

Actually I very much dislike testing list entries locklessly, because they
are not trivially atomic operations like single stores... which is another
reason why I like your first patch.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH RESEND] Minimal fix for private_list handling races
  2008-01-25  8:34         ` Nick Piggin
@ 2008-01-25 18:16           ` Jan Kara
  2008-01-28 22:16           ` Jan Kara
  1 sibling, 0 replies; 8+ messages in thread
From: Jan Kara @ 2008-01-25 18:16 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-kernel, Andrew Morton

On Fri 25-01-08 19:34:07, Nick Piggin wrote:
> On Thursday 24 January 2008 02:48, Jan Kara wrote:
> 
  <snip>

> > In the current version, we readd buffer to 
> > private_list if it is found dirty in the second while loop of
> > fsync_buffers() and that should be enough.
> 
> Sure, I think there is still a data race though, but if there is one
> it's already been there for a long time and nobody cares too much about
> those anyway.
> 
> 
> > > But let's see... there must be a memory ordering problem here in existing
> > > code anyway, because I don't see any barriers. Between b_assoc_buffers
> > > and b_state (via buffer_dirty); fsync_buffers_list vs
> > > mark_buffer_dirty_inode, right?
> >
> >   I'm not sure. What exactly to you mean? BTW: spin_lock is a memory
> > barrier, isn't it?
> 
> In existing code:
> 
> mark_buffer_dirty_inode():              fsync_buffers_list():
>  test_set_buffer_dirty(bh);              list_del_init(&bh->b_assoc_buffers);
>  if (list_empty(&bh->b_assoc_buffers))   if (buffer_dirty(bh)) {
>      ...                                   list_add(&bh->b_assoc_buffers, );
> 
> These two code sequences can run concurrently because only fsync_buffers_list
> takes the lock.
> 
> So fsync_buffers_list can speculatively load bh->b_state before
> its stores to clear b_assoc_buffers propagate to the CPU running
> mark_buffer_dirty_inode.
> 
> So if there is a !dirty buffer on the list, then fsync_buffers_list will
> remove it from the list, but mark_buffer_dirty_inode won't see it has been
> removed from the list and won't re-add it. I think.
> 
> This is actually even possible to hit on x86 because they reorder loads
> past stores. It needs a smp_mb() before if (buffer_dirty(bh) {}.
  OK, I'll believe you ;) I was never completely sure what all can happen
if two processors access+write some memory without exclusion by a lock...

> Actually I very much dislike testing list entries locklessly, because they
> are not trivially atomic operations like single stores... which is another
> reason why I like your first patch.
  Yes, that was actually the reason why I changed the checks from
list_empty() to b_assoc_map testing. Well, so I'll add the barrier and
maybe also change these list_empty() checks to b_assoc_map tests...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH RESEND] Minimal fix for private_list handling races
  2008-01-25  8:34         ` Nick Piggin
  2008-01-25 18:16           ` Jan Kara
@ 2008-01-28 22:16           ` Jan Kara
  1 sibling, 0 replies; 8+ messages in thread
From: Jan Kara @ 2008-01-28 22:16 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-kernel, Andrew Morton

On Fri 25-01-08 19:34:07, Nick Piggin wrote:
> > > But let's see... there must be a memory ordering problem here in existing
> > > code anyway, because I don't see any barriers. Between b_assoc_buffers
> > > and b_state (via buffer_dirty); fsync_buffers_list vs
> > > mark_buffer_dirty_inode, right?
> >
> >   I'm not sure. What exactly to you mean? BTW: spin_lock is a memory
> > barrier, isn't it?
> 
> In existing code:
> 
> mark_buffer_dirty_inode():              fsync_buffers_list():
>  test_set_buffer_dirty(bh);              list_del_init(&bh->b_assoc_buffers);
>  if (list_empty(&bh->b_assoc_buffers))   if (buffer_dirty(bh)) {
>      ...                                   list_add(&bh->b_assoc_buffers, );
> 
> These two code sequences can run concurrently because only fsync_buffers_list
> takes the lock.
> 
> So fsync_buffers_list can speculatively load bh->b_state before
> its stores to clear b_assoc_buffers propagate to the CPU running
> mark_buffer_dirty_inode.
> 
> So if there is a !dirty buffer on the list, then fsync_buffers_list will
> remove it from the list, but mark_buffer_dirty_inode won't see it has been
> removed from the list and won't re-add it. I think.
> 
> This is actually even possible to hit on x86 because they reorder loads
> past stores. It needs a smp_mb() before if (buffer_dirty(bh) {}.
> 
> Actually I very much dislike testing list entries locklessly, because they
> are not trivially atomic operations like single stores... which is another
> reason why I like your first patch.
  OK, Nick, how do you like the patch below?

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
---

There are two possible races in handling of private_list in buffer cache.
1) When fsync_buffers_list() processes a private_list, it clears
b_assoc_mapping and moves buffer to its private list. Now drop_buffers() comes,
sees a buffer is on list so it calls __remove_assoc_queue() which complains
about b_assoc_mapping being cleared (as it cannot propagate possible IO error).
This race has been actually observed in the wild.
2) When fsync_buffers_list() processes a private_list,
mark_buffer_dirty_inode() can be called on bh which is already on the private
list of fsync_buffers_list(). As buffer is on some list (note that the check is
performed without private_lock), it is not readded to the mapping's
private_list and after fsync_buffers_list() finishes, we have a dirty buffer
which should be on private_list but it isn't. This race has not been reported,
probably because most (but not all) callers of mark_buffer_dirty_inode() hold
i_mutex and thus are serialized with fsync().

Fix these issues by not clearing b_assoc_map when fsync_buffers_list() moves
buffer to a dedicated list and by reinserting buffer in private_list when
it is found dirty after we have submitted buffer for IO.

Signed-off-by: Jan Kara <jack@suse.cz>

diff --git a/fs/buffer.c b/fs/buffer.c
index 7249e01..76e1ab4 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -678,7 +678,7 @@ void mark_buffer_dirty_inode(struct buffer_head *bh, struct inode *inode)
 	} else {
 		BUG_ON(mapping->assoc_mapping != buffer_mapping);
 	}
-	if (list_empty(&bh->b_assoc_buffers)) {
+	if (!bh->b_assoc_map) {
 		spin_lock(&buffer_mapping->private_lock);
 		list_move_tail(&bh->b_assoc_buffers,
 				&mapping->private_list);
@@ -794,6 +794,7 @@ static int fsync_buffers_list(spinlock_t *lock, struct list_head *list)
 {
 	struct buffer_head *bh;
 	struct list_head tmp;
+	struct address_space *mapping;
 	int err = 0, err2;
 
 	INIT_LIST_HEAD(&tmp);
@@ -801,9 +802,14 @@ static int fsync_buffers_list(spinlock_t *lock, struct list_head *list)
 	spin_lock(lock);
 	while (!list_empty(list)) {
 		bh = BH_ENTRY(list->next);
+		mapping = bh->b_assoc_map;
 		__remove_assoc_queue(bh);
+		/* Avoid race with mark_buffer_dirty_inode() which does
+		 * a lockless check and we rely on seeing the dirty bit */
+		smp_mb();
 		if (buffer_dirty(bh) || buffer_locked(bh)) {
 			list_add(&bh->b_assoc_buffers, &tmp);
+			bh->b_assoc_map = mapping;
 			if (buffer_dirty(bh)) {
 				get_bh(bh);
 				spin_unlock(lock);
@@ -822,8 +828,17 @@ static int fsync_buffers_list(spinlock_t *lock, struct list_head *list)
 
 	while (!list_empty(&tmp)) {
 		bh = BH_ENTRY(tmp.prev);
-		list_del_init(&bh->b_assoc_buffers);
 		get_bh(bh);
+		mapping = bh->b_assoc_map;
+		__remove_assoc_queue(bh);
+		/* Avoid race with mark_buffer_dirty_inode() which does
+		 * a lockless check and we rely on seeing the dirty bit */
+		smp_mb();
+		if (buffer_dirty(bh)) {
+			list_add(&bh->b_assoc_buffers,
+				 &bh->b_assoc_map->private_list);
+			bh->b_assoc_map = mapping;
+		}
 		spin_unlock(lock);
 		wait_on_buffer(bh);
 		if (!buffer_uptodate(bh))
@@ -1195,7 +1210,7 @@ void __brelse(struct buffer_head * buf)
 void __bforget(struct buffer_head *bh)
 {
 	clear_buffer_dirty(bh);
-	if (!list_empty(&bh->b_assoc_buffers)) {
+	if (bh->b_assoc_map) {
 		struct address_space *buffer_mapping = bh->b_page->mapping;
 
 		spin_lock(&buffer_mapping->private_lock);
@@ -3037,7 +3052,7 @@ drop_buffers(struct page *page, struct buffer_head **buffers_to_free)
 	do {
 		struct buffer_head *next = bh->b_this_page;
 
-		if (!list_empty(&bh->b_assoc_buffers))
+		if (bh->b_assoc_map)
 			__remove_assoc_queue(bh);
 		bh = next;
 	} while (bh != head);

^ permalink raw reply related	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2008-01-28 22:22 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-01-22 17:10 [PATCH RESEND] Minimal fix for private_list handling races Jan Kara
2008-01-23  1:00 ` Nick Piggin
2008-01-23 13:30   ` Jan Kara
2008-01-23 15:05     ` Nick Piggin
2008-01-23 15:48       ` Jan Kara
2008-01-25  8:34         ` Nick Piggin
2008-01-25 18:16           ` Jan Kara
2008-01-28 22:16           ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).