All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch 1/4] fs: mark_inode_dirty barrier fix
@ 2010-11-22 13:05 Nick Piggin
  2010-11-22 13:06 ` [patch 2/4] fs: fsync inode dirty race fix Nick Piggin
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Nick Piggin @ 2010-11-22 13:05 UTC (permalink / raw)
  To: linux-fsdevel


fs: mark_inode_dirty barrier fix

Filesystems appear to be using ->dirty_inode, expecting that the dirtying
operating is done and visible to all CPUs (eg. setting private inode dirty
bits, without any barriers themselves). So release the dirty "critical
section" with a barrier before calling ->dirty_inode.

Cost is not significantly changed, because we're just moving the barrier.
Those filesystems that do use ->dirty_inode should have to care slightly
less about barriers, which is a good thing.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---

Here's some data integrity fixes for inode writeback cache. I'd like to
hear comments.

Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c	2010-11-19 16:47:00.000000000 +1100
+++ linux-2.6/fs/fs-writeback.c	2010-11-19 16:49:39.000000000 +1100
@@ -934,6 +934,15 @@ void __mark_inode_dirty(struct inode *in
 	bool wakeup_bdi = false;
 
 	/*
+	 * Make sure that changes are seen by all cpus before we test i_state
+	 * or mark anything as being dirty. Ie. all dirty state should be
+	 * written to the inode and visible. Like an "unlock" operation, the
+	 * mark_inode_dirty call must "release" our ordering window that is
+	 * opened when we started modifying the inode.
+	 */
+	smp_mb();
+
+	/*
 	 * Don't do this for I_DIRTY_PAGES - that doesn't actually
 	 * dirty the inode itself
 	 */
@@ -942,12 +951,6 @@ void __mark_inode_dirty(struct inode *in
 			sb->s_op->dirty_inode(inode);
 	}
 
-	/*
-	 * make sure that changes are seen by all cpus before we test i_state
-	 * -- mikulas
-	 */
-	smp_mb();
-
 	/* avoid the locking if we can */
 	if ((inode->i_state & flags) == flags)
 		return;

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [patch 2/4] fs: fsync inode dirty race fix
  2010-11-22 13:05 [patch 1/4] fs: mark_inode_dirty barrier fix Nick Piggin
@ 2010-11-22 13:06 ` Nick Piggin
  2010-11-22 13:23   ` Christoph Hellwig
  2010-11-22 13:07 ` [patch 3/4] fs: introduce inode dirty state helpers Nick Piggin
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 9+ messages in thread
From: Nick Piggin @ 2010-11-22 13:06 UTC (permalink / raw)
  To: linux-fsdevel

It is incorrect to test inode dirty bits without participating in the inode
writeback protocol. Inode writeback sets I_SYNC and clears I_DIRTY_?, then
writes out the particular bits, then clears I_SYNC when it is done. BTW. it
may not completely write all pages out, so I_DIRTY_PAGES would get set
again.

This is a standard pattern used throughout the kernel's writeback caches
(I_SYNC ~= I_WRITEBACK, if that makes it clearer).

And so it is not possible to determine an inode's dirty status just by
checking I_DIRTY bits. Especially not for the purpose of data integrity
syncs.

Missing the check for these bits means that fsync can complete while
writeback to the inode is underway. Inode writeback functions get this
right, so call into them rather than try to shortcut things by testing
dirty state improperly.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>


Index: linux-2.6/fs/libfs.c
===================================================================
--- linux-2.6.orig/fs/libfs.c	2010-11-19 16:44:39.000000000 +1100
+++ linux-2.6/fs/libfs.c	2010-11-19 16:49:42.000000000 +1100
@@ -895,11 +895,6 @@ int generic_file_fsync(struct file *file
 	int ret;
 
 	ret = sync_mapping_buffers(inode->i_mapping);
-	if (!(inode->i_state & I_DIRTY))
-		return ret;
-	if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
-		return ret;
-
 	err = sync_inode_metadata(inode, 1);
 	if (ret == 0)
 		ret = err;
Index: linux-2.6/fs/exofs/file.c
===================================================================
--- linux-2.6.orig/fs/exofs/file.c	2010-11-19 16:50:00.000000000 +1100
+++ linux-2.6/fs/exofs/file.c	2010-11-19 16:50:07.000000000 +1100
@@ -48,11 +48,6 @@ static int exofs_file_fsync(struct file
 	struct inode *inode = filp->f_mapping->host;
 	struct super_block *sb;
 
-	if (!(inode->i_state & I_DIRTY))
-		return 0;
-	if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
-		return 0;
-
 	ret = sync_inode_metadata(inode, 1);
 
 	/* This is a good place to write the sb */

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [patch 3/4] fs: introduce inode dirty state helpers
  2010-11-22 13:05 [patch 1/4] fs: mark_inode_dirty barrier fix Nick Piggin
  2010-11-22 13:06 ` [patch 2/4] fs: fsync inode dirty race fix Nick Piggin
@ 2010-11-22 13:07 ` Nick Piggin
  2010-11-22 13:25   ` Christoph Hellwig
  2010-11-22 13:09 ` [patch 4/4] ext2: inode sync fixes Nick Piggin
  2010-11-22 13:16 ` [patch 1/4] fs: mark_inode_dirty barrier fix Christoph Hellwig
  3 siblings, 1 reply; 9+ messages in thread
From: Nick Piggin @ 2010-11-22 13:07 UTC (permalink / raw)
  To: linux-fsdevel

Inode dirty state cannot be securely tested without participating properly
in the inode writeback protocol. Some filesystems need to check this state,
so break out the code into helpers and make them available.

This could also be used to reduce strange interactions between background
writeback and fsync. Currently if we fsync a single page in a file, the
entire file gets requeued to the back of the background IO list, even if
it is due for writeout and has a large number of pages. That's left for
a later time.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c	2010-11-22 23:19:38.000000000 +1100
+++ linux-2.6/fs/fs-writeback.c	2010-11-22 23:19:40.000000000 +1100
@@ -299,6 +299,74 @@ static void inode_wait_for_writeback(str
 	}
 }
 
+int inode_writeback_begin(struct inode *inode, int wait)
+{
+	assert_spin_locked(&inode_lock);
+	if (!atomic_read(&inode->i_count))
+		WARN_ON(!(inode->i_state & (I_WILL_FREE|I_FREEING)));
+	else
+		WARN_ON(inode->i_state & I_WILL_FREE);
+
+	if (inode->i_state & I_SYNC) {
+		/*
+		 * If this inode is locked for writeback and we are not doing
+		 * writeback-for-data-integrity, skip it.
+		 */
+		if (!wait)
+			return 0;
+
+		/*
+		 * It's a data-integrity sync.  We must wait.
+		 */
+		inode_wait_for_writeback(inode);
+	}
+
+	BUG_ON(inode->i_state & I_SYNC);
+
+	inode->i_state |= I_SYNC;
+	inode->i_state &= ~I_DIRTY;
+
+	return 1;
+}
+
+int inode_writeback_end(struct inode *inode)
+{
+	int ret = 1;
+
+	assert_spin_locked(&inode_lock);
+	BUG_ON(!(inode->i_state & I_SYNC));
+
+	if (!(inode->i_state & I_FREEING)) {
+		if (mapping_tagged(inode->i_mapping, PAGECACHE_TAG_DIRTY)) {
+			/*
+			 * We didn't write back all the pages.  nfs_writepages()
+			 * sometimes bales out without doing anything.
+			 */
+			inode->i_state |= I_DIRTY_PAGES;
+			ret = 0;
+		} else if (inode->i_state & I_DIRTY) {
+			/*
+			 * Filesystems can dirty the inode during writeback
+			 * operations, such as delayed allocation during
+			 * submission or metadata updates after data IO
+			 * completion.
+			 */
+			redirty_tail(inode);
+		} else {
+			/*
+			 * The inode is clean.  At this point we either have
+			 * a reference to the inode or it's on it's way out.
+			 * No need to add it back to the LRU.
+			 */
+			list_del_init(&inode->i_wb_list);
+		}
+	}
+	inode->i_state &= ~I_SYNC;
+	inode_sync_complete(inode);
+
+	return ret;
+}
+
 /*
  * Write out an inode's dirty pages.  Called under inode_lock.  Either the
  * caller has ref on the inode (either via __iget or via syscall against an fd)
@@ -319,36 +387,15 @@ writeback_single_inode(struct inode *ino
 	unsigned dirty;
 	int ret;
 
-	if (!atomic_read(&inode->i_count))
-		WARN_ON(!(inode->i_state & (I_WILL_FREE|I_FREEING)));
-	else
-		WARN_ON(inode->i_state & I_WILL_FREE);
-
-	if (inode->i_state & I_SYNC) {
+	if (!inode_writeback_begin(inode, wbc->sync_mode == WB_SYNC_ALL)) {
 		/*
-		 * If this inode is locked for writeback and we are not doing
-		 * writeback-for-data-integrity, move it to b_more_io so that
-		 * writeback can proceed with the other inodes on s_io.
-		 *
 		 * We'll have another go at writing back this inode when we
 		 * completed a full scan of b_io.
 		 */
-		if (wbc->sync_mode != WB_SYNC_ALL) {
-			requeue_io(inode);
-			return 0;
-		}
-
-		/*
-		 * It's a data-integrity sync.  We must wait.
-		 */
-		inode_wait_for_writeback(inode);
+		requeue_io(inode);
+		return 0;
 	}
 
-	BUG_ON(inode->i_state & I_SYNC);
-
-	/* Set I_SYNC, reset I_DIRTY_PAGES */
-	inode->i_state |= I_SYNC;
-	inode->i_state &= ~I_DIRTY_PAGES;
 	spin_unlock(&inode_lock);
 
 	ret = do_writepages(mapping, wbc);
@@ -381,47 +428,24 @@ writeback_single_inode(struct inode *ino
 	}
 
 	spin_lock(&inode_lock);
-	inode->i_state &= ~I_SYNC;
-	if (!(inode->i_state & I_FREEING)) {
-		if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
+	if (!inode_writeback_end(inode)) {
+		if (wbc->nr_to_write <= 0) {
 			/*
-			 * We didn't write back all the pages.  nfs_writepages()
-			 * sometimes bales out without doing anything.
+			 * slice used up: queue for next turn
 			 */
-			inode->i_state |= I_DIRTY_PAGES;
-			if (wbc->nr_to_write <= 0) {
-				/*
-				 * slice used up: queue for next turn
-				 */
-				requeue_io(inode);
-			} else {
-				/*
-				 * Writeback blocked by something other than
-				 * congestion. Delay the inode for some time to
-				 * avoid spinning on the CPU (100% iowait)
-				 * retrying writeback of the dirty page/inode
-				 * that cannot be performed immediately.
-				 */
-				redirty_tail(inode);
-			}
-		} else if (inode->i_state & I_DIRTY) {
-			/*
-			 * Filesystems can dirty the inode during writeback
-			 * operations, such as delayed allocation during
-			 * submission or metadata updates after data IO
-			 * completion.
-			 */
-			redirty_tail(inode);
+			requeue_io(inode);
 		} else {
 			/*
-			 * The inode is clean.  At this point we either have
-			 * a reference to the inode or it's on it's way out.
-			 * No need to add it back to the LRU.
+			 * Writeback blocked by something other than
+			 * congestion. Delay the inode for some time to
+			 * avoid spinning on the CPU (100% iowait)
+			 * retrying writeback of the dirty page/inode
+			 * that cannot be performed immediately.
 			 */
-			list_del_init(&inode->i_wb_list);
+			redirty_tail(inode);
 		}
 	}
-	inode_sync_complete(inode);
+
 	return ret;
 }
 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [patch 4/4] ext2: inode sync fixes
  2010-11-22 13:05 [patch 1/4] fs: mark_inode_dirty barrier fix Nick Piggin
  2010-11-22 13:06 ` [patch 2/4] fs: fsync inode dirty race fix Nick Piggin
  2010-11-22 13:07 ` [patch 3/4] fs: introduce inode dirty state helpers Nick Piggin
@ 2010-11-22 13:09 ` Nick Piggin
  2010-11-22 13:16 ` [patch 1/4] fs: mark_inode_dirty barrier fix Christoph Hellwig
  3 siblings, 0 replies; 9+ messages in thread
From: Nick Piggin @ 2010-11-22 13:09 UTC (permalink / raw)
  To: linux-fsdevel

ext2 relies on ->write_inode being called from sync_inode_metadata in
fsync in order to sync the inode. However I_DIRTY_SYNC gets cleared
after a call to this guy, and it doesn't actually write back and wait
on the inode block unless it is called for sync. This means that write_inode
from background writeback can kill the inode dirty bits without the data
getting to disk. Fsync will subsequently miss it.

The fix is for ->write_inode to dirty the buffers/cache, and then ->fsync to
write back the dirty data. In the full filesystem sync case, buffercache
writeback in the generic code will write back the dirty data. Other
filesystems could use ->sync_fs for this.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---

This is the other side of the inode-metadata-writeback fuckup coin (I
suspect many other filesystems also have problems here, but I've only
dared to look at a couple as yet).


Index: linux-2.6/fs/ext2/inode.c
===================================================================
--- linux-2.6.orig/fs/ext2/inode.c	2010-11-22 23:20:44.000000000 +1100
+++ linux-2.6/fs/ext2/inode.c	2010-11-22 23:38:26.000000000 +1100
@@ -1505,16 +1505,8 @@ static int __ext2_write_inode(struct ino
 	} else for (n = 0; n < EXT2_N_BLOCKS; n++)
 		raw_inode->i_block[n] = ei->i_data[n];
 	mark_buffer_dirty(bh);
-	if (do_sync) {
-		sync_dirty_buffer(bh);
-		if (buffer_req(bh) && !buffer_uptodate(bh)) {
-			printk ("IO error syncing ext2 inode [%s:%08lx]\n",
-				sb->s_id, (unsigned long) ino);
-			err = -EIO;
-		}
-	}
-	ei->i_state &= ~EXT2_STATE_NEW;
 	brelse (bh);
+	ei->i_state &= ~EXT2_STATE_NEW;
 	return err;
 }
 
Index: linux-2.6/fs/ext2/file.c
===================================================================
--- linux-2.6.orig/fs/ext2/file.c	2010-11-22 23:35:24.000000000 +1100
+++ linux-2.6/fs/ext2/file.c	2010-11-22 23:38:34.000000000 +1100
@@ -45,14 +45,30 @@ int ext2_fsync(struct file *file, int da
 	int ret;
 	struct super_block *sb = file->f_mapping->host->i_sb;
 	struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping;
+	ino_t ino = inode->i_ino;
+	struct buffer_head *bh;
+	struct ext2_inode *raw_inode;
 
 	ret = generic_file_fsync(file, datasync);
 	if (ret == -EIO || test_and_clear_bit(AS_EIO, &mapping->flags)) {
 		/* We don't really know where the IO error happened... */
 		ext2_error(sb, __func__,
 			   "detected IO error when writing metadata buffers");
+		return -EIO;
+	}
+
+	raw_inode = ext2_get_inode(sb, ino, &bh);
+	if (IS_ERR(raw_inode))
+ 		return -EIO;
+
+	sync_dirty_buffer(bh);
+	if (buffer_req(bh) && !buffer_uptodate(bh)) {
+		printk ("IO error syncing ext2 inode [%s:%08lx]\n",
+			sb->s_id, (unsigned long) ino);
 		ret = -EIO;
 	}
+	brelse (bh);
+
 	return ret;
 }
 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch 1/4] fs: mark_inode_dirty barrier fix
  2010-11-22 13:05 [patch 1/4] fs: mark_inode_dirty barrier fix Nick Piggin
                   ` (2 preceding siblings ...)
  2010-11-22 13:09 ` [patch 4/4] ext2: inode sync fixes Nick Piggin
@ 2010-11-22 13:16 ` Christoph Hellwig
  3 siblings, 0 replies; 9+ messages in thread
From: Christoph Hellwig @ 2010-11-22 13:16 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel

On Tue, Nov 23, 2010 at 12:05:07AM +1100, Nick Piggin wrote:
> 
> fs: mark_inode_dirty barrier fix
> 
> Filesystems appear to be using ->dirty_inode, expecting that the dirtying
> operating is done and visible to all CPUs (eg. setting private inode dirty
> bits, without any barriers themselves). So release the dirty "critical
> section" with a barrier before calling ->dirty_inode.
> 
> Cost is not significantly changed, because we're just moving the barrier.
> Those filesystems that do use ->dirty_inode should have to care slightly
> less about barriers, which is a good thing.
> 
> Signed-off-by: Nick Piggin <npiggin@kernel.dk>

Looks good to me.  I can't see any reason to have a barrier after
->dirty_inode.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch 2/4] fs: fsync inode dirty race fix
  2010-11-22 13:06 ` [patch 2/4] fs: fsync inode dirty race fix Nick Piggin
@ 2010-11-22 13:23   ` Christoph Hellwig
  2010-11-22 13:43     ` Nick Piggin
  0 siblings, 1 reply; 9+ messages in thread
From: Christoph Hellwig @ 2010-11-22 13:23 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel

>  	ret = sync_mapping_buffers(inode->i_mapping);
> -	if (!(inode->i_state & I_DIRTY))
> -		return ret;
> -	if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
> -		return ret;
> -
>
>  	err = sync_inode_metadata(inode, 1);

This makes fdatasync equivalent to fsync, which means a huge drop
in performance for database an virtualization performance.

I think the right aproach is to extend the sync_inode_metadata to
writeback_single_inode chain with a datasync parameter so that we
can do the correct decision there.

Note that there are a lot more fsync implementations than just
generic_file_fsync and exofs_fsync that have the same issue.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch 3/4] fs: introduce inode dirty state helpers
  2010-11-22 13:07 ` [patch 3/4] fs: introduce inode dirty state helpers Nick Piggin
@ 2010-11-22 13:25   ` Christoph Hellwig
  2010-11-22 13:44     ` Nick Piggin
  0 siblings, 1 reply; 9+ messages in thread
From: Christoph Hellwig @ 2010-11-22 13:25 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel

On Tue, Nov 23, 2010 at 12:07:33AM +1100, Nick Piggin wrote:
> Inode dirty state cannot be securely tested without participating properly
> in the inode writeback protocol. Some filesystems need to check this state,
> so break out the code into helpers and make them available.
> 
> This could also be used to reduce strange interactions between background
> writeback and fsync. Currently if we fsync a single page in a file, the
> entire file gets requeued to the back of the background IO list, even if
> it is due for writeout and has a large number of pages. That's left for
> a later time.

If you want filesystems to use the helpers they need to be exported,
and good kerneldoc comments so that filesystem writers know what they
need to do.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch 2/4] fs: fsync inode dirty race fix
  2010-11-22 13:23   ` Christoph Hellwig
@ 2010-11-22 13:43     ` Nick Piggin
  0 siblings, 0 replies; 9+ messages in thread
From: Nick Piggin @ 2010-11-22 13:43 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Nick Piggin, linux-fsdevel

On Tue, Nov 23, 2010 at 12:23 AM, Christoph Hellwig <hch@infradead.org> wrote:
>>       ret = sync_mapping_buffers(inode->i_mapping);
>> -     if (!(inode->i_state & I_DIRTY))
>> -             return ret;
>> -     if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
>> -             return ret;
>> -
>>
>>       err = sync_inode_metadata(inode, 1);
>
> This makes fdatasync equivalent to fsync, which means a huge drop
> in performance for database an virtualization performance.
>
> I think the right aproach is to extend the sync_inode_metadata to
> writeback_single_inode chain with a datasync parameter so that we
> can do the correct decision there.

I was leaning towards using the new inode writeback helpers, then
filesystems should use them to check inode dirty state without races.

You could still move datasync into the writeback function, but it
wouldn't be a requirement. More complex filesystem schemes seem
to need to actually test these flags.

> Note that there are a lot more fsync implementations than just
> generic_file_fsync and exofs_fsync that have the same issue.

Yes, they require patch 3/4, and a lot more testing than I had time
to do tonight. I'll send out some patches tomorrow.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch 3/4] fs: introduce inode dirty state helpers
  2010-11-22 13:25   ` Christoph Hellwig
@ 2010-11-22 13:44     ` Nick Piggin
  0 siblings, 0 replies; 9+ messages in thread
From: Nick Piggin @ 2010-11-22 13:44 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Nick Piggin, linux-fsdevel

On Tue, Nov 23, 2010 at 12:25 AM, Christoph Hellwig <hch@infradead.org> wrote:
> On Tue, Nov 23, 2010 at 12:07:33AM +1100, Nick Piggin wrote:
>> Inode dirty state cannot be securely tested without participating properly
>> in the inode writeback protocol. Some filesystems need to check this state,
>> so break out the code into helpers and make them available.
>>
>> This could also be used to reduce strange interactions between background
>> writeback and fsync. Currently if we fsync a single page in a file, the
>> entire file gets requeued to the back of the background IO list, even if
>> it is due for writeout and has a large number of pages. That's left for
>> a later time.
>
> If you want filesystems to use the helpers they need to be exported,
> and good kerneldoc comments so that filesystem writers know what they
> need to do.

You're right, of course. I sent a half-baked patch there sorry. (Well, it should
be working quite well, but it needs exports and decent kerneldoc as you say).

Will send an update tomorrow.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2010-11-22 13:44 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-11-22 13:05 [patch 1/4] fs: mark_inode_dirty barrier fix Nick Piggin
2010-11-22 13:06 ` [patch 2/4] fs: fsync inode dirty race fix Nick Piggin
2010-11-22 13:23   ` Christoph Hellwig
2010-11-22 13:43     ` Nick Piggin
2010-11-22 13:07 ` [patch 3/4] fs: introduce inode dirty state helpers Nick Piggin
2010-11-22 13:25   ` Christoph Hellwig
2010-11-22 13:44     ` Nick Piggin
2010-11-22 13:09 ` [patch 4/4] ext2: inode sync fixes Nick Piggin
2010-11-22 13:16 ` [patch 1/4] fs: mark_inode_dirty barrier fix Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.