linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 0/7] icache dirty / sync fixes
@ 2010-11-23 14:06 npiggin
  2010-11-23 14:06 ` [patch 1/7] fs: mark_inode_dirty barrier fix npiggin
                   ` (6 more replies)
  0 siblings, 7 replies; 20+ messages in thread
From: npiggin @ 2010-11-23 14:06 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

inode dirty / sync handling seems totally broken. I'd like some comments
on this series.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [patch 1/7] fs: mark_inode_dirty barrier fix
  2010-11-23 14:06 [patch 0/7] icache dirty / sync fixes npiggin
@ 2010-11-23 14:06 ` npiggin
  2010-11-23 14:06 ` [patch 2/7] fs: simple fsync race fix npiggin
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 20+ messages in thread
From: npiggin @ 2010-11-23 14:06 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

[-- Attachment #1: fs-mark_inode_dirty-barrier-fix.patch --]
[-- Type: text/plain, Size: 1610 bytes --]

Filesystems appear to be using ->dirty_inode, expecting that the dirtying
operating is done and visible to all CPUs (eg. setting private inode dirty
bits, without any barriers themselves). So release the dirty "critical
section" with a barrier before calling ->dirty_inode.

Cost is not significantly changed, because we're just moving the barrier.
Those filesystems that do use ->dirty_inode should have to care slightly
less about barriers, which is a good thing.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c	2010-11-19 16:47:00.000000000 +1100
+++ linux-2.6/fs/fs-writeback.c	2010-11-19 16:49:39.000000000 +1100
@@ -934,6 +934,15 @@ void __mark_inode_dirty(struct inode *in
 	bool wakeup_bdi = false;
 
 	/*
+	 * Make sure that changes are seen by all cpus before we test i_state
+	 * or mark anything as being dirty. Ie. all dirty state should be
+	 * written to the inode and visible. Like an "unlock" operation, the
+	 * mark_inode_dirty call must "release" our ordering window that is
+	 * opened when we started modifying the inode.
+	 */
+	smp_mb();
+
+	/*
 	 * Don't do this for I_DIRTY_PAGES - that doesn't actually
 	 * dirty the inode itself
 	 */
@@ -942,12 +951,6 @@ void __mark_inode_dirty(struct inode *in
 			sb->s_op->dirty_inode(inode);
 	}
 
-	/*
-	 * make sure that changes are seen by all cpus before we test i_state
-	 * -- mikulas
-	 */
-	smp_mb();
-
 	/* avoid the locking if we can */
 	if ((inode->i_state & flags) == flags)
 		return;



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [patch 2/7] fs: simple fsync race fix
  2010-11-23 14:06 [patch 0/7] icache dirty / sync fixes npiggin
  2010-11-23 14:06 ` [patch 1/7] fs: mark_inode_dirty barrier fix npiggin
@ 2010-11-23 14:06 ` npiggin
  2010-11-29 14:58   ` Christoph Hellwig
  2010-11-23 14:06 ` [patch 3/7] fs: introduce inode writeback helpers npiggin
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: npiggin @ 2010-11-23 14:06 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

[-- Attachment #1: fs-sync-fixup.patch --]
[-- Type: text/plain, Size: 1984 bytes --]

It is incorrect to test inode dirty bits without participating in the inode
writeback protocol. Inode writeback sets I_SYNC and clears I_DIRTY_?, then
writes out the particular bits, then clears I_SYNC when it is done. BTW. it
may not completely write all pages out, so I_DIRTY_PAGES would get set
again.

This is a standard pattern used throughout the kernel's writeback caches
(I_SYNC ~= I_WRITEBACK, if that makes it clearer).

And so it is not possible to determine an inode's dirty status just by
checking I_DIRTY bits. Especially not for the purpose of data integrity
syncs.

Missing the check for these bits means that fsync can complete while
writeback to the inode is underway. Inode writeback functions get this
right, so call into them rather than try to shortcut things by testing
dirty state improperly.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>


Index: linux-2.6/fs/libfs.c
===================================================================
--- linux-2.6.orig/fs/libfs.c	2010-11-19 16:44:39.000000000 +1100
+++ linux-2.6/fs/libfs.c	2010-11-19 16:49:42.000000000 +1100
@@ -895,11 +895,6 @@ int generic_file_fsync(struct file *file
 	int ret;
 
 	ret = sync_mapping_buffers(inode->i_mapping);
-	if (!(inode->i_state & I_DIRTY))
-		return ret;
-	if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
-		return ret;
-
 	err = sync_inode_metadata(inode, 1);
 	if (ret == 0)
 		ret = err;
Index: linux-2.6/fs/exofs/file.c
===================================================================
--- linux-2.6.orig/fs/exofs/file.c	2010-11-19 16:50:00.000000000 +1100
+++ linux-2.6/fs/exofs/file.c	2010-11-19 16:50:07.000000000 +1100
@@ -48,11 +48,6 @@ static int exofs_file_fsync(struct file
 	struct inode *inode = filp->f_mapping->host;
 	struct super_block *sb;
 
-	if (!(inode->i_state & I_DIRTY))
-		return 0;
-	if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
-		return 0;
-
 	ret = sync_inode_metadata(inode, 1);
 
 	/* This is a good place to write the sb */



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [patch 3/7] fs: introduce inode writeback helpers
  2010-11-23 14:06 [patch 0/7] icache dirty / sync fixes npiggin
  2010-11-23 14:06 ` [patch 1/7] fs: mark_inode_dirty barrier fix npiggin
  2010-11-23 14:06 ` [patch 2/7] fs: simple fsync race fix npiggin
@ 2010-11-23 14:06 ` npiggin
  2010-11-29 15:13   ` Christoph Hellwig
  2010-11-23 14:06 ` [patch 4/7] fs: preserve inode dirty bits on failed metadata writeback npiggin
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: npiggin @ 2010-11-23 14:06 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

[-- Attachment #1: fs-inode-clear-dirty-for-io.patch --]
[-- Type: text/plain, Size: 7238 bytes --]

Inode dirty state cannot be securely tested without participating properly
in the inode writeback protocol. Some filesystems need to check this state,
so break out the code into helpers and make them available.

This could also be used to reduce strange interactions between background
writeback and fsync. Currently if we fsync a single page in a file, the
entire file gets requeued to the back of the background IO list, even if
it is due for writeout and has a large number of pages. That's left for
a later time.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c	2010-11-23 21:51:20.000000000 +1100
+++ linux-2.6/fs/fs-writeback.c	2010-11-23 22:05:00.000000000 +1100
@@ -299,6 +299,102 @@ static void inode_wait_for_writeback(str
 	}
 }
 
+/**
+ * inode_writeback_begin -- prepare to writeback an inode
+ * @indoe: inode to write back
+ * @wait: synch writeout or not
+ * @Returns: 0 if wait == 0 and this call would block (due to other writeback).
+ *           otherwise returns 1.
+ *
+ * Context: inode_lock must be held, may be dropped. Returns with it held.
+ *
+ * inode_writeback_begin sets up an inode to be written back (data and/or
+ * metadata). This must be called before examining I_DIRTY state of the
+ * inode, and should be called at least before any data integrity writeout.
+ *
+ * If inode_writeback_begin returns 1, it must be followed by a call to
+ * inode_writeback_end.
+ */
+int inode_writeback_begin(struct inode *inode, int wait)
+{
+	assert_spin_locked(&inode_lock);
+	if (!atomic_read(&inode->i_count))
+		WARN_ON(!(inode->i_state & (I_WILL_FREE|I_FREEING)));
+	else
+		WARN_ON(inode->i_state & I_WILL_FREE);
+
+	if (inode->i_state & I_SYNC) {
+		/*
+		 * If this inode is locked for writeback and we are not doing
+		 * writeback-for-data-integrity, skip it.
+		 */
+		if (!wait)
+			return 0;
+
+		/*
+		 * It's a data-integrity sync.  We must wait.
+		 */
+		inode_wait_for_writeback(inode);
+	}
+
+	BUG_ON(inode->i_state & I_SYNC);
+
+	inode->i_state |= I_SYNC;
+	inode->i_state &= ~I_DIRTY;
+
+	return 1;
+}
+EXPORT_SYMBOL(inode_writeback_begin);
+
+/**
+ * inode_writeback_end - end a writeback section opened by inode_writeback_begin
+ * @inode: inode in question
+ * @Returns: 0 if the inode still has dirty pagecache, otherwise 1.
+ *
+ * Context: inode_lock must be held, not dropped.
+ *
+ * inode_writeback_end must follow a successful call to inode_writeback_begin
+ * after we have finished submitting writeback to the inode.
+ */
+int inode_writeback_end(struct inode *inode)
+{
+	int ret = 1;
+
+	assert_spin_locked(&inode_lock);
+	BUG_ON(!(inode->i_state & I_SYNC));
+
+	if (!(inode->i_state & I_FREEING)) {
+		if (mapping_tagged(inode->i_mapping, PAGECACHE_TAG_DIRTY)) {
+			/*
+			 * We didn't write back all the pages.  nfs_writepages()
+			 * sometimes bales out without doing anything.
+			 */
+			inode->i_state |= I_DIRTY_PAGES;
+			ret = 0;
+		} else if (inode->i_state & I_DIRTY) {
+			/*
+			 * Filesystems can dirty the inode during writeback
+			 * operations, such as delayed allocation during
+			 * submission or metadata updates after data IO
+			 * completion.
+			 */
+			redirty_tail(inode);
+		} else {
+			/*
+			 * The inode is clean.  At this point we either have
+			 * a reference to the inode or it's on it's way out.
+			 * No need to add it back to the LRU.
+			 */
+			list_del_init(&inode->i_wb_list);
+		}
+	}
+	inode->i_state &= ~I_SYNC;
+	inode_sync_complete(inode);
+
+	return ret;
+}
+EXPORT_SYMBOL(inode_writeback_end);
+
 /*
  * Write out an inode's dirty pages.  Called under inode_lock.  Either the
  * caller has ref on the inode (either via __iget or via syscall against an fd)
@@ -319,36 +415,15 @@ writeback_single_inode(struct inode *ino
 	unsigned dirty;
 	int ret;
 
-	if (!atomic_read(&inode->i_count))
-		WARN_ON(!(inode->i_state & (I_WILL_FREE|I_FREEING)));
-	else
-		WARN_ON(inode->i_state & I_WILL_FREE);
-
-	if (inode->i_state & I_SYNC) {
+	if (!inode_writeback_begin(inode, wbc->sync_mode == WB_SYNC_ALL)) {
 		/*
-		 * If this inode is locked for writeback and we are not doing
-		 * writeback-for-data-integrity, move it to b_more_io so that
-		 * writeback can proceed with the other inodes on s_io.
-		 *
 		 * We'll have another go at writing back this inode when we
 		 * completed a full scan of b_io.
 		 */
-		if (wbc->sync_mode != WB_SYNC_ALL) {
-			requeue_io(inode);
-			return 0;
-		}
-
-		/*
-		 * It's a data-integrity sync.  We must wait.
-		 */
-		inode_wait_for_writeback(inode);
+		requeue_io(inode);
+		return 0;
 	}
 
-	BUG_ON(inode->i_state & I_SYNC);
-
-	/* Set I_SYNC, reset I_DIRTY_PAGES */
-	inode->i_state |= I_SYNC;
-	inode->i_state &= ~I_DIRTY_PAGES;
 	spin_unlock(&inode_lock);
 
 	ret = do_writepages(mapping, wbc);
@@ -381,47 +456,24 @@ writeback_single_inode(struct inode *ino
 	}
 
 	spin_lock(&inode_lock);
-	inode->i_state &= ~I_SYNC;
-	if (!(inode->i_state & I_FREEING)) {
-		if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
+	if (!inode_writeback_end(inode)) {
+		if (wbc->nr_to_write <= 0) {
 			/*
-			 * We didn't write back all the pages.  nfs_writepages()
-			 * sometimes bales out without doing anything.
+			 * slice used up: queue for next turn
 			 */
-			inode->i_state |= I_DIRTY_PAGES;
-			if (wbc->nr_to_write <= 0) {
-				/*
-				 * slice used up: queue for next turn
-				 */
-				requeue_io(inode);
-			} else {
-				/*
-				 * Writeback blocked by something other than
-				 * congestion. Delay the inode for some time to
-				 * avoid spinning on the CPU (100% iowait)
-				 * retrying writeback of the dirty page/inode
-				 * that cannot be performed immediately.
-				 */
-				redirty_tail(inode);
-			}
-		} else if (inode->i_state & I_DIRTY) {
-			/*
-			 * Filesystems can dirty the inode during writeback
-			 * operations, such as delayed allocation during
-			 * submission or metadata updates after data IO
-			 * completion.
-			 */
-			redirty_tail(inode);
+			requeue_io(inode);
 		} else {
 			/*
-			 * The inode is clean.  At this point we either have
-			 * a reference to the inode or it's on it's way out.
-			 * No need to add it back to the LRU.
+			 * Writeback blocked by something other than
+			 * congestion. Delay the inode for some time to
+			 * avoid spinning on the CPU (100% iowait)
+			 * retrying writeback of the dirty page/inode
+			 * that cannot be performed immediately.
 			 */
-			list_del_init(&inode->i_wb_list);
+			redirty_tail(inode);
 		}
 	}
-	inode_sync_complete(inode);
+
 	return ret;
 }
 
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h	2010-11-23 22:04:18.000000000 +1100
+++ linux-2.6/include/linux/fs.h	2010-11-23 22:04:44.000000000 +1100
@@ -1765,6 +1765,8 @@ static inline void file_accessed(struct
 
 int sync_inode(struct inode *inode, struct writeback_control *wbc);
 int sync_inode_metadata(struct inode *inode, int wait);
+int inode_writeback_begin(struct inode *inode, int wait);
+int inode_writeback_end(struct inode *inode);
 
 struct file_system_type {
 	const char *name;



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [patch 4/7] fs: preserve inode dirty bits on failed metadata writeback
  2010-11-23 14:06 [patch 0/7] icache dirty / sync fixes npiggin
                   ` (2 preceding siblings ...)
  2010-11-23 14:06 ` [patch 3/7] fs: introduce inode writeback helpers npiggin
@ 2010-11-23 14:06 ` npiggin
  2010-11-29 14:59   ` Christoph Hellwig
  2010-11-23 14:06 ` [patch 5/7] fs: ext2 inode sync fix npiggin
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: npiggin @ 2010-11-23 14:06 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

[-- Attachment #1: fs-inode-dirty-error-fix.patch --]
[-- Type: text/plain, Size: 1136 bytes --]

Otherwise we think the inode is clean even if syncing failed.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c	2010-11-23 22:28:22.000000000 +1100
+++ linux-2.6/fs/fs-writeback.c	2010-11-23 22:57:40.000000000 +1100
@@ -447,15 +447,25 @@ writeback_single_inode(struct inode *ino
 	spin_lock(&inode_lock);
 	dirty = inode->i_state & I_DIRTY;
 	inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC);
-	spin_unlock(&inode_lock);
 	/* Don't write the inode if only I_DIRTY_PAGES was set */
 	if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
-		int err = write_inode(inode, wbc);
+		int err;
+
+		spin_unlock(&inode_lock);
+		err = write_inode(inode, wbc);
 		if (ret == 0)
 			ret = err;
+		spin_lock(&inode_lock);
+		if (err) {
+			/*
+			 * Inode writeout failed, restore inode metadata
+			 * dirty bits.
+			 */
+			inode->i_state |= dirty &
+					(I_DIRTY_SYNC | I_DIRTY_DATASYNC);
+		}
 	}
 
-	spin_lock(&inode_lock);
 	if (!inode_writeback_end(inode)) {
 		if (wbc->nr_to_write <= 0) {
 			/*



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [patch 5/7] fs: ext2 inode sync fix
  2010-11-23 14:06 [patch 0/7] icache dirty / sync fixes npiggin
                   ` (3 preceding siblings ...)
  2010-11-23 14:06 ` [patch 4/7] fs: preserve inode dirty bits on failed metadata writeback npiggin
@ 2010-11-23 14:06 ` npiggin
  2010-11-30 11:26   ` Boaz Harrosh
  2010-11-23 14:06 ` [patch 6/7] fs: fsync optimisations npiggin
  2010-11-23 14:06 ` [patch 7/7] fs: fix or note I_DIRTY handling bugs in filesystems npiggin
  6 siblings, 1 reply; 20+ messages in thread
From: npiggin @ 2010-11-23 14:06 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

[-- Attachment #1: ext2-sync-fixes.patch --]
[-- Type: text/plain, Size: 4142 bytes --]

There is a big fuckup with inode metadata writeback (I suspect in a lot
of filesystems, but I've only dared to look at a couple as yet).

ext2 relies on ->write_inode being called from sync_inode_metadata in
fsync in order to sync the inode. However I_DIRTY_SYNC gets cleared
after a call to this guy, and it doesn't actually write back and wait
on the inode block unless it is called for sync. This means that write_inode
from background writeback can kill the inode dirty bits without the data
getting to disk. Fsync will subsequently miss it.

The fix is for ->write_inode to dirty the buffers/cache, and then ->fsync to
write back the dirty data. In the full filesystem sync case, buffercache
writeback in the generic code will write back the dirty data. Other
filesystems could use ->sync_fs for this.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

Index: linux-2.6/fs/ext2/inode.c
===================================================================
--- linux-2.6.orig/fs/ext2/inode.c	2010-11-23 21:56:05.000000000 +1100
+++ linux-2.6/fs/ext2/inode.c	2010-11-23 22:05:02.000000000 +1100
@@ -1211,7 +1211,7 @@ static int ext2_setsize(struct inode *in
 	return 0;
 }
 
-static struct ext2_inode *ext2_get_inode(struct super_block *sb, ino_t ino,
+struct ext2_inode *ext2_get_inode(struct super_block *sb, ino_t ino,
 					struct buffer_head **p)
 {
 	struct buffer_head * bh;
@@ -1505,16 +1505,8 @@ static int __ext2_write_inode(struct ino
 	} else for (n = 0; n < EXT2_N_BLOCKS; n++)
 		raw_inode->i_block[n] = ei->i_data[n];
 	mark_buffer_dirty(bh);
-	if (do_sync) {
-		sync_dirty_buffer(bh);
-		if (buffer_req(bh) && !buffer_uptodate(bh)) {
-			printk ("IO error syncing ext2 inode [%s:%08lx]\n",
-				sb->s_id, (unsigned long) ino);
-			err = -EIO;
-		}
-	}
-	ei->i_state &= ~EXT2_STATE_NEW;
 	brelse (bh);
+	ei->i_state &= ~EXT2_STATE_NEW;
 	return err;
 }
 
Index: linux-2.6/fs/ext2/file.c
===================================================================
--- linux-2.6.orig/fs/ext2/file.c	2010-11-23 21:56:05.000000000 +1100
+++ linux-2.6/fs/ext2/file.c	2010-11-23 22:05:02.000000000 +1100
@@ -21,6 +21,7 @@
 #include <linux/time.h>
 #include <linux/pagemap.h>
 #include <linux/quotaops.h>
+#include <linux/buffer_head.h>
 #include "ext2.h"
 #include "xattr.h"
 #include "acl.h"
@@ -43,16 +44,33 @@ static int ext2_release_file (struct ino
 int ext2_fsync(struct file *file, int datasync)
 {
 	int ret;
-	struct super_block *sb = file->f_mapping->host->i_sb;
-	struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping;
+	struct inode *inode = file->f_mapping->host;
+	ino_t ino = inode->i_ino;
+	struct super_block *sb = inode->i_sb;
+	struct address_space *sb_mapping = sb->s_bdev->bd_inode->i_mapping;
+	struct buffer_head *bh;
+	struct ext2_inode *raw_inode;
 
 	ret = generic_file_fsync(file, datasync);
-	if (ret == -EIO || test_and_clear_bit(AS_EIO, &mapping->flags)) {
+	if (ret == -EIO || test_and_clear_bit(AS_EIO, &sb_mapping->flags)) {
 		/* We don't really know where the IO error happened... */
 		ext2_error(sb, __func__,
 			   "detected IO error when writing metadata buffers");
+		return -EIO;
+	}
+
+	raw_inode = ext2_get_inode(sb, ino, &bh);
+	if (IS_ERR(raw_inode))
+ 		return -EIO;
+
+	sync_dirty_buffer(bh);
+	if (buffer_req(bh) && !buffer_uptodate(bh)) {
+		printk ("IO error syncing ext2 inode [%s:%08lx]\n",
+			sb->s_id, (unsigned long) ino);
 		ret = -EIO;
 	}
+	brelse (bh);
+
 	return ret;
 }
 
Index: linux-2.6/fs/ext2/ext2.h
===================================================================
--- linux-2.6.orig/fs/ext2/ext2.h	2010-11-23 21:56:05.000000000 +1100
+++ linux-2.6/fs/ext2/ext2.h	2010-11-23 22:05:02.000000000 +1100
@@ -124,6 +124,8 @@ extern int ext2_get_block(struct inode *
 extern int ext2_setattr (struct dentry *, struct iattr *);
 extern void ext2_set_inode_flags(struct inode *inode);
 extern void ext2_get_inode_flags(struct ext2_inode_info *);
+extern struct ext2_inode *ext2_get_inode(struct super_block *sb, ino_t ino,
+					struct buffer_head **p);
 extern int ext2_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 		       u64 start, u64 len);
 



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [patch 6/7] fs: fsync optimisations
  2010-11-23 14:06 [patch 0/7] icache dirty / sync fixes npiggin
                   ` (4 preceding siblings ...)
  2010-11-23 14:06 ` [patch 5/7] fs: ext2 inode sync fix npiggin
@ 2010-11-23 14:06 ` npiggin
  2010-11-29 15:03   ` Christoph Hellwig
  2010-11-23 14:06 ` [patch 7/7] fs: fix or note I_DIRTY handling bugs in filesystems npiggin
  6 siblings, 1 reply; 20+ messages in thread
From: npiggin @ 2010-11-23 14:06 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

[-- Attachment #1: fs-datasync-opt.patch --]
[-- Type: text/plain, Size: 6864 bytes --]

Optimise fsync by adding a datasync parameter to sync_inode_metadata to
DTRT with writing back the inode (->write_inode in theory should have a
datasync parameter too perhaps, but that's for another time).

Also, implement the metadata sync optimally rather than reusing the
normal data writeback path. This means less useless moving the inode around the
writeback lists, and less dropping and retaking of inode_lock, and avoiding
the data writeback call with nr_pages == 0.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

Index: linux-2.6/drivers/staging/pohmelfs/inode.c
===================================================================
--- linux-2.6.orig/drivers/staging/pohmelfs/inode.c	2010-11-23 22:56:51.000000000 +1100
+++ linux-2.6/drivers/staging/pohmelfs/inode.c	2010-11-23 22:57:45.000000000 +1100
@@ -883,7 +883,7 @@ static int pohmelfs_fsync(struct file *f
 {
 	struct inode *inode = file->f_mapping->host;
 
-	return sync_inode_metadata(inode, 1);
+	return sync_inode_metadata(inode, datasync, 1);
 }
 
 ssize_t pohmelfs_write(struct file *file, const char __user *buf,
Index: linux-2.6/fs/exofs/file.c
===================================================================
--- linux-2.6.orig/fs/exofs/file.c	2010-11-23 22:56:51.000000000 +1100
+++ linux-2.6/fs/exofs/file.c	2010-11-23 22:57:45.000000000 +1100
@@ -48,7 +48,7 @@ static int exofs_file_fsync(struct file
 	struct inode *inode = filp->f_mapping->host;
 	struct super_block *sb;
 
-	ret = sync_inode_metadata(inode, 1);
+	ret = sync_inode_metadata(inode, datasync, 1);
 
 	/* This is a good place to write the sb */
 	/* TODO: Sechedule an sb-sync on create */
Index: linux-2.6/fs/ext2/dir.c
===================================================================
--- linux-2.6.orig/fs/ext2/dir.c	2010-11-23 22:56:51.000000000 +1100
+++ linux-2.6/fs/ext2/dir.c	2010-11-23 22:57:45.000000000 +1100
@@ -98,7 +98,7 @@ static int ext2_commit_chunk(struct page
 	if (IS_DIRSYNC(dir)) {
 		err = write_one_page(page, 1);
 		if (!err)
-			err = sync_inode_metadata(dir, 1);
+			err = sync_inode_metadata(dir, 0, 1);
 	} else {
 		unlock_page(page);
 	}
Index: linux-2.6/fs/ext2/inode.c
===================================================================
--- linux-2.6.orig/fs/ext2/inode.c	2010-11-23 22:57:43.000000000 +1100
+++ linux-2.6/fs/ext2/inode.c	2010-11-23 22:57:45.000000000 +1100
@@ -1203,7 +1203,7 @@ static int ext2_setsize(struct inode *in
 	inode->i_mtime = inode->i_ctime = CURRENT_TIME_SEC;
 	if (inode_needs_sync(inode)) {
 		sync_mapping_buffers(inode->i_mapping);
-		sync_inode_metadata(inode, 1);
+		sync_inode_metadata(inode, 0, 1);
 	} else {
 		mark_inode_dirty(inode);
 	}
Index: linux-2.6/fs/ext2/xattr.c
===================================================================
--- linux-2.6.orig/fs/ext2/xattr.c	2010-11-23 22:56:51.000000000 +1100
+++ linux-2.6/fs/ext2/xattr.c	2010-11-23 22:57:45.000000000 +1100
@@ -699,7 +699,7 @@ ext2_xattr_set2(struct inode *inode, str
 	EXT2_I(inode)->i_file_acl = new_bh ? new_bh->b_blocknr : 0;
 	inode->i_ctime = CURRENT_TIME_SEC;
 	if (IS_SYNC(inode)) {
-		error = sync_inode_metadata(inode, 1);
+		error = sync_inode_metadata(inode, 0, 1);
 		/* In case sync failed due to ENOSPC the inode was actually
 		 * written (only some dirty data were not) so we just proceed
 		 * as if nothing happened and cleanup the unused block */
Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c	2010-11-23 22:57:40.000000000 +1100
+++ linux-2.6/fs/fs-writeback.c	2010-11-23 22:59:41.000000000 +1100
@@ -1307,7 +1307,7 @@ int sync_inode(struct inode *inode, stru
 EXPORT_SYMBOL(sync_inode);
 
 /**
- * sync_inode - write an inode to disk
+ * sync_inode_metadata - write an inode to disk
  * @inode: the inode to sync
  * @wait: wait for I/O to complete.
  *
@@ -1315,13 +1315,49 @@ EXPORT_SYMBOL(sync_inode);
  *
  * Note: only writes the actual inode, no associated data or other metadata.
  */
-int sync_inode_metadata(struct inode *inode, int wait)
+int sync_inode_metadata(struct inode *inode, int datasync, int wait)
 {
+	struct address_space *mapping = inode->i_mapping;
 	struct writeback_control wbc = {
 		.sync_mode = wait ? WB_SYNC_ALL : WB_SYNC_NONE,
 		.nr_to_write = 0, /* metadata-only */
 	};
+	unsigned dirty, mask;
+	int ret = 0;
 
-	return sync_inode(inode, &wbc);
+	/*
+	 * This is a similar implementation to writeback_single_inode.
+	 * Keep them in sync.
+	 */
+	spin_lock(&inode_lock);
+	if (!inode_writeback_begin(inode, wait))
+		goto out;
+
+	if (datasync)
+		mask = I_DIRTY_DATASYNC;
+	else
+		mask = I_DIRTY_SYNC | I_DIRTY_DATASYNC;
+	dirty = inode->i_state & mask;
+	if (!dirty)
+		goto out;
+	/*
+	 * Generic write_inode doesn't distinguish between sync and datasync,
+	 * so even a datasync can clear the sync state. Filesystems which
+	 * distiguish these cases must only clear 'mask' in their metadata
+	 * sync code.
+	 */
+	inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC);
+
+	spin_unlock(&inode_lock);
+	ret = write_inode(inode, &wbc);
+	spin_lock(&inode_lock);
+	if (ret)
+		inode->i_state |= dirty; /* couldn't write out inode */
+
+	inode_writeback_end(inode);
+
+out:
+	spin_unlock(&inode_lock);
+	return ret;
 }
 EXPORT_SYMBOL(sync_inode_metadata);
Index: linux-2.6/fs/libfs.c
===================================================================
--- linux-2.6.orig/fs/libfs.c	2010-11-23 22:56:51.000000000 +1100
+++ linux-2.6/fs/libfs.c	2010-11-23 22:57:45.000000000 +1100
@@ -895,7 +895,7 @@ int generic_file_fsync(struct file *file
 	int ret;
 
 	ret = sync_mapping_buffers(inode->i_mapping);
-	err = sync_inode_metadata(inode, 1);
+	err = sync_inode_metadata(inode, datasync, 1);
 	if (ret == 0)
 		ret = err;
 	return ret;
Index: linux-2.6/fs/nfsd/vfs.c
===================================================================
--- linux-2.6.orig/fs/nfsd/vfs.c	2010-11-23 22:56:51.000000000 +1100
+++ linux-2.6/fs/nfsd/vfs.c	2010-11-23 22:57:45.000000000 +1100
@@ -287,7 +287,7 @@ commit_metadata(struct svc_fh *fhp)
 
 	if (export_ops->commit_metadata)
 		return export_ops->commit_metadata(inode);
-	return sync_inode_metadata(inode, 1);
+	return sync_inode_metadata(inode, 0, 1);
 }
 
 /*
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h	2010-11-23 22:56:51.000000000 +1100
+++ linux-2.6/include/linux/fs.h	2010-11-23 22:57:45.000000000 +1100
@@ -1764,7 +1764,7 @@ static inline void file_accessed(struct
 }
 
 int sync_inode(struct inode *inode, struct writeback_control *wbc);
-int sync_inode_metadata(struct inode *inode, int wait);
+int sync_inode_metadata(struct inode *inode, int datasync, int wait);
 int inode_writeback_begin(struct inode *inode, int wait);
 int inode_writeback_end(struct inode *inode);
 



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [patch 7/7] fs: fix or note I_DIRTY handling bugs in filesystems
  2010-11-23 14:06 [patch 0/7] icache dirty / sync fixes npiggin
                   ` (5 preceding siblings ...)
  2010-11-23 14:06 ` [patch 6/7] fs: fsync optimisations npiggin
@ 2010-11-23 14:06 ` npiggin
  2010-11-23 15:04   ` Steven Whitehouse
  2010-11-23 22:51   ` Dave Chinner
  6 siblings, 2 replies; 20+ messages in thread
From: npiggin @ 2010-11-23 14:06 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

[-- Attachment #1: fs-fix-dirty-flags.patch --]
[-- Type: text/plain, Size: 9377 bytes --]

Comments?

Index: linux-2.6/drivers/staging/pohmelfs/inode.c
===================================================================
--- linux-2.6.orig/drivers/staging/pohmelfs/inode.c	2010-11-23 22:57:45.000000000 +1100
+++ linux-2.6/drivers/staging/pohmelfs/inode.c	2010-11-23 22:59:47.000000000 +1100
@@ -373,6 +373,7 @@ static int pohmelfs_write_inode_create_c
 		dprintk("%s: parent: %llu, ino: %llu, inode: %p.\n",
 				__func__, parent->ino, n->ino, inode);
 
+		/* XXX: is this race WRT writeback? */
 		if (inode && (inode->i_state & I_DIRTY)) {
 			struct pohmelfs_inode *pi = POHMELFS_I(inode);
 			pohmelfs_write_create_inode(pi);
Index: linux-2.6/fs/gfs2/file.c
===================================================================
--- linux-2.6.orig/fs/gfs2/file.c	2010-11-23 22:54:47.000000000 +1100
+++ linux-2.6/fs/gfs2/file.c	2010-11-24 00:58:42.000000000 +1100
@@ -557,23 +557,43 @@ static int gfs2_close(struct inode *inod
 static int gfs2_fsync(struct file *file, int datasync)
 {
 	struct inode *inode = file->f_mapping->host;
-	int sync_state = inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC);
+	unsigned dirty, mask;
 	int ret = 0;
 
-	/* XXX: testing i_state is broken without proper synchronization */
-
 	if (gfs2_is_jdata(GFS2_I(inode))) {
 		gfs2_log_flush(GFS2_SB(inode), GFS2_I(inode)->i_gl);
 		return 0;
 	}
 
-	if (sync_state != 0) {
-		if (!datasync)
-			ret = write_inode_now(inode, 0);
+	spin_lock(&inode_lock);
+	inode_writeback_begin(inode, 1);
+
+	if (datasync)
+		mask = I_DIRTY_DATASYNC;
+	else
+		mask = I_DIRTY_SYNC | I_DIRTY_DATASYNC;
+	dirty = inode->i_state & mask;
+	inode->i_state &= ~mask;
+	if (dirty) {
+		spin_unlock(&inode_lock);
+
+		if (!datasync) {
+			struct writeback_control wbc = {
+				.sync_mode = WB_SYNC_ALL,
+			};
+			ret = inode->i_sb->s_op->write_inode(inode, &wbc);
+		} else {
+			if (gfs2_is_stuffed(GFS2_I(inode)))
+				gfs2_log_flush(GFS2_SB(inode),
+						GFS2_I(inode)->i_gl);
+		}
 
-		if (gfs2_is_stuffed(GFS2_I(inode)))
-			gfs2_log_flush(GFS2_SB(inode), GFS2_I(inode)->i_gl);
+		spin_lock(&inode_lock);
 	}
+	if (ret)
+		inode->i_state |= dirty;
+	inode_writeback_end(inode);
+	spin_unlock(&inode_lock);
 
 	return ret;
 }
Index: linux-2.6/fs/jffs2/fs.c
===================================================================
--- linux-2.6.orig/fs/jffs2/fs.c	2010-11-23 22:54:46.000000000 +1100
+++ linux-2.6/fs/jffs2/fs.c	2010-11-23 22:59:47.000000000 +1100
@@ -361,6 +361,7 @@ void jffs2_dirty_inode(struct inode *ino
 {
 	struct iattr iattr;
 
+	/* XXX: huh? How does this make sense? */
 	if (!(inode->i_state & I_DIRTY_DATASYNC)) {
 		D2(printk(KERN_DEBUG "jffs2_dirty_inode() not calling setattr() for ino #%lu\n", inode->i_ino));
 		return;
Index: linux-2.6/fs/jfs/file.c
===================================================================
--- linux-2.6.orig/fs/jfs/file.c	2010-11-23 22:54:47.000000000 +1100
+++ linux-2.6/fs/jfs/file.c	2010-11-24 00:38:08.000000000 +1100
@@ -19,6 +19,7 @@
 
 #include <linux/mm.h>
 #include <linux/fs.h>
+#include <linux/writeback.h>
 #include <linux/quotaops.h>
 #include "jfs_incore.h"
 #include "jfs_inode.h"
@@ -31,18 +32,34 @@
 int jfs_fsync(struct file *file, int datasync)
 {
 	struct inode *inode = file->f_mapping->host;
-	int rc = 0;
+	unsigned dirty, mask;
+	int err = 0;
 
-	if (!(inode->i_state & I_DIRTY) ||
-	    (datasync && !(inode->i_state & I_DIRTY_DATASYNC))) {
+	spin_lock(&inode_lock);
+	inode_writeback_begin(inode, 1);
+
+	if (datasync)
+		mask = I_DIRTY_DATASYNC;
+	else
+		mask = I_DIRTY_SYNC | I_DIRTY_DATASYNC;
+	dirty = inode->i_state & mask;
+	inode->i_state &= ~mask;
+	spin_unlock(&inode_lock);
+
+	if (!dirty) {
 		/* Make sure committed changes hit the disk */
 		jfs_flush_journal(JFS_SBI(inode->i_sb)->log, 1);
-		return rc;
+	} else {
+		err = jfs_commit_inode(inode, 1);
 	}
 
-	rc |= jfs_commit_inode(inode, 1);
+	spin_lock(&inode_lock);
+	if (err)
+		inode->i_state |= dirty;
+	inode_writeback_end(inode);
+	spin_unlock(&inode_lock);
 
-	return rc ? -EIO : 0;
+	return err ? -EIO : 0;
 }
 
 static int jfs_open(struct inode *inode, struct file *file)
Index: linux-2.6/fs/nfsd/vfs.c
===================================================================
--- linux-2.6.orig/fs/nfsd/vfs.c	2010-11-23 22:57:45.000000000 +1100
+++ linux-2.6/fs/nfsd/vfs.c	2010-11-23 22:59:47.000000000 +1100
@@ -969,10 +969,9 @@ static int wait_for_concurrent_writes(st
 		dprintk("nfsd: write resume %d\n", task_pid_nr(current));
 	}
 
-	if (inode->i_state & I_DIRTY) {
-		dprintk("nfsd: write sync %d\n", task_pid_nr(current));
-		err = vfs_fsync(file, 0);
-	}
+	dprintk("nfsd: write sync %d\n", task_pid_nr(current));
+	err = vfs_fsync(file, 0);
+
 	last_ino = inode->i_ino;
 	last_dev = inode->i_sb->s_dev;
 	return err;
Index: linux-2.6/fs/ocfs2/file.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/file.c	2010-11-23 22:54:47.000000000 +1100
+++ linux-2.6/fs/ocfs2/file.c	2010-11-24 00:33:24.000000000 +1100
@@ -176,12 +176,24 @@ static int ocfs2_sync_file(struct file *
 	journal_t *journal;
 	struct inode *inode = file->f_mapping->host;
 	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
+	unsigned dirty, mask;
 
 	mlog_entry("(0x%p, %d, 0x%p, '%.*s')\n", file, datasync,
 		   file->f_path.dentry, file->f_path.dentry->d_name.len,
 		   file->f_path.dentry->d_name.name);
 
-	if (datasync && !(inode->i_state & I_DIRTY_DATASYNC)) {
+	spin_lock(&inode_lock);
+	inode_writeback_begin(inode, 1);
+	if (datasync)
+		mask = I_DIRTY_DATASYNC;
+	else
+		mask = I_DIRTY_SYNC | I_DIRTY_DATASYNC;
+	dirty = inode->i_state & mask;
+	inode->i_state &= ~mask;
+	spin_unlock(&inode_lock);
+
+	if (datasync && dirty) {
+
 		/*
 		 * We still have to flush drive's caches to get data to the
 		 * platter
@@ -195,6 +207,12 @@ static int ocfs2_sync_file(struct file *
 	err = jbd2_journal_force_commit(journal);
 
 bail:
+	spin_lock(&inode_lock);
+	if (err)
+		inode->i_state |= dirty;
+	inode_writeback_end(inode);
+	spin_unlock(&inode_lock);
+
 	mlog_exit(err);
 
 	return (err < 0) ? -EIO : 0;
Index: linux-2.6/fs/ubifs/file.c
===================================================================
--- linux-2.6.orig/fs/ubifs/file.c	2010-11-23 22:54:47.000000000 +1100
+++ linux-2.6/fs/ubifs/file.c	2010-11-23 22:59:47.000000000 +1100
@@ -1313,11 +1313,9 @@ int ubifs_fsync(struct file *file, int d
 	 * VFS has already synchronized dirty pages for this inode. Synchronize
 	 * the inode unless this is a 'datasync()' call.
 	 */
-	if (!datasync || (inode->i_state & I_DIRTY_DATASYNC)) {
-		err = inode->i_sb->s_op->write_inode(inode, NULL);
-		if (err)
-			return err;
-	}
+	err = sync_inode_metadata(inode, datasync, 1);
+	if (err)
+		return err;
 
 	/*
 	 * Nodes related to this inode may still sit in a write-buffer. Flush
Index: linux-2.6/fs/ufs/truncate.c
===================================================================
--- linux-2.6.orig/fs/ufs/truncate.c	2010-11-23 22:54:47.000000000 +1100
+++ linux-2.6/fs/ufs/truncate.c	2010-11-23 22:59:47.000000000 +1100
@@ -479,7 +479,7 @@ int ufs_truncate(struct inode *inode, lo
 		retry |= ufs_trunc_tindirect (inode);
 		if (!retry)
 			break;
-		if (IS_SYNC(inode) && (inode->i_state & I_DIRTY))
+		if (IS_SYNC(inode))
 			ufs_sync_inode (inode);
 		blk_run_address_space(inode->i_mapping);
 		yield();
Index: linux-2.6/fs/xfs/linux-2.6/xfs_file.c
===================================================================
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_file.c	2010-11-23 22:54:47.000000000 +1100
+++ linux-2.6/fs/xfs/linux-2.6/xfs_file.c	2010-11-24 00:08:03.000000000 +1100
@@ -99,6 +99,7 @@ xfs_file_fsync(
 	struct xfs_trans	*tp;
 	int			error = 0;
 	int			log_flushed = 0;
+	unsigned		dirty, mask;
 
 	trace_xfs_file_fsync(ip);
 
@@ -132,9 +133,16 @@ xfs_file_fsync(
 	 * might gets cleared when the inode gets written out via the AIL
 	 * or xfs_iflush_cluster.
 	 */
-	if (((inode->i_state & I_DIRTY_DATASYNC) ||
-	    ((inode->i_state & I_DIRTY_SYNC) && !datasync)) &&
-	    ip->i_update_core) {
+	spin_lock(&inode_lock);
+	inode_writeback_begin(inode, 1);
+	if (datasync)
+		mask = I_DIRTY_DATASYNC;
+	else
+		mask = I_DIRTY_SYNC | I_DIRTY_DATASYNC;
+	dirty = inode->i_state & mask;
+	inode->i_state &= ~mask;
+	spin_unlock(&inode_lock);
+	if (dirty && ip->i_update_core) {
 		/*
 		 * Kick off a transaction to log the inode core to get the
 		 * updates.  The sync transaction will also force the log.
@@ -145,7 +153,7 @@ xfs_file_fsync(
 				XFS_FSYNC_TS_LOG_RES(ip->i_mount), 0, 0, 0);
 		if (error) {
 			xfs_trans_cancel(tp, 0);
-			return -error;
+			goto out;
 		}
 		xfs_ilock(ip, XFS_ILOCK_EXCL);
 
@@ -197,6 +205,13 @@ xfs_file_fsync(
 			xfs_blkdev_issue_flush(ip->i_mount->m_rtdev_targp);
 	}
 
+out:
+	spin_lock(&inode_lock);
+	if (error)
+		inode->i_state |= dirty;
+	inode_writeback_end(inode);
+	spin_unlock(&inode_lock);
+
 	return -error;
 }
 
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2010-11-23 23:53:57.000000000 +1100
+++ linux-2.6/fs/inode.c	2010-11-23 23:54:06.000000000 +1100
@@ -82,6 +82,7 @@ static struct hlist_head *inode_hashtabl
  * the i_state of an inode while it is in use..
  */
 DEFINE_SPINLOCK(inode_lock);
+EXPORT_SYMBOL(inode_lock);
 
 /*
  * iprune_sem provides exclusion between the kswapd or try_to_free_pages



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch 7/7] fs: fix or note I_DIRTY handling bugs in filesystems
  2010-11-23 14:06 ` [patch 7/7] fs: fix or note I_DIRTY handling bugs in filesystems npiggin
@ 2010-11-23 15:04   ` Steven Whitehouse
  2010-11-23 22:51   ` Dave Chinner
  1 sibling, 0 replies; 20+ messages in thread
From: Steven Whitehouse @ 2010-11-23 15:04 UTC (permalink / raw)
  To: npiggin; +Cc: linux-fsdevel, linux-kernel

Hi,

On Wed, 2010-11-24 at 01:06 +1100, npiggin@kernel.dk wrote:
> plain text document attachment (fs-fix-dirty-flags.patch)
> Comments?
> 
> Index: linux-2.6/drivers/staging/pohmelfs/inode.c
> ===================================================================
> --- linux-2.6.orig/drivers/staging/pohmelfs/inode.c	2010-11-23 22:57:45.000000000 +1100
> +++ linux-2.6/drivers/staging/pohmelfs/inode.c	2010-11-23 22:59:47.000000000 +1100
> @@ -373,6 +373,7 @@ static int pohmelfs_write_inode_create_c
>  		dprintk("%s: parent: %llu, ino: %llu, inode: %p.\n",
>  				__func__, parent->ino, n->ino, inode);
>  
> +		/* XXX: is this race WRT writeback? */
>  		if (inode && (inode->i_state & I_DIRTY)) {
>  			struct pohmelfs_inode *pi = POHMELFS_I(inode);
>  			pohmelfs_write_create_inode(pi);
> Index: linux-2.6/fs/gfs2/file.c
> ===================================================================
> --- linux-2.6.orig/fs/gfs2/file.c	2010-11-23 22:54:47.000000000 +1100
> +++ linux-2.6/fs/gfs2/file.c	2010-11-24 00:58:42.000000000 +1100
> @@ -557,23 +557,43 @@ static int gfs2_close(struct inode *inod
>  static int gfs2_fsync(struct file *file, int datasync)
>  {
>  	struct inode *inode = file->f_mapping->host;
> -	int sync_state = inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC);
> +	unsigned dirty, mask;
>  	int ret = 0;
>  
> -	/* XXX: testing i_state is broken without proper synchronization */
> -
>  	if (gfs2_is_jdata(GFS2_I(inode))) {
>  		gfs2_log_flush(GFS2_SB(inode), GFS2_I(inode)->i_gl);
>  		return 0;
>  	}
>  
> -	if (sync_state != 0) {
> -		if (!datasync)
> -			ret = write_inode_now(inode, 0);
> +	spin_lock(&inode_lock);
> +	inode_writeback_begin(inode, 1);
> +
> +	if (datasync)
> +		mask = I_DIRTY_DATASYNC;
> +	else
> +		mask = I_DIRTY_SYNC | I_DIRTY_DATASYNC;
> +	dirty = inode->i_state & mask;
> +	inode->i_state &= ~mask;
> +	if (dirty) {
> +		spin_unlock(&inode_lock);
> +
> +		if (!datasync) {
> +			struct writeback_control wbc = {
> +				.sync_mode = WB_SYNC_ALL,
> +			};
> +			ret = inode->i_sb->s_op->write_inode(inode, &wbc);
> +		} else {
> +			if (gfs2_is_stuffed(GFS2_I(inode)))
> +				gfs2_log_flush(GFS2_SB(inode),
> +						GFS2_I(inode)->i_gl);
> +		}
>  
> -		if (gfs2_is_stuffed(GFS2_I(inode)))
> -			gfs2_log_flush(GFS2_SB(inode), GFS2_I(inode)->i_gl);
> +		spin_lock(&inode_lock);
>  	}
> +	if (ret)
> +		inode->i_state |= dirty;
> +	inode_writeback_end(inode);
> +	spin_unlock(&inode_lock);
>  
>  	return ret;
>  }

The GFS2 changes seem to make sense to me, so:
Acked-by: Steven Whitehouse <swhiteho@redhat.com>

Steve.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch 7/7] fs: fix or note I_DIRTY handling bugs in filesystems
  2010-11-23 14:06 ` [patch 7/7] fs: fix or note I_DIRTY handling bugs in filesystems npiggin
  2010-11-23 15:04   ` Steven Whitehouse
@ 2010-11-23 22:51   ` Dave Chinner
  2010-11-24  0:23     ` Nick Piggin
  1 sibling, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2010-11-23 22:51 UTC (permalink / raw)
  To: npiggin; +Cc: linux-fsdevel, linux-kernel

On Wed, Nov 24, 2010 at 01:06:17AM +1100, npiggin@kernel.dk wrote:
> Comments?

How did you test the changes?

> +++ linux-2.6/fs/xfs/linux-2.6/xfs_file.c	2010-11-24 00:08:03.000000000 +1100
> @@ -99,6 +99,7 @@ xfs_file_fsync(
>  	struct xfs_trans	*tp;
>  	int			error = 0;
>  	int			log_flushed = 0;
> +	unsigned		dirty, mask;
>  
>  	trace_xfs_file_fsync(ip);
>  
> @@ -132,9 +133,16 @@ xfs_file_fsync(
>  	 * might gets cleared when the inode gets written out via the AIL
>  	 * or xfs_iflush_cluster.
>  	 */
> -	if (((inode->i_state & I_DIRTY_DATASYNC) ||
> -	    ((inode->i_state & I_DIRTY_SYNC) && !datasync)) &&
> -	    ip->i_update_core) {
> +	spin_lock(&inode_lock);
> +	inode_writeback_begin(inode, 1);
> +	if (datasync)
> +		mask = I_DIRTY_DATASYNC;
> +	else
> +		mask = I_DIRTY_SYNC | I_DIRTY_DATASYNC;
> +	dirty = inode->i_state & mask;
> +	inode->i_state &= ~mask;
> +	spin_unlock(&inode_lock);
> +	if (dirty && ip->i_update_core) {

It looks to me like the pattern "inode_writeback_begin(); get dirty
state from i_state" repeated for each filesystem is wrong. The
inode_writeback_begin() helper does this:

	inode->i_state &= ~I_DIRTY;

which clears all the dirty bits from the i_state, which means the
followup:

	dirty = inode->i_state & mask;

will always result in a zero value for dirty.  IOWs, this seems to
ensure that ->fsync never sees dirty inodes anymore. This will break
fsync on XFS, and probably on all the other filesystems you modified
to use this pattern as well.

Also, I think the pattern is racy with respect to concurrent page
cache dirtiers. i.e if the inode was dirtied between writeback and
->fsync() in vfs_fsync_range(), then this new code clears the
I_DIRTY_PAGES bit in i_state without writing back the dirty pages.

And FWIW, I'm not sure that we want to be propagating the inode_lock
into every filesystem...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch 7/7] fs: fix or note I_DIRTY handling bugs in filesystems
  2010-11-23 22:51   ` Dave Chinner
@ 2010-11-24  0:23     ` Nick Piggin
  0 siblings, 0 replies; 20+ messages in thread
From: Nick Piggin @ 2010-11-24  0:23 UTC (permalink / raw)
  To: Dave Chinner; +Cc: npiggin, linux-fsdevel, linux-kernel

On Wed, Nov 24, 2010 at 09:51:48AM +1100, Dave Chinner wrote:
> On Wed, Nov 24, 2010 at 01:06:17AM +1100, npiggin@kernel.dk wrote:
> > Comments?
> 
> How did you test the changes?

Not widely as yet, just tested a few filesystems passed deadlock and
bug tests. It's just in RFC state as yet.

 
> > +++ linux-2.6/fs/xfs/linux-2.6/xfs_file.c	2010-11-24 00:08:03.000000000 +1100
> > @@ -99,6 +99,7 @@ xfs_file_fsync(
> >  	struct xfs_trans	*tp;
> >  	int			error = 0;
> >  	int			log_flushed = 0;
> > +	unsigned		dirty, mask;
> >  
> >  	trace_xfs_file_fsync(ip);
> >  
> > @@ -132,9 +133,16 @@ xfs_file_fsync(
> >  	 * might gets cleared when the inode gets written out via the AIL
> >  	 * or xfs_iflush_cluster.
> >  	 */
> > -	if (((inode->i_state & I_DIRTY_DATASYNC) ||
> > -	    ((inode->i_state & I_DIRTY_SYNC) && !datasync)) &&
> > -	    ip->i_update_core) {
> > +	spin_lock(&inode_lock);
> > +	inode_writeback_begin(inode, 1);
> > +	if (datasync)
> > +		mask = I_DIRTY_DATASYNC;
> > +	else
> > +		mask = I_DIRTY_SYNC | I_DIRTY_DATASYNC;
> > +	dirty = inode->i_state & mask;
> > +	inode->i_state &= ~mask;
> > +	spin_unlock(&inode_lock);
> > +	if (dirty && ip->i_update_core) {
> 
> It looks to me like the pattern "inode_writeback_begin(); get dirty
> state from i_state" repeated for each filesystem is wrong. The
> inode_writeback_begin() helper does this:
> 
> 	inode->i_state &= ~I_DIRTY;
> 
> which clears all the dirty bits from the i_state, which means the
> followup:
> 
> 	dirty = inode->i_state & mask;
> 
> will always result in a zero value for dirty.  IOWs, this seems to
> ensure that ->fsync never sees dirty inodes anymore. This will break
> fsync on XFS, and probably on all the other filesystems you modified
> to use this pattern as well.

Yes, the helper needs to do inode->i_state &= ~I_DIRTY_PAGES. Good
catch, thanks.

I had I_DIRTY there because I was initially going to return the
dirty bits, however some cases want to check/clear bits at different
times (eg. background writeout wants to clear DIRTY_PAGES then do
the pagecache writeback, and then test/clear the metadata dirty bits).

 
> Also, I think the pattern is racy with respect to concurrent page
> cache dirtiers. i.e if the inode was dirtied between writeback and
> ->fsync() in vfs_fsync_range(), then this new code clears the
> I_DIRTY_PAGES bit in i_state without writing back the dirty pages.

That gets caught in the writeback_end helper, same way as for background
writeout. It's useful to do this for the fsync helper so that the inode
actually gets marked clean if the pagecache writeback cleaned
everything.

> 
> And FWIW, I'm not sure that we want to be propagating the inode_lock
> into every filesystem...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch 2/7] fs: simple fsync race fix
  2010-11-23 14:06 ` [patch 2/7] fs: simple fsync race fix npiggin
@ 2010-11-29 14:58   ` Christoph Hellwig
  2010-11-30  0:05     ` Nick Piggin
  0 siblings, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2010-11-29 14:58 UTC (permalink / raw)
  To: npiggin; +Cc: linux-fsdevel, linux-kernel

Instead of removing the data sync support here and re-adding it later
I would recommend moving the enhanced sync_inode_metadata earlier in
the series.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch 4/7] fs: preserve inode dirty bits on failed metadata writeback
  2010-11-23 14:06 ` [patch 4/7] fs: preserve inode dirty bits on failed metadata writeback npiggin
@ 2010-11-29 14:59   ` Christoph Hellwig
  2010-11-30  0:08     ` Nick Piggin
  0 siblings, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2010-11-29 14:59 UTC (permalink / raw)
  To: npiggin; +Cc: linux-fsdevel, linux-kernel

On Wed, Nov 24, 2010 at 01:06:14AM +1100, npiggin@kernel.dk wrote:
> Otherwise we think the inode is clean even if syncing failed.

The patch itself looks fine, but I'm not sure it's enough.  If we do
an synchronous writeout it could fail long after ->write_inode
has returned.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch 6/7] fs: fsync optimisations
  2010-11-23 14:06 ` [patch 6/7] fs: fsync optimisations npiggin
@ 2010-11-29 15:03   ` Christoph Hellwig
  2010-11-30  0:11     ` Nick Piggin
  0 siblings, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2010-11-29 15:03 UTC (permalink / raw)
  To: npiggin; +Cc: linux-fsdevel, linux-kernel

On Wed, Nov 24, 2010 at 01:06:16AM +1100, npiggin@kernel.dk wrote:
> Optimise fsync by adding a datasync parameter to sync_inode_metadata to
> DTRT with writing back the inode (->write_inode in theory should have a
> datasync parameter too perhaps, but that's for another time).
> 
> Also, implement the metadata sync optimally rather than reusing the
> normal data writeback path. This means less useless moving the inode around the
> writeback lists, and less dropping and retaking of inode_lock, and avoiding
> the data writeback call with nr_pages == 0.

This patch looks good to me, but a few minor comments below:

> @@ -1315,13 +1315,49 @@ EXPORT_SYMBOL(sync_inode);
>   *
>   * Note: only writes the actual inode, no associated data or other metadata.
>   */
> -int sync_inode_metadata(struct inode *inode, int wait)
> +int sync_inode_metadata(struct inode *inode, int datasync, int wait)
>  {
> +	struct address_space *mapping = inode->i_mapping;
>  	struct writeback_control wbc = {
>  		.sync_mode = wait ? WB_SYNC_ALL : WB_SYNC_NONE,
>  		.nr_to_write = 0, /* metadata-only */
>  	};
> +	unsigned dirty, mask;
> +	int ret = 0;
>  
> -	return sync_inode(inode, &wbc);
> +	/*
> +	 * This is a similar implementation to writeback_single_inode.
> +	 * Keep them in sync.
> +	 */

I'd move this comment to above the function.

> +	/*
> +	 * Generic write_inode doesn't distinguish between sync and datasync,
> +	 * so even a datasync can clear the sync state. Filesystems which
> +	 * distiguish these cases must only clear 'mask' in their metadata
> +	 * sync code.
> +	 */

I don't understand this comment.  Currenly filesystems never clear
i_state bits by themselves.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch 3/7] fs: introduce inode writeback helpers
  2010-11-23 14:06 ` [patch 3/7] fs: introduce inode writeback helpers npiggin
@ 2010-11-29 15:13   ` Christoph Hellwig
  2010-11-30  0:22     ` Nick Piggin
  0 siblings, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2010-11-29 15:13 UTC (permalink / raw)
  To: npiggin; +Cc: linux-fsdevel, linux-kernel

On Wed, Nov 24, 2010 at 01:06:13AM +1100, npiggin@kernel.dk wrote:
> Inode dirty state cannot be securely tested without participating properly
> in the inode writeback protocol. Some filesystems need to check this state,
> so break out the code into helpers and make them available.
> 
> This could also be used to reduce strange interactions between background
> writeback and fsync. Currently if we fsync a single page in a file, the
> entire file gets requeued to the back of the background IO list, even if
> it is due for writeout and has a large number of pages. That's left for
> a later time.

Generally looks fine, but as Dave already mentioned I'd rather keep
i_state manipulation outside the filesystems.  This could be done with
two wrappers like the following, which should also keep the churn
inside fsync implementations downs:

int fsync_begin(struct inode *inode, int datasync)
{
	int ret = 0;
	unsigned mask = I_DIRTY_DATASYNC;

	if (!datasync)
		mask |= I_DIRTY_SYNC;

	spin_lock(&inode_lock);
	if (!inode_writeback_begin(inode, 1))
		goto out;
	if (!(inode->i_state & mask))
		goto out;

	inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC);
	ret = 1;
out:
	spin_unlock(&inode_lock);
	return ret;
}

static void fsync_end(struct inode *inode, int fail)
{
	spin_lock(&inode_lock);
	if (fail)
		inode->i_state |= I_DIRTY_SYNC | I_DIRTY_DATASYNC;
	inode_writeback_end(inode);
	spin_unlock(&inode_lock);
}

note that this one marks the inode fully dirty in case of a failure,
which is a bit overkill but keeps the interface simpler.  Given that
failure is fsync is catastrophic anyway (filesystem corruption, etc)
that seems fine to me.

Alternatively we could add a fsync_helper that gets a function
pointer with the ->write_inode signature and contains the above
code before and after it. generic_file_fsync would pass the real
->write_inode while other filesystems could pass specific routines.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch 2/7] fs: simple fsync race fix
  2010-11-29 14:58   ` Christoph Hellwig
@ 2010-11-30  0:05     ` Nick Piggin
  0 siblings, 0 replies; 20+ messages in thread
From: Nick Piggin @ 2010-11-30  0:05 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: npiggin, linux-fsdevel, linux-kernel

On Mon, Nov 29, 2010 at 09:58:18AM -0500, Christoph Hellwig wrote:
> Instead of removing the data sync support here and re-adding it later
> I would recommend moving the enhanced sync_inode_metadata earlier in
> the series.

Well, one is the minimum bug fix, and the other is a proper
improvement for doing metadata. I chose this approach because
it suits stable backports, and bisecting better.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch 4/7] fs: preserve inode dirty bits on failed metadata writeback
  2010-11-29 14:59   ` Christoph Hellwig
@ 2010-11-30  0:08     ` Nick Piggin
  0 siblings, 0 replies; 20+ messages in thread
From: Nick Piggin @ 2010-11-30  0:08 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: npiggin, linux-fsdevel, linux-kernel

On Mon, Nov 29, 2010 at 09:59:37AM -0500, Christoph Hellwig wrote:
> On Wed, Nov 24, 2010 at 01:06:14AM +1100, npiggin@kernel.dk wrote:
> > Otherwise we think the inode is clean even if syncing failed.
> 
> The patch itself looks fine, but I'm not sure it's enough.  If we do
> an synchronous writeout it could fail long after ->write_inode
> has returned.

Oh there are lots of holes in this buggy POS. Still more that I
haven't fixed, even before you think about error cases.

But, after a *successful* ->write_inode (whether sync or async),
then the filesystem is not going to get any more, unless they
mark the inode dirty again.

I think that's fine, so long as the dirty buffers or whatever are
properly synced at sync(2) / fsync(2) time.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch 6/7] fs: fsync optimisations
  2010-11-29 15:03   ` Christoph Hellwig
@ 2010-11-30  0:11     ` Nick Piggin
  0 siblings, 0 replies; 20+ messages in thread
From: Nick Piggin @ 2010-11-30  0:11 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: npiggin, linux-fsdevel, linux-kernel

On Mon, Nov 29, 2010 at 10:03:25AM -0500, Christoph Hellwig wrote:
> On Wed, Nov 24, 2010 at 01:06:16AM +1100, npiggin@kernel.dk wrote:
> > Optimise fsync by adding a datasync parameter to sync_inode_metadata to
> > DTRT with writing back the inode (->write_inode in theory should have a
> > datasync parameter too perhaps, but that's for another time).
> > 
> > Also, implement the metadata sync optimally rather than reusing the
> > normal data writeback path. This means less useless moving the inode around the
> > writeback lists, and less dropping and retaking of inode_lock, and avoiding
> > the data writeback call with nr_pages == 0.
> 
> This patch looks good to me, but a few minor comments below:

(BTW. it actually had a bug with writeback_end not being called in some
cases, in case you test it)

> > @@ -1315,13 +1315,49 @@ EXPORT_SYMBOL(sync_inode);
> >   *
> >   * Note: only writes the actual inode, no associated data or other metadata.
> >   */
> > -int sync_inode_metadata(struct inode *inode, int wait)
> > +int sync_inode_metadata(struct inode *inode, int datasync, int wait)
> >  {
> > +	struct address_space *mapping = inode->i_mapping;
> >  	struct writeback_control wbc = {
> >  		.sync_mode = wait ? WB_SYNC_ALL : WB_SYNC_NONE,
> >  		.nr_to_write = 0, /* metadata-only */
> >  	};
> > +	unsigned dirty, mask;
> > +	int ret = 0;
> >  
> > -	return sync_inode(inode, &wbc);
> > +	/*
> > +	 * This is a similar implementation to writeback_single_inode.
> > +	 * Keep them in sync.
> > +	 */
> 
> I'd move this comment to above the function.

OK

> 
> > +	/*
> > +	 * Generic write_inode doesn't distinguish between sync and datasync,
> > +	 * so even a datasync can clear the sync state. Filesystems which
> > +	 * distiguish these cases must only clear 'mask' in their metadata
> > +	 * sync code.
> > +	 */
> 
> I don't understand this comment.  Currenly filesystems never clear
> i_state bits by themselves.

No, but with the new inode writeback routines they could. Basically
they might be expected to copy this function, and put their own
improvements in.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch 3/7] fs: introduce inode writeback helpers
  2010-11-29 15:13   ` Christoph Hellwig
@ 2010-11-30  0:22     ` Nick Piggin
  0 siblings, 0 replies; 20+ messages in thread
From: Nick Piggin @ 2010-11-30  0:22 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: npiggin, linux-fsdevel, linux-kernel

On Mon, Nov 29, 2010 at 10:13:27AM -0500, Christoph Hellwig wrote:
> On Wed, Nov 24, 2010 at 01:06:13AM +1100, npiggin@kernel.dk wrote:
> > Inode dirty state cannot be securely tested without participating properly
> > in the inode writeback protocol. Some filesystems need to check this state,
> > so break out the code into helpers and make them available.
> > 
> > This could also be used to reduce strange interactions between background
> > writeback and fsync. Currently if we fsync a single page in a file, the
> > entire file gets requeued to the back of the background IO list, even if
> > it is due for writeout and has a large number of pages. That's left for
> > a later time.
> 
> Generally looks fine, but as Dave already mentioned I'd rather keep
> i_state manipulation outside the filesystems.  This could be done with

I don't see a big problem with it. They already did load it previously
in way which required inode_lock (and was buggy in part because it
didn't take that lock).


> two wrappers like the following, which should also keep the churn
> inside fsync implementations downs:
> 
> int fsync_begin(struct inode *inode, int datasync)
> {
> 	int ret = 0;
> 	unsigned mask = I_DIRTY_DATASYNC;
> 
> 	if (!datasync)
> 		mask |= I_DIRTY_SYNC;
> 
> 	spin_lock(&inode_lock);
> 	if (!inode_writeback_begin(inode, 1))
> 		goto out;
> 	if (!(inode->i_state & mask))
> 		goto out;
> 
> 	inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC);
> 	ret = 1;
> out:
> 	spin_unlock(&inode_lock);
> 	return ret;
> }
> 
> static void fsync_end(struct inode *inode, int fail)
> {
> 	spin_lock(&inode_lock);
> 	if (fail)
> 		inode->i_state |= I_DIRTY_SYNC | I_DIRTY_DATASYNC;
> 	inode_writeback_end(inode);
> 	spin_unlock(&inode_lock);
> }

I prefer not to do that because it doesn't give any control over
setting or clearing the state flags (which might be done more
intelligently by the filesystem and so this function might be
unusable), and just restricts how filesystems use inode_writeback_begin
and inode lock.

Basically if you are doing anything slightly smart, you can start
inode_writeback_begin to exclude concurrent writeout, and if the
inode_lock is held, you can also prevent new changes to dirty bits
and thus keep the generic inode dirty bits in synch with your filesystem
private state.

In short, I don't see anything wrong with exporting
inode_writeback_begin and allowing i_state manipulation by filesystems
that want to do interesting things. And the wrappers AFAIKS don't add
that much -- it's not very long or difficult code.


> note that this one marks the inode fully dirty in case of a failure,
> which is a bit overkill but keeps the interface simpler.  Given that
> failure is fsync is catastrophic anyway (filesystem corruption, etc)
> that seems fine to me.
> 
> Alternatively we could add a fsync_helper that gets a function
> pointer with the ->write_inode signature and contains the above
> code before and after it. generic_file_fsync would pass the real
> ->write_inode while other filesystems could pass specific routines.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch 5/7] fs: ext2 inode sync fix
  2010-11-23 14:06 ` [patch 5/7] fs: ext2 inode sync fix npiggin
@ 2010-11-30 11:26   ` Boaz Harrosh
  0 siblings, 0 replies; 20+ messages in thread
From: Boaz Harrosh @ 2010-11-30 11:26 UTC (permalink / raw)
  To: npiggin; +Cc: linux-fsdevel, linux-kernel

On 11/23/2010 04:06 PM, npiggin@kernel.dk wrote:
> There is a big fuckup with inode metadata writeback (I suspect in a lot
> of filesystems, but I've only dared to look at a couple as yet).
> 
> ext2 relies on ->write_inode being called from sync_inode_metadata in
> fsync in order to sync the inode. However I_DIRTY_SYNC gets cleared
> after a call to this guy, and it doesn't actually write back and wait
> on the inode block unless it is called for sync. This means that write_inode
> from background writeback can kill the inode dirty bits without the data
> getting to disk. Fsync will subsequently miss it.
> 
> The fix is for ->write_inode to dirty the buffers/cache, and then ->fsync to
> write back the dirty data. In the full filesystem sync case, buffercache
> writeback in the generic code will write back the dirty data. Other
> filesystems could use ->sync_fs for this.
> 
> Signed-off-by: Nick Piggin <npiggin@kernel.dk>
> 
> Index: linux-2.6/fs/ext2/inode.c
> ===================================================================
> --- linux-2.6.orig/fs/ext2/inode.c	2010-11-23 21:56:05.000000000 +1100
> +++ linux-2.6/fs/ext2/inode.c	2010-11-23 22:05:02.000000000 +1100
> @@ -1211,7 +1211,7 @@ static int ext2_setsize(struct inode *in
>  	return 0;
>  }
>  
> -static struct ext2_inode *ext2_get_inode(struct super_block *sb, ino_t ino,
> +struct ext2_inode *ext2_get_inode(struct super_block *sb, ino_t ino,
>  					struct buffer_head **p)
>  {
>  	struct buffer_head * bh;
> @@ -1505,16 +1505,8 @@ static int __ext2_write_inode(struct ino
>  	} else for (n = 0; n < EXT2_N_BLOCKS; n++)
>  		raw_inode->i_block[n] = ei->i_data[n];
>  	mark_buffer_dirty(bh);
> -	if (do_sync) {
> -		sync_dirty_buffer(bh);
> -		if (buffer_req(bh) && !buffer_uptodate(bh)) {
> -			printk ("IO error syncing ext2 inode [%s:%08lx]\n",
> -				sb->s_id, (unsigned long) ino);
> -			err = -EIO;
> -		}
> -	}

Is .fsync the only case when .write_inode is called with do_sync==true ?

> -	ei->i_state &= ~EXT2_STATE_NEW;
>  	brelse (bh);
> +	ei->i_state &= ~EXT2_STATE_NEW;
>  	return err;
>  }
>  
> Index: linux-2.6/fs/ext2/file.c
> ===================================================================
> --- linux-2.6.orig/fs/ext2/file.c	2010-11-23 21:56:05.000000000 +1100
> +++ linux-2.6/fs/ext2/file.c	2010-11-23 22:05:02.000000000 +1100
> @@ -21,6 +21,7 @@
>  #include <linux/time.h>
>  #include <linux/pagemap.h>
>  #include <linux/quotaops.h>
> +#include <linux/buffer_head.h>
>  #include "ext2.h"
>  #include "xattr.h"
>  #include "acl.h"
> @@ -43,16 +44,33 @@ static int ext2_release_file (struct ino
>  int ext2_fsync(struct file *file, int datasync)
>  {
>  	int ret;
> -	struct super_block *sb = file->f_mapping->host->i_sb;
> -	struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping;
> +	struct inode *inode = file->f_mapping->host;
> +	ino_t ino = inode->i_ino;
> +	struct super_block *sb = inode->i_sb;
> +	struct address_space *sb_mapping = sb->s_bdev->bd_inode->i_mapping;
> +	struct buffer_head *bh;
> +	struct ext2_inode *raw_inode;
>  
>  	ret = generic_file_fsync(file, datasync);
> -	if (ret == -EIO || test_and_clear_bit(AS_EIO, &mapping->flags)) {
> +	if (ret == -EIO || test_and_clear_bit(AS_EIO, &sb_mapping->flags)) {
>  		/* We don't really know where the IO error happened... */
>  		ext2_error(sb, __func__,
>  			   "detected IO error when writing metadata buffers");
> +		return -EIO;
> +	}
> +
> +	raw_inode = ext2_get_inode(sb, ino, &bh);

You use the Read-inode-from-disk-if-not-present function to associate
a bufferhead with the inode, relying on it being in memory for sure.
That's ugly. Maybe a comment?

But do you need it, since above __ext2_write_inode did mark_buffer_dirty(bh);
is it not already included in the buffers flushed to disk?

> +	if (IS_ERR(raw_inode))
> + 		return -EIO;
> +
> +	sync_dirty_buffer(bh);
> +	if (buffer_req(bh) && !buffer_uptodate(bh)) {
> +		printk ("IO error syncing ext2 inode [%s:%08lx]\n",
> +			sb->s_id, (unsigned long) ino);

Should you not use ext2_error(sb, __func__, like above?

Boaz
>  		ret = -EIO;
>  	}
> +	brelse (bh);
> +
>  	return ret;
>  }
>  
> Index: linux-2.6/fs/ext2/ext2.h
> ===================================================================
> --- linux-2.6.orig/fs/ext2/ext2.h	2010-11-23 21:56:05.000000000 +1100
> +++ linux-2.6/fs/ext2/ext2.h	2010-11-23 22:05:02.000000000 +1100
> @@ -124,6 +124,8 @@ extern int ext2_get_block(struct inode *
>  extern int ext2_setattr (struct dentry *, struct iattr *);
>  extern void ext2_set_inode_flags(struct inode *inode);
>  extern void ext2_get_inode_flags(struct ext2_inode_info *);
> +extern struct ext2_inode *ext2_get_inode(struct super_block *sb, ino_t ino,
> +					struct buffer_head **p);
>  extern int ext2_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
>  		       u64 start, u64 len);
>  
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2010-11-30 11:26 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-11-23 14:06 [patch 0/7] icache dirty / sync fixes npiggin
2010-11-23 14:06 ` [patch 1/7] fs: mark_inode_dirty barrier fix npiggin
2010-11-23 14:06 ` [patch 2/7] fs: simple fsync race fix npiggin
2010-11-29 14:58   ` Christoph Hellwig
2010-11-30  0:05     ` Nick Piggin
2010-11-23 14:06 ` [patch 3/7] fs: introduce inode writeback helpers npiggin
2010-11-29 15:13   ` Christoph Hellwig
2010-11-30  0:22     ` Nick Piggin
2010-11-23 14:06 ` [patch 4/7] fs: preserve inode dirty bits on failed metadata writeback npiggin
2010-11-29 14:59   ` Christoph Hellwig
2010-11-30  0:08     ` Nick Piggin
2010-11-23 14:06 ` [patch 5/7] fs: ext2 inode sync fix npiggin
2010-11-30 11:26   ` Boaz Harrosh
2010-11-23 14:06 ` [patch 6/7] fs: fsync optimisations npiggin
2010-11-29 15:03   ` Christoph Hellwig
2010-11-30  0:11     ` Nick Piggin
2010-11-23 14:06 ` [patch 7/7] fs: fix or note I_DIRTY handling bugs in filesystems npiggin
2010-11-23 15:04   ` Steven Whitehouse
2010-11-23 22:51   ` Dave Chinner
2010-11-24  0:23     ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).