All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch 0/8] Inode data integrity patches
@ 2010-12-18  1:46 Nick Piggin
  2010-12-18  1:46 ` [patch 1/8] fs: mark_inode_dirty barrier fix Nick Piggin
                   ` (7 more replies)
  0 siblings, 8 replies; 42+ messages in thread
From: Nick Piggin @ 2010-12-18  1:46 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel; +Cc: Andrew Morton

These are not for merge quite yet, need a bit more changelog and comments
polishing, and splitting out filesystems patches, but if I don't get negative
comments about them, I'll be asking to merge them in next window.

Thanks,
Nick


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [patch 1/8] fs: mark_inode_dirty barrier fix
  2010-12-18  1:46 [patch 0/8] Inode data integrity patches Nick Piggin
@ 2010-12-18  1:46 ` Nick Piggin
  2010-12-18  1:46 ` [patch 2/8] fs: simple fsync race fix Nick Piggin
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 42+ messages in thread
From: Nick Piggin @ 2010-12-18  1:46 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel; +Cc: Andrew Morton

[-- Attachment #1: fs-mark_inode_dirty-barrier-fix.patch --]
[-- Type: text/plain, Size: 1610 bytes --]

Filesystems appear to be using ->dirty_inode, expecting that the dirtying
operating is done and visible to all CPUs (eg. setting private inode dirty
bits, without any barriers themselves). So release the dirty "critical
section" with a barrier before calling ->dirty_inode.

Cost is not significantly changed, because we're just moving the barrier.
Those filesystems that do use ->dirty_inode should have to care slightly
less about barriers, which is a good thing.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c	2010-11-19 16:47:00.000000000 +1100
+++ linux-2.6/fs/fs-writeback.c	2010-11-19 16:49:39.000000000 +1100
@@ -934,6 +934,15 @@ void __mark_inode_dirty(struct inode *in
 	bool wakeup_bdi = false;
 
 	/*
+	 * Make sure that changes are seen by all cpus before we test i_state
+	 * or mark anything as being dirty. Ie. all dirty state should be
+	 * written to the inode and visible. Like an "unlock" operation, the
+	 * mark_inode_dirty call must "release" our ordering window that is
+	 * opened when we started modifying the inode.
+	 */
+	smp_mb();
+
+	/*
 	 * Don't do this for I_DIRTY_PAGES - that doesn't actually
 	 * dirty the inode itself
 	 */
@@ -942,12 +951,6 @@ void __mark_inode_dirty(struct inode *in
 			sb->s_op->dirty_inode(inode);
 	}
 
-	/*
-	 * make sure that changes are seen by all cpus before we test i_state
-	 * -- mikulas
-	 */
-	smp_mb();
-
 	/* avoid the locking if we can */
 	if ((inode->i_state & flags) == flags)
 		return;



^ permalink raw reply	[flat|nested] 42+ messages in thread

* [patch 2/8] fs: simple fsync race fix
  2010-12-18  1:46 [patch 0/8] Inode data integrity patches Nick Piggin
  2010-12-18  1:46 ` [patch 1/8] fs: mark_inode_dirty barrier fix Nick Piggin
@ 2010-12-18  1:46 ` Nick Piggin
  2010-12-18  1:46 ` [patch 3/8] fs: introduce inode writeback helpers Nick Piggin
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 42+ messages in thread
From: Nick Piggin @ 2010-12-18  1:46 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel; +Cc: Andrew Morton

[-- Attachment #1: fs-sync-fixup.patch --]
[-- Type: text/plain, Size: 1984 bytes --]

It is incorrect to test inode dirty bits without participating in the inode
writeback protocol. Inode writeback sets I_SYNC and clears I_DIRTY_?, then
writes out the particular bits, then clears I_SYNC when it is done. BTW. it
may not completely write all pages out, so I_DIRTY_PAGES would get set
again.

This is a standard pattern used throughout the kernel's writeback caches
(I_SYNC ~= I_WRITEBACK, if that makes it clearer).

And so it is not possible to determine an inode's dirty status just by
checking I_DIRTY bits. Especially not for the purpose of data integrity
syncs.

Missing the check for these bits means that fsync can complete while
writeback to the inode is underway. Inode writeback functions get this
right, so call into them rather than try to shortcut things by testing
dirty state improperly.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>


Index: linux-2.6/fs/libfs.c
===================================================================
--- linux-2.6.orig/fs/libfs.c	2010-11-19 16:44:39.000000000 +1100
+++ linux-2.6/fs/libfs.c	2010-11-19 16:49:42.000000000 +1100
@@ -895,11 +895,6 @@ int generic_file_fsync(struct file *file
 	int ret;
 
 	ret = sync_mapping_buffers(inode->i_mapping);
-	if (!(inode->i_state & I_DIRTY))
-		return ret;
-	if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
-		return ret;
-
 	err = sync_inode_metadata(inode, 1);
 	if (ret == 0)
 		ret = err;
Index: linux-2.6/fs/exofs/file.c
===================================================================
--- linux-2.6.orig/fs/exofs/file.c	2010-11-19 16:50:00.000000000 +1100
+++ linux-2.6/fs/exofs/file.c	2010-11-19 16:50:07.000000000 +1100
@@ -48,11 +48,6 @@ static int exofs_file_fsync(struct file
 	struct inode *inode = filp->f_mapping->host;
 	struct super_block *sb;
 
-	if (!(inode->i_state & I_DIRTY))
-		return 0;
-	if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
-		return 0;
-
 	ret = sync_inode_metadata(inode, 1);
 
 	/* This is a good place to write the sb */



^ permalink raw reply	[flat|nested] 42+ messages in thread

* [patch 3/8] fs: introduce inode writeback helpers
  2010-12-18  1:46 [patch 0/8] Inode data integrity patches Nick Piggin
  2010-12-18  1:46 ` [patch 1/8] fs: mark_inode_dirty barrier fix Nick Piggin
  2010-12-18  1:46 ` [patch 2/8] fs: simple fsync race fix Nick Piggin
@ 2010-12-18  1:46 ` Nick Piggin
  2010-12-18  1:46 ` [patch 4/8] fs: preserve inode dirty bits on failed metadata writeback Nick Piggin
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 42+ messages in thread
From: Nick Piggin @ 2010-12-18  1:46 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel; +Cc: Andrew Morton

[-- Attachment #1: fs-inode-clear-dirty-for-io.patch --]
[-- Type: text/plain, Size: 7973 bytes --]

Inode dirty state cannot be securely tested without participating properly
in the inode writeback protocol. Some filesystems need to check this state,
so break out the code into helpers and make them available.

This could also be used to reduce strange interactions between background
writeback and fsync. Currently if we fsync a single page in a file, the
entire file gets requeued to the back of the background IO list, even if
it is due for writeout and has a large number of pages. That's left for
a later time.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c	2010-12-16 18:33:14.000000000 +1100
+++ linux-2.6/fs/fs-writeback.c	2010-12-16 18:33:17.000000000 +1100
@@ -299,26 +299,25 @@ static void inode_wait_for_writeback(str
 	}
 }
 
-/*
- * Write out an inode's dirty pages.  Called under inode_lock.  Either the
- * caller has ref on the inode (either via __iget or via syscall against an fd)
- * or the inode has I_WILL_FREE set (via generic_forget_inode)
+/**
+ * inode_writeback_begin -- prepare to writeback an inode
+ * @indoe: inode to write back
+ * @wait: synch writeout or not
+ * @Returns: 0 if wait == 0 and this call would block (due to other writeback).
+ *           otherwise returns 1.
  *
- * If `wait' is set, wait on the writeout.
+ * Context: inode_lock must be held, may be dropped. Returns with it held.
  *
- * The whole writeout design is quite complex and fragile.  We want to avoid
- * starvation of particular inodes when others are being redirtied, prevent
- * livelocks, etc.
+ * inode_writeback_begin sets up an inode to be written back (data and/or
+ * metadata). This must be called before examining I_DIRTY state of the
+ * inode, and should be called at least before any data integrity writeout.
  *
- * Called under inode_lock.
+ * If inode_writeback_begin returns 1, it must be followed by a call to
+ * inode_writeback_end.
  */
-static int
-writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
+int inode_writeback_begin(struct inode *inode, int wait)
 {
-	struct address_space *mapping = inode->i_mapping;
-	unsigned dirty;
-	int ret;
-
+	assert_spin_locked(&inode_lock);
 	if (!atomic_read(&inode->i_count))
 		WARN_ON(!(inode->i_state & (I_WILL_FREE|I_FREEING)));
 	else
@@ -327,16 +326,10 @@ writeback_single_inode(struct inode *ino
 	if (inode->i_state & I_SYNC) {
 		/*
 		 * If this inode is locked for writeback and we are not doing
-		 * writeback-for-data-integrity, move it to b_more_io so that
-		 * writeback can proceed with the other inodes on s_io.
-		 *
-		 * We'll have another go at writing back this inode when we
-		 * completed a full scan of b_io.
+		 * writeback-for-data-integrity, skip it.
 		 */
-		if (wbc->sync_mode != WB_SYNC_ALL) {
-			requeue_io(inode);
+		if (!wait)
 			return 0;
-		}
 
 		/*
 		 * It's a data-integrity sync.  We must wait.
@@ -346,9 +339,91 @@ writeback_single_inode(struct inode *ino
 
 	BUG_ON(inode->i_state & I_SYNC);
 
-	/* Set I_SYNC, reset I_DIRTY_PAGES */
 	inode->i_state |= I_SYNC;
 	inode->i_state &= ~I_DIRTY_PAGES;
+
+	return 1;
+}
+EXPORT_SYMBOL(inode_writeback_begin);
+
+/**
+ * inode_writeback_end - end a writeback section opened by inode_writeback_begin
+ * @inode: inode in question
+ * @Returns: 0 if the inode still has dirty pagecache, otherwise 1.
+ *
+ * Context: inode_lock must be held, not dropped.
+ *
+ * inode_writeback_end must follow a successful call to inode_writeback_begin
+ * after we have finished submitting writeback to the inode.
+ */
+int inode_writeback_end(struct inode *inode)
+{
+	int ret = 1;
+
+	assert_spin_locked(&inode_lock);
+	BUG_ON(!(inode->i_state & I_SYNC));
+
+	if (!(inode->i_state & I_FREEING)) {
+		if (mapping_tagged(inode->i_mapping, PAGECACHE_TAG_DIRTY)) {
+			/*
+			 * We didn't write back all the pages.  nfs_writepages()
+			 * sometimes bales out without doing anything.
+			 */
+			inode->i_state |= I_DIRTY_PAGES;
+			ret = 0;
+		} else if (inode->i_state & I_DIRTY) {
+			/*
+			 * Filesystems can dirty the inode during writeback
+			 * operations, such as delayed allocation during
+			 * submission or metadata updates after data IO
+			 * completion.
+			 */
+			redirty_tail(inode);
+		} else {
+			/*
+			 * The inode is clean.  At this point we either have
+			 * a reference to the inode or it's on it's way out.
+			 * No need to add it back to the LRU.
+			 */
+			list_del_init(&inode->i_wb_list);
+		}
+	}
+	inode->i_state &= ~I_SYNC;
+	inode_sync_complete(inode);
+
+	return ret;
+}
+EXPORT_SYMBOL(inode_writeback_end);
+
+/*
+ * Write out an inode's dirty pages.  Called under inode_lock.  Either the
+ * caller has ref on the inode (either via __iget or via syscall against an fd)
+ * or the inode has I_WILL_FREE set (via generic_forget_inode)
+ *
+ * If `wait' is set, wait on the writeout.
+ *
+ * The whole writeout design is quite complex and fragile.  We want to avoid
+ * starvation of particular inodes when others are being redirtied, prevent
+ * livelocks, etc.
+ *
+ * Called under inode_lock.
+ */
+static int
+writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
+{
+	struct address_space *mapping = inode->i_mapping;
+	unsigned dirty;
+	int ret;
+
+	if (!inode_writeback_begin(inode, wbc->sync_mode == WB_SYNC_ALL)) {
+		/*
+		 * We'll have another go at writing back this inode when we
+		 * completed a full scan of b_io.
+		 */
+		requeue_io(inode);
+		return 0;
+	}
+
 	spin_unlock(&inode_lock);
 
 	ret = do_writepages(mapping, wbc);
@@ -381,47 +456,24 @@ writeback_single_inode(struct inode *ino
 	}
 
 	spin_lock(&inode_lock);
-	inode->i_state &= ~I_SYNC;
-	if (!(inode->i_state & I_FREEING)) {
-		if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
-			/*
-			 * We didn't write back all the pages.  nfs_writepages()
-			 * sometimes bales out without doing anything.
-			 */
-			inode->i_state |= I_DIRTY_PAGES;
-			if (wbc->nr_to_write <= 0) {
-				/*
-				 * slice used up: queue for next turn
-				 */
-				requeue_io(inode);
-			} else {
-				/*
-				 * Writeback blocked by something other than
-				 * congestion. Delay the inode for some time to
-				 * avoid spinning on the CPU (100% iowait)
-				 * retrying writeback of the dirty page/inode
-				 * that cannot be performed immediately.
-				 */
-				redirty_tail(inode);
-			}
-		} else if (inode->i_state & I_DIRTY) {
+	if (!inode_writeback_end(inode)) {
+		if (wbc->nr_to_write <= 0) {
 			/*
-			 * Filesystems can dirty the inode during writeback
-			 * operations, such as delayed allocation during
-			 * submission or metadata updates after data IO
-			 * completion.
+			 * slice used up: queue for next turn
 			 */
-			redirty_tail(inode);
+			requeue_io(inode);
 		} else {
 			/*
-			 * The inode is clean.  At this point we either have
-			 * a reference to the inode or it's on it's way out.
-			 * No need to add it back to the LRU.
+			 * Writeback blocked by something other than
+			 * congestion. Delay the inode for some time to
+			 * avoid spinning on the CPU (100% iowait)
+			 * retrying writeback of the dirty page/inode
+			 * that cannot be performed immediately.
 			 */
-			list_del_init(&inode->i_wb_list);
+			redirty_tail(inode);
 		}
 	}
-	inode_sync_complete(inode);
+
 	return ret;
 }
 
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h	2010-12-16 18:29:04.000000000 +1100
+++ linux-2.6/include/linux/fs.h	2010-12-16 18:33:17.000000000 +1100
@@ -1766,6 +1766,8 @@ static inline void file_accessed(struct
 
 int sync_inode(struct inode *inode, struct writeback_control *wbc);
 int sync_inode_metadata(struct inode *inode, int wait);
+int inode_writeback_begin(struct inode *inode, int wait);
+int inode_writeback_end(struct inode *inode);
 
 struct file_system_type {
 	const char *name;



^ permalink raw reply	[flat|nested] 42+ messages in thread

* [patch 4/8] fs: preserve inode dirty bits on failed metadata writeback
  2010-12-18  1:46 [patch 0/8] Inode data integrity patches Nick Piggin
                   ` (2 preceding siblings ...)
  2010-12-18  1:46 ` [patch 3/8] fs: introduce inode writeback helpers Nick Piggin
@ 2010-12-18  1:46 ` Nick Piggin
  2010-12-18  1:46 ` [patch 5/8] fs: ext2 inode sync fix Nick Piggin
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 42+ messages in thread
From: Nick Piggin @ 2010-12-18  1:46 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel; +Cc: Andrew Morton

[-- Attachment #1: fs-inode-dirty-error-fix.patch --]
[-- Type: text/plain, Size: 1136 bytes --]

Otherwise we think the inode is clean even if syncing failed.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c	2010-11-23 22:28:22.000000000 +1100
+++ linux-2.6/fs/fs-writeback.c	2010-11-23 22:57:40.000000000 +1100
@@ -447,15 +447,25 @@ writeback_single_inode(struct inode *ino
 	spin_lock(&inode_lock);
 	dirty = inode->i_state & I_DIRTY;
 	inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC);
-	spin_unlock(&inode_lock);
 	/* Don't write the inode if only I_DIRTY_PAGES was set */
 	if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
-		int err = write_inode(inode, wbc);
+		int err;
+
+		spin_unlock(&inode_lock);
+		err = write_inode(inode, wbc);
 		if (ret == 0)
 			ret = err;
+		spin_lock(&inode_lock);
+		if (err) {
+			/*
+			 * Inode writeout failed, restore inode metadata
+			 * dirty bits.
+			 */
+			inode->i_state |= dirty &
+					(I_DIRTY_SYNC | I_DIRTY_DATASYNC);
+		}
 	}
 
-	spin_lock(&inode_lock);
 	if (!inode_writeback_end(inode)) {
 		if (wbc->nr_to_write <= 0) {
 			/*



^ permalink raw reply	[flat|nested] 42+ messages in thread

* [patch 5/8] fs: ext2 inode sync fix
  2010-12-18  1:46 [patch 0/8] Inode data integrity patches Nick Piggin
                   ` (3 preceding siblings ...)
  2010-12-18  1:46 ` [patch 4/8] fs: preserve inode dirty bits on failed metadata writeback Nick Piggin
@ 2010-12-18  1:46 ` Nick Piggin
  2011-01-07 19:08   ` Ted Ts'o
  2010-12-18  1:46 ` [patch 6/8] fs: fsync optimisations Nick Piggin
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 42+ messages in thread
From: Nick Piggin @ 2010-12-18  1:46 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel; +Cc: Andrew Morton

[-- Attachment #1: ext2-sync-fixes.patch --]
[-- Type: text/plain, Size: 11625 bytes --]

There is a big fuckup with inode metadata writeback (I suspect in a lot
of filesystems, but I've only dared to look at a couple as yet).

ext2 relies on ->write_inode being called from sync_inode_metadata in
fsync in order to sync the inode. However I_DIRTY_SYNC gets cleared
after a call to this guy, and it doesn't actually write back and wait
on the inode block unless it is called for sync. This means that write_inode
from background writeback can kill the inode dirty bits without the data
getting to disk. Fsync will subsequently miss it.

The fix is for ->write_inode to dirty the buffers/cache, and then ->fsync to
write back the dirty data. In the full filesystem sync case, buffercache
writeback in the generic code will write back the dirty data. Other
filesystems could use ->sync_fs for this.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

Index: linux-2.6/fs/ext2/inode.c
===================================================================
--- linux-2.6.orig/fs/ext2/inode.c	2010-12-11 20:57:04.000000000 +1100
+++ linux-2.6/fs/ext2/inode.c	2010-12-16 18:33:25.000000000 +1100
@@ -1211,7 +1211,7 @@ static int ext2_setsize(struct inode *in
 	return 0;
 }
 
-static struct ext2_inode *ext2_get_inode(struct super_block *sb, ino_t ino,
+struct ext2_inode *ext2_get_inode(struct super_block *sb, ino_t ino,
 					struct buffer_head **p)
 {
 	struct buffer_head * bh;
@@ -1505,16 +1505,8 @@ static int __ext2_write_inode(struct ino
 	} else for (n = 0; n < EXT2_N_BLOCKS; n++)
 		raw_inode->i_block[n] = ei->i_data[n];
 	mark_buffer_dirty(bh);
-	if (do_sync) {
-		sync_dirty_buffer(bh);
-		if (buffer_req(bh) && !buffer_uptodate(bh)) {
-			printk ("IO error syncing ext2 inode [%s:%08lx]\n",
-				sb->s_id, (unsigned long) ino);
-			err = -EIO;
-		}
-	}
-	ei->i_state &= ~EXT2_STATE_NEW;
 	brelse (bh);
+	ei->i_state &= ~EXT2_STATE_NEW;
 	return err;
 }
 
Index: linux-2.6/fs/ext2/file.c
===================================================================
--- linux-2.6.orig/fs/ext2/file.c	2010-12-16 18:28:58.000000000 +1100
+++ linux-2.6/fs/ext2/file.c	2010-12-16 18:33:25.000000000 +1100
@@ -21,6 +21,7 @@
 #include <linux/time.h>
 #include <linux/pagemap.h>
 #include <linux/quotaops.h>
+#include <linux/buffer_head.h>
 #include "ext2.h"
 #include "xattr.h"
 #include "acl.h"
@@ -43,16 +44,33 @@ static int ext2_release_file (struct ino
 int ext2_fsync(struct file *file, int datasync)
 {
 	int ret;
-	struct super_block *sb = file->f_mapping->host->i_sb;
-	struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping;
+	struct inode *inode = file->f_mapping->host;
+	ino_t ino = inode->i_ino;
+	struct super_block *sb = inode->i_sb;
+	struct address_space *sb_mapping = sb->s_bdev->bd_inode->i_mapping;
+	struct buffer_head *bh;
+	struct ext2_inode *raw_inode;
 
 	ret = generic_file_fsync(file, datasync);
-	if (ret == -EIO || test_and_clear_bit(AS_EIO, &mapping->flags)) {
+	if (ret == -EIO || test_and_clear_bit(AS_EIO, &sb_mapping->flags)) {
 		/* We don't really know where the IO error happened... */
 		ext2_error(sb, __func__,
 			   "detected IO error when writing metadata buffers");
+		return -EIO;
+	}
+
+	raw_inode = ext2_get_inode(sb, ino, &bh);
+	if (IS_ERR(raw_inode))
+ 		return -EIO;
+
+	sync_dirty_buffer(bh);
+	if (buffer_req(bh) && !buffer_uptodate(bh)) {
+		printk ("IO error syncing ext2 inode [%s:%08lx]\n",
+			sb->s_id, (unsigned long) ino);
 		ret = -EIO;
 	}
+	brelse (bh);
+
 	return ret;
 }
 
Index: linux-2.6/fs/ext2/ext2.h
===================================================================
--- linux-2.6.orig/fs/ext2/ext2.h	2010-12-11 20:57:04.000000000 +1100
+++ linux-2.6/fs/ext2/ext2.h	2010-12-16 18:33:25.000000000 +1100
@@ -124,6 +124,8 @@ extern int ext2_get_block(struct inode *
 extern int ext2_setattr (struct dentry *, struct iattr *);
 extern void ext2_set_inode_flags(struct inode *inode);
 extern void ext2_get_inode_flags(struct ext2_inode_info *);
+extern struct ext2_inode *ext2_get_inode(struct super_block *sb, ino_t ino,
+					struct buffer_head **p);
 extern int ext2_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 		       u64 start, u64 len);
 
Index: linux-2.6/fs/adfs/inode.c
===================================================================
--- linux-2.6.orig/fs/adfs/inode.c	2010-12-11 20:57:04.000000000 +1100
+++ linux-2.6/fs/adfs/inode.c	2010-12-16 18:33:25.000000000 +1100
@@ -383,7 +383,7 @@ int adfs_write_inode(struct inode *inode
 	obj.attr	= ADFS_I(inode)->attr;
 	obj.size	= inode->i_size;
 
-	ret = adfs_dir_update(sb, &obj, wbc->sync_mode == WB_SYNC_ALL);
+	ret = adfs_dir_update(sb, &obj, 1 /* XXX: fix fsync and use 'wbc->sync_mode == WB_SYNC_ALL' */);
 	unlock_kernel();
 	return ret;
 }
Index: linux-2.6/fs/affs/file.c
===================================================================
--- linux-2.6.orig/fs/affs/file.c	2010-12-11 20:57:04.000000000 +1100
+++ linux-2.6/fs/affs/file.c	2010-12-16 18:33:25.000000000 +1100
@@ -931,6 +931,7 @@ int affs_file_fsync(struct file *filp, i
 	int ret, err;
 
 	ret = write_inode_now(inode, 0);
+	/* XXX: could just sync the buffer been dirtied by write_inode */
 	err = sync_blockdev(inode->i_sb->s_bdev);
 	if (!ret)
 		ret = err;
Index: linux-2.6/fs/bfs/inode.c
===================================================================
--- linux-2.6.orig/fs/bfs/inode.c	2010-12-16 18:29:02.000000000 +1100
+++ linux-2.6/fs/bfs/inode.c	2010-12-16 18:33:25.000000000 +1100
@@ -151,7 +151,7 @@ static int bfs_write_inode(struct inode
 	di->i_eoffset = cpu_to_le32(i_sblock * BFS_BSIZE + inode->i_size - 1);
 
 	mark_buffer_dirty(bh);
-	if (wbc->sync_mode == WB_SYNC_ALL) {
+	if (1 /* XXX: fix fsync and use wbc->sync_mode == WB_SYNC_ALL */ ) {
 		sync_dirty_buffer(bh);
 		if (buffer_req(bh) && !buffer_uptodate(bh))
 			err = -EIO;
Index: linux-2.6/fs/exofs/inode.c
===================================================================
--- linux-2.6.orig/fs/exofs/inode.c	2010-12-11 20:57:04.000000000 +1100
+++ linux-2.6/fs/exofs/inode.c	2010-12-16 18:33:25.000000000 +1100
@@ -1273,7 +1273,7 @@ static int exofs_update_inode(struct ino
 
 int exofs_write_inode(struct inode *inode, struct writeback_control *wbc)
 {
-	return exofs_update_inode(inode, wbc->sync_mode == WB_SYNC_ALL);
+	return exofs_update_inode(inode, 1 /* XXX: fix fsync and use wbc->sync_mode == WB_SYNC_ALL */ );
 }
 
 /*
Index: linux-2.6/fs/ext4/inode.c
===================================================================
--- linux-2.6.orig/fs/ext4/inode.c	2010-12-16 18:30:20.000000000 +1100
+++ linux-2.6/fs/ext4/inode.c	2010-12-16 18:33:25.000000000 +1100
@@ -5243,7 +5243,7 @@ int ext4_write_inode(struct inode *inode
 		err = __ext4_get_inode_loc(inode, &iloc, 0);
 		if (err)
 			return err;
-		if (wbc->sync_mode == WB_SYNC_ALL)
+		if (1 /* XXX: fix fxync and use wbc->sync_mode == WB_SYNC_ALL */)
 			sync_dirty_buffer(iloc.bh);
 		if (buffer_req(iloc.bh) && !buffer_uptodate(iloc.bh)) {
 			EXT4_ERROR_INODE_BLOCK(inode, iloc.bh->b_blocknr,
Index: linux-2.6/fs/fat/inode.c
===================================================================
--- linux-2.6.orig/fs/fat/inode.c	2010-12-16 18:29:02.000000000 +1100
+++ linux-2.6/fs/fat/inode.c	2010-12-16 18:33:25.000000000 +1100
@@ -645,7 +645,7 @@ static int __fat_write_inode(struct inod
 	spin_unlock(&sbi->inode_hash_lock);
 	mark_buffer_dirty(bh);
 	err = 0;
-	if (wait)
+	if (1 /* XXX: fix fsync and use wait */)
 		err = sync_dirty_buffer(bh);
 	brelse(bh);
 	return err;
Index: linux-2.6/fs/jfs/inode.c
===================================================================
--- linux-2.6.orig/fs/jfs/inode.c	2010-12-11 20:57:04.000000000 +1100
+++ linux-2.6/fs/jfs/inode.c	2010-12-16 18:33:25.000000000 +1100
@@ -123,7 +123,7 @@ int jfs_commit_inode(struct inode *inode
 
 int jfs_write_inode(struct inode *inode, struct writeback_control *wbc)
 {
-	int wait = wbc->sync_mode == WB_SYNC_ALL;
+	int wait = 1; /* XXX fix fsync and use wbc->sync_mode == WB_SYNC_ALL; */
 
 	if (test_cflag(COMMIT_Nolink, inode))
 		return 0;
Index: linux-2.6/fs/minix/inode.c
===================================================================
--- linux-2.6.orig/fs/minix/inode.c	2010-12-16 18:29:02.000000000 +1100
+++ linux-2.6/fs/minix/inode.c	2010-12-16 18:33:25.000000000 +1100
@@ -576,7 +576,7 @@ static int minix_write_inode(struct inod
 		bh = V2_minix_update_inode(inode);
 	if (!bh)
 		return -EIO;
-	if (wbc->sync_mode == WB_SYNC_ALL && buffer_dirty(bh)) {
+	if (1 /* XXX: fix fsync and use wbc->sync_mode == WB_SYNC_ALL */ && buffer_dirty(bh)) {
 		sync_dirty_buffer(bh);
 		if (buffer_req(bh) && !buffer_uptodate(bh)) {
 			printk("IO error syncing minix inode [%s:%08lx]\n",
Index: linux-2.6/fs/omfs/inode.c
===================================================================
--- linux-2.6.orig/fs/omfs/inode.c	2010-12-11 20:57:04.000000000 +1100
+++ linux-2.6/fs/omfs/inode.c	2010-12-16 18:33:25.000000000 +1100
@@ -169,7 +169,7 @@ static int __omfs_write_inode(struct ino
 
 static int omfs_write_inode(struct inode *inode, struct writeback_control *wbc)
 {
-	return __omfs_write_inode(inode, wbc->sync_mode == WB_SYNC_ALL);
+	return __omfs_write_inode(inode, 1 /* XXX: fix fsync and use wbc->sync_mode == WB_SYNC_ALL */);
 }
 
 int omfs_sync_inode(struct inode *inode)
Index: linux-2.6/fs/reiserfs/inode.c
===================================================================
--- linux-2.6.orig/fs/reiserfs/inode.c	2010-12-11 20:57:04.000000000 +1100
+++ linux-2.6/fs/reiserfs/inode.c	2010-12-16 18:33:25.000000000 +1100
@@ -1635,6 +1635,8 @@ int reiserfs_write_inode(struct inode *i
 	 ** these cases are just when the system needs ram, not when the
 	 ** inode needs to reach disk for safety, and they can safely be
 	 ** ignored because the altered inode has already been logged.
+	 ** XXX: is this really OK? The caller clears the inode dirty bit, so
+	 ** a subsequent sync for integrity might never reach here.
 	 */
 	if (wbc->sync_mode == WB_SYNC_ALL && !(current->flags & PF_MEMALLOC)) {
 		reiserfs_write_lock(inode->i_sb);
Index: linux-2.6/fs/sysv/inode.c
===================================================================
--- linux-2.6.orig/fs/sysv/inode.c	2010-12-16 18:29:02.000000000 +1100
+++ linux-2.6/fs/sysv/inode.c	2010-12-16 18:33:25.000000000 +1100
@@ -286,7 +286,7 @@ static int __sysv_write_inode(struct ino
 		write3byte(sbi, (u8 *)&si->i_data[block],
 			&raw_inode->i_data[3*block]);
 	mark_buffer_dirty(bh);
-	if (wait) {
+	if (1 /* XXX: fix fsync and use wait */) {
                 sync_dirty_buffer(bh);
                 if (buffer_req(bh) && !buffer_uptodate(bh)) {
                         printk ("IO error syncing sysv inode [%s:%08x]\n",
Index: linux-2.6/fs/udf/inode.c
===================================================================
--- linux-2.6.orig/fs/udf/inode.c	2010-12-11 20:57:04.000000000 +1100
+++ linux-2.6/fs/udf/inode.c	2010-12-16 18:33:25.000000000 +1100
@@ -1598,7 +1598,7 @@ static int udf_update_inode(struct inode
 
 	/* write the data blocks */
 	mark_buffer_dirty(bh);
-	if (do_sync) {
+	if (1 /* XXX fix fsync and use do_sync */) {
 		sync_dirty_buffer(bh);
 		if (buffer_write_io_error(bh)) {
 			printk(KERN_WARNING "IO error syncing udf inode "
Index: linux-2.6/fs/ufs/inode.c
===================================================================
--- linux-2.6.orig/fs/ufs/inode.c	2010-12-11 20:57:04.000000000 +1100
+++ linux-2.6/fs/ufs/inode.c	2010-12-16 18:33:25.000000000 +1100
@@ -889,7 +889,7 @@ static int ufs_update_inode(struct inode
 	}
 		
 	mark_buffer_dirty(bh);
-	if (do_sync)
+	if (1 /* XXX: fix fsync and use do_sync */)
 		sync_dirty_buffer(bh);
 	brelse (bh);
 	



^ permalink raw reply	[flat|nested] 42+ messages in thread

* [patch 6/8] fs: fsync optimisations
  2010-12-18  1:46 [patch 0/8] Inode data integrity patches Nick Piggin
                   ` (4 preceding siblings ...)
  2010-12-18  1:46 ` [patch 5/8] fs: ext2 inode sync fix Nick Piggin
@ 2010-12-18  1:46 ` Nick Piggin
  2010-12-18  1:46 ` [patch 7/8] fs: fix or note I_DIRTY handling bugs in filesystems Nick Piggin
  2010-12-18  1:46 ` [patch 8/8] fs: add i_op->sync_inode Nick Piggin
  7 siblings, 0 replies; 42+ messages in thread
From: Nick Piggin @ 2010-12-18  1:46 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel; +Cc: Andrew Morton

[-- Attachment #1: fs-datasync-opt.patch --]
[-- Type: text/plain, Size: 6884 bytes --]

Optimise fsync by adding a datasync parameter to sync_inode_metadata to
DTRT with writing back the inode (->write_inode in theory should have a
datasync parameter too perhaps, but that's for another time).

Also, implement the metadata sync optimally rather than reusing the
normal data writeback path. This means less useless moving the inode around the
writeback lists, and less dropping and retaking of inode_lock, and avoiding
the data writeback call with nr_pages == 0.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

Index: linux-2.6/drivers/staging/pohmelfs/inode.c
===================================================================
--- linux-2.6.orig/drivers/staging/pohmelfs/inode.c	2010-12-16 18:29:02.000000000 +1100
+++ linux-2.6/drivers/staging/pohmelfs/inode.c	2010-12-16 18:33:29.000000000 +1100
@@ -883,7 +883,7 @@ static int pohmelfs_fsync(struct file *f
 {
 	struct inode *inode = file->f_mapping->host;
 
-	return sync_inode_metadata(inode, 1);
+	return sync_inode_metadata(inode, datasync, 1);
 }
 
 ssize_t pohmelfs_write(struct file *file, const char __user *buf,
Index: linux-2.6/fs/exofs/file.c
===================================================================
--- linux-2.6.orig/fs/exofs/file.c	2010-12-16 18:33:16.000000000 +1100
+++ linux-2.6/fs/exofs/file.c	2010-12-16 18:33:29.000000000 +1100
@@ -48,7 +48,7 @@ static int exofs_file_fsync(struct file
 	struct inode *inode = filp->f_mapping->host;
 	struct super_block *sb;
 
-	ret = sync_inode_metadata(inode, 1);
+	ret = sync_inode_metadata(inode, datasync, 1);
 
 	/* This is a good place to write the sb */
 	/* TODO: Sechedule an sb-sync on create */
Index: linux-2.6/fs/ext2/dir.c
===================================================================
--- linux-2.6.orig/fs/ext2/dir.c	2010-12-11 20:57:04.000000000 +1100
+++ linux-2.6/fs/ext2/dir.c	2010-12-16 18:33:29.000000000 +1100
@@ -98,7 +98,7 @@ static int ext2_commit_chunk(struct page
 	if (IS_DIRSYNC(dir)) {
 		err = write_one_page(page, 1);
 		if (!err)
-			err = sync_inode_metadata(dir, 1);
+			err = sync_inode_metadata(dir, 0, 1);
 	} else {
 		unlock_page(page);
 	}
Index: linux-2.6/fs/ext2/inode.c
===================================================================
--- linux-2.6.orig/fs/ext2/inode.c	2010-12-16 18:33:25.000000000 +1100
+++ linux-2.6/fs/ext2/inode.c	2010-12-16 18:33:29.000000000 +1100
@@ -1203,7 +1203,7 @@ static int ext2_setsize(struct inode *in
 	inode->i_mtime = inode->i_ctime = CURRENT_TIME_SEC;
 	if (inode_needs_sync(inode)) {
 		sync_mapping_buffers(inode->i_mapping);
-		sync_inode_metadata(inode, 1);
+		sync_inode_metadata(inode, 0, 1);
 	} else {
 		mark_inode_dirty(inode);
 	}
Index: linux-2.6/fs/ext2/xattr.c
===================================================================
--- linux-2.6.orig/fs/ext2/xattr.c	2010-12-11 20:57:04.000000000 +1100
+++ linux-2.6/fs/ext2/xattr.c	2010-12-16 18:33:29.000000000 +1100
@@ -699,7 +699,7 @@ ext2_xattr_set2(struct inode *inode, str
 	EXT2_I(inode)->i_file_acl = new_bh ? new_bh->b_blocknr : 0;
 	inode->i_ctime = CURRENT_TIME_SEC;
 	if (IS_SYNC(inode)) {
-		error = sync_inode_metadata(inode, 1);
+		error = sync_inode_metadata(inode, 0, 1);
 		/* In case sync failed due to ENOSPC the inode was actually
 		 * written (only some dirty data were not) so we just proceed
 		 * as if nothing happened and cleanup the unused block */
Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c	2010-12-16 18:33:22.000000000 +1100
+++ linux-2.6/fs/fs-writeback.c	2010-12-16 18:33:29.000000000 +1100
@@ -1307,7 +1307,7 @@ int sync_inode(struct inode *inode, stru
 EXPORT_SYMBOL(sync_inode);
 
 /**
- * sync_inode - write an inode to disk
+ * sync_inode_metadata - write an inode to disk
  * @inode: the inode to sync
  * @wait: wait for I/O to complete.
  *
@@ -1315,13 +1315,50 @@ EXPORT_SYMBOL(sync_inode);
  *
  * Note: only writes the actual inode, no associated data or other metadata.
  */
-int sync_inode_metadata(struct inode *inode, int wait)
+int sync_inode_metadata(struct inode *inode, int datasync, int wait)
 {
+	struct address_space *mapping = inode->i_mapping;
 	struct writeback_control wbc = {
 		.sync_mode = wait ? WB_SYNC_ALL : WB_SYNC_NONE,
 		.nr_to_write = 0, /* metadata-only */
 	};
+	unsigned dirty, mask;
+	int ret = 0;
 
-	return sync_inode(inode, &wbc);
+	/*
+	 * This is a similar implementation to writeback_single_inode.
+	 * Keep them in sync.
+	 */
+	spin_lock(&inode_lock);
+	if (!inode_writeback_begin(inode, wait))
+		goto out;
+
+	if (datasync)
+		mask = I_DIRTY_DATASYNC;
+	else
+		mask = I_DIRTY_SYNC | I_DIRTY_DATASYNC;
+	dirty = inode->i_state & mask;
+	if (!dirty)
+		goto out_wb_end;
+	/*
+	 * Generic write_inode doesn't distinguish between sync and datasync,
+	 * so even a datasync can clear the sync state. Filesystems which
+	 * distiguish these cases must only clear 'mask' in their metadata
+	 * sync code.
+	 */
+	inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC);
+
+	spin_unlock(&inode_lock);
+	ret = write_inode(inode, &wbc);
+	spin_lock(&inode_lock);
+	if (ret)
+		inode->i_state |= dirty; /* couldn't write out inode */
+
+out_wb_end:
+	inode_writeback_end(inode);
+
+out:
+	spin_unlock(&inode_lock);
+	return ret;
 }
 EXPORT_SYMBOL(sync_inode_metadata);
Index: linux-2.6/fs/libfs.c
===================================================================
--- linux-2.6.orig/fs/libfs.c	2010-12-16 18:33:16.000000000 +1100
+++ linux-2.6/fs/libfs.c	2010-12-16 18:33:29.000000000 +1100
@@ -895,7 +895,7 @@ int generic_file_fsync(struct file *file
 	int ret;
 
 	ret = sync_mapping_buffers(inode->i_mapping);
-	err = sync_inode_metadata(inode, 1);
+	err = sync_inode_metadata(inode, datasync, 1);
 	if (ret == 0)
 		ret = err;
 	return ret;
Index: linux-2.6/fs/nfsd/vfs.c
===================================================================
--- linux-2.6.orig/fs/nfsd/vfs.c	2010-12-16 18:29:06.000000000 +1100
+++ linux-2.6/fs/nfsd/vfs.c	2010-12-16 18:33:29.000000000 +1100
@@ -287,7 +287,7 @@ commit_metadata(struct svc_fh *fhp)
 
 	if (export_ops->commit_metadata)
 		return export_ops->commit_metadata(inode);
-	return sync_inode_metadata(inode, 1);
+	return sync_inode_metadata(inode, 0, 1);
 }
 
 /*
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h	2010-12-16 18:33:17.000000000 +1100
+++ linux-2.6/include/linux/fs.h	2010-12-16 18:33:29.000000000 +1100
@@ -1765,7 +1765,7 @@ static inline void file_accessed(struct
 }
 
 int sync_inode(struct inode *inode, struct writeback_control *wbc);
-int sync_inode_metadata(struct inode *inode, int wait);
+int sync_inode_metadata(struct inode *inode, int datasync, int wait);
 int inode_writeback_begin(struct inode *inode, int wait);
 int inode_writeback_end(struct inode *inode);
 



^ permalink raw reply	[flat|nested] 42+ messages in thread

* [patch 7/8] fs: fix or note I_DIRTY handling bugs in filesystems
  2010-12-18  1:46 [patch 0/8] Inode data integrity patches Nick Piggin
                   ` (5 preceding siblings ...)
  2010-12-18  1:46 ` [patch 6/8] fs: fsync optimisations Nick Piggin
@ 2010-12-18  1:46 ` Nick Piggin
  2010-12-29 15:01   ` Christoph Hellwig
  2010-12-18  1:46 ` [patch 8/8] fs: add i_op->sync_inode Nick Piggin
  7 siblings, 1 reply; 42+ messages in thread
From: Nick Piggin @ 2010-12-18  1:46 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel; +Cc: Andrew Morton

[-- Attachment #1: fs-fix-dirty-flags.patch --]
[-- Type: text/plain, Size: 11209 bytes --]

Checking I_DIRTY* bits is racy unless we're under I_SYNC, because there
might be a concurrent writeback thread that has cleared I_DIRTY, but
has not yet completed the ->write_inode call.

This means our (typically integrity/fsync) operation can finish before
previously dirty data has been written safely to disk.

Solve it by exporting inode_writeback_begin/end to filesystems, and have
them use that before checking dirty flags, where it matters.

I'm not thrilled at exporting inode_lock, however I didn't see a good way
to make that code fully generic (eg. some cases want to avoid taking
and dropping locks if there is nothing to do, future users might want to
be smarter about dirty bits and keep their own inode dirty bits in synch
under the same lock, etc).

Not signed off yet

Index: linux-2.6/drivers/staging/pohmelfs/inode.c
===================================================================
--- linux-2.6.orig/drivers/staging/pohmelfs/inode.c	2010-12-18 03:03:37.000000000 +1100
+++ linux-2.6/drivers/staging/pohmelfs/inode.c	2010-12-18 03:04:10.000000000 +1100
@@ -373,6 +373,7 @@ static int pohmelfs_write_inode_create_c
 		dprintk("%s: parent: %llu, ino: %llu, inode: %p.\n",
 				__func__, parent->ino, n->ino, inode);
 
+		/* XXX: is this race WRT writeback? */
 		if (inode && (inode->i_state & I_DIRTY)) {
 			struct pohmelfs_inode *pi = POHMELFS_I(inode);
 			pohmelfs_write_create_inode(pi);
Index: linux-2.6/fs/gfs2/file.c
===================================================================
--- linux-2.6.orig/fs/gfs2/file.c	2010-12-18 03:03:37.000000000 +1100
+++ linux-2.6/fs/gfs2/file.c	2010-12-18 03:04:10.000000000 +1100
@@ -557,7 +557,7 @@ static int gfs2_close(struct inode *inod
 static int gfs2_fsync(struct file *file, int datasync)
 {
 	struct inode *inode = file->f_mapping->host;
-	int sync_state = inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC);
+	unsigned dirty, mask;
 	int ret = 0;
 
 	if (gfs2_is_jdata(GFS2_I(inode))) {
@@ -565,13 +565,35 @@ static int gfs2_fsync(struct file *file,
 		return 0;
 	}
 
-	if (sync_state != 0) {
-		if (!datasync)
-			ret = write_inode_now(inode, 0);
+	spin_lock(&inode_lock);
+	inode_writeback_begin(inode, 1);
 
-		if (gfs2_is_stuffed(GFS2_I(inode)))
-			gfs2_log_flush(GFS2_SB(inode), GFS2_I(inode)->i_gl);
+	if (datasync)
+		mask = I_DIRTY_DATASYNC;
+	else
+		mask = I_DIRTY_SYNC | I_DIRTY_DATASYNC;
+	dirty = inode->i_state & mask;
+	inode->i_state &= ~mask;
+	if (dirty) {
+		spin_unlock(&inode_lock);
+
+		if (!datasync) {
+			struct writeback_control wbc = {
+				.sync_mode = WB_SYNC_ALL,
+			};
+			ret = inode->i_sb->s_op->write_inode(inode, &wbc);
+		} else {
+			if (gfs2_is_stuffed(GFS2_I(inode)))
+				gfs2_log_flush(GFS2_SB(inode),
+						GFS2_I(inode)->i_gl);
+		}
+
+		spin_lock(&inode_lock);
 	}
+	if (ret)
+		inode->i_state |= dirty;
+	inode_writeback_end(inode);
+	spin_unlock(&inode_lock);
 
 	return ret;
 }
Index: linux-2.6/fs/jffs2/fs.c
===================================================================
--- linux-2.6.orig/fs/jffs2/fs.c	2010-12-18 03:03:37.000000000 +1100
+++ linux-2.6/fs/jffs2/fs.c	2010-12-18 03:04:10.000000000 +1100
@@ -361,6 +361,7 @@ void jffs2_dirty_inode(struct inode *ino
 {
 	struct iattr iattr;
 
+	/* XXX: huh? How does this make sense? */
 	if (!(inode->i_state & I_DIRTY_DATASYNC)) {
 		D2(printk(KERN_DEBUG "jffs2_dirty_inode() not calling setattr() for ino #%lu\n", inode->i_ino));
 		return;
Index: linux-2.6/fs/jfs/file.c
===================================================================
--- linux-2.6.orig/fs/jfs/file.c	2010-12-18 03:03:37.000000000 +1100
+++ linux-2.6/fs/jfs/file.c	2010-12-18 03:04:10.000000000 +1100
@@ -19,6 +19,7 @@
 
 #include <linux/mm.h>
 #include <linux/fs.h>
+#include <linux/writeback.h>
 #include <linux/quotaops.h>
 #include "jfs_incore.h"
 #include "jfs_inode.h"
@@ -31,18 +32,34 @@
 int jfs_fsync(struct file *file, int datasync)
 {
 	struct inode *inode = file->f_mapping->host;
-	int rc = 0;
+	unsigned dirty, mask;
+	int err = 0;
 
-	if (!(inode->i_state & I_DIRTY) ||
-	    (datasync && !(inode->i_state & I_DIRTY_DATASYNC))) {
+	spin_lock(&inode_lock);
+	inode_writeback_begin(inode, 1);
+
+	if (datasync)
+		mask = I_DIRTY_DATASYNC;
+	else
+		mask = I_DIRTY_SYNC | I_DIRTY_DATASYNC;
+	dirty = inode->i_state & mask;
+	inode->i_state &= ~mask;
+	spin_unlock(&inode_lock);
+
+	if (!dirty) {
 		/* Make sure committed changes hit the disk */
 		jfs_flush_journal(JFS_SBI(inode->i_sb)->log, 1);
-		return rc;
+	} else {
+		err = jfs_commit_inode(inode, 1);
 	}
 
-	rc |= jfs_commit_inode(inode, 1);
+	spin_lock(&inode_lock);
+	if (err)
+		inode->i_state |= dirty;
+	inode_writeback_end(inode);
+	spin_unlock(&inode_lock);
 
-	return rc ? -EIO : 0;
+	return err ? -EIO : 0;
 }
 
 static int jfs_open(struct inode *inode, struct file *file)
Index: linux-2.6/fs/nfsd/vfs.c
===================================================================
--- linux-2.6.orig/fs/nfsd/vfs.c	2010-12-18 03:03:37.000000000 +1100
+++ linux-2.6/fs/nfsd/vfs.c	2010-12-18 03:04:10.000000000 +1100
@@ -969,10 +969,9 @@ static int wait_for_concurrent_writes(st
 		dprintk("nfsd: write resume %d\n", task_pid_nr(current));
 	}
 
-	if (inode->i_state & I_DIRTY) {
-		dprintk("nfsd: write sync %d\n", task_pid_nr(current));
-		err = vfs_fsync(file, 0);
-	}
+	dprintk("nfsd: write sync %d\n", task_pid_nr(current));
+	err = vfs_fsync(file, 0);
+
 	last_ino = inode->i_ino;
 	last_dev = inode->i_sb->s_dev;
 	return err;
Index: linux-2.6/fs/ocfs2/file.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/file.c	2010-12-18 03:03:37.000000000 +1100
+++ linux-2.6/fs/ocfs2/file.c	2010-12-18 03:04:10.000000000 +1100
@@ -176,12 +176,24 @@ static int ocfs2_sync_file(struct file *
 	journal_t *journal;
 	struct inode *inode = file->f_mapping->host;
 	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
+	unsigned dirty, mask;
 
 	mlog_entry("(0x%p, %d, 0x%p, '%.*s')\n", file, datasync,
 		   file->f_path.dentry, file->f_path.dentry->d_name.len,
 		   file->f_path.dentry->d_name.name);
 
-	if (datasync && !(inode->i_state & I_DIRTY_DATASYNC)) {
+	spin_lock(&inode_lock);
+	inode_writeback_begin(inode, 1);
+	if (datasync)
+		mask = I_DIRTY_DATASYNC;
+	else
+		mask = I_DIRTY_SYNC | I_DIRTY_DATASYNC;
+	dirty = inode->i_state & mask;
+	inode->i_state &= ~mask;
+	spin_unlock(&inode_lock);
+
+	if (datasync && dirty) {
+
 		/*
 		 * We still have to flush drive's caches to get data to the
 		 * platter
@@ -195,6 +207,12 @@ static int ocfs2_sync_file(struct file *
 	err = jbd2_journal_force_commit(journal);
 
 bail:
+	spin_lock(&inode_lock);
+	if (err)
+		inode->i_state |= dirty;
+	inode_writeback_end(inode);
+	spin_unlock(&inode_lock);
+
 	mlog_exit(err);
 
 	return (err < 0) ? -EIO : 0;
Index: linux-2.6/fs/ubifs/file.c
===================================================================
--- linux-2.6.orig/fs/ubifs/file.c	2010-12-18 03:03:37.000000000 +1100
+++ linux-2.6/fs/ubifs/file.c	2010-12-18 03:04:10.000000000 +1100
@@ -1313,11 +1313,9 @@ int ubifs_fsync(struct file *file, int d
 	 * VFS has already synchronized dirty pages for this inode. Synchronize
 	 * the inode unless this is a 'datasync()' call.
 	 */
-	if (!datasync || (inode->i_state & I_DIRTY_DATASYNC)) {
-		err = inode->i_sb->s_op->write_inode(inode, NULL);
-		if (err)
-			return err;
-	}
+	err = sync_inode_metadata(inode, datasync, 1);
+	if (err)
+		return err;
 
 	/*
 	 * Nodes related to this inode may still sit in a write-buffer. Flush
Index: linux-2.6/fs/ufs/truncate.c
===================================================================
--- linux-2.6.orig/fs/ufs/truncate.c	2010-12-18 03:03:37.000000000 +1100
+++ linux-2.6/fs/ufs/truncate.c	2010-12-18 03:04:10.000000000 +1100
@@ -479,7 +479,7 @@ int ufs_truncate(struct inode *inode, lo
 		retry |= ufs_trunc_tindirect (inode);
 		if (!retry)
 			break;
-		if (IS_SYNC(inode) && (inode->i_state & I_DIRTY))
+		if (IS_SYNC(inode))
 			ufs_sync_inode (inode);
 		blk_run_address_space(inode->i_mapping);
 		yield();
Index: linux-2.6/fs/xfs/linux-2.6/xfs_file.c
===================================================================
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_file.c	2010-12-18 03:03:37.000000000 +1100
+++ linux-2.6/fs/xfs/linux-2.6/xfs_file.c	2010-12-18 03:51:34.000000000 +1100
@@ -99,6 +99,7 @@ xfs_file_fsync(
 	struct xfs_trans	*tp;
 	int			error = 0;
 	int			log_flushed = 0;
+	unsigned		dirty, mask;
 
 	trace_xfs_file_fsync(ip);
 
@@ -110,6 +111,25 @@ xfs_file_fsync(
 	xfs_ioend_wait(ip);
 
 	/*
+	 * First check if the VFS inode is marked dirty.  All the dirtying
+	 * of non-transactional updates no goes through mark_inode_dirty*,
+	 * which allows us to distinguish beteeen pure timestamp updates
+	 * and i_size updates which need to be caught for fdatasync.
+	 * After that also theck for the dirty state in the XFS inode, which
+	 * might gets cleared when the inode gets written out via the AIL
+	 * or xfs_iflush_cluster.
+	 */
+	spin_lock(&inode_lock);
+	inode_writeback_begin(inode, 1);
+	if (datasync)
+		mask = I_DIRTY_DATASYNC;
+	else
+		mask = I_DIRTY_SYNC | I_DIRTY_DATASYNC;
+	dirty = inode->i_state & mask;
+	inode->i_state &= ~mask;
+	spin_unlock(&inode_lock);
+
+	/*
 	 * We always need to make sure that the required inode state is safe on
 	 * disk.  The inode might be clean but we still might need to force the
 	 * log because of committed transactions that haven't hit the disk yet.
@@ -123,18 +143,7 @@ xfs_file_fsync(
 	 */
 	xfs_ilock(ip, XFS_ILOCK_SHARED);
 
-	/*
-	 * First check if the VFS inode is marked dirty.  All the dirtying
-	 * of non-transactional updates no goes through mark_inode_dirty*,
-	 * which allows us to distinguish beteeen pure timestamp updates
-	 * and i_size updates which need to be caught for fdatasync.
-	 * After that also theck for the dirty state in the XFS inode, which
-	 * might gets cleared when the inode gets written out via the AIL
-	 * or xfs_iflush_cluster.
-	 */
-	if (((inode->i_state & I_DIRTY_DATASYNC) ||
-	    ((inode->i_state & I_DIRTY_SYNC) && !datasync)) &&
-	    ip->i_update_core) {
+	if (dirty && ip->i_update_core) {
 		/*
 		 * Kick off a transaction to log the inode core to get the
 		 * updates.  The sync transaction will also force the log.
@@ -145,7 +154,7 @@ xfs_file_fsync(
 				XFS_FSYNC_TS_LOG_RES(ip->i_mount), 0, 0, 0);
 		if (error) {
 			xfs_trans_cancel(tp, 0);
-			return -error;
+			goto out;
 		}
 		xfs_ilock(ip, XFS_ILOCK_EXCL);
 
@@ -197,6 +206,13 @@ xfs_file_fsync(
 			xfs_blkdev_issue_flush(ip->i_mount->m_rtdev_targp);
 	}
 
+out:
+	spin_lock(&inode_lock);
+	if (error)
+		inode->i_state |= dirty;
+	inode_writeback_end(inode);
+	spin_unlock(&inode_lock);
+
 	return -error;
 }
 
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2010-12-18 03:03:37.000000000 +1100
+++ linux-2.6/fs/inode.c	2010-12-18 03:04:10.000000000 +1100
@@ -82,6 +82,7 @@ static struct hlist_head *inode_hashtabl
  * the i_state of an inode while it is in use..
  */
 DEFINE_SPINLOCK(inode_lock);
+EXPORT_SYMBOL(inode_lock);
 
 /*
  * iprune_sem provides exclusion between the kswapd or try_to_free_pages



^ permalink raw reply	[flat|nested] 42+ messages in thread

* [patch 8/8] fs: add i_op->sync_inode
  2010-12-18  1:46 [patch 0/8] Inode data integrity patches Nick Piggin
                   ` (6 preceding siblings ...)
  2010-12-18  1:46 ` [patch 7/8] fs: fix or note I_DIRTY handling bugs in filesystems Nick Piggin
@ 2010-12-18  1:46 ` Nick Piggin
  2010-12-29 15:12   ` Christoph Hellwig
  2011-01-07 19:06   ` Ted Ts'o
  7 siblings, 2 replies; 42+ messages in thread
From: Nick Piggin @ 2010-12-18  1:46 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel; +Cc: Andrew Morton

[-- Attachment #1: fs-add-sync_inode.patch --]
[-- Type: text/plain, Size: 24551 bytes --]

Add a sync_inode i_op, for filesystems that can fsync with the inode and
not requiring the file (which is most of them), and use it in filesystems
and nfsd to correctly sync things.

Get rid of sync_inode, unexport sync_inodes_sb and make it internal, remove the
'wait' parameter from sync_inode_metadata, and document these things a little
better.

The core of the problem is that ->write_inode(inode, sync==1) is not
guaranteed to have the inode metadata on disk. This is because filesystems
that do asynchronous ->write_inode stuff don't "obey" sync==1. Thus, generic
code cannot assume that ->write_inode is all that is required for a full
sync. (filesystem specific code can, but that's a bit ugly I'd like to phase
it out or clarify with different function names).

Not signed off yet

---
 Documentation/filesystems/porting |    6 ++++
 drivers/staging/pohmelfs/inode.c  |    2 -
 fs/ext2/dir.c                     |    4 ++
 fs/ext2/ext2.h                    |    4 +-
 fs/ext2/file.c                    |   32 +++------------------
 fs/ext2/inode.c                   |   38 ++++++++++++++++++++++++-
 fs/ext2/namei.c                   |    1 
 fs/ext2/xattr.c                   |    4 ++
 fs/fat/file.c                     |   33 ++++++++++------------
 fs/fat/inode.c                    |   21 +++++---------
 fs/fs-writeback.c                 |   23 ---------------
 fs/internal.h                     |    5 +++
 fs/libfs.c                        |   57 ++++++++++++++++++++++++++++++++++++++
 fs/nfs/write.c                    |   16 ++++------
 fs/nfsd/vfs.c                     |    3 ++
 fs/sync.c                         |    8 +++++
 include/linux/fs.h                |   12 +++++++-
 include/linux/writeback.h         |    1 
 18 files changed, 172 insertions(+), 98 deletions(-)

Index: linux-2.6/fs/libfs.c
===================================================================
--- linux-2.6.orig/fs/libfs.c	2010-12-18 02:08:49.000000000 +1100
+++ linux-2.6/fs/libfs.c	2010-12-18 03:51:40.000000000 +1100
@@ -880,6 +880,64 @@ struct dentry *generic_fh_to_parent(stru
 EXPORT_SYMBOL_GPL(generic_fh_to_parent);
 
 /**
+ * generic_sync_inode - generic sync implementation for simple filesystems
+ * @inode:	inode to synchronize
+ * @flags:	INODE_SYNC_ flag
+ * @start:	start range (if INODE_SYNC_DATA is set)
+ * @end:	end range (if INODE_SYNC_DATA is set)
+ * @Returns:	0 on success, otherwise -errno.
+ *
+ * This is a generic implementation of the sync_inode method for simple
+ * filesystems which track all non-inode metadata in the buffers list
+ * hanging off the address_space structure. Their ->write_inode call must
+ * _always_ synchronously write back inode metadata (regardless if it was
+ * called for sync or async, because there is no other way to get it on
+ * disk).
+ *
+ * More advanced filesystems will want to always asynchronously dirty inode
+ * metadata in ->write_inode (regardless if it is sync or async), and then
+ * synchronously verify that it is on disk in their ->fsync, ->sync_inode,
+ * and ->sync_fs routines. generic_sync_inode could still be used in their
+ * sync_inode call, so long as they subsequently verify the metadata is on
+ * disk.
+ */
+int generic_sync_inode(struct inode *inode, unsigned int flags,
+			loff_t start, loff_t end)
+{
+	struct address_space *mapping = inode->i_mapping;
+	int err, ret = 0;
+
+	if (flags & INODE_SYNC_DATA) {
+		err = filemap_write_and_wait_range(mapping, start, end);
+		if (!ret)
+			ret = err;
+	}
+
+	if (flags & INODE_SYNC_DATA_METADATA) {
+		/*
+		 * We need to protect against concurrent writers, which could
+		 * cause livelocks in fsync_buffers_list().
+		 */
+		mutex_lock(&inode->i_mutex);
+		err = sync_mapping_buffers(mapping);
+		if (!ret)
+			ret = err;
+		mutex_unlock(&inode->i_mutex);
+	}
+
+	if (flags & (INODE_SYNC_DATA_METADATA | INODE_SYNC_METADATA)) {
+		int datasync = !(flags & INODE_SYNC_METADATA);
+
+		err = sync_inode_metadata(inode, datasync);
+		if (!ret)
+			ret = err;
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL(generic_sync_inode);
+
+/**
  * generic_file_fsync - generic fsync implementation for simple filesystems
  * @file:	file to synchronize
  * @datasync:	only synchronize essential metadata if true
@@ -895,7 +953,7 @@ int generic_file_fsync(struct file *file
 	int ret;
 
 	ret = sync_mapping_buffers(inode->i_mapping);
-	err = sync_inode_metadata(inode, datasync, 1);
+	err = sync_inode_metadata(inode, datasync);
 	if (ret == 0)
 		ret = err;
 	return ret;
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h	2010-12-18 02:08:49.000000000 +1100
+++ linux-2.6/include/linux/fs.h	2010-12-18 03:51:40.000000000 +1100
@@ -1545,6 +1545,16 @@ struct file_operations {
 	int (*setlease)(struct file *, long, struct file_lock **);
 };
 
+#define INODE_SYNC_DATA			0x01
+#define INODE_SYNC_DATA_METADATA	0x02
+#define INODE_SYNC_METADATA		0x04
+
+#define INODE_SYNC_FSYNC		(INODE_SYNC_DATA |		\
+					 INODE_SYNC_DATA_METADATA |	\
+					 INODE_SYNC_METADATA )
+#define INODE_SYNC_FDATASYNC		(INODE_SYNC_DATA |		\
+					 INODE_SYNC_DATA_METADATA)
+
 struct inode_operations {
 	int (*create) (struct inode *,struct dentry *,int, struct nameidata *);
 	struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameidata *);
@@ -1573,6 +1583,7 @@ struct inode_operations {
 			  loff_t len);
 	int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start,
 		      u64 len);
+	int (*sync_inode)(struct inode *, unsigned int, loff_t, loff_t);
 };
 
 struct seq_file;
@@ -1764,10 +1775,11 @@ static inline void file_accessed(struct
 		touch_atime(file->f_path.mnt, file->f_path.dentry);
 }
 
-int sync_inode(struct inode *inode, struct writeback_control *wbc);
-int sync_inode_metadata(struct inode *inode, int datasync, int wait);
+int sync_inode_metadata(struct inode *inode, int datasync);
 int inode_writeback_begin(struct inode *inode, int wait);
 int inode_writeback_end(struct inode *inode);
+int generic_sync_inode(struct inode *inode, unsigned int flags,
+			loff_t start, loff_t end);
 
 struct file_system_type {
 	const char *name;
Index: linux-2.6/fs/ext2/ext2.h
===================================================================
--- linux-2.6.orig/fs/ext2/ext2.h	2010-12-18 02:08:42.000000000 +1100
+++ linux-2.6/fs/ext2/ext2.h	2010-12-18 03:51:40.000000000 +1100
@@ -124,10 +124,10 @@ extern int ext2_get_block(struct inode *
 extern int ext2_setattr (struct dentry *, struct iattr *);
 extern void ext2_set_inode_flags(struct inode *inode);
 extern void ext2_get_inode_flags(struct ext2_inode_info *);
-extern struct ext2_inode *ext2_get_inode(struct super_block *sb, ino_t ino,
-					struct buffer_head **p);
 extern int ext2_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 		       u64 start, u64 len);
+extern int ext2_sync_inode(struct inode *inode, unsigned int flags,
+			loff_t start, loff_t end);
 
 /* ioctl.c */
 extern long ext2_ioctl(struct file *, unsigned int, unsigned long);
Index: linux-2.6/fs/ext2/file.c
===================================================================
--- linux-2.6.orig/fs/ext2/file.c	2010-12-18 02:08:42.000000000 +1100
+++ linux-2.6/fs/ext2/file.c	2010-12-18 03:51:40.000000000 +1100
@@ -43,35 +43,12 @@ static int ext2_release_file (struct ino
 
 int ext2_fsync(struct file *file, int datasync)
 {
-	int ret;
 	struct inode *inode = file->f_mapping->host;
-	ino_t ino = inode->i_ino;
-	struct super_block *sb = inode->i_sb;
-	struct address_space *sb_mapping = sb->s_bdev->bd_inode->i_mapping;
-	struct buffer_head *bh;
-	struct ext2_inode *raw_inode;
+	unsigned int flags = INODE_SYNC_DATA;
+	if (!datasync)
+		flags |= INODE_SYNC_METADATA;
 
-	ret = generic_file_fsync(file, datasync);
-	if (ret == -EIO || test_and_clear_bit(AS_EIO, &sb_mapping->flags)) {
-		/* We don't really know where the IO error happened... */
-		ext2_error(sb, __func__,
-			   "detected IO error when writing metadata buffers");
-		return -EIO;
-	}
-
-	raw_inode = ext2_get_inode(sb, ino, &bh);
-	if (IS_ERR(raw_inode))
- 		return -EIO;
-
-	sync_dirty_buffer(bh);
-	if (buffer_req(bh) && !buffer_uptodate(bh)) {
-		printk ("IO error syncing ext2 inode [%s:%08lx]\n",
-			sb->s_id, (unsigned long) ino);
-		ret = -EIO;
-	}
-	brelse (bh);
-
-	return ret;
+	return ext2_sync_inode(inode, flags, 0, LLONG_MAX);
 }
 
 /*
@@ -122,4 +99,5 @@ const struct inode_operations ext2_file_
 	.setattr	= ext2_setattr,
 	.check_acl	= ext2_check_acl,
 	.fiemap		= ext2_fiemap,
+	.sync_inode	= ext2_sync_inode,
 };
Index: linux-2.6/fs/ext2/inode.c
===================================================================
--- linux-2.6.orig/fs/ext2/inode.c	2010-12-18 02:08:49.000000000 +1100
+++ linux-2.6/fs/ext2/inode.c	2010-12-18 03:51:40.000000000 +1100
@@ -1202,13 +1202,14 @@ static int ext2_setsize(struct inode *in
 
 	inode->i_mtime = inode->i_ctime = CURRENT_TIME_SEC;
 	if (inode_needs_sync(inode)) {
-		sync_mapping_buffers(inode->i_mapping);
-		sync_inode_metadata(inode, 0, 1);
+		error = inode->i_op->sync_inode(inode,
+				INODE_SYNC_DATA_METADATA | INODE_SYNC_METADATA,
+				0, LLONG_MAX);
 	} else {
 		mark_inode_dirty(inode);
 	}
 
-	return 0;
+	return error;
 }
 
 struct ext2_inode *ext2_get_inode(struct super_block *sb, ino_t ino,
@@ -1515,6 +1516,39 @@ int ext2_write_inode(struct inode *inode
 	return __ext2_write_inode(inode, wbc->sync_mode == WB_SYNC_ALL);
 }
 
+int ext2_sync_inode(struct inode *inode, unsigned int flags,
+			loff_t start, loff_t end)
+{
+	int ret;
+	ino_t ino = inode->i_ino;
+	struct super_block *sb = inode->i_sb;
+	struct address_space *sb_mapping = sb->s_bdev->bd_inode->i_mapping;
+	struct buffer_head *bh;
+	struct ext2_inode *raw_inode;
+
+	ret = generic_sync_inode(inode, flags, start, end);
+	if (ret == -EIO || test_and_clear_bit(AS_EIO, &sb_mapping->flags)) {
+		/* We don't really know where the IO error happened... */
+		ext2_error(sb, __func__,
+			   "detected IO error when syncing inode");
+		return -EIO;
+	}
+
+	raw_inode = ext2_get_inode(sb, ino, &bh);
+	if (IS_ERR(raw_inode))
+ 		return -EIO;
+
+	sync_dirty_buffer(bh);
+	if (buffer_req(bh) && !buffer_uptodate(bh)) {
+		printk ("IO error syncing ext2 inode [%s:%08lx]\n",
+			sb->s_id, (unsigned long) ino);
+		ret = -EIO;
+	}
+	brelse (bh);
+
+	return ret;
+}
+
 int ext2_setattr(struct dentry *dentry, struct iattr *iattr)
 {
 	struct inode *inode = dentry->d_inode;
Index: linux-2.6/fs/ext2/namei.c
===================================================================
--- linux-2.6.orig/fs/ext2/namei.c	2010-12-18 00:32:44.000000000 +1100
+++ linux-2.6/fs/ext2/namei.c	2010-12-18 03:51:40.000000000 +1100
@@ -418,6 +418,7 @@ const struct inode_operations ext2_dir_i
 #endif
 	.setattr	= ext2_setattr,
 	.check_acl	= ext2_check_acl,
+	.sync_inode	= ext2_sync_inode,
 };
 
 const struct inode_operations ext2_special_inode_operations = {
Index: linux-2.6/fs/ext2/dir.c
===================================================================
--- linux-2.6.orig/fs/ext2/dir.c	2010-12-18 02:08:49.000000000 +1100
+++ linux-2.6/fs/ext2/dir.c	2010-12-18 03:51:40.000000000 +1100
@@ -98,7 +98,9 @@ static int ext2_commit_chunk(struct page
 	if (IS_DIRSYNC(dir)) {
 		err = write_one_page(page, 1);
 		if (!err)
-			err = sync_inode_metadata(dir, 0, 1);
+			err = dir->i_op->sync_inode(dir,
+				INODE_SYNC_DATA_METADATA | INODE_SYNC_METADATA,
+				0, LLONG_MAX);
 	} else {
 		unlock_page(page);
 	}
Index: linux-2.6/fs/ext2/xattr.c
===================================================================
--- linux-2.6.orig/fs/ext2/xattr.c	2010-12-18 02:08:49.000000000 +1100
+++ linux-2.6/fs/ext2/xattr.c	2010-12-18 03:51:40.000000000 +1100
@@ -699,7 +699,9 @@ ext2_xattr_set2(struct inode *inode, str
 	EXT2_I(inode)->i_file_acl = new_bh ? new_bh->b_blocknr : 0;
 	inode->i_ctime = CURRENT_TIME_SEC;
 	if (IS_SYNC(inode)) {
-		error = sync_inode_metadata(inode, 0, 1);
+		error = inode->i_op->sync_inode(inode,
+				INODE_SYNC_METADATA,
+				0, LLONG_MAX);
 		/* In case sync failed due to ENOSPC the inode was actually
 		 * written (only some dirty data were not) so we just proceed
 		 * as if nothing happened and cleanup the unused block */
Index: linux-2.6/fs/nfsd/vfs.c
===================================================================
--- linux-2.6.orig/fs/nfsd/vfs.c	2010-12-18 03:04:10.000000000 +1100
+++ linux-2.6/fs/nfsd/vfs.c	2010-12-18 03:51:40.000000000 +1100
@@ -287,7 +287,10 @@ commit_metadata(struct svc_fh *fhp)
 
 	if (export_ops->commit_metadata)
 		return export_ops->commit_metadata(inode);
-	return sync_inode_metadata(inode, 0, 1);
+	if (inode->i_op->sync_inode)
+		return inode->i_op->sync_inode(inode, INODE_SYNC_METADATA,
+						0, 0);
+	return sync_inode_metadata(inode, 0);
 }
 
 /*
Index: linux-2.6/fs/sync.c
===================================================================
--- linux-2.6.orig/fs/sync.c	2010-12-18 00:32:44.000000000 +1100
+++ linux-2.6/fs/sync.c	2010-12-18 03:51:40.000000000 +1100
@@ -142,8 +142,16 @@ void emergency_sync(void)
 int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int datasync)
 {
 	struct address_space *mapping = file->f_mapping;
+	struct inode *inode = mapping->host;
 	int err, ret;
 
+	if (inode->i_op->sync_inode) {
+		unsigned int flags = INODE_SYNC_DATA | INODE_SYNC_DATA_METADATA;
+		if (!datasync)
+			flags |= INODE_SYNC_METADATA;
+		return inode->i_op->sync_inode(inode, flags, start, end);
+	}
+
 	if (!file->f_op || !file->f_op->fsync) {
 		ret = -EINVAL;
 		goto out;
Index: linux-2.6/drivers/staging/pohmelfs/inode.c
===================================================================
--- linux-2.6.orig/drivers/staging/pohmelfs/inode.c	2010-12-18 03:04:10.000000000 +1100
+++ linux-2.6/drivers/staging/pohmelfs/inode.c	2010-12-18 03:51:40.000000000 +1100
@@ -884,7 +884,7 @@ static int pohmelfs_fsync(struct file *f
 {
 	struct inode *inode = file->f_mapping->host;
 
-	return sync_inode_metadata(inode, datasync, 1);
+	return sync_inode_metadata(inode, datasync);
 }
 
 ssize_t pohmelfs_write(struct file *file, const char __user *buf,
@@ -1951,7 +1951,7 @@ static struct dentry *pohmelfs_mount(str
  */
 static void pohmelfs_kill_super(struct super_block *sb)
 {
-	sync_inodes_sb(sb);
+	sync_filesystem(sb);
 	kill_anon_super(sb);
 }
 
Index: linux-2.6/fs/internal.h
===================================================================
--- linux-2.6.orig/fs/internal.h	2010-12-18 00:32:44.000000000 +1100
+++ linux-2.6/fs/internal.h	2010-12-18 03:51:40.000000000 +1100
@@ -108,3 +108,8 @@ extern void release_open_intent(struct n
 extern int get_nr_dirty_inodes(void);
 extern void evict_inodes(struct super_block *);
 extern int invalidate_inodes(struct super_block *);
+
+/*
+ * fs-writeback.c
+ */
+extern void sync_inodes_sb(struct super_block *);
Index: linux-2.6/include/linux/writeback.h
===================================================================
--- linux-2.6.orig/include/linux/writeback.h	2010-12-18 00:32:44.000000000 +1100
+++ linux-2.6/include/linux/writeback.h	2010-12-18 03:51:40.000000000 +1100
@@ -61,7 +61,6 @@ void writeback_inodes_sb(struct super_bl
 void writeback_inodes_sb_nr(struct super_block *, unsigned long nr);
 int writeback_inodes_sb_if_idle(struct super_block *);
 int writeback_inodes_sb_nr_if_idle(struct super_block *, unsigned long nr);
-void sync_inodes_sb(struct super_block *);
 void writeback_inodes_wb(struct bdi_writeback *wb,
 		struct writeback_control *wbc);
 long wb_do_writeback(struct bdi_writeback *wb, int force_wait);
Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c	2010-12-18 02:08:49.000000000 +1100
+++ linux-2.6/fs/fs-writeback.c	2010-12-18 03:51:40.000000000 +1100
@@ -1229,7 +1229,8 @@ EXPORT_SYMBOL(writeback_inodes_sb_nr_if_
  * @sb: the superblock
  *
  * This function writes and waits on any dirty inode belonging to this
- * super_block. The number of pages synced is returned.
+ * super_block. Metadata may not be guaranteed to be on disk until a
+ * subsequent ->sync_fs call. The number of pages synced is returned.
  */
 void sync_inodes_sb(struct super_block *sb)
 {
@@ -1249,15 +1250,19 @@ void sync_inodes_sb(struct super_block *
 
 	wait_sb_inodes(sb);
 }
-EXPORT_SYMBOL(sync_inodes_sb);
 
 /**
- * write_inode_now	-	write an inode to disk
+ * write_inode_now	-	write an inode to disk (maybe)
  * @inode: inode to write to disk
  * @sync: whether the write should be synchronous or not
  *
- * This function commits an inode to disk immediately if it is dirty. This is
- * primarily needed by knfsd.
+ * This function writes an inode's data to disk, and then calls ->write_inode
+ * on the metadata if the inode metadata dirty bits are set. This doesn't
+ * necessarily sync the inode to disk, depending on whether the filesystem
+ * syncs its inode at ->write_inode time or not. So this function is not
+ * portable outside a specific filesystem when sync == 1.
+ *
+ * Do not add any new users of this function.
  *
  * The caller must either have a ref on the inode or must have set I_WILL_FREE.
  */
@@ -1285,41 +1290,17 @@ int write_inode_now(struct inode *inode,
 EXPORT_SYMBOL(write_inode_now);
 
 /**
- * sync_inode - write an inode and its pages to disk.
+ * sync_inode_metadata - call ->write_inode if needed
  * @inode: the inode to sync
- * @wbc: controls the writeback mode
- *
- * sync_inode() will write an inode and its pages to disk.  It will also
- * correctly update the inode on its superblock's dirty inode lists and will
- * update inode->i_state.
- *
- * The caller must have a ref on the inode.
- */
-int sync_inode(struct inode *inode, struct writeback_control *wbc)
-{
-	int ret;
-
-	spin_lock(&inode_lock);
-	ret = writeback_single_inode(inode, wbc);
-	spin_unlock(&inode_lock);
-	return ret;
-}
-EXPORT_SYMBOL(sync_inode);
-
-/**
- * sync_inode_metadata - write an inode to disk
- * @inode: the inode to sync
- * @wait: wait for I/O to complete.
- *
- * Write an inode to disk and adjust it's dirty state after completion.
+ * @datasync: look at datasync bits only
  *
- * Note: only writes the actual inode, no associated data or other metadata.
+ * This only calls ->write_inode, which doesn't necessarily ensure anything
+ * on disk. It should only be used as an fs helper.
  */
-int sync_inode_metadata(struct inode *inode, int datasync, int wait)
+int sync_inode_metadata(struct inode *inode, int datasync)
 {
-	struct address_space *mapping = inode->i_mapping;
 	struct writeback_control wbc = {
-		.sync_mode = wait ? WB_SYNC_ALL : WB_SYNC_NONE,
+		.sync_mode = WB_SYNC_ALL,
 		.nr_to_write = 0, /* metadata-only */
 	};
 	unsigned dirty, mask;
@@ -1330,7 +1311,7 @@ int sync_inode_metadata(struct inode *in
 	 * Keep them in sync.
 	 */
 	spin_lock(&inode_lock);
-	if (!inode_writeback_begin(inode, wait))
+	if (!inode_writeback_begin(inode, 1))
 		goto out;
 
 	if (datasync)
Index: linux-2.6/fs/fat/file.c
===================================================================
--- linux-2.6.orig/fs/fat/file.c	2010-12-18 00:32:44.000000000 +1100
+++ linux-2.6/fs/fat/file.c	2010-12-18 03:51:40.000000000 +1100
@@ -160,7 +160,6 @@ int fat_file_fsync(struct file *filp, in
 	return res ? res : err;
 }
 
-
 const struct file_operations fat_file_operations = {
 	.llseek		= generic_file_llseek,
 	.read		= do_sync_read,
@@ -192,22 +191,9 @@ static int fat_cont_expand(struct inode
 	if (IS_SYNC(inode)) {
 		int err2;
 
-		/*
-		 * Opencode syncing since we don't have a file open to use
-		 * standard fsync path.
-		 */
-		err = filemap_fdatawrite_range(mapping, start,
-					       start + count - 1);
-		err2 = sync_mapping_buffers(mapping);
-		if (!err)
-			err = err2;
-		err2 = write_inode_now(inode, 1);
-		if (!err)
-			err = err2;
-		if (!err) {
-			err =  filemap_fdatawait_range(mapping, start,
-						       start + count - 1);
-		}
+		err = inode->i_op->sync_inode(inode,
+				INODE_SYNC_DATA | INODE_SYNC_DATA_METADATA |
+				INODE_SYNC_METADATA, start, start + count - 1);
 	}
 out:
 	return err;
@@ -440,7 +426,20 @@ int fat_setattr(struct dentry *dentry, s
 }
 EXPORT_SYMBOL_GPL(fat_setattr);
 
+static int fat_inode_sync_inode(struct inode *inode, unsigned int flags,
+			loff_t start, loff_t end)
+{
+	int err;
+
+	err = generic_sync_inode(inode, flags, start, end);
+	if (err)
+		return err;
+	err = sync_mapping_buffers(MSDOS_SB(inode->i_sb)->fat_inode->i_mapping);
+	return err;
+}
+
 const struct inode_operations fat_file_inode_operations = {
 	.setattr	= fat_setattr,
 	.getattr	= fat_getattr,
+	.sync_inode	= fat_inode_sync_inode,
 };
Index: linux-2.6/Documentation/filesystems/porting
===================================================================
--- linux-2.6.orig/Documentation/filesystems/porting	2010-12-18 00:32:44.000000000 +1100
+++ linux-2.6/Documentation/filesystems/porting	2010-12-18 03:51:40.000000000 +1100
@@ -318,3 +318,9 @@ if it's zero is not *and* *never* *had*
 may happen while the inode is in the middle of ->write_inode(); e.g. if you blindly
 free the on-disk inode, you may end up doing that while ->write_inode() is writing
 to it.
+
+--
+[mandatory]
+	Inode writeback and syncing has undergone an overhaul.
+write_inode_now doesn't sync (unless ->write_inode *always* syncs)
+
Index: linux-2.6/fs/fat/inode.c
===================================================================
--- linux-2.6.orig/fs/fat/inode.c	2010-12-18 02:08:42.000000000 +1100
+++ linux-2.6/fs/fat/inode.c	2010-12-18 03:51:40.000000000 +1100
@@ -1534,20 +1534,15 @@ EXPORT_SYMBOL_GPL(fat_fill_super);
  */
 static int writeback_inode(struct inode *inode)
 {
-
-	int ret;
 	struct address_space *mapping = inode->i_mapping;
-	struct writeback_control wbc = {
-	       .sync_mode = WB_SYNC_NONE,
-	      .nr_to_write = 0,
-	};
-	/* if we used WB_SYNC_ALL, sync_inode waits for the io for the
-	* inode to finish.  So WB_SYNC_NONE is sent down to sync_inode
-	* and filemap_fdatawrite is used for the data blocks
-	*/
-	ret = sync_inode(inode, &wbc);
-	if (!ret)
-	       ret = filemap_fdatawrite(mapping);
+	int ret;
+
+	ret = sync_inode_metadata(inode, 0);
+	if (ret)
+		return ret;
+
+	ret = filemap_fdatawrite(mapping);
+
 	return ret;
 }
 
Index: linux-2.6/fs/nfs/write.c
===================================================================
--- linux-2.6.orig/fs/nfs/write.c	2010-12-18 00:32:44.000000000 +1100
+++ linux-2.6/fs/nfs/write.c	2010-12-18 03:51:40.000000000 +1100
@@ -1416,8 +1416,7 @@ int nfs_commit_inode(struct inode *inode
 	return res;
 	/* Note: If we exit without ensuring that the commit is complete,
 	 * we must mark the inode as dirty. Otherwise, future calls to
-	 * sync_inode() with the WB_SYNC_ALL flag set will fail to ensure
-	 * that the data is on the disk.
+	 * ->write_inode() will fail to ensure that the data is on the disk.
 	 */
 out_mark_dirty:
 	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
@@ -1472,14 +1471,13 @@ int nfs_write_inode(struct inode *inode,
  */
 int nfs_wb_all(struct inode *inode)
 {
-	struct writeback_control wbc = {
-		.sync_mode = WB_SYNC_ALL,
-		.nr_to_write = LONG_MAX,
-		.range_start = 0,
-		.range_end = LLONG_MAX,
-	};
+	int ret;
 
-	return sync_inode(inode, &wbc);
+	ret = filemap_fdatawrite(inode->i_mapping);
+	if (ret)
+		return ret;
+	ret = sync_inode_metadata(inode, 0);
+	return ret;
 }
 
 int nfs_wb_page_cancel(struct inode *inode, struct page *page)
Index: linux-2.6/fs/exofs/file.c
===================================================================
--- linux-2.6.orig/fs/exofs/file.c	2010-12-18 02:08:49.000000000 +1100
+++ linux-2.6/fs/exofs/file.c	2010-12-18 03:51:40.000000000 +1100
@@ -48,7 +48,7 @@ static int exofs_file_fsync(struct file
 	struct inode *inode = filp->f_mapping->host;
 	struct super_block *sb;
 
-	ret = sync_inode_metadata(inode, datasync, 1);
+	ret = sync_inode_metadata(inode, datasync);
 
 	/* This is a good place to write the sb */
 	/* TODO: Sechedule an sb-sync on create */
Index: linux-2.6/fs/ubifs/file.c
===================================================================
--- linux-2.6.orig/fs/ubifs/file.c	2010-12-18 03:04:10.000000000 +1100
+++ linux-2.6/fs/ubifs/file.c	2010-12-18 03:51:40.000000000 +1100
@@ -1313,7 +1313,7 @@ int ubifs_fsync(struct file *file, int d
 	 * VFS has already synchronized dirty pages for this inode. Synchronize
 	 * the inode unless this is a 'datasync()' call.
 	 */
-	err = sync_inode_metadata(inode, datasync, 1);
+	err = sync_inode_metadata(inode, datasync);
 	if (err)
 		return err;
 



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch 7/8] fs: fix or note I_DIRTY handling bugs in filesystems
  2010-12-18  1:46 ` [patch 7/8] fs: fix or note I_DIRTY handling bugs in filesystems Nick Piggin
@ 2010-12-29 15:01   ` Christoph Hellwig
  2011-01-03 15:03     ` Steven Whitehouse
  2011-01-04  6:04     ` Nick Piggin
  0 siblings, 2 replies; 42+ messages in thread
From: Christoph Hellwig @ 2010-12-29 15:01 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-fsdevel, linux-kernel, Andrew Morton, mfasheh, joel.becker,
	swhiteho

As mentioned last round I think the exporting of inode_lock and pusing
of the I_DIRTY* complexities into the filesystems can be avoided.  See
the patch below, which compiles and passes xfstests for xfs, but
otherwise isn't quite done yet.  The only code change vs the opencoded
variant in the patch is that we do a useless inode_lock roundtrip
for a non-dirty inode on gfs2, which is I think is acceptable,
especially once we have the lock split anyway.

The other thing I don't like yet is passing the datasync flag - the
callback shouldn't care about what we were called with, but rather
about which bits it needs to sync out - which the dirty flag already
tells us about.

IMHO the behaviour in ocfs2 and gfs2 that relies on it is plain wrong:

 - ocfs2 really should always force the journal if any bit we care about
   in the inode is dirty, and only do the pure cache flush is nothing
   we care about is dirty (similar to the more complex code in XFS)
 - gfs2 seems really weird.  Doesn't it need to do any log force
   if an inode has a pending transaction?  Currently it only does for
   stuffed inodes, and if datasync was set, which seems weird.  Also
   I can't see why we'd never want to call into ->write_inode to write
   out the inode for the datasync case - except for not catching
   timestamp updates fdatasync really isn't any different from fsync.


Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c	2010-12-29 15:45:37.153003373 +0100
+++ linux-2.6/fs/fs-writeback.c	2010-12-29 15:46:13.566021881 +0100
@@ -315,7 +315,7 @@ static void inode_wait_for_writeback(str
  * If inode_writeback_begin returns 1, it must be followed by a call to
  * inode_writeback_end.
  */
-int inode_writeback_begin(struct inode *inode, int wait)
+static int inode_writeback_begin(struct inode *inode, int wait)
 {
 	assert_spin_locked(&inode_lock);
 	if (!atomic_read(&inode->i_count))
@@ -356,7 +356,7 @@ EXPORT_SYMBOL(inode_writeback_begin);
  * inode_writeback_end must follow a successful call to inode_writeback_begin
  * after we have finished submitting writeback to the inode.
  */
-int inode_writeback_end(struct inode *inode)
+static int inode_writeback_end(struct inode *inode)
 {
 	int ret = 1;
 
@@ -1362,3 +1362,45 @@ out:
 	return ret;
 }
 EXPORT_SYMBOL(sync_inode_metadata);
+
+/**
+ * fsync_helper - helper to implement writeback-coherent fsync
+ * @inode:	inode to sync
+ * @datasync:	only synchronize essential metadata if true
+ * @sync_inode:	actual fsync implementation
+ *
+ * This function should be used by all filesystems implementing that
+ * use VFS writeback and implement a complex fsync method that does not
+ * use sync_inode_metadata().  It ensures correct participation in the
+ * writeback locking protocol, and leaves all real work to the @sync_inode
+ * callback.  Note that the callback is also called if the inode wasn't
+ * found to be dirty - it's dirty argument contains the exact dirty bits
+ * found in the inode before releasing the inode lock.
+ */
+int fsync_helper(struct inode *inode, int datasync,
+	int (*sync_inode)(struct inode *inode, int datasync, int dirty))
+{
+	unsigned dirty, mask;
+	int ret = 0;
+
+	spin_lock(&inode_lock);
+	inode_writeback_begin(inode, 1);
+
+	if (datasync)
+		mask = I_DIRTY_DATASYNC;
+	else
+		mask = I_DIRTY_SYNC | I_DIRTY_DATASYNC;
+	dirty = inode->i_state & mask;
+	inode->i_state &= ~mask;
+	spin_unlock(&inode_lock);
+
+	ret = sync_inode(inode, datasync, dirty);
+
+	spin_lock(&inode_lock);
+	if (ret)
+		inode->i_state |= dirty;
+	inode_writeback_end(inode);
+	spin_unlock(&inode_lock);
+	return ret;
+}
+EXPORT_SYMBOL(fsync_helper);
Index: linux-2.6/fs/gfs2/file.c
===================================================================
--- linux-2.6.orig/fs/gfs2/file.c	2010-12-29 15:45:37.159003932 +0100
+++ linux-2.6/fs/gfs2/file.c	2010-12-29 15:46:13.566021881 +0100
@@ -535,6 +535,21 @@ static int gfs2_close(struct inode *inod
 	return 0;
 }
 
+static int gfs2_fsync_int(struct inode *inode, int datasync, int dirty)
+{
+	if (!dirty)
+		return 0;
+	else if (!datasync) {
+		struct writeback_control wbc = {
+			.sync_mode = WB_SYNC_ALL,
+		};
+		return inode->i_sb->s_op->write_inode(inode, &wbc);
+	} else if (gfs2_is_stuffed(GFS2_I(inode)))
+		gfs2_log_flush(GFS2_SB(inode), GFS2_I(inode)->i_gl);
+
+	return 0;
+}
+
 /**
  * gfs2_fsync - sync the dirty data for a file (across the cluster)
  * @file: the file that points to the dentry (we ignore this)
@@ -546,56 +561,18 @@ static int gfs2_close(struct inode *inod
  * is set. For stuffed inodes we must flush the log in order to
  * ensure that all data is on disk.
  *
- * The call to write_inode_now() is there to write back metadata and
- * the inode itself. It does also try and write the data, but thats
- * (hopefully) a no-op due to the VFS having already called filemap_fdatawrite()
- * for us.
- *
  * Returns: errno
  */
-
 static int gfs2_fsync(struct file *file, int datasync)
 {
 	struct inode *inode = file->f_mapping->host;
-	unsigned dirty, mask;
-	int ret = 0;
 
 	if (gfs2_is_jdata(GFS2_I(inode))) {
 		gfs2_log_flush(GFS2_SB(inode), GFS2_I(inode)->i_gl);
 		return 0;
 	}
 
-	spin_lock(&inode_lock);
-	inode_writeback_begin(inode, 1);
-
-	if (datasync)
-		mask = I_DIRTY_DATASYNC;
-	else
-		mask = I_DIRTY_SYNC | I_DIRTY_DATASYNC;
-	dirty = inode->i_state & mask;
-	inode->i_state &= ~mask;
-	if (dirty) {
-		spin_unlock(&inode_lock);
-
-		if (!datasync) {
-			struct writeback_control wbc = {
-				.sync_mode = WB_SYNC_ALL,
-			};
-			ret = inode->i_sb->s_op->write_inode(inode, &wbc);
-		} else {
-			if (gfs2_is_stuffed(GFS2_I(inode)))
-				gfs2_log_flush(GFS2_SB(inode),
-						GFS2_I(inode)->i_gl);
-		}
-
-		spin_lock(&inode_lock);
-	}
-	if (ret)
-		inode->i_state |= dirty;
-	inode_writeback_end(inode);
-	spin_unlock(&inode_lock);
-
-	return ret;
+	return fsync_helper(inode, datasync, gfs2_fsync_int);
 }
 
 /**
Index: linux-2.6/fs/jfs/file.c
===================================================================
--- linux-2.6.orig/fs/jfs/file.c	2010-12-29 15:45:37.166004630 +0100
+++ linux-2.6/fs/jfs/file.c	2010-12-29 15:46:43.278255573 +0100
@@ -29,37 +29,20 @@
 #include "jfs_acl.h"
 #include "jfs_debug.h"
 
-int jfs_fsync(struct file *file, int datasync)
+static int jfs_fsync_int(struct inode *inode, int datasync, int dirty)
 {
-	struct inode *inode = file->f_mapping->host;
-	unsigned dirty, mask;
-	int err = 0;
-
-	spin_lock(&inode_lock);
-	inode_writeback_begin(inode, 1);
-
-	if (datasync)
-		mask = I_DIRTY_DATASYNC;
-	else
-		mask = I_DIRTY_SYNC | I_DIRTY_DATASYNC;
-	dirty = inode->i_state & mask;
-	inode->i_state &= ~mask;
-	spin_unlock(&inode_lock);
-
 	if (!dirty) {
 		/* Make sure committed changes hit the disk */
 		jfs_flush_journal(JFS_SBI(inode->i_sb)->log, 1);
-	} else {
-		err = jfs_commit_inode(inode, 1);
+		return 0;
 	}
 
-	spin_lock(&inode_lock);
-	if (err)
-		inode->i_state |= dirty;
-	inode_writeback_end(inode);
-	spin_unlock(&inode_lock);
+	return jfs_commit_inode(inode, 1) ? -EIO : 0;
+}
 
-	return err ? -EIO : 0;
+int jfs_fsync(struct file *file, int datasync)
+{
+	return fsync_helper(file->f_mapping->host, datasync, jfs_fsync_int);
 }
 
 static int jfs_open(struct inode *inode, struct file *file)
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h	2010-12-29 15:45:37.204004211 +0100
+++ linux-2.6/include/linux/fs.h	2010-12-29 15:46:13.573272824 +0100
@@ -1766,8 +1766,8 @@ static inline void file_accessed(struct
 
 int sync_inode(struct inode *inode, struct writeback_control *wbc);
 int sync_inode_metadata(struct inode *inode, int datasync, int wait);
-int inode_writeback_begin(struct inode *inode, int wait);
-int inode_writeback_end(struct inode *inode);
+int fsync_helper(struct inode *inode, int datasync,
+	int (*sync_inode)(struct inode *inode, int datasync, int dirty));
 
 struct file_system_type {
 	const char *name;
Index: linux-2.6/fs/ocfs2/file.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/file.c	2010-12-29 15:45:37.174003862 +0100
+++ linux-2.6/fs/ocfs2/file.c	2010-12-29 15:46:54.843005610 +0100
@@ -170,52 +170,26 @@ static int ocfs2_dir_release(struct inod
 	return 0;
 }
 
-static int ocfs2_sync_file(struct file *file, int datasync)
+static int ocfs2_fsync_int(struct inode *inode, int datasync, int dirty)
 {
-	int err = 0;
-	journal_t *journal;
-	struct inode *inode = file->f_mapping->host;
 	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
-	unsigned dirty, mask;
-
-	mlog_entry("(0x%p, %d, 0x%p, '%.*s')\n", file, datasync,
-		   file->f_path.dentry, file->f_path.dentry->d_name.len,
-		   file->f_path.dentry->d_name.name);
-
-	spin_lock(&inode_lock);
-	inode_writeback_begin(inode, 1);
-	if (datasync)
-		mask = I_DIRTY_DATASYNC;
-	else
-		mask = I_DIRTY_SYNC | I_DIRTY_DATASYNC;
-	dirty = inode->i_state & mask;
-	inode->i_state &= ~mask;
-	spin_unlock(&inode_lock);
 
 	if (datasync && dirty) {
-
 		/*
 		 * We still have to flush drive's caches to get data to the
 		 * platter
 		 */
 		if (osb->s_mount_opt & OCFS2_MOUNT_BARRIER)
 			blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
-		goto bail;
+		return 0;
 	}
 
-	journal = osb->journal->j_journal;
-	err = jbd2_journal_force_commit(journal);
-
-bail:
-	spin_lock(&inode_lock);
-	if (err)
-		inode->i_state |= dirty;
-	inode_writeback_end(inode);
-	spin_unlock(&inode_lock);
-
-	mlog_exit(err);
+	return jbd2_journal_force_commit(osb->journal->j_journal) ? -EIO : 0;
+}
 
-	return (err < 0) ? -EIO : 0;
+static int ocfs2_sync_file(struct file *file, int datasync)
+{
+	return fsync_helper(file->f_mapping->host, datasync, ocfs2_fsync_int);
 }
 
 int ocfs2_should_update_atime(struct inode *inode,
Index: linux-2.6/fs/xfs/linux-2.6/xfs_file.c
===================================================================
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_file.c	2010-12-29 15:45:37.182003862 +0100
+++ linux-2.6/fs/xfs/linux-2.6/xfs_file.c	2010-12-29 15:47:55.253005677 +0100
@@ -90,16 +90,15 @@ xfs_iozero(
 }
 
 STATIC int
-xfs_file_fsync(
-	struct file		*file,
-	int			datasync)
+xfs_fsync_int(
+	struct inode		*inode,
+	int			datasync,
+	int			dirty)
 {
-	struct inode		*inode = file->f_mapping->host;
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_trans	*tp;
 	int			error = 0;
 	int			log_flushed = 0;
-	unsigned		dirty, mask;
 
 	trace_xfs_file_fsync(ip);
 
@@ -111,25 +110,6 @@ xfs_file_fsync(
 	xfs_ioend_wait(ip);
 
 	/*
-	 * First check if the VFS inode is marked dirty.  All the dirtying
-	 * of non-transactional updates no goes through mark_inode_dirty*,
-	 * which allows us to distinguish beteeen pure timestamp updates
-	 * and i_size updates which need to be caught for fdatasync.
-	 * After that also theck for the dirty state in the XFS inode, which
-	 * might gets cleared when the inode gets written out via the AIL
-	 * or xfs_iflush_cluster.
-	 */
-	spin_lock(&inode_lock);
-	inode_writeback_begin(inode, 1);
-	if (datasync)
-		mask = I_DIRTY_DATASYNC;
-	else
-		mask = I_DIRTY_SYNC | I_DIRTY_DATASYNC;
-	dirty = inode->i_state & mask;
-	inode->i_state &= ~mask;
-	spin_unlock(&inode_lock);
-
-	/*
 	 * We always need to make sure that the required inode state is safe on
 	 * disk.  The inode might be clean but we still might need to force the
 	 * log because of committed transactions that haven't hit the disk yet.
@@ -143,6 +123,15 @@ xfs_file_fsync(
 	 */
 	xfs_ilock(ip, XFS_ILOCK_SHARED);
 
+	/*
+	 * First check if the VFS inode is marked dirty.  All the dirtying
+	 * of non-transactional updates no goes through mark_inode_dirty*,
+	 * which allows us to distinguish beteeen pure timestamp updates
+	 * and i_size updates which need to be caught for fdatasync.
+	 * After that also theck for the dirty state in the XFS inode, which
+	 * might gets cleared when the inode gets written out via the AIL
+	 * or xfs_iflush_cluster.
+	 */
 	if (dirty && ip->i_update_core) {
 		/*
 		 * Kick off a transaction to log the inode core to get the
@@ -207,15 +196,17 @@ xfs_file_fsync(
 	}
 
 out:
-	spin_lock(&inode_lock);
-	if (error)
-		inode->i_state |= dirty;
-	inode_writeback_end(inode);
-	spin_unlock(&inode_lock);
-
 	return -error;
 }
 
+STATIC int
+xfs_file_fsync(
+	struct file		*file,
+	int			datasync)
+{
+	return fsync_helper(file->f_mapping->host, datasync, xfs_fsync_int);
+}
+
 STATIC ssize_t
 xfs_file_aio_read(
 	struct kiocb		*iocb,
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2010-12-29 15:45:37.190004281 +0100
+++ linux-2.6/fs/inode.c	2010-12-29 15:46:13.581004071 +0100
@@ -82,7 +82,6 @@ static struct hlist_head *inode_hashtabl
  * the i_state of an inode while it is in use..
  */
 DEFINE_SPINLOCK(inode_lock);
-EXPORT_SYMBOL(inode_lock);
 
 /*
  * iprune_sem provides exclusion between the kswapd or try_to_free_pages

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch 8/8] fs: add i_op->sync_inode
  2010-12-18  1:46 ` [patch 8/8] fs: add i_op->sync_inode Nick Piggin
@ 2010-12-29 15:12   ` Christoph Hellwig
  2011-01-04  6:27     ` Nick Piggin
  2011-01-07 19:06   ` Ted Ts'o
  1 sibling, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2010-12-29 15:12 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, linux-kernel, Andrew Morton

 - the sync_inodes_sb export removal looks fine to, but should be a
   separate patch with a good changelog
 - why is the sync_inode_metadata wait parameter removed?  Especially
   as we would need it for some callers that need to be converted to it.
   E.g. in this patch converts writeback_inode from non-blocking to
   blocking behaviour.
 - except for that the sync_inode() removal is fine, I had planned for
   that already.  But again, please a separate and well-documented
   patch.

As for the actualy sync_inode operation:  I don't think it's helpful.
The *_sync_inode helpers in ext2 and fat are fine, but there's little
point in going through an iops vector for it.  Also adding the file
syncing really complicates the API for now reason, epecially with
the range interface.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch 7/8] fs: fix or note I_DIRTY handling bugs in filesystems
  2010-12-29 15:01   ` Christoph Hellwig
@ 2011-01-03 15:03     ` Steven Whitehouse
  2011-01-03 16:58       ` Christoph Hellwig
  2011-01-04  6:04     ` Nick Piggin
  1 sibling, 1 reply; 42+ messages in thread
From: Steven Whitehouse @ 2011-01-03 15:03 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Nick Piggin, linux-fsdevel, linux-kernel, Andrew Morton, mfasheh,
	joel.becker

Hi,

On Wed, 2010-12-29 at 10:01 -0500, Christoph Hellwig wrote:
> As mentioned last round I think the exporting of inode_lock and pusing
> of the I_DIRTY* complexities into the filesystems can be avoided.  See
> the patch below, which compiles and passes xfstests for xfs, but
> otherwise isn't quite done yet.  The only code change vs the opencoded
> variant in the patch is that we do a useless inode_lock roundtrip
> for a non-dirty inode on gfs2, which is I think is acceptable,
> especially once we have the lock split anyway.
> 
> The other thing I don't like yet is passing the datasync flag - the
> callback shouldn't care about what we were called with, but rather
> about which bits it needs to sync out - which the dirty flag already
> tells us about.
> 
> IMHO the behaviour in ocfs2 and gfs2 that relies on it is plain wrong:
> 
>  - ocfs2 really should always force the journal if any bit we care about
>    in the inode is dirty, and only do the pure cache flush is nothing
>    we care about is dirty (similar to the more complex code in XFS)
>  - gfs2 seems really weird.  Doesn't it need to do any log force
>    if an inode has a pending transaction?  Currently it only does for
>    stuffed inodes, and if datasync was set, which seems weird.  Also
>    I can't see why we'd never want to call into ->write_inode to write
>    out the inode for the datasync case - except for not catching
>    timestamp updates fdatasync really isn't any different from fsync.
> 
The algorithm was intended to be:

 - With "journaled data" files
   - Do a log flush conditional upon the inode's glock
   - The core code then writes back any dirty pages

 - With regular files/directories
  - If datasync is not set, we need to write back the metadata including
timestamp updates, so that is done via ->write_inode. Note that an extra
complication here is that we need to get the glock on the inode if we
don't already have it in order to check and conditionally update the
atime.The call to ->write_inode includes an implicit (conditional) log
flush.
 - If datasync is set, we assume that only the data pages need to be
written out. My understanding of datasync was that it was only supposed
to write out data and never any of the metadata. The reason for the call
to flush the log for "stuffed" files is that the data shares a disk
block with the inode metadata, so we cannot avoid the log flush in this
case, since we must unpin the block to write it back.

There is something strange going on here though since there should be a
metadata sync included as well I think - I'm just working back through
the changes to see where that was lost at the moment,

Steve.



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch 7/8] fs: fix or note I_DIRTY handling bugs in filesystems
  2011-01-03 15:03     ` Steven Whitehouse
@ 2011-01-03 16:58       ` Christoph Hellwig
  2011-01-04  7:12         ` Nick Piggin
  2011-01-04 14:22         ` Steven Whitehouse
  0 siblings, 2 replies; 42+ messages in thread
From: Christoph Hellwig @ 2011-01-03 16:58 UTC (permalink / raw)
  To: Steven Whitehouse
  Cc: Christoph Hellwig, Nick Piggin, linux-fsdevel, linux-kernel,
	Andrew Morton, mfasheh, joel.becker

On Mon, Jan 03, 2011 at 03:03:29PM +0000, Steven Whitehouse wrote:
> 
>  - With "journaled data" files
>    - Do a log flush conditional upon the inode's glock
>    - The core code then writes back any dirty pages

Any data writeback is done before calling ->fsync.

>  - With regular files/directories
>   - If datasync is not set, we need to write back the metadata including
> timestamp updates, so that is done via ->write_inode. Note that an extra
> complication here is that we need to get the glock on the inode if we
> don't already have it in order to check and conditionally update the
> atime.The call to ->write_inode includes an implicit (conditional) log
> flush.
>  - If datasync is set, we assume that only the data pages need to be
> written out. My understanding of datasync was that it was only supposed
> to write out data and never any of the metadata. The reason for the call
> to flush the log for "stuffed" files is that the data shares a disk
> block with the inode metadata, so we cannot avoid the log flush in this
> case, since we must unpin the block to write it back.

What happens to indirect blocks, inode size updates, etc?  In general
the only correct form to use the datasync argument is along the lines
of:

	if ((inode->i_state & I_DIRTY_DATASYNC) ||
	    ((inode->i_state & I_DIRTY_SYNC) && !datasync)) {
		/* write out the inode */
	} else {
		/*
		 * VFS inode not dirty, no need to write it out.
		 *
		 * If the filesystem support asynchronous inode writes,
		 * we may have to wait for them here.
		 */
	}

or rather mostly correct, as pointed out by Nick in this series, that's
why the above gets replaced with an equivalent check that also
participates in the writeback locking protocol in this series.

For gfs2 on current mainline an fsync respecting that would look like:

static int gfs2_fsync(struct file *file, int datasync)
{
	struct inode *inode = file->f_mapping->host;
	struct gfs2_inode *ip = GFS2_I(inode);
	int ret = 0;

	if (gfs2_is_jdata(ip) {
		gfs2_log_flush(GFS2_SB(inode), ip);
		return 0;
	}

	if ((inode->i_state & I_DIRTY_DATASYNC) ||
	    ((inode->i_state & I_DIRTY_SYNC) && !datasync))
		sync_inode_metadata(inode, 1);
	else if (gfs2_is_stuffed(ip))
		gfs2_log_flush(GFS2_SB(inode), ip->i_gl);
}

Note that the asynchronous write_inode_now is replaced with a
sync_inode_metadata, which doesn't incorrectly write data again, and
makes sure we do a synchronous write.

I'm still not quite sure how the gfs2_log_flush are supposed to work.
What's the reason we don't need the ->write_inode call for journaled
data mode?  Also is it guaranteed that we might not have an asynchronous
transaction that update the inode in the log, e.g. why doesn't gfs2
need some sort of log flush even if the VFS inode is not dirty, unlike
most other journaled filesystems.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch 7/8] fs: fix or note I_DIRTY handling bugs in filesystems
  2010-12-29 15:01   ` Christoph Hellwig
  2011-01-03 15:03     ` Steven Whitehouse
@ 2011-01-04  6:04     ` Nick Piggin
  2011-01-04  6:39       ` Christoph Hellwig
  1 sibling, 1 reply; 42+ messages in thread
From: Nick Piggin @ 2011-01-04  6:04 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Nick Piggin, linux-fsdevel, linux-kernel, Andrew Morton, mfasheh,
	joel.becker, swhiteho

On Wed, Dec 29, 2010 at 10:01:09AM -0500, Christoph Hellwig wrote:
> As mentioned last round I think the exporting of inode_lock and pusing
> of the I_DIRTY* complexities into the filesystems can be avoided.  See

Yes I did see that, I hoped to continue discussion of that detail.

Let me start out by saying OK I will agree to hold off that change
until inode_lock is removed at least, and concentrate on just the
fixes.

However I strongly believe that filesystems should be able to access
and manipulate the inode dirty state directly. If you agree with that,
then I think they should be able to access the lock required for that.
Filesystems will want to keep their internal state in synch with vfs
visible state most likely (eg. like your hfsplus patches), and _every_
time we do "loose" coupling between state bits like this (eg. page and
buffer state; page and pte state; etc), it turns out to be a huge mess
of races and subtle code and ordering.


> the patch below, which compiles and passes xfstests for xfs, but
> otherwise isn't quite done yet.  The only code change vs the opencoded
> variant in the patch is that we do a useless inode_lock roundtrip

I dislike this style, except where it has some real advantages like
get_block case. I prefer just to make the existing inode_writeback_begin
into a "__special" variant, and make inode_writeback_begin just do the
locking and masking for filesystems.


> for a non-dirty inode on gfs2, which is I think is acceptable,
> especially once we have the lock split anyway.

The bigger issue IMO is if filesystems want to be smarter with dirty
bit handling and keep more internal state in sync with it. I don't
see any problem at all with allowing them to lock the dirty state.
(but will hold off the patch for now, as said).
 
Thanks,
Nick


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch 8/8] fs: add i_op->sync_inode
  2010-12-29 15:12   ` Christoph Hellwig
@ 2011-01-04  6:27     ` Nick Piggin
  2011-01-04  6:57       ` Christoph Hellwig
  0 siblings, 1 reply; 42+ messages in thread
From: Nick Piggin @ 2011-01-04  6:27 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Nick Piggin, linux-fsdevel, linux-kernel, Andrew Morton

On Wed, Dec 29, 2010 at 10:12:46AM -0500, Christoph Hellwig wrote:
>  - the sync_inodes_sb export removal looks fine to, but should be a
>    separate patch with a good changelog

OK


>  - why is the sync_inode_metadata wait parameter removed?  Especially
>    as we would need it for some callers that need to be converted to it.
>    E.g. in this patch converts writeback_inode from non-blocking to
>    blocking behaviour.

Basically the blocking or non blocking behaviour of ->write_inode is
irrelevant and will go away. Specific behaviour cannot be relied upon from
generic code, only filesystems. But even filesystems should not
differentiate between blocking and non-blocking (where they do, they
have been buggy). That is the *other* big bug in the code (the first one
being syncing code not waiting for writeback).

So after my series, we guarantee ->write_inode is called after the inode
has been dirtied. We make no guarantees about what mode it is called in
(blocking or non blocking). So the filesystem should either have _all_
write_inode calls do sync writeout, or have them all do part of the op
(eg. clean struct inode by dirtying pagecache) and then syncing in fsync
and sync_fs. In the latter scheme, it doesn't make sense to do anything
more in the case of a "sync" call.

Basically you can spot the bugs by looking for where the .write_inode
functions differentiate the writeback mode.

So, that's all very confusing with the current set of APIs that are
called sync_inode, sync_inode_metadata, etc. So I'm trying to reduce and
remove these gradually.


>  - except for that the sync_inode() removal is fine, I had planned for
>    that already.  But again, please a separate and well-documented
>    patch.

Yes ok.

 
> As for the actualy sync_inode operation:  I don't think it's helpful.
> The *_sync_inode helpers in ext2 and fat are fine, but there's little
> point in going through an iops vector for it.  Also adding the file
> syncing really complicates the API for now reason, epecially with
> the range interface.

It does have a reason, which is the nfsd syncing callback -- using
sync_inode_metadata there is actually wrong and should really be
replaced with a warning that the fs cannot support such syncing.

See the problem I explained above -- it really needs to do a full
fsync call. But it doesn't have a file descriptor, and most filesystems
don't need one, so I think a sync_inode operation is nice.

Also giving filesystems the flexibility to do the data writeout can
be more optimal by doing all writeout at once or ordering to suit their
writeback patterns. Having range data could give some performance
advantages writing back mappings or commit operations over network. I
don't see it as a big complexity. It gives a chance to do range fsyncs
and sync_file_range and such properly too.



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch 7/8] fs: fix or note I_DIRTY handling bugs in filesystems
  2011-01-04  6:04     ` Nick Piggin
@ 2011-01-04  6:39       ` Christoph Hellwig
  2011-01-04  7:52         ` Nick Piggin
  0 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2011-01-04  6:39 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Hellwig, linux-fsdevel, linux-kernel, Andrew Morton,
	mfasheh, joel.becker, swhiteho

On Tue, Jan 04, 2011 at 05:04:52PM +1100, Nick Piggin wrote:
> However I strongly believe that filesystems should be able to access
> and manipulate the inode dirty state directly. If you agree with that,
> then I think they should be able to access the lock required for that.
> Filesystems will want to keep their internal state in synch with vfs
> visible state most likely (eg. like your hfsplus patches), and _every_
> time we do "loose" coupling between state bits like this (eg. page and
> buffer state; page and pte state; etc), it turns out to be a huge mess
> of races and subtle code and ordering.

I've probably done the two most complicated fsync implementations in xfs
and hfsplys myself, and I'd really prefer the interface to be as simple
as possible.  The way the I_DIRTY_* flags and the datasync parameter to
->fsync interact are almost a receipe for getting it wrong, which in
fact most implementations that tried to be smart did.  See gfs2 and
ocfs2 comments in this threads for classic examples.

If we actually get filesystems that need to do smarts in
checking/clearing the I_DIRTY_* flag we can discuss proper interfaces
for it - duplicating guts of i_state manipulations sounds like a
relatively bad idea for that.  Note that my fsync_helper still gives
filesystems a lot of information, just not in a redundant and confusing
way.  The dirty argument still tells if datasync-relevant or
non-datasyns relevant (aka timestampts) metadata was dirty, so if a
filesystem needs to write them back in different ways it still can.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch 8/8] fs: add i_op->sync_inode
  2011-01-04  6:27     ` Nick Piggin
@ 2011-01-04  6:57       ` Christoph Hellwig
  2011-01-04  8:03         ` Nick Piggin
  0 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2011-01-04  6:57 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christoph Hellwig, linux-fsdevel, linux-kernel, Andrew Morton

On Tue, Jan 04, 2011 at 05:27:25PM +1100, Nick Piggin wrote:
> Basically the blocking or non blocking behaviour of ->write_inode is
> irrelevant and will go away. Specific behaviour cannot be relied upon from
> generic code, only filesystems. But even filesystems should not
> differentiate between blocking and non-blocking (where they do, they
> have been buggy). That is the *other* big bug in the code (the first one
> being syncing code not waiting for writeback).
>
> So after my series, we guarantee ->write_inode is called after the inode
> has been dirtied. We make no guarantees about what mode it is called in
> (blocking or non blocking). So the filesystem should either have _all_
> write_inode calls do sync writeout, or have them all do part of the op
> (eg. clean struct inode by dirtying pagecache) and then syncing in fsync
> and sync_fs. In the latter scheme, it doesn't make sense to do anything
> more in the case of a "sync" call.

There is absolutely no problem with writing out inodes asynchronously,
it's just that the writeback code can't really cope with it right now.
The fix is to only update i_state on I/O completion.

The blocking vs non-blocking difference is actually quite important for
performance still.  In fact for most modern filesystems we simply don't
every want to do a normal writeout for the !sync case.  Either it would
be a no-op, or do something like we do in xfs currently, where we make
sure to read the inode to start the read/modify/write cycle early, and
put it at the end of the delayed write queue.  The sync one on the other
hand just logs the inode, so that the log force in ->sync_fs/etc can
write them all out in one go.

> > As for the actualy sync_inode operation:  I don't think it's helpful.
> > The *_sync_inode helpers in ext2 and fat are fine, but there's little
> > point in going through an iops vector for it.  Also adding the file
> > syncing really complicates the API for now reason, epecially with
> > the range interface.
> 
> It does have a reason, which is the nfsd syncing callback -- using
> sync_inode_metadata there is actually wrong and should really be
> replaced with a warning that the fs cannot support such syncing.
> 
> See the problem I explained above -- it really needs to do a full
> fsync call. But it doesn't have a file descriptor, and most filesystems
> don't need one, so I think a sync_inode operation is nice.

It doesn't need to do an fsync, it's actually a quite different
operations.  That's why we added the ->commit_metadata operations.  What
NFSD really wants is to guarantee synchronous metadata operations.
Traditionally unix required the filesystems to be mounted using -o wsync
for that, but we try to emulate that from NFSD without affecting other
access to the filesystem.  The ->commit_metadata is there to ensure any
previously started async operations completes.  It does not have to
push all dirty state to disk like fsync.  Just compare the complexities
of xfs_file_fsync and xfs_fs_nfs_commit_metadata - the latter simply
checks if the inode has been logged, and if yes forces the log to disk
up to the last operation on this inode.  fsync does the same force if
the inode is not dirty, but otherwise has to start a new transactions.
It also has to wait for pending I/O completions before dealing with
metadata, and issue barriers to data devices, which is completly
superflous for nfs operations.

> Also giving filesystems the flexibility to do the data writeout can
> be more optimal by doing all writeout at once or ordering to suit their
> writeback patterns. Having range data could give some performance
> advantages writing back mappings or commit operations over network. I
> don't see it as a big complexity. It gives a chance to do range fsyncs
> and sync_file_range and such properly too.

We currently do all that just fine before calling into ->fsync.  That
doesn't mean we can't move it into ->fsync if a good reason for it comes
up, but

 a) it needs a good rational
 b) it should stay in ->fsync instead of beeing mixed up with the method
    used for other metadata writeback


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch 7/8] fs: fix or note I_DIRTY handling bugs in filesystems
  2011-01-03 16:58       ` Christoph Hellwig
@ 2011-01-04  7:12         ` Nick Piggin
  2011-01-04 14:22         ` Steven Whitehouse
  1 sibling, 0 replies; 42+ messages in thread
From: Nick Piggin @ 2011-01-04  7:12 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Steven Whitehouse, Nick Piggin, linux-fsdevel, linux-kernel,
	Andrew Morton, mfasheh, joel.becker

On Mon, Jan 03, 2011 at 11:58:21AM -0500, Christoph Hellwig wrote:
> On Mon, Jan 03, 2011 at 03:03:29PM +0000, Steven Whitehouse wrote:
> > 
> >  - With "journaled data" files
> >    - Do a log flush conditional upon the inode's glock
> >    - The core code then writes back any dirty pages
> 
> Any data writeback is done before calling ->fsync.
> 
> >  - With regular files/directories
> >   - If datasync is not set, we need to write back the metadata including
> > timestamp updates, so that is done via ->write_inode. Note that an extra
> > complication here is that we need to get the glock on the inode if we
> > don't already have it in order to check and conditionally update the
> > atime.The call to ->write_inode includes an implicit (conditional) log
> > flush.
> >  - If datasync is set, we assume that only the data pages need to be
> > written out. My understanding of datasync was that it was only supposed
> > to write out data and never any of the metadata. The reason for the call
> > to flush the log for "stuffed" files is that the data shares a disk
> > block with the inode metadata, so we cannot avoid the log flush in this
> > case, since we must unpin the block to write it back.
> 
> What happens to indirect blocks, inode size updates, etc?  In general
> the only correct form to use the datasync argument is along the lines
> of:
> 
> 	if ((inode->i_state & I_DIRTY_DATASYNC) ||
> 	    ((inode->i_state & I_DIRTY_SYNC) && !datasync)) {
> 		/* write out the inode */
> 	} else {
> 		/*
> 		 * VFS inode not dirty, no need to write it out.
> 		 *
> 		 * If the filesystem support asynchronous inode writes,
> 		 * we may have to wait for them here.
> 		 */
> 	}
> 
> or rather mostly correct, as pointed out by Nick in this series, that's
> why the above gets replaced with an equivalent check that also
> participates in the writeback locking protocol in this series.

Just to recap, basically we have 2 main problems in vfs/filesystems:

- i_state dirtyness is checked outside the correct synchronization
  protcol, so it may be seen as clean before a concurrent writer
  has finished.

- .write_inode is only guaranteed to be called once regardless of sync
  or async mode, for a dirty inode at a sync point. Many filesystems
  were incorrectly assuming they would be called once *in synchronous
  mode*.

  The optimal approach for .write_inode seems to be clean the struct
  inode so that it may be eventually reclaimed. Then have your .fsync and
  .sync_fs implementations enforce the actual data integrity.

  Note that "clean struct inode" often means to copy the metadata
  somewhere else to be scheduled for asynch writeout. You have to be
  careful to note that if you allow the inode to be evicted at this
  point without data integrity point also in .evict_inode, then you need
  to keep in mind that .sync_fs (and subsequent .fsync, if the inode is
  re opened) need to still enforce integrity for these potentially
  evicted inodes.

Everyone happy with this? Please review your filesystems and look at
my patches :)

Thanks,
Nick


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch 7/8] fs: fix or note I_DIRTY handling bugs in filesystems
  2011-01-04  6:39       ` Christoph Hellwig
@ 2011-01-04  7:52         ` Nick Piggin
  2011-01-04  9:13           ` Christoph Hellwig
  0 siblings, 1 reply; 42+ messages in thread
From: Nick Piggin @ 2011-01-04  7:52 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Nick Piggin, linux-fsdevel, linux-kernel, Andrew Morton, mfasheh,
	joel.becker, swhiteho

On Tue, Jan 04, 2011 at 01:39:44AM -0500, Christoph Hellwig wrote:
> On Tue, Jan 04, 2011 at 05:04:52PM +1100, Nick Piggin wrote:
> > However I strongly believe that filesystems should be able to access
> > and manipulate the inode dirty state directly. If you agree with that,
> > then I think they should be able to access the lock required for that.
> > Filesystems will want to keep their internal state in synch with vfs
> > visible state most likely (eg. like your hfsplus patches), and _every_
> > time we do "loose" coupling between state bits like this (eg. page and
> > buffer state; page and pte state; etc), it turns out to be a huge mess
> > of races and subtle code and ordering.
> 
> I've probably done the two most complicated fsync implementations in xfs
> and hfsplys myself, and I'd really prefer the interface to be as simple
> as possible.

Agree and as I said I'll change inode_writeback_begin/end to do the
locking and just return the dirty bit mask etc.


>  The way the I_DIRTY_* flags and the datasync parameter to
> ->fsync interact are almost a receipe for getting it wrong, which in
> fact most implementations that tried to be smart did.  See gfs2 and
> ocfs2 comments in this threads for classic examples.

Right if you want a helper to get the correct mask of bits required
that's fine and I agree, but locking is a different issue too: if
filesystems are trying to keep private state in sync with vfs state,
then they _need_ to do it properly with the proper locking. I think
your hfsplus implementation had a bug or two in this area didn't it?
(although I ended up getting side tracked with all these bugs half
way through looking at that).


> If we actually get filesystems that need to do smarts in
> checking/clearing the I_DIRTY_* flag we can discuss proper interfaces
> for it - duplicating guts of i_state manipulations sounds like a
> relatively bad idea for that.

I disagree, but we'll explore it further later.

It is a couple of lines to check and clear dirty bits.  Not rocket
science and I think it is far better to make it explicit what is
happening to the filesystem. The disconnect between what the vfs is
doing and what the filesystems throught should be happening is what
caused all these bugs to start with.

Anyway, long story short, I'll drop the inode_lock export and move the
locking and manipulation into inode_writeback_begin/end for now.

Thanks,
Nick


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch 8/8] fs: add i_op->sync_inode
  2011-01-04  6:57       ` Christoph Hellwig
@ 2011-01-04  8:03         ` Nick Piggin
  2011-01-04  8:31           ` Nick Piggin
  2011-01-04  9:25           ` Christoph Hellwig
  0 siblings, 2 replies; 42+ messages in thread
From: Nick Piggin @ 2011-01-04  8:03 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Nick Piggin, linux-fsdevel, linux-kernel, Andrew Morton

On Tue, Jan 04, 2011 at 01:57:37AM -0500, Christoph Hellwig wrote:
> On Tue, Jan 04, 2011 at 05:27:25PM +1100, Nick Piggin wrote:
> > Basically the blocking or non blocking behaviour of ->write_inode is
> > irrelevant and will go away. Specific behaviour cannot be relied upon from
> > generic code, only filesystems. But even filesystems should not
> > differentiate between blocking and non-blocking (where they do, they
> > have been buggy). That is the *other* big bug in the code (the first one
> > being syncing code not waiting for writeback).
> >
> > So after my series, we guarantee ->write_inode is called after the inode
> > has been dirtied. We make no guarantees about what mode it is called in
> > (blocking or non blocking). So the filesystem should either have _all_
> > write_inode calls do sync writeout, or have them all do part of the op
> > (eg. clean struct inode by dirtying pagecache) and then syncing in fsync
> > and sync_fs. In the latter scheme, it doesn't make sense to do anything
> > more in the case of a "sync" call.
> 
> There is absolutely no problem with writing out inodes asynchronously,
> it's just that the writeback code can't really cope with it right now.
> The fix is to only update i_state on I/O completion.

I don't know what you mean by this. As a filesystem generic API,
.write_inode says almost nothing. Some filesystems will do one thing,
some may do something in future after patches to the vfs to allow more
fancy work, others do other things, generic code can't assume any of
that.

 
> The blocking vs non-blocking difference is actually quite important for
> performance still.

I don't see how it could be. All integrity points still have to wait for
all potentially non blocking write_inode, so if that's the case just make
*all* the write_inode non blocking to begin with. Simpler code and
you'd get more performance at least in the sync(3) case where writeback
and waiting of multiple inodes can be batched.


>  In fact for most modern filesystems we simply don't
> every want to do a normal writeout for the !sync case.  Either it would
> be a no-op, or do something like we do in xfs currently, where we make
> sure to read the inode to start the read/modify/write cycle early, and
> put it at the end of the delayed write queue.  The sync one on the other
> hand just logs the inode, so that the log force in ->sync_fs/etc can
> write them all out in one go.

Great, so why would you ever do a sync .write_inode?

 
> > > As for the actualy sync_inode operation:  I don't think it's helpful.
> > > The *_sync_inode helpers in ext2 and fat are fine, but there's little
> > > point in going through an iops vector for it.  Also adding the file
> > > syncing really complicates the API for now reason, epecially with
> > > the range interface.
> > 
> > It does have a reason, which is the nfsd syncing callback -- using
> > sync_inode_metadata there is actually wrong and should really be
> > replaced with a warning that the fs cannot support such syncing.
> > 
> > See the problem I explained above -- it really needs to do a full
> > fsync call. But it doesn't have a file descriptor, and most filesystems
> > don't need one, so I think a sync_inode operation is nice.
> 
> It doesn't need to do an fsync, it's actually a quite different
> operations.  That's why we added the ->commit_metadata operations.  What
> NFSD really wants is to guarantee synchronous metadata operations.
> Traditionally unix required the filesystems to be mounted using -o wsync
> for that, but we try to emulate that from NFSD without affecting other
> access to the filesystem.  The ->commit_metadata is there to ensure any
> previously started async operations completes.  It does not have to
> push all dirty state to disk like fsync.  Just compare the complexities
> of xfs_file_fsync and xfs_fs_nfs_commit_metadata - the latter simply
> checks if the inode has been logged, and if yes forces the log to disk
> up to the last operation on this inode.  fsync does the same force if
> the inode is not dirty, but otherwise has to start a new transactions.
> It also has to wait for pending I/O completions before dealing with
> metadata, and issue barriers to data devices, which is completly
> superflous for nfs operations.

Thanks for the interesting comments, but I didn't say it needs an fsync,
I said sync_inode_metadata is not sufficient by definition because it
guarantees nothing.

 
> > Also giving filesystems the flexibility to do the data writeout can
> > be more optimal by doing all writeout at once or ordering to suit their
> > writeback patterns. Having range data could give some performance
> > advantages writing back mappings or commit operations over network. I
> > don't see it as a big complexity. It gives a chance to do range fsyncs
> > and sync_file_range and such properly too.
> 
> We currently do all that just fine before calling into ->fsync.

What do you mean we do all that just fine? A filesystem can't schedule
data submission and waiting optimally versus metadata, it can't do
metadata operations or remote commits corresponding to data ranges, and
it doesn't support nfs sync operations.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch 8/8] fs: add i_op->sync_inode
  2011-01-04  8:03         ` Nick Piggin
@ 2011-01-04  8:31           ` Nick Piggin
  2011-01-04  9:25             ` Christoph Hellwig
  2011-01-04  9:25           ` Christoph Hellwig
  1 sibling, 1 reply; 42+ messages in thread
From: Nick Piggin @ 2011-01-04  8:31 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christoph Hellwig, linux-fsdevel, linux-kernel, Andrew Morton

On Tue, Jan 04, 2011 at 07:03:23PM +1100, Nick Piggin wrote:
> On Tue, Jan 04, 2011 at 01:57:37AM -0500, Christoph Hellwig wrote:
> > > Also giving filesystems the flexibility to do the data writeout can
> > > be more optimal by doing all writeout at once or ordering to suit their
> > > writeback patterns. Having range data could give some performance
> > > advantages writing back mappings or commit operations over network. I
> > > don't see it as a big complexity. It gives a chance to do range fsyncs
> > > and sync_file_range and such properly too.
> > 
> > We currently do all that just fine before calling into ->fsync.
> 
> What do you mean we do all that just fine? A filesystem can't schedule
> data submission and waiting optimally versus metadata, it can't do
> metadata operations or remote commits corresponding to data ranges, and
> it doesn't support nfs sync operations.

And actually I think it is much better to have sync_inode, which means
we'll be able to get rid of commit_metadata (which should be an inode
operation anyway, not an export operation which really should deal with
exporting filesystems to a non-vfs namespace, not nfsd hacks).

commit_metadata would just be sync_inode with a null range or no data
sync flag set.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch 7/8] fs: fix or note I_DIRTY handling bugs in filesystems
  2011-01-04  7:52         ` Nick Piggin
@ 2011-01-04  9:13           ` Christoph Hellwig
  2011-01-04  9:28             ` Nick Piggin
  0 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2011-01-04  9:13 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Hellwig, linux-fsdevel, linux-kernel, Andrew Morton,
	mfasheh, joel.becker, swhiteho

On Tue, Jan 04, 2011 at 06:52:48PM +1100, Nick Piggin wrote:
> Right if you want a helper to get the correct mask of bits required
> that's fine and I agree, but locking is a different issue too: if
> filesystems are trying to keep private state in sync with vfs state,
> then they _need_ to do it properly with the proper locking. I think
> your hfsplus implementation had a bug or two in this area didn't it?
> (although I ended up getting side tracked with all these bugs half
> way through looking at that).

It's not locking, but ordering that was the issue.  It's an issue caused
by the VFS interfaces, but not really related to the area you work
on currently.  If ->dirty_inode told us what was dirtied it would be
a lot simpler.  Alternatively we should just stop requiring filesystems
to participate in the I_DIRTY_SYNC/DATASYNC protocol.  In general it's
much easier for the filesystem to keep that state by itself in proper
granularity.  But I lost the fight to have the timestamp updates go
through proper methods instead of just writing into the VFS inode and
marking the inode dirty long ago, so we'll have to live with it.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch 8/8] fs: add i_op->sync_inode
  2011-01-04  8:03         ` Nick Piggin
  2011-01-04  8:31           ` Nick Piggin
@ 2011-01-04  9:25           ` Christoph Hellwig
  2011-01-04  9:49             ` Nick Piggin
  1 sibling, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2011-01-04  9:25 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christoph Hellwig, linux-fsdevel, linux-kernel, Andrew Morton

On Tue, Jan 04, 2011 at 07:03:23PM +1100, Nick Piggin wrote:
> > There is absolutely no problem with writing out inodes asynchronously,
> > it's just that the writeback code can't really cope with it right now.
> > The fix is to only update i_state on I/O completion.
> 
> I don't know what you mean by this. As a filesystem generic API,
> .write_inode says almost nothing. Some filesystems will do one thing,
> some may do something in future after patches to the vfs to allow more
> fancy work, others do other things, generic code can't assume any of
> that.

There is no problem with having a method (let's call it write_inode for
simplicity, it could be sync_inode or got knows what else) to write
clear the dirty state of the VFS inode asynchronously, as long as we
make sure the bits are only updated once it's been successfull.  That
is for an async call we can't let the caller of ->write_inode update
it, but need to do it in an I/O completion callback.

> > The blocking vs non-blocking difference is actually quite important for
> > performance still.
> 
> I don't see how it could be. All integrity points still have to wait for
> all potentially non blocking write_inode, so if that's the case just make
> *all* the write_inode non blocking to begin with. Simpler code and
> you'd get more performance at least in the sync(3) case where writeback
> and waiting of multiple inodes can be batched.

The problem is that currently we almost never do a pure blocking
->write_inode.  The way the sync code is written we always do a
non-blocking one first, then a blocking one.  If you always do the
synchronous one we'll get a lot more overhead - the first previous
asynchronous one will write the inode (be it just into the log, or for
real), then we write back data, and then we'll have to write it again
becaus it has usually been redirtied again due to the data writeback in
the meantime.  If you want to get rid of the asynchronous ones you need
to get rid of the callers, and not just change the behaviour in the
implementations.

> >  In fact for most modern filesystems we simply don't
> > every want to do a normal writeout for the !sync case.  Either it would
> > be a no-op, or do something like we do in xfs currently, where we make
> > sure to read the inode to start the read/modify/write cycle early, and
> > put it at the end of the delayed write queue.  The sync one on the other
> > hand just logs the inode, so that the log force in ->sync_fs/etc can
> > write them all out in one go.
> 
> Great, so why would you ever do a sync .write_inode?

We need to propagate the VFS dirty state into the fs-internal state,
e.g. for XFS start a transaction.  The reason for that is that the VFS
simply writes timestamps into the inode and marks it dirty instead of
telling the filesystem about timestamp updates.  For XFS in
2.6.38+ timestamp updates and i_size updates are the only unlogged
metadata changes, and thus now the only thing going through
->write_inode.

> > > Also giving filesystems the flexibility to do the data writeout can
> > > be more optimal by doing all writeout at once or ordering to suit their
> > > writeback patterns. Having range data could give some performance
> > > advantages writing back mappings or commit operations over network. I
> > > don't see it as a big complexity. It gives a chance to do range fsyncs
> > > and sync_file_range and such properly too.
> > 
> > We currently do all that just fine before calling into ->fsync.
> 
> What do you mean we do all that just fine? A filesystem can't schedule
> data submission and waiting optimally versus metadata, it can't do
> metadata operations or remote commits corresponding to data ranges, and
> it doesn't support nfs sync operations.

A normal filesystems needs to wait for data I/O to finish before updating
the metadata for data integrity reasons - otherwise we can expose stale
data in case of a crash - that's one of the things we fixed a couple of
kernel releases ago.  As I said I have no problems moving the
data operations into ->fsync or adding range information if we actually
need it.  Show me a filesystem that really needs this to get good
performance and we can adapt the ->fsync prototype.  If we need it at
all we'll need it for network/distributed filesystems, though - which
have a tendency to actually require the file argument and won't be
able to use an inode operations.  Nevermind the fact that
file_operations are what we use for all data I/O operations, and messing
this up for the sake of it doesn't seem like an overly useful idea.

Care to explain what you mean with nfs sync operations?

> 
---end quoted text---

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch 8/8] fs: add i_op->sync_inode
  2011-01-04  8:31           ` Nick Piggin
@ 2011-01-04  9:25             ` Christoph Hellwig
  2011-01-04  9:52               ` Nick Piggin
  0 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2011-01-04  9:25 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christoph Hellwig, linux-fsdevel, linux-kernel, Andrew Morton

On Tue, Jan 04, 2011 at 07:31:32PM +1100, Nick Piggin wrote:
> And actually I think it is much better to have sync_inode, which means
> we'll be able to get rid of commit_metadata (which should be an inode
> operation anyway, not an export operation which really should deal with
> exporting filesystems to a non-vfs namespace, not nfsd hacks).
> 
> commit_metadata would just be sync_inode with a null range or no data
> sync flag set.

As explained in the previous mail it's not just not writing data, it's
conceptually quite different from fsync.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch 7/8] fs: fix or note I_DIRTY handling bugs in filesystems
  2011-01-04  9:13           ` Christoph Hellwig
@ 2011-01-04  9:28             ` Nick Piggin
  0 siblings, 0 replies; 42+ messages in thread
From: Nick Piggin @ 2011-01-04  9:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Nick Piggin, linux-fsdevel, linux-kernel, Andrew Morton, mfasheh,
	joel.becker, swhiteho

On Tue, Jan 04, 2011 at 04:13:48AM -0500, Christoph Hellwig wrote:
> On Tue, Jan 04, 2011 at 06:52:48PM +1100, Nick Piggin wrote:
> > Right if you want a helper to get the correct mask of bits required
> > that's fine and I agree, but locking is a different issue too: if
> > filesystems are trying to keep private state in sync with vfs state,
> > then they _need_ to do it properly with the proper locking. I think
> > your hfsplus implementation had a bug or two in this area didn't it?
> > (although I ended up getting side tracked with all these bugs half
> > way through looking at that).
> 
> It's not locking, but ordering that was the issue.  It's an issue caused

Right, but ordering should probably not be an issue if filesystem could
manage the dirty state and locking by itself. 


> by the VFS interfaces, but not really related to the area you work
> on currently.  If ->dirty_inode told us what was dirtied it would be

Well it's one side of the issue, the other side is indeed the dirtying
side.


> a lot simpler.  Alternatively we should just stop requiring filesystems
> to participate in the I_DIRTY_SYNC/DATASYNC protocol.  In general it's
> much easier for the filesystem to keep that state by itself in proper
> granularity.  But I lost the fight to have the timestamp updates go
> through proper methods instead of just writing into the VFS inode and
> marking the inode dirty long ago, so we'll have to live with it.

I wouldn't mind revisiting that, once these correctness fixes are in
(and yes that's another good reason to hold off with allowing
filesystems to do the i_state locking just yet).

Allowing the filesystem to entirely manage the setting and clearing of
dirty bits would be a good idea. Then just have a single bit that
specifies they want background writeout to run a callback together when
it does data writeout for that inode. Everything else would be done in
the fs.

Page dirtying has similarly silly conventions like having the caller
clear dirty before calling into ->writepage, which means the filesystem
has to have hacks like redirty_page_for_io and has a hard time keeping
page dirty state in sync with page private metadata. (completely
different issue but similar general problem of having things half
managed by one side and half by the other).


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch 8/8] fs: add i_op->sync_inode
  2011-01-04  9:25           ` Christoph Hellwig
@ 2011-01-04  9:49             ` Nick Piggin
  2011-01-06 20:45               ` Christoph Hellwig
  0 siblings, 1 reply; 42+ messages in thread
From: Nick Piggin @ 2011-01-04  9:49 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Nick Piggin, linux-fsdevel, linux-kernel, Andrew Morton

On Tue, Jan 04, 2011 at 04:25:01AM -0500, Christoph Hellwig wrote:
> On Tue, Jan 04, 2011 at 07:03:23PM +1100, Nick Piggin wrote:
> > > There is absolutely no problem with writing out inodes asynchronously,
> > > it's just that the writeback code can't really cope with it right now.
> > > The fix is to only update i_state on I/O completion.
> > 
> > I don't know what you mean by this. As a filesystem generic API,
> > .write_inode says almost nothing. Some filesystems will do one thing,
> > some may do something in future after patches to the vfs to allow more
> > fancy work, others do other things, generic code can't assume any of
> > that.
> 
> There is no problem with having a method (let's call it write_inode for
> simplicity, it could be sync_inode or got knows what else) to write
> clear the dirty state of the VFS inode asynchronously, as long as we
> make sure the bits are only updated once it's been successfull.  That
> is for an async call we can't let the caller of ->write_inode update
> it, but need to do it in an I/O completion callback.

Right, but that would be a bit of a change to generic code and a really
big change to make all filesystems support the API.

What I am saying right now is that .write_inode doesn't mean anything
to generic code, except that it does something "nice" for
writeback/reclaim purposes, and it needs to be called within the
sync / fsync protocol.

But .write_inode on its own doesn't mean anything. Not even if 99% of
the filesystems implement it in a particular way. Hence,
sync_inode_metadata and such can't be used in generic code like nfsd.


> > > The blocking vs non-blocking difference is actually quite important for
> > > performance still.
> > 
> > I don't see how it could be. All integrity points still have to wait for
> > all potentially non blocking write_inode, so if that's the case just make
> > *all* the write_inode non blocking to begin with. Simpler code and
> > you'd get more performance at least in the sync(3) case where writeback
> > and waiting of multiple inodes can be batched.
> 
> The problem is that currently we almost never do a pure blocking
> ->write_inode.  The way the sync code is written we always do a
> non-blocking one first, then a blocking one.  If you always do the
> synchronous one we'll get a lot more overhead - the first previous
> asynchronous one will write the inode (be it just into the log, or for
> real), then we write back data, and then we'll have to write it again
> becaus it has usually been redirtied again due to the data writeback in
> the meantime.

It doesn't matter, the integrity still has to be enforced in .sync_fs,
because sync .write_inode may *never* get called, because of the fact
that async .write_inode also clears the inode metadata dirty bits.

So if .sync_fs has to enforce integrity anyway, then you don't ever need
to do any actual waiting in your sync .write_inode. See?


>  If you want to get rid of the asynchronous ones you need
> to get rid of the callers, and not just change the behaviour in the
> implementations.

The patches are working towards that. It's a lot less broken that it
was, and (barring my oversights, or existing fs bugs) it should actually
"work" now.

In cases where particular calls remain, it is for example where the
filesystem calls their own handler and so it knows what it can expect.


> > >  In fact for most modern filesystems we simply don't
> > > every want to do a normal writeout for the !sync case.  Either it would
> > > be a no-op, or do something like we do in xfs currently, where we make
> > > sure to read the inode to start the read/modify/write cycle early, and
> > > put it at the end of the delayed write queue.  The sync one on the other
> > > hand just logs the inode, so that the log force in ->sync_fs/etc can
> > > write them all out in one go.
> > 
> > Great, so why would you ever do a sync .write_inode?
> 
> We need to propagate the VFS dirty state into the fs-internal state,
> e.g. for XFS start a transaction.  The reason for that is that the VFS
> simply writes timestamps into the inode and marks it dirty instead of
> telling the filesystem about timestamp updates.  For XFS in
> 2.6.38+ timestamp updates and i_size updates are the only unlogged
> metadata changes, and thus now the only thing going through
> ->write_inode.

Well then you have a bug, because a sync .write_inode *may never get
called*. You may only get an async one, even in the case of fsync,
because async writeback clears the vfs dirty bits.


> > > > Also giving filesystems the flexibility to do the data writeout can
> > > > be more optimal by doing all writeout at once or ordering to suit their
> > > > writeback patterns. Having range data could give some performance
> > > > advantages writing back mappings or commit operations over network. I
> > > > don't see it as a big complexity. It gives a chance to do range fsyncs
> > > > and sync_file_range and such properly too.
> > > 
> > > We currently do all that just fine before calling into ->fsync.
> > 
> > What do you mean we do all that just fine? A filesystem can't schedule
> > data submission and waiting optimally versus metadata, it can't do
> > metadata operations or remote commits corresponding to data ranges, and
> > it doesn't support nfs sync operations.
> 
> A normal filesystems needs to wait for data I/O to finish before updating
> the metadata for data integrity reasons - otherwise we can expose stale
> data in case of a crash - that's one of the things we fixed a couple of
> kernel releases ago.  As I said I have no problems moving the

Yes, and that's a filesystem-implementation detail.


> data operations into ->fsync or adding range information if we actually
> need it.  Show me a filesystem that really needs this to get good
> performance and we can adapt the ->fsync prototype.  If we need it at
> all we'll need it for network/distributed filesystems, though - which
> have a tendency to actually require the file argument and won't be
> able to use an inode operations.  Nevermind the fact that
> file_operations are what we use for all data I/O operations, and messing
> this up for the sake of it doesn't seem like an overly useful idea.

Just if I'm adding this API it makes sense to add the obvious
parameters. A sync range does not "hugely complicate the API" at
all.

 
> Care to explain what you mean with nfs sync operations?

The one that is attempted to be implemented with sync_inode_metadata.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch 8/8] fs: add i_op->sync_inode
  2011-01-04  9:25             ` Christoph Hellwig
@ 2011-01-04  9:52               ` Nick Piggin
  2011-01-06 20:49                 ` Christoph Hellwig
  0 siblings, 1 reply; 42+ messages in thread
From: Nick Piggin @ 2011-01-04  9:52 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Nick Piggin, linux-fsdevel, linux-kernel, Andrew Morton

On Tue, Jan 04, 2011 at 04:25:41AM -0500, Christoph Hellwig wrote:
> On Tue, Jan 04, 2011 at 07:31:32PM +1100, Nick Piggin wrote:
> > And actually I think it is much better to have sync_inode, which means
> > we'll be able to get rid of commit_metadata (which should be an inode
> > operation anyway, not an export operation which really should deal with
> > exporting filesystems to a non-vfs namespace, not nfsd hacks).
> > 
> > commit_metadata would just be sync_inode with a null range or no data
> > sync flag set.
> 
> As explained in the previous mail it's not just not writing data, it's
> conceptually quite different from fsync.

OK I missed that part about not requiring dirty metadata to be written,
just currently ongoing async operations. But then I don't understand how
it would be used by nfsd, how does nfsd start some async operation on
the inode metadata such that ->commit_metadata would do anything useful
for it?


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch 7/8] fs: fix or note I_DIRTY handling bugs in filesystems
  2011-01-03 16:58       ` Christoph Hellwig
  2011-01-04  7:12         ` Nick Piggin
@ 2011-01-04 14:22         ` Steven Whitehouse
  1 sibling, 0 replies; 42+ messages in thread
From: Steven Whitehouse @ 2011-01-04 14:22 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Nick Piggin, linux-fsdevel, linux-kernel, Andrew Morton, mfasheh,
	joel.becker

Hi,

On Mon, 2011-01-03 at 11:58 -0500, Christoph Hellwig wrote:
> On Mon, Jan 03, 2011 at 03:03:29PM +0000, Steven Whitehouse wrote:
> > 
> >  - With "journaled data" files
> >    - Do a log flush conditional upon the inode's glock
> >    - The core code then writes back any dirty pages
> 
> Any data writeback is done before calling ->fsync.
> 
> >  - With regular files/directories
> >   - If datasync is not set, we need to write back the metadata including
> > timestamp updates, so that is done via ->write_inode. Note that an extra
> > complication here is that we need to get the glock on the inode if we
> > don't already have it in order to check and conditionally update the
> > atime.The call to ->write_inode includes an implicit (conditional) log
> > flush.
> >  - If datasync is set, we assume that only the data pages need to be
> > written out. My understanding of datasync was that it was only supposed
> > to write out data and never any of the metadata. The reason for the call
> > to flush the log for "stuffed" files is that the data shares a disk
> > block with the inode metadata, so we cannot avoid the log flush in this
> > case, since we must unpin the block to write it back.
> 
> What happens to indirect blocks, inode size updates, etc?  In general
> the only correct form to use the datasync argument is along the lines
> of:
> 
> 	if ((inode->i_state & I_DIRTY_DATASYNC) ||
> 	    ((inode->i_state & I_DIRTY_SYNC) && !datasync)) {
> 		/* write out the inode */
> 	} else {
> 		/*
> 		 * VFS inode not dirty, no need to write it out.
> 		 *
> 		 * If the filesystem support asynchronous inode writes,
> 		 * we may have to wait for them here.
> 		 */
> 	}
> 
> or rather mostly correct, as pointed out by Nick in this series, that's
> why the above gets replaced with an equivalent check that also
> participates in the writeback locking protocol in this series.
> 
Yes, that looks much better than what we have at the moment.

> For gfs2 on current mainline an fsync respecting that would look like:
> 
> static int gfs2_fsync(struct file *file, int datasync)
> {
> 	struct inode *inode = file->f_mapping->host;
> 	struct gfs2_inode *ip = GFS2_I(inode);
> 	int ret = 0;
> 
> 	if (gfs2_is_jdata(ip) {
> 		gfs2_log_flush(GFS2_SB(inode), ip);
> 		return 0;
> 	}
> 
> 	if ((inode->i_state & I_DIRTY_DATASYNC) ||
> 	    ((inode->i_state & I_DIRTY_SYNC) && !datasync))
> 		sync_inode_metadata(inode, 1);
> 	else if (gfs2_is_stuffed(ip))
> 		gfs2_log_flush(GFS2_SB(inode), ip->i_gl);
> }
> 
> Note that the asynchronous write_inode_now is replaced with a
> sync_inode_metadata, which doesn't incorrectly write data again, and
> makes sure we do a synchronous write.
>
> I'm still not quite sure how the gfs2_log_flush are supposed to work.
> What's the reason we don't need the ->write_inode call for journaled
> data mode?  Also is it guaranteed that we might not have an asynchronous
> transaction that update the inode in the log, e.g. why doesn't gfs2
> need some sort of log flush even if the VFS inode is not dirty, unlike
> most other journaled filesystems.
> 

I think that has just been missed due to the way in which the code has
developed. It appears to be needed to me, but originally all the
timestamp updates were handled internally by GFS2 and in a synchronous
manner, so that there was no need for ->write_inode() in that case. I
think that needs to be added now the vfs looks after atime updates
though, in order to be correct.

After the log flush there should also be a write on the metadata mapping
as per the inode_go_sync() function which is very similar (but not quite
similar enough to use the same code, I think) function,

Steve.



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch 8/8] fs: add i_op->sync_inode
  2011-01-04  9:49             ` Nick Piggin
@ 2011-01-06 20:45               ` Christoph Hellwig
  2011-01-07  4:47                 ` Nick Piggin
  0 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2011-01-06 20:45 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christoph Hellwig, linux-fsdevel, linux-kernel, Andrew Morton

> > The problem is that currently we almost never do a pure blocking
> > ->write_inode.  The way the sync code is written we always do a
> > non-blocking one first, then a blocking one.  If you always do the
> > synchronous one we'll get a lot more overhead - the first previous
> > asynchronous one will write the inode (be it just into the log, or for
> > real), then we write back data, and then we'll have to write it again
> > becaus it has usually been redirtied again due to the data writeback in
> > the meantime.
> 
> It doesn't matter, the integrity still has to be enforced in .sync_fs,
> because sync .write_inode may *never* get called, because of the fact
> that async .write_inode also clears the inode metadata dirty bits.
> 
> So if .sync_fs has to enforce integrity anyway, then you don't ever need
> to do any actual waiting in your sync .write_inode. See?

I'm not talking about the actual waiting.  Right now we have two
different use cases for ->write_inode:

 1) sync_mode == WB_SYNC_NONE

	This tells the filesystem to start an opportunistic writeout.

 2) sync_mode == WB_SYNC_ALL

	This tells the filesystem it needs to to a mandatory writeout.

Note that writeout is losely defined.  If a filesystems isn't
exportable or implements the commit_metadata operation it's indeed
enough to synchronize the state into internal fs data just enough for
->sync_fs.

Or that's how it should be.  As you pointed out the way the writeback
code treats the WB_SYNC_NONE writeouts makes this not work as expected.

There's various ways to fix this:

 1) the one you advocate, that is treating all ->write_inode calls as
    if they were WB_SYNC_ALL.  This does fix the issue of incorrectly
    updating the dirty state, but causes a lot of additional I/O -
    the way the sync process is designed we basically always call
    ->write_inode with WB_SYNC_NONE first, and then with WB_SYNC_ALL
 2) keep the WB_SYNC_NONE calls, but never update dirty state for them.
    This also fixes the i_dirty state updates, but allows filesystems
    that keep internal dirty state to be smarted about avoiding I/O
 3) remove the calls to ->write_inode with WB_SYNC_NONE.  This might
    work well for calls from the sync() callchain, but we rely on
    inode background writeback from the flusher threads in lots of
    places.  Note that we really do not want to block the flusher
    threads with blocking writes, which is another argument against
    (1).
 4) require ->write_inode to update the dirty state itself after
    the inode is on disk or in a data structure caught by ->sync_fs.
    This keeps optimal behaviour, but requires a lot of code changes.

If we want a quick fix only (2) seems feasibly to me, with the option
of implementing (4) and parts of (3) later on.

> > We need to propagate the VFS dirty state into the fs-internal state,
> > e.g. for XFS start a transaction.  The reason for that is that the VFS
> > simply writes timestamps into the inode and marks it dirty instead of
> > telling the filesystem about timestamp updates.  For XFS in
> > 2.6.38+ timestamp updates and i_size updates are the only unlogged
> > metadata changes, and thus now the only thing going through
> > ->write_inode.
> 
> Well then you have a bug, because a sync .write_inode *may never get
> called*. You may only get an async one, even in the case of fsync,
> because async writeback clears the vfs dirty bits.

Yes, the bug about updating the dirty state for WB_SYNC_NONE affects

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch 8/8] fs: add i_op->sync_inode
  2011-01-04  9:52               ` Nick Piggin
@ 2011-01-06 20:49                 ` Christoph Hellwig
  2011-01-07  4:48                   ` Nick Piggin
  0 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2011-01-06 20:49 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christoph Hellwig, linux-fsdevel, linux-kernel, Andrew Morton

On Tue, Jan 04, 2011 at 08:52:31PM +1100, Nick Piggin wrote:
> OK I missed that part about not requiring dirty metadata to be written,
> just currently ongoing async operations. But then I don't understand how
> it would be used by nfsd, how does nfsd start some async operation on
> the inode metadata such that ->commit_metadata would do anything useful
> for it?

NFSD calls various inode operations (create/mkdir/mknod/link/symlink/
rename/unlink/rmdir/setattr) and then requires those operations to be
on disk before completing the request to the client, but it does not
require other dirty state to be written (data, unlogged size
or timestamp updates).  Take a look at the XFS implementation: it just
checks if the inode is still pinned (that is in the in-memory log, but
not commited to disk) and if so forces the log up to the log buffer
that contains the last changes to the inode.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch 8/8] fs: add i_op->sync_inode
  2011-01-06 20:45               ` Christoph Hellwig
@ 2011-01-07  4:47                 ` Nick Piggin
  2011-01-07  7:24                   ` Christoph Hellwig
  0 siblings, 1 reply; 42+ messages in thread
From: Nick Piggin @ 2011-01-07  4:47 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Nick Piggin, linux-fsdevel, linux-kernel, Andrew Morton

On Thu, Jan 06, 2011 at 03:45:10PM -0500, Christoph Hellwig wrote:
> > > The problem is that currently we almost never do a pure blocking
> > > ->write_inode.  The way the sync code is written we always do a
> > > non-blocking one first, then a blocking one.  If you always do the
> > > synchronous one we'll get a lot more overhead - the first previous
> > > asynchronous one will write the inode (be it just into the log, or for
> > > real), then we write back data, and then we'll have to write it again
> > > becaus it has usually been redirtied again due to the data writeback in
> > > the meantime.
> > 
> > It doesn't matter, the integrity still has to be enforced in .sync_fs,
> > because sync .write_inode may *never* get called, because of the fact
> > that async .write_inode also clears the inode metadata dirty bits.
> > 
> > So if .sync_fs has to enforce integrity anyway, then you don't ever need
> > to do any actual waiting in your sync .write_inode. See?
> 
> I'm not talking about the actual waiting.  Right now we have two
> different use cases for ->write_inode:
> 
>  1) sync_mode == WB_SYNC_NONE
> 
> 	This tells the filesystem to start an opportunistic writeout.
> 
>  2) sync_mode == WB_SYNC_ALL
> 
> 	This tells the filesystem it needs to to a mandatory writeout.
> 
> Note that writeout is losely defined.  If a filesystems isn't
> exportable or implements the commit_metadata operation it's indeed
> enough to synchronize the state into internal fs data just enough for
> ->sync_fs.
> 
> Or that's how it should be.  As you pointed out the way the writeback
> code treats the WB_SYNC_NONE writeouts makes this not work as expected.
> 
> There's various ways to fix this:
> 
>  1) the one you advocate, that is treating all ->write_inode calls as
>     if they were WB_SYNC_ALL.  This does fix the issue of incorrectly
>     updating the dirty state, but causes a lot of additional I/O -
>     the way the sync process is designed we basically always call
>     ->write_inode with WB_SYNC_NONE first, and then with WB_SYNC_ALL
>  2) keep the WB_SYNC_NONE calls, but never update dirty state for them.
>     This also fixes the i_dirty state updates, but allows filesystems
>     that keep internal dirty state to be smarted about avoiding I/O
>  3) remove the calls to ->write_inode with WB_SYNC_NONE.  This might
>     work well for calls from the sync() callchain, but we rely on
>     inode background writeback from the flusher threads in lots of
>     places.  Note that we really do not want to block the flusher
>     threads with blocking writes, which is another argument against
>     (1).
>  4) require ->write_inode to update the dirty state itself after
>     the inode is on disk or in a data structure caught by ->sync_fs.
>     This keeps optimal behaviour, but requires a lot of code changes.
> 
> If we want a quick fix only (2) seems feasibly to me, with the option
> of implementing (4) and parts of (3) later on.

No, you misunderstand 1. I am saying they should be treated as
WB_SYNC_NONE.

In fact 2 would cause much more IO, because dirty writeout would
never clean them so it will just keep writing them out. I don't
know how 2 could be feasible.


> > > We need to propagate the VFS dirty state into the fs-internal state,
> > > e.g. for XFS start a transaction.  The reason for that is that the VFS
> > > simply writes timestamps into the inode and marks it dirty instead of
> > > telling the filesystem about timestamp updates.  For XFS in
> > > 2.6.38+ timestamp updates and i_size updates are the only unlogged
> > > metadata changes, and thus now the only thing going through
> > > ->write_inode.
> > 
> > Well then you have a bug, because a sync .write_inode *may never get
> > called*. You may only get an async one, even in the case of fsync,
> > because async writeback clears the vfs dirty bits.
> 
> Yes, the bug about updating the dirty state for WB_SYNC_NONE affects

So, back to my original question: what is the performance problem
with treating write_inode as WB_SYNC_NONE, and then having .fsync
and .sync_fs do the integrity?


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch 8/8] fs: add i_op->sync_inode
  2011-01-06 20:49                 ` Christoph Hellwig
@ 2011-01-07  4:48                   ` Nick Piggin
  2011-01-07  7:25                     ` Christoph Hellwig
  0 siblings, 1 reply; 42+ messages in thread
From: Nick Piggin @ 2011-01-07  4:48 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Nick Piggin, linux-fsdevel, linux-kernel, Andrew Morton

On Thu, Jan 06, 2011 at 03:49:11PM -0500, Christoph Hellwig wrote:
> On Tue, Jan 04, 2011 at 08:52:31PM +1100, Nick Piggin wrote:
> > OK I missed that part about not requiring dirty metadata to be written,
> > just currently ongoing async operations. But then I don't understand how
> > it would be used by nfsd, how does nfsd start some async operation on
> > the inode metadata such that ->commit_metadata would do anything useful
> > for it?
> 
> NFSD calls various inode operations (create/mkdir/mknod/link/symlink/
> rename/unlink/rmdir/setattr) and then requires those operations to be
> on disk before completing the request to the client, but it does not
> require other dirty state to be written (data, unlogged size
> or timestamp updates).  Take a look at the XFS implementation: it just
> checks if the inode is still pinned (that is in the in-memory log, but
> not commited to disk) and if so forces the log up to the log buffer
> that contains the last changes to the inode.

OK, I don't exactly see why a sync_inode with appropriate flag could
not solve that problem. I'll take a bit more look through nfs and
xfs. Thanks...


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch 8/8] fs: add i_op->sync_inode
  2011-01-07  4:47                 ` Nick Piggin
@ 2011-01-07  7:24                   ` Christoph Hellwig
  2011-01-07  7:29                     ` Christoph Hellwig
  0 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2011-01-07  7:24 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christoph Hellwig, linux-fsdevel, linux-kernel, Andrew Morton

On Fri, Jan 07, 2011 at 03:47:34PM +1100, Nick Piggin wrote:
> No, you misunderstand 1. I am saying they should be treated as
> WB_SYNC_NONE.
> 
> In fact 2 would cause much more IO, because dirty writeout would
> never clean them so it will just keep writing them out. I don't
> know how 2 could be feasible.

WB_SYNC_NONE means ->write_inode behaves non-blocking.  That is
we do not block on memory allocations, and we do not take locks
blocking.  Most journaling filesystems currently take the easy
way out an make it a no-op due to that, but take a look at XFS
how complicated it is to avoid the blocking if you want a non-noop
implementation.

> So, back to my original question: what is the performance problem
> with treating write_inode as WB_SYNC_NONE, and then having .fsync
> and .sync_fs do the integrity?

See above - we'll block in the flusher thread and cause it to stall,
which is really nasty as it does all data I/O writeback.  The salling
may also block sync() although I don't think it's as important there.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch 8/8] fs: add i_op->sync_inode
  2011-01-07  4:48                   ` Nick Piggin
@ 2011-01-07  7:25                     ` Christoph Hellwig
  2011-01-11  3:44                       ` Nick Piggin
  0 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2011-01-07  7:25 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, linux-kernel, Andrew Morton

On Fri, Jan 07, 2011 at 03:48:36PM +1100, Nick Piggin wrote:
> OK, I don't exactly see why a sync_inode with appropriate flag could
> not solve that problem. I'll take a bit more look through nfs and
> xfs. Thanks...

We could also overload all inode operations into a single ioctl like
method, but that doesn't make it a good design.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch 8/8] fs: add i_op->sync_inode
  2011-01-07  7:24                   ` Christoph Hellwig
@ 2011-01-07  7:29                     ` Christoph Hellwig
  2011-01-07 13:10                       ` Christoph Hellwig
  2011-01-07 18:30                       ` Ted Ts'o
  0 siblings, 2 replies; 42+ messages in thread
From: Christoph Hellwig @ 2011-01-07  7:29 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christoph Hellwig, linux-fsdevel, linux-kernel, Andrew Morton

On Fri, Jan 07, 2011 at 02:24:30AM -0500, Christoph Hellwig wrote:
> On Fri, Jan 07, 2011 at 03:47:34PM +1100, Nick Piggin wrote:
> > No, you misunderstand 1. I am saying they should be treated as
> > WB_SYNC_NONE.
> > 
> > In fact 2 would cause much more IO, because dirty writeout would
> > never clean them so it will just keep writing them out. I don't
> > know how 2 could be feasible.
> 
> WB_SYNC_NONE means ->write_inode behaves non-blocking.  That is
> we do not block on memory allocations, and we do not take locks
> blocking.  Most journaling filesystems currently take the easy
> way out an make it a no-op due to that, but take a look at XFS
> how complicated it is to avoid the blocking if you want a non-noop
> implementation.

Btw, there's an easy way how we could get this right, in fact
the write_inode in XFS is already trying to do it, it's just the
caller not copying with it:

 - if we can't get locks for a non-blocking ->write_inode we return
   EAGAIN, and the callers sets the dirty bits again.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch 8/8] fs: add i_op->sync_inode
  2011-01-07  7:29                     ` Christoph Hellwig
@ 2011-01-07 13:10                       ` Christoph Hellwig
  2011-01-07 18:30                       ` Ted Ts'o
  1 sibling, 0 replies; 42+ messages in thread
From: Christoph Hellwig @ 2011-01-07 13:10 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, linux-kernel, Andrew Morton

On Fri, Jan 07, 2011 at 02:29:34AM -0500, Christoph Hellwig wrote:
> Btw, there's an easy way how we could get this right, in fact
> the write_inode in XFS is already trying to do it, it's just the
> caller not copying with it:
> 
>  - if we can't get locks for a non-blocking ->write_inode we return
>    EAGAIN, and the callers sets the dirty bits again.

I just tried to implement this and noticed we're actually doing this
inside XFS - if we get our EAGAIN error from the lower level code
in ->write_inode we do a manual mark_inode_dirty_sync().  So as far
as XFS is concerned ->write_inode always pushes data into a state
where ->sync_fs writes it out, or if it was called with WB_SYNC_NONE
and couldn't get the locks redirties the inode, and thus is not affected
by the issue you mentioned.  I think this is also a good model for
other filesystems to follow.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch 8/8] fs: add i_op->sync_inode
  2011-01-07  7:29                     ` Christoph Hellwig
  2011-01-07 13:10                       ` Christoph Hellwig
@ 2011-01-07 18:30                       ` Ted Ts'o
  2011-01-07 18:32                           ` Christoph Hellwig
  1 sibling, 1 reply; 42+ messages in thread
From: Ted Ts'o @ 2011-01-07 18:30 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Nick Piggin, linux-fsdevel, linux-kernel, Andrew Morton

On Fri, Jan 07, 2011 at 02:29:34AM -0500, Christoph Hellwig wrote:
> Btw, there's an easy way how we could get this right, in fact
> the write_inode in XFS is already trying to do it, it's just the
> caller not copying with it:
> 
>  - if we can't get locks for a non-blocking ->write_inode we return
>    EAGAIN, and the callers sets the dirty bits again.

I like that solution; it might be one of the easier ways to maintain
backwards compatibility.  Especially since (correct me if I am wrong)
the simpler file systems which always write out the inode in the case
of a non-blocking write_inode, say, like say the fat file system, are
immune from this specific problem, right?

						- Ted

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch 8/8] fs: add i_op->sync_inode
  2011-01-07 18:30                       ` Ted Ts'o
@ 2011-01-07 18:32                           ` Christoph Hellwig
  0 siblings, 0 replies; 42+ messages in thread
From: Christoph Hellwig @ 2011-01-07 18:32 UTC (permalink / raw)
  To: Ted Ts'o, Christoph Hellwig, Nick Piggin, linux-fsdevel,
	linux-kernel, Andrew Morton

On Fri, Jan 07, 2011 at 01:30:47PM -0500, Ted Ts'o wrote:
> > the write_inode in XFS is already trying to do it, it's just the
> > caller not copying with it:
> > 
> >  - if we can't get locks for a non-blocking ->write_inode we return
> >    EAGAIN, and the callers sets the dirty bits again.
> 
> I like that solution; it might be one of the easier ways to maintain
> backwards compatibility.  Especially since (correct me if I am wrong)
> the simpler file systems which always write out the inode in the case
> of a non-blocking write_inode, say, like say the fat file system, are
> immune from this specific problem, right?

Yes.  Also one of the patches in Nick's series actually implements
this already.  Looks like we lost that fact when arguing about other
things..

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch 8/8] fs: add i_op->sync_inode
@ 2011-01-07 18:32                           ` Christoph Hellwig
  0 siblings, 0 replies; 42+ messages in thread
From: Christoph Hellwig @ 2011-01-07 18:32 UTC (permalink / raw)
  To: Ted Ts'o, Christoph Hellwig, Nick Piggin, linux-fsdevel,
	linux-kernel, Andrew Morton

On Fri, Jan 07, 2011 at 01:30:47PM -0500, Ted Ts'o wrote:
> > the write_inode in XFS is already trying to do it, it's just the
> > caller not copying with it:
> > 
> >  - if we can't get locks for a non-blocking ->write_inode we return
> >    EAGAIN, and the callers sets the dirty bits again.
> 
> I like that solution; it might be one of the easier ways to maintain
> backwards compatibility.  Especially since (correct me if I am wrong)
> the simpler file systems which always write out the inode in the case
> of a non-blocking write_inode, say, like say the fat file system, are
> immune from this specific problem, right?

Yes.  Also one of the patches in Nick's series actually implements
this already.  Looks like we lost that fact when arguing about other
things..

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch 8/8] fs: add i_op->sync_inode
  2010-12-18  1:46 ` [patch 8/8] fs: add i_op->sync_inode Nick Piggin
  2010-12-29 15:12   ` Christoph Hellwig
@ 2011-01-07 19:06   ` Ted Ts'o
  1 sibling, 0 replies; 42+ messages in thread
From: Ted Ts'o @ 2011-01-07 19:06 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, linux-kernel, Andrew Morton

On Sat, Dec 18, 2010 at 12:46:42PM +1100, Nick Piggin wrote:
> --- linux-2.6.orig/Documentation/filesystems/porting	2010-12-18 00:32:44.000000000 +1100
> +++ linux-2.6/Documentation/filesystems/porting	2010-12-18 03:51:40.000000000 +1100
> +--
> +[mandatory]
> +	Inode writeback and syncing has undergone an overhaul.
> +write_inode_now doesn't sync (unless ->write_inode *always* syncs)
> +

Could you perhaps give a bit more detail in the documentation about
what sync_inode() is supposed to do?  I gather that it is supposed to
(based on the flags), do the specified combination of (a) do writeback
for the pages corresponding to the specified range (if INODE_SYNC_DATA
is given), (b) write out the inode if necessary depending on
INODE_SYNC_DATA_METADATA (for fdatasync) and INODE_SYNC_DATA (for
fsync).

When sync_inode() returns, what is guaranteed is that the appropriate
buffer_heads are dirtied (for both the metadata and the data if the
file system is still using the buffer cache for data I/O) and/or the
data has been pushed to the block I/O system, but the sync_inode()
method does not wait force the dirtied buffers out to disk, and it
does not wait for any initiated I/O to complete (either at the block
I/O layer or guaranteeing that the data has hit the platters by
issuing a barrier request).

Do I have that right?  If so, could we please add that to the
documentation, just so it's clear?  Thanks!!

						- Ted

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch 5/8] fs: ext2 inode sync fix
  2010-12-18  1:46 ` [patch 5/8] fs: ext2 inode sync fix Nick Piggin
@ 2011-01-07 19:08   ` Ted Ts'o
  0 siblings, 0 replies; 42+ messages in thread
From: Ted Ts'o @ 2011-01-07 19:08 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, linux-kernel, Andrew Morton

On Sat, Dec 18, 2010 at 12:46:39PM +1100, Nick Piggin wrote:
> There is a big fuckup with inode metadata writeback (I suspect in a lot
> of filesystems, but I've only dared to look at a couple as yet).
> 
> ext2 relies on ->write_inode being called from sync_inode_metadata in
> fsync in order to sync the inode. However I_DIRTY_SYNC gets cleared
> after a call to this guy, and it doesn't actually write back and wait
> on the inode block unless it is called for sync. This means that write_inode
> from background writeback can kill the inode dirty bits without the data
> getting to disk. Fsync will subsequently miss it.
> 
> The fix is for ->write_inode to dirty the buffers/cache, and then ->fsync to
> write back the dirty data. In the full filesystem sync case, buffercache
> writeback in the generic code will write back the dirty data. Other
> filesystems could use ->sync_fs for this.
> 
> Signed-off-by: Nick Piggin <npiggin@kernel.dk>

What are your plans for this patch?  Is this something you were
planning on pushing during the merge window, or should file system
maintainers try to fix up their code, since I gather this particular
change is independent of the previous patches in this patchset.  Is
that correct?

						- Ted

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [patch 8/8] fs: add i_op->sync_inode
  2011-01-07  7:25                     ` Christoph Hellwig
@ 2011-01-11  3:44                       ` Nick Piggin
  0 siblings, 0 replies; 42+ messages in thread
From: Nick Piggin @ 2011-01-11  3:44 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Nick Piggin, linux-fsdevel, linux-kernel, Andrew Morton

On Fri, Jan 07, 2011 at 02:25:10AM -0500, Christoph Hellwig wrote:
> On Fri, Jan 07, 2011 at 03:48:36PM +1100, Nick Piggin wrote:
> > OK, I don't exactly see why a sync_inode with appropriate flag could
> > not solve that problem. I'll take a bit more look through nfs and
> > xfs. Thanks...
> 
> We could also overload all inode operations into a single ioctl like
> method, but that doesn't make it a good design.

I've been a bit busy lately and haven't had time to continue this
discussion. Thanks for your input so far, and I'll get back to it soon.

I will try to get out a set of changes for current merge window that
fix the bulk of the problems without going into areas you disagree
with, and then we can have a smaller set of changes we to agree on for
next merge window. I will come back and reply to your points when I
get a bit more time.



^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2011-01-11  3:44 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-12-18  1:46 [patch 0/8] Inode data integrity patches Nick Piggin
2010-12-18  1:46 ` [patch 1/8] fs: mark_inode_dirty barrier fix Nick Piggin
2010-12-18  1:46 ` [patch 2/8] fs: simple fsync race fix Nick Piggin
2010-12-18  1:46 ` [patch 3/8] fs: introduce inode writeback helpers Nick Piggin
2010-12-18  1:46 ` [patch 4/8] fs: preserve inode dirty bits on failed metadata writeback Nick Piggin
2010-12-18  1:46 ` [patch 5/8] fs: ext2 inode sync fix Nick Piggin
2011-01-07 19:08   ` Ted Ts'o
2010-12-18  1:46 ` [patch 6/8] fs: fsync optimisations Nick Piggin
2010-12-18  1:46 ` [patch 7/8] fs: fix or note I_DIRTY handling bugs in filesystems Nick Piggin
2010-12-29 15:01   ` Christoph Hellwig
2011-01-03 15:03     ` Steven Whitehouse
2011-01-03 16:58       ` Christoph Hellwig
2011-01-04  7:12         ` Nick Piggin
2011-01-04 14:22         ` Steven Whitehouse
2011-01-04  6:04     ` Nick Piggin
2011-01-04  6:39       ` Christoph Hellwig
2011-01-04  7:52         ` Nick Piggin
2011-01-04  9:13           ` Christoph Hellwig
2011-01-04  9:28             ` Nick Piggin
2010-12-18  1:46 ` [patch 8/8] fs: add i_op->sync_inode Nick Piggin
2010-12-29 15:12   ` Christoph Hellwig
2011-01-04  6:27     ` Nick Piggin
2011-01-04  6:57       ` Christoph Hellwig
2011-01-04  8:03         ` Nick Piggin
2011-01-04  8:31           ` Nick Piggin
2011-01-04  9:25             ` Christoph Hellwig
2011-01-04  9:52               ` Nick Piggin
2011-01-06 20:49                 ` Christoph Hellwig
2011-01-07  4:48                   ` Nick Piggin
2011-01-07  7:25                     ` Christoph Hellwig
2011-01-11  3:44                       ` Nick Piggin
2011-01-04  9:25           ` Christoph Hellwig
2011-01-04  9:49             ` Nick Piggin
2011-01-06 20:45               ` Christoph Hellwig
2011-01-07  4:47                 ` Nick Piggin
2011-01-07  7:24                   ` Christoph Hellwig
2011-01-07  7:29                     ` Christoph Hellwig
2011-01-07 13:10                       ` Christoph Hellwig
2011-01-07 18:30                       ` Ted Ts'o
2011-01-07 18:32                         ` Christoph Hellwig
2011-01-07 18:32                           ` Christoph Hellwig
2011-01-07 19:06   ` Ted Ts'o

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.