All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch 1/6] mm: direct IO starvation improvement
@ 2008-12-10  7:24 Nick Piggin
  2008-12-10  7:25 ` [patch 2/6] fs: remove WB_SYNC_HOLD Nick Piggin
                   ` (5 more replies)
  0 siblings, 6 replies; 24+ messages in thread
From: Nick Piggin @ 2008-12-10  7:24 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel, Mikulas Patocka; +Cc: npiggin


Direct IO can invalidate and sync a lot of pagecache pages in the mapping. A
4K direct IO will actually try to sync and/or invalidate the pagecache of the
entire file, for example (which might be many GB or TB large).

Improve this by doing range syncs. Also, memory no longer has to be unmapped
to catch the dirty bits for syncing, as dirty bits would remain coherent due to
dirty mmap accounting.

This fixes the immediate DM deadlocks when doing direct IO reads to block
device with a mounted filesystem, if only by papering over the problem somewhat
rather than addressing the fsync starvation cases.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 mm/filemap.c |   16 +++++-----------
 1 file changed, 5 insertions(+), 11 deletions(-)

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1317,7 +1317,8 @@ generic_file_aio_read(struct kiocb *iocb
 			goto out; /* skip atime */
 		size = i_size_read(inode);
 		if (pos < size) {
-			retval = filemap_write_and_wait(mapping);
+			retval = filemap_write_and_wait_range(mapping, pos,
+					pos + iov_length(iov, nr_segs) - 1);
 			if (!retval) {
 				retval = mapping->a_ops->direct_IO(READ, iocb,
 							iov, pos, nr_segs);
@@ -2060,18 +2061,10 @@ generic_file_direct_write(struct kiocb *
 	if (count != ocount)
 		*nr_segs = iov_shorten((struct iovec *)iov, *nr_segs, count);
 
-	/*
-	 * Unmap all mmappings of the file up-front.
-	 *
-	 * This will cause any pte dirty bits to be propagated into the
-	 * pageframes for the subsequent filemap_write_and_wait().
-	 */
 	write_len = iov_length(iov, *nr_segs);
 	end = (pos + write_len - 1) >> PAGE_CACHE_SHIFT;
-	if (mapping_mapped(mapping))
-		unmap_mapping_range(mapping, pos, write_len, 0);
 
-	written = filemap_write_and_wait(mapping);
+	written = filemap_write_and_wait_range(mapping, pos, pos + write_len - 1);
 	if (written)
 		goto out;
 
@@ -2286,7 +2279,8 @@ generic_file_buffered_write(struct kiocb
 	 * the file data here, to try to honour O_DIRECT expectations.
 	 */
 	if (unlikely(file->f_flags & O_DIRECT) && written)
-		status = filemap_write_and_wait(mapping);
+		status = filemap_write_and_wait_range(mapping,
+					pos, pos + written - 1);
 
 	return written ? written : status;
 }

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [patch 2/6] fs: remove WB_SYNC_HOLD
  2008-12-10  7:24 [patch 1/6] mm: direct IO starvation improvement Nick Piggin
@ 2008-12-10  7:25 ` Nick Piggin
  2008-12-10  7:27 ` [patch 3/6] fs: sync_sb_inodes fix Nick Piggin
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 24+ messages in thread
From: Nick Piggin @ 2008-12-10  7:25 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel, Mikulas Patocka


Remove WB_SYNC_HOLD. The primary motiviation is the design of my
anti-starvation code for fsync. It requires taking an inode lock
over the sync operation, so we could run into lock ordering problems
with multiple inodes. It is possible to take a single global lock to
solve the ordering problem, but then that would prevent a future
nice implementation of "sync multiple inodes" based on lock order via
inode address.

Seems like a backward step to remove this, but actually it is busted
anyway: we can't use the inode lists for data integrity wait: an
inode can be taken off the dirty lists but still be under writeback.
In order to satisfy data integrity semantics, we should wait for it
to finish writeback, but if we only search the dirty lists, we'll miss
it.

It would be possible to have a "writeback" list, for sys_sync, I suppose.
But why complicate things by prematurely optimise? For unmounting, we could
avoid the "livelock avoidance" code, which would be easier, but again
premature IMO.

Fixing the existing data integrity problem will come next.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 fs/fs-writeback.c         |   12 ++----------
 include/linux/writeback.h |    1 -
 2 files changed, 2 insertions(+), 11 deletions(-)

Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c
+++ linux-2.6/fs/fs-writeback.c
@@ -421,9 +421,6 @@ __writeback_single_inode(struct inode *i
  * If we're a pdlfush thread, then implement pdflush collision avoidance
  * against the entire list.
  *
- * WB_SYNC_HOLD is a hack for sys_sync(): reattach the inode to sb->s_dirty so
- * that it can be located for waiting on in __writeback_single_inode().
- *
  * If `bdi' is non-zero then we're being asked to writeback a specific queue.
  * This function assumes that the blockdev superblock's inodes are backed by
  * a variety of queues, so all inodes are searched.  For other superblocks,
@@ -499,10 +496,6 @@ void generic_sync_sb_inodes(struct super
 		__iget(inode);
 		pages_skipped = wbc->pages_skipped;
 		__writeback_single_inode(inode, wbc);
-		if (wbc->sync_mode == WB_SYNC_HOLD) {
-			inode->dirtied_when = jiffies;
-			list_move(&inode->i_list, &sb->s_dirty);
-		}
 		if (current_is_pdflush())
 			writeback_release(bdi);
 		if (wbc->pages_skipped != pages_skipped) {
@@ -588,8 +581,7 @@ restart:
 
 /*
  * writeback and wait upon the filesystem's dirty inodes.  The caller will
- * do this in two passes - one to write, and one to wait.  WB_SYNC_HOLD is
- * used to park the written inodes on sb->s_dirty for the wait pass.
+ * do this in two passes - one to write, and one to wait.
  *
  * A finite limit is set on the number of pages which will be written.
  * To prevent infinite livelock of sys_sync().
@@ -600,7 +592,7 @@ restart:
 void sync_inodes_sb(struct super_block *sb, int wait)
 {
 	struct writeback_control wbc = {
-		.sync_mode	= wait ? WB_SYNC_ALL : WB_SYNC_HOLD,
+		.sync_mode	= wait ? WB_SYNC_ALL : WB_SYNC_NONE,
 		.range_start	= 0,
 		.range_end	= LLONG_MAX,
 	};
Index: linux-2.6/include/linux/writeback.h
===================================================================
--- linux-2.6.orig/include/linux/writeback.h
+++ linux-2.6/include/linux/writeback.h
@@ -30,7 +30,6 @@ static inline int task_is_pdflush(struct
 enum writeback_sync_modes {
 	WB_SYNC_NONE,	/* Don't wait on anything */
 	WB_SYNC_ALL,	/* Wait on every mapping */
-	WB_SYNC_HOLD,	/* Hold the inode on sb_dirty for sys_sync() */
 };
 
 /*

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [patch 3/6] fs: sync_sb_inodes fix
  2008-12-10  7:24 [patch 1/6] mm: direct IO starvation improvement Nick Piggin
  2008-12-10  7:25 ` [patch 2/6] fs: remove WB_SYNC_HOLD Nick Piggin
@ 2008-12-10  7:27 ` Nick Piggin
  2008-12-11 21:51   ` Andrew Morton
  2008-12-10  7:27 ` [patch 4/6] fs: sys_sync fix Nick Piggin
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 24+ messages in thread
From: Nick Piggin @ 2008-12-10  7:27 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel, Mikulas Patocka

Fix data integrity semantics required by sys_sync, by iterating over all
inodes and waiting for any writeback pages after the initial writeout.
Comments explain the exact problem.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 fs/fs-writeback.c |   60 +++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 53 insertions(+), 7 deletions(-)

Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c
+++ linux-2.6/fs/fs-writeback.c
@@ -440,6 +440,7 @@ void generic_sync_sb_inodes(struct super
 				struct writeback_control *wbc)
 {
 	const unsigned long start = jiffies;	/* livelock avoidance */
+	int sync = wbc->sync_mode == WB_SYNC_ALL;
 
 	spin_lock(&inode_lock);
 	if (!wbc->for_kupdate || list_empty(&sb->s_io))
@@ -516,7 +517,49 @@ void generic_sync_sb_inodes(struct super
 		if (!list_empty(&sb->s_more_io))
 			wbc->more_io = 1;
 	}
-	spin_unlock(&inode_lock);
+
+	if (sync) {
+		struct inode *inode, *old_inode = NULL;
+
+		/*
+		 * Data integrity sync. Must wait for all pages under writeback,
+		 * because there may have been pages dirtied before our sync
+		 * call, but which had writeout started before we write it out.
+		 * In which case, the inode may not be on the dirty list, but
+		 * we still have to wait for that writeout.
+		 */
+		list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+			struct address_space *mapping;
+
+			if (inode->i_state & (I_FREEING|I_WILL_FREE))
+				continue;
+			mapping = inode->i_mapping;
+			if (mapping->nrpages == 0)
+				continue;
+			__iget(inode);
+			spin_unlock(&inode_lock);
+			/*
+			 * We hold a reference to 'inode' so it couldn't have
+			 * been removed from s_inodes list while we dropped the
+			 * inode_lock.  We cannot iput the inode now as we can
+			 * be holding the last reference and we cannot iput it
+			 * under inode_lock. So we keep the reference and iput
+			 * it later.
+			 */
+			iput(old_inode);
+			old_inode = inode;
+
+			filemap_fdatawait(mapping);
+
+			cond_resched();
+
+			spin_lock(&inode_lock);
+		}
+		spin_unlock(&inode_lock);
+		iput(old_inode);
+	} else
+		spin_unlock(&inode_lock);
+
 	return;		/* Leave any unwritten inodes on s_io */
 }
 EXPORT_SYMBOL_GPL(generic_sync_sb_inodes);
@@ -596,13 +639,16 @@ void sync_inodes_sb(struct super_block *
 		.range_start	= 0,
 		.range_end	= LLONG_MAX,
 	};
-	unsigned long nr_dirty = global_page_state(NR_FILE_DIRTY);
-	unsigned long nr_unstable = global_page_state(NR_UNSTABLE_NFS);
 
-	wbc.nr_to_write = nr_dirty + nr_unstable +
-			(inodes_stat.nr_inodes - inodes_stat.nr_unused) +
-			nr_dirty + nr_unstable;
-	wbc.nr_to_write += wbc.nr_to_write / 2;		/* Bit more for luck */
+	if (!wait) {
+		unsigned long nr_dirty = global_page_state(NR_FILE_DIRTY);
+		unsigned long nr_unstable = global_page_state(NR_UNSTABLE_NFS);
+
+		wbc.nr_to_write = nr_dirty + nr_unstable +
+			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
+	} else
+		wbc.nr_to_write = LONG_MAX; /* doesn't actually matter */
+
 	sync_sb_inodes(sb, &wbc);
 }
 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [patch 4/6] fs: sys_sync fix
  2008-12-10  7:24 [patch 1/6] mm: direct IO starvation improvement Nick Piggin
  2008-12-10  7:25 ` [patch 2/6] fs: remove WB_SYNC_HOLD Nick Piggin
  2008-12-10  7:27 ` [patch 3/6] fs: sync_sb_inodes fix Nick Piggin
@ 2008-12-10  7:27 ` Nick Piggin
  2008-12-10  7:28 ` [patch 5/6] radix-tree: gang set if tagged operation Nick Piggin
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 24+ messages in thread
From: Nick Piggin @ 2008-12-10  7:27 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel, Mikulas Patocka


s_syncing livelock avoidance was breaking data integrity guarantee of sys_sync,
by allowing sys_sync to skip writing or waiting for superblocks if there is a
concurrent sys_sync happening.

This livelock avoidance is much less important now that we don't have the
get_super_to_sync() call after every sb that we sync. This was replaced by
__put_super_and_need_restart.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 fs/fs-writeback.c  |   20 +-------------------
 include/linux/fs.h |    1 -
 2 files changed, 1 insertion(+), 20 deletions(-)

Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c
+++ linux-2.6/fs/fs-writeback.c
@@ -652,18 +652,6 @@ void sync_inodes_sb(struct super_block *
 	sync_sb_inodes(sb, &wbc);
 }
 
-/*
- * Rather lame livelock avoidance.
- */
-static void set_sb_syncing(int val)
-{
-	struct super_block *sb;
-	spin_lock(&sb_lock);
-	list_for_each_entry_reverse(sb, &super_blocks, s_list)
-		sb->s_syncing = val;
-	spin_unlock(&sb_lock);
-}
-
 /**
  * sync_inodes - writes all inodes to disk
  * @wait: wait for completion
@@ -690,9 +678,6 @@ static void __sync_inodes(int wait)
 	spin_lock(&sb_lock);
 restart:
 	list_for_each_entry(sb, &super_blocks, s_list) {
-		if (sb->s_syncing)
-			continue;
-		sb->s_syncing = 1;
 		sb->s_count++;
 		spin_unlock(&sb_lock);
 		down_read(&sb->s_umount);
@@ -710,13 +695,10 @@ restart:
 
 void sync_inodes(int wait)
 {
-	set_sb_syncing(0);
 	__sync_inodes(0);
 
-	if (wait) {
-		set_sb_syncing(0);
+	if (wait)
 		__sync_inodes(1);
-	}
 }
 
 /**
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -1121,7 +1121,6 @@ struct super_block {
 	struct rw_semaphore	s_umount;
 	struct mutex		s_lock;
 	int			s_count;
-	int			s_syncing;
 	int			s_need_sync_fs;
 	atomic_t		s_active;
 #ifdef CONFIG_SECURITY

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [patch 5/6] radix-tree: gang set if tagged operation
  2008-12-10  7:24 [patch 1/6] mm: direct IO starvation improvement Nick Piggin
                   ` (2 preceding siblings ...)
  2008-12-10  7:27 ` [patch 4/6] fs: sys_sync fix Nick Piggin
@ 2008-12-10  7:28 ` Nick Piggin
  2008-12-11 21:20   ` Andrew Morton
  2008-12-10  7:42 ` [patch 6/6] mm: fsync livelock avoidance Nick Piggin
  2008-12-12 16:04 ` [patch 1/6] mm: direct IO starvation improvement Jeff Moyer
  5 siblings, 1 reply; 24+ messages in thread
From: Nick Piggin @ 2008-12-10  7:28 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel, Mikulas Patocka


Add a "radix_tree_gang_set_if_tagged" operation, which takes a range of indexes,
and sets a given tag, if any of the specified tags were found to be set. Used by
the next patch to set an "fsync" bit on any dirty and writeback pages.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 include/linux/radix-tree.h |    1 
 lib/radix-tree.c           |  107 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 108 insertions(+)

Index: linux-2.6/include/linux/radix-tree.h
===================================================================
--- linux-2.6.orig/include/linux/radix-tree.h
+++ linux-2.6/include/linux/radix-tree.h
@@ -183,6 +183,7 @@ unsigned int
 radix_tree_gang_lookup_tag_slot(struct radix_tree_root *root, void ***results,
 		unsigned long first_index, unsigned int max_items,
 		unsigned int tag);
+int radix_tree_gang_tag_set_if_tagged(struct radix_tree_root *root, unsigned long first_index, unsigned long last_index, unsigned long ifset, unsigned long thentag);
 int radix_tree_tagged(struct radix_tree_root *root, unsigned int tag);
 
 static inline void radix_tree_preload_end(void)
Index: linux-2.6/lib/radix-tree.c
===================================================================
--- linux-2.6.orig/lib/radix-tree.c
+++ linux-2.6/lib/radix-tree.c
@@ -629,6 +629,113 @@ int radix_tree_tag_get(struct radix_tree
 EXPORT_SYMBOL(radix_tree_tag_get);
 #endif
 
+int node_tag_set_if_tagged(struct radix_tree_node *node, unsigned int height, unsigned long first_index, unsigned long last_index, unsigned long ifset, unsigned long thentag)
+{
+	unsigned long first, last;
+	int offset_start, offset_end, i;
+	int ret = 0;
+	unsigned int shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
+	unsigned int size = RADIX_TREE_MAP_SIZE << shift;
+
+	first = first_index;
+	last = min(last_index, ((first + size) & ~(size-1)) - 1);
+
+	offset_start = (first_index >> shift) & RADIX_TREE_MAP_MASK;
+	offset_end = (last_index >> shift) & RADIX_TREE_MAP_MASK;
+	for (i = offset_start; i <= offset_end; i++) {
+
+		if (height > 1) {
+			struct radix_tree_node *slot;
+
+			slot = node->slots[i];
+			if (!slot)
+				goto notset;
+
+			if (node_tag_set_if_tagged(slot, height - 1,
+						first, last,
+						ifset, thentag)) {
+				if (ifset == 0x1)
+					BUG_ON(!tag_get(node, 0, i));
+				if (ifset == 0x2)
+					BUG_ON(!tag_get(node, 1, i));
+				if (ifset == 0x2)
+					BUG_ON(!tag_get(node, 0, i) &&
+						!tag_get(node, 1, i));
+				if (thentag & 0x4) {
+					if (!tag_get(node, 2, i))
+						tag_set(node, 2, i);
+				}
+				ret = 1;
+			}
+notset:
+			first = last+1;
+			last = min(last_index, first + size - 1);
+
+		} else {
+			if (ifset & 0x1) {
+				if (tag_get(node, 0, i))
+					goto isset;
+			}
+			if (ifset & 0x2) {
+				if (tag_get(node, 1, i))
+					goto isset;
+			}
+			continue;
+isset:
+			if (thentag & 0x4) {
+				if (!tag_get(node, 2, i))
+					tag_set(node, 2, i);
+			}
+			ret = 1;
+		}
+	}
+
+	return ret;
+}
+
+int radix_tree_gang_tag_set_if_tagged(struct radix_tree_root *root, unsigned long first_index, unsigned long last_index, unsigned long ifset, unsigned long thentag)
+{
+	unsigned int height;
+	struct radix_tree_node *slot;
+
+	BUG_ON(ifset & ~0x3);
+	BUG_ON(thentag & ~0x4);
+
+	height = root->height;
+	if (height == 0) {
+		/* set the root's tag bit */
+		if (first_index != 0)
+			return 0;
+
+		if (ifset & 0x1)
+			if (root_tag_get(root, 0))
+				goto isset;
+		if (ifset & 0x2)
+			if (root_tag_get(root, 1))
+				goto isset;
+		return 0;
+isset:
+		if (thentag & 0x4) {
+			if (!root_tag_get(root, 2))
+				root_tag_set(root, 2);
+		}
+		return 1;
+	}
+
+	slot = radix_tree_indirect_to_ptr(root->rnode);
+	if (node_tag_set_if_tagged(slot, height, first_index, last_index, ifset, thentag)) {
+		/* set the root's tag bit */
+		if (thentag & 0x4) {
+			if (!root_tag_get(root, 2))
+				root_tag_set(root, 2);
+		}
+
+		return 1;
+	}
+
+	return 0;
+}
+
 /**
  *	radix_tree_next_hole    -    find the next hole (not-present entry)
  *	@root:		tree root

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [patch 6/6] mm: fsync livelock avoidance
  2008-12-10  7:24 [patch 1/6] mm: direct IO starvation improvement Nick Piggin
                   ` (3 preceding siblings ...)
  2008-12-10  7:28 ` [patch 5/6] radix-tree: gang set if tagged operation Nick Piggin
@ 2008-12-10  7:42 ` Nick Piggin
  2008-12-10  9:15   ` steve
                     ` (3 more replies)
  2008-12-12 16:04 ` [patch 1/6] mm: direct IO starvation improvement Jeff Moyer
  5 siblings, 4 replies; 24+ messages in thread
From: Nick Piggin @ 2008-12-10  7:42 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel, Mikulas Patocka

OK, there has not been any further discussion on this approach since I last
posted it, so I am going to go out on a limb and suggest that we take this
approach, if any, rather than Mikulas' one.

The advantages of my approach I think:
- nothing added to non-sync fastpaths
- close to theoretically fewest number of pages will be written / waited for
  in the fsync case
- works nicely even in unusual cases (eg. file with 1GB of dirty data, but
  the fsync_range only needs to sync a few pages will not get stuck behind
  a concurrent dirtier)

And some comments:
- adds 8 bytes to the radix tree node, but this doesn't really change its
  fitting into slab or cachelines, so effective impact is basically zero
  for this addition.
- adds an extra lock, but as per the comments, this lock seems to be required
  in order to fix a bug anyway. And we already tend to hold i_mutex over at
  least some of the fsync operation. Although if anybody thinks this will be
  a problem, I'd like to hear.

Disadvantages:
- more complex. Although in a way I consider Mikulas' change to have
  more complex heuristics, which I don't like. I think Mikulas' version
  would be more complex to analyse at runtime. Also, much of the complexity
  comes from the extra lock, which as I said fixes a bug.

Any additions or disputes? :)

--
This patch fixes fsync starvation problems in the presence of concurrent
dirtiers.

To take an extreme example: if thread A calls fsync on a file with one dirty
page, at index 1 000 000; at the same time, thread B starts dirtying the
file from offset 0 onwards.

Thead B perhaps will be allowed to dirty 1 000 pages before hitting its dirty
threshold, then it will start throttling. Thread A will start writing out B's
pages. They'll proceed more or less in lockstep until thread B finishes
writing.

While these semantics are correct, we'd prefer a more timely notification that
pages dirty at the time fsync was called are safely on disk. In the above
scenario, we may have to wait until many times the machine's RAM capacity has
been written to disk before the fsync returns success. Ideally, thread A would
write the single page at index 1 000 000, then return.

This patch introduces a new pagecache tag, PAGECACHE_TAG_FSYNC. Data integrity
syncs then start by looking through the pagecache for pages which are DIRTY
and/or WRITEBACK within the requested range, and tagging all those as FSYNC.

Subsequent writeout and wait phases need then only look up those pages in the
pagecache which are tagged with PAGECACHE_TAG_FSYNC.

After the sync operation has completed, the FSYNC tags are removed from the
radix tree. This design requires exclusive usage of the FSYNC tags for the
duration of a sync operation, so a lock on the address space is required.

For simplicity, I have removed the "don't wait for writeout if we hit -EIO"
logic from a couple of places. I don't know if this is really worth the added
complexity (EIO will still get reported, but it will just take a bit longer;
an app can't rely in specific behaviour or timeliness here).

This lock also solves a real data integrity problem that I only noticed as
I was writing the livelock avoidance code. If we consider the lock as the
solution to this bug, this makes the livelock avoidance code much more
attractive because then it does not introduce the new lock.

The bug is that fsync errors do not get propogated back up to the caller
properly in some cases. Consider where we write a page in the writeout path,
then it encounters an IO error and finishes writeback, in the meantime, another
process (eg. via sys_sync, or another fsync) clears the mapping error bits.
Then our fsync will have appeared to finish successfully, but actually should
have returned error.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
 drivers/usb/gadget/file_storage.c |    4 -
 fs/cifs/cifsfs.c                  |    7 +
 fs/fs-writeback.c                 |   13 ++-
 fs/gfs2/glops.c                   |    9 +-
 fs/gfs2/meta_io.c                 |    4 -
 fs/gfs2/ops_file.c                |   13 ++-
 fs/nfs/delegation.c               |    8 +-
 fs/nfsd/vfs.c                     |   10 +-
 fs/ocfs2/dlmglue.c                |    4 -
 fs/sync.c                         |   11 +-
 fs/xfs/linux-2.6/xfs_fs_subr.c    |   21 +++--
 include/linux/fs.h                |    7 +
 include/linux/pagemap.h           |    6 +
 include/linux/radix-tree.h        |    2 
 mm/filemap.c                      |  148 ++++++++++++++++++++++++++++++--------
 mm/page-writeback.c               |   35 ++++++++
 16 files changed, 237 insertions(+), 65 deletions(-)

Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -582,10 +582,12 @@ struct block_device {
 
 /*
  * Radix-tree tags, for tagging dirty and writeback pages within the pagecache
- * radix trees
+ * radix trees. Also, to snapshot all pages required to be fsync'ed in order
+ * to obey data integrity semantics.
  */
 #define PAGECACHE_TAG_DIRTY	0
 #define PAGECACHE_TAG_WRITEBACK	1
+#define PAGECACHE_TAG_FSYNC	2
 
 int mapping_tagged(struct address_space *mapping, int tag);
 
@@ -1808,11 +1810,14 @@ extern int write_inode_now(struct inode
 extern int filemap_fdatawrite(struct address_space *);
 extern int filemap_flush(struct address_space *);
 extern int filemap_fdatawait(struct address_space *);
+extern int filemap_fdatawait_fsync(struct address_space *);
 extern int filemap_write_and_wait(struct address_space *mapping);
 extern int filemap_write_and_wait_range(struct address_space *mapping,
 				        loff_t lstart, loff_t lend);
 extern int wait_on_page_writeback_range(struct address_space *mapping,
 				pgoff_t start, pgoff_t end);
+extern int wait_on_page_writeback_range_fsync(struct address_space *mapping,
+				pgoff_t start, pgoff_t end);
 extern int __filemap_fdatawrite_range(struct address_space *mapping,
 				loff_t start, loff_t end, int sync_mode);
 extern int filemap_fdatawrite_range(struct address_space *mapping,
Index: linux-2.6/include/linux/radix-tree.h
===================================================================
--- linux-2.6.orig/include/linux/radix-tree.h
+++ linux-2.6/include/linux/radix-tree.h
@@ -55,7 +55,7 @@ static inline int radix_tree_is_indirect
 
 /*** radix-tree API starts here ***/
 
-#define RADIX_TREE_MAX_TAGS 2
+#define RADIX_TREE_MAX_TAGS 3
 
 /* root tags are stored in gfp_mask, shifted by __GFP_BITS_SHIFT */
 struct radix_tree_root {
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -147,6 +147,28 @@ void remove_from_page_cache(struct page
 	spin_unlock_irq(&mapping->tree_lock);
 }
 
+static int sleep_on_fsync(void *word)
+{
+	io_schedule();
+	return 0;
+}
+
+void mapping_fsync_lock(struct address_space *mapping)
+{
+	wait_on_bit_lock(&mapping->flags, AS_FSYNC_LOCK, sleep_on_fsync,
+							TASK_UNINTERRUPTIBLE);
+	WARN_ON(mapping_tagged(mapping, PAGECACHE_TAG_FSYNC));
+}
+
+void mapping_fsync_unlock(struct address_space *mapping)
+{
+	WARN_ON(mapping_tagged(mapping, PAGECACHE_TAG_FSYNC));
+	WARN_ON(!test_bit(AS_FSYNC_LOCK, &mapping->flags));
+	clear_bit_unlock(AS_FSYNC_LOCK, &mapping->flags);
+	smp_mb__after_clear_bit();
+	wake_up_bit(&mapping->flags, AS_FSYNC_LOCK);
+}
+
 static int sync_page(void *word)
 {
 	struct address_space *mapping;
@@ -287,7 +309,64 @@ int wait_on_page_writeback_range(struct
 
 			/* until radix tree lookup accepts end_index */
 			if (page->index > end)
-				continue;
+				break;
+
+			wait_on_page_writeback(page);
+			if (PageError(page))
+				ret = -EIO;
+		}
+		pagevec_release(&pvec);
+		cond_resched();
+	}
+
+	/* Check for outstanding write errors */
+	if (test_and_clear_bit(AS_ENOSPC, &mapping->flags))
+		ret = -ENOSPC;
+	if (test_and_clear_bit(AS_EIO, &mapping->flags))
+		ret = -EIO;
+
+	return ret;
+}
+
+int wait_on_page_writeback_range_fsync(struct address_space *mapping,
+				pgoff_t start, pgoff_t end)
+{
+	struct pagevec pvec;
+	int nr_pages;
+	int ret = 0;
+	pgoff_t index;
+
+	WARN_ON(!test_bit(AS_FSYNC_LOCK, &mapping->flags));
+
+	if (end < start)
+		goto out;
+
+	pagevec_init(&pvec, 0);
+	index = start;
+	while ((index <= end) &&
+			(nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
+			PAGECACHE_TAG_FSYNC,
+			min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1)) != 0) {
+		unsigned i;
+
+		spin_lock_irq(&mapping->tree_lock);
+		for (i = 0; i < nr_pages; i++) {
+			struct page *page = pvec.pages[i];
+
+			/* until radix tree lookup accepts end_index */
+			if (page->index > end)
+				break;
+
+			radix_tree_tag_clear(&mapping->page_tree, page->index, PAGECACHE_TAG_FSYNC);
+		}
+		spin_unlock_irq(&mapping->tree_lock);
+
+		for (i = 0; i < nr_pages; i++) {
+			struct page *page = pvec.pages[i];
+
+			/* until radix tree lookup accepts end_index */
+			if (page->index > end)
+				break;
 
 			wait_on_page_writeback(page);
 			if (PageError(page))
@@ -303,6 +382,7 @@ int wait_on_page_writeback_range(struct
 	if (test_and_clear_bit(AS_EIO, &mapping->flags))
 		ret = -EIO;
 
+out:
 	return ret;
 }
 
@@ -325,18 +405,20 @@ int sync_page_range(struct inode *inode,
 {
 	pgoff_t start = pos >> PAGE_CACHE_SHIFT;
 	pgoff_t end = (pos + count - 1) >> PAGE_CACHE_SHIFT;
-	int ret;
+	int ret, ret2;
 
 	if (!mapping_cap_writeback_dirty(mapping) || !count)
 		return 0;
+	mutex_lock(&inode->i_mutex);
+	mapping_fsync_lock(mapping);
 	ret = filemap_fdatawrite_range(mapping, pos, pos + count - 1);
-	if (ret == 0) {
-		mutex_lock(&inode->i_mutex);
+	if (ret == 0)
 		ret = generic_osync_inode(inode, mapping, OSYNC_METADATA);
-		mutex_unlock(&inode->i_mutex);
-	}
+	mutex_unlock(&inode->i_mutex);
+	ret2 = wait_on_page_writeback_range_fsync(mapping, start, end);
 	if (ret == 0)
-		ret = wait_on_page_writeback_range(mapping, start, end);
+		ret = ret2;
+	mapping_fsync_unlock(mapping);
 	return ret;
 }
 EXPORT_SYMBOL(sync_page_range);
@@ -357,15 +439,18 @@ int sync_page_range_nolock(struct inode
 {
 	pgoff_t start = pos >> PAGE_CACHE_SHIFT;
 	pgoff_t end = (pos + count - 1) >> PAGE_CACHE_SHIFT;
-	int ret;
+	int ret, ret2;
 
 	if (!mapping_cap_writeback_dirty(mapping) || !count)
 		return 0;
+	mapping_fsync_lock(mapping);
 	ret = filemap_fdatawrite_range(mapping, pos, pos + count - 1);
 	if (ret == 0)
 		ret = generic_osync_inode(inode, mapping, OSYNC_METADATA);
+	ret2 = wait_on_page_writeback_range_fsync(mapping, start, end);
 	if (ret == 0)
-		ret = wait_on_page_writeback_range(mapping, start, end);
+		ret = ret2;
+	mapping_fsync_unlock(mapping);
 	return ret;
 }
 EXPORT_SYMBOL(sync_page_range_nolock);
@@ -389,23 +474,30 @@ int filemap_fdatawait(struct address_spa
 }
 EXPORT_SYMBOL(filemap_fdatawait);
 
+int filemap_fdatawait_fsync(struct address_space *mapping)
+{
+	loff_t i_size = i_size_read(mapping->host);
+
+	if (i_size == 0)
+		return 0;
+
+	return wait_on_page_writeback_range_fsync(mapping, 0,
+				(i_size - 1) >> PAGE_CACHE_SHIFT);
+}
+
 int filemap_write_and_wait(struct address_space *mapping)
 {
 	int err = 0;
 
 	if (mapping->nrpages) {
+		int err2;
+
+		mapping_fsync_lock(mapping);
 		err = filemap_fdatawrite(mapping);
-		/*
-		 * Even if the above returned error, the pages may be
-		 * written partially (e.g. -ENOSPC), so we wait for it.
-		 * But the -EIO is special case, it may indicate the worst
-		 * thing (e.g. bug) happened, so we avoid waiting for it.
-		 */
-		if (err != -EIO) {
-			int err2 = filemap_fdatawait(mapping);
-			if (!err)
-				err = err2;
-		}
+		err2 = filemap_fdatawait_fsync(mapping);
+		if (!err)
+			err = err2;
+		mapping_fsync_unlock(mapping);
 	}
 	return err;
 }
@@ -428,16 +520,16 @@ int filemap_write_and_wait_range(struct
 	int err = 0;
 
 	if (mapping->nrpages) {
-		err = __filemap_fdatawrite_range(mapping, lstart, lend,
-						 WB_SYNC_ALL);
-		/* See comment of filemap_write_and_wait() */
-		if (err != -EIO) {
-			int err2 = wait_on_page_writeback_range(mapping,
+		int err2;
+
+		mapping_fsync_lock(mapping);
+		err = filemap_fdatawrite_range(mapping, lstart, lend);
+		err2 = wait_on_page_writeback_range_fsync(mapping,
 						lstart >> PAGE_CACHE_SHIFT,
 						lend >> PAGE_CACHE_SHIFT);
-			if (!err)
-				err = err2;
-		}
+		if (!err)
+			err = err2;
+		mapping_fsync_unlock(mapping);
 	}
 	return err;
 }
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -872,9 +872,11 @@ int write_cache_pages(struct address_spa
 	pgoff_t index;
 	pgoff_t end;		/* Inclusive */
 	pgoff_t done_index;
+	unsigned int tag = PAGECACHE_TAG_DIRTY;
 	int cycled;
 	int range_whole = 0;
 	long nr_to_write = wbc->nr_to_write;
+	int sync = wbc->sync_mode != WB_SYNC_NONE;
 
 	if (wbc->nonblocking && bdi_write_congested(bdi)) {
 		wbc->encountered_congestion = 1;
@@ -897,13 +899,40 @@ int write_cache_pages(struct address_spa
 			range_whole = 1;
 		cycled = 1; /* ignore range_cyclic tests */
 	}
+
+	if (sync) {
+		WARN_ON(!test_bit(AS_FSYNC_LOCK, &mapping->flags));
+		/*
+		 * If any pages are writeback or dirty, mark them fsync now.
+		 * These are the pages we need to wait in in order to meet our
+		 * data integrity contract.
+		 *
+		 * Writeback pages need to be tagged, so we'll wait for them
+		 * at the end of the writeout phase. However, the lookup below
+		 * could just look up pages which are _DIRTY AND _FSYNC,
+		 * because we don't care about them for the writeout phase.
+		 */
+		spin_lock_irq(&mapping->tree_lock);
+		if (!radix_tree_gang_tag_set_if_tagged(&mapping->page_tree,
+							index, end,
+				(1UL << PAGECACHE_TAG_DIRTY) |
+				(1UL << PAGECACHE_TAG_WRITEBACK),
+				(1UL << PAGECACHE_TAG_FSYNC))) {
+			/* nothing tagged */
+			spin_unlock_irq(&mapping->tree_lock);
+			return 0;
+		}
+		spin_unlock_irq(&mapping->tree_lock);
+		tag = PAGECACHE_TAG_FSYNC;
+	}
+
 retry:
 	done_index = index;
 	while (!done && (index <= end)) {
 		int i;
 
 		nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
-			      PAGECACHE_TAG_DIRTY,
+			      tag,
 			      min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1);
 		if (nr_pages == 0)
 			break;
@@ -951,7 +980,7 @@ continue_unlock:
 			}
 
 			if (PageWriteback(page)) {
-				if (wbc->sync_mode != WB_SYNC_NONE)
+				if (sync)
 					wait_on_page_writeback(page);
 				else
 					goto continue_unlock;
@@ -981,7 +1010,7 @@ continue_unlock:
 				}
  			}
 
-			if (wbc->sync_mode == WB_SYNC_NONE) {
+			if (!sync) {
 				wbc->nr_to_write--;
 				if (wbc->nr_to_write <= 0) {
 					done = 1;
Index: linux-2.6/drivers/usb/gadget/file_storage.c
===================================================================
--- linux-2.6.orig/drivers/usb/gadget/file_storage.c
+++ linux-2.6/drivers/usb/gadget/file_storage.c
@@ -1873,13 +1873,15 @@ static int fsync_sub(struct lun *curlun)
 
 	inode = filp->f_path.dentry->d_inode;
 	mutex_lock(&inode->i_mutex);
+	mapping_fsync_lock(mapping);
 	rc = filemap_fdatawrite(inode->i_mapping);
 	err = filp->f_op->fsync(filp, filp->f_path.dentry, 1);
 	if (!rc)
 		rc = err;
-	err = filemap_fdatawait(inode->i_mapping);
+	err = filemap_fdatawait_fsync(inode->i_mapping);
 	if (!rc)
 		rc = err;
+	mapping_fsync_unlock(mapping);
 	mutex_unlock(&inode->i_mutex);
 	VLDBG(curlun, "fdatasync -> %d\n", rc);
 	return rc;
Index: linux-2.6/fs/cifs/cifsfs.c
===================================================================
--- linux-2.6.orig/fs/cifs/cifsfs.c
+++ linux-2.6/fs/cifs/cifsfs.c
@@ -992,12 +992,15 @@ static int cifs_oplock_thread(void *dumm
 				else if (CIFS_I(inode)->clientCanCacheRead == 0)
 					break_lease(inode, FMODE_WRITE);
 #endif
+				mapping_fsync_lock(mapping);
 				rc = filemap_fdatawrite(inode->i_mapping);
 				if (CIFS_I(inode)->clientCanCacheRead == 0) {
-					waitrc = filemap_fdatawait(
+					waitrc = filemap_fdatawait_fsync(
 							      inode->i_mapping);
+					mapping_fsync_unlock(mapping);
 					invalidate_remote_inode(inode);
-				}
+				} else
+					mapping_fsync_unlock(mapping);
 				if (rc == 0)
 					rc = waitrc;
 			} else
Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c
+++ linux-2.6/fs/fs-writeback.c
@@ -282,6 +282,9 @@ __sync_single_inode(struct inode *inode,
 
 	spin_unlock(&inode_lock);
 
+	if (wait)
+		mapping_fsync_lock(mapping);
+
 	ret = do_writepages(mapping, wbc);
 
 	/* Don't write the inode if only I_DIRTY_PAGES was set */
@@ -292,9 +295,10 @@ __sync_single_inode(struct inode *inode,
 	}
 
 	if (wait) {
-		int err = filemap_fdatawait(mapping);
+		int err = filemap_fdatawait_fsync(mapping);
 		if (ret == 0)
 			ret = err;
+		mapping_fsync_unlock(mapping);
 	}
 
 	spin_lock(&inode_lock);
@@ -779,17 +783,20 @@ int generic_osync_inode(struct inode *in
 	int need_write_inode_now = 0;
 	int err2;
 
-	if (what & OSYNC_DATA)
+	if (what & OSYNC_DATA) {
+		mapping_fsync_lock(mapping);
 		err = filemap_fdatawrite(mapping);
+	}
 	if (what & (OSYNC_METADATA|OSYNC_DATA)) {
 		err2 = sync_mapping_buffers(mapping);
 		if (!err)
 			err = err2;
 	}
 	if (what & OSYNC_DATA) {
-		err2 = filemap_fdatawait(mapping);
+		err2 = filemap_fdatawait_fsync(mapping);
 		if (!err)
 			err = err2;
+		mapping_fsync_unlock(mapping);
 	}
 
 	spin_lock(&inode_lock);
Index: linux-2.6/fs/gfs2/glops.c
===================================================================
--- linux-2.6.orig/fs/gfs2/glops.c
+++ linux-2.6/fs/gfs2/glops.c
@@ -158,15 +158,20 @@ static void inode_go_sync(struct gfs2_gl
 
 	if (test_bit(GLF_DIRTY, &gl->gl_flags)) {
 		gfs2_log_flush(gl->gl_sbd, gl);
+		mapping_fsync_lock(metamapping);
 		filemap_fdatawrite(metamapping);
 		if (ip) {
 			struct address_space *mapping = ip->i_inode.i_mapping;
+
+			mapping_fsync_lock(mapping);
 			filemap_fdatawrite(mapping);
-			error = filemap_fdatawait(mapping);
+			error = filemap_fdatawait_fsync(mapping);
 			mapping_set_error(mapping, error);
+			mapping_fsync_unlock(mapping);
 		}
-		error = filemap_fdatawait(metamapping);
+		error = filemap_fdatawait_fsync(metamapping);
 		mapping_set_error(metamapping, error);
+		mapping_fsync_unlock(metamapping);
 		clear_bit(GLF_DIRTY, &gl->gl_flags);
 		gfs2_ail_empty_gl(gl);
 	}
Index: linux-2.6/fs/gfs2/meta_io.c
===================================================================
--- linux-2.6.orig/fs/gfs2/meta_io.c
+++ linux-2.6/fs/gfs2/meta_io.c
@@ -121,8 +121,10 @@ void gfs2_meta_sync(struct gfs2_glock *g
 	struct address_space *mapping = gl->gl_aspace->i_mapping;
 	int error;
 
+	mapping_fsync_lock(mapping);
 	filemap_fdatawrite(mapping);
-	error = filemap_fdatawait(mapping);
+	error = filemap_fdatawait_fsync(mapping);
+	mapping_fsync_unlock(mapping);
 
 	if (error)
 		gfs2_io_error(gl->gl_sbd);
Index: linux-2.6/fs/gfs2/ops_file.c
===================================================================
--- linux-2.6.orig/fs/gfs2/ops_file.c
+++ linux-2.6/fs/gfs2/ops_file.c
@@ -244,12 +244,17 @@ static int do_gfs2_set_flags(struct file
 			goto out;
 	}
 	if ((flags ^ new_flags) & GFS2_DIF_JDATA) {
+		struct address_space *mapping = inode->i_mapping;
+		int error2;
+
 		if (flags & GFS2_DIF_JDATA)
 			gfs2_log_flush(sdp, ip->i_gl);
-		error = filemap_fdatawrite(inode->i_mapping);
-		if (error)
-			goto out;
-		error = filemap_fdatawait(inode->i_mapping);
+		mapping_fsync_lock(mapping);
+		error = filemap_fdatawrite(mapping);
+		error2 = filemap_fdatawait_fsync(mapping);
+		mapping_fsync_unlock(mapping);
+		if (!error)
+			error = error2;
 		if (error)
 			goto out;
 	}
Index: linux-2.6/fs/nfs/delegation.c
===================================================================
--- linux-2.6.orig/fs/nfs/delegation.c
+++ linux-2.6/fs/nfs/delegation.c
@@ -216,9 +216,13 @@ out:
 /* Sync all data to disk upon delegation return */
 static void nfs_msync_inode(struct inode *inode)
 {
-	filemap_fdatawrite(inode->i_mapping);
+	struct address_space *mapping = inode->i_mapping;
+
+	mapping_fsync_lock(mapping);
+	filemap_fdatawrite(mapping);
 	nfs_wb_all(inode);
-	filemap_fdatawait(inode->i_mapping);
+	filemap_fdatawait_fsync(mapping);
+	mapping_fsync_unlock(mapping);
 }
 
 /*
Index: linux-2.6/fs/nfsd/vfs.c
===================================================================
--- linux-2.6.orig/fs/nfsd/vfs.c
+++ linux-2.6/fs/nfsd/vfs.c
@@ -752,14 +752,18 @@ static inline int nfsd_dosync(struct fil
 			      const struct file_operations *fop)
 {
 	struct inode *inode = dp->d_inode;
+	struct address_space *mapping = inode->i_mapping;
 	int (*fsync) (struct file *, struct dentry *, int);
-	int err;
+	int err, err2;
 
-	err = filemap_fdatawrite(inode->i_mapping);
+	mapping_fsync_lock(mapping);
+	err = filemap_fdatawrite(mapping);
 	if (err == 0 && fop && (fsync = fop->fsync))
 		err = fsync(filp, dp, 0);
+	err2 = filemap_fdatawait_fsync(mapping);
 	if (err == 0)
-		err = filemap_fdatawait(inode->i_mapping);
+		err = err2;
+	mapping_fsync_unlock(mapping);
 
 	return err;
 }
Index: linux-2.6/fs/ocfs2/dlmglue.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/dlmglue.c
+++ linux-2.6/fs/ocfs2/dlmglue.c
@@ -3284,6 +3284,7 @@ static int ocfs2_data_convert_worker(str
 	 */
 	unmap_mapping_range(mapping, 0, 0, 0);
 
+	mapping_fsync_lock(mapping);
 	if (filemap_fdatawrite(mapping)) {
 		mlog(ML_ERROR, "Could not sync inode %llu for downconvert!",
 		     (unsigned long long)OCFS2_I(inode)->ip_blkno);
@@ -3297,8 +3298,9 @@ static int ocfs2_data_convert_worker(str
 		 * for us above. We don't truncate pages if we're
 		 * blocking anything < EXMODE because we want to keep
 		 * them around in that case. */
-		filemap_fdatawait(mapping);
+		filemap_fdatawait_fsync(mapping);
 	}
+	mapping_fsync_unlock(mapping);
 
 out:
 	return UNBLOCK_CONTINUE;
Index: linux-2.6/fs/sync.c
===================================================================
--- linux-2.6.orig/fs/sync.c
+++ linux-2.6/fs/sync.c
@@ -87,20 +87,22 @@ long do_fsync(struct file *file, int dat
 		goto out;
 	}
 
-	ret = filemap_fdatawrite(mapping);
-
 	/*
 	 * We need to protect against concurrent writers, which could cause
 	 * livelocks in fsync_buffers_list().
 	 */
 	mutex_lock(&mapping->host->i_mutex);
+	mapping_fsync_lock(mapping);
+	ret = filemap_fdatawrite(mapping);
+
 	err = file->f_op->fsync(file, file->f_path.dentry, datasync);
 	if (!ret)
 		ret = err;
 	mutex_unlock(&mapping->host->i_mutex);
-	err = filemap_fdatawait(mapping);
+	err = filemap_fdatawait_fsync(mapping);
 	if (!ret)
 		ret = err;
+	mapping_fsync_unlock(mapping);
 out:
 	return ret;
 }
@@ -268,8 +270,7 @@ int do_sync_mapping_range(struct address
 	}
 
 	if (flags & SYNC_FILE_RANGE_WRITE) {
-		ret = __filemap_fdatawrite_range(mapping, offset, endbyte,
-						WB_SYNC_ALL);
+		ret = filemap_fdatawrite_range(mapping, offset, endbyte);
 		if (ret < 0)
 			goto out;
 	}
Index: linux-2.6/fs/xfs/linux-2.6/xfs_fs_subr.c
===================================================================
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_fs_subr.c
+++ linux-2.6/fs/xfs/linux-2.6/xfs_fs_subr.c
@@ -19,6 +19,7 @@
 #include "xfs_vnodeops.h"
 #include "xfs_bmap_btree.h"
 #include "xfs_inode.h"
+#include <linux/writeback.h>
 
 int  fs_noerr(void) { return 0; }
 int  fs_nosys(void) { return ENOSYS; }
@@ -66,16 +67,22 @@ xfs_flush_pages(
 {
 	struct address_space *mapping = VFS_I(ip)->i_mapping;
 	int		ret = 0;
-	int		ret2;
 
 	if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
 		xfs_iflags_clear(ip, XFS_ITRUNCATED);
-		ret = filemap_fdatawrite(mapping);
-		if (flags & XFS_B_ASYNC)
-			return ret;
-		ret2 = filemap_fdatawait(mapping);
-		if (!ret)
-			ret = ret2;
+		if (flags & XFS_B_ASYNC) {
+			ret = __filemap_fdatawrite_range(mapping,
+					0, LLONG_MAX, WB_SYNC_NONE);
+		} else {
+			int ret2;
+
+			mapping_fsync_lock(mapping);
+			ret = filemap_fdatawrite(mapping);
+			ret2 = filemap_fdatawait_fsync(mapping);
+			if (!ret)
+				ret = ret2;
+			mapping_fsync_unlock(mapping);
+		}
 	}
 	return ret;
 }
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -21,6 +21,7 @@
 #define	AS_EIO		(__GFP_BITS_SHIFT + 0)	/* IO error on async write */
 #define AS_ENOSPC	(__GFP_BITS_SHIFT + 1)	/* ENOSPC on async write */
 #define AS_MM_ALL_LOCKS	(__GFP_BITS_SHIFT + 2)	/* under mm_take_all_locks() */
+#define AS_FSYNC_LOCK	(__GFP_BITS_SHIFT + 3)	/* under fsync */
 
 static inline void mapping_set_error(struct address_space *mapping, int error)
 {
@@ -33,7 +34,7 @@ static inline void mapping_set_error(str
 }
 
 #ifdef CONFIG_UNEVICTABLE_LRU
-#define AS_UNEVICTABLE	(__GFP_BITS_SHIFT + 2)	/* e.g., ramdisk, SHM_LOCK */
+#define AS_UNEVICTABLE	(__GFP_BITS_SHIFT + 4)	/* e.g., ramdisk, SHM_LOCK */
 
 static inline void mapping_set_unevictable(struct address_space *mapping)
 {
@@ -60,6 +61,9 @@ static inline int mapping_unevictable(st
 }
 #endif
 
+void mapping_fsync_lock(struct address_space *mapping);
+void mapping_fsync_unlock(struct address_space *mapping);
+
 static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
 {
 	return (__force gfp_t)mapping->flags & __GFP_BITS_MASK;

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch 6/6] mm: fsync livelock avoidance
  2008-12-10  7:42 ` [patch 6/6] mm: fsync livelock avoidance Nick Piggin
@ 2008-12-10  9:15   ` steve
  2008-12-11 21:51   ` Andrew Morton
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 24+ messages in thread
From: steve @ 2008-12-10  9:15 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-fsdevel, Mikulas Patocka

Hi,

On Wed, Dec 10, 2008 at 08:42:09AM +0100, Nick Piggin wrote:
> OK, there has not been any further discussion on this approach since I last
> posted it, so I am going to go out on a limb and suggest that we take this
> approach, if any, rather than Mikulas' one.
> 
[snip]
> 
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> ---
>  drivers/usb/gadget/file_storage.c |    4 -
>  fs/cifs/cifsfs.c                  |    7 +
>  fs/fs-writeback.c                 |   13 ++-
>  fs/gfs2/glops.c                   |    9 +-
>  fs/gfs2/meta_io.c                 |    4 -
>  fs/gfs2/ops_file.c                |   13 ++-

For the GFS2 bits:
Acked-by: Steven Whitehouse <swhiteho@redhat.com>

That seems a reasonable approach to me,

Steve.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch 5/6] radix-tree: gang set if tagged operation
  2008-12-10  7:28 ` [patch 5/6] radix-tree: gang set if tagged operation Nick Piggin
@ 2008-12-11 21:20   ` Andrew Morton
  2008-12-11 22:10     ` Nick Piggin
  0 siblings, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2008-12-11 21:20 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, mpatocka

On Wed, 10 Dec 2008 08:28:50 +0100
Nick Piggin <npiggin@suse.de> wrote:

> 
> Add a "radix_tree_gang_set_if_tagged" operation, which takes a range of indexes,
> and sets a given tag, if any of the specified tags were found to be set. Used by
> the next patch to set an "fsync" bit on any dirty and writeback pages.
> 

Spent 15 seconds trying to decode the above, quickly gave up.

Because it's the code comments which matter, not the changelog.

Only there are no code comemnts.  The rest of the radix-tree API is
documented.

> ---
>  include/linux/radix-tree.h |    1 
>  lib/radix-tree.c           |  107 +++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 108 insertions(+)
> 
> Index: linux-2.6/include/linux/radix-tree.h
> ===================================================================
> --- linux-2.6.orig/include/linux/radix-tree.h
> +++ linux-2.6/include/linux/radix-tree.h
> @@ -183,6 +183,7 @@ unsigned int
>  radix_tree_gang_lookup_tag_slot(struct radix_tree_root *root, void ***results,
>  		unsigned long first_index, unsigned int max_items,
>  		unsigned int tag);
> +int radix_tree_gang_tag_set_if_tagged(struct radix_tree_root *root, unsigned long first_index, unsigned long last_index, unsigned long ifset, unsigned long thentag);

aww, c'mon, that's just stupid.  167 columns.

>  int radix_tree_tagged(struct radix_tree_root *root, unsigned int tag);
>  
>  static inline void radix_tree_preload_end(void)
> Index: linux-2.6/lib/radix-tree.c
> ===================================================================
> --- linux-2.6.orig/lib/radix-tree.c
> +++ linux-2.6/lib/radix-tree.c
> @@ -629,6 +629,113 @@ int radix_tree_tag_get(struct radix_tree
>  EXPORT_SYMBOL(radix_tree_tag_get);
>  #endif
>  
> +int node_tag_set_if_tagged(struct radix_tree_node *node, unsigned int height, unsigned long first_index, unsigned long last_index, unsigned long ifset, unsigned long thentag)

This identifier is global?

> +{
> +	unsigned long first, last;
> +	int offset_start, offset_end, i;
> +	int ret = 0;
> +	unsigned int shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
> +	unsigned int size = RADIX_TREE_MAP_SIZE << shift;
> +
> +	first = first_index;
> +	last = min(last_index, ((first + size) & ~(size-1)) - 1);
> +
> +	offset_start = (first_index >> shift) & RADIX_TREE_MAP_MASK;
> +	offset_end = (last_index >> shift) & RADIX_TREE_MAP_MASK;
> +	for (i = offset_start; i <= offset_end; i++) {
> +
> +		if (height > 1) {
> +			struct radix_tree_node *slot;
> +
> +			slot = node->slots[i];
> +			if (!slot)
> +				goto notset;
> +
> +			if (node_tag_set_if_tagged(slot, height - 1,
> +						first, last,
> +						ifset, thentag)) {

It's recursive.

Considerable effort has been made to avoid recursion in the radix-tree
code and here it gets added with nary a mention in the code or the
changelog.  It *at least* needs justification by showing a calculation
of the maximum stack usage.  Which is dependent upon the value of
RADIX_TREE_MAP_SHIFT and is hence a constraint upon it.

(I have a vague feeling that we broke the don't-recur design a while back, but
re-breaking it is still bad).


> +				if (ifset == 0x1)
> +					BUG_ON(!tag_get(node, 0, i));
> +				if (ifset == 0x2)
> +					BUG_ON(!tag_get(node, 1, i));
> +				if (ifset == 0x2)
> +					BUG_ON(!tag_get(node, 0, i) &&
> +						!tag_get(node, 1, i));
> +				if (thentag & 0x4) {
> +					if (!tag_get(node, 2, i))
> +						tag_set(node, 2, i);
> +				}
> +				ret = 1;
> +			}
> +notset:
> +			first = last+1;
> +			last = min(last_index, first + size - 1);
> +
> +		} else {
> +			if (ifset & 0x1) {
> +				if (tag_get(node, 0, i))
> +					goto isset;
> +			}
> +			if (ifset & 0x2) {
> +				if (tag_get(node, 1, i))
> +					goto isset;
> +			}
> +			continue;
> +isset:
> +			if (thentag & 0x4) {
> +				if (!tag_get(node, 2, i))
> +					tag_set(node, 2, i);
> +			}
> +			ret = 1;
> +		}
> +	}
> +
> +	return ret;
> +}
> +
> +int radix_tree_gang_tag_set_if_tagged(struct radix_tree_root *root, unsigned long first_index, unsigned long last_index, unsigned long ifset, unsigned long thentag)
> +{
> +	unsigned int height;
> +	struct radix_tree_node *slot;
> +
> +	BUG_ON(ifset & ~0x3);
> +	BUG_ON(thentag & ~0x4);

boggle.

So these things are mystery bitfields.  They need documentation!  The
code's not reviewable in this state.

afacit these two fields could have type `int' or `unsigned'?  No need
to add the additional overhead of 64-bit?


> +	height = root->height;
> +	if (height == 0) {
> +		/* set the root's tag bit */
> +		if (first_index != 0)
> +			return 0;
> +
> +		if (ifset & 0x1)
> +			if (root_tag_get(root, 0))
> +				goto isset;
> +		if (ifset & 0x2)
> +			if (root_tag_get(root, 1))
> +				goto isset;
> +		return 0;
> +isset:
> +		if (thentag & 0x4) {
> +			if (!root_tag_get(root, 2))
> +				root_tag_set(root, 2);
> +		}
> +		return 1;
> +	}
> +
> +	slot = radix_tree_indirect_to_ptr(root->rnode);
> +	if (node_tag_set_if_tagged(slot, height, first_index, last_index, ifset, thentag)) {
> +		/* set the root's tag bit */
> +		if (thentag & 0x4) {
> +			if (!root_tag_get(root, 2))
> +				root_tag_set(root, 2);
> +		}
> +
> +		return 1;
> +	}
> +
> +	return 0;
> +}

The rest of the radix-tree API is exported to modules, so this addition
should also be.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch 6/6] mm: fsync livelock avoidance
  2008-12-10  7:42 ` [patch 6/6] mm: fsync livelock avoidance Nick Piggin
  2008-12-10  9:15   ` steve
@ 2008-12-11 21:51   ` Andrew Morton
  2008-12-11 22:32     ` Nick Piggin
  2008-12-11 21:51   ` Andrew Morton
  2008-12-11 22:23   ` Andrew Morton
  3 siblings, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2008-12-11 21:51 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, mpatocka

On Wed, 10 Dec 2008 08:42:09 +0100
Nick Piggin <npiggin@suse.de> wrote:

> OK, there has not been any further discussion on this approach since I last
> posted it, so I am going to go out on a limb and suggest that we take this
> approach, if any, rather than Mikulas' one.
> 
> The advantages of my approach I think:
> - nothing added to non-sync fastpaths
> - close to theoretically fewest number of pages will be written / waited for
>   in the fsync case
> - works nicely even in unusual cases (eg. file with 1GB of dirty data, but
>   the fsync_range only needs to sync a few pages will not get stuck behind
>   a concurrent dirtier)
> 
> And some comments:
> - adds 8 bytes to the radix tree node, but this doesn't really change its
>   fitting into slab or cachelines, so effective impact is basically zero
>   for this addition.
> - adds an extra lock, but as per the comments, this lock seems to be required
>   in order to fix a bug anyway. And we already tend to hold i_mutex over at
>   least some of the fsync operation. Although if anybody thinks this will be
>   a problem, I'd like to hear.
> 
> Disadvantages:
> - more complex. Although in a way I consider Mikulas' change to have
>   more complex heuristics, which I don't like. I think Mikulas' version
>   would be more complex to analyse at runtime. Also, much of the complexity
>   comes from the extra lock, which as I said fixes a bug.
> 
> Any additions or disputes? :)
> 
> --
> This patch fixes fsync starvation problems in the presence of concurrent
> dirtiers.
> 
> To take an extreme example: if thread A calls fsync on a file with one dirty
> page, at index 1 000 000; at the same time, thread B starts dirtying the
> file from offset 0 onwards.
> 
> Thead B perhaps will be allowed to dirty 1 000 pages before hitting its dirty
> threshold, then it will start throttling. Thread A will start writing out B's
> pages. They'll proceed more or less in lockstep until thread B finishes
> writing.
> 
> While these semantics are correct, we'd prefer a more timely notification that
> pages dirty at the time fsync was called are safely on disk. In the above
> scenario, we may have to wait until many times the machine's RAM capacity has
> been written to disk before the fsync returns success. Ideally, thread A would
> write the single page at index 1 000 000, then return.
> 
> This patch introduces a new pagecache tag, PAGECACHE_TAG_FSYNC. Data integrity
> syncs then start by looking through the pagecache for pages which are DIRTY
> and/or WRITEBACK within the requested range, and tagging all those as FSYNC.
> 
> Subsequent writeout and wait phases need then only look up those pages in the
> pagecache which are tagged with PAGECACHE_TAG_FSYNC.
> 
> After the sync operation has completed, the FSYNC tags are removed from the
> radix tree. This design requires exclusive usage of the FSYNC tags for the
> duration of a sync operation, so a lock on the address space is required.
> 
> For simplicity, I have removed the "don't wait for writeout if we hit -EIO"
> logic from a couple of places. I don't know if this is really worth the added
> complexity (EIO will still get reported, but it will just take a bit longer;
> an app can't rely in specific behaviour or timeliness here).
> 
> This lock also solves a real data integrity problem that I only noticed as
> I was writing the livelock avoidance code. If we consider the lock as the
> solution to this bug, this makes the livelock avoidance code much more
> attractive because then it does not introduce the new lock.
> 
> The bug is that fsync errors do not get propogated back up to the caller
> properly in some cases. Consider where we write a page in the writeout path,
> then it encounters an IO error and finishes writeback, in the meantime, another
> process (eg. via sys_sync, or another fsync) clears the mapping error bits.
> Then our fsync will have appeared to finish successfully, but actually should
> have returned error.

Has *anybody* *ever* complained about this behaviour?  I think maybe
one person after sixish years?

Why fix it?

>
> ...
>
> --- linux-2.6.orig/mm/filemap.c
> +++ linux-2.6/mm/filemap.c
> @@ -147,6 +147,28 @@ void remove_from_page_cache(struct page
>  	spin_unlock_irq(&mapping->tree_lock);
>  }
>  
> +static int sleep_on_fsync(void *word)
> +{
> +	io_schedule();
> +	return 0;
> +}
> +
> +void mapping_fsync_lock(struct address_space *mapping)
> +{
> +	wait_on_bit_lock(&mapping->flags, AS_FSYNC_LOCK, sleep_on_fsync,
> +							TASK_UNINTERRUPTIBLE);
> +	WARN_ON(mapping_tagged(mapping, PAGECACHE_TAG_FSYNC));
> +}
> +
> +void mapping_fsync_unlock(struct address_space *mapping)
> +{
> +	WARN_ON(mapping_tagged(mapping, PAGECACHE_TAG_FSYNC));
> +	WARN_ON(!test_bit(AS_FSYNC_LOCK, &mapping->flags));
> +	clear_bit_unlock(AS_FSYNC_LOCK, &mapping->flags);
> +	smp_mb__after_clear_bit();

hm, I wonder why clear_bit_unlock() didn't already do that.

The clear_bit_unlock() documentation is rather crappy.

> +	wake_up_bit(&mapping->flags, AS_FSYNC_LOCK);
> +}

It wouldn't hurt to document this interface a little bit.

>  static int sync_page(void *word)
>  {
>  	struct address_space *mapping;
> @@ -287,7 +309,64 @@ int wait_on_page_writeback_range(struct
>  
>  			/* until radix tree lookup accepts end_index */
>  			if (page->index > end)
> -				continue;
> +				break;
> +
> +			wait_on_page_writeback(page);
> +			if (PageError(page))
> +				ret = -EIO;
> +		}
> +		pagevec_release(&pvec);
> +		cond_resched();
> +	}
> +
> +	/* Check for outstanding write errors */
> +	if (test_and_clear_bit(AS_ENOSPC, &mapping->flags))
> +		ret = -ENOSPC;
> +	if (test_and_clear_bit(AS_EIO, &mapping->flags))
> +		ret = -EIO;
> +
> +	return ret;
> +}
> +
> +int wait_on_page_writeback_range_fsync(struct address_space *mapping,
> +				pgoff_t start, pgoff_t end)

We already have a wait_on_page_writeback_range().  The reader of your
code will be wondering why this variant exists, and how it differs. 
Sigh.


> +{
> +	struct pagevec pvec;
> +	int nr_pages;
> +	int ret = 0;
> +	pgoff_t index;
> +
> +	WARN_ON(!test_bit(AS_FSYNC_LOCK, &mapping->flags));
> +
> +	if (end < start)
> +		goto out;
> +
> +	pagevec_init(&pvec, 0);
> +	index = start;
> +	while ((index <= end) &&
> +			(nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
> +			PAGECACHE_TAG_FSYNC,
> +			min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1)) != 0) {
> +		unsigned i;
> +
> +		spin_lock_irq(&mapping->tree_lock);
> +		for (i = 0; i < nr_pages; i++) {
> +			struct page *page = pvec.pages[i];
> +
> +			/* until radix tree lookup accepts end_index */
> +			if (page->index > end)
> +				break;
> +
> +			radix_tree_tag_clear(&mapping->page_tree, page->index, PAGECACHE_TAG_FSYNC);

Don't mucky up the nice kernel code please.

> +		}
> +		spin_unlock_irq(&mapping->tree_lock);
> +
> +		for (i = 0; i < nr_pages; i++) {
> +			struct page *page = pvec.pages[i];
> +
> +			/* until radix tree lookup accepts end_index */
> +			if (page->index > end)
> +				break;
>  
>  			wait_on_page_writeback(page);
>  			if (PageError(page))
> @@ -303,6 +382,7 @@ int wait_on_page_writeback_range(struct
>  	if (test_and_clear_bit(AS_EIO, &mapping->flags))
>  		ret = -EIO;
>  
> +out:
>  	return ret;
>  }
>  
> @@ -325,18 +405,20 @@ int sync_page_range(struct inode *inode,
>  {
>  	pgoff_t start = pos >> PAGE_CACHE_SHIFT;
>  	pgoff_t end = (pos + count - 1) >> PAGE_CACHE_SHIFT;
> -	int ret;
> +	int ret, ret2;
>  
>  	if (!mapping_cap_writeback_dirty(mapping) || !count)
>  		return 0;
> +	mutex_lock(&inode->i_mutex);

I am unable to determine why i_mutex is being taken here.

> +	mapping_fsync_lock(mapping);
>  	ret = filemap_fdatawrite_range(mapping, pos, pos + count - 1);
> -	if (ret == 0) {
> -		mutex_lock(&inode->i_mutex);
> +	if (ret == 0)
>  		ret = generic_osync_inode(inode, mapping, OSYNC_METADATA);
> -		mutex_unlock(&inode->i_mutex);
> -	}
> +	mutex_unlock(&inode->i_mutex);

Nor why it was dropped at this particular place.

> +	ret2 = wait_on_page_writeback_range_fsync(mapping, start, end);
>  	if (ret == 0)
> -		ret = wait_on_page_writeback_range(mapping, start, end);
> +		ret = ret2;
> +	mapping_fsync_unlock(mapping);
>  	return ret;
>  }
>  EXPORT_SYMBOL(sync_page_range);
> @@ -357,15 +439,18 @@ int sync_page_range_nolock(struct inode
>  {
>  	pgoff_t start = pos >> PAGE_CACHE_SHIFT;
>  	pgoff_t end = (pos + count - 1) >> PAGE_CACHE_SHIFT;
> -	int ret;
> +	int ret, ret2;
>  
>  	if (!mapping_cap_writeback_dirty(mapping) || !count)
>  		return 0;
> +	mapping_fsync_lock(mapping);
>  	ret = filemap_fdatawrite_range(mapping, pos, pos + count - 1);
>  	if (ret == 0)
>  		ret = generic_osync_inode(inode, mapping, OSYNC_METADATA);
> +	ret2 = wait_on_page_writeback_range_fsync(mapping, start, end);
>  	if (ret == 0)
> -		ret = wait_on_page_writeback_range(mapping, start, end);
> +		ret = ret2;
> +	mapping_fsync_unlock(mapping);
>  	return ret;
>  }
>  EXPORT_SYMBOL(sync_page_range_nolock);
> @@ -389,23 +474,30 @@ int filemap_fdatawait(struct address_spa
>  }
>  EXPORT_SYMBOL(filemap_fdatawait);
>  
> +int filemap_fdatawait_fsync(struct address_space *mapping)
> +{
> +	loff_t i_size = i_size_read(mapping->host);
> +
> +	if (i_size == 0)
> +		return 0;
> +
> +	return wait_on_page_writeback_range_fsync(mapping, 0,
> +				(i_size - 1) >> PAGE_CACHE_SHIFT);
> +}

filemap_fdatawait() is documented...

>  int filemap_write_and_wait(struct address_space *mapping)
>  {
>  	int err = 0;
>  
>  	if (mapping->nrpages) {
> +		int err2;
> +
> +		mapping_fsync_lock(mapping);
>  		err = filemap_fdatawrite(mapping);
> -		/*
> -		 * Even if the above returned error, the pages may be
> -		 * written partially (e.g. -ENOSPC), so we wait for it.
> -		 * But the -EIO is special case, it may indicate the worst
> -		 * thing (e.g. bug) happened, so we avoid waiting for it.
> -		 */
> -		if (err != -EIO) {
> -			int err2 = filemap_fdatawait(mapping);
> -			if (!err)
> -				err = err2;
> -		}
> +		err2 = filemap_fdatawait_fsync(mapping);
> +		if (!err)
> +			err = err2;
> +		mapping_fsync_unlock(mapping);
>  	}
>  	return err;
>  }
> @@ -428,16 +520,16 @@ int filemap_write_and_wait_range(struct
>  	int err = 0;
>  
>  	if (mapping->nrpages) {
> -		err = __filemap_fdatawrite_range(mapping, lstart, lend,
> -						 WB_SYNC_ALL);
> -		/* See comment of filemap_write_and_wait() */
> -		if (err != -EIO) {
> -			int err2 = wait_on_page_writeback_range(mapping,
> +		int err2;
> +
> +		mapping_fsync_lock(mapping);
> +		err = filemap_fdatawrite_range(mapping, lstart, lend);
> +		err2 = wait_on_page_writeback_range_fsync(mapping,
>  						lstart >> PAGE_CACHE_SHIFT,
>  						lend >> PAGE_CACHE_SHIFT);
> -			if (!err)
> -				err = err2;
> -		}
> +		if (!err)
> +			err = err2;
> +		mapping_fsync_unlock(mapping);
>  	}
>  	return err;
>  }
>
> ...
>
> --- linux-2.6.orig/mm/page-writeback.c
> +++ linux-2.6/mm/page-writeback.c
> @@ -872,9 +872,11 @@ int write_cache_pages(struct address_spa
>  	pgoff_t index;
>  	pgoff_t end;		/* Inclusive */
>  	pgoff_t done_index;
> +	unsigned int tag = PAGECACHE_TAG_DIRTY;
>  	int cycled;
>  	int range_whole = 0;
>  	long nr_to_write = wbc->nr_to_write;
> +	int sync = wbc->sync_mode != WB_SYNC_NONE;
>  
>  	if (wbc->nonblocking && bdi_write_congested(bdi)) {
>  		wbc->encountered_congestion = 1;
> @@ -897,13 +899,40 @@ int write_cache_pages(struct address_spa
>  			range_whole = 1;
>  		cycled = 1; /* ignore range_cyclic tests */
>  	}
> +
> +	if (sync) {
> +		WARN_ON(!test_bit(AS_FSYNC_LOCK, &mapping->flags));

hm.  Is fsync the only caller of write_cache_pages(!WB_SYNC_NONE)? 
Surprised.

> +		/*
> +		 * If any pages are writeback or dirty, mark them fsync now.
> +		 * These are the pages we need to wait in in order to meet our

s/in/on/

> +		 * data integrity contract.
> +		 *
> +		 * Writeback pages need to be tagged, so we'll wait for them
> +		 * at the end of the writeout phase. However, the lookup below
> +		 * could just look up pages which are _DIRTY AND _FSYNC,
> +		 * because we don't care about them for the writeout phase.
> +		 */
> +		spin_lock_irq(&mapping->tree_lock);
> +		if (!radix_tree_gang_tag_set_if_tagged(&mapping->page_tree,
> +							index, end,
> +				(1UL << PAGECACHE_TAG_DIRTY) |
> +				(1UL << PAGECACHE_TAG_WRITEBACK),
> +				(1UL << PAGECACHE_TAG_FSYNC))) {

ooh, so that's what that thing does.

> +			/* nothing tagged */
> +			spin_unlock_irq(&mapping->tree_lock);
> +			return 0;

Can we please avoid the deeply-nested-return hand grenade?

> +		}
> +		spin_unlock_irq(&mapping->tree_lock);
> +		tag = PAGECACHE_TAG_FSYNC;
> +	}
> +
>  retry:
>  	done_index = index;
>  	while (!done && (index <= end)) {
>  		int i;
>  
>  		nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
> -			      PAGECACHE_TAG_DIRTY,
> +			      tag,
>  			      min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1);
>  		if (nr_pages == 0)
>  			break;
> @@ -951,7 +980,7 @@ continue_unlock:
>  			}
>  
>  			if (PageWriteback(page)) {
> -				if (wbc->sync_mode != WB_SYNC_NONE)
> +				if (sync)
>  					wait_on_page_writeback(page);
>  				else
>  					goto continue_unlock;
> @@ -981,7 +1010,7 @@ continue_unlock:
>  				}
>   			}
>  
> -			if (wbc->sync_mode == WB_SYNC_NONE) {
> +			if (!sync) {
>  				wbc->nr_to_write--;
>  				if (wbc->nr_to_write <= 0) {
>  					done = 1;
> Index: linux-2.6/drivers/usb/gadget/file_storage.c
> ===================================================================
> --- linux-2.6.orig/drivers/usb/gadget/file_storage.c
> +++ linux-2.6/drivers/usb/gadget/file_storage.c
> @@ -1873,13 +1873,15 @@ static int fsync_sub(struct lun *curlun)
>  
>  	inode = filp->f_path.dentry->d_inode;
>  	mutex_lock(&inode->i_mutex);
> +	mapping_fsync_lock(mapping);

Dood. Do `make allmodconfig ; make'

>  	rc = filemap_fdatawrite(inode->i_mapping);
>  	err = filp->f_op->fsync(filp, filp->f_path.dentry, 1);
>  	if (!rc)
>  		rc = err;
> -	err = filemap_fdatawait(inode->i_mapping);
> +	err = filemap_fdatawait_fsync(inode->i_mapping);
>  	if (!rc)
>  		rc = err;
> +	mapping_fsync_unlock(mapping);
>
> ...
>


I won't apply this because of the build breakage.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch 6/6] mm: fsync livelock avoidance
  2008-12-10  7:42 ` [patch 6/6] mm: fsync livelock avoidance Nick Piggin
  2008-12-10  9:15   ` steve
  2008-12-11 21:51   ` Andrew Morton
@ 2008-12-11 21:51   ` Andrew Morton
  2008-12-11 22:23   ` Andrew Morton
  3 siblings, 0 replies; 24+ messages in thread
From: Andrew Morton @ 2008-12-11 21:51 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, mpatocka

On Wed, 10 Dec 2008 08:42:09 +0100
Nick Piggin <npiggin@suse.de> wrote:

> This patch fixes fsync starvation problems in the presence of concurrent
> dirtiers.

One reject in xfs, which now does:

int
xfs_flush_pages(
	xfs_inode_t	*ip,
	xfs_off_t	first,
	xfs_off_t	last,
	uint64_t	flags,
	int		fiopt)
{
	struct address_space *mapping = VFS_I(ip)->i_mapping;
	int		ret = 0;
	int		ret2;

	if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
		xfs_iflags_clear(ip, XFS_ITRUNCATED);
		ret = filemap_fdatawrite(mapping);
		if (flags & XFS_B_ASYNC)
			return -ret;
		ret2 = filemap_fdatawait(mapping);
		if (!ret)
			ret = ret2;
	}
	return -ret;
}

ie: it returns a +ve integer on error, heaven knows why.

So I did

int
xfs_flush_pages(
	xfs_inode_t	*ip,
	xfs_off_t	first,
	xfs_off_t	last,
	uint64_t	flags,
	int		fiopt)
{
	struct address_space *mapping = VFS_I(ip)->i_mapping;
	int		ret = 0;

	if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
		xfs_iflags_clear(ip, XFS_ITRUNCATED);
		if (flags & XFS_B_ASYNC) {
			ret = __filemap_fdatawrite_range(mapping,
					0, LLONG_MAX, WB_SYNC_NONE);
		} else {
			int ret2;

			mapping_fsync_lock(mapping);
			ret = filemap_fdatawrite(mapping);
			ret2 = filemap_fdatawait_fsync(mapping);
			if (!ret)
				ret = ret2;
			mapping_fsync_unlock(mapping);
		}
	}
	return -ret;
}


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch 3/6] fs: sync_sb_inodes fix
  2008-12-10  7:27 ` [patch 3/6] fs: sync_sb_inodes fix Nick Piggin
@ 2008-12-11 21:51   ` Andrew Morton
  2008-12-11 22:34     ` Nick Piggin
  0 siblings, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2008-12-11 21:51 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, mpatocka

On Wed, 10 Dec 2008 08:27:07 +0100
Nick Piggin <npiggin@suse.de> wrote:

> +	if (sync) {
> +		struct inode *inode, *old_inode = NULL;
> +
> +		/*
> +		 * Data integrity sync. Must wait for all pages under writeback,
> +		 * because there may have been pages dirtied before our sync
> +		 * call, but which had writeout started before we write it out.
> +		 * In which case, the inode may not be on the dirty list, but
> +		 * we still have to wait for that writeout.
> +		 */
> +		list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> +			struct address_space *mapping;
> +
> +			if (inode->i_state & (I_FREEING|I_WILL_FREE))
> +				continue;
> +			mapping = inode->i_mapping;
> +			if (mapping->nrpages == 0)
> +				continue;
> +			__iget(inode);
> +			spin_unlock(&inode_lock);
> +			/*
> +			 * We hold a reference to 'inode' so it couldn't have
> +			 * been removed from s_inodes list while we dropped the
> +			 * inode_lock.  We cannot iput the inode now as we can
> +			 * be holding the last reference and we cannot iput it
> +			 * under inode_lock. So we keep the reference and iput
> +			 * it later.
> +			 */

hm, tricky.

Can umount run concurrently with this?  What will it say about the busy
inode?


> +			iput(old_inode);
> +			old_inode = inode;
> +
> +			filemap_fdatawait(mapping);
> +
> +			cond_resched();
> +
> +			spin_lock(&inode_lock);
> +		}
> +		spin_unlock(&inode_lock);
> +		iput(old_inode);
> +	} else

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch 5/6] radix-tree: gang set if tagged operation
  2008-12-11 21:20   ` Andrew Morton
@ 2008-12-11 22:10     ` Nick Piggin
  0 siblings, 0 replies; 24+ messages in thread
From: Nick Piggin @ 2008-12-11 22:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-fsdevel, mpatocka

On Thu, Dec 11, 2008 at 01:20:23PM -0800, Andrew Morton wrote:
> On Wed, 10 Dec 2008 08:28:50 +0100
> Nick Piggin <npiggin@suse.de> wrote:
> 
> > 
> > Add a "radix_tree_gang_set_if_tagged" operation, which takes a range of indexes,
> > and sets a given tag, if any of the specified tags were found to be set. Used by
> > the next patch to set an "fsync" bit on any dirty and writeback pages.
> > 
> 
> Spent 15 seconds trying to decode the above, quickly gave up.
> 
> Because it's the code comments which matter, not the changelog.
> 
> Only there are no code comemnts.  The rest of the radix-tree API is
> documented.

Fair call on comments and style. You needn't apply this one if you
don't apply the next one either. The first 4 were the more important
data integrity ones. I'll send updated 5/6 and 6/6 later, and we can
decide if we want the starvation fix.

> > +	for (i = offset_start; i <= offset_end; i++) {
> > +
> > +		if (height > 1) {
> > +			struct radix_tree_node *slot;
> > +
> > +			slot = node->slots[i];
> > +			if (!slot)
> > +				goto notset;
> > +
> > +			if (node_tag_set_if_tagged(slot, height - 1,
> > +						first, last,
> > +						ifset, thentag)) {
> 
> It's recursive.
> 
> Considerable effort has been made to avoid recursion in the radix-tree
> code and here it gets added with nary a mention in the code or the
> changelog.  It *at least* needs justification by showing a calculation
> of the maximum stack usage.  Which is dependent upon the value of
> RADIX_TREE_MAP_SHIFT and is hence a constraint upon it.
> 
> (I have a vague feeling that we broke the don't-recur design a while back, but
> re-breaking it is still bad).

Hmm. It's rather tricky code, so I found recursion was easier :( I'll
try to take another look at it... I guess it could go a maximum of
16 levels with CONFIG_SMALL and a really big file. It doesn't use a
heap of stack, although it's hard to say because some architectures
save a lot of things.

> The rest of the radix-tree API is exported to modules, so this addition
> should also be.

OK.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch 6/6] mm: fsync livelock avoidance
  2008-12-10  7:42 ` [patch 6/6] mm: fsync livelock avoidance Nick Piggin
                     ` (2 preceding siblings ...)
  2008-12-11 21:51   ` Andrew Morton
@ 2008-12-11 22:23   ` Andrew Morton
  2008-12-11 22:45     ` Nick Piggin
  3 siblings, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2008-12-11 22:23 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, mpatocka

obtw,

On Wed, 10 Dec 2008 08:42:09 +0100
Nick Piggin <npiggin@suse.de> wrote:

> For simplicity, I have removed the "don't wait for writeout if we hit -EIO"
> logic from a couple of places. I don't know if this is really worth the added
> complexity (EIO will still get reported, but it will just take a bit longer;
> an app can't rely in specific behaviour or timeliness here).

This is ungood.  The device layer likes to twiddle thumbs for 30
seconds or more when it hits an IO error.  We went and made that 30,000
or more..



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch 6/6] mm: fsync livelock avoidance
  2008-12-11 21:51   ` Andrew Morton
@ 2008-12-11 22:32     ` Nick Piggin
  2008-12-11 22:41       ` Andrew Morton
  2008-12-11 22:45       ` Andrew Morton
  0 siblings, 2 replies; 24+ messages in thread
From: Nick Piggin @ 2008-12-11 22:32 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-fsdevel, mpatocka

On Thu, Dec 11, 2008 at 01:51:11PM -0800, Andrew Morton wrote:
> On Wed, 10 Dec 2008 08:42:09 +0100
> Nick Piggin <npiggin@suse.de> wrote:
> > 
> > This lock also solves a real data integrity problem that I only noticed as
> > I was writing the livelock avoidance code. If we consider the lock as the
> > solution to this bug, this makes the livelock avoidance code much more
> > attractive because then it does not introduce the new lock.
> > 
> > The bug is that fsync errors do not get propogated back up to the caller
> > properly in some cases. Consider where we write a page in the writeout path,
> > then it encounters an IO error and finishes writeback, in the meantime, another
> > process (eg. via sys_sync, or another fsync) clears the mapping error bits.
> > Then our fsync will have appeared to finish successfully, but actually should
> > have returned error.
> 
> Has *anybody* *ever* complained about this behaviour?  I think maybe
> one person after sixish years?

The livelock behaviour? (or the error propagation).

I first heard about it from Mikulas, where some dm tool locks up because
it does direct IO on the block device of mounted filesystem (or something
like that). That case is actually mostly solved by my first ptach in the
series. 

> Why fix it?

Good question. My earlier patches already in your tree removed some starvation
avoidance code because they were breaking data integrity semantics. So in
theory, your tree today is more susceptible to this sync/fsync starvation
than mainline. I care most about the correctness, and it would be great if
nobody cares about this starvation problem so we don't need the extra
complexity.

Thanks for the review.

> > +void mapping_fsync_lock(struct address_space *mapping)
> > +{
> > +	wait_on_bit_lock(&mapping->flags, AS_FSYNC_LOCK, sleep_on_fsync,
> > +							TASK_UNINTERRUPTIBLE);
> > +	WARN_ON(mapping_tagged(mapping, PAGECACHE_TAG_FSYNC));
> > +}
> > +
> > +void mapping_fsync_unlock(struct address_space *mapping)
> > +{
> > +	WARN_ON(mapping_tagged(mapping, PAGECACHE_TAG_FSYNC));
> > +	WARN_ON(!test_bit(AS_FSYNC_LOCK, &mapping->flags));
> > +	clear_bit_unlock(AS_FSYNC_LOCK, &mapping->flags);
> > +	smp_mb__after_clear_bit();
> 
> hm, I wonder why clear_bit_unlock() didn't already do that.

Strictly unlock semantics. So it has the mb before the clear bit,
but none after.

 
> The clear_bit_unlock() documentation is rather crappy.
> 
> > +	wake_up_bit(&mapping->flags, AS_FSYNC_LOCK);
> > +}
> 
> It wouldn't hurt to document this interface a little bit.

True.


> > +int wait_on_page_writeback_range_fsync(struct address_space *mapping,
> > +				pgoff_t start, pgoff_t end)
> 
> We already have a wait_on_page_writeback_range().  The reader of your
> code will be wondering why this variant exists, and how it differs. 
> Sigh.

Ah OK, wait_on_page_writeback_range_fsync can be used when the caller
has set up the fsync tags and holds the mapping bit lock. So unconverted
filesystems hopefully can use the old code without blowing up (they just
would be more prone to starvation).

> >  	if (!mapping_cap_writeback_dirty(mapping) || !count)
> >  		return 0;
> > +	mutex_lock(&inode->i_mutex);
> 
> I am unable to determine why i_mutex is being taken here.

Oh, good point (which I probably didn't mention). mapping_fsync_lock
nests inside i_mutex. Will add that to documentation and lock order
graphs.

> > @@ -897,13 +899,40 @@ int write_cache_pages(struct address_spa
> >  			range_whole = 1;
> >  		cycled = 1; /* ignore range_cyclic tests */
> >  	}
> > +
> > +	if (sync) {
> > +		WARN_ON(!test_bit(AS_FSYNC_LOCK, &mapping->flags));
> 
> hm.  Is fsync the only caller of write_cache_pages(!WB_SYNC_NONE)? 
> Surprised.

No some filesystems and things also call eg. filemap_write_and_wait
which should come here too. But they also need to take the lock.

> > +		if (!radix_tree_gang_tag_set_if_tagged(&mapping->page_tree,
> > +							index, end,
> > +				(1UL << PAGECACHE_TAG_DIRTY) |
> > +				(1UL << PAGECACHE_TAG_WRITEBACK),
> > +				(1UL << PAGECACHE_TAG_FSYNC))) {
> 
> ooh, so that's what that thing does.

Maybe an example is a better documentation than my waffling.

 
> > +			/* nothing tagged */
> > +			spin_unlock_irq(&mapping->tree_lock);
> > +			return 0;
> 
> Can we please avoid the deeply-nested-return hand grenade?

Hmm, we could 

   goto out;
...
out:
 return ret;

But is that less hand grenadie than the plain return?

> > ===================================================================
> > --- linux-2.6.orig/drivers/usb/gadget/file_storage.c
> > +++ linux-2.6/drivers/usb/gadget/file_storage.c
> > @@ -1873,13 +1873,15 @@ static int fsync_sub(struct lun *curlun)
> >  
> >  	inode = filp->f_path.dentry->d_inode;
> >  	mutex_lock(&inode->i_mutex);
> > +	mapping_fsync_lock(mapping);
> 
> Dood. Do `make allmodconfig ; make'
 
OK.


> >  	rc = filemap_fdatawrite(inode->i_mapping);
> >  	err = filp->f_op->fsync(filp, filp->f_path.dentry, 1);
> >  	if (!rc)
> >  		rc = err;
> > -	err = filemap_fdatawait(inode->i_mapping);
> > +	err = filemap_fdatawait_fsync(inode->i_mapping);
> >  	if (!rc)
> >  		rc = err;
> > +	mapping_fsync_unlock(mapping);
> >
> > ...
> >
> 
> 
> I won't apply this because of the build breakage.

OK.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch 3/6] fs: sync_sb_inodes fix
  2008-12-11 21:51   ` Andrew Morton
@ 2008-12-11 22:34     ` Nick Piggin
  0 siblings, 0 replies; 24+ messages in thread
From: Nick Piggin @ 2008-12-11 22:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-fsdevel, mpatocka

On Thu, Dec 11, 2008 at 01:51:47PM -0800, Andrew Morton wrote:
> On Wed, 10 Dec 2008 08:27:07 +0100
> Nick Piggin <npiggin@suse.de> wrote:
> 
> > +	if (sync) {
> > +		struct inode *inode, *old_inode = NULL;
> > +
> > +		/*
> > +		 * Data integrity sync. Must wait for all pages under writeback,
> > +		 * because there may have been pages dirtied before our sync
> > +		 * call, but which had writeout started before we write it out.
> > +		 * In which case, the inode may not be on the dirty list, but
> > +		 * we still have to wait for that writeout.
> > +		 */
> > +		list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> > +			struct address_space *mapping;
> > +
> > +			if (inode->i_state & (I_FREEING|I_WILL_FREE))
> > +				continue;
> > +			mapping = inode->i_mapping;
> > +			if (mapping->nrpages == 0)
> > +				continue;
> > +			__iget(inode);
> > +			spin_unlock(&inode_lock);
> > +			/*
> > +			 * We hold a reference to 'inode' so it couldn't have
> > +			 * been removed from s_inodes list while we dropped the
> > +			 * inode_lock.  We cannot iput the inode now as we can
> > +			 * be holding the last reference and we cannot iput it
> > +			 * under inode_lock. So we keep the reference and iput
> > +			 * it later.
> > +			 */
> 
> hm, tricky.
> 
> Can umount run concurrently with this?  What will it say about the busy
> inode?

AFAIKS umount shouldn't because we've taken a reference on the superblock
higher up.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch 6/6] mm: fsync livelock avoidance
  2008-12-11 22:32     ` Nick Piggin
@ 2008-12-11 22:41       ` Andrew Morton
  2008-12-11 22:45       ` Andrew Morton
  1 sibling, 0 replies; 24+ messages in thread
From: Andrew Morton @ 2008-12-11 22:41 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, mpatocka

On Thu, 11 Dec 2008 23:32:13 +0100
Nick Piggin <npiggin@suse.de> wrote:

> > > +			/* nothing tagged */
> > > +			spin_unlock_irq(&mapping->tree_lock);
> > > +			return 0;
> > 
> > Can we please avoid the deeply-nested-return hand grenade?
> 
> Hmm, we could 
> 
>    goto out;
> ...
> out:
>  return ret;
> 
> But is that less hand grenadie than the plain return?

yep.  I've seen many many locking errors and resource leaks caused by
the multiple-return-statements mistake.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch 6/6] mm: fsync livelock avoidance
  2008-12-11 22:32     ` Nick Piggin
  2008-12-11 22:41       ` Andrew Morton
@ 2008-12-11 22:45       ` Andrew Morton
  2008-12-11 22:59         ` Nick Piggin
  1 sibling, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2008-12-11 22:45 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, mpatocka

On Thu, 11 Dec 2008 23:32:13 +0100
Nick Piggin <npiggin@suse.de> wrote:

> On Thu, Dec 11, 2008 at 01:51:11PM -0800, Andrew Morton wrote:
> > On Wed, 10 Dec 2008 08:42:09 +0100
> > Nick Piggin <npiggin@suse.de> wrote:
> > > 
> > > This lock also solves a real data integrity problem that I only noticed as
> > > I was writing the livelock avoidance code. If we consider the lock as the
> > > solution to this bug, this makes the livelock avoidance code much more
> > > attractive because then it does not introduce the new lock.
> > > 
> > > The bug is that fsync errors do not get propogated back up to the caller
> > > properly in some cases. Consider where we write a page in the writeout path,
> > > then it encounters an IO error and finishes writeback, in the meantime, another
> > > process (eg. via sys_sync, or another fsync) clears the mapping error bits.
> > > Then our fsync will have appeared to finish successfully, but actually should
> > > have returned error.
> > 
> > Has *anybody* *ever* complained about this behaviour?  I think maybe
> > one person after sixish years?
> 
> The livelock behaviour? (or the error propagation).
> 
> I first heard about it from Mikulas, where some dm tool locks up because
> it does direct IO on the block device of mounted filesystem (or something
> like that).

Does it actually lock up?  Or does it just take a loooong time?

Presumably it can be worked around in userspace.

> That case is actually mostly solved by my first ptach in the
> series. 

mm-direct-io-starvation-improvement.patch?   I guess that would help
a lot.  I can't imagine why we didn't do that years ago???

Can we please determine whether that optimisation was sufficient
for Mikulas's example?

> > Why fix it?
> 
> Good question. My earlier patches already in your tree removed some starvation
> avoidance code because they were breaking data integrity semantics. So in
> theory, your tree today is more susceptible to this sync/fsync starvation
> than mainline. I care most about the correctness, and it would be great if
> nobody cares about this starvation problem so we don't need the extra
> complexity.

Yes, it does add quite a bit of complexity and more code.  It'd be good
if we could find some way of avoiding merging it.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch 6/6] mm: fsync livelock avoidance
  2008-12-11 22:23   ` Andrew Morton
@ 2008-12-11 22:45     ` Nick Piggin
  2008-12-11 23:14       ` Andrew Morton
  0 siblings, 1 reply; 24+ messages in thread
From: Nick Piggin @ 2008-12-11 22:45 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-fsdevel, mpatocka

On Thu, Dec 11, 2008 at 02:23:47PM -0800, Andrew Morton wrote:
> obtw,
> 
> On Wed, 10 Dec 2008 08:42:09 +0100
> Nick Piggin <npiggin@suse.de> wrote:
> 
> > For simplicity, I have removed the "don't wait for writeout if we hit -EIO"
> > logic from a couple of places. I don't know if this is really worth the added
> > complexity (EIO will still get reported, but it will just take a bit longer;
> > an app can't rely in specific behaviour or timeliness here).
> 
> This is ungood.  The device layer likes to twiddle thumbs for 30
> seconds or more when it hits an IO error.  We went and made that 30,000
> or more..

It isn't really a good solution anyway, because I think it's much
less likely for writepage to return -EIO directly. Usually they
would come back via data IO completion asynchronously.

And if we are fsyncing so many requests anyway, we are likely going
to start blocking behind them in the submission path anyway (elevator
queues fill up).


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch 6/6] mm: fsync livelock avoidance
  2008-12-11 22:45       ` Andrew Morton
@ 2008-12-11 22:59         ` Nick Piggin
  0 siblings, 0 replies; 24+ messages in thread
From: Nick Piggin @ 2008-12-11 22:59 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-fsdevel, mpatocka

On Thu, Dec 11, 2008 at 02:45:02PM -0800, Andrew Morton wrote:
> On Thu, 11 Dec 2008 23:32:13 +0100
> Nick Piggin <npiggin@suse.de> wrote:
> 
> > The livelock behaviour? (or the error propagation).
> > 
> > I first heard about it from Mikulas, where some dm tool locks up because
> > it does direct IO on the block device of mounted filesystem (or something
> > like that).
> 
> Does it actually lock up?  Or does it just take a loooong time?

Just takes a looong time.

 
> Presumably it can be worked around in userspace.

Depends. It's very easy to get stuck behind a dirtier, but OTOH,
do many applications do such a thing? I simply don't know.

Livelock due to sys_sync would be easier (because then you have
lots of other apps all doing their own thing which you aren't
in control of, but I don't want to optimise for apps calling sys_sync,
and the sync command basically should only be for testing).

Point is, I don't know :) Probably not a bad idea to leave it out
and then think about it again if anybody cares.
 

> > That case is actually mostly solved by my first ptach in the
> > series. 
> 
> mm-direct-io-starvation-improvement.patch?   I guess that would help
> a lot.  I can't imagine why we didn't do that years ago???
> 
> Can we please determine whether that optimisation was sufficient
> for Mikulas's example?

Yes, that patch. I was able to reproduce the problem here, and that
did solve it for me, but yes confirming it with the actual dm tools
would be nice.


> > > Why fix it?
> > 
> > Good question. My earlier patches already in your tree removed some starvation
> > avoidance code because they were breaking data integrity semantics. So in
> > theory, your tree today is more susceptible to this sync/fsync starvation
> > than mainline. I care most about the correctness, and it would be great if
> > nobody cares about this starvation problem so we don't need the extra
> > complexity.
> 
> Yes, it does add quite a bit of complexity and more code.  It'd be good
> if we could find some way of avoiding merging it.

Well, don't merge it yet then.

Oh, but we still have this fsync error may not get propagated problem...
I guess that should be solved in a broken out patch anyway.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch 6/6] mm: fsync livelock avoidance
  2008-12-11 22:45     ` Nick Piggin
@ 2008-12-11 23:14       ` Andrew Morton
  2008-12-11 23:43         ` Nick Piggin
  0 siblings, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2008-12-11 23:14 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, mpatocka

On Thu, 11 Dec 2008 23:45:14 +0100
Nick Piggin <npiggin@suse.de> wrote:

> On Thu, Dec 11, 2008 at 02:23:47PM -0800, Andrew Morton wrote:
> > obtw,
> > 
> > On Wed, 10 Dec 2008 08:42:09 +0100
> > Nick Piggin <npiggin@suse.de> wrote:
> > 
> > > For simplicity, I have removed the "don't wait for writeout if we hit -EIO"
> > > logic from a couple of places. I don't know if this is really worth the added
> > > complexity (EIO will still get reported, but it will just take a bit longer;
> > > an app can't rely in specific behaviour or timeliness here).
> > 
> > This is ungood.  The device layer likes to twiddle thumbs for 30
> > seconds or more when it hits an IO error.  We went and made that 30,000
> > or more..
> 
> It isn't really a good solution anyway,

what isn't a good solution to what?

> because I think it's much
> less likely for writepage to return -EIO directly. Usually they
> would come back via data IO completion asynchronously.

umm, maybe.  If all the file metadata is in pagecache.  Often it is not.

> And if we are fsyncing so many requests anyway, we are likely going
> to start blocking behind them in the submission path anyway (elevator
> queues fill up).

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch 6/6] mm: fsync livelock avoidance
  2008-12-11 23:14       ` Andrew Morton
@ 2008-12-11 23:43         ` Nick Piggin
  2008-12-12  0:39           ` Andrew Morton
  0 siblings, 1 reply; 24+ messages in thread
From: Nick Piggin @ 2008-12-11 23:43 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-fsdevel, mpatocka

On Thu, Dec 11, 2008 at 03:14:07PM -0800, Andrew Morton wrote:
> On Thu, 11 Dec 2008 23:45:14 +0100
> Nick Piggin <npiggin@suse.de> wrote:
> 
> > > > For simplicity, I have removed the "don't wait for writeout if we hit -EIO"
> > > > logic from a couple of places. I don't know if this is really worth the added
> > > > complexity (EIO will still get reported, but it will just take a bit longer;
> > > > an app can't rely in specific behaviour or timeliness here).
> > > 
> > > This is ungood.  The device layer likes to twiddle thumbs for 30
> > > seconds or more when it hits an IO error.  We went and made that 30,000
> > > or more..
> > 
> > It isn't really a good solution anyway,
> 
> what isn't a good solution to what?

To the problem of long waits on IO errors.

 
> > because I think it's much
> > less likely for writepage to return -EIO directly. Usually they
> > would come back via data IO completion asynchronously.
> 
> umm, maybe.  If all the file metadata is in pagecache.  Often it is not.

I'd say, often it *is* because the buffer layer allocates/maps/reserves
blocks for the page when it gets dirtied.

Of course these checks will catch some cases for some filesystems, but
they're not a good general solution to the problem of EIO errors taking
a long time, IMO, because there are other ways it can happen.
 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch 6/6] mm: fsync livelock avoidance
  2008-12-11 23:43         ` Nick Piggin
@ 2008-12-12  0:39           ` Andrew Morton
  2008-12-12  4:01             ` Nick Piggin
  0 siblings, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2008-12-12  0:39 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, mpatocka

On Fri, 12 Dec 2008 00:43:35 +0100 Nick Piggin <npiggin@suse.de> wrote:

> On Thu, Dec 11, 2008 at 03:14:07PM -0800, Andrew Morton wrote:
> > On Thu, 11 Dec 2008 23:45:14 +0100
> > Nick Piggin <npiggin@suse.de> wrote:
> > 
> > > > > For simplicity, I have removed the "don't wait for writeout if we hit -EIO"
> > > > > logic from a couple of places. I don't know if this is really worth the added
> > > > > complexity (EIO will still get reported, but it will just take a bit longer;
> > > > > an app can't rely in specific behaviour or timeliness here).
> > > > 
> > > > This is ungood.  The device layer likes to twiddle thumbs for 30
> > > > seconds or more when it hits an IO error.  We went and made that 30,000
> > > > or more..
> > > 
> > > It isn't really a good solution anyway,
> > 
> > what isn't a good solution to what?
> 
> To the problem of long waits on IO errors.
> 
>  
> > > because I think it's much
> > > less likely for writepage to return -EIO directly. Usually they
> > > would come back via data IO completion asynchronously.
> > 
> > umm, maybe.  If all the file metadata is in pagecache.  Often it is not.
> 
> I'd say, often it *is* because the buffer layer allocates/maps/reserves
> blocks for the page when it gets dirtied.

Depends on the value of "often" ;)

Metadata can get evicted early because it's in lowmem.

We don't instantiate the blocks at fault time for MAP_SHARED writes
(do we?  we were thinking of doing so)

> Of course these checks will catch some cases for some filesystems, but
> they're not a good general solution to the problem of EIO errors taking
> a long time, IMO, because there are other ways it can happen.

Well it's one of those things where a 90% solution is 90% better than no
solution.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch 6/6] mm: fsync livelock avoidance
  2008-12-12  0:39           ` Andrew Morton
@ 2008-12-12  4:01             ` Nick Piggin
  0 siblings, 0 replies; 24+ messages in thread
From: Nick Piggin @ 2008-12-12  4:01 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-fsdevel, mpatocka

On Thu, Dec 11, 2008 at 04:39:43PM -0800, Andrew Morton wrote:
> On Fri, 12 Dec 2008 00:43:35 +0100 Nick Piggin <npiggin@suse.de> wrote:
> 
> > > > because I think it's much
> > > > less likely for writepage to return -EIO directly. Usually they
> > > > would come back via data IO completion asynchronously.
> > > 
> > > umm, maybe.  If all the file metadata is in pagecache.  Often it is not.
> > 
> > I'd say, often it *is* because the buffer layer allocates/maps/reserves
> > blocks for the page when it gets dirtied.
> 
> Depends on the value of "often" ;)

Very true.

 
> Metadata can get evicted early because it's in lowmem.

Yes, although highmem isn't "often" these days :) And for non-highmem,
you would expect the page to be written back before buffer heads are
reclaimed.


> We don't instantiate the blocks at fault time for MAP_SHARED writes
> (do we?  we were thinking of doing so)

No. Well, some filesystems do I think. But not many. There were patches
for ext*. But I think MAP_SHARED writes aren't terribly common either.

 
> > Of course these checks will catch some cases for some filesystems, but
> > they're not a good general solution to the problem of EIO errors taking
> > a long time, IMO, because there are other ways it can happen.
> 
> Well it's one of those things where a 90% solution is 90% better than no
> solution.

Or maybe a 20% solution is better than no solution too -- it's not a lot
of code as is. It would have got a little more complex after my FSYNC
tag patch. But we haven't even decided if that's a good idea to begin
with, so let's not get ahead of ourselves :)

How can this be solved nicely? Have an error rate threshold in the block
device code, after which the queue shuts down and starts synchronously
rejecting requests?


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch 1/6] mm: direct IO starvation improvement
  2008-12-10  7:24 [patch 1/6] mm: direct IO starvation improvement Nick Piggin
                   ` (4 preceding siblings ...)
  2008-12-10  7:42 ` [patch 6/6] mm: fsync livelock avoidance Nick Piggin
@ 2008-12-12 16:04 ` Jeff Moyer
  5 siblings, 0 replies; 24+ messages in thread
From: Jeff Moyer @ 2008-12-12 16:04 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-fsdevel, Mikulas Patocka

Nick Piggin <npiggin@suse.de> writes:

> Direct IO can invalidate and sync a lot of pagecache pages in the mapping. A
> 4K direct IO will actually try to sync and/or invalidate the pagecache of the
> entire file, for example (which might be many GB or TB large).
>
> Improve this by doing range syncs. Also, memory no longer has to be unmapped
> to catch the dirty bits for syncing, as dirty bits would remain coherent due to
> dirty mmap accounting.
>
> This fixes the immediate DM deadlocks when doing direct IO reads to block
> device with a mounted filesystem, if only by papering over the problem somewhat
> rather than addressing the fsync starvation cases.
>
> Signed-off-by: Nick Piggin <npiggin@suse.de>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2008-12-12 16:04 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-12-10  7:24 [patch 1/6] mm: direct IO starvation improvement Nick Piggin
2008-12-10  7:25 ` [patch 2/6] fs: remove WB_SYNC_HOLD Nick Piggin
2008-12-10  7:27 ` [patch 3/6] fs: sync_sb_inodes fix Nick Piggin
2008-12-11 21:51   ` Andrew Morton
2008-12-11 22:34     ` Nick Piggin
2008-12-10  7:27 ` [patch 4/6] fs: sys_sync fix Nick Piggin
2008-12-10  7:28 ` [patch 5/6] radix-tree: gang set if tagged operation Nick Piggin
2008-12-11 21:20   ` Andrew Morton
2008-12-11 22:10     ` Nick Piggin
2008-12-10  7:42 ` [patch 6/6] mm: fsync livelock avoidance Nick Piggin
2008-12-10  9:15   ` steve
2008-12-11 21:51   ` Andrew Morton
2008-12-11 22:32     ` Nick Piggin
2008-12-11 22:41       ` Andrew Morton
2008-12-11 22:45       ` Andrew Morton
2008-12-11 22:59         ` Nick Piggin
2008-12-11 21:51   ` Andrew Morton
2008-12-11 22:23   ` Andrew Morton
2008-12-11 22:45     ` Nick Piggin
2008-12-11 23:14       ` Andrew Morton
2008-12-11 23:43         ` Nick Piggin
2008-12-12  0:39           ` Andrew Morton
2008-12-12  4:01             ` Nick Piggin
2008-12-12 16:04 ` [patch 1/6] mm: direct IO starvation improvement Jeff Moyer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.