All of lore.kernel.org
 help / color / mirror / Atom feed
From: Christoph Hellwig <hch@lst.de>
To: Jan Kara <jack@suse.cz>
Cc: linux-fsdevel@vger.kernel.org,
	LKML <linux-kernel@vger.kernel.org>,
	hch@lst.de
Subject: [PATCH] fsync: wait for data writeout completion before calling ->fsync
Date: Thu, 3 Sep 2009 00:18:38 +0200	[thread overview]
Message-ID: <20090902221838.GA5311@lst.de> (raw)
In-Reply-To: <1251899966-7316-1-git-send-email-jack@suse.cz>

I think we should add this one ontop:

-- 
Subject: [PATCH] fsync: wait for data writeout completion before calling ->fsync
From: Christoph Hellwig <hch@lst.de>

Currenly vfs_fsync(_range) first calls filemap_fdatawrite to write out
the data, the calls into ->fsync to write out the metadata and then finally
calls filemap_fdatawait to wait for the data I/O to complete.  What sounds
like a clever micro-optimization actually is nast trap for many filesystems.

For many modern filesystems i_size or other inode information is only
updated on I/O completion and we need to wait for I/O to finish before
we can write out the metadata.  For old fashionen filesystems that
instanciate blocks during the actual write and also update the metadata
at that point it opens up a large window were we could expose uninitialized
blocks after a crash.  While a few filesystems that need it already wait
for the I/O to finish inside their ->fsync methods it is rather suboptimal
as it is done under the i_mutex and also always for the whole file instead
of just a part as we could do for O_SYNC handling.

Here is a small audit of all fsync instances in the tree:

 - spufs_mfc_fsync:
 - ps3flash_fsync: 
 - vol_cdev_fsync:
 - printer_fsync:
 - fb_deferred_io_fsync:
 - bad_file_fsync:
 - simple_sync_file:

	don't care - filesystems/drivers do't use the page cache or are
	purely in-memory.

 - simple_fsync:
 - file_fsync:
 - affs_file_fsync:
 - fat_file_fsync:
 - jfs_fsync:
 - ubifs_fsync:
 - reiserfs_dir_fsync:
 - reiserfs_sync_file:

	never touch pagecache themselves.  We need to wait before if we do
	not want to expose stale data after an allocation.

 - afs_fsync:
 - fuse_fsync_common:

	do the waiting writeback itself in awkward ways, would benefit from
	proper semantics

 - block_fsync:

	Does a filemap_write_and_wait on the block device inode.  Because we
	now have f_mapping that is the same inode we call it on in vfs_fsync.
	So just removing it and letting the VFS do the work in one go would
	be an improvement.

 - btrfs_sync_file:
 - cifs_fsync:
 - xfs_file_fsync:

	need the wait first and currently do it themselves. would benefit from
	doing it outside i_mutex.

 - coda_fsync:
 - ecryptfs_fsync:
 - exofs_file_fsync:
 - shm_fsync:

	only passes the fsync through to the lower layer

 - ext3_sync_file:

	doesn't seem to care, comments are confusing.

 - ext4_sync_file:

	would need the wait to work correctly for delalloc mode with late
	i_size updates.  Otherwise the ext3 comment applies.


	currently implemens it's own writeback and wait in an odd way,
	could benefit from doing it properly.

 - gfs2_fsync:

	not needed for journaled data mode, but probably harmless there.
	Currently writes back data asynchronously itself.  Needs some
	major audit.

 - hostfs_fsync:

	just calls fsync/datasync on the host FD.  Without the wait before
	data might not even be inflight yet if we're unlucky.

 - hpfs_file_fsync:
 - ncp_fsync:

	no-ops.  Dangerous before and after.

 - jffs2_fsync:

	just calls jffs2_flush_wbuf_gc, not sure how this relates to data.

 - nfs_fsync_dir:

	just increments stats, claims all directory operations are synchronous

 - nfs_file_fsync:

	only writes out data???  Looks very odd.

 - nilfs_sync_file:

	looks like it expects all data done, but not sure from the code

 - ntfs_dir_fsync:
 - ntfs_file_fsync:

	appear to do their own data writeback.  Very convoluted code.

 - ocfs2_sync_file:

	does it's own data writeback, but no wait.  probably needs the wait.

 - smb_fsync:

	according to a comment expects all pages written already, probably needs
	the wait before.


This patch only changes vfs_fsync_range, removal of the wait in the methods
that have it is left to the filesystem maintainers.  Note that most
filesystems really do need an audit for their fsync methods given the
gems found in this very brief audit.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/fs/sync.c
===================================================================
--- linux-2.6.orig/fs/sync.c	2009-09-02 15:03:41.073271287 -0300
+++ linux-2.6/fs/sync.c	2009-09-02 15:04:34.401269249 -0300
@@ -216,7 +216,7 @@ int vfs_fsync_range(struct file *file, s
 		goto out;
 	}
 
-	ret = filemap_fdatawrite_range(mapping, start, end);
+	ret = filemap_write_and_wait_range(mapping, start, end);
 
 	/*
 	 * We need to protect against concurrent writers, which could cause
@@ -228,9 +228,6 @@ int vfs_fsync_range(struct file *file, s
 		ret = err;
 	mutex_unlock(&mapping->host->i_mutex);
 
-	err = filemap_fdatawait_range(mapping, start, end);
-	if (!ret)
-		ret = err;
 out:
 	return ret;
 }

  parent reply	other threads:[~2009-09-02 22:18 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-09-02 13:59 [PATCH 0/16] Make O_SYNC handling use standard syncing path (version 4) Jan Kara
2009-09-02 13:59 ` [PATCH 01/16] vfs: Introduce filemap_fdatawait_range Jan Kara
2009-09-02 13:59 ` [PATCH 02/16] vfs: Export __generic_file_aio_write() and add some comments Jan Kara
2009-09-02 13:59   ` [Ocfs2-devel] " Jan Kara
2009-09-02 13:59   ` Jan Kara
2009-09-02 13:59 ` [PATCH 03/16] vfs: Remove syncing from generic_file_direct_write() and generic_file_buffered_write() Jan Kara
2009-09-02 13:59   ` [Ocfs2-devel] " Jan Kara
2009-09-02 13:59   ` Jan Kara
2009-09-02 13:59   ` Jan Kara
2009-09-02 13:59 ` [PATCH 04/16] pohmelfs: Use __generic_file_aio_write instead of generic_file_aio_write_nolock Jan Kara
2009-09-02 13:59 ` [PATCH 05/16] ocfs2: " Jan Kara
2009-09-02 13:59   ` [Ocfs2-devel] " Jan Kara
2009-09-02 13:59   ` Jan Kara
2009-09-02 13:59 ` [PATCH 06/16] vfs: Rename generic_file_aio_write_nolock Jan Kara
2009-09-02 21:47   ` Christoph Hellwig
2009-09-03 10:24     ` Jan Kara
2009-09-03 15:37       ` Christoph Hellwig
2009-09-02 13:59 ` [PATCH 07/16] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode Jan Kara
2009-09-02 13:59   ` [Ocfs2-devel] " Jan Kara
2009-09-02 13:59   ` Jan Kara
2009-09-02 13:59 ` [PATCH 08/16] ext2: Update comment about generic_osync_inode Jan Kara
2009-09-02 13:59 ` [PATCH 09/16] ext3: Remove syncing logic from ext3_file_write Jan Kara
2009-09-02 13:59 ` [PATCH 10/16] ext4: Remove syncing logic from ext4_file_write Jan Kara
2009-09-02 13:59 ` [PATCH 11/16] ntfs: Use new syncing helpers and update comments Jan Kara
2009-09-02 13:59 ` [PATCH 12/16] ocfs2: Update syncing after splicing to match generic version Jan Kara
2009-09-02 13:59   ` [Ocfs2-devel] " Jan Kara
2009-09-02 13:59   ` Jan Kara
2009-09-02 13:59 ` [PATCH 13/16] xfs: Convert sync_page_range() to simple filemap_write_and_wait_range() Jan Kara
2009-09-02 13:59   ` Jan Kara
2009-09-02 13:59 ` [PATCH 14/16] pohmelfs: Use new syncing helper Jan Kara
2009-09-02 13:59 ` [PATCH 15/16] fat: Opencode sync_page_range_nolock() Jan Kara
2009-09-02 13:59 ` [PATCH 16/16] vfs: Remove generic_osync_inode() and sync_page_range{_nolock}() Jan Kara
2009-09-02 14:16 ` [PATCH 0/16] Make O_SYNC handling use standard syncing path (version 4) Christoph Hellwig
2009-09-02 22:18 ` Christoph Hellwig [this message]
2009-09-02 22:37   ` [PATCH] fsync: wait for data writeout completion before calling ->fsync Joel Becker
2009-09-03 10:47   ` Jan Kara
2009-09-03 15:39     ` Christoph Hellwig
2009-09-10 20:25 ` [PATCH 18/16] implement posix O_SYNC and O_DSYNC semantics Christoph Hellwig
2009-09-10 20:38   ` Trond Myklebust
2009-09-10 20:40     ` Christoph Hellwig
2009-09-10 20:43       ` Trond Myklebust
2009-09-10 20:44         ` Christoph Hellwig
2009-09-10 23:07   ` Andreas Dilger
2009-09-10 23:18     ` Christoph Hellwig
2009-09-11 19:16   ` [PATCHv2 " Christoph Hellwig
2009-09-14 16:54     ` Jan Kara
2009-09-14 17:02       ` Christoph Hellwig
2009-09-15 13:12       ` [PATCH] " Christoph Hellwig
2009-09-15 14:10         ` Jan Kara
2009-09-15 14:50         ` Ulrich Drepper
2009-09-17 17:16           ` Christoph Hellwig
2009-09-17 21:03         ` Kyle McMartin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090902221838.GA5311@lst.de \
    --to=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.