All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/17] Make O_SYNC handling use standard syncing path
@ 2009-08-19 16:04 Jan Kara
  2009-08-19 16:04 ` [PATCH 01/17] vfs: Introduce filemap_fdatawait_range Jan Kara
                   ` (17 more replies)
  0 siblings, 18 replies; 144+ messages in thread
From: Jan Kara @ 2009-08-19 16:04 UTC (permalink / raw)
  To: LKML; +Cc: hch


  Hi,

  below comes a patch series that unifies O_SYNC handling with code doing
fsync(). After this, we have just one place forcing a single file to disk so
filesystems like ext3 / ext4 don't have to force a transaction commit in
ext?_file_write for O_SYNC files / IS_SYNC inodes.
  The code is also cleaner this way (actually about 150 lines shorter), we
don't sync the inode several times as it happened previously etc.
  The patchset touches quite some filesystems. Hopefully I've added all the
relevant people to CC for the relevant patches. I've tested via blktrace that
for ext?, xfs, ocfs2 and fat, the file is indeed forced to disk as expected.
Maintainers of pohmelfs and NTFS might want to check that everything works
as expected for them after applying the patches.
  Comments / review welcome.

								Honza

^ permalink raw reply	[flat|nested] 144+ messages in thread

* [PATCH 01/17] vfs: Introduce filemap_fdatawait_range
  2009-08-19 16:04 [PATCH 0/17] Make O_SYNC handling use standard syncing path Jan Kara
@ 2009-08-19 16:04 ` Jan Kara
  2009-08-19 16:10   ` Christoph Hellwig
  2009-08-19 16:04   ` [Ocfs2-devel] " Jan Kara
                   ` (16 subsequent siblings)
  17 siblings, 1 reply; 144+ messages in thread
From: Jan Kara @ 2009-08-19 16:04 UTC (permalink / raw)
  To: LKML; +Cc: hch, Jan Kara

This simple helper saves some filesystems conversion from byte offset
to page numbers and also makes the fdata* interface more complete.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/fs.h |    2 ++
 mm/filemap.c       |   20 ++++++++++++++++++++
 2 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 67888a9..cb365ad 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2076,6 +2076,8 @@ extern int write_inode_now(struct inode *, int);
 extern int filemap_fdatawrite(struct address_space *);
 extern int filemap_flush(struct address_space *);
 extern int filemap_fdatawait(struct address_space *);
+extern int filemap_fdatawait_range(struct address_space *, loff_t lstart,
+				   loff_t lend);
 extern int filemap_write_and_wait(struct address_space *mapping);
 extern int filemap_write_and_wait_range(struct address_space *mapping,
 				        loff_t lstart, loff_t lend);
diff --git a/mm/filemap.c b/mm/filemap.c
index ccea3b6..65b2e50 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -307,6 +307,26 @@ int wait_on_page_writeback_range(struct address_space *mapping,
 }
 
 /**
+ * filemap_fdatawait_range - wait for all under-writeback pages to complete in a given range
+ * @mapping: address space structure to wait for
+ * @start:	offset in bytes where the range starts
+ * @end:	offset in bytes where the range ends (inclusive)
+ *
+ * Walk the list of under-writeback pages of the given address space
+ * in the given range and wait for all of them.
+ *
+ * This is just a simple wrapper so that callers don't have to convert offsets
+ * to page indexes themselves
+ */
+int filemap_fdatawait_range(struct address_space *mapping, loff_t start,
+			    loff_t end)
+{
+	return wait_on_page_writeback_range(mapping, start >> PAGE_CACHE_SHIFT,
+					    end >> PAGE_CACHE_SHIFT);
+}
+EXPORT_SYMBOL(filemap_fdatawait_range);
+
+/**
  * sync_page_range - write and wait on all pages in the passed range
  * @inode:	target inode
  * @mapping:	target address_space
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH 02/17] vfs: Export __generic_file_aio_write() and add some comments
  2009-08-19 16:04 [PATCH 0/17] Make O_SYNC handling use standard syncing path Jan Kara
@ 2009-08-19 16:04   ` Jan Kara
  2009-08-19 16:04   ` [Ocfs2-devel] " Jan Kara
                     ` (16 subsequent siblings)
  17 siblings, 0 replies; 144+ messages in thread
From: Jan Kara @ 2009-08-19 16:04 UTC (permalink / raw)
  To: LKML; +Cc: hch, Jan Kara, Evgeniy Polyakov, ocfs2-devel, Joel Becker

Rename __generic_file_aio_write_nolock() to __generic_file_aio_write(), add
comments to write helpers explaining how they should be used and export
__generic_file_aio_write() since it will be used by some filesystems.

CC: Evgeniy Polyakov <zbr@ioremap.net>
CC: ocfs2-devel@oss.oracle.com
CC: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/fs.h |    2 +
 mm/filemap.c       |   57 +++++++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 52 insertions(+), 7 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index cb365ad..1edd159 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2195,6 +2195,8 @@ extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
 extern int file_read_actor(read_descriptor_t * desc, struct page *page, unsigned long offset, unsigned long size);
 int generic_write_checks(struct file *file, loff_t *pos, size_t *count, int isblk);
 extern ssize_t generic_file_aio_read(struct kiocb *, const struct iovec *, unsigned long, loff_t);
+extern ssize_t __generic_file_aio_write(struct kiocb *, const struct iovec *, unsigned long,
+		loff_t *);
 extern ssize_t generic_file_aio_write(struct kiocb *, const struct iovec *, unsigned long, loff_t);
 extern ssize_t generic_file_aio_write_nolock(struct kiocb *, const struct iovec *,
 		unsigned long, loff_t);
diff --git a/mm/filemap.c b/mm/filemap.c
index 65b2e50..554a396 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2368,9 +2368,27 @@ generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
 }
 EXPORT_SYMBOL(generic_file_buffered_write);
 
-static ssize_t
-__generic_file_aio_write_nolock(struct kiocb *iocb, const struct iovec *iov,
-				unsigned long nr_segs, loff_t *ppos)
+/**
+ * __generic_file_aio_write - write data to a file
+ * @iocb:	IO state structure (file, offset, etc.)
+ * @iov:	vector with data to write
+ * @nr_segs:	number of segments in the vector
+ * @ppos:	position where to write
+ *
+ * This function does all the work needed for actually writing data to a
+ * file. It does all basic checks, removes SUID from the file, updates
+ * modification times and calls proper subroutines depending on whether we
+ * do direct IO or a standard buffered write.
+ *
+ * It expects i_mutex to be grabbed unless we work on a block device or similar
+ * object which does not need locking at all.
+ *
+ * This function does *not* take care of syncing data in case of O_SYNC write.
+ * A caller has to handle it. This is mainly due to the fact that we want to
+ * avoid syncing under i_mutex.
+ */
+ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
+				 unsigned long nr_segs, loff_t *ppos)
 {
 	struct file *file = iocb->ki_filp;
 	struct address_space * mapping = file->f_mapping;
@@ -2467,7 +2485,23 @@ out:
 	current->backing_dev_info = NULL;
 	return written ? written : err;
 }
+EXPORT_SYMBOL(__generic_file_aio_write);
+
 
+/**
+ * generic_file_aio_write_nolock - write data, usually to a device
+ * @iocb:	IO state structure
+ * @iov:	vector with data to write
+ * @nr_segs:	number of segments in the vector
+ * @pos:	position in file where to write
+ *
+ * This is a wrapper around __generic_file_aio_write() which takes care of
+ * syncing the file in case of O_SYNC file. It does not take i_mutex for the
+ * write itself but may do so during syncing. It is meant for users like block
+ * devices which do not need i_mutex during write. If your filesystem needs to
+ * do a write but already holds i_mutex, use __generic_file_aio_write()
+ * directly and then sync the file like generic_file_aio_write().
+ */
 ssize_t generic_file_aio_write_nolock(struct kiocb *iocb,
 		const struct iovec *iov, unsigned long nr_segs, loff_t pos)
 {
@@ -2478,8 +2512,7 @@ ssize_t generic_file_aio_write_nolock(struct kiocb *iocb,
 
 	BUG_ON(iocb->ki_pos != pos);
 
-	ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs,
-			&iocb->ki_pos);
+	ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
 
 	if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
 		ssize_t err;
@@ -2492,6 +2525,17 @@ ssize_t generic_file_aio_write_nolock(struct kiocb *iocb,
 }
 EXPORT_SYMBOL(generic_file_aio_write_nolock);
 
+/**
+ * generic_file_aio_write - write data to a file
+ * @iocb:	IO state structure
+ * @iov:	vector with data to write
+ * @nr_segs:	number of segments in the vector
+ * @pos:	position in file where to write
+ *
+ * This is a wrapper around __generic_file_aio_write() to be used by most
+ * filesystems. It takes care of syncing the file in case of O_SYNC file
+ * and acquires i_mutex as needed.
+ */
 ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 		unsigned long nr_segs, loff_t pos)
 {
@@ -2503,8 +2547,7 @@ ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 	BUG_ON(iocb->ki_pos != pos);
 
 	mutex_lock(&inode->i_mutex);
-	ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs,
-			&iocb->ki_pos);
+	ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
 	mutex_unlock(&inode->i_mutex);
 
 	if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [Ocfs2-devel] [PATCH 02/17] vfs: Export __generic_file_aio_write() and add some comments
@ 2009-08-19 16:04   ` Jan Kara
  0 siblings, 0 replies; 144+ messages in thread
From: Jan Kara @ 2009-08-19 16:04 UTC (permalink / raw)
  To: LKML; +Cc: hch, Jan Kara, Evgeniy Polyakov, ocfs2-devel, Joel Becker

Rename __generic_file_aio_write_nolock() to __generic_file_aio_write(), add
comments to write helpers explaining how they should be used and export
__generic_file_aio_write() since it will be used by some filesystems.

CC: Evgeniy Polyakov <zbr@ioremap.net>
CC: ocfs2-devel at oss.oracle.com
CC: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/fs.h |    2 +
 mm/filemap.c       |   57 +++++++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 52 insertions(+), 7 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index cb365ad..1edd159 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2195,6 +2195,8 @@ extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
 extern int file_read_actor(read_descriptor_t * desc, struct page *page, unsigned long offset, unsigned long size);
 int generic_write_checks(struct file *file, loff_t *pos, size_t *count, int isblk);
 extern ssize_t generic_file_aio_read(struct kiocb *, const struct iovec *, unsigned long, loff_t);
+extern ssize_t __generic_file_aio_write(struct kiocb *, const struct iovec *, unsigned long,
+		loff_t *);
 extern ssize_t generic_file_aio_write(struct kiocb *, const struct iovec *, unsigned long, loff_t);
 extern ssize_t generic_file_aio_write_nolock(struct kiocb *, const struct iovec *,
 		unsigned long, loff_t);
diff --git a/mm/filemap.c b/mm/filemap.c
index 65b2e50..554a396 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2368,9 +2368,27 @@ generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
 }
 EXPORT_SYMBOL(generic_file_buffered_write);
 
-static ssize_t
-__generic_file_aio_write_nolock(struct kiocb *iocb, const struct iovec *iov,
-				unsigned long nr_segs, loff_t *ppos)
+/**
+ * __generic_file_aio_write - write data to a file
+ * @iocb:	IO state structure (file, offset, etc.)
+ * @iov:	vector with data to write
+ * @nr_segs:	number of segments in the vector
+ * @ppos:	position where to write
+ *
+ * This function does all the work needed for actually writing data to a
+ * file. It does all basic checks, removes SUID from the file, updates
+ * modification times and calls proper subroutines depending on whether we
+ * do direct IO or a standard buffered write.
+ *
+ * It expects i_mutex to be grabbed unless we work on a block device or similar
+ * object which does not need locking at all.
+ *
+ * This function does *not* take care of syncing data in case of O_SYNC write.
+ * A caller has to handle it. This is mainly due to the fact that we want to
+ * avoid syncing under i_mutex.
+ */
+ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
+				 unsigned long nr_segs, loff_t *ppos)
 {
 	struct file *file = iocb->ki_filp;
 	struct address_space * mapping = file->f_mapping;
@@ -2467,7 +2485,23 @@ out:
 	current->backing_dev_info = NULL;
 	return written ? written : err;
 }
+EXPORT_SYMBOL(__generic_file_aio_write);
+
 
+/**
+ * generic_file_aio_write_nolock - write data, usually to a device
+ * @iocb:	IO state structure
+ * @iov:	vector with data to write
+ * @nr_segs:	number of segments in the vector
+ * @pos:	position in file where to write
+ *
+ * This is a wrapper around __generic_file_aio_write() which takes care of
+ * syncing the file in case of O_SYNC file. It does not take i_mutex for the
+ * write itself but may do so during syncing. It is meant for users like block
+ * devices which do not need i_mutex during write. If your filesystem needs to
+ * do a write but already holds i_mutex, use __generic_file_aio_write()
+ * directly and then sync the file like generic_file_aio_write().
+ */
 ssize_t generic_file_aio_write_nolock(struct kiocb *iocb,
 		const struct iovec *iov, unsigned long nr_segs, loff_t pos)
 {
@@ -2478,8 +2512,7 @@ ssize_t generic_file_aio_write_nolock(struct kiocb *iocb,
 
 	BUG_ON(iocb->ki_pos != pos);
 
-	ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs,
-			&iocb->ki_pos);
+	ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
 
 	if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
 		ssize_t err;
@@ -2492,6 +2525,17 @@ ssize_t generic_file_aio_write_nolock(struct kiocb *iocb,
 }
 EXPORT_SYMBOL(generic_file_aio_write_nolock);
 
+/**
+ * generic_file_aio_write - write data to a file
+ * @iocb:	IO state structure
+ * @iov:	vector with data to write
+ * @nr_segs:	number of segments in the vector
+ * @pos:	position in file where to write
+ *
+ * This is a wrapper around __generic_file_aio_write() to be used by most
+ * filesystems. It takes care of syncing the file in case of O_SYNC file
+ * and acquires i_mutex as needed.
+ */
 ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 		unsigned long nr_segs, loff_t pos)
 {
@@ -2503,8 +2547,7 @@ ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 	BUG_ON(iocb->ki_pos != pos);
 
 	mutex_lock(&inode->i_mutex);
-	ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs,
-			&iocb->ki_pos);
+	ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
 	mutex_unlock(&inode->i_mutex);
 
 	if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
-- 
1.6.0.2

^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH 03/17] vfs: Remove syncing from generic_file_direct_write() and generic_file_buffered_write()
  2009-08-19 16:04 [PATCH 0/17] Make O_SYNC handling use standard syncing path Jan Kara
  2009-08-19 16:04 ` [PATCH 01/17] vfs: Introduce filemap_fdatawait_range Jan Kara
@ 2009-08-19 16:04   ` Jan Kara
  2009-08-19 16:04   ` Jan Kara
                     ` (15 subsequent siblings)
  17 siblings, 0 replies; 144+ messages in thread
From: Jan Kara @ 2009-08-19 16:04 UTC (permalink / raw)
  To: LKML; +Cc: hch, Jan Kara, ocfs2-devel, Joel Becker, Felix Blyakher, xfs

generic_file_direct_write() and generic_file_buffered_write() called
generic_osync_inode() if it was called on O_SYNC file or IS_SYNC inode. But
this is superfluous since generic_file_aio_write() does the syncing as well.
Also XFS and OCFS2 which call these functions directly handle syncing
themselves. So let's have a single place where syncing happens:
generic_file_aio_write().

CC: ocfs2-devel@oss.oracle.com
CC: Joel Becker <joel.becker@oracle.com>
CC: Felix Blyakher <felixb@sgi.com>
CC: xfs@oss.sgi.com
Signed-off-by: Jan Kara <jack@suse.cz>
---
 mm/filemap.c |   25 -------------------------
 1 files changed, 0 insertions(+), 25 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 554a396..3bee198 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2187,20 +2187,7 @@ generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
 		}
 		*ppos = end;
 	}
-
-	/*
-	 * Sync the fs metadata but not the minor inode changes and
-	 * of course not the data as we did direct DMA for the IO.
-	 * i_mutex is held, which protects generic_osync_inode() from
-	 * livelocking.  AIO O_DIRECT ops attempt to sync metadata here.
-	 */
 out:
-	if ((written >= 0 || written == -EIOCBQUEUED) &&
-	    ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
-		int err = generic_osync_inode(inode, mapping, OSYNC_METADATA);
-		if (err < 0)
-			written = err;
-	}
 	return written;
 }
 EXPORT_SYMBOL(generic_file_direct_write);
@@ -2332,8 +2319,6 @@ generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
 {
 	struct file *file = iocb->ki_filp;
 	struct address_space *mapping = file->f_mapping;
-	const struct address_space_operations *a_ops = mapping->a_ops;
-	struct inode *inode = mapping->host;
 	ssize_t status;
 	struct iov_iter i;
 
@@ -2343,16 +2328,6 @@ generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
 	if (likely(status >= 0)) {
 		written += status;
 		*ppos = pos + status;
-
-		/*
-		 * For now, when the user asks for O_SYNC, we'll actually give
-		 * O_DSYNC
-		 */
-		if (unlikely((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
-			if (!a_ops->writepage || !is_sync_kiocb(iocb))
-				status = generic_osync_inode(inode, mapping,
-						OSYNC_METADATA|OSYNC_DATA);
-		}
   	}
 	
 	/*
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH 03/17] vfs: Remove syncing from generic_file_direct_write() and generic_file_buffered_write()
@ 2009-08-19 16:04   ` Jan Kara
  0 siblings, 0 replies; 144+ messages in thread
From: Jan Kara @ 2009-08-19 16:04 UTC (permalink / raw)
  To: LKML; +Cc: Jan Kara, Joel Becker, xfs, hch, ocfs2-devel

generic_file_direct_write() and generic_file_buffered_write() called
generic_osync_inode() if it was called on O_SYNC file or IS_SYNC inode. But
this is superfluous since generic_file_aio_write() does the syncing as well.
Also XFS and OCFS2 which call these functions directly handle syncing
themselves. So let's have a single place where syncing happens:
generic_file_aio_write().

CC: ocfs2-devel@oss.oracle.com
CC: Joel Becker <joel.becker@oracle.com>
CC: Felix Blyakher <felixb@sgi.com>
CC: xfs@oss.sgi.com
Signed-off-by: Jan Kara <jack@suse.cz>
---
 mm/filemap.c |   25 -------------------------
 1 files changed, 0 insertions(+), 25 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 554a396..3bee198 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2187,20 +2187,7 @@ generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
 		}
 		*ppos = end;
 	}
-
-	/*
-	 * Sync the fs metadata but not the minor inode changes and
-	 * of course not the data as we did direct DMA for the IO.
-	 * i_mutex is held, which protects generic_osync_inode() from
-	 * livelocking.  AIO O_DIRECT ops attempt to sync metadata here.
-	 */
 out:
-	if ((written >= 0 || written == -EIOCBQUEUED) &&
-	    ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
-		int err = generic_osync_inode(inode, mapping, OSYNC_METADATA);
-		if (err < 0)
-			written = err;
-	}
 	return written;
 }
 EXPORT_SYMBOL(generic_file_direct_write);
@@ -2332,8 +2319,6 @@ generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
 {
 	struct file *file = iocb->ki_filp;
 	struct address_space *mapping = file->f_mapping;
-	const struct address_space_operations *a_ops = mapping->a_ops;
-	struct inode *inode = mapping->host;
 	ssize_t status;
 	struct iov_iter i;
 
@@ -2343,16 +2328,6 @@ generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
 	if (likely(status >= 0)) {
 		written += status;
 		*ppos = pos + status;
-
-		/*
-		 * For now, when the user asks for O_SYNC, we'll actually give
-		 * O_DSYNC
-		 */
-		if (unlikely((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
-			if (!a_ops->writepage || !is_sync_kiocb(iocb))
-				status = generic_osync_inode(inode, mapping,
-						OSYNC_METADATA|OSYNC_DATA);
-		}
   	}
 	
 	/*
-- 
1.6.0.2

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [Ocfs2-devel] [PATCH 03/17] vfs: Remove syncing from generic_file_direct_write() and generic_file_buffered_write()
@ 2009-08-19 16:04   ` Jan Kara
  0 siblings, 0 replies; 144+ messages in thread
From: Jan Kara @ 2009-08-19 16:04 UTC (permalink / raw)
  To: LKML; +Cc: Jan Kara, Joel Becker, xfs, hch, ocfs2-devel

generic_file_direct_write() and generic_file_buffered_write() called
generic_osync_inode() if it was called on O_SYNC file or IS_SYNC inode. But
this is superfluous since generic_file_aio_write() does the syncing as well.
Also XFS and OCFS2 which call these functions directly handle syncing
themselves. So let's have a single place where syncing happens:
generic_file_aio_write().

CC: ocfs2-devel at oss.oracle.com
CC: Joel Becker <joel.becker@oracle.com>
CC: Felix Blyakher <felixb@sgi.com>
CC: xfs at oss.sgi.com
Signed-off-by: Jan Kara <jack@suse.cz>
---
 mm/filemap.c |   25 -------------------------
 1 files changed, 0 insertions(+), 25 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 554a396..3bee198 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2187,20 +2187,7 @@ generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
 		}
 		*ppos = end;
 	}
-
-	/*
-	 * Sync the fs metadata but not the minor inode changes and
-	 * of course not the data as we did direct DMA for the IO.
-	 * i_mutex is held, which protects generic_osync_inode() from
-	 * livelocking.  AIO O_DIRECT ops attempt to sync metadata here.
-	 */
 out:
-	if ((written >= 0 || written == -EIOCBQUEUED) &&
-	    ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
-		int err = generic_osync_inode(inode, mapping, OSYNC_METADATA);
-		if (err < 0)
-			written = err;
-	}
 	return written;
 }
 EXPORT_SYMBOL(generic_file_direct_write);
@@ -2332,8 +2319,6 @@ generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
 {
 	struct file *file = iocb->ki_filp;
 	struct address_space *mapping = file->f_mapping;
-	const struct address_space_operations *a_ops = mapping->a_ops;
-	struct inode *inode = mapping->host;
 	ssize_t status;
 	struct iov_iter i;
 
@@ -2343,16 +2328,6 @@ generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
 	if (likely(status >= 0)) {
 		written += status;
 		*ppos = pos + status;
-
-		/*
-		 * For now, when the user asks for O_SYNC, we'll actually give
-		 * O_DSYNC
-		 */
-		if (unlikely((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
-			if (!a_ops->writepage || !is_sync_kiocb(iocb))
-				status = generic_osync_inode(inode, mapping,
-						OSYNC_METADATA|OSYNC_DATA);
-		}
   	}
 	
 	/*
-- 
1.6.0.2

^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH 04/17] pohmelfs: Use __generic_file_aio_write instead of generic_file_aio_write_nolock
  2009-08-19 16:04 [PATCH 0/17] Make O_SYNC handling use standard syncing path Jan Kara
                   ` (2 preceding siblings ...)
  2009-08-19 16:04   ` Jan Kara
@ 2009-08-19 16:04 ` Jan Kara
  2009-08-19 16:04   ` [Ocfs2-devel] " Jan Kara
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 144+ messages in thread
From: Jan Kara @ 2009-08-19 16:04 UTC (permalink / raw)
  To: LKML; +Cc: hch, Jan Kara, Evgeniy Polyakov

Use new helper __generic_file_aio_write(). Since the fs takes care of syncing
by itself afterwards, there are no more changes needed.

CC: Evgeniy Polyakov <zbr@ioremap.net>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 drivers/staging/pohmelfs/inode.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/staging/pohmelfs/inode.c b/drivers/staging/pohmelfs/inode.c
index 7b60579..17801a5 100644
--- a/drivers/staging/pohmelfs/inode.c
+++ b/drivers/staging/pohmelfs/inode.c
@@ -921,7 +921,7 @@ ssize_t pohmelfs_write(struct file *file, const char __user *buf,
 	if (ret)
 		goto err_out_unlock;
 
-	ret = generic_file_aio_write_nolock(&kiocb, &iov, 1, pos);
+	ret = __generic_file_aio_write(&kiocb, &iov, 1, &kiocb.ki_pos);
 	*ppos = kiocb.ki_pos;
 
 	mutex_unlock(&inode->i_mutex);
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH 05/17] ocfs2: Use __generic_file_aio_write instead of generic_file_aio_write_nolock
  2009-08-19 16:04 [PATCH 0/17] Make O_SYNC handling use standard syncing path Jan Kara
@ 2009-08-19 16:04   ` Jan Kara
  2009-08-19 16:04   ` [Ocfs2-devel] " Jan Kara
                     ` (16 subsequent siblings)
  17 siblings, 0 replies; 144+ messages in thread
From: Jan Kara @ 2009-08-19 16:04 UTC (permalink / raw)
  To: LKML; +Cc: hch, Jan Kara, ocfs2-devel, Joel Becker

Use the new helper. We have to submit data pages ourselves in case of O_SYNC
write because __generic_file_aio_write does not do it for us. OCFS2 developpers
might think about moving the sync out of i_mutex which seems to be easily
possible but that's out of scope of this patch.

CC: ocfs2-devel@oss.oracle.com
CC: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ocfs2/file.c |   22 ++++++++++++----------
 1 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 62442e4..1c71f0a 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -1870,8 +1870,7 @@ relock:
 			goto out_dio;
 		}
 	} else {
-		written = generic_file_aio_write_nolock(iocb, iov, nr_segs,
-							*ppos);
+		written = __generic_file_aio_write(iocb, iov, nr_segs, ppos);
 	}
 
 out_dio:
@@ -1879,18 +1878,21 @@ out_dio:
 	BUG_ON(ret == -EIOCBQUEUED && !(file->f_flags & O_DIRECT));
 
 	if ((file->f_flags & O_SYNC && !direct_io) || IS_SYNC(inode)) {
-		/*
-		 * The generic write paths have handled getting data
-		 * to disk, but since we don't make use of the dirty
-		 * inode list, a manual journal commit is necessary
-		 * here.
-		 */
-		if (old_size != i_size_read(inode) ||
-		    old_clusters != OCFS2_I(inode)->ip_clusters) {
+		ret = filemap_fdatawrite_range(file->f_mapping, pos,
+					       pos + count - 1);
+		if (ret < 0)
+			written = ret;
+
+		if (!ret && (old_size != i_size_read(inode) ||
+		    old_clusters != OCFS2_I(inode)->ip_clusters)) {
 			ret = jbd2_journal_force_commit(osb->journal->j_journal);
 			if (ret < 0)
 				written = ret;
 		}
+
+		if (!ret)
+			ret = filemap_fdatawait_range(file->f_mapping, pos,
+						      pos + count - 1);
 	}
 
 	/* 
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [Ocfs2-devel] [PATCH 05/17] ocfs2: Use __generic_file_aio_write instead of generic_file_aio_write_nolock
@ 2009-08-19 16:04   ` Jan Kara
  0 siblings, 0 replies; 144+ messages in thread
From: Jan Kara @ 2009-08-19 16:04 UTC (permalink / raw)
  To: LKML; +Cc: hch, Jan Kara, ocfs2-devel, Joel Becker

Use the new helper. We have to submit data pages ourselves in case of O_SYNC
write because __generic_file_aio_write does not do it for us. OCFS2 developpers
might think about moving the sync out of i_mutex which seems to be easily
possible but that's out of scope of this patch.

CC: ocfs2-devel at oss.oracle.com
CC: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ocfs2/file.c |   22 ++++++++++++----------
 1 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 62442e4..1c71f0a 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -1870,8 +1870,7 @@ relock:
 			goto out_dio;
 		}
 	} else {
-		written = generic_file_aio_write_nolock(iocb, iov, nr_segs,
-							*ppos);
+		written = __generic_file_aio_write(iocb, iov, nr_segs, ppos);
 	}
 
 out_dio:
@@ -1879,18 +1878,21 @@ out_dio:
 	BUG_ON(ret == -EIOCBQUEUED && !(file->f_flags & O_DIRECT));
 
 	if ((file->f_flags & O_SYNC && !direct_io) || IS_SYNC(inode)) {
-		/*
-		 * The generic write paths have handled getting data
-		 * to disk, but since we don't make use of the dirty
-		 * inode list, a manual journal commit is necessary
-		 * here.
-		 */
-		if (old_size != i_size_read(inode) ||
-		    old_clusters != OCFS2_I(inode)->ip_clusters) {
+		ret = filemap_fdatawrite_range(file->f_mapping, pos,
+					       pos + count - 1);
+		if (ret < 0)
+			written = ret;
+
+		if (!ret && (old_size != i_size_read(inode) ||
+		    old_clusters != OCFS2_I(inode)->ip_clusters)) {
 			ret = jbd2_journal_force_commit(osb->journal->j_journal);
 			if (ret < 0)
 				written = ret;
 		}
+
+		if (!ret)
+			ret = filemap_fdatawait_range(file->f_mapping, pos,
+						      pos + count - 1);
 	}
 
 	/* 
-- 
1.6.0.2

^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH 06/17] vfs: Remove sync_page_range_nolock
  2009-08-19 16:04 [PATCH 0/17] Make O_SYNC handling use standard syncing path Jan Kara
                   ` (4 preceding siblings ...)
  2009-08-19 16:04   ` [Ocfs2-devel] " Jan Kara
@ 2009-08-19 16:04 ` Jan Kara
  2009-08-19 16:21   ` Christoph Hellwig
  2009-08-19 16:04   ` Jan Kara
                   ` (11 subsequent siblings)
  17 siblings, 1 reply; 144+ messages in thread
From: Jan Kara @ 2009-08-19 16:04 UTC (permalink / raw)
  To: LKML; +Cc: hch, Jan Kara

The last user of sync_page_range_nolock() is generic_file_aio_write_nolock().
Now we have converted all its callers to not hold i_mutex and so we can
afford to call sync_page_range() instead of sync_page_range_nolock() from
there. This non-necessarily acquires i_mutex when syncing block devices but
that's happening in fsync() path as well and block_fsync() may just drop +
reacquire i_mutex if we care enough.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/writeback.h |    2 --
 mm/filemap.c              |   31 +------------------------------
 2 files changed, 1 insertions(+), 32 deletions(-)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 3224820..46d6064 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -159,8 +159,6 @@ int write_cache_pages(struct address_space *mapping,
 int do_writepages(struct address_space *mapping, struct writeback_control *wbc);
 int sync_page_range(struct inode *inode, struct address_space *mapping,
 			loff_t pos, loff_t count);
-int sync_page_range_nolock(struct inode *inode, struct address_space *mapping,
-			   loff_t pos, loff_t count);
 void set_page_dirty_balance(struct page *page, int page_mkwrite);
 void writeback_set_ratelimit(void);
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 3bee198..dffdc60 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -362,35 +362,6 @@ int sync_page_range(struct inode *inode, struct address_space *mapping,
 EXPORT_SYMBOL(sync_page_range);
 
 /**
- * sync_page_range_nolock - write & wait on all pages in the passed range without locking
- * @inode:	target inode
- * @mapping:	target address_space
- * @pos:	beginning offset in pages to write
- * @count:	number of bytes to write
- *
- * Note: Holding i_mutex across sync_page_range_nolock() is not a good idea
- * as it forces O_SYNC writers to different parts of the same file
- * to be serialised right until io completion.
- */
-int sync_page_range_nolock(struct inode *inode, struct address_space *mapping,
-			   loff_t pos, loff_t count)
-{
-	pgoff_t start = pos >> PAGE_CACHE_SHIFT;
-	pgoff_t end = (pos + count - 1) >> PAGE_CACHE_SHIFT;
-	int ret;
-
-	if (!mapping_cap_writeback_dirty(mapping) || !count)
-		return 0;
-	ret = filemap_fdatawrite_range(mapping, pos, pos + count - 1);
-	if (ret == 0)
-		ret = generic_osync_inode(inode, mapping, OSYNC_METADATA);
-	if (ret == 0)
-		ret = wait_on_page_writeback_range(mapping, start, end);
-	return ret;
-}
-EXPORT_SYMBOL(sync_page_range_nolock);
-
-/**
  * filemap_fdatawait - wait for all under-writeback pages to complete
  * @mapping: address space structure to wait for
  *
@@ -2492,7 +2463,7 @@ ssize_t generic_file_aio_write_nolock(struct kiocb *iocb,
 	if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
 		ssize_t err;
 
-		err = sync_page_range_nolock(inode, mapping, pos, ret);
+		err = sync_page_range(inode, mapping, pos, ret);
 		if (err < 0)
 			ret = err;
 	}
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode
  2009-08-19 16:04 [PATCH 0/17] Make O_SYNC handling use standard syncing path Jan Kara
  2009-08-19 16:04 ` [PATCH 01/17] vfs: Introduce filemap_fdatawait_range Jan Kara
@ 2009-08-19 16:04   ` Jan Kara
  2009-08-19 16:04   ` Jan Kara
                     ` (15 subsequent siblings)
  17 siblings, 0 replies; 144+ messages in thread
From: Jan Kara @ 2009-08-19 16:04 UTC (permalink / raw)
  To: LKML
  Cc: hch, Jan Kara, Evgeniy Polyakov, ocfs2-devel, Joel Becker,
	Felix Blyakher, xfs, Anton Altaparmakov, linux-ntfs-dev,
	OGAWA Hirofumi, linux-ext4, tytso

Introduce new function for generic inode syncing (generic_sync_file) and use
it from fsync() path. Introduce also new helper for syncing after a sync
write (generic_write_sync) using the generic function.

Use these new helpers for syncing from generic VFS functions.

CC: Evgeniy Polyakov <zbr@ioremap.net>
CC: ocfs2-devel@oss.oracle.com
CC: Joel Becker <joel.becker@oracle.com>
CC: Felix Blyakher <felixb@sgi.com>
CC: xfs@oss.sgi.com
CC: Anton Altaparmakov <aia21@cantab.net>
CC: linux-ntfs-dev@lists.sourceforge.net
CC: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
CC: linux-ext4@vger.kernel.org
CC: tytso@mit.edu
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/splice.c        |   22 ++++----------
 fs/sync.c          |   81 +++++++++++++++++++++++++++++++++++++++++++---------
 include/linux/fs.h |    7 ++++
 mm/filemap.c       |   16 ++++------
 4 files changed, 86 insertions(+), 40 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 73766d2..8190237 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -976,25 +976,15 @@ generic_file_splice_write(struct pipe_inode_info *pipe, struct file *out,
 
 	if (ret > 0) {
 		unsigned long nr_pages;
+		int err;
 
-		*ppos += ret;
 		nr_pages = (ret + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
 
-		/*
-		 * If file or inode is SYNC and we actually wrote some data,
-		 * sync it.
-		 */
-		if (unlikely((out->f_flags & O_SYNC) || IS_SYNC(inode))) {
-			int err;
-
-			mutex_lock(&inode->i_mutex);
-			err = generic_osync_inode(inode, mapping,
-						  OSYNC_METADATA|OSYNC_DATA);
-			mutex_unlock(&inode->i_mutex);
-
-			if (err)
-				ret = err;
-		}
+		err = generic_write_sync(out, *ppos, ret);
+		if (err)
+			ret = err;
+		else
+			*ppos += ret;
 		balance_dirty_pages_ratelimited_nr(mapping, nr_pages);
 	}
 
diff --git a/fs/sync.c b/fs/sync.c
index 3422ba6..fc320aa 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -176,23 +176,30 @@ int file_fsync(struct file *filp, struct dentry *dentry, int datasync)
 }
 
 /**
- * vfs_fsync - perform a fsync or fdatasync on a file
+ * generic_sync_file - helper to sync data & metadata to disk
  * @file:		file to sync
  * @dentry:		dentry of @file
- * @data:		only perform a fdatasync operation
+ * @start:		offset in bytes of the beginning of data range to sync
+ * @end:		offset in bytes of the end of data range (inclusive)
+ * @what:		what should be synced
  *
- * Write back data and metadata for @file to disk.  If @datasync is
- * set only metadata needed to access modified file data is written.
+ * What the function exactly does is controlled by the @what parameter:
  *
- * In case this function is called from nfsd @file may be %NULL and
- * only @dentry is set.  This can only happen when the filesystem
- * implements the export_operations API.
+ * If SYNC_SUBMIT_DATA is set, the function submits all pages in the given
+ * range to disk.
+ *
+ * The function always calls ->fsync() callback of the filesystem. If
+ * SYNC_INODE is not set, we pass down the fact that it is just a datasync.
+ *
+ * If SYNC_WAIT_DATA is set, the function waits for writeback to finish
+ * in the given range.
  */
-int vfs_fsync(struct file *file, struct dentry *dentry, int datasync)
+int generic_sync_file(struct file *file, struct dentry *dentry, loff_t start,
+		      loff_t end, int what)
 {
 	const struct file_operations *fop;
 	struct address_space *mapping;
-	int err, ret;
+	int err, ret = 0;
 
 	/*
 	 * Get mapping and operations from the file in case we have
@@ -212,23 +219,50 @@ int vfs_fsync(struct file *file, struct dentry *dentry, int datasync)
 		goto out;
 	}
 
-	ret = filemap_fdatawrite(mapping);
+	if (what & SYNC_SUBMIT_DATA)
+		ret = filemap_fdatawrite_range(mapping, start, end);
 
 	/*
 	 * We need to protect against concurrent writers, which could cause
 	 * livelocks in fsync_buffers_list().
 	 */
 	mutex_lock(&mapping->host->i_mutex);
-	err = fop->fsync(file, dentry, datasync);
+	err = fop->fsync(file, dentry, !(what & SYNC_INODE));
 	if (!ret)
 		ret = err;
 	mutex_unlock(&mapping->host->i_mutex);
-	err = filemap_fdatawait(mapping);
-	if (!ret)
-		ret = err;
+
+	if (what & SYNC_WAIT_DATA) {
+		err = filemap_fdatawait_range(mapping, start, end);
+		if (!ret)
+			ret = err;
+	}
 out:
 	return ret;
 }
+EXPORT_SYMBOL(generic_sync_file);
+
+/**
+ * vfs_fsync - perform a fsync or fdatasync on a file
+ * @file:		file to sync
+ * @dentry:		dentry of @file
+ * @datasync:		only perform a fdatasync operation
+ *
+ * Write back data and metadata for @file to disk.  If @datasync is
+ * set only metadata needed to access modified file data is written.
+ *
+ * In case this function is called from nfsd @file may be %NULL and
+ * only @dentry is set.  This can only happen when the filesystem
+ * implements the export_operations API.
+ */
+int vfs_fsync(struct file *file, struct dentry *dentry, int datasync)
+{
+	int what = SYNC_SUBMIT_DATA | SYNC_WAIT_DATA;
+
+	if (!datasync)
+		what |= SYNC_INODE;
+	return generic_sync_file(file, dentry, 0, LLONG_MAX, what);
+}
 EXPORT_SYMBOL(vfs_fsync);
 
 static int do_fsync(unsigned int fd, int datasync)
@@ -254,6 +288,25 @@ SYSCALL_DEFINE1(fdatasync, unsigned int, fd)
 	return do_fsync(fd, 1);
 }
 
+/**
+ * generic_write_sync - perform syncing after a write if file / inode is sync
+ * @file:	file to which the write happened
+ * @pos:	offset where the write started
+ * @count:	length of the write
+ *
+ * This is just a simple wrapper about our general syncing function.
+ * FIXME: Make it inline?
+ */
+int generic_write_sync(struct file *file, loff_t pos, loff_t count)
+{
+	if (!(file->f_flags & O_SYNC) && !IS_SYNC(file->f_mapping->host))
+		return 0;
+	return generic_sync_file(file, file->f_path.dentry, pos,
+				 pos + count - 1,
+				 SYNC_SUBMIT_DATA | SYNC_WAIT_DATA);
+}
+EXPORT_SYMBOL(generic_write_sync);
+
 /*
  * sys_sync_file_range() permits finely controlled syncing over a segment of
  * a file in the range offset .. (offset+nbytes-1) inclusive.  If nbytes is
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 1edd159..0f13049 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2088,7 +2088,14 @@ extern int __filemap_fdatawrite_range(struct address_space *mapping,
 extern int filemap_fdatawrite_range(struct address_space *mapping,
 				loff_t start, loff_t end);
 
+/* Flags for generic_sync_file */
+#define SYNC_INODE		1
+#define SYNC_SUBMIT_DATA	2
+#define SYNC_WAIT_DATA		4
+extern int generic_sync_file(struct file *file, struct dentry *dentry,
+			   loff_t start, loff_t end, int what);
 extern int vfs_fsync(struct file *file, struct dentry *dentry, int datasync);
+extern int generic_write_sync(struct file *file, loff_t pos, loff_t count);
 extern void sync_supers(void);
 extern void emergency_sync(void);
 extern void emergency_remount(void);
diff --git a/mm/filemap.c b/mm/filemap.c
index dffdc60..a29b35b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -39,11 +39,10 @@
 /*
  * FIXME: remove all knowledge of the buffer layer from the core VM
  */
-#include <linux/buffer_head.h> /* for generic_osync_inode */
+#include <linux/buffer_head.h> /* for try_to_free_buffers */
 
 #include <asm/mman.h>
 
-
 /*
  * Shared mappings implemented 30.11.1994. It's not fully working yet,
  * though.
@@ -2452,18 +2451,16 @@ ssize_t generic_file_aio_write_nolock(struct kiocb *iocb,
 		const struct iovec *iov, unsigned long nr_segs, loff_t pos)
 {
 	struct file *file = iocb->ki_filp;
-	struct address_space *mapping = file->f_mapping;
-	struct inode *inode = mapping->host;
 	ssize_t ret;
 
 	BUG_ON(iocb->ki_pos != pos);
 
 	ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
 
-	if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
+	if (ret > 0) {
 		ssize_t err;
 
-		err = sync_page_range(inode, mapping, pos, ret);
+		err = generic_write_sync(file, pos, ret);
 		if (err < 0)
 			ret = err;
 	}
@@ -2486,8 +2483,7 @@ ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 		unsigned long nr_segs, loff_t pos)
 {
 	struct file *file = iocb->ki_filp;
-	struct address_space *mapping = file->f_mapping;
-	struct inode *inode = mapping->host;
+	struct inode *inode = file->f_mapping->host;
 	ssize_t ret;
 
 	BUG_ON(iocb->ki_pos != pos);
@@ -2496,10 +2492,10 @@ ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 	ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
 	mutex_unlock(&inode->i_mutex);
 
-	if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
+	if (ret > 0) {
 		ssize_t err;
 
-		err = sync_page_range(inode, mapping, pos, ret);
+		err = generic_write_sync(file, pos, ret);
 		if (err < 0)
 			ret = err;
 	}
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode
@ 2009-08-19 16:04   ` Jan Kara
  0 siblings, 0 replies; 144+ messages in thread
From: Jan Kara @ 2009-08-19 16:04 UTC (permalink / raw)
  To: LKML
  Cc: tytso, Jan Kara, linux-ntfs-dev, Joel Becker, xfs, hch,
	Anton Altaparmakov, OGAWA Hirofumi, Evgeniy Polyakov, linux-ext4,
	ocfs2-devel

Introduce new function for generic inode syncing (generic_sync_file) and use
it from fsync() path. Introduce also new helper for syncing after a sync
write (generic_write_sync) using the generic function.

Use these new helpers for syncing from generic VFS functions.

CC: Evgeniy Polyakov <zbr@ioremap.net>
CC: ocfs2-devel@oss.oracle.com
CC: Joel Becker <joel.becker@oracle.com>
CC: Felix Blyakher <felixb@sgi.com>
CC: xfs@oss.sgi.com
CC: Anton Altaparmakov <aia21@cantab.net>
CC: linux-ntfs-dev@lists.sourceforge.net
CC: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
CC: linux-ext4@vger.kernel.org
CC: tytso@mit.edu
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/splice.c        |   22 ++++----------
 fs/sync.c          |   81 +++++++++++++++++++++++++++++++++++++++++++---------
 include/linux/fs.h |    7 ++++
 mm/filemap.c       |   16 ++++------
 4 files changed, 86 insertions(+), 40 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 73766d2..8190237 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -976,25 +976,15 @@ generic_file_splice_write(struct pipe_inode_info *pipe, struct file *out,
 
 	if (ret > 0) {
 		unsigned long nr_pages;
+		int err;
 
-		*ppos += ret;
 		nr_pages = (ret + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
 
-		/*
-		 * If file or inode is SYNC and we actually wrote some data,
-		 * sync it.
-		 */
-		if (unlikely((out->f_flags & O_SYNC) || IS_SYNC(inode))) {
-			int err;
-
-			mutex_lock(&inode->i_mutex);
-			err = generic_osync_inode(inode, mapping,
-						  OSYNC_METADATA|OSYNC_DATA);
-			mutex_unlock(&inode->i_mutex);
-
-			if (err)
-				ret = err;
-		}
+		err = generic_write_sync(out, *ppos, ret);
+		if (err)
+			ret = err;
+		else
+			*ppos += ret;
 		balance_dirty_pages_ratelimited_nr(mapping, nr_pages);
 	}
 
diff --git a/fs/sync.c b/fs/sync.c
index 3422ba6..fc320aa 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -176,23 +176,30 @@ int file_fsync(struct file *filp, struct dentry *dentry, int datasync)
 }
 
 /**
- * vfs_fsync - perform a fsync or fdatasync on a file
+ * generic_sync_file - helper to sync data & metadata to disk
  * @file:		file to sync
  * @dentry:		dentry of @file
- * @data:		only perform a fdatasync operation
+ * @start:		offset in bytes of the beginning of data range to sync
+ * @end:		offset in bytes of the end of data range (inclusive)
+ * @what:		what should be synced
  *
- * Write back data and metadata for @file to disk.  If @datasync is
- * set only metadata needed to access modified file data is written.
+ * What the function exactly does is controlled by the @what parameter:
  *
- * In case this function is called from nfsd @file may be %NULL and
- * only @dentry is set.  This can only happen when the filesystem
- * implements the export_operations API.
+ * If SYNC_SUBMIT_DATA is set, the function submits all pages in the given
+ * range to disk.
+ *
+ * The function always calls ->fsync() callback of the filesystem. If
+ * SYNC_INODE is not set, we pass down the fact that it is just a datasync.
+ *
+ * If SYNC_WAIT_DATA is set, the function waits for writeback to finish
+ * in the given range.
  */
-int vfs_fsync(struct file *file, struct dentry *dentry, int datasync)
+int generic_sync_file(struct file *file, struct dentry *dentry, loff_t start,
+		      loff_t end, int what)
 {
 	const struct file_operations *fop;
 	struct address_space *mapping;
-	int err, ret;
+	int err, ret = 0;
 
 	/*
 	 * Get mapping and operations from the file in case we have
@@ -212,23 +219,50 @@ int vfs_fsync(struct file *file, struct dentry *dentry, int datasync)
 		goto out;
 	}
 
-	ret = filemap_fdatawrite(mapping);
+	if (what & SYNC_SUBMIT_DATA)
+		ret = filemap_fdatawrite_range(mapping, start, end);
 
 	/*
 	 * We need to protect against concurrent writers, which could cause
 	 * livelocks in fsync_buffers_list().
 	 */
 	mutex_lock(&mapping->host->i_mutex);
-	err = fop->fsync(file, dentry, datasync);
+	err = fop->fsync(file, dentry, !(what & SYNC_INODE));
 	if (!ret)
 		ret = err;
 	mutex_unlock(&mapping->host->i_mutex);
-	err = filemap_fdatawait(mapping);
-	if (!ret)
-		ret = err;
+
+	if (what & SYNC_WAIT_DATA) {
+		err = filemap_fdatawait_range(mapping, start, end);
+		if (!ret)
+			ret = err;
+	}
 out:
 	return ret;
 }
+EXPORT_SYMBOL(generic_sync_file);
+
+/**
+ * vfs_fsync - perform a fsync or fdatasync on a file
+ * @file:		file to sync
+ * @dentry:		dentry of @file
+ * @datasync:		only perform a fdatasync operation
+ *
+ * Write back data and metadata for @file to disk.  If @datasync is
+ * set only metadata needed to access modified file data is written.
+ *
+ * In case this function is called from nfsd @file may be %NULL and
+ * only @dentry is set.  This can only happen when the filesystem
+ * implements the export_operations API.
+ */
+int vfs_fsync(struct file *file, struct dentry *dentry, int datasync)
+{
+	int what = SYNC_SUBMIT_DATA | SYNC_WAIT_DATA;
+
+	if (!datasync)
+		what |= SYNC_INODE;
+	return generic_sync_file(file, dentry, 0, LLONG_MAX, what);
+}
 EXPORT_SYMBOL(vfs_fsync);
 
 static int do_fsync(unsigned int fd, int datasync)
@@ -254,6 +288,25 @@ SYSCALL_DEFINE1(fdatasync, unsigned int, fd)
 	return do_fsync(fd, 1);
 }
 
+/**
+ * generic_write_sync - perform syncing after a write if file / inode is sync
+ * @file:	file to which the write happened
+ * @pos:	offset where the write started
+ * @count:	length of the write
+ *
+ * This is just a simple wrapper about our general syncing function.
+ * FIXME: Make it inline?
+ */
+int generic_write_sync(struct file *file, loff_t pos, loff_t count)
+{
+	if (!(file->f_flags & O_SYNC) && !IS_SYNC(file->f_mapping->host))
+		return 0;
+	return generic_sync_file(file, file->f_path.dentry, pos,
+				 pos + count - 1,
+				 SYNC_SUBMIT_DATA | SYNC_WAIT_DATA);
+}
+EXPORT_SYMBOL(generic_write_sync);
+
 /*
  * sys_sync_file_range() permits finely controlled syncing over a segment of
  * a file in the range offset .. (offset+nbytes-1) inclusive.  If nbytes is
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 1edd159..0f13049 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2088,7 +2088,14 @@ extern int __filemap_fdatawrite_range(struct address_space *mapping,
 extern int filemap_fdatawrite_range(struct address_space *mapping,
 				loff_t start, loff_t end);
 
+/* Flags for generic_sync_file */
+#define SYNC_INODE		1
+#define SYNC_SUBMIT_DATA	2
+#define SYNC_WAIT_DATA		4
+extern int generic_sync_file(struct file *file, struct dentry *dentry,
+			   loff_t start, loff_t end, int what);
 extern int vfs_fsync(struct file *file, struct dentry *dentry, int datasync);
+extern int generic_write_sync(struct file *file, loff_t pos, loff_t count);
 extern void sync_supers(void);
 extern void emergency_sync(void);
 extern void emergency_remount(void);
diff --git a/mm/filemap.c b/mm/filemap.c
index dffdc60..a29b35b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -39,11 +39,10 @@
 /*
  * FIXME: remove all knowledge of the buffer layer from the core VM
  */
-#include <linux/buffer_head.h> /* for generic_osync_inode */
+#include <linux/buffer_head.h> /* for try_to_free_buffers */
 
 #include <asm/mman.h>
 
-
 /*
  * Shared mappings implemented 30.11.1994. It's not fully working yet,
  * though.
@@ -2452,18 +2451,16 @@ ssize_t generic_file_aio_write_nolock(struct kiocb *iocb,
 		const struct iovec *iov, unsigned long nr_segs, loff_t pos)
 {
 	struct file *file = iocb->ki_filp;
-	struct address_space *mapping = file->f_mapping;
-	struct inode *inode = mapping->host;
 	ssize_t ret;
 
 	BUG_ON(iocb->ki_pos != pos);
 
 	ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
 
-	if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
+	if (ret > 0) {
 		ssize_t err;
 
-		err = sync_page_range(inode, mapping, pos, ret);
+		err = generic_write_sync(file, pos, ret);
 		if (err < 0)
 			ret = err;
 	}
@@ -2486,8 +2483,7 @@ ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 		unsigned long nr_segs, loff_t pos)
 {
 	struct file *file = iocb->ki_filp;
-	struct address_space *mapping = file->f_mapping;
-	struct inode *inode = mapping->host;
+	struct inode *inode = file->f_mapping->host;
 	ssize_t ret;
 
 	BUG_ON(iocb->ki_pos != pos);
@@ -2496,10 +2492,10 @@ ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 	ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
 	mutex_unlock(&inode->i_mutex);
 
-	if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
+	if (ret > 0) {
 		ssize_t err;
 
-		err = sync_page_range(inode, mapping, pos, ret);
+		err = generic_write_sync(file, pos, ret);
 		if (err < 0)
 			ret = err;
 	}
-- 
1.6.0.2

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH 08/17] ext2: Update comment about generic_osync_inode
  2009-08-19 16:04 [PATCH 0/17] Make O_SYNC handling use standard syncing path Jan Kara
                   ` (6 preceding siblings ...)
  2009-08-19 16:04   ` Jan Kara
@ 2009-08-19 16:04 ` Jan Kara
  2009-08-19 16:04 ` [PATCH 09/17] ext3: Remove syncing logic from ext3_file_write Jan Kara
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 144+ messages in thread
From: Jan Kara @ 2009-08-19 16:04 UTC (permalink / raw)
  To: LKML; +Cc: hch, Jan Kara, linux-ext4

We rely on generic_write_sync() now.

CC: linux-ext4@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext2/inode.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index e271303..1c1638f 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -482,7 +482,7 @@ static int ext2_alloc_branch(struct inode *inode,
 		unlock_buffer(bh);
 		mark_buffer_dirty_inode(bh, inode);
 		/* We used to sync bh here if IS_SYNC(inode).
-		 * But we now rely upon generic_osync_inode()
+		 * But we now rely upon generic_write_sync()
 		 * and b_inode_buffers.  But not for directories.
 		 */
 		if (S_ISDIR(inode->i_mode) && IS_DIRSYNC(inode))
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH 09/17] ext3: Remove syncing logic from ext3_file_write
  2009-08-19 16:04 [PATCH 0/17] Make O_SYNC handling use standard syncing path Jan Kara
                   ` (7 preceding siblings ...)
  2009-08-19 16:04 ` [PATCH 08/17] ext2: Update comment about generic_osync_inode Jan Kara
@ 2009-08-19 16:04 ` Jan Kara
  2009-08-19 16:04   ` Jan Kara
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 144+ messages in thread
From: Jan Kara @ 2009-08-19 16:04 UTC (permalink / raw)
  To: LKML; +Cc: hch, Jan Kara, linux-ext4

Syncing is now properly done by generic_file_aio_write() so no special logic is
needed in ext3.

CC: linux-ext4@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext3/file.c |   61 +-------------------------------------------------------
 1 files changed, 1 insertions(+), 60 deletions(-)

diff --git a/fs/ext3/file.c b/fs/ext3/file.c
index 5b49704..51fee5f 100644
--- a/fs/ext3/file.c
+++ b/fs/ext3/file.c
@@ -51,71 +51,12 @@ static int ext3_release_file (struct inode * inode, struct file * filp)
 	return 0;
 }
 
-static ssize_t
-ext3_file_write(struct kiocb *iocb, const struct iovec *iov,
-		unsigned long nr_segs, loff_t pos)
-{
-	struct file *file = iocb->ki_filp;
-	struct inode *inode = file->f_path.dentry->d_inode;
-	ssize_t ret;
-	int err;
-
-	ret = generic_file_aio_write(iocb, iov, nr_segs, pos);
-
-	/*
-	 * Skip flushing if there was an error, or if nothing was written.
-	 */
-	if (ret <= 0)
-		return ret;
-
-	/*
-	 * If the inode is IS_SYNC, or is O_SYNC and we are doing data
-	 * journalling then we need to make sure that we force the transaction
-	 * to disk to keep all metadata uptodate synchronously.
-	 */
-	if (file->f_flags & O_SYNC) {
-		/*
-		 * If we are non-data-journaled, then the dirty data has
-		 * already been flushed to backing store by generic_osync_inode,
-		 * and the inode has been flushed too if there have been any
-		 * modifications other than mere timestamp updates.
-		 *
-		 * Open question --- do we care about flushing timestamps too
-		 * if the inode is IS_SYNC?
-		 */
-		if (!ext3_should_journal_data(inode))
-			return ret;
-
-		goto force_commit;
-	}
-
-	/*
-	 * So we know that there has been no forced data flush.  If the inode
-	 * is marked IS_SYNC, we need to force one ourselves.
-	 */
-	if (!IS_SYNC(inode))
-		return ret;
-
-	/*
-	 * Open question #2 --- should we force data to disk here too?  If we
-	 * don't, the only impact is that data=writeback filesystems won't
-	 * flush data to disk automatically on IS_SYNC, only metadata (but
-	 * historically, that is what ext2 has done.)
-	 */
-
-force_commit:
-	err = ext3_force_commit(inode->i_sb);
-	if (err)
-		return err;
-	return ret;
-}
-
 const struct file_operations ext3_file_operations = {
 	.llseek		= generic_file_llseek,
 	.read		= do_sync_read,
 	.write		= do_sync_write,
 	.aio_read	= generic_file_aio_read,
-	.aio_write	= ext3_file_write,
+	.aio_write	= generic_file_aio_write,
 	.unlocked_ioctl	= ext3_ioctl,
 #ifdef CONFIG_COMPAT
 	.compat_ioctl	= ext3_compat_ioctl,
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH 10/17] ext4: Remove syncing logic from ext4_file_write
  2009-08-19 16:04 [PATCH 0/17] Make O_SYNC handling use standard syncing path Jan Kara
@ 2009-08-19 16:04   ` Jan Kara
  2009-08-19 16:04   ` [Ocfs2-devel] " Jan Kara
                     ` (16 subsequent siblings)
  17 siblings, 0 replies; 144+ messages in thread
From: Jan Kara @ 2009-08-19 16:04 UTC (permalink / raw)
  To: LKML; +Cc: hch, Jan Kara, linux-ext4, tytso

The syncing is now properly handled by generic_file_aio_write() so
no special ext4 code is needed.

CC: linux-ext4@vger.kernel.org
CC: tytso@mit.edu
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/file.c |   53 ++---------------------------------------------------
 1 files changed, 2 insertions(+), 51 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 3f1873f..aafe432 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -58,10 +58,7 @@ static ssize_t
 ext4_file_write(struct kiocb *iocb, const struct iovec *iov,
 		unsigned long nr_segs, loff_t pos)
 {
-	struct file *file = iocb->ki_filp;
-	struct inode *inode = file->f_path.dentry->d_inode;
-	ssize_t ret;
-	int err;
+	struct inode *inode = iocb->ki_filp->f_path.dentry->d_inode;
 
 	/*
 	 * If we have encountered a bitmap-format file, the size limit
@@ -81,53 +78,7 @@ ext4_file_write(struct kiocb *iocb, const struct iovec *iov,
 		}
 	}
 
-	ret = generic_file_aio_write(iocb, iov, nr_segs, pos);
-	/*
-	 * Skip flushing if there was an error, or if nothing was written.
-	 */
-	if (ret <= 0)
-		return ret;
-
-	/*
-	 * If the inode is IS_SYNC, or is O_SYNC and we are doing data
-	 * journalling then we need to make sure that we force the transaction
-	 * to disk to keep all metadata uptodate synchronously.
-	 */
-	if (file->f_flags & O_SYNC) {
-		/*
-		 * If we are non-data-journaled, then the dirty data has
-		 * already been flushed to backing store by generic_osync_inode,
-		 * and the inode has been flushed too if there have been any
-		 * modifications other than mere timestamp updates.
-		 *
-		 * Open question --- do we care about flushing timestamps too
-		 * if the inode is IS_SYNC?
-		 */
-		if (!ext4_should_journal_data(inode))
-			return ret;
-
-		goto force_commit;
-	}
-
-	/*
-	 * So we know that there has been no forced data flush.  If the inode
-	 * is marked IS_SYNC, we need to force one ourselves.
-	 */
-	if (!IS_SYNC(inode))
-		return ret;
-
-	/*
-	 * Open question #2 --- should we force data to disk here too?  If we
-	 * don't, the only impact is that data=writeback filesystems won't
-	 * flush data to disk automatically on IS_SYNC, only metadata (but
-	 * historically, that is what ext2 has done.)
-	 */
-
-force_commit:
-	err = ext4_force_commit(inode->i_sb);
-	if (err)
-		return err;
-	return ret;
+	return generic_file_aio_write(iocb, iov, nr_segs, pos);
 }
 
 static struct vm_operations_struct ext4_file_vm_ops = {
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH 10/17] ext4: Remove syncing logic from ext4_file_write
@ 2009-08-19 16:04   ` Jan Kara
  0 siblings, 0 replies; 144+ messages in thread
From: Jan Kara @ 2009-08-19 16:04 UTC (permalink / raw)
  To: LKML; +Cc: hch, Jan Kara, linux-ext4, tytso

The syncing is now properly handled by generic_file_aio_write() so
no special ext4 code is needed.

CC: linux-ext4@vger.kernel.org
CC: tytso@mit.edu
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/file.c |   53 ++---------------------------------------------------
 1 files changed, 2 insertions(+), 51 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 3f1873f..aafe432 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -58,10 +58,7 @@ static ssize_t
 ext4_file_write(struct kiocb *iocb, const struct iovec *iov,
 		unsigned long nr_segs, loff_t pos)
 {
-	struct file *file = iocb->ki_filp;
-	struct inode *inode = file->f_path.dentry->d_inode;
-	ssize_t ret;
-	int err;
+	struct inode *inode = iocb->ki_filp->f_path.dentry->d_inode;
 
 	/*
 	 * If we have encountered a bitmap-format file, the size limit
@@ -81,53 +78,7 @@ ext4_file_write(struct kiocb *iocb, const struct iovec *iov,
 		}
 	}
 
-	ret = generic_file_aio_write(iocb, iov, nr_segs, pos);
-	/*
-	 * Skip flushing if there was an error, or if nothing was written.
-	 */
-	if (ret <= 0)
-		return ret;
-
-	/*
-	 * If the inode is IS_SYNC, or is O_SYNC and we are doing data
-	 * journalling then we need to make sure that we force the transaction
-	 * to disk to keep all metadata uptodate synchronously.
-	 */
-	if (file->f_flags & O_SYNC) {
-		/*
-		 * If we are non-data-journaled, then the dirty data has
-		 * already been flushed to backing store by generic_osync_inode,
-		 * and the inode has been flushed too if there have been any
-		 * modifications other than mere timestamp updates.
-		 *
-		 * Open question --- do we care about flushing timestamps too
-		 * if the inode is IS_SYNC?
-		 */
-		if (!ext4_should_journal_data(inode))
-			return ret;
-
-		goto force_commit;
-	}
-
-	/*
-	 * So we know that there has been no forced data flush.  If the inode
-	 * is marked IS_SYNC, we need to force one ourselves.
-	 */
-	if (!IS_SYNC(inode))
-		return ret;
-
-	/*
-	 * Open question #2 --- should we force data to disk here too?  If we
-	 * don't, the only impact is that data=writeback filesystems won't
-	 * flush data to disk automatically on IS_SYNC, only metadata (but
-	 * historically, that is what ext2 has done.)
-	 */

^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH 11/17] fat: Opencode sync_page_range_nolock()
  2009-08-19 16:04 [PATCH 0/17] Make O_SYNC handling use standard syncing path Jan Kara
                   ` (9 preceding siblings ...)
  2009-08-19 16:04   ` Jan Kara
@ 2009-08-19 16:04 ` Jan Kara
  2009-08-19 16:04 ` [PATCH 12/17] ntfs: Use new syncing helpers and update comments Jan Kara
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 144+ messages in thread
From: Jan Kara @ 2009-08-19 16:04 UTC (permalink / raw)
  To: LKML; +Cc: hch, Jan Kara, OGAWA Hirofumi

fat_cont_expand() is the only user of sync_page_range_nolock(). It's also the
only user of generic_osync_inode() which does not have a file open.  So
opencode needed actions for FAT so that we can convert generic_osync_inode() to
a standard syncing path.

Update a comment about generic_osync_inode().

CC: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fat/file.c |   22 ++++++++++++++++++++--
 fs/fat/misc.c |    4 ++--
 2 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/fs/fat/file.c b/fs/fat/file.c
index f042b96..e8c159d 100644
--- a/fs/fat/file.c
+++ b/fs/fat/file.c
@@ -176,8 +176,26 @@ static int fat_cont_expand(struct inode *inode, loff_t size)
 
 	inode->i_ctime = inode->i_mtime = CURRENT_TIME_SEC;
 	mark_inode_dirty(inode);
-	if (IS_SYNC(inode))
-		err = sync_page_range_nolock(inode, mapping, start, count);
+	if (IS_SYNC(inode)) {
+		int err2;
+
+		/*
+		 * Opencode syncing since we don't have a file open to use
+		 * standard fsync path.
+		 */
+		err = filemap_fdatawrite_range(mapping, start,
+					       start + count - 1);
+		err2 = sync_mapping_buffers(mapping);
+		if (!err)
+			err = err2;
+		err2 = write_inode_now(inode, 1);
+		if (!err)
+			err = err2;
+		if (!err) {
+			err =  filemap_fdatawait_range(mapping, start,
+						       start + count - 1);
+		}
+	}
 out:
 	return err;
 }
diff --git a/fs/fat/misc.c b/fs/fat/misc.c
index a6c2047..4e35be8 100644
--- a/fs/fat/misc.c
+++ b/fs/fat/misc.c
@@ -119,8 +119,8 @@ int fat_chain_add(struct inode *inode, int new_dclus, int nr_cluster)
 		MSDOS_I(inode)->i_start = new_dclus;
 		MSDOS_I(inode)->i_logstart = new_dclus;
 		/*
-		 * Since generic_osync_inode() synchronize later if
-		 * this is not directory, we don't here.
+		 * Since generic_write_sync() synchronizes regular files later,
+		 * we sync here only directories.
 		 */
 		if (S_ISDIR(inode->i_mode) && IS_DIRSYNC(inode)) {
 			ret = fat_sync_inode(inode);
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH 12/17] ntfs: Use new syncing helpers and update comments
  2009-08-19 16:04 [PATCH 0/17] Make O_SYNC handling use standard syncing path Jan Kara
                   ` (10 preceding siblings ...)
  2009-08-19 16:04 ` [PATCH 11/17] fat: Opencode sync_page_range_nolock() Jan Kara
@ 2009-08-19 16:04 ` Jan Kara
  2009-08-19 16:04   ` [Ocfs2-devel] " Jan Kara
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 144+ messages in thread
From: Jan Kara @ 2009-08-19 16:04 UTC (permalink / raw)
  To: LKML; +Cc: hch, Jan Kara, Anton Altaparmakov, linux-ntfs-dev

Use new syncing helpers in .write and .aio_write functions. Also
remove superfluous syncing in ntfs_file_buffered_write() and update
comments about generic_osync_inode().

CC: Anton Altaparmakov <aia21@cantab.net>
CC: linux-ntfs-dev@lists.sourceforge.net
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ntfs/file.c |   16 ++++------------
 fs/ntfs/mft.c  |   13 ++++++-------
 2 files changed, 10 insertions(+), 19 deletions(-)

diff --git a/fs/ntfs/file.c b/fs/ntfs/file.c
index 3140a44..4350d49 100644
--- a/fs/ntfs/file.c
+++ b/fs/ntfs/file.c
@@ -2076,14 +2076,6 @@ err_out:
 	*ppos = pos;
 	if (cached_page)
 		page_cache_release(cached_page);
-	/* For now, when the user asks for O_SYNC, we actually give O_DSYNC. */
-	if (likely(!status)) {
-		if (unlikely((file->f_flags & O_SYNC) || IS_SYNC(vi))) {
-			if (!mapping->a_ops->writepage || !is_sync_kiocb(iocb))
-				status = generic_osync_inode(vi, mapping,
-						OSYNC_METADATA|OSYNC_DATA);
-		}
-  	}
 	pagevec_lru_add_file(&lru_pvec);
 	ntfs_debug("Done.  Returning %s (written 0x%lx, status %li).",
 			written ? "written" : "status", (unsigned long)written,
@@ -2145,8 +2137,8 @@ static ssize_t ntfs_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 	mutex_lock(&inode->i_mutex);
 	ret = ntfs_file_aio_write_nolock(iocb, iov, nr_segs, &iocb->ki_pos);
 	mutex_unlock(&inode->i_mutex);
-	if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
-		int err = sync_page_range(inode, mapping, pos, ret);
+	if (ret > 0) {
+		int err = generic_write_sync(file, pos, ret);
 		if (err < 0)
 			ret = err;
 	}
@@ -2173,8 +2165,8 @@ static ssize_t ntfs_file_writev(struct file *file, const struct iovec *iov,
 	if (ret == -EIOCBQUEUED)
 		ret = wait_on_sync_kiocb(&kiocb);
 	mutex_unlock(&inode->i_mutex);
-	if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
-		int err = sync_page_range(inode, mapping, *ppos - ret, ret);
+	if (ret > 0) {
+		int err = generic_write_sync(file, *ppos - ret, ret);
 		if (err < 0)
 			ret = err;
 	}
diff --git a/fs/ntfs/mft.c b/fs/ntfs/mft.c
index 23bf684..1caa0ef 100644
--- a/fs/ntfs/mft.c
+++ b/fs/ntfs/mft.c
@@ -384,13 +384,12 @@ unm_err_out:
  * it is dirty in the inode meta data rather than the data page cache of the
  * inode, and thus there are no data pages that need writing out.  Therefore, a
  * full mark_inode_dirty() is overkill.  A mark_inode_dirty_sync(), on the
- * other hand, is not sufficient, because I_DIRTY_DATASYNC needs to be set to
- * ensure ->write_inode is called from generic_osync_inode() and this needs to
- * happen or the file data would not necessarily hit the device synchronously,
- * even though the vfs inode has the O_SYNC flag set.  Also, I_DIRTY_DATASYNC
- * simply "feels" better than just I_DIRTY_SYNC, since the file data has not
- * actually hit the block device yet, which is not what I_DIRTY_SYNC on its own
- * would suggest.
+ * other hand, is not sufficient, because ->write_inode needs to be called even
+ * in case of fdatasync. This needs to happen or the file data would not
+ * necessarily hit the device synchronously, even though the vfs inode has the
+ * O_SYNC flag set.  Also, I_DIRTY_DATASYNC simply "feels" better than just
+ * I_DIRTY_SYNC, since the file data has not actually hit the block device yet,
+ * which is not what I_DIRTY_SYNC on its own would suggest.
  */
 void __mark_mft_record_dirty(ntfs_inode *ni)
 {
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH 13/17] ocfs2: Update syncing after splicing to match generic version
  2009-08-19 16:04 [PATCH 0/17] Make O_SYNC handling use standard syncing path Jan Kara
@ 2009-08-19 16:04   ` Jan Kara
  2009-08-19 16:04   ` [Ocfs2-devel] " Jan Kara
                     ` (16 subsequent siblings)
  17 siblings, 0 replies; 144+ messages in thread
From: Jan Kara @ 2009-08-19 16:04 UTC (permalink / raw)
  To: LKML; +Cc: hch, Jan Kara, Joel Becker, ocfs2-devel

Update ocfs2 specific splicing code to use generic syncing helper.

CC: Joel Becker <Joel.Becker@oracle.com>
CC: ocfs2-devel@oss.oracle.com
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ocfs2/file.c |   27 ++++++---------------------
 1 files changed, 6 insertions(+), 21 deletions(-)

diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 1c71f0a..bd7fdf8 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -1990,31 +1990,16 @@ static ssize_t ocfs2_file_splice_write(struct pipe_inode_info *pipe,
 
 	if (ret > 0) {
 		unsigned long nr_pages;
+		int err;
 
-		*ppos += ret;
 		nr_pages = (ret + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
 
-		/*
-		 * If file or inode is SYNC and we actually wrote some data,
-		 * sync it.
-		 */
-		if (unlikely((out->f_flags & O_SYNC) || IS_SYNC(inode))) {
-			int err;
-
-			mutex_lock(&inode->i_mutex);
-			err = ocfs2_rw_lock(inode, 1);
-			if (err < 0) {
-				mlog_errno(err);
-			} else {
-				err = generic_osync_inode(inode, mapping,
-						  OSYNC_METADATA|OSYNC_DATA);
-				ocfs2_rw_unlock(inode, 1);
-			}
-			mutex_unlock(&inode->i_mutex);
+		err = generic_write_sync(out, *ppos, ret);
+		if (err)
+			ret = err;
+		else
+			*ppos += ret;
 
-			if (err)
-				ret = err;
-		}
 		balance_dirty_pages_ratelimited_nr(mapping, nr_pages);
 	}
 
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [Ocfs2-devel] [PATCH 13/17] ocfs2: Update syncing after splicing to match generic version
@ 2009-08-19 16:04   ` Jan Kara
  0 siblings, 0 replies; 144+ messages in thread
From: Jan Kara @ 2009-08-19 16:04 UTC (permalink / raw)
  To: LKML; +Cc: hch, Jan Kara, Joel Becker, ocfs2-devel

Update ocfs2 specific splicing code to use generic syncing helper.

CC: Joel Becker <Joel.Becker@oracle.com>
CC: ocfs2-devel at oss.oracle.com
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ocfs2/file.c |   27 ++++++---------------------
 1 files changed, 6 insertions(+), 21 deletions(-)

diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 1c71f0a..bd7fdf8 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -1990,31 +1990,16 @@ static ssize_t ocfs2_file_splice_write(struct pipe_inode_info *pipe,
 
 	if (ret > 0) {
 		unsigned long nr_pages;
+		int err;
 
-		*ppos += ret;
 		nr_pages = (ret + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
 
-		/*
-		 * If file or inode is SYNC and we actually wrote some data,
-		 * sync it.
-		 */
-		if (unlikely((out->f_flags & O_SYNC) || IS_SYNC(inode))) {
-			int err;
-
-			mutex_lock(&inode->i_mutex);
-			err = ocfs2_rw_lock(inode, 1);
-			if (err < 0) {
-				mlog_errno(err);
-			} else {
-				err = generic_osync_inode(inode, mapping,
-						  OSYNC_METADATA|OSYNC_DATA);
-				ocfs2_rw_unlock(inode, 1);
-			}
-			mutex_unlock(&inode->i_mutex);
+		err = generic_write_sync(out, *ppos, ret);
+		if (err)
+			ret = err;
+		else
+			*ppos += ret;
 
-			if (err)
-				ret = err;
-		}
 		balance_dirty_pages_ratelimited_nr(mapping, nr_pages);
 	}
 
-- 
1.6.0.2

^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH 14/17] xfs: Use new syncing helper
  2009-08-19 16:04 [PATCH 0/17] Make O_SYNC handling use standard syncing path Jan Kara
@ 2009-08-19 16:04   ` Jan Kara
  2009-08-19 16:04   ` [Ocfs2-devel] " Jan Kara
                     ` (16 subsequent siblings)
  17 siblings, 0 replies; 144+ messages in thread
From: Jan Kara @ 2009-08-19 16:04 UTC (permalink / raw)
  To: LKML; +Cc: hch, Jan Kara, Felix Blyakher, xfs

Use new generic_write_sync() helper from xfs_write().

CC: Felix Blyakher <felixb@sgi.com>
CC: xfs@oss.sgi.com
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/xfs/linux-2.6/xfs_lrw.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_lrw.c b/fs/xfs/linux-2.6/xfs_lrw.c
index 7078974..aeb5a39 100644
--- a/fs/xfs/linux-2.6/xfs_lrw.c
+++ b/fs/xfs/linux-2.6/xfs_lrw.c
@@ -817,7 +817,7 @@ write_retry:
 		xfs_iunlock(xip, iolock);
 		if (need_i_mutex)
 			mutex_unlock(&inode->i_mutex);
-		error2 = sync_page_range(inode, mapping, pos, ret);
+		error2 = generic_write_sync(file, pos, ret);
 		if (!error)
 			error = error2;
 		if (need_i_mutex)
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH 14/17] xfs: Use new syncing helper
@ 2009-08-19 16:04   ` Jan Kara
  0 siblings, 0 replies; 144+ messages in thread
From: Jan Kara @ 2009-08-19 16:04 UTC (permalink / raw)
  To: LKML; +Cc: hch, Jan Kara, xfs

Use new generic_write_sync() helper from xfs_write().

CC: Felix Blyakher <felixb@sgi.com>
CC: xfs@oss.sgi.com
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/xfs/linux-2.6/xfs_lrw.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_lrw.c b/fs/xfs/linux-2.6/xfs_lrw.c
index 7078974..aeb5a39 100644
--- a/fs/xfs/linux-2.6/xfs_lrw.c
+++ b/fs/xfs/linux-2.6/xfs_lrw.c
@@ -817,7 +817,7 @@ write_retry:
 		xfs_iunlock(xip, iolock);
 		if (need_i_mutex)
 			mutex_unlock(&inode->i_mutex);
-		error2 = sync_page_range(inode, mapping, pos, ret);
+		error2 = generic_write_sync(file, pos, ret);
 		if (!error)
 			error = error2;
 		if (need_i_mutex)
-- 
1.6.0.2

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH 15/17] pohmelfs: Use new syncing helper
  2009-08-19 16:04 [PATCH 0/17] Make O_SYNC handling use standard syncing path Jan Kara
                   ` (13 preceding siblings ...)
  2009-08-19 16:04   ` Jan Kara
@ 2009-08-19 16:04 ` Jan Kara
  2009-08-19 16:04 ` [PATCH 16/17] nfs: Remove reference to generic_osync_inode from a comment Jan Kara
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 144+ messages in thread
From: Jan Kara @ 2009-08-19 16:04 UTC (permalink / raw)
  To: LKML; +Cc: hch, Jan Kara, Evgeniy Polyakov

Use new generic_write_sync() helper instead of sync_page_range().

CC: Evgeniy Polyakov <zbr@ioremap.net>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 drivers/staging/pohmelfs/inode.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/staging/pohmelfs/inode.c b/drivers/staging/pohmelfs/inode.c
index 17801a5..a2e5eed 100644
--- a/drivers/staging/pohmelfs/inode.c
+++ b/drivers/staging/pohmelfs/inode.c
@@ -927,10 +927,10 @@ ssize_t pohmelfs_write(struct file *file, const char __user *buf,
 	mutex_unlock(&inode->i_mutex);
 	WARN_ON(ret < 0);
 
-	if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
+	if (ret > 0) {
 		ssize_t err;
 
-		err = sync_page_range(inode, mapping, pos, ret);
+		err = generic_write_sync(file, pos, ret);
 		if (err < 0)
 			ret = err;
 		WARN_ON(ret < 0);
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH 16/17] nfs: Remove reference to generic_osync_inode from a comment
  2009-08-19 16:04 [PATCH 0/17] Make O_SYNC handling use standard syncing path Jan Kara
                   ` (14 preceding siblings ...)
  2009-08-19 16:04 ` [PATCH 15/17] pohmelfs: " Jan Kara
@ 2009-08-19 16:04 ` Jan Kara
  2009-08-19 16:04 ` [PATCH 17/17] vfs: Remove generic_osync_inode() and sync_page_range() Jan Kara
  2009-08-20 22:12 ` O_DIRECT and barriers Christoph Hellwig
  17 siblings, 0 replies; 144+ messages in thread
From: Jan Kara @ 2009-08-19 16:04 UTC (permalink / raw)
  To: LKML; +Cc: hch, Jan Kara, linux-nfs, Neil Brown, J. Bruce Fields

generic_file_direct_write() no longer calls generic_osync_inode() so remove the
comment.

CC: linux-nfs@vger.kernel.org
CC: Neil Brown <neilb@suse.de>
CC: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/nfs/direct.c |    3 ---
 1 files changed, 0 insertions(+), 3 deletions(-)

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 489fc01..fd5fcc9 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -934,9 +934,6 @@ out:
  * back into its cache.  We let the server do generic write
  * parameter checking and report problems.
  *
- * We also avoid an unnecessary invocation of generic_osync_inode(),
- * as it is fairly meaningless to sync the metadata of an NFS file.
- *
  * We eliminate local atime updates, see direct read above.
  *
  * We avoid unnecessary page cache invalidations for normal cached
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH 17/17] vfs: Remove generic_osync_inode() and sync_page_range()
  2009-08-19 16:04 [PATCH 0/17] Make O_SYNC handling use standard syncing path Jan Kara
                   ` (15 preceding siblings ...)
  2009-08-19 16:04 ` [PATCH 16/17] nfs: Remove reference to generic_osync_inode from a comment Jan Kara
@ 2009-08-19 16:04 ` Jan Kara
  2009-08-20 22:12 ` O_DIRECT and barriers Christoph Hellwig
  17 siblings, 0 replies; 144+ messages in thread
From: Jan Kara @ 2009-08-19 16:04 UTC (permalink / raw)
  To: LKML; +Cc: hch, Jan Kara

Remove these two functions since nobody uses them anymore.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c         |   54 ---------------------------------------------
 include/linux/fs.h        |    5 ----
 include/linux/writeback.h |    2 -
 mm/filemap.c              |   35 -----------------------------
 4 files changed, 0 insertions(+), 96 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index c54226b..d8dbef0 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -737,57 +737,3 @@ int sync_inode(struct inode *inode, struct writeback_control *wbc)
 	return ret;
 }
 EXPORT_SYMBOL(sync_inode);
-
-/**
- * generic_osync_inode - flush all dirty data for a given inode to disk
- * @inode: inode to write
- * @mapping: the address_space that should be flushed
- * @what:  what to write and wait upon
- *
- * This can be called by file_write functions for files which have the
- * O_SYNC flag set, to flush dirty writes to disk.
- *
- * @what is a bitmask, specifying which part of the inode's data should be
- * written and waited upon.
- *
- *    OSYNC_DATA:     i_mapping's dirty data
- *    OSYNC_METADATA: the buffers at i_mapping->private_list
- *    OSYNC_INODE:    the inode itself
- */
-
-int generic_osync_inode(struct inode *inode, struct address_space *mapping, int what)
-{
-	int err = 0;
-	int need_write_inode_now = 0;
-	int err2;
-
-	if (what & OSYNC_DATA)
-		err = filemap_fdatawrite(mapping);
-	if (what & (OSYNC_METADATA|OSYNC_DATA)) {
-		err2 = sync_mapping_buffers(mapping);
-		if (!err)
-			err = err2;
-	}
-	if (what & OSYNC_DATA) {
-		err2 = filemap_fdatawait(mapping);
-		if (!err)
-			err = err2;
-	}
-
-	spin_lock(&inode_lock);
-	if ((inode->i_state & I_DIRTY) &&
-	    ((what & OSYNC_INODE) || (inode->i_state & I_DIRTY_DATASYNC)))
-		need_write_inode_now = 1;
-	spin_unlock(&inode_lock);
-
-	if (need_write_inode_now) {
-		err2 = write_inode_now(inode, 1);
-		if (!err)
-			err = err2;
-	}
-	else
-		inode_sync_wait(inode);
-
-	return err;
-}
-EXPORT_SYMBOL(generic_osync_inode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 0f13049..9c33da7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1458,11 +1458,6 @@ int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
 #define DT_SOCK		12
 #define DT_WHT		14
 
-#define OSYNC_METADATA	(1<<0)
-#define OSYNC_DATA	(1<<1)
-#define OSYNC_INODE	(1<<2)
-int generic_osync_inode(struct inode *, struct address_space *, int);
-
 /*
  * This is the "filldir" function type, used by readdir() to let
  * the kernel specify what kind of dirent layout it wants to have.
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 46d6064..1446694 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -157,8 +157,6 @@ int write_cache_pages(struct address_space *mapping,
 		      struct writeback_control *wbc, writepage_t writepage,
 		      void *data);
 int do_writepages(struct address_space *mapping, struct writeback_control *wbc);
-int sync_page_range(struct inode *inode, struct address_space *mapping,
-			loff_t pos, loff_t count);
 void set_page_dirty_balance(struct page *page, int page_mkwrite);
 void writeback_set_ratelimit(void);
 
diff --git a/mm/filemap.c b/mm/filemap.c
index a29b35b..1e68a34 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -326,41 +326,6 @@ int filemap_fdatawait_range(struct address_space *mapping, loff_t start,
 EXPORT_SYMBOL(filemap_fdatawait_range);
 
 /**
- * sync_page_range - write and wait on all pages in the passed range
- * @inode:	target inode
- * @mapping:	target address_space
- * @pos:	beginning offset in pages to write
- * @count:	number of bytes to write
- *
- * Write and wait upon all the pages in the passed range.  This is a "data
- * integrity" operation.  It waits upon in-flight writeout before starting and
- * waiting upon new writeout.  If there was an IO error, return it.
- *
- * We need to re-take i_mutex during the generic_osync_inode list walk because
- * it is otherwise livelockable.
- */
-int sync_page_range(struct inode *inode, struct address_space *mapping,
-			loff_t pos, loff_t count)
-{
-	pgoff_t start = pos >> PAGE_CACHE_SHIFT;
-	pgoff_t end = (pos + count - 1) >> PAGE_CACHE_SHIFT;
-	int ret;
-
-	if (!mapping_cap_writeback_dirty(mapping) || !count)
-		return 0;
-	ret = filemap_fdatawrite_range(mapping, pos, pos + count - 1);
-	if (ret == 0) {
-		mutex_lock(&inode->i_mutex);
-		ret = generic_osync_inode(inode, mapping, OSYNC_METADATA);
-		mutex_unlock(&inode->i_mutex);
-	}
-	if (ret == 0)
-		ret = wait_on_page_writeback_range(mapping, start, end);
-	return ret;
-}
-EXPORT_SYMBOL(sync_page_range);
-
-/**
  * filemap_fdatawait - wait for all under-writeback pages to complete
  * @mapping: address space structure to wait for
  *
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [Ocfs2-devel] [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode
@ 2009-08-19 16:04   ` Jan Kara
  0 siblings, 0 replies; 144+ messages in thread
From: Jan Kara @ 2009-08-19 16:04 UTC (permalink / raw)
  To: LKML
  Cc: hch, Jan Kara, Evgeniy Polyakov, ocfs2-devel, Joel Becker,
	Felix Blyakher, xfs, Anton Altaparmakov, linux-ntfs-dev,
	OGAWA Hirofumi, linux-ext4, tytso

Introduce new function for generic inode syncing (generic_sync_file) and use
it from fsync() path. Introduce also new helper for syncing after a sync
write (generic_write_sync) using the generic function.

Use these new helpers for syncing from generic VFS functions.

CC: Evgeniy Polyakov <zbr@ioremap.net>
CC: ocfs2-devel at oss.oracle.com
CC: Joel Becker <joel.becker@oracle.com>
CC: Felix Blyakher <felixb@sgi.com>
CC: xfs at oss.sgi.com
CC: Anton Altaparmakov <aia21@cantab.net>
CC: linux-ntfs-dev at lists.sourceforge.net
CC: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
CC: linux-ext4 at vger.kernel.org
CC: tytso at mit.edu
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/splice.c        |   22 ++++----------
 fs/sync.c          |   81 +++++++++++++++++++++++++++++++++++++++++++---------
 include/linux/fs.h |    7 ++++
 mm/filemap.c       |   16 ++++------
 4 files changed, 86 insertions(+), 40 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 73766d2..8190237 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -976,25 +976,15 @@ generic_file_splice_write(struct pipe_inode_info *pipe, struct file *out,
 
 	if (ret > 0) {
 		unsigned long nr_pages;
+		int err;
 
-		*ppos += ret;
 		nr_pages = (ret + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
 
-		/*
-		 * If file or inode is SYNC and we actually wrote some data,
-		 * sync it.
-		 */
-		if (unlikely((out->f_flags & O_SYNC) || IS_SYNC(inode))) {
-			int err;
-
-			mutex_lock(&inode->i_mutex);
-			err = generic_osync_inode(inode, mapping,
-						  OSYNC_METADATA|OSYNC_DATA);
-			mutex_unlock(&inode->i_mutex);
-
-			if (err)
-				ret = err;
-		}
+		err = generic_write_sync(out, *ppos, ret);
+		if (err)
+			ret = err;
+		else
+			*ppos += ret;
 		balance_dirty_pages_ratelimited_nr(mapping, nr_pages);
 	}
 
diff --git a/fs/sync.c b/fs/sync.c
index 3422ba6..fc320aa 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -176,23 +176,30 @@ int file_fsync(struct file *filp, struct dentry *dentry, int datasync)
 }
 
 /**
- * vfs_fsync - perform a fsync or fdatasync on a file
+ * generic_sync_file - helper to sync data & metadata to disk
  * @file:		file to sync
  * @dentry:		dentry of @file
- * @data:		only perform a fdatasync operation
+ * @start:		offset in bytes of the beginning of data range to sync
+ * @end:		offset in bytes of the end of data range (inclusive)
+ * @what:		what should be synced
  *
- * Write back data and metadata for @file to disk.  If @datasync is
- * set only metadata needed to access modified file data is written.
+ * What the function exactly does is controlled by the @what parameter:
  *
- * In case this function is called from nfsd @file may be %NULL and
- * only @dentry is set.  This can only happen when the filesystem
- * implements the export_operations API.
+ * If SYNC_SUBMIT_DATA is set, the function submits all pages in the given
+ * range to disk.
+ *
+ * The function always calls ->fsync() callback of the filesystem. If
+ * SYNC_INODE is not set, we pass down the fact that it is just a datasync.
+ *
+ * If SYNC_WAIT_DATA is set, the function waits for writeback to finish
+ * in the given range.
  */
-int vfs_fsync(struct file *file, struct dentry *dentry, int datasync)
+int generic_sync_file(struct file *file, struct dentry *dentry, loff_t start,
+		      loff_t end, int what)
 {
 	const struct file_operations *fop;
 	struct address_space *mapping;
-	int err, ret;
+	int err, ret = 0;
 
 	/*
 	 * Get mapping and operations from the file in case we have
@@ -212,23 +219,50 @@ int vfs_fsync(struct file *file, struct dentry *dentry, int datasync)
 		goto out;
 	}
 
-	ret = filemap_fdatawrite(mapping);
+	if (what & SYNC_SUBMIT_DATA)
+		ret = filemap_fdatawrite_range(mapping, start, end);
 
 	/*
 	 * We need to protect against concurrent writers, which could cause
 	 * livelocks in fsync_buffers_list().
 	 */
 	mutex_lock(&mapping->host->i_mutex);
-	err = fop->fsync(file, dentry, datasync);
+	err = fop->fsync(file, dentry, !(what & SYNC_INODE));
 	if (!ret)
 		ret = err;
 	mutex_unlock(&mapping->host->i_mutex);
-	err = filemap_fdatawait(mapping);
-	if (!ret)
-		ret = err;
+
+	if (what & SYNC_WAIT_DATA) {
+		err = filemap_fdatawait_range(mapping, start, end);
+		if (!ret)
+			ret = err;
+	}
 out:
 	return ret;
 }
+EXPORT_SYMBOL(generic_sync_file);
+
+/**
+ * vfs_fsync - perform a fsync or fdatasync on a file
+ * @file:		file to sync
+ * @dentry:		dentry of @file
+ * @datasync:		only perform a fdatasync operation
+ *
+ * Write back data and metadata for @file to disk.  If @datasync is
+ * set only metadata needed to access modified file data is written.
+ *
+ * In case this function is called from nfsd @file may be %NULL and
+ * only @dentry is set.  This can only happen when the filesystem
+ * implements the export_operations API.
+ */
+int vfs_fsync(struct file *file, struct dentry *dentry, int datasync)
+{
+	int what = SYNC_SUBMIT_DATA | SYNC_WAIT_DATA;
+
+	if (!datasync)
+		what |= SYNC_INODE;
+	return generic_sync_file(file, dentry, 0, LLONG_MAX, what);
+}
 EXPORT_SYMBOL(vfs_fsync);
 
 static int do_fsync(unsigned int fd, int datasync)
@@ -254,6 +288,25 @@ SYSCALL_DEFINE1(fdatasync, unsigned int, fd)
 	return do_fsync(fd, 1);
 }
 
+/**
+ * generic_write_sync - perform syncing after a write if file / inode is sync
+ * @file:	file to which the write happened
+ * @pos:	offset where the write started
+ * @count:	length of the write
+ *
+ * This is just a simple wrapper about our general syncing function.
+ * FIXME: Make it inline?
+ */
+int generic_write_sync(struct file *file, loff_t pos, loff_t count)
+{
+	if (!(file->f_flags & O_SYNC) && !IS_SYNC(file->f_mapping->host))
+		return 0;
+	return generic_sync_file(file, file->f_path.dentry, pos,
+				 pos + count - 1,
+				 SYNC_SUBMIT_DATA | SYNC_WAIT_DATA);
+}
+EXPORT_SYMBOL(generic_write_sync);
+
 /*
  * sys_sync_file_range() permits finely controlled syncing over a segment of
  * a file in the range offset .. (offset+nbytes-1) inclusive.  If nbytes is
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 1edd159..0f13049 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2088,7 +2088,14 @@ extern int __filemap_fdatawrite_range(struct address_space *mapping,
 extern int filemap_fdatawrite_range(struct address_space *mapping,
 				loff_t start, loff_t end);
 
+/* Flags for generic_sync_file */
+#define SYNC_INODE		1
+#define SYNC_SUBMIT_DATA	2
+#define SYNC_WAIT_DATA		4
+extern int generic_sync_file(struct file *file, struct dentry *dentry,
+			   loff_t start, loff_t end, int what);
 extern int vfs_fsync(struct file *file, struct dentry *dentry, int datasync);
+extern int generic_write_sync(struct file *file, loff_t pos, loff_t count);
 extern void sync_supers(void);
 extern void emergency_sync(void);
 extern void emergency_remount(void);
diff --git a/mm/filemap.c b/mm/filemap.c
index dffdc60..a29b35b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -39,11 +39,10 @@
 /*
  * FIXME: remove all knowledge of the buffer layer from the core VM
  */
-#include <linux/buffer_head.h> /* for generic_osync_inode */
+#include <linux/buffer_head.h> /* for try_to_free_buffers */
 
 #include <asm/mman.h>
 
-
 /*
  * Shared mappings implemented 30.11.1994. It's not fully working yet,
  * though.
@@ -2452,18 +2451,16 @@ ssize_t generic_file_aio_write_nolock(struct kiocb *iocb,
 		const struct iovec *iov, unsigned long nr_segs, loff_t pos)
 {
 	struct file *file = iocb->ki_filp;
-	struct address_space *mapping = file->f_mapping;
-	struct inode *inode = mapping->host;
 	ssize_t ret;
 
 	BUG_ON(iocb->ki_pos != pos);
 
 	ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
 
-	if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
+	if (ret > 0) {
 		ssize_t err;
 
-		err = sync_page_range(inode, mapping, pos, ret);
+		err = generic_write_sync(file, pos, ret);
 		if (err < 0)
 			ret = err;
 	}
@@ -2486,8 +2483,7 @@ ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 		unsigned long nr_segs, loff_t pos)
 {
 	struct file *file = iocb->ki_filp;
-	struct address_space *mapping = file->f_mapping;
-	struct inode *inode = mapping->host;
+	struct inode *inode = file->f_mapping->host;
 	ssize_t ret;
 
 	BUG_ON(iocb->ki_pos != pos);
@@ -2496,10 +2492,10 @@ ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 	ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
 	mutex_unlock(&inode->i_mutex);
 
-	if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
+	if (ret > 0) {
 		ssize_t err;
 
-		err = sync_page_range(inode, mapping, pos, ret);
+		err = generic_write_sync(file, pos, ret);
 		if (err < 0)
 			ret = err;
 	}
-- 
1.6.0.2

^ permalink raw reply related	[flat|nested] 144+ messages in thread

* Re: [PATCH 01/17] vfs: Introduce filemap_fdatawait_range
  2009-08-19 16:04 ` [PATCH 01/17] vfs: Introduce filemap_fdatawait_range Jan Kara
@ 2009-08-19 16:10   ` Christoph Hellwig
  0 siblings, 0 replies; 144+ messages in thread
From: Christoph Hellwig @ 2009-08-19 16:10 UTC (permalink / raw)
  To: Jan Kara; +Cc: LKML, hch

On Wed, Aug 19, 2009 at 06:04:28PM +0200, Jan Kara wrote:
> This simple helper saves some filesystems conversion from byte offset
> to page numbers and also makes the fdata* interface more complete.

Looks good to me,


Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 02/17] vfs: Export __generic_file_aio_write() and add some comments
  2009-08-19 16:04   ` [Ocfs2-devel] " Jan Kara
@ 2009-08-19 16:11     ` Christoph Hellwig
  -1 siblings, 0 replies; 144+ messages in thread
From: Christoph Hellwig @ 2009-08-19 16:11 UTC (permalink / raw)
  To: Jan Kara; +Cc: LKML, hch, Evgeniy Polyakov, ocfs2-devel, Joel Becker

On Wed, Aug 19, 2009 at 06:04:29PM +0200, Jan Kara wrote:
> Rename __generic_file_aio_write_nolock() to __generic_file_aio_write(), add
> comments to write helpers explaining how they should be used and export
> __generic_file_aio_write() since it will be used by some filesystems.

The renaming and commenting looks good to me, and the export is fine
with me given that we'll actually use it later one.


Reviewed-by: Christoph Hellwig <hch@lst.de>

Can we maybe get rid of generic_file_aio_write_nolock after you patch
series?


^ permalink raw reply	[flat|nested] 144+ messages in thread

* [Ocfs2-devel] [PATCH 02/17] vfs: Export __generic_file_aio_write() and add some comments
@ 2009-08-19 16:11     ` Christoph Hellwig
  0 siblings, 0 replies; 144+ messages in thread
From: Christoph Hellwig @ 2009-08-19 16:11 UTC (permalink / raw)
  To: Jan Kara; +Cc: LKML, hch, Evgeniy Polyakov, ocfs2-devel, Joel Becker

On Wed, Aug 19, 2009 at 06:04:29PM +0200, Jan Kara wrote:
> Rename __generic_file_aio_write_nolock() to __generic_file_aio_write(), add
> comments to write helpers explaining how they should be used and export
> __generic_file_aio_write() since it will be used by some filesystems.

The renaming and commenting looks good to me, and the export is fine
with me given that we'll actually use it later one.


Reviewed-by: Christoph Hellwig <hch@lst.de>

Can we maybe get rid of generic_file_aio_write_nolock after you patch
series?

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 03/17] vfs: Remove syncing from generic_file_direct_write() and generic_file_buffered_write()
  2009-08-19 16:04   ` Jan Kara
  (?)
@ 2009-08-19 16:18     ` Christoph Hellwig
  -1 siblings, 0 replies; 144+ messages in thread
From: Christoph Hellwig @ 2009-08-19 16:18 UTC (permalink / raw)
  To: Jan Kara; +Cc: LKML, hch, ocfs2-devel, Joel Becker, Felix Blyakher, xfs

On Wed, Aug 19, 2009 at 06:04:30PM +0200, Jan Kara wrote:
> generic_file_direct_write() and generic_file_buffered_write() called
> generic_osync_inode() if it was called on O_SYNC file or IS_SYNC inode. But
> this is superfluous since generic_file_aio_write() does the syncing as well.
> Also XFS and OCFS2 which call these functions directly handle syncing
> themselves. So let's have a single place where syncing happens:
> generic_file_aio_write().

Yeah, this is something that never made any sense to me.

> @@ -2187,20 +2187,7 @@ generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
>  		}
>  		*ppos = end;
>  	}
> -
> -	/*
> -	 * Sync the fs metadata but not the minor inode changes and
> -	 * of course not the data as we did direct DMA for the IO.
> -	 * i_mutex is held, which protects generic_osync_inode() from
> -	 * livelocking.  AIO O_DIRECT ops attempt to sync metadata here.
> -	 */
>  out:
> -	if ((written >= 0 || written == -EIOCBQUEUED) &&
> -	    ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
> -		int err = generic_osync_inode(inode, mapping, OSYNC_METADATA);
> -		if (err < 0)
> -			written = err;
> -	}
>  	return written;

Here we check (written >= 0 || written == -EIOCBQUEUED), but
generic_file_aio_write only cares about positive return values.  We
defintively do have a change here for partial AIO requests.  The
question is if the previous behaviour made in sense.  If do have an
O_SYNC aio+dio request we would have to flush out the metadata after the
request has completed and not here.

> @@ -2343,16 +2328,6 @@ generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
>  	if (likely(status >= 0)) {
>  		written += status;
>  		*ppos = pos + status;
> -
> -		/*
> -		 * For now, when the user asks for O_SYNC, we'll actually give
> -		 * O_DSYNC
> -		 */
> -		if (unlikely((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
> -			if (!a_ops->writepage || !is_sync_kiocb(iocb))
> -				status = generic_osync_inode(inode, mapping,
> -						OSYNC_METADATA|OSYNC_DATA);
> -		}
>    	}

No problem with -EIOCBQUEUED here, but we change from doing
generic_osync_inode with OSYNC_DATA which does a full writeout of the
data to sync_page_range which only does the range writeout here.  That
should be fine (as we only need to sync that range), but should probably
be documented in the patch description.

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 03/17] vfs: Remove syncing from generic_file_direct_write() and generic_file_buffered_write()
@ 2009-08-19 16:18     ` Christoph Hellwig
  0 siblings, 0 replies; 144+ messages in thread
From: Christoph Hellwig @ 2009-08-19 16:18 UTC (permalink / raw)
  To: Jan Kara; +Cc: LKML, xfs, hch, Joel Becker, ocfs2-devel

On Wed, Aug 19, 2009 at 06:04:30PM +0200, Jan Kara wrote:
> generic_file_direct_write() and generic_file_buffered_write() called
> generic_osync_inode() if it was called on O_SYNC file or IS_SYNC inode. But
> this is superfluous since generic_file_aio_write() does the syncing as well.
> Also XFS and OCFS2 which call these functions directly handle syncing
> themselves. So let's have a single place where syncing happens:
> generic_file_aio_write().

Yeah, this is something that never made any sense to me.

> @@ -2187,20 +2187,7 @@ generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
>  		}
>  		*ppos = end;
>  	}
> -
> -	/*
> -	 * Sync the fs metadata but not the minor inode changes and
> -	 * of course not the data as we did direct DMA for the IO.
> -	 * i_mutex is held, which protects generic_osync_inode() from
> -	 * livelocking.  AIO O_DIRECT ops attempt to sync metadata here.
> -	 */
>  out:
> -	if ((written >= 0 || written == -EIOCBQUEUED) &&
> -	    ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
> -		int err = generic_osync_inode(inode, mapping, OSYNC_METADATA);
> -		if (err < 0)
> -			written = err;
> -	}
>  	return written;

Here we check (written >= 0 || written == -EIOCBQUEUED), but
generic_file_aio_write only cares about positive return values.  We
defintively do have a change here for partial AIO requests.  The
question is if the previous behaviour made in sense.  If do have an
O_SYNC aio+dio request we would have to flush out the metadata after the
request has completed and not here.

> @@ -2343,16 +2328,6 @@ generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
>  	if (likely(status >= 0)) {
>  		written += status;
>  		*ppos = pos + status;
> -
> -		/*
> -		 * For now, when the user asks for O_SYNC, we'll actually give
> -		 * O_DSYNC
> -		 */
> -		if (unlikely((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
> -			if (!a_ops->writepage || !is_sync_kiocb(iocb))
> -				status = generic_osync_inode(inode, mapping,
> -						OSYNC_METADATA|OSYNC_DATA);
> -		}
>    	}

No problem with -EIOCBQUEUED here, but we change from doing
generic_osync_inode with OSYNC_DATA which does a full writeout of the
data to sync_page_range which only does the range writeout here.  That
should be fine (as we only need to sync that range), but should probably
be documented in the patch description.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* [Ocfs2-devel] [PATCH 03/17] vfs: Remove syncing from generic_file_direct_write() and generic_file_buffered_write()
@ 2009-08-19 16:18     ` Christoph Hellwig
  0 siblings, 0 replies; 144+ messages in thread
From: Christoph Hellwig @ 2009-08-19 16:18 UTC (permalink / raw)
  To: Jan Kara; +Cc: LKML, xfs, hch, Joel Becker, ocfs2-devel

On Wed, Aug 19, 2009 at 06:04:30PM +0200, Jan Kara wrote:
> generic_file_direct_write() and generic_file_buffered_write() called
> generic_osync_inode() if it was called on O_SYNC file or IS_SYNC inode. But
> this is superfluous since generic_file_aio_write() does the syncing as well.
> Also XFS and OCFS2 which call these functions directly handle syncing
> themselves. So let's have a single place where syncing happens:
> generic_file_aio_write().

Yeah, this is something that never made any sense to me.

> @@ -2187,20 +2187,7 @@ generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
>  		}
>  		*ppos = end;
>  	}
> -
> -	/*
> -	 * Sync the fs metadata but not the minor inode changes and
> -	 * of course not the data as we did direct DMA for the IO.
> -	 * i_mutex is held, which protects generic_osync_inode() from
> -	 * livelocking.  AIO O_DIRECT ops attempt to sync metadata here.
> -	 */
>  out:
> -	if ((written >= 0 || written == -EIOCBQUEUED) &&
> -	    ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
> -		int err = generic_osync_inode(inode, mapping, OSYNC_METADATA);
> -		if (err < 0)
> -			written = err;
> -	}
>  	return written;

Here we check (written >= 0 || written == -EIOCBQUEUED), but
generic_file_aio_write only cares about positive return values.  We
defintively do have a change here for partial AIO requests.  The
question is if the previous behaviour made in sense.  If do have an
O_SYNC aio+dio request we would have to flush out the metadata after the
request has completed and not here.

> @@ -2343,16 +2328,6 @@ generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
>  	if (likely(status >= 0)) {
>  		written += status;
>  		*ppos = pos + status;
> -
> -		/*
> -		 * For now, when the user asks for O_SYNC, we'll actually give
> -		 * O_DSYNC
> -		 */
> -		if (unlikely((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
> -			if (!a_ops->writepage || !is_sync_kiocb(iocb))
> -				status = generic_osync_inode(inode, mapping,
> -						OSYNC_METADATA|OSYNC_DATA);
> -		}
>    	}

No problem with -EIOCBQUEUED here, but we change from doing
generic_osync_inode with OSYNC_DATA which does a full writeout of the
data to sync_page_range which only does the range writeout here.  That
should be fine (as we only need to sync that range), but should probably
be documented in the patch description.

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 06/17] vfs: Remove sync_page_range_nolock
  2009-08-19 16:04 ` [PATCH 06/17] vfs: Remove sync_page_range_nolock Jan Kara
@ 2009-08-19 16:21   ` Christoph Hellwig
  0 siblings, 0 replies; 144+ messages in thread
From: Christoph Hellwig @ 2009-08-19 16:21 UTC (permalink / raw)
  To: Jan Kara; +Cc: LKML, hch

On Wed, Aug 19, 2009 at 06:04:33PM +0200, Jan Kara wrote:
> The last user of sync_page_range_nolock() is generic_file_aio_write_nolock().
> Now we have converted all its callers to not hold i_mutex and so we can
> afford to call sync_page_range() instead of sync_page_range_nolock() from
> there. This non-necessarily acquires i_mutex when syncing block devices but
> that's happening in fsync() path as well and block_fsync() may just drop +
> reacquire i_mutex if we care enough.

Looks good to me, but to stop people from using
generic_file_aio_write_nolock accidentally I would rename it to
blkdev_aio_write, move it to fs/block_dev.c and stop exporting it.  That
way any out of tree or in the merge queue filesystems is forced to use
the appropinquate __generic_file_aio_write.


^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode
  2009-08-19 16:04   ` Jan Kara
  (?)
@ 2009-08-19 16:26     ` Christoph Hellwig
  -1 siblings, 0 replies; 144+ messages in thread
From: Christoph Hellwig @ 2009-08-19 16:26 UTC (permalink / raw)
  To: Jan Kara
  Cc: LKML, hch, Evgeniy Polyakov, ocfs2-devel, Joel Becker,
	Felix Blyakher, xfs, Anton Altaparmakov, linux-ntfs-dev,
	OGAWA Hirofumi, linux-ext4, tytso

Looks good to me.  Eventually we should use those SYNC_ flags also all
through the fsync codepath, but I'll see if I can incorporate that in my
planned fsync rewrite.

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode
@ 2009-08-19 16:26     ` Christoph Hellwig
  0 siblings, 0 replies; 144+ messages in thread
From: Christoph Hellwig @ 2009-08-19 16:26 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, tytso, linux-ntfs-dev, LKML, Joel Becker, hch,
	Anton Altaparmakov, OGAWA Hirofumi, Evgeniy Polyakov, xfs,
	ocfs2-devel

Looks good to me.  Eventually we should use those SYNC_ flags also all
through the fsync codepath, but I'll see if I can incorporate that in my
planned fsync rewrite.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* [Ocfs2-devel] [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode
@ 2009-08-19 16:26     ` Christoph Hellwig
  0 siblings, 0 replies; 144+ messages in thread
From: Christoph Hellwig @ 2009-08-19 16:26 UTC (permalink / raw)
  To: Jan Kara
  Cc: LKML, hch, Evgeniy Polyakov, ocfs2-devel, Joel Becker,
	Felix Blyakher, xfs, Anton Altaparmakov, linux-ntfs-dev,
	OGAWA Hirofumi, linux-ext4, tytso

Looks good to me.  Eventually we should use those SYNC_ flags also all
through the fsync codepath, but I'll see if I can incorporate that in my
planned fsync rewrite.

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 14/17] xfs: Use new syncing helper
  2009-08-19 16:04   ` Jan Kara
@ 2009-08-19 16:33     ` Christoph Hellwig
  -1 siblings, 0 replies; 144+ messages in thread
From: Christoph Hellwig @ 2009-08-19 16:33 UTC (permalink / raw)
  To: Jan Kara; +Cc: LKML, hch, Felix Blyakher, xfs

On Wed, Aug 19, 2009 at 06:04:41PM +0200, Jan Kara wrote:
> diff --git a/fs/xfs/linux-2.6/xfs_lrw.c b/fs/xfs/linux-2.6/xfs_lrw.c
> index 7078974..aeb5a39 100644
> --- a/fs/xfs/linux-2.6/xfs_lrw.c
> +++ b/fs/xfs/linux-2.6/xfs_lrw.c
> @@ -817,7 +817,7 @@ write_retry:
>  		xfs_iunlock(xip, iolock);
>  		if (need_i_mutex)
>  			mutex_unlock(&inode->i_mutex);
> -		error2 = sync_page_range(inode, mapping, pos, ret);
> +		error2 = generic_write_sync(file, pos, ret);

I don't think this is very optimal for XFS.  This first starts an
asynchronous writeout of the range in generic_write_sync,
then calls into ->fsync which waits for all I/O on the file to finish,
then forces the log inside xfs_fsync,  then waits for the range again in
generic_write_sync, and after this code calls into
xfs_write_sync_logforce which forces to log _again_.

We should be fine just doing your new filemap_fdatawrite_range helper
here and let xfs_write_sync_logforce do the work.  Long term we should
just get rid of this stupid submit writes before syncing the iode, then
wait crap which is a pain for all modern filesystems.

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 14/17] xfs: Use new syncing helper
@ 2009-08-19 16:33     ` Christoph Hellwig
  0 siblings, 0 replies; 144+ messages in thread
From: Christoph Hellwig @ 2009-08-19 16:33 UTC (permalink / raw)
  To: Jan Kara; +Cc: hch, LKML, xfs

On Wed, Aug 19, 2009 at 06:04:41PM +0200, Jan Kara wrote:
> diff --git a/fs/xfs/linux-2.6/xfs_lrw.c b/fs/xfs/linux-2.6/xfs_lrw.c
> index 7078974..aeb5a39 100644
> --- a/fs/xfs/linux-2.6/xfs_lrw.c
> +++ b/fs/xfs/linux-2.6/xfs_lrw.c
> @@ -817,7 +817,7 @@ write_retry:
>  		xfs_iunlock(xip, iolock);
>  		if (need_i_mutex)
>  			mutex_unlock(&inode->i_mutex);
> -		error2 = sync_page_range(inode, mapping, pos, ret);
> +		error2 = generic_write_sync(file, pos, ret);

I don't think this is very optimal for XFS.  This first starts an
asynchronous writeout of the range in generic_write_sync,
then calls into ->fsync which waits for all I/O on the file to finish,
then forces the log inside xfs_fsync,  then waits for the range again in
generic_write_sync, and after this code calls into
xfs_write_sync_logforce which forces to log _again_.

We should be fine just doing your new filemap_fdatawrite_range helper
here and let xfs_write_sync_logforce do the work.  Long term we should
just get rid of this stupid submit writes before syncing the iode, then
wait crap which is a pain for all modern filesystems.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 02/17] vfs: Export __generic_file_aio_write() and add some comments
  2009-08-19 16:04   ` [Ocfs2-devel] " Jan Kara
@ 2009-08-19 20:22     ` Evgeniy Polyakov
  -1 siblings, 0 replies; 144+ messages in thread
From: Evgeniy Polyakov @ 2009-08-19 20:22 UTC (permalink / raw)
  To: Jan Kara; +Cc: LKML, hch, ocfs2-devel, Joel Becker

Hi.

On Wed, Aug 19, 2009 at 06:04:29PM +0200, Jan Kara (jack@suse.cz) wrote:
> Rename __generic_file_aio_write_nolock() to __generic_file_aio_write(), add
> comments to write helpers explaining how they should be used and export
> __generic_file_aio_write() since it will be used by some filesystems.

Looks good to me. And comments are definitely good.
Ack.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 144+ messages in thread

* [Ocfs2-devel] [PATCH 02/17] vfs: Export __generic_file_aio_write() and add some comments
@ 2009-08-19 20:22     ` Evgeniy Polyakov
  0 siblings, 0 replies; 144+ messages in thread
From: Evgeniy Polyakov @ 2009-08-19 20:22 UTC (permalink / raw)
  To: Jan Kara; +Cc: LKML, hch, ocfs2-devel, Joel Becker

Hi.

On Wed, Aug 19, 2009 at 06:04:29PM +0200, Jan Kara (jack at suse.cz) wrote:
> Rename __generic_file_aio_write_nolock() to __generic_file_aio_write(), add
> comments to write helpers explaining how they should be used and export
> __generic_file_aio_write() since it will be used by some filesystems.

Looks good to me. And comments are definitely good.
Ack.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 02/17] vfs: Export __generic_file_aio_write() and add some comments
  2009-08-19 16:11     ` [Ocfs2-devel] " Christoph Hellwig
@ 2009-08-20 12:04       ` Jan Kara
  -1 siblings, 0 replies; 144+ messages in thread
From: Jan Kara @ 2009-08-20 12:04 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, LKML, Evgeniy Polyakov, ocfs2-devel, Joel Becker

On Wed 19-08-09 12:11:45, Christoph Hellwig wrote:
> On Wed, Aug 19, 2009 at 06:04:29PM +0200, Jan Kara wrote:
> > Rename __generic_file_aio_write_nolock() to __generic_file_aio_write(), add
> > comments to write helpers explaining how they should be used and export
> > __generic_file_aio_write() since it will be used by some filesystems.
> 
> The renaming and commenting looks good to me, and the export is fine
> with me given that we'll actually use it later one.
> 
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> 
> Can we maybe get rid of generic_file_aio_write_nolock after you patch
> series?
  In principle we could. It's just that block devices don't need i_mutex
for serializing writes so I found it dumb to serialize writes to the block
device needlessly. But we could move generic_file_aio_write_nolock to
blockdev.c and name it like blockdev_aio_write() since noone else uses
it after my series. I guess I'll do that.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 144+ messages in thread

* [Ocfs2-devel] [PATCH 02/17] vfs: Export __generic_file_aio_write() and add some comments
@ 2009-08-20 12:04       ` Jan Kara
  0 siblings, 0 replies; 144+ messages in thread
From: Jan Kara @ 2009-08-20 12:04 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, LKML, Evgeniy Polyakov, ocfs2-devel, Joel Becker

On Wed 19-08-09 12:11:45, Christoph Hellwig wrote:
> On Wed, Aug 19, 2009 at 06:04:29PM +0200, Jan Kara wrote:
> > Rename __generic_file_aio_write_nolock() to __generic_file_aio_write(), add
> > comments to write helpers explaining how they should be used and export
> > __generic_file_aio_write() since it will be used by some filesystems.
> 
> The renaming and commenting looks good to me, and the export is fine
> with me given that we'll actually use it later one.
> 
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> 
> Can we maybe get rid of generic_file_aio_write_nolock after you patch
> series?
  In principle we could. It's just that block devices don't need i_mutex
for serializing writes so I found it dumb to serialize writes to the block
device needlessly. But we could move generic_file_aio_write_nolock to
blockdev.c and name it like blockdev_aio_write() since noone else uses
it after my series. I guess I'll do that.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode
  2009-08-19 16:26     ` Christoph Hellwig
  (?)
@ 2009-08-20 12:15       ` Jan Kara
  -1 siblings, 0 replies; 144+ messages in thread
From: Jan Kara @ 2009-08-20 12:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, LKML, Evgeniy Polyakov, ocfs2-devel, Joel Becker,
	Felix Blyakher, xfs, Anton Altaparmakov, linux-ntfs-dev,
	OGAWA Hirofumi, linux-ext4, tytso

On Wed 19-08-09 12:26:38, Christoph Hellwig wrote:
> Looks good to me.  Eventually we should use those SYNC_ flags also all
> through the fsync codepath, but I'll see if I can incorporate that in my
> planned fsync rewrite.
  Yes, I thought I'll leave that for later. BTW it should be fairly easy to
teach generic_sync_file() to do fdatawait() before calling ->fsync() if the
filesystem sets some flag in inode->i_mapping (or somewhere else) as is
needed for XFS, btrfs, etc.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode
@ 2009-08-20 12:15       ` Jan Kara
  0 siblings, 0 replies; 144+ messages in thread
From: Jan Kara @ 2009-08-20 12:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: tytso, linux-ext4, Jan Kara, linux-ntfs-dev, LKML, Joel Becker,
	Anton Altaparmakov, OGAWA Hirofumi, Evgeniy Polyakov, xfs,
	ocfs2-devel

On Wed 19-08-09 12:26:38, Christoph Hellwig wrote:
> Looks good to me.  Eventually we should use those SYNC_ flags also all
> through the fsync codepath, but I'll see if I can incorporate that in my
> planned fsync rewrite.
  Yes, I thought I'll leave that for later. BTW it should be fairly easy to
teach generic_sync_file() to do fdatawait() before calling ->fsync() if the
filesystem sets some flag in inode->i_mapping (or somewhere else) as is
needed for XFS, btrfs, etc.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* [Ocfs2-devel] [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode
@ 2009-08-20 12:15       ` Jan Kara
  0 siblings, 0 replies; 144+ messages in thread
From: Jan Kara @ 2009-08-20 12:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, LKML, Evgeniy Polyakov, ocfs2-devel, Joel Becker,
	Felix Blyakher, xfs, Anton Altaparmakov, linux-ntfs-dev,
	OGAWA Hirofumi, linux-ext4, tytso

On Wed 19-08-09 12:26:38, Christoph Hellwig wrote:
> Looks good to me.  Eventually we should use those SYNC_ flags also all
> through the fsync codepath, but I'll see if I can incorporate that in my
> planned fsync rewrite.
  Yes, I thought I'll leave that for later. BTW it should be fairly easy to
teach generic_sync_file() to do fdatawait() before calling ->fsync() if the
filesystem sets some flag in inode->i_mapping (or somewhere else) as is
needed for XFS, btrfs, etc.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 14/17] xfs: Use new syncing helper
  2009-08-19 16:33     ` Christoph Hellwig
@ 2009-08-20 12:22       ` Jan Kara
  -1 siblings, 0 replies; 144+ messages in thread
From: Jan Kara @ 2009-08-20 12:22 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jan Kara, LKML, Felix Blyakher, xfs

On Wed 19-08-09 12:33:15, Christoph Hellwig wrote:
> On Wed, Aug 19, 2009 at 06:04:41PM +0200, Jan Kara wrote:
> > diff --git a/fs/xfs/linux-2.6/xfs_lrw.c b/fs/xfs/linux-2.6/xfs_lrw.c
> > index 7078974..aeb5a39 100644
> > --- a/fs/xfs/linux-2.6/xfs_lrw.c
> > +++ b/fs/xfs/linux-2.6/xfs_lrw.c
> > @@ -817,7 +817,7 @@ write_retry:
> >  		xfs_iunlock(xip, iolock);
> >  		if (need_i_mutex)
> >  			mutex_unlock(&inode->i_mutex);
> > -		error2 = sync_page_range(inode, mapping, pos, ret);
> > +		error2 = generic_write_sync(file, pos, ret);
> 
> I don't think this is very optimal for XFS.  This first starts an
> asynchronous writeout of the range in generic_write_sync,
> then calls into ->fsync which waits for all I/O on the file to finish,
> then forces the log inside xfs_fsync,  then waits for the range again in
> generic_write_sync, and after this code calls into
> xfs_write_sync_logforce which forces to log _again_.
> 
> We should be fine just doing your new filemap_fdatawrite_range helper
> here and let xfs_write_sync_logforce do the work.  Long term we should
> just get rid of this stupid submit writes before syncing the iode, then
> wait crap which is a pain for all modern filesystems.
  OK, I'll happily change this. I was just doing the straightforward
conversion and sync_page_range() essentially did what generic_write_sync()
does since it called fdatawrite(), write_inode_now(inode, 1) (here XFS was
forced to wait for pages and force the log), fdatawait().

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 14/17] xfs: Use new syncing helper
@ 2009-08-20 12:22       ` Jan Kara
  0 siblings, 0 replies; 144+ messages in thread
From: Jan Kara @ 2009-08-20 12:22 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jan Kara, LKML, xfs

On Wed 19-08-09 12:33:15, Christoph Hellwig wrote:
> On Wed, Aug 19, 2009 at 06:04:41PM +0200, Jan Kara wrote:
> > diff --git a/fs/xfs/linux-2.6/xfs_lrw.c b/fs/xfs/linux-2.6/xfs_lrw.c
> > index 7078974..aeb5a39 100644
> > --- a/fs/xfs/linux-2.6/xfs_lrw.c
> > +++ b/fs/xfs/linux-2.6/xfs_lrw.c
> > @@ -817,7 +817,7 @@ write_retry:
> >  		xfs_iunlock(xip, iolock);
> >  		if (need_i_mutex)
> >  			mutex_unlock(&inode->i_mutex);
> > -		error2 = sync_page_range(inode, mapping, pos, ret);
> > +		error2 = generic_write_sync(file, pos, ret);
> 
> I don't think this is very optimal for XFS.  This first starts an
> asynchronous writeout of the range in generic_write_sync,
> then calls into ->fsync which waits for all I/O on the file to finish,
> then forces the log inside xfs_fsync,  then waits for the range again in
> generic_write_sync, and after this code calls into
> xfs_write_sync_logforce which forces to log _again_.
> 
> We should be fine just doing your new filemap_fdatawrite_range helper
> here and let xfs_write_sync_logforce do the work.  Long term we should
> just get rid of this stupid submit writes before syncing the iode, then
> wait crap which is a pain for all modern filesystems.
  OK, I'll happily change this. I was just doing the straightforward
conversion and sync_page_range() essentially did what generic_write_sync()
does since it called fdatawrite(), write_inode_now(inode, 1) (here XFS was
forced to wait for pages and force the log), fdatawait().

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 02/17] vfs: Export __generic_file_aio_write() and add some comments
  2009-08-19 20:22     ` [Ocfs2-devel] " Evgeniy Polyakov
@ 2009-08-20 12:31       ` Jan Kara
  -1 siblings, 0 replies; 144+ messages in thread
From: Jan Kara @ 2009-08-20 12:31 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Jan Kara, LKML, hch, ocfs2-devel, Joel Becker

On Thu 20-08-09 00:22:08, Evgeniy Polyakov wrote:
> Hi.
> 
> On Wed, Aug 19, 2009 at 06:04:29PM +0200, Jan Kara (jack@suse.cz) wrote:
> > Rename __generic_file_aio_write_nolock() to __generic_file_aio_write(), add
> > comments to write helpers explaining how they should be used and export
> > __generic_file_aio_write() since it will be used by some filesystems.
> 
> Looks good to me. And comments are definitely good.
> Ack.
  Thanks for the review. BTW: I've CCed you mainly due to pohmelfs changes.
Will you have time to look at them?

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 144+ messages in thread

* [Ocfs2-devel] [PATCH 02/17] vfs: Export __generic_file_aio_write() and add some comments
@ 2009-08-20 12:31       ` Jan Kara
  0 siblings, 0 replies; 144+ messages in thread
From: Jan Kara @ 2009-08-20 12:31 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Jan Kara, LKML, hch, ocfs2-devel, Joel Becker

On Thu 20-08-09 00:22:08, Evgeniy Polyakov wrote:
> Hi.
> 
> On Wed, Aug 19, 2009 at 06:04:29PM +0200, Jan Kara (jack at suse.cz) wrote:
> > Rename __generic_file_aio_write_nolock() to __generic_file_aio_write(), add
> > comments to write helpers explaining how they should be used and export
> > __generic_file_aio_write() since it will be used by some filesystems.
> 
> Looks good to me. And comments are definitely good.
> Ack.
  Thanks for the review. BTW: I've CCed you mainly due to pohmelfs changes.
Will you have time to look at them?

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 02/17] vfs: Export __generic_file_aio_write() and add some comments
  2009-08-20 12:31       ` [Ocfs2-devel] " Jan Kara
@ 2009-08-20 13:30         ` Evgeniy Polyakov
  -1 siblings, 0 replies; 144+ messages in thread
From: Evgeniy Polyakov @ 2009-08-20 13:30 UTC (permalink / raw)
  To: Jan Kara; +Cc: LKML, hch, ocfs2-devel, Joel Becker

On Thu, Aug 20, 2009 at 02:31:37PM +0200, Jan Kara (jack@suse.cz) wrote:
>   Thanks for the review. BTW: I've CCed you mainly due to pohmelfs changes.
> Will you have time to look at them?

The one where you replaced one name with another? :)
Yes, I did.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 144+ messages in thread

* [Ocfs2-devel] [PATCH 02/17] vfs: Export __generic_file_aio_write() and add some comments
@ 2009-08-20 13:30         ` Evgeniy Polyakov
  0 siblings, 0 replies; 144+ messages in thread
From: Evgeniy Polyakov @ 2009-08-20 13:30 UTC (permalink / raw)
  To: Jan Kara; +Cc: LKML, hch, ocfs2-devel, Joel Becker

On Thu, Aug 20, 2009 at 02:31:37PM +0200, Jan Kara (jack at suse.cz) wrote:
>   Thanks for the review. BTW: I've CCed you mainly due to pohmelfs changes.
> Will you have time to look at them?

The one where you replaced one name with another? :)
Yes, I did.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 03/17] vfs: Remove syncing from generic_file_direct_write() and generic_file_buffered_write()
  2009-08-19 16:18     ` Christoph Hellwig
  (?)
@ 2009-08-20 13:31       ` Jan Kara
  -1 siblings, 0 replies; 144+ messages in thread
From: Jan Kara @ 2009-08-20 13:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, LKML, ocfs2-devel, Joel Becker, Felix Blyakher, xfs

On Wed 19-08-09 12:18:53, Christoph Hellwig wrote:
> On Wed, Aug 19, 2009 at 06:04:30PM +0200, Jan Kara wrote:
> > generic_file_direct_write() and generic_file_buffered_write() called
> > generic_osync_inode() if it was called on O_SYNC file or IS_SYNC inode. But
> > this is superfluous since generic_file_aio_write() does the syncing as well.
> > Also XFS and OCFS2 which call these functions directly handle syncing
> > themselves. So let's have a single place where syncing happens:
> > generic_file_aio_write().
> 
> Yeah, this is something that never made any sense to me.
> 
> > @@ -2187,20 +2187,7 @@ generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
> >  		}
> >  		*ppos = end;
> >  	}
> > -
> > -	/*
> > -	 * Sync the fs metadata but not the minor inode changes and
> > -	 * of course not the data as we did direct DMA for the IO.
> > -	 * i_mutex is held, which protects generic_osync_inode() from
> > -	 * livelocking.  AIO O_DIRECT ops attempt to sync metadata here.
> > -	 */
> >  out:
> > -	if ((written >= 0 || written == -EIOCBQUEUED) &&
> > -	    ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
> > -		int err = generic_osync_inode(inode, mapping, OSYNC_METADATA);
> > -		if (err < 0)
> > -			written = err;
> > -	}
> >  	return written;
> 
> Here we check (written >= 0 || written == -EIOCBQUEUED), but
> generic_file_aio_write only cares about positive return values.  We
> defintively do have a change here for partial AIO requests.
  Ah, that's a good point.

> The question is if the previous behaviour made in sense.  If do have an
> O_SYNC aio+dio request we would have to flush out the metadata after the
> request has completed and not here.
  Yes, that would be a correct behavior but I see no good way of doing
that. Flushing on EIOCBQUEUED mostly works for simple filesystems like
ext[23] where we don't do anything from end_io callback. The only risk
is if we crash after the transaction commits but before the direct IO is
done (we could expose stale data) but I'm not sure that is an issue with
real HW.
  Anyway, the question is what we should do about it.  For now, I'd call
generic_write_sync() even in case EIOCBQUEUED is returned to preserve old
behavior (EIOCBQUEUED does not seem to be returned from buffered write path
so we really just preserve the old behavior).

> > @@ -2343,16 +2328,6 @@ generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
> >  	if (likely(status >= 0)) {
> >  		written += status;
> >  		*ppos = pos + status;
> > -
> > -		/*
> > -		 * For now, when the user asks for O_SYNC, we'll actually give
> > -		 * O_DSYNC
> > -		 */
> > -		if (unlikely((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
> > -			if (!a_ops->writepage || !is_sync_kiocb(iocb))
> > -				status = generic_osync_inode(inode, mapping,
> > -						OSYNC_METADATA|OSYNC_DATA);
> > -		}
> >    	}
> 
> No problem with -EIOCBQUEUED here, but we change from doing
> generic_osync_inode with OSYNC_DATA which does a full writeout of the
> data to sync_page_range which only does the range writeout here.  That
> should be fine (as we only need to sync that range), but should probably
> be documented in the patch description.
  Yes, will do.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 03/17] vfs: Remove syncing from generic_file_direct_write() and generic_file_buffered_write()
@ 2009-08-20 13:31       ` Jan Kara
  0 siblings, 0 replies; 144+ messages in thread
From: Jan Kara @ 2009-08-20 13:31 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jan Kara, LKML, xfs, Joel Becker, ocfs2-devel

On Wed 19-08-09 12:18:53, Christoph Hellwig wrote:
> On Wed, Aug 19, 2009 at 06:04:30PM +0200, Jan Kara wrote:
> > generic_file_direct_write() and generic_file_buffered_write() called
> > generic_osync_inode() if it was called on O_SYNC file or IS_SYNC inode. But
> > this is superfluous since generic_file_aio_write() does the syncing as well.
> > Also XFS and OCFS2 which call these functions directly handle syncing
> > themselves. So let's have a single place where syncing happens:
> > generic_file_aio_write().
> 
> Yeah, this is something that never made any sense to me.
> 
> > @@ -2187,20 +2187,7 @@ generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
> >  		}
> >  		*ppos = end;
> >  	}
> > -
> > -	/*
> > -	 * Sync the fs metadata but not the minor inode changes and
> > -	 * of course not the data as we did direct DMA for the IO.
> > -	 * i_mutex is held, which protects generic_osync_inode() from
> > -	 * livelocking.  AIO O_DIRECT ops attempt to sync metadata here.
> > -	 */
> >  out:
> > -	if ((written >= 0 || written == -EIOCBQUEUED) &&
> > -	    ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
> > -		int err = generic_osync_inode(inode, mapping, OSYNC_METADATA);
> > -		if (err < 0)
> > -			written = err;
> > -	}
> >  	return written;
> 
> Here we check (written >= 0 || written == -EIOCBQUEUED), but
> generic_file_aio_write only cares about positive return values.  We
> defintively do have a change here for partial AIO requests.
  Ah, that's a good point.

> The question is if the previous behaviour made in sense.  If do have an
> O_SYNC aio+dio request we would have to flush out the metadata after the
> request has completed and not here.
  Yes, that would be a correct behavior but I see no good way of doing
that. Flushing on EIOCBQUEUED mostly works for simple filesystems like
ext[23] where we don't do anything from end_io callback. The only risk
is if we crash after the transaction commits but before the direct IO is
done (we could expose stale data) but I'm not sure that is an issue with
real HW.
  Anyway, the question is what we should do about it.  For now, I'd call
generic_write_sync() even in case EIOCBQUEUED is returned to preserve old
behavior (EIOCBQUEUED does not seem to be returned from buffered write path
so we really just preserve the old behavior).

> > @@ -2343,16 +2328,6 @@ generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
> >  	if (likely(status >= 0)) {
> >  		written += status;
> >  		*ppos = pos + status;
> > -
> > -		/*
> > -		 * For now, when the user asks for O_SYNC, we'll actually give
> > -		 * O_DSYNC
> > -		 */
> > -		if (unlikely((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
> > -			if (!a_ops->writepage || !is_sync_kiocb(iocb))
> > -				status = generic_osync_inode(inode, mapping,
> > -						OSYNC_METADATA|OSYNC_DATA);
> > -		}
> >    	}
> 
> No problem with -EIOCBQUEUED here, but we change from doing
> generic_osync_inode with OSYNC_DATA which does a full writeout of the
> data to sync_page_range which only does the range writeout here.  That
> should be fine (as we only need to sync that range), but should probably
> be documented in the patch description.
  Yes, will do.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* [Ocfs2-devel] [PATCH 03/17] vfs: Remove syncing from generic_file_direct_write() and generic_file_buffered_write()
@ 2009-08-20 13:31       ` Jan Kara
  0 siblings, 0 replies; 144+ messages in thread
From: Jan Kara @ 2009-08-20 13:31 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jan Kara, LKML, xfs, Joel Becker, ocfs2-devel

On Wed 19-08-09 12:18:53, Christoph Hellwig wrote:
> On Wed, Aug 19, 2009 at 06:04:30PM +0200, Jan Kara wrote:
> > generic_file_direct_write() and generic_file_buffered_write() called
> > generic_osync_inode() if it was called on O_SYNC file or IS_SYNC inode. But
> > this is superfluous since generic_file_aio_write() does the syncing as well.
> > Also XFS and OCFS2 which call these functions directly handle syncing
> > themselves. So let's have a single place where syncing happens:
> > generic_file_aio_write().
> 
> Yeah, this is something that never made any sense to me.
> 
> > @@ -2187,20 +2187,7 @@ generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
> >  		}
> >  		*ppos = end;
> >  	}
> > -
> > -	/*
> > -	 * Sync the fs metadata but not the minor inode changes and
> > -	 * of course not the data as we did direct DMA for the IO.
> > -	 * i_mutex is held, which protects generic_osync_inode() from
> > -	 * livelocking.  AIO O_DIRECT ops attempt to sync metadata here.
> > -	 */
> >  out:
> > -	if ((written >= 0 || written == -EIOCBQUEUED) &&
> > -	    ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
> > -		int err = generic_osync_inode(inode, mapping, OSYNC_METADATA);
> > -		if (err < 0)
> > -			written = err;
> > -	}
> >  	return written;
> 
> Here we check (written >= 0 || written == -EIOCBQUEUED), but
> generic_file_aio_write only cares about positive return values.  We
> defintively do have a change here for partial AIO requests.
  Ah, that's a good point.

> The question is if the previous behaviour made in sense.  If do have an
> O_SYNC aio+dio request we would have to flush out the metadata after the
> request has completed and not here.
  Yes, that would be a correct behavior but I see no good way of doing
that. Flushing on EIOCBQUEUED mostly works for simple filesystems like
ext[23] where we don't do anything from end_io callback. The only risk
is if we crash after the transaction commits but before the direct IO is
done (we could expose stale data) but I'm not sure that is an issue with
real HW.
  Anyway, the question is what we should do about it.  For now, I'd call
generic_write_sync() even in case EIOCBQUEUED is returned to preserve old
behavior (EIOCBQUEUED does not seem to be returned from buffered write path
so we really just preserve the old behavior).

> > @@ -2343,16 +2328,6 @@ generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
> >  	if (likely(status >= 0)) {
> >  		written += status;
> >  		*ppos = pos + status;
> > -
> > -		/*
> > -		 * For now, when the user asks for O_SYNC, we'll actually give
> > -		 * O_DSYNC
> > -		 */
> > -		if (unlikely((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
> > -			if (!a_ops->writepage || !is_sync_kiocb(iocb))
> > -				status = generic_osync_inode(inode, mapping,
> > -						OSYNC_METADATA|OSYNC_DATA);
> > -		}
> >    	}
> 
> No problem with -EIOCBQUEUED here, but we change from doing
> generic_osync_inode with OSYNC_DATA which does a full writeout of the
> data to sync_page_range which only does the range writeout here.  That
> should be fine (as we only need to sync that range), but should probably
> be documented in the patch description.
  Yes, will do.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 02/17] vfs: Export __generic_file_aio_write() and add some comments
  2009-08-20 13:30         ` [Ocfs2-devel] " Evgeniy Polyakov
@ 2009-08-20 13:52           ` Jan Kara
  -1 siblings, 0 replies; 144+ messages in thread
From: Jan Kara @ 2009-08-20 13:52 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Jan Kara, LKML, hch, ocfs2-devel, Joel Becker

On Thu 20-08-09 17:30:28, Evgeniy Polyakov wrote:
> On Thu, Aug 20, 2009 at 02:31:37PM +0200, Jan Kara (jack@suse.cz) wrote:
> >   Thanks for the review. BTW: I've CCed you mainly due to pohmelfs changes.
> > Will you have time to look at them?
> 
> The one where you replaced one name with another? :)
> Yes, I did.
  Well, there were two patches. Both were similarly simple but there could
be some subtleties I've missed :). So can I add your Acked-by also to those
two pohmelfs patches?

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 144+ messages in thread

* [Ocfs2-devel] [PATCH 02/17] vfs: Export __generic_file_aio_write() and add some comments
@ 2009-08-20 13:52           ` Jan Kara
  0 siblings, 0 replies; 144+ messages in thread
From: Jan Kara @ 2009-08-20 13:52 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Jan Kara, LKML, hch, ocfs2-devel, Joel Becker

On Thu 20-08-09 17:30:28, Evgeniy Polyakov wrote:
> On Thu, Aug 20, 2009 at 02:31:37PM +0200, Jan Kara (jack at suse.cz) wrote:
> >   Thanks for the review. BTW: I've CCed you mainly due to pohmelfs changes.
> > Will you have time to look at them?
> 
> The one where you replaced one name with another? :)
> Yes, I did.
  Well, there were two patches. Both were similarly simple but there could
be some subtleties I've missed :). So can I add your Acked-by also to those
two pohmelfs patches?

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 02/17] vfs: Export __generic_file_aio_write() and add some comments
  2009-08-20 13:52           ` [Ocfs2-devel] " Jan Kara
@ 2009-08-20 13:58             ` Evgeniy Polyakov
  -1 siblings, 0 replies; 144+ messages in thread
From: Evgeniy Polyakov @ 2009-08-20 13:58 UTC (permalink / raw)
  To: Jan Kara; +Cc: LKML, hch, ocfs2-devel, Joel Becker

On Thu, Aug 20, 2009 at 03:52:45PM +0200, Jan Kara (jack@suse.cz) wrote:
>   Well, there were two patches. Both were similarly simple but there could

As of sync behaviour - pohmelfs will follow the usual VFS behaviour, so
if you managed to break this, we still will follow the mainstream :)

> be some subtleties I've missed :). So can I add your Acked-by also to those
> two pohmelfs patches?

Sure.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 144+ messages in thread

* [Ocfs2-devel] [PATCH 02/17] vfs: Export __generic_file_aio_write() and add some comments
@ 2009-08-20 13:58             ` Evgeniy Polyakov
  0 siblings, 0 replies; 144+ messages in thread
From: Evgeniy Polyakov @ 2009-08-20 13:58 UTC (permalink / raw)
  To: Jan Kara; +Cc: LKML, hch, ocfs2-devel, Joel Becker

On Thu, Aug 20, 2009 at 03:52:45PM +0200, Jan Kara (jack at suse.cz) wrote:
>   Well, there were two patches. Both were similarly simple but there could

As of sync behaviour - pohmelfs will follow the usual VFS behaviour, so
if you managed to break this, we still will follow the mainstream :)

> be some subtleties I've missed :). So can I add your Acked-by also to those
> two pohmelfs patches?

Sure.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode
  2009-08-20 12:15       ` Jan Kara
  (?)
@ 2009-08-20 16:27         ` Christoph Hellwig
  -1 siblings, 0 replies; 144+ messages in thread
From: Christoph Hellwig @ 2009-08-20 16:27 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, LKML, Evgeniy Polyakov, ocfs2-devel,
	Joel Becker, Felix Blyakher, xfs, Anton Altaparmakov,
	linux-ntfs-dev, OGAWA Hirofumi, linux-ext4, tytso

On Thu, Aug 20, 2009 at 02:15:31PM +0200, Jan Kara wrote:
> On Wed 19-08-09 12:26:38, Christoph Hellwig wrote:
> > Looks good to me.  Eventually we should use those SYNC_ flags also all
> > through the fsync codepath, but I'll see if I can incorporate that in my
> > planned fsync rewrite.
>   Yes, I thought I'll leave that for later. BTW it should be fairly easy to
> teach generic_sync_file() to do fdatawait() before calling ->fsync() if the
> filesystem sets some flag in inode->i_mapping (or somewhere else) as is
> needed for XFS, btrfs, etc.

Maybe you can help brain storming, but I still can't see any way in that
the

  - write data
  - write inode
  - wait for data

actually is a benefit in terms of semantics (I agree that it could be
faster in theory, but even that is debatable with todays seek latencies
in disks)

Think about a simple non-journaling filesystem like ext2:

 (1) block get allocated during ->write before putting data in
      - this dirties the inode because we update i_block/i_size/etc
 (2) we call fsync (or the O_SNC handling code for that matter)
      - we start writeout of the data, which takes forever because the
	file is very large
      - then we write out the inode, including the i_size/i_blocks
	update
      - due to some reason this gets reordered before the data writeout
	finishes (without that happening there would be no benefit to
	this ordering anyway)
 (3) no we call filemap_fdatawait to wait for data I/O to finish


Now the system crashes between (2) and (3).  After that we we do have
stale data in the inode in the area not written yet.

Is there some case between that simple filesystem and the i_size update
from I/O completion handler in XFS/ext4 where this behaviour actually
buys us anything?  Any ext3 magic maybe?

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode
@ 2009-08-20 16:27         ` Christoph Hellwig
  0 siblings, 0 replies; 144+ messages in thread
From: Christoph Hellwig @ 2009-08-20 16:27 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, tytso, linux-ntfs-dev, LKML, Joel Becker,
	Christoph Hellwig, Anton Altaparmakov, OGAWA Hirofumi,
	Evgeniy Polyakov, xfs, ocfs2-devel

On Thu, Aug 20, 2009 at 02:15:31PM +0200, Jan Kara wrote:
> On Wed 19-08-09 12:26:38, Christoph Hellwig wrote:
> > Looks good to me.  Eventually we should use those SYNC_ flags also all
> > through the fsync codepath, but I'll see if I can incorporate that in my
> > planned fsync rewrite.
>   Yes, I thought I'll leave that for later. BTW it should be fairly easy to
> teach generic_sync_file() to do fdatawait() before calling ->fsync() if the
> filesystem sets some flag in inode->i_mapping (or somewhere else) as is
> needed for XFS, btrfs, etc.

Maybe you can help brain storming, but I still can't see any way in that
the

  - write data
  - write inode
  - wait for data

actually is a benefit in terms of semantics (I agree that it could be
faster in theory, but even that is debatable with todays seek latencies
in disks)

Think about a simple non-journaling filesystem like ext2:

 (1) block get allocated during ->write before putting data in
      - this dirties the inode because we update i_block/i_size/etc
 (2) we call fsync (or the O_SNC handling code for that matter)
      - we start writeout of the data, which takes forever because the
	file is very large
      - then we write out the inode, including the i_size/i_blocks
	update
      - due to some reason this gets reordered before the data writeout
	finishes (without that happening there would be no benefit to
	this ordering anyway)
 (3) no we call filemap_fdatawait to wait for data I/O to finish


Now the system crashes between (2) and (3).  After that we we do have
stale data in the inode in the area not written yet.

Is there some case between that simple filesystem and the i_size update
from I/O completion handler in XFS/ext4 where this behaviour actually
buys us anything?  Any ext3 magic maybe?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* [Ocfs2-devel] [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode
@ 2009-08-20 16:27         ` Christoph Hellwig
  0 siblings, 0 replies; 144+ messages in thread
From: Christoph Hellwig @ 2009-08-20 16:27 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, LKML, Evgeniy Polyakov, ocfs2-devel,
	Joel Becker, Felix Blyakher, xfs, Anton Altaparmakov,
	linux-ntfs-dev, OGAWA Hirofumi, linux-ext4, tytso

On Thu, Aug 20, 2009 at 02:15:31PM +0200, Jan Kara wrote:
> On Wed 19-08-09 12:26:38, Christoph Hellwig wrote:
> > Looks good to me.  Eventually we should use those SYNC_ flags also all
> > through the fsync codepath, but I'll see if I can incorporate that in my
> > planned fsync rewrite.
>   Yes, I thought I'll leave that for later. BTW it should be fairly easy to
> teach generic_sync_file() to do fdatawait() before calling ->fsync() if the
> filesystem sets some flag in inode->i_mapping (or somewhere else) as is
> needed for XFS, btrfs, etc.

Maybe you can help brain storming, but I still can't see any way in that
the

  - write data
  - write inode
  - wait for data

actually is a benefit in terms of semantics (I agree that it could be
faster in theory, but even that is debatable with todays seek latencies
in disks)

Think about a simple non-journaling filesystem like ext2:

 (1) block get allocated during ->write before putting data in
      - this dirties the inode because we update i_block/i_size/etc
 (2) we call fsync (or the O_SNC handling code for that matter)
      - we start writeout of the data, which takes forever because the
	file is very large
      - then we write out the inode, including the i_size/i_blocks
	update
      - due to some reason this gets reordered before the data writeout
	finishes (without that happening there would be no benefit to
	this ordering anyway)
 (3) no we call filemap_fdatawait to wait for data I/O to finish


Now the system crashes between (2) and (3).  After that we we do have
stale data in the inode in the area not written yet.

Is there some case between that simple filesystem and the i_size update
from I/O completion handler in XFS/ext4 where this behaviour actually
buys us anything?  Any ext3 magic maybe?

^ permalink raw reply	[flat|nested] 144+ messages in thread

* O_DIRECT and barriers
  2009-08-19 16:04 [PATCH 0/17] Make O_SYNC handling use standard syncing path Jan Kara
                   ` (16 preceding siblings ...)
  2009-08-19 16:04 ` [PATCH 17/17] vfs: Remove generic_osync_inode() and sync_page_range() Jan Kara
@ 2009-08-20 22:12 ` Christoph Hellwig
  2009-08-21 11:40   ` Jens Axboe
  17 siblings, 1 reply; 144+ messages in thread
From: Christoph Hellwig @ 2009-08-20 22:12 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-fsdevel, linux-scsi

Btw, something semi-related I've been looking at recently:

Currently O_DIRECT writes bypass all kernel caches, but there they do
use the disk caches.  We currenly don't have any barrier support for
them at all, which is really bad for data integrity in virtualized
environments.  I've started thinking about how to implement this.

The simplest scheme would be to mark the last request of each
O_DIRECT write as barrier requests.  This works nicely from the FS
perspective and works with all hardware supporting barriers.  It's
massive overkill though - we really only need to flush the cache
after our request, and not before.  And for SCSI we would be much
better just setting the FUA bit on the commands and not require a
full cache flush at all.

The next scheme would be to simply always do a cache flush after
the direct I/O write has completed, but given that blkdev_issue_flush
blocks until the command is done that would a) require everyone to
use the end_io callback and b) spend a lot of time in that workque.
This only requires one full cache flush, but it's still suboptimal.

I have prototypes this for XFS, but I don't really like it.

The best scheme would be to get some highlevel FUA request in the
block layer which gets emulated by a post-command cache flush.

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [Ocfs2-devel] [PATCH 13/17] ocfs2: Update syncing after splicing to match generic version
  2009-08-19 16:04   ` [Ocfs2-devel] " Jan Kara
@ 2009-08-21  1:36     ` Joel Becker
  -1 siblings, 0 replies; 144+ messages in thread
From: Joel Becker @ 2009-08-21  1:36 UTC (permalink / raw)
  To: Jan Kara; +Cc: LKML, hch, ocfs2-devel, mfasheh

On Wed, Aug 19, 2009 at 06:04:40PM +0200, Jan Kara wrote:
> Update ocfs2 specific splicing code to use generic syncing helper.
> 
> CC: Joel Becker <Joel.Becker@oracle.com>
> CC: ocfs2-devel@oss.oracle.com
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/ocfs2/file.c |   27 ++++++---------------------
>  1 files changed, 6 insertions(+), 21 deletions(-)
> 
> diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
> index 1c71f0a..bd7fdf8 100644
> --- a/fs/ocfs2/file.c
> +++ b/fs/ocfs2/file.c
> @@ -1990,31 +1990,16 @@ static ssize_t ocfs2_file_splice_write(struct pipe_inode_info *pipe,
>  
>  	if (ret > 0) {
>  		unsigned long nr_pages;
> +		int err;
>  
> -		*ppos += ret;
>  		nr_pages = (ret + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
>  
> -		/*
> -		 * If file or inode is SYNC and we actually wrote some data,
> -		 * sync it.
> -		 */
> -		if (unlikely((out->f_flags & O_SYNC) || IS_SYNC(inode))) {
> -			int err;
> -
> -			mutex_lock(&inode->i_mutex);
> -			err = ocfs2_rw_lock(inode, 1);
> -			if (err < 0) {
> -				mlog_errno(err);
> -			} else {
> -				err = generic_osync_inode(inode, mapping,
> -						  OSYNC_METADATA|OSYNC_DATA);
> -				ocfs2_rw_unlock(inode, 1);
> -			}
> -			mutex_unlock(&inode->i_mutex);
> +		err = generic_write_sync(out, *ppos, ret);
> +		if (err)
> +			ret = err;
> +		else
> +			*ppos += ret;

	You've removed the rw_lock around the sync.  Any reason why?

Joel

-- 

"A narcissist is someone better looking than you are."  
         - Gore Vidal

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 144+ messages in thread

* [Ocfs2-devel] [PATCH 13/17] ocfs2: Update syncing after splicing to match generic version
@ 2009-08-21  1:36     ` Joel Becker
  0 siblings, 0 replies; 144+ messages in thread
From: Joel Becker @ 2009-08-21  1:36 UTC (permalink / raw)
  To: Jan Kara; +Cc: LKML, hch, ocfs2-devel, mfasheh

On Wed, Aug 19, 2009 at 06:04:40PM +0200, Jan Kara wrote:
> Update ocfs2 specific splicing code to use generic syncing helper.
> 
> CC: Joel Becker <Joel.Becker@oracle.com>
> CC: ocfs2-devel at oss.oracle.com
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/ocfs2/file.c |   27 ++++++---------------------
>  1 files changed, 6 insertions(+), 21 deletions(-)
> 
> diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
> index 1c71f0a..bd7fdf8 100644
> --- a/fs/ocfs2/file.c
> +++ b/fs/ocfs2/file.c
> @@ -1990,31 +1990,16 @@ static ssize_t ocfs2_file_splice_write(struct pipe_inode_info *pipe,
>  
>  	if (ret > 0) {
>  		unsigned long nr_pages;
> +		int err;
>  
> -		*ppos += ret;
>  		nr_pages = (ret + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
>  
> -		/*
> -		 * If file or inode is SYNC and we actually wrote some data,
> -		 * sync it.
> -		 */
> -		if (unlikely((out->f_flags & O_SYNC) || IS_SYNC(inode))) {
> -			int err;
> -
> -			mutex_lock(&inode->i_mutex);
> -			err = ocfs2_rw_lock(inode, 1);
> -			if (err < 0) {
> -				mlog_errno(err);
> -			} else {
> -				err = generic_osync_inode(inode, mapping,
> -						  OSYNC_METADATA|OSYNC_DATA);
> -				ocfs2_rw_unlock(inode, 1);
> -			}
> -			mutex_unlock(&inode->i_mutex);
> +		err = generic_write_sync(out, *ppos, ret);
> +		if (err)
> +			ret = err;
> +		else
> +			*ppos += ret;

	You've removed the rw_lock around the sync.  Any reason why?

Joel

-- 

"A narcissist is someone better looking than you are."  
         - Gore Vidal

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-20 22:12 ` O_DIRECT and barriers Christoph Hellwig
@ 2009-08-21 11:40   ` Jens Axboe
  2009-08-21 13:54     ` Jamie Lokier
  2009-08-21 14:20     ` Christoph Hellwig
  0 siblings, 2 replies; 144+ messages in thread
From: Jens Axboe @ 2009-08-21 11:40 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-scsi

On Thu, Aug 20 2009, Christoph Hellwig wrote:
> Btw, something semi-related I've been looking at recently:
> 
> Currently O_DIRECT writes bypass all kernel caches, but there they do
> use the disk caches.  We currenly don't have any barrier support for
> them at all, which is really bad for data integrity in virtualized
> environments.  I've started thinking about how to implement this.
> 
> The simplest scheme would be to mark the last request of each
> O_DIRECT write as barrier requests.  This works nicely from the FS
> perspective and works with all hardware supporting barriers.  It's
> massive overkill though - we really only need to flush the cache
> after our request, and not before.  And for SCSI we would be much
> better just setting the FUA bit on the commands and not require a
> full cache flush at all.
> 
> The next scheme would be to simply always do a cache flush after
> the direct I/O write has completed, but given that blkdev_issue_flush
> blocks until the command is done that would a) require everyone to
> use the end_io callback and b) spend a lot of time in that workque.
> This only requires one full cache flush, but it's still suboptimal.
> 
> I have prototypes this for XFS, but I don't really like it.
> 
> The best scheme would be to get some highlevel FUA request in the
> block layer which gets emulated by a post-command cache flush.

I've talked to Chris about this in the past too, but I never got around
to benchmarking FUA for O_DIRECT. It should be pretty easy to wire up
without making too many changes, and we do have FUA support on most SATA
drives too. Basically just a check in the driver for whether the
request is O_DIRECT and a WRITE, ala:

        if (rq_data_dir(rq) == WRITE && rq_is_sync(rq))
                WRITE_FUA;

I know that FUA is used by that other OS, so I think we should be golden
on the hw support side.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-21 11:40   ` Jens Axboe
@ 2009-08-21 13:54     ` Jamie Lokier
  2009-08-21 14:26       ` Christoph Hellwig
  2009-08-21 14:20     ` Christoph Hellwig
  1 sibling, 1 reply; 144+ messages in thread
From: Jamie Lokier @ 2009-08-21 13:54 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Christoph Hellwig, linux-fsdevel, linux-scsi

Jens Axboe wrote:
> On Thu, Aug 20 2009, Christoph Hellwig wrote:
> > Btw, something semi-related I've been looking at recently:
> > 
> > Currently O_DIRECT writes bypass all kernel caches, but there they do
> > use the disk caches.  We currenly don't have any barrier support for
> > them at all, which is really bad for data integrity in virtualized
> > environments.  I've started thinking about how to implement this.
> > 
> > The simplest scheme would be to mark the last request of each
> > O_DIRECT write as barrier requests.  This works nicely from the FS
> > perspective and works with all hardware supporting barriers.  It's
> > massive overkill though - we really only need to flush the cache
> > after our request, and not before.  And for SCSI we would be much
> > better just setting the FUA bit on the commands and not require a
> > full cache flush at all.
> > 
> > The next scheme would be to simply always do a cache flush after
> > the direct I/O write has completed, but given that blkdev_issue_flush
> > blocks until the command is done that would a) require everyone to
> > use the end_io callback and b) spend a lot of time in that workque.
> > This only requires one full cache flush, but it's still suboptimal.
> > 
> > I have prototypes this for XFS, but I don't really like it.
> > 
> > The best scheme would be to get some highlevel FUA request in the
> > block layer which gets emulated by a post-command cache flush.
> 
> I've talked to Chris about this in the past too, but I never got around
> to benchmarking FUA for O_DIRECT. It should be pretty easy to wire up
> without making too many changes, and we do have FUA support on most SATA
> drives too. Basically just a check in the driver for whether the
> request is O_DIRECT and a WRITE, ala:
> 
>         if (rq_data_dir(rq) == WRITE && rq_is_sync(rq))
>                 WRITE_FUA;
> 
> I know that FUA is used by that other OS, so I think we should be golden
> on the hw support side.

I've been thinking about this too, and for optimal performance with
VMs and also with databases, I think FUA is too strong.  (It's also
too weak, on drives which don't have FUA).

I would like to be able to get the same performance and integrity as
the kernel filesystems can get, and that means using barrier flushes
when a kernel filesystem would use them, and FUA when a kernel
filesystem would use that.  Preferably the same whether userspace is
using a file or a block device.

The conclusion I came to is that O_DIRECT users need a barrier flush
primitive.  FUA can either be deduced by the elevator, or signalled
explicitly by userspace.

Fortunately there's already a sensible API for both: fdatasync (and
aio_fsync) to mean flush, and O_DSYNC (or inferred from
flush-after-one-write) to mean FUA.

Those apply to files, but they could be made to have the same effect
with block devices, which would be nice for applications which can use
both.  I'll talk about files from here on; assume the idea is to
provide the same functions for block devices.

It turns out that applications needing integrity must use fdatasync or
O_DSYNC (or O_SYNC) *already* with O_DIRECT, because the kernel may
choose to use buffered writes at any time, with no signal to the
application.  O_DSYNC or fdatasync ensures that unknown buffered
writes will be committed.  This is true for other operating systems
too, for the same reason, except some other unixes will convert all
writes to buffered writes, not just corner cases, under various
circumstances that it's hard for applications to detect.

So there's already a good match to using fdatasync and/or O_DSYNC for
O_DIRECT integrity.

If we define fdatasync's behaviour to be that it always causes a
barrier flush if there have been any WRITE commands to a disk since
the last barrier flush, in addition to it's behaviour of flushing
cached pages, that would be enough for VM and database applications
would have good support for integrity.  Of course O_DSYNC would imply
the same after each write.

As an optimisation, I think that FUA might be best done by the
elevator detecting opportunities to do that, rather than explicitly
signalled.

For VMs, the highest performance (with integrity) will likely come from:

    If the guest requests a virtual disk with write cache enabled:

        - Host opens file/blockdev with O_DIRECT  (but *not O_DSYNC*)
        - Host maps guests WRITE commands to host writes
        - Host maps guests CACHE FLUSH commands to fdatasync on host

    If the guest requests a virtual disk with write cache disabled:

        - Host opens file/blockdev with O_DIRECT|O_DSYNC
        - Host maps guests WRITE commands to host writes
        - Host maps guests CACHE FLUSH commands to nothing

    That's with host configured to use O_DIRECT.  If the host is
    configured to not use O_DIRECT, the same logic applies except that
    O_DIRECT is simply omitted.  Nice and simple eh?

Databases and userspace filesystems would be encouraged to do the
equivalent.  In other words, databases would open with O_DIRECT or not
(depending on behaviour preferred), and use fdatasync for barriers, or
use O_DSYNC if they are not using fdatasync.
       
Notice how it conveniently does the right thing when the kernel falls
back to buffered writes without telling anyone.

Code written in that way should do the right thing (or as close as
it's possible to get) on other OSes too.

(Btw, from what I can tell from various Windows documentation, it maps
the equivalent of O_DIRECT|O_DSYNC to setting FUA on every disk write,
and it maps the equivalent of fsync to sending a the disk a cache
flush command as well as writing file metadata.  There's no Windows
equivalent to O_SYNC or fdatasync.)

-- Jamie

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-21 11:40   ` Jens Axboe
  2009-08-21 13:54     ` Jamie Lokier
@ 2009-08-21 14:20     ` Christoph Hellwig
  2009-08-21 15:06       ` James Bottomley
  1 sibling, 1 reply; 144+ messages in thread
From: Christoph Hellwig @ 2009-08-21 14:20 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Christoph Hellwig, linux-fsdevel, linux-scsi

On Fri, Aug 21, 2009 at 01:40:10PM +0200, Jens Axboe wrote:
> I've talked to Chris about this in the past too, but I never got around
> to benchmarking FUA for O_DIRECT. It should be pretty easy to wire up
> without making too many changes, and we do have FUA support on most SATA
> drives too. Basically just a check in the driver for whether the
> request is O_DIRECT and a WRITE, ala:
> 
>         if (rq_data_dir(rq) == WRITE && rq_is_sync(rq))
>                 WRITE_FUA;
> 
> I know that FUA is used by that other OS, so I think we should be golden
> on the hw support side.

Just doing FUA should be pretty easy, in fact from my reading of the
code we already use FUA for barriers if supported, that is only drain
the queue, do a pre-flush for a barrier and then issue the actual
barrier write as FUA.

I can play around with getting rid of the pre-flush and doing cache
flush based emulation if FUA is not supported if you're fine with that.

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-21 13:54     ` Jamie Lokier
@ 2009-08-21 14:26       ` Christoph Hellwig
  2009-08-21 15:24         ` Jamie Lokier
  2009-08-21 22:08         ` Theodore Tso
  0 siblings, 2 replies; 144+ messages in thread
From: Christoph Hellwig @ 2009-08-21 14:26 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Jens Axboe, Christoph Hellwig, linux-fsdevel, linux-scsi

On Fri, Aug 21, 2009 at 02:54:03PM +0100, Jamie Lokier wrote:
> I've been thinking about this too, and for optimal performance with
> VMs and also with databases, I think FUA is too strong.  (It's also
> too weak, on drives which don't have FUA).


Why is FUA too strong?

> Fortunately there's already a sensible API for both: fdatasync (and
> aio_fsync) to mean flush, and O_DSYNC (or inferred from
> flush-after-one-write) to mean FUA.

I thought about this alot .  It would be sensible to only require
the FUA semantics if O_SYNC is specified.  But from looking around at
users of O_DIRECT no one seems to actually specify O_SYNC with it.
And on Linux where O_SYNC really means O_DYSNC that's pretty sensible -
if O_DIRECT bypasses the filesystem cache there is nothing else
left to sync for a non-extending write.  That is until those pesky disk
write back caches come into play that no application writer wants or
should have to understand.

> It turns out that applications needing integrity must use fdatasync or
> O_DSYNC (or O_SYNC) *already* with O_DIRECT, because the kernel may
> choose to use buffered writes at any time, with no signal to the
> application.

The fallback was a relatively recent addition to the O_DIRECT semantics
for broken filesystems that can't handle holes very well.  Fortunately
enough we do force O_SYNC (that is Linux O_SYNC aka Posix O_DSYNC)
semantics for that already.


^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [Ocfs2-devel] [PATCH 13/17] ocfs2: Update syncing after splicing to match generic version
  2009-08-21  1:36     ` Joel Becker
@ 2009-08-21 14:30       ` Jan Kara
  -1 siblings, 0 replies; 144+ messages in thread
From: Jan Kara @ 2009-08-21 14:30 UTC (permalink / raw)
  To: Joel Becker; +Cc: Jan Kara, LKML, hch, ocfs2-devel, mfasheh

On Thu 20-08-09 18:36:17, Joel Becker wrote:
> On Wed, Aug 19, 2009 at 06:04:40PM +0200, Jan Kara wrote:
> > Update ocfs2 specific splicing code to use generic syncing helper.
> > 
> > CC: Joel Becker <Joel.Becker@oracle.com>
> > CC: ocfs2-devel@oss.oracle.com
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/ocfs2/file.c |   27 ++++++---------------------
> >  1 files changed, 6 insertions(+), 21 deletions(-)
> > 
> > diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
> > index 1c71f0a..bd7fdf8 100644
> > --- a/fs/ocfs2/file.c
> > +++ b/fs/ocfs2/file.c
> > @@ -1990,31 +1990,16 @@ static ssize_t ocfs2_file_splice_write(struct pipe_inode_info *pipe,
> >  
> >  	if (ret > 0) {
> >  		unsigned long nr_pages;
> > +		int err;
> >  
> > -		*ppos += ret;
> >  		nr_pages = (ret + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
> >  
> > -		/*
> > -		 * If file or inode is SYNC and we actually wrote some data,
> > -		 * sync it.
> > -		 */
> > -		if (unlikely((out->f_flags & O_SYNC) || IS_SYNC(inode))) {
> > -			int err;
> > -
> > -			mutex_lock(&inode->i_mutex);
> > -			err = ocfs2_rw_lock(inode, 1);
> > -			if (err < 0) {
> > -				mlog_errno(err);
> > -			} else {
> > -				err = generic_osync_inode(inode, mapping,
> > -						  OSYNC_METADATA|OSYNC_DATA);
> > -				ocfs2_rw_unlock(inode, 1);
> > -			}
> > -			mutex_unlock(&inode->i_mutex);
> > +		err = generic_write_sync(out, *ppos, ret);
> > +		if (err)
> > +			ret = err;
> > +		else
> > +			*ppos += ret;
> 
> 	You've removed the rw_lock around the sync.  Any reason why?
  Ah, I should have written in the changelog: generic_write_sync() will
acquire i_mutex so to preserve lock ordering (i_mutex -> rw_lock) we cannot
hold it while calling generic_write_sync(). Furthermore, I didn't see point
for holding it while calling generic_write_sync() since we don't hold it
in fsync() path either.
  I'll add something like this to the changelog.

									Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 144+ messages in thread

* [Ocfs2-devel] [PATCH 13/17] ocfs2: Update syncing after splicing to match generic version
@ 2009-08-21 14:30       ` Jan Kara
  0 siblings, 0 replies; 144+ messages in thread
From: Jan Kara @ 2009-08-21 14:30 UTC (permalink / raw)
  To: Joel Becker; +Cc: Jan Kara, LKML, hch, ocfs2-devel, mfasheh

On Thu 20-08-09 18:36:17, Joel Becker wrote:
> On Wed, Aug 19, 2009 at 06:04:40PM +0200, Jan Kara wrote:
> > Update ocfs2 specific splicing code to use generic syncing helper.
> > 
> > CC: Joel Becker <Joel.Becker@oracle.com>
> > CC: ocfs2-devel at oss.oracle.com
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/ocfs2/file.c |   27 ++++++---------------------
> >  1 files changed, 6 insertions(+), 21 deletions(-)
> > 
> > diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
> > index 1c71f0a..bd7fdf8 100644
> > --- a/fs/ocfs2/file.c
> > +++ b/fs/ocfs2/file.c
> > @@ -1990,31 +1990,16 @@ static ssize_t ocfs2_file_splice_write(struct pipe_inode_info *pipe,
> >  
> >  	if (ret > 0) {
> >  		unsigned long nr_pages;
> > +		int err;
> >  
> > -		*ppos += ret;
> >  		nr_pages = (ret + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
> >  
> > -		/*
> > -		 * If file or inode is SYNC and we actually wrote some data,
> > -		 * sync it.
> > -		 */
> > -		if (unlikely((out->f_flags & O_SYNC) || IS_SYNC(inode))) {
> > -			int err;
> > -
> > -			mutex_lock(&inode->i_mutex);
> > -			err = ocfs2_rw_lock(inode, 1);
> > -			if (err < 0) {
> > -				mlog_errno(err);
> > -			} else {
> > -				err = generic_osync_inode(inode, mapping,
> > -						  OSYNC_METADATA|OSYNC_DATA);
> > -				ocfs2_rw_unlock(inode, 1);
> > -			}
> > -			mutex_unlock(&inode->i_mutex);
> > +		err = generic_write_sync(out, *ppos, ret);
> > +		if (err)
> > +			ret = err;
> > +		else
> > +			*ppos += ret;
> 
> 	You've removed the rw_lock around the sync.  Any reason why?
  Ah, I should have written in the changelog: generic_write_sync() will
acquire i_mutex so to preserve lock ordering (i_mutex -> rw_lock) we cannot
hold it while calling generic_write_sync(). Furthermore, I didn't see point
for holding it while calling generic_write_sync() since we don't hold it
in fsync() path either.
  I'll add something like this to the changelog.

									Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-21 14:20     ` Christoph Hellwig
@ 2009-08-21 15:06       ` James Bottomley
  2009-08-21 15:23         ` Christoph Hellwig
  0 siblings, 1 reply; 144+ messages in thread
From: James Bottomley @ 2009-08-21 15:06 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, linux-fsdevel, linux-scsi

On Fri, 2009-08-21 at 10:20 -0400, Christoph Hellwig wrote:
> On Fri, Aug 21, 2009 at 01:40:10PM +0200, Jens Axboe wrote:
> > I've talked to Chris about this in the past too, but I never got around
> > to benchmarking FUA for O_DIRECT. It should be pretty easy to wire up
> > without making too many changes, and we do have FUA support on most SATA
> > drives too. Basically just a check in the driver for whether the
> > request is O_DIRECT and a WRITE, ala:
> > 
> >         if (rq_data_dir(rq) == WRITE && rq_is_sync(rq))
> >                 WRITE_FUA;
> > 
> > I know that FUA is used by that other OS, so I think we should be golden
> > on the hw support side.
> 
> Just doing FUA should be pretty easy, in fact from my reading of the
> code we already use FUA for barriers if supported, that is only drain
> the queue, do a pre-flush for a barrier and then issue the actual
> barrier write as FUA.

I've never really understood why FUA is considered equivalent to a
barrier.  Our barrier semantics are that all I/Os before the barrier
should be safely on disk after the barrier executes.  The FUA semantics
are that *this write* should be safely on disk after it executes ... it
can still leave preceding writes in the cache.  I can see that if you're
only interested in metadata that making every metadata write a FUA and
leaving the cache to sort out data writes does give FS image
consistency.

How does FUA give us linux barrier semantics?

James



^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-21 15:06       ` James Bottomley
@ 2009-08-21 15:23         ` Christoph Hellwig
  0 siblings, 0 replies; 144+ messages in thread
From: Christoph Hellwig @ 2009-08-21 15:23 UTC (permalink / raw)
  To: James Bottomley; +Cc: Christoph Hellwig, Jens Axboe, linux-fsdevel, linux-scsi

On Fri, Aug 21, 2009 at 09:06:10AM -0600, James Bottomley wrote:
> I've never really understood why FUA is considered equivalent to a
> barrier.  Our barrier semantics are that all I/Os before the barrier
> should be safely on disk after the barrier executes.  The FUA semantics
> are that *this write* should be safely on disk after it executes ... it
> can still leave preceding writes in the cache.  I can see that if you're
> only interested in metadata that making every metadata write a FUA and
> leaving the cache to sort out data writes does give FS image
> consistency.
> 
> How does FUA give us linux barrier semantics?

FUA by itself doesn't.

Think what use cases we have for barriers and/or FUA right now:

 - a cache flush.  Can only implement as cache flush obviously.
 - a barrier flush bio - can be implement as
     o cache flush, write, cache flush
     o or more efficiently as cache flush, write with FUA bit set

now there is a third use case for O_SYNC, O_DIRECT write which actually
do have FUA-like semantis, that is we only guarantee the I/O is on disk,
but we do not make guarantees about ordering vs earlier writes.
Currently we (as in those few filesystem bothering despite the
VFS/generic helpers making it really hard) implement O_SYNC by:

 - doing one or multiple normal writes, and wait on them
 - then issue a cache flush - either explicitly blkdev_issue_flush
   or implicitly as part of a barrier write for metadata

this could be done more efficiently simply setting the FUA bit on these
requests if we had an API for it.  For O_DIRECT should also apply
except that currently we don't even try.


^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode
  2009-08-20 16:27         ` Christoph Hellwig
  (?)
@ 2009-08-21 15:23           ` Jan Kara
  -1 siblings, 0 replies; 144+ messages in thread
From: Jan Kara @ 2009-08-21 15:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, LKML, Evgeniy Polyakov, ocfs2-devel, Joel Becker,
	Felix Blyakher, xfs, Anton Altaparmakov, linux-ntfs-dev,
	OGAWA Hirofumi, linux-ext4, tytso

On Thu 20-08-09 12:27:29, Christoph Hellwig wrote:
> On Thu, Aug 20, 2009 at 02:15:31PM +0200, Jan Kara wrote:
> > On Wed 19-08-09 12:26:38, Christoph Hellwig wrote:
> > > Looks good to me.  Eventually we should use those SYNC_ flags also all
> > > through the fsync codepath, but I'll see if I can incorporate that in my
> > > planned fsync rewrite.
> >   Yes, I thought I'll leave that for later. BTW it should be fairly easy to
> > teach generic_sync_file() to do fdatawait() before calling ->fsync() if the
> > filesystem sets some flag in inode->i_mapping (or somewhere else) as is
> > needed for XFS, btrfs, etc.
> 
> Maybe you can help brain storming, but I still can't see any way in that
> the
> 
>   - write data
>   - write inode
>   - wait for data
> 
> actually is a benefit in terms of semantics (I agree that it could be
> faster in theory, but even that is debatable with todays seek latencies
> in disks)
> 
> Think about a simple non-journaling filesystem like ext2:
> 
>  (1) block get allocated during ->write before putting data in
>       - this dirties the inode because we update i_block/i_size/etc
>  (2) we call fsync (or the O_SNC handling code for that matter)
>       - we start writeout of the data, which takes forever because the
> 	file is very large
>       - then we write out the inode, including the i_size/i_blocks
> 	update
>       - due to some reason this gets reordered before the data writeout
> 	finishes (without that happening there would be no benefit to
> 	this ordering anyway)
>  (3) no we call filemap_fdatawait to wait for data I/O to finish
> 
> Now the system crashes between (2) and (3).  After that we we do have
> stale data in the inode in the area not written yet.
  Yes, that's true.

> Is there some case between that simple filesystem and the i_size update
> from I/O completion handler in XFS/ext4 where this behaviour actually
> buys us anything?  Any ext3 magic maybe?
  Hmm, I can imagine it would buy us something in two cases (but looking at
the code, neither is implemented in such a way that it would really help
us in any way):
1) when an inode and it's data are stored in one block (e.g. OCFS2 or UDF) do
   this.
2) when we journal data
  In the first case we would wait for block with data to be written only to
submit it again because inode was still dirty.
  In the second case, it would make sence if we waited for transaction
commit in fdatawait() because only then data is really on disk. But I don't
know about a fs which would do it - ext3 in data=journal mode just adds
page buffers to the current transaction in writepage and never sets
PageWriteback so fdatawait() is nop for it. The page is pinned in memory
only by the fact that its buffer heads are part of a transaction and thus
cannot be freed.
  So currently I don't know about real cases where fdatawait after ->fsync()
would buy us anything...

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode
@ 2009-08-21 15:23           ` Jan Kara
  0 siblings, 0 replies; 144+ messages in thread
From: Jan Kara @ 2009-08-21 15:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: tytso, linux-ext4, Jan Kara, linux-ntfs-dev, LKML, Joel Becker,
	Anton Altaparmakov, OGAWA Hirofumi, Evgeniy Polyakov, xfs,
	ocfs2-devel

On Thu 20-08-09 12:27:29, Christoph Hellwig wrote:
> On Thu, Aug 20, 2009 at 02:15:31PM +0200, Jan Kara wrote:
> > On Wed 19-08-09 12:26:38, Christoph Hellwig wrote:
> > > Looks good to me.  Eventually we should use those SYNC_ flags also all
> > > through the fsync codepath, but I'll see if I can incorporate that in my
> > > planned fsync rewrite.
> >   Yes, I thought I'll leave that for later. BTW it should be fairly easy to
> > teach generic_sync_file() to do fdatawait() before calling ->fsync() if the
> > filesystem sets some flag in inode->i_mapping (or somewhere else) as is
> > needed for XFS, btrfs, etc.
> 
> Maybe you can help brain storming, but I still can't see any way in that
> the
> 
>   - write data
>   - write inode
>   - wait for data
> 
> actually is a benefit in terms of semantics (I agree that it could be
> faster in theory, but even that is debatable with todays seek latencies
> in disks)
> 
> Think about a simple non-journaling filesystem like ext2:
> 
>  (1) block get allocated during ->write before putting data in
>       - this dirties the inode because we update i_block/i_size/etc
>  (2) we call fsync (or the O_SNC handling code for that matter)
>       - we start writeout of the data, which takes forever because the
> 	file is very large
>       - then we write out the inode, including the i_size/i_blocks
> 	update
>       - due to some reason this gets reordered before the data writeout
> 	finishes (without that happening there would be no benefit to
> 	this ordering anyway)
>  (3) no we call filemap_fdatawait to wait for data I/O to finish
> 
> Now the system crashes between (2) and (3).  After that we we do have
> stale data in the inode in the area not written yet.
  Yes, that's true.

> Is there some case between that simple filesystem and the i_size update
> from I/O completion handler in XFS/ext4 where this behaviour actually
> buys us anything?  Any ext3 magic maybe?
  Hmm, I can imagine it would buy us something in two cases (but looking at
the code, neither is implemented in such a way that it would really help
us in any way):
1) when an inode and it's data are stored in one block (e.g. OCFS2 or UDF) do
   this.
2) when we journal data
  In the first case we would wait for block with data to be written only to
submit it again because inode was still dirty.
  In the second case, it would make sence if we waited for transaction
commit in fdatawait() because only then data is really on disk. But I don't
know about a fs which would do it - ext3 in data=journal mode just adds
page buffers to the current transaction in writepage and never sets
PageWriteback so fdatawait() is nop for it. The page is pinned in memory
only by the fact that its buffer heads are part of a transaction and thus
cannot be freed.
  So currently I don't know about real cases where fdatawait after ->fsync()
would buy us anything...

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* [Ocfs2-devel] [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode
@ 2009-08-21 15:23           ` Jan Kara
  0 siblings, 0 replies; 144+ messages in thread
From: Jan Kara @ 2009-08-21 15:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, LKML, Evgeniy Polyakov, ocfs2-devel, Joel Becker,
	Felix Blyakher, xfs, Anton Altaparmakov, linux-ntfs-dev,
	OGAWA Hirofumi, linux-ext4, tytso

On Thu 20-08-09 12:27:29, Christoph Hellwig wrote:
> On Thu, Aug 20, 2009 at 02:15:31PM +0200, Jan Kara wrote:
> > On Wed 19-08-09 12:26:38, Christoph Hellwig wrote:
> > > Looks good to me.  Eventually we should use those SYNC_ flags also all
> > > through the fsync codepath, but I'll see if I can incorporate that in my
> > > planned fsync rewrite.
> >   Yes, I thought I'll leave that for later. BTW it should be fairly easy to
> > teach generic_sync_file() to do fdatawait() before calling ->fsync() if the
> > filesystem sets some flag in inode->i_mapping (or somewhere else) as is
> > needed for XFS, btrfs, etc.
> 
> Maybe you can help brain storming, but I still can't see any way in that
> the
> 
>   - write data
>   - write inode
>   - wait for data
> 
> actually is a benefit in terms of semantics (I agree that it could be
> faster in theory, but even that is debatable with todays seek latencies
> in disks)
> 
> Think about a simple non-journaling filesystem like ext2:
> 
>  (1) block get allocated during ->write before putting data in
>       - this dirties the inode because we update i_block/i_size/etc
>  (2) we call fsync (or the O_SNC handling code for that matter)
>       - we start writeout of the data, which takes forever because the
> 	file is very large
>       - then we write out the inode, including the i_size/i_blocks
> 	update
>       - due to some reason this gets reordered before the data writeout
> 	finishes (without that happening there would be no benefit to
> 	this ordering anyway)
>  (3) no we call filemap_fdatawait to wait for data I/O to finish
> 
> Now the system crashes between (2) and (3).  After that we we do have
> stale data in the inode in the area not written yet.
  Yes, that's true.

> Is there some case between that simple filesystem and the i_size update
> from I/O completion handler in XFS/ext4 where this behaviour actually
> buys us anything?  Any ext3 magic maybe?
  Hmm, I can imagine it would buy us something in two cases (but looking at
the code, neither is implemented in such a way that it would really help
us in any way):
1) when an inode and it's data are stored in one block (e.g. OCFS2 or UDF) do
   this.
2) when we journal data
  In the first case we would wait for block with data to be written only to
submit it again because inode was still dirty.
  In the second case, it would make sence if we waited for transaction
commit in fdatawait() because only then data is really on disk. But I don't
know about a fs which would do it - ext3 in data=journal mode just adds
page buffers to the current transaction in writepage and never sets
PageWriteback so fdatawait() is nop for it. The page is pinned in memory
only by the fact that its buffer heads are part of a transaction and thus
cannot be freed.
  So currently I don't know about real cases where fdatawait after ->fsync()
would buy us anything...

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-21 14:26       ` Christoph Hellwig
@ 2009-08-21 15:24         ` Jamie Lokier
  2009-08-21 17:45           ` Christoph Hellwig
  2009-08-21 22:08         ` Theodore Tso
  1 sibling, 1 reply; 144+ messages in thread
From: Jamie Lokier @ 2009-08-21 15:24 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, linux-fsdevel, linux-scsi

Christoph Hellwig wrote:
> On Fri, Aug 21, 2009 at 02:54:03PM +0100, Jamie Lokier wrote:
> > I've been thinking about this too, and for optimal performance with
> > VMs and also with databases, I think FUA is too strong.  (It's also
> > too weak, on drives which don't have FUA).
> 
> Why is FUA too strong?

In measurements I've done, disabling a disk's write cache results in
much slower ext3 filesystem writes than using barriers.  Others report
similar results.  This is with disks that don't have NCQ; good NCQ may
be better.

Using FUA for all writes should be equivalent to writing with write
cache disabled.

A journalling filesystem or database tends to write like this:

   (guest) WRITE
   (guest) WRITE
   (guest) WRITE
   (guest) WRITE
   (guest) WRITE
   (guest) CACHE FLUSH
   (guest) WRITE
   (guest) CACHE FLUSH
   (guest) WRITE
   (guest) WRITE
   (guest) WRITE

When a guest does that, for integrity it can be mapped to this on the
host with FUA:

   (host) WRITE FUA
   (host) WRITE FUA
   (host) WRITE FUA
   (host) WRITE FUA
   (host) WRITE FUA
   (host) WRITE FUA
   (host) WRITE FUA
   (host) WRITE FUA
   (host) WRITE FUA

or

   (host) WRITE
   (host) WRITE
   (host) WRITE
   (host) WRITE
   (host) WRITE
   (host) CACHE FLUSH
   (host) WRITE
   (host) CACHE FLUSH 
   (host) WRITE
   (host) WRITE
   (host) WRITE

We know from measurements that disabling the disk write cache is much
slower than using barriers, at least with some disks.

Assuming that WRITE FUA is equivalent to disabling write cache, we may
expect the WRITE FUA version to run much slower than the CACHE FLUSH
version.

It's also too weak, of course, on drives which don't support FUA.
Then you have to use CACHE FLUSH anyway, so the code should support
that (or disable the write cache entirely, which also performs badly).
If you don't handle drives without FUA, then you're back to "integrity
sometimes, user must check type of hardware", which is something we're
trying to get away from.  Integrity should not be a surprise when the
application requests it.

> > Fortunately there's already a sensible API for both: fdatasync (and
> > aio_fsync) to mean flush, and O_DSYNC (or inferred from
> > flush-after-one-write) to mean FUA.
> 
> I thought about this alot .  It would be sensible to only require
> the FUA semantics if O_SYNC is specified.  But from looking around at
> users of O_DIRECT no one seems to actually specify O_SYNC with it.

O_DIRECT with true POSIX O_SYNC is a bad idea, because it flushes
inode metadata (like mtime) too.  O_DIRECT|O_DSYNC is better.

O_DIRECT without O_SYNC, O_DSYNC, fsync or fdatasync is asking for
integrity problems when direct writes are converted to buffered writes
- which applies to all or nearly all OSes according to their
documentation (I've read a lot of them).

I notice that all applications I looked at which use O_DIRECT don't
attempt to determine when O_DIRECT will definitely result in direct
writes; they simpy assume it can be used as a substituted for O_SYNC
or O_DSYNC, as long as you follow the alignment rules.  Generally they
leave it to the user to configure what they want, and often don't
explain the drive integrity issue, except to say "depends on the OS,
your mileage may vary, we can do nothing about it".

Imho, integrity should not be something which depends on the user
knowing the details of their hardware to decide application
configuration options - at least, not out of the box.

On a related note,
http://publib.boulder.ibm.com/infocenter/systems/index.jsp?topic=/com.ibm.aix.genprogc/doc/genprogc/fileio.htm
says:

    Direct I/O and Data I/O Integrity Completion

    Although direct I/O writes are done synchronously, they do not
    provide synchronized I/O data integrity completion, as defined by
    POSIX. Applications that need this feature should use O_DSYNC in
    addition to O_DIRECT. O_DSYNC guarantees that all of the data and
    enough of the metadata (for example, indirect blocks) have written
    to the stable store to be able to retrieve the data after a system
    crash. O_DIRECT only writes the data; it does not write the
    metadata.

That's another reason to use O_DIRECT|O_DSYNC in moderately portable code.

> And on Linux where O_SYNC really means O_DYSNC that's pretty sensible -
> if O_DIRECT bypasses the filesystem cache there is nothing else
> left to sync for a non-extending write.

Oh, O_SYNC means O_DSYNC?  I thought it was the other way around.
Ugh, how messy.

> That is until those pesky disk
> write back caches come into play that no application writer wants or
> should have to understand.

As far as I can tell, they generally go out of their way to avoid
understanding it, except as a vaguely uncomfortable awareness and pass
the problem on to the application's user.

Unfortunately just disabling the disk cache for O_DIRECT would make
it's performance drop significantly, otherwise I'd say go for it.

> > It turns out that applications needing integrity must use fdatasync or
> > O_DSYNC (or O_SYNfC) *already* with O_DIRECT, because the kernel may
> > choose to use buffered writes at any time, with no signal to the
> > application.
> 
> The fallback was a relatively recent addition to the O_DIRECT semantics
> for broken filesystems that can't handle holes very well.  Fortunately
> enough we do force O_SYNC (that is Linux O_SYNC aka Posix O_DSYNC)
> semantics for that already.

Ok, so you're saying there's no _harm_ in specifying O_DSYNC with
O_DIRECT either? :-)

-- Jamie

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode
  2009-08-21 15:23           ` Jan Kara
  (?)
@ 2009-08-21 15:32             ` Christoph Hellwig
  -1 siblings, 0 replies; 144+ messages in thread
From: Christoph Hellwig @ 2009-08-21 15:32 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, LKML, Evgeniy Polyakov, ocfs2-devel,
	Joel Becker, Felix Blyakher, xfs, Anton Altaparmakov,
	linux-ntfs-dev, OGAWA Hirofumi, linux-ext4, tytso

On Fri, Aug 21, 2009 at 05:23:39PM +0200, Jan Kara wrote:
>   Hmm, I can imagine it would buy us something in two cases (but looking at
> the code, neither is implemented in such a way that it would really help
> us in any way):
> 1) when an inode and it's data are stored in one block (e.g. OCFS2 or UDF) do
>    this.

>   In the first case we would wait for block with data to be written only to
> submit it again because inode was still dirty.

But we would not actually store in the pagecache or at least never
writeback it via the pacache in that case, but always just write it as
part of the inode.


^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode
@ 2009-08-21 15:32             ` Christoph Hellwig
  0 siblings, 0 replies; 144+ messages in thread
From: Christoph Hellwig @ 2009-08-21 15:32 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, tytso, linux-ntfs-dev, LKML, Joel Becker,
	Christoph Hellwig, Anton Altaparmakov, OGAWA Hirofumi,
	Evgeniy Polyakov, xfs, ocfs2-devel

On Fri, Aug 21, 2009 at 05:23:39PM +0200, Jan Kara wrote:
>   Hmm, I can imagine it would buy us something in two cases (but looking at
> the code, neither is implemented in such a way that it would really help
> us in any way):
> 1) when an inode and it's data are stored in one block (e.g. OCFS2 or UDF) do
>    this.

>   In the first case we would wait for block with data to be written only to
> submit it again because inode was still dirty.

But we would not actually store in the pagecache or at least never
writeback it via the pacache in that case, but always just write it as
part of the inode.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* [Ocfs2-devel] [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode
@ 2009-08-21 15:32             ` Christoph Hellwig
  0 siblings, 0 replies; 144+ messages in thread
From: Christoph Hellwig @ 2009-08-21 15:32 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, LKML, Evgeniy Polyakov, ocfs2-devel,
	Joel Becker, Felix Blyakher, xfs, Anton Altaparmakov,
	linux-ntfs-dev, OGAWA Hirofumi, linux-ext4, tytso

On Fri, Aug 21, 2009 at 05:23:39PM +0200, Jan Kara wrote:
>   Hmm, I can imagine it would buy us something in two cases (but looking at
> the code, neither is implemented in such a way that it would really help
> us in any way):
> 1) when an inode and it's data are stored in one block (e.g. OCFS2 or UDF) do
>    this.

>   In the first case we would wait for block with data to be written only to
> submit it again because inode was still dirty.

But we would not actually store in the pagecache or at least never
writeback it via the pacache in that case, but always just write it as
part of the inode.

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode
  2009-08-21 15:32             ` Christoph Hellwig
  (?)
@ 2009-08-21 15:48               ` Jan Kara
  -1 siblings, 0 replies; 144+ messages in thread
From: Jan Kara @ 2009-08-21 15:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, LKML, Evgeniy Polyakov, ocfs2-devel, Joel Becker,
	Felix Blyakher, xfs, Anton Altaparmakov, linux-ntfs-dev,
	OGAWA Hirofumi, linux-ext4, tytso

On Fri 21-08-09 11:32:35, Christoph Hellwig wrote:
> On Fri, Aug 21, 2009 at 05:23:39PM +0200, Jan Kara wrote:
> >   Hmm, I can imagine it would buy us something in two cases (but looking at
> > the code, neither is implemented in such a way that it would really help
> > us in any way):
> > 1) when an inode and it's data are stored in one block (e.g. OCFS2 or UDF) do
> >    this.
> 
> >   In the first case we would wait for block with data to be written only to
> > submit it again because inode was still dirty.
> 
> But we would not actually store in the pagecache or at least never
> writeback it via the pacache in that case, but always just write it as
> part of the inode.
  Yes, probably. Actually, that's what UDF does. It's just that then the
sequence fdatawrite(), fdatawait() does not really send the data to disk.
But that probably does not matter.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode
@ 2009-08-21 15:48               ` Jan Kara
  0 siblings, 0 replies; 144+ messages in thread
From: Jan Kara @ 2009-08-21 15:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: tytso, linux-ext4, Jan Kara, linux-ntfs-dev, LKML, Joel Becker,
	Anton Altaparmakov, OGAWA Hirofumi, Evgeniy Polyakov, xfs,
	ocfs2-devel

On Fri 21-08-09 11:32:35, Christoph Hellwig wrote:
> On Fri, Aug 21, 2009 at 05:23:39PM +0200, Jan Kara wrote:
> >   Hmm, I can imagine it would buy us something in two cases (but looking at
> > the code, neither is implemented in such a way that it would really help
> > us in any way):
> > 1) when an inode and it's data are stored in one block (e.g. OCFS2 or UDF) do
> >    this.
> 
> >   In the first case we would wait for block with data to be written only to
> > submit it again because inode was still dirty.
> 
> But we would not actually store in the pagecache or at least never
> writeback it via the pacache in that case, but always just write it as
> part of the inode.
  Yes, probably. Actually, that's what UDF does. It's just that then the
sequence fdatawrite(), fdatawait() does not really send the data to disk.
But that probably does not matter.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* [Ocfs2-devel] [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode
@ 2009-08-21 15:48               ` Jan Kara
  0 siblings, 0 replies; 144+ messages in thread
From: Jan Kara @ 2009-08-21 15:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, LKML, Evgeniy Polyakov, ocfs2-devel, Joel Becker,
	Felix Blyakher, xfs, Anton Altaparmakov, linux-ntfs-dev,
	OGAWA Hirofumi, linux-ext4, tytso

On Fri 21-08-09 11:32:35, Christoph Hellwig wrote:
> On Fri, Aug 21, 2009 at 05:23:39PM +0200, Jan Kara wrote:
> >   Hmm, I can imagine it would buy us something in two cases (but looking at
> > the code, neither is implemented in such a way that it would really help
> > us in any way):
> > 1) when an inode and it's data are stored in one block (e.g. OCFS2 or UDF) do
> >    this.
> 
> >   In the first case we would wait for block with data to be written only to
> > submit it again because inode was still dirty.
> 
> But we would not actually store in the pagecache or at least never
> writeback it via the pacache in that case, but always just write it as
> part of the inode.
  Yes, probably. Actually, that's what UDF does. It's just that then the
sequence fdatawrite(), fdatawait() does not really send the data to disk.
But that probably does not matter.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-21 15:24         ` Jamie Lokier
@ 2009-08-21 17:45           ` Christoph Hellwig
  2009-08-21 19:18             ` Ric Wheeler
  2009-08-22  0:50             ` Jamie Lokier
  0 siblings, 2 replies; 144+ messages in thread
From: Christoph Hellwig @ 2009-08-21 17:45 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Christoph Hellwig, Jens Axboe, linux-fsdevel, linux-scsi

On Fri, Aug 21, 2009 at 04:24:59PM +0100, Jamie Lokier wrote:
> In measurements I've done, disabling a disk's write cache results in
> much slower ext3 filesystem writes than using barriers.  Others report
> similar results.  This is with disks that don't have NCQ; good NCQ may
> be better.

On a scsi disk and a SATA SSD with NCQ I get different results.  Most
worksloads, in particular metadata-intensive ones and large streaming
writes are noticably better just turning off the write cache.  The only
onces that benefit from it are relatively small writes witout O_SYNC
or much fsyncs.  This is however using XFS which tends to issue much
more barriers than ext3.

> Using FUA for all writes should be equivalent to writing with write
> cache disabled.
> 
> A journalling filesystem or database tends to write like this:
> 
>    (guest) WRITE
>    (guest) WRITE
>    (guest) WRITE
>    (guest) WRITE
>    (guest) WRITE
>    (guest) CACHE FLUSH
>    (guest) WRITE
>    (guest) CACHE FLUSH
>    (guest) WRITE
>    (guest) WRITE
>    (guest) WRITE

In the optimal case, yeah.

> Assuming that WRITE FUA is equivalent to disabling write cache, we may
> expect the WRITE FUA version to run much slower than the CACHE FLUSH
> version.

For a workload that only does FUA writes, yeah.  That is however the use
case for virtual machines.  As I'm looking into those issues I will run
some benchmarks comparing both variants.

> It's also too weak, of course, on drives which don't support FUA.
> Then you have to use CACHE FLUSH anyway, so the code should support
> that (or disable the write cache entirely, which also performs badly).
> If you don't handle drives without FUA, then you're back to "integrity
> sometimes, user must check type of hardware", which is something we're
> trying to get away from.  Integrity should not be a surprise when the
> application requests it.

As mentioned in the previous mails FUA would only be an optimization
(if it ends up helping) we do need to support the cache flush case.

> > I thought about this alot .  It would be sensible to only require
> > the FUA semantics if O_SYNC is specified.  But from looking around at
> > users of O_DIRECT no one seems to actually specify O_SYNC with it.
> 
> O_DIRECT with true POSIX O_SYNC is a bad idea, because it flushes
> inode metadata (like mtime) too.  O_DIRECT|O_DSYNC is better.

O_SYNC above is the Linux O_SYNC aka Posix O_DYNC.

> O_DIRECT without O_SYNC, O_DSYNC, fsync or fdatasync is asking for
> integrity problems when direct writes are converted to buffered writes
> - which applies to all or nearly all OSes according to their
> documentation (I've read a lot of them).

It did not happen on IRIX where O_DIRECT originated that did not happen,
neither does it happen on Linux when using XFS.  Then again at least on
Linux we provide O_SYNC (that is Linux O_SYNC, aka Posix O_DYSC)
semantics for that case.

> Imho, integrity should not be something which depends on the user
> knowing the details of their hardware to decide application
> configuration options - at least, not out of the box.

That is what I meant.  Only doing cache flushes/FUA for O_DIRECT|O_DSYNC
is not what users naively expect.  And the wording in hour manpages also
suggests this behaviour, although it is not entirely clear:


O_DIRECT (Since Linux 2.4.10)

	Try to minimize cache effects of the I/O to and from this file.  In
	general this will degrade performance, but it is useful in special
	situations, such as when applications do their own caching.  File I/O
	is done directly to/from user space buffers.  The I/O is synchronous,
	that is,  at the completion of a read(2) or write(2), data is
	guaranteed to have been transferred.  See NOTES below forfurther
	discussion.

(And yeah, the whole wording is horrible, I will send an update once
we've sorted out the semantics, including caveats about older kernels)

> > And on Linux where O_SYNC really means O_DYSNC that's pretty sensible -
> > if O_DIRECT bypasses the filesystem cache there is nothing else
> > left to sync for a non-extending write.
> 
> Oh, O_SYNC means O_DSYNC?  I thought it was the other way around.
> Ugh, how messy.

Yes.  Except when using XFS and using the "osyncisosync" mount option :)

> > The fallback was a relatively recent addition to the O_DIRECT semantics
> > for broken filesystems that can't handle holes very well.  Fortunately
> > enough we do force O_SYNC (that is Linux O_SYNC aka Posix O_DSYNC)
> > semantics for that already.
> 
> Ok, so you're saying there's no _harm_ in specifying O_DSYNC with
> O_DIRECT either? :-)

No.  In the generic code and filesystems I looked at it simply has no
effect at all.


^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-21 17:45           ` Christoph Hellwig
@ 2009-08-21 19:18             ` Ric Wheeler
  2009-08-22  0:50             ` Jamie Lokier
  1 sibling, 0 replies; 144+ messages in thread
From: Ric Wheeler @ 2009-08-21 19:18 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jamie Lokier, Jens Axboe, linux-fsdevel, linux-scsi

On 08/21/2009 01:45 PM, Christoph Hellwig wrote:
> On Fri, Aug 21, 2009 at 04:24:59PM +0100, Jamie Lokier wrote:
>    
>> In measurements I've done, disabling a disk's write cache results in
>> much slower ext3 filesystem writes than using barriers.  Others report
>> similar results.  This is with disks that don't have NCQ; good NCQ may
>> be better.
>>      
> On a scsi disk and a SATA SSD with NCQ I get different results.  Most
> worksloads, in particular metadata-intensive ones and large streaming
> writes are noticably better just turning off the write cache.  The only
> onces that benefit from it are relatively small writes witout O_SYNC
> or much fsyncs.  This is however using XFS which tends to issue much
> more barriers than ext3.
>    

With normal S-ATA disks, streaming write workloads on ext3 run twice as 
fast with barriers & write cache enabled in my testing.

Small file workloads were more even if I remember correctly...

ric

>    
>> Using FUA for all writes should be equivalent to writing with write
>> cache disabled.
>>
>> A journalling filesystem or database tends to write like this:
>>
>>     (guest) WRITE
>>     (guest) WRITE
>>     (guest) WRITE
>>     (guest) WRITE
>>     (guest) WRITE
>>     (guest) CACHE FLUSH
>>     (guest) WRITE
>>     (guest) CACHE FLUSH
>>     (guest) WRITE
>>     (guest) WRITE
>>     (guest) WRITE
>>      
> In the optimal case, yeah.
>
>    
>> Assuming that WRITE FUA is equivalent to disabling write cache, we may
>> expect the WRITE FUA version to run much slower than the CACHE FLUSH
>> version.
>>      
> For a workload that only does FUA writes, yeah.  That is however the use
> case for virtual machines.  As I'm looking into those issues I will run
> some benchmarks comparing both variants.
>
>    
>> It's also too weak, of course, on drives which don't support FUA.
>> Then you have to use CACHE FLUSH anyway, so the code should support
>> that (or disable the write cache entirely, which also performs badly).
>> If you don't handle drives without FUA, then you're back to "integrity
>> sometimes, user must check type of hardware", which is something we're
>> trying to get away from.  Integrity should not be a surprise when the
>> application requests it.
>>      
> As mentioned in the previous mails FUA would only be an optimization
> (if it ends up helping) we do need to support the cache flush case.
>
>    
>>> I thought about this alot .  It would be sensible to only require
>>> the FUA semantics if O_SYNC is specified.  But from looking around at
>>> users of O_DIRECT no one seems to actually specify O_SYNC with it.
>>>        
>> O_DIRECT with true POSIX O_SYNC is a bad idea, because it flushes
>> inode metadata (like mtime) too.  O_DIRECT|O_DSYNC is better.
>>      
> O_SYNC above is the Linux O_SYNC aka Posix O_DYNC.
>
>    
>> O_DIRECT without O_SYNC, O_DSYNC, fsync or fdatasync is asking for
>> integrity problems when direct writes are converted to buffered writes
>> - which applies to all or nearly all OSes according to their
>> documentation (I've read a lot of them).
>>      
> It did not happen on IRIX where O_DIRECT originated that did not happen,
> neither does it happen on Linux when using XFS.  Then again at least on
> Linux we provide O_SYNC (that is Linux O_SYNC, aka Posix O_DYSC)
> semantics for that case.
>
>    
>> Imho, integrity should not be something which depends on the user
>> knowing the details of their hardware to decide application
>> configuration options - at least, not out of the box.
>>      
> That is what I meant.  Only doing cache flushes/FUA for O_DIRECT|O_DSYNC
> is not what users naively expect.  And the wording in hour manpages also
> suggests this behaviour, although it is not entirely clear:
>
>
> O_DIRECT (Since Linux 2.4.10)
>
> 	Try to minimize cache effects of the I/O to and from this file.  In
> 	general this will degrade performance, but it is useful in special
> 	situations, such as when applications do their own caching.  File I/O
> 	is done directly to/from user space buffers.  The I/O is synchronous,
> 	that is,  at the completion of a read(2) or write(2), data is
> 	guaranteed to have been transferred.  See NOTES below forfurther
> 	discussion.
>
> (And yeah, the whole wording is horrible, I will send an update once
> we've sorted out the semantics, including caveats about older kernels)
>
>    
>>> And on Linux where O_SYNC really means O_DYSNC that's pretty sensible -
>>> if O_DIRECT bypasses the filesystem cache there is nothing else
>>> left to sync for a non-extending write.
>>>        
>> Oh, O_SYNC means O_DSYNC?  I thought it was the other way around.
>> Ugh, how messy.
>>      
> Yes.  Except when using XFS and using the "osyncisosync" mount option :)
>
>    
>>> The fallback was a relatively recent addition to the O_DIRECT semantics
>>> for broken filesystems that can't handle holes very well.  Fortunately
>>> enough we do force O_SYNC (that is Linux O_SYNC aka Posix O_DSYNC)
>>> semantics for that already.
>>>        
>> Ok, so you're saying there's no _harm_ in specifying O_DSYNC with
>> O_DIRECT either? :-)
>>      
> No.  In the generic code and filesystems I looked at it simply has no
> effect at all.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>    


^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-21 14:26       ` Christoph Hellwig
  2009-08-21 15:24         ` Jamie Lokier
@ 2009-08-21 22:08         ` Theodore Tso
  2009-08-21 22:38           ` Joel Becker
                             ` (5 more replies)
  1 sibling, 6 replies; 144+ messages in thread
From: Theodore Tso @ 2009-08-21 22:08 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jamie Lokier, Jens Axboe, linux-fsdevel, linux-scsi

On Fri, Aug 21, 2009 at 10:26:35AM -0400, Christoph Hellwig wrote:
> > It turns out that applications needing integrity must use fdatasync or
> > O_DSYNC (or O_SYNC) *already* with O_DIRECT, because the kernel may
> > choose to use buffered writes at any time, with no signal to the
> > application.
> 
> The fallback was a relatively recent addition to the O_DIRECT semantics
> for broken filesystems that can't handle holes very well.  Fortunately
> enough we do force O_SYNC (that is Linux O_SYNC aka Posix O_DSYNC)
> semantics for that already.

Um, actually, we don't.  If we did that, we would have to wait for a
journal commit to complete before allowing the write(2) to complete,
which would be especially painfully slow for ext3.

This question recently came up on the ext4 developer's list, because
of a question of how direct I/O to an preallocated (uninitialized)
extent should be handled.  Are we supposed to guarantee synchronous
updates of the metadata by the time write(2) returns, or not?  One of
the ext4 developers (I can't remember if it was Mingming or Eric)
asked an XFS developer what they did in that case, and I believe the
answer they were given was that XFS started a commit, but did *not*
wait for the commit to complete before returning from the Direct I/O
write.  In fact, they were told (I believe this was from an SGI
engineer, but I don't remember the name; we can track that down if
it's important) that if an application wanted to guarantee metadata
would be updated for an extending write, they had to use fsync() or
O_SYNC/O_DSYNC.  

Perhaps they were given an incorrect answer, but it's clear the
semantics of exactly how Direct I/O works in edge cases isn't well
defined, or at least clearly and widely understood.

I have an early draft (for discussion only) what we think it means and
what is currently implemented in Linux, which I've put up, (again, let
me emphasisize) for *discussion* here:

http://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics

Comments are welcome, either on the wiki's talk page, or directly to
me, or to the linux-fsdevel or linux-ext4.

						- Ted

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-21 22:08         ` Theodore Tso
@ 2009-08-21 22:38           ` Joel Becker
  2009-08-21 22:45           ` Joel Becker
                             ` (4 subsequent siblings)
  5 siblings, 0 replies; 144+ messages in thread
From: Joel Becker @ 2009-08-21 22:38 UTC (permalink / raw)
  To: Theodore Tso, Christoph Hellwig, Jamie Lokier, Jens Axboe,
	linux-fsdevel, linux-scsi

On Fri, Aug 21, 2009 at 06:08:52PM -0400, Theodore Tso wrote:
> I have an early draft (for discussion only) what we think it means and
> what is currently implemented in Linux, which I've put up, (again, let
> me emphasisize) for *discussion* here:
> 
> http://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics

	I think you mean "not well specified". ;-)

Joel

-- 

Life's Little Instruction Book #511

	"Call your mother."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-21 22:08         ` Theodore Tso
  2009-08-21 22:38           ` Joel Becker
@ 2009-08-21 22:45           ` Joel Becker
  2009-08-22  2:11             ` Theodore Tso
                               ` (2 more replies)
  2009-08-21 22:45           ` Joel Becker
                             ` (3 subsequent siblings)
  5 siblings, 3 replies; 144+ messages in thread
From: Joel Becker @ 2009-08-21 22:45 UTC (permalink / raw)
  To: Theodore Tso, Christoph Hellwig, Jamie Lokier, Jens Axboe,
	linux-fsdevel, linux-scsi

On Fri, Aug 21, 2009 at 06:08:52PM -0400, Theodore Tso wrote:
> http://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics
> 
> Comments are welcome, either on the wiki's talk page, or directly to
> me, or to the linux-fsdevel or linux-ext4.

	In the section on perhaps not waiting for buffered fallback, we
need to clarify that O_DIRECT reads need to know to look in the
pagecache.  That is, if we decide that extending O_DIRECT writes without
fsync can return before the data hits the storage, the caller shouldn't
also have to call fsync() just to call read() of data they just wrote!

Joel

-- 

To spot the expert, pick the one who predicts the job will take the
longest and cost the most.

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-21 22:08         ` Theodore Tso
  2009-08-21 22:38           ` Joel Becker
  2009-08-21 22:45           ` Joel Becker
@ 2009-08-21 22:45           ` Joel Becker
  2009-08-22  0:56           ` Jamie Lokier
                             ` (2 subsequent siblings)
  5 siblings, 0 replies; 144+ messages in thread
From: Joel Becker @ 2009-08-21 22:45 UTC (permalink / raw)
  To: Theodore Tso, Christoph Hellwig, Jamie Lokier, Jens Axboe,
	linux-fsdevel, linux-scsi

On Fri, Aug 21, 2009 at 06:08:52PM -0400, Theodore Tso wrote:
> http://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics
> 
> Comments are welcome, either on the wiki's talk page, or directly to
> me, or to the linux-fsdevel or linux-ext4.

	In the section on perhaps not waiting for buffered fallback, we
need to clarify that O_DIRECT reads need to know to look in the
pagecache.  That is, if we decide that extending O_DIRECT writes without
fsync can return before the data hits the storage, the caller shouldn't
also have to call fsync() just to call read() of data they just wrote!

Joel

-- 

To spot the expert, pick the one who predicts the job will take the
longest and cost the most.

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-21 17:45           ` Christoph Hellwig
  2009-08-21 19:18             ` Ric Wheeler
@ 2009-08-22  0:50             ` Jamie Lokier
  2009-08-22  2:19               ` Theodore Tso
  2009-08-24  2:34               ` Christoph Hellwig
  1 sibling, 2 replies; 144+ messages in thread
From: Jamie Lokier @ 2009-08-22  0:50 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, linux-fsdevel, linux-scsi

Christoph Hellwig wrote:
> > O_DIRECT without O_SYNC, O_DSYNC, fsync or fdatasync is asking for
> > integrity problems when direct writes are converted to buffered writes
> > - which applies to all or nearly all OSes according to their
> > documentation (I've read a lot of them).
> 
> It did not happen on IRIX where O_DIRECT originated that did not happen,

IRIX has an unusually sane O_DIRECT - at least according to it's
documentation.  This is write(2):

     When attempting to write to a file with O_DIRECT or FDIRECT set,
     the portion being written can not be locked in memory by any
     process. In this case, -1 will be returned and errno will be set
     to EBUSY.

AIX however says this:

     In order to avoid consistency issues between programs that use
     Direct I/O and programs that use normal cached I/O, Direct I/O is
     by default used in an exclusive use mode. If there are multiple
     opens of a file and some of them are direct and others are not,
     the file will stay in its normal cached access mode. Only when
     the file is open exclusively by Direct I/O programs will the file
     be placed in Direct I/O mode.

     Similarly, if the file is mapped into virtual memory via the
     shmat() or mmap() system calls, then file will stay in normal
     cached mode.

     The JFS or JFS2 will attempt to move the file into Direct I/O
     mode any time the last conflicting. non-direct access is
     eliminated (either by close(), munmap(), or shmdt()
     subroutines). Changing the file from normal mode to Direct I/O
     mode can be rather expensive since it requires writing all
     modified pages to disk and removing all the file's pages from
     memory.

> neither does it happen on Linux when using XFS.  Then again at least on
> Linux we provide O_SYNC (that is Linux O_SYNC, aka Posix O_DYSC)
> semantics for that case.

As Ted T'so pointer out, we don't.

> > Imho, integrity should not be something which depends on the user
> > knowing the details of their hardware to decide application
> > configuration options - at least, not out of the box.
> 
> That is what I meant.  Only doing cache flushes/FUA for O_DIRECT|O_DSYNC
> is not what users naively expect.

Oh, I agree with that.  That comes from observing that quasi-portable
code using O_DIRECT needs to use O_DSYNC too because several OSes and
filesystems on those OSes revert to buffered writes under some
circumstances, in which case you want O_DSYNC too.  That has nothing
to do with hardware caches, but it's a lucky coincidence that
fdatasync() would form a nice barrier function, and O_DIRECT|O_DSYNC
would then make sense as an FUA equivalent.

> And the wording in hour manpages also suggests this behaviour,
> although it is not entirely clear:
> 
> O_DIRECT (Since Linux 2.4.10)
> 	Try to minimize cache effects of the I/O to and from this file.  In
> 	general this will degrade performance, but it is useful in special
> 	situations, such as when applications do their own caching.  File I/O
> 	is done directly to/from user space buffers.  The I/O is synchronous,
> 	that is,  at the completion of a read(2) or write(2), data is
> 	guaranteed to have been transferred.  See NOTES below forfurther
> 	discussion.

Perhaps in the same way that fsync/fdatasync aren't clear on disk
cache behaviour either.  On Linux and some other OSes.

> (And yeah, the whole wording is horrible, I will send an update once
> we've sorted out the semantics, including caveats about older kernels)

One thing it's unhelpful about is the performance.  O_DIRECT tends to
improve performance for applications that do their own caching, it
also improves performance in whole systems when caching when would
cause memory pressure, and on Linux O_DIRECT is necessary for AIO
which can improve performance.

I have a 166MHz embedded device that I'm using O_DIRECT on to improve
performance - from 1MB/s to 10MB/s.

However if O_DIRECT is changed to force each write(2) through the disk
cache separately, then it will no longer provide this performance
boost at least with some kinds of disk.

That's why it's important not to change it casually.  Maybe it's the
right thing to do, but then it will be important to provide another
form of O_DIRECT which does not write through the disk cache, instead
providing a barrier capability.

(...After all, if we believed in integrity above everything then barriers
would be enabled for ext3 by default, *ahem*.)

Probably the best thing to do is look at what other OSes that are used
by databases etc. do with O_DIRECT, and if it makes sense, copy it.

What does IRIX do?  Does O_DIRECT on IRIX write through the drive's
cache?  What about Solaris?

-- Jamie

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-21 22:08         ` Theodore Tso
                             ` (2 preceding siblings ...)
  2009-08-21 22:45           ` Joel Becker
@ 2009-08-22  0:56           ` Jamie Lokier
  2009-08-22  2:06             ` Theodore Tso
  2009-08-26  6:34           ` Dave Chinner
  2009-08-26  6:34           ` Dave Chinner
  5 siblings, 1 reply; 144+ messages in thread
From: Jamie Lokier @ 2009-08-22  0:56 UTC (permalink / raw)
  To: Theodore Tso, Christoph Hellwig, Jens Axboe, linux-fsdevel, linux-scsi

Theodore Tso wrote:
> On Fri, Aug 21, 2009 at 10:26:35AM -0400, Christoph Hellwig wrote:
> > > It turns out that applications needing integrity must use fdatasync or
> > > O_DSYNC (or O_SYNC) *already* with O_DIRECT, because the kernel may
> > > choose to use buffered writes at any time, with no signal to the
> > > application.
> > 
> > The fallback was a relatively recent addition to the O_DIRECT semantics
> > for broken filesystems that can't handle holes very well.  Fortunately
> > enough we do force O_SYNC (that is Linux O_SYNC aka Posix O_DSYNC)
> > semantics for that already.
> 
> Um, actually, we don't.  If we did that, we would have to wait for a
> journal commit to complete before allowing the write(2) to complete,
> which would be especially painfully slow for ext3.
> 
> This question recently came up on the ext4 developer's list, because
> of a question of how direct I/O to an preallocated (uninitialized)
> extent should be handled.  Are we supposed to guarantee synchronous
> updates of the metadata by the time write(2) returns, or not?  One of
> the ext4 developers (I can't remember if it was Mingming or Eric)
> asked an XFS developer what they did in that case, and I believe the
> answer they were given was that XFS started a commit, but did *not*
> wait for the commit to complete before returning from the Direct I/O
> write.  In fact, they were told (I believe this was from an SGI
> engineer, but I don't remember the name; we can track that down if
> it's important) that if an application wanted to guarantee metadata
> would be updated for an extending write, they had to use fsync() or
> O_SYNC/O_DSYNC.  
> 
> Perhaps they were given an incorrect answer, but it's clear the
> semantics of exactly how Direct I/O works in edge cases isn't well
> defined, or at least clearly and widely understood.

And that's not even a hardware cache issue, just whether filesystem
metadata is written.

AIX behaves like XFS according to documentation:

    [ http://publib.boulder.ibm.com/infocenter/systems/index.jsp?topic=/com.ibm.aix.genprogc/doc/genprogc/fileio.htm ]

    Direct I/O and Data I/O Integrity Completion

    Although direct I/O writes are done synchronously, they do not
    provide synchronized I/O data integrity completion, as defined by
    POSIX. Applications that need this feature should use O_DSYNC in
    addition to O_DIRECT. O_DSYNC guarantees that all of the data and
    enough of the metadata (for example, indirect blocks) have written
    to the stable store to be able to retrieve the data after a system
    crash. O_DIRECT only writes the data; it does not write the
    metadata.

That's another reason to use O_DIRECT|O_DSYNC in moderately portable
code.

> I have an early draft (for discussion only) what we think it means and
> what is currently implemented in Linux, which I've put up, (again, let
> me emphasisize) for *discussion* here:
> 
> http://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics
> 
> Comments are welcome, either on the wiki's talk page, or directly to
> me, or to the linux-fsdevel or linux-ext4.

I haven't read it yet.  One thing which comes to mind is it would be
good to summarise what other OSes as well as Linux do with O_DIRECT
w.r.t. data-finding metadata, preallocation, file extending, hole
filling, unaligned access and what alignment is required, block
devices vs. files and different filesystems and behaviour-modifying
mount options, file open for buffered I/O on another descriptor, file
has mapped pages, mlocked pages, and of course drive cache write
through or not.

-- Jamie

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-22  0:56           ` Jamie Lokier
@ 2009-08-22  2:06             ` Theodore Tso
  0 siblings, 0 replies; 144+ messages in thread
From: Theodore Tso @ 2009-08-22  2:06 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Christoph Hellwig, Jens Axboe, linux-fsdevel, linux-scsi

On Sat, Aug 22, 2009 at 01:56:13AM +0100, Jamie Lokier wrote:
> AIX behaves like XFS according to documentation:
> 
>     [ http://publib.boulder.ibm.com/infocenter/systems/index.jsp?topic=/com.ibm.aix.genprogc/doc/genprogc/fileio.htm ]
> 
>     Direct I/O and Data I/O Integrity Completion
> 
>     Although direct I/O writes are done synchronously, they do not
>     provide synchronized I/O data integrity completion, as defined by
>     POSIX. Applications that need this feature should use O_DSYNC in
>     addition to O_DIRECT. O_DSYNC guarantees that all of the data and
>     enough of the metadata (for example, indirect blocks) have written
>     to the stable store to be able to retrieve the data after a system
>     crash. O_DIRECT only writes the data; it does not write the
>     metadata.
> 
> That's another reason to use O_DIRECT|O_DSYNC in moderately portable
> code.

...or use fsync() when they need to guarantee that data has been
atomically written, but not before.  This becomes critically important
if the application is writing into a sparse file, or writing into
uninitalized blocks that were allocated using fallocate(); otherwise,
with O_DIRECT|O_DSYNC, the file system would have to do a commit
operation after each write, which could be a performance disaster.

> > http://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics
> > 
> > Comments are welcome, either on the wiki's talk page, or directly to
> > me, or to the linux-fsdevel or linux-ext4.
> 
> I haven't read it yet.  One thing which comes to mind is it would be
> good to summarise what other OSes as well as Linux do with O_DIRECT
> w.r.t. data-finding metadata, preallocation, file extending, hole
> filling, unaligned access and what alignment is required, block
> devices vs. files and different filesystems and behaviour-modifying
> mount options, file open for buffered I/O on another descriptor, file
> has mapped pages, mlocked pages, and of course drive cache write
> through or not.

It's a wiki; contributions to define all of that is welcome.  :-)

We may want to carefully consider what we want to guarantee for all
time to application writers, and what we might want to leave open to
allow for performance optimizations by the kernel, though.

      	  	      		       	   	   - Ted

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-21 22:45           ` Joel Becker
@ 2009-08-22  2:11             ` Theodore Tso
  2009-08-24  2:42               ` Christoph Hellwig
  2009-08-24  2:37             ` Christoph Hellwig
  2009-08-24  2:37             ` Christoph Hellwig
  2 siblings, 1 reply; 144+ messages in thread
From: Theodore Tso @ 2009-08-22  2:11 UTC (permalink / raw)
  To: Christoph Hellwig, Jamie Lokier, Jens Axboe, linux-fsdevel, linux-scsi

On Fri, Aug 21, 2009 at 03:45:18PM -0700, Joel Becker wrote:
> On Fri, Aug 21, 2009 at 06:08:52PM -0400, Theodore Tso wrote:
> > http://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics
> > 
> > Comments are welcome, either on the wiki's talk page, or directly to
> > me, or to the linux-fsdevel or linux-ext4.
> 
> 	In the section on perhaps not waiting for buffered fallback, we
> need to clarify that O_DIRECT reads need to know to look in the
> pagecache.  That is, if we decide that extending O_DIRECT writes without
> fsync can return before the data hits the storage, the caller shouldn't
> also have to call fsync() just to call read() of data they just wrote!

Yeah, I guess we can only do that if the filesystem guarantees
coherence between the page cache and O_DIRECT reads; it's been a long
while since I've studied that code, so I'm not sure whether all
filesystems that support O_DIRECT provide this coherency (since I
thought it was provided in the generic O_DIRECT routines, isn't it?)
or not.

							- Ted

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-22  0:50             ` Jamie Lokier
@ 2009-08-22  2:19               ` Theodore Tso
  2009-08-22  2:31                 ` Theodore Tso
  2009-08-24  2:34               ` Christoph Hellwig
  1 sibling, 1 reply; 144+ messages in thread
From: Theodore Tso @ 2009-08-22  2:19 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Christoph Hellwig, Jens Axboe, linux-fsdevel, linux-scsi

On Sat, Aug 22, 2009 at 01:50:06AM +0100, Jamie Lokier wrote:
> Christoph Hellwig wrote:
> > > O_DIRECT without O_SYNC, O_DSYNC, fsync or fdatasync is asking for
> > > integrity problems when direct writes are converted to buffered writes
> > > - which applies to all or nearly all OSes according to their
> > > documentation (I've read a lot of them).
> > 
> > It did not happen on IRIX where O_DIRECT originated that did not happen,
> 
> IRIX has an unusually sane O_DIRECT - at least according to it's
> documentation.  This is write(2):
> 
>      When attempting to write to a file with O_DIRECT or FDIRECT set,
>      the portion being written can not be locked in memory by any
>      process. In this case, -1 will be returned and errno will be set
>      to EBUSY.

Can you forward a pointer to an Irix man page which describes its
O_DIRECT semantics (or at least what they claim in their man pages)?
I was looking for one on the web, but I couldn't seem to find any
on-line web pages for Irix.  

It'd be nice if we could also get permission from SGI to quote
relevant sections in the "Clarifying Direct I/O Semantics" wiki page
would be welcome, in case we end up quoting more than what someone
might consider fair game for fair use, but for now, I'd be really
happy getting something that I could look out for reference purposes.
Was there any thing more than what you quoted in the Irix write(2) man
page about O_DIRECT?

Thanks,

						- Ted

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-22  2:19               ` Theodore Tso
@ 2009-08-22  2:31                 ` Theodore Tso
  0 siblings, 0 replies; 144+ messages in thread
From: Theodore Tso @ 2009-08-22  2:31 UTC (permalink / raw)
  To: Jamie Lokier, Christoph Hellwig, Jens Axboe, linux-fsdevel, linux-scsi

On Fri, Aug 21, 2009 at 10:19:56PM -0400, Theodore Tso wrote:
> Can you forward a pointer to an Irix man page which describes its
> O_DIRECT semantics (or at least what they claim in their man pages)?
> I was looking for one on the web, but I couldn't seem to find any
> on-line web pages for Irix.  

Never mind, I found it.  (And I've added the relevant bits to the wiki
article).

					- Ted

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-22  0:50             ` Jamie Lokier
  2009-08-22  2:19               ` Theodore Tso
@ 2009-08-24  2:34               ` Christoph Hellwig
  2009-08-27 14:34                 ` Jamie Lokier
  1 sibling, 1 reply; 144+ messages in thread
From: Christoph Hellwig @ 2009-08-24  2:34 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Christoph Hellwig, Jens Axboe, linux-fsdevel, linux-scsi

On Sat, Aug 22, 2009 at 01:50:06AM +0100, Jamie Lokier wrote:
> Oh, I agree with that.  That comes from observing that quasi-portable
> code using O_DIRECT needs to use O_DSYNC too because several OSes and
> filesystems on those OSes revert to buffered writes under some
> circumstances, in which case you want O_DSYNC too.  That has nothing
> to do with hardware caches, but it's a lucky coincidence that
> fdatasync() would form a nice barrier function, and O_DIRECT|O_DSYNC
> would then make sense as an FUA equivalent.

I agree.  I do however fear about everything using O_DIRECT that is
around now.  Less so about the databases and HPC workloads on expensive
hardware because they usually run on vendor approved scsi disks that
have the write back cache disabled, but rather things like
virtualization software or other things that get run on commodity
hardware.

Then again they already don't get what they expect and never did,
so if we clear document and communicate the O_SYNC (that is Linux
O_SYNC) requirement we might be able to go with this.

> Perhaps in the same way that fsync/fdatasync aren't clear on disk
> cache behaviour either.  On Linux and some other OSes.

The disk write cache really is an implementation detail, it has no
business in Posix.

Posix seems to define the semantics for fdatasync and cor relatively
well (that is if you like the specification speak in there):

"The fdatasync() function forces all currently queued I/O operations
 associated with the file indicated by file descriptor fildes to the
 synchronised I/O completion state."

"synchronised I/O data integrity completion

 o For read, when the operation has been completed or diagnosed if
   unsuccessful. The read is complete only when an image of the data has
   been successfully transferred to the requesting process. If there were
   any pending write requests affecting the data to be read at the time
   that the synchronised read operation was requested, these write
   requests shall be successfully transferred prior to reading the
   data."
 o For write, when the operation has been completed or diagnosed if
   unsuccessful. The write is complete only when the data specified in the
   write request is successfully transferred and all file system
   information required to retrieve the data is successfully transferred."

Given that it talks about data retrievable an volatile cache does not
seem to meet the above criteria.  But yeah, it's a horrible language.

> What does IRIX do?  Does O_DIRECT on IRIX write through the drive's
> cache?  What about Solaris?

IRIX only came pre-packaged with SGI MIPS systems.  Which as most of
the more expensive hardware was not configured with write through
caches.  Which btw is still the case for all more expensive hardware
I have.  The whole issue with volatile write back cache is just too
much of a data integrity nightmare as that you would enable it where
your customers actually care about their data.


^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-21 22:45           ` Joel Becker
  2009-08-22  2:11             ` Theodore Tso
@ 2009-08-24  2:37             ` Christoph Hellwig
  2009-08-24  2:37             ` Christoph Hellwig
  2 siblings, 0 replies; 144+ messages in thread
From: Christoph Hellwig @ 2009-08-24  2:37 UTC (permalink / raw)
  To: Theodore Tso, Christoph Hellwig, Jamie Lokier, Jens Axboe,
	linux-fsdevel, linux-scsi

On Fri, Aug 21, 2009 at 03:45:18PM -0700, Joel Becker wrote:
> 	In the section on perhaps not waiting for buffered fallback, we
> need to clarify that O_DIRECT reads need to know to look in the
> pagecache.  That is, if we decide that extending O_DIRECT writes without
> fsync can return before the data hits the storage, the caller shouldn't
> also have to call fsync() just to call read() of data they just wrote!

The way the O_DIRECT fallback is implemented currenly is that data does
hit the disk before return, thanks to a:

	err = do_sync_mapping_range(file->f_mapping, pos, endbyte,
					SYNC_FILE_RANGE_WAIT_BEFORE|
					SYNC_FILE_RANGE_WRITE|
					SYNC_FILE_RANGE_WAIT_AFTER);

which I expected to also sync the required metdata to disk, which
it doesn't.    Which btw are really horrible semantics given that
we export that beast to userspace as a separate system call.


^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-21 22:45           ` Joel Becker
  2009-08-22  2:11             ` Theodore Tso
  2009-08-24  2:37             ` Christoph Hellwig
@ 2009-08-24  2:37             ` Christoph Hellwig
  2 siblings, 0 replies; 144+ messages in thread
From: Christoph Hellwig @ 2009-08-24  2:37 UTC (permalink / raw)
  To: Theodore Tso, Christoph Hellwig, Jamie Lokier, Jens Axboe,
	linux-fsdevel, linux-scsi

On Fri, Aug 21, 2009 at 03:45:18PM -0700, Joel Becker wrote:
> 	In the section on perhaps not waiting for buffered fallback, we
> need to clarify that O_DIRECT reads need to know to look in the
> pagecache.  That is, if we decide that extending O_DIRECT writes without
> fsync can return before the data hits the storage, the caller shouldn't
> also have to call fsync() just to call read() of data they just wrote!

The way the O_DIRECT fallback is implemented currenly is that data does
hit the disk before return, thanks to a:

	err = do_sync_mapping_range(file->f_mapping, pos, endbyte,
					SYNC_FILE_RANGE_WAIT_BEFORE|
					SYNC_FILE_RANGE_WRITE|
					SYNC_FILE_RANGE_WAIT_AFTER);

which I expected to also sync the required metdata to disk, which
it doesn't.    Which btw are really horrible semantics given that
we export that beast to userspace as a separate system call.


^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-22  2:11             ` Theodore Tso
@ 2009-08-24  2:42               ` Christoph Hellwig
  0 siblings, 0 replies; 144+ messages in thread
From: Christoph Hellwig @ 2009-08-24  2:42 UTC (permalink / raw)
  To: Theodore Tso, Christoph Hellwig, Jamie Lokier, Jens Axboe,
	linux-fsdevel, linux-scsi

On Fri, Aug 21, 2009 at 10:11:37PM -0400, Theodore Tso wrote:
> Yeah, I guess we can only do that if the filesystem guarantees
> coherence between the page cache and O_DIRECT reads; it's been a long
> while since I've studied that code, so I'm not sure whether all
> filesystems that support O_DIRECT provide this coherency (since I
> thought it was provided in the generic O_DIRECT routines, isn't it?)
> or not.

It's provided in the generic code, yes (or at least appears to).  

Note that xfstests has quite a few tests exercising it.

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-21 22:08         ` Theodore Tso
                             ` (4 preceding siblings ...)
  2009-08-26  6:34           ` Dave Chinner
@ 2009-08-26  6:34           ` Dave Chinner
  2009-08-26 15:01             ` Jamie Lokier
  5 siblings, 1 reply; 144+ messages in thread
From: Dave Chinner @ 2009-08-26  6:34 UTC (permalink / raw)
  To: Theodore Tso, Christoph Hellwig, Jamie Lokier, Jens Axboe,
	linux-fsdevel, linux-scsi

On Fri, Aug 21, 2009 at 06:08:52PM -0400, Theodore Tso wrote:
> On Fri, Aug 21, 2009 at 10:26:35AM -0400, Christoph Hellwig wrote:
> > > It turns out that applications needing integrity must use fdatasync or
> > > O_DSYNC (or O_SYNC) *already* with O_DIRECT, because the kernel may
> > > choose to use buffered writes at any time, with no signal to the
> > > application.
> > 
> > The fallback was a relatively recent addition to the O_DIRECT semantics
> > for broken filesystems that can't handle holes very well.  Fortunately
> > enough we do force O_SYNC (that is Linux O_SYNC aka Posix O_DSYNC)
> > semantics for that already.
> 
> Um, actually, we don't.  If we did that, we would have to wait for a
> journal commit to complete before allowing the write(2) to complete,
> which would be especially painfully slow for ext3.
> 
> This question recently came up on the ext4 developer's list, because
> of a question of how direct I/O to an preallocated (uninitialized)
> extent should be handled.  Are we supposed to guarantee synchronous
> updates of the metadata by the time write(2) returns, or not?  One of
> the ext4 developers (I can't remember if it was Mingming or Eric)
> asked an XFS developer what they did in that case, and I believe the
> answer they were given was that XFS started a commit, but did *not*
> wait for the commit to complete before returning from the Direct I/O
> write.  In fact, they were told (I believe this was from an SGI
> engineer, but I don't remember the name; we can track that down if
> it's important) that if an application wanted to guarantee metadata
> would be updated for an extending write, they had to use fsync() or
> O_SYNC/O_DSYNC.  

That would have been Eric asking me. My answer that O_DIRECT does
not imply any new data integrity guarantees associated with a
write(2) call - it just avoids system caches. You get the same
guarantees of resiliency as a non-O_DIRECT write(2) call at
completion - it may or may notbe there if you crash. If you want
some guarantee of integrity, then you need to use O_DSYNC, O_SYNC or
call f[data]sync(2) just like all other IO.

Also, note that direct IO is not necessarily synchronous - you can
do asynchronous direct IO.....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-21 22:08         ` Theodore Tso
                             ` (3 preceding siblings ...)
  2009-08-22  0:56           ` Jamie Lokier
@ 2009-08-26  6:34           ` Dave Chinner
  2009-08-26  6:34           ` Dave Chinner
  5 siblings, 0 replies; 144+ messages in thread
From: Dave Chinner @ 2009-08-26  6:34 UTC (permalink / raw)
  To: Theodore Tso, Christoph Hellwig, Jamie Lokier, Jens Axboe,
	linux-fsdevel, linux-scsi

On Fri, Aug 21, 2009 at 06:08:52PM -0400, Theodore Tso wrote:
> On Fri, Aug 21, 2009 at 10:26:35AM -0400, Christoph Hellwig wrote:
> > > It turns out that applications needing integrity must use fdatasync or
> > > O_DSYNC (or O_SYNC) *already* with O_DIRECT, because the kernel may
> > > choose to use buffered writes at any time, with no signal to the
> > > application.
> > 
> > The fallback was a relatively recent addition to the O_DIRECT semantics
> > for broken filesystems that can't handle holes very well.  Fortunately
> > enough we do force O_SYNC (that is Linux O_SYNC aka Posix O_DSYNC)
> > semantics for that already.
> 
> Um, actually, we don't.  If we did that, we would have to wait for a
> journal commit to complete before allowing the write(2) to complete,
> which would be especially painfully slow for ext3.
> 
> This question recently came up on the ext4 developer's list, because
> of a question of how direct I/O to an preallocated (uninitialized)
> extent should be handled.  Are we supposed to guarantee synchronous
> updates of the metadata by the time write(2) returns, or not?  One of
> the ext4 developers (I can't remember if it was Mingming or Eric)
> asked an XFS developer what they did in that case, and I believe the
> answer they were given was that XFS started a commit, but did *not*
> wait for the commit to complete before returning from the Direct I/O
> write.  In fact, they were told (I believe this was from an SGI
> engineer, but I don't remember the name; we can track that down if
> it's important) that if an application wanted to guarantee metadata
> would be updated for an extending write, they had to use fsync() or
> O_SYNC/O_DSYNC.  

That would have been Eric asking me. My answer that O_DIRECT does
not imply any new data integrity guarantees associated with a
write(2) call - it just avoids system caches. You get the same
guarantees of resiliency as a non-O_DIRECT write(2) call at
completion - it may or may notbe there if you crash. If you want
some guarantee of integrity, then you need to use O_DSYNC, O_SYNC or
call f[data]sync(2) just like all other IO.

Also, note that direct IO is not necessarily synchronous - you can
do asynchronous direct IO.....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-26  6:34           ` Dave Chinner
@ 2009-08-26 15:01             ` Jamie Lokier
  2009-08-26 18:47               ` Theodore Tso
  0 siblings, 1 reply; 144+ messages in thread
From: Jamie Lokier @ 2009-08-26 15:01 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Theodore Tso, Christoph Hellwig, Jens Axboe, linux-fsdevel, linux-scsi

Dave Chinner wrote:
> On Fri, Aug 21, 2009 at 06:08:52PM -0400, Theodore Tso wrote:
> > On Fri, Aug 21, 2009 at 10:26:35AM -0400, Christoph Hellwig wrote:
> > > > It turns out that applications needing integrity must use fdatasync or
> > > > O_DSYNC (or O_SYNC) *already* with O_DIRECT, because the kernel may
> > > > choose to use buffered writes at any time, with no signal to the
> > > > application.
> > > 
> > > The fallback was a relatively recent addition to the O_DIRECT semantics
> > > for broken filesystems that can't handle holes very well.  Fortunately
> > > enough we do force O_SYNC (that is Linux O_SYNC aka Posix O_DSYNC)
> > > semantics for that already.
> > 
> > Um, actually, we don't.  If we did that, we would have to wait for a
> > journal commit to complete before allowing the write(2) to complete,
> > which would be especially painfully slow for ext3.
> > 
> > This question recently came up on the ext4 developer's list, because
> > of a question of how direct I/O to an preallocated (uninitialized)
> > extent should be handled.  Are we supposed to guarantee synchronous
> > updates of the metadata by the time write(2) returns, or not?  One of
> > the ext4 developers (I can't remember if it was Mingming or Eric)
> > asked an XFS developer what they did in that case, and I believe the
> > answer they were given was that XFS started a commit, but did *not*
> > wait for the commit to complete before returning from the Direct I/O
> > write.  In fact, they were told (I believe this was from an SGI
> > engineer, but I don't remember the name; we can track that down if
> > it's important) that if an application wanted to guarantee metadata
> > would be updated for an extending write, they had to use fsync() or
> > O_SYNC/O_DSYNC.  
> 
> That would have been Eric asking me. My answer that O_DIRECT does
> not imply any new data integrity guarantees associated with a
> write(2) call - it just avoids system caches. You get the same
> guarantees of resiliency as a non-O_DIRECT write(2) call at
> completion - it may or may notbe there if you crash. If you want
> some guarantee of integrity, then you need to use O_DSYNC, O_SYNC or
> call f[data]sync(2) just like all other IO.
> 
> Also, note that direct IO is not necessarily synchronous - you can
> do asynchronous direct IO.....

I agree with all of the above, except:

  1. If the automatic O_SYNC fallback mentioned by Christopher is
     currently implemented at all, even in a subset of filesystems,
     then I think it should be removed.

     An app which wants integrity should be calling fsync/fdatasync or
     using O_DSYNC/O_SYNC explicitly - with fsync/fdatasync giving
     more control over batching.

     If it doesn't do any of those things, it may be using O_DIRECT
     for performance, and not wish to be penalised by an expensive
     O_SYNC on every individual write.  Especially when O_SYNC is
     fixed to commit drive caches.

  2. I agree with everything Dave said about needing to use some other
     mechanism for an integrity commit; O_DIRECT is not enough.

     We can't realistically make O_DIRECT (by itself) do integrity
     commits anyway, because on some drives that involves committing
     the drive cache, and it would be a large performance regression.
     Given O_DIRECT is often used for its performance, that's not an
     option.

  3. Currently none of the options provides good integrity commit.

     All of them fail to commit drive caches under some circumstances;
     even fsync on ext3 with barriers enabled (because it doesn't
     commit a journal record if there were writes but no inode change
     with data=ordered).

     This should be changed (or at least made optionally available),
     and that's all the more reason to avoid commit operations except
     when requested.

  4. On drives which need it, fdatasync/fsync must trigger a drive
     cache flush even when there is no dirty page cache to write,
     because dirty pages may have been written in the background
     already, and because O_DIRECT writes dirty the drive cache but
     not the page cache.

     A per-drive flag would make sense to optimise this: It is set by
     any non-FUA writes sent to the drive while the drive's writeback
     cache is enabled, and cleared when any cache flush command is
     sent.  When the flag is clear, further cache flush commands don't
     need to be sent.

-- Jamie

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode
  2009-08-20 16:27         ` Christoph Hellwig
  (?)
@ 2009-08-26 18:22           ` Christoph Hellwig
  -1 siblings, 0 replies; 144+ messages in thread
From: Christoph Hellwig @ 2009-08-26 18:22 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, LKML, Evgeniy Polyakov, ocfs2-devel,
	Joel Becker, Felix Blyakher, xfs, Anton Altaparmakov,
	linux-ntfs-dev, OGAWA Hirofumi, linux-ext4, tytso

On Thu, Aug 20, 2009 at 12:27:29PM -0400, Christoph Hellwig wrote:
> Maybe you can help brain storming, but I still can't see any way in that
> the
> 
>   - write data
>   - write inode
>   - wait for data
> 
> actually is a benefit in terms of semantics (I agree that it could be
> faster in theory, but even that is debatable with todays seek latencies
> in disks)

Btw, another reason why our current default is actively harmful:

	barriers

With volatile write caches we do have to flush the disk write cache in
->fsync, either implicitly by a metadata operation, or explicitly if
only data changed.  Unless the filesystems waits itself for the data
to hit the disk like XFS or btrfs will be issue the cache flush
potentially before the data write has actually reached the disk cache.

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode
@ 2009-08-26 18:22           ` Christoph Hellwig
  0 siblings, 0 replies; 144+ messages in thread
From: Christoph Hellwig @ 2009-08-26 18:22 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, tytso, linux-ntfs-dev, LKML, Joel Becker,
	Christoph Hellwig, Anton Altaparmakov, OGAWA Hirofumi,
	Evgeniy Polyakov, xfs, ocfs2-devel

On Thu, Aug 20, 2009 at 12:27:29PM -0400, Christoph Hellwig wrote:
> Maybe you can help brain storming, but I still can't see any way in that
> the
> 
>   - write data
>   - write inode
>   - wait for data
> 
> actually is a benefit in terms of semantics (I agree that it could be
> faster in theory, but even that is debatable with todays seek latencies
> in disks)

Btw, another reason why our current default is actively harmful:

	barriers

With volatile write caches we do have to flush the disk write cache in
->fsync, either implicitly by a metadata operation, or explicitly if
only data changed.  Unless the filesystems waits itself for the data
to hit the disk like XFS or btrfs will be issue the cache flush
potentially before the data write has actually reached the disk cache.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* [Ocfs2-devel] [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode
@ 2009-08-26 18:22           ` Christoph Hellwig
  0 siblings, 0 replies; 144+ messages in thread
From: Christoph Hellwig @ 2009-08-26 18:22 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, LKML, Evgeniy Polyakov, ocfs2-devel,
	Joel Becker, Felix Blyakher, xfs, Anton Altaparmakov,
	linux-ntfs-dev, OGAWA Hirofumi, linux-ext4, tytso

On Thu, Aug 20, 2009 at 12:27:29PM -0400, Christoph Hellwig wrote:
> Maybe you can help brain storming, but I still can't see any way in that
> the
> 
>   - write data
>   - write inode
>   - wait for data
> 
> actually is a benefit in terms of semantics (I agree that it could be
> faster in theory, but even that is debatable with todays seek latencies
> in disks)

Btw, another reason why our current default is actively harmful:

	barriers

With volatile write caches we do have to flush the disk write cache in
->fsync, either implicitly by a metadata operation, or explicitly if
only data changed.  Unless the filesystems waits itself for the data
to hit the disk like XFS or btrfs will be issue the cache flush
potentially before the data write has actually reached the disk cache.

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-26 15:01             ` Jamie Lokier
@ 2009-08-26 18:47               ` Theodore Tso
  2009-08-27 14:50                 ` Jamie Lokier
  0 siblings, 1 reply; 144+ messages in thread
From: Theodore Tso @ 2009-08-26 18:47 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Dave Chinner, Christoph Hellwig, Jens Axboe, linux-fsdevel, linux-scsi

On Wed, Aug 26, 2009 at 04:01:02PM +0100, Jamie Lokier wrote:
>   1. If the automatic O_SYNC fallback mentioned by Christopher is
>      currently implemented at all, even in a subset of filesystems,
>      then I think it should be removed.

Could you clarify what you meant by "it" above?  I'm not sure I
understood what you were referring to.

Also, it sounds like you and Dave are mostly agreeing with the what
I've written here; is that true?

http://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics

I'm trying to get consensus that this is both (a) an accurate
description of the state of affiars in Linux, and (b) that it is what
we think things should be, before I start circulating it around
application developers (especially database developers), to make sure
they have the same understanding of O_DIRECT semantics as we have.

> 
>   4. On drives which need it, fdatasync/fsync must trigger a drive
>      cache flush even when there is no dirty page cache to write,
>      because dirty pages may have been written in the background
>      already, and because O_DIRECT writes dirty the drive cache but
>      not the page cache.
> 

I agree we *should* do this, but we're going to take a pretty serious
performance hit when we do.  Mac OS chickened out and added an
F_FULLSYNC option:

http://developer.apple.com/documentation/Darwin/Reference/Manpages/man2/fcntl.2.html

The concern is that there are GUI programers that want to update state
files after every window resize or move, and after click on a web
browser.  These GUI programmers then get cranky when changes get lost
after proprietary video drivers cause the laptop to lock up.  If we
make fsync() too burdensome, then fewer and fewer applications will
use it.  Evidently the MacOS developers decided the few applications
who really cared about doing device cache flushes were much smaller
than the fast number of applications that need a lightweight file
flush.  Should we do the same?  

It seems like an awful cop-out, but having seen, up front and
personal, how "agressively stupid" some desktop programmers can be[1],
I can **certainly** understand why Apple chose the F_FULLSYNC route.

[1] http://josefsipek.net/blahg/?p=364

    		  	     	      - Ted
				      (who really needs to get himself
				       an O_PONIES t-shirt :-)

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode
  2009-08-26 18:22           ` Christoph Hellwig
  (?)
@ 2009-08-27  0:04             ` Christoph Hellwig
  -1 siblings, 0 replies; 144+ messages in thread
From: Christoph Hellwig @ 2009-08-27  0:04 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, LKML, Evgeniy Polyakov, ocfs2-devel,
	Joel Becker, Felix Blyakher, xfs, Anton Altaparmakov,
	linux-ntfs-dev, OGAWA Hirofumi, linux-ext4, tytso

On Wed, Aug 26, 2009 at 02:22:36PM -0400, Christoph Hellwig wrote:
> On Thu, Aug 20, 2009 at 12:27:29PM -0400, Christoph Hellwig wrote:
> > Maybe you can help brain storming, but I still can't see any way in that
> > the
> > 
> >   - write data
> >   - write inode
> >   - wait for data
> > 
> > actually is a benefit in terms of semantics (I agree that it could be
> > faster in theory, but even that is debatable with todays seek latencies
> > in disks)
> 
> Btw, another reason why our current default is actively harmful:
> 
> 	barriers
> 
> With volatile write caches we do have to flush the disk write cache in
> ->fsync, either implicitly by a metadata operation, or explicitly if
> only data changed.  Unless the filesystems waits itself for the data
> to hit the disk like XFS or btrfs will be issue the cache flush
> potentially before the data write has actually reached the disk cache.

Ok, this one failed the reality check - no matter how hard I tried
I could not reproduce that case in my test harness.  It turns out
cache flush request are quite sensibly treated as barriers by the block
layer and thus we drain the queue before issueing the cache flush.

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode
@ 2009-08-27  0:04             ` Christoph Hellwig
  0 siblings, 0 replies; 144+ messages in thread
From: Christoph Hellwig @ 2009-08-27  0:04 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, tytso, linux-ntfs-dev, LKML, Joel Becker,
	Christoph Hellwig, Anton Altaparmakov, OGAWA Hirofumi,
	Evgeniy Polyakov, xfs, ocfs2-devel

On Wed, Aug 26, 2009 at 02:22:36PM -0400, Christoph Hellwig wrote:
> On Thu, Aug 20, 2009 at 12:27:29PM -0400, Christoph Hellwig wrote:
> > Maybe you can help brain storming, but I still can't see any way in that
> > the
> > 
> >   - write data
> >   - write inode
> >   - wait for data
> > 
> > actually is a benefit in terms of semantics (I agree that it could be
> > faster in theory, but even that is debatable with todays seek latencies
> > in disks)
> 
> Btw, another reason why our current default is actively harmful:
> 
> 	barriers
> 
> With volatile write caches we do have to flush the disk write cache in
> ->fsync, either implicitly by a metadata operation, or explicitly if
> only data changed.  Unless the filesystems waits itself for the data
> to hit the disk like XFS or btrfs will be issue the cache flush
> potentially before the data write has actually reached the disk cache.

Ok, this one failed the reality check - no matter how hard I tried
I could not reproduce that case in my test harness.  It turns out
cache flush request are quite sensibly treated as barriers by the block
layer and thus we drain the queue before issueing the cache flush.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* [Ocfs2-devel] [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode
@ 2009-08-27  0:04             ` Christoph Hellwig
  0 siblings, 0 replies; 144+ messages in thread
From: Christoph Hellwig @ 2009-08-27  0:04 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, LKML, Evgeniy Polyakov, ocfs2-devel,
	Joel Becker, Felix Blyakher, xfs, Anton Altaparmakov,
	linux-ntfs-dev, OGAWA Hirofumi, linux-ext4, tytso

On Wed, Aug 26, 2009 at 02:22:36PM -0400, Christoph Hellwig wrote:
> On Thu, Aug 20, 2009 at 12:27:29PM -0400, Christoph Hellwig wrote:
> > Maybe you can help brain storming, but I still can't see any way in that
> > the
> > 
> >   - write data
> >   - write inode
> >   - wait for data
> > 
> > actually is a benefit in terms of semantics (I agree that it could be
> > faster in theory, but even that is debatable with todays seek latencies
> > in disks)
> 
> Btw, another reason why our current default is actively harmful:
> 
> 	barriers
> 
> With volatile write caches we do have to flush the disk write cache in
> ->fsync, either implicitly by a metadata operation, or explicitly if
> only data changed.  Unless the filesystems waits itself for the data
> to hit the disk like XFS or btrfs will be issue the cache flush
> potentially before the data write has actually reached the disk cache.

Ok, this one failed the reality check - no matter how hard I tried
I could not reproduce that case in my test harness.  It turns out
cache flush request are quite sensibly treated as barriers by the block
layer and thus we drain the queue before issueing the cache flush.

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-24  2:34               ` Christoph Hellwig
@ 2009-08-27 14:34                 ` Jamie Lokier
  2009-08-27 17:10                   ` adding proper O_SYNC/O_DSYNC, was " Christoph Hellwig
  0 siblings, 1 reply; 144+ messages in thread
From: Jamie Lokier @ 2009-08-27 14:34 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, linux-fsdevel, linux-scsi

Christoph Hellwig wrote:
> Then again they already don't get what they expect and never did,
> so if we clear document and communicate the O_SYNC (that is Linux
> O_SYNC) requirement we might be able to go with this.

I'm thinking, while we're looking at this, that now is a really good
time to split up O_SYNC and O_DSYNC.

We have separate fsync and fdatasync, so it should be quite tidy now.

Then we can document using O_DSYNC on Linux, which is fine for older
versions because it has the same value as O_SYNC at the moment.

-- Jamie

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: O_DIRECT and barriers
  2009-08-26 18:47               ` Theodore Tso
@ 2009-08-27 14:50                 ` Jamie Lokier
  0 siblings, 0 replies; 144+ messages in thread
From: Jamie Lokier @ 2009-08-27 14:50 UTC (permalink / raw)
  To: Theodore Tso, Dave Chinner, Christoph Hellwig, Jens Axboe,
	linux-fsdevel, linux-scsi

Theodore Tso wrote:
> On Wed, Aug 26, 2009 at 04:01:02PM +0100, Jamie Lokier wrote:
> >   1. If the automatic O_SYNC fallback mentioned by Christopher is
> >      currently implemented at all, even in a subset of filesystems,
> >      then I think it should be removed.
> 
> Could you clarify what you meant by "it" above?  I'm not sure I
> understood what you were referring to.

I meant the automatic O_SYNC fallback, in other words, if O_DIRECT
falls back to buffered writing, Chris said it automatically did
O_SYNC, and you followed up by saying it doesn't :-)

All I'm saying is if there's _some_ code doing O_SYNC writing when
O_DIRECT falls back to buffered, it should be ripped out.  Leave the
syncing to explicit fsync calls from userspace.

> >   4. On drives which need it, fdatasync/fsync must trigger a drive
> >      cache flush even when there is no dirty page cache to write,
> >      because dirty pages may have been written in the background
> >      already, and because O_DIRECT writes dirty the drive cache but
> >      not the page cache.
> > 
> 
> I agree we *should* do this, but we're going to take a pretty serious
> performance hit when we do.  Mac OS chickened out and added an
> F_FULLSYNC option:

I know about that one.  (I've done quite a lot of research on O_DIRECT
and fsync behaviours).  It's really unfortunate that they didn't
provide F_FULLDATASYNC, which is what a database or VM would ideally
use.

I think Vxfs provides a whole suite of mount options to adjust what
O_SYNC and fdatasync actually do.

> The concern is that there are GUI programers that want to update state
> files after every window resize or move, and after click on a web
> browser.  These GUI programmers then get cranky when changes get lost
> after proprietary video drivers cause the laptop to lock up.  If we
> make fsync() too burdensome, then fewer and fewer applications will
> use it.  Evidently the MacOS developers decided the few applications
> who really cared about doing device cache flushes were much smaller
> than the fast number of applications that need a lightweight file
> flush.  Should we do the same?  

If fsync is cheap but doesn't commit changes properly - what's the
point in encouraging applications to use it?  Without drive cache
flushes, they will still lose changes occasionally.

(Btw, don't blame proprietary video drivers.  I see too many lockups
with open source video drivers too.)

> It seems like an awful cop-out, but having seen, up front and
> personal, how "agressively stupid" some desktop programmers can be[1],
> I can **certainly** understand why Apple chose the F_FULLSYNC route.

I did see a few of those threads, and I think your solution was genius.
Genius at keeping people quiet that is :-)

But it's also a good default.  fsync() isn't practical in shell
scripts or Makefiles, although that's really because "mv" lacks the
fsync option...

Personally I side with "want some kind of full-system asynchronous
transactionality please".  (Possibly aka. O_PONIES :-)

-- Jamie

^ permalink raw reply	[flat|nested] 144+ messages in thread

* adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-27 14:34                 ` Jamie Lokier
@ 2009-08-27 17:10                   ` Christoph Hellwig
  2009-08-27 17:24                       ` Ulrich Drepper
  0 siblings, 1 reply; 144+ messages in thread
From: Christoph Hellwig @ 2009-08-27 17:10 UTC (permalink / raw)
  To: Jamie Lokier, Ulrich Drepper; +Cc: linux-fsdevel, linux-kernel

On Thu, Aug 27, 2009 at 03:34:59PM +0100, Jamie Lokier wrote:
> Christoph Hellwig wrote:
> > Then again they already don't get what they expect and never did,
> > so if we clear document and communicate the O_SYNC (that is Linux
> > O_SYNC) requirement we might be able to go with this.
> 
> I'm thinking, while we're looking at this, that now is a really good
> time to split up O_SYNC and O_DSYNC.
> 
> We have separate fsync and fdatasync, so it should be quite tidy now.
> 
> Then we can document using O_DSYNC on Linux, which is fine for older
> versions because it has the same value as O_SYNC at the moment.

Technically we could easily make O_SYNC really mean O_SYNC and implement
a seaprate O_DSYNC at the kernel level.

The question is how to handle this at the libc level.  Currently glibc
defines O_DSYNC to be O_SYNC.  We would need to update glibc to pass
through O_DSYNC for newer kernels and make sure it falls back to O_SYNC
for olders.  I'm not sure how feasible this is, but maybe Ulrich has
some better ideas.

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-27 17:10                   ` adding proper O_SYNC/O_DSYNC, was " Christoph Hellwig
@ 2009-08-27 17:24                       ` Ulrich Drepper
  0 siblings, 0 replies; 144+ messages in thread
From: Ulrich Drepper @ 2009-08-27 17:24 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jamie Lokier, linux-fsdevel, linux-kernel

On 08/27/2009 10:10 AM, Christoph Hellwig wrote:
> The question is how to handle this at the libc level.  Currently glibc
> defines O_DSYNC to be O_SYNC.  We would need to update glibc to pass
> through O_DSYNC for newer kernels and make sure it falls back to O_SYNC
> for olders.  I'm not sure how feasible this is, but maybe Ulrich has
> some better ideas.

The problem with O_* extensions is that the syscall doesn't fail if the 
flag is not handled.  This is a problem in the open implementation which 
can only be fixed with a new syscall.

Why cannot just go on and say we interpret O_SYNC like O_SYNC and 
O_SYNC|O_DSYNC like O_DSYNC.  The POSIX spec explicitly requires that 
the latter handled like O_SYNC.

We could handle it by allocating two bits, only one is handled in the 
kernel.  If the O_DSYNC definition for userlevel would be different from 
the kernel definition then the kernel could interpret O_SYNC|O_DSYNC 
like O_DSYNC.  The libc would then have to translate the userlevel 
O_DSYNC into the kernel O_DSYNC.  If the libc is too old for the kernel 
and the application, the userlevel flag would be passed to the kernel 
and nothing bad happens.

The cleaner alternative is to have a sys_newopen which checks for 
unknown flags and fails in that case.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
@ 2009-08-27 17:24                       ` Ulrich Drepper
  0 siblings, 0 replies; 144+ messages in thread
From: Ulrich Drepper @ 2009-08-27 17:24 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jamie Lokier, linux-fsdevel, linux-kernel

On 08/27/2009 10:10 AM, Christoph Hellwig wrote:
> The question is how to handle this at the libc level.  Currently glibc
> defines O_DSYNC to be O_SYNC.  We would need to update glibc to pass
> through O_DSYNC for newer kernels and make sure it falls back to O_SYNC
> for olders.  I'm not sure how feasible this is, but maybe Ulrich has
> some better ideas.

The problem with O_* extensions is that the syscall doesn't fail if the 
flag is not handled.  This is a problem in the open implementation which 
can only be fixed with a new syscall.

Why cannot just go on and say we interpret O_SYNC like O_SYNC and 
O_SYNC|O_DSYNC like O_DSYNC.  The POSIX spec explicitly requires that 
the latter handled like O_SYNC.

We could handle it by allocating two bits, only one is handled in the 
kernel.  If the O_DSYNC definition for userlevel would be different from 
the kernel definition then the kernel could interpret O_SYNC|O_DSYNC 
like O_DSYNC.  The libc would then have to translate the userlevel 
O_DSYNC into the kernel O_DSYNC.  If the libc is too old for the kernel 
and the application, the userlevel flag would be passed to the kernel 
and nothing bad happens.

The cleaner alternative is to have a sys_newopen which checks for 
unknown flags and fails in that case.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-27 17:24                       ` Ulrich Drepper
  (?)
@ 2009-08-28 15:46                       ` Christoph Hellwig
  2009-08-28 16:06                           ` Ulrich Drepper
                                           ` (2 more replies)
  -1 siblings, 3 replies; 144+ messages in thread
From: Christoph Hellwig @ 2009-08-28 15:46 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Christoph Hellwig, Jamie Lokier, linux-fsdevel, linux-kernel

On Thu, Aug 27, 2009 at 10:24:28AM -0700, Ulrich Drepper wrote:
> The problem with O_* extensions is that the syscall doesn't fail if the  
> flag is not handled.  This is a problem in the open implementation which  
> can only be fixed with a new syscall.
>
> Why cannot just go on and say we interpret O_SYNC like O_SYNC and  
> O_SYNC|O_DSYNC like O_DSYNC.  The POSIX spec explicitly requires that  
> the latter handled like O_SYNC.
>
> We could handle it by allocating two bits, only one is handled in the  
> kernel.  If the O_DSYNC definition for userlevel would be different from  
> the kernel definition then the kernel could interpret O_SYNC|O_DSYNC  
> like O_DSYNC.  The libc would then have to translate the userlevel  
> O_DSYNC into the kernel O_DSYNC.  If the libc is too old for the kernel  
> and the application, the userlevel flag would be passed to the kernel  
> and nothing bad happens.

What about hte following variant:

 - given that our current O_SYNC really is and always has been actuall
   Posix O_DSYNC keep the numerical value and rename it to O_DSYNC in
   the headers.
 - Add a new O_SYNC definition:

	#define O_SYNC		(O_DSYNC|O_REALLY_SYNC)

   and do full O_SYNC handling in new kernels if O_REALLY_SYNC is
   present.

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 15:46                       ` Christoph Hellwig
@ 2009-08-28 16:06                           ` Ulrich Drepper
  2009-08-28 16:44                         ` Jamie Lokier
  2009-08-28 23:06                         ` Jamie Lokier
  2 siblings, 0 replies; 144+ messages in thread
From: Ulrich Drepper @ 2009-08-28 16:06 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jamie Lokier, linux-fsdevel, linux-kernel

On 08/28/2009 08:46 AM, Christoph Hellwig wrote:
>   - given that our current O_SYNC really is and always has been actuall
>     Posix O_DSYNC

If this is true, then this proposal would work, yes.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
@ 2009-08-28 16:06                           ` Ulrich Drepper
  0 siblings, 0 replies; 144+ messages in thread
From: Ulrich Drepper @ 2009-08-28 16:06 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jamie Lokier, linux-fsdevel, linux-kernel

On 08/28/2009 08:46 AM, Christoph Hellwig wrote:
>   - given that our current O_SYNC really is and always has been actuall
>     Posix O_DSYNC

If this is true, then this proposal would work, yes.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 16:06                           ` Ulrich Drepper
  (?)
@ 2009-08-28 16:17                           ` Christoph Hellwig
  2009-08-28 16:33                               ` Ulrich Drepper
  -1 siblings, 1 reply; 144+ messages in thread
From: Christoph Hellwig @ 2009-08-28 16:17 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Christoph Hellwig, Jamie Lokier, linux-fsdevel, linux-kernel

On Fri, Aug 28, 2009 at 09:06:35AM -0700, Ulrich Drepper wrote:
> On 08/28/2009 08:46 AM, Christoph Hellwig wrote:
>>   - given that our current O_SYNC really is and always has been actuall
>>     Posix O_DSYNC
>
> If this is true, then this proposal would work, yes.

I'll put it on my todo list.  While reading through the Posix specs
I came up with some questions that you might be able to answer:

 - O_RSYNC basically means we need to commit atime updates before a
   read returns, right?  It would be easy to implement 
   it in a slightly suboptimal fashion, but is there any point?


^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 16:17                           ` Christoph Hellwig
@ 2009-08-28 16:33                               ` Ulrich Drepper
  0 siblings, 0 replies; 144+ messages in thread
From: Ulrich Drepper @ 2009-08-28 16:33 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jamie Lokier, linux-fsdevel, linux-kernel

On 08/28/2009 09:17 AM, Christoph Hellwig wrote:
> I'll put it on my todo list.

Any ABI change like this takes a long time to trickle down.

If this is agreed to be the correct approach then adding the O_* 
definitions earlier is better.  Even if it isn't yet implemented.  Then, 
once the kernel side is implemented, programs are ready to use it.  I 
cannot jump the gun and define the flags myself first.


>   - O_RSYNC basically means we need to commit atime updates before a
>     read returns, right?

No, that's not it.

O_RSYNC on its own just means the data is successfully transferred to 
the calling process (always the case).

O_RSYNC|O_DSYNC means that if a read request hits data that is currently 
in a cache and not yet on the medium, then the write to medium is 
successful before the read succeeds.

O_RSYNC|O_SYNC means the same plus the integrity of file meta 
information (access time etc).

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
@ 2009-08-28 16:33                               ` Ulrich Drepper
  0 siblings, 0 replies; 144+ messages in thread
From: Ulrich Drepper @ 2009-08-28 16:33 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jamie Lokier, linux-fsdevel, linux-kernel

On 08/28/2009 09:17 AM, Christoph Hellwig wrote:
> I'll put it on my todo list.

Any ABI change like this takes a long time to trickle down.

If this is agreed to be the correct approach then adding the O_* 
definitions earlier is better.  Even if it isn't yet implemented.  Then, 
once the kernel side is implemented, programs are ready to use it.  I 
cannot jump the gun and define the flags myself first.


>   - O_RSYNC basically means we need to commit atime updates before a
>     read returns, right?

No, that's not it.

O_RSYNC on its own just means the data is successfully transferred to 
the calling process (always the case).

O_RSYNC|O_DSYNC means that if a read request hits data that is currently 
in a cache and not yet on the medium, then the write to medium is 
successful before the read succeeds.

O_RSYNC|O_SYNC means the same plus the integrity of file meta 
information (access time etc).

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 16:33                               ` Ulrich Drepper
  (?)
@ 2009-08-28 16:41                               ` Christoph Hellwig
  2009-08-28 20:51                                   ` Ulrich Drepper
  -1 siblings, 1 reply; 144+ messages in thread
From: Christoph Hellwig @ 2009-08-28 16:41 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Christoph Hellwig, Jamie Lokier, linux-fsdevel, linux-kernel

On Fri, Aug 28, 2009 at 09:33:29AM -0700, Ulrich Drepper wrote:
> On 08/28/2009 09:17 AM, Christoph Hellwig wrote:
>> I'll put it on my todo list.
>
> Any ABI change like this takes a long time to trickle down.
>
> If this is agreed to be the correct approach then adding the O_*  
> definitions earlier is better.  Even if it isn't yet implemented.  Then,  
> once the kernel side is implemented, programs are ready to use it.  I  
> cannot jump the gun and define the flags myself first.

Yeah.  The implementation really is trivial in 2.6.32 - we basically
just need to change one function to check the new O_REALLY_SYNC flag
and pass down a 0 instead of a 1 to another routine in the generic
fs code, plus doing the same in a few filesystems opencoding it instead
of using the generic helpers.

So the logistics of doing the flags really is the biggest work here.
And I'm not entirely sure how to do it correctly.  Can we just switch
the current O_SYNC defintion in the kernel headers to O_DSYNC while
adding the new O_SYNC and everything will continue to work? 

>>   - O_RSYNC basically means we need to commit atime updates before a
>>     read returns, right?
>
> No, that's not it.
>
> O_RSYNC on its own just means the data is successfully transferred to  
> the calling process (always the case).
>
> O_RSYNC|O_DSYNC means that if a read request hits data that is currently  
> in a cache and not yet on the medium, then the write to medium is  
> successful before the read succeeds.

That includes a write from another process?  So O_RSYNC basically means
doing an range-fdatasync before the actual read request?

Again, we could implement this easily if we care enough.


^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 15:46                       ` Christoph Hellwig
  2009-08-28 16:06                           ` Ulrich Drepper
@ 2009-08-28 16:44                         ` Jamie Lokier
  2009-08-28 16:50                           ` Jamie Lokier
  2009-08-28 21:08                             ` Ulrich Drepper
  2009-08-28 23:06                         ` Jamie Lokier
  2 siblings, 2 replies; 144+ messages in thread
From: Jamie Lokier @ 2009-08-28 16:44 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Ulrich Drepper, linux-fsdevel, linux-kernel

Christoph Hellwig wrote:
> On Thu, Aug 27, 2009 at 10:24:28AM -0700, Ulrich Drepper wrote:
> > The problem with O_* extensions is that the syscall doesn't fail if the  
> > flag is not handled.  This is a problem in the open implementation which  
> > can only be fixed with a new syscall.
> >
> > Why cannot just go on and say we interpret O_SYNC like O_SYNC and  
> > O_SYNC|O_DSYNC like O_DSYNC.  The POSIX spec explicitly requires that  
> > the latter handled like O_SYNC.
> >
> > We could handle it by allocating two bits, only one is handled in the  
> > kernel.  If the O_DSYNC definition for userlevel would be different from  
> > the kernel definition then the kernel could interpret O_SYNC|O_DSYNC  
> > like O_DSYNC.  The libc would then have to translate the userlevel  
> > O_DSYNC into the kernel O_DSYNC.  If the libc is too old for the kernel  
> > and the application, the userlevel flag would be passed to the kernel  
> > and nothing bad happens.
> 
> What about hte following variant:
> 
>  - given that our current O_SYNC really is and always has been actuall
>    Posix O_DSYNC keep the numerical value and rename it to O_DSYNC in
>    the headers.
>  - Add a new O_SYNC definition:
> 
> 	#define O_SYNC		(O_DSYNC|O_REALLY_SYNC)
> 
>    and do full O_SYNC handling in new kernels if O_REALLY_SYNC is
>    present.

That looks good for the kernel.

However, for userspace, there's an issue with applications which were
compiled with an old libc and used O_SYNC.  Most of them probably
expected O_SYNC behaviour but all they got was O_DSYNC, because Linux
didn't do it right.

But they *didn't know* that.

When using a newer kernel which actually implements O_SYNC behaviour,
I'm thinking those applications which asked for O_SYNC should get it,
even though they're still linked with an old libc.

That's because this thread is the first time I've heard that Linux
O_SYNC was really the weaker O_DSYNC in disguise, and judging from the
many Googlings I've done about O_SYNC in applications and on different
OS, it'll be news to other people too.

(I always thought the "#define O_DSYNC O_SYNC" was because Linux
didn't implement the weaker O_DSYNC).

(Oh, and Ulrich: Why is there a "#define O_RSYNC O_SYNC" in the Glibc
headers?  That doesn't make sense: O_RSYNC has nothing to do with
writing.)

To achieve that, libc could implement two versions of open() at the
same time as it updates header files.  The new libc's __old_open() would
do:

    /* Only O_DSYNC is set for apps built against old libc which
       were compiled
    if (flags & O_DSYNC)
        flags |= O_SYNC;

I'm not exactly sure how symbol versioning works, but perhaps the
header file in the new libc would need __REDIRECT_NTH to map open() to
__new_open(), which just calls the kernel.  This is to ensure .o and
.a files built with an old libc's headers but then linked to a new
libc will get __old_open().

Although libc's __new_open() could have this:

    /* Old kernels only look at O_DSYNC.  It's better than nothing. */
    if (flags & O_SYNC)
        flags |= O_DSYNC;

Imho, it's better to not do that, and instead have

    #define O_SYNC          (O_DSYNC|__O_SYNC_KERNEL)

as Chris suggests, in the libc header the same as the kernel header,
because that way applications which use the syscall() function or have
to invoke a syscall directly (I've seen clone-using code doing it),
won't spontaneously start losing their O_SYNCness on older kernels.
Unless there is some reason why "flags &= ~O_SYNC" is not permitted to
clear the O_DSYNC flag, or other reason why they must be separate flags.

-- Jamie

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 16:33                               ` Ulrich Drepper
  (?)
  (?)
@ 2009-08-28 16:46                               ` Jamie Lokier
  2009-08-29  0:59                                 ` Jamie Lokier
  -1 siblings, 1 reply; 144+ messages in thread
From: Jamie Lokier @ 2009-08-28 16:46 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: Christoph Hellwig, linux-fsdevel, linux-kernel

Ulrich Drepper wrote:
> >  - O_RSYNC basically means we need to commit atime updates before a
> >    read returns, right?
> 
> No, that's not it.
> 
> O_RSYNC on its own just means the data is successfully transferred to 
> the calling process (always the case).
> 
> O_RSYNC|O_DSYNC means that if a read request hits data that is currently 
> in a cache and not yet on the medium, then the write to medium is 
> successful before the read succeeds.
> 
> O_RSYNC|O_SYNC means the same plus the integrity of file meta 
> information (access time etc).

On several unixes, O_RSYNC means it will send the read to the
hardware, not relying on the cache.  This can be used to verify the
data which was written earlier, whether by O_DSYNC or fdatasync.

-- Jamie

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 16:44                         ` Jamie Lokier
@ 2009-08-28 16:50                           ` Jamie Lokier
  2009-08-28 21:08                             ` Ulrich Drepper
  1 sibling, 0 replies; 144+ messages in thread
From: Jamie Lokier @ 2009-08-28 16:50 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Ulrich Drepper, linux-fsdevel, linux-kernel

Jamie Lokier wrote:
> That's because this thread is the first time I've heard that Linux
> O_SYNC was really the weaker O_DSYNC in disguise, and judging from the
> many Googlings I've done about O_SYNC in applications and on different
> OS, it'll be news to other people too.
> 
> (I always thought the "#define O_DSYNC O_SYNC" was because Linux
> didn't implement the weaker O_DSYNC).

It looks like we're not the only ones.  AIX has:

http://publib.boulder.ibm.com/infocenter/systems/index.jsp?topic=/com.ibm.aix.genprogc/doc/genprogc/fileio.htm

    Before the O_DSYNC open mode existed, AIX applied O_DSYNC semantics to
    O_SYNC. For binary compatibility reasons, this behavior still
    exists. If true O_SYNC behavior is required, then both O_DSYNC and
    O_SYNC open flags must be specified. Exporting the XPG_SUS_ENV=ON
    environment variable also enables true O_SYNC behavior.

-- Jamie

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 16:41                               ` Christoph Hellwig
@ 2009-08-28 20:51                                   ` Ulrich Drepper
  0 siblings, 0 replies; 144+ messages in thread
From: Ulrich Drepper @ 2009-08-28 20:51 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jamie Lokier, linux-fsdevel, linux-kernel

On 08/28/2009 09:41 AM, Christoph Hellwig wrote:
> Yeah.  The implementation really is trivial in 2.6.32 - we basically
> just need to change one function to check the new O_REALLY_SYNC flag
> and pass down a 0 instead of a 1 to another routine in the generic
> fs code, plus doing the same in a few filesystems opencoding it instead
> of using the generic helpers.

I don't think you have to change anything.  As I wrote before, the 
kernel ignores unknown O_* flags.  It's usually a problem.  Here it is a 
positive thing.


> So the logistics of doing the flags really is the biggest work here.
> And I'm not entirely sure how to do it correctly.  Can we just switch
> the current O_SYNC defintion in the kernel headers to O_DSYNC while
> adding the new O_SYNC and everything will continue to work?

No, that's not a good idea.  This would mean a program compiled with 
newer headers is using O_SYNC which isn't known to old kernels and 
ignored.  Such programs will then not even get the current O_DSYNC benefits.


> That includes a write from another process?  So O_RSYNC basically means
> doing an range-fdatasync before the actual read request?

Yes.  You can easily see how this can be useful.


> Again, we could implement this easily if we care enough.

I think it can be useful at times.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
@ 2009-08-28 20:51                                   ` Ulrich Drepper
  0 siblings, 0 replies; 144+ messages in thread
From: Ulrich Drepper @ 2009-08-28 20:51 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jamie Lokier, linux-fsdevel, linux-kernel

On 08/28/2009 09:41 AM, Christoph Hellwig wrote:
> Yeah.  The implementation really is trivial in 2.6.32 - we basically
> just need to change one function to check the new O_REALLY_SYNC flag
> and pass down a 0 instead of a 1 to another routine in the generic
> fs code, plus doing the same in a few filesystems opencoding it instead
> of using the generic helpers.

I don't think you have to change anything.  As I wrote before, the 
kernel ignores unknown O_* flags.  It's usually a problem.  Here it is a 
positive thing.


> So the logistics of doing the flags really is the biggest work here.
> And I'm not entirely sure how to do it correctly.  Can we just switch
> the current O_SYNC defintion in the kernel headers to O_DSYNC while
> adding the new O_SYNC and everything will continue to work?

No, that's not a good idea.  This would mean a program compiled with 
newer headers is using O_SYNC which isn't known to old kernels and 
ignored.  Such programs will then not even get the current O_DSYNC benefits.


> That includes a write from another process?  So O_RSYNC basically means
> doing an range-fdatasync before the actual read request?

Yes.  You can easily see how this can be useful.


> Again, we could implement this easily if we care enough.

I think it can be useful at times.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 20:51                                   ` Ulrich Drepper
  (?)
@ 2009-08-28 21:08                                   ` Christoph Hellwig
  2009-08-28 21:16                                     ` Trond Myklebust
  2009-08-30 16:44                                     ` Jamie Lokier
  -1 siblings, 2 replies; 144+ messages in thread
From: Christoph Hellwig @ 2009-08-28 21:08 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Christoph Hellwig, Jamie Lokier, linux-fsdevel, linux-kernel, torvalds

On Fri, Aug 28, 2009 at 01:51:03PM -0700, Ulrich Drepper wrote:
> No, that's not a good idea.  This would mean a program compiled with  
> newer headers is using O_SYNC which isn't known to old kernels and  
> ignored.  Such programs will then not even get the current O_DSYNC 
> benefits.

Ok, let's agree on how to proceed:


once 2.6.31 is out we will do the following

 - do a global s/O_SYNC/O_DSYNC/g over the whole kernel tree
 - add a this to include/asm-generic/fcntl.h and in modified form
   to arch headers not using it:

#ifndef O_FULLSYNC
#define O_FULLSYNC	02000000
#endif

#ifndef O_RSYNC
#define O_RSYNC		04000000
#endif

#define O_SYNC	(O_FULLSYNC|O_DSYNC)

 - during the normal merge window I will add a real implementation for
   for O_FULLSYNC and O_RSYNC

P.S. better naming suggestions for O_FULLSYNC welcome

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 16:44                         ` Jamie Lokier
@ 2009-08-28 21:08                             ` Ulrich Drepper
  2009-08-28 21:08                             ` Ulrich Drepper
  1 sibling, 0 replies; 144+ messages in thread
From: Ulrich Drepper @ 2009-08-28 21:08 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Christoph Hellwig, linux-fsdevel, linux-kernel

On 08/28/2009 09:44 AM, Jamie Lokier wrote:
> However, for userspace, there's an issue with applications which were
> compiled with an old libc and used O_SYNC.  Most of them probably
> expected O_SYNC behaviour but all they got was O_DSYNC, because Linux
> didn't do it right.

Right.  But these programs apparently can live with the broken 
semantics.  I don't worry too much about this.  If people really need 
the fixed O_SYNC semantics then let them recompile their code.


> When using a newer kernel which actually implements O_SYNC behaviour,
> I'm thinking those applications which asked for O_SYNC should get it,
> even though they're still linked with an old libc.

In general yes, but it's too expensive.  Again, existing programs expect 
the current behavior and can live with it.


> (Oh, and Ulrich: Why is there a "#define O_RSYNC O_SYNC" in the Glibc
> headers?  That doesn't make sense: O_RSYNC has nothing to do with
> writing.)

O_SYNC is a superset of O_RSYNC.  In the absence of a true O_RSYNC 
that's the next best thing.  Of course I didn't know the Linux O_SYNC is 
really O_DSYNC.  In that context the definition doesn't make sense.


> Although libc's __new_open() could have this:
>
>      /* Old kernels only look at O_DSYNC.  It's better than nothing. */
>      if (flags&  O_SYNC)
>          flags |= O_DSYNC;
>
> Imho, it's better to not do that, and instead have
>
>      #define O_SYNC          (O_DSYNC|__O_SYNC_KERNEL)

Why should it be better?  You're replacing something the compiler can do 
with zero cost with active code.


Again, these O_* constant changes are sufficient.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
@ 2009-08-28 21:08                             ` Ulrich Drepper
  0 siblings, 0 replies; 144+ messages in thread
From: Ulrich Drepper @ 2009-08-28 21:08 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Christoph Hellwig, linux-fsdevel, linux-kernel

On 08/28/2009 09:44 AM, Jamie Lokier wrote:
> However, for userspace, there's an issue with applications which were
> compiled with an old libc and used O_SYNC.  Most of them probably
> expected O_SYNC behaviour but all they got was O_DSYNC, because Linux
> didn't do it right.

Right.  But these programs apparently can live with the broken 
semantics.  I don't worry too much about this.  If people really need 
the fixed O_SYNC semantics then let them recompile their code.


> When using a newer kernel which actually implements O_SYNC behaviour,
> I'm thinking those applications which asked for O_SYNC should get it,
> even though they're still linked with an old libc.

In general yes, but it's too expensive.  Again, existing programs expect 
the current behavior and can live with it.


> (Oh, and Ulrich: Why is there a "#define O_RSYNC O_SYNC" in the Glibc
> headers?  That doesn't make sense: O_RSYNC has nothing to do with
> writing.)

O_SYNC is a superset of O_RSYNC.  In the absence of a true O_RSYNC 
that's the next best thing.  Of course I didn't know the Linux O_SYNC is 
really O_DSYNC.  In that context the definition doesn't make sense.


> Although libc's __new_open() could have this:
>
>      /* Old kernels only look at O_DSYNC.  It's better than nothing. */
>      if (flags&  O_SYNC)
>          flags |= O_DSYNC;
>
> Imho, it's better to not do that, and instead have
>
>      #define O_SYNC          (O_DSYNC|__O_SYNC_KERNEL)

Why should it be better?  You're replacing something the compiler can do 
with zero cost with active code.


Again, these O_* constant changes are sufficient.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 21:08                                   ` Christoph Hellwig
@ 2009-08-28 21:16                                     ` Trond Myklebust
  2009-08-28 21:29                                       ` Christoph Hellwig
  2009-08-30 16:44                                     ` Jamie Lokier
  1 sibling, 1 reply; 144+ messages in thread
From: Trond Myklebust @ 2009-08-28 21:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ulrich Drepper, Jamie Lokier, linux-fsdevel, linux-kernel, torvalds

On Fri, 2009-08-28 at 17:08 -0400, Christoph Hellwig wrote:
> #define O_SYNC	(O_FULLSYNC|O_DSYNC)
> 
>  - during the normal merge window I will add a real implementation for
>    for O_FULLSYNC and O_RSYNC
> 
> P.S. better naming suggestions for O_FULLSYNC welcome

Basically you are just ensuring that the metadata changes are being
synced together with the data changes, so how about O_ISYNC (inode
sync)?



^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 21:16                                     ` Trond Myklebust
@ 2009-08-28 21:29                                       ` Christoph Hellwig
  2009-08-28 21:43                                         ` Trond Myklebust
  0 siblings, 1 reply; 144+ messages in thread
From: Christoph Hellwig @ 2009-08-28 21:29 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Christoph Hellwig, Ulrich Drepper, Jamie Lokier, linux-fsdevel,
	linux-kernel, torvalds

On Fri, Aug 28, 2009 at 05:16:14PM -0400, Trond Myklebust wrote:
> On Fri, 2009-08-28 at 17:08 -0400, Christoph Hellwig wrote:
> > #define O_SYNC	(O_FULLSYNC|O_DSYNC)
> > 
> >  - during the normal merge window I will add a real implementation for
> >    for O_FULLSYNC and O_RSYNC
> > 
> > P.S. better naming suggestions for O_FULLSYNC welcome
> 
> Basically you are just ensuring that the metadata changes are being
> synced together with the data changes, so how about O_ISYNC (inode
> sync)?

Yeah.  Thinking about this a bit more we should define this flag
much more clearly.  In the obvious implementation it would not actually
do anything if it's set on it's own.  We would only check it if O_DSYNC
is already set to decided if we want to set the datasync argument to
->fsync to 0 or 1 for the generic filesystems (and similar things for
filesystems not using the generic helper).

If we deem that this is too unsafe we could make sure O_DSYNC always
gets set on this fag in ->open, but if we make sure O_SYNC is defined
like the one above in the kernel headers and glibc we should be fine.

Although in that case a name that doesn't suggest that it actually does
something useful would be better.

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 21:29                                       ` Christoph Hellwig
@ 2009-08-28 21:43                                         ` Trond Myklebust
  2009-08-28 22:39                                           ` Christoph Hellwig
  0 siblings, 1 reply; 144+ messages in thread
From: Trond Myklebust @ 2009-08-28 21:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ulrich Drepper, Jamie Lokier, linux-fsdevel, linux-kernel, torvalds

On Fri, 2009-08-28 at 17:29 -0400, Christoph Hellwig wrote:
> On Fri, Aug 28, 2009 at 05:16:14PM -0400, Trond Myklebust wrote:
> > On Fri, 2009-08-28 at 17:08 -0400, Christoph Hellwig wrote:
> > > #define O_SYNC	(O_FULLSYNC|O_DSYNC)
> > > 
> > >  - during the normal merge window I will add a real implementation for
> > >    for O_FULLSYNC and O_RSYNC
> > > 
> > > P.S. better naming suggestions for O_FULLSYNC welcome
> > 
> > Basically you are just ensuring that the metadata changes are being
> > synced together with the data changes, so how about O_ISYNC (inode
> > sync)?
> 
> Yeah.  Thinking about this a bit more we should define this flag
> much more clearly.  In the obvious implementation it would not actually
> do anything if it's set on it's own.  We would only check it if O_DSYNC
> is already set to decided if we want to set the datasync argument to
> ->fsync to 0 or 1 for the generic filesystems (and similar things for
> filesystems not using the generic helper).
> 
> If we deem that this is too unsafe we could make sure O_DSYNC always
> gets set on this fag in ->open, but if we make sure O_SYNC is defined
> like the one above in the kernel headers and glibc we should be fine.
> 
> Although in that case a name that doesn't suggest that it actually does
> something useful would be better.

If you are going to automatically set O_DSYNC in open(), then
fcntl(F_SETFL) might get a bit nasty.

Imagine using it after the open in order to clear the O_ISYNC flag;
you'll still be left with the O_DSYNC (which you never set in the first
place). That would be confusing...

Cheers
  Trond


^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 21:43                                         ` Trond Myklebust
@ 2009-08-28 22:39                                           ` Christoph Hellwig
  0 siblings, 0 replies; 144+ messages in thread
From: Christoph Hellwig @ 2009-08-28 22:39 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Christoph Hellwig, Ulrich Drepper, Jamie Lokier, linux-fsdevel,
	linux-kernel, torvalds

On Fri, Aug 28, 2009 at 05:43:05PM -0400, Trond Myklebust wrote:
> If you are going to automatically set O_DSYNC in open(), then
> fcntl(F_SETFL) might get a bit nasty.
> 
> Imagine using it after the open in order to clear the O_ISYNC flag;
> you'll still be left with the O_DSYNC (which you never set in the first
> place). That would be confusing...

Indeed, that's a killer argument for the first variant.  We just need
to make it extremly clear (manpage _and_ comments) that only O_SYNC is
an exposed user interface and that O_WHATEVER_SYNC is an implementation
detail.

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 15:46                       ` Christoph Hellwig
  2009-08-28 16:06                           ` Ulrich Drepper
  2009-08-28 16:44                         ` Jamie Lokier
@ 2009-08-28 23:06                         ` Jamie Lokier
  2009-08-28 23:46                           ` Christoph Hellwig
  2 siblings, 1 reply; 144+ messages in thread
From: Jamie Lokier @ 2009-08-28 23:06 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Ulrich Drepper, linux-fsdevel, linux-kernel

Christoph Hellwig wrote:
>  - given that our current O_SYNC really is and always has been actuall
>    Posix O_DSYNC

Are you sure about this?

>From http://www-01.ibm.com/support/docview.wss?uid=isg1IZ01704 :

    Error description

       LINUX O_DIRECT/O_SYNC TAKES TOO MANY IOS

    Problem summary

       On AIX, the O_SYNC and O_DSYNC are different values and
       performance improvement are available because the inode does
       not need to be flushed for mtime changes only.
       On Linux the flags are the same, so performance is lost.
       when databases open files with O_DIRECT and O_SYNC.

-- Jamie

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 23:06                         ` Jamie Lokier
@ 2009-08-28 23:46                           ` Christoph Hellwig
  0 siblings, 0 replies; 144+ messages in thread
From: Christoph Hellwig @ 2009-08-28 23:46 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Christoph Hellwig, Ulrich Drepper, linux-fsdevel, linux-kernel

On Sat, Aug 29, 2009 at 12:06:23AM +0100, Jamie Lokier wrote:
> Christoph Hellwig wrote:
> >  - given that our current O_SYNC really is and always has been actuall
> >    Posix O_DSYNC
> 
> Are you sure about this?
> 
> >From http://www-01.ibm.com/support/docview.wss?uid=isg1IZ01704 :
> 
>     Error description
> 
>        LINUX O_DIRECT/O_SYNC TAKES TOO MANY IOS

That is for GPFS, and out of tree filesystem with binary components.
It could be that they took linux O_SYNC for real O_SYNC.  Any filesystem
using the generic helpers in Linux has gotten the O_DSYNC semantics at
least as long as I have worked on Linux filesystems, which is getting
close to 10 years now.  I'll do some code archaelogy before we'll move
with this to be sure.


^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 16:46                               ` Jamie Lokier
@ 2009-08-29  0:59                                 ` Jamie Lokier
  0 siblings, 0 replies; 144+ messages in thread
From: Jamie Lokier @ 2009-08-29  0:59 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: Christoph Hellwig, linux-fsdevel, linux-kernel

Jamie Lokier wrote:
> Ulrich Drepper wrote:
> > >  - O_RSYNC basically means we need to commit atime updates before a
> > >    read returns, right?
> > 
> > No, that's not it.
> > 
> > O_RSYNC on its own just means the data is successfully transferred to 
> > the calling process (always the case).
> > 
> > O_RSYNC|O_DSYNC means that if a read request hits data that is currently 
> > in a cache and not yet on the medium, then the write to medium is 
> > successful before the read succeeds.
> > 
> > O_RSYNC|O_SYNC means the same plus the integrity of file meta 
> > information (access time etc).
> 
> On several unixes, O_RSYNC means it will send the read to the
> hardware, not relying on the cache.  This can be used to verify the
> data which was written earlier, whether by O_DSYNC or fdatasync.

I'm sure I read that in a couple of OS man pages, but I can't find it
again.  Maybe it was something more obscure than the mainstream
unices; maybe I imagined it.  Ho hum.  For now, forget I said anythng.

-- Jamie

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 21:08                                   ` Christoph Hellwig
  2009-08-28 21:16                                     ` Trond Myklebust
@ 2009-08-30 16:44                                     ` Jamie Lokier
  1 sibling, 0 replies; 144+ messages in thread
From: Jamie Lokier @ 2009-08-30 16:44 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Ulrich Drepper, linux-fsdevel, linux-kernel, torvalds

Christoph Hellwig wrote:
> P.S. better naming suggestions for O_FULLSYNC welcome

O_FULLSYNC might get confused with MacOS X's F_FULLSYNC, which means
something else: fsync through hardware volatile write caches.

(Might we even want to provide O_FULLSYNC and O_FULLDATASYNC to mean
that, eventually?)

O_ISYNC is a bit misleading if we don't really offer "flush just the
inode state" by itself.

So it should at least start with underscores: __O_ISYNC.

How about __O_SYNC_NEW with

    #define O_SYNC     (O_DSYNC|__O_SYNC_NEW)

I think that tells people reading the headers a bit about what to
expect on older kernels too.

-- Jamie

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 21:08                             ` Ulrich Drepper
  (?)
@ 2009-08-30 16:58                             ` Jamie Lokier
  -1 siblings, 0 replies; 144+ messages in thread
From: Jamie Lokier @ 2009-08-30 16:58 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: Christoph Hellwig, linux-fsdevel, linux-kernel

Ulrich Drepper wrote:
> On 08/28/2009 09:44 AM, Jamie Lokier wrote:
> >(Oh, and Ulrich: Why is there a "#define O_RSYNC O_SYNC" in the Glibc
> >headers?  That doesn't make sense: O_RSYNC has nothing to do with
> >writing.)
> 
> O_SYNC is a superset of O_RSYNC.  In the absence of a true O_RSYNC 
> that's the next best thing.

That's an error - O_SYNC is not a superset of O_RSYNC.

O_SYNC (by itself) only affects writes.

O_RSYNC only affect reads.

In the absence of O_RSYNC support in the kernel, it's better to not
define O_RSYNC at all in userspace.  That tells applications they can
call fsync/fdatasync themselves before reading to get an equivalent
effect.

In fact O_RSYNC, when implemented correctly, can be used by
applications to get the effect of range-fsync/fdatasync when such
system calls aren't available (by reading a range), but not as
efficiently of course.  Defining O_RSYNC as O_SYNC fails to do that.

-- Jamie

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers
  2009-08-28 21:08                             ` Ulrich Drepper
  (?)
  (?)
@ 2009-08-30 17:48                             ` Jamie Lokier
  -1 siblings, 0 replies; 144+ messages in thread
From: Jamie Lokier @ 2009-08-30 17:48 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: Christoph Hellwig, linux-fsdevel, linux-kernel

Ulrich Drepper wrote:
> On 08/28/2009 09:44 AM, Jamie Lokier wrote:
> >Although libc's __new_open() could have this:
> >
> >     /* Old kernels only look at O_DSYNC.  It's better than nothing. */
> >     if (flags&  O_SYNC)
> >         flags |= O_DSYNC;
> >
> >Imho, it's better to not do that, and instead have
> >
> >     #define O_SYNC          (O_DSYNC|__O_SYNC_KERNEL)
> 
> Why should it be better?  You're replacing something the compiler can do 
> with zero cost with active code.

You misread; I said the zero cost thing is better.

The only reason you might use the active code is this:

    /* Upgrade O_DSYNC to O_SYNC. */

    flags = fcntl(fd, F_GETFL, 0);
    flags = (flags | O_SYNC) & ~O_DSYNC;
    fcntl(fd, F_SETFL, flags);

I'm not sure if that should work in POSIX.

-- Jamie

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 05/17] ocfs2: Use __generic_file_aio_write instead of generic_file_aio_write_nolock
  2009-08-21 17:23   ` Jan Kara
@ 2009-08-25 18:58     ` Joel Becker
  -1 siblings, 0 replies; 144+ messages in thread
From: Joel Becker @ 2009-08-25 18:58 UTC (permalink / raw)
  To: Jan Kara; +Cc: LKML, hch, linux-fsdevel, ocfs2-devel

On Fri, Aug 21, 2009 at 07:23:55PM +0200, Jan Kara wrote:
> Use the new helper. We have to submit data pages ourselves in case of O_SYNC
> write because __generic_file_aio_write does not do it for us. OCFS2 developpers
> might think about moving the sync out of i_mutex which seems to be easily
> possible but that's out of scope of this patch.
> 
> CC: ocfs2-devel@oss.oracle.com
> CC: Joel Becker <joel.becker@oracle.com>
> Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Joel Becker <joel.becker@oracle.com>
 

-- 

"The real reason GNU ls is 8-bit-clean is so that they can
 start using ISO-8859-1 option characters."
	- Christopher Davis (ckd@loiosh.kei.com)

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH 05/17] ocfs2: Use __generic_file_aio_write instead of generic_file_aio_write_nolock
@ 2009-08-25 18:58     ` Joel Becker
  0 siblings, 0 replies; 144+ messages in thread
From: Joel Becker @ 2009-08-25 18:58 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, LKML, ocfs2-devel

On Fri, Aug 21, 2009 at 07:23:55PM +0200, Jan Kara wrote:
> Use the new helper. We have to submit data pages ourselves in case of O_SYNC
> write because __generic_file_aio_write does not do it for us. OCFS2 developpers
> might think about moving the sync out of i_mutex which seems to be easily
> possible but that's out of scope of this patch.
> 
> CC: ocfs2-devel@oss.oracle.com
> CC: Joel Becker <joel.becker@oracle.com>
> Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Joel Becker <joel.becker@oracle.com>
 

-- 

"The real reason GNU ls is 8-bit-clean is so that they can
 start using ISO-8859-1 option characters."
	- Christopher Davis (ckd@loiosh.kei.com)

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 144+ messages in thread

* [PATCH 05/17] ocfs2: Use __generic_file_aio_write instead of generic_file_aio_write_nolock
  2009-08-21 17:23 [PATCH 0/17] Make O_SYNC handling use standard syncing path (Version 2) Jan Kara
@ 2009-08-21 17:23   ` Jan Kara
  0 siblings, 0 replies; 144+ messages in thread
From: Jan Kara @ 2009-08-21 17:23 UTC (permalink / raw)
  To: LKML; +Cc: hch, linux-fsdevel, Jan Kara, ocfs2-devel, Joel Becker

Use the new helper. We have to submit data pages ourselves in case of O_SYNC
write because __generic_file_aio_write does not do it for us. OCFS2 developpers
might think about moving the sync out of i_mutex which seems to be easily
possible but that's out of scope of this patch.

CC: ocfs2-devel@oss.oracle.com
CC: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ocfs2/file.c |   22 ++++++++++++----------
 1 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 62442e4..1c71f0a 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -1870,8 +1870,7 @@ relock:
 			goto out_dio;
 		}
 	} else {
-		written = generic_file_aio_write_nolock(iocb, iov, nr_segs,
-							*ppos);
+		written = __generic_file_aio_write(iocb, iov, nr_segs, ppos);
 	}
 
 out_dio:
@@ -1879,18 +1878,21 @@ out_dio:
 	BUG_ON(ret == -EIOCBQUEUED && !(file->f_flags & O_DIRECT));
 
 	if ((file->f_flags & O_SYNC && !direct_io) || IS_SYNC(inode)) {
-		/*
-		 * The generic write paths have handled getting data
-		 * to disk, but since we don't make use of the dirty
-		 * inode list, a manual journal commit is necessary
-		 * here.
-		 */
-		if (old_size != i_size_read(inode) ||
-		    old_clusters != OCFS2_I(inode)->ip_clusters) {
+		ret = filemap_fdatawrite_range(file->f_mapping, pos,
+					       pos + count - 1);
+		if (ret < 0)
+			written = ret;
+
+		if (!ret && (old_size != i_size_read(inode) ||
+		    old_clusters != OCFS2_I(inode)->ip_clusters)) {
 			ret = jbd2_journal_force_commit(osb->journal->j_journal);
 			if (ret < 0)
 				written = ret;
 		}
+
+		if (!ret)
+			ret = filemap_fdatawait_range(file->f_mapping, pos,
+						      pos + count - 1);
 	}
 
 	/* 
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH 05/17] ocfs2: Use __generic_file_aio_write instead of generic_file_aio_write_nolock
@ 2009-08-21 17:23   ` Jan Kara
  0 siblings, 0 replies; 144+ messages in thread
From: Jan Kara @ 2009-08-21 17:23 UTC (permalink / raw)
  To: LKML; +Cc: linux-fsdevel, Jan Kara, ocfs2-devel

Use the new helper. We have to submit data pages ourselves in case of O_SYNC
write because __generic_file_aio_write does not do it for us. OCFS2 developpers
might think about moving the sync out of i_mutex which seems to be easily
possible but that's out of scope of this patch.

CC: ocfs2-devel@oss.oracle.com
CC: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ocfs2/file.c |   22 ++++++++++++----------
 1 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 62442e4..1c71f0a 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -1870,8 +1870,7 @@ relock:
 			goto out_dio;
 		}
 	} else {
-		written = generic_file_aio_write_nolock(iocb, iov, nr_segs,
-							*ppos);
+		written = __generic_file_aio_write(iocb, iov, nr_segs, ppos);
 	}
 
 out_dio:
@@ -1879,18 +1878,21 @@ out_dio:
 	BUG_ON(ret == -EIOCBQUEUED && !(file->f_flags & O_DIRECT));
 
 	if ((file->f_flags & O_SYNC && !direct_io) || IS_SYNC(inode)) {
-		/*
-		 * The generic write paths have handled getting data
-		 * to disk, but since we don't make use of the dirty
-		 * inode list, a manual journal commit is necessary
-		 * here.
-		 */
-		if (old_size != i_size_read(inode) ||
-		    old_clusters != OCFS2_I(inode)->ip_clusters) {
+		ret = filemap_fdatawrite_range(file->f_mapping, pos,
+					       pos + count - 1);
+		if (ret < 0)
+			written = ret;
+
+		if (!ret && (old_size != i_size_read(inode) ||
+		    old_clusters != OCFS2_I(inode)->ip_clusters)) {
 			ret = jbd2_journal_force_commit(osb->journal->j_journal);
 			if (ret < 0)
 				written = ret;
 		}
+
+		if (!ret)
+			ret = filemap_fdatawait_range(file->f_mapping, pos,
+						      pos + count - 1);
 	}
 
 	/* 
-- 
1.6.0.2

^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH 05/17] ocfs2: Use __generic_file_aio_write instead of generic_file_aio_write_nolock
  2009-08-21 16:59 [PATCH 01/17] vfs: Introduce filemap_fdatawait_range Jan Kara
@ 2009-08-21 16:59 ` Jan Kara
  0 siblings, 0 replies; 144+ messages in thread
From: Jan Kara @ 2009-08-21 16:59 UTC (permalink / raw)
  To: LKML; +Cc: hch, linux-fsdevel, Jan Kara, ocfs2-devel, Joel Becker

Use the new helper. We have to submit data pages ourselves in case of O_SYNC
write because __generic_file_aio_write does not do it for us. OCFS2 developpers
might think about moving the sync out of i_mutex which seems to be easily
possible but that's out of scope of this patch.

CC: ocfs2-devel@oss.oracle.com
CC: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ocfs2/file.c |   22 ++++++++++++----------
 1 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 62442e4..1c71f0a 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -1870,8 +1870,7 @@ relock:
 			goto out_dio;
 		}
 	} else {
-		written = generic_file_aio_write_nolock(iocb, iov, nr_segs,
-							*ppos);
+		written = __generic_file_aio_write(iocb, iov, nr_segs, ppos);
 	}
 
 out_dio:
@@ -1879,18 +1878,21 @@ out_dio:
 	BUG_ON(ret == -EIOCBQUEUED && !(file->f_flags & O_DIRECT));
 
 	if ((file->f_flags & O_SYNC && !direct_io) || IS_SYNC(inode)) {
-		/*
-		 * The generic write paths have handled getting data
-		 * to disk, but since we don't make use of the dirty
-		 * inode list, a manual journal commit is necessary
-		 * here.
-		 */
-		if (old_size != i_size_read(inode) ||
-		    old_clusters != OCFS2_I(inode)->ip_clusters) {
+		ret = filemap_fdatawrite_range(file->f_mapping, pos,
+					       pos + count - 1);
+		if (ret < 0)
+			written = ret;
+
+		if (!ret && (old_size != i_size_read(inode) ||
+		    old_clusters != OCFS2_I(inode)->ip_clusters)) {
 			ret = jbd2_journal_force_commit(osb->journal->j_journal);
 			if (ret < 0)
 				written = ret;
 		}
+
+		if (!ret)
+			ret = filemap_fdatawait_range(file->f_mapping, pos,
+						      pos + count - 1);
 	}
 
 	/* 
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 144+ messages in thread

end of thread, other threads:[~2009-08-30 17:49 UTC | newest]

Thread overview: 144+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-08-19 16:04 [PATCH 0/17] Make O_SYNC handling use standard syncing path Jan Kara
2009-08-19 16:04 ` [PATCH 01/17] vfs: Introduce filemap_fdatawait_range Jan Kara
2009-08-19 16:10   ` Christoph Hellwig
2009-08-19 16:04 ` [PATCH 02/17] vfs: Export __generic_file_aio_write() and add some comments Jan Kara
2009-08-19 16:04   ` [Ocfs2-devel] " Jan Kara
2009-08-19 16:11   ` Christoph Hellwig
2009-08-19 16:11     ` [Ocfs2-devel] " Christoph Hellwig
2009-08-20 12:04     ` Jan Kara
2009-08-20 12:04       ` [Ocfs2-devel] " Jan Kara
2009-08-19 20:22   ` Evgeniy Polyakov
2009-08-19 20:22     ` [Ocfs2-devel] " Evgeniy Polyakov
2009-08-20 12:31     ` Jan Kara
2009-08-20 12:31       ` [Ocfs2-devel] " Jan Kara
2009-08-20 13:30       ` Evgeniy Polyakov
2009-08-20 13:30         ` [Ocfs2-devel] " Evgeniy Polyakov
2009-08-20 13:52         ` Jan Kara
2009-08-20 13:52           ` [Ocfs2-devel] " Jan Kara
2009-08-20 13:58           ` Evgeniy Polyakov
2009-08-20 13:58             ` [Ocfs2-devel] " Evgeniy Polyakov
2009-08-19 16:04 ` [PATCH 03/17] vfs: Remove syncing from generic_file_direct_write() and generic_file_buffered_write() Jan Kara
2009-08-19 16:04   ` [Ocfs2-devel] " Jan Kara
2009-08-19 16:04   ` Jan Kara
2009-08-19 16:18   ` Christoph Hellwig
2009-08-19 16:18     ` [Ocfs2-devel] " Christoph Hellwig
2009-08-19 16:18     ` Christoph Hellwig
2009-08-20 13:31     ` Jan Kara
2009-08-20 13:31       ` [Ocfs2-devel] " Jan Kara
2009-08-20 13:31       ` Jan Kara
2009-08-19 16:04 ` [PATCH 04/17] pohmelfs: Use __generic_file_aio_write instead of generic_file_aio_write_nolock Jan Kara
2009-08-19 16:04 ` [PATCH 05/17] ocfs2: " Jan Kara
2009-08-19 16:04   ` [Ocfs2-devel] " Jan Kara
2009-08-19 16:04 ` [PATCH 06/17] vfs: Remove sync_page_range_nolock Jan Kara
2009-08-19 16:21   ` Christoph Hellwig
2009-08-19 16:04 ` [PATCH 07/17] vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode Jan Kara
2009-08-19 16:04   ` [Ocfs2-devel] " Jan Kara
2009-08-19 16:04   ` Jan Kara
2009-08-19 16:26   ` Christoph Hellwig
2009-08-19 16:26     ` [Ocfs2-devel] " Christoph Hellwig
2009-08-19 16:26     ` Christoph Hellwig
2009-08-20 12:15     ` Jan Kara
2009-08-20 12:15       ` [Ocfs2-devel] " Jan Kara
2009-08-20 12:15       ` Jan Kara
2009-08-20 16:27       ` Christoph Hellwig
2009-08-20 16:27         ` [Ocfs2-devel] " Christoph Hellwig
2009-08-20 16:27         ` Christoph Hellwig
2009-08-21 15:23         ` Jan Kara
2009-08-21 15:23           ` [Ocfs2-devel] " Jan Kara
2009-08-21 15:23           ` Jan Kara
2009-08-21 15:32           ` Christoph Hellwig
2009-08-21 15:32             ` [Ocfs2-devel] " Christoph Hellwig
2009-08-21 15:32             ` Christoph Hellwig
2009-08-21 15:48             ` Jan Kara
2009-08-21 15:48               ` [Ocfs2-devel] " Jan Kara
2009-08-21 15:48               ` Jan Kara
2009-08-26 18:22         ` Christoph Hellwig
2009-08-26 18:22           ` [Ocfs2-devel] " Christoph Hellwig
2009-08-26 18:22           ` Christoph Hellwig
2009-08-27  0:04           ` Christoph Hellwig
2009-08-27  0:04             ` [Ocfs2-devel] " Christoph Hellwig
2009-08-27  0:04             ` Christoph Hellwig
2009-08-19 16:04 ` [PATCH 08/17] ext2: Update comment about generic_osync_inode Jan Kara
2009-08-19 16:04 ` [PATCH 09/17] ext3: Remove syncing logic from ext3_file_write Jan Kara
2009-08-19 16:04 ` [PATCH 10/17] ext4: Remove syncing logic from ext4_file_write Jan Kara
2009-08-19 16:04   ` Jan Kara
2009-08-19 16:04 ` [PATCH 11/17] fat: Opencode sync_page_range_nolock() Jan Kara
2009-08-19 16:04 ` [PATCH 12/17] ntfs: Use new syncing helpers and update comments Jan Kara
2009-08-19 16:04 ` [PATCH 13/17] ocfs2: Update syncing after splicing to match generic version Jan Kara
2009-08-19 16:04   ` [Ocfs2-devel] " Jan Kara
2009-08-21  1:36   ` Joel Becker
2009-08-21  1:36     ` Joel Becker
2009-08-21 14:30     ` Jan Kara
2009-08-21 14:30       ` Jan Kara
2009-08-19 16:04 ` [PATCH 14/17] xfs: Use new syncing helper Jan Kara
2009-08-19 16:04   ` Jan Kara
2009-08-19 16:33   ` Christoph Hellwig
2009-08-19 16:33     ` Christoph Hellwig
2009-08-20 12:22     ` Jan Kara
2009-08-20 12:22       ` Jan Kara
2009-08-19 16:04 ` [PATCH 15/17] pohmelfs: " Jan Kara
2009-08-19 16:04 ` [PATCH 16/17] nfs: Remove reference to generic_osync_inode from a comment Jan Kara
2009-08-19 16:04 ` [PATCH 17/17] vfs: Remove generic_osync_inode() and sync_page_range() Jan Kara
2009-08-20 22:12 ` O_DIRECT and barriers Christoph Hellwig
2009-08-21 11:40   ` Jens Axboe
2009-08-21 13:54     ` Jamie Lokier
2009-08-21 14:26       ` Christoph Hellwig
2009-08-21 15:24         ` Jamie Lokier
2009-08-21 17:45           ` Christoph Hellwig
2009-08-21 19:18             ` Ric Wheeler
2009-08-22  0:50             ` Jamie Lokier
2009-08-22  2:19               ` Theodore Tso
2009-08-22  2:31                 ` Theodore Tso
2009-08-24  2:34               ` Christoph Hellwig
2009-08-27 14:34                 ` Jamie Lokier
2009-08-27 17:10                   ` adding proper O_SYNC/O_DSYNC, was " Christoph Hellwig
2009-08-27 17:24                     ` Ulrich Drepper
2009-08-27 17:24                       ` Ulrich Drepper
2009-08-28 15:46                       ` Christoph Hellwig
2009-08-28 16:06                         ` Ulrich Drepper
2009-08-28 16:06                           ` Ulrich Drepper
2009-08-28 16:17                           ` Christoph Hellwig
2009-08-28 16:33                             ` Ulrich Drepper
2009-08-28 16:33                               ` Ulrich Drepper
2009-08-28 16:41                               ` Christoph Hellwig
2009-08-28 20:51                                 ` Ulrich Drepper
2009-08-28 20:51                                   ` Ulrich Drepper
2009-08-28 21:08                                   ` Christoph Hellwig
2009-08-28 21:16                                     ` Trond Myklebust
2009-08-28 21:29                                       ` Christoph Hellwig
2009-08-28 21:43                                         ` Trond Myklebust
2009-08-28 22:39                                           ` Christoph Hellwig
2009-08-30 16:44                                     ` Jamie Lokier
2009-08-28 16:46                               ` Jamie Lokier
2009-08-29  0:59                                 ` Jamie Lokier
2009-08-28 16:44                         ` Jamie Lokier
2009-08-28 16:50                           ` Jamie Lokier
2009-08-28 21:08                           ` Ulrich Drepper
2009-08-28 21:08                             ` Ulrich Drepper
2009-08-30 16:58                             ` Jamie Lokier
2009-08-30 17:48                             ` Jamie Lokier
2009-08-28 23:06                         ` Jamie Lokier
2009-08-28 23:46                           ` Christoph Hellwig
2009-08-21 22:08         ` Theodore Tso
2009-08-21 22:38           ` Joel Becker
2009-08-21 22:45           ` Joel Becker
2009-08-22  2:11             ` Theodore Tso
2009-08-24  2:42               ` Christoph Hellwig
2009-08-24  2:37             ` Christoph Hellwig
2009-08-24  2:37             ` Christoph Hellwig
2009-08-21 22:45           ` Joel Becker
2009-08-22  0:56           ` Jamie Lokier
2009-08-22  2:06             ` Theodore Tso
2009-08-26  6:34           ` Dave Chinner
2009-08-26  6:34           ` Dave Chinner
2009-08-26 15:01             ` Jamie Lokier
2009-08-26 18:47               ` Theodore Tso
2009-08-27 14:50                 ` Jamie Lokier
2009-08-21 14:20     ` Christoph Hellwig
2009-08-21 15:06       ` James Bottomley
2009-08-21 15:23         ` Christoph Hellwig
2009-08-21 16:59 [PATCH 01/17] vfs: Introduce filemap_fdatawait_range Jan Kara
2009-08-21 16:59 ` [PATCH 05/17] ocfs2: Use __generic_file_aio_write instead of generic_file_aio_write_nolock Jan Kara
2009-08-21 17:23 [PATCH 0/17] Make O_SYNC handling use standard syncing path (Version 2) Jan Kara
2009-08-21 17:23 ` [PATCH 05/17] ocfs2: Use __generic_file_aio_write instead of generic_file_aio_write_nolock Jan Kara
2009-08-21 17:23   ` Jan Kara
2009-08-25 18:58   ` Joel Becker
2009-08-25 18:58     ` Joel Becker

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.