All of lore.kernel.org
 help / color / mirror / Atom feed
* ext2 + -osync: not as easy as it seems
@ 2009-01-13 13:14 Pavel Machek
  2009-01-13 13:45 ` Alan Cox
  0 siblings, 1 reply; 74+ messages in thread
From: Pavel Machek @ 2009-01-13 13:14 UTC (permalink / raw)
  To: kernel list, jack, tytso

Hi!

I tried to use ext2+ -osync, and was surprised that my data do not hit
the (SATA) disk. Fortunately I soon realized that I need hdparm -W0
/dev/sda.

Could the ext2 be teached to use barriers? -W0 hurts performance more
than neccessary...

Plus it has a nasty behaviour where it reverts to -W1 if disk
connection is momentarily lost. (If you unplug/replug SATA disk, linux
will happily rediscover and use it, but -W0 was already forgotten at
that point, right?)
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: ext2 + -osync: not as easy as it seems
  2009-01-13 13:14 ext2 + -osync: not as easy as it seems Pavel Machek
@ 2009-01-13 13:45 ` Alan Cox
  2009-01-13 14:03   ` Theodore Tso
  2009-01-13 14:42   ` ext2 + -osync: not as easy as it seems Pavel Machek
  0 siblings, 2 replies; 74+ messages in thread
From: Alan Cox @ 2009-01-13 13:45 UTC (permalink / raw)
  To: Pavel Machek; +Cc: kernel list, jack, tytso

> Plus it has a nasty behaviour where it reverts to -W1 if disk
> connection is momentarily lost. (If you unplug/replug SATA disk, linux
> will happily rediscover and use it, but -W0 was already forgotten at
> that point, right?)

If you momentarily lose power your disk state isn't defined anyway. If
you do that with the SATA code it should treat it the same as a scsi disk
swap so it'll get a new device and the old fs will go down.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: ext2 + -osync: not as easy as it seems
  2009-01-13 13:45 ` Alan Cox
@ 2009-01-13 14:03   ` Theodore Tso
  2009-01-13 14:07     ` Jens Axboe
  2009-01-13 14:30     ` ext2 + -osync: not as easy as it seems Jan Kara
  2009-01-13 14:42   ` ext2 + -osync: not as easy as it seems Pavel Machek
  1 sibling, 2 replies; 74+ messages in thread
From: Theodore Tso @ 2009-01-13 14:03 UTC (permalink / raw)
  To: Alan Cox; +Cc: Pavel Machek, kernel list, jack, Jens Axboe

Adding a barrier shouldn't be that hard; just a matter adding a call
to blkdev_issue_flush() to ext2_sync_file() before it returns.

BTW, I think there's a stale documentation bug in in
block/blk-barrier.c, around line 305:

  * Description:
  *    Issue a flush for the block device in question. Caller can supply
  *    room for storing the error offset in case of a flush error, if they
- *    wish to.  Caller must run wait_for_completion() on its own.
+ *    wish to.
  */
 int blkdev_issue_flush(struct block_device *bdev, sector_t *error_sector)
 {

Jens, is that right?

						- Ted

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: ext2 + -osync: not as easy as it seems
  2009-01-13 14:03   ` Theodore Tso
@ 2009-01-13 14:07     ` Jens Axboe
  2009-01-13 14:26       ` [PATCH] block: Fix documentation for blkdev_issue_flush() Theodore Ts'o
  2009-01-13 14:30     ` ext2 + -osync: not as easy as it seems Jan Kara
  1 sibling, 1 reply; 74+ messages in thread
From: Jens Axboe @ 2009-01-13 14:07 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Alan Cox, Pavel Machek, kernel list, jack

On Tue, Jan 13 2009, Theodore Tso wrote:
> Adding a barrier shouldn't be that hard; just a matter adding a call
> to blkdev_issue_flush() to ext2_sync_file() before it returns.
> 
> BTW, I think there's a stale documentation bug in in
> block/blk-barrier.c, around line 305:
> 
>   * Description:
>   *    Issue a flush for the block device in question. Caller can supply
>   *    room for storing the error offset in case of a flush error, if they
> - *    wish to.  Caller must run wait_for_completion() on its own.
> + *    wish to.
>   */
>  int blkdev_issue_flush(struct block_device *bdev, sector_t *error_sector)
>  {
> 
> Jens, is that right?

Yep, that is indeed stale!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH] block: Fix documentation for blkdev_issue_flush()
  2009-01-13 14:07     ` Jens Axboe
@ 2009-01-13 14:26       ` Theodore Ts'o
  2009-01-13 14:28         ` Jens Axboe
  0 siblings, 1 reply; 74+ messages in thread
From: Theodore Ts'o @ 2009-01-13 14:26 UTC (permalink / raw)
  To: Jens Axboe; +Cc: kernel list, Theodore Ts'o

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
 block/blk-barrier.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/block/blk-barrier.c b/block/blk-barrier.c
index 8eba4e4..f7dae57 100644
--- a/block/blk-barrier.c
+++ b/block/blk-barrier.c
@@ -302,7 +302,7 @@ static void bio_end_empty_barrier(struct bio *bio, int err)
  * Description:
  *    Issue a flush for the block device in question. Caller can supply
  *    room for storing the error offset in case of a flush error, if they
- *    wish to.  Caller must run wait_for_completion() on its own.
+ *    wish to.
  */
 int blkdev_issue_flush(struct block_device *bdev, sector_t *error_sector)
 {
-- 
1.6.0.4.8.g36f27.dirty


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [PATCH] block: Fix documentation for blkdev_issue_flush()
  2009-01-13 14:26       ` [PATCH] block: Fix documentation for blkdev_issue_flush() Theodore Ts'o
@ 2009-01-13 14:28         ` Jens Axboe
  0 siblings, 0 replies; 74+ messages in thread
From: Jens Axboe @ 2009-01-13 14:28 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: kernel list

On Tue, Jan 13 2009, Theodore Ts'o wrote:
> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> ---
>  block/blk-barrier.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/block/blk-barrier.c b/block/blk-barrier.c
> index 8eba4e4..f7dae57 100644
> --- a/block/blk-barrier.c
> +++ b/block/blk-barrier.c
> @@ -302,7 +302,7 @@ static void bio_end_empty_barrier(struct bio *bio, int err)
>   * Description:
>   *    Issue a flush for the block device in question. Caller can supply
>   *    room for storing the error offset in case of a flush error, if they
> - *    wish to.  Caller must run wait_for_completion() on its own.
> + *    wish to.
>   */
>  int blkdev_issue_flush(struct block_device *bdev, sector_t *error_sector)
>  {

Thanks Ted, applied.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: ext2 + -osync: not as easy as it seems
  2009-01-13 14:03   ` Theodore Tso
  2009-01-13 14:07     ` Jens Axboe
@ 2009-01-13 14:30     ` Jan Kara
  2009-01-13 14:46       ` Theodore Tso
  2009-01-14  3:37       ` Fernando Luis Vázquez Cao
  1 sibling, 2 replies; 74+ messages in thread
From: Jan Kara @ 2009-01-13 14:30 UTC (permalink / raw)
  To: Theodore Tso, Alan Cox, Pavel Machek, kernel list, jack, Jens Axboe

On Tue 13-01-09 09:03:47, Theodore Tso wrote:
> Adding a barrier shouldn't be that hard; just a matter adding a call
> to blkdev_issue_flush() to ext2_sync_file() before it returns.
  Yes. Something like the patch below?

  But it's not the whole story. Strictly speaking we should also call
blkdev_issue_flush() whenever we write things because of O_SYNC or
O_DIRSYNC flags. My patch does also that (it's based on the previous ext2
patch I've sent a while before).

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

---
>From 0f5a1865a9f1538886699c5d13d3527343c88c11 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Tue, 13 Jan 2009 15:09:18 +0100
Subject: [PATCH] ext2: Add blk_issue_flush() to syncing paths

To be really safe that the data hit the platter, we should also flush drive's
writeback caches on fsync and for O_SYNC files or O_DIRSYNC inodes.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext2/dir.c   |    5 ++++-
 fs/ext2/fsync.c |    7 +++++--
 fs/ext2/inode.c |    7 +++++++
 fs/ext2/xattr.c |    6 +++++-
 4 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index 7fba549..9ab9347 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -25,6 +25,7 @@
 #include <linux/buffer_head.h>
 #include <linux/pagemap.h>
 #include <linux/swap.h>
+#include <linux/blkdev.h>		/* for blkdev_issue_flush() */
 
 typedef struct ext2_dir_entry_2 ext2_dirent;
 
@@ -96,8 +97,10 @@ static int ext2_commit_chunk(struct page *page, loff_t pos, unsigned len)
 	}
 	unlock_page(page);
 
-	if (IS_DIRSYNC(dir))
+	if (IS_DIRSYNC(dir)) {
 		err = ext2_sync_inode(dir);
+		blkdev_issue_flush(dir->i_sb->s_bdev, NULL);
+	}
 
 	return err;
 }
diff --git a/fs/ext2/fsync.c b/fs/ext2/fsync.c
index fc66c93..9cd1838 100644
--- a/fs/ext2/fsync.c
+++ b/fs/ext2/fsync.c
@@ -24,6 +24,7 @@
 
 #include "ext2.h"
 #include <linux/buffer_head.h>		/* for sync_mapping_buffers() */
+#include <linux/blkdev.h>		/* for blkdev_issue_flush() */
 
 
 /*
@@ -39,12 +40,14 @@ int ext2_sync_file(struct file *file, struct dentry *dentry, int datasync)
 
 	ret = sync_mapping_buffers(inode->i_mapping);
 	if (!(inode->i_state & I_DIRTY))
-		return ret;
+		goto out;
 	if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
-		return ret;
+		goto out;
 
 	err = ext2_sync_inode(inode);
 	if (ret == 0)
 		ret = err;
+out:
+	blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
 	return ret;
 }
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 23fff2f..49b479e 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -33,6 +33,7 @@
 #include <linux/mpage.h>
 #include <linux/fiemap.h>
 #include <linux/namei.h>
+#include <linux/blkdev.h>	/* for blkdev_issue_flush() */
 #include "ext2.h"
 #include "acl.h"
 #include "xip.h"
@@ -68,9 +69,14 @@ void ext2_delete_inode (struct inode * inode)
 	mark_inode_dirty(inode);
 	ext2_update_inode(inode, inode_needs_sync(inode));
 
+	/* Make sure inode deletion really gets to disk. Disk write caches
+	 * are flushed either in ext2_truncate() or we do it explicitly */
 	inode->i_size = 0;
 	if (inode->i_blocks)
 		ext2_truncate (inode);
+	else if (inode_needs_sync(inode))
+		blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
+
 	ext2_free_inode (inode);
 
 	return;
@@ -1104,6 +1110,7 @@ do_indirects:
 	if (inode_needs_sync(inode)) {
 		sync_mapping_buffers(inode->i_mapping);
 		ext2_sync_inode (inode);
+		blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
 	} else {
 		mark_inode_dirty(inode);
 	}
diff --git a/fs/ext2/xattr.c b/fs/ext2/xattr.c
index 987a526..d480216 100644
--- a/fs/ext2/xattr.c
+++ b/fs/ext2/xattr.c
@@ -60,6 +60,7 @@
 #include <linux/mbcache.h>
 #include <linux/quotaops.h>
 #include <linux/rwsem.h>
+#include <linux/blkdev.h>		/* for blkdev_issue_flush() */
 #include "ext2.h"
 #include "xattr.h"
 #include "acl.h"
@@ -702,6 +703,7 @@ ext2_xattr_set2(struct inode *inode, struct buffer_head *old_bh,
 				DQUOT_FREE_BLOCK(inode, 1);
 			goto cleanup;
 		}
+		blkdev_issue_flush(sb->s_bdev, NULL);
 	} else
 		mark_inode_dirty(inode);
 
@@ -792,8 +794,10 @@ ext2_xattr_delete_inode(struct inode *inode)
 			le32_to_cpu(HDR(bh)->h_refcount));
 		unlock_buffer(bh);
 		mark_buffer_dirty(bh);
-		if (IS_SYNC(inode))
+		if (IS_SYNC(inode)) {
 			sync_dirty_buffer(bh);
+			blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
+		}
 		DQUOT_FREE_BLOCK(inode, 1);
 	}
 	EXT2_I(inode)->i_file_acl = 0;
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: ext2 + -osync: not as easy as it seems
  2009-01-13 13:45 ` Alan Cox
  2009-01-13 14:03   ` Theodore Tso
@ 2009-01-13 14:42   ` Pavel Machek
  1 sibling, 0 replies; 74+ messages in thread
From: Pavel Machek @ 2009-01-13 14:42 UTC (permalink / raw)
  To: Alan Cox; +Cc: kernel list, jack, tytso

On Tue 2009-01-13 13:45:03, Alan Cox wrote:
> > Plus it has a nasty behaviour where it reverts to -W1 if disk
> > connection is momentarily lost. (If you unplug/replug SATA disk, linux
> > will happily rediscover and use it, but -W0 was already forgotten at
> > that point, right?)
> 
> If you momentarily lose power your disk state isn't defined anyway. If
> you do that with the SATA code it should treat it the same as a scsi disk
> swap so it'll get a new device and the old fs will go down.

No, that is not what happened :-(. If I replug disk within 10 seconds,
it just behaves as if nothing happened, and continues operating on the
disk. AMD machine... but I don't think that matters.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: ext2 + -osync: not as easy as it seems
  2009-01-13 14:30     ` ext2 + -osync: not as easy as it seems Jan Kara
@ 2009-01-13 14:46       ` Theodore Tso
  2009-01-14  3:37       ` Fernando Luis Vázquez Cao
  1 sibling, 0 replies; 74+ messages in thread
From: Theodore Tso @ 2009-01-13 14:46 UTC (permalink / raw)
  To: Jan Kara; +Cc: Alan Cox, Pavel Machek, kernel list, Jens Axboe

On Tue, Jan 13, 2009 at 03:30:12PM +0100, Jan Kara wrote:
> 
>   But it's not the whole story. Strictly speaking we should also call
> blkdev_issue_flush() whenever we write things because of O_SYNC or
> O_DIRSYNC flags. My patch does also that (it's based on the previous ext2
> patch I've sent a while before).

Looks good to me.

Acked-by: "Theodore Ts'o" <tytso@mit.edu>

						- Ted

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: ext2 + -osync: not as easy as it seems
  2009-01-13 14:30     ` ext2 + -osync: not as easy as it seems Jan Kara
  2009-01-13 14:46       ` Theodore Tso
@ 2009-01-14  3:37       ` Fernando Luis Vázquez Cao
  2009-01-14 10:35         ` Jan Kara
  1 sibling, 1 reply; 74+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-01-14  3:37 UTC (permalink / raw)
  To: Jan Kara
  Cc: Theodore Tso, Alan Cox, Pavel Machek, kernel list, Jens Axboe, sandeen

(CCing Eric Sandeen)

On Tue, 2009-01-13 at 15:30 +0100, Jan Kara wrote:
> On Tue 13-01-09 09:03:47, Theodore Tso wrote:
> > Adding a barrier shouldn't be that hard; just a matter adding a call
> > to blkdev_issue_flush() to ext2_sync_file() before it returns.
>   Yes. Something like the patch below?
> 
>   But it's not the whole story. Strictly speaking we should also call
> blkdev_issue_flush() whenever we write things because of O_SYNC or
> O_DIRSYNC flags. My patch does also that (it's based on the previous ext2
> patch I've sent a while before).

Commit d755fb384250d6bd7fd18a0930e71965acc8e72e added a call to
blkdev_issue_flush to ext4_sync_file, and looking at its ext3
counterpart it seems it might be needed there too.

I may be missing something, but is it possible to ensure the inode hits
the platter without the patch below?

--

From: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Subject: ext3: call blkdev_issue_flush on fsync

To ensure that bits are truly on-disk after an fsync, we should call
blkdev_issue_flush if barriers are supported.

This is a straight port of a similar patch written by Eric Sandeen for
ext4 (commit d755fb384250d6bd7fd18a0930e71965acc8e72e).

Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
---

diff -urNp linux-2.6.29-rc1-orig/fs/ext3/fsync.c linux-2.6.29-rc1/fs/ext3/fsync.c
--- linux-2.6.29-rc1-orig/fs/ext3/fsync.c	2008-12-25 08:26:37.000000000 +0900
+++ linux-2.6.29-rc1/fs/ext3/fsync.c	2009-01-14 11:45:47.000000000 +0900
@@ -27,6 +27,7 @@
 #include <linux/sched.h>
 #include <linux/writeback.h>
 #include <linux/jbd.h>
+#include <linux/blkdev.h>
 #include <linux/ext3_fs.h>
 #include <linux/ext3_jbd.h>
 
@@ -45,6 +46,7 @@
 int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
 {
 	struct inode *inode = dentry->d_inode;
+	journal_t *journal = EXT3_SB(inode->i_sb)->s_journal;
 	int ret = 0;
 
 	J_ASSERT(ext3_journal_current_handle() == NULL);
@@ -85,6 +87,9 @@ int ext3_sync_file(struct file * file, s
 			.nr_to_write = 0, /* sys_fsync did this */
 		};
 		ret = sync_inode(inode, &wbc);
+
+		if (journal && (journal->j_flags & JFS_BARRIER))
+			blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
 	}
 out:
 	return ret;



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: ext2 + -osync: not as easy as it seems
  2009-01-14  3:37       ` Fernando Luis Vázquez Cao
@ 2009-01-14 10:35         ` Jan Kara
  2009-01-14 13:21           ` Theodore Tso
  0 siblings, 1 reply; 74+ messages in thread
From: Jan Kara @ 2009-01-14 10:35 UTC (permalink / raw)
  To: Fernando Luis Vázquez Cao
  Cc: Theodore Tso, Alan Cox, Pavel Machek, kernel list, Jens Axboe, sandeen

On Wed 14-01-09 12:37:19, Fernando Luis Vázquez Cao wrote:
> (CCing Eric Sandeen)
> 
> On Tue, 2009-01-13 at 15:30 +0100, Jan Kara wrote:
> > On Tue 13-01-09 09:03:47, Theodore Tso wrote:
> > > Adding a barrier shouldn't be that hard; just a matter adding a call
> > > to blkdev_issue_flush() to ext2_sync_file() before it returns.
> >   Yes. Something like the patch below?
> > 
> >   But it's not the whole story. Strictly speaking we should also call
> > blkdev_issue_flush() whenever we write things because of O_SYNC or
> > O_DIRSYNC flags. My patch does also that (it's based on the previous ext2
> > patch I've sent a while before).
> 
> Commit d755fb384250d6bd7fd18a0930e71965acc8e72e added a call to
> blkdev_issue_flush to ext4_sync_file, and looking at its ext3
> counterpart it seems it might be needed there too.
> 
> I may be missing something, but is it possible to ensure the inode hits
> the platter without the patch below?
  Yes, I noticed that yesterday as well. But then I was puzzled why ext4
would need the flush where it has it... sync_inode() has started and
committed a transaction which issued a barrier when the commit was done.
The only reason I could imagine is that barrier (although it is usually
translated to flushing writeback caches) actually means just an ordering
requirement and hence does not necessarily mean that the caches are
properly flushed. Is that so Eric?
  BTW: We should also issue the flush in the fdatasync() case, shouldn't
we?

								Honza

> From: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
> Subject: ext3: call blkdev_issue_flush on fsync
> 
> To ensure that bits are truly on-disk after an fsync, we should call
> blkdev_issue_flush if barriers are supported.
> 
> This is a straight port of a similar patch written by Eric Sandeen for
> ext4 (commit d755fb384250d6bd7fd18a0930e71965acc8e72e).
> 
> Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
> ---
> 
> diff -urNp linux-2.6.29-rc1-orig/fs/ext3/fsync.c linux-2.6.29-rc1/fs/ext3/fsync.c
> --- linux-2.6.29-rc1-orig/fs/ext3/fsync.c	2008-12-25 08:26:37.000000000 +0900
> +++ linux-2.6.29-rc1/fs/ext3/fsync.c	2009-01-14 11:45:47.000000000 +0900
> @@ -27,6 +27,7 @@
>  #include <linux/sched.h>
>  #include <linux/writeback.h>
>  #include <linux/jbd.h>
> +#include <linux/blkdev.h>
>  #include <linux/ext3_fs.h>
>  #include <linux/ext3_jbd.h>
>  
> @@ -45,6 +46,7 @@
>  int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
>  {
>  	struct inode *inode = dentry->d_inode;
> +	journal_t *journal = EXT3_SB(inode->i_sb)->s_journal;
>  	int ret = 0;
>  
>  	J_ASSERT(ext3_journal_current_handle() == NULL);
> @@ -85,6 +87,9 @@ int ext3_sync_file(struct file * file, s
>  			.nr_to_write = 0, /* sys_fsync did this */
>  		};
>  		ret = sync_inode(inode, &wbc);
> +
> +		if (journal && (journal->j_flags & JFS_BARRIER))
> +			blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
>  	}
>  out:
>  	return ret;
> 
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: ext2 + -osync: not as easy as it seems
  2009-01-14 10:35         ` Jan Kara
@ 2009-01-14 13:21           ` Theodore Tso
  2009-01-14 14:05             ` Jan Kara
  0 siblings, 1 reply; 74+ messages in thread
From: Theodore Tso @ 2009-01-14 13:21 UTC (permalink / raw)
  To: Jan Kara
  Cc: Fernando Luis Vázquez Cao, Alan Cox, Pavel Machek,
	kernel list, Jens Axboe, sandeen

On Wed, Jan 14, 2009 at 11:35:33AM +0100, Jan Kara wrote:
>   Yes, I noticed that yesterday as well. But then I was puzzled why ext4
> would need the flush where it has it... sync_inode() has started and
> committed a transaction which issued a barrier when the commit was done.

You're right; I'm not convinced we need the flush in ext4 (or ext3) at
all.  We write the data blocks, *then* we call ext4_write_inode(),
which will force a commit.  Now, if we apply that patch which
optimizes out commits if there are no dirty blocks, then we'll be
trouble, because we won't know for sure whether or not
ext4_write_inode() will have forced a journal commit.

If we optimize out the journal commit when there are no blocks
attached to the transaction, we could change the patch to only force a
flush if inode->i_state did not have I_DIRTY before the call to
sync_inode().  Does that sound sane?

> The only reason I could imagine is that barrier (although it is usually
> translated to flushing writeback caches) actually means just an ordering
> requirement and hence does not necessarily mean that the caches are
> properly flushed. Is that so Eric?

I'm not sure what you mean; if the barrier operation isn't flushing
all of the caches all the way out to the iron oxide, it's not going to
be working properly no matter where it is being called, whether it's
in ext4_sync_file() or in jbd2's journal_submit_commit_record().

						- Ted

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: ext2 + -osync: not as easy as it seems
  2009-01-14 13:21           ` Theodore Tso
@ 2009-01-14 14:05             ` Jan Kara
  2009-01-14 14:08               ` Jens Axboe
  2009-01-14 14:12               ` Theodore Tso
  0 siblings, 2 replies; 74+ messages in thread
From: Jan Kara @ 2009-01-14 14:05 UTC (permalink / raw)
  To: Theodore Tso, Fernando Luis Vázquez Cao, Alan Cox,
	Pavel Machek, kernel list, Jens Axboe, sandeen

On Wed 14-01-09 08:21:46, Theodore Tso wrote:
> On Wed, Jan 14, 2009 at 11:35:33AM +0100, Jan Kara wrote:
> >   Yes, I noticed that yesterday as well. But then I was puzzled why ext4
> > would need the flush where it has it... sync_inode() has started and
> > committed a transaction which issued a barrier when the commit was done.
> 
> You're right; I'm not convinced we need the flush in ext4 (or ext3) at
> all.  We write the data blocks, *then* we call ext4_write_inode(),
> which will force a commit.  Now, if we apply that patch which
> optimizes out commits if there are no dirty blocks, then we'll be
> trouble, because we won't know for sure whether or not
> ext4_write_inode() will have forced a journal commit.
> 
> If we optimize out the journal commit when there are no blocks
> attached to the transaction, we could change the patch to only force a
> flush if inode->i_state did not have I_DIRTY before the call to
> sync_inode().  Does that sound sane?
  Yes. And also add a flush in case of fdatasync().

> > The only reason I could imagine is that barrier (although it is usually
> > translated to flushing writeback caches) actually means just an ordering
> > requirement and hence does not necessarily mean that the caches are
> > properly flushed. Is that so Eric?
> 
> I'm not sure what you mean; if the barrier operation isn't flushing
> all of the caches all the way out to the iron oxide, it's not going to
> be working properly no matter where it is being called, whether it's
> in ext4_sync_file() or in jbd2's journal_submit_commit_record().
  Well, I thought that a barrier, as an abstraction, only guarantees that
any IO which happened before the barrier hits the iron before any IO which
has been submitted after a barrier. This is actually enough for a
journalling to work correctly but it's not enough for fsync() guarantees.
But I might be wrong...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: ext2 + -osync: not as easy as it seems
  2009-01-14 14:05             ` Jan Kara
@ 2009-01-14 14:08               ` Jens Axboe
  2009-01-14 14:34                 ` Theodore Tso
  2009-02-12 16:43                 ` Eric Sandeen
  2009-01-14 14:12               ` Theodore Tso
  1 sibling, 2 replies; 74+ messages in thread
From: Jens Axboe @ 2009-01-14 14:08 UTC (permalink / raw)
  To: Jan Kara
  Cc: Theodore Tso, Fernando Luis Vázquez Cao, Alan Cox,
	Pavel Machek, kernel list, sandeen

On Wed, Jan 14 2009, Jan Kara wrote:
> > I'm not sure what you mean; if the barrier operation isn't flushing
> > all of the caches all the way out to the iron oxide, it's not going to
> > be working properly no matter where it is being called, whether it's
> > in ext4_sync_file() or in jbd2's journal_submit_commit_record().
>   Well, I thought that a barrier, as an abstraction, only guarantees that
> any IO which happened before the barrier hits the iron before any IO which
> has been submitted after a barrier. This is actually enough for a
> journalling to work correctly but it's not enough for fsync() guarantees.
> But I might be wrong...

It also guarentees that when you get a completion for that barrier
write, it's on safe storage. Think of it as a flush-write-flush
operation, in the presence of write back caching.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: ext2 + -osync: not as easy as it seems
  2009-01-14 14:05             ` Jan Kara
  2009-01-14 14:08               ` Jens Axboe
@ 2009-01-14 14:12               ` Theodore Tso
  2009-01-14 14:37                 ` Jan Kara
  1 sibling, 1 reply; 74+ messages in thread
From: Theodore Tso @ 2009-01-14 14:12 UTC (permalink / raw)
  To: Jan Kara
  Cc: Fernando Luis Vázquez Cao, Alan Cox, Pavel Machek,
	kernel list, Jens Axboe, sandeen

On Wed, Jan 14, 2009 at 03:05:32PM +0100, Jan Kara wrote:
> On Wed 14-01-09 08:21:46, Theodore Tso wrote:
> > 
> > If we optimize out the journal commit when there are no blocks
> > attached to the transaction, we could change the patch to only force a
> > flush if inode->i_state did not have I_DIRTY before the call to
> > sync_inode().  Does that sound sane?
>   Yes. And also add a flush in case of fdatasync().

Um, we have that already; the sync_inode() followed by
blkdev_issue_flush() is the path taken by fdatasync(), I do believe.

>   Well, I thought that a barrier, as an abstraction, only guarantees that
> any IO which happened before the barrier hits the iron before any IO which
> has been submitted after a barrier. This is actually enough for a
> journalling to work correctly but it's not enough for fsync() guarantees.
> But I might be wrong...

Ah, yes, that's what you're getting at.  True, but for better or for
worse, we have no other interface other than blkdev_issue_flush().
This will guarantee that the data has made it to the disk controller,
but it won't necessarily guarantee that it will have made it onto the
disk platter, as I understand things; but I don't think we have any
other interfaces available to us at this point.

						- Ted

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: ext2 + -osync: not as easy as it seems
  2009-01-14 14:08               ` Jens Axboe
@ 2009-01-14 14:34                 ` Theodore Tso
  2009-01-14 14:43                   ` Jens Axboe
  2009-02-12 16:43                 ` Eric Sandeen
  1 sibling, 1 reply; 74+ messages in thread
From: Theodore Tso @ 2009-01-14 14:34 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jan Kara, Fernando Luis Vázquez Cao, Alan Cox, Pavel Machek,
	kernel list, sandeen

On Wed, Jan 14, 2009 at 03:08:04PM +0100, Jens Axboe wrote:
> 
> It also guarentees that when you get a completion for that barrier
> write, it's on safe storage. Think of it as a flush-write-flush
> operation, in the presence of write back caching.
> 

Is that true even if the barrier isn't attached to a write operation,
i.e., when using

      blkdev_issue_flush(sb, NULL);

?

					- Ted

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: ext2 + -osync: not as easy as it seems
  2009-01-14 14:12               ` Theodore Tso
@ 2009-01-14 14:37                 ` Jan Kara
  2009-01-14 16:59                   ` Theodore Tso
  0 siblings, 1 reply; 74+ messages in thread
From: Jan Kara @ 2009-01-14 14:37 UTC (permalink / raw)
  To: Theodore Tso, Fernando Luis Vázquez Cao, Alan Cox,
	Pavel Machek, kernel list, Jens Axboe, sandeen

On Wed 14-01-09 09:12:04, Theodore Tso wrote:
> On Wed, Jan 14, 2009 at 03:05:32PM +0100, Jan Kara wrote:
> > On Wed 14-01-09 08:21:46, Theodore Tso wrote:
> > > 
> > > If we optimize out the journal commit when there are no blocks
> > > attached to the transaction, we could change the patch to only force a
> > > flush if inode->i_state did not have I_DIRTY before the call to
> > > sync_inode().  Does that sound sane?
> >   Yes. And also add a flush in case of fdatasync().
> 
> Um, we have that already; the sync_inode() followed by
> blkdev_issue_flush() is the path taken by fdatasync(), I do believe.
  Maybe ext4-patch-queue changes that area but in Linus's tree I see:

  if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
         goto out;

  So if we just overwrite some data, we send them to disk via fdatawrite()
and then we quickly bail out from ext4_sync_file() without doing
blkdev_issue_flush().

> >   Well, I thought that a barrier, as an abstraction, only guarantees that
> > any IO which happened before the barrier hits the iron before any IO which
> > has been submitted after a barrier. This is actually enough for a
> > journalling to work correctly but it's not enough for fsync() guarantees.
> > But I might be wrong...
> 
> Ah, yes, that's what you're getting at.  True, but for better or for
> worse, we have no other interface other than blkdev_issue_flush().
> This will guarantee that the data has made it to the disk controller,
> but it won't necessarily guarantee that it will have made it onto the
> disk platter, as I understand things; but I don't think we have any
> other interfaces available to us at this point.
  As Jens wrote, it seems barrier guarantees more than I thought so we are
correct.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: ext2 + -osync: not as easy as it seems
  2009-01-14 14:34                 ` Theodore Tso
@ 2009-01-14 14:43                   ` Jens Axboe
  0 siblings, 0 replies; 74+ messages in thread
From: Jens Axboe @ 2009-01-14 14:43 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Jan Kara, Fernando Luis Vázquez Cao, Alan Cox, Pavel Machek,
	kernel list, sandeen

On Wed, Jan 14 2009, Theodore Tso wrote:
> On Wed, Jan 14, 2009 at 03:08:04PM +0100, Jens Axboe wrote:
> > 
> > It also guarentees that when you get a completion for that barrier
> > write, it's on safe storage. Think of it as a flush-write-flush
> > operation, in the presence of write back caching.
> > 
> 
> Is that true even if the barrier isn't attached to a write operation,
> i.e., when using
> 
>       blkdev_issue_flush(sb, NULL);
> 
> ?

No, since there's no specific write to protect in that case. The above
will flush existing writes to platter, but has no bearing on writes
submitted later (other than the obvious effect that they will be on disk
later then previously submitted writes). So for blkdev_issue_flush(),
when that returns with good status, then the writes are safe. In the
absence of the barrier flag on a write, completion of that write may
just mean that the disk drive is aware of it, not that it is in any way
safe.

The hardware may even treat the flush as a noop and do absolutely
nothing, which is perfectly valid if it guarantees us that it will never
lose a submitted write.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: ext2 + -osync: not as easy as it seems
  2009-01-14 14:37                 ` Jan Kara
@ 2009-01-14 16:59                   ` Theodore Tso
  2009-01-15 12:06                     ` Fernando Luis Vázquez Cao
  0 siblings, 1 reply; 74+ messages in thread
From: Theodore Tso @ 2009-01-14 16:59 UTC (permalink / raw)
  To: Jan Kara
  Cc: Fernando Luis Vázquez Cao, Alan Cox, Pavel Machek,
	kernel list, Jens Axboe, sandeen

On Wed, Jan 14, 2009 at 03:37:56PM +0100, Jan Kara wrote:
> > Um, we have that already; the sync_inode() followed by
> > blkdev_issue_flush() is the path taken by fdatasync(), I do believe.
>
>   Maybe ext4-patch-queue changes that area but in Linus's tree I see:
> 
>   if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
>          goto out;
> 
>   So if we just overwrite some data, we send them to disk via fdatawrite()
> and then we quickly bail out from ext4_sync_file() without doing
> blkdev_issue_flush().

So you're thinking about fdatawrite() being called by some code path
other than ext4_sync_file() before we call fsync()?  Yeah, that could
happen....  I think that will only happen if the file is opened
O_SYNC, but that raises another issue, which is that we're not forcing
a flush for writes when the file is opened O_SYNC.

							- Ted

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: ext2 + -osync: not as easy as it seems
  2009-01-14 16:59                   ` Theodore Tso
@ 2009-01-15 12:06                     ` Fernando Luis Vázquez Cao
  2009-01-15 23:45                       ` Jan Kara
  0 siblings, 1 reply; 74+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-01-15 12:06 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Jan Kara, Alan Cox, Pavel Machek, kernel list, Jens Axboe, sandeen

On Wed, 2009-01-14 at 11:59 -0500, Theodore Tso wrote: 
> On Wed, Jan 14, 2009 at 03:37:56PM +0100, Jan Kara wrote:
> > > Um, we have that already; the sync_inode() followed by
> > > blkdev_issue_flush() is the path taken by fdatasync(), I do believe.
> >
> >   Maybe ext4-patch-queue changes that area but in Linus's tree I see:
> > 
> >   if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
> >          goto out;
> > 
> >   So if we just overwrite some data, we send them to disk via fdatawrite()
> > and then we quickly bail out from ext4_sync_file() without doing
> > blkdev_issue_flush().
> 
> So you're thinking about fdatawrite() being called by some code path
> other than ext4_sync_file() before we call fsync()?  Yeah, that could
> happen....  I think that will only happen if the file is opened
> O_SYNC, but that raises another issue, which is that we're not forcing
> a flush for writes when the file is opened O_SYNC.

Hi Jan, Ted

Is something like the patch below what you had in mind?

--

From: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Subject: ext3: call blkdev_issue_flush on fsync

To ensure that bits are truly on-disk after an fsync or fdatasync, we
should call blkdev_issue_flush if barriers are supported.

Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
---

--- linux-2.6.29-rc1-orig/fs/ext4/fsync.c	2008-12-25 08:26:37.000000000 +0900
+++ linux-2.6.29-rc1/fs/ext4/fsync.c	2009-01-15 21:03:19.000000000 +0900
@@ -48,6 +48,7 @@ int ext4_sync_file(struct file *file, st
 {
 	struct inode *inode = dentry->d_inode;
 	journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
+	unsigned long i_state = inode->i_state;
 	int ret = 0;
 
 	J_ASSERT(ext4_journal_current_handle() == NULL);
@@ -79,22 +80,35 @@ int ext4_sync_file(struct file *file, st
 		goto out;
 	}
 
-	if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
-		goto out;
+	if (datasync && !(i_state & I_DIRTY_DATASYNC))
+		goto flush_blkdev;
 
 	/*
 	 * The VFS has written the file data.  If the inode is unaltered
 	 * then we need not start a commit.
 	 */
-	if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
+	if (i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
 		struct writeback_control wbc = {
 			.sync_mode = WB_SYNC_ALL,
 			.nr_to_write = 0, /* sys_fsync did this */
 		};
 		ret = sync_inode(inode, &wbc);
-		if (journal && (journal->j_flags & JBD2_BARRIER))
-			blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
+		/*
+		 * When there are no blocks attached to the journal transaction
+		 * some optimizations are possible, but if there were dirty
+		 * pages sync_inode() should have ensured that all data gets
+		 * actually written to disk. Thus, we can skip
+		 * blkdev_issue_flush() below.
+		 */
+		if (!(i_state & I_DIRTY_PAGES))
+			goto flush_blkdev;
 	}
+
+	goto out;
+
+flush_blkdev:
+	if (journal && (journal->j_flags & JBD2_BARRIER))
+		blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
 out:
 	return ret;
 }



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: ext2 + -osync: not as easy as it seems
  2009-01-15 12:06                     ` Fernando Luis Vázquez Cao
@ 2009-01-15 23:45                       ` Jan Kara
  2009-01-16 12:31                         ` Fernando Luis Vázquez Cao
  0 siblings, 1 reply; 74+ messages in thread
From: Jan Kara @ 2009-01-15 23:45 UTC (permalink / raw)
  To: Fernando Luis Vázquez Cao
  Cc: Theodore Tso, Jan Kara, Alan Cox, Pavel Machek, kernel list,
	Jens Axboe, sandeen

On Thu 15-01-09 21:06:51, Fernando Luis Vázquez Cao wrote:
> On Wed, 2009-01-14 at 11:59 -0500, Theodore Tso wrote: 
> > On Wed, Jan 14, 2009 at 03:37:56PM +0100, Jan Kara wrote:
> > > > Um, we have that already; the sync_inode() followed by
> > > > blkdev_issue_flush() is the path taken by fdatasync(), I do believe.
> > >
> > >   Maybe ext4-patch-queue changes that area but in Linus's tree I see:
> > > 
> > >   if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
> > >          goto out;
> > > 
> > >   So if we just overwrite some data, we send them to disk via fdatawrite()
> > > and then we quickly bail out from ext4_sync_file() without doing
> > > blkdev_issue_flush().
> > 
> > So you're thinking about fdatawrite() being called by some code path
> > other than ext4_sync_file() before we call fsync()?  Yeah, that could
> > happen....  I think that will only happen if the file is opened
> > O_SYNC, but that raises another issue, which is that we're not forcing
> > a flush for writes when the file is opened O_SYNC.
> 
> Hi Jan, Ted
> 
> Is something like the patch below what you had in mind?
> 
> --
> 
> From: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
> Subject: ext3: call blkdev_issue_flush on fsync
> 
> To ensure that bits are truly on-disk after an fsync or fdatasync, we
> should call blkdev_issue_flush if barriers are supported.
> 
> Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
> ---
> 
> --- linux-2.6.29-rc1-orig/fs/ext4/fsync.c	2008-12-25 08:26:37.000000000 +0900
> +++ linux-2.6.29-rc1/fs/ext4/fsync.c	2009-01-15 21:03:19.000000000 +0900
> @@ -48,6 +48,7 @@ int ext4_sync_file(struct file *file, st
>  {
>  	struct inode *inode = dentry->d_inode;
>  	journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
> +	unsigned long i_state = inode->i_state;
>  	int ret = 0;
>  
>  	J_ASSERT(ext4_journal_current_handle() == NULL);
> @@ -79,22 +80,35 @@ int ext4_sync_file(struct file *file, st
>  		goto out;
>  	}
>  
> -	if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
> -		goto out;
> +	if (datasync && !(i_state & I_DIRTY_DATASYNC))
> +		goto flush_blkdev;
>  
>  	/*
>  	 * The VFS has written the file data.  If the inode is unaltered
>  	 * then we need not start a commit.
>  	 */
> -	if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
> +	if (i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
>  		struct writeback_control wbc = {
>  			.sync_mode = WB_SYNC_ALL,
>  			.nr_to_write = 0, /* sys_fsync did this */
>  		};
>  		ret = sync_inode(inode, &wbc);
> -		if (journal && (journal->j_flags & JBD2_BARRIER))
> -			blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
> +		/*
> +		 * When there are no blocks attached to the journal transaction
> +		 * some optimizations are possible, but if there were dirty
> +		 * pages sync_inode() should have ensured that all data gets
> +		 * actually written to disk. Thus, we can skip
> +		 * blkdev_issue_flush() below.
> +		 */
> +		if (!(i_state & I_DIRTY_PAGES))
> +			goto flush_blkdev;
  Uh. Here I don't get it. When we did sync_inode(), blkdev_issue_flush()
is needed only if the journal does not do barriers. So I'd expect here:
	if (!(journal->j_flags & JBD2_BARRIER))
		goto flush_blkdev;
	goto out;

>  	}
> +
> +	goto out;
  But here we might need to issue a flush if there are some data written. So
I'd have here:
	if (!(i_state & I_DIRTY_PAGES))
		goto out;

> +
> +flush_blkdev:
> +	if (journal && (journal->j_flags & JBD2_BARRIER))
> +		blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
>  out:
>  	return ret;
>  }

										Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: ext2 + -osync: not as easy as it seems
  2009-01-15 23:45                       ` Jan Kara
@ 2009-01-16 12:31                         ` Fernando Luis Vázquez Cao
  2009-01-16 13:55                           ` ext3: call blkdev_issue_flush on fsync Fernando Luis Vázquez Cao
  2009-01-16 13:59                           ` ext4: call blkdev_issue_flush " Fernando Luis Vázquez Cao
  0 siblings, 2 replies; 74+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-01-16 12:31 UTC (permalink / raw)
  To: Jan Kara
  Cc: Theodore Tso, Alan Cox, Pavel Machek, kernel list, Jens Axboe, sandeen

On Fri, 2009-01-16 at 00:45 +0100, Jan Kara wrote:
> On Thu 15-01-09 21:06:51, Fernando Luis Vázquez Cao wrote:
> > On Wed, 2009-01-14 at 11:59 -0500, Theodore Tso wrote: 
> > > On Wed, Jan 14, 2009 at 03:37:56PM +0100, Jan Kara wrote:
> > > > > Um, we have that already; the sync_inode() followed by
> > > > > blkdev_issue_flush() is the path taken by fdatasync(), I do believe.
> > > >
> > > >   Maybe ext4-patch-queue changes that area but in Linus's tree I see:
> > > > 
> > > >   if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
> > > >          goto out;
> > > > 
> > > >   So if we just overwrite some data, we send them to disk via fdatawrite()
> > > > and then we quickly bail out from ext4_sync_file() without doing
> > > > blkdev_issue_flush().
> > > 
> > > So you're thinking about fdatawrite() being called by some code path
> > > other than ext4_sync_file() before we call fsync()?  Yeah, that could
> > > happen....  I think that will only happen if the file is opened
> > > O_SYNC, but that raises another issue, which is that we're not forcing
> > > a flush for writes when the file is opened O_SYNC.
> > 
> > Hi Jan, Ted
> > 
> > Is something like the patch below what you had in mind?
> > 
> > --
> > 
> > From: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
> > Subject: ext3: call blkdev_issue_flush on fsync
> > 
> > To ensure that bits are truly on-disk after an fsync or fdatasync, we
> > should call blkdev_issue_flush if barriers are supported.
> > 
> > Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
> > ---
> > 
> > --- linux-2.6.29-rc1-orig/fs/ext4/fsync.c	2008-12-25 08:26:37.000000000 +0900
> > +++ linux-2.6.29-rc1/fs/ext4/fsync.c	2009-01-15 21:03:19.000000000 +0900
> > @@ -48,6 +48,7 @@ int ext4_sync_file(struct file *file, st
> >  {
> >  	struct inode *inode = dentry->d_inode;
> >  	journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
> > +	unsigned long i_state = inode->i_state;
> >  	int ret = 0;
> >  
> >  	J_ASSERT(ext4_journal_current_handle() == NULL);
> > @@ -79,22 +80,35 @@ int ext4_sync_file(struct file *file, st
> >  		goto out;
> >  	}
> >  
> > -	if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
> > -		goto out;
> > +	if (datasync && !(i_state & I_DIRTY_DATASYNC))
> > +		goto flush_blkdev;
> >  
> >  	/*
> >  	 * The VFS has written the file data.  If the inode is unaltered
> >  	 * then we need not start a commit.
> >  	 */
> > -	if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
> > +	if (i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
> >  		struct writeback_control wbc = {
> >  			.sync_mode = WB_SYNC_ALL,
> >  			.nr_to_write = 0, /* sys_fsync did this */
> >  		};
> >  		ret = sync_inode(inode, &wbc);
> > -		if (journal && (journal->j_flags & JBD2_BARRIER))
> > -			blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
> > +		/*
> > +		 * When there are no blocks attached to the journal transaction
> > +		 * some optimizations are possible, but if there were dirty
> > +		 * pages sync_inode() should have ensured that all data gets
> > +		 * actually written to disk. Thus, we can skip
> > +		 * blkdev_issue_flush() below.
> > +		 */
> > +		if (!(i_state & I_DIRTY_PAGES))
> > +			goto flush_blkdev;
>   Uh. Here I don't get it. When we did sync_inode(), blkdev_issue_flush()
> is needed only if the journal does not do barriers. So I'd expect here:
> 	if (!(journal->j_flags & JBD2_BARRIER))
> 		goto flush_blkdev;
> 	goto out;

Ups, you are right. I somehow managed to mangle the logic that I
intended to put here and under flush_blkdev. By the way, I think that
the same check may be needed for the data==journal case too.

Thank you for the feedback, Jan.

I'll be replying to this email with new patches for ext2/ext3.

- Fernando


^ permalink raw reply	[flat|nested] 74+ messages in thread

* ext3: call blkdev_issue_flush on fsync
  2009-01-16 12:31                         ` Fernando Luis Vázquez Cao
@ 2009-01-16 13:55                           ` Fernando Luis Vázquez Cao
  2009-01-16 16:30                             ` Jan Kara
  2009-01-16 13:59                           ` ext4: call blkdev_issue_flush " Fernando Luis Vázquez Cao
  1 sibling, 1 reply; 74+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-01-16 13:55 UTC (permalink / raw)
  To: Jan Kara
  Cc: Theodore Tso, Alan Cox, Pavel Machek, kernel list, Jens Axboe,
	sandeen, fernando

To ensure that bits are truly on-disk after an fsync or fdatasync, we
should force a disk flush explicitly when there is dirty data/metadata
and the journal didn't emit a write barrier (either because metadata is
not being synched or barriers are disabled).

Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
---

--- linux-2.6.29-rc1-orig/fs/ext3/fsync.c	2008-12-25 08:26:37.000000000 +0900
+++ linux-2.6.29-rc1/fs/ext3/fsync.c	2009-01-16 22:18:53.000000000 +0900
@@ -27,6 +27,7 @@
 #include <linux/sched.h>
 #include <linux/writeback.h>
 #include <linux/jbd.h>
+#include <linux/blkdev.h>
 #include <linux/ext3_fs.h>
 #include <linux/ext3_jbd.h>
 
@@ -45,6 +46,8 @@
 int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
 {
 	struct inode *inode = dentry->d_inode;
+	journal_t *journal = EXT3_SB(inode->i_sb)->s_journal;
+	unsigned long i_state = inode->i_state;
 	int ret = 0;
 
 	J_ASSERT(ext3_journal_current_handle() == NULL);
@@ -69,23 +72,33 @@ int ext3_sync_file(struct file * file, s
 	 */
 	if (ext3_should_journal_data(inode)) {
 		ret = ext3_force_commit(inode->i_sb);
+		if (!(journal->j_flags & JFS_BARRIER))
+			goto no_journal_barrier;
 		goto out;
 	}
 
-	if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
-		goto out;
+	if (datasync && !(i_state & I_DIRTY_DATASYNC))
+		goto flush_blkdev;
 
 	/*
 	 * The VFS has written the file data.  If the inode is unaltered
 	 * then we need not start a commit.
 	 */
-	if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
+	if (i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
 		struct writeback_control wbc = {
 			.sync_mode = WB_SYNC_ALL,
 			.nr_to_write = 0, /* sys_fsync did this */
 		};
 		ret = sync_inode(inode, &wbc);
+		if (journal && !(journal->j_flags & JFS_BARRIER))
+			goto no_journal_barrier;
 	}
+
+flush_blkdev:
+	if (!(i_state & I_DIRTY_PAGES))
+		goto out;
+no_journal_barrier:
+	blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
 out:
 	return ret;
 }



^ permalink raw reply	[flat|nested] 74+ messages in thread

* ext4: call blkdev_issue_flush on fsync
  2009-01-16 12:31                         ` Fernando Luis Vázquez Cao
  2009-01-16 13:55                           ` ext3: call blkdev_issue_flush on fsync Fernando Luis Vázquez Cao
@ 2009-01-16 13:59                           ` Fernando Luis Vázquez Cao
  1 sibling, 0 replies; 74+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-01-16 13:59 UTC (permalink / raw)
  To: Jan Kara
  Cc: Theodore Tso, Alan Cox, Pavel Machek, kernel list, Jens Axboe,
	sandeen, fernando

To ensure that bits are truly on-disk after an fsync or fdatasync, we
should force a disk flush explicitly when there is dirty data/metadata
and the journal didn't emit a write barrier (either because metadata is
not being synched or barriers are disabled).

Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
---

--- linux-2.6.29-rc1-orig/fs/ext4/fsync.c	2008-12-25 08:26:37.000000000 +0900
+++ linux-2.6.29-rc1/fs/ext4/fsync.c	2009-01-16 22:18:51.000000000 +0900
@@ -48,6 +48,7 @@ int ext4_sync_file(struct file *file, st
 {
 	struct inode *inode = dentry->d_inode;
 	journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
+	unsigned long i_state = inode->i_state;
 	int ret = 0;
 
 	J_ASSERT(ext4_journal_current_handle() == NULL);
@@ -76,25 +77,33 @@ int ext4_sync_file(struct file *file, st
 	 */
 	if (ext4_should_journal_data(inode)) {
 		ret = ext4_force_commit(inode->i_sb);
+		if (!(journal->j_flags & JBD2_BARRIER))
+			goto no_journal_barrier;
 		goto out;
 	}
 
-	if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
-		goto out;
+	if (datasync && !(i_state & I_DIRTY_DATASYNC))
+		goto flush_blkdev;
 
 	/*
 	 * The VFS has written the file data.  If the inode is unaltered
 	 * then we need not start a commit.
 	 */
-	if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
+	if (i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
 		struct writeback_control wbc = {
 			.sync_mode = WB_SYNC_ALL,
 			.nr_to_write = 0, /* sys_fsync did this */
 		};
 		ret = sync_inode(inode, &wbc);
-		if (journal && (journal->j_flags & JBD2_BARRIER))
-			blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
+		if (journal && !(journal->j_flags & JBD2_BARRIER))
+			goto no_journal_barrier;
 	}
+
+flush_blkdev:
+	if (!(i_state & I_DIRTY_PAGES))
+		goto out;
+no_journal_barrier:
+	blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
 out:
 	return ret;
 }



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: ext3: call blkdev_issue_flush on fsync
  2009-01-16 13:55                           ` ext3: call blkdev_issue_flush on fsync Fernando Luis Vázquez Cao
@ 2009-01-16 16:30                             ` Jan Kara
  2009-01-17  9:47                               ` Fernando Luis Vázquez Cao
  0 siblings, 1 reply; 74+ messages in thread
From: Jan Kara @ 2009-01-16 16:30 UTC (permalink / raw)
  To: Fernando Luis Vázquez Cao
  Cc: Theodore Tso, Alan Cox, Pavel Machek, kernel list, Jens Axboe,
	sandeen, fernando

On Fri 16-01-09 22:55:01, Fernando Luis Vázquez Cao wrote:
> To ensure that bits are truly on-disk after an fsync or fdatasync, we
> should force a disk flush explicitly when there is dirty data/metadata
> and the journal didn't emit a write barrier (either because metadata is
> not being synched or barriers are disabled).
> 
> Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
> ---
  Only two minor nits:

> --- linux-2.6.29-rc1-orig/fs/ext3/fsync.c	2008-12-25 08:26:37.000000000 +0900
> +++ linux-2.6.29-rc1/fs/ext3/fsync.c	2009-01-16 22:18:53.000000000 +0900
> @@ -27,6 +27,7 @@
>  #include <linux/sched.h>
>  #include <linux/writeback.h>
>  #include <linux/jbd.h>
> +#include <linux/blkdev.h>
>  #include <linux/ext3_fs.h>
>  #include <linux/ext3_jbd.h>
>  
> @@ -45,6 +46,8 @@
>  int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
>  {
>  	struct inode *inode = dentry->d_inode;
> +	journal_t *journal = EXT3_SB(inode->i_sb)->s_journal;
> +	unsigned long i_state = inode->i_state;
>  	int ret = 0;
>  
>  	J_ASSERT(ext3_journal_current_handle() == NULL);
> @@ -69,23 +72,33 @@ int ext3_sync_file(struct file * file, s
>  	 */
>  	if (ext3_should_journal_data(inode)) {
>  		ret = ext3_force_commit(inode->i_sb);
> +		if (!(journal->j_flags & JFS_BARRIER))
> +			goto no_journal_barrier;
>  		goto out;
>  	}
>  
> -	if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
> -		goto out;
> +	if (datasync && !(i_state & I_DIRTY_DATASYNC))
> +		goto flush_blkdev;
>  
>  	/*
>  	 * The VFS has written the file data.  If the inode is unaltered
>  	 * then we need not start a commit.
>  	 */
> -	if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
> +	if (i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
>  		struct writeback_control wbc = {
>  			.sync_mode = WB_SYNC_ALL,
>  			.nr_to_write = 0, /* sys_fsync did this */
>  		};
>  		ret = sync_inode(inode, &wbc);
> +		if (journal && !(journal->j_flags & JFS_BARRIER))
> +			goto no_journal_barrier;
  I cannot imagine "journal" will be NULL here.
  And we can also optimize here a bit and do "goto out" because here
we know the barrier has been issued.

>  	}
> +
> +flush_blkdev:
> +	if (!(i_state & I_DIRTY_PAGES))
> +		goto out;
> +no_journal_barrier:
> +	blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
>  out:
>  	return ret;
>  }

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: ext3: call blkdev_issue_flush on fsync
  2009-01-16 16:30                             ` Jan Kara
@ 2009-01-17  9:47                               ` Fernando Luis Vázquez Cao
  2009-01-17 10:00                                 ` Fernando Luis Vázquez Cao
  0 siblings, 1 reply; 74+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-01-17  9:47 UTC (permalink / raw)
  To: Jan Kara
  Cc: Theodore Tso, Alan Cox, Pavel Machek, kernel list, Jens Axboe,
	sandeen, fernando

On Fri, 2009-01-16 at 17:30 +0100, Jan Kara wrote:
> On Fri 16-01-09 22:55:01, Fernando Luis Vázquez Cao wrote:
> > To ensure that bits are truly on-disk after an fsync or fdatasync, we
> > should force a disk flush explicitly when there is dirty data/metadata
> > and the journal didn't emit a write barrier (either because metadata is
> > not being synched or barriers are disabled).
> > 
> > Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
> > ---
>   Only two minor nits:
> 
> > --- linux-2.6.29-rc1-orig/fs/ext3/fsync.c	2008-12-25 08:26:37.000000000 +0900
> > +++ linux-2.6.29-rc1/fs/ext3/fsync.c	2009-01-16 22:18:53.000000000 +0900
> > @@ -27,6 +27,7 @@
> >  #include <linux/sched.h>
> >  #include <linux/writeback.h>
> >  #include <linux/jbd.h>
> > +#include <linux/blkdev.h>
> >  #include <linux/ext3_fs.h>
> >  #include <linux/ext3_jbd.h>
> >  
> > @@ -45,6 +46,8 @@
> >  int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
> >  {
> >  	struct inode *inode = dentry->d_inode;
> > +	journal_t *journal = EXT3_SB(inode->i_sb)->s_journal;
> > +	unsigned long i_state = inode->i_state;
> >  	int ret = 0;
> >  
> >  	J_ASSERT(ext3_journal_current_handle() == NULL);
> > @@ -69,23 +72,33 @@ int ext3_sync_file(struct file * file, s
> >  	 */
> >  	if (ext3_should_journal_data(inode)) {
> >  		ret = ext3_force_commit(inode->i_sb);
> > +		if (!(journal->j_flags & JFS_BARRIER))
> > +			goto no_journal_barrier;
> >  		goto out;
> >  	}
> >  
> > -	if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
> > -		goto out;
> > +	if (datasync && !(i_state & I_DIRTY_DATASYNC))
> > +		goto flush_blkdev;
> >  
> >  	/*
> >  	 * The VFS has written the file data.  If the inode is unaltered
> >  	 * then we need not start a commit.
> >  	 */
> > -	if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
> > +	if (i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
> >  		struct writeback_control wbc = {
> >  			.sync_mode = WB_SYNC_ALL,
> >  			.nr_to_write = 0, /* sys_fsync did this */
> >  		};
> >  		ret = sync_inode(inode, &wbc);
> > +		if (journal && !(journal->j_flags & JFS_BARRIER))
> > +			goto no_journal_barrier;
>   I cannot imagine "journal" will be NULL here.

I'll try to check whether that is always so just in case.

>   And we can also optimize here a bit and do "goto out" because here
> we know the barrier has been issued.

Yep, I was considering the same optimization. By the way, I was
wondering if we should honor ext3 and ext4's "barrier" mount option for
sys_fsync()/sys_fdatasync() and do not force a flush when "barrier=1".
What are your thoughts on this?


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: ext3: call blkdev_issue_flush on fsync
  2009-01-17  9:47                               ` Fernando Luis Vázquez Cao
@ 2009-01-17 10:00                                 ` Fernando Luis Vázquez Cao
  2009-01-19 12:03                                   ` Jan Kara
  0 siblings, 1 reply; 74+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-01-17 10:00 UTC (permalink / raw)
  To: Jan Kara
  Cc: Theodore Tso, Alan Cox, Pavel Machek, kernel list, Jens Axboe,
	sandeen, fernando

On Sat, 2009-01-17 at 18:47 +0900, Fernando Luis Vázquez Cao wrote:
> On Fri, 2009-01-16 at 17:30 +0100, Jan Kara wrote:
> > On Fri 16-01-09 22:55:01, Fernando Luis Vázquez Cao wrote:
> > > To ensure that bits are truly on-disk after an fsync or fdatasync, we
> > > should force a disk flush explicitly when there is dirty data/metadata
> > > and the journal didn't emit a write barrier (either because metadata is
> > > not being synched or barriers are disabled).
> > > 
> > > Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
> > > ---
> >   Only two minor nits:
> > 
> > > --- linux-2.6.29-rc1-orig/fs/ext3/fsync.c	2008-12-25 08:26:37.000000000 +0900
> > > +++ linux-2.6.29-rc1/fs/ext3/fsync.c	2009-01-16 22:18:53.000000000 +0900
> > > @@ -27,6 +27,7 @@
> > >  #include <linux/sched.h>
> > >  #include <linux/writeback.h>
> > >  #include <linux/jbd.h>
> > > +#include <linux/blkdev.h>
> > >  #include <linux/ext3_fs.h>
> > >  #include <linux/ext3_jbd.h>
> > >  
> > > @@ -45,6 +46,8 @@
> > >  int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
> > >  {
> > >  	struct inode *inode = dentry->d_inode;
> > > +	journal_t *journal = EXT3_SB(inode->i_sb)->s_journal;
> > > +	unsigned long i_state = inode->i_state;
> > >  	int ret = 0;
> > >  
> > >  	J_ASSERT(ext3_journal_current_handle() == NULL);
> > > @@ -69,23 +72,33 @@ int ext3_sync_file(struct file * file, s
> > >  	 */
> > >  	if (ext3_should_journal_data(inode)) {
> > >  		ret = ext3_force_commit(inode->i_sb);
> > > +		if (!(journal->j_flags & JFS_BARRIER))
> > > +			goto no_journal_barrier;
> > >  		goto out;
> > >  	}
> > >  
> > > -	if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
> > > -		goto out;
> > > +	if (datasync && !(i_state & I_DIRTY_DATASYNC))
> > > +		goto flush_blkdev;
> > >  
> > >  	/*
> > >  	 * The VFS has written the file data.  If the inode is unaltered
> > >  	 * then we need not start a commit.
> > >  	 */
> > > -	if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
> > > +	if (i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
> > >  		struct writeback_control wbc = {
> > >  			.sync_mode = WB_SYNC_ALL,
> > >  			.nr_to_write = 0, /* sys_fsync did this */
> > >  		};
> > >  		ret = sync_inode(inode, &wbc);
> > > +		if (journal && !(journal->j_flags & JFS_BARRIER))
> > > +			goto no_journal_barrier;
> >   I cannot imagine "journal" will be NULL here.
> 
> I'll try to check whether that is always so just in case.
> 
> >   And we can also optimize here a bit and do "goto out" because here
> > we know the barrier has been issued.
> 
> Yep, I was considering the same optimization. By the way, I was
> wondering if we should honor ext3 and ext4's "barrier" mount option for
> sys_fsync()/sys_fdatasync() and do not force a flush when "barrier=1".

The last phrase should read " do not force a flush when "barrier=0" ".

Sorry for the noise.

- Fernando


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: ext3: call blkdev_issue_flush on fsync
  2009-01-17 10:00                                 ` Fernando Luis Vázquez Cao
@ 2009-01-19 12:03                                   ` Jan Kara
  2009-01-28  9:45                                     ` Fernando Luis Vázquez Cao
  0 siblings, 1 reply; 74+ messages in thread
From: Jan Kara @ 2009-01-19 12:03 UTC (permalink / raw)
  To: Fernando Luis Vázquez Cao
  Cc: Theodore Tso, Alan Cox, Pavel Machek, kernel list, Jens Axboe,
	sandeen, fernando

On Sat 17-01-09 19:00:49, Fernando Luis Vázquez Cao wrote:
> On Sat, 2009-01-17 at 18:47 +0900, Fernando Luis Vázquez Cao wrote:
> > On Fri, 2009-01-16 at 17:30 +0100, Jan Kara wrote:
> > > On Fri 16-01-09 22:55:01, Fernando Luis Vázquez Cao wrote:
> > > > To ensure that bits are truly on-disk after an fsync or fdatasync, we
> > > > should force a disk flush explicitly when there is dirty data/metadata
> > > > and the journal didn't emit a write barrier (either because metadata is
> > > > not being synched or barriers are disabled).
> > > > 
> > > > Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
> > > > ---
> > >   Only two minor nits:
> > > 
> > > > --- linux-2.6.29-rc1-orig/fs/ext3/fsync.c	2008-12-25 08:26:37.000000000 +0900
> > > > +++ linux-2.6.29-rc1/fs/ext3/fsync.c	2009-01-16 22:18:53.000000000 +0900
> > > > @@ -27,6 +27,7 @@
> > > >  #include <linux/sched.h>
> > > >  #include <linux/writeback.h>
> > > >  #include <linux/jbd.h>
> > > > +#include <linux/blkdev.h>
> > > >  #include <linux/ext3_fs.h>
> > > >  #include <linux/ext3_jbd.h>
> > > >  
> > > > @@ -45,6 +46,8 @@
> > > >  int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
> > > >  {
> > > >  	struct inode *inode = dentry->d_inode;
> > > > +	journal_t *journal = EXT3_SB(inode->i_sb)->s_journal;
> > > > +	unsigned long i_state = inode->i_state;
> > > >  	int ret = 0;
> > > >  
> > > >  	J_ASSERT(ext3_journal_current_handle() == NULL);
> > > > @@ -69,23 +72,33 @@ int ext3_sync_file(struct file * file, s
> > > >  	 */
> > > >  	if (ext3_should_journal_data(inode)) {
> > > >  		ret = ext3_force_commit(inode->i_sb);
> > > > +		if (!(journal->j_flags & JFS_BARRIER))
> > > > +			goto no_journal_barrier;
> > > >  		goto out;
> > > >  	}
> > > >  
> > > > -	if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
> > > > -		goto out;
> > > > +	if (datasync && !(i_state & I_DIRTY_DATASYNC))
> > > > +		goto flush_blkdev;
> > > >  
> > > >  	/*
> > > >  	 * The VFS has written the file data.  If the inode is unaltered
> > > >  	 * then we need not start a commit.
> > > >  	 */
> > > > -	if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
> > > > +	if (i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
> > > >  		struct writeback_control wbc = {
> > > >  			.sync_mode = WB_SYNC_ALL,
> > > >  			.nr_to_write = 0, /* sys_fsync did this */
> > > >  		};
> > > >  		ret = sync_inode(inode, &wbc);
> > > > +		if (journal && !(journal->j_flags & JFS_BARRIER))
> > > > +			goto no_journal_barrier;
> > >   I cannot imagine "journal" will be NULL here.
> > 
> > I'll try to check whether that is always so just in case.
> > 
> > >   And we can also optimize here a bit and do "goto out" because here
> > > we know the barrier has been issued.
> > 
> > Yep, I was considering the same optimization. By the way, I was
> > wondering if we should honor ext3 and ext4's "barrier" mount option for
> > sys_fsync()/sys_fdatasync() and do not force a flush when "barrier=1".
> 
> The last phrase should read " do not force a flush when "barrier=0" ".
  I was hesitating about this a bit. But I don't think so. The reason is
that POSIX (or any other reasonable specification) mandates that fsync()
should return only after the data is safely on storage. So if we don't
flush blockdevice caches, we effectively violate POSIX and we should never
do that. With barriers the matter is a bit different - that is just a
filesystem specific thing, no standard guarantees anything.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: ext3: call blkdev_issue_flush on fsync
  2009-01-19 12:03                                   ` Jan Kara
@ 2009-01-28  9:45                                     ` Fernando Luis Vázquez Cao
  2009-01-28  9:55                                       ` Jan Kara
  0 siblings, 1 reply; 74+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-01-28  9:45 UTC (permalink / raw)
  To: Jan Kara
  Cc: Theodore Tso, Alan Cox, Pavel Machek, kernel list, Jens Axboe,
	sandeen, fernando

On Mon, 2009-01-19 at 13:03 +0100, Jan Kara wrote:
> On Sat 17-01-09 19:00:49, Fernando Luis Vázquez Cao wrote:
> > On Sat, 2009-01-17 at 18:47 +0900, Fernando Luis Vázquez Cao wrote:
> > > On Fri, 2009-01-16 at 17:30 +0100, Jan Kara wrote:
> > > > On Fri 16-01-09 22:55:01, Fernando Luis Vázquez Cao wrote:
> > > > > To ensure that bits are truly on-disk after an fsync or fdatasync, we
> > > > > should force a disk flush explicitly when there is dirty data/metadata
> > > > > and the journal didn't emit a write barrier (either because metadata is
> > > > > not being synched or barriers are disabled).
> > > > > 
> > > > > Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
> > > > > ---
> > > >   Only two minor nits:
> > > > 
> > > > > --- linux-2.6.29-rc1-orig/fs/ext3/fsync.c	2008-12-25 08:26:37.000000000 +0900
> > > > > +++ linux-2.6.29-rc1/fs/ext3/fsync.c	2009-01-16 22:18:53.000000000 +0900
> > > > > @@ -27,6 +27,7 @@
> > > > >  #include <linux/sched.h>
> > > > >  #include <linux/writeback.h>
> > > > >  #include <linux/jbd.h>
> > > > > +#include <linux/blkdev.h>
> > > > >  #include <linux/ext3_fs.h>
> > > > >  #include <linux/ext3_jbd.h>
> > > > >  
> > > > > @@ -45,6 +46,8 @@
> > > > >  int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
> > > > >  {
> > > > >  	struct inode *inode = dentry->d_inode;
> > > > > +	journal_t *journal = EXT3_SB(inode->i_sb)->s_journal;
> > > > > +	unsigned long i_state = inode->i_state;
> > > > >  	int ret = 0;
> > > > >  
> > > > >  	J_ASSERT(ext3_journal_current_handle() == NULL);
> > > > > @@ -69,23 +72,33 @@ int ext3_sync_file(struct file * file, s
> > > > >  	 */
> > > > >  	if (ext3_should_journal_data(inode)) {
> > > > >  		ret = ext3_force_commit(inode->i_sb);
> > > > > +		if (!(journal->j_flags & JFS_BARRIER))
> > > > > +			goto no_journal_barrier;
> > > > >  		goto out;
> > > > >  	}
> > > > >  
> > > > > -	if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
> > > > > -		goto out;
> > > > > +	if (datasync && !(i_state & I_DIRTY_DATASYNC))
> > > > > +		goto flush_blkdev;
> > > > >  
> > > > >  	/*
> > > > >  	 * The VFS has written the file data.  If the inode is unaltered
> > > > >  	 * then we need not start a commit.
> > > > >  	 */
> > > > > -	if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
> > > > > +	if (i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
> > > > >  		struct writeback_control wbc = {
> > > > >  			.sync_mode = WB_SYNC_ALL,
> > > > >  			.nr_to_write = 0, /* sys_fsync did this */
> > > > >  		};
> > > > >  		ret = sync_inode(inode, &wbc);
> > > > > +		if (journal && !(journal->j_flags & JFS_BARRIER))
> > > > > +			goto no_journal_barrier;
> > > >   I cannot imagine "journal" will be NULL here.
> > > 
> > > I'll try to check whether that is always so just in case.
> > > 
> > > >   And we can also optimize here a bit and do "goto out" because here
> > > > we know the barrier has been issued.
> > > 
> > > Yep, I was considering the same optimization. By the way, I was
> > > wondering if we should honor ext3 and ext4's "barrier" mount option for
> > > sys_fsync()/sys_fdatasync() and do not force a flush when "barrier=1".
> > 
> > The last phrase should read " do not force a flush when "barrier=0" ".
>   I was hesitating about this a bit. But I don't think so. The reason is
> that POSIX (or any other reasonable specification) mandates that fsync()
> should return only after the data is safely on storage. So if we don't
> flush blockdevice caches, we effectively violate POSIX and we should never
> do that. With barriers the matter is a bit different - that is just a
> filesystem specific thing, no standard guarantees anything.

Hi Jan,

Sorry it's taken me so long to get back to you.

Thinking a lit bit more about this issue, it occurred to me that adding
a new mount option à la existing "barrier" is likely to be preferable.
As an example where such an option could make sense, let's consider a
system with battery-backup cache devices. Since the battery-backup
guarantees the data still not committed to the platter will not vanish
in the event of a power down, it should be possible to obtain a
performance gain by optimizing out the device flush on every
fsync()/fdatasync() call.

If there is consensus on the propriety of this approach, I will send
updated patches.

Regards,

Fernando


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: ext3: call blkdev_issue_flush on fsync
  2009-01-28  9:45                                     ` Fernando Luis Vázquez Cao
@ 2009-01-28  9:55                                       ` Jan Kara
  2009-02-12 10:33                                         ` Fernando Luis Vázquez Cao
  0 siblings, 1 reply; 74+ messages in thread
From: Jan Kara @ 2009-01-28  9:55 UTC (permalink / raw)
  To: Fernando Luis Vázquez Cao
  Cc: Theodore Tso, Alan Cox, Pavel Machek, kernel list, Jens Axboe,
	sandeen, fernando

On Wed 28-01-09 18:45:13, Fernando Luis Vázquez Cao wrote:
> On Mon, 2009-01-19 at 13:03 +0100, Jan Kara wrote:
> > On Sat 17-01-09 19:00:49, Fernando Luis Vázquez Cao wrote:
> > > On Sat, 2009-01-17 at 18:47 +0900, Fernando Luis Vázquez Cao wrote:
> > > > On Fri, 2009-01-16 at 17:30 +0100, Jan Kara wrote:
> > > > > On Fri 16-01-09 22:55:01, Fernando Luis Vázquez Cao wrote:
> > > > > > To ensure that bits are truly on-disk after an fsync or fdatasync, we
> > > > > > should force a disk flush explicitly when there is dirty data/metadata
> > > > > > and the journal didn't emit a write barrier (either because metadata is
> > > > > > not being synched or barriers are disabled).
> > > > > > 
> > > > > > Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
> > > > > > ---
> > > > >   Only two minor nits:
> > > > > 
> > > > > > --- linux-2.6.29-rc1-orig/fs/ext3/fsync.c	2008-12-25 08:26:37.000000000 +0900
> > > > > > +++ linux-2.6.29-rc1/fs/ext3/fsync.c	2009-01-16 22:18:53.000000000 +0900
> > > > > > @@ -27,6 +27,7 @@
> > > > > >  #include <linux/sched.h>
> > > > > >  #include <linux/writeback.h>
> > > > > >  #include <linux/jbd.h>
> > > > > > +#include <linux/blkdev.h>
> > > > > >  #include <linux/ext3_fs.h>
> > > > > >  #include <linux/ext3_jbd.h>
> > > > > >  
> > > > > > @@ -45,6 +46,8 @@
> > > > > >  int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
> > > > > >  {
> > > > > >  	struct inode *inode = dentry->d_inode;
> > > > > > +	journal_t *journal = EXT3_SB(inode->i_sb)->s_journal;
> > > > > > +	unsigned long i_state = inode->i_state;
> > > > > >  	int ret = 0;
> > > > > >  
> > > > > >  	J_ASSERT(ext3_journal_current_handle() == NULL);
> > > > > > @@ -69,23 +72,33 @@ int ext3_sync_file(struct file * file, s
> > > > > >  	 */
> > > > > >  	if (ext3_should_journal_data(inode)) {
> > > > > >  		ret = ext3_force_commit(inode->i_sb);
> > > > > > +		if (!(journal->j_flags & JFS_BARRIER))
> > > > > > +			goto no_journal_barrier;
> > > > > >  		goto out;
> > > > > >  	}
> > > > > >  
> > > > > > -	if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
> > > > > > -		goto out;
> > > > > > +	if (datasync && !(i_state & I_DIRTY_DATASYNC))
> > > > > > +		goto flush_blkdev;
> > > > > >  
> > > > > >  	/*
> > > > > >  	 * The VFS has written the file data.  If the inode is unaltered
> > > > > >  	 * then we need not start a commit.
> > > > > >  	 */
> > > > > > -	if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
> > > > > > +	if (i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
> > > > > >  		struct writeback_control wbc = {
> > > > > >  			.sync_mode = WB_SYNC_ALL,
> > > > > >  			.nr_to_write = 0, /* sys_fsync did this */
> > > > > >  		};
> > > > > >  		ret = sync_inode(inode, &wbc);
> > > > > > +		if (journal && !(journal->j_flags & JFS_BARRIER))
> > > > > > +			goto no_journal_barrier;
> > > > >   I cannot imagine "journal" will be NULL here.
> > > > 
> > > > I'll try to check whether that is always so just in case.
> > > > 
> > > > >   And we can also optimize here a bit and do "goto out" because here
> > > > > we know the barrier has been issued.
> > > > 
> > > > Yep, I was considering the same optimization. By the way, I was
> > > > wondering if we should honor ext3 and ext4's "barrier" mount option for
> > > > sys_fsync()/sys_fdatasync() and do not force a flush when "barrier=1".
> > > 
> > > The last phrase should read " do not force a flush when "barrier=0" ".
> >   I was hesitating about this a bit. But I don't think so. The reason is
> > that POSIX (or any other reasonable specification) mandates that fsync()
> > should return only after the data is safely on storage. So if we don't
> > flush blockdevice caches, we effectively violate POSIX and we should never
> > do that. With barriers the matter is a bit different - that is just a
> > filesystem specific thing, no standard guarantees anything.
> 
> Hi Jan,
> 
> Sorry it's taken me so long to get back to you.
> 
> Thinking a lit bit more about this issue, it occurred to me that adding
> a new mount option à la existing "barrier" is likely to be preferable.
> As an example where such an option could make sense, let's consider a
> system with battery-backup cache devices. Since the battery-backup
> guarantees the data still not committed to the platter will not vanish
> in the event of a power down, it should be possible to obtain a
> performance gain by optimizing out the device flush on every
> fsync()/fdatasync() call.
  Yes, I came to this conclusion as well when we were discussing how to fix
other filesystems which fail to issue flush in fsync
(https://kerneltrap.org/mailarchive/linux-ext4/2009/1/22/4788004/thread).
I think it makes sence there's a general mount option recognized at VFS
level which would set superblock flag whether flush on fsync is needed or
not. Then you could use this flag in your patch. 
  So if you have time, please write a (separate) patch introducing this
generic option. Thanks.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: ext3: call blkdev_issue_flush on fsync
  2009-01-28  9:55                                       ` Jan Kara
@ 2009-02-12 10:33                                         ` Fernando Luis Vázquez Cao
  2009-02-12 10:35                                           ` vfs: Improve readability off mount flag definitins by using offsets Fernando Luis Vázquez Cao
                                                             ` (6 more replies)
  0 siblings, 7 replies; 74+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-02-12 10:33 UTC (permalink / raw)
  To: Jan Kara
  Cc: Theodore Tso, Alan Cox, Pavel Machek, kernel list, Jens Axboe,
	sandeen, fernando

On Wed, 2009-01-28 at 10:55 +0100, Jan Kara wrote: 
> On Wed 28-01-09 18:45:13, Fernando Luis Vázquez Cao wrote:
> > On Mon, 2009-01-19 at 13:03 +0100, Jan Kara wrote:
> > > On Sat 17-01-09 19:00:49, Fernando Luis Vázquez Cao wrote:
> > > > On Sat, 2009-01-17 at 18:47 +0900, Fernando Luis Vázquez Cao wrote:
> > > > > On Fri, 2009-01-16 at 17:30 +0100, Jan Kara wrote:
> > > > > > On Fri 16-01-09 22:55:01, Fernando Luis Vázquez Cao wrote:
> > > > > > > To ensure that bits are truly on-disk after an fsync or fdatasync, we
> > > > > > > should force a disk flush explicitly when there is dirty data/metadata
> > > > > > > and the journal didn't emit a write barrier (either because metadata is
> > > > > > > not being synched or barriers are disabled).
> > > > > > > 
> > > > > > > Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
> > > > > > > ---
> > > > > >   Only two minor nits:
> > > > > > 
> > > > > > > --- linux-2.6.29-rc1-orig/fs/ext3/fsync.c	2008-12-25 08:26:37.000000000 +0900
> > > > > > > +++ linux-2.6.29-rc1/fs/ext3/fsync.c	2009-01-16 22:18:53.000000000 +0900
> > > > > > > @@ -27,6 +27,7 @@
> > > > > > >  #include <linux/sched.h>
> > > > > > >  #include <linux/writeback.h>
> > > > > > >  #include <linux/jbd.h>
> > > > > > > +#include <linux/blkdev.h>
> > > > > > >  #include <linux/ext3_fs.h>
> > > > > > >  #include <linux/ext3_jbd.h>
> > > > > > >  
> > > > > > > @@ -45,6 +46,8 @@
> > > > > > >  int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
> > > > > > >  {
> > > > > > >  	struct inode *inode = dentry->d_inode;
> > > > > > > +	journal_t *journal = EXT3_SB(inode->i_sb)->s_journal;
> > > > > > > +	unsigned long i_state = inode->i_state;
> > > > > > >  	int ret = 0;
> > > > > > >  
> > > > > > >  	J_ASSERT(ext3_journal_current_handle() == NULL);
> > > > > > > @@ -69,23 +72,33 @@ int ext3_sync_file(struct file * file, s
> > > > > > >  	 */
> > > > > > >  	if (ext3_should_journal_data(inode)) {
> > > > > > >  		ret = ext3_force_commit(inode->i_sb);
> > > > > > > +		if (!(journal->j_flags & JFS_BARRIER))
> > > > > > > +			goto no_journal_barrier;
> > > > > > >  		goto out;
> > > > > > >  	}
> > > > > > >  
> > > > > > > -	if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
> > > > > > > -		goto out;
> > > > > > > +	if (datasync && !(i_state & I_DIRTY_DATASYNC))
> > > > > > > +		goto flush_blkdev;
> > > > > > >  
> > > > > > >  	/*
> > > > > > >  	 * The VFS has written the file data.  If the inode is unaltered
> > > > > > >  	 * then we need not start a commit.
> > > > > > >  	 */
> > > > > > > -	if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
> > > > > > > +	if (i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
> > > > > > >  		struct writeback_control wbc = {
> > > > > > >  			.sync_mode = WB_SYNC_ALL,
> > > > > > >  			.nr_to_write = 0, /* sys_fsync did this */
> > > > > > >  		};
> > > > > > >  		ret = sync_inode(inode, &wbc);
> > > > > > > +		if (journal && !(journal->j_flags & JFS_BARRIER))
> > > > > > > +			goto no_journal_barrier;
> > > > > >   I cannot imagine "journal" will be NULL here.
> > > > > 
> > > > > I'll try to check whether that is always so just in case.
> > > > > 
> > > > > >   And we can also optimize here a bit and do "goto out" because here
> > > > > > we know the barrier has been issued.
> > > > > 
> > > > > Yep, I was considering the same optimization. By the way, I was
> > > > > wondering if we should honor ext3 and ext4's "barrier" mount option for
> > > > > sys_fsync()/sys_fdatasync() and do not force a flush when "barrier=1".
> > > > 
> > > > The last phrase should read " do not force a flush when "barrier=0" ".
> > >   I was hesitating about this a bit. But I don't think so. The reason is
> > > that POSIX (or any other reasonable specification) mandates that fsync()
> > > should return only after the data is safely on storage. So if we don't
> > > flush blockdevice caches, we effectively violate POSIX and we should never
> > > do that. With barriers the matter is a bit different - that is just a
> > > filesystem specific thing, no standard guarantees anything.
> > 
> > Hi Jan,
> > 
> > Sorry it's taken me so long to get back to you.
> > 
> > Thinking a lit bit more about this issue, it occurred to me that adding
> > a new mount option à la existing "barrier" is likely to be preferable.
> > As an example where such an option could make sense, let's consider a
> > system with battery-backup cache devices. Since the battery-backup
> > guarantees the data still not committed to the platter will not vanish
> > in the event of a power down, it should be possible to obtain a
> > performance gain by optimizing out the device flush on every
> > fsync()/fdatasync() call.
>   Yes, I came to this conclusion as well when we were discussing how to fix
> other filesystems which fail to issue flush in fsync
> (https://kerneltrap.org/mailarchive/linux-ext4/2009/1/22/4788004/thread).
> I think it makes sence there's a general mount option recognized at VFS
> level which would set superblock flag whether flush on fsync is needed or
> not. Then you could use this flag in your patch. 
>   So if you have time, please write a (separate) patch introducing this
> generic option. Thanks.

Sorry it's taken me so long to get back at you, but I got distracted
with some high priority stuff. I finally got around writing the patches
for both the kernel and util-linux which I am sending as a reply to this
email. Can you tell me if this is what you had in mind?

Regards,

Fernando


^ permalink raw reply	[flat|nested] 74+ messages in thread

* vfs: Improve readability off mount flag definitins by using offsets
  2009-02-12 10:33                                         ` Fernando Luis Vázquez Cao
@ 2009-02-12 10:35                                           ` Fernando Luis Vázquez Cao
  2009-02-12 10:36                                           ` vfs: Add MS_FLUSHONFSYNC mount flag Fernando Luis Vázquez Cao
                                                             ` (5 subsequent siblings)
  6 siblings, 0 replies; 74+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-02-12 10:35 UTC (permalink / raw)
  To: Jan Kara
  Cc: Theodore Tso, Alan Cox, Pavel Machek, kernel list, Jens Axboe,
	sandeen, fernando

Currently mount flags are defined using both decimal values and offsets which
is kind of confusing. Define all flags using offsets for the sake of
readability.

Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
---

diff -urNp linux-2.6.29-rc3-orig/include/linux/fs.h linux-2.6.29-rc3/include/linux/fs.h
--- linux-2.6.29-rc3-orig/include/linux/fs.h	2009-01-29 09:47:51.000000000 +0900
+++ linux-2.6.29-rc3/include/linux/fs.h	2009-01-29 10:01:58.000000000 +0900
@@ -111,22 +111,22 @@ struct inodes_stat_t {
 /*
  * These are the fs-independent mount-flags: up to 32 flags are supported
  */
-#define MS_RDONLY	 1	/* Mount read-only */
-#define MS_NOSUID	 2	/* Ignore suid and sgid bits */
-#define MS_NODEV	 4	/* Disallow access to device special files */
-#define MS_NOEXEC	 8	/* Disallow program execution */
-#define MS_SYNCHRONOUS	16	/* Writes are synced at once */
-#define MS_REMOUNT	32	/* Alter flags of a mounted FS */
-#define MS_MANDLOCK	64	/* Allow mandatory locks on an FS */
-#define MS_DIRSYNC	128	/* Directory modifications are synchronous */
-#define MS_NOATIME	1024	/* Do not update access times. */
-#define MS_NODIRATIME	2048	/* Do not update directory access times */
-#define MS_BIND		4096
-#define MS_MOVE		8192
-#define MS_REC		16384
-#define MS_VERBOSE	32768	/* War is peace. Verbosity is silence.
+#define MS_RDONLY	(1<<0)	/* Mount read-only */
+#define MS_NOSUID	(1<<1)	/* Ignore suid and sgid bits */
+#define MS_NODEV	(1<<2)	/* Disallow access to device special files */
+#define MS_NOEXEC	(1<<3)	/* Disallow program execution */
+#define MS_SYNCHRONOUS	(1<<4)	/* Writes are synced at once */
+#define MS_REMOUNT	(1<<5)	/* Alter flags of a mounted FS */
+#define MS_MANDLOCK	(1<<6)	/* Allow mandatory locks on an FS */
+#define MS_DIRSYNC	(1<<7)	/* Directory modifications are synchronous */
+#define MS_NOATIME	(1<<10)	/* Do not update access times. */
+#define MS_NODIRATIME	(1<<11)	/* Do not update directory access times */
+#define MS_BIND		(1<<12)
+#define MS_MOVE		(1<<13)
+#define MS_REC		(1<<14)
+#define MS_VERBOSE	(1<<15)	/* War is peace. Verbosity is silence.
 				   MS_VERBOSE is deprecated. */
-#define MS_SILENT	32768
+#define MS_SILENT	(1<<15)
 #define MS_POSIXACL	(1<<16)	/* VFS does not apply the umask */
 #define MS_UNBINDABLE	(1<<17)	/* change to unbindable */
 #define MS_PRIVATE	(1<<18)	/* change to private */



^ permalink raw reply	[flat|nested] 74+ messages in thread

* vfs: Add MS_FLUSHONFSYNC mount flag
  2009-02-12 10:33                                         ` Fernando Luis Vázquez Cao
  2009-02-12 10:35                                           ` vfs: Improve readability off mount flag definitins by using offsets Fernando Luis Vázquez Cao
@ 2009-02-12 10:36                                           ` Fernando Luis Vázquez Cao
  2009-02-12 17:13                                             ` Eric Sandeen
  2009-02-12 10:37                                           ` util-linux: Add new mount options flushonfsync and noflushonfsync to mount Fernando Luis Vázquez Cao
                                                             ` (4 subsequent siblings)
  6 siblings, 1 reply; 74+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-02-12 10:36 UTC (permalink / raw)
  To: Jan Kara
  Cc: Theodore Tso, Alan Cox, Pavel Machek, kernel list, Jens Axboe,
	sandeen, fernando

This mount flag will be used to determine whether the block device's write
cache should be flush or not on fsync()/fdatasync().

Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
---

diff -urNp linux-2.6.29-rc3-orig/fs/namespace.c linux-2.6.29-rc3/fs/namespace.c
--- linux-2.6.29-rc3-orig/fs/namespace.c	2009-01-29 09:47:51.000000000 +0900
+++ linux-2.6.29-rc3/fs/namespace.c	2009-01-29 18:40:04.000000000 +0900
@@ -1933,8 +1933,8 @@ long do_mount(char *dev_name, char *dir_
 	if (flags & MS_RDONLY)
 		mnt_flags |= MNT_READONLY;
 
-	flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE |
-		   MS_NOATIME | MS_NODIRATIME | MS_RELATIME| MS_KERNMOUNT);
+	flags &= ~(MS_NOSUID | MS_NODEV | MS_NOEXEC | MS_NOATIME |
+		   MS_NODIRATIME | MS_RELATIME | MS_ACTIVE | MS_KERNMOUNT);
 
 	/* ... and get the mountpoint */
 	retval = kern_path(dir_name, LOOKUP_FOLLOW, &path);
diff -urNp linux-2.6.29-rc3-orig/include/linux/fs.h linux-2.6.29-rc3/include/linux/fs.h
--- linux-2.6.29-rc3-orig/include/linux/fs.h	2009-01-29 10:04:18.000000000 +0900
+++ linux-2.6.29-rc3/include/linux/fs.h	2009-01-29 10:24:14.000000000 +0900
@@ -135,6 +135,8 @@ struct inodes_stat_t {
 #define MS_RELATIME	(1<<21)	/* Update atime relative to mtime/ctime. */
 #define MS_KERNMOUNT	(1<<22) /* this is a kern_mount call */
 #define MS_I_VERSION	(1<<23) /* Update inode I_version field */
+#define MS_FLUSHONFSYNC (1<<24)	/* Force block device flush on
+				   fsync()/fdatasync() */
 #define MS_ACTIVE	(1<<30)
 #define MS_NOUSER	(1<<31)
 



^ permalink raw reply	[flat|nested] 74+ messages in thread

* util-linux: Add new mount options flushonfsync and noflushonfsync to mount
  2009-02-12 10:33                                         ` Fernando Luis Vázquez Cao
  2009-02-12 10:35                                           ` vfs: Improve readability off mount flag definitins by using offsets Fernando Luis Vázquez Cao
  2009-02-12 10:36                                           ` vfs: Add MS_FLUSHONFSYNC mount flag Fernando Luis Vázquez Cao
@ 2009-02-12 10:37                                           ` Fernando Luis Vázquez Cao
  2009-02-12 10:38                                           ` util-linux: Add explanation for new mount options flushonfsync and noflushonfsync to mount(8) man page Fernando Luis Vázquez Cao
                                                             ` (3 subsequent siblings)
  6 siblings, 0 replies; 74+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-02-12 10:37 UTC (permalink / raw)
  To: Jan Kara
  Cc: Theodore Tso, Alan Cox, Pavel Machek, kernel list, Jens Axboe,
	sandeen, fernando

This mount flag will be used to determine whether the block device's write
cache should be flush or not on fsync()/fdatasync().

Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
---

diff -urp mount-orig/mount.c mount/mount.c
--- mount-orig/mount.c	2009-01-29 15:50:50.000000000 +0900
+++ mount/mount.c	2009-01-29 14:56:19.000000000 +0900
@@ -185,6 +185,13 @@ static const struct opt_map opt_map[] = 
   { "norelatime", 0, 1, MS_RELATIME }, /* Update access time without regard
 					  to mtime/ctime */
 #endif
+#ifdef MS_FLUSHONFSYNC
+  { "flushonfsync", 0, 0, MS_FLUSHONFSYNC },	/* Force block device flush on
+						   fsync()/fdatasync() */
+  { "noflushonfsync", 0, 1, MS_FLUSHONFSYNC },	/* Do not force block device
+						   flush on
+						   fsync()/fdatasync() */
+#endif
   { "nofail",	0, 0, MS_COMMENT},	/* Do not fail if ENOENT on dev */
   { NULL,	0, 0, 0		}
 };
diff -urp mount-orig/mount_constants.h mount/mount_constants.h
--- mount-orig/mount_constants.h	2008-04-22 17:59:53.000000000 +0900
+++ mount/mount_constants.h	2009-01-29 13:40:42.000000000 +0900
@@ -56,6 +56,10 @@
 #ifndef MS_SHARED
 #define MS_SHARED	(1<<20)	/* 1048576 Shared*/
 #endif
+#ifndef MS_FLUSHONFSYNC
+#define MS_FLUSHONFSYNC	(1<<24) /* Force block device flush on
+				   fsync()/fdatasync() */
+#endif
 /*
  * Magic mount flag number. Had to be or-ed to the flag values.
  */



^ permalink raw reply	[flat|nested] 74+ messages in thread

* util-linux: Add explanation for new mount options flushonfsync and noflushonfsync to mount(8) man page
  2009-02-12 10:33                                         ` Fernando Luis Vázquez Cao
                                                             ` (2 preceding siblings ...)
  2009-02-12 10:37                                           ` util-linux: Add new mount options flushonfsync and noflushonfsync to mount Fernando Luis Vázquez Cao
@ 2009-02-12 10:38                                           ` Fernando Luis Vázquez Cao
  2009-02-12 10:38                                           ` block: Add block_flush_device() Fernando Luis Vázquez Cao
                                                             ` (2 subsequent siblings)
  6 siblings, 0 replies; 74+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-02-12 10:38 UTC (permalink / raw)
  To: Jan Kara
  Cc: Theodore Tso, Alan Cox, Pavel Machek, kernel list, Jens Axboe,
	sandeen, fernando

Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
---

diff -urp mount-orig/mount.8 mount/mount.8
--- mount-orig/mount.8	2009-01-29 15:50:50.000000000 +0900
+++ mount/mount.8	2009-01-29 15:40:39.000000000 +0900
@@ -649,6 +649,18 @@ current modify or change time. (Similar 
 mutt or other applications that need to know if a file has been read
 since the last time it was modified.)
 .TP
+.B flushonfsync
+Force block device flush on fsync()/fdatasync(). When using storage equipped
+with write caches that can be explicitly flushed by the kernel, this guarantees
+that the data/metadata actually made it to the physical media after invoking
+the respective file data/metadata synchronization system call.
+.TP
+.B noflushonfsync
+Do not force a flush of the backing device's write cache when a file
+data/metadata synchronization system call is executed. Unless the device is
+equipped with a battery-backup write cache, this behavior can cause data loss
+and/or file system corruption in the event of a power failure.
+.TP
 .B noauto
 Can only be mounted explicitly (i.e., the
 .B \-a



^ permalink raw reply	[flat|nested] 74+ messages in thread

* block: Add block_flush_device()
  2009-02-12 10:33                                         ` Fernando Luis Vázquez Cao
                                                             ` (3 preceding siblings ...)
  2009-02-12 10:38                                           ` util-linux: Add explanation for new mount options flushonfsync and noflushonfsync to mount(8) man page Fernando Luis Vázquez Cao
@ 2009-02-12 10:38                                           ` Fernando Luis Vázquez Cao
  2009-02-12 10:39                                           ` ext3: call blkdev_issue_flush on fsync Fernando Luis Vázquez Cao
  2009-02-12 10:40                                           ` ext4: " Fernando Luis Vázquez Cao
  6 siblings, 0 replies; 74+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-02-12 10:38 UTC (permalink / raw)
  To: Jan Kara
  Cc: Theodore Tso, Alan Cox, Pavel Machek, kernel list, Jens Axboe,
	sandeen, fernando

This patch adds a helper function that should be used by filesystems that need
to flush the underlying block device on fsync()/fdatasync().

Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
---

diff -urNp linux-2.6.29-rc3-orig/fs/buffer.c linux-2.6.29-rc3/fs/buffer.c
--- linux-2.6.29-rc3-orig/fs/buffer.c	2009-01-29 09:47:51.000000000 +0900
+++ linux-2.6.29-rc3/fs/buffer.c	2009-01-29 20:11:29.000000000 +0900
@@ -165,6 +165,24 @@ void end_buffer_write_sync(struct buffer
 	put_bh(bh);
 }
 
+/* Issue flush of write caches on the block device */
+int block_flush_device(struct super_block *sb)
+{
+	int ret = 0;
+
+	if (!(sb->s_flags & MS_FLUSHONFSYNC))
+		return ret;
+
+	ret = blkdev_issue_flush(sb->s_bdev, NULL);
+
+	if (ret == -EOPNOTSUPP)
+		return 0;
+
+	return ret;
+}
+EXPORT_SYMBOL(block_flush_device);
+
+
 /*
  * Write out and wait upon all the dirty data associated with a block
  * device via its mapping.  Does not take the superblock lock.
diff -urNp linux-2.6.29-rc3-orig/include/linux/buffer_head.h linux-2.6.29-rc3/include/linux/buffer_head.h
--- linux-2.6.29-rc3-orig/include/linux/buffer_head.h	2009-01-29 09:47:51.000000000 +0900
+++ linux-2.6.29-rc3/include/linux/buffer_head.h	2009-01-29 19:26:33.000000000 +0900
@@ -238,6 +238,7 @@ int nobh_write_end(struct file *, struct
 int nobh_truncate_page(struct address_space *, loff_t, get_block_t *);
 int nobh_writepage(struct page *page, get_block_t *get_block,
                         struct writeback_control *wbc);
+int block_flush_device(struct super_block *sb);
 
 void buffer_init(void);
 



^ permalink raw reply	[flat|nested] 74+ messages in thread

* ext3: call blkdev_issue_flush on fsync
  2009-02-12 10:33                                         ` Fernando Luis Vázquez Cao
                                                             ` (4 preceding siblings ...)
  2009-02-12 10:38                                           ` block: Add block_flush_device() Fernando Luis Vázquez Cao
@ 2009-02-12 10:39                                           ` Fernando Luis Vázquez Cao
  2009-02-12 10:40                                           ` ext4: " Fernando Luis Vázquez Cao
  6 siblings, 0 replies; 74+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-02-12 10:39 UTC (permalink / raw)
  To: Jan Kara
  Cc: Theodore Tso, Alan Cox, Pavel Machek, kernel list, Jens Axboe,
	sandeen, fernando

To ensure that bits are truly on-disk after an fsync or fdatasync, we
should force a disk flush explicitly when there is dirty data/metadata
and the journal didn't emit a write barrier (either because metadata is
not being synched or barriers are disabled).

Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
---

diff -urNp linux-2.6.29-rc4-orig/fs/ext3/fsync.c linux-2.6.29-rc4/fs/ext3/fsync.c
--- linux-2.6.29-rc4-orig/fs/ext3/fsync.c	2008-12-25 08:26:37.000000000 +0900
+++ linux-2.6.29-rc4/fs/ext3/fsync.c	2009-02-12 19:20:23.000000000 +0900
@@ -45,6 +45,8 @@
 int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
 {
 	struct inode *inode = dentry->d_inode;
+	journal_t *journal = EXT3_SB(inode->i_sb)->s_journal;
+	unsigned long i_state = inode->i_state;
 	int ret = 0;
 
 	J_ASSERT(ext3_journal_current_handle() == NULL);
@@ -69,23 +71,34 @@ int ext3_sync_file(struct file * file, s
 	 */
 	if (ext3_should_journal_data(inode)) {
 		ret = ext3_force_commit(inode->i_sb);
+		if (!(journal->j_flags & JFS_BARRIER))
+			goto no_journal_barrier;
 		goto out;
 	}
 
-	if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
-		goto out;
+	if (datasync && !(i_state & I_DIRTY_DATASYNC))
+		goto flush_blkdev;
 
 	/*
 	 * The VFS has written the file data.  If the inode is unaltered
 	 * then we need not start a commit.
 	 */
-	if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
+	if (i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
 		struct writeback_control wbc = {
 			.sync_mode = WB_SYNC_ALL,
 			.nr_to_write = 0, /* sys_fsync did this */
 		};
 		ret = sync_inode(inode, &wbc);
+		if (journal && !(journal->j_flags & JFS_BARRIER))
+			goto no_journal_barrier;
+		goto out;
 	}
+
+flush_blkdev:
+	if (!(i_state & I_DIRTY_PAGES))
+		goto out;
+no_journal_barrier:
+	block_flush_device(inode->i_sb);
 out:
 	return ret;
 }



^ permalink raw reply	[flat|nested] 74+ messages in thread

* ext4: call blkdev_issue_flush on fsync
  2009-02-12 10:33                                         ` Fernando Luis Vázquez Cao
                                                             ` (5 preceding siblings ...)
  2009-02-12 10:39                                           ` ext3: call blkdev_issue_flush on fsync Fernando Luis Vázquez Cao
@ 2009-02-12 10:40                                           ` Fernando Luis Vázquez Cao
  2009-02-15 22:46                                             ` Theodore Tso
  6 siblings, 1 reply; 74+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-02-12 10:40 UTC (permalink / raw)
  To: Jan Kara
  Cc: Theodore Tso, Alan Cox, Pavel Machek, kernel list, Jens Axboe,
	sandeen, fernando

To ensure that bits are truly on-disk after an fsync or fdatasync, we
should force a disk flush explicitly when there is dirty data/metadata
and the journal didn't emit a write barrier (either because metadata is
not being synched or barriers are disabled).

Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
---

diff -urNp linux-2.6.29-rc4-orig/fs/ext4/fsync.c linux-2.6.29-rc4/fs/ext4/fsync.c
--- linux-2.6.29-rc4-orig/fs/ext4/fsync.c	2008-12-25 08:26:37.000000000 +0900
+++ linux-2.6.29-rc4/fs/ext4/fsync.c	2009-02-12 19:22:28.000000000 +0900
@@ -48,6 +48,7 @@ int ext4_sync_file(struct file *file, st
 {
 	struct inode *inode = dentry->d_inode;
 	journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
+	unsigned long i_state = inode->i_state;
 	int ret = 0;
 
 	J_ASSERT(ext4_journal_current_handle() == NULL);
@@ -76,25 +77,34 @@ int ext4_sync_file(struct file *file, st
 	 */
 	if (ext4_should_journal_data(inode)) {
 		ret = ext4_force_commit(inode->i_sb);
+		if (!(journal->j_flags & JBD2_BARRIER))
+			goto no_journal_barrier;
 		goto out;
 	}
 
-	if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
-		goto out;
+	if (datasync && !(i_state & I_DIRTY_DATASYNC))
+		goto flush_blkdev;
 
 	/*
 	 * The VFS has written the file data.  If the inode is unaltered
 	 * then we need not start a commit.
 	 */
-	if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
+	if (i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
 		struct writeback_control wbc = {
 			.sync_mode = WB_SYNC_ALL,
 			.nr_to_write = 0, /* sys_fsync did this */
 		};
 		ret = sync_inode(inode, &wbc);
-		if (journal && (journal->j_flags & JBD2_BARRIER))
-			blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
+		if (journal && !(journal->j_flags & JBD2_BARRIER))
+			goto no_journal_barrier;
+		goto out;
 	}
+
+flush_blkdev:
+	if (!(i_state & I_DIRTY_PAGES))
+		goto out;
+no_journal_barrier:
+	block_flush_device(inode->i_sb);
 out:
 	return ret;
 }



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: ext2 + -osync: not as easy as it seems
  2009-01-14 14:08               ` Jens Axboe
  2009-01-14 14:34                 ` Theodore Tso
@ 2009-02-12 16:43                 ` Eric Sandeen
  2009-02-16 12:09                   ` Jens Axboe
  1 sibling, 1 reply; 74+ messages in thread
From: Eric Sandeen @ 2009-02-12 16:43 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jan Kara, Theodore Tso, Fernando Luis Vázquez Cao, Alan Cox,
	Pavel Machek, kernel list

Jens Axboe wrote:
> On Wed, Jan 14 2009, Jan Kara wrote:
>>> I'm not sure what you mean; if the barrier operation isn't flushing
>>> all of the caches all the way out to the iron oxide, it's not going to
>>> be working properly no matter where it is being called, whether it's
>>> in ext4_sync_file() or in jbd2's journal_submit_commit_record().
>>   Well, I thought that a barrier, as an abstraction, only guarantees that
>> any IO which happened before the barrier hits the iron before any IO which
>> has been submitted after a barrier. This is actually enough for a
>> journalling to work correctly but it's not enough for fsync() guarantees.
>> But I might be wrong...
> 
> It also guarentees that when you get a completion for that barrier
> write, it's on safe storage. Think of it as a flush-write-flush
> operation, in the presence of write back caching.

(sorry for chiming in so late)

Jens, isn't this just the way it's implemented today?  At some point
couldn't a barrier bio simply be a reordering barrier that the storage
can use when destaging the write cache, rather than the heavy-handed
flush-write-flush we have today?

I guess it's a question of the intended semantics of a barrier bio, vs.
today's implementation based on current hardware functionality...

-Eric

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: vfs: Add MS_FLUSHONFSYNC mount flag
  2009-02-12 10:36                                           ` vfs: Add MS_FLUSHONFSYNC mount flag Fernando Luis Vázquez Cao
@ 2009-02-12 17:13                                             ` Eric Sandeen
  2009-02-12 17:29                                               ` Jeff Garzik
                                                                 ` (2 more replies)
  0 siblings, 3 replies; 74+ messages in thread
From: Eric Sandeen @ 2009-02-12 17:13 UTC (permalink / raw)
  To: Fernando Luis Vázquez Cao
  Cc: Jan Kara, Theodore Tso, Alan Cox, Pavel Machek, kernel list,
	Jens Axboe, fernando, Ric Wheeler

Fernando Luis Vázquez Cao wrote:
> This mount flag will be used to determine whether the block device's write
> cache should be flush or not on fsync()/fdatasync().
> 
> Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
> ---

Again, apologies for chiming in late.

But wouldn't it be better to make this a block device property rather
than a new filesystem mount option?

That way the filesystem can always do "the right thing" and call the
blkdev flush on fsync.

The block device *could* choose to ignore this in hardware if it knows
it's built with a nonvolatile write cache or if it has no write cache.

Somewhere in the middle, if an administrator knows they have a UPS they
trust and hardware that stays connected to it, they could tune the bdev
to ignore these flush requests.

Also that way if you have 8 partitions on a battery-backed blockdev, you
can tune it once, instead of needing to mount all 8 filesystems with the
new option.

Thoughts?

Thanks,
-Eric

> diff -urNp linux-2.6.29-rc3-orig/fs/namespace.c linux-2.6.29-rc3/fs/namespace.c
> --- linux-2.6.29-rc3-orig/fs/namespace.c	2009-01-29 09:47:51.000000000 +0900
> +++ linux-2.6.29-rc3/fs/namespace.c	2009-01-29 18:40:04.000000000 +0900
> @@ -1933,8 +1933,8 @@ long do_mount(char *dev_name, char *dir_
>  	if (flags & MS_RDONLY)
>  		mnt_flags |= MNT_READONLY;
>  
> -	flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE |
> -		   MS_NOATIME | MS_NODIRATIME | MS_RELATIME| MS_KERNMOUNT);
> +	flags &= ~(MS_NOSUID | MS_NODEV | MS_NOEXEC | MS_NOATIME |
> +		   MS_NODIRATIME | MS_RELATIME | MS_ACTIVE | MS_KERNMOUNT);
>  
>  	/* ... and get the mountpoint */
>  	retval = kern_path(dir_name, LOOKUP_FOLLOW, &path);
> diff -urNp linux-2.6.29-rc3-orig/include/linux/fs.h linux-2.6.29-rc3/include/linux/fs.h
> --- linux-2.6.29-rc3-orig/include/linux/fs.h	2009-01-29 10:04:18.000000000 +0900
> +++ linux-2.6.29-rc3/include/linux/fs.h	2009-01-29 10:24:14.000000000 +0900
> @@ -135,6 +135,8 @@ struct inodes_stat_t {
>  #define MS_RELATIME	(1<<21)	/* Update atime relative to mtime/ctime. */
>  #define MS_KERNMOUNT	(1<<22) /* this is a kern_mount call */
>  #define MS_I_VERSION	(1<<23) /* Update inode I_version field */
> +#define MS_FLUSHONFSYNC (1<<24)	/* Force block device flush on
> +				   fsync()/fdatasync() */
>  #define MS_ACTIVE	(1<<30)
>  #define MS_NOUSER	(1<<31)
>  
> 
> 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: vfs: Add MS_FLUSHONFSYNC mount flag
  2009-02-12 17:13                                             ` Eric Sandeen
@ 2009-02-12 17:29                                               ` Jeff Garzik
  2009-02-14 15:36                                                 ` Christoph Hellwig
  2009-02-12 21:23                                               ` Jan Kara
  2009-02-13  1:14                                               ` Fernando Luis Vázquez Cao
  2 siblings, 1 reply; 74+ messages in thread
From: Jeff Garzik @ 2009-02-12 17:29 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Fernando Luis Vázquez Cao, Jan Kara, Theodore Tso, Alan Cox,
	Pavel Machek, kernel list, Jens Axboe, fernando, Ric Wheeler

Eric Sandeen wrote:
> Fernando Luis Vázquez Cao wrote:
>> This mount flag will be used to determine whether the block device's write
>> cache should be flush or not on fsync()/fdatasync().
>>
>> Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
>> ---
> 
> Again, apologies for chiming in late.
> 
> But wouldn't it be better to make this a block device property rather
> than a new filesystem mount option?
> 
> That way the filesystem can always do "the right thing" and call the
> blkdev flush on fsync.
> 
> The block device *could* choose to ignore this in hardware if it knows
> it's built with a nonvolatile write cache or if it has no write cache.

That would certainly be my preference -- turn this ON by default, and 
them if a layer NEEDS to ignore it, it can.

	Jeff





^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: vfs: Add MS_FLUSHONFSYNC mount flag
  2009-02-12 17:13                                             ` Eric Sandeen
  2009-02-12 17:29                                               ` Jeff Garzik
@ 2009-02-12 21:23                                               ` Jan Kara
  2009-02-12 21:30                                                 ` Eric Sandeen
  2009-02-13  1:14                                               ` Fernando Luis Vázquez Cao
  2 siblings, 1 reply; 74+ messages in thread
From: Jan Kara @ 2009-02-12 21:23 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Fernando Luis Vázquez Cao, Theodore Tso, Alan Cox,
	Pavel Machek, kernel list, Jens Axboe, fernando, Ric Wheeler

On Thu 12-02-09 11:13:37, Eric Sandeen wrote:
> Fernando Luis Vázquez Cao wrote:
> > This mount flag will be used to determine whether the block device's write
> > cache should be flush or not on fsync()/fdatasync().
> > 
> > Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
> > ---
> 
> Again, apologies for chiming in late.
> 
> But wouldn't it be better to make this a block device property rather
> than a new filesystem mount option?
  Hum, that's an interesting idea. Yes, probably it makes even more sence
than a global mount option.

> That way the filesystem can always do "the right thing" and call the
> blkdev flush on fsync.
> 
> The block device *could* choose to ignore this in hardware if it knows
> it's built with a nonvolatile write cache or if it has no write cache.
> 
> Somewhere in the middle, if an administrator knows they have a UPS they
> trust and hardware that stays connected to it, they could tune the bdev
> to ignore these flush requests.
> 
> Also that way if you have 8 partitions on a battery-backed blockdev, you
> can tune it once, instead of needing to mount all 8 filesystems with the
> new option.
  Yes, but OTOH we should give sysadmin a possibility to enable / disable
it on just some partitions. I don't see a reasonable use for that but people
tend to do strange things ;) and here isn't probably a strong reason to not
allow them.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: vfs: Add MS_FLUSHONFSYNC mount flag
  2009-02-12 21:23                                               ` Jan Kara
@ 2009-02-12 21:30                                                 ` Eric Sandeen
  2009-02-13  1:47                                                   ` Fernando Luis Vázquez Cao
  2009-02-13  2:23                                                   ` Theodore Tso
  0 siblings, 2 replies; 74+ messages in thread
From: Eric Sandeen @ 2009-02-12 21:30 UTC (permalink / raw)
  To: Jan Kara
  Cc: Fernando Luis Vázquez Cao, Theodore Tso, Alan Cox,
	Pavel Machek, kernel list, Jens Axboe, fernando, Ric Wheeler

Jan Kara wrote:
> On Thu 12-02-09 11:13:37, Eric Sandeen wrote:

...

>> Also that way if you have 8 partitions on a battery-backed blockdev, you
>> can tune it once, instead of needing to mount all 8 filesystems with the
>> new option.
>   Yes, but OTOH we should give sysadmin a possibility to enable / disable
> it on just some partitions. I don't see a reasonable use for that but people
> tend to do strange things ;) and here isn't probably a strong reason to not
> allow them.
> 
> 								Honza

But nobody has asked for that, have they?  So why offer it up a this point?

They could use LD_PRELOAD to make fsync a no-op if they really don't
care for it, I guess... though that's not easily per-fs either.

But do we really want to go out of our way to enable people to
short-circuit data integrity paths and then file bugs when their files
go missing? :)

(I guess the blockdev tunable is similarly dangerous, but it more
clearly meets the explicit need (writecache-safe devices))

-Eric

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: vfs: Add MS_FLUSHONFSYNC mount flag
  2009-02-12 17:13                                             ` Eric Sandeen
  2009-02-12 17:29                                               ` Jeff Garzik
  2009-02-12 21:23                                               ` Jan Kara
@ 2009-02-13  1:14                                               ` Fernando Luis Vázquez Cao
  2009-02-13  6:20                                                 ` Eric Sandeen
  2 siblings, 1 reply; 74+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-02-13  1:14 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Jan Kara, Theodore Tso, Alan Cox, Pavel Machek, kernel list,
	Jens Axboe, fernando, Ric Wheeler

On Thu, 2009-02-12 at 11:13 -0600, Eric Sandeen wrote:
> Fernando Luis Vázquez Cao wrote:
> > This mount flag will be used to determine whether the block device's write
> > cache should be flush or not on fsync()/fdatasync().
> > 
> > Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
> > ---
> 
> Again, apologies for chiming in late.
> 
> But wouldn't it be better to make this a block device property rather
> than a new filesystem mount option?
> 
> That way the filesystem can always do "the right thing" and call the
> blkdev flush on fsync.
> 
> The block device *could* choose to ignore this in hardware if it knows
> it's built with a nonvolatile write cache or if it has no write cache.
> 
> Somewhere in the middle, if an administrator knows they have a UPS they
> trust and hardware that stays connected to it, they could tune the bdev
> to ignore these flush requests.
> 
> Also that way if you have 8 partitions on a battery-backed blockdev, you
> can tune it once, instead of needing to mount all 8 filesystems with the
> new option.

The main reason I decided to go for the mount option approach is to be
consistent with what we do when it comes to write barriers. Treating one
as a mount option and the other as a (possibly) sysfs tunable property
seems a bit confusing to me.

Do you suggest using sysfs tunables instead?

- Fernando


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: vfs: Add MS_FLUSHONFSYNC mount flag
  2009-02-12 21:30                                                 ` Eric Sandeen
@ 2009-02-13  1:47                                                   ` Fernando Luis Vázquez Cao
  2009-02-13  6:07                                                     ` Eric Sandeen
  2009-02-13  2:23                                                   ` Theodore Tso
  1 sibling, 1 reply; 74+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-02-13  1:47 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Jan Kara, Theodore Tso, Alan Cox, Pavel Machek, kernel list,
	Jens Axboe, fernando, Ric Wheeler

On Thu, 2009-02-12 at 15:30 -0600, Eric Sandeen wrote:
> Jan Kara wrote:
> > On Thu 12-02-09 11:13:37, Eric Sandeen wrote:
> 
> ...
> 
> >> Also that way if you have 8 partitions on a battery-backed blockdev, you
> >> can tune it once, instead of needing to mount all 8 filesystems with the
> >> new option.
> >   Yes, but OTOH we should give sysadmin a possibility to enable / disable
> > it on just some partitions. I don't see a reasonable use for that but people
> > tend to do strange things ;) and here isn't probably a strong reason to not
> > allow them.
> > 
> > 								Honza
> 
> But nobody has asked for that, have they?  So why offer it up a this point?
> 
> They could use LD_PRELOAD to make fsync a no-op if they really don't
> care for it, I guess... though that's not easily per-fs either.
> 
> But do we really want to go out of our way to enable people to
> short-circuit data integrity paths and then file bugs when their files
> go missing? :)

Well, it is just a matter of using safe defaults. IMHO, a scenario where
the administrator wants to optimize writes to a certain partition and
_explicitly_ clears MS_FLUSHONFSYNC on that superblock is not completely
unreasonable.

> (I guess the blockdev tunable is similarly dangerous, but it more
> clearly meets the explicit need (writecache-safe devices))

If distributions use sane defaults and we document the mount option or
bdev tunable properly I guess it might make sense to allow system
administrators to shoot themselves in the foot.

(By the way, in this patch-set a patch for mount(8) is included.)

- Fernando


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: vfs: Add MS_FLUSHONFSYNC mount flag
  2009-02-12 21:30                                                 ` Eric Sandeen
  2009-02-13  1:47                                                   ` Fernando Luis Vázquez Cao
@ 2009-02-13  2:23                                                   ` Theodore Tso
  2009-02-22 14:15                                                     ` Pavel Machek
  1 sibling, 1 reply; 74+ messages in thread
From: Theodore Tso @ 2009-02-13  2:23 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Jan Kara, Fernando Luis Vázquez Cao, Alan Cox, Pavel Machek,
	kernel list, Jens Axboe, fernando, Ric Wheeler

On Thu, Feb 12, 2009 at 03:30:10PM -0600, Eric Sandeen wrote:
> >   Yes, but OTOH we should give sysadmin a possibility to enable / disable
> > it on just some partitions. I don't see a reasonable use for that but people
> > tend to do strange things ;) and here isn't probably a strong reason to not
> > allow them.
> > 
>
> But nobody has asked for that, have they?  So why offer it up a this point?
> 
> They could use LD_PRELOAD to make fsync a no-op if they really don't
> care for it, I guess... though that's not easily per-fs either.

Actually, Bart Samwel at FOSDEM talked to me and asked for something
similar --- what we came up which meant his request while still being
standards-compliant was a per-process personality flag which had three
options:

     *)  Always honor fsync() calls (the default)
     *)  Never honor fsync() calls
     *)  Only honor fsync() calls if a global "honor fsync" flag
           (which would be manipulated by the laptop mode scripts)
	   is set.

The flag would be reset to the default across a setuid exec, but would
otherwise be inherited across fork()'s.  It might be possible to
set/get the flag via a /proc interface.

The basic idea is that laptop systems where the system administrator
wants longer battery life (and trusts the battery not to suddenly give
out) more than they care about fsync() guarantees can set up a pam
library which sets the flag for at login time so that all of the
user's processes can be set up not to honor fsync() calls; however,
all of the system daemons would still function normally.   

       	   	  	  	      	       - Ted

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: vfs: Add MS_FLUSHONFSYNC mount flag
  2009-02-13  1:47                                                   ` Fernando Luis Vázquez Cao
@ 2009-02-13  6:07                                                     ` Eric Sandeen
  0 siblings, 0 replies; 74+ messages in thread
From: Eric Sandeen @ 2009-02-13  6:07 UTC (permalink / raw)
  To: Fernando Luis Vázquez Cao
  Cc: Jan Kara, Theodore Tso, Alan Cox, Pavel Machek, kernel list,
	Jens Axboe, fernando, Ric Wheeler

Fernando Luis Vázquez Cao wrote:
> On Thu, 2009-02-12 at 15:30 -0600, Eric Sandeen wrote:
>> Jan Kara wrote:
>>> On Thu 12-02-09 11:13:37, Eric Sandeen wrote:
>> ...
>>
>>>> Also that way if you have 8 partitions on a battery-backed blockdev, you
>>>> can tune it once, instead of needing to mount all 8 filesystems with the
>>>> new option.
>>>   Yes, but OTOH we should give sysadmin a possibility to enable / disable
>>> it on just some partitions. I don't see a reasonable use for that but people
>>> tend to do strange things ;) and here isn't probably a strong reason to not
>>> allow them.
>>>
>>> 								Honza
>> But nobody has asked for that, have they?  So why offer it up a this point?
>>
>> They could use LD_PRELOAD to make fsync a no-op if they really don't
>> care for it, I guess... though that's not easily per-fs either.
>>
>> But do we really want to go out of our way to enable people to
>> short-circuit data integrity paths and then file bugs when their files
>> go missing? :)
> 
> Well, it is just a matter of using safe defaults. IMHO, a scenario where
> the administrator wants to optimize writes to a certain partition and
> _explicitly_ clears MS_FLUSHONFSYNC on that superblock is not completely
> unreasonable.

One case is "this device is safe with a write cache and flush is not
necessary for data consistency" - that's the per-bdev setting.

The other case is "I don't really care and I just want to go faster" -
that's the per-mount setting.

I'd be much more likely to support the first case, as it's needed for
maximum performance w/o sacrificing correctness, when properly used.

The latter case is really only for cutting corners and giving people
more rope than they need to hang themselves.

>> (I guess the blockdev tunable is similarly dangerous, but it more
>> clearly meets the explicit need (writecache-safe devices))
> 
> If distributions use sane defaults and we document the mount option or
> bdev tunable properly I guess it might make sense to allow system
> administrators to shoot themselves in the foot.
> 
> (By the way, in this patch-set a patch for mount(8) is included.)

... one of the advantages of making it a bdev tunable is that you don't
have to mess with mount(8) :)

-Eric

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: vfs: Add MS_FLUSHONFSYNC mount flag
  2009-02-13  1:14                                               ` Fernando Luis Vázquez Cao
@ 2009-02-13  6:20                                                 ` Eric Sandeen
  2009-02-13 10:36                                                   ` Fernando Luis Vázquez Cao
  2009-02-13 12:20                                                   ` Dave Chinner
  0 siblings, 2 replies; 74+ messages in thread
From: Eric Sandeen @ 2009-02-13  6:20 UTC (permalink / raw)
  To: Fernando Luis Vázquez Cao
  Cc: Jan Kara, Theodore Tso, Alan Cox, Pavel Machek, kernel list,
	Jens Axboe, fernando, Ric Wheeler

Fernando Luis Vázquez Cao wrote:
> On Thu, 2009-02-12 at 11:13 -0600, Eric Sandeen wrote:
>> Fernando Luis Vázquez Cao wrote:
>>> This mount flag will be used to determine whether the block device's write
>>> cache should be flush or not on fsync()/fdatasync().
>>>
>>> Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
>>> ---
>> Again, apologies for chiming in late.
>>
>> But wouldn't it be better to make this a block device property rather
>> than a new filesystem mount option?
>>
>> That way the filesystem can always do "the right thing" and call the
>> blkdev flush on fsync.
>>
>> The block device *could* choose to ignore this in hardware if it knows
>> it's built with a nonvolatile write cache or if it has no write cache.
>>
>> Somewhere in the middle, if an administrator knows they have a UPS they
>> trust and hardware that stays connected to it, they could tune the bdev
>> to ignore these flush requests.
>>
>> Also that way if you have 8 partitions on a battery-backed blockdev, you
>> can tune it once, instead of needing to mount all 8 filesystems with the
>> new option.
> 
> The main reason I decided to go for the mount option approach is to be
> consistent with what we do when it comes to write barriers. Treating one
> as a mount option and the other as a (possibly) sysfs tunable property
> seems a bit confusing to me.

well... technically, I think barriers really *should* mean "don't
reorder these writes, I need them this way for consistency" - and that
is really specific to the fs implementation, isn't it? (we just happen
to implement them as cache flushes) and so that is a per-fs setting, I
think.

Maybe there is no good argument for ignoring barriers on one fs, and
implementing them on another, other than playing fast & loose &
dangerous.... hrm.

> Do you suggest using sysfs tunables instead?

For a per-bdev flush setting, yes...

I guess I'll have to try to convince myself one way or another whether
barrier mount options are consistent with this view.  :)

I guess sometimes you do have workloads where you simply want speed, and
on a crash you start over.  In this case you don't care about barriers
(ordering constraints - if you don't care about fs integrity if fsck or
re-mkfs is ok) or flushing (caches - if you don't care about data
integrity, you regenerate your results).  That could vary from fs to fs....

I'm just a little leery of the "dangerous" mount option proliferation, I
guess.

-Eric

> - Fernando
> 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: vfs: Add MS_FLUSHONFSYNC mount flag
  2009-02-13  6:20                                                 ` Eric Sandeen
@ 2009-02-13 10:36                                                   ` Fernando Luis Vázquez Cao
  2009-02-13 12:20                                                   ` Dave Chinner
  1 sibling, 0 replies; 74+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-02-13 10:36 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Jan Kara, Theodore Tso, Alan Cox, Pavel Machek, kernel list,
	Jens Axboe, fernando, Ric Wheeler

On Fri, 2009-02-13 at 00:20 -0600, Eric Sandeen wrote:
> Fernando Luis Vázquez Cao wrote:
> > On Thu, 2009-02-12 at 11:13 -0600, Eric Sandeen wrote:
> >> Fernando Luis Vázquez Cao wrote:
> >>> This mount flag will be used to determine whether the block device's write
> >>> cache should be flush or not on fsync()/fdatasync().
> >>>
> >>> Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
> >>> ---
> >> Again, apologies for chiming in late.
> >>
> >> But wouldn't it be better to make this a block device property rather
> >> than a new filesystem mount option?
> >>
> >> That way the filesystem can always do "the right thing" and call the
> >> blkdev flush on fsync.
> >>
> >> The block device *could* choose to ignore this in hardware if it knows
> >> it's built with a nonvolatile write cache or if it has no write cache.
> >>
> >> Somewhere in the middle, if an administrator knows they have a UPS they
> >> trust and hardware that stays connected to it, they could tune the bdev
> >> to ignore these flush requests.
> >>
> >> Also that way if you have 8 partitions on a battery-backed blockdev, you
> >> can tune it once, instead of needing to mount all 8 filesystems with the
> >> new option.
> > 
> > The main reason I decided to go for the mount option approach is to be
> > consistent with what we do when it comes to write barriers. Treating one
> > as a mount option and the other as a (possibly) sysfs tunable property
> > seems a bit confusing to me.
> 
> well... technically, I think barriers really *should* mean "don't
> reorder these writes, I need them this way for consistency" - and that
> is really specific to the fs implementation, isn't it? (we just happen
> to implement them as cache flushes) and so that is a per-fs setting, I
> think.
> 
> Maybe there is no good argument for ignoring barriers on one fs, and
> implementing them on another, other than playing fast & loose &
> dangerous.... hrm.
> 
> > Do you suggest using sysfs tunables instead?
> 
> For a per-bdev flush setting, yes...
> 
> I guess I'll have to try to convince myself one way or another whether
> barrier mount options are consistent with this view.  :)

I went through the same process :), and finally concluded that in both
cases it all comes down a trade-off between between integrity (be it
filesystem integrity or data integrity) and speed. Since the desired
behavior could vary from filesystem to filesystem and for the sake of
consistency with barriers the mount option approach seemed to make more
sense.

> I guess sometimes you do have workloads where you simply want speed, and
> on a crash you start over.  In this case you don't care about barriers
> (ordering constraints - if you don't care about fs integrity if fsck or
> re-mkfs is ok) or flushing (caches - if you don't care about data
> integrity, you regenerate your results).  That could vary from fs to fs....
> 
> I'm just a little leery of the "dangerous" mount option proliferation, I
> guess.

If we can have distros make "barrier=1,flushonfsync" be the default
setting and document these two mount options properly by explicitly
indicating the dangers of deviating from these defaults, I think we'll
be heading in the right direction.

By the way, apart from the issue of whether flushonsync should be a
mount option or sysfs tunable is there any other issue with the patches?
Jan, have I addressed all your concerns?

Regards,

Fernando


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: vfs: Add MS_FLUSHONFSYNC mount flag
  2009-02-13  6:20                                                 ` Eric Sandeen
  2009-02-13 10:36                                                   ` Fernando Luis Vázquez Cao
@ 2009-02-13 12:20                                                   ` Dave Chinner
  2009-02-13 16:29                                                     ` Fernando Luis Vazquez Cao
  1 sibling, 1 reply; 74+ messages in thread
From: Dave Chinner @ 2009-02-13 12:20 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Fernando Luis Vázquez Cao, Jan Kara, Theodore Tso, Alan Cox,
	Pavel Machek, kernel list, Jens Axboe, fernando, Ric Wheeler

On Fri, Feb 13, 2009 at 12:20:17AM -0600, Eric Sandeen wrote:
> I'm just a little leery of the "dangerous" mount option proliferation, I
> guess.

You're not the only one, Eric. It's bad enough having to explain to
users what barriers do once they have lost data after a power loss,
let alone confusing them further by adding more mount options they
will get wrong by accident....

Quite frankly, the VFS should do stuff that is slow and safe
and filesystems can choose to ignore the VFS (via filesystem
specific mount options) if they want to be fast and potentially
unsafe.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: vfs: Add MS_FLUSHONFSYNC mount flag
  2009-02-13 12:20                                                   ` Dave Chinner
@ 2009-02-13 16:29                                                     ` Fernando Luis Vazquez Cao
  2009-02-14 11:24                                                       ` Dave Chinner
  0 siblings, 1 reply; 74+ messages in thread
From: Fernando Luis Vazquez Cao @ 2009-02-13 16:29 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Eric Sandeen, Fernando Luis Vázquez Cao, Jan Kara,
	Theodore Tso, Alan Cox, Pavel Machek, kernel list, Jens Axboe,
	Ric Wheeler

On Fri, 2009-02-13 at 23:20 +1100, Dave Chinner wrote:
> On Fri, Feb 13, 2009 at 12:20:17AM -0600, Eric Sandeen wrote:
> > I'm just a little leery of the "dangerous" mount option proliferation, I
> > guess.
> 
> You're not the only one, Eric. It's bad enough having to explain to
> users what barriers do once they have lost data after a power loss,
> let alone confusing them further by adding more mount options they
> will get wrong by accident....

That is precisely the reason why we should use sensible defaults, which
in this case means enabling barriers and flushing disk caches on
fsync()/fdatasync() by default.

Adding either a new mount option (as you yourself suggest below) or a
sysfs tunable is desirable for those cases when we really do not need to
flush the disk write cache to guarantee integrity (battery-backed block
devices come to mind), or we want to be fast at the cost of potentially
losing some data.

> Quite frankly, the VFS should do stuff that is slow and safe
> and filesystems can choose to ignore the VFS (via filesystem
> specific mount options) if they want to be fast and potentially
> unsafe.

To avoid unnecessary flushes and allow for filesystem-specific
optimizations I was considering the following approach:

1- Add flushonfsync mount option (as an aside, I am of the opinion that
it should be set by default).
2- Modify file_fsync() so that it checks whether FLUSHONFSYNC is set and
flushes the underlying device accordingly. With this we would cover all
filesystems that use the vfs-provided file_fsync() as their fsync method
(commonly used filesystems such as fat fall in this group).
3- Advanced filesystems (ext3/4, XFS, btrfs, etc) which provide their
own fsync implementations are allowed to perform filesystem-specific
optimizations there to minimize the number of flushes and maximize
throughput.

In this patch set I implemented (1) and (3) for ext3/4 to have some code
to comment on.

Does this approach make sense? Thoughts?

- Fernando


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: vfs: Add MS_FLUSHONFSYNC mount flag
  2009-02-13 16:29                                                     ` Fernando Luis Vazquez Cao
@ 2009-02-14 11:24                                                       ` Dave Chinner
  2009-02-14 13:03                                                         ` Fernando Luis Vázquez Cao
  0 siblings, 1 reply; 74+ messages in thread
From: Dave Chinner @ 2009-02-14 11:24 UTC (permalink / raw)
  To: Fernando Luis Vazquez Cao
  Cc: Eric Sandeen, Fernando Luis Vázquez Cao, Jan Kara,
	Theodore Tso, Alan Cox, Pavel Machek, kernel list, Jens Axboe,
	Ric Wheeler

On Sat, Feb 14, 2009 at 01:29:28AM +0900, Fernando Luis Vazquez Cao wrote:
> On Fri, 2009-02-13 at 23:20 +1100, Dave Chinner wrote:
> > On Fri, Feb 13, 2009 at 12:20:17AM -0600, Eric Sandeen wrote:
> > > I'm just a little leery of the "dangerous" mount option proliferation, I
> > > guess.
> > 
> > You're not the only one, Eric. It's bad enough having to explain to
> > users what barriers do once they have lost data after a power loss,
> > let alone confusing them further by adding more mount options they
> > will get wrong by accident....
> 
> That is precisely the reason why we should use sensible defaults, which
> in this case means enabling barriers and flushing disk caches on
> fsync()/fdatasync() by default.
> 
> Adding either a new mount option (as you yourself suggest below) or a
> sysfs tunable is desirable for those cases when we really do not need to
> flush the disk write cache to guarantee integrity (battery-backed block
> devices come to mind), or we want to be fast at the cost of potentially
> losing some data.

Mount options are the wrong place for this. if you want to change
the behaviour of the block device, then it should be at that level.

> > Quite frankly, the VFS should do stuff that is slow and safe
> > and filesystems can choose to ignore the VFS (via filesystem
> > specific mount options) if they want to be fast and potentially
> > unsafe.
> 
> To avoid unnecessary flushes and allow for filesystem-specific
> optimizations I was considering the following approach:
> 
> 1- Add flushonfsync mount option (as an aside, I am of the opinion that
> it should be set by default).

No mount option - too confusing for someone to work out what
combination of barriers and flushing for things to work correctly.
Just make filesystems issue the necessary flush calls or barrier
IOs and allow the block devices to ignore flushes.

> 2- Modify file_fsync() so that it checks whether FLUSHONFSYNC is set and
> flushes the underlying device accordingly. With this we would cover all
> filesystems that use the vfs-provided file_fsync() as their fsync method
> (commonly used filesystems such as fat fall in this group).

Just make it flush the block device.

> 3- Advanced filesystems (ext3/4, XFS, btrfs, etc) which provide their
> own fsync implementations are allowed to perform filesystem-specific
> optimizations there to minimize the number of flushes and maximize
> throughput.

Um, you are describing what we already have in place.  Almost every
filesystem provides it's own ->fsync method, not just the "advanced"
ones. It is those methods that need to be fixed to issue flushes, not just
file_fsync().

> In this patch set I implemented (1) and (3) for ext3/4 to have some code
> to comment on.

I don't think we want (1) at all, and I thought that if ext3/4 are using
barriers then the barrier I/O issued by the journal does the flush
already. Hence (3) is redundant, right?

FWIW, block device flushes are implemented by barrier IOs, so if
the underlying block device doesn't support barriers then you can't
flush the cache, either...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: vfs: Add MS_FLUSHONFSYNC mount flag
  2009-02-14 11:24                                                       ` Dave Chinner
@ 2009-02-14 13:03                                                         ` Fernando Luis Vázquez Cao
  2009-02-14 13:19                                                           ` Fernando Luis Vázquez Cao
  2009-02-15  2:48                                                           ` Dave Chinner
  0 siblings, 2 replies; 74+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-02-14 13:03 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Fernando Luis Vazquez Cao, Eric Sandeen, Jan Kara, Theodore Tso,
	Alan Cox, Pavel Machek, kernel list, Jens Axboe, Ric Wheeler

On Sat, 2009-02-14 at 22:24 +1100, Dave Chinner wrote:
> On Sat, Feb 14, 2009 at 01:29:28AM +0900, Fernando Luis Vazquez Cao wrote:
> > On Fri, 2009-02-13 at 23:20 +1100, Dave Chinner wrote:
> > > On Fri, Feb 13, 2009 at 12:20:17AM -0600, Eric Sandeen wrote:
> > > > I'm just a little leery of the "dangerous" mount option proliferation, I
> > > > guess.
> > > 
> > > You're not the only one, Eric. It's bad enough having to explain to
> > > users what barriers do once they have lost data after a power loss,
> > > let alone confusing them further by adding more mount options they
> > > will get wrong by accident....
> > 
> > That is precisely the reason why we should use sensible defaults, which
> > in this case means enabling barriers and flushing disk caches on
> > fsync()/fdatasync() by default.
> > 
> > Adding either a new mount option (as you yourself suggest below) or a
> > sysfs tunable is desirable for those cases when we really do not need to
> > flush the disk write cache to guarantee integrity (battery-backed block
> > devices come to mind), or we want to be fast at the cost of potentially
> > losing some data.
> 
> Mount options are the wrong place for this. if you want to change
> the behaviour of the block device, then it should be at that level.

To be more precise, what we are trying to change is the behavior of
fsync()/fdatasync(), which users might want to change on a per-partition
basis. I guess this is the reason the barrier switch was made a mount
option, and I just wanted to be consistent.

My fear is that making one of them a mount option (barriers) and the
other a sysfs-tunable block device property (device flushes on fsync())
might end up creating more confusion than using a mount option for both.

Anyway, I do not have strong feelings on this issue and if there is
consensus I would be willing to change the patches so that flushonfsync
becomes a per block device tunable instead.

> > > Quite frankly, the VFS should do stuff that is slow and safe
> > > and filesystems can choose to ignore the VFS (via filesystem
> > > specific mount options) if they want to be fast and potentially
> > > unsafe.
> > 
> > To avoid unnecessary flushes and allow for filesystem-specific
> > optimizations I was considering the following approach:
> > 
> > 1- Add flushonfsync mount option (as an aside, I am of the opinion that
> > it should be set by default).
> 
> No mount option - too confusing for someone to work out what
> combination of barriers and flushing for things to work correctly.

As I suggested in a previous email, it is just a matter of using a safe
combination by default so that users do not need to figure out anything.

> Just make filesystems issue the necessary flush calls or barrier IOs

"ext3: call blkdev_issue_flush on fsync" and "ext4: call
blkdev_issue_flush on fsync" in this patch set implement just that for
ext3/4.

>  and allow the block devices to ignore flushes.

Wouldn't it make more sense to avoid sending bios down the block layer
which we can know in advance are going to be ignored by the block
device?

> > 2- Modify file_fsync() so that it checks whether FLUSHONFSYNC is set and
> > flushes the underlying device accordingly. With this we would cover all
> > filesystems that use the vfs-provided file_fsync() as their fsync method
> > (commonly used filesystems such as fat fall in this group).
> 
> Just make it flush the block device.

I wrote a patch that does exactly that but, in addition, it checks
whether FLUSHONFSYNC is set to avoid sending unnecessary flushes down
the block layer (this patch is not included in this patch-set, but I
will add it in the next iteration).

As I mentioned above, if everyone thinks this small optimization not
elegant or an undesirable layering violation I will remove it.

> > 3- Advanced filesystems (ext3/4, XFS, btrfs, etc) which provide their
> > own fsync implementations are allowed to perform filesystem-specific
> > optimizations there to minimize the number of flushes and maximize
> > throughput.
> 
> Um, you are describing what we already have in place.  Almost every
> filesystem provides it's own ->fsync method, not just the "advanced"
> ones.

Yes, I know. There are some remarkable exceptions such as fat, though.

>  It is those methods that need to be fixed to issue flushes, not just
> file_fsync().

Exactly, and this patch-set is my first attempt at that. For the first
submission I limited myself to fixing only ext3/4 so that I can get some
early feedback on my approach before moving forward.

> > In this patch set I implemented (1) and (3) for ext3/4 to have some code
> > to comment on.
> 
> I don't think we want (1) at all, and I thought that if ext3/4 are using
> barriers then the barrier I/O issued by the journal does the flush
> already. Hence (3) is redundant, right?

No, it is no redundant because a barrier is not issued in all cases. The
aforementioned two patches fix ext3/4 by emitting a device flush only
when necessary (i.e. when a barrier would not be emitted).

My impression is that we all agree in the basic approach, the only point
of contention being whether filesystems/VFS should be allowed to
optimize out disk flushes when the user told the kernel to do so (be it
via a sysfs tunable or mount option).

Cheers,

Fernando


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: vfs: Add MS_FLUSHONFSYNC mount flag
  2009-02-14 13:03                                                         ` Fernando Luis Vázquez Cao
@ 2009-02-14 13:19                                                           ` Fernando Luis Vázquez Cao
  2009-02-15  2:48                                                           ` Dave Chinner
  1 sibling, 0 replies; 74+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-02-14 13:19 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Fernando Luis Vazquez Cao, Eric Sandeen, Jan Kara, Theodore Tso,
	Alan Cox, Pavel Machek, kernel list, Jens Axboe, Ric Wheeler

On Sat, 2009-02-14 at 22:03 +0900, Fernando Luis Vázquez Cao wrote:
> > > 2- Modify file_fsync() so that it checks whether FLUSHONFSYNC is set and
> > > flushes the underlying device accordingly. With this we would cover all
> > > filesystems that use the vfs-provided file_fsync() as their fsync method
> > > (commonly used filesystems such as fat fall in this group).
> > 
> > Just make it flush the block device.
> 
> I wrote a patch that does exactly that but, in addition, it checks
> whether FLUSHONFSYNC is set to avoid sending unnecessary flushes down
> the block layer (this patch is not included in this patch-set, but I
> will add it in the next iteration).

I probably should have mentioned in my proposal all filesystems would
just call the helper function block_flush_device() to flush the
underlying block device, unconditionally.

This helper function looks like this:

+/* Issue flush of write caches on the block device */
+int block_flush_device(struct super_block *sb)
+{
+       int ret = 0;
+
+       if (!(sb->s_flags & MS_FLUSHONFSYNC))
+               return ret;
+
+       ret = blkdev_issue_flush(sb->s_bdev, NULL);
+
+       if (ret == -EOPNOTSUPP)
+               return 0;
+
+       return ret;
+}

As you can see the check for flushonfsync is done here, so changing my
patches along the lines you suggest would be a trivial two lines patch.

- Fernando

> As I mentioned above, if everyone thinks this small optimization not
> elegant or an undesirable layering violation I will remove it.




^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: vfs: Add MS_FLUSHONFSYNC mount flag
  2009-02-12 17:29                                               ` Jeff Garzik
@ 2009-02-14 15:36                                                 ` Christoph Hellwig
  2009-02-15  7:23                                                   ` Fernando Luis Vázquez Cao
  0 siblings, 1 reply; 74+ messages in thread
From: Christoph Hellwig @ 2009-02-14 15:36 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Eric Sandeen, Fernando Luis V?zquez Cao, Jan Kara, Theodore Tso,
	Alan Cox, Pavel Machek, kernel list, Jens Axboe, fernando,
	Ric Wheeler

On Thu, Feb 12, 2009 at 12:29:52PM -0500, Jeff Garzik wrote:
>> The block device *could* choose to ignore this in hardware if it knows
>> it's built with a nonvolatile write cache or if it has no write cache.
>
> That would certainly be my preference -- turn this ON by default, and  
> them if a layer NEEDS to ignore it, it can.

Yeah, and we should integrate this with the barriers settings.

I think the right setup is:

 - each gendisk has a variable to indicate if we have a write-back
   cache, which is filled from scsi inquiry data (or whatever the
   equivalent in the storage protocol is), but we allow an override
   from userspace if the admin knows better (if he really does or
   wants to play fast and lose is the admin's business)
 - filesystems do the right things by using barriers and cache flushes
   if they see the underlying device needs it.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: vfs: Add MS_FLUSHONFSYNC mount flag
  2009-02-14 13:03                                                         ` Fernando Luis Vázquez Cao
  2009-02-14 13:19                                                           ` Fernando Luis Vázquez Cao
@ 2009-02-15  2:48                                                           ` Dave Chinner
  2009-02-15  7:11                                                             ` Fernando Luis Vázquez Cao
  1 sibling, 1 reply; 74+ messages in thread
From: Dave Chinner @ 2009-02-15  2:48 UTC (permalink / raw)
  To: Fernando Luis Vázquez Cao
  Cc: Fernando Luis Vazquez Cao, Eric Sandeen, Jan Kara, Theodore Tso,
	Alan Cox, Pavel Machek, kernel list, Jens Axboe, Ric Wheeler

On Sat, Feb 14, 2009 at 10:03:53PM +0900, Fernando Luis Vázquez Cao wrote:
> On Sat, 2009-02-14 at 22:24 +1100, Dave Chinner wrote:
> > On Sat, Feb 14, 2009 at 01:29:28AM +0900, Fernando Luis Vazquez Cao wrote:
> > > On Fri, 2009-02-13 at 23:20 +1100, Dave Chinner wrote:
> > > > On Fri, Feb 13, 2009 at 12:20:17AM -0600, Eric Sandeen wrote:
> > > > > I'm just a little leery of the "dangerous" mount option proliferation, I
> > > > > guess.
> > > > 
> > > > You're not the only one, Eric. It's bad enough having to explain to
> > > > users what barriers do once they have lost data after a power loss,
> > > > let alone confusing them further by adding more mount options they
> > > > will get wrong by accident....
> > > 
> > > That is precisely the reason why we should use sensible defaults, which
> > > in this case means enabling barriers and flushing disk caches on
> > > fsync()/fdatasync() by default.
> > > 
> > > Adding either a new mount option (as you yourself suggest below) or a
> > > sysfs tunable is desirable for those cases when we really do not need to
> > > flush the disk write cache to guarantee integrity (battery-backed block
> > > devices come to mind), or we want to be fast at the cost of potentially
> > > losing some data.
> > 
> > Mount options are the wrong place for this. if you want to change
> > the behaviour of the block device, then it should be at that level.
> 
> To be more precise, what we are trying to change is the behavior of
> fsync()/fdatasync(), which users might want to change on a per-partition
> basis. I guess this is the reason the barrier switch was made a mount
> option, and I just wanted to be consistent.

This has no place in the kernel. Use LD_PRELOAD to make fsync() a
no-op.

> > No mount option - too confusing for someone to work out what
> > combination of barriers and flushing for things to work correctly.
> 
> As I suggested in a previous email, it is just a matter of using a safe
> combination by default so that users do not need to figure out anything.

Too many users think that they need to specify everything rather
than rely on defaults...

> > Just make filesystems issue the necessary flush calls or barrier IOs
> 
> "ext3: call blkdev_issue_flush on fsync" and "ext4: call
> blkdev_issue_flush on fsync" in this patch set implement just that for
> ext3/4.
> 
> >  and allow the block devices to ignore flushes.
> 
> Wouldn't it make more sense to avoid sending bios down the block layer
> which we can know in advance are going to be ignored by the block
> device?

As soon as the block layer reports EOPNOTSUPPORTED to a barrier IO,
the filesystem will switch them off and not issue them anymore.

> > I don't think we want (1) at all, and I thought that if ext3/4 are using
> > barriers then the barrier I/O issued by the journal does the flush
> > already. Hence (3) is redundant, right?
> 
> No, it is no redundant because a barrier is not issued in all cases. The
> aforementioned two patches fix ext3/4 by emitting a device flush only
> when necessary (i.e. when a barrier would not be emitted).

Then that is a filesystem fix, not something that requires VFS
modifications or new mount options....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: vfs: Add MS_FLUSHONFSYNC mount flag
  2009-02-15  2:48                                                           ` Dave Chinner
@ 2009-02-15  7:11                                                             ` Fernando Luis Vázquez Cao
  0 siblings, 0 replies; 74+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-02-15  7:11 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Fernando Luis Vazquez Cao, Eric Sandeen, Jan Kara, Theodore Tso,
	Alan Cox, Pavel Machek, kernel list, Jens Axboe, Ric Wheeler

On Sun, 2009-02-15 at 13:48 +1100, Dave Chinner wrote:
> On Sat, Feb 14, 2009 at 10:03:53PM +0900, Fernando Luis Vázquez Cao wrote:
> > On Sat, 2009-02-14 at 22:24 +1100, Dave Chinner wrote:
> > > On Sat, Feb 14, 2009 at 01:29:28AM +0900, Fernando Luis Vazquez Cao wrote:
> > > > On Fri, 2009-02-13 at 23:20 +1100, Dave Chinner wrote:
> > > > > On Fri, Feb 13, 2009 at 12:20:17AM -0600, Eric Sandeen wrote:
> > > > > > I'm just a little leery of the "dangerous" mount option proliferation, I
> > > > > > guess.
> > > > > 
> > > > > You're not the only one, Eric. It's bad enough having to explain to
> > > > > users what barriers do once they have lost data after a power loss,
> > > > > let alone confusing them further by adding more mount options they
> > > > > will get wrong by accident....
> > > > 
> > > > That is precisely the reason why we should use sensible defaults, which
> > > > in this case means enabling barriers and flushing disk caches on
> > > > fsync()/fdatasync() by default.
> > > > 
> > > > Adding either a new mount option (as you yourself suggest below) or a
> > > > sysfs tunable is desirable for those cases when we really do not need to
> > > > flush the disk write cache to guarantee integrity (battery-backed block
> > > > devices come to mind), or we want to be fast at the cost of potentially
> > > > losing some data.
> > > 
> > > Mount options are the wrong place for this. if you want to change
> > > the behaviour of the block device, then it should be at that level.
> > 
> > To be more precise, what we are trying to change is the behavior of
> > fsync()/fdatasync(), which users might want to change on a per-partition
> > basis. I guess this is the reason the barrier switch was made a mount
> > option, and I just wanted to be consistent.
> 
> This has no place in the kernel. Use LD_PRELOAD to make fsync() a
> no-op.

The purpose of flushonfsync is not making fsync() a no-op and goes
beyond what we can currently achieve with LD_PRELOAD. For example, if we
send the data to disk but avoid flushing the block device's write cache
we can potentially improve I/O performance at the cost of compromising
data and filesystem integrity. This is a risk that those who play fast
and loose may want assume. By the way, sadly enough this is the way many
of the filesystems in Linus' tree behave. I just wanted to change this
situation by making all filesystems issue write-cache flushes by
default.

Some people suggested to leave a knob for those who wanted to revert to
the old behavior and I myself thought that it could make sense in some
cases so decided to add the tunable flushonsync.

If there is consensus flushonfsync should be a per-device tunable I am
more than willing to make it so. My goal is to fix all filesystem so
that they emit barriers and disk flushes when they should. flushonfsync
is just a nicety I added for those who, for whatever reason, still want
the old behavior.

For the next iteration of this patchset I will take out the contentious
bits and leave only the filesystem/VFS fixes so that we can move forward
while we discuss the propriety of adding a per-device or a
per-filesystem tunable such as flushonfsync to change the default (and
safe) behavior.

> > > No mount option - too confusing for someone to work out what
> > > combination of barriers and flushing for things to work correctly.
> > 
> > As I suggested in a previous email, it is just a matter of using a safe
> > combination by default so that users do not need to figure out anything.
> 
> Too many users think that they need to specify everything rather
> than rely on defaults...

Well that is their business. From my experience most admins in the field
do not stray from their enterprise-distro provided defaults.

> > > Just make filesystems issue the necessary flush calls or barrier IOs
> > 
> > "ext3: call blkdev_issue_flush on fsync" and "ext4: call
> > blkdev_issue_flush on fsync" in this patch set implement just that for
> > ext3/4.
> > 
> > >  and allow the block devices to ignore flushes.
> > 
> > Wouldn't it make more sense to avoid sending bios down the block layer
> > which we can know in advance are going to be ignored by the block
> > device?
> 
> As soon as the block layer reports EOPNOTSUPPORTED to a barrier IO,
> the filesystem will switch them off and not issue them anymore.

Yes, that certainly makes sense. But the point in discussion is whether
users should be allowed to switch them off (it arguably makes sense in
some scenarios). I am afraid that some users will not be happy if we do
not leave the door open for them to revert to the old behavior.

> > > I don't think we want (1) at all, and I thought that if ext3/4 are using
> > > barriers then the barrier I/O issued by the journal does the flush
> > > already. Hence (3) is redundant, right?
> > 
> > No, it is no redundant because a barrier is not issued in all cases. The
> > aforementioned two patches fix ext3/4 by emitting a device flush only
> > when necessary (i.e. when a barrier would not be emitted).
> 
> Then that is a filesystem fix, not something that requires VFS
> modifications or new mount options....

Yup, as mentioned above flushonfsync is just a nicety I added to the
second iteration of this patchset and is independent from the filesystem
fixes.

Regards,

Fernando


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: vfs: Add MS_FLUSHONFSYNC mount flag
  2009-02-14 15:36                                                 ` Christoph Hellwig
@ 2009-02-15  7:23                                                   ` Fernando Luis Vázquez Cao
  2009-02-15 22:54                                                     ` Theodore Tso
  0 siblings, 1 reply; 74+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-02-15  7:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jeff Garzik, Eric Sandeen, Jan Kara, Theodore Tso, Alan Cox,
	Pavel Machek, kernel list, Jens Axboe, fernando, Ric Wheeler

On Sat, 2009-02-14 at 10:36 -0500, Christoph Hellwig wrote:
> On Thu, Feb 12, 2009 at 12:29:52PM -0500, Jeff Garzik wrote:
> >> The block device *could* choose to ignore this in hardware if it knows
> >> it's built with a nonvolatile write cache or if it has no write cache.
> >
> > That would certainly be my preference -- turn this ON by default, and  
> > them if a layer NEEDS to ignore it, it can.
> 
> Yeah, and we should integrate this with the barriers settings.
> 
> I think the right setup is:
> 
>  - each gendisk has a variable to indicate if we have a write-back
>    cache, which is filled from scsi inquiry data (or whatever the
>    equivalent in the storage protocol is), but we allow an override
>    from userspace if the admin knows better (if he really does or
>    wants to play fast and lose is the admin's business)
>  - filesystems do the right things by using barriers and cache flushes
>    if they see the underlying device needs it.

That makes sense, but the contentious issue seems to be whether the
override from userspace you mention should take the form of mount option
or per-block device sysfs tunable instead. Making this override
(flushonfsync in my patches) be a mount option would be consistent with
what filesystems such as ext3/4 and xfs do when it comes to barriers,
but, if there is consensus, I would not mind turning it into a
per-device tunable instead.

You mentioned "we should integrate this with the barrier settings". Do
you imply we should make it a per-device tunable too? Should we keep the
barrier-related mount options some filesystems provide?

 - Fernando


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: ext4: call blkdev_issue_flush on fsync
  2009-02-12 10:40                                           ` ext4: " Fernando Luis Vázquez Cao
@ 2009-02-15 22:46                                             ` Theodore Tso
  2009-02-16  7:09                                               ` Fernando Luis Vázquez Cao
  0 siblings, 1 reply; 74+ messages in thread
From: Theodore Tso @ 2009-02-15 22:46 UTC (permalink / raw)
  To: Fernando Luis Vázquez Cao
  Cc: Jan Kara, Alan Cox, Pavel Machek, kernel list, Jens Axboe,
	sandeen, fernando

On Thu, Feb 12, 2009 at 07:40:45PM +0900, Fernando Luis Vázquez Cao wrote:
> @@ -76,25 +77,34 @@ int ext4_sync_file(struct file *file, st
>  	 */
>  	if (ext4_should_journal_data(inode)) {
>  		ret = ext4_force_commit(inode->i_sb);
> +		if (!(journal->j_flags & JBD2_BARRIER))
> +			goto no_journal_barrier;
>  		goto out;
>  	}

All of these goto statements makes it one gigantic pile of spaghetti.
The code will be much more understable if you do:

		if (!(journal->j_flags & JBD2_BARRIER))
			block_flush_device(inode->i_sb);
		return ret;	

>  
> -	if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
> -		goto out;
> +	if (datasync && !(i_state & I_DIRTY_DATASYNC))
> +		goto flush_blkdev;
>  

Maybe instead:
	if (datasync && !(i_state & I_DIRTY_DATASYNC)) {
		if (i_state & I_DIRTY_PAGES)
			block_flush_device(inode->i_sb);
		return ret;
	}


> -		if (journal && (journal->j_flags & JBD2_BARRIER))
> -			blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
> +		if (journal && !(journal->j_flags & JBD2_BARRIER))
> +			goto no_journal_barrier;
> +		goto out;

Maybe instead:
		if (journal && !(journal->j_flags & JBD2_BARRIER))
			block_flush_device(inode->i_sb);
		return ret;
  	}

I'm not a fanatic about eliminating all goto's, but "goto out" which
could be replaced by "return ret;" is just silly.

      	 	     	     	      - Ted

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: vfs: Add MS_FLUSHONFSYNC mount flag
  2009-02-15  7:23                                                   ` Fernando Luis Vázquez Cao
@ 2009-02-15 22:54                                                     ` Theodore Tso
  2009-02-16  4:29                                                       ` Eric Sandeen
  2009-02-16  7:47                                                         ` Fernando Luis Vázquez Cao
  0 siblings, 2 replies; 74+ messages in thread
From: Theodore Tso @ 2009-02-15 22:54 UTC (permalink / raw)
  To: Fernando Luis Vázquez Cao
  Cc: Christoph Hellwig, Jeff Garzik, Eric Sandeen, Jan Kara, Alan Cox,
	Pavel Machek, kernel list, Jens Axboe, fernando, Ric Wheeler

On Sun, Feb 15, 2009 at 04:23:26PM +0900, Fernando Luis Vázquez Cao wrote:
> You mentioned "we should integrate this with the barrier settings". Do
> you imply we should make it a per-device tunable too? Should we keep the
> barrier-related mount options some filesystems provide?
> 

Making barriers to be a per-device tunable makes sense.  The only
reason why we kept it as a mount option in ext4 is for benchmarking
purposes, and in ext3, because the filesystem predated the barrier
code, and there was a desire to be able to benchmark with and without
the old behavior --- and because akpm is still worried about the
performance hit of the barrier code, so he's been resistant about
change the default for ext3.

       	   				- Ted


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: vfs: Add MS_FLUSHONFSYNC mount flag
  2009-02-15 22:54                                                     ` Theodore Tso
@ 2009-02-16  4:29                                                       ` Eric Sandeen
  2009-02-16  7:47                                                         ` Fernando Luis Vázquez Cao
  1 sibling, 0 replies; 74+ messages in thread
From: Eric Sandeen @ 2009-02-16  4:29 UTC (permalink / raw)
  To: Theodore Tso, Fernando Luis Vázquez Cao, Christoph Hellwig,
	Jeff Garzik, Eric Sandeen, Jan Kara, Alan Cox, Pavel Machek,
	kernel list, Jens Axboe, fernando, Ric Wheeler

Theodore Tso wrote:
> On Sun, Feb 15, 2009 at 04:23:26PM +0900, Fernando Luis Vázquez Cao wrote:
>> You mentioned "we should integrate this with the barrier settings". Do
>> you imply we should make it a per-device tunable too? Should we keep the
>> barrier-related mount options some filesystems provide?
>>
> 
> Making barriers to be a per-device tunable makes sense.  The only
> reason why we kept it as a mount option in ext4 is for benchmarking
> purposes, and in ext3, because the filesystem predated the barrier
> code, and there was a desire to be able to benchmark with and without
> the old behavior --- and because akpm is still worried about the
> performance hit of the barrier code, so he's been resistant about
> change the default for ext3.
> 
>        	   				- Ted

I agree that making barriers a per-device tunable probably does make the
most sense.  And talking w/ Dave, he points out that if it were this
way, the fs could just issue barrier IOs at every appropriate place, and
the bdev would just ignore them if they got tuned off, instead of the
current fs world of trying a barrier, always watching out for
-EOPNOTSUPP coming back up, then disabling barriers, etc.

-Eric

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: ext4: call blkdev_issue_flush on fsync
  2009-02-15 22:46                                             ` Theodore Tso
@ 2009-02-16  7:09                                               ` Fernando Luis Vázquez Cao
  2009-02-16  7:25                                                 ` [PATCH 1/3] block: Add block_flush_device() Fernando Luis Vázquez Cao
                                                                   ` (2 more replies)
  0 siblings, 3 replies; 74+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-02-16  7:09 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Jan Kara, Alan Cox, Pavel Machek, kernel list, Jens Axboe,
	sandeen, fernando

On Sun, 2009-02-15 at 17:46 -0500, Theodore Tso wrote:
> On Thu, Feb 12, 2009 at 07:40:45PM +0900, Fernando Luis Vázquez Cao wrote:
> > @@ -76,25 +77,34 @@ int ext4_sync_file(struct file *file, st
> >  	 */
> >  	if (ext4_should_journal_data(inode)) {
> >  		ret = ext4_force_commit(inode->i_sb);
> > +		if (!(journal->j_flags & JBD2_BARRIER))
> > +			goto no_journal_barrier;
> >  		goto out;
> >  	}
> 
> All of these goto statements makes it one gigantic pile of spaghetti.
> The code will be much more understable if you do:
> 
> 		if (!(journal->j_flags & JBD2_BARRIER))
> 			block_flush_device(inode->i_sb);
> 		return ret;	
> 
> >  
> > -	if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
> > -		goto out;
> > +	if (datasync && !(i_state & I_DIRTY_DATASYNC))
> > +		goto flush_blkdev;
> >  
> 
> Maybe instead:
> 	if (datasync && !(i_state & I_DIRTY_DATASYNC)) {
> 		if (i_state & I_DIRTY_PAGES)
> 			block_flush_device(inode->i_sb);
> 		return ret;
> 	}
> 
> 
> > -		if (journal && (journal->j_flags & JBD2_BARRIER))
> > -			blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
> > +		if (journal && !(journal->j_flags & JBD2_BARRIER))
> > +			goto no_journal_barrier;
> > +		goto out;
> 
> Maybe instead:
> 		if (journal && !(journal->j_flags & JBD2_BARRIER))
> 			block_flush_device(inode->i_sb);
> 		return ret;
>   	}
> 
> I'm not a fanatic about eliminating all goto's, but "goto out" which
> could be replaced by "return ret;" is just silly.

Yes, you are right. I just wanted to follow the current style of the
code, but I got carried away and took things a bit too far.

I'll be replying to this email with a new iteration of the patch-set
that leaves out the potentially conflictive bits and should be more
readable.

Regards,

Fernando


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 1/3] block: Add block_flush_device()
  2009-02-16  7:09                                               ` Fernando Luis Vázquez Cao
@ 2009-02-16  7:25                                                 ` Fernando Luis Vázquez Cao
  2009-02-16  7:29                                                 ` [2/3] ext3: call block_flush_device() on fsync Fernando Luis Vázquez Cao
  2009-02-16  7:31                                                 ` [PATCH 3/3] ext4: " Fernando Luis Vázquez Cao
  2 siblings, 0 replies; 74+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-02-16  7:25 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Jan Kara, Alan Cox, Pavel Machek, kernel list, Jens Axboe,
	sandeen, fernando, rwheeler, linux-fsdevel

This patch adds a helper function that should be used by filesystems that need
to flush the underlying block device on fsync()/fdatasync().

Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
---

diff -urNp linux-2.6.29-rc4-orig/fs/buffer.c linux-2.6.29-rc4/fs/buffer.c
--- linux-2.6.29-rc4-orig/fs/buffer.c	2009-02-16 14:45:11.000000000 +0900
+++ linux-2.6.29-rc4/fs/buffer.c	2009-02-16 14:53:26.000000000 +0900
@@ -165,6 +165,17 @@ void end_buffer_write_sync(struct buffer
 	put_bh(bh);
 }
 
+/* Issue flush of write caches on the block device */
+int block_flush_device(struct super_block *sb)
+{
+	int ret = 0;
+
+	ret = blkdev_issue_flush(sb->s_bdev, NULL);
+
+	return (ret == -EOPNOTSUPP) ? 0 : ret;
+}
+EXPORT_SYMBOL(block_flush_device);
+
 /*
  * Write out and wait upon all the dirty data associated with a block
  * device via its mapping.  Does not take the superblock lock.
diff -urNp linux-2.6.29-rc4-orig/include/linux/buffer_head.h linux-2.6.29-rc4/include/linux/buffer_head.h
--- linux-2.6.29-rc4-orig/include/linux/buffer_head.h	2009-02-16 14:45:12.000000000 +0900
+++ linux-2.6.29-rc4/include/linux/buffer_head.h	2009-02-16 14:48:28.000000000 +0900
@@ -238,6 +238,7 @@ int nobh_write_end(struct file *, struct
 int nobh_truncate_page(struct address_space *, loff_t, get_block_t *);
 int nobh_writepage(struct page *page, get_block_t *get_block,
                         struct writeback_control *wbc);
+int block_flush_device(struct super_block *sb);
 
 void buffer_init(void);
 



^ permalink raw reply	[flat|nested] 74+ messages in thread

* [2/3] ext3: call  block_flush_device() on fsync
  2009-02-16  7:09                                               ` Fernando Luis Vázquez Cao
  2009-02-16  7:25                                                 ` [PATCH 1/3] block: Add block_flush_device() Fernando Luis Vázquez Cao
@ 2009-02-16  7:29                                                 ` Fernando Luis Vázquez Cao
  2009-02-16  7:31                                                 ` [PATCH 3/3] ext4: " Fernando Luis Vázquez Cao
  2 siblings, 0 replies; 74+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-02-16  7:29 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Jan Kara, Alan Cox, Pavel Machek, kernel list, Jens Axboe,
	sandeen, fernando, rwheeler, linux-fsdevel

To ensure that bits are truly on-disk after an fsync or fdatasync, we
should force a disk flush explicitly when there is dirty data/metadata
and the journal didn't emit a write barrier (either because metadata is
not being synched or barriers are disabled).

Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
---

diff -urNp linux-2.6.29-rc5-orig/fs/ext3/fsync.c linux-2.6.29-rc5/fs/ext3/fsync.c
--- linux-2.6.29-rc5-orig/fs/ext3/fsync.c	2008-12-25 08:26:37.000000000 +0900
+++ linux-2.6.29-rc5/fs/ext3/fsync.c	2009-02-16 15:56:05.000000000 +0900
@@ -45,6 +45,8 @@
 int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
 {
 	struct inode *inode = dentry->d_inode;
+	journal_t *journal = EXT3_SB(inode->i_sb)->s_journal;
+	unsigned long i_state = inode->i_state;
 	int ret = 0;
 
 	J_ASSERT(ext3_journal_current_handle() == NULL);
@@ -69,23 +71,30 @@ int ext3_sync_file(struct file * file, s
 	 */
 	if (ext3_should_journal_data(inode)) {
 		ret = ext3_force_commit(inode->i_sb);
-		goto out;
+		if (!(journal->j_flags & JFS_BARRIER))
+			block_flush_device(inode->i_sb);
+		return ret;
 	}
 
-	if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
-		goto out;
+	if (datasync && !(i_state & I_DIRTY_DATASYNC)) {
+		if (i_state & I_DIRTY_PAGES)
+			block_flush_device(inode->i_sb);
+		return ret;
+	}
 
 	/*
 	 * The VFS has written the file data.  If the inode is unaltered
 	 * then we need not start a commit.
 	 */
-	if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
+	if (i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
 		struct writeback_control wbc = {
 			.sync_mode = WB_SYNC_ALL,
 			.nr_to_write = 0, /* sys_fsync did this */
 		};
 		ret = sync_inode(inode, &wbc);
+		if (journal && !(journal->j_flags & JFS_BARRIER))
+			block_flush_device(inode->i_sb);
 	}
-out:
+
 	return ret;
 }



^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 3/3] ext4: call  block_flush_device() on fsync
  2009-02-16  7:09                                               ` Fernando Luis Vázquez Cao
  2009-02-16  7:25                                                 ` [PATCH 1/3] block: Add block_flush_device() Fernando Luis Vázquez Cao
  2009-02-16  7:29                                                 ` [2/3] ext3: call block_flush_device() on fsync Fernando Luis Vázquez Cao
@ 2009-02-16  7:31                                                 ` Fernando Luis Vázquez Cao
  2 siblings, 0 replies; 74+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-02-16  7:31 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Jan Kara, Alan Cox, Pavel Machek, kernel list, Jens Axboe,
	sandeen, fernando, rwheeler, linux-fsdevel

To ensure that bits are truly on-disk after an fsync or fdatasync, we
should force a disk flush explicitly when there is dirty data/metadata
and the journal didn't emit a write barrier (either because metadata is
not being synched or barriers are disabled).

Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
---

diff -urNp linux-2.6.29-rc5-orig/fs/ext4/fsync.c linux-2.6.29-rc5/fs/ext4/fsync.c
--- linux-2.6.29-rc5-orig/fs/ext4/fsync.c	2008-12-25 08:26:37.000000000 +0900
+++ linux-2.6.29-rc5/fs/ext4/fsync.c	2009-02-16 15:52:56.000000000 +0900
@@ -48,6 +48,7 @@ int ext4_sync_file(struct file *file, st
 {
 	struct inode *inode = dentry->d_inode;
 	journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
+	unsigned long i_state = inode->i_state;
 	int ret = 0;
 
 	J_ASSERT(ext4_journal_current_handle() == NULL);
@@ -76,25 +77,30 @@ int ext4_sync_file(struct file *file, st
 	 */
 	if (ext4_should_journal_data(inode)) {
 		ret = ext4_force_commit(inode->i_sb);
-		goto out;
+		if (!(journal->j_flags & JBD2_BARRIER))
+			block_flush_device(inode->i_sb);
+		return ret;
 	}
 
-	if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
-		goto out;
+	if (datasync && !(i_state & I_DIRTY_DATASYNC)) {
+		if (i_state & I_DIRTY_PAGES)
+			block_flush_device(inode->i_sb);
+		return ret;
+	}
 
 	/*
 	 * The VFS has written the file data.  If the inode is unaltered
 	 * then we need not start a commit.
 	 */
-	if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
+	if (i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
 		struct writeback_control wbc = {
 			.sync_mode = WB_SYNC_ALL,
 			.nr_to_write = 0, /* sys_fsync did this */
 		};
 		ret = sync_inode(inode, &wbc);
-		if (journal && (journal->j_flags & JBD2_BARRIER))
-			blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
+		if (journal && !(journal->j_flags & JBD2_BARRIER))
+			block_flush_device(inode->i_sb);
 	}
-out:
+
 	return ret;
 }



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: vfs: Add MS_FLUSHONFSYNC mount flag
  2009-02-15 22:54                                                     ` Theodore Tso
@ 2009-02-16  7:47                                                         ` Fernando Luis Vázquez Cao
  2009-02-16  7:47                                                         ` Fernando Luis Vázquez Cao
  1 sibling, 0 replies; 74+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-02-16  7:47 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Christoph Hellwig, Jeff Garzik, Eric Sandeen, Jan Kara, Alan Cox,
	Pavel Machek, kernel list, Jens Axboe, fernando, Ric Wheeler,
	linux-fsdevel

On Sun, 2009-02-15 at 17:54 -0500, Theodore Tso wrote:
> On Sun, Feb 15, 2009 at 04:23:26PM +0900, Fernando Luis Vázquez Cao wrote:
> > You mentioned "we should integrate this with the barrier settings". Do
> > you imply we should make it a per-device tunable too? Should we keep the
> > barrier-related mount options some filesystems provide?
> > 
> 
> Making barriers to be a per-device tunable makes sense.  The only
> reason why we kept it as a mount option in ext4 is for benchmarking
> purposes, and in ext3, because the filesystem predated the barrier
> code, and there was a desire to be able to benchmark with and without
> the old behavior --- and because akpm is still worried about the
> performance hit of the barrier code, so he's been resistant about
> change the default for ext3.

Ok, I'll turn both barriers and flushonfsync into a sysfs-exported
per-device knob and see how it turns out.

By the way, should we also add/keep a mount option for "benchmarking
purposes"?. I guess that once we get the per-device tunable we probable
do not need it anymore.

Regards,

Fernando


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: vfs: Add MS_FLUSHONFSYNC mount flag
@ 2009-02-16  7:47                                                         ` Fernando Luis Vázquez Cao
  0 siblings, 0 replies; 74+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-02-16  7:47 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Christoph Hellwig, Jeff Garzik, Eric Sandeen, Jan Kara, Alan Cox,
	Pavel Machek, kernel list, Jens Axboe, fernando, Ric Wheeler,
	linux-fsdevel

On Sun, 2009-02-15 at 17:54 -0500, Theodore Tso wrote:
> On Sun, Feb 15, 2009 at 04:23:26PM +0900, Fernando Luis Vázquez Cao wrote:
> > You mentioned "we should integrate this with the barrier settings". Do
> > you imply we should make it a per-device tunable too? Should we keep the
> > barrier-related mount options some filesystems provide?
> > 
> 
> Making barriers to be a per-device tunable makes sense.  The only
> reason why we kept it as a mount option in ext4 is for benchmarking
> purposes, and in ext3, because the filesystem predated the barrier
> code, and there was a desire to be able to benchmark with and without
> the old behavior --- and because akpm is still worried about the
> performance hit of the barrier code, so he's been resistant about
> change the default for ext3.

Ok, I'll turn both barriers and flushonfsync into a sysfs-exported
per-device knob and see how it turns out.

By the way, should we also add/keep a mount option for "benchmarking
purposes"?. I guess that once we get the per-device tunable we probable
do not need it anymore.

Regards,

Fernando

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: ext2 + -osync: not as easy as it seems
  2009-02-12 16:43                 ` Eric Sandeen
@ 2009-02-16 12:09                   ` Jens Axboe
  0 siblings, 0 replies; 74+ messages in thread
From: Jens Axboe @ 2009-02-16 12:09 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Jan Kara, Theodore Tso, Fernando Luis Vázquez Cao, Alan Cox,
	Pavel Machek, kernel list

On Thu, Feb 12 2009, Eric Sandeen wrote:
> Jens Axboe wrote:
> > On Wed, Jan 14 2009, Jan Kara wrote:
> >>> I'm not sure what you mean; if the barrier operation isn't flushing
> >>> all of the caches all the way out to the iron oxide, it's not going to
> >>> be working properly no matter where it is being called, whether it's
> >>> in ext4_sync_file() or in jbd2's journal_submit_commit_record().
> >>   Well, I thought that a barrier, as an abstraction, only guarantees that
> >> any IO which happened before the barrier hits the iron before any IO which
> >> has been submitted after a barrier. This is actually enough for a
> >> journalling to work correctly but it's not enough for fsync() guarantees.
> >> But I might be wrong...
> > 
> > It also guarentees that when you get a completion for that barrier
> > write, it's on safe storage. Think of it as a flush-write-flush
> > operation, in the presence of write back caching.
> 
> (sorry for chiming in so late)
> 
> Jens, isn't this just the way it's implemented today?  At some point
> couldn't a barrier bio simply be a reordering barrier that the storage
> can use when destaging the write cache, rather than the heavy-handed
> flush-write-flush we have today?
> 
> I guess it's a question of the intended semantics of a barrier bio, vs.
> today's implementation based on current hardware functionality...

Right, the current implementation is largely an artefact of what we can
do with todays hardware. It would be quite reasonable to decouple the
flush from the reordering, if we were able to support that on the
hardware side.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: vfs: Add MS_FLUSHONFSYNC mount flag
  2009-02-13  2:23                                                   ` Theodore Tso
@ 2009-02-22 14:15                                                     ` Pavel Machek
  2009-02-22 20:51                                                       ` Eric Sandeen
  2009-02-22 23:19                                                       ` Theodore Tso
  0 siblings, 2 replies; 74+ messages in thread
From: Pavel Machek @ 2009-02-22 14:15 UTC (permalink / raw)
  To: Theodore Tso, Eric Sandeen, Jan Kara, Fernando Luis V?zquez Cao,
	Alan Cox, kernel list, Jens Axboe, fernando, Ric Wheeler

On Thu 2009-02-12 21:23:36, Theodore Tso wrote:
> On Thu, Feb 12, 2009 at 03:30:10PM -0600, Eric Sandeen wrote:
> > >   Yes, but OTOH we should give sysadmin a possibility to enable / disable
> > > it on just some partitions. I don't see a reasonable use for that but people
> > > tend to do strange things ;) and here isn't probably a strong reason to not
> > > allow them.
> > > 
> >
> > But nobody has asked for that, have they?  So why offer it up a this point?
> > 
> > They could use LD_PRELOAD to make fsync a no-op if they really don't
> > care for it, I guess... though that's not easily per-fs either.
> 
> Actually, Bart Samwel at FOSDEM talked to me and asked for something
> similar --- what we came up which meant his request while still being
> standards-compliant was a per-process personality flag which had three
> options:
> 
>      *)  Always honor fsync() calls (the default)
>      *)  Never honor fsync() calls
>      *)  Only honor fsync() calls if a global "honor fsync" flag
>            (which would be manipulated by the laptop mode scripts)
> 	   is set.
> 
> The flag would be reset to the default across a setuid exec, but would
> otherwise be inherited across fork()'s.  It might be possible to
> set/get the flag via a /proc interface.
> 
> The basic idea is that laptop systems where the system administrator
> wants longer battery life (and trusts the battery not to suddenly give
> out) more than they care about fsync() guarantees can set up a pam
> library which sets the flag for at login time so that all of the
> user's processes can be set up not to honor fsync() calls; however,
> all of the system daemons would still function normally.   

Sounds like posix violation to
me... '/sys/fsync_does_not_really_sync'?

Perhaps it is better done at glibc level? Environment variables
already mostly have semantics you want.....

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: vfs: Add MS_FLUSHONFSYNC mount flag
  2009-02-22 14:15                                                     ` Pavel Machek
@ 2009-02-22 20:51                                                       ` Eric Sandeen
  2009-02-22 23:19                                                       ` Theodore Tso
  1 sibling, 0 replies; 74+ messages in thread
From: Eric Sandeen @ 2009-02-22 20:51 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Theodore Tso, Jan Kara, Fernando Luis V?zquez Cao, Alan Cox,
	kernel list, Jens Axboe, fernando, Ric Wheeler

Pavel Machek wrote:
> On Thu 2009-02-12 21:23:36, Theodore Tso wrote:
>> On Thu, Feb 12, 2009 at 03:30:10PM -0600, Eric Sandeen wrote:
>>>>   Yes, but OTOH we should give sysadmin a possibility to enable / disable
>>>> it on just some partitions. I don't see a reasonable use for that but people
>>>> tend to do strange things ;) and here isn't probably a strong reason to not
>>>> allow them.
>>>>
>>> But nobody has asked for that, have they?  So why offer it up a this point?
>>>
>>> They could use LD_PRELOAD to make fsync a no-op if they really don't
>>> care for it, I guess... though that's not easily per-fs either.
>> Actually, Bart Samwel at FOSDEM talked to me and asked for something
>> similar --- what we came up which meant his request while still being
>> standards-compliant was a per-process personality flag which had three
>> options:
>>
>>      *)  Always honor fsync() calls (the default)
>>      *)  Never honor fsync() calls
>>      *)  Only honor fsync() calls if a global "honor fsync" flag
>>            (which would be manipulated by the laptop mode scripts)
>> 	   is set.
>>
>> The flag would be reset to the default across a setuid exec, but would
>> otherwise be inherited across fork()'s.  It might be possible to
>> set/get the flag via a /proc interface.
>>
>> The basic idea is that laptop systems where the system administrator
>> wants longer battery life (and trusts the battery not to suddenly give
>> out) more than they care about fsync() guarantees can set up a pam
>> library which sets the flag for at login time so that all of the
>> user's processes can be set up not to honor fsync() calls; however,
>> all of the system daemons would still function normally.   
> 
> Sounds like posix violation to
> me... '/sys/fsync_does_not_really_sync'?
> 
> Perhaps it is better done at glibc level? Environment variables
> already mostly have semantics you want.....
> 
> 									Pavel

One other thing that may be worth bringing up (just to muddy the waters
more) is OSX's handling of this stuff.

>From the fsync(2) manpage:

>      Note that while fsync() will flush all data from the host to the
>      drive (i.e. the "permanent storage device"), the drive itself may not
>      physically write the data to the platters for quite some time and it
>      may be written in an out-of-order sequence.
> 
>      Specifically, if the drive loses power or the OS crashes, the appli-
>      cation may find that only some or none of their data was written.
>      The disk drive may also re-order the data so that later writes may be
>      present, while earlier writes are not.
> 
>      This is not a theoretical edge case.  This scenario is easily repro-
>      duced with real world workloads and drive power failures.
> 
>      For applications that require tighter guarantees about the integrity
>      of their data, Mac OS X provides the F_FULLFSYNC fcntl.  The F_FULLF-
>      SYNC fcntl asks the drive to flush all buffered data to permanent
>      storage.  Applications, such as databases, that require a strict
>      ordering of writes should use F_FULLFSYNC to ensure that their data
>      is written in the order they expect.  Please see fcntl(2) for more
>      detail.

and from fcntl(2)

>      F_FULLFSYNC    Does the same thing as fsync(2) then asks the drive to
>                     flush all buffered data to the permanent storage
>                     device (arg is ignored).  This is currently imple-
>                     mented on HFS, MS-DOS (FAT), and Universal Disk Format
>                     (UDF) file systems.  The operation may take quite a
>                     while to complete.  Certain FireWire drives have also
>                     been known to ignore the request to flush their
>                     buffered data.

-Eric

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: vfs: Add MS_FLUSHONFSYNC mount flag
  2009-02-22 14:15                                                     ` Pavel Machek
  2009-02-22 20:51                                                       ` Eric Sandeen
@ 2009-02-22 23:19                                                       ` Theodore Tso
  2009-02-22 23:42                                                         ` Jeff Garzik
  1 sibling, 1 reply; 74+ messages in thread
From: Theodore Tso @ 2009-02-22 23:19 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Eric Sandeen, Jan Kara, Fernando Luis V?zquez Cao, Alan Cox,
	kernel list, Jens Axboe, fernando, Ric Wheeler

On Sun, Feb 22, 2009 at 03:15:33PM +0100, Pavel Machek wrote:
> 
> Sounds like posix violation to
> me... '/sys/fsync_does_not_really_sync'?

It's as much a Posix violation as noatime.  :-)

> Perhaps it is better done at glibc level? Environment variables
> already mostly have semantics you want.....

Environment variables own't allow us to switch fsync's on and off
while the process is running, so that the system can run "safe" while
it is on AC mains, but "low power" when on battery.

      	    	       	    	   	   - Ted

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: vfs: Add MS_FLUSHONFSYNC mount flag
  2009-02-22 23:19                                                       ` Theodore Tso
@ 2009-02-22 23:42                                                         ` Jeff Garzik
  2009-02-22 23:46                                                           ` Jeff Garzik
  0 siblings, 1 reply; 74+ messages in thread
From: Jeff Garzik @ 2009-02-22 23:42 UTC (permalink / raw)
  To: Theodore Tso, Pavel Machek, Eric Sandeen, Jan Kara,
	Fernando Luis V?zquez Cao, Alan Cox, kernel list, Jens Axboe,
	fernando, Ric Wheeler, Andrew Morton

Theodore Tso wrote:
> On Sun, Feb 22, 2009 at 03:15:33PM +0100, Pavel Machek wrote:
>> Sounds like posix violation to
>> me... '/sys/fsync_does_not_really_sync'?
> 
> It's as much a Posix violation as noatime.  :-)
> 
>> Perhaps it is better done at glibc level? Environment variables
>> already mostly have semantics you want.....
> 
> Environment variables own't allow us to switch fsync's on and off
> while the process is running, so that the system can run "safe" while
> it is on AC mains, but "low power" when on battery.

Overall, for each filesystem, we really should _default_ to a safe mode 
where fsync(2), fdatasync(2), sync_file_range(2) and transaction commits 
are guaranteed to be committed to stable storage -- i.e. SYNC CACHE / 
FLUSH CACHE, a forced unit access (FUA) bit, or guarantee that the 
underlying storage has disabled write-back caching.

Correctness should come before performance.  Linux has not had 
credibility here, in the ATA+ext[23] space at least, and it is embarrassing.

Modern SATA and SCSI disks can do tagged command queueing with the FUA 
bit, which does a lot to mitigate the performance loss seen with a less 
capable "FLUSH CACHE + barrier" setup.

	Jeff





^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: vfs: Add MS_FLUSHONFSYNC mount flag
  2009-02-22 23:42                                                         ` Jeff Garzik
@ 2009-02-22 23:46                                                           ` Jeff Garzik
  2009-02-23  1:23                                                             ` Theodore Tso
  0 siblings, 1 reply; 74+ messages in thread
From: Jeff Garzik @ 2009-02-22 23:46 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Pavel Machek, Eric Sandeen, Jan Kara, Fernando Luis V?zquez Cao,
	Alan Cox, kernel list, Jens Axboe, fernando, Ric Wheeler,
	Andrew Morton

Jeff Garzik wrote:
> Correctness should come before performance.  Linux has not had 
> credibility here, in the ATA+ext[23] space at least, and it is 
> embarrassing.


To be more clear / precise, this means actually performing the 
guarantees we claim to the user.  For example,

fsync(2) on ext2 should trigger a storage device writeback cache flush 
[or equivalent guarantee via FUA].

fsync(2) or journal commit on ext3 should trigger a flush [or equivalent 
guarantee via FUA].

Though, certainly, the user should be able to disable this strict 
behavior and trade correctness for performance.

	Jeff





^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: vfs: Add MS_FLUSHONFSYNC mount flag
  2009-02-22 23:46                                                           ` Jeff Garzik
@ 2009-02-23  1:23                                                             ` Theodore Tso
  0 siblings, 0 replies; 74+ messages in thread
From: Theodore Tso @ 2009-02-23  1:23 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Pavel Machek, Eric Sandeen, Jan Kara, Fernando Luis V?zquez Cao,
	Alan Cox, kernel list, Jens Axboe, fernando, Ric Wheeler,
	Andrew Morton

On Sun, Feb 22, 2009 at 06:46:51PM -0500, Jeff Garzik wrote:
>
> To be more clear / precise, this means actually performing the  
> guarantees we claim to the user.  For example,
>
> fsync(2) on ext2 should trigger a storage device writeback cache flush  
> [or equivalent guarantee via FUA].
>
> fsync(2) or journal commit on ext3 should trigger a flush [or equivalent  
> guarantee via FUA].
>
> Though, certainly, the user should be able to disable this strict  
> behavior and trade correctness for performance.

No argument here.  The *default* should be that we respect fsync().
However, the user should be able to configure with a reasonable amount
of granularity tradeoffs between safety and performance and/or battery
usage.

						- Ted

^ permalink raw reply	[flat|nested] 74+ messages in thread

end of thread, other threads:[~2009-02-23  1:24 UTC | newest]

Thread overview: 74+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-01-13 13:14 ext2 + -osync: not as easy as it seems Pavel Machek
2009-01-13 13:45 ` Alan Cox
2009-01-13 14:03   ` Theodore Tso
2009-01-13 14:07     ` Jens Axboe
2009-01-13 14:26       ` [PATCH] block: Fix documentation for blkdev_issue_flush() Theodore Ts'o
2009-01-13 14:28         ` Jens Axboe
2009-01-13 14:30     ` ext2 + -osync: not as easy as it seems Jan Kara
2009-01-13 14:46       ` Theodore Tso
2009-01-14  3:37       ` Fernando Luis Vázquez Cao
2009-01-14 10:35         ` Jan Kara
2009-01-14 13:21           ` Theodore Tso
2009-01-14 14:05             ` Jan Kara
2009-01-14 14:08               ` Jens Axboe
2009-01-14 14:34                 ` Theodore Tso
2009-01-14 14:43                   ` Jens Axboe
2009-02-12 16:43                 ` Eric Sandeen
2009-02-16 12:09                   ` Jens Axboe
2009-01-14 14:12               ` Theodore Tso
2009-01-14 14:37                 ` Jan Kara
2009-01-14 16:59                   ` Theodore Tso
2009-01-15 12:06                     ` Fernando Luis Vázquez Cao
2009-01-15 23:45                       ` Jan Kara
2009-01-16 12:31                         ` Fernando Luis Vázquez Cao
2009-01-16 13:55                           ` ext3: call blkdev_issue_flush on fsync Fernando Luis Vázquez Cao
2009-01-16 16:30                             ` Jan Kara
2009-01-17  9:47                               ` Fernando Luis Vázquez Cao
2009-01-17 10:00                                 ` Fernando Luis Vázquez Cao
2009-01-19 12:03                                   ` Jan Kara
2009-01-28  9:45                                     ` Fernando Luis Vázquez Cao
2009-01-28  9:55                                       ` Jan Kara
2009-02-12 10:33                                         ` Fernando Luis Vázquez Cao
2009-02-12 10:35                                           ` vfs: Improve readability off mount flag definitins by using offsets Fernando Luis Vázquez Cao
2009-02-12 10:36                                           ` vfs: Add MS_FLUSHONFSYNC mount flag Fernando Luis Vázquez Cao
2009-02-12 17:13                                             ` Eric Sandeen
2009-02-12 17:29                                               ` Jeff Garzik
2009-02-14 15:36                                                 ` Christoph Hellwig
2009-02-15  7:23                                                   ` Fernando Luis Vázquez Cao
2009-02-15 22:54                                                     ` Theodore Tso
2009-02-16  4:29                                                       ` Eric Sandeen
2009-02-16  7:47                                                       ` Fernando Luis Vázquez Cao
2009-02-16  7:47                                                         ` Fernando Luis Vázquez Cao
2009-02-12 21:23                                               ` Jan Kara
2009-02-12 21:30                                                 ` Eric Sandeen
2009-02-13  1:47                                                   ` Fernando Luis Vázquez Cao
2009-02-13  6:07                                                     ` Eric Sandeen
2009-02-13  2:23                                                   ` Theodore Tso
2009-02-22 14:15                                                     ` Pavel Machek
2009-02-22 20:51                                                       ` Eric Sandeen
2009-02-22 23:19                                                       ` Theodore Tso
2009-02-22 23:42                                                         ` Jeff Garzik
2009-02-22 23:46                                                           ` Jeff Garzik
2009-02-23  1:23                                                             ` Theodore Tso
2009-02-13  1:14                                               ` Fernando Luis Vázquez Cao
2009-02-13  6:20                                                 ` Eric Sandeen
2009-02-13 10:36                                                   ` Fernando Luis Vázquez Cao
2009-02-13 12:20                                                   ` Dave Chinner
2009-02-13 16:29                                                     ` Fernando Luis Vazquez Cao
2009-02-14 11:24                                                       ` Dave Chinner
2009-02-14 13:03                                                         ` Fernando Luis Vázquez Cao
2009-02-14 13:19                                                           ` Fernando Luis Vázquez Cao
2009-02-15  2:48                                                           ` Dave Chinner
2009-02-15  7:11                                                             ` Fernando Luis Vázquez Cao
2009-02-12 10:37                                           ` util-linux: Add new mount options flushonfsync and noflushonfsync to mount Fernando Luis Vázquez Cao
2009-02-12 10:38                                           ` util-linux: Add explanation for new mount options flushonfsync and noflushonfsync to mount(8) man page Fernando Luis Vázquez Cao
2009-02-12 10:38                                           ` block: Add block_flush_device() Fernando Luis Vázquez Cao
2009-02-12 10:39                                           ` ext3: call blkdev_issue_flush on fsync Fernando Luis Vázquez Cao
2009-02-12 10:40                                           ` ext4: " Fernando Luis Vázquez Cao
2009-02-15 22:46                                             ` Theodore Tso
2009-02-16  7:09                                               ` Fernando Luis Vázquez Cao
2009-02-16  7:25                                                 ` [PATCH 1/3] block: Add block_flush_device() Fernando Luis Vázquez Cao
2009-02-16  7:29                                                 ` [2/3] ext3: call block_flush_device() on fsync Fernando Luis Vázquez Cao
2009-02-16  7:31                                                 ` [PATCH 3/3] ext4: " Fernando Luis Vázquez Cao
2009-01-16 13:59                           ` ext4: call blkdev_issue_flush " Fernando Luis Vázquez Cao
2009-01-13 14:42   ` ext2 + -osync: not as easy as it seems Pavel Machek

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.