* vfs: Improve readability off mount flag definitins by using offsets
2009-02-12 10:33 ` Fernando Luis Vázquez Cao
@ 2009-02-12 10:35 ` Fernando Luis Vázquez Cao
2009-02-12 10:36 ` vfs: Add MS_FLUSHONFSYNC mount flag Fernando Luis Vázquez Cao
` (5 subsequent siblings)
6 siblings, 0 replies; 74+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-02-12 10:35 UTC (permalink / raw)
To: Jan Kara
Cc: Theodore Tso, Alan Cox, Pavel Machek, kernel list, Jens Axboe,
sandeen, fernando
Currently mount flags are defined using both decimal values and offsets which
is kind of confusing. Define all flags using offsets for the sake of
readability.
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
---
diff -urNp linux-2.6.29-rc3-orig/include/linux/fs.h linux-2.6.29-rc3/include/linux/fs.h
--- linux-2.6.29-rc3-orig/include/linux/fs.h 2009-01-29 09:47:51.000000000 +0900
+++ linux-2.6.29-rc3/include/linux/fs.h 2009-01-29 10:01:58.000000000 +0900
@@ -111,22 +111,22 @@ struct inodes_stat_t {
/*
* These are the fs-independent mount-flags: up to 32 flags are supported
*/
-#define MS_RDONLY 1 /* Mount read-only */
-#define MS_NOSUID 2 /* Ignore suid and sgid bits */
-#define MS_NODEV 4 /* Disallow access to device special files */
-#define MS_NOEXEC 8 /* Disallow program execution */
-#define MS_SYNCHRONOUS 16 /* Writes are synced at once */
-#define MS_REMOUNT 32 /* Alter flags of a mounted FS */
-#define MS_MANDLOCK 64 /* Allow mandatory locks on an FS */
-#define MS_DIRSYNC 128 /* Directory modifications are synchronous */
-#define MS_NOATIME 1024 /* Do not update access times. */
-#define MS_NODIRATIME 2048 /* Do not update directory access times */
-#define MS_BIND 4096
-#define MS_MOVE 8192
-#define MS_REC 16384
-#define MS_VERBOSE 32768 /* War is peace. Verbosity is silence.
+#define MS_RDONLY (1<<0) /* Mount read-only */
+#define MS_NOSUID (1<<1) /* Ignore suid and sgid bits */
+#define MS_NODEV (1<<2) /* Disallow access to device special files */
+#define MS_NOEXEC (1<<3) /* Disallow program execution */
+#define MS_SYNCHRONOUS (1<<4) /* Writes are synced at once */
+#define MS_REMOUNT (1<<5) /* Alter flags of a mounted FS */
+#define MS_MANDLOCK (1<<6) /* Allow mandatory locks on an FS */
+#define MS_DIRSYNC (1<<7) /* Directory modifications are synchronous */
+#define MS_NOATIME (1<<10) /* Do not update access times. */
+#define MS_NODIRATIME (1<<11) /* Do not update directory access times */
+#define MS_BIND (1<<12)
+#define MS_MOVE (1<<13)
+#define MS_REC (1<<14)
+#define MS_VERBOSE (1<<15) /* War is peace. Verbosity is silence.
MS_VERBOSE is deprecated. */
-#define MS_SILENT 32768
+#define MS_SILENT (1<<15)
#define MS_POSIXACL (1<<16) /* VFS does not apply the umask */
#define MS_UNBINDABLE (1<<17) /* change to unbindable */
#define MS_PRIVATE (1<<18) /* change to private */
^ permalink raw reply [flat|nested] 74+ messages in thread
* vfs: Add MS_FLUSHONFSYNC mount flag
2009-02-12 10:33 ` Fernando Luis Vázquez Cao
2009-02-12 10:35 ` vfs: Improve readability off mount flag definitins by using offsets Fernando Luis Vázquez Cao
@ 2009-02-12 10:36 ` Fernando Luis Vázquez Cao
2009-02-12 17:13 ` Eric Sandeen
2009-02-12 10:37 ` util-linux: Add new mount options flushonfsync and noflushonfsync to mount Fernando Luis Vázquez Cao
` (4 subsequent siblings)
6 siblings, 1 reply; 74+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-02-12 10:36 UTC (permalink / raw)
To: Jan Kara
Cc: Theodore Tso, Alan Cox, Pavel Machek, kernel list, Jens Axboe,
sandeen, fernando
This mount flag will be used to determine whether the block device's write
cache should be flush or not on fsync()/fdatasync().
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
---
diff -urNp linux-2.6.29-rc3-orig/fs/namespace.c linux-2.6.29-rc3/fs/namespace.c
--- linux-2.6.29-rc3-orig/fs/namespace.c 2009-01-29 09:47:51.000000000 +0900
+++ linux-2.6.29-rc3/fs/namespace.c 2009-01-29 18:40:04.000000000 +0900
@@ -1933,8 +1933,8 @@ long do_mount(char *dev_name, char *dir_
if (flags & MS_RDONLY)
mnt_flags |= MNT_READONLY;
- flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE |
- MS_NOATIME | MS_NODIRATIME | MS_RELATIME| MS_KERNMOUNT);
+ flags &= ~(MS_NOSUID | MS_NODEV | MS_NOEXEC | MS_NOATIME |
+ MS_NODIRATIME | MS_RELATIME | MS_ACTIVE | MS_KERNMOUNT);
/* ... and get the mountpoint */
retval = kern_path(dir_name, LOOKUP_FOLLOW, &path);
diff -urNp linux-2.6.29-rc3-orig/include/linux/fs.h linux-2.6.29-rc3/include/linux/fs.h
--- linux-2.6.29-rc3-orig/include/linux/fs.h 2009-01-29 10:04:18.000000000 +0900
+++ linux-2.6.29-rc3/include/linux/fs.h 2009-01-29 10:24:14.000000000 +0900
@@ -135,6 +135,8 @@ struct inodes_stat_t {
#define MS_RELATIME (1<<21) /* Update atime relative to mtime/ctime. */
#define MS_KERNMOUNT (1<<22) /* this is a kern_mount call */
#define MS_I_VERSION (1<<23) /* Update inode I_version field */
+#define MS_FLUSHONFSYNC (1<<24) /* Force block device flush on
+ fsync()/fdatasync() */
#define MS_ACTIVE (1<<30)
#define MS_NOUSER (1<<31)
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: vfs: Add MS_FLUSHONFSYNC mount flag
2009-02-12 10:36 ` vfs: Add MS_FLUSHONFSYNC mount flag Fernando Luis Vázquez Cao
@ 2009-02-12 17:13 ` Eric Sandeen
2009-02-12 17:29 ` Jeff Garzik
` (2 more replies)
0 siblings, 3 replies; 74+ messages in thread
From: Eric Sandeen @ 2009-02-12 17:13 UTC (permalink / raw)
To: Fernando Luis Vázquez Cao
Cc: Jan Kara, Theodore Tso, Alan Cox, Pavel Machek, kernel list,
Jens Axboe, fernando, Ric Wheeler
Fernando Luis Vázquez Cao wrote:
> This mount flag will be used to determine whether the block device's write
> cache should be flush or not on fsync()/fdatasync().
>
> Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
> ---
Again, apologies for chiming in late.
But wouldn't it be better to make this a block device property rather
than a new filesystem mount option?
That way the filesystem can always do "the right thing" and call the
blkdev flush on fsync.
The block device *could* choose to ignore this in hardware if it knows
it's built with a nonvolatile write cache or if it has no write cache.
Somewhere in the middle, if an administrator knows they have a UPS they
trust and hardware that stays connected to it, they could tune the bdev
to ignore these flush requests.
Also that way if you have 8 partitions on a battery-backed blockdev, you
can tune it once, instead of needing to mount all 8 filesystems with the
new option.
Thoughts?
Thanks,
-Eric
> diff -urNp linux-2.6.29-rc3-orig/fs/namespace.c linux-2.6.29-rc3/fs/namespace.c
> --- linux-2.6.29-rc3-orig/fs/namespace.c 2009-01-29 09:47:51.000000000 +0900
> +++ linux-2.6.29-rc3/fs/namespace.c 2009-01-29 18:40:04.000000000 +0900
> @@ -1933,8 +1933,8 @@ long do_mount(char *dev_name, char *dir_
> if (flags & MS_RDONLY)
> mnt_flags |= MNT_READONLY;
>
> - flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE |
> - MS_NOATIME | MS_NODIRATIME | MS_RELATIME| MS_KERNMOUNT);
> + flags &= ~(MS_NOSUID | MS_NODEV | MS_NOEXEC | MS_NOATIME |
> + MS_NODIRATIME | MS_RELATIME | MS_ACTIVE | MS_KERNMOUNT);
>
> /* ... and get the mountpoint */
> retval = kern_path(dir_name, LOOKUP_FOLLOW, &path);
> diff -urNp linux-2.6.29-rc3-orig/include/linux/fs.h linux-2.6.29-rc3/include/linux/fs.h
> --- linux-2.6.29-rc3-orig/include/linux/fs.h 2009-01-29 10:04:18.000000000 +0900
> +++ linux-2.6.29-rc3/include/linux/fs.h 2009-01-29 10:24:14.000000000 +0900
> @@ -135,6 +135,8 @@ struct inodes_stat_t {
> #define MS_RELATIME (1<<21) /* Update atime relative to mtime/ctime. */
> #define MS_KERNMOUNT (1<<22) /* this is a kern_mount call */
> #define MS_I_VERSION (1<<23) /* Update inode I_version field */
> +#define MS_FLUSHONFSYNC (1<<24) /* Force block device flush on
> + fsync()/fdatasync() */
> #define MS_ACTIVE (1<<30)
> #define MS_NOUSER (1<<31)
>
>
>
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: vfs: Add MS_FLUSHONFSYNC mount flag
2009-02-12 17:13 ` Eric Sandeen
@ 2009-02-12 17:29 ` Jeff Garzik
2009-02-14 15:36 ` Christoph Hellwig
2009-02-12 21:23 ` Jan Kara
2009-02-13 1:14 ` Fernando Luis Vázquez Cao
2 siblings, 1 reply; 74+ messages in thread
From: Jeff Garzik @ 2009-02-12 17:29 UTC (permalink / raw)
To: Eric Sandeen
Cc: Fernando Luis Vázquez Cao, Jan Kara, Theodore Tso, Alan Cox,
Pavel Machek, kernel list, Jens Axboe, fernando, Ric Wheeler
Eric Sandeen wrote:
> Fernando Luis Vázquez Cao wrote:
>> This mount flag will be used to determine whether the block device's write
>> cache should be flush or not on fsync()/fdatasync().
>>
>> Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
>> ---
>
> Again, apologies for chiming in late.
>
> But wouldn't it be better to make this a block device property rather
> than a new filesystem mount option?
>
> That way the filesystem can always do "the right thing" and call the
> blkdev flush on fsync.
>
> The block device *could* choose to ignore this in hardware if it knows
> it's built with a nonvolatile write cache or if it has no write cache.
That would certainly be my preference -- turn this ON by default, and
them if a layer NEEDS to ignore it, it can.
Jeff
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: vfs: Add MS_FLUSHONFSYNC mount flag
2009-02-12 17:29 ` Jeff Garzik
@ 2009-02-14 15:36 ` Christoph Hellwig
2009-02-15 7:23 ` Fernando Luis Vázquez Cao
0 siblings, 1 reply; 74+ messages in thread
From: Christoph Hellwig @ 2009-02-14 15:36 UTC (permalink / raw)
To: Jeff Garzik
Cc: Eric Sandeen, Fernando Luis V?zquez Cao, Jan Kara, Theodore Tso,
Alan Cox, Pavel Machek, kernel list, Jens Axboe, fernando,
Ric Wheeler
On Thu, Feb 12, 2009 at 12:29:52PM -0500, Jeff Garzik wrote:
>> The block device *could* choose to ignore this in hardware if it knows
>> it's built with a nonvolatile write cache or if it has no write cache.
>
> That would certainly be my preference -- turn this ON by default, and
> them if a layer NEEDS to ignore it, it can.
Yeah, and we should integrate this with the barriers settings.
I think the right setup is:
- each gendisk has a variable to indicate if we have a write-back
cache, which is filled from scsi inquiry data (or whatever the
equivalent in the storage protocol is), but we allow an override
from userspace if the admin knows better (if he really does or
wants to play fast and lose is the admin's business)
- filesystems do the right things by using barriers and cache flushes
if they see the underlying device needs it.
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: vfs: Add MS_FLUSHONFSYNC mount flag
2009-02-14 15:36 ` Christoph Hellwig
@ 2009-02-15 7:23 ` Fernando Luis Vázquez Cao
2009-02-15 22:54 ` Theodore Tso
0 siblings, 1 reply; 74+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-02-15 7:23 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jeff Garzik, Eric Sandeen, Jan Kara, Theodore Tso, Alan Cox,
Pavel Machek, kernel list, Jens Axboe, fernando, Ric Wheeler
On Sat, 2009-02-14 at 10:36 -0500, Christoph Hellwig wrote:
> On Thu, Feb 12, 2009 at 12:29:52PM -0500, Jeff Garzik wrote:
> >> The block device *could* choose to ignore this in hardware if it knows
> >> it's built with a nonvolatile write cache or if it has no write cache.
> >
> > That would certainly be my preference -- turn this ON by default, and
> > them if a layer NEEDS to ignore it, it can.
>
> Yeah, and we should integrate this with the barriers settings.
>
> I think the right setup is:
>
> - each gendisk has a variable to indicate if we have a write-back
> cache, which is filled from scsi inquiry data (or whatever the
> equivalent in the storage protocol is), but we allow an override
> from userspace if the admin knows better (if he really does or
> wants to play fast and lose is the admin's business)
> - filesystems do the right things by using barriers and cache flushes
> if they see the underlying device needs it.
That makes sense, but the contentious issue seems to be whether the
override from userspace you mention should take the form of mount option
or per-block device sysfs tunable instead. Making this override
(flushonfsync in my patches) be a mount option would be consistent with
what filesystems such as ext3/4 and xfs do when it comes to barriers,
but, if there is consensus, I would not mind turning it into a
per-device tunable instead.
You mentioned "we should integrate this with the barrier settings". Do
you imply we should make it a per-device tunable too? Should we keep the
barrier-related mount options some filesystems provide?
- Fernando
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: vfs: Add MS_FLUSHONFSYNC mount flag
2009-02-15 7:23 ` Fernando Luis Vázquez Cao
@ 2009-02-15 22:54 ` Theodore Tso
2009-02-16 4:29 ` Eric Sandeen
2009-02-16 7:47 ` Fernando Luis Vázquez Cao
0 siblings, 2 replies; 74+ messages in thread
From: Theodore Tso @ 2009-02-15 22:54 UTC (permalink / raw)
To: Fernando Luis Vázquez Cao
Cc: Christoph Hellwig, Jeff Garzik, Eric Sandeen, Jan Kara, Alan Cox,
Pavel Machek, kernel list, Jens Axboe, fernando, Ric Wheeler
On Sun, Feb 15, 2009 at 04:23:26PM +0900, Fernando Luis Vázquez Cao wrote:
> You mentioned "we should integrate this with the barrier settings". Do
> you imply we should make it a per-device tunable too? Should we keep the
> barrier-related mount options some filesystems provide?
>
Making barriers to be a per-device tunable makes sense. The only
reason why we kept it as a mount option in ext4 is for benchmarking
purposes, and in ext3, because the filesystem predated the barrier
code, and there was a desire to be able to benchmark with and without
the old behavior --- and because akpm is still worried about the
performance hit of the barrier code, so he's been resistant about
change the default for ext3.
- Ted
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: vfs: Add MS_FLUSHONFSYNC mount flag
2009-02-15 22:54 ` Theodore Tso
@ 2009-02-16 4:29 ` Eric Sandeen
2009-02-16 7:47 ` Fernando Luis Vázquez Cao
1 sibling, 0 replies; 74+ messages in thread
From: Eric Sandeen @ 2009-02-16 4:29 UTC (permalink / raw)
To: Theodore Tso, Fernando Luis Vázquez Cao, Christoph Hellwig,
Jeff Garzik, Eric Sandeen, Jan Kara, Alan Cox, Pavel Machek,
kernel list, Jens Axboe, fernando, Ric Wheeler
Theodore Tso wrote:
> On Sun, Feb 15, 2009 at 04:23:26PM +0900, Fernando Luis Vázquez Cao wrote:
>> You mentioned "we should integrate this with the barrier settings". Do
>> you imply we should make it a per-device tunable too? Should we keep the
>> barrier-related mount options some filesystems provide?
>>
>
> Making barriers to be a per-device tunable makes sense. The only
> reason why we kept it as a mount option in ext4 is for benchmarking
> purposes, and in ext3, because the filesystem predated the barrier
> code, and there was a desire to be able to benchmark with and without
> the old behavior --- and because akpm is still worried about the
> performance hit of the barrier code, so he's been resistant about
> change the default for ext3.
>
> - Ted
I agree that making barriers a per-device tunable probably does make the
most sense. And talking w/ Dave, he points out that if it were this
way, the fs could just issue barrier IOs at every appropriate place, and
the bdev would just ignore them if they got tuned off, instead of the
current fs world of trying a barrier, always watching out for
-EOPNOTSUPP coming back up, then disabling barriers, etc.
-Eric
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: vfs: Add MS_FLUSHONFSYNC mount flag
2009-02-15 22:54 ` Theodore Tso
@ 2009-02-16 7:47 ` Fernando Luis Vázquez Cao
2009-02-16 7:47 ` Fernando Luis Vázquez Cao
1 sibling, 0 replies; 74+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-02-16 7:47 UTC (permalink / raw)
To: Theodore Tso
Cc: Christoph Hellwig, Jeff Garzik, Eric Sandeen, Jan Kara, Alan Cox,
Pavel Machek, kernel list, Jens Axboe, fernando, Ric Wheeler,
linux-fsdevel
On Sun, 2009-02-15 at 17:54 -0500, Theodore Tso wrote:
> On Sun, Feb 15, 2009 at 04:23:26PM +0900, Fernando Luis Vázquez Cao wrote:
> > You mentioned "we should integrate this with the barrier settings". Do
> > you imply we should make it a per-device tunable too? Should we keep the
> > barrier-related mount options some filesystems provide?
> >
>
> Making barriers to be a per-device tunable makes sense. The only
> reason why we kept it as a mount option in ext4 is for benchmarking
> purposes, and in ext3, because the filesystem predated the barrier
> code, and there was a desire to be able to benchmark with and without
> the old behavior --- and because akpm is still worried about the
> performance hit of the barrier code, so he's been resistant about
> change the default for ext3.
Ok, I'll turn both barriers and flushonfsync into a sysfs-exported
per-device knob and see how it turns out.
By the way, should we also add/keep a mount option for "benchmarking
purposes"?. I guess that once we get the per-device tunable we probable
do not need it anymore.
Regards,
Fernando
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: vfs: Add MS_FLUSHONFSYNC mount flag
@ 2009-02-16 7:47 ` Fernando Luis Vázquez Cao
0 siblings, 0 replies; 74+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-02-16 7:47 UTC (permalink / raw)
To: Theodore Tso
Cc: Christoph Hellwig, Jeff Garzik, Eric Sandeen, Jan Kara, Alan Cox,
Pavel Machek, kernel list, Jens Axboe, fernando, Ric Wheeler,
linux-fsdevel
On Sun, 2009-02-15 at 17:54 -0500, Theodore Tso wrote:
> On Sun, Feb 15, 2009 at 04:23:26PM +0900, Fernando Luis Vázquez Cao wrote:
> > You mentioned "we should integrate this with the barrier settings". Do
> > you imply we should make it a per-device tunable too? Should we keep the
> > barrier-related mount options some filesystems provide?
> >
>
> Making barriers to be a per-device tunable makes sense. The only
> reason why we kept it as a mount option in ext4 is for benchmarking
> purposes, and in ext3, because the filesystem predated the barrier
> code, and there was a desire to be able to benchmark with and without
> the old behavior --- and because akpm is still worried about the
> performance hit of the barrier code, so he's been resistant about
> change the default for ext3.
Ok, I'll turn both barriers and flushonfsync into a sysfs-exported
per-device knob and see how it turns out.
By the way, should we also add/keep a mount option for "benchmarking
purposes"?. I guess that once we get the per-device tunable we probable
do not need it anymore.
Regards,
Fernando
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: vfs: Add MS_FLUSHONFSYNC mount flag
2009-02-12 17:13 ` Eric Sandeen
2009-02-12 17:29 ` Jeff Garzik
@ 2009-02-12 21:23 ` Jan Kara
2009-02-12 21:30 ` Eric Sandeen
2009-02-13 1:14 ` Fernando Luis Vázquez Cao
2 siblings, 1 reply; 74+ messages in thread
From: Jan Kara @ 2009-02-12 21:23 UTC (permalink / raw)
To: Eric Sandeen
Cc: Fernando Luis Vázquez Cao, Theodore Tso, Alan Cox,
Pavel Machek, kernel list, Jens Axboe, fernando, Ric Wheeler
On Thu 12-02-09 11:13:37, Eric Sandeen wrote:
> Fernando Luis Vázquez Cao wrote:
> > This mount flag will be used to determine whether the block device's write
> > cache should be flush or not on fsync()/fdatasync().
> >
> > Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
> > ---
>
> Again, apologies for chiming in late.
>
> But wouldn't it be better to make this a block device property rather
> than a new filesystem mount option?
Hum, that's an interesting idea. Yes, probably it makes even more sence
than a global mount option.
> That way the filesystem can always do "the right thing" and call the
> blkdev flush on fsync.
>
> The block device *could* choose to ignore this in hardware if it knows
> it's built with a nonvolatile write cache or if it has no write cache.
>
> Somewhere in the middle, if an administrator knows they have a UPS they
> trust and hardware that stays connected to it, they could tune the bdev
> to ignore these flush requests.
>
> Also that way if you have 8 partitions on a battery-backed blockdev, you
> can tune it once, instead of needing to mount all 8 filesystems with the
> new option.
Yes, but OTOH we should give sysadmin a possibility to enable / disable
it on just some partitions. I don't see a reasonable use for that but people
tend to do strange things ;) and here isn't probably a strong reason to not
allow them.
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: vfs: Add MS_FLUSHONFSYNC mount flag
2009-02-12 21:23 ` Jan Kara
@ 2009-02-12 21:30 ` Eric Sandeen
2009-02-13 1:47 ` Fernando Luis Vázquez Cao
2009-02-13 2:23 ` Theodore Tso
0 siblings, 2 replies; 74+ messages in thread
From: Eric Sandeen @ 2009-02-12 21:30 UTC (permalink / raw)
To: Jan Kara
Cc: Fernando Luis Vázquez Cao, Theodore Tso, Alan Cox,
Pavel Machek, kernel list, Jens Axboe, fernando, Ric Wheeler
Jan Kara wrote:
> On Thu 12-02-09 11:13:37, Eric Sandeen wrote:
...
>> Also that way if you have 8 partitions on a battery-backed blockdev, you
>> can tune it once, instead of needing to mount all 8 filesystems with the
>> new option.
> Yes, but OTOH we should give sysadmin a possibility to enable / disable
> it on just some partitions. I don't see a reasonable use for that but people
> tend to do strange things ;) and here isn't probably a strong reason to not
> allow them.
>
> Honza
But nobody has asked for that, have they? So why offer it up a this point?
They could use LD_PRELOAD to make fsync a no-op if they really don't
care for it, I guess... though that's not easily per-fs either.
But do we really want to go out of our way to enable people to
short-circuit data integrity paths and then file bugs when their files
go missing? :)
(I guess the blockdev tunable is similarly dangerous, but it more
clearly meets the explicit need (writecache-safe devices))
-Eric
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: vfs: Add MS_FLUSHONFSYNC mount flag
2009-02-12 21:30 ` Eric Sandeen
@ 2009-02-13 1:47 ` Fernando Luis Vázquez Cao
2009-02-13 6:07 ` Eric Sandeen
2009-02-13 2:23 ` Theodore Tso
1 sibling, 1 reply; 74+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-02-13 1:47 UTC (permalink / raw)
To: Eric Sandeen
Cc: Jan Kara, Theodore Tso, Alan Cox, Pavel Machek, kernel list,
Jens Axboe, fernando, Ric Wheeler
On Thu, 2009-02-12 at 15:30 -0600, Eric Sandeen wrote:
> Jan Kara wrote:
> > On Thu 12-02-09 11:13:37, Eric Sandeen wrote:
>
> ...
>
> >> Also that way if you have 8 partitions on a battery-backed blockdev, you
> >> can tune it once, instead of needing to mount all 8 filesystems with the
> >> new option.
> > Yes, but OTOH we should give sysadmin a possibility to enable / disable
> > it on just some partitions. I don't see a reasonable use for that but people
> > tend to do strange things ;) and here isn't probably a strong reason to not
> > allow them.
> >
> > Honza
>
> But nobody has asked for that, have they? So why offer it up a this point?
>
> They could use LD_PRELOAD to make fsync a no-op if they really don't
> care for it, I guess... though that's not easily per-fs either.
>
> But do we really want to go out of our way to enable people to
> short-circuit data integrity paths and then file bugs when their files
> go missing? :)
Well, it is just a matter of using safe defaults. IMHO, a scenario where
the administrator wants to optimize writes to a certain partition and
_explicitly_ clears MS_FLUSHONFSYNC on that superblock is not completely
unreasonable.
> (I guess the blockdev tunable is similarly dangerous, but it more
> clearly meets the explicit need (writecache-safe devices))
If distributions use sane defaults and we document the mount option or
bdev tunable properly I guess it might make sense to allow system
administrators to shoot themselves in the foot.
(By the way, in this patch-set a patch for mount(8) is included.)
- Fernando
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: vfs: Add MS_FLUSHONFSYNC mount flag
2009-02-13 1:47 ` Fernando Luis Vázquez Cao
@ 2009-02-13 6:07 ` Eric Sandeen
0 siblings, 0 replies; 74+ messages in thread
From: Eric Sandeen @ 2009-02-13 6:07 UTC (permalink / raw)
To: Fernando Luis Vázquez Cao
Cc: Jan Kara, Theodore Tso, Alan Cox, Pavel Machek, kernel list,
Jens Axboe, fernando, Ric Wheeler
Fernando Luis Vázquez Cao wrote:
> On Thu, 2009-02-12 at 15:30 -0600, Eric Sandeen wrote:
>> Jan Kara wrote:
>>> On Thu 12-02-09 11:13:37, Eric Sandeen wrote:
>> ...
>>
>>>> Also that way if you have 8 partitions on a battery-backed blockdev, you
>>>> can tune it once, instead of needing to mount all 8 filesystems with the
>>>> new option.
>>> Yes, but OTOH we should give sysadmin a possibility to enable / disable
>>> it on just some partitions. I don't see a reasonable use for that but people
>>> tend to do strange things ;) and here isn't probably a strong reason to not
>>> allow them.
>>>
>>> Honza
>> But nobody has asked for that, have they? So why offer it up a this point?
>>
>> They could use LD_PRELOAD to make fsync a no-op if they really don't
>> care for it, I guess... though that's not easily per-fs either.
>>
>> But do we really want to go out of our way to enable people to
>> short-circuit data integrity paths and then file bugs when their files
>> go missing? :)
>
> Well, it is just a matter of using safe defaults. IMHO, a scenario where
> the administrator wants to optimize writes to a certain partition and
> _explicitly_ clears MS_FLUSHONFSYNC on that superblock is not completely
> unreasonable.
One case is "this device is safe with a write cache and flush is not
necessary for data consistency" - that's the per-bdev setting.
The other case is "I don't really care and I just want to go faster" -
that's the per-mount setting.
I'd be much more likely to support the first case, as it's needed for
maximum performance w/o sacrificing correctness, when properly used.
The latter case is really only for cutting corners and giving people
more rope than they need to hang themselves.
>> (I guess the blockdev tunable is similarly dangerous, but it more
>> clearly meets the explicit need (writecache-safe devices))
>
> If distributions use sane defaults and we document the mount option or
> bdev tunable properly I guess it might make sense to allow system
> administrators to shoot themselves in the foot.
>
> (By the way, in this patch-set a patch for mount(8) is included.)
... one of the advantages of making it a bdev tunable is that you don't
have to mess with mount(8) :)
-Eric
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: vfs: Add MS_FLUSHONFSYNC mount flag
2009-02-12 21:30 ` Eric Sandeen
2009-02-13 1:47 ` Fernando Luis Vázquez Cao
@ 2009-02-13 2:23 ` Theodore Tso
2009-02-22 14:15 ` Pavel Machek
1 sibling, 1 reply; 74+ messages in thread
From: Theodore Tso @ 2009-02-13 2:23 UTC (permalink / raw)
To: Eric Sandeen
Cc: Jan Kara, Fernando Luis Vázquez Cao, Alan Cox, Pavel Machek,
kernel list, Jens Axboe, fernando, Ric Wheeler
On Thu, Feb 12, 2009 at 03:30:10PM -0600, Eric Sandeen wrote:
> > Yes, but OTOH we should give sysadmin a possibility to enable / disable
> > it on just some partitions. I don't see a reasonable use for that but people
> > tend to do strange things ;) and here isn't probably a strong reason to not
> > allow them.
> >
>
> But nobody has asked for that, have they? So why offer it up a this point?
>
> They could use LD_PRELOAD to make fsync a no-op if they really don't
> care for it, I guess... though that's not easily per-fs either.
Actually, Bart Samwel at FOSDEM talked to me and asked for something
similar --- what we came up which meant his request while still being
standards-compliant was a per-process personality flag which had three
options:
*) Always honor fsync() calls (the default)
*) Never honor fsync() calls
*) Only honor fsync() calls if a global "honor fsync" flag
(which would be manipulated by the laptop mode scripts)
is set.
The flag would be reset to the default across a setuid exec, but would
otherwise be inherited across fork()'s. It might be possible to
set/get the flag via a /proc interface.
The basic idea is that laptop systems where the system administrator
wants longer battery life (and trusts the battery not to suddenly give
out) more than they care about fsync() guarantees can set up a pam
library which sets the flag for at login time so that all of the
user's processes can be set up not to honor fsync() calls; however,
all of the system daemons would still function normally.
- Ted
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: vfs: Add MS_FLUSHONFSYNC mount flag
2009-02-13 2:23 ` Theodore Tso
@ 2009-02-22 14:15 ` Pavel Machek
2009-02-22 20:51 ` Eric Sandeen
2009-02-22 23:19 ` Theodore Tso
0 siblings, 2 replies; 74+ messages in thread
From: Pavel Machek @ 2009-02-22 14:15 UTC (permalink / raw)
To: Theodore Tso, Eric Sandeen, Jan Kara, Fernando Luis V?zquez Cao,
Alan Cox, kernel list, Jens Axboe, fernando, Ric Wheeler
On Thu 2009-02-12 21:23:36, Theodore Tso wrote:
> On Thu, Feb 12, 2009 at 03:30:10PM -0600, Eric Sandeen wrote:
> > > Yes, but OTOH we should give sysadmin a possibility to enable / disable
> > > it on just some partitions. I don't see a reasonable use for that but people
> > > tend to do strange things ;) and here isn't probably a strong reason to not
> > > allow them.
> > >
> >
> > But nobody has asked for that, have they? So why offer it up a this point?
> >
> > They could use LD_PRELOAD to make fsync a no-op if they really don't
> > care for it, I guess... though that's not easily per-fs either.
>
> Actually, Bart Samwel at FOSDEM talked to me and asked for something
> similar --- what we came up which meant his request while still being
> standards-compliant was a per-process personality flag which had three
> options:
>
> *) Always honor fsync() calls (the default)
> *) Never honor fsync() calls
> *) Only honor fsync() calls if a global "honor fsync" flag
> (which would be manipulated by the laptop mode scripts)
> is set.
>
> The flag would be reset to the default across a setuid exec, but would
> otherwise be inherited across fork()'s. It might be possible to
> set/get the flag via a /proc interface.
>
> The basic idea is that laptop systems where the system administrator
> wants longer battery life (and trusts the battery not to suddenly give
> out) more than they care about fsync() guarantees can set up a pam
> library which sets the flag for at login time so that all of the
> user's processes can be set up not to honor fsync() calls; however,
> all of the system daemons would still function normally.
Sounds like posix violation to
me... '/sys/fsync_does_not_really_sync'?
Perhaps it is better done at glibc level? Environment variables
already mostly have semantics you want.....
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: vfs: Add MS_FLUSHONFSYNC mount flag
2009-02-22 14:15 ` Pavel Machek
@ 2009-02-22 20:51 ` Eric Sandeen
2009-02-22 23:19 ` Theodore Tso
1 sibling, 0 replies; 74+ messages in thread
From: Eric Sandeen @ 2009-02-22 20:51 UTC (permalink / raw)
To: Pavel Machek
Cc: Theodore Tso, Jan Kara, Fernando Luis V?zquez Cao, Alan Cox,
kernel list, Jens Axboe, fernando, Ric Wheeler
Pavel Machek wrote:
> On Thu 2009-02-12 21:23:36, Theodore Tso wrote:
>> On Thu, Feb 12, 2009 at 03:30:10PM -0600, Eric Sandeen wrote:
>>>> Yes, but OTOH we should give sysadmin a possibility to enable / disable
>>>> it on just some partitions. I don't see a reasonable use for that but people
>>>> tend to do strange things ;) and here isn't probably a strong reason to not
>>>> allow them.
>>>>
>>> But nobody has asked for that, have they? So why offer it up a this point?
>>>
>>> They could use LD_PRELOAD to make fsync a no-op if they really don't
>>> care for it, I guess... though that's not easily per-fs either.
>> Actually, Bart Samwel at FOSDEM talked to me and asked for something
>> similar --- what we came up which meant his request while still being
>> standards-compliant was a per-process personality flag which had three
>> options:
>>
>> *) Always honor fsync() calls (the default)
>> *) Never honor fsync() calls
>> *) Only honor fsync() calls if a global "honor fsync" flag
>> (which would be manipulated by the laptop mode scripts)
>> is set.
>>
>> The flag would be reset to the default across a setuid exec, but would
>> otherwise be inherited across fork()'s. It might be possible to
>> set/get the flag via a /proc interface.
>>
>> The basic idea is that laptop systems where the system administrator
>> wants longer battery life (and trusts the battery not to suddenly give
>> out) more than they care about fsync() guarantees can set up a pam
>> library which sets the flag for at login time so that all of the
>> user's processes can be set up not to honor fsync() calls; however,
>> all of the system daemons would still function normally.
>
> Sounds like posix violation to
> me... '/sys/fsync_does_not_really_sync'?
>
> Perhaps it is better done at glibc level? Environment variables
> already mostly have semantics you want.....
>
> Pavel
One other thing that may be worth bringing up (just to muddy the waters
more) is OSX's handling of this stuff.
>From the fsync(2) manpage:
> Note that while fsync() will flush all data from the host to the
> drive (i.e. the "permanent storage device"), the drive itself may not
> physically write the data to the platters for quite some time and it
> may be written in an out-of-order sequence.
>
> Specifically, if the drive loses power or the OS crashes, the appli-
> cation may find that only some or none of their data was written.
> The disk drive may also re-order the data so that later writes may be
> present, while earlier writes are not.
>
> This is not a theoretical edge case. This scenario is easily repro-
> duced with real world workloads and drive power failures.
>
> For applications that require tighter guarantees about the integrity
> of their data, Mac OS X provides the F_FULLFSYNC fcntl. The F_FULLF-
> SYNC fcntl asks the drive to flush all buffered data to permanent
> storage. Applications, such as databases, that require a strict
> ordering of writes should use F_FULLFSYNC to ensure that their data
> is written in the order they expect. Please see fcntl(2) for more
> detail.
and from fcntl(2)
> F_FULLFSYNC Does the same thing as fsync(2) then asks the drive to
> flush all buffered data to the permanent storage
> device (arg is ignored). This is currently imple-
> mented on HFS, MS-DOS (FAT), and Universal Disk Format
> (UDF) file systems. The operation may take quite a
> while to complete. Certain FireWire drives have also
> been known to ignore the request to flush their
> buffered data.
-Eric
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: vfs: Add MS_FLUSHONFSYNC mount flag
2009-02-22 14:15 ` Pavel Machek
2009-02-22 20:51 ` Eric Sandeen
@ 2009-02-22 23:19 ` Theodore Tso
2009-02-22 23:42 ` Jeff Garzik
1 sibling, 1 reply; 74+ messages in thread
From: Theodore Tso @ 2009-02-22 23:19 UTC (permalink / raw)
To: Pavel Machek
Cc: Eric Sandeen, Jan Kara, Fernando Luis V?zquez Cao, Alan Cox,
kernel list, Jens Axboe, fernando, Ric Wheeler
On Sun, Feb 22, 2009 at 03:15:33PM +0100, Pavel Machek wrote:
>
> Sounds like posix violation to
> me... '/sys/fsync_does_not_really_sync'?
It's as much a Posix violation as noatime. :-)
> Perhaps it is better done at glibc level? Environment variables
> already mostly have semantics you want.....
Environment variables own't allow us to switch fsync's on and off
while the process is running, so that the system can run "safe" while
it is on AC mains, but "low power" when on battery.
- Ted
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: vfs: Add MS_FLUSHONFSYNC mount flag
2009-02-22 23:19 ` Theodore Tso
@ 2009-02-22 23:42 ` Jeff Garzik
2009-02-22 23:46 ` Jeff Garzik
0 siblings, 1 reply; 74+ messages in thread
From: Jeff Garzik @ 2009-02-22 23:42 UTC (permalink / raw)
To: Theodore Tso, Pavel Machek, Eric Sandeen, Jan Kara,
Fernando Luis V?zquez Cao, Alan Cox, kernel list, Jens Axboe,
fernando, Ric Wheeler, Andrew Morton
Theodore Tso wrote:
> On Sun, Feb 22, 2009 at 03:15:33PM +0100, Pavel Machek wrote:
>> Sounds like posix violation to
>> me... '/sys/fsync_does_not_really_sync'?
>
> It's as much a Posix violation as noatime. :-)
>
>> Perhaps it is better done at glibc level? Environment variables
>> already mostly have semantics you want.....
>
> Environment variables own't allow us to switch fsync's on and off
> while the process is running, so that the system can run "safe" while
> it is on AC mains, but "low power" when on battery.
Overall, for each filesystem, we really should _default_ to a safe mode
where fsync(2), fdatasync(2), sync_file_range(2) and transaction commits
are guaranteed to be committed to stable storage -- i.e. SYNC CACHE /
FLUSH CACHE, a forced unit access (FUA) bit, or guarantee that the
underlying storage has disabled write-back caching.
Correctness should come before performance. Linux has not had
credibility here, in the ATA+ext[23] space at least, and it is embarrassing.
Modern SATA and SCSI disks can do tagged command queueing with the FUA
bit, which does a lot to mitigate the performance loss seen with a less
capable "FLUSH CACHE + barrier" setup.
Jeff
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: vfs: Add MS_FLUSHONFSYNC mount flag
2009-02-22 23:42 ` Jeff Garzik
@ 2009-02-22 23:46 ` Jeff Garzik
2009-02-23 1:23 ` Theodore Tso
0 siblings, 1 reply; 74+ messages in thread
From: Jeff Garzik @ 2009-02-22 23:46 UTC (permalink / raw)
To: Theodore Tso
Cc: Pavel Machek, Eric Sandeen, Jan Kara, Fernando Luis V?zquez Cao,
Alan Cox, kernel list, Jens Axboe, fernando, Ric Wheeler,
Andrew Morton
Jeff Garzik wrote:
> Correctness should come before performance. Linux has not had
> credibility here, in the ATA+ext[23] space at least, and it is
> embarrassing.
To be more clear / precise, this means actually performing the
guarantees we claim to the user. For example,
fsync(2) on ext2 should trigger a storage device writeback cache flush
[or equivalent guarantee via FUA].
fsync(2) or journal commit on ext3 should trigger a flush [or equivalent
guarantee via FUA].
Though, certainly, the user should be able to disable this strict
behavior and trade correctness for performance.
Jeff
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: vfs: Add MS_FLUSHONFSYNC mount flag
2009-02-22 23:46 ` Jeff Garzik
@ 2009-02-23 1:23 ` Theodore Tso
0 siblings, 0 replies; 74+ messages in thread
From: Theodore Tso @ 2009-02-23 1:23 UTC (permalink / raw)
To: Jeff Garzik
Cc: Pavel Machek, Eric Sandeen, Jan Kara, Fernando Luis V?zquez Cao,
Alan Cox, kernel list, Jens Axboe, fernando, Ric Wheeler,
Andrew Morton
On Sun, Feb 22, 2009 at 06:46:51PM -0500, Jeff Garzik wrote:
>
> To be more clear / precise, this means actually performing the
> guarantees we claim to the user. For example,
>
> fsync(2) on ext2 should trigger a storage device writeback cache flush
> [or equivalent guarantee via FUA].
>
> fsync(2) or journal commit on ext3 should trigger a flush [or equivalent
> guarantee via FUA].
>
> Though, certainly, the user should be able to disable this strict
> behavior and trade correctness for performance.
No argument here. The *default* should be that we respect fsync().
However, the user should be able to configure with a reasonable amount
of granularity tradeoffs between safety and performance and/or battery
usage.
- Ted
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: vfs: Add MS_FLUSHONFSYNC mount flag
2009-02-12 17:13 ` Eric Sandeen
2009-02-12 17:29 ` Jeff Garzik
2009-02-12 21:23 ` Jan Kara
@ 2009-02-13 1:14 ` Fernando Luis Vázquez Cao
2009-02-13 6:20 ` Eric Sandeen
2 siblings, 1 reply; 74+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-02-13 1:14 UTC (permalink / raw)
To: Eric Sandeen
Cc: Jan Kara, Theodore Tso, Alan Cox, Pavel Machek, kernel list,
Jens Axboe, fernando, Ric Wheeler
On Thu, 2009-02-12 at 11:13 -0600, Eric Sandeen wrote:
> Fernando Luis Vázquez Cao wrote:
> > This mount flag will be used to determine whether the block device's write
> > cache should be flush or not on fsync()/fdatasync().
> >
> > Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
> > ---
>
> Again, apologies for chiming in late.
>
> But wouldn't it be better to make this a block device property rather
> than a new filesystem mount option?
>
> That way the filesystem can always do "the right thing" and call the
> blkdev flush on fsync.
>
> The block device *could* choose to ignore this in hardware if it knows
> it's built with a nonvolatile write cache or if it has no write cache.
>
> Somewhere in the middle, if an administrator knows they have a UPS they
> trust and hardware that stays connected to it, they could tune the bdev
> to ignore these flush requests.
>
> Also that way if you have 8 partitions on a battery-backed blockdev, you
> can tune it once, instead of needing to mount all 8 filesystems with the
> new option.
The main reason I decided to go for the mount option approach is to be
consistent with what we do when it comes to write barriers. Treating one
as a mount option and the other as a (possibly) sysfs tunable property
seems a bit confusing to me.
Do you suggest using sysfs tunables instead?
- Fernando
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: vfs: Add MS_FLUSHONFSYNC mount flag
2009-02-13 1:14 ` Fernando Luis Vázquez Cao
@ 2009-02-13 6:20 ` Eric Sandeen
2009-02-13 10:36 ` Fernando Luis Vázquez Cao
2009-02-13 12:20 ` Dave Chinner
0 siblings, 2 replies; 74+ messages in thread
From: Eric Sandeen @ 2009-02-13 6:20 UTC (permalink / raw)
To: Fernando Luis Vázquez Cao
Cc: Jan Kara, Theodore Tso, Alan Cox, Pavel Machek, kernel list,
Jens Axboe, fernando, Ric Wheeler
Fernando Luis Vázquez Cao wrote:
> On Thu, 2009-02-12 at 11:13 -0600, Eric Sandeen wrote:
>> Fernando Luis Vázquez Cao wrote:
>>> This mount flag will be used to determine whether the block device's write
>>> cache should be flush or not on fsync()/fdatasync().
>>>
>>> Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
>>> ---
>> Again, apologies for chiming in late.
>>
>> But wouldn't it be better to make this a block device property rather
>> than a new filesystem mount option?
>>
>> That way the filesystem can always do "the right thing" and call the
>> blkdev flush on fsync.
>>
>> The block device *could* choose to ignore this in hardware if it knows
>> it's built with a nonvolatile write cache or if it has no write cache.
>>
>> Somewhere in the middle, if an administrator knows they have a UPS they
>> trust and hardware that stays connected to it, they could tune the bdev
>> to ignore these flush requests.
>>
>> Also that way if you have 8 partitions on a battery-backed blockdev, you
>> can tune it once, instead of needing to mount all 8 filesystems with the
>> new option.
>
> The main reason I decided to go for the mount option approach is to be
> consistent with what we do when it comes to write barriers. Treating one
> as a mount option and the other as a (possibly) sysfs tunable property
> seems a bit confusing to me.
well... technically, I think barriers really *should* mean "don't
reorder these writes, I need them this way for consistency" - and that
is really specific to the fs implementation, isn't it? (we just happen
to implement them as cache flushes) and so that is a per-fs setting, I
think.
Maybe there is no good argument for ignoring barriers on one fs, and
implementing them on another, other than playing fast & loose &
dangerous.... hrm.
> Do you suggest using sysfs tunables instead?
For a per-bdev flush setting, yes...
I guess I'll have to try to convince myself one way or another whether
barrier mount options are consistent with this view. :)
I guess sometimes you do have workloads where you simply want speed, and
on a crash you start over. In this case you don't care about barriers
(ordering constraints - if you don't care about fs integrity if fsck or
re-mkfs is ok) or flushing (caches - if you don't care about data
integrity, you regenerate your results). That could vary from fs to fs....
I'm just a little leery of the "dangerous" mount option proliferation, I
guess.
-Eric
> - Fernando
>
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: vfs: Add MS_FLUSHONFSYNC mount flag
2009-02-13 6:20 ` Eric Sandeen
@ 2009-02-13 10:36 ` Fernando Luis Vázquez Cao
2009-02-13 12:20 ` Dave Chinner
1 sibling, 0 replies; 74+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-02-13 10:36 UTC (permalink / raw)
To: Eric Sandeen
Cc: Jan Kara, Theodore Tso, Alan Cox, Pavel Machek, kernel list,
Jens Axboe, fernando, Ric Wheeler
On Fri, 2009-02-13 at 00:20 -0600, Eric Sandeen wrote:
> Fernando Luis Vázquez Cao wrote:
> > On Thu, 2009-02-12 at 11:13 -0600, Eric Sandeen wrote:
> >> Fernando Luis Vázquez Cao wrote:
> >>> This mount flag will be used to determine whether the block device's write
> >>> cache should be flush or not on fsync()/fdatasync().
> >>>
> >>> Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
> >>> ---
> >> Again, apologies for chiming in late.
> >>
> >> But wouldn't it be better to make this a block device property rather
> >> than a new filesystem mount option?
> >>
> >> That way the filesystem can always do "the right thing" and call the
> >> blkdev flush on fsync.
> >>
> >> The block device *could* choose to ignore this in hardware if it knows
> >> it's built with a nonvolatile write cache or if it has no write cache.
> >>
> >> Somewhere in the middle, if an administrator knows they have a UPS they
> >> trust and hardware that stays connected to it, they could tune the bdev
> >> to ignore these flush requests.
> >>
> >> Also that way if you have 8 partitions on a battery-backed blockdev, you
> >> can tune it once, instead of needing to mount all 8 filesystems with the
> >> new option.
> >
> > The main reason I decided to go for the mount option approach is to be
> > consistent with what we do when it comes to write barriers. Treating one
> > as a mount option and the other as a (possibly) sysfs tunable property
> > seems a bit confusing to me.
>
> well... technically, I think barriers really *should* mean "don't
> reorder these writes, I need them this way for consistency" - and that
> is really specific to the fs implementation, isn't it? (we just happen
> to implement them as cache flushes) and so that is a per-fs setting, I
> think.
>
> Maybe there is no good argument for ignoring barriers on one fs, and
> implementing them on another, other than playing fast & loose &
> dangerous.... hrm.
>
> > Do you suggest using sysfs tunables instead?
>
> For a per-bdev flush setting, yes...
>
> I guess I'll have to try to convince myself one way or another whether
> barrier mount options are consistent with this view. :)
I went through the same process :), and finally concluded that in both
cases it all comes down a trade-off between between integrity (be it
filesystem integrity or data integrity) and speed. Since the desired
behavior could vary from filesystem to filesystem and for the sake of
consistency with barriers the mount option approach seemed to make more
sense.
> I guess sometimes you do have workloads where you simply want speed, and
> on a crash you start over. In this case you don't care about barriers
> (ordering constraints - if you don't care about fs integrity if fsck or
> re-mkfs is ok) or flushing (caches - if you don't care about data
> integrity, you regenerate your results). That could vary from fs to fs....
>
> I'm just a little leery of the "dangerous" mount option proliferation, I
> guess.
If we can have distros make "barrier=1,flushonfsync" be the default
setting and document these two mount options properly by explicitly
indicating the dangers of deviating from these defaults, I think we'll
be heading in the right direction.
By the way, apart from the issue of whether flushonsync should be a
mount option or sysfs tunable is there any other issue with the patches?
Jan, have I addressed all your concerns?
Regards,
Fernando
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: vfs: Add MS_FLUSHONFSYNC mount flag
2009-02-13 6:20 ` Eric Sandeen
2009-02-13 10:36 ` Fernando Luis Vázquez Cao
@ 2009-02-13 12:20 ` Dave Chinner
2009-02-13 16:29 ` Fernando Luis Vazquez Cao
1 sibling, 1 reply; 74+ messages in thread
From: Dave Chinner @ 2009-02-13 12:20 UTC (permalink / raw)
To: Eric Sandeen
Cc: Fernando Luis Vázquez Cao, Jan Kara, Theodore Tso, Alan Cox,
Pavel Machek, kernel list, Jens Axboe, fernando, Ric Wheeler
On Fri, Feb 13, 2009 at 12:20:17AM -0600, Eric Sandeen wrote:
> I'm just a little leery of the "dangerous" mount option proliferation, I
> guess.
You're not the only one, Eric. It's bad enough having to explain to
users what barriers do once they have lost data after a power loss,
let alone confusing them further by adding more mount options they
will get wrong by accident....
Quite frankly, the VFS should do stuff that is slow and safe
and filesystems can choose to ignore the VFS (via filesystem
specific mount options) if they want to be fast and potentially
unsafe.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: vfs: Add MS_FLUSHONFSYNC mount flag
2009-02-13 12:20 ` Dave Chinner
@ 2009-02-13 16:29 ` Fernando Luis Vazquez Cao
2009-02-14 11:24 ` Dave Chinner
0 siblings, 1 reply; 74+ messages in thread
From: Fernando Luis Vazquez Cao @ 2009-02-13 16:29 UTC (permalink / raw)
To: Dave Chinner
Cc: Eric Sandeen, Fernando Luis Vázquez Cao, Jan Kara,
Theodore Tso, Alan Cox, Pavel Machek, kernel list, Jens Axboe,
Ric Wheeler
On Fri, 2009-02-13 at 23:20 +1100, Dave Chinner wrote:
> On Fri, Feb 13, 2009 at 12:20:17AM -0600, Eric Sandeen wrote:
> > I'm just a little leery of the "dangerous" mount option proliferation, I
> > guess.
>
> You're not the only one, Eric. It's bad enough having to explain to
> users what barriers do once they have lost data after a power loss,
> let alone confusing them further by adding more mount options they
> will get wrong by accident....
That is precisely the reason why we should use sensible defaults, which
in this case means enabling barriers and flushing disk caches on
fsync()/fdatasync() by default.
Adding either a new mount option (as you yourself suggest below) or a
sysfs tunable is desirable for those cases when we really do not need to
flush the disk write cache to guarantee integrity (battery-backed block
devices come to mind), or we want to be fast at the cost of potentially
losing some data.
> Quite frankly, the VFS should do stuff that is slow and safe
> and filesystems can choose to ignore the VFS (via filesystem
> specific mount options) if they want to be fast and potentially
> unsafe.
To avoid unnecessary flushes and allow for filesystem-specific
optimizations I was considering the following approach:
1- Add flushonfsync mount option (as an aside, I am of the opinion that
it should be set by default).
2- Modify file_fsync() so that it checks whether FLUSHONFSYNC is set and
flushes the underlying device accordingly. With this we would cover all
filesystems that use the vfs-provided file_fsync() as their fsync method
(commonly used filesystems such as fat fall in this group).
3- Advanced filesystems (ext3/4, XFS, btrfs, etc) which provide their
own fsync implementations are allowed to perform filesystem-specific
optimizations there to minimize the number of flushes and maximize
throughput.
In this patch set I implemented (1) and (3) for ext3/4 to have some code
to comment on.
Does this approach make sense? Thoughts?
- Fernando
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: vfs: Add MS_FLUSHONFSYNC mount flag
2009-02-13 16:29 ` Fernando Luis Vazquez Cao
@ 2009-02-14 11:24 ` Dave Chinner
2009-02-14 13:03 ` Fernando Luis Vázquez Cao
0 siblings, 1 reply; 74+ messages in thread
From: Dave Chinner @ 2009-02-14 11:24 UTC (permalink / raw)
To: Fernando Luis Vazquez Cao
Cc: Eric Sandeen, Fernando Luis Vázquez Cao, Jan Kara,
Theodore Tso, Alan Cox, Pavel Machek, kernel list, Jens Axboe,
Ric Wheeler
On Sat, Feb 14, 2009 at 01:29:28AM +0900, Fernando Luis Vazquez Cao wrote:
> On Fri, 2009-02-13 at 23:20 +1100, Dave Chinner wrote:
> > On Fri, Feb 13, 2009 at 12:20:17AM -0600, Eric Sandeen wrote:
> > > I'm just a little leery of the "dangerous" mount option proliferation, I
> > > guess.
> >
> > You're not the only one, Eric. It's bad enough having to explain to
> > users what barriers do once they have lost data after a power loss,
> > let alone confusing them further by adding more mount options they
> > will get wrong by accident....
>
> That is precisely the reason why we should use sensible defaults, which
> in this case means enabling barriers and flushing disk caches on
> fsync()/fdatasync() by default.
>
> Adding either a new mount option (as you yourself suggest below) or a
> sysfs tunable is desirable for those cases when we really do not need to
> flush the disk write cache to guarantee integrity (battery-backed block
> devices come to mind), or we want to be fast at the cost of potentially
> losing some data.
Mount options are the wrong place for this. if you want to change
the behaviour of the block device, then it should be at that level.
> > Quite frankly, the VFS should do stuff that is slow and safe
> > and filesystems can choose to ignore the VFS (via filesystem
> > specific mount options) if they want to be fast and potentially
> > unsafe.
>
> To avoid unnecessary flushes and allow for filesystem-specific
> optimizations I was considering the following approach:
>
> 1- Add flushonfsync mount option (as an aside, I am of the opinion that
> it should be set by default).
No mount option - too confusing for someone to work out what
combination of barriers and flushing for things to work correctly.
Just make filesystems issue the necessary flush calls or barrier
IOs and allow the block devices to ignore flushes.
> 2- Modify file_fsync() so that it checks whether FLUSHONFSYNC is set and
> flushes the underlying device accordingly. With this we would cover all
> filesystems that use the vfs-provided file_fsync() as their fsync method
> (commonly used filesystems such as fat fall in this group).
Just make it flush the block device.
> 3- Advanced filesystems (ext3/4, XFS, btrfs, etc) which provide their
> own fsync implementations are allowed to perform filesystem-specific
> optimizations there to minimize the number of flushes and maximize
> throughput.
Um, you are describing what we already have in place. Almost every
filesystem provides it's own ->fsync method, not just the "advanced"
ones. It is those methods that need to be fixed to issue flushes, not just
file_fsync().
> In this patch set I implemented (1) and (3) for ext3/4 to have some code
> to comment on.
I don't think we want (1) at all, and I thought that if ext3/4 are using
barriers then the barrier I/O issued by the journal does the flush
already. Hence (3) is redundant, right?
FWIW, block device flushes are implemented by barrier IOs, so if
the underlying block device doesn't support barriers then you can't
flush the cache, either...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: vfs: Add MS_FLUSHONFSYNC mount flag
2009-02-14 11:24 ` Dave Chinner
@ 2009-02-14 13:03 ` Fernando Luis Vázquez Cao
2009-02-14 13:19 ` Fernando Luis Vázquez Cao
2009-02-15 2:48 ` Dave Chinner
0 siblings, 2 replies; 74+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-02-14 13:03 UTC (permalink / raw)
To: Dave Chinner
Cc: Fernando Luis Vazquez Cao, Eric Sandeen, Jan Kara, Theodore Tso,
Alan Cox, Pavel Machek, kernel list, Jens Axboe, Ric Wheeler
On Sat, 2009-02-14 at 22:24 +1100, Dave Chinner wrote:
> On Sat, Feb 14, 2009 at 01:29:28AM +0900, Fernando Luis Vazquez Cao wrote:
> > On Fri, 2009-02-13 at 23:20 +1100, Dave Chinner wrote:
> > > On Fri, Feb 13, 2009 at 12:20:17AM -0600, Eric Sandeen wrote:
> > > > I'm just a little leery of the "dangerous" mount option proliferation, I
> > > > guess.
> > >
> > > You're not the only one, Eric. It's bad enough having to explain to
> > > users what barriers do once they have lost data after a power loss,
> > > let alone confusing them further by adding more mount options they
> > > will get wrong by accident....
> >
> > That is precisely the reason why we should use sensible defaults, which
> > in this case means enabling barriers and flushing disk caches on
> > fsync()/fdatasync() by default.
> >
> > Adding either a new mount option (as you yourself suggest below) or a
> > sysfs tunable is desirable for those cases when we really do not need to
> > flush the disk write cache to guarantee integrity (battery-backed block
> > devices come to mind), or we want to be fast at the cost of potentially
> > losing some data.
>
> Mount options are the wrong place for this. if you want to change
> the behaviour of the block device, then it should be at that level.
To be more precise, what we are trying to change is the behavior of
fsync()/fdatasync(), which users might want to change on a per-partition
basis. I guess this is the reason the barrier switch was made a mount
option, and I just wanted to be consistent.
My fear is that making one of them a mount option (barriers) and the
other a sysfs-tunable block device property (device flushes on fsync())
might end up creating more confusion than using a mount option for both.
Anyway, I do not have strong feelings on this issue and if there is
consensus I would be willing to change the patches so that flushonfsync
becomes a per block device tunable instead.
> > > Quite frankly, the VFS should do stuff that is slow and safe
> > > and filesystems can choose to ignore the VFS (via filesystem
> > > specific mount options) if they want to be fast and potentially
> > > unsafe.
> >
> > To avoid unnecessary flushes and allow for filesystem-specific
> > optimizations I was considering the following approach:
> >
> > 1- Add flushonfsync mount option (as an aside, I am of the opinion that
> > it should be set by default).
>
> No mount option - too confusing for someone to work out what
> combination of barriers and flushing for things to work correctly.
As I suggested in a previous email, it is just a matter of using a safe
combination by default so that users do not need to figure out anything.
> Just make filesystems issue the necessary flush calls or barrier IOs
"ext3: call blkdev_issue_flush on fsync" and "ext4: call
blkdev_issue_flush on fsync" in this patch set implement just that for
ext3/4.
> and allow the block devices to ignore flushes.
Wouldn't it make more sense to avoid sending bios down the block layer
which we can know in advance are going to be ignored by the block
device?
> > 2- Modify file_fsync() so that it checks whether FLUSHONFSYNC is set and
> > flushes the underlying device accordingly. With this we would cover all
> > filesystems that use the vfs-provided file_fsync() as their fsync method
> > (commonly used filesystems such as fat fall in this group).
>
> Just make it flush the block device.
I wrote a patch that does exactly that but, in addition, it checks
whether FLUSHONFSYNC is set to avoid sending unnecessary flushes down
the block layer (this patch is not included in this patch-set, but I
will add it in the next iteration).
As I mentioned above, if everyone thinks this small optimization not
elegant or an undesirable layering violation I will remove it.
> > 3- Advanced filesystems (ext3/4, XFS, btrfs, etc) which provide their
> > own fsync implementations are allowed to perform filesystem-specific
> > optimizations there to minimize the number of flushes and maximize
> > throughput.
>
> Um, you are describing what we already have in place. Almost every
> filesystem provides it's own ->fsync method, not just the "advanced"
> ones.
Yes, I know. There are some remarkable exceptions such as fat, though.
> It is those methods that need to be fixed to issue flushes, not just
> file_fsync().
Exactly, and this patch-set is my first attempt at that. For the first
submission I limited myself to fixing only ext3/4 so that I can get some
early feedback on my approach before moving forward.
> > In this patch set I implemented (1) and (3) for ext3/4 to have some code
> > to comment on.
>
> I don't think we want (1) at all, and I thought that if ext3/4 are using
> barriers then the barrier I/O issued by the journal does the flush
> already. Hence (3) is redundant, right?
No, it is no redundant because a barrier is not issued in all cases. The
aforementioned two patches fix ext3/4 by emitting a device flush only
when necessary (i.e. when a barrier would not be emitted).
My impression is that we all agree in the basic approach, the only point
of contention being whether filesystems/VFS should be allowed to
optimize out disk flushes when the user told the kernel to do so (be it
via a sysfs tunable or mount option).
Cheers,
Fernando
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: vfs: Add MS_FLUSHONFSYNC mount flag
2009-02-14 13:03 ` Fernando Luis Vázquez Cao
@ 2009-02-14 13:19 ` Fernando Luis Vázquez Cao
2009-02-15 2:48 ` Dave Chinner
1 sibling, 0 replies; 74+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-02-14 13:19 UTC (permalink / raw)
To: Dave Chinner
Cc: Fernando Luis Vazquez Cao, Eric Sandeen, Jan Kara, Theodore Tso,
Alan Cox, Pavel Machek, kernel list, Jens Axboe, Ric Wheeler
On Sat, 2009-02-14 at 22:03 +0900, Fernando Luis Vázquez Cao wrote:
> > > 2- Modify file_fsync() so that it checks whether FLUSHONFSYNC is set and
> > > flushes the underlying device accordingly. With this we would cover all
> > > filesystems that use the vfs-provided file_fsync() as their fsync method
> > > (commonly used filesystems such as fat fall in this group).
> >
> > Just make it flush the block device.
>
> I wrote a patch that does exactly that but, in addition, it checks
> whether FLUSHONFSYNC is set to avoid sending unnecessary flushes down
> the block layer (this patch is not included in this patch-set, but I
> will add it in the next iteration).
I probably should have mentioned in my proposal all filesystems would
just call the helper function block_flush_device() to flush the
underlying block device, unconditionally.
This helper function looks like this:
+/* Issue flush of write caches on the block device */
+int block_flush_device(struct super_block *sb)
+{
+ int ret = 0;
+
+ if (!(sb->s_flags & MS_FLUSHONFSYNC))
+ return ret;
+
+ ret = blkdev_issue_flush(sb->s_bdev, NULL);
+
+ if (ret == -EOPNOTSUPP)
+ return 0;
+
+ return ret;
+}
As you can see the check for flushonfsync is done here, so changing my
patches along the lines you suggest would be a trivial two lines patch.
- Fernando
> As I mentioned above, if everyone thinks this small optimization not
> elegant or an undesirable layering violation I will remove it.
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: vfs: Add MS_FLUSHONFSYNC mount flag
2009-02-14 13:03 ` Fernando Luis Vázquez Cao
2009-02-14 13:19 ` Fernando Luis Vázquez Cao
@ 2009-02-15 2:48 ` Dave Chinner
2009-02-15 7:11 ` Fernando Luis Vázquez Cao
1 sibling, 1 reply; 74+ messages in thread
From: Dave Chinner @ 2009-02-15 2:48 UTC (permalink / raw)
To: Fernando Luis Vázquez Cao
Cc: Fernando Luis Vazquez Cao, Eric Sandeen, Jan Kara, Theodore Tso,
Alan Cox, Pavel Machek, kernel list, Jens Axboe, Ric Wheeler
On Sat, Feb 14, 2009 at 10:03:53PM +0900, Fernando Luis Vázquez Cao wrote:
> On Sat, 2009-02-14 at 22:24 +1100, Dave Chinner wrote:
> > On Sat, Feb 14, 2009 at 01:29:28AM +0900, Fernando Luis Vazquez Cao wrote:
> > > On Fri, 2009-02-13 at 23:20 +1100, Dave Chinner wrote:
> > > > On Fri, Feb 13, 2009 at 12:20:17AM -0600, Eric Sandeen wrote:
> > > > > I'm just a little leery of the "dangerous" mount option proliferation, I
> > > > > guess.
> > > >
> > > > You're not the only one, Eric. It's bad enough having to explain to
> > > > users what barriers do once they have lost data after a power loss,
> > > > let alone confusing them further by adding more mount options they
> > > > will get wrong by accident....
> > >
> > > That is precisely the reason why we should use sensible defaults, which
> > > in this case means enabling barriers and flushing disk caches on
> > > fsync()/fdatasync() by default.
> > >
> > > Adding either a new mount option (as you yourself suggest below) or a
> > > sysfs tunable is desirable for those cases when we really do not need to
> > > flush the disk write cache to guarantee integrity (battery-backed block
> > > devices come to mind), or we want to be fast at the cost of potentially
> > > losing some data.
> >
> > Mount options are the wrong place for this. if you want to change
> > the behaviour of the block device, then it should be at that level.
>
> To be more precise, what we are trying to change is the behavior of
> fsync()/fdatasync(), which users might want to change on a per-partition
> basis. I guess this is the reason the barrier switch was made a mount
> option, and I just wanted to be consistent.
This has no place in the kernel. Use LD_PRELOAD to make fsync() a
no-op.
> > No mount option - too confusing for someone to work out what
> > combination of barriers and flushing for things to work correctly.
>
> As I suggested in a previous email, it is just a matter of using a safe
> combination by default so that users do not need to figure out anything.
Too many users think that they need to specify everything rather
than rely on defaults...
> > Just make filesystems issue the necessary flush calls or barrier IOs
>
> "ext3: call blkdev_issue_flush on fsync" and "ext4: call
> blkdev_issue_flush on fsync" in this patch set implement just that for
> ext3/4.
>
> > and allow the block devices to ignore flushes.
>
> Wouldn't it make more sense to avoid sending bios down the block layer
> which we can know in advance are going to be ignored by the block
> device?
As soon as the block layer reports EOPNOTSUPPORTED to a barrier IO,
the filesystem will switch them off and not issue them anymore.
> > I don't think we want (1) at all, and I thought that if ext3/4 are using
> > barriers then the barrier I/O issued by the journal does the flush
> > already. Hence (3) is redundant, right?
>
> No, it is no redundant because a barrier is not issued in all cases. The
> aforementioned two patches fix ext3/4 by emitting a device flush only
> when necessary (i.e. when a barrier would not be emitted).
Then that is a filesystem fix, not something that requires VFS
modifications or new mount options....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: vfs: Add MS_FLUSHONFSYNC mount flag
2009-02-15 2:48 ` Dave Chinner
@ 2009-02-15 7:11 ` Fernando Luis Vázquez Cao
0 siblings, 0 replies; 74+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-02-15 7:11 UTC (permalink / raw)
To: Dave Chinner
Cc: Fernando Luis Vazquez Cao, Eric Sandeen, Jan Kara, Theodore Tso,
Alan Cox, Pavel Machek, kernel list, Jens Axboe, Ric Wheeler
On Sun, 2009-02-15 at 13:48 +1100, Dave Chinner wrote:
> On Sat, Feb 14, 2009 at 10:03:53PM +0900, Fernando Luis Vázquez Cao wrote:
> > On Sat, 2009-02-14 at 22:24 +1100, Dave Chinner wrote:
> > > On Sat, Feb 14, 2009 at 01:29:28AM +0900, Fernando Luis Vazquez Cao wrote:
> > > > On Fri, 2009-02-13 at 23:20 +1100, Dave Chinner wrote:
> > > > > On Fri, Feb 13, 2009 at 12:20:17AM -0600, Eric Sandeen wrote:
> > > > > > I'm just a little leery of the "dangerous" mount option proliferation, I
> > > > > > guess.
> > > > >
> > > > > You're not the only one, Eric. It's bad enough having to explain to
> > > > > users what barriers do once they have lost data after a power loss,
> > > > > let alone confusing them further by adding more mount options they
> > > > > will get wrong by accident....
> > > >
> > > > That is precisely the reason why we should use sensible defaults, which
> > > > in this case means enabling barriers and flushing disk caches on
> > > > fsync()/fdatasync() by default.
> > > >
> > > > Adding either a new mount option (as you yourself suggest below) or a
> > > > sysfs tunable is desirable for those cases when we really do not need to
> > > > flush the disk write cache to guarantee integrity (battery-backed block
> > > > devices come to mind), or we want to be fast at the cost of potentially
> > > > losing some data.
> > >
> > > Mount options are the wrong place for this. if you want to change
> > > the behaviour of the block device, then it should be at that level.
> >
> > To be more precise, what we are trying to change is the behavior of
> > fsync()/fdatasync(), which users might want to change on a per-partition
> > basis. I guess this is the reason the barrier switch was made a mount
> > option, and I just wanted to be consistent.
>
> This has no place in the kernel. Use LD_PRELOAD to make fsync() a
> no-op.
The purpose of flushonfsync is not making fsync() a no-op and goes
beyond what we can currently achieve with LD_PRELOAD. For example, if we
send the data to disk but avoid flushing the block device's write cache
we can potentially improve I/O performance at the cost of compromising
data and filesystem integrity. This is a risk that those who play fast
and loose may want assume. By the way, sadly enough this is the way many
of the filesystems in Linus' tree behave. I just wanted to change this
situation by making all filesystems issue write-cache flushes by
default.
Some people suggested to leave a knob for those who wanted to revert to
the old behavior and I myself thought that it could make sense in some
cases so decided to add the tunable flushonsync.
If there is consensus flushonfsync should be a per-device tunable I am
more than willing to make it so. My goal is to fix all filesystem so
that they emit barriers and disk flushes when they should. flushonfsync
is just a nicety I added for those who, for whatever reason, still want
the old behavior.
For the next iteration of this patchset I will take out the contentious
bits and leave only the filesystem/VFS fixes so that we can move forward
while we discuss the propriety of adding a per-device or a
per-filesystem tunable such as flushonfsync to change the default (and
safe) behavior.
> > > No mount option - too confusing for someone to work out what
> > > combination of barriers and flushing for things to work correctly.
> >
> > As I suggested in a previous email, it is just a matter of using a safe
> > combination by default so that users do not need to figure out anything.
>
> Too many users think that they need to specify everything rather
> than rely on defaults...
Well that is their business. From my experience most admins in the field
do not stray from their enterprise-distro provided defaults.
> > > Just make filesystems issue the necessary flush calls or barrier IOs
> >
> > "ext3: call blkdev_issue_flush on fsync" and "ext4: call
> > blkdev_issue_flush on fsync" in this patch set implement just that for
> > ext3/4.
> >
> > > and allow the block devices to ignore flushes.
> >
> > Wouldn't it make more sense to avoid sending bios down the block layer
> > which we can know in advance are going to be ignored by the block
> > device?
>
> As soon as the block layer reports EOPNOTSUPPORTED to a barrier IO,
> the filesystem will switch them off and not issue them anymore.
Yes, that certainly makes sense. But the point in discussion is whether
users should be allowed to switch them off (it arguably makes sense in
some scenarios). I am afraid that some users will not be happy if we do
not leave the door open for them to revert to the old behavior.
> > > I don't think we want (1) at all, and I thought that if ext3/4 are using
> > > barriers then the barrier I/O issued by the journal does the flush
> > > already. Hence (3) is redundant, right?
> >
> > No, it is no redundant because a barrier is not issued in all cases. The
> > aforementioned two patches fix ext3/4 by emitting a device flush only
> > when necessary (i.e. when a barrier would not be emitted).
>
> Then that is a filesystem fix, not something that requires VFS
> modifications or new mount options....
Yup, as mentioned above flushonfsync is just a nicety I added to the
second iteration of this patchset and is independent from the filesystem
fixes.
Regards,
Fernando
^ permalink raw reply [flat|nested] 74+ messages in thread
* util-linux: Add new mount options flushonfsync and noflushonfsync to mount
2009-02-12 10:33 ` Fernando Luis Vázquez Cao
2009-02-12 10:35 ` vfs: Improve readability off mount flag definitins by using offsets Fernando Luis Vázquez Cao
2009-02-12 10:36 ` vfs: Add MS_FLUSHONFSYNC mount flag Fernando Luis Vázquez Cao
@ 2009-02-12 10:37 ` Fernando Luis Vázquez Cao
2009-02-12 10:38 ` util-linux: Add explanation for new mount options flushonfsync and noflushonfsync to mount(8) man page Fernando Luis Vázquez Cao
` (3 subsequent siblings)
6 siblings, 0 replies; 74+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-02-12 10:37 UTC (permalink / raw)
To: Jan Kara
Cc: Theodore Tso, Alan Cox, Pavel Machek, kernel list, Jens Axboe,
sandeen, fernando
This mount flag will be used to determine whether the block device's write
cache should be flush or not on fsync()/fdatasync().
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
---
diff -urp mount-orig/mount.c mount/mount.c
--- mount-orig/mount.c 2009-01-29 15:50:50.000000000 +0900
+++ mount/mount.c 2009-01-29 14:56:19.000000000 +0900
@@ -185,6 +185,13 @@ static const struct opt_map opt_map[] =
{ "norelatime", 0, 1, MS_RELATIME }, /* Update access time without regard
to mtime/ctime */
#endif
+#ifdef MS_FLUSHONFSYNC
+ { "flushonfsync", 0, 0, MS_FLUSHONFSYNC }, /* Force block device flush on
+ fsync()/fdatasync() */
+ { "noflushonfsync", 0, 1, MS_FLUSHONFSYNC }, /* Do not force block device
+ flush on
+ fsync()/fdatasync() */
+#endif
{ "nofail", 0, 0, MS_COMMENT}, /* Do not fail if ENOENT on dev */
{ NULL, 0, 0, 0 }
};
diff -urp mount-orig/mount_constants.h mount/mount_constants.h
--- mount-orig/mount_constants.h 2008-04-22 17:59:53.000000000 +0900
+++ mount/mount_constants.h 2009-01-29 13:40:42.000000000 +0900
@@ -56,6 +56,10 @@
#ifndef MS_SHARED
#define MS_SHARED (1<<20) /* 1048576 Shared*/
#endif
+#ifndef MS_FLUSHONFSYNC
+#define MS_FLUSHONFSYNC (1<<24) /* Force block device flush on
+ fsync()/fdatasync() */
+#endif
/*
* Magic mount flag number. Had to be or-ed to the flag values.
*/
^ permalink raw reply [flat|nested] 74+ messages in thread
* util-linux: Add explanation for new mount options flushonfsync and noflushonfsync to mount(8) man page
2009-02-12 10:33 ` Fernando Luis Vázquez Cao
` (2 preceding siblings ...)
2009-02-12 10:37 ` util-linux: Add new mount options flushonfsync and noflushonfsync to mount Fernando Luis Vázquez Cao
@ 2009-02-12 10:38 ` Fernando Luis Vázquez Cao
2009-02-12 10:38 ` block: Add block_flush_device() Fernando Luis Vázquez Cao
` (2 subsequent siblings)
6 siblings, 0 replies; 74+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-02-12 10:38 UTC (permalink / raw)
To: Jan Kara
Cc: Theodore Tso, Alan Cox, Pavel Machek, kernel list, Jens Axboe,
sandeen, fernando
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
---
diff -urp mount-orig/mount.8 mount/mount.8
--- mount-orig/mount.8 2009-01-29 15:50:50.000000000 +0900
+++ mount/mount.8 2009-01-29 15:40:39.000000000 +0900
@@ -649,6 +649,18 @@ current modify or change time. (Similar
mutt or other applications that need to know if a file has been read
since the last time it was modified.)
.TP
+.B flushonfsync
+Force block device flush on fsync()/fdatasync(). When using storage equipped
+with write caches that can be explicitly flushed by the kernel, this guarantees
+that the data/metadata actually made it to the physical media after invoking
+the respective file data/metadata synchronization system call.
+.TP
+.B noflushonfsync
+Do not force a flush of the backing device's write cache when a file
+data/metadata synchronization system call is executed. Unless the device is
+equipped with a battery-backup write cache, this behavior can cause data loss
+and/or file system corruption in the event of a power failure.
+.TP
.B noauto
Can only be mounted explicitly (i.e., the
.B \-a
^ permalink raw reply [flat|nested] 74+ messages in thread
* block: Add block_flush_device()
2009-02-12 10:33 ` Fernando Luis Vázquez Cao
` (3 preceding siblings ...)
2009-02-12 10:38 ` util-linux: Add explanation for new mount options flushonfsync and noflushonfsync to mount(8) man page Fernando Luis Vázquez Cao
@ 2009-02-12 10:38 ` Fernando Luis Vázquez Cao
2009-02-12 10:39 ` ext3: call blkdev_issue_flush on fsync Fernando Luis Vázquez Cao
2009-02-12 10:40 ` ext4: " Fernando Luis Vázquez Cao
6 siblings, 0 replies; 74+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-02-12 10:38 UTC (permalink / raw)
To: Jan Kara
Cc: Theodore Tso, Alan Cox, Pavel Machek, kernel list, Jens Axboe,
sandeen, fernando
This patch adds a helper function that should be used by filesystems that need
to flush the underlying block device on fsync()/fdatasync().
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
---
diff -urNp linux-2.6.29-rc3-orig/fs/buffer.c linux-2.6.29-rc3/fs/buffer.c
--- linux-2.6.29-rc3-orig/fs/buffer.c 2009-01-29 09:47:51.000000000 +0900
+++ linux-2.6.29-rc3/fs/buffer.c 2009-01-29 20:11:29.000000000 +0900
@@ -165,6 +165,24 @@ void end_buffer_write_sync(struct buffer
put_bh(bh);
}
+/* Issue flush of write caches on the block device */
+int block_flush_device(struct super_block *sb)
+{
+ int ret = 0;
+
+ if (!(sb->s_flags & MS_FLUSHONFSYNC))
+ return ret;
+
+ ret = blkdev_issue_flush(sb->s_bdev, NULL);
+
+ if (ret == -EOPNOTSUPP)
+ return 0;
+
+ return ret;
+}
+EXPORT_SYMBOL(block_flush_device);
+
+
/*
* Write out and wait upon all the dirty data associated with a block
* device via its mapping. Does not take the superblock lock.
diff -urNp linux-2.6.29-rc3-orig/include/linux/buffer_head.h linux-2.6.29-rc3/include/linux/buffer_head.h
--- linux-2.6.29-rc3-orig/include/linux/buffer_head.h 2009-01-29 09:47:51.000000000 +0900
+++ linux-2.6.29-rc3/include/linux/buffer_head.h 2009-01-29 19:26:33.000000000 +0900
@@ -238,6 +238,7 @@ int nobh_write_end(struct file *, struct
int nobh_truncate_page(struct address_space *, loff_t, get_block_t *);
int nobh_writepage(struct page *page, get_block_t *get_block,
struct writeback_control *wbc);
+int block_flush_device(struct super_block *sb);
void buffer_init(void);
^ permalink raw reply [flat|nested] 74+ messages in thread
* ext3: call blkdev_issue_flush on fsync
2009-02-12 10:33 ` Fernando Luis Vázquez Cao
` (4 preceding siblings ...)
2009-02-12 10:38 ` block: Add block_flush_device() Fernando Luis Vázquez Cao
@ 2009-02-12 10:39 ` Fernando Luis Vázquez Cao
2009-02-12 10:40 ` ext4: " Fernando Luis Vázquez Cao
6 siblings, 0 replies; 74+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-02-12 10:39 UTC (permalink / raw)
To: Jan Kara
Cc: Theodore Tso, Alan Cox, Pavel Machek, kernel list, Jens Axboe,
sandeen, fernando
To ensure that bits are truly on-disk after an fsync or fdatasync, we
should force a disk flush explicitly when there is dirty data/metadata
and the journal didn't emit a write barrier (either because metadata is
not being synched or barriers are disabled).
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
---
diff -urNp linux-2.6.29-rc4-orig/fs/ext3/fsync.c linux-2.6.29-rc4/fs/ext3/fsync.c
--- linux-2.6.29-rc4-orig/fs/ext3/fsync.c 2008-12-25 08:26:37.000000000 +0900
+++ linux-2.6.29-rc4/fs/ext3/fsync.c 2009-02-12 19:20:23.000000000 +0900
@@ -45,6 +45,8 @@
int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
{
struct inode *inode = dentry->d_inode;
+ journal_t *journal = EXT3_SB(inode->i_sb)->s_journal;
+ unsigned long i_state = inode->i_state;
int ret = 0;
J_ASSERT(ext3_journal_current_handle() == NULL);
@@ -69,23 +71,34 @@ int ext3_sync_file(struct file * file, s
*/
if (ext3_should_journal_data(inode)) {
ret = ext3_force_commit(inode->i_sb);
+ if (!(journal->j_flags & JFS_BARRIER))
+ goto no_journal_barrier;
goto out;
}
- if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
- goto out;
+ if (datasync && !(i_state & I_DIRTY_DATASYNC))
+ goto flush_blkdev;
/*
* The VFS has written the file data. If the inode is unaltered
* then we need not start a commit.
*/
- if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
+ if (i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
struct writeback_control wbc = {
.sync_mode = WB_SYNC_ALL,
.nr_to_write = 0, /* sys_fsync did this */
};
ret = sync_inode(inode, &wbc);
+ if (journal && !(journal->j_flags & JFS_BARRIER))
+ goto no_journal_barrier;
+ goto out;
}
+
+flush_blkdev:
+ if (!(i_state & I_DIRTY_PAGES))
+ goto out;
+no_journal_barrier:
+ block_flush_device(inode->i_sb);
out:
return ret;
}
^ permalink raw reply [flat|nested] 74+ messages in thread
* ext4: call blkdev_issue_flush on fsync
2009-02-12 10:33 ` Fernando Luis Vázquez Cao
` (5 preceding siblings ...)
2009-02-12 10:39 ` ext3: call blkdev_issue_flush on fsync Fernando Luis Vázquez Cao
@ 2009-02-12 10:40 ` Fernando Luis Vázquez Cao
2009-02-15 22:46 ` Theodore Tso
6 siblings, 1 reply; 74+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-02-12 10:40 UTC (permalink / raw)
To: Jan Kara
Cc: Theodore Tso, Alan Cox, Pavel Machek, kernel list, Jens Axboe,
sandeen, fernando
To ensure that bits are truly on-disk after an fsync or fdatasync, we
should force a disk flush explicitly when there is dirty data/metadata
and the journal didn't emit a write barrier (either because metadata is
not being synched or barriers are disabled).
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
---
diff -urNp linux-2.6.29-rc4-orig/fs/ext4/fsync.c linux-2.6.29-rc4/fs/ext4/fsync.c
--- linux-2.6.29-rc4-orig/fs/ext4/fsync.c 2008-12-25 08:26:37.000000000 +0900
+++ linux-2.6.29-rc4/fs/ext4/fsync.c 2009-02-12 19:22:28.000000000 +0900
@@ -48,6 +48,7 @@ int ext4_sync_file(struct file *file, st
{
struct inode *inode = dentry->d_inode;
journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
+ unsigned long i_state = inode->i_state;
int ret = 0;
J_ASSERT(ext4_journal_current_handle() == NULL);
@@ -76,25 +77,34 @@ int ext4_sync_file(struct file *file, st
*/
if (ext4_should_journal_data(inode)) {
ret = ext4_force_commit(inode->i_sb);
+ if (!(journal->j_flags & JBD2_BARRIER))
+ goto no_journal_barrier;
goto out;
}
- if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
- goto out;
+ if (datasync && !(i_state & I_DIRTY_DATASYNC))
+ goto flush_blkdev;
/*
* The VFS has written the file data. If the inode is unaltered
* then we need not start a commit.
*/
- if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
+ if (i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
struct writeback_control wbc = {
.sync_mode = WB_SYNC_ALL,
.nr_to_write = 0, /* sys_fsync did this */
};
ret = sync_inode(inode, &wbc);
- if (journal && (journal->j_flags & JBD2_BARRIER))
- blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
+ if (journal && !(journal->j_flags & JBD2_BARRIER))
+ goto no_journal_barrier;
+ goto out;
}
+
+flush_blkdev:
+ if (!(i_state & I_DIRTY_PAGES))
+ goto out;
+no_journal_barrier:
+ block_flush_device(inode->i_sb);
out:
return ret;
}
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: ext4: call blkdev_issue_flush on fsync
2009-02-12 10:40 ` ext4: " Fernando Luis Vázquez Cao
@ 2009-02-15 22:46 ` Theodore Tso
2009-02-16 7:09 ` Fernando Luis Vázquez Cao
0 siblings, 1 reply; 74+ messages in thread
From: Theodore Tso @ 2009-02-15 22:46 UTC (permalink / raw)
To: Fernando Luis Vázquez Cao
Cc: Jan Kara, Alan Cox, Pavel Machek, kernel list, Jens Axboe,
sandeen, fernando
On Thu, Feb 12, 2009 at 07:40:45PM +0900, Fernando Luis Vázquez Cao wrote:
> @@ -76,25 +77,34 @@ int ext4_sync_file(struct file *file, st
> */
> if (ext4_should_journal_data(inode)) {
> ret = ext4_force_commit(inode->i_sb);
> + if (!(journal->j_flags & JBD2_BARRIER))
> + goto no_journal_barrier;
> goto out;
> }
All of these goto statements makes it one gigantic pile of spaghetti.
The code will be much more understable if you do:
if (!(journal->j_flags & JBD2_BARRIER))
block_flush_device(inode->i_sb);
return ret;
>
> - if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
> - goto out;
> + if (datasync && !(i_state & I_DIRTY_DATASYNC))
> + goto flush_blkdev;
>
Maybe instead:
if (datasync && !(i_state & I_DIRTY_DATASYNC)) {
if (i_state & I_DIRTY_PAGES)
block_flush_device(inode->i_sb);
return ret;
}
> - if (journal && (journal->j_flags & JBD2_BARRIER))
> - blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
> + if (journal && !(journal->j_flags & JBD2_BARRIER))
> + goto no_journal_barrier;
> + goto out;
Maybe instead:
if (journal && !(journal->j_flags & JBD2_BARRIER))
block_flush_device(inode->i_sb);
return ret;
}
I'm not a fanatic about eliminating all goto's, but "goto out" which
could be replaced by "return ret;" is just silly.
- Ted
^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: ext4: call blkdev_issue_flush on fsync
2009-02-15 22:46 ` Theodore Tso
@ 2009-02-16 7:09 ` Fernando Luis Vázquez Cao
2009-02-16 7:25 ` [PATCH 1/3] block: Add block_flush_device() Fernando Luis Vázquez Cao
` (2 more replies)
0 siblings, 3 replies; 74+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-02-16 7:09 UTC (permalink / raw)
To: Theodore Tso
Cc: Jan Kara, Alan Cox, Pavel Machek, kernel list, Jens Axboe,
sandeen, fernando
On Sun, 2009-02-15 at 17:46 -0500, Theodore Tso wrote:
> On Thu, Feb 12, 2009 at 07:40:45PM +0900, Fernando Luis Vázquez Cao wrote:
> > @@ -76,25 +77,34 @@ int ext4_sync_file(struct file *file, st
> > */
> > if (ext4_should_journal_data(inode)) {
> > ret = ext4_force_commit(inode->i_sb);
> > + if (!(journal->j_flags & JBD2_BARRIER))
> > + goto no_journal_barrier;
> > goto out;
> > }
>
> All of these goto statements makes it one gigantic pile of spaghetti.
> The code will be much more understable if you do:
>
> if (!(journal->j_flags & JBD2_BARRIER))
> block_flush_device(inode->i_sb);
> return ret;
>
> >
> > - if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
> > - goto out;
> > + if (datasync && !(i_state & I_DIRTY_DATASYNC))
> > + goto flush_blkdev;
> >
>
> Maybe instead:
> if (datasync && !(i_state & I_DIRTY_DATASYNC)) {
> if (i_state & I_DIRTY_PAGES)
> block_flush_device(inode->i_sb);
> return ret;
> }
>
>
> > - if (journal && (journal->j_flags & JBD2_BARRIER))
> > - blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
> > + if (journal && !(journal->j_flags & JBD2_BARRIER))
> > + goto no_journal_barrier;
> > + goto out;
>
> Maybe instead:
> if (journal && !(journal->j_flags & JBD2_BARRIER))
> block_flush_device(inode->i_sb);
> return ret;
> }
>
> I'm not a fanatic about eliminating all goto's, but "goto out" which
> could be replaced by "return ret;" is just silly.
Yes, you are right. I just wanted to follow the current style of the
code, but I got carried away and took things a bit too far.
I'll be replying to this email with a new iteration of the patch-set
that leaves out the potentially conflictive bits and should be more
readable.
Regards,
Fernando
^ permalink raw reply [flat|nested] 74+ messages in thread
* [PATCH 1/3] block: Add block_flush_device()
2009-02-16 7:09 ` Fernando Luis Vázquez Cao
@ 2009-02-16 7:25 ` Fernando Luis Vázquez Cao
2009-02-16 7:29 ` [2/3] ext3: call block_flush_device() on fsync Fernando Luis Vázquez Cao
2009-02-16 7:31 ` [PATCH 3/3] ext4: " Fernando Luis Vázquez Cao
2 siblings, 0 replies; 74+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-02-16 7:25 UTC (permalink / raw)
To: Theodore Tso
Cc: Jan Kara, Alan Cox, Pavel Machek, kernel list, Jens Axboe,
sandeen, fernando, rwheeler, linux-fsdevel
This patch adds a helper function that should be used by filesystems that need
to flush the underlying block device on fsync()/fdatasync().
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
---
diff -urNp linux-2.6.29-rc4-orig/fs/buffer.c linux-2.6.29-rc4/fs/buffer.c
--- linux-2.6.29-rc4-orig/fs/buffer.c 2009-02-16 14:45:11.000000000 +0900
+++ linux-2.6.29-rc4/fs/buffer.c 2009-02-16 14:53:26.000000000 +0900
@@ -165,6 +165,17 @@ void end_buffer_write_sync(struct buffer
put_bh(bh);
}
+/* Issue flush of write caches on the block device */
+int block_flush_device(struct super_block *sb)
+{
+ int ret = 0;
+
+ ret = blkdev_issue_flush(sb->s_bdev, NULL);
+
+ return (ret == -EOPNOTSUPP) ? 0 : ret;
+}
+EXPORT_SYMBOL(block_flush_device);
+
/*
* Write out and wait upon all the dirty data associated with a block
* device via its mapping. Does not take the superblock lock.
diff -urNp linux-2.6.29-rc4-orig/include/linux/buffer_head.h linux-2.6.29-rc4/include/linux/buffer_head.h
--- linux-2.6.29-rc4-orig/include/linux/buffer_head.h 2009-02-16 14:45:12.000000000 +0900
+++ linux-2.6.29-rc4/include/linux/buffer_head.h 2009-02-16 14:48:28.000000000 +0900
@@ -238,6 +238,7 @@ int nobh_write_end(struct file *, struct
int nobh_truncate_page(struct address_space *, loff_t, get_block_t *);
int nobh_writepage(struct page *page, get_block_t *get_block,
struct writeback_control *wbc);
+int block_flush_device(struct super_block *sb);
void buffer_init(void);
^ permalink raw reply [flat|nested] 74+ messages in thread
* [2/3] ext3: call block_flush_device() on fsync
2009-02-16 7:09 ` Fernando Luis Vázquez Cao
2009-02-16 7:25 ` [PATCH 1/3] block: Add block_flush_device() Fernando Luis Vázquez Cao
@ 2009-02-16 7:29 ` Fernando Luis Vázquez Cao
2009-02-16 7:31 ` [PATCH 3/3] ext4: " Fernando Luis Vázquez Cao
2 siblings, 0 replies; 74+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-02-16 7:29 UTC (permalink / raw)
To: Theodore Tso
Cc: Jan Kara, Alan Cox, Pavel Machek, kernel list, Jens Axboe,
sandeen, fernando, rwheeler, linux-fsdevel
To ensure that bits are truly on-disk after an fsync or fdatasync, we
should force a disk flush explicitly when there is dirty data/metadata
and the journal didn't emit a write barrier (either because metadata is
not being synched or barriers are disabled).
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
---
diff -urNp linux-2.6.29-rc5-orig/fs/ext3/fsync.c linux-2.6.29-rc5/fs/ext3/fsync.c
--- linux-2.6.29-rc5-orig/fs/ext3/fsync.c 2008-12-25 08:26:37.000000000 +0900
+++ linux-2.6.29-rc5/fs/ext3/fsync.c 2009-02-16 15:56:05.000000000 +0900
@@ -45,6 +45,8 @@
int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
{
struct inode *inode = dentry->d_inode;
+ journal_t *journal = EXT3_SB(inode->i_sb)->s_journal;
+ unsigned long i_state = inode->i_state;
int ret = 0;
J_ASSERT(ext3_journal_current_handle() == NULL);
@@ -69,23 +71,30 @@ int ext3_sync_file(struct file * file, s
*/
if (ext3_should_journal_data(inode)) {
ret = ext3_force_commit(inode->i_sb);
- goto out;
+ if (!(journal->j_flags & JFS_BARRIER))
+ block_flush_device(inode->i_sb);
+ return ret;
}
- if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
- goto out;
+ if (datasync && !(i_state & I_DIRTY_DATASYNC)) {
+ if (i_state & I_DIRTY_PAGES)
+ block_flush_device(inode->i_sb);
+ return ret;
+ }
/*
* The VFS has written the file data. If the inode is unaltered
* then we need not start a commit.
*/
- if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
+ if (i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
struct writeback_control wbc = {
.sync_mode = WB_SYNC_ALL,
.nr_to_write = 0, /* sys_fsync did this */
};
ret = sync_inode(inode, &wbc);
+ if (journal && !(journal->j_flags & JFS_BARRIER))
+ block_flush_device(inode->i_sb);
}
-out:
+
return ret;
}
^ permalink raw reply [flat|nested] 74+ messages in thread
* [PATCH 3/3] ext4: call block_flush_device() on fsync
2009-02-16 7:09 ` Fernando Luis Vázquez Cao
2009-02-16 7:25 ` [PATCH 1/3] block: Add block_flush_device() Fernando Luis Vázquez Cao
2009-02-16 7:29 ` [2/3] ext3: call block_flush_device() on fsync Fernando Luis Vázquez Cao
@ 2009-02-16 7:31 ` Fernando Luis Vázquez Cao
2 siblings, 0 replies; 74+ messages in thread
From: Fernando Luis Vázquez Cao @ 2009-02-16 7:31 UTC (permalink / raw)
To: Theodore Tso
Cc: Jan Kara, Alan Cox, Pavel Machek, kernel list, Jens Axboe,
sandeen, fernando, rwheeler, linux-fsdevel
To ensure that bits are truly on-disk after an fsync or fdatasync, we
should force a disk flush explicitly when there is dirty data/metadata
and the journal didn't emit a write barrier (either because metadata is
not being synched or barriers are disabled).
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
---
diff -urNp linux-2.6.29-rc5-orig/fs/ext4/fsync.c linux-2.6.29-rc5/fs/ext4/fsync.c
--- linux-2.6.29-rc5-orig/fs/ext4/fsync.c 2008-12-25 08:26:37.000000000 +0900
+++ linux-2.6.29-rc5/fs/ext4/fsync.c 2009-02-16 15:52:56.000000000 +0900
@@ -48,6 +48,7 @@ int ext4_sync_file(struct file *file, st
{
struct inode *inode = dentry->d_inode;
journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
+ unsigned long i_state = inode->i_state;
int ret = 0;
J_ASSERT(ext4_journal_current_handle() == NULL);
@@ -76,25 +77,30 @@ int ext4_sync_file(struct file *file, st
*/
if (ext4_should_journal_data(inode)) {
ret = ext4_force_commit(inode->i_sb);
- goto out;
+ if (!(journal->j_flags & JBD2_BARRIER))
+ block_flush_device(inode->i_sb);
+ return ret;
}
- if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
- goto out;
+ if (datasync && !(i_state & I_DIRTY_DATASYNC)) {
+ if (i_state & I_DIRTY_PAGES)
+ block_flush_device(inode->i_sb);
+ return ret;
+ }
/*
* The VFS has written the file data. If the inode is unaltered
* then we need not start a commit.
*/
- if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
+ if (i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
struct writeback_control wbc = {
.sync_mode = WB_SYNC_ALL,
.nr_to_write = 0, /* sys_fsync did this */
};
ret = sync_inode(inode, &wbc);
- if (journal && (journal->j_flags & JBD2_BARRIER))
- blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
+ if (journal && !(journal->j_flags & JBD2_BARRIER))
+ block_flush_device(inode->i_sb);
}
-out:
+
return ret;
}
^ permalink raw reply [flat|nested] 74+ messages in thread