All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] fs/sync.c: sync_file_range(2) may use WB_SYNC_ALL writeback
@ 2019-04-09 11:49 Amir Goldstein
  2019-04-09 15:43 ` Jan Kara
  2019-04-17  5:45 ` [PATCH v4] " Amir Goldstein
  0 siblings, 2 replies; 10+ messages in thread
From: Amir Goldstein @ 2019-04-09 11:49 UTC (permalink / raw)
  To: Jan Kara; +Cc: Dave Chinner, Al Viro, linux-fsdevel, linux-kernel

Commit 23d0127096cb ("fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE
writeback") claims that sync_file_range(2) syscall was "created for
userspace to be able to issue background writeout and so waiting for
in-flight IO is undesirable there" and changes the writeback (back) to
WB_SYNC_NONE.

This claim is only partially true. Is is true for users that use the flag
SYNC_FILE_RANGE_WRITE by itself, as does PostgreSQL, the user that was
the reason for changing to WB_SYNC_NONE writeback.

However, that claim is not true for users that use that flag combination
SYNC_FILE_RANGE_{WAIT_BEFORE|WRITE|_WAIT_AFTER}.  Those users explicitly
requested to wait for in-flight IO as well as to writeback of dirty
pages.  sync_file_range(2) man page describes this flag combintaion as
"write-for-data-integrity operation", although some may argue against
this definition.

Re-brand that flag combination as SYNC_FILE_RANGE_WRITE_AND_WAIT and use
the helper filemap_write_and_wait_range(), that uses WB_SYNC_ALL
writeback, to perform the range sync request.

One intended use case for this API is creating a dependency between
a new file's content and its link into the namepsace without requiring
journal commit and flushing of disk volatile caches:

  fd = open(".", O_TMPFILE | O_RDWR);
  write(fd, newdata, count);
  sync_file_range(fd, SYNC_FILE_RANGE_WRITE_AND_WAIT, 0, 0);
  linkat(AT_EMPTY_PATH, fd, AT_FDCWD, newfile);

For many local filesystems, ext4 and xfs included, the sequence above
will guaranty that after crash, either 'newfile' exists with 'newdata'
content or 'newfile' does not exists.  For some applications, this
guaranty is strong enough and the effects of sync_file_range(2), even
with WB_SYNC_ALL, are far less intrusive to other writers in the system
than the effects of fdatasync(2).

Fixes: 23d0127096cb ("fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---

Hi folks,

I was debating with myself whether I should change behavior of
sync_file_range(2) do to what documentation says it does and flag names
suggest that they do or to introduce a new flag for the old/new
behavior.

At the moment, myself won with restoring old behavior without a new
flag, but with re-branding of existing flags. I do not see how there
could be users relying on existing behavior (TM).

An argument in favor of new flag is that with existing flags,
application doesn't know if kernel provides the guaranty or not.

An argument is favor of existing flags is to not further complicate the
man page, that is complicated enough as it is.
With this patch in upstream and stable kernels, the only thing I would add to
man page is:

  SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER
+ (or SYNC_FILE_RANGE_WRITE_AND_WAIT)
   This is a write-for-data-integrity operation that will ensure that all pages
   in the specified range which were dirty when sync_file_range() was called are
-  committed to disk.
+  written to disk.  It should be noted that disk caches are not flushed by this
+  call, so there are no guarantees here that the data will be available on disk
+  after a crash.

Thought?

Amir.


 fs/sync.c               | 25 ++++++++++++++++++-------
 include/uapi/linux/fs.h |  3 +++
 2 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/fs/sync.c b/fs/sync.c
index b54e0541ad89..5cf6fdbae4de 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -18,8 +18,8 @@
 #include <linux/backing-dev.h>
 #include "internal.h"
 
-#define VALID_FLAGS (SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE| \
-			SYNC_FILE_RANGE_WAIT_AFTER)
+#define VALID_FLAGS (SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WRITE_AND_WAIT | \
+		     SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WAIT_AFTER)
 
 /*
  * Do the filesystem syncing work. For simple filesystems
@@ -235,9 +235,9 @@ SYSCALL_DEFINE1(fdatasync, unsigned int, fd)
 }
 
 /*
- * sys_sync_file_range() permits finely controlled syncing over a segment of
+ * ksys_sync_file_range() permits finely controlled syncing over a segment of
  * a file in the range offset .. (offset+nbytes-1) inclusive.  If nbytes is
- * zero then sys_sync_file_range() will operate from offset out to EOF.
+ * zero then ksys_sync_file_range() will operate from offset out to EOF.
  *
  * The flag bits are:
  *
@@ -254,7 +254,7 @@ SYSCALL_DEFINE1(fdatasync, unsigned int, fd)
  * Useful combinations of the flag bits are:
  *
  * SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE: ensures that all pages
- * in the range which were dirty on entry to sys_sync_file_range() are placed
+ * in the range which were dirty on entry to ksys_sync_file_range() are placed
  * under writeout.  This is a start-write-for-data-integrity operation.
  *
  * SYNC_FILE_RANGE_WRITE: start writeout of all dirty pages in the range which
@@ -266,10 +266,13 @@ SYSCALL_DEFINE1(fdatasync, unsigned int, fd)
  * earlier SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE operation to wait
  * for that operation to complete and to return the result.
  *
- * SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER:
+ * SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER
+ * (a.k.a. SYNC_FILE_RANGE_WRITE_AND_WAIT):
  * a traditional sync() operation.  This is a write-for-data-integrity operation
  * which will ensure that all pages in the range which were dirty on entry to
- * sys_sync_file_range() are committed to disk.
+ * ksys_sync_file_range() are written to disk.  It should be noted that disk
+ * caches are not flushed by this call, so there are no guarantees here that the
+ * data will be available on disk after a crash.
  *
  *
  * SYNC_FILE_RANGE_WAIT_BEFORE and SYNC_FILE_RANGE_WAIT_AFTER will detect any
@@ -338,6 +341,14 @@ int ksys_sync_file_range(int fd, loff_t offset, loff_t nbytes,
 
 	mapping = f.file->f_mapping;
 	ret = 0;
+	if ((flags & SYNC_FILE_RANGE_WRITE_AND_WAIT) ==
+		     SYNC_FILE_RANGE_WRITE_AND_WAIT) {
+		/* Unlike SYNC_FILE_RANGE_WRITE alone uses WB_SYNC_ALL */
+		ret = filemap_write_and_wait_range(mapping, offset, endbyte);
+		if (ret < 0)
+			goto out_put;
+	}
+
 	if (flags & SYNC_FILE_RANGE_WAIT_BEFORE) {
 		ret = file_fdatawait_range(f.file, offset, endbyte);
 		if (ret < 0)
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 121e82ce296b..59c71fa8c553 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -320,6 +320,9 @@ struct fscrypt_key {
 #define SYNC_FILE_RANGE_WAIT_BEFORE	1
 #define SYNC_FILE_RANGE_WRITE		2
 #define SYNC_FILE_RANGE_WAIT_AFTER	4
+#define SYNC_FILE_RANGE_WRITE_AND_WAIT	(SYNC_FILE_RANGE_WRITE | \
+					 SYNC_FILE_RANGE_WAIT_BEFORE | \
+					 SYNC_FILE_RANGE_WAIT_AFTER)
 
 /*
  * Flags for preadv2/pwritev2:
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] fs/sync.c: sync_file_range(2) may use WB_SYNC_ALL writeback
  2019-04-09 11:49 [PATCH] fs/sync.c: sync_file_range(2) may use WB_SYNC_ALL writeback Amir Goldstein
@ 2019-04-09 15:43 ` Jan Kara
  2019-04-09 18:01   ` Amir Goldstein
  2019-04-17  5:45 ` [PATCH v4] " Amir Goldstein
  1 sibling, 1 reply; 10+ messages in thread
From: Jan Kara @ 2019-04-09 15:43 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Dave Chinner, Al Viro, linux-fsdevel, linux-kernel

On Tue 09-04-19 14:49:22, Amir Goldstein wrote:
> Commit 23d0127096cb ("fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE
> writeback") claims that sync_file_range(2) syscall was "created for
> userspace to be able to issue background writeout and so waiting for
> in-flight IO is undesirable there" and changes the writeback (back) to
> WB_SYNC_NONE.
> 
> This claim is only partially true. Is is true for users that use the flag
> SYNC_FILE_RANGE_WRITE by itself, as does PostgreSQL, the user that was
> the reason for changing to WB_SYNC_NONE writeback.
> 
> However, that claim is not true for users that use that flag combination
> SYNC_FILE_RANGE_{WAIT_BEFORE|WRITE|_WAIT_AFTER}.  Those users explicitly
> requested to wait for in-flight IO as well as to writeback of dirty
> pages.  sync_file_range(2) man page describes this flag combintaion as
> "write-for-data-integrity operation", although some may argue against
> this definition.
> 
> Re-brand that flag combination as SYNC_FILE_RANGE_WRITE_AND_WAIT and use
> the helper filemap_write_and_wait_range(), that uses WB_SYNC_ALL
> writeback, to perform the range sync request.
> 
> One intended use case for this API is creating a dependency between
> a new file's content and its link into the namepsace without requiring
> journal commit and flushing of disk volatile caches:
> 
>   fd = open(".", O_TMPFILE | O_RDWR);
>   write(fd, newdata, count);
>   sync_file_range(fd, SYNC_FILE_RANGE_WRITE_AND_WAIT, 0, 0);
>   linkat(AT_EMPTY_PATH, fd, AT_FDCWD, newfile);
> 
> For many local filesystems, ext4 and xfs included, the sequence above
> will guaranty that after crash, either 'newfile' exists with 'newdata'
> content or 'newfile' does not exists.  For some applications, this
> guaranty is strong enough and the effects of sync_file_range(2), even
> with WB_SYNC_ALL, are far less intrusive to other writers in the system
> than the effects of fdatasync(2).

I agree that this paragraph is true but I don't want any userspace program
rely on this. We've been through this when ext3 got converted to ext4 and
it has caused a lot of complaints. Ext3 had provided rather strong data vs
metadata ordering due to the way journalling was implemented. Then ext4
came, implemented delayed allocation and somewhat changed how journalling
works and suddently userspace programmers were caught by surprise their code
working by luck stopped working. And they were complaining that when their
code worked without fsync(2) before, it should work after it as well. So it
took a lot of explaining that their applications are broken and Ted has
also implemented some workarounds to make at least the most common mistakes
silently fixed by the kernel most of the time.

So I'm against providing 90% data integrity guarantees because there'll be
many people who'll think they can get away with it and then complain when
they won't once our implementation changes.

Rather I'd modify the manpage to not say that SYNC_FILE_RANGE_WAIT_BEFORE
| SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER is a
write-for-data-integrity.
								Honza

> 
> Fixes: 23d0127096cb ("fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE")
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> ---
> 
> Hi folks,
> 
> I was debating with myself whether I should change behavior of
> sync_file_range(2) do to what documentation says it does and flag names
> suggest that they do or to introduce a new flag for the old/new
> behavior.
> 
> At the moment, myself won with restoring old behavior without a new
> flag, but with re-branding of existing flags. I do not see how there
> could be users relying on existing behavior (TM).
> 
> An argument in favor of new flag is that with existing flags,
> application doesn't know if kernel provides the guaranty or not.
> 
> An argument is favor of existing flags is to not further complicate the
> man page, that is complicated enough as it is.
> With this patch in upstream and stable kernels, the only thing I would add to
> man page is:
> 
>   SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER
> + (or SYNC_FILE_RANGE_WRITE_AND_WAIT)
>    This is a write-for-data-integrity operation that will ensure that all pages
>    in the specified range which were dirty when sync_file_range() was called are
> -  committed to disk.
> +  written to disk.  It should be noted that disk caches are not flushed by this
> +  call, so there are no guarantees here that the data will be available on disk
> +  after a crash.
> 
> Thought?
> 
> Amir.
> 
> 
>  fs/sync.c               | 25 ++++++++++++++++++-------
>  include/uapi/linux/fs.h |  3 +++
>  2 files changed, 21 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/sync.c b/fs/sync.c
> index b54e0541ad89..5cf6fdbae4de 100644
> --- a/fs/sync.c
> +++ b/fs/sync.c
> @@ -18,8 +18,8 @@
>  #include <linux/backing-dev.h>
>  #include "internal.h"
>  
> -#define VALID_FLAGS (SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE| \
> -			SYNC_FILE_RANGE_WAIT_AFTER)
> +#define VALID_FLAGS (SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WRITE_AND_WAIT | \
> +		     SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WAIT_AFTER)
>  
>  /*
>   * Do the filesystem syncing work. For simple filesystems
> @@ -235,9 +235,9 @@ SYSCALL_DEFINE1(fdatasync, unsigned int, fd)
>  }
>  
>  /*
> - * sys_sync_file_range() permits finely controlled syncing over a segment of
> + * ksys_sync_file_range() permits finely controlled syncing over a segment of
>   * a file in the range offset .. (offset+nbytes-1) inclusive.  If nbytes is
> - * zero then sys_sync_file_range() will operate from offset out to EOF.
> + * zero then ksys_sync_file_range() will operate from offset out to EOF.
>   *
>   * The flag bits are:
>   *
> @@ -254,7 +254,7 @@ SYSCALL_DEFINE1(fdatasync, unsigned int, fd)
>   * Useful combinations of the flag bits are:
>   *
>   * SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE: ensures that all pages
> - * in the range which were dirty on entry to sys_sync_file_range() are placed
> + * in the range which were dirty on entry to ksys_sync_file_range() are placed
>   * under writeout.  This is a start-write-for-data-integrity operation.
>   *
>   * SYNC_FILE_RANGE_WRITE: start writeout of all dirty pages in the range which
> @@ -266,10 +266,13 @@ SYSCALL_DEFINE1(fdatasync, unsigned int, fd)
>   * earlier SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE operation to wait
>   * for that operation to complete and to return the result.
>   *
> - * SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER:
> + * SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER
> + * (a.k.a. SYNC_FILE_RANGE_WRITE_AND_WAIT):
>   * a traditional sync() operation.  This is a write-for-data-integrity operation
>   * which will ensure that all pages in the range which were dirty on entry to
> - * sys_sync_file_range() are committed to disk.
> + * ksys_sync_file_range() are written to disk.  It should be noted that disk
> + * caches are not flushed by this call, so there are no guarantees here that the
> + * data will be available on disk after a crash.
>   *
>   *
>   * SYNC_FILE_RANGE_WAIT_BEFORE and SYNC_FILE_RANGE_WAIT_AFTER will detect any
> @@ -338,6 +341,14 @@ int ksys_sync_file_range(int fd, loff_t offset, loff_t nbytes,
>  
>  	mapping = f.file->f_mapping;
>  	ret = 0;
> +	if ((flags & SYNC_FILE_RANGE_WRITE_AND_WAIT) ==
> +		     SYNC_FILE_RANGE_WRITE_AND_WAIT) {
> +		/* Unlike SYNC_FILE_RANGE_WRITE alone uses WB_SYNC_ALL */
> +		ret = filemap_write_and_wait_range(mapping, offset, endbyte);
> +		if (ret < 0)
> +			goto out_put;
> +	}
> +
>  	if (flags & SYNC_FILE_RANGE_WAIT_BEFORE) {
>  		ret = file_fdatawait_range(f.file, offset, endbyte);
>  		if (ret < 0)
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index 121e82ce296b..59c71fa8c553 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -320,6 +320,9 @@ struct fscrypt_key {
>  #define SYNC_FILE_RANGE_WAIT_BEFORE	1
>  #define SYNC_FILE_RANGE_WRITE		2
>  #define SYNC_FILE_RANGE_WAIT_AFTER	4
> +#define SYNC_FILE_RANGE_WRITE_AND_WAIT	(SYNC_FILE_RANGE_WRITE | \
> +					 SYNC_FILE_RANGE_WAIT_BEFORE | \
> +					 SYNC_FILE_RANGE_WAIT_AFTER)
>  
>  /*
>   * Flags for preadv2/pwritev2:
> -- 
> 2.17.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] fs/sync.c: sync_file_range(2) may use WB_SYNC_ALL writeback
  2019-04-09 15:43 ` Jan Kara
@ 2019-04-09 18:01   ` Amir Goldstein
  2019-04-10  9:10     ` Jan Kara
  0 siblings, 1 reply; 10+ messages in thread
From: Amir Goldstein @ 2019-04-09 18:01 UTC (permalink / raw)
  To: Jan Kara; +Cc: Dave Chinner, Al Viro, linux-fsdevel, linux-kernel

On Tue, Apr 9, 2019 at 6:43 PM Jan Kara <jack@suse.cz> wrote:
>
> On Tue 09-04-19 14:49:22, Amir Goldstein wrote:
> > Commit 23d0127096cb ("fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE
> > writeback") claims that sync_file_range(2) syscall was "created for
> > userspace to be able to issue background writeout and so waiting for
> > in-flight IO is undesirable there" and changes the writeback (back) to
> > WB_SYNC_NONE.
> >
> > This claim is only partially true. Is is true for users that use the flag
> > SYNC_FILE_RANGE_WRITE by itself, as does PostgreSQL, the user that was
> > the reason for changing to WB_SYNC_NONE writeback.
> >
> > However, that claim is not true for users that use that flag combination
> > SYNC_FILE_RANGE_{WAIT_BEFORE|WRITE|_WAIT_AFTER}.  Those users explicitly
> > requested to wait for in-flight IO as well as to writeback of dirty
> > pages.  sync_file_range(2) man page describes this flag combintaion as
> > "write-for-data-integrity operation", although some may argue against
> > this definition.
> >
> > Re-brand that flag combination as SYNC_FILE_RANGE_WRITE_AND_WAIT and use
> > the helper filemap_write_and_wait_range(), that uses WB_SYNC_ALL
> > writeback, to perform the range sync request.
> >
> > One intended use case for this API is creating a dependency between
> > a new file's content and its link into the namepsace without requiring
> > journal commit and flushing of disk volatile caches:
> >
> >   fd = open(".", O_TMPFILE | O_RDWR);
> >   write(fd, newdata, count);
> >   sync_file_range(fd, SYNC_FILE_RANGE_WRITE_AND_WAIT, 0, 0);
> >   linkat(AT_EMPTY_PATH, fd, AT_FDCWD, newfile);
> >
> > For many local filesystems, ext4 and xfs included, the sequence above
> > will guaranty that after crash, either 'newfile' exists with 'newdata'
> > content or 'newfile' does not exists.  For some applications, this
> > guaranty is strong enough and the effects of sync_file_range(2), even
> > with WB_SYNC_ALL, are far less intrusive to other writers in the system
> > than the effects of fdatasync(2).
>
> I agree that this paragraph is true but I don't want any userspace program
> rely on this. We've been through this when ext3 got converted to ext4 and
> it has caused a lot of complaints. Ext3 had provided rather strong data vs
> metadata ordering due to the way journalling was implemented. Then ext4
> came, implemented delayed allocation and somewhat changed how journalling
> works and suddently userspace programmers were caught by surprise their code
> working by luck stopped working. And they were complaining that when their
> code worked without fsync(2) before, it should work after it as well. So it
> took a lot of explaining that their applications are broken and Ted has
> also implemented some workarounds to make at least the most common mistakes
> silently fixed by the kernel most of the time.
>
> So I'm against providing 90% data integrity guarantees because there'll be
> many people who'll think they can get away with it and then complain when
> they won't once our implementation changes.
>
> Rather I'd modify the manpage to not say that SYNC_FILE_RANGE_WAIT_BEFORE
> | SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER is a
> write-for-data-integrity.

OK. I do agree that manpage is misleading and that the language describing
this flag combination should be toned down. I do not object to adding more
and more disclaimers about this API not being a data integrity API.

But the requirement I have is a real life workload and fdatasync(2) is not at
all a viable option when application is creating many files per second.

I need to export the functionality of data writeback to userspace.
Objecting to expose this functionality via the interface that has been
documented to expose it for years and used to expose it in the
past (until the Fixes commit) is a bit weird IMO, even though I do
completely relate to your concern.

I don't mind removing the "intended use case" part of commit message
if that helps reducing the chance that people misunderstand
the guaranties of the API.

Another thing I could do is introduce a new flag for sync_file_range()
that will really mean what I want it to mean (all data will be written back
and all related inode metadata changes will be committed to journal
before the next inode metadata change is committed). For the sake of
argument let's call it SYNC_FILE_RANGE_DATA_DEPENDENCY.

I could then extend the ->fsync() f_op 'datasync' argument to 'sync_flags'
and let sync_file_range() call the filesystem to request the data dependency.
By default, no filesystem will support the new flag, but xfs and ext4 could
implement flag support simply by calling filemap_write_and_wait_range().

This way, the contract between the application and filesystem is drawn in
a way that filesystem is still able to make writeback optimizations without
breaking the contract. For example, if ext4 changes in a way that
filemap_write_and_wait_range() no longer gives the data dependency
guaranty, or if ext4 is mounted with no journal, then ext4 could refuse
the data dependency request or it could implement it differently.

Does that sound any better? If so, any suggestion for a less horid name
for this new flag?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] fs/sync.c: sync_file_range(2) may use WB_SYNC_ALL writeback
  2019-04-09 18:01   ` Amir Goldstein
@ 2019-04-10  9:10     ` Jan Kara
  2019-04-10 10:42       ` Amir Goldstein
  0 siblings, 1 reply; 10+ messages in thread
From: Jan Kara @ 2019-04-10  9:10 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Dave Chinner, Al Viro, linux-fsdevel, linux-kernel

On Tue 09-04-19 21:01:54, Amir Goldstein wrote:
> On Tue, Apr 9, 2019 at 6:43 PM Jan Kara <jack@suse.cz> wrote:
> > On Tue 09-04-19 14:49:22, Amir Goldstein wrote:
> > > Commit 23d0127096cb ("fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE
> > > writeback") claims that sync_file_range(2) syscall was "created for
> > > userspace to be able to issue background writeout and so waiting for
> > > in-flight IO is undesirable there" and changes the writeback (back) to
> > > WB_SYNC_NONE.
> > >
> > > This claim is only partially true. Is is true for users that use the flag
> > > SYNC_FILE_RANGE_WRITE by itself, as does PostgreSQL, the user that was
> > > the reason for changing to WB_SYNC_NONE writeback.
> > >
> > > However, that claim is not true for users that use that flag combination
> > > SYNC_FILE_RANGE_{WAIT_BEFORE|WRITE|_WAIT_AFTER}.  Those users explicitly
> > > requested to wait for in-flight IO as well as to writeback of dirty
> > > pages.  sync_file_range(2) man page describes this flag combintaion as
> > > "write-for-data-integrity operation", although some may argue against
> > > this definition.
> > >
> > > Re-brand that flag combination as SYNC_FILE_RANGE_WRITE_AND_WAIT and use
> > > the helper filemap_write_and_wait_range(), that uses WB_SYNC_ALL
> > > writeback, to perform the range sync request.
> > >
> > > One intended use case for this API is creating a dependency between
> > > a new file's content and its link into the namepsace without requiring
> > > journal commit and flushing of disk volatile caches:
> > >
> > >   fd = open(".", O_TMPFILE | O_RDWR);
> > >   write(fd, newdata, count);
> > >   sync_file_range(fd, SYNC_FILE_RANGE_WRITE_AND_WAIT, 0, 0);
> > >   linkat(AT_EMPTY_PATH, fd, AT_FDCWD, newfile);
> > >
> > > For many local filesystems, ext4 and xfs included, the sequence above
> > > will guaranty that after crash, either 'newfile' exists with 'newdata'
> > > content or 'newfile' does not exists.  For some applications, this
> > > guaranty is strong enough and the effects of sync_file_range(2), even
> > > with WB_SYNC_ALL, are far less intrusive to other writers in the system
> > > than the effects of fdatasync(2).
> >
> > I agree that this paragraph is true but I don't want any userspace program
> > rely on this. We've been through this when ext3 got converted to ext4 and
> > it has caused a lot of complaints. Ext3 had provided rather strong data vs
> > metadata ordering due to the way journalling was implemented. Then ext4
> > came, implemented delayed allocation and somewhat changed how journalling
> > works and suddently userspace programmers were caught by surprise their code
> > working by luck stopped working. And they were complaining that when their
> > code worked without fsync(2) before, it should work after it as well. So it
> > took a lot of explaining that their applications are broken and Ted has
> > also implemented some workarounds to make at least the most common mistakes
> > silently fixed by the kernel most of the time.
> >
> > So I'm against providing 90% data integrity guarantees because there'll be
> > many people who'll think they can get away with it and then complain when
> > they won't once our implementation changes.
> >
> > Rather I'd modify the manpage to not say that SYNC_FILE_RANGE_WAIT_BEFORE
> > | SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER is a
> > write-for-data-integrity.
> 
> OK. I do agree that manpage is misleading and that the language describing
> this flag combination should be toned down. I do not object to adding more
> and more disclaimers about this API not being a data integrity API.

I don't think we need more disclaimers, I'd just reword that
"write-for-data-integrity" bit.

> But the requirement I have is a real life workload and fdatasync(2) is not at
> all a viable option when application is creating many files per second.
> 
> I need to export the functionality of data writeback to userspace.
> Objecting to expose this functionality via the interface that has been
> documented to expose it for years and used to expose it in the
> past (until the Fixes commit) is a bit weird IMO, even though I do
> completely relate to your concern.
> 
> I don't mind removing the "intended use case" part of commit message
> if that helps reducing the chance that people misunderstand
> the guaranties of the API.
> 
> Another thing I could do is introduce a new flag for sync_file_range()
> that will really mean what I want it to mean (all data will be written back
> and all related inode metadata changes will be committed to journal
> before the next inode metadata change is committed). For the sake of
> argument let's call it SYNC_FILE_RANGE_DATA_DEPENDENCY.

So there are two parts to your requirements:

1) Data writeback for all file pages has been submitted (+ completed).
2) What happens with related metadata.

sync_file_range() is fine for 1) although I agree that WB_SYNC_NONE mode
can still end up skipping some pages due to unrelated writeback / lock
contention so having sync_file_range() mode doing WB_SYNC_ALL writeback
makes some sense to me.

What happens with 2) is pretty much up to the filesystem. You are speaking
about the journal but that is very much a filesystem specific detail.
What if the filesystem doesn't have journal at all? E.g. ext2? We cannot
really expose a generic API that silently fails to work based on fs
internal details.

I guess I fail to see what kind of guarantees from the fs you're looking
for? When you speak about journal, I guess you're looking for some kind of
ordering but it's hard to be sure about exact requirements. So can you
please spell those out? Once we know requirements, we can speak about the
API.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] fs/sync.c: sync_file_range(2) may use WB_SYNC_ALL writeback
  2019-04-10  9:10     ` Jan Kara
@ 2019-04-10 10:42       ` Amir Goldstein
  2019-04-11 10:16         ` Jan Kara
  0 siblings, 1 reply; 10+ messages in thread
From: Amir Goldstein @ 2019-04-10 10:42 UTC (permalink / raw)
  To: Jan Kara; +Cc: Dave Chinner, Al Viro, linux-fsdevel, linux-kernel

On Wed, Apr 10, 2019 at 12:10 PM Jan Kara <jack@suse.cz> wrote:
>
> On Tue 09-04-19 21:01:54, Amir Goldstein wrote:
> > On Tue, Apr 9, 2019 at 6:43 PM Jan Kara <jack@suse.cz> wrote:
> > > On Tue 09-04-19 14:49:22, Amir Goldstein wrote:
> > > > Commit 23d0127096cb ("fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE
> > > > writeback") claims that sync_file_range(2) syscall was "created for
> > > > userspace to be able to issue background writeout and so waiting for
> > > > in-flight IO is undesirable there" and changes the writeback (back) to
> > > > WB_SYNC_NONE.
> > > >
> > > > This claim is only partially true. Is is true for users that use the flag
> > > > SYNC_FILE_RANGE_WRITE by itself, as does PostgreSQL, the user that was
> > > > the reason for changing to WB_SYNC_NONE writeback.
> > > >
> > > > However, that claim is not true for users that use that flag combination
> > > > SYNC_FILE_RANGE_{WAIT_BEFORE|WRITE|_WAIT_AFTER}.  Those users explicitly
> > > > requested to wait for in-flight IO as well as to writeback of dirty
> > > > pages.  sync_file_range(2) man page describes this flag combintaion as
> > > > "write-for-data-integrity operation", although some may argue against
> > > > this definition.
> > > >
> > > > Re-brand that flag combination as SYNC_FILE_RANGE_WRITE_AND_WAIT and use
> > > > the helper filemap_write_and_wait_range(), that uses WB_SYNC_ALL
> > > > writeback, to perform the range sync request.
> > > >
> > > > One intended use case for this API is creating a dependency between
> > > > a new file's content and its link into the namepsace without requiring
> > > > journal commit and flushing of disk volatile caches:
> > > >
> > > >   fd = open(".", O_TMPFILE | O_RDWR);
> > > >   write(fd, newdata, count);
> > > >   sync_file_range(fd, SYNC_FILE_RANGE_WRITE_AND_WAIT, 0, 0);
> > > >   linkat(AT_EMPTY_PATH, fd, AT_FDCWD, newfile);
> > > >
> > > > For many local filesystems, ext4 and xfs included, the sequence above
> > > > will guaranty that after crash, either 'newfile' exists with 'newdata'
> > > > content or 'newfile' does not exists.  For some applications, this
> > > > guaranty is strong enough and the effects of sync_file_range(2), even
> > > > with WB_SYNC_ALL, are far less intrusive to other writers in the system
> > > > than the effects of fdatasync(2).
> > >
> > > I agree that this paragraph is true but I don't want any userspace program
> > > rely on this. We've been through this when ext3 got converted to ext4 and
> > > it has caused a lot of complaints. Ext3 had provided rather strong data vs
> > > metadata ordering due to the way journalling was implemented. Then ext4
> > > came, implemented delayed allocation and somewhat changed how journalling
> > > works and suddently userspace programmers were caught by surprise their code
> > > working by luck stopped working. And they were complaining that when their
> > > code worked without fsync(2) before, it should work after it as well. So it
> > > took a lot of explaining that their applications are broken and Ted has
> > > also implemented some workarounds to make at least the most common mistakes
> > > silently fixed by the kernel most of the time.
> > >
> > > So I'm against providing 90% data integrity guarantees because there'll be
> > > many people who'll think they can get away with it and then complain when
> > > they won't once our implementation changes.
> > >
> > > Rather I'd modify the manpage to not say that SYNC_FILE_RANGE_WAIT_BEFORE
> > > | SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER is a
> > > write-for-data-integrity.
> >
> > OK. I do agree that manpage is misleading and that the language describing
> > this flag combination should be toned down. I do not object to adding more
> > and more disclaimers about this API not being a data integrity API.
>
> I don't think we need more disclaimers, I'd just reword that
> "write-for-data-integrity" bit.
>
> > But the requirement I have is a real life workload and fdatasync(2) is not at
> > all a viable option when application is creating many files per second.
> >
> > I need to export the functionality of data writeback to userspace.
> > Objecting to expose this functionality via the interface that has been
> > documented to expose it for years and used to expose it in the
> > past (until the Fixes commit) is a bit weird IMO, even though I do
> > completely relate to your concern.
> >
> > I don't mind removing the "intended use case" part of commit message
> > if that helps reducing the chance that people misunderstand
> > the guaranties of the API.
> >
> > Another thing I could do is introduce a new flag for sync_file_range()
> > that will really mean what I want it to mean (all data will be written back
> > and all related inode metadata changes will be committed to journal
> > before the next inode metadata change is committed). For the sake of
> > argument let's call it SYNC_FILE_RANGE_DATA_DEPENDENCY.
>
> So there are two parts to your requirements:
>
> 1) Data writeback for all file pages has been submitted (+ completed).
> 2) What happens with related metadata.
>
> sync_file_range() is fine for 1) although I agree that WB_SYNC_NONE mode
> can still end up skipping some pages due to unrelated writeback / lock
> contention so having sync_file_range() mode doing WB_SYNC_ALL writeback
> makes some sense to me.
>

OK. does it make enough sense for you to ACK my patch if the
commit message was toned down to:

Commit 23d0127096cb ("fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE
writeback") claims that sync_file_range(2) syscall was "created for
userspace to be able to issue background writeout and so waiting for
in-flight IO is undesirable there" and changes the writeback (back) to
WB_SYNC_NONE.

This claim is only partially true. It is true for users that use the flag
SYNC_FILE_RANGE_WRITE by itself, as does PostgreSQL, the user that was
the reason for changing to WB_SYNC_NONE writeback.

However, that claim is not true for users that use that flag combination
SYNC_FILE_RANGE_{WAIT_BEFORE|WRITE|_WAIT_AFTER}.
Those users explicitly requested to wait for in-flight IO as well as to
writeback of dirty pages.

Re-brand that flag combination as SYNC_FILE_RANGE_WRITE_AND_WAIT
and use the helper filemap_write_and_wait_range(), that uses WB_SYNC_ALL
writeback, to perform the range sync request.

Fixes: 23d0127096cb ("fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>

> What happens with 2) is pretty much up to the filesystem. You are speaking
> about the journal but that is very much a filesystem specific detail.
> What if the filesystem doesn't have journal at all? E.g. ext2? We cannot
> really expose a generic API that silently fails to work based on fs
> internal details.
>
> I guess I fail to see what kind of guarantees from the fs you're looking
> for? When you speak about journal, I guess you're looking for some kind of
> ordering but it's hard to be sure about exact requirements. So can you
> please spell those out? Once we know requirements, we can speak about the
> API.
>

The requirement is quite easy to explain with an example, less easy to
generalize. I illustrated the requirement with the example:
O_TMPFILE;write();DATA_BARRIER;linkat()

The requirement of the DATA_BARRIER operation is that if the effects
of linkat() are observed after crash, then the effects of write() must also
be observed after crash.

At this point one may say that fdatasync(2) meets the requirements
of a DATA_BARRIER operation, so I need to introduce a non-requirement.
The non-requirement is that DATA_BARRIER operation does not need
to be a DATA_INTEGRITY operation, meaning that after crash, there
is no requirement to be able to find or access the written data.

Now it should be clear to filesystem developers why the DATA_BARRIER
requirement is easier to meet with far less performance penalty than
fdatasync(2) on xfs and ext4.

Here is a clumsy attempt to generalize the guaranty of such operation,
without referring to internal fs implementation details:
"Make sure that all file data hits persistent storage before the next
 file metadata change hits persistent storage".

Does that make sense to you?
Is the requirement clear?

How does this sound for an API:
sync_file_range(fd, SYNC_FILE_RANGE_WRITE_DATA_BARRIER, 0, 0);

which individual filesystems could support or EOPNOTSUPP
via the ->fsync() file_operation?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] fs/sync.c: sync_file_range(2) may use WB_SYNC_ALL writeback
  2019-04-10 10:42       ` Amir Goldstein
@ 2019-04-11 10:16         ` Jan Kara
  2019-04-11 11:08           ` Amir Goldstein
  0 siblings, 1 reply; 10+ messages in thread
From: Jan Kara @ 2019-04-11 10:16 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Dave Chinner, Al Viro, linux-fsdevel, linux-kernel

On Wed 10-04-19 13:42:23, Amir Goldstein wrote:
> On Wed, Apr 10, 2019 at 12:10 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Tue 09-04-19 21:01:54, Amir Goldstein wrote:
> > > On Tue, Apr 9, 2019 at 6:43 PM Jan Kara <jack@suse.cz> wrote:
> > > > On Tue 09-04-19 14:49:22, Amir Goldstein wrote:
> > > > > Commit 23d0127096cb ("fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE
> > > > > writeback") claims that sync_file_range(2) syscall was "created for
> > > > > userspace to be able to issue background writeout and so waiting for
> > > > > in-flight IO is undesirable there" and changes the writeback (back) to
> > > > > WB_SYNC_NONE.
> > > > >
> > > > > This claim is only partially true. Is is true for users that use the flag
> > > > > SYNC_FILE_RANGE_WRITE by itself, as does PostgreSQL, the user that was
> > > > > the reason for changing to WB_SYNC_NONE writeback.
> > > > >
> > > > > However, that claim is not true for users that use that flag combination
> > > > > SYNC_FILE_RANGE_{WAIT_BEFORE|WRITE|_WAIT_AFTER}.  Those users explicitly
> > > > > requested to wait for in-flight IO as well as to writeback of dirty
> > > > > pages.  sync_file_range(2) man page describes this flag combintaion as
> > > > > "write-for-data-integrity operation", although some may argue against
> > > > > this definition.
> > > > >
> > > > > Re-brand that flag combination as SYNC_FILE_RANGE_WRITE_AND_WAIT and use
> > > > > the helper filemap_write_and_wait_range(), that uses WB_SYNC_ALL
> > > > > writeback, to perform the range sync request.
> > > > >
> > > > > One intended use case for this API is creating a dependency between
> > > > > a new file's content and its link into the namepsace without requiring
> > > > > journal commit and flushing of disk volatile caches:
> > > > >
> > > > >   fd = open(".", O_TMPFILE | O_RDWR);
> > > > >   write(fd, newdata, count);
> > > > >   sync_file_range(fd, SYNC_FILE_RANGE_WRITE_AND_WAIT, 0, 0);
> > > > >   linkat(AT_EMPTY_PATH, fd, AT_FDCWD, newfile);
> > > > >
> > > > > For many local filesystems, ext4 and xfs included, the sequence above
> > > > > will guaranty that after crash, either 'newfile' exists with 'newdata'
> > > > > content or 'newfile' does not exists.  For some applications, this
> > > > > guaranty is strong enough and the effects of sync_file_range(2), even
> > > > > with WB_SYNC_ALL, are far less intrusive to other writers in the system
> > > > > than the effects of fdatasync(2).
> > > >
> > > > I agree that this paragraph is true but I don't want any userspace program
> > > > rely on this. We've been through this when ext3 got converted to ext4 and
> > > > it has caused a lot of complaints. Ext3 had provided rather strong data vs
> > > > metadata ordering due to the way journalling was implemented. Then ext4
> > > > came, implemented delayed allocation and somewhat changed how journalling
> > > > works and suddently userspace programmers were caught by surprise their code
> > > > working by luck stopped working. And they were complaining that when their
> > > > code worked without fsync(2) before, it should work after it as well. So it
> > > > took a lot of explaining that their applications are broken and Ted has
> > > > also implemented some workarounds to make at least the most common mistakes
> > > > silently fixed by the kernel most of the time.
> > > >
> > > > So I'm against providing 90% data integrity guarantees because there'll be
> > > > many people who'll think they can get away with it and then complain when
> > > > they won't once our implementation changes.
> > > >
> > > > Rather I'd modify the manpage to not say that SYNC_FILE_RANGE_WAIT_BEFORE
> > > > | SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER is a
> > > > write-for-data-integrity.
> > >
> > > OK. I do agree that manpage is misleading and that the language describing
> > > this flag combination should be toned down. I do not object to adding more
> > > and more disclaimers about this API not being a data integrity API.
> >
> > I don't think we need more disclaimers, I'd just reword that
> > "write-for-data-integrity" bit.
> >
> > > But the requirement I have is a real life workload and fdatasync(2) is not at
> > > all a viable option when application is creating many files per second.
> > >
> > > I need to export the functionality of data writeback to userspace.
> > > Objecting to expose this functionality via the interface that has been
> > > documented to expose it for years and used to expose it in the
> > > past (until the Fixes commit) is a bit weird IMO, even though I do
> > > completely relate to your concern.
> > >
> > > I don't mind removing the "intended use case" part of commit message
> > > if that helps reducing the chance that people misunderstand
> > > the guaranties of the API.
> > >
> > > Another thing I could do is introduce a new flag for sync_file_range()
> > > that will really mean what I want it to mean (all data will be written back
> > > and all related inode metadata changes will be committed to journal
> > > before the next inode metadata change is committed). For the sake of
> > > argument let's call it SYNC_FILE_RANGE_DATA_DEPENDENCY.
> >
> > So there are two parts to your requirements:
> >
> > 1) Data writeback for all file pages has been submitted (+ completed).
> > 2) What happens with related metadata.
> >
> > sync_file_range() is fine for 1) although I agree that WB_SYNC_NONE mode
> > can still end up skipping some pages due to unrelated writeback / lock
> > contention so having sync_file_range() mode doing WB_SYNC_ALL writeback
> > makes some sense to me.
> >
> 
> OK. does it make enough sense for you to ACK my patch if the
> commit message was toned down to:
> 
> Commit 23d0127096cb ("fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE
> writeback") claims that sync_file_range(2) syscall was "created for
> userspace to be able to issue background writeout and so waiting for
> in-flight IO is undesirable there" and changes the writeback (back) to
> WB_SYNC_NONE.
> 
> This claim is only partially true. It is true for users that use the flag
> SYNC_FILE_RANGE_WRITE by itself, as does PostgreSQL, the user that was
> the reason for changing to WB_SYNC_NONE writeback.
> 
> However, that claim is not true for users that use that flag combination
> SYNC_FILE_RANGE_{WAIT_BEFORE|WRITE|_WAIT_AFTER}.
> Those users explicitly requested to wait for in-flight IO as well as to
> writeback of dirty pages.
> 
> Re-brand that flag combination as SYNC_FILE_RANGE_WRITE_AND_WAIT
> and use the helper filemap_write_and_wait_range(), that uses WB_SYNC_ALL
> writeback, to perform the range sync request.
> 
> Fixes: 23d0127096cb ("fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE")
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>

Yes, I'd be fine with that.

> > What happens with 2) is pretty much up to the filesystem. You are speaking
> > about the journal but that is very much a filesystem specific detail.
> > What if the filesystem doesn't have journal at all? E.g. ext2? We cannot
> > really expose a generic API that silently fails to work based on fs
> > internal details.
> >
> > I guess I fail to see what kind of guarantees from the fs you're looking
> > for? When you speak about journal, I guess you're looking for some kind of
> > ordering but it's hard to be sure about exact requirements. So can you
> > please spell those out? Once we know requirements, we can speak about the
> > API.
> >
> 
> The requirement is quite easy to explain with an example, less easy to
> generalize. I illustrated the requirement with the example:
> O_TMPFILE;write();DATA_BARRIER;linkat()
>
> The requirement of the DATA_BARRIER operation is that if the effects
> of linkat() are observed after crash, then the effects of write() must also
> be observed after crash.

OK, understood and I agree this makes sense to ask for... And you're not
the only one. Application programmers often *assume* that if they do
write(file_new), rename(file_new, file), they will either see old or new
file contents after a crash. They are wrong but ext4 has heuristics to
detect such behavior and at least initiate writeback on rename to make time
window when data loss can happen smaller. Because there was lot of
controversy around this when users were transitioning from ext3 (where this
worked) to ext4. If there was API with decent performance to achieve this,
I guess at least some applications would start using it.
 
> At this point one may say that fdatasync(2) meets the requirements
> of a DATA_BARRIER operation, so I need to introduce a non-requirement.
> The non-requirement is that DATA_BARRIER operation does not need
> to be a DATA_INTEGRITY operation, meaning that after crash, there
> is no requirement to be able to find or access the written data.

Well, I guess your other real requirement is good performance ;) How that's
achieved is internal to the filesystem.

> Now it should be clear to filesystem developers why the DATA_BARRIER
> requirement is easier to meet with far less performance penalty than
> fdatasync(2) on xfs and ext4.

Well, it won't be free either but yes, you can save a transaction commit
/ disk cache flush compared to fdatasync(2).

> Here is a clumsy attempt to generalize the guaranty of such operation,
> without referring to internal fs implementation details:
> "Make sure that all file data hits persistent storage before the next
>  file metadata change hits persistent storage".
> 
> Does that make sense to you?
> Is the requirement clear?
> 
> How does this sound for an API:
> sync_file_range(fd, SYNC_FILE_RANGE_WRITE_DATA_BARRIER, 0, 0);
> 
> which individual filesystems could support or EOPNOTSUPP
> via the ->fsync() file_operation?

Yes, this sounds workable to me but there are also other options - e.g.
somehow pass the information to linkat() / rename() / whatever that the
ordering is required. This might be easier for some filesystems to
implement. So it might be good to make a short session at LSF/MM about this
so that developers of other filesystems can say what interface they would
be able to implement...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] fs/sync.c: sync_file_range(2) may use WB_SYNC_ALL writeback
  2019-04-11 10:16         ` Jan Kara
@ 2019-04-11 11:08           ` Amir Goldstein
  0 siblings, 0 replies; 10+ messages in thread
From: Amir Goldstein @ 2019-04-11 11:08 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, Al Viro, linux-fsdevel, linux-kernel,
	Andrew Morton, Anna Schumaker, Josef Bacik

On Thu, Apr 11, 2019 at 1:16 PM Jan Kara <jack@suse.cz> wrote:
>
> On Wed 10-04-19 13:42:23, Amir Goldstein wrote:
> > On Wed, Apr 10, 2019 at 12:10 PM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Tue 09-04-19 21:01:54, Amir Goldstein wrote:
> > > > On Tue, Apr 9, 2019 at 6:43 PM Jan Kara <jack@suse.cz> wrote:
> > > > > On Tue 09-04-19 14:49:22, Amir Goldstein wrote:
> > > > > > Commit 23d0127096cb ("fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE
> > > > > > writeback") claims that sync_file_range(2) syscall was "created for
> > > > > > userspace to be able to issue background writeout and so waiting for
> > > > > > in-flight IO is undesirable there" and changes the writeback (back) to
> > > > > > WB_SYNC_NONE.
> > > > > >
> > > > > > This claim is only partially true. Is is true for users that use the flag
> > > > > > SYNC_FILE_RANGE_WRITE by itself, as does PostgreSQL, the user that was
> > > > > > the reason for changing to WB_SYNC_NONE writeback.
> > > > > >
> > > > > > However, that claim is not true for users that use that flag combination
> > > > > > SYNC_FILE_RANGE_{WAIT_BEFORE|WRITE|_WAIT_AFTER}.  Those users explicitly
> > > > > > requested to wait for in-flight IO as well as to writeback of dirty
> > > > > > pages.  sync_file_range(2) man page describes this flag combintaion as
> > > > > > "write-for-data-integrity operation", although some may argue against
> > > > > > this definition.
> > > > > >
> > > > > > Re-brand that flag combination as SYNC_FILE_RANGE_WRITE_AND_WAIT and use
> > > > > > the helper filemap_write_and_wait_range(), that uses WB_SYNC_ALL
> > > > > > writeback, to perform the range sync request.
> > > > > >
> > > > > > One intended use case for this API is creating a dependency between
> > > > > > a new file's content and its link into the namepsace without requiring
> > > > > > journal commit and flushing of disk volatile caches:
> > > > > >
> > > > > >   fd = open(".", O_TMPFILE | O_RDWR);
> > > > > >   write(fd, newdata, count);
> > > > > >   sync_file_range(fd, SYNC_FILE_RANGE_WRITE_AND_WAIT, 0, 0);
> > > > > >   linkat(AT_EMPTY_PATH, fd, AT_FDCWD, newfile);
> > > > > >
> > > > > > For many local filesystems, ext4 and xfs included, the sequence above
> > > > > > will guaranty that after crash, either 'newfile' exists with 'newdata'
> > > > > > content or 'newfile' does not exists.  For some applications, this
> > > > > > guaranty is strong enough and the effects of sync_file_range(2), even
> > > > > > with WB_SYNC_ALL, are far less intrusive to other writers in the system
> > > > > > than the effects of fdatasync(2).
> > > > >
> > > > > I agree that this paragraph is true but I don't want any userspace program
> > > > > rely on this. We've been through this when ext3 got converted to ext4 and
> > > > > it has caused a lot of complaints. Ext3 had provided rather strong data vs
> > > > > metadata ordering due to the way journalling was implemented. Then ext4
> > > > > came, implemented delayed allocation and somewhat changed how journalling
> > > > > works and suddently userspace programmers were caught by surprise their code
> > > > > working by luck stopped working. And they were complaining that when their
> > > > > code worked without fsync(2) before, it should work after it as well. So it
> > > > > took a lot of explaining that their applications are broken and Ted has
> > > > > also implemented some workarounds to make at least the most common mistakes
> > > > > silently fixed by the kernel most of the time.
> > > > >
> > > > > So I'm against providing 90% data integrity guarantees because there'll be
> > > > > many people who'll think they can get away with it and then complain when
> > > > > they won't once our implementation changes.
> > > > >
> > > > > Rather I'd modify the manpage to not say that SYNC_FILE_RANGE_WAIT_BEFORE
> > > > > | SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER is a
> > > > > write-for-data-integrity.
> > > >
> > > > OK. I do agree that manpage is misleading and that the language describing
> > > > this flag combination should be toned down. I do not object to adding more
> > > > and more disclaimers about this API not being a data integrity API.
> > >
> > > I don't think we need more disclaimers, I'd just reword that
> > > "write-for-data-integrity" bit.
> > >
> > > > But the requirement I have is a real life workload and fdatasync(2) is not at
> > > > all a viable option when application is creating many files per second.
> > > >
> > > > I need to export the functionality of data writeback to userspace.
> > > > Objecting to expose this functionality via the interface that has been
> > > > documented to expose it for years and used to expose it in the
> > > > past (until the Fixes commit) is a bit weird IMO, even though I do
> > > > completely relate to your concern.
> > > >
> > > > I don't mind removing the "intended use case" part of commit message
> > > > if that helps reducing the chance that people misunderstand
> > > > the guaranties of the API.
> > > >
> > > > Another thing I could do is introduce a new flag for sync_file_range()
> > > > that will really mean what I want it to mean (all data will be written back
> > > > and all related inode metadata changes will be committed to journal
> > > > before the next inode metadata change is committed). For the sake of
> > > > argument let's call it SYNC_FILE_RANGE_DATA_DEPENDENCY.
> > >
> > > So there are two parts to your requirements:
> > >
> > > 1) Data writeback for all file pages has been submitted (+ completed).
> > > 2) What happens with related metadata.
> > >
> > > sync_file_range() is fine for 1) although I agree that WB_SYNC_NONE mode
> > > can still end up skipping some pages due to unrelated writeback / lock
> > > contention so having sync_file_range() mode doing WB_SYNC_ALL writeback
> > > makes some sense to me.
> > >
> >
> > OK. does it make enough sense for you to ACK my patch if the
> > commit message was toned down to:
> >
> > Commit 23d0127096cb ("fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE
> > writeback") claims that sync_file_range(2) syscall was "created for
> > userspace to be able to issue background writeout and so waiting for
> > in-flight IO is undesirable there" and changes the writeback (back) to
> > WB_SYNC_NONE.
> >
> > This claim is only partially true. It is true for users that use the flag
> > SYNC_FILE_RANGE_WRITE by itself, as does PostgreSQL, the user that was
> > the reason for changing to WB_SYNC_NONE writeback.
> >
> > However, that claim is not true for users that use that flag combination
> > SYNC_FILE_RANGE_{WAIT_BEFORE|WRITE|_WAIT_AFTER}.
> > Those users explicitly requested to wait for in-flight IO as well as to
> > writeback of dirty pages.
> >
> > Re-brand that flag combination as SYNC_FILE_RANGE_WRITE_AND_WAIT
> > and use the helper filemap_write_and_wait_range(), that uses WB_SYNC_ALL
> > writeback, to perform the range sync request.
> >
> > Fixes: 23d0127096cb ("fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE")
> > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
>
> Yes, I'd be fine with that.

Great.
I'll send V2 with your ACK.
I guess since Andrew merged the Fixes commit, I'll point to patch at
him as well.

>
> > > What happens with 2) is pretty much up to the filesystem. You are speaking
> > > about the journal but that is very much a filesystem specific detail.
> > > What if the filesystem doesn't have journal at all? E.g. ext2? We cannot
> > > really expose a generic API that silently fails to work based on fs
> > > internal details.
> > >
> > > I guess I fail to see what kind of guarantees from the fs you're looking
> > > for? When you speak about journal, I guess you're looking for some kind of
> > > ordering but it's hard to be sure about exact requirements. So can you
> > > please spell those out? Once we know requirements, we can speak about the
> > > API.
> > >
> >
> > The requirement is quite easy to explain with an example, less easy to
> > generalize. I illustrated the requirement with the example:
> > O_TMPFILE;write();DATA_BARRIER;linkat()
> >
> > The requirement of the DATA_BARRIER operation is that if the effects
> > of linkat() are observed after crash, then the effects of write() must also
> > be observed after crash.
>
> OK, understood and I agree this makes sense to ask for... And you're not
> the only one. Application programmers often *assume* that if they do
> write(file_new), rename(file_new, file), they will either see old or new
> file contents after a crash. They are wrong but ext4 has heuristics to
> detect such behavior and at least initiate writeback on rename to make time
> window when data loss can happen smaller. Because there was lot of
> controversy around this when users were transitioning from ext3 (where this
> worked) to ext4. If there was API with decent performance to achieve this,
> I guess at least some applications would start using it.
>
> > At this point one may say that fdatasync(2) meets the requirements
> > of a DATA_BARRIER operation, so I need to introduce a non-requirement.
> > The non-requirement is that DATA_BARRIER operation does not need
> > to be a DATA_INTEGRITY operation, meaning that after crash, there
> > is no requirement to be able to find or access the written data.
>
> Well, I guess your other real requirement is good performance ;) How that's
> achieved is internal to the filesystem.
>
> > Now it should be clear to filesystem developers why the DATA_BARRIER
> > requirement is easier to meet with far less performance penalty than
> > fdatasync(2) on xfs and ext4.
>
> Well, it won't be free either but yes, you can save a transaction commit
> / disk cache flush compared to fdatasync(2).
>
> > Here is a clumsy attempt to generalize the guaranty of such operation,
> > without referring to internal fs implementation details:
> > "Make sure that all file data hits persistent storage before the next
> >  file metadata change hits persistent storage".
> >
> > Does that make sense to you?
> > Is the requirement clear?
> >
> > How does this sound for an API:
> > sync_file_range(fd, SYNC_FILE_RANGE_WRITE_DATA_BARRIER, 0, 0);
> >
> > which individual filesystems could support or EOPNOTSUPP
> > via the ->fsync() file_operation?
>
> Yes, this sounds workable to me but there are also other options - e.g.
> somehow pass the information to linkat() / rename() / whatever that the
> ordering is required. This might be easier for some filesystems to
> implement.

I like this idea. rename() and linkat() are definitely the most probable
metadata operation that needs this barrier. But I wouldn't want to limit
this capability to rename() and linkat() alone.
Overlayfs for example, could use a similar data write barrier before
remove xattr (when "hydrating" a metacopy upper).

> So it might be good to make a short session at LSF/MM about this
> so that developers of other filesystems can say what interface they would
> be able to implement...

Yeh, its on my short list for things to discuss.
[CC: Anna & Josef]

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v4] fs/sync.c: sync_file_range(2) may use WB_SYNC_ALL writeback
  2019-04-09 11:49 [PATCH] fs/sync.c: sync_file_range(2) may use WB_SYNC_ALL writeback Amir Goldstein
  2019-04-09 15:43 ` Jan Kara
@ 2019-04-17  5:45 ` Amir Goldstein
  2019-04-19  0:02   ` Dave Chinner
  1 sibling, 1 reply; 10+ messages in thread
From: Amir Goldstein @ 2019-04-17  5:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Dave Chinner, Al Viro, linux-mm, linux-api, linux-fsdevel

Commit 23d0127096cb ("fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE
writeback") claims that sync_file_range(2) syscall was "created for
userspace to be able to issue background writeout and so waiting for
in-flight IO is undesirable there" and changes the writeback (back) to
WB_SYNC_NONE.

This claim is only partially true. It is true for users that use the flag
SYNC_FILE_RANGE_WRITE by itself, as does PostgreSQL, the user that was
the reason for changing to WB_SYNC_NONE writeback.

However, that claim is not true for users that use that flag combination
SYNC_FILE_RANGE_{WAIT_BEFORE|WRITE|_WAIT_AFTER}.
Those users explicitly requested to wait for in-flight IO as well as to
writeback of dirty pages.

Re-brand that flag combination as SYNC_FILE_RANGE_WRITE_AND_WAIT
and use the helper filemap_write_and_wait_range(), that uses WB_SYNC_ALL
writeback, to perform the full range sync request.

Link: http://lkml.kernel.org/r/20190409114922.30095-1-amir73il@gmail.com
Fixes: 23d0127096cb ("fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Acked-by: Jan Kara <jack@suse.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
---

Andrew,

V2 of this patch is on your mmtotm queue.
However, I had already sent out V3 with a braino fix and Dave Chinner
just added more review comments which I had addressed in this version.

Thanks,
Amir.

Changes since v3:
- Remove unneeded change to VALID_FLAGS (Dave)
- Call file_fdatawait_range() before writeback (Dave)

Changes since v2:
- Return after filemap_write_and_wait_range()

Changes since v1:
- Remove non-guaranties of the API from commit message
- Added ACK by Jan

 fs/sync.c               | 20 +++++++++++++++-----
 include/uapi/linux/fs.h |  3 +++
 2 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/fs/sync.c b/fs/sync.c
index b54e0541ad89..1836328f1ae8 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -235,9 +235,9 @@ SYSCALL_DEFINE1(fdatasync, unsigned int, fd)
 }
 
 /*
- * sys_sync_file_range() permits finely controlled syncing over a segment of
+ * ksys_sync_file_range() permits finely controlled syncing over a segment of
  * a file in the range offset .. (offset+nbytes-1) inclusive.  If nbytes is
- * zero then sys_sync_file_range() will operate from offset out to EOF.
+ * zero then ksys_sync_file_range() will operate from offset out to EOF.
  *
  * The flag bits are:
  *
@@ -254,7 +254,7 @@ SYSCALL_DEFINE1(fdatasync, unsigned int, fd)
  * Useful combinations of the flag bits are:
  *
  * SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE: ensures that all pages
- * in the range which were dirty on entry to sys_sync_file_range() are placed
+ * in the range which were dirty on entry to ksys_sync_file_range() are placed
  * under writeout.  This is a start-write-for-data-integrity operation.
  *
  * SYNC_FILE_RANGE_WRITE: start writeout of all dirty pages in the range which
@@ -266,10 +266,13 @@ SYSCALL_DEFINE1(fdatasync, unsigned int, fd)
  * earlier SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE operation to wait
  * for that operation to complete and to return the result.
  *
- * SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER:
+ * SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER
+ * (a.k.a. SYNC_FILE_RANGE_WRITE_AND_WAIT):
  * a traditional sync() operation.  This is a write-for-data-integrity operation
  * which will ensure that all pages in the range which were dirty on entry to
- * sys_sync_file_range() are committed to disk.
+ * ksys_sync_file_range() are written to disk.  It should be noted that disk
+ * caches are not flushed by this call, so there are no guarantees here that the
+ * data will be available on disk after a crash.
  *
  *
  * SYNC_FILE_RANGE_WAIT_BEFORE and SYNC_FILE_RANGE_WAIT_AFTER will detect any
@@ -344,6 +347,13 @@ int ksys_sync_file_range(int fd, loff_t offset, loff_t nbytes,
 			goto out_put;
 	}
 
+	if ((flags & SYNC_FILE_RANGE_WRITE_AND_WAIT) ==
+		     SYNC_FILE_RANGE_WRITE_AND_WAIT) {
+		/* Unlike SYNC_FILE_RANGE_WRITE alone, uses WB_SYNC_ALL */
+		ret = filemap_write_and_wait_range(mapping, offset, endbyte);
+		goto out_put;
+	}
+
 	if (flags & SYNC_FILE_RANGE_WRITE) {
 		ret = __filemap_fdatawrite_range(mapping, offset, endbyte,
 						 WB_SYNC_NONE);
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 121e82ce296b..59c71fa8c553 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -320,6 +320,9 @@ struct fscrypt_key {
 #define SYNC_FILE_RANGE_WAIT_BEFORE	1
 #define SYNC_FILE_RANGE_WRITE		2
 #define SYNC_FILE_RANGE_WAIT_AFTER	4
+#define SYNC_FILE_RANGE_WRITE_AND_WAIT	(SYNC_FILE_RANGE_WRITE | \
+					 SYNC_FILE_RANGE_WAIT_BEFORE | \
+					 SYNC_FILE_RANGE_WAIT_AFTER)
 
 /*
  * Flags for preadv2/pwritev2:
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v4] fs/sync.c: sync_file_range(2) may use WB_SYNC_ALL writeback
  2019-04-17  5:45 ` [PATCH v4] " Amir Goldstein
@ 2019-04-19  0:02   ` Dave Chinner
  2019-04-19  7:29     ` [PATCH v5] " Amir Goldstein
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2019-04-19  0:02 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Andrew Morton, Jan Kara, Al Viro, linux-mm, linux-api, linux-fsdevel

On Wed, Apr 17, 2019 at 08:45:59AM +0300, Amir Goldstein wrote:
> Commit 23d0127096cb ("fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE
> writeback") claims that sync_file_range(2) syscall was "created for
> userspace to be able to issue background writeout and so waiting for
> in-flight IO is undesirable there" and changes the writeback (back) to
> WB_SYNC_NONE.
> 
> This claim is only partially true. It is true for users that use the flag
> SYNC_FILE_RANGE_WRITE by itself, as does PostgreSQL, the user that was
> the reason for changing to WB_SYNC_NONE writeback.
> 
> However, that claim is not true for users that use that flag combination
> SYNC_FILE_RANGE_{WAIT_BEFORE|WRITE|_WAIT_AFTER}.
> Those users explicitly requested to wait for in-flight IO as well as to
> writeback of dirty pages.
> 
> Re-brand that flag combination as SYNC_FILE_RANGE_WRITE_AND_WAIT
> and use the helper filemap_write_and_wait_range(), that uses WB_SYNC_ALL
> writeback, to perform the full range sync request.
> 
> Link: http://lkml.kernel.org/r/20190409114922.30095-1-amir73il@gmail.com
> Fixes: 23d0127096cb ("fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE")
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> Acked-by: Jan Kara <jack@suse.com>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> ---
> 
> Andrew,
> 
> V2 of this patch is on your mmtotm queue.
> However, I had already sent out V3 with a braino fix and Dave Chinner
> just added more review comments which I had addressed in this version.
> 
> Thanks,
> Amir.
> 
> Changes since v3:
> - Remove unneeded change to VALID_FLAGS (Dave)
> - Call file_fdatawait_range() before writeback (Dave)
> 
> Changes since v2:
> - Return after filemap_write_and_wait_range()
> 
> Changes since v1:
> - Remove non-guaranties of the API from commit message
> - Added ACK by Jan
> 
>  fs/sync.c               | 20 +++++++++++++++-----
>  include/uapi/linux/fs.h |  3 +++
>  2 files changed, 18 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/sync.c b/fs/sync.c
> index b54e0541ad89..1836328f1ae8 100644
> --- a/fs/sync.c
> +++ b/fs/sync.c
> @@ -235,9 +235,9 @@ SYSCALL_DEFINE1(fdatasync, unsigned int, fd)
>  }
>  
>  /*
> - * sys_sync_file_range() permits finely controlled syncing over a segment of
> + * ksys_sync_file_range() permits finely controlled syncing over a segment of
>   * a file in the range offset .. (offset+nbytes-1) inclusive.  If nbytes is
> - * zero then sys_sync_file_range() will operate from offset out to EOF.
> + * zero then ksys_sync_file_range() will operate from offset out to EOF.
>   *
>   * The flag bits are:
>   *
> @@ -254,7 +254,7 @@ SYSCALL_DEFINE1(fdatasync, unsigned int, fd)
>   * Useful combinations of the flag bits are:
>   *
>   * SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE: ensures that all pages
> - * in the range which were dirty on entry to sys_sync_file_range() are placed
> + * in the range which were dirty on entry to ksys_sync_file_range() are placed
>   * under writeout.  This is a start-write-for-data-integrity operation.
>   *
>   * SYNC_FILE_RANGE_WRITE: start writeout of all dirty pages in the range which
> @@ -266,10 +266,13 @@ SYSCALL_DEFINE1(fdatasync, unsigned int, fd)
>   * earlier SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE operation to wait
>   * for that operation to complete and to return the result.
>   *
> - * SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER:
> + * SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER
> + * (a.k.a. SYNC_FILE_RANGE_WRITE_AND_WAIT):
>   * a traditional sync() operation.  This is a write-for-data-integrity operation
>   * which will ensure that all pages in the range which were dirty on entry to
> - * sys_sync_file_range() are committed to disk.
> + * ksys_sync_file_range() are written to disk.  It should be noted that disk
> + * caches are not flushed by this call, so there are no guarantees here that the
> + * data will be available on disk after a crash.
>   *
>   *
>   * SYNC_FILE_RANGE_WAIT_BEFORE and SYNC_FILE_RANGE_WAIT_AFTER will detect any
> @@ -344,6 +347,13 @@ int ksys_sync_file_range(int fd, loff_t offset, loff_t nbytes,
>  			goto out_put;
>  	}
>  
> +	if ((flags & SYNC_FILE_RANGE_WRITE_AND_WAIT) ==
> +		     SYNC_FILE_RANGE_WRITE_AND_WAIT) {
> +		/* Unlike SYNC_FILE_RANGE_WRITE alone, uses WB_SYNC_ALL */
> +		ret = filemap_write_and_wait_range(mapping, offset, endbyte);
> +		goto out_put;
> +	}

Clunky, now that I look at it in context.

+	int	sync_mode = WB_SYNC_NONE;
+
+	if ((flags & SYNC_FILE_RANGE_WRITE_AND_WAIT) ==
+		     SYNC_FILE_RANGE_WRITE_AND_WAIT)
+		sync_mode = WB_SYNC_ALL;

.....

	if (flags & SYNC_FILE_RANGE_WRITE) {
		ret = __filemap_fdatawrite_range(mapping, offset, endbyte,
-						 WB_SYNC_NONE);
+						 sync_mode);

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v5] fs/sync.c: sync_file_range(2) may use WB_SYNC_ALL writeback
  2019-04-19  0:02   ` Dave Chinner
@ 2019-04-19  7:29     ` Amir Goldstein
  0 siblings, 0 replies; 10+ messages in thread
From: Amir Goldstein @ 2019-04-19  7:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Dave Chinner, Al Viro, linux-mm, linux-api, linux-fsdevel

Commit 23d0127096cb ("fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE
writeback") claims that sync_file_range(2) syscall was "created for
userspace to be able to issue background writeout and so waiting for
in-flight IO is undesirable there" and changes the writeback (back) to
WB_SYNC_NONE.

This claim is only partially true. It is true for users that use the flag
SYNC_FILE_RANGE_WRITE by itself, as does PostgreSQL, the user that was
the reason for changing to WB_SYNC_NONE writeback.

However, that claim is not true for users that use that flag combination
SYNC_FILE_RANGE_{WAIT_BEFORE|WRITE|_WAIT_AFTER}.
Those users explicitly requested to wait for in-flight IO as well as to
writeback of dirty pages.

Re-brand that flag combination as SYNC_FILE_RANGE_WRITE_AND_WAIT
and use WB_SYNC_ALL writeback to perform the full range sync request.

Link: http://lkml.kernel.org/r/20190409114922.30095-1-amir73il@gmail.com
Fixes: 23d0127096cb ("fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Acked-by: Jan Kara <jack@suse.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
---

Andrew,

One more version addressing another comment by Dave Chinner.

Thanks,
Amir.

Changes since v4:
- Don't use filemap_write_and_wait_range() helper (Dave)

Changes since v3:
- Remove unneeded change to VALID_FLAGS (Dave)
- Call file_fdatawait_range() before writeback (Dave)

Changes since v2:
- Return after filemap_write_and_wait_range()

Changes since v1:
- Remove non-guaranties of the API from commit message
- Added ACK by Jan

 fs/sync.c               | 21 +++++++++++++++------
 include/uapi/linux/fs.h |  3 +++
 2 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/fs/sync.c b/fs/sync.c
index b54e0541ad89..9e8cd90e890f 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -235,9 +235,9 @@ SYSCALL_DEFINE1(fdatasync, unsigned int, fd)
 }
 
 /*
- * sys_sync_file_range() permits finely controlled syncing over a segment of
+ * ksys_sync_file_range() permits finely controlled syncing over a segment of
  * a file in the range offset .. (offset+nbytes-1) inclusive.  If nbytes is
- * zero then sys_sync_file_range() will operate from offset out to EOF.
+ * zero then ksys_sync_file_range() will operate from offset out to EOF.
  *
  * The flag bits are:
  *
@@ -254,7 +254,7 @@ SYSCALL_DEFINE1(fdatasync, unsigned int, fd)
  * Useful combinations of the flag bits are:
  *
  * SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE: ensures that all pages
- * in the range which were dirty on entry to sys_sync_file_range() are placed
+ * in the range which were dirty on entry to ksys_sync_file_range() are placed
  * under writeout.  This is a start-write-for-data-integrity operation.
  *
  * SYNC_FILE_RANGE_WRITE: start writeout of all dirty pages in the range which
@@ -266,10 +266,13 @@ SYSCALL_DEFINE1(fdatasync, unsigned int, fd)
  * earlier SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE operation to wait
  * for that operation to complete and to return the result.
  *
- * SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER:
+ * SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER
+ * (a.k.a. SYNC_FILE_RANGE_WRITE_AND_WAIT):
  * a traditional sync() operation.  This is a write-for-data-integrity operation
  * which will ensure that all pages in the range which were dirty on entry to
- * sys_sync_file_range() are committed to disk.
+ * ksys_sync_file_range() are written to disk.  It should be noted that disk
+ * caches are not flushed by this call, so there are no guarantees here that the
+ * data will be available on disk after a crash.
  *
  *
  * SYNC_FILE_RANGE_WAIT_BEFORE and SYNC_FILE_RANGE_WAIT_AFTER will detect any
@@ -345,8 +348,14 @@ int ksys_sync_file_range(int fd, loff_t offset, loff_t nbytes,
 	}
 
 	if (flags & SYNC_FILE_RANGE_WRITE) {
+		int sync_mode = WB_SYNC_NONE;
+
+		if ((flags & SYNC_FILE_RANGE_WRITE_AND_WAIT) ==
+			     SYNC_FILE_RANGE_WRITE_AND_WAIT)
+			sync_mode = WB_SYNC_ALL;
+
 		ret = __filemap_fdatawrite_range(mapping, offset, endbyte,
-						 WB_SYNC_NONE);
+						 sync_mode);
 		if (ret < 0)
 			goto out_put;
 	}
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 121e82ce296b..59c71fa8c553 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -320,6 +320,9 @@ struct fscrypt_key {
 #define SYNC_FILE_RANGE_WAIT_BEFORE	1
 #define SYNC_FILE_RANGE_WRITE		2
 #define SYNC_FILE_RANGE_WAIT_AFTER	4
+#define SYNC_FILE_RANGE_WRITE_AND_WAIT	(SYNC_FILE_RANGE_WRITE | \
+					 SYNC_FILE_RANGE_WAIT_BEFORE | \
+					 SYNC_FILE_RANGE_WAIT_AFTER)
 
 /*
  * Flags for preadv2/pwritev2:
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-04-19 18:28 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-09 11:49 [PATCH] fs/sync.c: sync_file_range(2) may use WB_SYNC_ALL writeback Amir Goldstein
2019-04-09 15:43 ` Jan Kara
2019-04-09 18:01   ` Amir Goldstein
2019-04-10  9:10     ` Jan Kara
2019-04-10 10:42       ` Amir Goldstein
2019-04-11 10:16         ` Jan Kara
2019-04-11 11:08           ` Amir Goldstein
2019-04-17  5:45 ` [PATCH v4] " Amir Goldstein
2019-04-19  0:02   ` Dave Chinner
2019-04-19  7:29     ` [PATCH v5] " Amir Goldstein

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.