All of lore.kernel.org
 help / color / mirror / Atom feed
* [QUESTION] multiple fsync() vs single sync()
@ 2018-10-16 10:22 Romain Le Disez
  2018-10-16 12:57 ` Carlos Maiolino
  2018-10-17  1:16 ` Dave Chinner
  0 siblings, 2 replies; 8+ messages in thread
From: Romain Le Disez @ 2018-10-16 10:22 UTC (permalink / raw)
  To: linux-xfs

Hi all,

In this pseudo-code (extracted from OpenStack Swift [1]):
    fd=open("/tmp/tempfile", O_CREAT | O_WRONLY);
    write(fd, ...);
    fsetxattr(fd, ...);
    fsync(fd);
    rename("/tmp/tempfile", "/data/foobar");
    dirfd = open("/data", O_DIRECTORY | O_RDONLY);
    fsync(dirfd);

OR (the same without temporary file):
    fd=open("/data", O_TMPFILE | O_WRONLY);
    write(fd, ...);
    fsetxattr(fd, ...);
    fsync(fd);
    linkat(AT_FDCWD, "/proc/self/fd/" + fd, AT_FDCWD, "/data/foobar", AT_SYMLINK_FOLLOW);
    dirfd = open("/data", O_DIRECTORY | O_RDONLY);
    fsync(dirfd);


I’m guaranteed that, what ever happen, I’ll have a complete file (data+xattr) or no file at all in the directory /data.

First question: is that a correct assumption or is there any loopholes?

Second question, if I replace the two fsync() by one sync(), do I get the same guarantee?
    fd=open("/data", O_TMPFILE | O_WRONLY);
    write(fd, ...);
    fsetxattr(fd, ...);
    linkat(AT_FDCWD, « /proc/self/fd/" + fd, AT_FDCWD, "/data/foobar", AT_SYMLINK_FOLLOW);
    sync();

From what I understand of the FAQ [1], write_barrier guarantee that journal (aka log) will be written before the inode (aka metadata). Did I miss something?

Many thanks for your help.

[1] https://github.com/openstack/swift/blob/2.19.0/swift/obj/diskfile.py#L1674-L1694
[2] http://xfs.org/index.php/XFS_FAQ#Q:_What_is_the_problem_with_the_write_cache_on_journaled_filesystems.3F

-- 
Romain


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [QUESTION] multiple fsync() vs single sync()
  2018-10-16 10:22 [QUESTION] multiple fsync() vs single sync() Romain Le Disez
@ 2018-10-16 12:57 ` Carlos Maiolino
  2018-10-16 13:53   ` Stefan Ring
  2018-10-17  1:16 ` Dave Chinner
  1 sibling, 1 reply; 8+ messages in thread
From: Carlos Maiolino @ 2018-10-16 12:57 UTC (permalink / raw)
  To: Romain Le Disez; +Cc: linux-xfs


Hi,

> 
> In this pseudo-code (extracted from OpenStack Swift [1]):
>     fd=open("/tmp/tempfile", O_CREAT | O_WRONLY);
>     write(fd, ...);
>     fsetxattr(fd, ...);
>     fsync(fd);
>     rename("/tmp/tempfile", "/data/foobar");
>     dirfd = open("/data", O_DIRECTORY | O_RDONLY);
>     fsync(dirfd);
> 
> OR (the same without temporary file):
>     fd=open("/data", O_TMPFILE | O_WRONLY);
>     write(fd, ...);
>     fsetxattr(fd, ...);
>     fsync(fd);
>     linkat(AT_FDCWD, "/proc/self/fd/" + fd, AT_FDCWD, "/data/foobar", AT_SYMLINK_FOLLOW);
>     dirfd = open("/data", O_DIRECTORY | O_RDONLY);
>     fsync(dirfd);
> 
> 
> I’m guaranteed that, what ever happen, I’ll have a complete file (data+xattr) or no file at all in the directory /data.
> 
> First question: is that a correct assumption or is there any loopholes?

Unless you have broken storage, and you are not using volatile write-cache, an
fsync of both file and directory is enough.

> 
> Second question, if I replace the two fsync() by one sync(), do I get the same guarantee?
>     fd=open("/data", O_TMPFILE | O_WRONLY);
>     write(fd, ...);
>     fsetxattr(fd, ...);
>     linkat(AT_FDCWD, « /proc/self/fd/" + fd, AT_FDCWD, "/data/foobar", AT_SYMLINK_FOLLOW);
>     sync();

IIRC, sync() on Linux is supposed to have the same guarantees of syncfs(), once
we wait for IO completion on sync (POSIX doesn't guarantee sync() will return
until everything is written to backing storage, but Linux does wait for IO
completion).

Short answer is, sync() does work the same way as if you run fsync() on every
file on your filesystem. The question would be. Do you want to fsync() all files
in your filesystem? This may take way longer than a pair of fsync() on the file
and its directory. But it's your call, as I said sync() will behave as if you
have ran a fsyn() on every file/directory on your filesystem.

Cheers

> 
> From what I understand of the FAQ [1], write_barrier guarantee that journal (aka log) will be written before the inode (aka metadata). Did I miss something?
> 
> Many thanks for your help.
> 
> [1] https://github.com/openstack/swift/blob/2.19.0/swift/obj/diskfile.py#L1674-L1694
> [2] http://xfs.org/index.php/XFS_FAQ#Q:_What_is_the_problem_with_the_write_cache_on_journaled_filesystems.3F
> 
> -- 
> Romain
> 

-- 
Carlos

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [QUESTION] multiple fsync() vs single sync()
  2018-10-16 12:57 ` Carlos Maiolino
@ 2018-10-16 13:53   ` Stefan Ring
  2018-10-16 14:09     ` Romain Le Disez
  0 siblings, 1 reply; 8+ messages in thread
From: Stefan Ring @ 2018-10-16 13:53 UTC (permalink / raw)
  To: romain.le-disez; +Cc: linux-xfs

On Tue, Oct 16, 2018 at 2:57 PM Carlos Maiolino <cmaiolino@redhat.com> wrote:
LY);
> >     write(fd, ...);
> >     fsetxattr(fd, ...);
> >     linkat(AT_FDCWD, « /proc/self/fd/" + fd, AT_FDCWD, "/data/foobar", AT_SYMLINK_FOLLOW);
> >     sync();
>
> IIRC, sync() on Linux is supposed to have the same guarantees of syncfs(), once
> we wait for IO completion on sync (POSIX doesn't guarantee sync() will return
> until everything is written to backing storage, but Linux does wait for IO
> completion).
>
> Short answer is, sync() does work the same way as if you run fsync() on every
> file on your filesystem. The question would be. Do you want to fsync() all files
> in your filesystem? This may take way longer than a pair of fsync() on the file
> and its directory. But it's your call, as I said sync() will behave as if you
> have ran a fsyn() on every file/directory on your filesystem.

But in what order? If I understood correctly, with the single sync()
call, he might end up with a directory entry referencing an incomplete
file. Which should not be possible in the case with the two fsyncs.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [QUESTION] multiple fsync() vs single sync()
  2018-10-16 13:53   ` Stefan Ring
@ 2018-10-16 14:09     ` Romain Le Disez
  2018-10-18 11:43       ` Carlos Maiolino
  0 siblings, 1 reply; 8+ messages in thread
From: Romain Le Disez @ 2018-10-16 14:09 UTC (permalink / raw)
  To: Stefan Ring; +Cc: linux-xfs


> Le 16 oct. 2018 à 15:53, Stefan Ring <stefanrin@gmail.com> a écrit :
> 
> But in what order? If I understood correctly, with the single sync()
> call, he might end up with a directory entry referencing an incomplete
> file. Which should not be possible in the case with the two fsyncs.

In what order, this is exactly my question :)

We are creating hundreds or thousands of files in a row. Converting thousands of fsync() to one sync() would be a great performance improvement, but I want to be sure we are not taking any risk with data consistency.

-- 
Romain


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [QUESTION] multiple fsync() vs single sync()
  2018-10-16 10:22 [QUESTION] multiple fsync() vs single sync() Romain Le Disez
  2018-10-16 12:57 ` Carlos Maiolino
@ 2018-10-17  1:16 ` Dave Chinner
  2018-10-19  8:16   ` Romain Le Disez
  1 sibling, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2018-10-17  1:16 UTC (permalink / raw)
  To: Romain Le Disez; +Cc: linux-xfs

On Tue, Oct 16, 2018 at 10:22:18AM +0000, Romain Le Disez wrote:
> Hi all,
> 
> In this pseudo-code (extracted from OpenStack Swift [1]):
>     fd=open("/tmp/tempfile", O_CREAT | O_WRONLY);
>     write(fd, ...);
>     fsetxattr(fd, ...);
>     fsync(fd);
>     rename("/tmp/tempfile", "/data/foobar");
>     dirfd = open("/data", O_DIRECTORY | O_RDONLY);
>     fsync(dirfd);
> 
> OR (the same without temporary file):
>     fd=open("/data", O_TMPFILE | O_WRONLY);
>     write(fd, ...);
>     fsetxattr(fd, ...);
>     fsync(fd);
>     linkat(AT_FDCWD, "/proc/self/fd/" + fd, AT_FDCWD, "/data/foobar", AT_SYMLINK_FOLLOW);

	linkat(fd, "",  AT_FDCWD, "/data/foobar", AT_EMPTY_PATH);

>     dirfd = open("/data", O_DIRECTORY | O_RDONLY);
>     fsync(dirfd);

> I’m guaranteed that, what ever happen, I’ll have a
> complete file (data+xattr) or no file at all in the directory
> /data.

Yes.

> Second question, if I replace the two fsync() by one sync(), do I
> get the same guarantee?
>     fd=open("/data", O_TMPFILE | O_WRONLY);
>     write(fd, ...);
>     fsetxattr(fd, ...);
>     linkat(AT_FDCWD, « /proc/self/fd/" + fd, AT_FDCWD, "/data/foobar", AT_SYMLINK_FOLLOW);
>     sync();
> 
> From what I understand of the FAQ [1], write_barrier guarantee
> that journal (aka log) will be written before the inode (aka
> metadata). Did I miss something?

"write barriers" don't exist anymore. What we have these days are
cache flushes to correctly order data/metadata IO vs journal IO.

The syncfs() operation (and sync(), which is just syncfs() across
all filesystems) writes oustanding data first, then asks the
filesystem to force metadata to stable storage. XFS does that with
a log flush, which issues a cache flush (data now on stable storage)
followed by FUA log writes (metadata now on stable storage in the
journal).

So, effectively, you get the same thing in both cases. The only
difference is that sync() does a lot more work than a couple of
fsync() operations, and does work system wide on filesystems and
files you don't care about. fsync() will always perform better on a
busy system than a sync call.

Let the filesystem worry about optimising fsync calls necessary for
consistency and integrity purposes. If there was a faster way than
issuing fsync on only the objects that need it when required, then
everyone would be using it all the time....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [QUESTION] multiple fsync() vs single sync()
  2018-10-16 14:09     ` Romain Le Disez
@ 2018-10-18 11:43       ` Carlos Maiolino
  0 siblings, 0 replies; 8+ messages in thread
From: Carlos Maiolino @ 2018-10-18 11:43 UTC (permalink / raw)
  To: Romain Le Disez; +Cc: Stefan Ring, linux-xfs

On Tue, Oct 16, 2018 at 02:09:27PM +0000, Romain Le Disez wrote:
> 
> > Le 16 oct. 2018 à 15:53, Stefan Ring <stefanrin@gmail.com> a écrit :
> > 
> > But in what order? If I understood correctly, with the single sync()
> > call, he might end up with a directory entry referencing an incomplete
> > file. Which should not be possible in the case with the two fsyncs.
> 
> In what order, this is exactly my question :)
> 
> We are creating hundreds or thousands of files in a row. Converting thousands of fsync() to one sync() would be a great performance improvement, but I want to be sure we are not taking any risk with data consistency.

I honestly don't remember on the top of my head, a sync() will cause the whole
XFS log to be flushed, and the flush order, I believe, will be according to how
the metadata got logged in. But, I do not believe it comes to the case. Reality
is, doesn't matter which is flushed first, file or directory metadata. If sync()
fails, you must assume nothing got flushed at all, and not 'guess' if something
got flushed in.
But, as I mentioned before, and also did Dave, unless you want to cause a whole
filesystem flush every time you have a file modified, use fsync() on the
specific files, instead a global sync().

Cheers


> 
> -- 
> Romain
> 

-- 
Carlos

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [QUESTION] multiple fsync() vs single sync()
  2018-10-17  1:16 ` Dave Chinner
@ 2018-10-19  8:16   ` Romain Le Disez
  2018-10-19 12:12     ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Romain Le Disez @ 2018-10-19  8:16 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

Thanks all for your answers, it is really helpful, I have now a clearer vision of how it works.

I have one last question.

If I’m in the process of creating 1000 files this way, but the server crashes before the syncfs() function was called, what will happen to the files that were already rename()/linkat()?

Do they follow the same ordering, so I’m sure they are either complete (all data/xattr + xfs metadata) or not in the destination directory?

Or, is syncfs() the only way to ensure this ordering?

Thanks a lot for your time.

> Le 17 oct. 2018 à 03:16, Dave Chinner <david@fromorbit.com> a écrit :
> 
> On Tue, Oct 16, 2018 at 10:22:18AM +0000, Romain Le Disez wrote:
>> Hi all,
>> 
>> In this pseudo-code (extracted from OpenStack Swift [1]):
>>    fd=open("/tmp/tempfile", O_CREAT | O_WRONLY);
>>    write(fd, ...);
>>    fsetxattr(fd, ...);
>>    fsync(fd);
>>    rename("/tmp/tempfile", "/data/foobar");
>>    dirfd = open("/data", O_DIRECTORY | O_RDONLY);
>>    fsync(dirfd);
>> 
>> OR (the same without temporary file):
>>    fd=open("/data", O_TMPFILE | O_WRONLY);
>>    write(fd, ...);
>>    fsetxattr(fd, ...);
>>    fsync(fd);
>>    linkat(AT_FDCWD, "/proc/self/fd/" + fd, AT_FDCWD, "/data/foobar", AT_SYMLINK_FOLLOW);
> 
> 	linkat(fd, "",  AT_FDCWD, "/data/foobar", AT_EMPTY_PATH);
> 
>>    dirfd = open("/data", O_DIRECTORY | O_RDONLY);
>>    fsync(dirfd);
> 
>> I’m guaranteed that, what ever happen, I’ll have a
>> complete file (data+xattr) or no file at all in the directory
>> /data.
> 
> Yes.
> 
>> Second question, if I replace the two fsync() by one sync(), do I
>> get the same guarantee?
>>    fd=open("/data", O_TMPFILE | O_WRONLY);
>>    write(fd, ...);
>>    fsetxattr(fd, ...);
>>    linkat(AT_FDCWD, « /proc/self/fd/" + fd, AT_FDCWD, "/data/foobar", AT_SYMLINK_FOLLOW);
>>    sync();
>> 
>> From what I understand of the FAQ [1], write_barrier guarantee
>> that journal (aka log) will be written before the inode (aka
>> metadata). Did I miss something?
> 
> "write barriers" don't exist anymore. What we have these days are
> cache flushes to correctly order data/metadata IO vs journal IO.
> 
> The syncfs() operation (and sync(), which is just syncfs() across
> all filesystems) writes oustanding data first, then asks the
> filesystem to force metadata to stable storage. XFS does that with
> a log flush, which issues a cache flush (data now on stable storage)
> followed by FUA log writes (metadata now on stable storage in the
> journal).
> 
> So, effectively, you get the same thing in both cases. The only
> difference is that sync() does a lot more work than a couple of
> fsync() operations, and does work system wide on filesystems and
> files you don't care about. fsync() will always perform better on a
> busy system than a sync call.
> 
> Let the filesystem worry about optimising fsync calls necessary for
> consistency and integrity purposes. If there was a faster way than
> issuing fsync on only the objects that need it when required, then
> everyone would be using it all the time....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

-- 
Romain


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [QUESTION] multiple fsync() vs single sync()
  2018-10-19  8:16   ` Romain Le Disez
@ 2018-10-19 12:12     ` Dave Chinner
  0 siblings, 0 replies; 8+ messages in thread
From: Dave Chinner @ 2018-10-19 12:12 UTC (permalink / raw)
  To: Romain Le Disez; +Cc: linux-xfs

On Fri, Oct 19, 2018 at 08:16:27AM +0000, Romain Le Disez wrote:
> Thanks all for your answers, it is really helpful, I have now a clearer vision of how it works.
> 
> I have one last question.
> 
> If I’m in the process of creating 1000 files this way, but the server crashes before the syncfs() function was called, what will happen to the files that were already rename()/linkat()?

files up to a certain point will be there in order, data in those
files is likely to be missing. What files are there and what files
have data will be completely random.

i.e. you'll have a mess to clean up.

> Do they follow the same ordering, so I’m sure they are either complete (all data/xattr + xfs metadata) or not in the destination directory?
> 
> Or, is syncfs() the only way to ensure this ordering?

syncfs is like a bulk checkpoint. Until it completes, there are no
guarantees about anything. Only way to guarantee per file data
integrity and ordering is to use fsync.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2018-10-19 20:18 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-16 10:22 [QUESTION] multiple fsync() vs single sync() Romain Le Disez
2018-10-16 12:57 ` Carlos Maiolino
2018-10-16 13:53   ` Stefan Ring
2018-10-16 14:09     ` Romain Le Disez
2018-10-18 11:43       ` Carlos Maiolino
2018-10-17  1:16 ` Dave Chinner
2018-10-19  8:16   ` Romain Le Disez
2018-10-19 12:12     ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.