All of lore.kernel.org
 help / color / mirror / Atom feed
* Symlink not persisted even after fsync
@ 2018-04-12 17:51 Jayashree Mohan
  2018-04-13  5:52 ` Amir Goldstein
  0 siblings, 1 reply; 25+ messages in thread
From: Jayashree Mohan @ 2018-04-12 17:51 UTC (permalink / raw)
  To: linux-btrfs, fstests, linux-f2fs-devel; +Cc: Vijaychidambaram Velayudhan Pillai

Hi,

We came across what seems to be a crash consistency bug on btrfs and
f2fs. When we create a symlink for a file and fsync the symlink, on
recovery from crash, the fsync-ed file is missing.

You can reproduce this behaviour using this minimal workload :

1. symlink (foo, bar)
2. open bar
3. fsync bar
----crash here----

When we recover, we find that file bar is missing. This behaviour
seems unexpected as the fsynced file is lost on a crash. ext4 and xfs
correctly recovers file bar. This seems like a bug. If not, could you
explain why?

Do let me know if I am missing some detail here.

Thanks,
Jayashree Mohan

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Symlink not persisted even after fsync
  2018-04-12 17:51 Symlink not persisted even after fsync Jayashree Mohan
@ 2018-04-13  5:52 ` Amir Goldstein
  2018-04-13 12:57   ` Vijay Chidambaram
                     ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Amir Goldstein @ 2018-04-13  5:52 UTC (permalink / raw)
  To: Jayashree Mohan
  Cc: linux-btrfs, fstests, linux-f2fs-devel,
	Vijaychidambaram Velayudhan Pillai

On Thu, Apr 12, 2018 at 8:51 PM, Jayashree Mohan
<jayashree2912@gmail.com> wrote:
> Hi,
>
> We came across what seems to be a crash consistency bug on btrfs and
> f2fs. When we create a symlink for a file and fsync the symlink, on
> recovery from crash, the fsync-ed file is missing.
>
> You can reproduce this behaviour using this minimal workload :
>
> 1. symlink (foo, bar)
> 2. open bar
> 3. fsync bar
> ----crash here----
>
> When we recover, we find that file bar is missing. This behaviour
> seems unexpected as the fsynced file is lost on a crash. ext4 and xfs
> correctly recovers file bar. This seems like a bug. If not, could you
> explain why?
>

Not a bug.

>From man 2 fsync:

"Calling  fsync() does not necessarily ensure that the entry in the
 directory containing the file has also reached disk.  For that an
 explicit fsync() on a file descriptor for the directory is also needed."

There is a reason why this behavior is not being reproduces in
ext4/xfs, but you should be able to reproduce a similar issue
like this:

1. symlink (foo, bar.tmp)
2. open bar.tmp
3. fsync bar.tmp
4. rename(bar.tmp, bar)
5. fsync bar
----crash here----

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Symlink not persisted even after fsync
  2018-04-13  5:52 ` Amir Goldstein
@ 2018-04-13 12:57   ` Vijay Chidambaram
       [not found]   ` <CAPaz=E+-baGSWhL3nD-8X4jn6rKdn2AVGLeqWh3EY5Nh-RodRA@mail.gmail.com>
  2018-04-13 14:06   ` Dave Chinner
  2 siblings, 0 replies; 25+ messages in thread
From: Vijay Chidambaram @ 2018-04-13 12:57 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jayashree Mohan, linux-btrfs, fstests, linux-f2fs-devel

Hi Amir,

Thanks for the reply!

On Fri, Apr 13, 2018 at 12:52 AM, Amir Goldstein <amir73il@gmail.com> wrote:
>
> Not a bug.
>
> From man 2 fsync:
>
> "Calling  fsync() does not necessarily ensure that the entry in the
>  directory containing the file has also reached disk.  For that an
>  explicit fsync() on a file descriptor for the directory is also needed."

Are we understanding this right:

ext4 and xfs fsync the parent directory if a sym link file is
fsync-ed. But btrfs does not. Is this what we are seeing?

I agree that fsync of a file does not mean fsync of its directory
entry, but it seems odd to do it for regular files and not for sym
links. We do not see this behavior if we use a regular file instead of
a sym link file.

> There is a reason why this behavior is not being reproduces in
> ext4/xfs, but you should be able to reproduce a similar issue
> like this:
>
>
> 1. symlink (foo, bar.tmp)
> 2. open bar.tmp
> 3. fsync bar.tmp
> 4. rename(bar.tmp, bar)
> 5. fsync bar
> ----crash here----

I'm guessing xfs/ext4 detect the symlink-fsync pattern and fsync the
parent dir in our workload, but would miss it because of the rename in
the workload you provided?

Thanks,
Vijay Chidambaram
http://www.cs.utexas.edu/~vijay/

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Symlink not persisted even after fsync
       [not found]   ` <CAPaz=E+-baGSWhL3nD-8X4jn6rKdn2AVGLeqWh3EY5Nh-RodRA@mail.gmail.com>
@ 2018-04-13 13:16     ` Amir Goldstein
  2018-04-13 14:39       ` Jayashree Mohan
  0 siblings, 1 reply; 25+ messages in thread
From: Amir Goldstein @ 2018-04-13 13:16 UTC (permalink / raw)
  To: Vijaychidambaram Velayudhan Pillai
  Cc: Jayashree Mohan, linux-btrfs, fstests, linux-f2fs-devel

On Fri, Apr 13, 2018 at 3:54 PM, Vijay Chidambaram <vijay@cs.utexas.edu> wrote:
> Hi Amir,
>
> Thanks for the reply!
>
> On Fri, Apr 13, 2018 at 12:52 AM, Amir Goldstein <amir73il@gmail.com> wrote:
>>
>> Not a bug.
>>
>> From man 2 fsync:
>>
>> "Calling  fsync() does not necessarily ensure that the entry in the
>>  directory containing the file has also reached disk.  For that an
>>  explicit fsync() on a file descriptor for the directory is also needed."
>
>
> Are we understanding this right:
>
> ext4 and xfs fsync the parent directory if a sym link file is fsync-ed. But
> btrfs does not. Is this what we are seeing?

Nope.

You are seeing an unintentional fsync of parent, because both
parent update and symlink update are metadata updates that are
tracked by the same transaction.

fsync of symlink forces the current transaction to the journal,
pulling in the parent update with it.


>
> I agree that fsync of a file does not mean fsync of its directory entry, but
> it seems odd to do it for regular files and not for sym links. We do not see
> this behavior if we use a regular file instead of a sym link file.
>

fsync of regular file behaves differently than fsync of non regular file.
I suggest this read:
https://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/

>>
>> There is a reason why this behavior is not being reproduces in
>> ext4/xfs, but you should be able to reproduce a similar issue
>> like this:
>>
>>
>> 1. symlink (foo, bar.tmp)
>> 2. open bar.tmp
>> 3. fsync bar.tmp
>> 4. rename(bar.tmp, bar)
>> 5. fsync bar
>> ----crash here----
>
>
> I'm guessing xfs/ext4 detect the symlink-fsync pattern and fsync the parent
> dir in our workload, but would miss it because of the rename in the workload
> you provided?
>

No pattern detecting by xfs/ext4 AFAIK.
rename does not change metadata of victim, so fsync(bar)
may (depending on fs) trigger no metadata transaction commit.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Symlink not persisted even after fsync
  2018-04-13  5:52 ` Amir Goldstein
  2018-04-13 12:57   ` Vijay Chidambaram
       [not found]   ` <CAPaz=E+-baGSWhL3nD-8X4jn6rKdn2AVGLeqWh3EY5Nh-RodRA@mail.gmail.com>
@ 2018-04-13 14:06   ` Dave Chinner
  2 siblings, 0 replies; 25+ messages in thread
From: Dave Chinner @ 2018-04-13 14:06 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jayashree Mohan, linux-btrfs, fstests, linux-f2fs-devel,
	Vijaychidambaram Velayudhan Pillai

On Fri, Apr 13, 2018 at 08:52:19AM +0300, Amir Goldstein wrote:
> On Thu, Apr 12, 2018 at 8:51 PM, Jayashree Mohan
> <jayashree2912@gmail.com> wrote:
> > Hi,
> >
> > We came across what seems to be a crash consistency bug on btrfs and
> > f2fs. When we create a symlink for a file and fsync the symlink, on
> > recovery from crash, the fsync-ed file is missing.
> >
> > You can reproduce this behaviour using this minimal workload :
> >
> > 1. symlink (foo, bar)
> > 2. open bar
> > 3. fsync bar
> > ----crash here----
> >
> > When we recover, we find that file bar is missing. This behaviour
> > seems unexpected as the fsynced file is lost on a crash. ext4 and xfs
> > correctly recovers file bar. This seems like a bug. If not, could you
> > explain why?
> >
> 
> Not a bug.

Actually, for a filesystem with strictly ordered metadata recovery
semantics, it is a bug.

> From man 2 fsync:
> 
> "Calling  fsync() does not necessarily ensure that the entry in the
>  directory containing the file has also reached disk.  For that an
>  explicit fsync() on a file descriptor for the directory is also needed."

We've been through this before, many times. This caveat does not
apply to strictly ordered metadata filesystems. If you fsync a file
on an ordered metadata filesystem, then all previous transactions
that are needed to reference the file are also committed.

The behaviour from ext4 and XFS is correct for strictly ordered
filesystems.  This is not a "fsync requirement", nor is it a general
linux filesystem requirement. It is a requirement of the desired
filesystem crash recovery mechanisms....

BTRFS is advertised as having strictly ordered metadata
recovery semantics, so it should behave the same way as ext4 and
XFS in tests like these. If it doesn't, then there's filesystem bugs
that need fixing...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Symlink not persisted even after fsync
  2018-04-13 13:16     ` Amir Goldstein
@ 2018-04-13 14:39       ` Jayashree Mohan
  2018-04-14  1:20         ` Dave Chinner
  0 siblings, 1 reply; 25+ messages in thread
From: Jayashree Mohan @ 2018-04-13 14:39 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Vijaychidambaram Velayudhan Pillai, linux-btrfs, fstests,
	linux-f2fs-devel, Dave Chinner

Hey Dave,

Thanks for clarifying the crash recovery semantics of strictly
metadata ordered filesystems. We had a follow-up question in this
case.

On Fri, Apr 13, 2018 at 8:16 AM, Amir Goldstein <amir73il@gmail.com> wrote:
> On Fri, Apr 13, 2018 at 3:54 PM, Vijay Chidambaram <vijay@cs.utexas.edu> wrote:
>> Hi Amir,
>>
>> Thanks for the reply!
>>
>> On Fri, Apr 13, 2018 at 12:52 AM, Amir Goldstein <amir73il@gmail.com> wrote:
>>>
>>> Not a bug.
>>>
>>> From man 2 fsync:
>>>
>>> "Calling  fsync() does not necessarily ensure that the entry in the
>>>  directory containing the file has also reached disk.  For that an
>>>  explicit fsync() on a file descriptor for the directory is also needed."
>>
>>
>> Are we understanding this right:
>>
>> ext4 and xfs fsync the parent directory if a sym link file is fsync-ed. But
>> btrfs does not. Is this what we are seeing?
>
> Nope.
>
> You are seeing an unintentional fsync of parent, because both
> parent update and symlink update are metadata updates that are
> tracked by the same transaction.
>
> fsync of symlink forces the current transaction to the journal,
> pulling in the parent update with it.
>
>
>>
>> I agree that fsync of a file does not mean fsync of its directory entry, but
>> it seems odd to do it for regular files and not for sym links. We do not see
>> this behavior if we use a regular file instead of a sym link file.
>>
>
> fsync of regular file behaves differently than fsync of non regular file.
> I suggest this read:
> https://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/
>
>>>
>>> There is a reason why this behavior is not being reproduces in
>>> ext4/xfs, but you should be able to reproduce a similar issue
>>> like this:
>>>
>>>
>>> 1. symlink (foo, bar.tmp)
>>> 2. open bar.tmp
>>> 3. fsync bar.tmp
>>> 4. rename(bar.tmp, bar)
>>> 5. fsync bar
>>> ----crash here----
>>

Going by your argument that all previous transactions that referenced
the file being fsync-ed needs to be committed, should we expect xfs
(and ext4) to persist file bar in this case?

If that's expected, I'd like to bring to your notice that file bar is
not persisted in both xfs and ext4. Is there any other detail we
should be considering in this scenario?


>>
>> I'm guessing xfs/ext4 detect the symlink-fsync pattern and fsync the parent
>> dir in our workload, but would miss it because of the rename in the workload
>> you provided?
>>
>
> No pattern detecting by xfs/ext4 AFAIK.
> rename does not change metadata of victim, so fsync(bar)
> may (depending on fs) trigger no metadata transaction commit.
>
> Thanks,
> Amir.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Symlink not persisted even after fsync
  2018-04-13 14:39       ` Jayashree Mohan
@ 2018-04-14  1:20         ` Dave Chinner
  2018-04-14  3:27           ` Vijay Chidambaram
  0 siblings, 1 reply; 25+ messages in thread
From: Dave Chinner @ 2018-04-14  1:20 UTC (permalink / raw)
  To: Jayashree Mohan
  Cc: Amir Goldstein, Vijaychidambaram Velayudhan Pillai, linux-btrfs,
	fstests, linux-f2fs-devel

On Fri, Apr 13, 2018 at 09:39:27AM -0500, Jayashree Mohan wrote:
> Hey Dave,
> 
> Thanks for clarifying the crash recovery semantics of strictly
> metadata ordered filesystems. We had a follow-up question in this
> case.
> 
> On Fri, Apr 13, 2018 at 8:16 AM, Amir Goldstein <amir73il@gmail.com> wrote:
> > On Fri, Apr 13, 2018 at 3:54 PM, Vijay Chidambaram <vijay@cs.utexas.edu> wrote:
> >> Hi Amir,
> >>
> >> Thanks for the reply!
> >>
> >> On Fri, Apr 13, 2018 at 12:52 AM, Amir Goldstein <amir73il@gmail.com> wrote:
> >>>
> >>> Not a bug.
> >>>
> >>> From man 2 fsync:
> >>>
> >>> "Calling  fsync() does not necessarily ensure that the entry in the
> >>>  directory containing the file has also reached disk.  For that an
> >>>  explicit fsync() on a file descriptor for the directory is also needed."
> >>
> >>
> >> Are we understanding this right:
> >>
> >> ext4 and xfs fsync the parent directory if a sym link file is fsync-ed. But
> >> btrfs does not. Is this what we are seeing?
> >
> > Nope.
> >
> > You are seeing an unintentional fsync of parent, because both
> > parent update and symlink update are metadata updates that are
> > tracked by the same transaction.
> >
> > fsync of symlink forces the current transaction to the journal,
> > pulling in the parent update with it.
> >
> >
> >>
> >> I agree that fsync of a file does not mean fsync of its directory entry, but
> >> it seems odd to do it for regular files and not for sym links. We do not see
> >> this behavior if we use a regular file instead of a sym link file.
> >>
> >
> > fsync of regular file behaves differently than fsync of non regular file.
> > I suggest this read:
> > https://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/
> >
> >>>
> >>> There is a reason why this behavior is not being reproduces in
> >>> ext4/xfs, but you should be able to reproduce a similar issue
> >>> like this:
> >>>
> >>>
> >>> 1. symlink (foo, bar.tmp)
> >>> 2. open bar.tmp
> >>> 3. fsync bar.tmp
> >>> 4. rename(bar.tmp, bar)
> >>> 5. fsync bar
> >>> ----crash here----
> >>
> 
> Going by your argument that all previous transactions that referenced
> the file being fsync-ed needs to be committed, should we expect xfs
> (and ext4) to persist file bar in this case?

No, that's not what I'm implying. I'm implying that there is
specific ordering dependencies that govern this behaviour, and
assuming that what the fsync man page says about files applies to
symlinks is not a valid assumption because files and symlinks are
not equivalent objects.

In these cases, you first have to ask "what are we actually running
fsync on?"

The fsync is being performed on the inode the symlink points to, not
the symlink. You can't directly open a symlink to fsync the symlink.

Then you have to ask "what is the dependency chain between the
parent directory, the symlink and the file it points to?"

the short answer is that symlinks have no direct relationship to the
object they point to. i.e. symlinks contain a path, not a reference
to a specific filesystem object.

IOWs, symlinks are really a directory construct, not a file.
However, there is no ordering dependency between a symlink and what
it points to. symlinks contain a path which needs to be resolved to
find out what it points to, and that may not even exist. Files have
no reference to symlinks that point at them, so there's no way we
can create an ordering dependency between file updates and any
symlink that points to them.

Directories, OTOH, contain a pointer to a reference counted object
(an inode) in their dirents. hence if you add/remove directory
dirents that point to an inode, you also have to modify the inode
link counts as it records how many directory entries point at it.
That's a bi-directional atomic modification ordering dependency
between directories and inodes they point at.

So when we look at symlinks, the parent directory has a ordering
dependency with the symlink inode, not whatever is found by
resolving the path in the symlink data. IOWs, there is no ordering
relationship between the symlink's parent directory and whatever the
symlink points at. i.e. it's a one-way relationship, and so there is
no reverse ordering dependency that requires fsync() on the file to
force synchronisation of a symlink it knows nothing about.

i.e. the ordering dependency that exists with symlinks is between
the symlink and it's parent directory, not whatever the symlink
points to. Hence fsyncing whatever the symlink points to does not
guarantee that the symlink is made stable because the symlink is not
part of the dependency chain of the object being fsync()d....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Symlink not persisted even after fsync
  2018-04-14  1:20         ` Dave Chinner
@ 2018-04-14  3:27           ` Vijay Chidambaram
  2018-04-14 21:55               ` Dave Chinner
  0 siblings, 1 reply; 25+ messages in thread
From: Vijay Chidambaram @ 2018-04-14  3:27 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jayashree Mohan, Amir Goldstein, linux-btrfs, fstests, linux-f2fs-devel

Hi Dave,

Thanks for the reply.

I feel like we are not talking about the same thing here.

What we are asking is: if you perform

fsync(symlink)
crash

can we expect it to see the symlink file in the parent directory after
a crash given we didn't fsync the parent directory? Amir argues we
can't expect it. Your first email seemed to argue we should expect it.
ext4 and xfs have this behavior, which Amir argues is an
implementation side-effect, and not intended.

>> >>> 1. symlink (foo, bar.tmp)
>> >>> 2. open bar.tmp
>> >>> 3. fsync bar.tmp
>> >>> 4. rename(bar.tmp, bar)
>> >>> 5. fsync bar
>> >>> ----crash here----

The second workload that Amir constructed just moves the symlink
creation into a different transaction. In both workloads, we are
creating or renaming new symlinks and calling fsync on them. In both
cases we are not explicitly calling fsync on the parent directory.

Note that we are not saying if we call fsync on symlink file, it
should call fsync on the original file. We agree that should not be
done as the symlink file and the original link are two distinct
entities.

I believe in most journaling/copy-on-write file systems today, if you
call fsync on a new file, the fsync will persist the directory entry
of the new file in the parent directory (even though POSIX doesn't
really require this). It seems reasonable to extend this persistence
courtesy to symlinks (considering them just as normal files).

Thoughts from other btrfs developers?

Thanks,
Vijay Chidambaram
http://www.cs.utexas.edu/~vijay/

On Fri, Apr 13, 2018 at 8:20 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Fri, Apr 13, 2018 at 09:39:27AM -0500, Jayashree Mohan wrote:
>> Hey Dave,
>>
>> Thanks for clarifying the crash recovery semantics of strictly
>> metadata ordered filesystems. We had a follow-up question in this
>> case.
>>
>> On Fri, Apr 13, 2018 at 8:16 AM, Amir Goldstein <amir73il@gmail.com> wrote:
>> > On Fri, Apr 13, 2018 at 3:54 PM, Vijay Chidambaram <vijay@cs.utexas.edu> wrote:
>> >> Hi Amir,
>> >>
>> >> Thanks for the reply!
>> >>
>> >> On Fri, Apr 13, 2018 at 12:52 AM, Amir Goldstein <amir73il@gmail.com> wrote:
>> >>>
>> >>> Not a bug.
>> >>>
>> >>> From man 2 fsync:
>> >>>
>> >>> "Calling  fsync() does not necessarily ensure that the entry in the
>> >>>  directory containing the file has also reached disk.  For that an
>> >>>  explicit fsync() on a file descriptor for the directory is also needed."
>> >>
>> >>
>> >> Are we understanding this right:
>> >>
>> >> ext4 and xfs fsync the parent directory if a sym link file is fsync-ed. But
>> >> btrfs does not. Is this what we are seeing?
>> >
>> > Nope.
>> >
>> > You are seeing an unintentional fsync of parent, because both
>> > parent update and symlink update are metadata updates that are
>> > tracked by the same transaction.
>> >
>> > fsync of symlink forces the current transaction to the journal,
>> > pulling in the parent update with it.
>> >
>> >
>> >>
>> >> I agree that fsync of a file does not mean fsync of its directory entry, but
>> >> it seems odd to do it for regular files and not for sym links. We do not see
>> >> this behavior if we use a regular file instead of a sym link file.
>> >>
>> >
>> > fsync of regular file behaves differently than fsync of non regular file.
>> > I suggest this read:
>> > https://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/
>> >
>> >>>
>> >>> There is a reason why this behavior is not being reproduces in
>> >>> ext4/xfs, but you should be able to reproduce a similar issue
>> >>> like this:
>> >>>
>> >>>

>> >>
>>
>> Going by your argument that all previous transactions that referenced
>> the file being fsync-ed needs to be committed, should we expect xfs
>> (and ext4) to persist file bar in this case?
>
> No, that's not what I'm implying. I'm implying that there is
> specific ordering dependencies that govern this behaviour, and
> assuming that what the fsync man page says about files applies to
> symlinks is not a valid assumption because files and symlinks are
> not equivalent objects.
>
> In these cases, you first have to ask "what are we actually running
> fsync on?"
>
> The fsync is being performed on the inode the symlink points to, not
> the symlink. You can't directly open a symlink to fsync the symlink.
>
> Then you have to ask "what is the dependency chain between the
> parent directory, the symlink and the file it points to?"
>
> the short answer is that symlinks have no direct relationship to the
> object they point to. i.e. symlinks contain a path, not a reference
> to a specific filesystem object.
>
> IOWs, symlinks are really a directory construct, not a file.
> However, there is no ordering dependency between a symlink and what
> it points to. symlinks contain a path which needs to be resolved to
> find out what it points to, and that may not even exist. Files have
> no reference to symlinks that point at them, so there's no way we
> can create an ordering dependency between file updates and any
> symlink that points to them.
>
> Directories, OTOH, contain a pointer to a reference counted object
> (an inode) in their dirents. hence if you add/remove directory
> dirents that point to an inode, you also have to modify the inode
> link counts as it records how many directory entries point at it.
> That's a bi-directional atomic modification ordering dependency
> between directories and inodes they point at.
>
> So when we look at symlinks, the parent directory has a ordering
> dependency with the symlink inode, not whatever is found by
> resolving the path in the symlink data. IOWs, there is no ordering
> relationship between the symlink's parent directory and whatever the
> symlink points at. i.e. it's a one-way relationship, and so there is
> no reverse ordering dependency that requires fsync() on the file to
> force synchronisation of a symlink it knows nothing about.
>
> i.e. the ordering dependency that exists with symlinks is between
> the symlink and it's parent directory, not whatever the symlink
> points to. Hence fsyncing whatever the symlink points to does not
> guarantee that the symlink is made stable because the symlink is not
> part of the dependency chain of the object being fsync()d....
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Symlink not persisted even after fsync
  2018-04-14  3:27           ` Vijay Chidambaram
@ 2018-04-14 21:55               ` Dave Chinner
  0 siblings, 0 replies; 25+ messages in thread
From: Dave Chinner @ 2018-04-14 21:55 UTC (permalink / raw)
  To: Vijay Chidambaram
  Cc: Jayashree Mohan, Amir Goldstein, linux-btrfs, fstests, linux-f2fs-devel

On Fri, Apr 13, 2018 at 10:27:56PM -0500, Vijay Chidambaram wrote:
> Hi Dave,
> 
> Thanks for the reply.
> 
> I feel like we are not talking about the same thing here.
> 
> What we are asking is: if you perform
> 
> fsync(symlink)
> crash
> 
> can we expect it to see the symlink file in the parent directory after
> a crash given we didn't fsync the parent directory? Amir argues we
> can't expect it. Your first email seemed to argue we should expect it.

My first email comments on Amir's quoting of behaviours for files vs
directories on fsync, and then applying those caveats to symlinks.
It probably wasn't that clear I was mainly trying to point out that
symlinks are not files, so they have different ordering
requirements. i.e. that you have to look at ordering requirements of
the filesystems, not the fsync() specification to determine what the
fsync behviour is supposed to be.

My second email clarifies the ordering behaviour that is expected
with symlinks and the reason why you'll see different behaviour to
files w.r.t. fsync and parent directories.

> ext4 and xfs have this behavior, which Amir argues is an
> implementation side-effect, and not intended.
> 
> >> >>> 1. symlink (foo, bar.tmp)
> >> >>> 2. open bar.tmp
> >> >>> 3. fsync bar.tmp
> >> >>> 4. rename(bar.tmp, bar)
> >> >>> 5. fsync bar
> >> >>> ----crash here----
> 
> The second workload that Amir constructed just moves the symlink
> creation into a different transaction. In both workloads, we are
> creating or renaming new symlinks and calling fsync on them. In both
> cases we are not explicitly calling fsync on the parent directory.

Yes, I decided not to write all this "symlink behaviour is dependent
on initial conditions" stuff because, AFAIC, it is a pretty obvious
conclusion to draw from the ordering dependencies I described
between the symlink and the object it points at.

Script that demonstrates this is simple:

$ cat t.sh
#!/bin/bash

dev=/dev/vdb
mnt=/mnt/scratch
test_file=$mnt/foo

# 1. symlink (foo, bar.tmp)
# 2. open bar.tmp
# 3. fsync bar.tmp
# 4. rename(bar.tmp, bar)
# 5. fsync bar

umount $mnt
mount $dev $mnt

cd $mnt
rm -f foo bar.tmp bar
sync

# Don't fsync creation of foo, will see foo and bar.tmp after shutdown
touch foo
ln -s foo bar.tmp
xfs_io -c fsync bar.tmp
mv bar.tmp bar
xfs_io -c fsync bar
xfs_io -xc "shutdown" $mnt

cd ~
umount $mnt
mount $dev $mnt
cd $mnt
ls -l $mnt
rm -f foo bar.tmp bar
sync

# don't fsync foo or bar.tmp, will see foo and bar after shutdown
touch foo
xfs_io -c fsync foo

touch foo
ln -s foo bar.tmp
mv bar.tmp bar
xfs_io -c fsync bar
xfs_io -xc "shutdown" $mnt


cd ~
umount $mnt
mount $dev $mnt
cd $mnt
ls -l $mnt
rm -f foo bar.tmp bar
sync

# fsync creation of foo, will see only foo after shutdown
touch foo
xfs_io -c fsync foo

ln -s foo bar.tmp
xfs_io -c fsync bar.tmp
mv bar.tmp bar
xfs_io -c fsync bar
xfs_io -xc "shutdown" $mnt

cd ~
umount $mnt
mount $dev $mnt
cd $mnt
ls -l $mnt
$

And the output is:

$ sudo umount /mnt/scratch ; sudo mount /dev/vdb /mnt/scratch ; sudo ./t.sh ;
total 0
lrwxrwxrwx. 1 root root 3 Apr 14 09:52 bar.tmp -> foo
-rw-r--r--. 1 root root 0 Apr 14 09:52 foo
total 0
lrwxrwxrwx. 1 root root 3 Apr 14 09:52 bar -> foo
-rw-r--r--. 1 root root 0 Apr 14 09:52 foo
total 0
-rw-r--r--. 1 root root 0 Apr 14 09:52 foo
$

i.e. it depends on the state of the original file as to what is
captured by the fsync of that file through the symlink. i.e.
symlinks has no ordering dependency with the object resolved from
the path in the symlink.


> Note that we are not saying if we call fsync on symlink file, it
> should call fsync on the original file. We agree that should not be
> done as the symlink file and the original link are two distinct
> entities.

"symlink file" - there's no such thing. It's either a symlink or a
regular file and it cant be both. 

And, well, you can't fsync a symlink *inode*, anyway, because you
can't open it directly for IO operations.

> I believe in most journaling/copy-on-write file systems today, if you
> call fsync on a new file, the fsync will persist the directory entry
> of the new file in the parent directory (even though POSIX doesn't
> really require this).

Yes, that's the strict ordering dependency thing I talked about, and
it was something that btrfs got wrong for an awful long time.

> It seems reasonable to extend this persistence
> courtesy to symlinks (considering them just as normal files).

And no, that's not reasonable, because symlinks only contain a path
instead of a direct reference to any filesysetm object. i.e. it's an
indirect reference, and that can be clearly seen by the fact that
Symlinks are created and removed without referencing the object they
point to or caring whether it is even valid.

There is no way reliable ordering dependencies can be created for
indirect references, especially as symlinks can point to any type of
object (e.g. dir, blkdev, etc), it can point to something outside
the filesystem, and it can even point to something that doesn't
exist.

This also means that "fsync on a symlink" may, in fact, run a fsync
method of a completely different filesystem or subsystem. There is
no way this could possible trigger a directory fsync of the symlink
parent, because the object being fsync()d may not even know what a
filesystem is...

If you want a symlink to have ordering behaviour like a dirent
pointing to a regular file, then use hard links....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Symlink not persisted even after fsync
@ 2018-04-14 21:55               ` Dave Chinner
  0 siblings, 0 replies; 25+ messages in thread
From: Dave Chinner @ 2018-04-14 21:55 UTC (permalink / raw)
  To: Vijay Chidambaram; +Cc: Amir Goldstein, fstests, linux-btrfs, linux-f2fs-devel

On Fri, Apr 13, 2018 at 10:27:56PM -0500, Vijay Chidambaram wrote:
> Hi Dave,
> 
> Thanks for the reply.
> 
> I feel like we are not talking about the same thing here.
> 
> What we are asking is: if you perform
> 
> fsync(symlink)
> crash
> 
> can we expect it to see the symlink file in the parent directory after
> a crash given we didn't fsync the parent directory? Amir argues we
> can't expect it. Your first email seemed to argue we should expect it.

My first email comments on Amir's quoting of behaviours for files vs
directories on fsync, and then applying those caveats to symlinks.
It probably wasn't that clear I was mainly trying to point out that
symlinks are not files, so they have different ordering
requirements. i.e. that you have to look at ordering requirements of
the filesystems, not the fsync() specification to determine what the
fsync behviour is supposed to be.

My second email clarifies the ordering behaviour that is expected
with symlinks and the reason why you'll see different behaviour to
files w.r.t. fsync and parent directories.

> ext4 and xfs have this behavior, which Amir argues is an
> implementation side-effect, and not intended.
> 
> >> >>> 1. symlink (foo, bar.tmp)
> >> >>> 2. open bar.tmp
> >> >>> 3. fsync bar.tmp
> >> >>> 4. rename(bar.tmp, bar)
> >> >>> 5. fsync bar
> >> >>> ----crash here----
> 
> The second workload that Amir constructed just moves the symlink
> creation into a different transaction. In both workloads, we are
> creating or renaming new symlinks and calling fsync on them. In both
> cases we are not explicitly calling fsync on the parent directory.

Yes, I decided not to write all this "symlink behaviour is dependent
on initial conditions" stuff because, AFAIC, it is a pretty obvious
conclusion to draw from the ordering dependencies I described
between the symlink and the object it points at.

Script that demonstrates this is simple:

$ cat t.sh
#!/bin/bash

dev=/dev/vdb
mnt=/mnt/scratch
test_file=$mnt/foo

# 1. symlink (foo, bar.tmp)
# 2. open bar.tmp
# 3. fsync bar.tmp
# 4. rename(bar.tmp, bar)
# 5. fsync bar

umount $mnt
mount $dev $mnt

cd $mnt
rm -f foo bar.tmp bar
sync

# Don't fsync creation of foo, will see foo and bar.tmp after shutdown
touch foo
ln -s foo bar.tmp
xfs_io -c fsync bar.tmp
mv bar.tmp bar
xfs_io -c fsync bar
xfs_io -xc "shutdown" $mnt

cd ~
umount $mnt
mount $dev $mnt
cd $mnt
ls -l $mnt
rm -f foo bar.tmp bar
sync

# don't fsync foo or bar.tmp, will see foo and bar after shutdown
touch foo
xfs_io -c fsync foo

touch foo
ln -s foo bar.tmp
mv bar.tmp bar
xfs_io -c fsync bar
xfs_io -xc "shutdown" $mnt


cd ~
umount $mnt
mount $dev $mnt
cd $mnt
ls -l $mnt
rm -f foo bar.tmp bar
sync

# fsync creation of foo, will see only foo after shutdown
touch foo
xfs_io -c fsync foo

ln -s foo bar.tmp
xfs_io -c fsync bar.tmp
mv bar.tmp bar
xfs_io -c fsync bar
xfs_io -xc "shutdown" $mnt

cd ~
umount $mnt
mount $dev $mnt
cd $mnt
ls -l $mnt
$

And the output is:

$ sudo umount /mnt/scratch ; sudo mount /dev/vdb /mnt/scratch ; sudo ./t.sh ;
total 0
lrwxrwxrwx. 1 root root 3 Apr 14 09:52 bar.tmp -> foo
-rw-r--r--. 1 root root 0 Apr 14 09:52 foo
total 0
lrwxrwxrwx. 1 root root 3 Apr 14 09:52 bar -> foo
-rw-r--r--. 1 root root 0 Apr 14 09:52 foo
total 0
-rw-r--r--. 1 root root 0 Apr 14 09:52 foo
$

i.e. it depends on the state of the original file as to what is
captured by the fsync of that file through the symlink. i.e.
symlinks has no ordering dependency with the object resolved from
the path in the symlink.


> Note that we are not saying if we call fsync on symlink file, it
> should call fsync on the original file. We agree that should not be
> done as the symlink file and the original link are two distinct
> entities.

"symlink file" - there's no such thing. It's either a symlink or a
regular file and it cant be both. 

And, well, you can't fsync a symlink *inode*, anyway, because you
can't open it directly for IO operations.

> I believe in most journaling/copy-on-write file systems today, if you
> call fsync on a new file, the fsync will persist the directory entry
> of the new file in the parent directory (even though POSIX doesn't
> really require this).

Yes, that's the strict ordering dependency thing I talked about, and
it was something that btrfs got wrong for an awful long time.

> It seems reasonable to extend this persistence
> courtesy to symlinks (considering them just as normal files).

And no, that's not reasonable, because symlinks only contain a path
instead of a direct reference to any filesysetm object. i.e. it's an
indirect reference, and that can be clearly seen by the fact that
Symlinks are created and removed without referencing the object they
point to or caring whether it is even valid.

There is no way reliable ordering dependencies can be created for
indirect references, especially as symlinks can point to any type of
object (e.g. dir, blkdev, etc), it can point to something outside
the filesystem, and it can even point to something that doesn't
exist.

This also means that "fsync on a symlink" may, in fact, run a fsync
method of a completely different filesystem or subsystem. There is
no way this could possible trigger a directory fsync of the symlink
parent, because the object being fsync()d may not even know what a
filesystem is...

If you want a symlink to have ordering behaviour like a dirent
pointing to a regular file, then use hard links....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Symlink not persisted even after fsync
  2018-04-14 21:55               ` Dave Chinner
  (?)
@ 2018-04-15  1:13               ` Vijay Chidambaram
  2018-04-15  1:30                   ` Theodore Y. Ts'o
  -1 siblings, 1 reply; 25+ messages in thread
From: Vijay Chidambaram @ 2018-04-15  1:13 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jayashree Mohan, Amir Goldstein, linux-btrfs, fstests, linux-f2fs-devel

Hi Dave,

Thank you for your detailed reply.

I think we still have a misunderstanding. Bear with me, much of this
may seem obvious to you, but not to us and future readers of this
mailing list :)

We are *not* saying an fsync on a symlink file has to result in any
action on the original file. We understand the lack of ordering
constraints here.

Consider the following:

/p/t/testdir> touch orig
/p/t/testdir> ln -s orig symlink

/p/t/testdir> ls -l
total 8
-rw-r--r--  1 vijay  wheel  0 Apr 14 19:31 orig
lrwxr-xr-x  1 vijay  wheel  4 Apr 14 19:31 symlink -> orig

Here, there is a directory entry in testdir for symlink.

If we fsync symlink, is that directory entry persisted or not? That is
what we want to know. Regardless of whether symlink is a regular file
or an original file, it has a directory entry. We are saying *nothing*
about orig in this example.

If you fsync the symlink file ("symlink" in the example), does it
persist the directory entry for "symlink" also? Whatever relationship
exists between "testdir" and "orig", that relationship also exists
between "symlink" and "testdir".

>From your emails, I believe the answer is "no". The answer seems to be
"yes" for regular files, although this seems like an implementation
side-effect on file systems like btrfs (its not a guarantee btrfs
seeks to provide).

Regarding how we are able to fsync "symlink":

open(symlink) -> fails

but

fd = open(symlink, O_CREAT|O_RDWR) -> succeeds (even if symlink already exists)
fsync(fd) -> succeeds

So perhaps fsync on "symlink" is unsupported behavior that varies from
file system to file system? We saw ext4 and xfs had this behavior, so
we assumed it to be the default.

Thanks,
Vijay Chidambaram
http://www.cs.utexas.edu/~vijay/

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Symlink not persisted even after fsync
  2018-04-14 21:55               ` Dave Chinner
@ 2018-04-15  1:17                 ` Theodore Y. Ts'o
  -1 siblings, 0 replies; 25+ messages in thread
From: Theodore Y. Ts'o @ 2018-04-15  1:17 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Vijay Chidambaram, Jayashree Mohan, Amir Goldstein, linux-btrfs,
	fstests, linux-f2fs-devel

The only thing I would add to Dave's comments is that a lot of these
formal semantics are de facto, and not de jure.  If you take a look at
POSIX or the Single Unix Specification, they are remarkably silent
about how fsync works.

In fact POSIX/SUS doesn't even define "fsync on a directory".  In the
original POSIX, the O_DIRECTORY flag does not exist and the directory
stream object returned opendir(2) does not have to be implemented using
a file descriptor[1]t

[1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/opendir.html

In SUSv7, between adding openat(2) and fchdir(2), etc., the standards
body has backed itself into more-or-less admittihng that on all
implementations that matter directory fd's really do exist.  But if
you take a look at what is stated about fsync(2), it only talks about
what it does in relation to _files_, and not directories, or anything else[2]

[2] http://pubs.opengroup.org/onlinepubs/969991t9799/functions/fsync.html

Furthermore, "strictly ordered metadata recovery semantics" is not
something which is formally in any kind of standards document.
Filesystem developers knows what it means, and it gets encoded as
things like test in xfstests.  But at the same time, we need to be
careful not to invent stricter "guarantees" than what is required by
the standards and the generally agreed-upon norms by file system
developers.

Otherwise we can have academics inventing guarantees, such as Pillai,
et.al[3] and justifying this because they find applications have
better crash semantics with these new guarantees --- and instead of
saying that the applications are buggy, instead the paper proposes
that perhaps file systems should provide thos extra guarantees.

[3]  https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pdf

The problem with this is providing those extra guarantees may very
often imposing performance tradeoffs; and while I'm not saying that
the only thing file system authors should feel obliged to provide is
the bare minimum specified by POSIX (which doesn't require strictly
ordered metadata semantics), at the same time --- let's not go crazy.
There are cost-benefit decisions that need to be made.

So in the case of symlinks, the first thing I would ask is *why* do
application writers really want formal crash semantics for symlinks?
Is it a reasonable thing for them to want it?  And is it a good thing
for them to want, given that portable code should work on more than
just one file system, and certainly on more than one operating system
--- and there are no guarantees that all POSIX-compliant operating
systems will even *have* symlinks.  So in my opinion the best thing to
do is to assume that they exist for system administrator convenience,
and they aren't things which applications should be trying to use in
use cases where they need some kind of transactional semantics.

> And, well, you can't fsync a symlink *inode*, anyway, because you
> can't open it directly for IO operations.

Well.... you can get a fd on a symlink using O_PATH | O_NOFOLLOW.  It
doesn't work today, but one could imagine a future kernel extension
which adds to the system calls that can use a fd-on-a-symlink beyond
fchownat(2), fstatat(2), freadlinkat(2), et. al., and allowing
fsync(2) to work.  (It would require VFS and file-system level
changes.)

But the first question to ask is *why*?  Is it worth the extra hair
and complexity?

Especially given that if the file system has ordered metadata
semantics after a crash, there are other ways that an application can
request the same semantics.

					- Ted

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Symlink not persisted even after fsync
@ 2018-04-15  1:17                 ` Theodore Y. Ts'o
  0 siblings, 0 replies; 25+ messages in thread
From: Theodore Y. Ts'o @ 2018-04-15  1:17 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Amir Goldstein, fstests, linux-f2fs-devel, Vijay Chidambaram,
	linux-btrfs

The only thing I would add to Dave's comments is that a lot of these
formal semantics are de facto, and not de jure.  If you take a look at
POSIX or the Single Unix Specification, they are remarkably silent
about how fsync works.

In fact POSIX/SUS doesn't even define "fsync on a directory".  In the
original POSIX, the O_DIRECTORY flag does not exist and the directory
stream object returned opendir(2) does not have to be implemented using
a file descriptor[1]t

[1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/opendir.html

In SUSv7, between adding openat(2) and fchdir(2), etc., the standards
body has backed itself into more-or-less admittihng that on all
implementations that matter directory fd's really do exist.  But if
you take a look at what is stated about fsync(2), it only talks about
what it does in relation to _files_, and not directories, or anything else[2]

[2] http://pubs.opengroup.org/onlinepubs/969991t9799/functions/fsync.html

Furthermore, "strictly ordered metadata recovery semantics" is not
something which is formally in any kind of standards document.
Filesystem developers knows what it means, and it gets encoded as
things like test in xfstests.  But at the same time, we need to be
careful not to invent stricter "guarantees" than what is required by
the standards and the generally agreed-upon norms by file system
developers.

Otherwise we can have academics inventing guarantees, such as Pillai,
et.al[3] and justifying this because they find applications have
better crash semantics with these new guarantees --- and instead of
saying that the applications are buggy, instead the paper proposes
that perhaps file systems should provide thos extra guarantees.

[3]  https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pdf

The problem with this is providing those extra guarantees may very
often imposing performance tradeoffs; and while I'm not saying that
the only thing file system authors should feel obliged to provide is
the bare minimum specified by POSIX (which doesn't require strictly
ordered metadata semantics), at the same time --- let's not go crazy.
There are cost-benefit decisions that need to be made.

So in the case of symlinks, the first thing I would ask is *why* do
application writers really want formal crash semantics for symlinks?
Is it a reasonable thing for them to want it?  And is it a good thing
for them to want, given that portable code should work on more than
just one file system, and certainly on more than one operating system
--- and there are no guarantees that all POSIX-compliant operating
systems will even *have* symlinks.  So in my opinion the best thing to
do is to assume that they exist for system administrator convenience,
and they aren't things which applications should be trying to use in
use cases where they need some kind of transactional semantics.

> And, well, you can't fsync a symlink *inode*, anyway, because you
> can't open it directly for IO operations.

Well.... you can get a fd on a symlink using O_PATH | O_NOFOLLOW.  It
doesn't work today, but one could imagine a future kernel extension
which adds to the system calls that can use a fd-on-a-symlink beyond
fchownat(2), fstatat(2), freadlinkat(2), et. al., and allowing
fsync(2) to work.  (It would require VFS and file-system level
changes.)

But the first question to ask is *why*?  Is it worth the extra hair
and complexity?

Especially given that if the file system has ordered metadata
semantics after a crash, there are other ways that an application can
request the same semantics.

					- Ted

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Symlink not persisted even after fsync
  2018-04-15  1:13               ` Vijay Chidambaram
@ 2018-04-15  1:30                   ` Theodore Y. Ts'o
  0 siblings, 0 replies; 25+ messages in thread
From: Theodore Y. Ts'o @ 2018-04-15  1:30 UTC (permalink / raw)
  To: Vijay Chidambaram
  Cc: Dave Chinner, Jayashree Mohan, Amir Goldstein, linux-btrfs,
	fstests, linux-f2fs-devel

On Sat, Apr 14, 2018 at 08:13:28PM -0500, Vijay Chidambaram wrote:
> 
> We are *not* saying an fsync on a symlink file has to result in any
> action on the original file. We understand the lack of ordering
> constraints here.

The problem is you're not being precise here.  The fsync(2) system
call operates on a file descriptor, not a file.  This is an important
distinction.

Suppose you have a symlink called "foo" which points to  "/tmp/bar/quux"

	fd = open("foo", O_CREAT|O_RDWR);

It will _fail_ if /tmp/bar does not exist:

% ls -l foo
0 lrwxrwxrwx 1 tytso tytso 13 Apr 14 21:21 foo -> /tmp/bar/quux
% echo test > foo
bash: foo: No such file or directory
% mkdir /tmp/bar
% echo test > foo
% ls -l /tmp/bar
total 4
4 -rw-r--r-- 1 tytso tytso 5 Apr 14 21:22 quux

When you open "foo", the restulting file descriptor is not associated
with the symlink.  The resulting file descriptor is the exact same
thing you would get if you had instead called:

	fd = open("/tmp/bar/quux", O_CREAT|O_RDWR);

Hence, when you call

       fsync(fd);

What you are calling fsync on is not the _symlink_, but the inode
which is named by /tmp/bar/quux.


Now, as I said in the e-mail I just sent, you _can_ do this:

	fd = open ("foo", O_PATH | O_NOFOLLOW);

And this *will* give you a file descriptor which is associated with
the symlink, and not the inode (if it exists) which the symlink points
at.  However, there is a very limited number of system calls can
operate on that file descriptor --- and read(2), write(2), and
fsync(2) are not among them.

> fsync(fd) -> succeeds
> 
> So perhaps fsync on "symlink" is unsupported behavior that varies from
> file system to file system? We saw ext4 and xfs had this behavior, so
> we assumed it to be the default.

fsync on "symlink" doesn't exist at all today.  You were mistaken as
to what you were doing; what you were fsync'ing was the inode that was
created as the result of:

	fd = open("symlink", O_CREAT|O_RDWR);

... if you were able to create file in the first place.  If "symlink"
points at /tmp/bar/quux, and /tmp/bar does not exist, the open will
fail.  Not because it was a symlink, but because the equivalent
open("/tmp/bar/quux", O_CREAT|O_RDWR) would have failed.

Regards,

						- Ted

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Symlink not persisted even after fsync
@ 2018-04-15  1:30                   ` Theodore Y. Ts'o
  0 siblings, 0 replies; 25+ messages in thread
From: Theodore Y. Ts'o @ 2018-04-15  1:30 UTC (permalink / raw)
  To: Vijay Chidambaram
  Cc: Amir Goldstein, Dave Chinner, fstests, linux-f2fs-devel, linux-btrfs

On Sat, Apr 14, 2018 at 08:13:28PM -0500, Vijay Chidambaram wrote:
> 
> We are *not* saying an fsync on a symlink file has to result in any
> action on the original file. We understand the lack of ordering
> constraints here.

The problem is you're not being precise here.  The fsync(2) system
call operates on a file descriptor, not a file.  This is an important
distinction.

Suppose you have a symlink called "foo" which points to  "/tmp/bar/quux"

	fd = open("foo", O_CREAT|O_RDWR);

It will _fail_ if /tmp/bar does not exist:

% ls -l foo
0 lrwxrwxrwx 1 tytso tytso 13 Apr 14 21:21 foo -> /tmp/bar/quux
% echo test > foo
bash: foo: No such file or directory
% mkdir /tmp/bar
% echo test > foo
% ls -l /tmp/bar
total 4
4 -rw-r--r-- 1 tytso tytso 5 Apr 14 21:22 quux

When you open "foo", the restulting file descriptor is not associated
with the symlink.  The resulting file descriptor is the exact same
thing you would get if you had instead called:

	fd = open("/tmp/bar/quux", O_CREAT|O_RDWR);

Hence, when you call

       fsync(fd);

What you are calling fsync on is not the _symlink_, but the inode
which is named by /tmp/bar/quux.


Now, as I said in the e-mail I just sent, you _can_ do this:

	fd = open ("foo", O_PATH | O_NOFOLLOW);

And this *will* give you a file descriptor which is associated with
the symlink, and not the inode (if it exists) which the symlink points
at.  However, there is a very limited number of system calls can
operate on that file descriptor --- and read(2), write(2), and
fsync(2) are not among them.

> fsync(fd) -> succeeds
> 
> So perhaps fsync on "symlink" is unsupported behavior that varies from
> file system to file system? We saw ext4 and xfs had this behavior, so
> we assumed it to be the default.

fsync on "symlink" doesn't exist at all today.  You were mistaken as
to what you were doing; what you were fsync'ing was the inode that was
created as the result of:

	fd = open("symlink", O_CREAT|O_RDWR);

... if you were able to create file in the first place.  If "symlink"
points at /tmp/bar/quux, and /tmp/bar does not exist, the open will
fail.  Not because it was a symlink, but because the equivalent
open("/tmp/bar/quux", O_CREAT|O_RDWR) would have failed.

Regards,

						- Ted

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Symlink not persisted even after fsync
  2018-04-15  1:17                 ` Theodore Y. Ts'o
  (?)
@ 2018-04-15  1:38                 ` Vijay Chidambaram
  -1 siblings, 0 replies; 25+ messages in thread
From: Vijay Chidambaram @ 2018-04-15  1:38 UTC (permalink / raw)
  To: Theodore Y. Ts'o
  Cc: Dave Chinner, Jayashree Mohan, Amir Goldstein, linux-btrfs,
	fstests, linux-f2fs-devel

Hi Ted,

Thanks for the reply.

On Sat, Apr 14, 2018 at 8:17 PM, Theodore Y. Ts'o <tytso@mit.edu> wrote:
>
> The only thing I would add to Dave's comments is that a lot of these
> formal semantics are de facto, and not de jure.  If you take a look at
> POSIX or the Single Unix Specification, they are remarkably silent
> about how fsync works.
>
> In fact POSIX/SUS doesn't even define "fsync on a directory".  In the
> original POSIX, the O_DIRECTORY flag does not exist and the directory
> stream object returned opendir(2) does not have to be implemented using
> a file descriptor[1]t
>
> [1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/opendir.html
>
> In SUSv7, between adding openat(2) and fchdir(2), etc., the standards
> body has backed itself into more-or-less admittihng that on all
> implementations that matter directory fd's really do exist.  But if
> you take a look at what is stated about fsync(2), it only talks about
> what it does in relation to _files_, and not directories, or anything else[2]
>
> [2] http://pubs.opengroup.org/onlinepubs/969991t9799/functions/fsync.html
>
> Furthermore, "strictly ordered metadata recovery semantics" is not
> something which is formally in any kind of standards document.
> Filesystem developers knows what it means, and it gets encoded as
> things like test in xfstests.  But at the same time, we need to be
> careful not to invent stricter "guarantees" than what is required by
> the standards and the generally agreed-upon norms by file system
> developers.


I agree the semantics are vaguely defined. On the other hand, we would
like to have rigorous testing of crash-consistency semantics, beyond
the ad-hoc collection of tests present in xfstests right now. That's
why these discussions are important.

To be clear, we absolutely do not want to test stuff the community
does not care about. But since its not exactly written down anywhere,
our best course of action is to engage in discussion and figure out it
piece by piece.

>
> Otherwise we can have academics inventing guarantees, such as Pillai,
> et.al[3] and justifying this because they find applications have
> better crash semantics with these new guarantees --- and instead of
> saying that the applications are buggy, instead the paper proposes
> that perhaps file systems should provide thos extra guarantees.
>
> [3]  https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pdf
>
> The problem with this is providing those extra guarantees may very
> often imposing performance tradeoffs; and while I'm not saying that
> the only thing file system authors should feel obliged to provide is
> the bare minimum specified by POSIX (which doesn't require strictly
> ordered metadata semantics), at the same time --- let's not go crazy.
> There are cost-benefit decisions that need to be made.


I was one of the authors on that paper, and I didn't know until today
you didn't like that work :) The paper did *not* suggest we support
invented guarantees without considering the performance impact.

As academics, it is our job to analyze existing solutions, and propose
what could be. I think this is healthy -- it is up-to developers
themselves to figure out what it is they want to support, and what
they don't want to support.

>
> So in the case of symlinks, the first thing I would ask is *why* do
> application writers really want formal crash semantics for symlinks?
> Is it a reasonable thing for them to want it?  And is it a good thing
> for them to want, given that portable code should work on more than
> just one file system, and certainly on more than one operating system
> --- and there are no guarantees that all POSIX-compliant operating
> systems will even *have* symlinks.  So in my opinion the best thing to
> do is to assume that they exist for system administrator convenience,
> and they aren't things which applications should be trying to use in
> use cases where they need some kind of transactional semantics.
>
> > And, well, you can't fsync a symlink *inode*, anyway, because you
> > can't open it directly for IO operations.
>
> Well.... you can get a fd on a symlink using O_PATH | O_NOFOLLOW.  It
> doesn't work today, but one could imagine a future kernel extension
> which adds to the system calls that can use a fd-on-a-symlink beyond
> fchownat(2), fstatat(2), freadlinkat(2), et. al., and allowing
> fsync(2) to work.  (It would require VFS and file-system level
> changes.)
>
> But the first question to ask is *why*?  Is it worth the extra hair
> and complexity?
>
> Especially given that if the file system has ordered metadata
> semantics after a crash, there are other ways that an application can
> request the same semantics.
>
>                                         - Ted


I don't disagree with any of this. But you can imagine how this can be
all be confusing to file-system developers and research groups who
work on file systems: without formal documentation, what exactly
should they test or support? Clearly current file systems provide more
than just POSIX and therefore POSIX itself is not very useful.

But in any case, coming back to our main question, the conclusion
seems to be: symlinks aren't standard, so we shouldn't be studying
their crash-consistency properties. This is useful to know. Thanks!

Thanks,
Vijay Chidambaram
http://www.cs.utexas.edu/~vijay/

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Symlink not persisted even after fsync
  2018-04-15  1:30                   ` Theodore Y. Ts'o
  (?)
@ 2018-04-15  1:40                   ` Vijay Chidambaram
  -1 siblings, 0 replies; 25+ messages in thread
From: Vijay Chidambaram @ 2018-04-15  1:40 UTC (permalink / raw)
  To: Theodore Y. Ts'o
  Cc: Dave Chinner, Jayashree Mohan, Amir Goldstein, linux-btrfs,
	fstests, linux-f2fs-devel

Hi Ted,

On Sat, Apr 14, 2018 at 8:30 PM, Theodore Y. Ts'o <tytso@mit.edu> wrote:
> When you open "foo", the restulting file descriptor is not associated
> with the symlink.  The resulting file descriptor is the exact same
> thing you would get if you had instead called:
>
>         fd = open("/tmp/bar/quux", O_CREAT|O_RDWR);
>
> Hence, when you call
>
>        fsync(fd);
>
> What you are calling fsync on is not the _symlink_, but the inode
> which is named by /tmp/bar/quux.

Thank you! This was the crucial misunderstanding on our part. The
behavior makes a lot more sense now.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Symlink not persisted even after fsync
       [not found]                 ` <CAHWVdUXAyyeTGNXrtTTc+tUbA3t9TUjJPSF=M4Cetj4+d1w3eQ@mail.gmail.com>
@ 2018-04-15 14:13                   ` Theodore Y. Ts'o
  2018-04-16  0:10                     ` Vijay Chidambaram
  0 siblings, 1 reply; 25+ messages in thread
From: Theodore Y. Ts'o @ 2018-04-15 14:13 UTC (permalink / raw)
  To: Vijaychidambaram Velayudhan Pillai
  Cc: Dave Chinner, Jayashree Mohan, Amir Goldstein, linux-btrfs,
	fstests, linux-f2fs-devel

On Sat, Apr 14, 2018 at 08:35:45PM -0500, Vijaychidambaram Velayudhan Pillai wrote:
> I was one of the authors on that paper, and I didn't know until today you
> didn't like that work :) The paper did *not* suggest we support invented
> guarantees without considering the performance impact.

I hadn't noticed that you were one of the authors on that paper,
actually.

The problem with that paper was I don't think the researchers had
talked to anyone who had actually designed production file systems.
For example, there are some the hypothetical ext3-fast file system
proposed in the paper has some real practical problems.  You can't
just switch between having the file contents being journaled via the
data=journal mode, and file contents being written via the normal page
cache mechanisms.  If you don't do some very heavy-weight, performance
killing special measures, data corruption is a very real possibility.

(If you're curious as to why, see the comments in the function
ext4_change_journal_flag() in fs/ext4/inode.c, which is called when
clearing the per-file data journal flag.  We need to stop the journal,
write all dirty, journalled buffers to disk, empty the journal, and
only then can we switch a file from using data journalling to the
normal ordered data mode handling.  Now imagine ext3-fast needing to
do all of this...)

The paper also talked in terms of what file system designers should
consider; it didn't really make the same recommendation to application
authors.  If you look at Table 3(c), which listed application
"vulnerabilities" under current file systems, for the applications
that do purport to provide robustness against crashes (e.g., Postgres,
LMDB, etc.) , most of them actually work quite well, with little or
vulerabilities.  A notable example is Zookeeper --- but that might be
an example where the application is just buggy, and should be fixed.

> I don't disagree with any of this. But you can imagine how this can be all
> be confusing to file-system developers and research groups who work on file
> systems: without formal documentation, what exactly should they test or
> support? Clearly current file systems provide more than just POSIX and
> therefore POSIX itself is not very useful.

I agree that documenting what behavior applications can depend upon is
useful.  However, this needs to be done as a conversation --- and a
negotiation --- between application and file system developers.  (And
not necessarily just from one operating system, either!  Application
authors might care about whether they can get robustness guarantees on
other operationg systems, such as Mac OS X.)  Also, the tradeoffs may
in some cases probabilities of data loss, and not hard guarantees.

Formal documentation also takes a lot of effort to write.  That's
probably why no one has tried to formally codify it since POSIX.  We
do have informal agreements, such as adding an implied data flush
after certain close or renames operations.  And sometimes these are
written up, but only informally.  A good example of this is the
O_PONIES controversy, wherein the negotiations/conversation happened
on various blog entries, and ultimately at an LSF/MM face-to-face
meeting:

	http://blahg.josefsipek.net/?p=364
	https://sandeen.net/wordpress/uncategorized/coming-clean-on-o_ponies/	
	https://lwn.net/Articles/322823/
	https://lwn.net/Articles/327601/
	https://lwn.net/Articles/351422/

Note that the implied file writebacks after certain renames and closes
(as documented at the end of https://lwn.net/Articles/322823/) was
implemented for ext4, and then after discussion at LSF/MM, there was
general agreement across multiple major file system maintainers that
we should all provide similar behavior.

So doing this kind of standardization, especially if you want to take
into account all of the stakeholders, takes time and is not easy.  If
you only take one point of view, you can have what happened with the C
standard, where the room was packed with compiler authors, who were
only interested in what kind of cool compiler optimizations they could
do, and completely ignored whether the resulting standard would
actually be useful by practicing system programmers.  Which is why the
Linux kernel is only really supported on gcc, and then with certain
optimizations allowed by the C standard explicitly turned off.  (Clang
support is almost there, but not everyone trust a kernel built by
Clang won't have some subtle, hard-to-debug problems...)

Academics could very well have a place in helping to facilitate the
conversation.  I think my primary concern with the Pillai paper is
that the authors apparently talked a whole bunch to application
authors, but not nearly as much to file system developers.

> But in any case, coming back to our main question, the conclusion seems to
> be: symlinks aren't standard, so we shouldn't be studying their
> crash-consistency properties. This is useful to know. Thanks!

Well, symlinks are standardized.  But what the standards say about
them is extremely limited.  And the crash-consistency properties you
were looking at, which is what fsync() being called on a file
descriptor opened via a symlink, is definitely not consistent with
either the Posix/SUS standard, or historical practice by BSD and other
Unix systems, as well as Linux.

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Symlink not persisted even after fsync
  2018-04-15 14:13                   ` Theodore Y. Ts'o
@ 2018-04-16  0:10                     ` Vijay Chidambaram
  2018-04-16  5:39                       ` Amir Goldstein
                                         ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Vijay Chidambaram @ 2018-04-16  0:10 UTC (permalink / raw)
  To: Theodore Y. Ts'o
  Cc: Dave Chinner, Jayashree Mohan, Amir Goldstein, linux-btrfs,
	fstests, linux-f2fs-devel

Hi Ted,

On Sun, Apr 15, 2018 at 9:13 AM, Theodore Y. Ts'o <tytso@mit.edu> wrote:
> On Sat, Apr 14, 2018 at 08:35:45PM -0500, Vijaychidambaram Velayudhan Pillai wrote:
>> I was one of the authors on that paper, and I didn't know until today you
>> didn't like that work :) The paper did *not* suggest we support invented
>> guarantees without considering the performance impact.
>
> I hadn't noticed that you were one of the authors on that paper,
> actually.
>
> The problem with that paper was I don't think the researchers had
> talked to anyone who had actually designed production file systems.
> For example, there are some the hypothetical ext3-fast file system
> proposed in the paper has some real practical problems.  You can't
> just switch between having the file contents being journaled via the
> data=journal mode, and file contents being written via the normal page
> cache mechanisms.  If you don't do some very heavy-weight, performance
> killing special measures, data corruption is a very real possibility.

I don't think this is what the paper's ext3-fast does. All the paper
says is if you have a file system where the fsync of a file persisted
only data related to that file, it would increase performance.
ext3-fast is the name given to such a file system. Note that we do not
present a design of ext3-fast or analyze it in any detail. In fact, we
explicitly say "The ext3-fast file system (derived from inferences
provided by ALICE) seems interesting for application safety, though
further investigation is required into the validity of its design."

> I agree that documenting what behavior applications can depend upon is
> useful.  However, this needs to be done as a conversation --- and a
> negotiation --- between application and file system developers.  (And
> not necessarily just from one operating system, either!  Application
> authors might care about whether they can get robustness guarantees on
> other operationg systems, such as Mac OS X.)  Also, the tradeoffs may
> in some cases probabilities of data loss, and not hard guarantees.
>
> Formal documentation also takes a lot of effort to write.  That's
> probably why no one has tried to formally codify it since POSIX.  We
> do have informal agreements, such as adding an implied data flush
> after certain close or renames operations.  And sometimes these are
> written up, but only informally.  A good example of this is the
> O_PONIES controversy, wherein the negotiations/conversation happened
> on various blog entries, and ultimately at an LSF/MM face-to-face
> meeting:
>
>         http://blahg.josefsipek.net/?p=364
>         https://sandeen.net/wordpress/uncategorized/coming-clean-on-o_ponies/
>         https://lwn.net/Articles/322823/
>         https://lwn.net/Articles/327601/
>         https://lwn.net/Articles/351422/
>
> Note that the implied file writebacks after certain renames and closes
> (as documented at the end of https://lwn.net/Articles/322823/) was
> implemented for ext4, and then after discussion at LSF/MM, there was
> general agreement across multiple major file system maintainers that
> we should all provide similar behavior.
>
> So doing this kind of standardization, especially if you want to take
> into account all of the stakeholders, takes time and is not easy.  If
> you only take one point of view, you can have what happened with the C
> standard, where the room was packed with compiler authors, who were
> only interested in what kind of cool compiler optimizations they could
> do, and completely ignored whether the resulting standard would
> actually be useful by practicing system programmers.  Which is why the
> Linux kernel is only really supported on gcc, and then with certain
> optimizations allowed by the C standard explicitly turned off.  (Clang
> support is almost there, but not everyone trust a kernel built by
> Clang won't have some subtle, hard-to-debug problems...)

I definitely agree it takes time and effort. I'm hoping our work on
CrashMonkey can help here, by codifying the crash-consistency
guarantees into tests that new file-system developers can use.

>
> Academics could very well have a place in helping to facilitate the
> conversation.  I think my primary concern with the Pillai paper is
> that the authors apparently talked a whole bunch to application
> authors, but not nearly as much to file system developers.

I agree with this criticism. This is why my research group engages
with the file-system community right from project start, as we have
been doing with CrashMonkey.

>> But in any case, coming back to our main question, the conclusion seems to
>> be: symlinks aren't standard, so we shouldn't be studying their
>> crash-consistency properties. This is useful to know. Thanks!
>
> Well, symlinks are standardized.  But what the standards say about
> them is extremely limited.  And the crash-consistency properties you
> were looking at, which is what fsync() being called on a file
> descriptor opened via a symlink, is definitely not consistent with
> either the Posix/SUS standard, or historical practice by BSD and other
> Unix systems, as well as Linux.

Thanks! As I mentioned before, this is useful. I have a follow-up
question. Consider the following workload:

 creat foo
 link (foo, A/bar)
 fsync(foo)
 crash

In this case, after the file system recovers, do we expect foo's link
count to be 2 or 1? I would say 2, but POSIX is silent on this, so
thought I would confirm. The tricky part here is we are not calling
fsync() on directory A.

In this case, its not a symlink; its a hard link, so I would say the
link count for foo should be 2. But btrfs and F2FS show link count of
1 after a crash.

Thanks,
Vijay Chidambaram
http://www.cs.utexas.edu/~vijay/

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Symlink not persisted even after fsync
  2018-04-16  0:10                     ` Vijay Chidambaram
@ 2018-04-16  5:39                       ` Amir Goldstein
  2018-04-16 15:17                         ` Vijay Chidambaram
  2018-04-16  5:52                       ` Theodore Y. Ts'o
  2018-04-17  0:07                       ` Dave Chinner
  2 siblings, 1 reply; 25+ messages in thread
From: Amir Goldstein @ 2018-04-16  5:39 UTC (permalink / raw)
  To: Vijaychidambaram Velayudhan Pillai
  Cc: Theodore Y. Ts'o, Dave Chinner, Jayashree Mohan, linux-btrfs,
	fstests, linux-f2fs-devel

On Mon, Apr 16, 2018 at 3:10 AM, Vijay Chidambaram <vijay@cs.utexas.edu> wrote:
[...]
> Consider the following workload:
>
>  creat foo
>  link (foo, A/bar)
>  fsync(foo)
>  crash
>
> In this case, after the file system recovers, do we expect foo's link
> count to be 2 or 1? I would say 2, but POSIX is silent on this, so
> thought I would confirm. The tricky part here is we are not calling
> fsync() on directory A.
>
> In this case, its not a symlink; its a hard link, so I would say the
> link count for foo should be 2. But btrfs and F2FS show link count of
> 1 after a crash.
>

That sounds like a clear bug - nlink is metadata of inode foo, so
should be made persistent by fsync(foo).

For non-journaled fs you would need to fsync(A) to guarantee
seeing A/bar after crash, but for a journaled fs, if you didn't see
A/bar after crash and did see nlink 2 on foo then you would get
a filesystem inconsistency, so practically, fsync(foo) takes care
of persisting A/bar entry as well. But as you already understand,
these rules have not been formalized by a standard, instead, they
have been "formalized" by various fsck.* tools.

Allow me to suggest a different framing for CrashMonkey.
You seem to be engaging in discussions with the community
about whether X behavior is a bug or not and as you can see
the answer depends on the filesystem (and sometimes on the
developer). Instead, you could declare that CrashMonkey
is a "Certification tool" to certify filesystems to a certain
crash consistency behavior. Then you can discuss with the
community about specific models that CrashMonkey should
be testing. The model describes the implicit dependencies
and ordering guaranties between operations.
Dave has mentioned the "strictly ordered metadata" model.
I do not know of any formal definition of this model for filesystems,
but you can take a shot at starting one and encoding it into
CrashMonkey. This sounds like a great paper to me.

I don't know if Btrfs and f2fs will qualify as "strictly ordered
metadata" and I don't know if they would want to qualify.
Mind you a filesystem can be crash consistent without
following "strictly ordered metadata". In fact, in many cases
"strictly ordered metadata" imposes performance penalty by
coupling together unrelated metadata updates (e.g. create
A/a and create B/b), but it is also quite hard to decouple them
because future operation can create a dependency (e.g.
mv A/a B/b).

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Symlink not persisted even after fsync
  2018-04-16  0:10                     ` Vijay Chidambaram
  2018-04-16  5:39                       ` Amir Goldstein
@ 2018-04-16  5:52                       ` Theodore Y. Ts'o
  2018-04-16 15:09                         ` Vijay Chidambaram
  2018-04-17  0:07                       ` Dave Chinner
  2 siblings, 1 reply; 25+ messages in thread
From: Theodore Y. Ts'o @ 2018-04-16  5:52 UTC (permalink / raw)
  To: Vijay Chidambaram
  Cc: Dave Chinner, Jayashree Mohan, Amir Goldstein, linux-btrfs,
	fstests, linux-f2fs-devel

On Sun, Apr 15, 2018 at 07:10:52PM -0500, Vijay Chidambaram wrote:
> 
> I don't think this is what the paper's ext3-fast does. All the paper
> says is if you have a file system where the fsync of a file persisted
> only data related to that file, it would increase performance.
> ext3-fast is the name given to such a file system. Note that we do not
> present a design of ext3-fast or analyze it in any detail. In fact, we
> explicitly say "The ext3-fast file system (derived from inferences
> provided by ALICE) seems interesting for application safety, though
> further investigation is required into the validity of its design."

Well, says that it's based on ext3's data=journal "Abstract Persistent
Model".  It's true that a design was not proposed --- but if you
don't propose a design, how do you know what the performance is or
whether it's even practical?  That's one of those things I find
extremely distasteful in the paper.  Sure, I can model a faster than
light interstellar engine ala Star Trek's Warp Drive --- and I can
talk about it having, say, better performance than a reaction drive.
But it doesn't tell us anything useful about whether it can be built,
or whether it's even useful to dream about it.

To me, that part of the paper, really read as, "watch as I wave my
hands around widely, that they never leave the ends of my arms!"

> Thanks! As I mentioned before, this is useful. I have a follow-up
> question. Consider the following workload:
> 
>  creat foo
>  link (foo, A/bar)
>  fsync(foo)
>  crash
> 
> In this case, after the file system recovers, do we expect foo's link
> count to be 2 or 1? I would say 2, but POSIX is silent on this, so
> thought I would confirm. The tricky part here is we are not calling
> fsync() on directory A.
> 
> In this case, its not a symlink; its a hard link, so I would say the
> link count for foo should be 2. But btrfs and F2FS show link count of
> 1 after a crash.

Well, is the link count accurate?  That is to say, does A/bar exist?
I would think that the requirement that the file system be self
consistent is the most important consideration.

Cheers,

							- Ted

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Symlink not persisted even after fsync
  2018-04-16  5:52                       ` Theodore Y. Ts'o
@ 2018-04-16 15:09                         ` Vijay Chidambaram
  0 siblings, 0 replies; 25+ messages in thread
From: Vijay Chidambaram @ 2018-04-16 15:09 UTC (permalink / raw)
  To: Theodore Y. Ts'o
  Cc: Dave Chinner, Jayashree Mohan, Amir Goldstein, linux-btrfs,
	fstests, linux-f2fs-devel

On Mon, Apr 16, 2018 at 12:52 AM, Theodore Y. Ts'o <tytso@mit.edu> wrote:
> On Sun, Apr 15, 2018 at 07:10:52PM -0500, Vijay Chidambaram wrote:
>>
>> I don't think this is what the paper's ext3-fast does. All the paper
>> says is if you have a file system where the fsync of a file persisted
>> only data related to that file, it would increase performance.
>> ext3-fast is the name given to such a file system. Note that we do not
>> present a design of ext3-fast or analyze it in any detail. In fact, we
>> explicitly say "The ext3-fast file system (derived from inferences
>> provided by ALICE) seems interesting for application safety, though
>> further investigation is required into the validity of its design."
>
> Well, says that it's based on ext3's data=journal "Abstract Persistent
> Model".  It's true that a design was not proposed --- but if you
> don't propose a design, how do you know what the performance is or
> whether it's even practical?  That's one of those things I find
> extremely distasteful in the paper.  Sure, I can model a faster than
> light interstellar engine ala Star Trek's Warp Drive --- and I can
> talk about it having, say, better performance than a reaction drive.
> But it doesn't tell us anything useful about whether it can be built,
> or whether it's even useful to dream about it.
>
> To me, that part of the paper, really read as, "watch as I wave my
> hands around widely, that they never leave the ends of my arms!"

I partially understand where you are coming from, but your argument
seems to boil down to "don't say anything until you have worked out
every detail". I don't agree with this. Yes, it was speculative, but
we did have a fairly clear disclaimer.

To the point about it being obvious: you might be surprised at how
many people outside this community take it for granted that if you
fsync a file, only that file's contents and metadata will be persisted
:) So it was obvious to you, but truly shocking for many.

Btw, ext3-fast is what led to our CCFS work in FAST 17:
http://www.cs.utexas.edu/~vijay/papers/fast17-c2fs.pdf. In this paper,
we do show that if you divide your application writes into streams, it
is possible to persist only the data/metadata of one stream,
independent of the IO being done in other streams. So as it turned
out, it wasn't an impossible file-system design.

But we digress. I think we both agree that researchers should engage
more with the file-system community.

>
>> Thanks! As I mentioned before, this is useful. I have a follow-up
>> question. Consider the following workload:
>>
>>  creat foo
>>  link (foo, A/bar)
>>  fsync(foo)
>>  crash
>>
>> In this case, after the file system recovers, do we expect foo's link
>> count to be 2 or 1? I would say 2, but POSIX is silent on this, so
>> thought I would confirm. The tricky part here is we are not calling
>> fsync() on directory A.
>>
>> In this case, its not a symlink; its a hard link, so I would say the
>> link count for foo should be 2. But btrfs and F2FS show link count of
>> 1 after a crash.
>
> Well, is the link count accurate?  That is to say, does A/bar exist?
> I would think that the requirement that the file system be self
> consistent is the most important consideration.

There are two ways to look at this.

1. A/bar does not exist, link count is 1, and so it is not a bug.

2. We are calling fsync on the inode when the inode's link count is 2.
So it should persist the inode plus the dependency that is A/bar. The
file system after a crash should show both A/bar and the file with
link count 2. This is what ext4, xfs, and F2FS do.

We've posted separately to figure out what semantics btrfs supports.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Symlink not persisted even after fsync
  2018-04-16  5:39                       ` Amir Goldstein
@ 2018-04-16 15:17                         ` Vijay Chidambaram
  0 siblings, 0 replies; 25+ messages in thread
From: Vijay Chidambaram @ 2018-04-16 15:17 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Theodore Y. Ts'o, Dave Chinner, Jayashree Mohan, linux-btrfs,
	fstests, linux-f2fs-devel

On Mon, Apr 16, 2018 at 12:39 AM, Amir Goldstein <amir73il@gmail.com> wrote:
> On Mon, Apr 16, 2018 at 3:10 AM, Vijay Chidambaram <vijay@cs.utexas.edu> wrote:
> [...]
>> Consider the following workload:
>>
>>  creat foo
>>  link (foo, A/bar)
>>  fsync(foo)
>>  crash
>>
>> In this case, after the file system recovers, do we expect foo's link
>> count to be 2 or 1? I would say 2, but POSIX is silent on this, so
>> thought I would confirm. The tricky part here is we are not calling
>> fsync() on directory A.
>>
>> In this case, its not a symlink; its a hard link, so I would say the
>> link count for foo should be 2. But btrfs and F2FS show link count of
>> 1 after a crash.
>>
>
> That sounds like a clear bug - nlink is metadata of inode foo, so
> should be made persistent by fsync(foo).

This is what we think as well. We have posted this as a separate
thread to confirm this with other btrfs developers.

> For non-journaled fs you would need to fsync(A) to guarantee
> seeing A/bar after crash, but for a journaled fs, if you didn't see
> A/bar after crash and did see nlink 2 on foo then you would get
> a filesystem inconsistency, so practically, fsync(foo) takes care
> of persisting A/bar entry as well. But as you already understand,
> these rules have not been formalized by a standard, instead, they
> have been "formalized" by various fsck.* tools.

I don't think fsck tools are very useful here: fsck could return the
file system to an empty state, and that would still be consistent.
fsck makes no guarantees about data loss. I think fsck is allow to
truncate files, remove directory entries etc. which could lead to data
loss.

But I agree the guarantees haven't been formalized.

> Allow me to suggest a different framing for CrashMonkey.
> You seem to be engaging in discussions with the community
> about whether X behavior is a bug or not and as you can see
> the answer depends on the filesystem (and sometimes on the
> developer). Instead, you could declare that CrashMonkey
> is a "Certification tool" to certify filesystems to a certain
> crash consistency behavior. Then you can discuss with the
> community about specific models that CrashMonkey should
> be testing. The model describes the implicit dependencies
> and ordering guaranties between operations.
> Dave has mentioned the "strictly ordered metadata" model.
> I do not know of any formal definition of this model for filesystems,
> but you can take a shot at starting one and encoding it into
> CrashMonkey. This sounds like a great paper to me.

This is a great idea! We will be submitting the basic CrashMonkey
paper soon, so I don't know if we have enough time to do this.
Currently, we just explicitly say this behavior is supported by ext4,
but not btrfs, etc. So the bugs are file-system specific. But we would
definitely consider doing this in the future.

Btw, such models are what we introduced in the ALICE paper that Ted
had mentioned before. We called them "Abstract Persistence Models",
but it was essentially the same idea.

> I don't know if Btrfs and f2fs will qualify as "strictly ordered
> metadata" and I don't know if they would want to qualify.
> Mind you a filesystem can be crash consistent without
> following "strictly ordered metadata". In fact, in many cases
> "strictly ordered metadata" imposes performance penalty by
> coupling together unrelated metadata updates (e.g. create
> A/a and create B/b), but it is also quite hard to decouple them
> because future operation can create a dependency (e.g.
> mv A/a B/b).

I agree that total ordering might lead to performance loss. I'm not
advocating for btrfs/F2FS to be totally ordered; I merely want them to
be clear about what guarantees they do provide.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Symlink not persisted even after fsync
  2018-04-16  0:10                     ` Vijay Chidambaram
  2018-04-16  5:39                       ` Amir Goldstein
  2018-04-16  5:52                       ` Theodore Y. Ts'o
@ 2018-04-17  0:07                       ` Dave Chinner
  2018-04-17  2:56                         ` Vijay Chidambaram
  2 siblings, 1 reply; 25+ messages in thread
From: Dave Chinner @ 2018-04-17  0:07 UTC (permalink / raw)
  To: Vijay Chidambaram
  Cc: Theodore Y. Ts'o, Jayashree Mohan, Amir Goldstein,
	linux-btrfs, fstests, linux-f2fs-devel

On Sun, Apr 15, 2018 at 07:10:52PM -0500, Vijay Chidambaram wrote:
> Thanks! As I mentioned before, this is useful. I have a follow-up
> question. Consider the following workload:
> 
>  creat foo
>  link (foo, A/bar)
>  fsync(foo)
>  crash
> 
> In this case, after the file system recovers, do we expect foo's link
> count to be 2 or 1? 

So, strictly ordered behaviour:

create foo:
	- creates dirent in inode B and new inode A in an atomic
	  transaction sequence #1

link foo -> A/bar
	- creates dirent in inode C and bumps inode A link count in
	  an atomic transaction seqeunce #2.

fsync foo
	- looks at inode A, sees it's "last modification" sequence
	  counter as #2
	- flushes all transactions up to and including #2 to the
	  journal.

See the dependency chain? Both the inodes and dirents in the create
operation and the link operation are chained to the inode foo via
the atomic transactions. Hence when we flush foo, we also flush the
dependent changes because of the change atomicity requirements....

> I would say 2,

Correct, for strict ordering. But....

> but POSIX is silent on this,

Well, it's not silent, POSIX explicitly allows for fsync() to do
nothing and report success. Hence we can't really look to POSIX to
define how fsync() should behave.

> so
> thought I would confirm. The tricky part here is we are not calling
> fsync() on directory A.

Right. But directory A has a dependent change linked to foo. If we
fsync() foo, we are persisting the link count change in that file,
and hence all the other changes related to that link count change
must also be flushed. Similarly, all the cahnges related to the
creation on foo must be flushed, too.

> In this case, its not a symlink; its a hard link, so I would say the
> link count for foo should be 2.

Right - that's the "reference counted object dependency" I refered
to. i.e. it's a bi-direction atomic dependency - either we show both
the new dirent and the link count change, or we show neither of
them.  Hence fsync on one object implies that we are also persisting
the related changes in the other object, too.

> But btrfs and F2FS show link count of
> 1 after a crash.

That may be valid if the dirent A/bar does not exist after recovery,
but it also means fsync() hasn't actually guaranteed inode changes
made prior to the fsync to be persistent on disk. i.e. that's a
violation of ordered metadata semantics and probably a bug.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Symlink not persisted even after fsync
  2018-04-17  0:07                       ` Dave Chinner
@ 2018-04-17  2:56                         ` Vijay Chidambaram
  0 siblings, 0 replies; 25+ messages in thread
From: Vijay Chidambaram @ 2018-04-17  2:56 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Theodore Y. Ts'o, Jayashree Mohan, Amir Goldstein,
	linux-btrfs, fstests, linux-f2fs-devel

On Mon, Apr 16, 2018 at 7:07 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Sun, Apr 15, 2018 at 07:10:52PM -0500, Vijay Chidambaram wrote:
>> Thanks! As I mentioned before, this is useful. I have a follow-up
>> question. Consider the following workload:
>>
>>  creat foo
>>  link (foo, A/bar)
>>  fsync(foo)
>>  crash
>>
>> In this case, after the file system recovers, do we expect foo's link
>> count to be 2 or 1?
>
> So, strictly ordered behaviour:
>
> create foo:
>         - creates dirent in inode B and new inode A in an atomic
>           transaction sequence #1
>
> link foo -> A/bar
>         - creates dirent in inode C and bumps inode A link count in
>           an atomic transaction seqeunce #2.
>
> fsync foo
>         - looks at inode A, sees it's "last modification" sequence
>           counter as #2
>         - flushes all transactions up to and including #2 to the
>           journal.
>
> See the dependency chain? Both the inodes and dirents in the create
> operation and the link operation are chained to the inode foo via
> the atomic transactions. Hence when we flush foo, we also flush the
> dependent changes because of the change atomicity requirements....
>
>> I would say 2,
>
> Correct, for strict ordering. But....
>
>> but POSIX is silent on this,
>
> Well, it's not silent, POSIX explicitly allows for fsync() to do
> nothing and report success. Hence we can't really look to POSIX to
> define how fsync() should behave.
>
>> so
>> thought I would confirm. The tricky part here is we are not calling
>> fsync() on directory A.
>
> Right. But directory A has a dependent change linked to foo. If we
> fsync() foo, we are persisting the link count change in that file,
> and hence all the other changes related to that link count change
> must also be flushed. Similarly, all the cahnges related to the
> creation on foo must be flushed, too.
>
>> In this case, its not a symlink; its a hard link, so I would say the
>> link count for foo should be 2.
>
> Right - that's the "reference counted object dependency" I refered
> to. i.e. it's a bi-direction atomic dependency - either we show both
> the new dirent and the link count change, or we show neither of
> them.  Hence fsync on one object implies that we are also persisting
> the related changes in the other object, too.
>
>> But btrfs and F2FS show link count of
>> 1 after a crash.
>
> That may be valid if the dirent A/bar does not exist after recovery,
> but it also means fsync() hasn't actually guaranteed inode changes
> made prior to the fsync to be persistent on disk. i.e. that's a
> violation of ordered metadata semantics and probably a bug.

Great, this matches our understanding perfectly. We have separately
posted to the btrfs mailing list to confirm it is a bug. Thanks!

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2018-04-17  2:57 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-12 17:51 Symlink not persisted even after fsync Jayashree Mohan
2018-04-13  5:52 ` Amir Goldstein
2018-04-13 12:57   ` Vijay Chidambaram
     [not found]   ` <CAPaz=E+-baGSWhL3nD-8X4jn6rKdn2AVGLeqWh3EY5Nh-RodRA@mail.gmail.com>
2018-04-13 13:16     ` Amir Goldstein
2018-04-13 14:39       ` Jayashree Mohan
2018-04-14  1:20         ` Dave Chinner
2018-04-14  3:27           ` Vijay Chidambaram
2018-04-14 21:55             ` Dave Chinner
2018-04-14 21:55               ` Dave Chinner
2018-04-15  1:13               ` Vijay Chidambaram
2018-04-15  1:30                 ` Theodore Y. Ts'o
2018-04-15  1:30                   ` Theodore Y. Ts'o
2018-04-15  1:40                   ` Vijay Chidambaram
2018-04-15  1:17               ` Theodore Y. Ts'o
2018-04-15  1:17                 ` Theodore Y. Ts'o
2018-04-15  1:38                 ` Vijay Chidambaram
     [not found]                 ` <CAHWVdUXAyyeTGNXrtTTc+tUbA3t9TUjJPSF=M4Cetj4+d1w3eQ@mail.gmail.com>
2018-04-15 14:13                   ` Theodore Y. Ts'o
2018-04-16  0:10                     ` Vijay Chidambaram
2018-04-16  5:39                       ` Amir Goldstein
2018-04-16 15:17                         ` Vijay Chidambaram
2018-04-16  5:52                       ` Theodore Y. Ts'o
2018-04-16 15:09                         ` Vijay Chidambaram
2018-04-17  0:07                       ` Dave Chinner
2018-04-17  2:56                         ` Vijay Chidambaram
2018-04-13 14:06   ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.