All of lore.kernel.org
 help / color / mirror / Atom feed
* Hard link not persisted on fsync
@ 2018-04-16 14:35 Jayashree Mohan
  2018-04-18  0:02 ` Jayashree Mohan
  2018-04-19 21:41 ` Zygo Blaxell
  0 siblings, 2 replies; 5+ messages in thread
From: Jayashree Mohan @ 2018-04-16 14:35 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Vijaychidambaram Velayudhan Pillai

Hi,

The following seems to be a crash consistency bug on btrfs, where in
the link count is not persisted even after a fsync on the original
file.

Consider the following workload :
creat foo
link (foo, A/bar)
fsync(foo)
---Crash---

Now, on recovery we expect the metadata of foo to be persisted i.e
have a link count of 2. However in btrfs, the link count is 1 and file
A/bar is not persisted. The expected behaviour would be to persist the
dependencies of inode foo. That is to say, shouldn't fsync of foo
persist A/bar and correctly update the link count?
Note that ext4, xfs and f2fs recover to the correct link count of 2
for the above workload.

Let us know what you think about this behavior.

Thanks,
Jayashree Mohan

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Hard link not persisted on fsync
  2018-04-16 14:35 Hard link not persisted on fsync Jayashree Mohan
@ 2018-04-18  0:02 ` Jayashree Mohan
  2018-04-18  8:34   ` Filipe Manana
  2018-04-19 21:41 ` Zygo Blaxell
  1 sibling, 1 reply; 5+ messages in thread
From: Jayashree Mohan @ 2018-04-18  0:02 UTC (permalink / raw)
  To: linux-btrfs, Filipe Manana; +Cc: Vijaychidambaram Velayudhan Pillai

Hi,

A gentle reminder on the crash consistency bug that we found on btrfs:
Link count of a file is not persisted even after a fsync. We believe a
filesystem that ensures strictly ordered metadata behaviour should be
able to persist the hard link after a fsync on the original file.
Could you comment on why btrfs does not exhibit this behavior, and if
it's something you'd want to fix?

Thanks,
Jayashree Mohan



On Mon, Apr 16, 2018 at 9:35 AM, Jayashree Mohan
<jayashree2912@gmail.com> wrote:
> Hi,
>
> The following seems to be a crash consistency bug on btrfs, where in
> the link count is not persisted even after a fsync on the original
> file.
>
> Consider the following workload :
> creat foo
> link (foo, A/bar)
> fsync(foo)
> ---Crash---
>
> Now, on recovery we expect the metadata of foo to be persisted i.e
> have a link count of 2. However in btrfs, the link count is 1 and file
> A/bar is not persisted. The expected behaviour would be to persist the
> dependencies of inode foo. That is to say, shouldn't fsync of foo
> persist A/bar and correctly update the link count?
> Note that ext4, xfs and f2fs recover to the correct link count of 2
> for the above workload.
>
> Let us know what you think about this behavior.
>
> Thanks,
> Jayashree Mohan

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Hard link not persisted on fsync
  2018-04-18  0:02 ` Jayashree Mohan
@ 2018-04-18  8:34   ` Filipe Manana
  0 siblings, 0 replies; 5+ messages in thread
From: Filipe Manana @ 2018-04-18  8:34 UTC (permalink / raw)
  To: Jayashree Mohan; +Cc: linux-btrfs, Vijaychidambaram Velayudhan Pillai

On Wed, Apr 18, 2018 at 1:02 AM, Jayashree Mohan
<jayashree2912@gmail.com> wrote:
> Hi,
>
> A gentle reminder on the crash consistency bug that we found on btrfs:

Why do you call it a consistency bug?
The filesystem does not stay in inconsistent state. The link count
stays 1 and the dentry used for fsync (foo) is persisted.
An inconsistency would be if we ended up with a link count of 2 and
only one of the dentries was persisted, or if we ended up with a link
count of 1 and both dentries were persisted.
Those cases would be detected by an fsck and would could fs operations
to fail unexpectedly (attemping to remove a dentry, etc).

> Link count of a file is not persisted even after a fsync. We believe a
> filesystem that ensures strictly ordered metadata behaviour should be
> able to persist the hard link after a fsync on the original file.

The thing is there's no written specification about what's the
expected and correct behavior.
Yes, I'm aware of Dave's explanation on strictly ordered metadata on
the other thread and transaction details, but things on btrfs work
very differently from xfs (I'm not saying he's wrong).

For me it makes more sense to persist the hard link, but again,
there's a lack of a specification that demands such behaviour, and I'm
not aware of applications needing that behaviour nor user reports
about it.

> Could you comment on why btrfs does not exhibit this behavior, and if
> it's something you'd want to fix?

I don't know the "why", as I am not the original author of the log
tree (what is used to implement fsync on btrfs), so either it's
accidental behavior (the most likely) or intentional.

Since it's not something that causes any corruption, fs inconsistency,
crash, etc, nor there are user reports complaining about this (afaik),
for me it would be far from a priority at the moment as I'm trying to
fix corruptions and similar issues (not necessarily caused by fsync).

Plus, adding such behaviour would have to be done carefully to not
impact performance, as checking if the node has multiple hard links
and which ones need to be persisted (created in the current,
uncommitted, transaction) may have a measurable impact. The current
behaviour it to only guarantee persisting the dentry used for the
fsync call.


>
> Thanks,
> Jayashree Mohan
>
>
>
> On Mon, Apr 16, 2018 at 9:35 AM, Jayashree Mohan
> <jayashree2912@gmail.com> wrote:
>> Hi,
>>
>> The following seems to be a crash consistency bug on btrfs, where in
>> the link count is not persisted even after a fsync on the original
>> file.
>>
>> Consider the following workload :
>> creat foo
>> link (foo, A/bar)
>> fsync(foo)
>> ---Crash---
>>
>> Now, on recovery we expect the metadata of foo to be persisted i.e
>> have a link count of 2. However in btrfs, the link count is 1 and file
>> A/bar is not persisted. The expected behaviour would be to persist the
>> dependencies of inode foo. That is to say, shouldn't fsync of foo
>> persist A/bar and correctly update the link count?
>> Note that ext4, xfs and f2fs recover to the correct link count of 2
>> for the above workload.
>>
>> Let us know what you think about this behavior.
>>
>> Thanks,
>> Jayashree Mohan

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Hard link not persisted on fsync
  2018-04-16 14:35 Hard link not persisted on fsync Jayashree Mohan
  2018-04-18  0:02 ` Jayashree Mohan
@ 2018-04-19 21:41 ` Zygo Blaxell
  2018-04-20 11:37   ` Jayashree Mohan
  1 sibling, 1 reply; 5+ messages in thread
From: Zygo Blaxell @ 2018-04-19 21:41 UTC (permalink / raw)
  To: Jayashree Mohan; +Cc: linux-btrfs, Vijaychidambaram Velayudhan Pillai

[-- Attachment #1: Type: text/plain, Size: 3334 bytes --]

On Mon, Apr 16, 2018 at 09:35:24AM -0500, Jayashree Mohan wrote:
> Hi,
> 
> The following seems to be a crash consistency bug on btrfs, where in
> the link count is not persisted even after a fsync on the original
> file.
> 
> Consider the following workload :
> creat foo
> link (foo, A/bar)
> fsync(foo)
> ---Crash---
> 
> Now, on recovery we expect the metadata of foo to be persisted i.e
> have a link count of 2. However in btrfs, the link count is 1 and file
> A/bar is not persisted. The expected behaviour would be to persist the
> dependencies of inode foo. That is to say, shouldn't fsync of foo
> persist A/bar and correctly update the link count?

Those dependencies are backward.  foo's inode doesn't depend on anything
but the data in the file foo, and foo's inode itself.

"foo" and "A/bar" are dirents that both depend on the inode of foo, which
implies that "A" and "." must be updated atomically with foo's inode.
If you had called fsync(A) then we'd expect A/bar to exist and the inode
to have a link count of 2.  If you'd called fsync(.) then...well, you
didn't modify "." at all, so I guess either outcome is valid as long as
the inode link count matches the number of dirents referencing the inode.

But then...why does foo exist at all?  I'd expect at least some tests
would end without foo on disk either, since all that was fsync()ed was the
foo inode, not the foo dirent in the directory '.'.  Does btrfs combine
creating foo and updating foo's inode into a single atomic operation?
I vaguely recall that it does exactly that, in order to solve a bug
some years ago.  What happens if you add a rename, e.g.

	unlink foo2	# make sure foo2 doesn't exist
	creat foo
	rename(foo, foo2)
	link(foo2, A/bar)
	fsync(foo2)

Do you get foo or foo2?  I'd expect foo since you didn't fsync '.',
but maybe rename implies flush and you get foo2.

That's not to say that fsync is not a rich source of filesystem bugs.
btrfs did once have (and maybe still has?) a bug where renames and fsync
can create a dirent with no inode, e.g.

	loop continuously:
		creat foo
		write(foo, data)
		fsync(foo)
		rename(foo, bar)

and crash somewhere in the middle of the loop, which will create a
dirent "foo" that points to a non-existent inode.

Removing the "fsync" works around the bug.  rename() does a flush anyway,
so the fsync() wasn't needed, but fsync() shouldn't _create_ a filesystem
inconsistency, especially when Googling recommends app developers to
sprinkle fsync()s indiscriminately in their code to prevent their data
from being mangled.

I haven't been tracking to see if that's fixed yet.  I last saw it on
4.11, but I have been aggressively avoiding fsync with eatmydata for
some years now.

> Note that ext4, xfs and f2fs recover to the correct link count of 2
> for the above workload.

Do those filesystems also work if you remove the fsync?  That may be
your answer:  they could be flushing the other metadata earlier, before
you call fsync().

> Let us know what you think about this behavior.
> 
> Thanks,
> Jayashree Mohan
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Hard link not persisted on fsync
  2018-04-19 21:41 ` Zygo Blaxell
@ 2018-04-20 11:37   ` Jayashree Mohan
  0 siblings, 0 replies; 5+ messages in thread
From: Jayashree Mohan @ 2018-04-20 11:37 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs, Vijaychidambaram Velayudhan Pillai

Hi Zygo,

Thanks for the reply. Here are few responses about btrfs behavior that
you had questions about in your email.

On Thu, Apr 19, 2018 at 4:41 PM, Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
>
> On Mon, Apr 16, 2018 at 09:35:24AM -0500, Jayashree Mohan wrote:
> > Hi,
> >
> > The following seems to be a crash consistency bug on btrfs, where in
> > the link count is not persisted even after a fsync on the original
> > file.
> >
> > Consider the following workload :
> > creat foo
> > link (foo, A/bar)
> > fsync(foo)
> > ---Crash---
> >
> > Now, on recovery we expect the metadata of foo to be persisted i.e
> > have a link count of 2. However in btrfs, the link count is 1 and file
> > A/bar is not persisted. The expected behaviour would be to persist the
> > dependencies of inode foo. That is to say, shouldn't fsync of foo
> > persist A/bar and correctly update the link count?
>
> Those dependencies are backward.  foo's inode doesn't depend on anything
> but the data in the file foo, and foo's inode itself.
>
> "foo" and "A/bar" are dirents that both depend on the inode of foo, which
> implies that "A" and "." must be updated atomically with foo's inode.
> If you had called fsync(A) then we'd expect A/bar to exist and the inode
> to have a link count of 2.  If you'd called fsync(.) then...well, you
> didn't modify "." at all, so I guess either outcome is valid as long as
> the inode link count matches the number of dirents referencing the inode.
>
> But then...why does foo exist at all?  I'd expect at least some tests
> would end without foo on disk either, since all that was fsync()ed was the
> foo inode, not the foo dirent in the directory '.'.  Does btrfs combine
> creating foo and updating foo's inode into a single atomic operation?
> I vaguely recall that it does exactly that, in order to solve a bug
> some years ago.  What happens if you add a rename, e.g.
>
>         unlink foo2     # make sure foo2 doesn't exist
>         creat foo
>         rename(foo, foo2)
>         link(foo2, A/bar)
>         fsync(foo2)
>
> Do you get foo or foo2?  I'd expect foo since you didn't fsync '.',
> but maybe rename implies flush and you get foo2.

This is the workload we tried:

creat foo
rename(foo, foo2)
mkdir A
link(foo2, A/bar)
fsync(foo2)

Btrfs persists foo2 here, and A/bar is not persisted. Also, the link
count of foo2 is 1, while on the other filesystems, A/bar persist as
well and foo2 has a link count 2.
I don't think it's the rename here that implies the flush - removing
the fsync(foo2) ends up not persisting A/bar. So looks like, its the
fsync that forces all dependencies to the disk.

As a variant of the above workload, consider the following:

mkdir A
sync
creat foo
rename(foo, foo2)
link(foo2, A/bar)
fsync(foo2)

The only difference between this workload and the one above is, where
directory A is created and whether its persisted previously or not. In
this modified workload, foo2 has a link count of 2 and ends up
persisting A/bar. Why is A/bar being persisted here even when we did
not explicitly call a fsync on directory A?

We don't seem to correctly understand the reason behind these
different behaviors. It will be useful if you could provide more
insight into this.

> That's not to say that fsync is not a rich source of filesystem bugs.
> btrfs did once have (and maybe still has?) a bug where renames and fsync
> can create a dirent with no inode, e.g.
>
>         loop continuously:
>                 creat foo
>                 write(foo, data)
>                 fsync(foo)
>                 rename(foo, bar)
>
> and crash somewhere in the middle of the loop, which will create a
> dirent "foo" that points to a non-existent inode.
>
> Removing the "fsync" works around the bug.  rename() does a flush anyway,
> so the fsync() wasn't needed, but fsync() shouldn't _create_ a filesystem
> inconsistency, especially when Googling recommends app developers to
> sprinkle fsync()s indiscriminately in their code to prevent their data
> from being mangled.
>
> I haven't been tracking to see if that's fixed yet.  I last saw it on
> 4.11, but I have been aggressively avoiding fsync with eatmydata for
> some years now.
>
> > Note that ext4, xfs and f2fs recover to the correct link count of 2
> > for the above workload.
>
> Do those filesystems also work if you remove the fsync?  That may be
> your answer:  they could be flushing the other metadata earlier, before
> you call fsync().

>> creat foo
>> link (foo, A/bar)
> >fsync(foo)
>>---Crash---

Actually, they don't seem to work if we remove the fsync. The behavior
changes and we neither persist A/bar nor update the link count of foo
to 2. So this confirms that fsync is forcing the metadata and not the
link() syscall itself right?

Thanks,
Jayashree

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2018-04-20 11:37 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-16 14:35 Hard link not persisted on fsync Jayashree Mohan
2018-04-18  0:02 ` Jayashree Mohan
2018-04-18  8:34   ` Filipe Manana
2018-04-19 21:41 ` Zygo Blaxell
2018-04-20 11:37   ` Jayashree Mohan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.