linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* metadata operation reordering regards to crash
@ 2018-09-14  9:06 焦晓冬
  2018-09-14 22:23 ` Dave Chinner
  0 siblings, 1 reply; 5+ messages in thread
From: 焦晓冬 @ 2018-09-14  9:06 UTC (permalink / raw)
  To: linux-fsdevel, linux-ext4, adilger.kernel; +Cc: linux-kernel

Hi, all,

A probably bit of complex question:
Does nowadays practical filesystems, eg., extX, btfs, preserve metadata
operation order through a crash/power failure?

What I know is modern filesystems ensure metadata consistency
after crash/power failure. Journal filesystems like extX do that by
write-ahead logging of metadata operations into transactions. Other
filesystems do that in various ways as btfs do that by COW.

What I'm not so far clear is whether these filesystems preserve
metadata operation order after a crash.

For example,
op 1.  rename(A, B)
op 2.  rename(C, D)

As mentioned above,  metadata consistency is ensured after a crash.
Thus, B is either the original B(or not exists) or has been replaced by A.
The same to D.

Is it possible that, after a crash, D has been replaced by C but B is still
the original file(or not exists)?

Or, from the view of implementation, before the crash
- in a journal filesystem,
Is the atomic transaction `rename(C, D)` permitted to be written to disk journal
before the transaction `rename(A, B)`?
- in other filesystems, say btfs,
Is it permit to reorder `rename(C,D)` and `rename(A,B)` atomic operation hiting
disk?

The question is meaningful as many applications do that:
if (flag_file_says_need_generate_data) {
    open_write_sync_close(data_tmp);
    rename(data_tmp, data);

    open_write_sync_close(flag_file_tmp, no_need_to_generate_data);
    rename(flag_file_tmp, flag_file)
}
use_data_file()

If flag is here but data is not after a crash, that is a problem.

Thanks,
Trol

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: metadata operation reordering regards to crash
  2018-09-14  9:06 metadata operation reordering regards to crash 焦晓冬
@ 2018-09-14 22:23 ` Dave Chinner
  2018-09-15  6:58   ` 焦晓冬
  0 siblings, 1 reply; 5+ messages in thread
From: Dave Chinner @ 2018-09-14 22:23 UTC (permalink / raw)
  To: 焦晓冬
  Cc: linux-fsdevel, linux-ext4, adilger.kernel, linux-kernel

On Fri, Sep 14, 2018 at 05:06:44PM +0800, 焦晓冬 wrote:
> Hi, all,
> 
> A probably bit of complex question:
> Does nowadays practical filesystems, eg., extX, btfs, preserve metadata
> operation order through a crash/power failure?

Yes.

Behaviour is filesystem dependent, but we have tests in fstests that
specifically exercise order preservation across filesystem failures.

> What I know is modern filesystems ensure metadata consistency
> after crash/power failure. Journal filesystems like extX do that by
> write-ahead logging of metadata operations into transactions. Other
> filesystems do that in various ways as btfs do that by COW.
> 
> What I'm not so far clear is whether these filesystems preserve
> metadata operation order after a crash.
> 
> For example,
> op 1.  rename(A, B)
> op 2.  rename(C, D)
> 
> As mentioned above,  metadata consistency is ensured after a crash.
> Thus, B is either the original B(or not exists) or has been replaced by A.
> The same to D.
> 
> Is it possible that, after a crash, D has been replaced by C but B is still
> the original file(or not exists)?

Not for XFS, ext4, btrfs or f2fs. Other filesystems might be
different.

Cheers,

Dave,
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: metadata operation reordering regards to crash
  2018-09-14 22:23 ` Dave Chinner
@ 2018-09-15  6:58   ` 焦晓冬
  2018-09-15 18:04     ` Andreas Dilger
  2018-09-16  1:18     ` Qu Wenruo
  0 siblings, 2 replies; 5+ messages in thread
From: 焦晓冬 @ 2018-09-15  6:58 UTC (permalink / raw)
  To: david, cmumford, linux-btrfs
  Cc: linux-fsdevel, linux-ext4, adilger.kernel, linux-kernel

On Sat, Sep 15, 2018 at 6:23 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Fri, Sep 14, 2018 at 05:06:44PM +0800, 焦晓冬 wrote:
> > Hi, all,
> >
> > A probably bit of complex question:
> > Does nowadays practical filesystems, eg., extX, btfs, preserve metadata
> > operation order through a crash/power failure?
>
> Yes.
>
> Behaviour is filesystem dependent, but we have tests in fstests that
> specifically exercise order preservation across filesystem failures.
>
> > What I know is modern filesystems ensure metadata consistency
> > after crash/power failure. Journal filesystems like extX do that by
> > write-ahead logging of metadata operations into transactions. Other
> > filesystems do that in various ways as btfs do that by COW.
> >
> > What I'm not so far clear is whether these filesystems preserve
> > metadata operation order after a crash.
> >
> > For example,
> > op 1.  rename(A, B)
> > op 2.  rename(C, D)
> >
> > As mentioned above,  metadata consistency is ensured after a crash.
> > Thus, B is either the original B(or not exists) or has been replaced by A.
> > The same to D.
> >
> > Is it possible that, after a crash, D has been replaced by C but B is still
> > the original file(or not exists)?
>
> Not for XFS, ext4, btrfs or f2fs. Other filesystems might be
> different.

Thanks, Dave,

I found this archive:
https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg31937.html

It seems btrfs people thinks reordering could happen.

It is a relatively old reply. Has the implement changed? Or is there
some new standard that requires reordering not happen?

> Cheers,
>
> Dave,
> --
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: metadata operation reordering regards to crash
  2018-09-15  6:58   ` 焦晓冬
@ 2018-09-15 18:04     ` Andreas Dilger
  2018-09-16  1:18     ` Qu Wenruo
  1 sibling, 0 replies; 5+ messages in thread
From: Andreas Dilger @ 2018-09-15 18:04 UTC (permalink / raw)
  To: 焦晓冬
  Cc: Dave Chinner, cmumford, linux-btrfs, linux-fsdevel,
	Ext4 Developers List, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 2585 bytes --]

On Sep 15, 2018, at 12:58 AM, 焦晓冬 <milestonejxd@gmail.com> wrote:
> 
> On Sat, Sep 15, 2018 at 6:23 AM Dave Chinner <david@fromorbit.com> wrote:
>> 
>> On Fri, Sep 14, 2018 at 05:06:44PM +0800, 焦晓冬 wrote:
>>> Hi, all,
>>> 
>>> A probably bit of complex question:
>>> Does nowadays practical filesystems, eg., extX, btfs, preserve metadata
>>> operation order through a crash/power failure?
>> 
>> Yes.
>> 
>> Behaviour is filesystem dependent, but we have tests in fstests that
>> specifically exercise order preservation across filesystem failures.
>> 
>>> What I know is modern filesystems ensure metadata consistency
>>> after crash/power failure. Journal filesystems like extX do that by
>>> write-ahead logging of metadata operations into transactions. Other
>>> filesystems do that in various ways as btfs do that by COW.
>>> 
>>> What I'm not so far clear is whether these filesystems preserve
>>> metadata operation order after a crash.
>>> 
>>> For example,
>>> op 1.  rename(A, B)
>>> op 2.  rename(C, D)
>>> 
>>> As mentioned above,  metadata consistency is ensured after a crash.
>>> Thus, B is either the original B(or not exists) or has been replaced by A.
>>> The same to D.
>>> 
>>> Is it possible that, after a crash, D has been replaced by C but B is still
>>> the original file(or not exists)?
>> 
>> Not for XFS, ext4, btrfs or f2fs. Other filesystems might be
>> different.
> 
> Thanks, Dave,
> 
> I found this archive:
> https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg31937.html
> 
> It seems btrfs people thinks reordering could happen.
> 
> It is a relatively old reply. Has the implement changed? Or is there
> some new standard that requires reordering not happen?

There is nothing in POSIX that requires any particular ordering.  However,
the sequence "A, B, C, sync C" on ext3/ext4 has "always" resulted in A, B
also being sync'd to disk (including parent directory creation, etc).

For a while, ext4 with delayed allocation resulted in write A, rename A->B
causing "B" to potentially not have any data (commit v2.6.29-5120-g8750c6d).
While the applications are depending on non-POSIX behaviour, the operation
ordering behaviour has been around long that applications have grown to
depend on it, and consider the filesystem to have a bug when it doesn't
behave that way.

If you want to write a robust application, you should fsync() the files you
care about (possibly with AIO so you get a notification on completion rather
than waiting).

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: metadata operation reordering regards to crash
  2018-09-15  6:58   ` 焦晓冬
  2018-09-15 18:04     ` Andreas Dilger
@ 2018-09-16  1:18     ` Qu Wenruo
  1 sibling, 0 replies; 5+ messages in thread
From: Qu Wenruo @ 2018-09-16  1:18 UTC (permalink / raw)
  To: 焦晓冬, david, cmumford, linux-btrfs
  Cc: linux-fsdevel, linux-ext4, adilger.kernel, linux-kernel


[-- Attachment #1.1: Type: text/plain, Size: 2642 bytes --]



On 2018/9/15 下午2:58, 焦晓冬 wrote:
> On Sat, Sep 15, 2018 at 6:23 AM Dave Chinner <david@fromorbit.com> wrote:
>>
>> On Fri, Sep 14, 2018 at 05:06:44PM +0800, 焦晓冬 wrote:
>>> Hi, all,
>>>
>>> A probably bit of complex question:
>>> Does nowadays practical filesystems, eg., extX, btfs, preserve metadata
>>> operation order through a crash/power failure?
>>
>> Yes.
>>
>> Behaviour is filesystem dependent, but we have tests in fstests that
>> specifically exercise order preservation across filesystem failures.
>>
>>> What I know is modern filesystems ensure metadata consistency
>>> after crash/power failure. Journal filesystems like extX do that by
>>> write-ahead logging of metadata operations into transactions. Other
>>> filesystems do that in various ways as btfs do that by COW.
>>>
>>> What I'm not so far clear is whether these filesystems preserve
>>> metadata operation order after a crash.
>>>
>>> For example,
>>> op 1.  rename(A, B)
>>> op 2.  rename(C, D)
>>>
>>> As mentioned above,  metadata consistency is ensured after a crash.
>>> Thus, B is either the original B(or not exists) or has been replaced by A.
>>> The same to D.
>>>
>>> Is it possible that, after a crash, D has been replaced by C but B is still
>>> the original file(or not exists)?
>>
>> Not for XFS, ext4, btrfs or f2fs. Other filesystems might be
>> different.
> 
> Thanks, Dave,
> 
> I found this archive:
> https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg31937.html
> 
> It seems btrfs people thinks reordering could happen.

It depends.

For default btrfs (using log tree), it depends on the log replay code
(which is somewhat like journal, but not completely the same).

Unfortunately I'm not a expert on that part, but tree log is more a
performance optimization other than a vital part to keep fs consistent.

But if using notreelog mount option, btrfs won't use log tree and falls
back to sync() for all fsync() due to its metadata organization.

And in that case, there is no reordering at all. It uses metadata CoW to
ensure everything is consistent.
In that case, power loss happens either before or after super block
write back.
For old superblock it always points to old trees, and vice verse for new
superblock.
So one will only see either the new fs or the old fs, thus making btrfs
atomic for its metadata update.

Thanks,
Qu

> 
> It is a relatively old reply. Has the implement changed? Or is there
> some new standard that requires reordering not happen?
> 
>> Cheers,
>>
>> Dave,
>> --
>> Dave Chinner
>> david@fromorbit.com


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2018-09-16  6:39 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-14  9:06 metadata operation reordering regards to crash 焦晓冬
2018-09-14 22:23 ` Dave Chinner
2018-09-15  6:58   ` 焦晓冬
2018-09-15 18:04     ` Andreas Dilger
2018-09-16  1:18     ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).