Re: [PATCH v2] Documenting the crash-recovery guarantees of Linux file systems

From: Dave Chinner <david@fromorbit.com>
To: Jayashree <jaya@cs.utexas.edu>
Cc: fstests@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-doc@vger.kernel.org, vijay@cs.utexas.edu,
	amir73il@gmail.com, tytso@mit.edu, chao@kernel.org,
	fdmanana@gmail.com, corbet@lwn.net
Subject: Re: [PATCH v2] Documenting the crash-recovery guarantees of Linux file systems
Date: Thu, 14 Mar 2019 12:19:25 +1100	[thread overview]
Message-ID: <20190314011925.GG23020@dastard> (raw)
In-Reply-To: <1552418820-18102-1-git-send-email-jaya@cs.utexas.edu>

On Tue, Mar 12, 2019 at 02:27:00PM -0500, Jayashree wrote:
> In this file, we document the crash-recovery guarantees
> provided by four Linux file systems - xfs, ext4, F2FS and btrfs. We also
> present Dave Chinner's proposal of Strictly-Ordered Metadata Consistency
> (SOMC), which is provided by xfs. It is not clear to us if other file systems
> provide SOMC.

FWIW, new kernel documents should be written in rst markup format,
not plain ascii text.

> 
> Signed-off-by: Jayashree Mohan <jaya@cs.utexas.edu>
> Reviewed-by: Amir Goldstein <amir73il@gmail.com>
> ---
> 
> We would be happy to modify the document if file-system
> developers claim that their system provides (or aims to provide) SOMC.
> 
> Changes since v1:
>   * Addressed few nits identified in the review
>   * Added the fsync guarantees for F2FS and its SOMC compliance
> ---
>  .../filesystems/crash-recovery-guarantees.txt      | 193 +++++++++++++++++++++
>  1 file changed, 193 insertions(+)
>  create mode 100644 Documentation/filesystems/crash-recovery-guarantees.txt
> 
> diff --git a/Documentation/filesystems/crash-recovery-guarantees.txt b/Documentation/filesystems/crash-recovery-guarantees.txt
> new file mode 100644
> index 0000000..be84964
> --- /dev/null
> +++ b/Documentation/filesystems/crash-recovery-guarantees.txt
> @@ -0,0 +1,193 @@
> +=====================================================================
> +File System Crash-Recovery Guarantees
> +=====================================================================
> +Linux file systems provide certain guarantees to user-space
> +applications about what happens to their data if the system crashes
> +(due to power loss or kernel panic). These are termed crash-recovery
> +guarantees.

These are termed "data integrity guarantees", not "crash recovery
guarantees".

i.e. crash recovery is generic phrase describing the _mechanism_
used by some filesystems to implement the data integrity guarantees
the filesystem provides to userspace applications. 

> +
> +Crash-recovery guarantees only pertain to data or metadata that has
> +been explicitly persisted to storage with fsync(), fdatasync(), or
> +sync() system calls.

Define data and metadata in terms of what they refer to when we talk
about data integrity guarantees.

Define "persisted to storage".

Also, data integrity guarantees are provided by more interfaces than
you mention. They also apply to syncfs(), FIFREEZE, files/dirs
opened with O_[D]SYNC, readv2/writev2 calls with RWF_[D]SYNC set,
inodes with the S_[DIR]SYNC on-disk attribute, mounts with
dirsync/wsync options, etc. "data integrity guarantees" encompass
all these operations, not just fsync/fdatasync/sync....

> By default, write(), mkdir(), and other
> +file-system related system calls only affect the in-memory state of
> +the file system.

That's a generalisation that is not always correct from the user's
or userspace develper's point of view. e.g.  inodes with the sync
attribute set will default to synchronous on-disk state changes,
applications can use O_DSYNC/O_SYNC by default, etc....

> +The crash-recovery guarantees provided by most Linux file systems are
> +significantly stronger than what is required by POSIX. POSIX is vague,
> +even allowing fsync() to do nothing (Mac OSX takes advantage of
> +this).

Except when _POSIX_SYNCHRONIZED_IO is asserted, and then the
semantics filesystems must provide users are very explicit:

"[SIO] [Option Start] If _POSIX_SYNCHRONIZED_IO is defined, the
fsync() function shall force all currently queued I/O operations
associated with the file indicated by file descriptor fildes to the
synchronized I/O completion state. All I/O operations shall be
completed as defined for synchronized I/O file integrity completion.
[Option End]"

glibc asserts _POSIX_SYNCHRONIZED_IO (I'll use SIO from now on):

$ getconf _POSIX_SYNCHRONIZED_IO
200809
$

This means fsync() on Linux is supposed to conform to Section 3.376
"Synchronized I/O File Integrity Completion" of the specification,
which is a superset of the 3.375 "Synchronized I/O Data Integrity
Completion". Section 3.375 says:

"For write, when the operation has been completed or diagnosed if
unsuccessful. The write is complete only when the data specified in
the write request is successfully transferred and all file system
information required to retrieve the data is successfully
transferred."

https://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap03.html#tag_03_375

The key phrase here is "all the file system information required to
retrieve the data". If the directory entry that points at the file
is not persisted with the file itself, then you can't retreive the
data after a crash.  i.e. when _POSIX_SYNCHRONIZED_IO is asserted by
the system, the filesystem must guarantee this:

# touch A/foo
# echo "hello world" > A/foo
# fsync A/foo

persists the foo entry in the directory A, because that is
"filesystem information required to retreive the data in the file
A/foo". i.e. if we crash here and A/foo is not present after
restart, then we've violated the POSIX specification for SIO.

IOWs, POSIX fsync w/ SIO semantics does not allow fsync() to do
nothing, but instead has explicit definitions of the behaviour
applications can expect.  The only "wiggle room" in this
specification whether the meaning of "data transfer" includes
physically persisting the data to storage media or just moving it
into the device's volatile cache. On Linux, we've explicitly chosen
the former, because the latter does not provide SIO semantics as
data or referencing metadata can still be lost from the device's
volatile cache after transfer.

> However, the guarantees provided by file systems are not
> +documented, and vary between file systems. This document seeks to
> +describe the current crash-recovery guarantees provided by major Linux
> +file systems.
> +
> +What does the fsync() operation guarantee?
> +----------------------------------------------------
> +fsync() operation is meant to force the physical write of data
> +corresponding to a file from the buffer cache, along with the file
> +metadata. Note that the guarantees mentioned for each file system below
> +are in addition to the ones provided by POSIX.

a. what is a "physical write"?
b. Linux does not have a buffer cache. What about direct IO?
c. Exactly what "file metadata" are you talking about here?
e. Actually, it's not "in addtion" to posix - what you are
documenting here is where filesystems do not conform to
the POSIX SIO specification....

> +POSIX
> +-----
> +fsync(file) : Flushes the data and metadata associated with the
> +file.  However, if the directory entry for the file has not been
> +previously persisted, or has been modified, it is not guaranteed to be
> +persisted by the fsync of the file [1].

These are the semantics defined in the linux fsync(3) man page, and
as per the above, they are substantially /weaker/ than the POSIX
SIO specification glibc says we implement.

> What this means is, if a file
> +is newly created, you will have to fsync(parent directory) in addition
> +to fsync(file) in order to ensure that the file's directory entry has
> +safely reached the disk.

Define "safely reached disk" or use the same terms as previously
defined (i.e. "persisted to storage").

> +
> +fsync(dir) : Flushes directory data and directory entries. However if
> +you created a new file within the directory and wrote data to the
> +file, then the file data is not guaranteed to be persisted, unless an
> +explicit fsync() is issued on the file.

You talk about file metadata, then ignore what fsync does with
directory metadata...

> +ext4
> +-----
> +fsync(file) : Ensures that a newly created file's directory entry is
> +persisted (no need to explicitly persist the parent directory). However,
> +if you create multiple names of the file (hard links), then their directory
> +entries are not guaranteed to persist unless each one of the parent
> +directory entries are persisted [2].

So you use a specific example to indicate an exception where ext4
needs an explicit parent directory fsync (i.e. hard links to a
single file across multiple directories). That implies ext4 POSIX
SIO compliance is questionable, and it is definitely not SOMC
compliant. Further, it implies that transactional change atomicity
requirements are also violated. i.e. the inode is journalled with a
link count equivalent to all links existing, but not all the dirents
that point to the inode are persisted at the same time.

So from this example, ext4 is not SOMC compliant.

> +fsync(dir) : All file names within the persisted directory will exist,
> +but does not guarantee file data.

what about the inodes that were created, removed or hard linked?
Does it ensure they exist (or have been correctly freed) after
fsync(dir), too?  (that hardlink behaviour makes me question
everything related to transaction atomicity in ext4 now)

> +xfs
> +----
> +fsync(file) : Ensures that a newly created file's directory entry is
> +persisted.

Actually, it ensures the path all the way up to the root inode is
persisted. i.e. it guarantees the inode can be found after crash via
a path walk. Basically, XFS demonstrates POSIX SIO compliant
behaviour.

> Additionally, all the previous dependent modifications to
> +this file are also persisted.

That's the mechanism that provides the behaviour, not sure that's
relevant here.

FWIW, this description is pretty much useless to a reader who knows
nothing about XFS and what these terms actually mean.  IOWs, you
need to define "previous dependent modifications", "modification
dependency", etc before using them. Essentially, you need to
describe the observable behaviour here, not the implementation that
creates the behaviour.

> If any file shares an object
> +modification dependency with the fsync-ed file, then that file's
> +directory entry is also persisted.

Which you need to explain with references to the ext4 hardlink
failure and how XFS will persist all the hard link directory entries
for each hardlink all the way back up to the root. i.e. don't
describe the implementation, describe the observable behaviour.

> +fsync(dir) : All file names within the persisted directory will exist,
> +but does not guarantee file data. As with files, fsync(dir) also persists
> +previous dependent metadata operations.
>
> +btrfs
> +------
> +fsync(file) : Ensures that a newly created file's directory entry
> +is persisted, along with the directory entries of all its hard links.
> +You do not need to explicitly fsync individual hard links to the file.

So how is that different to XFS? Why explicitly state the hard link
behaviour, but then not mention anything about dependencies and
propagation? Especially after doing exactly the opposite when
describing XFS....

> +fsync(dir) : All the file names within the directory will persist. All the
> +rename and unlink operations within the directory are persisted. Due
> +to the design choices made by btrfs, fsync of a directory could lead
> +to an iterative fsync on sub-directories, thereby requiring a full
> +file system commit. So btrfs does not advocate fsync of directories
> +[2].

I don't think this "recommendation" is appropriate for a document
describing behaviour. It's also indicative of btrfs not having SOMC
behaviour.

> +F2FS
> +----
> +fsync(file) or fsync(dir) : In the default mode (fsync-mode=posix),
> +F2FS only guarantees POSIX behaviour. However, it provides xfs-like

What does "only guarantees POSIX behaviour" actually mean? because
it can mean "loses all your data on crash"....

> +guarantees if mounted with fsync-mode=strict option.

So, by default, f2fs will lose all your data on crash? And they call
that "POSIX" behaviour, despite glibc telling applications that the
system provides data integrity preserving fsync functionality?

Seems like a very badly named mount option and a terrible default -
basically we have "fast-and-loose" behaviour which has "eats your
data" data integrity semantics and "strict" which should be POSIX
SIO conformant.

> +fsync(symlink)
> +-------------
> +A symlink inode cannot be directly opened for IO, which means there is
> +no such thing as fsync of a symlink [3]. You could be tricked by the
> +fact that open and fsync of a symlink succeeds without returning a
> +error, but what happens in reality is as follows.
> +
> +Suppose we have a symlink “foo”, which points to the file “A/bar”
> +
> +fd = open(“foo”, O_CREAT | O_RDWR)
> +fsync(fd)
> +
> +Both the above operations succeed, but if you crash after fsync, the
> +symlink could be still missing.
> +
> +When you try to open the symlink “foo”, you are actually trying to
> +open the file that the symlink resolves to, which in this case is
> +“A/bar”. When you fsync the inode returned by the open system call, you
> +are actually persisting the file “A/bar” and not the symlink. Note
> +that if the file “A/bar” does not exist and you try the open the
> +symlink “foo” without the O_CREAT flag, then file open will fail. To
> +obtain the file descriptor associated with the symlink inode, you
> +could open the symlink using “O_PATH | O_NOFOLLOW” flags. However, the
> +file descriptor obtained this way can be only used to indicate a
> +location in the file-system tree and to perform operations that act
> +purely at the file descriptor level. Operations like read(), write(),
> +fsync() etc cannot be performed on such file descriptors.
> +
> +Bottomline : You cannot fsync() a symlink.

You can fsync() the parent dir after it is created or removed
to persist that operation.

> +fsync(special files)
> +--------------------
> +Special files in Linux include block and character device files
> +(created using mknod), FIFO (created using mkfifo) etc. Just like the
> +behavior of fsync on symlinks described above, these special files do
> +not have an fsync function defined. Similar to symlinks, you
> +cannot fsync a special file [4].

You can fsync() the parent dir after it is created or removed
to persist that operation.

> +Strictly Ordered Metadata Consistency
> +-------------------------------------
> +With each file system providing varying levels of persistence
> +guarantees, a consensus in this regard, will benefit application
> +developers to work with certain fixed assumptions about file system
> +guarantees. Dave Chinner proposed a unified model called the
> +Strictly Ordered Metadata Consistency (SOMC) [5].
> +
> +Under this scheme, the file system guarantees to persist all previous
> +dependent modifications to the object upon fsync().  If you fsync() an
> +inode, it will persist all the changes required to reference the inode
> +and its data. SOMC can be defined as follows [6]:
> +
> +If op1 precedes op2 in program order (in-memory execution order), and
> +op1 and op2 share a dependency, then op2 must not be observed by a
> +user after recovery without also observing op1.
> +
> +Unfortunately, SOMC's definition depends upon whether two operations
> +share a dependency, which could be file-system specific. It might
> +require a developer to understand file-system internals to know if
> +SOMC would order one operation before another.

That's largely an internal implementation detail, and users should
not have to care about the internal implementation because the
fundamental dependencies are all defined by the directory heirarchy
relationships that users can see and manipulate.

i.e. fs internal dependencies only increase the size of the graph
that is persisted, but it will never be reduced to less than what
the user can observe in the directory heirarchy.

So this can be further refined:

	If op1 precedes op2 in program order (in-memory execution
	order), and op1 and op2 share a user visible reference, then
	op2 must not be observed by a user after recovery without
	also observing op1.

e.g. in the case of the parent directory - the parent has a link
count. Hence every create, unlink, rename, hard link, symlink, etc
operation in a directory modifies a user visible link count
reference.  Hence fsync of one of those children will persist the
directory link count, and then all of the other preceeding
transactions that modified the link count also need to be persisted.

But keep in mind this defines ordering, not the persistence set:

# touch {a,b,c,d}
# touch {1,2,3,4}
# fsync d
<crash>

SOMC doesn't require {1,2,3,4} to be in the persistence set and
hence present after recovery. It only requires {a,b,c,d} to be in
the persistence set.

If you observe XFS behaviour, it will result in {1,2,3,4} also being
included in the persistence set, because it aggregates all the changes to the parent
directory into a single change per journal checkpoint sequence and
hence it cannot separate them at fsync time.

This, however, is a XFS journal implementation detail and not
something required by SOMC. The resulting behaviour is that XFS
generally persists more than SOMC requires, but the persistence set
that XFS calculates always maintains SOMC semantics so should always
does the right thing.

IOWs, a finer grained implementation of change dependencies could
result in providing exact, minimal persistence SOMC behaviour in
every situation, but don't expect that from XFS. It is likely that
experimental, explicit change depedency graph based filesystems like
featherstitch would provide minimal scope SOMC persistence
behaviour, but that's out of the scope of this document.

(*) http://featherstitch.cs.ucla.edu/
http://featherstitch.cs.ucla.edu/publications/featherstitch-sosp07.pdf
https://lwn.net/Articles/354861/

> It is worth noting
> +that a file system can be crash-consistent (according to POSIX),
> +without providing SOMC [7].

"crash-consistent" doesn't mean "data integrity preserving", and
posix only talks about data integrity beahviour. "crash-consistent"
just means the filesystem is not in a corrupt state when it
recovers.

> +As an example, consider the following test case from xfstest
> +generic/342 [8]
> +-------
> +touch A/foo
> +echo “hello” >  A/foo
> +sync
> +
> +mv A/foo A/bar
> +echo “world” > A/foo
> +fsync A/foo
> +CRASH

[whacky utf-8(?) symbols.  Plain ascii text for documents, please.]

> +What would you expect on recovery, if the file system crashed after
> +the final fsync returned successfully?
> +
> +Non-SOMC file systems will not persist the file
> +A/bar because it was not explicitly fsync-ed. But this means, you will
> +find only the file A/foo with data “world” after crash, thereby losing
> +the previously persisted file with data “hello”. You will need to
> +explicitly fsync the directory A to ensure the rename operation is
> +safely persisted on disk.
> +
> +Under SOMC, to correctly reference the new inode via A/foo,
> +the previous rename operation must persist as well. Therefore,
> +fsync() of A/foo will persist the renamed file A/bar as well.
> +On recovery you will find both A/bar (with data “hello”)
> +and A/foo (with data “world”).

You should describe the SOMC behaviour up front in the document,
because that is the behaviour this document is about.  Then describe
how the "man page fsync behaviour" and individual filesystems differ
from SOMC behaviour.

it would also be worth contrasting SOMC to historic ext3 behaviour
(globally ordered metadata and data), because that is the behaviour
that many application devleopers and users still want current
filesystems to emulate.

> +It is noteworthy that xfs, ext4, F2FS (when mounted with fsync_mode=strict)
> +and btrfs provide SOMC-like behaviour in this particular example.
> +However, in writing, only XFS claims to provide SOMC. F2FS aims to provide
> +SOMC when mounted with fsync_mode=strict. It is not clear if ext4 and
> +btrfs provide strictly ordered metadata consistency.

btrfs does not provide SOMC w.r.t. fsync() - that much is clear from
the endless stream of fsync bugs that are being found and fixed.

Also, the hard link behaviour described for ext4 indicates that it
is not truly SOMC, either. From this, I'd consider ext4 a "mostly
SOMC" implementation, but it seems that there are aspects of
ext4/jbd2 dependency and/or atomicity tracking that don't fully
resolve cross-object transactional atomicity dependencies correctly.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com