All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Jayashree Mohan <jayashree2912@gmail.com>
Cc: linux-xfs@vger.kernel.org,
	Vijaychidambaram Velayudhan Pillai <vijay@cs.utexas.edu>,
	Ashlie Martinez <ashmrtn@utexas.edu>,
	fstests@vger.kernel.org, Theodore Ts'o <tytso@mit.edu>
Subject: Re: XFS crash consistency bug : Loss of fsynced metadata operation
Date: Wed, 14 Mar 2018 09:45:22 +1100	[thread overview]
Message-ID: <20180313224522.GZ7000@dastard> (raw)
In-Reply-To: <CA+EzBbC9yQyxqiPOTCymrbL=cW-KZq2b8hunf7KEpiNZ9f3s4Q@mail.gmail.com>

On Tue, Mar 13, 2018 at 11:57:21AM -0500, Jayashree Mohan wrote:
> Hi Dave,
> 
> Thanks for the response. CrashMonkey assumes the following behavior of
> disk cache. Let me know if any of this sounds unreasonable.
> 
> Whenever the underlying storage device has an associated cache, the IO
> is marked completed the moment it reaches the disk cache. This does
> not guarantee that the disk cache would persist them in the same
> order, unless there is a Flush/FUA. The order of completed writes as
> seen by the user could be A, B, C, *Flush* D, E. However the disk
> cache could write these back to the persistent storage in the order
> say B, A, C, E, D. The only invariant it ensures is that writing in an
> order like  A, C, E, B, D is
> not possible because, writes A,B,C have to strictly happen before D
> and E. However you cannot ensure that (A, B, C) is written to the
> persistent storage in the same order.
> 
> CrashMonkey reorders bios in conformance to the guarantees provided by
> disk cache; we do not make any extra assumptions and we respect the
> barrier operations.

I think your model is wrong. caches do not randomly re-order
completed IO operations to the *same LBA*. When a block is overwritten
the cache contains the overwrite data and the previous data is
discarded. THe previous data may be on disk, but it's no longer in
the cache.

e.g. take a dependent filesystem read-modify-write cycle (I'm
choosing this because that's the problem this fzero/fsync
"bug" is apparently demonstrating) where we write data to disk,
invalidate the kernel cache, read the data back off disk, zero it
in memory, then write it back to disk, all in the one LBA:

	<flush>
	write A to disk, invalidate kernel cache
	......
	read A from disk into kernel cache
	A' = <modify A>
	write A' to disk
	......
	<flush>

The disk cache model you are describing allows writes
to be reordered anywhere in the flush window regardless of their
inter-IO completion dependencies. Hence you're allowing temporally
ordered filesystem IO to the same LBA be reorded like so:


	<flush>
	......
	write A'
	......
	read A
	A' = <modify A>
	......
	write A
	......
	<flush>

This violates causality. it's simply *not possible for the disk
cache to contain A' before either "write A", "read A" or the
in-memory modification of A has been completed by the OS. Hence
there is no way for a crash situation to have the disk cache or the
physical storage medium to contain corruption that indicates it
stored A' on disk before stored A.

> CrashMonkey therefore respects the guarantees provided by the disk
> cache, and assumes nothing more than that. I hope this provides more
> clarity on what
> CrashMonkey is trying to do, and why we think it is reasonable to do so.

It clearly demonstrates to me where CrashMonkey is broken and needs
fixing - it needs to respect the ordering of temporally separate IO
to the same LBA and not violate causality. Simulators that assume
time travel is possible are not useful to us.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2018-03-13 22:45 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-03-13  2:15 XFS crash consistency bug : Loss of fsynced metadata operation Jayashree Mohan
2018-03-13  4:21 ` Dave Chinner
2018-03-13  6:36   ` Amir Goldstein
2018-03-13 18:05     ` Jayashree Mohan
2018-03-13 16:57   ` Jayashree Mohan
2018-03-13 22:45     ` Dave Chinner [this message]
2018-03-14 13:16       ` Lukas Czerner
2018-03-14 13:32         ` Dave Chinner
2018-03-14 13:57           ` Lukas Czerner
2018-03-14 21:24             ` Dave Chinner
2018-03-15  6:15               ` Lukas Czerner
2018-03-15 10:06                 ` Dave Chinner
2018-03-15 10:32                   ` Lukas Czerner
2018-03-16  0:19                     ` Dave Chinner
2018-03-16  5:45                       ` Lukas Czerner
2018-03-17  3:16                         ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180313224522.GZ7000@dastard \
    --to=david@fromorbit.com \
    --cc=ashmrtn@utexas.edu \
    --cc=fstests@vger.kernel.org \
    --cc=jayashree2912@gmail.com \
    --cc=linux-xfs@vger.kernel.org \
    --cc=tytso@mit.edu \
    --cc=vijay@cs.utexas.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.