All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <dchinner@redhat.com>
To: Eric Wheeler <bcachefs@lists.ewheeler.net>
Cc: Kent Overstreet <kent.overstreet@linux.dev>,
	Brian Foster <bfoster@redhat.com>,
	linux-bcachefs@vger.kernel.org
Subject: Re: Freezing (was: Re: fstests generic/441 -- occasional bcachefs failure)
Date: Tue, 21 Feb 2023 09:19:37 +1100	[thread overview]
Message-ID: <Y/Px+avr7z4EsMN+@rh> (raw)
In-Reply-To: <c7a8e8b2-33f5-341c-92ad-bf40b5634a35@ewheeler.net>

On Thu, Feb 16, 2023 at 12:04:34PM -0800, Eric Wheeler wrote:
> On Tue, 7 Feb 2023, Dave Chinner wrote:
> > On Fri, Feb 03, 2023 at 07:35:15PM -0500, Kent Overstreet wrote:
> > > On Fri, Feb 03, 2023 at 11:51:12AM +1100, Dave Chinner wrote:
> > > > What do you need to know? The vast majority of the freeze
> > > > infrastructure is generic and the filesystem doesn't need to do
> > > > anything special. The only thing it needs to implement is
> > > > ->freeze_fs to essentially quiesce the filesystem - by this stage
> > > > all the data has been written back and all user-driven operations
> > > > have either been stalled or drained at either the VFS or transaction
> > > > reservation points.
> > > > 
> > > > This requires the filesystem transaction start point to call
> > > > sb_start_intwrite() in a location the transaction start can block
> > > > safely forever, and to call sb_end_intwrite() when the transaction
> > > > is complete and being torn down. 
> > > > 
> > > > [Note that bcachefs might also require
> > > > sb_{start/end}_{write/pagefault} calls in paths that it has custom
> > > > handlers for and to protect against ioctl operations triggering
> > > > modifications during freezes.]
> > > > 
> > > > This allows freeze_super() to set a barrier to prevent new
> > > > transactions from starting, and to wait on transactions in flight to
> > > > drain. Once all transactions have drained, it will then call
> > > > ->freeze_fs if it is defined so the filesystem can flush it's
> > > > journal and dirty in-memory metadata so that it becomes consistent
> > > > on disk without requiring journal recovery to be run.
> > > > 
> > > > This basically means that once ->fs_freeze completes, the filesystem
> > > > should be in the same state on-disk as if it were unmounted cleanly,
> > > > and the fs will not issue any more IO until the filesystem is
> > > > thawed. Thaw will call ->unfreeze_fs if defined before unblocking
> > > > tasks so that the filesystem can restart things that may be needed
> > > > for normal operation that were stopped during the freeze.
> > > > 
> > > > It's not all that complex anymore - a few hooks to enable
> > > > modification barriers to be placed and running the
> > > > writeback part of unmount in ->freeze_fs is the main component
> > > > of the work that needs to be done....
> > > 
> > > Thanks, that is a lot simpler than I was thinking - I guess I was
> > > thinking about task freezing for suspend, that definitely had some
> > > tricky bits.
> > 
> > "freezing for suspend" is a mess - it should just be calling
> > freeze_super() and letting the filesystem take care of everything to
> > do with freezing the filesystem. The whole idea that we can suspend
> > a filesystem safely by running sync() and stopping kernel threads and
> > workqueues from running is .... broken.
> > 
> > > Sounds like treating it as if we were remounting read-only
> > > is all we need to do.
> > 
> > Yup, pretty much. XFS shares all the log quiescing code between
> > ->freeze_fs and remount_ro. The only difference is that remount_ro
> > has to do all the work to write dirty data back to disk before
> > it quiesces the log....
> 
> Wait, so are you saying that XFS does not commit dirty buffers for 
> sleeping, only on remount_ro?

Yup. But this is not unique to XFS - every journalling filesystem
(ext3, ext4, jfs, etc) have exactly the same problem:
sync_filesystem() only guarantees that the filesystem is consistent
on disk, not that it is clean in memory.

And by "consistent on disk", that means all dirty metadata has been
written -to the journal- so that if a crash occurs immeditately
afterwards, journal recovery on the next mount will ensure that the
filesystem is consistent.

IOWs, after sync_filesystem(), the filesystem is most definitely
*not idle* and *not clean in memory*, and that's where all the
issues with suspend end up coming from - it assumes sync() is all
that is needed to put a filesystem in an idle state....

> In an ideal case I suppose the laptop ram is 
> still hot... But sometimes (ahem, far too often) I close my laptop and 
> forget about it, in which case the battery dies and of course then any 
> dirty pages are lost.  IMHO sleep should always be crash-safe.

suspend is generally considered crash safe. The problems with
suspend stem from inconsistent in-memory vs on-disk filesystem state
in the suspend image - this causes problems on resume of the
suspended image, not on the next cold boot of the system.

-Dave.
-- 
Dave Chinner
dchinner@redhat.com


  reply	other threads:[~2023-02-20 22:20 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-25 15:45 fstests generic/441 -- occasional bcachefs failure Brian Foster
2023-01-26 15:08 ` Kent Overstreet
2023-01-27  7:21   ` Kent Overstreet
2023-01-27 14:50   ` Brian Foster
2023-01-30 17:06     ` Kent Overstreet
2023-01-31 16:04       ` Brian Foster
2023-02-01 14:34         ` Kent Overstreet
2023-02-02 15:50           ` Brian Foster
2023-02-02 17:09             ` Freezing (was: Re: fstests generic/441 -- occasional bcachefs failure) Kent Overstreet
2023-02-02 20:04               ` Brian Foster
2023-02-02 22:39                 ` Kent Overstreet
2023-02-03  0:51               ` Dave Chinner
2023-02-04  0:35                 ` Kent Overstreet
2023-02-07  0:03                   ` Dave Chinner
2023-02-16 20:04                     ` Eric Wheeler
2023-02-20 22:19                       ` Dave Chinner [this message]
2023-02-20 23:23                         ` Kent Overstreet
2023-02-02 22:56         ` fstests generic/441 -- occasional bcachefs failure Kent Overstreet
2023-02-04 21:33           ` Brian Foster
2023-02-04 22:15             ` Kent Overstreet
2023-02-06 15:33               ` Brian Foster
2023-02-06 22:18                 ` Kent Overstreet
2023-02-09 12:57                   ` Brian Foster
2023-02-09 14:58                     ` Kent Overstreet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y/Px+avr7z4EsMN+@rh \
    --to=dchinner@redhat.com \
    --cc=bcachefs@lists.ewheeler.net \
    --cc=bfoster@redhat.com \
    --cc=kent.overstreet@linux.dev \
    --cc=linux-bcachefs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.