All of
 help / color / mirror / Atom feed
From: Dave Chinner <>
To: Matthew Wilcox <>
Cc: David Howells <>,,,
	Theodore Ts'o <>,,,
	Trond Myklebust <>,
	"Darrick J. Wong" <>,
	Jeff Layton <>,
	Andreas Dilger <>,
	Anna Schumaker <>,, Bob Liu <>,
	"Darrick J. Wong" <>,
	Josef Bacik <>,
	Seth Jennings <>,
	Jens Axboe <>,,,,,
	Chris Mason <>, David Sterba <>,
	Minchan Kim <>,
	Steve French <>, NeilBrown <>,
	Dan Magenheimer <>,, Ilya Dryomov <>,,,,
Subject: Re: [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles
Date: Mon, 27 Sep 2021 08:36:58 +1000	[thread overview]
Message-ID: <20210926223658.GE1756565@dread.disaster.area> (raw)
In-Reply-To: <YU/>

On Sun, Sep 26, 2021 at 04:10:43AM +0100, Matthew Wilcox wrote:
> On Sun, Sep 26, 2021 at 09:42:43AM +1000, Dave Chinner wrote:
> > Ok, so if the filesystem is doing block mapping in the IO path now,
> > why does the swap file still need to map the file into a private
> > block mapping now?  i.e all the work that iomap_swapfile_activate()
> > does for filesystems like XFS and ext4 - it's this completely
> > redundant now that we are doing block mapping during swap file IO
> > via iomap_dio_rw()?
> Hi Dave,
> Thanks for bringing up all these points.  I think they all deserve to go
> into the documentation as "things to consider" for people implementing
> ->swap_rw for their filesystem.
> Something I don't think David perhaps made sufficiently clear is that
> regular DIO from userspace gets handled by ->read_iter and ->write_iter.
> This ->swap_rw op is used exclusive for, as the name suggests, swap DIO.
> So filesystems don't have to handle swap DIO and regular DIO the same
> way, and can split the allocation work between ->swap_activate and the
> iomap callback as they see fit (as long as they can guarantee the lack
> of deadlocks under memory pressure).

I understand this completely.

The point is that the implementation of ->swap_rw is to call
iomap_dio_rw() with the same ops as the normal DIO read/write path
uses. IOWs, apart from the IOCB_SWAP flag, there is no practical
difference between the "swap DIO" and "normal DIO" I/O paths.

> There are several advantages to using the DIO infrastructure for
> swap:
>  - unify block & net swap paths
>  - allow filesystems to _see_ swap IOs instead of being bypassed
>  - get rid of the swap extent rbtree
>  - allow writing compound pages to swap files instead of splitting
>    them
>  - allow ->readpage to be synchronous for better error reporting
>  - remove page_file_mapping() and page_file_offset()
> I suspect there are several problems with this patchset, but I'm not
> likely to have a chance to read it closely for a few days.  If you
> have time to give the XFS parts a good look, that would be fantastic.

That's what I've already done, and all the questions I've raised are
from asking a simple question: what happens if a transaction is
required to complete the iomap_dio_rw() swap write operation?

I mean, this is similar to the problems with IOCB_NOWAIT - we're
supposed to return -EAGAIN if we might block during IO submission,
and one of those situations we have to consider is "do we need to
run a transaction". If we get it wrong (and we do!), then the worst
thing that happens is that there is a long latency for IO
submission. It's a minor performance issue, not the end of the

The difference with IOCB_SWAP is that "don't do transactions during
iomap_dio_rw()" is a _hard requirement_ on both IO submission and
completion. That means, from now and forever, we will have to
guarantee a path through iomap_dio_rw() that will never run
transactions on an IO. That requirement needs to be enforced in
every block mapping callback into each filesystem, as this is
something the iomap infrastructure cannot enforce. Hence we'll have
to plumb IOCB_SWAP into a new IOMAP_SWAP iterator flag to pass to
the ->iomap_begin() DIO methods to ensure they do the right thing.

And then the question becomes: what happens if the filesystem cannot
do the right thing? Can the swap code handle an error? e.g. the
first thing that xfs_direct_write_iomap_begin() and
xfs_read_iomap_begin() do is check if the filesystem is shut down
and returns -EIO in that case. IOWs, we've now got normal filesystem
"reject all IO" corruption protection mechanisms in play. Using
iomap_dio_rw() as it stands means that _all swapfile IO will fail_
if the filesystem shuts down.

Right now the swap file IO can keep going blissfully unaware of the
filesystem failure status. The open swapfile will prevent the
filesystem from being unmounted. Hence to unmount the shutdown
filesystem to correct the problem, first the swap file has to be
turned off, which means we have a fail-safe behaviour. Using the
iomap_dio_rw() path means that swapfile IO _can and will fail_.

AFAICT, swap IO errors are pretty much thrown away by the mm code;
the swap_writepage() return value is ignored or placed on the swap
cache address space and ignored. And it looks like the new read path
just sets PageError() and leaves it to callers to detect and deal
with a swapin failure because swap_readpage() is now void...

So it seems like there's a whole new set of failure cases using the
DIO path introduces into the swap IO path that haven't been
considered here. I can't see why we wouldn't be able to solve them,
but these considerations lead me to think that use of the DIO is
based on an incorrect assumption - DIO is not a "simple low level
IO" interface.

Hence I suspect that we'd be much better off with a new
iomap_swap_rw() implementation that just does what swap needs
without any of the complexity of the DIO API. Internally iomap can
share what it needs to share with the DIO path, but at this point
I'm not sure we should be overloading the iomap_dio_rw() path with
the semantics required by swap.

e.g. we limit iomap_swap_rw() to only accept written or unwritten
block mappings within file size on inodes with clean metadata (i.e.
pure overwrite to guarantee no modification transactions), and then
the fs provided ->iomap_begin callback can ignore shutdown state,
elide inode level locking, do read-only mappings, etc without adding
extra overhead to the existing DIO code path...


Dave Chinner

  reply	other threads:[~2021-09-26 22:37 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-24 17:17 David Howells
2021-09-24 17:18 ` [PATCH v3 1/9] mm: Remove the callback func argument from __swap_writepage() David Howells
2021-09-24 17:18 ` [PATCH v3 2/9] mm: Add 'supports' field to the address_space_operations to list features David Howells
2021-09-24 20:10   ` Matthew Wilcox
2021-09-24 17:18 ` [PATCH v3 3/9] mm: Make swap_readpage() void David Howells
2021-09-24 22:07   ` Matthew Wilcox
2021-09-24 17:18 ` [PATCH v3 4/9] Introduce IOCB_SWAP kiocb flag to trigger REQ_SWAP David Howells
2021-09-26 21:56   ` Dave Chinner
2021-09-24 17:18 ` [PATCH v3 5/9] mm: Make swap_readpage() for SWP_FS_OPS use ->swap_rw() not ->readpage() David Howells
2021-09-24 17:18 ` [PATCH v3 6/9] mm: Make __swap_writepage() do async DIO if asked for it David Howells
2021-09-24 17:19 ` [PATCH v3 7/9] nfs: Fix write to swapfile failure due to generic_write_checks() David Howells
2021-09-24 17:19 ` [PATCH v3 8/9] block, btrfs, ext4, xfs: Implement swap_rw David Howells
2021-09-24 17:19 ` [PATCH v3 9/9] mm: Remove swap BIO paths and only use DIO paths David Howells
2021-09-25 14:56   ` Matthew Wilcox
2021-09-25 15:36   ` David Howells
2021-09-25 17:09     ` Matthew Wilcox
2021-09-26 23:08       ` Damien Le Moal
2021-09-27  1:25         ` Dave Chinner
2021-09-27  1:41           ` Damien Le Moal
2021-09-27 20:03     ` David Sterba
2021-09-25 23:42 ` [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles Dave Chinner
2021-09-26  3:10   ` Matthew Wilcox
2021-09-26 22:36     ` Dave Chinner [this message]
2021-09-27 20:07 ` David Sterba
2021-09-28  3:11 ` NeilBrown
2021-09-30 15:54   ` Steve French
2021-09-30 15:54     ` Steve French
2021-09-29 15:45 ` David Howells

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210926223658.GE1756565@dread.disaster.area \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \
    --subject='Re: [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles' \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.