linux-cifs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles
@ 2021-09-24 17:17 David Howells
  2021-09-24 17:18 ` [PATCH v3 2/9] mm: Add 'supports' field to the address_space_operations to list features David Howells
                   ` (4 more replies)
  0 siblings, 5 replies; 10+ messages in thread
From: David Howells @ 2021-09-24 17:17 UTC (permalink / raw)
  To: willy, hch, trond.myklebust
  Cc: Theodore Ts'o, linux-block, ceph-devel, Trond Myklebust,
	Darrick J. Wong, Jeff Layton, Andreas Dilger, Anna Schumaker,
	linux-mm, Bob Liu, Darrick J. Wong, Josef Bacik, Seth Jennings,
	Jens Axboe, linux-fsdevel, linux-xfs, linux-ext4, linux-cifs,
	Chris Mason, David Sterba, Minchan Kim, Steve French, NeilBrown,
	Dan Magenheimer, linux-nfs, Ilya Dryomov, linux-btrfs, dhowells,
	dhowells, darrick.wong, viro, jlayton, torvalds, linux-nfs,
	linux-mm, linux-fsdevel, linux-kernel


Hi Willy, Trond, Christoph,

Here's v3 of a change to make reads and writes from the swapfile use async
DIO, adding a new ->swap_rw() address_space method, rather than readpage()
or direct_IO(), as requested by Willy.  This allows NFS to bypass the write
checks that prevent swapfiles from working, plus a bunch of other checks
that may or may not be necessary.

Whilst trying to make this work, I found that NFS's support for swapfiles
seems to have been non-functional since Aug 2019 (I think), so the first
patch fixes that.  Question is: do we actually *want* to keep this
functionality, given that it seems that no one's tested it with an upstream
kernel in the last couple of years?

There are additional patches to get rid of noop_direct_IO and replace it
with a feature bitmask, to make btrfs, ext4, xfs and raw blockdevs use the
new ->swap_rw method and thence remove the direct BIO submission paths from
swap.

I kept the IOCB_SWAP flag, using it to enable REQ_SWAP.  I'm not sure if
that's necessary, but it seems accounting related.

The synchronous DIO I/O code on NFS, raw blockdev, ext4 swapfile and xfs
swapfile all seem to work fine.  Btrfs refuses to swapon because the file
might be CoW'd.  I've tried doing "chattr +C", but that didn't help.

The async DIO paths fail spectacularly (from I/O errors to ATA failure
messages on the test disk using a normal swapspace); NFS just hangs.

My patches can be found here also:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=swap-dio

I tested this using the procedure and program outlined in the NFS patch.

I also encountered occasional instances of the following warning with NFS, so
I'm wondering if there's a scheduling problem somewhere:

BUG: workqueue lockup - pool cpus=0-3 flags=0x5 nice=0 stuck for 34s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
  pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
    in-flight: 1565:fill_page_cache_func
workqueue events_highpri: flags=0x10
  pwq 3: cpus=1 node=0 flags=0x1 nice=-20 active=1/256 refcnt=2
    in-flight: 1547:fill_page_cache_func
  pwq 1: cpus=0 node=0 flags=0x0 nice=-20 active=1/256 refcnt=2
    in-flight: 1811:fill_page_cache_func
workqueue events_unbound: flags=0x2
  pwq 8: cpus=0-3 flags=0x5 nice=0 active=3/512 refcnt=5
    pending: fsnotify_connector_destroy_workfn, fsnotify_mark_destroy_workfn, cleanup_offline_cgwbs_workfn
workqueue events_power_efficient: flags=0x82
  pwq 8: cpus=0-3 flags=0x5 nice=0 active=4/256 refcnt=6
    pending: neigh_periodic_work, neigh_periodic_work, check_lifetime, do_cache_clean
workqueue writeback: flags=0x4a
  pwq 8: cpus=0-3 flags=0x5 nice=0 active=1/256 refcnt=4
    in-flight: 433(RESCUER):wb_workfn
workqueue rpciod: flags=0xa
  pwq 8: cpus=0-3 flags=0x5 nice=0 active=38/256 refcnt=40
    in-flight: 7:rpc_async_schedule, 1609:rpc_async_schedule, 1610:rpc_async_schedule, 912:rpc_async_schedule, 1613:rpc_async_schedule, 1631:rpc_async_schedule, 34:rpc_async_schedule, 44:rpc_async_schedule
    pending: rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule
workqueue ext4-rsv-conversion: flags=0x2000a
pool 1: cpus=0 node=0 flags=0x0 nice=-20 hung=59s workers=2 idle: 6
pool 3: cpus=1 node=0 flags=0x1 nice=-20 hung=43s workers=2 manager: 20
pool 6: cpus=3 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 498 29
pool 8: cpus=0-3 flags=0x5 nice=0 hung=34s workers=9 manager: 1623
pool 9: cpus=0-3 flags=0x5 nice=-20 hung=0s workers=2 manager: 5224 idle: 859

Note that this is due to DIO writes to NFS only, as far as I can tell, and
that no reads had happened yet.

Changes:
========
ver #3:
   - Introduced a new ->swap_rw() method.
   - Added feature support flags to the address_space_operations struct and
     got rid of the checks for ->direct_() and noop_direct_IO() and
     similar.
   - Implemented swap_rw for nfs, adjusting the direct I/O code paths.
   - Implemented swap_rw for blockdev, btrfs, ext4 and xfs.
   - Got rid of the return value on swap_readpage() as it's never checked.

ver #2:
   - Remove the callback param to __swap_writepage() as it's invariant.
   - Allocate the kiocb on the stack in sync mode.
   - Do an async DIO write if WB_SYNC_ALL isn't set.
   - Try to remove the BIO submission paths.

David

Link: https://lore.kernel.org/r/162876946134.3068428.15475611190876694695.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/162879971699.3306668.8977537647318498651.stgit@warthog.procyon.org.uk/ # v2
---
David Howells (9):
      mm: Remove the callback func argument from __swap_writepage()
      mm: Add 'supports' field to the address_space_operations to list features
      mm: Make swap_readpage() void
      Introduce IOCB_SWAP kiocb flag to trigger REQ_SWAP
      mm: Make swap_readpage() for SWP_FS_OPS use ->swap_rw() not ->readpage()
      mm: Make __swap_writepage() do async DIO if asked for it
      nfs: Fix write to swapfile failure due to generic_write_checks()
      block, btrfs, ext4, xfs: Implement swap_rw
      mm: Remove swap BIO paths and only use DIO paths


 Documentation/filesystems/vfs.rst |   8 +
 block/fops.c                      |   2 +
 drivers/block/loop.c              |   6 +-
 fs/9p/vfs_addr.c                  |   1 +
 fs/affs/file.c                    |   1 +
 fs/btrfs/inode.c                  |  14 +-
 fs/ceph/addr.c                    |  13 +-
 fs/cifs/file.c                    |  21 +-
 fs/direct-io.c                    |   2 +
 fs/erofs/data.c                   |   2 +-
 fs/exfat/inode.c                  |   1 +
 fs/ext2/inode.c                   |   4 +-
 fs/ext4/inode.c                   |  17 +-
 fs/f2fs/data.c                    |   1 +
 fs/fat/inode.c                    |   1 +
 fs/fcntl.c                        |   2 +-
 fs/fuse/dax.c                     |   2 +-
 fs/fuse/file.c                    |   1 +
 fs/gfs2/aops.c                    |   2 +-
 fs/hfs/inode.c                    |   1 +
 fs/hfsplus/inode.c                |   1 +
 fs/jfs/inode.c                    |   1 +
 fs/libfs.c                        |  12 -
 fs/nfs/direct.c                   |  28 +--
 fs/nfs/file.c                     |  15 +-
 fs/nilfs2/inode.c                 |   1 +
 fs/ntfs3/inode.c                  |   1 +
 fs/ocfs2/aops.c                   |   1 +
 fs/open.c                         |   3 +-
 fs/orangefs/inode.c               |   1 +
 fs/overlayfs/file.c               |   2 +-
 fs/overlayfs/inode.c              |   3 +-
 fs/reiserfs/inode.c               |   1 +
 fs/udf/file.c                     |   1 +
 fs/udf/inode.c                    |   1 +
 fs/xfs/xfs_aops.c                 |  13 +-
 fs/zonefs/super.c                 |   2 +-
 include/linux/bio.h               |   2 +
 include/linux/fs.h                |   7 +-
 include/linux/nfs_fs.h            |   2 +-
 include/linux/swap.h              |   2 +-
 mm/page_io.c                      | 356 +++++++++++++++---------------
 mm/swapfile.c                     |   4 +-
 43 files changed, 275 insertions(+), 287 deletions(-)



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v3 2/9] mm: Add 'supports' field to the address_space_operations to list features
  2021-09-24 17:17 [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles David Howells
@ 2021-09-24 17:18 ` David Howells
  2021-09-24 20:10   ` Matthew Wilcox
  2021-09-25 23:42 ` [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles Dave Chinner
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 10+ messages in thread
From: David Howells @ 2021-09-24 17:18 UTC (permalink / raw)
  To: willy, hch, trond.myklebust
  Cc: Darrick J. Wong, Ilya Dryomov, Jeff Layton, ceph-devel,
	Steve French, linux-cifs, linux-xfs, linux-fsdevel, linux-mm,
	dhowells, dhowells, darrick.wong, viro, jlayton, torvalds,
	linux-nfs, linux-mm, linux-fsdevel, linux-kernel

Rather than depending on .direct_IO to point to something to indicate that
direct I/O is supported, add a 'supports' bitmask that we can test, since
we only need one bit.

We can then remove noop_direct_IO, ceph_direct_io and cifs_direct_io.

[Question: Some filesystems support read DIO but not write DIO - should I
 split the flag?]

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@lst.de>
cc: Darrick J. Wong <djwong@kernel.org>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: ceph-devel@vger.kernel.org
cc: Steve French <sfrench@samba.org>
cc: linux-cifs@vger.kernel.org
cc: linux-xfs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
---

 Documentation/filesystems/vfs.rst |    8 ++++++++
 block/fops.c                      |    1 +
 drivers/block/loop.c              |    6 +++---
 fs/9p/vfs_addr.c                  |    1 +
 fs/affs/file.c                    |    1 +
 fs/btrfs/inode.c                  |    2 +-
 fs/ceph/addr.c                    |   13 +------------
 fs/cifs/file.c                    |   21 +--------------------
 fs/erofs/data.c                   |    2 +-
 fs/exfat/inode.c                  |    1 +
 fs/ext2/inode.c                   |    4 +++-
 fs/ext4/inode.c                   |    8 ++++----
 fs/f2fs/data.c                    |    1 +
 fs/fat/inode.c                    |    1 +
 fs/fcntl.c                        |    2 +-
 fs/fuse/dax.c                     |    2 +-
 fs/fuse/file.c                    |    1 +
 fs/gfs2/aops.c                    |    2 +-
 fs/hfs/inode.c                    |    1 +
 fs/hfsplus/inode.c                |    1 +
 fs/jfs/inode.c                    |    1 +
 fs/libfs.c                        |   12 ------------
 fs/nfs/file.c                     |    1 +
 fs/nilfs2/inode.c                 |    1 +
 fs/ntfs3/inode.c                  |    1 +
 fs/ocfs2/aops.c                   |    1 +
 fs/open.c                         |    3 ++-
 fs/orangefs/inode.c               |    1 +
 fs/overlayfs/file.c               |    2 +-
 fs/overlayfs/inode.c              |    3 +--
 fs/reiserfs/inode.c               |    1 +
 fs/udf/file.c                     |    1 +
 fs/udf/inode.c                    |    1 +
 fs/xfs/xfs_aops.c                 |    4 ++--
 fs/zonefs/super.c                 |    2 +-
 include/linux/fs.h                |    4 +++-
 36 files changed, 53 insertions(+), 65 deletions(-)

diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index bf5c48066fac..abb844792d6a 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -721,6 +721,7 @@ cache in your filesystem.  The following members are defined:
 .. code-block:: c
 
 	struct address_space_operations {
+		unsigned int supports;
 		int (*writepage)(struct page *page, struct writeback_control *wbc);
 		int (*readpage)(struct file *, struct page *);
 		int (*writepages)(struct address_space *, struct writeback_control *);
@@ -755,6 +756,13 @@ cache in your filesystem.  The following members are defined:
 		int (*swap_deactivate)(struct file *);
 	};
 
+``supports``
+	provides a list of features supported by address_spaces using this
+	operations set.  The following feature support flags are provided:
+
+	``AS_SUPPORTS_DIRECT_IO``
+		Direct I/O is supported.
+
 ``writepage``
 	called by the VM to write a dirty page to backing store.  This
 	may happen for data integrity reasons (i.e. 'sync'), or to free
diff --git a/block/fops.c b/block/fops.c
index ffce6f6c68dd..84c64d814d0d 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -384,6 +384,7 @@ const struct address_space_operations def_blk_aops = {
 	.direct_IO	= blkdev_direct_IO,
 	.migratepage	= buffer_migrate_page_norefs,
 	.is_dirty_writeback = buffer_check_dirty_writeback,
+	.supports	= AS_SUPPORTS_DIRECT_IO,
 };
 
 /*
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 7bf4686af774..76f7a6d85815 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -237,9 +237,9 @@ static void __loop_update_dio(struct loop_device *lo, bool dio)
 	 */
 	if (dio) {
 		if (queue_logical_block_size(lo->lo_queue) >= sb_bsize &&
-				!(lo->lo_offset & dio_align) &&
-				mapping->a_ops->direct_IO &&
-				!lo->transfer)
+		    !(lo->lo_offset & dio_align) &&
+		    (mapping->a_ops->supports & AS_SUPPORTS_DIRECT_IO) &&
+		    !lo->transfer)
 			use_dio = true;
 		else
 			use_dio = false;
diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c
index cce9ace651a2..4910898af0d7 100644
--- a/fs/9p/vfs_addr.c
+++ b/fs/9p/vfs_addr.c
@@ -333,4 +333,5 @@ const struct address_space_operations v9fs_addr_operations = {
 	.invalidatepage = v9fs_invalidate_page,
 	.launder_page = v9fs_launder_page,
 	.direct_IO = v9fs_direct_IO,
+	.supports	= AS_SUPPORTS_DIRECT_IO,
 };
diff --git a/fs/affs/file.c b/fs/affs/file.c
index 75ebd2b576ca..7488bd7d3e0c 100644
--- a/fs/affs/file.c
+++ b/fs/affs/file.c
@@ -460,6 +460,7 @@ const struct address_space_operations affs_aops = {
 	.write_end = affs_write_end,
 	.direct_IO = affs_direct_IO,
 	.bmap = _affs_bmap
+	.supports	= AS_SUPPORTS_DIRECT_IO,
 };
 
 static inline struct buffer_head *
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 487533c35ddb..b479c97e42fc 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -10937,7 +10937,6 @@ static const struct address_space_operations btrfs_aops = {
 	.writepage	= btrfs_writepage,
 	.writepages	= btrfs_writepages,
 	.readahead	= btrfs_readahead,
-	.direct_IO	= noop_direct_IO,
 	.invalidatepage = btrfs_invalidatepage,
 	.releasepage	= btrfs_releasepage,
 #ifdef CONFIG_MIGRATION
@@ -10947,6 +10946,7 @@ static const struct address_space_operations btrfs_aops = {
 	.error_remove_page = generic_error_remove_page,
 	.swap_activate	= btrfs_swap_activate,
 	.swap_deactivate = btrfs_swap_deactivate,
+	.supports	= AS_SUPPORTS_DIRECT_IO,
 };
 
 static const struct inode_operations btrfs_file_inode_operations = {
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 99b80b5c7a93..086d4745b99e 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -1306,17 +1306,6 @@ static int ceph_write_end(struct file *file, struct address_space *mapping,
 	return copied;
 }
 
-/*
- * we set .direct_IO to indicate direct io is supported, but since we
- * intercept O_DIRECT reads and writes early, this function should
- * never get called.
- */
-static ssize_t ceph_direct_io(struct kiocb *iocb, struct iov_iter *iter)
-{
-	WARN_ON(1);
-	return -EINVAL;
-}
-
 const struct address_space_operations ceph_aops = {
 	.readpage = ceph_readpage,
 	.readahead = ceph_readahead,
@@ -1327,7 +1316,7 @@ const struct address_space_operations ceph_aops = {
 	.set_page_dirty = ceph_set_page_dirty,
 	.invalidatepage = ceph_invalidatepage,
 	.releasepage = ceph_releasepage,
-	.direct_IO = ceph_direct_io,
+	.supports	= AS_SUPPORTS_DIRECT_IO,
 };
 
 static void ceph_block_sigs(sigset_t *oldset)
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 6796fc73b304..a5787cf3d836 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -4891,25 +4891,6 @@ void cifs_oplock_break(struct work_struct *work)
 	cifs_done_oplock_break(cinode);
 }
 
-/*
- * The presence of cifs_direct_io() in the address space ops vector
- * allowes open() O_DIRECT flags which would have failed otherwise.
- *
- * In the non-cached mode (mount with cache=none), we shunt off direct read and write requests
- * so this method should never be called.
- *
- * Direct IO is not yet supported in the cached mode. 
- */
-static ssize_t
-cifs_direct_io(struct kiocb *iocb, struct iov_iter *iter)
-{
-        /*
-         * FIXME
-         * Eventually need to support direct IO for non forcedirectio mounts
-         */
-        return -EINVAL;
-}
-
 static int cifs_swap_activate(struct swap_info_struct *sis,
 			      struct file *swap_file, sector_t *span)
 {
@@ -4974,7 +4955,6 @@ const struct address_space_operations cifs_addr_ops = {
 	.write_end = cifs_write_end,
 	.set_page_dirty = __set_page_dirty_nobuffers,
 	.releasepage = cifs_release_page,
-	.direct_IO = cifs_direct_io,
 	.invalidatepage = cifs_invalidate_page,
 	.launder_page = cifs_launder_page,
 	/*
@@ -4984,6 +4964,7 @@ const struct address_space_operations cifs_addr_ops = {
 	 */
 	.swap_activate = cifs_swap_activate,
 	.swap_deactivate = cifs_swap_deactivate,
+	.supports	= AS_SUPPORTS_DIRECT_IO,
 };
 
 /*
diff --git a/fs/erofs/data.c b/fs/erofs/data.c
index 9db829715652..30f19296b268 100644
--- a/fs/erofs/data.c
+++ b/fs/erofs/data.c
@@ -299,7 +299,7 @@ const struct address_space_operations erofs_raw_access_aops = {
 	.readpage = erofs_readpage,
 	.readahead = erofs_readahead,
 	.bmap = erofs_bmap,
-	.direct_IO = noop_direct_IO,
+	.supports	= AS_SUPPORTS_DIRECT_IO,
 };
 
 #ifdef CONFIG_FS_DAX
diff --git a/fs/exfat/inode.c b/fs/exfat/inode.c
index ca37d4344361..f38f42282f54 100644
--- a/fs/exfat/inode.c
+++ b/fs/exfat/inode.c
@@ -500,6 +500,7 @@ static const struct address_space_operations exfat_aops = {
 	.write_end	= exfat_write_end,
 	.direct_IO	= exfat_direct_IO,
 	.bmap		= exfat_aop_bmap
+	.supports	= AS_SUPPORTS_DIRECT_IO,
 };
 
 static inline unsigned long exfat_hash(loff_t i_pos)
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 333fa62661d5..4ad3655defd9 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -974,6 +974,7 @@ const struct address_space_operations ext2_aops = {
 	.migratepage		= buffer_migrate_page,
 	.is_partially_uptodate	= block_is_partially_uptodate,
 	.error_remove_page	= generic_error_remove_page,
+	.supports		= AS_SUPPORTS_DIRECT_IO,
 };
 
 const struct address_space_operations ext2_nobh_aops = {
@@ -988,13 +989,14 @@ const struct address_space_operations ext2_nobh_aops = {
 	.writepages		= ext2_writepages,
 	.migratepage		= buffer_migrate_page,
 	.error_remove_page	= generic_error_remove_page,
+	.supports		= AS_SUPPORTS_DIRECT_IO,
 };
 
 static const struct address_space_operations ext2_dax_aops = {
 	.writepages		= ext2_dax_writepages,
-	.direct_IO		= noop_direct_IO,
 	.set_page_dirty		= __set_page_dirty_no_writeback,
 	.invalidatepage		= noop_invalidatepage,
+	.supports		= AS_SUPPORTS_DIRECT_IO,
 };
 
 /*
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index d18852d6029c..08d3541d8daa 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3662,11 +3662,11 @@ static const struct address_space_operations ext4_aops = {
 	.bmap			= ext4_bmap,
 	.invalidatepage		= ext4_invalidatepage,
 	.releasepage		= ext4_releasepage,
-	.direct_IO		= noop_direct_IO,
 	.migratepage		= buffer_migrate_page,
 	.is_partially_uptodate  = block_is_partially_uptodate,
 	.error_remove_page	= generic_error_remove_page,
 	.swap_activate		= ext4_iomap_swap_activate,
+	.supports		= AS_SUPPORTS_DIRECT_IO,
 };
 
 static const struct address_space_operations ext4_journalled_aops = {
@@ -3680,10 +3680,10 @@ static const struct address_space_operations ext4_journalled_aops = {
 	.bmap			= ext4_bmap,
 	.invalidatepage		= ext4_journalled_invalidatepage,
 	.releasepage		= ext4_releasepage,
-	.direct_IO		= noop_direct_IO,
 	.is_partially_uptodate  = block_is_partially_uptodate,
 	.error_remove_page	= generic_error_remove_page,
 	.swap_activate		= ext4_iomap_swap_activate,
+	.supports		= AS_SUPPORTS_DIRECT_IO,
 };
 
 static const struct address_space_operations ext4_da_aops = {
@@ -3697,20 +3697,20 @@ static const struct address_space_operations ext4_da_aops = {
 	.bmap			= ext4_bmap,
 	.invalidatepage		= ext4_invalidatepage,
 	.releasepage		= ext4_releasepage,
-	.direct_IO		= noop_direct_IO,
 	.migratepage		= buffer_migrate_page,
 	.is_partially_uptodate  = block_is_partially_uptodate,
 	.error_remove_page	= generic_error_remove_page,
 	.swap_activate		= ext4_iomap_swap_activate,
+	.supports		= AS_SUPPORTS_DIRECT_IO,
 };
 
 static const struct address_space_operations ext4_dax_aops = {
 	.writepages		= ext4_dax_writepages,
-	.direct_IO		= noop_direct_IO,
 	.set_page_dirty		= __set_page_dirty_no_writeback,
 	.bmap			= ext4_bmap,
 	.invalidatepage		= noop_invalidatepage,
 	.swap_activate		= ext4_iomap_swap_activate,
+	.supports		= AS_SUPPORTS_DIRECT_IO,
 };
 
 void ext4_set_aops(struct inode *inode)
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index f4fd6c246c9a..4c3643969b69 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -4156,6 +4156,7 @@ const struct address_space_operations f2fs_dblock_aops = {
 #ifdef CONFIG_MIGRATION
 	.migratepage    = f2fs_migrate_page,
 #endif
+	.supports	= AS_SUPPORTS_DIRECT_IO,
 };
 
 void f2fs_clear_page_cache_dirty_tag(struct page *page)
diff --git a/fs/fat/inode.c b/fs/fat/inode.c
index de0c9b013a85..4352981dfb82 100644
--- a/fs/fat/inode.c
+++ b/fs/fat/inode.c
@@ -351,6 +351,7 @@ static const struct address_space_operations fat_aops = {
 	.write_end	= fat_write_end,
 	.direct_IO	= fat_direct_IO,
 	.bmap		= _fat_bmap
+	.supports	= AS_SUPPORTS_DIRECT_IO,
 };
 
 /*
diff --git a/fs/fcntl.c b/fs/fcntl.c
index 9c6c6a3e2de5..7308e8274ff9 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -58,7 +58,7 @@ static int setfl(int fd, struct file * filp, unsigned long arg)
 	/* Pipe packetized mode is controlled by O_DIRECT flag */
 	if (!S_ISFIFO(inode->i_mode) && (arg & O_DIRECT)) {
 		if (!filp->f_mapping || !filp->f_mapping->a_ops ||
-			!filp->f_mapping->a_ops->direct_IO)
+		    !(filp->f_mapping->a_ops->supports & AS_SUPPORTS_DIRECT_IO))
 				return -EINVAL;
 	}
 
diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
index 281d79f8b3d3..e39468fd7177 100644
--- a/fs/fuse/dax.c
+++ b/fs/fuse/dax.c
@@ -1325,9 +1325,9 @@ bool fuse_dax_inode_alloc(struct super_block *sb, struct fuse_inode *fi)
 
 static const struct address_space_operations fuse_dax_file_aops  = {
 	.writepages	= fuse_dax_writepages,
-	.direct_IO	= noop_direct_IO,
 	.set_page_dirty	= __set_page_dirty_no_writeback,
 	.invalidatepage	= noop_invalidatepage,
+	.supports	= AS_SUPPORTS_DIRECT_IO,
 };
 
 void fuse_dax_inode_init(struct inode *inode)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 11404f8c21c7..3db64194d346 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -3161,6 +3161,7 @@ static const struct address_space_operations fuse_file_aops  = {
 	.direct_IO	= fuse_direct_IO,
 	.write_begin	= fuse_write_begin,
 	.write_end	= fuse_write_end,
+	.supports	= AS_SUPPORTS_DIRECT_IO,
 };
 
 void fuse_init_file_inode(struct inode *inode)
diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 005e920f5d4a..dc50b53d6abd 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -783,10 +783,10 @@ static const struct address_space_operations gfs2_aops = {
 	.releasepage = iomap_releasepage,
 	.invalidatepage = iomap_invalidatepage,
 	.bmap = gfs2_bmap,
-	.direct_IO = noop_direct_IO,
 	.migratepage = iomap_migrate_page,
 	.is_partially_uptodate = iomap_is_partially_uptodate,
 	.error_remove_page = generic_error_remove_page,
+	.supports	= AS_SUPPORTS_DIRECT_IO,
 };
 
 static const struct address_space_operations gfs2_jdata_aops = {
diff --git a/fs/hfs/inode.c b/fs/hfs/inode.c
index 4a95a92546a0..5f9e5464a5bf 100644
--- a/fs/hfs/inode.c
+++ b/fs/hfs/inode.c
@@ -177,6 +177,7 @@ const struct address_space_operations hfs_aops = {
 	.bmap		= hfs_bmap,
 	.direct_IO	= hfs_direct_IO,
 	.writepages	= hfs_writepages,
+	.supports	= AS_SUPPORTS_DIRECT_IO,
 };
 
 /*
diff --git a/fs/hfsplus/inode.c b/fs/hfsplus/inode.c
index 6fef67c2a9f0..9f0c27e5e115 100644
--- a/fs/hfsplus/inode.c
+++ b/fs/hfsplus/inode.c
@@ -174,6 +174,7 @@ const struct address_space_operations hfsplus_aops = {
 	.bmap		= hfsplus_bmap,
 	.direct_IO	= hfsplus_direct_IO,
 	.writepages	= hfsplus_writepages,
+	.supports	= AS_SUPPORTS_DIRECT_IO,
 };
 
 const struct dentry_operations hfsplus_dentry_operations = {
diff --git a/fs/jfs/inode.c b/fs/jfs/inode.c
index 57ab424c05ff..a477267471a4 100644
--- a/fs/jfs/inode.c
+++ b/fs/jfs/inode.c
@@ -366,6 +366,7 @@ const struct address_space_operations jfs_aops = {
 	.write_end	= nobh_write_end,
 	.bmap		= jfs_bmap,
 	.direct_IO	= jfs_direct_IO,
+	.supports	= AS_SUPPORTS_DIRECT_IO,
 };
 
 /*
diff --git a/fs/libfs.c b/fs/libfs.c
index 51b4de3b3447..c27f681291e5 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -1182,18 +1182,6 @@ void noop_invalidatepage(struct page *page, unsigned int offset,
 }
 EXPORT_SYMBOL_GPL(noop_invalidatepage);
 
-ssize_t noop_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
-{
-	/*
-	 * iomap based filesystems support direct I/O without need for
-	 * this callback. However, it still needs to be set in
-	 * inode->a_ops so that open/fcntl know that direct I/O is
-	 * generally supported.
-	 */
-	return -EINVAL;
-}
-EXPORT_SYMBOL_GPL(noop_direct_IO);
-
 /* Because kfree isn't assignment-compatible with void(void*) ;-/ */
 void kfree_link(void *p)
 {
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index aa353fd58240..7403ec6317cb 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -532,6 +532,7 @@ const struct address_space_operations nfs_file_aops = {
 	.error_remove_page = generic_error_remove_page,
 	.swap_activate = nfs_swap_activate,
 	.swap_deactivate = nfs_swap_deactivate,
+	.supports	= AS_SUPPORTS_DIRECT_IO,
 };
 
 /*
diff --git a/fs/nilfs2/inode.c b/fs/nilfs2/inode.c
index 2e8eb263cf0f..c57395c01817 100644
--- a/fs/nilfs2/inode.c
+++ b/fs/nilfs2/inode.c
@@ -307,6 +307,7 @@ const struct address_space_operations nilfs_aops = {
 	.invalidatepage		= block_invalidatepage,
 	.direct_IO		= nilfs_direct_IO,
 	.is_partially_uptodate  = block_is_partially_uptodate,
+	.supports		= AS_SUPPORTS_DIRECT_IO,
 };
 
 static int nilfs_insert_inode_locked(struct inode *inode,
diff --git a/fs/ntfs3/inode.c b/fs/ntfs3/inode.c
index db2a5a4c38e4..7b3ac1ab5d04 100644
--- a/fs/ntfs3/inode.c
+++ b/fs/ntfs3/inode.c
@@ -1948,6 +1948,7 @@ const struct address_space_operations ntfs_aops = {
 	.direct_IO	= ntfs_direct_IO,
 	.bmap		= ntfs_bmap,
 	.set_page_dirty = __set_page_dirty_buffers,
+	.supports	= AS_SUPPORTS_DIRECT_IO,
 };
 
 const struct address_space_operations ntfs_aops_cmpr = {
diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index 68d11c295dd3..5a158975a4ff 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -2466,4 +2466,5 @@ const struct address_space_operations ocfs2_aops = {
 	.migratepage		= buffer_migrate_page,
 	.is_partially_uptodate	= block_is_partially_uptodate,
 	.error_remove_page	= generic_error_remove_page,
+	.supports		= AS_SUPPORTS_DIRECT_IO,
 };
diff --git a/fs/open.c b/fs/open.c
index daa324606a41..d679dc0c1801 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -840,7 +840,8 @@ static int do_dentry_open(struct file *f,
 
 	/* NB: we're sure to have correct a_ops only after f_op->open */
 	if (f->f_flags & O_DIRECT) {
-		if (!f->f_mapping->a_ops || !f->f_mapping->a_ops->direct_IO)
+		if (!f->f_mapping->a_ops ||
+		    !(f->f_mapping->a_ops->supports & AS_SUPPORTS_DIRECT_IO))
 			return -EINVAL;
 	}
 
diff --git a/fs/orangefs/inode.c b/fs/orangefs/inode.c
index c1bb4c4b5d67..c5bad94dfbd0 100644
--- a/fs/orangefs/inode.c
+++ b/fs/orangefs/inode.c
@@ -641,6 +641,7 @@ static const struct address_space_operations orangefs_address_operations = {
 	.freepage = orangefs_freepage,
 	.launder_page = orangefs_launder_page,
 	.direct_IO = orangefs_direct_IO,
+	.supports	= AS_SUPPORTS_DIRECT_IO,
 };
 
 vm_fault_t orangefs_page_mkwrite(struct vm_fault *vmf)
diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index d081faa55e83..87d05f1d718a 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -83,7 +83,7 @@ static int ovl_change_flags(struct file *file, unsigned int flags)
 
 	if (flags & O_DIRECT) {
 		if (!file->f_mapping->a_ops ||
-		    !file->f_mapping->a_ops->direct_IO)
+		    !(file->f_mapping->a_ops->supports & AS_SUPPORTS_DIRECT_IO))
 			return -EINVAL;
 	}
 
diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
index 832b17589733..9902608b1715 100644
--- a/fs/overlayfs/inode.c
+++ b/fs/overlayfs/inode.c
@@ -660,8 +660,7 @@ static const struct inode_operations ovl_special_inode_operations = {
 };
 
 static const struct address_space_operations ovl_aops = {
-	/* For O_DIRECT dentry_open() checks f_mapping->a_ops->direct_IO */
-	.direct_IO		= noop_direct_IO,
+	.supports	= AS_SUPPORTS_DIRECT_IO,
 };
 
 /*
diff --git a/fs/reiserfs/inode.c b/fs/reiserfs/inode.c
index f49b72ccac4c..890d91847d58 100644
--- a/fs/reiserfs/inode.c
+++ b/fs/reiserfs/inode.c
@@ -3436,4 +3436,5 @@ const struct address_space_operations reiserfs_address_space_operations = {
 	.bmap = reiserfs_aop_bmap,
 	.direct_IO = reiserfs_direct_IO,
 	.set_page_dirty = reiserfs_set_page_dirty,
+	.supports	= AS_SUPPORTS_DIRECT_IO,
 };
diff --git a/fs/udf/file.c b/fs/udf/file.c
index 1baff8ddb754..2cb1b499e5c7 100644
--- a/fs/udf/file.c
+++ b/fs/udf/file.c
@@ -131,6 +131,7 @@ const struct address_space_operations udf_adinicb_aops = {
 	.write_begin	= udf_adinicb_write_begin,
 	.write_end	= udf_adinicb_write_end,
 	.direct_IO	= udf_adinicb_direct_IO,
+	.supports	= AS_SUPPORTS_DIRECT_IO,
 };
 
 static ssize_t udf_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
diff --git a/fs/udf/inode.c b/fs/udf/inode.c
index 1d6b7a50736b..38b799b457d5 100644
--- a/fs/udf/inode.c
+++ b/fs/udf/inode.c
@@ -244,6 +244,7 @@ const struct address_space_operations udf_aops = {
 	.write_end	= generic_write_end,
 	.direct_IO	= udf_direct_IO,
 	.bmap		= udf_bmap,
+	.supports	= AS_SUPPORTS_DIRECT_IO,
 };
 
 /*
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 34fc6148032a..2a4570516591 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -548,17 +548,17 @@ const struct address_space_operations xfs_address_space_operations = {
 	.releasepage		= iomap_releasepage,
 	.invalidatepage		= iomap_invalidatepage,
 	.bmap			= xfs_vm_bmap,
-	.direct_IO		= noop_direct_IO,
 	.migratepage		= iomap_migrate_page,
 	.is_partially_uptodate  = iomap_is_partially_uptodate,
 	.error_remove_page	= generic_error_remove_page,
 	.swap_activate		= xfs_iomap_swapfile_activate,
+	.supports		= AS_SUPPORTS_DIRECT_IO,
 };
 
 const struct address_space_operations xfs_dax_aops = {
 	.writepages		= xfs_dax_writepages,
-	.direct_IO		= noop_direct_IO,
 	.set_page_dirty		= __set_page_dirty_no_writeback,
 	.invalidatepage		= noop_invalidatepage,
 	.swap_activate		= xfs_iomap_swapfile_activate,
+	.supports		= AS_SUPPORTS_DIRECT_IO,
 };
diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
index ddc346a9df9b..37ff541467e8 100644
--- a/fs/zonefs/super.c
+++ b/fs/zonefs/super.c
@@ -191,8 +191,8 @@ static const struct address_space_operations zonefs_file_aops = {
 	.migratepage		= iomap_migrate_page,
 	.is_partially_uptodate	= iomap_is_partially_uptodate,
 	.error_remove_page	= generic_error_remove_page,
-	.direct_IO		= noop_direct_IO,
 	.swap_activate		= zonefs_swap_activate,
+	.supports		= AS_SUPPORTS_DIRECT_IO,
 };
 
 static void zonefs_update_stats(struct inode *inode, loff_t new_isize)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e7a633353fd2..c909ca6c0eb6 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -369,7 +369,10 @@ typedef struct {
 typedef int (*read_actor_t)(read_descriptor_t *, struct page *,
 		unsigned long, unsigned long);
 
+#define AS_SUPPORTS_DIRECT_IO	0x00000001
+
 struct address_space_operations {
+	unsigned int supports; /* Bitmask of AS_SUPPORTS_* flags */
 	int (*writepage)(struct page *page, struct writeback_control *wbc);
 	int (*readpage)(struct file *, struct page *);
 
@@ -3391,7 +3394,6 @@ extern void simple_recursive_removal(struct dentry *,
 extern int noop_fsync(struct file *, loff_t, loff_t, int);
 extern void noop_invalidatepage(struct page *page, unsigned int offset,
 		unsigned int length);
-extern ssize_t noop_direct_IO(struct kiocb *iocb, struct iov_iter *iter);
 extern int simple_empty(struct dentry *);
 extern int simple_write_begin(struct file *file, struct address_space *mapping,
 			loff_t pos, unsigned len, unsigned flags,



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 2/9] mm: Add 'supports' field to the address_space_operations to list features
  2021-09-24 17:18 ` [PATCH v3 2/9] mm: Add 'supports' field to the address_space_operations to list features David Howells
@ 2021-09-24 20:10   ` Matthew Wilcox
  0 siblings, 0 replies; 10+ messages in thread
From: Matthew Wilcox @ 2021-09-24 20:10 UTC (permalink / raw)
  To: David Howells
  Cc: hch, trond.myklebust, Darrick J. Wong, Ilya Dryomov, Jeff Layton,
	ceph-devel, Steve French, linux-cifs, linux-xfs, linux-fsdevel,
	linux-mm, darrick.wong, viro, torvalds, linux-nfs, linux-kernel

On Fri, Sep 24, 2021 at 06:18:14PM +0100, David Howells wrote:
> Rather than depending on .direct_IO to point to something to indicate that
> direct I/O is supported, add a 'supports' bitmask that we can test, since
> we only need one bit.

Why would you add mapping->aops->supports instead of using one of the free
bits in mapping->flags?  enum mapping_flags in pagemap.h.

It could also be a per-fs flag, or per-sb flag, but it's fewer
dereferences at check time if it's in mapping->flags.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles
  2021-09-24 17:17 [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles David Howells
  2021-09-24 17:18 ` [PATCH v3 2/9] mm: Add 'supports' field to the address_space_operations to list features David Howells
@ 2021-09-25 23:42 ` Dave Chinner
  2021-09-26  3:10   ` Matthew Wilcox
  2021-09-27 20:07 ` David Sterba
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2021-09-25 23:42 UTC (permalink / raw)
  To: David Howells
  Cc: willy, hch, trond.myklebust, Theodore Ts'o, linux-block,
	ceph-devel, Trond Myklebust, Darrick J. Wong, Jeff Layton,
	Andreas Dilger, Anna Schumaker, linux-mm, Bob Liu,
	Darrick J. Wong, Josef Bacik, Seth Jennings, Jens Axboe,
	linux-fsdevel, linux-xfs, linux-ext4, linux-cifs, Chris Mason,
	David Sterba, Minchan Kim, Steve French, NeilBrown,
	Dan Magenheimer, linux-nfs, Ilya Dryomov, linux-btrfs, viro,
	torvalds, linux-kernel

On Fri, Sep 24, 2021 at 06:17:52PM +0100, David Howells wrote:
> 
> Hi Willy, Trond, Christoph,
> 
> Here's v3 of a change to make reads and writes from the swapfile use async
> DIO, adding a new ->swap_rw() address_space method, rather than readpage()
> or direct_IO(), as requested by Willy.  This allows NFS to bypass the write
> checks that prevent swapfiles from working, plus a bunch of other checks
> that may or may not be necessary.
> 
> Whilst trying to make this work, I found that NFS's support for swapfiles
> seems to have been non-functional since Aug 2019 (I think), so the first
> patch fixes that.  Question is: do we actually *want* to keep this
> functionality, given that it seems that no one's tested it with an upstream
> kernel in the last couple of years?
> 
> There are additional patches to get rid of noop_direct_IO and replace it
> with a feature bitmask, to make btrfs, ext4, xfs and raw blockdevs use the
> new ->swap_rw method and thence remove the direct BIO submission paths from
> swap.
> 
> I kept the IOCB_SWAP flag, using it to enable REQ_SWAP.  I'm not sure if
> that's necessary, but it seems accounting related.
> 
> The synchronous DIO I/O code on NFS, raw blockdev, ext4 swapfile and xfs
> swapfile all seem to work fine.  Btrfs refuses to swapon because the file
> might be CoW'd.  I've tried doing "chattr +C", but that didn't help.

Ok, so if the filesystem is doing block mapping in the IO path now,
why does the swap file still need to map the file into a private
block mapping now?  i.e all the work that iomap_swapfile_activate()
does for filesystems like XFS and ext4 - it's this completely
redundant now that we are doing block mapping during swap file IO
via iomap_dio_rw()?

Actually, that path does all the "can we use this file as a swap
file" checking. So the extent iteration can't go away, just the swap
file mapping part (iomap_swapfile_add_extent()). This is necessary
to ensure there aren't any holes in the file, and we still need that
because the DIO write path will allocate into holes, which leads
me to my main concern here.

Using the DIO path opens up the possibility that the filesystem
could want to run transactions are part of the DIO. Right now we
support unwritten extents for swap files (so they don't have to be
written to allocate the backing store before activation) and that
means we'll be doing DIO to unwritten extents. IO completion of a
DIO write to an unwritten extent will run a transaction to convert
that extent to written. A similar problem with sparse files exists,
because allocation of blocks can be done from the DIO path, and that
requires transactions. File extension is another potential
transaction path we open up by using DIO writes dor swap.

The problem is that a transaction run in swap IO context will will
deadlock the filesystem. Either through the unbound memory demand of
metadata modification, or from needing log space that can't be freed
up because the metadata IO that will free the log space is waiting
on memory allocation that is waiting on swap IO...

I think some more thought needs to be put into controlling the
behaviour/semantics of the DIO path so that it can be safely used
by swap IO, because it's not a direct 1:1 behavioural mapping with
existing DIO and there are potential deadlock vectors we need to
avoid.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles
  2021-09-25 23:42 ` [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles Dave Chinner
@ 2021-09-26  3:10   ` Matthew Wilcox
  2021-09-26 22:36     ` Dave Chinner
  0 siblings, 1 reply; 10+ messages in thread
From: Matthew Wilcox @ 2021-09-26  3:10 UTC (permalink / raw)
  To: Dave Chinner
  Cc: David Howells, hch, trond.myklebust, Theodore Ts'o,
	linux-block, ceph-devel, Trond Myklebust, Darrick J. Wong,
	Jeff Layton, Andreas Dilger, Anna Schumaker, linux-mm, Bob Liu,
	Darrick J. Wong, Josef Bacik, Seth Jennings, Jens Axboe,
	linux-fsdevel, linux-xfs, linux-ext4, linux-cifs, Chris Mason,
	David Sterba, Minchan Kim, Steve French, NeilBrown,
	Dan Magenheimer, linux-nfs, Ilya Dryomov, linux-btrfs, viro,
	torvalds, linux-kernel

On Sun, Sep 26, 2021 at 09:42:43AM +1000, Dave Chinner wrote:
> Ok, so if the filesystem is doing block mapping in the IO path now,
> why does the swap file still need to map the file into a private
> block mapping now?  i.e all the work that iomap_swapfile_activate()
> does for filesystems like XFS and ext4 - it's this completely
> redundant now that we are doing block mapping during swap file IO
> via iomap_dio_rw()?

Hi Dave,

Thanks for bringing up all these points.  I think they all deserve to go
into the documentation as "things to consider" for people implementing
->swap_rw for their filesystem.

Something I don't think David perhaps made sufficiently clear is that
regular DIO from userspace gets handled by ->read_iter and ->write_iter.
This ->swap_rw op is used exclusive for, as the name suggests, swap DIO.
So filesystems don't have to handle swap DIO and regular DIO the same
way, and can split the allocation work between ->swap_activate and the
iomap callback as they see fit (as long as they can guarantee the lack
of deadlocks under memory pressure).

There are several advantages to using the DIO infrastructure for
swap:

 - unify block & net swap paths
 - allow filesystems to _see_ swap IOs instead of being bypassed
 - get rid of the swap extent rbtree
 - allow writing compound pages to swap files instead of splitting
   them
 - allow ->readpage to be synchronous for better error reporting
 - remove page_file_mapping() and page_file_offset()

I suspect there are several problems with this patchset, but I'm not
likely to have a chance to read it closely for a few days.  If you
have time to give the XFS parts a good look, that would be fantastic.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles
  2021-09-26  3:10   ` Matthew Wilcox
@ 2021-09-26 22:36     ` Dave Chinner
  0 siblings, 0 replies; 10+ messages in thread
From: Dave Chinner @ 2021-09-26 22:36 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: David Howells, hch, trond.myklebust, Theodore Ts'o,
	linux-block, ceph-devel, Trond Myklebust, Darrick J. Wong,
	Jeff Layton, Andreas Dilger, Anna Schumaker, linux-mm, Bob Liu,
	Darrick J. Wong, Josef Bacik, Seth Jennings, Jens Axboe,
	linux-fsdevel, linux-xfs, linux-ext4, linux-cifs, Chris Mason,
	David Sterba, Minchan Kim, Steve French, NeilBrown,
	Dan Magenheimer, linux-nfs, Ilya Dryomov, linux-btrfs, viro,
	torvalds, linux-kernel

On Sun, Sep 26, 2021 at 04:10:43AM +0100, Matthew Wilcox wrote:
> On Sun, Sep 26, 2021 at 09:42:43AM +1000, Dave Chinner wrote:
> > Ok, so if the filesystem is doing block mapping in the IO path now,
> > why does the swap file still need to map the file into a private
> > block mapping now?  i.e all the work that iomap_swapfile_activate()
> > does for filesystems like XFS and ext4 - it's this completely
> > redundant now that we are doing block mapping during swap file IO
> > via iomap_dio_rw()?
> 
> Hi Dave,
> 
> Thanks for bringing up all these points.  I think they all deserve to go
> into the documentation as "things to consider" for people implementing
> ->swap_rw for their filesystem.
> 
> Something I don't think David perhaps made sufficiently clear is that
> regular DIO from userspace gets handled by ->read_iter and ->write_iter.
> This ->swap_rw op is used exclusive for, as the name suggests, swap DIO.
> So filesystems don't have to handle swap DIO and regular DIO the same
> way, and can split the allocation work between ->swap_activate and the
> iomap callback as they see fit (as long as they can guarantee the lack
> of deadlocks under memory pressure).

I understand this completely.

The point is that the implementation of ->swap_rw is to call
iomap_dio_rw() with the same ops as the normal DIO read/write path
uses. IOWs, apart from the IOCB_SWAP flag, there is no practical
difference between the "swap DIO" and "normal DIO" I/O paths.

> There are several advantages to using the DIO infrastructure for
> swap:
> 
>  - unify block & net swap paths
>  - allow filesystems to _see_ swap IOs instead of being bypassed
>  - get rid of the swap extent rbtree
>  - allow writing compound pages to swap files instead of splitting
>    them
>  - allow ->readpage to be synchronous for better error reporting
>  - remove page_file_mapping() and page_file_offset()
> 
> I suspect there are several problems with this patchset, but I'm not
> likely to have a chance to read it closely for a few days.  If you
> have time to give the XFS parts a good look, that would be fantastic.

That's what I've already done, and all the questions I've raised are
from asking a simple question: what happens if a transaction is
required to complete the iomap_dio_rw() swap write operation?

I mean, this is similar to the problems with IOCB_NOWAIT - we're
supposed to return -EAGAIN if we might block during IO submission,
and one of those situations we have to consider is "do we need to
run a transaction". If we get it wrong (and we do!), then the worst
thing that happens is that there is a long latency for IO
submission. It's a minor performance issue, not the end of the
world.

The difference with IOCB_SWAP is that "don't do transactions during
iomap_dio_rw()" is a _hard requirement_ on both IO submission and
completion. That means, from now and forever, we will have to
guarantee a path through iomap_dio_rw() that will never run
transactions on an IO. That requirement needs to be enforced in
every block mapping callback into each filesystem, as this is
something the iomap infrastructure cannot enforce. Hence we'll have
to plumb IOCB_SWAP into a new IOMAP_SWAP iterator flag to pass to
the ->iomap_begin() DIO methods to ensure they do the right thing.

And then the question becomes: what happens if the filesystem cannot
do the right thing? Can the swap code handle an error? e.g. the
first thing that xfs_direct_write_iomap_begin() and
xfs_read_iomap_begin() do is check if the filesystem is shut down
and returns -EIO in that case. IOWs, we've now got normal filesystem
"reject all IO" corruption protection mechanisms in play. Using
iomap_dio_rw() as it stands means that _all swapfile IO will fail_
if the filesystem shuts down.

Right now the swap file IO can keep going blissfully unaware of the
filesystem failure status. The open swapfile will prevent the
filesystem from being unmounted. Hence to unmount the shutdown
filesystem to correct the problem, first the swap file has to be
turned off, which means we have a fail-safe behaviour. Using the
iomap_dio_rw() path means that swapfile IO _can and will fail_.

AFAICT, swap IO errors are pretty much thrown away by the mm code;
the swap_writepage() return value is ignored or placed on the swap
cache address space and ignored. And it looks like the new read path
just sets PageError() and leaves it to callers to detect and deal
with a swapin failure because swap_readpage() is now void...

So it seems like there's a whole new set of failure cases using the
DIO path introduces into the swap IO path that haven't been
considered here. I can't see why we wouldn't be able to solve them,
but these considerations lead me to think that use of the DIO is
based on an incorrect assumption - DIO is not a "simple low level
IO" interface.

Hence I suspect that we'd be much better off with a new
iomap_swap_rw() implementation that just does what swap needs
without any of the complexity of the DIO API. Internally iomap can
share what it needs to share with the DIO path, but at this point
I'm not sure we should be overloading the iomap_dio_rw() path with
the semantics required by swap.

e.g. we limit iomap_swap_rw() to only accept written or unwritten
block mappings within file size on inodes with clean metadata (i.e.
pure overwrite to guarantee no modification transactions), and then
the fs provided ->iomap_begin callback can ignore shutdown state,
elide inode level locking, do read-only mappings, etc without adding
extra overhead to the existing DIO code path...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles
  2021-09-24 17:17 [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles David Howells
  2021-09-24 17:18 ` [PATCH v3 2/9] mm: Add 'supports' field to the address_space_operations to list features David Howells
  2021-09-25 23:42 ` [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles Dave Chinner
@ 2021-09-27 20:07 ` David Sterba
  2021-09-28  3:11 ` NeilBrown
  2021-09-29 15:45 ` David Howells
  4 siblings, 0 replies; 10+ messages in thread
From: David Sterba @ 2021-09-27 20:07 UTC (permalink / raw)
  To: David Howells
  Cc: willy, hch, trond.myklebust, Theodore Ts'o, linux-block,
	ceph-devel, Trond Myklebust, Darrick J. Wong, Jeff Layton,
	Andreas Dilger, Anna Schumaker, linux-mm, Bob Liu,
	Darrick J. Wong, Josef Bacik, Seth Jennings, Jens Axboe,
	linux-fsdevel, linux-xfs, linux-ext4, linux-cifs, Chris Mason,
	David Sterba, Minchan Kim, Steve French, NeilBrown,
	Dan Magenheimer, linux-nfs, Ilya Dryomov, linux-btrfs, viro,
	torvalds, linux-kernel

On Fri, Sep 24, 2021 at 06:17:52PM +0100, David Howells wrote:
> 
> Hi Willy, Trond, Christoph,
> 
> Here's v3 of a change to make reads and writes from the swapfile use async
> DIO, adding a new ->swap_rw() address_space method, rather than readpage()
> or direct_IO(), as requested by Willy.  This allows NFS to bypass the write
> checks that prevent swapfiles from working, plus a bunch of other checks
> that may or may not be necessary.
> 
> Whilst trying to make this work, I found that NFS's support for swapfiles
> seems to have been non-functional since Aug 2019 (I think), so the first
> patch fixes that.  Question is: do we actually *want* to keep this
> functionality, given that it seems that no one's tested it with an upstream
> kernel in the last couple of years?
> 
> There are additional patches to get rid of noop_direct_IO and replace it
> with a feature bitmask, to make btrfs, ext4, xfs and raw blockdevs use the
> new ->swap_rw method and thence remove the direct BIO submission paths from
> swap.
> 
> I kept the IOCB_SWAP flag, using it to enable REQ_SWAP.  I'm not sure if
> that's necessary, but it seems accounting related.
> 
> The synchronous DIO I/O code on NFS, raw blockdev, ext4 swapfile and xfs
> swapfile all seem to work fine.  Btrfs refuses to swapon because the file
> might be CoW'd.  I've tried doing "chattr +C", but that didn't help.

There was probably some step missing. The file must not have holes, so
either do 'dd' to the right size or use fallocate (which is recommended
in manual page btrfs(5) SWAPFILE SUPPORT). There are some fstests
exercising swapfile (grep -l _format_swapfile tests/generic/*) so you
could try that without having to set up the swapfile manually.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles
  2021-09-24 17:17 [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles David Howells
                   ` (2 preceding siblings ...)
  2021-09-27 20:07 ` David Sterba
@ 2021-09-28  3:11 ` NeilBrown
  2021-09-30 15:54   ` Steve French
  2021-09-29 15:45 ` David Howells
  4 siblings, 1 reply; 10+ messages in thread
From: NeilBrown @ 2021-09-28  3:11 UTC (permalink / raw)
  To: David Howells
  Cc: willy, hch, trond.myklebust, Theodore Ts'o, linux-block,
	ceph-devel, Trond Myklebust, Darrick J. Wong, Jeff Layton,
	Andreas Dilger, Anna Schumaker, linux-mm, Bob Liu,
	Darrick J. Wong, Josef Bacik, Seth Jennings, Jens Axboe,
	linux-fsdevel, linux-xfs, linux-ext4, linux-cifs, Chris Mason,
	David Sterba, Minchan Kim, Steve French, Dan Magenheimer,
	linux-nfs, Ilya Dryomov, linux-btrfs, dhowells, viro, torvalds,
	linux-kernel

On Sat, 25 Sep 2021, David Howells wrote:
> Whilst trying to make this work, I found that NFS's support for swapfiles
> seems to have been non-functional since Aug 2019 (I think), so the first
> patch fixes that.  Question is: do we actually *want* to keep this
> functionality, given that it seems that no one's tested it with an upstream
> kernel in the last couple of years?

SUSE definitely want to keep this functionality.  We have customers
using it.
I agree it would be good if it was being tested somewhere....

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles
  2021-09-24 17:17 [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles David Howells
                   ` (3 preceding siblings ...)
  2021-09-28  3:11 ` NeilBrown
@ 2021-09-29 15:45 ` David Howells
  4 siblings, 0 replies; 10+ messages in thread
From: David Howells @ 2021-09-29 15:45 UTC (permalink / raw)
  To: dsterba
  Cc: dhowells, willy, Chris Mason, linux-block, ceph-devel, linux-mm,
	linux-fsdevel, linux-xfs, linux-ext4, linux-cifs, linux-nfs,
	Ilya Dryomov, linux-btrfs, linux-kernel

David Sterba <dsterba@suse.cz> wrote:

> > There are additional patches to get rid of noop_direct_IO and replace it
> > with a feature bitmask, to make btrfs, ext4, xfs and raw blockdevs use the
> > new ->swap_rw method and thence remove the direct BIO submission paths from
> > swap.
> > 
> > I kept the IOCB_SWAP flag, using it to enable REQ_SWAP.  I'm not sure if
> > that's necessary, but it seems accounting related.
>
> There was probably some step missing. The file must not have holes, so
> either do 'dd' to the right size or use fallocate (which is recommended
> in manual page btrfs(5) SWAPFILE SUPPORT). There are some fstests
> exercising swapfile (grep -l _format_swapfile tests/generic/*) so you
> could try that without having to set up the swapfile manually.

Yeah.  As advised elsewhere, I removed the file and recreated it, doing the
chattr before extending the file.  At that point swapon worked.  It didn't
work though, and various userspace programs started dying.  I'm guessing my
btrfs_swap_rw() is wrong somehow.

David


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles
  2021-09-28  3:11 ` NeilBrown
@ 2021-09-30 15:54   ` Steve French
  0 siblings, 0 replies; 10+ messages in thread
From: Steve French @ 2021-09-30 15:54 UTC (permalink / raw)
  To: NeilBrown
  Cc: David Howells, Matthew Wilcox, Christoph Hellwig,
	Trond Myklebust, Theodore Ts'o, linux-block, ceph-devel,
	Trond Myklebust, Darrick J. Wong, Jeff Layton, Andreas Dilger,
	Anna Schumaker, linux-mm, Bob Liu, Darrick J. Wong, Josef Bacik,
	Seth Jennings, Jens Axboe, linux-fsdevel, linux-xfs, linux-ext4,
	CIFS, Chris Mason, David Sterba, Minchan Kim, Steve French,
	Dan Magenheimer, linux-nfs, Ilya Dryomov, linux-btrfs, Al Viro,
	Linus Torvalds, LKML

On Mon, Sep 27, 2021 at 10:12 PM NeilBrown <neilb@suse.de> wrote:
>
> On Sat, 25 Sep 2021, David Howells wrote:
> > Whilst trying to make this work, I found that NFS's support for swapfiles
> > seems to have been non-functional since Aug 2019 (I think), so the first
> > patch fixes that.  Question is: do we actually *want* to keep this
> > functionality, given that it seems that no one's tested it with an upstream
> > kernel in the last couple of years?
>
> SUSE definitely want to keep this functionality.  We have customers
> using it.
> I agree it would be good if it was being tested somewhere....
>

I am trying to work through the testing of swap over SMB3 mounts
since there are use cases where you need to expand the swap
space to remote storage and so this requirement comes up.  The main difficulty
I run into is forgetting to mount with the mount options (to store mode bits)
(so swap file has the right permissions) and debugging some of the
xfstests relating to swap can be a little confusing.

-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-09-30 15:54 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-24 17:17 [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles David Howells
2021-09-24 17:18 ` [PATCH v3 2/9] mm: Add 'supports' field to the address_space_operations to list features David Howells
2021-09-24 20:10   ` Matthew Wilcox
2021-09-25 23:42 ` [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles Dave Chinner
2021-09-26  3:10   ` Matthew Wilcox
2021-09-26 22:36     ` Dave Chinner
2021-09-27 20:07 ` David Sterba
2021-09-28  3:11 ` NeilBrown
2021-09-30 15:54   ` Steve French
2021-09-29 15:45 ` David Howells

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).