All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles
@ 2021-09-24 17:17 David Howells
  2021-09-24 17:18 ` [PATCH v3 1/9] mm: Remove the callback func argument from __swap_writepage() David Howells
                   ` (12 more replies)
  0 siblings, 13 replies; 28+ messages in thread
From: David Howells @ 2021-09-24 17:17 UTC (permalink / raw)
  To: willy, hch, trond.myklebust
  Cc: Theodore Ts'o, linux-block, ceph-devel, Trond Myklebust,
	Darrick J. Wong, Jeff Layton, Andreas Dilger, Anna Schumaker,
	linux-mm, Bob Liu, Darrick J. Wong, Josef Bacik, Seth Jennings,
	Jens Axboe, linux-fsdevel, linux-xfs, linux-ext4, linux-cifs,
	Chris Mason, David Sterba, Minchan Kim, Steve French, NeilBrown,
	Dan Magenheimer, linux-nfs, Ilya Dryomov, linux-btrfs, dhowells,
	dhowells, darrick.wong, viro, jlayton, torvalds, linux-nfs,
	linux-mm, linux-fsdevel, linux-kernel


Hi Willy, Trond, Christoph,

Here's v3 of a change to make reads and writes from the swapfile use async
DIO, adding a new ->swap_rw() address_space method, rather than readpage()
or direct_IO(), as requested by Willy.  This allows NFS to bypass the write
checks that prevent swapfiles from working, plus a bunch of other checks
that may or may not be necessary.

Whilst trying to make this work, I found that NFS's support for swapfiles
seems to have been non-functional since Aug 2019 (I think), so the first
patch fixes that.  Question is: do we actually *want* to keep this
functionality, given that it seems that no one's tested it with an upstream
kernel in the last couple of years?

There are additional patches to get rid of noop_direct_IO and replace it
with a feature bitmask, to make btrfs, ext4, xfs and raw blockdevs use the
new ->swap_rw method and thence remove the direct BIO submission paths from
swap.

I kept the IOCB_SWAP flag, using it to enable REQ_SWAP.  I'm not sure if
that's necessary, but it seems accounting related.

The synchronous DIO I/O code on NFS, raw blockdev, ext4 swapfile and xfs
swapfile all seem to work fine.  Btrfs refuses to swapon because the file
might be CoW'd.  I've tried doing "chattr +C", but that didn't help.

The async DIO paths fail spectacularly (from I/O errors to ATA failure
messages on the test disk using a normal swapspace); NFS just hangs.

My patches can be found here also:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=swap-dio

I tested this using the procedure and program outlined in the NFS patch.

I also encountered occasional instances of the following warning with NFS, so
I'm wondering if there's a scheduling problem somewhere:

BUG: workqueue lockup - pool cpus=0-3 flags=0x5 nice=0 stuck for 34s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
  pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
    in-flight: 1565:fill_page_cache_func
workqueue events_highpri: flags=0x10
  pwq 3: cpus=1 node=0 flags=0x1 nice=-20 active=1/256 refcnt=2
    in-flight: 1547:fill_page_cache_func
  pwq 1: cpus=0 node=0 flags=0x0 nice=-20 active=1/256 refcnt=2
    in-flight: 1811:fill_page_cache_func
workqueue events_unbound: flags=0x2
  pwq 8: cpus=0-3 flags=0x5 nice=0 active=3/512 refcnt=5
    pending: fsnotify_connector_destroy_workfn, fsnotify_mark_destroy_workfn, cleanup_offline_cgwbs_workfn
workqueue events_power_efficient: flags=0x82
  pwq 8: cpus=0-3 flags=0x5 nice=0 active=4/256 refcnt=6
    pending: neigh_periodic_work, neigh_periodic_work, check_lifetime, do_cache_clean
workqueue writeback: flags=0x4a
  pwq 8: cpus=0-3 flags=0x5 nice=0 active=1/256 refcnt=4
    in-flight: 433(RESCUER):wb_workfn
workqueue rpciod: flags=0xa
  pwq 8: cpus=0-3 flags=0x5 nice=0 active=38/256 refcnt=40
    in-flight: 7:rpc_async_schedule, 1609:rpc_async_schedule, 1610:rpc_async_schedule, 912:rpc_async_schedule, 1613:rpc_async_schedule, 1631:rpc_async_schedule, 34:rpc_async_schedule, 44:rpc_async_schedule
    pending: rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule
workqueue ext4-rsv-conversion: flags=0x2000a
pool 1: cpus=0 node=0 flags=0x0 nice=-20 hung=59s workers=2 idle: 6
pool 3: cpus=1 node=0 flags=0x1 nice=-20 hung=43s workers=2 manager: 20
pool 6: cpus=3 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 498 29
pool 8: cpus=0-3 flags=0x5 nice=0 hung=34s workers=9 manager: 1623
pool 9: cpus=0-3 flags=0x5 nice=-20 hung=0s workers=2 manager: 5224 idle: 859

Note that this is due to DIO writes to NFS only, as far as I can tell, and
that no reads had happened yet.

Changes:
========
ver #3:
   - Introduced a new ->swap_rw() method.
   - Added feature support flags to the address_space_operations struct and
     got rid of the checks for ->direct_() and noop_direct_IO() and
     similar.
   - Implemented swap_rw for nfs, adjusting the direct I/O code paths.
   - Implemented swap_rw for blockdev, btrfs, ext4 and xfs.
   - Got rid of the return value on swap_readpage() as it's never checked.

ver #2:
   - Remove the callback param to __swap_writepage() as it's invariant.
   - Allocate the kiocb on the stack in sync mode.
   - Do an async DIO write if WB_SYNC_ALL isn't set.
   - Try to remove the BIO submission paths.

David

Link: https://lore.kernel.org/r/162876946134.3068428.15475611190876694695.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/162879971699.3306668.8977537647318498651.stgit@warthog.procyon.org.uk/ # v2
---
David Howells (9):
      mm: Remove the callback func argument from __swap_writepage()
      mm: Add 'supports' field to the address_space_operations to list features
      mm: Make swap_readpage() void
      Introduce IOCB_SWAP kiocb flag to trigger REQ_SWAP
      mm: Make swap_readpage() for SWP_FS_OPS use ->swap_rw() not ->readpage()
      mm: Make __swap_writepage() do async DIO if asked for it
      nfs: Fix write to swapfile failure due to generic_write_checks()
      block, btrfs, ext4, xfs: Implement swap_rw
      mm: Remove swap BIO paths and only use DIO paths


 Documentation/filesystems/vfs.rst |   8 +
 block/fops.c                      |   2 +
 drivers/block/loop.c              |   6 +-
 fs/9p/vfs_addr.c                  |   1 +
 fs/affs/file.c                    |   1 +
 fs/btrfs/inode.c                  |  14 +-
 fs/ceph/addr.c                    |  13 +-
 fs/cifs/file.c                    |  21 +-
 fs/direct-io.c                    |   2 +
 fs/erofs/data.c                   |   2 +-
 fs/exfat/inode.c                  |   1 +
 fs/ext2/inode.c                   |   4 +-
 fs/ext4/inode.c                   |  17 +-
 fs/f2fs/data.c                    |   1 +
 fs/fat/inode.c                    |   1 +
 fs/fcntl.c                        |   2 +-
 fs/fuse/dax.c                     |   2 +-
 fs/fuse/file.c                    |   1 +
 fs/gfs2/aops.c                    |   2 +-
 fs/hfs/inode.c                    |   1 +
 fs/hfsplus/inode.c                |   1 +
 fs/jfs/inode.c                    |   1 +
 fs/libfs.c                        |  12 -
 fs/nfs/direct.c                   |  28 +--
 fs/nfs/file.c                     |  15 +-
 fs/nilfs2/inode.c                 |   1 +
 fs/ntfs3/inode.c                  |   1 +
 fs/ocfs2/aops.c                   |   1 +
 fs/open.c                         |   3 +-
 fs/orangefs/inode.c               |   1 +
 fs/overlayfs/file.c               |   2 +-
 fs/overlayfs/inode.c              |   3 +-
 fs/reiserfs/inode.c               |   1 +
 fs/udf/file.c                     |   1 +
 fs/udf/inode.c                    |   1 +
 fs/xfs/xfs_aops.c                 |  13 +-
 fs/zonefs/super.c                 |   2 +-
 include/linux/bio.h               |   2 +
 include/linux/fs.h                |   7 +-
 include/linux/nfs_fs.h            |   2 +-
 include/linux/swap.h              |   2 +-
 mm/page_io.c                      | 356 +++++++++++++++---------------
 mm/swapfile.c                     |   4 +-
 43 files changed, 275 insertions(+), 287 deletions(-)



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3 1/9] mm: Remove the callback func argument from __swap_writepage()
  2021-09-24 17:17 [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles David Howells
@ 2021-09-24 17:18 ` David Howells
  2021-09-24 17:18 ` [PATCH v3 2/9] mm: Add 'supports' field to the address_space_operations to list features David Howells
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 28+ messages in thread
From: David Howells @ 2021-09-24 17:18 UTC (permalink / raw)
  To: willy, hch, trond.myklebust
  Cc: Darrick J. Wong, Seth Jennings, Bob Liu, Minchan Kim,
	Dan Magenheimer, linux-block, linux-xfs, linux-fsdevel, linux-mm,
	dhowells, dhowells, darrick.wong, viro, jlayton, torvalds,
	linux-nfs, linux-mm, linux-fsdevel, linux-kernel

Remove the callback func argument from __swap_writepage() as it's
end_swap_bio_write() in both places that call it.

This reverts:

	commit 1eec6702a80e04416d528846a5ff2122484d95ec
	mm: allow for outstanding swap writeback accounting

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
cc: Matthew Wilcox (Oracle) <willy@infradead.org>
cc: Darrick J. Wong <djwong@kernel.org>
cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
cc: Bob Liu <bob.liu@oracle.com>
cc: Minchan Kim <minchan@kernel.org>
cc: Dan Magenheimer <dan.magenheimer@oracle.com>
cc: linux-block@vger.kernel.org
cc: linux-xfs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
---

 include/linux/swap.h |    4 +---
 mm/page_io.c         |    9 ++++-----
 mm/zswap.c           |    2 +-
 3 files changed, 6 insertions(+), 9 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index ba52f3a3478e..576d40e33b1f 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -418,9 +418,7 @@ extern void kswapd_stop(int nid);
 /* linux/mm/page_io.c */
 extern int swap_readpage(struct page *page, bool do_poll);
 extern int swap_writepage(struct page *page, struct writeback_control *wbc);
-extern void end_swap_bio_write(struct bio *bio);
-extern int __swap_writepage(struct page *page, struct writeback_control *wbc,
-	bio_end_io_t end_write_func);
+int __swap_writepage(struct page *page, struct writeback_control *wbc);
 extern int swap_set_page_dirty(struct page *page);
 
 int add_swap_extent(struct swap_info_struct *sis, unsigned long start_page,
diff --git a/mm/page_io.c b/mm/page_io.c
index c493ce9ebcf5..afd18f6ec09e 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -26,7 +26,7 @@
 #include <linux/uio.h>
 #include <linux/sched/task.h>
 
-void end_swap_bio_write(struct bio *bio)
+static void end_swap_bio_write(struct bio *bio)
 {
 	struct page *page = bio_first_page_all(bio);
 
@@ -249,7 +249,7 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
 		end_page_writeback(page);
 		goto out;
 	}
-	ret = __swap_writepage(page, wbc, end_swap_bio_write);
+	ret = __swap_writepage(page, wbc);
 out:
 	return ret;
 }
@@ -282,8 +282,7 @@ static void bio_associate_blkg_from_page(struct bio *bio, struct page *page)
 #define bio_associate_blkg_from_page(bio, page)		do { } while (0)
 #endif /* CONFIG_MEMCG && CONFIG_BLK_CGROUP */
 
-int __swap_writepage(struct page *page, struct writeback_control *wbc,
-		bio_end_io_t end_write_func)
+int __swap_writepage(struct page *page, struct writeback_control *wbc)
 {
 	struct bio *bio;
 	int ret;
@@ -341,7 +340,7 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
 	bio_set_dev(bio, sis->bdev);
 	bio->bi_iter.bi_sector = swap_page_sector(page);
 	bio->bi_opf = REQ_OP_WRITE | REQ_SWAP | wbc_to_write_flags(wbc);
-	bio->bi_end_io = end_write_func;
+	bio->bi_end_io = end_swap_bio_write;
 	bio_add_page(bio, page, thp_size(page), 0);
 
 	bio_associate_blkg_from_page(bio, page);
diff --git a/mm/zswap.c b/mm/zswap.c
index 7944e3e57e78..f38e34917aa3 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1011,7 +1011,7 @@ static int zswap_writeback_entry(struct zpool *pool, unsigned long handle)
 	SetPageReclaim(page);
 
 	/* start writeback */
-	__swap_writepage(page, &wbc, end_swap_bio_write);
+	__swap_writepage(page, &wbc);
 	put_page(page);
 	zswap_written_back_pages++;
 



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v3 2/9] mm: Add 'supports' field to the address_space_operations to list features
  2021-09-24 17:17 [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles David Howells
  2021-09-24 17:18 ` [PATCH v3 1/9] mm: Remove the callback func argument from __swap_writepage() David Howells
@ 2021-09-24 17:18 ` David Howells
  2021-09-24 20:10   ` Matthew Wilcox
  2021-09-24 17:18 ` [PATCH v3 3/9] mm: Make swap_readpage() void David Howells
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 28+ messages in thread
From: David Howells @ 2021-09-24 17:18 UTC (permalink / raw)
  To: willy, hch, trond.myklebust
  Cc: Darrick J. Wong, Ilya Dryomov, Jeff Layton, ceph-devel,
	Steve French, linux-cifs, linux-xfs, linux-fsdevel, linux-mm,
	dhowells, dhowells, darrick.wong, viro, jlayton, torvalds,
	linux-nfs, linux-mm, linux-fsdevel, linux-kernel

Rather than depending on .direct_IO to point to something to indicate that
direct I/O is supported, add a 'supports' bitmask that we can test, since
we only need one bit.

We can then remove noop_direct_IO, ceph_direct_io and cifs_direct_io.

[Question: Some filesystems support read DIO but not write DIO - should I
 split the flag?]

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@lst.de>
cc: Darrick J. Wong <djwong@kernel.org>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: ceph-devel@vger.kernel.org
cc: Steve French <sfrench@samba.org>
cc: linux-cifs@vger.kernel.org
cc: linux-xfs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
---

 Documentation/filesystems/vfs.rst |    8 ++++++++
 block/fops.c                      |    1 +
 drivers/block/loop.c              |    6 +++---
 fs/9p/vfs_addr.c                  |    1 +
 fs/affs/file.c                    |    1 +
 fs/btrfs/inode.c                  |    2 +-
 fs/ceph/addr.c                    |   13 +------------
 fs/cifs/file.c                    |   21 +--------------------
 fs/erofs/data.c                   |    2 +-
 fs/exfat/inode.c                  |    1 +
 fs/ext2/inode.c                   |    4 +++-
 fs/ext4/inode.c                   |    8 ++++----
 fs/f2fs/data.c                    |    1 +
 fs/fat/inode.c                    |    1 +
 fs/fcntl.c                        |    2 +-
 fs/fuse/dax.c                     |    2 +-
 fs/fuse/file.c                    |    1 +
 fs/gfs2/aops.c                    |    2 +-
 fs/hfs/inode.c                    |    1 +
 fs/hfsplus/inode.c                |    1 +
 fs/jfs/inode.c                    |    1 +
 fs/libfs.c                        |   12 ------------
 fs/nfs/file.c                     |    1 +
 fs/nilfs2/inode.c                 |    1 +
 fs/ntfs3/inode.c                  |    1 +
 fs/ocfs2/aops.c                   |    1 +
 fs/open.c                         |    3 ++-
 fs/orangefs/inode.c               |    1 +
 fs/overlayfs/file.c               |    2 +-
 fs/overlayfs/inode.c              |    3 +--
 fs/reiserfs/inode.c               |    1 +
 fs/udf/file.c                     |    1 +
 fs/udf/inode.c                    |    1 +
 fs/xfs/xfs_aops.c                 |    4 ++--
 fs/zonefs/super.c                 |    2 +-
 include/linux/fs.h                |    4 +++-
 36 files changed, 53 insertions(+), 65 deletions(-)

diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index bf5c48066fac..abb844792d6a 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -721,6 +721,7 @@ cache in your filesystem.  The following members are defined:
 .. code-block:: c
 
 	struct address_space_operations {
+		unsigned int supports;
 		int (*writepage)(struct page *page, struct writeback_control *wbc);
 		int (*readpage)(struct file *, struct page *);
 		int (*writepages)(struct address_space *, struct writeback_control *);
@@ -755,6 +756,13 @@ cache in your filesystem.  The following members are defined:
 		int (*swap_deactivate)(struct file *);
 	};
 
+``supports``
+	provides a list of features supported by address_spaces using this
+	operations set.  The following feature support flags are provided:
+
+	``AS_SUPPORTS_DIRECT_IO``
+		Direct I/O is supported.
+
 ``writepage``
 	called by the VM to write a dirty page to backing store.  This
 	may happen for data integrity reasons (i.e. 'sync'), or to free
diff --git a/block/fops.c b/block/fops.c
index ffce6f6c68dd..84c64d814d0d 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -384,6 +384,7 @@ const struct address_space_operations def_blk_aops = {
 	.direct_IO	= blkdev_direct_IO,
 	.migratepage	= buffer_migrate_page_norefs,
 	.is_dirty_writeback = buffer_check_dirty_writeback,
+	.supports	= AS_SUPPORTS_DIRECT_IO,
 };
 
 /*
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 7bf4686af774..76f7a6d85815 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -237,9 +237,9 @@ static void __loop_update_dio(struct loop_device *lo, bool dio)
 	 */
 	if (dio) {
 		if (queue_logical_block_size(lo->lo_queue) >= sb_bsize &&
-				!(lo->lo_offset & dio_align) &&
-				mapping->a_ops->direct_IO &&
-				!lo->transfer)
+		    !(lo->lo_offset & dio_align) &&
+		    (mapping->a_ops->supports & AS_SUPPORTS_DIRECT_IO) &&
+		    !lo->transfer)
 			use_dio = true;
 		else
 			use_dio = false;
diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c
index cce9ace651a2..4910898af0d7 100644
--- a/fs/9p/vfs_addr.c
+++ b/fs/9p/vfs_addr.c
@@ -333,4 +333,5 @@ const struct address_space_operations v9fs_addr_operations = {
 	.invalidatepage = v9fs_invalidate_page,
 	.launder_page = v9fs_launder_page,
 	.direct_IO = v9fs_direct_IO,
+	.supports	= AS_SUPPORTS_DIRECT_IO,
 };
diff --git a/fs/affs/file.c b/fs/affs/file.c
index 75ebd2b576ca..7488bd7d3e0c 100644
--- a/fs/affs/file.c
+++ b/fs/affs/file.c
@@ -460,6 +460,7 @@ const struct address_space_operations affs_aops = {
 	.write_end = affs_write_end,
 	.direct_IO = affs_direct_IO,
 	.bmap = _affs_bmap
+	.supports	= AS_SUPPORTS_DIRECT_IO,
 };
 
 static inline struct buffer_head *
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 487533c35ddb..b479c97e42fc 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -10937,7 +10937,6 @@ static const struct address_space_operations btrfs_aops = {
 	.writepage	= btrfs_writepage,
 	.writepages	= btrfs_writepages,
 	.readahead	= btrfs_readahead,
-	.direct_IO	= noop_direct_IO,
 	.invalidatepage = btrfs_invalidatepage,
 	.releasepage	= btrfs_releasepage,
 #ifdef CONFIG_MIGRATION
@@ -10947,6 +10946,7 @@ static const struct address_space_operations btrfs_aops = {
 	.error_remove_page = generic_error_remove_page,
 	.swap_activate	= btrfs_swap_activate,
 	.swap_deactivate = btrfs_swap_deactivate,
+	.supports	= AS_SUPPORTS_DIRECT_IO,
 };
 
 static const struct inode_operations btrfs_file_inode_operations = {
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 99b80b5c7a93..086d4745b99e 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -1306,17 +1306,6 @@ static int ceph_write_end(struct file *file, struct address_space *mapping,
 	return copied;
 }
 
-/*
- * we set .direct_IO to indicate direct io is supported, but since we
- * intercept O_DIRECT reads and writes early, this function should
- * never get called.
- */
-static ssize_t ceph_direct_io(struct kiocb *iocb, struct iov_iter *iter)
-{
-	WARN_ON(1);
-	return -EINVAL;
-}
-
 const struct address_space_operations ceph_aops = {
 	.readpage = ceph_readpage,
 	.readahead = ceph_readahead,
@@ -1327,7 +1316,7 @@ const struct address_space_operations ceph_aops = {
 	.set_page_dirty = ceph_set_page_dirty,
 	.invalidatepage = ceph_invalidatepage,
 	.releasepage = ceph_releasepage,
-	.direct_IO = ceph_direct_io,
+	.supports	= AS_SUPPORTS_DIRECT_IO,
 };
 
 static void ceph_block_sigs(sigset_t *oldset)
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 6796fc73b304..a5787cf3d836 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -4891,25 +4891,6 @@ void cifs_oplock_break(struct work_struct *work)
 	cifs_done_oplock_break(cinode);
 }
 
-/*
- * The presence of cifs_direct_io() in the address space ops vector
- * allowes open() O_DIRECT flags which would have failed otherwise.
- *
- * In the non-cached mode (mount with cache=none), we shunt off direct read and write requests
- * so this method should never be called.
- *
- * Direct IO is not yet supported in the cached mode. 
- */
-static ssize_t
-cifs_direct_io(struct kiocb *iocb, struct iov_iter *iter)
-{
-        /*
-         * FIXME
-         * Eventually need to support direct IO for non forcedirectio mounts
-         */
-        return -EINVAL;
-}
-
 static int cifs_swap_activate(struct swap_info_struct *sis,
 			      struct file *swap_file, sector_t *span)
 {
@@ -4974,7 +4955,6 @@ const struct address_space_operations cifs_addr_ops = {
 	.write_end = cifs_write_end,
 	.set_page_dirty = __set_page_dirty_nobuffers,
 	.releasepage = cifs_release_page,
-	.direct_IO = cifs_direct_io,
 	.invalidatepage = cifs_invalidate_page,
 	.launder_page = cifs_launder_page,
 	/*
@@ -4984,6 +4964,7 @@ const struct address_space_operations cifs_addr_ops = {
 	 */
 	.swap_activate = cifs_swap_activate,
 	.swap_deactivate = cifs_swap_deactivate,
+	.supports	= AS_SUPPORTS_DIRECT_IO,
 };
 
 /*
diff --git a/fs/erofs/data.c b/fs/erofs/data.c
index 9db829715652..30f19296b268 100644
--- a/fs/erofs/data.c
+++ b/fs/erofs/data.c
@@ -299,7 +299,7 @@ const struct address_space_operations erofs_raw_access_aops = {
 	.readpage = erofs_readpage,
 	.readahead = erofs_readahead,
 	.bmap = erofs_bmap,
-	.direct_IO = noop_direct_IO,
+	.supports	= AS_SUPPORTS_DIRECT_IO,
 };
 
 #ifdef CONFIG_FS_DAX
diff --git a/fs/exfat/inode.c b/fs/exfat/inode.c
index ca37d4344361..f38f42282f54 100644
--- a/fs/exfat/inode.c
+++ b/fs/exfat/inode.c
@@ -500,6 +500,7 @@ static const struct address_space_operations exfat_aops = {
 	.write_end	= exfat_write_end,
 	.direct_IO	= exfat_direct_IO,
 	.bmap		= exfat_aop_bmap
+	.supports	= AS_SUPPORTS_DIRECT_IO,
 };
 
 static inline unsigned long exfat_hash(loff_t i_pos)
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 333fa62661d5..4ad3655defd9 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -974,6 +974,7 @@ const struct address_space_operations ext2_aops = {
 	.migratepage		= buffer_migrate_page,
 	.is_partially_uptodate	= block_is_partially_uptodate,
 	.error_remove_page	= generic_error_remove_page,
+	.supports		= AS_SUPPORTS_DIRECT_IO,
 };
 
 const struct address_space_operations ext2_nobh_aops = {
@@ -988,13 +989,14 @@ const struct address_space_operations ext2_nobh_aops = {
 	.writepages		= ext2_writepages,
 	.migratepage		= buffer_migrate_page,
 	.error_remove_page	= generic_error_remove_page,
+	.supports		= AS_SUPPORTS_DIRECT_IO,
 };
 
 static const struct address_space_operations ext2_dax_aops = {
 	.writepages		= ext2_dax_writepages,
-	.direct_IO		= noop_direct_IO,
 	.set_page_dirty		= __set_page_dirty_no_writeback,
 	.invalidatepage		= noop_invalidatepage,
+	.supports		= AS_SUPPORTS_DIRECT_IO,
 };
 
 /*
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index d18852d6029c..08d3541d8daa 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3662,11 +3662,11 @@ static const struct address_space_operations ext4_aops = {
 	.bmap			= ext4_bmap,
 	.invalidatepage		= ext4_invalidatepage,
 	.releasepage		= ext4_releasepage,
-	.direct_IO		= noop_direct_IO,
 	.migratepage		= buffer_migrate_page,
 	.is_partially_uptodate  = block_is_partially_uptodate,
 	.error_remove_page	= generic_error_remove_page,
 	.swap_activate		= ext4_iomap_swap_activate,
+	.supports		= AS_SUPPORTS_DIRECT_IO,
 };
 
 static const struct address_space_operations ext4_journalled_aops = {
@@ -3680,10 +3680,10 @@ static const struct address_space_operations ext4_journalled_aops = {
 	.bmap			= ext4_bmap,
 	.invalidatepage		= ext4_journalled_invalidatepage,
 	.releasepage		= ext4_releasepage,
-	.direct_IO		= noop_direct_IO,
 	.is_partially_uptodate  = block_is_partially_uptodate,
 	.error_remove_page	= generic_error_remove_page,
 	.swap_activate		= ext4_iomap_swap_activate,
+	.supports		= AS_SUPPORTS_DIRECT_IO,
 };
 
 static const struct address_space_operations ext4_da_aops = {
@@ -3697,20 +3697,20 @@ static const struct address_space_operations ext4_da_aops = {
 	.bmap			= ext4_bmap,
 	.invalidatepage		= ext4_invalidatepage,
 	.releasepage		= ext4_releasepage,
-	.direct_IO		= noop_direct_IO,
 	.migratepage		= buffer_migrate_page,
 	.is_partially_uptodate  = block_is_partially_uptodate,
 	.error_remove_page	= generic_error_remove_page,
 	.swap_activate		= ext4_iomap_swap_activate,
+	.supports		= AS_SUPPORTS_DIRECT_IO,
 };
 
 static const struct address_space_operations ext4_dax_aops = {
 	.writepages		= ext4_dax_writepages,
-	.direct_IO		= noop_direct_IO,
 	.set_page_dirty		= __set_page_dirty_no_writeback,
 	.bmap			= ext4_bmap,
 	.invalidatepage		= noop_invalidatepage,
 	.swap_activate		= ext4_iomap_swap_activate,
+	.supports		= AS_SUPPORTS_DIRECT_IO,
 };
 
 void ext4_set_aops(struct inode *inode)
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index f4fd6c246c9a..4c3643969b69 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -4156,6 +4156,7 @@ const struct address_space_operations f2fs_dblock_aops = {
 #ifdef CONFIG_MIGRATION
 	.migratepage    = f2fs_migrate_page,
 #endif
+	.supports	= AS_SUPPORTS_DIRECT_IO,
 };
 
 void f2fs_clear_page_cache_dirty_tag(struct page *page)
diff --git a/fs/fat/inode.c b/fs/fat/inode.c
index de0c9b013a85..4352981dfb82 100644
--- a/fs/fat/inode.c
+++ b/fs/fat/inode.c
@@ -351,6 +351,7 @@ static const struct address_space_operations fat_aops = {
 	.write_end	= fat_write_end,
 	.direct_IO	= fat_direct_IO,
 	.bmap		= _fat_bmap
+	.supports	= AS_SUPPORTS_DIRECT_IO,
 };
 
 /*
diff --git a/fs/fcntl.c b/fs/fcntl.c
index 9c6c6a3e2de5..7308e8274ff9 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -58,7 +58,7 @@ static int setfl(int fd, struct file * filp, unsigned long arg)
 	/* Pipe packetized mode is controlled by O_DIRECT flag */
 	if (!S_ISFIFO(inode->i_mode) && (arg & O_DIRECT)) {
 		if (!filp->f_mapping || !filp->f_mapping->a_ops ||
-			!filp->f_mapping->a_ops->direct_IO)
+		    !(filp->f_mapping->a_ops->supports & AS_SUPPORTS_DIRECT_IO))
 				return -EINVAL;
 	}
 
diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
index 281d79f8b3d3..e39468fd7177 100644
--- a/fs/fuse/dax.c
+++ b/fs/fuse/dax.c
@@ -1325,9 +1325,9 @@ bool fuse_dax_inode_alloc(struct super_block *sb, struct fuse_inode *fi)
 
 static const struct address_space_operations fuse_dax_file_aops  = {
 	.writepages	= fuse_dax_writepages,
-	.direct_IO	= noop_direct_IO,
 	.set_page_dirty	= __set_page_dirty_no_writeback,
 	.invalidatepage	= noop_invalidatepage,
+	.supports	= AS_SUPPORTS_DIRECT_IO,
 };
 
 void fuse_dax_inode_init(struct inode *inode)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 11404f8c21c7..3db64194d346 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -3161,6 +3161,7 @@ static const struct address_space_operations fuse_file_aops  = {
 	.direct_IO	= fuse_direct_IO,
 	.write_begin	= fuse_write_begin,
 	.write_end	= fuse_write_end,
+	.supports	= AS_SUPPORTS_DIRECT_IO,
 };
 
 void fuse_init_file_inode(struct inode *inode)
diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 005e920f5d4a..dc50b53d6abd 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -783,10 +783,10 @@ static const struct address_space_operations gfs2_aops = {
 	.releasepage = iomap_releasepage,
 	.invalidatepage = iomap_invalidatepage,
 	.bmap = gfs2_bmap,
-	.direct_IO = noop_direct_IO,
 	.migratepage = iomap_migrate_page,
 	.is_partially_uptodate = iomap_is_partially_uptodate,
 	.error_remove_page = generic_error_remove_page,
+	.supports	= AS_SUPPORTS_DIRECT_IO,
 };
 
 static const struct address_space_operations gfs2_jdata_aops = {
diff --git a/fs/hfs/inode.c b/fs/hfs/inode.c
index 4a95a92546a0..5f9e5464a5bf 100644
--- a/fs/hfs/inode.c
+++ b/fs/hfs/inode.c
@@ -177,6 +177,7 @@ const struct address_space_operations hfs_aops = {
 	.bmap		= hfs_bmap,
 	.direct_IO	= hfs_direct_IO,
 	.writepages	= hfs_writepages,
+	.supports	= AS_SUPPORTS_DIRECT_IO,
 };
 
 /*
diff --git a/fs/hfsplus/inode.c b/fs/hfsplus/inode.c
index 6fef67c2a9f0..9f0c27e5e115 100644
--- a/fs/hfsplus/inode.c
+++ b/fs/hfsplus/inode.c
@@ -174,6 +174,7 @@ const struct address_space_operations hfsplus_aops = {
 	.bmap		= hfsplus_bmap,
 	.direct_IO	= hfsplus_direct_IO,
 	.writepages	= hfsplus_writepages,
+	.supports	= AS_SUPPORTS_DIRECT_IO,
 };
 
 const struct dentry_operations hfsplus_dentry_operations = {
diff --git a/fs/jfs/inode.c b/fs/jfs/inode.c
index 57ab424c05ff..a477267471a4 100644
--- a/fs/jfs/inode.c
+++ b/fs/jfs/inode.c
@@ -366,6 +366,7 @@ const struct address_space_operations jfs_aops = {
 	.write_end	= nobh_write_end,
 	.bmap		= jfs_bmap,
 	.direct_IO	= jfs_direct_IO,
+	.supports	= AS_SUPPORTS_DIRECT_IO,
 };
 
 /*
diff --git a/fs/libfs.c b/fs/libfs.c
index 51b4de3b3447..c27f681291e5 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -1182,18 +1182,6 @@ void noop_invalidatepage(struct page *page, unsigned int offset,
 }
 EXPORT_SYMBOL_GPL(noop_invalidatepage);
 
-ssize_t noop_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
-{
-	/*
-	 * iomap based filesystems support direct I/O without need for
-	 * this callback. However, it still needs to be set in
-	 * inode->a_ops so that open/fcntl know that direct I/O is
-	 * generally supported.
-	 */
-	return -EINVAL;
-}
-EXPORT_SYMBOL_GPL(noop_direct_IO);
-
 /* Because kfree isn't assignment-compatible with void(void*) ;-/ */
 void kfree_link(void *p)
 {
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index aa353fd58240..7403ec6317cb 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -532,6 +532,7 @@ const struct address_space_operations nfs_file_aops = {
 	.error_remove_page = generic_error_remove_page,
 	.swap_activate = nfs_swap_activate,
 	.swap_deactivate = nfs_swap_deactivate,
+	.supports	= AS_SUPPORTS_DIRECT_IO,
 };
 
 /*
diff --git a/fs/nilfs2/inode.c b/fs/nilfs2/inode.c
index 2e8eb263cf0f..c57395c01817 100644
--- a/fs/nilfs2/inode.c
+++ b/fs/nilfs2/inode.c
@@ -307,6 +307,7 @@ const struct address_space_operations nilfs_aops = {
 	.invalidatepage		= block_invalidatepage,
 	.direct_IO		= nilfs_direct_IO,
 	.is_partially_uptodate  = block_is_partially_uptodate,
+	.supports		= AS_SUPPORTS_DIRECT_IO,
 };
 
 static int nilfs_insert_inode_locked(struct inode *inode,
diff --git a/fs/ntfs3/inode.c b/fs/ntfs3/inode.c
index db2a5a4c38e4..7b3ac1ab5d04 100644
--- a/fs/ntfs3/inode.c
+++ b/fs/ntfs3/inode.c
@@ -1948,6 +1948,7 @@ const struct address_space_operations ntfs_aops = {
 	.direct_IO	= ntfs_direct_IO,
 	.bmap		= ntfs_bmap,
 	.set_page_dirty = __set_page_dirty_buffers,
+	.supports	= AS_SUPPORTS_DIRECT_IO,
 };
 
 const struct address_space_operations ntfs_aops_cmpr = {
diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index 68d11c295dd3..5a158975a4ff 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -2466,4 +2466,5 @@ const struct address_space_operations ocfs2_aops = {
 	.migratepage		= buffer_migrate_page,
 	.is_partially_uptodate	= block_is_partially_uptodate,
 	.error_remove_page	= generic_error_remove_page,
+	.supports		= AS_SUPPORTS_DIRECT_IO,
 };
diff --git a/fs/open.c b/fs/open.c
index daa324606a41..d679dc0c1801 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -840,7 +840,8 @@ static int do_dentry_open(struct file *f,
 
 	/* NB: we're sure to have correct a_ops only after f_op->open */
 	if (f->f_flags & O_DIRECT) {
-		if (!f->f_mapping->a_ops || !f->f_mapping->a_ops->direct_IO)
+		if (!f->f_mapping->a_ops ||
+		    !(f->f_mapping->a_ops->supports & AS_SUPPORTS_DIRECT_IO))
 			return -EINVAL;
 	}
 
diff --git a/fs/orangefs/inode.c b/fs/orangefs/inode.c
index c1bb4c4b5d67..c5bad94dfbd0 100644
--- a/fs/orangefs/inode.c
+++ b/fs/orangefs/inode.c
@@ -641,6 +641,7 @@ static const struct address_space_operations orangefs_address_operations = {
 	.freepage = orangefs_freepage,
 	.launder_page = orangefs_launder_page,
 	.direct_IO = orangefs_direct_IO,
+	.supports	= AS_SUPPORTS_DIRECT_IO,
 };
 
 vm_fault_t orangefs_page_mkwrite(struct vm_fault *vmf)
diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index d081faa55e83..87d05f1d718a 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -83,7 +83,7 @@ static int ovl_change_flags(struct file *file, unsigned int flags)
 
 	if (flags & O_DIRECT) {
 		if (!file->f_mapping->a_ops ||
-		    !file->f_mapping->a_ops->direct_IO)
+		    !(file->f_mapping->a_ops->supports & AS_SUPPORTS_DIRECT_IO))
 			return -EINVAL;
 	}
 
diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
index 832b17589733..9902608b1715 100644
--- a/fs/overlayfs/inode.c
+++ b/fs/overlayfs/inode.c
@@ -660,8 +660,7 @@ static const struct inode_operations ovl_special_inode_operations = {
 };
 
 static const struct address_space_operations ovl_aops = {
-	/* For O_DIRECT dentry_open() checks f_mapping->a_ops->direct_IO */
-	.direct_IO		= noop_direct_IO,
+	.supports	= AS_SUPPORTS_DIRECT_IO,
 };
 
 /*
diff --git a/fs/reiserfs/inode.c b/fs/reiserfs/inode.c
index f49b72ccac4c..890d91847d58 100644
--- a/fs/reiserfs/inode.c
+++ b/fs/reiserfs/inode.c
@@ -3436,4 +3436,5 @@ const struct address_space_operations reiserfs_address_space_operations = {
 	.bmap = reiserfs_aop_bmap,
 	.direct_IO = reiserfs_direct_IO,
 	.set_page_dirty = reiserfs_set_page_dirty,
+	.supports	= AS_SUPPORTS_DIRECT_IO,
 };
diff --git a/fs/udf/file.c b/fs/udf/file.c
index 1baff8ddb754..2cb1b499e5c7 100644
--- a/fs/udf/file.c
+++ b/fs/udf/file.c
@@ -131,6 +131,7 @@ const struct address_space_operations udf_adinicb_aops = {
 	.write_begin	= udf_adinicb_write_begin,
 	.write_end	= udf_adinicb_write_end,
 	.direct_IO	= udf_adinicb_direct_IO,
+	.supports	= AS_SUPPORTS_DIRECT_IO,
 };
 
 static ssize_t udf_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
diff --git a/fs/udf/inode.c b/fs/udf/inode.c
index 1d6b7a50736b..38b799b457d5 100644
--- a/fs/udf/inode.c
+++ b/fs/udf/inode.c
@@ -244,6 +244,7 @@ const struct address_space_operations udf_aops = {
 	.write_end	= generic_write_end,
 	.direct_IO	= udf_direct_IO,
 	.bmap		= udf_bmap,
+	.supports	= AS_SUPPORTS_DIRECT_IO,
 };
 
 /*
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 34fc6148032a..2a4570516591 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -548,17 +548,17 @@ const struct address_space_operations xfs_address_space_operations = {
 	.releasepage		= iomap_releasepage,
 	.invalidatepage		= iomap_invalidatepage,
 	.bmap			= xfs_vm_bmap,
-	.direct_IO		= noop_direct_IO,
 	.migratepage		= iomap_migrate_page,
 	.is_partially_uptodate  = iomap_is_partially_uptodate,
 	.error_remove_page	= generic_error_remove_page,
 	.swap_activate		= xfs_iomap_swapfile_activate,
+	.supports		= AS_SUPPORTS_DIRECT_IO,
 };
 
 const struct address_space_operations xfs_dax_aops = {
 	.writepages		= xfs_dax_writepages,
-	.direct_IO		= noop_direct_IO,
 	.set_page_dirty		= __set_page_dirty_no_writeback,
 	.invalidatepage		= noop_invalidatepage,
 	.swap_activate		= xfs_iomap_swapfile_activate,
+	.supports		= AS_SUPPORTS_DIRECT_IO,
 };
diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
index ddc346a9df9b..37ff541467e8 100644
--- a/fs/zonefs/super.c
+++ b/fs/zonefs/super.c
@@ -191,8 +191,8 @@ static const struct address_space_operations zonefs_file_aops = {
 	.migratepage		= iomap_migrate_page,
 	.is_partially_uptodate	= iomap_is_partially_uptodate,
 	.error_remove_page	= generic_error_remove_page,
-	.direct_IO		= noop_direct_IO,
 	.swap_activate		= zonefs_swap_activate,
+	.supports		= AS_SUPPORTS_DIRECT_IO,
 };
 
 static void zonefs_update_stats(struct inode *inode, loff_t new_isize)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e7a633353fd2..c909ca6c0eb6 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -369,7 +369,10 @@ typedef struct {
 typedef int (*read_actor_t)(read_descriptor_t *, struct page *,
 		unsigned long, unsigned long);
 
+#define AS_SUPPORTS_DIRECT_IO	0x00000001
+
 struct address_space_operations {
+	unsigned int supports; /* Bitmask of AS_SUPPORTS_* flags */
 	int (*writepage)(struct page *page, struct writeback_control *wbc);
 	int (*readpage)(struct file *, struct page *);
 
@@ -3391,7 +3394,6 @@ extern void simple_recursive_removal(struct dentry *,
 extern int noop_fsync(struct file *, loff_t, loff_t, int);
 extern void noop_invalidatepage(struct page *page, unsigned int offset,
 		unsigned int length);
-extern ssize_t noop_direct_IO(struct kiocb *iocb, struct iov_iter *iter);
 extern int simple_empty(struct dentry *);
 extern int simple_write_begin(struct file *file, struct address_space *mapping,
 			loff_t pos, unsigned len, unsigned flags,



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v3 3/9] mm: Make swap_readpage() void
  2021-09-24 17:17 [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles David Howells
  2021-09-24 17:18 ` [PATCH v3 1/9] mm: Remove the callback func argument from __swap_writepage() David Howells
  2021-09-24 17:18 ` [PATCH v3 2/9] mm: Add 'supports' field to the address_space_operations to list features David Howells
@ 2021-09-24 17:18 ` David Howells
  2021-09-24 22:07   ` Matthew Wilcox
  2021-09-24 17:18 ` [PATCH v3 4/9] Introduce IOCB_SWAP kiocb flag to trigger REQ_SWAP David Howells
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 28+ messages in thread
From: David Howells @ 2021-09-24 17:18 UTC (permalink / raw)
  To: willy, hch, trond.myklebust
  Cc: Jens Axboe, Darrick J. Wong, linux-xfs, linux-fsdevel, linux-mm,
	dhowells, dhowells, darrick.wong, viro, jlayton, torvalds,
	linux-nfs, linux-mm, linux-fsdevel, linux-kernel

None of the callers of swap_readpage() actually check its return value and,
indeed, the operation may still be in progress, so remove the return value.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@lst.de>
cc: Jens Axboe <axboe@kernel.dk>
cc: Darrick J. Wong <djwong@kernel.org>
cc: linux-xfs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
---

 include/linux/swap.h |    2 +-
 mm/page_io.c         |   11 +++--------
 2 files changed, 4 insertions(+), 9 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 576d40e33b1f..293eba012d4f 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -416,7 +416,7 @@ extern void kswapd_stop(int nid);
 #include <linux/blk_types.h> /* for bio_end_io_t */
 
 /* linux/mm/page_io.c */
-extern int swap_readpage(struct page *page, bool do_poll);
+void swap_readpage(struct page *page, bool synchronous);
 extern int swap_writepage(struct page *page, struct writeback_control *wbc);
 int __swap_writepage(struct page *page, struct writeback_control *wbc);
 extern int swap_set_page_dirty(struct page *page);
diff --git a/mm/page_io.c b/mm/page_io.c
index afd18f6ec09e..b9fe25101a39 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -352,10 +352,9 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc)
 	return 0;
 }
 
-int swap_readpage(struct page *page, bool synchronous)
+void swap_readpage(struct page *page, bool synchronous)
 {
 	struct bio *bio;
-	int ret = 0;
 	struct swap_info_struct *sis = page_swap_info(page);
 	blk_qc_t qc;
 	struct gendisk *disk;
@@ -382,15 +381,13 @@ int swap_readpage(struct page *page, bool synchronous)
 		struct file *swap_file = sis->swap_file;
 		struct address_space *mapping = swap_file->f_mapping;
 
-		ret = mapping->a_ops->readpage(swap_file, page);
-		if (!ret)
+		if (!mapping->a_ops->readpage(swap_file, page))
 			count_vm_event(PSWPIN);
 		goto out;
 	}
 
 	if (sis->flags & SWP_SYNCHRONOUS_IO) {
-		ret = bdev_read_page(sis->bdev, swap_page_sector(page), page);
-		if (!ret) {
+		if (!bdev_read_page(sis->bdev, swap_page_sector(page), page)) {
 			if (trylock_page(page)) {
 				swap_slot_free_notify(page);
 				unlock_page(page);
@@ -401,7 +398,6 @@ int swap_readpage(struct page *page, bool synchronous)
 		}
 	}
 
-	ret = 0;
 	bio = bio_alloc(GFP_KERNEL, 1);
 	bio_set_dev(bio, sis->bdev);
 	bio->bi_opf = REQ_OP_READ;
@@ -435,7 +431,6 @@ int swap_readpage(struct page *page, bool synchronous)
 
 out:
 	psi_memstall_leave(&pflags);
-	return ret;
 }
 
 int swap_set_page_dirty(struct page *page)



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v3 4/9] Introduce IOCB_SWAP kiocb flag to trigger REQ_SWAP
  2021-09-24 17:17 [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles David Howells
                   ` (2 preceding siblings ...)
  2021-09-24 17:18 ` [PATCH v3 3/9] mm: Make swap_readpage() void David Howells
@ 2021-09-24 17:18 ` David Howells
  2021-09-26 21:56   ` Dave Chinner
  2021-09-24 17:18 ` [PATCH v3 5/9] mm: Make swap_readpage() for SWP_FS_OPS use ->swap_rw() not ->readpage() David Howells
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 28+ messages in thread
From: David Howells @ 2021-09-24 17:18 UTC (permalink / raw)
  To: willy, hch, trond.myklebust
  Cc: Darrick J. Wong, linux-xfs, linux-block, linux-fsdevel, linux-mm,
	dhowells, dhowells, darrick.wong, viro, jlayton, torvalds,
	linux-nfs, linux-mm, linux-fsdevel, linux-kernel

Introduce an IOCB_SWAP flag for the kiocb struct such that the REQ_SWAP
will get set on lower level operation structures in generic code.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@lst.de>
cc: Darrick J. Wong <djwong@kernel.org>
cc: linux-xfs@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
---

 fs/direct-io.c      |    2 ++
 include/linux/bio.h |    2 ++
 include/linux/fs.h  |    1 +
 3 files changed, 5 insertions(+)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index b2e86e739d7a..76eec0a68fa4 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -1216,6 +1216,8 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 	}
 	if (iocb->ki_flags & IOCB_HIPRI)
 		dio->op_flags |= REQ_HIPRI;
+	if (iocb->ki_flags & IOCB_SWAP)
+		dio->op_flags |= REQ_SWAP;
 
 	/*
 	 * For AIO O_(D)SYNC writes we need to defer completions to a workqueue
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 00952e92eae1..b01133727494 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -787,6 +787,8 @@ static inline void bio_set_polled(struct bio *bio, struct kiocb *kiocb)
 	bio->bi_opf |= REQ_HIPRI;
 	if (!is_sync_kiocb(kiocb))
 		bio->bi_opf |= REQ_NOWAIT;
+	if (kiocb->ki_flags & IOCB_SWAP)
+		bio->bi_opf |= REQ_SWAP;
 }
 
 struct bio *blk_next_bio(struct bio *bio, unsigned int nr_pages, gfp_t gfp);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c909ca6c0eb6..c20f4423e2f1 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -321,6 +321,7 @@ enum rw_hint {
 #define IOCB_NOIO		(1 << 20)
 /* can use bio alloc cache */
 #define IOCB_ALLOC_CACHE	(1 << 21)
+#define IOCB_SWAP		(1 << 22)	/* Operation on a swapfile */
 
 struct kiocb {
 	struct file		*ki_filp;



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v3 5/9] mm: Make swap_readpage() for SWP_FS_OPS use ->swap_rw() not ->readpage()
  2021-09-24 17:17 [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles David Howells
                   ` (3 preceding siblings ...)
  2021-09-24 17:18 ` [PATCH v3 4/9] Introduce IOCB_SWAP kiocb flag to trigger REQ_SWAP David Howells
@ 2021-09-24 17:18 ` David Howells
  2021-09-24 17:18 ` [PATCH v3 6/9] mm: Make __swap_writepage() do async DIO if asked for it David Howells
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 28+ messages in thread
From: David Howells @ 2021-09-24 17:18 UTC (permalink / raw)
  To: willy, hch, trond.myklebust
  Cc: Jens Axboe, Darrick J. Wong, linux-block, linux-xfs,
	linux-fsdevel, linux-mm, dhowells, dhowells, darrick.wong, viro,
	jlayton, torvalds, linux-nfs, linux-mm, linux-fsdevel,
	linux-kernel

Make swap_readpage() use the ->swap_rw() method on the filesystem to do
direct I/O rather then ->readpage() when accessing a swap file
(SWP_FS_OPS).

Make swap_writepage() similarly use ->swap_rw() also rather than the
->direct_IO() method.

Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@lst.de>
cc: Jens Axboe <axboe@kernel.dk>
cc: Darrick J. Wong <djwong@kernel.org>
cc: linux-block@vger.kernel.org
cc: linux-xfs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
---

 include/linux/fs.h |    2 +
 mm/page_io.c       |  106 +++++++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 98 insertions(+), 10 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index c20f4423e2f1..c8f7724ecded 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -338,6 +338,7 @@ struct kiocb {
 	union {
 		unsigned int		ki_cookie; /* for ->iopoll */
 		struct wait_page_queue	*ki_waitq; /* for async buffered IO */
+		struct page	*ki_swap_page;	/* For swapfile_read/write */
 	};
 
 	randomized_struct_fields_end
@@ -404,6 +405,7 @@ struct address_space_operations {
 	int (*releasepage) (struct page *, gfp_t);
 	void (*freepage)(struct page *);
 	ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter);
+	ssize_t (*swap_rw)(struct kiocb *, struct iov_iter *);
 	/*
 	 * migrate the contents of a page to the specified target. If
 	 * migrate_mode is MIGRATE_ASYNC, it must not block.
diff --git a/mm/page_io.c b/mm/page_io.c
index b9fe25101a39..6b1465699c72 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -4,7 +4,7 @@
  *
  *  Copyright (C) 1991, 1992, 1993, 1994  Linus Torvalds
  *
- *  Swap reorganised 29.12.95, 
+ *  Swap reorganised 29.12.95,
  *  Asynchronous swapping added 30.12.95. Stephen Tweedie
  *  Removed race in async swapping. 14.4.1996. Bruno Haible
  *  Add swap of shared pages through the page cache. 20.2.1998. Stephen Tweedie
@@ -26,6 +26,22 @@
 #include <linux/uio.h>
 #include <linux/sched/task.h>
 
+/*
+ * Keep track of the kiocb we're using to do async DIO.  We have to
+ * refcount it until various things stop looking at the kiocb *after*
+ * calling ->ki_complete().
+ */
+struct swapfile_kiocb {
+	struct kiocb		iocb;
+	refcount_t		ref;
+};
+
+static void swapfile_put_kiocb(struct swapfile_kiocb *ki)
+{
+	if (refcount_dec_and_test(&ki->ref))
+		kfree(ki);
+}
+
 static void end_swap_bio_write(struct bio *bio)
 {
 	struct page *page = bio_first_page_all(bio);
@@ -302,11 +318,12 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc)
 
 		iov_iter_bvec(&from, WRITE, &bv, 1, PAGE_SIZE);
 		init_sync_kiocb(&kiocb, swap_file);
-		kiocb.ki_pos = page_file_offset(page);
+		kiocb.ki_pos	= page_file_offset(page);
+		kiocb.ki_flags	= IOCB_DIRECT | IOCB_WRITE | IOCB_SWAP;
 
 		set_page_writeback(page);
 		unlock_page(page);
-		ret = mapping->a_ops->direct_IO(&kiocb, &from);
+		ret = mapping->a_ops->swap_rw(&kiocb, &from);
 		if (ret == PAGE_SIZE) {
 			count_vm_event(PSWPOUT);
 			ret = 0;
@@ -323,8 +340,8 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc)
 			 */
 			set_page_dirty(page);
 			ClearPageReclaim(page);
-			pr_err_ratelimited("Write error on dio swapfile (%llu)\n",
-					   page_file_offset(page));
+			pr_err_ratelimited("Write error (%d) on dio swapfile (%llu)\n",
+					   ret, page_file_offset(page));
 		}
 		end_page_writeback(page);
 		return ret;
@@ -352,6 +369,79 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc)
 	return 0;
 }
 
+static void swapfile_read_complete(struct page *page, long ret)
+{
+	if (ret == page_size(page)) {
+		count_vm_event(PSWPIN);
+		SetPageUptodate(page);
+	} else {
+		SetPageError(page);
+		pr_err_ratelimited("Read error (%ld) on dio swapfile (%llu)\n",
+				   ret, page_file_offset(page));
+	}
+
+	unlock_page(page);
+}
+
+static void __swapfile_read_complete(struct kiocb *iocb, long ret, long ret2)
+{
+	struct swapfile_kiocb *ki = container_of(iocb, struct swapfile_kiocb, iocb);
+
+	swapfile_read_complete(iocb->ki_swap_page, ret);
+	swapfile_put_kiocb(ki);
+}
+
+static void swapfile_read_sync(struct swap_info_struct *sis, struct page *page,
+			       struct iov_iter *to)
+{
+	struct kiocb kiocb;
+	struct file *swap_file = sis->swap_file;
+	int ret;
+
+	init_sync_kiocb(&kiocb, swap_file);
+	kiocb.ki_swap_page	= page;
+	kiocb.ki_pos		= page_file_offset(page);
+	kiocb.ki_flags		= IOCB_DIRECT | IOCB_SWAP;
+	ret = swap_file->f_mapping->a_ops->swap_rw(&kiocb, to);
+
+	swapfile_read_complete(page, ret);
+}
+
+static void swapfile_read(struct swap_info_struct *sis, struct page *page,
+			  bool synchronous)
+{
+	struct swapfile_kiocb *ki;
+	struct file *swap_file = sis->swap_file;
+	struct bio_vec bv = {
+		.bv_page = page,
+		.bv_len  = thp_size(page),
+		.bv_offset = 0
+	};
+	struct iov_iter to;
+	int ret;
+
+	iov_iter_bvec(&to, READ, &bv, 1, thp_size(page));
+
+	if (synchronous)
+		return swapfile_read_sync(sis, page, &to);
+
+	ki = kzalloc(sizeof(*ki), GFP_KERNEL);
+	if (!ki)
+		return;
+
+	refcount_set(&ki->ref, 2);
+	init_sync_kiocb(&ki->iocb, swap_file);
+	ki->iocb.ki_swap_page	= page;
+	ki->iocb.ki_flags	= IOCB_DIRECT | IOCB_SWAP;
+	ki->iocb.ki_pos		= page_file_offset(page);
+	ki->iocb.ki_complete	= __swapfile_read_complete;
+
+	ret = swap_file->f_mapping->a_ops->swap_rw(&ki->iocb, &to);
+	if (ret != -EIOCBQUEUED)
+		__swapfile_read_complete(&ki->iocb, ret, 0);
+	swapfile_put_kiocb(ki);
+}
+
 void swap_readpage(struct page *page, bool synchronous)
 {
 	struct bio *bio;
@@ -378,11 +468,7 @@ void swap_readpage(struct page *page, bool synchronous)
 	}
 
 	if (data_race(sis->flags & SWP_FS_OPS)) {
-		struct file *swap_file = sis->swap_file;
-		struct address_space *mapping = swap_file->f_mapping;
-
-		if (!mapping->a_ops->readpage(swap_file, page))
-			count_vm_event(PSWPIN);
+		swapfile_read(sis, page, synchronous);
 		goto out;
 	}
 



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v3 6/9] mm: Make __swap_writepage() do async DIO if asked for it
  2021-09-24 17:17 [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles David Howells
                   ` (4 preceding siblings ...)
  2021-09-24 17:18 ` [PATCH v3 5/9] mm: Make swap_readpage() for SWP_FS_OPS use ->swap_rw() not ->readpage() David Howells
@ 2021-09-24 17:18 ` David Howells
  2021-09-24 17:19 ` [PATCH v3 7/9] nfs: Fix write to swapfile failure due to generic_write_checks() David Howells
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 28+ messages in thread
From: David Howells @ 2021-09-24 17:18 UTC (permalink / raw)
  To: willy, hch, trond.myklebust
  Cc: Darrick J. Wong, Trond Myklebust, linux-nfs, linux-block,
	linux-xfs, linux-fsdevel, linux-mm, dhowells, dhowells,
	darrick.wong, viro, jlayton, torvalds, linux-nfs, linux-mm,
	linux-fsdevel, linux-kernel

Make __swap_writepage()'s DIO path do sync DIO if the writeback control's
sync mode is WB_SYNC_ALL and async DIO if not.

Note that this causes hanging processes in sunrpc if the swapfile is on
NFS.  I'm not sure whether it's due to misscheduling or something else.

Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Matthew Wilcox (Oracle) <willy@infradead.org>
cc: Christoph Hellwig <hch@lst.de>
cc: Darrick J. Wong <djwong@kernel.org>
cc: Trond Myklebust <trond.myklebust@hammerspace.com>
cc: linux-nfs@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-xfs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
---

 mm/page_io.c |  133 ++++++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 92 insertions(+), 41 deletions(-)

diff --git a/mm/page_io.c b/mm/page_io.c
index 6b1465699c72..8f1199d59162 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -298,6 +298,96 @@ static void bio_associate_blkg_from_page(struct bio *bio, struct page *page)
 #define bio_associate_blkg_from_page(bio, page)		do { } while (0)
 #endif /* CONFIG_MEMCG && CONFIG_BLK_CGROUP */
 
+static void swapfile_write_complete(struct page *page, long ret)
+{
+	if (ret == thp_size(page)) {
+		count_swpout_vm_event(page);
+	} else {
+		/*
+		 * In the case of swap-over-nfs, this can be a
+		 * temporary failure if the system has limited memory
+		 * for allocating transmit buffers.  Mark the page
+		 * dirty and avoid rotate_reclaimable_page but
+		 * rate-limit the messages but do not flag PageError
+		 * like the normal direct-to-bio case as it could be
+		 * temporary.
+		 */
+		set_page_dirty(page);
+		ClearPageReclaim(page);
+		pr_err_ratelimited("Write error (%ld) on dio swapfile (%llu)\n",
+				   ret, page_file_offset(page));
+	}
+	end_page_writeback(page);
+}
+
+static void __swapfile_write_complete(struct kiocb *iocb, long ret, long ret2)
+{
+	struct swapfile_kiocb *ki = container_of(iocb, struct swapfile_kiocb, iocb);
+
+	swapfile_write_complete(iocb->ki_swap_page, ret);
+	swapfile_put_kiocb(ki);
+}
+
+static int swapfile_write_sync(struct swap_info_struct *sis,
+			       struct page *page, struct writeback_control *wbc,
+			       struct iov_iter *from)
+{
+	struct kiocb kiocb;
+	struct file *swap_file = sis->swap_file;
+	int ret;
+
+	init_sync_kiocb(&kiocb, swap_file);
+	kiocb.ki_swap_page	= page;
+	kiocb.ki_pos		= page_file_offset(page);
+	kiocb.ki_flags		= IOCB_DIRECT | IOCB_WRITE | IOCB_SWAP;
+
+	set_page_writeback(page);
+	unlock_page(page);
+
+	ret = swap_file->f_mapping->a_ops->swap_rw(&kiocb, from);
+	swapfile_write_complete(page, ret);
+	return ret == page_size(page) ? 0 : ret >= 0 ? -ENODATA : ret;
+}
+
+static int swapfile_write(struct swap_info_struct *sis,
+			  struct page *page, struct writeback_control *wbc)
+{
+	struct swapfile_kiocb *ki;
+	struct file *swap_file = sis->swap_file;
+	struct bio_vec bv = {
+		.bv_page	= page,
+		.bv_len		= page_size(page),
+		.bv_offset	= 0
+	};
+	struct iov_iter from;
+	int ret;
+
+	iov_iter_bvec(&from, WRITE, &bv, 1, PAGE_SIZE);
+
+	if (wbc->sync_mode == WB_SYNC_ALL)
+		return swapfile_write_sync(sis, page, wbc, &from);
+
+	ki = kzalloc(sizeof(*ki), GFP_KERNEL);
+	if (!ki)
+		return -ENOMEM;
+
+	refcount_set(&ki->ref, 2);
+	init_sync_kiocb(&ki->iocb, swap_file);
+	ki->iocb.ki_swap_page	= page;
+	ki->iocb.ki_pos		= page_file_offset(page);
+	ki->iocb.ki_flags	= IOCB_DIRECT | IOCB_WRITE | IOCB_SWAP;
+	ki->iocb.ki_complete	= __swapfile_write_complete;
+
+	set_page_writeback(page);
+	unlock_page(page);
+	ret = swap_file->f_mapping->a_ops->swap_rw(&ki->iocb, &from);
+
+	if (ret != -EIOCBQUEUED)
+		__swapfile_write_complete(&ki->iocb, ret, 0);
+	swapfile_put_kiocb(ki);
+	return ret == page_size(page) ? 0 : ret >= 0 ? -ENODATA : ret;
+}
+
 int __swap_writepage(struct page *page, struct writeback_control *wbc)
 {
 	struct bio *bio;
@@ -305,47 +395,8 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc)
 	struct swap_info_struct *sis = page_swap_info(page);
 
 	VM_BUG_ON_PAGE(!PageSwapCache(page), page);
-	if (data_race(sis->flags & SWP_FS_OPS)) {
-		struct kiocb kiocb;
-		struct file *swap_file = sis->swap_file;
-		struct address_space *mapping = swap_file->f_mapping;
-		struct bio_vec bv = {
-			.bv_page = page,
-			.bv_len  = PAGE_SIZE,
-			.bv_offset = 0
-		};
-		struct iov_iter from;
-
-		iov_iter_bvec(&from, WRITE, &bv, 1, PAGE_SIZE);
-		init_sync_kiocb(&kiocb, swap_file);
-		kiocb.ki_pos	= page_file_offset(page);
-		kiocb.ki_flags	= IOCB_DIRECT | IOCB_WRITE | IOCB_SWAP;
-
-		set_page_writeback(page);
-		unlock_page(page);
-		ret = mapping->a_ops->swap_rw(&kiocb, &from);
-		if (ret == PAGE_SIZE) {
-			count_vm_event(PSWPOUT);
-			ret = 0;
-		} else {
-			/*
-			 * In the case of swap-over-nfs, this can be a
-			 * temporary failure if the system has limited
-			 * memory for allocating transmit buffers.
-			 * Mark the page dirty and avoid
-			 * rotate_reclaimable_page but rate-limit the
-			 * messages but do not flag PageError like
-			 * the normal direct-to-bio case as it could
-			 * be temporary.
-			 */
-			set_page_dirty(page);
-			ClearPageReclaim(page);
-			pr_err_ratelimited("Write error (%d) on dio swapfile (%llu)\n",
-					   ret, page_file_offset(page));
-		}
-		end_page_writeback(page);
-		return ret;
-	}
+	if (data_race(sis->flags & SWP_FS_OPS))
+		return swapfile_write(sis, page, wbc);
 
 	ret = bdev_write_page(sis->bdev, swap_page_sector(page), page, wbc);
 	if (!ret) {



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v3 7/9] nfs: Fix write to swapfile failure due to generic_write_checks()
  2021-09-24 17:17 [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles David Howells
                   ` (5 preceding siblings ...)
  2021-09-24 17:18 ` [PATCH v3 6/9] mm: Make __swap_writepage() do async DIO if asked for it David Howells
@ 2021-09-24 17:19 ` David Howells
  2021-09-24 17:19 ` [PATCH v3 8/9] block, btrfs, ext4, xfs: Implement swap_rw David Howells
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 28+ messages in thread
From: David Howells @ 2021-09-24 17:19 UTC (permalink / raw)
  To: willy, hch, trond.myklebust
  Cc: Anna Schumaker, NeilBrown, Darrick J. Wong, linux-nfs, linux-mm,
	linux-fsdevel, dhowells, dhowells, darrick.wong, viro, jlayton,
	torvalds, linux-nfs, linux-mm, linux-fsdevel, linux-kernel

Trying to use a swapfile on NFS results in every DIO write failing with
ETXTBSY because generic_write_checks(), as called by nfs_direct_write()
from nfs_direct_IO(), forbids writes to swapfiles.

Fix this implementing the ->swap_rw() method for NFS, and using that to
bypass the checks in generic_write_checks().  [I'm not sure if we still
need to do some of the checks]

Without this patch, the following is seen:

	Write error on dio swapfile (3800334336)

Altering __swap_writepage() to show the error shows:

	Write error (-26) on dio swapfile (3800334336)

Tested by swapping off all swap partitions and then swapping on a prepared
NFS file (CONFIG_NFS_SWAP=y is also needed).  Enough copies of the
following program then need to be run to force swapping to occur (at least
one per gigabyte of RAM):

	#include <stdbool.h>
	#include <stdio.h>
	#include <stdlib.h>
	#include <unistd.h>
	#include <sys/mman.h>
	int main()
	{
		unsigned int pid = getpid(), iterations = 0;
		size_t i, j, size = 1024 * 1024 * 1024;
		char *p;
		bool mismatch;
		p = malloc(size);
		if (!p) {
			perror("malloc");
			exit(1);
		}
		srand(pid);
		for (i = 0; i < size; i += 4)
			*(unsigned int *)(p + i) = rand();
		do {
			for (j = 0; j < 16; j++) {
				for (i = 0; i < size; i += 4096)
					*(unsigned int *)(p + i) += 1;
				iterations++;
			}
			mismatch = false;
			srand(pid);
			for (i = 0; i < size; i += 4) {
				unsigned int r = rand();
				unsigned int v = *(unsigned int *)(p + i);
				if (i % 4096 == 0)
					v -= iterations;
				if (v != r) {
					fprintf(stderr, "mismatch %zx: %x != %x (diff %x)\n",
						i, v, r, v - r);
					mismatch = true;
				}
			}
		} while (!mismatch);
		exit(1);
	}


Fixes: dc617f29dbe5 ("vfs: don't allow writes to swap files")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Trond Myklebust <trond.myklebust@primarydata.com>
cc: Anna Schumaker <anna.schumaker@netapp.com>
cc: "NeilBrown" <neilb@suse.de>
cc: Matthew Wilcox <willy@infradead.org>
cc: Darrick J. Wong <darrick.wong@oracle.com>
cc: Christoph Hellwig <hch@lst.de>
cc: linux-nfs@vger.kernel.org
cc: linux-mm@kvack.org
cc: linux-fsdevel@vger.kernel.org
---

 fs/nfs/direct.c        |   28 +++++++---------------------
 fs/nfs/file.c          |   14 ++++++--------
 include/linux/nfs_fs.h |    2 +-
 3 files changed, 14 insertions(+), 30 deletions(-)

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 2e894fec036b..71da8054df7e 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -152,28 +152,18 @@ nfs_direct_count_bytes(struct nfs_direct_req *dreq,
 }
 
 /**
- * nfs_direct_IO - NFS address space operation for direct I/O
+ * nfs_swap_rw - Do direct I/O to a swapfile on NFS
  * @iocb: target I/O control block
  * @iter: I/O buffer
  *
  * The presence of this routine in the address space ops vector means
- * the NFS client supports direct I/O. However, for most direct IO, we
- * shunt off direct read and write requests before the VFS gets them,
- * so this method is only ever called for swap.
+ * the NFS client supports direct I/O for swap.
  */
-ssize_t nfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
+ssize_t nfs_swap_rw(struct kiocb *iocb, struct iov_iter *iter)
 {
-	struct inode *inode = iocb->ki_filp->f_mapping->host;
-
-	/* we only support swap file calling nfs_direct_IO */
-	if (!IS_SWAPFILE(inode))
-		return 0;
-
-	VM_BUG_ON(iov_iter_count(iter) != PAGE_SIZE);
-
-	if (iov_iter_rw(iter) == READ)
-		return nfs_file_direct_read(iocb, iter);
-	return nfs_file_direct_write(iocb, iter);
+	if (iocb->ki_flags & IOCB_WRITE)
+		return nfs_file_direct_write(iocb, iter);
+	return nfs_file_direct_read(iocb, iter);
 }
 
 static void nfs_direct_release_pages(struct page **pages, unsigned int npages)
@@ -894,7 +884,7 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
 ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter)
 {
 	ssize_t result, requested;
-	size_t count;
+	size_t count = iov_iter_count(iter);
 	struct file *file = iocb->ki_filp;
 	struct address_space *mapping = file->f_mapping;
 	struct inode *inode = mapping->host;
@@ -905,10 +895,6 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter)
 	dfprintk(FILE, "NFS: direct write(%pD2, %zd@%Ld)\n",
 		file, iov_iter_count(iter), (long long) iocb->ki_pos);
 
-	result = generic_write_checks(iocb, iter);
-	if (result <= 0)
-		return result;
-	count = result;
 	nfs_add_stats(mapping->host, NFSIOS_DIRECTWRITTENBYTES, count);
 
 	pos = iocb->ki_pos;
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 7403ec6317cb..70dd49994751 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -523,7 +523,7 @@ const struct address_space_operations nfs_file_aops = {
 	.write_end = nfs_write_end,
 	.invalidatepage = nfs_invalidate_page,
 	.releasepage = nfs_release_page,
-	.direct_IO = nfs_direct_IO,
+	.swap_rw = nfs_swap_rw,
 #ifdef CONFIG_MIGRATION
 	.migratepage = nfs_migrate_page,
 #endif
@@ -616,14 +616,16 @@ ssize_t nfs_file_write(struct kiocb *iocb, struct iov_iter *from)
 	if (result)
 		return result;
 
-	if (iocb->ki_flags & IOCB_DIRECT)
+	if (iocb->ki_flags & IOCB_DIRECT) {
+		result = generic_write_checks(iocb, from);
+		if (result <= 0)
+			return result;
 		return nfs_file_direct_write(iocb, from);
+	}
 
 	dprintk("NFS: write(%pD2, %zu@%Ld)\n",
 		file, iov_iter_count(from), (long long) iocb->ki_pos);
 
-	if (IS_SWAPFILE(inode))
-		goto out_swapfile;
 	/*
 	 * O_APPEND implies that we must revalidate the file length.
 	 */
@@ -678,10 +680,6 @@ ssize_t nfs_file_write(struct kiocb *iocb, struct iov_iter *from)
 	nfs_add_stats(inode, NFSIOS_NORMALWRITTENBYTES, written);
 out:
 	return result;
-
-out_swapfile:
-	printk(KERN_INFO "NFS: attempt to write to active swap file!\n");
-	return -ETXTBSY;
 }
 EXPORT_SYMBOL_GPL(nfs_file_write);
 
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index b9a8b925db43..4a8bd9e48237 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -493,7 +493,7 @@ static inline const struct cred *nfs_file_cred(struct file *file)
 /*
  * linux/fs/nfs/direct.c
  */
-extern ssize_t nfs_direct_IO(struct kiocb *, struct iov_iter *);
+extern ssize_t nfs_swap_rw(struct kiocb *, struct iov_iter *);
 extern ssize_t nfs_file_direct_read(struct kiocb *iocb,
 			struct iov_iter *iter);
 extern ssize_t nfs_file_direct_write(struct kiocb *iocb,



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v3 8/9] block, btrfs, ext4, xfs: Implement swap_rw
  2021-09-24 17:17 [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles David Howells
                   ` (6 preceding siblings ...)
  2021-09-24 17:19 ` [PATCH v3 7/9] nfs: Fix write to swapfile failure due to generic_write_checks() David Howells
@ 2021-09-24 17:19 ` David Howells
  2021-09-24 17:19 ` [PATCH v3 9/9] mm: Remove swap BIO paths and only use DIO paths David Howells
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 28+ messages in thread
From: David Howells @ 2021-09-24 17:19 UTC (permalink / raw)
  To: willy, hch, trond.myklebust
  Cc: Jens Axboe, Chris Mason, Josef Bacik, David Sterba,
	Theodore Ts'o, Andreas Dilger, Darrick J. Wong, linux-block,
	linux-btrfs, linux-ext4, linux-xfs, linux-fsdevel, linux-mm,
	dhowells, dhowells, darrick.wong, viro, jlayton, torvalds,
	linux-nfs, linux-mm, linux-fsdevel, linux-kernel

Implement swap_rw for block devices, btrfs, ext4 and xfs.  This allows the
the page swapping code to use direct-IO rather than direct bio submission,
whilst skipping the checks going via read/write_iter would entail.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@lst.de>
cc: Jens Axboe <axboe@kernel.dk>
cc: Chris Mason <clm@fb.com>
cc: Josef Bacik <josef@toxicpanda.com>
cc: David Sterba <dsterba@suse.com>
cc: "Theodore Ts'o" <tytso@mit.edu>
cc: Andreas Dilger <adilger.kernel@dilger.ca>
cc: Darrick J. Wong <djwong@kernel.org>
cc: linux-block@vger.kernel.org
cc: linux-btrfs@vger.kernel.org
cc: linux-ext4@vger.kernel.org
cc: linux-xfs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
---

 block/fops.c      |    1 +
 fs/btrfs/inode.c  |   12 +++++-------
 fs/ext4/inode.c   |    9 +++++++++
 fs/xfs/xfs_aops.c |    9 +++++++++
 4 files changed, 24 insertions(+), 7 deletions(-)

diff --git a/block/fops.c b/block/fops.c
index 84c64d814d0d..7ba37dfafae2 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -382,6 +382,7 @@ const struct address_space_operations def_blk_aops = {
 	.write_end	= blkdev_write_end,
 	.writepages	= blkdev_writepages,
 	.direct_IO	= blkdev_direct_IO,
+	.swap_rw	= blkdev_direct_IO,
 	.migratepage	= buffer_migrate_page_norefs,
 	.is_dirty_writeback = buffer_check_dirty_writeback,
 	.supports	= AS_SUPPORTS_DIRECT_IO,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b479c97e42fc..9ffcefecb3bb 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -10852,15 +10852,10 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 	sis->highest_bit = bsi.nr_pages - 1;
 	return bsi.nr_extents;
 }
-#else
-static void btrfs_swap_deactivate(struct file *file)
-{
-}
 
-static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
-			       sector_t *span)
+static ssize_t btrfs_swap_rw(struct kiocb *iocb, struct iov_iter *iter)
 {
-	return -EOPNOTSUPP;
+	return iomap_dio_rw(iocb, iter, &btrfs_dio_iomap_ops, NULL, 0);
 }
 #endif
 
@@ -10944,8 +10939,11 @@ static const struct address_space_operations btrfs_aops = {
 #endif
 	.set_page_dirty	= btrfs_set_page_dirty,
 	.error_remove_page = generic_error_remove_page,
+#ifdef CONFIG_SWAP
 	.swap_activate	= btrfs_swap_activate,
 	.swap_deactivate = btrfs_swap_deactivate,
+	.swap_rw	= btrfs_swap_rw,
+#endif
 	.supports	= AS_SUPPORTS_DIRECT_IO,
 };
 
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 08d3541d8daa..3c14724d58a8 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3651,6 +3651,11 @@ static int ext4_iomap_swap_activate(struct swap_info_struct *sis,
 				       &ext4_iomap_report_ops);
 }
 
+static ssize_t ext4_swap_rw(struct kiocb *iocb, struct iov_iter *iter)
+{
+	return iomap_dio_rw(iocb, iter, &ext4_iomap_ops, NULL, 0);
+}
+
 static const struct address_space_operations ext4_aops = {
 	.readpage		= ext4_readpage,
 	.readahead		= ext4_readahead,
@@ -3666,6 +3671,7 @@ static const struct address_space_operations ext4_aops = {
 	.is_partially_uptodate  = block_is_partially_uptodate,
 	.error_remove_page	= generic_error_remove_page,
 	.swap_activate		= ext4_iomap_swap_activate,
+	.swap_rw		= ext4_swap_rw,
 	.supports		= AS_SUPPORTS_DIRECT_IO,
 };
 
@@ -3683,6 +3689,7 @@ static const struct address_space_operations ext4_journalled_aops = {
 	.is_partially_uptodate  = block_is_partially_uptodate,
 	.error_remove_page	= generic_error_remove_page,
 	.swap_activate		= ext4_iomap_swap_activate,
+	.swap_rw		= ext4_swap_rw,
 	.supports		= AS_SUPPORTS_DIRECT_IO,
 };
 
@@ -3701,6 +3708,7 @@ static const struct address_space_operations ext4_da_aops = {
 	.is_partially_uptodate  = block_is_partially_uptodate,
 	.error_remove_page	= generic_error_remove_page,
 	.swap_activate		= ext4_iomap_swap_activate,
+	.swap_rw		= ext4_swap_rw,
 	.supports		= AS_SUPPORTS_DIRECT_IO,
 };
 
@@ -3710,6 +3718,7 @@ static const struct address_space_operations ext4_dax_aops = {
 	.bmap			= ext4_bmap,
 	.invalidatepage		= noop_invalidatepage,
 	.swap_activate		= ext4_iomap_swap_activate,
+	.swap_rw		= ext4_swap_rw,
 	.supports		= AS_SUPPORTS_DIRECT_IO,
 };
 
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 2a4570516591..23ade2cc8241 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -540,6 +540,13 @@ xfs_iomap_swapfile_activate(
 			&xfs_read_iomap_ops);
 }
 
+static ssize_t xfs_swap_rw(struct kiocb *iocb, struct iov_iter *iter)
+{
+	if (iocb->ki_flags & IOCB_WRITE)
+		return iomap_dio_rw(iocb, iter, &xfs_direct_write_iomap_ops, NULL, 0);
+	return iomap_dio_rw(iocb, iter, &xfs_read_iomap_ops, NULL, 0);
+}
+
 const struct address_space_operations xfs_address_space_operations = {
 	.readpage		= xfs_vm_readpage,
 	.readahead		= xfs_vm_readahead,
@@ -552,6 +559,7 @@ const struct address_space_operations xfs_address_space_operations = {
 	.is_partially_uptodate  = iomap_is_partially_uptodate,
 	.error_remove_page	= generic_error_remove_page,
 	.swap_activate		= xfs_iomap_swapfile_activate,
+	.swap_rw		= xfs_swap_rw,
 	.supports		= AS_SUPPORTS_DIRECT_IO,
 };
 
@@ -560,5 +568,6 @@ const struct address_space_operations xfs_dax_aops = {
 	.set_page_dirty		= __set_page_dirty_no_writeback,
 	.invalidatepage		= noop_invalidatepage,
 	.swap_activate		= xfs_iomap_swapfile_activate,
+	.swap_rw		= xfs_swap_rw,
 	.supports		= AS_SUPPORTS_DIRECT_IO,
 };



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v3 9/9] mm: Remove swap BIO paths and only use DIO paths
  2021-09-24 17:17 [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles David Howells
                   ` (7 preceding siblings ...)
  2021-09-24 17:19 ` [PATCH v3 8/9] block, btrfs, ext4, xfs: Implement swap_rw David Howells
@ 2021-09-24 17:19 ` David Howells
  2021-09-25 14:56   ` Matthew Wilcox
  2021-09-25 15:36   ` David Howells
  2021-09-25 23:42 ` [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles Dave Chinner
                   ` (3 subsequent siblings)
  12 siblings, 2 replies; 28+ messages in thread
From: David Howells @ 2021-09-24 17:19 UTC (permalink / raw)
  To: willy, hch, trond.myklebust
  Cc: Jens Axboe, Darrick J. Wong, linux-block, linux-xfs,
	linux-fsdevel, linux-mm, dhowells, dhowells, darrick.wong, viro,
	jlayton, torvalds, linux-nfs, linux-mm, linux-fsdevel,
	linux-kernel

Delete the BIO-generating swap read/write paths and always use ->swap_rw().
This puts the mapping layer in the filesystem.

[!] ALSO: Add a compile-time knob to disable swap by asynchronous DIO, only
    using synchronous DIO.  Async DIO doesn't seem to work, with ATA errors
    being chucked out by the swap-on-blockdev and swapfile-on-XFS.  It also
    misbehaves on NFS.

I have tested this with sync DIO on ext4-swapfile, xfs-swapfile, a raw
blockdev and NFS.  The first three work; NFS works for a while then grinds to
a halt, chucking out lists of blocked sunrpc operations (I suspect it can't
allocate memory somewhere).

Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@lst.de>
cc: Jens Axboe <axboe@kernel.dk>
cc: Darrick J. Wong <djwong@kernel.org>
cc: linux-block@vger.kernel.org
cc: linux-xfs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
---

 mm/page_io.c  |  156 +++------------------------------------------------------
 mm/swapfile.c |    4 +
 2 files changed, 10 insertions(+), 150 deletions(-)

diff --git a/mm/page_io.c b/mm/page_io.c
index 8f1199d59162..b48318951380 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -26,6 +26,8 @@
 #include <linux/uio.h>
 #include <linux/sched/task.h>
 
+#define ONLY_USE_SYNC_DIO 1
+
 /*
  * Keep track of the kiocb we're using to do async DIO.  We have to
  * refcount it until various things stop looking at the kiocb *after*
@@ -42,30 +44,6 @@ static void swapfile_put_kiocb(struct swapfile_kiocb *ki)
 		kfree(ki);
 }
 
-static void end_swap_bio_write(struct bio *bio)
-{
-	struct page *page = bio_first_page_all(bio);
-
-	if (bio->bi_status) {
-		SetPageError(page);
-		/*
-		 * We failed to write the page out to swap-space.
-		 * Re-dirty the page in order to avoid it being reclaimed.
-		 * Also print a dire warning that things will go BAD (tm)
-		 * very quickly.
-		 *
-		 * Also clear PG_reclaim to avoid rotate_reclaimable_page()
-		 */
-		set_page_dirty(page);
-		pr_alert_ratelimited("Write-error on swap-device (%u:%u:%llu)\n",
-				     MAJOR(bio_dev(bio)), MINOR(bio_dev(bio)),
-				     (unsigned long long)bio->bi_iter.bi_sector);
-		ClearPageReclaim(page);
-	}
-	end_page_writeback(page);
-	bio_put(bio);
-}
-
 static void swap_slot_free_notify(struct page *page)
 {
 	struct swap_info_struct *sis;
@@ -114,32 +92,6 @@ static void swap_slot_free_notify(struct page *page)
 	}
 }
 
-static void end_swap_bio_read(struct bio *bio)
-{
-	struct page *page = bio_first_page_all(bio);
-	struct task_struct *waiter = bio->bi_private;
-
-	if (bio->bi_status) {
-		SetPageError(page);
-		ClearPageUptodate(page);
-		pr_alert_ratelimited("Read-error on swap-device (%u:%u:%llu)\n",
-				     MAJOR(bio_dev(bio)), MINOR(bio_dev(bio)),
-				     (unsigned long long)bio->bi_iter.bi_sector);
-		goto out;
-	}
-
-	SetPageUptodate(page);
-	swap_slot_free_notify(page);
-out:
-	unlock_page(page);
-	WRITE_ONCE(bio->bi_private, NULL);
-	bio_put(bio);
-	if (waiter) {
-		blk_wake_io_task(waiter);
-		put_task_struct(waiter);
-	}
-}
-
 int generic_swapfile_activate(struct swap_info_struct *sis,
 				struct file *swap_file,
 				sector_t *span)
@@ -279,25 +231,6 @@ static inline void count_swpout_vm_event(struct page *page)
 	count_vm_events(PSWPOUT, thp_nr_pages(page));
 }
 
-#if defined(CONFIG_MEMCG) && defined(CONFIG_BLK_CGROUP)
-static void bio_associate_blkg_from_page(struct bio *bio, struct page *page)
-{
-	struct cgroup_subsys_state *css;
-	struct mem_cgroup *memcg;
-
-	memcg = page_memcg(page);
-	if (!memcg)
-		return;
-
-	rcu_read_lock();
-	css = cgroup_e_css(memcg->css.cgroup, &io_cgrp_subsys);
-	bio_associate_blkg_from_css(bio, css);
-	rcu_read_unlock();
-}
-#else
-#define bio_associate_blkg_from_page(bio, page)		do { } while (0)
-#endif /* CONFIG_MEMCG && CONFIG_BLK_CGROUP */
-
 static void swapfile_write_complete(struct page *page, long ret)
 {
 	if (ret == thp_size(page)) {
@@ -364,7 +297,7 @@ static int swapfile_write(struct swap_info_struct *sis,
 
 	iov_iter_bvec(&from, WRITE, &bv, 1, PAGE_SIZE);
 
-	if (wbc->sync_mode == WB_SYNC_ALL)
+	if (ONLY_USE_SYNC_DIO || wbc->sync_mode == WB_SYNC_ALL)
 		return swapfile_write_sync(sis, page, wbc, &from);
 
 	ki = kzalloc(sizeof(*ki), GFP_KERNEL);
@@ -390,40 +323,17 @@ static int swapfile_write(struct swap_info_struct *sis,
 
 int __swap_writepage(struct page *page, struct writeback_control *wbc)
 {
-	struct bio *bio;
-	int ret;
 	struct swap_info_struct *sis = page_swap_info(page);
 
 	VM_BUG_ON_PAGE(!PageSwapCache(page), page);
-	if (data_race(sis->flags & SWP_FS_OPS))
-		return swapfile_write(sis, page, wbc);
-
-	ret = bdev_write_page(sis->bdev, swap_page_sector(page), page, wbc);
-	if (!ret) {
-		count_swpout_vm_event(page);
-		return 0;
-	}
-
-	bio = bio_alloc(GFP_NOIO, 1);
-	bio_set_dev(bio, sis->bdev);
-	bio->bi_iter.bi_sector = swap_page_sector(page);
-	bio->bi_opf = REQ_OP_WRITE | REQ_SWAP | wbc_to_write_flags(wbc);
-	bio->bi_end_io = end_swap_bio_write;
-	bio_add_page(bio, page, thp_size(page), 0);
-
-	bio_associate_blkg_from_page(bio, page);
-	count_swpout_vm_event(page);
-	set_page_writeback(page);
-	unlock_page(page);
-	submit_bio(bio);
-
-	return 0;
+	return swapfile_write(sis, page, wbc);
 }
 
 static void swapfile_read_complete(struct page *page, long ret)
 {
 	if (ret == page_size(page)) {
 		count_vm_event(PSWPIN);
+		swap_slot_free_notify(page);
 		SetPageUptodate(page);
 	} else {
 		SetPageError(page);
@@ -473,7 +383,7 @@ static void swapfile_read(struct swap_info_struct *sis, struct page *page,
 
 	iov_iter_bvec(&to, READ, &bv, 1, thp_size(page));
 
-	if (synchronous)
+	if (ONLY_USE_SYNC_DIO || synchronous)
 		return swapfile_read_sync(sis, page, &to);
 
 	ki = kzalloc(sizeof(*ki), GFP_KERNEL);
@@ -495,10 +405,7 @@ static void swapfile_read(struct swap_info_struct *sis, struct page *page,
 
 void swap_readpage(struct page *page, bool synchronous)
 {
-	struct bio *bio;
 	struct swap_info_struct *sis = page_swap_info(page);
-	blk_qc_t qc;
-	struct gendisk *disk;
 	unsigned long pflags;
 
 	VM_BUG_ON_PAGE(!PageSwapCache(page) && !synchronous, page);
@@ -515,58 +422,9 @@ void swap_readpage(struct page *page, bool synchronous)
 	if (frontswap_load(page) == 0) {
 		SetPageUptodate(page);
 		unlock_page(page);
-		goto out;
-	}
-
-	if (data_race(sis->flags & SWP_FS_OPS)) {
+	} else {
 		swapfile_read(sis, page, synchronous);
-		goto out;
 	}
-
-	if (sis->flags & SWP_SYNCHRONOUS_IO) {
-		if (!bdev_read_page(sis->bdev, swap_page_sector(page), page)) {
-			if (trylock_page(page)) {
-				swap_slot_free_notify(page);
-				unlock_page(page);
-			}
-
-			count_vm_event(PSWPIN);
-			goto out;
-		}
-	}
-
-	bio = bio_alloc(GFP_KERNEL, 1);
-	bio_set_dev(bio, sis->bdev);
-	bio->bi_opf = REQ_OP_READ;
-	bio->bi_iter.bi_sector = swap_page_sector(page);
-	bio->bi_end_io = end_swap_bio_read;
-	bio_add_page(bio, page, thp_size(page), 0);
-
-	disk = bio->bi_bdev->bd_disk;
-	/*
-	 * Keep this task valid during swap readpage because the oom killer may
-	 * attempt to access it in the page fault retry time check.
-	 */
-	if (synchronous) {
-		bio->bi_opf |= REQ_HIPRI;
-		get_task_struct(current);
-		bio->bi_private = current;
-	}
-	count_vm_event(PSWPIN);
-	bio_get(bio);
-	qc = submit_bio(bio);
-	while (synchronous) {
-		set_current_state(TASK_UNINTERRUPTIBLE);
-		if (!READ_ONCE(bio->bi_private))
-			break;
-
-		if (!blk_poll(disk->queue, qc, true))
-			blk_io_schedule();
-	}
-	__set_current_state(TASK_RUNNING);
-	bio_put(bio);
-
-out:
 	psi_memstall_leave(&pflags);
 }
 
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 22d10f713848..95d2571e3727 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2918,6 +2918,8 @@ static int claim_swapfile(struct swap_info_struct *p, struct inode *inode)
 			return -EINVAL;
 		p->flags |= SWP_BLKDEV;
 	} else if (S_ISREG(inode->i_mode)) {
+		if (!inode->i_mapping->a_ops->swap_rw)
+			return -EINVAL;
 		p->bdev = inode->i_sb->s_bdev;
 	}
 
@@ -3165,7 +3167,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 		name = NULL;
 		goto bad_swap;
 	}
-	swap_file = file_open_name(name, O_RDWR|O_LARGEFILE, 0);
+	swap_file = file_open_name(name, O_RDWR | O_LARGEFILE | O_DIRECT, 0);
 	if (IS_ERR(swap_file)) {
 		error = PTR_ERR(swap_file);
 		swap_file = NULL;



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 2/9] mm: Add 'supports' field to the address_space_operations to list features
  2021-09-24 17:18 ` [PATCH v3 2/9] mm: Add 'supports' field to the address_space_operations to list features David Howells
@ 2021-09-24 20:10   ` Matthew Wilcox
  0 siblings, 0 replies; 28+ messages in thread
From: Matthew Wilcox @ 2021-09-24 20:10 UTC (permalink / raw)
  To: David Howells
  Cc: hch, trond.myklebust, Darrick J. Wong, Ilya Dryomov, Jeff Layton,
	ceph-devel, Steve French, linux-cifs, linux-xfs, linux-fsdevel,
	linux-mm, darrick.wong, viro, torvalds, linux-nfs, linux-kernel

On Fri, Sep 24, 2021 at 06:18:14PM +0100, David Howells wrote:
> Rather than depending on .direct_IO to point to something to indicate that
> direct I/O is supported, add a 'supports' bitmask that we can test, since
> we only need one bit.

Why would you add mapping->aops->supports instead of using one of the free
bits in mapping->flags?  enum mapping_flags in pagemap.h.

It could also be a per-fs flag, or per-sb flag, but it's fewer
dereferences at check time if it's in mapping->flags.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 3/9] mm: Make swap_readpage() void
  2021-09-24 17:18 ` [PATCH v3 3/9] mm: Make swap_readpage() void David Howells
@ 2021-09-24 22:07   ` Matthew Wilcox
  0 siblings, 0 replies; 28+ messages in thread
From: Matthew Wilcox @ 2021-09-24 22:07 UTC (permalink / raw)
  To: David Howells
  Cc: hch, trond.myklebust, Jens Axboe, Darrick J. Wong, linux-xfs,
	linux-fsdevel, linux-mm, darrick.wong, viro, jlayton, torvalds,
	linux-nfs, linux-kernel

On Fri, Sep 24, 2021 at 06:18:24PM +0100, David Howells wrote:
> None of the callers of swap_readpage() actually check its return value and,
> indeed, the operation may still be in progress, so remove the return value.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>

Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 9/9] mm: Remove swap BIO paths and only use DIO paths
  2021-09-24 17:19 ` [PATCH v3 9/9] mm: Remove swap BIO paths and only use DIO paths David Howells
@ 2021-09-25 14:56   ` Matthew Wilcox
  2021-09-25 15:36   ` David Howells
  1 sibling, 0 replies; 28+ messages in thread
From: Matthew Wilcox @ 2021-09-25 14:56 UTC (permalink / raw)
  To: David Howells
  Cc: hch, trond.myklebust, Jens Axboe, Darrick J. Wong, linux-block,
	linux-xfs, linux-fsdevel, linux-mm, darrick.wong, viro, jlayton,
	torvalds, linux-nfs, linux-kernel

On Fri, Sep 24, 2021 at 06:19:23PM +0100, David Howells wrote:
> Delete the BIO-generating swap read/write paths and always use ->swap_rw().
> This puts the mapping layer in the filesystem.

Is SWP_FS_OPS now unused after this patch?

Also, do we still need ->swap_activate and ->swap_deactivate?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 9/9] mm: Remove swap BIO paths and only use DIO paths
  2021-09-24 17:19 ` [PATCH v3 9/9] mm: Remove swap BIO paths and only use DIO paths David Howells
  2021-09-25 14:56   ` Matthew Wilcox
@ 2021-09-25 15:36   ` David Howells
  2021-09-25 17:09     ` Matthew Wilcox
  2021-09-27 20:03     ` David Sterba
  1 sibling, 2 replies; 28+ messages in thread
From: David Howells @ 2021-09-25 15:36 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: dhowells, hch, trond.myklebust, Jens Axboe, Darrick J. Wong,
	linux-block, linux-xfs, linux-fsdevel, linux-mm, darrick.wong,
	viro, jlayton, torvalds, linux-nfs, linux-kernel

Matthew Wilcox <willy@infradead.org> wrote:

> On Fri, Sep 24, 2021 at 06:19:23PM +0100, David Howells wrote:
> > Delete the BIO-generating swap read/write paths and always use ->swap_rw().
> > This puts the mapping layer in the filesystem.
> 
> Is SWP_FS_OPS now unused after this patch?

Ummm.  Interesting question - it's only used in swap_set_page_dirty():

int swap_set_page_dirty(struct page *page)
{
	struct swap_info_struct *sis = page_swap_info(page);

	if (data_race(sis->flags & SWP_FS_OPS)) {
		struct address_space *mapping = sis->swap_file->f_mapping;

		VM_BUG_ON_PAGE(!PageSwapCache(page), page);
		return mapping->a_ops->set_page_dirty(page);
	} else {
		return __set_page_dirty_no_writeback(page);
	}
}


> Also, do we still need ->swap_activate and ->swap_deactivate?

f2fs does quite a lot of work in its ->swap_activate(), as does btrfs.  I'm
not sure how necessary it is.  cifs looks like it intends to use it, but it's
not fully implemented yet.  zonefs and nfs do some checking, including hole
checking in nfs's case.  nfs also does some setting up for the sunrpc
transport.

btrfs, cifs, f2fs and nfs all supply ->swap_deactivate() to undo the effects
of the activation.

David


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 9/9] mm: Remove swap BIO paths and only use DIO paths
  2021-09-25 15:36   ` David Howells
@ 2021-09-25 17:09     ` Matthew Wilcox
  2021-09-26 23:08       ` Damien Le Moal
  2021-09-27 20:03     ` David Sterba
  1 sibling, 1 reply; 28+ messages in thread
From: Matthew Wilcox @ 2021-09-25 17:09 UTC (permalink / raw)
  To: David Howells
  Cc: hch, trond.myklebust, Jens Axboe, Darrick J. Wong, linux-block,
	linux-xfs, linux-fsdevel, linux-mm, darrick.wong, viro, jlayton,
	torvalds, linux-nfs, linux-kernel

On Sat, Sep 25, 2021 at 04:36:42PM +0100, David Howells wrote:
> Matthew Wilcox <willy@infradead.org> wrote:
> 
> > On Fri, Sep 24, 2021 at 06:19:23PM +0100, David Howells wrote:
> > > Delete the BIO-generating swap read/write paths and always use ->swap_rw().
> > > This puts the mapping layer in the filesystem.
> > 
> > Is SWP_FS_OPS now unused after this patch?
> 
> Ummm.  Interesting question - it's only used in swap_set_page_dirty():
> 
> int swap_set_page_dirty(struct page *page)
> {
> 	struct swap_info_struct *sis = page_swap_info(page);
> 
> 	if (data_race(sis->flags & SWP_FS_OPS)) {
> 		struct address_space *mapping = sis->swap_file->f_mapping;
> 
> 		VM_BUG_ON_PAGE(!PageSwapCache(page), page);
> 		return mapping->a_ops->set_page_dirty(page);
> 	} else {
> 		return __set_page_dirty_no_writeback(page);
> 	}
> }

I suspect that's no longer necessary.  NFS was the only filesystem
using SWP_FS_OPS and ...

fs/nfs/file.c:  .set_page_dirty = __set_page_dirty_nobuffers,

so it's not like NFS does anything special to reserve memory to write
back swap pages.

> > Also, do we still need ->swap_activate and ->swap_deactivate?
> 
> f2fs does quite a lot of work in its ->swap_activate(), as does btrfs.  I'm
> not sure how necessary it is.  cifs looks like it intends to use it, but it's
> not fully implemented yet.  zonefs and nfs do some checking, including hole
> checking in nfs's case.  nfs also does some setting up for the sunrpc
> transport.
> 
> btrfs, cifs, f2fs and nfs all supply ->swap_deactivate() to undo the effects
> of the activation.

Right ... so my question really is, now that we're doing I/O through
aops->direct_IO (or ->swap_rw), do those magic things need to be done?
After all, open(O_DIRECT) doesn't do these same magic things.  They're
really there to allow the direct-to-BIO path to work, and you're removing
that here.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles
  2021-09-24 17:17 [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles David Howells
                   ` (8 preceding siblings ...)
  2021-09-24 17:19 ` [PATCH v3 9/9] mm: Remove swap BIO paths and only use DIO paths David Howells
@ 2021-09-25 23:42 ` Dave Chinner
  2021-09-26  3:10   ` Matthew Wilcox
  2021-09-27 20:07 ` David Sterba
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 28+ messages in thread
From: Dave Chinner @ 2021-09-25 23:42 UTC (permalink / raw)
  To: David Howells
  Cc: willy, hch, trond.myklebust, Theodore Ts'o, linux-block,
	ceph-devel, Trond Myklebust, Darrick J. Wong, Jeff Layton,
	Andreas Dilger, Anna Schumaker, linux-mm, Bob Liu,
	Darrick J. Wong, Josef Bacik, Seth Jennings, Jens Axboe,
	linux-fsdevel, linux-xfs, linux-ext4, linux-cifs, Chris Mason,
	David Sterba, Minchan Kim, Steve French, NeilBrown,
	Dan Magenheimer, linux-nfs, Ilya Dryomov, linux-btrfs, viro,
	torvalds, linux-kernel

On Fri, Sep 24, 2021 at 06:17:52PM +0100, David Howells wrote:
> 
> Hi Willy, Trond, Christoph,
> 
> Here's v3 of a change to make reads and writes from the swapfile use async
> DIO, adding a new ->swap_rw() address_space method, rather than readpage()
> or direct_IO(), as requested by Willy.  This allows NFS to bypass the write
> checks that prevent swapfiles from working, plus a bunch of other checks
> that may or may not be necessary.
> 
> Whilst trying to make this work, I found that NFS's support for swapfiles
> seems to have been non-functional since Aug 2019 (I think), so the first
> patch fixes that.  Question is: do we actually *want* to keep this
> functionality, given that it seems that no one's tested it with an upstream
> kernel in the last couple of years?
> 
> There are additional patches to get rid of noop_direct_IO and replace it
> with a feature bitmask, to make btrfs, ext4, xfs and raw blockdevs use the
> new ->swap_rw method and thence remove the direct BIO submission paths from
> swap.
> 
> I kept the IOCB_SWAP flag, using it to enable REQ_SWAP.  I'm not sure if
> that's necessary, but it seems accounting related.
> 
> The synchronous DIO I/O code on NFS, raw blockdev, ext4 swapfile and xfs
> swapfile all seem to work fine.  Btrfs refuses to swapon because the file
> might be CoW'd.  I've tried doing "chattr +C", but that didn't help.

Ok, so if the filesystem is doing block mapping in the IO path now,
why does the swap file still need to map the file into a private
block mapping now?  i.e all the work that iomap_swapfile_activate()
does for filesystems like XFS and ext4 - it's this completely
redundant now that we are doing block mapping during swap file IO
via iomap_dio_rw()?

Actually, that path does all the "can we use this file as a swap
file" checking. So the extent iteration can't go away, just the swap
file mapping part (iomap_swapfile_add_extent()). This is necessary
to ensure there aren't any holes in the file, and we still need that
because the DIO write path will allocate into holes, which leads
me to my main concern here.

Using the DIO path opens up the possibility that the filesystem
could want to run transactions are part of the DIO. Right now we
support unwritten extents for swap files (so they don't have to be
written to allocate the backing store before activation) and that
means we'll be doing DIO to unwritten extents. IO completion of a
DIO write to an unwritten extent will run a transaction to convert
that extent to written. A similar problem with sparse files exists,
because allocation of blocks can be done from the DIO path, and that
requires transactions. File extension is another potential
transaction path we open up by using DIO writes dor swap.

The problem is that a transaction run in swap IO context will will
deadlock the filesystem. Either through the unbound memory demand of
metadata modification, or from needing log space that can't be freed
up because the metadata IO that will free the log space is waiting
on memory allocation that is waiting on swap IO...

I think some more thought needs to be put into controlling the
behaviour/semantics of the DIO path so that it can be safely used
by swap IO, because it's not a direct 1:1 behavioural mapping with
existing DIO and there are potential deadlock vectors we need to
avoid.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles
  2021-09-25 23:42 ` [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles Dave Chinner
@ 2021-09-26  3:10   ` Matthew Wilcox
  2021-09-26 22:36     ` Dave Chinner
  0 siblings, 1 reply; 28+ messages in thread
From: Matthew Wilcox @ 2021-09-26  3:10 UTC (permalink / raw)
  To: Dave Chinner
  Cc: David Howells, hch, trond.myklebust, Theodore Ts'o,
	linux-block, ceph-devel, Trond Myklebust, Darrick J. Wong,
	Jeff Layton, Andreas Dilger, Anna Schumaker, linux-mm, Bob Liu,
	Darrick J. Wong, Josef Bacik, Seth Jennings, Jens Axboe,
	linux-fsdevel, linux-xfs, linux-ext4, linux-cifs, Chris Mason,
	David Sterba, Minchan Kim, Steve French, NeilBrown,
	Dan Magenheimer, linux-nfs, Ilya Dryomov, linux-btrfs, viro,
	torvalds, linux-kernel

On Sun, Sep 26, 2021 at 09:42:43AM +1000, Dave Chinner wrote:
> Ok, so if the filesystem is doing block mapping in the IO path now,
> why does the swap file still need to map the file into a private
> block mapping now?  i.e all the work that iomap_swapfile_activate()
> does for filesystems like XFS and ext4 - it's this completely
> redundant now that we are doing block mapping during swap file IO
> via iomap_dio_rw()?

Hi Dave,

Thanks for bringing up all these points.  I think they all deserve to go
into the documentation as "things to consider" for people implementing
->swap_rw for their filesystem.

Something I don't think David perhaps made sufficiently clear is that
regular DIO from userspace gets handled by ->read_iter and ->write_iter.
This ->swap_rw op is used exclusive for, as the name suggests, swap DIO.
So filesystems don't have to handle swap DIO and regular DIO the same
way, and can split the allocation work between ->swap_activate and the
iomap callback as they see fit (as long as they can guarantee the lack
of deadlocks under memory pressure).

There are several advantages to using the DIO infrastructure for
swap:

 - unify block & net swap paths
 - allow filesystems to _see_ swap IOs instead of being bypassed
 - get rid of the swap extent rbtree
 - allow writing compound pages to swap files instead of splitting
   them
 - allow ->readpage to be synchronous for better error reporting
 - remove page_file_mapping() and page_file_offset()

I suspect there are several problems with this patchset, but I'm not
likely to have a chance to read it closely for a few days.  If you
have time to give the XFS parts a good look, that would be fantastic.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 4/9] Introduce IOCB_SWAP kiocb flag to trigger REQ_SWAP
  2021-09-24 17:18 ` [PATCH v3 4/9] Introduce IOCB_SWAP kiocb flag to trigger REQ_SWAP David Howells
@ 2021-09-26 21:56   ` Dave Chinner
  0 siblings, 0 replies; 28+ messages in thread
From: Dave Chinner @ 2021-09-26 21:56 UTC (permalink / raw)
  To: David Howells
  Cc: willy, hch, trond.myklebust, Darrick J. Wong, linux-xfs,
	linux-block, linux-fsdevel, linux-mm, darrick.wong, viro,
	jlayton, torvalds, linux-nfs, linux-kernel

On Fri, Sep 24, 2021 at 06:18:32PM +0100, David Howells wrote:
> Introduce an IOCB_SWAP flag for the kiocb struct such that the REQ_SWAP
> will get set on lower level operation structures in generic code.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: Matthew Wilcox <willy@infradead.org>
> cc: Christoph Hellwig <hch@lst.de>
> cc: Darrick J. Wong <djwong@kernel.org>
> cc: linux-xfs@vger.kernel.org
> cc: linux-block@vger.kernel.org
> cc: linux-fsdevel@vger.kernel.org
> cc: linux-mm@kvack.org
> ---
> 
>  fs/direct-io.c      |    2 ++
>  include/linux/bio.h |    2 ++
>  include/linux/fs.h  |    1 +
>  3 files changed, 5 insertions(+)
> 
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index b2e86e739d7a..76eec0a68fa4 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -1216,6 +1216,8 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
>  	}
>  	if (iocb->ki_flags & IOCB_HIPRI)
>  		dio->op_flags |= REQ_HIPRI;
> +	if (iocb->ki_flags & IOCB_SWAP)
> +		dio->op_flags |= REQ_SWAP;
>  
>  	/*
>  	 * For AIO O_(D)SYNC writes we need to defer completions to a workqueue
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 00952e92eae1..b01133727494 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -787,6 +787,8 @@ static inline void bio_set_polled(struct bio *bio, struct kiocb *kiocb)
>  	bio->bi_opf |= REQ_HIPRI;
>  	if (!is_sync_kiocb(kiocb))
>  		bio->bi_opf |= REQ_NOWAIT;
> +	if (kiocb->ki_flags & IOCB_SWAP)
> +		bio->bi_opf |= REQ_SWAP;
>  }
>  
>  struct bio *blk_next_bio(struct bio *bio, unsigned int nr_pages, gfp_t gfp);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index c909ca6c0eb6..c20f4423e2f1 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -321,6 +321,7 @@ enum rw_hint {
>  #define IOCB_NOIO		(1 << 20)
>  /* can use bio alloc cache */
>  #define IOCB_ALLOC_CACHE	(1 << 21)
> +#define IOCB_SWAP		(1 << 22)	/* Operation on a swapfile */
>  
>  struct kiocb {
>  	struct file		*ki_filp;

This doesn't set REQ_SWAP for the iomap based DIO path.
bio_set_polled() is only called from iomap for IOCB_HIPRI IO.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles
  2021-09-26  3:10   ` Matthew Wilcox
@ 2021-09-26 22:36     ` Dave Chinner
  0 siblings, 0 replies; 28+ messages in thread
From: Dave Chinner @ 2021-09-26 22:36 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: David Howells, hch, trond.myklebust, Theodore Ts'o,
	linux-block, ceph-devel, Trond Myklebust, Darrick J. Wong,
	Jeff Layton, Andreas Dilger, Anna Schumaker, linux-mm, Bob Liu,
	Darrick J. Wong, Josef Bacik, Seth Jennings, Jens Axboe,
	linux-fsdevel, linux-xfs, linux-ext4, linux-cifs, Chris Mason,
	David Sterba, Minchan Kim, Steve French, NeilBrown,
	Dan Magenheimer, linux-nfs, Ilya Dryomov, linux-btrfs, viro,
	torvalds, linux-kernel

On Sun, Sep 26, 2021 at 04:10:43AM +0100, Matthew Wilcox wrote:
> On Sun, Sep 26, 2021 at 09:42:43AM +1000, Dave Chinner wrote:
> > Ok, so if the filesystem is doing block mapping in the IO path now,
> > why does the swap file still need to map the file into a private
> > block mapping now?  i.e all the work that iomap_swapfile_activate()
> > does for filesystems like XFS and ext4 - it's this completely
> > redundant now that we are doing block mapping during swap file IO
> > via iomap_dio_rw()?
> 
> Hi Dave,
> 
> Thanks for bringing up all these points.  I think they all deserve to go
> into the documentation as "things to consider" for people implementing
> ->swap_rw for their filesystem.
> 
> Something I don't think David perhaps made sufficiently clear is that
> regular DIO from userspace gets handled by ->read_iter and ->write_iter.
> This ->swap_rw op is used exclusive for, as the name suggests, swap DIO.
> So filesystems don't have to handle swap DIO and regular DIO the same
> way, and can split the allocation work between ->swap_activate and the
> iomap callback as they see fit (as long as they can guarantee the lack
> of deadlocks under memory pressure).

I understand this completely.

The point is that the implementation of ->swap_rw is to call
iomap_dio_rw() with the same ops as the normal DIO read/write path
uses. IOWs, apart from the IOCB_SWAP flag, there is no practical
difference between the "swap DIO" and "normal DIO" I/O paths.

> There are several advantages to using the DIO infrastructure for
> swap:
> 
>  - unify block & net swap paths
>  - allow filesystems to _see_ swap IOs instead of being bypassed
>  - get rid of the swap extent rbtree
>  - allow writing compound pages to swap files instead of splitting
>    them
>  - allow ->readpage to be synchronous for better error reporting
>  - remove page_file_mapping() and page_file_offset()
> 
> I suspect there are several problems with this patchset, but I'm not
> likely to have a chance to read it closely for a few days.  If you
> have time to give the XFS parts a good look, that would be fantastic.

That's what I've already done, and all the questions I've raised are
from asking a simple question: what happens if a transaction is
required to complete the iomap_dio_rw() swap write operation?

I mean, this is similar to the problems with IOCB_NOWAIT - we're
supposed to return -EAGAIN if we might block during IO submission,
and one of those situations we have to consider is "do we need to
run a transaction". If we get it wrong (and we do!), then the worst
thing that happens is that there is a long latency for IO
submission. It's a minor performance issue, not the end of the
world.

The difference with IOCB_SWAP is that "don't do transactions during
iomap_dio_rw()" is a _hard requirement_ on both IO submission and
completion. That means, from now and forever, we will have to
guarantee a path through iomap_dio_rw() that will never run
transactions on an IO. That requirement needs to be enforced in
every block mapping callback into each filesystem, as this is
something the iomap infrastructure cannot enforce. Hence we'll have
to plumb IOCB_SWAP into a new IOMAP_SWAP iterator flag to pass to
the ->iomap_begin() DIO methods to ensure they do the right thing.

And then the question becomes: what happens if the filesystem cannot
do the right thing? Can the swap code handle an error? e.g. the
first thing that xfs_direct_write_iomap_begin() and
xfs_read_iomap_begin() do is check if the filesystem is shut down
and returns -EIO in that case. IOWs, we've now got normal filesystem
"reject all IO" corruption protection mechanisms in play. Using
iomap_dio_rw() as it stands means that _all swapfile IO will fail_
if the filesystem shuts down.

Right now the swap file IO can keep going blissfully unaware of the
filesystem failure status. The open swapfile will prevent the
filesystem from being unmounted. Hence to unmount the shutdown
filesystem to correct the problem, first the swap file has to be
turned off, which means we have a fail-safe behaviour. Using the
iomap_dio_rw() path means that swapfile IO _can and will fail_.

AFAICT, swap IO errors are pretty much thrown away by the mm code;
the swap_writepage() return value is ignored or placed on the swap
cache address space and ignored. And it looks like the new read path
just sets PageError() and leaves it to callers to detect and deal
with a swapin failure because swap_readpage() is now void...

So it seems like there's a whole new set of failure cases using the
DIO path introduces into the swap IO path that haven't been
considered here. I can't see why we wouldn't be able to solve them,
but these considerations lead me to think that use of the DIO is
based on an incorrect assumption - DIO is not a "simple low level
IO" interface.

Hence I suspect that we'd be much better off with a new
iomap_swap_rw() implementation that just does what swap needs
without any of the complexity of the DIO API. Internally iomap can
share what it needs to share with the DIO path, but at this point
I'm not sure we should be overloading the iomap_dio_rw() path with
the semantics required by swap.

e.g. we limit iomap_swap_rw() to only accept written or unwritten
block mappings within file size on inodes with clean metadata (i.e.
pure overwrite to guarantee no modification transactions), and then
the fs provided ->iomap_begin callback can ignore shutdown state,
elide inode level locking, do read-only mappings, etc without adding
extra overhead to the existing DIO code path...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 9/9] mm: Remove swap BIO paths and only use DIO paths
  2021-09-25 17:09     ` Matthew Wilcox
@ 2021-09-26 23:08       ` Damien Le Moal
  2021-09-27  1:25         ` Dave Chinner
  0 siblings, 1 reply; 28+ messages in thread
From: Damien Le Moal @ 2021-09-26 23:08 UTC (permalink / raw)
  To: Matthew Wilcox, David Howells
  Cc: hch, trond.myklebust, Jens Axboe, Darrick J. Wong, linux-block,
	linux-xfs, linux-fsdevel, linux-mm, darrick.wong, viro, jlayton,
	torvalds, linux-nfs, linux-kernel

On 2021/09/26 2:09, Matthew Wilcox wrote:
> On Sat, Sep 25, 2021 at 04:36:42PM +0100, David Howells wrote:
>> Matthew Wilcox <willy@infradead.org> wrote:
>>
>>> On Fri, Sep 24, 2021 at 06:19:23PM +0100, David Howells wrote:
>>>> Delete the BIO-generating swap read/write paths and always use ->swap_rw().
>>>> This puts the mapping layer in the filesystem.
>>>
>>> Is SWP_FS_OPS now unused after this patch?
>>
>> Ummm.  Interesting question - it's only used in swap_set_page_dirty():
>>
>> int swap_set_page_dirty(struct page *page)
>> {
>> 	struct swap_info_struct *sis = page_swap_info(page);
>>
>> 	if (data_race(sis->flags & SWP_FS_OPS)) {
>> 		struct address_space *mapping = sis->swap_file->f_mapping;
>>
>> 		VM_BUG_ON_PAGE(!PageSwapCache(page), page);
>> 		return mapping->a_ops->set_page_dirty(page);
>> 	} else {
>> 		return __set_page_dirty_no_writeback(page);
>> 	}
>> }
> 
> I suspect that's no longer necessary.  NFS was the only filesystem
> using SWP_FS_OPS and ...
> 
> fs/nfs/file.c:  .set_page_dirty = __set_page_dirty_nobuffers,
> 
> so it's not like NFS does anything special to reserve memory to write
> back swap pages.
> 
>>> Also, do we still need ->swap_activate and ->swap_deactivate?
>>
>> f2fs does quite a lot of work in its ->swap_activate(), as does btrfs.  I'm
>> not sure how necessary it is.  cifs looks like it intends to use it, but it's
>> not fully implemented yet.  zonefs and nfs do some checking, including hole
>> checking in nfs's case.  nfs also does some setting up for the sunrpc
>> transport.
>>
>> btrfs, cifs, f2fs and nfs all supply ->swap_deactivate() to undo the effects
>> of the activation.
> 
> Right ... so my question really is, now that we're doing I/O through
> aops->direct_IO (or ->swap_rw), do those magic things need to be done?
> After all, open(O_DIRECT) doesn't do these same magic things.  They're
> really there to allow the direct-to-BIO path to work, and you're removing
> that here.

For zonefs, ->swap_activate() checks that the user is not trying to use a
sequential write only file for swap. Swap cannot work on these files as there
are no guarantees that the writes will be sequential.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 9/9] mm: Remove swap BIO paths and only use DIO paths
  2021-09-26 23:08       ` Damien Le Moal
@ 2021-09-27  1:25         ` Dave Chinner
  2021-09-27  1:41           ` Damien Le Moal
  0 siblings, 1 reply; 28+ messages in thread
From: Dave Chinner @ 2021-09-27  1:25 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Matthew Wilcox, David Howells, hch, trond.myklebust, Jens Axboe,
	Darrick J. Wong, linux-block, linux-xfs, linux-fsdevel, linux-mm,
	darrick.wong, viro, jlayton, torvalds, linux-nfs, linux-kernel

On Mon, Sep 27, 2021 at 08:08:53AM +0900, Damien Le Moal wrote:
> On 2021/09/26 2:09, Matthew Wilcox wrote:
> > On Sat, Sep 25, 2021 at 04:36:42PM +0100, David Howells wrote:
> >> Matthew Wilcox <willy@infradead.org> wrote:
> >>
> >>> On Fri, Sep 24, 2021 at 06:19:23PM +0100, David Howells wrote:
> >>>> Delete the BIO-generating swap read/write paths and always use ->swap_rw().
> >>>> This puts the mapping layer in the filesystem.
> >>>
> >>> Is SWP_FS_OPS now unused after this patch?
> >>
> >> Ummm.  Interesting question - it's only used in swap_set_page_dirty():
> >>
> >> int swap_set_page_dirty(struct page *page)
> >> {
> >> 	struct swap_info_struct *sis = page_swap_info(page);
> >>
> >> 	if (data_race(sis->flags & SWP_FS_OPS)) {
> >> 		struct address_space *mapping = sis->swap_file->f_mapping;
> >>
> >> 		VM_BUG_ON_PAGE(!PageSwapCache(page), page);
> >> 		return mapping->a_ops->set_page_dirty(page);
> >> 	} else {
> >> 		return __set_page_dirty_no_writeback(page);
> >> 	}
> >> }
> > 
> > I suspect that's no longer necessary.  NFS was the only filesystem
> > using SWP_FS_OPS and ...
> > 
> > fs/nfs/file.c:  .set_page_dirty = __set_page_dirty_nobuffers,
> > 
> > so it's not like NFS does anything special to reserve memory to write
> > back swap pages.
> > 
> >>> Also, do we still need ->swap_activate and ->swap_deactivate?
> >>
> >> f2fs does quite a lot of work in its ->swap_activate(), as does btrfs.  I'm
> >> not sure how necessary it is.  cifs looks like it intends to use it, but it's
> >> not fully implemented yet.  zonefs and nfs do some checking, including hole
> >> checking in nfs's case.  nfs also does some setting up for the sunrpc
> >> transport.
> >>
> >> btrfs, cifs, f2fs and nfs all supply ->swap_deactivate() to undo the effects
> >> of the activation.
> > 
> > Right ... so my question really is, now that we're doing I/O through
> > aops->direct_IO (or ->swap_rw), do those magic things need to be done?
> > After all, open(O_DIRECT) doesn't do these same magic things.  They're
> > really there to allow the direct-to-BIO path to work, and you're removing
> > that here.
> 
> For zonefs, ->swap_activate() checks that the user is not trying to use a
> sequential write only file for swap. Swap cannot work on these files as there
> are no guarantees that the writes will be sequential.

iomap_swapfile_activate() is used by ext4, XFS and zonefs. It checks
there are no holes in the file, no shared extents, no inline
extents, the swap info block device matches the block device the
extent is mapped to (i.e. filesystems can have more than one bdev,
swapfile only supports files on sb->s_bdev), etc.

Also, I noticed, iomap_swapfile_add_extent() filters out extents
that are smaller than PAGE_SIZE, and aligns larger extents to
PAGE_SIZE. This allows ensures that when fs block size != PAGE_SIZE
that only a single IO per page being swapped is required. i.e. the
DIO path may change the "one page, one bio, one IO" behaviour that
the current swapfile mapping guarantees.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 9/9] mm: Remove swap BIO paths and only use DIO paths
  2021-09-27  1:25         ` Dave Chinner
@ 2021-09-27  1:41           ` Damien Le Moal
  0 siblings, 0 replies; 28+ messages in thread
From: Damien Le Moal @ 2021-09-27  1:41 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Matthew Wilcox, David Howells, hch, trond.myklebust, Jens Axboe,
	Darrick J. Wong, linux-block, linux-xfs, linux-fsdevel, linux-mm,
	darrick.wong, viro, jlayton, torvalds, linux-nfs, linux-kernel

On 2021/09/27 10:25, Dave Chinner wrote:
> On Mon, Sep 27, 2021 at 08:08:53AM +0900, Damien Le Moal wrote:
>> On 2021/09/26 2:09, Matthew Wilcox wrote:
>>> On Sat, Sep 25, 2021 at 04:36:42PM +0100, David Howells wrote:
>>>> Matthew Wilcox <willy@infradead.org> wrote:
>>>>
>>>>> On Fri, Sep 24, 2021 at 06:19:23PM +0100, David Howells wrote:
>>>>>> Delete the BIO-generating swap read/write paths and always use ->swap_rw().
>>>>>> This puts the mapping layer in the filesystem.
>>>>>
>>>>> Is SWP_FS_OPS now unused after this patch?
>>>>
>>>> Ummm.  Interesting question - it's only used in swap_set_page_dirty():
>>>>
>>>> int swap_set_page_dirty(struct page *page)
>>>> {
>>>> 	struct swap_info_struct *sis = page_swap_info(page);
>>>>
>>>> 	if (data_race(sis->flags & SWP_FS_OPS)) {
>>>> 		struct address_space *mapping = sis->swap_file->f_mapping;
>>>>
>>>> 		VM_BUG_ON_PAGE(!PageSwapCache(page), page);
>>>> 		return mapping->a_ops->set_page_dirty(page);
>>>> 	} else {
>>>> 		return __set_page_dirty_no_writeback(page);
>>>> 	}
>>>> }
>>>
>>> I suspect that's no longer necessary.  NFS was the only filesystem
>>> using SWP_FS_OPS and ...
>>>
>>> fs/nfs/file.c:  .set_page_dirty = __set_page_dirty_nobuffers,
>>>
>>> so it's not like NFS does anything special to reserve memory to write
>>> back swap pages.
>>>
>>>>> Also, do we still need ->swap_activate and ->swap_deactivate?
>>>>
>>>> f2fs does quite a lot of work in its ->swap_activate(), as does btrfs.  I'm
>>>> not sure how necessary it is.  cifs looks like it intends to use it, but it's
>>>> not fully implemented yet.  zonefs and nfs do some checking, including hole
>>>> checking in nfs's case.  nfs also does some setting up for the sunrpc
>>>> transport.
>>>>
>>>> btrfs, cifs, f2fs and nfs all supply ->swap_deactivate() to undo the effects
>>>> of the activation.
>>>
>>> Right ... so my question really is, now that we're doing I/O through
>>> aops->direct_IO (or ->swap_rw), do those magic things need to be done?
>>> After all, open(O_DIRECT) doesn't do these same magic things.  They're
>>> really there to allow the direct-to-BIO path to work, and you're removing
>>> that here.
>>
>> For zonefs, ->swap_activate() checks that the user is not trying to use a
>> sequential write only file for swap. Swap cannot work on these files as there
>> are no guarantees that the writes will be sequential.
> 
> iomap_swapfile_activate() is used by ext4, XFS and zonefs. It checks
> there are no holes in the file, no shared extents, no inline
> extents, the swap info block device matches the block device the
> extent is mapped to (i.e. filesystems can have more than one bdev,
> swapfile only supports files on sb->s_bdev), etc.

OK. But I was referring to the additional check in zonefs_swap_activate() before
iomap_swapfile_activate() is called. We must prevent that function from being
called for a full sequential write only zone file since such file will pass all
checks (no hole, all extents written etc) but cannot be used for swap since it
is not writtable when full (no overwrites allowed in sequential zones).

> 
> Also, I noticed, iomap_swapfile_add_extent() filters out extents
> that are smaller than PAGE_SIZE, and aligns larger extents to
> PAGE_SIZE. This allows ensures that when fs block size != PAGE_SIZE
> that only a single IO per page being swapped is required. i.e. the
> DIO path may change the "one page, one bio, one IO" behaviour that
> the current swapfile mapping guarantees.
> 
> Cheers,
> 
> Dave.
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 9/9] mm: Remove swap BIO paths and only use DIO paths
  2021-09-25 15:36   ` David Howells
  2021-09-25 17:09     ` Matthew Wilcox
@ 2021-09-27 20:03     ` David Sterba
  1 sibling, 0 replies; 28+ messages in thread
From: David Sterba @ 2021-09-27 20:03 UTC (permalink / raw)
  To: David Howells
  Cc: Matthew Wilcox, hch, trond.myklebust, Jens Axboe,
	Darrick J. Wong, linux-block, linux-xfs, linux-fsdevel, linux-mm,
	darrick.wong, viro, jlayton, torvalds, linux-nfs, linux-kernel

On Sat, Sep 25, 2021 at 04:36:42PM +0100, David Howells wrote:
> Matthew Wilcox <willy@infradead.org> wrote:
> 
> > On Fri, Sep 24, 2021 at 06:19:23PM +0100, David Howells wrote:
> > > Delete the BIO-generating swap read/write paths and always use ->swap_rw().
> > > This puts the mapping layer in the filesystem.
> > 
> > Is SWP_FS_OPS now unused after this patch?
> 
> Ummm.  Interesting question - it's only used in swap_set_page_dirty():
> 
> int swap_set_page_dirty(struct page *page)
> {
> 	struct swap_info_struct *sis = page_swap_info(page);
> 
> 	if (data_race(sis->flags & SWP_FS_OPS)) {
> 		struct address_space *mapping = sis->swap_file->f_mapping;
> 
> 		VM_BUG_ON_PAGE(!PageSwapCache(page), page);
> 		return mapping->a_ops->set_page_dirty(page);
> 	} else {
> 		return __set_page_dirty_no_writeback(page);
> 	}
> }
> 
> 
> > Also, do we still need ->swap_activate and ->swap_deactivate?
> 
> f2fs does quite a lot of work in its ->swap_activate(), as does btrfs.  I'm
> not sure how necessary it is.

Yes we still need it for btrfs. Besides checking the conditions similar
to what iomap_swapfile_activate does on the file itself, we need to
exclude other operations potentially changing the mapping on the level
of block groups. This is namely relocation, used to implement several
other things like resize or balance. There's an exclusion at the
beginning of btrfs_swap_activate. Right now I don't see how we could
make sure that the swapfile requirements would be satisfied without it.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles
  2021-09-24 17:17 [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles David Howells
                   ` (9 preceding siblings ...)
  2021-09-25 23:42 ` [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles Dave Chinner
@ 2021-09-27 20:07 ` David Sterba
  2021-09-28  3:11 ` NeilBrown
  2021-09-29 15:45 ` David Howells
  12 siblings, 0 replies; 28+ messages in thread
From: David Sterba @ 2021-09-27 20:07 UTC (permalink / raw)
  To: David Howells
  Cc: willy, hch, trond.myklebust, Theodore Ts'o, linux-block,
	ceph-devel, Trond Myklebust, Darrick J. Wong, Jeff Layton,
	Andreas Dilger, Anna Schumaker, linux-mm, Bob Liu,
	Darrick J. Wong, Josef Bacik, Seth Jennings, Jens Axboe,
	linux-fsdevel, linux-xfs, linux-ext4, linux-cifs, Chris Mason,
	David Sterba, Minchan Kim, Steve French, NeilBrown,
	Dan Magenheimer, linux-nfs, Ilya Dryomov, linux-btrfs, viro,
	torvalds, linux-kernel

On Fri, Sep 24, 2021 at 06:17:52PM +0100, David Howells wrote:
> 
> Hi Willy, Trond, Christoph,
> 
> Here's v3 of a change to make reads and writes from the swapfile use async
> DIO, adding a new ->swap_rw() address_space method, rather than readpage()
> or direct_IO(), as requested by Willy.  This allows NFS to bypass the write
> checks that prevent swapfiles from working, plus a bunch of other checks
> that may or may not be necessary.
> 
> Whilst trying to make this work, I found that NFS's support for swapfiles
> seems to have been non-functional since Aug 2019 (I think), so the first
> patch fixes that.  Question is: do we actually *want* to keep this
> functionality, given that it seems that no one's tested it with an upstream
> kernel in the last couple of years?
> 
> There are additional patches to get rid of noop_direct_IO and replace it
> with a feature bitmask, to make btrfs, ext4, xfs and raw blockdevs use the
> new ->swap_rw method and thence remove the direct BIO submission paths from
> swap.
> 
> I kept the IOCB_SWAP flag, using it to enable REQ_SWAP.  I'm not sure if
> that's necessary, but it seems accounting related.
> 
> The synchronous DIO I/O code on NFS, raw blockdev, ext4 swapfile and xfs
> swapfile all seem to work fine.  Btrfs refuses to swapon because the file
> might be CoW'd.  I've tried doing "chattr +C", but that didn't help.

There was probably some step missing. The file must not have holes, so
either do 'dd' to the right size or use fallocate (which is recommended
in manual page btrfs(5) SWAPFILE SUPPORT). There are some fstests
exercising swapfile (grep -l _format_swapfile tests/generic/*) so you
could try that without having to set up the swapfile manually.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles
  2021-09-24 17:17 [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles David Howells
                   ` (10 preceding siblings ...)
  2021-09-27 20:07 ` David Sterba
@ 2021-09-28  3:11 ` NeilBrown
  2021-09-30 15:54     ` Steve French
  2021-09-29 15:45 ` David Howells
  12 siblings, 1 reply; 28+ messages in thread
From: NeilBrown @ 2021-09-28  3:11 UTC (permalink / raw)
  To: David Howells
  Cc: willy, hch, trond.myklebust, Theodore Ts'o, linux-block,
	ceph-devel, Trond Myklebust, Darrick J. Wong, Jeff Layton,
	Andreas Dilger, Anna Schumaker, linux-mm, Bob Liu,
	Darrick J. Wong, Josef Bacik, Seth Jennings, Jens Axboe,
	linux-fsdevel, linux-xfs, linux-ext4, linux-cifs, Chris Mason,
	David Sterba, Minchan Kim, Steve French, Dan Magenheimer,
	linux-nfs, Ilya Dryomov, linux-btrfs, dhowells, viro, torvalds,
	linux-kernel

On Sat, 25 Sep 2021, David Howells wrote:
> Whilst trying to make this work, I found that NFS's support for swapfiles
> seems to have been non-functional since Aug 2019 (I think), so the first
> patch fixes that.  Question is: do we actually *want* to keep this
> functionality, given that it seems that no one's tested it with an upstream
> kernel in the last couple of years?

SUSE definitely want to keep this functionality.  We have customers
using it.
I agree it would be good if it was being tested somewhere....

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles
  2021-09-24 17:17 [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles David Howells
                   ` (11 preceding siblings ...)
  2021-09-28  3:11 ` NeilBrown
@ 2021-09-29 15:45 ` David Howells
  12 siblings, 0 replies; 28+ messages in thread
From: David Howells @ 2021-09-29 15:45 UTC (permalink / raw)
  To: dsterba
  Cc: dhowells, willy, Chris Mason, linux-block, ceph-devel, linux-mm,
	linux-fsdevel, linux-xfs, linux-ext4, linux-cifs, linux-nfs,
	Ilya Dryomov, linux-btrfs, linux-kernel

David Sterba <dsterba@suse.cz> wrote:

> > There are additional patches to get rid of noop_direct_IO and replace it
> > with a feature bitmask, to make btrfs, ext4, xfs and raw blockdevs use the
> > new ->swap_rw method and thence remove the direct BIO submission paths from
> > swap.
> > 
> > I kept the IOCB_SWAP flag, using it to enable REQ_SWAP.  I'm not sure if
> > that's necessary, but it seems accounting related.
>
> There was probably some step missing. The file must not have holes, so
> either do 'dd' to the right size or use fallocate (which is recommended
> in manual page btrfs(5) SWAPFILE SUPPORT). There are some fstests
> exercising swapfile (grep -l _format_swapfile tests/generic/*) so you
> could try that without having to set up the swapfile manually.

Yeah.  As advised elsewhere, I removed the file and recreated it, doing the
chattr before extending the file.  At that point swapon worked.  It didn't
work though, and various userspace programs started dying.  I'm guessing my
btrfs_swap_rw() is wrong somehow.

David


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles
  2021-09-28  3:11 ` NeilBrown
@ 2021-09-30 15:54     ` Steve French
  0 siblings, 0 replies; 28+ messages in thread
From: Steve French @ 2021-09-30 15:54 UTC (permalink / raw)
  To: NeilBrown
  Cc: David Howells, Matthew Wilcox, Christoph Hellwig,
	Trond Myklebust, Theodore Ts'o, linux-block, ceph-devel,
	Trond Myklebust, Darrick J. Wong, Jeff Layton, Andreas Dilger,
	Anna Schumaker, linux-mm, Bob Liu, Darrick J. Wong, Josef Bacik,
	Seth Jennings, Jens Axboe, linux-fsdevel, linux-xfs, linux-ext4,
	CIFS, Chris Mason, David Sterba, Minchan Kim, Steve French,
	Dan Magenheimer, linux-nfs, Ilya Dryomov, linux-btrfs, Al Viro,
	Linus Torvalds, LKML

On Mon, Sep 27, 2021 at 10:12 PM NeilBrown <neilb@suse.de> wrote:
>
> On Sat, 25 Sep 2021, David Howells wrote:
> > Whilst trying to make this work, I found that NFS's support for swapfiles
> > seems to have been non-functional since Aug 2019 (I think), so the first
> > patch fixes that.  Question is: do we actually *want* to keep this
> > functionality, given that it seems that no one's tested it with an upstream
> > kernel in the last couple of years?
>
> SUSE definitely want to keep this functionality.  We have customers
> using it.
> I agree it would be good if it was being tested somewhere....
>

I am trying to work through the testing of swap over SMB3 mounts
since there are use cases where you need to expand the swap
space to remote storage and so this requirement comes up.  The main difficulty
I run into is forgetting to mount with the mount options (to store mode bits)
(so swap file has the right permissions) and debugging some of the
xfstests relating to swap can be a little confusing.

-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles
@ 2021-09-30 15:54     ` Steve French
  0 siblings, 0 replies; 28+ messages in thread
From: Steve French @ 2021-09-30 15:54 UTC (permalink / raw)
  To: NeilBrown
  Cc: David Howells, Matthew Wilcox, Christoph Hellwig,
	Trond Myklebust, Theodore Ts'o, linux-block, ceph-devel,
	Trond Myklebust, Darrick J. Wong, Jeff Layton, Andreas Dilger,
	Anna Schumaker, linux-mm, Bob Liu, Darrick J. Wong, Josef Bacik,
	Seth Jennings, Jens Axboe, linux-fsdevel, linux-xfs, linux-ext4,
	CIFS, Chris Mason, David Sterba, Minchan Kim, Steve French,
	Dan Magenheimer, linux-nfs, Ilya Dryomov, linux-btrfs, Al Viro,
	Linus Torvalds, LKML

On Mon, Sep 27, 2021 at 10:12 PM NeilBrown <neilb@suse.de> wrote:
>
> On Sat, 25 Sep 2021, David Howells wrote:
> > Whilst trying to make this work, I found that NFS's support for swapfiles
> > seems to have been non-functional since Aug 2019 (I think), so the first
> > patch fixes that.  Question is: do we actually *want* to keep this
> > functionality, given that it seems that no one's tested it with an upstream
> > kernel in the last couple of years?
>
> SUSE definitely want to keep this functionality.  We have customers
> using it.
> I agree it would be good if it was being tested somewhere....
>

I am trying to work through the testing of swap over SMB3 mounts
since there are use cases where you need to expand the swap
space to remote storage and so this requirement comes up.  The main difficulty
I run into is forgetting to mount with the mount options (to store mode bits)
(so swap file has the right permissions) and debugging some of the
xfstests relating to swap can be a little confusing.

-- 
Thanks,

Steve


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2021-09-30 15:54 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-24 17:17 [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles David Howells
2021-09-24 17:18 ` [PATCH v3 1/9] mm: Remove the callback func argument from __swap_writepage() David Howells
2021-09-24 17:18 ` [PATCH v3 2/9] mm: Add 'supports' field to the address_space_operations to list features David Howells
2021-09-24 20:10   ` Matthew Wilcox
2021-09-24 17:18 ` [PATCH v3 3/9] mm: Make swap_readpage() void David Howells
2021-09-24 22:07   ` Matthew Wilcox
2021-09-24 17:18 ` [PATCH v3 4/9] Introduce IOCB_SWAP kiocb flag to trigger REQ_SWAP David Howells
2021-09-26 21:56   ` Dave Chinner
2021-09-24 17:18 ` [PATCH v3 5/9] mm: Make swap_readpage() for SWP_FS_OPS use ->swap_rw() not ->readpage() David Howells
2021-09-24 17:18 ` [PATCH v3 6/9] mm: Make __swap_writepage() do async DIO if asked for it David Howells
2021-09-24 17:19 ` [PATCH v3 7/9] nfs: Fix write to swapfile failure due to generic_write_checks() David Howells
2021-09-24 17:19 ` [PATCH v3 8/9] block, btrfs, ext4, xfs: Implement swap_rw David Howells
2021-09-24 17:19 ` [PATCH v3 9/9] mm: Remove swap BIO paths and only use DIO paths David Howells
2021-09-25 14:56   ` Matthew Wilcox
2021-09-25 15:36   ` David Howells
2021-09-25 17:09     ` Matthew Wilcox
2021-09-26 23:08       ` Damien Le Moal
2021-09-27  1:25         ` Dave Chinner
2021-09-27  1:41           ` Damien Le Moal
2021-09-27 20:03     ` David Sterba
2021-09-25 23:42 ` [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles Dave Chinner
2021-09-26  3:10   ` Matthew Wilcox
2021-09-26 22:36     ` Dave Chinner
2021-09-27 20:07 ` David Sterba
2021-09-28  3:11 ` NeilBrown
2021-09-30 15:54   ` Steve French
2021-09-30 15:54     ` Steve French
2021-09-29 15:45 ` David Howells

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.