linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v2 0/5] mm: Fix NFS swapfiles and use DIO for swapfiles
@ 2021-08-12 20:21 David Howells
  2021-08-12 20:22 ` [RFC PATCH v2 1/5] nfs: Fix write to swapfile failure due to generic_write_checks() David Howells
                   ` (4 more replies)
  0 siblings, 5 replies; 13+ messages in thread
From: David Howells @ 2021-08-12 20:21 UTC (permalink / raw)
  To: willy
  Cc: Darrick J. Wong, Seth Jennings, Bob Liu, linux-nfs,
	Christoph Hellwig, Dan Magenheimer, Trond Myklebust,
	Trond Myklebust, Minchan Kim, dhowells, dhowells,
	trond.myklebust, darrick.wong, hch, viro, jlayton, sfrench,
	torvalds, linux-nfs, linux-mm, linux-fsdevel, linux-kernel


Hi Willy, Trond,

Here's v2 of a change to make reads and writes from the swapfile use async
DIO, via the ->direct_IO() method, rather than readpage(), as requested by
Willy.

The consensus seems to be that this is probably the wrong approach and
->direct_IO() needs replacing with a swap-specific method - but that will
require a bunch of filesystems to be modified also.

Note that I'm refcounting the kiocb struct around the call to
->direct_IO().  This is required in cachefiles where I'm going in through
the ->read_iter/->write_iter methods as both core routines and filesystems
touch kiocb *after* calling the completion routine.  Should this practice
be disallowed?

I've also added an additional patch to try and remove the bio-submission
path entirely from swap_readpage/writepage code and only go down the
->direct_IO route, but this fails spectacularly (from I/O errors to ATA
failure messages on the test disk using a normal swapspace).  This was
suggested by Willy as the bio-submission code is a potential data corruptor
if it's asked to write out a compound page that crosses extent boundaries.

Whilst trying to make this work, I found that NFS's support for swapfiles
seems to have been non-functional since Aug 2019 (I think), so the first
patch fixes that.  Question is: do we actually *want* to keep this
functionality, given that it seems that no one's tested it with an upstream
kernel in the last couple of years?

My patches can be found here also:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=swap-dio

I tested this using the procedure and program outlined in the first patch.

I also encountered occasional instances of the following warning, so I'm
wondering if there's a scheduling problem somewhere:

BUG: workqueue lockup - pool cpus=0-3 flags=0x5 nice=0 stuck for 34s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
  pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
    in-flight: 1565:fill_page_cache_func
workqueue events_highpri: flags=0x10
  pwq 3: cpus=1 node=0 flags=0x1 nice=-20 active=1/256 refcnt=2
    in-flight: 1547:fill_page_cache_func
  pwq 1: cpus=0 node=0 flags=0x0 nice=-20 active=1/256 refcnt=2
    in-flight: 1811:fill_page_cache_func
workqueue events_unbound: flags=0x2
  pwq 8: cpus=0-3 flags=0x5 nice=0 active=3/512 refcnt=5
    pending: fsnotify_connector_destroy_workfn, fsnotify_mark_destroy_workfn, cleanup_offline_cgwbs_workfn
workqueue events_power_efficient: flags=0x82
  pwq 8: cpus=0-3 flags=0x5 nice=0 active=4/256 refcnt=6
    pending: neigh_periodic_work, neigh_periodic_work, check_lifetime, do_cache_clean
workqueue writeback: flags=0x4a
  pwq 8: cpus=0-3 flags=0x5 nice=0 active=1/256 refcnt=4
    in-flight: 433(RESCUER):wb_workfn
workqueue rpciod: flags=0xa
  pwq 8: cpus=0-3 flags=0x5 nice=0 active=38/256 refcnt=40
    in-flight: 7:rpc_async_schedule, 1609:rpc_async_schedule, 1610:rpc_async_schedule, 912:rpc_async_schedule, 1613:rpc_async_schedule, 1631:rpc_async_schedule, 34:rpc_async_schedule, 44:rpc_async_schedule
    pending: rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule
workqueue ext4-rsv-conversion: flags=0x2000a
pool 1: cpus=0 node=0 flags=0x0 nice=-20 hung=59s workers=2 idle: 6
pool 3: cpus=1 node=0 flags=0x1 nice=-20 hung=43s workers=2 manager: 20
pool 6: cpus=3 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 498 29
pool 8: cpus=0-3 flags=0x5 nice=0 hung=34s workers=9 manager: 1623
pool 9: cpus=0-3 flags=0x5 nice=-20 hung=0s workers=2 manager: 5224 idle: 859

Note that this is due to DIO writes to NFS only, as far as I can tell, and
that no reads had happened yet.

Changes:
========
ver #2:
   - Remove the callback param to __swap_writepage() as it's invariant.
   - Allocate the kiocb on the stack in sync mode.
   - Do an async DIO write if WB_SYNC_ALL isn't set.
   - Try to remove the BIO submission paths.

David

Link: https://lore.kernel.org/r/162876946134.3068428.15475611190876694695.stgit@warthog.procyon.org.uk/ # v1
---
David Howells (5):
      nfs: Fix write to swapfile failure due to generic_write_checks()
      mm: Remove the callback func argument from __swap_writepage()
      mm: Make swap_readpage() for SWP_FS_OPS use ->direct_IO() not ->readpage()
      mm: Make __swap_writepage() do async DIO if asked for it
      mm: Remove swap BIO paths and only use DIO paths [BROKEN]


 fs/direct-io.c       |   2 +
 include/linux/bio.h  |   2 +
 include/linux/fs.h   |   1 +
 include/linux/swap.h |   4 +-
 mm/page_io.c         | 379 ++++++++++++++++++++++---------------------
 mm/zswap.c           |   2 +-
 6 files changed, 204 insertions(+), 186 deletions(-)



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC PATCH v2 1/5] nfs: Fix write to swapfile failure due to generic_write_checks()
  2021-08-12 20:21 [RFC PATCH v2 0/5] mm: Fix NFS swapfiles and use DIO for swapfiles David Howells
@ 2021-08-12 20:22 ` David Howells
  2021-08-13  3:09   ` Matthew Wilcox
  2021-08-19 22:38   ` NeilBrown
  2021-08-12 20:22 ` [RFC PATCH v2 2/5] mm: Remove the callback func argument from __swap_writepage() David Howells
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 13+ messages in thread
From: David Howells @ 2021-08-12 20:22 UTC (permalink / raw)
  To: willy
  Cc: Darrick J. Wong, Christoph Hellwig, Trond Myklebust, linux-nfs,
	dhowells, dhowells, trond.myklebust, darrick.wong, hch, viro,
	jlayton, sfrench, torvalds, linux-nfs, linux-mm, linux-fsdevel,
	linux-kernel

Trying to use a swapfile on NFS results in every DIO write failing with
ETXTBSY because generic_write_checks(), as called by nfs_direct_write()
from nfs_direct_IO(), forbids writes to swapfiles.

Fix this by introducing a new kiocb flag, IOCB_SWAP, that's set by the swap
code to indicate that the swapper is doing this operation and so overrule
the check in generic_write_checks().

Without this patch, the following is seen:

	Write error on dio swapfile (3800334336)

Altering __swap_writepage() to show the error shows:

	Write error (-26) on dio swapfile (3800334336)

Tested by swapping off all swap partitions and then swapping on a prepared
NFS file (CONFIG_NFS_SWAP=y is also needed).  Enough copies of the
following program then need to be run to force swapping to occur (at least
one per gigabyte of RAM):

	#include <stdbool.h>
	#include <stdio.h>
	#include <stdlib.h>
	#include <unistd.h>
	#include <sys/mman.h>
	int main()
	{
		unsigned int pid = getpid(), iterations = 0;
		size_t i, j, size = 1024 * 1024 * 1024;
		char *p;
		bool mismatch;
		p = malloc(size);
		if (!p) {
			perror("malloc");
			exit(1);
		}
		srand(pid);
		for (i = 0; i < size; i += 4)
			*(unsigned int *)(p + i) = rand();
		do {
			for (j = 0; j < 16; j++) {
				for (i = 0; i < size; i += 4096)
					*(unsigned int *)(p + i) += 1;
				iterations++;
			}
			mismatch = false;
			srand(pid);
			for (i = 0; i < size; i += 4) {
				unsigned int r = rand();
				unsigned int v = *(unsigned int *)(p + i);
				if (i % 4096 == 0)
					v -= iterations;
				if (v != r) {
					fprintf(stderr, "mismatch %zx: %x != %x (diff %x)\n",
						i, v, r, v - r);
					mismatch = true;
				}
			}
		} while (!mismatch);
		exit(1);
	}


Fixes: dc617f29dbe5 ("vfs: don't allow writes to swap files")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Darrick J. Wong <darrick.wong@oracle.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Trond Myklebust <trond.myklebust@primarydata.com>
cc: linux-nfs@vger.kernel.org
---

 fs/read_write.c    |    2 +-
 include/linux/fs.h |    1 +
 mm/page_io.c       |    7 ++++---
 3 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 9db7adf160d2..daef721ca67e 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1646,7 +1646,7 @@ ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
 	loff_t count;
 	int ret;
 
-	if (IS_SWAPFILE(inode))
+	if (IS_SWAPFILE(inode) && !(iocb->ki_flags & IOCB_SWAP))
 		return -ETXTBSY;
 
 	if (!iov_iter_count(from))
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 640574294216..b3e6a20f28ef 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -319,6 +319,7 @@ enum rw_hint {
 /* iocb->ki_waitq is valid */
 #define IOCB_WAITQ		(1 << 19)
 #define IOCB_NOIO		(1 << 20)
+#define IOCB_SWAP		(1 << 21)	/* This is a swap request */
 
 struct kiocb {
 	struct file		*ki_filp;
diff --git a/mm/page_io.c b/mm/page_io.c
index d597bc6e6e45..edb72bf624d2 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -303,7 +303,8 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
 
 		iov_iter_bvec(&from, WRITE, &bv, 1, PAGE_SIZE);
 		init_sync_kiocb(&kiocb, swap_file);
-		kiocb.ki_pos = page_file_offset(page);
+		kiocb.ki_pos	= page_file_offset(page);
+		kiocb.ki_flags	= IOCB_DIRECT | IOCB_WRITE | IOCB_SWAP;
 
 		set_page_writeback(page);
 		unlock_page(page);
@@ -324,8 +325,8 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
 			 */
 			set_page_dirty(page);
 			ClearPageReclaim(page);
-			pr_err_ratelimited("Write error on dio swapfile (%llu)\n",
-					   page_file_offset(page));
+			pr_err_ratelimited("Write error (%d) on dio swapfile (%llu)\n",
+					   ret, page_file_offset(page));
 		}
 		end_page_writeback(page);
 		return ret;



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH v2 2/5] mm: Remove the callback func argument from __swap_writepage()
  2021-08-12 20:21 [RFC PATCH v2 0/5] mm: Fix NFS swapfiles and use DIO for swapfiles David Howells
  2021-08-12 20:22 ` [RFC PATCH v2 1/5] nfs: Fix write to swapfile failure due to generic_write_checks() David Howells
@ 2021-08-12 20:22 ` David Howells
  2021-08-13  7:03   ` Christoph Hellwig
  2021-08-12 20:22 ` [RFC PATCH v2 3/5] mm: Make swap_readpage() for SWP_FS_OPS use ->direct_IO() not ->readpage() David Howells
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 13+ messages in thread
From: David Howells @ 2021-08-12 20:22 UTC (permalink / raw)
  To: willy
  Cc: Seth Jennings, Bob Liu, Minchan Kim, Dan Magenheimer, dhowells,
	dhowells, trond.myklebust, darrick.wong, hch, viro, jlayton,
	sfrench, torvalds, linux-nfs, linux-mm, linux-fsdevel,
	linux-kernel

Remove the callback func argument from __swap_writepage() as it's
end_swap_bio_write() in both places that call it.

This reverts:

	commit 1eec6702a80e04416d528846a5ff2122484d95ec
	mm: allow for outstanding swap writeback accounting

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Matthew Wilcox (Oracle) <willy@infradead.org>
cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
cc: Bob Liu <bob.liu@oracle.com>
cc: Minchan Kim <minchan@kernel.org>
cc: Dan Magenheimer <dan.magenheimer@oracle.com>
---

 include/linux/swap.h |    4 +---
 mm/page_io.c         |    9 ++++-----
 mm/zswap.c           |    2 +-
 3 files changed, 6 insertions(+), 9 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 81801ba78b1e..b785bb041a44 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -425,9 +425,7 @@ extern void kswapd_stop(int nid);
 /* linux/mm/page_io.c */
 extern int swap_readpage(struct page *page, bool do_poll);
 extern int swap_writepage(struct page *page, struct writeback_control *wbc);
-extern void end_swap_bio_write(struct bio *bio);
-extern int __swap_writepage(struct page *page, struct writeback_control *wbc,
-	bio_end_io_t end_write_func);
+extern int __swap_writepage(struct page *page, struct writeback_control *wbc);
 extern int swap_set_page_dirty(struct page *page);
 
 int add_swap_extent(struct swap_info_struct *sis, unsigned long start_page,
diff --git a/mm/page_io.c b/mm/page_io.c
index edb72bf624d2..62cabcdfcec6 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -26,7 +26,7 @@
 #include <linux/uio.h>
 #include <linux/sched/task.h>
 
-void end_swap_bio_write(struct bio *bio)
+static void end_swap_bio_write(struct bio *bio)
 {
 	struct page *page = bio_first_page_all(bio);
 
@@ -249,7 +249,7 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
 		end_page_writeback(page);
 		goto out;
 	}
-	ret = __swap_writepage(page, wbc, end_swap_bio_write);
+	ret = __swap_writepage(page, wbc);
 out:
 	return ret;
 }
@@ -282,8 +282,7 @@ static void bio_associate_blkg_from_page(struct bio *bio, struct page *page)
 #define bio_associate_blkg_from_page(bio, page)		do { } while (0)
 #endif /* CONFIG_MEMCG && CONFIG_BLK_CGROUP */
 
-int __swap_writepage(struct page *page, struct writeback_control *wbc,
-		bio_end_io_t end_write_func)
+int __swap_writepage(struct page *page, struct writeback_control *wbc)
 {
 	struct bio *bio;
 	int ret;
@@ -342,7 +341,7 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
 	bio_set_dev(bio, sis->bdev);
 	bio->bi_iter.bi_sector = swap_page_sector(page);
 	bio->bi_opf = REQ_OP_WRITE | REQ_SWAP | wbc_to_write_flags(wbc);
-	bio->bi_end_io = end_write_func;
+	bio->bi_end_io = end_swap_bio_write;
 	bio_add_page(bio, page, thp_size(page), 0);
 
 	bio_associate_blkg_from_page(bio, page);
diff --git a/mm/zswap.c b/mm/zswap.c
index 7944e3e57e78..f38e34917aa3 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1011,7 +1011,7 @@ static int zswap_writeback_entry(struct zpool *pool, unsigned long handle)
 	SetPageReclaim(page);
 
 	/* start writeback */
-	__swap_writepage(page, &wbc, end_swap_bio_write);
+	__swap_writepage(page, &wbc);
 	put_page(page);
 	zswap_written_back_pages++;
 



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH v2 3/5] mm: Make swap_readpage() for SWP_FS_OPS use ->direct_IO() not ->readpage()
  2021-08-12 20:21 [RFC PATCH v2 0/5] mm: Fix NFS swapfiles and use DIO for swapfiles David Howells
  2021-08-12 20:22 ` [RFC PATCH v2 1/5] nfs: Fix write to swapfile failure due to generic_write_checks() David Howells
  2021-08-12 20:22 ` [RFC PATCH v2 2/5] mm: Remove the callback func argument from __swap_writepage() David Howells
@ 2021-08-12 20:22 ` David Howells
  2021-08-12 21:23   ` Matthew Wilcox
  2021-08-13  7:12   ` Christoph Hellwig
  2021-08-12 20:22 ` [RFC PATCH v2 4/5] mm: Make __swap_writepage() do async DIO if asked for it David Howells
  2021-08-12 20:22 ` [RFC PATCH v2 5/5] mm: Remove swap BIO paths and only use DIO paths [BROKEN] David Howells
  4 siblings, 2 replies; 13+ messages in thread
From: David Howells @ 2021-08-12 20:22 UTC (permalink / raw)
  To: willy
  Cc: dhowells, dhowells, trond.myklebust, darrick.wong, hch, viro,
	jlayton, sfrench, torvalds, linux-nfs, linux-mm, linux-fsdevel,
	linux-kernel

Make swap_readpage(), when accessing a swap file (SWP_FS_OPS) use
the ->direct_IO() method on the filesystem rather then ->readpage().

Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
---

 include/linux/fs.h |    1 
 mm/page_io.c       |  115 +++++++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 110 insertions(+), 6 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index b3e6a20f28ef..94c47b9b5b1c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -336,6 +336,7 @@ struct kiocb {
 	union {
 		unsigned int		ki_cookie; /* for ->iopoll */
 		struct wait_page_queue	*ki_waitq; /* for async buffered IO */
+		struct page	*ki_swap_page;	/* For swapfile_read/write */
 	};
 
 	randomized_struct_fields_end
diff --git a/mm/page_io.c b/mm/page_io.c
index 62cabcdfcec6..92ec4a7b0545 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -26,6 +26,24 @@
 #include <linux/uio.h>
 #include <linux/sched/task.h>
 
+/*
+ * Keep track of the kiocb we're using to do async DIO.  We have to
+ * refcount it until various things stop looking at the kiocb *after*
+ * calling ->ki_complete().
+ */
+struct swapfile_kiocb {
+	struct kiocb		iocb;
+	refcount_t		ki_refcnt;
+};
+
+static void swapfile_put_kiocb(struct swapfile_kiocb *ki)
+{
+	if (refcount_dec_and_test(&ki->ki_refcnt)) {
+		fput(ki->iocb.ki_filp);
+		kfree(ki);
+	}
+}
+
 static void end_swap_bio_write(struct bio *bio)
 {
 	struct page *page = bio_first_page_all(bio);
@@ -353,6 +371,96 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc)
 	return 0;
 }
 
+static void __swapfile_read_complete(struct kiocb *iocb, long ret, long ret2)
+{
+	struct page *page = iocb->ki_swap_page;
+
+	if (ret == PAGE_SIZE) {
+		count_vm_event(PSWPIN);
+		SetPageUptodate(page);
+	} else {
+		SetPageError(page);
+		pr_err_ratelimited("Read error (%ld) on dio swapfile (%llu)\n",
+				   ret, page_file_offset(page));
+	}
+
+	unlock_page(page);
+}
+
+static void swapfile_read_complete(struct kiocb *iocb, long ret, long ret2)
+{
+	struct swapfile_kiocb *ki = container_of(iocb, struct swapfile_kiocb, iocb);
+
+	__swapfile_read_complete(iocb, ret, ret2);
+	swapfile_put_kiocb(ki);
+}
+
+static int swapfile_read_sync(struct swap_info_struct *sis, struct page *page)
+{
+	struct kiocb kiocb;
+	struct file *swap_file = sis->swap_file;
+	struct bio_vec bv = {
+		.bv_page	= page,
+		.bv_len		= thp_size(page),
+		.bv_offset	= 0
+	};
+	struct iov_iter to;
+	int ret;
+
+	init_sync_kiocb(&kiocb, swap_file);
+	kiocb.ki_swap_page	= page;
+	kiocb.ki_pos		= page_file_offset(page);
+	kiocb.ki_filp		= swap_file;
+	kiocb.ki_flags		= IOCB_DIRECT | IOCB_SWAP;
+	/* Should set IOCB_HIPRI too, but the box becomes unresponsive whilst
+	 * putting out occasional messages about the NFS sunrpc scheduling
+	 * tasks being hung.
+	 */
+
+	iov_iter_bvec(&to, READ, &bv, 1, thp_size(page));
+	ret = swap_file->f_mapping->a_ops->direct_IO(&kiocb, &to);
+
+	__swapfile_read_complete(&kiocb, ret, 0);
+	return (ret > 0) ? 0 : ret;
+}
+
+static int swapfile_read(struct swap_info_struct *sis, struct page *page,
+			 bool synchronous)
+{
+	struct swapfile_kiocb *ki;
+	struct file *swap_file = sis->swap_file;
+	struct bio_vec bv = {
+		.bv_page = page,
+		.bv_len  = thp_size(page),
+		.bv_offset = 0
+	};
+	struct iov_iter to;
+	int ret;
+
+	if (synchronous)
+		return swapfile_read_sync(sis, page);
+
+	ki = kzalloc(sizeof(*ki), GFP_KERNEL);
+	if (!ki)
+		return -ENOMEM;
+
+	refcount_set(&ki->ki_refcnt, 2);
+	init_sync_kiocb(&ki->iocb, swap_file);
+	ki->iocb.ki_swap_page	= page;
+	ki->iocb.ki_flags	= IOCB_DIRECT | IOCB_SWAP;
+	ki->iocb.ki_pos		= page_file_offset(page);
+	ki->iocb.ki_filp	= get_file(swap_file);
+	ki->iocb.ki_complete	= swapfile_read_complete;
+
+	iov_iter_bvec(&to, READ, &bv, 1, thp_size(page));
+	ret = swap_file->f_mapping->a_ops->direct_IO(&ki->iocb, &to);
+
+	if (ret != -EIOCBQUEUED)
+		swapfile_read_complete(&ki->iocb, ret, 0);
+	swapfile_put_kiocb(ki);
+	return (ret > 0) ? 0 : ret;
+}
+
 int swap_readpage(struct page *page, bool synchronous)
 {
 	struct bio *bio;
@@ -380,12 +488,7 @@ int swap_readpage(struct page *page, bool synchronous)
 	}
 
 	if (data_race(sis->flags & SWP_FS_OPS)) {
-		struct file *swap_file = sis->swap_file;
-		struct address_space *mapping = swap_file->f_mapping;
-
-		ret = mapping->a_ops->readpage(swap_file, page);
-		if (!ret)
-			count_vm_event(PSWPIN);
+		ret = swapfile_read(sis, page, synchronous);
 		goto out;
 	}
 



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH v2 4/5] mm: Make __swap_writepage() do async DIO if asked for it
  2021-08-12 20:21 [RFC PATCH v2 0/5] mm: Fix NFS swapfiles and use DIO for swapfiles David Howells
                   ` (2 preceding siblings ...)
  2021-08-12 20:22 ` [RFC PATCH v2 3/5] mm: Make swap_readpage() for SWP_FS_OPS use ->direct_IO() not ->readpage() David Howells
@ 2021-08-12 20:22 ` David Howells
  2021-08-12 20:22 ` [RFC PATCH v2 5/5] mm: Remove swap BIO paths and only use DIO paths [BROKEN] David Howells
  4 siblings, 0 replies; 13+ messages in thread
From: David Howells @ 2021-08-12 20:22 UTC (permalink / raw)
  To: willy
  Cc: Trond Myklebust, linux-nfs, dhowells, dhowells, trond.myklebust,
	darrick.wong, hch, viro, jlayton, sfrench, torvalds, linux-nfs,
	linux-mm, linux-fsdevel, linux-kernel

Make __swap_writepage()'s DIO path do sync DIO if the writeback control's
sync mode is WB_SYNC_ALL and async DIO if not.

Note that this causes hanging processes in sunrpc if the swapfile is on
NFS.  I'm not sure whether it's due to misscheduling or something else.

Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Trond Myklebust <trond.myklebust@hammerspace.com>
cc: linux-nfs@vger.kernel.org
---

 mm/page_io.c |  145 +++++++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 102 insertions(+), 43 deletions(-)

diff --git a/mm/page_io.c b/mm/page_io.c
index 92ec4a7b0545..dae7bbd7a842 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -300,6 +300,105 @@ static void bio_associate_blkg_from_page(struct bio *bio, struct page *page)
 #define bio_associate_blkg_from_page(bio, page)		do { } while (0)
 #endif /* CONFIG_MEMCG && CONFIG_BLK_CGROUP */
 
+static void __swapfile_write_complete(struct kiocb *iocb, long ret, long ret2)
+{
+	struct page *page = iocb->ki_swap_page;
+
+	if (ret == thp_size(page)) {
+		count_vm_event(PSWPOUT);
+		ret = 0;
+	} else {
+		/*
+		 * In the case of swap-over-nfs, this can be a
+		 * temporary failure if the system has limited memory
+		 * for allocating transmit buffers.  Mark the page
+		 * dirty and avoid folio_rotate_reclaimable but
+		 * rate-limit the messages but do not flag PageError
+		 * like the normal direct-to-bio case as it could be
+		 * temporary.
+		 */
+		set_page_dirty(page);
+		ClearPageReclaim(page);
+		pr_err_ratelimited("Write error (%ld) on dio swapfile (%llu)\n",
+				   ret, page_file_offset(page));
+	}
+	end_page_writeback(page);
+}
+
+static void swapfile_write_complete(struct kiocb *iocb, long ret, long ret2)
+{
+	struct swapfile_kiocb *ki = container_of(iocb, struct swapfile_kiocb, iocb);
+
+	__swapfile_write_complete(iocb, ret, ret2);
+	swapfile_put_kiocb(ki);
+}
+
+static int swapfile_write_sync(struct swap_info_struct *sis,
+			       struct page *page, struct writeback_control *wbc)
+{
+	struct kiocb kiocb;
+	struct file *swap_file = sis->swap_file;
+	struct bio_vec bv = {
+		.bv_page	= page,
+		.bv_len		= thp_size(page),
+		.bv_offset	= 0
+	};
+	struct iov_iter from;
+	int ret;
+
+	init_sync_kiocb(&kiocb, swap_file);
+	kiocb.ki_swap_page	= page;
+	kiocb.ki_pos		= page_file_offset(page);
+	kiocb.ki_flags		= IOCB_DIRECT | IOCB_WRITE | IOCB_SWAP;
+
+	set_page_writeback(page);
+	unlock_page(page);
+
+	iov_iter_bvec(&from, WRITE, &bv, 1, thp_size(page));
+	ret = swap_file->f_mapping->a_ops->direct_IO(&kiocb, &from);
+	__swapfile_write_complete(&kiocb, ret, 0);
+	return (ret > 0) ? 0 : ret;
+}
+
+static int swapfile_write(struct swap_info_struct *sis,
+			  struct page *page, struct writeback_control *wbc)
+{
+	struct swapfile_kiocb *ki;
+	struct file *swap_file = sis->swap_file;
+	struct bio_vec bv = {
+		.bv_page	= page,
+		.bv_len		= thp_size(page),
+		.bv_offset	= 0
+	};
+	struct iov_iter from;
+	int ret;
+
+	if (wbc->sync_mode == WB_SYNC_ALL)
+		return swapfile_write_sync(sis, page, wbc);
+
+	ki = kzalloc(sizeof(*ki), GFP_KERNEL);
+	if (!ki)
+		return -ENOMEM;
+
+	refcount_set(&ki->ki_refcnt, 2);
+	iov_iter_bvec(&from, WRITE, &bv, 1, PAGE_SIZE);
+	init_sync_kiocb(&ki->iocb, swap_file);
+	ki->iocb.ki_swap_page	= page;
+	ki->iocb.ki_pos		= page_file_offset(page);
+	ki->iocb.ki_flags	= IOCB_DIRECT | IOCB_WRITE | IOCB_SWAP;
+	ki->iocb.ki_complete	= swapfile_write_complete;
+	get_file(swap_file);
+
+	set_page_writeback(page);
+	unlock_page(page);
+	ret = swap_file->f_mapping->a_ops->direct_IO(&ki->iocb, &from);
+
+	if (ret != -EIOCBQUEUED)
+		swapfile_write_complete(&ki->iocb, ret, 0);
+	swapfile_put_kiocb(ki);
+	return (ret > 0) ? 0 : ret;
+}
+
 int __swap_writepage(struct page *page, struct writeback_control *wbc)
 {
 	struct bio *bio;
@@ -307,47 +406,8 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc)
 	struct swap_info_struct *sis = page_swap_info(page);
 
 	VM_BUG_ON_PAGE(!PageSwapCache(page), page);
-	if (data_race(sis->flags & SWP_FS_OPS)) {
-		struct kiocb kiocb;
-		struct file *swap_file = sis->swap_file;
-		struct address_space *mapping = swap_file->f_mapping;
-		struct bio_vec bv = {
-			.bv_page = page,
-			.bv_len  = PAGE_SIZE,
-			.bv_offset = 0
-		};
-		struct iov_iter from;
-
-		iov_iter_bvec(&from, WRITE, &bv, 1, PAGE_SIZE);
-		init_sync_kiocb(&kiocb, swap_file);
-		kiocb.ki_pos	= page_file_offset(page);
-		kiocb.ki_flags	= IOCB_DIRECT | IOCB_WRITE | IOCB_SWAP;
-
-		set_page_writeback(page);
-		unlock_page(page);
-		ret = mapping->a_ops->direct_IO(&kiocb, &from);
-		if (ret == PAGE_SIZE) {
-			count_vm_event(PSWPOUT);
-			ret = 0;
-		} else {
-			/*
-			 * In the case of swap-over-nfs, this can be a
-			 * temporary failure if the system has limited
-			 * memory for allocating transmit buffers.
-			 * Mark the page dirty and avoid
-			 * folio_rotate_reclaimable but rate-limit the
-			 * messages but do not flag PageError like
-			 * the normal direct-to-bio case as it could
-			 * be temporary.
-			 */
-			set_page_dirty(page);
-			ClearPageReclaim(page);
-			pr_err_ratelimited("Write error (%d) on dio swapfile (%llu)\n",
-					   ret, page_file_offset(page));
-		}
-		end_page_writeback(page);
-		return ret;
-	}
+	if (data_race(sis->flags & SWP_FS_OPS))
+		return swapfile_write(sis, page, wbc);
 
 	ret = bdev_write_page(sis->bdev, swap_page_sector(page), page, wbc);
 	if (!ret) {
@@ -410,7 +470,6 @@ static int swapfile_read_sync(struct swap_info_struct *sis, struct page *page)
 	init_sync_kiocb(&kiocb, swap_file);
 	kiocb.ki_swap_page	= page;
 	kiocb.ki_pos		= page_file_offset(page);
-	kiocb.ki_filp		= swap_file;
 	kiocb.ki_flags		= IOCB_DIRECT | IOCB_SWAP;
 	/* Should set IOCB_HIPRI too, but the box becomes unresponsive whilst
 	 * putting out occasional messages about the NFS sunrpc scheduling
@@ -449,8 +508,8 @@ static int swapfile_read(struct swap_info_struct *sis, struct page *page,
 	ki->iocb.ki_swap_page	= page;
 	ki->iocb.ki_flags	= IOCB_DIRECT | IOCB_SWAP;
 	ki->iocb.ki_pos		= page_file_offset(page);
-	ki->iocb.ki_filp	= get_file(swap_file);
 	ki->iocb.ki_complete	= swapfile_read_complete;
+	get_file(swap_file);
 
 	iov_iter_bvec(&to, READ, &bv, 1, thp_size(page));
 	ret = swap_file->f_mapping->a_ops->direct_IO(&ki->iocb, &to);



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH v2 5/5] mm: Remove swap BIO paths and only use DIO paths [BROKEN]
  2021-08-12 20:21 [RFC PATCH v2 0/5] mm: Fix NFS swapfiles and use DIO for swapfiles David Howells
                   ` (3 preceding siblings ...)
  2021-08-12 20:22 ` [RFC PATCH v2 4/5] mm: Make __swap_writepage() do async DIO if asked for it David Howells
@ 2021-08-12 20:22 ` David Howells
  2021-08-13  7:14   ` Christoph Hellwig
  4 siblings, 1 reply; 13+ messages in thread
From: David Howells @ 2021-08-12 20:22 UTC (permalink / raw)
  To: willy
  Cc: dhowells, dhowells, trond.myklebust, darrick.wong, hch, viro,
	jlayton, sfrench, torvalds, linux-nfs, linux-mm, linux-fsdevel,
	linux-kernel

[!] NOTE: This doesn't work and might damage your disk's contents.

Delete the BIO-generating swap read/write paths and always use
->direct_IO().  This puts the mapping layer in the filesystem.

This doesn't work - probably due to ki_pos being set to
page_file_offset(page) which then gets remapped.

Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/direct-io.c      |    2 +
 include/linux/bio.h |    2 +
 mm/page_io.c        |  156 ++-------------------------------------------------
 3 files changed, 9 insertions(+), 151 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index b2e86e739d7a..76eec0a68fa4 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -1216,6 +1216,8 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 	}
 	if (iocb->ki_flags & IOCB_HIPRI)
 		dio->op_flags |= REQ_HIPRI;
+	if (iocb->ki_flags & IOCB_SWAP)
+		dio->op_flags |= REQ_SWAP;
 
 	/*
 	 * For AIO O_(D)SYNC writes we need to defer completions to a workqueue
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 2203b686e1f0..da75cfa72ed3 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -816,6 +816,8 @@ static inline void bio_set_polled(struct bio *bio, struct kiocb *kiocb)
 	bio->bi_opf |= REQ_HIPRI;
 	if (!is_sync_kiocb(kiocb))
 		bio->bi_opf |= REQ_NOWAIT;
+	if (kiocb->ki_flags & IOCB_SWAP)
+		bio->bi_opf |= REQ_SWAP;
 }
 
 struct bio *blk_next_bio(struct bio *bio, unsigned int nr_pages, gfp_t gfp);
diff --git a/mm/page_io.c b/mm/page_io.c
index dae7bbd7a842..fb260d9c3973 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -44,30 +44,6 @@ static void swapfile_put_kiocb(struct swapfile_kiocb *ki)
 	}
 }
 
-static void end_swap_bio_write(struct bio *bio)
-{
-	struct page *page = bio_first_page_all(bio);
-
-	if (bio->bi_status) {
-		SetPageError(page);
-		/*
-		 * We failed to write the page out to swap-space.
-		 * Re-dirty the page in order to avoid it being reclaimed.
-		 * Also print a dire warning that things will go BAD (tm)
-		 * very quickly.
-		 *
-		 * Also clear PG_reclaim to avoid folio_rotate_reclaimable()
-		 */
-		set_page_dirty(page);
-		pr_alert_ratelimited("Write-error on swap-device (%u:%u:%llu)\n",
-				     MAJOR(bio_dev(bio)), MINOR(bio_dev(bio)),
-				     (unsigned long long)bio->bi_iter.bi_sector);
-		ClearPageReclaim(page);
-	}
-	end_page_writeback(page);
-	bio_put(bio);
-}
-
 static void swap_slot_free_notify(struct page *page)
 {
 	struct swap_info_struct *sis;
@@ -116,32 +92,6 @@ static void swap_slot_free_notify(struct page *page)
 	}
 }
 
-static void end_swap_bio_read(struct bio *bio)
-{
-	struct page *page = bio_first_page_all(bio);
-	struct task_struct *waiter = bio->bi_private;
-
-	if (bio->bi_status) {
-		SetPageError(page);
-		ClearPageUptodate(page);
-		pr_alert_ratelimited("Read-error on swap-device (%u:%u:%llu)\n",
-				     MAJOR(bio_dev(bio)), MINOR(bio_dev(bio)),
-				     (unsigned long long)bio->bi_iter.bi_sector);
-		goto out;
-	}
-
-	SetPageUptodate(page);
-	swap_slot_free_notify(page);
-out:
-	unlock_page(page);
-	WRITE_ONCE(bio->bi_private, NULL);
-	bio_put(bio);
-	if (waiter) {
-		blk_wake_io_task(waiter);
-		put_task_struct(waiter);
-	}
-}
-
 int generic_swapfile_activate(struct swap_info_struct *sis,
 				struct file *swap_file,
 				sector_t *span)
@@ -281,31 +231,12 @@ static inline void count_swpout_vm_event(struct page *page)
 	count_vm_events(PSWPOUT, thp_nr_pages(page));
 }
 
-#if defined(CONFIG_MEMCG) && defined(CONFIG_BLK_CGROUP)
-static void bio_associate_blkg_from_page(struct bio *bio, struct page *page)
-{
-	struct cgroup_subsys_state *css;
-	struct mem_cgroup *memcg;
-
-	memcg = page_memcg(page);
-	if (!memcg)
-		return;
-
-	rcu_read_lock();
-	css = cgroup_e_css(memcg->css.cgroup, &io_cgrp_subsys);
-	bio_associate_blkg_from_css(bio, css);
-	rcu_read_unlock();
-}
-#else
-#define bio_associate_blkg_from_page(bio, page)		do { } while (0)
-#endif /* CONFIG_MEMCG && CONFIG_BLK_CGROUP */
-
 static void __swapfile_write_complete(struct kiocb *iocb, long ret, long ret2)
 {
 	struct page *page = iocb->ki_swap_page;
 
 	if (ret == thp_size(page)) {
-		count_vm_event(PSWPOUT);
+		count_swpout_vm_event(page);
 		ret = 0;
 	} else {
 		/*
@@ -401,34 +332,10 @@ static int swapfile_write(struct swap_info_struct *sis,
 
 int __swap_writepage(struct page *page, struct writeback_control *wbc)
 {
-	struct bio *bio;
-	int ret;
 	struct swap_info_struct *sis = page_swap_info(page);
 
 	VM_BUG_ON_PAGE(!PageSwapCache(page), page);
-	if (data_race(sis->flags & SWP_FS_OPS))
-		return swapfile_write(sis, page, wbc);
-
-	ret = bdev_write_page(sis->bdev, swap_page_sector(page), page, wbc);
-	if (!ret) {
-		count_swpout_vm_event(page);
-		return 0;
-	}
-
-	bio = bio_alloc(GFP_NOIO, 1);
-	bio_set_dev(bio, sis->bdev);
-	bio->bi_iter.bi_sector = swap_page_sector(page);
-	bio->bi_opf = REQ_OP_WRITE | REQ_SWAP | wbc_to_write_flags(wbc);
-	bio->bi_end_io = end_swap_bio_write;
-	bio_add_page(bio, page, thp_size(page), 0);
-
-	bio_associate_blkg_from_page(bio, page);
-	count_swpout_vm_event(page);
-	set_page_writeback(page);
-	unlock_page(page);
-	submit_bio(bio);
-
-	return 0;
+	return swapfile_write(sis, page, wbc);
 }
 
 static void __swapfile_read_complete(struct kiocb *iocb, long ret, long ret2)
@@ -437,6 +344,7 @@ static void __swapfile_read_complete(struct kiocb *iocb, long ret, long ret2)
 
 	if (ret == PAGE_SIZE) {
 		count_vm_event(PSWPIN);
+		swap_slot_free_notify(page);
 		SetPageUptodate(page);
 	} else {
 		SetPageError(page);
@@ -522,12 +430,9 @@ static int swapfile_read(struct swap_info_struct *sis, struct page *page,
 
 int swap_readpage(struct page *page, bool synchronous)
 {
-	struct bio *bio;
-	int ret = 0;
 	struct swap_info_struct *sis = page_swap_info(page);
-	blk_qc_t qc;
-	struct gendisk *disk;
 	unsigned long pflags;
+	int ret = 0;
 
 	VM_BUG_ON_PAGE(!PageSwapCache(page) && !synchronous, page);
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
@@ -543,60 +448,9 @@ int swap_readpage(struct page *page, bool synchronous)
 	if (frontswap_load(page) == 0) {
 		SetPageUptodate(page);
 		unlock_page(page);
-		goto out;
-	}
-
-	if (data_race(sis->flags & SWP_FS_OPS)) {
+	} else {
 		ret = swapfile_read(sis, page, synchronous);
-		goto out;
-	}
-
-	if (sis->flags & SWP_SYNCHRONOUS_IO) {
-		ret = bdev_read_page(sis->bdev, swap_page_sector(page), page);
-		if (!ret) {
-			if (trylock_page(page)) {
-				swap_slot_free_notify(page);
-				unlock_page(page);
-			}
-
-			count_vm_event(PSWPIN);
-			goto out;
-		}
 	}
-
-	ret = 0;
-	bio = bio_alloc(GFP_KERNEL, 1);
-	bio_set_dev(bio, sis->bdev);
-	bio->bi_opf = REQ_OP_READ;
-	bio->bi_iter.bi_sector = swap_page_sector(page);
-	bio->bi_end_io = end_swap_bio_read;
-	bio_add_page(bio, page, thp_size(page), 0);
-
-	disk = bio->bi_bdev->bd_disk;
-	/*
-	 * Keep this task valid during swap readpage because the oom killer may
-	 * attempt to access it in the page fault retry time check.
-	 */
-	if (synchronous) {
-		bio->bi_opf |= REQ_HIPRI;
-		get_task_struct(current);
-		bio->bi_private = current;
-	}
-	count_vm_event(PSWPIN);
-	bio_get(bio);
-	qc = submit_bio(bio);
-	while (synchronous) {
-		set_current_state(TASK_UNINTERRUPTIBLE);
-		if (!READ_ONCE(bio->bi_private))
-			break;
-
-		if (!blk_poll(disk->queue, qc, true))
-			blk_io_schedule();
-	}
-	__set_current_state(TASK_RUNNING);
-	bio_put(bio);
-
-out:
 	psi_memstall_leave(&pflags);
 	return ret;
 }



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v2 3/5] mm: Make swap_readpage() for SWP_FS_OPS use ->direct_IO() not ->readpage()
  2021-08-12 20:22 ` [RFC PATCH v2 3/5] mm: Make swap_readpage() for SWP_FS_OPS use ->direct_IO() not ->readpage() David Howells
@ 2021-08-12 21:23   ` Matthew Wilcox
  2021-08-13  7:12   ` Christoph Hellwig
  1 sibling, 0 replies; 13+ messages in thread
From: Matthew Wilcox @ 2021-08-12 21:23 UTC (permalink / raw)
  To: David Howells
  Cc: trond.myklebust, darrick.wong, hch, viro, jlayton, sfrench,
	torvalds, linux-nfs, linux-mm, linux-fsdevel, linux-kernel

On Thu, Aug 12, 2021 at 09:22:24PM +0100, David Howells wrote:
> +++ b/include/linux/fs.h
> @@ -336,6 +336,7 @@ struct kiocb {
>  	union {
>  		unsigned int		ki_cookie; /* for ->iopoll */
>  		struct wait_page_queue	*ki_waitq; /* for async buffered IO */
> +		struct page	*ki_swap_page;	/* For swapfile_read/write */

Nice idea.

> +static void __swapfile_read_complete(struct kiocb *iocb, long ret, long ret2)

I would make this take a struct page * and just one 'ret'.

> +{
> +	struct page *page = iocb->ki_swap_page;
> +
> +	if (ret == PAGE_SIZE) {

page_size(page)?

> +	kiocb.ki_pos		= page_file_offset(page);

We talked about swap_file_pos(), right?

> +	ret = swap_file->f_mapping->a_ops->direct_IO(&kiocb, &to);
> +
> +	__swapfile_read_complete(&kiocb, ret, 0);
> +	return (ret > 0) ? 0 : ret;

What if it returns a short read?

> +static int swapfile_read(struct swap_info_struct *sis, struct page *page,
> +			 bool synchronous)
> +{
> +	struct swapfile_kiocb *ki;
> +	struct file *swap_file = sis->swap_file;
> +	struct bio_vec bv = {
> +		.bv_page = page,
> +		.bv_len  = thp_size(page),
> +		.bv_offset = 0
> +	};
> +	struct iov_iter to;
> +	int ret;
> +
> +	if (synchronous)
> +		return swapfile_read_sync(sis, page);

Seems a shame to set up the bio_vec and iov_iter twice.  Maybe call:

	iov_iter_bvec(&to, READ, &bv, 1, thp_size(page));

before swapfile_read_sync() and pass a pointer to 'to' to
swapfile_read_sync?


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v2 1/5] nfs: Fix write to swapfile failure due to generic_write_checks()
  2021-08-12 20:22 ` [RFC PATCH v2 1/5] nfs: Fix write to swapfile failure due to generic_write_checks() David Howells
@ 2021-08-13  3:09   ` Matthew Wilcox
  2021-08-13  7:02     ` Christoph Hellwig
  2021-08-19 22:38   ` NeilBrown
  1 sibling, 1 reply; 13+ messages in thread
From: Matthew Wilcox @ 2021-08-13  3:09 UTC (permalink / raw)
  To: David Howells
  Cc: Darrick J. Wong, Christoph Hellwig, Trond Myklebust, linux-nfs,
	viro, jlayton, sfrench, torvalds, linux-mm, linux-fsdevel,
	linux-kernel

On Thu, Aug 12, 2021 at 09:22:06PM +0100, David Howells wrote:
> Trying to use a swapfile on NFS results in every DIO write failing with
> ETXTBSY because generic_write_checks(), as called by nfs_direct_write()
> from nfs_direct_IO(), forbids writes to swapfiles.

Why does nfs_direct_write() call generic_write_checks()?

ie call generic_write_checks() earlier, and only swap would bypass them.

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 2e894fec036b..7e2ca6b5fc5f 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -905,9 +905,6 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter)
 	dfprintk(FILE, "NFS: direct write(%pD2, %zd@%Ld)\n",
 		file, iov_iter_count(iter), (long long) iocb->ki_pos);
 
-	result = generic_write_checks(iocb, iter);
-	if (result <= 0)
-		return result;
 	count = result;
 	nfs_add_stats(mapping->host, NFSIOS_DIRECTWRITTENBYTES, count);
 
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 1fef107961bc..91b2e3214836 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -611,6 +611,10 @@ ssize_t nfs_file_write(struct kiocb *iocb, struct iov_iter *from)
 	errseq_t since;
 	int error;
 
+	result = generic_write_checks(iocb, from);
+	if (result < 0)
+		return result;
+
 	result = nfs_key_timeout_notify(file, inode);
 	if (result)
 		return result;
@@ -621,8 +625,6 @@ ssize_t nfs_file_write(struct kiocb *iocb, struct iov_iter *from)
 	dprintk("NFS: write(%pD2, %zu@%Ld)\n",
 		file, iov_iter_count(from), (long long) iocb->ki_pos);
 
-	if (IS_SWAPFILE(inode))
-		goto out_swapfile;
 	/*
 	 * O_APPEND implies that we must revalidate the file length.
 	 */
@@ -636,7 +638,6 @@ ssize_t nfs_file_write(struct kiocb *iocb, struct iov_iter *from)
 
 	since = filemap_sample_wb_err(file->f_mapping);
 	nfs_start_io_write(inode);
-	result = generic_write_checks(iocb, from);
 	if (result > 0) {
 		current->backing_dev_info = inode_to_bdi(inode);
 		result = generic_perform_write(file, from, iocb->ki_pos);
@@ -677,10 +678,6 @@ ssize_t nfs_file_write(struct kiocb *iocb, struct iov_iter *from)
 	nfs_add_stats(inode, NFSIOS_NORMALWRITTENBYTES, written);
 out:
 	return result;
-
-out_swapfile:
-	printk(KERN_INFO "NFS: attempt to write to active swap file!\n");
-	return -ETXTBSY;
 }
 EXPORT_SYMBOL_GPL(nfs_file_write);
 

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v2 1/5] nfs: Fix write to swapfile failure due to generic_write_checks()
  2021-08-13  3:09   ` Matthew Wilcox
@ 2021-08-13  7:02     ` Christoph Hellwig
  0 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2021-08-13  7:02 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: David Howells, Darrick J. Wong, Christoph Hellwig,
	Trond Myklebust, linux-nfs, viro, jlayton, sfrench, torvalds,
	linux-mm, linux-fsdevel, linux-kernel

On Fri, Aug 13, 2021 at 04:09:39AM +0100, Matthew Wilcox wrote:
> On Thu, Aug 12, 2021 at 09:22:06PM +0100, David Howells wrote:
> > Trying to use a swapfile on NFS results in every DIO write failing with
> > ETXTBSY because generic_write_checks(), as called by nfs_direct_write()
> > from nfs_direct_IO(), forbids writes to swapfiles.
> 
> Why does nfs_direct_write() call generic_write_checks()?
> 
> ie call generic_write_checks() earlier, and only swap would bypass them.

Yes, something like that is a good idea probably.  Additionally I'd like
to move to a separate of for swap I/O ASAP given that NFS only
implemens ->direct_IO for swap.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v2 2/5] mm: Remove the callback func argument from __swap_writepage()
  2021-08-12 20:22 ` [RFC PATCH v2 2/5] mm: Remove the callback func argument from __swap_writepage() David Howells
@ 2021-08-13  7:03   ` Christoph Hellwig
  0 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2021-08-13  7:03 UTC (permalink / raw)
  To: David Howells
  Cc: willy, Seth Jennings, Bob Liu, Minchan Kim, Dan Magenheimer,
	trond.myklebust, darrick.wong, hch, viro, jlayton, sfrench,
	torvalds, linux-nfs, linux-mm, linux-fsdevel, linux-kernel

On Thu, Aug 12, 2021 at 09:22:15PM +0100, David Howells wrote:
> Remove the callback func argument from __swap_writepage() as it's
> end_swap_bio_write() in both places that call it.

Yeah.  I actually had a similar patch in a WIP tree for a while but
never got to end it out.

>  /* linux/mm/page_io.c */
>  extern int swap_readpage(struct page *page, bool do_poll);
>  extern int swap_writepage(struct page *page, struct writeback_control *wbc);
> -extern void end_swap_bio_write(struct bio *bio);
> -extern int __swap_writepage(struct page *page, struct writeback_control *wbc,
> -	bio_end_io_t end_write_func);
> +extern int __swap_writepage(struct page *page, struct writeback_control *wbc);

Please also drop the extern while you're at it.

With that:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v2 3/5] mm: Make swap_readpage() for SWP_FS_OPS use ->direct_IO() not ->readpage()
  2021-08-12 20:22 ` [RFC PATCH v2 3/5] mm: Make swap_readpage() for SWP_FS_OPS use ->direct_IO() not ->readpage() David Howells
  2021-08-12 21:23   ` Matthew Wilcox
@ 2021-08-13  7:12   ` Christoph Hellwig
  1 sibling, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2021-08-13  7:12 UTC (permalink / raw)
  To: David Howells
  Cc: willy, trond.myklebust, darrick.wong, hch, viro, jlayton,
	sfrench, torvalds, linux-nfs, linux-mm, linux-fsdevel,
	linux-kernel

> +/*
> + * Keep track of the kiocb we're using to do async DIO.  We have to
> + * refcount it until various things stop looking at the kiocb *after*
> + * calling ->ki_complete().
> + */
> +struct swapfile_kiocb {
> +	struct kiocb		iocb;
> +	refcount_t		ki_refcnt;
> +};

The ki_ prefix is a little strange here.

> +
> +static void swapfile_put_kiocb(struct swapfile_kiocb *ki)
> +{
> +	if (refcount_dec_and_test(&ki->ki_refcnt)) {
> +		fput(ki->iocb.ki_filp);

What do we need the file reference for here?  The swap code has to have
higher level prevention for closing the file vs active I/O, at least the
block path seems to rely on that.

> +static void swapfile_read_complete(struct kiocb *iocb, long ret, long ret2)
> +{
> +	struct swapfile_kiocb *ki = container_of(iocb, struct swapfile_kiocb, iocb);

Overly long line.

> +	/* Should set IOCB_HIPRI too, but the box becomes unresponsive whilst
> +	 * putting out occasional messages about the NFS sunrpc scheduling
> +	 * tasks being hung.
> +	 */

IOCB_HIPRI has a very specific meaning, so I'm not sure we should
use it never mind leave such a comment here.  Also this is not the
proper standard kernel comment style.

> +
> +	iov_iter_bvec(&to, READ, &bv, 1, thp_size(page));
> +	ret = swap_file->f_mapping->a_ops->direct_IO(&kiocb, &to);
> +
> +	__swapfile_read_complete(&kiocb, ret, 0);
> +	return (ret > 0) ? 0 : ret;

No need for the braces.

> +	return (ret > 0) ? 0 : ret;

Same here.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v2 5/5] mm: Remove swap BIO paths and only use DIO paths [BROKEN]
  2021-08-12 20:22 ` [RFC PATCH v2 5/5] mm: Remove swap BIO paths and only use DIO paths [BROKEN] David Howells
@ 2021-08-13  7:14   ` Christoph Hellwig
  0 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2021-08-13  7:14 UTC (permalink / raw)
  To: David Howells
  Cc: willy, trond.myklebust, darrick.wong, hch, viro, jlayton,
	sfrench, torvalds, linux-nfs, linux-mm, linux-fsdevel,
	linux-kernel

On Thu, Aug 12, 2021 at 09:22:41PM +0100, David Howells wrote:
> [!] NOTE: This doesn't work and might damage your disk's contents.
> 
> Delete the BIO-generating swap read/write paths and always use
> ->direct_IO().  This puts the mapping layer in the filesystem.
> 
> This doesn't work - probably due to ki_pos being set to
> page_file_offset(page) which then gets remapped.

Also because most common block file systems do not actually implement
a ->direct_IO that does anything (noop_direct_IO).

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v2 1/5] nfs: Fix write to swapfile failure due to generic_write_checks()
  2021-08-12 20:22 ` [RFC PATCH v2 1/5] nfs: Fix write to swapfile failure due to generic_write_checks() David Howells
  2021-08-13  3:09   ` Matthew Wilcox
@ 2021-08-19 22:38   ` NeilBrown
  1 sibling, 0 replies; 13+ messages in thread
From: NeilBrown @ 2021-08-19 22:38 UTC (permalink / raw)
  To: David Howells
  Cc: willy, Darrick J. Wong, Christoph Hellwig, Trond Myklebust,
	linux-nfs, dhowells, viro, jlayton, sfrench, torvalds, linux-mm,
	linux-fsdevel, linux-kernel

On Fri, 13 Aug 2021, David Howells wrote:
> Trying to use a swapfile on NFS results in every DIO write failing with
> ETXTBSY because generic_write_checks(), as called by nfs_direct_write()
> from nfs_direct_IO(), forbids writes to swapfiles.

Could we just remove this check from generic_write_checks(), and instead
call deny_write_access() in swap_on?
Then user-space wouldn't be able to open a swap-file for write, so there
would be no need to check on every single write.

NeilBrown


> 
> Fix this by introducing a new kiocb flag, IOCB_SWAP, that's set by the swap
> code to indicate that the swapper is doing this operation and so overrule
> the check in generic_write_checks().
> 
> Without this patch, the following is seen:
> 
> 	Write error on dio swapfile (3800334336)
> 
> Altering __swap_writepage() to show the error shows:
> 
> 	Write error (-26) on dio swapfile (3800334336)
> 
> Tested by swapping off all swap partitions and then swapping on a prepared
> NFS file (CONFIG_NFS_SWAP=y is also needed).  Enough copies of the
> following program then need to be run to force swapping to occur (at least
> one per gigabyte of RAM):
> 
> 	#include <stdbool.h>
> 	#include <stdio.h>
> 	#include <stdlib.h>
> 	#include <unistd.h>
> 	#include <sys/mman.h>
> 	int main()
> 	{
> 		unsigned int pid = getpid(), iterations = 0;
> 		size_t i, j, size = 1024 * 1024 * 1024;
> 		char *p;
> 		bool mismatch;
> 		p = malloc(size);
> 		if (!p) {
> 			perror("malloc");
> 			exit(1);
> 		}
> 		srand(pid);
> 		for (i = 0; i < size; i += 4)
> 			*(unsigned int *)(p + i) = rand();
> 		do {
> 			for (j = 0; j < 16; j++) {
> 				for (i = 0; i < size; i += 4096)
> 					*(unsigned int *)(p + i) += 1;
> 				iterations++;
> 			}
> 			mismatch = false;
> 			srand(pid);
> 			for (i = 0; i < size; i += 4) {
> 				unsigned int r = rand();
> 				unsigned int v = *(unsigned int *)(p + i);
> 				if (i % 4096 == 0)
> 					v -= iterations;
> 				if (v != r) {
> 					fprintf(stderr, "mismatch %zx: %x != %x (diff %x)\n",
> 						i, v, r, v - r);
> 					mismatch = true;
> 				}
> 			}
> 		} while (!mismatch);
> 		exit(1);
> 	}
> 
> 
> Fixes: dc617f29dbe5 ("vfs: don't allow writes to swap files")
> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: Darrick J. Wong <darrick.wong@oracle.com>
> cc: Christoph Hellwig <hch@lst.de>
> cc: Trond Myklebust <trond.myklebust@primarydata.com>
> cc: linux-nfs@vger.kernel.org
> ---
> 
>  fs/read_write.c    |    2 +-
>  include/linux/fs.h |    1 +
>  mm/page_io.c       |    7 ++++---
>  3 files changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 9db7adf160d2..daef721ca67e 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1646,7 +1646,7 @@ ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
>  	loff_t count;
>  	int ret;
>  
> -	if (IS_SWAPFILE(inode))
> +	if (IS_SWAPFILE(inode) && !(iocb->ki_flags & IOCB_SWAP))
>  		return -ETXTBSY;
>  
>  	if (!iov_iter_count(from))
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 640574294216..b3e6a20f28ef 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -319,6 +319,7 @@ enum rw_hint {
>  /* iocb->ki_waitq is valid */
>  #define IOCB_WAITQ		(1 << 19)
>  #define IOCB_NOIO		(1 << 20)
> +#define IOCB_SWAP		(1 << 21)	/* This is a swap request */
>  
>  struct kiocb {
>  	struct file		*ki_filp;
> diff --git a/mm/page_io.c b/mm/page_io.c
> index d597bc6e6e45..edb72bf624d2 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -303,7 +303,8 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
>  
>  		iov_iter_bvec(&from, WRITE, &bv, 1, PAGE_SIZE);
>  		init_sync_kiocb(&kiocb, swap_file);
> -		kiocb.ki_pos = page_file_offset(page);
> +		kiocb.ki_pos	= page_file_offset(page);
> +		kiocb.ki_flags	= IOCB_DIRECT | IOCB_WRITE | IOCB_SWAP;
>  
>  		set_page_writeback(page);
>  		unlock_page(page);
> @@ -324,8 +325,8 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
>  			 */
>  			set_page_dirty(page);
>  			ClearPageReclaim(page);
> -			pr_err_ratelimited("Write error on dio swapfile (%llu)\n",
> -					   page_file_offset(page));
> +			pr_err_ratelimited("Write error (%d) on dio swapfile (%llu)\n",
> +					   ret, page_file_offset(page));
>  		}
>  		end_page_writeback(page);
>  		return ret;
> 
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-08-19 22:38 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-12 20:21 [RFC PATCH v2 0/5] mm: Fix NFS swapfiles and use DIO for swapfiles David Howells
2021-08-12 20:22 ` [RFC PATCH v2 1/5] nfs: Fix write to swapfile failure due to generic_write_checks() David Howells
2021-08-13  3:09   ` Matthew Wilcox
2021-08-13  7:02     ` Christoph Hellwig
2021-08-19 22:38   ` NeilBrown
2021-08-12 20:22 ` [RFC PATCH v2 2/5] mm: Remove the callback func argument from __swap_writepage() David Howells
2021-08-13  7:03   ` Christoph Hellwig
2021-08-12 20:22 ` [RFC PATCH v2 3/5] mm: Make swap_readpage() for SWP_FS_OPS use ->direct_IO() not ->readpage() David Howells
2021-08-12 21:23   ` Matthew Wilcox
2021-08-13  7:12   ` Christoph Hellwig
2021-08-12 20:22 ` [RFC PATCH v2 4/5] mm: Make __swap_writepage() do async DIO if asked for it David Howells
2021-08-12 20:22 ` [RFC PATCH v2 5/5] mm: Remove swap BIO paths and only use DIO paths [BROKEN] David Howells
2021-08-13  7:14   ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).