All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] mm: Fix NFS swapfiles and use DIO read for swapfiles
@ 2021-08-12 11:57 David Howells
  2021-08-12 11:57 ` [PATCH 1/2] nfs: Fix write to swapfile failure due to generic_write_checks() David Howells
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: David Howells @ 2021-08-12 11:57 UTC (permalink / raw)
  To: willy
  Cc: dhowells, trond.myklebust, darrick.wong, hch, jlayton, sfrench,
	torvalds, linux-nfs, linux-mm, linux-fsdevel, linux-kernel


Hi Willy, Trond,

Here's a change to make reads from the swapfile use async DIO rather than
readpage(), as requested by Willy.

Whilst trying to make this work, I found that NFS's support for swapfiles
seems to have been non-functional since Aug 2019 (I think), so the first
patch fixes that.  Question is: do we actually *want* to keep this
functionality, given that it seems that no one's tested it with an upstream
kernel in the last couple of years?

I tested this using the procedure and program outlined in the first patch.

I also encountered occasional instances of the following warning, so I'm
wondering if there's a scheduling problem somewhere:

BUG: workqueue lockup - pool cpus=0-3 flags=0x5 nice=0 stuck for 34s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
  pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
    in-flight: 1565:fill_page_cache_func
workqueue events_highpri: flags=0x10
  pwq 3: cpus=1 node=0 flags=0x1 nice=-20 active=1/256 refcnt=2
    in-flight: 1547:fill_page_cache_func
  pwq 1: cpus=0 node=0 flags=0x0 nice=-20 active=1/256 refcnt=2
    in-flight: 1811:fill_page_cache_func
workqueue events_unbound: flags=0x2
  pwq 8: cpus=0-3 flags=0x5 nice=0 active=3/512 refcnt=5
    pending: fsnotify_connector_destroy_workfn, fsnotify_mark_destroy_workfn, cleanup_offline_cgwbs_workfn
workqueue events_power_efficient: flags=0x82
  pwq 8: cpus=0-3 flags=0x5 nice=0 active=4/256 refcnt=6
    pending: neigh_periodic_work, neigh_periodic_work, check_lifetime, do_cache_clean
workqueue writeback: flags=0x4a
  pwq 8: cpus=0-3 flags=0x5 nice=0 active=1/256 refcnt=4
    in-flight: 433(RESCUER):wb_workfn
workqueue rpciod: flags=0xa
  pwq 8: cpus=0-3 flags=0x5 nice=0 active=38/256 refcnt=40
    in-flight: 7:rpc_async_schedule, 1609:rpc_async_schedule, 1610:rpc_async_schedule, 912:rpc_async_schedule, 1613:rpc_async_schedule, 1631:rpc_async_schedule, 34:rpc_async_schedule, 44:rpc_async_schedule
    pending: rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule
workqueue ext4-rsv-conversion: flags=0x2000a
pool 1: cpus=0 node=0 flags=0x0 nice=-20 hung=59s workers=2 idle: 6
pool 3: cpus=1 node=0 flags=0x1 nice=-20 hung=43s workers=2 manager: 20
pool 6: cpus=3 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 498 29
pool 8: cpus=0-3 flags=0x5 nice=0 hung=34s workers=9 manager: 1623
pool 9: cpus=0-3 flags=0x5 nice=-20 hung=0s workers=2 manager: 5224 idle: 859

Note that this is due to DIO writes to NFS only, as far as I can tell, and
that no reads had happened yet.

David
---
David Howells (2):
      nfs: Fix write to swapfile failure due to generic_write_checks()
      mm: Make swap_readpage() for SWP_FS_OPS use ->direct_IO() not ->readpage()


 mm/page_io.c | 73 +++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 67 insertions(+), 6 deletions(-)



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/2] nfs: Fix write to swapfile failure due to generic_write_checks()
  2021-08-12 11:57 [PATCH 0/2] mm: Fix NFS swapfiles and use DIO read for swapfiles David Howells
@ 2021-08-12 11:57 ` David Howells
  2021-08-12 11:57 ` [PATCH 2/2] mm: Make swap_readpage() for SWP_FS_OPS use ->direct_IO() not ->readpage() David Howells
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 18+ messages in thread
From: David Howells @ 2021-08-12 11:57 UTC (permalink / raw)
  To: willy
  Cc: dhowells, trond.myklebust, darrick.wong, hch, jlayton, sfrench,
	torvalds, linux-nfs, linux-mm, linux-fsdevel, linux-kernel

Trying to use a swapfile on NFS results in every DIO write failing with
ETXTBSY because generic_write_checks(), as called by nfs_direct_write()
from nfs_direct_IO(), forbids writes to swapfiles.

Fix this by introducing a new kiocb flag, IOCB_SWAP, that's set by the swap
code to indicate that the swapper is doing this operation and so overrule
the check in generic_write_checks().

Without this patch, the following is seen:

	Write error on dio swapfile (3800334336)

Altering __swap_writepage() to show the error shows:

	Write error (-26) on dio swapfile (3800334336)

Tested by swapping off all swap partitions and then swapping on a prepared
NFS file (CONFIG_NFS_SWAP=y is also needed).  Enough copies of the
following program then need to be run to force swapping to occur (at least
one per gigabyte of RAM):

	#include <stdbool.h>
	#include <stdio.h>
	#include <stdlib.h>
	#include <unistd.h>
	#include <sys/mman.h>
	int main()
	{
		unsigned int pid = getpid(), iterations = 0;
		size_t i, j, size = 1024 * 1024 * 1024;
		char *p;
		bool mismatch;
		p = malloc(size);
		if (!p) {
			perror("malloc");
			exit(1);
		}
		srand(pid);
		for (i = 0; i < size; i += 4)
			*(unsigned int *)(p + i) = rand();
		do {
			for (j = 0; j < 16; j++) {
				for (i = 0; i < size; i += 4096)
					*(unsigned int *)(p + i) += 1;
				iterations++;
			}
			mismatch = false;
			srand(pid);
			for (i = 0; i < size; i += 4) {
				unsigned int r = rand();
				unsigned int v = *(unsigned int *)(p + i);
				if (i % 4096 == 0)
					v -= iterations;
				if (v != r) {
					fprintf(stderr, "mismatch %zx: %x != %x (diff %x)\n",
						i, v, r, v - r);
					mismatch = true;
				}
			}
		} while (!mismatch);
		exit(1);
	}


Fixes: dc617f29dbe5 ("vfs: don't allow writes to swap files")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Darrick J. Wong <darrick.wong@oracle.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Trond Myklebust <trond.myklebust@primarydata.com>
cc: linux-nfs@vger.kernel.org
---

 fs/read_write.c    |    2 +-
 include/linux/fs.h |    1 +
 mm/page_io.c       |    7 ++++---
 3 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 9db7adf160d2..daef721ca67e 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1646,7 +1646,7 @@ ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
 	loff_t count;
 	int ret;
 
-	if (IS_SWAPFILE(inode))
+	if (IS_SWAPFILE(inode) && !(iocb->ki_flags & IOCB_SWAP))
 		return -ETXTBSY;
 
 	if (!iov_iter_count(from))
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 640574294216..b3e6a20f28ef 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -319,6 +319,7 @@ enum rw_hint {
 /* iocb->ki_waitq is valid */
 #define IOCB_WAITQ		(1 << 19)
 #define IOCB_NOIO		(1 << 20)
+#define IOCB_SWAP		(1 << 21)	/* This is a swap request */
 
 struct kiocb {
 	struct file		*ki_filp;
diff --git a/mm/page_io.c b/mm/page_io.c
index d597bc6e6e45..edb72bf624d2 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -303,7 +303,8 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
 
 		iov_iter_bvec(&from, WRITE, &bv, 1, PAGE_SIZE);
 		init_sync_kiocb(&kiocb, swap_file);
-		kiocb.ki_pos = page_file_offset(page);
+		kiocb.ki_pos	= page_file_offset(page);
+		kiocb.ki_flags	= IOCB_DIRECT | IOCB_WRITE | IOCB_SWAP;
 
 		set_page_writeback(page);
 		unlock_page(page);
@@ -324,8 +325,8 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
 			 */
 			set_page_dirty(page);
 			ClearPageReclaim(page);
-			pr_err_ratelimited("Write error on dio swapfile (%llu)\n",
-					   page_file_offset(page));
+			pr_err_ratelimited("Write error (%d) on dio swapfile (%llu)\n",
+					   ret, page_file_offset(page));
 		}
 		end_page_writeback(page);
 		return ret;



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/2] mm: Make swap_readpage() for SWP_FS_OPS use ->direct_IO() not ->readpage()
  2021-08-12 11:57 [PATCH 0/2] mm: Fix NFS swapfiles and use DIO read for swapfiles David Howells
  2021-08-12 11:57 ` [PATCH 1/2] nfs: Fix write to swapfile failure due to generic_write_checks() David Howells
@ 2021-08-12 11:57 ` David Howells
  2021-08-12 12:21   ` Christoph Hellwig
                     ` (4 more replies)
  2021-08-12 12:18 ` [PATCH 0/2] mm: Fix NFS swapfiles and use DIO read for swapfiles Christoph Hellwig
  2021-08-13  2:59 ` Hillf Danton
  3 siblings, 5 replies; 18+ messages in thread
From: David Howells @ 2021-08-12 11:57 UTC (permalink / raw)
  To: willy
  Cc: dhowells, trond.myklebust, darrick.wong, hch, jlayton, sfrench,
	torvalds, linux-nfs, linux-mm, linux-fsdevel, linux-kernel

Make swap_readpage(), when accessing a swap file (SWP_FS_OPS) use
the ->direct_IO() method on the filesystem rather then ->readpage().

Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
---

 mm/page_io.c |   73 +++++++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 67 insertions(+), 6 deletions(-)

diff --git a/mm/page_io.c b/mm/page_io.c
index edb72bf624d2..108f864cea28 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -354,6 +354,72 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
 	return 0;
 }
 
+struct swapfile_kiocb {
+	struct kiocb		iocb;
+	struct page		*page;
+	refcount_t		ki_refcnt;
+};
+
+static void swapfile_put_kiocb(struct swapfile_kiocb *ki)
+{
+	if (refcount_dec_and_test(&ki->ki_refcnt)) {
+		fput(ki->iocb.ki_filp);
+		kfree(ki);
+	}
+}
+
+static void swapfile_read_complete(struct kiocb *iocb, long ret, long ret2)
+{
+	struct swapfile_kiocb *ki = container_of(iocb, struct swapfile_kiocb, iocb);
+	struct page *page = ki->page;
+
+	if (ret == PAGE_SIZE) {
+		count_vm_event(PSWPIN);
+		SetPageUptodate(page);
+	} else {
+		pr_err_ratelimited("Read error (%ld) on dio swapfile (%llu)\n",
+				   ret, page_file_offset(page));
+	}
+
+	unlock_page(page);
+	swapfile_put_kiocb(ki);
+}
+
+static int swapfile_read(struct swap_info_struct *sis, struct page *page,
+			 bool synchronous)
+{
+	struct swapfile_kiocb *ki;
+	struct file *swap_file = sis->swap_file;
+	struct bio_vec bv = {
+		.bv_page = page,
+		.bv_len  = PAGE_SIZE,
+		.bv_offset = 0
+	};
+	struct iov_iter to;
+	int ret;
+
+	ki = kzalloc(sizeof(*ki), GFP_KERNEL);
+	if (!ki)
+		return -ENOMEM;
+
+	refcount_set(&ki->ki_refcnt, 2);
+	init_sync_kiocb(&ki->iocb, swap_file);
+	ki->page = page;
+	ki->iocb.ki_flags = IOCB_DIRECT | IOCB_SWAP;
+	ki->iocb.ki_pos	= page_file_offset(page);
+	ki->iocb.ki_filp = get_file(swap_file);
+	if (!synchronous)
+		ki->iocb.ki_complete = swapfile_read_complete;
+
+	iov_iter_bvec(&to, READ, &bv, 1, PAGE_SIZE);
+	ret = swap_file->f_mapping->a_ops->direct_IO(&ki->iocb, &to);
+
+	if (ret != -EIOCBQUEUED)
+		swapfile_read_complete(&ki->iocb, ret, 0);
+	swapfile_put_kiocb(ki);
+	return (ret > 0) ? 0 : ret;
+}
+
 int swap_readpage(struct page *page, bool synchronous)
 {
 	struct bio *bio;
@@ -381,12 +447,7 @@ int swap_readpage(struct page *page, bool synchronous)
 	}
 
 	if (data_race(sis->flags & SWP_FS_OPS)) {
-		struct file *swap_file = sis->swap_file;
-		struct address_space *mapping = swap_file->f_mapping;
-
-		ret = mapping->a_ops->readpage(swap_file, page);
-		if (!ret)
-			count_vm_event(PSWPIN);
+		ret = swapfile_read(sis, page, synchronous);
 		goto out;
 	}
 



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/2] mm: Fix NFS swapfiles and use DIO read for swapfiles
  2021-08-12 11:57 [PATCH 0/2] mm: Fix NFS swapfiles and use DIO read for swapfiles David Howells
  2021-08-12 11:57 ` [PATCH 1/2] nfs: Fix write to swapfile failure due to generic_write_checks() David Howells
  2021-08-12 11:57 ` [PATCH 2/2] mm: Make swap_readpage() for SWP_FS_OPS use ->direct_IO() not ->readpage() David Howells
@ 2021-08-12 12:18 ` Christoph Hellwig
  2021-08-13  2:59 ` Hillf Danton
  3 siblings, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2021-08-12 12:18 UTC (permalink / raw)
  To: David Howells
  Cc: willy, trond.myklebust, darrick.wong, hch, jlayton, sfrench,
	torvalds, linux-nfs, linux-mm, linux-fsdevel, linux-kernel

On Thu, Aug 12, 2021 at 12:57:41PM +0100, David Howells wrote:
> 
> Hi Willy, Trond,
> 
> Here's a change to make reads from the swapfile use async DIO rather than
> readpage(), as requested by Willy.
> 
> Whilst trying to make this work, I found that NFS's support for swapfiles
> seems to have been non-functional since Aug 2019 (I think), so the first
> patch fixes that.  Question is: do we actually *want* to keep this
> functionality, given that it seems that no one's tested it with an upstream
> kernel in the last couple of years?

Independ of the NFS use using the direct I/O code for swap seems like
the right thing to do in generlal.  e.g. for XFS a lookup in the extent
btree will be more efficient than the weird swap extent map.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] mm: Make swap_readpage() for SWP_FS_OPS use ->direct_IO() not ->readpage()
  2021-08-12 11:57 ` [PATCH 2/2] mm: Make swap_readpage() for SWP_FS_OPS use ->direct_IO() not ->readpage() David Howells
@ 2021-08-12 12:21   ` Christoph Hellwig
  2021-08-12 12:57   ` David Howells
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2021-08-12 12:21 UTC (permalink / raw)
  To: David Howells
  Cc: willy, trond.myklebust, darrick.wong, hch, jlayton, sfrench,
	torvalds, linux-nfs, linux-mm, linux-fsdevel, linux-kernel

On Thu, Aug 12, 2021 at 12:57:58PM +0100, David Howells wrote:
> Make swap_readpage(), when accessing a swap file (SWP_FS_OPS) use
> the ->direct_IO() method on the filesystem rather then ->readpage().

->direct_IO is just a helper for ->read_iter and ->write_iter, so please
don't call it directly.  It actually is slowly on its way out, with at
at least all of the iomap implementations not using it, as well as various
other file systems.

> +	ki = kzalloc(sizeof(*ki), GFP_KERNEL);
> +	if (!ki)
> +		return -ENOMEM;

for the synchronous case we could avoid this allocation and just use
arguments on stack.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] mm: Make swap_readpage() for SWP_FS_OPS use ->direct_IO() not ->readpage()
  2021-08-12 11:57 ` [PATCH 2/2] mm: Make swap_readpage() for SWP_FS_OPS use ->direct_IO() not ->readpage() David Howells
  2021-08-12 12:21   ` Christoph Hellwig
@ 2021-08-12 12:57   ` David Howells
  2021-08-12 15:39     ` Matthew Wilcox
  2021-08-12 13:00   ` Matthew Wilcox
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 18+ messages in thread
From: David Howells @ 2021-08-12 12:57 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dhowells, willy, trond.myklebust, darrick.wong, jlayton, sfrench,
	torvalds, linux-nfs, linux-mm, linux-fsdevel, linux-kernel

Christoph Hellwig <hch@lst.de> wrote:

> On Thu, Aug 12, 2021 at 12:57:58PM +0100, David Howells wrote:
> > Make swap_readpage(), when accessing a swap file (SWP_FS_OPS) use
> > the ->direct_IO() method on the filesystem rather then ->readpage().
> 
> ->direct_IO is just a helper for ->read_iter and ->write_iter, so please
> don't call it directly.  It actually is slowly on its way out, with at
> at least all of the iomap implementations not using it, as well as various
> other file systems.

[Note that __swap_writepage() uses ->direct_IO().]

Calling ->write_iter is probably a bad idea here.  Imagine that it goes
through, say, generic_file_write_iter(), then __generic_file_write_iter() and
then generic_file_direct_write().  It adds a number of delays into the system,
including:

	- Taking the inode lock
	- Removing file privs
	- Cranking mtime, ctime, file version
	  - Doing mnt_want_write
	  - Setting the inode dirty
	- Waiting on pages in the range that are being written 
	- Walking over the pagecache to invalidate the range
	- Redoing the invalidation (can't be skipped since page 0 is pinned)

that we might want to skip as they'll end up being done for every page swapped
out.

> > +	ki = kzalloc(sizeof(*ki), GFP_KERNEL);
> > +	if (!ki)
> > +		return -ENOMEM;
> 
> for the synchronous case we could avoid this allocation and just use
> arguments on stack.

True.

David


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] mm: Make swap_readpage() for SWP_FS_OPS use ->direct_IO() not ->readpage()
  2021-08-12 11:57 ` [PATCH 2/2] mm: Make swap_readpage() for SWP_FS_OPS use ->direct_IO() not ->readpage() David Howells
  2021-08-12 12:21   ` Christoph Hellwig
  2021-08-12 12:57   ` David Howells
@ 2021-08-12 13:00   ` Matthew Wilcox
  2021-08-12 13:23   ` David Howells
  2021-08-12 13:37   ` David Howells
  4 siblings, 0 replies; 18+ messages in thread
From: Matthew Wilcox @ 2021-08-12 13:00 UTC (permalink / raw)
  To: David Howells
  Cc: trond.myklebust, darrick.wong, hch, jlayton, sfrench, torvalds,
	linux-nfs, linux-mm, linux-fsdevel, linux-kernel

On Thu, Aug 12, 2021 at 12:57:58PM +0100, David Howells wrote:

I'm not quite sure why we need the refcount.

> +	refcount_set(&ki->ki_refcnt, 2);
> +	init_sync_kiocb(&ki->iocb, swap_file);
> +	ki->page = page;
> +	ki->iocb.ki_flags = IOCB_DIRECT | IOCB_SWAP;
> +	ki->iocb.ki_pos	= page_file_offset(page);
> +	ki->iocb.ki_filp = get_file(swap_file);
> +	if (!synchronous)
> +		ki->iocb.ki_complete = swapfile_read_complete;
> +
> +	iov_iter_bvec(&to, READ, &bv, 1, PAGE_SIZE);
> +	ret = swap_file->f_mapping->a_ops->direct_IO(&ki->iocb, &to);

After submitting the IO here ...

> +	if (ret != -EIOCBQUEUED)
> +		swapfile_read_complete(&ki->iocb, ret, 0);

We only touch the 'ki' here ... if the caller didn't call read_complete

> +	swapfile_put_kiocb(ki);

Except for here, which is only touched in order to put the refcount.

So why can't swapfile_read_complete() do the work of freeing the ki?


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] mm: Make swap_readpage() for SWP_FS_OPS use ->direct_IO() not ->readpage()
  2021-08-12 11:57 ` [PATCH 2/2] mm: Make swap_readpage() for SWP_FS_OPS use ->direct_IO() not ->readpage() David Howells
                     ` (2 preceding siblings ...)
  2021-08-12 13:00   ` Matthew Wilcox
@ 2021-08-12 13:23   ` David Howells
  2021-08-12 13:37   ` David Howells
  4 siblings, 0 replies; 18+ messages in thread
From: David Howells @ 2021-08-12 13:23 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: dhowells, trond.myklebust, darrick.wong, hch, jlayton, sfrench,
	torvalds, linux-nfs, linux-mm, linux-fsdevel, linux-kernel

Matthew Wilcox <willy@infradead.org> wrote:

> After submitting the IO here ...
> 
> > +	if (ret != -EIOCBQUEUED)
> > +		swapfile_read_complete(&ki->iocb, ret, 0);
> 
> We only touch the 'ki' here ... if the caller didn't call read_complete
> 
> > +	swapfile_put_kiocb(ki);
> 
> Except for here, which is only touched in order to put the refcount.
> 
> So why can't swapfile_read_complete() do the work of freeing the ki?

When I was doing something similar for cachefiles, I couldn't get it to work
like that.  I'll have another look at that.

David


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] mm: Make swap_readpage() for SWP_FS_OPS use ->direct_IO() not ->readpage()
  2021-08-12 11:57 ` [PATCH 2/2] mm: Make swap_readpage() for SWP_FS_OPS use ->direct_IO() not ->readpage() David Howells
                     ` (3 preceding siblings ...)
  2021-08-12 13:23   ` David Howells
@ 2021-08-12 13:37   ` David Howells
  2021-08-12 13:50     ` Matthew Wilcox
  2021-08-12 14:16     ` David Howells
  4 siblings, 2 replies; 18+ messages in thread
From: David Howells @ 2021-08-12 13:37 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: dhowells, trond.myklebust, darrick.wong, hch, jlayton, sfrench,
	torvalds, linux-nfs, linux-mm, linux-fsdevel, linux-kernel

David Howells <dhowells@redhat.com> wrote:

> Matthew Wilcox <willy@infradead.org> wrote:
> 
> > After submitting the IO here ...
> > 
> > > +	if (ret != -EIOCBQUEUED)
> > > +		swapfile_read_complete(&ki->iocb, ret, 0);
> > 
> > We only touch the 'ki' here ... if the caller didn't call read_complete
> > 
> > > +	swapfile_put_kiocb(ki);
> > 
> > Except for here, which is only touched in order to put the refcount.
> > 
> > So why can't swapfile_read_complete() do the work of freeing the ki?
> 
> When I was doing something similar for cachefiles, I couldn't get it to work
> like that.  I'll have another look at that.

Ah, yes.  generic_file_direct_write() accesses in the kiocb *after* calling
->direct_IO(), so the kiocb *must not* go away until after
generic_file_direct_write() has returned.

David


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] mm: Make swap_readpage() for SWP_FS_OPS use ->direct_IO() not ->readpage()
  2021-08-12 13:37   ` David Howells
@ 2021-08-12 13:50     ` Matthew Wilcox
  2021-08-12 14:16     ` David Howells
  1 sibling, 0 replies; 18+ messages in thread
From: Matthew Wilcox @ 2021-08-12 13:50 UTC (permalink / raw)
  To: David Howells
  Cc: trond.myklebust, darrick.wong, hch, jlayton, sfrench, torvalds,
	linux-nfs, linux-mm, linux-fsdevel, linux-kernel

On Thu, Aug 12, 2021 at 02:37:59PM +0100, David Howells wrote:
> David Howells <dhowells@redhat.com> wrote:
> 
> > Matthew Wilcox <willy@infradead.org> wrote:
> > 
> > > After submitting the IO here ...
> > > 
> > > > +	if (ret != -EIOCBQUEUED)
> > > > +		swapfile_read_complete(&ki->iocb, ret, 0);
> > > 
> > > We only touch the 'ki' here ... if the caller didn't call read_complete
> > > 
> > > > +	swapfile_put_kiocb(ki);
> > > 
> > > Except for here, which is only touched in order to put the refcount.
> > > 
> > > So why can't swapfile_read_complete() do the work of freeing the ki?
> > 
> > When I was doing something similar for cachefiles, I couldn't get it to work
> > like that.  I'll have another look at that.
> 
> Ah, yes.  generic_file_direct_write() accesses in the kiocb *after* calling
> ->direct_IO(), so the kiocb *must not* go away until after
> generic_file_direct_write() has returned.

This is a read, not a write ... but we don't care about ki_pos being
updated, so that store can be conditioned on IOCB_SWAP being clear.
Or instead of storing directly to ki_pos, we take a pointer to ki_pos
and then redirect that pointer somewhere harmless.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] mm: Make swap_readpage() for SWP_FS_OPS use ->direct_IO() not ->readpage()
  2021-08-12 13:37   ` David Howells
  2021-08-12 13:50     ` Matthew Wilcox
@ 2021-08-12 14:16     ` David Howells
  1 sibling, 0 replies; 18+ messages in thread
From: David Howells @ 2021-08-12 14:16 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: dhowells, trond.myklebust, darrick.wong, hch, jlayton, sfrench,
	torvalds, linux-nfs, linux-mm, linux-fsdevel, linux-kernel

Matthew Wilcox <willy@infradead.org> wrote:

> This is a read, not a write ... but we don't care about ki_pos being
> updated, so that store can be conditioned on IOCB_SWAP being clear.
> Or instead of storing directly to ki_pos, we take a pointer to ki_pos
> and then redirect that pointer somewhere harmless.

But see also ext4_dio_read_iter(), for example.  That touches ki_filp after
starting the op.

David


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] mm: Make swap_readpage() for SWP_FS_OPS use ->direct_IO() not ->readpage()
  2021-08-12 12:57   ` David Howells
@ 2021-08-12 15:39     ` Matthew Wilcox
  2021-08-12 17:02       ` Christoph Hellwig
  0 siblings, 1 reply; 18+ messages in thread
From: Matthew Wilcox @ 2021-08-12 15:39 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, trond.myklebust, darrick.wong, jlayton,
	sfrench, torvalds, linux-nfs, linux-mm, linux-fsdevel,
	linux-kernel

On Thu, Aug 12, 2021 at 01:57:05PM +0100, David Howells wrote:
> Christoph Hellwig <hch@lst.de> wrote:
> 
> > On Thu, Aug 12, 2021 at 12:57:58PM +0100, David Howells wrote:
> > > Make swap_readpage(), when accessing a swap file (SWP_FS_OPS) use
> > > the ->direct_IO() method on the filesystem rather then ->readpage().
> > 
> > ->direct_IO is just a helper for ->read_iter and ->write_iter, so please
> > don't call it directly.  It actually is slowly on its way out, with at
> > at least all of the iomap implementations not using it, as well as various
> > other file systems.
> 
> [Note that __swap_writepage() uses ->direct_IO().]
> 
> Calling ->write_iter is probably a bad idea here.  Imagine that it goes
> through, say, generic_file_write_iter(), then __generic_file_write_iter() and
> then generic_file_direct_write().  It adds a number of delays into the system,
> including:
> 
> 	- Taking the inode lock
> 	- Removing file privs
> 	- Cranking mtime, ctime, file version
> 	  - Doing mnt_want_write
> 	  - Setting the inode dirty
> 	- Waiting on pages in the range that are being written 
> 	- Walking over the pagecache to invalidate the range
> 	- Redoing the invalidation (can't be skipped since page 0 is pinned)
> 
> that we might want to skip as they'll end up being done for every page swapped
> out.

I agree with David; we want something lower-level for swap to call into.
I'd suggest aops->swap_rw and an implementation might well look
something like:

static ssize_t ext4_swap_rw(struct kiocb *iocb, struct iov_iter *iter)
{
	return iomap_dio_rw(iocb, iter, &ext4_iomap_ops, NULL, 0);
}

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] mm: Make swap_readpage() for SWP_FS_OPS use ->direct_IO() not ->readpage()
  2021-08-12 15:39     ` Matthew Wilcox
@ 2021-08-12 17:02       ` Christoph Hellwig
  2021-08-12 17:48         ` Darrick J. Wong
  0 siblings, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2021-08-12 17:02 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: David Howells, Christoph Hellwig, trond.myklebust, darrick.wong,
	jlayton, sfrench, torvalds, linux-nfs, linux-mm, linux-fsdevel,
	linux-kernel

On Thu, Aug 12, 2021 at 04:39:40PM +0100, Matthew Wilcox wrote:
> I agree with David; we want something lower-level for swap to call into.
> I'd suggest aops->swap_rw and an implementation might well look
> something like:
> 
> static ssize_t ext4_swap_rw(struct kiocb *iocb, struct iov_iter *iter)
> {
> 	return iomap_dio_rw(iocb, iter, &ext4_iomap_ops, NULL, 0);
> }

Yes, that might make sense and would also replace the awkward IOCB_SWAP
flag for the write side.

For file systems like ext4 and xfs that have an in-memory block mapping
tree this would be way better than the current version and also support
swap on say multi-device file systems properly.  We'd just need to be
careful to read the extent information in at extent_activate time,
by doing xfs_iread_extents for XFS or the equivalents in other file
systems.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] mm: Make swap_readpage() for SWP_FS_OPS use ->direct_IO() not ->readpage()
  2021-08-12 17:02       ` Christoph Hellwig
@ 2021-08-12 17:48         ` Darrick J. Wong
  2021-08-12 18:14           ` Matthew Wilcox
  2021-08-13  6:54           ` Christoph Hellwig
  0 siblings, 2 replies; 18+ messages in thread
From: Darrick J. Wong @ 2021-08-12 17:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Matthew Wilcox, David Howells, trond.myklebust, darrick.wong,
	jlayton, sfrench, torvalds, linux-nfs, linux-mm, linux-fsdevel,
	linux-kernel

On Thu, Aug 12, 2021 at 07:02:33PM +0200, Christoph Hellwig wrote:
> On Thu, Aug 12, 2021 at 04:39:40PM +0100, Matthew Wilcox wrote:
> > I agree with David; we want something lower-level for swap to call into.
> > I'd suggest aops->swap_rw and an implementation might well look
> > something like:
> > 
> > static ssize_t ext4_swap_rw(struct kiocb *iocb, struct iov_iter *iter)
> > {
> > 	return iomap_dio_rw(iocb, iter, &ext4_iomap_ops, NULL, 0);
> > }
> 
> Yes, that might make sense and would also replace the awkward IOCB_SWAP
> flag for the write side.
> 
> For file systems like ext4 and xfs that have an in-memory block mapping
> tree this would be way better than the current version and also support
> swap on say multi-device file systems properly.  We'd just need to be
> careful to read the extent information in at extent_activate time,
> by doing xfs_iread_extents for XFS or the equivalents in other file
> systems.

You'd still want to walk the extent map at activation time to reject
swapfiles with holes, shared extents, etc., right?

--D

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] mm: Make swap_readpage() for SWP_FS_OPS use ->direct_IO() not ->readpage()
  2021-08-12 17:48         ` Darrick J. Wong
@ 2021-08-12 18:14           ` Matthew Wilcox
  2021-08-12 20:13             ` Theodore Ts'o
  2021-08-13  6:54           ` Christoph Hellwig
  1 sibling, 1 reply; 18+ messages in thread
From: Matthew Wilcox @ 2021-08-12 18:14 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, David Howells, trond.myklebust, darrick.wong,
	jlayton, sfrench, torvalds, linux-nfs, linux-mm, linux-fsdevel,
	linux-kernel

On Thu, Aug 12, 2021 at 10:48:18AM -0700, Darrick J. Wong wrote:
> On Thu, Aug 12, 2021 at 07:02:33PM +0200, Christoph Hellwig wrote:
> > On Thu, Aug 12, 2021 at 04:39:40PM +0100, Matthew Wilcox wrote:
> > > I agree with David; we want something lower-level for swap to call into.
> > > I'd suggest aops->swap_rw and an implementation might well look
> > > something like:
> > > 
> > > static ssize_t ext4_swap_rw(struct kiocb *iocb, struct iov_iter *iter)
> > > {
> > > 	return iomap_dio_rw(iocb, iter, &ext4_iomap_ops, NULL, 0);
> > > }
> > 
> > Yes, that might make sense and would also replace the awkward IOCB_SWAP
> > flag for the write side.
> > 
> > For file systems like ext4 and xfs that have an in-memory block mapping
> > tree this would be way better than the current version and also support
> > swap on say multi-device file systems properly.  We'd just need to be
> > careful to read the extent information in at extent_activate time,
> > by doing xfs_iread_extents for XFS or the equivalents in other file
> > systems.
> 
> You'd still want to walk the extent map at activation time to reject
> swapfiles with holes, shared extents, etc., right?

Well ... this would actually allow the filesystem to break COWs and
allocate new blocks for holes.  Maybe you don't want to be doing that
in a low-memory situation though ;-)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] mm: Make swap_readpage() for SWP_FS_OPS use ->direct_IO() not ->readpage()
  2021-08-12 18:14           ` Matthew Wilcox
@ 2021-08-12 20:13             ` Theodore Ts'o
  0 siblings, 0 replies; 18+ messages in thread
From: Theodore Ts'o @ 2021-08-12 20:13 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Darrick J. Wong, Christoph Hellwig, David Howells,
	trond.myklebust, darrick.wong, jlayton, sfrench, torvalds,
	linux-nfs, linux-mm, linux-fsdevel, linux-kernel

On Thu, Aug 12, 2021 at 07:14:37PM +0100, Matthew Wilcox wrote:
> 
> Well ... this would actually allow the filesystem to break COWs and
> allocate new blocks for holes.  Maybe you don't want to be doing that
> in a low-memory situation though ;-)

I'm not sure the benefits are worth the costs.  You'd have to handle
ENOSPC errors, and it would require some kind of metadata journal
transaction, which could potentially block for any number of reasons
(not just due to memory allocations, but because you're waiting for a
journal commit to complete).  As you say, doing that in a low-memory
situation seems to be unneeded complexity.

					- Ted

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/2] mm: Fix NFS swapfiles and use DIO read for swapfiles
  2021-08-12 11:57 [PATCH 0/2] mm: Fix NFS swapfiles and use DIO read for swapfiles David Howells
                   ` (2 preceding siblings ...)
  2021-08-12 12:18 ` [PATCH 0/2] mm: Fix NFS swapfiles and use DIO read for swapfiles Christoph Hellwig
@ 2021-08-13  2:59 ` Hillf Danton
  3 siblings, 0 replies; 18+ messages in thread
From: Hillf Danton @ 2021-08-13  2:59 UTC (permalink / raw)
  To: David Howells
  Cc: willy, trond.myklebust, hch, linux-nfs, linux-mm, linux-fsdevel,
	linux-kernel

On Thu, 12 Aug 2021 12:57:41 +0100 David Howells wrote:
> 
> Hi Willy, Trond,
> 
> Here's a change to make reads from the swapfile use async DIO rather than
> readpage(), as requested by Willy.
> 
> Whilst trying to make this work, I found that NFS's support for swapfiles
> seems to have been non-functional since Aug 2019 (I think), so the first
> patch fixes that.  Question is: do we actually *want* to keep this
> functionality, given that it seems that no one's tested it with an upstream
> kernel in the last couple of years?
> 
> I tested this using the procedure and program outlined in the first patch.
> 
> I also encountered occasional instances of the following warning, so I'm
> wondering if there's a scheduling problem somewhere:
> 
> BUG: workqueue lockup - pool cpus=0-3 flags=0x5 nice=0 stuck for 34s!
> Showing busy workqueues and worker pools:
> workqueue events: flags=0x0
>   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
>     in-flight: 1565:fill_page_cache_func
> workqueue events_highpri: flags=0x10
>   pwq 3: cpus=1 node=0 flags=0x1 nice=-20 active=1/256 refcnt=2
>     in-flight: 1547:fill_page_cache_func
>   pwq 1: cpus=0 node=0 flags=0x0 nice=-20 active=1/256 refcnt=2
>     in-flight: 1811:fill_page_cache_func
> workqueue events_unbound: flags=0x2
>   pwq 8: cpus=0-3 flags=0x5 nice=0 active=3/512 refcnt=5
>     pending: fsnotify_connector_destroy_workfn, fsnotify_mark_destroy_workfn, cleanup_offline_cgwbs_workfn
> workqueue events_power_efficient: flags=0x82
>   pwq 8: cpus=0-3 flags=0x5 nice=0 active=4/256 refcnt=6
>     pending: neigh_periodic_work, neigh_periodic_work, check_lifetime, do_cache_clean
> workqueue writeback: flags=0x4a
>   pwq 8: cpus=0-3 flags=0x5 nice=0 active=1/256 refcnt=4
>     in-flight: 433(RESCUER):wb_workfn

Is it a memory tight scenario that got rescuer active?

> workqueue rpciod: flags=0xa
>   pwq 8: cpus=0-3 flags=0x5 nice=0 active=38/256 refcnt=40
>     in-flight: 7:rpc_async_schedule, 1609:rpc_async_schedule, 1610:rpc_async_schedule, 912:rpc_async_schedule, 1613:rpc_async_schedule, 1631:rpc_async_schedule, 34:rpc_async_schedule, 44:rpc_async_schedule
>     pending: rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule
> workqueue ext4-rsv-conversion: flags=0x2000a
> pool 1: cpus=0 node=0 flags=0x0 nice=-20 hung=59s workers=2 idle: 6
> pool 3: cpus=1 node=0 flags=0x1 nice=-20 hung=43s workers=2 manager: 20
> pool 6: cpus=3 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 498 29
> pool 8: cpus=0-3 flags=0x5 nice=0 hung=34s workers=9 manager: 1623
> pool 9: cpus=0-3 flags=0x5 nice=-20 hung=0s workers=2 manager: 5224 idle: 859
> 
> Note that this is due to DIO writes to NFS only, as far as I can tell, and
> that no reads had happened yet.
> 
> David
> ---
> David Howells (2):
>       nfs: Fix write to swapfile failure due to generic_write_checks()
>       mm: Make swap_readpage() for SWP_FS_OPS use ->direct_IO() not ->readpage()
> 
> 
>  mm/page_io.c | 73 +++++++++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 67 insertions(+), 6 deletions(-)


Print memory info to help understand the busy rescuer.

+++ x/kernel/workqueue.c
@@ -4710,12 +4710,16 @@ static void show_pwq(struct pool_workque
 	}
 	if (has_in_flight) {
 		bool comma = false;
+		bool rescuer = false;
 
 		pr_info("    in-flight:");
 		hash_for_each(pool->busy_hash, bkt, worker, hentry) {
 			if (worker->current_pwq != pwq)
 				continue;
 
+			if (worker->rescue_wq)
+				rescuer = true;
+
 			pr_cont("%s %d%s:%ps", comma ? "," : "",
 				task_pid_nr(worker->task),
 				worker->rescue_wq ? "(RESCUER)" : "",
@@ -4725,6 +4729,11 @@ static void show_pwq(struct pool_workque
 			comma = true;
 		}
 		pr_cont("\n");
+		if (rescuer) {
+			pr_cont("\n");
+			show_free_areas(0, NULL);
+			pr_cont("\n");
+		}
 	}
 
 	list_for_each_entry(work, &pool->worklist, entry) {


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] mm: Make swap_readpage() for SWP_FS_OPS use ->direct_IO() not ->readpage()
  2021-08-12 17:48         ` Darrick J. Wong
  2021-08-12 18:14           ` Matthew Wilcox
@ 2021-08-13  6:54           ` Christoph Hellwig
  1 sibling, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2021-08-13  6:54 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, Matthew Wilcox, David Howells,
	trond.myklebust, darrick.wong, jlayton, sfrench, torvalds,
	linux-nfs, linux-mm, linux-fsdevel, linux-kernel

On Thu, Aug 12, 2021 at 10:48:18AM -0700, Darrick J. Wong wrote:
> On Thu, Aug 12, 2021 at 07:02:33PM +0200, Christoph Hellwig wrote:
> > On Thu, Aug 12, 2021 at 04:39:40PM +0100, Matthew Wilcox wrote:
> > > I agree with David; we want something lower-level for swap to call into.
> > > I'd suggest aops->swap_rw and an implementation might well look
> > > something like:
> > > 
> > > static ssize_t ext4_swap_rw(struct kiocb *iocb, struct iov_iter *iter)
> > > {
> > > 	return iomap_dio_rw(iocb, iter, &ext4_iomap_ops, NULL, 0);
> > > }
> > 
> > Yes, that might make sense and would also replace the awkward IOCB_SWAP
> > flag for the write side.
> > 
> > For file systems like ext4 and xfs that have an in-memory block mapping
> > tree this would be way better than the current version and also support
> > swap on say multi-device file systems properly.  We'd just need to be
> > careful to read the extent information in at extent_activate time,
> > by doing xfs_iread_extents for XFS or the equivalents in other file
> > systems.
> 
> You'd still want to walk the extent map at activation time to reject
> swapfiles with holes, shared extents, etc., right?

Yes.  While direct I/O code could do allocation at swap I/O time that
probably is not a good idea due to the memory requirements.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2021-08-13  6:54 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-12 11:57 [PATCH 0/2] mm: Fix NFS swapfiles and use DIO read for swapfiles David Howells
2021-08-12 11:57 ` [PATCH 1/2] nfs: Fix write to swapfile failure due to generic_write_checks() David Howells
2021-08-12 11:57 ` [PATCH 2/2] mm: Make swap_readpage() for SWP_FS_OPS use ->direct_IO() not ->readpage() David Howells
2021-08-12 12:21   ` Christoph Hellwig
2021-08-12 12:57   ` David Howells
2021-08-12 15:39     ` Matthew Wilcox
2021-08-12 17:02       ` Christoph Hellwig
2021-08-12 17:48         ` Darrick J. Wong
2021-08-12 18:14           ` Matthew Wilcox
2021-08-12 20:13             ` Theodore Ts'o
2021-08-13  6:54           ` Christoph Hellwig
2021-08-12 13:00   ` Matthew Wilcox
2021-08-12 13:23   ` David Howells
2021-08-12 13:37   ` David Howells
2021-08-12 13:50     ` Matthew Wilcox
2021-08-12 14:16     ` David Howells
2021-08-12 12:18 ` [PATCH 0/2] mm: Fix NFS swapfiles and use DIO read for swapfiles Christoph Hellwig
2021-08-13  2:59 ` Hillf Danton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.