linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/13] Repair SWAP-over-NFS
@ 2021-11-16  2:44 NeilBrown
  2021-11-16  2:44 ` [PATCH 09/13] SUNRPC: improve 'swap' handling: scheduling and PF_MEMALLOC NeilBrown
                   ` (13 more replies)
  0 siblings, 14 replies; 25+ messages in thread
From: NeilBrown @ 2021-11-16  2:44 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton, Mel Gorman
  Cc: linux-nfs, linux-mm, linux-kernel

swap-over-NFS currently has a variety of problems.

Due to a newish test in generic_write_checks(), all writes to swap
currently fail.
With that fixed, there are various sources of deadlocks that can cause
a swapping system to freeze.

swap has never worked over NFSv4 due to the occasional need to start the
state-management thread - which won't happen when under high memory
pressure.

This series addresses all the problems that I could find, and also
changes writes to be asynchronous, and both reads and writes to use
multi-page RPC requests when possible (the last 2 patches).

This last change causes interesting performance changes.  The rate of
writes to the swap file (measured in K/sec) increases by a factor of
about 5 (not precisely measured).  However interactive response falls
noticeably (response time in multiple seconds, but not minutes).  So
while it seems like it should be a good idea, I'm not sure if we want it
until it is better understood.

I'd be very happy if others could test out some swapping scenarios to
see how it performs.  I've been using
    stress-ng --brk 2 --stack 2 --bigheap 2
which doesn't give me any insight into whether more useful work is
getting done.

Apart from the last two patches, I think this series is ready.

Thanks,
NeilBrown

---

NeilBrown (13):
      NFS: move generic_write_checks() call from nfs_file_direct_write() to nfs_file_write()
      NFS: do not take i_rwsem for swap IO
      MM: reclaim mustn't enter FS for swap-over-NFS
      SUNRPC/call_alloc: async tasks mustn't block waiting for memory
      SUNRPC/auth: async tasks mustn't block waiting for memory
      SUNRPC/xprt: async tasks mustn't block waiting for memory
      SUNRPC: remove scheduling boost for "SWAPPER" tasks.
      NFS: discard NFS_RPC_SWAPFLAGS and RPC_TASK_ROOTCREDS
      SUNRPC: improve 'swap' handling: scheduling and PF_MEMALLOC
      NFSv4: keep state manager thread active if swap is enabled
      NFS: swap-out must always use STABLE writes.
      MM: use AIO/DIO for reads from SWP_FS_OPS swap-space
      MM: use AIO for DIO writes to swap


 fs/nfs/direct.c                 |  12 +-
 fs/nfs/file.c                   |  21 ++-
 fs/nfs/io.c                     |   9 ++
 fs/nfs/nfs4_fs.h                |   1 +
 fs/nfs/nfs4proc.c               |  20 +++
 fs/nfs/nfs4state.c              |  39 ++++-
 fs/nfs/read.c                   |   4 -
 fs/nfs/write.c                  |   2 +
 include/linux/nfs_fs.h          |   8 +-
 include/linux/nfs_xdr.h         |   2 +
 include/linux/sunrpc/auth.h     |   1 +
 include/linux/sunrpc/sched.h    |   1 -
 include/trace/events/sunrpc.h   |   1 -
 mm/page_io.c                    | 243 +++++++++++++++++++++++++++-----
 mm/vmscan.c                     |  12 +-
 net/sunrpc/auth.c               |   8 +-
 net/sunrpc/auth_gss/auth_gss.c  |   6 +-
 net/sunrpc/auth_unix.c          |  10 +-
 net/sunrpc/clnt.c               |   7 +-
 net/sunrpc/sched.c              |  29 ++--
 net/sunrpc/xprt.c               |  19 +--
 net/sunrpc/xprtrdma/transport.c |  10 +-
 net/sunrpc/xprtsock.c           |   8 ++
 23 files changed, 374 insertions(+), 99 deletions(-)

--
Signature


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 01/13] NFS: move generic_write_checks() call from nfs_file_direct_write() to nfs_file_write()
  2021-11-16  2:44 [PATCH 00/13] Repair SWAP-over-NFS NeilBrown
                   ` (3 preceding siblings ...)
  2021-11-16  2:44 ` [PATCH 08/13] NFS: discard NFS_RPC_SWAPFLAGS and RPC_TASK_ROOTCREDS NeilBrown
@ 2021-11-16  2:44 ` NeilBrown
  2021-11-16  2:44 ` [PATCH 07/13] SUNRPC: remove scheduling boost for "SWAPPER" tasks NeilBrown
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 25+ messages in thread
From: NeilBrown @ 2021-11-16  2:44 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton, Mel Gorman
  Cc: linux-nfs, linux-mm, linux-kernel

generic_write_checks() is not needed for swap-out writes, and fails if
they are attempted.
nfs_file_direct_write() currently calls generic_write_checks() and is in
turn called from:
  nfs_direct_IO  - only for swap-out
  nfs_file_write - for normal O_DIRECT write

So move the generic_write_checks() call into nfs_file_write().  This
allows NFS swap-out writes to complete.

Fixes: dc617f29dbe5 ("vfs: don't allow writes to swap files")
Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/nfs/direct.c |    5 +----
 fs/nfs/file.c   |    6 +++++-
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 9cff8709c80a..1e80d243ba25 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -905,10 +905,7 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter)
 	dfprintk(FILE, "NFS: direct write(%pD2, %zd@%Ld)\n",
 		file, iov_iter_count(iter), (long long) iocb->ki_pos);
 
-	result = generic_write_checks(iocb, iter);
-	if (result <= 0)
-		return result;
-	count = result;
+	count = iov_iter_count(iter);
 	nfs_add_stats(mapping->host, NFSIOS_DIRECTWRITTENBYTES, count);
 
 	pos = iocb->ki_pos;
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 24e7dccce355..45d8180b7be3 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -615,8 +615,12 @@ ssize_t nfs_file_write(struct kiocb *iocb, struct iov_iter *from)
 	if (result)
 		return result;
 
-	if (iocb->ki_flags & IOCB_DIRECT)
+	if (iocb->ki_flags & IOCB_DIRECT) {
+		result = generic_write_checks(iocb, from);
+		if (result <= 0)
+			return result;
 		return nfs_file_direct_write(iocb, from);
+	}
 
 	dprintk("NFS: write(%pD2, %zu@%Ld)\n",
 		file, iov_iter_count(from), (long long) iocb->ki_pos);



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 02/13] NFS: do not take i_rwsem for swap IO
  2021-11-16  2:44 [PATCH 00/13] Repair SWAP-over-NFS NeilBrown
  2021-11-16  2:44 ` [PATCH 09/13] SUNRPC: improve 'swap' handling: scheduling and PF_MEMALLOC NeilBrown
  2021-11-16  2:44 ` [PATCH 11/13] NFS: swap-out must always use STABLE writes NeilBrown
@ 2021-11-16  2:44 ` NeilBrown
  2021-11-16  7:52   ` Christoph Hellwig
  2021-11-16  2:44 ` [PATCH 08/13] NFS: discard NFS_RPC_SWAPFLAGS and RPC_TASK_ROOTCREDS NeilBrown
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 25+ messages in thread
From: NeilBrown @ 2021-11-16  2:44 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton, Mel Gorman
  Cc: linux-nfs, linux-mm, linux-kernel

Taking the i_rwsem for swap IO triggers lockdep warnings regarding
possible deadlocks with "fs_reclaim".  These deadlocks could, I believe,
eventuate if a buffered read on the swapfile was attempted.

We don't need coherence with the page cache for a swap file, and
buffered writes are forbidden anyway.  There is no other need for
i_rwsem during direct IO.

So don't take the rwsem or set the NFS_INO_ODIRECT flag during IO to the
swap file.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/nfs/io.c |    9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/fs/nfs/io.c b/fs/nfs/io.c
index b5551ed8f648..83b4dfbb826d 100644
--- a/fs/nfs/io.c
+++ b/fs/nfs/io.c
@@ -118,11 +118,18 @@ static void nfs_block_buffered(struct nfs_inode *nfsi, struct inode *inode)
  * NFS_INO_ODIRECT.
  * Note that buffered writes and truncates both take a write lock on
  * inode->i_rwsem, meaning that those are serialised w.r.t. O_DIRECT.
+ *
+ * When inode IS_SWAPFILE we ignore the flag and don't take the rwsem
+ * as it triggers lockdep warnings and possible deadlocks.
+ * bufferred writes are forbidden anyway, and buffered reads will not
+ * be coherent.
  */
 void
 nfs_start_io_direct(struct inode *inode)
 {
 	struct nfs_inode *nfsi = NFS_I(inode);
+	if (IS_SWAPFILE(inode))
+		return;
 	/* Be an optimist! */
 	down_read(&inode->i_rwsem);
 	if (test_bit(NFS_INO_ODIRECT, &nfsi->flags) != 0)
@@ -144,5 +151,7 @@ nfs_start_io_direct(struct inode *inode)
 void
 nfs_end_io_direct(struct inode *inode)
 {
+	if (IS_SWAPFILE(inode))
+		return;
 	up_read(&inode->i_rwsem);
 }



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 03/13] MM: reclaim mustn't enter FS for swap-over-NFS
  2021-11-16  2:44 [PATCH 00/13] Repair SWAP-over-NFS NeilBrown
                   ` (10 preceding siblings ...)
  2021-11-16  2:44 ` [PATCH 06/13] SUNRPC/xprt: async tasks mustn't block waiting for memory NeilBrown
@ 2021-11-16  2:44 ` NeilBrown
  2021-11-16  8:32   ` Christoph Hellwig
  2021-11-18  1:43   ` kernel test robot
  2021-11-16  2:44 ` [PATCH 12/13] MM: use AIO/DIO for reads from SWP_FS_OPS swap-space NeilBrown
  2021-11-16  3:29 ` [PATCH 00/13] Repair SWAP-over-NFS Matthew Wilcox
  13 siblings, 2 replies; 25+ messages in thread
From: NeilBrown @ 2021-11-16  2:44 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton, Mel Gorman
  Cc: linux-nfs, linux-mm, linux-kernel

If swap-out is using filesystem operations (SWP_FS_OPS), then it is not
safe to enter the FS for reclaim.
So only down-grade the requirement for swap pages to __GFP_IO after
checking that SWP_FS_OPS are not being used.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 mm/vmscan.c |   12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index fb9584641ac7..049ff4494081 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1513,8 +1513,14 @@ static unsigned int shrink_page_list(struct list_head *page_list,
 		if (!sc->may_unmap && page_mapped(page))
 			goto keep_locked;
 
+		/* ->flags can be updated non-atomicially (scan_swap_map_slots),
+		 * but that will never affect SWP_FS_OPS, so the data_race
+		 * is safe.
+		 */
 		may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
-			(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
+			(PageSwapCache(page) &&
+			 !data_race(page_swap_info(page)->flags & SWP_FS_OPS) &&
+			 (sc->gfp_mask & __GFP_IO));
 
 		/*
 		 * The number of dirty pages determines if a node is marked
@@ -1682,7 +1688,9 @@ static unsigned int shrink_page_list(struct list_head *page_list,
 						goto activate_locked_split;
 				}
 
-				may_enter_fs = true;
+				if ((sc->gfp_mask & __GFP_FS) ||
+				    !data_race(page_swap_info(page)->flags & SWP_FS_OPS))
+					may_enter_fs = true;
 
 				/* Adding to swap updated mapping */
 				mapping = page_mapping(page);



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 04/13] SUNRPC/call_alloc: async tasks mustn't block waiting for memory
  2021-11-16  2:44 [PATCH 00/13] Repair SWAP-over-NFS NeilBrown
                   ` (7 preceding siblings ...)
  2021-11-16  2:44 ` [PATCH 05/13] SUNRPC/auth: async tasks mustn't block waiting for memory NeilBrown
@ 2021-11-16  2:44 ` NeilBrown
  2021-11-16  2:44 ` [PATCH 13/13] MM: use AIO for DIO writes to swap NeilBrown
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 25+ messages in thread
From: NeilBrown @ 2021-11-16  2:44 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton, Mel Gorman
  Cc: linux-nfs, linux-mm, linux-kernel

When memory is short, new worker threads cannot be created and we depend
on the minimum one rpciod thread to be able to handle everything.
So it must not block waiting for memory.

mempools are particularly a problem as memory can only be released back
to the mempool by an async rpc task running.  If all available
workqueue threads are waiting on the mempool, no thread is available to
return anything.

rpc_malloc() can block, and this might cause deadlocks.
So check RPC_IS_ASYNC(), rather than RPC_IS_SWAPPER() to determine if
blocking is acceptable.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 net/sunrpc/sched.c              |    4 +++-
 net/sunrpc/xprtrdma/transport.c |    4 +++-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index e2c835482791..d5b6e897f5a5 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -1023,8 +1023,10 @@ int rpc_malloc(struct rpc_task *task)
 	struct rpc_buffer *buf;
 	gfp_t gfp = GFP_NOFS;
 
+	if (RPC_IS_ASYNC(task))
+		gfp = GFP_NOWAIT | __GFP_NOWARN;
 	if (RPC_IS_SWAPPER(task))
-		gfp = __GFP_MEMALLOC | GFP_NOWAIT | __GFP_NOWARN;
+		gfp |= __GFP_MEMALLOC;
 
 	size += sizeof(struct rpc_buffer);
 	if (size <= RPC_BUFFER_MAXSIZE)
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 16e5696314a4..a52277115500 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -574,8 +574,10 @@ xprt_rdma_allocate(struct rpc_task *task)
 	gfp_t flags;
 
 	flags = RPCRDMA_DEF_GFP;
+	if (RPC_IS_ASYNC(task))
+		flags = GFP_NOWAIT | __GFP_NOWARN;
 	if (RPC_IS_SWAPPER(task))
-		flags = __GFP_MEMALLOC | GFP_NOWAIT | __GFP_NOWARN;
+		flags |= __GFP_MEMALLOC;
 
 	if (!rpcrdma_check_regbuf(r_xprt, req->rl_sendbuf, rqst->rq_callsize,
 				  flags))



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 06/13] SUNRPC/xprt: async tasks mustn't block waiting for memory
  2021-11-16  2:44 [PATCH 00/13] Repair SWAP-over-NFS NeilBrown
                   ` (9 preceding siblings ...)
  2021-11-16  2:44 ` [PATCH 13/13] MM: use AIO for DIO writes to swap NeilBrown
@ 2021-11-16  2:44 ` NeilBrown
  2021-11-16  2:44 ` [PATCH 03/13] MM: reclaim mustn't enter FS for swap-over-NFS NeilBrown
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 25+ messages in thread
From: NeilBrown @ 2021-11-16  2:44 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton, Mel Gorman
  Cc: linux-nfs, linux-mm, linux-kernel

When memory is short, new worker threads cannot be created and we depend
on the minimum one rpciod thread to be able to handle everything.  So it
must not block waiting for memory.

xprt_dynamic_alloc_slot can block indefinitely.  This can tie up all
workqueue threads and NFS can deadlock.  So when called from a
workqueue, set __GFP_NORETRY.

The rdma alloc_slot already does not block.  However it sets the error
to -EAGAIN suggesting this will trigger a sleep.  It does not.  As we
can see in call_reserveresult(), only -ENOMEM causes a sleep.  -EAGAIN
causes immediate retry.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 net/sunrpc/xprt.c               |    5 ++++-
 net/sunrpc/xprtrdma/transport.c |    2 +-
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index a02de2bddb28..47d207e416ab 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -1687,12 +1687,15 @@ static bool xprt_throttle_congested(struct rpc_xprt *xprt, struct rpc_task *task
 static struct rpc_rqst *xprt_dynamic_alloc_slot(struct rpc_xprt *xprt)
 {
 	struct rpc_rqst *req = ERR_PTR(-EAGAIN);
+	gfp_t gfp_mask = GFP_NOFS;
 
 	if (xprt->num_reqs >= xprt->max_reqs)
 		goto out;
 	++xprt->num_reqs;
 	spin_unlock(&xprt->reserve_lock);
-	req = kzalloc(sizeof(struct rpc_rqst), GFP_NOFS);
+	if (current->flags & PF_WQ_WORKER)
+		gfp_mask |= __GFP_NORETRY | __GFP_NOWARN;
+	req = kzalloc(sizeof(struct rpc_rqst), gfp_mask);
 	spin_lock(&xprt->reserve_lock);
 	if (req != NULL)
 		goto out;
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index a52277115500..32df23796747 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -521,7 +521,7 @@ xprt_rdma_alloc_slot(struct rpc_xprt *xprt, struct rpc_task *task)
 	return;
 
 out_sleep:
-	task->tk_status = -EAGAIN;
+	task->tk_status = -ENOMEM;
 	xprt_add_backlog(xprt, task);
 }
 



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 09/13] SUNRPC: improve 'swap' handling: scheduling and PF_MEMALLOC
  2021-11-16  2:44 [PATCH 00/13] Repair SWAP-over-NFS NeilBrown
@ 2021-11-16  2:44 ` NeilBrown
  2021-11-16  2:44 ` [PATCH 11/13] NFS: swap-out must always use STABLE writes NeilBrown
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 25+ messages in thread
From: NeilBrown @ 2021-11-16  2:44 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton, Mel Gorman
  Cc: linux-nfs, linux-mm, linux-kernel

rpc tasks can be marked as RPC_TASK_SWAPPER.  This causes GFP_MEMALLOC
to be used for some allocations.  This is needed in some cases, but not
in all where it is currently provided, and in some where it isn't
provided.

Currently *all* tasks associated with a rpc_client on which swap is
enabled get the flag and hence some GFP_MEMALLOC support.

GFP_MEMALLOC is provided for ->buf_alloc() but only swap-writes need it.
However xdr_alloc_bvec does not get GFP_MEMALLOC - though it often does
need it.

xdr_alloc_bvec is called while the XPRT_LOCK is held.  If this blocks,
then it blocks all other queued tasks.  So this allocation needs
GFP_MEMALLOC for *all* requests, not just writes, when the xprt is used
for any swap writes.

Similarly, if the transport is not connected, that will block all
requests including swap writes, so memory allocations should get
GFP_MEMALLOC if swap writes are possible.

So with this patch:
 1/ we ONLY set RPC_TASK_SWAPPER for swap writes.
 2/ __rpc_execute() sets PF_MEMALLOC while handling any task
    with RPC_TASK_SWAPPER set, or when handling any task that
    holds the XPRT_LOCKED lock on an xprt used for swap.
    This removes the need for the RPC_IS_SWAPPER() test
    in ->buf_alloc handlers.
 3/ xprt_prepare_transmit() sets PF_MEMALLOC after locking
    any task to a swapper xprt.  __rpc_execute() will clear it.
 3/ PF_MEMALLOC is set for all the connect workers.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/nfs/write.c                  |    2 ++
 net/sunrpc/clnt.c               |    2 --
 net/sunrpc/sched.c              |   20 +++++++++++++++++---
 net/sunrpc/xprt.c               |    3 +++
 net/sunrpc/xprtrdma/transport.c |    6 ++++--
 net/sunrpc/xprtsock.c           |    8 ++++++++
 6 files changed, 34 insertions(+), 7 deletions(-)

diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 9b7619ce17a7..0c7a304c9e74 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -1408,6 +1408,8 @@ static void nfs_initiate_write(struct nfs_pgio_header *hdr,
 {
 	int priority = flush_task_priority(how);
 
+	if (IS_SWAPFILE(hdr->inode))
+		task_setup_data->flags |= RPC_TASK_SWAPPER;
 	task_setup_data->priority = priority;
 	rpc_ops->write_setup(hdr, msg, &task_setup_data->rpc_client);
 	trace_nfs_initiate_write(hdr);
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index 238b2ef5491f..cb76fbea3ed5 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -1085,8 +1085,6 @@ void rpc_task_set_client(struct rpc_task *task, struct rpc_clnt *clnt)
 		task->tk_flags |= RPC_TASK_TIMEOUT;
 	if (clnt->cl_noretranstimeo)
 		task->tk_flags |= RPC_TASK_NO_RETRANS_TIMEOUT;
-	if (atomic_read(&clnt->cl_swapper))
-		task->tk_flags |= RPC_TASK_SWAPPER;
 	/* Add to the client's list of all tasks */
 	spin_lock(&clnt->cl_lock);
 	list_add_tail(&task->tk_task, &clnt->cl_tasks);
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index 256302bf6557..9020cedb7c95 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -869,6 +869,15 @@ void rpc_release_calldata(const struct rpc_call_ops *ops, void *calldata)
 		ops->rpc_release(calldata);
 }
 
+static bool xprt_needs_memalloc(struct rpc_xprt *xprt, struct rpc_task *tk)
+{
+	if (!xprt)
+		return false;
+	if (!atomic_read(&xprt->swapper))
+		return false;
+	return test_bit(XPRT_LOCKED, &xprt->state) && xprt->snd_task == tk;
+}
+
 /*
  * This is the RPC `scheduler' (or rather, the finite state machine).
  */
@@ -877,6 +886,7 @@ static void __rpc_execute(struct rpc_task *task)
 	struct rpc_wait_queue *queue;
 	int task_is_async = RPC_IS_ASYNC(task);
 	int status = 0;
+	unsigned long pflags = current->flags;
 
 	WARN_ON_ONCE(RPC_IS_QUEUED(task));
 	if (RPC_IS_QUEUED(task))
@@ -899,6 +909,10 @@ static void __rpc_execute(struct rpc_task *task)
 		}
 		if (!do_action)
 			break;
+		if (RPC_IS_SWAPPER(task) ||
+		    xprt_needs_memalloc(task->tk_xprt, task))
+			current->flags |= PF_MEMALLOC;
+
 		trace_rpc_task_run_action(task, do_action);
 		do_action(task);
 
@@ -936,7 +950,7 @@ static void __rpc_execute(struct rpc_task *task)
 		rpc_clear_running(task);
 		spin_unlock(&queue->lock);
 		if (task_is_async)
-			return;
+			goto out;
 
 		/* sync task: sleep here */
 		trace_rpc_task_sync_sleep(task, task->tk_action);
@@ -960,6 +974,8 @@ static void __rpc_execute(struct rpc_task *task)
 
 	/* Release all resources associated with the task */
 	rpc_release_task(task);
+out:
+	current_restore_flags(pflags, PF_MEMALLOC);
 }
 
 /*
@@ -1018,8 +1034,6 @@ int rpc_malloc(struct rpc_task *task)
 
 	if (RPC_IS_ASYNC(task))
 		gfp = GFP_NOWAIT | __GFP_NOWARN;
-	if (RPC_IS_SWAPPER(task))
-		gfp |= __GFP_MEMALLOC;
 
 	size += sizeof(struct rpc_buffer);
 	if (size <= RPC_BUFFER_MAXSIZE)
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index a0a2583fe941..0614e7463d4b 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -1492,6 +1492,9 @@ bool xprt_prepare_transmit(struct rpc_task *task)
 		return false;
 
 	}
+	if (atomic_read(&xprt->swapper))
+		/* This will be clear in __rpc_execute */
+		current->flags |= PF_MEMALLOC;
 	return true;
 }
 
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 32df23796747..256b06a92391 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -239,8 +239,11 @@ xprt_rdma_connect_worker(struct work_struct *work)
 	struct rpcrdma_xprt *r_xprt = container_of(work, struct rpcrdma_xprt,
 						   rx_connect_worker.work);
 	struct rpc_xprt *xprt = &r_xprt->rx_xprt;
+	unsigned int pflags = current->flags;
 	int rc;
 
+	if (atomic_read(&xprt->swapper))
+		current->flags |= PF_MEMALLOC;
 	rc = rpcrdma_xprt_connect(r_xprt);
 	xprt_clear_connecting(xprt);
 	if (!rc) {
@@ -254,6 +257,7 @@ xprt_rdma_connect_worker(struct work_struct *work)
 		rpcrdma_xprt_disconnect(r_xprt);
 	xprt_unlock_connect(xprt, r_xprt);
 	xprt_wake_pending_tasks(xprt, rc);
+	current_restore_flags(pflags, PF_MEMALLOC);
 }
 
 /**
@@ -576,8 +580,6 @@ xprt_rdma_allocate(struct rpc_task *task)
 	flags = RPCRDMA_DEF_GFP;
 	if (RPC_IS_ASYNC(task))
 		flags = GFP_NOWAIT | __GFP_NOWARN;
-	if (RPC_IS_SWAPPER(task))
-		flags |= __GFP_MEMALLOC;
 
 	if (!rpcrdma_check_regbuf(r_xprt, req->rl_sendbuf, rqst->rq_callsize,
 				  flags))
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index ae48c9c84ee1..062ff1f37702 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -2047,7 +2047,10 @@ static void xs_udp_setup_socket(struct work_struct *work)
 	struct rpc_xprt *xprt = &transport->xprt;
 	struct socket *sock;
 	int status = -EIO;
+	unsigned int pflags = current->flags;
 
+	if (atomic_read(&xprt->swapper))
+		current->flags |= PF_MEMALLOC;
 	sock = xs_create_sock(xprt, transport,
 			xs_addr(xprt)->sa_family, SOCK_DGRAM,
 			IPPROTO_UDP, false);
@@ -2067,6 +2070,7 @@ static void xs_udp_setup_socket(struct work_struct *work)
 	xprt_clear_connecting(xprt);
 	xprt_unlock_connect(xprt, transport);
 	xprt_wake_pending_tasks(xprt, status);
+	current_restore_flags(pflags, PF_MEMALLOC);
 }
 
 /**
@@ -2226,7 +2230,10 @@ static void xs_tcp_setup_socket(struct work_struct *work)
 	struct socket *sock = transport->sock;
 	struct rpc_xprt *xprt = &transport->xprt;
 	int status;
+	unsigned int pflags = current->flags;
 
+	if (atomic_read(&xprt->swapper))
+		current->flags |= PF_MEMALLOC;
 	if (!sock) {
 		sock = xs_create_sock(xprt, transport,
 				xs_addr(xprt)->sa_family, SOCK_STREAM,
@@ -2291,6 +2298,7 @@ static void xs_tcp_setup_socket(struct work_struct *work)
 	xprt_clear_connecting(xprt);
 out_unlock:
 	xprt_unlock_connect(xprt, transport);
+	current_restore_flags(pflags, PF_MEMALLOC);
 }
 
 /**



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 10/13] NFSv4: keep state manager thread active if swap is enabled
  2021-11-16  2:44 [PATCH 00/13] Repair SWAP-over-NFS NeilBrown
                   ` (5 preceding siblings ...)
  2021-11-16  2:44 ` [PATCH 07/13] SUNRPC: remove scheduling boost for "SWAPPER" tasks NeilBrown
@ 2021-11-16  2:44 ` NeilBrown
  2021-11-16  2:44 ` [PATCH 05/13] SUNRPC/auth: async tasks mustn't block waiting for memory NeilBrown
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 25+ messages in thread
From: NeilBrown @ 2021-11-16  2:44 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton, Mel Gorman
  Cc: linux-nfs, linux-mm, linux-kernel

If we are swapping over NFSv4, we may not be able to allocate memory to
start the state-manager thread at the time when we need it.
So keep it always running when swap is enabled, and just signal it to
start.

This requires updating and testing the cl_swapper count on the root
rpc_clnt after following all ->cl_parent links.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/nfs/file.c           |   19 +++++++++++++++----
 fs/nfs/nfs4_fs.h        |    1 +
 fs/nfs/nfs4proc.c       |   20 ++++++++++++++++++++
 fs/nfs/nfs4state.c      |   39 +++++++++++++++++++++++++++++++++------
 include/linux/nfs_xdr.h |    2 ++
 net/sunrpc/clnt.c       |    2 ++
 6 files changed, 73 insertions(+), 10 deletions(-)

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 45d8180b7be3..59c271f42ea5 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -489,8 +489,10 @@ static int nfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 {
 	unsigned long blocks;
 	long long isize;
-	struct rpc_clnt *clnt = NFS_CLIENT(file->f_mapping->host);
-	struct inode *inode = file->f_mapping->host;
+	struct inode *inode = file_inode(file);
+	struct rpc_clnt *clnt = NFS_CLIENT(inode);
+	struct nfs_client *cl = NFS_SERVER(inode)->nfs_client;
+	int ret;
 
 	spin_lock(&inode->i_lock);
 	blocks = inode->i_blocks;
@@ -503,14 +505,23 @@ static int nfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 
 	*span = sis->pages;
 
-	return rpc_clnt_swap_activate(clnt);
+	ret = rpc_clnt_swap_activate(clnt);
+
+	if (ret == 0 && cl->rpc_ops->enable_swap)
+		cl->rpc_ops->enable_swap(inode);
+
+	return ret;
 }
 
 static void nfs_swap_deactivate(struct file *file)
 {
-	struct rpc_clnt *clnt = NFS_CLIENT(file->f_mapping->host);
+	struct inode *inode = file_inode(file);
+	struct rpc_clnt *clnt = NFS_CLIENT(inode);
+	struct nfs_client *cl = NFS_SERVER(inode)->nfs_client;
 
 	rpc_clnt_swap_deactivate(clnt);
+	if (cl->rpc_ops->disable_swap)
+		cl->rpc_ops->disable_swap(file_inode(file));
 }
 
 const struct address_space_operations nfs_file_aops = {
diff --git a/fs/nfs/nfs4_fs.h b/fs/nfs/nfs4_fs.h
index ed5eaca6801e..8a9ce0f42efd 100644
--- a/fs/nfs/nfs4_fs.h
+++ b/fs/nfs/nfs4_fs.h
@@ -42,6 +42,7 @@ enum nfs4_client_state {
 	NFS4CLNT_LEASE_MOVED,
 	NFS4CLNT_DELEGATION_EXPIRED,
 	NFS4CLNT_RUN_MANAGER,
+	NFS4CLNT_MANAGER_AVAILABLE,
 	NFS4CLNT_RECALL_RUNNING,
 	NFS4CLNT_RECALL_ANY_LAYOUT_READ,
 	NFS4CLNT_RECALL_ANY_LAYOUT_RW,
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index ee3bc79f6ca3..ab6382f9cbf0 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -10347,6 +10347,24 @@ static ssize_t nfs4_listxattr(struct dentry *dentry, char *list, size_t size)
 	return error + error2 + error3;
 }
 
+static void nfs4_enable_swap(struct inode *inode)
+{
+	/* The state manager thread must always be running.
+	 * It will notice the client is a swapper, and stay put.
+	 */
+	struct nfs_client *clp = NFS_SERVER(inode)->nfs_client;
+
+	nfs4_schedule_state_manager(clp);
+}
+
+static void nfs4_disable_swap(struct inode *inode)
+{
+	/* The state manager thread will now exit once it is
+	 * woken.
+	 */
+	wake_up_var(&NFS_SERVER(inode)->nfs_client->cl_state);
+}
+
 static const struct inode_operations nfs4_dir_inode_operations = {
 	.create		= nfs_create,
 	.lookup		= nfs_lookup,
@@ -10423,6 +10441,8 @@ const struct nfs_rpc_ops nfs_v4_clientops = {
 	.free_client	= nfs4_free_client,
 	.create_server	= nfs4_create_server,
 	.clone_server	= nfs_clone_server,
+	.enable_swap	= nfs4_enable_swap,
+	.disable_swap	= nfs4_disable_swap,
 };
 
 static const struct xattr_handler nfs4_xattr_nfs4_acl_handler = {
diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
index ecc4594299d6..2408e9b98f68 100644
--- a/fs/nfs/nfs4state.c
+++ b/fs/nfs/nfs4state.c
@@ -1205,10 +1205,17 @@ void nfs4_schedule_state_manager(struct nfs_client *clp)
 {
 	struct task_struct *task;
 	char buf[INET6_ADDRSTRLEN + sizeof("-manager") + 1];
+	struct rpc_clnt *cl = clp->cl_rpcclient;
+
+	while (cl != cl->cl_parent)
+		cl = cl->cl_parent;
 
 	set_bit(NFS4CLNT_RUN_MANAGER, &clp->cl_state);
-	if (test_and_set_bit(NFS4CLNT_MANAGER_RUNNING, &clp->cl_state) != 0)
+	if (test_and_set_bit(NFS4CLNT_MANAGER_AVAILABLE, &clp->cl_state) != 0) {
+		wake_up_var(&clp->cl_state);
 		return;
+	}
+	set_bit(NFS4CLNT_MANAGER_RUNNING, &clp->cl_state);
 	__module_get(THIS_MODULE);
 	refcount_inc(&clp->cl_count);
 
@@ -1224,6 +1231,7 @@ void nfs4_schedule_state_manager(struct nfs_client *clp)
 		printk(KERN_ERR "%s: kthread_run: %ld\n",
 			__func__, PTR_ERR(task));
 		nfs4_clear_state_manager_bit(clp);
+		clear_bit(NFS4CLNT_MANAGER_AVAILABLE, &clp->cl_state);
 		nfs_put_client(clp);
 		module_put(THIS_MODULE);
 	}
@@ -2661,11 +2669,8 @@ static void nfs4_state_manager(struct nfs_client *clp)
 			clear_bit(NFS4CLNT_RECALL_RUNNING, &clp->cl_state);
 		}
 
-		/* Did we race with an attempt to give us more work? */
-		if (!test_bit(NFS4CLNT_RUN_MANAGER, &clp->cl_state))
-			return;
-		if (test_and_set_bit(NFS4CLNT_MANAGER_RUNNING, &clp->cl_state) != 0)
-			return;
+		return;
+
 	} while (refcount_read(&clp->cl_count) > 1 && !signalled());
 	goto out_drain;
 
@@ -2685,9 +2690,31 @@ static void nfs4_state_manager(struct nfs_client *clp)
 static int nfs4_run_state_manager(void *ptr)
 {
 	struct nfs_client *clp = ptr;
+	struct rpc_clnt *cl = clp->cl_rpcclient;
+
+	while (cl != cl->cl_parent)
+		cl = cl->cl_parent;
 
 	allow_signal(SIGKILL);
+again:
+	set_bit(NFS4CLNT_MANAGER_RUNNING, &clp->cl_state);
 	nfs4_state_manager(clp);
+	if (atomic_read(&cl->cl_swapper)) {
+		wait_var_event_interruptible(&clp->cl_state,
+					     test_bit(NFS4CLNT_RUN_MANAGER,
+						      &clp->cl_state));
+		if (atomic_read(&cl->cl_swapper) &&
+		    test_bit(NFS4CLNT_RUN_MANAGER, &clp->cl_state))
+			goto again;
+		/* Either no longer a swapper, or were signalled */
+	}
+	clear_bit(NFS4CLNT_MANAGER_AVAILABLE, &clp->cl_state);
+
+	if (refcount_read(&clp->cl_count) > 1 && !signalled() &&
+	    test_bit(NFS4CLNT_RUN_MANAGER, &clp->cl_state) &&
+	    !test_and_set_bit(NFS4CLNT_MANAGER_AVAILABLE, &clp->cl_state))
+		goto again;
+
 	nfs_put_client(clp);
 	module_put_and_exit(0);
 	return 0;
diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
index 967a0098f0a9..04cf3a8fb949 100644
--- a/include/linux/nfs_xdr.h
+++ b/include/linux/nfs_xdr.h
@@ -1795,6 +1795,8 @@ struct nfs_rpc_ops {
 	struct nfs_server *(*create_server)(struct fs_context *);
 	struct nfs_server *(*clone_server)(struct nfs_server *, struct nfs_fh *,
 					   struct nfs_fattr *, rpc_authflavor_t);
+	void	(*enable_swap)(struct inode *inode);
+	void	(*disable_swap)(struct inode *inode);
 };
 
 /*
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index cb76fbea3ed5..4cb403a0f334 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -3066,6 +3066,8 @@ rpc_clnt_swap_activate_callback(struct rpc_clnt *clnt,
 int
 rpc_clnt_swap_activate(struct rpc_clnt *clnt)
 {
+	while (clnt != clnt->cl_parent)
+		clnt = clnt->cl_parent;
 	if (atomic_inc_return(&clnt->cl_swapper) == 1)
 		return rpc_clnt_iterate_for_each_xprt(clnt,
 				rpc_clnt_swap_activate_callback, NULL);



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 12/13] MM: use AIO/DIO for reads from SWP_FS_OPS swap-space
  2021-11-16  2:44 [PATCH 00/13] Repair SWAP-over-NFS NeilBrown
                   ` (11 preceding siblings ...)
  2021-11-16  2:44 ` [PATCH 03/13] MM: reclaim mustn't enter FS for swap-over-NFS NeilBrown
@ 2021-11-16  2:44 ` NeilBrown
  2021-11-16  8:31   ` Christoph Hellwig
  2021-11-16  3:29 ` [PATCH 00/13] Repair SWAP-over-NFS Matthew Wilcox
  13 siblings, 1 reply; 25+ messages in thread
From: NeilBrown @ 2021-11-16  2:44 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton, Mel Gorman
  Cc: linux-nfs, linux-mm, linux-kernel

When pages a read from SWP_FS_OPS swap-space, the reads are submitted as
separate reads for each page.  This is generally less efficient than
larger reads.

We can use the block-plugging infrastructure to delay submitting the
read request until multiple contigious pages have been collected.  This
requires using ->direct_IO to submit the read (as ->readpages isn't
suitable for swap).

If the caller schedules before unplugging, we hand the read-in task off
to systemwq to avoid any possible locking issues.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 mm/page_io.c |  107 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 103 insertions(+), 4 deletions(-)

diff --git a/mm/page_io.c b/mm/page_io.c
index 9725c7e1eeea..30d613881995 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -282,6 +282,14 @@ static void bio_associate_blkg_from_page(struct bio *bio, struct page *page)
 #define bio_associate_blkg_from_page(bio, page)		do { } while (0)
 #endif /* CONFIG_MEMCG && CONFIG_BLK_CGROUP */
 
+struct swap_iocb {
+	struct blk_plug_cb	cb;	/* Must be first */
+	struct kiocb		iocb;
+	struct bio_vec		bvec[SWAP_CLUSTER_MAX];
+	struct work_struct	work;
+	int			pages;
+};
+
 int __swap_writepage(struct page *page, struct writeback_control *wbc,
 		bio_end_io_t end_write_func)
 {
@@ -353,6 +361,59 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
 	return 0;
 }
 
+static void sio_read_complete(struct kiocb *iocb, long ret)
+{
+	struct swap_iocb *sio = container_of(iocb, struct swap_iocb, iocb);
+	int p;
+
+	for (p = 0; p < sio->pages; p++) {
+		struct page *page = sio->bvec[p].bv_page;
+
+		if (ret != PAGE_SIZE * sio->pages) {
+			SetPageError(page);
+			ClearPageUptodate(page);
+			pr_alert_ratelimited("Read-error on swap-device\n");
+		} else {
+			SetPageUptodate(page);
+			count_vm_event(PSWPIN);
+		}
+		unlock_page(page);
+	}
+	kfree(sio);
+}
+
+static void sio_read_unplug(struct blk_plug_cb *cb, bool from_schedule);
+
+static void sio_read_unplug_worker(struct work_struct *work)
+{
+	struct swap_iocb *sio = container_of(work, struct swap_iocb, work);
+	sio_read_unplug(&sio->cb, 0);
+}
+
+static void sio_read_unplug(struct blk_plug_cb *cb, bool from_schedule)
+{
+	struct swap_iocb *sio = container_of(cb, struct swap_iocb, cb);
+	struct address_space *mapping = sio->iocb.ki_filp->f_mapping;
+	struct iov_iter from;
+	unsigned int nofs_flag;
+	int ret;
+
+	if (from_schedule) {
+		INIT_WORK(&sio->work, sio_read_unplug_worker);
+		schedule_work(&sio->work);
+		return;
+	}
+
+	iov_iter_bvec(&from, READ, sio->bvec,
+		      sio->pages, PAGE_SIZE * sio->pages);
+	/* nofs needs as ->direct_IO may take the same mutex it takes for write */
+	nofs_flag = memalloc_nofs_save();
+	ret = mapping->a_ops->direct_IO(&sio->iocb, &from);
+	memalloc_nofs_restore(nofs_flag);
+	if (ret != -EIOCBQUEUED)
+		sio_read_complete(&sio->iocb, ret);
+}
+
 int swap_readpage(struct page *page, bool synchronous)
 {
 	struct bio *bio;
@@ -380,10 +441,48 @@ int swap_readpage(struct page *page, bool synchronous)
 	if (data_race(sis->flags & SWP_FS_OPS)) {
 		struct file *swap_file = sis->swap_file;
 		struct address_space *mapping = swap_file->f_mapping;
-
-		ret = mapping->a_ops->readpage(swap_file, page);
-		if (!ret)
-			count_vm_event(PSWPIN);
+		struct blk_plug_cb *cb;
+		struct swap_iocb *sio;
+		loff_t pos = page_file_offset(page);
+		struct blk_plug plug;
+		int p;
+
+		/* We are sometimes called without a plug active.
+		 * By calling blk_start_plug() here, we ensure blk_check_plugged
+		 * only fails if memory allocation fails.
+		 */
+		blk_start_plug(&plug);
+		cb = blk_check_plugged(sio_read_unplug, swap_file, sizeof(*sio));
+		sio = container_of(cb, struct swap_iocb, cb);
+		if (cb && sio->pages &&
+		    sio->iocb.ki_pos + sio->pages * PAGE_SIZE != pos) {
+			/* Not contiguous - hide this sio from lookup */
+			cb->data = NULL;
+			cb = blk_check_plugged(sio_read_unplug, swap_file,
+					       sizeof(*sio));
+			sio = container_of(cb, struct swap_iocb, cb);
+		}
+		if (!cb) {
+			blk_finish_plug(&plug);
+			ret = mapping->a_ops->readpage(swap_file, page);
+			if (!ret)
+				count_vm_event(PSWPIN);
+			goto out;
+		}
+		if (sio->pages == 0) {
+			init_sync_kiocb(&sio->iocb, swap_file);
+			sio->iocb.ki_pos = pos;
+			sio->iocb.ki_complete = sio_read_complete;
+		}
+		p = sio->pages;
+		sio->bvec[p].bv_page = page;
+		sio->bvec[p].bv_len = PAGE_SIZE;
+		sio->bvec[p].bv_offset = 0;
+		p += 1;
+		sio->pages = p;
+		if (p == ARRAY_SIZE(sio->bvec))
+			cb->data = NULL;
+		blk_finish_plug(&plug);
 		goto out;
 	}
 



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 13/13] MM: use AIO for DIO writes to swap
  2021-11-16  2:44 [PATCH 00/13] Repair SWAP-over-NFS NeilBrown
                   ` (8 preceding siblings ...)
  2021-11-16  2:44 ` [PATCH 04/13] SUNRPC/call_alloc: " NeilBrown
@ 2021-11-16  2:44 ` NeilBrown
  2021-11-16  2:44 ` [PATCH 06/13] SUNRPC/xprt: async tasks mustn't block waiting for memory NeilBrown
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 25+ messages in thread
From: NeilBrown @ 2021-11-16  2:44 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton, Mel Gorman
  Cc: linux-nfs, linux-mm, linux-kernel

When swap-out goes through the filesystem (as with NFS), we currently
perform synchronous writes with ->direct_IO.  This serializes swap
writes and causes kswapd to block waiting for a writes to complete.  This
is quite different to swap-out to a block device (always async), and
possibly hurts liveness.

So switch to AIO writes.  If the necessary kiocb structure cannot be
allocated, fall back to sync writes using a kiocb on the stack.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 mm/page_io.c |  136 ++++++++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 103 insertions(+), 33 deletions(-)

diff --git a/mm/page_io.c b/mm/page_io.c
index 30d613881995..59a2d49e53c3 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -25,6 +25,7 @@
 #include <linux/psi.h>
 #include <linux/uio.h>
 #include <linux/sched/task.h>
+#include "internal.h"
 
 void end_swap_bio_write(struct bio *bio)
 {
@@ -288,8 +289,70 @@ struct swap_iocb {
 	struct bio_vec		bvec[SWAP_CLUSTER_MAX];
 	struct work_struct	work;
 	int			pages;
+	bool			on_stack;
 };
 
+static void sio_aio_complete(struct kiocb *iocb, long ret)
+{
+	struct swap_iocb *sio = container_of(iocb, struct swap_iocb, iocb);
+	int p;
+
+	if (ret != PAGE_SIZE * sio->pages) {
+		/*
+		 * In the case of swap-over-nfs, this can be a
+		 * temporary failure if the system has limited
+		 * memory for allocating transmit buffers.
+		 * Mark the page dirty and avoid
+		 * rotate_reclaimable_page but rate-limit the
+		 * messages but do not flag PageError like
+		 * the normal direct-to-bio case as it could
+		 * be temporary.
+		 */
+		pr_err_ratelimited("Write error on dio swapfile (%llu - %d pages)\n",
+				   page_file_offset(sio->bvec[0].bv_page),
+				   sio->pages);
+		for (p = 0; p < sio->pages; p++) {
+			set_page_dirty(sio->bvec[p].bv_page);
+			ClearPageReclaim(sio->bvec[p].bv_page);
+		}
+	}
+	for (p = 0; p < sio->pages; p++)
+		end_page_writeback(sio->bvec[p].bv_page);
+	if (!sio->on_stack)
+		kfree(sio);
+}
+
+static void sio_aio_unplug(struct blk_plug_cb *cb, bool from_schedule);
+
+static void sio_write_unplug_worker(struct work_struct *work)
+{
+	struct swap_iocb *sio = container_of(work, struct swap_iocb, work);
+	sio_aio_unplug(&sio->cb, 0);
+}
+
+static void sio_aio_unplug(struct blk_plug_cb *cb, bool from_schedule)
+{
+	struct swap_iocb *sio = container_of(cb, struct swap_iocb, cb);
+	struct address_space *mapping = sio->iocb.ki_filp->f_mapping;
+	struct iov_iter from;
+	int ret;
+	unsigned int noreclaim_flag;
+
+	if (from_schedule) {
+		INIT_WORK(&sio->work, sio_write_unplug_worker);
+		queue_work(mm_percpu_wq, &sio->work);
+		return;
+	}
+
+	noreclaim_flag = memalloc_noreclaim_save();
+	iov_iter_bvec(&from, WRITE, sio->bvec,
+		      sio->pages, PAGE_SIZE * sio->pages);
+	ret = mapping->a_ops->direct_IO(&sio->iocb, &from);
+	memalloc_noreclaim_restore(noreclaim_flag);
+	if (ret != -EIOCBQUEUED)
+		sio_aio_complete(&sio->iocb, ret);
+}
+
 int __swap_writepage(struct page *page, struct writeback_control *wbc,
 		bio_end_io_t end_write_func)
 {
@@ -299,44 +362,51 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
 
 	VM_BUG_ON_PAGE(!PageSwapCache(page), page);
 	if (data_race(sis->flags & SWP_FS_OPS)) {
-		struct kiocb kiocb;
+		struct swap_iocb *sio, sio_on_stack;
+		struct blk_plug_cb *cb;
 		struct file *swap_file = sis->swap_file;
-		struct address_space *mapping = swap_file->f_mapping;
-		struct bio_vec bv = {
-			.bv_page = page,
-			.bv_len  = PAGE_SIZE,
-			.bv_offset = 0
-		};
-		struct iov_iter from;
-
-		iov_iter_bvec(&from, WRITE, &bv, 1, PAGE_SIZE);
-		init_sync_kiocb(&kiocb, swap_file);
-		kiocb.ki_pos = page_file_offset(page);
+		loff_t pos = page_file_offset(page);
+		int p;
 
 		set_page_writeback(page);
 		unlock_page(page);
-		ret = mapping->a_ops->direct_IO(&kiocb, &from);
-		if (ret == PAGE_SIZE) {
-			count_vm_event(PSWPOUT);
-			ret = 0;
-		} else {
-			/*
-			 * In the case of swap-over-nfs, this can be a
-			 * temporary failure if the system has limited
-			 * memory for allocating transmit buffers.
-			 * Mark the page dirty and avoid
-			 * folio_rotate_reclaimable but rate-limit the
-			 * messages but do not flag PageError like
-			 * the normal direct-to-bio case as it could
-			 * be temporary.
-			 */
-			set_page_dirty(page);
-			ClearPageReclaim(page);
-			pr_err_ratelimited("Write error on dio swapfile (%llu)\n",
-					   page_file_offset(page));
+		cb = blk_check_plugged(sio_aio_unplug, swap_file, sizeof(*sio));
+		sio = container_of(cb, struct swap_iocb, cb);
+		if (cb && sio->pages &&
+		    sio->iocb.ki_pos + sio->pages * PAGE_SIZE != pos) {
+			/* Not contiguous - hide this sio from lookup */
+			cb->data = NULL;
+			cb = blk_check_plugged(sio_aio_unplug, swap_file,
+					       sizeof(*sio));
+			sio = container_of(cb, struct swap_iocb, cb);
 		}
-		end_page_writeback(page);
-		return ret;
+		if (!cb) {
+			sio = &sio_on_stack;
+			sio->pages = 0;
+			sio->on_stack = true;
+		}
+
+		if (sio->pages == 0) {
+			init_sync_kiocb(&sio->iocb, swap_file);
+			sio->iocb.ki_pos = pos;
+			if (sio != &sio_on_stack)
+				sio->iocb.ki_complete = sio_aio_complete;
+		}
+		p = sio->pages;
+		sio->bvec[p].bv_page = page;
+		sio->bvec[p].bv_len = PAGE_SIZE;
+		sio->bvec[p].bv_offset = 0;
+		p += 1;
+		sio->pages = p;
+		if (!cb)
+			sio_aio_unplug(&sio->cb, 0);
+		else if (p >= ARRAY_SIZE(sio->bvec))
+			/* Don't try to add to this */
+			cb->data = NULL;
+
+		count_vm_event(PSWPOUT);
+
+		return 0;
 	}
 
 	ret = bdev_write_page(sis->bdev, swap_page_sector(page), page, wbc);



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 11/13] NFS: swap-out must always use STABLE writes.
  2021-11-16  2:44 [PATCH 00/13] Repair SWAP-over-NFS NeilBrown
  2021-11-16  2:44 ` [PATCH 09/13] SUNRPC: improve 'swap' handling: scheduling and PF_MEMALLOC NeilBrown
@ 2021-11-16  2:44 ` NeilBrown
  2021-11-16  2:44 ` [PATCH 02/13] NFS: do not take i_rwsem for swap IO NeilBrown
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 25+ messages in thread
From: NeilBrown @ 2021-11-16  2:44 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton, Mel Gorman
  Cc: linux-nfs, linux-mm, linux-kernel

The commit handling code is not safe against memory-pressure deadlocks
when writing to swap.  In particular, nfs_commitdata_alloc() blocks
indefinitely waiting for memory, and this can consume all available
workqueue threads.

swap-out most likely uses STABLE writes anyway as COND_STABLE indicates
that a stable write should be used if the write fits in a single
request, and it normally does.  However if we ever swap with a small
wsize, or gather unusually large numbers of pages for a single write,
this might change.

For safety, make it explicit in the code that direct writes use for swap
must always use FLUSH_COND_STABLE.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/nfs/direct.c        |   12 +++++++-----
 fs/nfs/file.c          |    2 +-
 include/linux/nfs_fs.h |    3 ++-
 3 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 1e80d243ba25..8d3b12402725 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -173,7 +173,7 @@ ssize_t nfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 
 	if (iov_iter_rw(iter) == READ)
 		return nfs_file_direct_read(iocb, iter);
-	return nfs_file_direct_write(iocb, iter);
+	return nfs_file_direct_write(iocb, iter, FLUSH_STABLE);
 }
 
 static void nfs_direct_release_pages(struct page **pages, unsigned int npages)
@@ -789,7 +789,7 @@ static const struct nfs_pgio_completion_ops nfs_direct_write_completion_ops = {
  */
 static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
 					       struct iov_iter *iter,
-					       loff_t pos)
+					       loff_t pos, int ioflags)
 {
 	struct nfs_pageio_descriptor desc;
 	struct inode *inode = dreq->inode;
@@ -797,7 +797,7 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
 	size_t requested_bytes = 0;
 	size_t wsize = max_t(size_t, NFS_SERVER(inode)->wsize, PAGE_SIZE);
 
-	nfs_pageio_init_write(&desc, inode, FLUSH_COND_STABLE, false,
+	nfs_pageio_init_write(&desc, inode, ioflags, false,
 			      &nfs_direct_write_completion_ops);
 	desc.pg_dreq = dreq;
 	get_dreq(dreq);
@@ -875,6 +875,7 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
  * nfs_file_direct_write - file direct write operation for NFS files
  * @iocb: target I/O control block
  * @iter: vector of user buffers from which to write data
+ * @ioflags: flags for nfs_pageio_init_write()
  *
  * We use this function for direct writes instead of calling
  * generic_file_aio_write() in order to avoid taking the inode
@@ -891,7 +892,8 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
  * Note that O_APPEND is not supported for NFS direct writes, as there
  * is no atomic O_APPEND write facility in the NFS protocol.
  */
-ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter)
+ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter,
+			      int ioflags)
 {
 	ssize_t result, requested;
 	size_t count;
@@ -935,7 +937,7 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter)
 
 	nfs_start_io_direct(inode);
 
-	requested = nfs_direct_write_schedule_iovec(dreq, iter, pos);
+	requested = nfs_direct_write_schedule_iovec(dreq, iter, pos, ioflags);
 
 	if (mapping->nrpages) {
 		invalidate_inode_pages2_range(mapping,
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 59c271f42ea5..878a6a510a5e 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -630,7 +630,7 @@ ssize_t nfs_file_write(struct kiocb *iocb, struct iov_iter *from)
 		result = generic_write_checks(iocb, from);
 		if (result <= 0)
 			return result;
-		return nfs_file_direct_write(iocb, from);
+		return nfs_file_direct_write(iocb, from, FLUSH_COND_STABLE);
 	}
 
 	dprintk("NFS: write(%pD2, %zu@%Ld)\n",
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index 5a605e51f4b1..ca312aea6bec 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -509,7 +509,8 @@ extern ssize_t nfs_direct_IO(struct kiocb *, struct iov_iter *);
 extern ssize_t nfs_file_direct_read(struct kiocb *iocb,
 			struct iov_iter *iter);
 extern ssize_t nfs_file_direct_write(struct kiocb *iocb,
-			struct iov_iter *iter);
+				     struct iov_iter *iter,
+				     int ioflags);
 
 /*
  * linux/fs/nfs/dir.c



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 08/13] NFS: discard NFS_RPC_SWAPFLAGS and RPC_TASK_ROOTCREDS
  2021-11-16  2:44 [PATCH 00/13] Repair SWAP-over-NFS NeilBrown
                   ` (2 preceding siblings ...)
  2021-11-16  2:44 ` [PATCH 02/13] NFS: do not take i_rwsem for swap IO NeilBrown
@ 2021-11-16  2:44 ` NeilBrown
  2021-11-16  2:44 ` [PATCH 01/13] NFS: move generic_write_checks() call from nfs_file_direct_write() to nfs_file_write() NeilBrown
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 25+ messages in thread
From: NeilBrown @ 2021-11-16  2:44 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton, Mel Gorman
  Cc: linux-nfs, linux-mm, linux-kernel

NFS_RPC_SWAPFLAGS is only used for READ requests.
It sets RPC_TASK_SWAPPER which gives some memory-allocation priority to
requests.  This is not needed for swap READ - though it is for writes
where it is set via a different mechanism.

RPC_TASK_ROOTCREDS causes the 'machine' credential to be used.
This is not needed as the root credential is saved when the swap file is
opened, and this is used for all IO.

So NFS_RPC_SWAPFLAGS isn't needed, and as it is the only user of
RPC_TASK_ROOTCREDS, that isn't needed either.

Remove both.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/nfs/read.c                 |    4 ----
 include/linux/nfs_fs.h        |    5 -----
 include/linux/sunrpc/sched.h  |    1 -
 include/trace/events/sunrpc.h |    1 -
 net/sunrpc/auth.c             |    2 +-
 5 files changed, 1 insertion(+), 12 deletions(-)

diff --git a/fs/nfs/read.c b/fs/nfs/read.c
index d11af2a9299c..a8f2b884306c 100644
--- a/fs/nfs/read.c
+++ b/fs/nfs/read.c
@@ -194,10 +194,6 @@ static void nfs_initiate_read(struct nfs_pgio_header *hdr,
 			      const struct nfs_rpc_ops *rpc_ops,
 			      struct rpc_task_setup *task_setup_data, int how)
 {
-	struct inode *inode = hdr->inode;
-	int swap_flags = IS_SWAPFILE(inode) ? NFS_RPC_SWAPFLAGS : 0;
-
-	task_setup_data->flags |= swap_flags;
 	rpc_ops->read_setup(hdr, msg);
 	trace_nfs_initiate_read(hdr);
 }
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index 05f249f20f55..5a605e51f4b1 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -45,11 +45,6 @@
  */
 #define NFS_MAX_TRANSPORTS 16
 
-/*
- * These are the default flags for swap requests
- */
-#define NFS_RPC_SWAPFLAGS		(RPC_TASK_SWAPPER|RPC_TASK_ROOTCREDS)
-
 /*
  * Size of the NFS directory verifier
  */
diff --git a/include/linux/sunrpc/sched.h b/include/linux/sunrpc/sched.h
index db964bb63912..56710f8056d3 100644
--- a/include/linux/sunrpc/sched.h
+++ b/include/linux/sunrpc/sched.h
@@ -124,7 +124,6 @@ struct rpc_task_setup {
 #define RPC_TASK_MOVEABLE	0x0004		/* nfs4.1+ rpc tasks */
 #define RPC_TASK_NULLCREDS	0x0010		/* Use AUTH_NULL credential */
 #define RPC_CALL_MAJORSEEN	0x0020		/* major timeout seen */
-#define RPC_TASK_ROOTCREDS	0x0040		/* force root creds */
 #define RPC_TASK_DYNAMIC	0x0080		/* task was kmalloc'ed */
 #define	RPC_TASK_NO_ROUND_ROBIN	0x0100		/* send requests on "main" xprt */
 #define RPC_TASK_SOFT		0x0200		/* Use soft timeouts */
diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
index 3a99358c262b..141dc34a450e 100644
--- a/include/trace/events/sunrpc.h
+++ b/include/trace/events/sunrpc.h
@@ -311,7 +311,6 @@ TRACE_EVENT(rpc_request,
 		{ RPC_TASK_MOVEABLE, "MOVEABLE" },			\
 		{ RPC_TASK_NULLCREDS, "NULLCREDS" },			\
 		{ RPC_CALL_MAJORSEEN, "MAJORSEEN" },			\
-		{ RPC_TASK_ROOTCREDS, "ROOTCREDS" },			\
 		{ RPC_TASK_DYNAMIC, "DYNAMIC" },			\
 		{ RPC_TASK_NO_ROUND_ROBIN, "NO_ROUND_ROBIN" },		\
 		{ RPC_TASK_SOFT, "SOFT" },				\
diff --git a/net/sunrpc/auth.c b/net/sunrpc/auth.c
index 6bfa19f9fa6a..682fcd24bf43 100644
--- a/net/sunrpc/auth.c
+++ b/net/sunrpc/auth.c
@@ -670,7 +670,7 @@ rpcauth_bindcred(struct rpc_task *task, const struct cred *cred, int flags)
 	/* If machine cred couldn't be bound, try a root cred */
 	if (new)
 		;
-	else if (cred == &machine_cred || (flags & RPC_TASK_ROOTCREDS))
+	else if (cred == &machine_cred)
 		new = rpcauth_bind_root_cred(task, lookupflags);
 	else if (flags & RPC_TASK_NULLCREDS)
 		new = authnull_ops.lookup_cred(NULL, NULL, 0);



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 07/13] SUNRPC: remove scheduling boost for "SWAPPER" tasks.
  2021-11-16  2:44 [PATCH 00/13] Repair SWAP-over-NFS NeilBrown
                   ` (4 preceding siblings ...)
  2021-11-16  2:44 ` [PATCH 01/13] NFS: move generic_write_checks() call from nfs_file_direct_write() to nfs_file_write() NeilBrown
@ 2021-11-16  2:44 ` NeilBrown
  2021-11-16  2:44 ` [PATCH 10/13] NFSv4: keep state manager thread active if swap is enabled NeilBrown
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 25+ messages in thread
From: NeilBrown @ 2021-11-16  2:44 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton, Mel Gorman
  Cc: linux-nfs, linux-mm, linux-kernel

Currently, tasks marked as "swapper" tasks get put to the front of
non-priority rpc_queues, and are sorted earlier than non-swapper tasks on
the transport's ->xmit_queue.

This is pointless as currently *all* tasks for a mount that has swap
enabled on *any* file are marked as "swapper" tasks.  So the net result
is that the non-priority rpc_queues are reverse-ordered (LIFO).

This scheduling boost is not necessary to avoid deadlocks, and hurts
fairness, so remove it.  If there were a need to expedite some requests,
the tk_priority mechanism is a more appropriate tool.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 net/sunrpc/sched.c |    7 -------
 net/sunrpc/xprt.c  |   11 -----------
 2 files changed, 18 deletions(-)

diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index d5b6e897f5a5..256302bf6557 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -186,11 +186,6 @@ static void __rpc_add_wait_queue_priority(struct rpc_wait_queue *queue,
 
 /*
  * Add new request to wait queue.
- *
- * Swapper tasks always get inserted at the head of the queue.
- * This should avoid many nasty memory deadlocks and hopefully
- * improve overall performance.
- * Everyone else gets appended to the queue to ensure proper FIFO behavior.
  */
 static void __rpc_add_wait_queue(struct rpc_wait_queue *queue,
 		struct rpc_task *task,
@@ -199,8 +194,6 @@ static void __rpc_add_wait_queue(struct rpc_wait_queue *queue,
 	INIT_LIST_HEAD(&task->u.tk_wait.timer_list);
 	if (RPC_IS_PRIORITY(queue))
 		__rpc_add_wait_queue_priority(queue, task, queue_priority);
-	else if (RPC_IS_SWAPPER(task))
-		list_add(&task->u.tk_wait.list, &queue->tasks[0]);
 	else
 		list_add_tail(&task->u.tk_wait.list, &queue->tasks[0]);
 	task->tk_waitqueue = queue;
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index 47d207e416ab..a0a2583fe941 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -1354,17 +1354,6 @@ xprt_request_enqueue_transmit(struct rpc_task *task)
 				INIT_LIST_HEAD(&req->rq_xmit2);
 				goto out;
 			}
-		} else if (RPC_IS_SWAPPER(task)) {
-			list_for_each_entry(pos, &xprt->xmit_queue, rq_xmit) {
-				if (pos->rq_cong || pos->rq_bytes_sent)
-					continue;
-				if (RPC_IS_SWAPPER(pos->rq_task))
-					continue;
-				/* Note: req is added _before_ pos */
-				list_add_tail(&req->rq_xmit, &pos->rq_xmit);
-				INIT_LIST_HEAD(&req->rq_xmit2);
-				goto out;
-			}
 		} else if (!req->rq_seqno) {
 			list_for_each_entry(pos, &xprt->xmit_queue, rq_xmit) {
 				if (pos->rq_task->tk_owner != task->tk_owner)



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 05/13] SUNRPC/auth: async tasks mustn't block waiting for memory
  2021-11-16  2:44 [PATCH 00/13] Repair SWAP-over-NFS NeilBrown
                   ` (6 preceding siblings ...)
  2021-11-16  2:44 ` [PATCH 10/13] NFSv4: keep state manager thread active if swap is enabled NeilBrown
@ 2021-11-16  2:44 ` NeilBrown
  2021-11-16  2:44 ` [PATCH 04/13] SUNRPC/call_alloc: " NeilBrown
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 25+ messages in thread
From: NeilBrown @ 2021-11-16  2:44 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton, Mel Gorman
  Cc: linux-nfs, linux-mm, linux-kernel

When memory is short, new worker threads cannot be created and we depend
on the minimum one rpciod thread to be able to handle everything.  So it
must not block waiting for memory.

mempools are particularly a problem as memory can only be released back
to the mempool by an async rpc task running.  If all available workqueue
threads are waiting on the mempool, no thread is available to return
anything.

lookup_cred() can block on a mempool or kmalloc - and this can cause
deadlocks.  So add a new RPCAUTH_LOOKUP flag for async lookups and don't
block on memory.  If the -ENOMEM gets back to call_refreshresult(), wait
a short while and try again.  HZ>>4 is chosen as it is used elsewhere
for -ENOMEM retries.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 include/linux/sunrpc/auth.h    |    1 +
 net/sunrpc/auth.c              |    6 +++++-
 net/sunrpc/auth_gss/auth_gss.c |    6 +++++-
 net/sunrpc/auth_unix.c         |   10 ++++++++--
 net/sunrpc/clnt.c              |    3 +++
 5 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/include/linux/sunrpc/auth.h b/include/linux/sunrpc/auth.h
index 98da816b5fc2..3e6ce288a7fc 100644
--- a/include/linux/sunrpc/auth.h
+++ b/include/linux/sunrpc/auth.h
@@ -99,6 +99,7 @@ struct rpc_auth_create_args {
 
 /* Flags for rpcauth_lookupcred() */
 #define RPCAUTH_LOOKUP_NEW		0x01	/* Accept an uninitialised cred */
+#define RPCAUTH_LOOKUP_ASYNC		0x02	/* Don't block waiting for memory */
 
 /*
  * Client authentication ops
diff --git a/net/sunrpc/auth.c b/net/sunrpc/auth.c
index a9f0d17fdb0d..6bfa19f9fa6a 100644
--- a/net/sunrpc/auth.c
+++ b/net/sunrpc/auth.c
@@ -615,6 +615,8 @@ rpcauth_bind_root_cred(struct rpc_task *task, int lookupflags)
 	};
 	struct rpc_cred *ret;
 
+	if (RPC_IS_ASYNC(task))
+		lookupflags |= RPCAUTH_LOOKUP_ASYNC;
 	ret = auth->au_ops->lookup_cred(auth, &acred, lookupflags);
 	put_cred(acred.cred);
 	return ret;
@@ -631,6 +633,8 @@ rpcauth_bind_machine_cred(struct rpc_task *task, int lookupflags)
 
 	if (!acred.principal)
 		return NULL;
+	if (RPC_IS_ASYNC(task))
+		lookupflags |= RPCAUTH_LOOKUP_ASYNC;
 	return auth->au_ops->lookup_cred(auth, &acred, lookupflags);
 }
 
@@ -654,7 +658,7 @@ rpcauth_bindcred(struct rpc_task *task, const struct cred *cred, int flags)
 	};
 
 	if (flags & RPC_TASK_ASYNC)
-		lookupflags |= RPCAUTH_LOOKUP_NEW;
+		lookupflags |= RPCAUTH_LOOKUP_NEW | RPCAUTH_LOOKUP_ASYNC;
 	if (task->tk_op_cred)
 		/* Task must use exactly this rpc_cred */
 		new = get_rpccred(task->tk_op_cred);
diff --git a/net/sunrpc/auth_gss/auth_gss.c b/net/sunrpc/auth_gss/auth_gss.c
index 5f42aa5fc612..df72d6301f78 100644
--- a/net/sunrpc/auth_gss/auth_gss.c
+++ b/net/sunrpc/auth_gss/auth_gss.c
@@ -1341,7 +1341,11 @@ gss_hash_cred(struct auth_cred *acred, unsigned int hashbits)
 static struct rpc_cred *
 gss_lookup_cred(struct rpc_auth *auth, struct auth_cred *acred, int flags)
 {
-	return rpcauth_lookup_credcache(auth, acred, flags, GFP_NOFS);
+	gfp_t gfp = GFP_NOFS;
+
+	if (flags & RPCAUTH_LOOKUP_ASYNC)
+		gfp = GFP_NOWAIT | __GFP_NOWARN;
+	return rpcauth_lookup_credcache(auth, acred, flags, gfp);
 }
 
 static struct rpc_cred *
diff --git a/net/sunrpc/auth_unix.c b/net/sunrpc/auth_unix.c
index e7df1f782b2e..e5819265dd1b 100644
--- a/net/sunrpc/auth_unix.c
+++ b/net/sunrpc/auth_unix.c
@@ -43,8 +43,14 @@ unx_destroy(struct rpc_auth *auth)
 static struct rpc_cred *
 unx_lookup_cred(struct rpc_auth *auth, struct auth_cred *acred, int flags)
 {
-	struct rpc_cred *ret = mempool_alloc(unix_pool, GFP_NOFS);
-
+	gfp_t gfp = GFP_NOFS;
+	struct rpc_cred *ret;
+
+	if (flags & RPCAUTH_LOOKUP_ASYNC)
+		gfp = GFP_NOWAIT | __GFP_NOWARN;
+	ret = mempool_alloc(unix_pool, gfp);
+	if (!ret)
+		return ERR_PTR(-ENOMEM);
 	rpcauth_init_cred(ret, acred, auth, &unix_credops);
 	ret->cr_flags = 1UL << RPCAUTH_CRED_UPTODATE;
 	return ret;
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index a312ea2bc440..238b2ef5491f 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -1745,6 +1745,9 @@ call_refreshresult(struct rpc_task *task)
 		task->tk_cred_retry--;
 		trace_rpc_retry_refresh_status(task);
 		return;
+	case -ENOMEM:
+		rpc_delay(task, HZ >> 4);
+		return;
 	}
 	trace_rpc_refresh_status(task);
 	rpc_call_rpcerror(task, status);



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 00/13] Repair SWAP-over-NFS
  2021-11-16  2:44 [PATCH 00/13] Repair SWAP-over-NFS NeilBrown
                   ` (12 preceding siblings ...)
  2021-11-16  2:44 ` [PATCH 12/13] MM: use AIO/DIO for reads from SWP_FS_OPS swap-space NeilBrown
@ 2021-11-16  3:29 ` Matthew Wilcox
  2021-11-16  3:55   ` NeilBrown
  13 siblings, 1 reply; 25+ messages in thread
From: Matthew Wilcox @ 2021-11-16  3:29 UTC (permalink / raw)
  To: NeilBrown
  Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, linux-nfs, linux-mm, linux-kernel

On Tue, Nov 16, 2021 at 01:44:04PM +1100, NeilBrown wrote:
> swap-over-NFS currently has a variety of problems.
> 
> Due to a newish test in generic_write_checks(), all writes to swap
> currently fail.

And by "currently", you mean "for over two years" (August 2019).
Does swap-over-NFS (or any other network filesystem) actually have any
users, and should we fix it or rip it out?


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 00/13] Repair SWAP-over-NFS
  2021-11-16  3:29 ` [PATCH 00/13] Repair SWAP-over-NFS Matthew Wilcox
@ 2021-11-16  3:55   ` NeilBrown
  0 siblings, 0 replies; 25+ messages in thread
From: NeilBrown @ 2021-11-16  3:55 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, linux-nfs, linux-mm, linux-kernel

On Tue, 16 Nov 2021, Matthew Wilcox wrote:
> On Tue, Nov 16, 2021 at 01:44:04PM +1100, NeilBrown wrote:
> > swap-over-NFS currently has a variety of problems.
> > 
> > Due to a newish test in generic_write_checks(), all writes to swap
> > currently fail.
> 
> And by "currently", you mean "for over two years" (August 2019).

That's about the time scale for "enterprise" releases...
Actually, the earliest patches that impacted swap-over-NFS was more like
4 years ago.  I didn't bother tracking Fixes: tags for everything that
was a fix, as I didn't think it would really help and might encourage
people to backport little bits of the series which I wouldn't recommend.

> Does swap-over-NFS (or any other network filesystem) actually have any
> users, and should we fix it or rip it out?
> 
> 
We have at least one user (why else would I be working on this?).  I
think we have more, though they are presumably still on an earlier
release.

I'd prefer "fix it" over "rip it out".

I don't think any other network filesystem supports swap, but it is
not trivial to grep for.. There must be a 'swap_activate' method, and it
must return 0.  There must also be a direct_IO that works.
The only other network filesystem with swap_activate is cifs.  It
returns 0, but direct_IO returns -EINVAL.

NeilBrown

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 02/13] NFS: do not take i_rwsem for swap IO
  2021-11-16  2:44 ` [PATCH 02/13] NFS: do not take i_rwsem for swap IO NeilBrown
@ 2021-11-16  7:52   ` Christoph Hellwig
  2021-11-16 21:50     ` NeilBrown
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2021-11-16  7:52 UTC (permalink / raw)
  To: NeilBrown
  Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, linux-nfs, linux-mm, linux-kernel

I'd really much prefer the variant we discussed before where
swap I/O uses it's own method instead of overloading the normal
file I/O path all over.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 12/13] MM: use AIO/DIO for reads from SWP_FS_OPS swap-space
  2021-11-16  2:44 ` [PATCH 12/13] MM: use AIO/DIO for reads from SWP_FS_OPS swap-space NeilBrown
@ 2021-11-16  8:31   ` Christoph Hellwig
  2021-11-16 21:46     ` NeilBrown
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2021-11-16  8:31 UTC (permalink / raw)
  To: NeilBrown
  Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, linux-nfs, linux-mm, linux-kernel, linux-block

On Tue, Nov 16, 2021 at 01:44:04PM +1100, NeilBrown wrote:
> When pages a read from SWP_FS_OPS swap-space, the reads are submitted as
> separate reads for each page.  This is generally less efficient than
> larger reads.
> 
> We can use the block-plugging infrastructure to delay submitting the
> read request until multiple contigious pages have been collected.  This
> requires using ->direct_IO to submit the read (as ->readpages isn't
> suitable for swap).

Abusing the block code here seems little ugly.  Also this won't
compile if CONFIG_BLOCK is not set, will it?

What is the problem with just batching up manually?

> +	/* nofs needs as ->direct_IO may take the same mutex it takes for write */

Overly long line.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 03/13] MM: reclaim mustn't enter FS for swap-over-NFS
  2021-11-16  2:44 ` [PATCH 03/13] MM: reclaim mustn't enter FS for swap-over-NFS NeilBrown
@ 2021-11-16  8:32   ` Christoph Hellwig
  2021-11-16 21:35     ` NeilBrown
  2021-11-18  1:43   ` kernel test robot
  1 sibling, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2021-11-16  8:32 UTC (permalink / raw)
  To: NeilBrown
  Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, linux-nfs, linux-mm, linux-kernel

On Tue, Nov 16, 2021 at 01:44:04PM +1100, NeilBrown wrote:
> +		/* ->flags can be updated non-atomicially (scan_swap_map_slots),
> +		 * but that will never affect SWP_FS_OPS, so the data_race
> +		 * is safe.
> +		 */
>  		may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
> +			(PageSwapCache(page) &&
> +			 !data_race(page_swap_info(page)->flags & SWP_FS_OPS) &&
> +			 (sc->gfp_mask & __GFP_IO));

You might want to move the comment and SWP_FS_OPS into a little
inline helper.  That makes it a lot more readable and also avoids the
overly long line in the second hunk.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 03/13] MM: reclaim mustn't enter FS for swap-over-NFS
  2021-11-16  8:32   ` Christoph Hellwig
@ 2021-11-16 21:35     ` NeilBrown
  2021-11-17  5:50       ` Christoph Hellwig
  0 siblings, 1 reply; 25+ messages in thread
From: NeilBrown @ 2021-11-16 21:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, linux-nfs, linux-mm, linux-kernel

On Tue, 16 Nov 2021, Christoph Hellwig wrote:
> On Tue, Nov 16, 2021 at 01:44:04PM +1100, NeilBrown wrote:
> > +		/* ->flags can be updated non-atomicially (scan_swap_map_slots),
> > +		 * but that will never affect SWP_FS_OPS, so the data_race
> > +		 * is safe.
> > +		 */
> >  		may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
> > +			(PageSwapCache(page) &&
> > +			 !data_race(page_swap_info(page)->flags & SWP_FS_OPS) &&
> > +			 (sc->gfp_mask & __GFP_IO));
> 
> You might want to move the comment and SWP_FS_OPS into a little
> inline helper.  That makes it a lot more readable and also avoids the
> overly long line in the second hunk.

Yes, that's a good idea.  Something like this....

Thanks,
NeilBrown

From a85d09cc3d671c45e32d782454afeeaaaece96c7 Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.de>
Date: Fri, 29 Oct 2021 13:35:56 +1100
Subject: [PATCH] MM: reclaim mustn't enter FS for swap-over-NFS

If swap-out is using filesystem operations (SWP_FS_OPS), then it is not
safe to enter the FS for reclaim.
So only down-grade the requirement for swap pages to __GFP_IO after
checking that SWP_FS_OPS are not being used.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 mm/vmscan.c | 21 ++++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index fb9584641ac7..e672fcc14bac 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1464,6 +1464,21 @@ static unsigned int demote_page_list(struct list_head *demote_pages,
 	return nr_succeeded;
 }
 
+static bool test_may_enter_fs(struct page *page, gfp_t gfp_mask)
+{
+	if (gfp_mask & __GFP_FS)
+		return true;
+	if (!PageSwapCache(page) || !(gfp_mask & __GFP_IO))
+		return false;
+	/* We can "enter_fs" for swap-cache with only __GFP_IO
+	 * providing this isn't SWP_FS_OPS.
+	 * ->flags can be updated non-atomicially (scan_swap_map_slots),
+	 * but that will never affect SWP_FS_OPS, so the data_race
+	 * is safe.
+	 */
+	return !data_race(page_swap_info(page)->flags & SWP_FS_OPS);
+}
+
 /*
  * shrink_page_list() returns the number of reclaimed pages
  */
@@ -1513,8 +1528,7 @@ static unsigned int shrink_page_list(struct list_head *page_list,
 		if (!sc->may_unmap && page_mapped(page))
 			goto keep_locked;
 
-		may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
-			(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
+		may_enter_fs = test_may_enter_fs(page, sc->gfp_mask);
 
 		/*
 		 * The number of dirty pages determines if a node is marked
@@ -1682,7 +1696,8 @@ static unsigned int shrink_page_list(struct list_head *page_list,
 						goto activate_locked_split;
 				}
 
-				may_enter_fs = true;
+				may_enter_fs = test_may_enter_fs(page,
+								 sc->gfp_mask);
 
 				/* Adding to swap updated mapping */
 				mapping = page_mapping(page);
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 12/13] MM: use AIO/DIO for reads from SWP_FS_OPS swap-space
  2021-11-16  8:31   ` Christoph Hellwig
@ 2021-11-16 21:46     ` NeilBrown
  0 siblings, 0 replies; 25+ messages in thread
From: NeilBrown @ 2021-11-16 21:46 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, linux-nfs, linux-mm, linux-kernel, linux-block

On Tue, 16 Nov 2021, Christoph Hellwig wrote:
> On Tue, Nov 16, 2021 at 01:44:04PM +1100, NeilBrown wrote:
> > When pages a read from SWP_FS_OPS swap-space, the reads are submitted as
> > separate reads for each page.  This is generally less efficient than
> > larger reads.
> > 
> > We can use the block-plugging infrastructure to delay submitting the
> > read request until multiple contigious pages have been collected.  This
> > requires using ->direct_IO to submit the read (as ->readpages isn't
> > suitable for swap).
> 
> Abusing the block code here seems little ugly.  Also this won't
> compile if CONFIG_BLOCK is not set, will it?

There is nothing really block-layer-specific about the plugging
interfaces.  I think it would be quite reasonable to move them to lib/
But you are correct that currently without CONFIG_BLOCK the code will
compile but not work.

> 
> What is the problem with just batching up manually?

That would require a bigger change to common code, which would only
benefit one user.  The plugging mechanism works well for batching
requests to a block device.  Why not use it for non-block-devices too?

Thanks,
NeilBrown


> 
> > +	/* nofs needs as ->direct_IO may take the same mutex it takes for write */
> 
> Overly long line.
> 
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 02/13] NFS: do not take i_rwsem for swap IO
  2021-11-16  7:52   ` Christoph Hellwig
@ 2021-11-16 21:50     ` NeilBrown
  2021-11-17  5:49       ` Christoph Hellwig
  0 siblings, 1 reply; 25+ messages in thread
From: NeilBrown @ 2021-11-16 21:50 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, linux-nfs, linux-mm, linux-kernel

On Tue, 16 Nov 2021, Christoph Hellwig wrote:
> I'd really much prefer the variant we discussed before where
> swap I/O uses it's own method instead of overloading the normal
> file I/O path all over.
> 
> 
This would be David Howells' "mm: Use DIO for swap and fix NFS
swapfiles" series?  I'd be very happy to work with that once it lands.
I might try it out and see how two work together.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 02/13] NFS: do not take i_rwsem for swap IO
  2021-11-16 21:50     ` NeilBrown
@ 2021-11-17  5:49       ` Christoph Hellwig
  0 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2021-11-17  5:49 UTC (permalink / raw)
  To: NeilBrown
  Cc: Christoph Hellwig, Trond Myklebust, Anna Schumaker, Chuck Lever,
	Andrew Morton, Mel Gorman, linux-nfs, linux-mm, linux-kernel,
	David Howells

On Wed, Nov 17, 2021 at 08:50:12AM +1100, NeilBrown wrote:
> This would be David Howells' "mm: Use DIO for swap and fix NFS
> swapfiles" series?

Yes.

> I'd be very happy to work with that once it lands.
> I might try it out and see how two work together.

Dave: is it ok if Neil takes over the swap vs NFS work given that he's
working on a customer requirement?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 03/13] MM: reclaim mustn't enter FS for swap-over-NFS
  2021-11-16 21:35     ` NeilBrown
@ 2021-11-17  5:50       ` Christoph Hellwig
  0 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2021-11-17  5:50 UTC (permalink / raw)
  To: NeilBrown
  Cc: Christoph Hellwig, Trond Myklebust, Anna Schumaker, Chuck Lever,
	Andrew Morton, Mel Gorman, linux-nfs, linux-mm, linux-kernel

On Wed, Nov 17, 2021 at 08:35:23AM +1100, NeilBrown wrote:
> +	/* We can "enter_fs" for swap-cache with only __GFP_IO

normal kernel style would be the

	/*

on a line of it's own.  But except for this nitpick this looks good.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 03/13] MM: reclaim mustn't enter FS for swap-over-NFS
  2021-11-16  2:44 ` [PATCH 03/13] MM: reclaim mustn't enter FS for swap-over-NFS NeilBrown
  2021-11-16  8:32   ` Christoph Hellwig
@ 2021-11-18  1:43   ` kernel test robot
  1 sibling, 0 replies; 25+ messages in thread
From: kernel test robot @ 2021-11-18  1:43 UTC (permalink / raw)
  To: NeilBrown, Trond Myklebust, Anna Schumaker, Chuck Lever,
	Andrew Morton, Mel Gorman
  Cc: kbuild-all, Linux Memory Management List, linux-nfs, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 22695 bytes --]

Hi NeilBrown,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on v5.16-rc1]
[also build test ERROR on next-20211117]
[cannot apply to trondmy-nfs/linux-next hnaz-mm/master rostedt-trace/for-next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/NeilBrown/Repair-SWAP-over-NFS/20211116-104822
base:    fa55b7dcdc43c1aa1ba12bca9d2dd4318c2a0dbf
config: mips-randconfig-r031-20211116 (attached as .config)
compiler: mipsel-linux-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/b2f1d12df57f816d09ef57fa73758fec820a23f1
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review NeilBrown/Repair-SWAP-over-NFS/20211116-104822
        git checkout b2f1d12df57f816d09ef57fa73758fec820a23f1
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross ARCH=mips 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   In file included from <command-line>:
   mm/vmscan.c: In function 'shrink_page_list':
>> mm/vmscan.c:1522:37: error: implicit declaration of function 'page_swap_info'; did you mean 'swp_swap_info'? [-Werror=implicit-function-declaration]
    1522 |                          !data_race(page_swap_info(page)->flags & SWP_FS_OPS) &&
         |                                     ^~~~~~~~~~~~~~
   include/linux/compiler_types.h:291:27: note: in definition of macro '__unqual_scalar_typeof'
     291 |                 _Generic((x),                                           \
         |                           ^
   mm/vmscan.c:1522:27: note: in expansion of macro 'data_race'
    1522 |                          !data_race(page_swap_info(page)->flags & SWP_FS_OPS) &&
         |                           ^~~~~~~~~
>> mm/vmscan.c:1522:57: error: invalid type argument of '->' (have 'int')
    1522 |                          !data_race(page_swap_info(page)->flags & SWP_FS_OPS) &&
         |                                                         ^~
   include/linux/compiler_types.h:291:27: note: in definition of macro '__unqual_scalar_typeof'
     291 |                 _Generic((x),                                           \
         |                           ^
   mm/vmscan.c:1522:27: note: in expansion of macro 'data_race'
    1522 |                          !data_race(page_swap_info(page)->flags & SWP_FS_OPS) &&
         |                           ^~~~~~~~~
   In file included from arch/mips/include/asm/bug.h:5,
                    from include/linux/bug.h:5,
                    from include/linux/mmdebug.h:5,
                    from include/linux/mm.h:9,
                    from mm/vmscan.c:15:
>> mm/vmscan.c:1522:57: error: invalid type argument of '->' (have 'int')
    1522 |                          !data_race(page_swap_info(page)->flags & SWP_FS_OPS) &&
         |                                                         ^~
   include/linux/compiler.h:218:17: note: in definition of macro 'data_race'
     218 |                 expr;                                                   \
         |                 ^~~~
   In file included from <command-line>:
   mm/vmscan.c:1692:68: error: invalid type argument of '->' (have 'int')
    1692 |                                     !data_race(page_swap_info(page)->flags & SWP_FS_OPS))
         |                                                                    ^~
   include/linux/compiler_types.h:291:27: note: in definition of macro '__unqual_scalar_typeof'
     291 |                 _Generic((x),                                           \
         |                           ^
   mm/vmscan.c:1692:38: note: in expansion of macro 'data_race'
    1692 |                                     !data_race(page_swap_info(page)->flags & SWP_FS_OPS))
         |                                      ^~~~~~~~~
   In file included from arch/mips/include/asm/bug.h:5,
                    from include/linux/bug.h:5,
                    from include/linux/mmdebug.h:5,
                    from include/linux/mm.h:9,
                    from mm/vmscan.c:15:
   mm/vmscan.c:1692:68: error: invalid type argument of '->' (have 'int')
    1692 |                                     !data_race(page_swap_info(page)->flags & SWP_FS_OPS))
         |                                                                    ^~
   include/linux/compiler.h:218:17: note: in definition of macro 'data_race'
     218 |                 expr;                                                   \
         |                 ^~~~
   cc1: some warnings being treated as errors


vim +1522 mm/vmscan.c

  1466	
  1467	/*
  1468	 * shrink_page_list() returns the number of reclaimed pages
  1469	 */
  1470	static unsigned int shrink_page_list(struct list_head *page_list,
  1471					     struct pglist_data *pgdat,
  1472					     struct scan_control *sc,
  1473					     struct reclaim_stat *stat,
  1474					     bool ignore_references)
  1475	{
  1476		LIST_HEAD(ret_pages);
  1477		LIST_HEAD(free_pages);
  1478		LIST_HEAD(demote_pages);
  1479		unsigned int nr_reclaimed = 0;
  1480		unsigned int pgactivate = 0;
  1481		bool do_demote_pass;
  1482	
  1483		memset(stat, 0, sizeof(*stat));
  1484		cond_resched();
  1485		do_demote_pass = can_demote(pgdat->node_id, sc);
  1486	
  1487	retry:
  1488		while (!list_empty(page_list)) {
  1489			struct address_space *mapping;
  1490			struct page *page;
  1491			enum page_references references = PAGEREF_RECLAIM;
  1492			bool dirty, writeback, may_enter_fs;
  1493			unsigned int nr_pages;
  1494	
  1495			cond_resched();
  1496	
  1497			page = lru_to_page(page_list);
  1498			list_del(&page->lru);
  1499	
  1500			if (!trylock_page(page))
  1501				goto keep;
  1502	
  1503			VM_BUG_ON_PAGE(PageActive(page), page);
  1504	
  1505			nr_pages = compound_nr(page);
  1506	
  1507			/* Account the number of base pages even though THP */
  1508			sc->nr_scanned += nr_pages;
  1509	
  1510			if (unlikely(!page_evictable(page)))
  1511				goto activate_locked;
  1512	
  1513			if (!sc->may_unmap && page_mapped(page))
  1514				goto keep_locked;
  1515	
  1516			/* ->flags can be updated non-atomicially (scan_swap_map_slots),
  1517			 * but that will never affect SWP_FS_OPS, so the data_race
  1518			 * is safe.
  1519			 */
  1520			may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
  1521				(PageSwapCache(page) &&
> 1522				 !data_race(page_swap_info(page)->flags & SWP_FS_OPS) &&
  1523				 (sc->gfp_mask & __GFP_IO));
  1524	
  1525			/*
  1526			 * The number of dirty pages determines if a node is marked
  1527			 * reclaim_congested. kswapd will stall and start writing
  1528			 * pages if the tail of the LRU is all dirty unqueued pages.
  1529			 */
  1530			page_check_dirty_writeback(page, &dirty, &writeback);
  1531			if (dirty || writeback)
  1532				stat->nr_dirty++;
  1533	
  1534			if (dirty && !writeback)
  1535				stat->nr_unqueued_dirty++;
  1536	
  1537			/*
  1538			 * Treat this page as congested if the underlying BDI is or if
  1539			 * pages are cycling through the LRU so quickly that the
  1540			 * pages marked for immediate reclaim are making it to the
  1541			 * end of the LRU a second time.
  1542			 */
  1543			mapping = page_mapping(page);
  1544			if (((dirty || writeback) && mapping &&
  1545			     inode_write_congested(mapping->host)) ||
  1546			    (writeback && PageReclaim(page)))
  1547				stat->nr_congested++;
  1548	
  1549			/*
  1550			 * If a page at the tail of the LRU is under writeback, there
  1551			 * are three cases to consider.
  1552			 *
  1553			 * 1) If reclaim is encountering an excessive number of pages
  1554			 *    under writeback and this page is both under writeback and
  1555			 *    PageReclaim then it indicates that pages are being queued
  1556			 *    for IO but are being recycled through the LRU before the
  1557			 *    IO can complete. Waiting on the page itself risks an
  1558			 *    indefinite stall if it is impossible to writeback the
  1559			 *    page due to IO error or disconnected storage so instead
  1560			 *    note that the LRU is being scanned too quickly and the
  1561			 *    caller can stall after page list has been processed.
  1562			 *
  1563			 * 2) Global or new memcg reclaim encounters a page that is
  1564			 *    not marked for immediate reclaim, or the caller does not
  1565			 *    have __GFP_FS (or __GFP_IO if it's simply going to swap,
  1566			 *    not to fs). In this case mark the page for immediate
  1567			 *    reclaim and continue scanning.
  1568			 *
  1569			 *    Require may_enter_fs because we would wait on fs, which
  1570			 *    may not have submitted IO yet. And the loop driver might
  1571			 *    enter reclaim, and deadlock if it waits on a page for
  1572			 *    which it is needed to do the write (loop masks off
  1573			 *    __GFP_IO|__GFP_FS for this reason); but more thought
  1574			 *    would probably show more reasons.
  1575			 *
  1576			 * 3) Legacy memcg encounters a page that is already marked
  1577			 *    PageReclaim. memcg does not have any dirty pages
  1578			 *    throttling so we could easily OOM just because too many
  1579			 *    pages are in writeback and there is nothing else to
  1580			 *    reclaim. Wait for the writeback to complete.
  1581			 *
  1582			 * In cases 1) and 2) we activate the pages to get them out of
  1583			 * the way while we continue scanning for clean pages on the
  1584			 * inactive list and refilling from the active list. The
  1585			 * observation here is that waiting for disk writes is more
  1586			 * expensive than potentially causing reloads down the line.
  1587			 * Since they're marked for immediate reclaim, they won't put
  1588			 * memory pressure on the cache working set any longer than it
  1589			 * takes to write them to disk.
  1590			 */
  1591			if (PageWriteback(page)) {
  1592				/* Case 1 above */
  1593				if (current_is_kswapd() &&
  1594				    PageReclaim(page) &&
  1595				    test_bit(PGDAT_WRITEBACK, &pgdat->flags)) {
  1596					stat->nr_immediate++;
  1597					goto activate_locked;
  1598	
  1599				/* Case 2 above */
  1600				} else if (writeback_throttling_sane(sc) ||
  1601				    !PageReclaim(page) || !may_enter_fs) {
  1602					/*
  1603					 * This is slightly racy - end_page_writeback()
  1604					 * might have just cleared PageReclaim, then
  1605					 * setting PageReclaim here end up interpreted
  1606					 * as PageReadahead - but that does not matter
  1607					 * enough to care.  What we do want is for this
  1608					 * page to have PageReclaim set next time memcg
  1609					 * reclaim reaches the tests above, so it will
  1610					 * then wait_on_page_writeback() to avoid OOM;
  1611					 * and it's also appropriate in global reclaim.
  1612					 */
  1613					SetPageReclaim(page);
  1614					stat->nr_writeback++;
  1615					goto activate_locked;
  1616	
  1617				/* Case 3 above */
  1618				} else {
  1619					unlock_page(page);
  1620					wait_on_page_writeback(page);
  1621					/* then go back and try same page again */
  1622					list_add_tail(&page->lru, page_list);
  1623					continue;
  1624				}
  1625			}
  1626	
  1627			if (!ignore_references)
  1628				references = page_check_references(page, sc);
  1629	
  1630			switch (references) {
  1631			case PAGEREF_ACTIVATE:
  1632				goto activate_locked;
  1633			case PAGEREF_KEEP:
  1634				stat->nr_ref_keep += nr_pages;
  1635				goto keep_locked;
  1636			case PAGEREF_RECLAIM:
  1637			case PAGEREF_RECLAIM_CLEAN:
  1638				; /* try to reclaim the page below */
  1639			}
  1640	
  1641			/*
  1642			 * Before reclaiming the page, try to relocate
  1643			 * its contents to another node.
  1644			 */
  1645			if (do_demote_pass &&
  1646			    (thp_migration_supported() || !PageTransHuge(page))) {
  1647				list_add(&page->lru, &demote_pages);
  1648				unlock_page(page);
  1649				continue;
  1650			}
  1651	
  1652			/*
  1653			 * Anonymous process memory has backing store?
  1654			 * Try to allocate it some swap space here.
  1655			 * Lazyfree page could be freed directly
  1656			 */
  1657			if (PageAnon(page) && PageSwapBacked(page)) {
  1658				if (!PageSwapCache(page)) {
  1659					if (!(sc->gfp_mask & __GFP_IO))
  1660						goto keep_locked;
  1661					if (page_maybe_dma_pinned(page))
  1662						goto keep_locked;
  1663					if (PageTransHuge(page)) {
  1664						/* cannot split THP, skip it */
  1665						if (!can_split_huge_page(page, NULL))
  1666							goto activate_locked;
  1667						/*
  1668						 * Split pages without a PMD map right
  1669						 * away. Chances are some or all of the
  1670						 * tail pages can be freed without IO.
  1671						 */
  1672						if (!compound_mapcount(page) &&
  1673						    split_huge_page_to_list(page,
  1674									    page_list))
  1675							goto activate_locked;
  1676					}
  1677					if (!add_to_swap(page)) {
  1678						if (!PageTransHuge(page))
  1679							goto activate_locked_split;
  1680						/* Fallback to swap normal pages */
  1681						if (split_huge_page_to_list(page,
  1682									    page_list))
  1683							goto activate_locked;
  1684	#ifdef CONFIG_TRANSPARENT_HUGEPAGE
  1685						count_vm_event(THP_SWPOUT_FALLBACK);
  1686	#endif
  1687						if (!add_to_swap(page))
  1688							goto activate_locked_split;
  1689					}
  1690	
  1691					if ((sc->gfp_mask & __GFP_FS) ||
  1692					    !data_race(page_swap_info(page)->flags & SWP_FS_OPS))
  1693						may_enter_fs = true;
  1694	
  1695					/* Adding to swap updated mapping */
  1696					mapping = page_mapping(page);
  1697				}
  1698			} else if (unlikely(PageTransHuge(page))) {
  1699				/* Split file THP */
  1700				if (split_huge_page_to_list(page, page_list))
  1701					goto keep_locked;
  1702			}
  1703	
  1704			/*
  1705			 * THP may get split above, need minus tail pages and update
  1706			 * nr_pages to avoid accounting tail pages twice.
  1707			 *
  1708			 * The tail pages that are added into swap cache successfully
  1709			 * reach here.
  1710			 */
  1711			if ((nr_pages > 1) && !PageTransHuge(page)) {
  1712				sc->nr_scanned -= (nr_pages - 1);
  1713				nr_pages = 1;
  1714			}
  1715	
  1716			/*
  1717			 * The page is mapped into the page tables of one or more
  1718			 * processes. Try to unmap it here.
  1719			 */
  1720			if (page_mapped(page)) {
  1721				enum ttu_flags flags = TTU_BATCH_FLUSH;
  1722				bool was_swapbacked = PageSwapBacked(page);
  1723	
  1724				if (unlikely(PageTransHuge(page)))
  1725					flags |= TTU_SPLIT_HUGE_PMD;
  1726	
  1727				try_to_unmap(page, flags);
  1728				if (page_mapped(page)) {
  1729					stat->nr_unmap_fail += nr_pages;
  1730					if (!was_swapbacked && PageSwapBacked(page))
  1731						stat->nr_lazyfree_fail += nr_pages;
  1732					goto activate_locked;
  1733				}
  1734			}
  1735	
  1736			if (PageDirty(page)) {
  1737				/*
  1738				 * Only kswapd can writeback filesystem pages
  1739				 * to avoid risk of stack overflow. But avoid
  1740				 * injecting inefficient single-page IO into
  1741				 * flusher writeback as much as possible: only
  1742				 * write pages when we've encountered many
  1743				 * dirty pages, and when we've already scanned
  1744				 * the rest of the LRU for clean pages and see
  1745				 * the same dirty pages again (PageReclaim).
  1746				 */
  1747				if (page_is_file_lru(page) &&
  1748				    (!current_is_kswapd() || !PageReclaim(page) ||
  1749				     !test_bit(PGDAT_DIRTY, &pgdat->flags))) {
  1750					/*
  1751					 * Immediately reclaim when written back.
  1752					 * Similar in principal to deactivate_page()
  1753					 * except we already have the page isolated
  1754					 * and know it's dirty
  1755					 */
  1756					inc_node_page_state(page, NR_VMSCAN_IMMEDIATE);
  1757					SetPageReclaim(page);
  1758	
  1759					goto activate_locked;
  1760				}
  1761	
  1762				if (references == PAGEREF_RECLAIM_CLEAN)
  1763					goto keep_locked;
  1764				if (!may_enter_fs)
  1765					goto keep_locked;
  1766				if (!sc->may_writepage)
  1767					goto keep_locked;
  1768	
  1769				/*
  1770				 * Page is dirty. Flush the TLB if a writable entry
  1771				 * potentially exists to avoid CPU writes after IO
  1772				 * starts and then write it out here.
  1773				 */
  1774				try_to_unmap_flush_dirty();
  1775				switch (pageout(page, mapping)) {
  1776				case PAGE_KEEP:
  1777					goto keep_locked;
  1778				case PAGE_ACTIVATE:
  1779					goto activate_locked;
  1780				case PAGE_SUCCESS:
  1781					stat->nr_pageout += thp_nr_pages(page);
  1782	
  1783					if (PageWriteback(page))
  1784						goto keep;
  1785					if (PageDirty(page))
  1786						goto keep;
  1787	
  1788					/*
  1789					 * A synchronous write - probably a ramdisk.  Go
  1790					 * ahead and try to reclaim the page.
  1791					 */
  1792					if (!trylock_page(page))
  1793						goto keep;
  1794					if (PageDirty(page) || PageWriteback(page))
  1795						goto keep_locked;
  1796					mapping = page_mapping(page);
  1797					fallthrough;
  1798				case PAGE_CLEAN:
  1799					; /* try to free the page below */
  1800				}
  1801			}
  1802	
  1803			/*
  1804			 * If the page has buffers, try to free the buffer mappings
  1805			 * associated with this page. If we succeed we try to free
  1806			 * the page as well.
  1807			 *
  1808			 * We do this even if the page is PageDirty().
  1809			 * try_to_release_page() does not perform I/O, but it is
  1810			 * possible for a page to have PageDirty set, but it is actually
  1811			 * clean (all its buffers are clean).  This happens if the
  1812			 * buffers were written out directly, with submit_bh(). ext3
  1813			 * will do this, as well as the blockdev mapping.
  1814			 * try_to_release_page() will discover that cleanness and will
  1815			 * drop the buffers and mark the page clean - it can be freed.
  1816			 *
  1817			 * Rarely, pages can have buffers and no ->mapping.  These are
  1818			 * the pages which were not successfully invalidated in
  1819			 * truncate_cleanup_page().  We try to drop those buffers here
  1820			 * and if that worked, and the page is no longer mapped into
  1821			 * process address space (page_count == 1) it can be freed.
  1822			 * Otherwise, leave the page on the LRU so it is swappable.
  1823			 */
  1824			if (page_has_private(page)) {
  1825				if (!try_to_release_page(page, sc->gfp_mask))
  1826					goto activate_locked;
  1827				if (!mapping && page_count(page) == 1) {
  1828					unlock_page(page);
  1829					if (put_page_testzero(page))
  1830						goto free_it;
  1831					else {
  1832						/*
  1833						 * rare race with speculative reference.
  1834						 * the speculative reference will free
  1835						 * this page shortly, so we may
  1836						 * increment nr_reclaimed here (and
  1837						 * leave it off the LRU).
  1838						 */
  1839						nr_reclaimed++;
  1840						continue;
  1841					}
  1842				}
  1843			}
  1844	
  1845			if (PageAnon(page) && !PageSwapBacked(page)) {
  1846				/* follow __remove_mapping for reference */
  1847				if (!page_ref_freeze(page, 1))
  1848					goto keep_locked;
  1849				/*
  1850				 * The page has only one reference left, which is
  1851				 * from the isolation. After the caller puts the
  1852				 * page back on lru and drops the reference, the
  1853				 * page will be freed anyway. It doesn't matter
  1854				 * which lru it goes. So we don't bother checking
  1855				 * PageDirty here.
  1856				 */
  1857				count_vm_event(PGLAZYFREED);
  1858				count_memcg_page_event(page, PGLAZYFREED);
  1859			} else if (!mapping || !__remove_mapping(mapping, page, true,
  1860								 sc->target_mem_cgroup))
  1861				goto keep_locked;
  1862	
  1863			unlock_page(page);
  1864	free_it:
  1865			/*
  1866			 * THP may get swapped out in a whole, need account
  1867			 * all base pages.
  1868			 */
  1869			nr_reclaimed += nr_pages;
  1870	
  1871			/*
  1872			 * Is there need to periodically free_page_list? It would
  1873			 * appear not as the counts should be low
  1874			 */
  1875			if (unlikely(PageTransHuge(page)))
  1876				destroy_compound_page(page);
  1877			else
  1878				list_add(&page->lru, &free_pages);
  1879			continue;
  1880	
  1881	activate_locked_split:
  1882			/*
  1883			 * The tail pages that are failed to add into swap cache
  1884			 * reach here.  Fixup nr_scanned and nr_pages.
  1885			 */
  1886			if (nr_pages > 1) {
  1887				sc->nr_scanned -= (nr_pages - 1);
  1888				nr_pages = 1;
  1889			}
  1890	activate_locked:
  1891			/* Not a candidate for swapping, so reclaim swap space. */
  1892			if (PageSwapCache(page) && (mem_cgroup_swap_full(page) ||
  1893							PageMlocked(page)))
  1894				try_to_free_swap(page);
  1895			VM_BUG_ON_PAGE(PageActive(page), page);
  1896			if (!PageMlocked(page)) {
  1897				int type = page_is_file_lru(page);
  1898				SetPageActive(page);
  1899				stat->nr_activate[type] += nr_pages;
  1900				count_memcg_page_event(page, PGACTIVATE);
  1901			}
  1902	keep_locked:
  1903			unlock_page(page);
  1904	keep:
  1905			list_add(&page->lru, &ret_pages);
  1906			VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page);
  1907		}
  1908		/* 'page_list' is always empty here */
  1909	
  1910		/* Migrate pages selected for demotion */
  1911		nr_reclaimed += demote_page_list(&demote_pages, pgdat);
  1912		/* Pages that could not be demoted are still in @demote_pages */
  1913		if (!list_empty(&demote_pages)) {
  1914			/* Pages which failed to demoted go back on @page_list for retry: */
  1915			list_splice_init(&demote_pages, page_list);
  1916			do_demote_pass = false;
  1917			goto retry;
  1918		}
  1919	
  1920		pgactivate = stat->nr_activate[0] + stat->nr_activate[1];
  1921	
  1922		mem_cgroup_uncharge_list(&free_pages);
  1923		try_to_unmap_flush();
  1924		free_unref_page_list(&free_pages);
  1925	
  1926		list_splice(&ret_pages, page_list);
  1927		count_vm_events(PGACTIVATE, pgactivate);
  1928	
  1929		return nr_reclaimed;
  1930	}
  1931	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 33321 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2021-11-18  1:43 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-16  2:44 [PATCH 00/13] Repair SWAP-over-NFS NeilBrown
2021-11-16  2:44 ` [PATCH 09/13] SUNRPC: improve 'swap' handling: scheduling and PF_MEMALLOC NeilBrown
2021-11-16  2:44 ` [PATCH 11/13] NFS: swap-out must always use STABLE writes NeilBrown
2021-11-16  2:44 ` [PATCH 02/13] NFS: do not take i_rwsem for swap IO NeilBrown
2021-11-16  7:52   ` Christoph Hellwig
2021-11-16 21:50     ` NeilBrown
2021-11-17  5:49       ` Christoph Hellwig
2021-11-16  2:44 ` [PATCH 08/13] NFS: discard NFS_RPC_SWAPFLAGS and RPC_TASK_ROOTCREDS NeilBrown
2021-11-16  2:44 ` [PATCH 01/13] NFS: move generic_write_checks() call from nfs_file_direct_write() to nfs_file_write() NeilBrown
2021-11-16  2:44 ` [PATCH 07/13] SUNRPC: remove scheduling boost for "SWAPPER" tasks NeilBrown
2021-11-16  2:44 ` [PATCH 10/13] NFSv4: keep state manager thread active if swap is enabled NeilBrown
2021-11-16  2:44 ` [PATCH 05/13] SUNRPC/auth: async tasks mustn't block waiting for memory NeilBrown
2021-11-16  2:44 ` [PATCH 04/13] SUNRPC/call_alloc: " NeilBrown
2021-11-16  2:44 ` [PATCH 13/13] MM: use AIO for DIO writes to swap NeilBrown
2021-11-16  2:44 ` [PATCH 06/13] SUNRPC/xprt: async tasks mustn't block waiting for memory NeilBrown
2021-11-16  2:44 ` [PATCH 03/13] MM: reclaim mustn't enter FS for swap-over-NFS NeilBrown
2021-11-16  8:32   ` Christoph Hellwig
2021-11-16 21:35     ` NeilBrown
2021-11-17  5:50       ` Christoph Hellwig
2021-11-18  1:43   ` kernel test robot
2021-11-16  2:44 ` [PATCH 12/13] MM: use AIO/DIO for reads from SWP_FS_OPS swap-space NeilBrown
2021-11-16  8:31   ` Christoph Hellwig
2021-11-16 21:46     ` NeilBrown
2021-11-16  3:29 ` [PATCH 00/13] Repair SWAP-over-NFS Matthew Wilcox
2021-11-16  3:55   ` NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).