linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/18 V2] Repair SWAP-over-NFS
@ 2021-12-16 23:48 NeilBrown
  2021-12-16 23:48 ` [PATCH 05/18] MM: reclaim mustn't enter FS for SWP_FS_OPS swap-space NeilBrown
                   ` (18 more replies)
  0 siblings, 19 replies; 39+ messages in thread
From: NeilBrown @ 2021-12-16 23:48 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells
  Cc: linux-nfs, linux-mm, linux-kernel

swap-over-NFS currently has a variety of problems.

swap writes call generic_write_checks(), which always fails on a swap
file, so it completely fails.
Even without this, various deadlocks are possible - largely due to
improvements in NFS memory allocation (using NOFS instead of ATOMIC)
which weren't tested against swap-out.

NFS is the only filesystem that has supported fs-based swap IO, and it
hasn't worked for several releases, so now is a convenient time to clean
up the swap-via-filesystem interfaces - we cannot break anything !

So the first few patches here clean up and improve various parts of the
swap-via-filesystem code.  ->activate_swap() is given a cleaner
interface, a new ->swap_rw is introduced instead of burdening
->direct_IO, etc.

Current swap-to-filesystem code only ever submits single-page reads and
writes.  These patches change that to allow multi-page IO when adjacent
requests are submitted.  Writes are also changed to be async rather than
sync.  This substantially speeds up write throughput for swap-over-NFS.

Some of the NFS patches can land independently of the MM patches.  A few
require the MM patches to land first.

Thanks,
NeilBrown


---

NeilBrown (18):
      Structural cleanup for filesystem-based swap
      MM: create new mm/swap.h header file.
      MM: use ->swap_rw for reads from SWP_FS_OPS swap-space
      MM: perform async writes to SWP_FS_OPS swap-space
      MM: reclaim mustn't enter FS for SWP_FS_OPS swap-space
      MM: submit multipage reads for SWP_FS_OPS swap-space
      MM: submit multipage write for SWP_FS_OPS swap-space
      MM: Add AS_CAN_DIO mapping flag
      NFS: rename nfs_direct_IO and use as ->swap_rw
      NFS: swap IO handling is slightly different for O_DIRECT IO
      SUNRPC/call_alloc: async tasks mustn't block waiting for memory
      SUNRPC/auth: async tasks mustn't block waiting for memory
      SUNRPC/xprt: async tasks mustn't block waiting for memory
      SUNRPC: remove scheduling boost for "SWAPPER" tasks.
      NFS: discard NFS_RPC_SWAPFLAGS and RPC_TASK_ROOTCREDS
      SUNRPC: improve 'swap' handling: scheduling and PF_MEMALLOC
      NFSv4: keep state manager thread active if swap is enabled
      NFS: swap-out must always use STABLE writes.


 drivers/block/loop.c            |   4 +-
 fs/fcntl.c                      |   5 +-
 fs/inode.c                      |   3 +
 fs/nfs/direct.c                 |  56 ++++++----
 fs/nfs/file.c                   |  25 +++--
 fs/nfs/inode.c                  |   1 +
 fs/nfs/nfs4_fs.h                |   1 +
 fs/nfs/nfs4proc.c               |  20 ++++
 fs/nfs/nfs4state.c              |  39 ++++++-
 fs/nfs/read.c                   |   4 -
 fs/nfs/write.c                  |   2 +
 fs/open.c                       |   2 +-
 fs/overlayfs/file.c             |  10 +-
 include/linux/fs.h              |   2 +-
 include/linux/nfs_fs.h          |  11 +-
 include/linux/nfs_xdr.h         |   2 +
 include/linux/pagemap.h         |   3 +-
 include/linux/sunrpc/auth.h     |   1 +
 include/linux/sunrpc/sched.h    |   1 -
 include/linux/swap.h            | 121 --------------------
 include/linux/writeback.h       |   7 ++
 include/trace/events/sunrpc.h   |   1 -
 mm/madvise.c                    |   9 +-
 mm/memory.c                     |   3 +-
 mm/mincore.c                    |   1 +
 mm/page_alloc.c                 |   1 +
 mm/page_io.c                    | 189 ++++++++++++++++++++++++++------
 mm/shmem.c                      |   1 +
 mm/swap.h                       | 140 +++++++++++++++++++++++
 mm/swap_state.c                 |  32 ++++--
 mm/swapfile.c                   |   6 +
 mm/util.c                       |   1 +
 mm/vmscan.c                     |  31 +++++-
 net/sunrpc/auth.c               |   8 +-
 net/sunrpc/auth_gss/auth_gss.c  |   6 +-
 net/sunrpc/auth_unix.c          |  10 +-
 net/sunrpc/clnt.c               |   7 +-
 net/sunrpc/sched.c              |  29 +++--
 net/sunrpc/xprt.c               |  19 ++--
 net/sunrpc/xprtrdma/transport.c |  10 +-
 net/sunrpc/xprtsock.c           |   8 ++
 41 files changed, 558 insertions(+), 274 deletions(-)
 create mode 100644 mm/swap.h

--
Signature


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 01/18] Structural cleanup for filesystem-based swap
  2021-12-16 23:48 [PATCH 00/18 V2] Repair SWAP-over-NFS NeilBrown
  2021-12-16 23:48 ` [PATCH 05/18] MM: reclaim mustn't enter FS for SWP_FS_OPS swap-space NeilBrown
@ 2021-12-16 23:48 ` NeilBrown
  2021-12-17 10:33   ` kernel test robot
  2021-12-21  8:34   ` Christoph Hellwig
  2021-12-16 23:48 ` [PATCH 03/18] MM: use ->swap_rw for reads from SWP_FS_OPS swap-space NeilBrown
                   ` (16 subsequent siblings)
  18 siblings, 2 replies; 39+ messages in thread
From: NeilBrown @ 2021-12-16 23:48 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells
  Cc: linux-nfs, linux-mm, linux-kernel

Linux primarily uses IO to block devices for swap, but can send the IO
requests to a filesystem.  This has only ever worked for NFS, and that
hasn't worked for a while due to a lack of testing.  This seems like a
good time for some tidy-up before restoring swap-over-NFS functionality.

This patch:
 - updates the documentation (both copies!) for swap_activate which
   is woefully out-of-date
 - introduces a new address_space operation "swap_rw" for swap IO.
   The code currently used ->readpage for reads and ->direct_IO for
   writes.  The former imposes a limit of one-page-at-a-time, the
   later means that direct writes and swap writes are encouraged to
   use the same path.  While similar, swap can often be simpler as
   it can assume that no allocation is needed, and coherence with the
   page cache is irrelevant.
 - move the responsibility for setting SWP_FS_OPS to ->swap_activate()
   and also requires it to always call add_swap_extent().  This makes
   it much easier to find filesystems that require SWP_FS_OPS.
 - drops the call to the filesystem for ->set_page_dirty().  These
   pages do not belong to the filesystem, and it has no interest
   in the dirty status.

writeout is switched to ->swap_rw, but read-in is not as that requires
too much change for this patch.

Both cifs and nfs set SWP_FS_OPS but neither provide a swap_rw, so both
will now fail to activate swap.  cifs never really tried to provide swap
support as ->direct_IO always returns an error.  NFS will be fixed up
with following patches.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 Documentation/filesystems/locking.rst |   18 ++++++++++++------
 Documentation/filesystems/vfs.rst     |   17 ++++++++++++-----
 fs/cifs/file.c                        |    7 ++++++-
 fs/nfs/file.c                         |   17 +++++++++++++++--
 include/linux/fs.h                    |    1 +
 include/linux/swap.h                  |    1 -
 mm/page_io.c                          |   26 ++++++--------------------
 mm/swap_state.c                       |    2 +-
 mm/swapfile.c                         |   10 +++-------
 9 files changed, 56 insertions(+), 43 deletions(-)

diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
index d36fe79167b3..c2bb753bf688 100644
--- a/Documentation/filesystems/locking.rst
+++ b/Documentation/filesystems/locking.rst
@@ -265,8 +265,9 @@ prototypes::
 	int (*launder_page)(struct page *);
 	int (*is_partially_uptodate)(struct page *, unsigned long, unsigned long);
 	int (*error_remove_page)(struct address_space *, struct page *);
-	int (*swap_activate)(struct file *);
+	int (*swap_activate)(struct swap_info_struct *sis, struct file *f, sector_t *span)
 	int (*swap_deactivate)(struct file *);
+	int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter);
 
 locking rules:
 	All except set_page_dirty and freepage may block
@@ -295,6 +296,7 @@ is_partially_uptodate:	yes
 error_remove_page:	yes
 swap_activate:		no
 swap_deactivate:	no
+swap_rw:		yes, unlocks
 ======================	======================== =========	===============
 
 ->write_begin(), ->write_end() and ->readpage() may be called from
@@ -397,15 +399,19 @@ cleaned, or an error value if not. Note that in order to prevent the page
 getting mapped back in and redirtied, it needs to be kept locked
 across the entire operation.
 
-->swap_activate will be called with a non-zero argument on
-files backing (non block device backed) swapfiles. A return value
-of zero indicates success, in which case this file can be used for
-backing swapspace. The swapspace operations will be proxied to the
-address space operations.
+->swap_activate() will be called to prepare the given file for swap.  It
+should perform any validation and preparation necessary to ensure that
+writes can be performed with minimal memory allocation.  It should call
+add_swap_extent(), or the helper iomap_swapfile_activate(), and return
+the number of extents added.  If IO should be submitted through
+->swap_rw(), it should set SWP_FS_OPS, otherwise IO will be submitted
+directly to the block device ``sis->bdev``.
 
 ->swap_deactivate() will be called in the sys_swapoff()
 path after ->swap_activate() returned success.
 
+->swap_rw will be called for swap IO if ->swap_activate() set SWP_FS_OPS.
+
 file_lock_operations
 ====================
 
diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index bf5c48066fac..70d7ce335565 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -751,8 +751,9 @@ cache in your filesystem.  The following members are defined:
 					      unsigned long);
 		void (*is_dirty_writeback) (struct page *, bool *, bool *);
 		int (*error_remove_page) (struct mapping *mapping, struct page *page);
-		int (*swap_activate)(struct file *);
+		int (*swap_activate)(struct swap_info_struct *sis, struct file *f, sector_t *span)
 		int (*swap_deactivate)(struct file *);
+		int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter);
 	};
 
 ``writepage``
@@ -959,15 +960,21 @@ cache in your filesystem.  The following members are defined:
 	unless you have them locked or reference counts increased.
 
 ``swap_activate``
-	Called when swapon is used on a file to allocate space if
-	necessary and pin the block lookup information in memory.  A
-	return value of zero indicates success, in which case this file
-	can be used to back swapspace.
+
+	Called to prepare the given file for swap.  It should perform
+	any validation and preparation necessary to ensure that writes
+	can be performed with minimal memory allocation.  It should call
+	add_swap_extent(), or the helper iomap_swapfile_activate(), and
+	return the number of extents added.  If IO should be submitted
+	through ->swap_rw(), it should set SWP_FS_OPS, otherwise IO will
+	be submitted directly to the block device ``sis->bdev``.
 
 ``swap_deactivate``
 	Called during swapoff on files where swap_activate was
 	successful.
 
+``swap_rw``
+	Called to read or write swap pages when swap_activate() set SWP_FS_OPS.
 
 The File Object
 ===============
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 9fee3af83a73..50bebf5f15cc 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -4943,6 +4943,10 @@ static int cifs_swap_activate(struct swap_info_struct *sis,
 
 	cifs_dbg(FYI, "swap activate\n");
 
+	if (!swap_file->f_mapping->a_ops->swap_rw)
+		/* Cannot support swap */
+		return -EINVAL;
+
 	spin_lock(&inode->i_lock);
 	blocks = inode->i_blocks;
 	isize = inode->i_size;
@@ -4971,7 +4975,8 @@ static int cifs_swap_activate(struct swap_info_struct *sis,
 	 * from reading or writing the file
 	 */
 
-	return 0;
+	sis->flags |= SWP_FS_OPS;
+	return add_swap_extent(sis, 0, sis->max, 0);
 }
 
 static void cifs_swap_deactivate(struct file *file)
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 24e7dccce355..0d33c95eefb6 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -489,9 +489,14 @@ static int nfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 {
 	unsigned long blocks;
 	long long isize;
+	int ret;
 	struct rpc_clnt *clnt = NFS_CLIENT(file->f_mapping->host);
 	struct inode *inode = file->f_mapping->host;
 
+	if (!file->f_mapping->a_ops->swap_rw)
+		/* Cannot support swap */
+		return -EINVAL;
+
 	spin_lock(&inode->i_lock);
 	blocks = inode->i_blocks;
 	isize = inode->i_size;
@@ -501,9 +506,17 @@ static int nfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 		return -EINVAL;
 	}
 
+	ret = rpc_clnt_swap_activate(clnt);
+	if (ret)
+		return ret;
+	ret = add_swap_extent(sis, 0, sis->max, 0);
+	if (ret < 0) {
+		rpc_clnt_swap_deactivate(clnt);
+		return ret;
+	}
 	*span = sis->pages;
-
-	return rpc_clnt_swap_activate(clnt);
+	sis->flags |= SWP_FS_OPS;
+	return ret;
 }
 
 static void nfs_swap_deactivate(struct file *file)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index bbf812ce89a8..deaaf359cc49 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -415,6 +415,7 @@ struct address_space_operations {
 	int (*swap_activate)(struct swap_info_struct *sis, struct file *file,
 				sector_t *span);
 	void (*swap_deactivate)(struct file *file);
+	int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter);
 };
 
 extern const struct address_space_operations empty_aops;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index d1ea44b31f19..10b2a92c1aa1 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -427,7 +427,6 @@ extern int swap_writepage(struct page *page, struct writeback_control *wbc);
 extern void end_swap_bio_write(struct bio *bio);
 extern int __swap_writepage(struct page *page, struct writeback_control *wbc,
 	bio_end_io_t end_write_func);
-extern int swap_set_page_dirty(struct page *page);
 
 int add_swap_extent(struct swap_info_struct *sis, unsigned long start_page,
 		unsigned long nr_pages, sector_t start_block);
diff --git a/mm/page_io.c b/mm/page_io.c
index 9725c7e1eeea..cb617a4f59df 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -307,10 +307,9 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
 
 		set_page_writeback(page);
 		unlock_page(page);
-		ret = mapping->a_ops->direct_IO(&kiocb, &from);
-		if (ret == PAGE_SIZE) {
+		ret = mapping->a_ops->swap_rw(&kiocb, &from);
+		if (ret == 0) {
 			count_vm_event(PSWPOUT);
-			ret = 0;
 		} else {
 			/*
 			 * In the case of swap-over-nfs, this can be a
@@ -378,10 +377,11 @@ int swap_readpage(struct page *page, bool synchronous)
 	}
 
 	if (data_race(sis->flags & SWP_FS_OPS)) {
-		struct file *swap_file = sis->swap_file;
-		struct address_space *mapping = swap_file->f_mapping;
+		//struct file *swap_file = sis->swap_file;
+		//struct address_space *mapping = swap_file->f_mapping;
 
-		ret = mapping->a_ops->readpage(swap_file, page);
+		/* This needs to use ->swap_rw() */
+		ret = -EINVAL;
 		if (!ret)
 			count_vm_event(PSWPIN);
 		goto out;
@@ -434,17 +434,3 @@ int swap_readpage(struct page *page, bool synchronous)
 	psi_memstall_leave(&pflags);
 	return ret;
 }
-
-int swap_set_page_dirty(struct page *page)
-{
-	struct swap_info_struct *sis = page_swap_info(page);
-
-	if (data_race(sis->flags & SWP_FS_OPS)) {
-		struct address_space *mapping = sis->swap_file->f_mapping;
-
-		VM_BUG_ON_PAGE(!PageSwapCache(page), page);
-		return mapping->a_ops->set_page_dirty(page);
-	} else {
-		return __set_page_dirty_no_writeback(page);
-	}
-}
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 8d4104242100..616eb1d75b35 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -30,7 +30,7 @@
  */
 static const struct address_space_operations swap_aops = {
 	.writepage	= swap_writepage,
-	.set_page_dirty	= swap_set_page_dirty,
+	.set_page_dirty	= __set_page_dirty_no_writeback,
 #ifdef CONFIG_MIGRATION
 	.migratepage	= migrate_page,
 #endif
diff --git a/mm/swapfile.c b/mm/swapfile.c
index e59e08ef46e1..419eacf474c5 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2397,13 +2397,9 @@ static int setup_swap_extents(struct swap_info_struct *sis, sector_t *span)
 
 	if (mapping->a_ops->swap_activate) {
 		ret = mapping->a_ops->swap_activate(sis, swap_file, span);
-		if (ret >= 0)
-			sis->flags |= SWP_ACTIVATED;
-		if (!ret) {
-			sis->flags |= SWP_FS_OPS;
-			ret = add_swap_extent(sis, 0, sis->max, 0);
-			*span = sis->pages;
-		}
+		if (ret < 0)
+			return ret;
+		sis->flags |= SWP_ACTIVATED;
 		return ret;
 	}
 



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 02/18] MM: create new mm/swap.h header file.
  2021-12-16 23:48 [PATCH 00/18 V2] Repair SWAP-over-NFS NeilBrown
                   ` (2 preceding siblings ...)
  2021-12-16 23:48 ` [PATCH 03/18] MM: use ->swap_rw for reads from SWP_FS_OPS swap-space NeilBrown
@ 2021-12-16 23:48 ` NeilBrown
  2021-12-17 10:03   ` kernel test robot
  2021-12-21  8:36   ` Christoph Hellwig
  2021-12-16 23:48 ` [PATCH 04/18] MM: perform async writes to SWP_FS_OPS swap-space NeilBrown
                   ` (14 subsequent siblings)
  18 siblings, 2 replies; 39+ messages in thread
From: NeilBrown @ 2021-12-16 23:48 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells
  Cc: linux-nfs, linux-mm, linux-kernel

Many functions declared in include/linux/swap.h are only used within mm/

Create a new "mm/swap.h" and move some of these declarations there.
Remove the redundant 'extern' from the function declarations.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 include/linux/swap.h |  121 -----------------------------------------------
 mm/madvise.c         |    1 
 mm/memory.c          |    1 
 mm/mincore.c         |    1 
 mm/page_alloc.c      |    1 
 mm/page_io.c         |    1 
 mm/shmem.c           |    1 
 mm/swap.h            |  129 ++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/swap_state.c      |    1 
 mm/swapfile.c        |    1 
 mm/util.c            |    1 
 mm/vmscan.c          |    1 
 12 files changed, 139 insertions(+), 121 deletions(-)
 create mode 100644 mm/swap.h

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 10b2a92c1aa1..6a0c25c0bb95 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -419,61 +419,18 @@ extern void kswapd_stop(int nid);
 
 #ifdef CONFIG_SWAP
 
-#include <linux/blk_types.h> /* for bio_end_io_t */
-
-/* linux/mm/page_io.c */
-extern int swap_readpage(struct page *page, bool do_poll);
-extern int swap_writepage(struct page *page, struct writeback_control *wbc);
-extern void end_swap_bio_write(struct bio *bio);
-extern int __swap_writepage(struct page *page, struct writeback_control *wbc,
-	bio_end_io_t end_write_func);
-
 int add_swap_extent(struct swap_info_struct *sis, unsigned long start_page,
 		unsigned long nr_pages, sector_t start_block);
 int generic_swapfile_activate(struct swap_info_struct *, struct file *,
 		sector_t *);
 
-/* linux/mm/swap_state.c */
-/* One swap address space for each 64M swap space */
-#define SWAP_ADDRESS_SPACE_SHIFT	14
-#define SWAP_ADDRESS_SPACE_PAGES	(1 << SWAP_ADDRESS_SPACE_SHIFT)
-extern struct address_space *swapper_spaces[];
-#define swap_address_space(entry)			    \
-	(&swapper_spaces[swp_type(entry)][swp_offset(entry) \
-		>> SWAP_ADDRESS_SPACE_SHIFT])
 static inline unsigned long total_swapcache_pages(void)
 {
 	return global_node_page_state(NR_SWAPCACHE);
 }
 
-extern void show_swap_cache_info(void);
-extern int add_to_swap(struct page *page);
-extern void *get_shadow_from_swap_cache(swp_entry_t entry);
-extern int add_to_swap_cache(struct page *page, swp_entry_t entry,
-			gfp_t gfp, void **shadowp);
-extern void __delete_from_swap_cache(struct page *page,
-			swp_entry_t entry, void *shadow);
-extern void delete_from_swap_cache(struct page *);
-extern void clear_shadow_from_swap_cache(int type, unsigned long begin,
-				unsigned long end);
-extern void free_swap_cache(struct page *);
 extern void free_page_and_swap_cache(struct page *);
 extern void free_pages_and_swap_cache(struct page **, int);
-extern struct page *lookup_swap_cache(swp_entry_t entry,
-				      struct vm_area_struct *vma,
-				      unsigned long addr);
-struct page *find_get_incore_page(struct address_space *mapping, pgoff_t index);
-extern struct page *read_swap_cache_async(swp_entry_t, gfp_t,
-			struct vm_area_struct *vma, unsigned long addr,
-			bool do_poll);
-extern struct page *__read_swap_cache_async(swp_entry_t, gfp_t,
-			struct vm_area_struct *vma, unsigned long addr,
-			bool *new_page_allocated);
-extern struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
-				struct vm_fault *vmf);
-extern struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
-				struct vm_fault *vmf);
-
 /* linux/mm/swapfile.c */
 extern atomic_long_t nr_swap_pages;
 extern long total_swap_pages;
@@ -527,12 +484,6 @@ static inline void put_swap_device(struct swap_info_struct *si)
 }
 
 #else /* CONFIG_SWAP */
-
-static inline int swap_readpage(struct page *page, bool do_poll)
-{
-	return 0;
-}
-
 static inline struct swap_info_struct *swp_swap_info(swp_entry_t entry)
 {
 	return NULL;
@@ -547,11 +498,6 @@ static inline void put_swap_device(struct swap_info_struct *si)
 {
 }
 
-static inline struct address_space *swap_address_space(swp_entry_t entry)
-{
-	return NULL;
-}
-
 #define get_nr_swap_pages()			0L
 #define total_swap_pages			0L
 #define total_swapcache_pages()			0UL
@@ -566,14 +512,6 @@ static inline struct address_space *swap_address_space(swp_entry_t entry)
 #define free_pages_and_swap_cache(pages, nr) \
 	release_pages((pages), (nr));
 
-static inline void free_swap_cache(struct page *page)
-{
-}
-
-static inline void show_swap_cache_info(void)
-{
-}
-
 /* used to sanity check ptes in zap_pte_range when CONFIG_SWAP=0 */
 #define free_swap_and_cache(e) is_pfn_swap_entry(e)
 
@@ -599,65 +537,6 @@ static inline void put_swap_page(struct page *page, swp_entry_t swp)
 {
 }
 
-static inline struct page *swap_cluster_readahead(swp_entry_t entry,
-				gfp_t gfp_mask, struct vm_fault *vmf)
-{
-	return NULL;
-}
-
-static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
-			struct vm_fault *vmf)
-{
-	return NULL;
-}
-
-static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
-{
-	return 0;
-}
-
-static inline struct page *lookup_swap_cache(swp_entry_t swp,
-					     struct vm_area_struct *vma,
-					     unsigned long addr)
-{
-	return NULL;
-}
-
-static inline
-struct page *find_get_incore_page(struct address_space *mapping, pgoff_t index)
-{
-	return find_get_page(mapping, index);
-}
-
-static inline int add_to_swap(struct page *page)
-{
-	return 0;
-}
-
-static inline void *get_shadow_from_swap_cache(swp_entry_t entry)
-{
-	return NULL;
-}
-
-static inline int add_to_swap_cache(struct page *page, swp_entry_t entry,
-					gfp_t gfp_mask, void **shadowp)
-{
-	return -1;
-}
-
-static inline void __delete_from_swap_cache(struct page *page,
-					swp_entry_t entry, void *shadow)
-{
-}
-
-static inline void delete_from_swap_cache(struct page *page)
-{
-}
-
-static inline void clear_shadow_from_swap_cache(int type, unsigned long begin,
-				unsigned long end)
-{
-}
 
 static inline int page_swapcount(struct page *page)
 {
diff --git a/mm/madvise.c b/mm/madvise.c
index 8c927202bbe6..724470773582 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -33,6 +33,7 @@
 #include <asm/tlb.h>
 
 #include "internal.h"
+#include "swap.h"
 
 struct madvise_walk_private {
 	struct mmu_gather *tlb;
diff --git a/mm/memory.c b/mm/memory.c
index 8f1de811a1dc..80bbfd449b40 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -85,6 +85,7 @@
 
 #include "pgalloc-track.h"
 #include "internal.h"
+#include "swap.h"
 
 #if defined(LAST_CPUPID_NOT_IN_PAGE_FLAGS) && !defined(CONFIG_COMPILE_TEST)
 #warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_cpupid.
diff --git a/mm/mincore.c b/mm/mincore.c
index 9122676b54d6..f4f627325e12 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -20,6 +20,7 @@
 #include <linux/pgtable.h>
 
 #include <linux/uaccess.h>
+#include "swap.h"
 
 static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr,
 			unsigned long end, struct mm_walk *walk)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c5952749ad40..c0b7a3878801 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -78,6 +78,7 @@
 #include "internal.h"
 #include "shuffle.h"
 #include "page_reporting.h"
+#include "swap.h"
 
 /* Free Page Internal flags: for internal, non-pcp variants of free_pages(). */
 typedef int __bitwise fpi_t;
diff --git a/mm/page_io.c b/mm/page_io.c
index cb617a4f59df..a9fe5de5dc32 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -25,6 +25,7 @@
 #include <linux/psi.h>
 #include <linux/uio.h>
 #include <linux/sched/task.h>
+#include "swap.h"
 
 void end_swap_bio_write(struct bio *bio)
 {
diff --git a/mm/shmem.c b/mm/shmem.c
index 18f93c2d68f1..993b6b3ca20f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -39,6 +39,7 @@
 #include <linux/frontswap.h>
 #include <linux/fs_parser.h>
 #include <linux/swapfile.h>
+#include "swap.h"
 
 static struct vfsmount *shm_mnt;
 
diff --git a/mm/swap.h b/mm/swap.h
new file mode 100644
index 000000000000..13e72a5023aa
--- /dev/null
+++ b/mm/swap.h
@@ -0,0 +1,129 @@
+
+#ifdef CONFIG_SWAP
+#include <linux/blk_types.h> /* for bio_end_io_t */
+
+/* linux/mm/page_io.c */
+int swap_readpage(struct page *page, bool do_poll);
+int swap_writepage(struct page *page, struct writeback_control *wbc);
+void end_swap_bio_write(struct bio *bio);
+int __swap_writepage(struct page *page, struct writeback_control *wbc,
+		     bio_end_io_t end_write_func);
+
+/* linux/mm/swap_state.c */
+/* One swap address space for each 64M swap space */
+#define SWAP_ADDRESS_SPACE_SHIFT	14
+#define SWAP_ADDRESS_SPACE_PAGES	(1 << SWAP_ADDRESS_SPACE_SHIFT)
+extern struct address_space *swapper_spaces[];
+#define swap_address_space(entry)			    \
+	(&swapper_spaces[swp_type(entry)][swp_offset(entry) \
+		>> SWAP_ADDRESS_SPACE_SHIFT])
+
+void show_swap_cache_info(void);
+int add_to_swap(struct page *page);
+void *get_shadow_from_swap_cache(swp_entry_t entry);
+int add_to_swap_cache(struct page *page, swp_entry_t entry,
+		      gfp_t gfp, void **shadowp);
+void __delete_from_swap_cache(struct page *page,
+			      swp_entry_t entry, void *shadow);
+void delete_from_swap_cache(struct page *);
+void clear_shadow_from_swap_cache(int type, unsigned long begin,
+				  unsigned long end);
+void free_swap_cache(struct page *);
+struct page *lookup_swap_cache(swp_entry_t entry,
+			       struct vm_area_struct *vma,
+			       unsigned long addr);
+struct page *find_get_incore_page(struct address_space *mapping, pgoff_t index);
+
+struct page *read_swap_cache_async(swp_entry_t, gfp_t,
+				   struct vm_area_struct *vma,
+				   unsigned long addr,
+				   bool do_poll);
+struct page *__read_swap_cache_async(swp_entry_t, gfp_t,
+				     struct vm_area_struct *vma,
+				     unsigned long addr,
+				     bool *new_page_allocated);
+struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
+				    struct vm_fault *vmf);
+struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
+			      struct vm_fault *vmf);
+
+#else /* CONFIG_SWAP */
+static inline int swap_readpage(struct page *page, bool do_poll)
+{
+	return 0;
+}
+
+static inline struct address_space *swap_address_space(swp_entry_t entry)
+{
+	return NULL;
+}
+
+static inline void free_swap_cache(struct page *page)
+{
+}
+
+static inline void show_swap_cache_info(void)
+{
+}
+
+static inline struct page *swap_cluster_readahead(swp_entry_t entry,
+				gfp_t gfp_mask, struct vm_fault *vmf)
+{
+	return NULL;
+}
+
+static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
+			struct vm_fault *vmf)
+{
+	return NULL;
+}
+
+static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
+{
+	return 0;
+}
+
+static inline struct page *lookup_swap_cache(swp_entry_t swp,
+					     struct vm_area_struct *vma,
+					     unsigned long addr)
+{
+	return NULL;
+}
+
+static inline
+struct page *find_get_incore_page(struct address_space *mapping, pgoff_t index)
+{
+	return find_get_page(mapping, index);
+}
+
+static inline int add_to_swap(struct page *page)
+{
+	return 0;
+}
+
+static inline void *get_shadow_from_swap_cache(swp_entry_t entry)
+{
+	return NULL;
+}
+
+static inline int add_to_swap_cache(struct page *page, swp_entry_t entry,
+					gfp_t gfp_mask, void **shadowp)
+{
+	return -1;
+}
+
+static inline void __delete_from_swap_cache(struct page *page,
+					swp_entry_t entry, void *shadow)
+{
+}
+
+static inline void delete_from_swap_cache(struct page *page)
+{
+}
+
+static inline void clear_shadow_from_swap_cache(int type, unsigned long begin,
+				unsigned long end)
+{
+}
+
+#endif /* CONFIG_SWAP */
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 616eb1d75b35..514b86b05488 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -23,6 +23,7 @@
 #include <linux/huge_mm.h>
 #include <linux/shmem_fs.h>
 #include "internal.h"
+#include "swap.h"
 
 /*
  * swapper_space is a fiction, retained to simplify the path through
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 419eacf474c5..f23d9ff21cf8 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -44,6 +44,7 @@
 #include <asm/tlbflush.h>
 #include <linux/swapops.h>
 #include <linux/swap_cgroup.h>
+#include "swap.h"
 
 static bool swap_count_continued(struct swap_info_struct *, pgoff_t,
 				 unsigned char);
diff --git a/mm/util.c b/mm/util.c
index 741ba32a43ac..7e26387be090 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -27,6 +27,7 @@
 #include <linux/uaccess.h>
 
 #include "internal.h"
+#include "swap.h"
 
 /**
  * kfree_const - conditionally free memory
diff --git a/mm/vmscan.c b/mm/vmscan.c
index fb9584641ac7..969bcdb4ca80 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -58,6 +58,7 @@
 #include <linux/balloon_compaction.h>
 
 #include "internal.h"
+#include "swap.h"
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/vmscan.h>



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 03/18] MM: use ->swap_rw for reads from SWP_FS_OPS swap-space
  2021-12-16 23:48 [PATCH 00/18 V2] Repair SWAP-over-NFS NeilBrown
  2021-12-16 23:48 ` [PATCH 05/18] MM: reclaim mustn't enter FS for SWP_FS_OPS swap-space NeilBrown
  2021-12-16 23:48 ` [PATCH 01/18] Structural cleanup for filesystem-based swap NeilBrown
@ 2021-12-16 23:48 ` NeilBrown
  2021-12-20 12:16   ` Mark Hemment
  2021-12-21  8:40   ` Christoph Hellwig
  2021-12-16 23:48 ` [PATCH 02/18] MM: create new mm/swap.h header file NeilBrown
                   ` (15 subsequent siblings)
  18 siblings, 2 replies; 39+ messages in thread
From: NeilBrown @ 2021-12-16 23:48 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells
  Cc: linux-nfs, linux-mm, linux-kernel

To submit an async read with ->swap_rw() we need to allocate
a structure to hold the kiocb and other details.  swap_readpage() cannot
handle transient failure, so create a mempool to provide the structures.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 mm/page_io.c  |   58 +++++++++++++++++++++++++++++++++++++++++++++++++++------
 mm/swap.h     |    1 +
 mm/swapfile.c |    5 +++++
 3 files changed, 58 insertions(+), 6 deletions(-)

diff --git a/mm/page_io.c b/mm/page_io.c
index a9fe5de5dc32..47d7e7866e33 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -283,6 +283,23 @@ static void bio_associate_blkg_from_page(struct bio *bio, struct page *page)
 #define bio_associate_blkg_from_page(bio, page)		do { } while (0)
 #endif /* CONFIG_MEMCG && CONFIG_BLK_CGROUP */
 
+struct swap_iocb {
+	struct kiocb		iocb;
+	struct bio_vec		bvec;
+};
+static mempool_t *sio_pool;
+
+int sio_pool_init(void)
+{
+	if (!sio_pool)
+		sio_pool = mempool_create_kmalloc_pool(
+			SWAP_CLUSTER_MAX, sizeof(struct swap_iocb));
+	if (sio_pool)
+		return 0;
+	else
+		return -ENOMEM;
+}
+
 int __swap_writepage(struct page *page, struct writeback_control *wbc,
 		bio_end_io_t end_write_func)
 {
@@ -353,6 +370,23 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
 	return 0;
 }
 
+static void sio_read_complete(struct kiocb *iocb, long ret)
+{
+	struct swap_iocb *sio = container_of(iocb, struct swap_iocb, iocb);
+	struct page *page = sio->bvec.bv_page;
+
+	if (ret != 0 && ret != PAGE_SIZE) {
+		SetPageError(page);
+		ClearPageUptodate(page);
+		pr_alert_ratelimited("Read-error on swap-device\n");
+	} else {
+		SetPageUptodate(page);
+		count_vm_event(PSWPIN);
+	}
+	unlock_page(page);
+	mempool_free(sio, sio_pool);
+}
+
 int swap_readpage(struct page *page, bool synchronous)
 {
 	struct bio *bio;
@@ -378,13 +412,25 @@ int swap_readpage(struct page *page, bool synchronous)
 	}
 
 	if (data_race(sis->flags & SWP_FS_OPS)) {
-		//struct file *swap_file = sis->swap_file;
-		//struct address_space *mapping = swap_file->f_mapping;
+		struct file *swap_file = sis->swap_file;
+		struct address_space *mapping = swap_file->f_mapping;
+		struct iov_iter from;
+		struct swap_iocb *sio;
+		loff_t pos = page_file_offset(page);
+
+		sio = mempool_alloc(sio_pool, GFP_KERNEL);
+		init_sync_kiocb(&sio->iocb, swap_file);
+		sio->iocb.ki_pos = pos;
+		sio->iocb.ki_complete = sio_read_complete;
+		sio->bvec.bv_page = page;
+		sio->bvec.bv_len = PAGE_SIZE;
+		sio->bvec.bv_offset = 0;
+
+		iov_iter_bvec(&from, READ, &sio->bvec, 1, PAGE_SIZE);
+		ret = mapping->a_ops->swap_rw(&sio->iocb, &from);
+		if (ret != -EIOCBQUEUED)
+			sio_read_complete(&sio->iocb, ret);
 
-		/* This needs to use ->swap_rw() */
-		ret = -EINVAL;
-		if (!ret)
-			count_vm_event(PSWPIN);
 		goto out;
 	}
 
diff --git a/mm/swap.h b/mm/swap.h
index 13e72a5023aa..128a1d3e5558 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -3,6 +3,7 @@
 #include <linux/blk_types.h> /* for bio_end_io_t */
 
 /* linux/mm/page_io.c */
+int sio_pool_init(void);
 int swap_readpage(struct page *page, bool do_poll);
 int swap_writepage(struct page *page, struct writeback_control *wbc);
 void end_swap_bio_write(struct bio *bio);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index f23d9ff21cf8..43539be38e68 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2401,6 +2401,11 @@ static int setup_swap_extents(struct swap_info_struct *sis, sector_t *span)
 		if (ret < 0)
 			return ret;
 		sis->flags |= SWP_ACTIVATED;
+		if ((sis->flags & SWP_FS_OPS) &&
+		    sio_pool_init() != 0) {
+			destroy_swap_extents(sis);
+			return -ENOMEM;
+		}
 		return ret;
 	}
 



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 04/18] MM: perform async writes to SWP_FS_OPS swap-space
  2021-12-16 23:48 [PATCH 00/18 V2] Repair SWAP-over-NFS NeilBrown
                   ` (3 preceding siblings ...)
  2021-12-16 23:48 ` [PATCH 02/18] MM: create new mm/swap.h header file NeilBrown
@ 2021-12-16 23:48 ` NeilBrown
  2021-12-21  8:41   ` Christoph Hellwig
  2021-12-16 23:48 ` [PATCH 06/18] MM: submit multipage reads for " NeilBrown
                   ` (13 subsequent siblings)
  18 siblings, 1 reply; 39+ messages in thread
From: NeilBrown @ 2021-12-16 23:48 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells
  Cc: linux-nfs, linux-mm, linux-kernel

Writes to SWP_FS_OPS swapspace is currently synchronous.  To make it
async we need to allocate the kiocb struct which may block, but won't
block as long as waiting for the write to complete would block.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 mm/page_io.c |   69 +++++++++++++++++++++++++++++++++-------------------------
 1 file changed, 39 insertions(+), 30 deletions(-)

diff --git a/mm/page_io.c b/mm/page_io.c
index 47d7e7866e33..84859132c9c6 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -300,6 +300,32 @@ int sio_pool_init(void)
 		return -ENOMEM;
 }
 
+static void sio_write_complete(struct kiocb *iocb, long ret)
+{
+	struct swap_iocb *sio = container_of(iocb, struct swap_iocb, iocb);
+	struct page *page = sio->bvec.bv_page;
+
+	if (ret != 0 && ret != PAGE_SIZE) {
+		/*
+		 * In the case of swap-over-nfs, this can be a
+		 * temporary failure if the system has limited
+		 * memory for allocating transmit buffers.
+		 * Mark the page dirty and avoid
+		 * folio_rotate_reclaimable but rate-limit the
+		 * messages but do not flag PageError like
+		 * the normal direct-to-bio case as it could
+		 * be temporary.
+		 */
+		set_page_dirty(page);
+		ClearPageReclaim(page);
+		pr_err_ratelimited("Write error %ld on dio swapfile (%llu)\n",
+				   ret, page_file_offset(page));
+	} else
+		count_vm_event(PSWPOUT);
+	end_page_writeback(page);
+	mempool_free(sio, sio_pool);
+}
+
 int __swap_writepage(struct page *page, struct writeback_control *wbc,
 		bio_end_io_t end_write_func)
 {
@@ -309,42 +335,25 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
 
 	VM_BUG_ON_PAGE(!PageSwapCache(page), page);
 	if (data_race(sis->flags & SWP_FS_OPS)) {
-		struct kiocb kiocb;
+		struct swap_iocb *sio;
 		struct file *swap_file = sis->swap_file;
 		struct address_space *mapping = swap_file->f_mapping;
-		struct bio_vec bv = {
-			.bv_page = page,
-			.bv_len  = PAGE_SIZE,
-			.bv_offset = 0
-		};
 		struct iov_iter from;
 
-		iov_iter_bvec(&from, WRITE, &bv, 1, PAGE_SIZE);
-		init_sync_kiocb(&kiocb, swap_file);
-		kiocb.ki_pos = page_file_offset(page);
-
 		set_page_writeback(page);
 		unlock_page(page);
-		ret = mapping->a_ops->swap_rw(&kiocb, &from);
-		if (ret == 0) {
-			count_vm_event(PSWPOUT);
-		} else {
-			/*
-			 * In the case of swap-over-nfs, this can be a
-			 * temporary failure if the system has limited
-			 * memory for allocating transmit buffers.
-			 * Mark the page dirty and avoid
-			 * folio_rotate_reclaimable but rate-limit the
-			 * messages but do not flag PageError like
-			 * the normal direct-to-bio case as it could
-			 * be temporary.
-			 */
-			set_page_dirty(page);
-			ClearPageReclaim(page);
-			pr_err_ratelimited("Write error on dio swapfile (%llu)\n",
-					   page_file_offset(page));
-		}
-		end_page_writeback(page);
+		sio = mempool_alloc(sio_pool, GFP_NOIO);
+		init_sync_kiocb(&sio->iocb, swap_file);
+		sio->iocb.ki_complete = sio_write_complete;
+		sio->iocb.ki_pos = page_file_offset(page);
+		sio->bvec.bv_page = page;
+		sio->bvec.bv_len = PAGE_SIZE;
+		sio->bvec.bv_offset = 0;
+		iov_iter_bvec(&from, WRITE, &sio->bvec, 1, PAGE_SIZE);
+		ret = mapping->a_ops->swap_rw(&sio->iocb, &from);
+		if (ret != -EIOCBQUEUED)
+			sio_write_complete(&sio->iocb, ret);
+
 		return ret;
 	}
 



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 05/18] MM: reclaim mustn't enter FS for SWP_FS_OPS swap-space
  2021-12-16 23:48 [PATCH 00/18 V2] Repair SWAP-over-NFS NeilBrown
@ 2021-12-16 23:48 ` NeilBrown
  2021-12-17  8:51   ` kernel test robot
  2021-12-21  8:43   ` Christoph Hellwig
  2021-12-16 23:48 ` [PATCH 01/18] Structural cleanup for filesystem-based swap NeilBrown
                   ` (17 subsequent siblings)
  18 siblings, 2 replies; 39+ messages in thread
From: NeilBrown @ 2021-12-16 23:48 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells
  Cc: linux-nfs, linux-mm, linux-kernel

If swap-out is using filesystem operations (SWP_FS_OPS), then it is not
safe to enter the FS for reclaim.
So only down-grade the requirement for swap pages to __GFP_IO after
checking that SWP_FS_OPS are not being used.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 mm/vmscan.c |   21 ++++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 969bcdb4ca80..5f460d174b1b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1465,6 +1465,21 @@ static unsigned int demote_page_list(struct list_head *demote_pages,
 	return nr_succeeded;
 }
 
+static bool test_may_enter_fs(struct page *page, gfp_t gfp_mask)
+{
+	if (gfp_mask & __GFP_FS)
+		return true;
+	if (!PageSwapCache(page) || !(gfp_mask & __GFP_IO))
+		return false;
+	/* We can "enter_fs" for swap-cache with only __GFP_IO
+	 * providing this isn't SWP_FS_OPS.
+	 * ->flags can be updated non-atomicially (scan_swap_map_slots),
+	 * but that will never affect SWP_FS_OPS, so the data_race
+	 * is safe.
+	 */
+	return !data_race(page_swap_info(page)->flags & SWP_FS_OPS);
+}
+
 /*
  * shrink_page_list() returns the number of reclaimed pages
  */
@@ -1514,8 +1529,7 @@ static unsigned int shrink_page_list(struct list_head *page_list,
 		if (!sc->may_unmap && page_mapped(page))
 			goto keep_locked;
 
-		may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
-			(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
+		may_enter_fs = test_may_enter_fs(page, sc->gfp_mask);
 
 		/*
 		 * The number of dirty pages determines if a node is marked
@@ -1683,7 +1697,8 @@ static unsigned int shrink_page_list(struct list_head *page_list,
 						goto activate_locked_split;
 				}
 
-				may_enter_fs = true;
+				may_enter_fs = test_may_enter_fs(page,
+								 sc->gfp_mask);
 
 				/* Adding to swap updated mapping */
 				mapping = page_mapping(page);



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 06/18] MM: submit multipage reads for SWP_FS_OPS swap-space
  2021-12-16 23:48 [PATCH 00/18 V2] Repair SWAP-over-NFS NeilBrown
                   ` (4 preceding siblings ...)
  2021-12-16 23:48 ` [PATCH 04/18] MM: perform async writes to SWP_FS_OPS swap-space NeilBrown
@ 2021-12-16 23:48 ` NeilBrown
  2021-12-17  7:09   ` kernel test robot
  2021-12-21  8:44   ` Christoph Hellwig
  2021-12-16 23:48 ` [PATCH 17/18] NFSv4: keep state manager thread active if swap is enabled NeilBrown
                   ` (12 subsequent siblings)
  18 siblings, 2 replies; 39+ messages in thread
From: NeilBrown @ 2021-12-16 23:48 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells
  Cc: linux-nfs, linux-mm, linux-kernel

swap_readpage() is given one page at a time, but maybe called repeatedly
in succession.
For block-device swapspace, the blk_plug functionality allows the
multiple pages to be combined together at lower layers.
That cannot be used for SWP_FS_OPS as blk_plug may not exist - it is
only active when CONFIG_BLOCK=y.  Consequently all swap reads over NFS
are single page reads.

With this patch we pass in a pointer-to-pointer when swap_readpage can
store state between calls - much like the effect of blk_plug.  After
calling swap_readpage() some number of times, the state will be passed
to swap_read_unplug() which can submit the combined request.

Some caller currently call blk_finish_plug() *before* the final call to
swap_readpage(), so the last page cannot be included.  This patch moves
blk_finish_plug() to after the last call, and calls swap_read_unplug()
there too.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 mm/madvise.c    |    8 +++--
 mm/memory.c     |    2 +
 mm/page_io.c    |   95 ++++++++++++++++++++++++++++++++++++-------------------
 mm/swap.h       |   13 ++++++--
 mm/swap_state.c |   31 ++++++++++++------
 5 files changed, 100 insertions(+), 49 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 724470773582..a90870c7a2df 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -191,6 +191,7 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
 	pte_t *orig_pte;
 	struct vm_area_struct *vma = walk->private;
 	unsigned long index;
+	struct swap_iocb *splug = NULL;
 
 	if (pmd_none_or_trans_huge_or_clear_bad(pmd))
 		return 0;
@@ -212,10 +213,11 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
 			continue;
 
 		page = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE,
-							vma, index, false);
+					     vma, index, false, &splug);
 		if (page)
 			put_page(page);
 	}
+	swap_read_unplug(splug);
 
 	return 0;
 }
@@ -231,6 +233,7 @@ static void force_shm_swapin_readahead(struct vm_area_struct *vma,
 	XA_STATE(xas, &mapping->i_pages, linear_page_index(vma, start));
 	pgoff_t end_index = linear_page_index(vma, end + PAGE_SIZE - 1);
 	struct page *page;
+	struct swap_iocb *splug = NULL;
 
 	rcu_read_lock();
 	xas_for_each(&xas, page, end_index) {
@@ -243,13 +246,14 @@ static void force_shm_swapin_readahead(struct vm_area_struct *vma,
 
 		swap = radix_to_swp_entry(page);
 		page = read_swap_cache_async(swap, GFP_HIGHUSER_MOVABLE,
-							NULL, 0, false);
+					     NULL, 0, false, &splug);
 		if (page)
 			put_page(page);
 
 		rcu_read_lock();
 	}
 	rcu_read_unlock();
+	swap_read_unplug(splug);
 
 	lru_add_drain();	/* Push any new pages onto the LRU now */
 }
diff --git a/mm/memory.c b/mm/memory.c
index 80bbfd449b40..0ca00f2a6890 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3538,7 +3538,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 
 				/* To provide entry to swap_readpage() */
 				set_page_private(page, entry.val);
-				swap_readpage(page, true);
+				swap_readpage(page, true, NULL);
 				set_page_private(page, 0);
 			}
 		} else {
diff --git a/mm/page_io.c b/mm/page_io.c
index 84859132c9c6..03fbf9463081 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -285,7 +285,8 @@ static void bio_associate_blkg_from_page(struct bio *bio, struct page *page)
 
 struct swap_iocb {
 	struct kiocb		iocb;
-	struct bio_vec		bvec;
+	struct bio_vec		bvec[SWAP_CLUSTER_MAX];
+	int			pages;
 };
 static mempool_t *sio_pool;
 
@@ -303,7 +304,7 @@ int sio_pool_init(void)
 static void sio_write_complete(struct kiocb *iocb, long ret)
 {
 	struct swap_iocb *sio = container_of(iocb, struct swap_iocb, iocb);
-	struct page *page = sio->bvec.bv_page;
+	struct page *page = sio->bvec[0].bv_page;
 
 	if (ret != 0 && ret != PAGE_SIZE) {
 		/*
@@ -346,10 +347,10 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
 		init_sync_kiocb(&sio->iocb, swap_file);
 		sio->iocb.ki_complete = sio_write_complete;
 		sio->iocb.ki_pos = page_file_offset(page);
-		sio->bvec.bv_page = page;
-		sio->bvec.bv_len = PAGE_SIZE;
-		sio->bvec.bv_offset = 0;
-		iov_iter_bvec(&from, WRITE, &sio->bvec, 1, PAGE_SIZE);
+		sio->bvec[0].bv_page = page;
+		sio->bvec[0].bv_len = PAGE_SIZE;
+		sio->bvec[0].bv_offset = 0;
+		iov_iter_bvec(&from, WRITE, &sio->bvec[0], 1, PAGE_SIZE);
 		ret = mapping->a_ops->swap_rw(&sio->iocb, &from);
 		if (ret != -EIOCBQUEUED)
 			sio_write_complete(&sio->iocb, ret);
@@ -382,21 +383,25 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
 static void sio_read_complete(struct kiocb *iocb, long ret)
 {
 	struct swap_iocb *sio = container_of(iocb, struct swap_iocb, iocb);
-	struct page *page = sio->bvec.bv_page;
-
-	if (ret != 0 && ret != PAGE_SIZE) {
-		SetPageError(page);
-		ClearPageUptodate(page);
-		pr_alert_ratelimited("Read-error on swap-device\n");
-	} else {
-		SetPageUptodate(page);
-		count_vm_event(PSWPIN);
+	int p;
+
+	for (p = 0; p < sio->pages; p++) {
+		struct page *page = sio->bvec[p].bv_page;
+		if (ret != 0 && ret != PAGE_SIZE * sio->pages) {
+			SetPageError(page);
+			ClearPageUptodate(page);
+			pr_alert_ratelimited("Read-error on swap-device\n");
+		} else {
+			SetPageUptodate(page);
+			count_vm_event(PSWPIN);
+		}
+		unlock_page(page);
 	}
-	unlock_page(page);
 	mempool_free(sio, sio_pool);
 }
 
-int swap_readpage(struct page *page, bool synchronous)
+int swap_readpage(struct page *page, bool synchronous,
+		  struct swap_iocb **plug)
 {
 	struct bio *bio;
 	int ret = 0;
@@ -421,24 +426,35 @@ int swap_readpage(struct page *page, bool synchronous)
 	}
 
 	if (data_race(sis->flags & SWP_FS_OPS)) {
-		struct file *swap_file = sis->swap_file;
-		struct address_space *mapping = swap_file->f_mapping;
-		struct iov_iter from;
-		struct swap_iocb *sio;
+		struct swap_iocb *sio = NULL;
 		loff_t pos = page_file_offset(page);
 
-		sio = mempool_alloc(sio_pool, GFP_KERNEL);
-		init_sync_kiocb(&sio->iocb, swap_file);
-		sio->iocb.ki_pos = pos;
-		sio->iocb.ki_complete = sio_read_complete;
-		sio->bvec.bv_page = page;
-		sio->bvec.bv_len = PAGE_SIZE;
-		sio->bvec.bv_offset = 0;
-
-		iov_iter_bvec(&from, READ, &sio->bvec, 1, PAGE_SIZE);
-		ret = mapping->a_ops->swap_rw(&sio->iocb, &from);
-		if (ret != -EIOCBQUEUED)
-			sio_read_complete(&sio->iocb, ret);
+		if (*plug)
+			sio = *plug;
+		if (sio) {
+			if (sio->iocb.ki_filp != sis->swap_file ||
+			    sio->iocb.ki_pos + sio->pages * PAGE_SIZE != pos) {
+				swap_read_unplug(sio);
+				sio = NULL;
+			}
+		}
+		if (!sio) {
+			sio = mempool_alloc(sio_pool, GFP_KERNEL);
+			init_sync_kiocb(&sio->iocb, sis->swap_file);
+			sio->iocb.ki_pos = pos;
+			sio->iocb.ki_complete = sio_read_complete;
+			sio->pages = 0;
+		}
+		sio->bvec[sio->pages].bv_page = page;
+		sio->bvec[sio->pages].bv_len = PAGE_SIZE;
+		sio->bvec[sio->pages].bv_offset = 0;
+		sio->pages += 1;
+		if (sio->pages == ARRAY_SIZE(sio->bvec) || !plug) {
+			swap_read_unplug(sio);
+			sio = NULL;
+		}
+		if (plug)
+			*plug = sio;
 
 		goto out;
 	}
@@ -490,3 +506,16 @@ int swap_readpage(struct page *page, bool synchronous)
 	psi_memstall_leave(&pflags);
 	return ret;
 }
+
+void __swap_read_unplug(struct swap_iocb *sio)
+{
+	struct iov_iter from;
+	struct address_space *mapping = sio->iocb.ki_filp->f_mapping;
+	int ret;
+
+	iov_iter_bvec(&from, READ, sio->bvec, sio->pages,
+		      PAGE_SIZE * sio->pages);
+	ret = mapping->a_ops->swap_rw(&sio->iocb, &from);
+	if (ret != -EIOCBQUEUED)
+		sio_read_complete(&sio->iocb, ret);
+}
diff --git a/mm/swap.h b/mm/swap.h
index 128a1d3e5558..ce967abc5f46 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -4,7 +4,15 @@
 
 /* linux/mm/page_io.c */
 int sio_pool_init(void);
-int swap_readpage(struct page *page, bool do_poll);
+struct swap_iocb;
+int swap_readpage(struct page *page, bool do_poll,
+		  struct swap_iocb **plug);
+void __swap_read_unplug(struct swap_iocb *plug);
+static inline void swap_read_unplug(struct swap_iocb *plug)
+{
+	if (unlikely(plug))
+		__swap_read_unplug(plug);
+}
 int swap_writepage(struct page *page, struct writeback_control *wbc);
 void end_swap_bio_write(struct bio *bio);
 int __swap_writepage(struct page *page, struct writeback_control *wbc,
@@ -38,7 +46,8 @@ struct page *find_get_incore_page(struct address_space *mapping, pgoff_t index);
 struct page *read_swap_cache_async(swp_entry_t, gfp_t,
 				   struct vm_area_struct *vma,
 				   unsigned long addr,
-				   bool do_poll);
+				   bool do_poll,
+				   struct swap_iocb **plug);
 struct page *__read_swap_cache_async(swp_entry_t, gfp_t,
 				     struct vm_area_struct *vma,
 				     unsigned long addr,
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 514b86b05488..5cb2c75fa247 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -520,14 +520,16 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
  * the swap entry is no longer in use.
  */
 struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
-		struct vm_area_struct *vma, unsigned long addr, bool do_poll)
+				   struct vm_area_struct *vma,
+				   unsigned long addr, bool do_poll,
+				   struct swap_iocb **plug)
 {
 	bool page_was_allocated;
 	struct page *retpage = __read_swap_cache_async(entry, gfp_mask,
 			vma, addr, &page_was_allocated);
 
 	if (page_was_allocated)
-		swap_readpage(retpage, do_poll);
+		swap_readpage(retpage, do_poll, plug);
 
 	return retpage;
 }
@@ -621,10 +623,12 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	unsigned long mask;
 	struct swap_info_struct *si = swp_swap_info(entry);
 	struct blk_plug plug;
+	struct swap_iocb *splug = NULL;
 	bool do_poll = true, page_allocated;
 	struct vm_area_struct *vma = vmf->vma;
 	unsigned long addr = vmf->address;
 
+	blk_start_plug(&plug);
 	mask = swapin_nr_pages(offset) - 1;
 	if (!mask)
 		goto skip;
@@ -638,7 +642,6 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	if (end_offset >= si->max)
 		end_offset = si->max - 1;
 
-	blk_start_plug(&plug);
 	for (offset = start_offset; offset <= end_offset ; offset++) {
 		/* Ok, do the async read-ahead now */
 		page = __read_swap_cache_async(
@@ -647,7 +650,7 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 		if (!page)
 			continue;
 		if (page_allocated) {
-			swap_readpage(page, false);
+			swap_readpage(page, false, &splug);
 			if (offset != entry_offset) {
 				SetPageReadahead(page);
 				count_vm_event(SWAP_RA);
@@ -655,11 +658,14 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 		}
 		put_page(page);
 	}
-	blk_finish_plug(&plug);
 
 	lru_add_drain();	/* Push any new pages onto the LRU now */
 skip:
-	return read_swap_cache_async(entry, gfp_mask, vma, addr, do_poll);
+	page = read_swap_cache_async(entry, gfp_mask, vma, addr, do_poll,
+				     &splug);
+	blk_finish_plug(&plug);
+	swap_read_unplug(splug);
+	return page;
 }
 
 int init_swap_address_space(unsigned int type, unsigned long nr_pages)
@@ -790,6 +796,7 @@ static struct page *swap_vma_readahead(swp_entry_t fentry, gfp_t gfp_mask,
 				       struct vm_fault *vmf)
 {
 	struct blk_plug plug;
+	struct swap_iocb *splug = NULL;
 	struct vm_area_struct *vma = vmf->vma;
 	struct page *page;
 	pte_t *pte, pentry;
@@ -800,11 +807,11 @@ static struct page *swap_vma_readahead(swp_entry_t fentry, gfp_t gfp_mask,
 		.win = 1,
 	};
 
+	blk_start_plug(&plug);
 	swap_ra_info(vmf, &ra_info);
 	if (ra_info.win == 1)
 		goto skip;
 
-	blk_start_plug(&plug);
 	for (i = 0, pte = ra_info.ptes; i < ra_info.nr_pte;
 	     i++, pte++) {
 		pentry = *pte;
@@ -820,7 +827,7 @@ static struct page *swap_vma_readahead(swp_entry_t fentry, gfp_t gfp_mask,
 		if (!page)
 			continue;
 		if (page_allocated) {
-			swap_readpage(page, false);
+			swap_readpage(page, false, &splug);
 			if (i != ra_info.offset) {
 				SetPageReadahead(page);
 				count_vm_event(SWAP_RA);
@@ -828,11 +835,13 @@ static struct page *swap_vma_readahead(swp_entry_t fentry, gfp_t gfp_mask,
 		}
 		put_page(page);
 	}
-	blk_finish_plug(&plug);
 	lru_add_drain();
 skip:
-	return read_swap_cache_async(fentry, gfp_mask, vma, vmf->address,
-				     ra_info.win == 1);
+	page = read_swap_cache_async(fentry, gfp_mask, vma, vmf->address,
+				     ra_info.win == 1, &splug);
+	blk_finish_plug(&plug);
+	swap_read_unplug(splug);
+	return page;
 }
 
 /**



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 07/18] MM: submit multipage write for SWP_FS_OPS swap-space
  2021-12-16 23:48 [PATCH 00/18 V2] Repair SWAP-over-NFS NeilBrown
                   ` (14 preceding siblings ...)
  2021-12-16 23:48 ` [PATCH 08/18] MM: Add AS_CAN_DIO mapping flag NeilBrown
@ 2021-12-16 23:48 ` NeilBrown
  2021-12-20 12:21   ` Mark Hemment
  2021-12-16 23:48 ` [PATCH 09/18] NFS: rename nfs_direct_IO and use as ->swap_rw NeilBrown
                   ` (2 subsequent siblings)
  18 siblings, 1 reply; 39+ messages in thread
From: NeilBrown @ 2021-12-16 23:48 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells
  Cc: linux-nfs, linux-mm, linux-kernel

swap_writepage() is given one page at a time, but may be called repeatedly
in succession.
For block-device swapspace, the blk_plug functionality allows the
multiple pages to be combined together at lower layers.
That cannot be used for SWP_FS_OPS as blk_plug may not exist - it is
only active when CONFIG_BLOCK=y.  Consequently all swap reads over NFS
are single page reads.

With this patch we pass a pointer-to-pointer via the wbc.
swap_writepage can store state between calls - much like the pointer
passed explicitly to swap_readpage.  After calling swap_writepage() some
number of times, the state will be passed to swap_write_unplug() which
can submit the combined request.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 include/linux/writeback.h |    7 +++
 mm/page_io.c              |   98 ++++++++++++++++++++++++++++++---------------
 mm/swap.h                 |    1 
 mm/vmscan.c               |    9 +++-
 4 files changed, 80 insertions(+), 35 deletions(-)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 3bfd487d1dd2..16f780b618d2 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -79,6 +79,13 @@ struct writeback_control {
 
 	unsigned punt_to_cgroup:1;	/* cgrp punting, see __REQ_CGROUP_PUNT */
 
+	/* To enable batching of swap writes to non-block-device backends,
+	 * "plug" can be set point to a 'struct swap_iocb *'.  When all swap
+	 * writes have been submitted, if with swap_iocb is not NULL,
+	 * swap_write_unplug() should be called.
+	 */
+	struct swap_iocb **plug;
+
 #ifdef CONFIG_CGROUP_WRITEBACK
 	struct bdi_writeback *wb;	/* wb this writeback is issued under */
 	struct inode *inode;		/* inode being written out */
diff --git a/mm/page_io.c b/mm/page_io.c
index 03fbf9463081..92a31df467a2 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -304,26 +304,30 @@ int sio_pool_init(void)
 static void sio_write_complete(struct kiocb *iocb, long ret)
 {
 	struct swap_iocb *sio = container_of(iocb, struct swap_iocb, iocb);
-	struct page *page = sio->bvec[0].bv_page;
+	int p;
 
-	if (ret != 0 && ret != PAGE_SIZE) {
-		/*
-		 * In the case of swap-over-nfs, this can be a
-		 * temporary failure if the system has limited
-		 * memory for allocating transmit buffers.
-		 * Mark the page dirty and avoid
-		 * folio_rotate_reclaimable but rate-limit the
-		 * messages but do not flag PageError like
-		 * the normal direct-to-bio case as it could
-		 * be temporary.
-		 */
-		set_page_dirty(page);
-		ClearPageReclaim(page);
-		pr_err_ratelimited("Write error %ld on dio swapfile (%llu)\n",
-				   ret, page_file_offset(page));
-	} else
-		count_vm_event(PSWPOUT);
-	end_page_writeback(page);
+	for (p = 0; p < sio->pages; p++) {
+		struct page *page = sio->bvec[p].bv_page;
+
+		if (ret != 0 && ret != PAGE_SIZE * sio->pages) {
+			/*
+			 * In the case of swap-over-nfs, this can be a
+			 * temporary failure if the system has limited
+			 * memory for allocating transmit buffers.
+			 * Mark the page dirty and avoid
+			 * folio_rotate_reclaimable but rate-limit the
+			 * messages but do not flag PageError like
+			 * the normal direct-to-bio case as it could
+			 * be temporary.
+			 */
+			set_page_dirty(page);
+			ClearPageReclaim(page);
+			pr_err_ratelimited("Write error %ld on dio swapfile (%llu)\n",
+					   ret, page_file_offset(page));
+		} else
+			count_vm_event(PSWPOUT);
+		end_page_writeback(page);
+	}
 	mempool_free(sio, sio_pool);
 }
 
@@ -336,24 +340,39 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
 
 	VM_BUG_ON_PAGE(!PageSwapCache(page), page);
 	if (data_race(sis->flags & SWP_FS_OPS)) {
-		struct swap_iocb *sio;
+		struct swap_iocb *sio = NULL;
 		struct file *swap_file = sis->swap_file;
-		struct address_space *mapping = swap_file->f_mapping;
-		struct iov_iter from;
+		loff_t pos = page_file_offset(page);
 
 		set_page_writeback(page);
 		unlock_page(page);
-		sio = mempool_alloc(sio_pool, GFP_NOIO);
-		init_sync_kiocb(&sio->iocb, swap_file);
-		sio->iocb.ki_complete = sio_write_complete;
-		sio->iocb.ki_pos = page_file_offset(page);
-		sio->bvec[0].bv_page = page;
-		sio->bvec[0].bv_len = PAGE_SIZE;
-		sio->bvec[0].bv_offset = 0;
-		iov_iter_bvec(&from, WRITE, &sio->bvec[0], 1, PAGE_SIZE);
-		ret = mapping->a_ops->swap_rw(&sio->iocb, &from);
-		if (ret != -EIOCBQUEUED)
-			sio_write_complete(&sio->iocb, ret);
+
+		if (wbc->plug)
+			sio = *wbc->plug;
+		if (sio) {
+			if (sio->iocb.ki_filp != swap_file ||
+			    sio->iocb.ki_pos + sio->pages * PAGE_SIZE != pos) {
+				swap_write_unplug(sio);
+				sio = NULL;
+			}
+		}
+		if (!sio) {
+			sio = mempool_alloc(sio_pool, GFP_NOIO);
+			init_sync_kiocb(&sio->iocb, swap_file);
+			sio->iocb.ki_complete = sio_write_complete;
+			sio->iocb.ki_pos = pos;
+			sio->pages = 0;
+		}
+		sio->bvec[sio->pages].bv_page = page;
+		sio->bvec[sio->pages].bv_len = PAGE_SIZE;
+		sio->bvec[sio->pages].bv_offset = 0;
+		sio->pages += 1;
+		if (sio->pages == ARRAY_SIZE(sio->bvec) || !wbc->plug) {
+			swap_write_unplug(sio);
+			sio = NULL;
+		}
+		if (wbc->plug)
+			*wbc->plug = sio;
 
 		return ret;
 	}
@@ -380,6 +399,19 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
 	return 0;
 }
 
+void swap_write_unplug(struct swap_iocb *sio)
+{
+	struct iov_iter from;
+	struct address_space *mapping = sio->iocb.ki_filp->f_mapping;
+	int ret;
+
+	iov_iter_bvec(&from, WRITE, sio->bvec, sio->pages,
+		      PAGE_SIZE * sio->pages);
+	ret = mapping->a_ops->swap_rw(&sio->iocb, &from);
+	if (ret != -EIOCBQUEUED)
+		sio_write_complete(&sio->iocb, ret);
+}
+
 static void sio_read_complete(struct kiocb *iocb, long ret)
 {
 	struct swap_iocb *sio = container_of(iocb, struct swap_iocb, iocb);
diff --git a/mm/swap.h b/mm/swap.h
index ce967abc5f46..f4d0edda6e59 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -13,6 +13,7 @@ static inline void swap_read_unplug(struct swap_iocb *plug)
 	if (unlikely(plug))
 		__swap_read_unplug(plug);
 }
+void swap_write_unplug(struct swap_iocb *sio);
 int swap_writepage(struct page *page, struct writeback_control *wbc);
 void end_swap_bio_write(struct bio *bio);
 int __swap_writepage(struct page *page, struct writeback_control *wbc,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5f460d174b1b..50a363e63102 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1123,7 +1123,8 @@ typedef enum {
  * pageout is called by shrink_page_list() for each dirty page.
  * Calls ->writepage().
  */
-static pageout_t pageout(struct page *page, struct address_space *mapping)
+static pageout_t pageout(struct page *page, struct address_space *mapping,
+			 struct swap_iocb **plug)
 {
 	/*
 	 * If the page is dirty, only perform writeback if that write
@@ -1170,6 +1171,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping)
 			.range_start = 0,
 			.range_end = LLONG_MAX,
 			.for_reclaim = 1,
+			.plug = plug,
 		};
 
 		SetPageReclaim(page);
@@ -1495,6 +1497,7 @@ static unsigned int shrink_page_list(struct list_head *page_list,
 	unsigned int nr_reclaimed = 0;
 	unsigned int pgactivate = 0;
 	bool do_demote_pass;
+	struct swap_iocb *plug = NULL;
 
 	memset(stat, 0, sizeof(*stat));
 	cond_resched();
@@ -1780,7 +1783,7 @@ static unsigned int shrink_page_list(struct list_head *page_list,
 			 * starts and then write it out here.
 			 */
 			try_to_unmap_flush_dirty();
-			switch (pageout(page, mapping)) {
+			switch (pageout(page, mapping, &plug)) {
 			case PAGE_KEEP:
 				goto keep_locked;
 			case PAGE_ACTIVATE:
@@ -1934,6 +1937,8 @@ static unsigned int shrink_page_list(struct list_head *page_list,
 	list_splice(&ret_pages, page_list);
 	count_vm_events(PGACTIVATE, pgactivate);
 
+	if (plug)
+		swap_write_unplug(plug);
 	return nr_reclaimed;
 }
 



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 08/18] MM: Add AS_CAN_DIO mapping flag
  2021-12-16 23:48 [PATCH 00/18 V2] Repair SWAP-over-NFS NeilBrown
                   ` (13 preceding siblings ...)
  2021-12-16 23:48 ` [PATCH 16/18] SUNRPC: improve 'swap' handling: scheduling and PF_MEMALLOC NeilBrown
@ 2021-12-16 23:48 ` NeilBrown
  2021-12-19 13:38   ` Mark Hemment
  2021-12-21  8:46   ` Christoph Hellwig
  2021-12-16 23:48 ` [PATCH 07/18] MM: submit multipage write for SWP_FS_OPS swap-space NeilBrown
                   ` (3 subsequent siblings)
  18 siblings, 2 replies; 39+ messages in thread
From: NeilBrown @ 2021-12-16 23:48 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells
  Cc: linux-nfs, linux-mm, linux-kernel

Currently various places test if direct IO is possible on a file by
checking for the existence of the direct_IO address space operation.
This is a poor choice, as the direct_IO operation may not be used - it is
only used if the generic_file_*_iter functions are called for direct IO
and some filesystems - particularly NFS - don't do this.

Instead, introduce a new mapping flag: AS_CAN_DIO and change the various
places to check this (avoiding a pointer dereference).
unlock_new_inode() will set this flag if ->direct_IO is present, so
filesystems do not need to be changed.

NFS *is* changed, to set the flag explicitly and discard the direct_IO
entry in the address_space_operations for files.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/block/loop.c    |    4 ++--
 fs/fcntl.c              |    5 +++--
 fs/inode.c              |    3 +++
 fs/nfs/file.c           |    1 -
 fs/nfs/inode.c          |    1 +
 fs/open.c               |    2 +-
 fs/overlayfs/file.c     |   10 ++++------
 include/linux/fs.h      |    2 +-
 include/linux/pagemap.h |    3 ++-
 9 files changed, 17 insertions(+), 14 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index c3a36cfaa855..ab4dee6c0fc3 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -184,8 +184,8 @@ static void __loop_update_dio(struct loop_device *lo, bool dio)
 	 */
 	if (dio) {
 		if (queue_logical_block_size(lo->lo_queue) >= sb_bsize &&
-				!(lo->lo_offset & dio_align) &&
-				mapping->a_ops->direct_IO)
+		    !(lo->lo_offset & dio_align) &&
+		    test_bit(AS_CAN_DIO, &mapping->flags))
 			use_dio = true;
 		else
 			use_dio = false;
diff --git a/fs/fcntl.c b/fs/fcntl.c
index 9c6c6a3e2de5..fcbf2dc44273 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -26,6 +26,7 @@
 #include <linux/memfd.h>
 #include <linux/compat.h>
 #include <linux/mount.h>
+#include <linux/pagemap.h>
 
 #include <linux/poll.h>
 #include <asm/siginfo.h>
@@ -57,9 +58,9 @@ static int setfl(int fd, struct file * filp, unsigned long arg)
 
 	/* Pipe packetized mode is controlled by O_DIRECT flag */
 	if (!S_ISFIFO(inode->i_mode) && (arg & O_DIRECT)) {
-		if (!filp->f_mapping || !filp->f_mapping->a_ops ||
-			!filp->f_mapping->a_ops->direct_IO)
-				return -EINVAL;
+		if (!filp->f_mapping ||
+		    !test_bit(AS_CAN_DIO, &filp->f_mapping->flags))
+			return -EINVAL;
 	}
 
 	if (filp->f_op->check_flags)
diff --git a/fs/inode.c b/fs/inode.c
index 6b80a51129d5..bae65ccecdb1 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1008,6 +1008,9 @@ EXPORT_SYMBOL(lockdep_annotate_inode_mutex_key);
 void unlock_new_inode(struct inode *inode)
 {
 	lockdep_annotate_inode_mutex_key(inode);
+	if (inode->i_mapping->a_ops &&
+	    inode->i_mapping->a_ops->direct_IO)
+		set_bit(AS_CAN_DIO, &inode->i_mapping->flags);
 	spin_lock(&inode->i_lock);
 	WARN_ON(!(inode->i_state & I_NEW));
 	inode->i_state &= ~I_NEW & ~I_CREATING;
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 0d33c95eefb6..60842b774b56 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -536,7 +536,6 @@ const struct address_space_operations nfs_file_aops = {
 	.write_end = nfs_write_end,
 	.invalidatepage = nfs_invalidate_page,
 	.releasepage = nfs_release_page,
-	.direct_IO = nfs_direct_IO,
 #ifdef CONFIG_MIGRATION
 	.migratepage = nfs_migrate_page,
 #endif
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index fda530d5e764..e9d1097170b1 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -496,6 +496,7 @@ nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct nfs_fattr *fattr)
 		if (S_ISREG(inode->i_mode)) {
 			inode->i_fop = NFS_SB(sb)->nfs_client->rpc_ops->file_ops;
 			inode->i_data.a_ops = &nfs_file_aops;
+			set_bit(AS_CAN_DIO, &inode->i_data.flags);
 			nfs_inode_init_regular(nfsi);
 		} else if (S_ISDIR(inode->i_mode)) {
 			inode->i_op = NFS_SB(sb)->nfs_client->rpc_ops->dir_inode_ops;
diff --git a/fs/open.c b/fs/open.c
index f732fb94600c..ff58874acd10 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -840,7 +840,7 @@ static int do_dentry_open(struct file *f,
 
 	/* NB: we're sure to have correct a_ops only after f_op->open */
 	if (f->f_flags & O_DIRECT) {
-		if (!f->f_mapping->a_ops || !f->f_mapping->a_ops->direct_IO)
+		if (!test_bit(AS_CAN_DIO, &f->f_mapping->flags))
 			return -EINVAL;
 	}
 
diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index fa125feed0ff..21754edf5b62 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -13,6 +13,7 @@
 #include <linux/security.h>
 #include <linux/mm.h>
 #include <linux/fs.h>
+#include <linux/pagemap.h>
 #include "overlayfs.h"
 
 struct ovl_aio_req {
@@ -83,8 +84,7 @@ static int ovl_change_flags(struct file *file, unsigned int flags)
 		return -EPERM;
 
 	if (flags & O_DIRECT) {
-		if (!file->f_mapping->a_ops ||
-		    !file->f_mapping->a_ops->direct_IO)
+		if (!test_bit(AS_CAN_DIO, &file->f_mapping->flags))
 			return -EINVAL;
 	}
 
@@ -306,8 +306,7 @@ static ssize_t ovl_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 
 	ret = -EINVAL;
 	if (iocb->ki_flags & IOCB_DIRECT &&
-	    (!real.file->f_mapping->a_ops ||
-	     !real.file->f_mapping->a_ops->direct_IO))
+	    !test_bit(AS_CAN_DIO, &real.file->f_mapping->flags))
 		goto out_fdput;
 
 	old_cred = ovl_override_creds(file_inode(file)->i_sb);
@@ -367,8 +366,7 @@ static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter)
 
 	ret = -EINVAL;
 	if (iocb->ki_flags & IOCB_DIRECT &&
-	    (!real.file->f_mapping->a_ops ||
-	     !real.file->f_mapping->a_ops->direct_IO))
+	    !test_bit(AS_CAN_DIO, &real.file->f_mapping->flags))
 		goto out_fdput;
 
 	if (!ovl_should_sync(OVL_FS(inode->i_sb)))
diff --git a/include/linux/fs.h b/include/linux/fs.h
index deaaf359cc49..1e954756b093 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -448,7 +448,7 @@ int pagecache_write_end(struct file *, struct address_space *mapping,
  * @nrpages: Number of page entries, protected by the i_pages lock.
  * @writeback_index: Writeback starts here.
  * @a_ops: Methods.
- * @flags: Error bits and flags (AS_*).
+ * @flags: Error bits and flags (AS_*). (enum mapping_flags)
  * @wb_err: The most recent error which has occurred.
  * @private_lock: For use by the owner of the address_space.
  * @private_list: For use by the owner of the address_space.
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 605246452305..ceb599b6ba8b 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -81,10 +81,11 @@ enum mapping_flags {
 	AS_ENOSPC	= 1,	/* ENOSPC on async write */
 	AS_MM_ALL_LOCKS	= 2,	/* under mm_take_all_locks() */
 	AS_UNEVICTABLE	= 3,	/* e.g., ramdisk, SHM_LOCK */
-	AS_EXITING	= 4, 	/* final truncate in progress */
+	AS_EXITING	= 4,	/* final truncate in progress */
 	/* writeback related tags are not used */
 	AS_NO_WRITEBACK_TAGS = 5,
 	AS_LARGE_FOLIO_SUPPORT = 6,
+	AS_CAN_DIO	= 7,	/* DIO is supported */
 };
 
 /**



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 09/18] NFS: rename nfs_direct_IO and use as ->swap_rw
  2021-12-16 23:48 [PATCH 00/18 V2] Repair SWAP-over-NFS NeilBrown
                   ` (15 preceding siblings ...)
  2021-12-16 23:48 ` [PATCH 07/18] MM: submit multipage write for SWP_FS_OPS swap-space NeilBrown
@ 2021-12-16 23:48 ` NeilBrown
  2021-12-16 23:48 ` [PATCH 18/18] NFS: swap-out must always use STABLE writes NeilBrown
  2021-12-17 21:29 ` [PATCH 00/18 V2] Repair SWAP-over-NFS Anna Schumaker
  18 siblings, 0 replies; 39+ messages in thread
From: NeilBrown @ 2021-12-16 23:48 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells
  Cc: linux-nfs, linux-mm, linux-kernel

The nfs_direct_IO() exists to support SWAP IO, but hasn't worked for a
while.  We now need a ->swap_rw function which behaves slightly
differently, returning zero for success rather than a byte count.

So modify nfs_direct_IO accordingly, rename it, and use it as the
->swap_rw function.

Note: it still won't work - that will be fixed in later patches.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/nfs/direct.c        |   23 ++++++++++-------------
 fs/nfs/file.c          |    5 +----
 include/linux/nfs_fs.h |    2 +-
 3 files changed, 12 insertions(+), 18 deletions(-)

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 9cff8709c80a..f1e169f3050a 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -152,28 +152,25 @@ nfs_direct_count_bytes(struct nfs_direct_req *dreq,
 }
 
 /**
- * nfs_direct_IO - NFS address space operation for direct I/O
+ * nfs_swap_rw - NFS address space operation for swap I/O
  * @iocb: target I/O control block
  * @iter: I/O buffer
  *
- * The presence of this routine in the address space ops vector means
- * the NFS client supports direct I/O. However, for most direct IO, we
- * shunt off direct read and write requests before the VFS gets them,
- * so this method is only ever called for swap.
+ * Perform IO to the swap-file.  This is much like direct IO.
  */
-ssize_t nfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
+int nfs_swap_rw(struct kiocb *iocb, struct iov_iter *iter)
 {
-	struct inode *inode = iocb->ki_filp->f_mapping->host;
-
-	/* we only support swap file calling nfs_direct_IO */
-	if (!IS_SWAPFILE(inode))
-		return 0;
+	ssize_t ret;
 
 	VM_BUG_ON(iov_iter_count(iter) != PAGE_SIZE);
 
 	if (iov_iter_rw(iter) == READ)
-		return nfs_file_direct_read(iocb, iter);
-	return nfs_file_direct_write(iocb, iter);
+		ret = nfs_file_direct_read(iocb, iter);
+	else
+		ret = nfs_file_direct_write(iocb, iter);
+	if (ret < 0)
+		return ret;
+	return 0;
 }
 
 static void nfs_direct_release_pages(struct page **pages, unsigned int npages)
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 60842b774b56..b620fe697158 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -493,10 +493,6 @@ static int nfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 	struct rpc_clnt *clnt = NFS_CLIENT(file->f_mapping->host);
 	struct inode *inode = file->f_mapping->host;
 
-	if (!file->f_mapping->a_ops->swap_rw)
-		/* Cannot support swap */
-		return -EINVAL;
-
 	spin_lock(&inode->i_lock);
 	blocks = inode->i_blocks;
 	isize = inode->i_size;
@@ -544,6 +540,7 @@ const struct address_space_operations nfs_file_aops = {
 	.error_remove_page = generic_error_remove_page,
 	.swap_activate = nfs_swap_activate,
 	.swap_deactivate = nfs_swap_deactivate,
+	.swap_rw = nfs_swap_rw,
 };
 
 /*
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index 05f249f20f55..6329e6958718 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -510,7 +510,7 @@ static inline const struct cred *nfs_file_cred(struct file *file)
 /*
  * linux/fs/nfs/direct.c
  */
-extern ssize_t nfs_direct_IO(struct kiocb *, struct iov_iter *);
+extern int nfs_swap_rw(struct kiocb *, struct iov_iter *);
 extern ssize_t nfs_file_direct_read(struct kiocb *iocb,
 			struct iov_iter *iter);
 extern ssize_t nfs_file_direct_write(struct kiocb *iocb,



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 10/18] NFS: swap IO handling is slightly different for O_DIRECT IO
  2021-12-16 23:48 [PATCH 00/18 V2] Repair SWAP-over-NFS NeilBrown
                   ` (6 preceding siblings ...)
  2021-12-16 23:48 ` [PATCH 17/18] NFSv4: keep state manager thread active if swap is enabled NeilBrown
@ 2021-12-16 23:48 ` NeilBrown
  2021-12-20 15:02   ` Mark Hemment
  2021-12-16 23:48 ` [PATCH 12/18] SUNRPC/auth: async tasks mustn't block waiting for memory NeilBrown
                   ` (10 subsequent siblings)
  18 siblings, 1 reply; 39+ messages in thread
From: NeilBrown @ 2021-12-16 23:48 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells
  Cc: linux-nfs, linux-mm, linux-kernel

1/ Taking the i_rwsem for swap IO triggers lockdep warnings regarding
   possible deadlocks with "fs_reclaim".  These deadlocks could, I believe,
   eventuate if a buffered read on the swapfile was attempted.

   We don't need coherence with the page cache for a swap file, and
   buffered writes are forbidden anyway.  There is no other need for
   i_rwsem during direct IO.  So never take it for swap_rw()

2/ generic_write_checks() explicitly forbids writes to swap, and
   performs checks that are not needed for swap.  So bypass it
   for swap_rw().

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/nfs/direct.c        |   30 +++++++++++++++++++++---------
 fs/nfs/file.c          |    4 ++--
 include/linux/nfs_fs.h |    4 ++--
 3 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index f1e169f3050a..eeff1b4e1a7c 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -165,9 +165,9 @@ int nfs_swap_rw(struct kiocb *iocb, struct iov_iter *iter)
 	VM_BUG_ON(iov_iter_count(iter) != PAGE_SIZE);
 
 	if (iov_iter_rw(iter) == READ)
-		ret = nfs_file_direct_read(iocb, iter);
+		ret = nfs_file_direct_read(iocb, iter, true);
 	else
-		ret = nfs_file_direct_write(iocb, iter);
+		ret = nfs_file_direct_write(iocb, iter, true);
 	if (ret < 0)
 		return ret;
 	return 0;
@@ -421,6 +421,7 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
  * nfs_file_direct_read - file direct read operation for NFS files
  * @iocb: target I/O control block
  * @iter: vector of user buffers into which to read data
+ * @swap: flag indicating this is swap IO, not O_DIRECT IO
  *
  * We use this function for direct reads instead of calling
  * generic_file_aio_read() in order to avoid gfar's check to see if
@@ -436,7 +437,8 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
  * client must read the updated atime from the server back into its
  * cache.
  */
-ssize_t nfs_file_direct_read(struct kiocb *iocb, struct iov_iter *iter)
+ssize_t nfs_file_direct_read(struct kiocb *iocb, struct iov_iter *iter,
+			     bool swap)
 {
 	struct file *file = iocb->ki_filp;
 	struct address_space *mapping = file->f_mapping;
@@ -478,12 +480,14 @@ ssize_t nfs_file_direct_read(struct kiocb *iocb, struct iov_iter *iter)
 	if (iter_is_iovec(iter))
 		dreq->flags = NFS_ODIRECT_SHOULD_DIRTY;
 
-	nfs_start_io_direct(inode);
+	if (!swap)
+		nfs_start_io_direct(inode);
 
 	NFS_I(inode)->read_io += count;
 	requested = nfs_direct_read_schedule_iovec(dreq, iter, iocb->ki_pos);
 
-	nfs_end_io_direct(inode);
+	if (!swap)
+		nfs_end_io_direct(inode);
 
 	if (requested > 0) {
 		result = nfs_direct_wait(dreq);
@@ -872,6 +876,7 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
  * nfs_file_direct_write - file direct write operation for NFS files
  * @iocb: target I/O control block
  * @iter: vector of user buffers from which to write data
+ * @swap: flag indicating this is swap IO, not O_DIRECT IO
  *
  * We use this function for direct writes instead of calling
  * generic_file_aio_write() in order to avoid taking the inode
@@ -888,7 +893,8 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
  * Note that O_APPEND is not supported for NFS direct writes, as there
  * is no atomic O_APPEND write facility in the NFS protocol.
  */
-ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter)
+ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter,
+			      bool swap)
 {
 	ssize_t result, requested;
 	size_t count;
@@ -902,7 +908,11 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter)
 	dfprintk(FILE, "NFS: direct write(%pD2, %zd@%Ld)\n",
 		file, iov_iter_count(iter), (long long) iocb->ki_pos);
 
-	result = generic_write_checks(iocb, iter);
+	if (!swap)
+		result = generic_write_checks(iocb, iter);
+	else
+		/* bypass generic checks */
+		result =  iov_iter_count(iter);
 	if (result <= 0)
 		return result;
 	count = result;
@@ -933,7 +943,8 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter)
 		dreq->iocb = iocb;
 	pnfs_init_ds_commit_info_ops(&dreq->ds_cinfo, inode);
 
-	nfs_start_io_direct(inode);
+	if (!swap)
+		nfs_start_io_direct(inode);
 
 	requested = nfs_direct_write_schedule_iovec(dreq, iter, pos);
 
@@ -942,7 +953,8 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter)
 					      pos >> PAGE_SHIFT, end);
 	}
 
-	nfs_end_io_direct(inode);
+	if (!swap)
+		nfs_end_io_direct(inode);
 
 	if (requested > 0) {
 		result = nfs_direct_wait(dreq);
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index b620fe697158..996dfb3c74b2 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -161,7 +161,7 @@ nfs_file_read(struct kiocb *iocb, struct iov_iter *to)
 	ssize_t result;
 
 	if (iocb->ki_flags & IOCB_DIRECT)
-		return nfs_file_direct_read(iocb, to);
+		return nfs_file_direct_read(iocb, to, false);
 
 	dprintk("NFS: read(%pD2, %zu@%lu)\n",
 		iocb->ki_filp,
@@ -625,7 +625,7 @@ ssize_t nfs_file_write(struct kiocb *iocb, struct iov_iter *from)
 		return result;
 
 	if (iocb->ki_flags & IOCB_DIRECT)
-		return nfs_file_direct_write(iocb, from);
+		return nfs_file_direct_write(iocb, from, false);
 
 	dprintk("NFS: write(%pD2, %zu@%Ld)\n",
 		file, iov_iter_count(from), (long long) iocb->ki_pos);
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index 6329e6958718..3a210478f665 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -512,9 +512,9 @@ static inline const struct cred *nfs_file_cred(struct file *file)
  */
 extern int nfs_swap_rw(struct kiocb *, struct iov_iter *);
 extern ssize_t nfs_file_direct_read(struct kiocb *iocb,
-			struct iov_iter *iter);
+				    struct iov_iter *iter, bool swap);
 extern ssize_t nfs_file_direct_write(struct kiocb *iocb,
-			struct iov_iter *iter);
+				     struct iov_iter *iter, bool swap);
 
 /*
  * linux/fs/nfs/dir.c



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 11/18] SUNRPC/call_alloc: async tasks mustn't block waiting for memory
  2021-12-16 23:48 [PATCH 00/18 V2] Repair SWAP-over-NFS NeilBrown
                   ` (11 preceding siblings ...)
  2021-12-16 23:48 ` [PATCH 13/18] SUNRPC/xprt: async tasks mustn't block waiting for memory NeilBrown
@ 2021-12-16 23:48 ` NeilBrown
  2021-12-16 23:48 ` [PATCH 16/18] SUNRPC: improve 'swap' handling: scheduling and PF_MEMALLOC NeilBrown
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 39+ messages in thread
From: NeilBrown @ 2021-12-16 23:48 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells
  Cc: linux-nfs, linux-mm, linux-kernel

When memory is short, new worker threads cannot be created and we depend
on the minimum one rpciod thread to be able to handle everything.
So it must not block waiting for memory.

mempools are particularly a problem as memory can only be released back
to the mempool by an async rpc task running.  If all available
workqueue threads are waiting on the mempool, no thread is available to
return anything.

rpc_malloc() can block, and this might cause deadlocks.
So check RPC_IS_ASYNC(), rather than RPC_IS_SWAPPER() to determine if
blocking is acceptable.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 net/sunrpc/sched.c              |    4 +++-
 net/sunrpc/xprtrdma/transport.c |    4 +++-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index e2c835482791..d5b6e897f5a5 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -1023,8 +1023,10 @@ int rpc_malloc(struct rpc_task *task)
 	struct rpc_buffer *buf;
 	gfp_t gfp = GFP_NOFS;
 
+	if (RPC_IS_ASYNC(task))
+		gfp = GFP_NOWAIT | __GFP_NOWARN;
 	if (RPC_IS_SWAPPER(task))
-		gfp = __GFP_MEMALLOC | GFP_NOWAIT | __GFP_NOWARN;
+		gfp |= __GFP_MEMALLOC;
 
 	size += sizeof(struct rpc_buffer);
 	if (size <= RPC_BUFFER_MAXSIZE)
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 16e5696314a4..a52277115500 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -574,8 +574,10 @@ xprt_rdma_allocate(struct rpc_task *task)
 	gfp_t flags;
 
 	flags = RPCRDMA_DEF_GFP;
+	if (RPC_IS_ASYNC(task))
+		flags = GFP_NOWAIT | __GFP_NOWARN;
 	if (RPC_IS_SWAPPER(task))
-		flags = __GFP_MEMALLOC | GFP_NOWAIT | __GFP_NOWARN;
+		flags |= __GFP_MEMALLOC;
 
 	if (!rpcrdma_check_regbuf(r_xprt, req->rl_sendbuf, rqst->rq_callsize,
 				  flags))



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 12/18] SUNRPC/auth: async tasks mustn't block waiting for memory
  2021-12-16 23:48 [PATCH 00/18 V2] Repair SWAP-over-NFS NeilBrown
                   ` (7 preceding siblings ...)
  2021-12-16 23:48 ` [PATCH 10/18] NFS: swap IO handling is slightly different for O_DIRECT IO NeilBrown
@ 2021-12-16 23:48 ` NeilBrown
  2021-12-16 23:48 ` [PATCH 15/18] NFS: discard NFS_RPC_SWAPFLAGS and RPC_TASK_ROOTCREDS NeilBrown
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 39+ messages in thread
From: NeilBrown @ 2021-12-16 23:48 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells
  Cc: linux-nfs, linux-mm, linux-kernel

When memory is short, new worker threads cannot be created and we depend
on the minimum one rpciod thread to be able to handle everything.  So it
must not block waiting for memory.

mempools are particularly a problem as memory can only be released back
to the mempool by an async rpc task running.  If all available workqueue
threads are waiting on the mempool, no thread is available to return
anything.

lookup_cred() can block on a mempool or kmalloc - and this can cause
deadlocks.  So add a new RPCAUTH_LOOKUP flag for async lookups and don't
block on memory.  If the -ENOMEM gets back to call_refreshresult(), wait
a short while and try again.  HZ>>4 is chosen as it is used elsewhere
for -ENOMEM retries.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 include/linux/sunrpc/auth.h    |    1 +
 net/sunrpc/auth.c              |    6 +++++-
 net/sunrpc/auth_gss/auth_gss.c |    6 +++++-
 net/sunrpc/auth_unix.c         |   10 ++++++++--
 net/sunrpc/clnt.c              |    3 +++
 5 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/include/linux/sunrpc/auth.h b/include/linux/sunrpc/auth.h
index 98da816b5fc2..3e6ce288a7fc 100644
--- a/include/linux/sunrpc/auth.h
+++ b/include/linux/sunrpc/auth.h
@@ -99,6 +99,7 @@ struct rpc_auth_create_args {
 
 /* Flags for rpcauth_lookupcred() */
 #define RPCAUTH_LOOKUP_NEW		0x01	/* Accept an uninitialised cred */
+#define RPCAUTH_LOOKUP_ASYNC		0x02	/* Don't block waiting for memory */
 
 /*
  * Client authentication ops
diff --git a/net/sunrpc/auth.c b/net/sunrpc/auth.c
index a9f0d17fdb0d..6bfa19f9fa6a 100644
--- a/net/sunrpc/auth.c
+++ b/net/sunrpc/auth.c
@@ -615,6 +615,8 @@ rpcauth_bind_root_cred(struct rpc_task *task, int lookupflags)
 	};
 	struct rpc_cred *ret;
 
+	if (RPC_IS_ASYNC(task))
+		lookupflags |= RPCAUTH_LOOKUP_ASYNC;
 	ret = auth->au_ops->lookup_cred(auth, &acred, lookupflags);
 	put_cred(acred.cred);
 	return ret;
@@ -631,6 +633,8 @@ rpcauth_bind_machine_cred(struct rpc_task *task, int lookupflags)
 
 	if (!acred.principal)
 		return NULL;
+	if (RPC_IS_ASYNC(task))
+		lookupflags |= RPCAUTH_LOOKUP_ASYNC;
 	return auth->au_ops->lookup_cred(auth, &acred, lookupflags);
 }
 
@@ -654,7 +658,7 @@ rpcauth_bindcred(struct rpc_task *task, const struct cred *cred, int flags)
 	};
 
 	if (flags & RPC_TASK_ASYNC)
-		lookupflags |= RPCAUTH_LOOKUP_NEW;
+		lookupflags |= RPCAUTH_LOOKUP_NEW | RPCAUTH_LOOKUP_ASYNC;
 	if (task->tk_op_cred)
 		/* Task must use exactly this rpc_cred */
 		new = get_rpccred(task->tk_op_cred);
diff --git a/net/sunrpc/auth_gss/auth_gss.c b/net/sunrpc/auth_gss/auth_gss.c
index 5f42aa5fc612..df72d6301f78 100644
--- a/net/sunrpc/auth_gss/auth_gss.c
+++ b/net/sunrpc/auth_gss/auth_gss.c
@@ -1341,7 +1341,11 @@ gss_hash_cred(struct auth_cred *acred, unsigned int hashbits)
 static struct rpc_cred *
 gss_lookup_cred(struct rpc_auth *auth, struct auth_cred *acred, int flags)
 {
-	return rpcauth_lookup_credcache(auth, acred, flags, GFP_NOFS);
+	gfp_t gfp = GFP_NOFS;
+
+	if (flags & RPCAUTH_LOOKUP_ASYNC)
+		gfp = GFP_NOWAIT | __GFP_NOWARN;
+	return rpcauth_lookup_credcache(auth, acred, flags, gfp);
 }
 
 static struct rpc_cred *
diff --git a/net/sunrpc/auth_unix.c b/net/sunrpc/auth_unix.c
index e7df1f782b2e..e5819265dd1b 100644
--- a/net/sunrpc/auth_unix.c
+++ b/net/sunrpc/auth_unix.c
@@ -43,8 +43,14 @@ unx_destroy(struct rpc_auth *auth)
 static struct rpc_cred *
 unx_lookup_cred(struct rpc_auth *auth, struct auth_cred *acred, int flags)
 {
-	struct rpc_cred *ret = mempool_alloc(unix_pool, GFP_NOFS);
-
+	gfp_t gfp = GFP_NOFS;
+	struct rpc_cred *ret;
+
+	if (flags & RPCAUTH_LOOKUP_ASYNC)
+		gfp = GFP_NOWAIT | __GFP_NOWARN;
+	ret = mempool_alloc(unix_pool, gfp);
+	if (!ret)
+		return ERR_PTR(-ENOMEM);
 	rpcauth_init_cred(ret, acred, auth, &unix_credops);
 	ret->cr_flags = 1UL << RPCAUTH_CRED_UPTODATE;
 	return ret;
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index a312ea2bc440..238b2ef5491f 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -1745,6 +1745,9 @@ call_refreshresult(struct rpc_task *task)
 		task->tk_cred_retry--;
 		trace_rpc_retry_refresh_status(task);
 		return;
+	case -ENOMEM:
+		rpc_delay(task, HZ >> 4);
+		return;
 	}
 	trace_rpc_refresh_status(task);
 	rpc_call_rpcerror(task, status);



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 13/18] SUNRPC/xprt: async tasks mustn't block waiting for memory
  2021-12-16 23:48 [PATCH 00/18 V2] Repair SWAP-over-NFS NeilBrown
                   ` (10 preceding siblings ...)
  2021-12-16 23:48 ` [PATCH 14/18] SUNRPC: remove scheduling boost for "SWAPPER" tasks NeilBrown
@ 2021-12-16 23:48 ` NeilBrown
  2021-12-16 23:48 ` [PATCH 11/18] SUNRPC/call_alloc: " NeilBrown
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 39+ messages in thread
From: NeilBrown @ 2021-12-16 23:48 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells
  Cc: linux-nfs, linux-mm, linux-kernel

When memory is short, new worker threads cannot be created and we depend
on the minimum one rpciod thread to be able to handle everything.  So it
must not block waiting for memory.

xprt_dynamic_alloc_slot can block indefinitely.  This can tie up all
workqueue threads and NFS can deadlock.  So when called from a
workqueue, set __GFP_NORETRY.

The rdma alloc_slot already does not block.  However it sets the error
to -EAGAIN suggesting this will trigger a sleep.  It does not.  As we
can see in call_reserveresult(), only -ENOMEM causes a sleep.  -EAGAIN
causes immediate retry.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 net/sunrpc/xprt.c               |    5 ++++-
 net/sunrpc/xprtrdma/transport.c |    2 +-
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index a02de2bddb28..47d207e416ab 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -1687,12 +1687,15 @@ static bool xprt_throttle_congested(struct rpc_xprt *xprt, struct rpc_task *task
 static struct rpc_rqst *xprt_dynamic_alloc_slot(struct rpc_xprt *xprt)
 {
 	struct rpc_rqst *req = ERR_PTR(-EAGAIN);
+	gfp_t gfp_mask = GFP_NOFS;
 
 	if (xprt->num_reqs >= xprt->max_reqs)
 		goto out;
 	++xprt->num_reqs;
 	spin_unlock(&xprt->reserve_lock);
-	req = kzalloc(sizeof(struct rpc_rqst), GFP_NOFS);
+	if (current->flags & PF_WQ_WORKER)
+		gfp_mask |= __GFP_NORETRY | __GFP_NOWARN;
+	req = kzalloc(sizeof(struct rpc_rqst), gfp_mask);
 	spin_lock(&xprt->reserve_lock);
 	if (req != NULL)
 		goto out;
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index a52277115500..32df23796747 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -521,7 +521,7 @@ xprt_rdma_alloc_slot(struct rpc_xprt *xprt, struct rpc_task *task)
 	return;
 
 out_sleep:
-	task->tk_status = -EAGAIN;
+	task->tk_status = -ENOMEM;
 	xprt_add_backlog(xprt, task);
 }
 



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 14/18] SUNRPC: remove scheduling boost for "SWAPPER" tasks.
  2021-12-16 23:48 [PATCH 00/18 V2] Repair SWAP-over-NFS NeilBrown
                   ` (9 preceding siblings ...)
  2021-12-16 23:48 ` [PATCH 15/18] NFS: discard NFS_RPC_SWAPFLAGS and RPC_TASK_ROOTCREDS NeilBrown
@ 2021-12-16 23:48 ` NeilBrown
  2021-12-16 23:48 ` [PATCH 13/18] SUNRPC/xprt: async tasks mustn't block waiting for memory NeilBrown
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 39+ messages in thread
From: NeilBrown @ 2021-12-16 23:48 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells
  Cc: linux-nfs, linux-mm, linux-kernel

Currently, tasks marked as "swapper" tasks get put to the front of
non-priority rpc_queues, and are sorted earlier than non-swapper tasks on
the transport's ->xmit_queue.

This is pointless as currently *all* tasks for a mount that has swap
enabled on *any* file are marked as "swapper" tasks.  So the net result
is that the non-priority rpc_queues are reverse-ordered (LIFO).

This scheduling boost is not necessary to avoid deadlocks, and hurts
fairness, so remove it.  If there were a need to expedite some requests,
the tk_priority mechanism is a more appropriate tool.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 net/sunrpc/sched.c |    7 -------
 net/sunrpc/xprt.c  |   11 -----------
 2 files changed, 18 deletions(-)

diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index d5b6e897f5a5..256302bf6557 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -186,11 +186,6 @@ static void __rpc_add_wait_queue_priority(struct rpc_wait_queue *queue,
 
 /*
  * Add new request to wait queue.
- *
- * Swapper tasks always get inserted at the head of the queue.
- * This should avoid many nasty memory deadlocks and hopefully
- * improve overall performance.
- * Everyone else gets appended to the queue to ensure proper FIFO behavior.
  */
 static void __rpc_add_wait_queue(struct rpc_wait_queue *queue,
 		struct rpc_task *task,
@@ -199,8 +194,6 @@ static void __rpc_add_wait_queue(struct rpc_wait_queue *queue,
 	INIT_LIST_HEAD(&task->u.tk_wait.timer_list);
 	if (RPC_IS_PRIORITY(queue))
 		__rpc_add_wait_queue_priority(queue, task, queue_priority);
-	else if (RPC_IS_SWAPPER(task))
-		list_add(&task->u.tk_wait.list, &queue->tasks[0]);
 	else
 		list_add_tail(&task->u.tk_wait.list, &queue->tasks[0]);
 	task->tk_waitqueue = queue;
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index 47d207e416ab..a0a2583fe941 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -1354,17 +1354,6 @@ xprt_request_enqueue_transmit(struct rpc_task *task)
 				INIT_LIST_HEAD(&req->rq_xmit2);
 				goto out;
 			}
-		} else if (RPC_IS_SWAPPER(task)) {
-			list_for_each_entry(pos, &xprt->xmit_queue, rq_xmit) {
-				if (pos->rq_cong || pos->rq_bytes_sent)
-					continue;
-				if (RPC_IS_SWAPPER(pos->rq_task))
-					continue;
-				/* Note: req is added _before_ pos */
-				list_add_tail(&req->rq_xmit, &pos->rq_xmit);
-				INIT_LIST_HEAD(&req->rq_xmit2);
-				goto out;
-			}
 		} else if (!req->rq_seqno) {
 			list_for_each_entry(pos, &xprt->xmit_queue, rq_xmit) {
 				if (pos->rq_task->tk_owner != task->tk_owner)



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 15/18] NFS: discard NFS_RPC_SWAPFLAGS and RPC_TASK_ROOTCREDS
  2021-12-16 23:48 [PATCH 00/18 V2] Repair SWAP-over-NFS NeilBrown
                   ` (8 preceding siblings ...)
  2021-12-16 23:48 ` [PATCH 12/18] SUNRPC/auth: async tasks mustn't block waiting for memory NeilBrown
@ 2021-12-16 23:48 ` NeilBrown
  2021-12-16 23:48 ` [PATCH 14/18] SUNRPC: remove scheduling boost for "SWAPPER" tasks NeilBrown
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 39+ messages in thread
From: NeilBrown @ 2021-12-16 23:48 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells
  Cc: linux-nfs, linux-mm, linux-kernel

NFS_RPC_SWAPFLAGS is only used for READ requests.
It sets RPC_TASK_SWAPPER which gives some memory-allocation priority to
requests.  This is not needed for swap READ - though it is for writes
where it is set via a different mechanism.

RPC_TASK_ROOTCREDS causes the 'machine' credential to be used.
This is not needed as the root credential is saved when the swap file is
opened, and this is used for all IO.

So NFS_RPC_SWAPFLAGS isn't needed, and as it is the only user of
RPC_TASK_ROOTCREDS, that isn't needed either.

Remove both.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/nfs/read.c                 |    4 ----
 include/linux/nfs_fs.h        |    5 -----
 include/linux/sunrpc/sched.h  |    1 -
 include/trace/events/sunrpc.h |    1 -
 net/sunrpc/auth.c             |    2 +-
 5 files changed, 1 insertion(+), 12 deletions(-)

diff --git a/fs/nfs/read.c b/fs/nfs/read.c
index d11af2a9299c..a8f2b884306c 100644
--- a/fs/nfs/read.c
+++ b/fs/nfs/read.c
@@ -194,10 +194,6 @@ static void nfs_initiate_read(struct nfs_pgio_header *hdr,
 			      const struct nfs_rpc_ops *rpc_ops,
 			      struct rpc_task_setup *task_setup_data, int how)
 {
-	struct inode *inode = hdr->inode;
-	int swap_flags = IS_SWAPFILE(inode) ? NFS_RPC_SWAPFLAGS : 0;
-
-	task_setup_data->flags |= swap_flags;
 	rpc_ops->read_setup(hdr, msg);
 	trace_nfs_initiate_read(hdr);
 }
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index 3a210478f665..5dce9129fe64 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -45,11 +45,6 @@
  */
 #define NFS_MAX_TRANSPORTS 16
 
-/*
- * These are the default flags for swap requests
- */
-#define NFS_RPC_SWAPFLAGS		(RPC_TASK_SWAPPER|RPC_TASK_ROOTCREDS)
-
 /*
  * Size of the NFS directory verifier
  */
diff --git a/include/linux/sunrpc/sched.h b/include/linux/sunrpc/sched.h
index db964bb63912..56710f8056d3 100644
--- a/include/linux/sunrpc/sched.h
+++ b/include/linux/sunrpc/sched.h
@@ -124,7 +124,6 @@ struct rpc_task_setup {
 #define RPC_TASK_MOVEABLE	0x0004		/* nfs4.1+ rpc tasks */
 #define RPC_TASK_NULLCREDS	0x0010		/* Use AUTH_NULL credential */
 #define RPC_CALL_MAJORSEEN	0x0020		/* major timeout seen */
-#define RPC_TASK_ROOTCREDS	0x0040		/* force root creds */
 #define RPC_TASK_DYNAMIC	0x0080		/* task was kmalloc'ed */
 #define	RPC_TASK_NO_ROUND_ROBIN	0x0100		/* send requests on "main" xprt */
 #define RPC_TASK_SOFT		0x0200		/* Use soft timeouts */
diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
index 3a99358c262b..141dc34a450e 100644
--- a/include/trace/events/sunrpc.h
+++ b/include/trace/events/sunrpc.h
@@ -311,7 +311,6 @@ TRACE_EVENT(rpc_request,
 		{ RPC_TASK_MOVEABLE, "MOVEABLE" },			\
 		{ RPC_TASK_NULLCREDS, "NULLCREDS" },			\
 		{ RPC_CALL_MAJORSEEN, "MAJORSEEN" },			\
-		{ RPC_TASK_ROOTCREDS, "ROOTCREDS" },			\
 		{ RPC_TASK_DYNAMIC, "DYNAMIC" },			\
 		{ RPC_TASK_NO_ROUND_ROBIN, "NO_ROUND_ROBIN" },		\
 		{ RPC_TASK_SOFT, "SOFT" },				\
diff --git a/net/sunrpc/auth.c b/net/sunrpc/auth.c
index 6bfa19f9fa6a..682fcd24bf43 100644
--- a/net/sunrpc/auth.c
+++ b/net/sunrpc/auth.c
@@ -670,7 +670,7 @@ rpcauth_bindcred(struct rpc_task *task, const struct cred *cred, int flags)
 	/* If machine cred couldn't be bound, try a root cred */
 	if (new)
 		;
-	else if (cred == &machine_cred || (flags & RPC_TASK_ROOTCREDS))
+	else if (cred == &machine_cred)
 		new = rpcauth_bind_root_cred(task, lookupflags);
 	else if (flags & RPC_TASK_NULLCREDS)
 		new = authnull_ops.lookup_cred(NULL, NULL, 0);



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 16/18] SUNRPC: improve 'swap' handling: scheduling and PF_MEMALLOC
  2021-12-16 23:48 [PATCH 00/18 V2] Repair SWAP-over-NFS NeilBrown
                   ` (12 preceding siblings ...)
  2021-12-16 23:48 ` [PATCH 11/18] SUNRPC/call_alloc: " NeilBrown
@ 2021-12-16 23:48 ` NeilBrown
  2021-12-16 23:48 ` [PATCH 08/18] MM: Add AS_CAN_DIO mapping flag NeilBrown
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 39+ messages in thread
From: NeilBrown @ 2021-12-16 23:48 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells
  Cc: linux-nfs, linux-mm, linux-kernel

rpc tasks can be marked as RPC_TASK_SWAPPER.  This causes GFP_MEMALLOC
to be used for some allocations.  This is needed in some cases, but not
in all where it is currently provided, and in some where it isn't
provided.

Currently *all* tasks associated with a rpc_client on which swap is
enabled get the flag and hence some GFP_MEMALLOC support.

GFP_MEMALLOC is provided for ->buf_alloc() but only swap-writes need it.
However xdr_alloc_bvec does not get GFP_MEMALLOC - though it often does
need it.

xdr_alloc_bvec is called while the XPRT_LOCK is held.  If this blocks,
then it blocks all other queued tasks.  So this allocation needs
GFP_MEMALLOC for *all* requests, not just writes, when the xprt is used
for any swap writes.

Similarly, if the transport is not connected, that will block all
requests including swap writes, so memory allocations should get
GFP_MEMALLOC if swap writes are possible.

So with this patch:
 1/ we ONLY set RPC_TASK_SWAPPER for swap writes.
 2/ __rpc_execute() sets PF_MEMALLOC while handling any task
    with RPC_TASK_SWAPPER set, or when handling any task that
    holds the XPRT_LOCKED lock on an xprt used for swap.
    This removes the need for the RPC_IS_SWAPPER() test
    in ->buf_alloc handlers.
 3/ xprt_prepare_transmit() sets PF_MEMALLOC after locking
    any task to a swapper xprt.  __rpc_execute() will clear it.
 3/ PF_MEMALLOC is set for all the connect workers.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/nfs/write.c                  |    2 ++
 net/sunrpc/clnt.c               |    2 --
 net/sunrpc/sched.c              |   20 +++++++++++++++++---
 net/sunrpc/xprt.c               |    3 +++
 net/sunrpc/xprtrdma/transport.c |    6 ++++--
 net/sunrpc/xprtsock.c           |    8 ++++++++
 6 files changed, 34 insertions(+), 7 deletions(-)

diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 9b7619ce17a7..0c7a304c9e74 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -1408,6 +1408,8 @@ static void nfs_initiate_write(struct nfs_pgio_header *hdr,
 {
 	int priority = flush_task_priority(how);
 
+	if (IS_SWAPFILE(hdr->inode))
+		task_setup_data->flags |= RPC_TASK_SWAPPER;
 	task_setup_data->priority = priority;
 	rpc_ops->write_setup(hdr, msg, &task_setup_data->rpc_client);
 	trace_nfs_initiate_write(hdr);
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index 238b2ef5491f..cb76fbea3ed5 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -1085,8 +1085,6 @@ void rpc_task_set_client(struct rpc_task *task, struct rpc_clnt *clnt)
 		task->tk_flags |= RPC_TASK_TIMEOUT;
 	if (clnt->cl_noretranstimeo)
 		task->tk_flags |= RPC_TASK_NO_RETRANS_TIMEOUT;
-	if (atomic_read(&clnt->cl_swapper))
-		task->tk_flags |= RPC_TASK_SWAPPER;
 	/* Add to the client's list of all tasks */
 	spin_lock(&clnt->cl_lock);
 	list_add_tail(&task->tk_task, &clnt->cl_tasks);
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index 256302bf6557..9020cedb7c95 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -869,6 +869,15 @@ void rpc_release_calldata(const struct rpc_call_ops *ops, void *calldata)
 		ops->rpc_release(calldata);
 }
 
+static bool xprt_needs_memalloc(struct rpc_xprt *xprt, struct rpc_task *tk)
+{
+	if (!xprt)
+		return false;
+	if (!atomic_read(&xprt->swapper))
+		return false;
+	return test_bit(XPRT_LOCKED, &xprt->state) && xprt->snd_task == tk;
+}
+
 /*
  * This is the RPC `scheduler' (or rather, the finite state machine).
  */
@@ -877,6 +886,7 @@ static void __rpc_execute(struct rpc_task *task)
 	struct rpc_wait_queue *queue;
 	int task_is_async = RPC_IS_ASYNC(task);
 	int status = 0;
+	unsigned long pflags = current->flags;
 
 	WARN_ON_ONCE(RPC_IS_QUEUED(task));
 	if (RPC_IS_QUEUED(task))
@@ -899,6 +909,10 @@ static void __rpc_execute(struct rpc_task *task)
 		}
 		if (!do_action)
 			break;
+		if (RPC_IS_SWAPPER(task) ||
+		    xprt_needs_memalloc(task->tk_xprt, task))
+			current->flags |= PF_MEMALLOC;
+
 		trace_rpc_task_run_action(task, do_action);
 		do_action(task);
 
@@ -936,7 +950,7 @@ static void __rpc_execute(struct rpc_task *task)
 		rpc_clear_running(task);
 		spin_unlock(&queue->lock);
 		if (task_is_async)
-			return;
+			goto out;
 
 		/* sync task: sleep here */
 		trace_rpc_task_sync_sleep(task, task->tk_action);
@@ -960,6 +974,8 @@ static void __rpc_execute(struct rpc_task *task)
 
 	/* Release all resources associated with the task */
 	rpc_release_task(task);
+out:
+	current_restore_flags(pflags, PF_MEMALLOC);
 }
 
 /*
@@ -1018,8 +1034,6 @@ int rpc_malloc(struct rpc_task *task)
 
 	if (RPC_IS_ASYNC(task))
 		gfp = GFP_NOWAIT | __GFP_NOWARN;
-	if (RPC_IS_SWAPPER(task))
-		gfp |= __GFP_MEMALLOC;
 
 	size += sizeof(struct rpc_buffer);
 	if (size <= RPC_BUFFER_MAXSIZE)
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index a0a2583fe941..0614e7463d4b 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -1492,6 +1492,9 @@ bool xprt_prepare_transmit(struct rpc_task *task)
 		return false;
 
 	}
+	if (atomic_read(&xprt->swapper))
+		/* This will be clear in __rpc_execute */
+		current->flags |= PF_MEMALLOC;
 	return true;
 }
 
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 32df23796747..256b06a92391 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -239,8 +239,11 @@ xprt_rdma_connect_worker(struct work_struct *work)
 	struct rpcrdma_xprt *r_xprt = container_of(work, struct rpcrdma_xprt,
 						   rx_connect_worker.work);
 	struct rpc_xprt *xprt = &r_xprt->rx_xprt;
+	unsigned int pflags = current->flags;
 	int rc;
 
+	if (atomic_read(&xprt->swapper))
+		current->flags |= PF_MEMALLOC;
 	rc = rpcrdma_xprt_connect(r_xprt);
 	xprt_clear_connecting(xprt);
 	if (!rc) {
@@ -254,6 +257,7 @@ xprt_rdma_connect_worker(struct work_struct *work)
 		rpcrdma_xprt_disconnect(r_xprt);
 	xprt_unlock_connect(xprt, r_xprt);
 	xprt_wake_pending_tasks(xprt, rc);
+	current_restore_flags(pflags, PF_MEMALLOC);
 }
 
 /**
@@ -576,8 +580,6 @@ xprt_rdma_allocate(struct rpc_task *task)
 	flags = RPCRDMA_DEF_GFP;
 	if (RPC_IS_ASYNC(task))
 		flags = GFP_NOWAIT | __GFP_NOWARN;
-	if (RPC_IS_SWAPPER(task))
-		flags |= __GFP_MEMALLOC;
 
 	if (!rpcrdma_check_regbuf(r_xprt, req->rl_sendbuf, rqst->rq_callsize,
 				  flags))
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index d8ee06a9650a..9d34c71004fa 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -2047,7 +2047,10 @@ static void xs_udp_setup_socket(struct work_struct *work)
 	struct rpc_xprt *xprt = &transport->xprt;
 	struct socket *sock;
 	int status = -EIO;
+	unsigned int pflags = current->flags;
 
+	if (atomic_read(&xprt->swapper))
+		current->flags |= PF_MEMALLOC;
 	sock = xs_create_sock(xprt, transport,
 			xs_addr(xprt)->sa_family, SOCK_DGRAM,
 			IPPROTO_UDP, false);
@@ -2067,6 +2070,7 @@ static void xs_udp_setup_socket(struct work_struct *work)
 	xprt_clear_connecting(xprt);
 	xprt_unlock_connect(xprt, transport);
 	xprt_wake_pending_tasks(xprt, status);
+	current_restore_flags(pflags, PF_MEMALLOC);
 }
 
 /**
@@ -2226,7 +2230,10 @@ static void xs_tcp_setup_socket(struct work_struct *work)
 	struct socket *sock = transport->sock;
 	struct rpc_xprt *xprt = &transport->xprt;
 	int status;
+	unsigned int pflags = current->flags;
 
+	if (atomic_read(&xprt->swapper))
+		current->flags |= PF_MEMALLOC;
 	if (!sock) {
 		sock = xs_create_sock(xprt, transport,
 				xs_addr(xprt)->sa_family, SOCK_STREAM,
@@ -2291,6 +2298,7 @@ static void xs_tcp_setup_socket(struct work_struct *work)
 	xprt_clear_connecting(xprt);
 out_unlock:
 	xprt_unlock_connect(xprt, transport);
+	current_restore_flags(pflags, PF_MEMALLOC);
 }
 
 /**



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 17/18] NFSv4: keep state manager thread active if swap is enabled
  2021-12-16 23:48 [PATCH 00/18 V2] Repair SWAP-over-NFS NeilBrown
                   ` (5 preceding siblings ...)
  2021-12-16 23:48 ` [PATCH 06/18] MM: submit multipage reads for " NeilBrown
@ 2021-12-16 23:48 ` NeilBrown
  2021-12-16 23:48 ` [PATCH 10/18] NFS: swap IO handling is slightly different for O_DIRECT IO NeilBrown
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 39+ messages in thread
From: NeilBrown @ 2021-12-16 23:48 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells
  Cc: linux-nfs, linux-mm, linux-kernel

If we are swapping over NFSv4, we may not be able to allocate memory to
start the state-manager thread at the time when we need it.
So keep it always running when swap is enabled, and just signal it to
start.

This requires updating and testing the cl_swapper count on the root
rpc_clnt after following all ->cl_parent links.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/nfs/file.c           |   15 ++++++++++++---
 fs/nfs/nfs4_fs.h        |    1 +
 fs/nfs/nfs4proc.c       |   20 ++++++++++++++++++++
 fs/nfs/nfs4state.c      |   39 +++++++++++++++++++++++++++++++++------
 include/linux/nfs_xdr.h |    2 ++
 net/sunrpc/clnt.c       |    2 ++
 6 files changed, 70 insertions(+), 9 deletions(-)

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 996dfb3c74b2..6ad054b9bbd0 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -490,8 +490,9 @@ static int nfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 	unsigned long blocks;
 	long long isize;
 	int ret;
-	struct rpc_clnt *clnt = NFS_CLIENT(file->f_mapping->host);
-	struct inode *inode = file->f_mapping->host;
+	struct inode *inode = file_inode(file);
+	struct rpc_clnt *clnt = NFS_CLIENT(inode);
+	struct nfs_client *cl = NFS_SERVER(inode)->nfs_client;
 
 	spin_lock(&inode->i_lock);
 	blocks = inode->i_blocks;
@@ -512,14 +513,22 @@ static int nfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 	}
 	*span = sis->pages;
 	sis->flags |= SWP_FS_OPS;
+
+	if (cl->rpc_ops->enable_swap)
+		cl->rpc_ops->enable_swap(inode);
+
 	return ret;
 }
 
 static void nfs_swap_deactivate(struct file *file)
 {
-	struct rpc_clnt *clnt = NFS_CLIENT(file->f_mapping->host);
+	struct inode *inode = file_inode(file);
+	struct rpc_clnt *clnt = NFS_CLIENT(inode);
+	struct nfs_client *cl = NFS_SERVER(inode)->nfs_client;
 
 	rpc_clnt_swap_deactivate(clnt);
+	if (cl->rpc_ops->disable_swap)
+		cl->rpc_ops->disable_swap(file_inode(file));
 }
 
 const struct address_space_operations nfs_file_aops = {
diff --git a/fs/nfs/nfs4_fs.h b/fs/nfs/nfs4_fs.h
index ed5eaca6801e..8a9ce0f42efd 100644
--- a/fs/nfs/nfs4_fs.h
+++ b/fs/nfs/nfs4_fs.h
@@ -42,6 +42,7 @@ enum nfs4_client_state {
 	NFS4CLNT_LEASE_MOVED,
 	NFS4CLNT_DELEGATION_EXPIRED,
 	NFS4CLNT_RUN_MANAGER,
+	NFS4CLNT_MANAGER_AVAILABLE,
 	NFS4CLNT_RECALL_RUNNING,
 	NFS4CLNT_RECALL_ANY_LAYOUT_READ,
 	NFS4CLNT_RECALL_ANY_LAYOUT_RW,
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index ee3bc79f6ca3..ab6382f9cbf0 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -10347,6 +10347,24 @@ static ssize_t nfs4_listxattr(struct dentry *dentry, char *list, size_t size)
 	return error + error2 + error3;
 }
 
+static void nfs4_enable_swap(struct inode *inode)
+{
+	/* The state manager thread must always be running.
+	 * It will notice the client is a swapper, and stay put.
+	 */
+	struct nfs_client *clp = NFS_SERVER(inode)->nfs_client;
+
+	nfs4_schedule_state_manager(clp);
+}
+
+static void nfs4_disable_swap(struct inode *inode)
+{
+	/* The state manager thread will now exit once it is
+	 * woken.
+	 */
+	wake_up_var(&NFS_SERVER(inode)->nfs_client->cl_state);
+}
+
 static const struct inode_operations nfs4_dir_inode_operations = {
 	.create		= nfs_create,
 	.lookup		= nfs_lookup,
@@ -10423,6 +10441,8 @@ const struct nfs_rpc_ops nfs_v4_clientops = {
 	.free_client	= nfs4_free_client,
 	.create_server	= nfs4_create_server,
 	.clone_server	= nfs_clone_server,
+	.enable_swap	= nfs4_enable_swap,
+	.disable_swap	= nfs4_disable_swap,
 };
 
 static const struct xattr_handler nfs4_xattr_nfs4_acl_handler = {
diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
index f63dfa01001c..ebe470e6aa8f 100644
--- a/fs/nfs/nfs4state.c
+++ b/fs/nfs/nfs4state.c
@@ -1205,10 +1205,17 @@ void nfs4_schedule_state_manager(struct nfs_client *clp)
 {
 	struct task_struct *task;
 	char buf[INET6_ADDRSTRLEN + sizeof("-manager") + 1];
+	struct rpc_clnt *cl = clp->cl_rpcclient;
+
+	while (cl != cl->cl_parent)
+		cl = cl->cl_parent;
 
 	set_bit(NFS4CLNT_RUN_MANAGER, &clp->cl_state);
-	if (test_and_set_bit(NFS4CLNT_MANAGER_RUNNING, &clp->cl_state) != 0)
+	if (test_and_set_bit(NFS4CLNT_MANAGER_AVAILABLE, &clp->cl_state) != 0) {
+		wake_up_var(&clp->cl_state);
 		return;
+	}
+	set_bit(NFS4CLNT_MANAGER_RUNNING, &clp->cl_state);
 	__module_get(THIS_MODULE);
 	refcount_inc(&clp->cl_count);
 
@@ -1224,6 +1231,7 @@ void nfs4_schedule_state_manager(struct nfs_client *clp)
 		printk(KERN_ERR "%s: kthread_run: %ld\n",
 			__func__, PTR_ERR(task));
 		nfs4_clear_state_manager_bit(clp);
+		clear_bit(NFS4CLNT_MANAGER_AVAILABLE, &clp->cl_state);
 		nfs_put_client(clp);
 		module_put(THIS_MODULE);
 	}
@@ -2665,11 +2673,8 @@ static void nfs4_state_manager(struct nfs_client *clp)
 			clear_bit(NFS4CLNT_RECALL_RUNNING, &clp->cl_state);
 		}
 
-		/* Did we race with an attempt to give us more work? */
-		if (!test_bit(NFS4CLNT_RUN_MANAGER, &clp->cl_state))
-			return;
-		if (test_and_set_bit(NFS4CLNT_MANAGER_RUNNING, &clp->cl_state) != 0)
-			return;
+		return;
+
 	} while (refcount_read(&clp->cl_count) > 1 && !signalled());
 	goto out_drain;
 
@@ -2689,9 +2694,31 @@ static void nfs4_state_manager(struct nfs_client *clp)
 static int nfs4_run_state_manager(void *ptr)
 {
 	struct nfs_client *clp = ptr;
+	struct rpc_clnt *cl = clp->cl_rpcclient;
+
+	while (cl != cl->cl_parent)
+		cl = cl->cl_parent;
 
 	allow_signal(SIGKILL);
+again:
+	set_bit(NFS4CLNT_MANAGER_RUNNING, &clp->cl_state);
 	nfs4_state_manager(clp);
+	if (atomic_read(&cl->cl_swapper)) {
+		wait_var_event_interruptible(&clp->cl_state,
+					     test_bit(NFS4CLNT_RUN_MANAGER,
+						      &clp->cl_state));
+		if (atomic_read(&cl->cl_swapper) &&
+		    test_bit(NFS4CLNT_RUN_MANAGER, &clp->cl_state))
+			goto again;
+		/* Either no longer a swapper, or were signalled */
+	}
+	clear_bit(NFS4CLNT_MANAGER_AVAILABLE, &clp->cl_state);
+
+	if (refcount_read(&clp->cl_count) > 1 && !signalled() &&
+	    test_bit(NFS4CLNT_RUN_MANAGER, &clp->cl_state) &&
+	    !test_and_set_bit(NFS4CLNT_MANAGER_AVAILABLE, &clp->cl_state))
+		goto again;
+
 	nfs_put_client(clp);
 	module_put_and_exit(0);
 	return 0;
diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
index 967a0098f0a9..04cf3a8fb949 100644
--- a/include/linux/nfs_xdr.h
+++ b/include/linux/nfs_xdr.h
@@ -1795,6 +1795,8 @@ struct nfs_rpc_ops {
 	struct nfs_server *(*create_server)(struct fs_context *);
 	struct nfs_server *(*clone_server)(struct nfs_server *, struct nfs_fh *,
 					   struct nfs_fattr *, rpc_authflavor_t);
+	void	(*enable_swap)(struct inode *inode);
+	void	(*disable_swap)(struct inode *inode);
 };
 
 /*
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index cb76fbea3ed5..4cb403a0f334 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -3066,6 +3066,8 @@ rpc_clnt_swap_activate_callback(struct rpc_clnt *clnt,
 int
 rpc_clnt_swap_activate(struct rpc_clnt *clnt)
 {
+	while (clnt != clnt->cl_parent)
+		clnt = clnt->cl_parent;
 	if (atomic_inc_return(&clnt->cl_swapper) == 1)
 		return rpc_clnt_iterate_for_each_xprt(clnt,
 				rpc_clnt_swap_activate_callback, NULL);



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 18/18] NFS: swap-out must always use STABLE writes.
  2021-12-16 23:48 [PATCH 00/18 V2] Repair SWAP-over-NFS NeilBrown
                   ` (16 preceding siblings ...)
  2021-12-16 23:48 ` [PATCH 09/18] NFS: rename nfs_direct_IO and use as ->swap_rw NeilBrown
@ 2021-12-16 23:48 ` NeilBrown
  2021-12-17 21:29 ` [PATCH 00/18 V2] Repair SWAP-over-NFS Anna Schumaker
  18 siblings, 0 replies; 39+ messages in thread
From: NeilBrown @ 2021-12-16 23:48 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells
  Cc: linux-nfs, linux-mm, linux-kernel

The commit handling code is not safe against memory-pressure deadlocks
when writing to swap.  In particular, nfs_commitdata_alloc() blocks
indefinitely waiting for memory, and this can consume all available
workqueue threads.

swap-out most likely uses STABLE writes anyway as COND_STABLE indicates
that a stable write should be used if the write fits in a single
request, and it normally does.  However if we ever swap with a small
wsize, or gather unusually large numbers of pages for a single write,
this might change.

For safety, make it explicit in the code that direct writes used for swap
must always use FLUSH_COND_STABLE.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/nfs/direct.c |    7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index eeff1b4e1a7c..1317465150a6 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -790,7 +790,7 @@ static const struct nfs_pgio_completion_ops nfs_direct_write_completion_ops = {
  */
 static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
 					       struct iov_iter *iter,
-					       loff_t pos)
+					       loff_t pos, int ioflags)
 {
 	struct nfs_pageio_descriptor desc;
 	struct inode *inode = dreq->inode;
@@ -798,7 +798,7 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
 	size_t requested_bytes = 0;
 	size_t wsize = max_t(size_t, NFS_SERVER(inode)->wsize, PAGE_SIZE);
 
-	nfs_pageio_init_write(&desc, inode, FLUSH_COND_STABLE, false,
+	nfs_pageio_init_write(&desc, inode, ioflags, false,
 			      &nfs_direct_write_completion_ops);
 	desc.pg_dreq = dreq;
 	get_dreq(dreq);
@@ -904,6 +904,7 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter,
 	struct nfs_direct_req *dreq;
 	struct nfs_lock_context *l_ctx;
 	loff_t pos, end;
+	int ioflags = swap ? FLUSH_COND_STABLE : FLUSH_STABLE;
 
 	dfprintk(FILE, "NFS: direct write(%pD2, %zd@%Ld)\n",
 		file, iov_iter_count(iter), (long long) iocb->ki_pos);
@@ -946,7 +947,7 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter,
 	if (!swap)
 		nfs_start_io_direct(inode);
 
-	requested = nfs_direct_write_schedule_iovec(dreq, iter, pos);
+	requested = nfs_direct_write_schedule_iovec(dreq, iter, pos, ioflags);
 
 	if (mapping->nrpages) {
 		invalidate_inode_pages2_range(mapping,



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH 06/18] MM: submit multipage reads for SWP_FS_OPS swap-space
  2021-12-16 23:48 ` [PATCH 06/18] MM: submit multipage reads for " NeilBrown
@ 2021-12-17  7:09   ` kernel test robot
  2021-12-21  8:44   ` Christoph Hellwig
  1 sibling, 0 replies; 39+ messages in thread
From: kernel test robot @ 2021-12-17  7:09 UTC (permalink / raw)
  To: NeilBrown, Trond Myklebust, Anna Schumaker, Chuck Lever,
	Andrew Morton, Mel Gorman, Christoph Hellwig, David Howells
  Cc: kbuild-all, Linux Memory Management List, linux-nfs, linux-kernel

Hi NeilBrown,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on cifs/for-next]
[also build test ERROR on axboe-block/for-next rostedt-trace/for-next linus/master v5.16-rc5]
[cannot apply to trondmy-nfs/linux-next hnaz-mm/master mszeredi-vfs/overlayfs-next next-20211216]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/NeilBrown/Repair-SWAP-over-NFS/20211217-075659
base:   git://git.samba.org/sfrench/cifs-2.6.git for-next
config: nds32-allnoconfig (https://download.01.org/0day-ci/archive/20211217/202112171515.XWCl9bpF-lkp@intel.com/config)
compiler: nds32le-linux-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/d34716a962c31e9e0a6e40a702e581a02b7e29f7
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review NeilBrown/Repair-SWAP-over-NFS/20211217-075659
        git checkout d34716a962c31e9e0a6e40a702e581a02b7e29f7
        # save the config file to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross O=build_dir ARCH=nds32 SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   mm/memory.c: In function 'do_swap_page':
>> mm/memory.c:3541:33: error: too many arguments to function 'swap_readpage'
    3541 |                                 swap_readpage(page, true, NULL);
         |                                 ^~~~~~~~~~~~~
   In file included from mm/memory.c:88:
   mm/swap.h:61:19: note: declared here
      61 | static inline int swap_readpage(struct page *page, bool do_poll)
         |                   ^~~~~~~~~~~~~


vim +/swap_readpage +3541 mm/memory.c

  3462	
  3463	/*
  3464	 * We enter with non-exclusive mmap_lock (to exclude vma changes,
  3465	 * but allow concurrent faults), and pte mapped but not yet locked.
  3466	 * We return with pte unmapped and unlocked.
  3467	 *
  3468	 * We return with the mmap_lock locked or unlocked in the same cases
  3469	 * as does filemap_fault().
  3470	 */
  3471	vm_fault_t do_swap_page(struct vm_fault *vmf)
  3472	{
  3473		struct vm_area_struct *vma = vmf->vma;
  3474		struct page *page = NULL, *swapcache;
  3475		struct swap_info_struct *si = NULL;
  3476		swp_entry_t entry;
  3477		pte_t pte;
  3478		int locked;
  3479		int exclusive = 0;
  3480		vm_fault_t ret = 0;
  3481		void *shadow = NULL;
  3482	
  3483		if (!pte_unmap_same(vmf))
  3484			goto out;
  3485	
  3486		entry = pte_to_swp_entry(vmf->orig_pte);
  3487		if (unlikely(non_swap_entry(entry))) {
  3488			if (is_migration_entry(entry)) {
  3489				migration_entry_wait(vma->vm_mm, vmf->pmd,
  3490						     vmf->address);
  3491			} else if (is_device_exclusive_entry(entry)) {
  3492				vmf->page = pfn_swap_entry_to_page(entry);
  3493				ret = remove_device_exclusive_entry(vmf);
  3494			} else if (is_device_private_entry(entry)) {
  3495				vmf->page = pfn_swap_entry_to_page(entry);
  3496				ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
  3497			} else if (is_hwpoison_entry(entry)) {
  3498				ret = VM_FAULT_HWPOISON;
  3499			} else {
  3500				print_bad_pte(vma, vmf->address, vmf->orig_pte, NULL);
  3501				ret = VM_FAULT_SIGBUS;
  3502			}
  3503			goto out;
  3504		}
  3505	
  3506		/* Prevent swapoff from happening to us. */
  3507		si = get_swap_device(entry);
  3508		if (unlikely(!si))
  3509			goto out;
  3510	
  3511		delayacct_set_flag(current, DELAYACCT_PF_SWAPIN);
  3512		page = lookup_swap_cache(entry, vma, vmf->address);
  3513		swapcache = page;
  3514	
  3515		if (!page) {
  3516			if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
  3517			    __swap_count(entry) == 1) {
  3518				/* skip swapcache */
  3519				page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
  3520								vmf->address);
  3521				if (page) {
  3522					__SetPageLocked(page);
  3523					__SetPageSwapBacked(page);
  3524	
  3525					if (mem_cgroup_swapin_charge_page(page,
  3526						vma->vm_mm, GFP_KERNEL, entry)) {
  3527						ret = VM_FAULT_OOM;
  3528						goto out_page;
  3529					}
  3530					mem_cgroup_swapin_uncharge_swap(entry);
  3531	
  3532					shadow = get_shadow_from_swap_cache(entry);
  3533					if (shadow)
  3534						workingset_refault(page_folio(page),
  3535									shadow);
  3536	
  3537					lru_cache_add(page);
  3538	
  3539					/* To provide entry to swap_readpage() */
  3540					set_page_private(page, entry.val);
> 3541					swap_readpage(page, true, NULL);
  3542					set_page_private(page, 0);
  3543				}
  3544			} else {
  3545				page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
  3546							vmf);
  3547				swapcache = page;
  3548			}
  3549	
  3550			if (!page) {
  3551				/*
  3552				 * Back out if somebody else faulted in this pte
  3553				 * while we released the pte lock.
  3554				 */
  3555				vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
  3556						vmf->address, &vmf->ptl);
  3557				if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
  3558					ret = VM_FAULT_OOM;
  3559				delayacct_clear_flag(current, DELAYACCT_PF_SWAPIN);
  3560				goto unlock;
  3561			}
  3562	
  3563			/* Had to read the page from swap area: Major fault */
  3564			ret = VM_FAULT_MAJOR;
  3565			count_vm_event(PGMAJFAULT);
  3566			count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
  3567		} else if (PageHWPoison(page)) {
  3568			/*
  3569			 * hwpoisoned dirty swapcache pages are kept for killing
  3570			 * owner processes (which may be unknown at hwpoison time)
  3571			 */
  3572			ret = VM_FAULT_HWPOISON;
  3573			delayacct_clear_flag(current, DELAYACCT_PF_SWAPIN);
  3574			goto out_release;
  3575		}
  3576	
  3577		locked = lock_page_or_retry(page, vma->vm_mm, vmf->flags);
  3578	
  3579		delayacct_clear_flag(current, DELAYACCT_PF_SWAPIN);
  3580		if (!locked) {
  3581			ret |= VM_FAULT_RETRY;
  3582			goto out_release;
  3583		}
  3584	
  3585		/*
  3586		 * Make sure try_to_free_swap or reuse_swap_page or swapoff did not
  3587		 * release the swapcache from under us.  The page pin, and pte_same
  3588		 * test below, are not enough to exclude that.  Even if it is still
  3589		 * swapcache, we need to check that the page's swap has not changed.
  3590		 */
  3591		if (unlikely((!PageSwapCache(page) ||
  3592				page_private(page) != entry.val)) && swapcache)
  3593			goto out_page;
  3594	
  3595		page = ksm_might_need_to_copy(page, vma, vmf->address);
  3596		if (unlikely(!page)) {
  3597			ret = VM_FAULT_OOM;
  3598			page = swapcache;
  3599			goto out_page;
  3600		}
  3601	
  3602		cgroup_throttle_swaprate(page, GFP_KERNEL);
  3603	
  3604		/*
  3605		 * Back out if somebody else already faulted in this pte.
  3606		 */
  3607		vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
  3608				&vmf->ptl);
  3609		if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte)))
  3610			goto out_nomap;
  3611	
  3612		if (unlikely(!PageUptodate(page))) {
  3613			ret = VM_FAULT_SIGBUS;
  3614			goto out_nomap;
  3615		}
  3616	
  3617		/*
  3618		 * The page isn't present yet, go ahead with the fault.
  3619		 *
  3620		 * Be careful about the sequence of operations here.
  3621		 * To get its accounting right, reuse_swap_page() must be called
  3622		 * while the page is counted on swap but not yet in mapcount i.e.
  3623		 * before page_add_anon_rmap() and swap_free(); try_to_free_swap()
  3624		 * must be called after the swap_free(), or it will never succeed.
  3625		 */
  3626	
  3627		inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
  3628		dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS);
  3629		pte = mk_pte(page, vma->vm_page_prot);
  3630		if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) {
  3631			pte = maybe_mkwrite(pte_mkdirty(pte), vma);
  3632			vmf->flags &= ~FAULT_FLAG_WRITE;
  3633			ret |= VM_FAULT_WRITE;
  3634			exclusive = RMAP_EXCLUSIVE;
  3635		}
  3636		flush_icache_page(vma, page);
  3637		if (pte_swp_soft_dirty(vmf->orig_pte))
  3638			pte = pte_mksoft_dirty(pte);
  3639		if (pte_swp_uffd_wp(vmf->orig_pte)) {
  3640			pte = pte_mkuffd_wp(pte);
  3641			pte = pte_wrprotect(pte);
  3642		}
  3643		set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
  3644		arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
  3645		vmf->orig_pte = pte;
  3646	
  3647		/* ksm created a completely new copy */
  3648		if (unlikely(page != swapcache && swapcache)) {
  3649			page_add_new_anon_rmap(page, vma, vmf->address, false);
  3650			lru_cache_add_inactive_or_unevictable(page, vma);
  3651		} else {
  3652			do_page_add_anon_rmap(page, vma, vmf->address, exclusive);
  3653		}
  3654	
  3655		swap_free(entry);
  3656		if (mem_cgroup_swap_full(page) ||
  3657		    (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
  3658			try_to_free_swap(page);
  3659		unlock_page(page);
  3660		if (page != swapcache && swapcache) {
  3661			/*
  3662			 * Hold the lock to avoid the swap entry to be reused
  3663			 * until we take the PT lock for the pte_same() check
  3664			 * (to avoid false positives from pte_same). For
  3665			 * further safety release the lock after the swap_free
  3666			 * so that the swap count won't change under a
  3667			 * parallel locked swapcache.
  3668			 */
  3669			unlock_page(swapcache);
  3670			put_page(swapcache);
  3671		}
  3672	
  3673		if (vmf->flags & FAULT_FLAG_WRITE) {
  3674			ret |= do_wp_page(vmf);
  3675			if (ret & VM_FAULT_ERROR)
  3676				ret &= VM_FAULT_ERROR;
  3677			goto out;
  3678		}
  3679	
  3680		/* No need to invalidate - it was non-present before */
  3681		update_mmu_cache(vma, vmf->address, vmf->pte);
  3682	unlock:
  3683		pte_unmap_unlock(vmf->pte, vmf->ptl);
  3684	out:
  3685		if (si)
  3686			put_swap_device(si);
  3687		return ret;
  3688	out_nomap:
  3689		pte_unmap_unlock(vmf->pte, vmf->ptl);
  3690	out_page:
  3691		unlock_page(page);
  3692	out_release:
  3693		put_page(page);
  3694		if (page != swapcache && swapcache) {
  3695			unlock_page(swapcache);
  3696			put_page(swapcache);
  3697		}
  3698		if (si)
  3699			put_swap_device(si);
  3700		return ret;
  3701	}
  3702	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 05/18] MM: reclaim mustn't enter FS for SWP_FS_OPS swap-space
  2021-12-16 23:48 ` [PATCH 05/18] MM: reclaim mustn't enter FS for SWP_FS_OPS swap-space NeilBrown
@ 2021-12-17  8:51   ` kernel test robot
  2021-12-21  8:43   ` Christoph Hellwig
  1 sibling, 0 replies; 39+ messages in thread
From: kernel test robot @ 2021-12-17  8:51 UTC (permalink / raw)
  To: NeilBrown, Trond Myklebust, Anna Schumaker, Chuck Lever,
	Andrew Morton, Mel Gorman, Christoph Hellwig, David Howells
  Cc: llvm, kbuild-all, Linux Memory Management List, linux-nfs, linux-kernel

Hi NeilBrown,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on cifs/for-next]
[also build test ERROR on axboe-block/for-next rostedt-trace/for-next linus/master v5.16-rc5]
[cannot apply to trondmy-nfs/linux-next hnaz-mm/master mszeredi-vfs/overlayfs-next next-20211216]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/NeilBrown/Repair-SWAP-over-NFS/20211217-075659
base:   git://git.samba.org/sfrench/cifs-2.6.git for-next
config: arm-randconfig-r006-20211216 (https://download.01.org/0day-ci/archive/20211217/202112171635.JUIRMzHQ-lkp@intel.com/config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project 9043c3d65b11b442226015acfbf8167684586cfa)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install arm cross compiling tool for clang build
        # apt-get install binutils-arm-linux-gnueabi
        # https://github.com/0day-ci/linux/commit/a8e1b1ffec6ade1545df519d254eae0400b7ec37
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review NeilBrown/Repair-SWAP-over-NFS/20211217-075659
        git checkout a8e1b1ffec6ade1545df519d254eae0400b7ec37
        # save the config file to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=arm SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

>> mm/vmscan.c:1480:20: error: implicit declaration of function 'page_swap_info' [-Werror,-Wimplicit-function-declaration]
           return !data_race(page_swap_info(page)->flags & SWP_FS_OPS);
                             ^
   mm/vmscan.c:1480:20: note: did you mean 'swp_swap_info'?
   include/linux/swap.h:487:40: note: 'swp_swap_info' declared here
   static inline struct swap_info_struct *swp_swap_info(swp_entry_t entry)
                                          ^
>> mm/vmscan.c:1480:42: error: member reference type 'int' is not a pointer
           return !data_race(page_swap_info(page)->flags & SWP_FS_OPS);
                             ~~~~~~~~~~~~~~~~~~~~  ^
   include/linux/compiler.h:216:28: note: expanded from macro 'data_race'
           __unqual_scalar_typeof(({ expr; })) __v = ({                    \
                                     ^~~~
   include/linux/compiler_types.h:291:13: note: expanded from macro '__unqual_scalar_typeof'
                   _Generic((x),                                           \
                             ^
>> mm/vmscan.c:1480:20: error: implicit declaration of function 'page_swap_info' [-Werror,-Wimplicit-function-declaration]
           return !data_race(page_swap_info(page)->flags & SWP_FS_OPS);
                             ^
>> mm/vmscan.c:1480:42: error: member reference type 'int' is not a pointer
           return !data_race(page_swap_info(page)->flags & SWP_FS_OPS);
                             ~~~~~~~~~~~~~~~~~~~~  ^
   include/linux/compiler.h:216:28: note: expanded from macro 'data_race'
           __unqual_scalar_typeof(({ expr; })) __v = ({                    \
                                     ^~~~
   include/linux/compiler_types.h:298:15: note: expanded from macro '__unqual_scalar_typeof'
                            default: (x)))
                                      ^
>> mm/vmscan.c:1480:20: error: implicit declaration of function 'page_swap_info' [-Werror,-Wimplicit-function-declaration]
           return !data_race(page_swap_info(page)->flags & SWP_FS_OPS);
                             ^
>> mm/vmscan.c:1480:42: error: member reference type 'int' is not a pointer
           return !data_race(page_swap_info(page)->flags & SWP_FS_OPS);
                             ~~~~~~~~~~~~~~~~~~~~  ^
   include/linux/compiler.h:218:3: note: expanded from macro 'data_race'
                   expr;                                                   \
                   ^~~~
>> mm/vmscan.c:1480:9: error: invalid argument type 'void' to unary expression
           return !data_race(page_swap_info(page)->flags & SWP_FS_OPS);
                  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   7 errors generated.


vim +/page_swap_info +1480 mm/vmscan.c

  1467	
  1468	static bool test_may_enter_fs(struct page *page, gfp_t gfp_mask)
  1469	{
  1470		if (gfp_mask & __GFP_FS)
  1471			return true;
  1472		if (!PageSwapCache(page) || !(gfp_mask & __GFP_IO))
  1473			return false;
  1474		/* We can "enter_fs" for swap-cache with only __GFP_IO
  1475		 * providing this isn't SWP_FS_OPS.
  1476		 * ->flags can be updated non-atomicially (scan_swap_map_slots),
  1477		 * but that will never affect SWP_FS_OPS, so the data_race
  1478		 * is safe.
  1479		 */
> 1480		return !data_race(page_swap_info(page)->flags & SWP_FS_OPS);
  1481	}
  1482	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 02/18] MM: create new mm/swap.h header file.
  2021-12-16 23:48 ` [PATCH 02/18] MM: create new mm/swap.h header file NeilBrown
@ 2021-12-17 10:03   ` kernel test robot
  2021-12-21  8:36   ` Christoph Hellwig
  1 sibling, 0 replies; 39+ messages in thread
From: kernel test robot @ 2021-12-17 10:03 UTC (permalink / raw)
  To: NeilBrown, Trond Myklebust, Anna Schumaker, Chuck Lever,
	Andrew Morton, Mel Gorman, Christoph Hellwig, David Howells
  Cc: llvm, kbuild-all, Linux Memory Management List, linux-nfs, linux-kernel

Hi NeilBrown,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on cifs/for-next]
[also build test ERROR on axboe-block/for-next rostedt-trace/for-next linus/master v5.16-rc5]
[cannot apply to trondmy-nfs/linux-next hnaz-mm/master next-20211216]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/NeilBrown/Repair-SWAP-over-NFS/20211217-075659
base:   git://git.samba.org/sfrench/cifs-2.6.git for-next
config: hexagon-randconfig-r045-20211216 (https://download.01.org/0day-ci/archive/20211217/202112171739.uSeLyZ1M-lkp@intel.com/config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project 9043c3d65b11b442226015acfbf8167684586cfa)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/3dd9e64650d0340fd6469ba4f8abc183bb2eea15
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review NeilBrown/Repair-SWAP-over-NFS/20211217-075659
        git checkout 3dd9e64650d0340fd6469ba4f8abc183bb2eea15
        # save the config file to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=hexagon SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All error/warnings (new ones prefixed by >>):

>> mm/memcontrol.c:5532:9: error: implicit declaration of function 'find_get_incore_page' [-Werror,-Wimplicit-function-declaration]
           return find_get_incore_page(vma->vm_file->f_mapping,
                  ^
>> mm/memcontrol.c:5532:9: warning: incompatible integer to pointer conversion returning 'int' from a function with result type 'struct page *' [-Wint-conversion]
           return find_get_incore_page(vma->vm_file->f_mapping,
                  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   1 warning and 1 error generated.


vim +/find_get_incore_page +5532 mm/memcontrol.c

90254a65833b67 Daisuke Nishimura       2010-05-26  5521  
87946a72283be3 Daisuke Nishimura       2010-05-26  5522  static struct page *mc_handle_file_pte(struct vm_area_struct *vma,
48384b0b76f366 Peter Xu                2021-11-05  5523  			unsigned long addr, pte_t ptent)
87946a72283be3 Daisuke Nishimura       2010-05-26  5524  {
87946a72283be3 Daisuke Nishimura       2010-05-26  5525  	if (!vma->vm_file) /* anonymous vma */
87946a72283be3 Daisuke Nishimura       2010-05-26  5526  		return NULL;
1dfab5abcdd404 Johannes Weiner         2015-02-11  5527  	if (!(mc.flags & MOVE_FILE))
87946a72283be3 Daisuke Nishimura       2010-05-26  5528  		return NULL;
87946a72283be3 Daisuke Nishimura       2010-05-26  5529  
87946a72283be3 Daisuke Nishimura       2010-05-26  5530  	/* page is moved even if it's not RSS of this task(page-faulted). */
aa3b189551ad8e Hugh Dickins            2011-08-03  5531  	/* shmem/tmpfs may report page out on swap: account for that too. */
f5df8635c5a3c9 Matthew Wilcox (Oracle  2020-10-13 @5532) 	return find_get_incore_page(vma->vm_file->f_mapping,
f5df8635c5a3c9 Matthew Wilcox (Oracle  2020-10-13  5533) 			linear_page_index(vma, addr));
87946a72283be3 Daisuke Nishimura       2010-05-26  5534  }
87946a72283be3 Daisuke Nishimura       2010-05-26  5535  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 01/18] Structural cleanup for filesystem-based swap
  2021-12-16 23:48 ` [PATCH 01/18] Structural cleanup for filesystem-based swap NeilBrown
@ 2021-12-17 10:33   ` kernel test robot
  2021-12-21  8:34   ` Christoph Hellwig
  1 sibling, 0 replies; 39+ messages in thread
From: kernel test robot @ 2021-12-17 10:33 UTC (permalink / raw)
  To: NeilBrown, Trond Myklebust, Anna Schumaker, Chuck Lever,
	Andrew Morton, Mel Gorman, Christoph Hellwig, David Howells
  Cc: llvm, kbuild-all, Linux Memory Management List, linux-nfs, linux-kernel

Hi NeilBrown,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on cifs/for-next]
[also build test ERROR on axboe-block/for-next mszeredi-vfs/overlayfs-next rostedt-trace/for-next linus/master v5.16-rc5 next-20211216]
[cannot apply to trondmy-nfs/linux-next hnaz-mm/master]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/NeilBrown/Repair-SWAP-over-NFS/20211217-075659
base:   git://git.samba.org/sfrench/cifs-2.6.git for-next
config: arm-randconfig-r005-20211216 (https://download.01.org/0day-ci/archive/20211217/202112171822.DW1WPE1G-lkp@intel.com/config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project 9043c3d65b11b442226015acfbf8167684586cfa)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install arm cross compiling tool for clang build
        # apt-get install binutils-arm-linux-gnueabi
        # https://github.com/0day-ci/linux/commit/6443c9d01129c1a1c19f3df4a594b01e3772e6bd
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review NeilBrown/Repair-SWAP-over-NFS/20211217-075659
        git checkout 6443c9d01129c1a1c19f3df4a594b01e3772e6bd
        # save the config file to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=arm SHELL=/bin/bash fs/nfs/

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

>> fs/nfs/file.c:512:8: error: implicit declaration of function 'add_swap_extent' [-Werror,-Wimplicit-function-declaration]
           ret = add_swap_extent(sis, 0, sis->max, 0);
                 ^
   1 error generated.


vim +/add_swap_extent +512 fs/nfs/file.c

   486	
   487	static int nfs_swap_activate(struct swap_info_struct *sis, struct file *file,
   488							sector_t *span)
   489	{
   490		unsigned long blocks;
   491		long long isize;
   492		int ret;
   493		struct rpc_clnt *clnt = NFS_CLIENT(file->f_mapping->host);
   494		struct inode *inode = file->f_mapping->host;
   495	
   496		if (!file->f_mapping->a_ops->swap_rw)
   497			/* Cannot support swap */
   498			return -EINVAL;
   499	
   500		spin_lock(&inode->i_lock);
   501		blocks = inode->i_blocks;
   502		isize = inode->i_size;
   503		spin_unlock(&inode->i_lock);
   504		if (blocks*512 < isize) {
   505			pr_warn("swap activate: swapfile has holes\n");
   506			return -EINVAL;
   507		}
   508	
   509		ret = rpc_clnt_swap_activate(clnt);
   510		if (ret)
   511			return ret;
 > 512		ret = add_swap_extent(sis, 0, sis->max, 0);
   513		if (ret < 0) {
   514			rpc_clnt_swap_deactivate(clnt);
   515			return ret;
   516		}
   517		*span = sis->pages;
   518		sis->flags |= SWP_FS_OPS;
   519		return ret;
   520	}
   521	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 00/18 V2] Repair SWAP-over-NFS
  2021-12-16 23:48 [PATCH 00/18 V2] Repair SWAP-over-NFS NeilBrown
                   ` (17 preceding siblings ...)
  2021-12-16 23:48 ` [PATCH 18/18] NFS: swap-out must always use STABLE writes NeilBrown
@ 2021-12-17 21:29 ` Anna Schumaker
  2021-12-19 21:07   ` NeilBrown
  18 siblings, 1 reply; 39+ messages in thread
From: Anna Schumaker @ 2021-12-17 21:29 UTC (permalink / raw)
  To: NeilBrown
  Cc: Trond Myklebust, Chuck Lever, Andrew Morton, Mel Gorman,
	Christoph Hellwig, David Howells, Linux NFS Mailing List,
	linux-mm, Linux Kernel Mailing List

Hi Neil,

On Thu, Dec 16, 2021 at 7:07 PM NeilBrown <neilb@suse.de> wrote:
>
> swap-over-NFS currently has a variety of problems.
>
> swap writes call generic_write_checks(), which always fails on a swap
> file, so it completely fails.
> Even without this, various deadlocks are possible - largely due to
> improvements in NFS memory allocation (using NOFS instead of ATOMIC)
> which weren't tested against swap-out.
>
> NFS is the only filesystem that has supported fs-based swap IO, and it
> hasn't worked for several releases, so now is a convenient time to clean
> up the swap-via-filesystem interfaces - we cannot break anything !
>
> So the first few patches here clean up and improve various parts of the
> swap-via-filesystem code.  ->activate_swap() is given a cleaner
> interface, a new ->swap_rw is introduced instead of burdening
> ->direct_IO, etc.
>
> Current swap-to-filesystem code only ever submits single-page reads and
> writes.  These patches change that to allow multi-page IO when adjacent
> requests are submitted.  Writes are also changed to be async rather than
> sync.  This substantially speeds up write throughput for swap-over-NFS.
>
> Some of the NFS patches can land independently of the MM patches.  A few
> require the MM patches to land first.

Thanks for fixing swap-over-NFS! Looks like it passes all the
swap-related xfstests except for generic/357 on NFS v4.2. This test
checks that we get -EINVAL on a reflinked swapfile, but I'm not sure
if there is a way to check for that on the client side but if you have
any ideas it would be nice to get that test passing while you're at
it!

Anna

>
> Thanks,
> NeilBrown
>
>
> ---
>
> NeilBrown (18):
>       Structural cleanup for filesystem-based swap
>       MM: create new mm/swap.h header file.
>       MM: use ->swap_rw for reads from SWP_FS_OPS swap-space
>       MM: perform async writes to SWP_FS_OPS swap-space
>       MM: reclaim mustn't enter FS for SWP_FS_OPS swap-space
>       MM: submit multipage reads for SWP_FS_OPS swap-space
>       MM: submit multipage write for SWP_FS_OPS swap-space
>       MM: Add AS_CAN_DIO mapping flag
>       NFS: rename nfs_direct_IO and use as ->swap_rw
>       NFS: swap IO handling is slightly different for O_DIRECT IO
>       SUNRPC/call_alloc: async tasks mustn't block waiting for memory
>       SUNRPC/auth: async tasks mustn't block waiting for memory
>       SUNRPC/xprt: async tasks mustn't block waiting for memory
>       SUNRPC: remove scheduling boost for "SWAPPER" tasks.
>       NFS: discard NFS_RPC_SWAPFLAGS and RPC_TASK_ROOTCREDS
>       SUNRPC: improve 'swap' handling: scheduling and PF_MEMALLOC
>       NFSv4: keep state manager thread active if swap is enabled
>       NFS: swap-out must always use STABLE writes.
>
>
>  drivers/block/loop.c            |   4 +-
>  fs/fcntl.c                      |   5 +-
>  fs/inode.c                      |   3 +
>  fs/nfs/direct.c                 |  56 ++++++----
>  fs/nfs/file.c                   |  25 +++--
>  fs/nfs/inode.c                  |   1 +
>  fs/nfs/nfs4_fs.h                |   1 +
>  fs/nfs/nfs4proc.c               |  20 ++++
>  fs/nfs/nfs4state.c              |  39 ++++++-
>  fs/nfs/read.c                   |   4 -
>  fs/nfs/write.c                  |   2 +
>  fs/open.c                       |   2 +-
>  fs/overlayfs/file.c             |  10 +-
>  include/linux/fs.h              |   2 +-
>  include/linux/nfs_fs.h          |  11 +-
>  include/linux/nfs_xdr.h         |   2 +
>  include/linux/pagemap.h         |   3 +-
>  include/linux/sunrpc/auth.h     |   1 +
>  include/linux/sunrpc/sched.h    |   1 -
>  include/linux/swap.h            | 121 --------------------
>  include/linux/writeback.h       |   7 ++
>  include/trace/events/sunrpc.h   |   1 -
>  mm/madvise.c                    |   9 +-
>  mm/memory.c                     |   3 +-
>  mm/mincore.c                    |   1 +
>  mm/page_alloc.c                 |   1 +
>  mm/page_io.c                    | 189 ++++++++++++++++++++++++++------
>  mm/shmem.c                      |   1 +
>  mm/swap.h                       | 140 +++++++++++++++++++++++
>  mm/swap_state.c                 |  32 ++++--
>  mm/swapfile.c                   |   6 +
>  mm/util.c                       |   1 +
>  mm/vmscan.c                     |  31 +++++-
>  net/sunrpc/auth.c               |   8 +-
>  net/sunrpc/auth_gss/auth_gss.c  |   6 +-
>  net/sunrpc/auth_unix.c          |  10 +-
>  net/sunrpc/clnt.c               |   7 +-
>  net/sunrpc/sched.c              |  29 +++--
>  net/sunrpc/xprt.c               |  19 ++--
>  net/sunrpc/xprtrdma/transport.c |  10 +-
>  net/sunrpc/xprtsock.c           |   8 ++
>  41 files changed, 558 insertions(+), 274 deletions(-)
>  create mode 100644 mm/swap.h
>
> --
> Signature
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 08/18] MM: Add AS_CAN_DIO mapping flag
  2021-12-16 23:48 ` [PATCH 08/18] MM: Add AS_CAN_DIO mapping flag NeilBrown
@ 2021-12-19 13:38   ` Mark Hemment
  2021-12-19 20:59     ` NeilBrown
  2021-12-21  8:46   ` Christoph Hellwig
  1 sibling, 1 reply; 39+ messages in thread
From: Mark Hemment @ 2021-12-19 13:38 UTC (permalink / raw)
  To: NeilBrown
  Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells, linux-nfs,
	linux-mm, linux-kernel

On Thu, 16 Dec 2021 at 23:57, NeilBrown <neilb@suse.de> wrote:
>
> Currently various places test if direct IO is possible on a file by
> checking for the existence of the direct_IO address space operation.
> This is a poor choice, as the direct_IO operation may not be used - it is
> only used if the generic_file_*_iter functions are called for direct IO
> and some filesystems - particularly NFS - don't do this.
>
> Instead, introduce a new mapping flag: AS_CAN_DIO and change the various
> places to check this (avoiding a pointer dereference).
> unlock_new_inode() will set this flag if ->direct_IO is present, so
> filesystems do not need to be changed.
...
> diff --git a/fs/inode.c b/fs/inode.c
> index 6b80a51129d5..bae65ccecdb1 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1008,6 +1008,9 @@ EXPORT_SYMBOL(lockdep_annotate_inode_mutex_key);
>  void unlock_new_inode(struct inode *inode)
>  {
>         lockdep_annotate_inode_mutex_key(inode);
> +       if (inode->i_mapping->a_ops &&
> +           inode->i_mapping->a_ops->direct_IO)
> +               set_bit(AS_CAN_DIO, &inode->i_mapping->flags);
>         spin_lock(&inode->i_lock);
>         WARN_ON(!(inode->i_state & I_NEW));
>         inode->i_state &= ~I_NEW & ~I_CREATING;

Does d_instantiate_new() also need to set AS_CAN_DIO?

Mark

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 08/18] MM: Add AS_CAN_DIO mapping flag
  2021-12-19 13:38   ` Mark Hemment
@ 2021-12-19 20:59     ` NeilBrown
  0 siblings, 0 replies; 39+ messages in thread
From: NeilBrown @ 2021-12-19 20:59 UTC (permalink / raw)
  To: Mark Hemment
  Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells, linux-nfs,
	linux-mm, linux-kernel

On Mon, 20 Dec 2021, Mark Hemment wrote:
> On Thu, 16 Dec 2021 at 23:57, NeilBrown <neilb@suse.de> wrote:
> >
> > Currently various places test if direct IO is possible on a file by
> > checking for the existence of the direct_IO address space operation.
> > This is a poor choice, as the direct_IO operation may not be used - it is
> > only used if the generic_file_*_iter functions are called for direct IO
> > and some filesystems - particularly NFS - don't do this.
> >
> > Instead, introduce a new mapping flag: AS_CAN_DIO and change the various
> > places to check this (avoiding a pointer dereference).
> > unlock_new_inode() will set this flag if ->direct_IO is present, so
> > filesystems do not need to be changed.
> ...
> > diff --git a/fs/inode.c b/fs/inode.c
> > index 6b80a51129d5..bae65ccecdb1 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -1008,6 +1008,9 @@ EXPORT_SYMBOL(lockdep_annotate_inode_mutex_key);
> >  void unlock_new_inode(struct inode *inode)
> >  {
> >         lockdep_annotate_inode_mutex_key(inode);
> > +       if (inode->i_mapping->a_ops &&
> > +           inode->i_mapping->a_ops->direct_IO)
> > +               set_bit(AS_CAN_DIO, &inode->i_mapping->flags);
> >         spin_lock(&inode->i_lock);
> >         WARN_ON(!(inode->i_state & I_NEW));
> >         inode->i_state &= ~I_NEW & ~I_CREATING;
> 
> Does d_instantiate_new() also need to set AS_CAN_DIO?

Yes it does - thanks for catching that.  I'll update my patch.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 00/18 V2] Repair SWAP-over-NFS
  2021-12-17 21:29 ` [PATCH 00/18 V2] Repair SWAP-over-NFS Anna Schumaker
@ 2021-12-19 21:07   ` NeilBrown
  2021-12-21  8:48     ` Christoph Hellwig
  0 siblings, 1 reply; 39+ messages in thread
From: NeilBrown @ 2021-12-19 21:07 UTC (permalink / raw)
  To: Anna Schumaker
  Cc: Trond Myklebust, Chuck Lever, Andrew Morton, Mel Gorman,
	Christoph Hellwig, David Howells, Linux NFS Mailing List,
	linux-mm, Linux Kernel Mailing List

On Sat, 18 Dec 2021, Anna Schumaker wrote:
> Hi Neil,
> 
> On Thu, Dec 16, 2021 at 7:07 PM NeilBrown <neilb@suse.de> wrote:
> >
> > swap-over-NFS currently has a variety of problems.
> >
> > swap writes call generic_write_checks(), which always fails on a swap
> > file, so it completely fails.
> > Even without this, various deadlocks are possible - largely due to
> > improvements in NFS memory allocation (using NOFS instead of ATOMIC)
> > which weren't tested against swap-out.
> >
> > NFS is the only filesystem that has supported fs-based swap IO, and it
> > hasn't worked for several releases, so now is a convenient time to clean
> > up the swap-via-filesystem interfaces - we cannot break anything !
> >
> > So the first few patches here clean up and improve various parts of the
> > swap-via-filesystem code.  ->activate_swap() is given a cleaner
> > interface, a new ->swap_rw is introduced instead of burdening
> > ->direct_IO, etc.
> >
> > Current swap-to-filesystem code only ever submits single-page reads and
> > writes.  These patches change that to allow multi-page IO when adjacent
> > requests are submitted.  Writes are also changed to be async rather than
> > sync.  This substantially speeds up write throughput for swap-over-NFS.
> >
> > Some of the NFS patches can land independently of the MM patches.  A few
> > require the MM patches to land first.
> 
> Thanks for fixing swap-over-NFS! Looks like it passes all the
> swap-related xfstests except for generic/357 on NFS v4.2. This test
> checks that we get -EINVAL on a reflinked swapfile, but I'm not sure
> if there is a way to check for that on the client side but if you have
> any ideas it would be nice to get that test passing while you're at
> it!

Thanks for testing!.
I think that testing that swap fails on a reflinked file is bogus.  This
isn't an important part of the API, it is just an internal
implementation detail.
I certainly understand that it could be problematic implementing swap on
a reflinked file within XFS and it is perfectly acceptable to fail such
a request.  But if one day someone decided to implement it - should that
be seen as a regression?

Certainly over NFS there is no reason at all not to swap to a file that
happens to be reflinked on the server.
I don't think it even makes sense to test if the file has holes as the
current nfs_swap_activate() does.  I don't exactly object to the test,
but I think it is misguided and pointless.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 03/18] MM: use ->swap_rw for reads from SWP_FS_OPS swap-space
  2021-12-16 23:48 ` [PATCH 03/18] MM: use ->swap_rw for reads from SWP_FS_OPS swap-space NeilBrown
@ 2021-12-20 12:16   ` Mark Hemment
  2021-12-21  8:40   ` Christoph Hellwig
  1 sibling, 0 replies; 39+ messages in thread
From: Mark Hemment @ 2021-12-20 12:16 UTC (permalink / raw)
  To: NeilBrown
  Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells, linux-nfs,
	linux-mm, linux-kernel

On Thu, 16 Dec 2021 at 23:54, NeilBrown <neilb@suse.de> wrote:
>
> To submit an async read with ->swap_rw() we need to allocate
> a structure to hold the kiocb and other details.  swap_readpage() cannot
> handle transient failure, so create a mempool to provide the structures.
>
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  mm/page_io.c  |   58 +++++++++++++++++++++++++++++++++++++++++++++++++++------
>  mm/swap.h     |    1 +
>  mm/swapfile.c |    5 +++++
>  3 files changed, 58 insertions(+), 6 deletions(-)
...
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index f23d9ff21cf8..43539be38e68 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -2401,6 +2401,11 @@ static int setup_swap_extents(struct swap_info_struct *sis, sector_t *span)
>                 if (ret < 0)
>                         return ret;
>                 sis->flags |= SWP_ACTIVATED;
> +               if ((sis->flags & SWP_FS_OPS) &&
> +                   sio_pool_init() != 0) {
> +                       destroy_swap_extents(sis);
> +                       return -ENOMEM;
> +               }
>                 return ret;
>         }

This code is called before 'swapon_mutex' is taken in the swapon code
path, so possible for multiple swapons to race here creating two (or
more) memory pools.

Mark

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 07/18] MM: submit multipage write for SWP_FS_OPS swap-space
  2021-12-16 23:48 ` [PATCH 07/18] MM: submit multipage write for SWP_FS_OPS swap-space NeilBrown
@ 2021-12-20 12:21   ` Mark Hemment
  0 siblings, 0 replies; 39+ messages in thread
From: Mark Hemment @ 2021-12-20 12:21 UTC (permalink / raw)
  To: NeilBrown
  Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells, linux-nfs,
	linux-mm, linux-kernel

On Thu, 16 Dec 2021 at 23:56, NeilBrown <neilb@suse.de> wrote:
>
> swap_writepage() is given one page at a time, but may be called repeatedly
> in succession.
> For block-device swapspace, the blk_plug functionality allows the
> multiple pages to be combined together at lower layers.
> That cannot be used for SWP_FS_OPS as blk_plug may not exist - it is
> only active when CONFIG_BLOCK=y.  Consequently all swap reads over NFS
> are single page reads.
>
> With this patch we pass a pointer-to-pointer via the wbc.
> swap_writepage can store state between calls - much like the pointer
> passed explicitly to swap_readpage.  After calling swap_writepage() some
> number of times, the state will be passed to swap_write_unplug() which
> can submit the combined request.
>
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  include/linux/writeback.h |    7 +++
>  mm/page_io.c              |   98 ++++++++++++++++++++++++++++++---------------
>  mm/swap.h                 |    1
>  mm/vmscan.c               |    9 +++-
>  4 files changed, 80 insertions(+), 35 deletions(-)
...
> +void swap_write_unplug(struct swap_iocb *sio)
> +{
> +       struct iov_iter from;
> +       struct address_space *mapping = sio->iocb.ki_filp->f_mapping;
> +       int ret;
> +
> +       iov_iter_bvec(&from, WRITE, sio->bvec, sio->pages,
> +                     PAGE_SIZE * sio->pages);
> +       ret = mapping->a_ops->swap_rw(&sio->iocb, &from);
> +       if (ret != -EIOCBQUEUED)
> +               sio_write_complete(&sio->iocb, ret);
> +}
> +

As swap_write_unplug() is called from vmscan.c, need an 'no-op'
version (in "swap.h") for when !CONFIG_SWAP

Mark

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 10/18] NFS: swap IO handling is slightly different for O_DIRECT IO
  2021-12-16 23:48 ` [PATCH 10/18] NFS: swap IO handling is slightly different for O_DIRECT IO NeilBrown
@ 2021-12-20 15:02   ` Mark Hemment
  0 siblings, 0 replies; 39+ messages in thread
From: Mark Hemment @ 2021-12-20 15:02 UTC (permalink / raw)
  To: NeilBrown
  Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells, linux-nfs,
	linux-mm, linux-kernel

On Thu, 16 Dec 2021 at 23:58, NeilBrown <neilb@suse.de> wrote:
>
> 1/ Taking the i_rwsem for swap IO triggers lockdep warnings regarding
>    possible deadlocks with "fs_reclaim".  These deadlocks could, I believe,
>    eventuate if a buffered read on the swapfile was attempted.
>
>    We don't need coherence with the page cache for a swap file, and
>    buffered writes are forbidden anyway.  There is no other need for
>    i_rwsem during direct IO.  So never take it for swap_rw()

Agreed.  I cannot see an issue with not taking the sem.

> 2/ generic_write_checks() explicitly forbids writes to swap, and
>    performs checks that are not needed for swap.  So bypass it
>    for swap_rw().
>
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
...
> diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> @@ -625,7 +625,7 @@ ssize_t nfs_file_write(struct kiocb *iocb, struct iov_iter *from)
>                 return result;
>
>         if (iocb->ki_flags & IOCB_DIRECT)
> -               return nfs_file_direct_write(iocb, from);
> +               return nfs_file_direct_write(iocb, from, false);
>
>         dprintk("NFS: write(%pD2, %zu@%Ld)\n",
>                 file, iov_iter_count(from), (long long) iocb->ki_pos);

This code at the top of nfs/file.c should be removed;
    /* Hack for future NFS swap support */
    #ifndef IS_SWAPFILE
    # define IS_SWAPFILE(inode)     (0)
    #endif
As IS_SWAPFILE() is used just below this diff (to prevent buffered
writes), better be using a non-hacked version.

Mark

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 01/18] Structural cleanup for filesystem-based swap
  2021-12-16 23:48 ` [PATCH 01/18] Structural cleanup for filesystem-based swap NeilBrown
  2021-12-17 10:33   ` kernel test robot
@ 2021-12-21  8:34   ` Christoph Hellwig
  1 sibling, 0 replies; 39+ messages in thread
From: Christoph Hellwig @ 2021-12-21  8:34 UTC (permalink / raw)
  To: NeilBrown
  Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells, linux-nfs,
	linux-mm, linux-kernel

On Fri, Dec 17, 2021 at 10:48:22AM +1100, NeilBrown wrote:
> Linux primarily uses IO to block devices for swap, but can send the IO
> requests to a filesystem.  This has only ever worked for NFS, and that
> hasn't worked for a while due to a lack of testing.  This seems like a
> good time for some tidy-up before restoring swap-over-NFS functionality.

The changes look good to me, but I think this needs to be split into
separate, self-contained patches.

> 
> This patch:

Patch 1:

>  - updates the documentation (both copies!) for swap_activate which
>    is woefully out-of-date

Patch 2:

>  - drops the call to the filesystem for ->set_page_dirty().  These
>    pages do not belong to the filesystem, and it has no interest
>    in the dirty status.

Patch 3:


>  - move the responsibility for setting SWP_FS_OPS to ->swap_activate()
>    and also requires it to always call add_swap_extent().  This makes
>    it much easier to find filesystems that require SWP_FS_OPS.

Patch 4:

>  - introduces a new address_space operation "swap_rw" for swap IO.
>    The code currently used ->readpage for reads and ->direct_IO for
>    writes.  The former imposes a limit of one-page-at-a-time, the
>    later means that direct writes and swap writes are encouraged to
>    use the same path.  While similar, swap can often be simpler as
>    it can assume that no allocation is needed, and coherence with the
>    page cache is irrelevant.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 02/18] MM: create new mm/swap.h header file.
  2021-12-16 23:48 ` [PATCH 02/18] MM: create new mm/swap.h header file NeilBrown
  2021-12-17 10:03   ` kernel test robot
@ 2021-12-21  8:36   ` Christoph Hellwig
  1 sibling, 0 replies; 39+ messages in thread
From: Christoph Hellwig @ 2021-12-21  8:36 UTC (permalink / raw)
  To: NeilBrown
  Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells, linux-nfs,
	linux-mm, linux-kernel

On Fri, Dec 17, 2021 at 10:48:22AM +1100, NeilBrown wrote:
> Many functions declared in include/linux/swap.h are only used within mm/
> 
> Create a new "mm/swap.h" and move some of these declarations there.
> Remove the redundant 'extern' from the function declarations.

Good idea!  Looks good modulo the extra includes that the buildbot asks
for:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 03/18] MM: use ->swap_rw for reads from SWP_FS_OPS swap-space
  2021-12-16 23:48 ` [PATCH 03/18] MM: use ->swap_rw for reads from SWP_FS_OPS swap-space NeilBrown
  2021-12-20 12:16   ` Mark Hemment
@ 2021-12-21  8:40   ` Christoph Hellwig
  1 sibling, 0 replies; 39+ messages in thread
From: Christoph Hellwig @ 2021-12-21  8:40 UTC (permalink / raw)
  To: NeilBrown
  Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells, linux-nfs,
	linux-mm, linux-kernel

> +int sio_pool_init(void)
> +{
> +	if (!sio_pool)
> +		sio_pool = mempool_create_kmalloc_pool(
> +			SWAP_CLUSTER_MAX, sizeof(struct swap_iocb));

I can't see anything serializing access here, so we'll need a lock or
cmpxchg dance.

> +	if (sio_pool)
> +		return 0;
> +	else
> +		return -ENOMEM;

Nit: This would flow much nicer as:

	if (!sio_pool)
		return -ENOMEM;
	return 0;

>  int swap_readpage(struct page *page, bool synchronous)
>  {
>  	struct bio *bio;
> @@ -378,13 +412,25 @@ int swap_readpage(struct page *page, bool synchronous)
>  	}
>  
>  	if (data_race(sis->flags & SWP_FS_OPS)) {
> -		//struct file *swap_file = sis->swap_file;
> -		//struct address_space *mapping = swap_file->f_mapping;

This should not be left by the previous patch.  In fact I suspect the
part of the previous patch that adds ->swap_rw should probably be folded
into this patch.

> +		struct file *swap_file = sis->swap_file;
> +		struct address_space *mapping = swap_file->f_mapping;
> +		struct iov_iter from;
> +		struct swap_iocb *sio;
> +		loff_t pos = page_file_offset(page);
> +
> +		sio = mempool_alloc(sio_pool, GFP_KERNEL);
> +		init_sync_kiocb(&sio->iocb, swap_file);
> +		sio->iocb.ki_pos = pos;
> +		sio->iocb.ki_complete = sio_read_complete;
> +		sio->bvec.bv_page = page;
> +		sio->bvec.bv_len = PAGE_SIZE;
> +		sio->bvec.bv_offset = 0;
> +
> +		iov_iter_bvec(&from, READ, &sio->bvec, 1, PAGE_SIZE);
> +		ret = mapping->a_ops->swap_rw(&sio->iocb, &from);
> +		if (ret != -EIOCBQUEUED)
> +			sio_read_complete(&sio->iocb, ret);
>  
>  		goto out;

I'd be tempted to split the SWP_FS_OPS into a helper to keep the
code tidy.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 04/18] MM: perform async writes to SWP_FS_OPS swap-space
  2021-12-16 23:48 ` [PATCH 04/18] MM: perform async writes to SWP_FS_OPS swap-space NeilBrown
@ 2021-12-21  8:41   ` Christoph Hellwig
  0 siblings, 0 replies; 39+ messages in thread
From: Christoph Hellwig @ 2021-12-21  8:41 UTC (permalink / raw)
  To: NeilBrown
  Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells, linux-nfs,
	linux-mm, linux-kernel

On Fri, Dec 17, 2021 at 10:48:22AM +1100, NeilBrown wrote:
> Writes to SWP_FS_OPS swapspace is currently synchronous.  To make it
> async we need to allocate the kiocb struct which may block, but won't
> block as long as waiting for the write to complete would block.

Against a little helper for the SWP_FS_OPS case of __swap_writepage
would be nice.  But otherwise this looks good to me.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 05/18] MM: reclaim mustn't enter FS for SWP_FS_OPS swap-space
  2021-12-16 23:48 ` [PATCH 05/18] MM: reclaim mustn't enter FS for SWP_FS_OPS swap-space NeilBrown
  2021-12-17  8:51   ` kernel test robot
@ 2021-12-21  8:43   ` Christoph Hellwig
  1 sibling, 0 replies; 39+ messages in thread
From: Christoph Hellwig @ 2021-12-21  8:43 UTC (permalink / raw)
  To: NeilBrown
  Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells, linux-nfs,
	linux-mm, linux-kernel

On Fri, Dec 17, 2021 at 10:48:22AM +1100, NeilBrown wrote:
> +static bool test_may_enter_fs(struct page *page, gfp_t gfp_mask)
> +{
> +	if (gfp_mask & __GFP_FS)
> +		return true;
> +	if (!PageSwapCache(page) || !(gfp_mask & __GFP_IO))
> +		return false;
> +	/* We can "enter_fs" for swap-cache with only __GFP_IO
> +	 * providing this isn't SWP_FS_OPS.
> +	 * ->flags can be updated non-atomicially (scan_swap_map_slots),
> +	 * but that will never affect SWP_FS_OPS, so the data_race
> +	 * is safe.
> +	 */
> +	return !data_race(page_swap_info(page)->flags & SWP_FS_OPS);

Nit: the normal kernel comment style uses an empty

	/*

line to start the comment.

> @@ -1514,8 +1529,7 @@ static unsigned int shrink_page_list(struct list_head *page_list,
>  		if (!sc->may_unmap && page_mapped(page))
>  			goto keep_locked;
>  
> -		may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
> -			(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
> +		may_enter_fs = test_may_enter_fs(page, sc->gfp_mask);
>  
>  		/*
>  		 * The number of dirty pages determines if a node is marked
> @@ -1683,7 +1697,8 @@ static unsigned int shrink_page_list(struct list_head *page_list,
>  						goto activate_locked_split;
>  				}
>  
> -				may_enter_fs = true;
> +				may_enter_fs = test_may_enter_fs(page,
> +								 sc->gfp_mask);

Now that may_enter_fs is always reset using test_may_enter_fs, do we
even need the local variable?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 06/18] MM: submit multipage reads for SWP_FS_OPS swap-space
  2021-12-16 23:48 ` [PATCH 06/18] MM: submit multipage reads for " NeilBrown
  2021-12-17  7:09   ` kernel test robot
@ 2021-12-21  8:44   ` Christoph Hellwig
  1 sibling, 0 replies; 39+ messages in thread
From: Christoph Hellwig @ 2021-12-21  8:44 UTC (permalink / raw)
  To: NeilBrown
  Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells, linux-nfs,
	linux-mm, linux-kernel

On Fri, Dec 17, 2021 at 10:48:22AM +1100, NeilBrown wrote:
> Some caller currently call blk_finish_plug() *before* the final call to
> swap_readpage(), so the last page cannot be included.  This patch moves
> blk_finish_plug() to after the last call, and calls swap_read_unplug()
> there too.

Can you move this fix into a separate prep patch, preferably with a
Fixes tag so that it gets picked up for backports?

Otherwise this looks sensible to me.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 08/18] MM: Add AS_CAN_DIO mapping flag
  2021-12-16 23:48 ` [PATCH 08/18] MM: Add AS_CAN_DIO mapping flag NeilBrown
  2021-12-19 13:38   ` Mark Hemment
@ 2021-12-21  8:46   ` Christoph Hellwig
  2022-01-19  3:54     ` NeilBrown
  1 sibling, 1 reply; 39+ messages in thread
From: Christoph Hellwig @ 2021-12-21  8:46 UTC (permalink / raw)
  To: NeilBrown
  Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells, linux-nfs,
	linux-mm, linux-kernel

On Fri, Dec 17, 2021 at 10:48:23AM +1100, NeilBrown wrote:
> Currently various places test if direct IO is possible on a file by
> checking for the existence of the direct_IO address space operation.
> This is a poor choice, as the direct_IO operation may not be used - it is
> only used if the generic_file_*_iter functions are called for direct IO
> and some filesystems - particularly NFS - don't do this.
> 
> Instead, introduce a new mapping flag: AS_CAN_DIO and change the various
> places to check this (avoiding a pointer dereference).
> unlock_new_inode() will set this flag if ->direct_IO is present, so
> filesystems do not need to be changed.
> 
> NFS *is* changed, to set the flag explicitly and discard the direct_IO
> entry in the address_space_operations for files.

For other can flags related to file operations we usuall stash them into
file->f_mode.  Any reason to treat this different?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 00/18 V2] Repair SWAP-over-NFS
  2021-12-19 21:07   ` NeilBrown
@ 2021-12-21  8:48     ` Christoph Hellwig
  0 siblings, 0 replies; 39+ messages in thread
From: Christoph Hellwig @ 2021-12-21  8:48 UTC (permalink / raw)
  To: NeilBrown
  Cc: Anna Schumaker, Trond Myklebust, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells,
	Linux NFS Mailing List, linux-mm, Linux Kernel Mailing List,
	Darrick J. Wong, linux-xfs, fstests

On Mon, Dec 20, 2021 at 08:07:15AM +1100, NeilBrown wrote:
> > Thanks for fixing swap-over-NFS! Looks like it passes all the
> > swap-related xfstests except for generic/357 on NFS v4.2. This test
> > checks that we get -EINVAL on a reflinked swapfile, but I'm not sure
> > if there is a way to check for that on the client side but if you have
> > any ideas it would be nice to get that test passing while you're at
> > it!
> 
> Thanks for testing!.
> I think that testing that swap fails on a reflinked file is bogus.  This
> isn't an important part of the API, it is just an internal
> implementation detail.
> I certainly understand that it could be problematic implementing swap on
> a reflinked file within XFS and it is perfectly acceptable to fail such
> a request.  But if one day someone decided to implement it - should that
> be seen as a regression?

Yes, there is really no fundamental reason not to support swap to
reflinked files, especially for NFS.  We'll need some kind of opt-in/out
for this test.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 08/18] MM: Add AS_CAN_DIO mapping flag
  2021-12-21  8:46   ` Christoph Hellwig
@ 2022-01-19  3:54     ` NeilBrown
  0 siblings, 0 replies; 39+ messages in thread
From: NeilBrown @ 2022-01-19  3:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells, linux-nfs,
	linux-mm, linux-kernel

On Tue, 21 Dec 2021, Christoph Hellwig wrote:
> On Fri, Dec 17, 2021 at 10:48:23AM +1100, NeilBrown wrote:
> > Currently various places test if direct IO is possible on a file by
> > checking for the existence of the direct_IO address space operation.
> > This is a poor choice, as the direct_IO operation may not be used - it is
> > only used if the generic_file_*_iter functions are called for direct IO
> > and some filesystems - particularly NFS - don't do this.
> > 
> > Instead, introduce a new mapping flag: AS_CAN_DIO and change the various
> > places to check this (avoiding a pointer dereference).
> > unlock_new_inode() will set this flag if ->direct_IO is present, so
> > filesystems do not need to be changed.
> > 
> > NFS *is* changed, to set the flag explicitly and discard the direct_IO
> > entry in the address_space_operations for files.
> 
> For other can flags related to file operations we usuall stash them into
> file->f_mode.  Any reason to treat this different?

f_mode is a much better place to put the flag!  Thanks for that
suggestion and all the other feedback.  I'll post a new series early
next week.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2022-01-19  3:55 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-16 23:48 [PATCH 00/18 V2] Repair SWAP-over-NFS NeilBrown
2021-12-16 23:48 ` [PATCH 05/18] MM: reclaim mustn't enter FS for SWP_FS_OPS swap-space NeilBrown
2021-12-17  8:51   ` kernel test robot
2021-12-21  8:43   ` Christoph Hellwig
2021-12-16 23:48 ` [PATCH 01/18] Structural cleanup for filesystem-based swap NeilBrown
2021-12-17 10:33   ` kernel test robot
2021-12-21  8:34   ` Christoph Hellwig
2021-12-16 23:48 ` [PATCH 03/18] MM: use ->swap_rw for reads from SWP_FS_OPS swap-space NeilBrown
2021-12-20 12:16   ` Mark Hemment
2021-12-21  8:40   ` Christoph Hellwig
2021-12-16 23:48 ` [PATCH 02/18] MM: create new mm/swap.h header file NeilBrown
2021-12-17 10:03   ` kernel test robot
2021-12-21  8:36   ` Christoph Hellwig
2021-12-16 23:48 ` [PATCH 04/18] MM: perform async writes to SWP_FS_OPS swap-space NeilBrown
2021-12-21  8:41   ` Christoph Hellwig
2021-12-16 23:48 ` [PATCH 06/18] MM: submit multipage reads for " NeilBrown
2021-12-17  7:09   ` kernel test robot
2021-12-21  8:44   ` Christoph Hellwig
2021-12-16 23:48 ` [PATCH 17/18] NFSv4: keep state manager thread active if swap is enabled NeilBrown
2021-12-16 23:48 ` [PATCH 10/18] NFS: swap IO handling is slightly different for O_DIRECT IO NeilBrown
2021-12-20 15:02   ` Mark Hemment
2021-12-16 23:48 ` [PATCH 12/18] SUNRPC/auth: async tasks mustn't block waiting for memory NeilBrown
2021-12-16 23:48 ` [PATCH 15/18] NFS: discard NFS_RPC_SWAPFLAGS and RPC_TASK_ROOTCREDS NeilBrown
2021-12-16 23:48 ` [PATCH 14/18] SUNRPC: remove scheduling boost for "SWAPPER" tasks NeilBrown
2021-12-16 23:48 ` [PATCH 13/18] SUNRPC/xprt: async tasks mustn't block waiting for memory NeilBrown
2021-12-16 23:48 ` [PATCH 11/18] SUNRPC/call_alloc: " NeilBrown
2021-12-16 23:48 ` [PATCH 16/18] SUNRPC: improve 'swap' handling: scheduling and PF_MEMALLOC NeilBrown
2021-12-16 23:48 ` [PATCH 08/18] MM: Add AS_CAN_DIO mapping flag NeilBrown
2021-12-19 13:38   ` Mark Hemment
2021-12-19 20:59     ` NeilBrown
2021-12-21  8:46   ` Christoph Hellwig
2022-01-19  3:54     ` NeilBrown
2021-12-16 23:48 ` [PATCH 07/18] MM: submit multipage write for SWP_FS_OPS swap-space NeilBrown
2021-12-20 12:21   ` Mark Hemment
2021-12-16 23:48 ` [PATCH 09/18] NFS: rename nfs_direct_IO and use as ->swap_rw NeilBrown
2021-12-16 23:48 ` [PATCH 18/18] NFS: swap-out must always use STABLE writes NeilBrown
2021-12-17 21:29 ` [PATCH 00/18 V2] Repair SWAP-over-NFS Anna Schumaker
2021-12-19 21:07   ` NeilBrown
2021-12-21  8:48     ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).