linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/23 V3] Repair SWAP-over_NFS
@ 2022-01-24  3:48 NeilBrown
  2022-01-24  3:48 ` [PATCH 05/23] MM: reclaim mustn't enter FS for SWP_FS_OPS swap-space NeilBrown
                   ` (23 more replies)
  0 siblings, 24 replies; 53+ messages in thread
From: NeilBrown @ 2022-01-24  3:48 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells
  Cc: linux-nfs, linux-mm, linux-kernel

This version of the series addresses the review comments received,
particularly from Christof.
Thanks to all for review and testing.

The patch adding mm/swap.h got a minor conflict when I rebaesd on
5.17-rc1, suggestion that it could easily get more conflicts in the
future.  It might be good if it could land before 5.17 comes out, to
avoid (some of) those conflicts.

I think (Though haven't checked) that all the NFS patch patches
except
      NFS: rename nfs_direct_IO and use as ->swap_rw
      NFS: swap IO handling is slightly different for O_DIRECT IO
can land independently of the MM patches, and can be moved to the
end of the series.  Maybe they could be held until after 5.18-rc1 if we
agree to proceed with these in the next merge window.

Intro from previous series is below.
Thanks,
NeilBrown

swap-over-NFS currently has a variety of problems.

swap writes call generic_write_checks(), which always fails on a swap
file, so it completely fails.
Even without this, various deadlocks are possible - largely due to
improvements in NFS memory allocation (using NOFS instead of ATOMIC)
which weren't tested against swap-out.

NFS is the only filesystem that has supported fs-based swap IO, and it
hasn't worked for several releases, so now is a convenient time to clean
up the swap-via-filesystem interfaces - we cannot break anything !

So the first few patches here clean up and improve various parts of the
swap-via-filesystem code.  ->activate_swap() is given a cleaner
interface, a new ->swap_rw is introduced instead of burdening
->direct_IO, etc.

Current swap-to-filesystem code only ever submits single-page reads and
writes.  These patches change that to allow multi-page IO when adjacent
requests are submitted.  Writes are also changed to be async rather than
sync.  This substantially speeds up write throughput for swap-over-NFS.

Some of the NFS patches can land independently of the MM patches.  A few
require the MM patches to land first.

---

NeilBrown (23):
      MM: create new mm/swap.h header file.
      MM: extend block-plugging to cover all swap reads with read-ahead
      MM: drop swap_set_page_dirty
      MM: move responsibility for setting SWP_FS_OPS to ->swap_activate
      MM: reclaim mustn't enter FS for SWP_FS_OPS swap-space
      MM: introduce ->swap_rw and use it for reads from SWP_FS_OPS swap-space
      MM: perform async writes to SWP_FS_OPS swap-space using ->swap_rw
      DOC: update documentation for swap_activate and swap_rw
      MM: submit multipage reads for SWP_FS_OPS swap-space
      MM: submit multipage write for SWP_FS_OPS swap-space
      VFS: Add FMODE_CAN_ODIRECT file flag
      NFS: remove IS_SWAPFILE hack
      NFS: rename nfs_direct_IO and use as ->swap_rw
      NFS: swap IO handling is slightly different for O_DIRECT IO
      SUNRPC/call_alloc: async tasks mustn't block waiting for memory
      SUNRPC/auth: async tasks mustn't block waiting for memory
      SUNRPC/xprt: async tasks mustn't block waiting for memory
      SUNRPC: remove scheduling boost for "SWAPPER" tasks.
      NFS: discard NFS_RPC_SWAPFLAGS and RPC_TASK_ROOTCREDS
      SUNRPC: improve 'swap' handling: scheduling and PF_MEMALLOC
      NFSv4: keep state manager thread active if swap is enabled
      NFS: swap-out must always use STABLE writes.
      SUNRPC: lock against ->sock changing during sysfs read


 Documentation/filesystems/locking.rst |  18 ++-
 Documentation/filesystems/vfs.rst     |  17 ++-
 drivers/block/loop.c                  |   4 +-
 fs/cifs/file.c                        |   7 +-
 fs/fcntl.c                            |   9 +-
 fs/nfs/direct.c                       |  56 ++++---
 fs/nfs/file.c                         |  39 +++--
 fs/nfs/nfs4_fs.h                      |   1 +
 fs/nfs/nfs4proc.c                     |  20 +++
 fs/nfs/nfs4state.c                    |  39 ++++-
 fs/nfs/read.c                         |   4 -
 fs/nfs/write.c                        |   2 +
 fs/open.c                             |   9 +-
 fs/overlayfs/file.c                   |  13 +-
 include/linux/fs.h                    |   4 +
 include/linux/nfs_fs.h                |  11 +-
 include/linux/nfs_xdr.h               |   2 +
 include/linux/sunrpc/auth.h           |   1 +
 include/linux/sunrpc/sched.h          |   1 -
 include/linux/swap.h                  |   7 +-
 include/linux/writeback.h             |   7 +
 include/trace/events/sunrpc.h         |   1 -
 mm/madvise.c                          |   8 +-
 mm/memory.c                           |   2 +-
 mm/page_io.c                          | 210 ++++++++++++++++++++------
 mm/swap.h                             |  26 +++-
 mm/swap_state.c                       |  33 ++--
 mm/swapfile.c                         |  13 +-
 mm/vmscan.c                           |  38 +++--
 net/sunrpc/auth.c                     |   8 +-
 net/sunrpc/auth_gss/auth_gss.c        |   6 +-
 net/sunrpc/auth_unix.c                |  10 +-
 net/sunrpc/clnt.c                     |   7 +-
 net/sunrpc/sched.c                    |  29 ++--
 net/sunrpc/sysfs.c                    |   5 +-
 net/sunrpc/xprt.c                     |  19 +--
 net/sunrpc/xprtrdma/transport.c       |  10 +-
 net/sunrpc/xprtsock.c                 |  15 +-
 38 files changed, 505 insertions(+), 206 deletions(-)

--
Signature



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 01/23] MM: create new mm/swap.h header file.
  2022-01-24  3:48 [PATCH 00/23 V3] Repair SWAP-over_NFS NeilBrown
                   ` (12 preceding siblings ...)
  2022-01-24  3:48 ` [PATCH 20/23] SUNRPC: improve 'swap' handling: scheduling and PF_MEMALLOC NeilBrown
@ 2022-01-24  3:48 ` NeilBrown
  2022-02-07 13:51   ` Geert Uytterhoeven
  2022-01-24  3:48 ` [PATCH 09/23] MM: submit multipage reads for SWP_FS_OPS swap-space NeilBrown
                   ` (9 subsequent siblings)
  23 siblings, 1 reply; 53+ messages in thread
From: NeilBrown @ 2022-01-24  3:48 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells
  Cc: linux-nfs, linux-mm, linux-kernel

Many functions declared in include/linux/swap.h are only used within mm/

Create a new "mm/swap.h" and move some of these declarations there.
Remove the redundant 'extern' from the function declarations.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.de>
---
 include/linux/swap.h |  121 -----------------------------------------------
 mm/madvise.c         |    1 
 mm/memcontrol.c      |    1 
 mm/memory.c          |    1 
 mm/mincore.c         |    1 
 mm/page_alloc.c      |    1 
 mm/page_io.c         |    1 
 mm/shmem.c           |    1 
 mm/swap.h            |  129 ++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/swap_state.c      |    1 
 mm/swapfile.c        |    1 
 mm/util.c            |    1 
 mm/vmscan.c          |    1 
 13 files changed, 140 insertions(+), 121 deletions(-)
 create mode 100644 mm/swap.h

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1d38d9475c4d..3f54a8941c9d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -419,62 +419,19 @@ extern void kswapd_stop(int nid);
 
 #ifdef CONFIG_SWAP
 
-#include <linux/blk_types.h> /* for bio_end_io_t */
-
-/* linux/mm/page_io.c */
-extern int swap_readpage(struct page *page, bool do_poll);
-extern int swap_writepage(struct page *page, struct writeback_control *wbc);
-extern void end_swap_bio_write(struct bio *bio);
-extern int __swap_writepage(struct page *page, struct writeback_control *wbc,
-	bio_end_io_t end_write_func);
 extern int swap_set_page_dirty(struct page *page);
-
 int add_swap_extent(struct swap_info_struct *sis, unsigned long start_page,
 		unsigned long nr_pages, sector_t start_block);
 int generic_swapfile_activate(struct swap_info_struct *, struct file *,
 		sector_t *);
 
-/* linux/mm/swap_state.c */
-/* One swap address space for each 64M swap space */
-#define SWAP_ADDRESS_SPACE_SHIFT	14
-#define SWAP_ADDRESS_SPACE_PAGES	(1 << SWAP_ADDRESS_SPACE_SHIFT)
-extern struct address_space *swapper_spaces[];
-#define swap_address_space(entry)			    \
-	(&swapper_spaces[swp_type(entry)][swp_offset(entry) \
-		>> SWAP_ADDRESS_SPACE_SHIFT])
 static inline unsigned long total_swapcache_pages(void)
 {
 	return global_node_page_state(NR_SWAPCACHE);
 }
 
-extern void show_swap_cache_info(void);
-extern int add_to_swap(struct page *page);
-extern void *get_shadow_from_swap_cache(swp_entry_t entry);
-extern int add_to_swap_cache(struct page *page, swp_entry_t entry,
-			gfp_t gfp, void **shadowp);
-extern void __delete_from_swap_cache(struct page *page,
-			swp_entry_t entry, void *shadow);
-extern void delete_from_swap_cache(struct page *);
-extern void clear_shadow_from_swap_cache(int type, unsigned long begin,
-				unsigned long end);
-extern void free_swap_cache(struct page *);
 extern void free_page_and_swap_cache(struct page *);
 extern void free_pages_and_swap_cache(struct page **, int);
-extern struct page *lookup_swap_cache(swp_entry_t entry,
-				      struct vm_area_struct *vma,
-				      unsigned long addr);
-struct page *find_get_incore_page(struct address_space *mapping, pgoff_t index);
-extern struct page *read_swap_cache_async(swp_entry_t, gfp_t,
-			struct vm_area_struct *vma, unsigned long addr,
-			bool do_poll);
-extern struct page *__read_swap_cache_async(swp_entry_t, gfp_t,
-			struct vm_area_struct *vma, unsigned long addr,
-			bool *new_page_allocated);
-extern struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
-				struct vm_fault *vmf);
-extern struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
-				struct vm_fault *vmf);
-
 /* linux/mm/swapfile.c */
 extern atomic_long_t nr_swap_pages;
 extern long total_swap_pages;
@@ -528,12 +485,6 @@ static inline void put_swap_device(struct swap_info_struct *si)
 }
 
 #else /* CONFIG_SWAP */
-
-static inline int swap_readpage(struct page *page, bool do_poll)
-{
-	return 0;
-}
-
 static inline struct swap_info_struct *swp_swap_info(swp_entry_t entry)
 {
 	return NULL;
@@ -548,11 +499,6 @@ static inline void put_swap_device(struct swap_info_struct *si)
 {
 }
 
-static inline struct address_space *swap_address_space(swp_entry_t entry)
-{
-	return NULL;
-}
-
 #define get_nr_swap_pages()			0L
 #define total_swap_pages			0L
 #define total_swapcache_pages()			0UL
@@ -567,14 +513,6 @@ static inline struct address_space *swap_address_space(swp_entry_t entry)
 #define free_pages_and_swap_cache(pages, nr) \
 	release_pages((pages), (nr));
 
-static inline void free_swap_cache(struct page *page)
-{
-}
-
-static inline void show_swap_cache_info(void)
-{
-}
-
 /* used to sanity check ptes in zap_pte_range when CONFIG_SWAP=0 */
 #define free_swap_and_cache(e) is_pfn_swap_entry(e)
 
@@ -600,65 +538,6 @@ static inline void put_swap_page(struct page *page, swp_entry_t swp)
 {
 }
 
-static inline struct page *swap_cluster_readahead(swp_entry_t entry,
-				gfp_t gfp_mask, struct vm_fault *vmf)
-{
-	return NULL;
-}
-
-static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
-			struct vm_fault *vmf)
-{
-	return NULL;
-}
-
-static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
-{
-	return 0;
-}
-
-static inline struct page *lookup_swap_cache(swp_entry_t swp,
-					     struct vm_area_struct *vma,
-					     unsigned long addr)
-{
-	return NULL;
-}
-
-static inline
-struct page *find_get_incore_page(struct address_space *mapping, pgoff_t index)
-{
-	return find_get_page(mapping, index);
-}
-
-static inline int add_to_swap(struct page *page)
-{
-	return 0;
-}
-
-static inline void *get_shadow_from_swap_cache(swp_entry_t entry)
-{
-	return NULL;
-}
-
-static inline int add_to_swap_cache(struct page *page, swp_entry_t entry,
-					gfp_t gfp_mask, void **shadowp)
-{
-	return -1;
-}
-
-static inline void __delete_from_swap_cache(struct page *page,
-					swp_entry_t entry, void *shadow)
-{
-}
-
-static inline void delete_from_swap_cache(struct page *page)
-{
-}
-
-static inline void clear_shadow_from_swap_cache(int type, unsigned long begin,
-				unsigned long end)
-{
-}
 
 static inline int page_swapcount(struct page *page)
 {
diff --git a/mm/madvise.c b/mm/madvise.c
index 5604064df464..1ee4b7583379 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -35,6 +35,7 @@
 #include <asm/tlb.h>
 
 #include "internal.h"
+#include "swap.h"
 
 struct madvise_walk_private {
 	struct mmu_gather *tlb;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 09d342c7cbd0..9b7c8181a207 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -66,6 +66,7 @@
 #include <net/sock.h>
 #include <net/ip.h>
 #include "slab.h"
+#include "swap.h"
 
 #include <linux/uaccess.h>
 
diff --git a/mm/memory.c b/mm/memory.c
index c125c4969913..d25372340107 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -86,6 +86,7 @@
 
 #include "pgalloc-track.h"
 #include "internal.h"
+#include "swap.h"
 
 #if defined(LAST_CPUPID_NOT_IN_PAGE_FLAGS) && !defined(CONFIG_COMPILE_TEST)
 #warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_cpupid.
diff --git a/mm/mincore.c b/mm/mincore.c
index 9122676b54d6..f4f627325e12 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -20,6 +20,7 @@
 #include <linux/pgtable.h>
 
 #include <linux/uaccess.h>
+#include "swap.h"
 
 static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr,
 			unsigned long end, struct mm_walk *walk)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3589febc6d31..221aa3c10b78 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -81,6 +81,7 @@
 #include "internal.h"
 #include "shuffle.h"
 #include "page_reporting.h"
+#include "swap.h"
 
 /* Free Page Internal flags: for internal, non-pcp variants of free_pages(). */
 typedef int __bitwise fpi_t;
diff --git a/mm/page_io.c b/mm/page_io.c
index 0bf8e40f4e57..f8c26092e869 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -26,6 +26,7 @@
 #include <linux/uio.h>
 #include <linux/sched/task.h>
 #include <linux/delayacct.h>
+#include "swap.h"
 
 void end_swap_bio_write(struct bio *bio)
 {
diff --git a/mm/shmem.c b/mm/shmem.c
index a09b29ec2b45..c8b8819fe2e6 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -38,6 +38,7 @@
 #include <linux/hugetlb.h>
 #include <linux/fs_parser.h>
 #include <linux/swapfile.h>
+#include "swap.h"
 
 static struct vfsmount *shm_mnt;
 
diff --git a/mm/swap.h b/mm/swap.h
new file mode 100644
index 000000000000..13e72a5023aa
--- /dev/null
+++ b/mm/swap.h
@@ -0,0 +1,129 @@
+
+#ifdef CONFIG_SWAP
+#include <linux/blk_types.h> /* for bio_end_io_t */
+
+/* linux/mm/page_io.c */
+int swap_readpage(struct page *page, bool do_poll);
+int swap_writepage(struct page *page, struct writeback_control *wbc);
+void end_swap_bio_write(struct bio *bio);
+int __swap_writepage(struct page *page, struct writeback_control *wbc,
+		     bio_end_io_t end_write_func);
+
+/* linux/mm/swap_state.c */
+/* One swap address space for each 64M swap space */
+#define SWAP_ADDRESS_SPACE_SHIFT	14
+#define SWAP_ADDRESS_SPACE_PAGES	(1 << SWAP_ADDRESS_SPACE_SHIFT)
+extern struct address_space *swapper_spaces[];
+#define swap_address_space(entry)			    \
+	(&swapper_spaces[swp_type(entry)][swp_offset(entry) \
+		>> SWAP_ADDRESS_SPACE_SHIFT])
+
+void show_swap_cache_info(void);
+int add_to_swap(struct page *page);
+void *get_shadow_from_swap_cache(swp_entry_t entry);
+int add_to_swap_cache(struct page *page, swp_entry_t entry,
+		      gfp_t gfp, void **shadowp);
+void __delete_from_swap_cache(struct page *page,
+			      swp_entry_t entry, void *shadow);
+void delete_from_swap_cache(struct page *);
+void clear_shadow_from_swap_cache(int type, unsigned long begin,
+				  unsigned long end);
+void free_swap_cache(struct page *);
+struct page *lookup_swap_cache(swp_entry_t entry,
+			       struct vm_area_struct *vma,
+			       unsigned long addr);
+struct page *find_get_incore_page(struct address_space *mapping, pgoff_t index);
+
+struct page *read_swap_cache_async(swp_entry_t, gfp_t,
+				   struct vm_area_struct *vma,
+				   unsigned long addr,
+				   bool do_poll);
+struct page *__read_swap_cache_async(swp_entry_t, gfp_t,
+				     struct vm_area_struct *vma,
+				     unsigned long addr,
+				     bool *new_page_allocated);
+struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
+				    struct vm_fault *vmf);
+struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
+			      struct vm_fault *vmf);
+
+#else /* CONFIG_SWAP */
+static inline int swap_readpage(struct page *page, bool do_poll)
+{
+	return 0;
+}
+
+static inline struct address_space *swap_address_space(swp_entry_t entry)
+{
+	return NULL;
+}
+
+static inline void free_swap_cache(struct page *page)
+{
+}
+
+static inline void show_swap_cache_info(void)
+{
+}
+
+static inline struct page *swap_cluster_readahead(swp_entry_t entry,
+				gfp_t gfp_mask, struct vm_fault *vmf)
+{
+	return NULL;
+}
+
+static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
+			struct vm_fault *vmf)
+{
+	return NULL;
+}
+
+static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
+{
+	return 0;
+}
+
+static inline struct page *lookup_swap_cache(swp_entry_t swp,
+					     struct vm_area_struct *vma,
+					     unsigned long addr)
+{
+	return NULL;
+}
+
+static inline
+struct page *find_get_incore_page(struct address_space *mapping, pgoff_t index)
+{
+	return find_get_page(mapping, index);
+}
+
+static inline int add_to_swap(struct page *page)
+{
+	return 0;
+}
+
+static inline void *get_shadow_from_swap_cache(swp_entry_t entry)
+{
+	return NULL;
+}
+
+static inline int add_to_swap_cache(struct page *page, swp_entry_t entry,
+					gfp_t gfp_mask, void **shadowp)
+{
+	return -1;
+}
+
+static inline void __delete_from_swap_cache(struct page *page,
+					swp_entry_t entry, void *shadow)
+{
+}
+
+static inline void delete_from_swap_cache(struct page *page)
+{
+}
+
+static inline void clear_shadow_from_swap_cache(int type, unsigned long begin,
+				unsigned long end)
+{
+}
+
+#endif /* CONFIG_SWAP */
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 8d4104242100..bb38453425c7 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -23,6 +23,7 @@
 #include <linux/huge_mm.h>
 #include <linux/shmem_fs.h>
 #include "internal.h"
+#include "swap.h"
 
 /*
  * swapper_space is a fiction, retained to simplify the path through
diff --git a/mm/swapfile.c b/mm/swapfile.c
index bf0df7aa7158..71c7a31dd291 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -44,6 +44,7 @@
 #include <asm/tlbflush.h>
 #include <linux/swapops.h>
 #include <linux/swap_cgroup.h>
+#include "swap.h"
 
 static bool swap_count_continued(struct swap_info_struct *, pgoff_t,
 				 unsigned char);
diff --git a/mm/util.c b/mm/util.c
index 7e43369064c8..619697e3d935 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -27,6 +27,7 @@
 #include <linux/uaccess.h>
 
 #include "internal.h"
+#include "swap.h"
 
 /**
  * kfree_const - conditionally free memory
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 090bfb605ecf..5c734ffc6057 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -58,6 +58,7 @@
 #include <linux/balloon_compaction.h>
 
 #include "internal.h"
+#include "swap.h"
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/vmscan.h>




^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 02/23] MM: extend block-plugging to cover all swap reads with read-ahead
  2022-01-24  3:48 [PATCH 00/23 V3] Repair SWAP-over_NFS NeilBrown
                   ` (6 preceding siblings ...)
  2022-01-24  3:48 ` [PATCH 07/23] MM: perform async writes to SWP_FS_OPS swap-space using ->swap_rw NeilBrown
@ 2022-01-24  3:48 ` NeilBrown
  2022-01-24  7:27   ` Christoph Hellwig
  2022-01-24  3:48 ` [PATCH 16/23] SUNRPC/auth: async tasks mustn't block waiting for memory NeilBrown
                   ` (15 subsequent siblings)
  23 siblings, 1 reply; 53+ messages in thread
From: NeilBrown @ 2022-01-24  3:48 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells
  Cc: linux-nfs, linux-mm, linux-kernel

Code that does swap read-ahead uses blk_start_plug() and
blk_finish_plug() to allow lower levels to combine multiple read-ahead
pages into a single request, but calls blk_finish_plug() *before*
submitting the original (non-ahead) read request.
This missed an opportunity to combine read requests.

This patch moves the blk_finish_plug to *after* all the reads.
This will likely combine the primary read with some of the "ahead"
reads, and that may slightly increase the latency of that read, but it
should more than make up for this by making more efficient use of the
storage path.

The patch mostly makes the code look more consistent.  Performance
change is unlikely to be noticeable.

Fixes-no-auto-backport: 3fb5c298b04e ("swap: allow swap readahead to be merged")
Signed-off-by: NeilBrown <neilb@suse.de>
---
 mm/swap_state.c |   14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/mm/swap_state.c b/mm/swap_state.c
index bb38453425c7..093ecf864200 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -625,6 +625,7 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	struct vm_area_struct *vma = vmf->vma;
 	unsigned long addr = vmf->address;
 
+	blk_start_plug(&plug);
 	mask = swapin_nr_pages(offset) - 1;
 	if (!mask)
 		goto skip;
@@ -638,7 +639,6 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	if (end_offset >= si->max)
 		end_offset = si->max - 1;
 
-	blk_start_plug(&plug);
 	for (offset = start_offset; offset <= end_offset ; offset++) {
 		/* Ok, do the async read-ahead now */
 		page = __read_swap_cache_async(
@@ -655,11 +655,12 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 		}
 		put_page(page);
 	}
-	blk_finish_plug(&plug);
 
 	lru_add_drain();	/* Push any new pages onto the LRU now */
 skip:
-	return read_swap_cache_async(entry, gfp_mask, vma, addr, do_poll);
+	page = read_swap_cache_async(entry, gfp_mask, vma, addr, do_poll);
+	blk_finish_plug(&plug);
+	return page;
 }
 
 int init_swap_address_space(unsigned int type, unsigned long nr_pages)
@@ -800,11 +801,11 @@ static struct page *swap_vma_readahead(swp_entry_t fentry, gfp_t gfp_mask,
 		.win = 1,
 	};
 
+	blk_start_plug(&plug);
 	swap_ra_info(vmf, &ra_info);
 	if (ra_info.win == 1)
 		goto skip;
 
-	blk_start_plug(&plug);
 	for (i = 0, pte = ra_info.ptes; i < ra_info.nr_pte;
 	     i++, pte++) {
 		pentry = *pte;
@@ -828,11 +829,12 @@ static struct page *swap_vma_readahead(swp_entry_t fentry, gfp_t gfp_mask,
 		}
 		put_page(page);
 	}
-	blk_finish_plug(&plug);
 	lru_add_drain();
 skip:
-	return read_swap_cache_async(fentry, gfp_mask, vma, vmf->address,
+	page = read_swap_cache_async(fentry, gfp_mask, vma, vmf->address,
 				     ra_info.win == 1);
+	blk_finish_plug(&plug);
+	return page;
 }
 
 /**




^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 03/23] MM: drop swap_set_page_dirty
  2022-01-24  3:48 [PATCH 00/23 V3] Repair SWAP-over_NFS NeilBrown
  2022-01-24  3:48 ` [PATCH 05/23] MM: reclaim mustn't enter FS for SWP_FS_OPS swap-space NeilBrown
@ 2022-01-24  3:48 ` NeilBrown
  2022-01-24  7:28   ` Christoph Hellwig
  2022-01-24  3:48 ` [PATCH 14/23] NFS: swap IO handling is slightly different for O_DIRECT IO NeilBrown
                   ` (21 subsequent siblings)
  23 siblings, 1 reply; 53+ messages in thread
From: NeilBrown @ 2022-01-24  3:48 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells
  Cc: linux-nfs, linux-mm, linux-kernel

Pages that are written to swap are owned by the MM subsystem - not any
filesystem.

When such a page is passed to a filesystem to be written out to a
swap-file, the filesystem handles the data, but the page itself does not
belong to the filesystem.  So calling the filesystem's set_page_dirty
address_space operation makes no sense.  This is for pages in the given
address space, and a page to be written to swap does not exist in the
given address space.

So drop swap_set_page_dirty() which calls the address-space's
set_page_dirty, and alway use __set_page_dirty_no_writeback, which is
appropriate for pages being swapped out.

Fixes-no-auto-backport: 62c230bc1790 ("mm: add support for a filesystem to activate swap files and use direct_IO for writing swap pages")
Signed-off-by: NeilBrown <neilb@suse.de>
---
 include/linux/swap.h |    1 -
 mm/page_io.c         |   14 --------------
 mm/swap_state.c      |    2 +-
 3 files changed, 1 insertion(+), 16 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 3f54a8941c9d..a43929f7033e 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -419,7 +419,6 @@ extern void kswapd_stop(int nid);
 
 #ifdef CONFIG_SWAP
 
-extern int swap_set_page_dirty(struct page *page);
 int add_swap_extent(struct swap_info_struct *sis, unsigned long start_page,
 		unsigned long nr_pages, sector_t start_block);
 int generic_swapfile_activate(struct swap_info_struct *, struct file *,
diff --git a/mm/page_io.c b/mm/page_io.c
index f8c26092e869..34b12d6f94d7 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -438,17 +438,3 @@ int swap_readpage(struct page *page, bool synchronous)
 	delayacct_swapin_end();
 	return ret;
 }
-
-int swap_set_page_dirty(struct page *page)
-{
-	struct swap_info_struct *sis = page_swap_info(page);
-
-	if (data_race(sis->flags & SWP_FS_OPS)) {
-		struct address_space *mapping = sis->swap_file->f_mapping;
-
-		VM_BUG_ON_PAGE(!PageSwapCache(page), page);
-		return mapping->a_ops->set_page_dirty(page);
-	} else {
-		return __set_page_dirty_no_writeback(page);
-	}
-}
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 093ecf864200..d541594be1c3 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -31,7 +31,7 @@
  */
 static const struct address_space_operations swap_aops = {
 	.writepage	= swap_writepage,
-	.set_page_dirty	= swap_set_page_dirty,
+	.set_page_dirty	= __set_page_dirty_no_writeback,
 #ifdef CONFIG_MIGRATION
 	.migratepage	= migrate_page,
 #endif




^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 04/23] MM: move responsibility for setting SWP_FS_OPS to ->swap_activate
  2022-01-24  3:48 [PATCH 00/23 V3] Repair SWAP-over_NFS NeilBrown
                   ` (8 preceding siblings ...)
  2022-01-24  3:48 ` [PATCH 16/23] SUNRPC/auth: async tasks mustn't block waiting for memory NeilBrown
@ 2022-01-24  3:48 ` NeilBrown
  2022-01-24  7:30   ` Christoph Hellwig
  2022-01-24  3:48 ` [PATCH 06/23] MM: introduce ->swap_rw and use it for reads from SWP_FS_OPS swap-space NeilBrown
                   ` (13 subsequent siblings)
  23 siblings, 1 reply; 53+ messages in thread
From: NeilBrown @ 2022-01-24  3:48 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells
  Cc: linux-nfs, linux-mm, linux-kernel

If a filesystem wishes to handle all swap IO itself (via ->direct_IO),
rather than just providing devices addresses for submit_bio(),
SWP_FS_OPS must be set.
Currently the protocol for setting this it to have ->swap_activate
return zero.  In that case SWP_FS_OPS is set, and add_swap_extent()
is called for the entire file.

This is a little clumsy as different return values for ->swap_activate
have quite different meanings, and it makes it hard to search for which
filesystems require SWP_FS_OPS to be set.

So remove the special meaning of a zero return, and require the
filesystem to set SWP_FS_OPS if it so desires, and to always call
add_swap_extent() as required.

Currently only NFS and CIFS return zero for add_swap_extent().

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/cifs/file.c       |    3 ++-
 fs/nfs/file.c        |   13 +++++++++++--
 include/linux/swap.h |    6 ++++++
 mm/swapfile.c        |   10 +++-------
 4 files changed, 22 insertions(+), 10 deletions(-)

diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 59334be9ed3b..c795d4a9ec4a 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -4974,7 +4974,8 @@ static int cifs_swap_activate(struct swap_info_struct *sis,
 	 * from reading or writing the file
 	 */
 
-	return 0;
+	sis->flags |= SWP_FS_OPS;
+	return add_swap_extent(sis, 0, sis->max, 0);
 }
 
 static void cifs_swap_deactivate(struct file *file)
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 76d76acbc594..d5aa55c7edb0 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -488,6 +488,7 @@ static int nfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 {
 	unsigned long blocks;
 	long long isize;
+	int ret;
 	struct rpc_clnt *clnt = NFS_CLIENT(file->f_mapping->host);
 	struct inode *inode = file->f_mapping->host;
 
@@ -500,9 +501,17 @@ static int nfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 		return -EINVAL;
 	}
 
+	ret = rpc_clnt_swap_activate(clnt);
+	if (ret)
+		return ret;
+	ret = add_swap_extent(sis, 0, sis->max, 0);
+	if (ret < 0) {
+		rpc_clnt_swap_deactivate(clnt);
+		return ret;
+	}
 	*span = sis->pages;
-
-	return rpc_clnt_swap_activate(clnt);
+	sis->flags |= SWP_FS_OPS;
+	return ret;
 }
 
 static void nfs_swap_deactivate(struct file *file)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index a43929f7033e..b57cff3c5ac2 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -573,6 +573,12 @@ static inline swp_entry_t get_swap_page(struct page *page)
 	return entry;
 }
 
+static inline int add_swap_extent(struct swap_info_struct *sis,
+				  unsigned long start_page,
+				  unsigned long nr_pages, sector_t start_block)
+{
+	return -EINVAL;
+}
 #endif /* CONFIG_SWAP */
 
 #ifdef CONFIG_THP_SWAP
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 71c7a31dd291..ed6028aea8bf 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2347,13 +2347,9 @@ static int setup_swap_extents(struct swap_info_struct *sis, sector_t *span)
 
 	if (mapping->a_ops->swap_activate) {
 		ret = mapping->a_ops->swap_activate(sis, swap_file, span);
-		if (ret >= 0)
-			sis->flags |= SWP_ACTIVATED;
-		if (!ret) {
-			sis->flags |= SWP_FS_OPS;
-			ret = add_swap_extent(sis, 0, sis->max, 0);
-			*span = sis->pages;
-		}
+		if (ret < 0)
+			return ret;
+		sis->flags |= SWP_ACTIVATED;
 		return ret;
 	}
 




^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 05/23] MM: reclaim mustn't enter FS for SWP_FS_OPS swap-space
  2022-01-24  3:48 [PATCH 00/23 V3] Repair SWAP-over_NFS NeilBrown
@ 2022-01-24  3:48 ` NeilBrown
  2022-01-24  7:31   ` Christoph Hellwig
  2022-01-24  3:48 ` [PATCH 03/23] MM: drop swap_set_page_dirty NeilBrown
                   ` (22 subsequent siblings)
  23 siblings, 1 reply; 53+ messages in thread
From: NeilBrown @ 2022-01-24  3:48 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells
  Cc: linux-nfs, linux-mm, linux-kernel

If swap-out is using filesystem operations (SWP_FS_OPS), then it is not
safe to enter the FS for reclaim.
So only down-grade the requirement for swap pages to __GFP_IO after
checking that SWP_FS_OPS are not being used.

This makes the calculation of "may_enter_fs" slightly more complex, so
move it into a separate function.  with that done, there is little value
in maintaining the bool variable any more.  So replace the
may_enter_fs variable with a may_enter_fs() function.  This removes any
risk for the variable becoming out-of-date.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 mm/swap.h   |    8 ++++++++
 mm/vmscan.c |   29 ++++++++++++++++++++---------
 2 files changed, 28 insertions(+), 9 deletions(-)

diff --git a/mm/swap.h b/mm/swap.h
index 13e72a5023aa..5c676e55f288 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -47,6 +47,10 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
 struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
 			      struct vm_fault *vmf);
 
+static inline unsigned int page_swap_flags(struct page *page)
+{
+	return page_swap_info(page)->flags;
+}
 #else /* CONFIG_SWAP */
 static inline int swap_readpage(struct page *page, bool do_poll)
 {
@@ -126,4 +130,8 @@ static inline void clear_shadow_from_swap_cache(int type, unsigned long begin,
 {
 }
 
+static inline unsigned int page_swap_flags(struct page *page)
+{
+	return 0;
+}
 #endif /* CONFIG_SWAP */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5c734ffc6057..ad5026d06aa8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1506,6 +1506,22 @@ static unsigned int demote_page_list(struct list_head *demote_pages,
 	return nr_succeeded;
 }
 
+static bool may_enter_fs(struct page *page, gfp_t gfp_mask)
+{
+	if (gfp_mask & __GFP_FS)
+		return true;
+	if (!PageSwapCache(page) || !(gfp_mask & __GFP_IO))
+		return false;
+	/*
+	 * We can "enter_fs" for swap-cache with only __GFP_IO
+	 * providing this isn't SWP_FS_OPS.
+	 * ->flags can be updated non-atomicially (scan_swap_map_slots),
+	 * but that will never affect SWP_FS_OPS, so the data_race
+	 * is safe.
+	 */
+	return !data_race(page_swap_flags(page) & SWP_FS_OPS);
+}
+
 /*
  * shrink_page_list() returns the number of reclaimed pages
  */
@@ -1531,7 +1547,7 @@ static unsigned int shrink_page_list(struct list_head *page_list,
 		struct address_space *mapping;
 		struct page *page;
 		enum page_references references = PAGEREF_RECLAIM;
-		bool dirty, writeback, may_enter_fs;
+		bool dirty, writeback;
 		unsigned int nr_pages;
 
 		cond_resched();
@@ -1555,9 +1571,6 @@ static unsigned int shrink_page_list(struct list_head *page_list,
 		if (!sc->may_unmap && page_mapped(page))
 			goto keep_locked;
 
-		may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
-			(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
-
 		/*
 		 * The number of dirty pages determines if a node is marked
 		 * reclaim_congested. kswapd will stall and start writing
@@ -1602,7 +1615,7 @@ static unsigned int shrink_page_list(struct list_head *page_list,
 		 *    not to fs). In this case mark the page for immediate
 		 *    reclaim and continue scanning.
 		 *
-		 *    Require may_enter_fs because we would wait on fs, which
+		 *    Require may_enter_fs() because we would wait on fs, which
 		 *    may not have submitted IO yet. And the loop driver might
 		 *    enter reclaim, and deadlock if it waits on a page for
 		 *    which it is needed to do the write (loop masks off
@@ -1634,7 +1647,7 @@ static unsigned int shrink_page_list(struct list_head *page_list,
 
 			/* Case 2 above */
 			} else if (writeback_throttling_sane(sc) ||
-			    !PageReclaim(page) || !may_enter_fs) {
+			    !PageReclaim(page) || !may_enter_fs(page, sc->gfp_mask)) {
 				/*
 				 * This is slightly racy - end_page_writeback()
 				 * might have just cleared PageReclaim, then
@@ -1724,8 +1737,6 @@ static unsigned int shrink_page_list(struct list_head *page_list,
 						goto activate_locked_split;
 				}
 
-				may_enter_fs = true;
-
 				/* Adding to swap updated mapping */
 				mapping = page_mapping(page);
 			}
@@ -1795,7 +1806,7 @@ static unsigned int shrink_page_list(struct list_head *page_list,
 
 			if (references == PAGEREF_RECLAIM_CLEAN)
 				goto keep_locked;
-			if (!may_enter_fs)
+			if (!may_enter_fs(page, sc->gfp_mask))
 				goto keep_locked;
 			if (!sc->may_writepage)
 				goto keep_locked;




^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 06/23] MM: introduce ->swap_rw and use it for reads from SWP_FS_OPS swap-space
  2022-01-24  3:48 [PATCH 00/23 V3] Repair SWAP-over_NFS NeilBrown
                   ` (9 preceding siblings ...)
  2022-01-24  3:48 ` [PATCH 04/23] MM: move responsibility for setting SWP_FS_OPS to ->swap_activate NeilBrown
@ 2022-01-24  3:48 ` NeilBrown
  2022-01-24  8:48   ` Christoph Hellwig
  2022-01-24  3:48 ` [PATCH 15/23] SUNRPC/call_alloc: async tasks mustn't block waiting for memory NeilBrown
                   ` (12 subsequent siblings)
  23 siblings, 1 reply; 53+ messages in thread
From: NeilBrown @ 2022-01-24  3:48 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells
  Cc: linux-nfs, linux-mm, linux-kernel

swap currently uses ->readpage to read swap pages.  This can only
request one page at a time from the filesystem, which is not most
efficient.

swap uses ->direct_IO for writes which while this is adequate is an
inappropriate over-loading.  ->direct_IO may need to had handle allocate
space for holes or other details that are not relevant for swap.

So this patch introduces a new address_space operation: ->swap_rw.
In this patch it is used for reads, and a subsequent patch will switch
writes to use it.

No filesystem yet supports ->swap_rw, but that is not a problem because
no filesystem actually works with filesystem-based swap.
Only two filesystems set SWP_FS_OPS:
- cifs sets the flag, but ->direct_IO always fails so swap cannot work.
- nfs sets the flag, but ->direct_IO calls generic_write_checks()
  which has failed on swap files for several releases.

To ensure that a NULL ->swap_rw isn't called, ->activate_swap() for both
NFS and cifs are changed to fail if ->swap_rw is not set.  This can be
removed if/when the function is added.

Future patches will restore swap-over-NFS functionality.

To submit an async read with ->swap_rw() we need to allocate a structure
to hold the kiocb and other details.  swap_readpage() cannot handle
transient failure, so we create a mempool to provide the structures.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/cifs/file.c     |    4 +++
 fs/nfs/file.c      |    4 +++
 include/linux/fs.h |    1 +
 mm/page_io.c       |   68 +++++++++++++++++++++++++++++++++++++++++++++++-----
 mm/swap.h          |    1 +
 mm/swapfile.c      |    5 ++++
 6 files changed, 77 insertions(+), 6 deletions(-)

diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index c795d4a9ec4a..b3898c4aa5ad 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -4946,6 +4946,10 @@ static int cifs_swap_activate(struct swap_info_struct *sis,
 
 	cifs_dbg(FYI, "swap activate\n");
 
+	if (!swap_file->f_mapping->a_ops->swap_rw)
+		/* Cannot support swap */
+		return -EINVAL;
+
 	spin_lock(&inode->i_lock);
 	blocks = inode->i_blocks;
 	isize = inode->i_size;
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index d5aa55c7edb0..3dbef2c31567 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -492,6 +492,10 @@ static int nfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 	struct rpc_clnt *clnt = NFS_CLIENT(file->f_mapping->host);
 	struct inode *inode = file->f_mapping->host;
 
+	if (!file->f_mapping->a_ops->swap_rw)
+		/* Cannot support swap */
+		return -EINVAL;
+
 	spin_lock(&inode->i_lock);
 	blocks = inode->i_blocks;
 	isize = inode->i_size;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f3daaea16554..4fade3b20c87 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -409,6 +409,7 @@ struct address_space_operations {
 	int (*swap_activate)(struct swap_info_struct *sis, struct file *file,
 				sector_t *span);
 	void (*swap_deactivate)(struct file *file);
+	int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter);
 };
 
 extern const struct address_space_operations empty_aops;
diff --git a/mm/page_io.c b/mm/page_io.c
index 34b12d6f94d7..e90a3231f225 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -284,6 +284,25 @@ static void bio_associate_blkg_from_page(struct bio *bio, struct page *page)
 #define bio_associate_blkg_from_page(bio, page)		do { } while (0)
 #endif /* CONFIG_MEMCG && CONFIG_BLK_CGROUP */
 
+struct swap_iocb {
+	struct kiocb		iocb;
+	struct bio_vec		bvec;
+};
+static mempool_t *sio_pool;
+
+int sio_pool_init(void)
+{
+	if (!sio_pool) {
+		mempool_t *pool = mempool_create_kmalloc_pool(
+			SWAP_CLUSTER_MAX, sizeof(struct swap_iocb));
+		if (cmpxchg(&sio_pool, NULL, pool))
+			mempool_destroy(pool);
+	}
+	if (!sio_pool)
+		return -ENOMEM;
+	return 0;
+}
+
 int __swap_writepage(struct page *page, struct writeback_control *wbc,
 		bio_end_io_t end_write_func)
 {
@@ -355,6 +374,48 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
 	return 0;
 }
 
+static void sio_read_complete(struct kiocb *iocb, long ret)
+{
+	struct swap_iocb *sio = container_of(iocb, struct swap_iocb, iocb);
+	struct page *page = sio->bvec.bv_page;
+
+	if (ret != 0 && ret != PAGE_SIZE) {
+		SetPageError(page);
+		ClearPageUptodate(page);
+		pr_alert_ratelimited("Read-error on swap-device\n");
+	} else {
+		SetPageUptodate(page);
+		count_vm_event(PSWPIN);
+	}
+	unlock_page(page);
+	mempool_free(sio, sio_pool);
+}
+
+static int swap_readpage_fs(struct page *page)
+{
+	struct swap_info_struct *sis = page_swap_info(page);
+	struct file *swap_file = sis->swap_file;
+	struct address_space *mapping = swap_file->f_mapping;
+	struct iov_iter from;
+	struct swap_iocb *sio;
+	loff_t pos = page_file_offset(page);
+	int ret;
+
+	sio = mempool_alloc(sio_pool, GFP_KERNEL);
+	init_sync_kiocb(&sio->iocb, swap_file);
+	sio->iocb.ki_pos = pos;
+	sio->iocb.ki_complete = sio_read_complete;
+	sio->bvec.bv_page = page;
+	sio->bvec.bv_len = PAGE_SIZE;
+	sio->bvec.bv_offset = 0;
+
+	iov_iter_bvec(&from, READ, &sio->bvec, 1, PAGE_SIZE);
+	ret = mapping->a_ops->swap_rw(&sio->iocb, &from);
+	if (ret != -EIOCBQUEUED)
+		sio_read_complete(&sio->iocb, ret);
+	return ret;
+}
+
 int swap_readpage(struct page *page, bool synchronous)
 {
 	struct bio *bio;
@@ -381,12 +442,7 @@ int swap_readpage(struct page *page, bool synchronous)
 	}
 
 	if (data_race(sis->flags & SWP_FS_OPS)) {
-		struct file *swap_file = sis->swap_file;
-		struct address_space *mapping = swap_file->f_mapping;
-
-		ret = mapping->a_ops->readpage(swap_file, page);
-		if (!ret)
-			count_vm_event(PSWPIN);
+		ret = swap_readpage_fs(page);
 		goto out;
 	}
 
diff --git a/mm/swap.h b/mm/swap.h
index 5c676e55f288..e8ee995cf8d8 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -3,6 +3,7 @@
 #include <linux/blk_types.h> /* for bio_end_io_t */
 
 /* linux/mm/page_io.c */
+int sio_pool_init(void);
 int swap_readpage(struct page *page, bool do_poll);
 int swap_writepage(struct page *page, struct writeback_control *wbc);
 void end_swap_bio_write(struct bio *bio);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index ed6028aea8bf..c800c17bf0c8 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2350,6 +2350,11 @@ static int setup_swap_extents(struct swap_info_struct *sis, sector_t *span)
 		if (ret < 0)
 			return ret;
 		sis->flags |= SWP_ACTIVATED;
+		if ((sis->flags & SWP_FS_OPS) &&
+		    sio_pool_init() != 0) {
+			destroy_swap_extents(sis);
+			return -ENOMEM;
+		}
 		return ret;
 	}
 




^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 07/23] MM: perform async writes to SWP_FS_OPS swap-space using ->swap_rw
  2022-01-24  3:48 [PATCH 00/23 V3] Repair SWAP-over_NFS NeilBrown
                   ` (5 preceding siblings ...)
  2022-01-24  3:48 ` [PATCH 08/23] DOC: update documentation for swap_activate and swap_rw NeilBrown
@ 2022-01-24  3:48 ` NeilBrown
  2022-01-24  8:49   ` Christoph Hellwig
  2022-01-24  3:48 ` [PATCH 02/23] MM: extend block-plugging to cover all swap reads with read-ahead NeilBrown
                   ` (16 subsequent siblings)
  23 siblings, 1 reply; 53+ messages in thread
From: NeilBrown @ 2022-01-24  3:48 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells
  Cc: linux-nfs, linux-mm, linux-kernel

This patch switches swap-out to SWP_FS_OPS swap-spaces to use ->swap_rw
and makes the writes asynchronous, like they are for other swap spaces.

To make it async we need to allocate the kiocb struct from a mempool.
This may block, but won't block as long as waiting for the write to
complete.  At most it will wait for some previous swap IO to complete.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 mm/page_io.c |   93 +++++++++++++++++++++++++++++++++-------------------------
 1 file changed, 53 insertions(+), 40 deletions(-)

diff --git a/mm/page_io.c b/mm/page_io.c
index e90a3231f225..6e32ca35d9b6 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -303,6 +303,57 @@ int sio_pool_init(void)
 	return 0;
 }
 
+static void sio_write_complete(struct kiocb *iocb, long ret)
+{
+	struct swap_iocb *sio = container_of(iocb, struct swap_iocb, iocb);
+	struct page *page = sio->bvec.bv_page;
+
+	if (ret != 0 && ret != PAGE_SIZE) {
+		/*
+		 * In the case of swap-over-nfs, this can be a
+		 * temporary failure if the system has limited
+		 * memory for allocating transmit buffers.
+		 * Mark the page dirty and avoid
+		 * folio_rotate_reclaimable but rate-limit the
+		 * messages but do not flag PageError like
+		 * the normal direct-to-bio case as it could
+		 * be temporary.
+		 */
+		set_page_dirty(page);
+		ClearPageReclaim(page);
+		pr_err_ratelimited("Write error %ld on dio swapfile (%llu)\n",
+				   ret, page_file_offset(page));
+	} else
+		count_vm_event(PSWPOUT);
+	end_page_writeback(page);
+	mempool_free(sio, sio_pool);
+}
+
+static int swap_writepage_fs(struct page *page, struct writeback_control *wbc)
+{
+	struct swap_iocb *sio;
+	struct swap_info_struct *sis = page_swap_info(page);
+	struct file *swap_file = sis->swap_file;
+	struct address_space *mapping = swap_file->f_mapping;
+	struct iov_iter from;
+	int ret;
+
+	set_page_writeback(page);
+	unlock_page(page);
+	sio = mempool_alloc(sio_pool, GFP_NOIO);
+	init_sync_kiocb(&sio->iocb, swap_file);
+	sio->iocb.ki_complete = sio_write_complete;
+	sio->iocb.ki_pos = page_file_offset(page);
+	sio->bvec.bv_page = page;
+	sio->bvec.bv_len = PAGE_SIZE;
+	sio->bvec.bv_offset = 0;
+	iov_iter_bvec(&from, WRITE, &sio->bvec, 1, PAGE_SIZE);
+	ret = mapping->a_ops->swap_rw(&sio->iocb, &from);
+	if (ret != -EIOCBQUEUED)
+		sio_write_complete(&sio->iocb, ret);
+	return ret;
+}
+
 int __swap_writepage(struct page *page, struct writeback_control *wbc,
 		bio_end_io_t end_write_func)
 {
@@ -311,46 +362,8 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
 	struct swap_info_struct *sis = page_swap_info(page);
 
 	VM_BUG_ON_PAGE(!PageSwapCache(page), page);
-	if (data_race(sis->flags & SWP_FS_OPS)) {
-		struct kiocb kiocb;
-		struct file *swap_file = sis->swap_file;
-		struct address_space *mapping = swap_file->f_mapping;
-		struct bio_vec bv = {
-			.bv_page = page,
-			.bv_len  = PAGE_SIZE,
-			.bv_offset = 0
-		};
-		struct iov_iter from;
-
-		iov_iter_bvec(&from, WRITE, &bv, 1, PAGE_SIZE);
-		init_sync_kiocb(&kiocb, swap_file);
-		kiocb.ki_pos = page_file_offset(page);
-
-		set_page_writeback(page);
-		unlock_page(page);
-		ret = mapping->a_ops->direct_IO(&kiocb, &from);
-		if (ret == PAGE_SIZE) {
-			count_vm_event(PSWPOUT);
-			ret = 0;
-		} else {
-			/*
-			 * In the case of swap-over-nfs, this can be a
-			 * temporary failure if the system has limited
-			 * memory for allocating transmit buffers.
-			 * Mark the page dirty and avoid
-			 * folio_rotate_reclaimable but rate-limit the
-			 * messages but do not flag PageError like
-			 * the normal direct-to-bio case as it could
-			 * be temporary.
-			 */
-			set_page_dirty(page);
-			ClearPageReclaim(page);
-			pr_err_ratelimited("Write error on dio swapfile (%llu)\n",
-					   page_file_offset(page));
-		}
-		end_page_writeback(page);
-		return ret;
-	}
+	if (data_race(sis->flags & SWP_FS_OPS))
+		return swap_writepage_fs(page, wbc);
 
 	ret = bdev_write_page(sis->bdev, swap_page_sector(page), page, wbc);
 	if (!ret) {




^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 08/23] DOC: update documentation for swap_activate and swap_rw
  2022-01-24  3:48 [PATCH 00/23 V3] Repair SWAP-over_NFS NeilBrown
                   ` (4 preceding siblings ...)
  2022-01-24  3:48 ` [PATCH 23/23] SUNRPC: lock against ->sock changing during sysfs read NeilBrown
@ 2022-01-24  3:48 ` NeilBrown
  2022-01-24  8:50   ` Christoph Hellwig
  2022-01-24  3:48 ` [PATCH 07/23] MM: perform async writes to SWP_FS_OPS swap-space using ->swap_rw NeilBrown
                   ` (17 subsequent siblings)
  23 siblings, 1 reply; 53+ messages in thread
From: NeilBrown @ 2022-01-24  3:48 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells
  Cc: linux-nfs, linux-mm, linux-kernel

This documentation for ->swap_activate() has been out-of-date for a long
time.  This patch updates it to match recent changes, and adds
documentation for the associated ->swap_rw()

Signed-off-by: NeilBrown <neilb@suse.de>
---
 Documentation/filesystems/locking.rst |   18 ++++++++++++------
 Documentation/filesystems/vfs.rst     |   17 ++++++++++++-----
 2 files changed, 24 insertions(+), 11 deletions(-)

diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
index 3f9b1497ebb8..fbb10378d5ee 100644
--- a/Documentation/filesystems/locking.rst
+++ b/Documentation/filesystems/locking.rst
@@ -260,8 +260,9 @@ prototypes::
 	int (*launder_page)(struct page *);
 	int (*is_partially_uptodate)(struct page *, unsigned long, unsigned long);
 	int (*error_remove_page)(struct address_space *, struct page *);
-	int (*swap_activate)(struct file *);
+	int (*swap_activate)(struct swap_info_struct *sis, struct file *f, sector_t *span)
 	int (*swap_deactivate)(struct file *);
+	int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter);
 
 locking rules:
 	All except set_page_dirty and freepage may block
@@ -290,6 +291,7 @@ is_partially_uptodate:	yes
 error_remove_page:	yes
 swap_activate:		no
 swap_deactivate:	no
+swap_rw:		yes, unlocks
 ======================	======================== =========	===============
 
 ->write_begin(), ->write_end() and ->readpage() may be called from
@@ -392,15 +394,19 @@ cleaned, or an error value if not. Note that in order to prevent the page
 getting mapped back in and redirtied, it needs to be kept locked
 across the entire operation.
 
-->swap_activate will be called with a non-zero argument on
-files backing (non block device backed) swapfiles. A return value
-of zero indicates success, in which case this file can be used for
-backing swapspace. The swapspace operations will be proxied to the
-address space operations.
+->swap_activate() will be called to prepare the given file for swap.  It
+should perform any validation and preparation necessary to ensure that
+writes can be performed with minimal memory allocation.  It should call
+add_swap_extent(), or the helper iomap_swapfile_activate(), and return
+the number of extents added.  If IO should be submitted through
+->swap_rw(), it should set SWP_FS_OPS, otherwise IO will be submitted
+directly to the block device ``sis->bdev``.
 
 ->swap_deactivate() will be called in the sys_swapoff()
 path after ->swap_activate() returned success.
 
+->swap_rw will be called for swap IO if SWP_FS_OPS was set by ->swap_activate().
+
 file_lock_operations
 ====================
 
diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index bf5c48066fac..779d23fc7954 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -751,8 +751,9 @@ cache in your filesystem.  The following members are defined:
 					      unsigned long);
 		void (*is_dirty_writeback) (struct page *, bool *, bool *);
 		int (*error_remove_page) (struct mapping *mapping, struct page *page);
-		int (*swap_activate)(struct file *);
+		int (*swap_activate)(struct swap_info_struct *sis, struct file *f, sector_t *span)
 		int (*swap_deactivate)(struct file *);
+		int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter);
 	};
 
 ``writepage``
@@ -959,15 +960,21 @@ cache in your filesystem.  The following members are defined:
 	unless you have them locked or reference counts increased.
 
 ``swap_activate``
-	Called when swapon is used on a file to allocate space if
-	necessary and pin the block lookup information in memory.  A
-	return value of zero indicates success, in which case this file
-	can be used to back swapspace.
+
+	Called to prepare the given file for swap.  It should perform
+	any validation and preparation necessary to ensure that writes
+	can be performed with minimal memory allocation.  It should call
+	add_swap_extent(), or the helper iomap_swapfile_activate(), and
+	return the number of extents added.  If IO should be submitted
+	through ->swap_rw(), it should set SWP_FS_OPS, otherwise IO will
+	be submitted directly to the block device ``sis->bdev``.
 
 ``swap_deactivate``
 	Called during swapoff on files where swap_activate was
 	successful.
 
+``swap_rw``
+	Called to read or write swap pages when SWP_FS_OPS is set.
 
 The File Object
 ===============




^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 09/23] MM: submit multipage reads for SWP_FS_OPS swap-space
  2022-01-24  3:48 [PATCH 00/23 V3] Repair SWAP-over_NFS NeilBrown
                   ` (13 preceding siblings ...)
  2022-01-24  3:48 ` [PATCH 01/23] MM: create new mm/swap.h header file NeilBrown
@ 2022-01-24  3:48 ` NeilBrown
  2022-01-24  8:25   ` kernel test robot
                     ` (4 more replies)
  2022-01-24  3:48 ` [PATCH 12/23] NFS: remove IS_SWAPFILE hack NeilBrown
                   ` (8 subsequent siblings)
  23 siblings, 5 replies; 53+ messages in thread
From: NeilBrown @ 2022-01-24  3:48 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells
  Cc: linux-nfs, linux-mm, linux-kernel

swap_readpage() is given one page at a time, but maybe called repeatedly
in succession.
For block-device swapspace, the blk_plug functionality allows the
multiple pages to be combined together at lower layers.
That cannot be used for SWP_FS_OPS as blk_plug may not exist - it is
only active when CONFIG_BLOCK=y.  Consequently all swap reads over NFS
are single page reads.

With this patch we pass in a pointer-to-pointer when swap_readpage can
store state between calls - much like the effect of blk_plug.  After
calling swap_readpage() some number of times, the state will be passed
to swap_read_unplug() which can submit the combined request.

Some caller currently call blk_finish_plug() *before* the final call to
swap_readpage(), so the last page cannot be included.  This patch moves
blk_finish_plug() to after the last call, and calls swap_read_unplug()
there too.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 mm/madvise.c    |    8 +++-
 mm/memory.c     |    2 +
 mm/page_io.c    |  102 +++++++++++++++++++++++++++++++++++--------------------
 mm/swap.h       |   16 +++++++--
 mm/swap_state.c |   19 +++++++---
 5 files changed, 98 insertions(+), 49 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 1ee4b7583379..2b1ab30af141 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -225,6 +225,7 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
 	pte_t *orig_pte;
 	struct vm_area_struct *vma = walk->private;
 	unsigned long index;
+	struct swap_iocb *splug = NULL;
 
 	if (pmd_none_or_trans_huge_or_clear_bad(pmd))
 		return 0;
@@ -246,10 +247,11 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
 			continue;
 
 		page = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE,
-							vma, index, false);
+					     vma, index, false, &splug);
 		if (page)
 			put_page(page);
 	}
+	swap_read_unplug(splug);
 
 	return 0;
 }
@@ -265,6 +267,7 @@ static void force_shm_swapin_readahead(struct vm_area_struct *vma,
 	XA_STATE(xas, &mapping->i_pages, linear_page_index(vma, start));
 	pgoff_t end_index = linear_page_index(vma, end + PAGE_SIZE - 1);
 	struct page *page;
+	struct swap_iocb *splug = NULL;
 
 	rcu_read_lock();
 	xas_for_each(&xas, page, end_index) {
@@ -277,13 +280,14 @@ static void force_shm_swapin_readahead(struct vm_area_struct *vma,
 
 		swap = radix_to_swp_entry(page);
 		page = read_swap_cache_async(swap, GFP_HIGHUSER_MOVABLE,
-							NULL, 0, false);
+					     NULL, 0, false, &splug);
 		if (page)
 			put_page(page);
 
 		rcu_read_lock();
 	}
 	rcu_read_unlock();
+	swap_read_unplug(splug);
 
 	lru_add_drain();	/* Push any new pages onto the LRU now */
 }
diff --git a/mm/memory.c b/mm/memory.c
index d25372340107..8bd18c54eaa4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3559,7 +3559,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 
 				/* To provide entry to swap_readpage() */
 				set_page_private(page, entry.val);
-				swap_readpage(page, true);
+				swap_readpage(page, true, NULL);
 				set_page_private(page, 0);
 			}
 		} else {
diff --git a/mm/page_io.c b/mm/page_io.c
index 6e32ca35d9b6..bcf655d650c8 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -286,7 +286,8 @@ static void bio_associate_blkg_from_page(struct bio *bio, struct page *page)
 
 struct swap_iocb {
 	struct kiocb		iocb;
-	struct bio_vec		bvec;
+	struct bio_vec		bvec[SWAP_CLUSTER_MAX];
+	int			pages;
 };
 static mempool_t *sio_pool;
 
@@ -306,7 +307,7 @@ int sio_pool_init(void)
 static void sio_write_complete(struct kiocb *iocb, long ret)
 {
 	struct swap_iocb *sio = container_of(iocb, struct swap_iocb, iocb);
-	struct page *page = sio->bvec.bv_page;
+	struct page *page = sio->bvec[0].bv_page;
 
 	if (ret != 0 && ret != PAGE_SIZE) {
 		/*
@@ -344,10 +345,10 @@ static int swap_writepage_fs(struct page *page, struct writeback_control *wbc)
 	init_sync_kiocb(&sio->iocb, swap_file);
 	sio->iocb.ki_complete = sio_write_complete;
 	sio->iocb.ki_pos = page_file_offset(page);
-	sio->bvec.bv_page = page;
-	sio->bvec.bv_len = PAGE_SIZE;
-	sio->bvec.bv_offset = 0;
-	iov_iter_bvec(&from, WRITE, &sio->bvec, 1, PAGE_SIZE);
+	sio->bvec[0].bv_page = page;
+	sio->bvec[0].bv_len = PAGE_SIZE;
+	sio->bvec[0].bv_offset = 0;
+	iov_iter_bvec(&from, WRITE, &sio->bvec[0], 1, PAGE_SIZE);
 	ret = mapping->a_ops->swap_rw(&sio->iocb, &from);
 	if (ret != -EIOCBQUEUED)
 		sio_write_complete(&sio->iocb, ret);
@@ -390,46 +391,60 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
 static void sio_read_complete(struct kiocb *iocb, long ret)
 {
 	struct swap_iocb *sio = container_of(iocb, struct swap_iocb, iocb);
-	struct page *page = sio->bvec.bv_page;
-
-	if (ret != 0 && ret != PAGE_SIZE) {
-		SetPageError(page);
-		ClearPageUptodate(page);
-		pr_alert_ratelimited("Read-error on swap-device\n");
-	} else {
-		SetPageUptodate(page);
-		count_vm_event(PSWPIN);
+	int p;
+
+	for (p = 0; p < sio->pages; p++) {
+		struct page *page = sio->bvec[p].bv_page;
+		if (ret != 0 && ret != PAGE_SIZE * sio->pages) {
+			SetPageError(page);
+			ClearPageUptodate(page);
+			pr_alert_ratelimited("Read-error on swap-device\n");
+		} else {
+			SetPageUptodate(page);
+			count_vm_event(PSWPIN);
+		}
+		unlock_page(page);
 	}
-	unlock_page(page);
 	mempool_free(sio, sio_pool);
 }
 
-static int swap_readpage_fs(struct page *page)
+static void swap_readpage_fs(struct page *page,
+			     struct swap_iocb **plug)
 {
 	struct swap_info_struct *sis = page_swap_info(page);
-	struct file *swap_file = sis->swap_file;
-	struct address_space *mapping = swap_file->f_mapping;
-	struct iov_iter from;
-	struct swap_iocb *sio;
+	struct swap_iocb *sio = NULL;
 	loff_t pos = page_file_offset(page);
-	int ret;
-
-	sio = mempool_alloc(sio_pool, GFP_KERNEL);
-	init_sync_kiocb(&sio->iocb, swap_file);
-	sio->iocb.ki_pos = pos;
-	sio->iocb.ki_complete = sio_read_complete;
-	sio->bvec.bv_page = page;
-	sio->bvec.bv_len = PAGE_SIZE;
-	sio->bvec.bv_offset = 0;
 
-	iov_iter_bvec(&from, READ, &sio->bvec, 1, PAGE_SIZE);
-	ret = mapping->a_ops->swap_rw(&sio->iocb, &from);
-	if (ret != -EIOCBQUEUED)
-		sio_read_complete(&sio->iocb, ret);
-	return ret;
+	if (*plug)
+		sio = *plug;
+	if (sio) {
+		if (sio->iocb.ki_filp != sis->swap_file ||
+		    sio->iocb.ki_pos + sio->pages * PAGE_SIZE != pos) {
+			swap_read_unplug(sio);
+			sio = NULL;
+		}
+	}
+	if (!sio) {
+		sio = mempool_alloc(sio_pool, GFP_KERNEL);
+		init_sync_kiocb(&sio->iocb, sis->swap_file);
+		sio->iocb.ki_pos = pos;
+		sio->iocb.ki_complete = sio_read_complete;
+		sio->pages = 0;
+	}
+	sio->bvec[sio->pages].bv_page = page;
+	sio->bvec[sio->pages].bv_len = PAGE_SIZE;
+	sio->bvec[sio->pages].bv_offset = 0;
+	sio->pages += 1;
+	if (sio->pages == ARRAY_SIZE(sio->bvec) || !plug) {
+		swap_read_unplug(sio);
+		sio = NULL;
+	}
+	if (plug)
+		*plug = sio;
 }
 
-int swap_readpage(struct page *page, bool synchronous)
+int swap_readpage(struct page *page, bool synchronous,
+		  struct swap_iocb **plug)
 {
 	struct bio *bio;
 	int ret = 0;
@@ -455,7 +470,7 @@ int swap_readpage(struct page *page, bool synchronous)
 	}
 
 	if (data_race(sis->flags & SWP_FS_OPS)) {
-		ret = swap_readpage_fs(page);
+		swap_readpage_fs(page, plug);
 		goto out;
 	}
 
@@ -507,3 +522,16 @@ int swap_readpage(struct page *page, bool synchronous)
 	delayacct_swapin_end();
 	return ret;
 }
+
+void __swap_read_unplug(struct swap_iocb *sio)
+{
+	struct iov_iter from;
+	struct address_space *mapping = sio->iocb.ki_filp->f_mapping;
+	int ret;
+
+	iov_iter_bvec(&from, READ, sio->bvec, sio->pages,
+		      PAGE_SIZE * sio->pages);
+	ret = mapping->a_ops->swap_rw(&sio->iocb, &from);
+	if (ret != -EIOCBQUEUED)
+		sio_read_complete(&sio->iocb, ret);
+}
diff --git a/mm/swap.h b/mm/swap.h
index e8ee995cf8d8..0c79b2478f3f 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -4,7 +4,15 @@
 
 /* linux/mm/page_io.c */
 int sio_pool_init(void);
-int swap_readpage(struct page *page, bool do_poll);
+struct swap_iocb;
+int swap_readpage(struct page *page, bool do_poll,
+		  struct swap_iocb **plug);
+void __swap_read_unplug(struct swap_iocb *plug);
+static inline void swap_read_unplug(struct swap_iocb *plug)
+{
+	if (unlikely(plug))
+		__swap_read_unplug(plug);
+}
 int swap_writepage(struct page *page, struct writeback_control *wbc);
 void end_swap_bio_write(struct bio *bio);
 int __swap_writepage(struct page *page, struct writeback_control *wbc,
@@ -38,7 +46,8 @@ struct page *find_get_incore_page(struct address_space *mapping, pgoff_t index);
 struct page *read_swap_cache_async(swp_entry_t, gfp_t,
 				   struct vm_area_struct *vma,
 				   unsigned long addr,
-				   bool do_poll);
+				   bool do_poll,
+				   struct swap_iocb **plug);
 struct page *__read_swap_cache_async(swp_entry_t, gfp_t,
 				     struct vm_area_struct *vma,
 				     unsigned long addr,
@@ -53,7 +62,8 @@ static inline unsigned int page_swap_flags(struct page *page)
 	return page_swap_info(page)->flags;
 }
 #else /* CONFIG_SWAP */
-static inline int swap_readpage(struct page *page, bool do_poll)
+static inline int swap_readpage(struct page *page, bool do_poll,
+				struct swap_iocb **plug);
 {
 	return 0;
 }
diff --git a/mm/swap_state.c b/mm/swap_state.c
index d541594be1c3..5cb2c75fa247 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -520,14 +520,16 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
  * the swap entry is no longer in use.
  */
 struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
-		struct vm_area_struct *vma, unsigned long addr, bool do_poll)
+				   struct vm_area_struct *vma,
+				   unsigned long addr, bool do_poll,
+				   struct swap_iocb **plug)
 {
 	bool page_was_allocated;
 	struct page *retpage = __read_swap_cache_async(entry, gfp_mask,
 			vma, addr, &page_was_allocated);
 
 	if (page_was_allocated)
-		swap_readpage(retpage, do_poll);
+		swap_readpage(retpage, do_poll, plug);
 
 	return retpage;
 }
@@ -621,6 +623,7 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	unsigned long mask;
 	struct swap_info_struct *si = swp_swap_info(entry);
 	struct blk_plug plug;
+	struct swap_iocb *splug = NULL;
 	bool do_poll = true, page_allocated;
 	struct vm_area_struct *vma = vmf->vma;
 	unsigned long addr = vmf->address;
@@ -647,7 +650,7 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 		if (!page)
 			continue;
 		if (page_allocated) {
-			swap_readpage(page, false);
+			swap_readpage(page, false, &splug);
 			if (offset != entry_offset) {
 				SetPageReadahead(page);
 				count_vm_event(SWAP_RA);
@@ -658,8 +661,10 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 
 	lru_add_drain();	/* Push any new pages onto the LRU now */
 skip:
-	page = read_swap_cache_async(entry, gfp_mask, vma, addr, do_poll);
+	page = read_swap_cache_async(entry, gfp_mask, vma, addr, do_poll,
+				     &splug);
 	blk_finish_plug(&plug);
+	swap_read_unplug(splug);
 	return page;
 }
 
@@ -791,6 +796,7 @@ static struct page *swap_vma_readahead(swp_entry_t fentry, gfp_t gfp_mask,
 				       struct vm_fault *vmf)
 {
 	struct blk_plug plug;
+	struct swap_iocb *splug = NULL;
 	struct vm_area_struct *vma = vmf->vma;
 	struct page *page;
 	pte_t *pte, pentry;
@@ -821,7 +827,7 @@ static struct page *swap_vma_readahead(swp_entry_t fentry, gfp_t gfp_mask,
 		if (!page)
 			continue;
 		if (page_allocated) {
-			swap_readpage(page, false);
+			swap_readpage(page, false, &splug);
 			if (i != ra_info.offset) {
 				SetPageReadahead(page);
 				count_vm_event(SWAP_RA);
@@ -832,8 +838,9 @@ static struct page *swap_vma_readahead(swp_entry_t fentry, gfp_t gfp_mask,
 	lru_add_drain();
 skip:
 	page = read_swap_cache_async(fentry, gfp_mask, vma, vmf->address,
-				     ra_info.win == 1);
+				     ra_info.win == 1, &splug);
 	blk_finish_plug(&plug);
+	swap_read_unplug(splug);
 	return page;
 }
 




^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 10/23] MM: submit multipage write for SWP_FS_OPS swap-space
  2022-01-24  3:48 [PATCH 00/23 V3] Repair SWAP-over_NFS NeilBrown
                   ` (20 preceding siblings ...)
  2022-01-24  3:48 ` [PATCH 11/23] VFS: Add FMODE_CAN_ODIRECT file flag NeilBrown
@ 2022-01-24  3:48 ` NeilBrown
  2022-01-24  8:55   ` Christoph Hellwig
  2022-01-24 10:29   ` kernel test robot
  2022-01-24  3:48 ` [PATCH 13/23] NFS: rename nfs_direct_IO and use as ->swap_rw NeilBrown
  2022-02-07 17:55 ` [PATCH 00/23 V3] Repair SWAP-over_NFS Geert Uytterhoeven
  23 siblings, 2 replies; 53+ messages in thread
From: NeilBrown @ 2022-01-24  3:48 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells
  Cc: linux-nfs, linux-mm, linux-kernel

swap_writepage() is given one page at a time, but may be called repeatedly
in succession.
For block-device swapspace, the blk_plug functionality allows the
multiple pages to be combined together at lower layers.
That cannot be used for SWP_FS_OPS as blk_plug may not exist - it is
only active when CONFIG_BLOCK=y.  Consequently all swap reads over NFS
are single page reads.

With this patch we pass a pointer-to-pointer via the wbc.
swap_writepage can store state between calls - much like the pointer
passed explicitly to swap_readpage.  After calling swap_writepage() some
number of times, the state will be passed to swap_write_unplug() which
can submit the combined request.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 include/linux/writeback.h |    7 +++
 mm/page_io.c              |  103 +++++++++++++++++++++++++++++----------------
 mm/swap.h                 |    1 
 mm/vmscan.c               |    9 +++-
 4 files changed, 82 insertions(+), 38 deletions(-)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index fec248ab1fec..6dcaa0639c0d 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -80,6 +80,13 @@ struct writeback_control {
 
 	unsigned punt_to_cgroup:1;	/* cgrp punting, see __REQ_CGROUP_PUNT */
 
+	/* To enable batching of swap writes to non-block-device backends,
+	 * "plug" can be set point to a 'struct swap_iocb *'.  When all swap
+	 * writes have been submitted, if with swap_iocb is not NULL,
+	 * swap_write_unplug() should be called.
+	 */
+	struct swap_iocb **plug;
+
 #ifdef CONFIG_CGROUP_WRITEBACK
 	struct bdi_writeback *wb;	/* wb this writeback is issued under */
 	struct inode *inode;		/* inode being written out */
diff --git a/mm/page_io.c b/mm/page_io.c
index bcf655d650c8..b61c2cafc4f9 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -307,56 +307,74 @@ int sio_pool_init(void)
 static void sio_write_complete(struct kiocb *iocb, long ret)
 {
 	struct swap_iocb *sio = container_of(iocb, struct swap_iocb, iocb);
-	struct page *page = sio->bvec[0].bv_page;
+	int p;
 
-	if (ret != 0 && ret != PAGE_SIZE) {
-		/*
-		 * In the case of swap-over-nfs, this can be a
-		 * temporary failure if the system has limited
-		 * memory for allocating transmit buffers.
-		 * Mark the page dirty and avoid
-		 * folio_rotate_reclaimable but rate-limit the
-		 * messages but do not flag PageError like
-		 * the normal direct-to-bio case as it could
-		 * be temporary.
-		 */
-		set_page_dirty(page);
-		ClearPageReclaim(page);
-		pr_err_ratelimited("Write error %ld on dio swapfile (%llu)\n",
-				   ret, page_file_offset(page));
-	} else
-		count_vm_event(PSWPOUT);
-	end_page_writeback(page);
+	for (p = 0; p < sio->pages; p++) {
+		struct page *page = sio->bvec[p].bv_page;
+
+		if (ret != 0 && ret != PAGE_SIZE * sio->pages) {
+			/*
+			 * In the case of swap-over-nfs, this can be a
+			 * temporary failure if the system has limited
+			 * memory for allocating transmit buffers.
+			 * Mark the page dirty and avoid
+			 * folio_rotate_reclaimable but rate-limit the
+			 * messages but do not flag PageError like
+			 * the normal direct-to-bio case as it could
+			 * be temporary.
+			 */
+			set_page_dirty(page);
+			ClearPageReclaim(page);
+			pr_err_ratelimited("Write error %ld on dio swapfile (%llu)\n",
+					   ret, page_file_offset(page));
+		} else
+			count_vm_event(PSWPOUT);
+		end_page_writeback(page);
+	}
 	mempool_free(sio, sio_pool);
 }
 
 static int swap_writepage_fs(struct page *page, struct writeback_control *wbc)
 {
-	struct swap_iocb *sio;
+	struct swap_iocb *sio = NULL;
 	struct swap_info_struct *sis = page_swap_info(page);
 	struct file *swap_file = sis->swap_file;
-	struct address_space *mapping = swap_file->f_mapping;
-	struct iov_iter from;
-	int ret;
+	loff_t pos = page_file_offset(page);
 
 	set_page_writeback(page);
 	unlock_page(page);
-	sio = mempool_alloc(sio_pool, GFP_NOIO);
-	init_sync_kiocb(&sio->iocb, swap_file);
-	sio->iocb.ki_complete = sio_write_complete;
-	sio->iocb.ki_pos = page_file_offset(page);
-	sio->bvec[0].bv_page = page;
-	sio->bvec[0].bv_len = PAGE_SIZE;
-	sio->bvec[0].bv_offset = 0;
-	iov_iter_bvec(&from, WRITE, &sio->bvec[0], 1, PAGE_SIZE);
-	ret = mapping->a_ops->swap_rw(&sio->iocb, &from);
-	if (ret != -EIOCBQUEUED)
-		sio_write_complete(&sio->iocb, ret);
-	return ret;
+	if (wbc->plug)
+		sio = *wbc->plug;
+	if (sio) {
+		if (sio->iocb.ki_filp != swap_file ||
+		    sio->iocb.ki_pos + sio->pages * PAGE_SIZE != pos) {
+			swap_write_unplug(sio);
+			sio = NULL;
+		}
+	}
+	if (!sio) {
+		sio = mempool_alloc(sio_pool, GFP_NOIO);
+		init_sync_kiocb(&sio->iocb, swap_file);
+		sio->iocb.ki_complete = sio_write_complete;
+		sio->iocb.ki_pos = pos;
+		sio->pages = 0;
+	}
+	sio->bvec[sio->pages].bv_page = page;
+	sio->bvec[sio->pages].bv_len = PAGE_SIZE;
+	sio->bvec[sio->pages].bv_offset = 0;
+	sio->pages += 1;
+	if (sio->pages == ARRAY_SIZE(sio->bvec) || !wbc->plug) {
+		swap_write_unplug(sio);
+		sio = NULL;
+	}
+	if (wbc->plug)
+		*wbc->plug = sio;
+
+	return 0;
 }
 
 int __swap_writepage(struct page *page, struct writeback_control *wbc,
-		bio_end_io_t end_write_func)
+		     bio_end_io_t end_write_func)
 {
 	struct bio *bio;
 	int ret;
@@ -388,6 +406,19 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
 	return 0;
 }
 
+void swap_write_unplug(struct swap_iocb *sio)
+{
+	struct iov_iter from;
+	struct address_space *mapping = sio->iocb.ki_filp->f_mapping;
+	int ret;
+
+	iov_iter_bvec(&from, WRITE, sio->bvec, sio->pages,
+		      PAGE_SIZE * sio->pages);
+	ret = mapping->a_ops->swap_rw(&sio->iocb, &from);
+	if (ret != -EIOCBQUEUED)
+		sio_write_complete(&sio->iocb, ret);
+}
+
 static void sio_read_complete(struct kiocb *iocb, long ret)
 {
 	struct swap_iocb *sio = container_of(iocb, struct swap_iocb, iocb);
diff --git a/mm/swap.h b/mm/swap.h
index 0c79b2478f3f..0194ac153d40 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -13,6 +13,7 @@ static inline void swap_read_unplug(struct swap_iocb *plug)
 	if (unlikely(plug))
 		__swap_read_unplug(plug);
 }
+void swap_write_unplug(struct swap_iocb *sio);
 int swap_writepage(struct page *page, struct writeback_control *wbc);
 void end_swap_bio_write(struct bio *bio);
 int __swap_writepage(struct page *page, struct writeback_control *wbc,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ad5026d06aa8..f75c71490921 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1164,7 +1164,8 @@ typedef enum {
  * pageout is called by shrink_page_list() for each dirty page.
  * Calls ->writepage().
  */
-static pageout_t pageout(struct page *page, struct address_space *mapping)
+static pageout_t pageout(struct page *page, struct address_space *mapping,
+			 struct swap_iocb **plug)
 {
 	/*
 	 * If the page is dirty, only perform writeback if that write
@@ -1211,6 +1212,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping)
 			.range_start = 0,
 			.range_end = LLONG_MAX,
 			.for_reclaim = 1,
+			.plug = plug,
 		};
 
 		SetPageReclaim(page);
@@ -1537,6 +1539,7 @@ static unsigned int shrink_page_list(struct list_head *page_list,
 	unsigned int nr_reclaimed = 0;
 	unsigned int pgactivate = 0;
 	bool do_demote_pass;
+	struct swap_iocb *plug = NULL;
 
 	memset(stat, 0, sizeof(*stat));
 	cond_resched();
@@ -1817,7 +1820,7 @@ static unsigned int shrink_page_list(struct list_head *page_list,
 			 * starts and then write it out here.
 			 */
 			try_to_unmap_flush_dirty();
-			switch (pageout(page, mapping)) {
+			switch (pageout(page, mapping, &plug)) {
 			case PAGE_KEEP:
 				goto keep_locked;
 			case PAGE_ACTIVATE:
@@ -1971,6 +1974,8 @@ static unsigned int shrink_page_list(struct list_head *page_list,
 	list_splice(&ret_pages, page_list);
 	count_vm_events(PGACTIVATE, pgactivate);
 
+	if (plug)
+		swap_write_unplug(plug);
 	return nr_reclaimed;
 }
 




^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 11/23] VFS: Add FMODE_CAN_ODIRECT file flag
  2022-01-24  3:48 [PATCH 00/23 V3] Repair SWAP-over_NFS NeilBrown
                   ` (19 preceding siblings ...)
  2022-01-24  3:48 ` [PATCH 21/23] NFSv4: keep state manager thread active if swap is enabled NeilBrown
@ 2022-01-24  3:48 ` NeilBrown
  2022-01-24  8:56   ` Christoph Hellwig
  2022-01-24  3:48 ` [PATCH 10/23] MM: submit multipage write for SWP_FS_OPS swap-space NeilBrown
                   ` (2 subsequent siblings)
  23 siblings, 1 reply; 53+ messages in thread
From: NeilBrown @ 2022-01-24  3:48 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells
  Cc: linux-nfs, linux-mm, linux-kernel

Currently various places test if direct IO is possible on a file by
checking for the existence of the direct_IO address space operation.
This is a poor choice, as the direct_IO operation may not be used - it is
only used if the generic_file_*_iter functions are called for direct IO
and some filesystems - particularly NFS - don't do this.

Instead, introduce a new f_mode flag: FMODE_CAN_ODIRECT and change the
various places to check this (avoiding pointer dereferences).
do_dentry_open() will set this flag if ->direct_IO is present, so
filesystems do not need to be changed.

NFS *is* changed, to set the flag explicitly and discard the direct_IO
entry in the address_space_operations for files.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/block/loop.c |    4 ++--
 fs/fcntl.c           |    9 ++++-----
 fs/nfs/file.c        |    3 ++-
 fs/open.c            |    9 ++++-----
 fs/overlayfs/file.c  |   13 ++++---------
 include/linux/fs.h   |    3 +++
 6 files changed, 19 insertions(+), 22 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 01cbbfc4e9e2..a2609dd79370 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -184,8 +184,8 @@ static void __loop_update_dio(struct loop_device *lo, bool dio)
 	 */
 	if (dio) {
 		if (queue_logical_block_size(lo->lo_queue) >= sb_bsize &&
-				!(lo->lo_offset & dio_align) &&
-				mapping->a_ops->direct_IO)
+		    !(lo->lo_offset & dio_align) &&
+		    (file->f_mode & FMODE_CAN_ODIRECT))
 			use_dio = true;
 		else
 			use_dio = false;
diff --git a/fs/fcntl.c b/fs/fcntl.c
index 9c6c6a3e2de5..11e665242a76 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -56,11 +56,10 @@ static int setfl(int fd, struct file * filp, unsigned long arg)
 		   arg |= O_NONBLOCK;
 
 	/* Pipe packetized mode is controlled by O_DIRECT flag */
-	if (!S_ISFIFO(inode->i_mode) && (arg & O_DIRECT)) {
-		if (!filp->f_mapping || !filp->f_mapping->a_ops ||
-			!filp->f_mapping->a_ops->direct_IO)
-				return -EINVAL;
-	}
+	if (!S_ISFIFO(inode->i_mode) &&
+	    (arg & O_DIRECT) &&
+	    !(filp->f_mode & FMODE_CAN_ODIRECT))
+		return -EINVAL;
 
 	if (filp->f_op->check_flags)
 		error = filp->f_op->check_flags(arg);
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 3dbef2c31567..9e2def045111 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -74,6 +74,8 @@ nfs_file_open(struct inode *inode, struct file *filp)
 		return res;
 
 	res = nfs_open(inode, filp);
+	if (res == 0)
+		filp->f_mode |= FMODE_CAN_ODIRECT;
 	return res;
 }
 
@@ -535,7 +537,6 @@ const struct address_space_operations nfs_file_aops = {
 	.write_end = nfs_write_end,
 	.invalidatepage = nfs_invalidate_page,
 	.releasepage = nfs_release_page,
-	.direct_IO = nfs_direct_IO,
 #ifdef CONFIG_MIGRATION
 	.migratepage = nfs_migrate_page,
 #endif
diff --git a/fs/open.c b/fs/open.c
index 9ff2f621b760..76ddf9014499 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -834,17 +834,16 @@ static int do_dentry_open(struct file *f,
 	if ((f->f_mode & FMODE_WRITE) &&
 	     likely(f->f_op->write || f->f_op->write_iter))
 		f->f_mode |= FMODE_CAN_WRITE;
+	if (f->f_mapping->a_ops && f->f_mapping->a_ops->direct_IO)
+		f->f_mode |= FMODE_CAN_ODIRECT;
 
 	f->f_write_hint = WRITE_LIFE_NOT_SET;
 	f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC);
 
 	file_ra_state_init(&f->f_ra, f->f_mapping->host->i_mapping);
 
-	/* NB: we're sure to have correct a_ops only after f_op->open */
-	if (f->f_flags & O_DIRECT) {
-		if (!f->f_mapping->a_ops || !f->f_mapping->a_ops->direct_IO)
-			return -EINVAL;
-	}
+	if ((f->f_flags & O_DIRECT) && !(f->f_mode & FMODE_CAN_ODIRECT))
+		return -EINVAL;
 
 	/*
 	 * XXX: Huge page cache doesn't support writing yet. Drop all page
diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index fa125feed0ff..9d69b4dbb8c4 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -82,11 +82,8 @@ static int ovl_change_flags(struct file *file, unsigned int flags)
 	if (((flags ^ file->f_flags) & O_APPEND) && IS_APPEND(inode))
 		return -EPERM;
 
-	if (flags & O_DIRECT) {
-		if (!file->f_mapping->a_ops ||
-		    !file->f_mapping->a_ops->direct_IO)
-			return -EINVAL;
-	}
+	if ((flags & O_DIRECT) && !(file->f_mode & FMODE_CAN_ODIRECT))
+		return -EINVAL;
 
 	if (file->f_op->check_flags) {
 		err = file->f_op->check_flags(flags);
@@ -306,8 +303,7 @@ static ssize_t ovl_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 
 	ret = -EINVAL;
 	if (iocb->ki_flags & IOCB_DIRECT &&
-	    (!real.file->f_mapping->a_ops ||
-	     !real.file->f_mapping->a_ops->direct_IO))
+	    !(real.file->f_mode & FMODE_CAN_ODIRECT))
 		goto out_fdput;
 
 	old_cred = ovl_override_creds(file_inode(file)->i_sb);
@@ -367,8 +363,7 @@ static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter)
 
 	ret = -EINVAL;
 	if (iocb->ki_flags & IOCB_DIRECT &&
-	    (!real.file->f_mapping->a_ops ||
-	     !real.file->f_mapping->a_ops->direct_IO))
+	    !(real.file->f_mode & FMODE_CAN_ODIRECT))
 		goto out_fdput;
 
 	if (!ovl_should_sync(OVL_FS(inode->i_sb)))
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 4fade3b20c87..48c021cbe327 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -161,6 +161,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 /* File is stream-like */
 #define FMODE_STREAM		((__force fmode_t)0x200000)
 
+/* File supports DIRECT IO */
+#define	FMODE_CAN_ODIRECT	((__force fmode_t)0x400000)
+
 /* File was opened by fanotify and shouldn't generate fanotify events */
 #define FMODE_NONOTIFY		((__force fmode_t)0x4000000)
 




^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 12/23] NFS: remove IS_SWAPFILE hack
  2022-01-24  3:48 [PATCH 00/23 V3] Repair SWAP-over_NFS NeilBrown
                   ` (14 preceding siblings ...)
  2022-01-24  3:48 ` [PATCH 09/23] MM: submit multipage reads for SWP_FS_OPS swap-space NeilBrown
@ 2022-01-24  3:48 ` NeilBrown
  2022-01-24  8:56   ` Christoph Hellwig
  2022-01-24  3:48 ` [PATCH 19/23] NFS: discard NFS_RPC_SWAPFLAGS and RPC_TASK_ROOTCREDS NeilBrown
                   ` (7 subsequent siblings)
  23 siblings, 1 reply; 53+ messages in thread
From: NeilBrown @ 2022-01-24  3:48 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells
  Cc: linux-nfs, linux-mm, linux-kernel

This code is pointless as IS_SWAPFILE is always defined.
So remove it.

Suggested-by: Mark Hemment <markhemm@googlemail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/nfs/file.c |    5 -----
 1 file changed, 5 deletions(-)

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 9e2def045111..4d4750738aeb 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -44,11 +44,6 @@
 
 static const struct vm_operations_struct nfs_file_vm_ops;
 
-/* Hack for future NFS swap support */
-#ifndef IS_SWAPFILE
-# define IS_SWAPFILE(inode)	(0)
-#endif
-
 int nfs_check_flags(int flags)
 {
 	if ((flags & (O_APPEND | O_DIRECT)) == (O_APPEND | O_DIRECT))




^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 13/23] NFS: rename nfs_direct_IO and use as ->swap_rw
  2022-01-24  3:48 [PATCH 00/23 V3] Repair SWAP-over_NFS NeilBrown
                   ` (21 preceding siblings ...)
  2022-01-24  3:48 ` [PATCH 10/23] MM: submit multipage write for SWP_FS_OPS swap-space NeilBrown
@ 2022-01-24  3:48 ` NeilBrown
  2022-01-24  8:57   ` Christoph Hellwig
  2022-02-07 17:55 ` [PATCH 00/23 V3] Repair SWAP-over_NFS Geert Uytterhoeven
  23 siblings, 1 reply; 53+ messages in thread
From: NeilBrown @ 2022-01-24  3:48 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells
  Cc: linux-nfs, linux-mm, linux-kernel

The nfs_direct_IO() exists to support SWAP IO, but hasn't worked for a
while.  We now need a ->swap_rw function which behaves slightly
differently, returning zero for success rather than a byte count.

So modify nfs_direct_IO accordingly, rename it, and use it as the
->swap_rw function.

Note: it still won't work - that will be fixed in later patches.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/nfs/direct.c        |   23 ++++++++++-------------
 fs/nfs/file.c          |    5 +----
 include/linux/nfs_fs.h |    2 +-
 3 files changed, 12 insertions(+), 18 deletions(-)

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index eabfdab543c8..b929dd5b0c3a 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -153,28 +153,25 @@ nfs_direct_count_bytes(struct nfs_direct_req *dreq,
 }
 
 /**
- * nfs_direct_IO - NFS address space operation for direct I/O
+ * nfs_swap_rw - NFS address space operation for swap I/O
  * @iocb: target I/O control block
  * @iter: I/O buffer
  *
- * The presence of this routine in the address space ops vector means
- * the NFS client supports direct I/O. However, for most direct IO, we
- * shunt off direct read and write requests before the VFS gets them,
- * so this method is only ever called for swap.
+ * Perform IO to the swap-file.  This is much like direct IO.
  */
-ssize_t nfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
+int nfs_swap_rw(struct kiocb *iocb, struct iov_iter *iter)
 {
-	struct inode *inode = iocb->ki_filp->f_mapping->host;
-
-	/* we only support swap file calling nfs_direct_IO */
-	if (!IS_SWAPFILE(inode))
-		return 0;
+	ssize_t ret;
 
 	VM_BUG_ON(iov_iter_count(iter) != PAGE_SIZE);
 
 	if (iov_iter_rw(iter) == READ)
-		return nfs_file_direct_read(iocb, iter);
-	return nfs_file_direct_write(iocb, iter);
+		ret = nfs_file_direct_read(iocb, iter);
+	else
+		ret = nfs_file_direct_write(iocb, iter);
+	if (ret < 0)
+		return ret;
+	return 0;
 }
 
 static void nfs_direct_release_pages(struct page **pages, unsigned int npages)
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 4d4750738aeb..91ff9ed05b06 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -489,10 +489,6 @@ static int nfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 	struct rpc_clnt *clnt = NFS_CLIENT(file->f_mapping->host);
 	struct inode *inode = file->f_mapping->host;
 
-	if (!file->f_mapping->a_ops->swap_rw)
-		/* Cannot support swap */
-		return -EINVAL;
-
 	spin_lock(&inode->i_lock);
 	blocks = inode->i_blocks;
 	isize = inode->i_size;
@@ -540,6 +536,7 @@ const struct address_space_operations nfs_file_aops = {
 	.error_remove_page = generic_error_remove_page,
 	.swap_activate = nfs_swap_activate,
 	.swap_deactivate = nfs_swap_deactivate,
+	.swap_rw = nfs_swap_rw,
 };
 
 /*
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index 00835bacd236..29a5e579f26f 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -509,7 +509,7 @@ static inline const struct cred *nfs_file_cred(struct file *file)
 /*
  * linux/fs/nfs/direct.c
  */
-extern ssize_t nfs_direct_IO(struct kiocb *, struct iov_iter *);
+extern int nfs_swap_rw(struct kiocb *, struct iov_iter *);
 extern ssize_t nfs_file_direct_read(struct kiocb *iocb,
 			struct iov_iter *iter);
 extern ssize_t nfs_file_direct_write(struct kiocb *iocb,




^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 14/23] NFS: swap IO handling is slightly different for O_DIRECT IO
  2022-01-24  3:48 [PATCH 00/23 V3] Repair SWAP-over_NFS NeilBrown
  2022-01-24  3:48 ` [PATCH 05/23] MM: reclaim mustn't enter FS for SWP_FS_OPS swap-space NeilBrown
  2022-01-24  3:48 ` [PATCH 03/23] MM: drop swap_set_page_dirty NeilBrown
@ 2022-01-24  3:48 ` NeilBrown
  2022-01-24  8:58   ` Christoph Hellwig
  2022-01-24 13:22   ` Mark Hemment
  2022-01-24  3:48 ` [PATCH 22/23] NFS: swap-out must always use STABLE writes NeilBrown
                   ` (20 subsequent siblings)
  23 siblings, 2 replies; 53+ messages in thread
From: NeilBrown @ 2022-01-24  3:48 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells
  Cc: linux-nfs, linux-mm, linux-kernel

1/ Taking the i_rwsem for swap IO triggers lockdep warnings regarding
   possible deadlocks with "fs_reclaim".  These deadlocks could, I believe,
   eventuate if a buffered read on the swapfile was attempted.

   We don't need coherence with the page cache for a swap file, and
   buffered writes are forbidden anyway.  There is no other need for
   i_rwsem during direct IO.  So never take it for swap_rw()

2/ generic_write_checks() explicitly forbids writes to swap, and
   performs checks that are not needed for swap.  So bypass it
   for swap_rw().

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/nfs/direct.c        |   30 +++++++++++++++++++++---------
 fs/nfs/file.c          |    4 ++--
 include/linux/nfs_fs.h |    4 ++--
 3 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index b929dd5b0c3a..43a956d7fd62 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -166,9 +166,9 @@ int nfs_swap_rw(struct kiocb *iocb, struct iov_iter *iter)
 	VM_BUG_ON(iov_iter_count(iter) != PAGE_SIZE);
 
 	if (iov_iter_rw(iter) == READ)
-		ret = nfs_file_direct_read(iocb, iter);
+		ret = nfs_file_direct_read(iocb, iter, true);
 	else
-		ret = nfs_file_direct_write(iocb, iter);
+		ret = nfs_file_direct_write(iocb, iter, true);
 	if (ret < 0)
 		return ret;
 	return 0;
@@ -422,6 +422,7 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
  * nfs_file_direct_read - file direct read operation for NFS files
  * @iocb: target I/O control block
  * @iter: vector of user buffers into which to read data
+ * @swap: flag indicating this is swap IO, not O_DIRECT IO
  *
  * We use this function for direct reads instead of calling
  * generic_file_aio_read() in order to avoid gfar's check to see if
@@ -437,7 +438,8 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
  * client must read the updated atime from the server back into its
  * cache.
  */
-ssize_t nfs_file_direct_read(struct kiocb *iocb, struct iov_iter *iter)
+ssize_t nfs_file_direct_read(struct kiocb *iocb, struct iov_iter *iter,
+			     bool swap)
 {
 	struct file *file = iocb->ki_filp;
 	struct address_space *mapping = file->f_mapping;
@@ -479,12 +481,14 @@ ssize_t nfs_file_direct_read(struct kiocb *iocb, struct iov_iter *iter)
 	if (iter_is_iovec(iter))
 		dreq->flags = NFS_ODIRECT_SHOULD_DIRTY;
 
-	nfs_start_io_direct(inode);
+	if (!swap)
+		nfs_start_io_direct(inode);
 
 	NFS_I(inode)->read_io += count;
 	requested = nfs_direct_read_schedule_iovec(dreq, iter, iocb->ki_pos);
 
-	nfs_end_io_direct(inode);
+	if (!swap)
+		nfs_end_io_direct(inode);
 
 	if (requested > 0) {
 		result = nfs_direct_wait(dreq);
@@ -873,6 +877,7 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
  * nfs_file_direct_write - file direct write operation for NFS files
  * @iocb: target I/O control block
  * @iter: vector of user buffers from which to write data
+ * @swap: flag indicating this is swap IO, not O_DIRECT IO
  *
  * We use this function for direct writes instead of calling
  * generic_file_aio_write() in order to avoid taking the inode
@@ -889,7 +894,8 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
  * Note that O_APPEND is not supported for NFS direct writes, as there
  * is no atomic O_APPEND write facility in the NFS protocol.
  */
-ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter)
+ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter,
+			      bool swap)
 {
 	ssize_t result, requested;
 	size_t count;
@@ -903,7 +909,11 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter)
 	dfprintk(FILE, "NFS: direct write(%pD2, %zd@%Ld)\n",
 		file, iov_iter_count(iter), (long long) iocb->ki_pos);
 
-	result = generic_write_checks(iocb, iter);
+	if (!swap)
+		result = generic_write_checks(iocb, iter);
+	else
+		/* bypass generic checks */
+		result =  iov_iter_count(iter);
 	if (result <= 0)
 		return result;
 	count = result;
@@ -934,7 +944,8 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter)
 		dreq->iocb = iocb;
 	pnfs_init_ds_commit_info_ops(&dreq->ds_cinfo, inode);
 
-	nfs_start_io_direct(inode);
+	if (!swap)
+		nfs_start_io_direct(inode);
 
 	requested = nfs_direct_write_schedule_iovec(dreq, iter, pos);
 
@@ -943,7 +954,8 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter)
 					      pos >> PAGE_SHIFT, end);
 	}
 
-	nfs_end_io_direct(inode);
+	if (!swap)
+		nfs_end_io_direct(inode);
 
 	if (requested > 0) {
 		result = nfs_direct_wait(dreq);
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 91ff9ed05b06..04ba56f223d3 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -159,7 +159,7 @@ nfs_file_read(struct kiocb *iocb, struct iov_iter *to)
 	ssize_t result;
 
 	if (iocb->ki_flags & IOCB_DIRECT)
-		return nfs_file_direct_read(iocb, to);
+		return nfs_file_direct_read(iocb, to, false);
 
 	dprintk("NFS: read(%pD2, %zu@%lu)\n",
 		iocb->ki_filp,
@@ -625,7 +625,7 @@ ssize_t nfs_file_write(struct kiocb *iocb, struct iov_iter *from)
 		return result;
 
 	if (iocb->ki_flags & IOCB_DIRECT)
-		return nfs_file_direct_write(iocb, from);
+		return nfs_file_direct_write(iocb, from, false);
 
 	dprintk("NFS: write(%pD2, %zu@%Ld)\n",
 		file, iov_iter_count(from), (long long) iocb->ki_pos);
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index 29a5e579f26f..aba38dc4fd29 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -511,9 +511,9 @@ static inline const struct cred *nfs_file_cred(struct file *file)
  */
 extern int nfs_swap_rw(struct kiocb *, struct iov_iter *);
 extern ssize_t nfs_file_direct_read(struct kiocb *iocb,
-			struct iov_iter *iter);
+				    struct iov_iter *iter, bool swap);
 extern ssize_t nfs_file_direct_write(struct kiocb *iocb,
-			struct iov_iter *iter);
+				     struct iov_iter *iter, bool swap);
 
 /*
  * linux/fs/nfs/dir.c




^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 15/23] SUNRPC/call_alloc: async tasks mustn't block waiting for memory
  2022-01-24  3:48 [PATCH 00/23 V3] Repair SWAP-over_NFS NeilBrown
                   ` (10 preceding siblings ...)
  2022-01-24  3:48 ` [PATCH 06/23] MM: introduce ->swap_rw and use it for reads from SWP_FS_OPS swap-space NeilBrown
@ 2022-01-24  3:48 ` NeilBrown
  2022-01-24  3:48 ` [PATCH 20/23] SUNRPC: improve 'swap' handling: scheduling and PF_MEMALLOC NeilBrown
                   ` (11 subsequent siblings)
  23 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2022-01-24  3:48 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells
  Cc: linux-nfs, linux-mm, linux-kernel

When memory is short, new worker threads cannot be created and we depend
on the minimum one rpciod thread to be able to handle everything.
So it must not block waiting for memory.

mempools are particularly a problem as memory can only be released back
to the mempool by an async rpc task running.  If all available
workqueue threads are waiting on the mempool, no thread is available to
return anything.

rpc_malloc() can block, and this might cause deadlocks.
So check RPC_IS_ASYNC(), rather than RPC_IS_SWAPPER() to determine if
blocking is acceptable.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 net/sunrpc/sched.c              |    4 +++-
 net/sunrpc/xprtrdma/transport.c |    4 +++-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index e2c835482791..d5b6e897f5a5 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -1023,8 +1023,10 @@ int rpc_malloc(struct rpc_task *task)
 	struct rpc_buffer *buf;
 	gfp_t gfp = GFP_NOFS;
 
+	if (RPC_IS_ASYNC(task))
+		gfp = GFP_NOWAIT | __GFP_NOWARN;
 	if (RPC_IS_SWAPPER(task))
-		gfp = __GFP_MEMALLOC | GFP_NOWAIT | __GFP_NOWARN;
+		gfp |= __GFP_MEMALLOC;
 
 	size += sizeof(struct rpc_buffer);
 	if (size <= RPC_BUFFER_MAXSIZE)
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 16e5696314a4..a52277115500 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -574,8 +574,10 @@ xprt_rdma_allocate(struct rpc_task *task)
 	gfp_t flags;
 
 	flags = RPCRDMA_DEF_GFP;
+	if (RPC_IS_ASYNC(task))
+		flags = GFP_NOWAIT | __GFP_NOWARN;
 	if (RPC_IS_SWAPPER(task))
-		flags = __GFP_MEMALLOC | GFP_NOWAIT | __GFP_NOWARN;
+		flags |= __GFP_MEMALLOC;
 
 	if (!rpcrdma_check_regbuf(r_xprt, req->rl_sendbuf, rqst->rq_callsize,
 				  flags))




^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 16/23] SUNRPC/auth: async tasks mustn't block waiting for memory
  2022-01-24  3:48 [PATCH 00/23 V3] Repair SWAP-over_NFS NeilBrown
                   ` (7 preceding siblings ...)
  2022-01-24  3:48 ` [PATCH 02/23] MM: extend block-plugging to cover all swap reads with read-ahead NeilBrown
@ 2022-01-24  3:48 ` NeilBrown
  2022-01-24  3:48 ` [PATCH 04/23] MM: move responsibility for setting SWP_FS_OPS to ->swap_activate NeilBrown
                   ` (14 subsequent siblings)
  23 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2022-01-24  3:48 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells
  Cc: linux-nfs, linux-mm, linux-kernel

When memory is short, new worker threads cannot be created and we depend
on the minimum one rpciod thread to be able to handle everything.  So it
must not block waiting for memory.

mempools are particularly a problem as memory can only be released back
to the mempool by an async rpc task running.  If all available workqueue
threads are waiting on the mempool, no thread is available to return
anything.

lookup_cred() can block on a mempool or kmalloc - and this can cause
deadlocks.  So add a new RPCAUTH_LOOKUP flag for async lookups and don't
block on memory.  If the -ENOMEM gets back to call_refreshresult(), wait
a short while and try again.  HZ>>4 is chosen as it is used elsewhere
for -ENOMEM retries.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 include/linux/sunrpc/auth.h    |    1 +
 net/sunrpc/auth.c              |    6 +++++-
 net/sunrpc/auth_gss/auth_gss.c |    6 +++++-
 net/sunrpc/auth_unix.c         |   10 ++++++++--
 net/sunrpc/clnt.c              |    3 +++
 5 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/include/linux/sunrpc/auth.h b/include/linux/sunrpc/auth.h
index 98da816b5fc2..3e6ce288a7fc 100644
--- a/include/linux/sunrpc/auth.h
+++ b/include/linux/sunrpc/auth.h
@@ -99,6 +99,7 @@ struct rpc_auth_create_args {
 
 /* Flags for rpcauth_lookupcred() */
 #define RPCAUTH_LOOKUP_NEW		0x01	/* Accept an uninitialised cred */
+#define RPCAUTH_LOOKUP_ASYNC		0x02	/* Don't block waiting for memory */
 
 /*
  * Client authentication ops
diff --git a/net/sunrpc/auth.c b/net/sunrpc/auth.c
index a9f0d17fdb0d..6bfa19f9fa6a 100644
--- a/net/sunrpc/auth.c
+++ b/net/sunrpc/auth.c
@@ -615,6 +615,8 @@ rpcauth_bind_root_cred(struct rpc_task *task, int lookupflags)
 	};
 	struct rpc_cred *ret;
 
+	if (RPC_IS_ASYNC(task))
+		lookupflags |= RPCAUTH_LOOKUP_ASYNC;
 	ret = auth->au_ops->lookup_cred(auth, &acred, lookupflags);
 	put_cred(acred.cred);
 	return ret;
@@ -631,6 +633,8 @@ rpcauth_bind_machine_cred(struct rpc_task *task, int lookupflags)
 
 	if (!acred.principal)
 		return NULL;
+	if (RPC_IS_ASYNC(task))
+		lookupflags |= RPCAUTH_LOOKUP_ASYNC;
 	return auth->au_ops->lookup_cred(auth, &acred, lookupflags);
 }
 
@@ -654,7 +658,7 @@ rpcauth_bindcred(struct rpc_task *task, const struct cred *cred, int flags)
 	};
 
 	if (flags & RPC_TASK_ASYNC)
-		lookupflags |= RPCAUTH_LOOKUP_NEW;
+		lookupflags |= RPCAUTH_LOOKUP_NEW | RPCAUTH_LOOKUP_ASYNC;
 	if (task->tk_op_cred)
 		/* Task must use exactly this rpc_cred */
 		new = get_rpccred(task->tk_op_cred);
diff --git a/net/sunrpc/auth_gss/auth_gss.c b/net/sunrpc/auth_gss/auth_gss.c
index 5f42aa5fc612..df72d6301f78 100644
--- a/net/sunrpc/auth_gss/auth_gss.c
+++ b/net/sunrpc/auth_gss/auth_gss.c
@@ -1341,7 +1341,11 @@ gss_hash_cred(struct auth_cred *acred, unsigned int hashbits)
 static struct rpc_cred *
 gss_lookup_cred(struct rpc_auth *auth, struct auth_cred *acred, int flags)
 {
-	return rpcauth_lookup_credcache(auth, acred, flags, GFP_NOFS);
+	gfp_t gfp = GFP_NOFS;
+
+	if (flags & RPCAUTH_LOOKUP_ASYNC)
+		gfp = GFP_NOWAIT | __GFP_NOWARN;
+	return rpcauth_lookup_credcache(auth, acred, flags, gfp);
 }
 
 static struct rpc_cred *
diff --git a/net/sunrpc/auth_unix.c b/net/sunrpc/auth_unix.c
index e7df1f782b2e..e5819265dd1b 100644
--- a/net/sunrpc/auth_unix.c
+++ b/net/sunrpc/auth_unix.c
@@ -43,8 +43,14 @@ unx_destroy(struct rpc_auth *auth)
 static struct rpc_cred *
 unx_lookup_cred(struct rpc_auth *auth, struct auth_cred *acred, int flags)
 {
-	struct rpc_cred *ret = mempool_alloc(unix_pool, GFP_NOFS);
-
+	gfp_t gfp = GFP_NOFS;
+	struct rpc_cred *ret;
+
+	if (flags & RPCAUTH_LOOKUP_ASYNC)
+		gfp = GFP_NOWAIT | __GFP_NOWARN;
+	ret = mempool_alloc(unix_pool, gfp);
+	if (!ret)
+		return ERR_PTR(-ENOMEM);
 	rpcauth_init_cred(ret, acred, auth, &unix_credops);
 	ret->cr_flags = 1UL << RPCAUTH_CRED_UPTODATE;
 	return ret;
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index a312ea2bc440..238b2ef5491f 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -1745,6 +1745,9 @@ call_refreshresult(struct rpc_task *task)
 		task->tk_cred_retry--;
 		trace_rpc_retry_refresh_status(task);
 		return;
+	case -ENOMEM:
+		rpc_delay(task, HZ >> 4);
+		return;
 	}
 	trace_rpc_refresh_status(task);
 	rpc_call_rpcerror(task, status);




^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 17/23] SUNRPC/xprt: async tasks mustn't block waiting for memory
  2022-01-24  3:48 [PATCH 00/23 V3] Repair SWAP-over_NFS NeilBrown
                   ` (16 preceding siblings ...)
  2022-01-24  3:48 ` [PATCH 19/23] NFS: discard NFS_RPC_SWAPFLAGS and RPC_TASK_ROOTCREDS NeilBrown
@ 2022-01-24  3:48 ` NeilBrown
  2022-01-24  3:48 ` [PATCH 18/23] SUNRPC: remove scheduling boost for "SWAPPER" tasks NeilBrown
                   ` (5 subsequent siblings)
  23 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2022-01-24  3:48 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells
  Cc: linux-nfs, linux-mm, linux-kernel

When memory is short, new worker threads cannot be created and we depend
on the minimum one rpciod thread to be able to handle everything.  So it
must not block waiting for memory.

xprt_dynamic_alloc_slot can block indefinitely.  This can tie up all
workqueue threads and NFS can deadlock.  So when called from a
workqueue, set __GFP_NORETRY.

The rdma alloc_slot already does not block.  However it sets the error
to -EAGAIN suggesting this will trigger a sleep.  It does not.  As we
can see in call_reserveresult(), only -ENOMEM causes a sleep.  -EAGAIN
causes immediate retry.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 net/sunrpc/xprt.c               |    5 ++++-
 net/sunrpc/xprtrdma/transport.c |    2 +-
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index a02de2bddb28..47d207e416ab 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -1687,12 +1687,15 @@ static bool xprt_throttle_congested(struct rpc_xprt *xprt, struct rpc_task *task
 static struct rpc_rqst *xprt_dynamic_alloc_slot(struct rpc_xprt *xprt)
 {
 	struct rpc_rqst *req = ERR_PTR(-EAGAIN);
+	gfp_t gfp_mask = GFP_NOFS;
 
 	if (xprt->num_reqs >= xprt->max_reqs)
 		goto out;
 	++xprt->num_reqs;
 	spin_unlock(&xprt->reserve_lock);
-	req = kzalloc(sizeof(struct rpc_rqst), GFP_NOFS);
+	if (current->flags & PF_WQ_WORKER)
+		gfp_mask |= __GFP_NORETRY | __GFP_NOWARN;
+	req = kzalloc(sizeof(struct rpc_rqst), gfp_mask);
 	spin_lock(&xprt->reserve_lock);
 	if (req != NULL)
 		goto out;
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index a52277115500..32df23796747 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -521,7 +521,7 @@ xprt_rdma_alloc_slot(struct rpc_xprt *xprt, struct rpc_task *task)
 	return;
 
 out_sleep:
-	task->tk_status = -EAGAIN;
+	task->tk_status = -ENOMEM;
 	xprt_add_backlog(xprt, task);
 }
 




^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 18/23] SUNRPC: remove scheduling boost for "SWAPPER" tasks.
  2022-01-24  3:48 [PATCH 00/23 V3] Repair SWAP-over_NFS NeilBrown
                   ` (17 preceding siblings ...)
  2022-01-24  3:48 ` [PATCH 17/23] SUNRPC/xprt: async tasks mustn't block waiting for memory NeilBrown
@ 2022-01-24  3:48 ` NeilBrown
  2022-01-24  3:48 ` [PATCH 21/23] NFSv4: keep state manager thread active if swap is enabled NeilBrown
                   ` (4 subsequent siblings)
  23 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2022-01-24  3:48 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells
  Cc: linux-nfs, linux-mm, linux-kernel

Currently, tasks marked as "swapper" tasks get put to the front of
non-priority rpc_queues, and are sorted earlier than non-swapper tasks on
the transport's ->xmit_queue.

This is pointless as currently *all* tasks for a mount that has swap
enabled on *any* file are marked as "swapper" tasks.  So the net result
is that the non-priority rpc_queues are reverse-ordered (LIFO).

This scheduling boost is not necessary to avoid deadlocks, and hurts
fairness, so remove it.  If there were a need to expedite some requests,
the tk_priority mechanism is a more appropriate tool.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 net/sunrpc/sched.c |    7 -------
 net/sunrpc/xprt.c  |   11 -----------
 2 files changed, 18 deletions(-)

diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index d5b6e897f5a5..256302bf6557 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -186,11 +186,6 @@ static void __rpc_add_wait_queue_priority(struct rpc_wait_queue *queue,
 
 /*
  * Add new request to wait queue.
- *
- * Swapper tasks always get inserted at the head of the queue.
- * This should avoid many nasty memory deadlocks and hopefully
- * improve overall performance.
- * Everyone else gets appended to the queue to ensure proper FIFO behavior.
  */
 static void __rpc_add_wait_queue(struct rpc_wait_queue *queue,
 		struct rpc_task *task,
@@ -199,8 +194,6 @@ static void __rpc_add_wait_queue(struct rpc_wait_queue *queue,
 	INIT_LIST_HEAD(&task->u.tk_wait.timer_list);
 	if (RPC_IS_PRIORITY(queue))
 		__rpc_add_wait_queue_priority(queue, task, queue_priority);
-	else if (RPC_IS_SWAPPER(task))
-		list_add(&task->u.tk_wait.list, &queue->tasks[0]);
 	else
 		list_add_tail(&task->u.tk_wait.list, &queue->tasks[0]);
 	task->tk_waitqueue = queue;
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index 47d207e416ab..a0a2583fe941 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -1354,17 +1354,6 @@ xprt_request_enqueue_transmit(struct rpc_task *task)
 				INIT_LIST_HEAD(&req->rq_xmit2);
 				goto out;
 			}
-		} else if (RPC_IS_SWAPPER(task)) {
-			list_for_each_entry(pos, &xprt->xmit_queue, rq_xmit) {
-				if (pos->rq_cong || pos->rq_bytes_sent)
-					continue;
-				if (RPC_IS_SWAPPER(pos->rq_task))
-					continue;
-				/* Note: req is added _before_ pos */
-				list_add_tail(&req->rq_xmit, &pos->rq_xmit);
-				INIT_LIST_HEAD(&req->rq_xmit2);
-				goto out;
-			}
 		} else if (!req->rq_seqno) {
 			list_for_each_entry(pos, &xprt->xmit_queue, rq_xmit) {
 				if (pos->rq_task->tk_owner != task->tk_owner)




^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 19/23] NFS: discard NFS_RPC_SWAPFLAGS and RPC_TASK_ROOTCREDS
  2022-01-24  3:48 [PATCH 00/23 V3] Repair SWAP-over_NFS NeilBrown
                   ` (15 preceding siblings ...)
  2022-01-24  3:48 ` [PATCH 12/23] NFS: remove IS_SWAPFILE hack NeilBrown
@ 2022-01-24  3:48 ` NeilBrown
  2022-01-24  3:48 ` [PATCH 17/23] SUNRPC/xprt: async tasks mustn't block waiting for memory NeilBrown
                   ` (6 subsequent siblings)
  23 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2022-01-24  3:48 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells
  Cc: linux-nfs, linux-mm, linux-kernel

NFS_RPC_SWAPFLAGS is only used for READ requests.
It sets RPC_TASK_SWAPPER which gives some memory-allocation priority to
requests.  This is not needed for swap READ - though it is for writes
where it is set via a different mechanism.

RPC_TASK_ROOTCREDS causes the 'machine' credential to be used.
This is not needed as the root credential is saved when the swap file is
opened, and this is used for all IO.

So NFS_RPC_SWAPFLAGS isn't needed, and as it is the only user of
RPC_TASK_ROOTCREDS, that isn't needed either.

Remove both.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/nfs/read.c                 |    4 ----
 include/linux/nfs_fs.h        |    5 -----
 include/linux/sunrpc/sched.h  |    1 -
 include/trace/events/sunrpc.h |    1 -
 net/sunrpc/auth.c             |    2 +-
 5 files changed, 1 insertion(+), 12 deletions(-)

diff --git a/fs/nfs/read.c b/fs/nfs/read.c
index eb00229c1a50..cd797ce3a67c 100644
--- a/fs/nfs/read.c
+++ b/fs/nfs/read.c
@@ -194,10 +194,6 @@ static void nfs_initiate_read(struct nfs_pgio_header *hdr,
 			      const struct nfs_rpc_ops *rpc_ops,
 			      struct rpc_task_setup *task_setup_data, int how)
 {
-	struct inode *inode = hdr->inode;
-	int swap_flags = IS_SWAPFILE(inode) ? NFS_RPC_SWAPFLAGS : 0;
-
-	task_setup_data->flags |= swap_flags;
 	rpc_ops->read_setup(hdr, msg);
 	trace_nfs_initiate_read(hdr);
 }
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index aba38dc4fd29..9e87752bdd00 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -45,11 +45,6 @@
  */
 #define NFS_MAX_TRANSPORTS 16
 
-/*
- * These are the default flags for swap requests
- */
-#define NFS_RPC_SWAPFLAGS		(RPC_TASK_SWAPPER|RPC_TASK_ROOTCREDS)
-
 /*
  * Size of the NFS directory verifier
  */
diff --git a/include/linux/sunrpc/sched.h b/include/linux/sunrpc/sched.h
index db964bb63912..56710f8056d3 100644
--- a/include/linux/sunrpc/sched.h
+++ b/include/linux/sunrpc/sched.h
@@ -124,7 +124,6 @@ struct rpc_task_setup {
 #define RPC_TASK_MOVEABLE	0x0004		/* nfs4.1+ rpc tasks */
 #define RPC_TASK_NULLCREDS	0x0010		/* Use AUTH_NULL credential */
 #define RPC_CALL_MAJORSEEN	0x0020		/* major timeout seen */
-#define RPC_TASK_ROOTCREDS	0x0040		/* force root creds */
 #define RPC_TASK_DYNAMIC	0x0080		/* task was kmalloc'ed */
 #define	RPC_TASK_NO_ROUND_ROBIN	0x0100		/* send requests on "main" xprt */
 #define RPC_TASK_SOFT		0x0200		/* Use soft timeouts */
diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
index 1e566ac4b812..ef9e9351cb2f 100644
--- a/include/trace/events/sunrpc.h
+++ b/include/trace/events/sunrpc.h
@@ -311,7 +311,6 @@ TRACE_EVENT(rpc_request,
 		{ RPC_TASK_MOVEABLE, "MOVEABLE" },			\
 		{ RPC_TASK_NULLCREDS, "NULLCREDS" },			\
 		{ RPC_CALL_MAJORSEEN, "MAJORSEEN" },			\
-		{ RPC_TASK_ROOTCREDS, "ROOTCREDS" },			\
 		{ RPC_TASK_DYNAMIC, "DYNAMIC" },			\
 		{ RPC_TASK_NO_ROUND_ROBIN, "NO_ROUND_ROBIN" },		\
 		{ RPC_TASK_SOFT, "SOFT" },				\
diff --git a/net/sunrpc/auth.c b/net/sunrpc/auth.c
index 6bfa19f9fa6a..682fcd24bf43 100644
--- a/net/sunrpc/auth.c
+++ b/net/sunrpc/auth.c
@@ -670,7 +670,7 @@ rpcauth_bindcred(struct rpc_task *task, const struct cred *cred, int flags)
 	/* If machine cred couldn't be bound, try a root cred */
 	if (new)
 		;
-	else if (cred == &machine_cred || (flags & RPC_TASK_ROOTCREDS))
+	else if (cred == &machine_cred)
 		new = rpcauth_bind_root_cred(task, lookupflags);
 	else if (flags & RPC_TASK_NULLCREDS)
 		new = authnull_ops.lookup_cred(NULL, NULL, 0);




^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 20/23] SUNRPC: improve 'swap' handling: scheduling and PF_MEMALLOC
  2022-01-24  3:48 [PATCH 00/23 V3] Repair SWAP-over_NFS NeilBrown
                   ` (11 preceding siblings ...)
  2022-01-24  3:48 ` [PATCH 15/23] SUNRPC/call_alloc: async tasks mustn't block waiting for memory NeilBrown
@ 2022-01-24  3:48 ` NeilBrown
  2022-01-24  3:48 ` [PATCH 01/23] MM: create new mm/swap.h header file NeilBrown
                   ` (10 subsequent siblings)
  23 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2022-01-24  3:48 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells
  Cc: linux-nfs, linux-mm, linux-kernel

rpc tasks can be marked as RPC_TASK_SWAPPER.  This causes GFP_MEMALLOC
to be used for some allocations.  This is needed in some cases, but not
in all where it is currently provided, and in some where it isn't
provided.

Currently *all* tasks associated with a rpc_client on which swap is
enabled get the flag and hence some GFP_MEMALLOC support.

GFP_MEMALLOC is provided for ->buf_alloc() but only swap-writes need it.
However xdr_alloc_bvec does not get GFP_MEMALLOC - though it often does
need it.

xdr_alloc_bvec is called while the XPRT_LOCK is held.  If this blocks,
then it blocks all other queued tasks.  So this allocation needs
GFP_MEMALLOC for *all* requests, not just writes, when the xprt is used
for any swap writes.

Similarly, if the transport is not connected, that will block all
requests including swap writes, so memory allocations should get
GFP_MEMALLOC if swap writes are possible.

So with this patch:
 1/ we ONLY set RPC_TASK_SWAPPER for swap writes.
 2/ __rpc_execute() sets PF_MEMALLOC while handling any task
    with RPC_TASK_SWAPPER set, or when handling any task that
    holds the XPRT_LOCKED lock on an xprt used for swap.
    This removes the need for the RPC_IS_SWAPPER() test
    in ->buf_alloc handlers.
 3/ xprt_prepare_transmit() sets PF_MEMALLOC after locking
    any task to a swapper xprt.  __rpc_execute() will clear it.
 3/ PF_MEMALLOC is set for all the connect workers.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/nfs/write.c                  |    2 ++
 net/sunrpc/clnt.c               |    2 --
 net/sunrpc/sched.c              |   20 +++++++++++++++++---
 net/sunrpc/xprt.c               |    3 +++
 net/sunrpc/xprtrdma/transport.c |    6 ++++--
 net/sunrpc/xprtsock.c           |    8 ++++++++
 6 files changed, 34 insertions(+), 7 deletions(-)

diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 987a187bd39a..9f7176745fef 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -1409,6 +1409,8 @@ static void nfs_initiate_write(struct nfs_pgio_header *hdr,
 {
 	int priority = flush_task_priority(how);
 
+	if (IS_SWAPFILE(hdr->inode))
+		task_setup_data->flags |= RPC_TASK_SWAPPER;
 	task_setup_data->priority = priority;
 	rpc_ops->write_setup(hdr, msg, &task_setup_data->rpc_client);
 	trace_nfs_initiate_write(hdr);
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index 238b2ef5491f..cb76fbea3ed5 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -1085,8 +1085,6 @@ void rpc_task_set_client(struct rpc_task *task, struct rpc_clnt *clnt)
 		task->tk_flags |= RPC_TASK_TIMEOUT;
 	if (clnt->cl_noretranstimeo)
 		task->tk_flags |= RPC_TASK_NO_RETRANS_TIMEOUT;
-	if (atomic_read(&clnt->cl_swapper))
-		task->tk_flags |= RPC_TASK_SWAPPER;
 	/* Add to the client's list of all tasks */
 	spin_lock(&clnt->cl_lock);
 	list_add_tail(&task->tk_task, &clnt->cl_tasks);
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index 256302bf6557..9020cedb7c95 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -869,6 +869,15 @@ void rpc_release_calldata(const struct rpc_call_ops *ops, void *calldata)
 		ops->rpc_release(calldata);
 }
 
+static bool xprt_needs_memalloc(struct rpc_xprt *xprt, struct rpc_task *tk)
+{
+	if (!xprt)
+		return false;
+	if (!atomic_read(&xprt->swapper))
+		return false;
+	return test_bit(XPRT_LOCKED, &xprt->state) && xprt->snd_task == tk;
+}
+
 /*
  * This is the RPC `scheduler' (or rather, the finite state machine).
  */
@@ -877,6 +886,7 @@ static void __rpc_execute(struct rpc_task *task)
 	struct rpc_wait_queue *queue;
 	int task_is_async = RPC_IS_ASYNC(task);
 	int status = 0;
+	unsigned long pflags = current->flags;
 
 	WARN_ON_ONCE(RPC_IS_QUEUED(task));
 	if (RPC_IS_QUEUED(task))
@@ -899,6 +909,10 @@ static void __rpc_execute(struct rpc_task *task)
 		}
 		if (!do_action)
 			break;
+		if (RPC_IS_SWAPPER(task) ||
+		    xprt_needs_memalloc(task->tk_xprt, task))
+			current->flags |= PF_MEMALLOC;
+
 		trace_rpc_task_run_action(task, do_action);
 		do_action(task);
 
@@ -936,7 +950,7 @@ static void __rpc_execute(struct rpc_task *task)
 		rpc_clear_running(task);
 		spin_unlock(&queue->lock);
 		if (task_is_async)
-			return;
+			goto out;
 
 		/* sync task: sleep here */
 		trace_rpc_task_sync_sleep(task, task->tk_action);
@@ -960,6 +974,8 @@ static void __rpc_execute(struct rpc_task *task)
 
 	/* Release all resources associated with the task */
 	rpc_release_task(task);
+out:
+	current_restore_flags(pflags, PF_MEMALLOC);
 }
 
 /*
@@ -1018,8 +1034,6 @@ int rpc_malloc(struct rpc_task *task)
 
 	if (RPC_IS_ASYNC(task))
 		gfp = GFP_NOWAIT | __GFP_NOWARN;
-	if (RPC_IS_SWAPPER(task))
-		gfp |= __GFP_MEMALLOC;
 
 	size += sizeof(struct rpc_buffer);
 	if (size <= RPC_BUFFER_MAXSIZE)
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index a0a2583fe941..0614e7463d4b 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -1492,6 +1492,9 @@ bool xprt_prepare_transmit(struct rpc_task *task)
 		return false;
 
 	}
+	if (atomic_read(&xprt->swapper))
+		/* This will be clear in __rpc_execute */
+		current->flags |= PF_MEMALLOC;
 	return true;
 }
 
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 32df23796747..256b06a92391 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -239,8 +239,11 @@ xprt_rdma_connect_worker(struct work_struct *work)
 	struct rpcrdma_xprt *r_xprt = container_of(work, struct rpcrdma_xprt,
 						   rx_connect_worker.work);
 	struct rpc_xprt *xprt = &r_xprt->rx_xprt;
+	unsigned int pflags = current->flags;
 	int rc;
 
+	if (atomic_read(&xprt->swapper))
+		current->flags |= PF_MEMALLOC;
 	rc = rpcrdma_xprt_connect(r_xprt);
 	xprt_clear_connecting(xprt);
 	if (!rc) {
@@ -254,6 +257,7 @@ xprt_rdma_connect_worker(struct work_struct *work)
 		rpcrdma_xprt_disconnect(r_xprt);
 	xprt_unlock_connect(xprt, r_xprt);
 	xprt_wake_pending_tasks(xprt, rc);
+	current_restore_flags(pflags, PF_MEMALLOC);
 }
 
 /**
@@ -576,8 +580,6 @@ xprt_rdma_allocate(struct rpc_task *task)
 	flags = RPCRDMA_DEF_GFP;
 	if (RPC_IS_ASYNC(task))
 		flags = GFP_NOWAIT | __GFP_NOWARN;
-	if (RPC_IS_SWAPPER(task))
-		flags |= __GFP_MEMALLOC;
 
 	if (!rpcrdma_check_regbuf(r_xprt, req->rl_sendbuf, rqst->rq_callsize,
 				  flags))
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index d8ee06a9650a..9d34c71004fa 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -2047,7 +2047,10 @@ static void xs_udp_setup_socket(struct work_struct *work)
 	struct rpc_xprt *xprt = &transport->xprt;
 	struct socket *sock;
 	int status = -EIO;
+	unsigned int pflags = current->flags;
 
+	if (atomic_read(&xprt->swapper))
+		current->flags |= PF_MEMALLOC;
 	sock = xs_create_sock(xprt, transport,
 			xs_addr(xprt)->sa_family, SOCK_DGRAM,
 			IPPROTO_UDP, false);
@@ -2067,6 +2070,7 @@ static void xs_udp_setup_socket(struct work_struct *work)
 	xprt_clear_connecting(xprt);
 	xprt_unlock_connect(xprt, transport);
 	xprt_wake_pending_tasks(xprt, status);
+	current_restore_flags(pflags, PF_MEMALLOC);
 }
 
 /**
@@ -2226,7 +2230,10 @@ static void xs_tcp_setup_socket(struct work_struct *work)
 	struct socket *sock = transport->sock;
 	struct rpc_xprt *xprt = &transport->xprt;
 	int status;
+	unsigned int pflags = current->flags;
 
+	if (atomic_read(&xprt->swapper))
+		current->flags |= PF_MEMALLOC;
 	if (!sock) {
 		sock = xs_create_sock(xprt, transport,
 				xs_addr(xprt)->sa_family, SOCK_STREAM,
@@ -2291,6 +2298,7 @@ static void xs_tcp_setup_socket(struct work_struct *work)
 	xprt_clear_connecting(xprt);
 out_unlock:
 	xprt_unlock_connect(xprt, transport);
+	current_restore_flags(pflags, PF_MEMALLOC);
 }
 
 /**




^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 21/23] NFSv4: keep state manager thread active if swap is enabled
  2022-01-24  3:48 [PATCH 00/23 V3] Repair SWAP-over_NFS NeilBrown
                   ` (18 preceding siblings ...)
  2022-01-24  3:48 ` [PATCH 18/23] SUNRPC: remove scheduling boost for "SWAPPER" tasks NeilBrown
@ 2022-01-24  3:48 ` NeilBrown
  2022-01-24  3:48 ` [PATCH 11/23] VFS: Add FMODE_CAN_ODIRECT file flag NeilBrown
                   ` (3 subsequent siblings)
  23 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2022-01-24  3:48 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells
  Cc: linux-nfs, linux-mm, linux-kernel

If we are swapping over NFSv4, we may not be able to allocate memory to
start the state-manager thread at the time when we need it.
So keep it always running when swap is enabled, and just signal it to
start.

This requires updating and testing the cl_swapper count on the root
rpc_clnt after following all ->cl_parent links.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/nfs/file.c           |   15 ++++++++++++---
 fs/nfs/nfs4_fs.h        |    1 +
 fs/nfs/nfs4proc.c       |   20 ++++++++++++++++++++
 fs/nfs/nfs4state.c      |   39 +++++++++++++++++++++++++++++++++------
 include/linux/nfs_xdr.h |    2 ++
 net/sunrpc/clnt.c       |    2 ++
 6 files changed, 70 insertions(+), 9 deletions(-)

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 04ba56f223d3..ceacae8e7a38 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -486,8 +486,9 @@ static int nfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 	unsigned long blocks;
 	long long isize;
 	int ret;
-	struct rpc_clnt *clnt = NFS_CLIENT(file->f_mapping->host);
-	struct inode *inode = file->f_mapping->host;
+	struct inode *inode = file_inode(file);
+	struct rpc_clnt *clnt = NFS_CLIENT(inode);
+	struct nfs_client *cl = NFS_SERVER(inode)->nfs_client;
 
 	spin_lock(&inode->i_lock);
 	blocks = inode->i_blocks;
@@ -508,14 +509,22 @@ static int nfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 	}
 	*span = sis->pages;
 	sis->flags |= SWP_FS_OPS;
+
+	if (cl->rpc_ops->enable_swap)
+		cl->rpc_ops->enable_swap(inode);
+
 	return ret;
 }
 
 static void nfs_swap_deactivate(struct file *file)
 {
-	struct rpc_clnt *clnt = NFS_CLIENT(file->f_mapping->host);
+	struct inode *inode = file_inode(file);
+	struct rpc_clnt *clnt = NFS_CLIENT(inode);
+	struct nfs_client *cl = NFS_SERVER(inode)->nfs_client;
 
 	rpc_clnt_swap_deactivate(clnt);
+	if (cl->rpc_ops->disable_swap)
+		cl->rpc_ops->disable_swap(file_inode(file));
 }
 
 const struct address_space_operations nfs_file_aops = {
diff --git a/fs/nfs/nfs4_fs.h b/fs/nfs/nfs4_fs.h
index ed5eaca6801e..8a9ce0f42efd 100644
--- a/fs/nfs/nfs4_fs.h
+++ b/fs/nfs/nfs4_fs.h
@@ -42,6 +42,7 @@ enum nfs4_client_state {
 	NFS4CLNT_LEASE_MOVED,
 	NFS4CLNT_DELEGATION_EXPIRED,
 	NFS4CLNT_RUN_MANAGER,
+	NFS4CLNT_MANAGER_AVAILABLE,
 	NFS4CLNT_RECALL_RUNNING,
 	NFS4CLNT_RECALL_ANY_LAYOUT_READ,
 	NFS4CLNT_RECALL_ANY_LAYOUT_RW,
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index ee3bc79f6ca3..ab6382f9cbf0 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -10347,6 +10347,24 @@ static ssize_t nfs4_listxattr(struct dentry *dentry, char *list, size_t size)
 	return error + error2 + error3;
 }
 
+static void nfs4_enable_swap(struct inode *inode)
+{
+	/* The state manager thread must always be running.
+	 * It will notice the client is a swapper, and stay put.
+	 */
+	struct nfs_client *clp = NFS_SERVER(inode)->nfs_client;
+
+	nfs4_schedule_state_manager(clp);
+}
+
+static void nfs4_disable_swap(struct inode *inode)
+{
+	/* The state manager thread will now exit once it is
+	 * woken.
+	 */
+	wake_up_var(&NFS_SERVER(inode)->nfs_client->cl_state);
+}
+
 static const struct inode_operations nfs4_dir_inode_operations = {
 	.create		= nfs_create,
 	.lookup		= nfs_lookup,
@@ -10423,6 +10441,8 @@ const struct nfs_rpc_ops nfs_v4_clientops = {
 	.free_client	= nfs4_free_client,
 	.create_server	= nfs4_create_server,
 	.clone_server	= nfs_clone_server,
+	.enable_swap	= nfs4_enable_swap,
+	.disable_swap	= nfs4_disable_swap,
 };
 
 static const struct xattr_handler nfs4_xattr_nfs4_acl_handler = {
diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
index d88b779f9dd0..05d53c8a2ddb 100644
--- a/fs/nfs/nfs4state.c
+++ b/fs/nfs/nfs4state.c
@@ -1205,10 +1205,17 @@ void nfs4_schedule_state_manager(struct nfs_client *clp)
 {
 	struct task_struct *task;
 	char buf[INET6_ADDRSTRLEN + sizeof("-manager") + 1];
+	struct rpc_clnt *cl = clp->cl_rpcclient;
+
+	while (cl != cl->cl_parent)
+		cl = cl->cl_parent;
 
 	set_bit(NFS4CLNT_RUN_MANAGER, &clp->cl_state);
-	if (test_and_set_bit(NFS4CLNT_MANAGER_RUNNING, &clp->cl_state) != 0)
+	if (test_and_set_bit(NFS4CLNT_MANAGER_AVAILABLE, &clp->cl_state) != 0) {
+		wake_up_var(&clp->cl_state);
 		return;
+	}
+	set_bit(NFS4CLNT_MANAGER_RUNNING, &clp->cl_state);
 	__module_get(THIS_MODULE);
 	refcount_inc(&clp->cl_count);
 
@@ -1224,6 +1231,7 @@ void nfs4_schedule_state_manager(struct nfs_client *clp)
 		printk(KERN_ERR "%s: kthread_run: %ld\n",
 			__func__, PTR_ERR(task));
 		nfs4_clear_state_manager_bit(clp);
+		clear_bit(NFS4CLNT_MANAGER_AVAILABLE, &clp->cl_state);
 		nfs_put_client(clp);
 		module_put(THIS_MODULE);
 	}
@@ -2665,11 +2673,8 @@ static void nfs4_state_manager(struct nfs_client *clp)
 			clear_bit(NFS4CLNT_RECALL_RUNNING, &clp->cl_state);
 		}
 
-		/* Did we race with an attempt to give us more work? */
-		if (!test_bit(NFS4CLNT_RUN_MANAGER, &clp->cl_state))
-			return;
-		if (test_and_set_bit(NFS4CLNT_MANAGER_RUNNING, &clp->cl_state) != 0)
-			return;
+		return;
+
 	} while (refcount_read(&clp->cl_count) > 1 && !signalled());
 	goto out_drain;
 
@@ -2689,9 +2694,31 @@ static void nfs4_state_manager(struct nfs_client *clp)
 static int nfs4_run_state_manager(void *ptr)
 {
 	struct nfs_client *clp = ptr;
+	struct rpc_clnt *cl = clp->cl_rpcclient;
+
+	while (cl != cl->cl_parent)
+		cl = cl->cl_parent;
 
 	allow_signal(SIGKILL);
+again:
+	set_bit(NFS4CLNT_MANAGER_RUNNING, &clp->cl_state);
 	nfs4_state_manager(clp);
+	if (atomic_read(&cl->cl_swapper)) {
+		wait_var_event_interruptible(&clp->cl_state,
+					     test_bit(NFS4CLNT_RUN_MANAGER,
+						      &clp->cl_state));
+		if (atomic_read(&cl->cl_swapper) &&
+		    test_bit(NFS4CLNT_RUN_MANAGER, &clp->cl_state))
+			goto again;
+		/* Either no longer a swapper, or were signalled */
+	}
+	clear_bit(NFS4CLNT_MANAGER_AVAILABLE, &clp->cl_state);
+
+	if (refcount_read(&clp->cl_count) > 1 && !signalled() &&
+	    test_bit(NFS4CLNT_RUN_MANAGER, &clp->cl_state) &&
+	    !test_and_set_bit(NFS4CLNT_MANAGER_AVAILABLE, &clp->cl_state))
+		goto again;
+
 	nfs_put_client(clp);
 	module_put_and_kthread_exit(0);
 	return 0;
diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
index 967a0098f0a9..04cf3a8fb949 100644
--- a/include/linux/nfs_xdr.h
+++ b/include/linux/nfs_xdr.h
@@ -1795,6 +1795,8 @@ struct nfs_rpc_ops {
 	struct nfs_server *(*create_server)(struct fs_context *);
 	struct nfs_server *(*clone_server)(struct nfs_server *, struct nfs_fh *,
 					   struct nfs_fattr *, rpc_authflavor_t);
+	void	(*enable_swap)(struct inode *inode);
+	void	(*disable_swap)(struct inode *inode);
 };
 
 /*
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index cb76fbea3ed5..4cb403a0f334 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -3066,6 +3066,8 @@ rpc_clnt_swap_activate_callback(struct rpc_clnt *clnt,
 int
 rpc_clnt_swap_activate(struct rpc_clnt *clnt)
 {
+	while (clnt != clnt->cl_parent)
+		clnt = clnt->cl_parent;
 	if (atomic_inc_return(&clnt->cl_swapper) == 1)
 		return rpc_clnt_iterate_for_each_xprt(clnt,
 				rpc_clnt_swap_activate_callback, NULL);




^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 22/23] NFS: swap-out must always use STABLE writes.
  2022-01-24  3:48 [PATCH 00/23 V3] Repair SWAP-over_NFS NeilBrown
                   ` (2 preceding siblings ...)
  2022-01-24  3:48 ` [PATCH 14/23] NFS: swap IO handling is slightly different for O_DIRECT IO NeilBrown
@ 2022-01-24  3:48 ` NeilBrown
  2022-01-26  3:45   ` Trond Myklebust
  2022-01-24  3:48 ` [PATCH 23/23] SUNRPC: lock against ->sock changing during sysfs read NeilBrown
                   ` (19 subsequent siblings)
  23 siblings, 1 reply; 53+ messages in thread
From: NeilBrown @ 2022-01-24  3:48 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells
  Cc: linux-nfs, linux-mm, linux-kernel

The commit handling code is not safe against memory-pressure deadlocks
when writing to swap.  In particular, nfs_commitdata_alloc() blocks
indefinitely waiting for memory, and this can consume all available
workqueue threads.

swap-out most likely uses STABLE writes anyway as COND_STABLE indicates
that a stable write should be used if the write fits in a single
request, and it normally does.  However if we ever swap with a small
wsize, or gather unusually large numbers of pages for a single write,
this might change.

For safety, make it explicit in the code that direct writes used for swap
must always use FLUSH_COND_STABLE.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/nfs/direct.c |    7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 43a956d7fd62..29c007b2a17a 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -791,7 +791,7 @@ static const struct nfs_pgio_completion_ops nfs_direct_write_completion_ops = {
  */
 static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
 					       struct iov_iter *iter,
-					       loff_t pos)
+					       loff_t pos, int ioflags)
 {
 	struct nfs_pageio_descriptor desc;
 	struct inode *inode = dreq->inode;
@@ -799,7 +799,7 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
 	size_t requested_bytes = 0;
 	size_t wsize = max_t(size_t, NFS_SERVER(inode)->wsize, PAGE_SIZE);
 
-	nfs_pageio_init_write(&desc, inode, FLUSH_COND_STABLE, false,
+	nfs_pageio_init_write(&desc, inode, ioflags, false,
 			      &nfs_direct_write_completion_ops);
 	desc.pg_dreq = dreq;
 	get_dreq(dreq);
@@ -905,6 +905,7 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter,
 	struct nfs_direct_req *dreq;
 	struct nfs_lock_context *l_ctx;
 	loff_t pos, end;
+	int ioflags = swap ? FLUSH_COND_STABLE : FLUSH_STABLE;
 
 	dfprintk(FILE, "NFS: direct write(%pD2, %zd@%Ld)\n",
 		file, iov_iter_count(iter), (long long) iocb->ki_pos);
@@ -947,7 +948,7 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter,
 	if (!swap)
 		nfs_start_io_direct(inode);
 
-	requested = nfs_direct_write_schedule_iovec(dreq, iter, pos);
+	requested = nfs_direct_write_schedule_iovec(dreq, iter, pos, ioflags);
 
 	if (mapping->nrpages) {
 		invalidate_inode_pages2_range(mapping,




^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 23/23] SUNRPC: lock against ->sock changing during sysfs read
  2022-01-24  3:48 [PATCH 00/23 V3] Repair SWAP-over_NFS NeilBrown
                   ` (3 preceding siblings ...)
  2022-01-24  3:48 ` [PATCH 22/23] NFS: swap-out must always use STABLE writes NeilBrown
@ 2022-01-24  3:48 ` NeilBrown
  2022-01-24  3:48 ` [PATCH 08/23] DOC: update documentation for swap_activate and swap_rw NeilBrown
                   ` (18 subsequent siblings)
  23 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2022-01-24  3:48 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells
  Cc: linux-nfs, linux-mm, linux-kernel

->sock can be set to NULL asynchronously unless ->recv_mutex is held.
So it is important to hold that mutex.  Otherwise a sysfs read can
trigger an oops.
Commit 17f09d3f619a ("SUNRPC: Check if the xprt is connected before
handling sysfs reads") appears to attempt to fix this problem, but it
only narrows the race window.

Fixes: 17f09d3f619a ("SUNRPC: Check if the xprt is connected before handling sysfs reads")
Fixes: a8482488a7d6 ("SUNRPC query transport's source port")
Signed-off-by: NeilBrown <neilb@suse.de>
---
 net/sunrpc/sysfs.c    |    5 ++++-
 net/sunrpc/xprtsock.c |    7 ++++++-
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/sysfs.c b/net/sunrpc/sysfs.c
index 2766dd21935b..baaf65ea9e38 100644
--- a/net/sunrpc/sysfs.c
+++ b/net/sunrpc/sysfs.c
@@ -115,11 +115,14 @@ static ssize_t rpc_sysfs_xprt_srcaddr_show(struct kobject *kobj,
 	}
 
 	sock = container_of(xprt, struct sock_xprt, xprt);
-	if (kernel_getsockname(sock->sock, (struct sockaddr *)&saddr) < 0)
+	mutex_lock(&sock->recv_mutex);
+	if (sock->sock == NULL ||
+	    kernel_getsockname(sock->sock, (struct sockaddr *)&saddr) < 0)
 		goto out;
 
 	ret = sprintf(buf, "%pISc\n", &saddr);
 out:
+	mutex_unlock(&sock->recv_mutex);
 	xprt_put(xprt);
 	return ret + 1;
 }
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 9d34c71004fa..3f2b766e9f82 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -1641,7 +1641,12 @@ static int xs_get_srcport(struct sock_xprt *transport)
 unsigned short get_srcport(struct rpc_xprt *xprt)
 {
 	struct sock_xprt *sock = container_of(xprt, struct sock_xprt, xprt);
-	return xs_sock_getport(sock->sock);
+	unsigned short ret = 0;
+	mutex_lock(&sock->recv_mutex);
+	if (sock->sock)
+		ret = xs_sock_getport(sock->sock);
+	mutex_unlock(&sock->recv_mutex);
+	return ret;
 }
 EXPORT_SYMBOL(get_srcport);
 




^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH 02/23] MM: extend block-plugging to cover all swap reads with read-ahead
  2022-01-24  3:48 ` [PATCH 02/23] MM: extend block-plugging to cover all swap reads with read-ahead NeilBrown
@ 2022-01-24  7:27   ` Christoph Hellwig
  2022-01-26 21:47     ` NeilBrown
  0 siblings, 1 reply; 53+ messages in thread
From: Christoph Hellwig @ 2022-01-24  7:27 UTC (permalink / raw)
  To: NeilBrown
  Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells, linux-nfs,
	linux-mm, linux-kernel

On Mon, Jan 24, 2022 at 02:48:32PM +1100, NeilBrown wrote:
> Code that does swap read-ahead uses blk_start_plug() and
> blk_finish_plug() to allow lower levels to combine multiple read-ahead
> pages into a single request, but calls blk_finish_plug() *before*
> submitting the original (non-ahead) read request.
> This missed an opportunity to combine read requests.
> 
> This patch moves the blk_finish_plug to *after* all the reads.
> This will likely combine the primary read with some of the "ahead"
> reads, and that may slightly increase the latency of that read, but it
> should more than make up for this by making more efficient use of the
> storage path.
> 
> The patch mostly makes the code look more consistent.  Performance
> change is unlikely to be noticeable.

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

> Fixes-no-auto-backport: 3fb5c298b04e ("swap: allow swap readahead to be merged")

Is this really a thing?


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 03/23] MM: drop swap_set_page_dirty
  2022-01-24  3:48 ` [PATCH 03/23] MM: drop swap_set_page_dirty NeilBrown
@ 2022-01-24  7:28   ` Christoph Hellwig
  0 siblings, 0 replies; 53+ messages in thread
From: Christoph Hellwig @ 2022-01-24  7:28 UTC (permalink / raw)
  To: NeilBrown
  Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells, linux-nfs,
	linux-mm, linux-kernel, Matthew Wilcox

On Mon, Jan 24, 2022 at 02:48:32PM +1100, NeilBrown wrote:
> Pages that are written to swap are owned by the MM subsystem - not any
> filesystem.
> 
> When such a page is passed to a filesystem to be written out to a
> swap-file, the filesystem handles the data, but the page itself does not
> belong to the filesystem.  So calling the filesystem's set_page_dirty
> address_space operation makes no sense.  This is for pages in the given
> address space, and a page to be written to swap does not exist in the
> given address space.
> 
> So drop swap_set_page_dirty() which calls the address-space's
> set_page_dirty, and alway use __set_page_dirty_no_writeback, which is
> appropriate for pages being swapped out.

Yes, this looks sane to me:

Reviewed-by: Christoph Hellwig <hch@lst.de>

> 
> Fixes-no-auto-backport: 62c230bc1790 ("mm: add support for a filesystem to activate swap files and use direct_IO for writing swap pages")
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  include/linux/swap.h |    1 -
>  mm/page_io.c         |   14 --------------
>  mm/swap_state.c      |    2 +-
>  3 files changed, 1 insertion(+), 16 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 3f54a8941c9d..a43929f7033e 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -419,7 +419,6 @@ extern void kswapd_stop(int nid);
>  
>  #ifdef CONFIG_SWAP
>  
> -extern int swap_set_page_dirty(struct page *page);
>  int add_swap_extent(struct swap_info_struct *sis, unsigned long start_page,
>  		unsigned long nr_pages, sector_t start_block);
>  int generic_swapfile_activate(struct swap_info_struct *, struct file *,
> diff --git a/mm/page_io.c b/mm/page_io.c
> index f8c26092e869..34b12d6f94d7 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -438,17 +438,3 @@ int swap_readpage(struct page *page, bool synchronous)
>  	delayacct_swapin_end();
>  	return ret;
>  }
> -
> -int swap_set_page_dirty(struct page *page)
> -{
> -	struct swap_info_struct *sis = page_swap_info(page);
> -
> -	if (data_race(sis->flags & SWP_FS_OPS)) {
> -		struct address_space *mapping = sis->swap_file->f_mapping;
> -
> -		VM_BUG_ON_PAGE(!PageSwapCache(page), page);
> -		return mapping->a_ops->set_page_dirty(page);
> -	} else {
> -		return __set_page_dirty_no_writeback(page);
> -	}
> -}
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 093ecf864200..d541594be1c3 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -31,7 +31,7 @@
>   */
>  static const struct address_space_operations swap_aops = {
>  	.writepage	= swap_writepage,
> -	.set_page_dirty	= swap_set_page_dirty,
> +	.set_page_dirty	= __set_page_dirty_no_writeback,
>  #ifdef CONFIG_MIGRATION
>  	.migratepage	= migrate_page,
>  #endif
> 
> 
---end quoted text---


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 04/23] MM: move responsibility for setting SWP_FS_OPS to ->swap_activate
  2022-01-24  3:48 ` [PATCH 04/23] MM: move responsibility for setting SWP_FS_OPS to ->swap_activate NeilBrown
@ 2022-01-24  7:30   ` Christoph Hellwig
  0 siblings, 0 replies; 53+ messages in thread
From: Christoph Hellwig @ 2022-01-24  7:30 UTC (permalink / raw)
  To: NeilBrown
  Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells, linux-nfs,
	linux-mm, linux-kernel

On Mon, Jan 24, 2022 at 02:48:32PM +1100, NeilBrown wrote:
> If a filesystem wishes to handle all swap IO itself (via ->direct_IO),

Right now it's half ->direct_IO and half ->readpage and about to change
with your series, isn't it?

> rather than just providing devices addresses for submit_bio(),
> SWP_FS_OPS must be set.
> Currently the protocol for setting this it to have ->swap_activate
> return zero.  In that case SWP_FS_OPS is set, and add_swap_extent()
> is called for the entire file.
> 
> This is a little clumsy as different return values for ->swap_activate
> have quite different meanings, and it makes it hard to search for which
> filesystems require SWP_FS_OPS to be set.
> 
> So remove the special meaning of a zero return, and require the
> filesystem to set SWP_FS_OPS if it so desires, and to always call
> add_swap_extent() as required.
> 
> Currently only NFS and CIFS return zero for add_swap_extent().
> 
> Signed-off-by: NeilBrown <neilb@suse.de>

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 05/23] MM: reclaim mustn't enter FS for SWP_FS_OPS swap-space
  2022-01-24  3:48 ` [PATCH 05/23] MM: reclaim mustn't enter FS for SWP_FS_OPS swap-space NeilBrown
@ 2022-01-24  7:31   ` Christoph Hellwig
  0 siblings, 0 replies; 53+ messages in thread
From: Christoph Hellwig @ 2022-01-24  7:31 UTC (permalink / raw)
  To: NeilBrown
  Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells, linux-nfs,
	linux-mm, linux-kernel

On Mon, Jan 24, 2022 at 02:48:32PM +1100, NeilBrown wrote:
> If swap-out is using filesystem operations (SWP_FS_OPS), then it is not
> safe to enter the FS for reclaim.
> So only down-grade the requirement for swap pages to __GFP_IO after
> checking that SWP_FS_OPS are not being used.
> 
> This makes the calculation of "may_enter_fs" slightly more complex, so
> move it into a separate function.  with that done, there is little value
> in maintaining the bool variable any more.  So replace the
> may_enter_fs variable with a may_enter_fs() function.  This removes any
> risk for the variable becoming out-of-date.

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 09/23] MM: submit multipage reads for SWP_FS_OPS swap-space
  2022-01-24  3:48 ` [PATCH 09/23] MM: submit multipage reads for SWP_FS_OPS swap-space NeilBrown
@ 2022-01-24  8:25   ` kernel test robot
  2022-01-24  8:52   ` Christoph Hellwig
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 53+ messages in thread
From: kernel test robot @ 2022-01-24  8:25 UTC (permalink / raw)
  To: NeilBrown, Trond Myklebust, Anna Schumaker, Chuck Lever,
	Andrew Morton, Mel Gorman, Christoph Hellwig, David Howells
  Cc: kbuild-all, Linux Memory Management List, linux-nfs, linux-kernel

Hi NeilBrown,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v5.17-rc1 next-20220124]
[cannot apply to trondmy-nfs/linux-next cifs/for-next hnaz-mm/master]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/NeilBrown/Repair-SWAP-over_NFS/20220124-115716
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git dd81e1c7d5fb126e5fbc5c9e334d7b3ec29a16a0
config: powerpc-allnoconfig (https://download.01.org/0day-ci/archive/20220124/202201241613.8J5z5arQ-lkp@intel.com/config)
compiler: powerpc-linux-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/63bff668aa0537d7ccef9ed428809fc16c1a6b6c
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review NeilBrown/Repair-SWAP-over_NFS/20220124-115716
        git checkout 63bff668aa0537d7ccef9ed428809fc16c1a6b6c
        # save the config file to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross O=build_dir ARCH=powerpc SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All error/warnings (new ones prefixed by >>):

   In file included from mm/vmscan.c:61:
   mm/swap.h:66:40: warning: 'struct swap_iocb' declared inside parameter list will not be visible outside of this definition or declaration
      66 |                                 struct swap_iocb **plug);
         |                                        ^~~~~~~~~
>> mm/swap.h:67:1: error: expected identifier or '(' before '{' token
      67 | {
         | ^
   mm/swap.h:65:19: warning: 'swap_readpage' declared 'static' but never defined [-Wunused-function]
      65 | static inline int swap_readpage(struct page *page, bool do_poll,
         |                   ^~~~~~~~~~~~~
--
   In file included from mm/memory.c:89:
   mm/swap.h:66:40: warning: 'struct swap_iocb' declared inside parameter list will not be visible outside of this definition or declaration
      66 |                                 struct swap_iocb **plug);
         |                                        ^~~~~~~~~
>> mm/swap.h:67:1: error: expected identifier or '(' before '{' token
      67 | {
         | ^
>> mm/swap.h:65:19: warning: 'swap_readpage' used but never defined
      65 | static inline int swap_readpage(struct page *page, bool do_poll,
         |                   ^~~~~~~~~~~~~
--
   In file included from mm/page_alloc.c:84:
   mm/swap.h:66:40: warning: 'struct swap_iocb' declared inside parameter list will not be visible outside of this definition or declaration
      66 |                                 struct swap_iocb **plug);
         |                                        ^~~~~~~~~
>> mm/swap.h:67:1: error: expected identifier or '(' before '{' token
      67 | {
         | ^
   mm/page_alloc.c:3821:15: warning: no previous prototype for 'should_fail_alloc_page' [-Wmissing-prototypes]
    3821 | noinline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
         |               ^~~~~~~~~~~~~~~~~~~~~~
   In file included from mm/page_alloc.c:84:
   mm/swap.h:65:19: warning: 'swap_readpage' declared 'static' but never defined [-Wunused-function]
      65 | static inline int swap_readpage(struct page *page, bool do_poll,
         |                   ^~~~~~~~~~~~~


vim +67 mm/swap.h

50dceef273a619 NeilBrown 2022-01-24  45  
50dceef273a619 NeilBrown 2022-01-24  46  struct page *read_swap_cache_async(swp_entry_t, gfp_t,
50dceef273a619 NeilBrown 2022-01-24  47  				   struct vm_area_struct *vma,
50dceef273a619 NeilBrown 2022-01-24  48  				   unsigned long addr,
63bff668aa0537 NeilBrown 2022-01-24  49  				   bool do_poll,
63bff668aa0537 NeilBrown 2022-01-24  50  				   struct swap_iocb **plug);
50dceef273a619 NeilBrown 2022-01-24  51  struct page *__read_swap_cache_async(swp_entry_t, gfp_t,
50dceef273a619 NeilBrown 2022-01-24  52  				     struct vm_area_struct *vma,
50dceef273a619 NeilBrown 2022-01-24  53  				     unsigned long addr,
50dceef273a619 NeilBrown 2022-01-24  54  				     bool *new_page_allocated);
50dceef273a619 NeilBrown 2022-01-24  55  struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
50dceef273a619 NeilBrown 2022-01-24  56  				    struct vm_fault *vmf);
50dceef273a619 NeilBrown 2022-01-24  57  struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
50dceef273a619 NeilBrown 2022-01-24  58  			      struct vm_fault *vmf);
50dceef273a619 NeilBrown 2022-01-24  59  
12cf545fe71035 NeilBrown 2022-01-24  60  static inline unsigned int page_swap_flags(struct page *page)
12cf545fe71035 NeilBrown 2022-01-24  61  {
12cf545fe71035 NeilBrown 2022-01-24  62  	return page_swap_info(page)->flags;
12cf545fe71035 NeilBrown 2022-01-24  63  }
50dceef273a619 NeilBrown 2022-01-24  64  #else /* CONFIG_SWAP */
63bff668aa0537 NeilBrown 2022-01-24 @65  static inline int swap_readpage(struct page *page, bool do_poll,
63bff668aa0537 NeilBrown 2022-01-24 @66  				struct swap_iocb **plug);
50dceef273a619 NeilBrown 2022-01-24 @67  {
50dceef273a619 NeilBrown 2022-01-24  68  	return 0;
50dceef273a619 NeilBrown 2022-01-24  69  }
50dceef273a619 NeilBrown 2022-01-24  70  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 06/23] MM: introduce ->swap_rw and use it for reads from SWP_FS_OPS swap-space
  2022-01-24  3:48 ` [PATCH 06/23] MM: introduce ->swap_rw and use it for reads from SWP_FS_OPS swap-space NeilBrown
@ 2022-01-24  8:48   ` Christoph Hellwig
  0 siblings, 0 replies; 53+ messages in thread
From: Christoph Hellwig @ 2022-01-24  8:48 UTC (permalink / raw)
  To: NeilBrown
  Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells, linux-nfs,
	linux-mm, linux-kernel

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 07/23] MM: perform async writes to SWP_FS_OPS swap-space using ->swap_rw
  2022-01-24  3:48 ` [PATCH 07/23] MM: perform async writes to SWP_FS_OPS swap-space using ->swap_rw NeilBrown
@ 2022-01-24  8:49   ` Christoph Hellwig
  0 siblings, 0 replies; 53+ messages in thread
From: Christoph Hellwig @ 2022-01-24  8:49 UTC (permalink / raw)
  To: NeilBrown
  Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells, linux-nfs,
	linux-mm, linux-kernel

On Mon, Jan 24, 2022 at 02:48:32PM +1100, NeilBrown wrote:
>  
> +static void sio_write_complete(struct kiocb *iocb, long ret)
> +{
> +	struct swap_iocb *sio = container_of(iocb, struct swap_iocb, iocb);
> +	struct page *page = sio->bvec.bv_page;
> +
> +	if (ret != 0 && ret != PAGE_SIZE) {

ret == 0 isn't really a success case for ki_complete, it is something
that should never happen at all.

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 08/23] DOC: update documentation for swap_activate and swap_rw
  2022-01-24  3:48 ` [PATCH 08/23] DOC: update documentation for swap_activate and swap_rw NeilBrown
@ 2022-01-24  8:50   ` Christoph Hellwig
  0 siblings, 0 replies; 53+ messages in thread
From: Christoph Hellwig @ 2022-01-24  8:50 UTC (permalink / raw)
  To: NeilBrown
  Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells, linux-nfs,
	linux-mm, linux-kernel

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 09/23] MM: submit multipage reads for SWP_FS_OPS swap-space
  2022-01-24  3:48 ` [PATCH 09/23] MM: submit multipage reads for SWP_FS_OPS swap-space NeilBrown
  2022-01-24  8:25   ` kernel test robot
@ 2022-01-24  8:52   ` Christoph Hellwig
  2022-01-24  9:27   ` kernel test robot
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 53+ messages in thread
From: Christoph Hellwig @ 2022-01-24  8:52 UTC (permalink / raw)
  To: NeilBrown
  Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells, linux-nfs,
	linux-mm, linux-kernel

On Mon, Jan 24, 2022 at 02:48:32PM +1100, NeilBrown wrote:
> swap_readpage() is given one page at a time, but maybe called repeatedly
> in succession.
> For block-device swapspace, the blk_plug functionality allows the
> multiple pages to be combined together at lower layers.
> That cannot be used for SWP_FS_OPS as blk_plug may not exist - it is
> only active when CONFIG_BLOCK=y.  Consequently all swap reads over NFS
> are single page reads.
> 
> With this patch we pass in a pointer-to-pointer when swap_readpage can
> store state between calls - much like the effect of blk_plug.  After
> calling swap_readpage() some number of times, the state will be passed
> to swap_read_unplug() which can submit the combined request.
> 
> Some caller currently call blk_finish_plug() *before* the final call to
> swap_readpage(), so the last page cannot be included.  This patch moves
> blk_finish_plug() to after the last call, and calls swap_read_unplug()
> there too.

Buildbot complais that we might need a forward declaration of
struct swap_iocb, but otherwise this looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 10/23] MM: submit multipage write for SWP_FS_OPS swap-space
  2022-01-24  3:48 ` [PATCH 10/23] MM: submit multipage write for SWP_FS_OPS swap-space NeilBrown
@ 2022-01-24  8:55   ` Christoph Hellwig
  2022-01-24 10:29   ` kernel test robot
  1 sibling, 0 replies; 53+ messages in thread
From: Christoph Hellwig @ 2022-01-24  8:55 UTC (permalink / raw)
  To: NeilBrown
  Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells, linux-nfs,
	linux-mm, linux-kernel

On Mon, Jan 24, 2022 at 02:48:32PM +1100, NeilBrown wrote:
> swap_writepage() is given one page at a time, but may be called repeatedly
> in succession.
> For block-device swapspace, the blk_plug functionality allows the
> multiple pages to be combined together at lower layers.
> That cannot be used for SWP_FS_OPS as blk_plug may not exist - it is
> only active when CONFIG_BLOCK=y.  Consequently all swap reads over NFS
> are single page reads.
> 
> With this patch we pass a pointer-to-pointer via the wbc.
> swap_writepage can store state between calls - much like the pointer
> passed explicitly to swap_readpage.  After calling swap_writepage() some
> number of times, the state will be passed to swap_write_unplug() which
> can submit the combined request.
> 
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  include/linux/writeback.h |    7 +++
>  mm/page_io.c              |  103 +++++++++++++++++++++++++++++----------------
>  mm/swap.h                 |    1 
>  mm/vmscan.c               |    9 +++-
>  4 files changed, 82 insertions(+), 38 deletions(-)
> 
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index fec248ab1fec..6dcaa0639c0d 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -80,6 +80,13 @@ struct writeback_control {
>  
>  	unsigned punt_to_cgroup:1;	/* cgrp punting, see __REQ_CGROUP_PUNT */
>  
> +	/* To enable batching of swap writes to non-block-device backends,
> +	 * "plug" can be set point to a 'struct swap_iocb *'.  When all swap
> +	 * writes have been submitted, if with swap_iocb is not NULL,
> +	 * swap_write_unplug() should be called.
> +	 */
> +	struct swap_iocb **plug;

Mayb plug isn't really the best name for something swap-specific in this
generic structure?

Also the above does not fit the normal kernel comment style with an
otherwise empty

	/*

line.

> +	for (p = 0; p < sio->pages; p++) {
> +		struct page *page = sio->bvec[p].bv_page;
> +
> +		if (ret != 0 && ret != PAGE_SIZE * sio->pages) {
> +			/*
> +			 * In the case of swap-over-nfs, this can be a
> +			 * temporary failure if the system has limited
> +			 * memory for allocating transmit buffers.
> +			 * Mark the page dirty and avoid
> +			 * folio_rotate_reclaimable but rate-limit the
> +			 * messages but do not flag PageError like
> +			 * the normal direct-to-bio case as it could
> +			 * be temporary.
> +			 */
> +			set_page_dirty(page);
> +			ClearPageReclaim(page);
> +			pr_err_ratelimited("Write error %ld on dio swapfile (%llu)\n",
> +					   ret, page_file_offset(page));
> +		} else
> +			count_vm_event(PSWPOUT);

I'd rather check for the error condition ones and have separate loops
forthe success vs error cases instead of checking the condition again
and again.

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 11/23] VFS: Add FMODE_CAN_ODIRECT file flag
  2022-01-24  3:48 ` [PATCH 11/23] VFS: Add FMODE_CAN_ODIRECT file flag NeilBrown
@ 2022-01-24  8:56   ` Christoph Hellwig
  2022-01-26 22:14     ` NeilBrown
  0 siblings, 1 reply; 53+ messages in thread
From: Christoph Hellwig @ 2022-01-24  8:56 UTC (permalink / raw)
  To: NeilBrown
  Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells, linux-nfs,
	linux-mm, linux-kernel

On Mon, Jan 24, 2022 at 02:48:32PM +1100, NeilBrown wrote:
> Currently various places test if direct IO is possible on a file by
> checking for the existence of the direct_IO address space operation.
> This is a poor choice, as the direct_IO operation may not be used - it is
> only used if the generic_file_*_iter functions are called for direct IO
> and some filesystems - particularly NFS - don't do this.
> 
> Instead, introduce a new f_mode flag: FMODE_CAN_ODIRECT and change the
> various places to check this (avoiding pointer dereferences).
> do_dentry_open() will set this flag if ->direct_IO is present, so
> filesystems do not need to be changed.
> 
> NFS *is* changed, to set the flag explicitly and discard the direct_IO
> entry in the address_space_operations for files.

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

It would be nice to throw in a cleanup to remove noop_direct_IO as well.



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 12/23] NFS: remove IS_SWAPFILE hack
  2022-01-24  3:48 ` [PATCH 12/23] NFS: remove IS_SWAPFILE hack NeilBrown
@ 2022-01-24  8:56   ` Christoph Hellwig
  0 siblings, 0 replies; 53+ messages in thread
From: Christoph Hellwig @ 2022-01-24  8:56 UTC (permalink / raw)
  To: NeilBrown
  Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells, linux-nfs,
	linux-mm, linux-kernel

On Mon, Jan 24, 2022 at 02:48:32PM +1100, NeilBrown wrote:
> This code is pointless as IS_SWAPFILE is always defined.
> So remove it.
> 
> Suggested-by: Mark Hemment <markhemm@googlemail.com>
> Signed-off-by: NeilBrown <neilb@suse.de>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 13/23] NFS: rename nfs_direct_IO and use as ->swap_rw
  2022-01-24  3:48 ` [PATCH 13/23] NFS: rename nfs_direct_IO and use as ->swap_rw NeilBrown
@ 2022-01-24  8:57   ` Christoph Hellwig
  0 siblings, 0 replies; 53+ messages in thread
From: Christoph Hellwig @ 2022-01-24  8:57 UTC (permalink / raw)
  To: NeilBrown
  Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells, linux-nfs,
	linux-mm, linux-kernel

> +extern int nfs_swap_rw(struct kiocb *, struct iov_iter *);

Might be worth to drop the pointless extern while you're at it.

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 14/23] NFS: swap IO handling is slightly different for O_DIRECT IO
  2022-01-24  3:48 ` [PATCH 14/23] NFS: swap IO handling is slightly different for O_DIRECT IO NeilBrown
@ 2022-01-24  8:58   ` Christoph Hellwig
  2022-01-24 13:22   ` Mark Hemment
  1 sibling, 0 replies; 53+ messages in thread
From: Christoph Hellwig @ 2022-01-24  8:58 UTC (permalink / raw)
  To: NeilBrown
  Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells, linux-nfs,
	linux-mm, linux-kernel

Same comment on the externs as for the previous one, but otherwise looks
good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 09/23] MM: submit multipage reads for SWP_FS_OPS swap-space
  2022-01-24  3:48 ` [PATCH 09/23] MM: submit multipage reads for SWP_FS_OPS swap-space NeilBrown
  2022-01-24  8:25   ` kernel test robot
  2022-01-24  8:52   ` Christoph Hellwig
@ 2022-01-24  9:27   ` kernel test robot
  2022-01-24 13:16   ` Mark Hemment
  2022-02-08 11:07   ` Geert Uytterhoeven
  4 siblings, 0 replies; 53+ messages in thread
From: kernel test robot @ 2022-01-24  9:27 UTC (permalink / raw)
  To: NeilBrown, Trond Myklebust, Anna Schumaker, Chuck Lever,
	Andrew Morton, Mel Gorman, Christoph Hellwig, David Howells
  Cc: llvm, kbuild-all, Linux Memory Management List, linux-nfs, linux-kernel

Hi NeilBrown,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v5.17-rc1 next-20220124]
[cannot apply to trondmy-nfs/linux-next cifs/for-next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/NeilBrown/Repair-SWAP-over_NFS/20220124-115716
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git dd81e1c7d5fb126e5fbc5c9e334d7b3ec29a16a0
config: i386-randconfig-a011-20220124 (https://download.01.org/0day-ci/archive/20220124/202201241747.X9gXaaeP-lkp@intel.com/config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project 9006bf424847bf91f0a624ffc27ad165c7b804c4)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/63bff668aa0537d7ccef9ed428809fc16c1a6b6c
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review NeilBrown/Repair-SWAP-over_NFS/20220124-115716
        git checkout 63bff668aa0537d7ccef9ed428809fc16c1a6b6c
        # save the config file to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=i386 SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   In file included from mm/vmscan.c:61:
   mm/swap.h:66:12: warning: declaration of 'struct swap_iocb' will not be visible outside of this function [-Wvisibility]
                                   struct swap_iocb **plug);
                                          ^
>> mm/swap.h:67:1: error: expected identifier or '('
   {
   ^
   1 warning and 1 error generated.
--
   In file included from mm/shmem.c:41:
   mm/swap.h:66:12: warning: declaration of 'struct swap_iocb' will not be visible outside of this function [-Wvisibility]
                                   struct swap_iocb **plug);
                                          ^
>> mm/swap.h:67:1: error: expected identifier or '('
   {
   ^
   In file included from mm/shmem.c:56:
   include/linux/mman.h:158:9: warning: division by zero is undefined [-Wdivision-by-zero]
                  _calc_vm_trans(flags, MAP_SYNC,       VM_SYNC      ) |
                  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/mman.h:136:21: note: expanded from macro '_calc_vm_trans'
      : ((x) & (bit1)) / ((bit1) / (bit2))))
                       ^ ~~~~~~~~~~~~~~~~~
   2 warnings and 1 error generated.
--
   In file included from mm/page_alloc.c:84:
   mm/swap.h:66:12: warning: declaration of 'struct swap_iocb' will not be visible outside of this function [-Wvisibility]
                                   struct swap_iocb **plug);
                                          ^
>> mm/swap.h:67:1: error: expected identifier or '('
   {
   ^
   mm/page_alloc.c:3821:15: warning: no previous prototype for function 'should_fail_alloc_page' [-Wmissing-prototypes]
   noinline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
                 ^
   mm/page_alloc.c:3821:10: note: declare 'static' if the function is not intended to be used outside of this translation unit
   noinline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
            ^
            static 
   2 warnings and 1 error generated.


vim +67 mm/swap.h

50dceef273a619 NeilBrown 2022-01-24  45  
50dceef273a619 NeilBrown 2022-01-24  46  struct page *read_swap_cache_async(swp_entry_t, gfp_t,
50dceef273a619 NeilBrown 2022-01-24  47  				   struct vm_area_struct *vma,
50dceef273a619 NeilBrown 2022-01-24  48  				   unsigned long addr,
63bff668aa0537 NeilBrown 2022-01-24  49  				   bool do_poll,
63bff668aa0537 NeilBrown 2022-01-24  50  				   struct swap_iocb **plug);
50dceef273a619 NeilBrown 2022-01-24  51  struct page *__read_swap_cache_async(swp_entry_t, gfp_t,
50dceef273a619 NeilBrown 2022-01-24  52  				     struct vm_area_struct *vma,
50dceef273a619 NeilBrown 2022-01-24  53  				     unsigned long addr,
50dceef273a619 NeilBrown 2022-01-24  54  				     bool *new_page_allocated);
50dceef273a619 NeilBrown 2022-01-24  55  struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
50dceef273a619 NeilBrown 2022-01-24  56  				    struct vm_fault *vmf);
50dceef273a619 NeilBrown 2022-01-24  57  struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
50dceef273a619 NeilBrown 2022-01-24  58  			      struct vm_fault *vmf);
50dceef273a619 NeilBrown 2022-01-24  59  
12cf545fe71035 NeilBrown 2022-01-24  60  static inline unsigned int page_swap_flags(struct page *page)
12cf545fe71035 NeilBrown 2022-01-24  61  {
12cf545fe71035 NeilBrown 2022-01-24  62  	return page_swap_info(page)->flags;
12cf545fe71035 NeilBrown 2022-01-24  63  }
50dceef273a619 NeilBrown 2022-01-24  64  #else /* CONFIG_SWAP */
63bff668aa0537 NeilBrown 2022-01-24  65  static inline int swap_readpage(struct page *page, bool do_poll,
63bff668aa0537 NeilBrown 2022-01-24 @66  				struct swap_iocb **plug);
50dceef273a619 NeilBrown 2022-01-24 @67  {
50dceef273a619 NeilBrown 2022-01-24  68  	return 0;
50dceef273a619 NeilBrown 2022-01-24  69  }
50dceef273a619 NeilBrown 2022-01-24  70  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 10/23] MM: submit multipage write for SWP_FS_OPS swap-space
  2022-01-24  3:48 ` [PATCH 10/23] MM: submit multipage write for SWP_FS_OPS swap-space NeilBrown
  2022-01-24  8:55   ` Christoph Hellwig
@ 2022-01-24 10:29   ` kernel test robot
  1 sibling, 0 replies; 53+ messages in thread
From: kernel test robot @ 2022-01-24 10:29 UTC (permalink / raw)
  To: NeilBrown, Trond Myklebust, Anna Schumaker, Chuck Lever,
	Andrew Morton, Mel Gorman, Christoph Hellwig, David Howells
  Cc: kbuild-all, Linux Memory Management List, linux-nfs, linux-kernel

Hi NeilBrown,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v5.17-rc1 next-20220124]
[cannot apply to trondmy-nfs/linux-next cifs/for-next hnaz-mm/master]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/NeilBrown/Repair-SWAP-over_NFS/20220124-115716
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git dd81e1c7d5fb126e5fbc5c9e334d7b3ec29a16a0
config: powerpc-allnoconfig (https://download.01.org/0day-ci/archive/20220124/202201241811.2ofGi6Q2-lkp@intel.com/config)
compiler: powerpc-linux-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/267352b9af826e20ab71b46a7cd70d51058b3030
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review NeilBrown/Repair-SWAP-over_NFS/20220124-115716
        git checkout 267352b9af826e20ab71b46a7cd70d51058b3030
        # save the config file to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross O=build_dir ARCH=powerpc SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   In file included from mm/vmscan.c:61:
   mm/swap.h:68:1: error: expected identifier or '(' before '{' token
      68 | {
         | ^
   mm/vmscan.c: In function 'shrink_page_list':
>> mm/vmscan.c:1978:17: error: implicit declaration of function 'swap_write_unplug'; did you mean 'swap_writepage'? [-Werror=implicit-function-declaration]
    1978 |                 swap_write_unplug(plug);
         |                 ^~~~~~~~~~~~~~~~~
         |                 swap_writepage
   In file included from mm/vmscan.c:61:
   mm/vmscan.c: At top level:
   mm/swap.h:66:19: warning: 'swap_readpage' declared 'static' but never defined [-Wunused-function]
      66 | static inline int swap_readpage(struct page *page, bool do_poll,
         |                   ^~~~~~~~~~~~~
   cc1: some warnings being treated as errors


vim +1978 mm/vmscan.c

  1526	
  1527	/*
  1528	 * shrink_page_list() returns the number of reclaimed pages
  1529	 */
  1530	static unsigned int shrink_page_list(struct list_head *page_list,
  1531					     struct pglist_data *pgdat,
  1532					     struct scan_control *sc,
  1533					     struct reclaim_stat *stat,
  1534					     bool ignore_references)
  1535	{
  1536		LIST_HEAD(ret_pages);
  1537		LIST_HEAD(free_pages);
  1538		LIST_HEAD(demote_pages);
  1539		unsigned int nr_reclaimed = 0;
  1540		unsigned int pgactivate = 0;
  1541		bool do_demote_pass;
  1542		struct swap_iocb *plug = NULL;
  1543	
  1544		memset(stat, 0, sizeof(*stat));
  1545		cond_resched();
  1546		do_demote_pass = can_demote(pgdat->node_id, sc);
  1547	
  1548	retry:
  1549		while (!list_empty(page_list)) {
  1550			struct address_space *mapping;
  1551			struct page *page;
  1552			enum page_references references = PAGEREF_RECLAIM;
  1553			bool dirty, writeback;
  1554			unsigned int nr_pages;
  1555	
  1556			cond_resched();
  1557	
  1558			page = lru_to_page(page_list);
  1559			list_del(&page->lru);
  1560	
  1561			if (!trylock_page(page))
  1562				goto keep;
  1563	
  1564			VM_BUG_ON_PAGE(PageActive(page), page);
  1565	
  1566			nr_pages = compound_nr(page);
  1567	
  1568			/* Account the number of base pages even though THP */
  1569			sc->nr_scanned += nr_pages;
  1570	
  1571			if (unlikely(!page_evictable(page)))
  1572				goto activate_locked;
  1573	
  1574			if (!sc->may_unmap && page_mapped(page))
  1575				goto keep_locked;
  1576	
  1577			/*
  1578			 * The number of dirty pages determines if a node is marked
  1579			 * reclaim_congested. kswapd will stall and start writing
  1580			 * pages if the tail of the LRU is all dirty unqueued pages.
  1581			 */
  1582			page_check_dirty_writeback(page, &dirty, &writeback);
  1583			if (dirty || writeback)
  1584				stat->nr_dirty++;
  1585	
  1586			if (dirty && !writeback)
  1587				stat->nr_unqueued_dirty++;
  1588	
  1589			/*
  1590			 * Treat this page as congested if the underlying BDI is or if
  1591			 * pages are cycling through the LRU so quickly that the
  1592			 * pages marked for immediate reclaim are making it to the
  1593			 * end of the LRU a second time.
  1594			 */
  1595			mapping = page_mapping(page);
  1596			if (((dirty || writeback) && mapping &&
  1597			     inode_write_congested(mapping->host)) ||
  1598			    (writeback && PageReclaim(page)))
  1599				stat->nr_congested++;
  1600	
  1601			/*
  1602			 * If a page at the tail of the LRU is under writeback, there
  1603			 * are three cases to consider.
  1604			 *
  1605			 * 1) If reclaim is encountering an excessive number of pages
  1606			 *    under writeback and this page is both under writeback and
  1607			 *    PageReclaim then it indicates that pages are being queued
  1608			 *    for IO but are being recycled through the LRU before the
  1609			 *    IO can complete. Waiting on the page itself risks an
  1610			 *    indefinite stall if it is impossible to writeback the
  1611			 *    page due to IO error or disconnected storage so instead
  1612			 *    note that the LRU is being scanned too quickly and the
  1613			 *    caller can stall after page list has been processed.
  1614			 *
  1615			 * 2) Global or new memcg reclaim encounters a page that is
  1616			 *    not marked for immediate reclaim, or the caller does not
  1617			 *    have __GFP_FS (or __GFP_IO if it's simply going to swap,
  1618			 *    not to fs). In this case mark the page for immediate
  1619			 *    reclaim and continue scanning.
  1620			 *
  1621			 *    Require may_enter_fs() because we would wait on fs, which
  1622			 *    may not have submitted IO yet. And the loop driver might
  1623			 *    enter reclaim, and deadlock if it waits on a page for
  1624			 *    which it is needed to do the write (loop masks off
  1625			 *    __GFP_IO|__GFP_FS for this reason); but more thought
  1626			 *    would probably show more reasons.
  1627			 *
  1628			 * 3) Legacy memcg encounters a page that is already marked
  1629			 *    PageReclaim. memcg does not have any dirty pages
  1630			 *    throttling so we could easily OOM just because too many
  1631			 *    pages are in writeback and there is nothing else to
  1632			 *    reclaim. Wait for the writeback to complete.
  1633			 *
  1634			 * In cases 1) and 2) we activate the pages to get them out of
  1635			 * the way while we continue scanning for clean pages on the
  1636			 * inactive list and refilling from the active list. The
  1637			 * observation here is that waiting for disk writes is more
  1638			 * expensive than potentially causing reloads down the line.
  1639			 * Since they're marked for immediate reclaim, they won't put
  1640			 * memory pressure on the cache working set any longer than it
  1641			 * takes to write them to disk.
  1642			 */
  1643			if (PageWriteback(page)) {
  1644				/* Case 1 above */
  1645				if (current_is_kswapd() &&
  1646				    PageReclaim(page) &&
  1647				    test_bit(PGDAT_WRITEBACK, &pgdat->flags)) {
  1648					stat->nr_immediate++;
  1649					goto activate_locked;
  1650	
  1651				/* Case 2 above */
  1652				} else if (writeback_throttling_sane(sc) ||
  1653				    !PageReclaim(page) || !may_enter_fs(page, sc->gfp_mask)) {
  1654					/*
  1655					 * This is slightly racy - end_page_writeback()
  1656					 * might have just cleared PageReclaim, then
  1657					 * setting PageReclaim here end up interpreted
  1658					 * as PageReadahead - but that does not matter
  1659					 * enough to care.  What we do want is for this
  1660					 * page to have PageReclaim set next time memcg
  1661					 * reclaim reaches the tests above, so it will
  1662					 * then wait_on_page_writeback() to avoid OOM;
  1663					 * and it's also appropriate in global reclaim.
  1664					 */
  1665					SetPageReclaim(page);
  1666					stat->nr_writeback++;
  1667					goto activate_locked;
  1668	
  1669				/* Case 3 above */
  1670				} else {
  1671					unlock_page(page);
  1672					wait_on_page_writeback(page);
  1673					/* then go back and try same page again */
  1674					list_add_tail(&page->lru, page_list);
  1675					continue;
  1676				}
  1677			}
  1678	
  1679			if (!ignore_references)
  1680				references = page_check_references(page, sc);
  1681	
  1682			switch (references) {
  1683			case PAGEREF_ACTIVATE:
  1684				goto activate_locked;
  1685			case PAGEREF_KEEP:
  1686				stat->nr_ref_keep += nr_pages;
  1687				goto keep_locked;
  1688			case PAGEREF_RECLAIM:
  1689			case PAGEREF_RECLAIM_CLEAN:
  1690				; /* try to reclaim the page below */
  1691			}
  1692	
  1693			/*
  1694			 * Before reclaiming the page, try to relocate
  1695			 * its contents to another node.
  1696			 */
  1697			if (do_demote_pass &&
  1698			    (thp_migration_supported() || !PageTransHuge(page))) {
  1699				list_add(&page->lru, &demote_pages);
  1700				unlock_page(page);
  1701				continue;
  1702			}
  1703	
  1704			/*
  1705			 * Anonymous process memory has backing store?
  1706			 * Try to allocate it some swap space here.
  1707			 * Lazyfree page could be freed directly
  1708			 */
  1709			if (PageAnon(page) && PageSwapBacked(page)) {
  1710				if (!PageSwapCache(page)) {
  1711					if (!(sc->gfp_mask & __GFP_IO))
  1712						goto keep_locked;
  1713					if (page_maybe_dma_pinned(page))
  1714						goto keep_locked;
  1715					if (PageTransHuge(page)) {
  1716						/* cannot split THP, skip it */
  1717						if (!can_split_huge_page(page, NULL))
  1718							goto activate_locked;
  1719						/*
  1720						 * Split pages without a PMD map right
  1721						 * away. Chances are some or all of the
  1722						 * tail pages can be freed without IO.
  1723						 */
  1724						if (!compound_mapcount(page) &&
  1725						    split_huge_page_to_list(page,
  1726									    page_list))
  1727							goto activate_locked;
  1728					}
  1729					if (!add_to_swap(page)) {
  1730						if (!PageTransHuge(page))
  1731							goto activate_locked_split;
  1732						/* Fallback to swap normal pages */
  1733						if (split_huge_page_to_list(page,
  1734									    page_list))
  1735							goto activate_locked;
  1736	#ifdef CONFIG_TRANSPARENT_HUGEPAGE
  1737						count_vm_event(THP_SWPOUT_FALLBACK);
  1738	#endif
  1739						if (!add_to_swap(page))
  1740							goto activate_locked_split;
  1741					}
  1742	
  1743					/* Adding to swap updated mapping */
  1744					mapping = page_mapping(page);
  1745				}
  1746			} else if (unlikely(PageTransHuge(page))) {
  1747				/* Split file THP */
  1748				if (split_huge_page_to_list(page, page_list))
  1749					goto keep_locked;
  1750			}
  1751	
  1752			/*
  1753			 * THP may get split above, need minus tail pages and update
  1754			 * nr_pages to avoid accounting tail pages twice.
  1755			 *
  1756			 * The tail pages that are added into swap cache successfully
  1757			 * reach here.
  1758			 */
  1759			if ((nr_pages > 1) && !PageTransHuge(page)) {
  1760				sc->nr_scanned -= (nr_pages - 1);
  1761				nr_pages = 1;
  1762			}
  1763	
  1764			/*
  1765			 * The page is mapped into the page tables of one or more
  1766			 * processes. Try to unmap it here.
  1767			 */
  1768			if (page_mapped(page)) {
  1769				enum ttu_flags flags = TTU_BATCH_FLUSH;
  1770				bool was_swapbacked = PageSwapBacked(page);
  1771	
  1772				if (unlikely(PageTransHuge(page)))
  1773					flags |= TTU_SPLIT_HUGE_PMD;
  1774	
  1775				try_to_unmap(page, flags);
  1776				if (page_mapped(page)) {
  1777					stat->nr_unmap_fail += nr_pages;
  1778					if (!was_swapbacked && PageSwapBacked(page))
  1779						stat->nr_lazyfree_fail += nr_pages;
  1780					goto activate_locked;
  1781				}
  1782			}
  1783	
  1784			if (PageDirty(page)) {
  1785				/*
  1786				 * Only kswapd can writeback filesystem pages
  1787				 * to avoid risk of stack overflow. But avoid
  1788				 * injecting inefficient single-page IO into
  1789				 * flusher writeback as much as possible: only
  1790				 * write pages when we've encountered many
  1791				 * dirty pages, and when we've already scanned
  1792				 * the rest of the LRU for clean pages and see
  1793				 * the same dirty pages again (PageReclaim).
  1794				 */
  1795				if (page_is_file_lru(page) &&
  1796				    (!current_is_kswapd() || !PageReclaim(page) ||
  1797				     !test_bit(PGDAT_DIRTY, &pgdat->flags))) {
  1798					/*
  1799					 * Immediately reclaim when written back.
  1800					 * Similar in principal to deactivate_page()
  1801					 * except we already have the page isolated
  1802					 * and know it's dirty
  1803					 */
  1804					inc_node_page_state(page, NR_VMSCAN_IMMEDIATE);
  1805					SetPageReclaim(page);
  1806	
  1807					goto activate_locked;
  1808				}
  1809	
  1810				if (references == PAGEREF_RECLAIM_CLEAN)
  1811					goto keep_locked;
  1812				if (!may_enter_fs(page, sc->gfp_mask))
  1813					goto keep_locked;
  1814				if (!sc->may_writepage)
  1815					goto keep_locked;
  1816	
  1817				/*
  1818				 * Page is dirty. Flush the TLB if a writable entry
  1819				 * potentially exists to avoid CPU writes after IO
  1820				 * starts and then write it out here.
  1821				 */
  1822				try_to_unmap_flush_dirty();
  1823				switch (pageout(page, mapping, &plug)) {
  1824				case PAGE_KEEP:
  1825					goto keep_locked;
  1826				case PAGE_ACTIVATE:
  1827					goto activate_locked;
  1828				case PAGE_SUCCESS:
  1829					stat->nr_pageout += thp_nr_pages(page);
  1830	
  1831					if (PageWriteback(page))
  1832						goto keep;
  1833					if (PageDirty(page))
  1834						goto keep;
  1835	
  1836					/*
  1837					 * A synchronous write - probably a ramdisk.  Go
  1838					 * ahead and try to reclaim the page.
  1839					 */
  1840					if (!trylock_page(page))
  1841						goto keep;
  1842					if (PageDirty(page) || PageWriteback(page))
  1843						goto keep_locked;
  1844					mapping = page_mapping(page);
  1845					fallthrough;
  1846				case PAGE_CLEAN:
  1847					; /* try to free the page below */
  1848				}
  1849			}
  1850	
  1851			/*
  1852			 * If the page has buffers, try to free the buffer mappings
  1853			 * associated with this page. If we succeed we try to free
  1854			 * the page as well.
  1855			 *
  1856			 * We do this even if the page is PageDirty().
  1857			 * try_to_release_page() does not perform I/O, but it is
  1858			 * possible for a page to have PageDirty set, but it is actually
  1859			 * clean (all its buffers are clean).  This happens if the
  1860			 * buffers were written out directly, with submit_bh(). ext3
  1861			 * will do this, as well as the blockdev mapping.
  1862			 * try_to_release_page() will discover that cleanness and will
  1863			 * drop the buffers and mark the page clean - it can be freed.
  1864			 *
  1865			 * Rarely, pages can have buffers and no ->mapping.  These are
  1866			 * the pages which were not successfully invalidated in
  1867			 * truncate_cleanup_page().  We try to drop those buffers here
  1868			 * and if that worked, and the page is no longer mapped into
  1869			 * process address space (page_count == 1) it can be freed.
  1870			 * Otherwise, leave the page on the LRU so it is swappable.
  1871			 */
  1872			if (page_has_private(page)) {
  1873				if (!try_to_release_page(page, sc->gfp_mask))
  1874					goto activate_locked;
  1875				if (!mapping && page_count(page) == 1) {
  1876					unlock_page(page);
  1877					if (put_page_testzero(page))
  1878						goto free_it;
  1879					else {
  1880						/*
  1881						 * rare race with speculative reference.
  1882						 * the speculative reference will free
  1883						 * this page shortly, so we may
  1884						 * increment nr_reclaimed here (and
  1885						 * leave it off the LRU).
  1886						 */
  1887						nr_reclaimed++;
  1888						continue;
  1889					}
  1890				}
  1891			}
  1892	
  1893			if (PageAnon(page) && !PageSwapBacked(page)) {
  1894				/* follow __remove_mapping for reference */
  1895				if (!page_ref_freeze(page, 1))
  1896					goto keep_locked;
  1897				/*
  1898				 * The page has only one reference left, which is
  1899				 * from the isolation. After the caller puts the
  1900				 * page back on lru and drops the reference, the
  1901				 * page will be freed anyway. It doesn't matter
  1902				 * which lru it goes. So we don't bother checking
  1903				 * PageDirty here.
  1904				 */
  1905				count_vm_event(PGLAZYFREED);
  1906				count_memcg_page_event(page, PGLAZYFREED);
  1907			} else if (!mapping || !__remove_mapping(mapping, page, true,
  1908								 sc->target_mem_cgroup))
  1909				goto keep_locked;
  1910	
  1911			unlock_page(page);
  1912	free_it:
  1913			/*
  1914			 * THP may get swapped out in a whole, need account
  1915			 * all base pages.
  1916			 */
  1917			nr_reclaimed += nr_pages;
  1918	
  1919			/*
  1920			 * Is there need to periodically free_page_list? It would
  1921			 * appear not as the counts should be low
  1922			 */
  1923			if (unlikely(PageTransHuge(page)))
  1924				destroy_compound_page(page);
  1925			else
  1926				list_add(&page->lru, &free_pages);
  1927			continue;
  1928	
  1929	activate_locked_split:
  1930			/*
  1931			 * The tail pages that are failed to add into swap cache
  1932			 * reach here.  Fixup nr_scanned and nr_pages.
  1933			 */
  1934			if (nr_pages > 1) {
  1935				sc->nr_scanned -= (nr_pages - 1);
  1936				nr_pages = 1;
  1937			}
  1938	activate_locked:
  1939			/* Not a candidate for swapping, so reclaim swap space. */
  1940			if (PageSwapCache(page) && (mem_cgroup_swap_full(page) ||
  1941							PageMlocked(page)))
  1942				try_to_free_swap(page);
  1943			VM_BUG_ON_PAGE(PageActive(page), page);
  1944			if (!PageMlocked(page)) {
  1945				int type = page_is_file_lru(page);
  1946				SetPageActive(page);
  1947				stat->nr_activate[type] += nr_pages;
  1948				count_memcg_page_event(page, PGACTIVATE);
  1949			}
  1950	keep_locked:
  1951			unlock_page(page);
  1952	keep:
  1953			list_add(&page->lru, &ret_pages);
  1954			VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page);
  1955		}
  1956		/* 'page_list' is always empty here */
  1957	
  1958		/* Migrate pages selected for demotion */
  1959		nr_reclaimed += demote_page_list(&demote_pages, pgdat);
  1960		/* Pages that could not be demoted are still in @demote_pages */
  1961		if (!list_empty(&demote_pages)) {
  1962			/* Pages which failed to demoted go back on @page_list for retry: */
  1963			list_splice_init(&demote_pages, page_list);
  1964			do_demote_pass = false;
  1965			goto retry;
  1966		}
  1967	
  1968		pgactivate = stat->nr_activate[0] + stat->nr_activate[1];
  1969	
  1970		mem_cgroup_uncharge_list(&free_pages);
  1971		try_to_unmap_flush();
  1972		free_unref_page_list(&free_pages);
  1973	
  1974		list_splice(&ret_pages, page_list);
  1975		count_vm_events(PGACTIVATE, pgactivate);
  1976	
  1977		if (plug)
> 1978			swap_write_unplug(plug);
  1979		return nr_reclaimed;
  1980	}
  1981	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 09/23] MM: submit multipage reads for SWP_FS_OPS swap-space
  2022-01-24  3:48 ` [PATCH 09/23] MM: submit multipage reads for SWP_FS_OPS swap-space NeilBrown
                     ` (2 preceding siblings ...)
  2022-01-24  9:27   ` kernel test robot
@ 2022-01-24 13:16   ` Mark Hemment
  2022-01-26 22:04     ` NeilBrown
  2022-02-08 11:07   ` Geert Uytterhoeven
  4 siblings, 1 reply; 53+ messages in thread
From: Mark Hemment @ 2022-01-24 13:16 UTC (permalink / raw)
  To: NeilBrown
  Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells, linux-nfs,
	linux-mm, linux-kernel

On Mon, 24 Jan 2022 at 03:52, NeilBrown <neilb@suse.de> wrote:
>
> swap_readpage() is given one page at a time, but maybe called repeatedly
> in succession.
> For block-device swapspace, the blk_plug functionality allows the
> multiple pages to be combined together at lower layers.
> That cannot be used for SWP_FS_OPS as blk_plug may not exist - it is
> only active when CONFIG_BLOCK=y.  Consequently all swap reads over NFS
> are single page reads.
>
> With this patch we pass in a pointer-to-pointer when swap_readpage can
> store state between calls - much like the effect of blk_plug.  After
> calling swap_readpage() some number of times, the state will be passed
> to swap_read_unplug() which can submit the combined request.
>
> Some caller currently call blk_finish_plug() *before* the final call to
> swap_readpage(), so the last page cannot be included.  This patch moves
> blk_finish_plug() to after the last call, and calls swap_read_unplug()
> there too.
>
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  mm/madvise.c    |    8 +++-
>  mm/memory.c     |    2 +
>  mm/page_io.c    |  102 +++++++++++++++++++++++++++++++++++--------------------
>  mm/swap.h       |   16 +++++++--
>  mm/swap_state.c |   19 +++++++---
>  5 files changed, 98 insertions(+), 49 deletions(-)
>
...
> diff --git a/mm/page_io.c b/mm/page_io.c
> index 6e32ca35d9b6..bcf655d650c8 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -390,46 +391,60 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
>  static void sio_read_complete(struct kiocb *iocb, long ret)
>  {
>         struct swap_iocb *sio = container_of(iocb, struct swap_iocb, iocb);
> -       struct page *page = sio->bvec.bv_page;
> -
> -       if (ret != 0 && ret != PAGE_SIZE) {
> -               SetPageError(page);
> -               ClearPageUptodate(page);
> -               pr_alert_ratelimited("Read-error on swap-device\n");
> -       } else {
> -               SetPageUptodate(page);
> -               count_vm_event(PSWPIN);
> +       int p;
> +
> +       for (p = 0; p < sio->pages; p++) {
> +               struct page *page = sio->bvec[p].bv_page;
> +               if (ret != 0 && ret != PAGE_SIZE * sio->pages) {
> +                       SetPageError(page);
> +                       ClearPageUptodate(page);
> +                       pr_alert_ratelimited("Read-error on swap-device\n");
> +               } else {
> +                       SetPageUptodate(page);
> +                       count_vm_event(PSWPIN);
> +               }
> +               unlock_page(page);
>         }
> -       unlock_page(page);
>         mempool_free(sio, sio_pool);
>  }

Trivial: on success, could be single call to count_vm_events(PSWPIN,
sio->pages).
Similar comment for PSWPOUT in sio_write_complete()

>
> -static int swap_readpage_fs(struct page *page)
> +static void swap_readpage_fs(struct page *page,
> +                            struct swap_iocb **plug)
>  {
>         struct swap_info_struct *sis = page_swap_info(page);
> -       struct file *swap_file = sis->swap_file;
> -       struct address_space *mapping = swap_file->f_mapping;
> -       struct iov_iter from;
> -       struct swap_iocb *sio;
> +       struct swap_iocb *sio = NULL;
>         loff_t pos = page_file_offset(page);
> -       int ret;
> -
> -       sio = mempool_alloc(sio_pool, GFP_KERNEL);
> -       init_sync_kiocb(&sio->iocb, swap_file);
> -       sio->iocb.ki_pos = pos;
> -       sio->iocb.ki_complete = sio_read_complete;
> -       sio->bvec.bv_page = page;
> -       sio->bvec.bv_len = PAGE_SIZE;
> -       sio->bvec.bv_offset = 0;
>
> -       iov_iter_bvec(&from, READ, &sio->bvec, 1, PAGE_SIZE);
> -       ret = mapping->a_ops->swap_rw(&sio->iocb, &from);
> -       if (ret != -EIOCBQUEUED)
> -               sio_read_complete(&sio->iocb, ret);
> -       return ret;
> +       if (*plug)
> +               sio = *plug;

'plug' can be NULL when called from do_swap_page();
        if (plug && *plug)

Cheers,
Mark


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 14/23] NFS: swap IO handling is slightly different for O_DIRECT IO
  2022-01-24  3:48 ` [PATCH 14/23] NFS: swap IO handling is slightly different for O_DIRECT IO NeilBrown
  2022-01-24  8:58   ` Christoph Hellwig
@ 2022-01-24 13:22   ` Mark Hemment
  2022-01-26 22:51     ` NeilBrown
  1 sibling, 1 reply; 53+ messages in thread
From: Mark Hemment @ 2022-01-24 13:22 UTC (permalink / raw)
  To: NeilBrown
  Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells, linux-nfs,
	linux-mm, linux-kernel

On Mon, 24 Jan 2022 at 03:53, NeilBrown <neilb@suse.de> wrote:
>
> 1/ Taking the i_rwsem for swap IO triggers lockdep warnings regarding
>    possible deadlocks with "fs_reclaim".  These deadlocks could, I believe,
>    eventuate if a buffered read on the swapfile was attempted.
>
>    We don't need coherence with the page cache for a swap file, and
>    buffered writes are forbidden anyway.  There is no other need for
>    i_rwsem during direct IO.  So never take it for swap_rw()
>
> 2/ generic_write_checks() explicitly forbids writes to swap, and
>    performs checks that are not needed for swap.  So bypass it
>    for swap_rw().
>
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  fs/nfs/direct.c        |   30 +++++++++++++++++++++---------
>  fs/nfs/file.c          |    4 ++--
>  include/linux/nfs_fs.h |    4 ++--
>  3 files changed, 25 insertions(+), 13 deletions(-)
>
...
> @@ -943,7 +954,8 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter)
>                                               pos >> PAGE_SHIFT, end);
>         }
>
> -       nfs_end_io_direct(inode);
> +       if (!swap)
> +               nfs_end_io_direct(inode);

Just above this code diff, there is;
    if (mapping->nrpages) {
        invalidate_inode_pages2_range(mapping,
             pos >> PAGE_SHIFT, end);
    }

This invalidation looks strange/wrong for a NFS swap write.  Should it
be disabled for the swap case?

Cheers,
Mark


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 22/23] NFS: swap-out must always use STABLE writes.
  2022-01-24  3:48 ` [PATCH 22/23] NFS: swap-out must always use STABLE writes NeilBrown
@ 2022-01-26  3:45   ` Trond Myklebust
  2022-01-26 21:42     ` NeilBrown
  0 siblings, 1 reply; 53+ messages in thread
From: Trond Myklebust @ 2022-01-26  3:45 UTC (permalink / raw)
  To: mgorman, dhowells, hch, neilb, anna.schumaker, chuck.lever, akpm
  Cc: linux-mm, linux-nfs, linux-kernel

On Mon, 2022-01-24 at 14:48 +1100, NeilBrown wrote:
> The commit handling code is not safe against memory-pressure
> deadlocks
> when writing to swap.  In particular, nfs_commitdata_alloc() blocks
> indefinitely waiting for memory, and this can consume all available
> workqueue threads.
> 
> swap-out most likely uses STABLE writes anyway as COND_STABLE
> indicates
> that a stable write should be used if the write fits in a single
> request, and it normally does.  However if we ever swap with a small
> wsize, or gather unusually large numbers of pages for a single write,
> this might change.
> 
> For safety, make it explicit in the code that direct writes used for
> swap
> must always use FLUSH_COND_STABLE.

OK. Your explanation above has me extremely confused.

If you want to avoid commit, then you should be using FLUSH_STABLE,
since that forces the writes to be synchronous. FLUSH_COND_STABLE can
and will use unstable writes if it sees that there are more writes to
come.

> 
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  fs/nfs/direct.c |    7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
> index 43a956d7fd62..29c007b2a17a 100644
> --- a/fs/nfs/direct.c
> +++ b/fs/nfs/direct.c
> @@ -791,7 +791,7 @@ static const struct nfs_pgio_completion_ops
> nfs_direct_write_completion_ops = {
>   */
>  static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req
> *dreq,
>                                                struct iov_iter *iter,
> -                                              loff_t pos)
> +                                              loff_t pos, int
> ioflags)
>  {
>         struct nfs_pageio_descriptor desc;
>         struct inode *inode = dreq->inode;
> @@ -799,7 +799,7 @@ static ssize_t
> nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
>         size_t requested_bytes = 0;
>         size_t wsize = max_t(size_t, NFS_SERVER(inode)->wsize,
> PAGE_SIZE);
>  
> -       nfs_pageio_init_write(&desc, inode, FLUSH_COND_STABLE, false,
> +       nfs_pageio_init_write(&desc, inode, ioflags, false,
>                               &nfs_direct_write_completion_ops);
>         desc.pg_dreq = dreq;
>         get_dreq(dreq);
> @@ -905,6 +905,7 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb,
> struct iov_iter *iter,
>         struct nfs_direct_req *dreq;
>         struct nfs_lock_context *l_ctx;
>         loff_t pos, end;
> +       int ioflags = swap ? FLUSH_COND_STABLE : FLUSH_STABLE;

This is an unacceptable change in behaviour for the non-swap case, so
NACK.

>  
>         dfprintk(FILE, "NFS: direct write(%pD2, %zd@%Ld)\n",
>                 file, iov_iter_count(iter), (long long) iocb-
> >ki_pos);
> @@ -947,7 +948,7 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb,
> struct iov_iter *iter,
>         if (!swap)
>                 nfs_start_io_direct(inode);
>  
> -       requested = nfs_direct_write_schedule_iovec(dreq, iter, pos);
> +       requested = nfs_direct_write_schedule_iovec(dreq, iter, pos,
> ioflags);
>  
>         if (mapping->nrpages) {
>                 invalidate_inode_pages2_range(mapping,
> 
> 

-- 
Trond Myklebust
CTO, Hammerspace Inc
4984 El Camino Real, Suite 208
Los Altos, CA 94022
​
www.hammer.space


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 22/23] NFS: swap-out must always use STABLE writes.
  2022-01-26  3:45   ` Trond Myklebust
@ 2022-01-26 21:42     ` NeilBrown
  0 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2022-01-26 21:42 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: mgorman, dhowells, hch, anna.schumaker, chuck.lever, akpm,
	linux-mm, linux-nfs, linux-kernel

On Wed, 26 Jan 2022, Trond Myklebust wrote:
> On Mon, 2022-01-24 at 14:48 +1100, NeilBrown wrote:
> > The commit handling code is not safe against memory-pressure
> > deadlocks
> > when writing to swap.  In particular, nfs_commitdata_alloc() blocks
> > indefinitely waiting for memory, and this can consume all available
> > workqueue threads.
> > 
> > swap-out most likely uses STABLE writes anyway as COND_STABLE
> > indicates
> > that a stable write should be used if the write fits in a single
> > request, and it normally does.  However if we ever swap with a small
> > wsize, or gather unusually large numbers of pages for a single write,
> > this might change.
> > 
> > For safety, make it explicit in the code that direct writes used for
> > swap
> > must always use FLUSH_COND_STABLE.
> 
> OK. Your explanation above has me extremely confused.
> 
> If you want to avoid commit, then you should be using FLUSH_STABLE,
> since that forces the writes to be synchronous. FLUSH_COND_STABLE can
> and will use unstable writes if it sees that there are more writes to
> come.
> 
> > 
> > Signed-off-by: NeilBrown <neilb@suse.de>
> > ---
> >  fs/nfs/direct.c |    7 ++++---
> >  1 file changed, 4 insertions(+), 3 deletions(-)
> > 
> > diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
> > index 43a956d7fd62..29c007b2a17a 100644
> > --- a/fs/nfs/direct.c
> > +++ b/fs/nfs/direct.c
> > @@ -791,7 +791,7 @@ static const struct nfs_pgio_completion_ops
> > nfs_direct_write_completion_ops = {
> >   */
> >  static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req
> > *dreq,
> >                                                struct iov_iter *iter,
> > -                                              loff_t pos)
> > +                                              loff_t pos, int
> > ioflags)
> >  {
> >         struct nfs_pageio_descriptor desc;
> >         struct inode *inode = dreq->inode;
> > @@ -799,7 +799,7 @@ static ssize_t
> > nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
> >         size_t requested_bytes = 0;
> >         size_t wsize = max_t(size_t, NFS_SERVER(inode)->wsize,
> > PAGE_SIZE);
> >  
> > -       nfs_pageio_init_write(&desc, inode, FLUSH_COND_STABLE, false,
> > +       nfs_pageio_init_write(&desc, inode, ioflags, false,
> >                               &nfs_direct_write_completion_ops);
> >         desc.pg_dreq = dreq;
> >         get_dreq(dreq);
> > @@ -905,6 +905,7 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb,
> > struct iov_iter *iter,
> >         struct nfs_direct_req *dreq;
> >         struct nfs_lock_context *l_ctx;
> >         loff_t pos, end;
> > +       int ioflags = swap ? FLUSH_COND_STABLE : FLUSH_STABLE;
> 
> This is an unacceptable change in behaviour for the non-swap case, so
> NACK.
> 

Hi Trond,
 thanks for the review.
 You are right - I had that test exactly backwards.  I've fixed for the
 next version.

Thanks,
NeilBrown


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 02/23] MM: extend block-plugging to cover all swap reads with read-ahead
  2022-01-24  7:27   ` Christoph Hellwig
@ 2022-01-26 21:47     ` NeilBrown
  2022-01-26 23:09       ` Hugh Dickins
  0 siblings, 1 reply; 53+ messages in thread
From: NeilBrown @ 2022-01-26 21:47 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells, linux-nfs,
	linux-mm, linux-kernel

On Mon, 24 Jan 2022, Christoph Hellwig wrote:
> On Mon, Jan 24, 2022 at 02:48:32PM +1100, NeilBrown wrote:
> > Code that does swap read-ahead uses blk_start_plug() and
> > blk_finish_plug() to allow lower levels to combine multiple read-ahead
> > pages into a single request, but calls blk_finish_plug() *before*
> > submitting the original (non-ahead) read request.
> > This missed an opportunity to combine read requests.
> > 
> > This patch moves the blk_finish_plug to *after* all the reads.
> > This will likely combine the primary read with some of the "ahead"
> > reads, and that may slightly increase the latency of that read, but it
> > should more than make up for this by making more efficient use of the
> > storage path.
> > 
> > The patch mostly makes the code look more consistent.  Performance
> > change is unlikely to be noticeable.
> 
> Looks good:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Thanks.
> 
> > Fixes-no-auto-backport: 3fb5c298b04e ("swap: allow swap readahead to be merged")
> 
> Is this really a thing?
Maybe it should be.....
As I'm sure you guessed, I think it is valuable to record this
connection between commits, but I don't like it hasty automatic
backporting of patches where the (unknown) risk might exceed the (known)
value.  This is how I choose to state my displeasure.

Thanks,
NeilBrown


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 09/23] MM: submit multipage reads for SWP_FS_OPS swap-space
  2022-01-24 13:16   ` Mark Hemment
@ 2022-01-26 22:04     ` NeilBrown
  0 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2022-01-26 22:04 UTC (permalink / raw)
  To: Mark Hemment
  Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells, linux-nfs,
	linux-mm, linux-kernel

On Tue, 25 Jan 2022, Mark Hemment wrote:
> On Mon, 24 Jan 2022 at 03:52, NeilBrown <neilb@suse.de> wrote:
> >
> > swap_readpage() is given one page at a time, but maybe called repeatedly
> > in succession.
> > For block-device swapspace, the blk_plug functionality allows the
> > multiple pages to be combined together at lower layers.
> > That cannot be used for SWP_FS_OPS as blk_plug may not exist - it is
> > only active when CONFIG_BLOCK=y.  Consequently all swap reads over NFS
> > are single page reads.
> >
> > With this patch we pass in a pointer-to-pointer when swap_readpage can
> > store state between calls - much like the effect of blk_plug.  After
> > calling swap_readpage() some number of times, the state will be passed
> > to swap_read_unplug() which can submit the combined request.
> >
> > Some caller currently call blk_finish_plug() *before* the final call to
> > swap_readpage(), so the last page cannot be included.  This patch moves
> > blk_finish_plug() to after the last call, and calls swap_read_unplug()
> > there too.
> >
> > Signed-off-by: NeilBrown <neilb@suse.de>
> > ---
> >  mm/madvise.c    |    8 +++-
> >  mm/memory.c     |    2 +
> >  mm/page_io.c    |  102 +++++++++++++++++++++++++++++++++++--------------------
> >  mm/swap.h       |   16 +++++++--
> >  mm/swap_state.c |   19 +++++++---
> >  5 files changed, 98 insertions(+), 49 deletions(-)
> >
> ...
> > diff --git a/mm/page_io.c b/mm/page_io.c
> > index 6e32ca35d9b6..bcf655d650c8 100644
> > --- a/mm/page_io.c
> > +++ b/mm/page_io.c
> > @@ -390,46 +391,60 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
> >  static void sio_read_complete(struct kiocb *iocb, long ret)
> >  {
> >         struct swap_iocb *sio = container_of(iocb, struct swap_iocb, iocb);
> > -       struct page *page = sio->bvec.bv_page;
> > -
> > -       if (ret != 0 && ret != PAGE_SIZE) {
> > -               SetPageError(page);
> > -               ClearPageUptodate(page);
> > -               pr_alert_ratelimited("Read-error on swap-device\n");
> > -       } else {
> > -               SetPageUptodate(page);
> > -               count_vm_event(PSWPIN);
> > +       int p;
> > +
> > +       for (p = 0; p < sio->pages; p++) {
> > +               struct page *page = sio->bvec[p].bv_page;
> > +               if (ret != 0 && ret != PAGE_SIZE * sio->pages) {
> > +                       SetPageError(page);
> > +                       ClearPageUptodate(page);
> > +                       pr_alert_ratelimited("Read-error on swap-device\n");
> > +               } else {
> > +                       SetPageUptodate(page);
> > +                       count_vm_event(PSWPIN);
> > +               }
> > +               unlock_page(page);
> >         }
> > -       unlock_page(page);
> >         mempool_free(sio, sio_pool);
> >  }
> 
> Trivial: on success, could be single call to count_vm_events(PSWPIN,
> sio->pages).
> Similar comment for PSWPOUT in sio_write_complete()
> 
> >
> > -static int swap_readpage_fs(struct page *page)
> > +static void swap_readpage_fs(struct page *page,
> > +                            struct swap_iocb **plug)
> >  {
> >         struct swap_info_struct *sis = page_swap_info(page);
> > -       struct file *swap_file = sis->swap_file;
> > -       struct address_space *mapping = swap_file->f_mapping;
> > -       struct iov_iter from;
> > -       struct swap_iocb *sio;
> > +       struct swap_iocb *sio = NULL;
> >         loff_t pos = page_file_offset(page);
> > -       int ret;
> > -
> > -       sio = mempool_alloc(sio_pool, GFP_KERNEL);
> > -       init_sync_kiocb(&sio->iocb, swap_file);
> > -       sio->iocb.ki_pos = pos;
> > -       sio->iocb.ki_complete = sio_read_complete;
> > -       sio->bvec.bv_page = page;
> > -       sio->bvec.bv_len = PAGE_SIZE;
> > -       sio->bvec.bv_offset = 0;
> >
> > -       iov_iter_bvec(&from, READ, &sio->bvec, 1, PAGE_SIZE);
> > -       ret = mapping->a_ops->swap_rw(&sio->iocb, &from);
> > -       if (ret != -EIOCBQUEUED)
> > -               sio_read_complete(&sio->iocb, ret);
> > -       return ret;
> > +       if (*plug)
> > +               sio = *plug;
> 
> 'plug' can be NULL when called from do_swap_page();
>         if (plug && *plug)

Thanks for catching that!  I actually want it to be
   if (plug)
       sio = *plug;

which nicely balances the
   if (plug)
       *plug = sio;
at the end of the function.

Thanks,
NeilBrown


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 11/23] VFS: Add FMODE_CAN_ODIRECT file flag
  2022-01-24  8:56   ` Christoph Hellwig
@ 2022-01-26 22:14     ` NeilBrown
  0 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2022-01-26 22:14 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells, linux-nfs,
	linux-mm, linux-kernel

On Mon, 24 Jan 2022, Christoph Hellwig wrote:
> On Mon, Jan 24, 2022 at 02:48:32PM +1100, NeilBrown wrote:
> > Currently various places test if direct IO is possible on a file by
> > checking for the existence of the direct_IO address space operation.
> > This is a poor choice, as the direct_IO operation may not be used - it is
> > only used if the generic_file_*_iter functions are called for direct IO
> > and some filesystems - particularly NFS - don't do this.
> > 
> > Instead, introduce a new f_mode flag: FMODE_CAN_ODIRECT and change the
> > various places to check this (avoiding pointer dereferences).
> > do_dentry_open() will set this flag if ->direct_IO is present, so
> > filesystems do not need to be changed.
> > 
> > NFS *is* changed, to set the flag explicitly and discard the direct_IO
> > entry in the address_space_operations for files.
> 
> Looks good:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> 
> It would be nice to throw in a cleanup to remove noop_direct_IO as well.

I don't want to add this to the present series.  When it lands I'll send
patches to the various filesystems to switch to using FMODE_CAN_ODIRECT.

Thanks,
NeilBrown


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 14/23] NFS: swap IO handling is slightly different for O_DIRECT IO
  2022-01-24 13:22   ` Mark Hemment
@ 2022-01-26 22:51     ` NeilBrown
  0 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2022-01-26 22:51 UTC (permalink / raw)
  To: Mark Hemment
  Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells, linux-nfs,
	linux-mm, linux-kernel

On Tue, 25 Jan 2022, Mark Hemment wrote:
> On Mon, 24 Jan 2022 at 03:53, NeilBrown <neilb@suse.de> wrote:
> >
> > 1/ Taking the i_rwsem for swap IO triggers lockdep warnings regarding
> >    possible deadlocks with "fs_reclaim".  These deadlocks could, I believe,
> >    eventuate if a buffered read on the swapfile was attempted.
> >
> >    We don't need coherence with the page cache for a swap file, and
> >    buffered writes are forbidden anyway.  There is no other need for
> >    i_rwsem during direct IO.  So never take it for swap_rw()
> >
> > 2/ generic_write_checks() explicitly forbids writes to swap, and
> >    performs checks that are not needed for swap.  So bypass it
> >    for swap_rw().
> >
> > Signed-off-by: NeilBrown <neilb@suse.de>
> > ---
> >  fs/nfs/direct.c        |   30 +++++++++++++++++++++---------
> >  fs/nfs/file.c          |    4 ++--
> >  include/linux/nfs_fs.h |    4 ++--
> >  3 files changed, 25 insertions(+), 13 deletions(-)
> >
> ...
> > @@ -943,7 +954,8 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter)
> >                                               pos >> PAGE_SHIFT, end);
> >         }
> >
> > -       nfs_end_io_direct(inode);
> > +       if (!swap)
> > +               nfs_end_io_direct(inode);
> 
> Just above this code diff, there is;
>     if (mapping->nrpages) {
>         invalidate_inode_pages2_range(mapping,
>              pos >> PAGE_SHIFT, end);
>     }
> 
> This invalidation looks strange/wrong for a NFS swap write.  Should it
> be disabled for the swap case?

Yes, I think it should - particularly as we don't take the mutex in the
swap case.  Thanks!
This change improves the look of the code too :-)

Thanks,
NeilBrown


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 02/23] MM: extend block-plugging to cover all swap reads with read-ahead
  2022-01-26 21:47     ` NeilBrown
@ 2022-01-26 23:09       ` Hugh Dickins
  2022-01-27  0:32         ` NeilBrown
  0 siblings, 1 reply; 53+ messages in thread
From: Hugh Dickins @ 2022-01-26 23:09 UTC (permalink / raw)
  To: NeilBrown
  Cc: Christoph Hellwig, Trond Myklebust, Anna Schumaker, Chuck Lever,
	Andrew Morton, Mel Gorman, David Howells, linux-nfs, linux-mm,
	linux-kernel

On Thu, 27 Jan 2022, NeilBrown wrote:
> On Mon, 24 Jan 2022, Christoph Hellwig wrote:
> > On Mon, Jan 24, 2022 at 02:48:32PM +1100, NeilBrown wrote:
> > > Code that does swap read-ahead uses blk_start_plug() and
> > > blk_finish_plug() to allow lower levels to combine multiple read-ahead
> > > pages into a single request, but calls blk_finish_plug() *before*
> > > submitting the original (non-ahead) read request.
> > > This missed an opportunity to combine read requests.

No, you're misunderstanding there.  All the necessary reads are issued
within the loop, between the plug and unplug: it does not skip over
the target page in the loop, but issues its read along with the rest.

But it has not kept any of those pages locked, nor even kept any
refcounts raised: so at the end has to look up the target page again
with the final read_swap_cache_async() (which also copes with the
highly unlikely case that the page got swapped out again meanwhile).

> > > 
> > > This patch moves the blk_finish_plug to *after* all the reads.
> > > This will likely combine the primary read with some of the "ahead"
> > > reads, and that may slightly increase the latency of that read, but it
> > > should more than make up for this by making more efficient use of the
> > > storage path.
> > > 
> > > The patch mostly makes the code look more consistent.  Performance
> > > change is unlikely to be noticeable.
> > 
> > Looks good:
> > 
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> 
> Thanks.
> > 
> > > Fixes-no-auto-backport: 3fb5c298b04e ("swap: allow swap readahead to be merged")
> > 
> > Is this really a thing?
> Maybe it should be.....
> As I'm sure you guessed, I think it is valuable to record this
> connection between commits, but I don't like it hasty automatic
> backporting of patches where the (unknown) risk might exceed the (known)
> value.  This is how I choose to state my displeasure.

I don't suppose your patch does any actual harm (beyond propagating a
misunderstanding), but it's certainly not a fix, and I think should
simply be dropped from the series.

(But please don't expect any comment from me on the rest:
SWP_FS_OPS has always been beyond my understanding.)

Hugh


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 02/23] MM: extend block-plugging to cover all swap reads with read-ahead
  2022-01-26 23:09       ` Hugh Dickins
@ 2022-01-27  0:32         ` NeilBrown
  0 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2022-01-27  0:32 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Christoph Hellwig, Trond Myklebust, Anna Schumaker, Chuck Lever,
	Andrew Morton, Mel Gorman, David Howells, linux-nfs, linux-mm,
	linux-kernel

On Thu, 27 Jan 2022, Hugh Dickins wrote:
> On Thu, 27 Jan 2022, NeilBrown wrote:
> > On Mon, 24 Jan 2022, Christoph Hellwig wrote:
> > > On Mon, Jan 24, 2022 at 02:48:32PM +1100, NeilBrown wrote:
> > > > Code that does swap read-ahead uses blk_start_plug() and
> > > > blk_finish_plug() to allow lower levels to combine multiple read-ahead
> > > > pages into a single request, but calls blk_finish_plug() *before*
> > > > submitting the original (non-ahead) read request.
> > > > This missed an opportunity to combine read requests.
> 
> No, you're misunderstanding there.  All the necessary reads are issued
> within the loop, between the plug and unplug: it does not skip over
> the target page in the loop, but issues its read along with the rest.
> 
> But it has not kept any of those pages locked, nor even kept any
> refcounts raised: so at the end has to look up the target page again
> with the final read_swap_cache_async() (which also copes with the
> highly unlikely case that the page got swapped out again meanwhile).
> 
....
> 
> I don't suppose your patch does any actual harm (beyond propagating a
> misunderstanding), but it's certainly not a fix, and I think should
> simply be dropped from the series.

Thanks - I had missed that.  The code is correct, but looks wrong (to
me).
I've dropped the patch, but added a comment when I add
"swap_read_unplug()" to explain while plugging isn't needed for that
final read_swap_cache_async().

> 
> (But please don't expect any comment from me on the rest:
> SWP_FS_OPS has always been beyond my understanding.)

:-)

Thanks,
NeilBrown


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 01/23] MM: create new mm/swap.h header file.
  2022-01-24  3:48 ` [PATCH 01/23] MM: create new mm/swap.h header file NeilBrown
@ 2022-02-07 13:51   ` Geert Uytterhoeven
  0 siblings, 0 replies; 53+ messages in thread
From: Geert Uytterhoeven @ 2022-02-07 13:51 UTC (permalink / raw)
  To: NeilBrown
  Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells, open list:NFS,
	SUNRPC, AND...,
	Linux MM, Linux Kernel Mailing List

Hi Neil,

On Mon, Jan 24, 2022 at 5:26 PM NeilBrown <neilb@suse.de> wrote:
> Many functions declared in include/linux/swap.h are only used within mm/
>
> Create a new "mm/swap.h" and move some of these declarations there.
> Remove the redundant 'extern' from the function declarations.
>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: NeilBrown <neilb@suse.de>

Thanks for your patch!

mm/huge_memory.c: In function ‘__split_huge_page’:
mm/huge_memory.c:2423:16: error: implicit declaration of function
‘swap_address_space’; did you mean ‘exit_swap_address_space’?
[-Werror=implicit-function-declaration]
 2423 |   swap_cache = swap_address_space(entry);
      |                ^~~~~~~~~~~~~~~~~~
      |                exit_swap_address_space

mm/huge_memory.c needs to #include "swap.h", too.

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 00/23 V3] Repair SWAP-over_NFS
  2022-01-24  3:48 [PATCH 00/23 V3] Repair SWAP-over_NFS NeilBrown
                   ` (22 preceding siblings ...)
  2022-01-24  3:48 ` [PATCH 13/23] NFS: rename nfs_direct_IO and use as ->swap_rw NeilBrown
@ 2022-02-07 17:55 ` Geert Uytterhoeven
  23 siblings, 0 replies; 53+ messages in thread
From: Geert Uytterhoeven @ 2022-02-07 17:55 UTC (permalink / raw)
  To: NeilBrown
  Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells, open list:NFS,
	SUNRPC, AND...,
	Linux MM, Linux Kernel Mailing List

Hi Neil,

On Mon, Jan 24, 2022 at 5:40 PM NeilBrown <neilb@suse.de> wrote:
> swap-over-NFS currently has a variety of problems.
>
> swap writes call generic_write_checks(), which always fails on a swap
> file, so it completely fails.
> Even without this, various deadlocks are possible - largely due to
> improvements in NFS memory allocation (using NOFS instead of ATOMIC)
> which weren't tested against swap-out.
>
> NFS is the only filesystem that has supported fs-based swap IO, and it
> hasn't worked for several releases, so now is a convenient time to clean
> up the swap-via-filesystem interfaces - we cannot break anything !
>
> So the first few patches here clean up and improve various parts of the
> swap-via-filesystem code.  ->activate_swap() is given a cleaner
> interface, a new ->swap_rw is introduced instead of burdening
> ->direct_IO, etc.
>
> Current swap-to-filesystem code only ever submits single-page reads and
> writes.  These patches change that to allow multi-page IO when adjacent
> requests are submitted.  Writes are also changed to be async rather than
> sync.  This substantially speeds up write throughput for swap-over-NFS.
>
> Some of the NFS patches can land independently of the MM patches.  A few
> require the MM patches to land first.

Thanks for your series!
Swap over NFS was indeed broken last time I tried[1], but with your
series, it's working again on arm32 (RZ/A1 with 32 MiB of RAM, 100Mbps
Ethernet and Debian 9 nfsroot). My system was exercised using "apt
update", and the subsequent "apt upgrade" is still running, though
(it took more than 6 hours to build the apt dependency tree, now it's
trying hard to create a list of packages...).

Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>

BTW, I think you do want to run scripts/checkpatch.pl on your series,
and improve it by fixing a few of the reported warnings (function
definition arguments should also have an identifier name, missing
data_race() comment, missing SPDX-License-Identifier tag).

[1] https://lore.kernel.org/all/20191230153238.29878-1-geert+renesas@glider.be/

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 09/23] MM: submit multipage reads for SWP_FS_OPS swap-space
  2022-01-24  3:48 ` [PATCH 09/23] MM: submit multipage reads for SWP_FS_OPS swap-space NeilBrown
                     ` (3 preceding siblings ...)
  2022-01-24 13:16   ` Mark Hemment
@ 2022-02-08 11:07   ` Geert Uytterhoeven
  4 siblings, 0 replies; 53+ messages in thread
From: Geert Uytterhoeven @ 2022-02-08 11:07 UTC (permalink / raw)
  To: NeilBrown
  Cc: Trond Myklebust, Anna Schumaker, Chuck Lever, Andrew Morton,
	Mel Gorman, Christoph Hellwig, David Howells, open list:NFS,
	SUNRPC, AND...,
	Linux MM, Linux Kernel Mailing List

Hi Neil,

On Mon, Jan 24, 2022 at 5:49 PM NeilBrown <neilb@suse.de> wrote:
> swap_readpage() is given one page at a time, but maybe called repeatedly
> in succession.
> For block-device swapspace, the blk_plug functionality allows the
> multiple pages to be combined together at lower layers.
> That cannot be used for SWP_FS_OPS as blk_plug may not exist - it is
> only active when CONFIG_BLOCK=y.  Consequently all swap reads over NFS
> are single page reads.
>
> With this patch we pass in a pointer-to-pointer when swap_readpage can
> store state between calls - much like the effect of blk_plug.  After
> calling swap_readpage() some number of times, the state will be passed
> to swap_read_unplug() which can submit the combined request.
>
> Some caller currently call blk_finish_plug() *before* the final call to
> swap_readpage(), so the last page cannot be included.  This patch moves
> blk_finish_plug() to after the last call, and calls swap_read_unplug()
> there too.
>
> Signed-off-by: NeilBrown <neilb@suse.de>

Thanks for your patch!

> --- a/mm/swap.h
> +++ b/mm/swap.h

> @@ -53,7 +62,8 @@ static inline unsigned int page_swap_flags(struct page *page)
>         return page_swap_info(page)->flags;
>  }
>  #else /* CONFIG_SWAP */
> -static inline int swap_readpage(struct page *page, bool do_poll)
> +static inline int swap_readpage(struct page *page, bool do_poll,
> +                               struct swap_iocb **plug);

Bogus semicolon.

>  {
>         return 0;
>  }

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds


^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2022-02-08 11:07 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-24  3:48 [PATCH 00/23 V3] Repair SWAP-over_NFS NeilBrown
2022-01-24  3:48 ` [PATCH 05/23] MM: reclaim mustn't enter FS for SWP_FS_OPS swap-space NeilBrown
2022-01-24  7:31   ` Christoph Hellwig
2022-01-24  3:48 ` [PATCH 03/23] MM: drop swap_set_page_dirty NeilBrown
2022-01-24  7:28   ` Christoph Hellwig
2022-01-24  3:48 ` [PATCH 14/23] NFS: swap IO handling is slightly different for O_DIRECT IO NeilBrown
2022-01-24  8:58   ` Christoph Hellwig
2022-01-24 13:22   ` Mark Hemment
2022-01-26 22:51     ` NeilBrown
2022-01-24  3:48 ` [PATCH 22/23] NFS: swap-out must always use STABLE writes NeilBrown
2022-01-26  3:45   ` Trond Myklebust
2022-01-26 21:42     ` NeilBrown
2022-01-24  3:48 ` [PATCH 23/23] SUNRPC: lock against ->sock changing during sysfs read NeilBrown
2022-01-24  3:48 ` [PATCH 08/23] DOC: update documentation for swap_activate and swap_rw NeilBrown
2022-01-24  8:50   ` Christoph Hellwig
2022-01-24  3:48 ` [PATCH 07/23] MM: perform async writes to SWP_FS_OPS swap-space using ->swap_rw NeilBrown
2022-01-24  8:49   ` Christoph Hellwig
2022-01-24  3:48 ` [PATCH 02/23] MM: extend block-plugging to cover all swap reads with read-ahead NeilBrown
2022-01-24  7:27   ` Christoph Hellwig
2022-01-26 21:47     ` NeilBrown
2022-01-26 23:09       ` Hugh Dickins
2022-01-27  0:32         ` NeilBrown
2022-01-24  3:48 ` [PATCH 16/23] SUNRPC/auth: async tasks mustn't block waiting for memory NeilBrown
2022-01-24  3:48 ` [PATCH 04/23] MM: move responsibility for setting SWP_FS_OPS to ->swap_activate NeilBrown
2022-01-24  7:30   ` Christoph Hellwig
2022-01-24  3:48 ` [PATCH 06/23] MM: introduce ->swap_rw and use it for reads from SWP_FS_OPS swap-space NeilBrown
2022-01-24  8:48   ` Christoph Hellwig
2022-01-24  3:48 ` [PATCH 15/23] SUNRPC/call_alloc: async tasks mustn't block waiting for memory NeilBrown
2022-01-24  3:48 ` [PATCH 20/23] SUNRPC: improve 'swap' handling: scheduling and PF_MEMALLOC NeilBrown
2022-01-24  3:48 ` [PATCH 01/23] MM: create new mm/swap.h header file NeilBrown
2022-02-07 13:51   ` Geert Uytterhoeven
2022-01-24  3:48 ` [PATCH 09/23] MM: submit multipage reads for SWP_FS_OPS swap-space NeilBrown
2022-01-24  8:25   ` kernel test robot
2022-01-24  8:52   ` Christoph Hellwig
2022-01-24  9:27   ` kernel test robot
2022-01-24 13:16   ` Mark Hemment
2022-01-26 22:04     ` NeilBrown
2022-02-08 11:07   ` Geert Uytterhoeven
2022-01-24  3:48 ` [PATCH 12/23] NFS: remove IS_SWAPFILE hack NeilBrown
2022-01-24  8:56   ` Christoph Hellwig
2022-01-24  3:48 ` [PATCH 19/23] NFS: discard NFS_RPC_SWAPFLAGS and RPC_TASK_ROOTCREDS NeilBrown
2022-01-24  3:48 ` [PATCH 17/23] SUNRPC/xprt: async tasks mustn't block waiting for memory NeilBrown
2022-01-24  3:48 ` [PATCH 18/23] SUNRPC: remove scheduling boost for "SWAPPER" tasks NeilBrown
2022-01-24  3:48 ` [PATCH 21/23] NFSv4: keep state manager thread active if swap is enabled NeilBrown
2022-01-24  3:48 ` [PATCH 11/23] VFS: Add FMODE_CAN_ODIRECT file flag NeilBrown
2022-01-24  8:56   ` Christoph Hellwig
2022-01-26 22:14     ` NeilBrown
2022-01-24  3:48 ` [PATCH 10/23] MM: submit multipage write for SWP_FS_OPS swap-space NeilBrown
2022-01-24  8:55   ` Christoph Hellwig
2022-01-24 10:29   ` kernel test robot
2022-01-24  3:48 ` [PATCH 13/23] NFS: rename nfs_direct_IO and use as ->swap_rw NeilBrown
2022-01-24  8:57   ` Christoph Hellwig
2022-02-07 17:55 ` [PATCH 00/23 V3] Repair SWAP-over_NFS Geert Uytterhoeven

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).