linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/33] Network fs helper library & fscache kiocb API [ver #3]
@ 2021-02-15 15:44 David Howells
  2021-02-15 15:44 ` [PATCH 01/33] iov_iter: Add ITER_XARRAY David Howells
                   ` (14 more replies)
  0 siblings, 15 replies; 47+ messages in thread
From: David Howells @ 2021-02-15 15:44 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Steve French, Dominique Martinet
  Cc: linux-cifs, ceph-devel, Jeff Layton, Matthew Wilcox,
	linux-cachefs, Alexander Viro, linux-mm, linux-afs,
	Matthew Wilcox (Oracle),
	v9fs-developer, Christoph Hellwig, linux-fsdevel, linux-nfs,
	Jeff Layton, Linus Torvalds, dhowells, Jeff Layton,
	David Wysochanski, Matthew Wilcox (Oracle),
	Alexander Viro, linux-cachefs, linux-afs, linux-nfs, linux-cifs,
	ceph-devel, v9fs-developer, linux-fsdevel, linux-kernel


Here's a set of patches to do two things:

 (1) Add a helper library to handle the new VM readahead interface.  This
     is intended to be used unconditionally by the filesystem (whether or
     not caching is enabled) and provides a common framework for doing
     caching, transparent huge pages and, in the future, possibly fscrypt
     and read bandwidth maximisation.  It also allows the netfs and the
     cache to align, expand and slice up a read request from the VM in
     various ways; the netfs need only provide a function to read a stretch
     of data to the pagecache and the helper takes care of the rest.

 (2) Add an alternative fscache/cachfiles I/O API that uses the kiocb
     facility to do async DIO to transfer data to/from the netfs's pages,
     rather than using readpage with wait queue snooping on one side and
     vfs_write() on the other.  It also uses less memory, since it doesn't
     do buffered I/O on the backing file.

     Note that this uses SEEK_HOLE/SEEK_DATA to locate the data available
     to be read from the cache.  Whilst this is an improvement from the
     bmap interface, it still has a problem with regard to a modern
     extent-based filesystem inserting or removing bridging blocks of
     zeros.  Fixing that requires a much greater overhaul.

This is a step towards overhauling the fscache API.  The change is opt-in
on the part of the network filesystem.  A netfs should not try to mix the
old and the new API because of conflicting ways of handling pages and the
PG_fscache page flag and because it would be mixing DIO with buffered I/O.
Further, the helper library can't be used with the old API.

This does not change any of the fscache cookie handling APIs or the way
invalidation is done.

In the near term, I intend to deprecate and remove the old I/O API
(fscache_allocate_page{,s}(), fscache_read_or_alloc_page{,s}(),
fscache_write_page() and fscache_uncache_page()) and eventually replace
most of fscache/cachefiles with something simpler and easier to follow.

The patchset contains five parts:

 (1) Some helper patches, including provision of an ITER_XARRAY iov
     iterator and a function to do readahead expansion.

 (2) Patches to add the netfs helper library.

 (3) A patch to add the fscache/cachefiles kiocb API.

 (4) Patches to add support in AFS for this.

 (5) Patches from Jeff Layton to add support in Ceph for this.

Dave Wysochanski also has patches for NFS for this, though they're not
included on this branch as there's an issue with PNFS.

With this, AFS without a cache passes all expected xfstests; with a cache,
there's an extra failure, but that's also there before these patches.
Fixing that probably requires a greater overhaul.  Ceph and NFS also pass
the expected tests.

These patches can be found also on:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=fscache-netfs-lib

For diffing reference, the tag for the 9th Feb pull request is
fscache-ioapi-20210203 and can be found in the same repository.



Changes
=======

 (v3) Rolled in the bug fixes.

      Adjusted the functions that unlock and wait for PG_fscache according
      to Linus's suggestion.

      Hold a ref on a page when PG_fscache is set as per Linus's
      suggestion.

      Dropped NFS support and added Ceph support.

 (v2) Fixed some bugs and added NFS support.


References
==========

These patches have been published for review before, firstly as part of a
larger set:

Link: https://lore.kernel.org/linux-fsdevel/158861203563.340223.7585359869938129395.stgit@warthog.procyon.org.uk/

Link: https://lore.kernel.org/linux-fsdevel/159465766378.1376105.11619976251039287525.stgit@warthog.procyon.org.uk/
Link: https://lore.kernel.org/linux-fsdevel/159465784033.1376674.18106463693989811037.stgit@warthog.procyon.org.uk/
Link: https://lore.kernel.org/linux-fsdevel/159465821598.1377938.2046362270225008168.stgit@warthog.procyon.org.uk/

Link: https://lore.kernel.org/linux-fsdevel/160588455242.3465195.3214733858273019178.stgit@warthog.procyon.org.uk/

Then as a cut-down set:

Link: https://lore.kernel.org/linux-fsdevel/161118128472.1232039.11746799833066425131.stgit@warthog.procyon.org.uk/

Link: https://lore.kernel.org/linux-fsdevel/161161025063.2537118.2009249444682241405.stgit@warthog.procyon.org.uk/


Proposals/information about the design has been published here:

Link: https://lore.kernel.org/lkml/24942.1573667720@warthog.procyon.org.uk/
Link: https://lore.kernel.org/linux-fsdevel/2758811.1610621106@warthog.procyon.org.uk/
Link: https://lore.kernel.org/linux-fsdevel/1441311.1598547738@warthog.procyon.org.uk/
Link: https://lore.kernel.org/linux-fsdevel/160655.1611012999@warthog.procyon.org.uk/

And requests for information:

Link: https://lore.kernel.org/linux-fsdevel/3326.1579019665@warthog.procyon.org.uk/
Link: https://lore.kernel.org/linux-fsdevel/4467.1579020509@warthog.procyon.org.uk/
Link: https://lore.kernel.org/linux-fsdevel/3577430.1579705075@warthog.procyon.org.uk/

The NFS parts, though not included here, have been tested by someone who's
using fscache in production:

Link: https://listman.redhat.com/archives/linux-cachefs/2020-December/msg00000.html

I've posted partial patches to try and help 9p and cifs along:

Link: https://lore.kernel.org/linux-fsdevel/1514086.1605697347@warthog.procyon.org.uk/
Link: https://lore.kernel.org/linux-cifs/1794123.1605713481@warthog.procyon.org.uk/
Link: https://lore.kernel.org/linux-fsdevel/241017.1612263863@warthog.procyon.org.uk/
Link: https://lore.kernel.org/linux-cifs/270998.1612265397@warthog.procyon.org.uk/

David
---
David Howells (27):
      iov_iter: Add ITER_XARRAY
      mm: Add an unlock function for PG_private_2/PG_fscache
      mm: Implement readahead_control pageset expansion
      vfs: Export rw_verify_area() for use by cachefiles
      netfs: Make a netfs helper module
      netfs, mm: Move PG_fscache helper funcs to linux/netfs.h
      netfs, mm: Add unlock_page_fscache() and wait_on_page_fscache()
      netfs: Provide readahead and readpage netfs helpers
      netfs: Add tracepoints
      netfs: Gather stats
      netfs: Add write_begin helper
      netfs: Define an interface to talk to a cache
      netfs: Hold a ref on a page when PG_private_2 is set
      fscache, cachefiles: Add alternate API to use kiocb for read/write to cache
      afs: Disable use of the fscache I/O routines
      afs: Pass page into dirty region helpers to provide THP size
      afs: Print the operation debug_id when logging an unexpected data version
      afs: Move key to afs_read struct
      afs: Don't truncate iter during data fetch
      afs: Log remote unmarshalling errors
      afs: Set up the iov_iter before calling afs_extract_data()
      afs: Use ITER_XARRAY for writing
      afs: Wait on PG_fscache before modifying/releasing a page
      afs: Extract writeback extension into its own function
      afs: Prepare for use of THPs
      afs: Use the fs operation ops to handle FetchData completion
      afs: Use new fscache read helper API

Jeff Layton (6):
      ceph: disable old fscache readpage handling
      ceph: rework PageFsCache handling
      ceph: fix fscache invalidation
      ceph: convert readpage to fscache read helper
      ceph: plug write_begin into read helper
      ceph: convert ceph_readpages to ceph_readahead


 fs/Kconfig                    |    1 +
 fs/Makefile                   |    1 +
 fs/afs/Kconfig                |    1 +
 fs/afs/dir.c                  |  225 ++++---
 fs/afs/file.c                 |  470 ++++---------
 fs/afs/fs_operation.c         |    4 +-
 fs/afs/fsclient.c             |  108 +--
 fs/afs/inode.c                |    7 +-
 fs/afs/internal.h             |   58 +-
 fs/afs/rxrpc.c                |  150 ++---
 fs/afs/write.c                |  610 +++++++++--------
 fs/afs/yfsclient.c            |   82 +--
 fs/cachefiles/Makefile        |    1 +
 fs/cachefiles/interface.c     |    5 +-
 fs/cachefiles/internal.h      |    9 +
 fs/cachefiles/rdwr2.c         |  412 ++++++++++++
 fs/ceph/Kconfig               |    1 +
 fs/ceph/addr.c                |  535 ++++++---------
 fs/ceph/cache.c               |  125 ----
 fs/ceph/cache.h               |  101 +--
 fs/ceph/caps.c                |   10 +-
 fs/ceph/inode.c               |    1 +
 fs/ceph/super.h               |    1 +
 fs/fscache/Kconfig            |    1 +
 fs/fscache/Makefile           |    3 +-
 fs/fscache/internal.h         |    3 +
 fs/fscache/page.c             |    2 +-
 fs/fscache/page2.c            |  117 ++++
 fs/fscache/stats.c            |    1 +
 fs/internal.h                 |    5 -
 fs/netfs/Kconfig              |   23 +
 fs/netfs/Makefile             |    5 +
 fs/netfs/internal.h           |   97 +++
 fs/netfs/read_helper.c        | 1169 +++++++++++++++++++++++++++++++++
 fs/netfs/stats.c              |   59 ++
 fs/read_write.c               |    1 +
 include/linux/fs.h            |    1 +
 include/linux/fscache-cache.h |    4 +
 include/linux/fscache.h       |   40 +-
 include/linux/netfs.h         |  195 ++++++
 include/linux/pagemap.h       |    3 +
 include/net/af_rxrpc.h        |    2 +-
 include/trace/events/afs.h    |   74 +--
 include/trace/events/netfs.h  |  201 ++++++
 mm/filemap.c                  |   20 +
 mm/readahead.c                |   70 ++
 net/rxrpc/recvmsg.c           |    9 +-
 47 files changed, 3473 insertions(+), 1550 deletions(-)
 create mode 100644 fs/cachefiles/rdwr2.c
 create mode 100644 fs/fscache/page2.c
 create mode 100644 fs/netfs/Kconfig
 create mode 100644 fs/netfs/Makefile
 create mode 100644 fs/netfs/internal.h
 create mode 100644 fs/netfs/read_helper.c
 create mode 100644 fs/netfs/stats.c
 create mode 100644 include/linux/netfs.h
 create mode 100644 include/trace/events/netfs.h




^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 01/33] iov_iter: Add ITER_XARRAY
  2021-02-15 15:44 [PATCH 00/33] Network fs helper library & fscache kiocb API [ver #3] David Howells
@ 2021-02-15 15:44 ` David Howells
  2021-02-15 15:44 ` [PATCH 02/33] mm: Add an unlock function for PG_private_2/PG_fscache David Howells
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 47+ messages in thread
From: David Howells @ 2021-02-15 15:44 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Steve French, Dominique Martinet
  Cc: Alexander Viro, Matthew Wilcox (Oracle),
	Christoph Hellwig, linux-mm, linux-cachefs, linux-afs, linux-nfs,
	linux-cifs, ceph-devel, v9fs-developer, linux-fsdevel, dhowells,
	Jeff Layton, David Wysochanski, Matthew Wilcox (Oracle),
	Alexander Viro, linux-cachefs, linux-afs, linux-nfs, linux-cifs,
	ceph-devel, v9fs-developer, linux-fsdevel, linux-kernel

Add an iterator, ITER_XARRAY, that walks through a set of pages attached to
an xarray, starting at a given page and offset and walking for the
specified amount of bytes.  The iterator supports transparent huge pages.

The caller must guarantee that the pages are all present and they must be
locked using PG_locked, PG_writeback or PG_fscache to prevent them from
going away or being migrated whilst they're being accessed.

This is useful for copying data from socket buffers to inodes in network
filesystems and for transferring data between those inodes and the cache
using direct I/O.

Whilst it is true that ITER_BVEC could be used instead, that would require
a bio_vec array to be allocated to refer to all the pages - which should be
redundant if inode->i_pages also points to all these pages.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Alexander Viro <viro@zeniv.linux.org.uk>
cc: Matthew Wilcox (Oracle) <willy@infradead.org>
cc: Christoph Hellwig <hch@lst.de>
cc: linux-mm@kvack.org
cc: linux-cachefs@redhat.com
cc: linux-afs@lists.infradead.org
cc: linux-nfs@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: ceph-devel@vger.kernel.org
cc: v9fs-developer@lists.sourceforge.net
cc: linux-fsdevel@vger.kernel.org
---

 include/linux/uio.h |   11 ++
 lib/iov_iter.c      |  313 +++++++++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 301 insertions(+), 23 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 72d88566694e..08b186df54ac 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -10,6 +10,7 @@
 #include <uapi/linux/uio.h>
 
 struct page;
+struct address_space;
 struct pipe_inode_info;
 
 struct kvec {
@@ -24,6 +25,7 @@ enum iter_type {
 	ITER_BVEC = 16,
 	ITER_PIPE = 32,
 	ITER_DISCARD = 64,
+	ITER_XARRAY = 128,
 };
 
 struct iov_iter {
@@ -39,6 +41,7 @@ struct iov_iter {
 		const struct iovec *iov;
 		const struct kvec *kvec;
 		const struct bio_vec *bvec;
+		struct xarray *xarray;
 		struct pipe_inode_info *pipe;
 	};
 	union {
@@ -47,6 +50,7 @@ struct iov_iter {
 			unsigned int head;
 			unsigned int start_head;
 		};
+		loff_t xarray_start;
 	};
 };
 
@@ -80,6 +84,11 @@ static inline bool iov_iter_is_discard(const struct iov_iter *i)
 	return iov_iter_type(i) == ITER_DISCARD;
 }
 
+static inline bool iov_iter_is_xarray(const struct iov_iter *i)
+{
+	return iov_iter_type(i) == ITER_XARRAY;
+}
+
 static inline unsigned char iov_iter_rw(const struct iov_iter *i)
 {
 	return i->type & (READ | WRITE);
@@ -221,6 +230,8 @@ void iov_iter_bvec(struct iov_iter *i, unsigned int direction, const struct bio_
 void iov_iter_pipe(struct iov_iter *i, unsigned int direction, struct pipe_inode_info *pipe,
 			size_t count);
 void iov_iter_discard(struct iov_iter *i, unsigned int direction, size_t count);
+void iov_iter_xarray(struct iov_iter *i, unsigned int direction, struct xarray *xarray,
+		     loff_t start, size_t count);
 ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
 			size_t maxsize, unsigned maxpages, size_t *start);
 ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, struct page ***pages,
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index a21e6a5792c5..f53a57588489 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -78,7 +78,44 @@
 	}						\
 }
 
-#define iterate_all_kinds(i, n, v, I, B, K) {			\
+#define iterate_xarray(i, n, __v, skip, STEP) {		\
+	struct page *head = NULL;				\
+	size_t wanted = n, seg, offset;				\
+	loff_t start = i->xarray_start + skip;			\
+	pgoff_t index = start >> PAGE_SHIFT;			\
+	int j;							\
+								\
+	XA_STATE(xas, i->xarray, index);			\
+								\
+	rcu_read_lock();						\
+	xas_for_each(&xas, head, ULONG_MAX) {				\
+		if (xas_retry(&xas, head))				\
+			continue;					\
+		if (WARN_ON(xa_is_value(head)))				\
+			break;						\
+		if (WARN_ON(PageHuge(head)))				\
+			break;						\
+		for (j = (head->index < index) ? index - head->index : 0; \
+		     j < thp_nr_pages(head); j++) {			\
+			__v.bv_page = head + j;				\
+			offset = (i->xarray_start + skip) & ~PAGE_MASK;	\
+			seg = PAGE_SIZE - offset;			\
+			__v.bv_offset = offset;				\
+			__v.bv_len = min(n, seg);			\
+			(void)(STEP);					\
+			n -= __v.bv_len;				\
+			skip += __v.bv_len;				\
+			if (n == 0)					\
+				break;					\
+		}							\
+		if (n == 0)						\
+			break;						\
+	}							\
+	rcu_read_unlock();					\
+	n = wanted - n;						\
+}
+
+#define iterate_all_kinds(i, n, v, I, B, K, X) {		\
 	if (likely(n)) {					\
 		size_t skip = i->iov_offset;			\
 		if (unlikely(i->type & ITER_BVEC)) {		\
@@ -90,6 +127,9 @@
 			struct kvec v;				\
 			iterate_kvec(i, n, v, kvec, skip, (K))	\
 		} else if (unlikely(i->type & ITER_DISCARD)) {	\
+		} else if (unlikely(i->type & ITER_XARRAY)) {	\
+			struct bio_vec v;			\
+			iterate_xarray(i, n, v, skip, (X));	\
 		} else {					\
 			const struct iovec *iov;		\
 			struct iovec v;				\
@@ -98,7 +138,7 @@
 	}							\
 }
 
-#define iterate_and_advance(i, n, v, I, B, K) {			\
+#define iterate_and_advance(i, n, v, I, B, K, X) {		\
 	if (unlikely(i->count < n))				\
 		n = i->count;					\
 	if (i->count) {						\
@@ -123,6 +163,9 @@
 			i->kvec = kvec;				\
 		} else if (unlikely(i->type & ITER_DISCARD)) {	\
 			skip += n;				\
+		} else if (unlikely(i->type & ITER_XARRAY)) {	\
+			struct bio_vec v;			\
+			iterate_xarray(i, n, v, skip, (X))	\
 		} else {					\
 			const struct iovec *iov;		\
 			struct iovec v;				\
@@ -636,7 +679,9 @@ size_t _copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
 		copyout(v.iov_base, (from += v.iov_len) - v.iov_len, v.iov_len),
 		memcpy_to_page(v.bv_page, v.bv_offset,
 			       (from += v.bv_len) - v.bv_len, v.bv_len),
-		memcpy(v.iov_base, (from += v.iov_len) - v.iov_len, v.iov_len)
+		memcpy(v.iov_base, (from += v.iov_len) - v.iov_len, v.iov_len),
+		memcpy_to_page(v.bv_page, v.bv_offset,
+			       (from += v.bv_len) - v.bv_len, v.bv_len)
 	)
 
 	return bytes;
@@ -752,6 +797,16 @@ size_t _copy_mc_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
 			bytes = curr_addr - s_addr - rem;
 			return bytes;
 		}
+		}),
+		({
+		rem = copy_mc_to_page(v.bv_page, v.bv_offset,
+				      (from += v.bv_len) - v.bv_len, v.bv_len);
+		if (rem) {
+			curr_addr = (unsigned long) from;
+			bytes = curr_addr - s_addr - rem;
+			rcu_read_unlock();
+			return bytes;
+		}
 		})
 	)
 
@@ -773,7 +828,9 @@ size_t _copy_from_iter(void *addr, size_t bytes, struct iov_iter *i)
 		copyin((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len),
 		memcpy_from_page((to += v.bv_len) - v.bv_len, v.bv_page,
 				 v.bv_offset, v.bv_len),
-		memcpy((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len)
+		memcpy((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len),
+		memcpy_from_page((to += v.bv_len) - v.bv_len, v.bv_page,
+				 v.bv_offset, v.bv_len)
 	)
 
 	return bytes;
@@ -799,7 +856,9 @@ bool _copy_from_iter_full(void *addr, size_t bytes, struct iov_iter *i)
 		0;}),
 		memcpy_from_page((to += v.bv_len) - v.bv_len, v.bv_page,
 				 v.bv_offset, v.bv_len),
-		memcpy((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len)
+		memcpy((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len),
+		memcpy_from_page((to += v.bv_len) - v.bv_len, v.bv_page,
+				 v.bv_offset, v.bv_len)
 	)
 
 	iov_iter_advance(i, bytes);
@@ -819,7 +878,9 @@ size_t _copy_from_iter_nocache(void *addr, size_t bytes, struct iov_iter *i)
 					 v.iov_base, v.iov_len),
 		memcpy_from_page((to += v.bv_len) - v.bv_len, v.bv_page,
 				 v.bv_offset, v.bv_len),
-		memcpy((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len)
+		memcpy((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len),
+		memcpy_from_page((to += v.bv_len) - v.bv_len, v.bv_page,
+				 v.bv_offset, v.bv_len)
 	)
 
 	return bytes;
@@ -854,7 +915,9 @@ size_t _copy_from_iter_flushcache(void *addr, size_t bytes, struct iov_iter *i)
 		memcpy_page_flushcache((to += v.bv_len) - v.bv_len, v.bv_page,
 				 v.bv_offset, v.bv_len),
 		memcpy_flushcache((to += v.iov_len) - v.iov_len, v.iov_base,
-			v.iov_len)
+			v.iov_len),
+		memcpy_page_flushcache((to += v.bv_len) - v.bv_len, v.bv_page,
+				 v.bv_offset, v.bv_len)
 	)
 
 	return bytes;
@@ -878,7 +941,9 @@ bool _copy_from_iter_full_nocache(void *addr, size_t bytes, struct iov_iter *i)
 		0;}),
 		memcpy_from_page((to += v.bv_len) - v.bv_len, v.bv_page,
 				 v.bv_offset, v.bv_len),
-		memcpy((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len)
+		memcpy((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len),
+		memcpy_from_page((to += v.bv_len) - v.bv_len, v.bv_page,
+				 v.bv_offset, v.bv_len)
 	)
 
 	iov_iter_advance(i, bytes);
@@ -915,7 +980,7 @@ size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
 {
 	if (unlikely(!page_copy_sane(page, offset, bytes)))
 		return 0;
-	if (i->type & (ITER_BVEC|ITER_KVEC)) {
+	if (i->type & (ITER_BVEC | ITER_KVEC | ITER_XARRAY)) {
 		void *kaddr = kmap_atomic(page);
 		size_t wanted = copy_to_iter(kaddr + offset, bytes, i);
 		kunmap_atomic(kaddr);
@@ -938,7 +1003,7 @@ size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
 		WARN_ON(1);
 		return 0;
 	}
-	if (i->type & (ITER_BVEC|ITER_KVEC)) {
+	if (i->type & (ITER_BVEC | ITER_KVEC | ITER_XARRAY)) {
 		void *kaddr = kmap_atomic(page);
 		size_t wanted = _copy_from_iter(kaddr + offset, bytes, i);
 		kunmap_atomic(kaddr);
@@ -982,7 +1047,8 @@ size_t iov_iter_zero(size_t bytes, struct iov_iter *i)
 	iterate_and_advance(i, bytes, v,
 		clear_user(v.iov_base, v.iov_len),
 		memzero_page(v.bv_page, v.bv_offset, v.bv_len),
-		memset(v.iov_base, 0, v.iov_len)
+		memset(v.iov_base, 0, v.iov_len),
+		memzero_page(v.bv_page, v.bv_offset, v.bv_len)
 	)
 
 	return bytes;
@@ -1006,7 +1072,9 @@ size_t iov_iter_copy_from_user_atomic(struct page *page,
 		copyin((p += v.iov_len) - v.iov_len, v.iov_base, v.iov_len),
 		memcpy_from_page((p += v.bv_len) - v.bv_len, v.bv_page,
 				 v.bv_offset, v.bv_len),
-		memcpy((p += v.iov_len) - v.iov_len, v.iov_base, v.iov_len)
+		memcpy((p += v.iov_len) - v.iov_len, v.iov_base, v.iov_len),
+		memcpy_from_page((p += v.bv_len) - v.bv_len, v.bv_page,
+				 v.bv_offset, v.bv_len)
 	)
 	kunmap_atomic(kaddr);
 	return bytes;
@@ -1077,7 +1145,12 @@ void iov_iter_advance(struct iov_iter *i, size_t size)
 		i->count -= size;
 		return;
 	}
-	iterate_and_advance(i, size, v, 0, 0, 0)
+	if (unlikely(iov_iter_is_xarray(i))) {
+		i->iov_offset += size;
+		i->count -= size;
+		return;
+	}
+	iterate_and_advance(i, size, v, 0, 0, 0, 0)
 }
 EXPORT_SYMBOL(iov_iter_advance);
 
@@ -1121,7 +1194,12 @@ void iov_iter_revert(struct iov_iter *i, size_t unroll)
 		return;
 	}
 	unroll -= i->iov_offset;
-	if (iov_iter_is_bvec(i)) {
+	if (iov_iter_is_xarray(i)) {
+		BUG(); /* We should never go beyond the start of the specified
+			* range since we might then be straying into pages that
+			* aren't pinned.
+			*/
+	} else if (iov_iter_is_bvec(i)) {
 		const struct bio_vec *bvec = i->bvec;
 		while (1) {
 			size_t n = (--bvec)->bv_len;
@@ -1158,9 +1236,9 @@ size_t iov_iter_single_seg_count(const struct iov_iter *i)
 		return i->count;	// it is a silly place, anyway
 	if (i->nr_segs == 1)
 		return i->count;
-	if (unlikely(iov_iter_is_discard(i)))
+	if (unlikely(iov_iter_is_discard(i) || iov_iter_is_xarray(i)))
 		return i->count;
-	else if (iov_iter_is_bvec(i))
+	if (iov_iter_is_bvec(i))
 		return min(i->count, i->bvec->bv_len - i->iov_offset);
 	else
 		return min(i->count, i->iov->iov_len - i->iov_offset);
@@ -1208,6 +1286,31 @@ void iov_iter_pipe(struct iov_iter *i, unsigned int direction,
 }
 EXPORT_SYMBOL(iov_iter_pipe);
 
+/**
+ * iov_iter_xarray - Initialise an I/O iterator to use the pages in an xarray
+ * @i: The iterator to initialise.
+ * @direction: The direction of the transfer.
+ * @xarray: The xarray to access.
+ * @start: The start file position.
+ * @count: The size of the I/O buffer in bytes.
+ *
+ * Set up an I/O iterator to either draw data out of the pages attached to an
+ * inode or to inject data into those pages.  The pages *must* be prevented
+ * from evaporation, either by taking a ref on them or locking them by the
+ * caller.
+ */
+void iov_iter_xarray(struct iov_iter *i, unsigned int direction,
+		     struct xarray *xarray, loff_t start, size_t count)
+{
+	BUG_ON(direction & ~1);
+	i->type = ITER_XARRAY | (direction & (READ | WRITE));
+	i->xarray = xarray;
+	i->xarray_start = start;
+	i->count = count;
+	i->iov_offset = 0;
+}
+EXPORT_SYMBOL(iov_iter_xarray);
+
 /**
  * iov_iter_discard - Initialise an I/O iterator that discards data
  * @i: The iterator to initialise.
@@ -1241,7 +1344,8 @@ unsigned long iov_iter_alignment(const struct iov_iter *i)
 	iterate_all_kinds(i, size, v,
 		(res |= (unsigned long)v.iov_base | v.iov_len, 0),
 		res |= v.bv_offset | v.bv_len,
-		res |= (unsigned long)v.iov_base | v.iov_len
+		res |= (unsigned long)v.iov_base | v.iov_len,
+		res |= v.bv_offset | v.bv_len
 	)
 	return res;
 }
@@ -1263,7 +1367,9 @@ unsigned long iov_iter_gap_alignment(const struct iov_iter *i)
 		(res |= (!res ? 0 : (unsigned long)v.bv_offset) |
 			(size != v.bv_len ? size : 0)),
 		(res |= (!res ? 0 : (unsigned long)v.iov_base) |
-			(size != v.iov_len ? size : 0))
+			(size != v.iov_len ? size : 0)),
+		(res |= (!res ? 0 : (unsigned long)v.bv_offset) |
+			(size != v.bv_len ? size : 0))
 		);
 	return res;
 }
@@ -1313,6 +1419,75 @@ static ssize_t pipe_get_pages(struct iov_iter *i,
 	return __pipe_get_pages(i, min(maxsize, capacity), pages, iter_head, start);
 }
 
+static ssize_t iter_xarray_copy_pages(struct page **pages, struct xarray *xa,
+				       pgoff_t index, unsigned int nr_pages)
+{
+	XA_STATE(xas, xa, index);
+	struct page *page;
+	unsigned int ret = 0;
+
+	rcu_read_lock();
+	for (page = xas_load(&xas); page; page = xas_next(&xas)) {
+		if (xas_retry(&xas, page))
+			continue;
+
+		/* Has the page moved or been split? */
+		if (unlikely(page != xas_reload(&xas))) {
+			xas_reset(&xas);
+			continue;
+		}
+
+		pages[ret] = find_subpage(page, xas.xa_index);
+		get_page(pages[ret]);
+		if (++ret == nr_pages)
+			break;
+	}
+	rcu_read_unlock();
+	return ret;
+}
+
+static ssize_t iter_xarray_get_pages(struct iov_iter *i,
+				     struct page **pages, size_t maxsize,
+				     unsigned maxpages, size_t *_start_offset)
+{
+	unsigned nr, offset;
+	pgoff_t index, count;
+	size_t size = maxsize, actual;
+	loff_t pos;
+
+	if (!size || !maxpages)
+		return 0;
+
+	pos = i->xarray_start + i->iov_offset;
+	index = pos >> PAGE_SHIFT;
+	offset = pos & ~PAGE_MASK;
+	*_start_offset = offset;
+
+	count = 1;
+	if (size > PAGE_SIZE - offset) {
+		size -= PAGE_SIZE - offset;
+		count += size >> PAGE_SHIFT;
+		size &= ~PAGE_MASK;
+		if (size)
+			count++;
+	}
+
+	if (count > maxpages)
+		count = maxpages;
+
+	nr = iter_xarray_copy_pages(pages, i->xarray, index, count);
+	if (nr == 0)
+		return 0;
+
+	actual = PAGE_SIZE * nr;
+	actual -= offset;
+	if (nr == count && size > 0) {
+		unsigned last_offset = (nr > 1) ? 0 : offset;
+		actual -= PAGE_SIZE - (last_offset + size);
+	}
+	return actual;
+}
+
 ssize_t iov_iter_get_pages(struct iov_iter *i,
 		   struct page **pages, size_t maxsize, unsigned maxpages,
 		   size_t *start)
@@ -1322,6 +1497,8 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
 
 	if (unlikely(iov_iter_is_pipe(i)))
 		return pipe_get_pages(i, pages, maxsize, maxpages, start);
+	if (unlikely(iov_iter_is_xarray(i)))
+		return iter_xarray_get_pages(i, pages, maxsize, maxpages, start);
 	if (unlikely(iov_iter_is_discard(i)))
 		return -EFAULT;
 
@@ -1348,7 +1525,8 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
 		return v.bv_len;
 	}),({
 		return -EFAULT;
-	})
+	}),
+	0
 	)
 	return 0;
 }
@@ -1392,6 +1570,51 @@ static ssize_t pipe_get_pages_alloc(struct iov_iter *i,
 	return n;
 }
 
+static ssize_t iter_xarray_get_pages_alloc(struct iov_iter *i,
+					   struct page ***pages, size_t maxsize,
+					   size_t *_start_offset)
+{
+	struct page **p;
+	unsigned nr, offset;
+	pgoff_t index, count;
+	size_t size = maxsize, actual;
+	loff_t pos;
+
+	if (!size)
+		return 0;
+
+	pos = i->xarray_start + i->iov_offset;
+	index = pos >> PAGE_SHIFT;
+	offset = pos & ~PAGE_MASK;
+	*_start_offset = offset;
+
+	count = 1;
+	if (size > PAGE_SIZE - offset) {
+		size -= PAGE_SIZE - offset;
+		count += size >> PAGE_SHIFT;
+		size &= ~PAGE_MASK;
+		if (size)
+			count++;
+	}
+
+	p = get_pages_array(count);
+	if (!p)
+		return -ENOMEM;
+	*pages = p;
+
+	nr = iter_xarray_copy_pages(p, i->xarray, index, count);
+	if (nr == 0)
+		return 0;
+
+	actual = PAGE_SIZE * nr;
+	actual -= offset;
+	if (nr == count && size > 0) {
+		unsigned last_offset = (nr > 1) ? 0 : offset;
+		actual -= PAGE_SIZE - (last_offset + size);
+	}
+	return actual;
+}
+
 ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 		   struct page ***pages, size_t maxsize,
 		   size_t *start)
@@ -1403,6 +1626,8 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 
 	if (unlikely(iov_iter_is_pipe(i)))
 		return pipe_get_pages_alloc(i, pages, maxsize, start);
+	if (unlikely(iov_iter_is_xarray(i)))
+		return iter_xarray_get_pages_alloc(i, pages, maxsize, start);
 	if (unlikely(iov_iter_is_discard(i)))
 		return -EFAULT;
 
@@ -1435,7 +1660,7 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 		return v.bv_len;
 	}),({
 		return -EFAULT;
-	})
+	}), 0
 	)
 	return 0;
 }
@@ -1473,6 +1698,13 @@ size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum,
 				      v.iov_base, v.iov_len,
 				      sum, off);
 		off += v.iov_len;
+	}), ({
+		char *p = kmap_atomic(v.bv_page);
+		sum = csum_and_memcpy((to += v.bv_len) - v.bv_len,
+				      p + v.bv_offset, v.bv_len,
+				      sum, off);
+		kunmap_atomic(p);
+		off += v.bv_len;
 	})
 	)
 	*csum = sum;
@@ -1514,6 +1746,13 @@ bool csum_and_copy_from_iter_full(void *addr, size_t bytes, __wsum *csum,
 				      v.iov_base, v.iov_len,
 				      sum, off);
 		off += v.iov_len;
+	}), ({
+		char *p = kmap_atomic(v.bv_page);
+		sum = csum_and_memcpy((to += v.bv_len) - v.bv_len,
+				      p + v.bv_offset, v.bv_len,
+				      sum, off);
+		kunmap_atomic(p);
+		off += v.bv_len;
 	})
 	)
 	*csum = sum;
@@ -1559,6 +1798,13 @@ size_t csum_and_copy_to_iter(const void *addr, size_t bytes, void *csump,
 				     (from += v.iov_len) - v.iov_len,
 				     v.iov_len, sum, off);
 		off += v.iov_len;
+	}), ({
+		char *p = kmap_atomic(v.bv_page);
+		sum = csum_and_memcpy(p + v.bv_offset,
+				      (from += v.bv_len) - v.bv_len,
+				      v.bv_len, sum, off);
+		kunmap_atomic(p);
+		off += v.bv_len;
 	})
 	)
 	*csum = sum;
@@ -1608,6 +1854,21 @@ int iov_iter_npages(const struct iov_iter *i, int maxpages)
 		npages = pipe_space_for_user(iter_head, pipe->tail, pipe);
 		if (npages >= maxpages)
 			return maxpages;
+	} else if (unlikely(iov_iter_is_xarray(i))) {
+		unsigned offset;
+
+		offset = (i->xarray_start + i->iov_offset) & ~PAGE_MASK;
+
+		npages = 1;
+		if (size > PAGE_SIZE - offset) {
+			size -= PAGE_SIZE - offset;
+			npages += size >> PAGE_SHIFT;
+			size &= ~PAGE_MASK;
+			if (size)
+				npages++;
+		}
+		if (npages >= maxpages)
+			return maxpages;
 	} else iterate_all_kinds(i, size, v, ({
 		unsigned long p = (unsigned long)v.iov_base;
 		npages += DIV_ROUND_UP(p + v.iov_len, PAGE_SIZE)
@@ -1624,7 +1885,8 @@ int iov_iter_npages(const struct iov_iter *i, int maxpages)
 			- p / PAGE_SIZE;
 		if (npages >= maxpages)
 			return maxpages;
-	})
+	}),
+	0
 	)
 	return npages;
 }
@@ -1637,7 +1899,7 @@ const void *dup_iter(struct iov_iter *new, struct iov_iter *old, gfp_t flags)
 		WARN_ON(1);
 		return NULL;
 	}
-	if (unlikely(iov_iter_is_discard(new)))
+	if (unlikely(iov_iter_is_discard(new) || iov_iter_is_xarray(new)))
 		return NULL;
 	if (iov_iter_is_bvec(new))
 		return new->bvec = kmemdup(new->bvec,
@@ -1842,7 +2104,12 @@ int iov_iter_for_each_range(struct iov_iter *i, size_t bytes,
 		kunmap(v.bv_page);
 		err;}), ({
 		w = v;
-		err = f(&w, context);})
+		err = f(&w, context);}), ({
+		w.iov_base = kmap(v.bv_page) + v.bv_offset;
+		w.iov_len = v.bv_len;
+		err = f(&w, context);
+		kunmap(v.bv_page);
+		err;})
 	)
 	return err;
 }




^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 02/33] mm: Add an unlock function for PG_private_2/PG_fscache
  2021-02-15 15:44 [PATCH 00/33] Network fs helper library & fscache kiocb API [ver #3] David Howells
  2021-02-15 15:44 ` [PATCH 01/33] iov_iter: Add ITER_XARRAY David Howells
@ 2021-02-15 15:44 ` David Howells
  2021-02-16 10:26   ` Christoph Hellwig
  2021-02-15 15:44 ` [PATCH 03/33] mm: Implement readahead_control pageset expansion David Howells
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 47+ messages in thread
From: David Howells @ 2021-02-15 15:44 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Steve French, Dominique Martinet
  Cc: Linus Torvalds, Linus Torvalds, Matthew Wilcox (Oracle),
	Alexander Viro, Christoph Hellwig, linux-mm, linux-cachefs,
	linux-afs, linux-nfs, linux-cifs, ceph-devel, v9fs-developer,
	linux-fsdevel, dhowells, Jeff Layton, David Wysochanski,
	Matthew Wilcox (Oracle),
	Alexander Viro, linux-cachefs, linux-afs, linux-nfs, linux-cifs,
	ceph-devel, v9fs-developer, linux-fsdevel, linux-kernel

Add a function, unlock_page_private_2(), to unlock PG_private_2 analogous
to that of PG_lock.  Add a kerneldoc banner to that indicating the example
usage case.

A wrapper will need to be placed in the netfs header in the patch that adds
that.

[This implements a suggestion by Linus to not mix the terminology of
 PG_private_2 and PG_fscache in the mm core function]

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Linus Torvalds <torvalds@linux-foundation.org>
cc: Matthew Wilcox (Oracle) <willy@infradead.org>
cc: Alexander Viro <viro@zeniv.linux.org.uk>
cc: Christoph Hellwig <hch@lst.de>
cc: linux-mm@kvack.org
cc: linux-cachefs@redhat.com
cc: linux-afs@lists.infradead.org
cc: linux-nfs@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: ceph-devel@vger.kernel.org
cc: v9fs-developer@lists.sourceforge.net
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/linux-fsdevel/1330473.1612974547@warthog.procyon.org.uk/
Link: https://lore.kernel.org/linux-fsdevel/CAHk-=wjgA-74ddehziVk=XAEMTKswPu1Yw4uaro1R3ibs27ztw@mail.gmail.com/
---

 include/linux/pagemap.h |    1 +
 mm/filemap.c            |   20 ++++++++++++++++++++
 2 files changed, 21 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index d5570deff400..365a28ece763 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -591,6 +591,7 @@ extern int __lock_page_async(struct page *page, struct wait_page_queue *wait);
 extern int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
 				unsigned int flags);
 extern void unlock_page(struct page *page);
+extern void unlock_page_private_2(struct page *page);
 
 /*
  * Return true if the page was successfully locked
diff --git a/mm/filemap.c b/mm/filemap.c
index 5c9d564317a5..7d321152d579 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1466,6 +1466,26 @@ void unlock_page(struct page *page)
 }
 EXPORT_SYMBOL(unlock_page);
 
+/**
+ * unlock_page_private_2 - Unlock a page that's locked with PG_private_2
+ * @page: The page
+ *
+ * Unlocks a page that's locked with PG_private_2 and wakes up sleepers in
+ * wait_on_page_private_2().
+ *
+ * This is, for example, used when a netfs page is being written to a local
+ * disk cache, thereby allowing writes to the cache for the same page to be
+ * serialised.
+ */
+void unlock_page_private_2(struct page *page)
+{
+	page = compound_head(page);
+	VM_BUG_ON_PAGE(!PagePrivate2(page), page);
+	clear_bit_unlock(PG_private_2, &page->flags);
+	wake_up_page_bit(page, PG_private_2);
+}
+EXPORT_SYMBOL(unlock_page_private_2);
+
 /**
  * end_page_writeback - end writeback against a page
  * @page: the page




^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 03/33] mm: Implement readahead_control pageset expansion
  2021-02-15 15:44 [PATCH 00/33] Network fs helper library & fscache kiocb API [ver #3] David Howells
  2021-02-15 15:44 ` [PATCH 01/33] iov_iter: Add ITER_XARRAY David Howells
  2021-02-15 15:44 ` [PATCH 02/33] mm: Add an unlock function for PG_private_2/PG_fscache David Howells
@ 2021-02-15 15:44 ` David Howells
  2021-02-16 10:32   ` Christoph Hellwig
                     ` (4 more replies)
  2021-02-15 15:45 ` [PATCH 04/33] vfs: Export rw_verify_area() for use by cachefiles David Howells
                   ` (11 subsequent siblings)
  14 siblings, 5 replies; 47+ messages in thread
From: David Howells @ 2021-02-15 15:44 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Steve French, Dominique Martinet
  Cc: Matthew Wilcox (Oracle), Matthew Wilcox (Oracle),
	Alexander Viro, Christoph Hellwig, linux-mm, linux-cachefs,
	linux-afs, linux-nfs, linux-cifs, ceph-devel, v9fs-developer,
	linux-fsdevel, dhowells, Jeff Layton, David Wysochanski,
	Matthew Wilcox (Oracle),
	Alexander Viro, linux-cachefs, linux-afs, linux-nfs, linux-cifs,
	ceph-devel, v9fs-developer, linux-fsdevel, linux-kernel

Provide a function, readahead_expand(), that expands the set of pages
specified by a readahead_control object to encompass a revised area with a
proposed size and length.

The proposed area must include all of the old area and may be expanded yet
more by this function so that the edges align on (transparent huge) page
boundaries as allocated.

The expansion will be cut short if a page already exists in either of the
areas being expanded into.  Note that any expansion made in such a case is
not rolled back.

This will be used by fscache so that reads can be expanded to cache granule
boundaries, thereby allowing whole granules to be stored in the cache, but
there are other potential users also.

Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Matthew Wilcox (Oracle) <willy@infradead.org>
cc: Alexander Viro <viro@zeniv.linux.org.uk>
cc: Christoph Hellwig <hch@lst.de>
cc: linux-mm@kvack.org
cc: linux-cachefs@redhat.com
cc: linux-afs@lists.infradead.org
cc: linux-nfs@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: ceph-devel@vger.kernel.org
cc: v9fs-developer@lists.sourceforge.net
cc: linux-fsdevel@vger.kernel.org
---

 include/linux/pagemap.h |    2 +
 mm/readahead.c          |   70 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 72 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 365a28ece763..d2786607d297 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -761,6 +761,8 @@ extern void __delete_from_page_cache(struct page *page, void *shadow);
 int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask);
 void delete_from_page_cache_batch(struct address_space *mapping,
 				  struct pagevec *pvec);
+void readahead_expand(struct readahead_control *ractl,
+		      loff_t new_start, size_t new_len);
 
 /*
  * Like add_to_page_cache_locked, but used to add newly allocated pages:
diff --git a/mm/readahead.c b/mm/readahead.c
index c5b0457415be..4446dada0bc2 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -638,3 +638,73 @@ SYSCALL_DEFINE3(readahead, int, fd, loff_t, offset, size_t, count)
 {
 	return ksys_readahead(fd, offset, count);
 }
+
+/**
+ * readahead_expand - Expand a readahead request
+ * @ractl: The request to be expanded
+ * @new_start: The revised start
+ * @new_len: The revised size of the request
+ *
+ * Attempt to expand a readahead request outwards from the current size to the
+ * specified size by inserting locked pages before and after the current window
+ * to increase the size to the new window.  This may involve the insertion of
+ * THPs, in which case the window may get expanded even beyond what was
+ * requested.
+ *
+ * The algorithm will stop if it encounters a conflicting page already in the
+ * pagecache and leave a smaller expansion than requested.
+ *
+ * The caller must check for this by examining the revised @ractl object for a
+ * different expansion than was requested.
+ */
+void readahead_expand(struct readahead_control *ractl,
+		      loff_t new_start, size_t new_len)
+{
+	struct address_space *mapping = ractl->mapping;
+	pgoff_t new_index, new_nr_pages;
+	gfp_t gfp_mask = readahead_gfp_mask(mapping);
+
+	new_index = new_start / PAGE_SIZE;
+
+	/* Expand the leading edge downwards */
+	while (ractl->_index > new_index) {
+		unsigned long index = ractl->_index - 1;
+		struct page *page = xa_load(&mapping->i_pages, index);
+
+		if (page && !xa_is_value(page))
+			return; /* Page apparently present */
+
+		page = __page_cache_alloc(gfp_mask);
+		if (!page)
+			return;
+		if (add_to_page_cache_lru(page, mapping, index, gfp_mask) < 0) {
+			put_page(page);
+			return;
+		}
+
+		ractl->_nr_pages++;
+		ractl->_index = page->index;
+	}
+
+	new_len += new_start - readahead_pos(ractl);
+	new_nr_pages = DIV_ROUND_UP(new_len, PAGE_SIZE);
+
+	/* Expand the trailing edge upwards */
+	while (ractl->_nr_pages < new_nr_pages) {
+		unsigned long index = ractl->_index + ractl->_nr_pages;
+		struct page *page = xa_load(&mapping->i_pages, index);
+
+		if (page && !xa_is_value(page))
+			return; /* Page apparently present */
+
+		page = __page_cache_alloc(gfp_mask);
+		if (!page)
+			return;
+		if (add_to_page_cache_lru(page, mapping, index, gfp_mask) < 0) {
+			put_page(page);
+			return;
+		}
+		ractl->_nr_pages++;
+	}
+}
+EXPORT_SYMBOL(readahead_expand);




^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 04/33] vfs: Export rw_verify_area() for use by cachefiles
  2021-02-15 15:44 [PATCH 00/33] Network fs helper library & fscache kiocb API [ver #3] David Howells
                   ` (2 preceding siblings ...)
  2021-02-15 15:44 ` [PATCH 03/33] mm: Implement readahead_control pageset expansion David Howells
@ 2021-02-15 15:45 ` David Howells
  2021-02-16 10:26   ` Christoph Hellwig
  2021-02-16 11:55   ` David Howells
  2021-02-15 15:45 ` [PATCH 05/33] netfs: Make a netfs helper module David Howells
                   ` (10 subsequent siblings)
  14 siblings, 2 replies; 47+ messages in thread
From: David Howells @ 2021-02-15 15:45 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Steve French, Dominique Martinet
  Cc: Alexander Viro, Christoph Hellwig, Matthew Wilcox, linux-mm,
	linux-cachefs, linux-afs, linux-nfs, linux-cifs, ceph-devel,
	v9fs-developer, linux-fsdevel, dhowells, Jeff Layton,
	David Wysochanski, Matthew Wilcox (Oracle),
	Alexander Viro, linux-cachefs, linux-afs, linux-nfs, linux-cifs,
	ceph-devel, v9fs-developer, linux-fsdevel, linux-kernel

Export rw_verify_area() for so that cachefiles can use it before issuing
call_read_iter() and call_write_iter() to effect async DIO operations
against the cache.  This is analogous to aio_read() and aio_write().

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Alexander Viro <viro@zeniv.linux.org.uk>
cc: Christoph Hellwig <hch@lst.de>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-mm@kvack.org
cc: linux-cachefs@redhat.com
cc: linux-afs@lists.infradead.org
cc: linux-nfs@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: ceph-devel@vger.kernel.org
cc: v9fs-developer@lists.sourceforge.net
cc: linux-fsdevel@vger.kernel.org
---

 fs/internal.h      |    5 -----
 fs/read_write.c    |    1 +
 include/linux/fs.h |    1 +
 3 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/fs/internal.h b/fs/internal.h
index 77c50befbfbe..92e686249c40 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -164,11 +164,6 @@ extern char *simple_dname(struct dentry *, char *, int);
 extern void dput_to_list(struct dentry *, struct list_head *);
 extern void shrink_dentry_list(struct list_head *);
 
-/*
- * read_write.c
- */
-extern int rw_verify_area(int, struct file *, const loff_t *, size_t);
-
 /*
  * pipe.c
  */
diff --git a/fs/read_write.c b/fs/read_write.c
index 75f764b43418..fe84e11245bd 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -400,6 +400,7 @@ int rw_verify_area(int read_write, struct file *file, const loff_t *ppos, size_t
 	return security_file_permission(file,
 				read_write == READ ? MAY_READ : MAY_WRITE);
 }
+EXPORT_SYMBOL(rw_verify_area);
 
 static ssize_t new_sync_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos)
 {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index fd47deea7c17..493804856ab3 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2760,6 +2760,7 @@ extern int notify_change(struct dentry *, struct iattr *, struct inode **);
 extern int inode_permission(struct inode *, int);
 extern int generic_permission(struct inode *, int);
 extern int __check_sticky(struct inode *dir, struct inode *inode);
+extern int rw_verify_area(int, struct file *, const loff_t *, size_t);
 
 static inline bool execute_ok(struct inode *inode)
 {




^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 05/33] netfs: Make a netfs helper module
  2021-02-15 15:44 [PATCH 00/33] Network fs helper library & fscache kiocb API [ver #3] David Howells
                   ` (3 preceding siblings ...)
  2021-02-15 15:45 ` [PATCH 04/33] vfs: Export rw_verify_area() for use by cachefiles David Howells
@ 2021-02-15 15:45 ` David Howells
  2021-02-15 15:45 ` [PATCH 06/33] netfs, mm: Move PG_fscache helper funcs to linux/netfs.h David Howells
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 47+ messages in thread
From: David Howells @ 2021-02-15 15:45 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Steve French, Dominique Martinet
  Cc: Jeff Layton, Matthew Wilcox, linux-mm, linux-cachefs, linux-afs,
	linux-nfs, linux-cifs, ceph-devel, v9fs-developer, linux-fsdevel,
	dhowells, Jeff Layton, David Wysochanski, Matthew Wilcox (Oracle),
	Alexander Viro, linux-cachefs, linux-afs, linux-nfs, linux-cifs,
	ceph-devel, v9fs-developer, linux-fsdevel, linux-kernel

Make a netfs helper module to manage read request segmentation, caching
support and transparent huge page support on behalf of a network
filesystem.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-mm@kvack.org
cc: linux-cachefs@redhat.com
cc: linux-afs@lists.infradead.org
cc: linux-nfs@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: ceph-devel@vger.kernel.org
cc: v9fs-developer@lists.sourceforge.net
cc: linux-fsdevel@vger.kernel.org
---

 fs/netfs/Kconfig |    8 ++++++++
 1 file changed, 8 insertions(+)
 create mode 100644 fs/netfs/Kconfig

diff --git a/fs/netfs/Kconfig b/fs/netfs/Kconfig
new file mode 100644
index 000000000000..2ebf90e6ca95
--- /dev/null
+++ b/fs/netfs/Kconfig
@@ -0,0 +1,8 @@
+# SPDX-License-Identifier: GPL-2.0-only
+
+config NETFS_SUPPORT
+	tristate "Support for network filesystem high-level I/O"
+	help
+	  This option enables support for network filesystems, including
+	  helpers for high-level buffered I/O, abstracting out read
+	  segmentation, local caching and transparent huge page support.




^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 06/33] netfs, mm: Move PG_fscache helper funcs to linux/netfs.h
  2021-02-15 15:44 [PATCH 00/33] Network fs helper library & fscache kiocb API [ver #3] David Howells
                   ` (4 preceding siblings ...)
  2021-02-15 15:45 ` [PATCH 05/33] netfs: Make a netfs helper module David Howells
@ 2021-02-15 15:45 ` David Howells
  2021-02-15 15:45 ` [PATCH 07/33] netfs, mm: Add unlock_page_fscache() and wait_on_page_fscache() David Howells
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 47+ messages in thread
From: David Howells @ 2021-02-15 15:45 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Steve French, Dominique Martinet
  Cc: Matthew Wilcox, linux-mm, linux-cachefs, linux-afs, linux-nfs,
	linux-cifs, ceph-devel, v9fs-developer, linux-fsdevel, dhowells,
	Jeff Layton, David Wysochanski, Matthew Wilcox (Oracle),
	Alexander Viro, linux-cachefs, linux-afs, linux-nfs, linux-cifs,
	ceph-devel, v9fs-developer, linux-fsdevel, linux-kernel

Move the PG_fscache related helper funcs (such as SetPageFsCache()) to
linux/netfs.h rather than linux/fscache.h as the intention is to move to a
model where they're used by the network filesystem and the helper library,
but not by fscache/cachefiles itself.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-mm@kvack.org
cc: linux-cachefs@redhat.com
cc: linux-afs@lists.infradead.org
cc: linux-nfs@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: ceph-devel@vger.kernel.org
cc: v9fs-developer@lists.sourceforge.net
cc: linux-fsdevel@vger.kernel.org
---

 include/linux/fscache.h |   11 +----------
 include/linux/netfs.h   |   25 +++++++++++++++++++++++++
 2 files changed, 26 insertions(+), 10 deletions(-)
 create mode 100644 include/linux/netfs.h

diff --git a/include/linux/fscache.h b/include/linux/fscache.h
index a1c928fe98e7..1f8dc72369ee 100644
--- a/include/linux/fscache.h
+++ b/include/linux/fscache.h
@@ -19,6 +19,7 @@
 #include <linux/pagemap.h>
 #include <linux/pagevec.h>
 #include <linux/list_bl.h>
+#include <linux/netfs.h>
 
 #if defined(CONFIG_FSCACHE) || defined(CONFIG_FSCACHE_MODULE)
 #define fscache_available() (1)
@@ -29,16 +30,6 @@
 #endif
 
 
-/*
- * overload PG_private_2 to give us PG_fscache - this is used to indicate that
- * a page is currently backed by a local disk cache
- */
-#define PageFsCache(page)		PagePrivate2((page))
-#define SetPageFsCache(page)		SetPagePrivate2((page))
-#define ClearPageFsCache(page)		ClearPagePrivate2((page))
-#define TestSetPageFsCache(page)	TestSetPagePrivate2((page))
-#define TestClearPageFsCache(page)	TestClearPagePrivate2((page))
-
 /* pattern used to fill dead space in an index entry */
 #define FSCACHE_INDEX_DEADFILL_PATTERN 0x79
 
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
new file mode 100644
index 000000000000..b3d869ec7d2a
--- /dev/null
+++ b/include/linux/netfs.h
@@ -0,0 +1,25 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/* Network filesystem support services.
+ *
+ * Copyright (C) 2021 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * for a description of the network filesystem interface declared here.
+ */
+
+#ifndef _LINUX_NETFS_H
+#define _LINUX_NETFS_H
+
+#include <linux/pagemap.h>
+
+/*
+ * Overload PG_private_2 to give us PG_fscache - this is used to indicate that
+ * a page is currently backed by a local disk cache
+ */
+#define PageFsCache(page)		PagePrivate2((page))
+#define SetPageFsCache(page)		SetPagePrivate2((page))
+#define ClearPageFsCache(page)		ClearPagePrivate2((page))
+#define TestSetPageFsCache(page)	TestSetPagePrivate2((page))
+#define TestClearPageFsCache(page)	TestClearPagePrivate2((page))
+
+#endif /* _LINUX_NETFS_H */




^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 07/33] netfs, mm: Add unlock_page_fscache() and wait_on_page_fscache()
  2021-02-15 15:44 [PATCH 00/33] Network fs helper library & fscache kiocb API [ver #3] David Howells
                   ` (5 preceding siblings ...)
  2021-02-15 15:45 ` [PATCH 06/33] netfs, mm: Move PG_fscache helper funcs to linux/netfs.h David Howells
@ 2021-02-15 15:45 ` David Howells
  2021-02-15 15:45 ` [PATCH 08/33] netfs: Provide readahead and readpage netfs helpers David Howells
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 47+ messages in thread
From: David Howells @ 2021-02-15 15:45 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Steve French, Dominique Martinet
  Cc: Linus Torvalds, Matthew Wilcox, linux-mm, linux-cachefs,
	linux-afs, linux-nfs, linux-cifs, ceph-devel, v9fs-developer,
	linux-fsdevel, dhowells, Jeff Layton, David Wysochanski,
	Matthew Wilcox (Oracle),
	Alexander Viro, linux-cachefs, linux-afs, linux-nfs, linux-cifs,
	ceph-devel, v9fs-developer, linux-fsdevel, linux-kernel

Add unlock_page_fscache() as an alias of unlock_page_private_2().  This
allows a page 'locked' with PG_fscache to be unlocked.

Add wait_on_page_fscache() to wait for PG_fscache to be unlocked.

[Linus suggested putting the fscache-themed functions into the
 caching-specific headers rather than pagemap.h]

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Linus Torvalds <torvalds@linux-foundation.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-mm@kvack.org
cc: linux-cachefs@redhat.com
cc: linux-afs@lists.infradead.org
cc: linux-nfs@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: ceph-devel@vger.kernel.org
cc: v9fs-developer@lists.sourceforge.net
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/linux-fsdevel/1330473.1612974547@warthog.procyon.org.uk/
Link: https://lore.kernel.org/linux-fsdevel/CAHk-=wjgA-74ddehziVk=XAEMTKswPu1Yw4uaro1R3ibs27ztw@mail.gmail.com/
---

 include/linux/netfs.h |   29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index b3d869ec7d2a..f69703543788 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -22,4 +22,33 @@
 #define TestSetPageFsCache(page)	TestSetPagePrivate2((page))
 #define TestClearPageFsCache(page)	TestClearPagePrivate2((page))
 
+/**
+ * unlock_page_fscache - Unlock a page that's locked with PG_fscache
+ * @page: The page
+ *
+ * Unlocks a page that's locked with PG_fscache and wakes up sleepers in
+ * wait_on_page_fscache().  This page bit is used by the netfs helpers when a
+ * netfs page is being written to a local disk cache, thereby allowing writes
+ * to the cache for the same page to be serialised.
+ */
+static inline void unlock_page_fscache(struct page *page)
+{
+	unlock_page_private_2(page);
+}
+
+/**
+ * wait_on_page_fscache - Wait for PG_fscache to be cleared on a page
+ * @page: The page
+ *
+ * Wait for the PG_fscache (PG_private_2) page bit to be removed from a page.
+ * This is, for example, used to handle a netfs page being written to a local
+ * disk cache, thereby allowing writes to the cache for the same page to be
+ * serialised.
+ */
+static inline void wait_on_page_fscache(struct page *page)
+{
+	if (PageFsCache(page))
+		wait_on_page_bit(compound_head(page), PG_fscache);
+}
+
 #endif /* _LINUX_NETFS_H */




^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 08/33] netfs: Provide readahead and readpage netfs helpers
  2021-02-15 15:44 [PATCH 00/33] Network fs helper library & fscache kiocb API [ver #3] David Howells
                   ` (6 preceding siblings ...)
  2021-02-15 15:45 ` [PATCH 07/33] netfs, mm: Add unlock_page_fscache() and wait_on_page_fscache() David Howells
@ 2021-02-15 15:45 ` David Howells
  2021-02-15 15:45 ` [PATCH 09/33] netfs: Add tracepoints David Howells
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 47+ messages in thread
From: David Howells @ 2021-02-15 15:45 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Steve French, Dominique Martinet
  Cc: Jeff Layton, Matthew Wilcox, linux-mm, linux-cachefs, linux-afs,
	linux-nfs, linux-cifs, ceph-devel, v9fs-developer, linux-fsdevel,
	dhowells, Jeff Layton, David Wysochanski, Matthew Wilcox (Oracle),
	Alexander Viro, linux-cachefs, linux-afs, linux-nfs, linux-cifs,
	ceph-devel, v9fs-developer, linux-fsdevel, linux-kernel

Add a pair of helper functions:

 (*) netfs_readahead()
 (*) netfs_readpage()

to do the work of handling a readahead or a readpage, where the page(s)
that form part of the request may be split between the local cache, the
server or just require clearing, and may be single pages and transparent
huge pages.  This is all handled within the helper.

Note that while both will read from the cache if there is data present,
only netfs_readahead() will expand the request beyond what it was asked to
do, and only netfs_readahead() will write back to the cache.

netfs_readpage(), on the other hand, is synchronous and only fetches the
page (which might be a THP) it is asked for.

The netfs gives the helper parameters from the VM, the cache cookie it
wants to use (or NULL) and a table of operations (only one of which is
mandatory):

 (*) expand_readahead() [optional]

     Called to allow the netfs to request an expansion of a readahead
     request to meet its own alignment requirements.  This is done by
     changing rreq->start and rreq->len.

 (*) clamp_length() [optional]

     Called to allow the netfs to cut down a subrequest to meet its own
     boundary requirements.  If it does this, the helper will generate
     additional subrequests until the full request is satisfied.

 (*) is_still_valid() [optional]

     Called to find out if the data just read from the cache has been
     invalidated and must be reread from the server.

 (*) issue_op() [required]

     Called to ask the netfs to issue a read to the server.  The subrequest
     describes the read.  The read request holds information about the file
     being accessed.

     The netfs can cache information in rreq->netfs_priv.

     Upon completion, the netfs should set the error, transferred and can
     also set FSCACHE_SREQ_CLEAR_TAIL and then call
     fscache_subreq_terminated().

 (*) done() [optional]

     Called after the pages have been unlocked.  The read request is still
     pinning the file and mapping and may still be pinning pages with
     PG_fscache.  rreq->error indicates any error that has been
     accumulated.

 (*) cleanup() [optional]

     Called when the helper is disposing of a finished read request.  This
     allows the netfs to clear rreq->netfs_priv.

Netfs support is enabled with CONFIG_NETFS_SUPPORT=y.  It will be built
even if CONFIG_FSCACHE=n and in this case much of it should be optimised
away, allowing the filesystem to use it even when caching is disabled.

changes:
 - Folded in a kerneldoc comment fix
 - Folded in a fix for the error handling in the case that ENOMEM occurs.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-mm@kvack.org
cc: linux-cachefs@redhat.com
cc: linux-afs@lists.infradead.org
cc: linux-nfs@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: ceph-devel@vger.kernel.org
cc: v9fs-developer@lists.sourceforge.net
cc: linux-fsdevel@vger.kernel.org
---

 fs/Kconfig             |    1 
 fs/Makefile            |    1 
 fs/netfs/Makefile      |    6 
 fs/netfs/internal.h    |   61 ++++
 fs/netfs/read_helper.c |  717 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/netfs.h  |   82 +++++
 6 files changed, 868 insertions(+)
 create mode 100644 fs/netfs/Makefile
 create mode 100644 fs/netfs/internal.h
 create mode 100644 fs/netfs/read_helper.c

diff --git a/fs/Kconfig b/fs/Kconfig
index aa4c12282301..a8ab2802c52e 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -125,6 +125,7 @@ source "fs/overlayfs/Kconfig"
 
 menu "Caches"
 
+source "fs/netfs/Kconfig"
 source "fs/fscache/Kconfig"
 source "fs/cachefiles/Kconfig"
 
diff --git a/fs/Makefile b/fs/Makefile
index 999d1a23f036..e922c3bc0a43 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -68,6 +68,7 @@ obj-$(CONFIG_PROFILING)		+= dcookies.o
 obj-$(CONFIG_DLM)		+= dlm/
  
 # Do not add any filesystems before this line
+obj-$(CONFIG_NETFS_SUPPORT)	+= netfs/
 obj-$(CONFIG_FSCACHE)		+= fscache/
 obj-$(CONFIG_REISERFS_FS)	+= reiserfs/
 obj-$(CONFIG_EXT4_FS)		+= ext4/
diff --git a/fs/netfs/Makefile b/fs/netfs/Makefile
new file mode 100644
index 000000000000..4b4eff2ba369
--- /dev/null
+++ b/fs/netfs/Makefile
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: GPL-2.0
+
+netfs-y := \
+	read_helper.o
+
+obj-$(CONFIG_NETFS_SUPPORT) := netfs.o
diff --git a/fs/netfs/internal.h b/fs/netfs/internal.h
new file mode 100644
index 000000000000..ee665c0e7dc8
--- /dev/null
+++ b/fs/netfs/internal.h
@@ -0,0 +1,61 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/* Internal definitions for network filesystem support
+ *
+ * Copyright (C) 2021 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+
+#ifdef pr_fmt
+#undef pr_fmt
+#endif
+
+#define pr_fmt(fmt) "netfs: " fmt
+
+/*
+ * read_helper.c
+ */
+extern unsigned int netfs_debug;
+
+#define netfs_stat(x) do {} while(0)
+#define netfs_stat_d(x) do {} while(0)
+
+/*****************************************************************************/
+/*
+ * debug tracing
+ */
+#define dbgprintk(FMT, ...) \
+	printk("[%-6.6s] "FMT"\n", current->comm, ##__VA_ARGS__)
+
+#define kenter(FMT, ...) dbgprintk("==> %s("FMT")", __func__, ##__VA_ARGS__)
+#define kleave(FMT, ...) dbgprintk("<== %s()"FMT"", __func__, ##__VA_ARGS__)
+#define kdebug(FMT, ...) dbgprintk(FMT, ##__VA_ARGS__)
+
+#ifdef __KDEBUG
+#define _enter(FMT, ...) kenter(FMT, ##__VA_ARGS__)
+#define _leave(FMT, ...) kleave(FMT, ##__VA_ARGS__)
+#define _debug(FMT, ...) kdebug(FMT, ##__VA_ARGS__)
+
+#elif defined(CONFIG_NETFS_DEBUG)
+#define _enter(FMT, ...)			\
+do {						\
+	if (netfs_debug)			\
+		kenter(FMT, ##__VA_ARGS__);	\
+} while (0)
+
+#define _leave(FMT, ...)			\
+do {						\
+	if (netfs_debug)			\
+		kleave(FMT, ##__VA_ARGS__);	\
+} while (0)
+
+#define _debug(FMT, ...)			\
+do {						\
+	if (netfs_debug)			\
+		kdebug(FMT, ##__VA_ARGS__);	\
+} while (0)
+
+#else
+#define _enter(FMT, ...) no_printk("==> %s("FMT")", __func__, ##__VA_ARGS__)
+#define _leave(FMT, ...) no_printk("<== %s()"FMT"", __func__, ##__VA_ARGS__)
+#define _debug(FMT, ...) no_printk(FMT, ##__VA_ARGS__)
+#endif
diff --git a/fs/netfs/read_helper.c b/fs/netfs/read_helper.c
new file mode 100644
index 000000000000..a35bd29928fa
--- /dev/null
+++ b/fs/netfs/read_helper.c
@@ -0,0 +1,717 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* Network filesystem high-level read support.
+ *
+ * Copyright (C) 2021 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+
+#include <linux/module.h>
+#include <linux/export.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/pagemap.h>
+#include <linux/slab.h>
+#include <linux/uio.h>
+#include <linux/sched/mm.h>
+#include <linux/task_io_accounting_ops.h>
+#include <linux/netfs.h>
+#include "internal.h"
+
+MODULE_DESCRIPTION("Network fs support");
+MODULE_AUTHOR("Red Hat, Inc.");
+MODULE_LICENSE("GPL");
+
+unsigned netfs_debug;
+module_param_named(debug, netfs_debug, uint, S_IWUSR | S_IRUGO);
+MODULE_PARM_DESC(netfs_debug, "Netfs support debugging mask");
+
+static void netfs_rreq_work(struct work_struct *);
+static void __netfs_put_subrequest(struct netfs_read_subrequest *);
+
+static void netfs_put_subrequest(struct netfs_read_subrequest *subreq)
+{
+	if (refcount_dec_and_test(&subreq->usage))
+		__netfs_put_subrequest(subreq);
+}
+
+static struct netfs_read_request *netfs_alloc_read_request(
+	const struct netfs_read_request_ops *ops, void *netfs_priv,
+	struct file *file)
+{
+	struct netfs_read_request *rreq;
+
+	rreq = kzalloc(sizeof(struct netfs_read_request), GFP_KERNEL);
+	if (rreq) {
+		rreq->netfs_ops	= ops;
+		rreq->netfs_priv = netfs_priv;
+		rreq->inode	= file_inode(file);
+		rreq->i_size	= i_size_read(rreq->inode);
+		INIT_LIST_HEAD(&rreq->subrequests);
+		INIT_WORK(&rreq->work, netfs_rreq_work);
+		refcount_set(&rreq->usage, 1);
+		__set_bit(NETFS_RREQ_IN_PROGRESS, &rreq->flags);
+		ops->init_rreq(rreq, file);
+	}
+
+	return rreq;
+}
+
+static void netfs_get_read_request(struct netfs_read_request *rreq)
+{
+	refcount_inc(&rreq->usage);
+}
+
+static void netfs_rreq_clear_subreqs(struct netfs_read_request *rreq)
+{
+	struct netfs_read_subrequest *subreq;
+
+	while (!list_empty(&rreq->subrequests)) {
+		subreq = list_first_entry(&rreq->subrequests,
+					  struct netfs_read_subrequest, rreq_link);
+		list_del(&subreq->rreq_link);
+		netfs_put_subrequest(subreq);
+	}
+}
+
+static void netfs_free_read_request(struct work_struct *work)
+{
+	struct netfs_read_request *rreq =
+		container_of(work, struct netfs_read_request, work);
+	netfs_rreq_clear_subreqs(rreq);
+	if (rreq->netfs_priv)
+		rreq->netfs_ops->cleanup(rreq->mapping, rreq->netfs_priv);
+	kfree(rreq);
+}
+
+static void netfs_put_read_request(struct netfs_read_request *rreq)
+{
+	if (refcount_dec_and_test(&rreq->usage)) {
+		if (in_softirq()) {
+			rreq->work.func = netfs_free_read_request;
+			if (!queue_work(system_unbound_wq, &rreq->work))
+				BUG();
+		} else {
+			netfs_free_read_request(&rreq->work);
+		}
+	}
+}
+
+/*
+ * Allocate and partially initialise an I/O request structure.
+ */
+static struct netfs_read_subrequest *netfs_alloc_subrequest(
+	struct netfs_read_request *rreq)
+{
+	struct netfs_read_subrequest *subreq;
+
+	subreq = kzalloc(sizeof(struct netfs_read_subrequest), GFP_KERNEL);
+	if (subreq) {
+		INIT_LIST_HEAD(&subreq->rreq_link);
+		refcount_set(&subreq->usage, 2);
+		subreq->rreq = rreq;
+		netfs_get_read_request(rreq);
+	}
+
+	return subreq;
+}
+
+static void netfs_get_read_subrequest(struct netfs_read_subrequest *subreq)
+{
+	refcount_inc(&subreq->usage);
+}
+
+static void __netfs_put_subrequest(struct netfs_read_subrequest *subreq)
+{
+	netfs_put_read_request(subreq->rreq);
+	kfree(subreq);
+}
+
+/*
+ * Clear the unread part of an I/O request.
+ */
+static void netfs_clear_unread(struct netfs_read_subrequest *subreq)
+{
+	struct iov_iter iter;
+
+	iov_iter_xarray(&iter, WRITE, &subreq->rreq->mapping->i_pages,
+			subreq->start + subreq->transferred,
+			subreq->len   - subreq->transferred);
+	iov_iter_zero(iov_iter_count(&iter), &iter);
+}
+
+/*
+ * Fill a subrequest region with zeroes.
+ */
+static void netfs_fill_with_zeroes(struct netfs_read_request *rreq,
+				   struct netfs_read_subrequest *subreq)
+{
+	__set_bit(NETFS_SREQ_CLEAR_TAIL, &subreq->flags);
+	netfs_subreq_terminated(subreq, 0);
+}
+
+/*
+ * Ask the netfs to issue a read request to the server for us.
+ *
+ * The netfs is expected to read from subreq->pos + subreq->transferred to
+ * subreq->pos + subreq->len - 1.  It may not backtrack and write data into the
+ * buffer prior to the transferred point as it might clobber dirty data
+ * obtained from the cache.
+ *
+ * Alternatively, the netfs is allowed to indicate one of two things:
+ *
+ * - NETFS_SREQ_SHORT_READ: A short read - it will get called again to try and
+ *   make progress.
+ *
+ * - NETFS_SREQ_CLEAR_TAIL: A short read - the rest of the buffer will be
+ *   cleared.
+ */
+static void netfs_read_from_server(struct netfs_read_request *rreq,
+				   struct netfs_read_subrequest *subreq)
+{
+	rreq->netfs_ops->issue_op(subreq);
+}
+
+/*
+ * Release those waiting.
+ */
+static void netfs_rreq_completed(struct netfs_read_request *rreq)
+{
+	netfs_rreq_clear_subreqs(rreq);
+	netfs_put_read_request(rreq);
+}
+
+/*
+ * Unlock the pages in a read operation.  We need to set PG_fscache on any
+ * pages we're going to write back before we unlock them.
+ */
+static void netfs_rreq_unlock(struct netfs_read_request *rreq)
+{
+	struct netfs_read_subrequest *subreq;
+	struct page *page;
+	unsigned int iopos, account = 0;
+	pgoff_t start_page = rreq->start / PAGE_SIZE;
+	pgoff_t last_page = ((rreq->start + rreq->len) / PAGE_SIZE) - 1;
+	bool subreq_failed = false;
+	int i;
+
+	XA_STATE(xas, &rreq->mapping->i_pages, start_page);
+
+	if (test_bit(NETFS_RREQ_FAILED, &rreq->flags)) {
+		__clear_bit(NETFS_RREQ_WRITE_TO_CACHE, &rreq->flags);
+		list_for_each_entry(subreq, &rreq->subrequests, rreq_link) {
+			__clear_bit(NETFS_SREQ_WRITE_TO_CACHE, &subreq->flags);
+		}
+	}
+
+	/* Walk through the pagecache and the I/O request lists simultaneously.
+	 * We may have a mixture of cached and uncached sections and we only
+	 * really want to write out the uncached sections.  This is slightly
+	 * complicated by the possibility that we might have huge pages with a
+	 * mixture inside.
+	 */
+	subreq = list_first_entry(&rreq->subrequests,
+				  struct netfs_read_subrequest, rreq_link);
+	iopos = 0;
+	subreq_failed = (subreq->error < 0);
+
+	rcu_read_lock();
+	xas_for_each(&xas, page, last_page) {
+		unsigned int pgpos = (page->index - start_page) * PAGE_SIZE;
+		unsigned int pgend = pgpos + thp_size(page);
+		bool pg_failed = false;
+
+		for (;;) {
+			if (!subreq) {
+				pg_failed = true;
+				break;
+			}
+			if (test_bit(NETFS_SREQ_WRITE_TO_CACHE, &subreq->flags))
+				SetPageFsCache(page);
+			pg_failed |= subreq_failed;
+			if (pgend < iopos + subreq->len)
+				break;
+
+			account += subreq->transferred;
+			iopos += subreq->len;
+			if (!list_is_last(&subreq->rreq_link, &rreq->subrequests)) {
+				subreq = list_next_entry(subreq, rreq_link);
+				subreq_failed = (subreq->error < 0);
+			} else {
+				subreq = NULL;
+				subreq_failed = false;
+			}
+			if (pgend == iopos)
+				break;
+		}
+
+		if (!pg_failed) {
+			for (i = 0; i < thp_nr_pages(page); i++)
+				flush_dcache_page(page);
+			SetPageUptodate(page);
+		}
+
+		if (!test_bit(NETFS_RREQ_DONT_UNLOCK_PAGES, &rreq->flags)) {
+			if (page->index == rreq->no_unlock_page &&
+			    test_bit(NETFS_RREQ_NO_UNLOCK_PAGE, &rreq->flags))
+				_debug("no unlock");
+			else
+				unlock_page(page);
+		}
+	}
+	rcu_read_unlock();
+
+	task_io_account_read(account);
+	if (rreq->netfs_ops->done)
+		rreq->netfs_ops->done(rreq);
+}
+
+/*
+ * Handle a short read.
+ */
+static void netfs_rreq_short_read(struct netfs_read_request *rreq,
+				  struct netfs_read_subrequest *subreq)
+{
+	__clear_bit(NETFS_SREQ_SHORT_READ, &subreq->flags);
+	__set_bit(NETFS_SREQ_SEEK_DATA_READ, &subreq->flags);
+
+	netfs_get_read_subrequest(subreq);
+	atomic_inc(&rreq->nr_rd_ops);
+	netfs_read_from_server(rreq, subreq);
+}
+
+/*
+ * Resubmit any short or failed operations.  Returns true if we got the rreq
+ * ref back.
+ */
+static bool netfs_rreq_perform_resubmissions(struct netfs_read_request *rreq)
+{
+	struct netfs_read_subrequest *subreq;
+
+	WARN_ON(in_softirq());
+
+	/* We don't want terminating submissions trying to wake us up whilst
+	 * we're still going through the list.
+	 */
+	atomic_inc(&rreq->nr_rd_ops);
+
+	__clear_bit(NETFS_RREQ_INCOMPLETE_IO, &rreq->flags);
+	list_for_each_entry(subreq, &rreq->subrequests, rreq_link) {
+		if (subreq->error) {
+			if (subreq->source != NETFS_READ_FROM_CACHE)
+				break;
+			subreq->source = NETFS_DOWNLOAD_FROM_SERVER;
+			subreq->error = 0;
+			netfs_get_read_subrequest(subreq);
+			atomic_inc(&rreq->nr_rd_ops);
+			netfs_read_from_server(rreq, subreq);
+		} else if (test_bit(NETFS_SREQ_SHORT_READ, &subreq->flags)) {
+			netfs_rreq_short_read(rreq, subreq);
+		}
+	}
+
+	/* If we decrement nr_rd_ops to 0, the usage ref belongs to us. */
+	if (atomic_dec_and_test(&rreq->nr_rd_ops))
+		return true;
+
+	wake_up_var(&rreq->nr_rd_ops);
+	return false;
+}
+
+/*
+ * Assess the state of a read request and decide what to do next.
+ *
+ * Note that we could be in an ordinary kernel thread, on a workqueue or in
+ * softirq context at this point.  We inherit a ref from the caller.
+ */
+static void netfs_rreq_assess(struct netfs_read_request *rreq)
+{
+again:
+	if (!test_bit(NETFS_RREQ_FAILED, &rreq->flags) &&
+	    test_bit(NETFS_RREQ_INCOMPLETE_IO, &rreq->flags)) {
+		if (netfs_rreq_perform_resubmissions(rreq))
+			goto again;
+		return;
+	}
+
+	netfs_rreq_unlock(rreq);
+
+	clear_bit_unlock(NETFS_RREQ_IN_PROGRESS, &rreq->flags);
+	wake_up_bit(&rreq->flags, NETFS_RREQ_IN_PROGRESS);
+
+	netfs_rreq_completed(rreq);
+}
+
+static void netfs_rreq_work(struct work_struct *work)
+{
+	struct netfs_read_request *rreq =
+		container_of(work, struct netfs_read_request, work);
+	netfs_rreq_assess(rreq);
+}
+
+/*
+ * Handle the completion of all outstanding I/O operations on a read request.
+ * We inherit a ref from the caller.
+ */
+static void netfs_rreq_terminated(struct netfs_read_request *rreq)
+{
+	if (test_bit(NETFS_RREQ_INCOMPLETE_IO, &rreq->flags) &&
+	    in_softirq()) {
+		if (!queue_work(system_unbound_wq, &rreq->work))
+			BUG();
+	} else {
+		netfs_rreq_assess(rreq);
+	}
+}
+
+/**
+ * netfs_subreq_terminated - Note the termination of an I/O operation.
+ * @subreq: The I/O request that has terminated.
+ * @transferred_or_error: The amount of data transferred or an error code.
+ *
+ * This tells the read helper that a contributory I/O operation has terminated,
+ * one way or another, and that it should integrate the results.
+ *
+ * The caller indicates in @transferred_or_error the outcome of the operation,
+ * supplying a positive value to indicate the number of bytes transferred, 0 to
+ * indicate a failure to transfer anything that should be retried or a negative
+ * error code.  The helper will look after reissuing I/O operations as
+ * appropriate and writing downloaded data to the cache.
+ *
+ * This may be called from a softirq handler, so we want to avoid taking the
+ * spinlock if we can.
+ */
+void netfs_subreq_terminated(struct netfs_read_subrequest *subreq,
+			     ssize_t transferred_or_error)
+{
+	struct netfs_read_request *rreq = subreq->rreq;
+	int u;
+
+	_enter("[%u]{%llx,%lx},%zd",
+	       subreq->debug_index, subreq->start, subreq->flags,
+	       transferred_or_error);
+
+	if (IS_ERR_VALUE(transferred_or_error)) {
+		subreq->error = transferred_or_error;
+		goto failed;
+	}
+
+	if (WARN_ON(transferred_or_error > subreq->len - subreq->transferred))
+		transferred_or_error = subreq->len - subreq->transferred;
+
+	subreq->error = 0;
+	subreq->transferred += transferred_or_error;
+	if (subreq->transferred < subreq->len)
+		goto incomplete;
+
+complete:
+	__clear_bit(NETFS_SREQ_NO_PROGRESS, &subreq->flags);
+	if (test_bit(NETFS_SREQ_WRITE_TO_CACHE, &subreq->flags))
+		set_bit(NETFS_RREQ_WRITE_TO_CACHE, &rreq->flags);
+
+out:
+	/* If we decrement nr_rd_ops to 0, the ref belongs to us. */
+	u = atomic_dec_return(&rreq->nr_rd_ops);
+	if (u == 0)
+		netfs_rreq_terminated(rreq);
+	else if (u == 1)
+		wake_up_var(&rreq->nr_rd_ops);
+
+	netfs_put_subrequest(subreq);
+	return;
+
+incomplete:
+	if (test_bit(NETFS_SREQ_CLEAR_TAIL, &subreq->flags)) {
+		netfs_clear_unread(subreq);
+		subreq->transferred = subreq->len;
+		goto complete;
+	}
+
+	if (transferred_or_error == 0) {
+		if (__test_and_set_bit(NETFS_SREQ_NO_PROGRESS, &subreq->flags)) {
+			subreq->error = -ENODATA;
+			goto failed;
+		}
+	} else {
+		__clear_bit(NETFS_SREQ_NO_PROGRESS, &subreq->flags);
+	}
+
+	__set_bit(NETFS_SREQ_SHORT_READ, &subreq->flags);
+	set_bit(NETFS_RREQ_INCOMPLETE_IO, &rreq->flags);
+	goto out;
+
+failed:
+	if (subreq->source == NETFS_READ_FROM_CACHE) {
+		set_bit(NETFS_RREQ_INCOMPLETE_IO, &rreq->flags);
+	} else {
+		set_bit(NETFS_RREQ_FAILED, &rreq->flags);
+		rreq->error = subreq->error;
+	}
+	goto out;
+}
+EXPORT_SYMBOL(netfs_subreq_terminated);
+
+static enum netfs_read_source netfs_cache_prepare_read(struct netfs_read_subrequest *subreq,
+						       loff_t i_size)
+{
+	struct netfs_read_request *rreq = subreq->rreq;
+
+	if (subreq->start >= rreq->i_size)
+		return NETFS_FILL_WITH_ZEROES;
+	return NETFS_DOWNLOAD_FROM_SERVER;
+}
+
+/*
+ * Work out what sort of subrequest the next one will be.
+ */
+static enum netfs_read_source
+netfs_rreq_prepare_read(struct netfs_read_request *rreq,
+			struct netfs_read_subrequest *subreq)
+{
+	enum netfs_read_source source;
+
+	_enter("%llx-%llx,%llx", subreq->start, subreq->start + subreq->len, rreq->i_size);
+
+	source = netfs_cache_prepare_read(subreq, rreq->i_size);
+	if (source == NETFS_INVALID_READ)
+		goto out;
+
+	if (source == NETFS_DOWNLOAD_FROM_SERVER) {
+		/* Call out to the netfs to let it shrink the request to fit
+		 * its own I/O sizes and boundaries.  If it shinks it here, it
+		 * will be called again to make simultaneous calls; if it wants
+		 * to make serial calls, it can indicate a short read and then
+		 * we will call it again.
+		 */
+		if (subreq->len > rreq->i_size - subreq->start)
+			subreq->len = rreq->i_size - subreq->start;
+
+		if (rreq->netfs_ops->clamp_length &&
+		    !rreq->netfs_ops->clamp_length(subreq)) {
+			source = NETFS_INVALID_READ;
+			goto out;
+		}
+	}
+
+	if (WARN_ON(subreq->len == 0))
+		source = NETFS_INVALID_READ;
+
+out:
+	subreq->source = source;
+	return source;
+}
+
+/*
+ * Slice off a piece of a read request and submit an I/O request for it.
+ */
+static bool netfs_rreq_submit_slice(struct netfs_read_request *rreq,
+				    unsigned int *_debug_index)
+{
+	struct netfs_read_subrequest *subreq;
+	enum netfs_read_source source;
+
+	subreq = netfs_alloc_subrequest(rreq);
+	if (!subreq)
+		return false;
+
+	subreq->debug_index	= (*_debug_index)++;
+	subreq->start		= rreq->start + rreq->submitted;
+	subreq->len		= rreq->len   - rreq->submitted;
+
+	_debug("slice %llx,%zx,%zx", subreq->start, subreq->len, rreq->submitted);
+	list_add_tail(&subreq->rreq_link, &rreq->subrequests);
+
+	/* Call out to the cache to find out what it can do with the remaining
+	 * subset.  It tells us in subreq->flags what it decided should be done
+	 * and adjusts subreq->len down if the subset crosses a cache boundary.
+	 *
+	 * Then when we hand the subset, it can choose to take a subset of that
+	 * (the starts must coincide), in which case, we go around the loop
+	 * again and ask it to download the next piece.
+	 */
+	source = netfs_rreq_prepare_read(rreq, subreq);
+	if (source == NETFS_INVALID_READ)
+		goto subreq_failed;
+
+	atomic_inc(&rreq->nr_rd_ops);
+
+	rreq->submitted += subreq->len;
+
+	switch (source) {
+	case NETFS_FILL_WITH_ZEROES:
+		netfs_fill_with_zeroes(rreq, subreq);
+		break;
+	case NETFS_DOWNLOAD_FROM_SERVER:
+		netfs_read_from_server(rreq, subreq);
+		break;
+	default:
+		BUG();
+	}
+
+	return true;
+
+subreq_failed:
+	rreq->error = subreq->error;
+	netfs_put_subrequest(subreq);
+	return false;
+}
+
+static void netfs_rreq_expand(struct netfs_read_request *rreq,
+			      struct readahead_control *ractl)
+{
+	/* Give the netfs a chance to change the request parameters.  The
+	 * resultant request must contain the original region.
+	 */
+	if (rreq->netfs_ops->expand_readahead)
+		rreq->netfs_ops->expand_readahead(rreq);
+
+	/* Expand the request if the cache wants it to start earlier.  Note
+	 * that the expansion may get further extended if the VM wishes to
+	 * insert THPs and the preferred start and/or end wind up in the middle
+	 * of THPs.
+	 *
+	 * If this is the case, however, the THP size should be an integer
+	 * multiple of the cache granule size, so we get a whole number of
+	 * granules to deal with.
+	 */
+	if (rreq->start  != readahead_pos(ractl) ||
+	    rreq->len != readahead_length(ractl)) {
+		readahead_expand(ractl, rreq->start, rreq->len);
+		rreq->start  = readahead_pos(ractl);
+		rreq->len = readahead_length(ractl);
+	}
+}
+
+/**
+ * netfs_readahead - Helper to manage a read request
+ * @ractl: The description of the readahead request
+ * @ops: The network filesystem's operations for the helper to use
+ * @netfs_priv: Private netfs data to be retained in the request
+ *
+ * Fulfil a readahead request by drawing data from the cache if possible, or
+ * the netfs if not.  Space beyond the EOF is zero-filled.  Multiple I/O
+ * requests from different sources will get munged together.  If necessary, the
+ * readahead window can be expanded in either direction to a more convenient
+ * alighment for RPC efficiency or to make storage in the cache feasible.
+ *
+ * The calling netfs must provide a table of operations, only one of which,
+ * issue_op, is mandatory.  It may also be passed a private token, which will
+ * be retained in rreq->netfs_priv and will be cleaned up by ops->cleanup().
+ *
+ * This is usable whether or not caching is enabled.
+ */
+void netfs_readahead(struct readahead_control *ractl,
+		     const struct netfs_read_request_ops *ops,
+		     void *netfs_priv)
+{
+	struct netfs_read_request *rreq;
+	struct page *page;
+	unsigned int debug_index = 0;
+
+	_enter("%lx,%x", readahead_index(ractl), readahead_count(ractl));
+
+	if (readahead_count(ractl) == 0)
+		goto cleanup;
+
+	rreq = netfs_alloc_read_request(ops, netfs_priv, ractl->file);
+	if (!rreq)
+		goto cleanup;
+	rreq->mapping	= ractl->mapping;
+	rreq->start	= readahead_pos(ractl);
+	rreq->len	= readahead_length(ractl);
+
+	netfs_rreq_expand(rreq, ractl);
+
+	atomic_set(&rreq->nr_rd_ops, 1);
+	do {
+		if (!netfs_rreq_submit_slice(rreq, &debug_index))
+			break;
+
+	} while (rreq->submitted < rreq->len);
+
+	if (rreq->submitted == 0) {
+		netfs_put_read_request(rreq);
+		return;
+	}
+
+	// TODO: If we didn't submit enough readage, we need to try punting to
+	// a work queue.
+
+	while ((page = readahead_page(ractl)))
+		put_page(page);
+
+	/* If we decrement nr_rd_ops to 0, the ref belongs to us. */
+	if (atomic_dec_and_test(&rreq->nr_rd_ops))
+		netfs_rreq_assess(rreq);
+	return;
+
+cleanup:
+	if (netfs_priv)
+		ops->cleanup(ractl->mapping, netfs_priv);
+	return;
+}
+EXPORT_SYMBOL(netfs_readahead);
+
+/**
+ * netfs_page - Helper to manage a readpage request
+ * @file: The file to read from
+ * @page: The page to read
+ * @ops: The network filesystem's operations for the helper to use
+ * @netfs_priv: Private netfs data to be retained in the request
+ *
+ * Fulfil a readpage request by drawing data from the cache if possible, or the
+ * netfs if not.  Space beyond the EOF is zero-filled.  Multiple I/O requests
+ * from different sources will get munged together.
+ *
+ * The calling netfs must provide a table of operations, only one of which,
+ * issue_op, is mandatory.  It may also be passed a private token, which will
+ * be retained in rreq->netfs_priv and will be cleaned up by ops->cleanup().
+ *
+ * This is usable whether or not caching is enabled.
+ */
+int netfs_readpage(struct file *file,
+		   struct page *page,
+		   const struct netfs_read_request_ops *ops,
+		   void *netfs_priv)
+{
+	struct netfs_read_request *rreq;
+	unsigned int debug_index = 0;
+	int ret;
+
+	_enter("%lx", page->index);
+
+	rreq = netfs_alloc_read_request(ops, netfs_priv, file);
+	if (!rreq) {
+		if (netfs_priv)
+			ops->cleanup(netfs_priv, page->mapping);
+		unlock_page(page);
+		return -ENOMEM;
+	}
+	rreq->mapping	= page->mapping;
+	rreq->start	= page->index * PAGE_SIZE;
+	rreq->len	= thp_size(page);
+
+	netfs_get_read_request(rreq);
+
+	atomic_set(&rreq->nr_rd_ops, 1);
+	do {
+		if (!netfs_rreq_submit_slice(rreq, &debug_index))
+			break;
+
+	} while (rreq->submitted < rreq->len);
+
+	/* Keep nr_rd_ops incremented so that the ref always belongs to us, and
+	 * the service code isn't punted off to a random thread pool to
+	 * process.
+	 */
+	do {
+		wait_var_event(&rreq->nr_rd_ops, atomic_read(&rreq->nr_rd_ops) == 1);
+		netfs_rreq_assess(rreq);
+	} while (test_bit(NETFS_RREQ_IN_PROGRESS, &rreq->flags));
+
+	ret = rreq->error;
+	if (ret == 0 && rreq->submitted < rreq->len)
+		ret = -EIO;
+	netfs_put_read_request(rreq);
+	return ret;
+}
+EXPORT_SYMBOL(netfs_readpage);
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index f69703543788..7f74d668a459 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -10,6 +10,8 @@
 #ifndef _LINUX_NETFS_H
 #define _LINUX_NETFS_H
 
+#include <linux/workqueue.h>
+#include <linux/fs.h>
 #include <linux/pagemap.h>
 
 /*
@@ -51,4 +53,84 @@ static inline void wait_on_page_fscache(struct page *page)
 		wait_on_page_bit(compound_head(page), PG_fscache);
 }
 
+enum netfs_read_source {
+	NETFS_FILL_WITH_ZEROES,
+	NETFS_DOWNLOAD_FROM_SERVER,
+	NETFS_READ_FROM_CACHE,
+	NETFS_INVALID_READ,
+} __mode(byte);
+
+/*
+ * Descriptor for a single component subrequest.
+ */
+struct netfs_read_subrequest {
+	struct netfs_read_request *rreq;	/* Supervising read request */
+	struct list_head	rreq_link;	/* Link in rreq->subrequests */
+	loff_t			start;		/* Where to start the I/O */
+	size_t			len;		/* Size of the I/O */
+	size_t			transferred;	/* Amount of data transferred */
+	refcount_t		usage;
+	short			error;		/* 0 or error that occurred */
+	unsigned short		debug_index;	/* Index in list (for debugging output) */
+	enum netfs_read_source	source;		/* Where to read from */
+	unsigned long		flags;
+#define NETFS_SREQ_WRITE_TO_CACHE	0	/* Set if should write to cache */
+#define NETFS_SREQ_CLEAR_TAIL		1	/* Set if the rest of the read should be cleared */
+#define NETFS_SREQ_SHORT_READ		2	/* Set if there was a short read from the cache */
+#define NETFS_SREQ_SEEK_DATA_READ	3	/* Set if ->read() should SEEK_DATA first */
+#define NETFS_SREQ_NO_PROGRESS		4	/* Set if we didn't manage to read any data */
+};
+
+/*
+ * Descriptor for a read helper request.  This is used to make multiple I/O
+ * requests on a variety of sources and then stitch the result together.
+ */
+struct netfs_read_request {
+	struct work_struct	work;
+	struct inode		*inode;		/* The file being accessed */
+	struct address_space	*mapping;	/* The mapping being accessed */
+	struct list_head	subrequests;	/* Requests to fetch I/O from disk or net */
+	void			*netfs_priv;	/* Private data for the netfs */
+	atomic_t		nr_rd_ops;	/* Number of read ops in progress */
+	size_t			submitted;	/* Amount submitted for I/O so far */
+	size_t			len;		/* Length of the request */
+	short			error;		/* 0 or error that occurred */
+	loff_t			i_size;		/* Size of the file */
+	loff_t			start;		/* Start position */
+	pgoff_t			no_unlock_page;	/* Don't unlock this page after read */
+	refcount_t		usage;
+	unsigned long		flags;
+#define NETFS_RREQ_INCOMPLETE_IO	0	/* Some ioreqs terminated short or with error */
+#define NETFS_RREQ_WRITE_TO_CACHE	1	/* Need to write to the cache */
+#define NETFS_RREQ_NO_UNLOCK_PAGE	2	/* Don't unlock no_unlock_page on completion */
+#define NETFS_RREQ_DONT_UNLOCK_PAGES	3	/* Don't unlock the pages on completion */
+#define NETFS_RREQ_FAILED		4	/* The request failed */
+#define NETFS_RREQ_IN_PROGRESS		5	/* Unlocked when the request completes */
+	const struct netfs_read_request_ops *netfs_ops;
+};
+
+/*
+ * Operations the network filesystem can/must provide to the helpers.
+ */
+struct netfs_read_request_ops {
+	void (*init_rreq)(struct netfs_read_request *rreq, struct file *file);
+	void (*expand_readahead)(struct netfs_read_request *rreq);
+	bool (*clamp_length)(struct netfs_read_subrequest *subreq);
+	void (*issue_op)(struct netfs_read_subrequest *subreq);
+	bool (*is_still_valid)(struct netfs_read_request *rreq);
+	void (*done)(struct netfs_read_request *rreq);
+	void (*cleanup)(struct address_space *mapping, void *netfs_priv);
+};
+
+struct readahead_control;
+extern void netfs_readahead(struct readahead_control *,
+			    const struct netfs_read_request_ops *,
+			    void *);
+extern int netfs_readpage(struct file *,
+			  struct page *,
+			  const struct netfs_read_request_ops *,
+			  void *);
+
+extern void netfs_subreq_terminated(struct netfs_read_subrequest *, ssize_t);
+
 #endif /* _LINUX_NETFS_H */




^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 09/33] netfs: Add tracepoints
  2021-02-15 15:44 [PATCH 00/33] Network fs helper library & fscache kiocb API [ver #3] David Howells
                   ` (7 preceding siblings ...)
  2021-02-15 15:45 ` [PATCH 08/33] netfs: Provide readahead and readpage netfs helpers David Howells
@ 2021-02-15 15:45 ` David Howells
  2021-02-15 15:46 ` [PATCH 10/33] netfs: Gather stats David Howells
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 47+ messages in thread
From: David Howells @ 2021-02-15 15:45 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Steve French, Dominique Martinet
  Cc: Jeff Layton, Matthew Wilcox, linux-mm, linux-cachefs, linux-afs,
	linux-nfs, linux-cifs, ceph-devel, v9fs-developer, linux-fsdevel,
	dhowells, Jeff Layton, David Wysochanski, Matthew Wilcox (Oracle),
	Alexander Viro, linux-cachefs, linux-afs, linux-nfs, linux-cifs,
	ceph-devel, v9fs-developer, linux-fsdevel, linux-kernel

Add three tracepoints to track the activity of the read helpers:

 (1) netfs/netfs_read

     This logs entry to the read helpers and also expansion of the range in
     a readahead request.

 (2) netfs/netfs_rreq

     This logs the progress of netfs_read_request objects which track
     read requests.  A read request may be a compound of multiple
     subrequests.

 (3) netfs/netfs_sreq

     This logs the progress of netfs_read_subrequest objects, which track
     the contributions from various sources to a read request.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-mm@kvack.org
cc: linux-cachefs@redhat.com
cc: linux-afs@lists.infradead.org
cc: linux-nfs@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: ceph-devel@vger.kernel.org
cc: v9fs-developer@lists.sourceforge.net
cc: linux-fsdevel@vger.kernel.org
---

 fs/netfs/read_helper.c       |   28 ++++++
 include/linux/netfs.h        |    2 
 include/trace/events/netfs.h |  199 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 229 insertions(+)
 create mode 100644 include/trace/events/netfs.h

diff --git a/fs/netfs/read_helper.c b/fs/netfs/read_helper.c
index a35bd29928fa..53b04b105179 100644
--- a/fs/netfs/read_helper.c
+++ b/fs/netfs/read_helper.c
@@ -16,6 +16,8 @@
 #include <linux/task_io_accounting_ops.h>
 #include <linux/netfs.h>
 #include "internal.h"
+#define CREATE_TRACE_POINTS
+#include <trace/events/netfs.h>
 
 MODULE_DESCRIPTION("Network fs support");
 MODULE_AUTHOR("Red Hat, Inc.");
@@ -38,6 +40,7 @@ static struct netfs_read_request *netfs_alloc_read_request(
 	const struct netfs_read_request_ops *ops, void *netfs_priv,
 	struct file *file)
 {
+	static atomic_t debug_ids;
 	struct netfs_read_request *rreq;
 
 	rreq = kzalloc(sizeof(struct netfs_read_request), GFP_KERNEL);
@@ -46,6 +49,7 @@ static struct netfs_read_request *netfs_alloc_read_request(
 		rreq->netfs_priv = netfs_priv;
 		rreq->inode	= file_inode(file);
 		rreq->i_size	= i_size_read(rreq->inode);
+		rreq->debug_id	= atomic_inc_return(&debug_ids);
 		INIT_LIST_HEAD(&rreq->subrequests);
 		INIT_WORK(&rreq->work, netfs_rreq_work);
 		refcount_set(&rreq->usage, 1);
@@ -80,6 +84,7 @@ static void netfs_free_read_request(struct work_struct *work)
 	netfs_rreq_clear_subreqs(rreq);
 	if (rreq->netfs_priv)
 		rreq->netfs_ops->cleanup(rreq->mapping, rreq->netfs_priv);
+	trace_netfs_rreq(rreq, netfs_rreq_trace_free);
 	kfree(rreq);
 }
 
@@ -122,6 +127,7 @@ static void netfs_get_read_subrequest(struct netfs_read_subrequest *subreq)
 
 static void __netfs_put_subrequest(struct netfs_read_subrequest *subreq)
 {
+	trace_netfs_sreq(subreq, netfs_sreq_trace_free);
 	netfs_put_read_request(subreq->rreq);
 	kfree(subreq);
 }
@@ -176,6 +182,7 @@ static void netfs_read_from_server(struct netfs_read_request *rreq,
  */
 static void netfs_rreq_completed(struct netfs_read_request *rreq)
 {
+	trace_netfs_rreq(rreq, netfs_rreq_trace_done);
 	netfs_rreq_clear_subreqs(rreq);
 	netfs_put_read_request(rreq);
 }
@@ -214,6 +221,8 @@ static void netfs_rreq_unlock(struct netfs_read_request *rreq)
 	iopos = 0;
 	subreq_failed = (subreq->error < 0);
 
+	trace_netfs_rreq(rreq, netfs_rreq_trace_unlock);
+
 	rcu_read_lock();
 	xas_for_each(&xas, page, last_page) {
 		unsigned int pgpos = (page->index - start_page) * PAGE_SIZE;
@@ -274,6 +283,8 @@ static void netfs_rreq_short_read(struct netfs_read_request *rreq,
 	__clear_bit(NETFS_SREQ_SHORT_READ, &subreq->flags);
 	__set_bit(NETFS_SREQ_SEEK_DATA_READ, &subreq->flags);
 
+	trace_netfs_sreq(subreq, netfs_sreq_trace_resubmit_short);
+
 	netfs_get_read_subrequest(subreq);
 	atomic_inc(&rreq->nr_rd_ops);
 	netfs_read_from_server(rreq, subreq);
@@ -289,6 +300,8 @@ static bool netfs_rreq_perform_resubmissions(struct netfs_read_request *rreq)
 
 	WARN_ON(in_softirq());
 
+	trace_netfs_rreq(rreq, netfs_rreq_trace_resubmit);
+
 	/* We don't want terminating submissions trying to wake us up whilst
 	 * we're still going through the list.
 	 */
@@ -301,6 +314,7 @@ static bool netfs_rreq_perform_resubmissions(struct netfs_read_request *rreq)
 				break;
 			subreq->source = NETFS_DOWNLOAD_FROM_SERVER;
 			subreq->error = 0;
+			trace_netfs_sreq(subreq, netfs_sreq_trace_download_instead);
 			netfs_get_read_subrequest(subreq);
 			atomic_inc(&rreq->nr_rd_ops);
 			netfs_read_from_server(rreq, subreq);
@@ -325,6 +339,8 @@ static bool netfs_rreq_perform_resubmissions(struct netfs_read_request *rreq)
  */
 static void netfs_rreq_assess(struct netfs_read_request *rreq)
 {
+	trace_netfs_rreq(rreq, netfs_rreq_trace_assess);
+
 again:
 	if (!test_bit(NETFS_RREQ_FAILED, &rreq->flags) &&
 	    test_bit(NETFS_RREQ_INCOMPLETE_IO, &rreq->flags)) {
@@ -409,6 +425,8 @@ void netfs_subreq_terminated(struct netfs_read_subrequest *subreq,
 		set_bit(NETFS_RREQ_WRITE_TO_CACHE, &rreq->flags);
 
 out:
+	trace_netfs_sreq(subreq, netfs_sreq_trace_terminated);
+
 	/* If we decrement nr_rd_ops to 0, the ref belongs to us. */
 	u = atomic_dec_return(&rreq->nr_rd_ops);
 	if (u == 0)
@@ -497,6 +515,7 @@ netfs_rreq_prepare_read(struct netfs_read_request *rreq,
 
 out:
 	subreq->source = source;
+	trace_netfs_sreq(subreq, netfs_sreq_trace_prepare);
 	return source;
 }
 
@@ -536,6 +555,7 @@ static bool netfs_rreq_submit_slice(struct netfs_read_request *rreq,
 
 	rreq->submitted += subreq->len;
 
+	trace_netfs_sreq(subreq, netfs_sreq_trace_submit);
 	switch (source) {
 	case NETFS_FILL_WITH_ZEROES:
 		netfs_fill_with_zeroes(rreq, subreq);
@@ -578,6 +598,9 @@ static void netfs_rreq_expand(struct netfs_read_request *rreq,
 		readahead_expand(ractl, rreq->start, rreq->len);
 		rreq->start  = readahead_pos(ractl);
 		rreq->len = readahead_length(ractl);
+
+		trace_netfs_read(rreq, readahead_pos(ractl), readahead_length(ractl),
+				 netfs_read_trace_expanded);
 	}
 }
 
@@ -619,6 +642,9 @@ void netfs_readahead(struct readahead_control *ractl,
 	rreq->start	= readahead_pos(ractl);
 	rreq->len	= readahead_length(ractl);
 
+	trace_netfs_read(rreq, readahead_pos(ractl), readahead_length(ractl),
+			 netfs_read_trace_readahead);
+
 	netfs_rreq_expand(rreq, ractl);
 
 	atomic_set(&rreq->nr_rd_ops, 1);
@@ -690,6 +716,8 @@ int netfs_readpage(struct file *file,
 	rreq->start	= page->index * PAGE_SIZE;
 	rreq->len	= thp_size(page);
 
+	trace_netfs_read(rreq, rreq->start, rreq->len, netfs_read_trace_readpage);
+
 	netfs_get_read_request(rreq);
 
 	atomic_set(&rreq->nr_rd_ops, 1);
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index 7f74d668a459..24083dc0adfa 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -91,6 +91,8 @@ struct netfs_read_request {
 	struct address_space	*mapping;	/* The mapping being accessed */
 	struct list_head	subrequests;	/* Requests to fetch I/O from disk or net */
 	void			*netfs_priv;	/* Private data for the netfs */
+	unsigned int		debug_id;
+	unsigned int		cookie_debug_id;
 	atomic_t		nr_rd_ops;	/* Number of read ops in progress */
 	size_t			submitted;	/* Amount submitted for I/O so far */
 	size_t			len;		/* Length of the request */
diff --git a/include/trace/events/netfs.h b/include/trace/events/netfs.h
new file mode 100644
index 000000000000..12ad382764c5
--- /dev/null
+++ b/include/trace/events/netfs.h
@@ -0,0 +1,199 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/* Network filesystem support module tracepoints
+ *
+ * Copyright (C) 2021 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM netfs
+
+#if !defined(_TRACE_NETFS_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_NETFS_H
+
+#include <linux/tracepoint.h>
+
+/*
+ * Define enums for tracing information.
+ */
+#ifndef __NETFS_DECLARE_TRACE_ENUMS_ONCE_ONLY
+#define __NETFS_DECLARE_TRACE_ENUMS_ONCE_ONLY
+
+enum netfs_read_trace {
+	netfs_read_trace_expanded,
+	netfs_read_trace_readahead,
+	netfs_read_trace_readpage,
+};
+
+enum netfs_rreq_trace {
+	netfs_rreq_trace_assess,
+	netfs_rreq_trace_done,
+	netfs_rreq_trace_free,
+	netfs_rreq_trace_resubmit,
+	netfs_rreq_trace_unlock,
+	netfs_rreq_trace_unmark,
+	netfs_rreq_trace_write,
+};
+
+enum netfs_sreq_trace {
+	netfs_sreq_trace_download_instead,
+	netfs_sreq_trace_free,
+	netfs_sreq_trace_prepare,
+	netfs_sreq_trace_resubmit_short,
+	netfs_sreq_trace_submit,
+	netfs_sreq_trace_terminated,
+	netfs_sreq_trace_write,
+	netfs_sreq_trace_write_term,
+};
+
+#endif
+
+#define netfs_read_traces					\
+	EM(netfs_read_trace_expanded,		"EXPANDED ")	\
+	EM(netfs_read_trace_readahead,		"READAHEAD")	\
+	E_(netfs_read_trace_readpage,		"READPAGE ")
+
+#define netfs_rreq_traces					\
+	EM(netfs_rreq_trace_assess,		"ASSESS")	\
+	EM(netfs_rreq_trace_done,		"DONE  ")	\
+	EM(netfs_rreq_trace_free,		"FREE  ")	\
+	EM(netfs_rreq_trace_resubmit,		"RESUBM")	\
+	EM(netfs_rreq_trace_unlock,		"UNLOCK")	\
+	EM(netfs_rreq_trace_unmark,		"UNMARK")	\
+	E_(netfs_rreq_trace_write,		"WRITE ")
+
+#define netfs_sreq_sources					\
+	EM(NETFS_FILL_WITH_ZEROES,		"ZERO")		\
+	EM(NETFS_DOWNLOAD_FROM_SERVER,		"DOWN")		\
+	EM(NETFS_READ_FROM_CACHE,		"READ")		\
+	E_(NETFS_INVALID_READ,			"INVL")		\
+
+#define netfs_sreq_traces					\
+	EM(netfs_sreq_trace_download_instead,	"RDOWN")	\
+	EM(netfs_sreq_trace_free,		"FREE ")	\
+	EM(netfs_sreq_trace_prepare,		"PREP ")	\
+	EM(netfs_sreq_trace_resubmit_short,	"SHORT")	\
+	EM(netfs_sreq_trace_submit,		"SUBMT")	\
+	EM(netfs_sreq_trace_terminated,		"TERM ")	\
+	EM(netfs_sreq_trace_write,		"WRITE")	\
+	E_(netfs_sreq_trace_write_term,		"WTERM")
+
+
+/*
+ * Export enum symbols via userspace.
+ */
+#undef EM
+#undef E_
+#define EM(a, b) TRACE_DEFINE_ENUM(a);
+#define E_(a, b) TRACE_DEFINE_ENUM(a);
+
+netfs_read_traces;
+netfs_rreq_traces;
+netfs_sreq_sources;
+netfs_sreq_traces;
+
+/*
+ * Now redefine the EM() and E_() macros to map the enums to the strings that
+ * will be printed in the output.
+ */
+#undef EM
+#undef E_
+#define EM(a, b)	{ a, b },
+#define E_(a, b)	{ a, b }
+
+TRACE_EVENT(netfs_read,
+	    TP_PROTO(struct netfs_read_request *rreq,
+		     loff_t start, size_t len,
+		     enum netfs_read_trace what),
+
+	    TP_ARGS(rreq, start, len, what),
+
+	    TP_STRUCT__entry(
+		    __field(unsigned int,		rreq		)
+		    __field(unsigned int,		cookie		)
+		    __field(loff_t,			start		)
+		    __field(size_t,			len		)
+		    __field(enum netfs_read_trace,	what		)
+			     ),
+
+	    TP_fast_assign(
+		    __entry->rreq	= rreq->debug_id;
+		    __entry->cookie	= rreq->cookie_debug_id;
+		    __entry->start	= start;
+		    __entry->len	= len;
+		    __entry->what	= what;
+			   ),
+
+	    TP_printk("R=%08x %s c=%08x s=%llx %zx",
+		      __entry->rreq,
+		      __print_symbolic(__entry->what, netfs_read_traces),
+		      __entry->cookie,
+		      __entry->start, __entry->len)
+	    );
+
+TRACE_EVENT(netfs_rreq,
+	    TP_PROTO(struct netfs_read_request *rreq,
+		     enum netfs_rreq_trace what),
+
+	    TP_ARGS(rreq, what),
+
+	    TP_STRUCT__entry(
+		    __field(unsigned int,		rreq		)
+		    __field(unsigned short,		flags		)
+		    __field(enum netfs_rreq_trace,	what		)
+			     ),
+
+	    TP_fast_assign(
+		    __entry->rreq	= rreq->debug_id;
+		    __entry->flags	= rreq->flags;
+		    __entry->what	= what;
+			   ),
+
+	    TP_printk("R=%08x %s f=%02x",
+		      __entry->rreq,
+		      __print_symbolic(__entry->what, netfs_rreq_traces),
+		      __entry->flags)
+	    );
+
+TRACE_EVENT(netfs_sreq,
+	    TP_PROTO(struct netfs_read_subrequest *sreq,
+		     enum netfs_sreq_trace what),
+
+	    TP_ARGS(sreq, what),
+
+	    TP_STRUCT__entry(
+		    __field(unsigned int,		rreq		)
+		    __field(unsigned short,		index		)
+		    __field(short,			error		)
+		    __field(unsigned short,		flags		)
+		    __field(enum netfs_read_source,	source		)
+		    __field(enum netfs_sreq_trace,	what		)
+		    __field(size_t,			len		)
+		    __field(size_t,			transferred	)
+		    __field(loff_t,			start		)
+			     ),
+
+	    TP_fast_assign(
+		    __entry->rreq	= sreq->rreq->debug_id;
+		    __entry->index	= sreq->debug_index;
+		    __entry->error	= sreq->error;
+		    __entry->flags	= sreq->flags;
+		    __entry->source	= sreq->source;
+		    __entry->what	= what;
+		    __entry->len	= sreq->len;
+		    __entry->transferred = sreq->transferred;
+		    __entry->start	= sreq->start;
+			   ),
+
+	    TP_printk("R=%08x[%u] %s %s f=%02x s=%llx %zx/%zx e=%d",
+		      __entry->rreq, __entry->index,
+		      __print_symbolic(__entry->what, netfs_sreq_traces),
+		      __print_symbolic(__entry->source, netfs_sreq_sources),
+		      __entry->flags,
+		      __entry->start, __entry->transferred, __entry->len,
+		      __entry->error)
+	    );
+
+#endif /* _TRACE_NETFS_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>




^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 10/33] netfs: Gather stats
  2021-02-15 15:44 [PATCH 00/33] Network fs helper library & fscache kiocb API [ver #3] David Howells
                   ` (8 preceding siblings ...)
  2021-02-15 15:45 ` [PATCH 09/33] netfs: Add tracepoints David Howells
@ 2021-02-15 15:46 ` David Howells
  2021-02-15 15:46 ` [PATCH 11/33] netfs: Add write_begin helper David Howells
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 47+ messages in thread
From: David Howells @ 2021-02-15 15:46 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Steve French, Dominique Martinet
  Cc: Jeff Layton, Matthew Wilcox, linux-mm, linux-cachefs, linux-afs,
	linux-nfs, linux-cifs, ceph-devel, v9fs-developer, linux-fsdevel,
	dhowells, Jeff Layton, David Wysochanski, Matthew Wilcox (Oracle),
	Alexander Viro, linux-cachefs, linux-afs, linux-nfs, linux-cifs,
	ceph-devel, v9fs-developer, linux-fsdevel, linux-kernel

Gather statistics from the netfs interface that can be exported through a
seqfile.  This is intended to be called by a later patch when viewing
/proc/fs/fscache/stats.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-mm@kvack.org
cc: linux-cachefs@redhat.com
cc: linux-afs@lists.infradead.org
cc: linux-nfs@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: ceph-devel@vger.kernel.org
cc: v9fs-developer@lists.sourceforge.net
cc: linux-fsdevel@vger.kernel.org
---

 fs/netfs/Kconfig       |   15 +++++++++++++
 fs/netfs/Makefile      |    3 +--
 fs/netfs/internal.h    |   34 ++++++++++++++++++++++++++++++
 fs/netfs/read_helper.c |   23 ++++++++++++++++++++
 fs/netfs/stats.c       |   54 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/netfs.h  |    1 +
 6 files changed, 128 insertions(+), 2 deletions(-)
 create mode 100644 fs/netfs/stats.c

diff --git a/fs/netfs/Kconfig b/fs/netfs/Kconfig
index 2ebf90e6ca95..578112713703 100644
--- a/fs/netfs/Kconfig
+++ b/fs/netfs/Kconfig
@@ -6,3 +6,18 @@ config NETFS_SUPPORT
 	  This option enables support for network filesystems, including
 	  helpers for high-level buffered I/O, abstracting out read
 	  segmentation, local caching and transparent huge page support.
+
+config NETFS_STATS
+	bool "Gather statistical information on local caching"
+	depends on NETFS_SUPPORT && PROC_FS
+	help
+	  This option causes statistical information to be gathered on local
+	  caching and exported through file:
+
+		/proc/fs/fscache/stats
+
+	  The gathering of statistics adds a certain amount of overhead to
+	  execution as there are a quite a few stats gathered, and on a
+	  multi-CPU system these may be on cachelines that keep bouncing
+	  between CPUs.  On the other hand, the stats are very useful for
+	  debugging purposes.  Saying 'Y' here is recommended.
diff --git a/fs/netfs/Makefile b/fs/netfs/Makefile
index 4b4eff2ba369..c15bfc966d96 100644
--- a/fs/netfs/Makefile
+++ b/fs/netfs/Makefile
@@ -1,6 +1,5 @@
 # SPDX-License-Identifier: GPL-2.0
 
-netfs-y := \
-	read_helper.o
+netfs-y := read_helper.o stats.o
 
 obj-$(CONFIG_NETFS_SUPPORT) := netfs.o
diff --git a/fs/netfs/internal.h b/fs/netfs/internal.h
index ee665c0e7dc8..98b6f4516da1 100644
--- a/fs/netfs/internal.h
+++ b/fs/netfs/internal.h
@@ -16,8 +16,42 @@
  */
 extern unsigned int netfs_debug;
 
+/*
+ * stats.c
+ */
+#ifdef CONFIG_NETFS_STATS
+extern atomic_t netfs_n_rh_readahead;
+extern atomic_t netfs_n_rh_readpage;
+extern atomic_t netfs_n_rh_rreq;
+extern atomic_t netfs_n_rh_sreq;
+extern atomic_t netfs_n_rh_download;
+extern atomic_t netfs_n_rh_download_done;
+extern atomic_t netfs_n_rh_download_failed;
+extern atomic_t netfs_n_rh_download_instead;
+extern atomic_t netfs_n_rh_read;
+extern atomic_t netfs_n_rh_read_done;
+extern atomic_t netfs_n_rh_read_failed;
+extern atomic_t netfs_n_rh_zero;
+extern atomic_t netfs_n_rh_short_read;
+extern atomic_t netfs_n_rh_write;
+extern atomic_t netfs_n_rh_write_done;
+extern atomic_t netfs_n_rh_write_failed;
+
+
+static inline void netfs_stat(atomic_t *stat)
+{
+	atomic_inc(stat);
+}
+
+static inline void netfs_stat_d(atomic_t *stat)
+{
+	atomic_dec(stat);
+}
+
+#else
 #define netfs_stat(x) do {} while(0)
 #define netfs_stat_d(x) do {} while(0)
+#endif
 
 /*****************************************************************************/
 /*
diff --git a/fs/netfs/read_helper.c b/fs/netfs/read_helper.c
index 53b04b105179..4f6f708f8f18 100644
--- a/fs/netfs/read_helper.c
+++ b/fs/netfs/read_helper.c
@@ -55,6 +55,7 @@ static struct netfs_read_request *netfs_alloc_read_request(
 		refcount_set(&rreq->usage, 1);
 		__set_bit(NETFS_RREQ_IN_PROGRESS, &rreq->flags);
 		ops->init_rreq(rreq, file);
+		netfs_stat(&netfs_n_rh_rreq);
 	}
 
 	return rreq;
@@ -86,6 +87,7 @@ static void netfs_free_read_request(struct work_struct *work)
 		rreq->netfs_ops->cleanup(rreq->mapping, rreq->netfs_priv);
 	trace_netfs_rreq(rreq, netfs_rreq_trace_free);
 	kfree(rreq);
+	netfs_stat_d(&netfs_n_rh_rreq);
 }
 
 static void netfs_put_read_request(struct netfs_read_request *rreq)
@@ -115,6 +117,7 @@ static struct netfs_read_subrequest *netfs_alloc_subrequest(
 		refcount_set(&subreq->usage, 2);
 		subreq->rreq = rreq;
 		netfs_get_read_request(rreq);
+		netfs_stat(&netfs_n_rh_sreq);
 	}
 
 	return subreq;
@@ -130,6 +133,7 @@ static void __netfs_put_subrequest(struct netfs_read_subrequest *subreq)
 	trace_netfs_sreq(subreq, netfs_sreq_trace_free);
 	netfs_put_read_request(subreq->rreq);
 	kfree(subreq);
+	netfs_stat_d(&netfs_n_rh_sreq);
 }
 
 /*
@@ -151,6 +155,7 @@ static void netfs_clear_unread(struct netfs_read_subrequest *subreq)
 static void netfs_fill_with_zeroes(struct netfs_read_request *rreq,
 				   struct netfs_read_subrequest *subreq)
 {
+	netfs_stat(&netfs_n_rh_zero);
 	__set_bit(NETFS_SREQ_CLEAR_TAIL, &subreq->flags);
 	netfs_subreq_terminated(subreq, 0);
 }
@@ -174,6 +179,7 @@ static void netfs_fill_with_zeroes(struct netfs_read_request *rreq,
 static void netfs_read_from_server(struct netfs_read_request *rreq,
 				   struct netfs_read_subrequest *subreq)
 {
+	netfs_stat(&netfs_n_rh_download);
 	rreq->netfs_ops->issue_op(subreq);
 }
 
@@ -283,6 +289,7 @@ static void netfs_rreq_short_read(struct netfs_read_request *rreq,
 	__clear_bit(NETFS_SREQ_SHORT_READ, &subreq->flags);
 	__set_bit(NETFS_SREQ_SEEK_DATA_READ, &subreq->flags);
 
+	netfs_stat(&netfs_n_rh_short_read);
 	trace_netfs_sreq(subreq, netfs_sreq_trace_resubmit_short);
 
 	netfs_get_read_subrequest(subreq);
@@ -314,6 +321,7 @@ static bool netfs_rreq_perform_resubmissions(struct netfs_read_request *rreq)
 				break;
 			subreq->source = NETFS_DOWNLOAD_FROM_SERVER;
 			subreq->error = 0;
+			netfs_stat(&netfs_n_rh_download_instead);
 			trace_netfs_sreq(subreq, netfs_sreq_trace_download_instead);
 			netfs_get_read_subrequest(subreq);
 			atomic_inc(&rreq->nr_rd_ops);
@@ -406,6 +414,17 @@ void netfs_subreq_terminated(struct netfs_read_subrequest *subreq,
 	       subreq->debug_index, subreq->start, subreq->flags,
 	       transferred_or_error);
 
+	switch (subreq->source) {
+	case NETFS_READ_FROM_CACHE:
+		netfs_stat(&netfs_n_rh_read_done);
+		break;
+	case NETFS_DOWNLOAD_FROM_SERVER:
+		netfs_stat(&netfs_n_rh_download_done);
+		break;
+	default:
+		break;
+	}
+
 	if (IS_ERR_VALUE(transferred_or_error)) {
 		subreq->error = transferred_or_error;
 		goto failed;
@@ -459,8 +478,10 @@ void netfs_subreq_terminated(struct netfs_read_subrequest *subreq,
 
 failed:
 	if (subreq->source == NETFS_READ_FROM_CACHE) {
+		netfs_stat(&netfs_n_rh_read_failed);
 		set_bit(NETFS_RREQ_INCOMPLETE_IO, &rreq->flags);
 	} else {
+		netfs_stat(&netfs_n_rh_download_failed);
 		set_bit(NETFS_RREQ_FAILED, &rreq->flags);
 		rreq->error = subreq->error;
 	}
@@ -642,6 +663,7 @@ void netfs_readahead(struct readahead_control *ractl,
 	rreq->start	= readahead_pos(ractl);
 	rreq->len	= readahead_length(ractl);
 
+	netfs_stat(&netfs_n_rh_readahead);
 	trace_netfs_read(rreq, readahead_pos(ractl), readahead_length(ractl),
 			 netfs_read_trace_readahead);
 
@@ -716,6 +738,7 @@ int netfs_readpage(struct file *file,
 	rreq->start	= page->index * PAGE_SIZE;
 	rreq->len	= thp_size(page);
 
+	netfs_stat(&netfs_n_rh_readpage);
 	trace_netfs_read(rreq, rreq->start, rreq->len, netfs_read_trace_readpage);
 
 	netfs_get_read_request(rreq);
diff --git a/fs/netfs/stats.c b/fs/netfs/stats.c
new file mode 100644
index 000000000000..df6ff5718f25
--- /dev/null
+++ b/fs/netfs/stats.c
@@ -0,0 +1,54 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* Netfs support statistics
+ *
+ * Copyright (C) 2021 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+
+#include <linux/export.h>
+#include <linux/seq_file.h>
+#include <linux/netfs.h>
+#include "internal.h"
+
+atomic_t netfs_n_rh_readahead;
+atomic_t netfs_n_rh_readpage;
+atomic_t netfs_n_rh_rreq;
+atomic_t netfs_n_rh_sreq;
+atomic_t netfs_n_rh_download;
+atomic_t netfs_n_rh_download_done;
+atomic_t netfs_n_rh_download_failed;
+atomic_t netfs_n_rh_download_instead;
+atomic_t netfs_n_rh_read;
+atomic_t netfs_n_rh_read_done;
+atomic_t netfs_n_rh_read_failed;
+atomic_t netfs_n_rh_zero;
+atomic_t netfs_n_rh_short_read;
+atomic_t netfs_n_rh_write;
+atomic_t netfs_n_rh_write_done;
+atomic_t netfs_n_rh_write_failed;
+
+void netfs_stats_show(struct seq_file *m)
+{
+	seq_printf(m, "RdHelp : RA=%u RP=%u rr=%u sr=%u\n",
+		   atomic_read(&netfs_n_rh_readahead),
+		   atomic_read(&netfs_n_rh_readpage),
+		   atomic_read(&netfs_n_rh_rreq),
+		   atomic_read(&netfs_n_rh_sreq));
+	seq_printf(m, "RdHelp : ZR=%u sh=%u\n",
+		   atomic_read(&netfs_n_rh_zero),
+		   atomic_read(&netfs_n_rh_short_read));
+	seq_printf(m, "RdHelp : DL=%u ds=%u df=%u di=%u\n",
+		   atomic_read(&netfs_n_rh_download),
+		   atomic_read(&netfs_n_rh_download_done),
+		   atomic_read(&netfs_n_rh_download_failed),
+		   atomic_read(&netfs_n_rh_download_instead));
+	seq_printf(m, "RdHelp : RD=%u rs=%u rf=%u\n",
+		   atomic_read(&netfs_n_rh_read),
+		   atomic_read(&netfs_n_rh_read_done),
+		   atomic_read(&netfs_n_rh_read_failed));
+	seq_printf(m, "RdHelp : WR=%u ws=%u wf=%u\n",
+		   atomic_read(&netfs_n_rh_write),
+		   atomic_read(&netfs_n_rh_write_done),
+		   atomic_read(&netfs_n_rh_write_failed));
+}
+EXPORT_SYMBOL(netfs_stats_show);
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index 24083dc0adfa..b8237b6f17cb 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -134,5 +134,6 @@ extern int netfs_readpage(struct file *,
 			  void *);
 
 extern void netfs_subreq_terminated(struct netfs_read_subrequest *, ssize_t);
+extern void netfs_stats_show(struct seq_file *);
 
 #endif /* _LINUX_NETFS_H */




^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 11/33] netfs: Add write_begin helper
  2021-02-15 15:44 [PATCH 00/33] Network fs helper library & fscache kiocb API [ver #3] David Howells
                   ` (9 preceding siblings ...)
  2021-02-15 15:46 ` [PATCH 10/33] netfs: Gather stats David Howells
@ 2021-02-15 15:46 ` David Howells
  2021-02-15 15:46 ` [PATCH 12/33] netfs: Define an interface to talk to a cache David Howells
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 47+ messages in thread
From: David Howells @ 2021-02-15 15:46 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Steve French, Dominique Martinet
  Cc: Jeff Layton, Matthew Wilcox, linux-mm, linux-cachefs, linux-afs,
	linux-nfs, linux-cifs, ceph-devel, v9fs-developer, linux-fsdevel,
	dhowells, Jeff Layton, David Wysochanski, Matthew Wilcox (Oracle),
	Alexander Viro, linux-cachefs, linux-afs, linux-nfs, linux-cifs,
	ceph-devel, v9fs-developer, linux-fsdevel, linux-kernel

Add a helper to do the pre-reading work for the netfs write_begin address
space op.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-mm@kvack.org
cc: linux-cachefs@redhat.com
cc: linux-afs@lists.infradead.org
cc: linux-nfs@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: ceph-devel@vger.kernel.org
cc: v9fs-developer@lists.sourceforge.net
cc: linux-fsdevel@vger.kernel.org
---

 fs/netfs/internal.h          |    2 +
 fs/netfs/read_helper.c       |  165 ++++++++++++++++++++++++++++++++++++++++++
 fs/netfs/stats.c             |   10 ++-
 include/linux/netfs.h        |    8 ++
 include/trace/events/netfs.h |    4 +
 5 files changed, 185 insertions(+), 4 deletions(-)

diff --git a/fs/netfs/internal.h b/fs/netfs/internal.h
index 98b6f4516da1..b7f2c4459f33 100644
--- a/fs/netfs/internal.h
+++ b/fs/netfs/internal.h
@@ -34,8 +34,10 @@ extern atomic_t netfs_n_rh_read_failed;
 extern atomic_t netfs_n_rh_zero;
 extern atomic_t netfs_n_rh_short_read;
 extern atomic_t netfs_n_rh_write;
+extern atomic_t netfs_n_rh_write_begin;
 extern atomic_t netfs_n_rh_write_done;
 extern atomic_t netfs_n_rh_write_failed;
+extern atomic_t netfs_n_rh_write_zskip;
 
 
 static inline void netfs_stat(atomic_t *stat)
diff --git a/fs/netfs/read_helper.c b/fs/netfs/read_helper.c
index 4f6f708f8f18..d179a37b92fd 100644
--- a/fs/netfs/read_helper.c
+++ b/fs/netfs/read_helper.c
@@ -766,3 +766,168 @@ int netfs_readpage(struct file *file,
 	return ret;
 }
 EXPORT_SYMBOL(netfs_readpage);
+
+static void netfs_clear_thp(struct page *page)
+{
+	unsigned int i;
+
+	for (i = 0; i < thp_nr_pages(page); i++)
+		clear_highpage(page + i);
+}
+
+/**
+ * netfs_write_begin - Helper to prepare for writing
+ * @file: The file to read from
+ * @mapping: The mapping to read from
+ * @pos: File position at which the write will begin
+ * @len: The length of the write in this page
+ * @flags: AOP_* flags
+ * @_page: Where to put the resultant page
+ * @_fsdata: Place for the netfs to store a cookie
+ * @ops: The network filesystem's operations for the helper to use
+ * @netfs_priv: Private netfs data to be retained in the request
+ *
+ * Pre-read data for a write-begin request by drawing data from the cache if
+ * possible, or the netfs if not.  Space beyond the EOF is zero-filled.
+ * Multiple I/O requests from different sources will get munged together.  If
+ * necessary, the readahead window can be expanded in either direction to a
+ * more convenient alighment for RPC efficiency or to make storage in the cache
+ * feasible.
+ *
+ * The calling netfs must provide a table of operations, only one of which,
+ * issue_op, is mandatory.
+ *
+ * The check_write_begin() operation can be provided to check for and flush
+ * conflicting writes once the page is grabbed and locked.  It is passed a
+ * pointer to the fsdata cookie that gets returned to the VM to be passed to
+ * write_end.  It is permitted to sleep.  It should return 0 if the request
+ * should go ahead; unlock the page and return -EAGAIN to cause the page to be
+ * regot; or return an error.
+ *
+ * This is usable whether or not caching is enabled.
+ */
+int netfs_write_begin(struct file *file, struct address_space *mapping,
+		      loff_t pos, unsigned int len, unsigned int flags,
+		      struct page **_page, void **_fsdata,
+		      const struct netfs_read_request_ops *ops,
+		      void *netfs_priv)
+{
+	struct netfs_read_request *rreq;
+	struct page *page, *xpage;
+	struct inode *inode = file_inode(file);
+	unsigned int debug_index = 0;
+	pgoff_t index = pos >> PAGE_SHIFT;
+	int pos_in_page = pos & ~PAGE_MASK;
+	loff_t size;
+	int ret;
+
+	struct readahead_control ractl = {
+		.file		= file,
+		.mapping	= mapping,
+		._index		= index,
+		._nr_pages	= 0,
+	};
+
+retry:
+	page = grab_cache_page_write_begin(mapping, index, 0);
+	if (!page)
+		return -ENOMEM;
+
+	if (ops->check_write_begin) {
+		/* Allow the netfs (eg. ceph) to flush conflicts. */
+		ret = ops->check_write_begin(file, pos, len, page, _fsdata);
+		if (ret < 0) {
+			if (ret == -EAGAIN)
+				goto retry;
+			goto error;
+		}
+	}
+
+	if (PageUptodate(page))
+		goto have_page;
+
+	/* If the page is beyond the EOF, we want to clear it - unless it's
+	 * within the cache granule containing the EOF, in which case we need
+	 * to preload the granule.
+	 */
+	size = i_size_read(inode);
+	if (!ops->is_cache_enabled(inode) &&
+	    ((pos_in_page == 0 && len == thp_size(page)) ||
+	     (pos >= size) ||
+	     (pos_in_page == 0 && (pos + len) >= size))) {
+		netfs_clear_thp(page);
+		SetPageUptodate(page);
+		netfs_stat(&netfs_n_rh_write_zskip);
+		goto have_page_no_wait;
+	}
+
+	ret = -ENOMEM;
+	rreq = netfs_alloc_read_request(ops, netfs_priv, file);
+	if (!rreq)
+		goto error;
+	rreq->mapping		= page->mapping;
+	rreq->start		= page->index * PAGE_SIZE;
+	rreq->len		= thp_size(page);
+	rreq->no_unlock_page	= page->index;
+	__set_bit(NETFS_RREQ_NO_UNLOCK_PAGE, &rreq->flags);
+	netfs_priv = NULL;
+
+	netfs_stat(&netfs_n_rh_write_begin);
+	trace_netfs_read(rreq, pos, len, netfs_read_trace_write_begin);
+
+	/* Expand the request to meet caching requirements and download
+	 * preferences.
+	 */
+	ractl._nr_pages = thp_nr_pages(page);
+	netfs_rreq_expand(rreq, &ractl);
+	netfs_get_read_request(rreq);
+
+	/* We hold the page locks, so we can drop the references */
+	while ((xpage = readahead_page(&ractl)))
+		if (xpage != page)
+			put_page(xpage);
+
+	atomic_set(&rreq->nr_rd_ops, 1);
+	do {
+		if (!netfs_rreq_submit_slice(rreq, &debug_index))
+			break;
+
+	} while (rreq->submitted < rreq->len);
+
+	/* Keep nr_rd_ops incremented so that the ref always belongs to us, and
+	 * the service code isn't punted off to a random thread pool to
+	 * process.
+	 */
+	for (;;) {
+		wait_var_event(&rreq->nr_rd_ops, atomic_read(&rreq->nr_rd_ops) == 1);
+		netfs_rreq_assess(rreq);
+		if (!test_bit(NETFS_RREQ_IN_PROGRESS, &rreq->flags))
+			break;
+		cond_resched();
+	}
+
+	ret = rreq->error;
+	if (ret == 0 && rreq->submitted < rreq->len)
+		ret = -EIO;
+	netfs_put_read_request(rreq);
+	if (ret < 0)
+		goto error;
+
+have_page:
+	wait_on_page_fscache(page);
+have_page_no_wait:
+	if (netfs_priv)
+		ops->cleanup(netfs_priv, mapping);
+	*_page = page;
+	_leave(" = 0");
+	return 0;
+
+error:
+	unlock_page(page);
+	put_page(page);
+	if (netfs_priv)
+		ops->cleanup(netfs_priv, mapping);
+	_leave(" = %d", ret);
+	return ret;
+}
+EXPORT_SYMBOL(netfs_write_begin);
diff --git a/fs/netfs/stats.c b/fs/netfs/stats.c
index df6ff5718f25..dd7ad66ed07e 100644
--- a/fs/netfs/stats.c
+++ b/fs/netfs/stats.c
@@ -24,19 +24,23 @@ atomic_t netfs_n_rh_read_failed;
 atomic_t netfs_n_rh_zero;
 atomic_t netfs_n_rh_short_read;
 atomic_t netfs_n_rh_write;
+atomic_t netfs_n_rh_write_begin;
 atomic_t netfs_n_rh_write_done;
 atomic_t netfs_n_rh_write_failed;
+atomic_t netfs_n_rh_write_zskip;
 
 void netfs_stats_show(struct seq_file *m)
 {
-	seq_printf(m, "RdHelp : RA=%u RP=%u rr=%u sr=%u\n",
+	seq_printf(m, "RdHelp : RA=%u RP=%u WB=%u rr=%u sr=%u\n",
 		   atomic_read(&netfs_n_rh_readahead),
 		   atomic_read(&netfs_n_rh_readpage),
+		   atomic_read(&netfs_n_rh_write_begin),
 		   atomic_read(&netfs_n_rh_rreq),
 		   atomic_read(&netfs_n_rh_sreq));
-	seq_printf(m, "RdHelp : ZR=%u sh=%u\n",
+	seq_printf(m, "RdHelp : ZR=%u sh=%u sk=%u\n",
 		   atomic_read(&netfs_n_rh_zero),
-		   atomic_read(&netfs_n_rh_short_read));
+		   atomic_read(&netfs_n_rh_short_read),
+		   atomic_read(&netfs_n_rh_write_zskip));
 	seq_printf(m, "RdHelp : DL=%u ds=%u df=%u di=%u\n",
 		   atomic_read(&netfs_n_rh_download),
 		   atomic_read(&netfs_n_rh_download_done),
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index b8237b6f17cb..ec9d1240ba49 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -115,11 +115,14 @@ struct netfs_read_request {
  * Operations the network filesystem can/must provide to the helpers.
  */
 struct netfs_read_request_ops {
+	bool (*is_cache_enabled)(struct inode *inode);
 	void (*init_rreq)(struct netfs_read_request *rreq, struct file *file);
 	void (*expand_readahead)(struct netfs_read_request *rreq);
 	bool (*clamp_length)(struct netfs_read_subrequest *subreq);
 	void (*issue_op)(struct netfs_read_subrequest *subreq);
 	bool (*is_still_valid)(struct netfs_read_request *rreq);
+	int (*check_write_begin)(struct file *file, loff_t pos, unsigned len,
+				 struct page *page, void **_fsdata);
 	void (*done)(struct netfs_read_request *rreq);
 	void (*cleanup)(struct address_space *mapping, void *netfs_priv);
 };
@@ -132,6 +135,11 @@ extern int netfs_readpage(struct file *,
 			  struct page *,
 			  const struct netfs_read_request_ops *,
 			  void *);
+extern int netfs_write_begin(struct file *, struct address_space *,
+			     loff_t, unsigned int, unsigned int, struct page **,
+			     void **,
+			     const struct netfs_read_request_ops *,
+			     void *);
 
 extern void netfs_subreq_terminated(struct netfs_read_subrequest *, ssize_t);
 extern void netfs_stats_show(struct seq_file *);
diff --git a/include/trace/events/netfs.h b/include/trace/events/netfs.h
index 12ad382764c5..a2bf6cd84bd4 100644
--- a/include/trace/events/netfs.h
+++ b/include/trace/events/netfs.h
@@ -22,6 +22,7 @@ enum netfs_read_trace {
 	netfs_read_trace_expanded,
 	netfs_read_trace_readahead,
 	netfs_read_trace_readpage,
+	netfs_read_trace_write_begin,
 };
 
 enum netfs_rreq_trace {
@@ -50,7 +51,8 @@ enum netfs_sreq_trace {
 #define netfs_read_traces					\
 	EM(netfs_read_trace_expanded,		"EXPANDED ")	\
 	EM(netfs_read_trace_readahead,		"READAHEAD")	\
-	E_(netfs_read_trace_readpage,		"READPAGE ")
+	EM(netfs_read_trace_readpage,		"READPAGE ")	\
+	E_(netfs_read_trace_write_begin,	"WRITEBEGN")
 
 #define netfs_rreq_traces					\
 	EM(netfs_rreq_trace_assess,		"ASSESS")	\




^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 12/33] netfs: Define an interface to talk to a cache
  2021-02-15 15:44 [PATCH 00/33] Network fs helper library & fscache kiocb API [ver #3] David Howells
                   ` (10 preceding siblings ...)
  2021-02-15 15:46 ` [PATCH 11/33] netfs: Add write_begin helper David Howells
@ 2021-02-15 15:46 ` David Howells
  2021-02-15 15:46 ` [PATCH 13/33] netfs: Hold a ref on a page when PG_private_2 is set David Howells
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 47+ messages in thread
From: David Howells @ 2021-02-15 15:46 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Steve French, Dominique Martinet
  Cc: Jeff Layton, Matthew Wilcox, linux-mm, linux-cachefs, linux-afs,
	linux-nfs, linux-cifs, ceph-devel, v9fs-developer, linux-fsdevel,
	dhowells, Jeff Layton, David Wysochanski, Matthew Wilcox (Oracle),
	Alexander Viro, linux-cachefs, linux-afs, linux-nfs, linux-cifs,
	ceph-devel, v9fs-developer, linux-fsdevel, linux-kernel

Add an interface to the netfs helper library for reading data from the
cache instead of downloading it from the server and support for writing
data just downloaded or cleared to the cache.

The API passes an iov_iter to the cache read/write routines to indicate the
data/buffer to be used.  This is done using the ITER_XARRAY type to provide
direct access to the netfs inode's pagecache.

When the netfs's ->begin_cache_operation() method is called, this must fill
in the cache_resources in the netfs_read_request struct, including the
netfs_cache_ops used by the helper lib to talk to the cache.  The helper
lib does not directly access the cache.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-mm@kvack.org
cc: linux-cachefs@redhat.com
cc: linux-afs@lists.infradead.org
cc: linux-nfs@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: ceph-devel@vger.kernel.org
cc: v9fs-developer@lists.sourceforge.net
cc: linux-fsdevel@vger.kernel.org
---

 fs/netfs/read_helper.c        |  230 +++++++++++++++++++++++++++++++++++++++++
 fs/netfs/stats.c              |    3 -
 include/linux/fscache-cache.h |    4 +
 include/linux/fscache.h       |   17 +++
 include/linux/netfs.h         |   48 +++++++++
 5 files changed, 300 insertions(+), 2 deletions(-)

diff --git a/fs/netfs/read_helper.c b/fs/netfs/read_helper.c
index d179a37b92fd..156941e0de61 100644
--- a/fs/netfs/read_helper.c
+++ b/fs/netfs/read_helper.c
@@ -86,6 +86,8 @@ static void netfs_free_read_request(struct work_struct *work)
 	if (rreq->netfs_priv)
 		rreq->netfs_ops->cleanup(rreq->mapping, rreq->netfs_priv);
 	trace_netfs_rreq(rreq, netfs_rreq_trace_free);
+	if (rreq->cache_resources.ops)
+		rreq->cache_resources.ops->end_operation(&rreq->cache_resources);
 	kfree(rreq);
 	netfs_stat_d(&netfs_n_rh_rreq);
 }
@@ -149,6 +151,32 @@ static void netfs_clear_unread(struct netfs_read_subrequest *subreq)
 	iov_iter_zero(iov_iter_count(&iter), &iter);
 }
 
+static void netfs_cache_read_terminated(void *priv, ssize_t transferred_or_error)
+{
+	struct netfs_read_subrequest *subreq = priv;
+
+	netfs_subreq_terminated(subreq, transferred_or_error);
+}
+
+/*
+ * Issue a read against the cache.
+ * - Eats the caller's ref on subreq.
+ */
+static void netfs_read_from_cache(struct netfs_read_request *rreq,
+				  struct netfs_read_subrequest *subreq,
+				  bool seek_data)
+{
+	struct netfs_cache_resources *cres = &rreq->cache_resources;
+	struct iov_iter iter;
+
+	iov_iter_xarray(&iter, READ, &rreq->mapping->i_pages,
+			subreq->start + subreq->transferred,
+			subreq->len   - subreq->transferred);
+
+	cres->ops->read(cres, subreq->start, &iter, seek_data,
+			netfs_cache_read_terminated, subreq);
+}
+
 /*
  * Fill a subrequest region with zeroes.
  */
@@ -193,6 +221,141 @@ static void netfs_rreq_completed(struct netfs_read_request *rreq)
 	netfs_put_read_request(rreq);
 }
 
+/*
+ * Deal with the completion of writing the data to the cache.  We have to clear
+ * the PG_fscache bits on the pages involved and release the caller's ref.
+ *
+ * May be called in softirq mode and we inherit a ref from the caller.
+ */
+static void netfs_rreq_unmark_after_write(struct netfs_read_request *rreq)
+{
+	struct netfs_read_subrequest *subreq;
+	struct page *page;
+	pgoff_t unlocked = 0;
+	bool have_unlocked = false;
+
+	rcu_read_lock();
+
+	list_for_each_entry(subreq, &rreq->subrequests, rreq_link) {
+		XA_STATE(xas, &rreq->mapping->i_pages, subreq->start / PAGE_SIZE);
+
+		xas_for_each(&xas, page, (subreq->start + subreq->len - 1) / PAGE_SIZE) {
+			/* We might have multiple writes from the same huge
+			 * page, but we mustn't unlock a page more than once.
+			 */
+			if (have_unlocked && page->index <= unlocked)
+				continue;
+			unlocked = page->index;
+			unlock_page_fscache(page);
+			have_unlocked = true;
+		}
+	}
+
+	rcu_read_unlock();
+	netfs_rreq_completed(rreq);
+}
+
+static void netfs_rreq_copy_terminated(void *priv, ssize_t transferred_or_error)
+{
+	struct netfs_read_subrequest *subreq = priv;
+	struct netfs_read_request *rreq = subreq->rreq;
+
+	if (IS_ERR_VALUE(transferred_or_error)) {
+		subreq->error = transferred_or_error;
+		netfs_stat(&netfs_n_rh_write_failed);
+	} else {
+		subreq->error = 0;
+		netfs_stat(&netfs_n_rh_write_done);
+	}
+
+	trace_netfs_sreq(subreq, netfs_sreq_trace_write_term);
+
+	/* If we decrement nr_wr_ops to 0, the ref belongs to us. */
+	if (atomic_dec_and_test(&rreq->nr_wr_ops))
+		netfs_rreq_unmark_after_write(rreq);
+
+	netfs_put_subrequest(subreq);
+}
+
+/*
+ * Perform any outstanding writes to the cache.  We inherit a ref from the
+ * caller.
+ */
+static void netfs_rreq_do_write_to_cache(struct netfs_read_request *rreq)
+{
+	struct netfs_cache_resources *cres = &rreq->cache_resources;
+	struct netfs_read_subrequest *subreq, *next, *p;
+	struct iov_iter iter;
+	loff_t pos;
+
+	trace_netfs_rreq(rreq, netfs_rreq_trace_write);
+
+	/* We don't want terminating writes trying to wake us up whilst we're
+	 * still going through the list.
+	 */
+	atomic_inc(&rreq->nr_wr_ops);
+
+	list_for_each_entry_safe(subreq, p, &rreq->subrequests, rreq_link) {
+		if (!test_bit(NETFS_SREQ_WRITE_TO_CACHE, &subreq->flags)) {
+			list_del_init(&subreq->rreq_link);
+			netfs_put_subrequest(subreq);
+		}
+	}
+
+	list_for_each_entry(subreq, &rreq->subrequests, rreq_link) {
+		/* Amalgamate adjacent writes */
+		pos = round_down(subreq->start, PAGE_SIZE);
+		if (pos != subreq->start) {
+			subreq->len += subreq->start - pos;
+			subreq->start = pos;
+		}
+		subreq->len = round_up(subreq->len, PAGE_SIZE);
+
+		while (!list_is_last(&subreq->rreq_link, &rreq->subrequests)) {
+			next = list_next_entry(subreq, rreq_link);
+			if (next->start > subreq->start + subreq->len)
+				break;
+			subreq->len += next->len;
+			subreq->len = round_up(subreq->len, PAGE_SIZE);
+			list_del_init(&next->rreq_link);
+			netfs_put_subrequest(next);
+		}
+
+		iov_iter_xarray(&iter, WRITE, &rreq->mapping->i_pages,
+				subreq->start, subreq->len);
+
+		atomic_inc(&rreq->nr_wr_ops);
+		netfs_stat(&netfs_n_rh_write);
+		netfs_get_read_subrequest(subreq);
+		trace_netfs_sreq(subreq, netfs_sreq_trace_write);
+		cres->ops->write(cres, subreq->start, &iter,
+				 netfs_rreq_copy_terminated, subreq);
+	}
+
+	/* If we decrement nr_wr_ops to 0, the usage ref belongs to us. */
+	if (atomic_dec_and_test(&rreq->nr_wr_ops))
+		netfs_rreq_unmark_after_write(rreq);
+}
+
+static void netfs_rreq_write_to_cache_work(struct work_struct *work)
+{
+	struct netfs_read_request *rreq =
+		container_of(work, struct netfs_read_request, work);
+
+	netfs_rreq_do_write_to_cache(rreq);
+}
+
+static void netfs_rreq_write_to_cache(struct netfs_read_request *rreq)
+{
+	if (in_softirq()) {
+		rreq->work.func = netfs_rreq_write_to_cache_work;
+		if (!queue_work(system_unbound_wq, &rreq->work))
+			BUG();
+	} else {
+		netfs_rreq_do_write_to_cache(rreq);
+	}
+}
+
 /*
  * Unlock the pages in a read operation.  We need to set PG_fscache on any
  * pages we're going to write back before we unlock them.
@@ -294,7 +457,10 @@ static void netfs_rreq_short_read(struct netfs_read_request *rreq,
 
 	netfs_get_read_subrequest(subreq);
 	atomic_inc(&rreq->nr_rd_ops);
-	netfs_read_from_server(rreq, subreq);
+	if (subreq->source == NETFS_READ_FROM_CACHE)
+		netfs_read_from_cache(rreq, subreq, true);
+	else
+		netfs_read_from_server(rreq, subreq);
 }
 
 /*
@@ -339,6 +505,25 @@ static bool netfs_rreq_perform_resubmissions(struct netfs_read_request *rreq)
 	return false;
 }
 
+/*
+ * Check to see if the data read is still valid.
+ */
+static void netfs_rreq_is_still_valid(struct netfs_read_request *rreq)
+{
+	struct netfs_read_subrequest *subreq;
+
+	if (!rreq->netfs_ops->is_still_valid ||
+	    rreq->netfs_ops->is_still_valid(rreq))
+		return;
+
+	list_for_each_entry(subreq, &rreq->subrequests, rreq_link) {
+		if (subreq->source == NETFS_READ_FROM_CACHE) {
+			subreq->error = -ESTALE;
+			__set_bit(NETFS_RREQ_INCOMPLETE_IO, &rreq->flags);
+		}
+	}
+}
+
 /*
  * Assess the state of a read request and decide what to do next.
  *
@@ -350,6 +535,8 @@ static void netfs_rreq_assess(struct netfs_read_request *rreq)
 	trace_netfs_rreq(rreq, netfs_rreq_trace_assess);
 
 again:
+	netfs_rreq_is_still_valid(rreq);
+
 	if (!test_bit(NETFS_RREQ_FAILED, &rreq->flags) &&
 	    test_bit(NETFS_RREQ_INCOMPLETE_IO, &rreq->flags)) {
 		if (netfs_rreq_perform_resubmissions(rreq))
@@ -362,6 +549,9 @@ static void netfs_rreq_assess(struct netfs_read_request *rreq)
 	clear_bit_unlock(NETFS_RREQ_IN_PROGRESS, &rreq->flags);
 	wake_up_bit(&rreq->flags, NETFS_RREQ_IN_PROGRESS);
 
+	if (test_bit(NETFS_RREQ_WRITE_TO_CACHE, &rreq->flags))
+		return netfs_rreq_write_to_cache(rreq);
+
 	netfs_rreq_completed(rreq);
 }
 
@@ -493,7 +683,10 @@ static enum netfs_read_source netfs_cache_prepare_read(struct netfs_read_subrequ
 						       loff_t i_size)
 {
 	struct netfs_read_request *rreq = subreq->rreq;
+	struct netfs_cache_resources *cres = &rreq->cache_resources;
 
+	if (cres->ops)
+		return cres->ops->prepare_read(subreq, i_size);
 	if (subreq->start >= rreq->i_size)
 		return NETFS_FILL_WITH_ZEROES;
 	return NETFS_DOWNLOAD_FROM_SERVER;
@@ -584,6 +777,9 @@ static bool netfs_rreq_submit_slice(struct netfs_read_request *rreq,
 	case NETFS_DOWNLOAD_FROM_SERVER:
 		netfs_read_from_server(rreq, subreq);
 		break;
+	case NETFS_READ_FROM_CACHE:
+		netfs_read_from_cache(rreq, subreq, false);
+		break;
 	default:
 		BUG();
 	}
@@ -596,9 +792,23 @@ static bool netfs_rreq_submit_slice(struct netfs_read_request *rreq,
 	return false;
 }
 
+static void netfs_cache_expand_readahead(struct netfs_read_request *rreq,
+					 loff_t *_start, size_t *_len, loff_t i_size)
+{
+	struct netfs_cache_resources *cres = &rreq->cache_resources;
+
+	if (cres->ops && cres->ops->expand_readahead)
+		cres->ops->expand_readahead(cres, _start, _len, i_size);
+}
+
 static void netfs_rreq_expand(struct netfs_read_request *rreq,
 			      struct readahead_control *ractl)
 {
+	/* Give the cache a chance to change the request parameters.  The
+	 * resultant request must contain the original region.
+	 */
+	netfs_cache_expand_readahead(rreq, &rreq->start, &rreq->len, rreq->i_size);
+
 	/* Give the netfs a chance to change the request parameters.  The
 	 * resultant request must contain the original region.
 	 */
@@ -650,6 +860,7 @@ void netfs_readahead(struct readahead_control *ractl,
 	struct netfs_read_request *rreq;
 	struct page *page;
 	unsigned int debug_index = 0;
+	int ret;
 
 	_enter("%lx,%x", readahead_index(ractl), readahead_count(ractl));
 
@@ -667,6 +878,11 @@ void netfs_readahead(struct readahead_control *ractl,
 	trace_netfs_read(rreq, readahead_pos(ractl), readahead_length(ractl),
 			 netfs_read_trace_readahead);
 
+	if (ops->begin_cache_operation) {
+		ret = ops->begin_cache_operation(rreq);
+		if (ret == -ENOMEM || ret == -EINTR || ret == -ERESTARTSYS)
+			goto cleanup_free;
+	}
 	netfs_rreq_expand(rreq, ractl);
 
 	atomic_set(&rreq->nr_rd_ops, 1);
@@ -692,6 +908,9 @@ void netfs_readahead(struct readahead_control *ractl,
 		netfs_rreq_assess(rreq);
 	return;
 
+cleanup_free:
+	netfs_put_read_request(rreq);
+	return;
 cleanup:
 	if (netfs_priv)
 		ops->cleanup(ractl->mapping, netfs_priv);
@@ -741,6 +960,14 @@ int netfs_readpage(struct file *file,
 	netfs_stat(&netfs_n_rh_readpage);
 	trace_netfs_read(rreq, rreq->start, rreq->len, netfs_read_trace_readpage);
 
+	if (ops->begin_cache_operation) {
+		ret = ops->begin_cache_operation(rreq);
+		if (ret == -ENOMEM || ret == -EINTR || ret == -ERESTARTSYS) {
+			unlock_page(page);
+			goto out;
+		}
+	}
+
 	netfs_get_read_request(rreq);
 
 	atomic_set(&rreq->nr_rd_ops, 1);
@@ -762,6 +989,7 @@ int netfs_readpage(struct file *file,
 	ret = rreq->error;
 	if (ret == 0 && rreq->submitted < rreq->len)
 		ret = -EIO;
+out:
 	netfs_put_read_request(rreq);
 	return ret;
 }
diff --git a/fs/netfs/stats.c b/fs/netfs/stats.c
index dd7ad66ed07e..9ae538c85378 100644
--- a/fs/netfs/stats.c
+++ b/fs/netfs/stats.c
@@ -31,10 +31,11 @@ atomic_t netfs_n_rh_write_zskip;
 
 void netfs_stats_show(struct seq_file *m)
 {
-	seq_printf(m, "RdHelp : RA=%u RP=%u WB=%u rr=%u sr=%u\n",
+	seq_printf(m, "RdHelp : RA=%u RP=%u WB=%u WBZ=%u rr=%u sr=%u\n",
 		   atomic_read(&netfs_n_rh_readahead),
 		   atomic_read(&netfs_n_rh_readpage),
 		   atomic_read(&netfs_n_rh_write_begin),
+		   atomic_read(&netfs_n_rh_write_zskip),
 		   atomic_read(&netfs_n_rh_rreq),
 		   atomic_read(&netfs_n_rh_sreq));
 	seq_printf(m, "RdHelp : ZR=%u sh=%u sk=%u\n",
diff --git a/include/linux/fscache-cache.h b/include/linux/fscache-cache.h
index 3f0b19dcfae7..3235ddbdcc09 100644
--- a/include/linux/fscache-cache.h
+++ b/include/linux/fscache-cache.h
@@ -304,6 +304,10 @@ struct fscache_cache_ops {
 
 	/* dissociate a cache from all the pages it was backing */
 	void (*dissociate_pages)(struct fscache_cache *cache);
+
+	/* Begin a read operation for the netfs lib */
+	int (*begin_read_operation)(struct netfs_read_request *rreq,
+				    struct fscache_retrieval *op);
 };
 
 extern struct fscache_cookie fscache_fsdef_index;
diff --git a/include/linux/fscache.h b/include/linux/fscache.h
index 1f8dc72369ee..0e4161ce347c 100644
--- a/include/linux/fscache.h
+++ b/include/linux/fscache.h
@@ -37,6 +37,7 @@ struct pagevec;
 struct fscache_cache_tag;
 struct fscache_cookie;
 struct fscache_netfs;
+struct netfs_read_request;
 
 typedef void (*fscache_rw_complete_t)(struct page *page,
 				      void *context,
@@ -217,6 +218,7 @@ extern void __fscache_readpages_cancel(struct fscache_cookie *cookie,
 extern void __fscache_disable_cookie(struct fscache_cookie *, const void *, bool);
 extern void __fscache_enable_cookie(struct fscache_cookie *, const void *, loff_t,
 				    bool (*)(void *), void *);
+extern int __fscache_begin_read_operation(struct netfs_read_request *, struct fscache_cookie *);
 
 /**
  * fscache_register_netfs - Register a filesystem as desiring caching services
@@ -831,4 +833,19 @@ void fscache_enable_cookie(struct fscache_cookie *cookie,
 					can_enable, data);
 }
 
+/**
+ * fscache_begin_read_operation - Begin a read operation for the netfs lib
+ * @rreq: The read request being undertaken
+ * @cookie: The cookie representing the cache object
+ */
+static inline
+int fscache_begin_read_operation(struct netfs_read_request *rreq,
+				 struct fscache_cookie *cookie)
+{
+	if (fscache_cookie_valid(cookie) && fscache_cookie_enabled(cookie))
+		return __fscache_begin_read_operation(rreq, cookie);
+	else
+		return -ENOBUFS;
+}
+
 #endif /* _LINUX_FSCACHE_H */
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index ec9d1240ba49..b2589b39feb8 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -60,6 +60,17 @@ enum netfs_read_source {
 	NETFS_INVALID_READ,
 } __mode(byte);
 
+typedef void (*netfs_io_terminated_t)(void *priv, ssize_t transferred_or_error);
+
+/*
+ * Resources required to do operations on a cache.
+ */
+struct netfs_cache_resources {
+	const struct netfs_cache_ops	*ops;
+	void				*cache_priv;
+	void				*cache_priv2;
+};
+
 /*
  * Descriptor for a single component subrequest.
  */
@@ -89,11 +100,13 @@ struct netfs_read_request {
 	struct work_struct	work;
 	struct inode		*inode;		/* The file being accessed */
 	struct address_space	*mapping;	/* The mapping being accessed */
+	struct netfs_cache_resources cache_resources;
 	struct list_head	subrequests;	/* Requests to fetch I/O from disk or net */
 	void			*netfs_priv;	/* Private data for the netfs */
 	unsigned int		debug_id;
 	unsigned int		cookie_debug_id;
 	atomic_t		nr_rd_ops;	/* Number of read ops in progress */
+	atomic_t		nr_wr_ops;	/* Number of write ops in progress */
 	size_t			submitted;	/* Amount submitted for I/O so far */
 	size_t			len;		/* Length of the request */
 	short			error;		/* 0 or error that occurred */
@@ -117,6 +130,7 @@ struct netfs_read_request {
 struct netfs_read_request_ops {
 	bool (*is_cache_enabled)(struct inode *inode);
 	void (*init_rreq)(struct netfs_read_request *rreq, struct file *file);
+	int (*begin_cache_operation)(struct netfs_read_request *rreq);
 	void (*expand_readahead)(struct netfs_read_request *rreq);
 	bool (*clamp_length)(struct netfs_read_subrequest *subreq);
 	void (*issue_op)(struct netfs_read_subrequest *subreq);
@@ -127,6 +141,40 @@ struct netfs_read_request_ops {
 	void (*cleanup)(struct address_space *mapping, void *netfs_priv);
 };
 
+/*
+ * Table of operations for access to a cache.  This is obtained by
+ * rreq->ops->begin_cache_operation().
+ */
+struct netfs_cache_ops {
+	/* End an operation */
+	void (*end_operation)(struct netfs_cache_resources *cres);
+
+	/* Read data from the cache */
+	int (*read)(struct netfs_cache_resources *cres,
+		    loff_t start_pos,
+		    struct iov_iter *iter,
+		    bool seek_data,
+		    netfs_io_terminated_t term_func,
+		    void *term_func_priv);
+
+	/* Write data to the cache */
+	int (*write)(struct netfs_cache_resources *cres,
+		     loff_t start_pos,
+		     struct iov_iter *iter,
+		     netfs_io_terminated_t term_func,
+		     void *term_func_priv);
+
+	/* Expand readahead request */
+	void (*expand_readahead)(struct netfs_cache_resources *cres,
+				 loff_t *_start, size_t *_len, loff_t i_size);
+
+	/* Prepare a read operation, shortening it to a cached/uncached
+	 * boundary as appropriate.
+	 */
+	enum netfs_read_source (*prepare_read)(struct netfs_read_subrequest *subreq,
+					       loff_t i_size);
+};
+
 struct readahead_control;
 extern void netfs_readahead(struct readahead_control *,
 			    const struct netfs_read_request_ops *,




^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 13/33] netfs: Hold a ref on a page when PG_private_2 is set
  2021-02-15 15:44 [PATCH 00/33] Network fs helper library & fscache kiocb API [ver #3] David Howells
                   ` (11 preceding siblings ...)
  2021-02-15 15:46 ` [PATCH 12/33] netfs: Define an interface to talk to a cache David Howells
@ 2021-02-15 15:46 ` David Howells
  2021-02-15 18:05 ` [PATCH 00/33] Network fs helper library & fscache kiocb API [ver #3] Jeff Layton
  2021-02-15 22:46 ` [PATCH 34/33] netfs: Use in_interrupt() not in_softirq() David Howells
  14 siblings, 0 replies; 47+ messages in thread
From: David Howells @ 2021-02-15 15:46 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Steve French, Dominique Martinet
  Cc: Linus Torvalds, Matthew Wilcox, Linus Torvalds, linux-mm,
	linux-cachefs, linux-afs, linux-nfs, linux-cifs, ceph-devel,
	v9fs-developer, linux-fsdevel, dhowells, Jeff Layton,
	David Wysochanski, Matthew Wilcox (Oracle),
	Alexander Viro, linux-cachefs, linux-afs, linux-nfs, linux-cifs,
	ceph-devel, v9fs-developer, linux-fsdevel, linux-kernel

Take a reference on a page when PG_private_2 is set and drop it once the
bit is unlocked.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: Linus Torvalds <torvalds@linux-foundation.org>
cc: linux-mm@kvack.org
cc: linux-cachefs@redhat.com
cc: linux-afs@lists.infradead.org
cc: linux-nfs@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: ceph-devel@vger.kernel.org
cc: v9fs-developer@lists.sourceforge.net
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/linux-fsdevel/1331025.1612974993@warthog.procyon.org.uk/
---

 fs/netfs/read_helper.c |   10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/fs/netfs/read_helper.c b/fs/netfs/read_helper.c
index 156941e0de61..9191a3617d91 100644
--- a/fs/netfs/read_helper.c
+++ b/fs/netfs/read_helper.c
@@ -10,6 +10,7 @@
 #include <linux/fs.h>
 #include <linux/mm.h>
 #include <linux/pagemap.h>
+#include <linux/pagevec.h>
 #include <linux/slab.h>
 #include <linux/uio.h>
 #include <linux/sched/mm.h>
@@ -230,10 +231,13 @@ static void netfs_rreq_completed(struct netfs_read_request *rreq)
 static void netfs_rreq_unmark_after_write(struct netfs_read_request *rreq)
 {
 	struct netfs_read_subrequest *subreq;
+	struct pagevec pvec;
 	struct page *page;
 	pgoff_t unlocked = 0;
 	bool have_unlocked = false;
 
+	pagevec_init(&pvec);
+
 	rcu_read_lock();
 
 	list_for_each_entry(subreq, &rreq->subrequests, rreq_link) {
@@ -247,6 +251,8 @@ static void netfs_rreq_unmark_after_write(struct netfs_read_request *rreq)
 				continue;
 			unlocked = page->index;
 			unlock_page_fscache(page);
+			if (pagevec_add(&pvec, page) == 0)
+				pagevec_release(&pvec);
 			have_unlocked = true;
 		}
 	}
@@ -403,8 +409,10 @@ static void netfs_rreq_unlock(struct netfs_read_request *rreq)
 				pg_failed = true;
 				break;
 			}
-			if (test_bit(NETFS_SREQ_WRITE_TO_CACHE, &subreq->flags))
+			if (test_bit(NETFS_SREQ_WRITE_TO_CACHE, &subreq->flags)) {
+				get_page(page);
 				SetPageFsCache(page);
+			}
 			pg_failed |= subreq_failed;
 			if (pgend < iopos + subreq->len)
 				break;




^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH 00/33] Network fs helper library & fscache kiocb API [ver #3]
  2021-02-15 15:44 [PATCH 00/33] Network fs helper library & fscache kiocb API [ver #3] David Howells
                   ` (12 preceding siblings ...)
  2021-02-15 15:46 ` [PATCH 13/33] netfs: Hold a ref on a page when PG_private_2 is set David Howells
@ 2021-02-15 18:05 ` Jeff Layton
  2021-02-16  0:40   ` Steve French
  2021-02-15 22:46 ` [PATCH 34/33] netfs: Use in_interrupt() not in_softirq() David Howells
  14 siblings, 1 reply; 47+ messages in thread
From: Jeff Layton @ 2021-02-15 18:05 UTC (permalink / raw)
  To: David Howells, Trond Myklebust, Anna Schumaker, Steve French,
	Dominique Martinet
  Cc: linux-cifs, ceph-devel, Matthew Wilcox, linux-cachefs,
	Alexander Viro, linux-mm, linux-afs, v9fs-developer,
	Christoph Hellwig, linux-fsdevel, linux-nfs, Linus Torvalds,
	David Wysochanski, linux-kernel

On Mon, 2021-02-15 at 15:44 +0000, David Howells wrote:
> Here's a set of patches to do two things:
> 
>  (1) Add a helper library to handle the new VM readahead interface.  This
>      is intended to be used unconditionally by the filesystem (whether or
>      not caching is enabled) and provides a common framework for doing
>      caching, transparent huge pages and, in the future, possibly fscrypt
>      and read bandwidth maximisation.  It also allows the netfs and the
>      cache to align, expand and slice up a read request from the VM in
>      various ways; the netfs need only provide a function to read a stretch
>      of data to the pagecache and the helper takes care of the rest.
> 
>  (2) Add an alternative fscache/cachfiles I/O API that uses the kiocb
>      facility to do async DIO to transfer data to/from the netfs's pages,
>      rather than using readpage with wait queue snooping on one side and
>      vfs_write() on the other.  It also uses less memory, since it doesn't
>      do buffered I/O on the backing file.
> 
>      Note that this uses SEEK_HOLE/SEEK_DATA to locate the data available
>      to be read from the cache.  Whilst this is an improvement from the
>      bmap interface, it still has a problem with regard to a modern
>      extent-based filesystem inserting or removing bridging blocks of
>      zeros.  Fixing that requires a much greater overhaul.
> 
> This is a step towards overhauling the fscache API.  The change is opt-in
> on the part of the network filesystem.  A netfs should not try to mix the
> old and the new API because of conflicting ways of handling pages and the
> PG_fscache page flag and because it would be mixing DIO with buffered I/O.
> Further, the helper library can't be used with the old API.
> 
> This does not change any of the fscache cookie handling APIs or the way
> invalidation is done.
> 
> In the near term, I intend to deprecate and remove the old I/O API
> (fscache_allocate_page{,s}(), fscache_read_or_alloc_page{,s}(),
> fscache_write_page() and fscache_uncache_page()) and eventually replace
> most of fscache/cachefiles with something simpler and easier to follow.
> 
> The patchset contains five parts:
> 
>  (1) Some helper patches, including provision of an ITER_XARRAY iov
>      iterator and a function to do readahead expansion.
> 
>  (2) Patches to add the netfs helper library.
> 
>  (3) A patch to add the fscache/cachefiles kiocb API.
> 
>  (4) Patches to add support in AFS for this.
> 
>  (5) Patches from Jeff Layton to add support in Ceph for this.
> 
> Dave Wysochanski also has patches for NFS for this, though they're not
> included on this branch as there's an issue with PNFS.
> 
> With this, AFS without a cache passes all expected xfstests; with a cache,
> there's an extra failure, but that's also there before these patches.
> Fixing that probably requires a greater overhaul.  Ceph and NFS also pass
> the expected tests.
> 
> These patches can be found also on:
> 
> 	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=fscache-netfs-lib
> 
> For diffing reference, the tag for the 9th Feb pull request is
> fscache-ioapi-20210203 and can be found in the same repository.
> 
> 
> 
> Changes
> =======
> 
>  (v3) Rolled in the bug fixes.
> 
>       Adjusted the functions that unlock and wait for PG_fscache according
>       to Linus's suggestion.
> 
>       Hold a ref on a page when PG_fscache is set as per Linus's
>       suggestion.
> 
>       Dropped NFS support and added Ceph support.
> 
>  (v2) Fixed some bugs and added NFS support.
> 
> 
> References
> ==========
> 
> These patches have been published for review before, firstly as part of a
> larger set:
> 
> Link: https://lore.kernel.org/linux-fsdevel/158861203563.340223.7585359869938129395.stgit@warthog.procyon.org.uk/
> 
> Link: https://lore.kernel.org/linux-fsdevel/159465766378.1376105.11619976251039287525.stgit@warthog.procyon.org.uk/
> Link: https://lore.kernel.org/linux-fsdevel/159465784033.1376674.18106463693989811037.stgit@warthog.procyon.org.uk/
> Link: https://lore.kernel.org/linux-fsdevel/159465821598.1377938.2046362270225008168.stgit@warthog.procyon.org.uk/
> 
> Link: https://lore.kernel.org/linux-fsdevel/160588455242.3465195.3214733858273019178.stgit@warthog.procyon.org.uk/
> 
> Then as a cut-down set:
> 
> Link: https://lore.kernel.org/linux-fsdevel/161118128472.1232039.11746799833066425131.stgit@warthog.procyon.org.uk/
> 
> Link: https://lore.kernel.org/linux-fsdevel/161161025063.2537118.2009249444682241405.stgit@warthog.procyon.org.uk/
> 
> 
> Proposals/information about the design has been published here:
> 
> Link: https://lore.kernel.org/lkml/24942.1573667720@warthog.procyon.org.uk/
> Link: https://lore.kernel.org/linux-fsdevel/2758811.1610621106@warthog.procyon.org.uk/
> Link: https://lore.kernel.org/linux-fsdevel/1441311.1598547738@warthog.procyon.org.uk/
> Link: https://lore.kernel.org/linux-fsdevel/160655.1611012999@warthog.procyon.org.uk/
> 
> And requests for information:
> 
> Link: https://lore.kernel.org/linux-fsdevel/3326.1579019665@warthog.procyon.org.uk/
> Link: https://lore.kernel.org/linux-fsdevel/4467.1579020509@warthog.procyon.org.uk/
> Link: https://lore.kernel.org/linux-fsdevel/3577430.1579705075@warthog.procyon.org.uk/
> 
> The NFS parts, though not included here, have been tested by someone who's
> using fscache in production:
> 
> Link: https://listman.redhat.com/archives/linux-cachefs/2020-December/msg00000.html
> 
> I've posted partial patches to try and help 9p and cifs along:
> 
> Link: https://lore.kernel.org/linux-fsdevel/1514086.1605697347@warthog.procyon.org.uk/
> Link: https://lore.kernel.org/linux-cifs/1794123.1605713481@warthog.procyon.org.uk/
> Link: https://lore.kernel.org/linux-fsdevel/241017.1612263863@warthog.procyon.org.uk/
> Link: https://lore.kernel.org/linux-cifs/270998.1612265397@warthog.procyon.org.uk/
> 
> David
> ---
> David Howells (27):
>       iov_iter: Add ITER_XARRAY
>       mm: Add an unlock function for PG_private_2/PG_fscache
>       mm: Implement readahead_control pageset expansion
>       vfs: Export rw_verify_area() for use by cachefiles
>       netfs: Make a netfs helper module
>       netfs, mm: Move PG_fscache helper funcs to linux/netfs.h
>       netfs, mm: Add unlock_page_fscache() and wait_on_page_fscache()
>       netfs: Provide readahead and readpage netfs helpers
>       netfs: Add tracepoints
>       netfs: Gather stats
>       netfs: Add write_begin helper
>       netfs: Define an interface to talk to a cache
>       netfs: Hold a ref on a page when PG_private_2 is set
>       fscache, cachefiles: Add alternate API to use kiocb for read/write to cache
>       afs: Disable use of the fscache I/O routines
>       afs: Pass page into dirty region helpers to provide THP size
>       afs: Print the operation debug_id when logging an unexpected data version
>       afs: Move key to afs_read struct
>       afs: Don't truncate iter during data fetch
>       afs: Log remote unmarshalling errors
>       afs: Set up the iov_iter before calling afs_extract_data()
>       afs: Use ITER_XARRAY for writing
>       afs: Wait on PG_fscache before modifying/releasing a page
>       afs: Extract writeback extension into its own function
>       afs: Prepare for use of THPs
>       afs: Use the fs operation ops to handle FetchData completion
>       afs: Use new fscache read helper API
> 
> Jeff Layton (6):
>       ceph: disable old fscache readpage handling
>       ceph: rework PageFsCache handling
>       ceph: fix fscache invalidation
>       ceph: convert readpage to fscache read helper
>       ceph: plug write_begin into read helper
>       ceph: convert ceph_readpages to ceph_readahead
> 
> 
>  fs/Kconfig                    |    1 +
>  fs/Makefile                   |    1 +
>  fs/afs/Kconfig                |    1 +
>  fs/afs/dir.c                  |  225 ++++---
>  fs/afs/file.c                 |  470 ++++---------
>  fs/afs/fs_operation.c         |    4 +-
>  fs/afs/fsclient.c             |  108 +--
>  fs/afs/inode.c                |    7 +-
>  fs/afs/internal.h             |   58 +-
>  fs/afs/rxrpc.c                |  150 ++---
>  fs/afs/write.c                |  610 +++++++++--------
>  fs/afs/yfsclient.c            |   82 +--
>  fs/cachefiles/Makefile        |    1 +
>  fs/cachefiles/interface.c     |    5 +-
>  fs/cachefiles/internal.h      |    9 +
>  fs/cachefiles/rdwr2.c         |  412 ++++++++++++
>  fs/ceph/Kconfig               |    1 +
>  fs/ceph/addr.c                |  535 ++++++---------
>  fs/ceph/cache.c               |  125 ----
>  fs/ceph/cache.h               |  101 +--
>  fs/ceph/caps.c                |   10 +-
>  fs/ceph/inode.c               |    1 +
>  fs/ceph/super.h               |    1 +
>  fs/fscache/Kconfig            |    1 +
>  fs/fscache/Makefile           |    3 +-
>  fs/fscache/internal.h         |    3 +
>  fs/fscache/page.c             |    2 +-
>  fs/fscache/page2.c            |  117 ++++
>  fs/fscache/stats.c            |    1 +
>  fs/internal.h                 |    5 -
>  fs/netfs/Kconfig              |   23 +
>  fs/netfs/Makefile             |    5 +
>  fs/netfs/internal.h           |   97 +++
>  fs/netfs/read_helper.c        | 1169 +++++++++++++++++++++++++++++++++
>  fs/netfs/stats.c              |   59 ++
>  fs/read_write.c               |    1 +
>  include/linux/fs.h            |    1 +
>  include/linux/fscache-cache.h |    4 +
>  include/linux/fscache.h       |   40 +-
>  include/linux/netfs.h         |  195 ++++++
>  include/linux/pagemap.h       |    3 +
>  include/net/af_rxrpc.h        |    2 +-
>  include/trace/events/afs.h    |   74 +--
>  include/trace/events/netfs.h  |  201 ++++++
>  mm/filemap.c                  |   20 +
>  mm/readahead.c                |   70 ++
>  net/rxrpc/recvmsg.c           |    9 +-
>  47 files changed, 3473 insertions(+), 1550 deletions(-)
>  create mode 100644 fs/cachefiles/rdwr2.c
>  create mode 100644 fs/fscache/page2.c
>  create mode 100644 fs/netfs/Kconfig
>  create mode 100644 fs/netfs/Makefile
>  create mode 100644 fs/netfs/internal.h
>  create mode 100644 fs/netfs/read_helper.c
>  create mode 100644 fs/netfs/stats.c
>  create mode 100644 include/linux/netfs.h
>  create mode 100644 include/trace/events/netfs.h
> 
> 

Thanks David,

I did an xfstests run on ceph with a kernel based on this and it seemed
to do fine. I'll plan to pull this into the ceph-client/testing branch
and run it through the ceph kclient test harness. There are only a few
differences from the last run we did, so I'm not expecting big changes,
but I'll keep you posted.

-- 
Jeff Layton <jlayton@redhat.com>



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 34/33] netfs: Use in_interrupt() not in_softirq()
  2021-02-15 15:44 [PATCH 00/33] Network fs helper library & fscache kiocb API [ver #3] David Howells
                   ` (13 preceding siblings ...)
  2021-02-15 18:05 ` [PATCH 00/33] Network fs helper library & fscache kiocb API [ver #3] Jeff Layton
@ 2021-02-15 22:46 ` David Howells
  2021-02-16  8:42   ` Christoph Hellwig
  2021-02-16  9:29   ` David Howells
  14 siblings, 2 replies; 47+ messages in thread
From: David Howells @ 2021-02-15 22:46 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: dhowells, Marc Dionne, Anna Schumaker, Steve French,
	Dominique Martinet, linux-cifs, ceph-devel, Jeff Layton,
	Matthew Wilcox, linux-cachefs, Alexander Viro, linux-mm,
	linux-afs, v9fs-developer, Christoph Hellwig, linux-fsdevel,
	linux-nfs, Jeff Layton, David Wysochanski, linux-kernel

The in_softirq() in netfs_rreq_terminated() works fine for the cache being
on a normal disk, as the completion handlers may get called in softirq
context, but for an NVMe drive, the completion handler may get called in
IRQ context.

Fix to use in_interrupt() instead of in_softirq() throughout the read
helpers, particularly when deciding whether to punt code that might sleep
off to a worker thread.

The symptom involves warnings like the following appearing and the kernel
hanging:

 WARNING: CPU: 0 PID: 0 at kernel/softirq.c:175 __local_bh_enable_ip+0x35/0x50
 ...
 RIP: 0010:__local_bh_enable_ip+0x35/0x50
 ...
 Call Trace:
  <IRQ>
  rxrpc_kernel_begin_call+0x7d/0x1b0 [rxrpc]
  ? afs_rx_new_call+0x40/0x40 [kafs]
  ? afs_alloc_call+0x28/0x120 [kafs]
  afs_make_call+0x120/0x510 [kafs]
  ? afs_rx_new_call+0x40/0x40 [kafs]
  ? afs_alloc_flat_call+0xba/0x100 [kafs]
  ? __kmalloc+0x167/0x2f0
  ? afs_alloc_flat_call+0x9b/0x100 [kafs]
  afs_wait_for_operation+0x2d/0x200 [kafs]
  afs_do_sync_operation+0x16/0x20 [kafs]
  afs_req_issue_op+0x8c/0xb0 [kafs]
  netfs_rreq_assess+0x125/0x7d0 [netfs]
  ? cachefiles_end_operation+0x40/0x40 [cachefiles]
  netfs_subreq_terminated+0x117/0x220 [netfs]
  cachefiles_read_complete+0x21/0x60 [cachefiles]
  iomap_dio_bio_end_io+0xdd/0x110
  blk_update_request+0x20a/0x380
  blk_mq_end_request+0x1c/0x120
  nvme_process_cq+0x159/0x1f0 [nvme]
  nvme_irq+0x10/0x20 [nvme]
  __handle_irq_event_percpu+0x37/0x150
  handle_irq_event+0x49/0xb0
  handle_edge_irq+0x7c/0x200
  asm_call_irq_on_stack+0xf/0x20
  </IRQ>
  common_interrupt+0xad/0x120
  asm_common_interrupt+0x1e/0x40
 ...

Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-mm@kvack.org
cc: linux-cachefs@redhat.com
cc: linux-afs@lists.infradead.org
cc: linux-nfs@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: ceph-devel@vger.kernel.org
cc: v9fs-developer@lists.sourceforge.net
cc: linux-fsdevel@vger.kernel.org
---
 read_helper.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/netfs/read_helper.c b/fs/netfs/read_helper.c
index 9191a3617d91..db582008b4bd 100644
--- a/fs/netfs/read_helper.c
+++ b/fs/netfs/read_helper.c
@@ -96,7 +96,7 @@ static void netfs_free_read_request(struct work_struct *work)
 static void netfs_put_read_request(struct netfs_read_request *rreq)
 {
 	if (refcount_dec_and_test(&rreq->usage)) {
-		if (in_softirq()) {
+		if (in_interrupt()) {
 			rreq->work.func = netfs_free_read_request;
 			if (!queue_work(system_unbound_wq, &rreq->work))
 				BUG();
@@ -353,7 +353,7 @@ static void netfs_rreq_write_to_cache_work(struct work_struct *work)
 
 static void netfs_rreq_write_to_cache(struct netfs_read_request *rreq)
 {
-	if (in_softirq()) {
+	if (in_interrupt()) {
 		rreq->work.func = netfs_rreq_write_to_cache_work;
 		if (!queue_work(system_unbound_wq, &rreq->work))
 			BUG();
@@ -479,7 +479,7 @@ static bool netfs_rreq_perform_resubmissions(struct netfs_read_request *rreq)
 {
 	struct netfs_read_subrequest *subreq;
 
-	WARN_ON(in_softirq());
+	WARN_ON(in_interrupt());
 
 	trace_netfs_rreq(rreq, netfs_rreq_trace_resubmit);
 
@@ -577,7 +577,7 @@ static void netfs_rreq_work(struct work_struct *work)
 static void netfs_rreq_terminated(struct netfs_read_request *rreq)
 {
 	if (test_bit(NETFS_RREQ_INCOMPLETE_IO, &rreq->flags) &&
-	    in_softirq()) {
+	    in_interrupt()) {
 		if (!queue_work(system_unbound_wq, &rreq->work))
 			BUG();
 	} else {



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH 00/33] Network fs helper library & fscache kiocb API [ver #3]
  2021-02-15 18:05 ` [PATCH 00/33] Network fs helper library & fscache kiocb API [ver #3] Jeff Layton
@ 2021-02-16  0:40   ` Steve French
  2021-02-16  2:10     ` Matthew Wilcox
  2021-02-16 11:01     ` Jeff Layton
  0 siblings, 2 replies; 47+ messages in thread
From: Steve French @ 2021-02-16  0:40 UTC (permalink / raw)
  To: Jeff Layton
  Cc: David Howells, Trond Myklebust, Anna Schumaker, Steve French,
	Dominique Martinet, CIFS, ceph-devel, Matthew Wilcox,
	linux-cachefs, Alexander Viro, linux-mm, linux-afs,
	v9fs-developer, Christoph Hellwig, linux-fsdevel, linux-nfs,
	Linus Torvalds, David Wysochanski, LKML

Jeff,
What are the performance differences you are seeing (positive or
negative) with ceph and netfs, especially with simple examples like
file copy or grep of large files?

It could be good if netfs simplifies the problem experienced by
network filesystems on Linux with readahead on large sequential reads
- where we don't get as much parallelism due to only having one
readahead request at a time (thus in many cases there is 'dead time'
on either the network or the file server while waiting for the next
readpages request to be issued).   This can be a significant
performance problem for current readpages when network latency is long
(or e.g. in cases when network encryption is enabled, and hardware
offload not available so time consuming on the server or client to
encrypt the packet).

Do you see netfs much faster than currentreadpages for ceph?

Have you been able to get much benefit from throttling readahead with
ceph from the current netfs approach for clamping i/o?

On Mon, Feb 15, 2021 at 12:08 PM Jeff Layton <jlayton@redhat.com> wrote:
>
> On Mon, 2021-02-15 at 15:44 +0000, David Howells wrote:
> > Here's a set of patches to do two things:
> >
> >  (1) Add a helper library to handle the new VM readahead interface.  This
> >      is intended to be used unconditionally by the filesystem (whether or
> >      not caching is enabled) and provides a common framework for doing
> >      caching, transparent huge pages and, in the future, possibly fscrypt
> >      and read bandwidth maximisation.  It also allows the netfs and the
> >      cache to align, expand and slice up a read request from the VM in
> >      various ways; the netfs need only provide a function to read a stretch
> >      of data to the pagecache and the helper takes care of the rest.
> >
> >  (2) Add an alternative fscache/cachfiles I/O API that uses the kiocb
> >      facility to do async DIO to transfer data to/from the netfs's pages,
> >      rather than using readpage with wait queue snooping on one side and
> >      vfs_write() on the other.  It also uses less memory, since it doesn't
> >      do buffered I/O on the backing file.
> >
> >      Note that this uses SEEK_HOLE/SEEK_DATA to locate the data available
> >      to be read from the cache.  Whilst this is an improvement from the
> >      bmap interface, it still has a problem with regard to a modern
> >      extent-based filesystem inserting or removing bridging blocks of
> >      zeros.  Fixing that requires a much greater overhaul.
> >
> > This is a step towards overhauling the fscache API.  The change is opt-in
> > on the part of the network filesystem.  A netfs should not try to mix the
> > old and the new API because of conflicting ways of handling pages and the
> > PG_fscache page flag and because it would be mixing DIO with buffered I/O.
> > Further, the helper library can't be used with the old API.
> >
> > This does not change any of the fscache cookie handling APIs or the way
> > invalidation is done.
> >
> > In the near term, I intend to deprecate and remove the old I/O API
> > (fscache_allocate_page{,s}(), fscache_read_or_alloc_page{,s}(),
> > fscache_write_page() and fscache_uncache_page()) and eventually replace
> > most of fscache/cachefiles with something simpler and easier to follow.
> >
> > The patchset contains five parts:
> >
> >  (1) Some helper patches, including provision of an ITER_XARRAY iov
> >      iterator and a function to do readahead expansion.
> >
> >  (2) Patches to add the netfs helper library.
> >
> >  (3) A patch to add the fscache/cachefiles kiocb API.
> >
> >  (4) Patches to add support in AFS for this.
> >
> >  (5) Patches from Jeff Layton to add support in Ceph for this.
> >
> > Dave Wysochanski also has patches for NFS for this, though they're not
> > included on this branch as there's an issue with PNFS.
> >
> > With this, AFS without a cache passes all expected xfstests; with a cache,
> > there's an extra failure, but that's also there before these patches.
> > Fixing that probably requires a greater overhaul.  Ceph and NFS also pass
> > the expected tests.
> >
> > These patches can be found also on:
> >
> >       https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=fscache-netfs-lib
> >
> > For diffing reference, the tag for the 9th Feb pull request is
> > fscache-ioapi-20210203 and can be found in the same repository.
> >
> >
> >
> > Changes
> > =======
> >
> >  (v3) Rolled in the bug fixes.
> >
> >       Adjusted the functions that unlock and wait for PG_fscache according
> >       to Linus's suggestion.
> >
> >       Hold a ref on a page when PG_fscache is set as per Linus's
> >       suggestion.
> >
> >       Dropped NFS support and added Ceph support.
> >
> >  (v2) Fixed some bugs and added NFS support.
> >
> >
> > References
> > ==========
> >
> > These patches have been published for review before, firstly as part of a
> > larger set:
> >
> > Link: https://lore.kernel.org/linux-fsdevel/158861203563.340223.7585359869938129395.stgit@warthog.procyon.org.uk/
> >
> > Link: https://lore.kernel.org/linux-fsdevel/159465766378.1376105.11619976251039287525.stgit@warthog.procyon.org.uk/
> > Link: https://lore.kernel.org/linux-fsdevel/159465784033.1376674.18106463693989811037.stgit@warthog.procyon.org.uk/
> > Link: https://lore.kernel.org/linux-fsdevel/159465821598.1377938.2046362270225008168.stgit@warthog.procyon.org.uk/
> >
> > Link: https://lore.kernel.org/linux-fsdevel/160588455242.3465195.3214733858273019178.stgit@warthog.procyon.org.uk/
> >
> > Then as a cut-down set:
> >
> > Link: https://lore.kernel.org/linux-fsdevel/161118128472.1232039.11746799833066425131.stgit@warthog.procyon.org.uk/
> >
> > Link: https://lore.kernel.org/linux-fsdevel/161161025063.2537118.2009249444682241405.stgit@warthog.procyon.org.uk/
> >
> >
> > Proposals/information about the design has been published here:
> >
> > Link: https://lore.kernel.org/lkml/24942.1573667720@warthog.procyon.org.uk/
> > Link: https://lore.kernel.org/linux-fsdevel/2758811.1610621106@warthog.procyon.org.uk/
> > Link: https://lore.kernel.org/linux-fsdevel/1441311.1598547738@warthog.procyon.org.uk/
> > Link: https://lore.kernel.org/linux-fsdevel/160655.1611012999@warthog.procyon.org.uk/
> >
> > And requests for information:
> >
> > Link: https://lore.kernel.org/linux-fsdevel/3326.1579019665@warthog.procyon.org.uk/
> > Link: https://lore.kernel.org/linux-fsdevel/4467.1579020509@warthog.procyon.org.uk/
> > Link: https://lore.kernel.org/linux-fsdevel/3577430.1579705075@warthog.procyon.org.uk/
> >
> > The NFS parts, though not included here, have been tested by someone who's
> > using fscache in production:
> >
> > Link: https://listman.redhat.com/archives/linux-cachefs/2020-December/msg00000.html
> >
> > I've posted partial patches to try and help 9p and cifs along:
> >
> > Link: https://lore.kernel.org/linux-fsdevel/1514086.1605697347@warthog.procyon.org.uk/
> > Link: https://lore.kernel.org/linux-cifs/1794123.1605713481@warthog.procyon.org.uk/
> > Link: https://lore.kernel.org/linux-fsdevel/241017.1612263863@warthog.procyon.org.uk/
> > Link: https://lore.kernel.org/linux-cifs/270998.1612265397@warthog.procyon.org.uk/
> >
> > David
> > ---
> > David Howells (27):
> >       iov_iter: Add ITER_XARRAY
> >       mm: Add an unlock function for PG_private_2/PG_fscache
> >       mm: Implement readahead_control pageset expansion
> >       vfs: Export rw_verify_area() for use by cachefiles
> >       netfs: Make a netfs helper module
> >       netfs, mm: Move PG_fscache helper funcs to linux/netfs.h
> >       netfs, mm: Add unlock_page_fscache() and wait_on_page_fscache()
> >       netfs: Provide readahead and readpage netfs helpers
> >       netfs: Add tracepoints
> >       netfs: Gather stats
> >       netfs: Add write_begin helper
> >       netfs: Define an interface to talk to a cache
> >       netfs: Hold a ref on a page when PG_private_2 is set
> >       fscache, cachefiles: Add alternate API to use kiocb for read/write to cache
> >       afs: Disable use of the fscache I/O routines
> >       afs: Pass page into dirty region helpers to provide THP size
> >       afs: Print the operation debug_id when logging an unexpected data version
> >       afs: Move key to afs_read struct
> >       afs: Don't truncate iter during data fetch
> >       afs: Log remote unmarshalling errors
> >       afs: Set up the iov_iter before calling afs_extract_data()
> >       afs: Use ITER_XARRAY for writing
> >       afs: Wait on PG_fscache before modifying/releasing a page
> >       afs: Extract writeback extension into its own function
> >       afs: Prepare for use of THPs
> >       afs: Use the fs operation ops to handle FetchData completion
> >       afs: Use new fscache read helper API
> >
> > Jeff Layton (6):
> >       ceph: disable old fscache readpage handling
> >       ceph: rework PageFsCache handling
> >       ceph: fix fscache invalidation
> >       ceph: convert readpage to fscache read helper
> >       ceph: plug write_begin into read helper
> >       ceph: convert ceph_readpages to ceph_readahead
> >
> >
> >  fs/Kconfig                    |    1 +
> >  fs/Makefile                   |    1 +
> >  fs/afs/Kconfig                |    1 +
> >  fs/afs/dir.c                  |  225 ++++---
> >  fs/afs/file.c                 |  470 ++++---------
> >  fs/afs/fs_operation.c         |    4 +-
> >  fs/afs/fsclient.c             |  108 +--
> >  fs/afs/inode.c                |    7 +-
> >  fs/afs/internal.h             |   58 +-
> >  fs/afs/rxrpc.c                |  150 ++---
> >  fs/afs/write.c                |  610 +++++++++--------
> >  fs/afs/yfsclient.c            |   82 +--
> >  fs/cachefiles/Makefile        |    1 +
> >  fs/cachefiles/interface.c     |    5 +-
> >  fs/cachefiles/internal.h      |    9 +
> >  fs/cachefiles/rdwr2.c         |  412 ++++++++++++
> >  fs/ceph/Kconfig               |    1 +
> >  fs/ceph/addr.c                |  535 ++++++---------
> >  fs/ceph/cache.c               |  125 ----
> >  fs/ceph/cache.h               |  101 +--
> >  fs/ceph/caps.c                |   10 +-
> >  fs/ceph/inode.c               |    1 +
> >  fs/ceph/super.h               |    1 +
> >  fs/fscache/Kconfig            |    1 +
> >  fs/fscache/Makefile           |    3 +-
> >  fs/fscache/internal.h         |    3 +
> >  fs/fscache/page.c             |    2 +-
> >  fs/fscache/page2.c            |  117 ++++
> >  fs/fscache/stats.c            |    1 +
> >  fs/internal.h                 |    5 -
> >  fs/netfs/Kconfig              |   23 +
> >  fs/netfs/Makefile             |    5 +
> >  fs/netfs/internal.h           |   97 +++
> >  fs/netfs/read_helper.c        | 1169 +++++++++++++++++++++++++++++++++
> >  fs/netfs/stats.c              |   59 ++
> >  fs/read_write.c               |    1 +
> >  include/linux/fs.h            |    1 +
> >  include/linux/fscache-cache.h |    4 +
> >  include/linux/fscache.h       |   40 +-
> >  include/linux/netfs.h         |  195 ++++++
> >  include/linux/pagemap.h       |    3 +
> >  include/net/af_rxrpc.h        |    2 +-
> >  include/trace/events/afs.h    |   74 +--
> >  include/trace/events/netfs.h  |  201 ++++++
> >  mm/filemap.c                  |   20 +
> >  mm/readahead.c                |   70 ++
> >  net/rxrpc/recvmsg.c           |    9 +-
> >  47 files changed, 3473 insertions(+), 1550 deletions(-)
> >  create mode 100644 fs/cachefiles/rdwr2.c
> >  create mode 100644 fs/fscache/page2.c
> >  create mode 100644 fs/netfs/Kconfig
> >  create mode 100644 fs/netfs/Makefile
> >  create mode 100644 fs/netfs/internal.h
> >  create mode 100644 fs/netfs/read_helper.c
> >  create mode 100644 fs/netfs/stats.c
> >  create mode 100644 include/linux/netfs.h
> >  create mode 100644 include/trace/events/netfs.h
> >
> >
>
> Thanks David,
>
> I did an xfstests run on ceph with a kernel based on this and it seemed
> to do fine. I'll plan to pull this into the ceph-client/testing branch
> and run it through the ceph kclient test harness. There are only a few
> differences from the last run we did, so I'm not expecting big changes,
> but I'll keep you posted.
>
> --
> Jeff Layton <jlayton@redhat.com>
>


-- 
Thanks,

Steve


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 00/33] Network fs helper library & fscache kiocb API [ver #3]
  2021-02-16  0:40   ` Steve French
@ 2021-02-16  2:10     ` Matthew Wilcox
  2021-02-16  5:18       ` Steve French
                         ` (2 more replies)
  2021-02-16 11:01     ` Jeff Layton
  1 sibling, 3 replies; 47+ messages in thread
From: Matthew Wilcox @ 2021-02-16  2:10 UTC (permalink / raw)
  To: Steve French
  Cc: Jeff Layton, David Howells, Trond Myklebust, Anna Schumaker,
	Steve French, Dominique Martinet, CIFS, ceph-devel,
	linux-cachefs, Alexander Viro, linux-mm, linux-afs,
	v9fs-developer, Christoph Hellwig, linux-fsdevel, linux-nfs,
	Linus Torvalds, David Wysochanski, LKML, William Kucharski

On Mon, Feb 15, 2021 at 06:40:27PM -0600, Steve French wrote:
> It could be good if netfs simplifies the problem experienced by
> network filesystems on Linux with readahead on large sequential reads
> - where we don't get as much parallelism due to only having one
> readahead request at a time (thus in many cases there is 'dead time'
> on either the network or the file server while waiting for the next
> readpages request to be issued).   This can be a significant
> performance problem for current readpages when network latency is long
> (or e.g. in cases when network encryption is enabled, and hardware
> offload not available so time consuming on the server or client to
> encrypt the packet).
> 
> Do you see netfs much faster than currentreadpages for ceph?
> 
> Have you been able to get much benefit from throttling readahead with
> ceph from the current netfs approach for clamping i/o?

The switch from readpages to readahead does help in a couple of corner
cases.  For example, if you have two processes reading the same file at
the same time, one will now block on the other (due to the page lock)
rather than submitting a mess of overlapping and partial reads.

We're not there yet on having multiple outstanding reads.  Bill and I
had a chat recently about how to make the readahead code detect that
it is in a "long fat pipe" situation (as opposed to just dealing with
a slow device), and submit extra readahead requests to make best use of
the bandwidth and minimise blocking of the application.

That's not something for the netfs code to do though; we can get into
that situation with highly parallel SSDs.



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 00/33] Network fs helper library & fscache kiocb API [ver #3]
  2021-02-16  2:10     ` Matthew Wilcox
@ 2021-02-16  5:18       ` Steve French
  2021-02-16  5:22       ` Steve French
  2021-02-24 13:32       ` David Howells
  2 siblings, 0 replies; 47+ messages in thread
From: Steve French @ 2021-02-16  5:18 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jeff Layton, David Howells, Trond Myklebust, Anna Schumaker,
	Steve French, Dominique Martinet, CIFS, ceph-devel,
	linux-cachefs, Alexander Viro, linux-mm, linux-afs,
	v9fs-developer, Christoph Hellwig, linux-fsdevel, linux-nfs,
	Linus Torvalds, David Wysochanski, LKML, William Kucharski

On Mon, Feb 15, 2021 at 8:10 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Feb 15, 2021 at 06:40:27PM -0600, Steve French wrote:
> > It could be good if netfs simplifies the problem experienced by
> > network filesystems on Linux with readahead on large sequential reads
> > - where we don't get as much parallelism due to only having one
> > readahead request at a time (thus in many cases there is 'dead time'
> > on either the network or the file server while waiting for the next
> > readpages request to be issued).   This can be a significant
> > performance problem for current readpages when network latency is long
> > (or e.g. in cases when network encryption is enabled, and hardware
> > offload not available so time consuming on the server or client to
> > encrypt the packet).
> >
> > Do you see netfs much faster than currentreadpages for ceph?
> >
> > Have you been able to get much benefit from throttling readahead with
> > ceph from the current netfs approach for clamping i/o?
>
> The switch from readpages to readahead does help in a couple of corner
> cases.  For example, if you have two processes reading the same file at
> the same time, one will now block on the other (due to the page lock)
> rather than submitting a mess of overlapping and partial reads.
>
> We're not there yet on having multiple outstanding reads.  Bill and I
> had a chat recently about how to make the readahead code detect that
> it is in a "long fat pipe" situation (as opposed to just dealing with
> a slow device), and submit extra readahead requests to make best use of
> the bandwidth and minimise blocking of the application.
>
> That's not something for the netfs code to do though; we can get into
> that situation with highly parallel SSDs.

This (readahead behavior improvements in Linux, on single large file sequential
read workloads like cp or grep) gets particularly interesting
with SMB3 as multichannel becomes more common.  With one channel having
one readahead request pending on the network is suboptimal - but not as bad as
when multichannel is negotiated. Interestingly in most cases two
network connections
to the same server (different TCP sockets,but the same mount, even in
cases where
only network adapter) can achieve better performance - but still significantly
lags Windows (and probably other clients) as in Linux we don't keep
multiple I/Os
in flight at one time (unless different files are being read at the same time
by different threads).   As network adapters are added and removed from the
server (other client typically poll to detect interface changes, and SMB3
also leverages the "witness protocol" to get notification of adapter additions
or removals) - it would be helpful to change the maximum number of
readahead requests in flight.  In addition, as the server throttles back
(reducing the number of 'credits' granted to the client) it will be important
to give hints to the readahead logic about reducing the number of
read ahead requests in flight.   Keeping multiple readahead requests
is easier to imagine when multiple processes are copying or reading
files, but there are many scenarios where we could do better with parallelizing
a single process doing copy by ensuring that there is no 'dead time' on
the network.


-- 
Thanks,

Steve


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 00/33] Network fs helper library & fscache kiocb API [ver #3]
  2021-02-16  2:10     ` Matthew Wilcox
  2021-02-16  5:18       ` Steve French
@ 2021-02-16  5:22       ` Steve French
  2021-02-23 20:27         ` Matthew Wilcox
  2021-02-24 13:32       ` David Howells
  2 siblings, 1 reply; 47+ messages in thread
From: Steve French @ 2021-02-16  5:22 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jeff Layton, David Howells, Trond Myklebust, Anna Schumaker,
	Steve French, Dominique Martinet, CIFS, ceph-devel,
	linux-cachefs, Alexander Viro, linux-mm, linux-afs,
	v9fs-developer, Christoph Hellwig, linux-fsdevel, linux-nfs,
	Linus Torvalds, David Wysochanski, LKML, William Kucharski

On Mon, Feb 15, 2021 at 8:10 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Feb 15, 2021 at 06:40:27PM -0600, Steve French wrote:
> > It could be good if netfs simplifies the problem experienced by
> > network filesystems on Linux with readahead on large sequential reads
> > - where we don't get as much parallelism due to only having one
> > readahead request at a time (thus in many cases there is 'dead time'
> > on either the network or the file server while waiting for the next
> > readpages request to be issued).   This can be a significant
> > performance problem for current readpages when network latency is long
> > (or e.g. in cases when network encryption is enabled, and hardware
> > offload not available so time consuming on the server or client to
> > encrypt the packet).
> >
> > Do you see netfs much faster than currentreadpages for ceph?
> >
> > Have you been able to get much benefit from throttling readahead with
> > ceph from the current netfs approach for clamping i/o?
>
> The switch from readpages to readahead does help in a couple of corner
> cases.  For example, if you have two processes reading the same file at
> the same time, one will now block on the other (due to the page lock)
> rather than submitting a mess of overlapping and partial reads.

Do you have a simple repro example of this we could try (fio, dbench, iozone
etc) to get some objective perf data?

My biggest worry is making sure that the switch to netfs doesn't degrade
performance (which might be a low bar now since current network file copy
perf seems to signifcantly lag at least Windows), and in some easy to understand
scenarios want to make sure it actually helps perf.

-- 
Thanks,

Steve


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 34/33] netfs: Use in_interrupt() not in_softirq()
  2021-02-15 22:46 ` [PATCH 34/33] netfs: Use in_interrupt() not in_softirq() David Howells
@ 2021-02-16  8:42   ` Christoph Hellwig
  2021-02-16  9:06     ` Sebastian Andrzej Siewior
  2021-02-16  9:29   ` David Howells
  1 sibling, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2021-02-16  8:42 UTC (permalink / raw)
  To: David Howells
  Cc: Trond Myklebust, Marc Dionne, Anna Schumaker, Steve French,
	Dominique Martinet, linux-cifs, ceph-devel, Jeff Layton,
	Matthew Wilcox, linux-cachefs, Alexander Viro, linux-mm,
	linux-afs, v9fs-developer, Christoph Hellwig, linux-fsdevel,
	linux-nfs, Jeff Layton, David Wysochanski, linux-kernel,
	Sebastian Andrzej Siewior

On Mon, Feb 15, 2021 at 10:46:23PM +0000, David Howells wrote:
> The in_softirq() in netfs_rreq_terminated() works fine for the cache being
> on a normal disk, as the completion handlers may get called in softirq
> context, but for an NVMe drive, the completion handler may get called in
> IRQ context.
> 
> Fix to use in_interrupt() instead of in_softirq() throughout the read
> helpers, particularly when deciding whether to punt code that might sleep
> off to a worker thread.

We must not use either check, as they all are unreliable especially
for PREEMPT-RT.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 34/33] netfs: Use in_interrupt() not in_softirq()
  2021-02-16  8:42   ` Christoph Hellwig
@ 2021-02-16  9:06     ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 47+ messages in thread
From: Sebastian Andrzej Siewior @ 2021-02-16  9:06 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: David Howells, Trond Myklebust, Marc Dionne, Anna Schumaker,
	Steve French, Dominique Martinet, linux-cifs, ceph-devel,
	Jeff Layton, Matthew Wilcox, linux-cachefs, Alexander Viro,
	linux-mm, linux-afs, v9fs-developer, linux-fsdevel, linux-nfs,
	Jeff Layton, David Wysochanski, linux-kernel

On 2021-02-16 09:42:30 [+0100], Christoph Hellwig wrote:
> On Mon, Feb 15, 2021 at 10:46:23PM +0000, David Howells wrote:
> > The in_softirq() in netfs_rreq_terminated() works fine for the cache being
> > on a normal disk, as the completion handlers may get called in softirq
> > context, but for an NVMe drive, the completion handler may get called in
> > IRQ context.
> > 
> > Fix to use in_interrupt() instead of in_softirq() throughout the read
> > helpers, particularly when deciding whether to punt code that might sleep
> > off to a worker thread.
> 
> We must not use either check, as they all are unreliable especially
> for PREEMPT-RT.

Yes, please. I try to cleanup the users one by one
    https://lore.kernel.org/r/20200914204209.256266093@linutronix.de/
    https://lore.kernel.org/amd-gfx/20210209124439.408140-1-bigeasy@linutronix.de/

Sebastian


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 34/33] netfs: Use in_interrupt() not in_softirq()
  2021-02-15 22:46 ` [PATCH 34/33] netfs: Use in_interrupt() not in_softirq() David Howells
  2021-02-16  8:42   ` Christoph Hellwig
@ 2021-02-16  9:29   ` David Howells
  2021-02-16  9:30     ` Christoph Hellwig
  2021-02-18 14:02     ` [PATCH 34/33] netfs: Pass flag rather than use in_softirq() David Howells
  1 sibling, 2 replies; 47+ messages in thread
From: David Howells @ 2021-02-16  9:29 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dhowells, Trond Myklebust, Marc Dionne, Anna Schumaker,
	Steve French, Dominique Martinet, linux-cifs, ceph-devel,
	Jeff Layton, Matthew Wilcox, linux-cachefs, Alexander Viro,
	linux-mm, linux-afs, v9fs-developer, linux-fsdevel, linux-nfs,
	Jeff Layton, David Wysochanski, linux-kernel,
	Sebastian Andrzej Siewior

Christoph Hellwig <hch@lst.de> wrote:

> On Mon, Feb 15, 2021 at 10:46:23PM +0000, David Howells wrote:
> > The in_softirq() in netfs_rreq_terminated() works fine for the cache being
> > on a normal disk, as the completion handlers may get called in softirq
> > context, but for an NVMe drive, the completion handler may get called in
> > IRQ context.
> > 
> > Fix to use in_interrupt() instead of in_softirq() throughout the read
> > helpers, particularly when deciding whether to punt code that might sleep
> > off to a worker thread.
> 
> We must not use either check, as they all are unreliable especially
> for PREEMPT-RT.

Is there a better way to do it?  The intent is to process the assessment phase
in the calling thread's context if possible rather than bumping over to a
worker thread.  For synchronous I/O, for example, that's done in the caller's
thread.  Maybe that's the answer - if it's known to be asynchronous, I have to
punt, but otherwise don't have to.

David



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 34/33] netfs: Use in_interrupt() not in_softirq()
  2021-02-16  9:29   ` David Howells
@ 2021-02-16  9:30     ` Christoph Hellwig
  2021-02-18 14:02     ` [PATCH 34/33] netfs: Pass flag rather than use in_softirq() David Howells
  1 sibling, 0 replies; 47+ messages in thread
From: Christoph Hellwig @ 2021-02-16  9:30 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, Trond Myklebust, Marc Dionne, Anna Schumaker,
	Steve French, Dominique Martinet, linux-cifs, ceph-devel,
	Jeff Layton, Matthew Wilcox, linux-cachefs, Alexander Viro,
	linux-mm, linux-afs, v9fs-developer, linux-fsdevel, linux-nfs,
	Jeff Layton, David Wysochanski, linux-kernel,
	Sebastian Andrzej Siewior

On Tue, Feb 16, 2021 at 09:29:31AM +0000, David Howells wrote:
> Is there a better way to do it?  The intent is to process the assessment phase
> in the calling thread's context if possible rather than bumping over to a
> worker thread.  For synchronous I/O, for example, that's done in the caller's
> thread.  Maybe that's the answer - if it's known to be asynchronous, I have to
> punt, but otherwise don't have to.

Yes, i think you want an explicit flag instead.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 04/33] vfs: Export rw_verify_area() for use by cachefiles
  2021-02-15 15:45 ` [PATCH 04/33] vfs: Export rw_verify_area() for use by cachefiles David Howells
@ 2021-02-16 10:26   ` Christoph Hellwig
  2021-02-16 11:55   ` David Howells
  1 sibling, 0 replies; 47+ messages in thread
From: Christoph Hellwig @ 2021-02-16 10:26 UTC (permalink / raw)
  To: David Howells
  Cc: Trond Myklebust, Anna Schumaker, Steve French,
	Dominique Martinet, Alexander Viro, Christoph Hellwig,
	Matthew Wilcox, linux-mm, linux-cachefs, linux-afs, linux-nfs,
	linux-cifs, ceph-devel, v9fs-developer, linux-fsdevel,
	Jeff Layton, David Wysochanski, linux-kernel

On Mon, Feb 15, 2021 at 03:45:01PM +0000, David Howells wrote:
> Export rw_verify_area() for so that cachefiles can use it before issuing
> call_read_iter() and call_write_iter() to effect async DIO operations
> against the cache.  This is analogous to aio_read() and aio_write().

I don't think this is the right thing to do.  Instead of calling
into ->read_iter / ->write_iter directly this should be using helpers.

What prevents you from using vfs_iocb_iter_read and
vfs_iocb_iter_write which seem the right level of abstraction for this?


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 02/33] mm: Add an unlock function for PG_private_2/PG_fscache
  2021-02-15 15:44 ` [PATCH 02/33] mm: Add an unlock function for PG_private_2/PG_fscache David Howells
@ 2021-02-16 10:26   ` Christoph Hellwig
  0 siblings, 0 replies; 47+ messages in thread
From: Christoph Hellwig @ 2021-02-16 10:26 UTC (permalink / raw)
  To: David Howells
  Cc: Trond Myklebust, Anna Schumaker, Steve French,
	Dominique Martinet, Linus Torvalds, Matthew Wilcox (Oracle),
	Alexander Viro, Christoph Hellwig, linux-mm, linux-cachefs,
	linux-afs, linux-nfs, linux-cifs, ceph-devel, v9fs-developer,
	linux-fsdevel, Jeff Layton, David Wysochanski, linux-kernel

> +extern void unlock_page_private_2(struct page *page);

No need for the extern.

Otherwise this looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 03/33] mm: Implement readahead_control pageset expansion
  2021-02-15 15:44 ` [PATCH 03/33] mm: Implement readahead_control pageset expansion David Howells
@ 2021-02-16 10:32   ` Christoph Hellwig
  2021-02-16 13:22     ` Matthew Wilcox
  2021-02-16 11:48   ` David Howells
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2021-02-16 10:32 UTC (permalink / raw)
  To: David Howells
  Cc: Trond Myklebust, Anna Schumaker, Steve French,
	Dominique Martinet, Matthew Wilcox (Oracle),
	Alexander Viro, Christoph Hellwig, linux-mm, linux-cachefs,
	linux-afs, linux-nfs, linux-cifs, ceph-devel, v9fs-developer,
	linux-fsdevel, Jeff Layton, David Wysochanski, linux-kernel

On Mon, Feb 15, 2021 at 03:44:52PM +0000, David Howells wrote:
> Provide a function, readahead_expand(), that expands the set of pages
> specified by a readahead_control object to encompass a revised area with a
> proposed size and length.
> 
> The proposed area must include all of the old area and may be expanded yet
> more by this function so that the edges align on (transparent huge) page
> boundaries as allocated.
> 
> The expansion will be cut short if a page already exists in either of the
> areas being expanded into.  Note that any expansion made in such a case is
> not rolled back.
> 
> This will be used by fscache so that reads can be expanded to cache granule
> boundaries, thereby allowing whole granules to be stored in the cache, but
> there are other potential users also.

So looking at linux-next this seems to have a user, but that user is
dead wood given that nothing implements ->expand_readahead.

Looking at the code structure I think netfs_readahead and
netfs_rreq_expand is a complete mess and needs to be turned upside
down, that is instead of calling back from netfs_readahead to the
calling file system, split it into a few helpers called by the
caller.

But even after this can't we just expose the cache granule boundary
to the VM so that the read-ahead request gets setup correctly from
the very beginning?


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 00/33] Network fs helper library & fscache kiocb API [ver #3]
  2021-02-16  0:40   ` Steve French
  2021-02-16  2:10     ` Matthew Wilcox
@ 2021-02-16 11:01     ` Jeff Layton
  1 sibling, 0 replies; 47+ messages in thread
From: Jeff Layton @ 2021-02-16 11:01 UTC (permalink / raw)
  To: Steve French
  Cc: David Howells, Trond Myklebust, Anna Schumaker, Steve French,
	Dominique Martinet, CIFS, ceph-devel, Matthew Wilcox,
	linux-cachefs, Alexander Viro, linux-mm, linux-afs,
	v9fs-developer, Christoph Hellwig, linux-fsdevel, linux-nfs,
	Linus Torvalds, David Wysochanski, LKML

On Mon, 2021-02-15 at 18:40 -0600, Steve French wrote:
> Jeff,
> What are the performance differences you are seeing (positive or
> negative) with ceph and netfs, especially with simple examples like
> file copy or grep of large files?
> 
> It could be good if netfs simplifies the problem experienced by
> network filesystems on Linux with readahead on large sequential reads
> - where we don't get as much parallelism due to only having one
> readahead request at a time (thus in many cases there is 'dead time'
> on either the network or the file server while waiting for the next
> readpages request to be issued).   This can be a significant
> performance problem for current readpages when network latency is long
> (or e.g. in cases when network encryption is enabled, and hardware
> offload not available so time consuming on the server or client to
> encrypt the packet).
> 
> Do you see netfs much faster than currentreadpages for ceph?
> 
> Have you been able to get much benefit from throttling readahead with
> ceph from the current netfs approach for clamping i/o?
> 

I haven't seen big performance differences at all with this set. It's
pretty much a wash, and it doesn't seem to change how the I/Os are
ultimately driven on the wire. For instance, the clamp_length op
basically just mirrors what ceph does today -- it ensures that the
length of the I/O can't go past the end of the current object.

The main benefits are that we get a large swath of readpage, readpages
amd write_begin code out of ceph altogether. All of the netfs's need to
gather and vet pages for I/O, etc. Most of that doesn't have anything to
do with the filesystem itself. By offloading that into the netfs lib,
most of that is taken care of for us and we don't need to bother with
doing that ourselves.

-- 
Jeff Layton <jlayton@redhat.com>



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 03/33] mm: Implement readahead_control pageset expansion
  2021-02-15 15:44 ` [PATCH 03/33] mm: Implement readahead_control pageset expansion David Howells
  2021-02-16 10:32   ` Christoph Hellwig
@ 2021-02-16 11:48   ` David Howells
  2021-02-17 16:13   ` Matthew Wilcox
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 47+ messages in thread
From: David Howells @ 2021-02-16 11:48 UTC (permalink / raw)
  To: Christoph Hellwig, Matthew Wilcox (Oracle)
  Cc: dhowells, Trond Myklebust, Anna Schumaker, Steve French,
	Dominique Martinet, Alexander Viro, linux-mm, linux-cachefs,
	linux-afs, linux-nfs, linux-cifs, ceph-devel, v9fs-developer,
	linux-fsdevel, Jeff Layton, David Wysochanski, linux-kernel

Christoph Hellwig <hch@lst.de> wrote:

> On Mon, Feb 15, 2021 at 03:44:52PM +0000, David Howells wrote:
> > Provide a function, readahead_expand(), that expands the set of pages
> > specified by a readahead_control object to encompass a revised area with a
> > proposed size and length.
> ...
> So looking at linux-next this seems to have a user, but that user is
> dead wood given that nothing implements ->expand_readahead.

Interesting question.  Code on my fscache-iter branch does implement this, but
I was asked to split the patchset up, so that's not in this subset.

> Looking at the code structure I think netfs_readahead and
> netfs_rreq_expand is a complete mess and needs to be turned upside
> down, that is instead of calling back from netfs_readahead to the
> calling file system, split it into a few helpers called by the
> caller.
> 
> But even after this can't we just expose the cache granule boundary
> to the VM so that the read-ahead request gets setup correctly from
> the very beginning?

You need to argue this one with Willy.  In my opinion, the VM should ask the
filesystem and the expansion be done before ->readahead() is called.  Willy
disagrees, however.

David



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 04/33] vfs: Export rw_verify_area() for use by cachefiles
  2021-02-15 15:45 ` [PATCH 04/33] vfs: Export rw_verify_area() for use by cachefiles David Howells
  2021-02-16 10:26   ` Christoph Hellwig
@ 2021-02-16 11:55   ` David Howells
  1 sibling, 0 replies; 47+ messages in thread
From: David Howells @ 2021-02-16 11:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dhowells, Trond Myklebust, Anna Schumaker, Steve French,
	Dominique Martinet, Alexander Viro, Matthew Wilcox, linux-mm,
	linux-cachefs, linux-afs, linux-nfs, linux-cifs, ceph-devel,
	v9fs-developer, linux-fsdevel, Jeff Layton, David Wysochanski,
	linux-kernel

Christoph Hellwig <hch@lst.de> wrote:

> > Export rw_verify_area() for so that cachefiles can use it before issuing
> > call_read_iter() and call_write_iter() to effect async DIO operations
> > against the cache.  This is analogous to aio_read() and aio_write().
> 
> I don't think this is the right thing to do.  Instead of calling
> into ->read_iter / ->write_iter directly this should be using helpers.
> 
> What prevents you from using vfs_iocb_iter_read and
> vfs_iocb_iter_write which seem the right level of abstraction for this?

I don't think they existed when I wrote this code.  Should aio use that too,
btw?  I modelled my code on aio_read() and aio_write().

But I can certainly switch to using vfs_iocb_iter_read/write, though the
trivial checks are redundant.  The fsnotify call, I guess I'm missing though
(and is that missing in aio_read/write() also?).

David



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 03/33] mm: Implement readahead_control pageset expansion
  2021-02-16 10:32   ` Christoph Hellwig
@ 2021-02-16 13:22     ` Matthew Wilcox
  2021-02-17 14:36       ` Mike Marshall
  2021-02-17 15:42       ` David Howells
  0 siblings, 2 replies; 47+ messages in thread
From: Matthew Wilcox @ 2021-02-16 13:22 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: David Howells, Trond Myklebust, Anna Schumaker, Steve French,
	Dominique Martinet, Alexander Viro, linux-mm, linux-cachefs,
	linux-afs, linux-nfs, linux-cifs, ceph-devel, v9fs-developer,
	linux-fsdevel, Jeff Layton, David Wysochanski, linux-kernel

On Tue, Feb 16, 2021 at 11:32:15AM +0100, Christoph Hellwig wrote:
> On Mon, Feb 15, 2021 at 03:44:52PM +0000, David Howells wrote:
> > Provide a function, readahead_expand(), that expands the set of pages
> > specified by a readahead_control object to encompass a revised area with a
> > proposed size and length.
> > 
> > The proposed area must include all of the old area and may be expanded yet
> > more by this function so that the edges align on (transparent huge) page
> > boundaries as allocated.
> > 
> > The expansion will be cut short if a page already exists in either of the
> > areas being expanded into.  Note that any expansion made in such a case is
> > not rolled back.
> > 
> > This will be used by fscache so that reads can be expanded to cache granule
> > boundaries, thereby allowing whole granules to be stored in the cache, but
> > there are other potential users also.
> 
> So looking at linux-next this seems to have a user, but that user is
> dead wood given that nothing implements ->expand_readahead.
> 
> Looking at the code structure I think netfs_readahead and
> netfs_rreq_expand is a complete mess and needs to be turned upside
> down, that is instead of calling back from netfs_readahead to the
> calling file system, split it into a few helpers called by the
> caller.

That's funny, we modelled it after iomap.

> But even after this can't we just expose the cache granule boundary
> to the VM so that the read-ahead request gets setup correctly from
> the very beginning?

The intent is that this be usable by filesystems which want to (for
example) compress variable sized blocks.  So they won't know which pages
they want to readahead until they're in their iomap actor routine,
see that the extent they're in is compressed, and find out how large
the extent is.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 03/33] mm: Implement readahead_control pageset expansion
  2021-02-16 13:22     ` Matthew Wilcox
@ 2021-02-17 14:36       ` Mike Marshall
  2021-02-17 15:42       ` David Howells
  1 sibling, 0 replies; 47+ messages in thread
From: Mike Marshall @ 2021-02-17 14:36 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, David Howells, Trond Myklebust,
	Anna Schumaker, Steve French, Dominique Martinet, Alexander Viro,
	linux-mm, linux-cachefs, linux-afs, Linux NFS Mailing List,
	linux-cifs, ceph-devel, V9FS Developers, linux-fsdevel,
	Jeff Layton, David Wysochanski, LKML

I plan to try and use readahead_expand in Orangefs...

-Mike

On Tue, Feb 16, 2021 at 8:28 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Feb 16, 2021 at 11:32:15AM +0100, Christoph Hellwig wrote:
> > On Mon, Feb 15, 2021 at 03:44:52PM +0000, David Howells wrote:
> > > Provide a function, readahead_expand(), that expands the set of pages
> > > specified by a readahead_control object to encompass a revised area with a
> > > proposed size and length.
> > >
> > > The proposed area must include all of the old area and may be expanded yet
> > > more by this function so that the edges align on (transparent huge) page
> > > boundaries as allocated.
> > >
> > > The expansion will be cut short if a page already exists in either of the
> > > areas being expanded into.  Note that any expansion made in such a case is
> > > not rolled back.
> > >
> > > This will be used by fscache so that reads can be expanded to cache granule
> > > boundaries, thereby allowing whole granules to be stored in the cache, but
> > > there are other potential users also.
> >
> > So looking at linux-next this seems to have a user, but that user is
> > dead wood given that nothing implements ->expand_readahead.
> >
> > Looking at the code structure I think netfs_readahead and
> > netfs_rreq_expand is a complete mess and needs to be turned upside
> > down, that is instead of calling back from netfs_readahead to the
> > calling file system, split it into a few helpers called by the
> > caller.
>
> That's funny, we modelled it after iomap.
>
> > But even after this can't we just expose the cache granule boundary
> > to the VM so that the read-ahead request gets setup correctly from
> > the very beginning?
>
> The intent is that this be usable by filesystems which want to (for
> example) compress variable sized blocks.  So they won't know which pages
> they want to readahead until they're in their iomap actor routine,
> see that the extent they're in is compressed, and find out how large
> the extent is.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 03/33] mm: Implement readahead_control pageset expansion
  2021-02-16 13:22     ` Matthew Wilcox
  2021-02-17 14:36       ` Mike Marshall
@ 2021-02-17 15:42       ` David Howells
  2021-02-17 16:59         ` Mike Marshall
  2021-02-17 22:20         ` David Howells
  1 sibling, 2 replies; 47+ messages in thread
From: David Howells @ 2021-02-17 15:42 UTC (permalink / raw)
  To: Mike Marshall
  Cc: dhowells, Matthew Wilcox, Christoph Hellwig, Trond Myklebust,
	Anna Schumaker, Steve French, Dominique Martinet, Alexander Viro,
	linux-mm, linux-cachefs, linux-afs, Linux NFS Mailing List,
	linux-cifs, ceph-devel, V9FS Developers, linux-fsdevel,
	Jeff Layton, David Wysochanski, LKML

Mike Marshall <hubcap@omnibond.com> wrote:

> I plan to try and use readahead_expand in Orangefs...

Would it help if I shuffled the readahead_expand patch to the bottom of the
pack?

David



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 03/33] mm: Implement readahead_control pageset expansion
  2021-02-15 15:44 ` [PATCH 03/33] mm: Implement readahead_control pageset expansion David Howells
  2021-02-16 10:32   ` Christoph Hellwig
  2021-02-16 11:48   ` David Howells
@ 2021-02-17 16:13   ` Matthew Wilcox
  2021-02-17 22:34   ` David Howells
  2021-02-18 17:47   ` David Howells
  4 siblings, 0 replies; 47+ messages in thread
From: Matthew Wilcox @ 2021-02-17 16:13 UTC (permalink / raw)
  To: David Howells
  Cc: Trond Myklebust, Anna Schumaker, Steve French,
	Dominique Martinet, Alexander Viro, Christoph Hellwig, linux-mm,
	linux-cachefs, linux-afs, linux-nfs, linux-cifs, ceph-devel,
	v9fs-developer, linux-fsdevel, Jeff Layton, David Wysochanski,
	linux-kernel

On Mon, Feb 15, 2021 at 03:44:52PM +0000, David Howells wrote:
> +++ b/include/linux/pagemap.h
> @@ -761,6 +761,8 @@ extern void __delete_from_page_cache(struct page *page, void *shadow);
>  int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask);
>  void delete_from_page_cache_batch(struct address_space *mapping,
>  				  struct pagevec *pvec);
> +void readahead_expand(struct readahead_control *ractl,
> +		      loff_t new_start, size_t new_len);

If we're revising this patchset, I'd rather this lived with the other
readahead declarations, ie after the definition of readahead_control.

> +	/* Expand the trailing edge upwards */
> +	while (ractl->_nr_pages < new_nr_pages) {
> +		unsigned long index = ractl->_index + ractl->_nr_pages;
> +		struct page *page = xa_load(&mapping->i_pages, index);
> +
> +		if (page && !xa_is_value(page))
> +			return; /* Page apparently present */
> +
> +		page = __page_cache_alloc(gfp_mask);
> +		if (!page)
> +			return;
> +		if (add_to_page_cache_lru(page, mapping, index, gfp_mask) < 0) {
> +			put_page(page);
> +			return;
> +		}
> +		ractl->_nr_pages++;
> +	}

We're defeating the ondemand_readahead() algorithm here.  Let's suppose
userspace is doing 64kB reads, the filesystem is OrangeFS which only
wants to do 4MB reads, the page cache is initially empty and there's
only one thread doing a sequential read.  ondemand_readahead() calls
get_init_ra_size() which tells it to allocate 128kB and set the async
marker at 64kB.  Then orangefs calls readahead_expand() to allocate the
remainder of the 4MB.  After the app has read the first 64kB, it comes
back to read the next 64kB, sees the readahead marker and tries to trigger
the next batch of readahead, but it's already present, so it does nothing
(see page_cache_ra_unbounded() for what happens with pages present).

Then it keeps going through the 4MB that's been read, not seeing any more
readahead markers, gets to 4MB and asks for ... 256kB?  Not quite sure.
Anyway, it then has to wait for the next 4MB because the readahead didn't
overlap with the application processing.

So readahead_expand() needs to adjust the file's f_ra so that when the
application gets to 64kB, it kicks off the readahead of 4MB-8MB chunk (and
then when we get to 4MB+256kB, it kicks off the readahead of 8MB-12MB,
and so on).

Unless someone sees a better way to do this?  I don't
want to inadvertently break POSIX_FADV_WILLNEED which calls
force_page_cache_readahead() and should not perturb the kernel's
ondemand algorithm.  Perhaps we need to add an 'ra' pointer to the
ractl to indicate whether the file_ra_state should be updated by
readahead_expand()?


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 03/33] mm: Implement readahead_control pageset expansion
  2021-02-17 15:42       ` David Howells
@ 2021-02-17 16:59         ` Mike Marshall
  2021-02-17 22:20         ` David Howells
  1 sibling, 0 replies; 47+ messages in thread
From: Mike Marshall @ 2021-02-17 16:59 UTC (permalink / raw)
  To: David Howells
  Cc: Matthew Wilcox, Christoph Hellwig, Trond Myklebust,
	Anna Schumaker, Steve French, Dominique Martinet, Alexander Viro,
	linux-mm, linux-cachefs, linux-afs, Linux NFS Mailing List,
	linux-cifs, ceph-devel, V9FS Developers, linux-fsdevel,
	Jeff Layton, David Wysochanski, LKML

Matthew has looked at how I'm fumbling about
trying to deal with Orangefs's need for much larger
than page-sized IO...

I think I need to implement orangefs_readahead
and from there fire off an asynchronous read
and while that's going I'll call readahead_page
with a rac that I've cranked up with readahead_expand
and when the read gets done I'll have plenty of pages
for the large IO I did.

Even if what I think I need to do is somewhere
near right, the async code in the Orangefs
kernel module didn't make it into the upstream
version, so I have to refurbish that. All that to
say: I don't need readahead_expand
"tomorrow", but it fits into my plan to
get Orangefs the extra pages it needs
without me having open-coded page cache
code in orangefs_readpage.

-Mike

On Wed, Feb 17, 2021 at 10:42 AM David Howells <dhowells@redhat.com> wrote:
>
> Mike Marshall <hubcap@omnibond.com> wrote:
>
> > I plan to try and use readahead_expand in Orangefs...
>
> Would it help if I shuffled the readahead_expand patch to the bottom of the
> pack?
>
> David
>


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 03/33] mm: Implement readahead_control pageset expansion
  2021-02-17 15:42       ` David Howells
  2021-02-17 16:59         ` Mike Marshall
@ 2021-02-17 22:20         ` David Howells
  1 sibling, 0 replies; 47+ messages in thread
From: David Howells @ 2021-02-17 22:20 UTC (permalink / raw)
  To: Mike Marshall
  Cc: dhowells, Matthew Wilcox, Christoph Hellwig, Trond Myklebust,
	Anna Schumaker, Steve French, Dominique Martinet, Alexander Viro,
	linux-mm, linux-cachefs, linux-afs, Linux NFS Mailing List,
	linux-cifs, ceph-devel, V9FS Developers, linux-fsdevel,
	Jeff Layton, David Wysochanski, LKML

Mike Marshall <hubcap@omnibond.com> wrote:

> Matthew has looked at how I'm fumbling about
> trying to deal with Orangefs's need for much larger
> than page-sized IO...
> 
> I think I need to implement orangefs_readahead
> and from there fire off an asynchronous read
> and while that's going I'll call readahead_page
> with a rac that I've cranked up with readahead_expand
> and when the read gets done I'll have plenty of pages
> for the large IO I did.

Would the netfs helper lib in patches 5-13 here be of use to orangefs?  Most
of the information about it is on patch 8.

David



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 03/33] mm: Implement readahead_control pageset expansion
  2021-02-15 15:44 ` [PATCH 03/33] mm: Implement readahead_control pageset expansion David Howells
                     ` (2 preceding siblings ...)
  2021-02-17 16:13   ` Matthew Wilcox
@ 2021-02-17 22:34   ` David Howells
  2021-02-17 22:49     ` Matthew Wilcox
  2021-02-18 17:47   ` David Howells
  4 siblings, 1 reply; 47+ messages in thread
From: David Howells @ 2021-02-17 22:34 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: dhowells, Christoph Hellwig, Mike Marshall, Trond Myklebust,
	Anna Schumaker, Steve French, Dominique Martinet, Alexander Viro,
	linux-mm, linux-cachefs, linux-afs, linux-nfs, linux-cifs,
	ceph-devel, v9fs-developer, linux-fsdevel, Jeff Layton,
	David Wysochanski, linux-kernel

Matthew Wilcox <willy@infradead.org> wrote:

> We're defeating the ondemand_readahead() algorithm here.  Let's suppose
> userspace is doing 64kB reads, the filesystem is OrangeFS which only
> wants to do 4MB reads, the page cache is initially empty and there's
> only one thread doing a sequential read.  ondemand_readahead() calls
> get_init_ra_size() which tells it to allocate 128kB and set the async
> marker at 64kB.  Then orangefs calls readahead_expand() to allocate the
> remainder of the 4MB.  After the app has read the first 64kB, it comes
> back to read the next 64kB, sees the readahead marker and tries to trigger
> the next batch of readahead, but it's already present, so it does nothing
> (see page_cache_ra_unbounded() for what happens with pages present).

It sounds like Christoph is right on the right track and the vm needs to ask
the filesystem (and by extension, the cache) before doing the allocation and
before setting the trigger flag.  Then we don't need to call back into the vm
to expand the readahead.

Also, there's Steve's request to try and keep at least two requests in flight
for CIFS/SMB at the same time to consider.

David



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 03/33] mm: Implement readahead_control pageset expansion
  2021-02-17 22:34   ` David Howells
@ 2021-02-17 22:49     ` Matthew Wilcox
  0 siblings, 0 replies; 47+ messages in thread
From: Matthew Wilcox @ 2021-02-17 22:49 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, Mike Marshall, Trond Myklebust,
	Anna Schumaker, Steve French, Dominique Martinet, Alexander Viro,
	linux-mm, linux-cachefs, linux-afs, linux-nfs, linux-cifs,
	ceph-devel, v9fs-developer, linux-fsdevel, Jeff Layton,
	David Wysochanski, linux-kernel

On Wed, Feb 17, 2021 at 10:34:39PM +0000, David Howells wrote:
> Matthew Wilcox <willy@infradead.org> wrote:
> 
> > We're defeating the ondemand_readahead() algorithm here.  Let's suppose
> > userspace is doing 64kB reads, the filesystem is OrangeFS which only
> > wants to do 4MB reads, the page cache is initially empty and there's
> > only one thread doing a sequential read.  ondemand_readahead() calls
> > get_init_ra_size() which tells it to allocate 128kB and set the async
> > marker at 64kB.  Then orangefs calls readahead_expand() to allocate the
> > remainder of the 4MB.  After the app has read the first 64kB, it comes
> > back to read the next 64kB, sees the readahead marker and tries to trigger
> > the next batch of readahead, but it's already present, so it does nothing
> > (see page_cache_ra_unbounded() for what happens with pages present).
> 
> It sounds like Christoph is right on the right track and the vm needs to ask
> the filesystem (and by extension, the cache) before doing the allocation and
> before setting the trigger flag.  Then we don't need to call back into the vm
> to expand the readahead.

Doesn't work.  You could read my reply to Christoph, or try to figure out
how to get rid of
https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/fs-io.c#n742
for yourself.

> Also, there's Steve's request to try and keep at least two requests in flight
> for CIFS/SMB at the same time to consider.

That's not relevant to this problem.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 34/33] netfs: Pass flag rather than use in_softirq()
  2021-02-16  9:29   ` David Howells
  2021-02-16  9:30     ` Christoph Hellwig
@ 2021-02-18 14:02     ` David Howells
  2021-02-18 15:06       ` Marc Dionne
                         ` (2 more replies)
  1 sibling, 3 replies; 47+ messages in thread
From: David Howells @ 2021-02-18 14:02 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dhowells, Trond Myklebust, Marc Dionne, Anna Schumaker,
	Steve French, Dominique Martinet, linux-cifs, ceph-devel,
	Jeff Layton, Matthew Wilcox, linux-cachefs, Alexander Viro,
	linux-mm, linux-afs, v9fs-developer, linux-fsdevel, linux-nfs,
	Jeff Layton, David Wysochanski, linux-kernel,
	Sebastian Andrzej Siewior

Christoph Hellwig <hch@lst.de> wrote:

> On Tue, Feb 16, 2021 at 09:29:31AM +0000, David Howells wrote:
> > Is there a better way to do it?  The intent is to process the assessment
> > phase in the calling thread's context if possible rather than bumping over
> > to a worker thread.  For synchronous I/O, for example, that's done in the
> > caller's thread.  Maybe that's the answer - if it's known to be
> > asynchronous, I have to punt, but otherwise don't have to.
> 
> Yes, i think you want an explicit flag instead.

How about the attached instead?

David
---
commit 29b3e9eed616db01f15c7998c062b4e501ea6582
Author: David Howells <dhowells@redhat.com>
Date:   Mon Feb 15 21:56:43 2021 +0000

    netfs: Pass flag rather than use in_softirq()
    
    The in_softirq() in netfs_rreq_terminated() works fine for the cache being
    on a normal disk, as the completion handlers may get called in softirq
    context, but for an NVMe drive, the completion handler may get called in
    IRQ context.
    
    Fix to pass a flag to netfs_subreq_terminated() to indicate whether we
    think the function isn't being called from a context in which we can do
    allocations, waits and I/O submissions (such as softirq or interrupt
    context).  If this flag is set, netfs lib has to punt to a worker thread to
    handle anything like that.
    
    The symptom involves warnings like the following appearing and the kernel
    hanging:
    
     WARNING: CPU: 0 PID: 0 at kernel/softirq.c:175 __local_bh_enable_ip+0x35/0x50
     ...
     RIP: 0010:__local_bh_enable_ip+0x35/0x50
     ...
     Call Trace:
      <IRQ>
      rxrpc_kernel_begin_call+0x7d/0x1b0 [rxrpc]
      ? afs_rx_new_call+0x40/0x40 [kafs]
      ? afs_alloc_call+0x28/0x120 [kafs]
      afs_make_call+0x120/0x510 [kafs]
      ? afs_rx_new_call+0x40/0x40 [kafs]
      ? afs_alloc_flat_call+0xba/0x100 [kafs]
      ? __kmalloc+0x167/0x2f0
      ? afs_alloc_flat_call+0x9b/0x100 [kafs]
      afs_wait_for_operation+0x2d/0x200 [kafs]
      afs_do_sync_operation+0x16/0x20 [kafs]
      afs_req_issue_op+0x8c/0xb0 [kafs]
      netfs_rreq_assess+0x125/0x7d0 [netfs]
      ? cachefiles_end_operation+0x40/0x40 [cachefiles]
      netfs_subreq_terminated+0x117/0x220 [netfs]
      cachefiles_read_complete+0x21/0x60 [cachefiles]
      iomap_dio_bio_end_io+0xdd/0x110
      blk_update_request+0x20a/0x380
      blk_mq_end_request+0x1c/0x120
      nvme_process_cq+0x159/0x1f0 [nvme]
      nvme_irq+0x10/0x20 [nvme]
      __handle_irq_event_percpu+0x37/0x150
      handle_irq_event+0x49/0xb0
      handle_edge_irq+0x7c/0x200
      asm_call_irq_on_stack+0xf/0x20
      </IRQ>
      common_interrupt+0xad/0x120
      asm_common_interrupt+0x1e/0x40
     ...
    
    Reported-by: Marc Dionne <marc.dionne@auristor.com>
    Signed-off-by: David Howells <dhowells@redhat.com>
    cc: Matthew Wilcox <willy@infradead.org>
    cc: linux-mm@kvack.org
    cc: linux-cachefs@redhat.com
    cc: linux-afs@lists.infradead.org
    cc: linux-nfs@vger.kernel.org
    cc: linux-cifs@vger.kernel.org
    cc: ceph-devel@vger.kernel.org
    cc: v9fs-developer@lists.sourceforge.net
    cc: linux-fsdevel@vger.kernel.org

diff --git a/fs/afs/file.c b/fs/afs/file.c
index 8f28d4f4cfd7..6dcdbbfb48e2 100644
--- a/fs/afs/file.c
+++ b/fs/afs/file.c
@@ -223,7 +223,7 @@ static void afs_fetch_data_notify(struct afs_operation *op)
 
 	if (subreq) {
 		__set_bit(NETFS_SREQ_CLEAR_TAIL, &subreq->flags);
-		netfs_subreq_terminated(subreq, error ?: req->actual_len);
+		netfs_subreq_terminated(subreq, error ?: req->actual_len, false);
 		req->subreq = NULL;
 	} else if (req->done) {
 		req->done(req);
@@ -289,7 +289,7 @@ static void afs_req_issue_op(struct netfs_read_subrequest *subreq)
 
 	fsreq = afs_alloc_read(GFP_NOFS);
 	if (!fsreq)
-		return netfs_subreq_terminated(subreq, -ENOMEM);
+		return netfs_subreq_terminated(subreq, -ENOMEM, false);
 
 	fsreq->subreq	= subreq;
 	fsreq->pos	= subreq->start + subreq->transferred;
@@ -304,7 +304,7 @@ static void afs_req_issue_op(struct netfs_read_subrequest *subreq)
 
 	ret = afs_fetch_data(fsreq->vnode, fsreq);
 	if (ret < 0)
-		return netfs_subreq_terminated(subreq, ret);
+		return netfs_subreq_terminated(subreq, ret, false);
 }
 
 static int afs_symlink_readpage(struct page *page)
diff --git a/fs/cachefiles/rdwr2.c b/fs/cachefiles/rdwr2.c
index 4cea5a2a2d6e..40668bfe6688 100644
--- a/fs/cachefiles/rdwr2.c
+++ b/fs/cachefiles/rdwr2.c
@@ -23,6 +23,7 @@ struct cachefiles_kiocb {
 	};
 	netfs_io_terminated_t	term_func;
 	void			*term_func_priv;
+	bool			was_async;
 };
 
 static inline void cachefiles_put_kiocb(struct cachefiles_kiocb *ki)
@@ -43,10 +44,9 @@ static void cachefiles_read_complete(struct kiocb *iocb, long ret, long ret2)
 	_enter("%ld,%ld", ret, ret2);
 
 	if (ki->term_func) {
-		if (ret < 0)
-			ki->term_func(ki->term_func_priv, ret);
-		else
-			ki->term_func(ki->term_func_priv, ki->skipped + ret);
+		if (ret >= 0)
+			ret += ki->skipped;
+		ki->term_func(ki->term_func_priv, ret, ki->was_async);
 	}
 
 	cachefiles_put_kiocb(ki);
@@ -114,6 +114,7 @@ static int cachefiles_read(struct netfs_cache_resources *cres,
 	ki->skipped		= skipped;
 	ki->term_func		= term_func;
 	ki->term_func_priv	= term_func_priv;
+	ki->was_async		= true;
 
 	if (ki->term_func)
 		ki->iocb.ki_complete = cachefiles_read_complete;
@@ -141,6 +142,7 @@ static int cachefiles_read(struct netfs_cache_resources *cres,
 		ret = -EINTR;
 		fallthrough;
 	default:
+		ki->was_async = false;
 		cachefiles_read_complete(&ki->iocb, ret, 0);
 		if (ret > 0)
 			ret = 0;
@@ -156,7 +158,7 @@ static int cachefiles_read(struct netfs_cache_resources *cres,
 	kfree(ki);
 presubmission_error:
 	if (term_func)
-		term_func(term_func_priv, ret < 0 ? ret : skipped);
+		term_func(term_func_priv, ret < 0 ? ret : skipped, false);
 	return ret;
 }
 
@@ -175,7 +177,7 @@ static void cachefiles_write_complete(struct kiocb *iocb, long ret, long ret2)
 	__sb_end_write(inode->i_sb, SB_FREEZE_WRITE);
 
 	if (ki->term_func)
-		ki->term_func(ki->term_func_priv, ret);
+		ki->term_func(ki->term_func_priv, ret, ki->was_async);
 
 	cachefiles_put_kiocb(ki);
 }
@@ -214,6 +216,7 @@ static int cachefiles_write(struct netfs_cache_resources *cres,
 	ki->len			= len;
 	ki->term_func		= term_func;
 	ki->term_func_priv	= term_func_priv;
+	ki->was_async		= true;
 
 	if (ki->term_func)
 		ki->iocb.ki_complete = cachefiles_write_complete;
@@ -250,6 +253,7 @@ static int cachefiles_write(struct netfs_cache_resources *cres,
 		ret = -EINTR;
 		/* Fall through */
 	default:
+		ki->was_async = false;
 		cachefiles_write_complete(&ki->iocb, ret, 0);
 		if (ret > 0)
 			ret = 0;
@@ -265,7 +269,7 @@ static int cachefiles_write(struct netfs_cache_resources *cres,
 	kfree(ki);
 presubmission_error:
 	if (term_func)
-		term_func(term_func_priv, -ENOMEM);
+		term_func(term_func_priv, -ENOMEM, false);
 	return -ENOMEM;
 }
 
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 0dd64d31eff6..dcfd805d168e 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -221,7 +221,7 @@ static void finish_netfs_read(struct ceph_osd_request *req)
 	if (err >= 0 && err < subreq->len)
 		__set_bit(NETFS_SREQ_CLEAR_TAIL, &subreq->flags);
 
-	netfs_subreq_terminated(subreq, err);
+	netfs_subreq_terminated(subreq, err, true);
 
 	num_pages = calc_pages_for(osd_data->alignment, osd_data->length);
 	ceph_put_page_vector(osd_data->pages, num_pages, false);
@@ -276,7 +276,7 @@ static void ceph_netfs_issue_op(struct netfs_read_subrequest *subreq)
 out:
 	ceph_osdc_put_request(req);
 	if (err)
-		netfs_subreq_terminated(subreq, err);
+		netfs_subreq_terminated(subreq, err, false);
 	dout("%s: result %d\n", __func__, err);
 }
 
diff --git a/fs/netfs/read_helper.c b/fs/netfs/read_helper.c
index 9191a3617d91..5f5de8278499 100644
--- a/fs/netfs/read_helper.c
+++ b/fs/netfs/read_helper.c
@@ -29,12 +29,13 @@ module_param_named(debug, netfs_debug, uint, S_IWUSR | S_IRUGO);
 MODULE_PARM_DESC(netfs_debug, "Netfs support debugging mask");
 
 static void netfs_rreq_work(struct work_struct *);
-static void __netfs_put_subrequest(struct netfs_read_subrequest *);
+static void __netfs_put_subrequest(struct netfs_read_subrequest *, bool);
 
-static void netfs_put_subrequest(struct netfs_read_subrequest *subreq)
+static void netfs_put_subrequest(struct netfs_read_subrequest *subreq,
+				 bool was_async)
 {
 	if (refcount_dec_and_test(&subreq->usage))
-		__netfs_put_subrequest(subreq);
+		__netfs_put_subrequest(subreq, was_async);
 }
 
 static struct netfs_read_request *netfs_alloc_read_request(
@@ -67,7 +68,8 @@ static void netfs_get_read_request(struct netfs_read_request *rreq)
 	refcount_inc(&rreq->usage);
 }
 
-static void netfs_rreq_clear_subreqs(struct netfs_read_request *rreq)
+static void netfs_rreq_clear_subreqs(struct netfs_read_request *rreq,
+				     bool was_async)
 {
 	struct netfs_read_subrequest *subreq;
 
@@ -75,7 +77,7 @@ static void netfs_rreq_clear_subreqs(struct netfs_read_request *rreq)
 		subreq = list_first_entry(&rreq->subrequests,
 					  struct netfs_read_subrequest, rreq_link);
 		list_del(&subreq->rreq_link);
-		netfs_put_subrequest(subreq);
+		netfs_put_subrequest(subreq, was_async);
 	}
 }
 
@@ -83,7 +85,7 @@ static void netfs_free_read_request(struct work_struct *work)
 {
 	struct netfs_read_request *rreq =
 		container_of(work, struct netfs_read_request, work);
-	netfs_rreq_clear_subreqs(rreq);
+	netfs_rreq_clear_subreqs(rreq, false);
 	if (rreq->netfs_priv)
 		rreq->netfs_ops->cleanup(rreq->mapping, rreq->netfs_priv);
 	trace_netfs_rreq(rreq, netfs_rreq_trace_free);
@@ -93,10 +95,10 @@ static void netfs_free_read_request(struct work_struct *work)
 	netfs_stat_d(&netfs_n_rh_rreq);
 }
 
-static void netfs_put_read_request(struct netfs_read_request *rreq)
+static void netfs_put_read_request(struct netfs_read_request *rreq, bool was_async)
 {
 	if (refcount_dec_and_test(&rreq->usage)) {
-		if (in_softirq()) {
+		if (was_async) {
 			rreq->work.func = netfs_free_read_request;
 			if (!queue_work(system_unbound_wq, &rreq->work))
 				BUG();
@@ -131,12 +133,15 @@ static void netfs_get_read_subrequest(struct netfs_read_subrequest *subreq)
 	refcount_inc(&subreq->usage);
 }
 
-static void __netfs_put_subrequest(struct netfs_read_subrequest *subreq)
+static void __netfs_put_subrequest(struct netfs_read_subrequest *subreq,
+				   bool was_async)
 {
+	struct netfs_read_request *rreq = subreq->rreq;
+
 	trace_netfs_sreq(subreq, netfs_sreq_trace_free);
-	netfs_put_read_request(subreq->rreq);
 	kfree(subreq);
 	netfs_stat_d(&netfs_n_rh_sreq);
+	netfs_put_read_request(rreq, was_async);
 }
 
 /*
@@ -152,11 +157,12 @@ static void netfs_clear_unread(struct netfs_read_subrequest *subreq)
 	iov_iter_zero(iov_iter_count(&iter), &iter);
 }
 
-static void netfs_cache_read_terminated(void *priv, ssize_t transferred_or_error)
+static void netfs_cache_read_terminated(void *priv, ssize_t transferred_or_error,
+					bool was_async)
 {
 	struct netfs_read_subrequest *subreq = priv;
 
-	netfs_subreq_terminated(subreq, transferred_or_error);
+	netfs_subreq_terminated(subreq, transferred_or_error, was_async);
 }
 
 /*
@@ -186,7 +192,7 @@ static void netfs_fill_with_zeroes(struct netfs_read_request *rreq,
 {
 	netfs_stat(&netfs_n_rh_zero);
 	__set_bit(NETFS_SREQ_CLEAR_TAIL, &subreq->flags);
-	netfs_subreq_terminated(subreq, 0);
+	netfs_subreq_terminated(subreq, 0, false);
 }
 
 /*
@@ -215,11 +221,11 @@ static void netfs_read_from_server(struct netfs_read_request *rreq,
 /*
  * Release those waiting.
  */
-static void netfs_rreq_completed(struct netfs_read_request *rreq)
+static void netfs_rreq_completed(struct netfs_read_request *rreq, bool was_async)
 {
 	trace_netfs_rreq(rreq, netfs_rreq_trace_done);
-	netfs_rreq_clear_subreqs(rreq);
-	netfs_put_read_request(rreq);
+	netfs_rreq_clear_subreqs(rreq, was_async);
+	netfs_put_read_request(rreq, was_async);
 }
 
 /*
@@ -228,7 +234,8 @@ static void netfs_rreq_completed(struct netfs_read_request *rreq)
  *
  * May be called in softirq mode and we inherit a ref from the caller.
  */
-static void netfs_rreq_unmark_after_write(struct netfs_read_request *rreq)
+static void netfs_rreq_unmark_after_write(struct netfs_read_request *rreq,
+					  bool was_async)
 {
 	struct netfs_read_subrequest *subreq;
 	struct pagevec pvec;
@@ -258,10 +265,11 @@ static void netfs_rreq_unmark_after_write(struct netfs_read_request *rreq)
 	}
 
 	rcu_read_unlock();
-	netfs_rreq_completed(rreq);
+	netfs_rreq_completed(rreq, was_async);
 }
 
-static void netfs_rreq_copy_terminated(void *priv, ssize_t transferred_or_error)
+static void netfs_rreq_copy_terminated(void *priv, ssize_t transferred_or_error,
+				       bool was_async)
 {
 	struct netfs_read_subrequest *subreq = priv;
 	struct netfs_read_request *rreq = subreq->rreq;
@@ -278,9 +286,9 @@ static void netfs_rreq_copy_terminated(void *priv, ssize_t transferred_or_error)
 
 	/* If we decrement nr_wr_ops to 0, the ref belongs to us. */
 	if (atomic_dec_and_test(&rreq->nr_wr_ops))
-		netfs_rreq_unmark_after_write(rreq);
+		netfs_rreq_unmark_after_write(rreq, was_async);
 
-	netfs_put_subrequest(subreq);
+	netfs_put_subrequest(subreq, was_async);
 }
 
 /*
@@ -304,7 +312,7 @@ static void netfs_rreq_do_write_to_cache(struct netfs_read_request *rreq)
 	list_for_each_entry_safe(subreq, p, &rreq->subrequests, rreq_link) {
 		if (!test_bit(NETFS_SREQ_WRITE_TO_CACHE, &subreq->flags)) {
 			list_del_init(&subreq->rreq_link);
-			netfs_put_subrequest(subreq);
+			netfs_put_subrequest(subreq, false);
 		}
 	}
 
@@ -324,7 +332,7 @@ static void netfs_rreq_do_write_to_cache(struct netfs_read_request *rreq)
 			subreq->len += next->len;
 			subreq->len = round_up(subreq->len, PAGE_SIZE);
 			list_del_init(&next->rreq_link);
-			netfs_put_subrequest(next);
+			netfs_put_subrequest(next, false);
 		}
 
 		iov_iter_xarray(&iter, WRITE, &rreq->mapping->i_pages,
@@ -340,7 +348,7 @@ static void netfs_rreq_do_write_to_cache(struct netfs_read_request *rreq)
 
 	/* If we decrement nr_wr_ops to 0, the usage ref belongs to us. */
 	if (atomic_dec_and_test(&rreq->nr_wr_ops))
-		netfs_rreq_unmark_after_write(rreq);
+		netfs_rreq_unmark_after_write(rreq, false);
 }
 
 static void netfs_rreq_write_to_cache_work(struct work_struct *work)
@@ -351,9 +359,10 @@ static void netfs_rreq_write_to_cache_work(struct work_struct *work)
 	netfs_rreq_do_write_to_cache(rreq);
 }
 
-static void netfs_rreq_write_to_cache(struct netfs_read_request *rreq)
+static void netfs_rreq_write_to_cache(struct netfs_read_request *rreq,
+				      bool was_async)
 {
-	if (in_softirq()) {
+	if (was_async) {
 		rreq->work.func = netfs_rreq_write_to_cache_work;
 		if (!queue_work(system_unbound_wq, &rreq->work))
 			BUG();
@@ -479,7 +488,7 @@ static bool netfs_rreq_perform_resubmissions(struct netfs_read_request *rreq)
 {
 	struct netfs_read_subrequest *subreq;
 
-	WARN_ON(in_softirq());
+	WARN_ON(in_interrupt());
 
 	trace_netfs_rreq(rreq, netfs_rreq_trace_resubmit);
 
@@ -538,7 +547,7 @@ static void netfs_rreq_is_still_valid(struct netfs_read_request *rreq)
  * Note that we could be in an ordinary kernel thread, on a workqueue or in
  * softirq context at this point.  We inherit a ref from the caller.
  */
-static void netfs_rreq_assess(struct netfs_read_request *rreq)
+static void netfs_rreq_assess(struct netfs_read_request *rreq, bool was_async)
 {
 	trace_netfs_rreq(rreq, netfs_rreq_trace_assess);
 
@@ -558,30 +567,31 @@ static void netfs_rreq_assess(struct netfs_read_request *rreq)
 	wake_up_bit(&rreq->flags, NETFS_RREQ_IN_PROGRESS);
 
 	if (test_bit(NETFS_RREQ_WRITE_TO_CACHE, &rreq->flags))
-		return netfs_rreq_write_to_cache(rreq);
+		return netfs_rreq_write_to_cache(rreq, was_async);
 
-	netfs_rreq_completed(rreq);
+	netfs_rreq_completed(rreq, was_async);
 }
 
 static void netfs_rreq_work(struct work_struct *work)
 {
 	struct netfs_read_request *rreq =
 		container_of(work, struct netfs_read_request, work);
-	netfs_rreq_assess(rreq);
+	netfs_rreq_assess(rreq, false);
 }
 
 /*
  * Handle the completion of all outstanding I/O operations on a read request.
  * We inherit a ref from the caller.
  */
-static void netfs_rreq_terminated(struct netfs_read_request *rreq)
+static void netfs_rreq_terminated(struct netfs_read_request *rreq,
+				  bool was_async)
 {
 	if (test_bit(NETFS_RREQ_INCOMPLETE_IO, &rreq->flags) &&
-	    in_softirq()) {
+	    was_async) {
 		if (!queue_work(system_unbound_wq, &rreq->work))
 			BUG();
 	} else {
-		netfs_rreq_assess(rreq);
+		netfs_rreq_assess(rreq, was_async);
 	}
 }
 
@@ -589,6 +599,7 @@ static void netfs_rreq_terminated(struct netfs_read_request *rreq)
  * netfs_subreq_terminated - Note the termination of an I/O operation.
  * @subreq: The I/O request that has terminated.
  * @transferred_or_error: The amount of data transferred or an error code.
+ * @was_async: The termination was asynchronous
  *
  * This tells the read helper that a contributory I/O operation has terminated,
  * one way or another, and that it should integrate the results.
@@ -599,11 +610,12 @@ static void netfs_rreq_terminated(struct netfs_read_request *rreq)
  * error code.  The helper will look after reissuing I/O operations as
  * appropriate and writing downloaded data to the cache.
  *
- * This may be called from a softirq handler, so we want to avoid taking the
- * spinlock if we can.
+ * If @was_async is true, the caller might be running in softirq or interrupt
+ * context and we can't sleep.
  */
 void netfs_subreq_terminated(struct netfs_read_subrequest *subreq,
-			     ssize_t transferred_or_error)
+			     ssize_t transferred_or_error,
+			     bool was_async)
 {
 	struct netfs_read_request *rreq = subreq->rreq;
 	int u;
@@ -647,11 +659,11 @@ void netfs_subreq_terminated(struct netfs_read_subrequest *subreq,
 	/* If we decrement nr_rd_ops to 0, the ref belongs to us. */
 	u = atomic_dec_return(&rreq->nr_rd_ops);
 	if (u == 0)
-		netfs_rreq_terminated(rreq);
+		netfs_rreq_terminated(rreq, was_async);
 	else if (u == 1)
 		wake_up_var(&rreq->nr_rd_ops);
 
-	netfs_put_subrequest(subreq);
+	netfs_put_subrequest(subreq, was_async);
 	return;
 
 incomplete:
@@ -796,7 +808,7 @@ static bool netfs_rreq_submit_slice(struct netfs_read_request *rreq,
 
 subreq_failed:
 	rreq->error = subreq->error;
-	netfs_put_subrequest(subreq);
+	netfs_put_subrequest(subreq, false);
 	return false;
 }
 
@@ -901,7 +913,7 @@ void netfs_readahead(struct readahead_control *ractl,
 	} while (rreq->submitted < rreq->len);
 
 	if (rreq->submitted == 0) {
-		netfs_put_read_request(rreq);
+		netfs_put_read_request(rreq, false);
 		return;
 	}
 
@@ -913,11 +925,11 @@ void netfs_readahead(struct readahead_control *ractl,
 
 	/* If we decrement nr_rd_ops to 0, the ref belongs to us. */
 	if (atomic_dec_and_test(&rreq->nr_rd_ops))
-		netfs_rreq_assess(rreq);
+		netfs_rreq_assess(rreq, false);
 	return;
 
 cleanup_free:
-	netfs_put_read_request(rreq);
+	netfs_put_read_request(rreq, false);
 	return;
 cleanup:
 	if (netfs_priv)
@@ -991,14 +1003,14 @@ int netfs_readpage(struct file *file,
 	 */
 	do {
 		wait_var_event(&rreq->nr_rd_ops, atomic_read(&rreq->nr_rd_ops) == 1);
-		netfs_rreq_assess(rreq);
+		netfs_rreq_assess(rreq, false);
 	} while (test_bit(NETFS_RREQ_IN_PROGRESS, &rreq->flags));
 
 	ret = rreq->error;
 	if (ret == 0 && rreq->submitted < rreq->len)
 		ret = -EIO;
 out:
-	netfs_put_read_request(rreq);
+	netfs_put_read_request(rreq, false);
 	return ret;
 }
 EXPORT_SYMBOL(netfs_readpage);
@@ -1136,7 +1148,7 @@ int netfs_write_begin(struct file *file, struct address_space *mapping,
 	 */
 	for (;;) {
 		wait_var_event(&rreq->nr_rd_ops, atomic_read(&rreq->nr_rd_ops) == 1);
-		netfs_rreq_assess(rreq);
+		netfs_rreq_assess(rreq, false);
 		if (!test_bit(NETFS_RREQ_IN_PROGRESS, &rreq->flags))
 			break;
 		cond_resched();
@@ -1145,7 +1157,7 @@ int netfs_write_begin(struct file *file, struct address_space *mapping,
 	ret = rreq->error;
 	if (ret == 0 && rreq->submitted < rreq->len)
 		ret = -EIO;
-	netfs_put_read_request(rreq);
+	netfs_put_read_request(rreq, false);
 	if (ret < 0)
 		goto error;
 
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index b2589b39feb8..c22b64db237d 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -60,7 +60,8 @@ enum netfs_read_source {
 	NETFS_INVALID_READ,
 } __mode(byte);
 
-typedef void (*netfs_io_terminated_t)(void *priv, ssize_t transferred_or_error);
+typedef void (*netfs_io_terminated_t)(void *priv, ssize_t transferred_or_error,
+				      bool was_async);
 
 /*
  * Resources required to do operations on a cache.
@@ -189,7 +190,7 @@ extern int netfs_write_begin(struct file *, struct address_space *,
 			     const struct netfs_read_request_ops *,
 			     void *);
 
-extern void netfs_subreq_terminated(struct netfs_read_subrequest *, ssize_t);
+extern void netfs_subreq_terminated(struct netfs_read_subrequest *, ssize_t, bool);
 extern void netfs_stats_show(struct seq_file *);
 
 #endif /* _LINUX_NETFS_H */



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH 34/33] netfs: Pass flag rather than use in_softirq()
  2021-02-18 14:02     ` [PATCH 34/33] netfs: Pass flag rather than use in_softirq() David Howells
@ 2021-02-18 15:06       ` Marc Dionne
  2021-02-18 15:16       ` Marc Dionne
  2021-02-19  9:01       ` Sebastian Andrzej Siewior
  2 siblings, 0 replies; 47+ messages in thread
From: Marc Dionne @ 2021-02-18 15:06 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, Trond Myklebust, Anna Schumaker, Steve French,
	Dominique Martinet, linux-cifs, ceph-devel, Jeff Layton,
	Matthew Wilcox, linux-cachefs, Alexander Viro, linux-mm,
	linux-afs, v9fs-developer, linux-fsdevel, linux-nfs, Jeff Layton,
	David Wysochanski, Linux Kernel Mailing List,
	Sebastian Andrzej Siewior

On Thu, Feb 18, 2021 at 10:03 AM David Howells <dhowells@redhat.com> wrote:
>
> Christoph Hellwig <hch@lst.de> wrote:
>
> > On Tue, Feb 16, 2021 at 09:29:31AM +0000, David Howells wrote:
> > > Is there a better way to do it?  The intent is to process the assessment
> > > phase in the calling thread's context if possible rather than bumping over
> > > to a worker thread.  For synchronous I/O, for example, that's done in the
> > > caller's thread.  Maybe that's the answer - if it's known to be
> > > asynchronous, I have to punt, but otherwise don't have to.
> >
> > Yes, i think you want an explicit flag instead.
>
> How about the attached instead?
>
> David
> ---
> commit 29b3e9eed616db01f15c7998c062b4e501ea6582
> Author: David Howells <dhowells@redhat.com>
> Date:   Mon Feb 15 21:56:43 2021 +0000
>
>     netfs: Pass flag rather than use in_softirq()
>
>     The in_softirq() in netfs_rreq_terminated() works fine for the cache being
>     on a normal disk, as the completion handlers may get called in softirq
>     context, but for an NVMe drive, the completion handler may get called in
>     IRQ context.
>
>     Fix to pass a flag to netfs_subreq_terminated() to indicate whether we
>     think the function isn't being called from a context in which we can do
>     allocations, waits and I/O submissions (such as softirq or interrupt
>     context).  If this flag is set, netfs lib has to punt to a worker thread to
>     handle anything like that.
>
>     The symptom involves warnings like the following appearing and the kernel
>     hanging:
>
>      WARNING: CPU: 0 PID: 0 at kernel/softirq.c:175 __local_bh_enable_ip+0x35/0x50
>      ...
>      RIP: 0010:__local_bh_enable_ip+0x35/0x50
>      ...
>      Call Trace:
>       <IRQ>
>       rxrpc_kernel_begin_call+0x7d/0x1b0 [rxrpc]
>       ? afs_rx_new_call+0x40/0x40 [kafs]
>       ? afs_alloc_call+0x28/0x120 [kafs]
>       afs_make_call+0x120/0x510 [kafs]
>       ? afs_rx_new_call+0x40/0x40 [kafs]
>       ? afs_alloc_flat_call+0xba/0x100 [kafs]
>       ? __kmalloc+0x167/0x2f0
>       ? afs_alloc_flat_call+0x9b/0x100 [kafs]
>       afs_wait_for_operation+0x2d/0x200 [kafs]
>       afs_do_sync_operation+0x16/0x20 [kafs]
>       afs_req_issue_op+0x8c/0xb0 [kafs]
>       netfs_rreq_assess+0x125/0x7d0 [netfs]
>       ? cachefiles_end_operation+0x40/0x40 [cachefiles]
>       netfs_subreq_terminated+0x117/0x220 [netfs]
>       cachefiles_read_complete+0x21/0x60 [cachefiles]
>       iomap_dio_bio_end_io+0xdd/0x110
>       blk_update_request+0x20a/0x380
>       blk_mq_end_request+0x1c/0x120
>       nvme_process_cq+0x159/0x1f0 [nvme]
>       nvme_irq+0x10/0x20 [nvme]
>       __handle_irq_event_percpu+0x37/0x150
>       handle_irq_event+0x49/0xb0
>       handle_edge_irq+0x7c/0x200
>       asm_call_irq_on_stack+0xf/0x20
>       </IRQ>
>       common_interrupt+0xad/0x120
>       asm_common_interrupt+0x1e/0x40
>      ...
>
>     Reported-by: Marc Dionne <marc.dionne@auristor.com>
>     Signed-off-by: David Howells <dhowells@redhat.com>
>     cc: Matthew Wilcox <willy@infradead.org>
>     cc: linux-mm@kvack.org
>     cc: linux-cachefs@redhat.com
>     cc: linux-afs@lists.infradead.org
>     cc: linux-nfs@vger.kernel.org
>     cc: linux-cifs@vger.kernel.org
>     cc: ceph-devel@vger.kernel.org
>     cc: v9fs-developer@lists.sourceforge.net
>     cc: linux-fsdevel@vger.kernel.org
>
> diff --git a/fs/afs/file.c b/fs/afs/file.c
> index 8f28d4f4cfd7..6dcdbbfb48e2 100644
> --- a/fs/afs/file.c
> +++ b/fs/afs/file.c
> @@ -223,7 +223,7 @@ static void afs_fetch_data_notify(struct afs_operation *op)
>
>         if (subreq) {
>                 __set_bit(NETFS_SREQ_CLEAR_TAIL, &subreq->flags);
> -               netfs_subreq_terminated(subreq, error ?: req->actual_len);
> +               netfs_subreq_terminated(subreq, error ?: req->actual_len, false);
>                 req->subreq = NULL;
>         } else if (req->done) {
>                 req->done(req);
> @@ -289,7 +289,7 @@ static void afs_req_issue_op(struct netfs_read_subrequest *subreq)
>
>         fsreq = afs_alloc_read(GFP_NOFS);
>         if (!fsreq)
> -               return netfs_subreq_terminated(subreq, -ENOMEM);
> +               return netfs_subreq_terminated(subreq, -ENOMEM, false);
>
>         fsreq->subreq   = subreq;
>         fsreq->pos      = subreq->start + subreq->transferred;
> @@ -304,7 +304,7 @@ static void afs_req_issue_op(struct netfs_read_subrequest *subreq)
>
>         ret = afs_fetch_data(fsreq->vnode, fsreq);
>         if (ret < 0)
> -               return netfs_subreq_terminated(subreq, ret);
> +               return netfs_subreq_terminated(subreq, ret, false);
>  }
>
>  static int afs_symlink_readpage(struct page *page)
> diff --git a/fs/cachefiles/rdwr2.c b/fs/cachefiles/rdwr2.c
> index 4cea5a2a2d6e..40668bfe6688 100644
> --- a/fs/cachefiles/rdwr2.c
> +++ b/fs/cachefiles/rdwr2.c
> @@ -23,6 +23,7 @@ struct cachefiles_kiocb {
>         };
>         netfs_io_terminated_t   term_func;
>         void                    *term_func_priv;
> +       bool                    was_async;
>  };
>
>  static inline void cachefiles_put_kiocb(struct cachefiles_kiocb *ki)
> @@ -43,10 +44,9 @@ static void cachefiles_read_complete(struct kiocb *iocb, long ret, long ret2)
>         _enter("%ld,%ld", ret, ret2);
>
>         if (ki->term_func) {
> -               if (ret < 0)
> -                       ki->term_func(ki->term_func_priv, ret);
> -               else
> -                       ki->term_func(ki->term_func_priv, ki->skipped + ret);
> +               if (ret >= 0)
> +                       ret += ki->skipped;
> +               ki->term_func(ki->term_func_priv, ret, ki->was_async);
>         }
>
>         cachefiles_put_kiocb(ki);
> @@ -114,6 +114,7 @@ static int cachefiles_read(struct netfs_cache_resources *cres,
>         ki->skipped             = skipped;
>         ki->term_func           = term_func;
>         ki->term_func_priv      = term_func_priv;
> +       ki->was_async           = true;
>
>         if (ki->term_func)
>                 ki->iocb.ki_complete = cachefiles_read_complete;
> @@ -141,6 +142,7 @@ static int cachefiles_read(struct netfs_cache_resources *cres,
>                 ret = -EINTR;
>                 fallthrough;
>         default:
> +               ki->was_async = false;
>                 cachefiles_read_complete(&ki->iocb, ret, 0);
>                 if (ret > 0)
>                         ret = 0;
> @@ -156,7 +158,7 @@ static int cachefiles_read(struct netfs_cache_resources *cres,
>         kfree(ki);
>  presubmission_error:
>         if (term_func)
> -               term_func(term_func_priv, ret < 0 ? ret : skipped);
> +               term_func(term_func_priv, ret < 0 ? ret : skipped, false);
>         return ret;
>  }
>
> @@ -175,7 +177,7 @@ static void cachefiles_write_complete(struct kiocb *iocb, long ret, long ret2)
>         __sb_end_write(inode->i_sb, SB_FREEZE_WRITE);
>
>         if (ki->term_func)
> -               ki->term_func(ki->term_func_priv, ret);
> +               ki->term_func(ki->term_func_priv, ret, ki->was_async);
>
>         cachefiles_put_kiocb(ki);
>  }
> @@ -214,6 +216,7 @@ static int cachefiles_write(struct netfs_cache_resources *cres,
>         ki->len                 = len;
>         ki->term_func           = term_func;
>         ki->term_func_priv      = term_func_priv;
> +       ki->was_async           = true;
>
>         if (ki->term_func)
>                 ki->iocb.ki_complete = cachefiles_write_complete;
> @@ -250,6 +253,7 @@ static int cachefiles_write(struct netfs_cache_resources *cres,
>                 ret = -EINTR;
>                 /* Fall through */
>         default:
> +               ki->was_async = false;
>                 cachefiles_write_complete(&ki->iocb, ret, 0);
>                 if (ret > 0)
>                         ret = 0;
> @@ -265,7 +269,7 @@ static int cachefiles_write(struct netfs_cache_resources *cres,
>         kfree(ki);
>  presubmission_error:
>         if (term_func)
> -               term_func(term_func_priv, -ENOMEM);
> +               term_func(term_func_priv, -ENOMEM, false);
>         return -ENOMEM;
>  }
>
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index 0dd64d31eff6..dcfd805d168e 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -221,7 +221,7 @@ static void finish_netfs_read(struct ceph_osd_request *req)
>         if (err >= 0 && err < subreq->len)
>                 __set_bit(NETFS_SREQ_CLEAR_TAIL, &subreq->flags);
>
> -       netfs_subreq_terminated(subreq, err);
> +       netfs_subreq_terminated(subreq, err, true);
>
>         num_pages = calc_pages_for(osd_data->alignment, osd_data->length);
>         ceph_put_page_vector(osd_data->pages, num_pages, false);
> @@ -276,7 +276,7 @@ static void ceph_netfs_issue_op(struct netfs_read_subrequest *subreq)
>  out:
>         ceph_osdc_put_request(req);
>         if (err)
> -               netfs_subreq_terminated(subreq, err);
> +               netfs_subreq_terminated(subreq, err, false);
>         dout("%s: result %d\n", __func__, err);
>  }
>
> diff --git a/fs/netfs/read_helper.c b/fs/netfs/read_helper.c
> index 9191a3617d91..5f5de8278499 100644
> --- a/fs/netfs/read_helper.c
> +++ b/fs/netfs/read_helper.c
> @@ -29,12 +29,13 @@ module_param_named(debug, netfs_debug, uint, S_IWUSR | S_IRUGO);
>  MODULE_PARM_DESC(netfs_debug, "Netfs support debugging mask");
>
>  static void netfs_rreq_work(struct work_struct *);
> -static void __netfs_put_subrequest(struct netfs_read_subrequest *);
> +static void __netfs_put_subrequest(struct netfs_read_subrequest *, bool);
>
> -static void netfs_put_subrequest(struct netfs_read_subrequest *subreq)
> +static void netfs_put_subrequest(struct netfs_read_subrequest *subreq,
> +                                bool was_async)
>  {
>         if (refcount_dec_and_test(&subreq->usage))
> -               __netfs_put_subrequest(subreq);
> +               __netfs_put_subrequest(subreq, was_async);
>  }
>
>  static struct netfs_read_request *netfs_alloc_read_request(
> @@ -67,7 +68,8 @@ static void netfs_get_read_request(struct netfs_read_request *rreq)
>         refcount_inc(&rreq->usage);
>  }
>
> -static void netfs_rreq_clear_subreqs(struct netfs_read_request *rreq)
> +static void netfs_rreq_clear_subreqs(struct netfs_read_request *rreq,
> +                                    bool was_async)
>  {
>         struct netfs_read_subrequest *subreq;
>
> @@ -75,7 +77,7 @@ static void netfs_rreq_clear_subreqs(struct netfs_read_request *rreq)
>                 subreq = list_first_entry(&rreq->subrequests,
>                                           struct netfs_read_subrequest, rreq_link);
>                 list_del(&subreq->rreq_link);
> -               netfs_put_subrequest(subreq);
> +               netfs_put_subrequest(subreq, was_async);
>         }
>  }
>
> @@ -83,7 +85,7 @@ static void netfs_free_read_request(struct work_struct *work)
>  {
>         struct netfs_read_request *rreq =
>                 container_of(work, struct netfs_read_request, work);
> -       netfs_rreq_clear_subreqs(rreq);
> +       netfs_rreq_clear_subreqs(rreq, false);
>         if (rreq->netfs_priv)
>                 rreq->netfs_ops->cleanup(rreq->mapping, rreq->netfs_priv);
>         trace_netfs_rreq(rreq, netfs_rreq_trace_free);
> @@ -93,10 +95,10 @@ static void netfs_free_read_request(struct work_struct *work)
>         netfs_stat_d(&netfs_n_rh_rreq);
>  }
>
> -static void netfs_put_read_request(struct netfs_read_request *rreq)
> +static void netfs_put_read_request(struct netfs_read_request *rreq, bool was_async)
>  {
>         if (refcount_dec_and_test(&rreq->usage)) {
> -               if (in_softirq()) {
> +               if (was_async) {
>                         rreq->work.func = netfs_free_read_request;
>                         if (!queue_work(system_unbound_wq, &rreq->work))
>                                 BUG();
> @@ -131,12 +133,15 @@ static void netfs_get_read_subrequest(struct netfs_read_subrequest *subreq)
>         refcount_inc(&subreq->usage);
>  }
>
> -static void __netfs_put_subrequest(struct netfs_read_subrequest *subreq)
> +static void __netfs_put_subrequest(struct netfs_read_subrequest *subreq,
> +                                  bool was_async)
>  {
> +       struct netfs_read_request *rreq = subreq->rreq;
> +
>         trace_netfs_sreq(subreq, netfs_sreq_trace_free);
> -       netfs_put_read_request(subreq->rreq);
>         kfree(subreq);
>         netfs_stat_d(&netfs_n_rh_sreq);
> +       netfs_put_read_request(rreq, was_async);
>  }
>
>  /*
> @@ -152,11 +157,12 @@ static void netfs_clear_unread(struct netfs_read_subrequest *subreq)
>         iov_iter_zero(iov_iter_count(&iter), &iter);
>  }
>
> -static void netfs_cache_read_terminated(void *priv, ssize_t transferred_or_error)
> +static void netfs_cache_read_terminated(void *priv, ssize_t transferred_or_error,
> +                                       bool was_async)
>  {
>         struct netfs_read_subrequest *subreq = priv;
>
> -       netfs_subreq_terminated(subreq, transferred_or_error);
> +       netfs_subreq_terminated(subreq, transferred_or_error, was_async);
>  }
>
>  /*
> @@ -186,7 +192,7 @@ static void netfs_fill_with_zeroes(struct netfs_read_request *rreq,
>  {
>         netfs_stat(&netfs_n_rh_zero);
>         __set_bit(NETFS_SREQ_CLEAR_TAIL, &subreq->flags);
> -       netfs_subreq_terminated(subreq, 0);
> +       netfs_subreq_terminated(subreq, 0, false);
>  }
>
>  /*
> @@ -215,11 +221,11 @@ static void netfs_read_from_server(struct netfs_read_request *rreq,
>  /*
>   * Release those waiting.
>   */
> -static void netfs_rreq_completed(struct netfs_read_request *rreq)
> +static void netfs_rreq_completed(struct netfs_read_request *rreq, bool was_async)
>  {
>         trace_netfs_rreq(rreq, netfs_rreq_trace_done);
> -       netfs_rreq_clear_subreqs(rreq);
> -       netfs_put_read_request(rreq);
> +       netfs_rreq_clear_subreqs(rreq, was_async);
> +       netfs_put_read_request(rreq, was_async);
>  }
>
>  /*
> @@ -228,7 +234,8 @@ static void netfs_rreq_completed(struct netfs_read_request *rreq)
>   *
>   * May be called in softirq mode and we inherit a ref from the caller.
>   */
> -static void netfs_rreq_unmark_after_write(struct netfs_read_request *rreq)
> +static void netfs_rreq_unmark_after_write(struct netfs_read_request *rreq,
> +                                         bool was_async)
>  {
>         struct netfs_read_subrequest *subreq;
>         struct pagevec pvec;
> @@ -258,10 +265,11 @@ static void netfs_rreq_unmark_after_write(struct netfs_read_request *rreq)
>         }
>
>         rcu_read_unlock();
> -       netfs_rreq_completed(rreq);
> +       netfs_rreq_completed(rreq, was_async);
>  }
>
> -static void netfs_rreq_copy_terminated(void *priv, ssize_t transferred_or_error)
> +static void netfs_rreq_copy_terminated(void *priv, ssize_t transferred_or_error,
> +                                      bool was_async)
>  {
>         struct netfs_read_subrequest *subreq = priv;
>         struct netfs_read_request *rreq = subreq->rreq;
> @@ -278,9 +286,9 @@ static void netfs_rreq_copy_terminated(void *priv, ssize_t transferred_or_error)
>
>         /* If we decrement nr_wr_ops to 0, the ref belongs to us. */
>         if (atomic_dec_and_test(&rreq->nr_wr_ops))
> -               netfs_rreq_unmark_after_write(rreq);
> +               netfs_rreq_unmark_after_write(rreq, was_async);
>
> -       netfs_put_subrequest(subreq);
> +       netfs_put_subrequest(subreq, was_async);
>  }
>
>  /*
> @@ -304,7 +312,7 @@ static void netfs_rreq_do_write_to_cache(struct netfs_read_request *rreq)
>         list_for_each_entry_safe(subreq, p, &rreq->subrequests, rreq_link) {
>                 if (!test_bit(NETFS_SREQ_WRITE_TO_CACHE, &subreq->flags)) {
>                         list_del_init(&subreq->rreq_link);
> -                       netfs_put_subrequest(subreq);
> +                       netfs_put_subrequest(subreq, false);
>                 }
>         }
>
> @@ -324,7 +332,7 @@ static void netfs_rreq_do_write_to_cache(struct netfs_read_request *rreq)
>                         subreq->len += next->len;
>                         subreq->len = round_up(subreq->len, PAGE_SIZE);
>                         list_del_init(&next->rreq_link);
> -                       netfs_put_subrequest(next);
> +                       netfs_put_subrequest(next, false);
>                 }
>
>                 iov_iter_xarray(&iter, WRITE, &rreq->mapping->i_pages,
> @@ -340,7 +348,7 @@ static void netfs_rreq_do_write_to_cache(struct netfs_read_request *rreq)
>
>         /* If we decrement nr_wr_ops to 0, the usage ref belongs to us. */
>         if (atomic_dec_and_test(&rreq->nr_wr_ops))
> -               netfs_rreq_unmark_after_write(rreq);
> +               netfs_rreq_unmark_after_write(rreq, false);
>  }
>
>  static void netfs_rreq_write_to_cache_work(struct work_struct *work)
> @@ -351,9 +359,10 @@ static void netfs_rreq_write_to_cache_work(struct work_struct *work)
>         netfs_rreq_do_write_to_cache(rreq);
>  }
>
> -static void netfs_rreq_write_to_cache(struct netfs_read_request *rreq)
> +static void netfs_rreq_write_to_cache(struct netfs_read_request *rreq,
> +                                     bool was_async)
>  {
> -       if (in_softirq()) {
> +       if (was_async) {
>                 rreq->work.func = netfs_rreq_write_to_cache_work;
>                 if (!queue_work(system_unbound_wq, &rreq->work))
>                         BUG();
> @@ -479,7 +488,7 @@ static bool netfs_rreq_perform_resubmissions(struct netfs_read_request *rreq)
>  {
>         struct netfs_read_subrequest *subreq;
>
> -       WARN_ON(in_softirq());
> +       WARN_ON(in_interrupt());
>
>         trace_netfs_rreq(rreq, netfs_rreq_trace_resubmit);
>
> @@ -538,7 +547,7 @@ static void netfs_rreq_is_still_valid(struct netfs_read_request *rreq)
>   * Note that we could be in an ordinary kernel thread, on a workqueue or in
>   * softirq context at this point.  We inherit a ref from the caller.
>   */
> -static void netfs_rreq_assess(struct netfs_read_request *rreq)
> +static void netfs_rreq_assess(struct netfs_read_request *rreq, bool was_async)
>  {
>         trace_netfs_rreq(rreq, netfs_rreq_trace_assess);
>
> @@ -558,30 +567,31 @@ static void netfs_rreq_assess(struct netfs_read_request *rreq)
>         wake_up_bit(&rreq->flags, NETFS_RREQ_IN_PROGRESS);
>
>         if (test_bit(NETFS_RREQ_WRITE_TO_CACHE, &rreq->flags))
> -               return netfs_rreq_write_to_cache(rreq);
> +               return netfs_rreq_write_to_cache(rreq, was_async);
>
> -       netfs_rreq_completed(rreq);
> +       netfs_rreq_completed(rreq, was_async);
>  }
>
>  static void netfs_rreq_work(struct work_struct *work)
>  {
>         struct netfs_read_request *rreq =
>                 container_of(work, struct netfs_read_request, work);
> -       netfs_rreq_assess(rreq);
> +       netfs_rreq_assess(rreq, false);
>  }
>
>  /*
>   * Handle the completion of all outstanding I/O operations on a read request.
>   * We inherit a ref from the caller.
>   */
> -static void netfs_rreq_terminated(struct netfs_read_request *rreq)
> +static void netfs_rreq_terminated(struct netfs_read_request *rreq,
> +                                 bool was_async)
>  {
>         if (test_bit(NETFS_RREQ_INCOMPLETE_IO, &rreq->flags) &&
> -           in_softirq()) {
> +           was_async) {
>                 if (!queue_work(system_unbound_wq, &rreq->work))
>                         BUG();
>         } else {
> -               netfs_rreq_assess(rreq);
> +               netfs_rreq_assess(rreq, was_async);
>         }
>  }
>
> @@ -589,6 +599,7 @@ static void netfs_rreq_terminated(struct netfs_read_request *rreq)
>   * netfs_subreq_terminated - Note the termination of an I/O operation.
>   * @subreq: The I/O request that has terminated.
>   * @transferred_or_error: The amount of data transferred or an error code.
> + * @was_async: The termination was asynchronous
>   *
>   * This tells the read helper that a contributory I/O operation has terminated,
>   * one way or another, and that it should integrate the results.
> @@ -599,11 +610,12 @@ static void netfs_rreq_terminated(struct netfs_read_request *rreq)
>   * error code.  The helper will look after reissuing I/O operations as
>   * appropriate and writing downloaded data to the cache.
>   *
> - * This may be called from a softirq handler, so we want to avoid taking the
> - * spinlock if we can.
> + * If @was_async is true, the caller might be running in softirq or interrupt
> + * context and we can't sleep.
>   */
>  void netfs_subreq_terminated(struct netfs_read_subrequest *subreq,
> -                            ssize_t transferred_or_error)
> +                            ssize_t transferred_or_error,
> +                            bool was_async)
>  {
>         struct netfs_read_request *rreq = subreq->rreq;
>         int u;
> @@ -647,11 +659,11 @@ void netfs_subreq_terminated(struct netfs_read_subrequest *subreq,
>         /* If we decrement nr_rd_ops to 0, the ref belongs to us. */
>         u = atomic_dec_return(&rreq->nr_rd_ops);
>         if (u == 0)
> -               netfs_rreq_terminated(rreq);
> +               netfs_rreq_terminated(rreq, was_async);
>         else if (u == 1)
>                 wake_up_var(&rreq->nr_rd_ops);
>
> -       netfs_put_subrequest(subreq);
> +       netfs_put_subrequest(subreq, was_async);
>         return;
>
>  incomplete:
> @@ -796,7 +808,7 @@ static bool netfs_rreq_submit_slice(struct netfs_read_request *rreq,
>
>  subreq_failed:
>         rreq->error = subreq->error;
> -       netfs_put_subrequest(subreq);
> +       netfs_put_subrequest(subreq, false);
>         return false;
>  }
>
> @@ -901,7 +913,7 @@ void netfs_readahead(struct readahead_control *ractl,
>         } while (rreq->submitted < rreq->len);
>
>         if (rreq->submitted == 0) {
> -               netfs_put_read_request(rreq);
> +               netfs_put_read_request(rreq, false);
>                 return;
>         }
>
> @@ -913,11 +925,11 @@ void netfs_readahead(struct readahead_control *ractl,
>
>         /* If we decrement nr_rd_ops to 0, the ref belongs to us. */
>         if (atomic_dec_and_test(&rreq->nr_rd_ops))
> -               netfs_rreq_assess(rreq);
> +               netfs_rreq_assess(rreq, false);
>         return;
>
>  cleanup_free:
> -       netfs_put_read_request(rreq);
> +       netfs_put_read_request(rreq, false);
>         return;
>  cleanup:
>         if (netfs_priv)
> @@ -991,14 +1003,14 @@ int netfs_readpage(struct file *file,
>          */
>         do {
>                 wait_var_event(&rreq->nr_rd_ops, atomic_read(&rreq->nr_rd_ops) == 1);
> -               netfs_rreq_assess(rreq);
> +               netfs_rreq_assess(rreq, false);
>         } while (test_bit(NETFS_RREQ_IN_PROGRESS, &rreq->flags));
>
>         ret = rreq->error;
>         if (ret == 0 && rreq->submitted < rreq->len)
>                 ret = -EIO;
>  out:
> -       netfs_put_read_request(rreq);
> +       netfs_put_read_request(rreq, false);
>         return ret;
>  }
>  EXPORT_SYMBOL(netfs_readpage);
> @@ -1136,7 +1148,7 @@ int netfs_write_begin(struct file *file, struct address_space *mapping,
>          */
>         for (;;) {
>                 wait_var_event(&rreq->nr_rd_ops, atomic_read(&rreq->nr_rd_ops) == 1);
> -               netfs_rreq_assess(rreq);
> +               netfs_rreq_assess(rreq, false);
>                 if (!test_bit(NETFS_RREQ_IN_PROGRESS, &rreq->flags))
>                         break;
>                 cond_resched();
> @@ -1145,7 +1157,7 @@ int netfs_write_begin(struct file *file, struct address_space *mapping,
>         ret = rreq->error;
>         if (ret == 0 && rreq->submitted < rreq->len)
>                 ret = -EIO;
> -       netfs_put_read_request(rreq);
> +       netfs_put_read_request(rreq, false);
>         if (ret < 0)
>                 goto error;
>
> diff --git a/include/linux/netfs.h b/include/linux/netfs.h
> index b2589b39feb8..c22b64db237d 100644
> --- a/include/linux/netfs.h
> +++ b/include/linux/netfs.h
> @@ -60,7 +60,8 @@ enum netfs_read_source {
>         NETFS_INVALID_READ,
>  } __mode(byte);
>
> -typedef void (*netfs_io_terminated_t)(void *priv, ssize_t transferred_or_error);
> +typedef void (*netfs_io_terminated_t)(void *priv, ssize_t transferred_or_error,
> +                                     bool was_async);
>
>  /*
>   * Resources required to do operations on a cache.
> @@ -189,7 +190,7 @@ extern int netfs_write_begin(struct file *, struct address_space *,
>                              const struct netfs_read_request_ops *,
>                              void *);
>
> -extern void netfs_subreq_terminated(struct netfs_read_subrequest *, ssize_t);
> +extern void netfs_subreq_terminated(struct netfs_read_subrequest *, ssize_t, bool);
>  extern void netfs_stats_show(struct seq_file *);
>
>  #endif /* _LINUX_NETFS_H */
>
>

Looks good in testing.

Tested-by: Marc Dionne <marc.dionne@auristor.com>


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 34/33] netfs: Pass flag rather than use in_softirq()
  2021-02-18 14:02     ` [PATCH 34/33] netfs: Pass flag rather than use in_softirq() David Howells
  2021-02-18 15:06       ` Marc Dionne
@ 2021-02-18 15:16       ` Marc Dionne
  2021-02-19  9:01       ` Sebastian Andrzej Siewior
  2 siblings, 0 replies; 47+ messages in thread
From: Marc Dionne @ 2021-02-18 15:16 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, Trond Myklebust, Anna Schumaker, Steve French,
	Dominique Martinet, linux-cifs, ceph-devel, Jeff Layton,
	Matthew Wilcox, linux-cachefs, Alexander Viro, linux-mm,
	linux-afs, v9fs-developer, linux-fsdevel, linux-nfs, Jeff Layton,
	David Wysochanski, Linux Kernel Mailing List,
	Sebastian Andrzej Siewior

On Thu, Feb 18, 2021 at 10:03 AM David Howells <dhowells@redhat.com> wrote:
>
> Christoph Hellwig <hch@lst.de> wrote:
>
> > On Tue, Feb 16, 2021 at 09:29:31AM +0000, David Howells wrote:
> > > Is there a better way to do it?  The intent is to process the assessment
> > > phase in the calling thread's context if possible rather than bumping over
> > > to a worker thread.  For synchronous I/O, for example, that's done in the
> > > caller's thread.  Maybe that's the answer - if it's known to be
> > > asynchronous, I have to punt, but otherwise don't have to.
> >
> > Yes, i think you want an explicit flag instead.
>
> How about the attached instead?
>
> David
> ---
> commit 29b3e9eed616db01f15c7998c062b4e501ea6582
> Author: David Howells <dhowells@redhat.com>
> Date:   Mon Feb 15 21:56:43 2021 +0000
>
>     netfs: Pass flag rather than use in_softirq()
>
>     The in_softirq() in netfs_rreq_terminated() works fine for the cache being
>     on a normal disk, as the completion handlers may get called in softirq
>     context, but for an NVMe drive, the completion handler may get called in
>     IRQ context.
>
>     Fix to pass a flag to netfs_subreq_terminated() to indicate whether we
>     think the function isn't being called from a context in which we can do
>     allocations, waits and I/O submissions (such as softirq or interrupt
>     context).  If this flag is set, netfs lib has to punt to a worker thread to
>     handle anything like that.
>
>     The symptom involves warnings like the following appearing and the kernel
>     hanging:
>
>      WARNING: CPU: 0 PID: 0 at kernel/softirq.c:175 __local_bh_enable_ip+0x35/0x50
>      ...
>      RIP: 0010:__local_bh_enable_ip+0x35/0x50
>      ...
>      Call Trace:
>       <IRQ>
>       rxrpc_kernel_begin_call+0x7d/0x1b0 [rxrpc]
>       ? afs_rx_new_call+0x40/0x40 [kafs]
>       ? afs_alloc_call+0x28/0x120 [kafs]
>       afs_make_call+0x120/0x510 [kafs]
>       ? afs_rx_new_call+0x40/0x40 [kafs]
>       ? afs_alloc_flat_call+0xba/0x100 [kafs]
>       ? __kmalloc+0x167/0x2f0
>       ? afs_alloc_flat_call+0x9b/0x100 [kafs]
>       afs_wait_for_operation+0x2d/0x200 [kafs]
>       afs_do_sync_operation+0x16/0x20 [kafs]
>       afs_req_issue_op+0x8c/0xb0 [kafs]
>       netfs_rreq_assess+0x125/0x7d0 [netfs]
>       ? cachefiles_end_operation+0x40/0x40 [cachefiles]
>       netfs_subreq_terminated+0x117/0x220 [netfs]
>       cachefiles_read_complete+0x21/0x60 [cachefiles]
>       iomap_dio_bio_end_io+0xdd/0x110
>       blk_update_request+0x20a/0x380
>       blk_mq_end_request+0x1c/0x120
>       nvme_process_cq+0x159/0x1f0 [nvme]
>       nvme_irq+0x10/0x20 [nvme]
>       __handle_irq_event_percpu+0x37/0x150
>       handle_irq_event+0x49/0xb0
>       handle_edge_irq+0x7c/0x200
>       asm_call_irq_on_stack+0xf/0x20
>       </IRQ>
>       common_interrupt+0xad/0x120
>       asm_common_interrupt+0x1e/0x40
>      ...
>
>     Reported-by: Marc Dionne <marc.dionne@auristor.com>
>     Signed-off-by: David Howells <dhowells@redhat.com>
>     cc: Matthew Wilcox <willy@infradead.org>
>     cc: linux-mm@kvack.org
>     cc: linux-cachefs@redhat.com
>     cc: linux-afs@lists.infradead.org
>     cc: linux-nfs@vger.kernel.org
>     cc: linux-cifs@vger.kernel.org
>     cc: ceph-devel@vger.kernel.org
>     cc: v9fs-developer@lists.sourceforge.net
>     cc: linux-fsdevel@vger.kernel.org
>
> diff --git a/fs/afs/file.c b/fs/afs/file.c
> index 8f28d4f4cfd7..6dcdbbfb48e2 100644
> --- a/fs/afs/file.c
> +++ b/fs/afs/file.c
> @@ -223,7 +223,7 @@ static void afs_fetch_data_notify(struct afs_operation *op)
>
>         if (subreq) {
>                 __set_bit(NETFS_SREQ_CLEAR_TAIL, &subreq->flags);
> -               netfs_subreq_terminated(subreq, error ?: req->actual_len);
> +               netfs_subreq_terminated(subreq, error ?: req->actual_len, false);
>                 req->subreq = NULL;
>         } else if (req->done) {
>                 req->done(req);
> @@ -289,7 +289,7 @@ static void afs_req_issue_op(struct netfs_read_subrequest *subreq)
>
>         fsreq = afs_alloc_read(GFP_NOFS);
>         if (!fsreq)
> -               return netfs_subreq_terminated(subreq, -ENOMEM);
> +               return netfs_subreq_terminated(subreq, -ENOMEM, false);
>
>         fsreq->subreq   = subreq;
>         fsreq->pos      = subreq->start + subreq->transferred;
> @@ -304,7 +304,7 @@ static void afs_req_issue_op(struct netfs_read_subrequest *subreq)
>
>         ret = afs_fetch_data(fsreq->vnode, fsreq);
>         if (ret < 0)
> -               return netfs_subreq_terminated(subreq, ret);
> +               return netfs_subreq_terminated(subreq, ret, false);
>  }
>
>  static int afs_symlink_readpage(struct page *page)
> diff --git a/fs/cachefiles/rdwr2.c b/fs/cachefiles/rdwr2.c
> index 4cea5a2a2d6e..40668bfe6688 100644
> --- a/fs/cachefiles/rdwr2.c
> +++ b/fs/cachefiles/rdwr2.c
> @@ -23,6 +23,7 @@ struct cachefiles_kiocb {
>         };
>         netfs_io_terminated_t   term_func;
>         void                    *term_func_priv;
> +       bool                    was_async;
>  };
>
>  static inline void cachefiles_put_kiocb(struct cachefiles_kiocb *ki)
> @@ -43,10 +44,9 @@ static void cachefiles_read_complete(struct kiocb *iocb, long ret, long ret2)
>         _enter("%ld,%ld", ret, ret2);
>
>         if (ki->term_func) {
> -               if (ret < 0)
> -                       ki->term_func(ki->term_func_priv, ret);
> -               else
> -                       ki->term_func(ki->term_func_priv, ki->skipped + ret);
> +               if (ret >= 0)
> +                       ret += ki->skipped;
> +               ki->term_func(ki->term_func_priv, ret, ki->was_async);
>         }
>
>         cachefiles_put_kiocb(ki);
> @@ -114,6 +114,7 @@ static int cachefiles_read(struct netfs_cache_resources *cres,
>         ki->skipped             = skipped;
>         ki->term_func           = term_func;
>         ki->term_func_priv      = term_func_priv;
> +       ki->was_async           = true;
>
>         if (ki->term_func)
>                 ki->iocb.ki_complete = cachefiles_read_complete;
> @@ -141,6 +142,7 @@ static int cachefiles_read(struct netfs_cache_resources *cres,
>                 ret = -EINTR;
>                 fallthrough;
>         default:
> +               ki->was_async = false;
>                 cachefiles_read_complete(&ki->iocb, ret, 0);
>                 if (ret > 0)
>                         ret = 0;
> @@ -156,7 +158,7 @@ static int cachefiles_read(struct netfs_cache_resources *cres,
>         kfree(ki);
>  presubmission_error:
>         if (term_func)
> -               term_func(term_func_priv, ret < 0 ? ret : skipped);
> +               term_func(term_func_priv, ret < 0 ? ret : skipped, false);
>         return ret;
>  }
>
> @@ -175,7 +177,7 @@ static void cachefiles_write_complete(struct kiocb *iocb, long ret, long ret2)
>         __sb_end_write(inode->i_sb, SB_FREEZE_WRITE);
>
>         if (ki->term_func)
> -               ki->term_func(ki->term_func_priv, ret);
> +               ki->term_func(ki->term_func_priv, ret, ki->was_async);
>
>         cachefiles_put_kiocb(ki);
>  }
> @@ -214,6 +216,7 @@ static int cachefiles_write(struct netfs_cache_resources *cres,
>         ki->len                 = len;
>         ki->term_func           = term_func;
>         ki->term_func_priv      = term_func_priv;
> +       ki->was_async           = true;
>
>         if (ki->term_func)
>                 ki->iocb.ki_complete = cachefiles_write_complete;
> @@ -250,6 +253,7 @@ static int cachefiles_write(struct netfs_cache_resources *cres,
>                 ret = -EINTR;
>                 /* Fall through */
>         default:
> +               ki->was_async = false;
>                 cachefiles_write_complete(&ki->iocb, ret, 0);
>                 if (ret > 0)
>                         ret = 0;
> @@ -265,7 +269,7 @@ static int cachefiles_write(struct netfs_cache_resources *cres,
>         kfree(ki);
>  presubmission_error:
>         if (term_func)
> -               term_func(term_func_priv, -ENOMEM);
> +               term_func(term_func_priv, -ENOMEM, false);
>         return -ENOMEM;
>  }
>
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index 0dd64d31eff6..dcfd805d168e 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -221,7 +221,7 @@ static void finish_netfs_read(struct ceph_osd_request *req)
>         if (err >= 0 && err < subreq->len)
>                 __set_bit(NETFS_SREQ_CLEAR_TAIL, &subreq->flags);
>
> -       netfs_subreq_terminated(subreq, err);
> +       netfs_subreq_terminated(subreq, err, true);
>
>         num_pages = calc_pages_for(osd_data->alignment, osd_data->length);
>         ceph_put_page_vector(osd_data->pages, num_pages, false);
> @@ -276,7 +276,7 @@ static void ceph_netfs_issue_op(struct netfs_read_subrequest *subreq)
>  out:
>         ceph_osdc_put_request(req);
>         if (err)
> -               netfs_subreq_terminated(subreq, err);
> +               netfs_subreq_terminated(subreq, err, false);
>         dout("%s: result %d\n", __func__, err);
>  }
>
> diff --git a/fs/netfs/read_helper.c b/fs/netfs/read_helper.c
> index 9191a3617d91..5f5de8278499 100644
> --- a/fs/netfs/read_helper.c
> +++ b/fs/netfs/read_helper.c
> @@ -29,12 +29,13 @@ module_param_named(debug, netfs_debug, uint, S_IWUSR | S_IRUGO);
>  MODULE_PARM_DESC(netfs_debug, "Netfs support debugging mask");
>
>  static void netfs_rreq_work(struct work_struct *);
> -static void __netfs_put_subrequest(struct netfs_read_subrequest *);
> +static void __netfs_put_subrequest(struct netfs_read_subrequest *, bool);
>
> -static void netfs_put_subrequest(struct netfs_read_subrequest *subreq)
> +static void netfs_put_subrequest(struct netfs_read_subrequest *subreq,
> +                                bool was_async)
>  {
>         if (refcount_dec_and_test(&subreq->usage))
> -               __netfs_put_subrequest(subreq);
> +               __netfs_put_subrequest(subreq, was_async);
>  }
>
>  static struct netfs_read_request *netfs_alloc_read_request(
> @@ -67,7 +68,8 @@ static void netfs_get_read_request(struct netfs_read_request *rreq)
>         refcount_inc(&rreq->usage);
>  }
>
> -static void netfs_rreq_clear_subreqs(struct netfs_read_request *rreq)
> +static void netfs_rreq_clear_subreqs(struct netfs_read_request *rreq,
> +                                    bool was_async)
>  {
>         struct netfs_read_subrequest *subreq;
>
> @@ -75,7 +77,7 @@ static void netfs_rreq_clear_subreqs(struct netfs_read_request *rreq)
>                 subreq = list_first_entry(&rreq->subrequests,
>                                           struct netfs_read_subrequest, rreq_link);
>                 list_del(&subreq->rreq_link);
> -               netfs_put_subrequest(subreq);
> +               netfs_put_subrequest(subreq, was_async);
>         }
>  }
>
> @@ -83,7 +85,7 @@ static void netfs_free_read_request(struct work_struct *work)
>  {
>         struct netfs_read_request *rreq =
>                 container_of(work, struct netfs_read_request, work);
> -       netfs_rreq_clear_subreqs(rreq);
> +       netfs_rreq_clear_subreqs(rreq, false);
>         if (rreq->netfs_priv)
>                 rreq->netfs_ops->cleanup(rreq->mapping, rreq->netfs_priv);
>         trace_netfs_rreq(rreq, netfs_rreq_trace_free);
> @@ -93,10 +95,10 @@ static void netfs_free_read_request(struct work_struct *work)
>         netfs_stat_d(&netfs_n_rh_rreq);
>  }
>
> -static void netfs_put_read_request(struct netfs_read_request *rreq)
> +static void netfs_put_read_request(struct netfs_read_request *rreq, bool was_async)
>  {
>         if (refcount_dec_and_test(&rreq->usage)) {
> -               if (in_softirq()) {
> +               if (was_async) {
>                         rreq->work.func = netfs_free_read_request;
>                         if (!queue_work(system_unbound_wq, &rreq->work))
>                                 BUG();
> @@ -131,12 +133,15 @@ static void netfs_get_read_subrequest(struct netfs_read_subrequest *subreq)
>         refcount_inc(&subreq->usage);
>  }
>
> -static void __netfs_put_subrequest(struct netfs_read_subrequest *subreq)
> +static void __netfs_put_subrequest(struct netfs_read_subrequest *subreq,
> +                                  bool was_async)
>  {
> +       struct netfs_read_request *rreq = subreq->rreq;
> +
>         trace_netfs_sreq(subreq, netfs_sreq_trace_free);
> -       netfs_put_read_request(subreq->rreq);
>         kfree(subreq);
>         netfs_stat_d(&netfs_n_rh_sreq);
> +       netfs_put_read_request(rreq, was_async);
>  }
>
>  /*
> @@ -152,11 +157,12 @@ static void netfs_clear_unread(struct netfs_read_subrequest *subreq)
>         iov_iter_zero(iov_iter_count(&iter), &iter);
>  }
>
> -static void netfs_cache_read_terminated(void *priv, ssize_t transferred_or_error)
> +static void netfs_cache_read_terminated(void *priv, ssize_t transferred_or_error,
> +                                       bool was_async)
>  {
>         struct netfs_read_subrequest *subreq = priv;
>
> -       netfs_subreq_terminated(subreq, transferred_or_error);
> +       netfs_subreq_terminated(subreq, transferred_or_error, was_async);
>  }
>
>  /*
> @@ -186,7 +192,7 @@ static void netfs_fill_with_zeroes(struct netfs_read_request *rreq,
>  {
>         netfs_stat(&netfs_n_rh_zero);
>         __set_bit(NETFS_SREQ_CLEAR_TAIL, &subreq->flags);
> -       netfs_subreq_terminated(subreq, 0);
> +       netfs_subreq_terminated(subreq, 0, false);
>  }
>
>  /*
> @@ -215,11 +221,11 @@ static void netfs_read_from_server(struct netfs_read_request *rreq,
>  /*
>   * Release those waiting.
>   */
> -static void netfs_rreq_completed(struct netfs_read_request *rreq)
> +static void netfs_rreq_completed(struct netfs_read_request *rreq, bool was_async)
>  {
>         trace_netfs_rreq(rreq, netfs_rreq_trace_done);
> -       netfs_rreq_clear_subreqs(rreq);
> -       netfs_put_read_request(rreq);
> +       netfs_rreq_clear_subreqs(rreq, was_async);
> +       netfs_put_read_request(rreq, was_async);
>  }
>
>  /*
> @@ -228,7 +234,8 @@ static void netfs_rreq_completed(struct netfs_read_request *rreq)
>   *
>   * May be called in softirq mode and we inherit a ref from the caller.
>   */
> -static void netfs_rreq_unmark_after_write(struct netfs_read_request *rreq)
> +static void netfs_rreq_unmark_after_write(struct netfs_read_request *rreq,
> +                                         bool was_async)
>  {
>         struct netfs_read_subrequest *subreq;
>         struct pagevec pvec;
> @@ -258,10 +265,11 @@ static void netfs_rreq_unmark_after_write(struct netfs_read_request *rreq)
>         }
>
>         rcu_read_unlock();
> -       netfs_rreq_completed(rreq);
> +       netfs_rreq_completed(rreq, was_async);
>  }
>
> -static void netfs_rreq_copy_terminated(void *priv, ssize_t transferred_or_error)
> +static void netfs_rreq_copy_terminated(void *priv, ssize_t transferred_or_error,
> +                                      bool was_async)
>  {
>         struct netfs_read_subrequest *subreq = priv;
>         struct netfs_read_request *rreq = subreq->rreq;
> @@ -278,9 +286,9 @@ static void netfs_rreq_copy_terminated(void *priv, ssize_t transferred_or_error)
>
>         /* If we decrement nr_wr_ops to 0, the ref belongs to us. */
>         if (atomic_dec_and_test(&rreq->nr_wr_ops))
> -               netfs_rreq_unmark_after_write(rreq);
> +               netfs_rreq_unmark_after_write(rreq, was_async);
>
> -       netfs_put_subrequest(subreq);
> +       netfs_put_subrequest(subreq, was_async);
>  }
>
>  /*
> @@ -304,7 +312,7 @@ static void netfs_rreq_do_write_to_cache(struct netfs_read_request *rreq)
>         list_for_each_entry_safe(subreq, p, &rreq->subrequests, rreq_link) {
>                 if (!test_bit(NETFS_SREQ_WRITE_TO_CACHE, &subreq->flags)) {
>                         list_del_init(&subreq->rreq_link);
> -                       netfs_put_subrequest(subreq);
> +                       netfs_put_subrequest(subreq, false);
>                 }
>         }
>
> @@ -324,7 +332,7 @@ static void netfs_rreq_do_write_to_cache(struct netfs_read_request *rreq)
>                         subreq->len += next->len;
>                         subreq->len = round_up(subreq->len, PAGE_SIZE);
>                         list_del_init(&next->rreq_link);
> -                       netfs_put_subrequest(next);
> +                       netfs_put_subrequest(next, false);
>                 }
>
>                 iov_iter_xarray(&iter, WRITE, &rreq->mapping->i_pages,
> @@ -340,7 +348,7 @@ static void netfs_rreq_do_write_to_cache(struct netfs_read_request *rreq)
>
>         /* If we decrement nr_wr_ops to 0, the usage ref belongs to us. */
>         if (atomic_dec_and_test(&rreq->nr_wr_ops))
> -               netfs_rreq_unmark_after_write(rreq);
> +               netfs_rreq_unmark_after_write(rreq, false);
>  }
>
>  static void netfs_rreq_write_to_cache_work(struct work_struct *work)
> @@ -351,9 +359,10 @@ static void netfs_rreq_write_to_cache_work(struct work_struct *work)
>         netfs_rreq_do_write_to_cache(rreq);
>  }
>
> -static void netfs_rreq_write_to_cache(struct netfs_read_request *rreq)
> +static void netfs_rreq_write_to_cache(struct netfs_read_request *rreq,
> +                                     bool was_async)
>  {
> -       if (in_softirq()) {
> +       if (was_async) {
>                 rreq->work.func = netfs_rreq_write_to_cache_work;
>                 if (!queue_work(system_unbound_wq, &rreq->work))
>                         BUG();
> @@ -479,7 +488,7 @@ static bool netfs_rreq_perform_resubmissions(struct netfs_read_request *rreq)
>  {
>         struct netfs_read_subrequest *subreq;
>
> -       WARN_ON(in_softirq());
> +       WARN_ON(in_interrupt());
>
>         trace_netfs_rreq(rreq, netfs_rreq_trace_resubmit);
>
> @@ -538,7 +547,7 @@ static void netfs_rreq_is_still_valid(struct netfs_read_request *rreq)
>   * Note that we could be in an ordinary kernel thread, on a workqueue or in
>   * softirq context at this point.  We inherit a ref from the caller.
>   */
> -static void netfs_rreq_assess(struct netfs_read_request *rreq)
> +static void netfs_rreq_assess(struct netfs_read_request *rreq, bool was_async)
>  {
>         trace_netfs_rreq(rreq, netfs_rreq_trace_assess);
>
> @@ -558,30 +567,31 @@ static void netfs_rreq_assess(struct netfs_read_request *rreq)
>         wake_up_bit(&rreq->flags, NETFS_RREQ_IN_PROGRESS);
>
>         if (test_bit(NETFS_RREQ_WRITE_TO_CACHE, &rreq->flags))
> -               return netfs_rreq_write_to_cache(rreq);
> +               return netfs_rreq_write_to_cache(rreq, was_async);
>
> -       netfs_rreq_completed(rreq);
> +       netfs_rreq_completed(rreq, was_async);
>  }
>
>  static void netfs_rreq_work(struct work_struct *work)
>  {
>         struct netfs_read_request *rreq =
>                 container_of(work, struct netfs_read_request, work);
> -       netfs_rreq_assess(rreq);
> +       netfs_rreq_assess(rreq, false);
>  }
>
>  /*
>   * Handle the completion of all outstanding I/O operations on a read request.
>   * We inherit a ref from the caller.
>   */
> -static void netfs_rreq_terminated(struct netfs_read_request *rreq)
> +static void netfs_rreq_terminated(struct netfs_read_request *rreq,
> +                                 bool was_async)
>  {
>         if (test_bit(NETFS_RREQ_INCOMPLETE_IO, &rreq->flags) &&
> -           in_softirq()) {
> +           was_async) {
>                 if (!queue_work(system_unbound_wq, &rreq->work))
>                         BUG();
>         } else {
> -               netfs_rreq_assess(rreq);
> +               netfs_rreq_assess(rreq, was_async);
>         }
>  }
>
> @@ -589,6 +599,7 @@ static void netfs_rreq_terminated(struct netfs_read_request *rreq)
>   * netfs_subreq_terminated - Note the termination of an I/O operation.
>   * @subreq: The I/O request that has terminated.
>   * @transferred_or_error: The amount of data transferred or an error code.
> + * @was_async: The termination was asynchronous
>   *
>   * This tells the read helper that a contributory I/O operation has terminated,
>   * one way or another, and that it should integrate the results.
> @@ -599,11 +610,12 @@ static void netfs_rreq_terminated(struct netfs_read_request *rreq)
>   * error code.  The helper will look after reissuing I/O operations as
>   * appropriate and writing downloaded data to the cache.
>   *
> - * This may be called from a softirq handler, so we want to avoid taking the
> - * spinlock if we can.
> + * If @was_async is true, the caller might be running in softirq or interrupt
> + * context and we can't sleep.
>   */
>  void netfs_subreq_terminated(struct netfs_read_subrequest *subreq,
> -                            ssize_t transferred_or_error)
> +                            ssize_t transferred_or_error,
> +                            bool was_async)
>  {
>         struct netfs_read_request *rreq = subreq->rreq;
>         int u;
> @@ -647,11 +659,11 @@ void netfs_subreq_terminated(struct netfs_read_subrequest *subreq,
>         /* If we decrement nr_rd_ops to 0, the ref belongs to us. */
>         u = atomic_dec_return(&rreq->nr_rd_ops);
>         if (u == 0)
> -               netfs_rreq_terminated(rreq);
> +               netfs_rreq_terminated(rreq, was_async);
>         else if (u == 1)
>                 wake_up_var(&rreq->nr_rd_ops);
>
> -       netfs_put_subrequest(subreq);
> +       netfs_put_subrequest(subreq, was_async);
>         return;
>
>  incomplete:
> @@ -796,7 +808,7 @@ static bool netfs_rreq_submit_slice(struct netfs_read_request *rreq,
>
>  subreq_failed:
>         rreq->error = subreq->error;
> -       netfs_put_subrequest(subreq);
> +       netfs_put_subrequest(subreq, false);
>         return false;
>  }
>
> @@ -901,7 +913,7 @@ void netfs_readahead(struct readahead_control *ractl,
>         } while (rreq->submitted < rreq->len);
>
>         if (rreq->submitted == 0) {
> -               netfs_put_read_request(rreq);
> +               netfs_put_read_request(rreq, false);
>                 return;
>         }
>
> @@ -913,11 +925,11 @@ void netfs_readahead(struct readahead_control *ractl,
>
>         /* If we decrement nr_rd_ops to 0, the ref belongs to us. */
>         if (atomic_dec_and_test(&rreq->nr_rd_ops))
> -               netfs_rreq_assess(rreq);
> +               netfs_rreq_assess(rreq, false);
>         return;
>
>  cleanup_free:
> -       netfs_put_read_request(rreq);
> +       netfs_put_read_request(rreq, false);
>         return;
>  cleanup:
>         if (netfs_priv)
> @@ -991,14 +1003,14 @@ int netfs_readpage(struct file *file,
>          */
>         do {
>                 wait_var_event(&rreq->nr_rd_ops, atomic_read(&rreq->nr_rd_ops) == 1);
> -               netfs_rreq_assess(rreq);
> +               netfs_rreq_assess(rreq, false);
>         } while (test_bit(NETFS_RREQ_IN_PROGRESS, &rreq->flags));
>
>         ret = rreq->error;
>         if (ret == 0 && rreq->submitted < rreq->len)
>                 ret = -EIO;
>  out:
> -       netfs_put_read_request(rreq);
> +       netfs_put_read_request(rreq, false);
>         return ret;
>  }
>  EXPORT_SYMBOL(netfs_readpage);
> @@ -1136,7 +1148,7 @@ int netfs_write_begin(struct file *file, struct address_space *mapping,
>          */
>         for (;;) {
>                 wait_var_event(&rreq->nr_rd_ops, atomic_read(&rreq->nr_rd_ops) == 1);
> -               netfs_rreq_assess(rreq);
> +               netfs_rreq_assess(rreq, false);
>                 if (!test_bit(NETFS_RREQ_IN_PROGRESS, &rreq->flags))
>                         break;
>                 cond_resched();
> @@ -1145,7 +1157,7 @@ int netfs_write_begin(struct file *file, struct address_space *mapping,
>         ret = rreq->error;
>         if (ret == 0 && rreq->submitted < rreq->len)
>                 ret = -EIO;
> -       netfs_put_read_request(rreq);
> +       netfs_put_read_request(rreq, false);
>         if (ret < 0)
>                 goto error;
>
> diff --git a/include/linux/netfs.h b/include/linux/netfs.h
> index b2589b39feb8..c22b64db237d 100644
> --- a/include/linux/netfs.h
> +++ b/include/linux/netfs.h
> @@ -60,7 +60,8 @@ enum netfs_read_source {
>         NETFS_INVALID_READ,
>  } __mode(byte);
>
> -typedef void (*netfs_io_terminated_t)(void *priv, ssize_t transferred_or_error);
> +typedef void (*netfs_io_terminated_t)(void *priv, ssize_t transferred_or_error,
> +                                     bool was_async);
>
>  /*
>   * Resources required to do operations on a cache.
> @@ -189,7 +190,7 @@ extern int netfs_write_begin(struct file *, struct address_space *,
>                              const struct netfs_read_request_ops *,
>                              void *);
>
> -extern void netfs_subreq_terminated(struct netfs_read_subrequest *, ssize_t);
> +extern void netfs_subreq_terminated(struct netfs_read_subrequest *, ssize_t, bool);
>  extern void netfs_stats_show(struct seq_file *);
>
>  #endif /* _LINUX_NETFS_H */
>
>

Looks good in testing.

Tested-by: Marc Dionne <marc.dionne@auristor.com>


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 03/33] mm: Implement readahead_control pageset expansion
  2021-02-15 15:44 ` [PATCH 03/33] mm: Implement readahead_control pageset expansion David Howells
                     ` (3 preceding siblings ...)
  2021-02-17 22:34   ` David Howells
@ 2021-02-18 17:47   ` David Howells
  4 siblings, 0 replies; 47+ messages in thread
From: David Howells @ 2021-02-18 17:47 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: dhowells, Trond Myklebust, Anna Schumaker, Steve French,
	Dominique Martinet, Alexander Viro, Christoph Hellwig, linux-mm,
	linux-cachefs, linux-afs, linux-nfs, linux-cifs, ceph-devel,
	v9fs-developer, linux-fsdevel, Jeff Layton, David Wysochanski,
	linux-kernel

Matthew Wilcox <willy@infradead.org> wrote:

> So readahead_expand() needs to adjust the file's f_ra so that when the
> application gets to 64kB, it kicks off the readahead of 4MB-8MB chunk (and
> then when we get to 4MB+256kB, it kicks off the readahead of 8MB-12MB,
> and so on).

Ummm...  Two questions:

Firstly, how do I do that?  Set ->async_size?  And to what?  The expansion
could be 2MB from a ceph stripe, 256k from the cache.  Just to add to the fun,
the leading edge of the window might also be rounded downwards and the RA
trigger could be before where the app is going to start reading.

Secondly, what happens if, say, a 4MB read is covered by a single 4MB THP?

David



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 34/33] netfs: Pass flag rather than use in_softirq()
  2021-02-18 14:02     ` [PATCH 34/33] netfs: Pass flag rather than use in_softirq() David Howells
  2021-02-18 15:06       ` Marc Dionne
  2021-02-18 15:16       ` Marc Dionne
@ 2021-02-19  9:01       ` Sebastian Andrzej Siewior
  2 siblings, 0 replies; 47+ messages in thread
From: Sebastian Andrzej Siewior @ 2021-02-19  9:01 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, Trond Myklebust, Marc Dionne, Anna Schumaker,
	Steve French, Dominique Martinet, linux-cifs, ceph-devel,
	Jeff Layton, Matthew Wilcox, linux-cachefs, Alexander Viro,
	linux-mm, linux-afs, v9fs-developer, linux-fsdevel, linux-nfs,
	Jeff Layton, David Wysochanski, linux-kernel

On 2021-02-18 14:02:36 [+0000], David Howells wrote:
> How about the attached instead?

Thank you for that flag.

> David

Sebastian


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 00/33] Network fs helper library & fscache kiocb API [ver #3]
  2021-02-16  5:22       ` Steve French
@ 2021-02-23 20:27         ` Matthew Wilcox
  2021-02-24  4:57           ` Steve French
  0 siblings, 1 reply; 47+ messages in thread
From: Matthew Wilcox @ 2021-02-23 20:27 UTC (permalink / raw)
  To: Steve French
  Cc: Jeff Layton, David Howells, Trond Myklebust, Anna Schumaker,
	Steve French, Dominique Martinet, CIFS, ceph-devel,
	linux-cachefs, Alexander Viro, linux-mm, linux-afs,
	v9fs-developer, Christoph Hellwig, linux-fsdevel, linux-nfs,
	Linus Torvalds, David Wysochanski, LKML, William Kucharski,
	Jaegeuk Kim, Chao Yu, linux-f2fs-devel

On Mon, Feb 15, 2021 at 11:22:20PM -0600, Steve French wrote:
> On Mon, Feb 15, 2021 at 8:10 PM Matthew Wilcox <willy@infradead.org> wrote:
> > The switch from readpages to readahead does help in a couple of corner
> > cases.  For example, if you have two processes reading the same file at
> > the same time, one will now block on the other (due to the page lock)
> > rather than submitting a mess of overlapping and partial reads.
> 
> Do you have a simple repro example of this we could try (fio, dbench, iozone
> etc) to get some objective perf data?

I don't.  The problem was noted by the f2fs people, so maybe they have a
reproducer.

> My biggest worry is making sure that the switch to netfs doesn't degrade
> performance (which might be a low bar now since current network file copy
> perf seems to signifcantly lag at least Windows), and in some easy to understand
> scenarios want to make sure it actually helps perf.

I had a question about that ... you've mentioned having 4x4MB reads
outstanding as being the way to get optimum performance.  Is there a
significant performance difference between 4x4MB, 16x1MB and 64x256kB?
I'm concerned about having "too large" an I/O on the wire at a given time.
For example, with a 1Gbps link, you get 250MB/s.  That's a minimum
latency of 16us for a 4kB page, but 16ms for a 4MB page.

"For very simple tasks, people can perceive latencies down to 2 ms or less"
(https://danluu.com/input-lag/)
so going all the way to 4MB I/Os takes us into the perceptible latency
range, whereas a 256kB I/O is only 1ms.

So could you do some experiments with fio doing direct I/O to see if
it takes significantly longer to do, say, 1TB of I/O in 4MB chunks vs
256kB chunks?  Obviously use threads to keep lots of I/Os outstanding.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 00/33] Network fs helper library & fscache kiocb API [ver #3]
  2021-02-23 20:27         ` Matthew Wilcox
@ 2021-02-24  4:57           ` Steve French
  0 siblings, 0 replies; 47+ messages in thread
From: Steve French @ 2021-02-24  4:57 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jeff Layton, David Howells, Trond Myklebust, Anna Schumaker,
	Steve French, Dominique Martinet, CIFS, ceph-devel,
	linux-cachefs, Alexander Viro, linux-mm, linux-afs,
	v9fs-developer, Christoph Hellwig, linux-fsdevel, linux-nfs,
	Linus Torvalds, David Wysochanski, LKML, William Kucharski,
	Jaegeuk Kim, Chao Yu, linux-f2fs-devel, Case van Rij

On Tue, Feb 23, 2021 at 2:28 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Feb 15, 2021 at 11:22:20PM -0600, Steve French wrote:
> > On Mon, Feb 15, 2021 at 8:10 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > The switch from readpages to readahead does help in a couple of corner
> > > cases.  For example, if you have two processes reading the same file at
> > > the same time, one will now block on the other (due to the page lock)
> > > rather than submitting a mess of overlapping and partial reads.
> >
> > Do you have a simple repro example of this we could try (fio, dbench, iozone
> > etc) to get some objective perf data?
>
> I don't.  The problem was noted by the f2fs people, so maybe they have a
> reproducer.
>
> > My biggest worry is making sure that the switch to netfs doesn't degrade
> > performance (which might be a low bar now since current network file copy
> > perf seems to signifcantly lag at least Windows), and in some easy to understand
> > scenarios want to make sure it actually helps perf.
>
> I had a question about that ... you've mentioned having 4x4MB reads
> outstanding as being the way to get optimum performance.  Is there a
> significant performance difference between 4x4MB, 16x1MB and 64x256kB?
> I'm concerned about having "too large" an I/O on the wire at a given time.
> For example, with a 1Gbps link, you get 250MB/s.  That's a minimum
> latency of 16us for a 4kB page, but 16ms for a 4MB page.
>
> "For very simple tasks, people can perceive latencies down to 2 ms or less"
> (https://danluu.com/input-lag/)
> so going all the way to 4MB I/Os takes us into the perceptible latency
> range, whereas a 256kB I/O is only 1ms.
>
> So could you do some experiments with fio doing direct I/O to see if
> it takes significantly longer to do, say, 1TB of I/O in 4MB chunks vs
> 256kB chunks?  Obviously use threads to keep lots of I/Os outstanding.

That is a good question and it has been months since I have done experiments
with something similar.   Obviously this will vary depending on RDMA or not and
multichannel or not - but assuming the 'normal' low end network configuration -
ie a 1Gbps link and no RDMA or multichannel I could do some more recent
experiments.

In the past what I had noticed was that server performance for simple
workloads like cp or grep increased with network I/O size to a point:
smaller than 256K packet size was bad. Performance improved
significantly from 256K to 512K to 1MB, but only very
slightly from 1MB to 2MB to 4MB and sometimes degraded at 8MB
(IIRC 8MB is the max commonly supported by SMB3 servers),
but this is with only one adapter (no multichannel) and 1Gb adapters.

But in those examples there wasn't a lot of concurrency on the wire.

I did some experiments with increasing the read ahead size
(which causes more than one async read to be issued by cifs.ko
but presumably does still result in some 'dead time')
which seemed to help perf of some sequential read examples
(e.g. grep or cp) to some servers but I didn't try enough variety
of server targets to feel confident about that change especially
if netfs is coming

e.g. a change I experimented with was:
         sb->s_bdi->ra_pages = cifs_sb->ctx->rsize / PAGE_SIZE
to
         sb->s_bdi->ra_pages = 2 * cifs_sb->ctx->rsize / PAGE_SIZE

and it did seem to help a little.

I would expect that 8x1MB (ie trying to keep eight 1MB reads in process should
keep the network mostly busy and not lead to too much dead time on
server, client
or network) and is 'good enough' in many read ahead use cases (at
least for non-RDMA, and non-multichannel on a slower network) to keep the pipe
file, and I would expect the performance to be similar to the equivalent using
2MB read (e.g. 4x2MB) and perhaps better than 2x4MB.  Below 1MB i/o size
on the wire I would expect to see degradation due to packet processing
and task switching
overhead.  Would definitely be worth doing more experimentation here.
-- 
Thanks,

Steve


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 00/33] Network fs helper library & fscache kiocb API [ver #3]
  2021-02-16  2:10     ` Matthew Wilcox
  2021-02-16  5:18       ` Steve French
  2021-02-16  5:22       ` Steve French
@ 2021-02-24 13:32       ` David Howells
  2021-02-24 15:51         ` Matthew Wilcox
  2 siblings, 1 reply; 47+ messages in thread
From: David Howells @ 2021-02-24 13:32 UTC (permalink / raw)
  To: Steve French
  Cc: dhowells, Matthew Wilcox, Jeff Layton, Trond Myklebust,
	Anna Schumaker, Steve French, Dominique Martinet, CIFS,
	ceph-devel, linux-cachefs, Alexander Viro, linux-mm, linux-afs,
	v9fs-developer, Christoph Hellwig, linux-fsdevel, linux-nfs,
	Linus Torvalds, David Wysochanski, LKML, William Kucharski

Steve French <smfrench@gmail.com> wrote:

> This (readahead behavior improvements in Linux, on single large file
> sequential read workloads like cp or grep) gets particularly interesting
> with SMB3 as multichannel becomes more common.  With one channel having one
> readahead request pending on the network is suboptimal - but not as bad as
> when multichannel is negotiated. Interestingly in most cases two network
> connections to the same server (different TCP sockets,but the same mount,
> even in cases where only network adapter) can achieve better performance -
> but still significantly lags Windows (and probably other clients) as in
> Linux we don't keep multiple I/Os in flight at one time (unless different
> files are being read at the same time by different threads).

I think it should be relatively straightforward to make the netfs_readahead()
function generate multiple read requests.  If I wasn't handed sufficient pages
by the VM upfront to do two or more read requests, I would need to do extra
expansion.  There are a couple of ways this could be done:

 (1) I could expand the readahead_control after fully starting a read request
     and then create another independent read request, and another for how
     ever many we want.

 (2) I could expand the readahead_control first to cover however many requests
     I'm going to generate, then chop it up into individual read requests.

However, generating larger requests means we're more likely to run into a
problem for the cache: if we can't allocate enough pages to fill out a cache
block, we don't have enough data to write to the cache.  Further, if the pages
are just unlocked and abandoned, readpage will be called to read them
individually - which means they likely won't get cached unless the cache
granularity is PAGE_SIZE.  But that's probably okay if ENOMEM occurred.

There are some other considerations too:

 (*) I would need to query the filesystem to find out if I should create
     another request.  The fs would have to keep track of how many I/O reqs
     are in flight and what the limit is.

 (*) How and where should the readahead triggers be emplaced?  I'm guessing
     that each block would need a trigger and that this should cause more
     requests to be generated until we hit the limit.

 (*) I would probably need to shuffle the request generation for the second
     and subsequent blocks in a single netfs_readahead() call to a worker
     thread because it'll probably be in a userspace kernel-side context and
     blocking an application from proceeding and consuming the pages already
     committed.

David



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 00/33] Network fs helper library & fscache kiocb API [ver #3]
  2021-02-24 13:32       ` David Howells
@ 2021-02-24 15:51         ` Matthew Wilcox
  0 siblings, 0 replies; 47+ messages in thread
From: Matthew Wilcox @ 2021-02-24 15:51 UTC (permalink / raw)
  To: David Howells
  Cc: Steve French, Jeff Layton, Trond Myklebust, Anna Schumaker,
	Steve French, Dominique Martinet, CIFS, ceph-devel,
	linux-cachefs, Alexander Viro, linux-mm, linux-afs,
	v9fs-developer, Christoph Hellwig, linux-fsdevel, linux-nfs,
	Linus Torvalds, David Wysochanski, LKML, William Kucharski

On Wed, Feb 24, 2021 at 01:32:02PM +0000, David Howells wrote:
> Steve French <smfrench@gmail.com> wrote:
> 
> > This (readahead behavior improvements in Linux, on single large file
> > sequential read workloads like cp or grep) gets particularly interesting
> > with SMB3 as multichannel becomes more common.  With one channel having one
> > readahead request pending on the network is suboptimal - but not as bad as
> > when multichannel is negotiated. Interestingly in most cases two network
> > connections to the same server (different TCP sockets,but the same mount,
> > even in cases where only network adapter) can achieve better performance -
> > but still significantly lags Windows (and probably other clients) as in
> > Linux we don't keep multiple I/Os in flight at one time (unless different
> > files are being read at the same time by different threads).
> 
> I think it should be relatively straightforward to make the netfs_readahead()
> function generate multiple read requests.  If I wasn't handed sufficient pages
> by the VM upfront to do two or more read requests, I would need to do extra
> expansion.  There are a couple of ways this could be done:

I don't think this is a job for netfs_readahead().  We can get into a
similar situation with SSDs or RAID arrays where ideally we would have
several outstanding readahead requests.

If your drive is connected through a 1Gbps link (eg PCIe gen 1 x1) and
has a latency of 10ms seek time, with one outstanding read, each read
needs to be 12.5MB in size in order to saturate the bus.  If the device
supports 128 outstanding commands, each read need only be 100kB.

We need the core readahead code to handle this situation.  My suggestion
for doing this is to send off an extra readahead request every time we
hit a !Uptodate page.  It looks something like this (assuming the app
is processing the data fast and always hits the !Uptodate case) ...

1. hit 0,
	set readahead size to 64kB,
	mark 32kB as Readahead, send read for 0-64kB
	wait for 0-64kB to complete
2. hit 32kB (Readahead), no reads outstanding
	inc readahead size to 128kB,
	mark 128kB as Readahead, send read for 64k-192kB
3. hit 64kB (!Uptodate), one read outstanding
	mark 256kB as Readahead, send read for 192-320kB
	mark 384kB as Readahead, send read for 320-448kB
	wait for 64-192kB to complete
4. hit 128kB (Readahead), two reads outstanding
	inc readahead size to 256kB,
	mark 576kB as Readahead, send read for 448-704kB
5. hit 192kB (!Uptodate), three reads outstanding
	mark 832kB as Readahead, send read for 704-960kB
	mark 1088kB as Readahead, send read for 960-1216kB
	wait for 192-320kB to complete
6. hit 256kB (Readahead), four reads outstanding
	mark 1344kB as Readahead, send read for 1216-1472kB
7. hit 320kB (!Uptodate), five reads outstanding
	mark 1600kB as Readahead, send read for 1472-1728kB
	mark 1856kB as Readahead, send read for 1728-1984kB
	wait for 320-448kB to complete
8. hit 384kB (Readahead), five reads outstanding
	mark 2112kB as Readahead, send read for 1984-2240kB
9. hit 448kB (!Uptodate), six reads outstanding
	mark 2368kB as Readahead, send read for 2240-2496kB
	mark 2624kB as Readahead, send read for 2496-2752kB
	wait for 448-704kB to complete
10. hit 576kB (Readahead), seven reads outstanding
	mark 2880kB as Readahead, send read for 2752-3008kB

...

Once we stop hitting !Uptodate pages, we'll maintain the number of pages
marked as Readahead, and thus keep the number of readahead requests
at the level it determined was necessary to keep the link saturated.
I think we may need to put a parallelism cap in the bdi so that a device
which is just slow instead of at the end of a long fat pipe doesn't get
overwhelmed with requests.


^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2021-02-24 15:52 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-15 15:44 [PATCH 00/33] Network fs helper library & fscache kiocb API [ver #3] David Howells
2021-02-15 15:44 ` [PATCH 01/33] iov_iter: Add ITER_XARRAY David Howells
2021-02-15 15:44 ` [PATCH 02/33] mm: Add an unlock function for PG_private_2/PG_fscache David Howells
2021-02-16 10:26   ` Christoph Hellwig
2021-02-15 15:44 ` [PATCH 03/33] mm: Implement readahead_control pageset expansion David Howells
2021-02-16 10:32   ` Christoph Hellwig
2021-02-16 13:22     ` Matthew Wilcox
2021-02-17 14:36       ` Mike Marshall
2021-02-17 15:42       ` David Howells
2021-02-17 16:59         ` Mike Marshall
2021-02-17 22:20         ` David Howells
2021-02-16 11:48   ` David Howells
2021-02-17 16:13   ` Matthew Wilcox
2021-02-17 22:34   ` David Howells
2021-02-17 22:49     ` Matthew Wilcox
2021-02-18 17:47   ` David Howells
2021-02-15 15:45 ` [PATCH 04/33] vfs: Export rw_verify_area() for use by cachefiles David Howells
2021-02-16 10:26   ` Christoph Hellwig
2021-02-16 11:55   ` David Howells
2021-02-15 15:45 ` [PATCH 05/33] netfs: Make a netfs helper module David Howells
2021-02-15 15:45 ` [PATCH 06/33] netfs, mm: Move PG_fscache helper funcs to linux/netfs.h David Howells
2021-02-15 15:45 ` [PATCH 07/33] netfs, mm: Add unlock_page_fscache() and wait_on_page_fscache() David Howells
2021-02-15 15:45 ` [PATCH 08/33] netfs: Provide readahead and readpage netfs helpers David Howells
2021-02-15 15:45 ` [PATCH 09/33] netfs: Add tracepoints David Howells
2021-02-15 15:46 ` [PATCH 10/33] netfs: Gather stats David Howells
2021-02-15 15:46 ` [PATCH 11/33] netfs: Add write_begin helper David Howells
2021-02-15 15:46 ` [PATCH 12/33] netfs: Define an interface to talk to a cache David Howells
2021-02-15 15:46 ` [PATCH 13/33] netfs: Hold a ref on a page when PG_private_2 is set David Howells
2021-02-15 18:05 ` [PATCH 00/33] Network fs helper library & fscache kiocb API [ver #3] Jeff Layton
2021-02-16  0:40   ` Steve French
2021-02-16  2:10     ` Matthew Wilcox
2021-02-16  5:18       ` Steve French
2021-02-16  5:22       ` Steve French
2021-02-23 20:27         ` Matthew Wilcox
2021-02-24  4:57           ` Steve French
2021-02-24 13:32       ` David Howells
2021-02-24 15:51         ` Matthew Wilcox
2021-02-16 11:01     ` Jeff Layton
2021-02-15 22:46 ` [PATCH 34/33] netfs: Use in_interrupt() not in_softirq() David Howells
2021-02-16  8:42   ` Christoph Hellwig
2021-02-16  9:06     ` Sebastian Andrzej Siewior
2021-02-16  9:29   ` David Howells
2021-02-16  9:30     ` Christoph Hellwig
2021-02-18 14:02     ` [PATCH 34/33] netfs: Pass flag rather than use in_softirq() David Howells
2021-02-18 15:06       ` Marc Dionne
2021-02-18 15:16       ` Marc Dionne
2021-02-19  9:01       ` Sebastian Andrzej Siewior

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).