linux-cifs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* (no subject)
@ 2021-07-21 13:44 David Howells
  2021-07-21 13:44 ` [RFC PATCH 01/12] afs: Sort out symlink reading David Howells
                   ` (13 more replies)
  0 siblings, 14 replies; 22+ messages in thread
From: David Howells @ 2021-07-21 13:44 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: dhowells, Jeff Layton, Matthew Wilcox (Oracle),
	Anna Schumaker, Steve French, Dominique Martinet, Mike Marshall,
	David Wysochanski, Shyam Prasad N, Miklos Szeredi,
	Linus Torvalds, linux-cachefs, linux-afs, linux-nfs, linux-cifs,
	ceph-devel, v9fs-developer, devel, linux-mm, linux-kernel

[RFC PATCH 00/12] netfs: Experimental write helpers, fscrypt and compression
Date: 


Hi all,

I've been working on extending the netfs helper library to provide write
support (even VM support) for the filesystems that want to use it, with an
eye to building in transparent content crypto support (eg. fscrypt) - so
that content-encrypted data is stored in fscache in encrypted form - and
also client-side compression (something that cifs/smb supports, I believe,
and something that afs may acquire in the future).

This brings interesting issues with PAGE_SIZE potentially being smaller
than the I/O block size, and thus having to make sure pages that aren't
locally modified stay retained.  Note that whilst folios could, in theory,
help here, a folio requires contiguous RAM.

So here's the changes I have so far (WARNING: it's experimental, so may
contain debugging stuff, notes and extra bits and it's not fully
implemented yet).  The changes can also be found here:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=netfs-regions-experimental

With this, I can do simple reads and writes through afs, and the
modifications can be made encrypted to both the server and the cache
(though I haven't yet written the decryption side of it).


REGION LISTS
============

One of the things that I've been experimenting with is keeping a list of
dirty regions on each inode separate from the dirty bits on the page.
Region records can then be used to manage the spans of pages required to
construct a crypto or compression buffer.

With these records available, other possibilities become available:

 (1) I can use a separate list of regions to keep track of the pending and
     active writes to an inode.  This allows the inode lock to be dropped
     as soon as the region is added, with the record acting as a region
     lock.  (Truncate/fallocate is then just a special sort of direct write
     operation).

 (2) Keeping a list of active writes allows parallel non-overlapping[*]
     write operations to the pagecache and possibly parallel DIO write
     operations to the server (I think cifs to Windows allows this).

     [*] Non-overlapping in the sense that, under some circumstances, they
     	 aren't even allowed to touch the same page/folio.

 (3) After downloading data from the server, we may need to write it to the
     cache.  This can be deferred to the VM writeback mechanism by adding a
     'semi-dirty' region and marking the pages dirty.

 (4) Regions can be grouped so that the groups have to be flushed in order,
     thereby allowing ceph snaps and fsync to be implemented by the same
     mechanism and offering a possible RWF_BARRIER for pwritev2().

 (5) No need for write_begin/write_end in the filesystem - this is handled
     using the region in netfs_perform_write().

 (6) PG_fscache is no longer required as the region record can track this.

 (7) page->private isn't needed to track dirty state.

I also keep a flush list of regions that need writing.  If we can't manage
to lock all the pages in a part of the region we want to change, we drop
the locks we've taken and defer.  Since the following should be true:

 - since a dirty region represents data in the pagecache that needs
   writing, the pages containing that data must be present in RAM;

 - a region in the flushing state acts as an exclusive lock against
   overlapping active writers (which must wait for it);

 - the ->releasepage() and ->migratepage() methods can be used to prevent
   the page from being lost

it might be feasible to access the page *without* taking the page lock.

The flusher can split an active region in order to write out part of it,
provided it does so at a page boundary that's at or less than the dirtied
point.  This prevents an in-progress write pinning the entirety of memory.

An alternative to using region records that I'm pondering is to pull the
NFS code for page handling into the netfs lib.  I'm not sure that this
would make it easier to handle multipage spans, though, as releasepage
would need to look at the pages either side.  "Regions" would also be
concocted on the fly by writepages() - but, again, this may require the
involvement of other pages so I would have to be extremely careful of
deadlock.


PROVISION OF BUFFERING
======================

Another of the things I'm experimenting with is sticking buffers in xarray
form in the read and write request structs.

On the read side, this allows a buffer larger than the requested size to be
employed, with the option to discard the excess data or splice it over into
the pagecache - for instance if we get a compressed blob that we don't know
the size of yet or that is larger than the hole we have available in the
pagecache.  A second buffer can be employed to decrypt or decryption can be
done in place, depending on whether we want to copy the encrypted data to
the pagecache.

On the write side, this can be used to encrypt into, with the buffer then
being written to the cache and the server rather than the original.  If
compression is involved, we might want two buffers: we might need to copy
the original into the first buffer so that it doesn't change during
compression, then compress into the second buffer (which could then be
encrypted - if that makes sense).

With regard to DIO, if crypto is required, the helpers would copy the data
in or out of separate buffers, crypting the buffers and uploading or
downloading the buffers to/from the server.  I could even make it handle
RMW for smaller reads, but that needs to be careful because of the
possibility of collision with remote conflicts.


HOW NETFSLIB WOULD BE USED
==========================

In the scheme I'm experimenting with, I envision that a filesystem would
add a netfs context directly after its inode, e.g.:

	struct afs_vnode {
		struct {
			struct inode	vfs_inode;
			struct netfs_i_context netfs_ctx;
		};
		...
	};

and then point many of its inode, address space and VM methods directly at
netfslib, e.g.:

	const struct file_operations afs_file_operations = {
		.open		= afs_open,
		.release	= afs_release,
		.llseek		= generic_file_llseek,
		.read_iter	= generic_file_read_iter,
		.write_iter	= netfs_file_write_iter,
		.mmap		= afs_file_mmap,
		.splice_read	= generic_file_splice_read,
		.splice_write	= iter_file_splice_write,
		.fsync		= netfs_fsync,
		.lock		= afs_lock,
		.flock		= afs_flock,
	};

	const struct address_space_operations afs_file_aops = {
		.readpage	= netfs_readpage,
		.readahead	= netfs_readahead,
		.releasepage	= netfs_releasepage,
		.invalidatepage	= netfs_invalidatepage,
		.writepage	= netfs_writepage,
		.writepages	= netfs_writepages,
	};

	static const struct vm_operations_struct afs_vm_ops = {
		.fault		= filemap_fault,
		.map_pages	= filemap_map_pages,
		.page_mkwrite	= netfs_page_mkwrite,
	};

though it can, of course, wrap them if it needs to.  The inode context
stores any required caching cookie, crypto management parameters and an
operations table.

The netfs lib would be providing helpers for write_iter, page_mkwrite,
writepage, writepages, fsync, truncation and remote invalidation - the idea
being that the filesystem then just needs to provide hooks to perform read
and write RPC operations plus other optional hooks for the maintenance of
state and to help manage grouping, shaping and slicing I/O operations and
doing content crypto, e.g.:

	const struct netfs_request_ops afs_req_ops = {
		.init_rreq		= afs_init_rreq,
		.begin_cache_operation	= afs_begin_cache_operation,
		.check_write_begin	= afs_check_write_begin,
		.issue_op		= afs_req_issue_op,
		.cleanup		= afs_priv_cleanup,
		.init_dirty_region	= afs_init_dirty_region,
		.free_dirty_region	= afs_free_dirty_region,
		.update_i_size		= afs_update_i_size,
		.init_wreq		= afs_init_wreq,
		.add_write_streams	= afs_add_write_streams,
		.encrypt_block		= afs_encrypt_block,
	};


SERVICES THE HELPERS WOULD PROVIDE
==================================

The helpers are intended to transparently provide a number of services to
all the filesystems that want to use them:

 (1) Handling of multipage folios.  The helpers provide iov_iters to the
     filesystem indicating the pages to be read/written.  These may point
     into the pagecache, may point to userspace for unencrypted DIO or may
     point to a separate buffer for cryption/compression.  The fs doesn't
     see any pages/folios unless it wants to.

 (2) Handling of content encryption (e.g. fscrypt).  Encrypted data should
     be encrypted in fscache.  The helpers will write the downloaded
     encrypted data to the cache and will write modified data to the cache
     after it had been encrypted.

     The filesystem will provide the actual crypto, though the helpers can
     do the block-by-block iteration and setting up of scatterlists.  The
     intention is that if fscrypt is being used, the helper will be there.

 (3) Handling of compressed data.  If the data is stored in compressed
     blocks on the server, whereby the client does the (de)compression
     locally, support for handling that is similar to crypto.

     The helpers will provide the buffers and filesystem will provide the
     compression, though the filesystem can expand the buffers as needed.

 (4) Handling of I/O block sizes larger than page size.  If the filesystem
     needs to perform a block RPC I/O that's larger than page size - say it
     has to deal with full-file crypto or a large compression blocksize -
     the helpers will keep around and gather together larger units to make
     it possible to handle writes.

     For a read of a larger block size, the helpers create a buffer of the
     size required, padding it with extra pages as necessary and read into
     that.  The extra pages can then be spliced into holes in the pagecache
     rather than being discarded.

 (5) Handling of direct I/O.  The helpers will break down DIO requests into
     slices based on the rsize/wsize and can also do content crypto and
     (de)compression on the data.

     In the encrypted case, I would, initially at least, make it so that
     the DIO blocksize is set to a multiple of the crypto blocksize.  I
     could allow it to be smaller: when reading, I can just discard the
     excess, but on writing I would need to implement some sort of RMW
     cycle.

 (6) Handling of remote invalidation.  The helpers would be able to operate
     in a number of modes when local modifications exist:

	- discard all local changes
	- download the new version and reapply local changes
	- keep local version and overwrite server version
	- stash local version and replace with new version

 (7) Handling of disconnected operation.  Given feedback from the
     filesystem to indicate when we're in disconnected operation, the
     helpers would save modified code only to the cache, along with a list
     of modified regions.

     Upon reconnection, we would need to sync back to the server - and then
     the handling of remote invalidation would apply when we hit a conflict.


THE CHANGES I'VE MADE SO FAR
============================

The attached patches make a partial attempt at the above and partially
convert the afs filesystem to use them.  It is by no means complete,
however, and almost certainly contains bugs beyond the bits not yet wholly
implemented.

To this end:

 (1) struct netfs_dirty_region defined a region.  This is primarily used to
     track which portions of an inode's pagecache are dirty and in what
     manner.  Not all dirty regions are equal.

 (2) Contiguous dirty regions may be mergeable or one may supersede part of
     another (a local modification supersedes a download), depending on
     type, state and other stuff.  netfs_merge_dirty_region() deals with
     this.

 (3) A read from the server will generate a semi-dirty region that is to be
     written to the cache only.  Such writes to the cache are then driven
     by the VM, no longer being dispatched automatically on completion of
     the read.

 (4) DSYNC writes supersede ordinary writes, may not be merged and are
     flushed immediately.  The writer then waits for that region to finish
     flushing.  (Untested)

 (5) Every region belongs to a flush group.  This provides the opportunity
     for writes to be grouped and for the groups to be flushed in order.
     netfs_flush_region() will flush older regions.  (Untested)

 (6) The netfs_dirty_region struct is used to manage write operations on an
     inode.  The inode has two lists for this: pending writes and active
     writes.  A write request is initially put onto the pending list until
     the region it wishes to modify becomes free of active writes, then
     it's moved to the active list.

     Writes on the active list are not allowed to overlap in their reserved
     regions.  This acts as a region lock, allowing the inode lock to be
     dropped immediately after the record is queued.

 (7) Each region has a bounding box that indicates where the start and end
     of the pages involved are.  The bounding box is expanded to fit
     crypto, compression and cache blocksize requirements.

     Incompatible writes are not allowed to share bounding boxes (e.g. DIO
     writes may not overlap with other writes as the pagecache needs
     invalidation thereafter).

     This is extra complicated with the advent of THPs/multipage folios are
     the page boundaries are variable.

     It might make sense to keep track of partially invalid regions too and
     require them to be downloaded before allowing them to be read.

 (8) An active write is not permitted to proceed until any flushing regions
     it overlaps with are complete.  At that point, it is also added to the
     dirty list.  As it progresses, its dirty region is expanded and the
     writeback manager may split off part of that to make space.  Once it
     is complete, it becomes an ordinary dirty region (if not DIO).

 (9) When a writeback of part of a region occurs, pages in the bounding box
     may be pinned as well as pages containing the modifications as
     necessary to perform crypto/compression.

(10) We then have the situation where a page may be holding modifications
     from different dirty regions.  Under some circumstances (such as the
     file being freshly created locally), these will be merged, bridging
     the gaps with zeros.

     However, if such regions cannot be merged, if we write out one region,
     we have to be careful not to clear the dirty mark on the page if
     there's another dirty region on it.  Similarly, the writeback mark
     might need maintaining after a region completes writing.

     Note that a 'page' might actually be a multipage folio and could be
     quite large - possibly multiple megabytes.

(11) writepage() is an issue.  The VM might call us to ask for a page in
     the middle of a dirty region be flushed.  However, the page is locked
     by the caller and we might need pages from either side to actually
     perform the write (which might also be locked).

     What I'm thinking of here is to have netfs_writepage() find the dirty
     region(s) contributory to a dirty page and put them on the flush queue
     and then return to the VM saying it couldn't be done at this time.


David

Proposals/information about previous parts of the design have been
published here:

Link: https://lore.kernel.org/r/24942.1573667720@warthog.procyon.org.uk/
Link: https://lore.kernel.org/r/2758811.1610621106@warthog.procyon.org.uk/
Link: https://lore.kernel.org/r/1441311.1598547738@warthog.procyon.org.uk/
Link: https://lore.kernel.org/r/160655.1611012999@warthog.procyon.org.uk/

v5 of the read helper patches was here:

Link: https://lore.kernel.org/r/161653784755.2770958.11820491619308713741.stgit@warthog.procyon.org.uk/


---
David Howells (12):
      afs: Sort out symlink reading
      netfs: Add an iov_iter to the read subreq for the network fs/cache to use
      netfs: Remove netfs_read_subrequest::transferred
      netfs: Use a buffer in netfs_read_request and add pages to it
      netfs: Add a netfs inode context
      netfs: Keep lists of pending, active, dirty and flushed regions
      netfs: Initiate write request from a dirty region
      netfs: Keep dirty mark for pages with more than one dirty region
      netfs: Send write request to multiple destinations
      netfs: Do encryption in write preparatory phase
      netfs: Put a list of regions in /proc/fs/netfs/regions
      netfs: Export some read-request ref functions


 fs/afs/callback.c            |   2 +-
 fs/afs/dir.c                 |   2 +-
 fs/afs/dynroot.c             |   1 +
 fs/afs/file.c                | 193 ++------
 fs/afs/inode.c               |  25 +-
 fs/afs/internal.h            |  27 +-
 fs/afs/super.c               |   9 +-
 fs/afs/write.c               | 397 ++++-----------
 fs/ceph/addr.c               |   2 +-
 fs/netfs/Makefile            |  11 +-
 fs/netfs/dio_helper.c        | 140 ++++++
 fs/netfs/internal.h          | 104 ++++
 fs/netfs/main.c              | 104 ++++
 fs/netfs/objects.c           | 218 +++++++++
 fs/netfs/read_helper.c       | 460 ++++++++++++-----
 fs/netfs/stats.c             |  22 +-
 fs/netfs/write_back.c        | 592 ++++++++++++++++++++++
 fs/netfs/write_helper.c      | 924 +++++++++++++++++++++++++++++++++++
 fs/netfs/write_prep.c        | 160 ++++++
 fs/netfs/xa_iterator.h       | 116 +++++
 include/linux/netfs.h        | 273 ++++++++++-
 include/trace/events/netfs.h | 325 +++++++++++-
 22 files changed, 3488 insertions(+), 619 deletions(-)
 create mode 100644 fs/netfs/dio_helper.c
 create mode 100644 fs/netfs/main.c
 create mode 100644 fs/netfs/objects.c
 create mode 100644 fs/netfs/write_back.c
 create mode 100644 fs/netfs/write_helper.c
 create mode 100644 fs/netfs/write_prep.c
 create mode 100644 fs/netfs/xa_iterator.h



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC PATCH 01/12] afs: Sort out symlink reading
  2021-07-21 13:44 David Howells
@ 2021-07-21 13:44 ` David Howells
  2021-07-21 16:20   ` Jeff Layton
  2021-07-26  9:44   ` David Howells
  2021-07-21 13:44 ` [RFC PATCH 02/12] netfs: Add an iov_iter to the read subreq for the network fs/cache to use David Howells
                   ` (12 subsequent siblings)
  13 siblings, 2 replies; 22+ messages in thread
From: David Howells @ 2021-07-21 13:44 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: dhowells, Jeff Layton, Matthew Wilcox (Oracle),
	Anna Schumaker, Steve French, Dominique Martinet, Mike Marshall,
	David Wysochanski, Shyam Prasad N, Miklos Szeredi,
	Linus Torvalds, linux-cachefs, linux-afs, linux-nfs, linux-cifs,
	ceph-devel, v9fs-developer, devel, linux-mm, linux-kernel

afs_readpage() doesn't get a file pointer when called for a symlink, so
separate it from regular file pointer handling.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/afs/file.c     |   14 +++++++++-----
 fs/afs/inode.c    |    6 +++---
 fs/afs/internal.h |    3 ++-
 3 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/fs/afs/file.c b/fs/afs/file.c
index ca0d993add65..c9c21ad0e7c9 100644
--- a/fs/afs/file.c
+++ b/fs/afs/file.c
@@ -19,6 +19,7 @@
 
 static int afs_file_mmap(struct file *file, struct vm_area_struct *vma);
 static int afs_readpage(struct file *file, struct page *page);
+static int afs_symlink_readpage(struct file *file, struct page *page);
 static void afs_invalidatepage(struct page *page, unsigned int offset,
 			       unsigned int length);
 static int afs_releasepage(struct page *page, gfp_t gfp_flags);
@@ -46,7 +47,7 @@ const struct inode_operations afs_file_inode_operations = {
 	.permission	= afs_permission,
 };
 
-const struct address_space_operations afs_fs_aops = {
+const struct address_space_operations afs_file_aops = {
 	.readpage	= afs_readpage,
 	.readahead	= afs_readahead,
 	.set_page_dirty	= afs_set_page_dirty,
@@ -60,6 +61,12 @@ const struct address_space_operations afs_fs_aops = {
 	.writepages	= afs_writepages,
 };
 
+const struct address_space_operations afs_symlink_aops = {
+	.readpage	= afs_symlink_readpage,
+	.releasepage	= afs_releasepage,
+	.invalidatepage	= afs_invalidatepage,
+};
+
 static const struct vm_operations_struct afs_vm_ops = {
 	.fault		= filemap_fault,
 	.map_pages	= filemap_map_pages,
@@ -321,7 +328,7 @@ static void afs_req_issue_op(struct netfs_read_subrequest *subreq)
 	afs_fetch_data(fsreq->vnode, fsreq);
 }
 
-static int afs_symlink_readpage(struct page *page)
+static int afs_symlink_readpage(struct file *file, struct page *page)
 {
 	struct afs_vnode *vnode = AFS_FS_I(page->mapping->host);
 	struct afs_read *fsreq;
@@ -386,9 +393,6 @@ const struct netfs_read_request_ops afs_req_ops = {
 
 static int afs_readpage(struct file *file, struct page *page)
 {
-	if (!file)
-		return afs_symlink_readpage(page);
-
 	return netfs_readpage(file, page, &afs_req_ops, NULL);
 }
 
diff --git a/fs/afs/inode.c b/fs/afs/inode.c
index bef6f5ccfb09..cf7b66957c6f 100644
--- a/fs/afs/inode.c
+++ b/fs/afs/inode.c
@@ -105,7 +105,7 @@ static int afs_inode_init_from_status(struct afs_operation *op,
 		inode->i_mode	= S_IFREG | (status->mode & S_IALLUGO);
 		inode->i_op	= &afs_file_inode_operations;
 		inode->i_fop	= &afs_file_operations;
-		inode->i_mapping->a_ops	= &afs_fs_aops;
+		inode->i_mapping->a_ops	= &afs_file_aops;
 		break;
 	case AFS_FTYPE_DIR:
 		inode->i_mode	= S_IFDIR |  (status->mode & S_IALLUGO);
@@ -123,11 +123,11 @@ static int afs_inode_init_from_status(struct afs_operation *op,
 			inode->i_mode	= S_IFDIR | 0555;
 			inode->i_op	= &afs_mntpt_inode_operations;
 			inode->i_fop	= &afs_mntpt_file_operations;
-			inode->i_mapping->a_ops	= &afs_fs_aops;
+			inode->i_mapping->a_ops	= &afs_symlink_aops;
 		} else {
 			inode->i_mode	= S_IFLNK | status->mode;
 			inode->i_op	= &afs_symlink_inode_operations;
-			inode->i_mapping->a_ops	= &afs_fs_aops;
+			inode->i_mapping->a_ops	= &afs_symlink_aops;
 		}
 		inode_nohighmem(inode);
 		break;
diff --git a/fs/afs/internal.h b/fs/afs/internal.h
index 791cf02e5696..ccdde00ada8a 100644
--- a/fs/afs/internal.h
+++ b/fs/afs/internal.h
@@ -1050,7 +1050,8 @@ extern void afs_dynroot_depopulate(struct super_block *);
 /*
  * file.c
  */
-extern const struct address_space_operations afs_fs_aops;
+extern const struct address_space_operations afs_file_aops;
+extern const struct address_space_operations afs_symlink_aops;
 extern const struct inode_operations afs_file_inode_operations;
 extern const struct file_operations afs_file_operations;
 extern const struct netfs_read_request_ops afs_req_ops;



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 02/12] netfs: Add an iov_iter to the read subreq for the network fs/cache to use
  2021-07-21 13:44 David Howells
  2021-07-21 13:44 ` [RFC PATCH 01/12] afs: Sort out symlink reading David Howells
@ 2021-07-21 13:44 ` David Howells
  2021-07-21 17:16   ` Jeff Layton
  2021-07-21 17:20   ` David Howells
  2021-07-21 13:45 ` [RFC PATCH 03/12] netfs: Remove netfs_read_subrequest::transferred David Howells
                   ` (11 subsequent siblings)
  13 siblings, 2 replies; 22+ messages in thread
From: David Howells @ 2021-07-21 13:44 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: dhowells, Jeff Layton, Matthew Wilcox (Oracle),
	Anna Schumaker, Steve French, Dominique Martinet, Mike Marshall,
	David Wysochanski, Shyam Prasad N, Miklos Szeredi,
	Linus Torvalds, linux-cachefs, linux-afs, linux-nfs, linux-cifs,
	ceph-devel, v9fs-developer, devel, linux-mm, linux-kernel

Add an iov_iter to the read subrequest and set it up to define the
destination buffer to write into.  This will allow future patches to point
to a bounce buffer instead for purposes of handling oversize writes,
decryption (where we want to save the encrypted data to the cache) and
decompression.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/afs/file.c          |    6 +-----
 fs/netfs/read_helper.c |    5 ++++-
 include/linux/netfs.h  |    2 ++
 3 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/fs/afs/file.c b/fs/afs/file.c
index c9c21ad0e7c9..ca529f23515a 100644
--- a/fs/afs/file.c
+++ b/fs/afs/file.c
@@ -319,11 +319,7 @@ static void afs_req_issue_op(struct netfs_read_subrequest *subreq)
 	fsreq->len	= subreq->len   - subreq->transferred;
 	fsreq->key	= subreq->rreq->netfs_priv;
 	fsreq->vnode	= vnode;
-	fsreq->iter	= &fsreq->def_iter;
-
-	iov_iter_xarray(&fsreq->def_iter, READ,
-			&fsreq->vnode->vfs_inode.i_mapping->i_pages,
-			fsreq->pos, fsreq->len);
+	fsreq->iter	= &subreq->iter;
 
 	afs_fetch_data(fsreq->vnode, fsreq);
 }
diff --git a/fs/netfs/read_helper.c b/fs/netfs/read_helper.c
index 0b6cd3b8734c..715f3e9c380d 100644
--- a/fs/netfs/read_helper.c
+++ b/fs/netfs/read_helper.c
@@ -150,7 +150,7 @@ static void netfs_clear_unread(struct netfs_read_subrequest *subreq)
 {
 	struct iov_iter iter;
 
-	iov_iter_xarray(&iter, WRITE, &subreq->rreq->mapping->i_pages,
+	iov_iter_xarray(&iter, READ, &subreq->rreq->mapping->i_pages,
 			subreq->start + subreq->transferred,
 			subreq->len   - subreq->transferred);
 	iov_iter_zero(iov_iter_count(&iter), &iter);
@@ -745,6 +745,9 @@ netfs_rreq_prepare_read(struct netfs_read_request *rreq,
 	if (WARN_ON(subreq->len == 0))
 		source = NETFS_INVALID_READ;
 
+	iov_iter_xarray(&subreq->iter, READ, &rreq->mapping->i_pages,
+			subreq->start, subreq->len);
+
 out:
 	subreq->source = source;
 	trace_netfs_sreq(subreq, netfs_sreq_trace_prepare);
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index fe9887768292..5e4fafcc9480 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -17,6 +17,7 @@
 #include <linux/workqueue.h>
 #include <linux/fs.h>
 #include <linux/pagemap.h>
+#include <linux/uio.h>
 
 /*
  * Overload PG_private_2 to give us PG_fscache - this is used to indicate that
@@ -112,6 +113,7 @@ struct netfs_cache_resources {
 struct netfs_read_subrequest {
 	struct netfs_read_request *rreq;	/* Supervising read request */
 	struct list_head	rreq_link;	/* Link in rreq->subrequests */
+	struct iov_iter		iter;		/* Iterator for this subrequest */
 	loff_t			start;		/* Where to start the I/O */
 	size_t			len;		/* Size of the I/O */
 	size_t			transferred;	/* Amount of data transferred */



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 03/12] netfs: Remove netfs_read_subrequest::transferred
  2021-07-21 13:44 David Howells
  2021-07-21 13:44 ` [RFC PATCH 01/12] afs: Sort out symlink reading David Howells
  2021-07-21 13:44 ` [RFC PATCH 02/12] netfs: Add an iov_iter to the read subreq for the network fs/cache to use David Howells
@ 2021-07-21 13:45 ` David Howells
  2021-07-21 17:43   ` Jeff Layton
  2021-07-21 18:54   ` David Howells
  2021-07-21 13:45 ` [RFC PATCH 04/12] netfs: Use a buffer in netfs_read_request and add pages to it David Howells
                   ` (10 subsequent siblings)
  13 siblings, 2 replies; 22+ messages in thread
From: David Howells @ 2021-07-21 13:45 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: dhowells, Jeff Layton, Matthew Wilcox (Oracle),
	Anna Schumaker, Steve French, Dominique Martinet, Mike Marshall,
	David Wysochanski, Shyam Prasad N, Miklos Szeredi,
	Linus Torvalds, linux-cachefs, linux-afs, linux-nfs, linux-cifs,
	ceph-devel, v9fs-developer, devel, linux-mm, linux-kernel

Remove netfs_read_subrequest::transferred as it's redundant as the count on
the iterator added to the subrequest can be used instead.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/afs/file.c                |    4 ++--
 fs/netfs/read_helper.c       |   26 ++++----------------------
 include/linux/netfs.h        |    1 -
 include/trace/events/netfs.h |   12 ++++++------
 4 files changed, 12 insertions(+), 31 deletions(-)

diff --git a/fs/afs/file.c b/fs/afs/file.c
index ca529f23515a..82e945dbe379 100644
--- a/fs/afs/file.c
+++ b/fs/afs/file.c
@@ -315,8 +315,8 @@ static void afs_req_issue_op(struct netfs_read_subrequest *subreq)
 		return netfs_subreq_terminated(subreq, -ENOMEM, false);
 
 	fsreq->subreq	= subreq;
-	fsreq->pos	= subreq->start + subreq->transferred;
-	fsreq->len	= subreq->len   - subreq->transferred;
+	fsreq->pos	= subreq->start + subreq->len - iov_iter_count(&subreq->iter);
+	fsreq->len	= iov_iter_count(&subreq->iter);
 	fsreq->key	= subreq->rreq->netfs_priv;
 	fsreq->vnode	= vnode;
 	fsreq->iter	= &subreq->iter;
diff --git a/fs/netfs/read_helper.c b/fs/netfs/read_helper.c
index 715f3e9c380d..5e1a9be48130 100644
--- a/fs/netfs/read_helper.c
+++ b/fs/netfs/read_helper.c
@@ -148,12 +148,7 @@ static void __netfs_put_subrequest(struct netfs_read_subrequest *subreq,
  */
 static void netfs_clear_unread(struct netfs_read_subrequest *subreq)
 {
-	struct iov_iter iter;
-
-	iov_iter_xarray(&iter, READ, &subreq->rreq->mapping->i_pages,
-			subreq->start + subreq->transferred,
-			subreq->len   - subreq->transferred);
-	iov_iter_zero(iov_iter_count(&iter), &iter);
+	iov_iter_zero(iov_iter_count(&subreq->iter), &subreq->iter);
 }
 
 static void netfs_cache_read_terminated(void *priv, ssize_t transferred_or_error,
@@ -173,14 +168,9 @@ static void netfs_read_from_cache(struct netfs_read_request *rreq,
 				  bool seek_data)
 {
 	struct netfs_cache_resources *cres = &rreq->cache_resources;
-	struct iov_iter iter;
 
 	netfs_stat(&netfs_n_rh_read);
-	iov_iter_xarray(&iter, READ, &rreq->mapping->i_pages,
-			subreq->start + subreq->transferred,
-			subreq->len   - subreq->transferred);
-
-	cres->ops->read(cres, subreq->start, &iter, seek_data,
+	cres->ops->read(cres, subreq->start, &subreq->iter, seek_data,
 			netfs_cache_read_terminated, subreq);
 }
 
@@ -419,7 +409,7 @@ static void netfs_rreq_unlock(struct netfs_read_request *rreq)
 			if (pgend < iopos + subreq->len)
 				break;
 
-			account += subreq->transferred;
+			account += subreq->len - iov_iter_count(&subreq->iter);
 			iopos += subreq->len;
 			if (!list_is_last(&subreq->rreq_link, &rreq->subrequests)) {
 				subreq = list_next_entry(subreq, rreq_link);
@@ -635,15 +625,8 @@ void netfs_subreq_terminated(struct netfs_read_subrequest *subreq,
 		goto failed;
 	}
 
-	if (WARN(transferred_or_error > subreq->len - subreq->transferred,
-		 "Subreq overread: R%x[%x] %zd > %zu - %zu",
-		 rreq->debug_id, subreq->debug_index,
-		 transferred_or_error, subreq->len, subreq->transferred))
-		transferred_or_error = subreq->len - subreq->transferred;
-
 	subreq->error = 0;
-	subreq->transferred += transferred_or_error;
-	if (subreq->transferred < subreq->len)
+	if (iov_iter_count(&subreq->iter))
 		goto incomplete;
 
 complete:
@@ -667,7 +650,6 @@ void netfs_subreq_terminated(struct netfs_read_subrequest *subreq,
 incomplete:
 	if (test_bit(NETFS_SREQ_CLEAR_TAIL, &subreq->flags)) {
 		netfs_clear_unread(subreq);
-		subreq->transferred = subreq->len;
 		goto complete;
 	}
 
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index 5e4fafcc9480..45d40c622205 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -116,7 +116,6 @@ struct netfs_read_subrequest {
 	struct iov_iter		iter;		/* Iterator for this subrequest */
 	loff_t			start;		/* Where to start the I/O */
 	size_t			len;		/* Size of the I/O */
-	size_t			transferred;	/* Amount of data transferred */
 	refcount_t		usage;
 	short			error;		/* 0 or error that occurred */
 	unsigned short		debug_index;	/* Index in list (for debugging output) */
diff --git a/include/trace/events/netfs.h b/include/trace/events/netfs.h
index 4d470bffd9f1..04ac29fc700f 100644
--- a/include/trace/events/netfs.h
+++ b/include/trace/events/netfs.h
@@ -190,7 +190,7 @@ TRACE_EVENT(netfs_sreq,
 		    __field(enum netfs_read_source,	source		)
 		    __field(enum netfs_sreq_trace,	what		)
 		    __field(size_t,			len		)
-		    __field(size_t,			transferred	)
+		    __field(size_t,			remain		)
 		    __field(loff_t,			start		)
 			     ),
 
@@ -202,7 +202,7 @@ TRACE_EVENT(netfs_sreq,
 		    __entry->source	= sreq->source;
 		    __entry->what	= what;
 		    __entry->len	= sreq->len;
-		    __entry->transferred = sreq->transferred;
+		    __entry->remain	= iov_iter_count(&sreq->iter);
 		    __entry->start	= sreq->start;
 			   ),
 
@@ -211,7 +211,7 @@ TRACE_EVENT(netfs_sreq,
 		      __print_symbolic(__entry->what, netfs_sreq_traces),
 		      __print_symbolic(__entry->source, netfs_sreq_sources),
 		      __entry->flags,
-		      __entry->start, __entry->transferred, __entry->len,
+		      __entry->start, __entry->len - __entry->remain, __entry->len,
 		      __entry->error)
 	    );
 
@@ -230,7 +230,7 @@ TRACE_EVENT(netfs_failure,
 		    __field(enum netfs_read_source,	source		)
 		    __field(enum netfs_failure,		what		)
 		    __field(size_t,			len		)
-		    __field(size_t,			transferred	)
+		    __field(size_t,			remain		)
 		    __field(loff_t,			start		)
 			     ),
 
@@ -242,7 +242,7 @@ TRACE_EVENT(netfs_failure,
 		    __entry->source	= sreq ? sreq->source : NETFS_INVALID_READ;
 		    __entry->what	= what;
 		    __entry->len	= sreq ? sreq->len : 0;
-		    __entry->transferred = sreq ? sreq->transferred : 0;
+		    __entry->remain	= sreq ? iov_iter_count(&sreq->iter) : 0;
 		    __entry->start	= sreq ? sreq->start : 0;
 			   ),
 
@@ -250,7 +250,7 @@ TRACE_EVENT(netfs_failure,
 		      __entry->rreq, __entry->index,
 		      __print_symbolic(__entry->source, netfs_sreq_sources),
 		      __entry->flags,
-		      __entry->start, __entry->transferred, __entry->len,
+		      __entry->start, __entry->len - __entry->remain, __entry->len,
 		      __print_symbolic(__entry->what, netfs_failures),
 		      __entry->error)
 	    );



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 04/12] netfs: Use a buffer in netfs_read_request and add pages to it
  2021-07-21 13:44 David Howells
                   ` (2 preceding siblings ...)
  2021-07-21 13:45 ` [RFC PATCH 03/12] netfs: Remove netfs_read_subrequest::transferred David Howells
@ 2021-07-21 13:45 ` David Howells
  2021-07-21 13:45 ` [RFC PATCH 05/12] netfs: Add a netfs inode context David Howells
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 22+ messages in thread
From: David Howells @ 2021-07-21 13:45 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: dhowells, Jeff Layton, Matthew Wilcox (Oracle),
	Anna Schumaker, Steve French, Dominique Martinet, Mike Marshall,
	David Wysochanski, Shyam Prasad N, Miklos Szeredi,
	Linus Torvalds, linux-cachefs, linux-afs, linux-nfs, linux-cifs,
	ceph-devel, v9fs-developer, devel, linux-mm, linux-kernel

Add an "output" buffer to the netfs_read_request struct.  This is an xarray
to which the intended destination pages can be added, supplemented by
additional pages to make the buffer up to a sufficient size to be the
output for an overlarge read, decryption and/or decompression.

The readahead_expand() function will only expand the requested pageset up
to a point where it runs into an already extant page at either end - which
means that the resulting buffer might not be large enough or may be
misaligned for our purposes.

With this, we can make sure we have a useful buffer and we can splice the
extra pages from it into the pagecache if there are holes we can plug.

The read buffer could also be useful in the future to perform RMW cycles
when fixing up after disconnected operation or direct I/O with
smaller-than-preferred granularity.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/netfs/read_helper.c |  166 ++++++++++++++++++++++++++++++++++++++++++++----
 include/linux/netfs.h  |    1 
 2 files changed, 154 insertions(+), 13 deletions(-)

diff --git a/fs/netfs/read_helper.c b/fs/netfs/read_helper.c
index 5e1a9be48130..b03bc5b0da5a 100644
--- a/fs/netfs/read_helper.c
+++ b/fs/netfs/read_helper.c
@@ -28,6 +28,7 @@ module_param_named(debug, netfs_debug, uint, S_IWUSR | S_IRUGO);
 MODULE_PARM_DESC(netfs_debug, "Netfs support debugging mask");
 
 static void netfs_rreq_work(struct work_struct *);
+static void netfs_rreq_clear_buffer(struct netfs_read_request *);
 static void __netfs_put_subrequest(struct netfs_read_subrequest *, bool);
 
 static void netfs_put_subrequest(struct netfs_read_subrequest *subreq,
@@ -51,6 +52,7 @@ static struct netfs_read_request *netfs_alloc_read_request(
 		rreq->inode	= file_inode(file);
 		rreq->i_size	= i_size_read(rreq->inode);
 		rreq->debug_id	= atomic_inc_return(&debug_ids);
+		xa_init(&rreq->buffer);
 		INIT_LIST_HEAD(&rreq->subrequests);
 		INIT_WORK(&rreq->work, netfs_rreq_work);
 		refcount_set(&rreq->usage, 1);
@@ -90,6 +92,7 @@ static void netfs_free_read_request(struct work_struct *work)
 	trace_netfs_rreq(rreq, netfs_rreq_trace_free);
 	if (rreq->cache_resources.ops)
 		rreq->cache_resources.ops->end_operation(&rreq->cache_resources);
+	netfs_rreq_clear_buffer(rreq);
 	kfree(rreq);
 	netfs_stat_d(&netfs_n_rh_rreq);
 }
@@ -727,7 +730,7 @@ netfs_rreq_prepare_read(struct netfs_read_request *rreq,
 	if (WARN_ON(subreq->len == 0))
 		source = NETFS_INVALID_READ;
 
-	iov_iter_xarray(&subreq->iter, READ, &rreq->mapping->i_pages,
+	iov_iter_xarray(&subreq->iter, READ, &rreq->buffer,
 			subreq->start, subreq->len);
 
 out:
@@ -838,6 +841,133 @@ static void netfs_rreq_expand(struct netfs_read_request *rreq,
 	}
 }
 
+/*
+ * Clear a read buffer, discarding the pages which have XA_MARK_0 set.
+ */
+static void netfs_rreq_clear_buffer(struct netfs_read_request *rreq)
+{
+	struct page *page;
+	XA_STATE(xas, &rreq->buffer, 0);
+
+	rcu_read_lock();
+	xas_for_each_marked(&xas, page, ULONG_MAX, XA_MARK_0) {
+		put_page(page);
+	}
+	rcu_read_unlock();
+	xa_destroy(&rreq->buffer);
+}
+
+static int xa_insert_set_mark(struct xarray *xa, unsigned long index,
+			      void *entry, xa_mark_t mark, gfp_t gfp_mask)
+{
+	int ret;
+
+	xa_lock(xa);
+	ret = __xa_insert(xa, index, entry, gfp_mask);
+	if (ret == 0)
+		__xa_set_mark(xa, index, mark);
+	xa_unlock(xa);
+	return ret;
+}
+
+/*
+ * Create the specified range of pages in the buffer attached to the read
+ * request.  The pages are marked with XA_MARK_0 so that we know that these
+ * need freeing later.
+ */
+static int netfs_rreq_add_pages_to_buffer(struct netfs_read_request *rreq,
+					  pgoff_t index, pgoff_t to, gfp_t gfp_mask)
+{
+	struct page *page;
+	int ret;
+
+	if (to + 1 == index) /* Page range is inclusive */
+		return 0;
+
+	do {
+		page = __page_cache_alloc(gfp_mask);
+		if (!page)
+			return -ENOMEM;
+		page->index = index;
+		ret = xa_insert_set_mark(&rreq->buffer, index, page, XA_MARK_0,
+					 gfp_mask);
+		if (ret < 0) {
+			__free_page(page);
+			return ret;
+		}
+
+		index += thp_nr_pages(page);
+	} while (index < to);
+
+	return 0;
+}
+
+/*
+ * Set up a buffer into which to data will be read or decrypted/decompressed.
+ * The pages to be read into are attached to this buffer and the gaps filled in
+ * to form a continuous region.
+ */
+static int netfs_rreq_set_up_buffer(struct netfs_read_request *rreq,
+				    struct readahead_control *ractl,
+				    struct page *keep,
+				    pgoff_t have_index, unsigned int have_pages)
+{
+	struct page *page;
+	gfp_t gfp_mask = readahead_gfp_mask(rreq->mapping);
+	unsigned int want_pages = have_pages;
+	pgoff_t want_index = have_index;
+	int ret;
+
+#if 0
+	want_index = round_down(want_index, 256 * 1024 / PAGE_SIZE);
+	want_pages += have_index - want_index;
+	want_pages = round_up(want_pages, 256 * 1024 / PAGE_SIZE);
+
+	kdebug("setup %lx-%lx -> %lx-%lx",
+	       have_index, have_index + have_pages - 1,
+	       want_index, want_index + want_pages - 1);
+#endif
+
+	ret = netfs_rreq_add_pages_to_buffer(rreq, want_index, have_index - 1,
+					     gfp_mask);
+	if (ret < 0)
+		return ret;
+	have_pages += have_index - want_index;
+
+	ret = netfs_rreq_add_pages_to_buffer(rreq, have_index + have_pages,
+					     want_index + want_pages - 1,
+					     gfp_mask);
+	if (ret < 0)
+		return ret;
+
+	/* Transfer the pages proposed by the VM into the buffer along with
+	 * their page refs.  The locks will be dropped in netfs_rreq_unlock().
+	 */
+	if (ractl) {
+		while ((page = readahead_page(ractl))) {
+			if (page == keep)
+				get_page(page);
+			ret = xa_insert_set_mark(&rreq->buffer, page->index, page,
+						 XA_MARK_0, gfp_mask);
+			if (ret < 0) {
+				if (page != keep)
+					unlock_page(page);
+				put_page(page);
+				return ret;
+			}
+		}
+	} else {
+		get_page(keep);
+		ret = xa_insert_set_mark(&rreq->buffer, keep->index, keep,
+					 XA_MARK_0, gfp_mask);
+		if (ret < 0) {
+			put_page(keep);
+			return ret;
+		}
+	}
+	return 0;
+}
+
 /**
  * netfs_readahead - Helper to manage a read request
  * @ractl: The description of the readahead request
@@ -861,7 +991,6 @@ void netfs_readahead(struct readahead_control *ractl,
 		     void *netfs_priv)
 {
 	struct netfs_read_request *rreq;
-	struct page *page;
 	unsigned int debug_index = 0;
 	int ret;
 
@@ -889,6 +1018,12 @@ void netfs_readahead(struct readahead_control *ractl,
 
 	netfs_rreq_expand(rreq, ractl);
 
+	/* Set up the output buffer */
+	ret = netfs_rreq_set_up_buffer(rreq, ractl, NULL,
+				       readahead_index(ractl), readahead_count(ractl));
+	if (ret < 0)
+		goto cleanup_free;
+
 	atomic_set(&rreq->nr_rd_ops, 1);
 	do {
 		if (!netfs_rreq_submit_slice(rreq, &debug_index))
@@ -896,12 +1031,6 @@ void netfs_readahead(struct readahead_control *ractl,
 
 	} while (rreq->submitted < rreq->len);
 
-	/* Drop the refs on the pages here rather than in the cache or
-	 * filesystem.  The locks will be dropped in netfs_rreq_unlock().
-	 */
-	while ((page = readahead_page(ractl)))
-		put_page(page);
-
 	/* If we decrement nr_rd_ops to 0, the ref belongs to us. */
 	if (atomic_dec_and_test(&rreq->nr_rd_ops))
 		netfs_rreq_assess(rreq, false);
@@ -967,6 +1096,12 @@ int netfs_readpage(struct file *file,
 	netfs_stat(&netfs_n_rh_readpage);
 	trace_netfs_read(rreq, rreq->start, rreq->len, netfs_read_trace_readpage);
 
+	/* Set up the output buffer */
+	ret = netfs_rreq_set_up_buffer(rreq, NULL, page,
+				       page_index(page), thp_nr_pages(page));
+	if (ret < 0)
+		goto out;
+
 	netfs_get_read_request(rreq);
 
 	atomic_set(&rreq->nr_rd_ops, 1);
@@ -1134,13 +1269,18 @@ int netfs_write_begin(struct file *file, struct address_space *mapping,
 	 */
 	ractl._nr_pages = thp_nr_pages(page);
 	netfs_rreq_expand(rreq, &ractl);
-	netfs_get_read_request(rreq);
 
-	/* We hold the page locks, so we can drop the references */
-	while ((xpage = readahead_page(&ractl)))
-		if (xpage != page)
-			put_page(xpage);
+	/* Set up the output buffer */
+	ret = netfs_rreq_set_up_buffer(rreq, &ractl, page,
+				       readahead_index(&ractl), readahead_count(&ractl));
+	if (ret < 0) {
+		while ((xpage = readahead_page(&ractl)))
+			if (xpage != page)
+				put_page(xpage);
+		goto error_put;
+	}
 
+	netfs_get_read_request(rreq);
 	atomic_set(&rreq->nr_rd_ops, 1);
 	do {
 		if (!netfs_rreq_submit_slice(rreq, &debug_index))
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index 45d40c622205..815001fe7a76 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -138,6 +138,7 @@ struct netfs_read_request {
 	struct address_space	*mapping;	/* The mapping being accessed */
 	struct netfs_cache_resources cache_resources;
 	struct list_head	subrequests;	/* Requests to fetch I/O from disk or net */
+	struct xarray		buffer;		/* Decryption/decompression buffer */
 	void			*netfs_priv;	/* Private data for the netfs */
 	unsigned int		debug_id;
 	atomic_t		nr_rd_ops;	/* Number of read ops in progress */



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 05/12] netfs: Add a netfs inode context
  2021-07-21 13:44 David Howells
                   ` (3 preceding siblings ...)
  2021-07-21 13:45 ` [RFC PATCH 04/12] netfs: Use a buffer in netfs_read_request and add pages to it David Howells
@ 2021-07-21 13:45 ` David Howells
  2021-07-21 13:46 ` [RFC PATCH 06/12] netfs: Keep lists of pending, active, dirty and flushed regions David Howells
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 22+ messages in thread
From: David Howells @ 2021-07-21 13:45 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: dhowells, Jeff Layton, Matthew Wilcox (Oracle),
	Anna Schumaker, Steve French, Dominique Martinet, Mike Marshall,
	David Wysochanski, Shyam Prasad N, Miklos Szeredi,
	Linus Torvalds, linux-cachefs, linux-afs, linux-nfs, linux-cifs,
	ceph-devel, v9fs-developer, devel, linux-mm, linux-kernel

Add a netfs_i_context struct that should be included in the network
filesystem's own inode struct wrapper, directly after the VFS's inode
struct, e.g.:

	struct my_inode {
		struct {
			struct inode		vfs_inode;
			struct netfs_i_context	netfs_ctx;
		};
	};

The netfs_i_context struct contains two fields for the network filesystem
to use:

	struct netfs_i_context {
		...
		struct fscache_cookie	*cache;
		unsigned long		flags;
	#define NETFS_ICTX_NEW_CONTENT	0
	};

There's a pointer to the cache cookie and a flag to indicate that the
content in the file is locally generated and entirely new (ie. the file was
just created locally or was truncated to nothing).

Two functions are provided to help with this:

 (1) void netfs_i_context_init(struct inode *inode,
			       const struct netfs_request_ops *ops);

     Initialise the netfs context and set the operations.

 (2) struct netfs_i_context *netfs_i_context(struct inode *inode);

     Find the netfs context from the inode struct.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/afs/callback.c      |    2 -
 fs/afs/dir.c           |    2 -
 fs/afs/dynroot.c       |    1 
 fs/afs/file.c          |   29 ++---------
 fs/afs/inode.c         |   10 ++--
 fs/afs/internal.h      |   13 ++---
 fs/afs/super.c         |    2 -
 fs/afs/write.c         |    7 +--
 fs/ceph/addr.c         |    2 -
 fs/netfs/internal.h    |   11 ++++
 fs/netfs/read_helper.c |  124 ++++++++++++++++++++++--------------------------
 fs/netfs/stats.c       |    1 
 include/linux/netfs.h  |   66 +++++++++++++++++++++-----
 13 files changed, 146 insertions(+), 124 deletions(-)

diff --git a/fs/afs/callback.c b/fs/afs/callback.c
index 7d9b23d981bf..0d4b9678ad22 100644
--- a/fs/afs/callback.c
+++ b/fs/afs/callback.c
@@ -41,7 +41,7 @@ void __afs_break_callback(struct afs_vnode *vnode, enum afs_cb_break_reason reas
 {
 	_enter("");
 
-	clear_bit(AFS_VNODE_NEW_CONTENT, &vnode->flags);
+	clear_bit(NETFS_ICTX_NEW_CONTENT, &netfs_i_context(&vnode->vfs_inode)->flags);
 	if (test_and_clear_bit(AFS_VNODE_CB_PROMISED, &vnode->flags)) {
 		vnode->cb_break++;
 		afs_clear_permits(vnode);
diff --git a/fs/afs/dir.c b/fs/afs/dir.c
index ac829e63c570..a4c9cd6de622 100644
--- a/fs/afs/dir.c
+++ b/fs/afs/dir.c
@@ -1350,7 +1350,7 @@ static void afs_vnode_new_inode(struct afs_operation *op)
 	}
 
 	vnode = AFS_FS_I(inode);
-	set_bit(AFS_VNODE_NEW_CONTENT, &vnode->flags);
+	set_bit(NETFS_ICTX_NEW_CONTENT, &netfs_i_context(&vnode->vfs_inode)->flags);
 	if (!op->error)
 		afs_cache_permit(vnode, op->key, vnode->cb_break, &vp->scb);
 	d_instantiate(op->dentry, inode);
diff --git a/fs/afs/dynroot.c b/fs/afs/dynroot.c
index db832cc931c8..f120bcb8bf73 100644
--- a/fs/afs/dynroot.c
+++ b/fs/afs/dynroot.c
@@ -76,6 +76,7 @@ struct inode *afs_iget_pseudo_dir(struct super_block *sb, bool root)
 	/* there shouldn't be an existing inode */
 	BUG_ON(!(inode->i_state & I_NEW));
 
+	netfs_i_context_init(inode, NULL);
 	inode->i_size		= 0;
 	inode->i_mode		= S_IFDIR | S_IRUGO | S_IXUGO;
 	if (root) {
diff --git a/fs/afs/file.c b/fs/afs/file.c
index 82e945dbe379..1861e4ecc2ce 100644
--- a/fs/afs/file.c
+++ b/fs/afs/file.c
@@ -18,13 +18,11 @@
 #include "internal.h"
 
 static int afs_file_mmap(struct file *file, struct vm_area_struct *vma);
-static int afs_readpage(struct file *file, struct page *page);
 static int afs_symlink_readpage(struct file *file, struct page *page);
 static void afs_invalidatepage(struct page *page, unsigned int offset,
 			       unsigned int length);
 static int afs_releasepage(struct page *page, gfp_t gfp_flags);
 
-static void afs_readahead(struct readahead_control *ractl);
 static ssize_t afs_direct_IO(struct kiocb *iocb, struct iov_iter *iter);
 
 const struct file_operations afs_file_operations = {
@@ -48,8 +46,8 @@ const struct inode_operations afs_file_inode_operations = {
 };
 
 const struct address_space_operations afs_file_aops = {
-	.readpage	= afs_readpage,
-	.readahead	= afs_readahead,
+	.readpage	= netfs_readpage,
+	.readahead	= netfs_readahead,
 	.set_page_dirty	= afs_set_page_dirty,
 	.launder_page	= afs_launder_page,
 	.releasepage	= afs_releasepage,
@@ -153,7 +151,8 @@ int afs_open(struct inode *inode, struct file *file)
 	}
 
 	if (file->f_flags & O_TRUNC)
-		set_bit(AFS_VNODE_NEW_CONTENT, &vnode->flags);
+		set_bit(NETFS_ICTX_NEW_CONTENT,
+			&netfs_i_context(&vnode->vfs_inode)->flags);
 
 	fscache_use_cookie(afs_vnode_cache(vnode), file->f_mode & FMODE_WRITE);
 
@@ -351,13 +350,6 @@ static void afs_init_rreq(struct netfs_read_request *rreq, struct file *file)
 	rreq->netfs_priv = key_get(afs_file_key(file));
 }
 
-static bool afs_is_cache_enabled(struct inode *inode)
-{
-	struct fscache_cookie *cookie = afs_vnode_cache(AFS_FS_I(inode));
-
-	return fscache_cookie_enabled(cookie) && cookie->cache_priv;
-}
-
 static int afs_begin_cache_operation(struct netfs_read_request *rreq)
 {
 	struct afs_vnode *vnode = AFS_FS_I(rreq->inode);
@@ -378,25 +370,14 @@ static void afs_priv_cleanup(struct address_space *mapping, void *netfs_priv)
 	key_put(netfs_priv);
 }
 
-const struct netfs_read_request_ops afs_req_ops = {
+const struct netfs_request_ops afs_req_ops = {
 	.init_rreq		= afs_init_rreq,
-	.is_cache_enabled	= afs_is_cache_enabled,
 	.begin_cache_operation	= afs_begin_cache_operation,
 	.check_write_begin	= afs_check_write_begin,
 	.issue_op		= afs_req_issue_op,
 	.cleanup		= afs_priv_cleanup,
 };
 
-static int afs_readpage(struct file *file, struct page *page)
-{
-	return netfs_readpage(file, page, &afs_req_ops, NULL);
-}
-
-static void afs_readahead(struct readahead_control *ractl)
-{
-	netfs_readahead(ractl, &afs_req_ops, NULL);
-}
-
 int afs_write_inode(struct inode *inode, struct writeback_control *wbc)
 {
 	fscache_unpin_writeback(wbc, afs_vnode_cache(AFS_FS_I(inode)));
diff --git a/fs/afs/inode.c b/fs/afs/inode.c
index cf7b66957c6f..3e9e388245a1 100644
--- a/fs/afs/inode.c
+++ b/fs/afs/inode.c
@@ -430,7 +430,7 @@ static void afs_get_inode_cache(struct afs_vnode *vnode)
 	struct afs_vnode_cache_aux aux;
 
 	if (vnode->status.type != AFS_FTYPE_FILE) {
-		vnode->cache = NULL;
+		vnode->netfs_ctx.cache = NULL;
 		return;
 	}
 
@@ -440,7 +440,7 @@ static void afs_get_inode_cache(struct afs_vnode *vnode)
 	key.vnode_id_ext[1]	= htonl(vnode->fid.vnode_hi);
 	afs_set_cache_aux(vnode, &aux);
 
-	vnode->cache = fscache_acquire_cookie(
+	vnode->netfs_ctx.cache = fscache_acquire_cookie(
 		vnode->volume->cache,
 		vnode->status.type == AFS_FTYPE_FILE ? 0 : FSCACHE_ADV_SINGLE_CHUNK,
 		&key, sizeof(key),
@@ -479,6 +479,7 @@ struct inode *afs_iget(struct afs_operation *op, struct afs_vnode_param *vp)
 		return inode;
 	}
 
+	netfs_i_context_init(inode, &afs_req_ops);
 	ret = afs_inode_init_from_status(op, vp, vnode);
 	if (ret < 0)
 		goto bad_inode;
@@ -535,6 +536,7 @@ struct inode *afs_root_iget(struct super_block *sb, struct key *key)
 	_debug("GOT ROOT INODE %p { vl=%llx }", inode, as->volume->vid);
 
 	BUG_ON(!(inode->i_state & I_NEW));
+	netfs_i_context_init(inode, &afs_req_ops);
 
 	vnode = AFS_FS_I(inode);
 	vnode->cb_v_break = as->volume->cb_v_break,
@@ -803,9 +805,9 @@ void afs_evict_inode(struct inode *inode)
 	}
 
 #ifdef CONFIG_AFS_FSCACHE
-	fscache_relinquish_cookie(vnode->cache,
+	fscache_relinquish_cookie(vnode->netfs_ctx.cache,
 				  test_bit(AFS_VNODE_DELETED, &vnode->flags));
-	vnode->cache = NULL;
+	vnode->netfs_ctx.cache = NULL;
 #endif
 
 	afs_prune_wb_keys(vnode);
diff --git a/fs/afs/internal.h b/fs/afs/internal.h
index ccdde00ada8a..e0204dde4b50 100644
--- a/fs/afs/internal.h
+++ b/fs/afs/internal.h
@@ -615,15 +615,15 @@ enum afs_lock_state {
  * leak from one inode to another.
  */
 struct afs_vnode {
-	struct inode		vfs_inode;	/* the VFS's inode record */
+	struct {
+		struct inode	vfs_inode;	/* the VFS's inode record */
+		struct netfs_i_context netfs_ctx; /* Netfslib context */
+	};
 
 	struct afs_volume	*volume;	/* volume on which vnode resides */
 	struct afs_fid		fid;		/* the file identifier for this inode */
 	struct afs_file_status	status;		/* AFS status info for this file */
 	afs_dataversion_t	invalid_before;	/* Child dentries are invalid before this */
-#ifdef CONFIG_AFS_FSCACHE
-	struct fscache_cookie	*cache;		/* caching cookie */
-#endif
 	struct afs_permits __rcu *permit_cache;	/* cache of permits so far obtained */
 	struct mutex		io_lock;	/* Lock for serialising I/O on this mutex */
 	struct rw_semaphore	validate_lock;	/* lock for validating this vnode */
@@ -640,7 +640,6 @@ struct afs_vnode {
 #define AFS_VNODE_MOUNTPOINT	5		/* set if vnode is a mountpoint symlink */
 #define AFS_VNODE_AUTOCELL	6		/* set if Vnode is an auto mount point */
 #define AFS_VNODE_PSEUDODIR	7 		/* set if Vnode is a pseudo directory */
-#define AFS_VNODE_NEW_CONTENT	8		/* Set if file has new content (create/trunc-0) */
 #define AFS_VNODE_SILLY_DELETED	9		/* Set if file has been silly-deleted */
 #define AFS_VNODE_MODIFYING	10		/* Set if we're performing a modification op */
 
@@ -666,7 +665,7 @@ struct afs_vnode {
 static inline struct fscache_cookie *afs_vnode_cache(struct afs_vnode *vnode)
 {
 #ifdef CONFIG_AFS_FSCACHE
-	return vnode->cache;
+	return vnode->netfs_ctx.cache;
 #else
 	return NULL;
 #endif
@@ -1054,7 +1053,7 @@ extern const struct address_space_operations afs_file_aops;
 extern const struct address_space_operations afs_symlink_aops;
 extern const struct inode_operations afs_file_inode_operations;
 extern const struct file_operations afs_file_operations;
-extern const struct netfs_read_request_ops afs_req_ops;
+extern const struct netfs_request_ops afs_req_ops;
 
 extern int afs_cache_wb_key(struct afs_vnode *, struct afs_file *);
 extern void afs_put_wb_key(struct afs_wb_key *);
diff --git a/fs/afs/super.c b/fs/afs/super.c
index 85e52c78f44f..29c1178beb72 100644
--- a/fs/afs/super.c
+++ b/fs/afs/super.c
@@ -692,7 +692,7 @@ static struct inode *afs_alloc_inode(struct super_block *sb)
 	vnode->lock_key		= NULL;
 	vnode->permit_cache	= NULL;
 #ifdef CONFIG_AFS_FSCACHE
-	vnode->cache		= NULL;
+	vnode->netfs_ctx.cache	= NULL;
 #endif
 
 	vnode->flags		= 1 << AFS_VNODE_UNSET;
diff --git a/fs/afs/write.c b/fs/afs/write.c
index 3be3a594124c..a244187f3503 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -49,8 +49,7 @@ int afs_write_begin(struct file *file, struct address_space *mapping,
 	 * file.  We need to do this before we get a lock on the page in case
 	 * there's more than one writer competing for the same cache block.
 	 */
-	ret = netfs_write_begin(file, mapping, pos, len, flags, &page, fsdata,
-				&afs_req_ops, NULL);
+	ret = netfs_write_begin(file, mapping, pos, len, flags, &page, fsdata);
 	if (ret < 0)
 		return ret;
 
@@ -76,7 +75,7 @@ int afs_write_begin(struct file *file, struct address_space *mapping,
 		 * spaces to be merged into writes.  If it's not, only write
 		 * back what the user gives us.
 		 */
-		if (!test_bit(AFS_VNODE_NEW_CONTENT, &vnode->flags) &&
+		if (!test_bit(NETFS_ICTX_NEW_CONTENT, &vnode->netfs_ctx.flags) &&
 		    (to < f || from > t))
 			goto flush_conflicting_write;
 	}
@@ -557,7 +556,7 @@ static ssize_t afs_write_back_from_locked_page(struct address_space *mapping,
 	unsigned long priv;
 	unsigned int offset, to, len, max_len;
 	loff_t i_size = i_size_read(&vnode->vfs_inode);
-	bool new_content = test_bit(AFS_VNODE_NEW_CONTENT, &vnode->flags);
+	bool new_content = test_bit(NETFS_ICTX_NEW_CONTENT, &vnode->netfs_ctx.flags);
 	long count = wbc->nr_to_write;
 	int ret;
 
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index a1e2813731d1..a8a41254e691 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -305,7 +305,7 @@ static void ceph_readahead_cleanup(struct address_space *mapping, void *priv)
 		ceph_put_cap_refs(ci, got);
 }
 
-static const struct netfs_read_request_ops ceph_netfs_read_ops = {
+static const struct netfs_request_ops ceph_netfs_read_ops = {
 	.init_rreq		= ceph_init_rreq,
 	.is_cache_enabled	= ceph_is_cache_enabled,
 	.begin_cache_operation	= ceph_begin_cache_operation,
diff --git a/fs/netfs/internal.h b/fs/netfs/internal.h
index b7f2c4459f33..4805d9fc8808 100644
--- a/fs/netfs/internal.h
+++ b/fs/netfs/internal.h
@@ -5,6 +5,10 @@
  * Written by David Howells (dhowells@redhat.com)
  */
 
+#include <linux/netfs.h>
+#include <linux/fscache.h>
+#include <trace/events/netfs.h>
+
 #ifdef pr_fmt
 #undef pr_fmt
 #endif
@@ -50,6 +54,13 @@ static inline void netfs_stat_d(atomic_t *stat)
 	atomic_dec(stat);
 }
 
+static inline bool netfs_is_cache_enabled(struct inode *inode)
+{
+	struct fscache_cookie *cookie = netfs_i_cookie(inode);
+
+	return fscache_cookie_enabled(cookie) && cookie->cache_priv;
+}
+
 #else
 #define netfs_stat(x) do {} while(0)
 #define netfs_stat_d(x) do {} while(0)
diff --git a/fs/netfs/read_helper.c b/fs/netfs/read_helper.c
index b03bc5b0da5a..aa98ecf6df6b 100644
--- a/fs/netfs/read_helper.c
+++ b/fs/netfs/read_helper.c
@@ -14,7 +14,6 @@
 #include <linux/uio.h>
 #include <linux/sched/mm.h>
 #include <linux/task_io_accounting_ops.h>
-#include <linux/netfs.h>
 #include "internal.h"
 #define CREATE_TRACE_POINTS
 #include <trace/events/netfs.h>
@@ -38,26 +37,27 @@ static void netfs_put_subrequest(struct netfs_read_subrequest *subreq,
 		__netfs_put_subrequest(subreq, was_async);
 }
 
-static struct netfs_read_request *netfs_alloc_read_request(
-	const struct netfs_read_request_ops *ops, void *netfs_priv,
-	struct file *file)
+static struct netfs_read_request *netfs_alloc_read_request(struct address_space *mapping,
+							   struct file *file)
 {
 	static atomic_t debug_ids;
+	struct inode *inode = file ? file_inode(file) : mapping->host;
+	struct netfs_i_context *ctx = netfs_i_context(inode);
 	struct netfs_read_request *rreq;
 
 	rreq = kzalloc(sizeof(struct netfs_read_request), GFP_KERNEL);
 	if (rreq) {
-		rreq->netfs_ops	= ops;
-		rreq->netfs_priv = netfs_priv;
-		rreq->inode	= file_inode(file);
-		rreq->i_size	= i_size_read(rreq->inode);
+		rreq->mapping	= mapping;
+		rreq->inode	= inode;
+		rreq->netfs_ops	= ctx->ops;
+		rreq->i_size	= i_size_read(inode);
 		rreq->debug_id	= atomic_inc_return(&debug_ids);
 		xa_init(&rreq->buffer);
 		INIT_LIST_HEAD(&rreq->subrequests);
 		INIT_WORK(&rreq->work, netfs_rreq_work);
 		refcount_set(&rreq->usage, 1);
 		__set_bit(NETFS_RREQ_IN_PROGRESS, &rreq->flags);
-		ops->init_rreq(rreq, file);
+		ctx->ops->init_rreq(rreq, file);
 		netfs_stat(&netfs_n_rh_rreq);
 	}
 
@@ -971,8 +971,6 @@ static int netfs_rreq_set_up_buffer(struct netfs_read_request *rreq,
 /**
  * netfs_readahead - Helper to manage a read request
  * @ractl: The description of the readahead request
- * @ops: The network filesystem's operations for the helper to use
- * @netfs_priv: Private netfs data to be retained in the request
  *
  * Fulfil a readahead request by drawing data from the cache if possible, or
  * the netfs if not.  Space beyond the EOF is zero-filled.  Multiple I/O
@@ -980,34 +978,31 @@ static int netfs_rreq_set_up_buffer(struct netfs_read_request *rreq,
  * readahead window can be expanded in either direction to a more convenient
  * alighment for RPC efficiency or to make storage in the cache feasible.
  *
- * The calling netfs must provide a table of operations, only one of which,
- * issue_op, is mandatory.  It may also be passed a private token, which will
- * be retained in rreq->netfs_priv and will be cleaned up by ops->cleanup().
+ * The calling netfs must initialise a netfs context contiguous to the vfs
+ * inode before calling this.
  *
  * This is usable whether or not caching is enabled.
  */
-void netfs_readahead(struct readahead_control *ractl,
-		     const struct netfs_read_request_ops *ops,
-		     void *netfs_priv)
+void netfs_readahead(struct readahead_control *ractl)
 {
 	struct netfs_read_request *rreq;
+	struct netfs_i_context *ctx = netfs_i_context(ractl->mapping->host);
 	unsigned int debug_index = 0;
 	int ret;
 
 	_enter("%lx,%x", readahead_index(ractl), readahead_count(ractl));
 
 	if (readahead_count(ractl) == 0)
-		goto cleanup;
+		return;
 
-	rreq = netfs_alloc_read_request(ops, netfs_priv, ractl->file);
+	rreq = netfs_alloc_read_request(ractl->mapping, ractl->file);
 	if (!rreq)
-		goto cleanup;
-	rreq->mapping	= ractl->mapping;
+		return;
 	rreq->start	= readahead_pos(ractl);
 	rreq->len	= readahead_length(ractl);
 
-	if (ops->begin_cache_operation) {
-		ret = ops->begin_cache_operation(rreq);
+	if (ctx->ops->begin_cache_operation) {
+		ret = ctx->ops->begin_cache_operation(rreq);
 		if (ret == -ENOMEM || ret == -EINTR || ret == -ERESTARTSYS)
 			goto cleanup_free;
 	}
@@ -1039,10 +1034,6 @@ void netfs_readahead(struct readahead_control *ractl,
 cleanup_free:
 	netfs_put_read_request(rreq, false);
 	return;
-cleanup:
-	if (netfs_priv)
-		ops->cleanup(ractl->mapping, netfs_priv);
-	return;
 }
 EXPORT_SYMBOL(netfs_readahead);
 
@@ -1050,43 +1041,34 @@ EXPORT_SYMBOL(netfs_readahead);
  * netfs_readpage - Helper to manage a readpage request
  * @file: The file to read from
  * @page: The page to read
- * @ops: The network filesystem's operations for the helper to use
- * @netfs_priv: Private netfs data to be retained in the request
  *
  * Fulfil a readpage request by drawing data from the cache if possible, or the
  * netfs if not.  Space beyond the EOF is zero-filled.  Multiple I/O requests
  * from different sources will get munged together.
  *
- * The calling netfs must provide a table of operations, only one of which,
- * issue_op, is mandatory.  It may also be passed a private token, which will
- * be retained in rreq->netfs_priv and will be cleaned up by ops->cleanup().
+ * The calling netfs must initialise a netfs context contiguous to the vfs
+ * inode before calling this.
  *
  * This is usable whether or not caching is enabled.
  */
-int netfs_readpage(struct file *file,
-		   struct page *page,
-		   const struct netfs_read_request_ops *ops,
-		   void *netfs_priv)
+int netfs_readpage(struct file *file, struct page *page)
 {
+	struct address_space *mapping = page_file_mapping(page);
 	struct netfs_read_request *rreq;
+	struct netfs_i_context *ctx = netfs_i_context(mapping->host);
 	unsigned int debug_index = 0;
 	int ret;
 
 	_enter("%lx", page_index(page));
 
-	rreq = netfs_alloc_read_request(ops, netfs_priv, file);
-	if (!rreq) {
-		if (netfs_priv)
-			ops->cleanup(netfs_priv, page_file_mapping(page));
-		unlock_page(page);
-		return -ENOMEM;
-	}
-	rreq->mapping	= page_file_mapping(page);
+	rreq = netfs_alloc_read_request(mapping, file);
+	if (!rreq)
+		goto nomem;
 	rreq->start	= page_file_offset(page);
 	rreq->len	= thp_size(page);
 
-	if (ops->begin_cache_operation) {
-		ret = ops->begin_cache_operation(rreq);
+	if (ctx->ops->begin_cache_operation) {
+		ret = ctx->ops->begin_cache_operation(rreq);
 		if (ret == -ENOMEM || ret == -EINTR || ret == -ERESTARTSYS) {
 			unlock_page(page);
 			goto out;
@@ -1128,6 +1110,9 @@ int netfs_readpage(struct file *file,
 out:
 	netfs_put_read_request(rreq, false);
 	return ret;
+nomem:
+	unlock_page(page);
+	return -ENOMEM;
 }
 EXPORT_SYMBOL(netfs_readpage);
 
@@ -1136,6 +1121,7 @@ EXPORT_SYMBOL(netfs_readpage);
  * @page: page being prepared
  * @pos: starting position for the write
  * @len: length of write
+ * @always_fill: T if the page should always be completely filled/cleared
  *
  * In some cases, write_begin doesn't need to read at all:
  * - full page write
@@ -1145,14 +1131,24 @@ EXPORT_SYMBOL(netfs_readpage);
  * If any of these criteria are met, then zero out the unwritten parts
  * of the page and return true. Otherwise, return false.
  */
-static bool netfs_skip_page_read(struct page *page, loff_t pos, size_t len)
+static bool netfs_skip_page_read(struct page *page, loff_t pos, size_t len,
+				 bool always_fill)
 {
 	struct inode *inode = page->mapping->host;
 	loff_t i_size = i_size_read(inode);
 	size_t offset = offset_in_thp(page, pos);
+	size_t plen = thp_size(page);
+
+	if (unlikely(always_fill)) {
+		if (pos - offset + len <= i_size)
+			return false; /* Page entirely before EOF */
+		zero_user_segment(page, 0, plen);
+		SetPageUptodate(page);
+		return true;
+	}
 
 	/* Full page write */
-	if (offset == 0 && len >= thp_size(page))
+	if (offset == 0 && len >= plen)
 		return true;
 
 	/* pos beyond last page in the file */
@@ -1165,7 +1161,7 @@ static bool netfs_skip_page_read(struct page *page, loff_t pos, size_t len)
 
 	return false;
 zero_out:
-	zero_user_segments(page, 0, offset, offset + len, thp_size(page));
+	zero_user_segments(page, 0, offset, offset + len, plen);
 	return true;
 }
 
@@ -1178,8 +1174,6 @@ static bool netfs_skip_page_read(struct page *page, loff_t pos, size_t len)
  * @flags: AOP_* flags
  * @_page: Where to put the resultant page
  * @_fsdata: Place for the netfs to store a cookie
- * @ops: The network filesystem's operations for the helper to use
- * @netfs_priv: Private netfs data to be retained in the request
  *
  * Pre-read data for a write-begin request by drawing data from the cache if
  * possible, or the netfs if not.  Space beyond the EOF is zero-filled.
@@ -1198,17 +1192,19 @@ static bool netfs_skip_page_read(struct page *page, loff_t pos, size_t len)
  * should go ahead; unlock the page and return -EAGAIN to cause the page to be
  * regot; or return an error.
  *
+ * The calling netfs must initialise a netfs context contiguous to the vfs
+ * inode before calling this.
+ *
  * This is usable whether or not caching is enabled.
  */
 int netfs_write_begin(struct file *file, struct address_space *mapping,
 		      loff_t pos, unsigned int len, unsigned int flags,
-		      struct page **_page, void **_fsdata,
-		      const struct netfs_read_request_ops *ops,
-		      void *netfs_priv)
+		      struct page **_page, void **_fsdata)
 {
 	struct netfs_read_request *rreq;
 	struct page *page, *xpage;
 	struct inode *inode = file_inode(file);
+	struct netfs_i_context *ctx = netfs_i_context(inode);
 	unsigned int debug_index = 0;
 	pgoff_t index = pos >> PAGE_SHIFT;
 	int ret;
@@ -1220,9 +1216,9 @@ int netfs_write_begin(struct file *file, struct address_space *mapping,
 	if (!page)
 		return -ENOMEM;
 
-	if (ops->check_write_begin) {
+	if (ctx->ops->check_write_begin) {
 		/* Allow the netfs (eg. ceph) to flush conflicts. */
-		ret = ops->check_write_begin(file, pos, len, page, _fsdata);
+		ret = ctx->ops->check_write_begin(file, pos, len, page, _fsdata);
 		if (ret < 0) {
 			trace_netfs_failure(NULL, NULL, ret, netfs_fail_check_write_begin);
 			if (ret == -EAGAIN)
@@ -1238,25 +1234,23 @@ int netfs_write_begin(struct file *file, struct address_space *mapping,
 	 * within the cache granule containing the EOF, in which case we need
 	 * to preload the granule.
 	 */
-	if (!ops->is_cache_enabled(inode) &&
-	    netfs_skip_page_read(page, pos, len)) {
+	if (!netfs_is_cache_enabled(inode) &&
+	    netfs_skip_page_read(page, pos, len, false)) {
 		netfs_stat(&netfs_n_rh_write_zskip);
 		goto have_page_no_wait;
 	}
 
 	ret = -ENOMEM;
-	rreq = netfs_alloc_read_request(ops, netfs_priv, file);
+	rreq = netfs_alloc_read_request(mapping, file);
 	if (!rreq)
 		goto error;
-	rreq->mapping		= page->mapping;
 	rreq->start		= page_offset(page);
 	rreq->len		= thp_size(page);
 	rreq->no_unlock_page	= page->index;
 	__set_bit(NETFS_RREQ_NO_UNLOCK_PAGE, &rreq->flags);
-	netfs_priv = NULL;
 
-	if (ops->begin_cache_operation) {
-		ret = ops->begin_cache_operation(rreq);
+	if (ctx->ops->begin_cache_operation) {
+		ret = ctx->ops->begin_cache_operation(rreq);
 		if (ret == -ENOMEM || ret == -EINTR || ret == -ERESTARTSYS)
 			goto error_put;
 	}
@@ -1314,8 +1308,6 @@ int netfs_write_begin(struct file *file, struct address_space *mapping,
 	if (ret < 0)
 		goto error;
 have_page_no_wait:
-	if (netfs_priv)
-		ops->cleanup(netfs_priv, mapping);
 	*_page = page;
 	_leave(" = 0");
 	return 0;
@@ -1325,8 +1317,6 @@ int netfs_write_begin(struct file *file, struct address_space *mapping,
 error:
 	unlock_page(page);
 	put_page(page);
-	if (netfs_priv)
-		ops->cleanup(netfs_priv, mapping);
 	_leave(" = %d", ret);
 	return ret;
 }
diff --git a/fs/netfs/stats.c b/fs/netfs/stats.c
index 9ae538c85378..5510a7a14a40 100644
--- a/fs/netfs/stats.c
+++ b/fs/netfs/stats.c
@@ -7,7 +7,6 @@
 
 #include <linux/export.h>
 #include <linux/seq_file.h>
-#include <linux/netfs.h>
 #include "internal.h"
 
 atomic_t netfs_n_rh_readahead;
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index 815001fe7a76..35bcd916c3a0 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -157,14 +157,25 @@ struct netfs_read_request {
 #define NETFS_RREQ_DONT_UNLOCK_PAGES	3	/* Don't unlock the pages on completion */
 #define NETFS_RREQ_FAILED		4	/* The request failed */
 #define NETFS_RREQ_IN_PROGRESS		5	/* Unlocked when the request completes */
-	const struct netfs_read_request_ops *netfs_ops;
+	const struct netfs_request_ops *netfs_ops;
+};
+
+/*
+ * Per-inode description.  This must be directly after the inode struct.
+ */
+struct netfs_i_context {
+	const struct netfs_request_ops *ops;
+#ifdef CONFIG_FSCACHE
+	struct fscache_cookie	*cache;
+#endif
+	unsigned long		flags;
+#define NETFS_ICTX_NEW_CONTENT	0		/* Set if file has new content (create/trunc-0) */
 };
 
 /*
  * Operations the network filesystem can/must provide to the helpers.
  */
-struct netfs_read_request_ops {
-	bool (*is_cache_enabled)(struct inode *inode);
+struct netfs_request_ops {
 	void (*init_rreq)(struct netfs_read_request *rreq, struct file *file);
 	int (*begin_cache_operation)(struct netfs_read_request *rreq);
 	void (*expand_readahead)(struct netfs_read_request *rreq);
@@ -218,20 +229,49 @@ struct netfs_cache_ops {
 };
 
 struct readahead_control;
-extern void netfs_readahead(struct readahead_control *,
-			    const struct netfs_read_request_ops *,
-			    void *);
-extern int netfs_readpage(struct file *,
-			  struct page *,
-			  const struct netfs_read_request_ops *,
-			  void *);
+extern void netfs_readahead(struct readahead_control *);
+extern int netfs_readpage(struct file *, struct page *);
 extern int netfs_write_begin(struct file *, struct address_space *,
 			     loff_t, unsigned int, unsigned int, struct page **,
-			     void **,
-			     const struct netfs_read_request_ops *,
-			     void *);
+			     void **);
 
 extern void netfs_subreq_terminated(struct netfs_read_subrequest *, ssize_t, bool);
 extern void netfs_stats_show(struct seq_file *);
 
+/**
+ * netfs_i_context - Get the netfs inode context from the inode
+ * @inode: The inode to query
+ *
+ * This function gets the netfs lib inode context from the network filesystem's
+ * inode.  It expects it to follow on directly from the VFS inode struct.
+ */
+static inline struct netfs_i_context *netfs_i_context(struct inode *inode)
+{
+	return (struct netfs_i_context *)(inode + 1);
+}
+
+static inline void netfs_i_context_init(struct inode *inode,
+					const struct netfs_request_ops *ops)
+{
+	struct netfs_i_context *ctx = netfs_i_context(inode);
+
+	ctx->ops = ops;
+}
+
+/**
+ * netfs_i_cookie - Get the cache cookie from the inode
+ * @inode: The inode to query
+ *
+ * Get the caching cookie (if enabled) from the network filesystem's inode.
+ */
+static inline struct fscache_cookie *netfs_i_cookie(struct inode *inode)
+{
+#ifdef CONFIG_FSCACHE
+	struct netfs_i_context *ctx = netfs_i_context(inode);
+	return ctx->cache;
+#else
+	return NULL;
+#endif
+}
+
 #endif /* _LINUX_NETFS_H */



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 06/12] netfs: Keep lists of pending, active, dirty and flushed regions
  2021-07-21 13:44 David Howells
                   ` (4 preceding siblings ...)
  2021-07-21 13:45 ` [RFC PATCH 05/12] netfs: Add a netfs inode context David Howells
@ 2021-07-21 13:46 ` David Howells
  2021-07-21 13:46 ` [RFC PATCH 07/12] netfs: Initiate write request from a dirty region David Howells
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 22+ messages in thread
From: David Howells @ 2021-07-21 13:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: dhowells, Jeff Layton, Matthew Wilcox (Oracle),
	Anna Schumaker, Steve French, Dominique Martinet, Mike Marshall,
	David Wysochanski, Shyam Prasad N, Miklos Szeredi,
	Linus Torvalds, linux-cachefs, linux-afs, linux-nfs, linux-cifs,
	ceph-devel, v9fs-developer, devel, linux-mm, linux-kernel

This looks nice, in theory, and has the following features:

 (*) Things are managed with write records.

     (-) A WRITE is a region defined by an outer bounding box that spans
         the pages that are involved and an inner region that contains the
         actual modifications.

     (-) The bounding box must encompass all the data that will be
     	 necessary to perform a write operation to the server (for example,
     	 if we want to encrypt with a 64K block size when we have 4K
     	 pages).

 (*) There are four list of write records:

     (-) The PENDING LIST holds writes that are blocked by another active
     	 write.  This list is in order of submission to avoid starvation
     	 and may overlap.

     (-) The ACTIVE LIST holds writes that have been granted exclusive
     	 access to a patch.  This is in order of starting position and
     	 regions held therein may not overlap.

     (-) The DIRTY LIST holds a list of regions that have been modified.
     	 This is also in order of starting position and regions may not
     	 overlap, though they can be merged.

     (-) The FLUSH LIST holds a list of regions that require writing.  This
     	 is in order of grouping.

 (*) An active region acts as an exclusion zone on part of the range,
     allowing the inode sem to be dropped once the region is on a list.

     (-) A DIO write creates its own exclusive region that must not overlap
         with any other dirty region.

     (-) An active write may overlap one or more dirty regions.

     (-) A dirty region may be overlapped by one or more writes.

     (-) If an active write overlaps with an incompatible dirty region,
     	 that region gets flushed, the active write has to wait for it to
     	 complete.

 (*) When an active write completes, the region is inserted or merged into
     the dirty list.

     (-) Merging can only happen between compatible regions.

     (-) Contiguous dirty regions can be merged.

     (-) If an inode has all new content, generated locally, dirty regions
         that have contiguous/ovelapping bounding boxes can be merged,
         bridging any gaps with zeros.

     (-) O_DSYNC causes the region to be flushed immediately.

 (*) There's a queue of groups of regions and those regions must be flushed
     in order.

     (-) If a region in a group needs flushing, then all prior groups must
     	 be flushed first.


TRICKY BITS
===========

 (*) The active and dirty lists are O(n) search time.  An interval tree
     might be a better option.

 (*) Having four list_heads is a lot of memory per inode.

 (*) Activating pending writes.

     (-) The pending list can contain a bunch of writes that can overlap.

     (-) When an active write completes, it is removed from the active
     	 queue and usually added to the dirty queue (except DIO, DSYNC).
     	 This makes a hole.

     (-) One or more pending writes can then be moved over, but care has to
     	 be taken not to misorder them to avoid starvation.

     (-) When a pending write is added to the active list, it may require
     	 part of the dirty list to be flushed.

 (*) A write that has been put onto the active queue may have to wait for
     flushing to complete.

 (*) How should an active write interact with a dirty region?

     (-) A dirty region may get flushed even whilst it is being modified on
         the assumption that the active write record will get added to the
         dirty list and cause a follow up write to the server.

 (*) RAM pinning.

     (-) An active write could pin a lot of pages, thereby causing a large
     	 write to run the system out of RAM.

     (-) Allow active writes to start being flushed whilst still being
     	 modified.

     (-) Use a scheduler hook to decant the modified portion into the dirty
     	 list when the modifying task is switched away from?

 (*) Bounding box and variably-sized pages/folios.

     (-) The bounding box needs to be rounded out to the page boundaries so
     	 that DIO writes can claim exclusivity on a series of pages so that
     	 they can be invalidated.

     (-) Allocation of higher-order folios could be limited in scope so
     	 that they don't escape the requested bounding box.

     (-) Bounding boxes could be enlarged to allow for larger folios.

     (-) Overlarge bounding boxes can be shrunk later, possibly on merging
     	 into the dirty list.

     (-) Ordinary writes can have overlapping bounding boxes, even if
     	 they're otherwise incompatible.
---

 fs/afs/file.c                |   30 +
 fs/afs/internal.h            |    7 
 fs/afs/write.c               |  166 --------
 fs/netfs/Makefile            |    8 
 fs/netfs/dio_helper.c        |  140 ++++++
 fs/netfs/internal.h          |   32 +
 fs/netfs/objects.c           |  113 +++++
 fs/netfs/read_helper.c       |   94 ++++
 fs/netfs/stats.c             |    5 
 fs/netfs/write_helper.c      |  908 ++++++++++++++++++++++++++++++++++++++++++
 include/linux/netfs.h        |   98 +++++
 include/trace/events/netfs.h |  180 ++++++++
 12 files changed, 1604 insertions(+), 177 deletions(-)
 create mode 100644 fs/netfs/dio_helper.c
 create mode 100644 fs/netfs/objects.c
 create mode 100644 fs/netfs/write_helper.c

diff --git a/fs/afs/file.c b/fs/afs/file.c
index 1861e4ecc2ce..8400cdf086b6 100644
--- a/fs/afs/file.c
+++ b/fs/afs/file.c
@@ -30,7 +30,7 @@ const struct file_operations afs_file_operations = {
 	.release	= afs_release,
 	.llseek		= generic_file_llseek,
 	.read_iter	= generic_file_read_iter,
-	.write_iter	= afs_file_write,
+	.write_iter	= netfs_file_write_iter,
 	.mmap		= afs_file_mmap,
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= iter_file_splice_write,
@@ -53,8 +53,6 @@ const struct address_space_operations afs_file_aops = {
 	.releasepage	= afs_releasepage,
 	.invalidatepage	= afs_invalidatepage,
 	.direct_IO	= afs_direct_IO,
-	.write_begin	= afs_write_begin,
-	.write_end	= afs_write_end,
 	.writepage	= afs_writepage,
 	.writepages	= afs_writepages,
 };
@@ -370,12 +368,38 @@ static void afs_priv_cleanup(struct address_space *mapping, void *netfs_priv)
 	key_put(netfs_priv);
 }
 
+static void afs_init_dirty_region(struct netfs_dirty_region *region, struct file *file)
+{
+	region->netfs_priv = key_get(afs_file_key(file));
+}
+
+static void afs_free_dirty_region(struct netfs_dirty_region *region)
+{
+	key_put(region->netfs_priv);
+}
+
+static void afs_update_i_size(struct file *file, loff_t new_i_size)
+{
+	struct afs_vnode *vnode = AFS_FS_I(file_inode(file));
+	loff_t i_size;
+
+	write_seqlock(&vnode->cb_lock);
+	i_size = i_size_read(&vnode->vfs_inode);
+	if (new_i_size > i_size)
+		i_size_write(&vnode->vfs_inode, new_i_size);
+	write_sequnlock(&vnode->cb_lock);
+	fscache_update_cookie(afs_vnode_cache(vnode), NULL, &new_i_size);
+}
+
 const struct netfs_request_ops afs_req_ops = {
 	.init_rreq		= afs_init_rreq,
 	.begin_cache_operation	= afs_begin_cache_operation,
 	.check_write_begin	= afs_check_write_begin,
 	.issue_op		= afs_req_issue_op,
 	.cleanup		= afs_priv_cleanup,
+	.init_dirty_region	= afs_init_dirty_region,
+	.free_dirty_region	= afs_free_dirty_region,
+	.update_i_size		= afs_update_i_size,
 };
 
 int afs_write_inode(struct inode *inode, struct writeback_control *wbc)
diff --git a/fs/afs/internal.h b/fs/afs/internal.h
index e0204dde4b50..0d01ed2fe8fa 100644
--- a/fs/afs/internal.h
+++ b/fs/afs/internal.h
@@ -1511,15 +1511,8 @@ extern int afs_check_volume_status(struct afs_volume *, struct afs_operation *);
  * write.c
  */
 extern int afs_set_page_dirty(struct page *);
-extern int afs_write_begin(struct file *file, struct address_space *mapping,
-			loff_t pos, unsigned len, unsigned flags,
-			struct page **pagep, void **fsdata);
-extern int afs_write_end(struct file *file, struct address_space *mapping,
-			loff_t pos, unsigned len, unsigned copied,
-			struct page *page, void *fsdata);
 extern int afs_writepage(struct page *, struct writeback_control *);
 extern int afs_writepages(struct address_space *, struct writeback_control *);
-extern ssize_t afs_file_write(struct kiocb *, struct iov_iter *);
 extern int afs_fsync(struct file *, loff_t, loff_t, int);
 extern vm_fault_t afs_page_mkwrite(struct vm_fault *vmf);
 extern void afs_prune_wb_keys(struct afs_vnode *);
diff --git a/fs/afs/write.c b/fs/afs/write.c
index a244187f3503..e6e2e924c8ae 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -27,152 +27,6 @@ int afs_set_page_dirty(struct page *page)
 	return fscache_set_page_dirty(page, afs_vnode_cache(AFS_FS_I(page->mapping->host)));
 }
 
-/*
- * Prepare to perform part of a write to a page.  Note that len may extend
- * beyond the end of the page.
- */
-int afs_write_begin(struct file *file, struct address_space *mapping,
-		    loff_t pos, unsigned len, unsigned flags,
-		    struct page **_page, void **fsdata)
-{
-	struct afs_vnode *vnode = AFS_FS_I(file_inode(file));
-	struct page *page;
-	unsigned long priv;
-	unsigned f, from;
-	unsigned t, to;
-	int ret;
-
-	_enter("{%llx:%llu},%llx,%x",
-	       vnode->fid.vid, vnode->fid.vnode, pos, len);
-
-	/* Prefetch area to be written into the cache if we're caching this
-	 * file.  We need to do this before we get a lock on the page in case
-	 * there's more than one writer competing for the same cache block.
-	 */
-	ret = netfs_write_begin(file, mapping, pos, len, flags, &page, fsdata);
-	if (ret < 0)
-		return ret;
-
-	from = offset_in_thp(page, pos);
-	len = min_t(size_t, len, thp_size(page) - from);
-	to = from + len;
-
-try_again:
-	/* See if this page is already partially written in a way that we can
-	 * merge the new write with.
-	 */
-	if (PagePrivate(page)) {
-		priv = page_private(page);
-		f = afs_page_dirty_from(page, priv);
-		t = afs_page_dirty_to(page, priv);
-		ASSERTCMP(f, <=, t);
-
-		if (PageWriteback(page)) {
-			trace_afs_page_dirty(vnode, tracepoint_string("alrdy"), page);
-			goto flush_conflicting_write;
-		}
-		/* If the file is being filled locally, allow inter-write
-		 * spaces to be merged into writes.  If it's not, only write
-		 * back what the user gives us.
-		 */
-		if (!test_bit(NETFS_ICTX_NEW_CONTENT, &vnode->netfs_ctx.flags) &&
-		    (to < f || from > t))
-			goto flush_conflicting_write;
-	}
-
-	*_page = find_subpage(page, pos / PAGE_SIZE);
-	_leave(" = 0");
-	return 0;
-
-	/* The previous write and this write aren't adjacent or overlapping, so
-	 * flush the page out.
-	 */
-flush_conflicting_write:
-	_debug("flush conflict");
-	ret = write_one_page(page);
-	if (ret < 0)
-		goto error;
-
-	ret = lock_page_killable(page);
-	if (ret < 0)
-		goto error;
-	goto try_again;
-
-error:
-	put_page(page);
-	_leave(" = %d", ret);
-	return ret;
-}
-
-/*
- * Finalise part of a write to a page.  Note that len may extend beyond the end
- * of the page.
- */
-int afs_write_end(struct file *file, struct address_space *mapping,
-		  loff_t pos, unsigned len, unsigned copied,
-		  struct page *subpage, void *fsdata)
-{
-	struct afs_vnode *vnode = AFS_FS_I(file_inode(file));
-	struct page *page = thp_head(subpage);
-	unsigned long priv;
-	unsigned int f, from = offset_in_thp(page, pos);
-	unsigned int t, to = from + copied;
-	loff_t i_size, write_end_pos;
-
-	_enter("{%llx:%llu},{%lx}",
-	       vnode->fid.vid, vnode->fid.vnode, page->index);
-
-	len = min_t(size_t, len, thp_size(page) - from);
-	if (!PageUptodate(page)) {
-		if (copied < len) {
-			copied = 0;
-			goto out;
-		}
-
-		SetPageUptodate(page);
-	}
-
-	if (copied == 0)
-		goto out;
-
-	write_end_pos = pos + copied;
-
-	i_size = i_size_read(&vnode->vfs_inode);
-	if (write_end_pos > i_size) {
-		write_seqlock(&vnode->cb_lock);
-		i_size = i_size_read(&vnode->vfs_inode);
-		if (write_end_pos > i_size)
-			i_size_write(&vnode->vfs_inode, write_end_pos);
-		write_sequnlock(&vnode->cb_lock);
-		fscache_update_cookie(afs_vnode_cache(vnode), NULL, &write_end_pos);
-	}
-
-	if (PagePrivate(page)) {
-		priv = page_private(page);
-		f = afs_page_dirty_from(page, priv);
-		t = afs_page_dirty_to(page, priv);
-		if (from < f)
-			f = from;
-		if (to > t)
-			t = to;
-		priv = afs_page_dirty(page, f, t);
-		set_page_private(page, priv);
-		trace_afs_page_dirty(vnode, tracepoint_string("dirty+"), page);
-	} else {
-		priv = afs_page_dirty(page, from, to);
-		attach_page_private(page, (void *)priv);
-		trace_afs_page_dirty(vnode, tracepoint_string("dirty"), page);
-	}
-
-	if (set_page_dirty(page))
-		_debug("dirtied %lx", page->index);
-
-out:
-	unlock_page(page);
-	put_page(page);
-	return copied;
-}
-
 /*
  * kill all the pages in the given range
  */
@@ -812,26 +666,6 @@ int afs_writepages(struct address_space *mapping,
 	return ret;
 }
 
-/*
- * write to an AFS file
- */
-ssize_t afs_file_write(struct kiocb *iocb, struct iov_iter *from)
-{
-	struct afs_vnode *vnode = AFS_FS_I(file_inode(iocb->ki_filp));
-	size_t count = iov_iter_count(from);
-
-	_enter("{%llx:%llu},{%zu},",
-	       vnode->fid.vid, vnode->fid.vnode, count);
-
-	if (IS_SWAPFILE(&vnode->vfs_inode)) {
-		printk(KERN_INFO
-		       "AFS: Attempt to write to active swap file!\n");
-		return -EBUSY;
-	}
-
-	return generic_file_write_iter(iocb, from);
-}
-
 /*
  * flush any dirty pages for this process, and check for write errors.
  * - the return status from this call provides a reliable indication of
diff --git a/fs/netfs/Makefile b/fs/netfs/Makefile
index c15bfc966d96..3e11453ad2c5 100644
--- a/fs/netfs/Makefile
+++ b/fs/netfs/Makefile
@@ -1,5 +1,11 @@
 # SPDX-License-Identifier: GPL-2.0
 
-netfs-y := read_helper.o stats.o
+netfs-y := \
+	objects.o \
+	read_helper.o \
+	write_helper.o
+# dio_helper.o
+
+netfs-$(CONFIG_NETFS_STATS) += stats.o
 
 obj-$(CONFIG_NETFS_SUPPORT) := netfs.o
diff --git a/fs/netfs/dio_helper.c b/fs/netfs/dio_helper.c
new file mode 100644
index 000000000000..3072de344601
--- /dev/null
+++ b/fs/netfs/dio_helper.c
@@ -0,0 +1,140 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Network filesystem high-level DIO support.
+ *
+ * Copyright (C) 2021 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/pagemap.h>
+#include <linux/slab.h>
+#include <linux/uio.h>
+#include <linux/sched/mm.h>
+#include <linux/backing-dev.h>
+#include <linux/task_io_accounting_ops.h>
+#include <linux/netfs.h>
+#include "internal.h"
+#include <trace/events/netfs.h>
+
+/*
+ * Perform a direct I/O write to a netfs server.
+ */
+ssize_t netfs_file_direct_write(struct netfs_dirty_region *region,
+				struct kiocb *iocb, struct iov_iter *from)
+{
+	struct file	*file = iocb->ki_filp;
+	struct address_space *mapping = file->f_mapping;
+	struct inode	*inode = mapping->host;
+	loff_t		pos = iocb->ki_pos, last;
+	ssize_t		written;
+	size_t		write_len;
+	pgoff_t		end;
+	int		ret;
+
+	write_len = iov_iter_count(from);
+	last = pos + write_len - 1;
+	end = to >> PAGE_SHIFT;
+
+	if (iocb->ki_flags & IOCB_NOWAIT) {
+		/* If there are pages to writeback, return */
+		if (filemap_range_has_page(file->f_mapping, pos, last))
+			return -EAGAIN;
+	} else {
+		ret = filemap_write_and_wait_range(mapping, pos, last);
+		if (ret)
+			return ret;
+	}
+
+	/* After a write we want buffered reads to be sure to go to disk to get
+	 * the new data.  We invalidate clean cached page from the region we're
+	 * about to write.  We do this *before* the write so that we can return
+	 * without clobbering -EIOCBQUEUED from ->direct_IO().
+	 */
+	ret = invalidate_inode_pages2_range(mapping, pos >> PAGE_SHIFT, end);
+	if (ret) {
+		/* If the page can't be invalidated, return 0 to fall back to
+		 * buffered write.
+		 */
+		return ret == -EBUSY ? 0 : ret;
+	}
+
+	written = mapping->a_ops->direct_IO(iocb, from);
+
+	/* Finally, try again to invalidate clean pages which might have been
+	 * cached by non-direct readahead, or faulted in by get_user_pages()
+	 * if the source of the write was an mmap'ed region of the file
+	 * we're writing.  Either one is a pretty crazy thing to do,
+	 * so we don't support it 100%.  If this invalidation
+	 * fails, tough, the write still worked...
+	 *
+	 * Most of the time we do not need this since dio_complete() will do
+	 * the invalidation for us. However there are some file systems that
+	 * do not end up with dio_complete() being called, so let's not break
+	 * them by removing it completely.
+	 *
+	 * Noticeable example is a blkdev_direct_IO().
+	 *
+	 * Skip invalidation for async writes or if mapping has no pages.
+	 */
+	if (written > 0 && mapping->nrpages &&
+	    invalidate_inode_pages2_range(mapping, pos >> PAGE_SHIFT, end))
+		dio_warn_stale_pagecache(file);
+
+	if (written > 0) {
+		pos += written;
+		write_len -= written;
+		if (pos > i_size_read(inode) && !S_ISBLK(inode->i_mode)) {
+			i_size_write(inode, pos);
+			mark_inode_dirty(inode);
+		}
+		iocb->ki_pos = pos;
+	}
+	if (written != -EIOCBQUEUED)
+		iov_iter_revert(from, write_len - iov_iter_count(from));
+out:
+#if 0
+			/*
+		 * If the write stopped short of completing, fall back to
+		 * buffered writes.  Some filesystems do this for writes to
+		 * holes, for example.  For DAX files, a buffered write will
+		 * not succeed (even if it did, DAX does not handle dirty
+		 * page-cache pages correctly).
+		 */
+		if (written < 0 || !iov_iter_count(from) || IS_DAX(inode))
+			goto out;
+
+		status = netfs_perform_write(region, file, from, pos = iocb->ki_pos);
+		/*
+		 * If generic_perform_write() returned a synchronous error
+		 * then we want to return the number of bytes which were
+		 * direct-written, or the error code if that was zero.  Note
+		 * that this differs from normal direct-io semantics, which
+		 * will return -EFOO even if some bytes were written.
+		 */
+		if (unlikely(status < 0)) {
+			err = status;
+			goto out;
+		}
+		/*
+		 * We need to ensure that the page cache pages are written to
+		 * disk and invalidated to preserve the expected O_DIRECT
+		 * semantics.
+		 */
+		endbyte = pos + status - 1;
+		err = filemap_write_and_wait_range(mapping, pos, endbyte);
+		if (err == 0) {
+			iocb->ki_pos = endbyte + 1;
+			written += status;
+			invalidate_mapping_pages(mapping,
+						 pos >> PAGE_SHIFT,
+						 endbyte >> PAGE_SHIFT);
+		} else {
+			/*
+			 * We don't know how much we wrote, so just return
+			 * the number of bytes which were direct-written
+			 */
+		}
+#endif
+	return written;
+}
diff --git a/fs/netfs/internal.h b/fs/netfs/internal.h
index 4805d9fc8808..77ceab694348 100644
--- a/fs/netfs/internal.h
+++ b/fs/netfs/internal.h
@@ -15,11 +15,41 @@
 
 #define pr_fmt(fmt) "netfs: " fmt
 
+/*
+ * dio_helper.c
+ */
+ssize_t netfs_file_direct_write(struct netfs_dirty_region *region,
+				struct kiocb *iocb, struct iov_iter *from);
+
+/*
+ * objects.c
+ */
+struct netfs_flush_group *netfs_get_flush_group(struct netfs_flush_group *group);
+void netfs_put_flush_group(struct netfs_flush_group *group);
+struct netfs_dirty_region *netfs_alloc_dirty_region(void);
+struct netfs_dirty_region *netfs_get_dirty_region(struct netfs_i_context *ctx,
+						  struct netfs_dirty_region *region,
+						  enum netfs_region_trace what);
+void netfs_free_dirty_region(struct netfs_i_context *ctx, struct netfs_dirty_region *region);
+void netfs_put_dirty_region(struct netfs_i_context *ctx,
+			    struct netfs_dirty_region *region,
+			    enum netfs_region_trace what);
+
 /*
  * read_helper.c
  */
 extern unsigned int netfs_debug;
 
+int netfs_prefetch_for_write(struct file *file, struct page *page, loff_t pos, size_t len,
+			     bool always_fill);
+
+/*
+ * write_helper.c
+ */
+void netfs_flush_region(struct netfs_i_context *ctx,
+			struct netfs_dirty_region *region,
+			enum netfs_dirty_trace why);
+
 /*
  * stats.c
  */
@@ -42,6 +72,8 @@ extern atomic_t netfs_n_rh_write_begin;
 extern atomic_t netfs_n_rh_write_done;
 extern atomic_t netfs_n_rh_write_failed;
 extern atomic_t netfs_n_rh_write_zskip;
+extern atomic_t netfs_n_wh_region;
+extern atomic_t netfs_n_wh_flush_group;
 
 
 static inline void netfs_stat(atomic_t *stat)
diff --git a/fs/netfs/objects.c b/fs/netfs/objects.c
new file mode 100644
index 000000000000..ba1e052aa352
--- /dev/null
+++ b/fs/netfs/objects.c
@@ -0,0 +1,113 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Object lifetime handling and tracing.
+ *
+ * Copyright (C) 2021 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+
+#include <linux/export.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/pagemap.h>
+#include <linux/slab.h>
+#include <linux/backing-dev.h>
+#include "internal.h"
+
+/**
+ * netfs_new_flush_group - Create a new write flush group
+ * @inode: The inode for which this is a flush group.
+ * @netfs_priv: Netfs private data to include in the new group
+ *
+ * Create a new flush group and add it to the tail of the inode's group list.
+ * Flush groups are used to control the order in which dirty data is written
+ * back to the server.
+ *
+ * The caller must hold ctx->lock.
+ */
+struct netfs_flush_group *netfs_new_flush_group(struct inode *inode, void *netfs_priv)
+{
+	struct netfs_flush_group *group;
+	struct netfs_i_context *ctx = netfs_i_context(inode);
+
+	group = kzalloc(sizeof(*group), GFP_KERNEL);
+	if (group) {
+		group->netfs_priv = netfs_priv;
+		INIT_LIST_HEAD(&group->region_list);
+		refcount_set(&group->ref, 1);
+		netfs_stat(&netfs_n_wh_flush_group);
+		list_add_tail(&group->group_link, &ctx->flush_groups);
+	}
+	return group;
+}
+EXPORT_SYMBOL(netfs_new_flush_group);
+
+struct netfs_flush_group *netfs_get_flush_group(struct netfs_flush_group *group)
+{
+	refcount_inc(&group->ref);
+	return group;
+}
+
+void netfs_put_flush_group(struct netfs_flush_group *group)
+{
+	if (group && refcount_dec_and_test(&group->ref)) {
+		netfs_stat_d(&netfs_n_wh_flush_group);
+		kfree(group);
+	}
+}
+
+struct netfs_dirty_region *netfs_alloc_dirty_region(void)
+{
+	struct netfs_dirty_region *region;
+
+	region = kzalloc(sizeof(struct netfs_dirty_region), GFP_KERNEL);
+	if (region)
+		netfs_stat(&netfs_n_wh_region);
+	return region;
+}
+
+struct netfs_dirty_region *netfs_get_dirty_region(struct netfs_i_context *ctx,
+						  struct netfs_dirty_region *region,
+						  enum netfs_region_trace what)
+{
+	int ref;
+
+	__refcount_inc(&region->ref, &ref);
+	trace_netfs_ref_region(region->debug_id, ref + 1, what);
+	return region;
+}
+
+void netfs_free_dirty_region(struct netfs_i_context *ctx,
+			     struct netfs_dirty_region *region)
+{
+	if (region) {
+		trace_netfs_ref_region(region->debug_id, 0, netfs_region_trace_free);
+		if (ctx->ops->free_dirty_region)
+			ctx->ops->free_dirty_region(region);
+		netfs_put_flush_group(region->group);
+		netfs_stat_d(&netfs_n_wh_region);
+		kfree(region);
+	}
+}
+
+void netfs_put_dirty_region(struct netfs_i_context *ctx,
+			    struct netfs_dirty_region *region,
+			    enum netfs_region_trace what)
+{
+	bool dead;
+	int ref;
+
+	if (!region)
+		return;
+	dead = __refcount_dec_and_test(&region->ref, &ref);
+	trace_netfs_ref_region(region->debug_id, ref - 1, what);
+	if (dead) {
+		if (!list_empty(&region->active_link) ||
+		    !list_empty(&region->dirty_link)) {
+			spin_lock(&ctx->lock);
+			list_del_init(&region->active_link);
+			list_del_init(&region->dirty_link);
+			spin_unlock(&ctx->lock);
+		}
+		netfs_free_dirty_region(ctx, region);
+	}
+}
diff --git a/fs/netfs/read_helper.c b/fs/netfs/read_helper.c
index aa98ecf6df6b..bfcdbbd32f4c 100644
--- a/fs/netfs/read_helper.c
+++ b/fs/netfs/read_helper.c
@@ -1321,3 +1321,97 @@ int netfs_write_begin(struct file *file, struct address_space *mapping,
 	return ret;
 }
 EXPORT_SYMBOL(netfs_write_begin);
+
+/*
+ * Preload the data into a page we're proposing to write into.
+ */
+int netfs_prefetch_for_write(struct file *file, struct page *page,
+			     loff_t pos, size_t len, bool always_fill)
+{
+	struct address_space *mapping = page_file_mapping(page);
+	struct netfs_read_request *rreq;
+	struct netfs_i_context *ctx = netfs_i_context(mapping->host);
+	struct page *xpage;
+	unsigned int debug_index = 0;
+	int ret;
+
+	DEFINE_READAHEAD(ractl, file, NULL, mapping, page_index(page));
+
+	/* If the page is beyond the EOF, we want to clear it - unless it's
+	 * within the cache granule containing the EOF, in which case we need
+	 * to preload the granule.
+	 */
+	if (!netfs_is_cache_enabled(mapping->host)) {
+		if (netfs_skip_page_read(page, pos, len, always_fill)) {
+			netfs_stat(&netfs_n_rh_write_zskip);
+			ret = 0;
+			goto error;
+		}
+	}
+
+	ret = -ENOMEM;
+	rreq = netfs_alloc_read_request(mapping, file);
+	if (!rreq)
+		goto error;
+	rreq->start		= page_offset(page);
+	rreq->len		= thp_size(page);
+	rreq->no_unlock_page	= page_file_offset(page);
+	__set_bit(NETFS_RREQ_NO_UNLOCK_PAGE, &rreq->flags);
+
+	if (ctx->ops->begin_cache_operation) {
+		ret = ctx->ops->begin_cache_operation(rreq);
+		if (ret == -ENOMEM || ret == -EINTR || ret == -ERESTARTSYS)
+			goto error_put;
+	}
+
+	netfs_stat(&netfs_n_rh_write_begin);
+	trace_netfs_read(rreq, pos, len, netfs_read_trace_prefetch_for_write);
+
+	/* Expand the request to meet caching requirements and download
+	 * preferences.
+	 */
+	ractl._nr_pages = thp_nr_pages(page);
+	netfs_rreq_expand(rreq, &ractl);
+
+	/* Set up the output buffer */
+	ret = netfs_rreq_set_up_buffer(rreq, &ractl, page,
+				       readahead_index(&ractl), readahead_count(&ractl));
+	if (ret < 0) {
+		while ((xpage = readahead_page(&ractl)))
+			if (xpage != page)
+				put_page(xpage);
+		goto error_put;
+	}
+
+	netfs_get_read_request(rreq);
+	atomic_set(&rreq->nr_rd_ops, 1);
+	do {
+		if (!netfs_rreq_submit_slice(rreq, &debug_index))
+			break;
+
+	} while (rreq->submitted < rreq->len);
+
+	/* Keep nr_rd_ops incremented so that the ref always belongs to us, and
+	 * the service code isn't punted off to a random thread pool to
+	 * process.
+	 */
+	for (;;) {
+		wait_var_event(&rreq->nr_rd_ops, atomic_read(&rreq->nr_rd_ops) == 1);
+		netfs_rreq_assess(rreq, false);
+		if (!test_bit(NETFS_RREQ_IN_PROGRESS, &rreq->flags))
+			break;
+		cond_resched();
+	}
+
+	ret = rreq->error;
+	if (ret == 0 && rreq->submitted < rreq->len) {
+		trace_netfs_failure(rreq, NULL, ret, netfs_fail_short_write_begin);
+		ret = -EIO;
+	}
+
+error_put:
+	netfs_put_read_request(rreq, false);
+error:
+	_leave(" = %d", ret);
+	return ret;
+}
diff --git a/fs/netfs/stats.c b/fs/netfs/stats.c
index 5510a7a14a40..7c079ca47b5b 100644
--- a/fs/netfs/stats.c
+++ b/fs/netfs/stats.c
@@ -27,6 +27,8 @@ atomic_t netfs_n_rh_write_begin;
 atomic_t netfs_n_rh_write_done;
 atomic_t netfs_n_rh_write_failed;
 atomic_t netfs_n_rh_write_zskip;
+atomic_t netfs_n_wh_region;
+atomic_t netfs_n_wh_flush_group;
 
 void netfs_stats_show(struct seq_file *m)
 {
@@ -54,5 +56,8 @@ void netfs_stats_show(struct seq_file *m)
 		   atomic_read(&netfs_n_rh_write),
 		   atomic_read(&netfs_n_rh_write_done),
 		   atomic_read(&netfs_n_rh_write_failed));
+	seq_printf(m, "WrHelp : R=%u F=%u\n",
+		   atomic_read(&netfs_n_wh_region),
+		   atomic_read(&netfs_n_wh_flush_group));
 }
 EXPORT_SYMBOL(netfs_stats_show);
diff --git a/fs/netfs/write_helper.c b/fs/netfs/write_helper.c
new file mode 100644
index 000000000000..a8c58eaa84d0
--- /dev/null
+++ b/fs/netfs/write_helper.c
@@ -0,0 +1,908 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Network filesystem high-level write support.
+ *
+ * Copyright (C) 2021 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+
+#include <linux/export.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/pagemap.h>
+#include <linux/slab.h>
+#include <linux/backing-dev.h>
+#include "internal.h"
+
+static atomic_t netfs_region_debug_ids;
+
+static bool __overlaps(loff_t start1, loff_t end1, loff_t start2, loff_t end2)
+{
+	return (start1 < start2) ? end1 > start2 : end2 > start1;
+}
+
+static bool overlaps(struct netfs_range *a, struct netfs_range *b)
+{
+	return __overlaps(a->start, a->end, b->start, b->end);
+}
+
+static int wait_on_region(struct netfs_dirty_region *region,
+			  enum netfs_region_state state)
+{
+	return wait_var_event_interruptible(&region->state,
+					    READ_ONCE(region->state) >= state);
+}
+
+/*
+ * Grab a page for writing.  We don't lock it at this point as we have yet to
+ * preemptively trigger a fault-in - but we need to know how large the page
+ * will be before we try that.
+ */
+static struct page *netfs_grab_page_for_write(struct address_space *mapping,
+					      loff_t pos, size_t len_remaining)
+{
+	struct page *page;
+	int fgp_flags = FGP_LOCK | FGP_WRITE | FGP_CREAT;
+
+	page = pagecache_get_page(mapping, pos >> PAGE_SHIFT, fgp_flags,
+				  mapping_gfp_mask(mapping));
+	if (!page)
+		return ERR_PTR(-ENOMEM);
+	wait_for_stable_page(page);
+	return page;
+}
+
+/*
+ * Initialise a new dirty page group.  The caller is responsible for setting
+ * the type and any flags that they want.
+ */
+static void netfs_init_dirty_region(struct netfs_dirty_region *region,
+				    struct inode *inode, struct file *file,
+				    enum netfs_region_type type,
+				    unsigned long flags,
+				    loff_t start, loff_t end)
+{
+	struct netfs_flush_group *group;
+	struct netfs_i_context *ctx = netfs_i_context(inode);
+
+	region->state		= NETFS_REGION_IS_PENDING;
+	region->type		= type;
+	region->flags		= flags;
+	region->reserved.start	= start;
+	region->reserved.end	= end;
+	region->dirty.start	= start;
+	region->dirty.end	= start;
+	region->bounds.start	= round_down(start, ctx->bsize);
+	region->bounds.end	= round_up(end, ctx->bsize);
+	region->i_size		= i_size_read(inode);
+	region->debug_id	= atomic_inc_return(&netfs_region_debug_ids);
+	INIT_LIST_HEAD(&region->active_link);
+	INIT_LIST_HEAD(&region->dirty_link);
+	INIT_LIST_HEAD(&region->flush_link);
+	refcount_set(&region->ref, 1);
+	spin_lock_init(&region->lock);
+	if (file && ctx->ops->init_dirty_region)
+		ctx->ops->init_dirty_region(region, file);
+	if (!region->group) {
+		group = list_last_entry(&ctx->flush_groups,
+					struct netfs_flush_group, group_link);
+		region->group = netfs_get_flush_group(group);
+		list_add_tail(&region->flush_link, &group->region_list);
+	}
+	trace_netfs_ref_region(region->debug_id, 1, netfs_region_trace_new);
+	trace_netfs_dirty(ctx, region, NULL, netfs_dirty_trace_new);
+}
+
+/*
+ * Queue a region for flushing.  Regions may need to be flushed in the right
+ * order (e.g. ceph snaps) and so we may need to chuck other regions onto the
+ * flush queue first.
+ *
+ * The caller must hold ctx->lock.
+ */
+void netfs_flush_region(struct netfs_i_context *ctx,
+			struct netfs_dirty_region *region,
+			enum netfs_dirty_trace why)
+{
+	struct netfs_flush_group *group;
+
+	LIST_HEAD(flush_queue);
+
+	kenter("%x", region->debug_id);
+
+	if (test_bit(NETFS_REGION_FLUSH_Q, &region->flags) ||
+	    region->group->flush)
+		return;
+
+	trace_netfs_dirty(ctx, region, NULL, why);
+
+	/* If the region isn't in the bottom flush group, we need to flush out
+	 * all of the flush groups below it.
+	 */
+	while (!list_is_first(&region->group->group_link, &ctx->flush_groups)) {
+		group = list_first_entry(&ctx->flush_groups,
+					 struct netfs_flush_group, group_link);
+		group->flush = true;
+		list_del_init(&group->group_link);
+		list_splice_tail_init(&group->region_list, &ctx->flush_queue);
+		netfs_put_flush_group(group);
+	}
+
+	set_bit(NETFS_REGION_FLUSH_Q, &region->flags);
+	list_move_tail(&region->flush_link, &ctx->flush_queue);
+}
+
+/*
+ * Decide if/how a write can be merged with a dirty region.
+ */
+static enum netfs_write_compatibility netfs_write_compatibility(
+	struct netfs_i_context *ctx,
+	struct netfs_dirty_region *old,
+	struct netfs_dirty_region *candidate)
+{
+	if (old->type == NETFS_REGION_DIO ||
+	    old->type == NETFS_REGION_DSYNC ||
+	    old->state >= NETFS_REGION_IS_FLUSHING ||
+	    /* The bounding boxes of DSYNC writes can overlap with those of
+	     * other DSYNC writes and ordinary writes.
+	     */
+	    candidate->group != old->group ||
+	    old->group->flush)
+		return NETFS_WRITES_INCOMPATIBLE;
+	if (!ctx->ops->is_write_compatible) {
+		if (candidate->type == NETFS_REGION_DSYNC)
+			return NETFS_WRITES_SUPERSEDE;
+		return NETFS_WRITES_COMPATIBLE;
+	}
+	return ctx->ops->is_write_compatible(ctx, old, candidate);
+}
+
+/*
+ * Split a dirty region.
+ */
+static struct netfs_dirty_region *netfs_split_dirty_region(
+	struct netfs_i_context *ctx,
+	struct netfs_dirty_region *region,
+	struct netfs_dirty_region **spare,
+	unsigned long long pos)
+{
+	struct netfs_dirty_region *tail = *spare;
+
+	*spare = NULL;
+	*tail = *region;
+	region->dirty.end = pos;
+	tail->dirty.start = pos;
+	tail->debug_id = atomic_inc_return(&netfs_region_debug_ids);
+
+	refcount_set(&tail->ref, 1);
+	INIT_LIST_HEAD(&tail->active_link);
+	netfs_get_flush_group(tail->group);
+	spin_lock_init(&tail->lock);
+	// TODO: grab cache resources
+
+	// need to split the bounding box?
+	__set_bit(NETFS_REGION_SUPERSEDED, &tail->flags);
+	if (ctx->ops->split_dirty_region)
+		ctx->ops->split_dirty_region(tail);
+	list_add(&tail->dirty_link, &region->dirty_link);
+	list_add(&tail->flush_link, &region->flush_link);
+	trace_netfs_dirty(ctx, tail, region, netfs_dirty_trace_split);
+	return tail;
+}
+
+/*
+ * Queue a write for access to the pagecache.  The caller must hold ctx->lock.
+ * The NETFS_REGION_PENDING flag will be cleared when it's possible to proceed.
+ */
+static void netfs_queue_write(struct netfs_i_context *ctx,
+			      struct netfs_dirty_region *candidate)
+{
+	struct netfs_dirty_region *r;
+	struct list_head *p;
+
+	/* We must wait for any overlapping pending writes */
+	list_for_each_entry(r, &ctx->pending_writes, active_link) {
+		if (overlaps(&candidate->bounds, &r->bounds)) {
+			if (overlaps(&candidate->reserved, &r->reserved) ||
+			    netfs_write_compatibility(ctx, r, candidate) ==
+			    NETFS_WRITES_INCOMPATIBLE)
+				goto add_to_pending_queue;
+		}
+	}
+
+	/* We mustn't let the request overlap with the reservation of any other
+	 * active writes, though it can overlap with a bounding box if the
+	 * writes are compatible.
+	 */
+	list_for_each(p, &ctx->active_writes) {
+		r = list_entry(p, struct netfs_dirty_region, active_link);
+		if (r->bounds.end <= candidate->bounds.start)
+			continue;
+		if (r->bounds.start >= candidate->bounds.end)
+			break;
+		if (overlaps(&candidate->bounds, &r->bounds)) {
+			if (overlaps(&candidate->reserved, &r->reserved) ||
+			    netfs_write_compatibility(ctx, r, candidate) ==
+			    NETFS_WRITES_INCOMPATIBLE)
+				goto add_to_pending_queue;
+		}
+	}
+
+	/* We can install the record in the active list to reserve our slot */
+	list_add(&candidate->active_link, p);
+
+	/* Okay, we've reserved our slot in the active queue */
+	smp_store_release(&candidate->state, NETFS_REGION_IS_RESERVED);
+	trace_netfs_dirty(ctx, candidate, NULL, netfs_dirty_trace_reserved);
+	wake_up_var(&candidate->state);
+	kleave(" [go]");
+	return;
+
+add_to_pending_queue:
+	/* We get added to the pending list and then we have to wait */
+	list_add(&candidate->active_link, &ctx->pending_writes);
+	trace_netfs_dirty(ctx, candidate, NULL, netfs_dirty_trace_wait_pend);
+	kleave(" [wait pend]");
+}
+
+/*
+ * Make sure there's a flush group.
+ */
+static int netfs_require_flush_group(struct inode *inode)
+{
+	struct netfs_flush_group *group;
+	struct netfs_i_context *ctx = netfs_i_context(inode);
+
+	if (list_empty(&ctx->flush_groups)) {
+		kdebug("new flush group");
+		group = netfs_new_flush_group(inode, NULL);
+		if (!group)
+			return -ENOMEM;
+	}
+	return 0;
+}
+
+/*
+ * Create a dirty region record for the write we're about to do and add it to
+ * the list of regions.  We may need to wait for conflicting writes to
+ * complete.
+ */
+static struct netfs_dirty_region *netfs_prepare_region(struct inode *inode,
+						       struct file *file,
+						       loff_t start, size_t len,
+						       enum netfs_region_type type,
+						       unsigned long flags)
+{
+	struct netfs_dirty_region *candidate;
+	struct netfs_i_context *ctx = netfs_i_context(inode);
+	loff_t end = start + len;
+	int ret;
+
+	ret = netfs_require_flush_group(inode);
+	if (ret < 0)
+		return ERR_PTR(ret);
+
+	candidate = netfs_alloc_dirty_region();
+	if (!candidate)
+		return ERR_PTR(-ENOMEM);
+
+	netfs_init_dirty_region(candidate, inode, file, type, flags, start, end);
+
+	spin_lock(&ctx->lock);
+	netfs_queue_write(ctx, candidate);
+	spin_unlock(&ctx->lock);
+	return candidate;
+}
+
+/*
+ * Activate a write.  This adds it to the dirty list and does any necessary
+ * flushing and superceding there.  The caller must provide a spare region
+ * record so that we can split a dirty record if we need to supersede it.
+ */
+static void __netfs_activate_write(struct netfs_i_context *ctx,
+				   struct netfs_dirty_region *candidate,
+				   struct netfs_dirty_region **spare)
+{
+	struct netfs_dirty_region *r;
+	struct list_head *p;
+	enum netfs_write_compatibility comp;
+	bool conflicts = false;
+
+	/* See if there are any dirty regions that need flushing first. */
+	list_for_each(p, &ctx->dirty_regions) {
+		r = list_entry(p, struct netfs_dirty_region, dirty_link);
+		if (r->bounds.end <= candidate->bounds.start)
+			continue;
+		if (r->bounds.start >= candidate->bounds.end)
+			break;
+
+		if (list_empty(&candidate->dirty_link) &&
+		    r->dirty.start > candidate->dirty.start)
+			list_add_tail(&candidate->dirty_link, p);
+
+		comp = netfs_write_compatibility(ctx, r, candidate);
+		switch (comp) {
+		case NETFS_WRITES_INCOMPATIBLE:
+			netfs_flush_region(ctx, r, netfs_dirty_trace_flush_conflict);
+			conflicts = true;
+			continue;
+
+		case NETFS_WRITES_SUPERSEDE:
+			if (!overlaps(&candidate->reserved, &r->dirty))
+				continue;
+			if (r->dirty.start < candidate->dirty.start) {
+				/* The region overlaps the beginning of our
+				 * region, we split it and mark the overlapping
+				 * part as superseded.  We insert ourself
+				 * between.
+				 */
+				r = netfs_split_dirty_region(ctx, r, spare,
+							     candidate->reserved.start);
+				list_add_tail(&candidate->dirty_link, &r->dirty_link);
+				p = &r->dirty_link; /* Advance the for-loop */
+			} else  {
+				/* The region is after ours, so make sure we're
+				 * inserted before it.
+				 */
+				if (list_empty(&candidate->dirty_link))
+					list_add_tail(&candidate->dirty_link, &r->dirty_link);
+				set_bit(NETFS_REGION_SUPERSEDED, &r->flags);
+				trace_netfs_dirty(ctx, candidate, r, netfs_dirty_trace_supersedes);
+			}
+			continue;
+
+		case NETFS_WRITES_COMPATIBLE:
+			continue;
+		}
+	}
+
+	if (list_empty(&candidate->dirty_link))
+		list_add_tail(&candidate->dirty_link, p);
+	netfs_get_dirty_region(ctx, candidate, netfs_region_trace_get_dirty);
+
+	if (conflicts) {
+		/* The caller must wait for the flushes to complete. */
+		trace_netfs_dirty(ctx, candidate, NULL, netfs_dirty_trace_wait_active);
+		kleave(" [wait flush]");
+		return;
+	}
+
+	/* Okay, we're cleared to proceed. */
+	smp_store_release(&candidate->state, NETFS_REGION_IS_ACTIVE);
+	trace_netfs_dirty(ctx, candidate, NULL, netfs_dirty_trace_active);
+	wake_up_var(&candidate->state);
+	kleave(" [go]");
+	return;
+}
+
+static int netfs_activate_write(struct netfs_i_context *ctx,
+				struct netfs_dirty_region *region)
+{
+	struct netfs_dirty_region *spare;
+
+	spare = netfs_alloc_dirty_region();
+	if (!spare)
+		return -ENOMEM;
+
+	spin_lock(&ctx->lock);
+	__netfs_activate_write(ctx, region, &spare);
+	spin_unlock(&ctx->lock);
+	netfs_free_dirty_region(ctx, spare);
+	return 0;
+}
+
+/*
+ * Merge a completed active write into the list of dirty regions.  The region
+ * can be in one of a number of states:
+ *
+ * - Ordinary write, error, no data copied.		Discard.
+ * - Ordinary write, unflushed.				Dirty
+ * - Ordinary write, flush started.			Dirty
+ * - Ordinary write, completed/failed.			Discard.
+ * - DIO write,      completed/failed.			Discard.
+ * - DSYNC write, error before flush.			As ordinary.
+ * - DSYNC write, flushed in progress, EINTR.		Dirty (supersede).
+ * - DSYNC write, written to server and cache.		Dirty (supersede)/Discard.
+ * - DSYNC write, written to server but not yet cache.	Dirty.
+ *
+ * Once we've dealt with this record, we see about activating some other writes
+ * to fill the activity hole.
+ *
+ * This eats the caller's ref on the region.
+ */
+static void netfs_merge_dirty_region(struct netfs_i_context *ctx,
+				     struct netfs_dirty_region *region)
+{
+	struct netfs_dirty_region *p, *q, *front;
+	bool new_content = test_bit(NETFS_ICTX_NEW_CONTENT, &ctx->flags);
+	LIST_HEAD(graveyard);
+
+	list_del_init(&region->active_link);
+
+	switch (region->type) {
+	case NETFS_REGION_DIO:
+		list_move_tail(&region->dirty_link, &graveyard);
+		goto discard;
+
+	case NETFS_REGION_DSYNC:
+		/* A DSYNC write may have overwritten some dirty data
+		 * and caused the writeback of other dirty data.
+		 */
+		goto scan_forwards;
+
+	case NETFS_REGION_ORDINARY:
+		if (region->dirty.end == region->dirty.start) {
+			list_move_tail(&region->dirty_link, &graveyard);
+			goto discard;
+		}
+		goto scan_backwards;
+	}
+
+scan_backwards:
+	kdebug("scan_backwards");
+	/* Search backwards for a preceding record that we might be able to
+	 * merge with.  We skip over any intervening flush-in-progress records.
+	 */
+	p = front = region;
+	list_for_each_entry_continue_reverse(p, &ctx->dirty_regions, dirty_link) {
+		kdebug("- back %x", p->debug_id);
+		if (p->state >= NETFS_REGION_IS_FLUSHING)
+			continue;
+		if (p->state == NETFS_REGION_IS_ACTIVE)
+			break;
+		if (p->bounds.end < region->bounds.start)
+			break;
+		if (p->dirty.end >= region->dirty.start || new_content)
+			goto merge_backwards;
+	}
+	goto scan_forwards;
+
+merge_backwards:
+	kdebug("merge_backwards");
+	if (test_bit(NETFS_REGION_SUPERSEDED, &p->flags) ||
+	    netfs_write_compatibility(ctx, p, region) != NETFS_WRITES_COMPATIBLE)
+		goto scan_forwards;
+
+	front = p;
+	front->bounds.end = max(front->bounds.end, region->bounds.end);
+	front->dirty.end  = max(front->dirty.end,  region->dirty.end);
+	set_bit(NETFS_REGION_SUPERSEDED, &region->flags);
+	list_del_init(&region->flush_link);
+	trace_netfs_dirty(ctx, front, region, netfs_dirty_trace_merged_back);
+
+scan_forwards:
+	/* Subsume forwards any records this one covers.  There should be no
+	 * non-supersedeable incompatible regions in our range as we would have
+	 * flushed and waited for them before permitting this write to start.
+	 *
+	 * There can, however, be regions undergoing flushing which we need to
+	 * skip over and not merge with.
+	 */
+	kdebug("scan_forwards");
+	p = region;
+	list_for_each_entry_safe_continue(p, q, &ctx->dirty_regions, dirty_link) {
+		kdebug("- forw %x", p->debug_id);
+		if (p->state >= NETFS_REGION_IS_FLUSHING)
+			continue;
+		if (p->state == NETFS_REGION_IS_ACTIVE)
+			break;
+		if (p->dirty.start > region->dirty.end &&
+		    (!new_content || p->bounds.start > p->bounds.end))
+			break;
+
+		if (region->dirty.end >= p->dirty.end) {
+			/* Entirely subsumed */
+			list_move_tail(&p->dirty_link, &graveyard);
+			list_del_init(&p->flush_link);
+			trace_netfs_dirty(ctx, front, p, netfs_dirty_trace_merged_sub);
+			continue;
+		}
+
+		goto merge_forwards;
+	}
+	goto merge_complete;
+
+merge_forwards:
+	kdebug("merge_forwards");
+	if (test_bit(NETFS_REGION_SUPERSEDED, &p->flags) ||
+	    netfs_write_compatibility(ctx, p, front) == NETFS_WRITES_SUPERSEDE) {
+		/* If a region was partially superseded by us, we need to roll
+		 * it forwards and remove the superseded flag.
+		 */
+		if (p->dirty.start < front->dirty.end) {
+			p->dirty.start = front->dirty.end;
+			clear_bit(NETFS_REGION_SUPERSEDED, &p->flags);
+		}
+		trace_netfs_dirty(ctx, p, front, netfs_dirty_trace_superseded);
+		goto merge_complete;
+	}
+
+	/* Simply merge overlapping/contiguous ordinary areas together. */
+	front->bounds.end = max(front->bounds.end, p->bounds.end);
+	front->dirty.end  = max(front->dirty.end,  p->dirty.end);
+	list_move_tail(&p->dirty_link, &graveyard);
+	list_del_init(&p->flush_link);
+	trace_netfs_dirty(ctx, front, p, netfs_dirty_trace_merged_forw);
+
+merge_complete:
+	if (test_bit(NETFS_REGION_SUPERSEDED, &region->flags)) {
+		list_move_tail(&region->dirty_link, &graveyard);
+	}
+discard:
+	while (!list_empty(&graveyard)) {
+		p = list_first_entry(&graveyard, struct netfs_dirty_region, dirty_link);
+		list_del_init(&p->dirty_link);
+		smp_store_release(&p->state, NETFS_REGION_IS_COMPLETE);
+		trace_netfs_dirty(ctx, p, NULL, netfs_dirty_trace_complete);
+		wake_up_var(&p->state);
+		netfs_put_dirty_region(ctx, p, netfs_region_trace_put_merged);
+	}
+}
+
+/*
+ * Start pending writes in a window we've created by the removal of an active
+ * write.  The writes are bundled onto the given queue and it's left as an
+ * exercise for the caller to actually start them.
+ */
+static void netfs_start_pending_writes(struct netfs_i_context *ctx,
+				       struct list_head *prev_p,
+				       struct list_head *queue)
+{
+	struct netfs_dirty_region *prev = NULL, *next = NULL, *p, *q;
+	struct netfs_range window = { 0, ULLONG_MAX };
+
+	if (prev_p != &ctx->active_writes) {
+		prev = list_entry(prev_p, struct netfs_dirty_region, active_link);
+		window.start = prev->reserved.end;
+		if (!list_is_last(prev_p, &ctx->active_writes)) {
+			next = list_next_entry(prev, active_link);
+			window.end = next->reserved.start;
+		}
+	} else if (!list_empty(&ctx->active_writes)) {
+		next = list_last_entry(&ctx->active_writes,
+				       struct netfs_dirty_region, active_link);
+		window.end = next->reserved.start;
+	}
+
+	list_for_each_entry_safe(p, q, &ctx->pending_writes, active_link) {
+		bool skip = false;
+
+		if (!overlaps(&p->reserved, &window))
+			continue;
+
+		/* Narrow the window when we find a region that requires more
+		 * than we can immediately provide.  The queue is in submission
+		 * order and we need to prevent starvation.
+		 */
+		if (p->type == NETFS_REGION_DIO) {
+			if (p->bounds.start < window.start) {
+				window.start = p->bounds.start;
+				skip = true;
+			}
+			if (p->bounds.end > window.end) {
+				window.end = p->bounds.end;
+				skip = true;
+			}
+		} else {
+			if (p->reserved.start < window.start) {
+				window.start = p->reserved.start;
+				skip = true;
+			}
+			if (p->reserved.end > window.end) {
+				window.end = p->reserved.end;
+				skip = true;
+			}
+		}
+		if (window.start >= window.end)
+			break;
+		if (skip)
+			continue;
+
+		/* Okay, we have a gap that's large enough to start this write
+		 * in.  Make sure it's compatible with any region its bounds
+		 * overlap.
+		 */
+		if (prev &&
+		    p->bounds.start < prev->bounds.end &&
+		    netfs_write_compatibility(ctx, prev, p) == NETFS_WRITES_INCOMPATIBLE) {
+			window.start = max(window.start, p->bounds.end);
+			skip = true;
+		}
+
+		if (next &&
+		    p->bounds.end > next->bounds.start &&
+		    netfs_write_compatibility(ctx, next, p) == NETFS_WRITES_INCOMPATIBLE) {
+			window.end = min(window.end, p->bounds.start);
+			skip = true;
+		}
+		if (window.start >= window.end)
+			break;
+		if (skip)
+			continue;
+
+		/* Okay, we can start this write. */
+		trace_netfs_dirty(ctx, p, NULL, netfs_dirty_trace_start_pending);
+		list_move(&p->active_link,
+			  prev ? &prev->active_link : &ctx->pending_writes);
+		list_add_tail(&p->dirty_link, queue);
+		if (p->type == NETFS_REGION_DIO)
+			window.start = p->bounds.end;
+		else
+			window.start = p->reserved.end;
+		prev = p;
+	}
+}
+
+/*
+ * We completed the modification phase of a write.  We need to fix up the dirty
+ * list, remove this region from the active list and start waiters.
+ */
+static void netfs_commit_write(struct netfs_i_context *ctx,
+			       struct netfs_dirty_region *region)
+{
+	struct netfs_dirty_region *p;
+	struct list_head *prev;
+	LIST_HEAD(queue);
+
+	spin_lock(&ctx->lock);
+	smp_store_release(&region->state, NETFS_REGION_IS_DIRTY);
+	trace_netfs_dirty(ctx, region, NULL, netfs_dirty_trace_commit);
+	wake_up_var(&region->state);
+
+	prev = region->active_link.prev;
+	netfs_merge_dirty_region(ctx, region);
+	if (!list_empty(&ctx->pending_writes))
+		netfs_start_pending_writes(ctx, prev, &queue);
+	spin_unlock(&ctx->lock);
+
+	while (!list_empty(&queue)) {
+		p = list_first_entry(&queue, struct netfs_dirty_region, dirty_link);
+		list_del_init(&p->dirty_link);
+		smp_store_release(&p->state, NETFS_REGION_IS_DIRTY);
+		wake_up_var(&p->state);
+	}
+}
+
+/*
+ * Write data into a prereserved region of the pagecache attached to a netfs
+ * inode.
+ */
+static ssize_t netfs_perform_write(struct netfs_dirty_region *region,
+				   struct kiocb *iocb, struct iov_iter *i)
+{
+	struct file *file = iocb->ki_filp;
+	struct netfs_i_context *ctx = netfs_i_context(file_inode(file));
+	struct page *page;
+	ssize_t written = 0, ret;
+	loff_t new_pos, i_size;
+	bool always_fill = false;
+
+	do {
+		size_t plen;
+		size_t offset;	/* Offset into pagecache page */
+		size_t bytes;	/* Bytes to write to page */
+		size_t copied;	/* Bytes copied from user */
+		bool relock = false;
+
+		page = netfs_grab_page_for_write(file->f_mapping, region->dirty.end,
+						 iov_iter_count(i));
+		if (!page)
+			return -ENOMEM;
+
+		plen = thp_size(page);
+		offset = region->dirty.end - page_file_offset(page);
+		bytes = min_t(size_t, plen - offset, iov_iter_count(i));
+
+		kdebug("segment %zx @%zx", bytes, offset);
+
+		if (!PageUptodate(page)) {
+			unlock_page(page); /* Avoid deadlocking fault-in */
+			relock = true;
+		}
+
+		/* Bring in the user page that we will copy from _first_.
+		 * Otherwise there's a nasty deadlock on copying from the
+		 * same page as we're writing to, without it being marked
+		 * up-to-date.
+		 *
+		 * Not only is this an optimisation, but it is also required
+		 * to check that the address is actually valid, when atomic
+		 * usercopies are used, below.
+		 */
+		if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
+			kdebug("fault-in");
+			ret = -EFAULT;
+			goto error_page;
+		}
+
+		if (fatal_signal_pending(current)) {
+			ret = -EINTR;
+			goto error_page;
+		}
+
+		if (relock) {
+			ret = lock_page_killable(page);
+			if (ret < 0)
+				goto error_page;
+		}
+
+redo_prefetch:
+		/* Prefetch area to be written into the cache if we're caching
+		 * this file.  We need to do this before we get a lock on the
+		 * page in case there's more than one writer competing for the
+		 * same cache block.
+		 */
+		if (!PageUptodate(page)) {
+			ret = netfs_prefetch_for_write(file, page, region->dirty.end,
+						       bytes, always_fill);
+			kdebug("prefetch %zx", ret);
+			if (ret < 0)
+				goto error_page;
+		}
+
+		if (mapping_writably_mapped(page->mapping))
+			flush_dcache_page(page);
+		copied = copy_page_from_iter_atomic(page, offset, bytes, i);
+		flush_dcache_page(page);
+		kdebug("copied %zx", copied);
+
+		/*  Deal with a (partially) failed copy */
+		if (!PageUptodate(page)) {
+			if (copied == 0) {
+				ret = -EFAULT;
+				goto error_page;
+			}
+			if (copied < bytes) {
+				iov_iter_revert(i, copied);
+				always_fill = true;
+				goto redo_prefetch;
+			}
+			SetPageUptodate(page);
+		}
+
+		/* Update the inode size if we moved the EOF marker */
+		new_pos = region->dirty.end + copied;
+		i_size = i_size_read(file_inode(file));
+		if (new_pos > i_size) {
+			if (ctx->ops->update_i_size) {
+				ctx->ops->update_i_size(file, new_pos);
+			} else {
+				i_size_write(file_inode(file), new_pos);
+				fscache_update_cookie(ctx->cache, NULL, &new_pos);
+			}
+		}
+
+		/* Update the region appropriately */
+		if (i_size > region->i_size)
+			region->i_size = i_size;
+		smp_store_release(&region->dirty.end, new_pos);
+
+		trace_netfs_dirty(ctx, region, NULL, netfs_dirty_trace_modified);
+		set_page_dirty(page);
+		unlock_page(page);
+		put_page(page);
+		page = NULL;
+
+		cond_resched();
+
+		written += copied;
+
+		balance_dirty_pages_ratelimited(file->f_mapping);
+	} while (iov_iter_count(i));
+
+out:
+	if (likely(written)) {
+		kdebug("written");
+		iocb->ki_pos += written;
+
+		/* Flush and wait for a write that requires immediate synchronisation. */
+		if (region->type == NETFS_REGION_DSYNC) {
+			kdebug("dsync");
+			spin_lock(&ctx->lock);
+			netfs_flush_region(ctx, region, netfs_dirty_trace_flush_dsync);
+			spin_unlock(&ctx->lock);
+
+			ret = wait_on_region(region, NETFS_REGION_IS_COMPLETE);
+			if (ret < 0)
+				written = ret;
+		}
+	}
+
+	netfs_commit_write(ctx, region);
+	return written ? written : ret;
+
+error_page:
+	unlock_page(page);
+	put_page(page);
+	goto out;
+}
+
+/**
+ * netfs_file_write_iter - write data to a file
+ * @iocb:	IO state structure
+ * @from:	iov_iter with data to write
+ *
+ * This is a wrapper around __generic_file_write_iter() to be used by most
+ * filesystems. It takes care of syncing the file in case of O_SYNC file
+ * and acquires i_mutex as needed.
+ * Return:
+ * * negative error code if no data has been written at all of
+ *   vfs_fsync_range() failed for a synchronous write
+ * * number of bytes written, even for truncated writes
+ */
+ssize_t netfs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
+{
+	struct netfs_dirty_region *region = NULL;
+	struct file *file = iocb->ki_filp;
+	struct inode *inode = file->f_mapping->host;
+	struct netfs_i_context *ctx = netfs_i_context(inode);
+	enum netfs_region_type type;
+	unsigned long flags = 0;
+	ssize_t ret;
+
+	printk("\n");
+	kenter("%llx,%zx,%llx", iocb->ki_pos, iov_iter_count(from), i_size_read(inode));
+
+	inode_lock(inode);
+	ret = generic_write_checks(iocb, from);
+	if (ret <= 0)
+		goto error_unlock;
+
+	if (iocb->ki_flags & IOCB_DIRECT)
+		type = NETFS_REGION_DIO;
+	if (iocb->ki_flags & IOCB_DSYNC)
+		type = NETFS_REGION_DSYNC;
+	else
+		type = NETFS_REGION_ORDINARY;
+	if (iocb->ki_flags & IOCB_SYNC)
+		__set_bit(NETFS_REGION_SYNC, &flags);
+
+	region = netfs_prepare_region(inode, file, iocb->ki_pos,
+				      iov_iter_count(from), type, flags);
+	if (IS_ERR(region)) {
+		ret = PTR_ERR(region);
+		goto error_unlock;
+	}
+
+	trace_netfs_write_iter(region, iocb, from);
+
+	/* We can write back this queue in page reclaim */
+	current->backing_dev_info = inode_to_bdi(inode);
+	ret = file_remove_privs(file);
+	if (ret)
+		goto error_unlock;
+
+	ret = file_update_time(file);
+	if (ret)
+		goto error_unlock;
+
+	inode_unlock(inode);
+
+	ret = wait_on_region(region, NETFS_REGION_IS_RESERVED);
+	if (ret < 0)
+		goto error;
+
+	ret = netfs_activate_write(ctx, region);
+	if (ret < 0)
+		goto error;
+
+	/* The region excludes overlapping writes and is used to synchronise
+	 * versus flushes.
+	 */
+	if (iocb->ki_flags & IOCB_DIRECT)
+		ret = -EOPNOTSUPP; //netfs_file_direct_write(region, iocb, from);
+	else
+		ret = netfs_perform_write(region, iocb, from);
+
+out:
+	netfs_put_dirty_region(ctx, region, netfs_region_trace_put_write_iter);
+	current->backing_dev_info = NULL;
+	return ret;
+
+error_unlock:
+	inode_unlock(inode);
+error:
+	if (region)
+		netfs_commit_write(ctx, region);
+	goto out;
+}
+EXPORT_SYMBOL(netfs_file_write_iter);
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index 35bcd916c3a0..fc91711d3178 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -165,17 +165,95 @@ struct netfs_read_request {
  */
 struct netfs_i_context {
 	const struct netfs_request_ops *ops;
+	struct list_head	pending_writes;	/* List of writes waiting to be begin */
+	struct list_head	active_writes;	/* List of writes being applied */
+	struct list_head	dirty_regions;	/* List of dirty regions in the pagecache */
+	struct list_head	flush_groups;	/* Writeable region ordering queue */
+	struct list_head	flush_queue;	/* Regions that need to be flushed */
 #ifdef CONFIG_FSCACHE
 	struct fscache_cookie	*cache;
 #endif
 	unsigned long		flags;
 #define NETFS_ICTX_NEW_CONTENT	0		/* Set if file has new content (create/trunc-0) */
+	spinlock_t		lock;
+	unsigned int		rsize;		/* Maximum read size */
+	unsigned int		wsize;		/* Maximum write size */
+	unsigned int		bsize;		/* Min block size for bounding box */
+	unsigned int		inval_counter;	/* Number of invalidations made */
+};
+
+/*
+ * Descriptor for a set of writes that will need to be flushed together.
+ */
+struct netfs_flush_group {
+	struct list_head	group_link;	/* Link in i_context->flush_groups */
+	struct list_head	region_list;	/* List of regions in this group */
+	void			*netfs_priv;
+	refcount_t		ref;
+	bool			flush;
+};
+
+struct netfs_range {
+	unsigned long long	start;		/* Start of region */
+	unsigned long long	end;		/* End of region */
+};
+
+/* State of a netfs_dirty_region */
+enum netfs_region_state {
+	NETFS_REGION_IS_PENDING,	/* Proposed write is waiting on an active write */
+	NETFS_REGION_IS_RESERVED,	/* Writable region is reserved, waiting on flushes */
+	NETFS_REGION_IS_ACTIVE,		/* Write is actively modifying the pagecache */
+	NETFS_REGION_IS_DIRTY,		/* Region is dirty */
+	NETFS_REGION_IS_FLUSHING,	/* Region is being flushed */
+	NETFS_REGION_IS_COMPLETE,	/* Region has been completed (stored/invalidated) */
+} __attribute__((mode(byte)));
+
+enum netfs_region_type {
+	NETFS_REGION_ORDINARY,		/* Ordinary write */
+	NETFS_REGION_DIO,		/* Direct I/O write */
+	NETFS_REGION_DSYNC,		/* O_DSYNC/RWF_DSYNC write */
+} __attribute__((mode(byte)));
+
+/*
+ * Descriptor for a dirty region that has a common set of parameters and can
+ * feasibly be written back in one go.  These are held in an ordered list.
+ *
+ * Regions are not allowed to overlap, though they may be merged.
+ */
+struct netfs_dirty_region {
+	struct netfs_flush_group *group;
+	struct list_head	active_link;	/* Link in i_context->pending/active_writes */
+	struct list_head	dirty_link;	/* Link in i_context->dirty_regions */
+	struct list_head	flush_link;	/* Link in group->region_list or
+						 * i_context->flush_queue */
+	spinlock_t		lock;
+	void			*netfs_priv;	/* Private data for the netfs */
+	struct netfs_range	bounds;		/* Bounding box including all affected pages */
+	struct netfs_range	reserved;	/* The region reserved against other writes */
+	struct netfs_range	dirty;		/* The region that has been modified */
+	loff_t			i_size;		/* Size of the file */
+	enum netfs_region_type	type;
+	enum netfs_region_state	state;
+	unsigned long		flags;
+#define NETFS_REGION_SYNC	0		/* Set if metadata sync required (RWF_SYNC) */
+#define NETFS_REGION_FLUSH_Q	1		/* Set if region is on flush queue */
+#define NETFS_REGION_SUPERSEDED	2		/* Set if region is being superseded */
+	unsigned int		debug_id;
+	refcount_t		ref;
+};
+
+enum netfs_write_compatibility {
+	NETFS_WRITES_COMPATIBLE,	/* Dirty regions can be directly merged */
+	NETFS_WRITES_SUPERSEDE,		/* Second write can supersede the first without first
+					 * having to be flushed (eg. authentication, DSYNC) */
+	NETFS_WRITES_INCOMPATIBLE,	/* Second write must wait for first (eg. DIO, ceph snap) */
 };
 
 /*
  * Operations the network filesystem can/must provide to the helpers.
  */
 struct netfs_request_ops {
+	/* Read request handling */
 	void (*init_rreq)(struct netfs_read_request *rreq, struct file *file);
 	int (*begin_cache_operation)(struct netfs_read_request *rreq);
 	void (*expand_readahead)(struct netfs_read_request *rreq);
@@ -186,6 +264,17 @@ struct netfs_request_ops {
 				 struct page *page, void **_fsdata);
 	void (*done)(struct netfs_read_request *rreq);
 	void (*cleanup)(struct address_space *mapping, void *netfs_priv);
+
+	/* Dirty region handling */
+	void (*init_dirty_region)(struct netfs_dirty_region *region, struct file *file);
+	void (*split_dirty_region)(struct netfs_dirty_region *region);
+	void (*free_dirty_region)(struct netfs_dirty_region *region);
+	enum netfs_write_compatibility (*is_write_compatible)(
+		struct netfs_i_context *ctx,
+		struct netfs_dirty_region *old_region,
+		struct netfs_dirty_region *candidate);
+	bool (*check_compatible_write)(struct netfs_dirty_region *region, struct file *file);
+	void (*update_i_size)(struct file *file, loff_t i_size);
 };
 
 /*
@@ -234,9 +323,11 @@ extern int netfs_readpage(struct file *, struct page *);
 extern int netfs_write_begin(struct file *, struct address_space *,
 			     loff_t, unsigned int, unsigned int, struct page **,
 			     void **);
+extern ssize_t netfs_file_write_iter(struct kiocb *iocb, struct iov_iter *from);
 
 extern void netfs_subreq_terminated(struct netfs_read_subrequest *, ssize_t, bool);
 extern void netfs_stats_show(struct seq_file *);
+extern struct netfs_flush_group *netfs_new_flush_group(struct inode *, void *);
 
 /**
  * netfs_i_context - Get the netfs inode context from the inode
@@ -256,6 +347,13 @@ static inline void netfs_i_context_init(struct inode *inode,
 	struct netfs_i_context *ctx = netfs_i_context(inode);
 
 	ctx->ops = ops;
+	ctx->bsize = PAGE_SIZE;
+	INIT_LIST_HEAD(&ctx->pending_writes);
+	INIT_LIST_HEAD(&ctx->active_writes);
+	INIT_LIST_HEAD(&ctx->dirty_regions);
+	INIT_LIST_HEAD(&ctx->flush_groups);
+	INIT_LIST_HEAD(&ctx->flush_queue);
+	spin_lock_init(&ctx->lock);
 }
 
 /**
diff --git a/include/trace/events/netfs.h b/include/trace/events/netfs.h
index 04ac29fc700f..808433e6ddd3 100644
--- a/include/trace/events/netfs.h
+++ b/include/trace/events/netfs.h
@@ -23,6 +23,7 @@ enum netfs_read_trace {
 	netfs_read_trace_readahead,
 	netfs_read_trace_readpage,
 	netfs_read_trace_write_begin,
+	netfs_read_trace_prefetch_for_write,
 };
 
 enum netfs_rreq_trace {
@@ -56,12 +57,43 @@ enum netfs_failure {
 	netfs_fail_prepare_write,
 };
 
+enum netfs_dirty_trace {
+	netfs_dirty_trace_active,
+	netfs_dirty_trace_commit,
+	netfs_dirty_trace_complete,
+	netfs_dirty_trace_flush_conflict,
+	netfs_dirty_trace_flush_dsync,
+	netfs_dirty_trace_merged_back,
+	netfs_dirty_trace_merged_forw,
+	netfs_dirty_trace_merged_sub,
+	netfs_dirty_trace_modified,
+	netfs_dirty_trace_new,
+	netfs_dirty_trace_reserved,
+	netfs_dirty_trace_split,
+	netfs_dirty_trace_start_pending,
+	netfs_dirty_trace_superseded,
+	netfs_dirty_trace_supersedes,
+	netfs_dirty_trace_wait_active,
+	netfs_dirty_trace_wait_pend,
+};
+
+enum netfs_region_trace {
+	netfs_region_trace_get_dirty,
+	netfs_region_trace_get_wreq,
+	netfs_region_trace_put_discard,
+	netfs_region_trace_put_merged,
+	netfs_region_trace_put_write_iter,
+	netfs_region_trace_free,
+	netfs_region_trace_new,
+};
+
 #endif
 
 #define netfs_read_traces					\
 	EM(netfs_read_trace_expanded,		"EXPANDED ")	\
 	EM(netfs_read_trace_readahead,		"READAHEAD")	\
 	EM(netfs_read_trace_readpage,		"READPAGE ")	\
+	EM(netfs_read_trace_prefetch_for_write,	"PREFETCHW")	\
 	E_(netfs_read_trace_write_begin,	"WRITEBEGN")
 
 #define netfs_rreq_traces					\
@@ -98,6 +130,46 @@ enum netfs_failure {
 	EM(netfs_fail_short_write_begin,	"short-write-begin")	\
 	E_(netfs_fail_prepare_write,		"prep-write")
 
+#define netfs_region_types					\
+	EM(NETFS_REGION_ORDINARY,		"ORD")		\
+	EM(NETFS_REGION_DIO,			"DIO")		\
+	E_(NETFS_REGION_DSYNC,			"DSY")
+
+#define netfs_region_states					\
+	EM(NETFS_REGION_IS_PENDING,		"pend")		\
+	EM(NETFS_REGION_IS_RESERVED,		"resv")		\
+	EM(NETFS_REGION_IS_ACTIVE,		"actv")		\
+	EM(NETFS_REGION_IS_DIRTY,		"drty")		\
+	EM(NETFS_REGION_IS_FLUSHING,		"flsh")		\
+	E_(NETFS_REGION_IS_COMPLETE,		"done")
+
+#define netfs_dirty_traces					\
+	EM(netfs_dirty_trace_active,		"ACTIVE    ")	\
+	EM(netfs_dirty_trace_commit,		"COMMIT    ")	\
+	EM(netfs_dirty_trace_complete,		"COMPLETE  ")	\
+	EM(netfs_dirty_trace_flush_conflict,	"FLSH CONFL")	\
+	EM(netfs_dirty_trace_flush_dsync,	"FLSH DSYNC")	\
+	EM(netfs_dirty_trace_merged_back,	"MERGE BACK")	\
+	EM(netfs_dirty_trace_merged_forw,	"MERGE FORW")	\
+	EM(netfs_dirty_trace_merged_sub,	"SUBSUMED  ")	\
+	EM(netfs_dirty_trace_modified,		"MODIFIED  ")	\
+	EM(netfs_dirty_trace_new,		"NEW       ")	\
+	EM(netfs_dirty_trace_reserved,		"RESERVED  ")	\
+	EM(netfs_dirty_trace_split,		"SPLIT     ")	\
+	EM(netfs_dirty_trace_start_pending,	"START PEND")	\
+	EM(netfs_dirty_trace_superseded,	"SUPERSEDED")	\
+	EM(netfs_dirty_trace_supersedes,	"SUPERSEDES")	\
+	EM(netfs_dirty_trace_wait_active,	"WAIT ACTV ")	\
+	E_(netfs_dirty_trace_wait_pend,		"WAIT PEND ")
+
+#define netfs_region_traces					\
+	EM(netfs_region_trace_get_dirty,	"GET DIRTY  ")	\
+	EM(netfs_region_trace_get_wreq,		"GET WREQ   ")	\
+	EM(netfs_region_trace_put_discard,	"PUT DISCARD")	\
+	EM(netfs_region_trace_put_merged,	"PUT MERGED ")	\
+	EM(netfs_region_trace_put_write_iter,	"PUT WRITER ")	\
+	EM(netfs_region_trace_free,		"FREE       ")	\
+	E_(netfs_region_trace_new,		"NEW        ")
 
 /*
  * Export enum symbols via userspace.
@@ -112,6 +184,9 @@ netfs_rreq_traces;
 netfs_sreq_sources;
 netfs_sreq_traces;
 netfs_failures;
+netfs_region_types;
+netfs_region_states;
+netfs_dirty_traces;
 
 /*
  * Now redefine the EM() and E_() macros to map the enums to the strings that
@@ -255,6 +330,111 @@ TRACE_EVENT(netfs_failure,
 		      __entry->error)
 	    );
 
+TRACE_EVENT(netfs_write_iter,
+	    TP_PROTO(struct netfs_dirty_region *region, struct kiocb *iocb,
+		     struct iov_iter *from),
+
+	    TP_ARGS(region, iocb, from),
+
+	    TP_STRUCT__entry(
+		    __field(unsigned int,		region		)
+		    __field(unsigned long long,		start		)
+		    __field(size_t,			len		)
+		    __field(unsigned int,		flags		)
+			     ),
+
+	    TP_fast_assign(
+		    __entry->region	= region->debug_id;
+		    __entry->start	= iocb->ki_pos;
+		    __entry->len	= iov_iter_count(from);
+		    __entry->flags	= iocb->ki_flags;
+			   ),
+
+	    TP_printk("D=%x WRITE-ITER s=%llx l=%zx f=%x",
+		      __entry->region, __entry->start, __entry->len, __entry->flags)
+	    );
+
+TRACE_EVENT(netfs_ref_region,
+	    TP_PROTO(unsigned int region_debug_id, int ref,
+		     enum netfs_region_trace what),
+
+	    TP_ARGS(region_debug_id, ref, what),
+
+	    TP_STRUCT__entry(
+		    __field(unsigned int,		region		)
+		    __field(int,			ref		)
+		    __field(enum netfs_region_trace,	what		)
+			     ),
+
+	    TP_fast_assign(
+		    __entry->region	= region_debug_id;
+		    __entry->ref	= ref;
+		    __entry->what	= what;
+			   ),
+
+	    TP_printk("D=%x %s r=%u",
+		      __entry->region,
+		      __print_symbolic(__entry->what, netfs_region_traces),
+		      __entry->ref)
+	    );
+
+TRACE_EVENT(netfs_dirty,
+	    TP_PROTO(struct netfs_i_context *ctx,
+		     struct netfs_dirty_region *region,
+		     struct netfs_dirty_region *region2,
+		     enum netfs_dirty_trace why),
+
+	    TP_ARGS(ctx, region, region2, why),
+
+	    TP_STRUCT__entry(
+		    __field(ino_t,			ino		)
+		    __field(unsigned long long,		bounds_start	)
+		    __field(unsigned long long,		bounds_end	)
+		    __field(unsigned long long,		reserved_start	)
+		    __field(unsigned long long,		reserved_end	)
+		    __field(unsigned long long,		dirty_start	)
+		    __field(unsigned long long,		dirty_end	)
+		    __field(unsigned int,		debug_id	)
+		    __field(unsigned int,		debug_id2	)
+		    __field(enum netfs_region_type,	type		)
+		    __field(enum netfs_region_state,	state		)
+		    __field(unsigned short,		flags		)
+		    __field(unsigned int,		ref		)
+		    __field(enum netfs_dirty_trace,	why		)
+			     ),
+
+	    TP_fast_assign(
+		    __entry->ino		= (((struct inode *)ctx) - 1)->i_ino;
+		    __entry->why		= why;
+		    __entry->bounds_start	= region->bounds.start;
+		    __entry->bounds_end		= region->bounds.end;
+		    __entry->reserved_start	= region->reserved.start;
+		    __entry->reserved_end	= region->reserved.end;
+		    __entry->dirty_start	= region->dirty.start;
+		    __entry->dirty_end		= region->dirty.end;
+		    __entry->debug_id		= region->debug_id;
+		    __entry->type		= region->type;
+		    __entry->state		= region->state;
+		    __entry->flags		= region->flags;
+		    __entry->debug_id2		= region2 ? region2->debug_id : 0;
+			   ),
+
+	    TP_printk("i=%lx D=%x %s %s dt=%04llx-%04llx bb=%04llx-%04llx rs=%04llx-%04llx %s f=%x XD=%x",
+		      __entry->ino, __entry->debug_id,
+		      __print_symbolic(__entry->why, netfs_dirty_traces),
+		      __print_symbolic(__entry->type, netfs_region_types),
+		      __entry->dirty_start,
+		      __entry->dirty_end,
+		      __entry->bounds_start,
+		      __entry->bounds_end,
+		      __entry->reserved_start,
+		      __entry->reserved_end,
+		      __print_symbolic(__entry->state, netfs_region_states),
+		      __entry->flags,
+		      __entry->debug_id2
+		      )
+	    );
+
 #endif /* _TRACE_NETFS_H */
 
 /* This part must be outside protection */



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 07/12] netfs: Initiate write request from a dirty region
  2021-07-21 13:44 David Howells
                   ` (5 preceding siblings ...)
  2021-07-21 13:46 ` [RFC PATCH 06/12] netfs: Keep lists of pending, active, dirty and flushed regions David Howells
@ 2021-07-21 13:46 ` David Howells
  2021-07-21 13:46 ` [RFC PATCH 08/12] netfs: Keep dirty mark for pages with more than one " David Howells
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 22+ messages in thread
From: David Howells @ 2021-07-21 13:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: dhowells, Jeff Layton, Matthew Wilcox (Oracle),
	Anna Schumaker, Steve French, Dominique Martinet, Mike Marshall,
	David Wysochanski, Shyam Prasad N, Miklos Szeredi,
	Linus Torvalds, linux-cachefs, linux-afs, linux-nfs, linux-cifs,
	ceph-devel, v9fs-developer, devel, linux-mm, linux-kernel

Handle the initiation of writeback of a piece of the dirty list.  The first
region on the flush list is extracted and a write request is set up to
manage it.  The pages in the affected region are flipped from dirty to
writeback-in-progress.

The writeback is then dispatched (which currently just logs a "--- WRITE
---" message to dmesg and then abandons it).

Notes:

 (*) A page may host multiple disjoint dirty regions, each with its own
     netfs_dirty_region, and a region may span multiple pages.  Dirty
     regions are not permitted to overlap, though they may be merged if
     they would otherwise overlap.

 (*) A page may be involved in multiple simultaneous writebacks.  Each one
     is managed by a separate netfs_dirty_region and netfs_write_request.

 (*) Multiple pages may be required to form a write (for crypto/compression
     purposes) and so adjacent non-dirty pages may also get marked for
     writeback.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/afs/file.c                |  128 ++----------------
 fs/netfs/Makefile            |    1 
 fs/netfs/internal.h          |   16 ++
 fs/netfs/objects.c           |   78 +++++++++++
 fs/netfs/read_helper.c       |   34 +++++
 fs/netfs/stats.c             |    6 +
 fs/netfs/write_back.c        |  306 ++++++++++++++++++++++++++++++++++++++++++
 fs/netfs/xa_iterator.h       |   85 ++++++++++++
 include/linux/netfs.h        |   35 +++++
 include/trace/events/netfs.h |   72 ++++++++++
 10 files changed, 642 insertions(+), 119 deletions(-)
 create mode 100644 fs/netfs/write_back.c
 create mode 100644 fs/netfs/xa_iterator.h

diff --git a/fs/afs/file.c b/fs/afs/file.c
index 8400cdf086b6..a6d483fe4e74 100644
--- a/fs/afs/file.c
+++ b/fs/afs/file.c
@@ -19,9 +19,6 @@
 
 static int afs_file_mmap(struct file *file, struct vm_area_struct *vma);
 static int afs_symlink_readpage(struct file *file, struct page *page);
-static void afs_invalidatepage(struct page *page, unsigned int offset,
-			       unsigned int length);
-static int afs_releasepage(struct page *page, gfp_t gfp_flags);
 
 static ssize_t afs_direct_IO(struct kiocb *iocb, struct iov_iter *iter);
 
@@ -50,17 +47,17 @@ const struct address_space_operations afs_file_aops = {
 	.readahead	= netfs_readahead,
 	.set_page_dirty	= afs_set_page_dirty,
 	.launder_page	= afs_launder_page,
-	.releasepage	= afs_releasepage,
-	.invalidatepage	= afs_invalidatepage,
+	.releasepage	= netfs_releasepage,
+	.invalidatepage	= netfs_invalidatepage,
 	.direct_IO	= afs_direct_IO,
 	.writepage	= afs_writepage,
-	.writepages	= afs_writepages,
+	.writepages	= netfs_writepages,
 };
 
 const struct address_space_operations afs_symlink_aops = {
 	.readpage	= afs_symlink_readpage,
-	.releasepage	= afs_releasepage,
-	.invalidatepage	= afs_invalidatepage,
+	.releasepage	= netfs_releasepage,
+	.invalidatepage	= netfs_invalidatepage,
 };
 
 static const struct vm_operations_struct afs_vm_ops = {
@@ -378,6 +375,11 @@ static void afs_free_dirty_region(struct netfs_dirty_region *region)
 	key_put(region->netfs_priv);
 }
 
+static void afs_init_wreq(struct netfs_write_request *wreq)
+{
+	//wreq->netfs_priv = key_get(afs_file_key(file));
+}
+
 static void afs_update_i_size(struct file *file, loff_t new_i_size)
 {
 	struct afs_vnode *vnode = AFS_FS_I(file_inode(file));
@@ -400,6 +402,7 @@ const struct netfs_request_ops afs_req_ops = {
 	.init_dirty_region	= afs_init_dirty_region,
 	.free_dirty_region	= afs_free_dirty_region,
 	.update_i_size		= afs_update_i_size,
+	.init_wreq		= afs_init_wreq,
 };
 
 int afs_write_inode(struct inode *inode, struct writeback_control *wbc)
@@ -408,115 +411,6 @@ int afs_write_inode(struct inode *inode, struct writeback_control *wbc)
 	return 0;
 }
 
-/*
- * Adjust the dirty region of the page on truncation or full invalidation,
- * getting rid of the markers altogether if the region is entirely invalidated.
- */
-static void afs_invalidate_dirty(struct page *page, unsigned int offset,
-				 unsigned int length)
-{
-	struct afs_vnode *vnode = AFS_FS_I(page->mapping->host);
-	unsigned long priv;
-	unsigned int f, t, end = offset + length;
-
-	priv = page_private(page);
-
-	/* we clean up only if the entire page is being invalidated */
-	if (offset == 0 && length == thp_size(page))
-		goto full_invalidate;
-
-	 /* If the page was dirtied by page_mkwrite(), the PTE stays writable
-	  * and we don't get another notification to tell us to expand it
-	  * again.
-	  */
-	if (afs_is_page_dirty_mmapped(priv))
-		return;
-
-	/* We may need to shorten the dirty region */
-	f = afs_page_dirty_from(page, priv);
-	t = afs_page_dirty_to(page, priv);
-
-	if (t <= offset || f >= end)
-		return; /* Doesn't overlap */
-
-	if (f < offset && t > end)
-		return; /* Splits the dirty region - just absorb it */
-
-	if (f >= offset && t <= end)
-		goto undirty;
-
-	if (f < offset)
-		t = offset;
-	else
-		f = end;
-	if (f == t)
-		goto undirty;
-
-	priv = afs_page_dirty(page, f, t);
-	set_page_private(page, priv);
-	trace_afs_page_dirty(vnode, tracepoint_string("trunc"), page);
-	return;
-
-undirty:
-	trace_afs_page_dirty(vnode, tracepoint_string("undirty"), page);
-	clear_page_dirty_for_io(page);
-full_invalidate:
-	trace_afs_page_dirty(vnode, tracepoint_string("inval"), page);
-	detach_page_private(page);
-}
-
-/*
- * invalidate part or all of a page
- * - release a page and clean up its private data if offset is 0 (indicating
- *   the entire page)
- */
-static void afs_invalidatepage(struct page *page, unsigned int offset,
-			       unsigned int length)
-{
-	_enter("{%lu},%u,%u", page->index, offset, length);
-
-	BUG_ON(!PageLocked(page));
-
-	if (PagePrivate(page))
-		afs_invalidate_dirty(page, offset, length);
-
-	wait_on_page_fscache(page);
-	_leave("");
-}
-
-/*
- * release a page and clean up its private state if it's not busy
- * - return true if the page can now be released, false if not
- */
-static int afs_releasepage(struct page *page, gfp_t gfp_flags)
-{
-	struct afs_vnode *vnode = AFS_FS_I(page->mapping->host);
-
-	_enter("{{%llx:%llu}[%lu],%lx},%x",
-	       vnode->fid.vid, vnode->fid.vnode, page->index, page->flags,
-	       gfp_flags);
-
-	/* deny if page is being written to the cache and the caller hasn't
-	 * elected to wait */
-#ifdef CONFIG_AFS_FSCACHE
-	if (PageFsCache(page)) {
-		if (!(gfp_flags & __GFP_DIRECT_RECLAIM) || !(gfp_flags & __GFP_FS))
-			return false;
-		wait_on_page_fscache(page);
-		fscache_note_page_release(afs_vnode_cache(vnode));
-	}
-#endif
-
-	if (PagePrivate(page)) {
-		trace_afs_page_dirty(vnode, tracepoint_string("rel"), page);
-		detach_page_private(page);
-	}
-
-	/* indicate that the page can be released */
-	_leave(" = T");
-	return 1;
-}
-
 /*
  * Handle setting up a memory mapping on an AFS file.
  */
diff --git a/fs/netfs/Makefile b/fs/netfs/Makefile
index 3e11453ad2c5..a201fd7b22cf 100644
--- a/fs/netfs/Makefile
+++ b/fs/netfs/Makefile
@@ -3,6 +3,7 @@
 netfs-y := \
 	objects.o \
 	read_helper.o \
+	write_back.o \
 	write_helper.o
 # dio_helper.o
 
diff --git a/fs/netfs/internal.h b/fs/netfs/internal.h
index 77ceab694348..fe85581d8ac0 100644
--- a/fs/netfs/internal.h
+++ b/fs/netfs/internal.h
@@ -8,6 +8,7 @@
 #include <linux/netfs.h>
 #include <linux/fscache.h>
 #include <trace/events/netfs.h>
+#include "xa_iterator.h"
 
 #ifdef pr_fmt
 #undef pr_fmt
@@ -34,6 +35,19 @@ void netfs_free_dirty_region(struct netfs_i_context *ctx, struct netfs_dirty_reg
 void netfs_put_dirty_region(struct netfs_i_context *ctx,
 			    struct netfs_dirty_region *region,
 			    enum netfs_region_trace what);
+struct netfs_write_request *netfs_alloc_write_request(struct address_space *mapping,
+						      bool is_dio);
+void netfs_get_write_request(struct netfs_write_request *wreq,
+			     enum netfs_wreq_trace what);
+void netfs_free_write_request(struct work_struct *work);
+void netfs_put_write_request(struct netfs_write_request *wreq,
+			     bool was_async, enum netfs_wreq_trace what);
+
+static inline void netfs_see_write_request(struct netfs_write_request *wreq,
+					   enum netfs_wreq_trace what)
+{
+	trace_netfs_ref_wreq(wreq->debug_id, refcount_read(&wreq->usage), what);
+}
 
 /*
  * read_helper.c
@@ -46,6 +60,7 @@ int netfs_prefetch_for_write(struct file *file, struct page *page, loff_t pos, s
 /*
  * write_helper.c
  */
+void netfs_writeback_worker(struct work_struct *work);
 void netfs_flush_region(struct netfs_i_context *ctx,
 			struct netfs_dirty_region *region,
 			enum netfs_dirty_trace why);
@@ -74,6 +89,7 @@ extern atomic_t netfs_n_rh_write_failed;
 extern atomic_t netfs_n_rh_write_zskip;
 extern atomic_t netfs_n_wh_region;
 extern atomic_t netfs_n_wh_flush_group;
+extern atomic_t netfs_n_wh_wreq;
 
 
 static inline void netfs_stat(atomic_t *stat)
diff --git a/fs/netfs/objects.c b/fs/netfs/objects.c
index ba1e052aa352..6e9b2a00076d 100644
--- a/fs/netfs/objects.c
+++ b/fs/netfs/objects.c
@@ -111,3 +111,81 @@ void netfs_put_dirty_region(struct netfs_i_context *ctx,
 		netfs_free_dirty_region(ctx, region);
 	}
 }
+
+struct netfs_write_request *netfs_alloc_write_request(struct address_space *mapping,
+						      bool is_dio)
+{
+	static atomic_t debug_ids;
+	struct inode *inode = mapping->host;
+	struct netfs_i_context *ctx = netfs_i_context(inode);
+	struct netfs_write_request *wreq;
+
+	wreq = kzalloc(sizeof(struct netfs_write_request), GFP_KERNEL);
+	if (wreq) {
+		wreq->mapping	= mapping;
+		wreq->inode	= inode;
+		wreq->netfs_ops	= ctx->ops;
+		wreq->debug_id	= atomic_inc_return(&debug_ids);
+		xa_init(&wreq->buffer);
+		INIT_WORK(&wreq->work, netfs_writeback_worker);
+		refcount_set(&wreq->usage, 1);
+		ctx->ops->init_wreq(wreq);
+		netfs_stat(&netfs_n_wh_wreq);
+		trace_netfs_ref_wreq(wreq->debug_id, 1, netfs_wreq_trace_new);
+	}
+
+	return wreq;
+}
+
+void netfs_get_write_request(struct netfs_write_request *wreq,
+			     enum netfs_wreq_trace what)
+{
+	int ref;
+
+	__refcount_inc(&wreq->usage, &ref);
+	trace_netfs_ref_wreq(wreq->debug_id, ref + 1, what);
+}
+
+void netfs_free_write_request(struct work_struct *work)
+{
+	struct netfs_write_request *wreq =
+		container_of(work, struct netfs_write_request, work);
+	struct netfs_i_context *ctx = netfs_i_context(wreq->inode);
+	struct page *page;
+	pgoff_t index;
+
+	if (wreq->netfs_priv)
+		wreq->netfs_ops->cleanup(wreq->mapping, wreq->netfs_priv);
+	trace_netfs_ref_wreq(wreq->debug_id, 0, netfs_wreq_trace_free);
+	if (wreq->cache_resources.ops)
+		wreq->cache_resources.ops->end_operation(&wreq->cache_resources);
+	if (wreq->region)
+		netfs_put_dirty_region(ctx, wreq->region,
+				       netfs_region_trace_put_wreq);
+	xa_for_each(&wreq->buffer, index, page) {
+		__free_page(page);
+	}
+	xa_destroy(&wreq->buffer);
+	kfree(wreq);
+	netfs_stat_d(&netfs_n_wh_wreq);
+}
+
+void netfs_put_write_request(struct netfs_write_request *wreq,
+			     bool was_async, enum netfs_wreq_trace what)
+{
+	unsigned int debug_id = wreq->debug_id;
+	bool dead;
+	int ref;
+
+	dead = __refcount_dec_and_test(&wreq->usage, &ref);
+	trace_netfs_ref_wreq(debug_id, ref - 1, what);
+	if (dead) {
+		if (was_async) {
+			wreq->work.func = netfs_free_write_request;
+			if (!queue_work(system_unbound_wq, &wreq->work))
+				BUG();
+		} else {
+			netfs_free_write_request(&wreq->work);
+		}
+	}
+}
diff --git a/fs/netfs/read_helper.c b/fs/netfs/read_helper.c
index bfcdbbd32f4c..0b771f2f5449 100644
--- a/fs/netfs/read_helper.c
+++ b/fs/netfs/read_helper.c
@@ -1415,3 +1415,37 @@ int netfs_prefetch_for_write(struct file *file, struct page *page,
 	_leave(" = %d", ret);
 	return ret;
 }
+
+/*
+ * Invalidate part or all of a page
+ * - release a page and clean up its private data if offset is 0 (indicating
+ *   the entire page)
+ */
+void netfs_invalidatepage(struct page *page, unsigned int offset, unsigned int length)
+{
+	_enter("{%lu},%u,%u", page->index, offset, length);
+
+	wait_on_page_fscache(page);
+}
+EXPORT_SYMBOL(netfs_invalidatepage);
+
+/*
+ * Release a page and clean up its private state if it's not busy
+ * - return true if the page can now be released, false if not
+ */
+int netfs_releasepage(struct page *page, gfp_t gfp_flags)
+{
+	struct netfs_i_context *ctx = netfs_i_context(page->mapping->host);
+
+	kenter("");
+
+	if (PageFsCache(page)) {
+		if (!(gfp_flags & __GFP_DIRECT_RECLAIM) || !(gfp_flags & __GFP_FS))
+			return false;
+		wait_on_page_fscache(page);
+		fscache_note_page_release(ctx->cache);
+	}
+
+	return true;
+}
+EXPORT_SYMBOL(netfs_releasepage);
diff --git a/fs/netfs/stats.c b/fs/netfs/stats.c
index 7c079ca47b5b..ac2510f8cab0 100644
--- a/fs/netfs/stats.c
+++ b/fs/netfs/stats.c
@@ -29,6 +29,7 @@ atomic_t netfs_n_rh_write_failed;
 atomic_t netfs_n_rh_write_zskip;
 atomic_t netfs_n_wh_region;
 atomic_t netfs_n_wh_flush_group;
+atomic_t netfs_n_wh_wreq;
 
 void netfs_stats_show(struct seq_file *m)
 {
@@ -56,8 +57,9 @@ void netfs_stats_show(struct seq_file *m)
 		   atomic_read(&netfs_n_rh_write),
 		   atomic_read(&netfs_n_rh_write_done),
 		   atomic_read(&netfs_n_rh_write_failed));
-	seq_printf(m, "WrHelp : R=%u F=%u\n",
+	seq_printf(m, "WrHelp : R=%u F=%u wr=%u\n",
 		   atomic_read(&netfs_n_wh_region),
-		   atomic_read(&netfs_n_wh_flush_group));
+		   atomic_read(&netfs_n_wh_flush_group),
+		   atomic_read(&netfs_n_wh_wreq));
 }
 EXPORT_SYMBOL(netfs_stats_show);
diff --git a/fs/netfs/write_back.c b/fs/netfs/write_back.c
new file mode 100644
index 000000000000..9fcb2ac50ebb
--- /dev/null
+++ b/fs/netfs/write_back.c
@@ -0,0 +1,306 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Network filesystem high-level write support.
+ *
+ * Copyright (C) 2021 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/pagemap.h>
+#include <linux/slab.h>
+#include "internal.h"
+
+/*
+ * Process a write request.
+ */
+static void netfs_writeback(struct netfs_write_request *wreq)
+{
+	kdebug("--- WRITE ---");
+}
+
+void netfs_writeback_worker(struct work_struct *work)
+{
+	struct netfs_write_request *wreq =
+		container_of(work, struct netfs_write_request, work);
+
+	netfs_see_write_request(wreq, netfs_wreq_trace_see_work);
+	netfs_writeback(wreq);
+	netfs_put_write_request(wreq, false, netfs_wreq_trace_put_work);
+}
+
+/*
+ * Flush some of the dirty queue.
+ */
+static int netfs_flush_dirty(struct address_space *mapping,
+			     struct writeback_control *wbc,
+			     struct netfs_range *range,
+			     loff_t *next)
+{
+	struct netfs_dirty_region *p, *q;
+	struct netfs_i_context *ctx = netfs_i_context(mapping->host);
+
+	kenter("%llx-%llx", range->start, range->end);
+
+	spin_lock(&ctx->lock);
+
+	/* Scan forwards to find dirty regions containing the suggested start
+	 * point.
+	 */
+	list_for_each_entry_safe(p, q, &ctx->dirty_regions, dirty_link) {
+		_debug("D=%x %llx-%llx", p->debug_id, p->dirty.start, p->dirty.end);
+		if (p->dirty.end <= range->start)
+			continue;
+		if (p->dirty.start >= range->end)
+			break;
+		if (p->state != NETFS_REGION_IS_DIRTY)
+			continue;
+		if (test_bit(NETFS_REGION_FLUSH_Q, &p->flags))
+			continue;
+
+		netfs_flush_region(ctx, p, netfs_dirty_trace_flush_writepages);
+	}
+
+	spin_unlock(&ctx->lock);
+	return 0;
+}
+
+static int netfs_unlock_pages_iterator(struct page *page)
+{
+	unlock_page(page);
+	put_page(page);
+	return 0;
+}
+
+/*
+ * Unlock all the pages in a range.
+ */
+static void netfs_unlock_pages(struct address_space *mapping,
+			       pgoff_t start, pgoff_t end)
+{
+	netfs_iterate_pages(mapping, start, end, netfs_unlock_pages_iterator);
+}
+
+static int netfs_lock_pages_iterator(struct xa_state *xas,
+				     struct page *page,
+				     struct netfs_write_request *wreq,
+				     struct writeback_control *wbc)
+{
+	int ret;
+
+	/* At this point we hold neither the i_pages lock nor the
+	 * page lock: the page may be truncated or invalidated
+	 * (changing page->mapping to NULL), or even swizzled
+	 * back from swapper_space to tmpfs file mapping
+	 */
+	if (wbc->sync_mode != WB_SYNC_NONE) {
+		xas_pause(xas);
+		rcu_read_unlock();
+		ret = lock_page_killable(page);
+		rcu_read_lock();
+	} else {
+		if (!trylock_page(page))
+			ret = -EBUSY;
+	}
+
+	return ret;
+}
+
+/*
+ * Lock all the pages in a range and add them to the write request.
+ */
+static int netfs_lock_pages(struct address_space *mapping,
+			    struct writeback_control *wbc,
+			    struct netfs_write_request *wreq)
+{
+	pgoff_t last = wreq->last;
+	int ret;
+
+	kenter("%lx-%lx", wreq->first, wreq->last);
+	ret = netfs_iterate_get_pages(mapping, wreq->first, wreq->last,
+				      netfs_lock_pages_iterator, wreq, wbc);
+	if (ret < 0)
+		goto failed;
+
+	if (wreq->last < last) {
+		kdebug("Some pages missing %lx < %lx", wreq->last, last);
+		ret = -EIO;
+		goto failed;
+	}
+
+	return 0;
+
+failed:
+	netfs_unlock_pages(mapping, wreq->first, wreq->last);
+	return ret;
+}
+
+static int netfs_set_page_writeback(struct page *page)
+{
+	/* Now we need to clear the dirty flags on any page that's not shared
+	 * with any other dirty region.
+	 */
+	if (!clear_page_dirty_for_io(page))
+		BUG();
+
+	/* We set writeback unconditionally because a page may participate in
+	 * more than one simultaneous writeback.
+	 */
+	set_page_writeback(page);
+	return 0;
+}
+
+/*
+ * Extract a region to write back.
+ */
+static struct netfs_dirty_region *netfs_extract_dirty_region(
+	struct netfs_i_context *ctx,
+	struct netfs_write_request *wreq)
+{
+	struct netfs_dirty_region *region = NULL, *spare;
+
+	spare = netfs_alloc_dirty_region();
+	if (!spare)
+		return NULL;
+
+	spin_lock(&ctx->lock);
+
+	if (list_empty(&ctx->flush_queue))
+		goto out;
+
+	region = list_first_entry(&ctx->flush_queue,
+				  struct netfs_dirty_region, flush_link);
+
+	wreq->region = netfs_get_dirty_region(ctx, region, netfs_region_trace_get_wreq);
+	wreq->start  = region->dirty.start;
+	wreq->len    = region->dirty.end - region->dirty.start;
+	wreq->first  =  region->dirty.start    / PAGE_SIZE;
+	wreq->last   = (region->dirty.end - 1) / PAGE_SIZE;
+
+	/* TODO: Split the region if it's larger than a certain size.  This is
+	 * tricky as we need to observe page, crypto and compression block
+	 * boundaries.  The crypto/comp bounds are defined by ctx->bsize, but
+	 * we don't know where the page boundaries are.
+	 *
+	 * All of these boundaries, however, must be pow-of-2 sized and
+	 * pow-of-2 aligned, so they never partially overlap
+	 */
+
+	smp_store_release(&region->state, NETFS_REGION_IS_FLUSHING);
+	trace_netfs_dirty(ctx, region, NULL, netfs_dirty_trace_flushing);
+	wake_up_var(&region->state);
+	list_del_init(&region->flush_link);
+
+out:
+	spin_unlock(&ctx->lock);
+	netfs_free_dirty_region(ctx, spare);
+	kleave(" = D=%x", region ? region->debug_id : 0);
+	return region;
+}
+
+/*
+ * Schedule a write for the first region on the flush queue.
+ */
+static int netfs_begin_write(struct address_space *mapping,
+			     struct writeback_control *wbc)
+{
+	struct netfs_write_request *wreq;
+	struct netfs_dirty_region *region;
+	struct netfs_i_context *ctx = netfs_i_context(mapping->host);
+	int ret;
+
+	wreq = netfs_alloc_write_request(mapping, false);
+	if (!wreq)
+		return -ENOMEM;
+
+	ret = 0;
+	region = netfs_extract_dirty_region(ctx, wreq);
+	if (!region)
+		goto error;
+
+	ret = netfs_lock_pages(mapping, wbc, wreq);
+	if (ret < 0)
+		goto error;
+
+	trace_netfs_wreq(wreq);
+
+	netfs_iterate_pages(mapping, wreq->first, wreq->last,
+			    netfs_set_page_writeback);
+	netfs_unlock_pages(mapping, wreq->first, wreq->last);
+	iov_iter_xarray(&wreq->source, WRITE, &wreq->mapping->i_pages,
+			wreq->start, wreq->len);
+
+	if (!queue_work(system_unbound_wq, &wreq->work))
+		BUG();
+
+	kleave(" = %lu", wreq->last - wreq->first + 1);
+	return wreq->last - wreq->first + 1;
+
+error:
+	netfs_put_write_request(wreq, wbc->sync_mode != WB_SYNC_NONE,
+				netfs_wreq_trace_put_discard);
+	kleave(" = %d", ret);
+	return ret;
+}
+
+/**
+ * netfs_writepages - Initiate writeback to the server and cache
+ * @mapping: The pagecache to write from
+ * @wbc: Hints from the VM as to what to write
+ *
+ * This is a helper intended to be called directly from a network filesystem's
+ * address space operations table to perform writeback to the server and the
+ * cache.
+ *
+ * We have to be careful as we can end up racing with setattr() truncating the
+ * pagecache since the caller doesn't take a lock here to prevent it.
+ */
+int netfs_writepages(struct address_space *mapping,
+		     struct writeback_control *wbc)
+{
+	struct netfs_range range;
+	loff_t next;
+	int ret;
+
+	kenter("%lx,%llx-%llx,%u,%c%c%c%c,%u,%u",
+	       wbc->nr_to_write,
+	       wbc->range_start, wbc->range_end,
+	       wbc->sync_mode,
+	       wbc->for_kupdate		? 'k' : '-',
+	       wbc->for_background	? 'b' : '-',
+	       wbc->for_reclaim		? 'r' : '-',
+	       wbc->for_sync		? 's' : '-',
+	       wbc->tagged_writepages,
+	       wbc->range_cyclic);
+
+	//dump_stack();
+
+	if (wbc->range_cyclic) {
+		range.start = mapping->writeback_index * PAGE_SIZE;
+		range.end   = ULLONG_MAX;
+		ret = netfs_flush_dirty(mapping, wbc, &range, &next);
+		if (range.start > 0 && wbc->nr_to_write > 0 && ret == 0) {
+			range.start = 0;
+			range.end   = mapping->writeback_index * PAGE_SIZE;
+			ret = netfs_flush_dirty(mapping, wbc, &range, &next);
+		}
+		mapping->writeback_index = next / PAGE_SIZE;
+	} else if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX) {
+		range.start = 0;
+		range.end   = ULLONG_MAX;
+		ret = netfs_flush_dirty(mapping, wbc, &range, &next);
+		if (wbc->nr_to_write > 0 && ret == 0)
+			mapping->writeback_index = next;
+	} else {
+		range.start = wbc->range_start;
+		range.end   = wbc->range_end + 1;
+		ret = netfs_flush_dirty(mapping, wbc, &range, &next);
+	}
+
+	if (ret == 0)
+		ret = netfs_begin_write(mapping, wbc);
+
+	_leave(" = %d", ret);
+	return ret;
+}
+EXPORT_SYMBOL(netfs_writepages);
diff --git a/fs/netfs/xa_iterator.h b/fs/netfs/xa_iterator.h
new file mode 100644
index 000000000000..3f37827f0f99
--- /dev/null
+++ b/fs/netfs/xa_iterator.h
@@ -0,0 +1,85 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/* xarray iterator macros for netfslib.
+ *
+ * Copyright (C) 2021 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+
+/*
+ * Iterate over a range of pages.  xarray locks are not held over the iterator
+ * function, so it can sleep if necessary.  The start and end positions are
+ * updated to indicate the span of pages actually processed.
+ */
+#define netfs_iterate_pages(MAPPING, START, END, ITERATOR, ...)		\
+	({								\
+		unsigned long __it_index;				\
+		struct page *page;					\
+		pgoff_t __it_start = (START);				\
+		pgoff_t __it_end = (END);				\
+		pgoff_t __it_tmp;					\
+		int ret = 0;						\
+									\
+		(END) = __it_start;					\
+		xa_for_each_range(&(MAPPING)->i_pages, __it_index, page, \
+				  __it_start, __it_end) {		\
+			if (xa_is_value(page)) {			\
+				ret = -EIO; /* Not a real page. */	\
+				break;					\
+			}						\
+			if (__it_index < (START))			\
+				(START) = __it_index;			\
+			ret = ITERATOR(page, ##__VA_ARGS__);		\
+			if (ret < 0)					\
+				break;					\
+			__it_tmp = __it_index + thp_nr_pages(page) - 1;	\
+			if (__it_tmp > (END))				\
+				(END) = __it_tmp;			\
+		}							\
+		ret;							\
+	})
+
+/*
+ * Iterate over a set of pages, getting each one before calling the iteration
+ * function.  The iteration function may drop the RCU read lock, but should
+ * call xas_pause() before it does so.  The start and end positions are updated
+ * to indicate the span of pages actually processed.
+ */
+#define netfs_iterate_get_pages(MAPPING, START, END, ITERATOR, ...)	\
+	({								\
+		unsigned long __it_index;				\
+		struct page *page;					\
+		pgoff_t __it_start = (START);				\
+		pgoff_t __it_end = (END);				\
+		pgoff_t __it_tmp;					\
+		int ret = 0;						\
+									\
+		XA_STATE(xas, &(MAPPING)->i_pages, __it_start);		\
+		(END) = __it_start;					\
+		rcu_read_lock();					\
+		for (page = xas_load(&xas); page; page = xas_next_entry(&xas, __it_end)) { \
+			if (xas_retry(&xas, page))			\
+				continue;				\
+			if (xa_is_value(page))				\
+				break;					\
+			if (!page_cache_get_speculative(page)) {	\
+				xas_reset(&xas);			\
+				continue;				\
+			}						\
+			if (unlikely(page != xas_reload(&xas))) {	\
+				put_page(page);				\
+				xas_reset(&xas);			\
+				continue;				\
+			}						\
+			__it_index = page_index(page);			\
+			if (__it_index < (START))			\
+				(START) = __it_index;			\
+			ret = ITERATOR(&xas, page, ##__VA_ARGS__);	\
+			if (ret < 0)					\
+				break;					\
+			__it_tmp = __it_index + thp_nr_pages(page) - 1; \
+			if (__it_tmp > (END))				\
+				(END) = __it_tmp;			\
+		}							\
+		rcu_read_unlock();					\
+		ret;							\
+	})
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index fc91711d3178..9f874e7ed45a 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -242,6 +242,35 @@ struct netfs_dirty_region {
 	refcount_t		ref;
 };
 
+/*
+ * Descriptor for a write request.  This is used to manage the preparation and
+ * storage of a sequence of dirty data - its compression/encryption and its
+ * writing to one or more servers and the cache.
+ *
+ * The prepared data is buffered here.
+ */
+struct netfs_write_request {
+	struct work_struct	work;
+	struct inode		*inode;		/* The file being accessed */
+	struct address_space	*mapping;	/* The mapping being accessed */
+	struct netfs_dirty_region *region;	/* The region we're writing back */
+	struct netfs_cache_resources cache_resources;
+	struct xarray		buffer;		/* Buffer for encrypted/compressed data */
+	struct iov_iter		source;		/* The iterator to be used */
+	struct list_head	write_link;	/* Link in i_context->write_requests */
+	void			*netfs_priv;	/* Private data for the netfs */
+	unsigned int		debug_id;
+	short			error;		/* 0 or error that occurred */
+	loff_t			i_size;		/* Size of the file */
+	loff_t			start;		/* Start position */
+	size_t			len;		/* Length of the request */
+	pgoff_t			first;		/* First page included */
+	pgoff_t			last;		/* Last page included */
+	refcount_t		usage;
+	unsigned long		flags;
+	const struct netfs_request_ops *netfs_ops;
+};
+
 enum netfs_write_compatibility {
 	NETFS_WRITES_COMPATIBLE,	/* Dirty regions can be directly merged */
 	NETFS_WRITES_SUPERSEDE,		/* Second write can supersede the first without first
@@ -275,6 +304,9 @@ struct netfs_request_ops {
 		struct netfs_dirty_region *candidate);
 	bool (*check_compatible_write)(struct netfs_dirty_region *region, struct file *file);
 	void (*update_i_size)(struct file *file, loff_t i_size);
+
+	/* Write request handling */
+	void (*init_wreq)(struct netfs_write_request *wreq);
 };
 
 /*
@@ -324,6 +356,9 @@ extern int netfs_write_begin(struct file *, struct address_space *,
 			     loff_t, unsigned int, unsigned int, struct page **,
 			     void **);
 extern ssize_t netfs_file_write_iter(struct kiocb *iocb, struct iov_iter *from);
+extern int netfs_writepages(struct address_space *mapping, struct writeback_control *wbc);
+extern void netfs_invalidatepage(struct page *page, unsigned int offset, unsigned int length);
+extern int netfs_releasepage(struct page *page, gfp_t gfp_flags);
 
 extern void netfs_subreq_terminated(struct netfs_read_subrequest *, ssize_t, bool);
 extern void netfs_stats_show(struct seq_file *);
diff --git a/include/trace/events/netfs.h b/include/trace/events/netfs.h
index 808433e6ddd3..e70abb5033e6 100644
--- a/include/trace/events/netfs.h
+++ b/include/trace/events/netfs.h
@@ -63,6 +63,8 @@ enum netfs_dirty_trace {
 	netfs_dirty_trace_complete,
 	netfs_dirty_trace_flush_conflict,
 	netfs_dirty_trace_flush_dsync,
+	netfs_dirty_trace_flush_writepages,
+	netfs_dirty_trace_flushing,
 	netfs_dirty_trace_merged_back,
 	netfs_dirty_trace_merged_forw,
 	netfs_dirty_trace_merged_sub,
@@ -82,11 +84,20 @@ enum netfs_region_trace {
 	netfs_region_trace_get_wreq,
 	netfs_region_trace_put_discard,
 	netfs_region_trace_put_merged,
+	netfs_region_trace_put_wreq,
 	netfs_region_trace_put_write_iter,
 	netfs_region_trace_free,
 	netfs_region_trace_new,
 };
 
+enum netfs_wreq_trace {
+	netfs_wreq_trace_free,
+	netfs_wreq_trace_put_discard,
+	netfs_wreq_trace_put_work,
+	netfs_wreq_trace_see_work,
+	netfs_wreq_trace_new,
+};
+
 #endif
 
 #define netfs_read_traces					\
@@ -149,6 +160,8 @@ enum netfs_region_trace {
 	EM(netfs_dirty_trace_complete,		"COMPLETE  ")	\
 	EM(netfs_dirty_trace_flush_conflict,	"FLSH CONFL")	\
 	EM(netfs_dirty_trace_flush_dsync,	"FLSH DSYNC")	\
+	EM(netfs_dirty_trace_flush_writepages,	"WRITEPAGES")	\
+	EM(netfs_dirty_trace_flushing,		"FLUSHING  ")	\
 	EM(netfs_dirty_trace_merged_back,	"MERGE BACK")	\
 	EM(netfs_dirty_trace_merged_forw,	"MERGE FORW")	\
 	EM(netfs_dirty_trace_merged_sub,	"SUBSUMED  ")	\
@@ -167,10 +180,19 @@ enum netfs_region_trace {
 	EM(netfs_region_trace_get_wreq,		"GET WREQ   ")	\
 	EM(netfs_region_trace_put_discard,	"PUT DISCARD")	\
 	EM(netfs_region_trace_put_merged,	"PUT MERGED ")	\
+	EM(netfs_region_trace_put_wreq,		"PUT WREQ   ")	\
 	EM(netfs_region_trace_put_write_iter,	"PUT WRITER ")	\
 	EM(netfs_region_trace_free,		"FREE       ")	\
 	E_(netfs_region_trace_new,		"NEW        ")
 
+#define netfs_wreq_traces					\
+	EM(netfs_wreq_trace_free,		"FREE       ")	\
+	EM(netfs_wreq_trace_put_discard,	"PUT DISCARD")	\
+	EM(netfs_wreq_trace_put_work,		"PUT WORK   ")	\
+	EM(netfs_wreq_trace_see_work,		"SEE WORK   ")	\
+	E_(netfs_wreq_trace_new,		"NEW        ")
+
+
 /*
  * Export enum symbols via userspace.
  */
@@ -187,6 +209,7 @@ netfs_failures;
 netfs_region_types;
 netfs_region_states;
 netfs_dirty_traces;
+netfs_wreq_traces;
 
 /*
  * Now redefine the EM() and E_() macros to map the enums to the strings that
@@ -435,6 +458,55 @@ TRACE_EVENT(netfs_dirty,
 		      )
 	    );
 
+TRACE_EVENT(netfs_wreq,
+	    TP_PROTO(struct netfs_write_request *wreq),
+
+	    TP_ARGS(wreq),
+
+	    TP_STRUCT__entry(
+		    __field(unsigned int,		wreq		)
+		    __field(unsigned int,		cookie		)
+		    __field(loff_t,			start		)
+		    __field(size_t,			len		)
+			     ),
+
+	    TP_fast_assign(
+		    __entry->wreq	= wreq->debug_id;
+		    __entry->cookie	= wreq->cache_resources.debug_id;
+		    __entry->start	= wreq->start;
+		    __entry->len	= wreq->len;
+			   ),
+
+	    TP_printk("W=%08x c=%08x s=%llx %zx",
+		      __entry->wreq,
+		      __entry->cookie,
+		      __entry->start, __entry->len)
+	    );
+
+TRACE_EVENT(netfs_ref_wreq,
+	    TP_PROTO(unsigned int wreq_debug_id, int ref,
+		     enum netfs_wreq_trace what),
+
+	    TP_ARGS(wreq_debug_id, ref, what),
+
+	    TP_STRUCT__entry(
+		    __field(unsigned int,		wreq		)
+		    __field(int,			ref		)
+		    __field(enum netfs_wreq_trace,	what		)
+			     ),
+
+	    TP_fast_assign(
+		    __entry->wreq	= wreq_debug_id;
+		    __entry->ref	= ref;
+		    __entry->what	= what;
+			   ),
+
+	    TP_printk("W=%08x %s r=%u",
+		      __entry->wreq,
+		      __print_symbolic(__entry->what, netfs_wreq_traces),
+		      __entry->ref)
+	    );
+
 #endif /* _TRACE_NETFS_H */
 
 /* This part must be outside protection */



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 08/12] netfs: Keep dirty mark for pages with more than one dirty region
  2021-07-21 13:44 David Howells
                   ` (6 preceding siblings ...)
  2021-07-21 13:46 ` [RFC PATCH 07/12] netfs: Initiate write request from a dirty region David Howells
@ 2021-07-21 13:46 ` David Howells
  2021-07-21 13:46 ` [RFC PATCH 09/12] netfs: Send write request to multiple destinations David Howells
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 22+ messages in thread
From: David Howells @ 2021-07-21 13:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: dhowells, Jeff Layton, Matthew Wilcox (Oracle),
	Anna Schumaker, Steve French, Dominique Martinet, Mike Marshall,
	David Wysochanski, Shyam Prasad N, Miklos Szeredi,
	Linus Torvalds, linux-cachefs, linux-afs, linux-nfs, linux-cifs,
	ceph-devel, v9fs-developer, devel, linux-mm, linux-kernel

If a page has more than one dirty region overlapping it, then we mustn't
clear the dirty mark when we want to flush one of them.

Make netfs_set_page_writeback() check the adjacent dirty regions to see if
they overlap the page(s) the region we're interested in, and if they do,
leave the page marked dirty.

NOTES:

 (1) Might want to discount the overlapping regions if they're being
     flushed (in which case they wouldn't normally want to hold the dirty
     bit).

 (2) Similarly, the writeback mark should not be cleared if the page is
     still being written back by another, overlapping region.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/netfs/write_back.c |   41 ++++++++++++++++++++++++++++++++++++++---
 1 file changed, 38 insertions(+), 3 deletions(-)

diff --git a/fs/netfs/write_back.c b/fs/netfs/write_back.c
index 9fcb2ac50ebb..5c779cb12345 100644
--- a/fs/netfs/write_back.c
+++ b/fs/netfs/write_back.c
@@ -135,12 +135,47 @@ static int netfs_lock_pages(struct address_space *mapping,
 	return ret;
 }
 
-static int netfs_set_page_writeback(struct page *page)
+static int netfs_set_page_writeback(struct page *page,
+				    struct netfs_i_context *ctx,
+				    struct netfs_write_request *wreq)
 {
+	struct netfs_dirty_region *region = wreq->region, *r;
+	loff_t pos = page_offset(page);
+	bool clear_dirty = true;
+
 	/* Now we need to clear the dirty flags on any page that's not shared
 	 * with any other dirty region.
 	 */
-	if (!clear_page_dirty_for_io(page))
+	spin_lock(&ctx->lock);
+	if (pos < region->dirty.start) {
+		r = region;
+		list_for_each_entry_continue_reverse(r, &ctx->dirty_regions, dirty_link) {
+			if (r->dirty.end <= pos)
+				break;
+			if (r->state < NETFS_REGION_IS_DIRTY)
+				continue;
+			kdebug("keep-dirty-b %lx reg=%x r=%x",
+			       page->index, region->debug_id, r->debug_id);
+			clear_dirty = false;
+		}
+	}
+
+	pos += thp_size(page);
+	if (pos > region->dirty.end) {
+		r = region;
+		list_for_each_entry_continue(r, &ctx->dirty_regions, dirty_link) {
+			if (r->dirty.start >= pos)
+				break;
+			if (r->state < NETFS_REGION_IS_DIRTY)
+				continue;
+			kdebug("keep-dirty-f %lx reg=%x r=%x",
+			       page->index, region->debug_id, r->debug_id);
+			clear_dirty = false;
+		}
+	}
+	spin_unlock(&ctx->lock);
+
+	if (clear_dirty && !clear_page_dirty_for_io(page))
 		BUG();
 
 	/* We set writeback unconditionally because a page may participate in
@@ -225,7 +260,7 @@ static int netfs_begin_write(struct address_space *mapping,
 	trace_netfs_wreq(wreq);
 
 	netfs_iterate_pages(mapping, wreq->first, wreq->last,
-			    netfs_set_page_writeback);
+			    netfs_set_page_writeback, ctx, wreq);
 	netfs_unlock_pages(mapping, wreq->first, wreq->last);
 	iov_iter_xarray(&wreq->source, WRITE, &wreq->mapping->i_pages,
 			wreq->start, wreq->len);



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 09/12] netfs: Send write request to multiple destinations
  2021-07-21 13:44 David Howells
                   ` (7 preceding siblings ...)
  2021-07-21 13:46 ` [RFC PATCH 08/12] netfs: Keep dirty mark for pages with more than one " David Howells
@ 2021-07-21 13:46 ` David Howells
  2021-07-21 13:46 ` [RFC PATCH 10/12] netfs: Do encryption in write preparatory phase David Howells
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 22+ messages in thread
From: David Howells @ 2021-07-21 13:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: dhowells, Jeff Layton, Matthew Wilcox (Oracle),
	Anna Schumaker, Steve French, Dominique Martinet, Mike Marshall,
	David Wysochanski, Shyam Prasad N, Miklos Szeredi,
	Linus Torvalds, linux-cachefs, linux-afs, linux-nfs, linux-cifs,
	ceph-devel, v9fs-developer, devel, linux-mm, linux-kernel

Write requests are set up to have a number of "write streams", whereby each
stream writes the entire request to a different destination.  Destination
types include server uploads and cache writes.

Each stream may be segmented into a series of writes that can be issued
consecutively, for example uploading to an AFS server, writing to a cache
or both.

A request has, or will have, a number of phases:

 (1) Preparation.  The data may need to be copied into a buffer and
     compressed or encrypted.  The modified data would then be stored to
     the cache or the server.

 (2) Writing.  Each stream writes the data.

 (3) Completion.  The pages are cleaned or redirtied as appropriate and the
     dirty list is updated to remove the now flushed region.  Waiting write
     requests that are wholly within the range now made available are woken
     up.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/afs/file.c                |    1 
 fs/afs/inode.c               |   13 ++
 fs/afs/internal.h            |    2 
 fs/afs/write.c               |  179 ++++++------------------------
 fs/netfs/internal.h          |    6 +
 fs/netfs/objects.c           |   25 ++++
 fs/netfs/stats.c             |   14 ++
 fs/netfs/write_back.c        |  249 ++++++++++++++++++++++++++++++++++++++++++
 fs/netfs/write_helper.c      |   28 +++--
 fs/netfs/xa_iterator.h       |   31 +++++
 include/linux/netfs.h        |   65 +++++++++++
 include/trace/events/netfs.h |   61 ++++++++++
 12 files changed, 515 insertions(+), 159 deletions(-)

diff --git a/fs/afs/file.c b/fs/afs/file.c
index a6d483fe4e74..22030d5191cd 100644
--- a/fs/afs/file.c
+++ b/fs/afs/file.c
@@ -403,6 +403,7 @@ const struct netfs_request_ops afs_req_ops = {
 	.free_dirty_region	= afs_free_dirty_region,
 	.update_i_size		= afs_update_i_size,
 	.init_wreq		= afs_init_wreq,
+	.add_write_streams	= afs_add_write_streams,
 };
 
 int afs_write_inode(struct inode *inode, struct writeback_control *wbc)
diff --git a/fs/afs/inode.c b/fs/afs/inode.c
index 3e9e388245a1..a6ae031461c7 100644
--- a/fs/afs/inode.c
+++ b/fs/afs/inode.c
@@ -449,6 +449,15 @@ static void afs_get_inode_cache(struct afs_vnode *vnode)
 #endif
 }
 
+static void afs_set_netfs_context(struct afs_vnode *vnode)
+{
+	struct netfs_i_context *ctx = netfs_i_context(&vnode->vfs_inode);
+
+	netfs_i_context_init(&vnode->vfs_inode, &afs_req_ops);
+	ctx->n_wstreams = 1;
+	ctx->bsize = PAGE_SIZE;
+}
+
 /*
  * inode retrieval
  */
@@ -479,7 +488,7 @@ struct inode *afs_iget(struct afs_operation *op, struct afs_vnode_param *vp)
 		return inode;
 	}
 
-	netfs_i_context_init(inode, &afs_req_ops);
+	afs_set_netfs_context(vnode);
 	ret = afs_inode_init_from_status(op, vp, vnode);
 	if (ret < 0)
 		goto bad_inode;
@@ -536,10 +545,10 @@ struct inode *afs_root_iget(struct super_block *sb, struct key *key)
 	_debug("GOT ROOT INODE %p { vl=%llx }", inode, as->volume->vid);
 
 	BUG_ON(!(inode->i_state & I_NEW));
-	netfs_i_context_init(inode, &afs_req_ops);
 
 	vnode = AFS_FS_I(inode);
 	vnode->cb_v_break = as->volume->cb_v_break,
+	afs_set_netfs_context(vnode);
 
 	op = afs_alloc_operation(key, as->volume);
 	if (IS_ERR(op)) {
diff --git a/fs/afs/internal.h b/fs/afs/internal.h
index 0d01ed2fe8fa..32a36b96cc9b 100644
--- a/fs/afs/internal.h
+++ b/fs/afs/internal.h
@@ -1512,12 +1512,12 @@ extern int afs_check_volume_status(struct afs_volume *, struct afs_operation *);
  */
 extern int afs_set_page_dirty(struct page *);
 extern int afs_writepage(struct page *, struct writeback_control *);
-extern int afs_writepages(struct address_space *, struct writeback_control *);
 extern int afs_fsync(struct file *, loff_t, loff_t, int);
 extern vm_fault_t afs_page_mkwrite(struct vm_fault *vmf);
 extern void afs_prune_wb_keys(struct afs_vnode *);
 extern int afs_launder_page(struct page *);
 extern ssize_t afs_file_direct_write(struct kiocb *, struct iov_iter *);
+extern void afs_add_write_streams(struct netfs_write_request *);
 
 /*
  * xattr.c
diff --git a/fs/afs/write.c b/fs/afs/write.c
index e6e2e924c8ae..0668389f3466 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -13,6 +13,7 @@
 #include <linux/pagevec.h>
 #include <linux/netfs.h>
 #include <linux/fscache.h>
+#include <trace/events/netfs.h>
 #include "internal.h"
 
 static void afs_write_to_cache(struct afs_vnode *vnode, loff_t start, size_t len,
@@ -120,31 +121,9 @@ static void afs_redirty_pages(struct writeback_control *wbc,
  */
 static void afs_pages_written_back(struct afs_vnode *vnode, loff_t start, unsigned int len)
 {
-	struct address_space *mapping = vnode->vfs_inode.i_mapping;
-	struct page *page;
-	pgoff_t end;
-
-	XA_STATE(xas, &mapping->i_pages, start / PAGE_SIZE);
-
 	_enter("{%llx:%llu},{%x @%llx}",
 	       vnode->fid.vid, vnode->fid.vnode, len, start);
 
-	rcu_read_lock();
-
-	end = (start + len - 1) / PAGE_SIZE;
-	xas_for_each(&xas, page, end) {
-		if (!PageWriteback(page)) {
-			kdebug("bad %x @%llx page %lx %lx", len, start, page->index, end);
-			ASSERT(PageWriteback(page));
-		}
-
-		trace_afs_page_dirty(vnode, tracepoint_string("clear"), page);
-		detach_page_private(page);
-		page_endio(page, true, 0);
-	}
-
-	rcu_read_unlock();
-
 	afs_prune_wb_keys(vnode);
 	_leave("");
 }
@@ -281,6 +260,39 @@ static int afs_store_data(struct afs_vnode *vnode, struct iov_iter *iter, loff_t
 	return afs_put_operation(op);
 }
 
+static void afs_upload_to_server(struct netfs_write_stream *stream,
+				 struct netfs_write_request *wreq)
+{
+	struct afs_vnode *vnode = AFS_FS_I(wreq->inode);
+	ssize_t ret;
+
+	kenter("%u", stream->index);
+
+	trace_netfs_wstr(stream, netfs_write_stream_submit);
+	ret = afs_store_data(vnode, &wreq->source, wreq->start, false);
+	netfs_write_stream_completed(stream, ret, false);
+}
+
+static void afs_upload_to_server_worker(struct work_struct *work)
+{
+	struct netfs_write_stream *stream = container_of(work, struct netfs_write_stream, work);
+	struct netfs_write_request *wreq = netfs_stream_to_wreq(stream);
+
+	afs_upload_to_server(stream, wreq);
+	netfs_put_write_request(wreq, false, netfs_wreq_trace_put_stream_work);
+}
+
+/*
+ * Add write streams to a write request.  We need to add a single stream for
+ * the server we're writing to.
+ */
+void afs_add_write_streams(struct netfs_write_request *wreq)
+{
+	kenter("");
+	netfs_set_up_write_stream(wreq, NETFS_UPLOAD_TO_SERVER,
+				  afs_upload_to_server_worker);
+}
+
 /*
  * Extend the region to be written back to include subsequent contiguously
  * dirty pages if possible, but don't sleep while doing so.
@@ -543,129 +555,6 @@ int afs_writepage(struct page *page, struct writeback_control *wbc)
 	return 0;
 }
 
-/*
- * write a region of pages back to the server
- */
-static int afs_writepages_region(struct address_space *mapping,
-				 struct writeback_control *wbc,
-				 loff_t start, loff_t end, loff_t *_next)
-{
-	struct page *page;
-	ssize_t ret;
-	int n;
-
-	_enter("%llx,%llx,", start, end);
-
-	do {
-		pgoff_t index = start / PAGE_SIZE;
-
-		n = find_get_pages_range_tag(mapping, &index, end / PAGE_SIZE,
-					     PAGECACHE_TAG_DIRTY, 1, &page);
-		if (!n)
-			break;
-
-		start = (loff_t)page->index * PAGE_SIZE; /* May regress with THPs */
-
-		_debug("wback %lx", page->index);
-
-		/* At this point we hold neither the i_pages lock nor the
-		 * page lock: the page may be truncated or invalidated
-		 * (changing page->mapping to NULL), or even swizzled
-		 * back from swapper_space to tmpfs file mapping
-		 */
-		if (wbc->sync_mode != WB_SYNC_NONE) {
-			ret = lock_page_killable(page);
-			if (ret < 0) {
-				put_page(page);
-				return ret;
-			}
-		} else {
-			if (!trylock_page(page)) {
-				put_page(page);
-				return 0;
-			}
-		}
-
-		if (page->mapping != mapping || !PageDirty(page)) {
-			start += thp_size(page);
-			unlock_page(page);
-			put_page(page);
-			continue;
-		}
-
-		if (PageWriteback(page) || PageFsCache(page)) {
-			unlock_page(page);
-			if (wbc->sync_mode != WB_SYNC_NONE) {
-				wait_on_page_writeback(page);
-#ifdef CONFIG_AFS_FSCACHE
-				wait_on_page_fscache(page);
-#endif
-			}
-			put_page(page);
-			continue;
-		}
-
-		if (!clear_page_dirty_for_io(page))
-			BUG();
-		ret = afs_write_back_from_locked_page(mapping, wbc, page, start, end);
-		put_page(page);
-		if (ret < 0) {
-			_leave(" = %zd", ret);
-			return ret;
-		}
-
-		start += ret;
-
-		cond_resched();
-	} while (wbc->nr_to_write > 0);
-
-	*_next = start;
-	_leave(" = 0 [%llx]", *_next);
-	return 0;
-}
-
-/*
- * write some of the pending data back to the server
- */
-int afs_writepages(struct address_space *mapping,
-		   struct writeback_control *wbc)
-{
-	struct afs_vnode *vnode = AFS_FS_I(mapping->host);
-	loff_t start, next;
-	int ret;
-
-	_enter("");
-
-	/* We have to be careful as we can end up racing with setattr()
-	 * truncating the pagecache since the caller doesn't take a lock here
-	 * to prevent it.
-	 */
-	if (wbc->sync_mode == WB_SYNC_ALL)
-		down_read(&vnode->validate_lock);
-	else if (!down_read_trylock(&vnode->validate_lock))
-		return 0;
-
-	if (wbc->range_cyclic) {
-		start = mapping->writeback_index * PAGE_SIZE;
-		ret = afs_writepages_region(mapping, wbc, start, LLONG_MAX, &next);
-		if (start > 0 && wbc->nr_to_write > 0 && ret == 0)
-			ret = afs_writepages_region(mapping, wbc, 0, start,
-						    &next);
-		mapping->writeback_index = next / PAGE_SIZE;
-	} else if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX) {
-		ret = afs_writepages_region(mapping, wbc, 0, LLONG_MAX, &next);
-		if (wbc->nr_to_write > 0 && ret == 0)
-			mapping->writeback_index = next;
-	} else {
-		ret = afs_writepages_region(mapping, wbc,
-					    wbc->range_start, wbc->range_end, &next);
-	}
-
-	up_read(&vnode->validate_lock);
-	_leave(" = %d", ret);
-	return ret;
-}
-
 /*
  * flush any dirty pages for this process, and check for write errors.
  * - the return status from this call provides a reliable indication of
diff --git a/fs/netfs/internal.h b/fs/netfs/internal.h
index fe85581d8ac0..6fdf9e5663f7 100644
--- a/fs/netfs/internal.h
+++ b/fs/netfs/internal.h
@@ -89,7 +89,13 @@ extern atomic_t netfs_n_rh_write_failed;
 extern atomic_t netfs_n_rh_write_zskip;
 extern atomic_t netfs_n_wh_region;
 extern atomic_t netfs_n_wh_flush_group;
+extern atomic_t netfs_n_wh_upload;
+extern atomic_t netfs_n_wh_upload_done;
+extern atomic_t netfs_n_wh_upload_failed;
 extern atomic_t netfs_n_wh_wreq;
+extern atomic_t netfs_n_wh_write;
+extern atomic_t netfs_n_wh_write_done;
+extern atomic_t netfs_n_wh_write_failed;
 
 
 static inline void netfs_stat(atomic_t *stat)
diff --git a/fs/netfs/objects.c b/fs/netfs/objects.c
index 6e9b2a00076d..8926b4230d91 100644
--- a/fs/netfs/objects.c
+++ b/fs/netfs/objects.c
@@ -119,16 +119,29 @@ struct netfs_write_request *netfs_alloc_write_request(struct address_space *mapp
 	struct inode *inode = mapping->host;
 	struct netfs_i_context *ctx = netfs_i_context(inode);
 	struct netfs_write_request *wreq;
+	unsigned int n_streams = ctx->n_wstreams, i;
+	bool cached;
 
-	wreq = kzalloc(sizeof(struct netfs_write_request), GFP_KERNEL);
+	if (!is_dio && netfs_is_cache_enabled(inode)) {
+		n_streams++;
+		cached = true;
+	}
+
+	wreq = kzalloc(struct_size(wreq, streams, n_streams), GFP_KERNEL);
 	if (wreq) {
 		wreq->mapping	= mapping;
 		wreq->inode	= inode;
 		wreq->netfs_ops	= ctx->ops;
+		wreq->max_streams = n_streams;
 		wreq->debug_id	= atomic_inc_return(&debug_ids);
+		if (cached)
+			__set_bit(NETFS_WREQ_WRITE_TO_CACHE, &wreq->flags);
 		xa_init(&wreq->buffer);
 		INIT_WORK(&wreq->work, netfs_writeback_worker);
+		for (i = 0; i < n_streams; i++)
+			INIT_LIST_HEAD(&wreq->streams[i].subrequests);
 		refcount_set(&wreq->usage, 1);
+		atomic_set(&wreq->outstanding, 1);
 		ctx->ops->init_wreq(wreq);
 		netfs_stat(&netfs_n_wh_wreq);
 		trace_netfs_ref_wreq(wreq->debug_id, 1, netfs_wreq_trace_new);
@@ -170,6 +183,15 @@ void netfs_free_write_request(struct work_struct *work)
 	netfs_stat_d(&netfs_n_wh_wreq);
 }
 
+/**
+ * netfs_put_write_request - Drop a reference on a write request descriptor.
+ * @wreq: The write request to drop
+ * @was_async: True if being called in a non-sleeping context
+ * @what: Reason code, to be displayed in trace line
+ *
+ * Drop a reference on a write request and schedule it for destruction
+ * after the last ref is gone.
+ */
 void netfs_put_write_request(struct netfs_write_request *wreq,
 			     bool was_async, enum netfs_wreq_trace what)
 {
@@ -189,3 +211,4 @@ void netfs_put_write_request(struct netfs_write_request *wreq,
 		}
 	}
 }
+EXPORT_SYMBOL(netfs_put_write_request);
diff --git a/fs/netfs/stats.c b/fs/netfs/stats.c
index ac2510f8cab0..a02d95bba158 100644
--- a/fs/netfs/stats.c
+++ b/fs/netfs/stats.c
@@ -30,6 +30,12 @@ atomic_t netfs_n_rh_write_zskip;
 atomic_t netfs_n_wh_region;
 atomic_t netfs_n_wh_flush_group;
 atomic_t netfs_n_wh_wreq;
+atomic_t netfs_n_wh_upload;
+atomic_t netfs_n_wh_upload_done;
+atomic_t netfs_n_wh_upload_failed;
+atomic_t netfs_n_wh_write;
+atomic_t netfs_n_wh_write_done;
+atomic_t netfs_n_wh_write_failed;
 
 void netfs_stats_show(struct seq_file *m)
 {
@@ -61,5 +67,13 @@ void netfs_stats_show(struct seq_file *m)
 		   atomic_read(&netfs_n_wh_region),
 		   atomic_read(&netfs_n_wh_flush_group),
 		   atomic_read(&netfs_n_wh_wreq));
+	seq_printf(m, "WrHelp : UL=%u us=%u uf=%u\n",
+		   atomic_read(&netfs_n_wh_upload),
+		   atomic_read(&netfs_n_wh_upload_done),
+		   atomic_read(&netfs_n_wh_upload_failed));
+	seq_printf(m, "WrHelp : WR=%u ws=%u wf=%u\n",
+		   atomic_read(&netfs_n_wh_write),
+		   atomic_read(&netfs_n_wh_write_done),
+		   atomic_read(&netfs_n_wh_write_failed));
 }
 EXPORT_SYMBOL(netfs_stats_show);
diff --git a/fs/netfs/write_back.c b/fs/netfs/write_back.c
index 5c779cb12345..15cc0e1b9acf 100644
--- a/fs/netfs/write_back.c
+++ b/fs/netfs/write_back.c
@@ -11,12 +11,259 @@
 #include <linux/slab.h>
 #include "internal.h"
 
+static int netfs_redirty_iterator(struct xa_state *xas, struct page *page)
+{
+	__set_page_dirty_nobuffers(page);
+	account_page_redirty(page);
+	end_page_writeback(page);
+	return 0;
+}
+
+/*
+ * Redirty all the pages in a given range.
+ */
+static void netfs_redirty_pages(struct netfs_write_request *wreq)
+{
+	_enter("%lx-%lx", wreq->first, wreq->last);
+
+	netfs_iterate_pinned_pages(wreq->mapping, wreq->first, wreq->last,
+				   netfs_redirty_iterator);
+	_leave("");
+}
+
+static int netfs_end_writeback_iterator(struct xa_state *xas, struct page *page)
+{
+	end_page_writeback(page);
+	return 0;
+}
+
+/*
+ * Fix up the dirty list upon completion of write.
+ */
+static void netfs_fix_up_dirty_list(struct netfs_write_request *wreq)
+{
+	struct netfs_dirty_region *region = wreq->region, *r;
+	struct netfs_i_context *ctx = netfs_i_context(wreq->inode);
+	unsigned long long available_to;
+	struct list_head *lower, *upper, *p;
+
+	netfs_iterate_pinned_pages(wreq->mapping, wreq->first, wreq->last,
+				   netfs_end_writeback_iterator);
+
+	spin_lock(&ctx->lock);
+
+	/* Find the bounds of the region we're going to make available. */
+	lower = &ctx->dirty_regions;
+	r = region;
+	list_for_each_entry_continue_reverse(r, &ctx->dirty_regions, dirty_link) {
+		_debug("- back %x", r->debug_id);
+		if (r->state >= NETFS_REGION_IS_DIRTY) {
+			lower = &r->dirty_link;
+			break;
+		}
+	}
+
+	available_to = ULLONG_MAX;
+	upper = &ctx->dirty_regions;
+	r = region;
+	list_for_each_entry_continue(r, &ctx->dirty_regions, dirty_link) {
+		_debug("- forw %x", r->debug_id);
+		if (r->state >= NETFS_REGION_IS_DIRTY) {
+			available_to = r->dirty.start;
+			upper = &r->dirty_link;
+			break;
+		}
+	}
+
+	/* Remove this region and we can start any waiters that are wholly
+	 * inside of the now-available region.
+	 */
+	list_del_init(&region->dirty_link);
+
+	for (p = lower->next; p != upper; p = p->next) {
+		r = list_entry(p, struct netfs_dirty_region, dirty_link);
+		if (r->reserved.end <= available_to) {
+			smp_store_release(&r->state, NETFS_REGION_IS_ACTIVE);
+			trace_netfs_dirty(ctx, r, NULL, netfs_dirty_trace_activate);
+			wake_up_var(&r->state);
+		}
+	}
+
+	spin_unlock(&ctx->lock);
+	netfs_put_dirty_region(ctx, region, netfs_region_trace_put_dirty);
+}
+
+/*
+ * Process a completed write request once all the component streams have been
+ * completed.
+ */
+static void netfs_write_completed(struct netfs_write_request *wreq, bool was_async)
+{
+	struct netfs_i_context *ctx = netfs_i_context(wreq->inode);
+	unsigned int s;
+
+	for (s = 0; s < wreq->n_streams; s++) {
+		struct netfs_write_stream *stream = &wreq->streams[s];
+		if (!stream->error)
+			continue;
+		switch (stream->dest) {
+		case NETFS_UPLOAD_TO_SERVER:
+			/* Depending on the type of failure, this may prevent
+			 * writeback completion unless we're in disconnected
+			 * mode.
+			 */
+			if (!wreq->error)
+				wreq->error = stream->error;
+			break;
+
+		case NETFS_WRITE_TO_CACHE:
+			/* Failure doesn't prevent writeback completion unless
+			 * we're in disconnected mode.
+			 */
+			if (stream->error != -ENOBUFS)
+				ctx->ops->invalidate_cache(wreq);
+			break;
+
+		default:
+			WARN_ON_ONCE(1);
+			if (!wreq->error)
+				wreq->error = -EIO;
+			return;
+		}
+	}
+
+	if (wreq->error)
+		netfs_redirty_pages(wreq);
+	else
+		netfs_fix_up_dirty_list(wreq);
+	netfs_put_write_request(wreq, was_async, netfs_wreq_trace_put_for_outstanding);
+}
+
+/*
+ * Deal with the completion of writing the data to the cache.
+ */
+void netfs_write_stream_completed(void *_stream, ssize_t transferred_or_error,
+				  bool was_async)
+{
+	struct netfs_write_stream *stream = _stream;
+	struct netfs_write_request *wreq = netfs_stream_to_wreq(stream);
+
+	if (IS_ERR_VALUE(transferred_or_error))
+		stream->error = transferred_or_error;
+	switch (stream->dest) {
+	case NETFS_UPLOAD_TO_SERVER:
+		if (stream->error)
+			netfs_stat(&netfs_n_wh_upload_failed);
+		else
+			netfs_stat(&netfs_n_wh_upload_done);
+		break;
+	case NETFS_WRITE_TO_CACHE:
+		if (stream->error)
+			netfs_stat(&netfs_n_wh_write_failed);
+		else
+			netfs_stat(&netfs_n_wh_write_done);
+		break;
+	case NETFS_INVALID_WRITE:
+		break;
+	}
+
+	trace_netfs_wstr(stream, netfs_write_stream_complete);
+	if (atomic_dec_and_test(&wreq->outstanding))
+		netfs_write_completed(wreq, was_async);
+}
+EXPORT_SYMBOL(netfs_write_stream_completed);
+
+static void netfs_write_to_cache_stream(struct netfs_write_stream *stream,
+					struct netfs_write_request *wreq)
+{
+	trace_netfs_wstr(stream, netfs_write_stream_submit);
+	fscache_write_to_cache(netfs_i_cookie(wreq->inode), wreq->mapping,
+			       wreq->start, wreq->len, wreq->region->i_size,
+			       netfs_write_stream_completed, stream);
+}
+
+static void netfs_write_to_cache_stream_worker(struct work_struct *work)
+{
+	struct netfs_write_stream *stream = container_of(work, struct netfs_write_stream, work);
+	struct netfs_write_request *wreq = netfs_stream_to_wreq(stream);
+
+	netfs_write_to_cache_stream(stream, wreq);
+	netfs_put_write_request(wreq, false, netfs_wreq_trace_put_stream_work);
+}
+
+/**
+ * netfs_set_up_write_stream - Allocate, set up and launch a write stream.
+ * @wreq: The write request this is storing from.
+ * @dest: The destination type
+ * @worker: The worker function to handle the write(s)
+ *
+ * Allocate the next write stream from a write request and queue the worker to
+ * make it happen.
+ */
+void netfs_set_up_write_stream(struct netfs_write_request *wreq,
+			       enum netfs_write_dest dest, work_func_t worker)
+{
+	struct netfs_write_stream *stream;
+	unsigned int s = wreq->n_streams++;
+
+	kenter("%u,%u", s, dest);
+
+	stream		= &wreq->streams[s];
+	stream->dest	= dest;
+	stream->index	= s;
+	INIT_WORK(&stream->work, worker);
+	atomic_inc(&wreq->outstanding);
+	trace_netfs_wstr(stream, netfs_write_stream_setup);
+
+	switch (stream->dest) {
+	case NETFS_UPLOAD_TO_SERVER:
+		netfs_stat(&netfs_n_wh_upload);
+		break;
+	case NETFS_WRITE_TO_CACHE:
+		netfs_stat(&netfs_n_wh_write);
+		break;
+	case NETFS_INVALID_WRITE:
+		BUG();
+	}
+
+	netfs_get_write_request(wreq, netfs_wreq_trace_get_stream_work);
+	if (!queue_work(system_unbound_wq, &stream->work))
+		netfs_put_write_request(wreq, false, netfs_wreq_trace_put_discard);
+}
+EXPORT_SYMBOL(netfs_set_up_write_stream);
+
+/*
+ * Set up a stream for writing to the cache.
+ */
+static void netfs_set_up_write_to_cache(struct netfs_write_request *wreq)
+{
+	netfs_set_up_write_stream(wreq, NETFS_WRITE_TO_CACHE,
+				  netfs_write_to_cache_stream_worker);
+}
+
 /*
  * Process a write request.
+ *
+ * All the pages in the bounding box have had a ref taken on them and those
+ * covering the dirty region have been marked as being written back and their
+ * dirty bits provisionally cleared.
  */
 static void netfs_writeback(struct netfs_write_request *wreq)
 {
-	kdebug("--- WRITE ---");
+	struct netfs_i_context *ctx = netfs_i_context(wreq->inode);
+
+	kenter("");
+
+	/* TODO: Encrypt or compress the region as appropriate */
+
+	/* ->outstanding > 0 carries a ref */
+	netfs_get_write_request(wreq, netfs_wreq_trace_get_for_outstanding);
+
+	if (test_bit(NETFS_WREQ_WRITE_TO_CACHE, &wreq->flags))
+		netfs_set_up_write_to_cache(wreq);
+	ctx->ops->add_write_streams(wreq);
+	if (atomic_dec_and_test(&wreq->outstanding))
+		netfs_write_completed(wreq, false);
 }
 
 void netfs_writeback_worker(struct work_struct *work)
diff --git a/fs/netfs/write_helper.c b/fs/netfs/write_helper.c
index a8c58eaa84d0..fa048e3882ea 100644
--- a/fs/netfs/write_helper.c
+++ b/fs/netfs/write_helper.c
@@ -139,18 +139,30 @@ static enum netfs_write_compatibility netfs_write_compatibility(
 	struct netfs_dirty_region *old,
 	struct netfs_dirty_region *candidate)
 {
-	if (old->type == NETFS_REGION_DIO ||
-	    old->type == NETFS_REGION_DSYNC ||
-	    old->state >= NETFS_REGION_IS_FLUSHING ||
-	    /* The bounding boxes of DSYNC writes can overlap with those of
-	     * other DSYNC writes and ordinary writes.
-	     */
+	/* Regions being actively flushed can't be merged with */
+	if (old->state >= NETFS_REGION_IS_FLUSHING ||
 	    candidate->group != old->group ||
-	    old->group->flush)
+	    old->group->flush) {
+		kleave(" = INCOM [flush]");
 		return NETFS_WRITES_INCOMPATIBLE;
+	}
+
+	/* The bounding boxes of DSYNC writes can overlap with those of other
+	 * DSYNC writes and ordinary writes.  DIO writes cannot overlap at all.
+	 */
+	if (candidate->type == NETFS_REGION_DIO ||
+	    old->type == NETFS_REGION_DIO ||
+	    old->type == NETFS_REGION_DSYNC) {
+		kleave(" = INCOM [dio/dsy]");
+		return NETFS_WRITES_INCOMPATIBLE;
+	}
+
 	if (!ctx->ops->is_write_compatible) {
-		if (candidate->type == NETFS_REGION_DSYNC)
+		if (candidate->type == NETFS_REGION_DSYNC) {
+			kleave(" = SUPER [dsync]");
 			return NETFS_WRITES_SUPERSEDE;
+		}
+		kleave(" = COMPT");
 		return NETFS_WRITES_COMPATIBLE;
 	}
 	return ctx->ops->is_write_compatible(ctx, old, candidate);
diff --git a/fs/netfs/xa_iterator.h b/fs/netfs/xa_iterator.h
index 3f37827f0f99..67e1daa964ab 100644
--- a/fs/netfs/xa_iterator.h
+++ b/fs/netfs/xa_iterator.h
@@ -5,6 +5,37 @@
  * Written by David Howells (dhowells@redhat.com)
  */
 
+/*
+ * Iterate over a set of pages that we hold pinned with the writeback flag.
+ * The iteration function may drop the RCU read lock, but should call
+ * xas_pause() before it does so.
+ */
+#define netfs_iterate_pinned_pages(MAPPING, START, END, ITERATOR, ...)	\
+	({								\
+		struct page *page;					\
+		pgoff_t __it_start = (START);				\
+		pgoff_t __it_end = (END);				\
+		int ret = 0;						\
+									\
+		XA_STATE(xas, &(MAPPING)->i_pages, __it_start);		\
+		rcu_read_lock();					\
+		for (page = xas_load(&xas); page; page = xas_next_entry(&xas, __it_end)) { \
+			if (xas_retry(&xas, page))			\
+				continue;				\
+			if (xa_is_value(page))				\
+				break;					\
+			if (unlikely(page != xas_reload(&xas))) {	\
+				xas_reset(&xas);			\
+				continue;				\
+			}						\
+			ret = ITERATOR(&xas, page, ##__VA_ARGS__);	\
+			if (ret < 0)					\
+				break;					\
+		}							\
+		rcu_read_unlock();					\
+		ret;							\
+	})
+
 /*
  * Iterate over a range of pages.  xarray locks are not held over the iterator
  * function, so it can sleep if necessary.  The start and end positions are
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index 9f874e7ed45a..9d50c2933863 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -19,6 +19,8 @@
 #include <linux/pagemap.h>
 #include <linux/uio.h>
 
+enum netfs_wreq_trace;
+
 /*
  * Overload PG_private_2 to give us PG_fscache - this is used to indicate that
  * a page is currently backed by a local disk cache
@@ -180,6 +182,7 @@ struct netfs_i_context {
 	unsigned int		wsize;		/* Maximum write size */
 	unsigned int		bsize;		/* Min block size for bounding box */
 	unsigned int		inval_counter;	/* Number of invalidations made */
+	unsigned char		n_wstreams;	/* Number of write streams to allocate */
 };
 
 /*
@@ -242,12 +245,53 @@ struct netfs_dirty_region {
 	refcount_t		ref;
 };
 
+enum netfs_write_dest {
+	NETFS_UPLOAD_TO_SERVER,
+	NETFS_WRITE_TO_CACHE,
+	NETFS_INVALID_WRITE,
+} __mode(byte);
+
+/*
+ * Descriptor for a write subrequest.  Each subrequest represents an individual
+ * write to a server or a cache.
+ */
+struct netfs_write_subrequest {
+	struct netfs_write_request *wreq;	/* Supervising write request */
+	struct list_head	stream_link;	/* Link in stream->subrequests */
+	loff_t			start;		/* Where to start the I/O */
+	size_t			len;		/* Size of the I/O */
+	size_t			transferred;	/* Amount of data transferred */
+	refcount_t		usage;
+	short			error;		/* 0 or error that occurred */
+	unsigned short		debug_index;	/* Index in list (for debugging output) */
+	unsigned char		stream_index;	/* Which stream we're part of */
+	enum netfs_write_dest	dest;		/* Where to write to */
+};
+
+/*
+ * Descriptor for a write stream.  Each stream represents a sequence of writes
+ * to a destination, where a stream covers the entirety of the write request.
+ * All of a stream goes to the same destination - and that destination might be
+ * a server, a cache, a journal.
+ *
+ * Each stream may be split up into separate subrequests according to different
+ * rules.
+ */
+struct netfs_write_stream {
+	struct work_struct	work;
+	struct list_head	subrequests;	/* The subrequests comprising this stream */
+	enum netfs_write_dest	dest;		/* Where to write to */
+	unsigned char		index;		/* Index in wreq->streams[] */
+	short			error;		/* 0 or error that occurred */
+};
+
 /*
  * Descriptor for a write request.  This is used to manage the preparation and
  * storage of a sequence of dirty data - its compression/encryption and its
  * writing to one or more servers and the cache.
  *
- * The prepared data is buffered here.
+ * The prepared data is buffered here, and then the streams are used to
+ * distribute the buffer to various destinations (servers, caches, etc.).
  */
 struct netfs_write_request {
 	struct work_struct	work;
@@ -260,15 +304,20 @@ struct netfs_write_request {
 	struct list_head	write_link;	/* Link in i_context->write_requests */
 	void			*netfs_priv;	/* Private data for the netfs */
 	unsigned int		debug_id;
+	unsigned char		max_streams;	/* Number of streams allocated */
+	unsigned char		n_streams;	/* Number of streams in use */
 	short			error;		/* 0 or error that occurred */
 	loff_t			i_size;		/* Size of the file */
 	loff_t			start;		/* Start position */
 	size_t			len;		/* Length of the request */
 	pgoff_t			first;		/* First page included */
 	pgoff_t			last;		/* Last page included */
+	atomic_t		outstanding;	/* Number of outstanding writes */
 	refcount_t		usage;
 	unsigned long		flags;
+#define NETFS_WREQ_WRITE_TO_CACHE	0	/* Need to write to the cache */
 	const struct netfs_request_ops *netfs_ops;
+	struct netfs_write_stream streams[];	/* Individual write streams */
 };
 
 enum netfs_write_compatibility {
@@ -307,6 +356,8 @@ struct netfs_request_ops {
 
 	/* Write request handling */
 	void (*init_wreq)(struct netfs_write_request *wreq);
+	void (*add_write_streams)(struct netfs_write_request *wreq);
+	void (*invalidate_cache)(struct netfs_write_request *wreq);
 };
 
 /*
@@ -363,6 +414,12 @@ extern int netfs_releasepage(struct page *page, gfp_t gfp_flags);
 extern void netfs_subreq_terminated(struct netfs_read_subrequest *, ssize_t, bool);
 extern void netfs_stats_show(struct seq_file *);
 extern struct netfs_flush_group *netfs_new_flush_group(struct inode *, void *);
+extern void netfs_set_up_write_stream(struct netfs_write_request *wreq,
+				      enum netfs_write_dest dest, work_func_t worker);
+extern void netfs_put_write_request(struct netfs_write_request *wreq,
+				    bool was_async, enum netfs_wreq_trace what);
+extern void netfs_write_stream_completed(void *_stream, ssize_t transferred_or_error,
+					 bool was_async);
 
 /**
  * netfs_i_context - Get the netfs inode context from the inode
@@ -407,4 +464,10 @@ static inline struct fscache_cookie *netfs_i_cookie(struct inode *inode)
 #endif
 }
 
+static inline
+struct netfs_write_request *netfs_stream_to_wreq(struct netfs_write_stream *stream)
+{
+	return container_of(stream, struct netfs_write_request, streams[stream->index]);
+}
+
 #endif /* _LINUX_NETFS_H */
diff --git a/include/trace/events/netfs.h b/include/trace/events/netfs.h
index e70abb5033e6..aa002725b209 100644
--- a/include/trace/events/netfs.h
+++ b/include/trace/events/netfs.h
@@ -59,6 +59,7 @@ enum netfs_failure {
 
 enum netfs_dirty_trace {
 	netfs_dirty_trace_active,
+	netfs_dirty_trace_activate,
 	netfs_dirty_trace_commit,
 	netfs_dirty_trace_complete,
 	netfs_dirty_trace_flush_conflict,
@@ -82,6 +83,7 @@ enum netfs_dirty_trace {
 enum netfs_region_trace {
 	netfs_region_trace_get_dirty,
 	netfs_region_trace_get_wreq,
+	netfs_region_trace_put_dirty,
 	netfs_region_trace_put_discard,
 	netfs_region_trace_put_merged,
 	netfs_region_trace_put_wreq,
@@ -92,12 +94,22 @@ enum netfs_region_trace {
 
 enum netfs_wreq_trace {
 	netfs_wreq_trace_free,
+	netfs_wreq_trace_get_for_outstanding,
+	netfs_wreq_trace_get_stream_work,
 	netfs_wreq_trace_put_discard,
+	netfs_wreq_trace_put_for_outstanding,
+	netfs_wreq_trace_put_stream_work,
 	netfs_wreq_trace_put_work,
 	netfs_wreq_trace_see_work,
 	netfs_wreq_trace_new,
 };
 
+enum netfs_write_stream_trace {
+	netfs_write_stream_complete,
+	netfs_write_stream_setup,
+	netfs_write_stream_submit,
+};
+
 #endif
 
 #define netfs_read_traces					\
@@ -156,6 +168,7 @@ enum netfs_wreq_trace {
 
 #define netfs_dirty_traces					\
 	EM(netfs_dirty_trace_active,		"ACTIVE    ")	\
+	EM(netfs_dirty_trace_activate,		"ACTIVATE  ")	\
 	EM(netfs_dirty_trace_commit,		"COMMIT    ")	\
 	EM(netfs_dirty_trace_complete,		"COMPLETE  ")	\
 	EM(netfs_dirty_trace_flush_conflict,	"FLSH CONFL")	\
@@ -178,6 +191,7 @@ enum netfs_wreq_trace {
 #define netfs_region_traces					\
 	EM(netfs_region_trace_get_dirty,	"GET DIRTY  ")	\
 	EM(netfs_region_trace_get_wreq,		"GET WREQ   ")	\
+	EM(netfs_region_trace_put_dirty,	"PUT DIRTY  ")	\
 	EM(netfs_region_trace_put_discard,	"PUT DISCARD")	\
 	EM(netfs_region_trace_put_merged,	"PUT MERGED ")	\
 	EM(netfs_region_trace_put_wreq,		"PUT WREQ   ")	\
@@ -187,11 +201,24 @@ enum netfs_wreq_trace {
 
 #define netfs_wreq_traces					\
 	EM(netfs_wreq_trace_free,		"FREE       ")	\
+	EM(netfs_wreq_trace_get_for_outstanding,"GET OUTSTND")	\
+	EM(netfs_wreq_trace_get_stream_work,	"GET S-WORK ")	\
 	EM(netfs_wreq_trace_put_discard,	"PUT DISCARD")	\
+	EM(netfs_wreq_trace_put_for_outstanding,"PUT OUTSTND")	\
+	EM(netfs_wreq_trace_put_stream_work,	"PUT S-WORK  ")	\
 	EM(netfs_wreq_trace_put_work,		"PUT WORK   ")	\
 	EM(netfs_wreq_trace_see_work,		"SEE WORK   ")	\
 	E_(netfs_wreq_trace_new,		"NEW        ")
 
+#define netfs_write_destinations				\
+	EM(NETFS_UPLOAD_TO_SERVER,		"UPLD")		\
+	EM(NETFS_WRITE_TO_CACHE,		"WRIT")		\
+	E_(NETFS_INVALID_WRITE,			"INVL")
+
+#define netfs_write_stream_traces		\
+	EM(netfs_write_stream_complete,		"DONE ")	\
+	EM(netfs_write_stream_setup,		"SETUP")	\
+	E_(netfs_write_stream_submit,		"SUBMT")
 
 /*
  * Export enum symbols via userspace.
@@ -210,6 +237,8 @@ netfs_region_types;
 netfs_region_states;
 netfs_dirty_traces;
 netfs_wreq_traces;
+netfs_write_destinations;
+netfs_write_stream_traces;
 
 /*
  * Now redefine the EM() and E_() macros to map the enums to the strings that
@@ -507,6 +536,38 @@ TRACE_EVENT(netfs_ref_wreq,
 		      __entry->ref)
 	    );
 
+TRACE_EVENT(netfs_wstr,
+	    TP_PROTO(struct netfs_write_stream *stream,
+		     enum netfs_write_stream_trace what),
+
+	    TP_ARGS(stream, what),
+
+	    TP_STRUCT__entry(
+		    __field(unsigned int,		wreq		)
+		    __field(unsigned char,		stream		)
+		    __field(short,			error		)
+		    __field(unsigned short,		flags		)
+		    __field(enum netfs_write_dest,	dest		)
+		    __field(enum netfs_write_stream_trace, what		)
+			     ),
+
+	    TP_fast_assign(
+		    struct netfs_write_request *wreq =
+		    container_of(stream, struct netfs_write_request, streams[stream->index]);
+		    __entry->wreq	= wreq->debug_id;
+		    __entry->stream	= stream->index;
+		    __entry->error	= stream->error;
+		    __entry->dest	= stream->dest;
+		    __entry->what	= what;
+			   ),
+
+	    TP_printk("W=%08x[%u] %s %s e=%d",
+		      __entry->wreq, __entry->stream,
+		      __print_symbolic(__entry->what, netfs_write_stream_traces),
+		      __print_symbolic(__entry->dest, netfs_write_destinations),
+		      __entry->error)
+	    );
+
 #endif /* _TRACE_NETFS_H */
 
 /* This part must be outside protection */



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 10/12] netfs: Do encryption in write preparatory phase
  2021-07-21 13:44 David Howells
                   ` (8 preceding siblings ...)
  2021-07-21 13:46 ` [RFC PATCH 09/12] netfs: Send write request to multiple destinations David Howells
@ 2021-07-21 13:46 ` David Howells
  2021-07-21 13:47 ` [RFC PATCH 11/12] netfs: Put a list of regions in /proc/fs/netfs/regions David Howells
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 22+ messages in thread
From: David Howells @ 2021-07-21 13:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: dhowells, Jeff Layton, Matthew Wilcox (Oracle),
	Anna Schumaker, Steve French, Dominique Martinet, Mike Marshall,
	David Wysochanski, Shyam Prasad N, Miklos Szeredi,
	Linus Torvalds, linux-cachefs, linux-afs, linux-nfs, linux-cifs,
	ceph-devel, v9fs-developer, devel, linux-mm, linux-kernel

When dealing with an encrypted or compressed file, we gather together
sufficient pages from the pagecache to constitute a logical
crypto/compression block, allocate a bounce buffer and then ask the
filesystem to encrypt/compress between the buffers.  The bounce buffer is
then passed to the filesystem to upload.

The network filesystem must set a flag to indicate what service is desired
and when the logical blocksize will be.

The netfs library iterates through each block to be processed, providing a
pair of scatterlists to describe the start and end buffers.

Note that it should be possible in future to encrypt/compress DIO writes
also by this same mechanism.

A mock-up block-encryption function for afs is included for illustration.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/afs/file.c         |    1 
 fs/afs/inode.c        |    6 ++
 fs/afs/internal.h     |    5 ++
 fs/afs/super.c        |    7 ++
 fs/afs/write.c        |   49 +++++++++++++++
 fs/netfs/Makefile     |    3 +
 fs/netfs/internal.h   |    5 ++
 fs/netfs/write_back.c |    6 ++
 fs/netfs/write_prep.c |  160 +++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/netfs.h |    6 ++
 10 files changed, 246 insertions(+), 2 deletions(-)
 create mode 100644 fs/netfs/write_prep.c

diff --git a/fs/afs/file.c b/fs/afs/file.c
index 22030d5191cd..8a6be8d2b426 100644
--- a/fs/afs/file.c
+++ b/fs/afs/file.c
@@ -404,6 +404,7 @@ const struct netfs_request_ops afs_req_ops = {
 	.update_i_size		= afs_update_i_size,
 	.init_wreq		= afs_init_wreq,
 	.add_write_streams	= afs_add_write_streams,
+	.encrypt_block		= afs_encrypt_block,
 };
 
 int afs_write_inode(struct inode *inode, struct writeback_control *wbc)
diff --git a/fs/afs/inode.c b/fs/afs/inode.c
index a6ae031461c7..7cad099c3bb1 100644
--- a/fs/afs/inode.c
+++ b/fs/afs/inode.c
@@ -452,10 +452,16 @@ static void afs_get_inode_cache(struct afs_vnode *vnode)
 static void afs_set_netfs_context(struct afs_vnode *vnode)
 {
 	struct netfs_i_context *ctx = netfs_i_context(&vnode->vfs_inode);
+	struct afs_super_info *as = AFS_FS_S(vnode->vfs_inode.i_sb);
 
 	netfs_i_context_init(&vnode->vfs_inode, &afs_req_ops);
 	ctx->n_wstreams = 1;
 	ctx->bsize = PAGE_SIZE;
+	if (as->fscrypt) {
+		kdebug("ENCRYPT!");
+		ctx->crypto_bsize = ilog2(4096);
+		__set_bit(NETFS_ICTX_ENCRYPTED, &ctx->flags);
+	}
 }
 
 /*
diff --git a/fs/afs/internal.h b/fs/afs/internal.h
index 32a36b96cc9b..b5f7c3659a0a 100644
--- a/fs/afs/internal.h
+++ b/fs/afs/internal.h
@@ -51,6 +51,7 @@ struct afs_fs_context {
 	bool			autocell;	/* T if set auto mount operation */
 	bool			dyn_root;	/* T if dynamic root */
 	bool			no_cell;	/* T if the source is "none" (for dynroot) */
+	bool			fscrypt;	/* T if content encryption is engaged */
 	enum afs_flock_mode	flock_mode;	/* Partial file-locking emulation mode */
 	afs_voltype_t		type;		/* type of volume requested */
 	unsigned int		volnamesz;	/* size of volume name */
@@ -230,6 +231,7 @@ struct afs_super_info {
 	struct afs_volume	*volume;	/* volume record */
 	enum afs_flock_mode	flock_mode:8;	/* File locking emulation mode */
 	bool			dyn_root;	/* True if dynamic root */
+	bool			fscrypt;	/* T if content encryption is engaged */
 };
 
 static inline struct afs_super_info *AFS_FS_S(struct super_block *sb)
@@ -1518,6 +1520,9 @@ extern void afs_prune_wb_keys(struct afs_vnode *);
 extern int afs_launder_page(struct page *);
 extern ssize_t afs_file_direct_write(struct kiocb *, struct iov_iter *);
 extern void afs_add_write_streams(struct netfs_write_request *);
+extern bool afs_encrypt_block(struct netfs_write_request *, loff_t, size_t,
+			      struct scatterlist *, unsigned int,
+			      struct scatterlist *, unsigned int);
 
 /*
  * xattr.c
diff --git a/fs/afs/super.c b/fs/afs/super.c
index 29c1178beb72..53f35ec7b17b 100644
--- a/fs/afs/super.c
+++ b/fs/afs/super.c
@@ -71,6 +71,7 @@ enum afs_param {
 	Opt_autocell,
 	Opt_dyn,
 	Opt_flock,
+	Opt_fscrypt,
 	Opt_source,
 };
 
@@ -86,6 +87,7 @@ static const struct fs_parameter_spec afs_fs_parameters[] = {
 	fsparam_flag  ("autocell",	Opt_autocell),
 	fsparam_flag  ("dyn",		Opt_dyn),
 	fsparam_enum  ("flock",		Opt_flock, afs_param_flock),
+	fsparam_flag  ("fscrypt",	Opt_fscrypt),
 	fsparam_string("source",	Opt_source),
 	{}
 };
@@ -342,6 +344,10 @@ static int afs_parse_param(struct fs_context *fc, struct fs_parameter *param)
 		ctx->flock_mode = result.uint_32;
 		break;
 
+	case Opt_fscrypt:
+		ctx->fscrypt = true;
+		break;
+
 	default:
 		return -EINVAL;
 	}
@@ -516,6 +522,7 @@ static struct afs_super_info *afs_alloc_sbi(struct fs_context *fc)
 			as->cell = afs_use_cell(ctx->cell, afs_cell_trace_use_sbi);
 			as->volume = afs_get_volume(ctx->volume,
 						    afs_volume_trace_get_alloc_sbi);
+			as->fscrypt = ctx->fscrypt;
 		}
 	}
 	return as;
diff --git a/fs/afs/write.c b/fs/afs/write.c
index 0668389f3466..d2b7cb1a4668 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -13,6 +13,7 @@
 #include <linux/pagevec.h>
 #include <linux/netfs.h>
 #include <linux/fscache.h>
+#include <crypto/skcipher.h>
 #include <trace/events/netfs.h>
 #include "internal.h"
 
@@ -293,6 +294,54 @@ void afs_add_write_streams(struct netfs_write_request *wreq)
 				  afs_upload_to_server_worker);
 }
 
+/*
+ * Encrypt part of a write for fscrypt.
+ */
+bool afs_encrypt_block(struct netfs_write_request *wreq, loff_t pos, size_t len,
+		       struct scatterlist *source_sg, unsigned int n_source,
+		       struct scatterlist *dest_sg, unsigned int n_dest)
+{
+	struct crypto_sync_skcipher *ci;
+	struct crypto_skcipher *tfm;
+	struct skcipher_request *req;
+	u8 session_key[8], iv[8];
+	int ret;
+
+	kenter("%llx", pos);
+
+	ci = crypto_alloc_sync_skcipher("pcbc(fcrypt)", 0, 0);
+	if (IS_ERR(ci)) {
+		_debug("no cipher");
+		ret = PTR_ERR(ci);
+		goto error;
+	}
+	tfm= &ci->base;
+
+	ret = crypto_sync_skcipher_setkey(ci, session_key, sizeof(session_key));
+	if (ret < 0)
+		goto error_ci;
+
+	ret = -ENOMEM;
+	req = skcipher_request_alloc(tfm, GFP_NOFS);
+	if (!req)
+		goto error_ci;
+
+	memset(iv, 0, sizeof(iv));
+	skcipher_request_set_sync_tfm(req, ci);
+	skcipher_request_set_callback(req, 0, NULL, NULL);
+	skcipher_request_set_crypt(req, source_sg, dest_sg, len, iv);
+	ret = crypto_skcipher_encrypt(req);
+
+	skcipher_request_free(req);
+error_ci:
+	crypto_free_sync_skcipher(ci);
+error:
+	if (ret < 0)
+		wreq->error = ret;
+	kleave(" = %d", ret);
+	return ret == 0;
+}
+
 /*
  * Extend the region to be written back to include subsequent contiguously
  * dirty pages if possible, but don't sleep while doing so.
diff --git a/fs/netfs/Makefile b/fs/netfs/Makefile
index a201fd7b22cf..a7c3a9173ac0 100644
--- a/fs/netfs/Makefile
+++ b/fs/netfs/Makefile
@@ -4,7 +4,8 @@ netfs-y := \
 	objects.o \
 	read_helper.o \
 	write_back.o \
-	write_helper.o
+	write_helper.o \
+	write_prep.o
 # dio_helper.o
 
 netfs-$(CONFIG_NETFS_STATS) += stats.o
diff --git a/fs/netfs/internal.h b/fs/netfs/internal.h
index 6fdf9e5663f7..381ca64062eb 100644
--- a/fs/netfs/internal.h
+++ b/fs/netfs/internal.h
@@ -65,6 +65,11 @@ void netfs_flush_region(struct netfs_i_context *ctx,
 			struct netfs_dirty_region *region,
 			enum netfs_dirty_trace why);
 
+/*
+ * write_prep.c
+ */
+bool netfs_prepare_wreq(struct netfs_write_request *wreq);
+
 /*
  * stats.c
  */
diff --git a/fs/netfs/write_back.c b/fs/netfs/write_back.c
index 15cc0e1b9acf..7363c3324602 100644
--- a/fs/netfs/write_back.c
+++ b/fs/netfs/write_back.c
@@ -254,7 +254,9 @@ static void netfs_writeback(struct netfs_write_request *wreq)
 
 	kenter("");
 
-	/* TODO: Encrypt or compress the region as appropriate */
+	if (test_bit(NETFS_ICTX_ENCRYPTED, &ctx->flags) &&
+	    !netfs_prepare_wreq(wreq))
+		goto out;
 
 	/* ->outstanding > 0 carries a ref */
 	netfs_get_write_request(wreq, netfs_wreq_trace_get_for_outstanding);
@@ -262,6 +264,8 @@ static void netfs_writeback(struct netfs_write_request *wreq)
 	if (test_bit(NETFS_WREQ_WRITE_TO_CACHE, &wreq->flags))
 		netfs_set_up_write_to_cache(wreq);
 	ctx->ops->add_write_streams(wreq);
+
+out:
 	if (atomic_dec_and_test(&wreq->outstanding))
 		netfs_write_completed(wreq, false);
 }
diff --git a/fs/netfs/write_prep.c b/fs/netfs/write_prep.c
new file mode 100644
index 000000000000..f0a9dfd92a18
--- /dev/null
+++ b/fs/netfs/write_prep.c
@@ -0,0 +1,160 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Network filesystem high-level write support.
+ *
+ * Copyright (C) 2021 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/pagemap.h>
+#include <linux/slab.h>
+#include "internal.h"
+
+/*
+ * Allocate a bunch of pages and add them into the xarray buffer starting at
+ * the given index.
+ */
+static int netfs_alloc_buffer(struct xarray *xa, pgoff_t index, unsigned int nr_pages)
+{
+	struct page *page;
+	unsigned int n;
+	int ret;
+	LIST_HEAD(list);
+
+	kenter("");
+
+	n = alloc_pages_bulk_list(GFP_NOIO, nr_pages, &list);
+	if (n < nr_pages) {
+		ret = -ENOMEM;
+	}
+
+	while ((page = list_first_entry_or_null(&list, struct page, lru))) {
+		list_del(&page->lru);
+		ret = xa_insert(xa, index++, page, GFP_NOIO);
+		if (ret < 0)
+			break;
+	}
+
+	while ((page = list_first_entry_or_null(&list, struct page, lru))) {
+		list_del(&page->lru);
+		__free_page(page);
+	}
+	return ret;
+}
+
+/*
+ * Populate a scatterlist from pages in an xarray.
+ */
+static int netfs_xarray_to_sglist(struct xarray *xa, loff_t pos, size_t len,
+				  struct scatterlist *sg, unsigned int n_sg)
+{
+	struct scatterlist *p = sg;
+	struct page *head = NULL;
+	size_t seg, offset, skip = 0;
+	loff_t start = pos;
+	pgoff_t index = start >> PAGE_SHIFT;
+	int j;
+
+	XA_STATE(xas, xa, index);
+
+	sg_init_table(sg, n_sg);
+
+	rcu_read_lock();
+
+	xas_for_each(&xas, head, ULONG_MAX) {
+		kdebug("LOAD %lx %px", head->index, head);
+		if (xas_retry(&xas, head))
+			continue;
+		if (WARN_ON(xa_is_value(head)) || WARN_ON(PageHuge(head)))
+			break;
+		for (j = (head->index < index) ? index - head->index : 0;
+		     j < thp_nr_pages(head); j++
+		     ) {
+			offset = (pos + skip) & ~PAGE_MASK;
+			seg = min(len, PAGE_SIZE - offset);
+
+			kdebug("[%zx] %lx %zx @%zx", p - sg, (head + j)->index, seg, offset);
+			sg_set_page(p++, head + j, seg, offset);
+
+			len -= seg;
+			skip += seg;
+			if (len == 0)
+				break;
+		}
+		if (len == 0)
+			break;
+	}
+
+	rcu_read_unlock();
+	if (len > 0) {
+		WARN_ON(len > 0);
+		return -EIO;
+	}
+
+	sg_mark_end(p - 1);
+	kleave(" = %zd", p - sg);
+	return p - sg;
+}
+
+/*
+ * Perform content encryption on the data to be written before we write it to
+ * the server and the cache.
+ */
+static bool netfs_prepare_encrypt(struct netfs_write_request *wreq)
+{
+	struct netfs_i_context *ctx = netfs_i_context(wreq->inode);
+	struct scatterlist source_sg[16], dest_sg[16];
+	unsigned int bsize = 1 << ctx->crypto_bsize, n_source, n_dest;
+	loff_t pos;
+	size_t n;
+	int ret;
+
+	ret = netfs_alloc_buffer(&wreq->buffer, wreq->first, wreq->last - wreq->first + 1);
+	if (ret < 0)
+		goto error;
+
+	pos = round_down(wreq->start, bsize);
+	n = round_up(wreq->start + wreq->len, bsize) - pos;
+	for (; n > 0; n -= bsize, pos += bsize) {
+		ret = netfs_xarray_to_sglist(&wreq->mapping->i_pages, pos, bsize,
+					     source_sg, ARRAY_SIZE(source_sg));
+		if (ret < 0)
+			goto error;
+		n_source = ret;
+
+		ret = netfs_xarray_to_sglist(&wreq->buffer, pos, bsize,
+					     dest_sg, ARRAY_SIZE(dest_sg));
+		if (ret < 0)
+			goto error;
+		n_dest = ret;
+
+		ret = ctx->ops->encrypt_block(wreq, pos, bsize,
+					      source_sg, n_source, dest_sg, n_dest);
+		if (ret < 0)
+			goto error;
+	}
+
+	iov_iter_xarray(&wreq->source, WRITE, &wreq->buffer, wreq->start, wreq->len);
+	kleave(" = t");
+	return true;
+
+error:
+	wreq->error = ret;
+	kleave(" = f [%d]", ret);
+	return false;
+}
+
+/*
+ * Prepare a write request for writing.  All the pages in the bounding box have
+ * had a ref taken on them and those covering the dirty region have been marked
+ * as being written back and their dirty bits provisionally cleared.
+ */
+bool netfs_prepare_wreq(struct netfs_write_request *wreq)
+{
+	struct netfs_i_context *ctx = netfs_i_context(wreq->inode);
+
+	if (test_bit(NETFS_ICTX_ENCRYPTED, &ctx->flags))
+		return netfs_prepare_encrypt(wreq);
+	return true;
+}
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index 9d50c2933863..6acf3fb170c3 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -19,6 +19,7 @@
 #include <linux/pagemap.h>
 #include <linux/uio.h>
 
+struct scatterlist;
 enum netfs_wreq_trace;
 
 /*
@@ -177,12 +178,14 @@ struct netfs_i_context {
 #endif
 	unsigned long		flags;
 #define NETFS_ICTX_NEW_CONTENT	0		/* Set if file has new content (create/trunc-0) */
+#define NETFS_ICTX_ENCRYPTED	1		/* The file contents are encrypted */
 	spinlock_t		lock;
 	unsigned int		rsize;		/* Maximum read size */
 	unsigned int		wsize;		/* Maximum write size */
 	unsigned int		bsize;		/* Min block size for bounding box */
 	unsigned int		inval_counter;	/* Number of invalidations made */
 	unsigned char		n_wstreams;	/* Number of write streams to allocate */
+	unsigned char		crypto_bsize;	/* log2 of crypto block size */
 };
 
 /*
@@ -358,6 +361,9 @@ struct netfs_request_ops {
 	void (*init_wreq)(struct netfs_write_request *wreq);
 	void (*add_write_streams)(struct netfs_write_request *wreq);
 	void (*invalidate_cache)(struct netfs_write_request *wreq);
+	bool (*encrypt_block)(struct netfs_write_request *wreq, loff_t pos,  size_t len,
+			      struct scatterlist *source_sg, unsigned int n_source,
+			      struct scatterlist *dest_sg, unsigned int n_dest);
 };
 
 /*



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 11/12] netfs: Put a list of regions in /proc/fs/netfs/regions
  2021-07-21 13:44 David Howells
                   ` (9 preceding siblings ...)
  2021-07-21 13:46 ` [RFC PATCH 10/12] netfs: Do encryption in write preparatory phase David Howells
@ 2021-07-21 13:47 ` David Howells
  2021-07-21 13:47 ` [RFC PATCH 12/12] netfs: Export some read-request ref functions David Howells
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 22+ messages in thread
From: David Howells @ 2021-07-21 13:47 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: dhowells, Jeff Layton, Matthew Wilcox (Oracle),
	Anna Schumaker, Steve French, Dominique Martinet, Mike Marshall,
	David Wysochanski, Shyam Prasad N, Miklos Szeredi,
	Linus Torvalds, linux-cachefs, linux-afs, linux-nfs, linux-cifs,
	ceph-devel, v9fs-developer, devel, linux-mm, linux-kernel


---

 fs/netfs/Makefile       |    1 
 fs/netfs/internal.h     |   24 +++++++++++
 fs/netfs/main.c         |  104 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/netfs/objects.c      |    6 ++-
 fs/netfs/write_helper.c |    4 ++
 include/linux/netfs.h   |    1 
 6 files changed, 139 insertions(+), 1 deletion(-)
 create mode 100644 fs/netfs/main.c

diff --git a/fs/netfs/Makefile b/fs/netfs/Makefile
index a7c3a9173ac0..62dad3d7bea0 100644
--- a/fs/netfs/Makefile
+++ b/fs/netfs/Makefile
@@ -1,6 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0
 
 netfs-y := \
+	main.o \
 	objects.o \
 	read_helper.o \
 	write_back.o \
diff --git a/fs/netfs/internal.h b/fs/netfs/internal.h
index 381ca64062eb..a9ec6591f90a 100644
--- a/fs/netfs/internal.h
+++ b/fs/netfs/internal.h
@@ -22,6 +22,30 @@
 ssize_t netfs_file_direct_write(struct netfs_dirty_region *region,
 				struct kiocb *iocb, struct iov_iter *from);
 
+/*
+ * main.c
+ */
+extern struct list_head netfs_regions;
+extern spinlock_t netfs_regions_lock;
+
+#ifdef CONFIG_PROC_FS
+static inline void netfs_proc_add_region(struct netfs_dirty_region *region)
+{
+	spin_lock(&netfs_regions_lock);
+	list_add_tail_rcu(&region->proc_link, &netfs_regions);
+	spin_unlock(&netfs_regions_lock);
+}
+static inline void netfs_proc_del_region(struct netfs_dirty_region *region)
+{
+	spin_lock(&netfs_regions_lock);
+	list_del_rcu(&region->proc_link);
+	spin_unlock(&netfs_regions_lock);
+}
+#else
+static inline void netfs_proc_add_region(struct netfs_dirty_region *region) {}
+static inline void netfs_proc_del_region(struct netfs_dirty_region *region) {}
+#endif
+
 /*
  * objects.c
  */
diff --git a/fs/netfs/main.c b/fs/netfs/main.c
new file mode 100644
index 000000000000..125b570efefd
--- /dev/null
+++ b/fs/netfs/main.c
@@ -0,0 +1,104 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* Network filesystem library.
+ *
+ * Copyright (C) 2021 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+
+#include <linux/module.h>
+#include <linux/export.h>
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/proc_fs.h>
+#include "internal.h"
+
+#ifdef CONFIG_PROC_FS
+LIST_HEAD(netfs_regions);
+DEFINE_SPINLOCK(netfs_regions_lock);
+
+static const char netfs_proc_region_states[] = "PRADFC";
+static const char *netfs_proc_region_types[] = {
+	[NETFS_REGION_ORDINARY]		= "ORD ",
+	[NETFS_REGION_DIO]		= "DIOW",
+	[NETFS_REGION_DSYNC]		= "DSYN",
+};
+
+/*
+ * Generate a list of regions in /proc/fs/netfs/regions
+ */
+static int netfs_regions_seq_show(struct seq_file *m, void *v)
+{
+	struct netfs_dirty_region *region;
+
+	if (v == &netfs_regions) {
+		seq_puts(m,
+			 "REGION   REF TYPE S FL DEV   INODE    DIRTY, BOUNDS, RESV\n"
+			 "======== === ==== = == ===== ======== ==============================\n"
+			 );
+		return 0;
+	}
+
+	region = list_entry(v, struct netfs_dirty_region, proc_link);
+	seq_printf(m,
+		   "%08x %3d %s %c %2lx %02x:%02x %8x %04llx-%04llx %04llx-%04llx %04llx-%04llx\n",
+		   region->debug_id,
+		   refcount_read(&region->ref),
+		   netfs_proc_region_types[region->type],
+		   netfs_proc_region_states[region->state],
+		   region->flags,
+		   0, 0, 0,
+		   region->dirty.start, region->dirty.end,
+		   region->bounds.start, region->bounds.end,
+		   region->reserved.start, region->reserved.end);
+	return 0;
+}
+
+static void *netfs_regions_seq_start(struct seq_file *m, loff_t *_pos)
+	__acquires(rcu)
+{
+	rcu_read_lock();
+	return seq_list_start_head(&netfs_regions, *_pos);
+}
+
+static void *netfs_regions_seq_next(struct seq_file *m, void *v, loff_t *_pos)
+{
+	return seq_list_next(v, &netfs_regions, _pos);
+}
+
+static void netfs_regions_seq_stop(struct seq_file *m, void *v)
+	__releases(rcu)
+{
+	rcu_read_unlock();
+}
+
+const struct seq_operations netfs_regions_seq_ops = {
+	.start  = netfs_regions_seq_start,
+	.next   = netfs_regions_seq_next,
+	.stop   = netfs_regions_seq_stop,
+	.show   = netfs_regions_seq_show,
+};
+#endif /* CONFIG_PROC_FS */
+
+static int __init netfs_init(void)
+{
+	if (!proc_mkdir("fs/netfs", NULL))
+		goto error;
+
+	if (!proc_create_seq("fs/netfs/regions", S_IFREG | 0444, NULL,
+			     &netfs_regions_seq_ops))
+		goto error_proc;
+
+	return 0;
+
+error_proc:
+	remove_proc_entry("fs/netfs", NULL);
+error:
+	return -ENOMEM;
+}
+fs_initcall(netfs_init);
+
+static void __exit netfs_exit(void)
+{
+	remove_proc_entry("fs/netfs", NULL);
+}
+module_exit(netfs_exit);
diff --git a/fs/netfs/objects.c b/fs/netfs/objects.c
index 8926b4230d91..1149f12ca8c9 100644
--- a/fs/netfs/objects.c
+++ b/fs/netfs/objects.c
@@ -60,8 +60,10 @@ struct netfs_dirty_region *netfs_alloc_dirty_region(void)
 	struct netfs_dirty_region *region;
 
 	region = kzalloc(sizeof(struct netfs_dirty_region), GFP_KERNEL);
-	if (region)
+	if (region) {
+		INIT_LIST_HEAD(&region->proc_link);
 		netfs_stat(&netfs_n_wh_region);
+	}
 	return region;
 }
 
@@ -81,6 +83,8 @@ void netfs_free_dirty_region(struct netfs_i_context *ctx,
 {
 	if (region) {
 		trace_netfs_ref_region(region->debug_id, 0, netfs_region_trace_free);
+		if (!list_empty(&region->proc_link))
+			netfs_proc_del_region(region);
 		if (ctx->ops->free_dirty_region)
 			ctx->ops->free_dirty_region(region);
 		netfs_put_flush_group(region->group);
diff --git a/fs/netfs/write_helper.c b/fs/netfs/write_helper.c
index fa048e3882ea..b1fe2d4c0df6 100644
--- a/fs/netfs/write_helper.c
+++ b/fs/netfs/write_helper.c
@@ -86,10 +86,13 @@ static void netfs_init_dirty_region(struct netfs_dirty_region *region,
 		group = list_last_entry(&ctx->flush_groups,
 					struct netfs_flush_group, group_link);
 		region->group = netfs_get_flush_group(group);
+		spin_lock(&ctx->lock);
 		list_add_tail(&region->flush_link, &group->region_list);
+		spin_unlock(&ctx->lock);
 	}
 	trace_netfs_ref_region(region->debug_id, 1, netfs_region_trace_new);
 	trace_netfs_dirty(ctx, region, NULL, netfs_dirty_trace_new);
+	netfs_proc_add_region(region);
 }
 
 /*
@@ -198,6 +201,7 @@ static struct netfs_dirty_region *netfs_split_dirty_region(
 	list_add(&tail->dirty_link, &region->dirty_link);
 	list_add(&tail->flush_link, &region->flush_link);
 	trace_netfs_dirty(ctx, tail, region, netfs_dirty_trace_split);
+	netfs_proc_add_region(tail);
 	return tail;
 }
 
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index 6acf3fb170c3..43d195badb0d 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -228,6 +228,7 @@ enum netfs_region_type {
  */
 struct netfs_dirty_region {
 	struct netfs_flush_group *group;
+	struct list_head	proc_link;	/* Link in /proc/fs/netfs/regions */
 	struct list_head	active_link;	/* Link in i_context->pending/active_writes */
 	struct list_head	dirty_link;	/* Link in i_context->dirty_regions */
 	struct list_head	flush_link;	/* Link in group->region_list or



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 12/12] netfs: Export some read-request ref functions
  2021-07-21 13:44 David Howells
                   ` (10 preceding siblings ...)
  2021-07-21 13:47 ` [RFC PATCH 11/12] netfs: Put a list of regions in /proc/fs/netfs/regions David Howells
@ 2021-07-21 13:47 ` David Howells
  2021-07-21 14:00 ` [RFC PATCH 00/12] netfs: Experimental write helpers, fscrypt and compression David Howells
  2021-07-21 18:42 ` [RFC PATCH 13/12] netfs: Do copy-to-cache-on-read through VM writeback David Howells
  13 siblings, 0 replies; 22+ messages in thread
From: David Howells @ 2021-07-21 13:47 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: dhowells, Jeff Layton, Matthew Wilcox (Oracle),
	Anna Schumaker, Steve French, Dominique Martinet, Mike Marshall,
	David Wysochanski, Shyam Prasad N, Miklos Szeredi,
	Linus Torvalds, linux-cachefs, linux-afs, linux-nfs, linux-cifs,
	ceph-devel, v9fs-developer, devel, linux-mm, linux-kernel

Export some functions for getting/putting read-request structures for use
in later patches.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/netfs/internal.h    |   10 ++++++++++
 fs/netfs/read_helper.c |   15 +++------------
 2 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/fs/netfs/internal.h b/fs/netfs/internal.h
index a9ec6591f90a..6ae1eb55093a 100644
--- a/fs/netfs/internal.h
+++ b/fs/netfs/internal.h
@@ -78,9 +78,19 @@ static inline void netfs_see_write_request(struct netfs_write_request *wreq,
  */
 extern unsigned int netfs_debug;
 
+void __netfs_put_subrequest(struct netfs_read_subrequest *subreq, bool was_async);
+void netfs_put_read_request(struct netfs_read_request *rreq, bool was_async);
+void netfs_rreq_completed(struct netfs_read_request *rreq, bool was_async);
 int netfs_prefetch_for_write(struct file *file, struct page *page, loff_t pos, size_t len,
 			     bool always_fill);
 
+static inline void netfs_put_subrequest(struct netfs_read_subrequest *subreq,
+					bool was_async)
+{
+	if (refcount_dec_and_test(&subreq->usage))
+		__netfs_put_subrequest(subreq, was_async);
+}
+
 /*
  * write_helper.c
  */
diff --git a/fs/netfs/read_helper.c b/fs/netfs/read_helper.c
index 0b771f2f5449..e5c636acc756 100644
--- a/fs/netfs/read_helper.c
+++ b/fs/netfs/read_helper.c
@@ -28,14 +28,6 @@ MODULE_PARM_DESC(netfs_debug, "Netfs support debugging mask");
 
 static void netfs_rreq_work(struct work_struct *);
 static void netfs_rreq_clear_buffer(struct netfs_read_request *);
-static void __netfs_put_subrequest(struct netfs_read_subrequest *, bool);
-
-static void netfs_put_subrequest(struct netfs_read_subrequest *subreq,
-				 bool was_async)
-{
-	if (refcount_dec_and_test(&subreq->usage))
-		__netfs_put_subrequest(subreq, was_async);
-}
 
 static struct netfs_read_request *netfs_alloc_read_request(struct address_space *mapping,
 							   struct file *file)
@@ -97,7 +89,7 @@ static void netfs_free_read_request(struct work_struct *work)
 	netfs_stat_d(&netfs_n_rh_rreq);
 }
 
-static void netfs_put_read_request(struct netfs_read_request *rreq, bool was_async)
+void netfs_put_read_request(struct netfs_read_request *rreq, bool was_async)
 {
 	if (refcount_dec_and_test(&rreq->usage)) {
 		if (was_async) {
@@ -135,8 +127,7 @@ static void netfs_get_read_subrequest(struct netfs_read_subrequest *subreq)
 	refcount_inc(&subreq->usage);
 }
 
-static void __netfs_put_subrequest(struct netfs_read_subrequest *subreq,
-				   bool was_async)
+void __netfs_put_subrequest(struct netfs_read_subrequest *subreq, bool was_async)
 {
 	struct netfs_read_request *rreq = subreq->rreq;
 
@@ -214,7 +205,7 @@ static void netfs_read_from_server(struct netfs_read_request *rreq,
 /*
  * Release those waiting.
  */
-static void netfs_rreq_completed(struct netfs_read_request *rreq, bool was_async)
+void netfs_rreq_completed(struct netfs_read_request *rreq, bool was_async)
 {
 	trace_netfs_rreq(rreq, netfs_rreq_trace_done);
 	netfs_rreq_clear_subreqs(rreq, was_async);



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 00/12] netfs: Experimental write helpers, fscrypt and compression
  2021-07-21 13:44 David Howells
                   ` (11 preceding siblings ...)
  2021-07-21 13:47 ` [RFC PATCH 12/12] netfs: Export some read-request ref functions David Howells
@ 2021-07-21 14:00 ` David Howells
  2021-07-21 18:42 ` [RFC PATCH 13/12] netfs: Do copy-to-cache-on-read through VM writeback David Howells
  13 siblings, 0 replies; 22+ messages in thread
From: David Howells @ 2021-07-21 14:00 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: dhowells, Jeff Layton, Matthew Wilcox (Oracle),
	Anna Schumaker, Steve French, Dominique Martinet, Mike Marshall,
	David Wysochanski, Shyam Prasad N, Miklos Szeredi,
	Linus Torvalds, linux-cachefs, linux-afs, linux-nfs, linux-cifs,
	ceph-devel, v9fs-developer, devel, linux-mm, linux-kernel

Apologies...  The subject line in the cover letter got word-wrapped by my
editor and lost.

David


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 01/12] afs: Sort out symlink reading
  2021-07-21 13:44 ` [RFC PATCH 01/12] afs: Sort out symlink reading David Howells
@ 2021-07-21 16:20   ` Jeff Layton
  2021-07-26  9:44   ` David Howells
  1 sibling, 0 replies; 22+ messages in thread
From: Jeff Layton @ 2021-07-21 16:20 UTC (permalink / raw)
  To: David Howells, linux-fsdevel
  Cc: Matthew Wilcox (Oracle),
	Anna Schumaker, Steve French, Dominique Martinet, Mike Marshall,
	David Wysochanski, Shyam Prasad N, Miklos Szeredi,
	Linus Torvalds, linux-cachefs, linux-afs, linux-nfs, linux-cifs,
	ceph-devel, v9fs-developer, devel, linux-mm, linux-kernel

On Wed, 2021-07-21 at 14:44 +0100, David Howells wrote:
> afs_readpage() doesn't get a file pointer when called for a symlink, so
> separate it from regular file pointer handling.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> ---
> 
>  fs/afs/file.c     |   14 +++++++++-----
>  fs/afs/inode.c    |    6 +++---
>  fs/afs/internal.h |    3 ++-
>  3 files changed, 14 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/afs/file.c b/fs/afs/file.c
> index ca0d993add65..c9c21ad0e7c9 100644
> --- a/fs/afs/file.c
> +++ b/fs/afs/file.c
> @@ -19,6 +19,7 @@
>  
>  static int afs_file_mmap(struct file *file, struct vm_area_struct *vma);
>  static int afs_readpage(struct file *file, struct page *page);
> +static int afs_symlink_readpage(struct file *file, struct page *page);
>  static void afs_invalidatepage(struct page *page, unsigned int offset,
>  			       unsigned int length);
>  static int afs_releasepage(struct page *page, gfp_t gfp_flags);
> @@ -46,7 +47,7 @@ const struct inode_operations afs_file_inode_operations = {
>  	.permission	= afs_permission,
>  };
>  
> -const struct address_space_operations afs_fs_aops = {
> +const struct address_space_operations afs_file_aops = {
>  	.readpage	= afs_readpage,
>  	.readahead	= afs_readahead,
>  	.set_page_dirty	= afs_set_page_dirty,
> @@ -60,6 +61,12 @@ const struct address_space_operations afs_fs_aops = {
>  	.writepages	= afs_writepages,
>  };
>  
> +const struct address_space_operations afs_symlink_aops = {
> +	.readpage	= afs_symlink_readpage,
> +	.releasepage	= afs_releasepage,
> +	.invalidatepage	= afs_invalidatepage,
> +};
> +
>  static const struct vm_operations_struct afs_vm_ops = {
>  	.fault		= filemap_fault,
>  	.map_pages	= filemap_map_pages,
> @@ -321,7 +328,7 @@ static void afs_req_issue_op(struct netfs_read_subrequest *subreq)
>  	afs_fetch_data(fsreq->vnode, fsreq);
>  }
>  
> -static int afs_symlink_readpage(struct page *page)
> +static int afs_symlink_readpage(struct file *file, struct page *page)
>  {
>  	struct afs_vnode *vnode = AFS_FS_I(page->mapping->host);
>  	struct afs_read *fsreq;


I wonder...would you be better served here by not using page_readlink
for symlinks and instead use simple_get_link and roll your own readlink
operation. It seems a bit more direct, and AFS seems to be the only
caller of page_readlink.

> @@ -386,9 +393,6 @@ const struct netfs_read_request_ops afs_req_ops = {
>  
>  static int afs_readpage(struct file *file, struct page *page)
>  {
> -	if (!file)
> -		return afs_symlink_readpage(page);
> -
>  	return netfs_readpage(file, page, &afs_req_ops, NULL);
>  }
>  
> diff --git a/fs/afs/inode.c b/fs/afs/inode.c
> index bef6f5ccfb09..cf7b66957c6f 100644
> --- a/fs/afs/inode.c
> +++ b/fs/afs/inode.c
> @@ -105,7 +105,7 @@ static int afs_inode_init_from_status(struct afs_operation *op,
>  		inode->i_mode	= S_IFREG | (status->mode & S_IALLUGO);
>  		inode->i_op	= &afs_file_inode_operations;
>  		inode->i_fop	= &afs_file_operations;
> -		inode->i_mapping->a_ops	= &afs_fs_aops;
> +		inode->i_mapping->a_ops	= &afs_file_aops;
>  		break;
>  	case AFS_FTYPE_DIR:
>  		inode->i_mode	= S_IFDIR |  (status->mode & S_IALLUGO);
> @@ -123,11 +123,11 @@ static int afs_inode_init_from_status(struct afs_operation *op,
>  			inode->i_mode	= S_IFDIR | 0555;
>  			inode->i_op	= &afs_mntpt_inode_operations;
>  			inode->i_fop	= &afs_mntpt_file_operations;
> -			inode->i_mapping->a_ops	= &afs_fs_aops;
> +			inode->i_mapping->a_ops	= &afs_symlink_aops;
>  		} else {
>  			inode->i_mode	= S_IFLNK | status->mode;
>  			inode->i_op	= &afs_symlink_inode_operations;
> -			inode->i_mapping->a_ops	= &afs_fs_aops;
> +			inode->i_mapping->a_ops	= &afs_symlink_aops;
>  		}
>  		inode_nohighmem(inode);
>  		break;
> diff --git a/fs/afs/internal.h b/fs/afs/internal.h
> index 791cf02e5696..ccdde00ada8a 100644
> --- a/fs/afs/internal.h
> +++ b/fs/afs/internal.h
> @@ -1050,7 +1050,8 @@ extern void afs_dynroot_depopulate(struct super_block *);
>  /*
>   * file.c
>   */
> -extern const struct address_space_operations afs_fs_aops;
> +extern const struct address_space_operations afs_file_aops;
> +extern const struct address_space_operations afs_symlink_aops;
>  extern const struct inode_operations afs_file_inode_operations;
>  extern const struct file_operations afs_file_operations;
>  extern const struct netfs_read_request_ops afs_req_ops;
> 
> 

Regardless, this is more reasonable than what's there now.

Reviewed-by: Jeff Layton <jlayton@redhat.com>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 02/12] netfs: Add an iov_iter to the read subreq for the network fs/cache to use
  2021-07-21 13:44 ` [RFC PATCH 02/12] netfs: Add an iov_iter to the read subreq for the network fs/cache to use David Howells
@ 2021-07-21 17:16   ` Jeff Layton
  2021-07-21 17:20   ` David Howells
  1 sibling, 0 replies; 22+ messages in thread
From: Jeff Layton @ 2021-07-21 17:16 UTC (permalink / raw)
  To: David Howells, linux-fsdevel
  Cc: Matthew Wilcox (Oracle),
	Anna Schumaker, Steve French, Dominique Martinet, Mike Marshall,
	David Wysochanski, Shyam Prasad N, Miklos Szeredi,
	Linus Torvalds, linux-cachefs, linux-afs, linux-nfs, linux-cifs,
	ceph-devel, v9fs-developer, devel, linux-mm, linux-kernel

On Wed, 2021-07-21 at 14:44 +0100, David Howells wrote:
> Add an iov_iter to the read subrequest and set it up to define the
> destination buffer to write into.  This will allow future patches to point
> to a bounce buffer instead for purposes of handling oversize writes,
> decryption (where we want to save the encrypted data to the cache) and
> decompression.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> ---
> 
>  fs/afs/file.c          |    6 +-----
>  fs/netfs/read_helper.c |    5 ++++-
>  include/linux/netfs.h  |    2 ++
>  3 files changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/afs/file.c b/fs/afs/file.c
> index c9c21ad0e7c9..ca529f23515a 100644
> --- a/fs/afs/file.c
> +++ b/fs/afs/file.c
> @@ -319,11 +319,7 @@ static void afs_req_issue_op(struct netfs_read_subrequest *subreq)
>  	fsreq->len	= subreq->len   - subreq->transferred;
>  	fsreq->key	= subreq->rreq->netfs_priv;
>  	fsreq->vnode	= vnode;
> -	fsreq->iter	= &fsreq->def_iter;
> -
> -	iov_iter_xarray(&fsreq->def_iter, READ,
> -			&fsreq->vnode->vfs_inode.i_mapping->i_pages,
> -			fsreq->pos, fsreq->len);
> +	fsreq->iter	= &subreq->iter;
>  
>  	afs_fetch_data(fsreq->vnode, fsreq);
>  }
> diff --git a/fs/netfs/read_helper.c b/fs/netfs/read_helper.c
> index 0b6cd3b8734c..715f3e9c380d 100644
> --- a/fs/netfs/read_helper.c
> +++ b/fs/netfs/read_helper.c
> @@ -150,7 +150,7 @@ static void netfs_clear_unread(struct netfs_read_subrequest *subreq)
>  {
>  	struct iov_iter iter;
>  
> -	iov_iter_xarray(&iter, WRITE, &subreq->rreq->mapping->i_pages,
> +	iov_iter_xarray(&iter, READ, &subreq->rreq->mapping->i_pages,

What's up with the WRITE -> READ change here? Was that a preexisting
bug?

>  			subreq->start + subreq->transferred,
>  			subreq->len   - subreq->transferred);
>  	iov_iter_zero(iov_iter_count(&iter), &iter);
> @@ -745,6 +745,9 @@ netfs_rreq_prepare_read(struct netfs_read_request *rreq,
>  	if (WARN_ON(subreq->len == 0))
>  		source = NETFS_INVALID_READ;
>  
> +	iov_iter_xarray(&subreq->iter, READ, &rreq->mapping->i_pages,
> +			subreq->start, subreq->len);
> +
>  out:
>  	subreq->source = source;
>  	trace_netfs_sreq(subreq, netfs_sreq_trace_prepare);
> diff --git a/include/linux/netfs.h b/include/linux/netfs.h
> index fe9887768292..5e4fafcc9480 100644
> --- a/include/linux/netfs.h
> +++ b/include/linux/netfs.h
> @@ -17,6 +17,7 @@
>  #include <linux/workqueue.h>
>  #include <linux/fs.h>
>  #include <linux/pagemap.h>
> +#include <linux/uio.h>
>  
>  /*
>   * Overload PG_private_2 to give us PG_fscache - this is used to indicate that
> @@ -112,6 +113,7 @@ struct netfs_cache_resources {
>  struct netfs_read_subrequest {
>  	struct netfs_read_request *rreq;	/* Supervising read request */
>  	struct list_head	rreq_link;	/* Link in rreq->subrequests */
> +	struct iov_iter		iter;		/* Iterator for this subrequest */
>  	loff_t			start;		/* Where to start the I/O */
>  	size_t			len;		/* Size of the I/O */
>  	size_t			transferred;	/* Amount of data transferred */
> 
> 

-- 
Jeff Layton <jlayton@redhat.com>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 02/12] netfs: Add an iov_iter to the read subreq for the network fs/cache to use
  2021-07-21 13:44 ` [RFC PATCH 02/12] netfs: Add an iov_iter to the read subreq for the network fs/cache to use David Howells
  2021-07-21 17:16   ` Jeff Layton
@ 2021-07-21 17:20   ` David Howells
  1 sibling, 0 replies; 22+ messages in thread
From: David Howells @ 2021-07-21 17:20 UTC (permalink / raw)
  To: Jeff Layton
  Cc: dhowells, linux-fsdevel, Matthew Wilcox (Oracle),
	Anna Schumaker, Steve French, Dominique Martinet, Mike Marshall,
	David Wysochanski, Shyam Prasad N, Miklos Szeredi,
	Linus Torvalds, linux-cachefs, linux-afs, linux-nfs, linux-cifs,
	ceph-devel, v9fs-developer, devel, linux-mm, linux-kernel

Jeff Layton <jlayton@redhat.com> wrote:

> > -	iov_iter_xarray(&iter, WRITE, &subreq->rreq->mapping->i_pages,
> > +	iov_iter_xarray(&iter, READ, &subreq->rreq->mapping->i_pages,
> 
> What's up with the WRITE -> READ change here? Was that a preexisting
> bug?

Actually, yes - I need to split that out and send it to Linus.

David


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 03/12] netfs: Remove netfs_read_subrequest::transferred
  2021-07-21 13:45 ` [RFC PATCH 03/12] netfs: Remove netfs_read_subrequest::transferred David Howells
@ 2021-07-21 17:43   ` Jeff Layton
  2021-07-21 18:54   ` David Howells
  1 sibling, 0 replies; 22+ messages in thread
From: Jeff Layton @ 2021-07-21 17:43 UTC (permalink / raw)
  To: David Howells, linux-fsdevel
  Cc: Matthew Wilcox (Oracle),
	Anna Schumaker, Steve French, Dominique Martinet, Mike Marshall,
	David Wysochanski, Shyam Prasad N, Miklos Szeredi,
	Linus Torvalds, linux-cachefs, linux-afs, linux-nfs, linux-cifs,
	ceph-devel, v9fs-developer, devel, linux-mm, linux-kernel

On Wed, 2021-07-21 at 14:45 +0100, David Howells wrote:
> Remove netfs_read_subrequest::transferred as it's redundant as the count on
> the iterator added to the subrequest can be used instead.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> ---
> 
>  fs/afs/file.c                |    4 ++--
>  fs/netfs/read_helper.c       |   26 ++++----------------------
>  include/linux/netfs.h        |    1 -
>  include/trace/events/netfs.h |   12 ++++++------
>  4 files changed, 12 insertions(+), 31 deletions(-)
> 
> diff --git a/fs/afs/file.c b/fs/afs/file.c
> index ca529f23515a..82e945dbe379 100644
> --- a/fs/afs/file.c
> +++ b/fs/afs/file.c
> @@ -315,8 +315,8 @@ static void afs_req_issue_op(struct netfs_read_subrequest *subreq)
>  		return netfs_subreq_terminated(subreq, -ENOMEM, false);
>  
>  	fsreq->subreq	= subreq;
> -	fsreq->pos	= subreq->start + subreq->transferred;
> -	fsreq->len	= subreq->len   - subreq->transferred;
> +	fsreq->pos	= subreq->start + subreq->len - iov_iter_count(&subreq->iter);
> +	fsreq->len	= iov_iter_count(&subreq->iter);
>  	fsreq->key	= subreq->rreq->netfs_priv;
>  	fsreq->vnode	= vnode;
>  	fsreq->iter	= &subreq->iter;
> diff --git a/fs/netfs/read_helper.c b/fs/netfs/read_helper.c
> index 715f3e9c380d..5e1a9be48130 100644
> --- a/fs/netfs/read_helper.c
> +++ b/fs/netfs/read_helper.c
> @@ -148,12 +148,7 @@ static void __netfs_put_subrequest(struct netfs_read_subrequest *subreq,
>   */
>  static void netfs_clear_unread(struct netfs_read_subrequest *subreq)
>  {
> -	struct iov_iter iter;
> -
> -	iov_iter_xarray(&iter, READ, &subreq->rreq->mapping->i_pages,
> -			subreq->start + subreq->transferred,
> -			subreq->len   - subreq->transferred);
> -	iov_iter_zero(iov_iter_count(&iter), &iter);
> +	iov_iter_zero(iov_iter_count(&subreq->iter), &subreq->iter);
>  }
>  
>  static void netfs_cache_read_terminated(void *priv, ssize_t transferred_or_error,
> @@ -173,14 +168,9 @@ static void netfs_read_from_cache(struct netfs_read_request *rreq,
>  				  bool seek_data)
>  {
>  	struct netfs_cache_resources *cres = &rreq->cache_resources;
> -	struct iov_iter iter;
>  
>  	netfs_stat(&netfs_n_rh_read);
> -	iov_iter_xarray(&iter, READ, &rreq->mapping->i_pages,
> -			subreq->start + subreq->transferred,
> -			subreq->len   - subreq->transferred);
> -
> -	cres->ops->read(cres, subreq->start, &iter, seek_data,
> +	cres->ops->read(cres, subreq->start, &subreq->iter, seek_data,
>  			netfs_cache_read_terminated, subreq);
>  }
>  

The above two deltas seem like they should have been in patch #2.

> @@ -419,7 +409,7 @@ static void netfs_rreq_unlock(struct netfs_read_request *rreq)
>  			if (pgend < iopos + subreq->len)
>  				break;
>  
> -			account += subreq->transferred;
> +			account += subreq->len - iov_iter_count(&subreq->iter);
>  			iopos += subreq->len;
>  			if (!list_is_last(&subreq->rreq_link, &rreq->subrequests)) {
>  				subreq = list_next_entry(subreq, rreq_link);
> @@ -635,15 +625,8 @@ void netfs_subreq_terminated(struct netfs_read_subrequest *subreq,
>  		goto failed;
>  	}
>  
> -	if (WARN(transferred_or_error > subreq->len - subreq->transferred,
> -		 "Subreq overread: R%x[%x] %zd > %zu - %zu",
> -		 rreq->debug_id, subreq->debug_index,
> -		 transferred_or_error, subreq->len, subreq->transferred))
> -		transferred_or_error = subreq->len - subreq->transferred;
> -
>  	subreq->error = 0;
> -	subreq->transferred += transferred_or_error;
> -	if (subreq->transferred < subreq->len)
> +	if (iov_iter_count(&subreq->iter))
>  		goto incomplete;
>  

I must be missing it, but where does subreq->iter get advanced to the
end of the current read? If you're getting rid of subreq->transferred
then I think that has to happen above, no?

>  complete:
> @@ -667,7 +650,6 @@ void netfs_subreq_terminated(struct netfs_read_subrequest *subreq,
>  incomplete:
>  	if (test_bit(NETFS_SREQ_CLEAR_TAIL, &subreq->flags)) {
>  		netfs_clear_unread(subreq);
> -		subreq->transferred = subreq->len;
>  		goto complete;
>  	}
>  
> diff --git a/include/linux/netfs.h b/include/linux/netfs.h
> index 5e4fafcc9480..45d40c622205 100644
> --- a/include/linux/netfs.h
> +++ b/include/linux/netfs.h
> @@ -116,7 +116,6 @@ struct netfs_read_subrequest {
>  	struct iov_iter		iter;		/* Iterator for this subrequest */
>  	loff_t			start;		/* Where to start the I/O */
>  	size_t			len;		/* Size of the I/O */
> -	size_t			transferred;	/* Amount of data transferred */
>  	refcount_t		usage;
>  	short			error;		/* 0 or error that occurred */
>  	unsigned short		debug_index;	/* Index in list (for debugging output) */
> diff --git a/include/trace/events/netfs.h b/include/trace/events/netfs.h
> index 4d470bffd9f1..04ac29fc700f 100644
> --- a/include/trace/events/netfs.h
> +++ b/include/trace/events/netfs.h
> @@ -190,7 +190,7 @@ TRACE_EVENT(netfs_sreq,
>  		    __field(enum netfs_read_source,	source		)
>  		    __field(enum netfs_sreq_trace,	what		)
>  		    __field(size_t,			len		)
> -		    __field(size_t,			transferred	)
> +		    __field(size_t,			remain		)
>  		    __field(loff_t,			start		)
>  			     ),
>  
> @@ -202,7 +202,7 @@ TRACE_EVENT(netfs_sreq,
>  		    __entry->source	= sreq->source;
>  		    __entry->what	= what;
>  		    __entry->len	= sreq->len;
> -		    __entry->transferred = sreq->transferred;
> +		    __entry->remain	= iov_iter_count(&sreq->iter);
>  		    __entry->start	= sreq->start;
>  			   ),
>  
> @@ -211,7 +211,7 @@ TRACE_EVENT(netfs_sreq,
>  		      __print_symbolic(__entry->what, netfs_sreq_traces),
>  		      __print_symbolic(__entry->source, netfs_sreq_sources),
>  		      __entry->flags,
> -		      __entry->start, __entry->transferred, __entry->len,
> +		      __entry->start, __entry->len - __entry->remain, __entry->len,
>  		      __entry->error)
>  	    );
>  
> @@ -230,7 +230,7 @@ TRACE_EVENT(netfs_failure,
>  		    __field(enum netfs_read_source,	source		)
>  		    __field(enum netfs_failure,		what		)
>  		    __field(size_t,			len		)
> -		    __field(size_t,			transferred	)
> +		    __field(size_t,			remain		)
>  		    __field(loff_t,			start		)
>  			     ),
>  
> @@ -242,7 +242,7 @@ TRACE_EVENT(netfs_failure,
>  		    __entry->source	= sreq ? sreq->source : NETFS_INVALID_READ;
>  		    __entry->what	= what;
>  		    __entry->len	= sreq ? sreq->len : 0;
> -		    __entry->transferred = sreq ? sreq->transferred : 0;
> +		    __entry->remain	= sreq ? iov_iter_count(&sreq->iter) : 0;
>  		    __entry->start	= sreq ? sreq->start : 0;
>  			   ),
>  
> @@ -250,7 +250,7 @@ TRACE_EVENT(netfs_failure,
>  		      __entry->rreq, __entry->index,
>  		      __print_symbolic(__entry->source, netfs_sreq_sources),
>  		      __entry->flags,
> -		      __entry->start, __entry->transferred, __entry->len,
> +		      __entry->start, __entry->len - __entry->remain, __entry->len,
>  		      __print_symbolic(__entry->what, netfs_failures),
>  		      __entry->error)
>  	    );
> 
> 

-- 
Jeff Layton <jlayton@redhat.com>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC PATCH 13/12] netfs: Do copy-to-cache-on-read through VM writeback
  2021-07-21 13:44 David Howells
                   ` (12 preceding siblings ...)
  2021-07-21 14:00 ` [RFC PATCH 00/12] netfs: Experimental write helpers, fscrypt and compression David Howells
@ 2021-07-21 18:42 ` David Howells
  13 siblings, 0 replies; 22+ messages in thread
From: David Howells @ 2021-07-21 18:42 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: dhowells, Jeff Layton, Matthew Wilcox (Oracle),
	Anna Schumaker, Steve French, Dominique Martinet, Mike Marshall,
	David Wysochanski, Shyam Prasad N, Miklos Szeredi,
	Linus Torvalds, linux-cachefs, linux-afs, linux-nfs, linux-cifs,
	ceph-devel, v9fs-developer, devel, linux-mm, linux-kernel

When data is read from the server and intended to be copied to the cache,
offload the cache write to the VM writeback mechanism rather than
scheduling it immediately.  This allows the downloaded data to be
superseded by local changes before it is written to the cache and means
that we no longer need to use the PG_fscache flag.

This is done by the following means:

 (1) The pages just downloaded into are marked dirty in
     netfs_rreq_unlock().

 (2) A region of NETFS_REGION_CACHE_COPY type is added to the dirty region
     list.

 (3) If a region-to-be-modified overlaps the cache-copy region, the
     modifications supersede the download, moving the end marker over in
     netfs_merge_dirty_region().

 (4) We don't really want to supersede in the middle of a region, so we may
     split a pristine region so that we can supersede forwards only.

 (5) We mark regions we're going to supersede with NETFS_REGION_SUPERSEDED
     to prevent them getting merged whilst we're superseding them.  This
     flag is cleared when we're done and we may merge afterwards.

 (6) Adjacent download regions are potentially mergeable.

 (7) When being flushed, CACHE_COPY regions are intended only to be written
     to the cache, not the server, though they may contribute data to a
     cross-page chunk that has to be encrypted or compressed and sent to
     the server.

Signed-off-by: David Howells <dhowells@redhat.com>
---
 fs/netfs/internal.h          |    4 --
 fs/netfs/main.c              |    1 
 fs/netfs/read_helper.c       |  126 ++--------------------------------------------------------------
 fs/netfs/stats.c             |    7 ---
 fs/netfs/write_back.c        |    3 +
 fs/netfs/write_helper.c      |  112 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/netfs.h        |    2 -
 include/trace/events/netfs.h |    3 +
 mm/filemap.c                 |    4 +-
 9 files changed, 125 insertions(+), 137 deletions(-)


diff --git a/fs/netfs/internal.h b/fs/netfs/internal.h
index 6ae1eb55093a..ee83b81e4682 100644
--- a/fs/netfs/internal.h
+++ b/fs/netfs/internal.h
@@ -98,6 +98,7 @@ void netfs_writeback_worker(struct work_struct *work);
 void netfs_flush_region(struct netfs_i_context *ctx,
 			struct netfs_dirty_region *region,
 			enum netfs_dirty_trace why);
+void netfs_rreq_do_write_to_cache(struct netfs_read_request *rreq);
 
 /*
  * write_prep.c
@@ -121,10 +122,7 @@ extern atomic_t netfs_n_rh_read_done;
 extern atomic_t netfs_n_rh_read_failed;
 extern atomic_t netfs_n_rh_zero;
 extern atomic_t netfs_n_rh_short_read;
-extern atomic_t netfs_n_rh_write;
 extern atomic_t netfs_n_rh_write_begin;
-extern atomic_t netfs_n_rh_write_done;
-extern atomic_t netfs_n_rh_write_failed;
 extern atomic_t netfs_n_rh_write_zskip;
 extern atomic_t netfs_n_wh_region;
 extern atomic_t netfs_n_wh_flush_group;
diff --git a/fs/netfs/main.c b/fs/netfs/main.c
index 125b570efefd..ad204dcbb5f7 100644
--- a/fs/netfs/main.c
+++ b/fs/netfs/main.c
@@ -21,6 +21,7 @@ static const char *netfs_proc_region_types[] = {
 	[NETFS_REGION_ORDINARY]		= "ORD ",
 	[NETFS_REGION_DIO]		= "DIOW",
 	[NETFS_REGION_DSYNC]		= "DSYN",
+	[NETFS_REGION_CACHE_COPY]	= "CCPY",
 };
 
 /*
diff --git a/fs/netfs/read_helper.c b/fs/netfs/read_helper.c
index e5c636acc756..7fa677d4c9ca 100644
--- a/fs/netfs/read_helper.c
+++ b/fs/netfs/read_helper.c
@@ -212,124 +212,6 @@ void netfs_rreq_completed(struct netfs_read_request *rreq, bool was_async)
 	netfs_put_read_request(rreq, was_async);
 }
 
-/*
- * Deal with the completion of writing the data to the cache.  We have to clear
- * the PG_fscache bits on the pages involved and release the caller's ref.
- *
- * May be called in softirq mode and we inherit a ref from the caller.
- */
-static void netfs_rreq_unmark_after_write(struct netfs_read_request *rreq,
-					  bool was_async)
-{
-	struct netfs_read_subrequest *subreq;
-	struct page *page;
-	pgoff_t unlocked = 0;
-	bool have_unlocked = false;
-
-	rcu_read_lock();
-
-	list_for_each_entry(subreq, &rreq->subrequests, rreq_link) {
-		XA_STATE(xas, &rreq->mapping->i_pages, subreq->start / PAGE_SIZE);
-
-		xas_for_each(&xas, page, (subreq->start + subreq->len - 1) / PAGE_SIZE) {
-			/* We might have multiple writes from the same huge
-			 * page, but we mustn't unlock a page more than once.
-			 */
-			if (have_unlocked && page->index <= unlocked)
-				continue;
-			unlocked = page->index;
-			end_page_fscache(page);
-			have_unlocked = true;
-		}
-	}
-
-	rcu_read_unlock();
-	netfs_rreq_completed(rreq, was_async);
-}
-
-static void netfs_rreq_copy_terminated(void *priv, ssize_t transferred_or_error,
-				       bool was_async)
-{
-	struct netfs_read_subrequest *subreq = priv;
-	struct netfs_read_request *rreq = subreq->rreq;
-
-	if (IS_ERR_VALUE(transferred_or_error)) {
-		netfs_stat(&netfs_n_rh_write_failed);
-		trace_netfs_failure(rreq, subreq, transferred_or_error,
-				    netfs_fail_copy_to_cache);
-	} else {
-		netfs_stat(&netfs_n_rh_write_done);
-	}
-
-	trace_netfs_sreq(subreq, netfs_sreq_trace_write_term);
-
-	/* If we decrement nr_wr_ops to 0, the ref belongs to us. */
-	if (atomic_dec_and_test(&rreq->nr_wr_ops))
-		netfs_rreq_unmark_after_write(rreq, was_async);
-
-	netfs_put_subrequest(subreq, was_async);
-}
-
-/*
- * Perform any outstanding writes to the cache.  We inherit a ref from the
- * caller.
- */
-static void netfs_rreq_do_write_to_cache(struct netfs_read_request *rreq)
-{
-	struct netfs_cache_resources *cres = &rreq->cache_resources;
-	struct netfs_read_subrequest *subreq, *next, *p;
-	struct iov_iter iter;
-	int ret;
-
-	trace_netfs_rreq(rreq, netfs_rreq_trace_write);
-
-	/* We don't want terminating writes trying to wake us up whilst we're
-	 * still going through the list.
-	 */
-	atomic_inc(&rreq->nr_wr_ops);
-
-	list_for_each_entry_safe(subreq, p, &rreq->subrequests, rreq_link) {
-		if (!test_bit(NETFS_SREQ_WRITE_TO_CACHE, &subreq->flags)) {
-			list_del_init(&subreq->rreq_link);
-			netfs_put_subrequest(subreq, false);
-		}
-	}
-
-	list_for_each_entry(subreq, &rreq->subrequests, rreq_link) {
-		/* Amalgamate adjacent writes */
-		while (!list_is_last(&subreq->rreq_link, &rreq->subrequests)) {
-			next = list_next_entry(subreq, rreq_link);
-			if (next->start != subreq->start + subreq->len)
-				break;
-			subreq->len += next->len;
-			list_del_init(&next->rreq_link);
-			netfs_put_subrequest(next, false);
-		}
-
-		ret = cres->ops->prepare_write(cres, &subreq->start, &subreq->len,
-					       rreq->i_size);
-		if (ret < 0) {
-			trace_netfs_failure(rreq, subreq, ret, netfs_fail_prepare_write);
-			trace_netfs_sreq(subreq, netfs_sreq_trace_write_skip);
-			continue;
-		}
-
-		iov_iter_xarray(&iter, WRITE, &rreq->mapping->i_pages,
-				subreq->start, subreq->len);
-
-		atomic_inc(&rreq->nr_wr_ops);
-		netfs_stat(&netfs_n_rh_write);
-		netfs_get_read_subrequest(subreq);
-		trace_netfs_sreq(subreq, netfs_sreq_trace_write);
-		cres->ops->write(cres, subreq->start, &iter,
-				 netfs_rreq_copy_terminated, subreq);
-	}
-
-	/* If we decrement nr_wr_ops to 0, the usage ref belongs to us. */
-	if (atomic_dec_and_test(&rreq->nr_wr_ops))
-		netfs_rreq_unmark_after_write(rreq, false);
-}
-
 static void netfs_rreq_write_to_cache_work(struct work_struct *work)
 {
 	struct netfs_read_request *rreq =
@@ -390,19 +272,19 @@ static void netfs_rreq_unlock(struct netfs_read_request *rreq)
 	xas_for_each(&xas, page, last_page) {
 		unsigned int pgpos = (page->index - start_page) * PAGE_SIZE;
 		unsigned int pgend = pgpos + thp_size(page);
-		bool pg_failed = false;
+		bool pg_failed = false, caching;
 
 		for (;;) {
 			if (!subreq) {
 				pg_failed = true;
 				break;
 			}
-			if (test_bit(NETFS_SREQ_WRITE_TO_CACHE, &subreq->flags))
-				set_page_fscache(page);
 			pg_failed |= subreq_failed;
 			if (pgend < iopos + subreq->len)
 				break;
 
+			caching = test_bit(NETFS_SREQ_WRITE_TO_CACHE, &subreq->flags);
+
 			account += subreq->len - iov_iter_count(&subreq->iter);
 			iopos += subreq->len;
 			if (!list_is_last(&subreq->rreq_link, &rreq->subrequests)) {
@@ -420,6 +302,8 @@ static void netfs_rreq_unlock(struct netfs_read_request *rreq)
 			for (i = 0; i < thp_nr_pages(page); i++)
 				flush_dcache_page(page);
 			SetPageUptodate(page);
+			if (caching)
+				set_page_dirty(page);
 		}
 
 		if (!test_bit(NETFS_RREQ_DONT_UNLOCK_PAGES, &rreq->flags)) {
diff --git a/fs/netfs/stats.c b/fs/netfs/stats.c
index a02d95bba158..414c2fca6b23 100644
--- a/fs/netfs/stats.c
+++ b/fs/netfs/stats.c
@@ -22,10 +22,7 @@ atomic_t netfs_n_rh_read_done;
 atomic_t netfs_n_rh_read_failed;
 atomic_t netfs_n_rh_zero;
 atomic_t netfs_n_rh_short_read;
-atomic_t netfs_n_rh_write;
 atomic_t netfs_n_rh_write_begin;
-atomic_t netfs_n_rh_write_done;
-atomic_t netfs_n_rh_write_failed;
 atomic_t netfs_n_rh_write_zskip;
 atomic_t netfs_n_wh_region;
 atomic_t netfs_n_wh_flush_group;
@@ -59,10 +56,6 @@ void netfs_stats_show(struct seq_file *m)
 		   atomic_read(&netfs_n_rh_read),
 		   atomic_read(&netfs_n_rh_read_done),
 		   atomic_read(&netfs_n_rh_read_failed));
-	seq_printf(m, "RdHelp : WR=%u ws=%u wf=%u\n",
-		   atomic_read(&netfs_n_rh_write),
-		   atomic_read(&netfs_n_rh_write_done),
-		   atomic_read(&netfs_n_rh_write_failed));
 	seq_printf(m, "WrHelp : R=%u F=%u wr=%u\n",
 		   atomic_read(&netfs_n_wh_region),
 		   atomic_read(&netfs_n_wh_flush_group),
diff --git a/fs/netfs/write_back.c b/fs/netfs/write_back.c
index 7363c3324602..4433c3121435 100644
--- a/fs/netfs/write_back.c
+++ b/fs/netfs/write_back.c
@@ -263,7 +263,8 @@ static void netfs_writeback(struct netfs_write_request *wreq)
 
 	if (test_bit(NETFS_WREQ_WRITE_TO_CACHE, &wreq->flags))
 		netfs_set_up_write_to_cache(wreq);
-	ctx->ops->add_write_streams(wreq);
+	if (wreq->region->type != NETFS_REGION_CACHE_COPY)
+		ctx->ops->add_write_streams(wreq);
 
 out:
 	if (atomic_dec_and_test(&wreq->outstanding))
diff --git a/fs/netfs/write_helper.c b/fs/netfs/write_helper.c
index b1fe2d4c0df6..5e50b01527fb 100644
--- a/fs/netfs/write_helper.c
+++ b/fs/netfs/write_helper.c
@@ -80,6 +80,11 @@ static void netfs_init_dirty_region(struct netfs_dirty_region *region,
 	INIT_LIST_HEAD(&region->flush_link);
 	refcount_set(&region->ref, 1);
 	spin_lock_init(&region->lock);
+	if (type == NETFS_REGION_CACHE_COPY) {
+		region->state = NETFS_REGION_IS_DIRTY;
+		region->dirty.end = end;
+	}
+
 	if (file && ctx->ops->init_dirty_region)
 		ctx->ops->init_dirty_region(region, file);
 	if (!region->group) {
@@ -160,6 +165,19 @@ static enum netfs_write_compatibility netfs_write_compatibility(
 		return NETFS_WRITES_INCOMPATIBLE;
 	}
 
+	/* Pending writes to the cache alone (ie. copy from a read) can be
+	 * merged or superseded by a modification that will require writing to
+	 * the server too.
+	 */
+	if (old->type == NETFS_REGION_CACHE_COPY) {
+		if (candidate->type == NETFS_REGION_CACHE_COPY) {
+			kleave(" = COMPT [ccopy]");
+			return NETFS_WRITES_COMPATIBLE;
+		}
+		kleave(" = SUPER [ccopy]");
+		return NETFS_WRITES_SUPERSEDE;
+	}
+
 	if (!ctx->ops->is_write_compatible) {
 		if (candidate->type == NETFS_REGION_DSYNC) {
 			kleave(" = SUPER [dsync]");
@@ -220,8 +238,11 @@ static void netfs_queue_write(struct netfs_i_context *ctx,
 		if (overlaps(&candidate->bounds, &r->bounds)) {
 			if (overlaps(&candidate->reserved, &r->reserved) ||
 			    netfs_write_compatibility(ctx, r, candidate) ==
-			    NETFS_WRITES_INCOMPATIBLE)
+			    NETFS_WRITES_INCOMPATIBLE) {
+				kdebug("conflict %x with pend %x",
+				       candidate->debug_id, r->debug_id);
 				goto add_to_pending_queue;
+			}
 		}
 	}
 
@@ -238,8 +259,11 @@ static void netfs_queue_write(struct netfs_i_context *ctx,
 		if (overlaps(&candidate->bounds, &r->bounds)) {
 			if (overlaps(&candidate->reserved, &r->reserved) ||
 			    netfs_write_compatibility(ctx, r, candidate) ==
-			    NETFS_WRITES_INCOMPATIBLE)
+			    NETFS_WRITES_INCOMPATIBLE) {
+				kdebug("conflict %x with actv %x",
+				       candidate->debug_id, r->debug_id);
 				goto add_to_pending_queue;
+			}
 		}
 	}
 
@@ -451,6 +475,9 @@ static void netfs_merge_dirty_region(struct netfs_i_context *ctx,
 			goto discard;
 		}
 		goto scan_backwards;
+
+	case NETFS_REGION_CACHE_COPY:
+		goto scan_backwards;
 	}
 
 scan_backwards:
@@ -922,3 +949,84 @@ ssize_t netfs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	goto out;
 }
 EXPORT_SYMBOL(netfs_file_write_iter);
+
+/*
+ * Add a region that's just been read as a region on the dirty list to
+ * schedule a write to the cache.
+ */
+static bool netfs_copy_to_cache(struct netfs_read_request *rreq,
+				struct netfs_read_subrequest *subreq)
+{
+	struct netfs_dirty_region *candidate, *r;
+	struct netfs_i_context *ctx = netfs_i_context(rreq->inode);
+	struct list_head *p;
+	loff_t end = subreq->start + subreq->len;
+	int ret;
+
+	ret = netfs_require_flush_group(rreq->inode);
+	if (ret < 0)
+		return false;
+
+	candidate = netfs_alloc_dirty_region();
+	if (!candidate)
+		return false;
+
+	netfs_init_dirty_region(candidate, rreq->inode, NULL,
+				NETFS_REGION_CACHE_COPY, 0, subreq->start, end);
+
+	spin_lock(&ctx->lock);
+
+	/* Find a place to insert.  There can't be any dirty regions
+	 * overlapping with the region we're adding.
+	 */
+	list_for_each(p, &ctx->dirty_regions) {
+		r = list_entry(p, struct netfs_dirty_region, dirty_link);
+		if (r->bounds.end <= candidate->bounds.start)
+			continue;
+		if (r->bounds.start >= candidate->bounds.end)
+			break;
+	}
+
+	list_add_tail(&candidate->dirty_link, p);
+	netfs_merge_dirty_region(ctx, candidate);
+
+	spin_unlock(&ctx->lock);
+	return true;
+}
+
+/*
+ * If we downloaded some data and it now needs writing to the cache, we add it
+ * to the dirty region list and let that flush it.  This way it can get merged
+ * with writes.
+ *
+ * We inherit a ref from the caller.
+ */
+void netfs_rreq_do_write_to_cache(struct netfs_read_request *rreq)
+{
+	struct netfs_read_subrequest *subreq, *next, *p;
+
+	trace_netfs_rreq(rreq, netfs_rreq_trace_write);
+
+	list_for_each_entry_safe(subreq, p, &rreq->subrequests, rreq_link) {
+		if (!test_bit(NETFS_SREQ_WRITE_TO_CACHE, &subreq->flags)) {
+			list_del_init(&subreq->rreq_link);
+			netfs_put_subrequest(subreq, false);
+		}
+	}
+
+	list_for_each_entry(subreq, &rreq->subrequests, rreq_link) {
+		/* Amalgamate adjacent writes */
+		while (!list_is_last(&subreq->rreq_link, &rreq->subrequests)) {
+			next = list_next_entry(subreq, rreq_link);
+			if (next->start != subreq->start + subreq->len)
+				break;
+			subreq->len += next->len;
+			list_del_init(&next->rreq_link);
+			netfs_put_subrequest(next, false);
+		}
+
+		netfs_copy_to_cache(rreq, subreq);
+	}
+
+	netfs_rreq_completed(rreq, false);
+}
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index 43d195badb0d..527f08eb4898 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -145,7 +145,6 @@ struct netfs_read_request {
 	void			*netfs_priv;	/* Private data for the netfs */
 	unsigned int		debug_id;
 	atomic_t		nr_rd_ops;	/* Number of read ops in progress */
-	atomic_t		nr_wr_ops;	/* Number of write ops in progress */
 	size_t			submitted;	/* Amount submitted for I/O so far */
 	size_t			len;		/* Length of the request */
 	short			error;		/* 0 or error that occurred */
@@ -218,6 +217,7 @@ enum netfs_region_type {
 	NETFS_REGION_ORDINARY,		/* Ordinary write */
 	NETFS_REGION_DIO,		/* Direct I/O write */
 	NETFS_REGION_DSYNC,		/* O_DSYNC/RWF_DSYNC write */
+	NETFS_REGION_CACHE_COPY,	/* Data to be written to cache only */
 } __attribute__((mode(byte)));
 
 /*
diff --git a/include/trace/events/netfs.h b/include/trace/events/netfs.h
index aa002725b209..136cc42263f9 100644
--- a/include/trace/events/netfs.h
+++ b/include/trace/events/netfs.h
@@ -156,7 +156,8 @@ enum netfs_write_stream_trace {
 #define netfs_region_types					\
 	EM(NETFS_REGION_ORDINARY,		"ORD")		\
 	EM(NETFS_REGION_DIO,			"DIO")		\
-	E_(NETFS_REGION_DSYNC,			"DSY")
+	EM(NETFS_REGION_DSYNC,			"DSY")		\
+	E_(NETFS_REGION_CACHE_COPY,		"CCP")
 
 #define netfs_region_states					\
 	EM(NETFS_REGION_IS_PENDING,		"pend")		\
diff --git a/mm/filemap.c b/mm/filemap.c
index d1458ecf2f51..442cd767a047 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1545,8 +1545,10 @@ void end_page_writeback(struct page *page)
 	 * reused before the wake_up_page().
 	 */
 	get_page(page);
-	if (!test_clear_page_writeback(page))
+	if (!test_clear_page_writeback(page)) {
+		pr_err("Page %lx doesn't have wb set\n", page->index);
 		BUG();
+	}
 
 	smp_mb__after_atomic();
 	wake_up_page(page, PG_writeback);


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 03/12] netfs: Remove netfs_read_subrequest::transferred
  2021-07-21 13:45 ` [RFC PATCH 03/12] netfs: Remove netfs_read_subrequest::transferred David Howells
  2021-07-21 17:43   ` Jeff Layton
@ 2021-07-21 18:54   ` David Howells
  2021-07-21 19:00     ` Jeff Layton
  1 sibling, 1 reply; 22+ messages in thread
From: David Howells @ 2021-07-21 18:54 UTC (permalink / raw)
  To: Jeff Layton
  Cc: dhowells, linux-fsdevel, Matthew Wilcox (Oracle),
	Anna Schumaker, Steve French, Dominique Martinet, Mike Marshall,
	David Wysochanski, Shyam Prasad N, Miklos Szeredi,
	Linus Torvalds, linux-cachefs, linux-afs, linux-nfs, linux-cifs,
	ceph-devel, v9fs-developer, devel, linux-mm, linux-kernel

Jeff Layton <jlayton@redhat.com> wrote:

> The above two deltas seem like they should have been in patch #2.

Yeah.  Looks like at least partially so.

> > @@ -635,15 +625,8 @@ void netfs_subreq_terminated(struct netfs_read_subrequest *subreq,
> >  		goto failed;
> >  	}
> >  
> > -	if (WARN(transferred_or_error > subreq->len - subreq->transferred,
> > -		 "Subreq overread: R%x[%x] %zd > %zu - %zu",
> > -		 rreq->debug_id, subreq->debug_index,
> > -		 transferred_or_error, subreq->len, subreq->transferred))
> > -		transferred_or_error = subreq->len - subreq->transferred;
> > -
> >  	subreq->error = 0;
> > -	subreq->transferred += transferred_or_error;
> > -	if (subreq->transferred < subreq->len)
> > +	if (iov_iter_count(&subreq->iter))
> >  		goto incomplete;
> >  
> 
> I must be missing it, but where does subreq->iter get advanced to the
> end of the current read? If you're getting rid of subreq->transferred
> then I think that has to happen above, no?

For afs, afs_req_issue_op() points fsreq->iter at the subrequest iterator and
calls afs_fetch_data().  Thereafter, we wend our way to
afs_deliver_fs_fetch_data() or yfs_deliver_fs_fetch_data() which set
call->iter to point to that iterator and then call afs_extract_data() which
passes it to rxrpc_kernel_recv_data(), which eventually passes it to
skb_copy_datagram_iter(), which advances the iterator.

For the cache, the subrequest iterator is passed to the cache backend by
netfs_read_from_cache().  This would be cachefiles_read() which calls
vfs_iocb_iter_read() which I thought advances the iterator (leastways,
filemap_read() keeps going until iov_iter_count() reaches 0 or some other stop
condition occurs and doesn't thereafter call iov_iter_revert()).

David


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 03/12] netfs: Remove netfs_read_subrequest::transferred
  2021-07-21 18:54   ` David Howells
@ 2021-07-21 19:00     ` Jeff Layton
  0 siblings, 0 replies; 22+ messages in thread
From: Jeff Layton @ 2021-07-21 19:00 UTC (permalink / raw)
  To: David Howells
  Cc: linux-fsdevel, Matthew Wilcox (Oracle),
	Anna Schumaker, Steve French, Dominique Martinet, Mike Marshall,
	David Wysochanski, Shyam Prasad N, Miklos Szeredi,
	Linus Torvalds, linux-cachefs, linux-afs, linux-nfs, linux-cifs,
	ceph-devel, v9fs-developer, devel, linux-mm, linux-kernel

On Wed, 2021-07-21 at 19:54 +0100, David Howells wrote:
> Jeff Layton <jlayton@redhat.com> wrote:
> 
> > The above two deltas seem like they should have been in patch #2.
> 
> Yeah.  Looks like at least partially so.
> 
> > > @@ -635,15 +625,8 @@ void netfs_subreq_terminated(struct netfs_read_subrequest *subreq,
> > >  		goto failed;
> > >  	}
> > >  
> > > -	if (WARN(transferred_or_error > subreq->len - subreq->transferred,
> > > -		 "Subreq overread: R%x[%x] %zd > %zu - %zu",
> > > -		 rreq->debug_id, subreq->debug_index,
> > > -		 transferred_or_error, subreq->len, subreq->transferred))
> > > -		transferred_or_error = subreq->len - subreq->transferred;
> > > -
> > >  	subreq->error = 0;
> > > -	subreq->transferred += transferred_or_error;
> > > -	if (subreq->transferred < subreq->len)
> > > +	if (iov_iter_count(&subreq->iter))
> > >  		goto incomplete;
> > >  
> > 
> > I must be missing it, but where does subreq->iter get advanced to the
> > end of the current read? If you're getting rid of subreq->transferred
> > then I think that has to happen above, no?
> 
> For afs, afs_req_issue_op() points fsreq->iter at the subrequest iterator and
> calls afs_fetch_data().  Thereafter, we wend our way to
> afs_deliver_fs_fetch_data() or yfs_deliver_fs_fetch_data() which set
> call->iter to point to that iterator and then call afs_extract_data() which
> passes it to rxrpc_kernel_recv_data(), which eventually passes it to
> skb_copy_datagram_iter(), which advances the iterator.
> 
> For the cache, the subrequest iterator is passed to the cache backend by
> netfs_read_from_cache().  This would be cachefiles_read() which calls
> vfs_iocb_iter_read() which I thought advances the iterator (leastways,
> filemap_read() keeps going until iov_iter_count() reaches 0 or some other stop
> condition occurs and doesn't thereafter call iov_iter_revert()).
> 

Ok, this will probably regress ceph then. We don't really have anything
to do with the subreq->iter at this point and this patch doesn't change
that. If you're going to make this change, it'd be cleaner to also fix
up ceph_netfs_issue_op to advance the iter at the same time.
-- 
Jeff Layton <jlayton@redhat.com>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 01/12] afs: Sort out symlink reading
  2021-07-21 13:44 ` [RFC PATCH 01/12] afs: Sort out symlink reading David Howells
  2021-07-21 16:20   ` Jeff Layton
@ 2021-07-26  9:44   ` David Howells
  1 sibling, 0 replies; 22+ messages in thread
From: David Howells @ 2021-07-26  9:44 UTC (permalink / raw)
  To: Jeff Layton
  Cc: dhowells, linux-fsdevel, Matthew Wilcox (Oracle),
	Anna Schumaker, Steve French, Dominique Martinet, Mike Marshall,
	David Wysochanski, Shyam Prasad N, Miklos Szeredi,
	Linus Torvalds, linux-cachefs, linux-afs, linux-nfs, linux-cifs,
	ceph-devel, v9fs-developer, devel, linux-mm, linux-kernel

Jeff Layton <jlayton@redhat.com> wrote:

> > -static int afs_symlink_readpage(struct page *page)
> > +static int afs_symlink_readpage(struct file *file, struct page *page)
> >  {
> >  	struct afs_vnode *vnode = AFS_FS_I(page->mapping->host);
> >  	struct afs_read *fsreq;
> 
> 
> I wonder...would you be better served here by not using page_readlink
> for symlinks and instead use simple_get_link and roll your own readlink
> operation. It seems a bit more direct, and AFS seems to be the only
> caller of page_readlink.

Maybe.  At some point it will need to go through netfs_readpage() so that it
will get cached and maybe encrypted.  Possibly there should be a
netfs_readlink().  AFS directories too will at some point need to go through
netfs_readahead() or similar.

David


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2021-07-26  9:44 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-21 13:44 David Howells
2021-07-21 13:44 ` [RFC PATCH 01/12] afs: Sort out symlink reading David Howells
2021-07-21 16:20   ` Jeff Layton
2021-07-26  9:44   ` David Howells
2021-07-21 13:44 ` [RFC PATCH 02/12] netfs: Add an iov_iter to the read subreq for the network fs/cache to use David Howells
2021-07-21 17:16   ` Jeff Layton
2021-07-21 17:20   ` David Howells
2021-07-21 13:45 ` [RFC PATCH 03/12] netfs: Remove netfs_read_subrequest::transferred David Howells
2021-07-21 17:43   ` Jeff Layton
2021-07-21 18:54   ` David Howells
2021-07-21 19:00     ` Jeff Layton
2021-07-21 13:45 ` [RFC PATCH 04/12] netfs: Use a buffer in netfs_read_request and add pages to it David Howells
2021-07-21 13:45 ` [RFC PATCH 05/12] netfs: Add a netfs inode context David Howells
2021-07-21 13:46 ` [RFC PATCH 06/12] netfs: Keep lists of pending, active, dirty and flushed regions David Howells
2021-07-21 13:46 ` [RFC PATCH 07/12] netfs: Initiate write request from a dirty region David Howells
2021-07-21 13:46 ` [RFC PATCH 08/12] netfs: Keep dirty mark for pages with more than one " David Howells
2021-07-21 13:46 ` [RFC PATCH 09/12] netfs: Send write request to multiple destinations David Howells
2021-07-21 13:46 ` [RFC PATCH 10/12] netfs: Do encryption in write preparatory phase David Howells
2021-07-21 13:47 ` [RFC PATCH 11/12] netfs: Put a list of regions in /proc/fs/netfs/regions David Howells
2021-07-21 13:47 ` [RFC PATCH 12/12] netfs: Export some read-request ref functions David Howells
2021-07-21 14:00 ` [RFC PATCH 00/12] netfs: Experimental write helpers, fscrypt and compression David Howells
2021-07-21 18:42 ` [RFC PATCH 13/12] netfs: Do copy-to-cache-on-read through VM writeback David Howells

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).