linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] mm: Replace ->launder_folio() with flush and wait
@ 2024-03-06 22:39 David Howells
  2024-03-07  4:45 ` Matthew Wilcox
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: David Howells @ 2024-03-06 22:39 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: dhowells, Christoph Hellwig, Andrew Morton, Alexander Viro,
	Christian Brauner, Jeff Layton, linux-mm, linux-fsdevel, netfs,
	v9fs, linux-afs, ceph-devel, linux-cifs, linux-nfs, devel,
	linux-kernel

Here's a patch to have a go at getting rid of ->launder_folio().  Since it's
failable and cannot guarantee that pages in the range are removed, I've tried
to replace laundering with just flush-and-wait, dropping the folio lock around
the I/O.

Are there any tests for launder_folio as triggering it is tricky?

David
---
mm: Replace ->launder_folio() with flush and wait

invalidate_inode_pages2_range() and its wrappers are frequently used to
invalidate overlapping folios prior to and after doing direct I/O.  This
calls ->launder_folio() to flush dirty folios out to the backing store,
keeping the folio lock across the I/O - presumably to prevent the folio
from being redirtied and thereby prevent it from being removed.

However...  If we're doing this prior to doing DIO on a file, there may be
nothing preventing an mmapped write from recreating and redirtying the
folio the moment it is removed from the mapping lest the kernel deadlock on
doing DIO to/from a buffer mmapped from the target file.

Further, invalidate_inode_pages2_range() is permitted to fail - and half
the callers don't even check to see if it *did* fail, probably not
unreasonably.

In which case, there's no need to keep the folio lock and no need for a
special op.  Instead, drop the lock and flush the range covering the page
to the end of the range to be invalidated (with a flag set in the wbc to
say what we're doing) and then wait for it to complete.

That said, there are other places that use invalidate_inode_pages2_range()
and most of those don't implement launder_page, so add a flag to only do
this in those filesystems that implemented it: 9p, afs, cifs, fuse, nfs and
orangefs.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@lst.de>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: Alexander Viro <viro@zeniv.linux.org.uk>
cc: Christian Brauner <brauner@kernel.org>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-mm@kvack.org
cc: linux-fsdevel@vger.kernel.org
cc: netfs@lists.linux.dev
cc: v9fs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: ceph-devel@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: linux-nfs@vger.kernel.org
cc: devel@lists.orangefs.org
---
 Documentation/filesystems/locking.rst |    8 ---
 Documentation/filesystems/vfs.rst     |    6 --
 fs/9p/vfs_addr.c                      |    1 
 fs/9p/vfs_inode.c                     |    1 
 fs/afs/file.c                         |    1 
 fs/afs/inode.c                        |    1 
 fs/fuse/dir.c                         |    2 
 fs/fuse/file.c                        |   17 -------
 fs/netfs/buffered_write.c             |   76 ----------------------------------
 fs/nfs/file.c                         |   23 ----------
 fs/nfs/inode.c                        |    1 
 fs/nfs/nfstrace.h                     |    1 
 fs/orangefs/inode.c                   |    2 
 fs/smb/client/file.c                  |   23 ----------
 fs/smb/client/inode.c                 |    1 
 include/linux/fs.h                    |    1 
 include/linux/netfs.h                 |    1 
 include/linux/pagemap.h               |    6 ++
 include/linux/writeback.h             |    1 
 mm/truncate.c                         |   35 +++++++++------
 20 files changed, 36 insertions(+), 172 deletions(-)

diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
index d5bf4b6b7509..139554d1ab51 100644
--- a/Documentation/filesystems/locking.rst
+++ b/Documentation/filesystems/locking.rst
@@ -262,7 +262,6 @@ prototypes::
 	int (*direct_IO)(struct kiocb *, struct iov_iter *iter);
 	int (*migrate_folio)(struct address_space *, struct folio *dst,
 			struct folio *src, enum migrate_mode);
-	int (*launder_folio)(struct folio *);
 	bool (*is_partially_uptodate)(struct folio *, size_t from, size_t count);
 	int (*error_remove_folio)(struct address_space *, struct folio *);
 	int (*swap_activate)(struct swap_info_struct *sis, struct file *f, sector_t *span)
@@ -288,7 +287,6 @@ release_folio:		yes
 free_folio:		yes
 direct_IO:
 migrate_folio:		yes (both)
-launder_folio:		yes
 is_partially_uptodate:	yes
 error_remove_folio:	yes
 swap_activate:		no
@@ -394,12 +392,6 @@ try_to_free_buffers().
 ->free_folio() is called when the kernel has dropped the folio
 from the page cache.
 
-->launder_folio() may be called prior to releasing a folio if
-it is still found to be dirty. It returns zero if the folio was successfully
-cleaned, or an error value if not. Note that in order to prevent the folio
-getting mapped back in and redirtied, it needs to be kept locked
-across the entire operation.
-
 ->swap_activate() will be called to prepare the given file for swap.  It
 should perform any validation and preparation necessary to ensure that
 writes can be performed with minimal memory allocation.  It should call
diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index eebcc0f9e2bc..b2af9ee6515a 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -818,7 +818,6 @@ cache in your filesystem.  The following members are defined:
 		ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter);
 		int (*migrate_folio)(struct mapping *, struct folio *dst,
 				struct folio *src, enum migrate_mode);
-		int (*launder_folio) (struct folio *);
 
 		bool (*is_partially_uptodate) (struct folio *, size_t from,
 					       size_t count);
@@ -1012,11 +1011,6 @@ cache in your filesystem.  The following members are defined:
 	folio to this function.  migrate_folio should transfer any private
 	data across and update any references that it has to the folio.
 
-``launder_folio``
-	Called before freeing a folio - it writes back the dirty folio.
-	To prevent redirtying the folio, it is kept locked during the
-	whole operation.
-
 ``is_partially_uptodate``
 	Called by the VM when reading a file through the pagecache when
 	the underlying blocksize is smaller than the size of the folio.
diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c
index 047855033d32..967c81b7aa73 100644
--- a/fs/9p/vfs_addr.c
+++ b/fs/9p/vfs_addr.c
@@ -141,7 +141,6 @@ const struct address_space_operations v9fs_addr_operations = {
 	.dirty_folio		= netfs_dirty_folio,
 	.release_folio		= netfs_release_folio,
 	.invalidate_folio	= netfs_invalidate_folio,
-	.launder_folio		= netfs_launder_folio,
 	.direct_IO		= noop_direct_IO,
 	.writepages		= netfs_writepages,
 };
diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c
index 32572982f72e..e085e520cb12 100644
--- a/fs/9p/vfs_inode.c
+++ b/fs/9p/vfs_inode.c
@@ -263,6 +263,7 @@ int v9fs_init_inode(struct v9fs_session_info *v9ses,
 	simple_inode_init_ts(inode);
 	inode->i_mapping->a_ops = &v9fs_addr_operations;
 	inode->i_private = NULL;
+	mapping_set_launder_folios(inode->i_mapping);
 
 	switch (mode & S_IFMT) {
 	case S_IFIFO:
diff --git a/fs/afs/file.c b/fs/afs/file.c
index ef2cc8f565d2..dfd8f60f5e1f 100644
--- a/fs/afs/file.c
+++ b/fs/afs/file.c
@@ -54,7 +54,6 @@ const struct address_space_operations afs_file_aops = {
 	.read_folio	= netfs_read_folio,
 	.readahead	= netfs_readahead,
 	.dirty_folio	= netfs_dirty_folio,
-	.launder_folio	= netfs_launder_folio,
 	.release_folio	= netfs_release_folio,
 	.invalidate_folio = netfs_invalidate_folio,
 	.migrate_folio	= filemap_migrate_folio,
diff --git a/fs/afs/inode.c b/fs/afs/inode.c
index 94fc049aff58..4041d9a4dae8 100644
--- a/fs/afs/inode.c
+++ b/fs/afs/inode.c
@@ -104,6 +104,7 @@ static int afs_inode_init_from_status(struct afs_operation *op,
 		inode->i_fop	= &afs_file_operations;
 		inode->i_mapping->a_ops	= &afs_file_aops;
 		mapping_set_large_folios(inode->i_mapping);
+		mapping_set_launder_folios(inode->i_mapping);
 		break;
 	case AFS_FTYPE_DIR:
 		inode->i_mode	= S_IFDIR |  (status->mode & S_IALLUGO);
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index d19cbf34c634..d87ee76bc578 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1978,7 +1978,7 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,
 
 	/*
 	 * Only call invalidate_inode_pages2() after removing
-	 * FUSE_NOWRITE, otherwise fuse_launder_folio() would deadlock.
+	 * FUSE_NOWRITE, otherwise writeback would deadlock.
 	 */
 	if ((is_truncate || !is_wb) &&
 	    S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) {
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 148a71b8b4d0..4933fe0b3269 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2393,21 +2393,6 @@ static int fuse_write_end(struct file *file, struct address_space *mapping,
 	return copied;
 }
 
-static int fuse_launder_folio(struct folio *folio)
-{
-	int err = 0;
-	if (folio_clear_dirty_for_io(folio)) {
-		struct inode *inode = folio->mapping->host;
-
-		/* Serialize with pending writeback for the same page */
-		fuse_wait_on_page_writeback(inode, folio->index);
-		err = fuse_writepage_locked(&folio->page);
-		if (!err)
-			fuse_wait_on_page_writeback(inode, folio->index);
-	}
-	return err;
-}
-
 /*
  * Write back dirty data/metadata now (there may not be any suitable
  * open files later for data)
@@ -3227,7 +3212,6 @@ static const struct address_space_operations fuse_file_aops  = {
 	.readahead	= fuse_readahead,
 	.writepage	= fuse_writepage,
 	.writepages	= fuse_writepages,
-	.launder_folio	= fuse_launder_folio,
 	.dirty_folio	= filemap_dirty_folio,
 	.bmap		= fuse_bmap,
 	.direct_IO	= fuse_direct_IO,
@@ -3241,6 +3225,7 @@ void fuse_init_file_inode(struct inode *inode, unsigned int flags)
 
 	inode->i_fop = &fuse_file_operations;
 	inode->i_data.a_ops = &fuse_file_aops;
+	mapping_set_launder_folios(inode->i_mapping);
 
 	INIT_LIST_HEAD(&fi->write_files);
 	INIT_LIST_HEAD(&fi->queued_writes);
diff --git a/fs/netfs/buffered_write.c b/fs/netfs/buffered_write.c
index 9a0d32e4b422..56aeb7e70bc8 100644
--- a/fs/netfs/buffered_write.c
+++ b/fs/netfs/buffered_write.c
@@ -908,7 +908,7 @@ static ssize_t netfs_write_back_from_locked_folio(struct address_space *mapping,
 	_enter(",%lx,%llx-%llx,%u", folio->index, start, end, caching);
 
 	wreq = netfs_alloc_request(mapping, NULL, start, folio_size(folio),
-				   NETFS_WRITEBACK);
+				   wbc->for_launder ? NETFS_LAUNDER_WRITE : NETFS_WRITEBACK);
 	if (IS_ERR(wreq)) {
 		folio_unlock(folio);
 		return PTR_ERR(wreq);
@@ -1181,77 +1181,3 @@ int netfs_writepages(struct address_space *mapping,
 	return ret;
 }
 EXPORT_SYMBOL(netfs_writepages);
-
-/*
- * Deal with the disposition of a laundered folio.
- */
-static void netfs_cleanup_launder_folio(struct netfs_io_request *wreq)
-{
-	if (wreq->error) {
-		pr_notice("R=%08x Laundering error %d\n", wreq->debug_id, wreq->error);
-		mapping_set_error(wreq->mapping, wreq->error);
-	}
-}
-
-/**
- * netfs_launder_folio - Clean up a dirty folio that's being invalidated
- * @folio: The folio to clean
- *
- * This is called to write back a folio that's being invalidated when an inode
- * is getting torn down.  Ideally, writepages would be used instead.
- */
-int netfs_launder_folio(struct folio *folio)
-{
-	struct netfs_io_request *wreq;
-	struct address_space *mapping = folio->mapping;
-	struct netfs_folio *finfo = netfs_folio_info(folio);
-	struct netfs_group *group = netfs_folio_group(folio);
-	struct bio_vec bvec;
-	unsigned long long i_size = i_size_read(mapping->host);
-	unsigned long long start = folio_pos(folio);
-	size_t offset = 0, len;
-	int ret = 0;
-
-	if (finfo) {
-		offset = finfo->dirty_offset;
-		start += offset;
-		len = finfo->dirty_len;
-	} else {
-		len = folio_size(folio);
-	}
-	len = min_t(unsigned long long, len, i_size - start);
-
-	wreq = netfs_alloc_request(mapping, NULL, start, len, NETFS_LAUNDER_WRITE);
-	if (IS_ERR(wreq)) {
-		ret = PTR_ERR(wreq);
-		goto out;
-	}
-
-	if (!folio_clear_dirty_for_io(folio))
-		goto out_put;
-
-	trace_netfs_folio(folio, netfs_folio_trace_launder);
-
-	_debug("launder %llx-%llx", start, start + len - 1);
-
-	/* Speculatively write to the cache.  We have to fix this up later if
-	 * the store fails.
-	 */
-	wreq->cleanup = netfs_cleanup_launder_folio;
-
-	bvec_set_folio(&bvec, folio, len, offset);
-	iov_iter_bvec(&wreq->iter, ITER_SOURCE, &bvec, 1, len);
-	__set_bit(NETFS_RREQ_UPLOAD_TO_SERVER, &wreq->flags);
-	ret = netfs_begin_write(wreq, true, netfs_write_trace_launder);
-
-out_put:
-	folio_detach_private(folio);
-	netfs_put_group(group);
-	kfree(finfo);
-	netfs_put_request(wreq, false, netfs_rreq_trace_put_return);
-out:
-	folio_wait_fscache(folio);
-	_leave(" = %d", ret);
-	return ret;
-}
-EXPORT_SYMBOL(netfs_launder_folio);
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 8577ccf621f5..6efe0af3ba80 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -484,28 +484,6 @@ static void nfs_check_dirty_writeback(struct folio *folio,
 		*dirty = true;
 }
 
-/*
- * Attempt to clear the private state associated with a page when an error
- * occurs that requires the cached contents of an inode to be written back or
- * destroyed
- * - Called if either PG_private or fscache is set on the page
- * - Caller holds page lock
- * - Return 0 if successful, -error otherwise
- */
-static int nfs_launder_folio(struct folio *folio)
-{
-	struct inode *inode = folio->mapping->host;
-	int ret;
-
-	dfprintk(PAGECACHE, "NFS: launder_folio(%ld, %llu)\n",
-		inode->i_ino, folio_pos(folio));
-
-	folio_wait_fscache(folio);
-	ret = nfs_wb_folio(inode, folio);
-	trace_nfs_launder_folio_done(inode, folio, ret);
-	return ret;
-}
-
 static int nfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 						sector_t *span)
 {
@@ -564,7 +542,6 @@ const struct address_space_operations nfs_file_aops = {
 	.invalidate_folio = nfs_invalidate_folio,
 	.release_folio = nfs_release_folio,
 	.migrate_folio = nfs_migrate_folio,
-	.launder_folio = nfs_launder_folio,
 	.is_dirty_writeback = nfs_check_dirty_writeback,
 	.error_remove_folio = generic_error_remove_folio,
 	.swap_activate = nfs_swap_activate,
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index ebb8d60e1152..f2f8cf00a442 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -491,6 +491,7 @@ nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct nfs_fattr *fattr)
 			inode->i_fop = NFS_SB(sb)->nfs_client->rpc_ops->file_ops;
 			inode->i_data.a_ops = &nfs_file_aops;
 			nfs_inode_init_regular(nfsi);
+			mapping_set_launder_folios(inode->i_mapping);
 		} else if (S_ISDIR(inode->i_mode)) {
 			inode->i_op = NFS_SB(sb)->nfs_client->rpc_ops->dir_inode_ops;
 			inode->i_fop = &nfs_dir_operations;
diff --git a/fs/nfs/nfstrace.h b/fs/nfs/nfstrace.h
index afedb449b54f..f0e8c0fb9447 100644
--- a/fs/nfs/nfstrace.h
+++ b/fs/nfs/nfstrace.h
@@ -1039,7 +1039,6 @@ DEFINE_NFS_FOLIO_EVENT(nfs_writeback_folio);
 DEFINE_NFS_FOLIO_EVENT_DONE(nfs_writeback_folio_done);
 
 DEFINE_NFS_FOLIO_EVENT(nfs_invalidate_folio);
-DEFINE_NFS_FOLIO_EVENT_DONE(nfs_launder_folio_done);
 
 TRACE_EVENT(nfs_aop_readahead,
 		TP_PROTO(
diff --git a/fs/orangefs/inode.c b/fs/orangefs/inode.c
index 085912268442..a9ea40261832 100644
--- a/fs/orangefs/inode.c
+++ b/fs/orangefs/inode.c
@@ -626,7 +626,6 @@ static const struct address_space_operations orangefs_address_operations = {
 	.invalidate_folio = orangefs_invalidate_folio,
 	.release_folio = orangefs_release_folio,
 	.free_folio = orangefs_free_folio,
-	.launder_folio = orangefs_launder_folio,
 	.direct_IO = orangefs_direct_IO,
 };
 
@@ -980,6 +979,7 @@ static const struct inode_operations orangefs_file_inode_operations = {
 static int orangefs_init_iops(struct inode *inode)
 {
 	inode->i_mapping->a_ops = &orangefs_address_operations;
+	mapping_set_launder_folios(inode->i_mapping);
 
 	switch (inode->i_mode & S_IFMT) {
 	case S_IFREG:
diff --git a/fs/smb/client/file.c b/fs/smb/client/file.c
index f391c9b803d8..4fd871ba00f9 100644
--- a/fs/smb/client/file.c
+++ b/fs/smb/client/file.c
@@ -4858,27 +4858,6 @@ static void cifs_invalidate_folio(struct folio *folio, size_t offset,
 	folio_wait_fscache(folio);
 }
 
-static int cifs_launder_folio(struct folio *folio)
-{
-	int rc = 0;
-	loff_t range_start = folio_pos(folio);
-	loff_t range_end = range_start + folio_size(folio);
-	struct writeback_control wbc = {
-		.sync_mode = WB_SYNC_ALL,
-		.nr_to_write = 0,
-		.range_start = range_start,
-		.range_end = range_end,
-	};
-
-	cifs_dbg(FYI, "Launder page: %lu\n", folio->index);
-
-	if (folio_clear_dirty_for_io(folio))
-		rc = cifs_writepage_locked(&folio->page, &wbc);
-
-	folio_wait_fscache(folio);
-	return rc;
-}
-
 void cifs_oplock_break(struct work_struct *work)
 {
 	struct cifsFileInfo *cfile = container_of(work, struct cifsFileInfo,
@@ -5057,7 +5036,6 @@ const struct address_space_operations cifs_addr_ops = {
 	.release_folio = cifs_release_folio,
 	.direct_IO = cifs_direct_io,
 	.invalidate_folio = cifs_invalidate_folio,
-	.launder_folio = cifs_launder_folio,
 	.migrate_folio = filemap_migrate_folio,
 	/*
 	 * TODO: investigate and if useful we could add an is_dirty_writeback
@@ -5080,6 +5058,5 @@ const struct address_space_operations cifs_addr_ops_smallbuf = {
 	.dirty_folio = netfs_dirty_folio,
 	.release_folio = cifs_release_folio,
 	.invalidate_folio = cifs_invalidate_folio,
-	.launder_folio = cifs_launder_folio,
 	.migrate_folio = filemap_migrate_folio,
 };
diff --git a/fs/smb/client/inode.c b/fs/smb/client/inode.c
index d02f8ba29cb5..24c1c2610e04 100644
--- a/fs/smb/client/inode.c
+++ b/fs/smb/client/inode.c
@@ -56,6 +56,7 @@ static void cifs_set_ops(struct inode *inode)
 			inode->i_data.a_ops = &cifs_addr_ops_smallbuf;
 		else
 			inode->i_data.a_ops = &cifs_addr_ops;
+		mapping_set_launder_folios(inode->i_mapping);
 		break;
 	case S_IFDIR:
 		if (IS_AUTOMOUNT(inode)) {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 1fbc72c5f112..ded54555ab30 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -432,7 +432,6 @@ struct address_space_operations {
 	 */
 	int (*migrate_folio)(struct address_space *, struct folio *dst,
 			struct folio *src, enum migrate_mode);
-	int (*launder_folio)(struct folio *);
 	bool (*is_partially_uptodate) (struct folio *, size_t from,
 			size_t count);
 	void (*is_dirty_writeback) (struct folio *, bool *dirty, bool *wb);
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index 100cbb261269..e566d31afb04 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -410,7 +410,6 @@ int netfs_unpin_writeback(struct inode *inode, struct writeback_control *wbc);
 void netfs_clear_inode_writeback(struct inode *inode, const void *aux);
 void netfs_invalidate_folio(struct folio *folio, size_t offset, size_t length);
 bool netfs_release_folio(struct folio *folio, gfp_t gfp);
-int netfs_launder_folio(struct folio *folio);
 
 /* VMA operations API. */
 vm_fault_t netfs_page_mkwrite(struct vm_fault *vmf, struct netfs_group *netfs_group);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 2df35e65557d..27ea8b9139a3 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -207,6 +207,7 @@ enum mapping_flags {
 	AS_STABLE_WRITES,	/* must wait for writeback before modifying
 				   folio contents */
 	AS_UNMOVABLE,		/* The mapping cannot be moved, ever */
+	AS_LAUNDER_FOLIOS,	/* Use laundering in invalidate_inode_pages2*() */
 };
 
 /**
@@ -323,6 +324,11 @@ static inline bool mapping_unmovable(struct address_space *mapping)
 	return test_bit(AS_UNMOVABLE, &mapping->flags);
 }
 
+static inline void mapping_set_launder_folios(struct address_space *mapping)
+{
+	set_bit(AS_LAUNDER_FOLIOS, &mapping->flags);
+}
+
 static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
 {
 	return mapping->gfp_mask;
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 9845cb62e40b..7a1b79b344fa 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -62,6 +62,7 @@ struct writeback_control {
 	unsigned for_reclaim:1;		/* Invoked from the page allocator */
 	unsigned range_cyclic:1;	/* range_start is cyclic */
 	unsigned for_sync:1;		/* sync(2) WB_SYNC_ALL writeback */
+	unsigned for_launder:1;		/* Called from invalidate_inode_pages2_range() */
 	unsigned unpinned_netfs_wb:1;	/* Cleared I_PINNING_NETFS_WB */
 
 	/*
diff --git a/mm/truncate.c b/mm/truncate.c
index 725b150e47ac..cb92759c4d8d 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -586,15 +586,6 @@ static int invalidate_complete_folio2(struct address_space *mapping,
 	return 0;
 }
 
-static int folio_launder(struct address_space *mapping, struct folio *folio)
-{
-	if (!folio_test_dirty(folio))
-		return 0;
-	if (folio->mapping != mapping || mapping->a_ops->launder_folio == NULL)
-		return 0;
-	return mapping->a_ops->launder_folio(folio);
-}
-
 /**
  * invalidate_inode_pages2_range - remove range of pages from an address_space
  * @mapping: the address_space
@@ -657,13 +648,29 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 				unmap_mapping_folio(folio);
 			BUG_ON(folio_mapped(folio));
 
-			ret2 = folio_launder(mapping, folio);
-			if (ret2 == 0) {
-				if (!invalidate_complete_folio2(mapping, folio))
+			if (folio_test_dirty(folio) &&
+			    test_bit(AS_LAUNDER_FOLIOS, &mapping->flags)) {
+				struct writeback_control wbc = {
+					.sync_mode   = WB_SYNC_ALL,
+					.nr_to_write = LLONG_MAX,
+					.range_start = folio_pos(folio),
+					.range_end   = (end + 1ULL) * PAGE_SIZE - 1,
+					.for_launder = true,
+				};
+
+				folio_unlock(folio);
+				ret2 = filemap_fdatawrite_wbc(mapping, &wbc);
+				folio_lock(folio);
+				if (ret2 < 0)
+					ret = ret2;
+				else if (!invalidate_complete_folio2(mapping, folio) &&
+					 ret == 0)
+					ret2 = -EBUSY;
+			} else {
+				if (!invalidate_complete_folio2(mapping, folio) &&
+				    ret == 0)
 					ret2 = -EBUSY;
 			}
-			if (ret2 < 0)
-				ret = ret2;
 			folio_unlock(folio);
 		}
 		folio_batch_remove_exceptionals(&fbatch);


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] mm: Replace ->launder_folio() with flush and wait
  2024-03-06 22:39 [RFC PATCH] mm: Replace ->launder_folio() with flush and wait David Howells
@ 2024-03-07  4:45 ` Matthew Wilcox
  2024-03-07  8:26 ` David Howells
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 11+ messages in thread
From: Matthew Wilcox @ 2024-03-07  4:45 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, Andrew Morton, Alexander Viro,
	Christian Brauner, Jeff Layton, linux-mm, linux-fsdevel, netfs,
	v9fs, linux-afs, ceph-devel, linux-cifs, linux-nfs, devel,
	linux-kernel

On Wed, Mar 06, 2024 at 10:39:37PM +0000, David Howells wrote:
> Here's a patch to have a go at getting rid of ->launder_folio().  Since it's
> failable and cannot guarantee that pages in the range are removed, I've tried
> to replace laundering with just flush-and-wait, dropping the folio lock around
> the I/O.

My sense is that ->launder_folio doesn't actually need to be replaced.

commit e3db7691e9f3dff3289f64e3d98583e28afe03db
Author: Trond Myklebust <Trond.Myklebust@netapp.com>
Date:   Wed Jan 10 23:15:39 2007 -0800

    [PATCH] NFS: Fix race in nfs_release_page()

        NFS: Fix race in nfs_release_page()

        invalidate_inode_pages2() may find the dirty bit has been set on a page
        owing to the fact that the page may still be mapped after it was locked.
        Only after the call to unmap_mapping_range() are we sure that the page
        can no longer be dirtied.
        In order to fix this, NFS has hooked the releasepage() method and tries
        to write the page out between the call to unmap_mapping_range() and the
        call to remove_mapping(). This, however leads to deadlocks in the page
        reclaim code, where the page may be locked without holding a reference
        to the inode or dentry.

        Fix is to add a new address_space_operation, launder_page(), which will
        attempt to write out a dirty page without releasing the page lock.

    Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

I don't understand why this couldn't've been solved by page_mkwrite.
NFS did later add nfs_vm_page_mkwrite in July 2007, and maybe it's just
not needed any more?  I haven't looked into it enough to make sure,
but my belief is that we should be able to get rid of it.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] mm: Replace ->launder_folio() with flush and wait
  2024-03-06 22:39 [RFC PATCH] mm: Replace ->launder_folio() with flush and wait David Howells
  2024-03-07  4:45 ` Matthew Wilcox
@ 2024-03-07  8:26 ` David Howells
  2024-03-07 10:36 ` David Howells
  2024-03-07 16:15 ` [RFC PATCH v2] mm: Kill ->launder_folio() David Howells
  3 siblings, 0 replies; 11+ messages in thread
From: David Howells @ 2024-03-07  8:26 UTC (permalink / raw)
  To: Matthew Wilcox, Trond Myklebust
  Cc: dhowells, Christoph Hellwig, Andrew Morton, Alexander Viro,
	Christian Brauner, Jeff Layton, linux-mm, linux-fsdevel, netfs,
	v9fs, linux-afs, ceph-devel, linux-cifs, linux-nfs, devel,
	linux-kernel

Matthew Wilcox <willy@infradead.org> wrote:

> commit e3db7691e9f3dff3289f64e3d98583e28afe03db
> Author: Trond Myklebust <Trond.Myklebust@netapp.com>
> Date:   Wed Jan 10 23:15:39 2007 -0800
> 
>     [PATCH] NFS: Fix race in nfs_release_page()
>... 
>       invalidate_inode_pages2() may find the dirty bit has been set on a page
>       owing to the fact that the page may still be mapped after it was locked.
>       Only after the call to unmap_mapping_range() are we sure that the page
>       can no longer be dirtied.

Is that last sentence even true?  It evicts folios from the TLB and/or
pagetables, but it doesn't actually trim any mmap made - in which case,
userspace is perfectly at liberty to regenerate and dirty the folio the moment
the folio is removed from the page tables.  Otherwise DIO-to/from-mmap will
deadlock.

> but my belief is that we should be able to get rid of it.

I think you're probably correct.  The best we can do, I think, is to preface
any call to invalidate_inode_pages2() with a flush-and-wait over the same
range.

David


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] mm: Replace ->launder_folio() with flush and wait
  2024-03-06 22:39 [RFC PATCH] mm: Replace ->launder_folio() with flush and wait David Howells
  2024-03-07  4:45 ` Matthew Wilcox
  2024-03-07  8:26 ` David Howells
@ 2024-03-07 10:36 ` David Howells
  2024-03-07 14:54   ` Christoph Hellwig
                     ` (2 more replies)
  2024-03-07 16:15 ` [RFC PATCH v2] mm: Kill ->launder_folio() David Howells
  3 siblings, 3 replies; 11+ messages in thread
From: David Howells @ 2024-03-07 10:36 UTC (permalink / raw)
  To: Matthew Wilcox, Trond Myklebust, Miklos Szeredi, Christoph Hellwig
  Cc: dhowells, Andrew Morton, Alexander Viro, Christian Brauner,
	Jeff Layton, linux-mm, linux-fsdevel, netfs, v9fs, linux-afs,
	ceph-devel, linux-cifs, linux-nfs, devel, linux-kernel

Matthew Wilcox <willy@infradead.org> wrote:

> On Wed, Mar 06, 2024 at 10:39:37PM +0000, David Howells wrote:
> > Here's a patch to have a go at getting rid of ->launder_folio().  Since it's
> > failable and cannot guarantee that pages in the range are removed, I've tried
> > to replace laundering with just flush-and-wait, dropping the folio lock around
> > the I/O.
> 
> My sense is that ->launder_folio doesn't actually need to be replaced.
> 
> commit e3db7691e9f3dff3289f64e3d98583e28afe03db
> Author: Trond Myklebust <Trond.Myklebust@netapp.com>
> Date:   Wed Jan 10 23:15:39 2007 -0800
> 
>     [PATCH] NFS: Fix race in nfs_release_page()
> 
>         NFS: Fix race in nfs_release_page()
> 
>         invalidate_inode_pages2() may find the dirty bit has been set on a page
>         owing to the fact that the page may still be mapped after it was locked.
>         Only after the call to unmap_mapping_range() are we sure that the page
>         can no longer be dirtied.
>         In order to fix this, NFS has hooked the releasepage() method and tries
>         to write the page out between the call to unmap_mapping_range() and the
>         call to remove_mapping(). This, however leads to deadlocks in the page
>         reclaim code, where the page may be locked without holding a reference
>         to the inode or dentry.
> 
>         Fix is to add a new address_space_operation, launder_page(), which will
>         attempt to write out a dirty page without releasing the page lock.
> 
>     Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
> 
> I don't understand why this couldn't've been solved by page_mkwrite.
> NFS did later add nfs_vm_page_mkwrite in July 2007, and maybe it's just
> not needed any more?  I haven't looked into it enough to make sure,
> but my belief is that we should be able to get rid of it.

Okay, it's slightly more complex than I thought - and I'm not sure all callers
are actually using it correctly.  There are some additional interesting cases
I've found, beyond the pre-/post-DIO case:

 (1) NFS relies on it to retry the write before stripping the pages in the
     case where a writeback error occurs.  I think this can probably be dealt
     with by sticking a filemap_fdatawrite() call before the invalidation.
     I'm not sure if this would incur the deadlock with the page reclaim code
     of which Trond speaks.

 (2) invalidate_inode_pages2() is used in some places to effect invalidation
     of the pagecache in the case where the server tells us that a third party
     modified the server copy of a file.  What the right behaviour should be
     here, I'm not sure, but at the moment, any dirty data will get laundered
     back to the server.  Possibly it should be simply invalidated locally or
     the user asked how they want to handle the divergence.

     Some filesystems use invalidate_remote_inode() instead which seems to
     leave the dirty pages in place locally.

     If it is desirous to save the dirty data, then filemap_fdatawrite()
     could be deployed before invalidating the pages.

 (3) Fuse uses invalidate_inode_pages2() in fuse_do_setattr() to strip all the
     pages from an inode that has had its size changed, laundering any page
     that's still dirty.  This would seem to be excessive, but maybe Miklos
     had a reason for doing it that way.

There are some places that should perhaps be using kiocb_invalidate_pages()
and kiocb_invalidate_post_direct_write() instead - assuming Christoph has no
objection to the latter function being exported.

David


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] mm: Replace ->launder_folio() with flush and wait
  2024-03-07 10:36 ` David Howells
@ 2024-03-07 14:54   ` Christoph Hellwig
  2024-03-08  7:52   ` Miklos Szeredi
  2024-03-19 14:14   ` David Howells
  2 siblings, 0 replies; 11+ messages in thread
From: Christoph Hellwig @ 2024-03-07 14:54 UTC (permalink / raw)
  To: David Howells
  Cc: Matthew Wilcox, Trond Myklebust, Miklos Szeredi,
	Christoph Hellwig, Andrew Morton, Alexander Viro,
	Christian Brauner, Jeff Layton, linux-mm, linux-fsdevel, netfs,
	v9fs, linux-afs, ceph-devel, linux-cifs, linux-nfs, devel,
	linux-kernel

On Thu, Mar 07, 2024 at 10:36:28AM +0000, David Howells wrote:
> There are some places that should perhaps be using kiocb_invalidate_pages()
> and kiocb_invalidate_post_direct_write() instead - assuming Christoph has no
> objection to the latter function being exported.

These are intended as helpers for direct I/O implementation, so I
very much expected them to eventually get exported.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH v2] mm: Kill ->launder_folio()
  2024-03-06 22:39 [RFC PATCH] mm: Replace ->launder_folio() with flush and wait David Howells
                   ` (2 preceding siblings ...)
  2024-03-07 10:36 ` David Howells
@ 2024-03-07 16:15 ` David Howells
  3 siblings, 0 replies; 11+ messages in thread
From: David Howells @ 2024-03-07 16:15 UTC (permalink / raw)
  To: Matthew Wilcox, Trond Myklebust
  Cc: dhowells, Christoph Hellwig, Andrew Morton, Alexander Viro,
	Christian Brauner, Jeff Layton, linux-mm, linux-fsdevel, netfs,
	v9fs, linux-afs, ceph-devel, linux-cifs, linux-nfs, devel,
	linux-kernel

invalidate_inode_pages2_range() and its wrappers are frequently used to
invalidate overlapping folios prior to and after doing direct I/O.  This
calls ->launder_folio() to flush dirty folios out to the backing store,
keeping the folio lock across the I/O - presumably to prevent the folio
from being redirtied and thereby prevent it from being removed.

However...  If we're doing this prior to doing DIO on a file, there may be
nothing preventing an mmapped write from recreating and redirtying the
folio the moment it is removed from the mapping lest the kernel deadlock on
doing DIO to/from a buffer mmapped from the target file.

Further, invalidate_inode_pages2_range() is permitted to fail - and half
the callers don't even check to see if it *did* fail, probably not
unreasonably.

In which case, there's no point doing the laundry there; better to call
something like filemap_fdatawrite() beforehand.  If mmap is going to
interfere, we can't stop it.

There are some other cases in which this is used:

 (1) In fuse_do_setattr(), when the size of a file is changed.  Calling
     invalidate_inode_pages2() here is probably the wrong thing to do as
     the preceding truncate_pagecache() should do the appropriate page
     trimming and this would just seem to reduce performance for no good
     reason.

 (2) In some network filesystems, when the server informs the client of a
     third-party modification to a file, the local pagecache is zapped with
     invalidate_inode_pages2() rather than invalidate_remote_inode().  The
     former writes back the dirty data whereas the latter retains it plus
     surrounding obsolete data in the same folio.  Maybe this should be
     done by filemap_fdatawrite() followed by invalidate_inode_pages2().

     Possibly, ->page_mkwrite() could be used to hold off mmap writes until
     remote invalidation-induced writeback is achieved.

 (3) In NFS, this is used to attempt to save the data when some sort of
     fatal error occurs.  It may be sufficient to do a filemap_fdatawrite()
     before calling invalidate_inode_pages2().  nfs_writepages() can
     observe the error state and do the laundering thing.  Again, maybe,
     ->page_mkwrite() could be used to hold off mmap writes until the
     pagecache has been invalidated.

Note that this only affects 9p, afs, cifs, fuse, nfs and orangefs.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Trond Myklebust <trond.myklebust@hammerspace.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@lst.de>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: Alexander Viro <viro@zeniv.linux.org.uk>
cc: Christian Brauner <brauner@kernel.org>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-mm@kvack.org
cc: linux-fsdevel@vger.kernel.org
cc: netfs@lists.linux.dev
cc: v9fs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: ceph-devel@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: linux-nfs@vger.kernel.org
cc: devel@lists.orangefs.org
---
 Documentation/filesystems/locking.rst |    8 --
 Documentation/filesystems/vfs.rst     |    6 -
 fs/9p/vfs_addr.c                      |    2 
 fs/afs/file.c                         |    1 
 fs/afs/internal.h                     |    1 
 fs/afs/write.c                        |   10 --
 fs/ceph/file.c                        |    6 -
 fs/fuse/file.c                        |   16 ----
 fs/netfs/buffered_write.c             |   74 --------------------
 fs/netfs/main.c                       |    1 
 fs/nfs/file.c                         |   23 ------
 fs/nfs/inode.c                        |    4 -
 fs/nfs/nfstrace.h                     |    1 
 fs/orangefs/inode.c                   |    1 
 fs/smb/client/file.c                  |  122 ----------------------------------
 include/linux/fs.h                    |    1 
 include/linux/netfs.h                 |    2 
 include/trace/events/netfs.h          |    3 
 mm/truncate.c                         |   25 +-----
 19 files changed, 14 insertions(+), 293 deletions(-)

diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
index d5bf4b6b7509..139554d1ab51 100644
--- a/Documentation/filesystems/locking.rst
+++ b/Documentation/filesystems/locking.rst
@@ -262,7 +262,6 @@ prototypes::
 	int (*direct_IO)(struct kiocb *, struct iov_iter *iter);
 	int (*migrate_folio)(struct address_space *, struct folio *dst,
 			struct folio *src, enum migrate_mode);
-	int (*launder_folio)(struct folio *);
 	bool (*is_partially_uptodate)(struct folio *, size_t from, size_t count);
 	int (*error_remove_folio)(struct address_space *, struct folio *);
 	int (*swap_activate)(struct swap_info_struct *sis, struct file *f, sector_t *span)
@@ -288,7 +287,6 @@ release_folio:		yes
 free_folio:		yes
 direct_IO:
 migrate_folio:		yes (both)
-launder_folio:		yes
 is_partially_uptodate:	yes
 error_remove_folio:	yes
 swap_activate:		no
@@ -394,12 +392,6 @@ try_to_free_buffers().
 ->free_folio() is called when the kernel has dropped the folio
 from the page cache.
 
-->launder_folio() may be called prior to releasing a folio if
-it is still found to be dirty. It returns zero if the folio was successfully
-cleaned, or an error value if not. Note that in order to prevent the folio
-getting mapped back in and redirtied, it needs to be kept locked
-across the entire operation.
-
 ->swap_activate() will be called to prepare the given file for swap.  It
 should perform any validation and preparation necessary to ensure that
 writes can be performed with minimal memory allocation.  It should call
diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index eebcc0f9e2bc..b2af9ee6515a 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -818,7 +818,6 @@ cache in your filesystem.  The following members are defined:
 		ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter);
 		int (*migrate_folio)(struct mapping *, struct folio *dst,
 				struct folio *src, enum migrate_mode);
-		int (*launder_folio) (struct folio *);
 
 		bool (*is_partially_uptodate) (struct folio *, size_t from,
 					       size_t count);
@@ -1012,11 +1011,6 @@ cache in your filesystem.  The following members are defined:
 	folio to this function.  migrate_folio should transfer any private
 	data across and update any references that it has to the folio.
 
-``launder_folio``
-	Called before freeing a folio - it writes back the dirty folio.
-	To prevent redirtying the folio, it is kept locked during the
-	whole operation.
-
 ``is_partially_uptodate``
 	Called by the VM when reading a file through the pagecache when
 	the underlying blocksize is smaller than the size of the folio.
diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c
index 047855033d32..5a943c122d83 100644
--- a/fs/9p/vfs_addr.c
+++ b/fs/9p/vfs_addr.c
@@ -89,7 +89,6 @@ static int v9fs_init_request(struct netfs_io_request *rreq, struct file *file)
 	bool writing = (rreq->origin == NETFS_READ_FOR_WRITE ||
 			rreq->origin == NETFS_WRITEBACK ||
 			rreq->origin == NETFS_WRITETHROUGH ||
-			rreq->origin == NETFS_LAUNDER_WRITE ||
 			rreq->origin == NETFS_UNBUFFERED_WRITE ||
 			rreq->origin == NETFS_DIO_WRITE);
 
@@ -141,7 +140,6 @@ const struct address_space_operations v9fs_addr_operations = {
 	.dirty_folio		= netfs_dirty_folio,
 	.release_folio		= netfs_release_folio,
 	.invalidate_folio	= netfs_invalidate_folio,
-	.launder_folio		= netfs_launder_folio,
 	.direct_IO		= noop_direct_IO,
 	.writepages		= netfs_writepages,
 };
diff --git a/fs/afs/file.c b/fs/afs/file.c
index ef2cc8f565d2..dfd8f60f5e1f 100644
--- a/fs/afs/file.c
+++ b/fs/afs/file.c
@@ -54,7 +54,6 @@ const struct address_space_operations afs_file_aops = {
 	.read_folio	= netfs_read_folio,
 	.readahead	= netfs_readahead,
 	.dirty_folio	= netfs_dirty_folio,
-	.launder_folio	= netfs_launder_folio,
 	.release_folio	= netfs_release_folio,
 	.invalidate_folio = netfs_invalidate_folio,
 	.migrate_folio	= filemap_migrate_folio,
diff --git a/fs/afs/internal.h b/fs/afs/internal.h
index 6ce5a612937c..b93aa026daa4 100644
--- a/fs/afs/internal.h
+++ b/fs/afs/internal.h
@@ -916,7 +916,6 @@ struct afs_operation {
 			loff_t	pos;
 			loff_t	size;
 			loff_t	i_size;
-			bool	laundering;	/* Laundering page, PG_writeback not set */
 		} store;
 		struct {
 			struct iattr	*attr;
diff --git a/fs/afs/write.c b/fs/afs/write.c
index 74402d95a884..1bc26466eb72 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -75,8 +75,7 @@ static void afs_store_data_success(struct afs_operation *op)
 	op->ctime = op->file[0].scb.status.mtime_client;
 	afs_vnode_commit_status(op, &op->file[0]);
 	if (!afs_op_error(op)) {
-		if (!op->store.laundering)
-			afs_pages_written_back(vnode, op->store.pos, op->store.size);
+		afs_pages_written_back(vnode, op->store.pos, op->store.size);
 		afs_stat_v(vnode, n_stores);
 		atomic_long_add(op->store.size, &afs_v2net(vnode)->n_store_bytes);
 	}
@@ -91,8 +90,7 @@ static const struct afs_operation_ops afs_store_data_operation = {
 /*
  * write to a file
  */
-static int afs_store_data(struct afs_vnode *vnode, struct iov_iter *iter, loff_t pos,
-			  bool laundering)
+static int afs_store_data(struct afs_vnode *vnode, struct iov_iter *iter, loff_t pos)
 {
 	struct afs_operation *op;
 	struct afs_wb_key *wbk = NULL;
@@ -123,7 +121,6 @@ static int afs_store_data(struct afs_vnode *vnode, struct iov_iter *iter, loff_t
 	op->file[0].modification = true;
 	op->store.pos = pos;
 	op->store.size = size;
-	op->store.laundering = laundering;
 	op->flags |= AFS_OPERATION_UNINTR;
 	op->ops = &afs_store_data_operation;
 
@@ -168,8 +165,7 @@ static void afs_upload_to_server(struct netfs_io_subrequest *subreq)
 	       subreq->rreq->debug_id, subreq->debug_index, subreq->io_iter.count);
 
 	trace_netfs_sreq(subreq, netfs_sreq_trace_submit);
-	ret = afs_store_data(vnode, &subreq->io_iter, subreq->start,
-			     subreq->rreq->origin == NETFS_LAUNDER_WRITE);
+	ret = afs_store_data(vnode, &subreq->io_iter, subreq->start);
 	netfs_write_subrequest_terminated(subreq, ret < 0 ? ret : subreq->len,
 					  false);
 }
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index abe8028d95bf..6af96c154f81 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1450,11 +1450,9 @@ ceph_direct_read_write(struct kiocb *iocb, struct iov_iter *iter,
 
 		ceph_fscache_invalidate(inode, true);
 
-		ret2 = invalidate_inode_pages2_range(inode->i_mapping,
-					pos >> PAGE_SHIFT,
-					(pos + count - 1) >> PAGE_SHIFT);
+		ret2 = kiocb_invalidate_pages(iocb, count);
 		if (ret2 < 0)
-			doutc(cl, "invalidate_inode_pages2_range returned %d\n",
+			doutc(cl, "kiocb_invalidate_pages returned %d\n",
 			      ret2);
 
 		flags = /* CEPH_OSD_FLAG_ORDERSNAP | */ CEPH_OSD_FLAG_WRITE;
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 148a71b8b4d0..43411f0d3ce1 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2393,21 +2393,6 @@ static int fuse_write_end(struct file *file, struct address_space *mapping,
 	return copied;
 }
 
-static int fuse_launder_folio(struct folio *folio)
-{
-	int err = 0;
-	if (folio_clear_dirty_for_io(folio)) {
-		struct inode *inode = folio->mapping->host;
-
-		/* Serialize with pending writeback for the same page */
-		fuse_wait_on_page_writeback(inode, folio->index);
-		err = fuse_writepage_locked(&folio->page);
-		if (!err)
-			fuse_wait_on_page_writeback(inode, folio->index);
-	}
-	return err;
-}
-
 /*
  * Write back dirty data/metadata now (there may not be any suitable
  * open files later for data)
@@ -3227,7 +3212,6 @@ static const struct address_space_operations fuse_file_aops  = {
 	.readahead	= fuse_readahead,
 	.writepage	= fuse_writepage,
 	.writepages	= fuse_writepages,
-	.launder_folio	= fuse_launder_folio,
 	.dirty_folio	= filemap_dirty_folio,
 	.bmap		= fuse_bmap,
 	.direct_IO	= fuse_direct_IO,
diff --git a/fs/netfs/buffered_write.c b/fs/netfs/buffered_write.c
index 9a0d32e4b422..4bdd427035de 100644
--- a/fs/netfs/buffered_write.c
+++ b/fs/netfs/buffered_write.c
@@ -1181,77 +1181,3 @@ int netfs_writepages(struct address_space *mapping,
 	return ret;
 }
 EXPORT_SYMBOL(netfs_writepages);
-
-/*
- * Deal with the disposition of a laundered folio.
- */
-static void netfs_cleanup_launder_folio(struct netfs_io_request *wreq)
-{
-	if (wreq->error) {
-		pr_notice("R=%08x Laundering error %d\n", wreq->debug_id, wreq->error);
-		mapping_set_error(wreq->mapping, wreq->error);
-	}
-}
-
-/**
- * netfs_launder_folio - Clean up a dirty folio that's being invalidated
- * @folio: The folio to clean
- *
- * This is called to write back a folio that's being invalidated when an inode
- * is getting torn down.  Ideally, writepages would be used instead.
- */
-int netfs_launder_folio(struct folio *folio)
-{
-	struct netfs_io_request *wreq;
-	struct address_space *mapping = folio->mapping;
-	struct netfs_folio *finfo = netfs_folio_info(folio);
-	struct netfs_group *group = netfs_folio_group(folio);
-	struct bio_vec bvec;
-	unsigned long long i_size = i_size_read(mapping->host);
-	unsigned long long start = folio_pos(folio);
-	size_t offset = 0, len;
-	int ret = 0;
-
-	if (finfo) {
-		offset = finfo->dirty_offset;
-		start += offset;
-		len = finfo->dirty_len;
-	} else {
-		len = folio_size(folio);
-	}
-	len = min_t(unsigned long long, len, i_size - start);
-
-	wreq = netfs_alloc_request(mapping, NULL, start, len, NETFS_LAUNDER_WRITE);
-	if (IS_ERR(wreq)) {
-		ret = PTR_ERR(wreq);
-		goto out;
-	}
-
-	if (!folio_clear_dirty_for_io(folio))
-		goto out_put;
-
-	trace_netfs_folio(folio, netfs_folio_trace_launder);
-
-	_debug("launder %llx-%llx", start, start + len - 1);
-
-	/* Speculatively write to the cache.  We have to fix this up later if
-	 * the store fails.
-	 */
-	wreq->cleanup = netfs_cleanup_launder_folio;
-
-	bvec_set_folio(&bvec, folio, len, offset);
-	iov_iter_bvec(&wreq->iter, ITER_SOURCE, &bvec, 1, len);
-	__set_bit(NETFS_RREQ_UPLOAD_TO_SERVER, &wreq->flags);
-	ret = netfs_begin_write(wreq, true, netfs_write_trace_launder);
-
-out_put:
-	folio_detach_private(folio);
-	netfs_put_group(group);
-	kfree(finfo);
-	netfs_put_request(wreq, false, netfs_rreq_trace_put_return);
-out:
-	folio_wait_fscache(folio);
-	_leave(" = %d", ret);
-	return ret;
-}
-EXPORT_SYMBOL(netfs_launder_folio);
diff --git a/fs/netfs/main.c b/fs/netfs/main.c
index 5e77618a7940..dd4bbdc1a4d0 100644
--- a/fs/netfs/main.c
+++ b/fs/netfs/main.c
@@ -33,7 +33,6 @@ static const char *netfs_origins[nr__netfs_io_origin] = {
 	[NETFS_READ_FOR_WRITE]		= "RW",
 	[NETFS_WRITEBACK]		= "WB",
 	[NETFS_WRITETHROUGH]		= "WT",
-	[NETFS_LAUNDER_WRITE]		= "LW",
 	[NETFS_UNBUFFERED_WRITE]	= "UW",
 	[NETFS_DIO_READ]		= "DR",
 	[NETFS_DIO_WRITE]		= "DW",
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 8577ccf621f5..6efe0af3ba80 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -484,28 +484,6 @@ static void nfs_check_dirty_writeback(struct folio *folio,
 		*dirty = true;
 }
 
-/*
- * Attempt to clear the private state associated with a page when an error
- * occurs that requires the cached contents of an inode to be written back or
- * destroyed
- * - Called if either PG_private or fscache is set on the page
- * - Caller holds page lock
- * - Return 0 if successful, -error otherwise
- */
-static int nfs_launder_folio(struct folio *folio)
-{
-	struct inode *inode = folio->mapping->host;
-	int ret;
-
-	dfprintk(PAGECACHE, "NFS: launder_folio(%ld, %llu)\n",
-		inode->i_ino, folio_pos(folio));
-
-	folio_wait_fscache(folio);
-	ret = nfs_wb_folio(inode, folio);
-	trace_nfs_launder_folio_done(inode, folio, ret);
-	return ret;
-}
-
 static int nfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 						sector_t *span)
 {
@@ -564,7 +542,6 @@ const struct address_space_operations nfs_file_aops = {
 	.invalidate_folio = nfs_invalidate_folio,
 	.release_folio = nfs_release_folio,
 	.migrate_folio = nfs_migrate_folio,
-	.launder_folio = nfs_launder_folio,
 	.is_dirty_writeback = nfs_check_dirty_writeback,
 	.error_remove_folio = generic_error_remove_folio,
 	.swap_activate = nfs_swap_activate,
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index ebb8d60e1152..898f65784fae 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -1162,8 +1162,10 @@ void nfs_file_clear_open_context(struct file *filp)
 		 * We fatal error on write before. Try to writeback
 		 * every page again.
 		 */
-		if (ctx->error < 0)
+		if (ctx->error < 0) {
+			filemap_fdatawrite(inode->i_mapping);
 			invalidate_inode_pages2(inode->i_mapping);
+		}
 		filp->private_data = NULL;
 		put_nfs_open_context_sync(ctx);
 	}
diff --git a/fs/nfs/nfstrace.h b/fs/nfs/nfstrace.h
index afedb449b54f..f0e8c0fb9447 100644
--- a/fs/nfs/nfstrace.h
+++ b/fs/nfs/nfstrace.h
@@ -1039,7 +1039,6 @@ DEFINE_NFS_FOLIO_EVENT(nfs_writeback_folio);
 DEFINE_NFS_FOLIO_EVENT_DONE(nfs_writeback_folio_done);
 
 DEFINE_NFS_FOLIO_EVENT(nfs_invalidate_folio);
-DEFINE_NFS_FOLIO_EVENT_DONE(nfs_launder_folio_done);
 
 TRACE_EVENT(nfs_aop_readahead,
 		TP_PROTO(
diff --git a/fs/orangefs/inode.c b/fs/orangefs/inode.c
index 085912268442..d78876d08175 100644
--- a/fs/orangefs/inode.c
+++ b/fs/orangefs/inode.c
@@ -626,7 +626,6 @@ static const struct address_space_operations orangefs_address_operations = {
 	.invalidate_folio = orangefs_invalidate_folio,
 	.release_folio = orangefs_release_folio,
 	.free_folio = orangefs_free_folio,
-	.launder_folio = orangefs_launder_folio,
 	.direct_IO = orangefs_direct_IO,
 };
 
diff --git a/fs/smb/client/file.c b/fs/smb/client/file.c
index f391c9b803d8..f5d5efeba5e9 100644
--- a/fs/smb/client/file.c
+++ b/fs/smb/client/file.c
@@ -2561,64 +2561,6 @@ struct cifs_writedata *cifs_writedata_alloc(work_func_t complete)
 	return wdata;
 }
 
-static int cifs_partialpagewrite(struct page *page, unsigned from, unsigned to)
-{
-	struct address_space *mapping = page->mapping;
-	loff_t offset = (loff_t)page->index << PAGE_SHIFT;
-	char *write_data;
-	int rc = -EFAULT;
-	int bytes_written = 0;
-	struct inode *inode;
-	struct cifsFileInfo *open_file;
-
-	if (!mapping || !mapping->host)
-		return -EFAULT;
-
-	inode = page->mapping->host;
-
-	offset += (loff_t)from;
-	write_data = kmap(page);
-	write_data += from;
-
-	if ((to > PAGE_SIZE) || (from > to)) {
-		kunmap(page);
-		return -EIO;
-	}
-
-	/* racing with truncate? */
-	if (offset > mapping->host->i_size) {
-		kunmap(page);
-		return 0; /* don't care */
-	}
-
-	/* check to make sure that we are not extending the file */
-	if (mapping->host->i_size - offset < (loff_t)to)
-		to = (unsigned)(mapping->host->i_size - offset);
-
-	rc = cifs_get_writable_file(CIFS_I(mapping->host), FIND_WR_ANY,
-				    &open_file);
-	if (!rc) {
-		bytes_written = cifs_write(open_file, open_file->pid,
-					   write_data, to - from, &offset);
-		cifsFileInfo_put(open_file);
-		/* Does mm or vfs already set times? */
-		simple_inode_init_ts(inode);
-		if ((bytes_written > 0) && (offset))
-			rc = 0;
-		else if (bytes_written < 0)
-			rc = bytes_written;
-		else
-			rc = -EFAULT;
-	} else {
-		cifs_dbg(FYI, "No writable handle for write page rc=%d\n", rc);
-		if (!is_retryable_error(rc))
-			rc = -EIO;
-	}
-
-	kunmap(page);
-	return rc;
-}
-
 /*
  * Extend the region to be written back to include subsequent contiguously
  * dirty pages if possible, but don't sleep while doing so.
@@ -3001,47 +2943,6 @@ static int cifs_writepages(struct address_space *mapping,
 	return ret;
 }
 
-static int
-cifs_writepage_locked(struct page *page, struct writeback_control *wbc)
-{
-	int rc;
-	unsigned int xid;
-
-	xid = get_xid();
-/* BB add check for wbc flags */
-	get_page(page);
-	if (!PageUptodate(page))
-		cifs_dbg(FYI, "ppw - page not up to date\n");
-
-	/*
-	 * Set the "writeback" flag, and clear "dirty" in the radix tree.
-	 *
-	 * A writepage() implementation always needs to do either this,
-	 * or re-dirty the page with "redirty_page_for_writepage()" in
-	 * the case of a failure.
-	 *
-	 * Just unlocking the page will cause the radix tree tag-bits
-	 * to fail to update with the state of the page correctly.
-	 */
-	set_page_writeback(page);
-retry_write:
-	rc = cifs_partialpagewrite(page, 0, PAGE_SIZE);
-	if (is_retryable_error(rc)) {
-		if (wbc->sync_mode == WB_SYNC_ALL && rc == -EAGAIN)
-			goto retry_write;
-		redirty_page_for_writepage(wbc, page);
-	} else if (rc != 0) {
-		SetPageError(page);
-		mapping_set_error(page->mapping, rc);
-	} else {
-		SetPageUptodate(page);
-	}
-	end_page_writeback(page);
-	put_page(page);
-	free_xid(xid);
-	return rc;
-}
-
 static int cifs_write_end(struct file *file, struct address_space *mapping,
 			loff_t pos, unsigned len, unsigned copied,
 			struct page *page, void *fsdata)
@@ -4858,27 +4759,6 @@ static void cifs_invalidate_folio(struct folio *folio, size_t offset,
 	folio_wait_fscache(folio);
 }
 
-static int cifs_launder_folio(struct folio *folio)
-{
-	int rc = 0;
-	loff_t range_start = folio_pos(folio);
-	loff_t range_end = range_start + folio_size(folio);
-	struct writeback_control wbc = {
-		.sync_mode = WB_SYNC_ALL,
-		.nr_to_write = 0,
-		.range_start = range_start,
-		.range_end = range_end,
-	};
-
-	cifs_dbg(FYI, "Launder page: %lu\n", folio->index);
-
-	if (folio_clear_dirty_for_io(folio))
-		rc = cifs_writepage_locked(&folio->page, &wbc);
-
-	folio_wait_fscache(folio);
-	return rc;
-}
-
 void cifs_oplock_break(struct work_struct *work)
 {
 	struct cifsFileInfo *cfile = container_of(work, struct cifsFileInfo,
@@ -5057,7 +4937,6 @@ const struct address_space_operations cifs_addr_ops = {
 	.release_folio = cifs_release_folio,
 	.direct_IO = cifs_direct_io,
 	.invalidate_folio = cifs_invalidate_folio,
-	.launder_folio = cifs_launder_folio,
 	.migrate_folio = filemap_migrate_folio,
 	/*
 	 * TODO: investigate and if useful we could add an is_dirty_writeback
@@ -5080,6 +4959,5 @@ const struct address_space_operations cifs_addr_ops_smallbuf = {
 	.dirty_folio = netfs_dirty_folio,
 	.release_folio = cifs_release_folio,
 	.invalidate_folio = cifs_invalidate_folio,
-	.launder_folio = cifs_launder_folio,
 	.migrate_folio = filemap_migrate_folio,
 };
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 1fbc72c5f112..ded54555ab30 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -432,7 +432,6 @@ struct address_space_operations {
 	 */
 	int (*migrate_folio)(struct address_space *, struct folio *dst,
 			struct folio *src, enum migrate_mode);
-	int (*launder_folio)(struct folio *);
 	bool (*is_partially_uptodate) (struct folio *, size_t from,
 			size_t count);
 	void (*is_dirty_writeback) (struct folio *, bool *dirty, bool *wb);
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index 100cbb261269..7bd2aeccb299 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -232,7 +232,6 @@ enum netfs_io_origin {
 	NETFS_READ_FOR_WRITE,		/* This read is to prepare a write */
 	NETFS_WRITEBACK,		/* This write was triggered by writepages */
 	NETFS_WRITETHROUGH,		/* This write was made by netfs_perform_write() */
-	NETFS_LAUNDER_WRITE,		/* This is triggered by ->launder_folio() */
 	NETFS_UNBUFFERED_WRITE,		/* This is an unbuffered write */
 	NETFS_DIO_READ,			/* This is a direct I/O read */
 	NETFS_DIO_WRITE,		/* This is a direct I/O write */
@@ -410,7 +409,6 @@ int netfs_unpin_writeback(struct inode *inode, struct writeback_control *wbc);
 void netfs_clear_inode_writeback(struct inode *inode, const void *aux);
 void netfs_invalidate_folio(struct folio *folio, size_t offset, size_t length);
 bool netfs_release_folio(struct folio *folio, gfp_t gfp);
-int netfs_launder_folio(struct folio *folio);
 
 /* VMA operations API. */
 vm_fault_t netfs_page_mkwrite(struct vm_fault *vmf, struct netfs_group *netfs_group);
diff --git a/include/trace/events/netfs.h b/include/trace/events/netfs.h
index 447a8c21cf57..57ed767f0230 100644
--- a/include/trace/events/netfs.h
+++ b/include/trace/events/netfs.h
@@ -25,7 +25,6 @@
 
 #define netfs_write_traces					\
 	EM(netfs_write_trace_dio_write,		"DIO-WRITE")	\
-	EM(netfs_write_trace_launder,		"LAUNDER  ")	\
 	EM(netfs_write_trace_unbuffered_write,	"UNB-WRITE")	\
 	EM(netfs_write_trace_writeback,		"WRITEBACK")	\
 	E_(netfs_write_trace_writethrough,	"WRITETHRU")
@@ -36,7 +35,6 @@
 	EM(NETFS_READ_FOR_WRITE,		"RW")		\
 	EM(NETFS_WRITEBACK,			"WB")		\
 	EM(NETFS_WRITETHROUGH,			"WT")		\
-	EM(NETFS_LAUNDER_WRITE,			"LW")		\
 	EM(NETFS_UNBUFFERED_WRITE,		"UW")		\
 	EM(NETFS_DIO_READ,			"DR")		\
 	E_(NETFS_DIO_WRITE,			"DW")
@@ -131,7 +129,6 @@
 	EM(netfs_folio_trace_end_copy,		"end-copy")	\
 	EM(netfs_folio_trace_filled_gaps,	"filled-gaps")	\
 	EM(netfs_folio_trace_kill,		"kill")		\
-	EM(netfs_folio_trace_launder,		"launder")	\
 	EM(netfs_folio_trace_mkwrite,		"mkwrite")	\
 	EM(netfs_folio_trace_mkwrite_plus,	"mkwrite+")	\
 	EM(netfs_folio_trace_read_gaps,		"read-gaps")	\
diff --git a/mm/truncate.c b/mm/truncate.c
index 725b150e47ac..dab17b926991 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -561,10 +561,10 @@ static int invalidate_complete_folio2(struct address_space *mapping,
 					struct folio *folio)
 {
 	if (folio->mapping != mapping)
-		return 0;
+		return -EBUSY;
 
 	if (!filemap_release_folio(folio, GFP_KERNEL))
-		return 0;
+		return -EBUSY;
 
 	spin_lock(&mapping->host->i_lock);
 	xa_lock_irq(&mapping->i_pages);
@@ -579,20 +579,11 @@ static int invalidate_complete_folio2(struct address_space *mapping,
 	spin_unlock(&mapping->host->i_lock);
 
 	filemap_free_folio(mapping, folio);
-	return 1;
+	return 0;
 failed:
 	xa_unlock_irq(&mapping->i_pages);
 	spin_unlock(&mapping->host->i_lock);
-	return 0;
-}
-
-static int folio_launder(struct address_space *mapping, struct folio *folio)
-{
-	if (!folio_test_dirty(folio))
-		return 0;
-	if (folio->mapping != mapping || mapping->a_ops->launder_folio == NULL)
-		return 0;
-	return mapping->a_ops->launder_folio(folio);
+	return -EBUSY;
 }
 
 /**
@@ -657,12 +648,8 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 				unmap_mapping_folio(folio);
 			BUG_ON(folio_mapped(folio));
 
-			ret2 = folio_launder(mapping, folio);
-			if (ret2 == 0) {
-				if (!invalidate_complete_folio2(mapping, folio))
-					ret2 = -EBUSY;
-			}
-			if (ret2 < 0)
+			ret2 = invalidate_complete_folio2(mapping, folio);
+			if (ret2)
 				ret = ret2;
 			folio_unlock(folio);
 		}


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] mm: Replace ->launder_folio() with flush and wait
  2024-03-07 10:36 ` David Howells
  2024-03-07 14:54   ` Christoph Hellwig
@ 2024-03-08  7:52   ` Miklos Szeredi
  2024-03-19 14:14   ` David Howells
  2 siblings, 0 replies; 11+ messages in thread
From: Miklos Szeredi @ 2024-03-08  7:52 UTC (permalink / raw)
  To: David Howells
  Cc: Matthew Wilcox, Trond Myklebust, Christoph Hellwig,
	Andrew Morton, Alexander Viro, Christian Brauner, Jeff Layton,
	linux-mm, linux-fsdevel, netfs, v9fs, linux-afs, ceph-devel,
	linux-cifs, linux-nfs, devel, linux-kernel

On Thu, 7 Mar 2024 at 11:36, David Howells <dhowells@redhat.com> wrote:

>  (2) invalidate_inode_pages2() is used in some places to effect invalidation
>      of the pagecache in the case where the server tells us that a third party
>      modified the server copy of a file.  What the right behaviour should be
>      here, I'm not sure, but at the moment, any dirty data will get laundered
>      back to the server.  Possibly it should be simply invalidated locally or
>      the user asked how they want to handle the divergence.

Skipping ->launder_page will mean there's a window where the data
*will* be lost, AFAICS.

Of course concurrent cached writes on different hosts against the same
region (the size of which depends on how the caching is done) will
conflict.

But if concurrent writes are to different regions, then they shouldn't
be lost, no?  Without the current ->launder_page thing I don't see how
that could be guaranteed.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] mm: Replace ->launder_folio() with flush and wait
  2024-03-07 10:36 ` David Howells
  2024-03-07 14:54   ` Christoph Hellwig
  2024-03-08  7:52   ` Miklos Szeredi
@ 2024-03-19 14:14   ` David Howells
  2024-03-19 16:13     ` Miklos Szeredi
  2 siblings, 1 reply; 11+ messages in thread
From: David Howells @ 2024-03-19 14:14 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: dhowells, Matthew Wilcox, Trond Myklebust, Christoph Hellwig,
	Andrew Morton, Alexander Viro, Christian Brauner, Jeff Layton,
	linux-mm, linux-fsdevel, netfs, v9fs, linux-afs, ceph-devel,
	linux-cifs, linux-nfs, devel, linux-kernel

Miklos Szeredi <miklos@szeredi.hu> wrote:

> >  (2) invalidate_inode_pages2() is used in some places to effect
> >      invalidation of the pagecache in the case where the server tells us
> >      that a third party modified the server copy of a file.  What the
> >      right behaviour should be here, I'm not sure, but at the moment, any
> >      dirty data will get laundered back to the server.  Possibly it should
> >      be simply invalidated locally or the user asked how they want to
> >      handle the divergence.
> 
> Skipping ->launder_page will mean there's a window where the data
> *will* be lost, AFAICS.
> 
> Of course concurrent cached writes on different hosts against the same
> region (the size of which depends on how the caching is done) will
> conflict.

Indeed.  Depending on when you're using invalidate_inode_pages2() and co. and
what circumstances you're using it for, you *are* going to suffer data loss.

For instance, if you have dirty data on the local host and get an invalidation
notification from the server: if you write just your dirty data back, you may
corrupt the file on the server, losing the third party changes; if you write
back your entire copy of the file, you might avoid corrupting the file, but
completely obliterate the third party changes; if you discard your changes,
you lose those instead, but save the third party changes.

I'm working towards supporting disconnected operation where I'll need to add
some sort of user interaction mechanism that will allow the user to say how
they want to handle this.

> But if concurrent writes are to different regions, then they shouldn't
> be lost, no?  Without the current ->launder_page thing I don't see how
> that could be guaranteed.

Define "different regions".  If they're not on the same folios, then why would
they be lost by simply flushing the data before doing the invalidation?  If
they are on different parts of the same folio, all the above still apply when
you flush the whole folio.

Now, you can mitigate the latter case by keeping track of which bytes changed,
but that still allows you to corrupt the file by writing back just your
particular changes.

And then there's the joker in the deck: mmap.  The main advantage of
invalidate_inode_pages2() is that it forcibly unmaps the page before
laundering it.  However, this doesn't prevent you then corrupting the upstream
copy by writing the changes back.

What particular usage case of invalidate_inode_pages2() are you thinking of?

DIO read/write can only be best effort: flush, invalidate then do the DIO
which may bring the buffers back in because they're mmapped.  In which case
doing a flush and a non-laundering invalidate that leaves dirty pages in place
ought to be fine.

David


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] mm: Replace ->launder_folio() with flush and wait
  2024-03-19 14:14   ` David Howells
@ 2024-03-19 16:13     ` Miklos Szeredi
  2024-03-19 16:40       ` Miklos Szeredi
  0 siblings, 1 reply; 11+ messages in thread
From: Miklos Szeredi @ 2024-03-19 16:13 UTC (permalink / raw)
  To: David Howells
  Cc: Matthew Wilcox, Trond Myklebust, Christoph Hellwig,
	Andrew Morton, Alexander Viro, Christian Brauner, Jeff Layton,
	linux-mm, linux-fsdevel, netfs, v9fs, linux-afs, ceph-devel,
	linux-cifs, linux-nfs, devel, linux-kernel

On Tue, 19 Mar 2024 at 15:15, David Howells <dhowells@redhat.com> wrote:

> What particular usage case of invalidate_inode_pages2() are you thinking of?

FUSE_NOTIFY_INVAL_INODE will trigger invalidate_inode_pages2_range()
to clean up the cache.

The server is free to discard writes resulting from this invalidation
and delay reads in the region until the invalidation finishes.  This
would no longer work with your change, since the mapping could
silently be reinstated between the writeback and the removal from the
cache due to the page being unlocked/relocked.

I'm not saying such a filesystem actually exists, but it's a
theoretical possibility.  And maybe there are cases which I haven't
thought of.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] mm: Replace ->launder_folio() with flush and wait
  2024-03-19 16:13     ` Miklos Szeredi
@ 2024-03-19 16:40       ` Miklos Szeredi
  2024-03-19 16:59         ` Bernd Schubert
  0 siblings, 1 reply; 11+ messages in thread
From: Miklos Szeredi @ 2024-03-19 16:40 UTC (permalink / raw)
  To: David Howells
  Cc: Matthew Wilcox, Trond Myklebust, Christoph Hellwig,
	Andrew Morton, Alexander Viro, Christian Brauner, Jeff Layton,
	linux-mm, linux-fsdevel, netfs, v9fs, linux-afs, ceph-devel,
	linux-cifs, linux-nfs, linux-kernel

On Tue, 19 Mar 2024 at 17:13, Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Tue, 19 Mar 2024 at 15:15, David Howells <dhowells@redhat.com> wrote:
>
> > What particular usage case of invalidate_inode_pages2() are you thinking of?
>
> FUSE_NOTIFY_INVAL_INODE will trigger invalidate_inode_pages2_range()
> to clean up the cache.
>
> The server is free to discard writes resulting from this invalidation
> and delay reads in the region until the invalidation finishes.  This
> would no longer work with your change, since the mapping could
> silently be reinstated between the writeback and the removal from the
> cache due to the page being unlocked/relocked.

This would also matter if a distributed filesystem wanted to implement
coherence even if there are mmaps.   I.e. a client could get exclusive
access to a region by issuing FUSE_NOTIFY_INVAL_INODE on all other
clients and blocking reads.  With your change this would fail.

Again, this is purely theoretical, and without a way to differentiate
between the read-only and write cases it has limited usefulness.
Adding leases to fuse (which I plan to do) would make this much more
useful.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] mm: Replace ->launder_folio() with flush and wait
  2024-03-19 16:40       ` Miklos Szeredi
@ 2024-03-19 16:59         ` Bernd Schubert
  0 siblings, 0 replies; 11+ messages in thread
From: Bernd Schubert @ 2024-03-19 16:59 UTC (permalink / raw)
  To: Miklos Szeredi, David Howells
  Cc: Matthew Wilcox, Trond Myklebust, Christoph Hellwig,
	Andrew Morton, Alexander Viro, Christian Brauner, Jeff Layton,
	linux-mm, linux-fsdevel, netfs, v9fs, linux-afs, ceph-devel,
	linux-cifs, linux-nfs, linux-kernel



On 3/19/24 17:40, Miklos Szeredi wrote:
> On Tue, 19 Mar 2024 at 17:13, Miklos Szeredi <miklos@szeredi.hu> wrote:
>>
>> On Tue, 19 Mar 2024 at 15:15, David Howells <dhowells@redhat.com> wrote:
>>
>>> What particular usage case of invalidate_inode_pages2() are you thinking of?
>>
>> FUSE_NOTIFY_INVAL_INODE will trigger invalidate_inode_pages2_range()
>> to clean up the cache.
>>
>> The server is free to discard writes resulting from this invalidation
>> and delay reads in the region until the invalidation finishes.  This
>> would no longer work with your change, since the mapping could
>> silently be reinstated between the writeback and the removal from the
>> cache due to the page being unlocked/relocked.
> 
> This would also matter if a distributed filesystem wanted to implement
> coherence even if there are mmaps.   I.e. a client could get exclusive
> access to a region by issuing FUSE_NOTIFY_INVAL_INODE on all other
> clients and blocking reads.  With your change this would fail.
> 
> Again, this is purely theoretical, and without a way to differentiate
> between the read-only and write cases it has limited usefulness.
> Adding leases to fuse (which I plan to do) would make this much more
> useful.

Thanks Miklos! Fyi, we are actually planning to extend fuse
notifications from inode to page ranges.


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-03-19 16:59 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-06 22:39 [RFC PATCH] mm: Replace ->launder_folio() with flush and wait David Howells
2024-03-07  4:45 ` Matthew Wilcox
2024-03-07  8:26 ` David Howells
2024-03-07 10:36 ` David Howells
2024-03-07 14:54   ` Christoph Hellwig
2024-03-08  7:52   ` Miklos Szeredi
2024-03-19 14:14   ` David Howells
2024-03-19 16:13     ` Miklos Szeredi
2024-03-19 16:40       ` Miklos Szeredi
2024-03-19 16:59         ` Bernd Schubert
2024-03-07 16:15 ` [RFC PATCH v2] mm: Kill ->launder_folio() David Howells

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).