linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] remove dependence of inode_congested()
@ 2022-01-31  4:03 NeilBrown
  2022-01-31  4:03 ` [PATCH 1/3] fuse: remove reliance on bdi congestion NeilBrown
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: NeilBrown @ 2022-01-31  4:03 UTC (permalink / raw)
  To: Andrew Morton, Jeff Layton, Ilya Dryomov, Miklos Szeredi,
	Trond Myklebust, Anna Schumaker
  Cc: linux-mm, linux-nfs, linux-fsdevel, ceph-devel, linux-kernel

Miklos pointed out that the filesystems which set the bdi congestion
flags do gain some value from that, and simply removing the code is not
appropriate.

Specifically, readahead and/or writeback are skipped when the congestion
flags are set.

We can mostly move this skipping into the filesystem.
->readahead can do nothing if reads are congested.
->writepage and ->wrtepages can do nothing for WB_SYNC_NONE if writes
  are congested.

Currently only *some* WB_SYNC_NONE writes are skipped due to congestion.
Those from sync_file_range() and those used for page migration are not.
Also, shrink_page_list() will now cause PageActive to be set if
->writepage skips due to congestion.

I don't expect these changes to be a problem, but I have no experience
to base that on.

Review/comments most welcome,

Thanks,
NeilBrown



---

NeilBrown (3):
      fuse: remove reliance on bdi congestion
      nfs: remove reliance on bdi congestion
      ceph: remove reliance on bdi congestion


 fs/ceph/addr.c            | 22 +++++++++++++---------
 fs/ceph/super.c           |  1 +
 fs/ceph/super.h           |  1 +
 fs/nfs/write.c            | 12 ++++++++++--
 include/linux/nfs_fs_sb.h |  1 +
 5 files changed, 26 insertions(+), 11 deletions(-)

--
Signature


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/3] fuse: remove reliance on bdi congestion
  2022-01-31  4:03 [PATCH 0/3] remove dependence of inode_congested() NeilBrown
@ 2022-01-31  4:03 ` NeilBrown
  2022-01-31  4:28   ` Matthew Wilcox
  2022-01-31  4:03 ` [PATCH 3/3] ceph: " NeilBrown
  2022-01-31  4:03 ` [PATCH 2/3] nfs: " NeilBrown
  2 siblings, 1 reply; 17+ messages in thread
From: NeilBrown @ 2022-01-31  4:03 UTC (permalink / raw)
  To: Andrew Morton, Jeff Layton, Ilya Dryomov, Miklos Szeredi,
	Trond Myklebust, Anna Schumaker
  Cc: linux-mm, linux-nfs, linux-fsdevel, ceph-devel, linux-kernel

The bdi congestion tracking in not widely used and will be removed.

Fuse is one of a small number of filesystems that uses it, setting both
the sync (read) and async (write) congestion flags at what it determines
are appropriate times.

The only remaining effect of the sync flag is to cause read-ahead to be
skipped.
The only remaining effect of the async flag is to cause (some)
WB_SYNC_NONE writes to be skipped.

So instead of setting the flags, change:
 - .readahead to do nothing if the flag would be set
 - .writepages to do nothing if WB_SYNC_NONE and the flag would be set
 - .writepage to return AOP_WRITEPAGE_ACTIVATE if WB_SYNC_NONE
    and the flag would be set.

The writepages change causes a behavioural change in that pageout() can
now return PAGE_ACTIVATE instead of PAGE_KEEP, so SetPageActive() will
be called on the page which (I think) will further delay the next attempt
at writeout.  This might be a good thing.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/fuse/control.c |   17 -----------------
 fs/fuse/dax.c     |    3 +++
 fs/fuse/dev.c     |    8 --------
 fs/fuse/file.c    |   11 +++++++++++
 4 files changed, 14 insertions(+), 25 deletions(-)

diff --git a/fs/fuse/control.c b/fs/fuse/control.c
index 000d2e5627e9..7cede9a3bc96 100644
--- a/fs/fuse/control.c
+++ b/fs/fuse/control.c
@@ -164,7 +164,6 @@ static ssize_t fuse_conn_congestion_threshold_write(struct file *file,
 {
 	unsigned val;
 	struct fuse_conn *fc;
-	struct fuse_mount *fm;
 	ssize_t ret;
 
 	ret = fuse_conn_limit_write(file, buf, count, ppos, &val,
@@ -178,22 +177,6 @@ static ssize_t fuse_conn_congestion_threshold_write(struct file *file,
 	down_read(&fc->killsb);
 	spin_lock(&fc->bg_lock);
 	fc->congestion_threshold = val;
-
-	/*
-	 * Get any fuse_mount belonging to this fuse_conn; s_bdi is
-	 * shared between all of them
-	 */
-
-	if (!list_empty(&fc->mounts)) {
-		fm = list_first_entry(&fc->mounts, struct fuse_mount, fc_entry);
-		if (fc->num_background < fc->congestion_threshold) {
-			clear_bdi_congested(fm->sb->s_bdi, BLK_RW_SYNC);
-			clear_bdi_congested(fm->sb->s_bdi, BLK_RW_ASYNC);
-		} else {
-			set_bdi_congested(fm->sb->s_bdi, BLK_RW_SYNC);
-			set_bdi_congested(fm->sb->s_bdi, BLK_RW_ASYNC);
-		}
-	}
 	spin_unlock(&fc->bg_lock);
 	up_read(&fc->killsb);
 	fuse_conn_put(fc);
diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
index 182b24a14804..5f74e2585f50 100644
--- a/fs/fuse/dax.c
+++ b/fs/fuse/dax.c
@@ -781,6 +781,9 @@ static int fuse_dax_writepages(struct address_space *mapping,
 	struct inode *inode = mapping->host;
 	struct fuse_conn *fc = get_fuse_conn(inode);
 
+	if (wbc->sync_mode == WB_SYNC_NONE &&
+	    fc->num_background >= fc->congestion_threshold)
+		return 0;
 	return dax_writeback_mapping_range(mapping, fc->dax->dev, wbc);
 }
 
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index cd54a529460d..e1b4a846c90d 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -315,10 +315,6 @@ void fuse_request_end(struct fuse_req *req)
 				wake_up(&fc->blocked_waitq);
 		}
 
-		if (fc->num_background == fc->congestion_threshold && fm->sb) {
-			clear_bdi_congested(fm->sb->s_bdi, BLK_RW_SYNC);
-			clear_bdi_congested(fm->sb->s_bdi, BLK_RW_ASYNC);
-		}
 		fc->num_background--;
 		fc->active_background--;
 		flush_bg_queue(fc);
@@ -540,10 +536,6 @@ static bool fuse_request_queue_background(struct fuse_req *req)
 		fc->num_background++;
 		if (fc->num_background == fc->max_background)
 			fc->blocked = 1;
-		if (fc->num_background == fc->congestion_threshold && fm->sb) {
-			set_bdi_congested(fm->sb->s_bdi, BLK_RW_SYNC);
-			set_bdi_congested(fm->sb->s_bdi, BLK_RW_ASYNC);
-		}
 		list_add_tail(&req->list, &fc->bg_queue);
 		flush_bg_queue(fc);
 		queued = true;
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 829094451774..b22a948be422 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -958,6 +958,8 @@ static void fuse_readahead(struct readahead_control *rac)
 
 	if (fuse_is_bad(inode))
 		return;
+	if (fc->num_background >= fc->congestion_threshold)
+		return;
 
 	max_pages = min_t(unsigned int, fc->max_pages,
 			fc->max_read / PAGE_SIZE);
@@ -1958,6 +1960,7 @@ static int fuse_writepage_locked(struct page *page)
 
 static int fuse_writepage(struct page *page, struct writeback_control *wbc)
 {
+	struct fuse_conn *fc = get_fuse_conn(page->mapping->host);
 	int err;
 
 	if (fuse_page_is_writeback(page->mapping->host, page->index)) {
@@ -1973,6 +1976,10 @@ static int fuse_writepage(struct page *page, struct writeback_control *wbc)
 		return 0;
 	}
 
+	if (wbc->sync_mode == WB_SYNC_NONE &&
+	    fc->num_background >= fc->congestion_threshold)
+		return AOP_WRITEPAGE_ACTIVATE;
+
 	err = fuse_writepage_locked(page);
 	unlock_page(page);
 
@@ -2226,6 +2233,10 @@ static int fuse_writepages(struct address_space *mapping,
 	if (fuse_is_bad(inode))
 		goto out;
 
+	if (wbc->sync_mode == WB_SYNC_NONE &&
+	    fc->num_background >= fc->congestion_threshold)
+		return AOP_WRITEPAGE_ACTIVATE;
+
 	data.inode = inode;
 	data.wpa = NULL;
 	data.ff = NULL;



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/3] nfs: remove reliance on bdi congestion
  2022-01-31  4:03 [PATCH 0/3] remove dependence of inode_congested() NeilBrown
  2022-01-31  4:03 ` [PATCH 1/3] fuse: remove reliance on bdi congestion NeilBrown
  2022-01-31  4:03 ` [PATCH 3/3] ceph: " NeilBrown
@ 2022-01-31  4:03 ` NeilBrown
  2022-01-31  4:22   ` Matthew Wilcox
  2 siblings, 1 reply; 17+ messages in thread
From: NeilBrown @ 2022-01-31  4:03 UTC (permalink / raw)
  To: Andrew Morton, Jeff Layton, Ilya Dryomov, Miklos Szeredi,
	Trond Myklebust, Anna Schumaker
  Cc: linux-mm, linux-nfs, linux-fsdevel, ceph-devel, linux-kernel

The bdi congestion tracking in not widely used and will be removed.

NFS is one of a small number of filesystems that uses it, setting just
the async (write) congestion flag at what it determines are appropriate
times.

The only remaining effect of the async flag is to cause (some)
WB_SYNC_NONE writes to be skipped.

So instead of setting the flag, set an internal flag and change:
 - .writepages to do nothing if WB_SYNC_NONE and the flag is set
 - .writepage to return AOP_WRITEPAGE_ACTIVATE if WB_SYNC_NONE
    and the flag is set.

The writepages change causes a behavioural change in that pageout() can
now return PAGE_ACTIVATE instead of PAGE_KEEP, so SetPageActive() will
be called on the page which (I think) wil further delay the next attempt
at writeout.  This might be a good thing.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/nfs/write.c            |   12 ++++++++++--
 include/linux/nfs_fs_sb.h |    1 +
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 987a187bd39a..b7c6721dd36d 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -417,7 +417,7 @@ static void nfs_set_page_writeback(struct page *page)
 
 	if (atomic_long_inc_return(&nfss->writeback) >
 			NFS_CONGESTION_ON_THRESH)
-		set_bdi_congested(inode_to_bdi(inode), BLK_RW_ASYNC);
+		nfss->write_congested = 1;
 }
 
 static void nfs_end_page_writeback(struct nfs_page *req)
@@ -433,7 +433,7 @@ static void nfs_end_page_writeback(struct nfs_page *req)
 
 	end_page_writeback(req->wb_page);
 	if (atomic_long_dec_return(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH)
-		clear_bdi_congested(inode_to_bdi(inode), BLK_RW_ASYNC);
+		nfss->write_congested = 0;
 }
 
 /*
@@ -672,6 +672,10 @@ static int nfs_writepage_locked(struct page *page,
 	struct inode *inode = page_file_mapping(page)->host;
 	int err;
 
+	if (wbc->sync_mode == WB_SYNC_NONE &&
+	    NFS_SERVER(inode)->write_congested)
+		return AOP_WRITEPAGE_ACTIVATE;
+
 	nfs_inc_stats(inode, NFSIOS_VFSWRITEPAGE);
 	nfs_pageio_init_write(&pgio, inode, 0,
 				false, &nfs_async_write_completion_ops);
@@ -719,6 +723,10 @@ int nfs_writepages(struct address_space *mapping, struct writeback_control *wbc)
 	int priority = 0;
 	int err;
 
+	if (wbc->sync_mode == WB_SYNC_NONE &&
+	    NFS_SERVER(inode)->write_congested)
+		return 0;
+
 	nfs_inc_stats(inode, NFSIOS_VFSWRITEPAGES);
 
 	if (!(mntflags & NFS_MOUNT_WRITE_EAGER) || wbc->for_kupdate ||
diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
index ca0959e51e81..6aa2a200676a 100644
--- a/include/linux/nfs_fs_sb.h
+++ b/include/linux/nfs_fs_sb.h
@@ -138,6 +138,7 @@ struct nfs_server {
 	struct nlm_host		*nlm_host;	/* NLM client handle */
 	struct nfs_iostats __percpu *io_stats;	/* I/O statistics */
 	atomic_long_t		writeback;	/* number of writeback pages */
+	unsigned int		write_congested;/* flag set when writeback gets too high */
 	unsigned int		flags;		/* various flags */
 
 /* The following are for internal use only. Also see uapi/linux/nfs_mount.h */



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/3] ceph: remove reliance on bdi congestion
  2022-01-31  4:03 [PATCH 0/3] remove dependence of inode_congested() NeilBrown
  2022-01-31  4:03 ` [PATCH 1/3] fuse: remove reliance on bdi congestion NeilBrown
@ 2022-01-31  4:03 ` NeilBrown
  2022-01-31  4:03 ` [PATCH 2/3] nfs: " NeilBrown
  2 siblings, 0 replies; 17+ messages in thread
From: NeilBrown @ 2022-01-31  4:03 UTC (permalink / raw)
  To: Andrew Morton, Jeff Layton, Ilya Dryomov, Miklos Szeredi,
	Trond Myklebust, Anna Schumaker
  Cc: linux-mm, linux-nfs, linux-fsdevel, ceph-devel, linux-kernel

The bdi congestion tracking in not widely used and will be removed.

CEPHfs is one of a small number of filesystems that uses it, setting
just the async (write) congestion flags at what it determines are
appropriate times.

The only remaining effect of the async flag is to cause (some)
WB_SYNC_NONE writes to be skipped.

So instead of setting the flag, set an internal flag and change:
 - .writepages to do nothing if WB_SYNC_NONE and the flag is set
 - .writepage to return AOP_WRITEPAGE_ACTIVATE if WB_SYNC_NONE
    and the flag is set.

The writepages change causes a behavioural change in that pageout() can
now return PAGE_ACTIVATE instead of PAGE_KEEP, so SetPageActive() will
be called on the page which (I think) wil further delay the next attempt
at writeout.  This might be a good thing.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/ceph/addr.c  |   22 +++++++++++++---------
 fs/ceph/super.c |    1 +
 fs/ceph/super.h |    1 +
 3 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index c98e5238a1b6..dc7af34640dd 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -563,7 +563,7 @@ static int writepage_nounlock(struct page *page, struct writeback_control *wbc)
 
 	if (atomic_long_inc_return(&fsc->writeback_count) >
 	    CONGESTION_ON_THRESH(fsc->mount_options->congestion_kb))
-		set_bdi_congested(inode_to_bdi(inode), BLK_RW_ASYNC);
+		fsc->write_congested = true;
 
 	req = ceph_osdc_new_request(osdc, &ci->i_layout, ceph_vino(inode), page_off, &len, 0, 1,
 				    CEPH_OSD_OP_WRITE, CEPH_OSD_FLAG_WRITE, snapc,
@@ -623,7 +623,7 @@ static int writepage_nounlock(struct page *page, struct writeback_control *wbc)
 
 	if (atomic_long_dec_return(&fsc->writeback_count) <
 	    CONGESTION_OFF_THRESH(fsc->mount_options->congestion_kb))
-		clear_bdi_congested(inode_to_bdi(inode), BLK_RW_ASYNC);
+		fsc->write_congested = false;
 
 	return err;
 }
@@ -635,6 +635,10 @@ static int ceph_writepage(struct page *page, struct writeback_control *wbc)
 	BUG_ON(!inode);
 	ihold(inode);
 
+	if (wbc->sync_mode == WB_SYNC_NONE &&
+	    ceph_inode_to_client(inode)->write_congested)
+		return AOP_WRITEPAGE_ACTIVATE;
+
 	wait_on_page_fscache(page);
 
 	err = writepage_nounlock(page, wbc);
@@ -707,8 +711,7 @@ static void writepages_finish(struct ceph_osd_request *req)
 			if (atomic_long_dec_return(&fsc->writeback_count) <
 			     CONGESTION_OFF_THRESH(
 					fsc->mount_options->congestion_kb))
-				clear_bdi_congested(inode_to_bdi(inode),
-						    BLK_RW_ASYNC);
+				fsc->write_congested = false;
 
 			ceph_put_snap_context(detach_page_private(page));
 			end_page_writeback(page);
@@ -760,6 +763,10 @@ static int ceph_writepages_start(struct address_space *mapping,
 	bool done = false;
 	bool caching = ceph_is_cache_enabled(inode);
 
+	if (wbc->sync_mode == WB_SYNC_NONE &&
+	    fsc->write_congested)
+		return 0;
+
 	dout("writepages_start %p (mode=%s)\n", inode,
 	     wbc->sync_mode == WB_SYNC_NONE ? "NONE" :
 	     (wbc->sync_mode == WB_SYNC_ALL ? "ALL" : "HOLD"));
@@ -954,11 +961,8 @@ static int ceph_writepages_start(struct address_space *mapping,
 
 			if (atomic_long_inc_return(&fsc->writeback_count) >
 			    CONGESTION_ON_THRESH(
-				    fsc->mount_options->congestion_kb)) {
-				set_bdi_congested(inode_to_bdi(inode),
-						  BLK_RW_ASYNC);
-			}
-
+				    fsc->mount_options->congestion_kb))
+				fsc->write_congested = true;
 
 			pages[locked_pages++] = page;
 			pvec.pages[i] = NULL;
diff --git a/fs/ceph/super.c b/fs/ceph/super.c
index bf79f369aec6..4a3b77d049c7 100644
--- a/fs/ceph/super.c
+++ b/fs/ceph/super.c
@@ -802,6 +802,7 @@ static struct ceph_fs_client *create_fs_client(struct ceph_mount_options *fsopt,
 	fsc->have_copy_from2 = true;
 
 	atomic_long_set(&fsc->writeback_count, 0);
+	fsc->write_congested = false;
 
 	err = -ENOMEM;
 	/*
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 67f145e1ae7a..0bd97aea2319 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -121,6 +121,7 @@ struct ceph_fs_client {
 	struct ceph_mds_client *mdsc;
 
 	atomic_long_t writeback_count;
+	bool write_congested;
 
 	struct workqueue_struct *inode_wq;
 	struct workqueue_struct *cap_wq;



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/3] nfs: remove reliance on bdi congestion
  2022-01-31  4:03 ` [PATCH 2/3] nfs: " NeilBrown
@ 2022-01-31  4:22   ` Matthew Wilcox
  2022-01-31  4:55     ` NeilBrown
  0 siblings, 1 reply; 17+ messages in thread
From: Matthew Wilcox @ 2022-01-31  4:22 UTC (permalink / raw)
  To: NeilBrown
  Cc: Andrew Morton, Jeff Layton, Ilya Dryomov, Miklos Szeredi,
	Trond Myklebust, Anna Schumaker, linux-mm, linux-nfs,
	linux-fsdevel, ceph-devel, linux-kernel

On Mon, Jan 31, 2022 at 03:03:53PM +1100, NeilBrown wrote:
>  - .writepage to return AOP_WRITEPAGE_ACTIVATE if WB_SYNC_NONE
>     and the flag is set.

Is this actually useful?  I ask because Dave Chinner believes
the call to ->writepage in vmscan to be essentially unused.
See commit 21b4ee7029c9, and I had a followup discussion with him
on IRC:

<willy> dchinner: did you gather any stats on how often ->writepage was
	being called by pageout() before "xfs: drop ->writepage completely"
	was added?
<dchinner> willy: Never saw it on XFS in 3 years in my test environment...
<dchinner> I don't ever recall seeing the memory reclaim guards we put on
	->writepage in XFS ever firing - IIRC they'd been there for the best
	part of a decade.
<willy> not so much the WARN_ON firing but the case where it actually calls
	iomap_writepage
<dchinner> willy: I mean both - I was running with a local patch that warned
	on writepage for a long time, regardless of where it was called from.

I can believe things are different for a network filesystem, or maybe
XFS does background writeback better than other filesystems, but it
would be intriguing to be able to get rid of ->writepage altogether
(or at least from pageout(); migrate.c may be a thornier proposition).

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] fuse: remove reliance on bdi congestion
  2022-01-31  4:03 ` [PATCH 1/3] fuse: remove reliance on bdi congestion NeilBrown
@ 2022-01-31  4:28   ` Matthew Wilcox
  2022-01-31  4:47     ` NeilBrown
  0 siblings, 1 reply; 17+ messages in thread
From: Matthew Wilcox @ 2022-01-31  4:28 UTC (permalink / raw)
  To: NeilBrown
  Cc: Andrew Morton, Jeff Layton, Ilya Dryomov, Miklos Szeredi,
	Trond Myklebust, Anna Schumaker, linux-mm, linux-nfs,
	linux-fsdevel, ceph-devel, linux-kernel

On Mon, Jan 31, 2022 at 03:03:53PM +1100, NeilBrown wrote:
> diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
> index 182b24a14804..5f74e2585f50 100644
> --- a/fs/fuse/dax.c
> +++ b/fs/fuse/dax.c
> @@ -781,6 +781,9 @@ static int fuse_dax_writepages(struct address_space *mapping,
>  	struct inode *inode = mapping->host;
>  	struct fuse_conn *fc = get_fuse_conn(inode);
>  
> +	if (wbc->sync_mode == WB_SYNC_NONE &&
> +	    fc->num_background >= fc->congestion_threshold)
> +		return 0;
>  	return dax_writeback_mapping_range(mapping, fc->dax->dev, wbc);

This makes no sense.  Doing writeback for DAX means flushing the
CPU cache (in a terribly inefficient way), but it's not going to
be doing anything in the background; it's a sync operation.

> +++ b/fs/fuse/file.c
> @@ -958,6 +958,8 @@ static void fuse_readahead(struct readahead_control *rac)
>  
>  	if (fuse_is_bad(inode))
>  		return;
> +	if (fc->num_background >= fc->congestion_threshold)
> +		return;

This seems like a bad idea to me.  If we don't even start reads on
readahead pages, they'll get ->readpage called on them one at a time
and the reading thread will block.  It's going to lead to some nasty
performance problems, exactly when you don't want them.  Better to
queue the reads internally and wait for congestion to ease before
submitting the read.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] fuse: remove reliance on bdi congestion
  2022-01-31  4:28   ` Matthew Wilcox
@ 2022-01-31  4:47     ` NeilBrown
  2022-01-31 10:21       ` Miklos Szeredi
  2022-01-31 13:12       ` Matthew Wilcox
  0 siblings, 2 replies; 17+ messages in thread
From: NeilBrown @ 2022-01-31  4:47 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, Jeff Layton, Ilya Dryomov, Miklos Szeredi,
	Trond Myklebust, Anna Schumaker, linux-mm, linux-nfs,
	linux-fsdevel, ceph-devel, linux-kernel

On Mon, 31 Jan 2022, Matthew Wilcox wrote:
> On Mon, Jan 31, 2022 at 03:03:53PM +1100, NeilBrown wrote:
> > diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
> > index 182b24a14804..5f74e2585f50 100644
> > --- a/fs/fuse/dax.c
> > +++ b/fs/fuse/dax.c
> > @@ -781,6 +781,9 @@ static int fuse_dax_writepages(struct address_space *mapping,
> >  	struct inode *inode = mapping->host;
> >  	struct fuse_conn *fc = get_fuse_conn(inode);
> >  
> > +	if (wbc->sync_mode == WB_SYNC_NONE &&
> > +	    fc->num_background >= fc->congestion_threshold)
> > +		return 0;
> >  	return dax_writeback_mapping_range(mapping, fc->dax->dev, wbc);
> 
> This makes no sense.  Doing writeback for DAX means flushing the
> CPU cache (in a terribly inefficient way), but it's not going to
> be doing anything in the background; it's a sync operation.

Fair enough ...  I was just being consistent.  I didn't wonder if dax
might be a bit special, but figured the change couldn't hurt.


> 
> > +++ b/fs/fuse/file.c
> > @@ -958,6 +958,8 @@ static void fuse_readahead(struct readahead_control *rac)
> >  
> >  	if (fuse_is_bad(inode))
> >  		return;
> > +	if (fc->num_background >= fc->congestion_threshold)
> > +		return;
> 
> This seems like a bad idea to me.  If we don't even start reads on
> readahead pages, they'll get ->readpage called on them one at a time
> and the reading thread will block.  It's going to lead to some nasty
> performance problems, exactly when you don't want them.  Better to
> queue the reads internally and wait for congestion to ease before
> submitting the read.
> 

Isn't that exactly what happens now? page_cache_async_ra() sees that
inode_read_congested() returns true, so it doesn't start readahead.
???

NeilBrown

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/3] nfs: remove reliance on bdi congestion
  2022-01-31  4:22   ` Matthew Wilcox
@ 2022-01-31  4:55     ` NeilBrown
  2022-01-31 13:15       ` Matthew Wilcox
  0 siblings, 1 reply; 17+ messages in thread
From: NeilBrown @ 2022-01-31  4:55 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, Jeff Layton, Ilya Dryomov, Miklos Szeredi,
	Trond Myklebust, Anna Schumaker, linux-mm, linux-nfs,
	linux-fsdevel, ceph-devel, linux-kernel

On Mon, 31 Jan 2022, Matthew Wilcox wrote:
> On Mon, Jan 31, 2022 at 03:03:53PM +1100, NeilBrown wrote:
> >  - .writepage to return AOP_WRITEPAGE_ACTIVATE if WB_SYNC_NONE
> >     and the flag is set.
> 
> Is this actually useful?  I ask because Dave Chinner believes
> the call to ->writepage in vmscan to be essentially unused.

He would be wrong ...  unless "essentially" means "mostly" rather than
"totally".
swap-out to NFS results in that ->writepage call.

Of course swap_writepage ignores sync_mode, so this might not be
entirely relevant.

But the main point of the patch is not to return AOP_WRITEPAGE_ACTIVATE
to vmscan.  It is to avoid writing at all when WB_SYNC_NONE and
congested.  e.g. for POSIX_FADV_DONTNEED
It is also to allow the removal of congestion tracking with minimal
changes to behaviour.

If I end up changing some dead code into different dead code, I really
don't care.  I'm not here to clean up all dead code - only the dead code
specifically related to congestion.

NeilBrown


> See commit 21b4ee7029c9, and I had a followup discussion with him
> on IRC:
> 
> <willy> dchinner: did you gather any stats on how often ->writepage was
> 	being called by pageout() before "xfs: drop ->writepage completely"
> 	was added?
> <dchinner> willy: Never saw it on XFS in 3 years in my test environment...
> <dchinner> I don't ever recall seeing the memory reclaim guards we put on
> 	->writepage in XFS ever firing - IIRC they'd been there for the best
> 	part of a decade.
> <willy> not so much the WARN_ON firing but the case where it actually calls
> 	iomap_writepage
> <dchinner> willy: I mean both - I was running with a local patch that warned
> 	on writepage for a long time, regardless of where it was called from.
> 
> I can believe things are different for a network filesystem, or maybe
> XFS does background writeback better than other filesystems, but it
> would be intriguing to be able to get rid of ->writepage altogether
> (or at least from pageout(); migrate.c may be a thornier proposition).
> 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] fuse: remove reliance on bdi congestion
  2022-01-31  4:47     ` NeilBrown
@ 2022-01-31 10:21       ` Miklos Szeredi
  2022-01-31 13:12       ` Matthew Wilcox
  1 sibling, 0 replies; 17+ messages in thread
From: Miklos Szeredi @ 2022-01-31 10:21 UTC (permalink / raw)
  To: NeilBrown
  Cc: Matthew Wilcox, Andrew Morton, Jeff Layton, Ilya Dryomov,
	Trond Myklebust, Anna Schumaker, linux-mm, Linux NFS list,
	linux-fsdevel, ceph-devel, linux-kernel

On Mon, 31 Jan 2022 at 05:47, NeilBrown <neilb@suse.de> wrote:

> > > +++ b/fs/fuse/file.c
> > > @@ -958,6 +958,8 @@ static void fuse_readahead(struct readahead_control *rac)
> > >
> > >     if (fuse_is_bad(inode))
> > >             return;
> > > +   if (fc->num_background >= fc->congestion_threshold)
> > > +           return;
> >
> > This seems like a bad idea to me.  If we don't even start reads on
> > readahead pages, they'll get ->readpage called on them one at a time
> > and the reading thread will block.  It's going to lead to some nasty
> > performance problems, exactly when you don't want them.  Better to
> > queue the reads internally and wait for congestion to ease before
> > submitting the read.
> >
>
> Isn't that exactly what happens now? page_cache_async_ra() sees that
> inode_read_congested() returns true, so it doesn't start readahead.
> ???

I agree.

Fuse throttles async requests even before allocating them, which
precludes placing them on any queue.  I guess it was done to limit the
amount of kernel memory pinned by a task (sync requests allow just one
request per task).

This has worked well, and I haven't heard complaints about performance
loss due to readahead throttling.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] fuse: remove reliance on bdi congestion
  2022-01-31  4:47     ` NeilBrown
  2022-01-31 10:21       ` Miklos Szeredi
@ 2022-01-31 13:12       ` Matthew Wilcox
  2022-01-31 23:00         ` NeilBrown
  1 sibling, 1 reply; 17+ messages in thread
From: Matthew Wilcox @ 2022-01-31 13:12 UTC (permalink / raw)
  To: NeilBrown
  Cc: Andrew Morton, Jeff Layton, Ilya Dryomov, Miklos Szeredi,
	Trond Myklebust, Anna Schumaker, linux-mm, linux-nfs,
	linux-fsdevel, ceph-devel, linux-kernel

On Mon, Jan 31, 2022 at 03:47:41PM +1100, NeilBrown wrote:
> On Mon, 31 Jan 2022, Matthew Wilcox wrote:
> > > +++ b/fs/fuse/file.c
> > > @@ -958,6 +958,8 @@ static void fuse_readahead(struct readahead_control *rac)
> > >  
> > >  	if (fuse_is_bad(inode))
> > >  		return;
> > > +	if (fc->num_background >= fc->congestion_threshold)
> > > +		return;
> > 
> > This seems like a bad idea to me.  If we don't even start reads on
> > readahead pages, they'll get ->readpage called on them one at a time
> > and the reading thread will block.  It's going to lead to some nasty
> > performance problems, exactly when you don't want them.  Better to
> > queue the reads internally and wait for congestion to ease before
> > submitting the read.
> > 
> 
> Isn't that exactly what happens now? page_cache_async_ra() sees that
> inode_read_congested() returns true, so it doesn't start readahead.
> ???

It's rather different.  Imagine the readahead window has expanded to
256kB (64 pages).  Today, we see congestion and don't do anything.
That means we miss the async readahed opportunity, find a missing
page and end up calling into page_cache_sync_ra(), by which time
we may or may not be congested.

If the inode_read_congested() in page_cache_async_ra() is removed and
the patch above is added to replace it, we'll allocate those 64 pages and
add them to the page cache.  But then we'll return without starting IO.
When we hit one of those !uptodate pages, we'll call ->readpage on it,
but we won't do anything to the other 63 pages.  So we'll go through a
protracted slow period of sending 64 reads, one at a time, whether or
not congestion has eased.  Then we'll hit a missing page and proceed
to the sync ra case as above.

(I'm assuming this is a workload which does a linear scan and so
readahead is actually effective)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/3] nfs: remove reliance on bdi congestion
  2022-01-31  4:55     ` NeilBrown
@ 2022-01-31 13:15       ` Matthew Wilcox
  2022-01-31 21:38         ` NeilBrown
  0 siblings, 1 reply; 17+ messages in thread
From: Matthew Wilcox @ 2022-01-31 13:15 UTC (permalink / raw)
  To: NeilBrown
  Cc: Andrew Morton, Jeff Layton, Ilya Dryomov, Miklos Szeredi,
	Trond Myklebust, Anna Schumaker, linux-mm, linux-nfs,
	linux-fsdevel, ceph-devel, linux-kernel

On Mon, Jan 31, 2022 at 03:55:22PM +1100, NeilBrown wrote:
> On Mon, 31 Jan 2022, Matthew Wilcox wrote:
> > On Mon, Jan 31, 2022 at 03:03:53PM +1100, NeilBrown wrote:
> > >  - .writepage to return AOP_WRITEPAGE_ACTIVATE if WB_SYNC_NONE
> > >     and the flag is set.
> > 
> > Is this actually useful?  I ask because Dave Chinner believes
> > the call to ->writepage in vmscan to be essentially unused.
> 
> He would be wrong ...  unless "essentially" means "mostly" rather than
> "totally".
> swap-out to NFS results in that ->writepage call.

For writes, SWP_FS_OPS uses ->direct_IO, not ->writepage.  Confused.

> Of course swap_writepage ignores sync_mode, so this might not be
> entirely relevant.
> 
> But the main point of the patch is not to return AOP_WRITEPAGE_ACTIVATE
> to vmscan.  It is to avoid writing at all when WB_SYNC_NONE and
> congested.  e.g. for POSIX_FADV_DONTNEED
> It is also to allow the removal of congestion tracking with minimal
> changes to behaviour.
> 
> If I end up changing some dead code into different dead code, I really
> don't care.  I'm not here to clean up all dead code - only the dead code
> specifically related to congestion.
> 
> NeilBrown
> 
> 
> > See commit 21b4ee7029c9, and I had a followup discussion with him
> > on IRC:
> > 
> > <willy> dchinner: did you gather any stats on how often ->writepage was
> > 	being called by pageout() before "xfs: drop ->writepage completely"
> > 	was added?
> > <dchinner> willy: Never saw it on XFS in 3 years in my test environment...
> > <dchinner> I don't ever recall seeing the memory reclaim guards we put on
> > 	->writepage in XFS ever firing - IIRC they'd been there for the best
> > 	part of a decade.
> > <willy> not so much the WARN_ON firing but the case where it actually calls
> > 	iomap_writepage
> > <dchinner> willy: I mean both - I was running with a local patch that warned
> > 	on writepage for a long time, regardless of where it was called from.
> > 
> > I can believe things are different for a network filesystem, or maybe
> > XFS does background writeback better than other filesystems, but it
> > would be intriguing to be able to get rid of ->writepage altogether
> > (or at least from pageout(); migrate.c may be a thornier proposition).
> > 
> > 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/3] nfs: remove reliance on bdi congestion
  2022-01-31 13:15       ` Matthew Wilcox
@ 2022-01-31 21:38         ` NeilBrown
  0 siblings, 0 replies; 17+ messages in thread
From: NeilBrown @ 2022-01-31 21:38 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, Jeff Layton, Ilya Dryomov, Miklos Szeredi,
	Trond Myklebust, Anna Schumaker, linux-mm, linux-nfs,
	linux-fsdevel, ceph-devel, linux-kernel

On Tue, 01 Feb 2022, Matthew Wilcox wrote:
> On Mon, Jan 31, 2022 at 03:55:22PM +1100, NeilBrown wrote:
> > On Mon, 31 Jan 2022, Matthew Wilcox wrote:
> > > On Mon, Jan 31, 2022 at 03:03:53PM +1100, NeilBrown wrote:
> > > >  - .writepage to return AOP_WRITEPAGE_ACTIVATE if WB_SYNC_NONE
> > > >     and the flag is set.
> > > 
> > > Is this actually useful?  I ask because Dave Chinner believes
> > > the call to ->writepage in vmscan to be essentially unused.
> > 
> > He would be wrong ...  unless "essentially" means "mostly" rather than
> > "totally".
> > swap-out to NFS results in that ->writepage call.
> 
> For writes, SWP_FS_OPS uses ->direct_IO, not ->writepage.  Confused.
> 

I shouldn't have mentioned NFS - that is an irrelevant distraction.

The "call to ->writepage in vmscan" is used, at least for swap.
For swapout it is the ->writepage from swap_aops, not the ->writepage of
any filesystem.  This is swap_writepage(), and for SWP_FS_OPS that maps
to a ->direct_IO call.

Dave may well be right that the ->writepage in vmscan never calls
xfs_writepage or many others.

To get to the ->writepage of a filesystem it would need to be called
from kswapd.  You would need to have no swap configured, and 90% of
memory consumed with anon pages so that the dirty_background_ratio
of 10% didn't kick off writeback.  Then I would expect to kswapd to
write out to a filesystem before writeback would do it.

Nonetheless, without clear evidence to the contrary, I think it is
safest to add this test to the ->writepage function for any filesystem
which currently sets the bdi async congested flag.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] fuse: remove reliance on bdi congestion
  2022-01-31 13:12       ` Matthew Wilcox
@ 2022-01-31 23:00         ` NeilBrown
  2022-02-01  2:01           ` Matthew Wilcox
  0 siblings, 1 reply; 17+ messages in thread
From: NeilBrown @ 2022-01-31 23:00 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, Jeff Layton, Ilya Dryomov, Miklos Szeredi,
	Trond Myklebust, Anna Schumaker, linux-mm, linux-nfs,
	linux-fsdevel, ceph-devel, linux-kernel

On Tue, 01 Feb 2022, Matthew Wilcox wrote:
> On Mon, Jan 31, 2022 at 03:47:41PM +1100, NeilBrown wrote:
> > On Mon, 31 Jan 2022, Matthew Wilcox wrote:
> > > > +++ b/fs/fuse/file.c
> > > > @@ -958,6 +958,8 @@ static void fuse_readahead(struct readahead_control *rac)
> > > >  
> > > >  	if (fuse_is_bad(inode))
> > > >  		return;
> > > > +	if (fc->num_background >= fc->congestion_threshold)
> > > > +		return;
> > > 
> > > This seems like a bad idea to me.  If we don't even start reads on
> > > readahead pages, they'll get ->readpage called on them one at a time
> > > and the reading thread will block.  It's going to lead to some nasty
> > > performance problems, exactly when you don't want them.  Better to
> > > queue the reads internally and wait for congestion to ease before
> > > submitting the read.
> > > 
> > 
> > Isn't that exactly what happens now? page_cache_async_ra() sees that
> > inode_read_congested() returns true, so it doesn't start readahead.
> > ???
> 
> It's rather different.  Imagine the readahead window has expanded to
> 256kB (64 pages).  Today, we see congestion and don't do anything.
> That means we miss the async readahed opportunity, find a missing
> page and end up calling into page_cache_sync_ra(), by which time
> we may or may not be congested.
> 
> If the inode_read_congested() in page_cache_async_ra() is removed and
> the patch above is added to replace it, we'll allocate those 64 pages and
> add them to the page cache.  But then we'll return without starting IO.
> When we hit one of those !uptodate pages, we'll call ->readpage on it,
> but we won't do anything to the other 63 pages.  So we'll go through a
> protracted slow period of sending 64 reads, one at a time, whether or
> not congestion has eased.  Then we'll hit a missing page and proceed
> to the sync ra case as above.

Hmmm... where is all this documented?
The entry for readahead in vfs.rst says:

    If the filesystem decides to stop attempting I/O before reaching the
    end of the readahead window, it can simply return.

but you are saying that if it simply returns, it'll most likely just get
called again.  So maybe it shouldn't say that?

What do other filesystems do?
ext4 sets REQ_RAHEAD, but otherwise just pushes ahead and submits all
requests. btrfs seems much the same.
xfs uses iomp_readahead ..  which again sets REQ_RAHEAD but otherwise
just does a normal read.

The effect of REQ_RAHEAD seems to be primarily to avoid retries on
failure.

So it seems that core read-ahead code it not set up to expect readahead
to fail, though it is (begrudgingly) permitted.

The current inode_read_congested() test in page_cache_async_ra() seems
to be just delaying the inevitable (and in fairness, the comment does
say "Defer....").  Maybe just blocking on the congestion is an equally
good way to delay it...

I note that ->readahead isn't told if the read-ahead is async or not, so
my patch will drop sync read-ahead on congestion, which the current code
doesn't do.

So maybe this congestion tracking really is useful, and we really want
to keep it.

I really would like to see that high-level documentation!!

Thanks,
NeilBrown

 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] fuse: remove reliance on bdi congestion
  2022-01-31 23:00         ` NeilBrown
@ 2022-02-01  2:01           ` Matthew Wilcox
  2022-02-01  3:28             ` NeilBrown
  0 siblings, 1 reply; 17+ messages in thread
From: Matthew Wilcox @ 2022-02-01  2:01 UTC (permalink / raw)
  To: NeilBrown
  Cc: Andrew Morton, Jeff Layton, Ilya Dryomov, Miklos Szeredi,
	Trond Myklebust, Anna Schumaker, linux-mm, linux-nfs,
	linux-fsdevel, ceph-devel, linux-kernel

On Tue, Feb 01, 2022 at 10:00:23AM +1100, NeilBrown wrote:
> On Tue, 01 Feb 2022, Matthew Wilcox wrote:
> > On Mon, Jan 31, 2022 at 03:47:41PM +1100, NeilBrown wrote:
> > > On Mon, 31 Jan 2022, Matthew Wilcox wrote:
> > > > > +++ b/fs/fuse/file.c
> > > > > @@ -958,6 +958,8 @@ static void fuse_readahead(struct readahead_control *rac)
> > > > >  
> > > > >  	if (fuse_is_bad(inode))
> > > > >  		return;
> > > > > +	if (fc->num_background >= fc->congestion_threshold)
> > > > > +		return;
> > > > 
> > > > This seems like a bad idea to me.  If we don't even start reads on
> > > > readahead pages, they'll get ->readpage called on them one at a time
> > > > and the reading thread will block.  It's going to lead to some nasty
> > > > performance problems, exactly when you don't want them.  Better to
> > > > queue the reads internally and wait for congestion to ease before
> > > > submitting the read.
> > > > 
> > > 
> > > Isn't that exactly what happens now? page_cache_async_ra() sees that
> > > inode_read_congested() returns true, so it doesn't start readahead.
> > > ???
> > 
> > It's rather different.  Imagine the readahead window has expanded to
> > 256kB (64 pages).  Today, we see congestion and don't do anything.
> > That means we miss the async readahed opportunity, find a missing
> > page and end up calling into page_cache_sync_ra(), by which time
> > we may or may not be congested.
> > 
> > If the inode_read_congested() in page_cache_async_ra() is removed and
> > the patch above is added to replace it, we'll allocate those 64 pages and
> > add them to the page cache.  But then we'll return without starting IO.
> > When we hit one of those !uptodate pages, we'll call ->readpage on it,
> > but we won't do anything to the other 63 pages.  So we'll go through a
> > protracted slow period of sending 64 reads, one at a time, whether or
> > not congestion has eased.  Then we'll hit a missing page and proceed
> > to the sync ra case as above.
> 
> Hmmm... where is all this documented?
> The entry for readahead in vfs.rst says:
> 
>     If the filesystem decides to stop attempting I/O before reaching the
>     end of the readahead window, it can simply return.
> 
> but you are saying that if it simply returns, it'll most likely just get
> called again.  So maybe it shouldn't say that?

That's not what I'm saying at all.  I'm saying that if ->readahead fails
to read the page, ->readpage will be called to read the page (if it's
actually accessed).

> What do other filesystems do?
> ext4 sets REQ_RAHEAD, but otherwise just pushes ahead and submits all
> requests. btrfs seems much the same.
> xfs uses iomp_readahead ..  which again sets REQ_RAHEAD but otherwise
> just does a normal read.
> 
> The effect of REQ_RAHEAD seems to be primarily to avoid retries on
> failure.
> 
> So it seems that core read-ahead code it not set up to expect readahead
> to fail, though it is (begrudgingly) permitted.

Well, yes.  The vast majority of reads don't fail.

> The current inode_read_congested() test in page_cache_async_ra() seems
> to be just delaying the inevitable (and in fairness, the comment does
> say "Defer....").  Maybe just blocking on the congestion is an equally
> good way to delay it...

I don't think we should _block_ for an async read request.  We're in the
context of a process which has read a different page.  Maybe what we
need is a readahead_abandon() call that removes the just-added pages
from the page cache, so we fall back to doing a sync readahead?

> I note that ->readahead isn't told if the read-ahead is async or not, so
> my patch will drop sync read-ahead on congestion, which the current code
> doesn't do.

Now that we have a readahead_control, it's simple to add that
information to it.

> So maybe this congestion tracking really is useful, and we really want
> to keep it.
> 
> I really would like to see that high-level documentation!!

I've done my best to add documentation.  There's more than before
I started.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] fuse: remove reliance on bdi congestion
  2022-02-01  2:01           ` Matthew Wilcox
@ 2022-02-01  3:28             ` NeilBrown
  2022-02-01  4:06               ` Matthew Wilcox
  0 siblings, 1 reply; 17+ messages in thread
From: NeilBrown @ 2022-02-01  3:28 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, Jeff Layton, Ilya Dryomov, Miklos Szeredi,
	Trond Myklebust, Anna Schumaker, linux-mm, linux-nfs,
	linux-fsdevel, ceph-devel, linux-kernel

On Tue, 01 Feb 2022, Matthew Wilcox wrote:
> On Tue, Feb 01, 2022 at 10:00:23AM +1100, NeilBrown wrote:
> > On Tue, 01 Feb 2022, Matthew Wilcox wrote:
> > > On Mon, Jan 31, 2022 at 03:47:41PM +1100, NeilBrown wrote:
> > > > On Mon, 31 Jan 2022, Matthew Wilcox wrote:
> > > > > > +++ b/fs/fuse/file.c
> > > > > > @@ -958,6 +958,8 @@ static void fuse_readahead(struct readahead_control *rac)
> > > > > >  
> > > > > >  	if (fuse_is_bad(inode))
> > > > > >  		return;
> > > > > > +	if (fc->num_background >= fc->congestion_threshold)
> > > > > > +		return;
> > > > > 
> > > > > This seems like a bad idea to me.  If we don't even start reads on
> > > > > readahead pages, they'll get ->readpage called on them one at a time
> > > > > and the reading thread will block.  It's going to lead to some nasty
> > > > > performance problems, exactly when you don't want them.  Better to
> > > > > queue the reads internally and wait for congestion to ease before
> > > > > submitting the read.
> > > > > 
> > > > 
> > > > Isn't that exactly what happens now? page_cache_async_ra() sees that
> > > > inode_read_congested() returns true, so it doesn't start readahead.
> > > > ???
> > > 
> > > It's rather different.  Imagine the readahead window has expanded to
> > > 256kB (64 pages).  Today, we see congestion and don't do anything.
> > > That means we miss the async readahed opportunity, find a missing
> > > page and end up calling into page_cache_sync_ra(), by which time
> > > we may or may not be congested.
> > > 
> > > If the inode_read_congested() in page_cache_async_ra() is removed and
> > > the patch above is added to replace it, we'll allocate those 64 pages and
> > > add them to the page cache.  But then we'll return without starting IO.
> > > When we hit one of those !uptodate pages, we'll call ->readpage on it,
> > > but we won't do anything to the other 63 pages.  So we'll go through a
> > > protracted slow period of sending 64 reads, one at a time, whether or
> > > not congestion has eased.  Then we'll hit a missing page and proceed
> > > to the sync ra case as above.
> > 
> > Hmmm... where is all this documented?
> > The entry for readahead in vfs.rst says:
> > 
> >     If the filesystem decides to stop attempting I/O before reaching the
> >     end of the readahead window, it can simply return.
> > 
> > but you are saying that if it simply returns, it'll most likely just get
> > called again.  So maybe it shouldn't say that?
> 
> That's not what I'm saying at all.  I'm saying that if ->readahead fails
> to read the page, ->readpage will be called to read the page (if it's
> actually accessed).

Yes, I see that now - thanks.

But looking at the first part of what you wrote - currently if
congestion means we skip page_cache_async_ra() (and it is the
WB_sync_congested (not async!) which causes us to skip that) then we end
up in page_cache_sync_ra() - which also calls ->readahead but without
the 'congested' protection.

Presumably the sync readahead asks for fewer pages or something?  What is
the logic there?

> 
> > What do other filesystems do?
> > ext4 sets REQ_RAHEAD, but otherwise just pushes ahead and submits all
> > requests. btrfs seems much the same.
> > xfs uses iomp_readahead ..  which again sets REQ_RAHEAD but otherwise
> > just does a normal read.
> > 
> > The effect of REQ_RAHEAD seems to be primarily to avoid retries on
> > failure.
> > 
> > So it seems that core read-ahead code it not set up to expect readahead
> > to fail, though it is (begrudgingly) permitted.
> 
> Well, yes.  The vast majority of reads don't fail.

Which makes one wonder why we have the special-case handling.  The code
that tests REQ_RAHEAD has probably never been tested.  Fortunately it is
quite simple code....

> 
> > The current inode_read_congested() test in page_cache_async_ra() seems
> > to be just delaying the inevitable (and in fairness, the comment does
> > say "Defer....").  Maybe just blocking on the congestion is an equally
> > good way to delay it...
> 
> I don't think we should _block_ for an async read request.  We're in the
> context of a process which has read a different page.  Maybe what we
> need is a readahead_abandon() call that removes the just-added pages
> from the page cache, so we fall back to doing a sync readahead?

Well, we do potentially block - when allocating a bio or other similar
structure, and when reading an index block to know where to read from.
But we don't block waiting for the read, and we don't block waiting to
allocate the page to read-ahead.  Just how much blocking is acceptable,
I wonder.  Maybe we should punt readahead to a workqueue and let it do
the small-time waiting.

Why does the presence of an unlocked non-uptodate page cause readahead
to be skipped?  Is this somehow related to the PG_readahead flag?  If we
set PG_readahead on the first page that we decided to skip in
->readahead, would that help?

> 
> > I note that ->readahead isn't told if the read-ahead is async or not, so
> > my patch will drop sync read-ahead on congestion, which the current code
> > doesn't do.
> 
> Now that we have a readahead_control, it's simple to add that
> information to it.

True.

> 
> > So maybe this congestion tracking really is useful, and we really want
> > to keep it.
> > 
> > I really would like to see that high-level documentation!!
> 
> I've done my best to add documentation.  There's more than before
> I started.

I guess it's my turn then - if I can manage to understand it.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] fuse: remove reliance on bdi congestion
  2022-02-01  3:28             ` NeilBrown
@ 2022-02-01  4:06               ` Matthew Wilcox
  2022-02-07  0:47                 ` NeilBrown
  0 siblings, 1 reply; 17+ messages in thread
From: Matthew Wilcox @ 2022-02-01  4:06 UTC (permalink / raw)
  To: NeilBrown
  Cc: Andrew Morton, Jeff Layton, Ilya Dryomov, Miklos Szeredi,
	Trond Myklebust, Anna Schumaker, linux-mm, linux-nfs,
	linux-fsdevel, ceph-devel, linux-kernel

On Tue, Feb 01, 2022 at 02:28:32PM +1100, NeilBrown wrote:
> On Tue, 01 Feb 2022, Matthew Wilcox wrote:
> > On Tue, Feb 01, 2022 at 10:00:23AM +1100, NeilBrown wrote:
> > > On Tue, 01 Feb 2022, Matthew Wilcox wrote:
> > > > On Mon, Jan 31, 2022 at 03:47:41PM +1100, NeilBrown wrote:
> > > > > On Mon, 31 Jan 2022, Matthew Wilcox wrote:
> > > > > > > +++ b/fs/fuse/file.c
> > > > > > > @@ -958,6 +958,8 @@ static void fuse_readahead(struct readahead_control *rac)
> > > > > > >  
> > > > > > >  	if (fuse_is_bad(inode))
> > > > > > >  		return;
> > > > > > > +	if (fc->num_background >= fc->congestion_threshold)
> > > > > > > +		return;
> > > > > > 
> > > > > > This seems like a bad idea to me.  If we don't even start reads on
> > > > > > readahead pages, they'll get ->readpage called on them one at a time
> > > > > > and the reading thread will block.  It's going to lead to some nasty
> > > > > > performance problems, exactly when you don't want them.  Better to
> > > > > > queue the reads internally and wait for congestion to ease before
> > > > > > submitting the read.
> > > > > > 
> > > > > 
> > > > > Isn't that exactly what happens now? page_cache_async_ra() sees that
> > > > > inode_read_congested() returns true, so it doesn't start readahead.
> > > > > ???
> > > > 
> > > > It's rather different.  Imagine the readahead window has expanded to
> > > > 256kB (64 pages).  Today, we see congestion and don't do anything.
> > > > That means we miss the async readahed opportunity, find a missing
> > > > page and end up calling into page_cache_sync_ra(), by which time
> > > > we may or may not be congested.
> > > > 
> > > > If the inode_read_congested() in page_cache_async_ra() is removed and
> > > > the patch above is added to replace it, we'll allocate those 64 pages and
> > > > add them to the page cache.  But then we'll return without starting IO.
> > > > When we hit one of those !uptodate pages, we'll call ->readpage on it,
> > > > but we won't do anything to the other 63 pages.  So we'll go through a
> > > > protracted slow period of sending 64 reads, one at a time, whether or
> > > > not congestion has eased.  Then we'll hit a missing page and proceed
> > > > to the sync ra case as above.
> > > 
> > > Hmmm... where is all this documented?
> > > The entry for readahead in vfs.rst says:
> > > 
> > >     If the filesystem decides to stop attempting I/O before reaching the
> > >     end of the readahead window, it can simply return.
> > > 
> > > but you are saying that if it simply returns, it'll most likely just get
> > > called again.  So maybe it shouldn't say that?
> > 
> > That's not what I'm saying at all.  I'm saying that if ->readahead fails
> > to read the page, ->readpage will be called to read the page (if it's
> > actually accessed).
> 
> Yes, I see that now - thanks.
> 
> But looking at the first part of what you wrote - currently if
> congestion means we skip page_cache_async_ra() (and it is the
> WB_sync_congested (not async!) which causes us to skip that) then we end
> up in page_cache_sync_ra() - which also calls ->readahead but without
> the 'congested' protection.
> 
> Presumably the sync readahead asks for fewer pages or something?  What is
> the logic there?

Assuming you open() the file and read() one byte at a time sequentially,
a sufficiently large file will work like this:

 - No page at index 0
   - Sync readahead pages 0-15
   - Set the readahead marker on page 8
   - Wait for page 0 to come uptodate (assume readahead succeeds)
 - Read pages 1-7
 - Notice the readahead mark on page 8
   - Async readahead pages 16-47
   - Set the readahead marker on page 32
 - Read pages 8-15
 - Hopefully the async readahead for page 16 already finished; if not
   wait for it
 - Read pages 17-31
 - Notice the readahead mark on page 32
   - Async readahead pages 48-111
   - Set the readahead marker on page 80
...

The sync readahead is "We need to read this page now, we may as well
start the read for other pages at the same time".  Async readahead is
"We predict we'll need these pages in the future".  Readpage only
gets used if readahead has failed.

> > > So it seems that core read-ahead code it not set up to expect readahead
> > > to fail, though it is (begrudgingly) permitted.
> > 
> > Well, yes.  The vast majority of reads don't fail.
> 
> Which makes one wonder why we have the special-case handling.  The code
> that tests REQ_RAHEAD has probably never been tested.  Fortunately it is
> quite simple code....

I actually did a set of tests while developing folios that failed every
readahead or some proportion of readaheads.  Found some interesting bugs
that way.  Might be a good idea to upstream an error injection so that
people can keep testing it.

> > > The current inode_read_congested() test in page_cache_async_ra() seems
> > > to be just delaying the inevitable (and in fairness, the comment does
> > > say "Defer....").  Maybe just blocking on the congestion is an equally
> > > good way to delay it...
> > 
> > I don't think we should _block_ for an async read request.  We're in the
> > context of a process which has read a different page.  Maybe what we
> > need is a readahead_abandon() call that removes the just-added pages
> > from the page cache, so we fall back to doing a sync readahead?
> 
> Well, we do potentially block - when allocating a bio or other similar
> structure, and when reading an index block to know where to read from.
> But we don't block waiting for the read, and we don't block waiting to
> allocate the page to read-ahead.  Just how much blocking is acceptable,
> I wonder.  Maybe we should punt readahead to a workqueue and let it do
> the small-time waiting.

Right, I meant "block on I/O" rather than "block trying to free memory".
We are under GFP_NOFS during the readahead path, so while we can block
for a previously-submitted I/O to finish, we can't start a new I/O.

> Why does the presence of an unlocked non-uptodate page cause readahead
> to be skipped?  Is this somehow related to the PG_readahead flag?  If we
> set PG_readahead on the first page that we decided to skip in
> ->readahead, would that help?

To go back to the example above, let's suppose the first async read hits
congestion.  Before your patches, we don't even allocate pages 16-47.
So we see a missing page, and the ondemand algorithm will submit a sync
ra for pages 16-31.

After your patches, we allocate pages 16-47, add them to the page cache
and then leave them there !uptodate.  Now each time our read() hits
a !uptodate page, we call ->readpage on it.  We have no idea that the
remaining pages in that readahead batch were also abandoned and could
be profitably read.  I think we'll submit another async readahead
for 48-112, but I'd have to check on that.

> > > I really would like to see that high-level documentation!!
> > 
> > I've done my best to add documentation.  There's more than before
> > I started.
> 
> I guess it's my turn then - if I can manage to understand it.

It always works out better when two people are interested in the
documentation.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] fuse: remove reliance on bdi congestion
  2022-02-01  4:06               ` Matthew Wilcox
@ 2022-02-07  0:47                 ` NeilBrown
  0 siblings, 0 replies; 17+ messages in thread
From: NeilBrown @ 2022-02-07  0:47 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, Jeff Layton, Ilya Dryomov, Miklos Szeredi,
	Trond Myklebust, Anna Schumaker, linux-mm, linux-nfs,
	linux-fsdevel, ceph-devel, linux-kernel

On Tue, 01 Feb 2022, Matthew Wilcox wrote:
> On Tue, Feb 01, 2022 at 02:28:32PM +1100, NeilBrown wrote:
> > On Tue, 01 Feb 2022, Matthew Wilcox wrote:
> > > On Tue, Feb 01, 2022 at 10:00:23AM +1100, NeilBrown wrote:
> > > > I really would like to see that high-level documentation!!
> > > 
> > > I've done my best to add documentation.  There's more than before
> > > I started.
> > 
> > I guess it's my turn then - if I can manage to understand it.
> 
> It always works out better when two people are interested in the
> documentation.
> 
> 

Please review...

From: NeilBrown <neilb@suse.de>
Subject: [PATCH] MM: document and polish read-ahead code.

Add some "big-picture" documentation for read-ahead and polish the code
to make it fit this documentation.

The meaning of ->async_size is clarified to match its name.
i.e. Any request to ->readahead() has a sync part and an async part.
The caller will wait for the sync pages to complete, but will not wait
for the async pages.  The first async page is still marked PG_readahead

- When ->readhead does not consume all pages, any remaining async pages
  are now discarded with delete_from_page_cache().  This make it
  possible for the filesystem to delay readahead due e.g. to congestion.
- in try_context_readahead(), the async_sync is set correctly rather
  than being set to 1.  Prior to Commit 2cad40180197 ("readahead: make
  context readahead more conservative") it was set to ra->size which
  is not correct (that implies no sync component).  As this was too
  high and caused problems it was reduced to 1, again incorrect but less
  problematic.  The setting provided with this patch does not restore
  those problems, and is now not arbitrary.
- The calculation of ->async_size in the initial_readahead section of
  ondemand_readahead() now makes sense - it is zero if the chosen
  size does not exceed the requested size.  This means that we will not
  set the PG_readahead flag in this case, but as the requested size
  has not been satisfied we can expect a subsequent read ahead request
  any way.

Note that the current function names page_cache_sync_ra() and
page_cache_async_ra() are misleading.  All ra request are partly sync
and partly async, so either part can be empty.
A page_cache_sync_ra() request will usually set ->async_size non-zero,
implying it is not all synchronous.
When a non-zero req_count is passed to page_cache_async_ra(), the
implication is that some prefix of the request is synchronous, though
the calculation made there is incorrect - I haven't tried to fix it.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 mm/readahead.c | 105 ++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 103 insertions(+), 2 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index cf0dcf89eb69..5676f5c1aa39 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -8,6 +8,105 @@
  *		Initial version.
  */
 
+/**
+ * Readahead is used to read content into the page cache before it is
+ * explicitly requested by the application.  Readahead only ever
+ * attempts to read pages which are not yet in the page cache.  If a
+ * page is present but not up-to-date, readahead will not try to read
+ * it. In that case a simple ->readpage() will be requested.
+ *
+ * Readahead is triggered when an application read request (whether a
+ * systemcall or a page fault) finds that the requested page is not in
+ * the page cache, or that it is in the page cache and has the
+ * PG_readahead flag set.  This flag indicates that the page was loaded
+ * as part of a previous read-ahead request and now that it has been
+ * accessed, it is time for the next read-ahead.
+ *
+ * Each readahead request is partly synchronous read, and partly async
+ * read-ahead.  This is reflected in the struct file_ra_state which
+ * contains ->size being to total number of pages, and ->async_size
+ * which is the number of pages in the async section.  The first page in
+ * this async section will have PG_readahead set as a trigger for a
+ * subsequent read ahead.  Once a series of sequential reads has been
+ * established, there should be no need for a synchronous component and
+ * all read ahead request will be fully asynchronous.
+ *
+ * When either of the triggers causes a readahead, three numbers need to
+ * be determined: the start of the region, the size of the region, and
+ * the size of the async tail.
+ *
+ * The start of the region is simply the first page address at or after
+ * the accessed address, which is not currently populated in the page
+ * cache.  This is found with a simple search in the page cache.
+ *
+ * The size of the async tail is determined by subtracting the size that
+ * was explicitly requested from the determined request size, unless
+ * this would be less than zero - then zero is used.  NOTE THIS
+ * CALCULATION IS WRONG WHEN THE START OF THE REGION IS NOT THE ACCESSED
+ * PAGE.
+ *
+ * The size of the region is normally determined from the size of the
+ * previous readahead which loaded the preceding pages.  This may be
+ * discovered from the struct file_ra_state for simple sequential reads,
+ * or from examining the state of the page cache when multiple
+ * sequential reads are interleaved.  Specifically: where the readahead
+ * was triggered by the PG_readahead flag, the size of the previous
+ * readahead is assumed to be the number of pages from the triggering
+ * page to the start of the new readahead.  In these cases, the size of
+ * the previous readahead is scaled, often doubled, for the new
+ * readahead, though see get_next_ra_size() for details.
+ *
+ * If the size of the previous read cannot be determined, the number of
+ * preceding pages in the page cache is used to estimate the size of
+ * a previous read.  This estimate could easily be misled by random
+ * reads being coincidentally adjacent, so it is ignored unless it is
+ * larger than the current request, and it is not scaled up, unless it
+ * is at the start of file.
+ *
+ * In generally read ahead is accelerated at the start of the file, as
+ * reads from there are often sequential.  There are other minor
+ * adjustments to the read ahead size in various special cases and these
+ * are best discovered by reading the code.
+ *
+ * The above calculation determine the readahead, to which any requested
+ * read size may be added.
+ *
+ * Readahead requests are sent to the filesystem using the ->readahead
+ * address space operation, for which mpage_readahead() is a canonical
+ * implementation.  ->readahead() should normally initiate reads on all
+ * pages, but may fail to read any or all pages without causing an IO
+ * error.  The page cache reading code will issue a ->readpage() request
+ * for any page which ->readahead() does not provided, and only an error
+ * from this will be final.
+ *
+ * ->readahead will generally call readahead_page() repeatedly to get
+ * each page from those prepared for read ahead.  It may fail to read a
+ * page by:
+ *  - not calling readahead_page() sufficiently many times, effectively
+ *    ignoring some pages, as might be appropriate if the path to
+ *    storage is congested.
+ *  - failing to actually submit a read request for a given page,
+ *    possibly due to insufficient resources, or
+ *  - getting an error during subsequent processing of a request.
+ * In the last two cases, the page should be unlocked to indicate that
+ * the read attempt has failed.  In the first case the page will be
+ * unlocked by the caller.
+ *
+ * Those pages not in the final ``async_size`` of the request should be
+ * considered to be important and ->readahead() should not fail them due
+ * to congestion or temporary resource unavailability, but should wait
+ * for necessary resources (e.g.  memory or indexing information) to
+ * become available.  Pages in the final ``async_size`` may be
+ * considered less urgent and failure to read them is more acceptable.
+ * In this case it best to use delete_from_page_cache() to remove the
+ * pages from the page cache as is automatically done for pages that
+ * were not fetched with readahead_page().  This will allow a
+ * subsequent synchronous read ahead request to try them again.  If they
+ * are left in the page cache, then they will be read individually using
+ * ->readpage().
+ *
+ */
+
 #include <linux/kernel.h>
 #include <linux/dax.h>
 #include <linux/gfp.h>
@@ -129,6 +228,8 @@ static void read_pages(struct readahead_control *rac, struct list_head *pages,
 		aops->readahead(rac);
 		/* Clean up the remaining pages */
 		while ((page = readahead_page(rac))) {
+			if (rac->ra->async_pages >= readahead_count(rac))
+				delete_from_page_cache(page);
 			unlock_page(page);
 			put_page(page);
 		}
@@ -426,7 +527,7 @@ static int try_context_readahead(struct address_space *mapping,
 
 	ra->start = index;
 	ra->size = min(size + req_size, max);
-	ra->async_size = 1;
+	ra->async_size = ra->size - req_size;
 
 	return 1;
 }
@@ -527,7 +628,7 @@ static void ondemand_readahead(struct readahead_control *ractl,
 initial_readahead:
 	ra->start = index;
 	ra->size = get_init_ra_size(req_size, max_pages);
-	ra->async_size = ra->size > req_size ? ra->size - req_size : ra->size;
+	ra->async_size = ra->size > req_size ? ra->size - req_size : 0;
 
 readit:
 	/*
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2022-02-07  0:57 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-31  4:03 [PATCH 0/3] remove dependence of inode_congested() NeilBrown
2022-01-31  4:03 ` [PATCH 1/3] fuse: remove reliance on bdi congestion NeilBrown
2022-01-31  4:28   ` Matthew Wilcox
2022-01-31  4:47     ` NeilBrown
2022-01-31 10:21       ` Miklos Szeredi
2022-01-31 13:12       ` Matthew Wilcox
2022-01-31 23:00         ` NeilBrown
2022-02-01  2:01           ` Matthew Wilcox
2022-02-01  3:28             ` NeilBrown
2022-02-01  4:06               ` Matthew Wilcox
2022-02-07  0:47                 ` NeilBrown
2022-01-31  4:03 ` [PATCH 3/3] ceph: " NeilBrown
2022-01-31  4:03 ` [PATCH 2/3] nfs: " NeilBrown
2022-01-31  4:22   ` Matthew Wilcox
2022-01-31  4:55     ` NeilBrown
2022-01-31 13:15       ` Matthew Wilcox
2022-01-31 21:38         ` NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).