Re: [PATCH 1/9] Remove inode_congested()

From: "NeilBrown" <neilb@suse.de>
To: "Miklos Szeredi" <miklos@szeredi.hu>
Cc: "Andrew Morton" <akpm@linux-foundation.org>,
	"Jaegeuk Kim" <jaegeuk@kernel.org>, "Chao Yu" <chao@kernel.org>,
	"Jeff Layton" <jlayton@kernel.org>,
	"Ilya Dryomov" <idryomov@gmail.com>,
	"Trond Myklebust" <trond.myklebust@hammerspace.com>,
	"Anna Schumaker" <anna.schumaker@netapp.com>,
	"Ryusuke Konishi" <konishi.ryusuke@gmail.com>,
	"Darrick J. Wong" <djwong@kernel.org>,
	"Philipp Reisner" <philipp.reisner@linbit.com>,
	"Lars Ellenberg" <lars.ellenberg@linbit.com>,
	"Paolo Valente" <paolo.valente@linaro.org>,
	"Jens Axboe" <axboe@kernel.dk>, "linux-mm" <linux-mm@kvack.org>,
	linux-nilfs@vger.kernel.org,
	"Linux NFS list" <linux-nfs@vger.kernel.org>,
	linux-fsdevel@vger.kernel.org,
	linux-f2fs-devel@lists.sourceforge.net,
	"Ext4" <linux-ext4@vger.kernel.org>,
	ceph-devel@vger.kernel.org, drbd-dev@lists.linbit.com,
	linux-kernel@vger.kernel.org, linux-block@vger.kernel.org
Subject: Re: [PATCH 1/9] Remove inode_congested()
Date: Sat, 29 Jan 2022 08:36:02 +1100	[thread overview]
Message-ID: <164340576289.5493.5784848964540459557@noble.neil.brown.name> (raw)
In-Reply-To: <CAJfpegt-igF8HqsDUcMzfU0jYv8WpofLy0Uv0YnXLzsfx=tkGg@mail.gmail.com>

On Fri, 28 Jan 2022, Miklos Szeredi wrote:
> On Thu, 27 Jan 2022 at 03:47, NeilBrown <neilb@suse.de> wrote:
> >
> > inode_congested() reports if the backing-device for the inode is
> > congested.  Few bdi report congestion any more, only ceph, fuse, and
> > nfs.  Having support just for those is unlikely to be useful.
> >
> > The places which test inode_congested() or it variants like
> > inode_write_congested(), avoid initiating IO if congestion is present.
> > We now have to rely on other places in the stack to back off, or abort
> > requests - we already do for everything except these 3 filesystems.
> >
> > So remove inode_congested() and related functions, and remove the call
> > sites, assuming that inode_congested() always returns 'false'.
> 
> Looks to me this is going to "break" fuse; e.g. readahead path will go
> ahead and try to submit more requests, even if the queue is getting
> congested.   In this case the readahead submission will eventually
> block, which is counterproductive.
> 
> I think we should *first* make sure all call sites are substituted
> with appropriate mechanisms in the affected filesystems and as a last
> step remove the superfluous bdi congestion mechanism.
> 
> You are saying that all fs except these three already have such
> mechanisms in place, right?  Can you elaborate on that?

Not much.  I haven't looked into how other filesystems cope, I just know
that they must because no other filesystem ever has a congested bdi
(with one or two minor exceptions, like filesystems over drbd).

Surely read-ahead should never block.  If it hits congestion, the
read-ahead request should simply fail.  block-based filesystems seem to
set REQ_RAHEAD which might get mapped to REQ_FAILFAST_MASK, though I
don't know how that is ultimately used.

Maybe fuse and others should continue to track 'congestion' and reject
read-ahead requests when congested.
Maybe also skip WB_SYNC_NONE writes..

Or maybe this doesn't really matter in practice...  I wonder if we can
measure the usefulness of congestion.

Thanks,
NeilBrown