From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id BCA5BC433FE for ; Tue, 1 Feb 2022 02:02:58 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232943AbiBACC5 (ORCPT ); Mon, 31 Jan 2022 21:02:57 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54966 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233118AbiBACBa (ORCPT ); Mon, 31 Jan 2022 21:01:30 -0500 Received: from casper.infradead.org (casper.infradead.org [IPv6:2001:8b0:10b:1236::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 134FBC061741; Mon, 31 Jan 2022 18:01:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=Sm7Yc1CtQBIVZ/GRlT8gqFP2ojSk7fSCO2XvuJwXk+w=; b=aLmub9E/3RVA1+xFe6GB5IgwFa KonTAXz5S6wqGq94yXuiVfawT+IlIZKim6M+ABrd3VJMZAYL6q2I1HShpV2MCjDcBHomp1cU10R7t 44i1b0YiK/p8Whr8CpU7+kHI8CMB8jW1qMCSlLKrFBlKbZx0ggC+mLwWOoA8aWPH1rMghEQtCOtj0 FyMMeZrqM1Jzy77Qhx13Uybrr7yZJJfFJINGoqCRfceMUsqi9x5huOhgbAg7Tnb1bRRoiD7CgJa4d R+dr5qTA+tFPH1/vEboBzowgBYxSw62CD53N+coERAPBHeLHW/uuAvM7YlIPaKL4LkY2BUyZufqSf 29wVeJKQ==; Received: from willy by casper.infradead.org with local (Exim 4.94.2 #2 (Red Hat Linux)) id 1nEiTo-00BAtw-Pj; Tue, 01 Feb 2022 02:01:12 +0000 Date: Tue, 1 Feb 2022 02:01:12 +0000 From: Matthew Wilcox To: NeilBrown Cc: Andrew Morton , Jeff Layton , Ilya Dryomov , Miklos Szeredi , Trond Myklebust , Anna Schumaker , linux-mm@kvack.org, linux-nfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, ceph-devel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 1/3] fuse: remove reliance on bdi congestion Message-ID: References: <164360127045.4233.2606812444285122570.stgit@noble.brown> <164360183348.4233.761031466326833349.stgit@noble.brown> <164360446180.18996.6767388833611575467@noble.neil.brown.name> <164367002370.18996.7242801209611375112@noble.neil.brown.name> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <164367002370.18996.7242801209611375112@noble.neil.brown.name> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Feb 01, 2022 at 10:00:23AM +1100, NeilBrown wrote: > On Tue, 01 Feb 2022, Matthew Wilcox wrote: > > On Mon, Jan 31, 2022 at 03:47:41PM +1100, NeilBrown wrote: > > > On Mon, 31 Jan 2022, Matthew Wilcox wrote: > > > > > +++ b/fs/fuse/file.c > > > > > @@ -958,6 +958,8 @@ static void fuse_readahead(struct readahead_control *rac) > > > > > > > > > > if (fuse_is_bad(inode)) > > > > > return; > > > > > + if (fc->num_background >= fc->congestion_threshold) > > > > > + return; > > > > > > > > This seems like a bad idea to me. If we don't even start reads on > > > > readahead pages, they'll get ->readpage called on them one at a time > > > > and the reading thread will block. It's going to lead to some nasty > > > > performance problems, exactly when you don't want them. Better to > > > > queue the reads internally and wait for congestion to ease before > > > > submitting the read. > > > > > > > > > > Isn't that exactly what happens now? page_cache_async_ra() sees that > > > inode_read_congested() returns true, so it doesn't start readahead. > > > ??? > > > > It's rather different. Imagine the readahead window has expanded to > > 256kB (64 pages). Today, we see congestion and don't do anything. > > That means we miss the async readahed opportunity, find a missing > > page and end up calling into page_cache_sync_ra(), by which time > > we may or may not be congested. > > > > If the inode_read_congested() in page_cache_async_ra() is removed and > > the patch above is added to replace it, we'll allocate those 64 pages and > > add them to the page cache. But then we'll return without starting IO. > > When we hit one of those !uptodate pages, we'll call ->readpage on it, > > but we won't do anything to the other 63 pages. So we'll go through a > > protracted slow period of sending 64 reads, one at a time, whether or > > not congestion has eased. Then we'll hit a missing page and proceed > > to the sync ra case as above. > > Hmmm... where is all this documented? > The entry for readahead in vfs.rst says: > > If the filesystem decides to stop attempting I/O before reaching the > end of the readahead window, it can simply return. > > but you are saying that if it simply returns, it'll most likely just get > called again. So maybe it shouldn't say that? That's not what I'm saying at all. I'm saying that if ->readahead fails to read the page, ->readpage will be called to read the page (if it's actually accessed). > What do other filesystems do? > ext4 sets REQ_RAHEAD, but otherwise just pushes ahead and submits all > requests. btrfs seems much the same. > xfs uses iomp_readahead .. which again sets REQ_RAHEAD but otherwise > just does a normal read. > > The effect of REQ_RAHEAD seems to be primarily to avoid retries on > failure. > > So it seems that core read-ahead code it not set up to expect readahead > to fail, though it is (begrudgingly) permitted. Well, yes. The vast majority of reads don't fail. > The current inode_read_congested() test in page_cache_async_ra() seems > to be just delaying the inevitable (and in fairness, the comment does > say "Defer...."). Maybe just blocking on the congestion is an equally > good way to delay it... I don't think we should _block_ for an async read request. We're in the context of a process which has read a different page. Maybe what we need is a readahead_abandon() call that removes the just-added pages from the page cache, so we fall back to doing a sync readahead? > I note that ->readahead isn't told if the read-ahead is async or not, so > my patch will drop sync read-ahead on congestion, which the current code > doesn't do. Now that we have a readahead_control, it's simple to add that information to it. > So maybe this congestion tracking really is useful, and we really want > to keep it. > > I really would like to see that high-level documentation!! I've done my best to add documentation. There's more than before I started.