From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 30582C64EC4 for ; Wed, 8 Mar 2023 08:00:00 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229732AbjCHH76 (ORCPT ); Wed, 8 Mar 2023 02:59:58 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37606 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229558AbjCHH75 (ORCPT ); Wed, 8 Mar 2023 02:59:57 -0500 Received: from mail-pf1-x42c.google.com (mail-pf1-x42c.google.com [IPv6:2607:f8b0:4864:20::42c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7AE639475D for ; Tue, 7 Mar 2023 23:59:56 -0800 (PST) Received: by mail-pf1-x42c.google.com with SMTP id n5so9659663pfv.11 for ; Tue, 07 Mar 2023 23:59:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20210112.gappssmtp.com; s=20210112; t=1678262396; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=oErMbovgNlM1fHmldHtqwkgHhSfzOswt1j6nL2mFsxo=; b=fiikSj48Ilmf1vGJNiI3Bq/tguYc41vQm7Wv71k+P26+1xYdYHczpadicL85OTqWpl U6NgRMu82A0NHUhz+qaoqyMW/MBvAos1JcIz5Ci30MHMEgpN3Efy5wucE0LVoqE02Rn4 GqceS5pTWt5mQ6VyIQboQr6gW3TVX/RlrbPi6hPLu1VhuQyw5IQIpJhcLRKbnd+S1RsE RuHTYzkLCyADd8bKD0vBez9bXsu1UTO6mx2e09vG0uGkUYtnWlz7XW/yqw7MKzE3kvL5 JPf+3rN6lFzPWYllqvI4I8A5EZfdXQcNqF0hbc4jXIbMi7pITxFgi4Q61TmDsGQ+kaJy 5MBw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1678262396; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=oErMbovgNlM1fHmldHtqwkgHhSfzOswt1j6nL2mFsxo=; b=wPZOfJOyHteMZhOgC2SA9U0o+fJUW02lOQA63KebyAqBWtnn54WR0ta8jQH5nO8FgC Y55RZRFfYg95gF7P2/ffWYA7vXf8wy7WpWmSkC0VCAd14c0C9uURlARQ6SEn+qCI+0cE A1p7zRh1DBhpG3ldfbdJcclJ8kiC/4KaeIhfEe/9uGrZjBp+ZK8msiyGdBcUmDq0rCI6 EHqZk+e2S5S2nUwQJv3ldv9+RyPik4cRuLK3kaP30ObM3nqZ3qTQXA2FhOkBzzJNm9Lb u0mabKH6xPYT3Gq3It/RR8QcOef3tte5uLYVwgvUFYpD0TzHHV2OnooRgIPydCajIzP0 xjEQ== X-Gm-Message-State: AO0yUKUat/QbdvhYmjYZQeAT8sO7BZ9Ggvms3I+rM7UaguoxqRdfdSyF RKyfn37+8iD+qq5x3iJuzrgvNg== X-Google-Smtp-Source: AK7set8bsOinXE0KQbcPQOZUDb0oKVCU2EyTeRsMoZqI+23n5Roy2RhtB13Lmukyu/y+AOZsaMHOxg== X-Received: by 2002:a62:1991:0:b0:5e2:434d:116b with SMTP id 139-20020a621991000000b005e2434d116bmr13162108pfz.23.1678262395850; Tue, 07 Mar 2023 23:59:55 -0800 (PST) Received: from dread.disaster.area (pa49-186-4-237.pa.vic.optusnet.com.au. [49.186.4.237]) by smtp.gmail.com with ESMTPSA id s1-20020aa78281000000b0059435689e36sm9159908pfm.170.2023.03.07.23.59.55 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 07 Mar 2023 23:59:55 -0800 (PST) Received: from dave by dread.disaster.area with local (Exim 4.92.3) (envelope-from ) id 1pZoiG-006DN9-PH; Wed, 08 Mar 2023 18:59:52 +1100 Date: Wed, 8 Mar 2023 18:59:52 +1100 From: Dave Chinner To: Luis Chamberlain Cc: Matthew Wilcox , "Darrick J. Wong" , James Bottomley , Keith Busch , Theodore Ts'o , Pankaj Raghav , Daniel Gomez , lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-block@vger.kernel.org Subject: Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations Message-ID: <20230308075952.GU2825702@dread.disaster.area> References: <2600732b9ed0ddabfda5831aff22fd7e4270e3be.camel@HansenPartnership.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org On Tue, Mar 07, 2023 at 10:11:43PM -0800, Luis Chamberlain wrote: > On Sun, Mar 05, 2023 at 05:02:43AM +0000, Matthew Wilcox wrote: > > On Sat, Mar 04, 2023 at 08:15:50PM -0800, Luis Chamberlain wrote: > > > On Sat, Mar 04, 2023 at 04:39:02PM +0000, Matthew Wilcox wrote: > > > > XFS already works with arbitrary-order folios. > > > > > > But block sizes > PAGE_SIZE is work which is still not merged. It > > > *can* be with time. That would allow one to muck with larger block > > > sizes than 4k on x86-64 for instance. Without this, you can't play > > > ball. > > > > Do you mean that XFS is checking that fs block size <= PAGE_SIZE and > > that check needs to be dropped? If so, I don't see where that happens. > > None of that. Back in 2018 Chinner had prototyped XFS support with > larger block size > PAGE_SIZE: > > https://lwn.net/ml/linux-fsdevel/20181107063127.3902-1-david@fromorbit.com/ Having a working BS > PS implementation on XFS based on variable page order support in the page cache goes back over a decade before that. Christoph Lameter did the page cache work, and I added support for XFS back in 2007. THe total change to XFS required can be seen in this simple patch: https://lore.kernel.org/linux-mm/20070423093152.GI32602149@melbourne.sgi.com/ That was when the howls of anguish about high order allocations Willy mentioned started.... > I just did a quick attempt to rebased it and most of the left over work > is actually on IOMAP for writeback and zero / writes requiring a new > zero-around functionality. All bugs on the rebase are my own, only compile > tested so far, and not happy with some of the changes I had to make so > likely could use tons more love: > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230307-larger-bs-then-ps-xfs On a current kernel, that patchset is fundamentally broken as we have multi-page folio support in XFS and iomap - the patchset is inherently PAGE_SIZE based and it will do the the wrong thing with PAGE_SIZE based zero-around. IOWs, IOMAP_F_ZERO_AROUND does not need to exist any more, nor should any of the custom hooks it triggered in different operations for zero-around. That's because we should now be using the same approach to BS > PS as we first used back in 2007. We already support multi-page folios in the page cache, so all the zero-around and partial folio uptodate tracking we need is already in place. Hence, like Willy said, all we need to do is have filemap_get_folio(FGP_CREAT) always allocate at least filesystem block sized and aligned folio and insert them into the mapping tree. Multi-page folios will always need to be sized as an integer multiple of the filesystem block size, but once we ensure size and alignment of folios in the page cache, we get everything else for free. /me cues the howls of anguish over memory fragmentation.... > But it should give you an idea of what type of things filesystems need to do. Not really. it gives you an idea of what filesystems needed to do 5 years ago to support BS > PS. We're living in the age of folios now, not pages. Willy starting work on folios was why I dropped that patch set, firstly because it was going to make the iomap conversion to folios harder, and secondly, we realised that none of it was necessary if folios supported multi-page constructs in the page cache natively. IOWs, multi-page folios in the page cache should make BS > PS mostly trivial to support for any filesystem or block device that doesn't have some other dependency on PAGE_SIZE objects in the page cache (e.g. bufferheads). Cheers, Dave. -- Dave Chinner david@fromorbit.com