From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id F3D97C433F5 for ; Wed, 25 May 2022 07:48:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230242AbiEYHsZ (ORCPT ); Wed, 25 May 2022 03:48:25 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60476 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229521AbiEYHsY (ORCPT ); Wed, 25 May 2022 03:48:24 -0400 Received: from mail-qt1-x830.google.com (mail-qt1-x830.google.com [IPv6:2607:f8b0:4864:20::830]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CD8D377F13 for ; Wed, 25 May 2022 00:48:21 -0700 (PDT) Received: by mail-qt1-x830.google.com with SMTP id y15so12130438qtx.4 for ; Wed, 25 May 2022 00:48:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=CykcZknmRdWBqOHZQwVhQSsfj/9vYecFACUhRxpqtfk=; b=mhEK7FvVTr9ha1N4Zlu+t826CcLctFFDxfEfXwc8sbkiHUdoHh4dsXTpZmSAlQ8OfP GnYfYaOLQFtr3reLmk2DPvmQFbnbCeh8hcsGqHbp2b4PzIvBwDYZgtLGfO+pDmFEaUoi kmP+tVJ4RzfBBcH1Cwgh3UiDmYBe6Ys7vXsdvP1i4HAH30Br8zVF8DjdBoiaD7g0dDU7 31Blxz0xXU6UsIBeWPUcydpSmH1CnojpAh11uWi6R9GMmW9QYqRE0WsRHzwvL0tfExdr I5rQbob+JbsQ5IsGZfNS2NNJIV8aSSBGsod1o9wBZoKu3JrVdsi19nz9zmmSNcU2VM6u PzBA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=CykcZknmRdWBqOHZQwVhQSsfj/9vYecFACUhRxpqtfk=; b=U1CUhyWcUqtEdFPLkNN4aYDu2UhKBhuotxE2RByqKCjlgpScyJd6luP/pRiv6NLMxf FJM/2k3G3cgJr76zBcpnvjCAt74dHMK8RDkwoMK7tyRa2y0UVj0veWiJ+sUPBbE0SYZ2 8zyRUPbT95CSgwkFvprPdrU1D2GjCik7z8aCTuAfLmH1c3PE8GkQFeTaMeKxm5rAGCGg 9etK31jameHULsqIuWUB1HlKtrT2+EuQmVFzuLGYsR3uQEE2xK7VFS4gQ1SgQ01SRa5z r8S/m8aDa/Y7U+xaV9OfhFJ/K2DKOxtcsAzYmB5KdmyCGSXZwL/86TRCnjhwn+7cctaJ 0xAA== X-Gm-Message-State: AOAM532NZe1sH66YvBWrRwqCvuRxQeTSu7pgs1RPH5PzIeu3qhQeihV/ ZMYl9f0V/ZiZxZ2PmKRD2hxgb3+qSmzptj0M6Ps= X-Google-Smtp-Source: ABdhPJzlBP0FnO83wu6zSmwoXK2UAPtx3AQdv6Jvo9FsE1EU5CJph415h4t1P1OB4u7EZUWHeoazE0eK6YXZDi3gULU= X-Received: by 2002:a05:622a:1a9c:b0:2f3:d873:4acc with SMTP id s28-20020a05622a1a9c00b002f3d8734accmr23599803qtc.424.1653464900850; Wed, 25 May 2022 00:48:20 -0700 (PDT) MIME-Version: 1.0 References: <20210110160720.3922965-1-chandanrlinux@gmail.com> <20220523224352.GT1098723@dread.disaster.area> <20220525073339.GF1098723@dread.disaster.area> In-Reply-To: <20220525073339.GF1098723@dread.disaster.area> From: Amir Goldstein Date: Wed, 25 May 2022 10:48:09 +0300 Message-ID: Subject: Re: [PATCH V14 00/16] Bail out if transaction can cause extent count to overflow To: Dave Chinner Cc: Chandan Babu R , linux-xfs , "Darrick J . Wong" , "Darrick J. Wong" , Christoph Hellwig , Allison Henderson , "Luis R. Rodriguez" , Theodore Tso Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org On Wed, May 25, 2022 at 10:33 AM Dave Chinner wrote: > > On Tue, May 24, 2022 at 08:36:50AM +0300, Amir Goldstein wrote: > > On Tue, May 24, 2022 at 1:43 AM Dave Chinner wrote: > > > > > > On Mon, May 23, 2022 at 02:15:44PM +0300, Amir Goldstein wrote: > > > > On Sun, Jan 10, 2021 at 6:10 PM Chandan Babu R wrote: > > > > > > > > > > XFS does not check for possible overflow of per-inode extent counter > > > > > fields when adding extents to either data or attr fork. > > > > > > > > > > For e.g. > > > > > 1. Insert 5 million xattrs (each having a value size of 255 bytes) and > > > > > then delete 50% of them in an alternating manner. > > > > > > > > > > 2. On a 4k block sized XFS filesystem instance, the above causes 98511 > > > > > extents to be created in the attr fork of the inode. > > > > > > > > > > xfsaild/loop0 2008 [003] 1475.127209: probe:xfs_inode_to_disk: (ffffffffa43fb6b0) if_nextents=98511 i_ino=131 > > > > > > > > > > 3. The incore inode fork extent counter is a signed 32-bit > > > > > quantity. However, the on-disk extent counter is an unsigned 16-bit > > > > > quantity and hence cannot hold 98511 extents. > > > > > > > > > > 4. The following incorrect value is stored in the xattr extent counter, > > > > > # xfs_db -f -c 'inode 131' -c 'print core.naextents' /dev/loop0 > > > > > core.naextents = -32561 > > > > > > > > > > This patchset adds a new helper function > > > > > (i.e. xfs_iext_count_may_overflow()) to check for overflow of the > > > > > per-inode data and xattr extent counters and invokes it before > > > > > starting an fs operation (e.g. creating a new directory entry). With > > > > > this patchset applied, XFS detects counter overflows and returns with > > > > > an error rather than causing a silent corruption. > > > > > > > > > > The patchset has been tested by executing xfstests with the following > > > > > mkfs.xfs options, > > > > > 1. -m crc=0 -b size=1k > > > > > 2. -m crc=0 -b size=4k > > > > > 3. -m crc=0 -b size=512 > > > > > 4. -m rmapbt=1,reflink=1 -b size=1k > > > > > 5. -m rmapbt=1,reflink=1 -b size=4k > > > > > > > > > > The patches can also be obtained from > > > > > https://github.com/chandanr/linux.git at branch xfs-reserve-extent-count-v14. > > > > > > > > > > I have two patches that define the newly introduced error injection > > > > > tags in xfsprogs > > > > > (https://lore.kernel.org/linux-xfs/20201104114900.172147-1-chandanrlinux@gmail.com/). > > > > > > > > > > I have also written tests > > > > > (https://github.com/chandanr/xfstests/commits/extent-overflow-tests) > > > > > for verifying the checks introduced in the kernel. > > > > > > > > > > > > > Hi Chandan and XFS folks, > > > > > > > > As you may have heard, I am working on producing a series of > > > > xfs patches for stable v5.10.y. > > > > > > > > My patch selection is documented at [1]. > > > > I am in the process of testing the backport patches against the 5.10.y > > > > baseline using Luis' kdevops [2] fstests runner. > > > > > > > > The configurations that we are testing are: > > > > 1. -m rmbat=0,reflink=1 -b size=4k (default) > > > > 2. -m crc=0 -b size=4k > > > > 3. -m crc=0 -b size=512 > > > > 4. -m rmapbt=1,reflink=1 -b size=1k > > > > 5. -m rmapbt=1,reflink=1 -b size=4k > > > > > > > > This patch set is the only largish series that I selected, because: > > > > - It applies cleanly to 5.10.y > > > > - I evaluated it as low risk and high value > > > > > > What value does it provide LTS users? > > > > > > > Cloud providers deploy a large number of VMs/containers > > and they may use reflink. So I think this could be an issue. > > Cloud providers are not deploying multi-TB VM images on XFS without > also using some mechanism for avoiding worst-case fragmentation. > They know all about the problems that manifest when extent > counts get into the tens of millions, let alone billions.... > > e.g. first access to a file pulls the entire extent list into > memory, so for a file with 4 billion extents this will take hours to > pull into memory (single threaded, synchronous read IO of millions > of filesystem blocks) and consume and consume >100GB of RAM for the > in-memory extent list. Having VM startup get delayed by hours and > put a massive load on the cloud storage infrastructure for that > entire length of time isn't desirable behaviour... > > For multi-TB VM image deployment - especially with reflink on the > image file - extent size hints are needed to mitigate worst case > fragmentation. Reflink copies can run at up to about 100,000 > extents/s, so if you reflink a file with 4 billion extents in it, > not only do you need another 100GB RAM, you also need to wait > several hours for the reflink to run. And while that reflink is > running, nothing else has access the data in that VM image: your VM > is *down* for *hours* while you snapshot it. > > Typical mitigation is extent size hints in the MB ranges to reduce > worst case fragmentation by two orders of magnitude (i.e. limit to > tens of millions of extents, not billions) which brings snapshot > times down to a minute or two. > > IOWs, it's obviously not practical to scale VM images out to > billions of extents, even though we support extent counts in the > billions. > > > > This series adds almost no value to normal users - extent count > > > overflows are just something that doesn't happen in production > > > systems at this point in time. The largest data extent count I've > > > ever seen is still an order of magnitude of extents away from > > > overflowing (i.e. 400 million extents seen, 4 billion to overflow), > > > and nobody is using the attribute fork sufficiently hard to overflow > > > 65536 extents (typically a couple of million xattrs per inode). > > > > > > i.e. this series is ground work for upcoming internal filesystem > > > functionality that require much larger attribute forks (parent > > > pointers and fsverity merkle tree storage) to be supported, and > > > allow scope for much larger, massively fragmented VM image files > > > (beyond 16TB on 4kB block size fs for worst case > > > fragmentation/reflink). > > > > I am not sure I follow this argument. > > Users can create large attributes, can they not? > > Sure. But *nobody does*, and there are good reasons we don't see > people doing this. > > The reality is that apps don't use xattrs heavily because > filesystems are traditionally very bad at storing even moderate > numbers of xattrs. XFS is the exception to the rule. Hence nobody is > trying to use a few million xattrs per inode right now, and it's > unlikely anyone will unless they specifically target XFS. In which > case, they are going to want the large extent count stuff that just > got merged into the for-next tree, and this whole discussion is > moot.... With all the barriers to large extents count that you mentioned I wonder how large extent counters feature mitigates those, but that is irrelevant to the question at hand. > > > And users can create massive fragmented/reflinked images, can they not? > > Yes, and they will hit scalability problems long before they get > anywhere near 4 billion extents. > > > If we have learned anything, is that if users can do something (i.e. on stable), > > users will do that, so it may still be worth protecting this workflow? > > If I have learned anything, it's that huge extent counts are highly > impractical for most workloads for one reason or another. We are a > long way for enabling practical use of extent counts in the > billions. Demand paging the extent list is the bare minimum we need, > but then there's sheer scale of modifications reflink and unlink > need to make (billions of transactions to share/free billions of > individual extents) and there's no magic solution to that. > > > I argue that the reason that you did not see those constructs in the wild yet, > > is the time it takes until users format new xfs filesystems with mkfs > > It really has nothing to do with filesystem formats and everything > to do with the *cost* of creating, accessing, indexing and managing > billions of extents. > > Have you ever tried to create a file with 4 billion extents in it? > Even using fallocate to do it as fast as possible (no data IO!), I > ran out of RAM on my 128GB test machine after 6 days of doing > nothing but running fallocate() on a single inode. The kernel died a > horrible OOM killer death at around 2.5 billion extents because the > extent list cannot be reclaimed from memory while the inode is in > use and the kernel ran out of all other memory it could reclaim as > the extent list grew. > > The only way to fix that is to make the extent lists reclaimable > (i.e. demand paging of the in-memory extent list) and that's a big > chunk of work that isn't on anyone's radar right now. > > > Given your inputs, I am not sure that the fix has high value, but I must > > say I didn't fully understand your argument. > > It sounded like > > "We don't need the fix because we did not see the problem yet", > > but I may have misunderstood you. > > I hope you now realise that there are much bigger practical > scalability limitations with extent lists and reflink that will > manifest in production systems long before we get anywhere near > billions of extents per inode. > I do! And I *really* appreciate the time that you took to explain it to me (and to everyone). I'm dropping this series from my xfs-5.10.y queue. Thanks, Amir.