From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.8 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY, SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3D40DC004C9 for ; Tue, 7 May 2019 11:37:03 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id EA473205ED for ; Tue, 7 May 2019 11:37:02 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="fh4sjnSt" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726572AbfEGLhB (ORCPT ); Tue, 7 May 2019 07:37:01 -0400 Received: from mail-ua1-f68.google.com ([209.85.222.68]:35759 "EHLO mail-ua1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725859AbfEGLhB (ORCPT ); Tue, 7 May 2019 07:37:01 -0400 Received: by mail-ua1-f68.google.com with SMTP id g16so5879186uad.2 for ; Tue, 07 May 2019 04:37:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:reply-to:from:date:message-id :subject:to:cc:content-transfer-encoding; bh=iO6xdgL+NmTi6/tMHQhaTaXs/P7Hs1Vqj1E+VqZbgbI=; b=fh4sjnStHVHdw0rENa3MijivdGS2RR4oiCxM9eI7D94xHy6+4E+w04khCZzaCnqfA4 ZRziLEFOaT2Q8DQ0FlbkxXVyghXNRkYE2MsHvrN/eMdrd8Vn0uq8JVbMXlDCyU/p8KGZ KZLVZi1Gp1nbvTjpzJX3kAovjGyu++eqnpEL6nLD0tG8cObKWHTaY91lpQ64IcIp/pVL dTlvekmrGmrHyIx4OjofodOSzU2pQmkk3eQY91sMCXBPC6fJuaozvEDer1nXv4PtQgrM tspBPxzFCgleBF79o7X5xe6E0OUf8ObX7k+1yUYxPZXGLCzMNbbH0uAURk63jZrXeZWk hELw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:reply-to :from:date:message-id:subject:to:cc:content-transfer-encoding; bh=iO6xdgL+NmTi6/tMHQhaTaXs/P7Hs1Vqj1E+VqZbgbI=; b=s/ExBv2cJgBhKqBnoSTRcPwgN2aSjOM7naI5SjYDSiJqARORPE73idALJX58AOF5hA DfKHDDhTiZp7fAQCE58MB9cQM4uneYgxsJFYzIEwVMoJ0NOp7zuPBLYKt9HVtAEMOpHi GWAN1SctYaIYM4mOljhiZ6iyaozF9H4aBEiPNzqVD//a3+6R4E7pxuhprLlvjT0kSzVa FvRcD6EbYazXgaw8XQxaU2lZijhBA2361gb1dQN/wgVZiP0hUvyKTmjWRD2oiSGb3ZqH 5aoRxQHCvelylZ8nCM1P+sEBUzcBPneqCNyaVUbGdk4Bu0dywHPGoE/H6V4368HEuopS chEA== X-Gm-Message-State: APjAAAVWjU/0dCH7lo0dffAEH5sC72yCS8IhUK6b1okrIXg+L/1oryZ7 py8ZikvjpiRzvqqPoFCYiIo5pGcUK0nsULrmGn8A/Fgv X-Google-Smtp-Source: APXvYqyTpbF5/3wNv/WkEG7PoUmnmjBYPAOtK+Kks6qKmZCnna68krqxU57GgxjAlDruwhzCvFljQIivBxnWVllSiBQ= X-Received: by 2002:ab0:3351:: with SMTP id h17mr9606095uap.123.1557229019457; Tue, 07 May 2019 04:36:59 -0700 (PDT) MIME-Version: 1.0 References: <20190503010852.10342-1-wqu@suse.com> <1e36e9e2-dbd3-3ab0-b908-25cfdf1d310d@gmx.com> <028e58f3-99a6-f4bc-51f0-12eafee92c76@gmx.com> In-Reply-To: <028e58f3-99a6-f4bc-51f0-12eafee92c76@gmx.com> Reply-To: fdmanana@gmail.com From: Filipe Manana Date: Tue, 7 May 2019 12:36:47 +0100 Message-ID: Subject: Re: [PATCH RFC] btrfs: reflink: Flush before reflink any extent to prevent NOCOW write falling back to CoW without data reservation To: Qu Wenruo Cc: Qu Wenruo , linux-btrfs Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org On Tue, May 7, 2019 at 12:13 PM Qu Wenruo wrote: > > > > On 2019/5/7 =E4=B8=8B=E5=8D=884:56, Filipe Manana wrote: > > On Mon, May 6, 2019 at 3:04 AM Qu Wenruo wrote= : > >> > >> [snip] > >>>> > >>>> For data writeback, it should only cause sync related failure. > >>> > >>> Ok, so please remove the transaction abort comments for next iteratio= n. > >>> By sync related failure, you mean running dealloc fails with ENOSPC, > >>> since when it tries to reserve a data extent it fails as it can't fin= d > >>> any free extent. > >> > >> Well, btrfs has some hidden way to fix such problem by itself already. > >> > >> At buffered write time, we have the following call chain: > >> btrfs_buffered_write() > >> |- btrfs_check_data_free_space() > >> |- btrfs_alloc_data_chunk_ondemand() > >> |- need_commit =3D 2; /* We have 2 chance to retry, 1 to flush *= / > >> |- do_chunk_alloc() /* we have no data space available */ > >> |- if (ret < 0 && ret =3D=3D -ENOSPC) > >> |- goto commit_trans; /* try to free some space */ > >> |- commit_trans: > >> |- need_commit--; > >> |- if (need_commit > 0) { > >> |- btrfs_start_delalloc_roots(); > >> |- btrfs_wait_ordered_roots(); > >> |- } > >> > >> This means, as long as we hit ENOSPC for data space, we will start wri= te > >> back, thus NODATACOW data will still hit disk as NODATACOW. > >> > >> Such hidden behavior along with always-reserve-data-space solves the > >> problem pretty well. > > > > It doesn't solve the problem you reported in the rfc patch. > > You're right, it doesn't solve the problem at all. > In fact, another bug caused my test script to pass even with some dirty > pages unable to be flushed back. > > But it at least make sure all other pages reach disk as NODATACOW except > the last page. > > > > > What happens: > > > > 1) We have a file with a prealloc extent, that isn't shared > > > > 2) We have 0 bytes of available data space (or any amount less then > > that of the buffered write size) > > > > 3) A buffered write happens that falls within a subrange of the preallo= c extent. > > We can't reserve space, we do all those things at > > btrfs_alloc_data_chunk_ondemand(), but we can't get any data space > > released, since it's all allocated. > > At that time, we're already flushing all previously buffered write data. > > E.g. if we're writing into one 1M preallocated extent. > The first 4K, we have no data space reserved, dirtied the page, prepare > all delalloc. > > Then the 2nd 4K, we call btrfs_check_data_free_space(), as we're at low > data free space already, we flush all inodes, including the previous 4K > we just dirtied. > Then the first 4K get written to disk NODATACOW, as expected. > > This loop happens until we reach the last page. We fragment a buffered write into several chunks depending on its size, the size of a page, the size of a pointer, etc. We will hardly iterate for each page alone. If your point is that btrfs_check_data_free_space() can be called multiple times for a single buffered write, yes, that's true if it's large than a certain threshold, but that will hardly be for each individual page, typically it's much more than that. > > > Therefore we fall back to nodatacow mode. We dirty the pages, mark > > the range as dealloc, etc. > > > > 4) The reflink happens, for a subrange of the prealloc extent that > > does not overlap the range of the buffered write. > > Just before the reflink, we only have 1 dirty page (the last page of > that buffered write) doesn't reach disk yet. > > For the final page, we have no choice but do COW, and it fails with -ENOS= PC. Yes, isn't that what I said? > > However due to some other problem, the -ENOSPC doesn't reach user space > at all. Yes, isn't that what I said? Yes... Same as the snapshotting case Robbie fixed. No news here. Same as many other silent data loss issues I fixed with fsync a few times sometime ago as well. > > > > > > 5) Some time after the reflink, writeback starts for the inode. > > During the writeback we fallback to COW mode, because the prealloc > > extent is shared, even if the subrange of the buffered write does not > > overlap the reflinked subrange. > > Now the write silently fails with -ENOSPC, and a user doesn't know > > about it unless it does an fsync after that writeback, which will > > report the error via filemap_check_wb_err(). > > > >> We either: > >> - reserve data space > >> Then no matter how it ends, we're OK, although it may end as CoW. > >> > >> - Failed to reserve data space > >> Writeback will be triggered anyway, no way to screw things around. > >> > >> Thus this workaround has nothing to fix, but only make certain NODATAC= OW > >> reach disk as NODATACOW. > >> > >> It makes some NODATACOW behaves more correctly but won't fix any obvio= us > >> bug. > >> > >> My personal take is to fix any strange behavior even it won't cause an= y > >> problem, but the full inode writeback can be performance heavy. > >> > >> So my question is, do we really need this anyway? > > > > Do we need what? Your patch, that logic at > > btrfs_alloc_data_chunk_ondemand(), something else? > > I meant the patch, but the deeper I dig into the problem, more problem I > found. > > The patch is still needed, but there is a more important bug, that > btrfs_run_delalloc_range() failure won't be reported in sync. How could it report a failure? sync calls sync(2), which returns void, meaning it always succeeds and has no way to report errors. See https://linux.die.net/man/2/sync Error tracking and reporting was precisely one of main goals of filemap_check_wb_err() and all the related work Jeff Layton did sometime ago. > > The script here I'm using is: > ------ > #!/bin/bash > > dev=3D/dev/test/test > mnt=3D/mnt/btrfs > > #mkfs.btrfs -f $dev -b 1G > /dev/null > #mount $dev $mnt -o nospace_cache > > umount $mnt &> /dev/null > umount $dev &> /dev/null > > dmesg -C > mkfs.btrfs -f $dev -b 512M > /dev/null > > mount $dev $mnt -o nospace_cache > > xfs_io -f -c "falloc 8k 64m" $mnt/file1 > xfs_io -f -c "pwrite 0 -b 4k 370M" $mnt/padding > > sync > btrfs fi df $mnt --raw > > xfs_io -c "pwrite 1m 16m" $mnt/file1 > echo "nodatacow write finished" >> /dev/kmsg > xfs_io -c "reflink $mnt/file1 8k 0 4k" $mnt/file1 > echo "reflink finished" >> /dev/kmsg > sync > echo "sync finished ret=3D$?" >> /dev/kmsg > umount $dev > ------ > > As describe, the last write at 17821696 (17M - 4K) will fail due to ENOSP= C. > But the sync succeeded without reporting any problem. > > Thanks, > Qu > > > > > Thanks. > > > >> > >> Thanks, > >> Qu > >> > >>> > >>>> > >>>>> I don't recall starting transactions when running dealloc, and fail= ed > >>>>> to see where after a quick glance to cow_file_range() > >>>>> and run_delalloc_nocow(). I'm assuming that 'at delalloc time' mean= s > >>>>> when starting writeback. > >>>>> > >>>>>> > >>>>>> [CAUSE] > >>>>>> This is due to the fact that btrfs can only do extent level share = check. > >>>>>> > >>>>>> Btrfs can only tell if an extent is shared, no matter if only part= of the > >>>>>> extent is shared or not. > >>>>>> > >>>>>> So for above script we have: > >>>>>> - fallocate > >>>>>> - buffered write > >>>>>> If we don't have enough data space, we fall back to NOCOW check. > >>>>>> At this timming, the extent is not shared, we can skip data > >>>>>> reservation. > >>>>> > >>>>> But in the above example we don't fall to nocow mode when doing the > >>>>> buffered write, as there's plenty of data space available (1Gb - > >>>>> 24Kb). > >>>>> You need to update the example. > >>>> I have to admit that the core part is mostly based on the worst case > >>>> *assumption*. > >>>> > >>>> I'll try to make the case convincing by making it fail directly. > >>> > >>> Great, thanks. > >>> > >>>> > >>>>> > >>>>> > >>>>>> - reflink > >>>>>> Now part of the large preallocated extent is shared. > >>>>>> - delalloc kicks in > >>>>> > >>>>> writeback kicks in > >>>>> > >>>>>> For the NOCOW range, as the preallocated extent is shared, we ne= ed > >>>>>> to fall back to COW. > >>>>>> > >>>>>> [WORKAROUND] > >>>>>> The workaround is to ensure any buffered write in the related exte= nts > >>>>>> (not the reflink source range) get flushed before reflink. > >>>>> > >>>>> not the reflink source range -> not just the reflink source range > >>>>> > >>>>>> > >>>>>> However it's pretty expensive to do a comprehensive check. > >>>>>> In the reproducer, the reflink source is just a part of a larger > >>>>> > >>>>> Again, the reproducer needs to be fixed (yes, I tested it even if i= t's > >>>>> clear by looking at it that it doesn't trigger the nocow case). > >>>>> > >>>>>> preallocated extent, we need to flush all buffered write of that e= xtent > >>>>>> before reflink. > >>>>>> Such backward search can be complex and we may not get much benefi= t from > >>>>>> it. > >>>>>> > >>>>>> So this patch will just try to flush the whole inode before reflin= k. > >>>>> > >>>>> > >>>>>> > >>>>>> Signed-off-by: Qu Wenruo > >>>>>> --- > >>>>>> Reason for RFC: > >>>>>> Flushing an inode just because it's a reflink source is definitely > >>>>>> overkilling, but I don't have any better way to handle it. > >>>>>> > >>>>>> Any comment on this is welcomed. > >>>>>> --- > >>>>>> fs/btrfs/ioctl.c | 22 ++++++++++++++++++++++ > >>>>>> 1 file changed, 22 insertions(+) > >>>>>> > >>>>>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c > >>>>>> index 7755b503b348..8caa0edb6fbf 100644 > >>>>>> --- a/fs/btrfs/ioctl.c > >>>>>> +++ b/fs/btrfs/ioctl.c > >>>>>> @@ -3930,6 +3930,28 @@ static noinline int btrfs_clone_files(struc= t file *file, struct file *file_src, > >>>>>> return ret; > >>>>>> } > >>>>>> > >>>>>> + /* > >>>>>> + * Workaround to make sure NOCOW buffered write reach disk= as NOCOW. > >>>>>> + * > >>>>>> + * Due to the limit of btrfs extent tree design, we can on= ly have > >>>>>> + * extent level share view. Any part of an extent is share= d then the > >>>>> > >>>>> Any -> If any > >>>>> > >>>>>> + * whole extent is shared and any write into that extent n= eeds to fall > >>>>> > >>>>> is -> is considered > >>>>> > >>>>>> + * back to COW. > >>>>> > >>>>> I would add, something like: > >>>>> > >>>>> That is, btrfs' back references do not have a block level granulari= ty, > >>>>> they work at the whole extent level. > >>>>> > >>>>>> + * > >>>>>> + * NOCOW buffered write without data space reserved could = to lead to > >>>>>> + * either data space bytes_may_use underflow (kernel warni= ng) or ENOSPC > >>>>>> + * at delalloc time (transaction abort). > >>>>> > >>>>> I would omit the warning and transaction abort parts, that can chan= ge > >>>>> any time. And we have that information in the changelog, so it's no= t > >>>>> lost. > >>>>> > >>>>>> + * > >>>>>> + * Here we take a shortcut by flush the whole inode. We co= uld do better > >>>>>> + * by finding all extents in that range and flush the spac= e referring > >>>>>> + * all those extents. > >>>>>> + * But that's too complex for such corner case. > >>>>>> + */ > >>>>>> + filemap_flush(src->i_mapping); > >>>>>> + if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT, > >>>>>> + &BTRFS_I(src)->runtime_flags)) > >>>>>> + filemap_flush(src->i_mapping); > >>>>> > >>>>> So a few comments here: > >>>>> > >>>>> - why just in the clone part? The dedupe side has the same problem,= doesn't it? > >>>> > >>>> Right. > >>>> > >>>>> > >>>>> - I would move such flushing to btrfs_remap_file_range_prep - this = is > >>>>> where we do the source and target range flush and wait. > >>>>> > >>>>> Can you turn the reproducer into an fstests case? > >>>> > >>>> Sure. > >>>> > >>>> Thanks for the info and all the comment, > >>>> Qu > >>>> > >>>>> > >>>>> Thanks. > >>>>> > >>>>>> + > >>>>>> /* > >>>>>> * Lock destination range to serialize with concurrent rea= dpages() and > >>>>>> * source range to serialize with relocation. > >>>>>> -- > >>>>>> 2.21.0 > >>>>>> > >>>>> > >>>>> > >>>> > >>> > >>> > >> > > > > > --=20 Filipe David Manana, =E2=80=9CWhether you think you can, or you think you can't =E2=80=94 you're= right.=E2=80=9D