From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.8 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY, SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2DE0BC004C9 for ; Tue, 7 May 2019 08:56:54 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id DBCCF204FD for ; Tue, 7 May 2019 08:56:53 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="CN8D/PtC" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726455AbfEGI4w (ORCPT ); Tue, 7 May 2019 04:56:52 -0400 Received: from mail-ua1-f66.google.com ([209.85.222.66]:45187 "EHLO mail-ua1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726236AbfEGI4w (ORCPT ); Tue, 7 May 2019 04:56:52 -0400 Received: by mail-ua1-f66.google.com with SMTP id o33so5709888uae.12 for ; Tue, 07 May 2019 01:56:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:reply-to:from:date:message-id :subject:to:cc:content-transfer-encoding; bh=VQApjmHwpaaQNOquTIXrY4x7QLkmnlWjYKGf6Uo7rRM=; b=CN8D/PtCDxwzkjWRwradfMKQDbv3snb4TDYXPWAy2X0SyBfdfUaUkmzEG5LG+1Ftxx eNvP98kU+KYmTJYc6BLZ/fFeV8ucqVd1amQxa6Mxo/KxuRhadr2jws/0B1OXMa3/E84w lmhdeLd+8dUM+ZiD6xQW9rN/xwHiwyBg7Hh/sC0CeU4aM8hmTpe5yxhe35xnHHBDFom/ bPs/uAKmxXEPx15sReorZ97WP0s5rvLcNk/4Bu3Ey8HDGiAvsrOVBAcd0NmXJoRYplQH 0ooUf+Bo29dNjO7PEVFdCqAdvu59izHmu7y5oTQDgHpgHOdPXslRIGkRrTEPimCVl5uU QsZg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:reply-to :from:date:message-id:subject:to:cc:content-transfer-encoding; bh=VQApjmHwpaaQNOquTIXrY4x7QLkmnlWjYKGf6Uo7rRM=; b=qhG2nH/lwct01SpAV+TZLgspxDWzmulswtpaxveNeT4ammBJLxTHEP3scci4DGPwUl JMJRl/IuAQ9vpgpSOsQz9NjgLXG1oADauVYvcS3trehL3DlppYsrL/dW3NiwovDEdVqi w85phcy03XzO9ziyo6SBhKHQcymNlhk4J0VcXD30scZdMm4oINfrwpKCDT8iiU/oAtl2 hsewJZywgIbhkbUQQ1n2AC3MiWu1we/spit5aQTzLeZRLhukqpg0QcP5hZWWxy9i2Seu CGlYlexv7v+wDvsqvEc43LuQwHAGzXT1dqOF6FotRmVGlpiB2GibQx8TVjNbwNjT5Z9m 1AjQ== X-Gm-Message-State: APjAAAVA1Euv5JEyA4But4g6LO5KXVFZER0IpQ8rBINYxqQyAeawAJqH iOoCO6be9VygBPI8C31gAezYQmL4+c2q5m9yPxU= X-Google-Smtp-Source: APXvYqwTtkdCcI+YMqEvmbUh+uJwDEZZvIvWotqxF677Poeqfj/4QaPwZTAtaAVUnq2IYKqVmEc+XMhgqfJ/7auwVtE= X-Received: by 2002:ab0:3351:: with SMTP id h17mr9236098uap.123.1557219411042; Tue, 07 May 2019 01:56:51 -0700 (PDT) MIME-Version: 1.0 References: <20190503010852.10342-1-wqu@suse.com> <1e36e9e2-dbd3-3ab0-b908-25cfdf1d310d@gmx.com> In-Reply-To: Reply-To: fdmanana@gmail.com From: Filipe Manana Date: Tue, 7 May 2019 09:56:40 +0100 Message-ID: Subject: Re: [PATCH RFC] btrfs: reflink: Flush before reflink any extent to prevent NOCOW write falling back to CoW without data reservation To: Qu Wenruo Cc: Qu Wenruo , linux-btrfs Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org On Mon, May 6, 2019 at 3:04 AM Qu Wenruo wrote: > > [snip] > >> > >> For data writeback, it should only cause sync related failure. > > > > Ok, so please remove the transaction abort comments for next iteration. > > By sync related failure, you mean running dealloc fails with ENOSPC, > > since when it tries to reserve a data extent it fails as it can't find > > any free extent. > > Well, btrfs has some hidden way to fix such problem by itself already. > > At buffered write time, we have the following call chain: > btrfs_buffered_write() > |- btrfs_check_data_free_space() > |- btrfs_alloc_data_chunk_ondemand() > |- need_commit =3D 2; /* We have 2 chance to retry, 1 to flush */ > |- do_chunk_alloc() /* we have no data space available */ > |- if (ret < 0 && ret =3D=3D -ENOSPC) > |- goto commit_trans; /* try to free some space */ > |- commit_trans: > |- need_commit--; > |- if (need_commit > 0) { > |- btrfs_start_delalloc_roots(); > |- btrfs_wait_ordered_roots(); > |- } > > This means, as long as we hit ENOSPC for data space, we will start write > back, thus NODATACOW data will still hit disk as NODATACOW. > > Such hidden behavior along with always-reserve-data-space solves the > problem pretty well. It doesn't solve the problem you reported in the rfc patch. What happens: 1) We have a file with a prealloc extent, that isn't shared 2) We have 0 bytes of available data space (or any amount less then that of the buffered write size) 3) A buffered write happens that falls within a subrange of the prealloc ex= tent. We can't reserve space, we do all those things at btrfs_alloc_data_chunk_ondemand(), but we can't get any data space released, since it's all allocated. Therefore we fall back to nodatacow mode. We dirty the pages, mark the range as dealloc, etc. 4) The reflink happens, for a subrange of the prealloc extent that does not overlap the range of the buffered write. 5) Some time after the reflink, writeback starts for the inode. During the writeback we fallback to COW mode, because the prealloc extent is shared, even if the subrange of the buffered write does not overlap the reflinked subrange. Now the write silently fails with -ENOSPC, and a user doesn't know about it unless it does an fsync after that writeback, which will report the error via filemap_check_wb_err(). > We either: > - reserve data space > Then no matter how it ends, we're OK, although it may end as CoW. > > - Failed to reserve data space > Writeback will be triggered anyway, no way to screw things around. > > Thus this workaround has nothing to fix, but only make certain NODATACOW > reach disk as NODATACOW. > > It makes some NODATACOW behaves more correctly but won't fix any obvious > bug. > > My personal take is to fix any strange behavior even it won't cause any > problem, but the full inode writeback can be performance heavy. > > So my question is, do we really need this anyway? Do we need what? Your patch, that logic at btrfs_alloc_data_chunk_ondemand(), something else? Thanks. > > Thanks, > Qu > > > > >> > >>> I don't recall starting transactions when running dealloc, and failed > >>> to see where after a quick glance to cow_file_range() > >>> and run_delalloc_nocow(). I'm assuming that 'at delalloc time' means > >>> when starting writeback. > >>> > >>>> > >>>> [CAUSE] > >>>> This is due to the fact that btrfs can only do extent level share ch= eck. > >>>> > >>>> Btrfs can only tell if an extent is shared, no matter if only part o= f the > >>>> extent is shared or not. > >>>> > >>>> So for above script we have: > >>>> - fallocate > >>>> - buffered write > >>>> If we don't have enough data space, we fall back to NOCOW check. > >>>> At this timming, the extent is not shared, we can skip data > >>>> reservation. > >>> > >>> But in the above example we don't fall to nocow mode when doing the > >>> buffered write, as there's plenty of data space available (1Gb - > >>> 24Kb). > >>> You need to update the example. > >> I have to admit that the core part is mostly based on the worst case > >> *assumption*. > >> > >> I'll try to make the case convincing by making it fail directly. > > > > Great, thanks. > > > >> > >>> > >>> > >>>> - reflink > >>>> Now part of the large preallocated extent is shared. > >>>> - delalloc kicks in > >>> > >>> writeback kicks in > >>> > >>>> For the NOCOW range, as the preallocated extent is shared, we need > >>>> to fall back to COW. > >>>> > >>>> [WORKAROUND] > >>>> The workaround is to ensure any buffered write in the related extent= s > >>>> (not the reflink source range) get flushed before reflink. > >>> > >>> not the reflink source range -> not just the reflink source range > >>> > >>>> > >>>> However it's pretty expensive to do a comprehensive check. > >>>> In the reproducer, the reflink source is just a part of a larger > >>> > >>> Again, the reproducer needs to be fixed (yes, I tested it even if it'= s > >>> clear by looking at it that it doesn't trigger the nocow case). > >>> > >>>> preallocated extent, we need to flush all buffered write of that ext= ent > >>>> before reflink. > >>>> Such backward search can be complex and we may not get much benefit = from > >>>> it. > >>>> > >>>> So this patch will just try to flush the whole inode before reflink. > >>> > >>> > >>>> > >>>> Signed-off-by: Qu Wenruo > >>>> --- > >>>> Reason for RFC: > >>>> Flushing an inode just because it's a reflink source is definitely > >>>> overkilling, but I don't have any better way to handle it. > >>>> > >>>> Any comment on this is welcomed. > >>>> --- > >>>> fs/btrfs/ioctl.c | 22 ++++++++++++++++++++++ > >>>> 1 file changed, 22 insertions(+) > >>>> > >>>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c > >>>> index 7755b503b348..8caa0edb6fbf 100644 > >>>> --- a/fs/btrfs/ioctl.c > >>>> +++ b/fs/btrfs/ioctl.c > >>>> @@ -3930,6 +3930,28 @@ static noinline int btrfs_clone_files(struct = file *file, struct file *file_src, > >>>> return ret; > >>>> } > >>>> > >>>> + /* > >>>> + * Workaround to make sure NOCOW buffered write reach disk a= s NOCOW. > >>>> + * > >>>> + * Due to the limit of btrfs extent tree design, we can only= have > >>>> + * extent level share view. Any part of an extent is shared = then the > >>> > >>> Any -> If any > >>> > >>>> + * whole extent is shared and any write into that extent nee= ds to fall > >>> > >>> is -> is considered > >>> > >>>> + * back to COW. > >>> > >>> I would add, something like: > >>> > >>> That is, btrfs' back references do not have a block level granularity= , > >>> they work at the whole extent level. > >>> > >>>> + * > >>>> + * NOCOW buffered write without data space reserved could to= lead to > >>>> + * either data space bytes_may_use underflow (kernel warning= ) or ENOSPC > >>>> + * at delalloc time (transaction abort). > >>> > >>> I would omit the warning and transaction abort parts, that can change > >>> any time. And we have that information in the changelog, so it's not > >>> lost. > >>> > >>>> + * > >>>> + * Here we take a shortcut by flush the whole inode. We coul= d do better > >>>> + * by finding all extents in that range and flush the space = referring > >>>> + * all those extents. > >>>> + * But that's too complex for such corner case. > >>>> + */ > >>>> + filemap_flush(src->i_mapping); > >>>> + if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT, > >>>> + &BTRFS_I(src)->runtime_flags)) > >>>> + filemap_flush(src->i_mapping); > >>> > >>> So a few comments here: > >>> > >>> - why just in the clone part? The dedupe side has the same problem, d= oesn't it? > >> > >> Right. > >> > >>> > >>> - I would move such flushing to btrfs_remap_file_range_prep - this is > >>> where we do the source and target range flush and wait. > >>> > >>> Can you turn the reproducer into an fstests case? > >> > >> Sure. > >> > >> Thanks for the info and all the comment, > >> Qu > >> > >>> > >>> Thanks. > >>> > >>>> + > >>>> /* > >>>> * Lock destination range to serialize with concurrent readp= ages() and > >>>> * source range to serialize with relocation. > >>>> -- > >>>> 2.21.0 > >>>> > >>> > >>> > >> > > > > > --=20 Filipe David Manana, =E2=80=9CWhether you think you can, or you think you can't =E2=80=94 you're= right.=E2=80=9D