From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-btrfs-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id CBD56C433F5
	for <linux-btrfs@archiver.kernel.org>; Tue, 17 May 2022 10:47:04 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S231134AbiEQKrD (ORCPT <rfc822;linux-btrfs@archiver.kernel.org>);
        Tue, 17 May 2022 06:47:03 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52826 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1344635AbiEQKqq (ORCPT
        <rfc822;linux-btrfs@vger.kernel.org>);
        Tue, 17 May 2022 06:46:46 -0400
Received: from dfw.source.kernel.org (dfw.source.kernel.org [IPv6:2604:1380:4641:c500::1])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EAAACBCAF
        for <linux-btrfs@vger.kernel.org>; Tue, 17 May 2022 03:46:31 -0700 (PDT)
Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140])
        (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
        (No client certificate requested)
        by dfw.source.kernel.org (Postfix) with ESMTPS id 77F5161673
        for <linux-btrfs@vger.kernel.org>; Tue, 17 May 2022 10:46:31 +0000 (UTC)
Received: by smtp.kernel.org (Postfix) with ESMTPSA id 79EDBC385B8;
        Tue, 17 May 2022 10:46:30 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org;
        s=k20201202; t=1652784390;
        bh=b1AIXgM9N4oZwiuiiEng+Gy7OE1b9v7Y52tuyN41NY8=;
        h=Date:From:To:Cc:Subject:References:In-Reply-To:From;
        b=cLp9c3/U9J6O//Wly9iOL1L+zTRlGvQc+av90g64s3qIAbYRJi2orxmeV6hURD/jb
         u8BWuQ3fuGWrG/acSr11hpIepU5yfsSqQwpurC/JKx93Ft7jtTVCQ9KdFLHZMQusmB
         GdgakbNMXMI0zvL/K0tetxWa5TM1TWmphttenbufLMF7d2w52RmS+atHwxH9g3IisU
         7BJiAv1xOnQU9GzvVmedW9GzEWIrGSlCOFxfDS1ax3L4fbAXIQJVcsIIkmED6rgjJa
         YRi/vXAo647H1ovoK5xoR98fpLP7Uvm0plSjThI/fhr/KwMMsdKY9MhqOtl67EUw7J
         Mc6FJVxAUaJ1A==
Date:   Tue, 17 May 2022 11:46:28 +0100
From:   Filipe Manana <fdmanana@kernel.org>
To:     Qu Wenruo <quwenruo.btrfs@gmx.com>
Cc:     linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 2/2] btrfs: send: avoid trashing the page cache
Message-ID: <20220517104628.GA2611561@falcondesktop>
References: <cover.1651770555.git.fdmanana@suse.com>
 <41782eb393b3a3ba47f4a7fce1cbb33433c3f994.1651770555.git.fdmanana@suse.com>
 <50994099-b70e-8307-dd41-ac88784e552c@gmx.com>
 <8ded3794-1a2e-605e-e01a-6c2ecbab1b7e@gmx.com>
 <20220517093906.GA2606208@falcondesktop>
 <4174cbc9-6321-4055-6824-a43f7d222846@gmx.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <4174cbc9-6321-4055-6824-a43f7d222846@gmx.com>
Precedence: bulk
List-ID: <linux-btrfs.vger.kernel.org>
X-Mailing-List: linux-btrfs@vger.kernel.org

On Tue, May 17, 2022 at 06:36:57PM +0800, Qu Wenruo wrote:
> 
> 
> On 2022/5/17 17:39, Filipe Manana wrote:
> > On Tue, May 17, 2022 at 03:26:14PM +0800, Qu Wenruo wrote:
> > > 
> > > 
> > > On 2022/5/17 14:35, Qu Wenruo wrote:
> > > > 
> > > > 
> > > > On 2022/5/6 01:16, fdmanana@kernel.org wrote:
> > > > > From: Filipe Manana <fdmanana@suse.com>
> > > > > 
> > > > > A send operation reads extent data using the buffered IO path for getting
> > > > > extent data to send in write commands and this is both because it's
> > > > > simple
> > > > > and to make use of the generic readahead infrastructure, which results in
> > > > > a massive speedup.
> > > > > 
> > > > > However this fills the page cache with data that, most of the time, is
> > > > > really only used by the send operation - once the write commands are
> > > > > sent,
> > > > > it's not useful to have the data in the page cache anymore. For large
> > > > > snapshots, bringing all data into the page cache eventually leads to the
> > > > > need to evict other data from the page cache that may be more useful for
> > > > > applications (and kernel susbsystems).
> > > > > 
> > > > > Even if extents are shared with the subvolume on which a snapshot is
> > > > > based
> > > > > on and the data is currently on the page cache due to being read through
> > > > > the subvolume, attempting to read the data through the snapshot will
> > > > > always result in bringing a new copy of the data into another location in
> > > > > the page cache (there's currently no shared memory for shared extents).
> > > > > 
> > > > > So make send evict the data it has read before if when it first opened
> > > > > the inode, its mapping had no pages currently loaded: when
> > > > > inode->i_mapping->nr_pages has a value of 0. Do this instead of deciding
> > > > > based on the return value of filemap_range_has_page() before reading an
> > > > > extent because the generic readahead mechanism may read pages beyond the
> > > > > range we request (and it very often does it), which means a call to
> > > > > filemap_range_has_page() will return true due to the readahead that was
> > > > > triggered when processing a previous extent - we don't have a simple way
> > > > > to distinguish this case from the case where the data was brought into
> > > > > the page cache through someone else. So checking for the mapping number
> > > > > of pages being 0 when we first open the inode is simple, cheap and it
> > > > > generally accomplishes the goal of not trashing the page cache - the
> > > > > only exception is if part of data was previously loaded into the page
> > > > > cache through the snapshot by some other process, in that case we end
> > > > > up not evicting any data send brings into the page cache, just like
> > > > > before this change - but that however is not the common case.
> > > > > 
> > > > > Example scenario, on a box with 32G of RAM:
> > > > > 
> > > > >     $ btrfs subvolume create /mnt/sv1
> > > > >     $ xfs_io -f -c "pwrite 0 4G" /mnt/sv1/file1
> > > > > 
> > > > >     $ btrfs subvolume snapshot -r /mnt/sv1 /mnt/snap1
> > > > > 
> > > > >     $ free -m
> > > > >                    total        used        free      shared
> > > > > buff/cache   available
> > > > >     Mem:           31937         186       26866           0
> > > > > 4883       31297
> > > > >     Swap:           8188           0        8188
> > > > > 
> > > > >     # After this we get less 4G of free memory.
> > > > >     $ btrfs send /mnt/snap1 >/dev/null
> > > > > 
> > > > >     $ free -m
> > > > >                    total        used        free      shared
> > > > > buff/cache   available
> > > > >     Mem:           31937         186       22814           0
> > > > > 8935       31297
> > > > >     Swap:           8188           0        8188
> > > > > 
> > > > > The same, obviously, applies to an incremental send.
> > > > > 
> > > > > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > > > 
> > > > Unfortunately, this patch seems to cause subpage cases to fail test case
> > > > btrfs/007, the reproducibility is around 50%, thus better "-I 8" to be
> > > > extra safe.
> > > > 
> > > > And I believe it also causes other send related failure for subpage cases.
> > > > 
> > > > I guess it's truncate_inode_pages_range() only truncating the full page,
> > > > but for subpage case, since one sector is smaller than one page, it
> > > > doesn't work as expected?
> > > 
> > > It looks like that's the case.
> > > 
> > > The following diff fixed the send related bugs here:
> > > (mail client seems to screw up the indent)
> > > 
> > > https://paste.opensuse.org/95871661
> > 
> > That may seem to work often, but it's not really correct because prev_extent_end
> > is already aligned to the page size, except possibly for the first extent processed.
> 
> You're right, my snippet is really just a dirty hack.
> 
> > 
> > For the case of the first processed extent not starting at an offset that is page
> > size aligned, we also need to round down that offset so that we evict the page
> > (and not simply zero out part of its content).
> > 
> > Can you try this instead:  https://pastebin.com/raw/kRPZKYxG ?
> 
> Tested with btrfs/007 and it passed without problems (8 runs).
> 
> Also tested it with the full send group, no regression found here.
> 
> 
> And with his fix, the following subpage failures now all pass.
> 
> - btrfs/007
> - btrfs/016
> - btrfs/025
> - btrfs/034
> - btrfs/097
> - btrfs/149
> - btrfs/252

Thanks for testing and the report! I'll send a v2 with that patch incorporated
(and a few comment fixes).

> 
> Thanks,
> Qu
> 
> > 
> > Thanks.
> > 
> > 
> > > 
> > > Thanks,
> > > Qu
> > > > 
> > > > If needed, I can provide you the access to my aarch64 vm for debugging.
> > > > 
> > > > Thanks,
> > > > Qu
> > > > 
> > > > > ---
> > > > >    fs/btrfs/send.c | 80 +++++++++++++++++++++++++++++++++++++++++++++++--
> > > > >    1 file changed, 77 insertions(+), 3 deletions(-)
> > > > > 
> > > > > diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
> > > > > index 55275ba90cb4..d899049dea53 100644
> > > > > --- a/fs/btrfs/send.c
> > > > > +++ b/fs/btrfs/send.c
> > > > > @@ -137,6 +137,8 @@ struct send_ctx {
> > > > >         */
> > > > >        struct inode *cur_inode;
> > > > >        struct file_ra_state ra;
> > > > > +    u64 prev_extent_end;
> > > > > +    bool clean_page_cache;
> > > > > 
> > > > >        /*
> > > > >         * We process inodes by their increasing order, so if before an
> > > > > @@ -5157,6 +5159,28 @@ static int send_extent_data(struct send_ctx *sctx,
> > > > >            }
> > > > >            memset(&sctx->ra, 0, sizeof(struct file_ra_state));
> > > > >            file_ra_state_init(&sctx->ra, sctx->cur_inode->i_mapping);
> > > > > +
> > > > > +        /*
> > > > > +         * It's very likely there are no pages from this inode in
> > > > > the page
> > > > > +         * cache, so after reading extents and sending their data,
> > > > > we clean
> > > > > +         * the page cache to avoid trashing the page cache (adding
> > > > > pressure
> > > > > +         * to the page cache and forcing eviction of other data
> > > > > more useful
> > > > > +         * for applications).
> > > > > +         *
> > > > > +         * We decide if we should clean the page cache simply by
> > > > > checking
> > > > > +         * if the inode's mapping nrpages is 0 when we first open
> > > > > it, and
> > > > > +         * not by using something like filemap_range_has_page() before
> > > > > +         * reading an extent because when we ask the readahead code to
> > > > > +         * read a given file range, it may (and almost always does) read
> > > > > +         * pages from beyond that range (see the documentation for
> > > > > +         * page_cache_sync_readahead()), so it would not be reliable,
> > > > > +         * because after reading the first extent future calls to
> > > > > +         * filemap_range_has_page() would return true because the
> > > > > readahead
> > > > > +         * on the previous extent resulted in reading pages of the
> > > > > current
> > > > > +         * extent as well.
> > > > > +         */
> > > > > +        sctx->clean_page_cache =
> > > > > (sctx->cur_inode->i_mapping->nrpages == 0);
> > > > > +        sctx->prev_extent_end = offset;
> > > > >        }
> > > > > 
> > > > >        while (sent < len) {
> > > > > @@ -5168,6 +5192,33 @@ static int send_extent_data(struct send_ctx *sctx,
> > > > >                return ret;
> > > > >            sent += size;
> > > > >        }
> > > > > +
> > > > > +    if (sctx->clean_page_cache) {
> > > > > +        const u64 end = round_up(offset + len, PAGE_SIZE);
> > > > > +
> > > > > +        /*
> > > > > +         * Always start from the end offset of the last processed
> > > > > extent.
> > > > > +         * This is because the readahead code may (and very often does)
> > > > > +         * reads pages beyond the range we request for readahead. So if
> > > > > +         * we have an extent layout like this:
> > > > > +         *
> > > > > +         *            [ extent A ] [ extent B ] [ extent C ]
> > > > > +         *
> > > > > +         * When we ask page_cache_sync_readahead() to read extent A, it
> > > > > +         * may also trigger reads for pages of extent B. If we are doing
> > > > > +         * an incremental send and extent B has not changed between the
> > > > > +         * parent and send snapshots, some or all of its pages may end
> > > > > +         * up being read and placed in the page cache. So when
> > > > > truncating
> > > > > +         * the page cache we always start from the end offset of the
> > > > > +         * previously processed extent up to the end of the current
> > > > > +         * extent.
> > > > > +         */
> > > > > +        truncate_inode_pages_range(&sctx->cur_inode->i_data,
> > > > > +                       sctx->prev_extent_end,
> > > > > +                       end - 1);
> > > > > +        sctx->prev_extent_end = end;
> > > > > +    }
> > > > > +
> > > > >        return 0;
> > > > >    }
> > > > > 
> > > > > @@ -6172,6 +6223,30 @@ static int btrfs_unlink_all_paths(struct
> > > > > send_ctx *sctx)
> > > > >        return ret;
> > > > >    }
> > > > > 
> > > > > +static void close_current_inode(struct send_ctx *sctx)
> > > > > +{
> > > > > +    u64 i_size;
> > > > > +
> > > > > +    if (sctx->cur_inode == NULL)
> > > > > +        return;
> > > > > +
> > > > > +    i_size = i_size_read(sctx->cur_inode);
> > > > > +
> > > > > +    /*
> > > > > +     * If we are doing an incremental send, we may have extents
> > > > > between the
> > > > > +     * last processed extent and the i_size that have not been processed
> > > > > +     * because they haven't changed but we may have read some of
> > > > > their pages
> > > > > +     * through readahead, see the comments at send_extent_data().
> > > > > +     */
> > > > > +    if (sctx->clean_page_cache && sctx->prev_extent_end < i_size)
> > > > > +        truncate_inode_pages_range(&sctx->cur_inode->i_data,
> > > > > +                       sctx->prev_extent_end,
> > > > > +                       round_up(i_size, PAGE_SIZE) - 1);
> > > > > +
> > > > > +    iput(sctx->cur_inode);
> > > > > +    sctx->cur_inode = NULL;
> > > > > +}
> > > > > +
> > > > >    static int changed_inode(struct send_ctx *sctx,
> > > > >                 enum btrfs_compare_tree_result result)
> > > > >    {
> > > > > @@ -6182,8 +6257,7 @@ static int changed_inode(struct send_ctx *sctx,
> > > > >        u64 left_gen = 0;
> > > > >        u64 right_gen = 0;
> > > > > 
> > > > > -    iput(sctx->cur_inode);
> > > > > -    sctx->cur_inode = NULL;
> > > > > +    close_current_inode(sctx);
> > > > > 
> > > > >        sctx->cur_ino = key->objectid;
> > > > >        sctx->cur_inode_new_gen = 0;
> > > > > @@ -7671,7 +7745,7 @@ long btrfs_ioctl_send(struct inode *inode,
> > > > > struct btrfs_ioctl_send_args *arg)
> > > > > 
> > > > >            name_cache_free(sctx);
> > > > > 
> > > > > -        iput(sctx->cur_inode);
> > > > > +        close_current_inode(sctx);
> > > > > 
> > > > >            kfree(sctx);
> > > > >        }