From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-4.3 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE, SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id DF1C0C07E95 for ; Fri, 16 Jul 2021 12:42:12 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id B5051613D3 for ; Fri, 16 Jul 2021 12:42:12 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234357AbhGPMpG (ORCPT ); Fri, 16 Jul 2021 08:45:06 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46448 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232757AbhGPMpG (ORCPT ); Fri, 16 Jul 2021 08:45:06 -0400 Received: from mail-pl1-x629.google.com (mail-pl1-x629.google.com [IPv6:2607:f8b0:4864:20::629]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id F0956C06175F for ; Fri, 16 Jul 2021 05:42:08 -0700 (PDT) Received: by mail-pl1-x629.google.com with SMTP id c15so5245402pls.13 for ; Fri, 16 Jul 2021 05:42:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:to:references:from:cc:message-id:date:user-agent :mime-version:in-reply-to:content-transfer-encoding:content-language; bh=Y+hwvs8SJAd1L4djBW0tqYsDBgCOOMJ34MaJFGvi2HM=; b=uhvtxlyxNchiF5WzdMu5HCypdJVcmn32sy6ArIJB2cy0jsWdayWZIwak6B4PTxpprl TGb3l2QbQL97nLwiiqy9DdEdXp2Z77M+pOZVU9TM0zBop4pSRn2BrG7dP8ochNCudjvs 4ZTVzSgPkGCLmXVIG1nEtyMmgH9P3SVIi6vuJV/rLkOgfkecmFHEjzr28WF1fWsDYJeC jXua/Ig+QbIVQSJHEY/qMBLTUOoZtuMpmlilGMg4CK2/5QgpvGSZcxYDxJC7Uf3nTfV8 7D5nipKrk9FAqJp7suf3WI6/3+qmJtlpseE943mf43cRHhXenRW9faehVulQmocQ5K0W 2Cyg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:references:from:cc:message-id:date :user-agent:mime-version:in-reply-to:content-transfer-encoding :content-language; bh=Y+hwvs8SJAd1L4djBW0tqYsDBgCOOMJ34MaJFGvi2HM=; b=dErC9Wla6ikrLXXulh5Y6U85b8L3I/wqOqwJhgrYUddWgBguYOkH4J5/LVRQw/LxYI 2gGxSHTzOgSK5BWP6hjXVuQwrf3XK4ebF/7SNThtZfcMSEsMh4lyeyeJUpPbHfSm84Ld a/uBgE37c2ORa0G4KDcF1pM9CMuBIg8vdMYmu5LMOwyuZIbvsQIcKdIPd/atu+TBaGbN MEYpJx6NBj7ABMykEq9vz2+psyMAGd/hLDmtav0MVpby/344OTGqamah0ZRW058wk4Pq YnSlfmN/zImpKT5cDNk5VkJINZmVBDz6rCi2pL0to9qguG7sPKFIuSVUXQEs/7uCUFjm yDFg== X-Gm-Message-State: AOAM533uC1Fb3CrJ1X3f5cdLb01lzisvMK9outPmkfZw4/Nntxrm1A9B A0k+Kovb7X43GPmUjkgg9g4= X-Google-Smtp-Source: ABdhPJxI+f3IWBDWIJ+qmKivna71dQu0DSnHOp+OSuQ8XQAMUFJ2BspnvSWQyN+mXN/5VHtlMOip0g== X-Received: by 2002:a17:90b:787:: with SMTP id l7mr15466206pjz.162.1626439328274; Fri, 16 Jul 2021 05:42:08 -0700 (PDT) Received: from [192.168.178.53] (60-242-36-91.tpgi.com.au. [60.242.36.91]) by smtp.gmail.com with ESMTPSA id k189sm11259739pgk.14.2021.07.16.05.42.06 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 16 Jul 2021 05:42:07 -0700 (PDT) Subject: Re: migrating to space_cache=2 and btrfs userspace commands To: Joshua , linux-btrfs@vger.kernel.org References: <63396688-0dc7-17c5-a830-5893b030a30f@gmail.com> <86f0624a-cba4-58a3-0a80-460d3f12e8b3@gmx.com> <0b4cf70fc883e28c97d893a3b2f81b11@mailmag.net> From: DanglingPointer Cc: danglingpointerexception@gmail.com Message-ID: <1e09cc8e-7100-a084-9542-e2f734cb33fa@gmail.com> Date: Fri, 16 Jul 2021 22:42:04 +1000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0 MIME-Version: 1.0 In-Reply-To: <0b4cf70fc883e28c97d893a3b2f81b11@mailmag.net> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-AU Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org Hi Joshua, on that system where you tried to run the "--clear-space-cache v1", when you gave up, did you continue using "space_cache=v2" on it? Here are some more questions to add for anyone who can help educate us:... Why would someone want to clear the space_cache v1? What's the value of clearing the previous version space_cache before using the new version? Why "clear" and not just "delete"?  Won't "deleting" the whole previous space_cache files, blocks, whatever in the filesystem be faster then doing whatever "clear" does? Am I missing out on something by not attempting to clear the previous version space_cache before using the new v2 version? On 16/7/21 3:51 am, Joshua wrote: > Just as a point of data, I have a 96 TB array with RAID1 data, and RAID1C3 metadata. > > I made the switch to space_cache=v2 some time ago, and I remember it made a huge difference when I did so! > (It was RAID1 metadata at the time, as RAID1C3 was not available at the time.) > > > However, I also tried a check with '--clear-space-cache v1' at the time, and after waiting a literal whole day without it completing, I gave up, canceled it, and put it back into production. Is a --clear-space-cache v1 operation expected to take so long on such a large file system? > > Thanks! > --Joshua Villwock > > > > July 15, 2021 9:40 AM, "DanglingPointer" wrote: > >> Hi Qu, >> >> Just updating here that setting the mount option "space_cache=v2" and "noatime" completely SOLVED >> the performance problem! >> Basically like night and day! >> >> These are my full fstab mount options... >> >> btrfs defaults,autodefrag,space_cache=v2,noatime 0 2 >> >> Perhaps defaulting the space_cache=v2 should be considered? Why default to v1, what's the value of >> v1? >> >> So for conclusion, for large multi-terrabyte arrays (in my case RAID5s), setting space_cache=v2 and >> noatime massively increases performance and eliminates the large long pauses in frequent intervals >> by "btrfs-transacti" blocking all IO. >> >> Thanks Qu for your help! >> >> On 14/7/21 5:45 pm, Qu Wenruo wrote: >> >>> On 2021/7/14 下午3:18, DanglingPointer wrote: >>>> a) "echo l > /proc/sysrq-trigger" >>>> >>>> The backup finished today already unfortunately and we are unlikely to >>>> run it again until we get an outage to remount the array with the >>>> space_cache=v2 and noatime mount options. >>>> Thanks for the command, we'll definitely use it if/when it happens again >>>> on the next large migration of data. >>> Just to avoid confusion, after that command, "dmesg" output is still >>> needed, as that's where sysrq put its output. >>>> b) "sudo btrfs qgroup show -prce" ........ >>>> >>>> $ ERROR: can't list qgroups: quotas not enabled >>>> >>>> So looks like it isn't enabled. >>> One less thing to bother. >>>> File sizes are between: 1,048,576 bytes and 16,777,216 bytes (Duplicacy >>>> backup defaults) >>> Between 1~16MiB, thus tons of small files. >>> >>> Btrfs is not really good at handling tons of small files, as they >>> generate a lot of metadata. >>> >>> That may contribute to the hang. >>> >>>> What classifies as a transaction? >>> It's a little complex. >>> >>> Technically it's a check point where before the checkpoint, all you see >>> is old data, after the checkpoint, all you see is new data. >>> >>> To end users, any data and metadata write will be included into one >>> transaction (with proper dependency handled). >>> >>> One way to finish (or commit) current transaction is to sync the fs, >>> using "sync" command (sync all filesystems). >>> >>>> Any/All writes done in a 30sec >>>> interval? >>> This the default commit interval. Almost all fses will try to commit its >>> data/metadata to disk after a configurable interval. >>> >>> The default one is 30s. That's also one way to commit current > transaction. >>> >>>> If 100 unique files were written in 30secs, is that 1 >>>> transaction or 100 transactions? >>> It depends. As things like syncfs() and subvolume/snapshot creation may >>> try to commit transaction. >>> >>> But without those special operations, just writing 100 unique files >>> using buffered write, it would only start one transaction, and when the >>> 30s interval get hit, the transaction will be committed to disk. >>> >>>> Millions of files of the size range >>>> above were backed up. >>> The amount of files may not force a transaction commit, if it doesn't >>> trigger enough memory pressure, or free space pressure. >>> >>> Anyway, the "echo l" sysrq would help us to locate what's taking so long >>> time. >>> >>>> c) "Just mount with "space_cache=v2"" >>>> >>>> Ok so no need to "clear_cache" the v1 cache, right? >>> Yes, and "clear_cache" won't really remove all the v1 cache anyway. >>> >>> Thus it doesn't help much. >>> >>> The only way to fully clear v1 cache is by using "btrfs check >>> --clear-space-cache v1" on a *unmounted* btrfs. >>> >>>> I wrote this in the fstab but hadn't remounted yet until I can get an >>>> outage.... >>> IMHO if you really want to test if v2 would help, you can just remount, >>> no need to wait for a break. >>> >>> Thanks, >>> Qu >>>> ..."btrfs defaults,autodefrag,clear_cache,space_cache=v2,noatime 0 2 > >>>> Thanks again for your help Qu! >>>> >>>> On 14/7/21 2:59 pm, Qu Wenruo wrote: >>> On 2021/7/13 下午11:38, DanglingPointer wrote: >>> We're currently considering switching to "space_cache=v2" with noatime >>> mount options for my lab server-workstations running RAID5. >>> >>> Btrfs RAID5 is unsafe due to its write-hole problem. >>> >>> * One has 13TB of data/metadata in a bunch of 6TB and 2TB disks >>> totalling 26TB. >>> * Another has about 12TB data/metadata in uniformly sized 6TB disks >>> totalling 24TB. >>> * Both of the arrays are on individually luks encrypted disks with >>> btrfs on top of the luks. >>> * Both have "defaults,autodefrag" turned on in fstab. >>> >>> We're starting to see large pauses during constant backups of millions >>> of chunk files (using duplicacy backup) in the 24TB array. >>> >>> Pauses sometimes take up to 20+ seconds in frequencies after every >>> ~30secs of the end of the last pause. "btrfs-transacti" process >>> consistently shows up as the blocking process/thread locking up >>> filesystem IO. IO gets into the RAID5 array via nfsd. There are no >>>> disk >>> or btrfs errors recorded. scrub last finished yesterday successfully. >>> >>> Please provide the "echo l > /proc/sysrq-trigger" output when such >>> pause >>> happens. >>> >>> If you're using qgroup (may be enabled by things like snapper), it may >>> be the cause, as qgroup does its accounting when committing >>> transaction. >>> >>> If one transaction is super large, it can cause such problem. >>> >>> You can test if qgroup is enabled by: >>> >>> # btrfs qgroup show -prce >>> >>> After doing some research around the internet, we've come to the >>> consideration above as described. Unfortunately the official >>> documentation isn't clear on the following. >>> >>> Official documentation URL - >>> https://btrfs.wiki.kernel.org/index.php/Manpage/btrfs(5) >>> >>> 1. How to migrate from default space_cache=v1 to space_cache=v2? It >>> talks about the reverse, from v2 to v1! >>> >>> Just mount with "space_cache=v2". >>> >>> 2. If we use space_cache=v2, is it indeed still the case that the >>> "btrfs" command will NOT work with the filesystem? >>> >>> Why would you think "btrfs" won't work on a btrfs? >>> >>> Thanks, >>> Qu >>> >>> So will our >>> "btrfs scrub start /mount/point/..." cron jobs FAIL? I'm guessing >>> the btrfs command comes from btrfs-progs which is currently >>>> v5.4.1-2 >>> amd64, is that correct? >>> 3. Any other ideas on how we can get rid of those annoying pauses with >>> large backups into the array? >>> >>> Thanks in advance! >>> >>> DP