From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f180.google.com ([209.85.212.180]:36835 "EHLO mail-wi0-f180.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753640AbbDMOGq (ORCPT ); Mon, 13 Apr 2015 10:06:46 -0400 Received: by wizk4 with SMTP id k4so74026045wiz.1 for ; Mon, 13 Apr 2015 07:06:45 -0700 (PDT) Received: from ?IPv6:2a02:1812:1980:9b00:d17b:f93c:8b28:5398? ([2a02:1812:1980:9b00:d17b:f93c:8b28:5398]) by mx.google.com with ESMTPSA id a6sm11657847wiy.17.2015.04.13.07.06.44 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 13 Apr 2015 07:06:44 -0700 (PDT) Message-ID: <552BCD6F.6080509@sjeng.org> Date: Mon, 13 Apr 2015 16:06:39 +0200 From: Gian-Carlo Pascutto MIME-Version: 1.0 To: linux-btrfs@vger.kernel.org Subject: Re: Big disk space usage difference, even after defrag, on identical data References: <55297D36.8090808@sjeng.org> In-Reply-To: Content-Type: text/plain; charset=utf-8 Sender: linux-btrfs-owner@vger.kernel.org List-ID: On 13-04-15 07:06, Duncan wrote: >> So what can explain this? Where did the 66G go? > > Out of curiosity, does a balance on the actively used btrfs help? > > You mentioned defrag -v -r -clzo, but didn't use the -f (flush) or -t > (minimum size file) options. Does adding -f -t1 help? Unfortunately I can no longer try this, see the other reply why. But the problem turned out to be some 1G-sized files, written using 3-5 extents, that for whatever reason defrag was not touching. > You aren't doing btrfs snapshots of either subvolume, are you? No :-) I should've mentioned that. > Defrag should force the rewrite of entire files and take care of this, > but obviously it's not returning to "clean" state. I forgot what the > default minimum file size is if -t isn't set, maybe 128 MiB? But a -t1 > will force it to defrag even small files, and I recall at least one > thread here where the poster said it made all the difference for him, so > try that. And the -f should force a filesystem sync afterward, so you > know the numbers from any report you run afterward match the final state. Reading the corresponding manual, the -t explanation says that "any extent bigger than this size will be considered already defragged". So I guess setting -t1 might've fixed the problem too...but after checking the source, I'm not so sure. I didn't find the -t default in the manpages - after browsing through the source, the default is in the kernel: https://github.com/torvalds/linux/blob/4f671fe2f9523a1ea206f63fe60a7c7b3a56d5c7/fs/btrfs/ioctl.c#L1268 (Not sure what units those are.) I wonder if this is relevant: https://github.com/torvalds/linux/blob/4f671fe2f9523a1ea206f63fe60a7c7b3a56d5c7/fs/btrfs/ioctl.c#L2572 This seems to reset the -t flag if compress (-c) is set? This looks a bit fishy? > Meanwhile, you may consider using the nocow attribute on those database > files. It will disable compression on them, I'm using btrfs specifically to get compression, so this isn't an option. > While initial usage will be higher due to the lack of compression, > as you've discovered, over time, on an actively updated database, > compression isn't all that effective anyway. I don't see why. If you're referring to the additional overhead of continuously compressing and decompressing everything - yes, of course. But in my case I have a mostly-append workload to a huge amount of fairly compressible data that's on magnetic storage, so compression is a win in disk space and perhaps even in performance. I'm well aware of the many caveats in using btrfs for databases - they're well documented and although I much appreciate your extended explanation, it wasn't new to me. It turns out that if your dataset isn't update heavy (so it doesn't fragment to begin with), or has to be queried via indexed access (i.e. mostly via random seeks), the fragmentation doesn't matter much anyway. Conversely, btrfs appears to have better sync performance with multiple threads, and allows one to disable part of the partial-page-write protection logic in the database (full_page_writes=off for PostgreSQL), because btrfs is already doing the COW to ensure those can't actually happen [1]. The net result is a *boost* from about 40 tps (ext4) to 55 tps (btrfs), which certainly is contrary to popular wisdom. Maybe btrfs would fall off eventually as fragementation does set in gradually, but given that there's an offline defragmentation tool that can run in the background, I don't care. [1] I wouldn't be too surprised if database COW, which consists of journal-writing a copy of the data out of band, then rewriting it again in the original place, is actually functionally equivalent do disabling COW in the database and running btrfs + defrag. Obviously you shouldn't keep COW enabled in btrfs *AND* the DB, requiring all data to be copied around at least 3 times...which I'm afraid almost everyone does because it's the default... -- GCP