From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf0-f51.google.com ([209.85.215.51]:33869 "EHLO mail-lf0-f51.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751507AbcDZKut (ORCPT ); Tue, 26 Apr 2016 06:50:49 -0400 Received: by mail-lf0-f51.google.com with SMTP id j11so12936147lfb.1 for ; Tue, 26 Apr 2016 03:50:48 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <571E154C.9060604@gmail.com> References: <571DFCF2.6050604@gmail.com> <571E154C.9060604@gmail.com> Date: Tue, 26 Apr 2016 04:50:47 -0600 Message-ID: Subject: Re: Add device while rebalancing From: Juan Alberto Cirez To: "Austin S. Hemmelgarn" Cc: linux-btrfs Content-Type: text/plain; charset=UTF-8 Sender: linux-btrfs-owner@vger.kernel.org List-ID: Thank you guys so very kindly for all your help and taking the time to answer my question. I have been reading the wiki and online use cases and otherwise delving deeper into the btrfs architecture. I am managing a 520TB storage pool spread across 16 server pods and have tried several methods of distributed storage. Last attempt was using Zfs as a base for the physical bricks and GlusterFS as a glue to string together the storage pool. I was not satisfied with the results (mainly Zfs). Once I have run btrfs for a while on the test server (32TB, 8x 4TB HDD RAID10) for a while I will try btrfs/ceph On Mon, Apr 25, 2016 at 7:02 AM, Austin S. Hemmelgarn wrote: > On 2016-04-25 08:43, Duncan wrote: >> >> Austin S. Hemmelgarn posted on Mon, 25 Apr 2016 07:18:10 -0400 as >> excerpted: >> >>> On 2016-04-23 01:38, Duncan wrote: >>>> >>>> >>>> And again with snapshotting operations. Making a snapshot is normally >>>> nearly instantaneous, but there's a scaling issue if you have too many >>>> per filesystem (try to keep it under 2000 snapshots per filesystem >>>> total, if possible, and definitely keep it under 10K or some operations >>>> will slow down substantially), and deleting snapshots is more work, so >>>> while you should ordinarily automatically thin down snapshots if you're >>>> automatically making them quite frequently (say daily or more >>>> frequently), you may want to put the snapshot deletion, at least, on >>>> hold while you scrub or balance or device delete or replace. >> >> >>> I would actually recommend putting all snapshot operations on hold, as >>> well as most writes to the filesystem, while doing a balance or device >>> deletion. The more writes you have while doing those, the longer they >>> take, and the less likely that you end up with a good on-disk layout of >>> the data. >> >> >> The thing with snapshot writing is that all snapshot creation effectively >> does is a bit of metadata writing. What snapshots primarily do is lock >> existing extents in place (down within their chunk, with the higher chunk >> level being the scope at which balance works), that would otherwise be >> COWed elsewhere with the existing extent deleted on change, or simply >> deleted on on file delete. A snapshot simply adds a reference to the >> current version, so that deletion, either directly or from the COW, never >> happens, and to do that simply requires a relatively small metadata write. > > Unless I'm mistaken about the internals of BTRFS (which might be the case), > creating a snapshot has to update reference counts on every single extent in > every single file in the snapshot. For something small this isn't much, but > if you are snapshotting something big (say, snapshotting an entire system > with all the data in one subvolume), it can amount to multiple MB of writes, > and it gets even worse if you have no shared extents to begin with (which is > still pretty typical). On some of the systems I work with at work, > snapshotting a terabyte of data can end up resulting in 10-20 MB of writes > to disk (in this case, that figure came from a partition containing mostly > small files that were just big enough that they didn't fit in-line in the > metadata blocks). > > This is of course still significantly faster than copying everything, but > it's not free either. >> >> >> So while I agree in general that more writes means balances taking >> longer, snapshot creation writes are pretty tiny in the scheme of things, >> and won't affect the balance much, compared to larger writes you'll very >> possibly still be doing unless you really do suspend pretty much all >> write operations to that filesystem during the balance. > > In general, yes, except that there's the case of running with mostly full > metadata chunks, where it might result in a further chunk allocation, which > in turn can throw off the balanced layout. Balance always allocates new > chunks, and doesn't write into existing ones, so if you're writing enough to > allocate a new chunk while a balance is happening: > 1. That chunk may or may not get considered by the balance code (I'm not > 100% certain about this, but I believe it will be ignored by any balance > running at the time it gets allocated). > 2. You run the risk of ending up with a chunk with almost nothing in it > which could be packed into another existing chunk. > Snapshots are not likely to trigger this, but it is still possible, > especially if you're taking lots of snapshots in a short period of time. >> >> >> But as I said, snapshot deletions are an entirely different story, as >> then all those previously locked in place extents are potentially freed, >> and the filesystem must do a lot of work to figure out which ones it can >> actually free and free them, vs. ones that still have other references >> which therefore cannot yet be freed. > > Most of the issue here with balance is that you end up potentially doing an > amount of unnecessary work which is unquantifiable before it's done. > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html