Linux-BTRFS Archive on lore.kernel.org
 help / color / Atom feed
From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
To: Adam Borowski <kilobyte@angband.pl>
Cc: Eric Wong <e@80x24.org>, linux-btrfs@vger.kernel.org
Subject: Re: raid1 with several old drives and a big new one
Date: Fri, 31 Jul 2020 23:40:00 -0400
Message-ID: <20200801033959.GL5890@hungrycats.org> (raw)
In-Reply-To: <20200731161307.GA31148@angband.pl>

On Fri, Jul 31, 2020 at 06:13:07PM +0200, Adam Borowski wrote:
> On Fri, Jul 31, 2020 at 12:16:52AM +0000, Eric Wong wrote:
> > Say I have three ancient 2TB HDDs and one new 6TB HDD, is there
> > a way I can ensure one raid1 copy of the data stays on the new
> > 6TB HDD?
> > 
> > I expect the 2TB HDDs to fail sooner than the 6TB HDD given
> > their age (>5 years).

It might be a good idea to run 'btrfs replace' on one of the two 2TB
disks instead of 'device add'.  That will move one copy of the data
very quickly to the new disk.  You then resize the new disk to 6TB (or
'max'), then add the 2TB disk back into the array with btrfs dev add.
This will leave you with 1 full 2TB disk, 1 empty 2TB disk, and a 6TB
disk with 2TB of data on it.

In that case you don't even need to balance--the empty 2TB drive will
fill up with BGs that contain one chunk from the 2TB drive and one
from 6TB, since the allocator will pick the two emptiest drives first.
Everything will be mirrored on the 6TB drive (probably, see below).

The variation in write load might also shift the date when the drives
eventually do fail, so they'll be less likely to fail at the same time.

> While there's no good way to do so in general, in your case, there's no way
> for any new block group to be allocated without the big disk.
> 
> Btrfs' allocation algorithm is: always pick the disk with most free space
> left.  Besides being simple, this guarantees optimally utilizing available
> space.

That is the theory; however, practice is a little different.

Sometimes btrfs just doesn't follow its own rules.  I've filled in
big raid1 arrays with lopsided disks like this, and ended up with one
block group out of every few thousand with a chunk from each of the
two smaller disks.  I guess it's a race condition, possibly triggered
by scrub or balance marking block groups readonly, but I've never fully
investigated.  When the larger disk is _exactly_ the same size as the two
smaller disks, having one block group in the wrong place can be annoying,
as it reduces capacity.

If two disks fail, btrfs will count the number of failing disks and say
"nope, can't mount this degraded raid1, sorry" if even one block group
in the filesystem contains both failing disks.

In any case, the behavior isn't strictly guaranteed here--btrfs *can*
allocate a block group across the two smaller disks, even though it
normally would not; therefore, there's a risk that it might do so
unexpectedly.

Contrast with combining the two 2TB disks (e.g. with mdadm-raid0 or
linear, or LVM), where btrfs is presented with exactly two devices and
has exactly one option to allocate mirror devices on them.

> And, for 2+2+2+6, no scheme that doesn't waste space could possibly place
> raid1 copies without having one on the biggest disk.
> 
> Thus, all you need is to balance once.
> 
> > The devid balance filter only affects data which already exists
> > on the device, so that isn't suitable for this, right?
> 
> Yeah, balance affects existing data, but doesn't have a lingering effect on
> new allocations.
> 
> Meow!
> -- 
> ⢀⣴⠾⠻⢶⣦⠀
> ⣾⠁⢠⠒⠀⣿⡁
> ⢿⡄⠘⠷⠚⠋⠀ It's time to migrate your Imaginary Protocol from version 4i to 6i.
> ⠈⠳⣄⠀⠀⠀⠀

      reply index

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-31  0:16 Eric Wong
2020-07-31  2:57 ` Chris Murphy
2020-07-31  3:22   ` Eric Wong
2020-07-31  3:35     ` Chris Murphy
2020-08-01  9:05   ` Roman Mamedov
2020-07-31  8:29 ` Alberto Bursi
2020-07-31 10:06   ` Eric Wong
2020-07-31 16:13 ` Adam Borowski
2020-08-01  3:40   ` Zygo Blaxell [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200801033959.GL5890@hungrycats.org \
    --to=ce3g8jdj@umail.furryterror.org \
    --cc=e@80x24.org \
    --cc=kilobyte@angband.pl \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-BTRFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-btrfs/0 linux-btrfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-btrfs linux-btrfs/ https://lore.kernel.org/linux-btrfs \
		linux-btrfs@vger.kernel.org
	public-inbox-index linux-btrfs

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-btrfs


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git