linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
To: Graham Cobb <g.btrfs@cobb.uk.net>
Cc: kreijack@inwind.it, Hans van Kranenburg <hans@knorrie.org>,
	linux-btrfs@vger.kernel.org, Josef Bacik <josef@toxicpanda.com>,
	David Sterba <dsterba@suse.cz>,
	Sinnamohideen Shafeeq <shafeeqs@panasas.com>
Subject: Re: [PATCH 4/4] btrfs: add allocator_hint mode
Date: Sat, 18 Dec 2021 21:30:59 -0500	[thread overview]
Message-ID: <Yb6ZY3BPtbXn44gX@hungrycats.org> (raw)
In-Reply-To: <4e18eff2-fca1-bde2-b942-159f89569f0f@cobb.uk.net>

On Sun, Dec 19, 2021 at 12:03:32AM +0000, Graham Cobb wrote:
> On 18/12/2021 22:48, Zygo Blaxell wrote:
> > On Sat, Dec 18, 2021 at 10:07:18AM +0100, Goffredo Baroncelli wrote:
> >> On 12/17/21 20:41, Zygo Blaxell wrote:
> >>> On Fri, Dec 17, 2021 at 07:28:28PM +0100, Goffredo Baroncelli wrote:
> >>>> On 12/17/21 16:58, Hans van Kranenburg wrote:
> >> [...]
> >>>> -----------------------------
> >>>> The chunk allocation policy is modified as follow.
> >>>>
> >>>> Each disk may have one of the following tags:
> >>>> - BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
> >>>> - BTRFS_DEV_ALLOCATION_METADATA_ONLY
> >>>> - BTRFS_DEV_ALLOCATION_DATA_ONLY
> >>>> - BTRFS_DEV_ALLOCATION_PREFERRED_DATA (default)
> >>>
> >>> Is it too late to rename these?  The order of the words is inconsistent
> >>> and the English usage is a bit odd.
> >>>
> >>> I'd much rather have:
> >>>
> >>>> - BTRFS_DEV_ALLOCATION_PREFER_METADATA
> >>>> - BTRFS_DEV_ALLOCATION_ONLY_METADATA
> >>>> - BTRFS_DEV_ALLOCATION_ONLY_DATA
> >>>> - BTRFS_DEV_ALLOCATION_PREFER_DATA (default)
> >>>
> >>> English speakers would say "[I/we/you] prefer X" or "X [is] preferred".
> >>>
> >>> or
> >>>
> >>>> - BTRFS_DEV_ALLOCATION_METADATA_PREFERRED
> >>>> - BTRFS_DEV_ALLOCATION_METADATA_ONLY
> >>>> - BTRFS_DEV_ALLOCATION_DATA_ONLY
> >>>> - BTRFS_DEV_ALLOCATION_DATA_PREFERRED (default)
> >>>
> >>> I keep typing "data_preferred" and "only_data" when it's really
> >>> "preferred_data" and "data_only" because they're not consistent.
> >>>
> >>
> >> Sorry but it is unclear to me the last sentence :-)
> >>
> >> Anyway I prefer
> >> BTRFS_DEV_ALLOCATION_METADATA_PREFERRED
> >> BTRFS_DEV_ALLOCATION_METADATA_ONLY
> >> [...]
> >>
> >> Because it seems to me more consistent
> > 
> > Sounds good.
> > 
> >>> There is a use case for a mix of _PREFERRED and _ONLY devices:  a system
> >>> with NVMe, SSD, and HDD might want to have the SSD use DATA_PREFERRED or
> >>> METADATA_PREFERRED while the NVMe and HDD use METADATA_ONLY and DATA_ONLY
> >>> respectively.  But this use case is not a very good match for what the
> >>> implementation does--we'd want to separate device selection ("can I use
> >>> this device for metadata, ever?") from ordering ("which devices should
> >>> I use for metadata first?").
> >>>
> >>> To keep things simple I'd say that use case is out of scope, and recommend
> >>> not mixing _PREFERRED and _ONLY in the same filesystem.  Either explicitly
> >>> allocate everything with _ONLY, or mark every device _PREFERRED one way
> >>> or the other, but don't use both _ONLY and _PREFERRED at the same time
> >>> unless you really know what you're doing.
> >>
> >> In what METADATA_ONLY + DATA_PREFERRED would be more dangerous than
> >> METADATA_ONLY + DATA_ONLY ?
> > 
> > If capacity is our first priority, we use METADATA_PREFERRED
> > and DATA_PREFERRED (everything can be allocated everywhere, we try
> > the highest performance but fall back).
> > 
> > If performance is our first priority, we use METADATA_ONLY and DATA_ONLY
> > (so we never have to balance which would reduce performance) or
> > METADATA_PREFERRED and DATA_ONLY (so we have more capacity, but get
> > lower performance because we must balance data in some cases, but not
> > as low as any combination of options with DATA_PREFERRED).
> 
> I think it would be a mistake to think that your performance and
> capacity use cases are the only ones others will care about.
> 
> Your analysis misses a third option for priority: resilience. I have a
> nearline backup server. It stores a lot of data but it is almost
> entirely write-only. My priority is to be able to get at most of the
> data quickly if I need it sometime - it isn't critical for any specific
> piece of data as I have additional, slower backups, but I want to be
> able to restore as much as possible from this server for speed and
> convenience. To keep as much nearline backup as possible, I keep data in
> SINGLE and metadata in RAID1. Fine - I can do that today.

We have a lot of servers built this way.  None with just two disks though.
Maybe that gives me a blind spot for the 2-disk corner cases...

> However, in normal use the main activity is btrfs receive of many mostly
> unchanged subvolumes every day. So, what I do today is have a large data
> disk and a second small disk for the RAID1 copy of the metadata. I want
> to keep data off that second disk. With this patch, I expect to set the
> metadata disk as METADATA_ONLY and the data disk as DATA_PREFERRED.

Thanks, that's exactly the missing use case I was looking for.

In this case, we are starting off by prioritizing performance over
capacity (otherwise we'd use METADATA_PREFERRED and DATA_PREFERRED),
so we would normally use METADATA_ONLY and DATA_ONLY.  If we had enough
disks for metadata, we'd be done and we'd stop there.

We can't use DATA_ONLY here, because we would then have only one disk
for raid1 metadata, and that doesn't work.  We could have DATA_ONLY
with dup metadata for best performance, but then we lose resilience
because SSD failure would wipe out the whole filesystem.  We can
add more data disks and then make some of them DATA_ONLY, but we'd
always need at least two devices to have one of the three allocation
preferences that allow metadata, or raid1 metadata won't work at all.
We could use the SSD for a block-layer writethrough cache instead,
but while the resiliency properties are similar (SSD is expendible, HDD
takes all the data with it when it fails), it's not completely identical:
it's prone to data evicting the metadata which makes it less useful,
and it's a separate chunk of code with exposure to more kernel bugs.
So every other option is worse than METADATA_ONLY and DATA_PREFERRED.

With the read_policy patches also applied, and allocation preferences
set to METADATA_ONLY and DATA_PREFERRED, we'd have maximum performance
on reads (they would exclusively use SSD) and minimum performance on
writes (they would all necessarily use HDD) and get the best possible
result given the combination of performance, resilience, and available
device constraints.

> Of course I would *also* like to be able to get btrfs to mostly read the
> RAID1 copy from the fast metadata disk for reading metadata. This patch
> does not address that, but I hope one day there will be a separate
> option for that.

That's the read_policy patch set.  I've been running that for a while too,
but I dropped it when I started having problems with kernels after 5.10.
I'm still having problems without the read_policy patch set, so it was
probably fine?

> I think the proposed settings are a useful step and will allow some
> experimentation and learning with different scenarios. They certainly
> aren't the answer to all allocation problems but I would like to see
> them available as soon as possible,
> 
> Graham
> 

  reply	other threads:[~2021-12-19  2:31 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-24 15:31 [RFC][V8][PATCH 0/5] btrfs: allocation_hint mode Goffredo Baroncelli
2021-10-24 15:31 ` [PATCH 1/4] btrfs: add flags to give an hint to the chunk allocator Goffredo Baroncelli
2021-10-24 15:31 ` [PATCH 2/4] btrfs: export dev_item.type in /sys/fs/btrfs/<uuid>/devinfo/<devid>/type Goffredo Baroncelli
2021-10-24 15:31 ` [PATCH 3/4] btrfs: change the DEV_ITEM 'type' field via sysfs Goffredo Baroncelli
2021-10-24 15:31 ` [PATCH 4/4] btrfs: add allocator_hint mode Goffredo Baroncelli
2021-12-17 15:58   ` Hans van Kranenburg
2021-12-17 18:28     ` Goffredo Baroncelli
2021-12-17 19:41       ` Zygo Blaxell
2021-12-18  9:07         ` Goffredo Baroncelli
2021-12-18 22:48           ` Zygo Blaxell
2021-12-19  0:03             ` Graham Cobb
2021-12-19  2:30               ` Zygo Blaxell [this message]
2021-12-13  9:39 ` [RFC][V8][PATCH 0/5] btrfs: allocation_hint mode Paul Jones
2021-12-13 19:54   ` Goffredo Baroncelli
2021-12-13 21:15     ` Josef Bacik
2021-12-13 22:49       ` Zygo Blaxell
2021-12-14 14:31         ` Josef Bacik
2021-12-14 19:03         ` Goffredo Baroncelli
2021-12-14 20:04           ` Zygo Blaxell
2021-12-14 20:34             ` Josef Bacik
2021-12-14 20:41               ` Goffredo Baroncelli
2021-12-15 13:58                 ` Josef Bacik
2021-12-15 18:53                   ` Goffredo Baroncelli
2021-12-16  0:56                     ` Josef Bacik
2021-12-17  5:40                       ` Zygo Blaxell
2021-12-17 14:48                         ` Josef Bacik
2021-12-17 16:31                           ` Zygo Blaxell
2021-12-17 18:08                         ` Goffredo Baroncelli
2021-12-16  2:30                   ` Paul Jones
2021-12-14  1:03       ` Sinnamohideen, Shafeeq
2021-12-14 18:53       ` Goffredo Baroncelli
2021-12-14 20:35         ` Josef Bacik
     [not found] <cover.1614028083.git.kreijack@inwind.it>
2021-02-22 21:19 ` [PATCH 4/4] btrfs: add allocator_hint mode Goffredo Baroncelli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Yb6ZY3BPtbXn44gX@hungrycats.org \
    --to=ce3g8jdj@umail.furryterror.org \
    --cc=dsterba@suse.cz \
    --cc=g.btrfs@cobb.uk.net \
    --cc=hans@knorrie.org \
    --cc=josef@toxicpanda.com \
    --cc=kreijack@inwind.it \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=shafeeqs@panasas.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).