linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
To: kreijack@inwind.it
Cc: Hans van Kranenburg <hans@knorrie.org>,
	linux-btrfs@vger.kernel.org, Josef Bacik <josef@toxicpanda.com>,
	David Sterba <dsterba@suse.cz>,
	Sinnamohideen Shafeeq <shafeeqs@panasas.com>
Subject: Re: [PATCH 4/4] btrfs: add allocator_hint mode
Date: Sat, 18 Dec 2021 17:48:41 -0500	[thread overview]
Message-ID: <Yb5lSevjq3eURuYB@hungrycats.org> (raw)
In-Reply-To: <5afe9f17-d171-c4e5-84f0-24f9a7fa250f@libero.it>

On Sat, Dec 18, 2021 at 10:07:18AM +0100, Goffredo Baroncelli wrote:
> On 12/17/21 20:41, Zygo Blaxell wrote:
> > On Fri, Dec 17, 2021 at 07:28:28PM +0100, Goffredo Baroncelli wrote:
> > > On 12/17/21 16:58, Hans van Kranenburg wrote:
> [...]
> > > -----------------------------
> > > The chunk allocation policy is modified as follow.
> > > 
> > > Each disk may have one of the following tags:
> > > - BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
> > > - BTRFS_DEV_ALLOCATION_METADATA_ONLY
> > > - BTRFS_DEV_ALLOCATION_DATA_ONLY
> > > - BTRFS_DEV_ALLOCATION_PREFERRED_DATA (default)
> > 
> > Is it too late to rename these?  The order of the words is inconsistent
> > and the English usage is a bit odd.
> > 
> > I'd much rather have:
> > 
> > > - BTRFS_DEV_ALLOCATION_PREFER_METADATA
> > > - BTRFS_DEV_ALLOCATION_ONLY_METADATA
> > > - BTRFS_DEV_ALLOCATION_ONLY_DATA
> > > - BTRFS_DEV_ALLOCATION_PREFER_DATA (default)
> > 
> > English speakers would say "[I/we/you] prefer X" or "X [is] preferred".
> > 
> > or
> > 
> > > - BTRFS_DEV_ALLOCATION_METADATA_PREFERRED
> > > - BTRFS_DEV_ALLOCATION_METADATA_ONLY
> > > - BTRFS_DEV_ALLOCATION_DATA_ONLY
> > > - BTRFS_DEV_ALLOCATION_DATA_PREFERRED (default)
> > 
> > I keep typing "data_preferred" and "only_data" when it's really
> > "preferred_data" and "data_only" because they're not consistent.
> > 
> 
> Sorry but it is unclear to me the last sentence :-)
> 
> Anyway I prefer
> BTRFS_DEV_ALLOCATION_METADATA_PREFERRED
> BTRFS_DEV_ALLOCATION_METADATA_ONLY
> [...]
> 
> Because it seems to me more consistent

Sounds good.

> > There is a use case for a mix of _PREFERRED and _ONLY devices:  a system
> > with NVMe, SSD, and HDD might want to have the SSD use DATA_PREFERRED or
> > METADATA_PREFERRED while the NVMe and HDD use METADATA_ONLY and DATA_ONLY
> > respectively.  But this use case is not a very good match for what the
> > implementation does--we'd want to separate device selection ("can I use
> > this device for metadata, ever?") from ordering ("which devices should
> > I use for metadata first?").
> > 
> > To keep things simple I'd say that use case is out of scope, and recommend
> > not mixing _PREFERRED and _ONLY in the same filesystem.  Either explicitly
> > allocate everything with _ONLY, or mark every device _PREFERRED one way
> > or the other, but don't use both _ONLY and _PREFERRED at the same time
> > unless you really know what you're doing.
> 
> In what METADATA_ONLY + DATA_PREFERRED would be more dangerous than
> METADATA_ONLY + DATA_ONLY ?

If capacity is our first priority, we use METADATA_PREFERRED
and DATA_PREFERRED (everything can be allocated everywhere, we try
the highest performance but fall back).

If performance is our first priority, we use METADATA_ONLY and DATA_ONLY
(so we never have to balance which would reduce performance) or
METADATA_PREFERRED and DATA_ONLY (so we have more capacity, but get
lower performance because we must balance data in some cases, but not
as low as any combination of options with DATA_PREFERRED).

If we have a complicated setup with 3 or more drives we might use 3 or
4 options at once to create multiple performance and capacity tiers.
But never exactly those two options.

What is METADATA_ONLY + DATA_PREFERRED for?

METADATA_ONLY + DATA_PREFERRED allows metadata to be allocated on data
drives, causing a performance crash.  It doesn't remove the need for
data balances since metadata and data can compete for space on the
DATA_PREFERRED devices.  It reduces data capacity (no data on the
METADATA_ONLY device) but doesn't guarantee a performance benefit
(metadata is allowed on the DATA_PREFERRED device).  Of all the
combinations of options, why would a user choose this one?

> If fact there I see two mains differents use cases:
> - I want to put my metadata on a SSD for performance reasoning:
> 	METADATA_PREFERRED + DATA_PREFERRED
>    as the most conservative approach

If you're using METADATA_PREFERRED, your first priority can only be
capacity, as performance will fail when some of the disks fill up.

We can't prioritize performance and capacity at the same time (at least
not with this code).  The user must choose which is most important
in cases when both are not available.

I see now why you keep missing this use case--it is because you are
thinking that PREFERRED is a valid option for performance use cases,
and therefore serves a superset of ONLY use cases.  ONLY and PREFERRED
serve use cases with opposite requirements, so one cannot be used to
serve the needs of the other.

PREFERRED is for capacity use cases, not for performance use cases.
PREFERRED only improves performance when over 96%(*) of the metadata
accesses hit the SSD; otherwise, the 4% of metadata on spinning devices
will be so slow that it will dominate the metadata access time.  When it
is forced to make a decision, PREFERRED will choose capacity, and drop
performance to pre-allocation-preference levels very quickly.  If the
user can tolerate the worst-case performance, then PREFERRED can provide
average performance above the worst case, but below the best case.
Only worst-case performance is guaranteed (within the limits of this
patch and the current btrfs allocators).

ONLY is for performance, not capacity (though it can also be used for one
of its side-effects on metadata allocation).  ONLY will not sacrifice
performance because one of the disks filled up.  ONLY will give us
ENOSPC immediately to tell us that we have no more space available with
acceptable performance.  ONLY guarantees best-case performance all the
time, since it doesn't allow any other case to arise.

(*) 24:1 is the raw ratio of access time between HDD data and NVMe
drives, but btrfs typically does 2-4x as many metadata iops as data iops,
which makes the effect on performance even worse when metadata leaks
onto HDD.  Also the most recently allocated block groups are most active,
which weights the HDD overflow even more in the average access time.
We typically get "node timeout" alarms immediately after the first
metadata block group on a big filesystem is misallocated on a spinning
disk, which is an effective ratio of 250:1 or 99.6%.  As a result,
we have zero tolerance for metadata on HDD.

> - I want to protect the metadata BG space from exhaustion (assuming that
>   a "today standard" disk is far larger than the total BG metadata)
> 	METADATA_ONLY + X
>   is a valid approach

Even if today's disks are too small, you can always add more of them.
It might even improve performance to make a raid10 metadata array.

> [...]
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
> 

  reply	other threads:[~2021-12-18 22:48 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-24 15:31 [RFC][V8][PATCH 0/5] btrfs: allocation_hint mode Goffredo Baroncelli
2021-10-24 15:31 ` [PATCH 1/4] btrfs: add flags to give an hint to the chunk allocator Goffredo Baroncelli
2021-10-24 15:31 ` [PATCH 2/4] btrfs: export dev_item.type in /sys/fs/btrfs/<uuid>/devinfo/<devid>/type Goffredo Baroncelli
2021-10-24 15:31 ` [PATCH 3/4] btrfs: change the DEV_ITEM 'type' field via sysfs Goffredo Baroncelli
2021-10-24 15:31 ` [PATCH 4/4] btrfs: add allocator_hint mode Goffredo Baroncelli
2021-12-17 15:58   ` Hans van Kranenburg
2021-12-17 18:28     ` Goffredo Baroncelli
2021-12-17 19:41       ` Zygo Blaxell
2021-12-18  9:07         ` Goffredo Baroncelli
2021-12-18 22:48           ` Zygo Blaxell [this message]
2021-12-19  0:03             ` Graham Cobb
2021-12-19  2:30               ` Zygo Blaxell
2021-12-13  9:39 ` [RFC][V8][PATCH 0/5] btrfs: allocation_hint mode Paul Jones
2021-12-13 19:54   ` Goffredo Baroncelli
2021-12-13 21:15     ` Josef Bacik
2021-12-13 22:49       ` Zygo Blaxell
2021-12-14 14:31         ` Josef Bacik
2021-12-14 19:03         ` Goffredo Baroncelli
2021-12-14 20:04           ` Zygo Blaxell
2021-12-14 20:34             ` Josef Bacik
2021-12-14 20:41               ` Goffredo Baroncelli
2021-12-15 13:58                 ` Josef Bacik
2021-12-15 18:53                   ` Goffredo Baroncelli
2021-12-16  0:56                     ` Josef Bacik
2021-12-17  5:40                       ` Zygo Blaxell
2021-12-17 14:48                         ` Josef Bacik
2021-12-17 16:31                           ` Zygo Blaxell
2021-12-17 18:08                         ` Goffredo Baroncelli
2021-12-16  2:30                   ` Paul Jones
2021-12-14  1:03       ` Sinnamohideen, Shafeeq
2021-12-14 18:53       ` Goffredo Baroncelli
2021-12-14 20:35         ` Josef Bacik
     [not found] <cover.1614028083.git.kreijack@inwind.it>
2021-02-22 21:19 ` [PATCH 4/4] btrfs: add allocator_hint mode Goffredo Baroncelli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Yb5lSevjq3eURuYB@hungrycats.org \
    --to=ce3g8jdj@umail.furryterror.org \
    --cc=dsterba@suse.cz \
    --cc=hans@knorrie.org \
    --cc=josef@toxicpanda.com \
    --cc=kreijack@inwind.it \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=shafeeqs@panasas.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).