Re: [PATCH] mkfs.xfs: don't go into multidisk mode if there is only one stripe

From: Dave Chinner <david@fromorbit.com>
To: Ilya Dryomov <idryomov@gmail.com>
Cc: Eric Sandeen <sandeen@sandeen.net>,
	xfs <linux-xfs@vger.kernel.org>, Mark Nelson <mnelson@redhat.com>,
	Eric Sandeen <sandeen@redhat.com>,
	Mike Snitzer <snitzer@redhat.com>
Subject: Re: [PATCH] mkfs.xfs: don't go into multidisk mode if there is only one stripe
Date: Sun, 7 Oct 2018 10:20:37 +1100	[thread overview]
Message-ID: <20181006232037.GB18095@dastard> (raw)
In-Reply-To: <CAOi1vP-Z7xZk1YvWHYWQGOHfyWccFCkyn8je0HBvGhGuUrXmaQ@mail.gmail.com>

On Sat, Oct 06, 2018 at 02:17:54PM +0200, Ilya Dryomov wrote:
> On Sat, Oct 6, 2018 at 1:27 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Fri, Oct 05, 2018 at 08:51:59AM -0500, Eric Sandeen wrote:
> > > On 10/5/18 6:27 AM, Ilya Dryomov wrote:
> > > > On Fri, Oct 5, 2018 at 12:29 AM Dave Chinner <david@fromorbit.com> wrote:
> > > >>
> > > >> On Thu, Oct 04, 2018 at 01:33:12PM -0500, Eric Sandeen wrote:
> > > >>> On 10/4/18 12:58 PM, Ilya Dryomov wrote:
> > > >>>> rbd devices report the following geometry:
> > > >>>>
> > > >>>>   $ blockdev --getss --getpbsz --getiomin --getioopt /dev/rbd0
> > > >>>>   512
> > > >>>>   512
> > > >>>>   4194304
> > > >>>>   4194304
> > > >>
> > > >> dm-thinp does this as well. THis is from the thinp device created
> > > >> by tests/generic/459:
> > > >>
> > > >> 512
> > > >> 4096
> > > >> 65536
> > > >> 65536
> > > >
> > > > (adding Mike)
> > > >
> > > > ... and that 300M filesystem ends up with 8 AGs, when normally you get
> > > > 4 AGs for anything less than 4T.  Is that really intended?
> > >
> > > Well, yes.  Multi-disk mode gives you more AGs, how many more is scaled
> > > by fs size.
> > >
> > >         /*
> > >          * For the multidisk configs we choose an AG count based on the number
> > >          * of data blocks available, trying to keep the number of AGs higher
> > >          * than the single disk configurations. This makes the assumption that
> > >          * larger filesystems have more parallelism available to them.
> > >          */
> > >
> > > For really tiny filesystems we cut down the number of AGs, but in general
> > > if the storage "told" us it has parallelism, mkfs uses it by default.
> >
> > We only keep the number of AGs down on single disks because of the
> > seek penalty it causes spinning disks. It's a trade off between
> > parallelism and seek time.
> 
> If it's primarily about seek times, why aren't you looking at rotational
> attribute for that?

Historically speaking, "rotational" hasn't been a reliable indicator
of device seek behaviour or alignment requirements. It was anasty
hack for people wanting to optimise for SSDs and most of those
optimisations were things we could already do with sunit/swidth
(such as aligning to internal SSD page sizes and/or erase blocks).

> > > > AFAIK dm-thinp reports these values for the same exact reason as rbd:
> > > > we are passing up the information about the efficient I/O size.  In the
> > > > case of dm-thinp, this is the thinp block size.  If you put dm-thinp on
> > > > top of a RAID array, I suspect it would pass up the array's preferred
> > > > sizes, as long as they are a proper factor of the thinp block size.
> >
> > dm-thinp is passing up it's allocation chunk size, not the
> > underlying device geometry. dm-thinp might be tuning it's chunk size
> > to match the underlying storage, but that's irrelevant to XFS.
> 
> I think the thinp chunk size is more about whether you just want thin
> provisioning or plan to do a lot of snapshotting, etc.  dm-thinp passes
> up the underlying device geometry if it's more demanding than the thinp
> chunk size.  Here is dm-thinp with 64K chunk size on top of mdraid:
> 
>   $ blockdev --getss --getpbsz --getiomin --getioopt /dev/mapper/vg1-thin1
>   512
>   512
>   524288
>   1048576

That's how iomin/ioopt are supposed to be propagated on layered
devices. i.e. the layer with the largest values bubbles to the top,
and the filesystem aligns to that.

That doesn't change the fact that thinp and other COW-based block
devices fundamentally isolate the filesystem from the physical
storage properties. The filesystem sees the result of the COW
behaviour in the block device and it's allocated algorithm, not the
physical block device properties.

Last time I looked, dm-thinp did first-free allocation, which means
it fills the block device from one end to the other regardless of
how many widely spaced IOs are in progress from the filesystems.
That means all the new writes end up being sequential from dm-thinp
rather than causing seek storms because they are being written to 32
different locations across the block device. IOWs, a properly
implemented COW-based thinp device should be able to handle much
higher random write IO workloads than if the filesystem was placed
directly on the same block device.

IOWs, dm-thinp does not behave how one expects a rotational device
to behave even when it is placed on a rotational device. We have to
optimise filesystem behaviour differently for dm-thinp.

> > That's because dm-thinp is a virtual mapping device in the same way
> > the OS provides virtually mapped memory to users. That it, there is
> > no relationship between the block device address space index and the
> > location on disk. Hence the seek times between different regions of
> > the block device address space are not linear or predictable.
> >
> > Hence dm-thinp completely changes the parallelism vs seek time
> > trade-off the filesystem layout makes.  We can't optimise for
> > minimal seek time anymore because we don't know the physical layout
> > of the storage, so all we care about is alignment to the block
> > device chunk size.
> >
> > i.e. what we want to do is give dm-thinp IO that is optimal (e.g.
> > large aligned writes for streaming IO) and we don't want to leave
> > lots of little unused holes in the dmthinp mapping that waste space.
> > To do this, we need to ensure minimal allocator contention occurs,
> > and hence we allow more concurrency in allocation by inreasing the
> > AG count, knowing that we can't make the seek time problem any worse
> > by doing this.
> 
> And yet dm-thinp presents itself as rotational if (at least one of) the
> underlying disk(s) is marked as rotational.

Which, as per above, means rotational devices don't all behave like
you'd expect a spinning spindle to behave. i.e. It's not an
indication of a specific, consistent device model that we can
optimise for.

> As it is, we get the nomultidisk trade-parallelism-for-seek-times
> behaviour on bare SSD devices, but dm-thinp on top of a single HDD
> device is regarded up to 8 (XFS_MULTIDISK_AGLOG - XFS_NOMULTIDISK_AGLOG)
> times more parallel...

Yes, that's expected. The single SSD case has to take into account
the really slow, cheap SSDs that aren't much better than spinning
disks right through to high end nvme drives.

It's easy to drown a slow SSD, just like it's easy to drown a single
spindle. But there's /very few/ applications that can drive a high
end nvme SSD to be allocation bound on a 4 AG XFS filesystem because
of how fast the IO is. As such, I'm yet to hear of reports of XFS
allocation concurrency bottlenecks in production workloads on nvme
SSDs.

Defaults are a trade off.  There is no "one size fits all" solution,
so we end up with defaults that are a compromise of "doesn't suck
for the majority of use cases". That means there might be some
unexpected default behaviours, but that doesn't mean they are wrong.

> > These are /generic/ alignment characteristics. While they were
> > originally derived from RAID characteristics, they have far wider
> > scope of use than just for configuring RAID devices. e.g. thinp,
> > exposing image file extent size hints as filesystem allocation
> > alignments similar to thinp, selecting what aspect of a multi-level
> > stacked RAID made up of hundreds of disks the filesystem should
> > align to, aligning to internal SSD structures (be it raid, erase
> > page sizes, etc), optimising for OSD block sizes, remote replication
> > block size constraints, helping DAX align allocations to huge page
> > sizes, etc.
> 
> Exactly, they are generic data alignment characteristics useful for
> both physical and virtual devices.  However, mkfs.xfs uses a heuristic
> that conflates them with agcount through the physics of the underlying
> device which it can't really reason about, especially in the virtual
> or network case.

Yet it's a heuristic that has served use well for 20 years. Yes,
we'v been madly conflating allocation concurrency with storage that
requires alignment since long before XFS was ported to Linux.

The defaults are appropriate for the vast majority of installations
and use cases. The defaults are not ideal for everyone, but there's
years of thought, observation, problem solving and knowledge behind
them.

If you don't like the defaults, then override them on the command
line, post your benchmarked improvements and make the argument why
this particular workload and tuning is better to everyone.

> > My point is that just looking at sunit/swidth as "the number of data
> > disks" completely ignores the many other uses we've found for it
> > over the last 20 years. In that time, it's almost always been the
> > case that devices requiring alignment have not been bound by the
> > seek time constraints of a single spinning spindle, and the default
> > behaviour reflects that.
> >
> > > Dave, do you have any problem with changing the behavior to only go into
> > > multidisk if swidth > sunit?  The more I think about it, the more it makes
> > > sense to me.
> >
> > Changing the existing behaviour doesn't make much sense to me. :)
> 
> The existing behaviour is to create 4 AGs on both spinning rust and
> e.g. Intel DC P3700.

That's a really bad example.  The p3700 has internal RAID with a
128k page size that it doesn't expose to iomin/ioopt. It has
*really* bad IO throughput for sub-128k sized or aligned IO (think
100x slower, not just a little). It's a device that absolutely
should be exposing preferred alignment characteristics to the
filesystem...

> If I then put dm-thinp on top of that spinner,
> it's suddenly deemed worthy of 32 AGs.  The issue here is that unlike
> other filesystems, XFS is inherently parallel and perfectly capable of
> subjecting it to 32 concurrent write streams.  This is pretty silly.

COW algorithms linearise and serialise concurrent write streams -
that's exactly what they are designed to do and why they perform so
well on random write workloads.  Optimising the filesystem layout
and characteristics to take advantage of COW algorithms in the
storage laye is not "pretty silly" - it's the smart thing to do
because the dm-thinp COW algorithms are only as good as the garbage
they are fed.

> You agreed that broken RAID controllers that expose "sunit == swidth"
> are their vendor's or administrator's problem. 

No I didn't - I said that raid controllers that only advertise sunit
or swidth are broken. Advertising sunit == swidth is a valid thing
to do - we really only need a single alignment value for hardware
RAID w/ NVRAM caches: the IO size/alignment needed to avoid RMW
cycles.

> The vast majority of
> SSD devices in wide use either expose nothing or lie.  The information
> about internal page size or erase block size is either hard to get or
> not public.

Hence, like the broken RAID controller case, we don't try to
optimise for them.  If they expose those things (and the p3700 case
demonstrates that they should!) then we'll automatically optimise
the filesystem for their physical characteristics.

> Can you give an example of a use case that would be negatively affected
> if this heuristic was switched from "sunit" to "sunit < swidth"?

Any time you only know a single alignment characteristic of the
underlying multi-disk storage. e.g. hardware RAID0/5/6 that sets
iomin = ioopt, multi-level RAID constructs where only the largest
alignment requirement is exposed, RAID1 devices exposing their chunk
size, remote replication chunk alignment (because remote rep. is
slow and so we need more concurrency to keep the pipeline full),
etc.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com