All of lore.kernel.org
 help / color / mirror / Atom feed
From: Eric Sandeen <sandeen@sandeen.net>
To: Ilya Dryomov <idryomov@gmail.com>, Dave Chinner <david@fromorbit.com>
Cc: xfs <linux-xfs@vger.kernel.org>, Mark Nelson <mnelson@redhat.com>,
	Eric Sandeen <sandeen@redhat.com>,
	Mike Snitzer <snitzer@redhat.com>
Subject: Re: [PATCH] mkfs.xfs: don't go into multidisk mode if there is only one stripe
Date: Fri, 5 Oct 2018 08:51:59 -0500	[thread overview]
Message-ID: <67627995-714c-5c38-a796-32b503de7d13@sandeen.net> (raw)
In-Reply-To: <CAOi1vP_U-QpABgs+a9oYpYvLWs4D2qVmff=-JEikn7S_=eCAXQ@mail.gmail.com>

On 10/5/18 6:27 AM, Ilya Dryomov wrote:
> On Fri, Oct 5, 2018 at 12:29 AM Dave Chinner <david@fromorbit.com> wrote:
>>
>> On Thu, Oct 04, 2018 at 01:33:12PM -0500, Eric Sandeen wrote:
>>> On 10/4/18 12:58 PM, Ilya Dryomov wrote:
>>>> rbd devices report the following geometry:
>>>>
>>>>   $ blockdev --getss --getpbsz --getiomin --getioopt /dev/rbd0
>>>>   512
>>>>   512
>>>>   4194304
>>>>   4194304
>>
>> dm-thinp does this as well. THis is from the thinp device created
>> by tests/generic/459:
>>
>> 512
>> 4096
>> 65536
>> 65536
> 
> (adding Mike)
> 
> ... and that 300M filesystem ends up with 8 AGs, when normally you get
> 4 AGs for anything less than 4T.  Is that really intended?

Well, yes.  Multi-disk mode gives you more AGs, how many more is scaled
by fs size.

        /*
         * For the multidisk configs we choose an AG count based on the number
         * of data blocks available, trying to keep the number of AGs higher
         * than the single disk configurations. This makes the assumption that
         * larger filesystems have more parallelism available to them.
         */

For really tiny filesystems we cut down the number of AGs, but in general
if the storage "told" us it has parallelism, mkfs uses it by default.

> AFAIK dm-thinp reports these values for the same exact reason as rbd:
> we are passing up the information about the efficient I/O size.  In the
> case of dm-thinp, this is the thinp block size.  If you put dm-thinp on
> top of a RAID array, I suspect it would pass up the array's preferred
> sizes, as long as they are a proper factor of the thinp block size.
> 
> The high agcount on dm-thinp has come up before and you suggested that
> dm-thinp should report iomin == ioopt (i.e. sunit == swidth).  If that
> was the right fix back in 2014, mkfs.xfs must have regressed:
> 
>   https://marc.info/?l=linux-xfs&m=137783388617206&w=2

Well, that question started out as being about higher AG counts.  And
it did come from the stripe geometry advertised, but I don't think
Dave's suggestion was intended to keep the AG count lower, it was just
explaining that the values seemed wrong.  It was a bit of an odd situation,
the existence of stripe geometry bumped up AG count but then mkfs decided
that a 512 byte "stripe unit" wasn't legit, and set it back to zero.  But
kept the higher AG count.  So there a few related issues there.

>   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fdfb4c8c1a9fc8dd8cf8eeb4e3ed83573b375285

So that sets them to the same size, but I don't think that change was
ever expected to change the AG count heuristic behavior, which has been
there for probably a decade or more.

>>
>> And I've also seen some hardware raid controllers do this, too,
>> because they only expose the stripe width in their enquiry page
>> rather than stripe unit and stripe width.

(which should be considered semi broken hardware, no?)

>> IOWs, this behaviour isn't really specific to Ceph's rbd device, and
>> it does occur on multi-disk devices that have something layered over
>> the top (dm-thinp, hardware raid, etc). As such, I don't think
>> there's a "one size fits all" solution and so someone is going to
>> have to tweak mkfs settings to have it do the right thing for their
>> storage subsystem....
> 
> FWIW I was surprised to see that calc_default_ag_geometry() doesn't
> look at swidth and just assumes that there will be "more parallelism
> available".  I expected it to be based on swidth to sunit ratio (i.e.
> sw).  sw is supposed to be the multiplier equal to the number of
> data-bearing disks, so it's the first thing that comes to mind for
> a parallelism estimate.
> 
> I'd argue that hardware RAID administrators are much more likely to pay
> attention to the output of mkfs.xfs and be able to tweak the settings
> to work around broken controllers that only expose stripe width.

Yeah, this starts to get a little philosophical.  We don't want to second
guess geometry or try to figure out what the raid array "really meant" if
it's sending weird numbers. [1]

But at the end of the day, it seems reasonable to always apply the
"swidth/sunit = number of data disks" rule  (which we apply in reverse when
we tell people how to manually figure out stripe widths) and stop treating
sunit==swidth as any indication of parallelism.

Dave, do you have any problem with changing the behavior to only go into
multidisk if swidth > sunit?  The more I think about it, the more it makes
sense to me.

-Eric


[1] for example,
    99a3f52 mkfs: avoid divide-by-zero when hardware reports optimal i/o size as 0

  reply	other threads:[~2018-10-05 20:50 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-10-04 17:58 [PATCH] mkfs.xfs: don't go into multidisk mode if there is only one stripe Ilya Dryomov
2018-10-04 18:33 ` Eric Sandeen
2018-10-04 18:56   ` Ilya Dryomov
2018-10-04 22:29   ` Dave Chinner
2018-10-05 11:27     ` Ilya Dryomov
2018-10-05 13:51       ` Eric Sandeen [this message]
2018-10-05 23:27         ` Dave Chinner
2018-10-06 12:17           ` Ilya Dryomov
2018-10-06 23:20             ` Dave Chinner
2018-10-07  0:14               ` Eric Sandeen
2018-11-29 13:53                 ` Ric Wheeler
2018-11-29 21:48                   ` Dave Chinner
2018-11-29 23:53                     ` Ric Wheeler
2018-11-30  2:25                       ` Dave Chinner
2018-11-30 18:00                         ` block layer API for file system creation - when to use multidisk mode Ric Wheeler
2018-11-30 18:00                           ` Ric Wheeler
2018-11-30 18:05                           ` Mark Nelson
2018-11-30 18:05                             ` Mark Nelson
2018-12-01  4:35                           ` Dave Chinner
2018-12-01  4:35                             ` Dave Chinner
2018-12-01 20:52                             ` Ric Wheeler
2018-12-01 20:52                               ` Ric Wheeler
2018-10-07 13:54               ` [PATCH] mkfs.xfs: don't go into multidisk mode if there is only one stripe Ilya Dryomov
2018-10-10  0:28                 ` Dave Chinner
2018-10-05 14:50       ` Mike Snitzer
2018-10-05 14:55         ` Eric Sandeen
2018-10-05 17:21           ` Ilya Dryomov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=67627995-714c-5c38-a796-32b503de7d13@sandeen.net \
    --to=sandeen@sandeen.net \
    --cc=david@fromorbit.com \
    --cc=idryomov@gmail.com \
    --cc=linux-xfs@vger.kernel.org \
    --cc=mnelson@redhat.com \
    --cc=sandeen@redhat.com \
    --cc=snitzer@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.