All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: "Richard W.M. Jones" <rjones@redhat.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: mkfs.xfs options suitable for creating absurdly large XFS filesystems?
Date: Tue, 4 Sep 2018 19:12:30 +1000	[thread overview]
Message-ID: <20180904091230.GU5631@dastard> (raw)
In-Reply-To: <20180904082332.GS5631@dastard>

On Tue, Sep 04, 2018 at 06:23:32PM +1000, Dave Chinner wrote:
> On Tue, Sep 04, 2018 at 10:49:40AM +1000, Dave Chinner wrote:
> > On Mon, Sep 03, 2018 at 11:49:19PM +0100, Richard W.M. Jones wrote:
> > > [This is silly and has no real purpose except to explore the limits.
> > > If that offends you, don't read the rest of this email.]
> > 
> > We do this quite frequently ourselves, even if it is just to remind
> > ourselves how long it takes to wait for millions of IOs to be done.
> > 
> > > I am trying to create an XFS filesystem in a partition of approx
> > > 2^63 - 1 bytes to see what happens.
> > 
> > Should just work. You might find problems with the underlying
> > storage, but the XFS side of things should just work.
> 
> > I'm trying to reproduce it here:
> > 
> > $ grep vdd /proc/partitions 
> >  253       48 9007199254739968 vdd
> > $ sudo mkfs.xfs -f -s size=1024 -d size=2251799813684887b -N /dev/vdd
> > meta-data=/dev/vdd               isize=512    agcount=8388609, agsize=268435455 blks
> >          =                       sectsz=1024  attr=2, projid32bit=1
> >          =                       crc=1        finobt=1, sparse=1, rmapbt=0
> >          =                       reflink=0
> > data     =                       bsize=4096   blocks=2251799813684887, imaxpct=1
> >          =                       sunit=0      swidth=0 blks
> > naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> > log      =internal log           bsize=4096   blocks=521728, version=2
> >          =                       sectsz=1024  sunit=1 blks, lazy-count=1
> > realtime =none                   extsz=4096   blocks=0, rtextents=0
> > 
> > 
> > And it is running now without the "-N" and I have to wait for tens
> > of millions of IOs to be issued. The write rate is currently about
> > 13,000 IOPS, so I'm guessing it'll take at least an hour to do
> > this. Next time I'll run it on the machine with faster SSDs.
> > 
> > I haven't seen any error after 20 minutes, though.
> 
> I killed it after 2 and half hours, and started looking at why it
> was taking that long. That's the above.

Or the below. Stand on your head if you're confused.

-Dave.

> But it's not fast. This is the first time I've looked at whether we
> perturbed the IO patterns in the recent mkfs.xfs refactoring. I'm
> not sure we made them any worse (the algorithms are the same), but
> it's now much more obvious how we can improve them drastically with
> a few small mods.
> 
> Firstly, there's the force overwrite alogrithm that zeros the old
> filesystem signature. One an 8EB device with an existing 8EB
> filesystem, there's 8+ million single sector IOs right there.
> So for the moment, zero the first 1MB of the device to whack the
> old superblock and you can avoid this step. I've got a fix for that
> now:
> 
> 	Time to mkfs a 1TB filsystem on a big device after it held another
> 	larger filesystem:
> 
> 	previous FS size	10PB	100PB	 1EB
> 	old mkfs time		1.95s	8.9s	81.3s
> 	patched			0.95s	1.2s	 1.2s
> 
> 
> Second, use -K to avoid discard (which you already know).
> 
> Third, we do two passes over the AG headers to initialise them.
> Unfortunately, with a large number of AGs, they don't stay in the
> buffer cache and so the second pas involves RMW cycles. This means
> we do at least 5 extra read and 5 extra write IOs per AG than we
> need to. I've got a fix for this, too:
> 
> 	Time to make a filesystem from scratch, using a zeroed device so the
> 	force overwrite algorithms are not triggered and -K to avoid
> 	discards:
> 
> 	FS size         10PB    100PB    1EB
> 	current mkfs    26.9s   214.8s  2484s
> 	patched         11.3s    70.3s	 709s
> 
> From that projection, the 8EB mkfs would have taken somewhere around
> 7-8 hours to complete. The new code should only take a couple of
> hours. Still not all that good....
> 
> .... and I think that's because we are using direct IO. That means
> the IO we issue is effectively synchronous, even though we sorta
> doing delayed writeback. The problem is that mkfs is not threaded so
> writeback happens when the cache fills up and we run out of buffers
> on the free list. Basically it's "direct delayed writeback" at that
> point.
> 
> Worse, because it's synchronous, we don't drive more than one IO at
> a time and so we don't get adjacent sector merging, even though most
> ofhte AG header writes are to adjacent sectors. That would cut the
> amount of IOs from ~10 per AG down to 2 for sectorsize < blocksize
> filesysetms and 1 for sectorsize = blocksize filesystems.
> 
> This isn't so easy to fix. I either need to:
> 
> 	1) thread the libxfs buffer cache so we can do this
> 	  writeback in the background.
> 	2) thread mkfs so it can process multiple AGs at once; or
> 	3) libxfs needs to use AIO via delayed write infrastructure
> 	similar to what we have in the kernel (buffer lists)
> 
> Approach 1) does not solve the queue depth = 1 issue, so
> it's of limited value. Might be quick, but doesn't really get us
> much improvement.
> 
> Approach 2) drives deeper queues, but it doesn't solve the adjacent
> sector IO merging problem because each thread only has a queue depth
> of one. So we'll be able to do more IO, but IO efficiency won't
> improve. And, realistically, this isn't a good idea because OOO AG
> processing doesn't work on spinning rust - it just causes seek
> storms and things go slower. To make things faster on spinning rust,
> we need single threaded, in order dispatch, asynchronous writeback.
> Which is almost what 1) is, except it's not asynchronous.
> 
> That's what 3) solves - single threaded, in-order, async writeback,
> controlled by the context creating the dirty buffers in a limited
> AIO context.  I'll have to think about this a bit more....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 

-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2018-09-04 13:36 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-09-03 22:49 mkfs.xfs options suitable for creating absurdly large XFS filesystems? Richard W.M. Jones
2018-09-04  0:49 ` Dave Chinner
2018-09-04  8:23   ` Dave Chinner
2018-09-04  9:12     ` Dave Chinner [this message]
2018-09-04  8:26   ` Richard W.M. Jones
2018-09-04  9:11     ` Dave Chinner
2018-09-04  9:45       ` Richard W.M. Jones
2018-09-04 15:36   ` Martin Steigerwald
2018-09-04 22:23     ` Dave Chinner
2018-09-05  7:09       ` Martin Steigerwald
2018-09-05  7:43         ` Dave Chinner
2018-09-05  9:05   ` Richard W.M. Jones

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180904091230.GU5631@dastard \
    --to=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    --cc=rjones@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.