mkfs.xfs options suitable for creating absurdly large XFS filesystems?

All of lore.kernel.org
 help / color / mirror / Atom feed

* mkfs.xfs options suitable for creating absurdly large XFS filesystems?
@ 2018-09-03 22:49 Richard W.M. Jones
  2018-09-04  0:49 ` Dave Chinner
  0 siblings, 1 reply; 12+ messages in thread
From: Richard W.M. Jones @ 2018-09-03 22:49 UTC (permalink / raw)
  To: linux-xfs

[This is silly and has no real purpose except to explore the limits.
If that offends you, don't read the rest of this email.]

I am trying to create an XFS filesystem in a partition of approx
2^63 - 1 bytes to see what happens.

This creates a 2^63 - 1 byte virtual disk and partitions it:

  # nbdkit memory size=9223372036854775807

  # modprobe nbd
  # nbd-client localhost /dev/nbd0
  # blockdev --getsize64 /dev/nbd0
  9223372036854774784
  # gdisk /dev/nbd0
  [...]
  Command (? for help): n
  Partition number (1-128, default 1):
  First sector (18-9007199254740973, default = 1024) or {+-}size{KMGTP}:
  Last sector (1024-9007199254740973, default = 9007199254740973) or {+-}size{KMGT
P}:
  Current type is 'Linux filesystem'
  Hex code or GUID (L to show codes, Enter = 8300):
  Changed type of partition to 'Linux filesystem'
  Command (? for help): w

The first problem was that the standard mkfs.xfs command will
try to trim the disk in 4 GB chunks (I believe this is a limit
imposed by the kernel APIs).  For a 8 EB image that takes forever.

However I can use the -K option to get around that:

  # mkfs.xfs -K /dev/nbd0p1
  meta-data=/dev/nbd0p1            isize=512    agcount=8388609, agsize=268435455 blks
           =                       sectsz=1024  attr=2, projid32bit=1
           =                       crc=1        finobt=1, sparse=0, rmapbt=0, reflink=0
  data     =                       bsize=4096   blocks=2251799813684987, imaxpct=1
           =                       sunit=0      swidth=0 blks
  naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
  log      =internal log           bsize=4096   blocks=521728, version=2
           =                       sectsz=1024  sunit=1 blks, lazy-count=1
  realtime =none                   extsz=4096   blocks=0, rtextents=0
  mkfs.xfs: read failed: Invalid argument

I guess this indicates a real bug in mkfs.xfs.  I've not tracked down
exactly why this syscall fails yet but will see if I can find it
later.

But first I wanted to ask a broader question about whether there are
other mkfs options (apart from -K) which are suitable when creating
especially large XFS filesystems?

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mkfs.xfs options suitable for creating absurdly large XFS filesystems?
  2018-09-03 22:49 mkfs.xfs options suitable for creating absurdly large XFS filesystems? Richard W.M. Jones
@ 2018-09-04  0:49 ` Dave Chinner
  2018-09-04  8:23   ` Dave Chinner
                     ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Dave Chinner @ 2018-09-04  0:49 UTC (permalink / raw)
  To: Richard W.M. Jones; +Cc: linux-xfs

On Mon, Sep 03, 2018 at 11:49:19PM +0100, Richard W.M. Jones wrote:
> [This is silly and has no real purpose except to explore the limits.
> If that offends you, don't read the rest of this email.]

We do this quite frequently ourselves, even if it is just to remind
ourselves how long it takes to wait for millions of IOs to be done.

> I am trying to create an XFS filesystem in a partition of approx
> 2^63 - 1 bytes to see what happens.

Should just work. You might find problems with the underlying
storage, but the XFS side of things should just work.

> This creates a 2^63 - 1 byte virtual disk and partitions it:
> 
>   # nbdkit memory size=9223372036854775807
> 
>   # modprobe nbd
>   # nbd-client localhost /dev/nbd0
>   # blockdev --getsize64 /dev/nbd0
>   9223372036854774784

$ echo $((2**63 - 1))
9223372036854775807

So the block device size is (2**63 - 1024) bytes.

>   # gdisk /dev/nbd0
>   [...]
>   Command (? for help): n
>   Partition number (1-128, default 1):
>   First sector (18-9007199254740973, default = 1024) or {+-}size{KMGTP}:
>   Last sector (1024-9007199254740973, default = 9007199254740973) or {+-}size{KMGTP}:

What's the sector size of you device? This seems to imply that it is
1024 bytes, not the normal 512 or 4096 bytes we see in most devices.

>   Current type is 'Linux filesystem'
>   Hex code or GUID (L to show codes, Enter = 8300):
>   Changed type of partition to 'Linux filesystem'
>   Command (? for help): w
> 
> The first problem was that the standard mkfs.xfs command will
> try to trim the disk in 4 GB chunks (I believe this is a limit
> imposed by the kernel APIs).  For a 8 EB image that takes forever.

Not a mkfs bug. XFS does a single BLKDISCARD call for the entire
block device range (the ioctl takes u64 start/end ranges). This gets
passed down as 64 bit ranges to __blkdev_issue_discard(), which then
slices and dices the large range to the granularity advertised by
the underlying block device.

Check /sys/block/<nbd-dev>/queue/discard_max_[hw_]bytes. THe local
nvme drives I have on this machine advertise:

$ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes
2199023255040

which is (2^41 - 512) bytes, more commonly known as (2^32 - 1)
sectors. Which, IIRC, is the maximum IO size that a single bio and
therefore a single discard request to the driver can support.

Hence if you are seeing 4GB discards on the NBD side, then the NBD
device must be advertising 4GB to the block layer as the
discard_max_bytes. i.e. this, at first blush, looks purely like a
NBD issue.

> However I can use the -K option to get around that:
> 
>   # mkfs.xfs -K /dev/nbd0p1
>   meta-data=/dev/nbd0p1            isize=512    agcount=8388609, agsize=268435455 blks
>            =                       sectsz=1024  attr=2, projid32bit=1

Oh, yeah, 1kB sectors. How weird is that - I've never seen a block
device with a 1kB sector before.

>            =                       crc=1        finobt=1, sparse=0, rmapbt=0, reflink=0
>   data     =                       bsize=4096   blocks=2251799813684987, imaxpct=1
>            =                       sunit=0      swidth=0 blks
>   naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
>   log      =internal log           bsize=4096   blocks=521728, version=2
>            =                       sectsz=1024  sunit=1 blks, lazy-count=1
>   realtime =none                   extsz=4096   blocks=0, rtextents=0
>   mkfs.xfs: read failed: Invalid argument
> 
> I guess this indicates a real bug in mkfs.xfs.

Did it fail straight away? Or after a long time?  Can you trap this
in gdb and post a back trace so we know where it is coming from?

As it is:

$ man 2 read
....
	EINVAL
		fd  is  attached  to an object which is unsuitable
		for reading; or the file was opened with the
		O_DIRECT flag, and either the address specified in
		buf, the value specified in count, or the file
		offset is not suitably aligned.

mkfs.xfs uses direct IO on block devices, so this implies that the
underlying block device rejected the IO for alignment reasons.

I'm trying to reproduce it here:

$ grep vdd /proc/partitions 
 253       48 9007199254739968 vdd
$ sudo mkfs.xfs -f -s size=1024 -d size=2251799813684887b -N /dev/vdd
meta-data=/dev/vdd               isize=512    agcount=8388609, agsize=268435455 blks
         =                       sectsz=1024  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=0
data     =                       bsize=4096   blocks=2251799813684887, imaxpct=1
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=521728, version=2
         =                       sectsz=1024  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

And it is running now without the "-N" and I have to wait for tens
of millions of IOs to be issued. The write rate is currently about
13,000 IOPS, so I'm guessing it'll take at least an hour to do
this. Next time I'll run it on the machine with faster SSDs.

I haven't seen any error after 20 minutes, though.

> I've not tracked down
> exactly why this syscall fails yet but will see if I can find it
> later.
> 
> But first I wanted to ask a broader question about whether there are
> other mkfs options (apart from -K) which are suitable when creating
> especially large XFS filesystems?

Use the defaults - there's nothing you can "optimise" to make
testing like this go faster because all the time is in
reading/writing AG headers. There's millions of them, and there are
cases where they may have to all be read at mount time, too. Be
prepared to wait a long time for simple things to happen...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mkfs.xfs options suitable for creating absurdly large XFS filesystems?
  2018-09-04  0:49 ` Dave Chinner
@ 2018-09-04  8:23   ` Dave Chinner
  2018-09-04  9:12     ` Dave Chinner
  2018-09-04  8:26   ` Richard W.M. Jones
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 12+ messages in thread
From: Dave Chinner @ 2018-09-04  8:23 UTC (permalink / raw)
  To: Richard W.M. Jones; +Cc: linux-xfs

On Tue, Sep 04, 2018 at 10:49:40AM +1000, Dave Chinner wrote:
> On Mon, Sep 03, 2018 at 11:49:19PM +0100, Richard W.M. Jones wrote:
> > [This is silly and has no real purpose except to explore the limits.
> > If that offends you, don't read the rest of this email.]
> 
> We do this quite frequently ourselves, even if it is just to remind
> ourselves how long it takes to wait for millions of IOs to be done.
> 
> > I am trying to create an XFS filesystem in a partition of approx
> > 2^63 - 1 bytes to see what happens.
> 
> Should just work. You might find problems with the underlying
> storage, but the XFS side of things should just work.

> I'm trying to reproduce it here:
> 
> $ grep vdd /proc/partitions 
>  253       48 9007199254739968 vdd
> $ sudo mkfs.xfs -f -s size=1024 -d size=2251799813684887b -N /dev/vdd
> meta-data=/dev/vdd               isize=512    agcount=8388609, agsize=268435455 blks
>          =                       sectsz=1024  attr=2, projid32bit=1
>          =                       crc=1        finobt=1, sparse=1, rmapbt=0
>          =                       reflink=0
> data     =                       bsize=4096   blocks=2251799813684887, imaxpct=1
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> log      =internal log           bsize=4096   blocks=521728, version=2
>          =                       sectsz=1024  sunit=1 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> 
> And it is running now without the "-N" and I have to wait for tens
> of millions of IOs to be issued. The write rate is currently about
> 13,000 IOPS, so I'm guessing it'll take at least an hour to do
> this. Next time I'll run it on the machine with faster SSDs.
> 
> I haven't seen any error after 20 minutes, though.

I killed it after 2 and half hours, and started looking at why it
was taking that long. That's the above.

But it's not fast. This is the first time I've looked at whether we
perturbed the IO patterns in the recent mkfs.xfs refactoring. I'm
not sure we made them any worse (the algorithms are the same), but
it's now much more obvious how we can improve them drastically with
a few small mods.

Firstly, there's the force overwrite alogrithm that zeros the old
filesystem signature. One an 8EB device with an existing 8EB
filesystem, there's 8+ million single sector IOs right there.
So for the moment, zero the first 1MB of the device to whack the
old superblock and you can avoid this step. I've got a fix for that
now:

	Time to mkfs a 1TB filsystem on a big device after it held another
	larger filesystem:

	previous FS size	10PB	100PB	 1EB
	old mkfs time		1.95s	8.9s	81.3s
	patched			0.95s	1.2s	 1.2s

Second, use -K to avoid discard (which you already know).

Third, we do two passes over the AG headers to initialise them.
Unfortunately, with a large number of AGs, they don't stay in the
buffer cache and so the second pas involves RMW cycles. This means
we do at least 5 extra read and 5 extra write IOs per AG than we
need to. I've got a fix for this, too:

	Time to make a filesystem from scratch, using a zeroed device so the
	force overwrite algorithms are not triggered and -K to avoid
	discards:

	FS size         10PB    100PB    1EB
	current mkfs    26.9s   214.8s  2484s
	patched         11.3s    70.3s	 709s

>From that projection, the 8EB mkfs would have taken somewhere around
7-8 hours to complete. The new code should only take a couple of
hours. Still not all that good....

.... and I think that's because we are using direct IO. That means
the IO we issue is effectively synchronous, even though we sorta
doing delayed writeback. The problem is that mkfs is not threaded so
writeback happens when the cache fills up and we run out of buffers
on the free list. Basically it's "direct delayed writeback" at that
point.

Worse, because it's synchronous, we don't drive more than one IO at
a time and so we don't get adjacent sector merging, even though most
ofhte AG header writes are to adjacent sectors. That would cut the
amount of IOs from ~10 per AG down to 2 for sectorsize < blocksize
filesysetms and 1 for sectorsize = blocksize filesystems.

This isn't so easy to fix. I either need to:

	1) thread the libxfs buffer cache so we can do this
	  writeback in the background.
	2) thread mkfs so it can process multiple AGs at once; or
	3) libxfs needs to use AIO via delayed write infrastructure
	similar to what we have in the kernel (buffer lists)

Approach 1) does not solve the queue depth = 1 issue, so
it's of limited value. Might be quick, but doesn't really get us
much improvement.

Approach 2) drives deeper queues, but it doesn't solve the adjacent
sector IO merging problem because each thread only has a queue depth
of one. So we'll be able to do more IO, but IO efficiency won't
improve. And, realistically, this isn't a good idea because OOO AG
processing doesn't work on spinning rust - it just causes seek
storms and things go slower. To make things faster on spinning rust,
we need single threaded, in order dispatch, asynchronous writeback.
Which is almost what 1) is, except it's not asynchronous.

That's what 3) solves - single threaded, in-order, async writeback,
controlled by the context creating the dirty buffers in a limited
AIO context.  I'll have to think about this a bit more....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mkfs.xfs options suitable for creating absurdly large XFS filesystems?
  2018-09-04  0:49 ` Dave Chinner
  2018-09-04  8:23   ` Dave Chinner
@ 2018-09-04  8:26   ` Richard W.M. Jones
  2018-09-04  9:11     ` Dave Chinner
  2018-09-04 15:36   ` Martin Steigerwald
  2018-09-05  9:05   ` Richard W.M. Jones
  3 siblings, 1 reply; 12+ messages in thread
From: Richard W.M. Jones @ 2018-09-04  8:26 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Sep 04, 2018 at 10:49:40AM +1000, Dave Chinner wrote:
> On Mon, Sep 03, 2018 at 11:49:19PM +0100, Richard W.M. Jones wrote:
> > [This is silly and has no real purpose except to explore the limits.
> > If that offends you, don't read the rest of this email.]
> 
> We do this quite frequently ourselves, even if it is just to remind
> ourselves how long it takes to wait for millions of IOs to be done.
>
> > I am trying to create an XFS filesystem in a partition of approx
> > 2^63 - 1 bytes to see what happens.
> 
> Should just work. You might find problems with the underlying
> storage, but the XFS side of things should just work.

Great!  How do you test this normally?  I'm assuming you must use a
virtual device and don't have actual 2^6x storage systems around?

[...]
> What's the sector size of you device? This seems to imply that it is
> 1024 bytes, not the normal 512 or 4096 bytes we see in most devices.

This led me to wondering how the sector size is chosen.  NBD itself is
agnostic about sectors (it deals entirely with byte offsets).  It
seems as if the Linux kernel NBD driver chooses this, I think here:

https://github.com/torvalds/linux/blob/60c1f89241d49bacf71035470684a8d7b4bb46ea/drivers/block/nbd.c#L1320

It seems an odd choice.

> Hence if you are seeing 4GB discards on the NBD side, then the NBD
> device must be advertising 4GB to the block layer as the
> discard_max_bytes. i.e. this, at first blush, looks purely like a
> NBD issue.

The 4 GB discard limit is indeed entirely a limit in the NBD protocol
(it uses 32 bit count sizes for various things like zeroing and
trimming, where it would make more sense to use wider type because we
aren't sending data over the wire).  I will take this up with the
upstream community and see if we can get an extension added.

> > However I can use the -K option to get around that:
> > 
> >   # mkfs.xfs -K /dev/nbd0p1
> >   meta-data=/dev/nbd0p1            isize=512    agcount=8388609, agsize=268435455 blks
> >            =                       sectsz=1024  attr=2, projid32bit=1
> 
> Oh, yeah, 1kB sectors. How weird is that - I've never seen a block
> device with a 1kB sector before.
> 
> >            =                       crc=1        finobt=1, sparse=0, rmapbt=0, reflink=0
> >   data     =                       bsize=4096   blocks=2251799813684987, imaxpct=1
> >            =                       sunit=0      swidth=0 blks
> >   naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
> >   log      =internal log           bsize=4096   blocks=521728, version=2
> >            =                       sectsz=1024  sunit=1 blks, lazy-count=1
> >   realtime =none                   extsz=4096   blocks=0, rtextents=0
> >   mkfs.xfs: read failed: Invalid argument
> > 
> > I guess this indicates a real bug in mkfs.xfs.
> 
> Did it fail straight away? Or after a long time?  Can you trap this
> in gdb and post a back trace so we know where it is coming from?

Yes I think I was far too hasty declaring this a problem with mkfs.xfs
last night.  It turns out that NBD on the wire can only describe a few
different errors and maps any other error to -EINVAL, which is likely
what is happening here.  I'll get the NBD server to log errors to find
out what's really going on.

[...]
> > But first I wanted to ask a broader question about whether there are
> > other mkfs options (apart from -K) which are suitable when creating
> > especially large XFS filesystems?
> 
> Use the defaults - there's nothing you can "optimise" to make
> testing like this go faster because all the time is in
> reading/writing AG headers. There's millions of them, and there are
> cases where they may have to all be read at mount time, too. Be
> prepared to wait a long time for simple things to happen...

OK this is really good to know, thanks.  I'll keep testing.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mkfs.xfs options suitable for creating absurdly large XFS filesystems?
  2018-09-04  8:26   ` Richard W.M. Jones
@ 2018-09-04  9:11     ` Dave Chinner
  2018-09-04  9:45       ` Richard W.M. Jones
  0 siblings, 1 reply; 12+ messages in thread
From: Dave Chinner @ 2018-09-04  9:11 UTC (permalink / raw)
  To: Richard W.M. Jones; +Cc: linux-xfs

On Tue, Sep 04, 2018 at 09:26:00AM +0100, Richard W.M. Jones wrote:
> On Tue, Sep 04, 2018 at 10:49:40AM +1000, Dave Chinner wrote:
> > On Mon, Sep 03, 2018 at 11:49:19PM +0100, Richard W.M. Jones wrote:
> > > [This is silly and has no real purpose except to explore the limits.
> > > If that offends you, don't read the rest of this email.]
> > 
> > We do this quite frequently ourselves, even if it is just to remind
> > ourselves how long it takes to wait for millions of IOs to be done.
> >
> > > I am trying to create an XFS filesystem in a partition of approx
> > > 2^63 - 1 bytes to see what happens.
> > 
> > Should just work. You might find problems with the underlying
> > storage, but the XFS side of things should just work.
> 
> Great!  How do you test this normally?

The usual: it's turtles all the way down.

> I'm assuming you must use a
> virtual device and don't have actual 2^6x storage systems around?

Right. I use XFS on XFS configurations. i.e. XFS is the storage pool
on physical storage (SSDs in RAID0 in this case). The disk images
are sparse files w/ extent size hints to minimise fragmentation and
allocation overhead. And the QEMU config uses AIO/DIO so it can do
concurrent, deeply queued async read/write IO from the guest to the
host - the guest block device behaves exactly like it is hosted on
real disks.

Apart from reflink and extent size hints, I'm using the defaults for
everything.

> > > I guess this indicates a real bug in mkfs.xfs.
> > 
> > Did it fail straight away? Or after a long time?  Can you trap this
> > in gdb and post a back trace so we know where it is coming from?
> 
> Yes I think I was far too hasty declaring this a problem with mkfs.xfs
> last night.  It turns out that NBD on the wire can only describe a few
> different errors and maps any other error to -EINVAL, which is likely

Urk. It should map them to -EIO, because then we know it's come from
the IO layer and isn't a problem related to userspace passing the
kernel invalid parameters.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mkfs.xfs options suitable for creating absurdly large XFS filesystems?
  2018-09-04  8:23   ` Dave Chinner
@ 2018-09-04  9:12     ` Dave Chinner
  0 siblings, 0 replies; 12+ messages in thread
From: Dave Chinner @ 2018-09-04  9:12 UTC (permalink / raw)
  To: Richard W.M. Jones; +Cc: linux-xfs

On Tue, Sep 04, 2018 at 06:23:32PM +1000, Dave Chinner wrote:
> On Tue, Sep 04, 2018 at 10:49:40AM +1000, Dave Chinner wrote:
> > On Mon, Sep 03, 2018 at 11:49:19PM +0100, Richard W.M. Jones wrote:
> > > [This is silly and has no real purpose except to explore the limits.
> > > If that offends you, don't read the rest of this email.]
> > 
> > We do this quite frequently ourselves, even if it is just to remind
> > ourselves how long it takes to wait for millions of IOs to be done.
> > 
> > > I am trying to create an XFS filesystem in a partition of approx
> > > 2^63 - 1 bytes to see what happens.
> > 
> > Should just work. You might find problems with the underlying
> > storage, but the XFS side of things should just work.
> 
> > I'm trying to reproduce it here:
> > 
> > $ grep vdd /proc/partitions 
> >  253       48 9007199254739968 vdd
> > $ sudo mkfs.xfs -f -s size=1024 -d size=2251799813684887b -N /dev/vdd
> > meta-data=/dev/vdd               isize=512    agcount=8388609, agsize=268435455 blks
> >          =                       sectsz=1024  attr=2, projid32bit=1
> >          =                       crc=1        finobt=1, sparse=1, rmapbt=0
> >          =                       reflink=0
> > data     =                       bsize=4096   blocks=2251799813684887, imaxpct=1
> >          =                       sunit=0      swidth=0 blks
> > naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> > log      =internal log           bsize=4096   blocks=521728, version=2
> >          =                       sectsz=1024  sunit=1 blks, lazy-count=1
> > realtime =none                   extsz=4096   blocks=0, rtextents=0
> > 
> > 
> > And it is running now without the "-N" and I have to wait for tens
> > of millions of IOs to be issued. The write rate is currently about
> > 13,000 IOPS, so I'm guessing it'll take at least an hour to do
> > this. Next time I'll run it on the machine with faster SSDs.
> > 
> > I haven't seen any error after 20 minutes, though.
> 
> I killed it after 2 and half hours, and started looking at why it
> was taking that long. That's the above.

Or the below. Stand on your head if you're confused.

-Dave.

> But it's not fast. This is the first time I've looked at whether we
> perturbed the IO patterns in the recent mkfs.xfs refactoring. I'm
> not sure we made them any worse (the algorithms are the same), but
> it's now much more obvious how we can improve them drastically with
> a few small mods.
> 
> Firstly, there's the force overwrite alogrithm that zeros the old
> filesystem signature. One an 8EB device with an existing 8EB
> filesystem, there's 8+ million single sector IOs right there.
> So for the moment, zero the first 1MB of the device to whack the
> old superblock and you can avoid this step. I've got a fix for that
> now:
> 
> 	Time to mkfs a 1TB filsystem on a big device after it held another
> 	larger filesystem:
> 
> 	previous FS size	10PB	100PB	 1EB
> 	old mkfs time		1.95s	8.9s	81.3s
> 	patched			0.95s	1.2s	 1.2s
> 
> 
> Second, use -K to avoid discard (which you already know).
> 
> Third, we do two passes over the AG headers to initialise them.
> Unfortunately, with a large number of AGs, they don't stay in the
> buffer cache and so the second pas involves RMW cycles. This means
> we do at least 5 extra read and 5 extra write IOs per AG than we
> need to. I've got a fix for this, too:
> 
> 	Time to make a filesystem from scratch, using a zeroed device so the
> 	force overwrite algorithms are not triggered and -K to avoid
> 	discards:
> 
> 	FS size         10PB    100PB    1EB
> 	current mkfs    26.9s   214.8s  2484s
> 	patched         11.3s    70.3s	 709s
> 
> From that projection, the 8EB mkfs would have taken somewhere around
> 7-8 hours to complete. The new code should only take a couple of
> hours. Still not all that good....
> 
> .... and I think that's because we are using direct IO. That means
> the IO we issue is effectively synchronous, even though we sorta
> doing delayed writeback. The problem is that mkfs is not threaded so
> writeback happens when the cache fills up and we run out of buffers
> on the free list. Basically it's "direct delayed writeback" at that
> point.
> 
> Worse, because it's synchronous, we don't drive more than one IO at
> a time and so we don't get adjacent sector merging, even though most
> ofhte AG header writes are to adjacent sectors. That would cut the
> amount of IOs from ~10 per AG down to 2 for sectorsize < blocksize
> filesysetms and 1 for sectorsize = blocksize filesystems.
> 
> This isn't so easy to fix. I either need to:
> 
> 	1) thread the libxfs buffer cache so we can do this
> 	  writeback in the background.
> 	2) thread mkfs so it can process multiple AGs at once; or
> 	3) libxfs needs to use AIO via delayed write infrastructure
> 	similar to what we have in the kernel (buffer lists)
> 
> Approach 1) does not solve the queue depth = 1 issue, so
> it's of limited value. Might be quick, but doesn't really get us
> much improvement.
> 
> Approach 2) drives deeper queues, but it doesn't solve the adjacent
> sector IO merging problem because each thread only has a queue depth
> of one. So we'll be able to do more IO, but IO efficiency won't
> improve. And, realistically, this isn't a good idea because OOO AG
> processing doesn't work on spinning rust - it just causes seek
> storms and things go slower. To make things faster on spinning rust,
> we need single threaded, in order dispatch, asynchronous writeback.
> Which is almost what 1) is, except it's not asynchronous.
> 
> That's what 3) solves - single threaded, in-order, async writeback,
> controlled by the context creating the dirty buffers in a limited
> AIO context.  I'll have to think about this a bit more....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mkfs.xfs options suitable for creating absurdly large XFS filesystems?
  2018-09-04  9:11     ` Dave Chinner
@ 2018-09-04  9:45       ` Richard W.M. Jones
  0 siblings, 0 replies; 12+ messages in thread
From: Richard W.M. Jones @ 2018-09-04  9:45 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Sep 04, 2018 at 07:11:12PM +1000, Dave Chinner wrote:
> On Tue, Sep 04, 2018 at 09:26:00AM +0100, Richard W.M. Jones wrote:
> > Yes I think I was far too hasty declaring this a problem with mkfs.xfs
> > last night.  It turns out that NBD on the wire can only describe a few
> > different errors and maps any other error to -EINVAL, which is likely
> 
> Urk. It should map them to -EIO, because then we know it's come from
> the IO layer and isn't a problem related to userspace passing the
> kernel invalid parameters.

Actually EIO is one of the errors that NBD does understand :-)

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mkfs.xfs options suitable for creating absurdly large XFS filesystems?
  2018-09-04  0:49 ` Dave Chinner
  2018-09-04  8:23   ` Dave Chinner
  2018-09-04  8:26   ` Richard W.M. Jones
@ 2018-09-04 15:36   ` Martin Steigerwald
  2018-09-04 22:23     ` Dave Chinner
  2018-09-05  9:05   ` Richard W.M. Jones
  3 siblings, 1 reply; 12+ messages in thread
From: Martin Steigerwald @ 2018-09-04 15:36 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Richard W.M. Jones, linux-xfs

Dave Chinner - 04.09.18, 02:49:
> On Mon, Sep 03, 2018 at 11:49:19PM +0100, Richard W.M. Jones wrote:
> > [This is silly and has no real purpose except to explore the limits.
> > If that offends you, don't read the rest of this email.]
> 
> We do this quite frequently ourselves, even if it is just to remind
> ourselves how long it takes to wait for millions of IOs to be done.

Just for the fun of it during an Linux Performance analysis & tuning 
course I held I created a 1 EiB XFS filesystem a sparse file on another 
XFS filesystem on an SSD of a ThinkPad T520. It took several hours to 
create, but then it was there and mountable. AFAIR the sparse file was a 
bit less than 20 GiB.

Trying to write more data to it than the parent filesystem can hold back 
then resulted in "lost buffer writes" or something like that in 
kernel.log, but no visible error message to the process that wrote the 
data.

Thanks,
-- 
Martin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mkfs.xfs options suitable for creating absurdly large XFS filesystems?
  2018-09-04 15:36   ` Martin Steigerwald
@ 2018-09-04 22:23     ` Dave Chinner
  2018-09-05  7:09       ` Martin Steigerwald
  0 siblings, 1 reply; 12+ messages in thread
From: Dave Chinner @ 2018-09-04 22:23 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: Richard W.M. Jones, linux-xfs

On Tue, Sep 04, 2018 at 05:36:43PM +0200, Martin Steigerwald wrote:
> Dave Chinner - 04.09.18, 02:49:
> > On Mon, Sep 03, 2018 at 11:49:19PM +0100, Richard W.M. Jones wrote:
> > > [This is silly and has no real purpose except to explore the limits.
> > > If that offends you, don't read the rest of this email.]
> > 
> > We do this quite frequently ourselves, even if it is just to remind
> > ourselves how long it takes to wait for millions of IOs to be done.
> 
> Just for the fun of it during an Linux Performance analysis & tuning 
> course I held I created a 1 EiB XFS filesystem a sparse file on another 
> XFS filesystem on an SSD of a ThinkPad T520. It took several hours to 
> create, but then it was there and mountable. AFAIR the sparse file was a 
> bit less than 20 GiB.

Yup, 20GB of single sector IOs takes a long time.

> Trying to write more data to it than the parent filesystem can hold back 
> then resulted in "lost buffer writes" or something like that in 
> kernel.log, but no visible error message to the process that wrote the 
> data.

That should mostly be fixed by now with all the error handling work
that went into the generic writeback path a few kernel releases ago.
Also, remember that the only guaranteed way to determine that there
was a writeback error is to run fsync on the file, and most
applications don't do that after writing their data.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mkfs.xfs options suitable for creating absurdly large XFS filesystems?
  2018-09-04 22:23     ` Dave Chinner
@ 2018-09-05  7:09       ` Martin Steigerwald
  2018-09-05  7:43         ` Dave Chinner
  0 siblings, 1 reply; 12+ messages in thread
From: Martin Steigerwald @ 2018-09-05  7:09 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Richard W.M. Jones, linux-xfs

Dave Chinner - 05.09.18, 00:23:
> On Tue, Sep 04, 2018 at 05:36:43PM +0200, Martin Steigerwald wrote:
> > Dave Chinner - 04.09.18, 02:49:
> > > On Mon, Sep 03, 2018 at 11:49:19PM +0100, Richard W.M. Jones 
wrote:
> > > > [This is silly and has no real purpose except to explore the
> > > > limits.
> > > > If that offends you, don't read the rest of this email.]
> > > 
> > > We do this quite frequently ourselves, even if it is just to
> > > remind
> > > ourselves how long it takes to wait for millions of IOs to be
> > > done.
> > 
> > Just for the fun of it during an Linux Performance analysis & tuning
> > course I held I created a 1 EiB XFS filesystem a sparse file on
> > another XFS filesystem on an SSD of a ThinkPad T520. It took
> > several hours to create, but then it was there and mountable. AFAIR
> > the sparse file was a bit less than 20 GiB.
> 
> Yup, 20GB of single sector IOs takes a long time.

Yeah. It was interesting to see that neither the CPU nor the SSD was 
fully utilized during that time tough.

> > Trying to write more data to it than the parent filesystem can hold
> > back then resulted in "lost buffer writes" or something like that
> > in kernel.log, but no visible error message to the process that
> > wrote the data.
> 
> That should mostly be fixed by now with all the error handling work
> that went into the generic writeback path a few kernel releases ago.
> Also, remember that the only guaranteed way to determine that there
> was a writeback error is to run fsync on the file, and most
> applications don't do that after writing their data.

Great. I saw the recent writeback error discussion as well.

Thanks,
-- 
Martin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mkfs.xfs options suitable for creating absurdly large XFS filesystems?
  2018-09-05  7:09       ` Martin Steigerwald
@ 2018-09-05  7:43         ` Dave Chinner
  0 siblings, 0 replies; 12+ messages in thread
From: Dave Chinner @ 2018-09-05  7:43 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: Richard W.M. Jones, linux-xfs

On Wed, Sep 05, 2018 at 09:09:28AM +0200, Martin Steigerwald wrote:
> Dave Chinner - 05.09.18, 00:23:
> > On Tue, Sep 04, 2018 at 05:36:43PM +0200, Martin Steigerwald wrote:
> > > Dave Chinner - 04.09.18, 02:49:
> > > > On Mon, Sep 03, 2018 at 11:49:19PM +0100, Richard W.M. Jones 
> wrote:
> > > > > [This is silly and has no real purpose except to explore the
> > > > > limits.
> > > > > If that offends you, don't read the rest of this email.]
> > > > 
> > > > We do this quite frequently ourselves, even if it is just to
> > > > remind
> > > > ourselves how long it takes to wait for millions of IOs to be
> > > > done.
> > > 
> > > Just for the fun of it during an Linux Performance analysis & tuning
> > > course I held I created a 1 EiB XFS filesystem a sparse file on
> > > another XFS filesystem on an SSD of a ThinkPad T520. It took
> > > several hours to create, but then it was there and mountable. AFAIR
> > > the sparse file was a bit less than 20 GiB.
> > 
> > Yup, 20GB of single sector IOs takes a long time.
> 
> Yeah. It was interesting to see that neither the CPU nor the SSD was 
> fully utilized during that time tough.

Right - it's not CPU bound because it's always waiting on a single
IO, and it's not IO bound because it's only issuing a single IO at a
time.

Speaking of which, I just hacked an delayed write buffer list
construct similar to the kernel code into mkfs/libxfs to batch
writeback. Then I added a
hacky AIO ring to allow it to drive deep IO queues. I'm seeing
sustained request queue depths of ~100 and the SSDs are about 80%
busy at 100,000 write IOPS. But mkfs is only consuming about 60% of
a single CPU.

Which means that, instead of 7-8 hours to make an 8EB filesystem, we
can get it down to:

$ time sudo ~/packages/mkfs.xfs -K  -d size=8191p /dev/vdd
meta-data=/dev/vdd               isize=512    agcount=8387585, agsize=268435455 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=0
data     =                       bsize=4096 blocks=2251524935778304, imaxpct=1
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=521728, version=2
	 =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

real    15m18.090s
user    5m54.162s
sys     3m49.518s

Around 15 minutes on a couple of cheap consumer nvme SSDs.

xfs_repair is going to need some help to scale up to this many AGs,
though - phase 1 is doing a huge amount of IO just to verify the
primary superblock...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: mkfs.xfs options suitable for creating absurdly large XFS filesystems?
  2018-09-04  0:49 ` Dave Chinner
                     ` (2 preceding siblings ...)
  2018-09-04 15:36   ` Martin Steigerwald
@ 2018-09-05  9:05   ` Richard W.M. Jones
  3 siblings, 0 replies; 12+ messages in thread
From: Richard W.M. Jones @ 2018-09-05  9:05 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Sep 04, 2018 at 10:49:40AM +1000, Dave Chinner wrote:
> What's the sector size of you device? This seems to imply that it is
> 1024 bytes, not the normal 512 or 4096 bytes we see in most devices.

It turns out that the sector size is selectable using the nbd-client
-b parameter, which I didn't notice before:

  # nbd-client -b 512 localhost /dev/nbd0

This actually turns out to be essential when mounting MBR partitioned
disks because the MBR partition table uses sector numbers, and if you
use the default (1k) sector size then everything is messed up.

Compare with 1k sectors:

  # fdisk -l /dev/nbd0
  Disk /dev/nbd0: 10 GiB, 10737418240 bytes, 10485760 sectors
  Units: sectors of 1 * 1024 = 1024 bytes
  Sector size (logical/physical): 1024 bytes / 1024 bytes
  I/O size (minimum/optimal): 1024 bytes / 1024 bytes
  Disklabel type: dos
  Disk identifier: 0x000127ae
  
  Device      Boot Start      End  Sectors Size Id Type
  /dev/nbd0p1       2048 20971519 20969472  20G 83 Linux
                                            ~~~

to the correct output with 512b sectors:

  # fdisk -l /dev/nbd0
  Disk /dev/nbd0: 10 GiB, 10737418240 bytes, 20971520 sectors
  Units: sectors of 1 * 512 = 512 bytes
  Sector size (logical/physical): 512 bytes / 512 bytes
  I/O size (minimum/optimal): 1024 bytes / 1024 bytes
  Disklabel type: dos
  Disk identifier: 0x000127ae
  
  Device      Boot Start      End  Sectors Size Id Type
  /dev/nbd0p1       2048 20971519 20969472  10G 83 Linux

So that's a nasty little "gotcha" in the nbd-client tool.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine.  Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2018-09-05 13:35 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-03 22:49 mkfs.xfs options suitable for creating absurdly large XFS filesystems? Richard W.M. Jones
2018-09-04  0:49 ` Dave Chinner
2018-09-04  8:23   ` Dave Chinner
2018-09-04  9:12     ` Dave Chinner
2018-09-04  8:26   ` Richard W.M. Jones
2018-09-04  9:11     ` Dave Chinner
2018-09-04  9:45       ` Richard W.M. Jones
2018-09-04 15:36   ` Martin Steigerwald
2018-09-04 22:23     ` Dave Chinner
2018-09-05  7:09       ` Martin Steigerwald
2018-09-05  7:43         ` Dave Chinner
2018-09-05  9:05   ` Richard W.M. Jones

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.