Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster

linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster
       [not found]           ` <20200819175300.GA141399@bfoster>
@ 2020-08-20 20:03             ` Alberto Garcia
  2020-08-20 21:58               ` Dave Chinner
  0 siblings, 1 reply; 18+ messages in thread
From: Alberto Garcia @ 2020-08-20 20:03 UTC (permalink / raw)
  To: Brian Foster, Kevin Wolf
  Cc: qemu-devel, qemu-block, Max Reitz, Vladimir Sementsov-Ogievskiy,
	linux-xfs

Cc: linux-xfs

On Wed 19 Aug 2020 07:53:00 PM CEST, Brian Foster wrote:
> In any event, if you're seeing unclear or unexpected performance
> deltas between certain XFS configurations or other fs', I think the
> best thing to do is post a more complete description of the workload,
> filesystem/storage setup, and test results to the linux-xfs mailing
> list (feel free to cc me as well). As it is, aside from the questions
> above, it's not really clear to me what the storage stack looks like
> for this test, if/how qcow2 is involved, what the various
> 'preallocation=' modes actually mean, etc.

(see [1] for a bit of context)

I repeated the tests with a larger (125GB) filesystem. Things are a bit
faster but not radically different, here are the new numbers:

|----------------------+-------+-------|
| preallocation mode   |   xfs |  ext4 |
|----------------------+-------+-------|
| off                  |  8139 | 11688 |
| off (w/o ZERO_RANGE) |  2965 |  2780 |
| metadata             |  7768 |  9132 |
| falloc               |  7742 | 13108 |
| full                 | 41389 | 16351 |
|----------------------+-------+-------|

The numbers are I/O operations per second as reported by fio, running
inside a VM.

The VM is running Debian 9.7 with Linux 4.9.130 and the fio version is
2.16-1. I'm using QEMU 5.1.0.

fio is sending random 4KB write requests to a 25GB virtual drive, this
is the full command line:

fio --filename=/dev/vdb --direct=1 --randrepeat=1 --eta=always
    --ioengine=libaio --iodepth=32 --numjobs=1 --name=test --size=25G
    --io_limit=25G --ramp_time=5 --rw=randwrite --bs=4k --runtime=60
  
The virtual drive (/dev/vdb) is a freshly created qcow2 file stored on
the host (on an xfs or ext4 filesystem as the table above shows), and
it is attached to QEMU using a virtio-blk-pci device:

   -drive if=virtio,file=image.qcow2,cache=none,l2-cache-size=200M

cache=none means that the image is opened with O_DIRECT and
l2-cache-size is large enough so QEMU is able to cache all the
relevant qcow2 metadata in memory.

The host is running Linux 4.19.132 and has an SSD drive.

About the preallocation modes: a qcow2 file is divided into clusters
of the same size (64KB in this case). That is the minimum unit of
allocation, so when writing 4KB to an unallocated cluster QEMU needs
to fill the other 60KB with zeroes. So here's what happens with the
different modes:

1) off: for every write request QEMU initializes the cluster (64KB)
        with fallocate(ZERO_RANGE) and then writes the 4KB of data.

2) off w/o ZERO_RANGE: QEMU writes the 4KB of data and fills the rest
        of the cluster with zeroes.

3) metadata: all clusters were allocated when the image was created
        but they are sparse, QEMU only writes the 4KB of data.

4) falloc: all clusters were allocated with fallocate() when the image
        was created, QEMU only writes 4KB of data.

5) full: all clusters were allocated by writing zeroes to all of them
        when the image was created, QEMU only writes 4KB of data.

As I said in a previous message I'm not familiar with xfs, but the
parts that I don't understand are

   - Why is (4) slower than (1)?
   - Why is (5) so much faster than everything else?

I hope I didn't forget anything, tell me if you have questions.

Berto

[1] https://lists.gnu.org/archive/html/qemu-block/2020-08/msg00481.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster
  2020-08-20 20:03             ` [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster Alberto Garcia
@ 2020-08-20 21:58               ` Dave Chinner
  2020-08-21 11:05                 ` Brian Foster
  2020-08-21 16:09                 ` Alberto Garcia
  0 siblings, 2 replies; 18+ messages in thread
From: Dave Chinner @ 2020-08-20 21:58 UTC (permalink / raw)
  To: Alberto Garcia
  Cc: Brian Foster, Kevin Wolf, qemu-devel, qemu-block, Max Reitz,
	Vladimir Sementsov-Ogievskiy, linux-xfs

On Thu, Aug 20, 2020 at 10:03:10PM +0200, Alberto Garcia wrote:
> Cc: linux-xfs
> 
> On Wed 19 Aug 2020 07:53:00 PM CEST, Brian Foster wrote:
> > In any event, if you're seeing unclear or unexpected performance
> > deltas between certain XFS configurations or other fs', I think the
> > best thing to do is post a more complete description of the workload,
> > filesystem/storage setup, and test results to the linux-xfs mailing
> > list (feel free to cc me as well). As it is, aside from the questions
> > above, it's not really clear to me what the storage stack looks like
> > for this test, if/how qcow2 is involved, what the various
> > 'preallocation=' modes actually mean, etc.
> 
> (see [1] for a bit of context)
> 
> I repeated the tests with a larger (125GB) filesystem. Things are a bit
> faster but not radically different, here are the new numbers:
> 
> |----------------------+-------+-------|
> | preallocation mode   |   xfs |  ext4 |
> |----------------------+-------+-------|
> | off                  |  8139 | 11688 |
> | off (w/o ZERO_RANGE) |  2965 |  2780 |
> | metadata             |  7768 |  9132 |
> | falloc               |  7742 | 13108 |
> | full                 | 41389 | 16351 |
> |----------------------+-------+-------|
> 
> The numbers are I/O operations per second as reported by fio, running
> inside a VM.
> 
> The VM is running Debian 9.7 with Linux 4.9.130 and the fio version is
> 2.16-1. I'm using QEMU 5.1.0.
> 
> fio is sending random 4KB write requests to a 25GB virtual drive, this
> is the full command line:
> 
> fio --filename=/dev/vdb --direct=1 --randrepeat=1 --eta=always
>     --ioengine=libaio --iodepth=32 --numjobs=1 --name=test --size=25G
>     --io_limit=25G --ramp_time=5 --rw=randwrite --bs=4k --runtime=60
>   
> The virtual drive (/dev/vdb) is a freshly created qcow2 file stored on
> the host (on an xfs or ext4 filesystem as the table above shows), and
> it is attached to QEMU using a virtio-blk-pci device:
> 
>    -drive if=virtio,file=image.qcow2,cache=none,l2-cache-size=200M

You're not using AIO on this image file, so it can't do
concurrent IO? what happens when you add "aio=native" to this?

> cache=none means that the image is opened with O_DIRECT and
> l2-cache-size is large enough so QEMU is able to cache all the
> relevant qcow2 metadata in memory.

What happens when you just use a sparse file (i.e. a raw image) with
aio=native instead of using qcow2? XFS, ext4, btrfs, etc all support
sparse files so using qcow2 to provide sparse image file support is
largely an unnecessary layer of indirection and overhead...

And with XFS, you don't need qcow2 for snapshots either because you
can use reflink copies to take an atomic copy-on-write snapshot of
the raw image file... (assuming you made the xfs filesystem with
reflink support (which is the TOT default now)).

I've been using raw sprase files on XFS for all my VMs for over a
decade now, and using reflink to create COW copies of golden
image files iwhen deploying new VMs for a couple of years now...

> The host is running Linux 4.19.132 and has an SSD drive.
> 
> About the preallocation modes: a qcow2 file is divided into clusters
> of the same size (64KB in this case). That is the minimum unit of
> allocation, so when writing 4KB to an unallocated cluster QEMU needs
> to fill the other 60KB with zeroes. So here's what happens with the
> different modes:

Which is something that sparse files on filesystems do not need to
do. If, on XFS, you really want 64kB allocation clusters, use an
extent size hint of 64kB. Though for image files, I highly recommend
using 1MB or larger extent size hints.

> 1) off: for every write request QEMU initializes the cluster (64KB)
>         with fallocate(ZERO_RANGE) and then writes the 4KB of data.
> 
> 2) off w/o ZERO_RANGE: QEMU writes the 4KB of data and fills the rest
>         of the cluster with zeroes.
> 
> 3) metadata: all clusters were allocated when the image was created
>         but they are sparse, QEMU only writes the 4KB of data.
> 
> 4) falloc: all clusters were allocated with fallocate() when the image
>         was created, QEMU only writes 4KB of data.
> 
> 5) full: all clusters were allocated by writing zeroes to all of them
>         when the image was created, QEMU only writes 4KB of data.
> 
> As I said in a previous message I'm not familiar with xfs, but the
> parts that I don't understand are
> 
>    - Why is (4) slower than (1)?

Because fallocate() is a full IO serialisation barrier at the
filesystem level. If you do:

fallocate(whole file)
<IO>
<IO>
<IO>
.....

The IO can run concurrent and does not serialise against anything in
the filesysetm except unwritten extent conversions at IO completion
(see answer to next question!)

However, if you just use (4) you get:

falloc(64k)
  <wait for inflight IO to complete>
  <allocates 64k as unwritten>
<4k io>
  ....
falloc(64k)
  <wait for inflight IO to complete>
  ....
  <4k IO completes, converts 4k to written>
  <allocates 64k as unwritten>
<4k io>
falloc(64k)
  <wait for inflight IO to complete>
  ....
  <4k IO completes, converts 4k to written>
  <allocates 64k as unwritten>
<4k io>
  ....

until all the clusters in the qcow2 file are intialised. IOWs, each
fallocate() call serialises all IO in flight. Compare that to using
extent size hints on a raw sparse image file for the same thing:

<set 64k extent size hint>
<4k IO>
  <allocates 64k as unwritten>
  ....
<4k IO>
  <allocates 64k as unwritten>
  ....
<4k IO>
  <allocates 64k as unwritten>
  ....
...
  <4k IO completes, converts 4k to written>
  <4k IO completes, converts 4k to written>
  <4k IO completes, converts 4k to written>
....

See the difference in IO pipelining here? You get the same "64kB
cluster initialised at a time" behaviour as qcow2, but you don't get
the IO pipeline stalls caused by fallocate() having to drain all the
IO in flight before it does the allocation.

>    - Why is (5) so much faster than everything else?

The full file allocation in (5) means the IO doesn't have to modify
the extent map hence all extent mapping is uses shared locking and
the entire IO path can run concurrently without serialisation at
all.

Thing is, once your writes into sprase image files regularly start
hitting written extents, the performance of (1), (2) and (4) will
trend towards (5) as writes hit already allocated ranges of the file
and the serialisation of extent mapping changes goes away. This
occurs with guest filesystems that perform overwrite in place (such
as XFS) and hence overwrites of existing data will hit allocated
space in the image file and not require further allocation.

IOWs, typical "write once" benchmark testing indicates the *worst*
performance you are going to see. As the guest filesytsem ages and
initialises more of the underlying image file, it will get faster,
not slower.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster
  2020-08-20 21:58               ` Dave Chinner
@ 2020-08-21 11:05                 ` Brian Foster
  2020-08-21 11:42                   ` Alberto Garcia
  2020-08-21 16:09                 ` Alberto Garcia
  1 sibling, 1 reply; 18+ messages in thread
From: Brian Foster @ 2020-08-21 11:05 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Alberto Garcia, Kevin Wolf, qemu-devel, qemu-block, Max Reitz,
	Vladimir Sementsov-Ogievskiy, linux-xfs

On Fri, Aug 21, 2020 at 07:58:11AM +1000, Dave Chinner wrote:
> On Thu, Aug 20, 2020 at 10:03:10PM +0200, Alberto Garcia wrote:
> > Cc: linux-xfs
> > 
> > On Wed 19 Aug 2020 07:53:00 PM CEST, Brian Foster wrote:
> > > In any event, if you're seeing unclear or unexpected performance
> > > deltas between certain XFS configurations or other fs', I think the
> > > best thing to do is post a more complete description of the workload,
> > > filesystem/storage setup, and test results to the linux-xfs mailing
> > > list (feel free to cc me as well). As it is, aside from the questions
> > > above, it's not really clear to me what the storage stack looks like
> > > for this test, if/how qcow2 is involved, what the various
> > > 'preallocation=' modes actually mean, etc.
> > 
> > (see [1] for a bit of context)
> > 
> > I repeated the tests with a larger (125GB) filesystem. Things are a bit
> > faster but not radically different, here are the new numbers:
> > 
> > |----------------------+-------+-------|
> > | preallocation mode   |   xfs |  ext4 |
> > |----------------------+-------+-------|
> > | off                  |  8139 | 11688 |
> > | off (w/o ZERO_RANGE) |  2965 |  2780 |
> > | metadata             |  7768 |  9132 |
> > | falloc               |  7742 | 13108 |
> > | full                 | 41389 | 16351 |
> > |----------------------+-------+-------|
> > 
> > The numbers are I/O operations per second as reported by fio, running
> > inside a VM.
> > 
> > The VM is running Debian 9.7 with Linux 4.9.130 and the fio version is
> > 2.16-1. I'm using QEMU 5.1.0.
> > 
> > fio is sending random 4KB write requests to a 25GB virtual drive, this
> > is the full command line:
> > 
> > fio --filename=/dev/vdb --direct=1 --randrepeat=1 --eta=always
> >     --ioengine=libaio --iodepth=32 --numjobs=1 --name=test --size=25G
> >     --io_limit=25G --ramp_time=5 --rw=randwrite --bs=4k --runtime=60
> >   
> > The virtual drive (/dev/vdb) is a freshly created qcow2 file stored on
> > the host (on an xfs or ext4 filesystem as the table above shows), and
> > it is attached to QEMU using a virtio-blk-pci device:
> > 
> >    -drive if=virtio,file=image.qcow2,cache=none,l2-cache-size=200M
> 
> You're not using AIO on this image file, so it can't do
> concurrent IO? what happens when you add "aio=native" to this?
> 
> > cache=none means that the image is opened with O_DIRECT and
> > l2-cache-size is large enough so QEMU is able to cache all the
> > relevant qcow2 metadata in memory.
> 
> What happens when you just use a sparse file (i.e. a raw image) with
> aio=native instead of using qcow2? XFS, ext4, btrfs, etc all support
> sparse files so using qcow2 to provide sparse image file support is
> largely an unnecessary layer of indirection and overhead...
> 
> And with XFS, you don't need qcow2 for snapshots either because you
> can use reflink copies to take an atomic copy-on-write snapshot of
> the raw image file... (assuming you made the xfs filesystem with
> reflink support (which is the TOT default now)).
> 
> I've been using raw sprase files on XFS for all my VMs for over a
> decade now, and using reflink to create COW copies of golden
> image files iwhen deploying new VMs for a couple of years now...
> 
> > The host is running Linux 4.19.132 and has an SSD drive.
> > 
> > About the preallocation modes: a qcow2 file is divided into clusters
> > of the same size (64KB in this case). That is the minimum unit of
> > allocation, so when writing 4KB to an unallocated cluster QEMU needs
> > to fill the other 60KB with zeroes. So here's what happens with the
> > different modes:
> 
> Which is something that sparse files on filesystems do not need to
> do. If, on XFS, you really want 64kB allocation clusters, use an
> extent size hint of 64kB. Though for image files, I highly recommend
> using 1MB or larger extent size hints.
> 
> 
> > 1) off: for every write request QEMU initializes the cluster (64KB)
> >         with fallocate(ZERO_RANGE) and then writes the 4KB of data.
> > 
> > 2) off w/o ZERO_RANGE: QEMU writes the 4KB of data and fills the rest
> >         of the cluster with zeroes.
> > 
> > 3) metadata: all clusters were allocated when the image was created
> >         but they are sparse, QEMU only writes the 4KB of data.
> > 
> > 4) falloc: all clusters were allocated with fallocate() when the image
> >         was created, QEMU only writes 4KB of data.
> > 
> > 5) full: all clusters were allocated by writing zeroes to all of them
> >         when the image was created, QEMU only writes 4KB of data.
> > 
> > As I said in a previous message I'm not familiar with xfs, but the
> > parts that I don't understand are
> > 
> >    - Why is (4) slower than (1)?
> 
> Because fallocate() is a full IO serialisation barrier at the
> filesystem level. If you do:
> 
> fallocate(whole file)
> <IO>
> <IO>
> <IO>
> .....
> 
> The IO can run concurrent and does not serialise against anything in
> the filesysetm except unwritten extent conversions at IO completion
> (see answer to next question!)
> 
> However, if you just use (4) you get:
> 
> falloc(64k)
>   <wait for inflight IO to complete>
>   <allocates 64k as unwritten>
> <4k io>
>   ....
> falloc(64k)
>   <wait for inflight IO to complete>
>   ....
>   <4k IO completes, converts 4k to written>
>   <allocates 64k as unwritten>
> <4k io>
> falloc(64k)
>   <wait for inflight IO to complete>
>   ....
>   <4k IO completes, converts 4k to written>
>   <allocates 64k as unwritten>
> <4k io>
>   ....
> 

Option 4 is described above as initial file preallocation whereas option
1 is per 64k cluster prealloc. Prealloc mode mixup aside, Berto is
reporting that the initial file preallocation mode is slower than the
per cluster prealloc mode. Berto, am I following that right?

Brian

> until all the clusters in the qcow2 file are intialised. IOWs, each
> fallocate() call serialises all IO in flight. Compare that to using
> extent size hints on a raw sparse image file for the same thing:
> 
> <set 64k extent size hint>
> <4k IO>
>   <allocates 64k as unwritten>
>   ....
> <4k IO>
>   <allocates 64k as unwritten>
>   ....
> <4k IO>
>   <allocates 64k as unwritten>
>   ....
> ...
>   <4k IO completes, converts 4k to written>
>   <4k IO completes, converts 4k to written>
>   <4k IO completes, converts 4k to written>
> ....
> 
> See the difference in IO pipelining here? You get the same "64kB
> cluster initialised at a time" behaviour as qcow2, but you don't get
> the IO pipeline stalls caused by fallocate() having to drain all the
> IO in flight before it does the allocation.
> 
> >    - Why is (5) so much faster than everything else?
> 
> The full file allocation in (5) means the IO doesn't have to modify
> the extent map hence all extent mapping is uses shared locking and
> the entire IO path can run concurrently without serialisation at
> all.
> 
> Thing is, once your writes into sprase image files regularly start
> hitting written extents, the performance of (1), (2) and (4) will
> trend towards (5) as writes hit already allocated ranges of the file
> and the serialisation of extent mapping changes goes away. This
> occurs with guest filesystems that perform overwrite in place (such
> as XFS) and hence overwrites of existing data will hit allocated
> space in the image file and not require further allocation.
> 
> IOWs, typical "write once" benchmark testing indicates the *worst*
> performance you are going to see. As the guest filesytsem ages and
> initialises more of the underlying image file, it will get faster,
> not slower.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster
  2020-08-21 11:05                 ` Brian Foster
@ 2020-08-21 11:42                   ` Alberto Garcia
  2020-08-21 12:12                     ` Alberto Garcia
  2020-08-21 12:59                     ` Brian Foster
  0 siblings, 2 replies; 18+ messages in thread
From: Alberto Garcia @ 2020-08-21 11:42 UTC (permalink / raw)
  To: Brian Foster, Dave Chinner
  Cc: Kevin Wolf, qemu-devel, qemu-block, Max Reitz,
	Vladimir Sementsov-Ogievskiy, linux-xfs

On Fri 21 Aug 2020 01:05:06 PM CEST, Brian Foster <bfoster@redhat.com> wrote:
>> > 1) off: for every write request QEMU initializes the cluster (64KB)
>> >         with fallocate(ZERO_RANGE) and then writes the 4KB of data.
>> > 
>> > 2) off w/o ZERO_RANGE: QEMU writes the 4KB of data and fills the rest
>> >         of the cluster with zeroes.
>> > 
>> > 3) metadata: all clusters were allocated when the image was created
>> >         but they are sparse, QEMU only writes the 4KB of data.
>> > 
>> > 4) falloc: all clusters were allocated with fallocate() when the image
>> >         was created, QEMU only writes 4KB of data.
>> > 
>> > 5) full: all clusters were allocated by writing zeroes to all of them
>> >         when the image was created, QEMU only writes 4KB of data.
>> > 
>> > As I said in a previous message I'm not familiar with xfs, but the
>> > parts that I don't understand are
>> > 
>> >    - Why is (4) slower than (1)?
>> 
>> Because fallocate() is a full IO serialisation barrier at the
>> filesystem level. If you do:
>> 
>> fallocate(whole file)
>> <IO>
>> <IO>
>> <IO>
>> .....
>> 
>> The IO can run concurrent and does not serialise against anything in
>> the filesysetm except unwritten extent conversions at IO completion
>> (see answer to next question!)
>> 
>> However, if you just use (4) you get:
>> 
>> falloc(64k)
>>   <wait for inflight IO to complete>
>>   <allocates 64k as unwritten>
>> <4k io>
>>   ....
>> falloc(64k)
>>   <wait for inflight IO to complete>
>>   ....
>>   <4k IO completes, converts 4k to written>
>>   <allocates 64k as unwritten>
>> <4k io>
>> falloc(64k)
>>   <wait for inflight IO to complete>
>>   ....
>>   <4k IO completes, converts 4k to written>
>>   <allocates 64k as unwritten>
>> <4k io>
>>   ....
>> 
>
> Option 4 is described above as initial file preallocation whereas
> option 1 is per 64k cluster prealloc. Prealloc mode mixup aside, Berto
> is reporting that the initial file preallocation mode is slower than
> the per cluster prealloc mode. Berto, am I following that right?

Option (1) means that no qcow2 cluster is allocated at the beginning of
the test so, apart from updating the relevant qcow2 metadata, each write
request clears the cluster first (with fallocate(ZERO_RANGE)) then
writes the requested 4KB of data. Further writes to the same cluster
don't need changes on the qcow2 metadata so they go directly to the area
that was cleared with fallocate().

Option (4) means that all clusters are allocated when the image is
created and they are initialized with fallocate() (actually with
posix_fallocate() now that I read the code, I suppose it's the same for
xfs?). Only after that the test starts. All write requests are simply
forwarded to the disk, there is no need to touch any qcow2 metadata nor
do anything else.

And yes, (4) is a bit slower than (1) in my tests. On ext4 I get 10%
more IOPS.

I just ran the tests with aio=native and with a raw image instead of
qcow2, here are the results:

qcow2:
|----------------------+-------------+------------|
| preallocation        | aio=threads | aio=native |
|----------------------+-------------+------------|
| off                  |        8139 |       7649 |
| off (w/o ZERO_RANGE) |        2965 |       2779 |
| metadata             |        7768 |       8265 |
| falloc               |        7742 |       7956 |
| full                 |       41389 |      56668 |
|----------------------+-------------+------------|

raw:
|---------------+-------------+------------|
| preallocation | aio=threads | aio=native |
|---------------+-------------+------------|
| off           |        7647 |       7928 |
| falloc        |        7662 |       7856 |
| full          |       45224 |      58627 |
|---------------+-------------+------------|

A qcow2 file with preallocation=metadata is more or less similar to a
sparse raw file (and the numbers are indeed similar).

preallocation=off on qcow2 does not have an equivalent on raw files.

Berto

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster
  2020-08-21 11:42                   ` Alberto Garcia
@ 2020-08-21 12:12                     ` Alberto Garcia
  2020-08-21 17:02                       ` Brian Foster
  2020-08-23 21:59                       ` Dave Chinner
  2020-08-21 12:59                     ` Brian Foster
  1 sibling, 2 replies; 18+ messages in thread
From: Alberto Garcia @ 2020-08-21 12:12 UTC (permalink / raw)
  To: Brian Foster, Dave Chinner
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-block, qemu-devel,
	Max Reitz, linux-xfs

On Fri 21 Aug 2020 01:42:52 PM CEST, Alberto Garcia wrote:
> On Fri 21 Aug 2020 01:05:06 PM CEST, Brian Foster <bfoster@redhat.com> wrote:
>>> > 1) off: for every write request QEMU initializes the cluster (64KB)
>>> >         with fallocate(ZERO_RANGE) and then writes the 4KB of data.
>>> > 
>>> > 2) off w/o ZERO_RANGE: QEMU writes the 4KB of data and fills the rest
>>> >         of the cluster with zeroes.
>>> > 
>>> > 3) metadata: all clusters were allocated when the image was created
>>> >         but they are sparse, QEMU only writes the 4KB of data.
>>> > 
>>> > 4) falloc: all clusters were allocated with fallocate() when the image
>>> >         was created, QEMU only writes 4KB of data.
>>> > 
>>> > 5) full: all clusters were allocated by writing zeroes to all of them
>>> >         when the image was created, QEMU only writes 4KB of data.
>>> > 
>>> > As I said in a previous message I'm not familiar with xfs, but the
>>> > parts that I don't understand are
>>> > 
>>> >    - Why is (4) slower than (1)?
>>> 
>>> Because fallocate() is a full IO serialisation barrier at the
>>> filesystem level. If you do:
>>> 
>>> fallocate(whole file)
>>> <IO>
>>> <IO>
>>> <IO>
>>> .....
>>> 
>>> The IO can run concurrent and does not serialise against anything in
>>> the filesysetm except unwritten extent conversions at IO completion
>>> (see answer to next question!)
>>> 
>>> However, if you just use (4) you get:
>>> 
>>> falloc(64k)
>>>   <wait for inflight IO to complete>
>>>   <allocates 64k as unwritten>
>>> <4k io>
>>>   ....
>>> falloc(64k)
>>>   <wait for inflight IO to complete>
>>>   ....
>>>   <4k IO completes, converts 4k to written>
>>>   <allocates 64k as unwritten>
>>> <4k io>
>>> falloc(64k)
>>>   <wait for inflight IO to complete>
>>>   ....
>>>   <4k IO completes, converts 4k to written>
>>>   <allocates 64k as unwritten>
>>> <4k io>
>>>   ....
>>> 
>>
>> Option 4 is described above as initial file preallocation whereas
>> option 1 is per 64k cluster prealloc. Prealloc mode mixup aside, Berto
>> is reporting that the initial file preallocation mode is slower than
>> the per cluster prealloc mode. Berto, am I following that right?

After looking more closely at the data I can see that there is a peak of
~30K IOPS during the first 5 or 6 seconds and then it suddenly drops to
~7K for the rest of the test.

I was running fio with --ramp_time=5 which ignores the first 5 seconds
of data in order to let performance settle, but if I remove that I can
see the effect more clearly. I can observe it with raw files (in 'off'
and 'prealloc' modes) and qcow2 files in 'prealloc' mode. With qcow2 and
preallocation=off the performance is stable during the whole test.

Berto

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster
  2020-08-21 11:42                   ` Alberto Garcia
  2020-08-21 12:12                     ` Alberto Garcia
@ 2020-08-21 12:59                     ` Brian Foster
  2020-08-21 15:51                       ` Alberto Garcia
  2020-08-23 22:16                       ` Dave Chinner
  1 sibling, 2 replies; 18+ messages in thread
From: Brian Foster @ 2020-08-21 12:59 UTC (permalink / raw)
  To: Alberto Garcia
  Cc: Dave Chinner, Kevin Wolf, qemu-devel, qemu-block, Max Reitz,
	Vladimir Sementsov-Ogievskiy, linux-xfs

On Fri, Aug 21, 2020 at 01:42:52PM +0200, Alberto Garcia wrote:
> On Fri 21 Aug 2020 01:05:06 PM CEST, Brian Foster <bfoster@redhat.com> wrote:
> >> > 1) off: for every write request QEMU initializes the cluster (64KB)
> >> >         with fallocate(ZERO_RANGE) and then writes the 4KB of data.
> >> > 
> >> > 2) off w/o ZERO_RANGE: QEMU writes the 4KB of data and fills the rest
> >> >         of the cluster with zeroes.
> >> > 
> >> > 3) metadata: all clusters were allocated when the image was created
> >> >         but they are sparse, QEMU only writes the 4KB of data.
> >> > 
> >> > 4) falloc: all clusters were allocated with fallocate() when the image
> >> >         was created, QEMU only writes 4KB of data.
> >> > 
> >> > 5) full: all clusters were allocated by writing zeroes to all of them
> >> >         when the image was created, QEMU only writes 4KB of data.
> >> > 
> >> > As I said in a previous message I'm not familiar with xfs, but the
> >> > parts that I don't understand are
> >> > 
> >> >    - Why is (4) slower than (1)?
> >> 
> >> Because fallocate() is a full IO serialisation barrier at the
> >> filesystem level. If you do:
> >> 
> >> fallocate(whole file)
> >> <IO>
> >> <IO>
> >> <IO>
> >> .....
> >> 
> >> The IO can run concurrent and does not serialise against anything in
> >> the filesysetm except unwritten extent conversions at IO completion
> >> (see answer to next question!)
> >> 
> >> However, if you just use (4) you get:
> >> 
> >> falloc(64k)
> >>   <wait for inflight IO to complete>
> >>   <allocates 64k as unwritten>
> >> <4k io>
> >>   ....
> >> falloc(64k)
> >>   <wait for inflight IO to complete>
> >>   ....
> >>   <4k IO completes, converts 4k to written>
> >>   <allocates 64k as unwritten>
> >> <4k io>
> >> falloc(64k)
> >>   <wait for inflight IO to complete>
> >>   ....
> >>   <4k IO completes, converts 4k to written>
> >>   <allocates 64k as unwritten>
> >> <4k io>
> >>   ....
> >> 
> >
> > Option 4 is described above as initial file preallocation whereas
> > option 1 is per 64k cluster prealloc. Prealloc mode mixup aside, Berto
> > is reporting that the initial file preallocation mode is slower than
> > the per cluster prealloc mode. Berto, am I following that right?
> 
> Option (1) means that no qcow2 cluster is allocated at the beginning of
> the test so, apart from updating the relevant qcow2 metadata, each write
> request clears the cluster first (with fallocate(ZERO_RANGE)) then
> writes the requested 4KB of data. Further writes to the same cluster
> don't need changes on the qcow2 metadata so they go directly to the area
> that was cleared with fallocate().
> 
> Option (4) means that all clusters are allocated when the image is
> created and they are initialized with fallocate() (actually with
> posix_fallocate() now that I read the code, I suppose it's the same for
> xfs?). Only after that the test starts. All write requests are simply
> forwarded to the disk, there is no need to touch any qcow2 metadata nor
> do anything else.
> 

Ok, I think that's consistent with what I described above (sorry, I find
the preallocation mode names rather confusing so I was trying to avoid
using them). Have you confirmed that posix_fallocate() in this case
translates directly to fallocate()? I suppose that's most likely the
case, otherwise you'd see numbers more like with preallocation=full
(file preallocated via writing zeroes).

> And yes, (4) is a bit slower than (1) in my tests. On ext4 I get 10%
> more IOPS.
> 
> I just ran the tests with aio=native and with a raw image instead of
> qcow2, here are the results:
> 
> qcow2:
> |----------------------+-------------+------------|
> | preallocation        | aio=threads | aio=native |
> |----------------------+-------------+------------|
> | off                  |        8139 |       7649 |
> | off (w/o ZERO_RANGE) |        2965 |       2779 |
> | metadata             |        7768 |       8265 |
> | falloc               |        7742 |       7956 |
> | full                 |       41389 |      56668 |
> |----------------------+-------------+------------|
> 

So this seems like Dave's suggestion to use native aio produced more
predictable results with full file prealloc being a bit faster than per
cluster prealloc. Not sure why that isn't the case with aio=threads. I
was wondering if perhaps the threading affects something indirectly like
the qcow2 metadata allocation itself, but I guess that would be
inconsistent with ext4 showing a notable jump from (1) to (4) (assuming
the previous ext4 numbers were with aio=threads).

> raw:
> |---------------+-------------+------------|
> | preallocation | aio=threads | aio=native |
> |---------------+-------------+------------|
> | off           |        7647 |       7928 |
> | falloc        |        7662 |       7856 |
> | full          |       45224 |      58627 |
> |---------------+-------------+------------|
> 
> A qcow2 file with preallocation=metadata is more or less similar to a
> sparse raw file (and the numbers are indeed similar).
> 
> preallocation=off on qcow2 does not have an equivalent on raw files.
> 

It sounds like preallocation=off for qcow2 would be roughly equivalent
to a raw file with a 64k extent size hint (on XFS).

Brian

> Berto
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster
  2020-08-21 12:59                     ` Brian Foster
@ 2020-08-21 15:51                       ` Alberto Garcia
  2020-08-23 22:16                       ` Dave Chinner
  1 sibling, 0 replies; 18+ messages in thread
From: Alberto Garcia @ 2020-08-21 15:51 UTC (permalink / raw)
  To: Brian Foster
  Cc: Dave Chinner, Kevin Wolf, qemu-devel, qemu-block, Max Reitz,
	Vladimir Sementsov-Ogievskiy, linux-xfs

On Fri 21 Aug 2020 02:59:44 PM CEST, Brian Foster wrote:
>> > Option 4 is described above as initial file preallocation whereas
>> > option 1 is per 64k cluster prealloc. Prealloc mode mixup aside, Berto
>> > is reporting that the initial file preallocation mode is slower than
>> > the per cluster prealloc mode. Berto, am I following that right?
>> 
>> Option (1) means that no qcow2 cluster is allocated at the beginning of
>> the test so, apart from updating the relevant qcow2 metadata, each write
>> request clears the cluster first (with fallocate(ZERO_RANGE)) then
>> writes the requested 4KB of data. Further writes to the same cluster
>> don't need changes on the qcow2 metadata so they go directly to the area
>> that was cleared with fallocate().
>> 
>> Option (4) means that all clusters are allocated when the image is
>> created and they are initialized with fallocate() (actually with
>> posix_fallocate() now that I read the code, I suppose it's the same for
>> xfs?). Only after that the test starts. All write requests are simply
>> forwarded to the disk, there is no need to touch any qcow2 metadata nor
>> do anything else.
>> 
>
> Ok, I think that's consistent with what I described above (sorry, I find
> the preallocation mode names rather confusing so I was trying to avoid
> using them). Have you confirmed that posix_fallocate() in this case
> translates directly to fallocate()? I suppose that's most likely the
> case, otherwise you'd see numbers more like with preallocation=full
> (file preallocated via writing zeroes).

Yes, it seems to be:

   https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/posix_fallocate.c;h=7238b000383af2f3878a9daf8528819645b6aa31;hb=HEAD

And that's also what the posix_fallocate() manual page says.

>> And yes, (4) is a bit slower than (1) in my tests. On ext4 I get 10%
>> more IOPS.
>> 
>> I just ran the tests with aio=native and with a raw image instead of
>> qcow2, here are the results:
>> 
>> qcow2:
>> |----------------------+-------------+------------|
>> | preallocation        | aio=threads | aio=native |
>> |----------------------+-------------+------------|
>> | off                  |        8139 |       7649 |
>> | off (w/o ZERO_RANGE) |        2965 |       2779 |
>> | metadata             |        7768 |       8265 |
>> | falloc               |        7742 |       7956 |
>> | full                 |       41389 |      56668 |
>> |----------------------+-------------+------------|
>> 
>
> So this seems like Dave's suggestion to use native aio produced more
> predictable results with full file prealloc being a bit faster than per
> cluster prealloc. Not sure why that isn't the case with aio=threads. I
> was wondering if perhaps the threading affects something indirectly like
> the qcow2 metadata allocation itself, but I guess that would be
> inconsistent with ext4 showing a notable jump from (1) to (4) (assuming
> the previous ext4 numbers were with aio=threads).

Yes, I took the ext4 numbers with aio=threads

>> raw:
>> |---------------+-------------+------------|
>> | preallocation | aio=threads | aio=native |
>> |---------------+-------------+------------|
>> | off           |        7647 |       7928 |
>> | falloc        |        7662 |       7856 |
>> | full          |       45224 |      58627 |
>> |---------------+-------------+------------|
>> 
>> A qcow2 file with preallocation=metadata is more or less similar to a
>> sparse raw file (and the numbers are indeed similar).
>> 
>> preallocation=off on qcow2 does not have an equivalent on raw files.
>
> It sounds like preallocation=off for qcow2 would be roughly equivalent
> to a raw file with a 64k extent size hint (on XFS).

There's the overhead of handling the qcow2 metadata but QEMU keeps a
memory cache so it should not be too big.

Berto

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster
  2020-08-20 21:58               ` Dave Chinner
  2020-08-21 11:05                 ` Brian Foster
@ 2020-08-21 16:09                 ` Alberto Garcia
  1 sibling, 0 replies; 18+ messages in thread
From: Alberto Garcia @ 2020-08-21 16:09 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Brian Foster, Kevin Wolf, qemu-devel, qemu-block, Max Reitz,
	Vladimir Sementsov-Ogievskiy, linux-xfs

On Thu 20 Aug 2020 11:58:11 PM CEST, Dave Chinner wrote:
>> The virtual drive (/dev/vdb) is a freshly created qcow2 file stored on
>> the host (on an xfs or ext4 filesystem as the table above shows), and
>> it is attached to QEMU using a virtio-blk-pci device:
>> 
>>    -drive if=virtio,file=image.qcow2,cache=none,l2-cache-size=200M
>
> You're not using AIO on this image file, so it can't do
> concurrent IO? what happens when you add "aio=native" to this?

I sent the results on a reply to Brian.

>> cache=none means that the image is opened with O_DIRECT and
>> l2-cache-size is large enough so QEMU is able to cache all the
>> relevant qcow2 metadata in memory.
>
> What happens when you just use a sparse file (i.e. a raw image) with
> aio=native instead of using qcow2? XFS, ext4, btrfs, etc all support
> sparse files so using qcow2 to provide sparse image file support is
> largely an unnecessary layer of indirection and overhead...
>
> And with XFS, you don't need qcow2 for snapshots either because you
> can use reflink copies to take an atomic copy-on-write snapshot of the
> raw image file... (assuming you made the xfs filesystem with reflink
> support (which is the TOT default now)).

To be clear, I'm not trying to advocate for or against qcow2 on xfs, we
were just analyzing different allocation strategies for qcow2 and we
came across these results which we don't quite understand.

>> 1) off: for every write request QEMU initializes the cluster (64KB)
>>         with fallocate(ZERO_RANGE) and then writes the 4KB of data.
>> 
>> 2) off w/o ZERO_RANGE: QEMU writes the 4KB of data and fills the rest
>>         of the cluster with zeroes.
>> 
>> 3) metadata: all clusters were allocated when the image was created
>>         but they are sparse, QEMU only writes the 4KB of data.
>> 
>> 4) falloc: all clusters were allocated with fallocate() when the image
>>         was created, QEMU only writes 4KB of data.
>> 
>> 5) full: all clusters were allocated by writing zeroes to all of them
>>         when the image was created, QEMU only writes 4KB of data.
>> 
>> As I said in a previous message I'm not familiar with xfs, but the
>> parts that I don't understand are
>> 
>>    - Why is (4) slower than (1)?
>
> Because fallocate() is a full IO serialisation barrier at the
> filesystem level. If you do:
>
> fallocate(whole file)
> <IO>
> <IO>
> <IO>
> .....
>
> The IO can run concurrent and does not serialise against anything in
> the filesysetm except unwritten extent conversions at IO completion
> (see answer to next question!)
>
> However, if you just use (4) you get:
>
> falloc(64k)
>   <wait for inflight IO to complete>
>   <allocates 64k as unwritten>
> <4k io>
>   ....
> falloc(64k)
>   <wait for inflight IO to complete>
>   ....
>   <4k IO completes, converts 4k to written>
>   <allocates 64k as unwritten>
> <4k io>

I think Brian pointed it out already, but scenario (4) is rather
falloc(25GB), then QEMU is launched and the actual 4k IO requests start
to happen.

So I would expect that after falloc(25GB) all clusters are initialized
and the end result would be closer to a full preallocation (i.e. writing
25GB worth of zeroes to disk).

> IOWs, typical "write once" benchmark testing indicates the *worst*
> performance you are going to see. As the guest filesytsem ages and
> initialises more of the underlying image file, it will get faster, not
> slower.

Yes, that's clear, once everything is allocation then it is fast (and
really much faster in the case of xfs vs ext4), what we try to optimize
in qcow2 is precisely the allocation of new clusters.

Berto

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster
  2020-08-21 12:12                     ` Alberto Garcia
@ 2020-08-21 17:02                       ` Brian Foster
  2020-08-25 12:24                         ` Alberto Garcia
  2020-08-23 21:59                       ` Dave Chinner
  1 sibling, 1 reply; 18+ messages in thread
From: Brian Foster @ 2020-08-21 17:02 UTC (permalink / raw)
  To: Alberto Garcia
  Cc: Dave Chinner, Kevin Wolf, Vladimir Sementsov-Ogievskiy,
	qemu-block, qemu-devel, Max Reitz, linux-xfs

On Fri, Aug 21, 2020 at 02:12:32PM +0200, Alberto Garcia wrote:
> On Fri 21 Aug 2020 01:42:52 PM CEST, Alberto Garcia wrote:
> > On Fri 21 Aug 2020 01:05:06 PM CEST, Brian Foster <bfoster@redhat.com> wrote:
> >>> > 1) off: for every write request QEMU initializes the cluster (64KB)
> >>> >         with fallocate(ZERO_RANGE) and then writes the 4KB of data.
> >>> > 
> >>> > 2) off w/o ZERO_RANGE: QEMU writes the 4KB of data and fills the rest
> >>> >         of the cluster with zeroes.
> >>> > 
> >>> > 3) metadata: all clusters were allocated when the image was created
> >>> >         but they are sparse, QEMU only writes the 4KB of data.
> >>> > 
> >>> > 4) falloc: all clusters were allocated with fallocate() when the image
> >>> >         was created, QEMU only writes 4KB of data.
> >>> > 
> >>> > 5) full: all clusters were allocated by writing zeroes to all of them
> >>> >         when the image was created, QEMU only writes 4KB of data.
> >>> > 
> >>> > As I said in a previous message I'm not familiar with xfs, but the
> >>> > parts that I don't understand are
> >>> > 
> >>> >    - Why is (4) slower than (1)?
> >>> 
> >>> Because fallocate() is a full IO serialisation barrier at the
> >>> filesystem level. If you do:
> >>> 
> >>> fallocate(whole file)
> >>> <IO>
> >>> <IO>
> >>> <IO>
> >>> .....
> >>> 
> >>> The IO can run concurrent and does not serialise against anything in
> >>> the filesysetm except unwritten extent conversions at IO completion
> >>> (see answer to next question!)
> >>> 
> >>> However, if you just use (4) you get:
> >>> 
> >>> falloc(64k)
> >>>   <wait for inflight IO to complete>
> >>>   <allocates 64k as unwritten>
> >>> <4k io>
> >>>   ....
> >>> falloc(64k)
> >>>   <wait for inflight IO to complete>
> >>>   ....
> >>>   <4k IO completes, converts 4k to written>
> >>>   <allocates 64k as unwritten>
> >>> <4k io>
> >>> falloc(64k)
> >>>   <wait for inflight IO to complete>
> >>>   ....
> >>>   <4k IO completes, converts 4k to written>
> >>>   <allocates 64k as unwritten>
> >>> <4k io>
> >>>   ....
> >>> 
> >>
> >> Option 4 is described above as initial file preallocation whereas
> >> option 1 is per 64k cluster prealloc. Prealloc mode mixup aside, Berto
> >> is reporting that the initial file preallocation mode is slower than
> >> the per cluster prealloc mode. Berto, am I following that right?
> 
> After looking more closely at the data I can see that there is a peak of
> ~30K IOPS during the first 5 or 6 seconds and then it suddenly drops to
> ~7K for the rest of the test.
> 
> I was running fio with --ramp_time=5 which ignores the first 5 seconds
> of data in order to let performance settle, but if I remove that I can
> see the effect more clearly. I can observe it with raw files (in 'off'
> and 'prealloc' modes) and qcow2 files in 'prealloc' mode. With qcow2 and
> preallocation=off the performance is stable during the whole test.
> 

That's interesting. I ran your fio command (without --ramp_time and with
--runtime=5m) against a file on XFS (so no qcow2, no zero_range) once
with sparse file with a 64k extent size hint and again with a fully
preallocated 25GB file and I saw similar results in terms of the delta.
This was just against an SSD backed vdisk in my local dev VM, but I saw
~5800 iops for the full preallocation test and ~6200 iops with the
extent size hint.

I do notice an initial iops burst as described for both tests, so I
switched to use a 60s ramp time and 60s runtime. With that longer ramp
up time, I see ~5000 iops with the 64k extent size hint and ~5500 iops
with the full 25GB prealloc. Perhaps the unexpected performance delta
with qcow2 is similarly transient towards the start of the test and the
runtime is short enough that it skews the final results..?

Brian

> Berto
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster
  2020-08-21 12:12                     ` Alberto Garcia
  2020-08-21 17:02                       ` Brian Foster
@ 2020-08-23 21:59                       ` Dave Chinner
  2020-08-24 20:14                         ` Alberto Garcia
  1 sibling, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2020-08-23 21:59 UTC (permalink / raw)
  To: Alberto Garcia
  Cc: Brian Foster, Kevin Wolf, Vladimir Sementsov-Ogievskiy,
	qemu-block, qemu-devel, Max Reitz, linux-xfs

On Fri, Aug 21, 2020 at 02:12:32PM +0200, Alberto Garcia wrote:
> On Fri 21 Aug 2020 01:42:52 PM CEST, Alberto Garcia wrote:
> > On Fri 21 Aug 2020 01:05:06 PM CEST, Brian Foster <bfoster@redhat.com> wrote:
> >>> > 1) off: for every write request QEMU initializes the cluster (64KB)
> >>> >         with fallocate(ZERO_RANGE) and then writes the 4KB of data.
> >>> > 
> >>> > 2) off w/o ZERO_RANGE: QEMU writes the 4KB of data and fills the rest
> >>> >         of the cluster with zeroes.
> >>> > 
> >>> > 3) metadata: all clusters were allocated when the image was created
> >>> >         but they are sparse, QEMU only writes the 4KB of data.
> >>> > 
> >>> > 4) falloc: all clusters were allocated with fallocate() when the image
> >>> >         was created, QEMU only writes 4KB of data.
> >>> > 
> >>> > 5) full: all clusters were allocated by writing zeroes to all of them
> >>> >         when the image was created, QEMU only writes 4KB of data.
> >>> > 
> >>> > As I said in a previous message I'm not familiar with xfs, but the
> >>> > parts that I don't understand are
> >>> > 
> >>> >    - Why is (4) slower than (1)?
> >>> 
> >>> Because fallocate() is a full IO serialisation barrier at the
> >>> filesystem level. If you do:
> >>> 
> >>> fallocate(whole file)
> >>> <IO>
> >>> <IO>
> >>> <IO>
> >>> .....
> >>> 
> >>> The IO can run concurrent and does not serialise against anything in
> >>> the filesysetm except unwritten extent conversions at IO completion
> >>> (see answer to next question!)
> >>> 
> >>> However, if you just use (4) you get:
> >>> 
> >>> falloc(64k)
> >>>   <wait for inflight IO to complete>
> >>>   <allocates 64k as unwritten>
> >>> <4k io>
> >>>   ....
> >>> falloc(64k)
> >>>   <wait for inflight IO to complete>
> >>>   ....
> >>>   <4k IO completes, converts 4k to written>
> >>>   <allocates 64k as unwritten>
> >>> <4k io>
> >>> falloc(64k)
> >>>   <wait for inflight IO to complete>
> >>>   ....
> >>>   <4k IO completes, converts 4k to written>
> >>>   <allocates 64k as unwritten>
> >>> <4k io>
> >>>   ....
> >>> 
> >>
> >> Option 4 is described above as initial file preallocation whereas
> >> option 1 is per 64k cluster prealloc. Prealloc mode mixup aside, Berto
> >> is reporting that the initial file preallocation mode is slower than
> >> the per cluster prealloc mode. Berto, am I following that right?
> 
> After looking more closely at the data I can see that there is a peak of
> ~30K IOPS during the first 5 or 6 seconds and then it suddenly drops to
> ~7K for the rest of the test.

How big is the filesystem, how big is the log? (xfs_info output,
please!)

In general, there are three typical causes of this. The first is
typical of the initial burst of allocations running on an empty
journal, then allocation transactions getting throttling back to the
speed at which metadata can be flushed once the journal fills up. If
you have a small filesystem and a default sized log, this is quite
likely to happen.

The second is that have large logs and you are running on hardware
where device cache flushes and FUA writes hammer overall device
performance. Hence when the CIL initially fills up and starts
flushing (journal writes are pre-flush + FUA so do both) device
performance goes way down because now it has to write it's cached
data to physical media rather than just cache it in volatile device
RAM. IOWs, journal writes end up forcing all volatile data to stable
media and so that can slow the device down. ALso, cache flushes
might not be queued commands, hence journal writes will also create IO
pipeline stalls...

The third is the hardware capability.  Consumer hardware is designed
to have extremely fast bursty behaviour, but then steady state
performance is much lower (think "SLC" burst caches in TLC SSDs). I
have isome consumer SSDs here that can sustain 400MB/s random 4kB
write for about 10-15s, then they drop to about 50MB/s once the
burst buffer is full. OTOH, I have enterprise SSDs that will sustain
a _much_ higher rate of random 4kB writes indefinitely than the
consumer SSDs burst at.  However, most consumer workloads don't move
this sort of data around, so this sort of design tradeoff is fine
for that market (Benchmarketing 101 stuff :).

IOWs, this behaviour could be filesystem config, it could be cache
flush behaviour, it could simply be storage device design
capability. Or it could be a combination of all three things.
Watching a set of fast sampling metrics that tell you what the
device and filesytem are doing in real time (e.g. I use PCP for this
and visualise ithe behaviour in real time via pmchart) gives a lot
of insight into exactly what is changing during transient workload
changes liek starting a benchmark...

> I was running fio with --ramp_time=5 which ignores the first 5 seconds
> of data in order to let performance settle, but if I remove that I can
> see the effect more clearly. I can observe it with raw files (in 'off'
> and 'prealloc' modes) and qcow2 files in 'prealloc' mode. With qcow2 and
> preallocation=off the performance is stable during the whole test.

What does "preallocation=off" mean again? Is that using
fallocate(ZERO_RANGE) prior to the data write rather than
preallocating the metadata/entire file? If so, I would expect the
limiting factor is the rate at which IO can be issued because of the
fallocate() triggered pipeline bubbles. That leaves idle device time
so you're not pushing the limits of the hardware and hence none of
the behaviours above will be evident...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster
  2020-08-21 12:59                     ` Brian Foster
  2020-08-21 15:51                       ` Alberto Garcia
@ 2020-08-23 22:16                       ` Dave Chinner
  1 sibling, 0 replies; 18+ messages in thread
From: Dave Chinner @ 2020-08-23 22:16 UTC (permalink / raw)
  To: Brian Foster
  Cc: Alberto Garcia, Kevin Wolf, qemu-devel, qemu-block, Max Reitz,
	Vladimir Sementsov-Ogievskiy, linux-xfs

On Fri, Aug 21, 2020 at 08:59:44AM -0400, Brian Foster wrote:
> On Fri, Aug 21, 2020 at 01:42:52PM +0200, Alberto Garcia wrote:
> > On Fri 21 Aug 2020 01:05:06 PM CEST, Brian Foster <bfoster@redhat.com> wrote:
> > And yes, (4) is a bit slower than (1) in my tests. On ext4 I get 10%
> > more IOPS.
> > 
> > I just ran the tests with aio=native and with a raw image instead of
> > qcow2, here are the results:
> > 
> > qcow2:
> > |----------------------+-------------+------------|
> > | preallocation        | aio=threads | aio=native |
> > |----------------------+-------------+------------|
> > | off                  |        8139 |       7649 |
> > | off (w/o ZERO_RANGE) |        2965 |       2779 |
> > | metadata             |        7768 |       8265 |
> > | falloc               |        7742 |       7956 |
> > | full                 |       41389 |      56668 |
> > |----------------------+-------------+------------|
> > 
> 
> So this seems like Dave's suggestion to use native aio produced more
> predictable results with full file prealloc being a bit faster than per
> cluster prealloc. Not sure why that isn't the case with aio=threads. I

That will the context switch overhead with aio=threads becoming a
performance limiting factor at higher IOPS. The "full" workload
there is probably running at 80-120k context switches/s while the
aio=native if probably under 10k ctxsw/s because it doesn't switch
threads for every IO that has to be submitted/completed.

For all the other results, I'd consider the difference to be noise -
it's just not significant enough to draw any conclusions from at
all.

FWIW, the other thing that aio=native gives us is plugging across
batch IO submission. This allows bio merging before dispatch and
that can greatly increase performance of AIO when the IO being
submitted has some mergable submissions. That's not the case for
pure random IO like this, but there are relatively few pure random
IO workloads out there... :P

> was wondering if perhaps the threading affects something indirectly like
> the qcow2 metadata allocation itself, but I guess that would be
> inconsistent with ext4 showing a notable jump from (1) to (4) (assuming
> the previous ext4 numbers were with aio=threads).

> > raw:
> > |---------------+-------------+------------|
> > | preallocation | aio=threads | aio=native |
> > |---------------+-------------+------------|
> > | off           |        7647 |       7928 |
> > | falloc        |        7662 |       7856 |
> > | full          |       45224 |      58627 |
> > |---------------+-------------+------------|
> > 
> > A qcow2 file with preallocation=metadata is more or less similar to a
> > sparse raw file (and the numbers are indeed similar).
> > 
> > preallocation=off on qcow2 does not have an equivalent on raw files.
> > 
> 
> It sounds like preallocation=off for qcow2 would be roughly equivalent
> to a raw file with a 64k extent size hint (on XFS).

Yes, the effect should be close to identical, the only difference is
that qcow2 adds new clusters to the end of the file (i.e. the file
itself is not sparse), while the extent size hint will just add 64kB
extents into the file around the write offset. That demonstrates the
other behavioural advantage that extent size hints have is they
avoid needing to extend the file, which is yet another way to
serialise concurrent IO and create IO pipeline stalls...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster
  2020-08-23 21:59                       ` Dave Chinner
@ 2020-08-24 20:14                         ` Alberto Garcia
  0 siblings, 0 replies; 18+ messages in thread
From: Alberto Garcia @ 2020-08-24 20:14 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Brian Foster, Kevin Wolf, Vladimir Sementsov-Ogievskiy,
	qemu-block, qemu-devel, Max Reitz, linux-xfs

On Sun 23 Aug 2020 11:59:07 PM CEST, Dave Chinner wrote:
>> >> Option 4 is described above as initial file preallocation whereas
>> >> option 1 is per 64k cluster prealloc. Prealloc mode mixup aside, Berto
>> >> is reporting that the initial file preallocation mode is slower than
>> >> the per cluster prealloc mode. Berto, am I following that right?
>> 
>> After looking more closely at the data I can see that there is a peak of
>> ~30K IOPS during the first 5 or 6 seconds and then it suddenly drops to
>> ~7K for the rest of the test.
>
> How big is the filesystem, how big is the log? (xfs_info output,
> please!)

The size of the filesystem is 126GB and here's the output of xfs_info:

meta-data=/dev/vg/test           isize=512    agcount=4, agsize=8248576 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=0
data     =                       bsize=4096   blocks=32994304, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=16110, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

>> I was running fio with --ramp_time=5 which ignores the first 5 seconds
>> of data in order to let performance settle, but if I remove that I can
>> see the effect more clearly. I can observe it with raw files (in 'off'
>> and 'prealloc' modes) and qcow2 files in 'prealloc' mode. With qcow2 and
>> preallocation=off the performance is stable during the whole test.
>
> What does "preallocation=off" mean again? Is that using
> fallocate(ZERO_RANGE) prior to the data write rather than
> preallocating the metadata/entire file?

Exactly, it means that. One fallocate() call before each data write
(unless the area has been allocated by a previous write).

> If so, I would expect the limiting factor is the rate at which IO can
> be issued because of the fallocate() triggered pipeline bubbles. That
> leaves idle device time so you're not pushing the limits of the
> hardware and hence none of the behaviours above will be evident...

The thing is that with raw (i.e. non-qcow2) images the number of IOPS is
similar, but in that case there are no fallocate() calls, only the data
writes.

Berto

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster
  2020-08-21 17:02                       ` Brian Foster
@ 2020-08-25 12:24                         ` Alberto Garcia
  2020-08-25 16:54                           ` Brian Foster
  0 siblings, 1 reply; 18+ messages in thread
From: Alberto Garcia @ 2020-08-25 12:24 UTC (permalink / raw)
  To: Brian Foster
  Cc: Dave Chinner, Kevin Wolf, Vladimir Sementsov-Ogievskiy,
	qemu-block, qemu-devel, Max Reitz, linux-xfs

On Fri 21 Aug 2020 07:02:32 PM CEST, Brian Foster wrote:
>> I was running fio with --ramp_time=5 which ignores the first 5 seconds
>> of data in order to let performance settle, but if I remove that I can
>> see the effect more clearly. I can observe it with raw files (in 'off'
>> and 'prealloc' modes) and qcow2 files in 'prealloc' mode. With qcow2 and
>> preallocation=off the performance is stable during the whole test.
>
> That's interesting. I ran your fio command (without --ramp_time and
> with --runtime=5m) against a file on XFS (so no qcow2, no zero_range)
> once with sparse file with a 64k extent size hint and again with a
> fully preallocated 25GB file and I saw similar results in terms of the
> delta.  This was just against an SSD backed vdisk in my local dev VM,
> but I saw ~5800 iops for the full preallocation test and ~6200 iops
> with the extent size hint.
>
> I do notice an initial iops burst as described for both tests, so I
> switched to use a 60s ramp time and 60s runtime. With that longer ramp
> up time, I see ~5000 iops with the 64k extent size hint and ~5500 iops
> with the full 25GB prealloc. Perhaps the unexpected performance delta
> with qcow2 is similarly transient towards the start of the test and
> the runtime is short enough that it skews the final results..?

I also tried running directly against a file on xfs (no qcow2, no VMs)
but it doesn't really matter whether I use --ramp_time=5 or 60.

Here are the results:

|---------------+-------+-------|
| preallocation |   xfs |  ext4 |
|---------------+-------+-------|
| off           |  7277 | 43260 |
| fallocate     |  7299 | 42810 |
| full          | 88404 | 83197 |
|---------------+-------+-------|

I ran the first case (no preallocation) for 5 minutes and I said there's
a peak during the first 5 seconds, but then the number remains under 10k
IOPS for the rest of the 5 minutes.

Berto

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster
  2020-08-25 12:24                         ` Alberto Garcia
@ 2020-08-25 16:54                           ` Brian Foster
  2020-08-25 17:18                             ` Alberto Garcia
  0 siblings, 1 reply; 18+ messages in thread
From: Brian Foster @ 2020-08-25 16:54 UTC (permalink / raw)
  To: Alberto Garcia
  Cc: Dave Chinner, Kevin Wolf, Vladimir Sementsov-Ogievskiy,
	qemu-block, qemu-devel, Max Reitz, linux-xfs

On Tue, Aug 25, 2020 at 02:24:58PM +0200, Alberto Garcia wrote:
> On Fri 21 Aug 2020 07:02:32 PM CEST, Brian Foster wrote:
> >> I was running fio with --ramp_time=5 which ignores the first 5 seconds
> >> of data in order to let performance settle, but if I remove that I can
> >> see the effect more clearly. I can observe it with raw files (in 'off'
> >> and 'prealloc' modes) and qcow2 files in 'prealloc' mode. With qcow2 and
> >> preallocation=off the performance is stable during the whole test.
> >
> > That's interesting. I ran your fio command (without --ramp_time and
> > with --runtime=5m) against a file on XFS (so no qcow2, no zero_range)
> > once with sparse file with a 64k extent size hint and again with a
> > fully preallocated 25GB file and I saw similar results in terms of the
> > delta.  This was just against an SSD backed vdisk in my local dev VM,
> > but I saw ~5800 iops for the full preallocation test and ~6200 iops
> > with the extent size hint.
> >
> > I do notice an initial iops burst as described for both tests, so I
> > switched to use a 60s ramp time and 60s runtime. With that longer ramp
> > up time, I see ~5000 iops with the 64k extent size hint and ~5500 iops
> > with the full 25GB prealloc. Perhaps the unexpected performance delta
> > with qcow2 is similarly transient towards the start of the test and
> > the runtime is short enough that it skews the final results..?
> 
> I also tried running directly against a file on xfs (no qcow2, no VMs)
> but it doesn't really matter whether I use --ramp_time=5 or 60.
> 
> Here are the results:
> 
> |---------------+-------+-------|
> | preallocation |   xfs |  ext4 |
> |---------------+-------+-------|
> | off           |  7277 | 43260 |
> | fallocate     |  7299 | 42810 |
> | full          | 88404 | 83197 |
> |---------------+-------+-------|
> 
> I ran the first case (no preallocation) for 5 minutes and I said there's
> a peak during the first 5 seconds, but then the number remains under 10k
> IOPS for the rest of the 5 minutes.
> 

I don't think we're talking about the same thing. I was referring to the
difference between full file preallocation and the extent size hint in
XFS, and how the latter was faster with the shorter ramp time but that
swapped around when the test ramped up for longer. Here, it looks like
you're comparing XFS to ext4 writing direct to a file..

If I compare this 5m fio test between XFS and ext4 on a couple of my
systems (with either no prealloc or full file prealloc), I end up seeing
ext4 run slightly faster on my vm and XFS slightly faster on bare metal.
Either way, I don't see that huge disparity where ext4 is 5-6 times
faster than XFS. Can you describe the test, filesystem and storage in
detail where you observe such a discrepancy?

Brian


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster
  2020-08-25 16:54                           ` Brian Foster
@ 2020-08-25 17:18                             ` Alberto Garcia
  2020-08-25 19:47                               ` Brian Foster
  0 siblings, 1 reply; 18+ messages in thread
From: Alberto Garcia @ 2020-08-25 17:18 UTC (permalink / raw)
  To: Brian Foster
  Cc: Dave Chinner, Kevin Wolf, Vladimir Sementsov-Ogievskiy,
	qemu-block, qemu-devel, Max Reitz, linux-xfs

On Tue 25 Aug 2020 06:54:15 PM CEST, Brian Foster wrote:
> If I compare this 5m fio test between XFS and ext4 on a couple of my
> systems (with either no prealloc or full file prealloc), I end up seeing
> ext4 run slightly faster on my vm and XFS slightly faster on bare metal.
> Either way, I don't see that huge disparity where ext4 is 5-6 times
> faster than XFS. Can you describe the test, filesystem and storage in
> detail where you observe such a discrepancy?

Here's the test:

fio --filename=/path/to/file.raw --direct=1 --randrepeat=1 \
    --eta=always --ioengine=libaio --iodepth=32 --numjobs=1 \
    --name=test --size=25G --io_limit=25G --ramp_time=0 \
    --rw=randwrite --bs=4k --runtime=300 --time_based=1

The size of the XFS filesystem is 126 GB and it's almost empty, here's
the xfs_info output:

meta-data=/dev/vg/test           isize=512    agcount=4, agsize=8248576
blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1,
         rmapbt=0
         =                       reflink=0
data     =                       bsize=4096   blocks=32994304, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=16110, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

The size of the ext4 filesystem is 99GB, of which 49GB are free (that
is, without the file used in this test). The filesystem uses 4KB
blocks, a 128M journal and these features:

Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index
                          filetype needs_recovery extent flex_bg
                          sparse_super large_file huge_file uninit_bg
                          dir_nlink extra_isize
Filesystem flags:         signed_directory_hash
Default mount options:    user_xattr acl

In both cases I'm using LVM on top of LUKS and the hard drive is a
Samsung SSD 850 PRO 1TB.

The Linux version is 4.19.132-1 from Debian.

Berto

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster
  2020-08-25 17:18                             ` Alberto Garcia
@ 2020-08-25 19:47                               ` Brian Foster
  2020-08-26 18:34                                 ` Alberto Garcia
  0 siblings, 1 reply; 18+ messages in thread
From: Brian Foster @ 2020-08-25 19:47 UTC (permalink / raw)
  To: Alberto Garcia
  Cc: Dave Chinner, Kevin Wolf, Vladimir Sementsov-Ogievskiy,
	qemu-block, qemu-devel, Max Reitz, linux-xfs

On Tue, Aug 25, 2020 at 07:18:19PM +0200, Alberto Garcia wrote:
> On Tue 25 Aug 2020 06:54:15 PM CEST, Brian Foster wrote:
> > If I compare this 5m fio test between XFS and ext4 on a couple of my
> > systems (with either no prealloc or full file prealloc), I end up seeing
> > ext4 run slightly faster on my vm and XFS slightly faster on bare metal.
> > Either way, I don't see that huge disparity where ext4 is 5-6 times
> > faster than XFS. Can you describe the test, filesystem and storage in
> > detail where you observe such a discrepancy?
> 
> Here's the test:
> 
> fio --filename=/path/to/file.raw --direct=1 --randrepeat=1 \
>     --eta=always --ioengine=libaio --iodepth=32 --numjobs=1 \
>     --name=test --size=25G --io_limit=25G --ramp_time=0 \
>     --rw=randwrite --bs=4k --runtime=300 --time_based=1
> 

My fio fallocates the entire file by default with this command. Is that
the intent of this particular test? I added --fallocate=none to my test
runs to incorporate the allocation cost in the I/Os.

> The size of the XFS filesystem is 126 GB and it's almost empty, here's
> the xfs_info output:
> 
> meta-data=/dev/vg/test           isize=512    agcount=4, agsize=8248576
> blks
>          =                       sectsz=512   attr=2, projid32bit=1
>          =                       crc=1        finobt=1, sparse=1,
>          rmapbt=0
>          =                       reflink=0
> data     =                       bsize=4096   blocks=32994304, imaxpct=25
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> log      =internal log           bsize=4096   blocks=16110, version=2
>          =                       sectsz=512   sunit=0 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> The size of the ext4 filesystem is 99GB, of which 49GB are free (that
> is, without the file used in this test). The filesystem uses 4KB
> blocks, a 128M journal and these features:
> 
> Filesystem revision #:    1 (dynamic)
> Filesystem features:      has_journal ext_attr resize_inode dir_index
>                           filetype needs_recovery extent flex_bg
>                           sparse_super large_file huge_file uninit_bg
>                           dir_nlink extra_isize
> Filesystem flags:         signed_directory_hash
> Default mount options:    user_xattr acl
> 
> In both cases I'm using LVM on top of LUKS and the hard drive is a
> Samsung SSD 850 PRO 1TB.
> 
> The Linux version is 4.19.132-1 from Debian.
> 

Thanks. I don't have LUKS in the mix on my box, but I was running on a
more recent kernel (Fedora 5.7.15-100). I threw v4.19 on the box and saw
a bit more of a delta between XFS (~14k iops) and ext4 (~24k). The same
test shows ~17k iops for XFS and ~19k iops for ext4 on v5.7. If I
increase the size of the LVM volume from 126G to >1TB, ext4 runs at
roughly the same rate and XFS closes the gap to around ~19k iops as
well. I'm not sure what might have changed since v4.19, but care to see
if this is still an issue on a more recent kernel?

Brian

> Berto
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster
  2020-08-25 19:47                               ` Brian Foster
@ 2020-08-26 18:34                                 ` Alberto Garcia
  2020-08-27 16:47                                   ` Brian Foster
  0 siblings, 1 reply; 18+ messages in thread
From: Alberto Garcia @ 2020-08-26 18:34 UTC (permalink / raw)
  To: Brian Foster
  Cc: Dave Chinner, Kevin Wolf, Vladimir Sementsov-Ogievskiy,
	qemu-block, qemu-devel, Max Reitz, linux-xfs

On Tue 25 Aug 2020 09:47:24 PM CEST, Brian Foster <bfoster@redhat.com> wrote:
> My fio fallocates the entire file by default with this command. Is that
> the intent of this particular test? I added --fallocate=none to my test
> runs to incorporate the allocation cost in the I/Os.

That wasn't intentional, you're right, it should use --fallocate=none (I
don't see a big difference in my test anyway).

>> The Linux version is 4.19.132-1 from Debian.
>
> Thanks. I don't have LUKS in the mix on my box, but I was running on a
> more recent kernel (Fedora 5.7.15-100). I threw v4.19 on the box and
> saw a bit more of a delta between XFS (~14k iops) and ext4 (~24k). The
> same test shows ~17k iops for XFS and ~19k iops for ext4 on v5.7. If I
> increase the size of the LVM volume from 126G to >1TB, ext4 runs at
> roughly the same rate and XFS closes the gap to around ~19k iops as
> well. I'm not sure what might have changed since v4.19, but care to
> see if this is still an issue on a more recent kernel?

Ok, I gave 5.7.10-1 a try but I still get similar numbers.

Perhaps with a larger filesystem there would be a difference? I don't
know.

Berto

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster
  2020-08-26 18:34                                 ` Alberto Garcia
@ 2020-08-27 16:47                                   ` Brian Foster
  0 siblings, 0 replies; 18+ messages in thread
From: Brian Foster @ 2020-08-27 16:47 UTC (permalink / raw)
  To: Alberto Garcia
  Cc: Dave Chinner, Kevin Wolf, Vladimir Sementsov-Ogievskiy,
	qemu-block, qemu-devel, Max Reitz, linux-xfs

On Wed, Aug 26, 2020 at 08:34:32PM +0200, Alberto Garcia wrote:
> On Tue 25 Aug 2020 09:47:24 PM CEST, Brian Foster <bfoster@redhat.com> wrote:
> > My fio fallocates the entire file by default with this command. Is that
> > the intent of this particular test? I added --fallocate=none to my test
> > runs to incorporate the allocation cost in the I/Os.
> 
> That wasn't intentional, you're right, it should use --fallocate=none (I
> don't see a big difference in my test anyway).
> 
> >> The Linux version is 4.19.132-1 from Debian.
> >
> > Thanks. I don't have LUKS in the mix on my box, but I was running on a
> > more recent kernel (Fedora 5.7.15-100). I threw v4.19 on the box and
> > saw a bit more of a delta between XFS (~14k iops) and ext4 (~24k). The
> > same test shows ~17k iops for XFS and ~19k iops for ext4 on v5.7. If I
> > increase the size of the LVM volume from 126G to >1TB, ext4 runs at
> > roughly the same rate and XFS closes the gap to around ~19k iops as
> > well. I'm not sure what might have changed since v4.19, but care to
> > see if this is still an issue on a more recent kernel?
> 
> Ok, I gave 5.7.10-1 a try but I still get similar numbers.
> 

Strange.

> Perhaps with a larger filesystem there would be a difference? I don't
> know.
> 

Perhaps. I believe Dave mentioned earlier how log size might affect
things.

I created a 125GB lvm volume and see slight deltas in iops going from
testing directly on the block device, to a fully allocated file on
XFS/ext4 and then to a preallocated file on XFS/ext4. In both cases the
numbers are comparable between XFS and ext4. On XFS, I can reproduce a
serious drop in iops if I reduce the default ~64MB log down to 8MB.
Perhaps you could try increasing your log ('-lsize=...' at mkfs time)
and see if that changes anything?

Beyond that, I'd probably try to normalize and simplify your storage
stack if you wanted to narrow it down further. E.g., clean format the
same bdev for XFS and ext4 and pull out things like LUKS just to rule
out any poor interactions.

Brian

> Berto
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2020-08-27 16:47 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <cover.1597416317.git.berto@igalia.com>
     [not found] ` <20200817101019.GD11402@linux.fritz.box>
     [not found]   ` <w518sedz3td.fsf@maestria.local.igalia.com>
     [not found]     ` <20200817155307.GS11402@linux.fritz.box>
     [not found]       ` <w51pn7memr7.fsf@maestria.local.igalia.com>
     [not found]         ` <20200819150711.GE10272@linux.fritz.box>
     [not found]           ` <20200819175300.GA141399@bfoster>
2020-08-20 20:03             ` [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster Alberto Garcia
2020-08-20 21:58               ` Dave Chinner
2020-08-21 11:05                 ` Brian Foster
2020-08-21 11:42                   ` Alberto Garcia
2020-08-21 12:12                     ` Alberto Garcia
2020-08-21 17:02                       ` Brian Foster
2020-08-25 12:24                         ` Alberto Garcia
2020-08-25 16:54                           ` Brian Foster
2020-08-25 17:18                             ` Alberto Garcia
2020-08-25 19:47                               ` Brian Foster
2020-08-26 18:34                                 ` Alberto Garcia
2020-08-27 16:47                                   ` Brian Foster
2020-08-23 21:59                       ` Dave Chinner
2020-08-24 20:14                         ` Alberto Garcia
2020-08-21 12:59                     ` Brian Foster
2020-08-21 15:51                       ` Alberto Garcia
2020-08-23 22:16                       ` Dave Chinner
2020-08-21 16:09                 ` Alberto Garcia

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).