All of lore.kernel.org
 help / color / mirror / Atom feed
* XFS fragmentation on file append
@ 2014-04-07 22:53 Keyur Govande
  2014-04-08  1:50   ` Dave Chinner
  0 siblings, 1 reply; 20+ messages in thread
From: Keyur Govande @ 2014-04-07 22:53 UTC (permalink / raw)
  To: linux-fsdevel

Hello,

I'm currently investigating a MySQL performance degradation on XFS due
to file fragmentation.

The box has a 16 drive RAID 10 array with a 1GB battery backed cache
running on a 12 core box.

xfs_info shows:
meta-data=/dev/sda4    isize=256    agcount=24, agsize=24024992 blks
               =                 sectsz=512   attr=2, projid32bit=0
data         =                 bsize=4096   blocks=576599552, imaxpct=5
               =                 sunit=16     swidth=512 blks
naming   = version 2     bsize=4096   ascii-ci=0
log         = internal       bsize=4096   blocks=281552, version=2
             =                   sectsz=512   sunit=16 blks, lazy-count=1
realtime = none            extsz=4096   blocks=0, rtextents=0

The kernel version is: 3.14.0-1.el6.elrepo.x86_64 and the XFS
partition is mounted with: rw,noatime,allocsize=128m,inode64,swalloc.
The partition is 2TB in size and 40% full to simulate production.

Here's a test program that appends 512KB like MySQL does (write and
then fsync). To exacerbate the issue, it loops a bunch of times:
https://gist.github.com/keyurdg/961c19175b81c73fdaa3

When run, this creates ~9500 extents most of length 1024. cat'ing the
file to /dev/null after dropping the caches reads at an average of 75
MBps, way less than the hardware is capable of.

When I add a posix_fallocate before calling pwrite() as shown here
https://gist.github.com/keyurdg/eb504864d27ebfe7b40a the file
fragments an order of magnitude less (~30 extents), and cat'ing to
/dev/null proceeds at ~1GBps.

The same behavior is seen even when the allocsize option is removed
and the partition remounted.

This is somewhat unexpected and I'm working on a patch to add
fallocate to MySQL, wanted to check in here if I'm missing anything
obvious here.

Cheers,
Keyur.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: XFS fragmentation on file append
  2014-04-07 22:53 XFS fragmentation on file append Keyur Govande
@ 2014-04-08  1:50   ` Dave Chinner
  0 siblings, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2014-04-08  1:50 UTC (permalink / raw)
  To: Keyur Govande; +Cc: linux-fsdevel, xfs

[cc the XFS mailing list <xfs@oss.sgi.com>]

On Mon, Apr 07, 2014 at 06:53:46PM -0400, Keyur Govande wrote:
> Hello,
> 
> I'm currently investigating a MySQL performance degradation on XFS due
> to file fragmentation.
> 
> The box has a 16 drive RAID 10 array with a 1GB battery backed cache
> running on a 12 core box.
> 
> xfs_info shows:
> meta-data=/dev/sda4    isize=256    agcount=24, agsize=24024992 blks
>                =                 sectsz=512   attr=2, projid32bit=0
> data         =                 bsize=4096   blocks=576599552, imaxpct=5
>                =                 sunit=16     swidth=512 blks
> naming   = version 2     bsize=4096   ascii-ci=0
> log         = internal       bsize=4096   blocks=281552, version=2
>              =                   sectsz=512   sunit=16 blks, lazy-count=1
> realtime = none            extsz=4096   blocks=0, rtextents=0
> 
> The kernel version is: 3.14.0-1.el6.elrepo.x86_64 and the XFS
> partition is mounted with: rw,noatime,allocsize=128m,inode64,swalloc.
> The partition is 2TB in size and 40% full to simulate production.
> 
> Here's a test program that appends 512KB like MySQL does (write and
> then fsync). To exacerbate the issue, it loops a bunch of times:
> https://gist.github.com/keyurdg/961c19175b81c73fdaa3
> 
> When run, this creates ~9500 extents most of length 1024.

1024 of what? Most likely it is 1024 basic blocks, which is 512KB,
the size of your writes.

Could you post the output of the xfs_bmap commands you are using to
get this information?

> cat'ing the
> file to /dev/null after dropping the caches reads at an average of 75
> MBps, way less than the hardware is capable of.

What you are doing is "open-seekend-write-fsync-close".  You haven't
told the filesystem you are doing append writes (O_APPEND, or the
append inode flag) so it can't optimise for them.

You are also cleaning the file before closing it, so you are
defeating the current heuristics that XFS uses to determine whether
to remove speculative preallocation on close() - if the inode is
dirty at close(), then it won't be removed. Hence speculative
preallocation does nothing for your IO pattern (i.e. the allocsize
mount option is completely useless). Remove the fsync and you'll
see your fragmentation problem go away completely.

> When I add a posix_fallocate before calling pwrite() as shown here
> https://gist.github.com/keyurdg/eb504864d27ebfe7b40a the file
> fragments an order of magnitude less (~30 extents), and cat'ing to
> /dev/null proceeds at ~1GBps.

That should make no difference on XFS as you are only preallocating
the 512KB region beyond EOF that you are about to write into and
hence both delayed allocation and preallocation have the same
allocation target (the current EOF block). Hence in both cases the
allocation patterns should be identical if the freespace extent they
are being allocated out of are identical.

Did you remove the previous test files and sync the filesystem
between test runs so that the available freespace was identical for
the different test runs? If you didn't then the filesystem allocated
the files out of different free space extents and hence you'll get
different allocation patterns...

> The same behavior is seen even when the allocsize option is removed
> and the partition remounted.

See above.

> This is somewhat unexpected and I'm working on a patch to add
> fallocate to MySQL, wanted to check in here if I'm missing anything
> obvious here.

fallocate() of 512KB sized regions will not prevent fragmentation
into 512KB sized extents with the write pattern you are using.

If you use the inode APPEND attribute for your log files, this lets
the filesystem optimise it's block management for append IO. In the
case of XFS, it then will not remove preallocation beyond EOF when
the fd is closed because the next write will be at EOF where the
speculative preallocation already exists. Then allocsize=128M will
actually work for your log files....

Alternatively, set an extent size hint on the log files to define
the minimum sized allocation (e.g. 32MB) and this will limit
fragmentation without you having to modify the MySQL code at all...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: XFS fragmentation on file append
@ 2014-04-08  1:50   ` Dave Chinner
  0 siblings, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2014-04-08  1:50 UTC (permalink / raw)
  To: Keyur Govande; +Cc: linux-fsdevel, xfs

[cc the XFS mailing list <xfs@oss.sgi.com>]

On Mon, Apr 07, 2014 at 06:53:46PM -0400, Keyur Govande wrote:
> Hello,
> 
> I'm currently investigating a MySQL performance degradation on XFS due
> to file fragmentation.
> 
> The box has a 16 drive RAID 10 array with a 1GB battery backed cache
> running on a 12 core box.
> 
> xfs_info shows:
> meta-data=/dev/sda4    isize=256    agcount=24, agsize=24024992 blks
>                =                 sectsz=512   attr=2, projid32bit=0
> data         =                 bsize=4096   blocks=576599552, imaxpct=5
>                =                 sunit=16     swidth=512 blks
> naming   = version 2     bsize=4096   ascii-ci=0
> log         = internal       bsize=4096   blocks=281552, version=2
>              =                   sectsz=512   sunit=16 blks, lazy-count=1
> realtime = none            extsz=4096   blocks=0, rtextents=0
> 
> The kernel version is: 3.14.0-1.el6.elrepo.x86_64 and the XFS
> partition is mounted with: rw,noatime,allocsize=128m,inode64,swalloc.
> The partition is 2TB in size and 40% full to simulate production.
> 
> Here's a test program that appends 512KB like MySQL does (write and
> then fsync). To exacerbate the issue, it loops a bunch of times:
> https://gist.github.com/keyurdg/961c19175b81c73fdaa3
> 
> When run, this creates ~9500 extents most of length 1024.

1024 of what? Most likely it is 1024 basic blocks, which is 512KB,
the size of your writes.

Could you post the output of the xfs_bmap commands you are using to
get this information?

> cat'ing the
> file to /dev/null after dropping the caches reads at an average of 75
> MBps, way less than the hardware is capable of.

What you are doing is "open-seekend-write-fsync-close".  You haven't
told the filesystem you are doing append writes (O_APPEND, or the
append inode flag) so it can't optimise for them.

You are also cleaning the file before closing it, so you are
defeating the current heuristics that XFS uses to determine whether
to remove speculative preallocation on close() - if the inode is
dirty at close(), then it won't be removed. Hence speculative
preallocation does nothing for your IO pattern (i.e. the allocsize
mount option is completely useless). Remove the fsync and you'll
see your fragmentation problem go away completely.

> When I add a posix_fallocate before calling pwrite() as shown here
> https://gist.github.com/keyurdg/eb504864d27ebfe7b40a the file
> fragments an order of magnitude less (~30 extents), and cat'ing to
> /dev/null proceeds at ~1GBps.

That should make no difference on XFS as you are only preallocating
the 512KB region beyond EOF that you are about to write into and
hence both delayed allocation and preallocation have the same
allocation target (the current EOF block). Hence in both cases the
allocation patterns should be identical if the freespace extent they
are being allocated out of are identical.

Did you remove the previous test files and sync the filesystem
between test runs so that the available freespace was identical for
the different test runs? If you didn't then the filesystem allocated
the files out of different free space extents and hence you'll get
different allocation patterns...

> The same behavior is seen even when the allocsize option is removed
> and the partition remounted.

See above.

> This is somewhat unexpected and I'm working on a patch to add
> fallocate to MySQL, wanted to check in here if I'm missing anything
> obvious here.

fallocate() of 512KB sized regions will not prevent fragmentation
into 512KB sized extents with the write pattern you are using.

If you use the inode APPEND attribute for your log files, this lets
the filesystem optimise it's block management for append IO. In the
case of XFS, it then will not remove preallocation beyond EOF when
the fd is closed because the next write will be at EOF where the
speculative preallocation already exists. Then allocsize=128M will
actually work for your log files....

Alternatively, set an extent size hint on the log files to define
the minimum sized allocation (e.g. 32MB) and this will limit
fragmentation without you having to modify the MySQL code at all...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: XFS fragmentation on file append
  2014-04-08  1:50   ` Dave Chinner
  (?)
@ 2014-04-08  3:42   ` Keyur Govande
  2014-04-08  5:31       ` Dave Chinner
  -1 siblings, 1 reply; 20+ messages in thread
From: Keyur Govande @ 2014-04-08  3:42 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, xfs

On Mon, Apr 7, 2014 at 9:50 PM, Dave Chinner <david@fromorbit.com> wrote:
> [cc the XFS mailing list <xfs@oss.sgi.com>]
>
> On Mon, Apr 07, 2014 at 06:53:46PM -0400, Keyur Govande wrote:
>> Hello,
>>
>> I'm currently investigating a MySQL performance degradation on XFS due
>> to file fragmentation.
>>
>> The box has a 16 drive RAID 10 array with a 1GB battery backed cache
>> running on a 12 core box.
>>
>> xfs_info shows:
>> meta-data=/dev/sda4    isize=256    agcount=24, agsize=24024992 blks
>>                =                 sectsz=512   attr=2, projid32bit=0
>> data         =                 bsize=4096   blocks=576599552, imaxpct=5
>>                =                 sunit=16     swidth=512 blks
>> naming   = version 2     bsize=4096   ascii-ci=0
>> log         = internal       bsize=4096   blocks=281552, version=2
>>              =                   sectsz=512   sunit=16 blks, lazy-count=1
>> realtime = none            extsz=4096   blocks=0, rtextents=0
>>
>> The kernel version is: 3.14.0-1.el6.elrepo.x86_64 and the XFS
>> partition is mounted with: rw,noatime,allocsize=128m,inode64,swalloc.
>> The partition is 2TB in size and 40% full to simulate production.
>>
>> Here's a test program that appends 512KB like MySQL does (write and
>> then fsync). To exacerbate the issue, it loops a bunch of times:
>> https://gist.github.com/keyurdg/961c19175b81c73fdaa3
>>
>> When run, this creates ~9500 extents most of length 1024.
>
> 1024 of what? Most likely it is 1024 basic blocks, which is 512KB,
> the size of your writes.

Yeah, 1024 basic blocks of 512 bytes each.

>
> Could you post the output of the xfs_bmap commands you are using to
> get this information?

I'm getting the extent information via xfs_bmap -v <file name>. Here's
a sample: https://gist.github.com/keyurdg/291b2a429f03c9a649ad

>
>> cat'ing the
>> file to /dev/null after dropping the caches reads at an average of 75
>> MBps, way less than the hardware is capable of.
>
> What you are doing is "open-seekend-write-fsync-close".  You haven't
> told the filesystem you are doing append writes (O_APPEND, or the
> append inode flag) so it can't optimise for them.

I tried this; adding O_APPEND the the open() in the pathological
pwrite.c makes no difference to the extent allocation and hence the
read performance.

>
> You are also cleaning the file before closing it, so you are
> defeating the current heuristics that XFS uses to determine whether
> to remove speculative preallocation on close() - if the inode is
> dirty at close(), then it won't be removed. Hence speculative
> preallocation does nothing for your IO pattern (i.e. the allocsize
> mount option is completely useless). Remove the fsync and you'll
> see your fragmentation problem go away completely.

I agree, but the MySQL data files (*.ibd) on our production cluster
are appended to in bursts and they have thousands of tiny (512KB)
extents. Getting rid of fsync is not possible given the use case.

Arguably, MySQL does not close the files, but it writes out
infrequently enough that I couldn't make a good and small test case
for it. But the output of xfs_bmap is exactly the same as that of
pwrite.c

>
>> When I add a posix_fallocate before calling pwrite() as shown here
>> https://gist.github.com/keyurdg/eb504864d27ebfe7b40a the file
>> fragments an order of magnitude less (~30 extents), and cat'ing to
>> /dev/null proceeds at ~1GBps.
>
> That should make no difference on XFS as you are only preallocating
> the 512KB region beyond EOF that you are about to write into and
> hence both delayed allocation and preallocation have the same
> allocation target (the current EOF block). Hence in both cases the
> allocation patterns should be identical if the freespace extent they
> are being allocated out of are identical.
>
> Did you remove the previous test files and sync the filesystem
> between test runs so that the available freespace was identical for
> the different test runs? If you didn't then the filesystem allocated
> the files out of different free space extents and hence you'll get
> different allocation patterns...

I do clear everything and sync the FS before every run, and this is
reproducible across multiple machines in our cluster. I've re-run the
programs at least a 1000 times now, and every time get the same
results. For some reason even the tiny 512KB fallocate() seems to be
triggering some form of extent "merging" and placement.

I tried this on ext4 as well: with and without fallocate perform
exactly the same (~450 MBps), but XFS with fallocate is 2X faster (~1
GBps).

>
>> The same behavior is seen even when the allocsize option is removed
>> and the partition remounted.
>
> See above.
>
>> This is somewhat unexpected and I'm working on a patch to add
>> fallocate to MySQL, wanted to check in here if I'm missing anything
>> obvious here.
>
> fallocate() of 512KB sized regions will not prevent fragmentation
> into 512KB sized extents with the write pattern you are using.
>
> If you use the inode APPEND attribute for your log files, this lets
> the filesystem optimise it's block management for append IO. In the
> case of XFS, it then will not remove preallocation beyond EOF when
> the fd is closed because the next write will be at EOF where the
> speculative preallocation already exists. Then allocsize=128M will
> actually work for your log files....
>
> Alternatively, set an extent size hint on the log files to define
> the minimum sized allocation (e.g. 32MB) and this will limit
> fragmentation without you having to modify the MySQL code at all...
>

I tried enabling extsize to 32MB, but it seems to make no difference.

[kgovande@host]# xfs_io -c "extsize" /var/lib/mysql/xfs/plain_pwrite.werr
[33554432] /var/lib/mysql/xfs/plain_pwrite.werr
[kgovande@host]# xfs_bmap -v /var/lib/mysql/xfs/*.werr  | wc -l
20001
[kgovande@host]# sync; echo 3 > /proc/sys/vm/drop_caches; pv
plain_pwrite.werr > /dev/null
9.77GB 0:02:41 [61.7MB/s] [========================================>] 100%

# With fallocate
[kgovande@host]# xfs_bmap -v /var/lib/mysql/xfs/*.werr  | wc -l
[kgovande@host]# sync; echo 3 > /proc/sys/vm/drop_caches; pv
falloc_pwrite.werr > /dev/null
9.77GB 0:00:09 [1.03GB/s] [========================================>] 100%

> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: XFS fragmentation on file append
  2014-04-08  3:42   ` Keyur Govande
@ 2014-04-08  5:31       ` Dave Chinner
  0 siblings, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2014-04-08  5:31 UTC (permalink / raw)
  To: Keyur Govande; +Cc: linux-fsdevel, xfs

On Mon, Apr 07, 2014 at 11:42:02PM -0400, Keyur Govande wrote:
> On Mon, Apr 7, 2014 at 9:50 PM, Dave Chinner <david@fromorbit.com> wrote:
> > [cc the XFS mailing list <xfs@oss.sgi.com>]
> >
> > On Mon, Apr 07, 2014 at 06:53:46PM -0400, Keyur Govande wrote:
> >> Hello,
> >>
> >> I'm currently investigating a MySQL performance degradation on XFS due
> >> to file fragmentation.
> >>
> >> The box has a 16 drive RAID 10 array with a 1GB battery backed cache
> >> running on a 12 core box.
> >>
> >> xfs_info shows:
> >> meta-data=/dev/sda4    isize=256    agcount=24, agsize=24024992 blks
> >>                =                 sectsz=512   attr=2, projid32bit=0
> >> data         =                 bsize=4096   blocks=576599552, imaxpct=5
> >>                =                 sunit=16     swidth=512 blks
> >> naming   = version 2     bsize=4096   ascii-ci=0
> >> log         = internal       bsize=4096   blocks=281552, version=2
> >>              =                   sectsz=512   sunit=16 blks, lazy-count=1
> >> realtime = none            extsz=4096   blocks=0, rtextents=0
> >>
> >> The kernel version is: 3.14.0-1.el6.elrepo.x86_64 and the XFS
> >> partition is mounted with: rw,noatime,allocsize=128m,inode64,swalloc.
> >> The partition is 2TB in size and 40% full to simulate production.
> >>
> >> Here's a test program that appends 512KB like MySQL does (write and
> >> then fsync). To exacerbate the issue, it loops a bunch of times:
> >> https://gist.github.com/keyurdg/961c19175b81c73fdaa3
> >>
> >> When run, this creates ~9500 extents most of length 1024.
> >
> > 1024 of what? Most likely it is 1024 basic blocks, which is 512KB,
> > the size of your writes.
> 
> Yeah, 1024 basic blocks of 512 bytes each.
> 
> >
> > Could you post the output of the xfs_bmap commands you are using to
> > get this information?
> 
> I'm getting the extent information via xfs_bmap -v <file name>. Here's
> a sample: https://gist.github.com/keyurdg/291b2a429f03c9a649ad

Yup, looks like fragmented free space so it's only finding islands
of 512kb of freespace near to the inode to allocate out of.

Can you post the output of /proc/mounts so I can check what the
allocator behaviour is being used?

> >> cat'ing the
> >> file to /dev/null after dropping the caches reads at an average of 75
> >> MBps, way less than the hardware is capable of.
> >
> > What you are doing is "open-seekend-write-fsync-close".  You haven't
> > told the filesystem you are doing append writes (O_APPEND, or the
> > append inode flag) so it can't optimise for them.
> 
> I tried this; adding O_APPEND the the open() in the pathological
> pwrite.c makes no difference to the extent allocation and hence the
> read performance.

Yeah, I had a look at what XFS does and in the close path it doesn't
know that the FD was O_APPEND because that state is available to the
->release path.

> > You are also cleaning the file before closing it, so you are
> > defeating the current heuristics that XFS uses to determine whether
> > to remove speculative preallocation on close() - if the inode is
> > dirty at close(), then it won't be removed. Hence speculative
> > preallocation does nothing for your IO pattern (i.e. the allocsize
> > mount option is completely useless). Remove the fsync and you'll
> > see your fragmentation problem go away completely.
> 
> I agree, but the MySQL data files (*.ibd) on our production cluster
> are appended to in bursts and they have thousands of tiny (512KB)
> extents. Getting rid of fsync is not possible given the use case.

Sure - just demonstrating that it's the fsync that is causing the
problems. i.e. it's application driven behaviour that the filesystem
can't easily detect and optimise...

> Arguably, MySQL does not close the files, but it writes out
> infrequently enough that I couldn't make a good and small test case
> for it. But the output of xfs_bmap is exactly the same as that of
> pwrite.c

Once you've fragmented free space, the only way to defrag it is to
remove whatever is using the space between the small freespace
extents. Usually the condition occurs when you intermix long lived
files with short lived files - removing the short lived files
results in fragmented free space that cannot be made contiguous
until both the short lived and long lived data has been removed.

If you want an idea of whether you've fragmented free space, use
the xfs_db freespace command. To see what each ag looks like
(change it to iterate all the ags in your fs):

$ for i in 0 1 2 3; do echo "*** AG $i:" ; sudo xfs_db -c "freesp -a $i -s" /dev/vda; done
*** AG 0:
   from      to extents  blocks    pct
      1       1     129     129   0.02
      2       3     119     283   0.05
      4       7     125     641   0.11
      8      15      93     944   0.16
     16      31      64    1368   0.23
     32      63      53    2300   0.39
     64     127      21    1942   0.33
    128     255      16    3145   0.53
    256     511       6    1678   0.28
    512    1023       1     680   0.11
  16384   32767       1   23032   3.87
 524288 1048576       1  558825  93.93
total free extents 629
total free blocks 594967
average free extent size 945.893
*** AG 1:
   from      to extents  blocks    pct
      1       1     123     123   0.01
      2       3     125     305   0.04
      4       7      79     418   0.05
......

And that will tell us what state your filesystem is in w.r.t.
freespace fragmentation...

> >> When I add a posix_fallocate before calling pwrite() as shown here
> >> https://gist.github.com/keyurdg/eb504864d27ebfe7b40a the file
> >> fragments an order of magnitude less (~30 extents), and cat'ing to
> >> /dev/null proceeds at ~1GBps.
> >
> > That should make no difference on XFS as you are only preallocating
> > the 512KB region beyond EOF that you are about to write into and
> > hence both delayed allocation and preallocation have the same
> > allocation target (the current EOF block). Hence in both cases the
> > allocation patterns should be identical if the freespace extent they
> > are being allocated out of are identical.
> >
> > Did you remove the previous test files and sync the filesystem
> > between test runs so that the available freespace was identical for
> > the different test runs? If you didn't then the filesystem allocated
> > the files out of different free space extents and hence you'll get
> > different allocation patterns...
> 
> I do clear everything and sync the FS before every run, and this is
> reproducible across multiple machines in our cluster.

Which indicates that you've probably already completely fragmented
free space in the filesystems.

> I've re-run the
> programs at least a 1000 times now, and every time get the same
> results. For some reason even the tiny 512KB fallocate() seems to be
> triggering some form of extent "merging" and placement.

Both methods of allocation shoul dbe doing the same thing - they use
exactly the same algorithm to select the next extent to allocate.
Can you tell me the:

	a) inode number of each of the target files that show
	different output
	b) the xfs_bmap output of the different files.

> > Alternatively, set an extent size hint on the log files to define
> > the minimum sized allocation (e.g. 32MB) and this will limit
> > fragmentation without you having to modify the MySQL code at all...
> >
> 
> I tried enabling extsize to 32MB, but it seems to make no difference.
> [kgovande@host]# xfs_io -c "extsize" /var/lib/mysql/xfs/plain_pwrite.werr
> [33554432] /var/lib/mysql/xfs/plain_pwrite.werr
> [kgovande@host]# xfs_bmap -v /var/lib/mysql/xfs/*.werr  | wc -l
> 20001
> [kgovande@host]# sync; echo 3 > /proc/sys/vm/drop_caches; pv
> plain_pwrite.werr > /dev/null
> 9.77GB 0:02:41 [61.7MB/s] [========================================>] 100%

Ah, extent size hints are not being considered in
xfs_can_free_eofblocks(). I suspect they should be, and that would
fix the problem.

Can you add this to xfs_can_free_eofblocks() in your kernel and see
what happens?


 	/* prealloc/delalloc exists only on regular files */
 	if (!S_ISREG(ip->i_d.di_mode))
 		return false;
 
+	if (xfs_get_extsz_hint(ip))
+		return false;
+
 	/*
 	 * Zero sized files with no cached pages and delalloc blocks will not
 	 * have speculative prealloc/delalloc blocks to remove.
 	 */

If that solves the problem, then I suspect that we might need to
modify this code to take into account the allocsize mount option as
well...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: XFS fragmentation on file append
@ 2014-04-08  5:31       ` Dave Chinner
  0 siblings, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2014-04-08  5:31 UTC (permalink / raw)
  To: Keyur Govande; +Cc: linux-fsdevel, xfs

On Mon, Apr 07, 2014 at 11:42:02PM -0400, Keyur Govande wrote:
> On Mon, Apr 7, 2014 at 9:50 PM, Dave Chinner <david@fromorbit.com> wrote:
> > [cc the XFS mailing list <xfs@oss.sgi.com>]
> >
> > On Mon, Apr 07, 2014 at 06:53:46PM -0400, Keyur Govande wrote:
> >> Hello,
> >>
> >> I'm currently investigating a MySQL performance degradation on XFS due
> >> to file fragmentation.
> >>
> >> The box has a 16 drive RAID 10 array with a 1GB battery backed cache
> >> running on a 12 core box.
> >>
> >> xfs_info shows:
> >> meta-data=/dev/sda4    isize=256    agcount=24, agsize=24024992 blks
> >>                =                 sectsz=512   attr=2, projid32bit=0
> >> data         =                 bsize=4096   blocks=576599552, imaxpct=5
> >>                =                 sunit=16     swidth=512 blks
> >> naming   = version 2     bsize=4096   ascii-ci=0
> >> log         = internal       bsize=4096   blocks=281552, version=2
> >>              =                   sectsz=512   sunit=16 blks, lazy-count=1
> >> realtime = none            extsz=4096   blocks=0, rtextents=0
> >>
> >> The kernel version is: 3.14.0-1.el6.elrepo.x86_64 and the XFS
> >> partition is mounted with: rw,noatime,allocsize=128m,inode64,swalloc.
> >> The partition is 2TB in size and 40% full to simulate production.
> >>
> >> Here's a test program that appends 512KB like MySQL does (write and
> >> then fsync). To exacerbate the issue, it loops a bunch of times:
> >> https://gist.github.com/keyurdg/961c19175b81c73fdaa3
> >>
> >> When run, this creates ~9500 extents most of length 1024.
> >
> > 1024 of what? Most likely it is 1024 basic blocks, which is 512KB,
> > the size of your writes.
> 
> Yeah, 1024 basic blocks of 512 bytes each.
> 
> >
> > Could you post the output of the xfs_bmap commands you are using to
> > get this information?
> 
> I'm getting the extent information via xfs_bmap -v <file name>. Here's
> a sample: https://gist.github.com/keyurdg/291b2a429f03c9a649ad

Yup, looks like fragmented free space so it's only finding islands
of 512kb of freespace near to the inode to allocate out of.

Can you post the output of /proc/mounts so I can check what the
allocator behaviour is being used?

> >> cat'ing the
> >> file to /dev/null after dropping the caches reads at an average of 75
> >> MBps, way less than the hardware is capable of.
> >
> > What you are doing is "open-seekend-write-fsync-close".  You haven't
> > told the filesystem you are doing append writes (O_APPEND, or the
> > append inode flag) so it can't optimise for them.
> 
> I tried this; adding O_APPEND the the open() in the pathological
> pwrite.c makes no difference to the extent allocation and hence the
> read performance.

Yeah, I had a look at what XFS does and in the close path it doesn't
know that the FD was O_APPEND because that state is available to the
->release path.

> > You are also cleaning the file before closing it, so you are
> > defeating the current heuristics that XFS uses to determine whether
> > to remove speculative preallocation on close() - if the inode is
> > dirty at close(), then it won't be removed. Hence speculative
> > preallocation does nothing for your IO pattern (i.e. the allocsize
> > mount option is completely useless). Remove the fsync and you'll
> > see your fragmentation problem go away completely.
> 
> I agree, but the MySQL data files (*.ibd) on our production cluster
> are appended to in bursts and they have thousands of tiny (512KB)
> extents. Getting rid of fsync is not possible given the use case.

Sure - just demonstrating that it's the fsync that is causing the
problems. i.e. it's application driven behaviour that the filesystem
can't easily detect and optimise...

> Arguably, MySQL does not close the files, but it writes out
> infrequently enough that I couldn't make a good and small test case
> for it. But the output of xfs_bmap is exactly the same as that of
> pwrite.c

Once you've fragmented free space, the only way to defrag it is to
remove whatever is using the space between the small freespace
extents. Usually the condition occurs when you intermix long lived
files with short lived files - removing the short lived files
results in fragmented free space that cannot be made contiguous
until both the short lived and long lived data has been removed.

If you want an idea of whether you've fragmented free space, use
the xfs_db freespace command. To see what each ag looks like
(change it to iterate all the ags in your fs):

$ for i in 0 1 2 3; do echo "*** AG $i:" ; sudo xfs_db -c "freesp -a $i -s" /dev/vda; done
*** AG 0:
   from      to extents  blocks    pct
      1       1     129     129   0.02
      2       3     119     283   0.05
      4       7     125     641   0.11
      8      15      93     944   0.16
     16      31      64    1368   0.23
     32      63      53    2300   0.39
     64     127      21    1942   0.33
    128     255      16    3145   0.53
    256     511       6    1678   0.28
    512    1023       1     680   0.11
  16384   32767       1   23032   3.87
 524288 1048576       1  558825  93.93
total free extents 629
total free blocks 594967
average free extent size 945.893
*** AG 1:
   from      to extents  blocks    pct
      1       1     123     123   0.01
      2       3     125     305   0.04
      4       7      79     418   0.05
......

And that will tell us what state your filesystem is in w.r.t.
freespace fragmentation...

> >> When I add a posix_fallocate before calling pwrite() as shown here
> >> https://gist.github.com/keyurdg/eb504864d27ebfe7b40a the file
> >> fragments an order of magnitude less (~30 extents), and cat'ing to
> >> /dev/null proceeds at ~1GBps.
> >
> > That should make no difference on XFS as you are only preallocating
> > the 512KB region beyond EOF that you are about to write into and
> > hence both delayed allocation and preallocation have the same
> > allocation target (the current EOF block). Hence in both cases the
> > allocation patterns should be identical if the freespace extent they
> > are being allocated out of are identical.
> >
> > Did you remove the previous test files and sync the filesystem
> > between test runs so that the available freespace was identical for
> > the different test runs? If you didn't then the filesystem allocated
> > the files out of different free space extents and hence you'll get
> > different allocation patterns...
> 
> I do clear everything and sync the FS before every run, and this is
> reproducible across multiple machines in our cluster.

Which indicates that you've probably already completely fragmented
free space in the filesystems.

> I've re-run the
> programs at least a 1000 times now, and every time get the same
> results. For some reason even the tiny 512KB fallocate() seems to be
> triggering some form of extent "merging" and placement.

Both methods of allocation shoul dbe doing the same thing - they use
exactly the same algorithm to select the next extent to allocate.
Can you tell me the:

	a) inode number of each of the target files that show
	different output
	b) the xfs_bmap output of the different files.

> > Alternatively, set an extent size hint on the log files to define
> > the minimum sized allocation (e.g. 32MB) and this will limit
> > fragmentation without you having to modify the MySQL code at all...
> >
> 
> I tried enabling extsize to 32MB, but it seems to make no difference.
> [kgovande@host]# xfs_io -c "extsize" /var/lib/mysql/xfs/plain_pwrite.werr
> [33554432] /var/lib/mysql/xfs/plain_pwrite.werr
> [kgovande@host]# xfs_bmap -v /var/lib/mysql/xfs/*.werr  | wc -l
> 20001
> [kgovande@host]# sync; echo 3 > /proc/sys/vm/drop_caches; pv
> plain_pwrite.werr > /dev/null
> 9.77GB 0:02:41 [61.7MB/s] [========================================>] 100%

Ah, extent size hints are not being considered in
xfs_can_free_eofblocks(). I suspect they should be, and that would
fix the problem.

Can you add this to xfs_can_free_eofblocks() in your kernel and see
what happens?


 	/* prealloc/delalloc exists only on regular files */
 	if (!S_ISREG(ip->i_d.di_mode))
 		return false;
 
+	if (xfs_get_extsz_hint(ip))
+		return false;
+
 	/*
 	 * Zero sized files with no cached pages and delalloc blocks will not
 	 * have speculative prealloc/delalloc blocks to remove.
 	 */

If that solves the problem, then I suspect that we might need to
modify this code to take into account the allocsize mount option as
well...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: XFS fragmentation on file append
  2014-04-08  5:31       ` Dave Chinner
@ 2014-04-22 23:35         ` Keyur Govande
  -1 siblings, 0 replies; 20+ messages in thread
From: Keyur Govande @ 2014-04-22 23:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, xfs

On Tue, Apr 8, 2014 at 1:31 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Apr 07, 2014 at 11:42:02PM -0400, Keyur Govande wrote:
>> On Mon, Apr 7, 2014 at 9:50 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > [cc the XFS mailing list <xfs@oss.sgi.com>]
>> >
>> > On Mon, Apr 07, 2014 at 06:53:46PM -0400, Keyur Govande wrote:
>> >> Hello,
>> >>
>> >> I'm currently investigating a MySQL performance degradation on XFS due
>> >> to file fragmentation.
>> >>
>> >> The box has a 16 drive RAID 10 array with a 1GB battery backed cache
>> >> running on a 12 core box.
>> >>
>> >> xfs_info shows:
>> >> meta-data=/dev/sda4    isize=256    agcount=24, agsize=24024992 blks
>> >>                =                 sectsz=512   attr=2, projid32bit=0
>> >> data         =                 bsize=4096   blocks=576599552, imaxpct=5
>> >>                =                 sunit=16     swidth=512 blks
>> >> naming   = version 2     bsize=4096   ascii-ci=0
>> >> log         = internal       bsize=4096   blocks=281552, version=2
>> >>              =                   sectsz=512   sunit=16 blks, lazy-count=1
>> >> realtime = none            extsz=4096   blocks=0, rtextents=0
>> >>
>> >> The kernel version is: 3.14.0-1.el6.elrepo.x86_64 and the XFS
>> >> partition is mounted with: rw,noatime,allocsize=128m,inode64,swalloc.
>> >> The partition is 2TB in size and 40% full to simulate production.
>> >>
>> >> Here's a test program that appends 512KB like MySQL does (write and
>> >> then fsync). To exacerbate the issue, it loops a bunch of times:
>> >> https://gist.github.com/keyurdg/961c19175b81c73fdaa3
>> >>
>> >> When run, this creates ~9500 extents most of length 1024.
>> >
>> > 1024 of what? Most likely it is 1024 basic blocks, which is 512KB,
>> > the size of your writes.
>>
>> Yeah, 1024 basic blocks of 512 bytes each.
>>
>> >
>> > Could you post the output of the xfs_bmap commands you are using to
>> > get this information?
>>
>> I'm getting the extent information via xfs_bmap -v <file name>. Here's
>> a sample: https://gist.github.com/keyurdg/291b2a429f03c9a649ad
>
> Yup, looks like fragmented free space so it's only finding islands
> of 512kb of freespace near to the inode to allocate out of.
>
> Can you post the output of /proc/mounts so I can check what the
> allocator behaviour is being used?
>
>> >> cat'ing the
>> >> file to /dev/null after dropping the caches reads at an average of 75
>> >> MBps, way less than the hardware is capable of.
>> >
>> > What you are doing is "open-seekend-write-fsync-close".  You haven't
>> > told the filesystem you are doing append writes (O_APPEND, or the
>> > append inode flag) so it can't optimise for them.
>>
>> I tried this; adding O_APPEND the the open() in the pathological
>> pwrite.c makes no difference to the extent allocation and hence the
>> read performance.
>
> Yeah, I had a look at what XFS does and in the close path it doesn't
> know that the FD was O_APPEND because that state is available to the
> ->release path.
>
>> > You are also cleaning the file before closing it, so you are
>> > defeating the current heuristics that XFS uses to determine whether
>> > to remove speculative preallocation on close() - if the inode is
>> > dirty at close(), then it won't be removed. Hence speculative
>> > preallocation does nothing for your IO pattern (i.e. the allocsize
>> > mount option is completely useless). Remove the fsync and you'll
>> > see your fragmentation problem go away completely.
>>
>> I agree, but the MySQL data files (*.ibd) on our production cluster
>> are appended to in bursts and they have thousands of tiny (512KB)
>> extents. Getting rid of fsync is not possible given the use case.
>
> Sure - just demonstrating that it's the fsync that is causing the
> problems. i.e. it's application driven behaviour that the filesystem
> can't easily detect and optimise...
>
>> Arguably, MySQL does not close the files, but it writes out
>> infrequently enough that I couldn't make a good and small test case
>> for it. But the output of xfs_bmap is exactly the same as that of
>> pwrite.c
>
> Once you've fragmented free space, the only way to defrag it is to
> remove whatever is using the space between the small freespace
> extents. Usually the condition occurs when you intermix long lived
> files with short lived files - removing the short lived files
> results in fragmented free space that cannot be made contiguous
> until both the short lived and long lived data has been removed.
>
> If you want an idea of whether you've fragmented free space, use
> the xfs_db freespace command. To see what each ag looks like
> (change it to iterate all the ags in your fs):
>
> $ for i in 0 1 2 3; do echo "*** AG $i:" ; sudo xfs_db -c "freesp -a $i -s" /dev/vda; done
> *** AG 0:
>    from      to extents  blocks    pct
>       1       1     129     129   0.02
>       2       3     119     283   0.05
>       4       7     125     641   0.11
>       8      15      93     944   0.16
>      16      31      64    1368   0.23
>      32      63      53    2300   0.39
>      64     127      21    1942   0.33
>     128     255      16    3145   0.53
>     256     511       6    1678   0.28
>     512    1023       1     680   0.11
>   16384   32767       1   23032   3.87
>  524288 1048576       1  558825  93.93
> total free extents 629
> total free blocks 594967
> average free extent size 945.893
> *** AG 1:
>    from      to extents  blocks    pct
>       1       1     123     123   0.01
>       2       3     125     305   0.04
>       4       7      79     418   0.05
> ......
>
> And that will tell us what state your filesystem is in w.r.t.
> freespace fragmentation...
>
>> >> When I add a posix_fallocate before calling pwrite() as shown here
>> >> https://gist.github.com/keyurdg/eb504864d27ebfe7b40a the file
>> >> fragments an order of magnitude less (~30 extents), and cat'ing to
>> >> /dev/null proceeds at ~1GBps.
>> >
>> > That should make no difference on XFS as you are only preallocating
>> > the 512KB region beyond EOF that you are about to write into and
>> > hence both delayed allocation and preallocation have the same
>> > allocation target (the current EOF block). Hence in both cases the
>> > allocation patterns should be identical if the freespace extent they
>> > are being allocated out of are identical.
>> >
>> > Did you remove the previous test files and sync the filesystem
>> > between test runs so that the available freespace was identical for
>> > the different test runs? If you didn't then the filesystem allocated
>> > the files out of different free space extents and hence you'll get
>> > different allocation patterns...
>>
>> I do clear everything and sync the FS before every run, and this is
>> reproducible across multiple machines in our cluster.
>
> Which indicates that you've probably already completely fragmented
> free space in the filesystems.
>
>> I've re-run the
>> programs at least a 1000 times now, and every time get the same
>> results. For some reason even the tiny 512KB fallocate() seems to be
>> triggering some form of extent "merging" and placement.
>
> Both methods of allocation shoul dbe doing the same thing - they use
> exactly the same algorithm to select the next extent to allocate.
> Can you tell me the:
>
>         a) inode number of each of the target files that show
>         different output
>         b) the xfs_bmap output of the different files.
>
>> > Alternatively, set an extent size hint on the log files to define
>> > the minimum sized allocation (e.g. 32MB) and this will limit
>> > fragmentation without you having to modify the MySQL code at all...
>> >
>>
>> I tried enabling extsize to 32MB, but it seems to make no difference.
>> [kgovande@host]# xfs_io -c "extsize" /var/lib/mysql/xfs/plain_pwrite.werr
>> [33554432] /var/lib/mysql/xfs/plain_pwrite.werr
>> [kgovande@host]# xfs_bmap -v /var/lib/mysql/xfs/*.werr  | wc -l
>> 20001
>> [kgovande@host]# sync; echo 3 > /proc/sys/vm/drop_caches; pv
>> plain_pwrite.werr > /dev/null
>> 9.77GB 0:02:41 [61.7MB/s] [========================================>] 100%
>
> Ah, extent size hints are not being considered in
> xfs_can_free_eofblocks(). I suspect they should be, and that would
> fix the problem.
>
> Can you add this to xfs_can_free_eofblocks() in your kernel and see
> what happens?
>
>
>         /* prealloc/delalloc exists only on regular files */
>         if (!S_ISREG(ip->i_d.di_mode))
>                 return false;
>
> +       if (xfs_get_extsz_hint(ip))
> +               return false;
> +
>         /*
>          * Zero sized files with no cached pages and delalloc blocks will not
>          * have speculative prealloc/delalloc blocks to remove.
>          */
>
> If that solves the problem, then I suspect that we might need to
> modify this code to take into account the allocsize mount option as
> well...
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

Hey Dave,

I spent some more time figuring out the MySQL write semantics and it
doesn't open/close files often and initial test script was incorrect.

It uses O_DIRECT and appends to the file; I modified my test binary to
take this into account here:
https://gist.github.com/keyurdg/54e0613e27dbe7946035

I've been testing on the 3.10 kernel. The set up is a empty 2 TB XFS partition.
[root@dbtest09 linux-3.10.37]# xfs_info /dev/sda4
meta-data=/dev/sda4              isize=256    agcount=24, agsize=24024992 blks
         =                       sectsz=512   attr=2, projid32bit=0
data     =                       bsize=4096   blocks=576599552, imaxpct=5
         =                       sunit=16     swidth=512 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=281552, version=2
         =                       sectsz=512   sunit=16 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

[root@dbtest09 linux-3.10.37]# cat /proc/mounts
rootfs / rootfs rw 0 0
proc /proc proc rw,relatime 0 0
sysfs /sys sysfs rw,relatime 0 0
devtmpfs /dev devtmpfs
rw,relatime,size=49573060k,nr_inodes=12393265,mode=755 0 0
devpts /dev/pts devpts rw,relatime,gid=5,mode=620,ptmxmode=000 0 0
tmpfs /dev/shm tmpfs rw,relatime 0 0
/dev/sda2 / ext3
rw,noatime,errors=continue,user_xattr,acl,barrier=1,data=ordered 0 0
/dev/sda1 /boot ext3
rw,relatime,errors=continue,user_xattr,acl,barrier=1,data=ordered 0 0
none /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0
/dev/sda4 /var/lib/mysql xfs
rw,noatime,swalloc,attr2,inode64,logbsize=64k,sunit=128,swidth=4096,noquota
0 0

[root@dbtest09 linux-3.10.37]# xfs_io -c "extsize " /var/lib/mysql/xfs/
[0] /var/lib/mysql/xfs/

Here's how the first 3 AG's look like:
https://gist.github.com/keyurdg/82b955fb96b003930e4f

After a run of the dpwrite program, here's how the bmap looks like:
https://gist.github.com/keyurdg/11196897

The files have nicely interleaved with each other, mostly
XFS_IEXT_BUFSZ size extents. The average read speed is 724 MBps. After
defragmenting the file to 1 extent, the speed improves 30% to 1.09
GBps.

I noticed that XFS chooses the AG based on the parent directory's AG
and only the next sequential one if there's no space available. A
small patch that chooses the AG randomly fixes the fragmentation issue
very nicely. All of the MySQL data files are in a single directory and
we see this in Production where a parent inode AG is filled, then the
sequential next, and so on.

diff --git a/fs/xfs/xfs_ialloc.c b/fs/xfs/xfs_ialloc.c
index c8f5ae1..7841509 100644
--- a/fs/xfs/xfs_ialloc.c
+++ b/fs/xfs/xfs_ialloc.c
@@ -517,7 +517,7 @@ xfs_ialloc_ag_select(
         * to mean that blocks must be allocated for them,
         * if none are currently free.
         */
-       agno = pagno;
+       agno = ((xfs_agnumber_t) prandom_u32()) % agcount;
        flags = XFS_ALLOC_FLAG_TRYLOCK;
        for (;;) {
                pag = xfs_perag_get(mp, agno);

I couldn't find guidance on the internet on how many allocation groups
to use for a 2 TB partition, but this random selection won't scale for
many hundreds of concurrently written files, but for a few heavily
writtent-to files it works nicely.

I noticed that for non-DIRECT_IO + every write fsync'd, XFS would
cleverly keep doubling the allocation block size as the file kept
growing.

The "extsize" option seems to me a bit too static because the size of
tables we use varies widely and large new tables come and go.

Could the same doubling logic be applied for DIRECT_IO writes as well?
I tried out this extremely rough patch based on the delayed write
code; if you think this is reasonable I can try to make it more
acceptable. It provides very nice performance indeed, for a 2GB file,
here's how the bmap looks like:
https://gist.github.com/keyurdg/ac6ed8536f864c8fffc8

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 8f8aaee..2682f53 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -118,6 +118,16 @@ xfs_alert_fsblock_zero(
        return EFSCORRUPTED;
 }

+STATIC int
+xfs_iomap_eof_want_preallocate(
+       xfs_mount_t     *mp,
+       xfs_inode_t     *ip,
+       xfs_off_t       offset,
+       size_t          count,
+       xfs_bmbt_irec_t *imap,
+       int             nimaps,
+       int             *prealloc);
+
 int
 xfs_iomap_write_direct(
        xfs_inode_t     *ip,
@@ -152,7 +162,32 @@ xfs_iomap_write_direct(
        offset_fsb = XFS_B_TO_FSBT(mp, offset);
        last_fsb = XFS_B_TO_FSB(mp, ((xfs_ufsize_t)(offset + count)));
        if ((offset + count) > XFS_ISIZE(ip)) {
-               error = xfs_iomap_eof_align_last_fsb(mp, ip, extsz, &last_fsb);
+               xfs_extlen_t    new_extsz = extsz;
+
+               if (!extsz) {
+                       int prealloc;
+                       xfs_bmbt_irec_t prealloc_imap[XFS_WRITE_IMAPS];
+
+                       error = xfs_iomap_eof_want_preallocate(mp, ip,
offset, count,
+                                               prealloc_imap,
XFS_WRITE_IMAPS, &prealloc);
+
+                       if (prealloc) {
+                               xfs_fileoff_t   temp_start_fsb;
+                               int             temp_imaps = 1;
+
+                               temp_start_fsb = XFS_B_TO_FSB(mp, offset);
+                               if (temp_start_fsb)
+                                       temp_start_fsb--;
+
+                               error = xfs_bmapi_read(ip,
temp_start_fsb, 1, prealloc_imap, &temp_imaps, XFS_BMAPI_ENTIRE);
+                               if (error)
+                                       return XFS_ERROR(error);
+
+                               new_extsz = prealloc_imap[0].br_blockcount << 1;
+                       }
+               }
+
+               error = xfs_iomap_eof_align_last_fsb(mp, ip,
new_extsz, &last_fsb);
                if (error)
                        return XFS_ERROR(error);
        } else {

Cheers,
Keyur.

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: XFS fragmentation on file append
@ 2014-04-22 23:35         ` Keyur Govande
  0 siblings, 0 replies; 20+ messages in thread
From: Keyur Govande @ 2014-04-22 23:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, xfs

On Tue, Apr 8, 2014 at 1:31 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Apr 07, 2014 at 11:42:02PM -0400, Keyur Govande wrote:
>> On Mon, Apr 7, 2014 at 9:50 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > [cc the XFS mailing list <xfs@oss.sgi.com>]
>> >
>> > On Mon, Apr 07, 2014 at 06:53:46PM -0400, Keyur Govande wrote:
>> >> Hello,
>> >>
>> >> I'm currently investigating a MySQL performance degradation on XFS due
>> >> to file fragmentation.
>> >>
>> >> The box has a 16 drive RAID 10 array with a 1GB battery backed cache
>> >> running on a 12 core box.
>> >>
>> >> xfs_info shows:
>> >> meta-data=/dev/sda4    isize=256    agcount=24, agsize=24024992 blks
>> >>                =                 sectsz=512   attr=2, projid32bit=0
>> >> data         =                 bsize=4096   blocks=576599552, imaxpct=5
>> >>                =                 sunit=16     swidth=512 blks
>> >> naming   = version 2     bsize=4096   ascii-ci=0
>> >> log         = internal       bsize=4096   blocks=281552, version=2
>> >>              =                   sectsz=512   sunit=16 blks, lazy-count=1
>> >> realtime = none            extsz=4096   blocks=0, rtextents=0
>> >>
>> >> The kernel version is: 3.14.0-1.el6.elrepo.x86_64 and the XFS
>> >> partition is mounted with: rw,noatime,allocsize=128m,inode64,swalloc.
>> >> The partition is 2TB in size and 40% full to simulate production.
>> >>
>> >> Here's a test program that appends 512KB like MySQL does (write and
>> >> then fsync). To exacerbate the issue, it loops a bunch of times:
>> >> https://gist.github.com/keyurdg/961c19175b81c73fdaa3
>> >>
>> >> When run, this creates ~9500 extents most of length 1024.
>> >
>> > 1024 of what? Most likely it is 1024 basic blocks, which is 512KB,
>> > the size of your writes.
>>
>> Yeah, 1024 basic blocks of 512 bytes each.
>>
>> >
>> > Could you post the output of the xfs_bmap commands you are using to
>> > get this information?
>>
>> I'm getting the extent information via xfs_bmap -v <file name>. Here's
>> a sample: https://gist.github.com/keyurdg/291b2a429f03c9a649ad
>
> Yup, looks like fragmented free space so it's only finding islands
> of 512kb of freespace near to the inode to allocate out of.
>
> Can you post the output of /proc/mounts so I can check what the
> allocator behaviour is being used?
>
>> >> cat'ing the
>> >> file to /dev/null after dropping the caches reads at an average of 75
>> >> MBps, way less than the hardware is capable of.
>> >
>> > What you are doing is "open-seekend-write-fsync-close".  You haven't
>> > told the filesystem you are doing append writes (O_APPEND, or the
>> > append inode flag) so it can't optimise for them.
>>
>> I tried this; adding O_APPEND the the open() in the pathological
>> pwrite.c makes no difference to the extent allocation and hence the
>> read performance.
>
> Yeah, I had a look at what XFS does and in the close path it doesn't
> know that the FD was O_APPEND because that state is available to the
> ->release path.
>
>> > You are also cleaning the file before closing it, so you are
>> > defeating the current heuristics that XFS uses to determine whether
>> > to remove speculative preallocation on close() - if the inode is
>> > dirty at close(), then it won't be removed. Hence speculative
>> > preallocation does nothing for your IO pattern (i.e. the allocsize
>> > mount option is completely useless). Remove the fsync and you'll
>> > see your fragmentation problem go away completely.
>>
>> I agree, but the MySQL data files (*.ibd) on our production cluster
>> are appended to in bursts and they have thousands of tiny (512KB)
>> extents. Getting rid of fsync is not possible given the use case.
>
> Sure - just demonstrating that it's the fsync that is causing the
> problems. i.e. it's application driven behaviour that the filesystem
> can't easily detect and optimise...
>
>> Arguably, MySQL does not close the files, but it writes out
>> infrequently enough that I couldn't make a good and small test case
>> for it. But the output of xfs_bmap is exactly the same as that of
>> pwrite.c
>
> Once you've fragmented free space, the only way to defrag it is to
> remove whatever is using the space between the small freespace
> extents. Usually the condition occurs when you intermix long lived
> files with short lived files - removing the short lived files
> results in fragmented free space that cannot be made contiguous
> until both the short lived and long lived data has been removed.
>
> If you want an idea of whether you've fragmented free space, use
> the xfs_db freespace command. To see what each ag looks like
> (change it to iterate all the ags in your fs):
>
> $ for i in 0 1 2 3; do echo "*** AG $i:" ; sudo xfs_db -c "freesp -a $i -s" /dev/vda; done
> *** AG 0:
>    from      to extents  blocks    pct
>       1       1     129     129   0.02
>       2       3     119     283   0.05
>       4       7     125     641   0.11
>       8      15      93     944   0.16
>      16      31      64    1368   0.23
>      32      63      53    2300   0.39
>      64     127      21    1942   0.33
>     128     255      16    3145   0.53
>     256     511       6    1678   0.28
>     512    1023       1     680   0.11
>   16384   32767       1   23032   3.87
>  524288 1048576       1  558825  93.93
> total free extents 629
> total free blocks 594967
> average free extent size 945.893
> *** AG 1:
>    from      to extents  blocks    pct
>       1       1     123     123   0.01
>       2       3     125     305   0.04
>       4       7      79     418   0.05
> ......
>
> And that will tell us what state your filesystem is in w.r.t.
> freespace fragmentation...
>
>> >> When I add a posix_fallocate before calling pwrite() as shown here
>> >> https://gist.github.com/keyurdg/eb504864d27ebfe7b40a the file
>> >> fragments an order of magnitude less (~30 extents), and cat'ing to
>> >> /dev/null proceeds at ~1GBps.
>> >
>> > That should make no difference on XFS as you are only preallocating
>> > the 512KB region beyond EOF that you are about to write into and
>> > hence both delayed allocation and preallocation have the same
>> > allocation target (the current EOF block). Hence in both cases the
>> > allocation patterns should be identical if the freespace extent they
>> > are being allocated out of are identical.
>> >
>> > Did you remove the previous test files and sync the filesystem
>> > between test runs so that the available freespace was identical for
>> > the different test runs? If you didn't then the filesystem allocated
>> > the files out of different free space extents and hence you'll get
>> > different allocation patterns...
>>
>> I do clear everything and sync the FS before every run, and this is
>> reproducible across multiple machines in our cluster.
>
> Which indicates that you've probably already completely fragmented
> free space in the filesystems.
>
>> I've re-run the
>> programs at least a 1000 times now, and every time get the same
>> results. For some reason even the tiny 512KB fallocate() seems to be
>> triggering some form of extent "merging" and placement.
>
> Both methods of allocation shoul dbe doing the same thing - they use
> exactly the same algorithm to select the next extent to allocate.
> Can you tell me the:
>
>         a) inode number of each of the target files that show
>         different output
>         b) the xfs_bmap output of the different files.
>
>> > Alternatively, set an extent size hint on the log files to define
>> > the minimum sized allocation (e.g. 32MB) and this will limit
>> > fragmentation without you having to modify the MySQL code at all...
>> >
>>
>> I tried enabling extsize to 32MB, but it seems to make no difference.
>> [kgovande@host]# xfs_io -c "extsize" /var/lib/mysql/xfs/plain_pwrite.werr
>> [33554432] /var/lib/mysql/xfs/plain_pwrite.werr
>> [kgovande@host]# xfs_bmap -v /var/lib/mysql/xfs/*.werr  | wc -l
>> 20001
>> [kgovande@host]# sync; echo 3 > /proc/sys/vm/drop_caches; pv
>> plain_pwrite.werr > /dev/null
>> 9.77GB 0:02:41 [61.7MB/s] [========================================>] 100%
>
> Ah, extent size hints are not being considered in
> xfs_can_free_eofblocks(). I suspect they should be, and that would
> fix the problem.
>
> Can you add this to xfs_can_free_eofblocks() in your kernel and see
> what happens?
>
>
>         /* prealloc/delalloc exists only on regular files */
>         if (!S_ISREG(ip->i_d.di_mode))
>                 return false;
>
> +       if (xfs_get_extsz_hint(ip))
> +               return false;
> +
>         /*
>          * Zero sized files with no cached pages and delalloc blocks will not
>          * have speculative prealloc/delalloc blocks to remove.
>          */
>
> If that solves the problem, then I suspect that we might need to
> modify this code to take into account the allocsize mount option as
> well...
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

Hey Dave,

I spent some more time figuring out the MySQL write semantics and it
doesn't open/close files often and initial test script was incorrect.

It uses O_DIRECT and appends to the file; I modified my test binary to
take this into account here:
https://gist.github.com/keyurdg/54e0613e27dbe7946035

I've been testing on the 3.10 kernel. The set up is a empty 2 TB XFS partition.
[root@dbtest09 linux-3.10.37]# xfs_info /dev/sda4
meta-data=/dev/sda4              isize=256    agcount=24, agsize=24024992 blks
         =                       sectsz=512   attr=2, projid32bit=0
data     =                       bsize=4096   blocks=576599552, imaxpct=5
         =                       sunit=16     swidth=512 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=281552, version=2
         =                       sectsz=512   sunit=16 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

[root@dbtest09 linux-3.10.37]# cat /proc/mounts
rootfs / rootfs rw 0 0
proc /proc proc rw,relatime 0 0
sysfs /sys sysfs rw,relatime 0 0
devtmpfs /dev devtmpfs
rw,relatime,size=49573060k,nr_inodes=12393265,mode=755 0 0
devpts /dev/pts devpts rw,relatime,gid=5,mode=620,ptmxmode=000 0 0
tmpfs /dev/shm tmpfs rw,relatime 0 0
/dev/sda2 / ext3
rw,noatime,errors=continue,user_xattr,acl,barrier=1,data=ordered 0 0
/dev/sda1 /boot ext3
rw,relatime,errors=continue,user_xattr,acl,barrier=1,data=ordered 0 0
none /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0
/dev/sda4 /var/lib/mysql xfs
rw,noatime,swalloc,attr2,inode64,logbsize=64k,sunit=128,swidth=4096,noquota
0 0

[root@dbtest09 linux-3.10.37]# xfs_io -c "extsize " /var/lib/mysql/xfs/
[0] /var/lib/mysql/xfs/

Here's how the first 3 AG's look like:
https://gist.github.com/keyurdg/82b955fb96b003930e4f

After a run of the dpwrite program, here's how the bmap looks like:
https://gist.github.com/keyurdg/11196897

The files have nicely interleaved with each other, mostly
XFS_IEXT_BUFSZ size extents. The average read speed is 724 MBps. After
defragmenting the file to 1 extent, the speed improves 30% to 1.09
GBps.

I noticed that XFS chooses the AG based on the parent directory's AG
and only the next sequential one if there's no space available. A
small patch that chooses the AG randomly fixes the fragmentation issue
very nicely. All of the MySQL data files are in a single directory and
we see this in Production where a parent inode AG is filled, then the
sequential next, and so on.

diff --git a/fs/xfs/xfs_ialloc.c b/fs/xfs/xfs_ialloc.c
index c8f5ae1..7841509 100644
--- a/fs/xfs/xfs_ialloc.c
+++ b/fs/xfs/xfs_ialloc.c
@@ -517,7 +517,7 @@ xfs_ialloc_ag_select(
         * to mean that blocks must be allocated for them,
         * if none are currently free.
         */
-       agno = pagno;
+       agno = ((xfs_agnumber_t) prandom_u32()) % agcount;
        flags = XFS_ALLOC_FLAG_TRYLOCK;
        for (;;) {
                pag = xfs_perag_get(mp, agno);

I couldn't find guidance on the internet on how many allocation groups
to use for a 2 TB partition, but this random selection won't scale for
many hundreds of concurrently written files, but for a few heavily
writtent-to files it works nicely.

I noticed that for non-DIRECT_IO + every write fsync'd, XFS would
cleverly keep doubling the allocation block size as the file kept
growing.

The "extsize" option seems to me a bit too static because the size of
tables we use varies widely and large new tables come and go.

Could the same doubling logic be applied for DIRECT_IO writes as well?
I tried out this extremely rough patch based on the delayed write
code; if you think this is reasonable I can try to make it more
acceptable. It provides very nice performance indeed, for a 2GB file,
here's how the bmap looks like:
https://gist.github.com/keyurdg/ac6ed8536f864c8fffc8

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 8f8aaee..2682f53 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -118,6 +118,16 @@ xfs_alert_fsblock_zero(
        return EFSCORRUPTED;
 }

+STATIC int
+xfs_iomap_eof_want_preallocate(
+       xfs_mount_t     *mp,
+       xfs_inode_t     *ip,
+       xfs_off_t       offset,
+       size_t          count,
+       xfs_bmbt_irec_t *imap,
+       int             nimaps,
+       int             *prealloc);
+
 int
 xfs_iomap_write_direct(
        xfs_inode_t     *ip,
@@ -152,7 +162,32 @@ xfs_iomap_write_direct(
        offset_fsb = XFS_B_TO_FSBT(mp, offset);
        last_fsb = XFS_B_TO_FSB(mp, ((xfs_ufsize_t)(offset + count)));
        if ((offset + count) > XFS_ISIZE(ip)) {
-               error = xfs_iomap_eof_align_last_fsb(mp, ip, extsz, &last_fsb);
+               xfs_extlen_t    new_extsz = extsz;
+
+               if (!extsz) {
+                       int prealloc;
+                       xfs_bmbt_irec_t prealloc_imap[XFS_WRITE_IMAPS];
+
+                       error = xfs_iomap_eof_want_preallocate(mp, ip,
offset, count,
+                                               prealloc_imap,
XFS_WRITE_IMAPS, &prealloc);
+
+                       if (prealloc) {
+                               xfs_fileoff_t   temp_start_fsb;
+                               int             temp_imaps = 1;
+
+                               temp_start_fsb = XFS_B_TO_FSB(mp, offset);
+                               if (temp_start_fsb)
+                                       temp_start_fsb--;
+
+                               error = xfs_bmapi_read(ip,
temp_start_fsb, 1, prealloc_imap, &temp_imaps, XFS_BMAPI_ENTIRE);
+                               if (error)
+                                       return XFS_ERROR(error);
+
+                               new_extsz = prealloc_imap[0].br_blockcount << 1;
+                       }
+               }
+
+               error = xfs_iomap_eof_align_last_fsb(mp, ip,
new_extsz, &last_fsb);
                if (error)
                        return XFS_ERROR(error);
        } else {

Cheers,
Keyur.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: XFS fragmentation on file append
  2014-04-22 23:35         ` Keyur Govande
@ 2014-04-23  5:47           ` Dave Chinner
  -1 siblings, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2014-04-23  5:47 UTC (permalink / raw)
  To: Keyur Govande; +Cc: linux-fsdevel, xfs

On Tue, Apr 22, 2014 at 07:35:34PM -0400, Keyur Govande wrote:
> On Tue, Apr 8, 2014 at 1:31 AM, Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, Apr 07, 2014 at 11:42:02PM -0400, Keyur Govande wrote:
> >> On Mon, Apr 7, 2014 at 9:50 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> > [cc the XFS mailing list <xfs@oss.sgi.com>]
> >> >
> >> > On Mon, Apr 07, 2014 at 06:53:46PM -0400, Keyur Govande wrote:
> >> >> Hello,
> >> >>
> >> >> I'm currently investigating a MySQL performance degradation on XFS due
> >> >> to file fragmentation.
.....
> >> > Alternatively, set an extent size hint on the log files to define
> >> > the minimum sized allocation (e.g. 32MB) and this will limit
> >> > fragmentation without you having to modify the MySQL code at all...
.....
> I spent some more time figuring out the MySQL write semantics and it
> doesn't open/close files often and initial test script was incorrect.
> 
> It uses O_DIRECT and appends to the file; I modified my test binary to
.....
> [root@dbtest09 linux-3.10.37]# xfs_io -c "extsize " /var/lib/mysql/xfs/
> [0] /var/lib/mysql/xfs/

So you aren't using extent size hints....

> 
> Here's how the first 3 AG's look like:
> https://gist.github.com/keyurdg/82b955fb96b003930e4f
> 
> After a run of the dpwrite program, here's how the bmap looks like:
> https://gist.github.com/keyurdg/11196897
> 
> The files have nicely interleaved with each other, mostly
> XFS_IEXT_BUFSZ size extents.

XFS_IEXT_BUFSZ Has nothing to do with the size of allocations. It's
the size of the in memory array buffer used to hold extent records.

What you are seeing is allocation interleaving according to the
pattern and size of the direct IOs being done by the application.
Which happen to be 512KB (1024 basic blocks) and the file being
written to is randomly selected.

> The average read speed is 724 MBps. After
> defragmenting the file to 1 extent, the speed improves 30% to 1.09
> GBps.

Sure. Now set an extent size hint of 32MB and try again.

> I noticed that XFS chooses the AG based on the parent directory's AG
> and only the next sequential one if there's no space available.

Yes, that's what the inode64 allocator does. It tries to keep files
in the same directory close together.

> A
> small patch that chooses the AG randomly fixes the fragmentation issue
> very nicely. All of the MySQL data files are in a single directory and
> we see this in Production where a parent inode AG is filled, then the
> sequential next, and so on.
> 
> diff --git a/fs/xfs/xfs_ialloc.c b/fs/xfs/xfs_ialloc.c
> index c8f5ae1..7841509 100644
> --- a/fs/xfs/xfs_ialloc.c
> +++ b/fs/xfs/xfs_ialloc.c
> @@ -517,7 +517,7 @@ xfs_ialloc_ag_select(
>          * to mean that blocks must be allocated for them,
>          * if none are currently free.
>          */
> -       agno = pagno;
> +       agno = ((xfs_agnumber_t) prandom_u32()) % agcount;
>         flags = XFS_ALLOC_FLAG_TRYLOCK;
>         for (;;) {
>                 pag = xfs_perag_get(mp, agno);

Ugh. That might fix the interleaving, but it randomly distributes
related files over the entire filesystem. Hence if you have random
access to the files (like a database does) you now have random seeks
across the entire filesystem rather than within AGs. You basically
destroy any concept of data locality that the filesystem has.

> I couldn't find guidance on the internet on how many allocation groups
> to use for a 2 TB partition,

I've already given guidance on that. Choose to ignore it if you
will...

> but this random selection won't scale for
> many hundreds of concurrently written files, but for a few heavily
> writtent-to files it works nicely.
> 
> I noticed that for non-DIRECT_IO + every write fsync'd, XFS would
> cleverly keep doubling the allocation block size as the file kept
> growing.

That's the behaviour of delayed allocation.  By using buffered IO,
the application has delegated all responisbility of optimal layout
of the file to the filesystem, and this is the method XFS uses to
minimise fragmentation in that case.

Direct IO does not have delayed allocation - it allocates for the
current IO according to the bounds given by the IO, inode extent size
hints and alignment characteristic of the filesystem. It does not do
specualtive allocation at all.

The principle of direct IO to do exactly what the application asked,
not to second guess what the application *might* need. Either the
application delegates everything to the filesystem (i.e. buffered
IO) or it assumes full responsibility for allocation behaviour and
IO coherency (i.e. direct IO).

IOWs, If you need to preallocate  space beyond EOF that doubles in
size as the file grows to prevent fragmentation, then the
application should be calling fallocate(FALLOC_FL_KEEP_SIZE) at
the appropriate times or using extent size hints to define the
minimum allocation sizes for the direct IO.

> The "extsize" option seems to me a bit too static because the size of
> tables we use varies widely and large new tables come and goe

You can set the extsize per file at create time, but really, you
only need to set the extent size just large enough to obtain maximal
read speeds.

> Could the same doubling logic be applied for DIRECT_IO writes as well?

I don't think so. It would break many carefully tuned production
systems out there that rely directly  on the fact that XFS does
exactly what the application asks it to do when using direct IO.

IOWs, I think you are trying to optimise the wrong layer - put your
effort into making fallocate() do what the application needs to
prevent fragmentation rather trying to hack the filesystem to do it
for you.  Not only will that improve performance on XFS, but it will
also improve performance on ext4 and any other filesystem that
supports fallocate and direct IO.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: XFS fragmentation on file append
@ 2014-04-23  5:47           ` Dave Chinner
  0 siblings, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2014-04-23  5:47 UTC (permalink / raw)
  To: Keyur Govande; +Cc: linux-fsdevel, xfs

On Tue, Apr 22, 2014 at 07:35:34PM -0400, Keyur Govande wrote:
> On Tue, Apr 8, 2014 at 1:31 AM, Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, Apr 07, 2014 at 11:42:02PM -0400, Keyur Govande wrote:
> >> On Mon, Apr 7, 2014 at 9:50 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> > [cc the XFS mailing list <xfs@oss.sgi.com>]
> >> >
> >> > On Mon, Apr 07, 2014 at 06:53:46PM -0400, Keyur Govande wrote:
> >> >> Hello,
> >> >>
> >> >> I'm currently investigating a MySQL performance degradation on XFS due
> >> >> to file fragmentation.
.....
> >> > Alternatively, set an extent size hint on the log files to define
> >> > the minimum sized allocation (e.g. 32MB) and this will limit
> >> > fragmentation without you having to modify the MySQL code at all...
.....
> I spent some more time figuring out the MySQL write semantics and it
> doesn't open/close files often and initial test script was incorrect.
> 
> It uses O_DIRECT and appends to the file; I modified my test binary to
.....
> [root@dbtest09 linux-3.10.37]# xfs_io -c "extsize " /var/lib/mysql/xfs/
> [0] /var/lib/mysql/xfs/

So you aren't using extent size hints....

> 
> Here's how the first 3 AG's look like:
> https://gist.github.com/keyurdg/82b955fb96b003930e4f
> 
> After a run of the dpwrite program, here's how the bmap looks like:
> https://gist.github.com/keyurdg/11196897
> 
> The files have nicely interleaved with each other, mostly
> XFS_IEXT_BUFSZ size extents.

XFS_IEXT_BUFSZ Has nothing to do with the size of allocations. It's
the size of the in memory array buffer used to hold extent records.

What you are seeing is allocation interleaving according to the
pattern and size of the direct IOs being done by the application.
Which happen to be 512KB (1024 basic blocks) and the file being
written to is randomly selected.

> The average read speed is 724 MBps. After
> defragmenting the file to 1 extent, the speed improves 30% to 1.09
> GBps.

Sure. Now set an extent size hint of 32MB and try again.

> I noticed that XFS chooses the AG based on the parent directory's AG
> and only the next sequential one if there's no space available.

Yes, that's what the inode64 allocator does. It tries to keep files
in the same directory close together.

> A
> small patch that chooses the AG randomly fixes the fragmentation issue
> very nicely. All of the MySQL data files are in a single directory and
> we see this in Production where a parent inode AG is filled, then the
> sequential next, and so on.
> 
> diff --git a/fs/xfs/xfs_ialloc.c b/fs/xfs/xfs_ialloc.c
> index c8f5ae1..7841509 100644
> --- a/fs/xfs/xfs_ialloc.c
> +++ b/fs/xfs/xfs_ialloc.c
> @@ -517,7 +517,7 @@ xfs_ialloc_ag_select(
>          * to mean that blocks must be allocated for them,
>          * if none are currently free.
>          */
> -       agno = pagno;
> +       agno = ((xfs_agnumber_t) prandom_u32()) % agcount;
>         flags = XFS_ALLOC_FLAG_TRYLOCK;
>         for (;;) {
>                 pag = xfs_perag_get(mp, agno);

Ugh. That might fix the interleaving, but it randomly distributes
related files over the entire filesystem. Hence if you have random
access to the files (like a database does) you now have random seeks
across the entire filesystem rather than within AGs. You basically
destroy any concept of data locality that the filesystem has.

> I couldn't find guidance on the internet on how many allocation groups
> to use for a 2 TB partition,

I've already given guidance on that. Choose to ignore it if you
will...

> but this random selection won't scale for
> many hundreds of concurrently written files, but for a few heavily
> writtent-to files it works nicely.
> 
> I noticed that for non-DIRECT_IO + every write fsync'd, XFS would
> cleverly keep doubling the allocation block size as the file kept
> growing.

That's the behaviour of delayed allocation.  By using buffered IO,
the application has delegated all responisbility of optimal layout
of the file to the filesystem, and this is the method XFS uses to
minimise fragmentation in that case.

Direct IO does not have delayed allocation - it allocates for the
current IO according to the bounds given by the IO, inode extent size
hints and alignment characteristic of the filesystem. It does not do
specualtive allocation at all.

The principle of direct IO to do exactly what the application asked,
not to second guess what the application *might* need. Either the
application delegates everything to the filesystem (i.e. buffered
IO) or it assumes full responsibility for allocation behaviour and
IO coherency (i.e. direct IO).

IOWs, If you need to preallocate  space beyond EOF that doubles in
size as the file grows to prevent fragmentation, then the
application should be calling fallocate(FALLOC_FL_KEEP_SIZE) at
the appropriate times or using extent size hints to define the
minimum allocation sizes for the direct IO.

> The "extsize" option seems to me a bit too static because the size of
> tables we use varies widely and large new tables come and goe

You can set the extsize per file at create time, but really, you
only need to set the extent size just large enough to obtain maximal
read speeds.

> Could the same doubling logic be applied for DIRECT_IO writes as well?

I don't think so. It would break many carefully tuned production
systems out there that rely directly  on the fact that XFS does
exactly what the application asks it to do when using direct IO.

IOWs, I think you are trying to optimise the wrong layer - put your
effort into making fallocate() do what the application needs to
prevent fragmentation rather trying to hack the filesystem to do it
for you.  Not only will that improve performance on XFS, but it will
also improve performance on ext4 and any other filesystem that
supports fallocate and direct IO.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: XFS fragmentation on file append
  2014-04-23  5:47           ` Dave Chinner
@ 2014-04-23  8:11             ` Dave Chinner
  -1 siblings, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2014-04-23  8:11 UTC (permalink / raw)
  To: Keyur Govande; +Cc: linux-fsdevel, xfs

On Wed, Apr 23, 2014 at 03:47:19PM +1000, Dave Chinner wrote:
> On Tue, Apr 22, 2014 at 07:35:34PM -0400, Keyur Govande wrote:
> > I noticed that XFS chooses the AG based on the parent directory's AG
> > and only the next sequential one if there's no space available.
> 
> Yes, that's what the inode64 allocator does. It tries to keep files
> in the same directory close together.
> > @@ -517,7 +517,7 @@ xfs_ialloc_ag_select(
> >          * to mean that blocks must be allocated for them,
> >          * if none are currently free.
> >          */
> > -       agno = pagno;
> > +       agno = ((xfs_agnumber_t) prandom_u32()) % agcount;
> >         flags = XFS_ALLOC_FLAG_TRYLOCK;
> >         for (;;) {
> >                 pag = xfs_perag_get(mp, agno);
> 
> Ugh. That might fix the interleaving, but it randomly distributes
> related files over the entire filesystem. Hence if you have random
> access to the files (like a database does) you now have random seeks
> across the entire filesystem rather than within AGs. You basically
> destroy any concept of data locality that the filesystem has.

BTW, the inode32 allocator (it's a mount option) does this. it's no
longer the default because a) it's always had terrible behaviour for
general workloads compared to inode64 and b) we don't care enough
about 32 bit applications failing to use stat64() anymore to stay
with inode32 by default...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: XFS fragmentation on file append
@ 2014-04-23  8:11             ` Dave Chinner
  0 siblings, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2014-04-23  8:11 UTC (permalink / raw)
  To: Keyur Govande; +Cc: linux-fsdevel, xfs

On Wed, Apr 23, 2014 at 03:47:19PM +1000, Dave Chinner wrote:
> On Tue, Apr 22, 2014 at 07:35:34PM -0400, Keyur Govande wrote:
> > I noticed that XFS chooses the AG based on the parent directory's AG
> > and only the next sequential one if there's no space available.
> 
> Yes, that's what the inode64 allocator does. It tries to keep files
> in the same directory close together.
> > @@ -517,7 +517,7 @@ xfs_ialloc_ag_select(
> >          * to mean that blocks must be allocated for them,
> >          * if none are currently free.
> >          */
> > -       agno = pagno;
> > +       agno = ((xfs_agnumber_t) prandom_u32()) % agcount;
> >         flags = XFS_ALLOC_FLAG_TRYLOCK;
> >         for (;;) {
> >                 pag = xfs_perag_get(mp, agno);
> 
> Ugh. That might fix the interleaving, but it randomly distributes
> related files over the entire filesystem. Hence if you have random
> access to the files (like a database does) you now have random seeks
> across the entire filesystem rather than within AGs. You basically
> destroy any concept of data locality that the filesystem has.

BTW, the inode32 allocator (it's a mount option) does this. it's no
longer the default because a) it's always had terrible behaviour for
general workloads compared to inode64 and b) we don't care enough
about 32 bit applications failing to use stat64() anymore to stay
with inode32 by default...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: XFS fragmentation on file append
  2014-04-22 23:35         ` Keyur Govande
  (?)
  (?)
@ 2014-04-23 11:48         ` Stewart Smith
  -1 siblings, 0 replies; 20+ messages in thread
From: Stewart Smith @ 2014-04-23 11:48 UTC (permalink / raw)
  To: Keyur Govande, Dave Chinner; +Cc: linux-fsdevel, xfs


[-- Attachment #1.1: Type: text/plain, Size: 975 bytes --]

Keyur Govande <keyurgovande@gmail.com> writes:
> I spent some more time figuring out the MySQL write semantics and it
> doesn't open/close files often and initial test script was incorrect.

MySQL will open/close files depending on some configuration parameters
and the number of tables that exist/are open/are in the working set.

If InnoDB tables, it's innodb_max_open_files (IIRC it's named that, or
something similar). If you have less than that number of tables and
innodb_file_per_table=true, then you'll never close. If you have the max
set to 10 times the working set of active tables, you're going to be
opening and closing files a lot - it's basically a LRU of unused tables
(open files).

> It uses O_DIRECT and appends to the file; I modified my test binary to
> take this into account here:
> https://gist.github.com/keyurdg/54e0613e27dbe7946035

You can also make it not use O_DIRECT, but that's generally a bad idea :)

-- 
Stewart Smith

[-- Attachment #1.2: Type: application/pgp-signature, Size: 818 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: XFS fragmentation on file append
  2014-04-23  5:47           ` Dave Chinner
  (?)
  (?)
@ 2014-04-23 19:05           ` Keyur Govande
  2014-04-23 22:52               ` Dave Chinner
  -1 siblings, 1 reply; 20+ messages in thread
From: Keyur Govande @ 2014-04-23 19:05 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, xfs

< re-sending to the distribution list for future reference >

On Wed, Apr 23, 2014 at 1:47 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Tue, Apr 22, 2014 at 07:35:34PM -0400, Keyur Govande wrote:
>> On Tue, Apr 8, 2014 at 1:31 AM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Mon, Apr 07, 2014 at 11:42:02PM -0400, Keyur Govande wrote:
>> >> On Mon, Apr 7, 2014 at 9:50 PM, Dave Chinner <david@fromorbit.com> wrote:
>> >> > [cc the XFS mailing list <xfs@oss.sgi.com>]
>> >> >
>> >> > On Mon, Apr 07, 2014 at 06:53:46PM -0400, Keyur Govande wrote:
>> >> >> Hello,
>> >> >>
>> >> >> I'm currently investigating a MySQL performance degradation on XFS due
>> >> >> to file fragmentation.
> .....
>> >> > Alternatively, set an extent size hint on the log files to define
>> >> > the minimum sized allocation (e.g. 32MB) and this will limit
>> >> > fragmentation without you having to modify the MySQL code at all...
> .....
>> I spent some more time figuring out the MySQL write semantics and it
>> doesn't open/close files often and initial test script was incorrect.
>>
>> It uses O_DIRECT and appends to the file; I modified my test binary to
> .....
>> [root@dbtest09 linux-3.10.37]# xfs_io -c "extsize " /var/lib/mysql/xfs/
>> [0] /var/lib/mysql/xfs/
>
> So you aren't using extent size hints....
>
>>
>> Here's how the first 3 AG's look like:
>> https://gist.github.com/keyurdg/82b955fb96b003930e4f
>>
>> After a run of the dpwrite program, here's how the bmap looks like:
>> https://gist.github.com/keyurdg/11196897
>>
>> The files have nicely interleaved with each other, mostly
>> XFS_IEXT_BUFSZ size extents.
>
> XFS_IEXT_BUFSZ Has nothing to do with the size of allocations. It's
> the size of the in memory array buffer used to hold extent records.
>
> What you are seeing is allocation interleaving according to the
> pattern and size of the direct IOs being done by the application.
> Which happen to be 512KB (1024 basic blocks) and the file being
> written to is randomly selected.
>

I misspoke; I meant to say XFS_IEXT_BUFSZ (4096) blocks per extent. As
long as each pwrite is less than 2 MB, the extents do lay out in 4096
blocks every time.

>> The average read speed is 724 MBps. After
>> defragmenting the file to 1 extent, the speed improves 30% to 1.09
>> GBps.
>
> Sure. Now set an extent size hint of 32MB and try again.

I did these runs as well going by your last email suggestion, but I
was more interested in what you thought about the other ideas so
didn't include the results.

32MB gives 850 MBps and 64MB hits 980MBps. The peak read rate from the
hardware for a contiguous file is 1.45 GBps. I could keep on
increasing it until I hit a number I like, but I was looking to see if
it could be globally optimized.

>
>> I noticed that XFS chooses the AG based on the parent directory's AG
>> and only the next sequential one if there's no space available.
>
> Yes, that's what the inode64 allocator does. It tries to keep files
> in the same directory close together.
>
>> A
>> small patch that chooses the AG randomly fixes the fragmentation issue
>> very nicely. All of the MySQL data files are in a single directory and
>> we see this in Production where a parent inode AG is filled, then the
>> sequential next, and so on.
>>
>> diff --git a/fs/xfs/xfs_ialloc.c b/fs/xfs/xfs_ialloc.c
>> index c8f5ae1..7841509 100644
>> --- a/fs/xfs/xfs_ialloc.c
>> +++ b/fs/xfs/xfs_ialloc.c
>> @@ -517,7 +517,7 @@ xfs_ialloc_ag_select(
>>          * to mean that blocks must be allocated for them,
>>          * if none are currently free.
>>          */
>> -       agno = pagno;
>> +       agno = ((xfs_agnumber_t) prandom_u32()) % agcount;
>>         flags = XFS_ALLOC_FLAG_TRYLOCK;
>>         for (;;) {
>>                 pag = xfs_perag_get(mp, agno);
>
> Ugh. That might fix the interleaving, but it randomly distributes
> related files over the entire filesystem. Hence if you have random
> access to the files (like a database does) you now have random seeks
> across the entire filesystem rather than within AGs. You basically
> destroy any concept of data locality that the filesystem has.

I realize this is terrible for small files like a source code tree,
but for a database which usually has a many large files in the same
directory the seek cost is amortized by the benefit from a large
contiguous read. Would it be terrible to have this modifiable as a
setting (like extsize is) with the default being the inode64 behavior?

>
>> I couldn't find guidance on the internet on how many allocation groups
>> to use for a 2 TB partition,
>
> I've already given guidance on that. Choose to ignore it if you
> will...
>

Could you repeat it or post a link? The only relevant info I found via
Google is using as many AGs as hardware threads
(http://blog.tsunanet.net/2011/08/mkfsxfs-raid10-optimal-performance.html).

>> but this random selection won't scale for
>> many hundreds of concurrently written files, but for a few heavily
>> writtent-to files it works nicely.
>>
>> I noticed that for non-DIRECT_IO + every write fsync'd, XFS would
>> cleverly keep doubling the allocation block size as the file kept
>> growing.
>
> That's the behaviour of delayed allocation.  By using buffered IO,
> the application has delegated all responisbility of optimal layout
> of the file to the filesystem, and this is the method XFS uses to
> minimise fragmentation in that case.
>
> Direct IO does not have delayed allocation - it allocates for the
> current IO according to the bounds given by the IO, inode extent size
> hints and alignment characteristic of the filesystem. It does not do
> specualtive allocation at all.
>
> The principle of direct IO to do exactly what the application asked,
> not to second guess what the application *might* need. Either the
> application delegates everything to the filesystem (i.e. buffered
> IO) or it assumes full responsibility for allocation behaviour and
> IO coherency (i.e. direct IO).
>
> IOWs, If you need to preallocate  space beyond EOF that doubles in
> size as the file grows to prevent fragmentation, then the
> application should be calling fallocate(FALLOC_FL_KEEP_SIZE) at
> the appropriate times or using extent size hints to define the
> minimum allocation sizes for the direct IO.
>
>> The "extsize" option seems to me a bit too static because the size of
>> tables we use varies widely and large new tables come and goe
>
> You can set the extsize per file at create time, but really, you
> only need to set the extent size just large enough to obtain maximal
> read speeds.
>
>> Could the same doubling logic be applied for DIRECT_IO writes as well?
>
> I don't think so. It would break many carefully tuned production
> systems out there that rely directly  on the fact that XFS does
> exactly what the application asks it to do when using direct IO.
>
> IOWs, I think you are trying to optimise the wrong layer - put your
> effort into making fallocate() do what the application needs to
> prevent fragmentation rather trying to hack the filesystem to do it
> for you.  Not only will that improve performance on XFS, but it will
> also improve performance on ext4 and any other filesystem that
> supports fallocate and direct IO.
>

I've been experimenting with patches to MySQL to use fallocate with
FALLOC_FL_KEEP_SIZE and measuring the performance and fragmentation.

I also poked at the kernel because I assumed other DBs may also
benefit from the heuristic (speculative) allocation. Point taken about
doing the optimization in the application layer.

> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: XFS fragmentation on file append
  2014-04-23 19:05           ` Keyur Govande
@ 2014-04-23 22:52               ` Dave Chinner
  0 siblings, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2014-04-23 22:52 UTC (permalink / raw)
  To: Keyur Govande; +Cc: linux-fsdevel, xfs

On Wed, Apr 23, 2014 at 03:05:00PM -0400, Keyur Govande wrote:
> On Wed, Apr 23, 2014 at 1:47 AM, Dave Chinner <david@fromorbit.com> wrote:
> > On Tue, Apr 22, 2014 at 07:35:34PM -0400, Keyur Govande wrote:
> >> Here's how the first 3 AG's look like:
> >> https://gist.github.com/keyurdg/82b955fb96b003930e4f
> >>
> >> After a run of the dpwrite program, here's how the bmap looks like:
> >> https://gist.github.com/keyurdg/11196897
> >>
> >> The files have nicely interleaved with each other, mostly
> >> XFS_IEXT_BUFSZ size extents.
> >
> > XFS_IEXT_BUFSZ Has nothing to do with the size of allocations. It's
> > the size of the in memory array buffer used to hold extent records.
> >
> > What you are seeing is allocation interleaving according to the
> > pattern and size of the direct IOs being done by the application.
> > Which happen to be 512KB (1024 basic blocks) and the file being
> > written to is randomly selected.
> 
> I misspoke; I meant to say XFS_IEXT_BUFSZ (4096) blocks per extent. As
> long as each pwrite is less than 2 MB, the extents do lay out in 4096
> blocks every time.

Sure, 4096 basic blocks per extent, but that has nothing to do with
XFS_IEXT_BUFSZ. All you've done is pick a random #define out of the
source code that matches the number you are seeing from xfs_bmap.
They are *completely unrelated*.

If your extents are laying out in 2MB chunks, then perhaps that's
because of allocation alignment being driven by stripe unit/stripe
width configuration, or maybe freespace is simply fragmented into
chunks that size.

> >> The average read speed is 724 MBps. After
> >> defragmenting the file to 1 extent, the speed improves 30% to 1.09
> >> GBps.
> >
> > Sure. Now set an extent size hint of 32MB and try again.
> 
> I did these runs as well going by your last email suggestion, but I
> was more interested in what you thought about the other ideas so
> didn't include the results.
> 
> 32MB gives 850 MBps and 64MB hits 980MBps. The peak read rate from the
> hardware for a contiguous file is 1.45 GBps. I could keep on
> increasing it until I hit a number I like, but I was looking to see if
> it could be globally optimized.

IOWs, if you hit the RAID controller readahead cache, it does
1.45GB/s. If you don't hit it, you see sustainable, real world disk
speeds you can get from the array.

> I realize this is terrible for small files like a source code tree,
> but for a database which usually has a many large files in the same
> directory the seek cost is amortized by the benefit from a large
> contiguous read. Would it be terrible to have this modifiable as a
> setting (like extsize is) with the default being the inode64 behavior?

We do have that behaviour configurable. Like I said, use the inode32
allocator (mount option).

> >> I couldn't find guidance on the internet on how many allocation groups
> >> to use for a 2 TB partition,
> >
> > I've already given guidance on that. Choose to ignore it if you
> > will...
> 
> Could you repeat it or post a link? The only relevant info I found via
> Google is using as many AGs as hardware threads

Sorry, I mixed you up with someone else asking about XFS
optimisation for database workloads a couple of days ago.

http://oss.sgi.com/archives/xfs/2014-04/msg00384.html

> (http://blog.tsunanet.net/2011/08/mkfsxfs-raid10-optimal-performance.html).

That's one of the better blog posts I've seen, but it's still got
quite a few subtle errors in it.

As it is, this shows why google is a *terrible source* of technical
information - google considers crappy blog posts to be more
authorative than the mailing list posts written by subject matter
experts....

Indeed, this is where I'm trying to document all this sort of stuff
in a semi-official manner so as to avoid this "he said, she said"
sort of problem:

http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs-documentation.git;a=blob_plain;f=admin/XFS_Performance_Tuning/filesystem_tunables.asciidoc;hb=HEAD

> >> Could the same doubling logic be applied for DIRECT_IO writes as well?
> >
> > I don't think so. It would break many carefully tuned production
> > systems out there that rely directly  on the fact that XFS does
> > exactly what the application asks it to do when using direct IO.
> >
> > IOWs, I think you are trying to optimise the wrong layer - put your
> > effort into making fallocate() do what the application needs to
> > prevent fragmentation rather trying to hack the filesystem to do it
> > for you.  Not only will that improve performance on XFS, but it will
> > also improve performance on ext4 and any other filesystem that
> > supports fallocate and direct IO.
> 
> I've been experimenting with patches to MySQL to use fallocate with
> FALLOC_FL_KEEP_SIZE and measuring the performance and fragmentation.

Great - I'm interested to know what your results are :)

> I also poked at the kernel because I assumed other DBs may also
> benefit from the heuristic (speculative) allocation. Point taken about
> doing the optimization in the application layer.

Postgres uses buffered IO (so can make use of speculative allocation
for it's files. However, that has it's own share of problems
that direct IO doesn't have, so they aren't really in a better
position as a result of making that choice...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: XFS fragmentation on file append
@ 2014-04-23 22:52               ` Dave Chinner
  0 siblings, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2014-04-23 22:52 UTC (permalink / raw)
  To: Keyur Govande; +Cc: linux-fsdevel, xfs

On Wed, Apr 23, 2014 at 03:05:00PM -0400, Keyur Govande wrote:
> On Wed, Apr 23, 2014 at 1:47 AM, Dave Chinner <david@fromorbit.com> wrote:
> > On Tue, Apr 22, 2014 at 07:35:34PM -0400, Keyur Govande wrote:
> >> Here's how the first 3 AG's look like:
> >> https://gist.github.com/keyurdg/82b955fb96b003930e4f
> >>
> >> After a run of the dpwrite program, here's how the bmap looks like:
> >> https://gist.github.com/keyurdg/11196897
> >>
> >> The files have nicely interleaved with each other, mostly
> >> XFS_IEXT_BUFSZ size extents.
> >
> > XFS_IEXT_BUFSZ Has nothing to do with the size of allocations. It's
> > the size of the in memory array buffer used to hold extent records.
> >
> > What you are seeing is allocation interleaving according to the
> > pattern and size of the direct IOs being done by the application.
> > Which happen to be 512KB (1024 basic blocks) and the file being
> > written to is randomly selected.
> 
> I misspoke; I meant to say XFS_IEXT_BUFSZ (4096) blocks per extent. As
> long as each pwrite is less than 2 MB, the extents do lay out in 4096
> blocks every time.

Sure, 4096 basic blocks per extent, but that has nothing to do with
XFS_IEXT_BUFSZ. All you've done is pick a random #define out of the
source code that matches the number you are seeing from xfs_bmap.
They are *completely unrelated*.

If your extents are laying out in 2MB chunks, then perhaps that's
because of allocation alignment being driven by stripe unit/stripe
width configuration, or maybe freespace is simply fragmented into
chunks that size.

> >> The average read speed is 724 MBps. After
> >> defragmenting the file to 1 extent, the speed improves 30% to 1.09
> >> GBps.
> >
> > Sure. Now set an extent size hint of 32MB and try again.
> 
> I did these runs as well going by your last email suggestion, but I
> was more interested in what you thought about the other ideas so
> didn't include the results.
> 
> 32MB gives 850 MBps and 64MB hits 980MBps. The peak read rate from the
> hardware for a contiguous file is 1.45 GBps. I could keep on
> increasing it until I hit a number I like, but I was looking to see if
> it could be globally optimized.

IOWs, if you hit the RAID controller readahead cache, it does
1.45GB/s. If you don't hit it, you see sustainable, real world disk
speeds you can get from the array.

> I realize this is terrible for small files like a source code tree,
> but for a database which usually has a many large files in the same
> directory the seek cost is amortized by the benefit from a large
> contiguous read. Would it be terrible to have this modifiable as a
> setting (like extsize is) with the default being the inode64 behavior?

We do have that behaviour configurable. Like I said, use the inode32
allocator (mount option).

> >> I couldn't find guidance on the internet on how many allocation groups
> >> to use for a 2 TB partition,
> >
> > I've already given guidance on that. Choose to ignore it if you
> > will...
> 
> Could you repeat it or post a link? The only relevant info I found via
> Google is using as many AGs as hardware threads

Sorry, I mixed you up with someone else asking about XFS
optimisation for database workloads a couple of days ago.

http://oss.sgi.com/archives/xfs/2014-04/msg00384.html

> (http://blog.tsunanet.net/2011/08/mkfsxfs-raid10-optimal-performance.html).

That's one of the better blog posts I've seen, but it's still got
quite a few subtle errors in it.

As it is, this shows why google is a *terrible source* of technical
information - google considers crappy blog posts to be more
authorative than the mailing list posts written by subject matter
experts....

Indeed, this is where I'm trying to document all this sort of stuff
in a semi-official manner so as to avoid this "he said, she said"
sort of problem:

http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs-documentation.git;a=blob_plain;f=admin/XFS_Performance_Tuning/filesystem_tunables.asciidoc;hb=HEAD

> >> Could the same doubling logic be applied for DIRECT_IO writes as well?
> >
> > I don't think so. It would break many carefully tuned production
> > systems out there that rely directly  on the fact that XFS does
> > exactly what the application asks it to do when using direct IO.
> >
> > IOWs, I think you are trying to optimise the wrong layer - put your
> > effort into making fallocate() do what the application needs to
> > prevent fragmentation rather trying to hack the filesystem to do it
> > for you.  Not only will that improve performance on XFS, but it will
> > also improve performance on ext4 and any other filesystem that
> > supports fallocate and direct IO.
> 
> I've been experimenting with patches to MySQL to use fallocate with
> FALLOC_FL_KEEP_SIZE and measuring the performance and fragmentation.

Great - I'm interested to know what your results are :)

> I also poked at the kernel because I assumed other DBs may also
> benefit from the heuristic (speculative) allocation. Point taken about
> doing the optimization in the application layer.

Postgres uses buffered IO (so can make use of speculative allocation
for it's files. However, that has it's own share of problems
that direct IO doesn't have, so they aren't really in a better
position as a result of making that choice...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: XFS fragmentation on file append
  2014-04-22 23:35         ` Keyur Govande
                           ` (2 preceding siblings ...)
  (?)
@ 2014-04-24  6:54         ` Stefan Ring
  2014-04-24 21:49           ` Keyur Govande
  -1 siblings, 1 reply; 20+ messages in thread
From: Stefan Ring @ 2014-04-24  6:54 UTC (permalink / raw)
  To: Keyur Govande; +Cc: linux-fsdevel, Linux fs XFS

I've become interested in this topic, as I'm also running MySQL with
O_DIRECT and innodb_file_per_table. Out of curiosity, I immediately
ran xfs_bmap on a moderately sized table space (34GB). It listed
around 30000 fragments, on average one for every MB.

I want to report what happened then: A flurry of activity started on
both disks (root/swap lives on one of them, the data volume containing
the MySQL files on another) and lasted for about two minutes.
Afterwards, all memory previously allocated to the file cache has
become free, and also everything XFS seems to keep buffered internally
(I think it's called SReclaimable) was released. Swap usage increased
only slightly. dmesg was silent during that time.

This is a 2.6.32-358.2.1.el6.x86_64 kernel with xfsprogs 3.1.1 (CentOS
6.4). The machine has 64GB of RAM (2 NUMA nodes) and 24 (virtual)
cores. Is this known behavior of xfs_bmap?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: XFS fragmentation on file append
  2014-04-24  6:54         ` Stefan Ring
@ 2014-04-24 21:49           ` Keyur Govande
  2014-05-15 20:25               ` Stefan Ring
  0 siblings, 1 reply; 20+ messages in thread
From: Keyur Govande @ 2014-04-24 21:49 UTC (permalink / raw)
  To: Stefan Ring; +Cc: linux-fsdevel, Linux fs XFS

On Thu, Apr 24, 2014 at 2:54 AM, Stefan Ring <stefanrin@gmail.com> wrote:
> I've become interested in this topic, as I'm also running MySQL with
> O_DIRECT and innodb_file_per_table. Out of curiosity, I immediately
> ran xfs_bmap on a moderately sized table space (34GB). It listed
> around 30000 fragments, on average one for every MB.
>
> I want to report what happened then: A flurry of activity started on
> both disks (root/swap lives on one of them, the data volume containing
> the MySQL files on another) and lasted for about two minutes.
> Afterwards, all memory previously allocated to the file cache has
> become free, and also everything XFS seems to keep buffered internally
> (I think it's called SReclaimable) was released. Swap usage increased
> only slightly. dmesg was silent during that time.
>
> This is a 2.6.32-358.2.1.el6.x86_64 kernel with xfsprogs 3.1.1 (CentOS
> 6.4). The machine has 64GB of RAM (2 NUMA nodes) and 24 (virtual)
> cores. Is this known behavior of xfs_bmap?

Interesting...it looks like your box flushed all of the OS buffer
cache. I am unable to reproduce this behavior on my test box with the
3.10.37 kernel. I also tried with 2.6.32-358.18.1.el6.x86_64 and
didn't hit the issue, but obviously our access patterns differ wildly.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: XFS fragmentation on file append
  2014-04-24 21:49           ` Keyur Govande
@ 2014-05-15 20:25               ` Stefan Ring
  0 siblings, 0 replies; 20+ messages in thread
From: Stefan Ring @ 2014-05-15 20:25 UTC (permalink / raw)
  To: Keyur Govande; +Cc: linux-fsdevel, Linux fs XFS

On Thu, Apr 24, 2014 at 11:49 PM, Keyur Govande <keyurgovande@gmail.com> wrote:
> On Thu, Apr 24, 2014 at 2:54 AM, Stefan Ring <stefanrin@gmail.com> wrote:
>> I've become interested in this topic, as I'm also running MySQL with
>> O_DIRECT and innodb_file_per_table. Out of curiosity, I immediately
>> ran xfs_bmap on a moderately sized table space (34GB). It listed
>> around 30000 fragments, on average one for every MB.
>>
>> I want to report what happened then: A flurry of activity started on
>> both disks (root/swap lives on one of them, the data volume containing
>> the MySQL files on another) and lasted for about two minutes.
>> Afterwards, all memory previously allocated to the file cache has
>> become free, and also everything XFS seems to keep buffered internally
>> (I think it's called SReclaimable) was released. Swap usage increased
>> only slightly. dmesg was silent during that time.
>>
>> This is a 2.6.32-358.2.1.el6.x86_64 kernel with xfsprogs 3.1.1 (CentOS
>> 6.4). The machine has 64GB of RAM (2 NUMA nodes) and 24 (virtual)
>> cores. Is this known behavior of xfs_bmap?
>
> Interesting...it looks like your box flushed all of the OS buffer
> cache. I am unable to reproduce this behavior on my test box with the
> 3.10.37 kernel. I also tried with 2.6.32-358.18.1.el6.x86_64 and
> didn't hit the issue, but obviously our access patterns differ wildly.

I tried it again, logging a few files in /proc periodically:
https://dl.dropboxusercontent.com/u/5338701/dev/xfs/memdump.tar.xz

Inside the archive, "memdump" is the simplistic script used to create
the other files. A few seconds in, I invoked xfs_bmap on the same file
again (this time weighing in at 81GB), and it spit out 36000
fragments. It took only a few seconds to completely drain 40 GB of
buffer memory.

cciss/c0d0 is the device where the XFS filesystem lives, while sda
contains root and swap.

If somebody could gain some insight from this, I'd be happy.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: XFS fragmentation on file append
@ 2014-05-15 20:25               ` Stefan Ring
  0 siblings, 0 replies; 20+ messages in thread
From: Stefan Ring @ 2014-05-15 20:25 UTC (permalink / raw)
  To: Keyur Govande; +Cc: linux-fsdevel, Linux fs XFS

On Thu, Apr 24, 2014 at 11:49 PM, Keyur Govande <keyurgovande@gmail.com> wrote:
> On Thu, Apr 24, 2014 at 2:54 AM, Stefan Ring <stefanrin@gmail.com> wrote:
>> I've become interested in this topic, as I'm also running MySQL with
>> O_DIRECT and innodb_file_per_table. Out of curiosity, I immediately
>> ran xfs_bmap on a moderately sized table space (34GB). It listed
>> around 30000 fragments, on average one for every MB.
>>
>> I want to report what happened then: A flurry of activity started on
>> both disks (root/swap lives on one of them, the data volume containing
>> the MySQL files on another) and lasted for about two minutes.
>> Afterwards, all memory previously allocated to the file cache has
>> become free, and also everything XFS seems to keep buffered internally
>> (I think it's called SReclaimable) was released. Swap usage increased
>> only slightly. dmesg was silent during that time.
>>
>> This is a 2.6.32-358.2.1.el6.x86_64 kernel with xfsprogs 3.1.1 (CentOS
>> 6.4). The machine has 64GB of RAM (2 NUMA nodes) and 24 (virtual)
>> cores. Is this known behavior of xfs_bmap?
>
> Interesting...it looks like your box flushed all of the OS buffer
> cache. I am unable to reproduce this behavior on my test box with the
> 3.10.37 kernel. I also tried with 2.6.32-358.18.1.el6.x86_64 and
> didn't hit the issue, but obviously our access patterns differ wildly.

I tried it again, logging a few files in /proc periodically:
https://dl.dropboxusercontent.com/u/5338701/dev/xfs/memdump.tar.xz

Inside the archive, "memdump" is the simplistic script used to create
the other files. A few seconds in, I invoked xfs_bmap on the same file
again (this time weighing in at 81GB), and it spit out 36000
fragments. It took only a few seconds to completely drain 40 GB of
buffer memory.

cciss/c0d0 is the device where the XFS filesystem lives, while sda
contains root and swap.

If somebody could gain some insight from this, I'd be happy.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2014-05-15 20:26 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-04-07 22:53 XFS fragmentation on file append Keyur Govande
2014-04-08  1:50 ` Dave Chinner
2014-04-08  1:50   ` Dave Chinner
2014-04-08  3:42   ` Keyur Govande
2014-04-08  5:31     ` Dave Chinner
2014-04-08  5:31       ` Dave Chinner
2014-04-22 23:35       ` Keyur Govande
2014-04-22 23:35         ` Keyur Govande
2014-04-23  5:47         ` Dave Chinner
2014-04-23  5:47           ` Dave Chinner
2014-04-23  8:11           ` Dave Chinner
2014-04-23  8:11             ` Dave Chinner
2014-04-23 19:05           ` Keyur Govande
2014-04-23 22:52             ` Dave Chinner
2014-04-23 22:52               ` Dave Chinner
2014-04-23 11:48         ` Stewart Smith
2014-04-24  6:54         ` Stefan Ring
2014-04-24 21:49           ` Keyur Govande
2014-05-15 20:25             ` Stefan Ring
2014-05-15 20:25               ` Stefan Ring

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.