All of lore.kernel.org
 help / color / mirror / Atom feed
* Pathological allocation pattern with direct IO
@ 2013-03-06 20:22 Jan Kara
  2013-03-06 22:01 ` Ben Myers
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Jan Kara @ 2013-03-06 20:22 UTC (permalink / raw)
  To: xfs

  Hello,

  one of our customers has application that write large (tens of GB) files
using direct IO done in 16 MB chunks. They keep the fs around 80% full
deleting oldest files when they need to store new ones. Usually the file
can be stored in under 10 extents but from time to time a pathological case
is triggered and the file has few thousands extents (which naturally has
impact on performance). The customer actually uses 2.6.32-based kernel but
I reproduced the issue with 3.8.2 kernel as well.

I was analyzing why this happens and the filefrag for the file looks like:
Filesystem type is: 58465342
File size of /raw_data/ex.20130302T121135/ov.s1a1.wb is 186294206464
(45481984 blocks, blocksize 4096)
 ext logical physical expected length flags
   0       0       13          4550656
   1 4550656 188136807  4550668 12562432
   2 17113088 200699240 200699238 622592
   3 17735680 182046055 201321831   4096
   4 17739776 182041959 182050150   4096
   5 17743872 182037863 182046054   4096
   6 17747968 182033767 182041958   4096
   7 17752064 182029671 182037862   4096
...
6757 45400064 154381644 154389835   4096
6758 45404160 154377548 154385739   4096
6759 45408256 252951571 154381643  73728 eof
/raw_data/ex.20130302T121135/ov.s1a1.wb: 6760 extents found

So we see that at one moment, the allocator starts giving us 16 MB chunks
backwards. This seems to be caused by XFS_ALLOCTYPE_NEAR_BNO allocation. For
two cases I was able to track down the logic:

1) We start allocating blocks for file. We want to allocate in the same AG
as the inode is. First we try exact allocation which fails so we try
XFS_ALLOCTYPE_NEAR_BNO allocation which finds large enough free extent
before the inode. So we start allocating 16 MB chunks from the end of that
free extent. From this moment on we are basically bound to continue
allocating backwards using XFS_ALLOCTYPE_NEAR_BNO allocation until we
exhaust the whole free extent.

2) Similar situation happens when we cannot further grow current extent but
there is large free space somewhere before this extent in the AG.

So I was wondering is this known? Is XFS_ALLOCTYPE_NEAR_BNO so beneficial
it outweights pathological cases like the above? Or shouldn't it maybe be
disabled for larger files or for direct IO?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Pathological allocation pattern with direct IO
  2013-03-06 20:22 Pathological allocation pattern with direct IO Jan Kara
@ 2013-03-06 22:01 ` Ben Myers
  2013-03-06 22:31 ` Peter Grandi
  2013-03-07  5:03 ` Dave Chinner
  2 siblings, 0 replies; 9+ messages in thread
From: Ben Myers @ 2013-03-06 22:01 UTC (permalink / raw)
  To: Jan Kara; +Cc: xfs

Hi Jan,

On Wed, Mar 06, 2013 at 09:22:10PM +0100, Jan Kara wrote:
>   one of our customers has application that write large (tens of GB) files
> using direct IO done in 16 MB chunks. They keep the fs around 80% full
> deleting oldest files when they need to store new ones. Usually the file
> can be stored in under 10 extents but from time to time a pathological case
> is triggered and the file has few thousands extents (which naturally has
> impact on performance). The customer actually uses 2.6.32-based kernel but
> I reproduced the issue with 3.8.2 kernel as well.
> 
> I was analyzing why this happens and the filefrag for the file looks like:
> Filesystem type is: 58465342
> File size of /raw_data/ex.20130302T121135/ov.s1a1.wb is 186294206464
> (45481984 blocks, blocksize 4096)
>  ext logical physical expected length flags
>    0       0       13          4550656
>    1 4550656 188136807  4550668 12562432
>    2 17113088 200699240 200699238 622592
>    3 17735680 182046055 201321831   4096
>    4 17739776 182041959 182050150   4096
>    5 17743872 182037863 182046054   4096
>    6 17747968 182033767 182041958   4096
>    7 17752064 182029671 182037862   4096
> ...
> 6757 45400064 154381644 154389835   4096
> 6758 45404160 154377548 154385739   4096
> 6759 45408256 252951571 154381643  73728 eof
> /raw_data/ex.20130302T121135/ov.s1a1.wb: 6760 extents found
> 
> So we see that at one moment, the allocator starts giving us 16 MB chunks
> backwards. This seems to be caused by XFS_ALLOCTYPE_NEAR_BNO allocation. For
> two cases I was able to track down the logic:
> 
> 1) We start allocating blocks for file. We want to allocate in the same AG
> as the inode is. First we try exact allocation which fails so we try
> XFS_ALLOCTYPE_NEAR_BNO allocation which finds large enough free extent
> before the inode. So we start allocating 16 MB chunks from the end of that
> free extent. From this moment on we are basically bound to continue
> allocating backwards using XFS_ALLOCTYPE_NEAR_BNO allocation until we
> exhaust the whole free extent.
> 
> 2) Similar situation happens when we cannot further grow current extent but
> there is large free space somewhere before this extent in the AG.
> 
> So I was wondering is this known? Is XFS_ALLOCTYPE_NEAR_BNO so beneficial
> it outweights pathological cases like the above? Or shouldn't it maybe be
> disabled for larger files or for direct IO?

I believe we've seen something similar to #2 before:

# xfs_bmap -v /data/dbench.dat
/data/dbench.dat:
  EXT: FILE-OFFSET               BLOCK-RANGE              AG AG-OFFSET                  TOTAL FLAGS
    0: [0..150994943]:           2343559168..2494554111    5 (2048..150996991)      150994944 00011
    1: [150994944..468582399]:   2494556160..2812143615    5 (150999040..468586495) 317587456 00011
    2: [468582400..670957567]:   3078479872..3280855039    6 (266211328..468586495) 202375168 00011
    3: [670957568..671088639]:   3078346752..3078477823    6 (266078208..266209279)    131072 00011
    4: [671088640..671219711]:   3078215680..3078346751    6 (265947136..266078207)    131072 00011
    5: [671219712..671350783]:   3078084608..3078215679    6 (265816064..265947135)    131072 00011
    6: [671350784..671481855]:   3077953536..3078084607    6 (265684992..265816063)    131072 00011
    7: [671481856..671612927]:   3077822464..3077953535    6 (265553920..265684991)    131072 00011
    8: [671612928..671743999]:   3077691392..3077822463    6 (265422848..265553919)    131072 00011
    9: [671744000..671875071]:   3077560320..3077691391    6 (265291776..265422847)    131072 00011
...
2040: [4216979456..4502192127]: 6562093056..6847305727   14 (133120..285345791)    285212672 00011
2041: [4502192128..4685430783]: 6847307776..7030546431   14 (285347840..468586495) 183238656 00011
2042: [4685430784..4876402687]: 9183129600..9374101503   19 (277612544..468584447) 190971904 00011
2043: [4876402688..5344985087]: 9374230528..9842812927   20 (2048..468584447)      468582400 00011
2044: [5344985088..5813567487]: 9842941952..10311524351  21 (2048..468584447)      468582400 00011
2045: [5813567488..6282149887]: 10311653376..10780235775 22 (2048..468584447)      468582400 00011
2046: [6282149888..6750732287]: 10780364800..11248947199 23 (2048..468584447)      468582400 00011
2047: [6750732288..6767501311]: 11249076224..11265845247 24 (2048..16771071)        16769024 00011
2048: [6767501312..7219314687]: 11265845248..11717658623 24 (16771072..468584447)  451813376 00011
2049: [7219314688..7687766015]: 11717918720..12186370047 25 (133120..468584447)    468451328
2050: [7687766016..8156348415]: 12186499072..12655081471 26 (2048..468584447)      468582400 00011
2051: [8156348416..8449425407]: 12655210496..12948287487 27 (2048..293079039)      293076992 00011

In this case, the allocation in AG 6 starts near the middle of the AG and runs
through the end.  At that point we began to march backward through the AG until
it was exhausted.  Not ideal.  Maybe it would be better if
XFS_ALLOCTYPE_NEAR_BNO would move on to the next AG if it reached the end of
the current one.  We need to be careful though.  What is good for this workload
may have unintended consequences for another.

Could you post geometry information for the filesystem in question?
xfs_growfs -n /dev/sda

Thanks,
Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Pathological allocation pattern with direct IO
  2013-03-06 20:22 Pathological allocation pattern with direct IO Jan Kara
  2013-03-06 22:01 ` Ben Myers
@ 2013-03-06 22:31 ` Peter Grandi
  2013-03-07  5:03 ` Dave Chinner
  2 siblings, 0 replies; 9+ messages in thread
From: Peter Grandi @ 2013-03-06 22:31 UTC (permalink / raw)
  To: Linux fs XFS

> [ ... ] application that write large (tens of GB) files using
> direct IO done in 16 MB chunks. They keep the fs around 80%
> full deleting oldest files when they need to store new ones.
> Usually the file can be stored in under 10 extents but from
> time to time a pathological case is triggered and the file has
> few thousands extents (which naturally has impact on
> performance). [ ... ]

They just want *guaranteed* (not just nearly-always) contiguity,
without preallocating, with incremental writes with direct IO,
while keeping the filesystem full at 80%.

And a pony! :-)

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Pathological allocation pattern with direct IO
  2013-03-06 20:22 Pathological allocation pattern with direct IO Jan Kara
  2013-03-06 22:01 ` Ben Myers
  2013-03-06 22:31 ` Peter Grandi
@ 2013-03-07  5:03 ` Dave Chinner
  2013-03-07 10:24   ` Jan Kara
  2 siblings, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2013-03-07  5:03 UTC (permalink / raw)
  To: Jan Kara; +Cc: xfs

On Wed, Mar 06, 2013 at 09:22:10PM +0100, Jan Kara wrote:
>   Hello,
> 
>   one of our customers has application that write large (tens of GB) files
> using direct IO done in 16 MB chunks. They keep the fs around 80% full
> deleting oldest files when they need to store new ones. Usually the file
> can be stored in under 10 extents but from time to time a pathological case
> is triggered and the file has few thousands extents (which naturally has
> impact on performance). The customer actually uses 2.6.32-based kernel but
> I reproduced the issue with 3.8.2 kernel as well.
> 
> I was analyzing why this happens and the filefrag for the file looks like:
> Filesystem type is: 58465342
> File size of /raw_data/ex.20130302T121135/ov.s1a1.wb is 186294206464
> (45481984 blocks, blocksize 4096)
>  ext logical physical expected length flags
>    0       0       13          4550656
>    1 4550656 188136807  4550668 12562432
>    2 17113088 200699240 200699238 622592
>    3 17735680 182046055 201321831   4096
>    4 17739776 182041959 182050150   4096
>    5 17743872 182037863 182046054   4096
>    6 17747968 182033767 182041958   4096
>    7 17752064 182029671 182037862   4096
> ...
> 6757 45400064 154381644 154389835   4096
> 6758 45404160 154377548 154385739   4096
> 6759 45408256 252951571 154381643  73728 eof
> /raw_data/ex.20130302T121135/ov.s1a1.wb: 6760 extents found
> 
> So we see that at one moment, the allocator starts giving us 16 MB chunks
> backwards. This seems to be caused by XFS_ALLOCTYPE_NEAR_BNO allocation. For
> two cases I was able to track down the logic:
> 
> 1) We start allocating blocks for file. We want to allocate in the same AG
> as the inode is. First we try exact allocation which fails so we try
> XFS_ALLOCTYPE_NEAR_BNO allocation which finds large enough free extent
> before the inode. So we start allocating 16 MB chunks from the end of that
> free extent. From this moment on we are basically bound to continue
> allocating backwards using XFS_ALLOCTYPE_NEAR_BNO allocation until we
> exhaust the whole free extent.
> 
> 2) Similar situation happens when we cannot further grow current extent but
> there is large free space somewhere before this extent in the AG.
> 
> So I was wondering is this known? Is XFS_ALLOCTYPE_NEAR_BNO so beneficial
> it outweights pathological cases like the above? Or shouldn't it maybe be
> disabled for larger files or for direct IO?

Well known issue, first diagnosed about 15 years ago, IIRC. Simple
solution: use extent size hints.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Pathological allocation pattern with direct IO
  2013-03-07  5:03 ` Dave Chinner
@ 2013-03-07 10:24   ` Jan Kara
  2013-03-07 13:58     ` Mark Tinguely
  2013-03-08  1:35     ` Dave Chinner
  0 siblings, 2 replies; 9+ messages in thread
From: Jan Kara @ 2013-03-07 10:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jan Kara, xfs

On Thu 07-03-13 16:03:25, Dave Chinner wrote:
> On Wed, Mar 06, 2013 at 09:22:10PM +0100, Jan Kara wrote:
> >   Hello,
> > 
> >   one of our customers has application that write large (tens of GB) files
> > using direct IO done in 16 MB chunks. They keep the fs around 80% full
> > deleting oldest files when they need to store new ones. Usually the file
> > can be stored in under 10 extents but from time to time a pathological case
> > is triggered and the file has few thousands extents (which naturally has
> > impact on performance). The customer actually uses 2.6.32-based kernel but
> > I reproduced the issue with 3.8.2 kernel as well.
> > 
> > I was analyzing why this happens and the filefrag for the file looks like:
> > Filesystem type is: 58465342
> > File size of /raw_data/ex.20130302T121135/ov.s1a1.wb is 186294206464
> > (45481984 blocks, blocksize 4096)
> >  ext logical physical expected length flags
> >    0       0       13          4550656
> >    1 4550656 188136807  4550668 12562432
> >    2 17113088 200699240 200699238 622592
> >    3 17735680 182046055 201321831   4096
> >    4 17739776 182041959 182050150   4096
> >    5 17743872 182037863 182046054   4096
> >    6 17747968 182033767 182041958   4096
> >    7 17752064 182029671 182037862   4096
> > ...
> > 6757 45400064 154381644 154389835   4096
> > 6758 45404160 154377548 154385739   4096
> > 6759 45408256 252951571 154381643  73728 eof
> > /raw_data/ex.20130302T121135/ov.s1a1.wb: 6760 extents found
> > 
> > So we see that at one moment, the allocator starts giving us 16 MB chunks
> > backwards. This seems to be caused by XFS_ALLOCTYPE_NEAR_BNO allocation. For
> > two cases I was able to track down the logic:
> > 
> > 1) We start allocating blocks for file. We want to allocate in the same AG
> > as the inode is. First we try exact allocation which fails so we try
> > XFS_ALLOCTYPE_NEAR_BNO allocation which finds large enough free extent
> > before the inode. So we start allocating 16 MB chunks from the end of that
> > free extent. From this moment on we are basically bound to continue
> > allocating backwards using XFS_ALLOCTYPE_NEAR_BNO allocation until we
> > exhaust the whole free extent.
> > 
> > 2) Similar situation happens when we cannot further grow current extent but
> > there is large free space somewhere before this extent in the AG.
> > 
> > So I was wondering is this known? Is XFS_ALLOCTYPE_NEAR_BNO so beneficial
> > it outweights pathological cases like the above? Or shouldn't it maybe be
> > disabled for larger files or for direct IO?
> 
> Well known issue, first diagnosed about 15 years ago, IIRC. Simple
> solution: use extent size hints.
  I thought someone must have hit it before. But I wasn't successful in
googling... I suggested using fallocate to the customer since they have a
good idea of the final file size in advance and in testing it gave better
results than extent size hints (plus it works for other filesystems as
well).

But really I was wondering about usefulness of XFS_ALLOCTYPE_NEAR_BNO
heuristic. Sure the seek time depends on the distance so if we are speaking
about allocating single extent then XFS_ALLOCTYPE_NEAR_BNO is useful but
once that strategy would allocate two or three consecutive extents you've
lost all the benefit and you would be better off if you started allocating
from the start of the free space. Obviously we don't know the future in
advance but this resembles a classical problem from approximations
algorithms theory (rent-or-buy problem where renting corresponds to
allocating from the end of free space and taking the smaller cost while
buying corresponds to allocation from the beginning, taking the higher
cost, but expecting you won't have to pay anything in future). And the
theory of approximation algorithms tells us that once we pay for renting as
much as buying will cost us, then at that moment it is advantageous to buy
and that gives you 2-approximation algorithm (you can do even better -
factor 1.58 approximation - if you use randomization but I don't think we
want to go that way). So from this I'd say that switching off
XFS_ALLOCTYPE_NEAR_BNO allocation once you've allocated 2-3 extents
backwards would work of better on average.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Pathological allocation pattern with direct IO
  2013-03-07 10:24   ` Jan Kara
@ 2013-03-07 13:58     ` Mark Tinguely
  2013-03-08  1:35     ` Dave Chinner
  1 sibling, 0 replies; 9+ messages in thread
From: Mark Tinguely @ 2013-03-07 13:58 UTC (permalink / raw)
  To: Jan Kara; +Cc: xfs

On 03/07/13 04:24, Jan Kara wrote:
> On Thu 07-03-13 16:03:25, Dave Chinner wrote:
>> On Wed, Mar 06, 2013 at 09:22:10PM +0100, Jan Kara wrote:
>>>    Hello,
>>>
>>>    one of our customers has application that write large (tens of GB) files
>>> using direct IO done in 16 MB chunks. They keep the fs around 80% full
>>> deleting oldest files when they need to store new ones. Usually the file
>>> can be stored in under 10 extents but from time to time a pathological case
>>> is triggered and the file has few thousands extents (which naturally has
>>> impact on performance). The customer actually uses 2.6.32-based kernel but
>>> I reproduced the issue with 3.8.2 kernel as well.
>>>
>>> I was analyzing why this happens and the filefrag for the file looks like:
>>> Filesystem type is: 58465342
>>> File size of /raw_data/ex.20130302T121135/ov.s1a1.wb is 186294206464
>>> (45481984 blocks, blocksize 4096)
>>>   ext logical physical expected length flags
>>>     0       0       13          4550656
>>>     1 4550656 188136807  4550668 12562432
>>>     2 17113088 200699240 200699238 622592
>>>     3 17735680 182046055 201321831   4096
>>>     4 17739776 182041959 182050150   4096
>>>     5 17743872 182037863 182046054   4096
>>>     6 17747968 182033767 182041958   4096
>>>     7 17752064 182029671 182037862   4096
>>> ...
>>> 6757 45400064 154381644 154389835   4096
>>> 6758 45404160 154377548 154385739   4096
>>> 6759 45408256 252951571 154381643  73728 eof
>>> /raw_data/ex.20130302T121135/ov.s1a1.wb: 6760 extents found
>>>
>>> So we see that at one moment, the allocator starts giving us 16 MB chunks
>>> backwards. This seems to be caused by XFS_ALLOCTYPE_NEAR_BNO allocation. For
>>> two cases I was able to track down the logic:
>>>
>>> 1) We start allocating blocks for file. We want to allocate in the same AG
>>> as the inode is. First we try exact allocation which fails so we try
>>> XFS_ALLOCTYPE_NEAR_BNO allocation which finds large enough free extent
>>> before the inode. So we start allocating 16 MB chunks from the end of that
>>> free extent. From this moment on we are basically bound to continue
>>> allocating backwards using XFS_ALLOCTYPE_NEAR_BNO allocation until we
>>> exhaust the whole free extent.
>>>
>>> 2) Similar situation happens when we cannot further grow current extent but
>>> there is large free space somewhere before this extent in the AG.
>>>
>>> So I was wondering is this known? Is XFS_ALLOCTYPE_NEAR_BNO so beneficial
>>> it outweights pathological cases like the above? Or shouldn't it maybe be
>>> disabled for larger files or for direct IO?
>>
>> Well known issue, first diagnosed about 15 years ago, IIRC. Simple
>> solution: use extent size hints.
>    I thought someone must have hit it before. But I wasn't successful in
> googling... I suggested using fallocate to the customer since they have a
> good idea of the final file size in advance and in testing it gave better
> results than extent size hints (plus it works for other filesystems as
> well).
>
> But really I was wondering about usefulness of XFS_ALLOCTYPE_NEAR_BNO
> heuristic. Sure the seek time depends on the distance so if we are speaking
> about allocating single extent then XFS_ALLOCTYPE_NEAR_BNO is useful but
> once that strategy would allocate two or three consecutive extents you've
> lost all the benefit and you would be better off if you started allocating
> from the start of the free space. Obviously we don't know the future in
> advance but this resembles a classical problem from approximations
> algorithms theory (rent-or-buy problem where renting corresponds to
> allocating from the end of free space and taking the smaller cost while
> buying corresponds to allocation from the beginning, taking the higher
> cost, but expecting you won't have to pay anything in future). And the
> theory of approximation algorithms tells us that once we pay for renting as
> much as buying will cost us, then at that moment it is advantageous to buy
> and that gives you 2-approximation algorithm (you can do even better -
> factor 1.58 approximation - if you use randomization but I don't think we
> want to go that way). So from this I'd say that switching off
> XFS_ALLOCTYPE_NEAR_BNO allocation once you've allocated 2-3 extents
> backwards would work of better on average.
>
> 								Honza

Sounds like a candidate for a dynamic allocation policy,

  http://oss.sgi.com/archives/xfs/2013-01/msg00611.html

--Mark.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Pathological allocation pattern with direct IO
  2013-03-07 10:24   ` Jan Kara
  2013-03-07 13:58     ` Mark Tinguely
@ 2013-03-08  1:35     ` Dave Chinner
  2013-03-12 11:01       ` Jan Kara
  1 sibling, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2013-03-08  1:35 UTC (permalink / raw)
  To: Jan Kara; +Cc: xfs

On Thu, Mar 07, 2013 at 11:24:06AM +0100, Jan Kara wrote:
> On Thu 07-03-13 16:03:25, Dave Chinner wrote:
> > On Wed, Mar 06, 2013 at 09:22:10PM +0100, Jan Kara wrote:
> > >   Hello,
> > > 
> > >   one of our customers has application that write large (tens of GB) files
> > > using direct IO done in 16 MB chunks. They keep the fs around 80% full
> > > deleting oldest files when they need to store new ones. Usually the file
> > > can be stored in under 10 extents but from time to time a pathological case
> > > is triggered and the file has few thousands extents (which naturally has
> > > impact on performance). The customer actually uses 2.6.32-based kernel but
> > > I reproduced the issue with 3.8.2 kernel as well.
.....
> > > 1) We start allocating blocks for file. We want to allocate in the same AG
> > > as the inode is. First we try exact allocation which fails so we try
> > > XFS_ALLOCTYPE_NEAR_BNO allocation which finds large enough free extent
> > > before the inode. So we start allocating 16 MB chunks from the end of that
> > > free extent. From this moment on we are basically bound to continue
> > > allocating backwards using XFS_ALLOCTYPE_NEAR_BNO allocation until we
> > > exhaust the whole free extent.
> > > 
> > > 2) Similar situation happens when we cannot further grow current extent but
> > > there is large free space somewhere before this extent in the AG.
> > > 
> > > So I was wondering is this known? Is XFS_ALLOCTYPE_NEAR_BNO so beneficial
> > > it outweights pathological cases like the above? Or shouldn't it maybe be
> > > disabled for larger files or for direct IO?
> > 
> > Well known issue, first diagnosed about 15 years ago, IIRC. Simple
> > solution: use extent size hints.
>   I thought someone must have hit it before. But I wasn't successful in
> googling... I suggested using fallocate to the customer since they have a
> good idea of the final file size in advance and in testing it gave better
> results than extent size hints (plus it works for other filesystems as
> well).

Extent size hints have the advantage of being effective when the
file size is not known ahead of time, or the application cannot be
modified to do preallocation. And they can be applied immediately to
any existing filesystem without needing downtime or code changes, so
the problem can be mitigated immediately without impacting ongoing
production....

> But really I was wondering about usefulness of XFS_ALLOCTYPE_NEAR_BNO
> heuristic. Sure the seek time depends on the distance so if we are speaking
> about allocating single extent then XFS_ALLOCTYPE_NEAR_BNO is useful but
> once that strategy would allocate two or three consecutive extents you've
> lost all the benefit and you would be better off if you started allocating
> from the start of the free space.

Sorry, I gave quick answer because I didn't have time yesterday to
answer in depth.

The thing is, this backwards allocation is a low level allocator
primitive, and it's behaviour is supposed to find the nearest free
extent to a target block number. What you are seeing is exactly the
expected beahviour of XFS_ALLOCTYPE_NEAR_BNO. An because it is a low
level primitive, it is not designed to take into account high level
allocation requirements or do pattern detection. It just finds the
nearest free extent to the target block number, and that happens to
be at the end of a free extent when it is at a lower block number
than the target being asked for.

This allocator behaviour is desirable for metadata in btree format,
because it means that the btree clusters locally around both sides
of the parent object rather than spreading out over an entire AG
with blocks getting further and further apart as free space on the
ascending side of the inode gets used up.

In this case, it's far better to keep seek distances short by
maintaining locality to the parent object by allocating on the
descending side of the object.  And backwards seeks are handled
pretty well by the XFS metadata code - it uses readahead in the
btree/dirent traversals to allow the elevator to optimise the
resultant IO when walking the metadata, and hence backwards seeks
are not really a problem at all.

For data, random write workloads into sparse files really want this
behaviour as well as it results in physical locality of extents. The
backwards allocation simply doesn't matter because the extents are
being randomly accessed anyway, because physical locality ultimately
determines random IO performance.

Further, high end storage arrays are capable of detecting backwards
read patterns and issue readahead for them, so this backwards
allocation doesn't always result in measurable performance
problems...

Hence repeated backwards allocation is, in many cases, desirable
behaviour, and in others it doesn't make any difference. And with it
being easy to work around for the few cases where it is tripped over
on aging filesystems, there hasn't been a great deal of need to
change the behaviour.

> Obviously we don't know the future in
> advance but this resembles a classical problem from approximations
> algorithms theory (rent-or-buy problem where renting corresponds to
> allocating from the end of free space and taking the smaller cost while
> buying corresponds to allocation from the beginning, taking the higher
> cost, but expecting you won't have to pay anything in future). And the
> theory of approximation algorithms tells us that once we pay for renting as
> much as buying will cost us, then at that moment it is advantageous to buy
> and that gives you 2-approximation algorithm (you can do even better -
> factor 1.58 approximation - if you use randomization but I don't think we
> want to go that way).

Way too mathematical for me, because...

> So from this I'd say that switching off
> XFS_ALLOCTYPE_NEAR_BNO allocation once you've allocated 2-3 extents
> backwards would work of better on average.

.... common sense says that this is true.

I haven't looked to solve this problem in the past 5 years because
extent size hints are such a simple way of mitigating the problem.
However, given that I know how this allocation code works whole lot
better than I did 5 years ago, I think I can see an easy way to
avoid this repeated backwards allocation pattern when extending
files.

What we have right now is this used space/free space layout
with the allocation context prev/want when extending the file:


            free extent				  free extent
	+-----------------+ prev         ...... +--------------+
	                  +------+
			         +------+
				   want

This is what we'll ask for as a XFS_ALLOCTYPE_THIS_BNO allocation,
but that is going to fail because there is no free space at wantbno.
This then falls back to XFS_ALLOCTYPE_NEAR_BNO with the same wantbno
target, and it selects the free extent before prev.

If we then we look at what is happening xfs_alloc_compute_diff(), we
have wantbno > freebno, and wantbno > freeend, so we end up in
either of these two cases:

        } else if (alignment > 1) {
                newbno1 = roundup(freeend - wantlen, alignment);
                if (newbno1 > freeend - wantlen &&
                    newbno1 - alignment >= freebno)
                        newbno1 -= alignment;
                else if (newbno1 >= freeend)
                        newbno1 = NULLAGBLOCK;
        } else
                newbno1 = freeend - wantlen;


Either way, the resultant extent that is cut out of the free extent
is this:

            free extent
	+-----------------+ prev
	                  +------+
			         +------+
                   +------+	   want
		     got

And hence we end up in a backwards allocation pattern. After
repeated iterations of this, we end up this a pattern like this:

            free extent
	+-----------------+ prev  prev-1 prev-2
	                  +------+------+------+
			         +------+
                   +------+	   want
		     got

and so is likely to lead to an extended run of backwards allocation
for sequential ascending offsets. So, we would need to detect this
pattern in xfs_bmap_adjacent, where the wantbno is determined for
EOF allocation. That's a bit complex - we'll need to do extent tree
lookups because we need more information than we've got from the
initial lookup.

I do not want to get this information from the initial lookup and
pass it up to xfs_bmap_adjacent() because we are already using lots
of stack through this path and adding another 70-odd bytes for
another couple of extent maps is not going to fly. So to detect
backwards allocation, we'd have to take the CPU hit of doing another
lookup and then analysing the results for backwards allocation. That
can be done, but I'm wondering if we even need to detect this
case...

I suspect we don't even need to detect the backwards allocation
pattern at all - if we are falling back from an
XFS_ALLOCTYPE_THIS_BNO allocation, we know that the target cannot be
allocated and so we *might* end up doing a backwards allocation.
Hence I don't think it will not hurt at all to simply say "select
the start of the next candidate free extent" for a file data
allocations in this case. That would completely avoid needing to
detect repeated backwards allocations and ensure that subsequent
allocations after the backwards jump to the start of the large free
extent then go forwards as intended...

And because we only do XFS_ALLOCTYPE_THIS_BNO allocations on data
allocations when extending file, adding such a flag won't affect
random write or metadata allocation patterns....

So, once we've decided that backward allocation is possible, all we
need to do is inform XFS_ALLOCTYPE_NEAR_BNO/xfs_alloc_compute_diff()
to select the start of the free extent it finds rather than select
the tail of it. That will mean that rather than consume the entire
free extent in small chunks via backwards allocation, we'll do one
large backwards step and then consume the free extent in the
forwards direction instead.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Pathological allocation pattern with direct IO
  2013-03-08  1:35     ` Dave Chinner
@ 2013-03-12 11:01       ` Jan Kara
  2013-03-14 23:36         ` Dave Chinner
  0 siblings, 1 reply; 9+ messages in thread
From: Jan Kara @ 2013-03-12 11:01 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jan Kara, xfs

On Fri 08-03-13 12:35:25, Dave Chinner wrote:
> On Thu, Mar 07, 2013 at 11:24:06AM +0100, Jan Kara wrote:
> > But really I was wondering about usefulness of XFS_ALLOCTYPE_NEAR_BNO
> > heuristic. Sure the seek time depends on the distance so if we are speaking
> > about allocating single extent then XFS_ALLOCTYPE_NEAR_BNO is useful but
> > once that strategy would allocate two or three consecutive extents you've
> > lost all the benefit and you would be better off if you started allocating
> > from the start of the free space.
  <snip>
> Hence repeated backwards allocation is, in many cases, desirable
> behaviour, and in others it doesn't make any difference. And with it
> being easy to work around for the few cases where it is tripped over
> on aging filesystems, there hasn't been a great deal of need to
> change the behaviour.
  Thanks for explanation!

> > Obviously we don't know the future in
> > advance but this resembles a classical problem from approximations
> > algorithms theory (rent-or-buy problem where renting corresponds to
> > allocating from the end of free space and taking the smaller cost while
> > buying corresponds to allocation from the beginning, taking the higher
> > cost, but expecting you won't have to pay anything in future). And the
> > theory of approximation algorithms tells us that once we pay for renting as
> > much as buying will cost us, then at that moment it is advantageous to buy
> > and that gives you 2-approximation algorithm (you can do even better -
> > factor 1.58 approximation - if you use randomization but I don't think we
> > want to go that way).
> 
> Way too mathematical for me, because...
> 
> > So from this I'd say that switching off
> > XFS_ALLOCTYPE_NEAR_BNO allocation once you've allocated 2-3 extents
> > backwards would work of better on average.
> 
> .... common sense says that this is true.
> 
> I haven't looked to solve this problem in the past 5 years because
> extent size hints are such a simple way of mitigating the problem.
> However, given that I know how this allocation code works whole lot
> better than I did 5 years ago, I think I can see an easy way to
> avoid this repeated backwards allocation pattern when extending
> files.
> 
> What we have right now is this used space/free space layout
> with the allocation context prev/want when extending the file:
> 
> 
>             free extent				  free extent
> 	+-----------------+ prev         ...... +--------------+
> 	                  +------+
> 			         +------+
> 				   want
> 
> This is what we'll ask for as a XFS_ALLOCTYPE_THIS_BNO allocation,
> but that is going to fail because there is no free space at wantbno.
> This then falls back to XFS_ALLOCTYPE_NEAR_BNO with the same wantbno
> target, and it selects the free extent before prev.
> 
> If we then we look at what is happening xfs_alloc_compute_diff(), we
> have wantbno > freebno, and wantbno > freeend, so we end up in
> either of these two cases:
> 
>         } else if (alignment > 1) {
>                 newbno1 = roundup(freeend - wantlen, alignment);
>                 if (newbno1 > freeend - wantlen &&
>                     newbno1 - alignment >= freebno)
>                         newbno1 -= alignment;
>                 else if (newbno1 >= freeend)
>                         newbno1 = NULLAGBLOCK;
>         } else
>                 newbno1 = freeend - wantlen;
> 
> 
> Either way, the resultant extent that is cut out of the free extent
> is this:
> 
>             free extent
> 	+-----------------+ prev
> 	                  +------+
> 			         +------+
>                    +------+	   want
> 		     got
> 
> And hence we end up in a backwards allocation pattern. After
> repeated iterations of this, we end up this a pattern like this:
> 
>             free extent
> 	+-----------------+ prev  prev-1 prev-2
> 	                  +------+------+------+
> 			         +------+
>                    +------+	   want
> 		     got
> 
> and so is likely to lead to an extended run of backwards allocation
> for sequential ascending offsets. So, we would need to detect this
  Yup, that's what I understood in the code and saw in my experiments...

> pattern in xfs_bmap_adjacent, where the wantbno is determined for
> EOF allocation. That's a bit complex - we'll need to do extent tree
> lookups because we need more information than we've got from the
> initial lookup.
> 
> I do not want to get this information from the initial lookup and
> pass it up to xfs_bmap_adjacent() because we are already using lots
> of stack through this path and adding another 70-odd bytes for
> another couple of extent maps is not going to fly. So to detect
> backwards allocation, we'd have to take the CPU hit of doing another
> lookup and then analysing the results for backwards allocation. That
> can be done, but I'm wondering if we even need to detect this
> case...
> 
> I suspect we don't even need to detect the backwards allocation
> pattern at all - if we are falling back from an
> XFS_ALLOCTYPE_THIS_BNO allocation, we know that the target cannot be
> allocated and so we *might* end up doing a backwards allocation.
> Hence I don't think it will not hurt at all to simply say "select
> the start of the next candidate free extent" for a file data
> allocations in this case. That would completely avoid needing to
> detect repeated backwards allocations and ensure that subsequent
> allocations after the backwards jump to the start of the large free
> extent then go forwards as intended...
  Yes, I was wondering about whether this strategy won't be good enough as
well. But there's one case I'm afraid of: Suppose you have a small file so
that it gets written out in one request and fits in a single extent of
blocks. Then it is somewhat advantageous to allocate from the end of the
free extent because that is closer to the inode and thus reduces
inode-to-data seek time. That's why I was suggesting maybe we could use
this logic only for larger files or something like that. Hum, maybe if we
didn't use this strategy for initial file extent, it would be enough.
That's already handled specially in the allocator so it shouldn't be a
problem to detect. What do you think?

> And because we only do XFS_ALLOCTYPE_THIS_BNO allocations on data
> allocations when extending file, adding such a flag won't affect
> random write or metadata allocation patterns....
> 
> So, once we've decided that backward allocation is possible, all we
> need to do is inform XFS_ALLOCTYPE_NEAR_BNO/xfs_alloc_compute_diff()
> to select the start of the free extent it finds rather than select
> the tail of it. That will mean that rather than consume the entire
> free extent in small chunks via backwards allocation, we'll do one
> large backwards step and then consume the free extent in the
> forwards direction instead.
  Yes, this would work great for me.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Pathological allocation pattern with direct IO
  2013-03-12 11:01       ` Jan Kara
@ 2013-03-14 23:36         ` Dave Chinner
  0 siblings, 0 replies; 9+ messages in thread
From: Dave Chinner @ 2013-03-14 23:36 UTC (permalink / raw)
  To: Jan Kara; +Cc: xfs

On Tue, Mar 12, 2013 at 12:01:00PM +0100, Jan Kara wrote:
> On Fri 08-03-13 12:35:25, Dave Chinner wrote:
> > On Thu, Mar 07, 2013 at 11:24:06AM +0100, Jan Kara wrote:
> > > But really I was wondering about usefulness of XFS_ALLOCTYPE_NEAR_BNO
> > > heuristic. Sure the seek time depends on the distance so if we are speaking
> > > about allocating single extent then XFS_ALLOCTYPE_NEAR_BNO is useful but
> > > once that strategy would allocate two or three consecutive extents you've
> > > lost all the benefit and you would be better off if you started allocating
> > > from the start of the free space.
>   <snip>
> > Hence repeated backwards allocation is, in many cases, desirable
> > behaviour, and in others it doesn't make any difference. And with it
> > being easy to work around for the few cases where it is tripped over
> > on aging filesystems, there hasn't been a great deal of need to
> > change the behaviour.
>   Thanks for explanation!

No problems - there's a lot of stuff in my head that simply isn't
documented anywhere, and it would take months of work to document
it so I've never written it down...

> > I suspect we don't even need to detect the backwards allocation
> > pattern at all - if we are falling back from an
> > XFS_ALLOCTYPE_THIS_BNO allocation, we know that the target cannot be
> > allocated and so we *might* end up doing a backwards allocation.
> > Hence I don't think it will not hurt at all to simply say "select
> > the start of the next candidate free extent" for a file data
> > allocations in this case. That would completely avoid needing to
> > detect repeated backwards allocations and ensure that subsequent
> > allocations after the backwards jump to the start of the large free
> > extent then go forwards as intended...
>   Yes, I was wondering about whether this strategy won't be good enough as
> well. But there's one case I'm afraid of: Suppose you have a small file so
> that it gets written out in one request and fits in a single extent of
> blocks. Then it is somewhat advantageous to allocate from the end of the
> free extent because that is closer to the inode and thus reduces
> inode-to-data seek time. That's why I was suggesting maybe we could use
> this logic only for larger files or something like that. Hum, maybe if we
> didn't use this strategy for initial file extent, it would be enough.
> That's already handled specially in the allocator so it shouldn't be a
> problem to detect. What do you think?

It took me a couple of readings to work out what you meant here.
You are wondering if selecting the start of a free extent for the
initial data for small files is the right thing to do as it may not
be "closest", right?

The thing is, we don't know if the inode is going to be written to
again, and so there is no context to be able to determine if we need
to be able to do a subsequent THIS_BNO exact allocation in
subsequent allocations. Because XFS is designed for large files and
high bandwidth IO, it is assumed that contiguous allocation is the
most important factor in determining where to allocate a block.
Hence even for a small allocation, the assumption is that there will
be another contiguous allocation request and another and another....

Hence XFS always tries to place new files at the start of a large
freespace extent, not at the end, because it is assumed that files
will always grow in the short term.  The thing is, this is all
*very* complicated, and there's a lot of context that needs to be
understood that, as I said above, isn't really documented anywhere.
So, more context:

We are searching left/right because the extent size we want to
allocate resulted in a by-size btree lookup returning a btree block
that is not at the end of the tree. i.e. we've hit a btree block
that contains free space extents of roughly the same size as the
extent we want to allocate rather than free space extents mcuh
larger than the size we need.

If we hit the last block - we have large contiguous free space
extents and so we do a linear search to select the freespace extent
that is *closest* to the target block, and allocate from the front
of it. Hence subsequent THIS_BNO allocations will work just fine as
there's plenty of contiguous free space at the end of the
allocation.

However, if we don't hit the last block it means there is some
amount of free space fragmentation present in the filesystem. This
means we can't use a by-size lookup to determine the closest free
extent that would fit the allocation. If freespace fragementation is
bad, there might be tens of thousands of free extents of that size
in the tree, and selecting the closest would require a linear search
through all the blocks. It's prohibitively expensive from both a CPU
and Io point of view, so that isn't done.

[ Freespace fragmentation of this style will occur from having an
application that does concurrent, identically sized allocations that
are small compared to the eventual file size.  e.g. using direct IO
in 16MB chunks to write files that are tens of GB in length. Over
time, extents will get interleaved and then when you remove some
files you end up with smaller freespace extents, which then get
interleaved into more files, and when some of them get removed you
get even smaller freespace extents. ]

It is when we see this sort of freespace fragmentation that the
NEAR_BNO allocation falls back to the second algorithm and search
left and right from the target block in the by-block freespace tree.
This *may* find a large extent that is in the last block of by-size
tree and this is the trigger for backwards allocation to occur. If
we were to have found the last by-size btree block in the initial
by-size lookup, this is the extent that would have been selected by
that algorithm, and we would have selected the front of the
freespace extent, not the end of it.

So for data allocation, it is quite clear that there is a
significant difference between the way the free space extent is
treated when we carve a chunk out of it when we compare the two
algorithms.  Hence if we simply change the way we select the best
part of the extent to use for data allocation to match the method
used when we select a large free space from the by-size tree, we'll
stop this repeated backwards allocation problem from occurring and
have similar extent selection behaviour for data allocation not
matter which algorithm finds a the closest large freespace
extent....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2013-03-14 23:36 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-03-06 20:22 Pathological allocation pattern with direct IO Jan Kara
2013-03-06 22:01 ` Ben Myers
2013-03-06 22:31 ` Peter Grandi
2013-03-07  5:03 ` Dave Chinner
2013-03-07 10:24   ` Jan Kara
2013-03-07 13:58     ` Mark Tinguely
2013-03-08  1:35     ` Dave Chinner
2013-03-12 11:01       ` Jan Kara
2013-03-14 23:36         ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.