All of lore.kernel.org
 help / color / mirror / Atom feed
* Limits to growth
@ 2022-04-14  4:00 Chris Dunlop
  2022-04-14  5:18 ` Dave Chinner
  0 siblings, 1 reply; 3+ messages in thread
From: Chris Dunlop @ 2022-04-14  4:00 UTC (permalink / raw)
  To: linux-xfs

Hi,

I have a nearly full 30T xfs filesystem that I need to grow significantly, 
e.g. to, say, 256T, and potentially further in future, e.g. up to, say, 
1PB. Alternatively at some point I'll need to copy a LOT of data from the 
existing fs to a newly-provisioned much larger fs. If I'm going to need to 
copy data around I guess it's better to do it now, before there's a whole 
lot more data to copy.

According to Dave Chinner:

   https://www.spinics.net/lists/linux-xfs/msg20084.html
   Rule of thumb we've stated every time it's been asked in the past 10-15 
   years is "try not to grow by more than 10x the original size".

It's also explained the issue is the number of AGs.

Is it ONLY the number of AGs that's a concern when growing a fs?

E.g. for a fs starting in the 10s of TB that may need to grow 
substantially (e.g. >=10x), is it advisable to simply create it with the 
maximum available agsize, and you can then grow to whatever multiple 
without worrying about XFS getting ornery?

Of course as Dave explains further in the thread it would probably be 
better to just start with XFS on a large thin provisioned volume in the 
first place, but that's not where I am currently. Sigh.

Looking my fs and just considering the number of AGs (agcount)...

My original fs has:

meta-data=xxxx           isize=512    agcount=32, agsize=244184192 blks
          =               sectsz=4096  attr=2, projid32bit=1
          =               crc=1        finobt=1, sparse=1, rmapbt=1
          =               reflink=1    bigtime=0 inobtcount=0
data     =               bsize=4096   blocks=7813893120, imaxpct=5
          =               sunit=128    swidth=512 blks
naming   =version 2      bsize=4096   ascii-ci=0, ftype=1
log      =internal log   bsize=4096   blocks=521728, version=2
          =               sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none           extsz=4096   blocks=0, rtextents=0

If I do a test xfs_grow to 256T, it shows:

metadata=xxxxx           isize=512    agcount=282, agsize=244184192 blks

Creating a new fs on 256T, I get:

metadata=xxxxx           isize=512    agcount=257, agsize=268435328 blks

So growing the fs from 30T to 256T I end up with an agcount ~10% larger 
(and agsize ~10% smaller) than creating a 256T fs from scratch.

Just for the exercise, creating a new FS on 1P (i.e. 33x the current fs) 
gives:

metadata=xxxxx           isize=512    agcount=1025, agsize=268435328 blks

I.e. it looks like for this case the max agsize is 268435328 blocks. So 
even if the current fs were to grow to a 1P or more, e.g. 30x - 60x 
original, I'm still only going to be ~10% worse off in terms of agcount 
than creating a large fs from scratch and copying all the data over.

Is that really going to make a significant difference?

Cheers,

Chris

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Limits to growth
  2022-04-14  4:00 Limits to growth Chris Dunlop
@ 2022-04-14  5:18 ` Dave Chinner
  2022-04-14  6:48   ` Chris Dunlop
  0 siblings, 1 reply; 3+ messages in thread
From: Dave Chinner @ 2022-04-14  5:18 UTC (permalink / raw)
  To: Chris Dunlop; +Cc: linux-xfs

On Thu, Apr 14, 2022 at 02:00:24PM +1000, Chris Dunlop wrote:
> Hi,
> 
> I have a nearly full 30T xfs filesystem that I need to grow significantly,
> e.g. to, say, 256T, and potentially further in future, e.g. up to, say, 1PB.

That'll be fun. :)

> Alternatively at some point I'll need to copy a LOT of data from the
> existing fs to a newly-provisioned much larger fs. If I'm going to need to
> copy data around I guess it's better to do it now, before there's a whole
> lot more data to copy.
> 
> According to Dave Chinner:
> 
>   https://www.spinics.net/lists/linux-xfs/msg20084.html
>   Rule of thumb we've stated every time it's been asked in the past 10-15
> years is "try not to grow by more than 10x the original size".
> 
> It's also explained the issue is the number of AGs.
> 
> Is it ONLY the number of AGs that's a concern when growing a fs?

No.

> E.g. for a fs starting in the 10s of TB that may need to grow substantially
> (e.g. >=10x), is it advisable to simply create it with the maximum available
> agsize, and you can then grow to whatever multiple without worrying about
> XFS getting ornery?

If you start with anything greater 4-32TB, there's a good chance
you've already got maximally sized AGs....

> Looking my fs and just considering the number of AGs (agcount)...
> 
> My original fs has:
> 
> meta-data=xxxx           isize=512    agcount=32, agsize=244184192 blks

Which is just short of maximally sized AGs. There's nothing to be
gained by reformatting to larger AGs here.

>          =               sectsz=4096  attr=2, projid32bit=1
>          =               crc=1        finobt=1, sparse=1, rmapbt=1
>          =               reflink=1    bigtime=0 inobtcount=0
> data     =               bsize=4096   blocks=7813893120, imaxpct=5
>          =               sunit=128    swidth=512 blks
> naming   =version 2      bsize=4096   ascii-ci=0, ftype=1
> log      =internal log   bsize=4096   blocks=521728, version=2
>          =               sectsz=4096  sunit=1 blks, lazy-count=1
> realtime =none           extsz=4096   blocks=0, rtextents=0
> 
> If I do a test xfs_grow to 256T, it shows:
> 
> metadata=xxxxx           isize=512    agcount=282, agsize=244184192 blks
> 
> Creating a new fs on 256T, I get:
> 
> metadata=xxxxx           isize=512    agcount=257, agsize=268435328 blks

Yup.

> So growing the fs from 30T to 256T I end up with an agcount ~10% larger (and
> agsize ~10% smaller) than creating a 256T fs from scratch.

Yup.

> Just for the exercise, creating a new FS on 1P (i.e. 33x the current fs)
> gives:
> 
> metadata=xxxxx           isize=512    agcount=1025, agsize=268435328 blks

Yup.

> I.e. it looks like for this case the max agsize is 268435328 blocks.

Yup.

> So even
> if the current fs were to grow to a 1P or more, e.g. 30x - 60x original, I'm
> still only going to be ~10% worse off in terms of agcount than creating a
> large fs from scratch and copying all the data over.

Yup.

> Is that really going to make a significant difference?

No.

But there will be significant differences. e.g. think of the data
layout and free space distribution of a 1PB filesystem that it is
90% full and had it's data evenly distributed throughout it's
capacity. Now consider the free space distribution of a 100TB
filesystem that has been filled to 99% and then grown by 100TB nine
times to a capacity of 90% @ 1PB. Where is all the free space?

That's right - the free space is only in the region that was
appended in the last 100TB grow operation. IOWs, 90% of the AGs are
completley full, and the newly added 10% are compeltely empty.

However, the allocation algorithms do linear target increments and
linear scans over *all AGs* trying to distribute the allocation
across the entire filesystem and to find the best available free
space for allocations.  When you have hundreds of AGs and only 10%
of them have usable free space, this becomes a problem.  e.g. if the
locality algorithm targets low numbered AGs that are full (and it
will because the target increments and wraps in a linear fashion),
then it might be scanning hundreds of AGs before it finds one of the
recently added high numbered AGs with a big enough free space to
allocate from.

Then consider that it is not unreasonable for the filesystem to hit
this case for thousands of consecutive allocations at a time (e.g.
untarring a tarball full of small files such as a kernel source tree
will trigger this), maybe even occur for every single allocation
over a time span of minutes or even hours.

IOWs, the scanning algorithms don't really scale to large numbers of
AGs when most of the AGs are full and cannot be allocated from, and
repeatedly growing full filesystems pushes the algorithms into
highly undesirable corner cases much, much faster than filesystems
that started off with that capacity,,,

IOWs, growing by more than 10x really starts to push the limits of
the algorithms regardless of the AG count it results in.  It's not a
capacity thing - it's a reflection of the number of AGs with usable
free space in them and the algorithms used to find free space in
those AGs.

The algorithms can be fixed, but it's not been an important issue to
solve because so few people are using grow[*] in this manner - growing
once or twice is generally as much as occurs over the life of a
typical production filesysetm...

Cheers,

Dave.

[*] Now, if you have a 2GB filesystem and you grow it to several TB
(that's a nasty antipattern we see quite frequently in cloud
deployments) then having 10,000+ tiny AGs has these linear scan
problems as well as all sorts of other scalability issues related to
the sheer number of AGs, but that's a different set of large ag count
problems....

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Limits to growth
  2022-04-14  5:18 ` Dave Chinner
@ 2022-04-14  6:48   ` Chris Dunlop
  0 siblings, 0 replies; 3+ messages in thread
From: Chris Dunlop @ 2022-04-14  6:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

Hi Dave,

On Thu, Apr 14, 2022 at 03:18:01PM +1000, Dave Chinner wrote:
> On Thu, Apr 14, 2022 at 02:00:24PM +1000, Chris Dunlop wrote:
>> Hi,
>>
>> I have a nearly full 30T xfs filesystem that I need to grow significantly,
>> e.g. to, say, 256T, and potentially further in future, e.g. up to, say, 1PB.
>
> That'll be fun. :)

Yeah, good problem to have.

>> So even if the current fs were to grow to a 1P or more, e.g. 30x - 60x 
>> original, I'm still only going to be ~10% worse off in terms of agcount 
>> than creating a large fs from scratch and copying all the data over.
>
> Yup.
>
>> Is that really going to make a significant difference?
>
> No.
>
> But there will be significant differences. e.g. think of the data
> layout and free space distribution of a 1PB filesystem that it is
> 90% full and had it's data evenly distributed throughout it's
> capacity. Now consider the free space distribution of a 100TB
> filesystem that has been filled to 99% and then grown by 100TB nine
> times to a capacity of 90% @ 1PB. Where is all the free space?
>
> That's right - the free space is only in the region that was
> appended in the last 100TB grow operation. IOWs, 90% of the AGs are
> completley full, and the newly added 10% are compeltely empty.

Yep. But growing from 30T (@ 89% full), to 256T, then possibly to 512T and 
then 1P shouldn't suffer from this issue as much - perhaps assisted by 
growing when it reaches, say, 60-70% rather than waiting till 90%.

In this case I might be in a decent position: the data is generally 
large-ish backup files and the older ones get deleted as they age out, so 
that should free up a significant amount of space in the 
currently-near-full AGs over time. It sounds like this, in conjunction 
with the balancing of the allocation algorithms, should end up 
homogenising the usage over all the AGs.

Or, if there's a chance it'll get to 1P, would it be better to grow it out 
that far now (on thin storage - ceph rbd in this case)?

On the other hand, I may want newer XFS feature goodies before I need 1P 
in this thing.

> IOWs, growing by more than 10x really starts to push the limits of
> the algorithms regardless of the AG count it results in.  It's not a
> capacity thing - it's a reflection of the number of AGs with usable
> free space in them and the algorithms used to find free space in
> those AGs.

I'm not sure if 89% full has already put me behind, but outside of that, 
doing a grow at, say, 60-70% full rather than 90+%, in conjunction with a 
decent amount of data turnover, should reduce or remove this problem, no?

I.e. would you say, growing by more (possibly a lot more) than 10x would 
probably be ok *IF* you're starting with (near) maximally sized AGs, and 
growing the fs when it reaches, say, 60-70% full, and with data that has a 
reasonable turn over cycle?

Cheers,

Chris

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2022-04-14  6:48 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-14  4:00 Limits to growth Chris Dunlop
2022-04-14  5:18 ` Dave Chinner
2022-04-14  6:48   ` Chris Dunlop

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.