All of lore.kernel.org
 help / color / mirror / Atom feed
* Limits on agi->agi_level (and other btree depths?)
@ 2014-02-12 23:54 Eric Sandeen
  2014-02-13  1:01 ` Shaun Gosse
  2014-02-13  1:03 ` Dave Chinner
  0 siblings, 2 replies; 5+ messages in thread
From: Eric Sandeen @ 2014-02-12 23:54 UTC (permalink / raw)
  To: xfs-oss

If agi->agi_level exceeds XFS_BTREE_MAXLEVELS (8), bad things happen.  For example in xfs_inobt_init_cursor() we read it directly off disk into a btree cursor:

xfs_inobt_init_cursor()
	cur->bc_nlevels = be32_to_cpu(agi->agi_level);

and then when it's time to tear it down we'll index into bc_bufs[] buy whatever it said:

xfs_btree_del_cursor()
        for (i = 0; i < cur->bc_nlevels; i++) {
                if (cur->bc_bufs[i])
                        xfs_trans_brelse(cur->bc_tp, cur->bc_bufs[i]);

but bc_bufs[] in the xfs_btree_cur is of fixed size:

        struct xfs_buf  *bc_bufs[XFS_BTREE_MAXLEVELS];  /* buf ptr per level */

where

#define XFS_BTREE_MAXLEVELS     8       /* max of all btrees */

(which means this limits any btree depth, not just agi, right...)

...

So I ran across this on an intentionally corrupted image, but I don't know what stops us from going past XFS_BTREE_MAXLEVELS in normal operations (unless we just hit filesystem limits before then?)

i.e. xfs_btree_new_root() does:

        /* Set the root in the holding structure  increasing the level by 1. */
        cur->bc_ops->set_root(cur, &lptr, 1);

and ->set_root / xfs_inobt_set_root() will happily increase agi_level; I don't see anything limiting it to XFS_BTREE_MAXLEVELS.

I guess XFS_BTREE_MAXLEVELS is just an arbitrary in-memory limit, not a limit of the underlying disk structures, but as it stands, we should be sure that we don't exceed it, right?

I was going to put that limit into xfs_agi_verify, but realized that I wasn't sure if we could actually exceed that depth in normal operations.

(cue dchinner working out that 9 levels is 59 bazillion jillion items, and will never be hit?)

Thanks,
-Eric

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: Limits on agi->agi_level (and other btree depths?)
  2014-02-12 23:54 Limits on agi->agi_level (and other btree depths?) Eric Sandeen
@ 2014-02-13  1:01 ` Shaun Gosse
  2014-02-13  1:03 ` Dave Chinner
  1 sibling, 0 replies; 5+ messages in thread
From: Shaun Gosse @ 2014-02-13  1:01 UTC (permalink / raw)
  To: Eric Sandeen, xfs-oss

>  (cue dchinner working out that 9 levels is 59 bazillion jillion items, and will never be hit?)

I don't know the XFS code and how it does B-trees. This is a very rough initial estimate that would apply to any sort of tree. Since we want to know when we're required to exceed a depth of 8, presume a full tree for simplicity. Say your branching factor for the whole tree is equal to the maximum depth. If it's higher, you have more room of course. But if it is equal, then you have 8**8 leaf nodes (calling the root node a depth of zero) before you have to add a depth 9 (or increase your branching factor; if there is a limit that might be hit, this would be the way to go to avoid more depth).

That gives you about 16 and three quarters million leaf nodes to work with. Of course if you're storing multiple entries per node, that gives you a constant factor on top.

In any event, sounds like it may be worth considering whether the limit could ever be reached and how to extend it if so. Given the discussion that's occurred about stack size, I could see where recursing on ever-deeper trees could be problematic, although perhaps it would be possible to use tail-recursion in anything which doesn't need to scan the full tree at least (if that's not already done).

Given that XFS has a theoretical maximum filesystem size of 18 Exabytes, for a hypothetical filesystem composed of 1k sized files (just to be maximally difficult; worst-case analysis), that's about 2 x 10**16 inodes to store. I'm sure this would make a lot of other things break first, but it would also blow past the number of leaves you can have in a tree of maximum height of 8, unless your branching factor goes above 50. And I doubt more than 100-1000 entries are being squeezed into a leaf, if that.

Sorry, I don't normally jump into the discussions on here, just follow along, since I'm not very familiar with the XFS code, but this seemed like an interesting question which could be partially addressed without any familiarity with the internals.

I will be curious to read what XFS is actually doing in this area in greater detail. And I look forward to the XFS test (and rig) that can manage to break it / exercise the limit...is there any known limit to the number of files? http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/ch02s04.html just lists max file size and file system size. If nothing else, it would be worth determining the limit and documenting it.

Cheers,
-Shaun

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Limits on agi->agi_level (and other btree depths?)
  2014-02-12 23:54 Limits on agi->agi_level (and other btree depths?) Eric Sandeen
  2014-02-13  1:01 ` Shaun Gosse
@ 2014-02-13  1:03 ` Dave Chinner
  2014-02-13  1:16   ` Shaun Gosse
  2014-02-13  1:53   ` Eric Sandeen
  1 sibling, 2 replies; 5+ messages in thread
From: Dave Chinner @ 2014-02-13  1:03 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: xfs-oss

On Wed, Feb 12, 2014 at 05:54:19PM -0600, Eric Sandeen wrote:
> If agi->agi_level exceeds XFS_BTREE_MAXLEVELS (8), bad things
> happen.  For example in xfs_inobt_init_cursor() we read it
> directly off disk into a btree cursor:
> 
> xfs_inobt_init_cursor()
> 	cur->bc_nlevels = be32_to_cpu(agi->agi_level);
> 
> and then when it's time to tear it down we'll index into bc_bufs[]
> buy whatever it said:
> 
> xfs_btree_del_cursor()
>         for (i = 0; i < cur->bc_nlevels; i++) {
>                 if (cur->bc_bufs[i])
>                         xfs_trans_brelse(cur->bc_tp, cur->bc_bufs[i]);
> 
> but bc_bufs[] in the xfs_btree_cur is of fixed size:
> 
>         struct xfs_buf  *bc_bufs[XFS_BTREE_MAXLEVELS];  /* buf ptr per level */
> 
> where
> 
> #define XFS_BTREE_MAXLEVELS     8       /* max of all btrees */
> 
> (which means this limits any btree depth, not just agi, right...)
> 
> ...
> So I ran across this on an intentionally corrupted image, but I
> don't know what stops us from going past XFS_BTREE_MAXLEVELS in
> normal operations (unless we just hit filesystem limits before
> then?)

Right, we hit filesystem limits before we get deeper than 8 levels.

For an AGI btree, ptr/key pairs in a node use 8 bytes, while records
use 16 bytes. Hence worst case is a 512 byte blocksize filesystem
where we get roughly 30 records/leaf and 60 ptr/key pairs per node.

So, the number of extents we can reference at different levels are:

level	number
0	30
1	60 * 30
2	60^2 * 30
....
n	60^n * 30

In 1TB AG, worst case freespace is alternate single block freespace,
so that's 1TB/blocksize/2 = (2^31*2^9) / 2^9 / 2^1 = 2^30 extent
records for a 512 byte blocksize filesystem.

2^30  :  1,073,741,824 records
n = 5 :	23,328,000,000 records

So the maximum possible number of levels for an AGI btree is 6 (5
node levels + leaf level). The AGF btrees are the same (32 bit
key/ptrs, 16 byte records)

The AGF freespace trees are more dense - the records are only 8
bytes so there's 60/leaf. It still needs 6 levels though.

For the extent btree (bmbt) it can index 54 bits of file offset, so
worst case is single block fragments so 2^54 extents. Records are 16
bytes, key and pointers are 8 bytes each. Hence 30/30 are the
numbers for a 512 byte block size fs. At level n, the extents are
30^n * 30 = 30^(n+1). So, solving 2^54 <= 30^(n+1) gives n = 11.

So in theory we could overflow XFS_BTREE_MAXLEVELS here, but in
practice this worst case requires 30^8 extents in memory, and that
requires this much RAM:

	656,100,000,000 * sizeof(struct xfs_bmbt_irec) bytes
	= 656,100,000,000 * 32 bytes
	~= 19TiB

And requires reading in from disk 512 bytes at a time.  Nothing in
XFS^WLinux scales to indexing 19TiB of extent metadata with any
efficiency in this manner. And let's face it, if you have a 300TiB
file in single 512 byte block fragments, you've got bigger problems.
The least of which being that you should be using a larger block
size...

Back in reality, if we take a 4k block size, the bmbt tree has a
240/240 breakdown, which means that the equation is actually 2^54 <=
240^(n+1), and in that case n = 6, so we don't overflow
XFS_BTREE_MAXLEVELS at all for the normal mkfs cases.

> i.e. xfs_btree_new_root() does:
> 
>         /* Set the root in the holding structure  increasing the level by 1. */
>         cur->bc_ops->set_root(cur, &lptr, 1);
> 

> and ->set_root / xfs_inobt_set_root() will happily increase
> agi_level; I don't see anything limiting it to
> XFS_BTREE_MAXLEVELS.

Physical limits of the AGI due to the 1TB size of the AG.

> I guess XFS_BTREE_MAXLEVELS is just an arbitrary in-memory limit,
> not a limit of the underlying disk structures, but as it stands,
> we should be sure that we don't exceed it, right?

If you really want to enforce XFS_BTREE_MAXLEVELS, add checks
into xfs_alloc_compute_maxlevels(), xfs_ialloc_compute_maxlevels()
and xfs_bmap_compute_maxlevels() to constrain the limits in the
struct xfs_mount and validate the on-disk values based on the
values in the struct xfs_mount.

> I was going to put that limit into xfs_agi_verify, but realized
> that I wasn't sure if we could actually exceed that depth in
> normal operations.
> 
> (cue dchinner working out that 9 levels is 59 bazillion jillion
> items, and will never be hit?)

Yep, done that ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: Limits on agi->agi_level (and other btree depths?)
  2014-02-13  1:03 ` Dave Chinner
@ 2014-02-13  1:16   ` Shaun Gosse
  2014-02-13  1:53   ` Eric Sandeen
  1 sibling, 0 replies; 5+ messages in thread
From: Shaun Gosse @ 2014-02-13  1:16 UTC (permalink / raw)
  To: Dave Chinner, Eric Sandeen; +Cc: xfs-oss

Dave,

Great read; thanks! It makes sense that the 1TB AG limit gets hit far faster than the depth. Given those numbers, XFS does look feasible for billions of files, for at least this scaling.

And as you point out, the arbitrary worst-case scenario hits other performance and logic issues anyhow (pathological allocation patterns essentially).

Is there any xfstest which tries to see what happens when you fill up the AG group? 1TB wouldn't be an entirely impossible amount of data to write, although it would take a long time of course, and whatever it took to generate that could take a while. It might function as an overall stress test in a worst-case type of scenario? This is an XFS 101 question, but does / can it split the AG once it approaches that limit? (And then of course the B-tree will be reduced as well, so it doesn't result in further issue, just curious about how the AG filling up is handled.)

Cheers,
-Shaun

-----Original Message-----
From: xfs-bounces@oss.sgi.com [mailto:xfs-bounces@oss.sgi.com] On Behalf Of Dave Chinner
Sent: Wednesday, February 12, 2014 7:03 PM
To: Eric Sandeen
Cc: xfs-oss
Subject: Re: Limits on agi->agi_level (and other btree depths?)

On Wed, Feb 12, 2014 at 05:54:19PM -0600, Eric Sandeen wrote:
> If agi->agi_level exceeds XFS_BTREE_MAXLEVELS (8), bad things happen.  
> For example in xfs_inobt_init_cursor() we read it directly off disk 
> into a btree cursor:
> 
> xfs_inobt_init_cursor()
> 	cur->bc_nlevels = be32_to_cpu(agi->agi_level);
> 
> and then when it's time to tear it down we'll index into bc_bufs[] buy 
> whatever it said:
> 
> xfs_btree_del_cursor()
>         for (i = 0; i < cur->bc_nlevels; i++) {
>                 if (cur->bc_bufs[i])
>                         xfs_trans_brelse(cur->bc_tp, cur->bc_bufs[i]);
> 
> but bc_bufs[] in the xfs_btree_cur is of fixed size:
> 
>         struct xfs_buf  *bc_bufs[XFS_BTREE_MAXLEVELS];  /* buf ptr per 
> level */
> 
> where
> 
> #define XFS_BTREE_MAXLEVELS     8       /* max of all btrees */
> 
> (which means this limits any btree depth, not just agi, right...)
> 
> ...
> So I ran across this on an intentionally corrupted image, but I don't 
> know what stops us from going past XFS_BTREE_MAXLEVELS in normal 
> operations (unless we just hit filesystem limits before
> then?)

Right, we hit filesystem limits before we get deeper than 8 levels.

For an AGI btree, ptr/key pairs in a node use 8 bytes, while records use 16 bytes. Hence worst case is a 512 byte blocksize filesystem where we get roughly 30 records/leaf and 60 ptr/key pairs per node.

So, the number of extents we can reference at different levels are:

level	number
0	30
1	60 * 30
2	60^2 * 30
....
n	60^n * 30

In 1TB AG, worst case freespace is alternate single block freespace, so that's 1TB/blocksize/2 = (2^31*2^9) / 2^9 / 2^1 = 2^30 extent records for a 512 byte blocksize filesystem.

2^30  :  1,073,741,824 records
n = 5 :	23,328,000,000 records

So the maximum possible number of levels for an AGI btree is 6 (5 node levels + leaf level). The AGF btrees are the same (32 bit key/ptrs, 16 byte records)

The AGF freespace trees are more dense - the records are only 8 bytes so there's 60/leaf. It still needs 6 levels though.

For the extent btree (bmbt) it can index 54 bits of file offset, so worst case is single block fragments so 2^54 extents. Records are 16 bytes, key and pointers are 8 bytes each. Hence 30/30 are the numbers for a 512 byte block size fs. At level n, the extents are 30^n * 30 = 30^(n+1). So, solving 2^54 <= 30^(n+1) gives n = 11.

So in theory we could overflow XFS_BTREE_MAXLEVELS here, but in practice this worst case requires 30^8 extents in memory, and that requires this much RAM:

	656,100,000,000 * sizeof(struct xfs_bmbt_irec) bytes
	= 656,100,000,000 * 32 bytes
	~= 19TiB

And requires reading in from disk 512 bytes at a time.  Nothing in XFS^WLinux scales to indexing 19TiB of extent metadata with any efficiency in this manner. And let's face it, if you have a 300TiB file in single 512 byte block fragments, you've got bigger problems.
The least of which being that you should be using a larger block size...

Back in reality, if we take a 4k block size, the bmbt tree has a
240/240 breakdown, which means that the equation is actually 2^54 <= 240^(n+1), and in that case n = 6, so we don't overflow XFS_BTREE_MAXLEVELS at all for the normal mkfs cases.

> i.e. xfs_btree_new_root() does:
> 
>         /* Set the root in the holding structure  increasing the level by 1. */
>         cur->bc_ops->set_root(cur, &lptr, 1);
> 

> and ->set_root / xfs_inobt_set_root() will happily increase agi_level; 
> I don't see anything limiting it to XFS_BTREE_MAXLEVELS.

Physical limits of the AGI due to the 1TB size of the AG.

> I guess XFS_BTREE_MAXLEVELS is just an arbitrary in-memory limit, not 
> a limit of the underlying disk structures, but as it stands, we should 
> be sure that we don't exceed it, right?

If you really want to enforce XFS_BTREE_MAXLEVELS, add checks into xfs_alloc_compute_maxlevels(), xfs_ialloc_compute_maxlevels() and xfs_bmap_compute_maxlevels() to constrain the limits in the struct xfs_mount and validate the on-disk values based on the values in the struct xfs_mount.

> I was going to put that limit into xfs_agi_verify, but realized that I 
> wasn't sure if we could actually exceed that depth in normal 
> operations.
> 
> (cue dchinner working out that 9 levels is 59 bazillion jillion items, 
> and will never be hit?)

Yep, done that ;)

Cheers,

Dave.
--
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Limits on agi->agi_level (and other btree depths?)
  2014-02-13  1:03 ` Dave Chinner
  2014-02-13  1:16   ` Shaun Gosse
@ 2014-02-13  1:53   ` Eric Sandeen
  1 sibling, 0 replies; 5+ messages in thread
From: Eric Sandeen @ 2014-02-13  1:53 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs-oss

On 2/12/14, 7:03 PM, Dave Chinner wrote:
> On Wed, Feb 12, 2014 at 05:54:19PM -0600, Eric Sandeen wrote:

...

>> I guess XFS_BTREE_MAXLEVELS is just an arbitrary in-memory limit,
>> not a limit of the underlying disk structures, but as it stands,
>> we should be sure that we don't exceed it, right?
> 
> If you really want to enforce XFS_BTREE_MAXLEVELS, add checks
> into xfs_alloc_compute_maxlevels(), xfs_ialloc_compute_maxlevels()
> and xfs_bmap_compute_maxlevels() to constrain the limits in the
> struct xfs_mount and validate the on-disk values based on the
> values in the struct xfs_mount.

Thanks, I had forgotten those existed - that looks perfect.

>> I was going to put that limit into xfs_agi_verify, but realized
>> that I wasn't sure if we could actually exceed that depth in
>> normal operations.
>>
>> (cue dchinner working out that 9 levels is 59 bazillion jillion
>> items, and will never be hit?)
> 
> Yep, done that ;)

I knew you'd enjoy it.  ;)

-Eric

> Cheers,
> 
> Dave.
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2014-02-13  1:53 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-12 23:54 Limits on agi->agi_level (and other btree depths?) Eric Sandeen
2014-02-13  1:01 ` Shaun Gosse
2014-02-13  1:03 ` Dave Chinner
2014-02-13  1:16   ` Shaun Gosse
2014-02-13  1:53   ` Eric Sandeen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.