All of lore.kernel.org
 help / color / mirror / Atom feed
* mkfs.btrfs/balance small-btrfs chunk size RFC
@ 2017-01-10  3:55 Duncan
  2017-01-10  5:34 ` Qu Wenruo
  2017-01-10 14:57 ` Austin S. Hemmelgarn
  0 siblings, 2 replies; 11+ messages in thread
From: Duncan @ 2017-01-10  3:55 UTC (permalink / raw)
  To: linux-btrfs

This post is triggered by a balance problem due to oversized chunks that 
I have currently.

Proposal 1: Ensure maximum chunk sizes are less than 1/8 the size of the 
filesystem (down to where they can't be any smaller, at least).

Proposal 2: Drastically reduce default system chunk size on small btrfs.

Here's the real-life scenario:  My /boot is 256 MiB mixed-bg-mode DUP.

Unfortunately, mkfs.btrfs apparently creates the first mixed chunk as 64 
MiB, making it unbalancable.  64 MiB duped due to dup mode is 128 MiB, 
exactly half the btrfs size.  But there's also a 16 MiB system chunk, 
duped to 32 MiB, so even with a still-empty fs immediately after creation 
I can't balance that chunk (which isn't entirely empty apparently in 
ordered to keep it from being erased by the kernel auto-clean or a 
balance, leaving no record of the chunk mode), because the 1/4 the btrfs 
chunk dups to 1/2 the btrfs, and with the system chunk as well, there's 
not half the btrfs left in ordered to create a second chunk along with 
its dup to balance into.

But if I fill the btrfs enough to create another mixed chunk, it's only 
16 MiB in size, duped to 32 MiB, and btrfs usage shows it going from 64 
MiB to 80 MiB (16 MiB change, the additional chunk size), with the 
resulting duped size going from 128 MiB to 160 MiB (32 MiB change, the 
additional chunk duped size).

Now if those first chunks were 32 MiB or even the 16 MiB of the second, 
there'd obviously be more of them used for the same file content, but as 
long as I kept enough unallocated space on the btrfs to handle twice the 
size (due to dup) of the largest chunk, I could still balance all chunks, 
something that's flat impossible when the first mixed chunk dups to half 
the btrfs, and there has to be room for the system chunk as well.

So if the maximum created chunk size was limited to 1/8 the btrfs size, 
it would dup to 1/4 the size, and balances should actually be possible.

As for proposal 2...

The system chunk size is 16 MiB, duped to 32 MiB, despite only a single 4 
KiB block actually being used.  Locking up 16 MiB, duped to 32 MiB thus 
1/8 the entire btrfs space of 256 MiB, for a single 4 KiB block, duped to 
8 KiB, 1/20th of 1 percent of that system chunk used if my math is 
correct, is ridiculous on a sub-GiB btrfs.

I don't know what the minimum chunk size actually is, but something like 
1 MiB system chunk size, if possible, would be far more reasonable in the 
sub-GiB btrfs context.  Otherwise 2 or even 4 MiB, the latter of which 
would dup to 8 MiB, would be tolerable, but a 16 MiB system chunk for a 
single 4 KiB block... and then dup /that/... just ridiculous.

It wouldn't be quite so bad if the global reserve (reported at 16 MiB) 
came from the system chunk instead of metadata (mixed-chunk here), and 
putting that in the system chunk would make sense since it's effectively 
system-reserved space, but of course it doesn't work that way, and I'd 
guess changing that would be a hairy nightmare, far worse than simply 
clamping down on created chunk sizes a bit, and likely practically 
impossible to implement at this stage.


But I'd expect clamping down on created chunk size, simply adding a check 
to ensure it's under 1/8 the full btrfs size (down to the minimum allowed 
chunk size, of course), to be quite practical and reasonably easy to 
implement.  Similarly altho I'm less sure of how small the minimum system 
chunk size can be, I expect maximum system chunk size can reasonably be 
limited to say 4 MiB, if not 1 or 2 MiB, on sub-GiB btrfs.

So RFC, how realistic and simple does this look to the devs actually 
doing the code?  Is it a small enough job it could qualify as a bug fix 
(as it arguably is, given that the btrfs is /created/ with chunks that 
are impossible to balance, at present, or at least was around 4.8 time, 
as I believe that's about when I created the btrfs), be tested and make 
it into released code within say five kernel cycles, a year's time?  
Obviously I'm hoping so. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: mkfs.btrfs/balance small-btrfs chunk size RFC
  2017-01-10  3:55 mkfs.btrfs/balance small-btrfs chunk size RFC Duncan
@ 2017-01-10  5:34 ` Qu Wenruo
  2017-01-10 14:57 ` Austin S. Hemmelgarn
  1 sibling, 0 replies; 11+ messages in thread
From: Qu Wenruo @ 2017-01-10  5:34 UTC (permalink / raw)
  To: Duncan, linux-btrfs



At 01/10/2017 11:55 AM, Duncan wrote:
> This post is triggered by a balance problem due to oversized chunks that
> I have currently.
>
> Proposal 1: Ensure maximum chunk sizes are less than 1/8 the size of the
> filesystem (down to where they can't be any smaller, at least).

In fact, kernel created new chunks are ensured to be less than 1/10 of 
the total rw bytes of the fs.
Which only limits on chunk size, so it may take 1/5 if using DUP/RAID1.

>
> Proposal 2: Drastically reduce default system chunk size on small btrfs.

Although btrfs_alloc_chunk() has similar 10% limit, but it doesn't work 
in your case.

I'm not sure what's going wrong but I'll look into it after fixing the 
dev-replace bug.

Thanks,
Qu

>
> Here's the real-life scenario:  My /boot is 256 MiB mixed-bg-mode DUP.
>
> Unfortunately, mkfs.btrfs apparently creates the first mixed chunk as 64
> MiB, making it unbalancable.  64 MiB duped due to dup mode is 128 MiB,
> exactly half the btrfs size.  But there's also a 16 MiB system chunk,
> duped to 32 MiB, so even with a still-empty fs immediately after creation
> I can't balance that chunk (which isn't entirely empty apparently in
> ordered to keep it from being erased by the kernel auto-clean or a
> balance, leaving no record of the chunk mode), because the 1/4 the btrfs
> chunk dups to 1/2 the btrfs, and with the system chunk as well, there's
> not half the btrfs left in ordered to create a second chunk along with
> its dup to balance into.
>
> But if I fill the btrfs enough to create another mixed chunk, it's only
> 16 MiB in size, duped to 32 MiB, and btrfs usage shows it going from 64
> MiB to 80 MiB (16 MiB change, the additional chunk size), with the
> resulting duped size going from 128 MiB to 160 MiB (32 MiB change, the
> additional chunk duped size).
>
> Now if those first chunks were 32 MiB or even the 16 MiB of the second,
> there'd obviously be more of them used for the same file content, but as
> long as I kept enough unallocated space on the btrfs to handle twice the
> size (due to dup) of the largest chunk, I could still balance all chunks,
> something that's flat impossible when the first mixed chunk dups to half
> the btrfs, and there has to be room for the system chunk as well.
>
> So if the maximum created chunk size was limited to 1/8 the btrfs size,
> it would dup to 1/4 the size, and balances should actually be possible.
>
> As for proposal 2...
>
> The system chunk size is 16 MiB, duped to 32 MiB, despite only a single 4
> KiB block actually being used.  Locking up 16 MiB, duped to 32 MiB thus
> 1/8 the entire btrfs space of 256 MiB, for a single 4 KiB block, duped to
> 8 KiB, 1/20th of 1 percent of that system chunk used if my math is
> correct, is ridiculous on a sub-GiB btrfs.
>
> I don't know what the minimum chunk size actually is, but something like
> 1 MiB system chunk size, if possible, would be far more reasonable in the
> sub-GiB btrfs context.  Otherwise 2 or even 4 MiB, the latter of which
> would dup to 8 MiB, would be tolerable, but a 16 MiB system chunk for a
> single 4 KiB block... and then dup /that/... just ridiculous.
>
> It wouldn't be quite so bad if the global reserve (reported at 16 MiB)
> came from the system chunk instead of metadata (mixed-chunk here), and
> putting that in the system chunk would make sense since it's effectively
> system-reserved space, but of course it doesn't work that way, and I'd
> guess changing that would be a hairy nightmare, far worse than simply
> clamping down on created chunk sizes a bit, and likely practically
> impossible to implement at this stage.
>
>
> But I'd expect clamping down on created chunk size, simply adding a check
> to ensure it's under 1/8 the full btrfs size (down to the minimum allowed
> chunk size, of course), to be quite practical and reasonably easy to
> implement.  Similarly altho I'm less sure of how small the minimum system
> chunk size can be, I expect maximum system chunk size can reasonably be
> limited to say 4 MiB, if not 1 or 2 MiB, on sub-GiB btrfs.
>
> So RFC, how realistic and simple does this look to the devs actually
> doing the code?  Is it a small enough job it could qualify as a bug fix
> (as it arguably is, given that the btrfs is /created/ with chunks that
> are impossible to balance, at present, or at least was around 4.8 time,
> as I believe that's about when I created the btrfs), be tested and make
> it into released code within say five kernel cycles, a year's time?
> Obviously I'm hoping so. =:^)
>



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: mkfs.btrfs/balance small-btrfs chunk size RFC
  2017-01-10  3:55 mkfs.btrfs/balance small-btrfs chunk size RFC Duncan
  2017-01-10  5:34 ` Qu Wenruo
@ 2017-01-10 14:57 ` Austin S. Hemmelgarn
  2017-01-10 15:29   ` Hugo Mills
  2017-01-11 19:25   ` Duncan
  1 sibling, 2 replies; 11+ messages in thread
From: Austin S. Hemmelgarn @ 2017-01-10 14:57 UTC (permalink / raw)
  To: linux-btrfs

On 2017-01-09 22:55, Duncan wrote:
> This post is triggered by a balance problem due to oversized chunks that
> I have currently.
>
> Proposal 1: Ensure maximum chunk sizes are less than 1/8 the size of the
> filesystem (down to where they can't be any smaller, at least).
>
> Proposal 2: Drastically reduce default system chunk size on small btrfs.
>
> Here's the real-life scenario:  My /boot is 256 MiB mixed-bg-mode DUP.
>
> Unfortunately, mkfs.btrfs apparently creates the first mixed chunk as 64
> MiB, making it unbalancable.  64 MiB duped due to dup mode is 128 MiB,
> exactly half the btrfs size.  But there's also a 16 MiB system chunk,
> duped to 32 MiB, so even with a still-empty fs immediately after creation
> I can't balance that chunk (which isn't entirely empty apparently in
> ordered to keep it from being erased by the kernel auto-clean or a
> balance, leaving no record of the chunk mode), because the 1/4 the btrfs
> chunk dups to 1/2 the btrfs, and with the system chunk as well, there's
> not half the btrfs left in ordered to create a second chunk along with
> its dup to balance into.
>
> But if I fill the btrfs enough to create another mixed chunk, it's only
> 16 MiB in size, duped to 32 MiB, and btrfs usage shows it going from 64
> MiB to 80 MiB (16 MiB change, the additional chunk size), with the
> resulting duped size going from 128 MiB to 160 MiB (32 MiB change, the
> additional chunk duped size).
>
> Now if those first chunks were 32 MiB or even the 16 MiB of the second,
> there'd obviously be more of them used for the same file content, but as
> long as I kept enough unallocated space on the btrfs to handle twice the
> size (due to dup) of the largest chunk, I could still balance all chunks,
> something that's flat impossible when the first mixed chunk dups to half
> the btrfs, and there has to be room for the system chunk as well.
>
> So if the maximum created chunk size was limited to 1/8 the btrfs size,
> it would dup to 1/4 the size, and balances should actually be possible.
>
> As for proposal 2...
>
> The system chunk size is 16 MiB, duped to 32 MiB, despite only a single 4
> KiB block actually being used.  Locking up 16 MiB, duped to 32 MiB thus
> 1/8 the entire btrfs space of 256 MiB, for a single 4 KiB block, duped to
> 8 KiB, 1/20th of 1 percent of that system chunk used if my math is
> correct, is ridiculous on a sub-GiB btrfs.
>
> I don't know what the minimum chunk size actually is, but something like
> 1 MiB system chunk size, if possible, would be far more reasonable in the
> sub-GiB btrfs context.  Otherwise 2 or even 4 MiB, the latter of which
> would dup to 8 MiB, would be tolerable, but a 16 MiB system chunk for a
> single 4 KiB block... and then dup /that/... just ridiculous.
>
> It wouldn't be quite so bad if the global reserve (reported at 16 MiB)
> came from the system chunk instead of metadata (mixed-chunk here), and
> putting that in the system chunk would make sense since it's effectively
> system-reserved space, but of course it doesn't work that way, and I'd
> guess changing that would be a hairy nightmare, far worse than simply
> clamping down on created chunk sizes a bit, and likely practically
> impossible to implement at this stage.
>
>
> But I'd expect clamping down on created chunk size, simply adding a check
> to ensure it's under 1/8 the full btrfs size (down to the minimum allowed
> chunk size, of course), to be quite practical and reasonably easy to
> implement.  Similarly altho I'm less sure of how small the minimum system
> chunk size can be, I expect maximum system chunk size can reasonably be
> limited to say 4 MiB, if not 1 or 2 MiB, on sub-GiB btrfs.
>
> So RFC, how realistic and simple does this look to the devs actually
> doing the code?  Is it a small enough job it could qualify as a bug fix
> (as it arguably is, given that the btrfs is /created/ with chunks that
> are impossible to balance, at present, or at least was around 4.8 time,
> as I believe that's about when I created the btrfs), be tested and make
> it into released code within say five kernel cycles, a year's time?
> Obviously I'm hoping so. =:^)
>
I can't personally comment on the code itself right now (I've actually 
never looked at the mkfs code, or any of the stuff that deals with the 
System chunk), but I can make a few general comments on this:
1. This behavior is still the case as of a Git build from yesterday (I 
just verified this myself with the locally built copy of btrfs-progs on 
my laptop).
2. Given the implications of snapshotting and typical usage, I'd say 
it's not likely that the System chunk will need to be much more than a 
single filesystem block on such a small FS.  I don't use more than a 
single snapshot at a time, but I do have lots of subvolumes on my 
laptop's root filesystem, and it's System chunk usage is still only one 
FS block (16kb in my case)).  That said, ISTR reading somewhere that the 
System chunk is functionally fixed-size (can't be more than one chunk, 
and BTRFS can't resize an existing chunk).
3. In theory, it shouldn't be hard to get mkfs to use different sizes 
when creating the FS.  For at least the System chunk though, it may face 
limitations due to kernel expectations.

Given the above three points, I'd like to make a slightly different 
proposal:
Add options to mkfs to specify the size of the initial Data, Metadata, 
and System chunk in the filesystem, and document clearly some reasonable 
numbers based on FS size and intended usage.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: mkfs.btrfs/balance small-btrfs chunk size RFC
  2017-01-10 14:57 ` Austin S. Hemmelgarn
@ 2017-01-10 15:29   ` Hugo Mills
  2017-01-10 15:42     ` Austin S. Hemmelgarn
  2017-01-11 19:25   ` Duncan
  1 sibling, 1 reply; 11+ messages in thread
From: Hugo Mills @ 2017-01-10 15:29 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 7329 bytes --]

On Tue, Jan 10, 2017 at 09:57:52AM -0500, Austin S. Hemmelgarn wrote:
> On 2017-01-09 22:55, Duncan wrote:
> >This post is triggered by a balance problem due to oversized chunks that
> >I have currently.
> >
> >Proposal 1: Ensure maximum chunk sizes are less than 1/8 the size of the
> >filesystem (down to where they can't be any smaller, at least).
> >
> >Proposal 2: Drastically reduce default system chunk size on small btrfs.
> >
> >Here's the real-life scenario:  My /boot is 256 MiB mixed-bg-mode DUP.
> >
> >Unfortunately, mkfs.btrfs apparently creates the first mixed chunk as 64
> >MiB, making it unbalancable.  64 MiB duped due to dup mode is 128 MiB,
> >exactly half the btrfs size.  But there's also a 16 MiB system chunk,
> >duped to 32 MiB, so even with a still-empty fs immediately after creation
> >I can't balance that chunk (which isn't entirely empty apparently in
> >ordered to keep it from being erased by the kernel auto-clean or a
> >balance, leaving no record of the chunk mode), because the 1/4 the btrfs
> >chunk dups to 1/2 the btrfs, and with the system chunk as well, there's
> >not half the btrfs left in ordered to create a second chunk along with
> >its dup to balance into.
> >
> >But if I fill the btrfs enough to create another mixed chunk, it's only
> >16 MiB in size, duped to 32 MiB, and btrfs usage shows it going from 64
> >MiB to 80 MiB (16 MiB change, the additional chunk size), with the
> >resulting duped size going from 128 MiB to 160 MiB (32 MiB change, the
> >additional chunk duped size).
> >
> >Now if those first chunks were 32 MiB or even the 16 MiB of the second,
> >there'd obviously be more of them used for the same file content, but as
> >long as I kept enough unallocated space on the btrfs to handle twice the
> >size (due to dup) of the largest chunk, I could still balance all chunks,
> >something that's flat impossible when the first mixed chunk dups to half
> >the btrfs, and there has to be room for the system chunk as well.
> >
> >So if the maximum created chunk size was limited to 1/8 the btrfs size,
> >it would dup to 1/4 the size, and balances should actually be possible.
> >
> >As for proposal 2...
> >
> >The system chunk size is 16 MiB, duped to 32 MiB, despite only a single 4
> >KiB block actually being used.  Locking up 16 MiB, duped to 32 MiB thus
> >1/8 the entire btrfs space of 256 MiB, for a single 4 KiB block, duped to
> >8 KiB, 1/20th of 1 percent of that system chunk used if my math is
> >correct, is ridiculous on a sub-GiB btrfs.
> >
> >I don't know what the minimum chunk size actually is, but something like
> >1 MiB system chunk size, if possible, would be far more reasonable in the
> >sub-GiB btrfs context.  Otherwise 2 or even 4 MiB, the latter of which
> >would dup to 8 MiB, would be tolerable, but a 16 MiB system chunk for a
> >single 4 KiB block... and then dup /that/... just ridiculous.
> >
> >It wouldn't be quite so bad if the global reserve (reported at 16 MiB)
> >came from the system chunk instead of metadata (mixed-chunk here), and
> >putting that in the system chunk would make sense since it's effectively
> >system-reserved space, but of course it doesn't work that way, and I'd
> >guess changing that would be a hairy nightmare, far worse than simply
> >clamping down on created chunk sizes a bit, and likely practically
> >impossible to implement at this stage.
> >
> >
> >But I'd expect clamping down on created chunk size, simply adding a check
> >to ensure it's under 1/8 the full btrfs size (down to the minimum allowed
> >chunk size, of course), to be quite practical and reasonably easy to
> >implement.  Similarly altho I'm less sure of how small the minimum system
> >chunk size can be, I expect maximum system chunk size can reasonably be
> >limited to say 4 MiB, if not 1 or 2 MiB, on sub-GiB btrfs.
> >
> >So RFC, how realistic and simple does this look to the devs actually
> >doing the code?  Is it a small enough job it could qualify as a bug fix
> >(as it arguably is, given that the btrfs is /created/ with chunks that
> >are impossible to balance, at present, or at least was around 4.8 time,
> >as I believe that's about when I created the btrfs), be tested and make
> >it into released code within say five kernel cycles, a year's time?
> >Obviously I'm hoping so. =:^)
> >
> I can't personally comment on the code itself right now (I've
> actually never looked at the mkfs code, or any of the stuff that
> deals with the System chunk), but I can make a few general comments
> on this:
> 1. This behavior is still the case as of a Git build from yesterday
> (I just verified this myself with the locally built copy of
> btrfs-progs on my laptop).
> 2. Given the implications of snapshotting and typical usage, I'd say
> it's not likely that the System chunk will need to be much more than
> a single filesystem block on such a small FS.

   The System chunk has nothing to do with snapshotting. It's where
the chunk tree lives. (And the reason it's separate from all the other
metadata is so that the FS can bootstrap the physical-virtual mapping
easily).

   A chunk record is 48 bytes, plus 17 bytes for the key, plus a
32 byte record appended to it for each stripe.

>  I don't use more than
> a single snapshot at a time, but I do have lots of subvolumes on my
> laptop's root filesystem, and it's System chunk usage is still only
> one FS block (16kb in my case)).  That said, ISTR reading somewhere
> that the System chunk is functionally fixed-size (can't be more than
> one chunk, and BTRFS can't resize an existing chunk).

   I don't recall seeing anything about that as a limitation. The
superblock structure appears to be able to support multiple system
chunks -- there's a list of records pointing to the physical location
of each system chunk appended to the end of the superblock. A comment
in ctree.h remarks that "this gives us enough room to translate 14
chunks with 3 stripes each", so there's clearly the expectation that
there may be more than one.

   The largest filesystem I think I've seen anyone mention in the wild
was on the scale of 100 TB, which is going to have about 100k chunks,
which comes out as about 10 MiB of metadata in the chunk tree. So you
definitely wouldn't want to use a 1 MiB chunk size for the system tree
in general. I don't see a problem with shrinking it for small
filesystems, though.

   The only thing I can see might be an issue is where a small FS is
created with, say, a 128 KiB system chunk, and then it's grown into a
large filesystem. You'd have to ensure that any subsequent system
chunk allocations are much larger, otherwise you're going to break the
14(ish) chunk limit in the superblock.

   Hugo.

> 3. In theory, it shouldn't be hard to get mkfs to use different
> sizes when creating the FS.  For at least the System chunk though,
> it may face limitations due to kernel expectations.
> 
> Given the above three points, I'd like to make a slightly different
> proposal:
> Add options to mkfs to specify the size of the initial Data,
> Metadata, and System chunk in the filesystem, and document clearly
> some reasonable numbers based on FS size and intended usage.
> 

-- 
Hugo Mills             | I'm always right. But I might be wrong about that.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4          |

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: mkfs.btrfs/balance small-btrfs chunk size RFC
  2017-01-10 15:29   ` Hugo Mills
@ 2017-01-10 15:42     ` Austin S. Hemmelgarn
  2017-01-10 15:47       ` Hugo Mills
  2017-01-10 17:17       ` Austin S. Hemmelgarn
  0 siblings, 2 replies; 11+ messages in thread
From: Austin S. Hemmelgarn @ 2017-01-10 15:42 UTC (permalink / raw)
  To: Hugo Mills, linux-btrfs

On 2017-01-10 10:29, Hugo Mills wrote:
> On Tue, Jan 10, 2017 at 09:57:52AM -0500, Austin S. Hemmelgarn wrote:
>> On 2017-01-09 22:55, Duncan wrote:
>>> This post is triggered by a balance problem due to oversized chunks that
>>> I have currently.
>>>
>>> Proposal 1: Ensure maximum chunk sizes are less than 1/8 the size of the
>>> filesystem (down to where they can't be any smaller, at least).
>>>
>>> Proposal 2: Drastically reduce default system chunk size on small btrfs.
>>>
>>> Here's the real-life scenario:  My /boot is 256 MiB mixed-bg-mode DUP.
>>>
>>> Unfortunately, mkfs.btrfs apparently creates the first mixed chunk as 64
>>> MiB, making it unbalancable.  64 MiB duped due to dup mode is 128 MiB,
>>> exactly half the btrfs size.  But there's also a 16 MiB system chunk,
>>> duped to 32 MiB, so even with a still-empty fs immediately after creation
>>> I can't balance that chunk (which isn't entirely empty apparently in
>>> ordered to keep it from being erased by the kernel auto-clean or a
>>> balance, leaving no record of the chunk mode), because the 1/4 the btrfs
>>> chunk dups to 1/2 the btrfs, and with the system chunk as well, there's
>>> not half the btrfs left in ordered to create a second chunk along with
>>> its dup to balance into.
>>>
>>> But if I fill the btrfs enough to create another mixed chunk, it's only
>>> 16 MiB in size, duped to 32 MiB, and btrfs usage shows it going from 64
>>> MiB to 80 MiB (16 MiB change, the additional chunk size), with the
>>> resulting duped size going from 128 MiB to 160 MiB (32 MiB change, the
>>> additional chunk duped size).
>>>
>>> Now if those first chunks were 32 MiB or even the 16 MiB of the second,
>>> there'd obviously be more of them used for the same file content, but as
>>> long as I kept enough unallocated space on the btrfs to handle twice the
>>> size (due to dup) of the largest chunk, I could still balance all chunks,
>>> something that's flat impossible when the first mixed chunk dups to half
>>> the btrfs, and there has to be room for the system chunk as well.
>>>
>>> So if the maximum created chunk size was limited to 1/8 the btrfs size,
>>> it would dup to 1/4 the size, and balances should actually be possible.
>>>
>>> As for proposal 2...
>>>
>>> The system chunk size is 16 MiB, duped to 32 MiB, despite only a single 4
>>> KiB block actually being used.  Locking up 16 MiB, duped to 32 MiB thus
>>> 1/8 the entire btrfs space of 256 MiB, for a single 4 KiB block, duped to
>>> 8 KiB, 1/20th of 1 percent of that system chunk used if my math is
>>> correct, is ridiculous on a sub-GiB btrfs.
>>>
>>> I don't know what the minimum chunk size actually is, but something like
>>> 1 MiB system chunk size, if possible, would be far more reasonable in the
>>> sub-GiB btrfs context.  Otherwise 2 or even 4 MiB, the latter of which
>>> would dup to 8 MiB, would be tolerable, but a 16 MiB system chunk for a
>>> single 4 KiB block... and then dup /that/... just ridiculous.
>>>
>>> It wouldn't be quite so bad if the global reserve (reported at 16 MiB)
>>> came from the system chunk instead of metadata (mixed-chunk here), and
>>> putting that in the system chunk would make sense since it's effectively
>>> system-reserved space, but of course it doesn't work that way, and I'd
>>> guess changing that would be a hairy nightmare, far worse than simply
>>> clamping down on created chunk sizes a bit, and likely practically
>>> impossible to implement at this stage.
>>>
>>>
>>> But I'd expect clamping down on created chunk size, simply adding a check
>>> to ensure it's under 1/8 the full btrfs size (down to the minimum allowed
>>> chunk size, of course), to be quite practical and reasonably easy to
>>> implement.  Similarly altho I'm less sure of how small the minimum system
>>> chunk size can be, I expect maximum system chunk size can reasonably be
>>> limited to say 4 MiB, if not 1 or 2 MiB, on sub-GiB btrfs.
>>>
>>> So RFC, how realistic and simple does this look to the devs actually
>>> doing the code?  Is it a small enough job it could qualify as a bug fix
>>> (as it arguably is, given that the btrfs is /created/ with chunks that
>>> are impossible to balance, at present, or at least was around 4.8 time,
>>> as I believe that's about when I created the btrfs), be tested and make
>>> it into released code within say five kernel cycles, a year's time?
>>> Obviously I'm hoping so. =:^)
>>>
>> I can't personally comment on the code itself right now (I've
>> actually never looked at the mkfs code, or any of the stuff that
>> deals with the System chunk), but I can make a few general comments
>> on this:
>> 1. This behavior is still the case as of a Git build from yesterday
>> (I just verified this myself with the locally built copy of
>> btrfs-progs on my laptop).
>> 2. Given the implications of snapshotting and typical usage, I'd say
>> it's not likely that the System chunk will need to be much more than
>> a single filesystem block on such a small FS.
>
>    The System chunk has nothing to do with snapshotting. It's where
> the chunk tree lives. (And the reason it's separate from all the other
> metadata is so that the FS can bootstrap the physical-virtual mapping
> easily).
>
>    A chunk record is 48 bytes, plus 17 bytes for the key, plus a
> 32 byte record appended to it for each stripe.
>
>>  I don't use more than
>> a single snapshot at a time, but I do have lots of subvolumes on my
>> laptop's root filesystem, and it's System chunk usage is still only
>> one FS block (16kb in my case)).  That said, ISTR reading somewhere
>> that the System chunk is functionally fixed-size (can't be more than
>> one chunk, and BTRFS can't resize an existing chunk).
>
>    I don't recall seeing anything about that as a limitation. The
> superblock structure appears to be able to support multiple system
> chunks -- there's a list of records pointing to the physical location
> of each system chunk appended to the end of the superblock. A comment
> in ctree.h remarks that "this gives us enough room to translate 14
> chunks with 3 stripes each", so there's clearly the expectation that
> there may be more than one.
>
>    The largest filesystem I think I've seen anyone mention in the wild
> was on the scale of 100 TB, which is going to have about 100k chunks,
> which comes out as about 10 MiB of metadata in the chunk tree. So you
> definitely wouldn't want to use a 1 MiB chunk size for the system tree
> in general. I don't see a problem with shrinking it for small
> filesystems, though.
>
>    The only thing I can see might be an issue is where a small FS is
> created with, say, a 128 KiB system chunk, and then it's grown into a
> large filesystem. You'd have to ensure that any subsequent system
> chunk allocations are much larger, otherwise you're going to break the
> 14(ish) chunk limit in the superblock.
Most of the issue in this case is with the size of the initial chunk. 
That said, I've got quite a few reasonably sized filesystems (I think 
the largest is 200GB) with moderate usage (max 90GB of data), and none 
of them are using more than the first 16kB block in the System chunk. 
While I'm not necessarily a typical user, I'd be willing to bet based on 
this that in general, most people who aren't storing very large amounts 
of data or taking huge numbers of snapshots aren't going to need a 
system chunk much bigger than 1MB.  Perhaps making the initial system 
chunk 1MB for every GB of space (rounded up to a full MB) in the 
filesystem up to 16GB would be reasonable (and then keep the 16MB 
default for larger filesystems)?

That said, it's not just the System chunk in this case.  There is 
absolutely zero reason that a filesystem of reasonable size (and 156MB 
is absolutely reasonable size for a FS) should be impossible to balance 
the moment it's created, and mkfs is currently creating filesystems like 
that, not just because of the System chunk, but because the default 
sizes for regular chunks are too large once you get that small (they're 
scaled down, but I think based on some experimentation that it won't 
create a chunk smaller than 64MB unless the FS is so small that it can't 
exist without smaller chunks).  Making that default saner would be nice, 
but it would be even better IMO to provide the option to just override 
the initial chunk size calculation and provide a specific size (with 
reasonable bounds on what you can have it create).

>> 3. In theory, it shouldn't be hard to get mkfs to use different
>> sizes when creating the FS.  For at least the System chunk though,
>> it may face limitations due to kernel expectations.
>>
>> Given the above three points, I'd like to make a slightly different
>> proposal:
>> Add options to mkfs to specify the size of the initial Data,
>> Metadata, and System chunk in the filesystem, and document clearly
>> some reasonable numbers based on FS size and intended usage.
>>
>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: mkfs.btrfs/balance small-btrfs chunk size RFC
  2017-01-10 15:42     ` Austin S. Hemmelgarn
@ 2017-01-10 15:47       ` Hugo Mills
  2017-01-10 16:05         ` Austin S. Hemmelgarn
  2017-01-11 19:00         ` Duncan
  2017-01-10 17:17       ` Austin S. Hemmelgarn
  1 sibling, 2 replies; 11+ messages in thread
From: Hugo Mills @ 2017-01-10 15:47 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 885 bytes --]

On Tue, Jan 10, 2017 at 10:42:51AM -0500, Austin S. Hemmelgarn wrote:
> Most of the issue in this case is with the size of the initial
> chunk. That said, I've got quite a few reasonably sized filesystems
> (I think the largest is 200GB) with moderate usage (max 90GB of
> data), and none of them are using more than the first 16kB block in
> the System chunk. While I'm not necessarily a typical user, I'd be
> willing to bet based on this that in general, most people who aren't
> storing very large amounts of data or taking huge numbers of
> snapshots aren't going to need a system chunk much bigger than 1MB.

   Again, the system chunk has *nothing* to do with snapshots.

   Agreed with everything else, though.

   Hugo.

-- 
Hugo Mills             | I'm always right. But I might be wrong about that.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4          |

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: mkfs.btrfs/balance small-btrfs chunk size RFC
  2017-01-10 15:47       ` Hugo Mills
@ 2017-01-10 16:05         ` Austin S. Hemmelgarn
  2017-01-10 16:10           ` Hugo Mills
  2017-01-11 19:00         ` Duncan
  1 sibling, 1 reply; 11+ messages in thread
From: Austin S. Hemmelgarn @ 2017-01-10 16:05 UTC (permalink / raw)
  To: Hugo Mills, linux-btrfs

On 2017-01-10 10:47, Hugo Mills wrote:
> On Tue, Jan 10, 2017 at 10:42:51AM -0500, Austin S. Hemmelgarn wrote:
>> Most of the issue in this case is with the size of the initial
>> chunk. That said, I've got quite a few reasonably sized filesystems
>> (I think the largest is 200GB) with moderate usage (max 90GB of
>> data), and none of them are using more than the first 16kB block in
>> the System chunk. While I'm not necessarily a typical user, I'd be
>> willing to bet based on this that in general, most people who aren't
>> storing very large amounts of data or taking huge numbers of
>> snapshots aren't going to need a system chunk much bigger than 1MB.
>
>    Again, the system chunk has *nothing* to do with snapshots.
Apologies, I somehow completely missed the part of your first reply 
where you commented on that...
>
>    Agreed with everything else, though.
>
>    Hugo.
>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: mkfs.btrfs/balance small-btrfs chunk size RFC
  2017-01-10 16:05         ` Austin S. Hemmelgarn
@ 2017-01-10 16:10           ` Hugo Mills
  0 siblings, 0 replies; 11+ messages in thread
From: Hugo Mills @ 2017-01-10 16:10 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1270 bytes --]

On Tue, Jan 10, 2017 at 11:05:17AM -0500, Austin S. Hemmelgarn wrote:
> On 2017-01-10 10:47, Hugo Mills wrote:
> >On Tue, Jan 10, 2017 at 10:42:51AM -0500, Austin S. Hemmelgarn wrote:
> >>Most of the issue in this case is with the size of the initial
> >>chunk. That said, I've got quite a few reasonably sized filesystems
> >>(I think the largest is 200GB) with moderate usage (max 90GB of
> >>data), and none of them are using more than the first 16kB block in
> >>the System chunk. While I'm not necessarily a typical user, I'd be
> >>willing to bet based on this that in general, most people who aren't
> >>storing very large amounts of data or taking huge numbers of
> >>snapshots aren't going to need a system chunk much bigger than 1MB.
> >
> >   Again, the system chunk has *nothing* to do with snapshots.
> Apologies, I somehow completely missed the part of your first reply
> where you commented on that...

   There was quite a lot of it. :)

   Hugo.

-- 
Hugo Mills             | Anyone who claims their cryptographic protocol is
hugo@... carfax.org.uk | secure is either a genius or a fool. Given the
http://carfax.org.uk/  | genius/fool ratio for our species, the odds aren't
PGP: E2AB1DE4          | good.                                  Bruce Schneier

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: mkfs.btrfs/balance small-btrfs chunk size RFC
  2017-01-10 15:42     ` Austin S. Hemmelgarn
  2017-01-10 15:47       ` Hugo Mills
@ 2017-01-10 17:17       ` Austin S. Hemmelgarn
  1 sibling, 0 replies; 11+ messages in thread
From: Austin S. Hemmelgarn @ 2017-01-10 17:17 UTC (permalink / raw)
  To: linux-btrfs

On 2017-01-10 10:42, Austin S. Hemmelgarn wrote:
> Most of the issue in this case is with the size of the initial
> chunk. That said, I've got quite a few reasonably sized filesystems
> (I think the largest is 200GB) with moderate usage (max 90GB of
> data), and none of them are using more than the first 16kB block in
> the System chunk. While I'm not necessarily a typical user, I'd be
> willing to bet based on this that in general, most people who aren't
> storing very large amounts of data or taking huge numbers of
> snapshots aren't going to need a system chunk much bigger than 1MB.
> Perhaps making the initial system chunk 1MB for every GB of space
> (rounded up to a full MB) in the filesystem up to 16GB would be
> reasonable (and then keep the 16MB default for larger filesystems)?
I'm a bit bored, so I just ran the numbers on this.  The math assumes a
RAID1 profile system with 2 identically sized devices, and the total
sizes given are for usable space.

Given an entry size of 97 bytes (48 for the main record plus 17 for the
key, plus 32 for the second stripe), a 16kiB block can handle 675
entries, and (assuming that the entries can't cross a block boundary),
with a 16kiB node size, a 1MiB System chunk can hold 10800 entries.
Assuming a typical mixed usage filesystem with 1 metadata chunk for
every 5.5 data chunks (this is roughly the ratio I see on most of the
filesystems I've worked with), that gives about 561 data chunks and 102
metadata chunks for each 16kiB block in the System chunk, giving a total
FS size of 586.5GiB per 16kiB block in the System Chunk, or 37536GiB for
a 1MiB System chunk.  So, for a 16kiB node size and accounting for the
scaling in chunk sizes on large filesystems, a 1MiB System chunk can
easily handle a 10TB filesystem, and could probably handle a 35-40TB
filesystem provided it's kept in good condition (regular balancing and
such).  This overall means that provided ideal conditions with a 16kiB
node size, the default 32MB system chunk could easily handle a 1PB
filesystem without needing to allocate another System chunk.  The math
actually works out roughly the same for a 4kiB node size (it's at most a
few percent smaller, probably less than 1% difference).  This in turn
means that given the room for 14 system chunks, with the default system
chunk size, the practical max filesystem size is at least 14PB, and
likely not more than 60-70PB, just based on the number of possible extents.

Now, going a bit further, the theoretical max FS size based on
addressing constraints is (IIRC) something around 16EB (sum total of all
device sizes, actual usable space would be 8EB).  Assuming a worst case
usage scenario with only metadata chunks and no chunk scaling, that
requires 2199023255552 chunks.  To be able to handle that within the 14
System chunks we currently support, we would need each System chunk to
be more than 14TB, which leads to the interesting conclusion that our
addressing is actually much more grandiose than we could realistically
support (because the locking at that point is going to be absolute hell,
not because that's an unreasonable amount of metadata).

For those who care, this overall means that the idealized overhead of
the chunk tree relative to filesystem size is at worst 0.001% assuming
maximally efficient use of System chunks, and probably roughly 0.05% for
a realistic filesystem if the system chunk were exactly the size needed
to fit all the chunk entries in the FS.

Based on all of this, I would propose the following heuristic be used to
determine the size of each System chunk:
1. If the filesystem is less than 1GB accounting for replication
profiles, make the system chunk 256kiB. This gives a max size of more
than 0.5TB before a new System chunk is needed, and the likelihood of
someone expanding a filesystem that started at less than 1GB to be that
large is near enough to zero to be statistically impossible.  The 
practical max size using 256kiB System chunks would be around 8TB.
2. If the filesystem is less than 100GB accounting for replication
profiles, make the system chunk 1MiB.  This gives a roughly 35TB max
size before a new System chunk is needed, which is again well beyond
what's statistically reasonable. Practical max size for 1MiB would be 
about 490TB.
3. If the filesystem is less than 10TB accounting for replication
profiles, make the system chunk 16MiB.  This gives a roughly 560TB max
size before a new System chunk is needed, and roughly 8PB practical max 
size.
4. If the filesystem is less than 1PB accounting for replication
profiles, make the system chunk 256MiB.  This gives a rough max size of
something around 9PB before a new System chunk is needed, and roughly 
126PB max size.
5. Otherwise, use a 1GB system chunk.  That gives (for a single System
chunk) a realistic max size of at least 500PB, which is much larger
scale than anyone is likely to need on a single non-distributed
filesystem for the foreseeable future, and is probably far beyond the
point at which the locking requirements on the extent tree far outweigh
any other overhead.
6. Beyond this, add an option to override this selection heuristic and
specify an exact size for the initial system chunk (which should
probably be restricted to power-of-2 multiples of the node size, with a
minimum of the next size down from what would be selected automatically).

This also brings up a rather interesting secondary question which is 
currently functionally impossible to test without special work involved:
What error does userspace get when it does something which would create 
a chunk and it's not possible to create a new chunk because the chunk 
tree has hit max size, and what's logged by the kernel when this happens?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: mkfs.btrfs/balance small-btrfs chunk size RFC
  2017-01-10 15:47       ` Hugo Mills
  2017-01-10 16:05         ` Austin S. Hemmelgarn
@ 2017-01-11 19:00         ` Duncan
  1 sibling, 0 replies; 11+ messages in thread
From: Duncan @ 2017-01-11 19:00 UTC (permalink / raw)
  To: linux-btrfs

Hugo Mills posted on Tue, 10 Jan 2017 15:47:53 +0000 as excerpted:

> On Tue, Jan 10, 2017 at 10:42:51AM -0500, Austin S. Hemmelgarn wrote:
>> Most of the issue in this case is with the size of the initial chunk.
>> That said, I've got quite a few reasonably sized filesystems (I think
>> the largest is 200GB) with moderate usage (max 90GB of data), and none
>> of them are using more than the first 16kB block in the System chunk.
>> While I'm not necessarily a typical user, I'd be willing to bet based
>> on this that in general, most people who aren't storing very large
>> amounts of data or taking huge numbers of snapshots aren't going to
>> need a system chunk much bigger than 1MB.
> 
>    Again, the system chunk has *nothing* to do with snapshots.

Given your explanation of the system chunk containing the chunk tree but 
not being (directly) related to snapshots, I took that as...

Many snapshots, some being old snapshots of now changed data, thus 
potentially multiplying the working copy data several times and of course 
requiring more chunks in ordered to contain all that archived data.

So while snapshots aren't directly related to the system chunk, the fact 
that they're snapshotting /something/ that's presumably changing or 
there'd be no need for snapshots, and the snapshot-archived versions of 
that /something/ presumably takes additional chunks, makes snapshots 
indirectly related to the required size of the system chunk(s), in 
ordered to contain the chunk tree supporting all the other chunks, 
necessary due not to live data, but due to the snapshots.

Is that a correct read, or is (somehow) that indirect dependency not 
there either, despite the system chunk(s) containing the chunk tree?

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: mkfs.btrfs/balance small-btrfs chunk size RFC
  2017-01-10 14:57 ` Austin S. Hemmelgarn
  2017-01-10 15:29   ` Hugo Mills
@ 2017-01-11 19:25   ` Duncan
  1 sibling, 0 replies; 11+ messages in thread
From: Duncan @ 2017-01-11 19:25 UTC (permalink / raw)
  To: linux-btrfs

Austin S. Hemmelgarn posted on Tue, 10 Jan 2017 09:57:52 -0500 as
excerpted:

> I can't personally comment on the code itself right now (I've actually
> never looked at the mkfs code, or any of the stuff that deals with the
> System chunk), but I can make a few general comments on this:
> 1. This behavior is still the case as of a Git build from yesterday (I
> just verified this myself with the locally built copy of btrfs-progs on
> my laptop).

Thanks.  After posting and seeing Qu W's response I was thinking I needed 
to test current behavior, and you just saved me the trouble (tho I do 
need to freshen my backup /boot one of these days, likely testing 
mkfs.btrfs on this in the process, but that can wait until 4.10).

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2017-01-11 19:26 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-10  3:55 mkfs.btrfs/balance small-btrfs chunk size RFC Duncan
2017-01-10  5:34 ` Qu Wenruo
2017-01-10 14:57 ` Austin S. Hemmelgarn
2017-01-10 15:29   ` Hugo Mills
2017-01-10 15:42     ` Austin S. Hemmelgarn
2017-01-10 15:47       ` Hugo Mills
2017-01-10 16:05         ` Austin S. Hemmelgarn
2017-01-10 16:10           ` Hugo Mills
2017-01-11 19:00         ` Duncan
2017-01-10 17:17       ` Austin S. Hemmelgarn
2017-01-11 19:25   ` Duncan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.