All of lore.kernel.org
 help / color / mirror / Atom feed
* Maildir quickly hitting max htree
@ 2021-11-12 19:52 Mark Hills
       [not found] ` <36FABD31-B636-4D94-B14D-93F3D2B4C148@dilger.ca>
  0 siblings, 1 reply; 8+ messages in thread
From: Mark Hills @ 2021-11-12 19:52 UTC (permalink / raw)
  To: linux-ext4

Surprised to hit a limit when handling a modest Maildir case; does this 
reflect a bug?

rsync'ing to a new mail server, after fewer than 100,000 files there are 
intermittent failures:

  rsync: [receiver] open "/home/mark/Maildir/.robot/cur/1633731549.M990187P7732.yello.[redacted],S=69473,W=70413:2," failed: No space left on device (28)
  rsync: [receiver] rename "/home/mark/Maildir/.robot/cur/.1624626598.M748388P84607.yello.[redacted],S=17049,W=17352:2,.oBphKA" -> ".robot/cur/1624626598.M748388P84607.yello.[redacted],S=17049,W=17352:2,": No space left on device (28)

The kernel:

  EXT4-fs warning (device dm-4): ext4_dx_add_entry:2351: Directory (ino: 225811) index full, reach max htree level :2
  EXT4-fs warning (device dm-4): ext4_dx_add_entry:2355: Large directory feature is not enabled on this filesystem

Reaching for 'large_dir' seems premature as this feature is reported as 
useful for 10M+ files, but this is much lower.

A 'bad' filename will fail consistently. Assuming the 10M+ absolute limit, 
is the tree grossly imbalanced?

Intuitively, 'htree level :2' does not sound particular deep.

The source folder is 195,000 files -- large, but not crazy. rsync 
eventually hit a ceiling having written 177,482 of the files. I can still 
create new ones on the command line with non-Maildir names.

Ruled out quotas, by disabling them with "tune2fs -O ^quota" and 
remounting.

See below for additional info.

-- 
Mark


$ uname -a
Linux floyd 5.10.78-0-virt #1-Alpine SMP Thu, 11 Nov 2021 14:31:09 +0000 x86_64 GNU/Linux

$ mke2fs -q -t ext4 /dev/vg0/home

$ rsync -va --exclude 'dovecot*' yello:Maildir/. $HOME/Maildir

$ ls | head -15
1605139205.M487508P91922.yello.[redacted],S=7625,W=7775:2,
1605139440.M413280P92363.yello.[redacted],S=7632,W=7782:2,
1605139466.M699663P92402.yello.[redacted],S=7560,W=7710:2,
1605139479.M651510P92421.yello.[redacted],S=7474,W=7623:2,
1605139508.M934351P92514.yello.[redacted],S=7626,W=7776:2,
1605139596.M459228P92713.yello.[redacted],S=7559,W=7709:2,
1605139645.M57446P92736.yello.[redacted],S=7632,W=7782:2,
1605139670.M964535P92758.yello.[redacted],S=7628,W=7778:2,
1605139697.M273694P92807.yello.[redacted],S=7632,W=7782:2,
1605139748.M607989P92853.yello.[redacted],S=7560,W=7710:2,
1605139759.M655635P92868.yello.[redacted],S=5912,W=6018:2,
1605139808.M338286P93071.yello.[redacted],S=7628,W=7778:2,
1605139961.M915501P93235.yello.[redacted],S=7625,W=7775:2,
1605140303.M219848P93591.yello.[redacted],S=6898,W=7023:2,
1605140580.M166212P93921.yello.[redacted],S=6896,W=7021:2,

$ touch abc
[success]

$ touch 1624626598.M748388P84607.yello.[redacted],S=17049,W=17352:2,
touch: cannot touch '1624626598.M748388P84607.yello.[redacted],S=17049,W=17352:2,': No space left on device

$ dumpe2fs /dev/vg0/home
Filesystem volume name:   <none>
Last mounted on:          /home
Filesystem UUID:          ad26c968-d057-4d44-bef9-1e2df347580e
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize metadata_csum
Filesystem flags:         signed_directory_hash
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              5225472
Block count:              21229568
Reserved block count:     851459
Overhead clusters:        22361
Free blocks:              8058180
Free inodes:              4799979
First block:              1
Block size:               1024
Fragment size:            1024
Group descriptor size:    64
Reserved GDT blocks:      96
Blocks per group:         8192
Fragments per group:      8192
Inodes per group:         2016
Inode blocks per group:   504
Flex block group size:    16
Filesystem created:       Mon Nov  8 13:14:56 2021
Last mount time:          Fri Nov 12 18:43:14 2021
Last write time:          Fri Nov 12 18:43:14 2021
Mount count:              27
Maximum mount count:      -1
Last checked:             Mon Nov  8 13:14:56 2021
Check interval:           0 (<none>)
Lifetime writes:          14 GB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               256
Required extra isize:     32
Desired extra isize:      32
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      839d2871-b97e-456d-9724-096db15931b8
Journal backup:           inode blocks
Checksum type:            crc32c
Checksum:                 0x5974a8b1
Journal features:         journal_incompat_revoke journal_64bit 
journal_checksum_v3
Total journal size:       4096k
Total journal blocks:     4096
Max transaction length:   4096
Fast commit length:       0
Journal sequence:         0x00000a2a
Journal start:            702
Journal checksum type:    crc32c
Journal checksum:         0x4d693e79



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Maildir quickly hitting max htree
       [not found] ` <36FABD31-B636-4D94-B14D-93F3D2B4C148@dilger.ca>
@ 2021-11-13 12:05   ` Mark Hills
  2021-11-13 17:19     ` Andreas Dilger
  2021-11-14 17:44     ` Theodore Ts'o
  0 siblings, 2 replies; 8+ messages in thread
From: Mark Hills @ 2021-11-13 12:05 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-ext4

[-- Attachment #1: Type: text/plain, Size: 2733 bytes --]

Andreas, thanks for such a prompt reply.

On Fri, 12 Nov 2021, Andreas Dilger wrote:

> On Nov 12, 2021, at 11:37, Mark Hills <mark@xwax.org> wrote:
> > 
> > Surprised to hit a limit when handling a modest Maildir case; does 
> > this reflect a bug?
> > 
> > rsync'ing to a new mail server, after fewer than 100,000 files there 
> > are intermittent failures:
> 
> This is probably because you are using 1KB blocksize instead of 4KB, 
> which reduces the size of each tree level by the cube of the ratio, so 
> 64x. I guess that was selected because of very small files in the 
> maildir?

Interesting! The 1Kb block size was not explicitly chosen. There was no 
plan other than using the defaults.

However I did forget that this is a VM installed from a base image. The 
root cause is likely to be that the /home partition has been enlarged from 
a small size to 32Gb.

Is block size the only factor? If so, a patch like below (untested) could 
make it clear it's relevant, and saved the question in this case.

[...]
> If you have a relatively recent kernel, you can enable the "large_dir" 
> feature to allow 3-level htree, which would be enough for another factor 
> of 1024/8 = 128 more entries than now (~12M).

The system is not yet in use, so I think it's better we reformat here, and 
get a block size chosen by the experts :)

These days I think VMs make it more common to enlarge a filesystem from a 
small size. We could have picked this up earlier with a warning from 
resize2fs; eg. if the block size will no longer match the one that would 
be chosen by default. That would pick it up before anyone puts 1Kb block 
size into production.

Thanks for identifying the issue.

-- 
Mark


From 8604c50be77a4bc56a91099598c409d5a3c1fdbe Mon Sep 17 00:00:00 2001
From: Mark Hills <mark@xwax.org>
Date: Sat, 13 Nov 2021 11:46:50 +0000
Subject: [PATCH] Block size has an effect on the index size

---
 fs/ext4/namei.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index f3bbcd4efb56..8965bed4d7ff 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2454,8 +2454,9 @@ static int ext4_dx_add_entry(handle_t *handle, struct ext4_filename *fname,
 		}
 		if (add_level && levels == ext4_dir_htree_level(sb)) {
 			ext4_warning(sb, "Directory (ino: %lu) index full, "
-					 "reach max htree level :%d",
-					 dir->i_ino, levels);
+					 "reach max htree level :%d"
+					 "with block size %lu",
+					 dir->i_ino, levels, sb->s_blocksize);
 			if (ext4_dir_htree_level(sb) < EXT4_HTREE_LEVEL) {
 				ext4_warning(sb, "Large directory feature is "
 						 "not enabled on this "
-- 
2.33.1

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: Maildir quickly hitting max htree
  2021-11-13 12:05   ` Mark Hills
@ 2021-11-13 17:19     ` Andreas Dilger
  2021-11-16 17:52       ` Mark Hills
  2021-11-14 17:44     ` Theodore Ts'o
  1 sibling, 1 reply; 8+ messages in thread
From: Andreas Dilger @ 2021-11-13 17:19 UTC (permalink / raw)
  To: Mark Hills; +Cc: Ext4 Developers List

On Nov 13, 2021, at 04:05, Mark Hills <mark@xwax.org> wrote:
> 
> Andreas, thanks for such a prompt reply.
> 
>> On Fri, 12 Nov 2021, Andreas Dilger wrote:
>> 
>>> On Nov 12, 2021, at 11:37, Mark Hills <mark@xwax.org> wrote:
>>> 
>>> Surprised to hit a limit when handling a modest Maildir case; does 
>>> this reflect a bug?
>>> 
>>> rsync'ing to a new mail server, after fewer than 100,000 files there 
>>> are intermittent failures:
>> 
>> This is probably because you are using 1KB blocksize instead of 4KB, 
>> which reduces the size of each tree level by the cube of the ratio, so 
>> 64x. I guess that was selected because of very small files in the 
>> maildir?
> 
> Interesting! The 1Kb block size was not explicitly chosen. There was no 
> plan other than using the defaults.
> 
> However I did forget that this is a VM installed from a base image. The 
> root cause is likely to be that the /home partition has been enlarged from 
> a small size to 32Gb.
> 
> Is block size the only factor? If so, a patch like below (untested) could 
> make it clear it's relevant, and saved the question in this case.

The patch looks reasonable, but should be submitted separately with
[patch] in the subject so that it will not be lost.  

You can also add on your patch:

Reviewed-by: Andreas Dilger <adilger@dilger.ca>


Cheers, Andreas

> 
> [...]
>> If you have a relatively recent kernel, you can enable the "large_dir" 
>> feature to allow 3-level htree, which would be enough for another factor 
>> of 1024/8 = 128 more entries than now (~12M).
> 
> The system is not yet in use, so I think it's better we reformat here, and 
> get a block size chosen by the experts :)
> 
> These days I think VMs make it more common to enlarge a filesystem from a 
> small size. We could have picked this up earlier with a warning from 
> resize2fs; eg. if the block size will no longer match the one that would 
> be chosen by default. That would pick it up before anyone puts 1Kb block 
> size into production.
> 
> Thanks for identifying the issue.
> 
> -- 
> Mark
> 
> 
> From 8604c50be77a4bc56a91099598c409d5a3c1fdbe Mon Sep 17 00:00:00 2001
> From: Mark Hills <mark@xwax.org>
> Date: Sat, 13 Nov 2021 11:46:50 +0000
> Subject: [PATCH] Block size has an effect on the index size
> 
> ---
> fs/ext4/namei.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> index f3bbcd4efb56..8965bed4d7ff 100644
> --- a/fs/ext4/namei.c
> +++ b/fs/ext4/namei.c
> @@ -2454,8 +2454,9 @@ static int ext4_dx_add_entry(handle_t *handle, struct ext4_filename *fname,
>        }
>        if (add_level && levels == ext4_dir_htree_level(sb)) {
>            ext4_warning(sb, "Directory (ino: %lu) index full, "
> -                     "reach max htree level :%d",
> -                     dir->i_ino, levels);
> +                     "reach max htree level :%d"
> +                     "with block size %lu",
> +                     dir->i_ino, levels, sb->s_blocksize);
>            if (ext4_dir_htree_level(sb) < EXT4_HTREE_LEVEL) {
>                ext4_warning(sb, "Large directory feature is "
>                         "not enabled on this "
> -- 
> 2.33.1

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Maildir quickly hitting max htree
  2021-11-13 12:05   ` Mark Hills
  2021-11-13 17:19     ` Andreas Dilger
@ 2021-11-14 17:44     ` Theodore Ts'o
  2021-11-16 19:31       ` Mark Hills
  1 sibling, 1 reply; 8+ messages in thread
From: Theodore Ts'o @ 2021-11-14 17:44 UTC (permalink / raw)
  To: Mark Hills; +Cc: Andreas Dilger, linux-ext4

On Sat, Nov 13, 2021 at 12:05:07PM +0000, Mark Hills wrote:
> 
> Interesting! The 1Kb block size was not explicitly chosen. There was no 
> plan other than using the defaults.
> 
> However I did forget that this is a VM installed from a base image. The 
> root cause is likely to be that the /home partition has been enlarged from 
> a small size to 32Gb.

How small was the base image?  As documented in the man page for
mke2fs.conf, for file systems that are smaller than 3mb, mke2fs use
the parameters in /etc/mke2fs.conf for type "floppy" (back when 3.5
inch floppies were either 1.44MB or 2.88MB).  So it must have been a
really tiny base image to begin with.

> These days I think VMs make it more common to enlarge a filesystem from a 
> small size. We could have picked this up earlier with a warning from 
> resize2fs; eg. if the block size will no longer match the one that would 
> be chosen by default. That would pick it up before anyone puts 1Kb block 
> size into production.

It's would be a bit tricky for resize2fs to do that, since it doesn't
know what might be in the mke2fs.conf file at the time when the file
system when the file system was creaeted.  Distributions or individual
system adminsitrators are free to modify that config file.

It is a good idea for resize2fs to give a warning, though.  What I'm
thinking that what might sense is if resize2fs is expanding the file
system by more than, say a factor of 10x (e.g., expanding a file
system from 10mb to 100mb, or 3mb to 20gb) to give a warning that
inflating file systems is an anti-pattern that will not necessarily
result in the best file system performance.  Even if the blocksize
isn't 1k, when a file system is shrunk to a very small size, and then
expanded to a very large size, the file system will not be optimal.

For example, the default size of the journal is based on the file
system size.  Like the block size, it can be overridden on the
command-line, but it's unlikely that most people preparing the file
image will remember to consider this.

More importantly, when a file system is shrunk, the data blocks are
moved without a whole lot of optimizations, and then when the file
system is expanded, files that are were pre-loaded to the image, and
were located in the parts of the file system that had to be evacuated
as part of the shrinking process remain in whatever fragmented form
that they were after the shrink operation.

The way things work for Amazon's and Google's cloud is that the image
is created with a size of 8GB and 10GB, and best practice would be
create a separate EBS volume for the data partition.  This would allow
the easy upgrade or replacement of the root file system, for example,
after you check in your project keys into a public repo (or you fail
to apply a security upgrade to an actively exploited zero-day), and
your system gets rooted to a fair-thee-well, it's much simpler to
completely throw away the root image, and reinstall it with a fresh
system image, without having to separate your data files from the
system image.

						- Ted

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Maildir quickly hitting max htree
  2021-11-13 17:19     ` Andreas Dilger
@ 2021-11-16 17:52       ` Mark Hills
  0 siblings, 0 replies; 8+ messages in thread
From: Mark Hills @ 2021-11-16 17:52 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Ext4 Developers List

[-- Attachment #1: Type: text/plain, Size: 905 bytes --]

On Sat, 13 Nov 2021, Andreas Dilger wrote:

> >>> On Nov 12, 2021, at 11:37, Mark Hills <mark@xwax.org> wrote:
> >>> 
> >>> Surprised to hit a limit when handling a modest Maildir case; does 
> >>> this reflect a bug?
> >>> 
> >>> rsync'ing to a new mail server, after fewer than 100,000 files there 
> >>> are intermittent failures:
> >> 
> >> This is probably because you are using 1KB blocksize instead of 4KB, 
[...]
> > Is block size the only factor? If so, a patch like below (untested) could 
> > make it clear it's relevant, and saved the question in this case.
> 
> The patch looks reasonable, but should be submitted separately with
> [patch] in the subject so that it will not be lost.  
> 
> You can also add on your patch:
> 
> Reviewed-by: Andreas Dilger <adilger@dilger.ca>

Thanks. When I get a moment I'll aim to test the patch and submit 
properly.

-- 
Mark

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Maildir quickly hitting max htree
  2021-11-14 17:44     ` Theodore Ts'o
@ 2021-11-16 19:31       ` Mark Hills
  2021-11-17  5:20         ` Theodore Ts'o
  0 siblings, 1 reply; 8+ messages in thread
From: Mark Hills @ 2021-11-16 19:31 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Andreas Dilger, linux-ext4

On Sun, 14 Nov 2021, Theodore Ts'o wrote:

> On Sat, Nov 13, 2021 at 12:05:07PM +0000, Mark Hills wrote:
> > 
> > Interesting! The 1Kb block size was not explicitly chosen. There was no 
> > plan other than using the defaults.
> > 
> > However I did forget that this is a VM installed from a base image. The 
> > root cause is likely to be that the /home partition has been enlarged from 
> > a small size to 32Gb.
> 
> How small was the base image?

/home was created with 256Mb, never shrunk.

> As documented in the man page for mke2fs.conf, for file systems that are 
> smaller than 3mb, mke2fs use the parameters in /etc/mke2fs.conf for type 
> "floppy" (back when 3.5 inch floppies were either 1.44MB or 2.88MB).  
> So it must have been a really tiny base image to begin with.

Small, but not microscopic :)

I see a definition in mke2fs.conf for "small" which uses 1024 blocksize, 
and I assume it originated there and not "floppy".

> > These days I think VMs make it more common to enlarge a filesystem from a 
> > small size. We could have picked this up earlier with a warning from 
> > resize2fs; eg. if the block size will no longer match the one that would 
> > be chosen by default. That would pick it up before anyone puts 1Kb block 
> > size into production.
> 
> It's would be a bit tricky for resize2fs to do that, since it doesn't
> know what might be in the mke2fs.conf file at the time when the file
> system when the file system was creaeted.  Distributions or individual
> system adminsitrators are free to modify that config file.

No need to time travel back -- it's complicated, and actually less 
relevant?

I haven't looked at resize2fs code, so this comes just from a user's 
point-of-view but... if it is already reading mke2fs.conf, it could make 
comparisons using an equivalent new filesystem as benchmark.

In the spirit of eg. "your resized filesystem will have a block size of 
1024, but a new filesystem of this size would use 4096"

Then you can compare any absolute metric of the filesystem that way.

The advantage being...

> It is a good idea for resize2fs to give a warning, though.  What I'm 
> thinking that what might sense is if resize2fs is expanding the file 
> system by more than, say a factor of 10x (e.g., expanding a file system 
> from 10mb to 100mb, or 3mb to 20gb)

... that the benchmark gives you a comparison that won't drift. eg. if you 
resize by +90% several times.

And reflects any desires that may be in the configuration.

> to give a warning that inflating file systems is an anti-pattern that 
> will not necessarily result in the best file system performance.

I imagine it's not a panacea, but it would be good to be more concrete on 
what the gotchas are; "bad performance" is vague, and since the tool 
exists it must be possible to use it properly.

I'll need to consult the docs, but so far have been made aware of:

* block size
  (which has knock-on effect to file limits per directory)

* journal size
  (not in configuration file -- can this be adjusted?)

* files get fragmented when shrinking a filesystem
  (but this is similar to any full file system?)

These are all things I'm generally aware of and their implications, just 
easy to miss when you're busy and focused on other aspects (completely 
escaped me that the filesystem had been enlarged when I began this 
thread!)

That's why the patch in the other thread is not a bad idea; just reminding 
that block size is relevant.

For info, our use case here is the base image used to deploy persistent 
VMs which use very different disk sizes. The base image is build using 
packer+QEMU managed as code. Then written using "dd" and LVM partitions 
expanded without needing to go single-user or take the system offline. 
This method is appealling because it allows to pre-populate /home with 
some small amount of data; SSH keys etc.

For the case that started this thread, we just wiped the filesystem and 
made a new one at the target size of 32Gb.

> Even if the blocksize isn't 1k, when a file system is shrunk
[...more on shrinking]

Many thanks,

-- 
Mark

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Maildir quickly hitting max htree
  2021-11-16 19:31       ` Mark Hills
@ 2021-11-17  5:20         ` Theodore Ts'o
  2021-11-17 13:13           ` Mark Hills
  0 siblings, 1 reply; 8+ messages in thread
From: Theodore Ts'o @ 2021-11-17  5:20 UTC (permalink / raw)
  To: Mark Hills; +Cc: Andreas Dilger, linux-ext4

On Tue, Nov 16, 2021 at 07:31:10PM +0000, Mark Hills wrote:
> 
> I see a definition in mke2fs.conf for "small" which uses 1024 blocksize, 
> and I assume it originated there and not "floppy".

Ah, yes, I forgot that we also had the "small" config for file systems
less than 512 mb.


There are a bunch of anti-patterns that I've seen with users using
VM's.  And I'm trying to understand them, so we can better document
why folks shouldn't be doing things like that.  For example, one of
the anti-patterns that I see on Cloud systems (e.g., at Amazon, Google
Cloud, Azure, etc.) is people who start with a super-tiny file system,
say, 10GB, and then let it grow until it's 99 full, and then they grow
it by another GB, so it's now 11GB, and then they fill it until it's
99% full, and then they grow it by another GB.  That's because (for
example) Google's PD Standard is 4 cents per GB per month, and they
are trying to cheap out by not wanting to spend that extra 4 cents per
month until they absolutely, positively have to.  Unfortunately, that
leaves the file system horribly fragmented, and performance is
terrible.  (BTW, this is true no matter what file system they use:
ext4, xfs, etc.)

File systems were originally engineered assuming that resizing would
be done in fairly big chunks.  For example, you might have a 50 TB
disk array, and you add another 10TB disk to the array, and you grow
the file system by 10TB.  You can grow it in smaller chunks, but
nothing comes for free, and trying to save 4 cents per month as
opposed to growing a file system from say, 10GB to 20GB on Google
Cloud, and paying an extra, princely *forty* cents (USD) per month
will probably result in far better performance, which you'll more than
make up when you consider the cost of the CPU and memory of said
VM....

> I haven't looked at resize2fs code, so this comes just from a user's 
> point-of-view but... if it is already reading mke2fs.conf, it could make 
> comparisons using an equivalent new filesystem as benchmark.

Resize2fs doens't read mke2fs.conf, and my point was that the system
where resize2fs is run is not necessary same as the system where
mke2fs is run, especially when it comes to cloud images for the root
file system.

> I imagine it's not a panacea, but it would be good to be more concrete on 
> what the gotchas are; "bad performance" is vague, and since the tool 
> exists it must be possible to use it properly.

Well, we can document the issues in much greater detail in a man page,
or in LWN article, but we need it's a bit complicated to explain it
all warning messages built into resize2fs.  There's the failure mode
of starting with a 100MB file system containing a root file system,
dropping it on a 10TB disk, or even worse, a 100TB raid array, and
trying to blow it up the 100MB file system to 100TB.  There's the
failure mode of waiting until the file system is 99% full, and then
expanding it one GB at a time, repeatedly, until it's several hundred
GB or TB, and then users wonder why performance is crap.

There are so many different ways one can shoot one's foot off, and
until I understand why people are desining their own particular
foot-guns, it's hard to write a man page warning about all of the
particular bad ways one can be a system administrator.  Unfortunately,
my imagination is not necessarily up to the task of figuring them all
out.  For example...


> For info, our use case here is the base image used to deploy persistent 
> VMs which use very different disk sizes. The base image is build using 
> packer+QEMU managed as code. Then written using "dd" and LVM partitions 
> expanded without needing to go single-user or take the system offline. 
> This method is appealling because it allows to pre-populate /home with 
> some small amount of data; SSH keys etc.

May I suggest using a tar.gz file instead and unpacking it onto a
freshly created file sysetem?  It's easier to inspect and update the
contents of the tarball, and it's actually going to be smaller than
using a file system iamge and then trying to expand it using
resize2fs....

To be honest, that particular use case didn't even *occur* to me,
since there are so many more efficient ways it can be done.  I take it
that you're trying to do this before the VM is launched, as opposed to
unpacking it as part of the VM boot process?

If you're using qemu/KVM, perhaps you could drop the tar.gz file in a
directory on the host, and launch the VM using a virtio-9p.  This can
be done by launching qemu with arguments like this:

qemu ... \
     -fsdev local,id=v_tmp,path=/tmp/kvm-xfstests-tytso,security_model=none \
     -device virtio-9p-pci,fsdev=v_tmp,mount_tag=v_tmp

and then in the guest's /etc/fstab, you might have an entry like this:

v_tmp           /vtmp   9p      trans=virtio,version=9p2000.L,msize=262144,nofail,x-systemd.device-timeout=1       0       0

This will result in everything in /tmp/kvm-xfstests-tytso on the host
system being visible as /vtmp in the guest.  A worked example of this
can be found at:

https://github.com/tytso/xfstests-bld/blob/master/kvm-xfstests/kvm-xfstests#L115
https://github.com/tytso/xfstests-bld/blob/master/kvm-xfstests/kvm-xfstests#L175
https://github.com/tytso/xfstests-bld/blob/master/kvm-xfstests/test-appliance/files/etc/fstab#L7

If you are using Google Cloud Platform or AWS, you could use Google
Cloud Storage or Amazon S3, respectively, and then just copy the
tar.gz file into /run and unpack it.  An example of this might get
done can be found here for Google Cloud Storage:

https://github.com/tytso/xfstests-bld/blob/master/kvm-xfstests/test-appliance/files/usr/local/lib/gce-load-kernel#L65

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Maildir quickly hitting max htree
  2021-11-17  5:20         ` Theodore Ts'o
@ 2021-11-17 13:13           ` Mark Hills
  0 siblings, 0 replies; 8+ messages in thread
From: Mark Hills @ 2021-11-17 13:13 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Andreas Dilger, linux-ext4

Ted, I can't speak for everyone, but perhaps I can give some insight:

> There are a bunch of anti-patterns that I've seen with users using VM's.  
> And I'm trying to understand them, so we can better document why folks 
> shouldn't be doing things like that.

I wouldn't be so negative :) My default position is 'cloud sceptic', but 
dynamic resources changes the trade-offs between layers of abstraction. I 
wouldn't expect to always be correct about someone's overall business when 
saying "don't do that".

> For example, one of the anti-patterns that I see on Cloud systems (e.g., 
> at Amazon, Google Cloud, Azure, etc.) is people who start with a 
> super-tiny file system, say, 10GB,

It sounds interesting that hosting VMs has renewed interest in smaller 
file systems -- more, smaller hosts. 10GB seems a lot to me ;)

> and then let it grow until it's 99 full, and then they grow it by 
> another GB, so it's now 11GB, and then they fill it until it's 99% full, 
> and then they grow it by another GB.  That's because (for example) 
> Google's PD Standard is 4 cents per GB per month, and they are trying to 
> cheap out by not wanting to spend that extra 4 cents per month until 
> they absolutely, positively have to.

One reason: people are rarely working with single hosts. Small costs are 
multipled up. 4 cents is not a lot per-host, but everything is a % 
increase to overall costs.

Cloud providers sold for many years on a "use only what you need" so it 
would not be surprising for people to be tightly optimising that.

[...] 
> > I haven't looked at resize2fs code, so this comes just from a user's 
> > point-of-view but... if it is already reading mke2fs.conf, it could make 
> > comparisons using an equivalent new filesystem as benchmark.
> 
> Resize2fs doens't read mke2fs.conf, and my point was that the system
> where resize2fs is run is not necessary same as the system where
> mke2fs is run, especially when it comes to cloud images for the root
> file system.

Totally understood your point, and sounds like maybe I wasn't clear in 
mine (which got trimmed):

There's no need to worry about the state of mke2fs.conf on the host which 
originally created the filesystem -- that system is no longer relevant.

At the point of expanding you have the all the relevant information for an 
appropriate warning.

> > I imagine it's not a panacea, but it would be good to be more concrete 
> > on what the gotchas are; "bad performance" is vague, and since the 
> > tool exists it must be possible to use it properly.
> 
> Well, we can document the issues in much greater detail in a man page, 
> or in LWN article, but we need it's a bit complicated to explain it all 
> warning messages built into resize2fs.  There's the failure mode of 
> starting with a 100MB file system containing a root file system, 
> dropping it on a 10TB disk, or even worse, a 100TB raid array, and 
> trying to blow it up the 100MB file system to 100TB.  There's the 
> failure mode of waiting until the file system is 99% full, and then 
> expanding it one GB at a time, repeatedly, until it's several hundred GB 
> or TB, and then users wonder why performance is crap.
> 
> There are so many different ways one can shoot one's foot off, and
> until I understand why people are desining their own particular
> foot-guns, it's hard to write a man page warning about all of the
> particular bad ways one can be a system administrator.  Unfortunately,
> my imagination is not necessarily up to the task of figuring them all
> out

Neithers your imagination, nor mine, nor anyone else's :) People will do 
cranky things.

But also nobody reasonable is expecting you to do their job for them.

You haven't really said what are the major underlying properties of the 
filesystem that are inflexible when re-sizing, and so I'm keen for 
more detail.

It's the sort of information makes for better reasoning; at least a hint 
of the appropriate trade-offs; and not having to keep explaining on 
mailing lists ;)

But saying "performance is crap" or "failure mode" is vague and so I can 
understand why people persevere (either knowingly or unknowingly) -- in my 
experience, many people work on the basis that something 'seems' to work 
ok.

Trying to be tangible here, I made a start on a section for the resize2fs 
man page. Would it be worthwhile to flesh this out and would you consider 
helping to do that?

  CAVEATS

    Re-sizing a filesystem should not be assumed to result in a filesystem 
    with identifical specification to one created at the new size. More 
    often this is not the case.

    Specifically, enlarging or shrinking a filesystem does not resize 
    these resources:

        * block size, which impacts directory indexes and the upper limit 
          on number of files in a directory;

        * journal size which affects [write performance?]

        * files which have become fragmented due to space constraints;
          see e2freefrag(8) and e4defrag(8)

        * [any more? or is this even comprehensive? Only
           the major contributors needed to begin with]

It really doesn't need to be very long; just a signpost in the right 
direction.

In my case, I have some knowledge of filesystem internals (much less than 
you, but probably more than most) but had completely forgotten this was a 
resized filesystem (as well as resized more than the original image ever 
intended). It just takes a nudge/reminder in the direction, not much more.

The tiny patch to dmesg (elsewhere in the thread) would have indicated the 
revelance of the block size; reminding me of the resize, and a tiny 
addition to the man page to help decide what action to take -- reformat, 
or not.

> For example...
> 
> > For info, our use case here is the base image used to deploy persistent 
> > VMs which use very different disk sizes. The base image is build using 
> > packer+QEMU managed as code. Then written using "dd" and LVM partitions 
> > expanded without needing to go single-user or take the system offline. 
> > This method is appealling because it allows to pre-populate /home with 
> > some small amount of data; SSH keys etc.
> 
> May I suggest using a tar.gz file instead and unpacking it onto a
> freshly created file sysetem?  It's easier to inspect and update the
> contents of the tarball, and it's actually going to be smaller than
> using a file system iamge and then trying to expand it using
> resize2fs....
> 
> To be honest, that particular use case didn't even *occur* to me,
> since there are so many more efficient ways it can be done.

But, this considers 'efficiency' in context of the filesystem performance 
only.

Unpacking a tar.gz requires custom scripting, is slow, extra 'one off' 
steps on boot up introduce complexity. These are the 'installers' that 
everyone hates :)

Also I presume there is COW at the image level on some infrastructure.

And none of this covers changing a size of an online system.

> I take it that you're trying to do this before the VM is launched, as 
> opposed to unpacking it as part of the VM boot process?

Yes; and I think that's really the spirit of an "image", right?
 
> If you're using qemu/KVM, perhaps you could drop the tar.gz file in a
> directory on the host, and launch the VM using a virtio-9p.  This can
> be done by launching qemu with arguments like this:
> 
> qemu ... \
[...]

We use qemu+Makefile to build the images, but for running on 
infrastructure most cloud providers are limited; controls like this are 
not available.

> If you are using Google Cloud Platform or AWS, you could use Google 
> Cloud Storage or Amazon S3, respectively, and then just copy the tar.gz 
> file into /run and unpack it.

We're not using Google nor AWS. In general I can envisage extra work to 
construct the secured side-channel to distribute supplementary .tar.gz 
files.

I don't think you should be disheartened by people resizing in these ways. 
I'm not an advocate, and understand your points.

But it sounds like it is allowing people to achieve things, and it _is_ a 
positive sign that the abstractions are well designed -- leaving users 
unaware of the caveats, which are hidden from them until they bite.

A rhethorical question, and with no prior knowledge, but: if there is 
benefits to these extreme resizes then rather than say "don't do that" 
could it be possible to generalise the maintaining of any filesystem as 
creating a 'zero-sized' one, and resizing it upwards? ie. the real code 
exists in resize, not in creation.

Thanks

-- 
Mark

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-11-17 13:13 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-12 19:52 Maildir quickly hitting max htree Mark Hills
     [not found] ` <36FABD31-B636-4D94-B14D-93F3D2B4C148@dilger.ca>
2021-11-13 12:05   ` Mark Hills
2021-11-13 17:19     ` Andreas Dilger
2021-11-16 17:52       ` Mark Hills
2021-11-14 17:44     ` Theodore Ts'o
2021-11-16 19:31       ` Mark Hills
2021-11-17  5:20         ` Theodore Ts'o
2021-11-17 13:13           ` Mark Hills

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.