* Maildir quickly hitting max htree @ 2021-11-12 19:52 Mark Hills [not found] ` <36FABD31-B636-4D94-B14D-93F3D2B4C148@dilger.ca> 0 siblings, 1 reply; 8+ messages in thread From: Mark Hills @ 2021-11-12 19:52 UTC (permalink / raw) To: linux-ext4 Surprised to hit a limit when handling a modest Maildir case; does this reflect a bug? rsync'ing to a new mail server, after fewer than 100,000 files there are intermittent failures: rsync: [receiver] open "/home/mark/Maildir/.robot/cur/1633731549.M990187P7732.yello.[redacted],S=69473,W=70413:2," failed: No space left on device (28) rsync: [receiver] rename "/home/mark/Maildir/.robot/cur/.1624626598.M748388P84607.yello.[redacted],S=17049,W=17352:2,.oBphKA" -> ".robot/cur/1624626598.M748388P84607.yello.[redacted],S=17049,W=17352:2,": No space left on device (28) The kernel: EXT4-fs warning (device dm-4): ext4_dx_add_entry:2351: Directory (ino: 225811) index full, reach max htree level :2 EXT4-fs warning (device dm-4): ext4_dx_add_entry:2355: Large directory feature is not enabled on this filesystem Reaching for 'large_dir' seems premature as this feature is reported as useful for 10M+ files, but this is much lower. A 'bad' filename will fail consistently. Assuming the 10M+ absolute limit, is the tree grossly imbalanced? Intuitively, 'htree level :2' does not sound particular deep. The source folder is 195,000 files -- large, but not crazy. rsync eventually hit a ceiling having written 177,482 of the files. I can still create new ones on the command line with non-Maildir names. Ruled out quotas, by disabling them with "tune2fs -O ^quota" and remounting. See below for additional info. -- Mark $ uname -a Linux floyd 5.10.78-0-virt #1-Alpine SMP Thu, 11 Nov 2021 14:31:09 +0000 x86_64 GNU/Linux $ mke2fs -q -t ext4 /dev/vg0/home $ rsync -va --exclude 'dovecot*' yello:Maildir/. $HOME/Maildir $ ls | head -15 1605139205.M487508P91922.yello.[redacted],S=7625,W=7775:2, 1605139440.M413280P92363.yello.[redacted],S=7632,W=7782:2, 1605139466.M699663P92402.yello.[redacted],S=7560,W=7710:2, 1605139479.M651510P92421.yello.[redacted],S=7474,W=7623:2, 1605139508.M934351P92514.yello.[redacted],S=7626,W=7776:2, 1605139596.M459228P92713.yello.[redacted],S=7559,W=7709:2, 1605139645.M57446P92736.yello.[redacted],S=7632,W=7782:2, 1605139670.M964535P92758.yello.[redacted],S=7628,W=7778:2, 1605139697.M273694P92807.yello.[redacted],S=7632,W=7782:2, 1605139748.M607989P92853.yello.[redacted],S=7560,W=7710:2, 1605139759.M655635P92868.yello.[redacted],S=5912,W=6018:2, 1605139808.M338286P93071.yello.[redacted],S=7628,W=7778:2, 1605139961.M915501P93235.yello.[redacted],S=7625,W=7775:2, 1605140303.M219848P93591.yello.[redacted],S=6898,W=7023:2, 1605140580.M166212P93921.yello.[redacted],S=6896,W=7021:2, $ touch abc [success] $ touch 1624626598.M748388P84607.yello.[redacted],S=17049,W=17352:2, touch: cannot touch '1624626598.M748388P84607.yello.[redacted],S=17049,W=17352:2,': No space left on device $ dumpe2fs /dev/vg0/home Filesystem volume name: <none> Last mounted on: /home Filesystem UUID: ad26c968-d057-4d44-bef9-1e2df347580e Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize metadata_csum Filesystem flags: signed_directory_hash Default mount options: user_xattr acl Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 5225472 Block count: 21229568 Reserved block count: 851459 Overhead clusters: 22361 Free blocks: 8058180 Free inodes: 4799979 First block: 1 Block size: 1024 Fragment size: 1024 Group descriptor size: 64 Reserved GDT blocks: 96 Blocks per group: 8192 Fragments per group: 8192 Inodes per group: 2016 Inode blocks per group: 504 Flex block group size: 16 Filesystem created: Mon Nov 8 13:14:56 2021 Last mount time: Fri Nov 12 18:43:14 2021 Last write time: Fri Nov 12 18:43:14 2021 Mount count: 27 Maximum mount count: -1 Last checked: Mon Nov 8 13:14:56 2021 Check interval: 0 (<none>) Lifetime writes: 14 GB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 256 Required extra isize: 32 Desired extra isize: 32 Journal inode: 8 Default directory hash: half_md4 Directory Hash Seed: 839d2871-b97e-456d-9724-096db15931b8 Journal backup: inode blocks Checksum type: crc32c Checksum: 0x5974a8b1 Journal features: journal_incompat_revoke journal_64bit journal_checksum_v3 Total journal size: 4096k Total journal blocks: 4096 Max transaction length: 4096 Fast commit length: 0 Journal sequence: 0x00000a2a Journal start: 702 Journal checksum type: crc32c Journal checksum: 0x4d693e79 ^ permalink raw reply [flat|nested] 8+ messages in thread
[parent not found: <36FABD31-B636-4D94-B14D-93F3D2B4C148@dilger.ca>]
* Re: Maildir quickly hitting max htree [not found] ` <36FABD31-B636-4D94-B14D-93F3D2B4C148@dilger.ca> @ 2021-11-13 12:05 ` Mark Hills 2021-11-13 17:19 ` Andreas Dilger 2021-11-14 17:44 ` Theodore Ts'o 0 siblings, 2 replies; 8+ messages in thread From: Mark Hills @ 2021-11-13 12:05 UTC (permalink / raw) To: Andreas Dilger; +Cc: linux-ext4 [-- Attachment #1: Type: text/plain, Size: 2733 bytes --] Andreas, thanks for such a prompt reply. On Fri, 12 Nov 2021, Andreas Dilger wrote: > On Nov 12, 2021, at 11:37, Mark Hills <mark@xwax.org> wrote: > > > > Surprised to hit a limit when handling a modest Maildir case; does > > this reflect a bug? > > > > rsync'ing to a new mail server, after fewer than 100,000 files there > > are intermittent failures: > > This is probably because you are using 1KB blocksize instead of 4KB, > which reduces the size of each tree level by the cube of the ratio, so > 64x. I guess that was selected because of very small files in the > maildir? Interesting! The 1Kb block size was not explicitly chosen. There was no plan other than using the defaults. However I did forget that this is a VM installed from a base image. The root cause is likely to be that the /home partition has been enlarged from a small size to 32Gb. Is block size the only factor? If so, a patch like below (untested) could make it clear it's relevant, and saved the question in this case. [...] > If you have a relatively recent kernel, you can enable the "large_dir" > feature to allow 3-level htree, which would be enough for another factor > of 1024/8 = 128 more entries than now (~12M). The system is not yet in use, so I think it's better we reformat here, and get a block size chosen by the experts :) These days I think VMs make it more common to enlarge a filesystem from a small size. We could have picked this up earlier with a warning from resize2fs; eg. if the block size will no longer match the one that would be chosen by default. That would pick it up before anyone puts 1Kb block size into production. Thanks for identifying the issue. -- Mark From 8604c50be77a4bc56a91099598c409d5a3c1fdbe Mon Sep 17 00:00:00 2001 From: Mark Hills <mark@xwax.org> Date: Sat, 13 Nov 2021 11:46:50 +0000 Subject: [PATCH] Block size has an effect on the index size --- fs/ext4/namei.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c index f3bbcd4efb56..8965bed4d7ff 100644 --- a/fs/ext4/namei.c +++ b/fs/ext4/namei.c @@ -2454,8 +2454,9 @@ static int ext4_dx_add_entry(handle_t *handle, struct ext4_filename *fname, } if (add_level && levels == ext4_dir_htree_level(sb)) { ext4_warning(sb, "Directory (ino: %lu) index full, " - "reach max htree level :%d", - dir->i_ino, levels); + "reach max htree level :%d" + "with block size %lu", + dir->i_ino, levels, sb->s_blocksize); if (ext4_dir_htree_level(sb) < EXT4_HTREE_LEVEL) { ext4_warning(sb, "Large directory feature is " "not enabled on this " -- 2.33.1 ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: Maildir quickly hitting max htree 2021-11-13 12:05 ` Mark Hills @ 2021-11-13 17:19 ` Andreas Dilger 2021-11-16 17:52 ` Mark Hills 2021-11-14 17:44 ` Theodore Ts'o 1 sibling, 1 reply; 8+ messages in thread From: Andreas Dilger @ 2021-11-13 17:19 UTC (permalink / raw) To: Mark Hills; +Cc: Ext4 Developers List On Nov 13, 2021, at 04:05, Mark Hills <mark@xwax.org> wrote: > > Andreas, thanks for such a prompt reply. > >> On Fri, 12 Nov 2021, Andreas Dilger wrote: >> >>> On Nov 12, 2021, at 11:37, Mark Hills <mark@xwax.org> wrote: >>> >>> Surprised to hit a limit when handling a modest Maildir case; does >>> this reflect a bug? >>> >>> rsync'ing to a new mail server, after fewer than 100,000 files there >>> are intermittent failures: >> >> This is probably because you are using 1KB blocksize instead of 4KB, >> which reduces the size of each tree level by the cube of the ratio, so >> 64x. I guess that was selected because of very small files in the >> maildir? > > Interesting! The 1Kb block size was not explicitly chosen. There was no > plan other than using the defaults. > > However I did forget that this is a VM installed from a base image. The > root cause is likely to be that the /home partition has been enlarged from > a small size to 32Gb. > > Is block size the only factor? If so, a patch like below (untested) could > make it clear it's relevant, and saved the question in this case. The patch looks reasonable, but should be submitted separately with [patch] in the subject so that it will not be lost. You can also add on your patch: Reviewed-by: Andreas Dilger <adilger@dilger.ca> Cheers, Andreas > > [...] >> If you have a relatively recent kernel, you can enable the "large_dir" >> feature to allow 3-level htree, which would be enough for another factor >> of 1024/8 = 128 more entries than now (~12M). > > The system is not yet in use, so I think it's better we reformat here, and > get a block size chosen by the experts :) > > These days I think VMs make it more common to enlarge a filesystem from a > small size. We could have picked this up earlier with a warning from > resize2fs; eg. if the block size will no longer match the one that would > be chosen by default. That would pick it up before anyone puts 1Kb block > size into production. > > Thanks for identifying the issue. > > -- > Mark > > > From 8604c50be77a4bc56a91099598c409d5a3c1fdbe Mon Sep 17 00:00:00 2001 > From: Mark Hills <mark@xwax.org> > Date: Sat, 13 Nov 2021 11:46:50 +0000 > Subject: [PATCH] Block size has an effect on the index size > > --- > fs/ext4/namei.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c > index f3bbcd4efb56..8965bed4d7ff 100644 > --- a/fs/ext4/namei.c > +++ b/fs/ext4/namei.c > @@ -2454,8 +2454,9 @@ static int ext4_dx_add_entry(handle_t *handle, struct ext4_filename *fname, > } > if (add_level && levels == ext4_dir_htree_level(sb)) { > ext4_warning(sb, "Directory (ino: %lu) index full, " > - "reach max htree level :%d", > - dir->i_ino, levels); > + "reach max htree level :%d" > + "with block size %lu", > + dir->i_ino, levels, sb->s_blocksize); > if (ext4_dir_htree_level(sb) < EXT4_HTREE_LEVEL) { > ext4_warning(sb, "Large directory feature is " > "not enabled on this " > -- > 2.33.1 ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Maildir quickly hitting max htree 2021-11-13 17:19 ` Andreas Dilger @ 2021-11-16 17:52 ` Mark Hills 0 siblings, 0 replies; 8+ messages in thread From: Mark Hills @ 2021-11-16 17:52 UTC (permalink / raw) To: Andreas Dilger; +Cc: Ext4 Developers List [-- Attachment #1: Type: text/plain, Size: 905 bytes --] On Sat, 13 Nov 2021, Andreas Dilger wrote: > >>> On Nov 12, 2021, at 11:37, Mark Hills <mark@xwax.org> wrote: > >>> > >>> Surprised to hit a limit when handling a modest Maildir case; does > >>> this reflect a bug? > >>> > >>> rsync'ing to a new mail server, after fewer than 100,000 files there > >>> are intermittent failures: > >> > >> This is probably because you are using 1KB blocksize instead of 4KB, [...] > > Is block size the only factor? If so, a patch like below (untested) could > > make it clear it's relevant, and saved the question in this case. > > The patch looks reasonable, but should be submitted separately with > [patch] in the subject so that it will not be lost. > > You can also add on your patch: > > Reviewed-by: Andreas Dilger <adilger@dilger.ca> Thanks. When I get a moment I'll aim to test the patch and submit properly. -- Mark ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Maildir quickly hitting max htree 2021-11-13 12:05 ` Mark Hills 2021-11-13 17:19 ` Andreas Dilger @ 2021-11-14 17:44 ` Theodore Ts'o 2021-11-16 19:31 ` Mark Hills 1 sibling, 1 reply; 8+ messages in thread From: Theodore Ts'o @ 2021-11-14 17:44 UTC (permalink / raw) To: Mark Hills; +Cc: Andreas Dilger, linux-ext4 On Sat, Nov 13, 2021 at 12:05:07PM +0000, Mark Hills wrote: > > Interesting! The 1Kb block size was not explicitly chosen. There was no > plan other than using the defaults. > > However I did forget that this is a VM installed from a base image. The > root cause is likely to be that the /home partition has been enlarged from > a small size to 32Gb. How small was the base image? As documented in the man page for mke2fs.conf, for file systems that are smaller than 3mb, mke2fs use the parameters in /etc/mke2fs.conf for type "floppy" (back when 3.5 inch floppies were either 1.44MB or 2.88MB). So it must have been a really tiny base image to begin with. > These days I think VMs make it more common to enlarge a filesystem from a > small size. We could have picked this up earlier with a warning from > resize2fs; eg. if the block size will no longer match the one that would > be chosen by default. That would pick it up before anyone puts 1Kb block > size into production. It's would be a bit tricky for resize2fs to do that, since it doesn't know what might be in the mke2fs.conf file at the time when the file system when the file system was creaeted. Distributions or individual system adminsitrators are free to modify that config file. It is a good idea for resize2fs to give a warning, though. What I'm thinking that what might sense is if resize2fs is expanding the file system by more than, say a factor of 10x (e.g., expanding a file system from 10mb to 100mb, or 3mb to 20gb) to give a warning that inflating file systems is an anti-pattern that will not necessarily result in the best file system performance. Even if the blocksize isn't 1k, when a file system is shrunk to a very small size, and then expanded to a very large size, the file system will not be optimal. For example, the default size of the journal is based on the file system size. Like the block size, it can be overridden on the command-line, but it's unlikely that most people preparing the file image will remember to consider this. More importantly, when a file system is shrunk, the data blocks are moved without a whole lot of optimizations, and then when the file system is expanded, files that are were pre-loaded to the image, and were located in the parts of the file system that had to be evacuated as part of the shrinking process remain in whatever fragmented form that they were after the shrink operation. The way things work for Amazon's and Google's cloud is that the image is created with a size of 8GB and 10GB, and best practice would be create a separate EBS volume for the data partition. This would allow the easy upgrade or replacement of the root file system, for example, after you check in your project keys into a public repo (or you fail to apply a security upgrade to an actively exploited zero-day), and your system gets rooted to a fair-thee-well, it's much simpler to completely throw away the root image, and reinstall it with a fresh system image, without having to separate your data files from the system image. - Ted ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Maildir quickly hitting max htree 2021-11-14 17:44 ` Theodore Ts'o @ 2021-11-16 19:31 ` Mark Hills 2021-11-17 5:20 ` Theodore Ts'o 0 siblings, 1 reply; 8+ messages in thread From: Mark Hills @ 2021-11-16 19:31 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Andreas Dilger, linux-ext4 On Sun, 14 Nov 2021, Theodore Ts'o wrote: > On Sat, Nov 13, 2021 at 12:05:07PM +0000, Mark Hills wrote: > > > > Interesting! The 1Kb block size was not explicitly chosen. There was no > > plan other than using the defaults. > > > > However I did forget that this is a VM installed from a base image. The > > root cause is likely to be that the /home partition has been enlarged from > > a small size to 32Gb. > > How small was the base image? /home was created with 256Mb, never shrunk. > As documented in the man page for mke2fs.conf, for file systems that are > smaller than 3mb, mke2fs use the parameters in /etc/mke2fs.conf for type > "floppy" (back when 3.5 inch floppies were either 1.44MB or 2.88MB). > So it must have been a really tiny base image to begin with. Small, but not microscopic :) I see a definition in mke2fs.conf for "small" which uses 1024 blocksize, and I assume it originated there and not "floppy". > > These days I think VMs make it more common to enlarge a filesystem from a > > small size. We could have picked this up earlier with a warning from > > resize2fs; eg. if the block size will no longer match the one that would > > be chosen by default. That would pick it up before anyone puts 1Kb block > > size into production. > > It's would be a bit tricky for resize2fs to do that, since it doesn't > know what might be in the mke2fs.conf file at the time when the file > system when the file system was creaeted. Distributions or individual > system adminsitrators are free to modify that config file. No need to time travel back -- it's complicated, and actually less relevant? I haven't looked at resize2fs code, so this comes just from a user's point-of-view but... if it is already reading mke2fs.conf, it could make comparisons using an equivalent new filesystem as benchmark. In the spirit of eg. "your resized filesystem will have a block size of 1024, but a new filesystem of this size would use 4096" Then you can compare any absolute metric of the filesystem that way. The advantage being... > It is a good idea for resize2fs to give a warning, though. What I'm > thinking that what might sense is if resize2fs is expanding the file > system by more than, say a factor of 10x (e.g., expanding a file system > from 10mb to 100mb, or 3mb to 20gb) ... that the benchmark gives you a comparison that won't drift. eg. if you resize by +90% several times. And reflects any desires that may be in the configuration. > to give a warning that inflating file systems is an anti-pattern that > will not necessarily result in the best file system performance. I imagine it's not a panacea, but it would be good to be more concrete on what the gotchas are; "bad performance" is vague, and since the tool exists it must be possible to use it properly. I'll need to consult the docs, but so far have been made aware of: * block size (which has knock-on effect to file limits per directory) * journal size (not in configuration file -- can this be adjusted?) * files get fragmented when shrinking a filesystem (but this is similar to any full file system?) These are all things I'm generally aware of and their implications, just easy to miss when you're busy and focused on other aspects (completely escaped me that the filesystem had been enlarged when I began this thread!) That's why the patch in the other thread is not a bad idea; just reminding that block size is relevant. For info, our use case here is the base image used to deploy persistent VMs which use very different disk sizes. The base image is build using packer+QEMU managed as code. Then written using "dd" and LVM partitions expanded without needing to go single-user or take the system offline. This method is appealling because it allows to pre-populate /home with some small amount of data; SSH keys etc. For the case that started this thread, we just wiped the filesystem and made a new one at the target size of 32Gb. > Even if the blocksize isn't 1k, when a file system is shrunk [...more on shrinking] Many thanks, -- Mark ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Maildir quickly hitting max htree 2021-11-16 19:31 ` Mark Hills @ 2021-11-17 5:20 ` Theodore Ts'o 2021-11-17 13:13 ` Mark Hills 0 siblings, 1 reply; 8+ messages in thread From: Theodore Ts'o @ 2021-11-17 5:20 UTC (permalink / raw) To: Mark Hills; +Cc: Andreas Dilger, linux-ext4 On Tue, Nov 16, 2021 at 07:31:10PM +0000, Mark Hills wrote: > > I see a definition in mke2fs.conf for "small" which uses 1024 blocksize, > and I assume it originated there and not "floppy". Ah, yes, I forgot that we also had the "small" config for file systems less than 512 mb. There are a bunch of anti-patterns that I've seen with users using VM's. And I'm trying to understand them, so we can better document why folks shouldn't be doing things like that. For example, one of the anti-patterns that I see on Cloud systems (e.g., at Amazon, Google Cloud, Azure, etc.) is people who start with a super-tiny file system, say, 10GB, and then let it grow until it's 99 full, and then they grow it by another GB, so it's now 11GB, and then they fill it until it's 99% full, and then they grow it by another GB. That's because (for example) Google's PD Standard is 4 cents per GB per month, and they are trying to cheap out by not wanting to spend that extra 4 cents per month until they absolutely, positively have to. Unfortunately, that leaves the file system horribly fragmented, and performance is terrible. (BTW, this is true no matter what file system they use: ext4, xfs, etc.) File systems were originally engineered assuming that resizing would be done in fairly big chunks. For example, you might have a 50 TB disk array, and you add another 10TB disk to the array, and you grow the file system by 10TB. You can grow it in smaller chunks, but nothing comes for free, and trying to save 4 cents per month as opposed to growing a file system from say, 10GB to 20GB on Google Cloud, and paying an extra, princely *forty* cents (USD) per month will probably result in far better performance, which you'll more than make up when you consider the cost of the CPU and memory of said VM.... > I haven't looked at resize2fs code, so this comes just from a user's > point-of-view but... if it is already reading mke2fs.conf, it could make > comparisons using an equivalent new filesystem as benchmark. Resize2fs doens't read mke2fs.conf, and my point was that the system where resize2fs is run is not necessary same as the system where mke2fs is run, especially when it comes to cloud images for the root file system. > I imagine it's not a panacea, but it would be good to be more concrete on > what the gotchas are; "bad performance" is vague, and since the tool > exists it must be possible to use it properly. Well, we can document the issues in much greater detail in a man page, or in LWN article, but we need it's a bit complicated to explain it all warning messages built into resize2fs. There's the failure mode of starting with a 100MB file system containing a root file system, dropping it on a 10TB disk, or even worse, a 100TB raid array, and trying to blow it up the 100MB file system to 100TB. There's the failure mode of waiting until the file system is 99% full, and then expanding it one GB at a time, repeatedly, until it's several hundred GB or TB, and then users wonder why performance is crap. There are so many different ways one can shoot one's foot off, and until I understand why people are desining their own particular foot-guns, it's hard to write a man page warning about all of the particular bad ways one can be a system administrator. Unfortunately, my imagination is not necessarily up to the task of figuring them all out. For example... > For info, our use case here is the base image used to deploy persistent > VMs which use very different disk sizes. The base image is build using > packer+QEMU managed as code. Then written using "dd" and LVM partitions > expanded without needing to go single-user or take the system offline. > This method is appealling because it allows to pre-populate /home with > some small amount of data; SSH keys etc. May I suggest using a tar.gz file instead and unpacking it onto a freshly created file sysetem? It's easier to inspect and update the contents of the tarball, and it's actually going to be smaller than using a file system iamge and then trying to expand it using resize2fs.... To be honest, that particular use case didn't even *occur* to me, since there are so many more efficient ways it can be done. I take it that you're trying to do this before the VM is launched, as opposed to unpacking it as part of the VM boot process? If you're using qemu/KVM, perhaps you could drop the tar.gz file in a directory on the host, and launch the VM using a virtio-9p. This can be done by launching qemu with arguments like this: qemu ... \ -fsdev local,id=v_tmp,path=/tmp/kvm-xfstests-tytso,security_model=none \ -device virtio-9p-pci,fsdev=v_tmp,mount_tag=v_tmp and then in the guest's /etc/fstab, you might have an entry like this: v_tmp /vtmp 9p trans=virtio,version=9p2000.L,msize=262144,nofail,x-systemd.device-timeout=1 0 0 This will result in everything in /tmp/kvm-xfstests-tytso on the host system being visible as /vtmp in the guest. A worked example of this can be found at: https://github.com/tytso/xfstests-bld/blob/master/kvm-xfstests/kvm-xfstests#L115 https://github.com/tytso/xfstests-bld/blob/master/kvm-xfstests/kvm-xfstests#L175 https://github.com/tytso/xfstests-bld/blob/master/kvm-xfstests/test-appliance/files/etc/fstab#L7 If you are using Google Cloud Platform or AWS, you could use Google Cloud Storage or Amazon S3, respectively, and then just copy the tar.gz file into /run and unpack it. An example of this might get done can be found here for Google Cloud Storage: https://github.com/tytso/xfstests-bld/blob/master/kvm-xfstests/test-appliance/files/usr/local/lib/gce-load-kernel#L65 Cheers, - Ted ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Maildir quickly hitting max htree 2021-11-17 5:20 ` Theodore Ts'o @ 2021-11-17 13:13 ` Mark Hills 0 siblings, 0 replies; 8+ messages in thread From: Mark Hills @ 2021-11-17 13:13 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Andreas Dilger, linux-ext4 Ted, I can't speak for everyone, but perhaps I can give some insight: > There are a bunch of anti-patterns that I've seen with users using VM's. > And I'm trying to understand them, so we can better document why folks > shouldn't be doing things like that. I wouldn't be so negative :) My default position is 'cloud sceptic', but dynamic resources changes the trade-offs between layers of abstraction. I wouldn't expect to always be correct about someone's overall business when saying "don't do that". > For example, one of the anti-patterns that I see on Cloud systems (e.g., > at Amazon, Google Cloud, Azure, etc.) is people who start with a > super-tiny file system, say, 10GB, It sounds interesting that hosting VMs has renewed interest in smaller file systems -- more, smaller hosts. 10GB seems a lot to me ;) > and then let it grow until it's 99 full, and then they grow it by > another GB, so it's now 11GB, and then they fill it until it's 99% full, > and then they grow it by another GB. That's because (for example) > Google's PD Standard is 4 cents per GB per month, and they are trying to > cheap out by not wanting to spend that extra 4 cents per month until > they absolutely, positively have to. One reason: people are rarely working with single hosts. Small costs are multipled up. 4 cents is not a lot per-host, but everything is a % increase to overall costs. Cloud providers sold for many years on a "use only what you need" so it would not be surprising for people to be tightly optimising that. [...] > > I haven't looked at resize2fs code, so this comes just from a user's > > point-of-view but... if it is already reading mke2fs.conf, it could make > > comparisons using an equivalent new filesystem as benchmark. > > Resize2fs doens't read mke2fs.conf, and my point was that the system > where resize2fs is run is not necessary same as the system where > mke2fs is run, especially when it comes to cloud images for the root > file system. Totally understood your point, and sounds like maybe I wasn't clear in mine (which got trimmed): There's no need to worry about the state of mke2fs.conf on the host which originally created the filesystem -- that system is no longer relevant. At the point of expanding you have the all the relevant information for an appropriate warning. > > I imagine it's not a panacea, but it would be good to be more concrete > > on what the gotchas are; "bad performance" is vague, and since the > > tool exists it must be possible to use it properly. > > Well, we can document the issues in much greater detail in a man page, > or in LWN article, but we need it's a bit complicated to explain it all > warning messages built into resize2fs. There's the failure mode of > starting with a 100MB file system containing a root file system, > dropping it on a 10TB disk, or even worse, a 100TB raid array, and > trying to blow it up the 100MB file system to 100TB. There's the > failure mode of waiting until the file system is 99% full, and then > expanding it one GB at a time, repeatedly, until it's several hundred GB > or TB, and then users wonder why performance is crap. > > There are so many different ways one can shoot one's foot off, and > until I understand why people are desining their own particular > foot-guns, it's hard to write a man page warning about all of the > particular bad ways one can be a system administrator. Unfortunately, > my imagination is not necessarily up to the task of figuring them all > out Neithers your imagination, nor mine, nor anyone else's :) People will do cranky things. But also nobody reasonable is expecting you to do their job for them. You haven't really said what are the major underlying properties of the filesystem that are inflexible when re-sizing, and so I'm keen for more detail. It's the sort of information makes for better reasoning; at least a hint of the appropriate trade-offs; and not having to keep explaining on mailing lists ;) But saying "performance is crap" or "failure mode" is vague and so I can understand why people persevere (either knowingly or unknowingly) -- in my experience, many people work on the basis that something 'seems' to work ok. Trying to be tangible here, I made a start on a section for the resize2fs man page. Would it be worthwhile to flesh this out and would you consider helping to do that? CAVEATS Re-sizing a filesystem should not be assumed to result in a filesystem with identifical specification to one created at the new size. More often this is not the case. Specifically, enlarging or shrinking a filesystem does not resize these resources: * block size, which impacts directory indexes and the upper limit on number of files in a directory; * journal size which affects [write performance?] * files which have become fragmented due to space constraints; see e2freefrag(8) and e4defrag(8) * [any more? or is this even comprehensive? Only the major contributors needed to begin with] It really doesn't need to be very long; just a signpost in the right direction. In my case, I have some knowledge of filesystem internals (much less than you, but probably more than most) but had completely forgotten this was a resized filesystem (as well as resized more than the original image ever intended). It just takes a nudge/reminder in the direction, not much more. The tiny patch to dmesg (elsewhere in the thread) would have indicated the revelance of the block size; reminding me of the resize, and a tiny addition to the man page to help decide what action to take -- reformat, or not. > For example... > > > For info, our use case here is the base image used to deploy persistent > > VMs which use very different disk sizes. The base image is build using > > packer+QEMU managed as code. Then written using "dd" and LVM partitions > > expanded without needing to go single-user or take the system offline. > > This method is appealling because it allows to pre-populate /home with > > some small amount of data; SSH keys etc. > > May I suggest using a tar.gz file instead and unpacking it onto a > freshly created file sysetem? It's easier to inspect and update the > contents of the tarball, and it's actually going to be smaller than > using a file system iamge and then trying to expand it using > resize2fs.... > > To be honest, that particular use case didn't even *occur* to me, > since there are so many more efficient ways it can be done. But, this considers 'efficiency' in context of the filesystem performance only. Unpacking a tar.gz requires custom scripting, is slow, extra 'one off' steps on boot up introduce complexity. These are the 'installers' that everyone hates :) Also I presume there is COW at the image level on some infrastructure. And none of this covers changing a size of an online system. > I take it that you're trying to do this before the VM is launched, as > opposed to unpacking it as part of the VM boot process? Yes; and I think that's really the spirit of an "image", right? > If you're using qemu/KVM, perhaps you could drop the tar.gz file in a > directory on the host, and launch the VM using a virtio-9p. This can > be done by launching qemu with arguments like this: > > qemu ... \ [...] We use qemu+Makefile to build the images, but for running on infrastructure most cloud providers are limited; controls like this are not available. > If you are using Google Cloud Platform or AWS, you could use Google > Cloud Storage or Amazon S3, respectively, and then just copy the tar.gz > file into /run and unpack it. We're not using Google nor AWS. In general I can envisage extra work to construct the secured side-channel to distribute supplementary .tar.gz files. I don't think you should be disheartened by people resizing in these ways. I'm not an advocate, and understand your points. But it sounds like it is allowing people to achieve things, and it _is_ a positive sign that the abstractions are well designed -- leaving users unaware of the caveats, which are hidden from them until they bite. A rhethorical question, and with no prior knowledge, but: if there is benefits to these extreme resizes then rather than say "don't do that" could it be possible to generalise the maintaining of any filesystem as creating a 'zero-sized' one, and resizing it upwards? ie. the real code exists in resize, not in creation. Thanks -- Mark ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2021-11-17 13:13 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-11-12 19:52 Maildir quickly hitting max htree Mark Hills [not found] ` <36FABD31-B636-4D94-B14D-93F3D2B4C148@dilger.ca> 2021-11-13 12:05 ` Mark Hills 2021-11-13 17:19 ` Andreas Dilger 2021-11-16 17:52 ` Mark Hills 2021-11-14 17:44 ` Theodore Ts'o 2021-11-16 19:31 ` Mark Hills 2021-11-17 5:20 ` Theodore Ts'o 2021-11-17 13:13 ` Mark Hills
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.