From mboxrd@z Thu Jan 1 00:00:00 1970 From: Theodore Ts'o Subject: Re: kernel bug at fs/ext4/resize.c:409 Date: Fri, 14 Feb 2014 18:46:31 -0500 Message-ID: <20140214234631.GC1748@thunk.org> References: <20140203182634.GA28811@shaniqua> <20140203185633.GA22856@thunk.org> <20140206210844.GA4335@helmut> <87sirnp2m3.fsf@openvz.org> <20140213145323.GA6296@helmut> <20140213211831.GA11480@thunk.org> <20140214201905.GA26292@helmut> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Dmitry Monakhov , linux-ext4@vger.kernel.org To: Jon Bernard Return-path: Received: from imap.thunk.org ([74.207.234.97]:58148 "EHLO imap.thunk.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751788AbaBNXqg (ORCPT ); Fri, 14 Feb 2014 18:46:36 -0500 Content-Disposition: inline In-Reply-To: <20140214201905.GA26292@helmut> Sender: linux-ext4-owner@vger.kernel.org List-ID: On Fri, Feb 14, 2014 at 03:19:05PM -0500, Jon Bernard wrote: > Ahh, I see. Here's where this comes from: the particular usecase is > provisioning of new cloud instances whose root volume is of unknown > size. The filesystem and its contents are created and bundled > before-hand into the smallest filesystem possible. The instance is PXE > booted for provisioning and the root filesystem is then copied onto the > disk - and then resized to take advantage of the total amount of space. > > In order to support very large partitions, the filesystem is created > with an abnormally large inode table so that large resizes would be > possible. I traced it to this commit as best I can tell: > > https://github.com/openstack/diskimage-builder/commit/fb246a02eb2ed330d3cc37f5795b3ed026aabe07 > > I assumed that additional inodes would be allocated along with block > groups during an online resize, but that commit contradicts my current > understanding. Additional inodes *are* allocated as the file system is grown. However thought otherwise was wrong. What happens is that there is a fixed number of inodes per block group. When the file system is resized, either by growing or shrinking file system, as block groups are added or removed from the file system, the number of inodes is also added or removed. > I suggested that the filesystem be created during the time of > provisioning to allow a more optimal on-disk layout, and I believe this > is being considered now. What causes the most damage in terms of a non-optimal data block layout, installing the file system on a large file system, and then shrinking the file system to its minimum size use resize2fs -M. There is so some non-optimality that occurs as the file system gets filled beyond about 90% full, but that it's not nearly so bad as shrinking the file system --- which you should avoid at all costs. >>From a performance point of view, the only time you should try to do an off-line resize2fs shrink is if you are shrinking the file system by a handful of blocks as part of converting a file system in place to use LVM or LUKS encryption, and you need to make room for some metadata blocks at the end of the partition. The other thing thing to note is that if you are using a format such as qcow2, or something like the device-mapper's thin-provisining (thinkp) scheme, or if you are willing to deal with sparse files, one approach is to not resize the file system at all. You could just use a tool like zerofree[1] to zero out all of the unused blocks in the file system, and then use "/bin/cp --sparse==always" to cause all zero blocks to be treated as sparse blocks on the destination file. [1] http://git.kernel.org/cgit/fs/ext2/xfstests-bld.git/tree/kvm-xfstests/util/zerofree.c This is part of how I maintain my root filesystem that I use in a VM for testing ext4 changes upstream. After I update to the latest Debian unstable package updates, install the latest updates from the xfstests and e2fsprogs git repositories, I then run the following script which uses the zerofree.c program to compress the qcow2 root file system image that I use with kvm: http://git.kernel.org/cgit/fs/ext2/xfstests-bld.git/tree/kvm-xfstests/compress-rootfs Also, starting with e2fsprogs 1.42.10, there's another way you can efficiently deploy a large file system image by only copying the blocks which are in use, by using a command like this: e2image -rap src_fs dest_fs (See also the -c flag as described in e2image's man page if you want to use this technique to do incremental image-based backups onto a flash-based backup medium; I was using this for a while to keep two laptop SSD's root filesystem in sync with one another.) So there are lots of ways that you can do what you need, all without playing games with resize2fs. Perhaps some of them would actually be better for your use case. > If it turns out to be not terribly complicated and there is not an > immediate time constraint, I would love to try to help with this or at > least test patches. I will hopefully have a bug fix in the next week or two. Cheers, - Ted