From mboxrd@z Thu Jan 1 00:00:00 1970 From: Theodore Ts'o Subject: Re: 64bit + resize2fs... this is Not Good. Date: Wed, 14 Nov 2012 18:38:54 -0500 Message-ID: <20121114233854.GD24980@thunk.org> References: <20121114203942.GA23511@thunk.org> <20121114232633.18292.qmail@science.horizon.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: linux-ext4@vger.kernel.org To: George Spelvin Return-path: Received: from li9-11.members.linode.com ([67.18.176.11]:32929 "EHLO imap.thunk.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932216Ab2KNXjD (ORCPT ); Wed, 14 Nov 2012 18:39:03 -0500 Content-Disposition: inline In-Reply-To: <20121114232633.18292.qmail@science.horizon.com> Sender: linux-ext4-owner@vger.kernel.org List-ID: On Wed, Nov 14, 2012 at 06:26:33PM -0500, George Spelvin wrote: > > > The reason why you lost so badly when you did an off-line resize was > > because you explicitly changed the resize limit default, via the -E > > resize=NNN option. (Can you explain to me your thinking about why you > > specified this, just out of curiosity?) > > Because I figured there was *some* overhead, and the documented > preallocation of 1000x initial FS size was preposterously large. It's actually not 1000x times. It's a 1000x times up to a maximum of 1024 current and reserved gdt blocks (which is the absolute maxmimum which can be supported using resize_inode feature). Contrary to what you had expected, it's simply not possible to have 2048 or 4096 reserved gdt blocks using the resize_inode scheme. That's because it stores in the reserved gdt blocks using an indirect/direct scheme, and that's all the sapce that we have. (With a 4k block, and 4 bytes per blocks --- the resize_inode scheme simply completely doesn't work if above 16TB since it uses 4 byte block numbers --- 4k/4 = 1024 block group descriptors.) > > Normally the default is 1000 times the size of the original file system, > > or for a file system larger than 16GB, enough so that the file system > > can be resized to the maximum amount that can be supported via the > > resize_inode scheme, which is 16TB. In practice, for pretty much any > > large file system, including pretty much any raid arrays, the default > > allows us to do resizes without needing to move any inode table blocks. > > Um, but on a 64bit file system, flex_bg *can* grow past 16 TB. Currently, only using the online resizing patches which are in 3.7-rc1 and newer kernels --- and that's using the meta_bg scheme, *not* the resize_inode resizing scheme. > One source of the problem is that I asked for a 64 TiB grow limit > *and didn't get it*. Even if mke2fs had done its math wrong and only > preallocated 32 * (64 TiB/128 MiB) = 16 MiB (4096 blocks) of GDT space, > I would have been fine. It can't preallocate that many GDT blocks. There simply isn't room in the indirect block scheme. I agree it should have given an error message in that case. Unfortunately using the extended resize=NNN option is not one that has gotten much attention or use, and so there are bugs hiding there that have been around for years and years. :-( > Of course, fixing the resizer to handle flex_bg probably isn't that hard. > Assuming standard layout, all the blocks of a given type are contiguous, > and the only difference between a flex_bg group and simply a larger > block group is the layout of the inode bitmaps. > > Just treat it as a block group with (in this case) 128 pages of inodes, > 8 pages of block bitmap, and 8 pages of inode bitmap, and bingo. Except that with the flex_bg feature, the inode table blocks can be anywhere. Usually they are contiguous, but they don't have to be --- if there are bad blocks, or depending on how the file system had been previously resized, it's possible that the inode tables for the block groups might not be adjacent. Probably the best thing to do is to just find some new space for the block group's inode table, and try not to keep it contiguous. Ultimately, the best thing to do is to just let mke2fs use the defaults, resize up to 16TB without needing to move any inode table blocks, and then switch over to the meta_bg scheme, and add support for meta_bg resizing in resize2fs. I just simply haven't had time to work on this. :-( > > # lvcreate -L 32g -n bigscratch /dev/closure > > # mke2fs -t ext4 -E resize=12582912 /dev/closure/bigscratch > > # lvresize -L 64g /dev/closure/bigscratch > > # e2fsck -f /dev/closure/bigscratch > > # resize2fs -p /dev/closure/bigscratch > > # e2fsck -fy /dev/closure/bigscratch > > Yeah, I see how that would cause problems, as you ask for 51.5G of > resize range. What pisses me off is that I asked for 64 TiB! > (-E resize=17179869184) Yes, mke2fs should have issued an error message to let you know there's no way it could honor your request. Again, I'm really sorry; you were exploring some of the less well tested code paths in e2fsprogs/resize2fs. :-( - Ted