* 64bit + resize2fs... this is Not Good. @ 2012-11-14 3:51 George Spelvin 2012-11-14 5:43 ` Theodore Ts'o 2012-11-14 6:27 ` George Spelvin 0 siblings, 2 replies; 11+ messages in thread From: George Spelvin @ 2012-11-14 3:51 UTC (permalink / raw) To: linux-ext4; +Cc: linux As people might know from my recent postings, I've been expanding a RAID with an ext4 file system. This has uncovered Some Issues. Because the final size exceeded 16 TiB, I had to use the 64bit support which is relatively recent. But now carrying out the resize has produced some problems... # resize2fs -p /dev/md1 resize2fs 1.43-WIP (22-Sep-2012) Filesystem at /dev/md1 is mounted on /data; on-line resizing required old_desc_blocks = 932, new_desc_blocks = 2562 resize2fs: Not enough reserved gdt blocks for resizing # /etc/init.d/smbd stop # umount /data # resize2fs -p /dev/md1 resize2fs 1.43-WIP (22-Sep-2012) Please run 'e2fsck -f /dev/md1' first. # e2fsck -v -C0 /dev/md1 e2fsck 1.43-WIP (22-Sep-2012) data: clean, 2012464/7630464 files, 1727380558/1953383296 blocks # e2fsck -f -v -C0 /dev/md1 e2fsck 1.43-WIP (22-Sep-2012) Pass 1: Checking inodes, blocks, and sizes eh_magic = 0000 != f30a Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information 2012464 inodes used (26.37%, out of 7630464) 9604 non-contiguous files (0.5%) 1374 non-contiguous directories (0.1%) # of inodes with ind/dind/tind blocks: 0/0/0 Extent depth histogram: 2009443/3000 1727380558 blocks used (88.43%, out of 1953383296) 0 bad blocks 392 large files 1096215 regular files 916227 directories 0 character device files 0 block device files 0 fifos 4346198 links 12 symbolic links (12 fast symbolic links) 1 socket ------------ 6358653 files # resize2fs -p /dev/md1 resize2fs 1.43-WIP (22-Sep-2012) Resizing the filesystem on /dev/md1 to 5371804064 (4k) blocks. Begin pass 1 (max = 104322) Extending the inode table XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Begin pass 2 (max = 12201) Relocating blocks XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Begin pass 3 (max = 59613) Scanning inode table XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Begin pass 5 (max = 1) Moving inode table XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX The filesystem on /dev/md1 is now 5371804064 blocks long. # e2fsck -v -C0 /dev/md1 e2fsck 1.43-WIP (22-Sep-2012) ext2fs_check_desc: Corrupt group descriptor: bad block for block bitmap e2fsck: Group descriptors look bad... trying backup blocks... data was not cleanly unmounted, check forced. Pass 1: Checking inodes, blocks, and sizes Group 1's inode table at 1997 conflicts with some other fs block. Relocate<y>? ^C [This bit lost toscroll. Something like...] data: e2fsck canceled. data: ***** FILE SYSTEM WAS MODIFIED ***** [Then I ran e2fsck -n once, it scrolled too much, and I ran it again while capturing the output.] # e2fsck -n -v -C0 /dev/md1 e2fsck 1.43-WIP (22-Sep-2012) ext2fs_check_desc: Corrupt group descriptor: bad block for block bitmap e2fsck: Group descriptors look bad... trying backup blocks... data was not cleanly unmounted, check forced. Pass 1: Checking inodes, blocks, and sizes Group 1's inode table at 1997 conflicts with some other fs block. Relocate? no Group 1's inode table at 1998 conflicts with some other fs block. Relocate? no Group 1's inode table at 1999 conflicts with some other fs block. Relocate? no Group 1's inode table at 2000 conflicts with some other fs block. Relocate? no Group 1's inode table at 2001 conflicts with some other fs block. Relocate? no Group 1's inode table at 2002 conflicts with some other fs block. Relocate? no Group 1's inode table at 2003 conflicts with some other fs block. Relocate? no Group 1's inode table at 2004 conflicts with some other fs block. Relocate? no Group 1's block bitmap at 1958 conflicts with some other fs block. Relocate? no This is tripleplusungood. Any recovery suggestions eagerly received. I'm poking around with dwbugfs -n right now... ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: 64bit + resize2fs... this is Not Good. 2012-11-14 3:51 64bit + resize2fs... this is Not Good George Spelvin @ 2012-11-14 5:43 ` Theodore Ts'o 2012-11-14 6:42 ` George Spelvin ` (2 more replies) 2012-11-14 6:27 ` George Spelvin 1 sibling, 3 replies; 11+ messages in thread From: Theodore Ts'o @ 2012-11-14 5:43 UTC (permalink / raw) To: George Spelvin; +Cc: linux-ext4 On Tue, Nov 13, 2012 at 10:51:01PM -0500, George Spelvin wrote: > As people might know from my recent postings, I've been expanding a RAID > with an ext4 file system. This has uncovered Some Issues. > > Because the final size exceeded 16 TiB, I had to use the 64bit support > which is relatively recent. > > But now carrying out the resize has produced some problems... > > # resize2fs -p /dev/md1 > resize2fs 1.43-WIP (22-Sep-2012) > Filesystem at /dev/md1 is mounted on /data; on-line resizing required > old_desc_blocks = 932, new_desc_blocks = 2562 > resize2fs: Not enough reserved gdt blocks for resizing OK, based on your description, you started with a device which was 8001057980416 bytes, and then grew it to 22002909446144 bytes. So I tried to exactly the same thing using a file located on an xfs partition (so I could make it that big): # mkfs.xfs /dev/closure/bigscratch # mount /dev/closure/bigscratch /mnt # touch /mnt/foo.img # truncate --size 8001057980416 /mnt/foo.img # mke2fs -F -t ext4 -O 64bit /mnt/foo.img # truncate --size 22002909446144 /mnt/foo.img # mount /mnt/foo.img /u2 # resize2fs /dev/loop0 This succeeded for me: # resize2fs /dev/loop0 resize2fs 1.43-WIP (21-Sep-2012) Filesystem at /dev/loop0 is mounted on /u2; on-line resizing required old_desc_blocks = 932, new_desc_blocks = 2562 The filesystem on /dev/loop0 is now 5371804064 blocks long. What version of resize2fs were you using --- I know it says 1.43-WIP, but what git ocmmit version specifically were you using? And how did you compile it, and how did you install it. Also, which kernel version were you using? OK, let's try an off-line resize: # truncate --size 8001057980416 /mnt/foo.img # mke2fs -F -t ext4 -O 64bit /mnt/foo.img # truncate --size 22002909446144 /mnt/foo.img # e2fsck -fy /mnt/foo.img # resize2fs -p /mnt/foo.img So first of all, I don't see this line when running e2fsck: > eh_magic = 0000 != f30a # e2fsck -fy /mnt/foo.img e2fsck 1.43-WIP (21-Sep-2012) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information /mnt/foo.img: 11/244174848 files (0.0% non-contiguous), 15457939/1953383296 blocks .... and the resize2fs output is a little different: # resize2fs -p /mnt/foo.img resize2fs 1.43-WIP (21-Sep-2012) Resizing the filesystem on /mnt/foo.img to 5371804064 (4k) blocks. Begin pass 5 (max = 1) Moving inode table XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX The filesystem on /mnt/foo.img is now 5371804064 blocks long. However, when I try doing an e2fsck on the resulting file system, I do see a very similar set of errors: e2fsck 1.43-WIP (21-Sep-2012) ext2fs_check_desc: Corrupt group descriptor: bad block for block bitmap e2fsck: Group descriptors look bad... trying backup blocks... Pass 1: Checking inodes, blocks, and sizes Group 1's inode table at 2245 conflicts with some other fs block. Relocate? yes Group 1's block bitmap at 1958 conflicts with some other fs block. Relocate? yes Group 1's inode bitmap at 1974 conflicts with some other fs block. Relocate? yes .... Given that I was primarily focused on making resize2fs work using on-line resizing, this doesn't completely surprise me, but it is definitely a bug with resize2fs that needs fixing --- we need to make off-line resizing work, and if there are bugs related to it, we need to simply make resize2fs refuse to do the off-line resize. So the first question is figuring out why the on-line resizing didn't work for you, since that is what I've spent most of my time trying to fix up. The secondary question then is trying to figure out whappened with the off-line resize, and to fix that bug in e2fsprogs. Regards, - Ted ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: 64bit + resize2fs... this is Not Good. 2012-11-14 5:43 ` Theodore Ts'o @ 2012-11-14 6:42 ` George Spelvin 2012-11-14 7:12 ` George Spelvin 2012-11-14 7:20 ` George Spelvin 2 siblings, 0 replies; 11+ messages in thread From: George Spelvin @ 2012-11-14 6:42 UTC (permalink / raw) To: linux, tytso; +Cc: linux-ext4 First of all, thanks a lot, Ted, for the middle-of-the-night tech support. I just fired off my discovery diary which I wrote before seeing your e-mail. Here are the basics: I have a newer (Oct 14) version compiled but not installed, but git reflog shows the version I installed (and used for this) was commit cf3c2ccea647c7d0db20ced920b68e98761dcd16 Author: Theodore Ts'o <tytso@mit.edu> Date: Sat Sep 22 22:29:34 2012 -0400 Update for e2fsprogs-1.43-WIP-2012-09-22 I compiled a Debian package using "make clean ; debian/rules binary" in the git directory, and installed that. The compiler is cc (Ubuntu/Linaro 4.7.2-2ubuntu1) 4.7.2 The system is mostlu Ubuntu 12.04 LTS, but I am running an unmodified v3.6.5 Linux kernel. (Compiled using the Ubuntu kernel packageing tools to linux-image-3.6.5_3.6.5-10.00.Custom_amd64.deb.) Note that I currently DO have the "superblock ckecksum is corrupt while mounted" bug in this kernel. I have some faint hope that the inodes are intact, just the group descriptors are wrong, and I'm trying to follow that up. Becasue BG#0's inodes *did* get relocated correctly. One strange thing I did was I supplied both "-o 64bit" and "-E resize=17179869184" when creating the file system. To do that, I used e2fsprogs as of 41bf599391faaf6523c9997eb467a86888542339 (Oct 14, "debugfs: teach the htree and ls commands to show directory checksums") with a local patch described in an earlier e-mail to the list. That may have caused some odd block group layouts to start with. Can you tell me, in your test, where the various bitmaps and inode tables are for the first 16 block groups, both before and after the resize? My resize appeared to not only move them, but *reorder* them, and I'd like to see what it's "supposed" to do. Thank you very much! ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: 64bit + resize2fs... this is Not Good. 2012-11-14 5:43 ` Theodore Ts'o 2012-11-14 6:42 ` George Spelvin @ 2012-11-14 7:12 ` George Spelvin 2012-11-14 7:20 ` George Spelvin 2 siblings, 0 replies; 11+ messages in thread From: George Spelvin @ 2012-11-14 7:12 UTC (permalink / raw) To: linux, tytso; +Cc: linux-ext4 > # mke2fs -F -t ext4 -O 64bit /mnt/foo.img I can't find the mkfs command in my history logs, but I'm pretty sure it was: mke2fs -F -t ext4 -i 1048576 -O 64bit,metadata_csum,^huge_file -E stride=32,stripe_width=352,resize=17179869184 -L data /mnt/foo.img ... since some of those options affect layout. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: 64bit + resize2fs... this is Not Good. 2012-11-14 5:43 ` Theodore Ts'o 2012-11-14 6:42 ` George Spelvin 2012-11-14 7:12 ` George Spelvin @ 2012-11-14 7:20 ` George Spelvin 2012-11-14 20:39 ` Theodore Ts'o 2 siblings, 1 reply; 11+ messages in thread From: George Spelvin @ 2012-11-14 7:20 UTC (permalink / raw) To: linux, tytso; +Cc: linux-ext4 > So the first question is figuring out why the on-line resizing didn't > work for you, since that is what I've spent most of my time trying to > fix up. The secondary question then is trying to figure out whappened > with the off-line resize, and to fix that bug in e2fsprogs. If you don't mind, *my* primary question is "what can I salvage from this rubble?", since I didn't happen to have 8 TB of backup space available to me when I did the resize, and there's soem stuff on the FS I'd rather not lose... So my big question is "where the F did inodes 129 through 2048 get copied to?", since the root directory contains a lot of inodes in that range, and every one I can recover saves a lot of pawing through lost+found later... In hindsght, I wish to hell I had turned on -d 14 and logged the results... If you happen to want to rerun your test with -d8 and tell me what happened there, I'd definitely appreciate it. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: 64bit + resize2fs... this is Not Good. 2012-11-14 7:20 ` George Spelvin @ 2012-11-14 20:39 ` Theodore Ts'o 2012-11-14 21:04 ` Theodore Ts'o 2012-11-14 23:26 ` George Spelvin 0 siblings, 2 replies; 11+ messages in thread From: Theodore Ts'o @ 2012-11-14 20:39 UTC (permalink / raw) To: George Spelvin; +Cc: linux-ext4 On Wed, Nov 14, 2012 at 02:20:21AM -0500, George Spelvin wrote: > > If you don't mind, *my* primary question is "what can I salvage from this > rubble?", since I didn't happen to have 8 TB of backup space available > to me when I did the resize, and there's soem stuff on the FS I'd > rather not lose... Sigh... ok, unfortunately that's not something I can answer right away. I *can* tell you what happened, though. Figuring out the best way to help you recover while minimizing data loss is going to require more thought.... First of all, the support for 64-bit online-resizing didn't hit the 3.6 kernel, and since it's a new feature, it's unlikely the stable kernel gatekeepers will accept it unless and until at leastr one distribution who is using a stable kernel is willing to backport the patches to their distro kernel. So in order to use it, you'll need a 3.7-rc1 or newer kernel. That explains why the on-line resizing didn't work. The reason why you lost so badly when you did an off-line resize was because you explicitly changed the resize limit default, via the -E resize=NNN option. (Can you explain to me your thinking about why you specified this, just out of curiosity?) Normally the default is 1000 times the size of the original file system, or for a file system larger than 1.6TB, enough so that the file system can be resized to the maximum amount that can be supported via the resize_inode scheme, which is 16TB. In practice, for pretty much any large file system, including pretty much any raid arrays, the default allows us to do resizes without needing to move any inode table blocks. So the way things would have worked with a default file system is that resize2fs would (in off-line mode) resize the file system up to the maximum 16TB, and then stop. Using online resizing, a sufficiently new enough kernel would use the resize_inode up to the number of reserved gdt blocks (which would by default take you to the 16TB limit) and then switch over to the meta_bg scheme for doing on-line resizing, which has no limits. Unfortunately resize2fs in off-line resizing mode, (a) does not yet know how to use the meta_bg scheme for resizing, and (b) doesn't deal well with the case where you (1) have multiple inode tables in the same block group, as is the case when flex_bg is enabled, as it is with ext4 file systems, and (2) when it needs to move inode tables. We protect against this by disallowing growing filesystems using off-line resizing in the case where the file system has flex_bg but does not have the resize_inode feature enabled. *However*, if the file system does have a resize_inode, but there is not a sufficient number of gdt blocks (because of an explicitly specified -E resize=NNN option to mke2fs), then this case isn't caught, and as a result resize2fs will corrupt the file system. Sigh. Unfortunately, you fell into this corner case, which I failed to forsee and protect against. It is relatively easy to fix resize2fs so it detects this case and handles it appropriately. What is harder is how to fix a file system which has been scrambed by resize2fs after the fact. There will definitely be some parts of the inode table which will have gotten overwritten, because resize2fs doesn't handle the flex_bg layout correctly when moving inode table blocks. The question is what's the best way of undoing the damage going forward, and that's going to have to require some more thought and probably some experimentation and development work. If you don't need the information right away, I'd advise you to not touch the file system, since any attempt to try to fix is likely to make the problem worse (and will cause me to to have to try to replicate the attempted fix to see what happened as a result). I'm guessing that you've already tried running e2fsck -fy, which aborted midway through the run? - Ted P.S. This doesn't exactly replicate what you did, but it's a simple repro case of the failure which you hit. The key to triggering the failure is the specification of the -E resize=NNN option. If you remove this, resize2fs will not corrupt the file system: # lvcreate -L 32g -n bigscratch /dev/closure # mke2fs -t ext4 -E resize=12582912 /dev/closure/bigscratch # lvresize -L 64g /dev/closure/bigscratch # e2fsck -f /dev/closure/bigscratch # resize2fs -p /dev/closure/bigscratch # e2fsck -fy /dev/closure/bigscratch ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: 64bit + resize2fs... this is Not Good. 2012-11-14 20:39 ` Theodore Ts'o @ 2012-11-14 21:04 ` Theodore Ts'o 2012-11-14 23:26 ` George Spelvin 1 sibling, 0 replies; 11+ messages in thread From: Theodore Ts'o @ 2012-11-14 21:04 UTC (permalink / raw) To: George Spelvin; +Cc: linux-ext4 On Wed, Nov 14, 2012 at 03:39:42PM -0500, Theodore Ts'o wrote: > The reason why you lost so badly when you did an off-line resize was > because you explicitly changed the resize limit default, via the -E > resize=NNN option. (Can you explain to me your thinking about why you > specified this, just out of curiosity?) Normally the default is 1000 > times the size of the original file system, or for a file system > larger than 1.6TB, enough so that the file system can be resized to > the maximum amount that can be supported via the resize_inode scheme, > which is 16TB. Correction: for any file system larger than 16GB.... - Ted ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: 64bit + resize2fs... this is Not Good. 2012-11-14 20:39 ` Theodore Ts'o 2012-11-14 21:04 ` Theodore Ts'o @ 2012-11-14 23:26 ` George Spelvin 2012-11-14 23:38 ` Theodore Ts'o 1 sibling, 1 reply; 11+ messages in thread From: George Spelvin @ 2012-11-14 23:26 UTC (permalink / raw) To: linux, tytso; +Cc: linux-ext4 >> If you don't mind, *my* primary question is "what can I salvage from >> this rubble?", since I didn't happen to have 8 TB of backup space >> available to me when I did the resize, and there's some stuff on the >> FS I'd rather not lose... > Sigh... ok, unfortunately that's not something I can answer right > away. I *can* tell you what happened, though. Figuring out the best > way to help you recover while minimizing data loss is going to require > more thought.... Well, I scanned BGs 0-15 (and a bit past) for anything that looked like an inode block (and didn't find anything), or a directory block (and found pretty much everything). After talking to the owner of the machine, we decided to give up on inodes 129-2048 and let e2fsck have at it, then and use the directory information rename everything in lost+found back to the right names. Unfortunately, that ended up with about 1.2 TB of data loss. Some is backups of other machines that can simply be re-backed-up, but most is media that can be re-ripped. There was an excellent chance of finding even large files written to contiguous sectors, given that the FS was created immediately before resize by "cp -a" from a 32bit file system, but it would take some file-type-specific tools to see if a given sector range "looked like" a correct file, and the *rest* of the FS would be inaccessible until I got that working. We decided that recovery sooner rather then more complete was better. Fortunately, the important unreproducible files are mostly intact, and what isn't there is mostly duplicated on people's laptops. There are irreplaceable losses, but they're a very small fraction of that 1.2 TB. It's just going to be a PITA finding it all from scattered sources. > First of all, the support for 64-bit online-resizing didn't hit the > 3.6 kernel, and since it's a new feature, it's unlikely the stable > kernel gatekeepers will accept it unless and until at leastr one > distribution who is using a stable kernel is willing to backport the > patches to their distro kernel. So in order to use it, you'll need a > 3.7-rc1 or newer kernel. That explains why the on-line resizing > didn't work. Ah, and I wasn't about to let an -rc kernel near this machine. I keep reading docs and thinking features I want to use are more stable than they are. > The reason why you lost so badly when you did an off-line resize was > because you explicitly changed the resize limit default, via the -E > resize=NNN option. (Can you explain to me your thinking about why you > specified this, just out of curiosity?) Because I figured there was *some* overhead, and the documented preallocation of 1000x initial FS size was preposterously large. I figured I'd explicitly set it to a reasonable value that was as large as I could imagine the FS growing in future, and definitely large enough for my planned resize. Especially as I was informed (in the discussion when I reported the "mke2fs -O 64bit -E resize=xxxx" bug) that it was actually *not* a hard limit and resize2fs could move sectors to grow past that limit. > Normally the default is 1000 times the size of the original file system, > or for a file system larger than 16GB, enough so that the file system > can be resized to the maximum amount that can be supported via the > resize_inode scheme, which is 16TB. In practice, for pretty much any > large file system, including pretty much any raid arrays, the default > allows us to do resizes without needing to move any inode table blocks. Um, but on a 64bit file system, flex_bg *can* grow past 16 TB. The limit is basically when BG#0 no longer has room for any data blocks, and the fixed overhead runs into the backup superblock in block #32768. With -i 1Mi and a flex_bg group size of 16, there are 1 + 16 * (1 + 1 + 8) = 161 blocks of other metadata (superblock, plus 16 x (block bitmap + inode bitmap + inodes), leaving 32607 blocks available for group descriptors. At 4096/64 = 64 descriptors per block, that's 2086848 block groups, 260856 GiB = 254.7 TiB = 280 TB (decimal). One source of the problem is that I asked for a 64 TiB grow limit *and didn't get it*. Even if mke2fs had done its math wrong and only preallocated 32 * (64 TiB/128 MiB) = 16 MiB (4096 blocks) of GDT space, I would have been fine. If I had got the full 8192 blocks of preallocated GDT space that I asked for, there definitely would not have been a problem. It appears that it preallocated 1955 blocks of GDT space (enough for 15.27 TiB of array) and things went funny... I tried resize=2^32 and 2^32-1 blocks when reporting the 64bit bug to see if that affected the divide by 0, but even if I'd cut & pasted those improperly, I should have had 2048 blocks of preallocateg GDT. > Sigh. Unfortunately, you fell into this corner case, which I failed > to forsee and protect against. And by a bug in mke2fs, which failed to allocate the resize space I requested. If it had computed the space correctly for the 64bit case, there would have been enough and all this would have been avoided. Of course, fixing the resizer to handle flex_bg probably isn't that hard. Assuming standard layout, all the blocks of a given type are contiguous, and the only difference between a flex_bg group and simply a larger block group is the layout of the inode bitmaps. Just treat it as a block group with (in this case) 128 pages of inodes, 8 pages of block bitmap, and 8 pages of inode bitmap, and bingo. You just need to verify the preconditions first and set up the 16 GDT entries to point to bits of it later. > It is relatively easy to fix resize2fs so it detects this case and > handles it appropriately. What is harder is how to fix a file system > which has been scrambed by resize2fs after the fact. There will > definitely be some parts of the inode table which will have gotten > overwritten, because resize2fs doesn't handle the flex_bg layout > correctly when moving inode table blocks. The question is what's the > best way of undoing the damage going forward, and that's going to have > to require some more thought and probably some experimentation and > development work. > > If you don't need the information right away, I'd advise you to not > touch the file system, since any attempt to try to fix is likely to > make the problem worse (and will cause me to to have to try to > replicate the attempted fix to see what happened as a result). I'm > guessing that you've already tried running e2fsck -fy, which aborted > midway through the run? I tried running e2fsck, which asked questions, and I hit ^C rather than say anything. Then I tried e2fsck -n, which aborted. This afternoon, I ran e2fsck -y and did the reconstruction previously described. Without the inode block maps (which you imply got overwritten), putting the files back together is very hard. > P.S. This doesn't exactly replicate what you did, but it's a simple > repro case of the failure which you hit. The key to triggering the > failure is the specification of the -E resize=NNN option. If you > remove this, resize2fs will not corrupt the file system: > > # lvcreate -L 32g -n bigscratch /dev/closure > # mke2fs -t ext4 -E resize=12582912 /dev/closure/bigscratch > # lvresize -L 64g /dev/closure/bigscratch > # e2fsck -f /dev/closure/bigscratch > # resize2fs -p /dev/closure/bigscratch > # e2fsck -fy /dev/closure/bigscratch Yeah, I see how that would cause problems, as you ask for 51.5G of resize range. What pisses me off is that I asked for 64 TiB! (-E resize=17179869184) ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: 64bit + resize2fs... this is Not Good. 2012-11-14 23:26 ` George Spelvin @ 2012-11-14 23:38 ` Theodore Ts'o 2012-11-15 3:43 ` George Spelvin 0 siblings, 1 reply; 11+ messages in thread From: Theodore Ts'o @ 2012-11-14 23:38 UTC (permalink / raw) To: George Spelvin; +Cc: linux-ext4 On Wed, Nov 14, 2012 at 06:26:33PM -0500, George Spelvin wrote: > > > The reason why you lost so badly when you did an off-line resize was > > because you explicitly changed the resize limit default, via the -E > > resize=NNN option. (Can you explain to me your thinking about why you > > specified this, just out of curiosity?) > > Because I figured there was *some* overhead, and the documented > preallocation of 1000x initial FS size was preposterously large. It's actually not 1000x times. It's a 1000x times up to a maximum of 1024 current and reserved gdt blocks (which is the absolute maxmimum which can be supported using resize_inode feature). Contrary to what you had expected, it's simply not possible to have 2048 or 4096 reserved gdt blocks using the resize_inode scheme. That's because it stores in the reserved gdt blocks using an indirect/direct scheme, and that's all the sapce that we have. (With a 4k block, and 4 bytes per blocks --- the resize_inode scheme simply completely doesn't work if above 16TB since it uses 4 byte block numbers --- 4k/4 = 1024 block group descriptors.) > > Normally the default is 1000 times the size of the original file system, > > or for a file system larger than 16GB, enough so that the file system > > can be resized to the maximum amount that can be supported via the > > resize_inode scheme, which is 16TB. In practice, for pretty much any > > large file system, including pretty much any raid arrays, the default > > allows us to do resizes without needing to move any inode table blocks. > > Um, but on a 64bit file system, flex_bg *can* grow past 16 TB. Currently, only using the online resizing patches which are in 3.7-rc1 and newer kernels --- and that's using the meta_bg scheme, *not* the resize_inode resizing scheme. > One source of the problem is that I asked for a 64 TiB grow limit > *and didn't get it*. Even if mke2fs had done its math wrong and only > preallocated 32 * (64 TiB/128 MiB) = 16 MiB (4096 blocks) of GDT space, > I would have been fine. It can't preallocate that many GDT blocks. There simply isn't room in the indirect block scheme. I agree it should have given an error message in that case. Unfortunately using the extended resize=NNN option is not one that has gotten much attention or use, and so there are bugs hiding there that have been around for years and years. :-( > Of course, fixing the resizer to handle flex_bg probably isn't that hard. > Assuming standard layout, all the blocks of a given type are contiguous, > and the only difference between a flex_bg group and simply a larger > block group is the layout of the inode bitmaps. > > Just treat it as a block group with (in this case) 128 pages of inodes, > 8 pages of block bitmap, and 8 pages of inode bitmap, and bingo. Except that with the flex_bg feature, the inode table blocks can be anywhere. Usually they are contiguous, but they don't have to be --- if there are bad blocks, or depending on how the file system had been previously resized, it's possible that the inode tables for the block groups might not be adjacent. Probably the best thing to do is to just find some new space for the block group's inode table, and try not to keep it contiguous. Ultimately, the best thing to do is to just let mke2fs use the defaults, resize up to 16TB without needing to move any inode table blocks, and then switch over to the meta_bg scheme, and add support for meta_bg resizing in resize2fs. I just simply haven't had time to work on this. :-( > > # lvcreate -L 32g -n bigscratch /dev/closure > > # mke2fs -t ext4 -E resize=12582912 /dev/closure/bigscratch > > # lvresize -L 64g /dev/closure/bigscratch > > # e2fsck -f /dev/closure/bigscratch > > # resize2fs -p /dev/closure/bigscratch > > # e2fsck -fy /dev/closure/bigscratch > > Yeah, I see how that would cause problems, as you ask for 51.5G of > resize range. What pisses me off is that I asked for 64 TiB! > (-E resize=17179869184) Yes, mke2fs should have issued an error message to let you know there's no way it could honor your request. Again, I'm really sorry; you were exploring some of the less well tested code paths in e2fsprogs/resize2fs. :-( - Ted ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: 64bit + resize2fs... this is Not Good. 2012-11-14 23:38 ` Theodore Ts'o @ 2012-11-15 3:43 ` George Spelvin 0 siblings, 0 replies; 11+ messages in thread From: George Spelvin @ 2012-11-15 3:43 UTC (permalink / raw) To: linux, tytso; +Cc: linux-ext4 > It's actually not 1000x times. It's a 1000x times up to a maximum of > 1024 current and reserved gdt blocks (which is the absolute maxmimum > which can be supported using resize_inode feature). Contrary to what > you had expected, it's simply not possible to have 2048 or 4096 > reserved gdt blocks using the resize_inode scheme. That's because it > stores in the reserved gdt blocks using an indirect/direct scheme, and > that's all the sapce that we have. (With a 4k block, and 4 bytes per > blocks --- the resize_inode scheme simply completely doesn't work if > above 16TB since it uses 4 byte block numbers --- 4k/4 = 1024 block group > descriptors.) Er... you can't use extents? The blocks *are* all contiguous. >> Yeah, I see how that would cause problems, as you ask for 51.5G of >> resize range. What pisses me off is that I asked for 64 TiB! >> (-E resize=17179869184) > Yes, mke2fs should have issued an error message to let you know > there's no way it could honor your request. As long as I get to be at least a *little* bit grumpy that *both* mke2fs and resize2fs, when asked to do something they couldn't to, failed to produce any sort of error message, but silently f***ed it up. > Again, I'm really sorry; you were exploring some of the less well > tested code paths in e2fsprogs/resize2fs. :-( I seem to be developing a knack for that this last couple of months. :-( I *thought* I was doing the obvious thing. All I set out to do was expand a 10 TB RAID to 22 TB. Really, everything I did I *thought* I chose the *safest* possible option. 1. Restripe RAID 2. Try to resize FS, hit 16 TB limit. 3. Restripe RAID back down. 4. Create new 8 TB RAID from new drives 5. Format with 64-bit ext4, telling mke2fs that I will be resizing later. 5a. Fight with bug in mke2fs while doing so. 6. Copy over files from 32-bit FS 7. Destroy old RAID, and add drives to new RAID 8. Restripe up to 22 TB (again!) 9. Resize file system. Personally, an off-line technique "feels safer" than on-line, so I went with that. 10. Kablooie! Other than skipping the first 3 steps, what was I supposed to do different? ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: 64bit + resize2fs... this is Not Good. 2012-11-14 3:51 64bit + resize2fs... this is Not Good George Spelvin 2012-11-14 5:43 ` Theodore Ts'o @ 2012-11-14 6:27 ` George Spelvin 1 sibling, 0 replies; 11+ messages in thread From: George Spelvin @ 2012-11-14 6:27 UTC (permalink / raw) To: linux-ext4, linux I'm studying the file system to see what I can salvage from the wreckage that resize2fs left me with, and thought I'd keep a diary of my discoveries here so someone can correct my errors. I'm leaning heavily on https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout debugfs stats tells me the following: Filesystem features: has_journal ext_attr resize_inode dir_index filetype extent 64bit flex_bg sparse_super large_file dir_nlink extra_isize metadata_csum Inode count: 20983680 Block count: 5371804064 First block: 0 Block size: 4096 Blocks per group: 32768 Inodes per group: 128 // 128 * 256 = 32 Kbytes Inode blocks per group: 8 // 8 * 4096 = 32 kbytes check! Flex block group size: 16 So, I have 5371804064 = 32768 * 163934 + 14752 blocks, meaning I have 163935 block groups (the last one partial). Since they're 64 bytes each, that's 4096 * 2561 + 1984 bytes, or 2562 pages per block descriptor backup. Since sparse_super is set, backup copies of the superblock and block group descriptors are present only in groups 0, 1, 3, 5, 7, 9, 25, 27, 49, 81, 125, ... Since flex_bg is set, block groups are basically "ganged" into groups of 512Ki blocks. So only groups 0, 16, 32, ... actually have bitmaps & inode tables in them. Thus, except for group 0, it's an either/or thing. Checking a backup copy with dumpe2fs -o superblock=$((9*32768)) /dev/md1 gives consistent results. The differences are: --- /tmp/0 2012-11-14 00:35:21.061916443 -0500 +++ /tmp/9 2012-11-14 00:35:23.985270766 -0500 @@ -9 +9 @@ -Filesystem state: not clean with errors +Filesystem state: not clean @@ -15 +15 @@ -Free blocks: 4643122829 +Free blocks: 348155533 @@ -29 +29 @@ -Last write time: Tue Nov 13 22:35:53 2012 +Last write time: Tue Nov 13 22:34:31 2012 @@ -45,11 +44,0 @@ -FS Error count: 779 -First error time: Tue Nov 13 22:56:42 2012 -First error function: ext4_iget -First error line #: 3832 -First error inode #: 771 -First error block #: 0 -Last error time: Tue Nov 13 23:10:01 2012 -Last error function: ext4_iget -Last error line #: 3832 -Last error inode #: 771 -Last error block #: 0 @@ -57 +46 @@ -Checksum: 0x1a66baa3 +Checksum: 0x921b7125 (Interesting that resize2fs updates the block count, but leaves the old free blocks value alone.) Anyway, looking at the output of dumpe2fs, I notice the first real oddity: (Note that I'm not quite sure what ITABLE_ZEROED means, and the nomenclature makes me VERY nervous. Does it mean ITABLE_INITIALIZED?) Group 0: (Blocks 0-32767) [ITABLE_ZEROED] Checksum 0xbf1d, unused inodes 0 Primary superblock at 0, Group descriptors at 1-2562 Block bitmap at 2571 (+2571), csum 0xba8922d8, Inode bitmap at 2572 (+2572), csum 0x7d86a677 Inode table at 2563-2570 (+2563) 17993 free blocks, 0 free inodes, 4 directories Group 1: (Blocks 32768-65535) [ITABLE_ZEROED] Checksum 0xca14, unused inodes 0 Backup superblock at 32768, Group descriptors at 32769-35330 Block bitmap at 1958 (bg #0 + 1958), csum 0xef1e59c7, Inode bitmap at 1974 (bg #0 + 1974), csum 0x7d86a677 Inode table at 1997-2004 (bg #0 + 1997) 30205 free blocks, 0 free inodes, 2 directories Group 2: (Blocks 65536-98303) [ITABLE_ZEROED] Checksum 0x8abe, unused inodes 0 Block bitmap at 1959 (bg #0 + 1959), csum 0x17817d16, Inode bitmap at 1975 (bg #0 + 1975), csum 0x7d86a677 Inode table at 2005-2012 (bg #0 + 2005) 32768 free blocks, 0 free inodes, 2 directories Group 3: (Blocks 98304-131071) [ITABLE_ZEROED] Checksum 0x478d, unused inodes 0 Backup superblock at 98304, Group descriptors at 98305-100866 Block bitmap at 1960 (bg #0 + 1960), csum 0xef1e59c7, Inode bitmap at 1976 (bg #0 + 1976), csum 0x7d86a677 Inode table at 2013-2020 (bg #0 + 2013) 30205 free blocks, 0 free inodes, 0 directories Group 4: (Blocks 131072-163839) [ITABLE_ZEROED] Checksum 0xfbc4, unused inodes 0 Block bitmap at 1961 (bg #0 + 1961), csum 0x17817d16, Inode bitmap at 1977 (bg #0 + 1977), csum 0x7d86a677 Inode table at 2021-2028 (bg #0 + 2021) 32768 free blocks, 0 free inodes, 3 directories Group 5: (Blocks 163840-196607) [ITABLE_ZEROED] Checksum 0x5c5f, unused inodes 0 Backup superblock at 163840, Group descriptors at 163841-166402 Block bitmap at 1962 (bg #0 + 1962), csum 0xef1e59c7, Inode bitmap at 1978 (bg #0 + 1978), csum 0x7d86a677 Inode table at 2029-2036 (bg #0 + 2029) 30205 free blocks, 0 free inodes, 0 directories Group 6: (Blocks 196608-229375) [ITABLE_ZEROED] Checksum 0x80c5, unused inodes 0 Block bitmap at 1963 (bg #0 + 1963), csum 0x17817d16, Inode bitmap at 1979 (bg #0 + 1979), csum 0x7d86a677 Inode table at 2037-2044 (bg #0 + 2037) 32768 free blocks, 0 free inodes, 12 directories Group 7: (Blocks 229376-262143) [ITABLE_ZEROED] Checksum 0xe21a, unused inodes 0 Backup superblock at 229376, Group descriptors at 229377-231938 Block bitmap at 1964 (bg #0 + 1964), csum 0xef1e59c7, Inode bitmap at 1980 (bg #0 + 1980), csum 0x7d86a677 Inode table at 2045-2052 (bg #0 + 2045) 30205 free blocks, 0 free inodes, 7 directories Group 8: (Blocks 262144-294911) [ITABLE_ZEROED] Checksum 0xb69a, unused inodes 0 Block bitmap at 1965 (bg #0 + 1965), csum 0x17817d16, Inode bitmap at 1981 (bg #0 + 1981), csum 0x7d86a677 Inode table at 2053-2060 (bg #0 + 2053) 32768 free blocks, 0 free inodes, 10 directories Group 9: (Blocks 294912-327679) [ITABLE_ZEROED] Checksum 0x360d, unused inodes 0 Backup superblock at 294912, Group descriptors at 294913-297474 Block bitmap at 1966 (bg #0 + 1966), csum 0xef1e59c7, Inode bitmap at 1982 (bg #0 + 1982), csum 0x7d86a677 Inode table at 2061-2068 (bg #0 + 2061) 30205 free blocks, 0 free inodes, 10 directories Group 10: (Blocks 327680-360447) [ITABLE_ZEROED] Checksum 0x6ba3, unused inodes 0 Block bitmap at 1967 (bg #0 + 1967), csum 0x17817d16, Inode bitmap at 1983 (bg #0 + 1983), csum 0x7d86a677 Inode table at 2069-2076 (bg #0 + 2069) 32768 free blocks, 0 free inodes, 11 directories Group 11: (Blocks 360448-393215) [ITABLE_ZEROED] Checksum 0xabd0, unused inodes 0 Block bitmap at 1968 (bg #0 + 1968), csum 0x17817d16, Inode bitmap at 1984 (bg #0 + 1984), csum 0x7d86a677 Inode table at 2077-2084 (bg #0 + 2077) 32768 free blocks, 0 free inodes, 5 directories Group 12: (Blocks 393216-425983) [ITABLE_ZEROED] Checksum 0x47de, unused inodes 0 Block bitmap at 1969 (bg #0 + 1969), csum 0x17817d16, Inode bitmap at 1985 (bg #0 + 1985), csum 0x7d86a677 Inode table at 2085-2092 (bg #0 + 2085) 32768 free blocks, 0 free inodes, 7 directories Group 13: (Blocks 425984-458751) [ITABLE_ZEROED] Checksum 0x5822, unused inodes 0 Block bitmap at 1970 (bg #0 + 1970), csum 0x17817d16, Inode bitmap at 1986 (bg #0 + 1986), csum 0x7d86a677 Inode table at 2093-2100 (bg #0 + 2093) 32768 free blocks, 0 free inodes, 13 directories Group 14: (Blocks 458752-491519) [ITABLE_ZEROED] Checksum 0x21db, unused inodes 0 Block bitmap at 1971 (bg #0 + 1971), csum 0x17817d16, Inode bitmap at 1987 (bg #0 + 1987), csum 0x7d86a677 Inode table at 2101-2108 (bg #0 + 2101) 32768 free blocks, 0 free inodes, 9 directories Group 15: (Blocks 491520-524287) [ITABLE_ZEROED] Checksum 0xdd66, unused inodes 0 Block bitmap at 1972 (bg #0 + 1972), csum 0x17817d16, Inode bitmap at 1988 (bg #0 + 1988), csum 0x7561ee79 Inode table at 2109-2116 (bg #0 + 2109) 32768 free blocks, 92 free inodes, 12 directories Group 16: (Blocks 524288-557055) [ITABLE_ZEROED] Checksum 0x5ee8, unused inodes 0 Block bitmap at 524288 (+0), csum 0x316efbb2, Inode bitmap at 524304 (+16), csum 0x7d86a677 Inode table at 524320-524327 (+32) 30816 free blocks, 0 free inodes, 2 directories Group 17: (Blocks 557056-589823) [ITABLE_ZEROED] Checksum 0x0f0a, unused inodes 0 Block bitmap at 524289 (bg #16 + 1), csum 0x17817d16, Inode bitmap at 524305 (bg #16 + 17), csum 0x7d86a677 Inode table at 524328-524335 (bg #16 + 40) 32768 free blocks, 0 free inodes, 1 directories Notice that Group 0's inode table starts at block 2563, immediately after the 1+2562 blocks of superblock + block descriptors. But groups 1..15 have their inode tables in the middle of the block descriptor array! WTF? Also, they're all in different orders. Group 0 is inode table, block bitmap, inode bitmap. Groups 1..15 have all their block bitmaps consecutive (blocks 1957-1972, if we extrapolate where group 0's bitmaps would go), followed by inode bitmaps and then inode tables. Are they just unchanged from the pre-resize version? The old file system size was 1953383296 blocks, requiring 59613 block groups, and 932 blocks to hold their descriptors. So that doesn't quite make sense... Oh! Is that the "reserved GDT block" space? Not quite sure how it was computed, but maybe... Is there a (forlorn hope!) chance that resize2fs actually relocated the inodes properly, and just failed to update the block group descriptors to point to them? So all the data is actually safe, just misplaced? Or perhaps only the first 16*128 = 2048 inodes are trashed? Or perhaps even inodes 128-2047? I notice that I can read inode 2 (the root directory) and inode 11 (lost+found), I cannot read inodes 513-775, inodes 2047 and 2048 look corrupted, but inode 2049 looks fine. Oh, that's right, inode 0 is invalid, so inodes are 1-based. So inodes 128 and 2049 might be right, but inodes 129-2048 are toast. I'm going to fire this off into the ether now and start the somewhat time-consuming process of seeing if I can find the lost inodes somewhere in block group 0. ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2012-11-15 3:43 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2012-11-14 3:51 64bit + resize2fs... this is Not Good George Spelvin 2012-11-14 5:43 ` Theodore Ts'o 2012-11-14 6:42 ` George Spelvin 2012-11-14 7:12 ` George Spelvin 2012-11-14 7:20 ` George Spelvin 2012-11-14 20:39 ` Theodore Ts'o 2012-11-14 21:04 ` Theodore Ts'o 2012-11-14 23:26 ` George Spelvin 2012-11-14 23:38 ` Theodore Ts'o 2012-11-15 3:43 ` George Spelvin 2012-11-14 6:27 ` George Spelvin
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.