All of lore.kernel.org
 help / color / mirror / Atom feed
* 64bit + resize2fs... this is Not Good.
@ 2012-11-14  3:51 George Spelvin
  2012-11-14  5:43 ` Theodore Ts'o
  2012-11-14  6:27 ` George Spelvin
  0 siblings, 2 replies; 11+ messages in thread
From: George Spelvin @ 2012-11-14  3:51 UTC (permalink / raw)
  To: linux-ext4; +Cc: linux

As people might know from my recent postings, I've been expanding a RAID
with an ext4 file system.  This has uncovered Some Issues.

Because the final size exceeded 16 TiB, I had to use the 64bit support
which is relatively recent.

But now carrying out the resize has produced some problems...

# resize2fs -p /dev/md1
resize2fs 1.43-WIP (22-Sep-2012)
Filesystem at /dev/md1 is mounted on /data; on-line resizing required
old_desc_blocks = 932, new_desc_blocks = 2562
resize2fs: Not enough reserved gdt blocks for resizing

# /etc/init.d/smbd stop

# umount /data

# resize2fs -p /dev/md1
resize2fs 1.43-WIP (22-Sep-2012)
Please run 'e2fsck -f /dev/md1' first.

# e2fsck -v -C0 /dev/md1
e2fsck 1.43-WIP (22-Sep-2012)
data: clean, 2012464/7630464 files, 1727380558/1953383296 blocks

# e2fsck -f -v -C0 /dev/md1
e2fsck 1.43-WIP (22-Sep-2012)
Pass 1: Checking inodes, blocks, and sizes
eh_magic = 0000 != f30a
Pass 2: Checking directory structure                                           
Pass 3: Checking directory connectivity                                        
Pass 4: Checking reference counts                                              
Pass 5: Checking group summary information                                     
                                                                               
     2012464 inodes used (26.37%, out of 7630464)
        9604 non-contiguous files (0.5%)
        1374 non-contiguous directories (0.1%)
             # of inodes with ind/dind/tind blocks: 0/0/0
             Extent depth histogram: 2009443/3000
  1727380558 blocks used (88.43%, out of 1953383296)
           0 bad blocks
         392 large files

     1096215 regular files
      916227 directories
           0 character device files
           0 block device files
           0 fifos
     4346198 links
          12 symbolic links (12 fast symbolic links)
           1 socket
------------
     6358653 files

# resize2fs -p /dev/md1
resize2fs 1.43-WIP (22-Sep-2012)
Resizing the filesystem on /dev/md1 to 5371804064 (4k) blocks.
Begin pass 1 (max = 104322)
Extending the inode table     XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Begin pass 2 (max = 12201)
Relocating blocks             XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Begin pass 3 (max = 59613)
Scanning inode table          XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Begin pass 5 (max = 1)
Moving inode table            XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
The filesystem on /dev/md1 is now 5371804064 blocks long.

# e2fsck -v -C0 /dev/md1
e2fsck 1.43-WIP (22-Sep-2012)
ext2fs_check_desc: Corrupt group descriptor: bad block for block bitmap
e2fsck: Group descriptors look bad... trying backup blocks...
data was not cleanly unmounted, check forced.
Pass 1: Checking inodes, blocks, and sizes
Group 1's inode table at 1997 conflicts with some other fs block.
Relocate<y>? ^C

	[This bit lost toscroll.  Something like...]

data: e2fsck canceled.                                                           

data: ***** FILE SYSTEM WAS MODIFIED *****

	[Then I ran e2fsck -n once, it scrolled too much, and I ran it
	again while capturing the output.]

# e2fsck -n -v -C0 /dev/md1
e2fsck 1.43-WIP (22-Sep-2012)
ext2fs_check_desc: Corrupt group descriptor: bad block for block bitmap
e2fsck: Group descriptors look bad... trying backup blocks...
data was not cleanly unmounted, check forced.
Pass 1: Checking inodes, blocks, and sizes
Group 1's inode table at 1997 conflicts with some other fs block.
Relocate? no

Group 1's inode table at 1998 conflicts with some other fs block.
Relocate? no

Group 1's inode table at 1999 conflicts with some other fs block.
Relocate? no

Group 1's inode table at 2000 conflicts with some other fs block.
Relocate? no

Group 1's inode table at 2001 conflicts with some other fs block.
Relocate? no

Group 1's inode table at 2002 conflicts with some other fs block.
Relocate? no

Group 1's inode table at 2003 conflicts with some other fs block.
Relocate? no

Group 1's inode table at 2004 conflicts with some other fs block.
Relocate? no

Group 1's block bitmap at 1958 conflicts with some other fs block.
Relocate? no


This is tripleplusungood.  Any recovery suggestions eagerly received.
I'm poking around with dwbugfs -n right now...

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 64bit + resize2fs... this is Not Good.
  2012-11-14  3:51 64bit + resize2fs... this is Not Good George Spelvin
@ 2012-11-14  5:43 ` Theodore Ts'o
  2012-11-14  6:42   ` George Spelvin
                     ` (2 more replies)
  2012-11-14  6:27 ` George Spelvin
  1 sibling, 3 replies; 11+ messages in thread
From: Theodore Ts'o @ 2012-11-14  5:43 UTC (permalink / raw)
  To: George Spelvin; +Cc: linux-ext4

On Tue, Nov 13, 2012 at 10:51:01PM -0500, George Spelvin wrote:
> As people might know from my recent postings, I've been expanding a RAID
> with an ext4 file system.  This has uncovered Some Issues.
> 
> Because the final size exceeded 16 TiB, I had to use the 64bit support
> which is relatively recent.
> 
> But now carrying out the resize has produced some problems...
> 
> # resize2fs -p /dev/md1
> resize2fs 1.43-WIP (22-Sep-2012)
> Filesystem at /dev/md1 is mounted on /data; on-line resizing required
> old_desc_blocks = 932, new_desc_blocks = 2562
> resize2fs: Not enough reserved gdt blocks for resizing

OK, based on your description, you started with a device which was
8001057980416 bytes, and then grew it to 22002909446144 bytes.  So I
tried to exactly the same thing using a file located on an xfs
partition (so I could make it that big):

# mkfs.xfs /dev/closure/bigscratch
# mount /dev/closure/bigscratch /mnt
# touch /mnt/foo.img
# truncate --size 8001057980416 /mnt/foo.img
# mke2fs -F -t ext4 -O 64bit /mnt/foo.img
# truncate --size 22002909446144 /mnt/foo.img
# mount /mnt/foo.img /u2
# resize2fs /dev/loop0

This succeeded for me:

# resize2fs /dev/loop0
resize2fs 1.43-WIP (21-Sep-2012)
Filesystem at /dev/loop0 is mounted on /u2; on-line resizing required
old_desc_blocks = 932, new_desc_blocks = 2562
The filesystem on /dev/loop0 is now 5371804064 blocks long.

What version of resize2fs were you using --- I know it says 1.43-WIP,
but what git ocmmit version specifically were you using?  And how did
you compile it, and how did you install it.

Also, which kernel version were you using?

OK, let's try an off-line resize:

# truncate --size 8001057980416 /mnt/foo.img
# mke2fs -F -t ext4 -O 64bit /mnt/foo.img
# truncate --size 22002909446144 /mnt/foo.img
# e2fsck -fy /mnt/foo.img
# resize2fs -p /mnt/foo.img

So first of all, I don't see this line when running e2fsck:

> eh_magic = 0000 != f30a

# e2fsck -fy /mnt/foo.img
e2fsck 1.43-WIP (21-Sep-2012)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/mnt/foo.img: 11/244174848 files (0.0% non-contiguous), 15457939/1953383296 blocks

.... and the resize2fs output is a little different:

# resize2fs -p /mnt/foo.img
resize2fs 1.43-WIP (21-Sep-2012)
Resizing the filesystem on /mnt/foo.img to 5371804064 (4k) blocks.

Begin pass 5 (max = 1)
Moving inode table            XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
The filesystem on /mnt/foo.img is now 5371804064 blocks long.

However, when I try doing an e2fsck on the resulting file system, I do
see a very similar set of errors:

e2fsck 1.43-WIP (21-Sep-2012)
ext2fs_check_desc: Corrupt group descriptor: bad block for block bitmap
e2fsck: Group descriptors look bad... trying backup blocks...
Pass 1: Checking inodes, blocks, and sizes
Group 1's inode table at 2245 conflicts with some other fs block.
Relocate? yes

Group 1's block bitmap at 1958 conflicts with some other fs block.
Relocate? yes

Group 1's inode bitmap at 1974 conflicts with some other fs block.
Relocate? yes

....

Given that I was primarily focused on making resize2fs work using
on-line resizing, this doesn't completely surprise me, but it is
definitely a bug with resize2fs that needs fixing --- we need to make
off-line resizing work, and if there are bugs related to it, we need
to simply make resize2fs refuse to do the off-line resize.

So the first question is figuring out why the on-line resizing didn't
work for you, since that is what I've spent most of my time trying to
fix up.  The secondary question then is trying to figure out whappened
with the off-line resize, and to fix that bug in e2fsprogs.

Regards,

						- Ted

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 64bit + resize2fs... this is Not Good.
  2012-11-14  3:51 64bit + resize2fs... this is Not Good George Spelvin
  2012-11-14  5:43 ` Theodore Ts'o
@ 2012-11-14  6:27 ` George Spelvin
  1 sibling, 0 replies; 11+ messages in thread
From: George Spelvin @ 2012-11-14  6:27 UTC (permalink / raw)
  To: linux-ext4, linux

I'm studying the file system to see what I can salvage from the
wreckage that resize2fs left me with, and thought I'd keep a diary of
my discoveries here so someone can correct my errors.

I'm leaning heavily on
https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout

debugfs stats tells me the following:

Filesystem features:      has_journal ext_attr resize_inode dir_index filetype extent 64bit flex_bg sparse_super large_file dir_nlink extra_isize metadata_csum
Inode count:              20983680
Block count:              5371804064
First block:              0
Block size:               4096
Blocks per group:         32768
Inodes per group:         128		// 128 * 256 = 32 Kbytes
Inode blocks per group:   8		// 8 * 4096  = 32 kbytes check!
Flex block group size:    16

So, I have 5371804064 = 32768 * 163934 + 14752 blocks, meaning I have 163935
block groups (the last one partial).

Since they're 64 bytes each, that's 4096 * 2561 + 1984 bytes, or 2562 pages
per block descriptor backup.

Since sparse_super is set, backup copies of the superblock and block group descriptors
are present only in groups 0, 1, 3, 5, 7, 9, 25, 27, 49, 81, 125, ...

Since flex_bg is set, block groups are basically "ganged" into groups
of 512Ki blocks.  So only groups 0, 16, 32, ... actually have bitmaps &
inode tables in them.  Thus, except for group 0, it's an either/or thing.

Checking a backup copy with
dumpe2fs -o superblock=$((9*32768)) /dev/md1
gives consistent results.  The differences are:

--- /tmp/0      2012-11-14 00:35:21.061916443 -0500
+++ /tmp/9      2012-11-14 00:35:23.985270766 -0500
@@ -9 +9 @@
-Filesystem state:         not clean with errors
+Filesystem state:         not clean
@@ -15 +15 @@
-Free blocks:              4643122829
+Free blocks:              348155533
@@ -29 +29 @@
-Last write time:          Tue Nov 13 22:35:53 2012
+Last write time:          Tue Nov 13 22:34:31 2012
@@ -45,11 +44,0 @@
-FS Error count:           779
-First error time:         Tue Nov 13 22:56:42 2012
-First error function:     ext4_iget
-First error line #:       3832
-First error inode #:      771
-First error block #:      0
-Last error time:          Tue Nov 13 23:10:01 2012
-Last error function:      ext4_iget
-Last error line #:        3832
-Last error inode #:       771
-Last error block #:       0
@@ -57 +46 @@
-Checksum:                 0x1a66baa3
+Checksum:                 0x921b7125

(Interesting that resize2fs updates the block count, but leaves the old free blocks
value alone.)


Anyway, looking at the output of dumpe2fs, I notice the first real oddity:

(Note that I'm not quite sure what ITABLE_ZEROED means, and the
nomenclature makes me VERY nervous.  Does it mean ITABLE_INITIALIZED?)

Group 0: (Blocks 0-32767) [ITABLE_ZEROED]
  Checksum 0xbf1d, unused inodes 0
  Primary superblock at 0, Group descriptors at 1-2562
  Block bitmap at 2571 (+2571), csum 0xba8922d8, Inode bitmap at 2572 (+2572), csum 0x7d86a677
  Inode table at 2563-2570 (+2563)
  17993 free blocks, 0 free inodes, 4 directories
Group 1: (Blocks 32768-65535) [ITABLE_ZEROED]
  Checksum 0xca14, unused inodes 0
  Backup superblock at 32768, Group descriptors at 32769-35330
  Block bitmap at 1958 (bg #0 + 1958), csum 0xef1e59c7, Inode bitmap at 1974 (bg #0 + 1974), csum 0x7d86a677
  Inode table at 1997-2004 (bg #0 + 1997)
  30205 free blocks, 0 free inodes, 2 directories
Group 2: (Blocks 65536-98303) [ITABLE_ZEROED]
  Checksum 0x8abe, unused inodes 0
  Block bitmap at 1959 (bg #0 + 1959), csum 0x17817d16, Inode bitmap at 1975 (bg #0 + 1975), csum 0x7d86a677
  Inode table at 2005-2012 (bg #0 + 2005)
  32768 free blocks, 0 free inodes, 2 directories
Group 3: (Blocks 98304-131071) [ITABLE_ZEROED]
  Checksum 0x478d, unused inodes 0
  Backup superblock at 98304, Group descriptors at 98305-100866
  Block bitmap at 1960 (bg #0 + 1960), csum 0xef1e59c7, Inode bitmap at 1976 (bg #0 + 1976), csum 0x7d86a677
  Inode table at 2013-2020 (bg #0 + 2013)
  30205 free blocks, 0 free inodes, 0 directories
Group 4: (Blocks 131072-163839) [ITABLE_ZEROED]
  Checksum 0xfbc4, unused inodes 0
  Block bitmap at 1961 (bg #0 + 1961), csum 0x17817d16, Inode bitmap at 1977 (bg #0 + 1977), csum 0x7d86a677
  Inode table at 2021-2028 (bg #0 + 2021)
  32768 free blocks, 0 free inodes, 3 directories
Group 5: (Blocks 163840-196607) [ITABLE_ZEROED]
  Checksum 0x5c5f, unused inodes 0
  Backup superblock at 163840, Group descriptors at 163841-166402
  Block bitmap at 1962 (bg #0 + 1962), csum 0xef1e59c7, Inode bitmap at 1978 (bg #0 + 1978), csum 0x7d86a677
  Inode table at 2029-2036 (bg #0 + 2029)
  30205 free blocks, 0 free inodes, 0 directories
Group 6: (Blocks 196608-229375) [ITABLE_ZEROED]
  Checksum 0x80c5, unused inodes 0
  Block bitmap at 1963 (bg #0 + 1963), csum 0x17817d16, Inode bitmap at 1979 (bg #0 + 1979), csum 0x7d86a677
  Inode table at 2037-2044 (bg #0 + 2037)
  32768 free blocks, 0 free inodes, 12 directories
Group 7: (Blocks 229376-262143) [ITABLE_ZEROED]
  Checksum 0xe21a, unused inodes 0
  Backup superblock at 229376, Group descriptors at 229377-231938
  Block bitmap at 1964 (bg #0 + 1964), csum 0xef1e59c7, Inode bitmap at 1980 (bg #0 + 1980), csum 0x7d86a677
  Inode table at 2045-2052 (bg #0 + 2045)
  30205 free blocks, 0 free inodes, 7 directories
Group 8: (Blocks 262144-294911) [ITABLE_ZEROED]
  Checksum 0xb69a, unused inodes 0
  Block bitmap at 1965 (bg #0 + 1965), csum 0x17817d16, Inode bitmap at 1981 (bg #0 + 1981), csum 0x7d86a677
  Inode table at 2053-2060 (bg #0 + 2053)
  32768 free blocks, 0 free inodes, 10 directories
Group 9: (Blocks 294912-327679) [ITABLE_ZEROED]
  Checksum 0x360d, unused inodes 0
  Backup superblock at 294912, Group descriptors at 294913-297474
  Block bitmap at 1966 (bg #0 + 1966), csum 0xef1e59c7, Inode bitmap at 1982 (bg #0 + 1982), csum 0x7d86a677
  Inode table at 2061-2068 (bg #0 + 2061)
  30205 free blocks, 0 free inodes, 10 directories
Group 10: (Blocks 327680-360447) [ITABLE_ZEROED]
  Checksum 0x6ba3, unused inodes 0
  Block bitmap at 1967 (bg #0 + 1967), csum 0x17817d16, Inode bitmap at 1983 (bg #0 + 1983), csum 0x7d86a677
  Inode table at 2069-2076 (bg #0 + 2069)
  32768 free blocks, 0 free inodes, 11 directories
Group 11: (Blocks 360448-393215) [ITABLE_ZEROED]
  Checksum 0xabd0, unused inodes 0
  Block bitmap at 1968 (bg #0 + 1968), csum 0x17817d16, Inode bitmap at 1984 (bg #0 + 1984), csum 0x7d86a677
  Inode table at 2077-2084 (bg #0 + 2077)
  32768 free blocks, 0 free inodes, 5 directories
Group 12: (Blocks 393216-425983) [ITABLE_ZEROED]
  Checksum 0x47de, unused inodes 0
  Block bitmap at 1969 (bg #0 + 1969), csum 0x17817d16, Inode bitmap at 1985 (bg #0 + 1985), csum 0x7d86a677
  Inode table at 2085-2092 (bg #0 + 2085)
  32768 free blocks, 0 free inodes, 7 directories
Group 13: (Blocks 425984-458751) [ITABLE_ZEROED]
  Checksum 0x5822, unused inodes 0
  Block bitmap at 1970 (bg #0 + 1970), csum 0x17817d16, Inode bitmap at 1986 (bg #0 + 1986), csum 0x7d86a677
  Inode table at 2093-2100 (bg #0 + 2093)
  32768 free blocks, 0 free inodes, 13 directories
Group 14: (Blocks 458752-491519) [ITABLE_ZEROED]
  Checksum 0x21db, unused inodes 0
  Block bitmap at 1971 (bg #0 + 1971), csum 0x17817d16, Inode bitmap at 1987 (bg #0 + 1987), csum 0x7d86a677
  Inode table at 2101-2108 (bg #0 + 2101)
  32768 free blocks, 0 free inodes, 9 directories
Group 15: (Blocks 491520-524287) [ITABLE_ZEROED]
  Checksum 0xdd66, unused inodes 0
  Block bitmap at 1972 (bg #0 + 1972), csum 0x17817d16, Inode bitmap at 1988 (bg #0 + 1988), csum 0x7561ee79
  Inode table at 2109-2116 (bg #0 + 2109)
  32768 free blocks, 92 free inodes, 12 directories
Group 16: (Blocks 524288-557055) [ITABLE_ZEROED]
  Checksum 0x5ee8, unused inodes 0
  Block bitmap at 524288 (+0), csum 0x316efbb2, Inode bitmap at 524304 (+16), csum 0x7d86a677
  Inode table at 524320-524327 (+32)
  30816 free blocks, 0 free inodes, 2 directories
Group 17: (Blocks 557056-589823) [ITABLE_ZEROED]
  Checksum 0x0f0a, unused inodes 0
  Block bitmap at 524289 (bg #16 + 1), csum 0x17817d16, Inode bitmap at 524305 (bg #16 + 17), csum 0x7d86a677
  Inode table at 524328-524335 (bg #16 + 40)
  32768 free blocks, 0 free inodes, 1 directories

Notice that Group 0's inode table starts at block 2563, immediately after the 1+2562 blocks of
superblock + block descriptors.

But groups 1..15 have their inode tables in the middle of the block descriptor array!  WTF?
Also, they're all in different orders.

Group 0 is inode table, block bitmap, inode bitmap.

Groups 1..15 have all their block bitmaps consecutive (blocks 1957-1972,
if we extrapolate where group 0's bitmaps would go), followed by inode
bitmaps and then inode tables.

Are they just unchanged from the pre-resize version?  The old file system
size was 1953383296 blocks, requiring 59613 block groups, and 932 blocks
to hold their descriptors.  So that doesn't quite make sense...

Oh!  Is that the "reserved GDT block" space?  Not quite sure how
it was computed, but maybe...


Is there a (forlorn hope!) chance that resize2fs actually relocated the
inodes properly, and just failed to update the block group descriptors
to point to them?  So all the data is actually safe, just misplaced?

Or perhaps only the first 16*128 = 2048 inodes are trashed?
Or perhaps even inodes 128-2047?

I notice that I can read inode 2 (the root directory) and inode 11
(lost+found), I cannot read inodes 513-775, inodes 2047 and 2048
look corrupted, but inode 2049 looks fine.

Oh, that's right, inode 0 is invalid, so inodes are 1-based.
So inodes 128 and 2049 might be right, but inodes 129-2048
are toast.


I'm going to fire this off into the ether now and start the somewhat
time-consuming process of seeing if I can find the lost inodes somewhere
in block group 0.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 64bit + resize2fs... this is Not Good.
  2012-11-14  5:43 ` Theodore Ts'o
@ 2012-11-14  6:42   ` George Spelvin
  2012-11-14  7:12   ` George Spelvin
  2012-11-14  7:20   ` George Spelvin
  2 siblings, 0 replies; 11+ messages in thread
From: George Spelvin @ 2012-11-14  6:42 UTC (permalink / raw)
  To: linux, tytso; +Cc: linux-ext4

First of all, thanks a lot, Ted, for the middle-of-the-night tech support.
I just fired off my discovery diary which I wrote before seeing your
e-mail.

Here are the basics:

I have a newer (Oct 14) version compiled but not installed, but git
reflog shows the version I installed (and used for this) was

commit cf3c2ccea647c7d0db20ced920b68e98761dcd16
Author: Theodore Ts'o <tytso@mit.edu>
Date:   Sat Sep 22 22:29:34 2012 -0400

    Update for e2fsprogs-1.43-WIP-2012-09-22


I compiled a Debian package using "make clean ; debian/rules binary"
in the git directory, and installed that.  The compiler is
cc (Ubuntu/Linaro 4.7.2-2ubuntu1) 4.7.2

The system is mostlu Ubuntu 12.04 LTS, but I am running an unmodified
v3.6.5 Linux kernel.  (Compiled using the Ubuntu kernel packageing tools
to linux-image-3.6.5_3.6.5-10.00.Custom_amd64.deb.)

Note that I currently DO have the "superblock ckecksum is
corrupt while mounted" bug in this kernel.


I have some faint hope that the inodes are intact, just the group
descriptors are wrong, and I'm trying to follow that up.  Becasue BG#0's
inodes *did* get relocated correctly.


One strange thing I did was I supplied both "-o 64bit" and "-E
resize=17179869184" when creating the file system.  To do that,
I used e2fsprogs as of 41bf599391faaf6523c9997eb467a86888542339
(Oct 14, "debugfs: teach the htree and ls commands to show directory checksums")
with a local patch described in an earlier e-mail to the list.

That may have caused some odd block group layouts to start with.


Can you tell me, in your test, where the various bitmaps and
inode tables are for the first 16 block groups, both before and
after the resize?  My resize appeared to not only move them, but
*reorder* them, and I'd like to see what it's "supposed" to do.


Thank you very much!

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 64bit + resize2fs... this is Not Good.
  2012-11-14  5:43 ` Theodore Ts'o
  2012-11-14  6:42   ` George Spelvin
@ 2012-11-14  7:12   ` George Spelvin
  2012-11-14  7:20   ` George Spelvin
  2 siblings, 0 replies; 11+ messages in thread
From: George Spelvin @ 2012-11-14  7:12 UTC (permalink / raw)
  To: linux, tytso; +Cc: linux-ext4

> # mke2fs -F -t ext4 -O 64bit /mnt/foo.img

I can't find the mkfs command in my history logs, but
I'm pretty sure it was:

mke2fs -F -t ext4 -i 1048576 -O 64bit,metadata_csum,^huge_file -E stride=32,stripe_width=352,resize=17179869184 -L data /mnt/foo.img

... since some of those options affect layout.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 64bit + resize2fs... this is Not Good.
  2012-11-14  5:43 ` Theodore Ts'o
  2012-11-14  6:42   ` George Spelvin
  2012-11-14  7:12   ` George Spelvin
@ 2012-11-14  7:20   ` George Spelvin
  2012-11-14 20:39     ` Theodore Ts'o
  2 siblings, 1 reply; 11+ messages in thread
From: George Spelvin @ 2012-11-14  7:20 UTC (permalink / raw)
  To: linux, tytso; +Cc: linux-ext4

> So the first question is figuring out why the on-line resizing didn't
> work for you, since that is what I've spent most of my time trying to
> fix up.  The secondary question then is trying to figure out whappened
> with the off-line resize, and to fix that bug in e2fsprogs.

If you don't mind, *my* primary question is "what can I salvage from this
rubble?", since I didn't happen to have 8 TB of backup space available
to me when I did the resize, and there's soem stuff on the FS I'd
rather not lose...

So my big question is "where the F did inodes 129 through 2048
get copied to?", since the root directory contains a lot of inodes
in that range, and every one I can recover saves a lot of
pawing through lost+found later...


In hindsght, I wish to hell I had turned on -d 14 and logged the
results...


If you happen to want to rerun your test with -d8 and tell me what
happened there, I'd definitely appreciate it.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 64bit + resize2fs... this is Not Good.
  2012-11-14  7:20   ` George Spelvin
@ 2012-11-14 20:39     ` Theodore Ts'o
  2012-11-14 21:04       ` Theodore Ts'o
  2012-11-14 23:26       ` George Spelvin
  0 siblings, 2 replies; 11+ messages in thread
From: Theodore Ts'o @ 2012-11-14 20:39 UTC (permalink / raw)
  To: George Spelvin; +Cc: linux-ext4

On Wed, Nov 14, 2012 at 02:20:21AM -0500, George Spelvin wrote:
> 
> If you don't mind, *my* primary question is "what can I salvage from this
> rubble?", since I didn't happen to have 8 TB of backup space available
> to me when I did the resize, and there's soem stuff on the FS I'd
> rather not lose...

Sigh...  ok, unfortunately that's not something I can answer right
away.  I *can* tell you what happened, though.  Figuring out the best
way to help you recover while minimizing data loss is going to require
more thought....

First of all, the support for 64-bit online-resizing didn't hit the
3.6 kernel, and since it's a new feature, it's unlikely the stable
kernel gatekeepers will accept it unless and until at leastr one
distribution who is using a stable kernel is willing to backport the
patches to their distro kernel.  So in order to use it, you'll need a
3.7-rc1 or newer kernel.  That explains why the on-line resizing
didn't work.

The reason why you lost so badly when you did an off-line resize was
because you explicitly changed the resize limit default, via the -E
resize=NNN option.  (Can you explain to me your thinking about why you
specified this, just out of curiosity?)  Normally the default is 1000
times the size of the original file system, or for a file system
larger than 1.6TB, enough so that the file system can be resized to
the maximum amount that can be supported via the resize_inode scheme,
which is 16TB.  In practice, for pretty much any large file system,
including pretty much any raid arrays, the default allows us to do
resizes without needing to move any inode table blocks.

So the way things would have worked with a default file system is that
resize2fs would (in off-line mode) resize the file system up to the
maximum 16TB, and then stop.  Using online resizing, a sufficiently
new enough kernel would use the resize_inode up to the number of
reserved gdt blocks (which would by default take you to the 16TB
limit) and then switch over to the meta_bg scheme for doing on-line
resizing, which has no limits.

Unfortunately resize2fs in off-line resizing mode, (a) does not yet
know how to use the meta_bg scheme for resizing, and (b) doesn't deal
well with the case where you (1) have multiple inode tables in the
same block group, as is the case when flex_bg is enabled, as it is
with ext4 file systems, and (2) when it needs to move inode tables.
We protect against this by disallowing growing filesystems using
off-line resizing in the case where the file system has flex_bg but
does not have the resize_inode feature enabled.  *However*, if the
file system does have a resize_inode, but there is not a sufficient
number of gdt blocks (because of an explicitly specified -E resize=NNN
option to mke2fs), then this case isn't caught, and as a result
resize2fs will corrupt the file system.

Sigh.  Unfortunately, you fell into this corner case, which I failed
to forsee and protect against.

It is relatively easy to fix resize2fs so it detects this case and
handles it appropriately.  What is harder is how to fix a file system
which has been scrambed by resize2fs after the fact.  There will
definitely be some parts of the inode table which will have gotten
overwritten, because resize2fs doesn't handle the flex_bg layout
correctly when moving inode table blocks.  The question is what's the
best way of undoing the damage going forward, and that's going to have
to require some more thought and probably some experimentation and
development work.

If you don't need the information right away, I'd advise you to not
touch the file system, since any attempt to try to fix is likely to
make the problem worse (and will cause me to to have to try to
replicate the attempted fix to see what happened as a result).  I'm
guessing that you've already tried running e2fsck -fy, which aborted
midway through the run?

						- Ted

P.S.  This doesn't exactly replicate what you did, but it's a simple
repro case of the failure which you hit.  The key to triggering the
failure is the specification of the -E resize=NNN option.  If you
remove this, resize2fs will not corrupt the file system:

# lvcreate -L 32g -n bigscratch /dev/closure
# mke2fs -t ext4 -E resize=12582912 /dev/closure/bigscratch
# lvresize -L 64g /dev/closure/bigscratch
# e2fsck -f /dev/closure/bigscratch
# resize2fs -p /dev/closure/bigscratch
# e2fsck -fy /dev/closure/bigscratch


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 64bit + resize2fs... this is Not Good.
  2012-11-14 20:39     ` Theodore Ts'o
@ 2012-11-14 21:04       ` Theodore Ts'o
  2012-11-14 23:26       ` George Spelvin
  1 sibling, 0 replies; 11+ messages in thread
From: Theodore Ts'o @ 2012-11-14 21:04 UTC (permalink / raw)
  To: George Spelvin; +Cc: linux-ext4

On Wed, Nov 14, 2012 at 03:39:42PM -0500, Theodore Ts'o wrote:
> The reason why you lost so badly when you did an off-line resize was
> because you explicitly changed the resize limit default, via the -E
> resize=NNN option.  (Can you explain to me your thinking about why you
> specified this, just out of curiosity?)  Normally the default is 1000
> times the size of the original file system, or for a file system
> larger than 1.6TB, enough so that the file system can be resized to
> the maximum amount that can be supported via the resize_inode scheme,
> which is 16TB.

Correction: for any file system larger than 16GB....

	    	    	 	       - Ted

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 64bit + resize2fs... this is Not Good.
  2012-11-14 20:39     ` Theodore Ts'o
  2012-11-14 21:04       ` Theodore Ts'o
@ 2012-11-14 23:26       ` George Spelvin
  2012-11-14 23:38         ` Theodore Ts'o
  1 sibling, 1 reply; 11+ messages in thread
From: George Spelvin @ 2012-11-14 23:26 UTC (permalink / raw)
  To: linux, tytso; +Cc: linux-ext4

>> If you don't mind, *my* primary question is "what can I salvage from
>> this rubble?", since I didn't happen to have 8 TB of backup space
>> available to me when I did the resize, and there's some stuff on the
>> FS I'd rather not lose...

> Sigh...  ok, unfortunately that's not something I can answer right
> away.  I *can* tell you what happened, though.  Figuring out the best
> way to help you recover while minimizing data loss is going to require
> more thought....

Well, I scanned BGs 0-15 (and a bit past) for anything that looked like
an inode block (and didn't find anything), or a directory block (and found
pretty much everything).

After talking to the owner of the machine, we decided to give up on
inodes 129-2048 and let e2fsck have at it, then and use the directory
information rename everything in lost+found back to the right names.

Unfortunately, that ended up with about 1.2 TB of data loss.  Some
is backups of other machines that can simply be re-backed-up, but most
is media that can be re-ripped.

There was an excellent chance of finding even large files written to
contiguous sectors, given that the FS was created immediately before
resize by "cp -a" from a 32bit file system, but it would take some
file-type-specific tools to see if a given sector range "looked like"
a correct file, and the *rest* of the FS would be inaccessible until
I got that working.

We decided that recovery sooner rather then more complete was better.


Fortunately, the important unreproducible files are mostly intact,
and what isn't there is mostly duplicated on people's laptops.

There are irreplaceable losses, but they're a very small fraction of
that 1.2 TB.  It's just going to be a PITA finding it all from scattered
sources.

> First of all, the support for 64-bit online-resizing didn't hit the
> 3.6 kernel, and since it's a new feature, it's unlikely the stable
> kernel gatekeepers will accept it unless and until at leastr one
> distribution who is using a stable kernel is willing to backport the
> patches to their distro kernel.  So in order to use it, you'll need a
> 3.7-rc1 or newer kernel.  That explains why the on-line resizing
> didn't work.

Ah, and I wasn't about to let an -rc kernel near this machine.

I keep reading docs and thinking features I want to use are more stable
than they are.

> The reason why you lost so badly when you did an off-line resize was
> because you explicitly changed the resize limit default, via the -E
> resize=NNN option.  (Can you explain to me your thinking about why you
> specified this, just out of curiosity?)

Because I figured there was *some* overhead, and the documented
preallocation of 1000x initial FS size was preposterously large.

I figured I'd explicitly set it to a reasonable value that was as large
as I could imagine the FS growing in future, and definitely large enough
for my planned resize.

Especially as I was informed (in the discussion when I reported the
"mke2fs -O 64bit -E resize=xxxx" bug) that it was actually *not* a hard
limit and resize2fs could move sectors to grow past that limit.

> Normally the default is 1000 times the size of the original file system,
> or for a file system larger than 16GB, enough so that the file system
> can be resized to the maximum amount that can be supported via the
> resize_inode scheme, which is 16TB.  In practice, for pretty much any
> large file system, including pretty much any raid arrays, the default
> allows us to do resizes without needing to move any inode table blocks.

Um, but on a 64bit file system, flex_bg *can* grow past 16 TB.

The limit is basically when BG#0 no longer has room for any data blocks,
and the fixed overhead runs into the backup superblock in block #32768.

With -i 1Mi and a flex_bg group size of 16, there are 1 + 16 * (1 +
1 + 8) = 161 blocks of other metadata (superblock, plus 16 x (block
bitmap + inode bitmap + inodes), leaving 32607 blocks available for
group descriptors.

At 4096/64 = 64 descriptors per block, that's 2086848 block groups,
260856 GiB = 254.7 TiB = 280 TB (decimal).


One source of the problem is that I asked for a 64 TiB grow limit
*and didn't get it*.  Even if mke2fs had done its math wrong and only
preallocated 32 * (64 TiB/128 MiB) = 16 MiB (4096 blocks) of GDT space,
I would have been fine.

If I had got the full 8192 blocks of preallocated GDT space that I asked
for, there definitely would not have been a problem.

It appears that it preallocated 1955 blocks of GDT space (enough for
15.27 TiB of array) and things went funny...

I tried resize=2^32 and 2^32-1 blocks when reporting the 64bit bug
to see if that affected the divide by 0, but even if I'd cut & pasted
those improperly, I should have had 2048 blocks of preallocateg GDT.

> Sigh.  Unfortunately, you fell into this corner case, which I failed
> to forsee and protect against.

And by a bug in mke2fs, which failed to allocate the resize space I
requested.  If it had computed the space correctly for the 64bit case,
there would have been enough and all this would have been avoided.


Of course, fixing the resizer to handle flex_bg probably isn't that hard.
Assuming standard layout, all the blocks of a given type are contiguous,
and the only difference between a flex_bg group and simply a larger
block group is the layout of the inode bitmaps.

Just treat it as a block group with (in this case) 128 pages of inodes,
8 pages of block bitmap, and 8 pages of inode bitmap, and bingo.

You just need to verify the preconditions first and set up the 16 GDT
entries to point to bits of it later.

> It is relatively easy to fix resize2fs so it detects this case and
> handles it appropriately.  What is harder is how to fix a file system
> which has been scrambed by resize2fs after the fact.  There will
> definitely be some parts of the inode table which will have gotten
> overwritten, because resize2fs doesn't handle the flex_bg layout
> correctly when moving inode table blocks.  The question is what's the
> best way of undoing the damage going forward, and that's going to have
> to require some more thought and probably some experimentation and
> development work.
> 
> If you don't need the information right away, I'd advise you to not
> touch the file system, since any attempt to try to fix is likely to
> make the problem worse (and will cause me to to have to try to
> replicate the attempted fix to see what happened as a result).  I'm
> guessing that you've already tried running e2fsck -fy, which aborted
> midway through the run?

I tried running e2fsck, which asked questions, and I hit ^C rather than
say anything.  Then I tried e2fsck -n, which aborted.

This afternoon, I ran e2fsck -y and did the reconstruction previously
described.  Without the inode block maps (which you imply got
overwritten), putting the files back together is very hard.

> P.S.  This doesn't exactly replicate what you did, but it's a simple
> repro case of the failure which you hit.  The key to triggering the
> failure is the specification of the -E resize=NNN option.  If you
> remove this, resize2fs will not corrupt the file system:
> 
> # lvcreate -L 32g -n bigscratch /dev/closure
> # mke2fs -t ext4 -E resize=12582912 /dev/closure/bigscratch
> # lvresize -L 64g /dev/closure/bigscratch
> # e2fsck -f /dev/closure/bigscratch
> # resize2fs -p /dev/closure/bigscratch
> # e2fsck -fy /dev/closure/bigscratch

Yeah, I see how that would cause problems, as you ask for 51.5G of
resize range.  What pisses me off is that I asked for 64 TiB!
(-E resize=17179869184)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 64bit + resize2fs... this is Not Good.
  2012-11-14 23:26       ` George Spelvin
@ 2012-11-14 23:38         ` Theodore Ts'o
  2012-11-15  3:43           ` George Spelvin
  0 siblings, 1 reply; 11+ messages in thread
From: Theodore Ts'o @ 2012-11-14 23:38 UTC (permalink / raw)
  To: George Spelvin; +Cc: linux-ext4

On Wed, Nov 14, 2012 at 06:26:33PM -0500, George Spelvin wrote:
> 
> > The reason why you lost so badly when you did an off-line resize was
> > because you explicitly changed the resize limit default, via the -E
> > resize=NNN option.  (Can you explain to me your thinking about why you
> > specified this, just out of curiosity?)
> 
> Because I figured there was *some* overhead, and the documented
> preallocation of 1000x initial FS size was preposterously large.

It's actually not 1000x times.  It's a 1000x times up to a maximum of
1024 current and reserved gdt blocks (which is the absolute maxmimum
which can be supported using resize_inode feature).  Contrary to what
you had expected, it's simply not possible to have 2048 or 4096
reserved gdt blocks using the resize_inode scheme.  That's because it
stores in the reserved gdt blocks using an indirect/direct scheme, and
that's all the sapce that we have.  (With a 4k block, and 4 bytes per
blocks --- the resize_inode scheme simply completely doesn't work if
above 16TB since it uses 4 byte block numbers --- 4k/4 = 1024 block group
descriptors.)

> > Normally the default is 1000 times the size of the original file system,
> > or for a file system larger than 16GB, enough so that the file system
> > can be resized to the maximum amount that can be supported via the
> > resize_inode scheme, which is 16TB.  In practice, for pretty much any
> > large file system, including pretty much any raid arrays, the default
> > allows us to do resizes without needing to move any inode table blocks.
> 
> Um, but on a 64bit file system, flex_bg *can* grow past 16 TB.

Currently, only using the online resizing patches which are in 3.7-rc1
and newer kernels --- and that's using the meta_bg scheme, *not* the
resize_inode resizing scheme.

> One source of the problem is that I asked for a 64 TiB grow limit
> *and didn't get it*.  Even if mke2fs had done its math wrong and only
> preallocated 32 * (64 TiB/128 MiB) = 16 MiB (4096 blocks) of GDT space,
> I would have been fine.

It can't preallocate that many GDT blocks.  There simply isn't room in
the indirect block scheme.  I agree it should have given an error
message in that case.  Unfortunately using the extended resize=NNN
option is not one that has gotten much attention or use, and so there
are bugs hiding there that have been around for years and years.  :-(

> Of course, fixing the resizer to handle flex_bg probably isn't that hard.
> Assuming standard layout, all the blocks of a given type are contiguous,
> and the only difference between a flex_bg group and simply a larger
> block group is the layout of the inode bitmaps.
> 
> Just treat it as a block group with (in this case) 128 pages of inodes,
> 8 pages of block bitmap, and 8 pages of inode bitmap, and bingo.

Except that with the flex_bg feature, the inode table blocks can be
anywhere.  Usually they are contiguous, but they don't have to be ---
if there are bad blocks, or depending on how the file system had been
previously resized, it's possible that the inode tables for the block
groups might not be adjacent.

Probably the best thing to do is to just find some new space for the
block group's inode table, and try not to keep it contiguous.
Ultimately, the best thing to do is to just let mke2fs use the
defaults, resize up to 16TB without needing to move any inode table
blocks, and then switch over to the meta_bg scheme, and add support
for meta_bg resizing in resize2fs.

I just simply haven't had time to work on this.  :-(

> > # lvcreate -L 32g -n bigscratch /dev/closure
> > # mke2fs -t ext4 -E resize=12582912 /dev/closure/bigscratch
> > # lvresize -L 64g /dev/closure/bigscratch
> > # e2fsck -f /dev/closure/bigscratch
> > # resize2fs -p /dev/closure/bigscratch
> > # e2fsck -fy /dev/closure/bigscratch
> 
> Yeah, I see how that would cause problems, as you ask for 51.5G of
> resize range.  What pisses me off is that I asked for 64 TiB!
> (-E resize=17179869184)

Yes, mke2fs should have issued an error message to let you know
there's no way it could honor your request.

Again, I'm really sorry; you were exploring some of the less well
tested code paths in e2fsprogs/resize2fs.  :-(

     	     	       	     	   	- Ted

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 64bit + resize2fs... this is Not Good.
  2012-11-14 23:38         ` Theodore Ts'o
@ 2012-11-15  3:43           ` George Spelvin
  0 siblings, 0 replies; 11+ messages in thread
From: George Spelvin @ 2012-11-15  3:43 UTC (permalink / raw)
  To: linux, tytso; +Cc: linux-ext4

> It's actually not 1000x times.  It's a 1000x times up to a maximum of
> 1024 current and reserved gdt blocks (which is the absolute maxmimum
> which can be supported using resize_inode feature).  Contrary to what
> you had expected, it's simply not possible to have 2048 or 4096
> reserved gdt blocks using the resize_inode scheme.  That's because it
> stores in the reserved gdt blocks using an indirect/direct scheme, and
> that's all the sapce that we have.  (With a 4k block, and 4 bytes per
> blocks --- the resize_inode scheme simply completely doesn't work if
> above 16TB since it uses 4 byte block numbers --- 4k/4 = 1024 block group
> descriptors.)

Er... you can't use extents?  The blocks *are* all contiguous.

>> Yeah, I see how that would cause problems, as you ask for 51.5G of
>> resize range.  What pisses me off is that I asked for 64 TiB!
>> (-E resize=17179869184)

> Yes, mke2fs should have issued an error message to let you know
> there's no way it could honor your request.

As long as I get to be at least a *little* bit grumpy that *both*
mke2fs and resize2fs, when asked to do something they couldn't to,
failed to produce any sort of error message, but silently f***ed it up.

> Again, I'm really sorry; you were exploring some of the less well
> tested code paths in e2fsprogs/resize2fs.  :-(

I seem to be developing a knack for that this last couple of months. :-(

I *thought* I was doing the obvious thing.

All I set out to do was expand a 10 TB RAID to 22 TB.
Really, everything I did I *thought* I chose the *safest*
possible option.

1. Restripe RAID
2. Try to resize FS, hit 16 TB limit.
3. Restripe RAID back down.
4. Create new 8 TB RAID from new drives
5. Format with 64-bit ext4, telling mke2fs that I will be resizing later.
5a. Fight with bug in mke2fs while doing so.

6. Copy over files from 32-bit FS
7. Destroy old RAID, and add drives to new RAID
8. Restripe up to 22 TB (again!)
9. Resize file system.  Personally, an off-line technique
  "feels safer" than on-line, so I went with that.

10. Kablooie!

Other than skipping the first 3 steps, what was I supposed to
do different?

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2012-11-15  3:43 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-11-14  3:51 64bit + resize2fs... this is Not Good George Spelvin
2012-11-14  5:43 ` Theodore Ts'o
2012-11-14  6:42   ` George Spelvin
2012-11-14  7:12   ` George Spelvin
2012-11-14  7:20   ` George Spelvin
2012-11-14 20:39     ` Theodore Ts'o
2012-11-14 21:04       ` Theodore Ts'o
2012-11-14 23:26       ` George Spelvin
2012-11-14 23:38         ` Theodore Ts'o
2012-11-15  3:43           ` George Spelvin
2012-11-14  6:27 ` George Spelvin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.