* ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps [not found] <CAHk-=wj-H5BYCU_kKiOK=B9sN3BtRzL4pnne2AJPyf54nQ+d=w@mail.gmail.com> @ 2020-10-05 8:14 ` Josh Triplett 2020-10-05 9:46 ` Jan Kara ` (2 more replies) 0 siblings, 3 replies; 34+ messages in thread From: Josh Triplett @ 2020-10-05 8:14 UTC (permalink / raw) To: Linus Torvalds, Theodore Ts'o, Andreas Dilger, Jan Kara Cc: Linux Kernel Mailing List, linux-ext4 Ran into an ext4 regression when testing upgrades to 5.9-rc kernels: Commit e7bfb5c9bb3d ("ext4: handle add_system_zone() failure in ext4_setup_system_zone()") breaks mounting of read-only ext4 filesystems with intentionally overlapping bitmap blocks. On an always-read-only filesystem explicitly marked with EXT4_FEATURE_RO_COMPAT_SHARED_BLOCKS, prior to that commit, it's safe to point all the block and inode bitmaps to a single block of all 1s, because a read-only filesystem will never allocate or free any blocks or inodes. However, after that commit, the block validity check rejects such filesystems with -EUCLEAN and "failed to initialize system zone (-117)". This causes systems that previously worked correctly to fail when upgrading to v5.9-rc2 or later. This was obviously a bugfix, and I'm not suggesting that it should be reverted; it looks like this effectively worked by accident before, because the block_validity check wasn't fully functional. However, this does break real systems, and I'd like to get some kind of regression fix in before 5.9 final if possible. I think it would suffice to make block_validity default to false if and only if EXT4_FEATURE_RO_COMPAT_SHARED_BLOCKS is set. Does that seem like a reasonable fix? Here's a quick sketch of a patch, which I've tested and confirmed to work: ----- 8< ----- Subject: [PATCH] Fix ext4 regression in v5.9-rc2 on ro fs with overlapped bitmaps Commit e7bfb5c9bb3d ("ext4: handle add_system_zone() failure in ext4_setup_system_zone()") breaks mounting of read-only ext4 filesystems with intentionally overlapping bitmap blocks. On an always-read-only filesystem explicitly marked with EXT4_FEATURE_RO_COMPAT_SHARED_BLOCKS, prior to that commit, it's safe to point all the block and inode bitmaps to a single block of all 1s, because a read-only filesystem will never allocate or free any blocks or inodes. However, after that commit, the block validity check rejects such filesystems with -EUCLEAN and "failed to initialize system zone (-117)". This causes systems that previously worked correctly to fail when upgrading to v5.9-rc2 or later. Fix this by defaulting block_validity to off when EXT4_FEATURE_RO_COMPAT_SHARED_BLOCKS is set. Signed-off-by: Josh Triplett <josh@joshtriplett.org> Fixes: e7bfb5c9bb3d ("ext4: handle add_system_zone() failure in ext4_setup_system_zone()") --- fs/ext4/ext4.h | 2 ++ fs/ext4/super.c | 3 ++- 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 523e00d7b392..7874028fa864 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1834,6 +1834,7 @@ static inline bool ext4_verity_in_progress(struct inode *inode) #define EXT4_FEATURE_RO_COMPAT_METADATA_CSUM 0x0400 #define EXT4_FEATURE_RO_COMPAT_READONLY 0x1000 #define EXT4_FEATURE_RO_COMPAT_PROJECT 0x2000 +#define EXT4_FEATURE_RO_COMPAT_SHARED_BLOCKS 0x4000 #define EXT4_FEATURE_RO_COMPAT_VERITY 0x8000 #define EXT4_FEATURE_INCOMPAT_COMPRESSION 0x0001 @@ -1930,6 +1931,7 @@ EXT4_FEATURE_RO_COMPAT_FUNCS(bigalloc, BIGALLOC) EXT4_FEATURE_RO_COMPAT_FUNCS(metadata_csum, METADATA_CSUM) EXT4_FEATURE_RO_COMPAT_FUNCS(readonly, READONLY) EXT4_FEATURE_RO_COMPAT_FUNCS(project, PROJECT) +EXT4_FEATURE_RO_COMPAT_FUNCS(shared_blocks, SHARED_BLOCKS) EXT4_FEATURE_RO_COMPAT_FUNCS(verity, VERITY) EXT4_FEATURE_INCOMPAT_FUNCS(compression, COMPRESSION) diff --git a/fs/ext4/super.c b/fs/ext4/super.c index ea425b49b345..f57a7e966e44 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -3954,7 +3954,8 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) else set_opt(sb, ERRORS_RO); /* block_validity enabled by default; disable with noblock_validity */ - set_opt(sb, BLOCK_VALIDITY); + if (!ext4_has_feature_shared_blocks(sb)) + set_opt(sb, BLOCK_VALIDITY); if (def_mount_opts & EXT4_DEFM_DISCARD) set_opt(sb, DISCARD); ^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps 2020-10-05 8:14 ` ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps Josh Triplett @ 2020-10-05 9:46 ` Jan Kara 2020-10-05 10:16 ` Josh Triplett 2020-10-05 16:20 ` Jan Kara 2020-10-05 17:36 ` Darrick J. Wong 2 siblings, 1 reply; 34+ messages in thread From: Jan Kara @ 2020-10-05 9:46 UTC (permalink / raw) To: Josh Triplett Cc: Linus Torvalds, Theodore Ts'o, Andreas Dilger, Jan Kara, Linux Kernel Mailing List, linux-ext4 Hi Josh! On Mon 05-10-20 01:14:54, Josh Triplett wrote: > Ran into an ext4 regression when testing upgrades to 5.9-rc kernels: > > Commit e7bfb5c9bb3d ("ext4: handle add_system_zone() failure in > ext4_setup_system_zone()") breaks mounting of read-only ext4 filesystems > with intentionally overlapping bitmap blocks. > > On an always-read-only filesystem explicitly marked with > EXT4_FEATURE_RO_COMPAT_SHARED_BLOCKS, prior to that commit, it's safe to > point all the block and inode bitmaps to a single block of all 1s, > because a read-only filesystem will never allocate or free any blocks or > inodes. > However, after that commit, the block validity check rejects such > filesystems with -EUCLEAN and "failed to initialize system zone (-117)". > This causes systems that previously worked correctly to fail when > upgrading to v5.9-rc2 or later. > > This was obviously a bugfix, and I'm not suggesting that it should be > reverted; it looks like this effectively worked by accident before, > because the block_validity check wasn't fully functional. However, this > does break real systems, and I'd like to get some kind of regression fix > in before 5.9 final if possible. I think it would suffice to make > block_validity default to false if and only if > EXT4_FEATURE_RO_COMPAT_SHARED_BLOCKS is set. > > Does that seem like a reasonable fix? Well, but EXT4_FEATURE_RO_COMPAT_SHARED_BLOCKS is your internal feature that's not present in current upstream kernel AFAICS. So if you have out of tree (even disk format incompatible) filesystem feature and upstream kernel changes in a way that breaks you, I'd say it's your problem to keep the pieces together? So IMO you need to change implementation of your EXT4_FEATURE_RO_COMPAT_SHARED_BLOCKS feature to work with more paranoid block validity checking. Whether you'll do that by disabling block validity or by properly teaching block validity code about that feature is up to you but I don't think that belongs to the upstream kernel since that is correct as is... Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps 2020-10-05 9:46 ` Jan Kara @ 2020-10-05 10:16 ` Josh Triplett 2020-10-05 16:19 ` Jan Kara 0 siblings, 1 reply; 34+ messages in thread From: Josh Triplett @ 2020-10-05 10:16 UTC (permalink / raw) To: Jan Kara Cc: Linus Torvalds, Theodore Ts'o, Andreas Dilger, Jan Kara, Linux Kernel Mailing List, linux-ext4 On Mon, Oct 05, 2020 at 11:46:01AM +0200, Jan Kara wrote: > On Mon 05-10-20 01:14:54, Josh Triplett wrote: > > Ran into an ext4 regression when testing upgrades to 5.9-rc kernels: > > > > Commit e7bfb5c9bb3d ("ext4: handle add_system_zone() failure in > > ext4_setup_system_zone()") breaks mounting of read-only ext4 filesystems > > with intentionally overlapping bitmap blocks. > > > > On an always-read-only filesystem explicitly marked with > > EXT4_FEATURE_RO_COMPAT_SHARED_BLOCKS, prior to that commit, it's safe to > > point all the block and inode bitmaps to a single block of all 1s, > > because a read-only filesystem will never allocate or free any blocks or > > inodes. > > However, after that commit, the block validity check rejects such > > filesystems with -EUCLEAN and "failed to initialize system zone (-117)". > > This causes systems that previously worked correctly to fail when > > upgrading to v5.9-rc2 or later. > > > > This was obviously a bugfix, and I'm not suggesting that it should be > > reverted; it looks like this effectively worked by accident before, > > because the block_validity check wasn't fully functional. However, this > > does break real systems, and I'd like to get some kind of regression fix > > in before 5.9 final if possible. I think it would suffice to make > > block_validity default to false if and only if > > EXT4_FEATURE_RO_COMPAT_SHARED_BLOCKS is set. > > > > Does that seem like a reasonable fix? > > Well, but EXT4_FEATURE_RO_COMPAT_SHARED_BLOCKS is your internal feature > that's not present in current upstream kernel AFAICS. It isn't "my" feature; the value for EXT4_FEATURE_RO_COMPAT_SHARED_BLOCKS is defined in the headers in the e2fsprogs tree. The kernel currently does absolutely nothing with it, nor did it previously need to; it's just an RO_COMPAT feature which ensures that the kernel can only mount the filesystem read-only. The point is that an always-read-only filesystem will never change the block or inode bitmaps, so ensuring they don't overlap is unnecessary (and harmful). I only added EXT4_FEATURE_RO_COMPAT_SHARED_BLOCKS to the header to generate the corresponding ext4_has_feature_shared_blocks function. I have filesystems that previous kernels mounted and worked with just fine, and new kernels reject. That seems like a regression to me. I'm suggesting the simplest possible way I can see to fix that regression. Another approach would be to default block_validity to false for any read-only filesystem mount (since it won't be written to), but that seemed like it'd be more invasive; I was going for a more minimal change. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps 2020-10-05 10:16 ` Josh Triplett @ 2020-10-05 16:19 ` Jan Kara 0 siblings, 0 replies; 34+ messages in thread From: Jan Kara @ 2020-10-05 16:19 UTC (permalink / raw) To: Josh Triplett Cc: Jan Kara, Linus Torvalds, Theodore Ts'o, Andreas Dilger, Jan Kara, Linux Kernel Mailing List, linux-ext4 On Mon 05-10-20 03:16:41, Josh Triplett wrote: > On Mon, Oct 05, 2020 at 11:46:01AM +0200, Jan Kara wrote: > > On Mon 05-10-20 01:14:54, Josh Triplett wrote: > > > Ran into an ext4 regression when testing upgrades to 5.9-rc kernels: > > > > > > Commit e7bfb5c9bb3d ("ext4: handle add_system_zone() failure in > > > ext4_setup_system_zone()") breaks mounting of read-only ext4 filesystems > > > with intentionally overlapping bitmap blocks. > > > > > > On an always-read-only filesystem explicitly marked with > > > EXT4_FEATURE_RO_COMPAT_SHARED_BLOCKS, prior to that commit, it's safe to > > > point all the block and inode bitmaps to a single block of all 1s, > > > because a read-only filesystem will never allocate or free any blocks or > > > inodes. > > > However, after that commit, the block validity check rejects such > > > filesystems with -EUCLEAN and "failed to initialize system zone (-117)". > > > This causes systems that previously worked correctly to fail when > > > upgrading to v5.9-rc2 or later. > > > > > > This was obviously a bugfix, and I'm not suggesting that it should be > > > reverted; it looks like this effectively worked by accident before, > > > because the block_validity check wasn't fully functional. However, this > > > does break real systems, and I'd like to get some kind of regression fix > > > in before 5.9 final if possible. I think it would suffice to make > > > block_validity default to false if and only if > > > EXT4_FEATURE_RO_COMPAT_SHARED_BLOCKS is set. > > > > > > Does that seem like a reasonable fix? > > > > Well, but EXT4_FEATURE_RO_COMPAT_SHARED_BLOCKS is your internal feature > > that's not present in current upstream kernel AFAICS. > > It isn't "my" feature; the value for > EXT4_FEATURE_RO_COMPAT_SHARED_BLOCKS is defined in the headers in the > e2fsprogs tree. The kernel currently does absolutely nothing with it, > nor did it previously need to; it's just an RO_COMPAT feature which > ensures that the kernel can only mount the filesystem read-only. The > point is that an always-read-only filesystem will never change the block > or inode bitmaps, so ensuring they don't overlap is unnecessary (and > harmful). Ah, I see. I missed EXT4_FEATURE_RO_COMPAT_SHARED_BLOCKS is actually defined in e2fsprogs. Then what you suggests makes sense I guess and it's good the headers are synced up again... Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps 2020-10-05 8:14 ` ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps Josh Triplett 2020-10-05 9:46 ` Jan Kara @ 2020-10-05 16:20 ` Jan Kara 2020-10-05 17:36 ` Darrick J. Wong 2 siblings, 0 replies; 34+ messages in thread From: Jan Kara @ 2020-10-05 16:20 UTC (permalink / raw) To: Josh Triplett Cc: Linus Torvalds, Theodore Ts'o, Andreas Dilger, Jan Kara, Linux Kernel Mailing List, linux-ext4 On Mon 05-10-20 01:14:54, Josh Triplett wrote: > Ran into an ext4 regression when testing upgrades to 5.9-rc kernels: > > Commit e7bfb5c9bb3d ("ext4: handle add_system_zone() failure in > ext4_setup_system_zone()") breaks mounting of read-only ext4 filesystems > with intentionally overlapping bitmap blocks. > > On an always-read-only filesystem explicitly marked with > EXT4_FEATURE_RO_COMPAT_SHARED_BLOCKS, prior to that commit, it's safe to > point all the block and inode bitmaps to a single block of all 1s, > because a read-only filesystem will never allocate or free any blocks or > inodes. > > However, after that commit, the block validity check rejects such > filesystems with -EUCLEAN and "failed to initialize system zone (-117)". > This causes systems that previously worked correctly to fail when > upgrading to v5.9-rc2 or later. > > This was obviously a bugfix, and I'm not suggesting that it should be > reverted; it looks like this effectively worked by accident before, > because the block_validity check wasn't fully functional. However, this > does break real systems, and I'd like to get some kind of regression fix > in before 5.9 final if possible. I think it would suffice to make > block_validity default to false if and only if > EXT4_FEATURE_RO_COMPAT_SHARED_BLOCKS is set. > > Does that seem like a reasonable fix? > > Here's a quick sketch of a patch, which I've tested and confirmed to > work: > > ----- 8< ----- > Subject: [PATCH] Fix ext4 regression in v5.9-rc2 on ro fs with overlapped bitmaps > > Commit e7bfb5c9bb3d ("ext4: handle add_system_zone() failure in > ext4_setup_system_zone()") breaks mounting of read-only ext4 filesystems > with intentionally overlapping bitmap blocks. > > On an always-read-only filesystem explicitly marked with > EXT4_FEATURE_RO_COMPAT_SHARED_BLOCKS, prior to that commit, it's safe to > point all the block and inode bitmaps to a single block of all 1s, > because a read-only filesystem will never allocate or free any blocks or > inodes. > > However, after that commit, the block validity check rejects such > filesystems with -EUCLEAN and "failed to initialize system zone (-117)". > This causes systems that previously worked correctly to fail when > upgrading to v5.9-rc2 or later. > > Fix this by defaulting block_validity to off when > EXT4_FEATURE_RO_COMPAT_SHARED_BLOCKS is set. > > Signed-off-by: Josh Triplett <josh@joshtriplett.org> > Fixes: e7bfb5c9bb3d ("ext4: handle add_system_zone() failure in ext4_setup_system_zone()") The patch looks fine to me. Thanks for fixing this and for educating me about the feature :) You can add: Reviewed-by: Jan Kara <jack@suse.cz> Honza > --- > fs/ext4/ext4.h | 2 ++ > fs/ext4/super.c | 3 ++- > 2 files changed, 4 insertions(+), 1 deletion(-) > > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h > index 523e00d7b392..7874028fa864 100644 > --- a/fs/ext4/ext4.h > +++ b/fs/ext4/ext4.h > @@ -1834,6 +1834,7 @@ static inline bool ext4_verity_in_progress(struct inode *inode) > #define EXT4_FEATURE_RO_COMPAT_METADATA_CSUM 0x0400 > #define EXT4_FEATURE_RO_COMPAT_READONLY 0x1000 > #define EXT4_FEATURE_RO_COMPAT_PROJECT 0x2000 > +#define EXT4_FEATURE_RO_COMPAT_SHARED_BLOCKS 0x4000 > #define EXT4_FEATURE_RO_COMPAT_VERITY 0x8000 > > #define EXT4_FEATURE_INCOMPAT_COMPRESSION 0x0001 > @@ -1930,6 +1931,7 @@ EXT4_FEATURE_RO_COMPAT_FUNCS(bigalloc, BIGALLOC) > EXT4_FEATURE_RO_COMPAT_FUNCS(metadata_csum, METADATA_CSUM) > EXT4_FEATURE_RO_COMPAT_FUNCS(readonly, READONLY) > EXT4_FEATURE_RO_COMPAT_FUNCS(project, PROJECT) > +EXT4_FEATURE_RO_COMPAT_FUNCS(shared_blocks, SHARED_BLOCKS) > EXT4_FEATURE_RO_COMPAT_FUNCS(verity, VERITY) > > EXT4_FEATURE_INCOMPAT_FUNCS(compression, COMPRESSION) > diff --git a/fs/ext4/super.c b/fs/ext4/super.c > index ea425b49b345..f57a7e966e44 100644 > --- a/fs/ext4/super.c > +++ b/fs/ext4/super.c > @@ -3954,7 +3954,8 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) > else > set_opt(sb, ERRORS_RO); > /* block_validity enabled by default; disable with noblock_validity */ > - set_opt(sb, BLOCK_VALIDITY); > + if (!ext4_has_feature_shared_blocks(sb)) > + set_opt(sb, BLOCK_VALIDITY); > if (def_mount_opts & EXT4_DEFM_DISCARD) > set_opt(sb, DISCARD); > > -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps 2020-10-05 8:14 ` ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps Josh Triplett 2020-10-05 9:46 ` Jan Kara 2020-10-05 16:20 ` Jan Kara @ 2020-10-05 17:36 ` Darrick J. Wong 2020-10-06 0:04 ` Theodore Y. Ts'o 2020-10-06 0:32 ` Josh Triplett 2 siblings, 2 replies; 34+ messages in thread From: Darrick J. Wong @ 2020-10-05 17:36 UTC (permalink / raw) To: Josh Triplett Cc: Linus Torvalds, Theodore Ts'o, Andreas Dilger, Jan Kara, Linux Kernel Mailing List, linux-ext4 On Mon, Oct 05, 2020 at 01:14:54AM -0700, Josh Triplett wrote: > Ran into an ext4 regression when testing upgrades to 5.9-rc kernels: > > Commit e7bfb5c9bb3d ("ext4: handle add_system_zone() failure in > ext4_setup_system_zone()") breaks mounting of read-only ext4 filesystems > with intentionally overlapping bitmap blocks. > > On an always-read-only filesystem explicitly marked with > EXT4_FEATURE_RO_COMPAT_SHARED_BLOCKS, prior to that commit, it's safe to > point all the block and inode bitmaps to a single block LOL, WHAT? I didn't know shared blocks applied to fs metadata. I thought that "shared" only applied to file extent maps being able to share physical blocks. Could /somebody/ please document the ondisk format changes that are associated with this feature? > of all 1s, > because a read-only filesystem will never allocate or free any blocks or > inodes. All 1s? So the inode bitmap says that every inode table slot is in use, even if the inode record itself says it isn't? What does e2fsck -n think about that kind of metadata inconsistency? Hmm, let's try. $ truncate -s 300m /tmp/a.img $ mke2fs -T ext4 -O shared_blocks /tmp/a.img -d /tmp/ -F mke2fs 1.46~WIP-2020-10-04 (4-Oct-2020) Invalid filesystem option set: shared_blocks Oookay. So that's not how you create these shared block ext4s, apparently... $ mke2fs -T ext4 /tmp/a.img -F mke2fs 1.46~WIP-2020-10-04 (4-Oct-2020) Discarding device blocks: done Creating filesystem with 76800 4k blocks and 19200 inodes Filesystem UUID: 0a763191-89ca-49bc-9dc6-bf2986009ad9 Superblock backups stored on blocks: 32768 Allocating group tables: done Writing inode tables: done Creating journal (4096 blocks): done Writing superblocks and filesystem accounting information: done $ debugfs -w /tmp/a.img debugfs 1.45.6 (20-Mar-2020) debugfs: features shared_blocks Filesystem features: has_journal ext_attr resize_inode dir_index filetype extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize metadata_csum shared_blocks debugfs: set_bg 1 inode_bitmap 42 debugfs: set_bg 1 block_bitmap 39 debugfs: stats Group 0: block bitmap at 39, inode bitmap at 42, inode table at 45 31517 free blocks, 6389 free inodes, 2 used directories, 6389 unused inodes [Checksum 0xda06] Group 1: block bitmap at 39, inode bitmap at 42, inode table at 445 28633 free blocks, 6400 free inodes, 0 used directories, 6400 unused inodes [Inode not init, Checksum 0x2e69] $ xfs_io -c "pwrite -S 0xFF $((39 * 4096)) 4096" /tmp/a.img $ xfs_io -c "pwrite -S 0xFF $((42 * 4096)) 4096" /tmp/a.img Ok, now we have a shared blocks fs where BG 0 and BG 1 share bitmaps, and the bitmaps are set to 1. $ e2fsck -n /tmp/a.img e2fsck 1.45.6 (20-Mar-2020) ext2fs_check_desc: Corrupt group descriptor: bad block for block bitmap e2fsck: Group descriptors look bad... trying backup blocks... /tmp/a.img was not cleanly unmounted, check forced. Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information Block bitmap differences: -(1251--32767) Fix? no Free blocks count wrong for group #0 (31517, counted=0). Fix? no Free blocks count wrong (71414, counted=39897). Fix? no Inode bitmap differences: -(12--6400) Fix? no Free inodes count wrong for group #0 (6389, counted=0). Fix? no Free inodes count wrong (19189, counted=12800). Fix? no Padding at end of inode bitmap is not set. Fix? no Inode bitmap differences: Group 0 inode bitmap does not match checksum. IGNORED. Group 1 inode bitmap does not match checksum. IGNORED. Group 2 inode bitmap does not match checksum. IGNORED. Block bitmap differences: Group 0 block bitmap does not match checksum. IGNORED. /tmp/a.img: ********** WARNING: Filesystem still has errors ********** /tmp/a.img: 11/19200 files (0.0% non-contiguous), 5386/76800 blocks Sooooo... are you shipping ext4 images with an undocumented ondisk format variation that looks like inconsistency to the standard tools? --D > > However, after that commit, the block validity check rejects such > filesystems with -EUCLEAN and "failed to initialize system zone (-117)". > This causes systems that previously worked correctly to fail when > upgrading to v5.9-rc2 or later. > > This was obviously a bugfix, and I'm not suggesting that it should be > reverted; it looks like this effectively worked by accident before, > because the block_validity check wasn't fully functional. However, this > does break real systems, and I'd like to get some kind of regression fix > in before 5.9 final if possible. I think it would suffice to make > block_validity default to false if and only if > EXT4_FEATURE_RO_COMPAT_SHARED_BLOCKS is set. > > Does that seem like a reasonable fix? > > Here's a quick sketch of a patch, which I've tested and confirmed to > work: > > ----- 8< ----- > Subject: [PATCH] Fix ext4 regression in v5.9-rc2 on ro fs with overlapped bitmaps > > Commit e7bfb5c9bb3d ("ext4: handle add_system_zone() failure in > ext4_setup_system_zone()") breaks mounting of read-only ext4 filesystems > with intentionally overlapping bitmap blocks. > > On an always-read-only filesystem explicitly marked with > EXT4_FEATURE_RO_COMPAT_SHARED_BLOCKS, prior to that commit, it's safe to > point all the block and inode bitmaps to a single block of all 1s, > because a read-only filesystem will never allocate or free any blocks or > inodes. > > However, after that commit, the block validity check rejects such > filesystems with -EUCLEAN and "failed to initialize system zone (-117)". > This causes systems that previously worked correctly to fail when > upgrading to v5.9-rc2 or later. > > Fix this by defaulting block_validity to off when > EXT4_FEATURE_RO_COMPAT_SHARED_BLOCKS is set. > > Signed-off-by: Josh Triplett <josh@joshtriplett.org> > Fixes: e7bfb5c9bb3d ("ext4: handle add_system_zone() failure in ext4_setup_system_zone()") > --- > fs/ext4/ext4.h | 2 ++ > fs/ext4/super.c | 3 ++- > 2 files changed, 4 insertions(+), 1 deletion(-) > > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h > index 523e00d7b392..7874028fa864 100644 > --- a/fs/ext4/ext4.h > +++ b/fs/ext4/ext4.h > @@ -1834,6 +1834,7 @@ static inline bool ext4_verity_in_progress(struct inode *inode) > #define EXT4_FEATURE_RO_COMPAT_METADATA_CSUM 0x0400 > #define EXT4_FEATURE_RO_COMPAT_READONLY 0x1000 > #define EXT4_FEATURE_RO_COMPAT_PROJECT 0x2000 > +#define EXT4_FEATURE_RO_COMPAT_SHARED_BLOCKS 0x4000 > #define EXT4_FEATURE_RO_COMPAT_VERITY 0x8000 > > #define EXT4_FEATURE_INCOMPAT_COMPRESSION 0x0001 > @@ -1930,6 +1931,7 @@ EXT4_FEATURE_RO_COMPAT_FUNCS(bigalloc, BIGALLOC) > EXT4_FEATURE_RO_COMPAT_FUNCS(metadata_csum, METADATA_CSUM) > EXT4_FEATURE_RO_COMPAT_FUNCS(readonly, READONLY) > EXT4_FEATURE_RO_COMPAT_FUNCS(project, PROJECT) > +EXT4_FEATURE_RO_COMPAT_FUNCS(shared_blocks, SHARED_BLOCKS) > EXT4_FEATURE_RO_COMPAT_FUNCS(verity, VERITY) > > EXT4_FEATURE_INCOMPAT_FUNCS(compression, COMPRESSION) > diff --git a/fs/ext4/super.c b/fs/ext4/super.c > index ea425b49b345..f57a7e966e44 100644 > --- a/fs/ext4/super.c > +++ b/fs/ext4/super.c > @@ -3954,7 +3954,8 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) > else > set_opt(sb, ERRORS_RO); > /* block_validity enabled by default; disable with noblock_validity */ > - set_opt(sb, BLOCK_VALIDITY); > + if (!ext4_has_feature_shared_blocks(sb)) > + set_opt(sb, BLOCK_VALIDITY); > if (def_mount_opts & EXT4_DEFM_DISCARD) > set_opt(sb, DISCARD); > ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps 2020-10-05 17:36 ` Darrick J. Wong @ 2020-10-06 0:04 ` Theodore Y. Ts'o 2020-10-06 0:32 ` Josh Triplett 1 sibling, 0 replies; 34+ messages in thread From: Theodore Y. Ts'o @ 2020-10-06 0:04 UTC (permalink / raw) To: Darrick J. Wong Cc: Josh Triplett, Linus Torvalds, Andreas Dilger, Jan Kara, Linux Kernel Mailing List, linux-ext4 On Mon, Oct 05, 2020 at 10:36:39AM -0700, Darrick J. Wong wrote: > > Commit e7bfb5c9bb3d ("ext4: handle add_system_zone() failure in > > ext4_setup_system_zone()") breaks mounting of read-only ext4 filesystems > > with intentionally overlapping bitmap blocks. > > > > On an always-read-only filesystem explicitly marked with > > EXT4_FEATURE_RO_COMPAT_SHARED_BLOCKS, prior to that commit, it's safe to > > point all the block and inode bitmaps to a single block > > LOL, WHAT? > > I didn't know shared blocks applied to fs metadata. I thought that > "shared" only applied to file extent maps being able to share physical > bloctks. My understanding matches Darrick's. I was going to track down the Google engineer who has most recently (as far as I know) enhanced e2fsprogs's support of the shared block feature (see the commits returned by "git log --author dvander@google.com contrib/android") but he's apparently out of the office today. Hopefully I'll be able to track him down and ask about this tomorrow. > Oookay. So that's not how you create these shared block ext4s, > apparently... Yeah, they are created by the e2fsdroid program. See sources in contrib/e2fsdroid. I took a quick look, and I don't see anything there which is frobbing with with the bitmaps; but perhaps I'm missing something, which is why I'd ask David to see if he knows anything about this. More to the point, if we are have someone who is trying to dedup or otherwise frob with bitmaps, I suspect this will break "e2fsck -E unshare_blocks /dev/XXX", which is a way that you can take a root file system which is using shared_blocks, and turn it into something that can actually be mount read/write. This is something that I believe was being used by AOSP "debug" or "userdebug" (I'm a bit fuzzy on the details) so that Android developers couldn't actually modify the root file system. (Of course, you have to also disable dm-verity in order for this to work....) Unfortunately, e2fsdroid is currently not buildable under the standard Linux compilation environment. For the reason why, see: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=928551#75 The first step would be to teach e2fsprogs's configure to check for libsparse, and to link against it if it's available. But before we could enable this by default for Linux distribution, we need to link against libsparse using dlopen(), since most distro release engineers would be.... cranky.... if mke2fs were to drag in some random Android libraries that have to be installed as shared libraries in their installers. Which is the point of comment #75 in the above bug. Since the only use of shared_blocks is for Android, since very few other projects want a completely read-only root file system, and where dedup is actually significantly helpful, we've never tried to make this work outside of the Android context. At least in theory, though, it might be useful if we could create shared_block file systems using "mke2fs -O shared_blocks -d /path/to/embedded-root-fs system.img 1G". Patches gratefully accepted.... Cheers, - Ted ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps 2020-10-05 17:36 ` Darrick J. Wong 2020-10-06 0:04 ` Theodore Y. Ts'o @ 2020-10-06 0:32 ` Josh Triplett 2020-10-06 2:51 ` Darrick J. Wong 1 sibling, 1 reply; 34+ messages in thread From: Josh Triplett @ 2020-10-06 0:32 UTC (permalink / raw) To: Darrick J. Wong Cc: Linus Torvalds, Theodore Ts'o, Andreas Dilger, Jan Kara, Linux Kernel Mailing List, linux-ext4 On Mon, Oct 05, 2020 at 10:36:39AM -0700, Darrick J. Wong wrote: > On Mon, Oct 05, 2020 at 01:14:54AM -0700, Josh Triplett wrote: > > Ran into an ext4 regression when testing upgrades to 5.9-rc kernels: > > > > Commit e7bfb5c9bb3d ("ext4: handle add_system_zone() failure in > > ext4_setup_system_zone()") breaks mounting of read-only ext4 filesystems > > with intentionally overlapping bitmap blocks. > > > > On an always-read-only filesystem explicitly marked with > > EXT4_FEATURE_RO_COMPAT_SHARED_BLOCKS, prior to that commit, it's safe to > > point all the block and inode bitmaps to a single block > > LOL, WHAT? > > I didn't know shared blocks applied to fs metadata. I thought that > "shared" only applied to file extent maps being able to share physical > blocks. The flag isn't documented very well yet, but since it specifically forces the filesystem to always be mounted read-only, the bitmaps really shouldn't matter at all. (In an ideal world, a permanently read-only filesystem should be able to omit all the bitmaps entirely. Pointing them all to a single disk block is the next best thing.) > Could /somebody/ please document the ondisk format changes that are > associated with this feature? I pretty much had to sort it out by looking at a combination of e2fsprogs and the kernel, and a lot of experimentation, until I ended up with something that the kernel was completely happy with without a single complaint. I'd be happy to write up a summary of the format. I'd still really like to see this patch applied for 5.9, to avoid having filesystems that an old kernel can mount but a new one can't. This still seems like a regression to me. > > of all 1s, > > because a read-only filesystem will never allocate or free any blocks or > > inodes. > > All 1s? So the inode bitmap says that every inode table slot is in use, > even if the inode record itself says it isn't? Yes. > What does e2fsck -n > think about that kind of metadata inconsistency? If you set up the rest of the metadata consistently with it (for instance, 0 free blocks and 0 free inodes), you'll only get a single complaint, from the e2fsck equivalent of block_validity. See https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=956509 for details on that; with that fixed, e2fsck wouldn't complain at all. The kernel, prior to 5.9-rc2, doesn't have a single complaint, whether at mount, unmount, or read of every file and directory on the filesystem. The errors you got in your e2fsck run came because you just overrode the bitmaps, but didn't make the rest of the metadata consistent with that. I can provide a sample filesystem if that would help. - Josh ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps 2020-10-06 0:32 ` Josh Triplett @ 2020-10-06 2:51 ` Darrick J. Wong 2020-10-06 3:18 ` Theodore Y. Ts'o 0 siblings, 1 reply; 34+ messages in thread From: Darrick J. Wong @ 2020-10-06 2:51 UTC (permalink / raw) To: Josh Triplett Cc: Linus Torvalds, Theodore Ts'o, Andreas Dilger, Jan Kara, Linux Kernel Mailing List, linux-ext4 On Mon, Oct 05, 2020 at 05:32:16PM -0700, Josh Triplett wrote: > On Mon, Oct 05, 2020 at 10:36:39AM -0700, Darrick J. Wong wrote: > > On Mon, Oct 05, 2020 at 01:14:54AM -0700, Josh Triplett wrote: > > > Ran into an ext4 regression when testing upgrades to 5.9-rc kernels: > > > > > > Commit e7bfb5c9bb3d ("ext4: handle add_system_zone() failure in > > > ext4_setup_system_zone()") breaks mounting of read-only ext4 filesystems > > > with intentionally overlapping bitmap blocks. > > > > > > On an always-read-only filesystem explicitly marked with > > > EXT4_FEATURE_RO_COMPAT_SHARED_BLOCKS, prior to that commit, it's safe to > > > point all the block and inode bitmaps to a single block > > > > LOL, WHAT? > > > > I didn't know shared blocks applied to fs metadata. I thought that > > "shared" only applied to file extent maps being able to share physical > > blocks. > > The flag isn't documented very well yet, but since it specifically > forces the filesystem to always be mounted read-only, the bitmaps really > shouldn't matter at all. (In an ideal world, a permanently read-only > filesystem should be able to omit all the bitmaps entirely. Pointing > them all to a single disk block is the next best thing.) I disagree; creating undocumented forks of an existing ondisk format (especially one that presents as inconsistent metadata to regular tools) is creating a ton of pain for future users and maintainers when the incompat forks collide with the canonical implementation(s). (Granted, I don't know if you /created/ this new interpretation of the feature flag or if you've merely been stuck with it...) I don't say that as a theoretical argument -- you've just crashed right into this, because nobody knew that the in-kernel block validity doesn't know how to deal with this other than to error out. > > Could /somebody/ please document the ondisk format changes that are > > associated with this feature? > > I pretty much had to sort it out by looking at a combination of > e2fsprogs and the kernel, and a lot of experimentation, until I ended up > with something that the kernel was completely happy with without a > single complaint. > > I'd be happy to write up a summary of the format. Seems like a good idea, particularly since you're asking for a format change that requires kernel support and the ondisk format documentation lives under Documentation/. That said... > I'd still really like to see this patch applied for 5.9, to avoid having > filesystems that an old kernel can mount but a new one can't. This still > seems like a regression to me. > > > > of all 1s, > > > because a read-only filesystem will never allocate or free any blocks or > > > inodes. > > > > All 1s? So the inode bitmap says that every inode table slot is in use, > > even if the inode record itself says it isn't? > > Yes. > > > What does e2fsck -n > > think about that kind of metadata inconsistency? > > If you set up the rest of the metadata consistently with it (for > instance, 0 free blocks and 0 free inodes), you'll only get a single > complaint, from the e2fsck equivalent of block_validity. See > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=956509 for details on > that; ...Ted shot down this whole thing six months ago. The Debian bug database is /not/ the designated forum to discuss changes to the ondisk format; linux-ext4 is. --D > with that fixed, e2fsck wouldn't complain at all. The kernel, > prior to 5.9-rc2, doesn't have a single complaint, whether at mount, > unmount, or read of every file and directory on the filesystem. > > The errors you got in your e2fsck run came because you just overrode the > bitmaps, but didn't make the rest of the metadata consistent with that. > I can provide a sample filesystem if that would help. > > - Josh ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps 2020-10-06 2:51 ` Darrick J. Wong @ 2020-10-06 3:18 ` Theodore Y. Ts'o 2020-10-06 5:03 ` Josh Triplett 0 siblings, 1 reply; 34+ messages in thread From: Theodore Y. Ts'o @ 2020-10-06 3:18 UTC (permalink / raw) To: Darrick J. Wong Cc: Josh Triplett, Linus Torvalds, Andreas Dilger, Jan Kara, Linux Kernel Mailing List, linux-ext4 On Mon, Oct 05, 2020 at 07:51:10PM -0700, Darrick J. Wong wrote: > > > Could /somebody/ please document the ondisk format changes that are > > > associated with this feature? > > > > I pretty much had to sort it out by looking at a combination of > > e2fsprogs and the kernel, and a lot of experimentation, until I ended up > > with something that the kernel was completely happy with without a > > single complaint. > > > > I'd be happy to write up a summary of the format. > > Seems like a good idea, particularly since you're asking for a format > change that requires kernel support and the ondisk format documentation > lives under Documentation/. That said... > > If you set up the rest of the metadata consistently with it (for > > instance, 0 free blocks and 0 free inodes), you'll only get a single > > complaint, from the e2fsck equivalent of block_validity. See > > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=956509 for details on > > that; > > ...Ted shot down this whole thing six months ago. > > The Debian bug database is /not/ the designated forum to discuss changes > to the ondisk format; linux-ext4 is. What Josh is proposing I'm pretty sure would also break "e2fsck -E unshare_blocks", so that's another reason not to accept this as a valid format change. As far as I'm concerned, contrib/e2fsdroid is the canonical definition of how to create valid file systems with shared_blocks. - Ted ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps 2020-10-06 3:18 ` Theodore Y. Ts'o @ 2020-10-06 5:03 ` Josh Triplett 2020-10-06 6:03 ` Josh Triplett 2020-10-06 13:35 ` Theodore Y. Ts'o 0 siblings, 2 replies; 34+ messages in thread From: Josh Triplett @ 2020-10-06 5:03 UTC (permalink / raw) To: Theodore Y. Ts'o Cc: Darrick J. Wong, Linus Torvalds, Andreas Dilger, Jan Kara, Linux Kernel Mailing List, linux-ext4 On Mon, Oct 05, 2020 at 11:18:34PM -0400, Theodore Y. Ts'o wrote: > What Josh is proposing I'm pretty sure would also break "e2fsck -E > unshare_blocks", so that's another reason not to accept this as a > valid format change. The kernel already accepted this as a valid mountable filesystem format, without a single error or warning of any kind, and has done so stably for years. > As far as I'm concerned, contrib/e2fsdroid is the canonical definition > of how to create valid file systems with shared_blocks. I'm not trying to create a problem here; I'm trying to address a whole family of problems. I was generally under the impression that mounting existing root filesystems fell under the scope of the kernel<->userspace or kernel<->existing-system boundary, as defined by what the kernel accepts and existing userspace has used successfully, and that upgrading the kernel should work with existing userspace and systems. If there's some other rule that applies for filesystems, I'm not aware of that. (I'm also not trying to suggest that every random corner case of what the kernel *could* accept needs to be the format definition, but rather, cases that correspond to existing userspace.) It wouldn't be *impossible* to work around this, this time; it may be possible to adapt the existing userspace to work on the new and old kernels. My concern is, if a filesystem format accepted by previous kernels can be rejected by future kernels, what stops a future kernel from further changing the format definition or its strictness (co-evolving with one specific userspace) and causing further regressions? I don't *want* to rely on what apparently turned out to be an undocumented bug in the kernel's validator. That's why I was trying to fix the issue in what seemed like the right way, by detecting the situation and turning off the validator. That seemed like it would fully address the issue. If it would help, I could also supply a tiny filesystem image for regression testing. I'm trying to figure out what solution you'd like to see here, as long as it isn't "any userspace that isn't e2fsdroid can be broken at will". I'd be willing to work to adapt the userspace bits I have to work around the regression, but I'd like to get this on the radar so this doesn't happen again. - Josh Triplett ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps 2020-10-06 5:03 ` Josh Triplett @ 2020-10-06 6:03 ` Josh Triplett 2020-10-06 13:35 ` Theodore Y. Ts'o 1 sibling, 0 replies; 34+ messages in thread From: Josh Triplett @ 2020-10-06 6:03 UTC (permalink / raw) To: Theodore Y. Ts'o Cc: Darrick J. Wong, Linus Torvalds, Andreas Dilger, Jan Kara, Linux Kernel Mailing List, linux-ext4 On Mon, Oct 05, 2020 at 10:03:13PM -0700, Josh Triplett wrote: > On Mon, Oct 05, 2020 at 11:18:34PM -0400, Theodore Y. Ts'o wrote: > > What Josh is proposing I'm pretty sure would also break "e2fsck -E > > unshare_blocks", so that's another reason not to accept this as a > > valid format change. > > The kernel already accepted this as a valid mountable filesystem format, > without a single error or warning of any kind, and has done so stably > for years. > > > As far as I'm concerned, contrib/e2fsdroid is the canonical definition > > of how to create valid file systems with shared_blocks. > > I'm not trying to create a problem here; I'm trying to address a whole > family of problems. I was generally under the impression that mounting > existing root filesystems fell under the scope of the kernel<->userspace > or kernel<->existing-system boundary, as defined by what the kernel > accepts and existing userspace has used successfully, and that upgrading > the kernel should work with existing userspace and systems. If there's > some other rule that applies for filesystems, I'm not aware of that. > (I'm also not trying to suggest that every random corner case of what > the kernel *could* accept needs to be the format definition, but rather, > cases that correspond to existing userspace.) > > It wouldn't be *impossible* to work around this, this time; it may be > possible to adapt the existing userspace to work on the new and old > kernels. My concern is, if a filesystem format accepted by previous > kernels can be rejected by future kernels, what stops a future kernel > from further changing the format definition or its strictness > (co-evolving with one specific userspace) and causing further > regressions? > > I don't *want* to rely on what apparently turned out to be an > undocumented bug in the kernel's validator. That's why I was trying to > fix the issue in what seemed like the right way, by detecting the > situation and turning off the validator. That seemed like it would fully > address the issue. If it would help, I could also supply a tiny filesystem > image for regression testing. > > I'm trying to figure out what solution you'd like to see here, as long > as it isn't "any userspace that isn't e2fsdroid can be broken at will". > I'd be willing to work to adapt the userspace bits I have to work around > the regression, but I'd like to get this on the radar so this doesn't > happen again. To clarify something further: I'm genuinely not looking to push hard on the limits or corners of the kernel/userspace boundary here, nor do I want to create an imposition on development. I'm happy to attempt to be a little more flexible than most userspace. I'm trying to make substantial, non-trivial use of the userspace side of a kernel/userspace boundary, and within reason, I need to rely on the kernel's stability guarantees. I'm relying on the combination of Documentation/filesystems/ext4 and fs/ext4 as the format documentation. The first time I discovered this issue was in doing some "there's about to be a new kernel release" regression testing for 5.9, in which it created a debugging adventure to track down what the problem was. I'd like to find a good way to report and handle this kind of thing going forward, if another issue like this arises. - Josh Triplett ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps 2020-10-06 5:03 ` Josh Triplett 2020-10-06 6:03 ` Josh Triplett @ 2020-10-06 13:35 ` Theodore Y. Ts'o 2020-10-07 8:03 ` Josh Triplett 1 sibling, 1 reply; 34+ messages in thread From: Theodore Y. Ts'o @ 2020-10-06 13:35 UTC (permalink / raw) To: Josh Triplett Cc: Darrick J. Wong, Linus Torvalds, Andreas Dilger, Jan Kara, Linux Kernel Mailing List, linux-ext4 On Mon, Oct 05, 2020 at 10:03:06PM -0700, Josh Triplett wrote: > > I'm not trying to create a problem here; I'm trying to address a whole > family of problems. I was generally under the impression that mounting > existing root filesystems fell under the scope of the kernel<->userspace > or kernel<->existing-system boundary, as defined by what the kernel > accepts and existing userspace has used successfully, and that upgrading > the kernel should work with existing userspace and systems. If there's > some other rule that applies for filesystems, I'm not aware of that. > (I'm also not trying to suggest that every random corner case of what > the kernel *could* accept needs to be the format definition, but rather, > cases that correspond to existing userspace.) I'm not opposed to the kernel side change; it's *this time*. I'm more interested in killing off the tool that generated the malformed file system in the first place. As I keep pointing out, things aren't going to go well if "e2fsck -E unshare_blocks" is applied to it. So users who use this unofficial tool to create this file system is can run into at least this corner case, if not others, and that will result in, as the UI designers like to say, "a poor user experience". We had a similar issue with Android. Many years ago, Andy Rubin was originally quite allergic to the GPL, and had tried to promulgate the rule, "no GPL in Android Userspace". This is why bionic is used as libc, and this resulted in Android engineers (I think before the Google acquisition, but I'm not 100% sure), creating an unofficial, "unauthorized" make_ext4fs which was a BSD-licensed version of mke2fs. Unfortuantely, it created file systems which the kernel would never complain about, but which, 50% of the time, would result in a file system which under some circumstances, would get corrupted (even more) when e2fsck attempted to repair the file system. So if a user had a bit flip caused by an eMMC hiccup, e2fsck could end up making things worse. Worse, make_ext4fs had over time, grown extra functionality, such as pre-setting the SELinux xattrs, such that you couldn't just replace it with mke2fs. It took *years* to fix the problem, and that's why contrib/e2fsdroid exists today. We finally, a few years ago, were able to retire make_ext4fs and replace it with the combination of mke2fs and e2fsdroid. So that's why I really don't like it when there are "unauthorized", unofficial tools creating file systems out there which we are now obliged to support. Even if it's OK as far as the kernel is concerned, unless you're planning on forking and/or reimplementing all of e2fsprogs, and doing so correctly, that way is going to cause headaches for file system developers. As far as I'm concerned, it's not just about on-disk file system format, it's also about the official user space tools. If you create a file system which the kernel is happy with, but which wasn't created using the official user space tools, file systems are so full of state and permutations of how things should be done that the opportunities for mischief are huge. And what's especially aggravating is when it's done for petty reasons --- whether it's trying to sae an extra 0.0003% of storage, or because some VP was allergic to the GPL, it's stupid stuff. > I don't *want* to rely on what apparently turned out to be an > undocumented bug in the kernel's validator. That's why I was trying to > fix the issue in what seemed like the right way, by detecting the > situation and turning off the validator. That seemed like it would fully > address the issue. If it would help, I could also supply a tiny filesystem > image for regression testing. I'm OK with working around the problem, and we're lucky that it's this simple.... this time around. But can we *please* take your custom tool out back and shoot it in the head? Like make_ext4fs, it's just going to cause more headaches down the line. And perhaps we need to make a policy that makes it clear that for file systems, it's not just about "whatever the kernel happens to accept". It should also be, "was it generated using an official version of the userspace tools", at least as a consideration. Yes, we can try to make the kernel more strict, and that's a good thing to do, but inevitably, as we make the kernel more strict, we can potentially break other unffocial tools out there, and it's going to make it a lot harder to be able to do backwards compatible format enhancements to the file system. - Ted ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps 2020-10-06 13:35 ` Theodore Y. Ts'o @ 2020-10-07 8:03 ` Josh Triplett 2020-10-07 14:32 ` Theodore Y. Ts'o 0 siblings, 1 reply; 34+ messages in thread From: Josh Triplett @ 2020-10-07 8:03 UTC (permalink / raw) To: Theodore Y. Ts'o Cc: Darrick J. Wong, Linus Torvalds, Andreas Dilger, Jan Kara, Linux Kernel Mailing List, linux-ext4 On Tue, Oct 06, 2020 at 09:35:33AM -0400, Theodore Y. Ts'o wrote: > On Mon, Oct 05, 2020 at 10:03:06PM -0700, Josh Triplett wrote: > > I'm not trying to create a problem here; I'm trying to address a whole > > family of problems. I was generally under the impression that mounting > > existing root filesystems fell under the scope of the kernel<->userspace > > or kernel<->existing-system boundary, as defined by what the kernel > > accepts and existing userspace has used successfully, and that upgrading > > the kernel should work with existing userspace and systems. If there's > > some other rule that applies for filesystems, I'm not aware of that. > > (I'm also not trying to suggest that every random corner case of what > > the kernel *could* accept needs to be the format definition, but rather, > > cases that correspond to existing userspace.) > > I'm not opposed to the kernel side change; it's *this time*. I appreciate that. (Does that mean you wouldn't object to this patch going into 5.9, with Jens' Reviewed-by and your ack?) > I'm more interested in killing off the tool that generated the > malformed file system in the first place. It was developed for a reason, and it's not intended for general-purpose use. It serves its purpose, which has nothing to do with e2fsprogs or any component of e2fsprogs. It does not simply create filesystem images, in the way that mke2fs or e2fsdroid does. If I'd found any existing code that would have worked, or could have been adapted to work, I would have happily reused it; I don't like reinventing the wheel. Also, see below regarding e2fsck. > As I keep pointing out, things aren't going to go well if "e2fsck -E > unshare_blocks" is applied to it. These aren't intended to *ever* become writable. I've ensured that they're marked with EXT4_FEATURE_RO_COMPAT_READONLY as well, which I would hope implies that. Would that be sufficient to convince e2fsck that it should never attempt to apply unshare_blocks to it? Also, it seems like most of block_validity is trying to carefully avoid metadata blocks intersecting with the journal, which could cause all manner of issues; the filesystems I'm working with have no journal (because they're permanently read-only). With the sole exception of the shared bitmap block, I have it as a requirement to pass e2fsck with zero complaints. If you'd be willing to take a patch to e2fsck along the same lines as this kernel patch, then I'd be happy to add it to the regression test suite: no filesystem images that e2fsck is unhappy with. Would that help? As mentioned before, I'd also be happy to supply some representative filesystem images. > We had a similar issue with Android. Many years ago, Andy Rubin was > originally quite allergic to the GPL, and had tried to promulgate the > rule, "no GPL in Android Userspace". This is why bionic is used as > libc, and this resulted in Android engineers (I think before the > Google acquisition, but I'm not 100% sure), creating an unofficial, > "unauthorized" make_ext4fs which was a BSD-licensed version of mke2fs. That is indeed a sad reason to develop anything. It's also particularly sad to develop a different tool to serve exactly the same purpose as an existing tool. As opposed to, in this case, developing a different piece of software to serve different purposes, that happens to make use of ext4 as one component. > Unfortuantely, it created file systems which the kernel would never > complain about, but which, 50% of the time, would result in a file > system which under some circumstances, would get corrupted (even more) > when e2fsck attempted to repair the file system. So if a user had a > bit flip caused by an eMMC hiccup, e2fsck could end up making things > worse. I'd be interested in reading more about that, if you could suggest a link or the right search terms. I did find https://bugzilla.kernel.org/show_bug.cgi?id=195561 , which references the issues in the same way you've done here, but I didn't find any of the specific details you're mentioning here about what made the images it created fragile. I did see you mention, there, the advice of making sure to check filesystems with e2fsck. That's something I've done from day one: I didn't stop until e2fsck was happy, not just the kernel. > So that's why I really don't like it when there are "unauthorized", > unofficial tools creating file systems out there which we are now > obliged to support. ext4 is an incredibly powerful and performant filesystem, with a reasonably well-documented format and many useful properties. Not all software that works with ext4 is going to live in e2fsprogs, nor should it need to. This would be analogous to the notion that (say) the FreeBSD kernel belongs in the e2fsprogs repository because it has an ext4 driver. The tail would be wagging the dog. > Even if it's OK as far as the kernel is > concerned, unless you're planning on forking and/or reimplementing all > of e2fsprogs, and doing so correctly, that way is going to cause > headaches for file system developers. I haven't recreated or reimplemented e2fsprogs, and I have no desire to do so. It seems like, as a result of the make_ext4fs debacle, you're seeing any other software working with ext4 as an instance of reinventing the wheel (and failing to make that wheel round, because hey, it still rolls). That's not the case here. > > I don't *want* to rely on what apparently turned out to be an > > undocumented bug in the kernel's validator. That's why I was trying to > > fix the issue in what seemed like the right way, by detecting the > > situation and turning off the validator. That seemed like it would fully > > address the issue. If it would help, I could also supply a tiny filesystem > > image for regression testing. > > I'm OK with working around the problem, and we're lucky that it's this > simple.... this time around. > > But can we *please* take your custom tool out back and shoot it in the > head? Nope. As mentioned, this isn't about creating ext4 filesystem images, and it isn't even remotely similar to mke2fs. > And perhaps we need to make a policy that makes it clear that for file > systems, it's not just about "whatever the kernel happens to accept". > It should also be, "was it generated using an official version of the > userspace tools", at least as a consideration. I don't think that's a reasonable policy at all, no. "Should pass e2fsck" would be a relatively reasonable policy, assuming that e2fsck doesn't ratchet up strictness in a way that breaks existing filesystems. I think it'd be reasonable to recommend against trying to push the boundaries of what the kernel accepts but e2fsck doesn't, without a concrete plan to improve e2fsck and the filesystem documentation. - Josh Triplett ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps 2020-10-07 8:03 ` Josh Triplett @ 2020-10-07 14:32 ` Theodore Y. Ts'o 2020-10-07 20:14 ` Josh Triplett 0 siblings, 1 reply; 34+ messages in thread From: Theodore Y. Ts'o @ 2020-10-07 14:32 UTC (permalink / raw) To: Josh Triplett Cc: Darrick J. Wong, Linus Torvalds, Andreas Dilger, Jan Kara, Linux Kernel Mailing List, linux-ext4 On Wed, Oct 07, 2020 at 01:03:04AM -0700, Josh Triplett wrote: > > But can we *please* take your custom tool out back and shoot it in the > > head? > > Nope. As mentioned, this isn't about creating ext4 filesystem images, > and it isn't even remotely similar to mke2fs. Can you please tell us what your tool is for, then? Why are you doing this? Why are you inflicting this on us? And if the answer is nope, I can't, it's propietary, sorry.... then I think we should treat this the same way we treat proprietary kernel modules. - Ted ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps 2020-10-07 14:32 ` Theodore Y. Ts'o @ 2020-10-07 20:14 ` Josh Triplett 2020-10-08 2:10 ` Theodore Y. Ts'o 2020-10-08 2:57 ` Andreas Dilger 0 siblings, 2 replies; 34+ messages in thread From: Josh Triplett @ 2020-10-07 20:14 UTC (permalink / raw) To: Theodore Y. Ts'o Cc: Darrick J. Wong, Linus Torvalds, Andreas Dilger, Jan Kara, Linux Kernel Mailing List, linux-ext4 On Wed, Oct 07, 2020 at 10:32:11AM -0400, Theodore Y. Ts'o wrote: > On Wed, Oct 07, 2020 at 01:03:04AM -0700, Josh Triplett wrote: > > > But can we *please* take your custom tool out back and shoot it in the > > > head? > > > > Nope. As mentioned, this isn't about creating ext4 filesystem images, > > and it isn't even remotely similar to mke2fs. > > Can you please tell us what your tool is for, then? Why are you doing > this? Why are you inflicting this on us? That sounds like a conversation that would have been a lot more interesting and enjoyable if it hadn't started with "can we shoot it in the head", and continued with the notion that anything other than e2fsprogs making something to be mounted by mount(2) and handled by fs/ext4 is being "inflicted", and if the goal didn't still seem to be "how do we make it go away so that only e2fsprogs and the kernel ever touch ext4". I started this thread because I'd written some userspace code, a new version of the kernel made that userspace code stop working, so I wanted to report that the moment I'd discovered that, along with a potential way to address it with as little disruption to ext4 as possible. I'm not looking to be an alternative userspace, or an alternative anything; that implies serving more-or-less the same function differently. I have no interest in supplanting mke2fs or any other part of e2fsprogs; I'm using those for many of the purposes they already serve. The short version is that I needed a library to rapidly turn dynamically-obtained data into a set of disk blocks to be served on-the-fly as a software-defined disk, and then mounted on the other side of that interface by the Linux kernel. Turns out that's *many orders of magnitude* faster than any kind of network filesystem like NFS. It's slightly similar to a vvfat for ext4. The less blocks it can generate and account for and cache, the faster it can run, and microseconds matter. ext4 has *incredible* compatibility guarantees. I'd already understood the whole compat/ro_compat mechanism when I read through the on-disk format documentation and the code. RO_COMPAT_SHARED_BLOCKS *seemed* like the right semantic description of "don't ever try to write to this filesystem because there are deduplicated blocks", and RO_COMPAT_READONLY seemed like the right semantic description for "don't ever try to write this, it's permanently read-only for unspecified reasons". If those aren't the right way to express that, I could potentially adapt. I had a similar such conversation on linux-ext4 already (about inline data with 128-bit inodes), which led to me choosing to abandon 128-byte inodes rather than try to get ext4 to support what I wanted with them, because I didn't want to be disruptive to ext4 for a niche use case. In the particular case that motivated this thread, what I was doing already worked in previous kernels, and it seemed reasonable to ask for it to continue to work in new kernels, while preserving the newly added checks in the new kernels. If the response here had been more along the lines of "could we create and use a *different* compat flag for shared metadata and have RO_COMPAT_SHARED_BLOCKS only mean shared data blocks", I'd be fine with that. If at some point I'm looking to make ext4 support more than it already does (e.g. a way to omit bitmaps entirely, or a way to express contiguous files with smaller extent maps, or other enhancements for read-only filesystems), I'd be coming with architectural discussions first, patches second, and at no point would I have the expectation that ext4 folks need to do extra work on my behalf. I'm happy to do the work. The *only* thing I'm asking, here, is "don't break things that worked". And after this particular item, I'd be happy to narrow that to "don't break things that e2fsck was previously happy with". - Josh Triplett ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps 2020-10-07 20:14 ` Josh Triplett @ 2020-10-08 2:10 ` Theodore Y. Ts'o 2020-10-08 17:54 ` Darrick J. Wong 2020-10-08 22:22 ` Josh Triplett 2020-10-08 2:57 ` Andreas Dilger 1 sibling, 2 replies; 34+ messages in thread From: Theodore Y. Ts'o @ 2020-10-08 2:10 UTC (permalink / raw) To: Josh Triplett Cc: Darrick J. Wong, Linus Torvalds, Andreas Dilger, Jan Kara, Linux Kernel Mailing List, linux-ext4 On Wed, Oct 07, 2020 at 01:14:24PM -0700, Josh Triplett wrote: > > That sounds like a conversation that would have been a lot more > interesting and enjoyable if it hadn't started with "can we shoot it in > the head", and continued with the notion that anything other than > e2fsprogs making something to be mounted by mount(2) and handled by > fs/ext4 is being "inflicted", and if the goal didn't still seem to be > "how do we make it go away so that only e2fsprogs and the kernel ever > touch ext4". I started this thread because I'd written some userspace > code, a new version of the kernel made that userspace code stop working, > so I wanted to report that the moment I'd discovered that, along with a > potential way to address it with as little disrupton to ext4 as > possible. What is really getting my dander up is your attempt to claim that the on-disk file system format is like the userspace/kernel interface, where if we break any file system that file system that was "previously accepted by an older kernel", this is a bug that must be reverted or otherwise fixed to allow file systems that had previously worked, to continue to work. And this is true even if the file system is ***invalid***. And the problem with this is that there have been any number of commits where file systems which were previously invalid, but which could be caused to trigger a syzbot whine, which was fixed by tightening up the validity tests in the kernel. In some cases, I had to also had to fix up e2fsck to detect the invalid file system which was generated by the file system fuzzer. Yes, it's unfortunate that we didn't have these checks earlier, but a file system has a huge amount of state. The principle you've articulated would make it impossible for me to fix these bugs, unless I can prove that the failure to check a particular invalid file system corruption could lead to a security vulnerability. (Would it be OK for me to make the kernel more strict and reject an invalid file system if it triggers a WARN_ON, so I get the syzbot complaint, but it doesn't actually cause a security issue?) So this conversation would have been a lot more pleasant for *me* if you hadn't tried to elevate your request to a general principle, where if someone is deliberately generating an invalid file system, I'm not allowed to make the kernel more strict to detect said invalidity and to reject the invalid / corrupted / fuzzed file system. And note that sometimes the security problem happens when there are multiple file system corruptions that are chained together. So enabling block validity *can* sometimes prevent the fuzzed file system from proceeding further. Granted, this is less likely in the case of a read-only file system, but it really worries me when there are proprietary programs (maybe your library isn't proprietary, but I note you haven't send me a link to your git repo, but instead have offered sending sample file systems) which insist on generating their own file systems, which might or might not be valid, and then expecting them to receive first class support as part of an iron-bound contract where I'm not even allowed to add stronger sanity checks which might reject said invalid file system in the future. > The short version is that I needed a library to rapidly turn > dynamically-obtained data into a set of disk blocks to be served > on-the-fly as a software-defined disk, and then mounted on the other > side of that interface by the Linux kernel. Turns out that's *many > orders of magnitude* faster than any kind of network filesystem like > NFS. It's slightly similar to a vvfat for ext4. The less blocks it can > generate and account for and cache, the faster it can run, and > microseconds matter. So are you actually trying to dedup data blocks, or are you just trying to avoid needing to track the block allocation bitmaps? And are you just writing a single file, or multiple files? Do you know what the maximum size of the file or files will be? Do you need a complex directory structure, or just a single root directory? Can the file system be sparse? So for example, you can do something like this, which puts all of the metadata at beginning of the file system, and then you could write to contiguous data blocks. Add the following in mke2fs.conf: [fs_types] hugefile = { features = extent,huge_file,bigalloc,flex_bg,uninit_bg,dir_nlink,extra_isize,^resize_inode,sparse_super2 cluster_size = 32768 hash_alg = half_md4 reserved_ratio = 0.0 num_backup_sb = 0 packed_meta_blocks = 1 make_hugefiles = 1 inode_ratio = 4194304 hugefiles_dir = /storage hugefiles_name = huge-file hugefiles_digits = 0 hugefiles_size = 0 hugefiles_align = 256M hugefiles_align_disk = true num_hugefiles = 1 zero_hugefiles = false inode_size = 128 } hugefiles = { features = extent,huge_file,flex_bg,uninit_bg,dir_nlink,extra_isize,^resize_inode,sparse_super2 hash_alg = half_md4 reserved_ratio = 0.0 num_backup_sb = 0 packed_meta_blocks = 1 make_hugefiles = 1 inode_ratio = 4194304 hugefiles_dir = /storage hugefiles_name = chunk- hugefiles_digits = 5 hugefiles_size = 4G hugefiles_align = 256M hugefiles_align_disk = true zero_hugefiles = false flex_bg_size = 262144 inode_size = 128 } ... and then run "mke2fs -T hugefile /tmp/image 1T" or "mke2fs -T hugefiles /tmp/image 1T", and see what you get. In the case of hugefile, you'll see a single file which covers the entire storage device. Because we are using bigalloc with a large cluster size, this minimizes the number of bitmap blocks. With hugefiles, it will create a set of 4G files to fill the size of the disk, again, aligned to 256 MiB zones at the beginning of the disk. In both cases, the file or files are aligned to 256 MiB relative to beginning of the disk, which can be handy if you are creating the file system, on, say, a 14T SMR disk. And this is a niche use case if there ever was one! :-) So if you had come to the ext4 list with a set of requirements, it could have been that we could have come up with something which uses the existing file system features, or come up with something which would have been more specific --- and more importantly, we'd know what the semantics were of various on-disk file system formats that people are depending upon. > If at some point I'm looking to make ext4 support more than it already > does (e.g. a way to omit bitmaps entirely, or a way to express > contiguous files with smaller extent maps, or other enhancements for > read-only filesystems), See above for a way to significantly reduce the number of bitmaps. Adding a way to omit bitmaps entirely would require an INCOMPAT flag, so it might not be worth it. The way to express contiguous files with smaller extent files would be to extend the kernel to allow file systems with block_size > page_size read-only. This would allow you to create a file system with a block size of 64k, which will reduce the size of the extent maps by a factor of 16, and it wouldn't be all that hard to teach ext4 to support these file systems. (The reason why it would be hard for us to support file systems with block sizes > page size is dealing with page cache when writing files while allocating blocks, especially when doing random writes into a sparse file. Read-only would be much easier to support.) So please, talk to us, and *tell* us what it is you're trying to do before you try to do it. Don't rely on some implementation detail where we're not being sufficiently strict in checking for an invalid file system, especially without telling us in advance and then trying to hold us to the lack of checking forever because it's "breaking things that used to work". Cheers, - Ted ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps 2020-10-08 2:10 ` Theodore Y. Ts'o @ 2020-10-08 17:54 ` Darrick J. Wong 2020-10-08 22:38 ` Josh Triplett 2020-10-08 22:22 ` Josh Triplett 1 sibling, 1 reply; 34+ messages in thread From: Darrick J. Wong @ 2020-10-08 17:54 UTC (permalink / raw) To: Theodore Y. Ts'o Cc: Josh Triplett, Linus Torvalds, Andreas Dilger, Jan Kara, Linux Kernel Mailing List, linux-ext4 On Wed, Oct 07, 2020 at 10:10:17PM -0400, Theodore Y. Ts'o wrote: > On Wed, Oct 07, 2020 at 01:14:24PM -0700, Josh Triplett wrote: > > > > That sounds like a conversation that would have been a lot more > > interesting and enjoyable if it hadn't started with "can we shoot it in This whole conversation would have been a lot less tense and defensive had I not started by thinking "Did Ted quietly slip more meaning into SHARED_BLOCKS (i.e. shared metadata blocks) than I had previously known about?" and then "No, e2fsprogs is still the same as it always was" and finally "OH, Josh wrote some weird userspace tool that we've never heard of which makes design assumptions that he's being cagey about". To reiterate what Ted said: General purpose filesystems are massively complex beasts. The kernel can do simple spot checks (checksums, record validation) but at least with XFS and ext4, we defer the larger relational checks between data structures (if this block is marked free, is it unclaimed by the inode table and all file block maps?) to userspace fsck. IMO, the prominent free software filesystem projects offer (at least) four layers of social structures to keep everyone on the same page, including people writing competing implementations. The first is "Does everything we write still work with the kernel", which I guess you clearly did since you're complaining about this change in 5.9-rc2. The second layer is "Does it pass fsck?" which, given that you're asking for changes to e2fsck elsewhere in this thread and I couldn't figure out how to generate a shared-bitmaps ext4 fs that didn't also cause e2fsck to complain about the overlap doesn't make me confident that you went beyond the first step before shipping something. The third layer is consulting the ondisk format documentation to see if it points out anything that the kernel or fsck didn't notice. That one's kind of on Ted for not updating Documentation/ to spell out what SHARED_BLOCKS actually means, but that just leads me to the fourth thing. The fourth layer is emailing the list to ask "Hey, I was thinking of ___, does anyone see a problem with my interpretation?". That clearly hasn't been done for shared bitmaps until now, which is all the more strange since you /did/ ask linux-ext4 about the inline data topic months ago. ext* and XFS have been around for 25 years. The software validation of both is imperfect and incomplete because the complexity of the ondisk formats is vast and we haven't caught up with the technical debt that resulted from not writing rigorous checks that would have been very difficult with mid-90s hardware. I know, because I've been writing online checking for XFS and have seen the wide the gap between what the existing software looks for and what the ondisk format documentation allows. The only chance that we as a community have at managing the complexity of a filesystem is to use those four tools I outlined above to try to keep the mental models of everyone who participates in close enough alignment. That's how we usually avoid discussions that end in rancor and confusion. That was a very long way of reiterating "If you're going to interpret aspects of the software, please come talk to us before you start writing code". That is key to ext4's long history of compatibility, because a project cannot maintain continuity when actors develop conflicting code in the shadows. Look at all the mutations FAT and UFS that have appeared over the years-- the codebases became a mess as a result. > > the head", and continued with the notion that anything other than > > e2fsprogs making something to be mounted by mount(2) and handled by > > fs/ext4 is being "inflicted", and if the goal didn't still seem to be > > "how do we make it go away so that only e2fsprogs and the kernel ever > > touch ext4". I started this thread because I'd written some userspace > > code, a new version of the kernel made that userspace code stop working, > > so I wanted to report that the moment I'd discovered that, along with a > > potential way to address it with as little disrupton to ext4 as > > possible. Ted: I don't think it's a good idea to make SHARED_BLOCKS disable block validity checking by default. You could someday enable users to write to block-sharing files by teaching ext4 how to COW, at which point you'd need correct bitmaps and block validity to prevent accidental overwrite of critical metadata. At that point you'd either be stuck with the precedent that SHARED_BLOCKS implies noblock_validity, or you'd end up breaking Josh's images again. Frankly, Josh, if you're not going to show us /your/ code and/or creating an incompat flag for shared-bitmaps, then just set "noblock_validity" in the superblock mount options field of the images you create. --D > What is really getting my dander up is your attempt to claim that the > on-disk file system format is like the userspace/kernel interface, > where if we break any file system that file system that was > "previously accepted by an older kernel", this is a bug that must be > reverted or otherwise fixed to allow file systems that had previously > worked, to continue to work. And this is true even if the file system > is ***invalid***. > > And the problem with this is that there have been any number of > commits where file systems which were previously invalid, but which > could be caused to trigger a syzbot whine, which was fixed by > tightening up the validity tests in the kernel. In some cases, I had > to also had to fix up e2fsck to detect the invalid file system which > was generated by the file system fuzzer. Yes, it's unfortunate that > we didn't have these checks earlier, but a file system has a huge > amount of state. > > The principle you've articulated would make it impossible for me to > fix these bugs, unless I can prove that the failure to check a > particular invalid file system corruption could lead to a security > vulnerability. (Would it be OK for me to make the kernel more strict > and reject an invalid file system if it triggers a WARN_ON, so I get > the syzbot complaint, but it doesn't actually cause a security issue?) > > So this conversation would have been a lot more pleasant for *me* if > you hadn't tried to elevate your request to a general principle, where > if someone is deliberately generating an invalid file system, I'm not > allowed to make the kernel more strict to detect said invalidity and > to reject the invalid / corrupted / fuzzed file system. > > And note that sometimes the security problem happens when there are > multiple file system corruptions that are chained together. So > enabling block validity *can* sometimes prevent the fuzzed file system > from proceeding further. Granted, this is less likely in the case of > a read-only file system, but it really worries me when there are > proprietary programs (maybe your library isn't proprietary, but I note > you haven't send me a link to your git repo, but instead have offered > sending sample file systems) which insist on generating their own file > systems, which might or might not be valid, and then expecting them to > receive first class support as part of an iron-bound contract where > I'm not even allowed to add stronger sanity checks which might reject > said invalid file system in the future. > > > The short version is that I needed a library to rapidly turn > > dynamically-obtained data into a set of disk blocks to be served > > on-the-fly as a software-defined disk, and then mounted on the other > > side of that interface by the Linux kernel. Turns out that's *many > > orders of magnitude* faster than any kind of network filesystem like > > NFS. It's slightly similar to a vvfat for ext4. The less blocks it can > > generate and account for and cache, the faster it can run, and > > microseconds matter. > > So are you actually trying to dedup data blocks, or are you just > trying to avoid needing to track the block allocation bitmaps? And > are you just writing a single file, or multiple files? Do you know > what the maximum size of the file or files will be? Do you need a > complex directory structure, or just a single root directory? Can the > file system be sparse? > > So for example, you can do something like this, which puts all of the > metadata at beginning of the file system, and then you could write to > contiguous data blocks. Add the following in mke2fs.conf: > > [fs_types] > hugefile = { > features = extent,huge_file,bigalloc,flex_bg,uninit_bg,dir_nlink,extra_isize,^resize_inode,sparse_super2 > cluster_size = 32768 > hash_alg = half_md4 > reserved_ratio = 0.0 > num_backup_sb = 0 > packed_meta_blocks = 1 > make_hugefiles = 1 > inode_ratio = 4194304 > hugefiles_dir = /storage > hugefiles_name = huge-file > hugefiles_digits = 0 > hugefiles_size = 0 > hugefiles_align = 256M > hugefiles_align_disk = true > num_hugefiles = 1 > zero_hugefiles = false > inode_size = 128 > } > > hugefiles = { > features = extent,huge_file,flex_bg,uninit_bg,dir_nlink,extra_isize,^resize_inode,sparse_super2 > hash_alg = half_md4 > reserved_ratio = 0.0 > num_backup_sb = 0 > packed_meta_blocks = 1 > make_hugefiles = 1 > inode_ratio = 4194304 > hugefiles_dir = /storage > hugefiles_name = chunk- > hugefiles_digits = 5 > hugefiles_size = 4G > hugefiles_align = 256M > hugefiles_align_disk = true > zero_hugefiles = false > flex_bg_size = 262144 > inode_size = 128 > } > > ... and then run "mke2fs -T hugefile /tmp/image 1T" or "mke2fs -T > hugefiles /tmp/image 1T", and see what you get. In the case of > hugefile, you'll see a single file which covers the entire storage > device. Because we are using bigalloc with a large cluster size, this > minimizes the number of bitmap blocks. > > With hugefiles, it will create a set of 4G files to fill the size of > the disk, again, aligned to 256 MiB zones at the beginning of the > disk. In both cases, the file or files are aligned to 256 MiB > relative to beginning of the disk, which can be handy if you are > creating the file system, on, say, a 14T SMR disk. And this is a > niche use case if there ever was one! :-) > > So if you had come to the ext4 list with a set of requirements, it > could have been that we could have come up with something which uses > the existing file system features, or come up with something which > would have been more specific --- and more importantly, we'd know what > the semantics were of various on-disk file system formats that people > are depending upon. > > > If at some point I'm looking to make ext4 support more than it already > > does (e.g. a way to omit bitmaps entirely, or a way to express > > contiguous files with smaller extent maps, or other enhancements for > > read-only filesystems), > > See above for a way to significantly reduce the number of bitmaps. > Adding a way to omit bitmaps entirely would require an INCOMPAT flag, > so it might not be worth it. > > The way to express contiguous files with smaller extent files would be > to extend the kernel to allow file systems with block_size > page_size > read-only. This would allow you to create a file system with a block > size of 64k, which will reduce the size of the extent maps by a factor > of 16, and it wouldn't be all that hard to teach ext4 to support these > file systems. (The reason why it would be hard for us to support file > systems with block sizes > page size is dealing with page cache when > writing files while allocating blocks, especially when doing random > writes into a sparse file. Read-only would be much easier to > support.) > > So please, talk to us, and *tell* us what it is you're trying to do > before you try to do it. Don't rely on some implementation detail > where we're not being sufficiently strict in checking for an invalid > file system, especially without telling us in advance and then trying > to hold us to the lack of checking forever because it's "breaking > things that used to work". > > Cheers, > > - Ted ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps 2020-10-08 17:54 ` Darrick J. Wong @ 2020-10-08 22:38 ` Josh Triplett 2020-10-09 2:54 ` Darrick J. Wong 0 siblings, 1 reply; 34+ messages in thread From: Josh Triplett @ 2020-10-08 22:38 UTC (permalink / raw) To: Darrick J. Wong Cc: Theodore Y. Ts'o, Linus Torvalds, Andreas Dilger, Jan Kara, Linux Kernel Mailing List, linux-ext4 On Thu, Oct 08, 2020 at 10:54:48AM -0700, Darrick J. Wong wrote: > IMO, the prominent free software filesystem projects offer (at least) > four layers of social structures to keep everyone on the same page, > including people writing competing implementations. (I certainly hope that this isn't a *competing* implementation. It's more that it serves a different purpose than the existing tools.) > The first is "Does > everything we write still work with the kernel", which I guess you > clearly did since you're complaining about this change in 5.9-rc2. Right. > The second layer is "Does it pass fsck?" which, given that you're asking > for changes to e2fsck elsewhere in this thread and I couldn't figure out > how to generate a shared-bitmaps ext4 fs that didn't also cause e2fsck > to complain about the overlap doesn't make me confident that you went > beyond the first step before shipping something. I did ensure that, other than the one top-level complaint that the bitmaps overlapped, I got no complaints from e2fsck. I also confirmed that with a patch to that one item, e2fsck passed with no issues. > The third layer is consulting the ondisk format documentation to see if > it points out anything that the kernel or fsck didn't notice. That > one's kind of on Ted for not updating Documentation/ to spell out what > SHARED_BLOCKS actually means, but that just leads me to the fourth thing. I've been making *extensive* use of Documentation/filesystems/ext4 by the way, and I greatly appreciate it. I know you've done a ton of work in that area. > The fourth layer is emailing the list to ask "Hey, I was thinking of > ___, does anyone see a problem with my interpretation?". That clearly > hasn't been done for shared bitmaps until now, which is all the more > strange since you /did/ ask linux-ext4 about the inline data topic > months ago. That one was on me, you're right. Because Ted is the maintainer of e2fsprogs in Debian, and conversations about ext4 often happen in the Debian BTS, reporting a bug on e2fsprogs there felt like starting an upstream conversation. That was my mistake; in the future, I'll make sure that things make it to linux-ext4. I already did so for https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=971014 , and I *should* have gone back and done so for the shared bitmap blocks issue. > ext* and XFS have been around for 25 years. The software validation of > both is imperfect and incomplete because the complexity of the ondisk > formats is vast and we haven't caught up with the technical debt that > resulted from not writing rigorous checks that would have been very > difficult with mid-90s hardware. I know, because I've been writing > online checking for XFS and have seen the wide the gap between what the > existing software looks for and what the ondisk format documentation > allows. > > The only chance that we as a community have at managing the complexity > of a filesystem is to use those four tools I outlined above to try to > keep the mental models of everyone who participates in close enough > alignment. That's how we usually avoid discussions that end in rancor > and confusion. > > That was a very long way of reiterating "If you're going to interpret > aspects of the software, please come talk to us before you start writing > code". That is key to ext4's long history of compatibility, because a > project cannot maintain continuity when actors develop conflicting code > in the shadows. Look at all the mutations FAT and UFS that have > appeared over the years-- the codebases became a mess as a result. Understood, agreed, and acknowledged. I'll make sure that any future potentially "adventurous" filesystem experiments get discussed on linux-ext4 first, not just in a distro bugtracker with a limited audience. > > > the head", and continued with the notion that anything other than > > > e2fsprogs making something to be mounted by mount(2) and handled by > > > fs/ext4 is being "inflicted", and if the goal didn't still seem to be > > > "how do we make it go away so that only e2fsprogs and the kernel ever > > > touch ext4". I started this thread because I'd written some userspace > > > code, a new version of the kernel made that userspace code stop working, > > > so I wanted to report that the moment I'd discovered that, along with a > > > potential way to address it with as little disrupton to ext4 as > > > possible. > > Ted: I don't think it's a good idea to make SHARED_BLOCKS disable block > validity checking by default. You could someday enable users to write > to block-sharing files by teaching ext4 how to COW, at which point you'd > need correct bitmaps and block validity to prevent accidental overwrite > of critical metadata. At that point you'd either be stuck with the > precedent that SHARED_BLOCKS implies noblock_validity, or you'd end up > breaking Josh's images again. That's a fair point (though I think a writable COW version of SHARED_BLOCKS would need a different flag). It'd make more sense to key this logic off of RO_COMPAT_READONLY or similar. But even better: > "noblock_validity" in the superblock mount options field of the images > you create. Yeah, I can do that. That's a much better solution, thank you. It would have been problematic to have to change the userspace that mounts the filesystem to pass new mount options ("noblock_validity") for the new kernel. But if I can embed it in the filesystem, that'll work. I'll do that, and please feel free to drop the original proposed patch as it's no longer needed. Thanks, Darrick. - Josh Triplett ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps 2020-10-08 22:38 ` Josh Triplett @ 2020-10-09 2:54 ` Darrick J. Wong 2020-10-09 19:08 ` Josh Triplett 0 siblings, 1 reply; 34+ messages in thread From: Darrick J. Wong @ 2020-10-09 2:54 UTC (permalink / raw) To: Josh Triplett Cc: Theodore Y. Ts'o, Linus Torvalds, Andreas Dilger, Jan Kara, Linux Kernel Mailing List, linux-ext4 On Thu, Oct 08, 2020 at 03:38:58PM -0700, Josh Triplett wrote: > On Thu, Oct 08, 2020 at 10:54:48AM -0700, Darrick J. Wong wrote: > > IMO, the prominent free software filesystem projects offer (at least) > > four layers of social structures to keep everyone on the same page, > > including people writing competing implementations. > > (I certainly hope that this isn't a *competing* implementation. It's > more that it serves a different purpose than the existing tools.) I hope so too, but external implementations tend to have staying power once people starting using them. You'd think e2fsdroid would have been merged into mke2fs by now, but no... > > The first is "Does > > everything we write still work with the kernel", which I guess you > > clearly did since you're complaining about this change in 5.9-rc2. > > Right. > > > The second layer is "Does it pass fsck?" which, given that you're asking > > for changes to e2fsck elsewhere in this thread and I couldn't figure out > > how to generate a shared-bitmaps ext4 fs that didn't also cause e2fsck > > to complain about the overlap doesn't make me confident that you went > > beyond the first step before shipping something. > > I did ensure that, other than the one top-level complaint that the > bitmaps overlapped, I got no complaints from e2fsck. I also confirmed > that with a patch to that one item, e2fsck passed with no issues. <cough> even a single top level complaint still means it doesn't pass. > > The third layer is consulting the ondisk format documentation to see if > > it points out anything that the kernel or fsck didn't notice. That > > one's kind of on Ted for not updating Documentation/ to spell out what > > SHARED_BLOCKS actually means, but that just leads me to the fourth thing. > > I've been making *extensive* use of Documentation/filesystems/ext4 by > the way, and I greatly appreciate it. I know you've done a ton of work > in that area. > > > The fourth layer is emailing the list to ask "Hey, I was thinking of > > ___, does anyone see a problem with my interpretation?". That clearly > > hasn't been done for shared bitmaps until now, which is all the more > > strange since you /did/ ask linux-ext4 about the inline data topic > > months ago. > > That one was on me, you're right. Because Ted is the maintainer of > e2fsprogs in Debian, and conversations about ext4 often happen in the > Debian BTS, reporting a bug on e2fsprogs there felt like starting an > upstream conversation. That was my mistake; in the future, I'll make > sure that things make it to linux-ext4. I already did so for > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=971014 , and I > *should* have gone back and done so for the shared bitmap blocks issue. <nod> Thanks. > > ext* and XFS have been around for 25 years. The software validation of > > both is imperfect and incomplete because the complexity of the ondisk > > formats is vast and we haven't caught up with the technical debt that > > resulted from not writing rigorous checks that would have been very > > difficult with mid-90s hardware. I know, because I've been writing > > online checking for XFS and have seen the wide the gap between what the > > existing software looks for and what the ondisk format documentation > > allows. > > > > The only chance that we as a community have at managing the complexity > > of a filesystem is to use those four tools I outlined above to try to > > keep the mental models of everyone who participates in close enough > > alignment. That's how we usually avoid discussions that end in rancor > > and confusion. > > > > That was a very long way of reiterating "If you're going to interpret > > aspects of the software, please come talk to us before you start writing > > code". That is key to ext4's long history of compatibility, because a > > project cannot maintain continuity when actors develop conflicting code > > in the shadows. Look at all the mutations FAT and UFS that have > > appeared over the years-- the codebases became a mess as a result. > > Understood, agreed, and acknowledged. I'll make sure that any future > potentially "adventurous" filesystem experiments get discussed on > linux-ext4 first, not just in a distro bugtracker with a limited > audience. > > > > > the head", and continued with the notion that anything other than > > > > e2fsprogs making something to be mounted by mount(2) and handled by > > > > fs/ext4 is being "inflicted", and if the goal didn't still seem to be > > > > "how do we make it go away so that only e2fsprogs and the kernel ever > > > > touch ext4". I started this thread because I'd written some userspace > > > > code, a new version of the kernel made that userspace code stop working, > > > > so I wanted to report that the moment I'd discovered that, along with a > > > > potential way to address it with as little disrupton to ext4 as > > > > possible. > > > > Ted: I don't think it's a good idea to make SHARED_BLOCKS disable block > > validity checking by default. You could someday enable users to write > > to block-sharing files by teaching ext4 how to COW, at which point you'd > > need correct bitmaps and block validity to prevent accidental overwrite > > of critical metadata. At that point you'd either be stuck with the > > precedent that SHARED_BLOCKS implies noblock_validity, or you'd end up > > breaking Josh's images again. > > That's a fair point (though I think a writable COW version of > SHARED_BLOCKS would need a different flag). It'd make more sense to key > this logic off of RO_COMPAT_READONLY or similar. But even better: It wouldn't require a new flag -- "rocompat" features bits mean that "it's safe to allow userspace to read files off the disk if software doesn't recognize this feature bit". We're (ab)using the "doesn't recognize" part of that sentence, since the kernel doesn't recognize RO_COMPAT_READONLY (or SHARED_BLOCKS) when it compares the superblock ro compat field to EXT4_FEATURE_RO_COMPAT_SUPP. It therefore only allows reading files. If someone /did/ add the code to write to files safely in the presence of shared blocks, all they'd have to do to light up the functionality is add it to the _SUPP define. Also, it's strange that the kernel source doesn't even know of SHARED_BLOCKS, but that's also on Ted... > > "noblock_validity" in the superblock mount options field of the images > > you create. > > Yeah, I can do that. That's a much better solution, thank you. It would > have been problematic to have to change the userspace that mounts the > filesystem to pass new mount options ("noblock_validity") for the new > kernel. But if I can embed it in the filesystem, that'll work. > > I'll do that, and please feel free to drop the original proposed patch > as it's no longer needed. Thanks, Darrick. NP. --D > > - Josh Triplett ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps 2020-10-09 2:54 ` Darrick J. Wong @ 2020-10-09 19:08 ` Josh Triplett 0 siblings, 0 replies; 34+ messages in thread From: Josh Triplett @ 2020-10-09 19:08 UTC (permalink / raw) To: Darrick J. Wong Cc: Theodore Y. Ts'o, Linus Torvalds, Andreas Dilger, Jan Kara, Linux Kernel Mailing List, linux-ext4 On Thu, Oct 08, 2020 at 07:54:09PM -0700, Darrick J. Wong wrote: > On Thu, Oct 08, 2020 at 03:38:58PM -0700, Josh Triplett wrote: > > On Thu, Oct 08, 2020 at 10:54:48AM -0700, Darrick J. Wong wrote: > > > > > the head", and continued with the notion that anything other than > > > > > e2fsprogs making something to be mounted by mount(2) and handled by > > > > > fs/ext4 is being "inflicted", and if the goal didn't still seem to be > > > > > "how do we make it go away so that only e2fsprogs and the kernel ever > > > > > touch ext4". I started this thread because I'd written some userspace > > > > > code, a new version of the kernel made that userspace code stop working, > > > > > so I wanted to report that the moment I'd discovered that, along with a > > > > > potential way to address it with as little disrupton to ext4 as > > > > > possible. > > > > > > Ted: I don't think it's a good idea to make SHARED_BLOCKS disable block > > > validity checking by default. You could someday enable users to write > > > to block-sharing files by teaching ext4 how to COW, at which point you'd > > > need correct bitmaps and block validity to prevent accidental overwrite > > > of critical metadata. At that point you'd either be stuck with the > > > precedent that SHARED_BLOCKS implies noblock_validity, or you'd end up > > > breaking Josh's images again. > > > > That's a fair point (though I think a writable COW version of > > SHARED_BLOCKS would need a different flag). It'd make more sense to key > > this logic off of RO_COMPAT_READONLY or similar. But even better: > > It wouldn't require a new flag -- "rocompat" features bits mean that > "it's safe to allow userspace to read files off the disk if software > doesn't recognize this feature bit". Sure; I was more thinking that if that involved adding some data structures to track shared blocks and the need to COW, whatever mechanism that used might potentially need an incompat flag. > If someone /did/ add the code to write to files safely in the presence > of shared blocks, all they'd have to do to light up the functionality is > add it to the _SUPP define. Also, it's strange that the kernel source > doesn't even know of SHARED_BLOCKS, but that's also on Ted... It would be nice if the kernel's ext4.h header and e2fsprogs were fully in sync, yes. - Josh Triplett ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps 2020-10-08 2:10 ` Theodore Y. Ts'o 2020-10-08 17:54 ` Darrick J. Wong @ 2020-10-08 22:22 ` Josh Triplett 2020-10-09 14:37 ` Theodore Y. Ts'o 1 sibling, 1 reply; 34+ messages in thread From: Josh Triplett @ 2020-10-08 22:22 UTC (permalink / raw) To: Theodore Y. Ts'o Cc: Darrick J. Wong, Linus Torvalds, Andreas Dilger, Jan Kara, Linux Kernel Mailing List, linux-ext4 On Wed, Oct 07, 2020 at 10:10:17PM -0400, Theodore Y. Ts'o wrote: > On Wed, Oct 07, 2020 at 01:14:24PM -0700, Josh Triplett wrote: > > That sounds like a conversation that would have been a lot more > > interesting and enjoyable if it hadn't started with "can we shoot it in > > the head", and continued with the notion that anything other than > > e2fsprogs making something to be mounted by mount(2) and handled by > > fs/ext4 is being "inflicted", and if the goal didn't still seem to be > > "how do we make it go away so that only e2fsprogs and the kernel ever > > touch ext4". I started this thread because I'd written some userspace > > code, a new version of the kernel made that userspace code stop working, > > so I wanted to report that the moment I'd discovered that, along with a > > potential way to address it with as little disrupton to ext4 as > > possible. > > What is really getting my dander up is your attempt to claim that the > on-disk file system format is like the userspace/kernel interface, > where if we break any file system that file system that was > "previously accepted by an older kernel", this is a bug that must be > reverted or otherwise fixed to allow file systems that had previously > worked, to continue to work. And this is true even if the file system > is ***invalid***. > > And the problem with this is that there have been any number of > commits where file systems which were previously invalid, but which > could be caused to trigger a syzbot whine, which was fixed by > tightening up the validity tests in the kernel. In some cases, I had > to also had to fix up e2fsck to detect the invalid file system which > was generated by the file system fuzzer. Yes, it's unfortunate that > we didn't have these checks earlier, but a file system has a huge > amount of state. > > The principle you've articulated would make it impossible for me to > fix these bugs, unless I can prove that the failure to check a > particular invalid file system corruption could lead to a security > vulnerability. (Would it be OK for me to make the kernel more strict > and reject an invalid file system if it triggers a WARN_ON, so I get > the syzbot complaint, but it doesn't actually cause a security issue?) > > So this conversation would have been a lot more pleasant for *me* if > you hadn't tried to elevate your request to a general principle, where > if someone is deliberately generating an invalid file system, I'm not > allowed to make the kernel more strict to detect said invalidity and > to reject the invalid / corrupted / fuzzed file system. I appreciate you providing this explanation; this makes the nature and severity of your concern *much* more clear. I also now understand why I was feeling confused, and why it felt several times like we were talking past each other; see below. For my part, I had no desire to have generated this level of concern. The restrictions you're describing above are far, far past anything I had possibly envisioned or desired. I want to clarify a couple of things. I wasn't trying to make a *new* general principle or policy. I was under the impression that this *was* the policy, because it never occurred to me that it could be otherwise. It seemed like a natural aspect of the kernel/userspace boundary, to the point that the idea of it *not* being part of the kernel's stability guarantees didn't cross my mind. Thus, when you were suggesting a policy of "only filesystems generated by the e2fsprogs userspace tools are acceptable", that seems like you were suggesting a massive loosening of existing policy. And when I suggested an alternative of "only filesystems that pass e2fsck are acceptable", I was also trying to suggest a loosening of the existing policy, just a different, milder one. I would have come to this discussion with a very different perspective if it had occurred to me that this might not be the policy, or at least that it simply hadn't come up in this way before. I would be quite happy to have a discussion about where that stability boundary should be. I also have no desire to be inflexible about that boundary or about the evolution of the software I'm working on; this is not some enterprise "thou shalt not touch" workload. On the contrary, within reason it's *relatively* straightforward for me to evolve filesystem generation code to deal with changes in the kernel or e2fsck. I was not looking to push some new general principle on stability; I only wished to ensure that it was reasonable to negotiate about the best way to solve problems if and when they arise, and I hope they do not arise often. (Most of the time, I'd expect that when they arise I'll end up wanting to change my software rather than the kernel.) I also don't think that a (milder) stability guarantee would mean that the kernel or fsck can never become stricter. On the contrary, they both *should* absolutely try to detect more issues over time. Much like other instances of the userspace/kernel boundary, changes that could hypothetically break some userspace software aren't automatically disqualified, only changes that *actually* break userspace in practice. I regularly see the kernel making a change that would technically affect some userspace software, but either no such software appears to exist (and the question could be revisited if it did), or any such software that does exist is not affected in a harmful way (e.g. the software copes with any error produced), or the software can be relatively easily patched to work around the new requirements and the nature of the software is such that people won't be running older versions, or any number of other practical considerations. Stability is a practical boundary rather than a theoretical concern. It's one that can be discussed and negotiated based on the nature of a change, and its practical impact. I suggested a kernel patch because that seemed like a straightforward approach with minimal impact to the kernel's validity checking. Darrick's suggestion in the next mail is an even better approach, and one I'm likely to use, so I'd have no problem with the patch that started this thread *not* being applied after all (though I appreciate the willingness to consider it). Finally, I think there's also some clarification needed in the role of what some of the incompat and ro_compat flags mean. For instance, "RO_COMPAT_READONLY" is documented as: > - Read-only filesystem image; the kernel will not mount this image > read-write and most tools will refuse to write to the image. Is it reasonable to interpret this as "this can never, ever become writable", such that no kernel should ever "understand" that flag in ro_compat? I'd assumed so, but this discussion is definitely leading me to want to confirm any such assumptions. Is this a flag that e2fsck could potentially use to determine that it shouldn't check read-write-specific data structures, or should that be a different flag? > > The short version is that I needed a library to rapidly turn > > dynamically-obtained data into a set of disk blocks to be served > > on-the-fly as a software-defined disk, and then mounted on the other > > side of that interface by the Linux kernel. Turns out that's *many > > orders of magnitude* faster than any kind of network filesystem like > > NFS. It's slightly similar to a vvfat for ext4. The less blocks it can > > generate and account for and cache, the faster it can run, and > > microseconds matter. > > So are you actually trying to dedup data blocks, or are you just > trying to avoid needing to track the block allocation bitmaps? Both. I do actually dedup some of the data blocks, when it's easy to do so. The less data that needs to get generated and go over the wire, the better. > And are you just writing a single file, or multiple files? Do you > know what the maximum size of the file or files will be? Do you need > a complex directory structure, or just a single root directory? It's an arbitrary filesystem hierarchy, including directories, files of various sizes (hence using inline_data), and permissions. The problem isn't to get data from point A to point B; the problem is (in part) to turn a representation of a filesystem into an actual mounted filesystem as efficiently as possible, live-serving individual blocks on demand rather than generating the whole image in advance. > Can the file system be sparse? Files almost certainly won't be (because in practice, it'd be rare enough that it's not worth the time and logic to check). The block device is in a way; unused blocks aren't transmitted. > So for example, you can do something like this, which puts all of the > metadata at beginning of the file system, and then you could write to > contiguous data blocks. Add the following in mke2fs.conf: (Note that the main reason mke2fs wouldn't help is that it generates filesystems in advance, rather than on the fly.) > [fs_types] > hugefile = { > features = extent,huge_file,bigalloc,flex_bg,uninit_bg,dir_nlink,extra_isize,^resize_inode,sparse_super2 Other than uninit_bg, that's pretty much the list of feature flags I'm setting. I'm also enabling 64bit, inline_data, largedir, and filetype. > cluster_size = 32768 Many files are potentially small, so 4k blocks make sense. (I've even considered 1k blocks, but that seemed like a less-well-trodden path and I didn't want to be less efficient for larger files. 4k seems to be a good balance, and has the advantage of matching the page size.) > So if you had come to the ext4 list with a set of requirements, it > could have been that we could have come up with something which uses > the existing file system features, or come up with something which > would have been more specific --- and more importantly, we'd know what > the semantics were of various on-disk file system formats that people > are depending upon. I've mentioned specific aspects of what I'm doing at least twice: once in the thread about 128-bit inodes and inline data (which got a good response, and ended up convincing me not to go that route even though it'd be theoretically possible), and once in the feature request asking for support in e2fsck for shared bitmap blocks (which didn't get any further followups after I'd provided additional information). > > If at some point I'm looking to make ext4 support more than it already > > does (e.g. a way to omit bitmaps entirely, or a way to express > > contiguous files with smaller extent maps, or other enhancements for > > read-only filesystems), > > See above for a way to significantly reduce the number of bitmaps. > Adding a way to omit bitmaps entirely would require an INCOMPAT flag, > so it might not be worth it. I agree that it'd be of limited value, compared to the amount of change required. > The way to express contiguous files with smaller extent files would be > to extend the kernel to allow file systems with block_size > page_size > read-only. This would allow you to create a file system with a block > size of 64k, which will reduce the size of the extent maps by a factor > of 16, and it wouldn't be all that hard to teach ext4 to support these > file systems. (The reason why it would be hard for us to support file > systems with block sizes > page size is dealing with page cache when > writing files while allocating blocks, especially when doing random > writes into a sparse file. Read-only would be much easier to > support.) That seems like it would work well for a filesystem that mostly contained larger files, but not a mix of small files and occasional large files. (Imagine, for one possible instance, a normal Linux system's root filesystem with everything on it, plus a single 200GB file.) It's more that I'd like to avoid generating and serving (and having the kernel parse and recurse through) a full set of extent tree blocks; it would be nice if an extent record could represent more than 32k blocks. It'd be even nicer if *inline* extents could represent an arbitrarily long file as long as the file was contiguous. Would it be reasonable, particularly on a read-only filesystem (to avoid the possibility of having to break it into a massive extent tree on the fly), to add a way to represent a completely contiguous or mostly contiguous file using only the 60-byte in-inode region, and no auxiliary extents? Having extents with a 48-bit number of blocks would be helpful, and in theory a contiguous file doesn't exactly need ee_block. > So please, talk to us, and *tell* us what it is you're trying to do > before you try to do it. Don't rely on some implementation detail > where we're not being sufficiently strict in checking for an invalid > file system, especially without telling us in advance and then trying > to hold us to the lack of checking forever because it's "breaking > things that used to work". I'd be happy to start more such conversations in the future. And as mentioned above, I am *not* looking for that level of guarantee of never checking *anything* that used to be checked; I'm looking to collaboratively figure out what the best solution is. I don't want the default assumption to be that the kernel needs to remain perfectly accepting of everything it did in the past, nor do I want the default assumption to be that userspace should always change to meet the new kernel requirements. - Josh Triplett ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps 2020-10-08 22:22 ` Josh Triplett @ 2020-10-09 14:37 ` Theodore Y. Ts'o 2020-10-09 20:30 ` Josh Triplett 2021-01-10 18:41 ` Malicious fs images was " Pavel Machek 0 siblings, 2 replies; 34+ messages in thread From: Theodore Y. Ts'o @ 2020-10-09 14:37 UTC (permalink / raw) To: Josh Triplett Cc: Darrick J. Wong, Linus Torvalds, Andreas Dilger, Jan Kara, Linux Kernel Mailing List, linux-ext4 On Thu, Oct 08, 2020 at 03:22:59PM -0700, Josh Triplett wrote: > > I wasn't trying to make a *new* general principle or policy. I was under > the impression that this *was* the policy, because it never occurred to > me that it could be otherwise. It seemed like a natural aspect of the > kernel/userspace boundary, to the point that the idea of it *not* being > part of the kernel's stability guarantees didn't cross my mind. From our perspective (and Darrick and I discussed this on this week's ext4 video conference, so it represents the ext4 and xfs maintainer's position) is that the file system format is different. First, the on-disk format is not an ABI, and it is several orders more complex than a system call interface. Second, we make no guarantees about what the file system created by malicious tools will do. For example, XFS developers reject bug reports from file system fuzzers, because the v5 format has CRC checks, so randomly corrupted file systems won't crash the kernel. Yes, this doesn't protect against maliciously created file systems where the attacker makes sure the checksums are valid, but only crazy people who think containers are just as secure as VM's and that unprivileged users should be allowed to make the kernel mount potentially maliciously created file systems would be exposing the kernel to such maliciously created images. > Finally, I think there's also some clarification needed in the role of > what some of the incompat and ro_compat flags mean. For instance, > "RO_COMPAT_READONLY" is documented as: > > - Read-only filesystem image; the kernel will not mount this image > > read-write and most tools will refuse to write to the image. > Is it reasonable to interpret this as "this can never, ever become > writable", such that no kernel should ever "understand" that flag in > ro_compat? Yes. However... > I'd assumed so, but this discussion is definitely leading me > to want to confirm any such assumptions. Is this a flag that e2fsck > could potentially use to determine that it shouldn't check > read-write-specific data structures, or should that be a different flag? Just because it won't be modifiable, shouldn't mean that e2fsck won't check to make sure that such structures are valid. "Won't be changed" and "valid" are two different concepts. And certainly, today we *do* check to make sure the bitmaps are valid and don't overlap, and we can't go back in time to change that. That being said, on the ext4 weekly video chat, we did discuss other uses of an incompat feature flag that would allow the allocation bitmap blocks and inode table block fields in the superblock to be zero, which would mean that they are unallocated. This would allow us to dynamically grow the inode table by adding an extra block group descriptor. In fact, I'd probably use this as an opportunity to make some other changes, such using inodes to store locations of the block group descriptors, inode tables, and allocation bitmaps at the same time. Those details can be discussed later, but the point is that this is why it's good to discuss format changes from a requirements perspective, so that if we do need to make an incompat change, we can kill multiple birds with a single stone. > It's an arbitrary filesystem hierarchy, including directories, files of > various sizes (hence using inline_data), and permissions. The problem > isn't to get data from point A to point B; the problem is (in part) to > turn a representation of a filesystem into an actual mounted filesystem > as efficiently as possible, live-serving individual blocks on demand > rather than generating the whole image in advance. Ah, so you want to be able to let the other side "look at" the file system in parallel with it being generated on demand? The cache coherency problems would seem to be... huge. For example, how can you add a file to directory after the reader has looked at the directory inode and directory blocks? Or for that matter, looked at a portion of the inode table block? Or are you using 4k inodes so there is only one inode per block? What about the fact that we sometimes do readahead of inode table blocks? I can think of all sorts of implementation level changes in terms of caching, readahead behavior, etc., that we might make in the future that might break you if you are doing some quite as outré are that. Again, the fact that you're being cagey about what you are doing, and potentially complaining about changes we might make that would break you, is ***really*** scaring me now. Can you go into more details here? I'm sorry if you're working for some startup who might want to patent these ideas, but if you want to guarantee future compatibility, I'm really going to have to insist. - Ted ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps 2020-10-09 14:37 ` Theodore Y. Ts'o @ 2020-10-09 20:30 ` Josh Triplett 2021-01-10 18:41 ` Malicious fs images was " Pavel Machek 1 sibling, 0 replies; 34+ messages in thread From: Josh Triplett @ 2020-10-09 20:30 UTC (permalink / raw) To: Theodore Y. Ts'o Cc: Darrick J. Wong, Linus Torvalds, Andreas Dilger, Jan Kara, Linux Kernel Mailing List, linux-ext4 On Fri, Oct 09, 2020 at 10:37:32AM -0400, Theodore Y. Ts'o wrote: > That being said, on the ext4 weekly video chat, we did discuss other > uses of an incompat feature flag that would allow the allocation > bitmap blocks and inode table block fields in the superblock to be > zero, which would mean that they are unallocated. What do you mean by "allocation bitmap blocks and inode table block fields in the superblock"? Those are in the group descriptor, not the superblock. Or am I missing something there? > This would allow us > to dynamically grow the inode table by adding an extra block group > descriptor. In fact, I'd probably use this as an opportunity to make > some other changes, such using inodes to store locations of the block > group descriptors, inode tables, and allocation bitmaps at the same > time. Those details can be discussed later, but the point is that > this is why it's good to discuss format changes from a requirements > perspective, so that if we do need to make an incompat change, we can > kill multiple birds with a single stone. I would be quite interested in that. > On Thu, Oct 08, 2020 at 03:22:59PM -0700, Josh Triplett wrote: > > It's an arbitrary filesystem hierarchy, including directories, files of > > various sizes (hence using inline_data), and permissions. The problem > > isn't to get data from point A to point B; the problem is (in part) to > > turn a representation of a filesystem into an actual mounted filesystem > > as efficiently as possible, live-serving individual blocks on demand > > rather than generating the whole image in advance. > > Ah, so you want to be able to let the other side "look at" the file > system in parallel with it being generated on demand? The cache > coherency problems would seem to be... huge. For example, how can you > add a file to directory after the reader has looked at the directory > inode and directory blocks? I don't. While the data is computed on demand for performance reasons, the nature and size of all the data is fully known in advance. I never add data to a filesystem, only create new filesystem images. The kernel *is* looking at the filesystem in parallel with it being generated, insofar as blocks aren't constructed until the kernel asks for them (which is a massive performance win, especially since the kernel may only want a small subset of the filesystem). But the block contents are fixed in advance, even if they haven't been generated yet. So the kernel can read ahead and cache any blocks it wants to, and they'll be valid. (Excessive readahead might be a performance problem, but not a correctness one.) I briefly considered ideas around adding new data after the filesystem was mounted, and dismissed those ideas just as quickly, for exactly these reasons (disk caching, filesystem caching, readahead). That would require a filesystem with at least some subset of cluster features. I don't plan to go there if I don't have to. If I do end up needing that, I'd consider proposing an ext4 change along the lines of making the root directory into a "super-root" directory under which multiple filesystem roots could live, and supporting a way to re-read that root to discover new filesystem roots and new block groups; then, it'd be possible to add a new root whose contents are *mostly* references to existing inodes (that the kernel would already have cached), and any modified directories or new/modified files would have new inodes added. That would isolate the "don't read ahead and cache" problem to the new inodes, which could potentially be isolated to new block groups for simplicity, and the "discover new roots" mechanism could also discover the newly added block groups and inode tables. But again, I'm not curently looking to do that, and *if* I were, I'd bring it up for architectural discussion and design on linux-ext4 first. There's also a balance there between a simple version that'd work for an append-only read-only filesystem, and a complex version that'd work for a writable filesystem, and I'd hope more for the former than the latter, but that'd be a topic for discussion. - Josh Triplett ^ permalink raw reply [flat|nested] 34+ messages in thread
* Malicious fs images was Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps 2020-10-09 14:37 ` Theodore Y. Ts'o 2020-10-09 20:30 ` Josh Triplett @ 2021-01-10 18:41 ` Pavel Machek 2021-01-11 18:51 ` Darrick J. Wong 2021-01-12 21:43 ` Theodore Ts'o 1 sibling, 2 replies; 34+ messages in thread From: Pavel Machek @ 2021-01-10 18:41 UTC (permalink / raw) To: Theodore Y. Ts'o Cc: Josh Triplett, Darrick J. Wong, Linus Torvalds, Andreas Dilger, Jan Kara, Linux Kernel Mailing List, linux-ext4 [-- Attachment #1: Type: text/plain, Size: 2071 bytes --] Hi! On Fri 2020-10-09 10:37:32, Theodore Y. Ts'o wrote: > On Thu, Oct 08, 2020 at 03:22:59PM -0700, Josh Triplett wrote: > > > > I wasn't trying to make a *new* general principle or policy. I was under > > the impression that this *was* the policy, because it never occurred to > > me that it could be otherwise. It seemed like a natural aspect of the > > kernel/userspace boundary, to the point that the idea of it *not* being > > part of the kernel's stability guarantees didn't cross my mind. > > >From our perspective (and Darrick and I discussed this on this week's > ext4 video conference, so it represents the ext4 and xfs maintainer's > position) is that the file system format is different. First, the > on-disk format is not an ABI, and it is several orders more complex > than a system call interface. Second, we make no guarantees about > what the file system created by malicious tools will do. For example, > XFS developers reject bug reports from file system fuzzers, because > the v5 format has CRC checks, so randomly corrupted file systems won't > crash the kernel. Yes, this doesn't protect against maliciously > created file systems where the attacker makes sure the checksums are > valid, but only crazy people who think containers are just as secure Well, it is not just containers. It is also USB sticks. And people who believe secure boot is good idea and try to protect kernel against root. And crazy people who encrypt pointers in dmesg. And... People want to use USB sticks from time to time. And while I understand XFS is so complex it is unsuitable for such use, I'd still expect bugs to be fixed there. I hope VFAT to be safe to mount, because that is very common on USB. I also hope ext2/3/4 is safe in that regard. Anyway it would be nice to have documentation explaining this. If I'm wrong about VFAT being safe, it would be good to know, and I guess many will be surprised that XFS is using different rules. Best regards, Pavel -- http://www.livejournal.com/~pavelmachek [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 181 bytes --] ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Malicious fs images was Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps 2021-01-10 18:41 ` Malicious fs images was " Pavel Machek @ 2021-01-11 18:51 ` Darrick J. Wong 2021-01-11 19:39 ` Eric Biggers 2021-01-12 21:43 ` Theodore Ts'o 1 sibling, 1 reply; 34+ messages in thread From: Darrick J. Wong @ 2021-01-11 18:51 UTC (permalink / raw) To: Pavel Machek Cc: Theodore Y. Ts'o, Josh Triplett, Darrick J. Wong, Linus Torvalds, Andreas Dilger, Jan Kara, Linux Kernel Mailing List, linux-ext4 On Sun, Jan 10, 2021 at 07:41:02PM +0100, Pavel Machek wrote: > Hi! > > On Fri 2020-10-09 10:37:32, Theodore Y. Ts'o wrote: > > On Thu, Oct 08, 2020 at 03:22:59PM -0700, Josh Triplett wrote: > > > > > > I wasn't trying to make a *new* general principle or policy. I was under > > > the impression that this *was* the policy, because it never occurred to > > > me that it could be otherwise. It seemed like a natural aspect of the > > > kernel/userspace boundary, to the point that the idea of it *not* being > > > part of the kernel's stability guarantees didn't cross my mind. > > > > >From our perspective (and Darrick and I discussed this on this week's > > ext4 video conference, so it represents the ext4 and xfs maintainer's > > position) is that the file system format is different. First, the > > on-disk format is not an ABI, and it is several orders more complex > > than a system call interface. Second, we make no guarantees about > > what the file system created by malicious tools will do. For example, > > XFS developers reject bug reports from file system fuzzers, because My recollection of this is quite different -- sybot was sending multiple zeroday exploits per week to the public xfs list, and nobody at Google was helping us to /fix/ those bugs. Each report took hours of developer time to extract the malicious image (because Dmitry couldn't figure out how to add that ability to syzbot) and syscall sequence from the reproducer program, plus whatever time it took to craft a patch, test it, and push it through review. Dave and I complained to Dmitry about how the community had zero input into the rate at which syzbot was allowed to look for xfs bugs. Nobody at Google would commit to helping fix any of the XFS bugs, and Dmitry would not give us better choices than (a) Google AI continuing to create zerodays and leaving the community to clean up the mess, or (b) shutting off syzbot entirely. At the time I said I would accept letting syzbot run against xfs until it finds something, and turning it off until we resolve the issue. That wasn't acceptable, because (I guess) nobody at Google wants to put /any/ staff time into XFS at all. TLDR: XFS /does/ accept bug reports about fuzzed and broken images. What we don't want is make-work Google AIs spraying zeroday code in public places and the community developers having to clean up the mess. > > the v5 format has CRC checks, so randomly corrupted file systems > > won't > > crash the kernel. Yes, this doesn't protect against maliciously > > created file systems where the attacker makes sure the checksums are > > valid, but only crazy people who think containers are just as secure > > Well, it is not just containers. It is also USB sticks. And people who > believe secure boot is good idea and try to protect kernel against > root. And crazy people who encrypt pointers in dmesg. And... > > People want to use USB sticks from time to time. And while I > understand XFS is so complex it is unsuitable for such use, I'd still > expect bugs to be fixed there. > > I hope VFAT to be safe to mount, because that is very common on USB. > > I also hope ext2/3/4 is safe in that regard. None of them are, that's why you can't mount local filesystems as an unprivileged user by default. At a bare minimum you'd have to audit the fs driver code to make sure it's robust against unexpected metadata; and since filesystems store complex relational data across multiple indices, you end up needing to hoist a full fsck into the mount process. Or retain the current behavior, which is to delegate the trust decision to the sysadmin. It's not just an XFS thing. --D > Anyway it would be nice to have documentation explaining this. If I'm > wrong about VFAT being safe, it would be good to know, and I guess > many will be surprised that XFS is using different rules. > > Best regards, > Pavel > -- > http://www.livejournal.com/~pavelmachek ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Malicious fs images was Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps 2021-01-11 18:51 ` Darrick J. Wong @ 2021-01-11 19:39 ` Eric Biggers 0 siblings, 0 replies; 34+ messages in thread From: Eric Biggers @ 2021-01-11 19:39 UTC (permalink / raw) To: Darrick J. Wong Cc: Pavel Machek, Theodore Y. Ts'o, Josh Triplett, Darrick J. Wong, Linus Torvalds, Andreas Dilger, Jan Kara, Linux Kernel Mailing List, linux-ext4 On Mon, Jan 11, 2021 at 10:51:20AM -0800, Darrick J. Wong wrote: > On Sun, Jan 10, 2021 at 07:41:02PM +0100, Pavel Machek wrote: > > Hi! > > > > On Fri 2020-10-09 10:37:32, Theodore Y. Ts'o wrote: > > > On Thu, Oct 08, 2020 at 03:22:59PM -0700, Josh Triplett wrote: > > > > > > > > I wasn't trying to make a *new* general principle or policy. I was under > > > > the impression that this *was* the policy, because it never occurred to > > > > me that it could be otherwise. It seemed like a natural aspect of the > > > > kernel/userspace boundary, to the point that the idea of it *not* being > > > > part of the kernel's stability guarantees didn't cross my mind. > > > > > > >From our perspective (and Darrick and I discussed this on this week's > > > ext4 video conference, so it represents the ext4 and xfs maintainer's > > > position) is that the file system format is different. First, the > > > on-disk format is not an ABI, and it is several orders more complex > > > than a system call interface. Second, we make no guarantees about > > > what the file system created by malicious tools will do. For example, > > > XFS developers reject bug reports from file system fuzzers, because > > My recollection of this is quite different -- sybot was sending multiple > zeroday exploits per week to the public xfs list, and nobody at Google > was helping us to /fix/ those bugs.Each report took hours of developer > time to extract the malicious image (because Dmitry couldn't figure out > how to add that ability to syzbot) and syscall sequence from the > reproducer program, plus whatever time it took to craft a patch, test > it, and push it through review. > > Dave and I complained to Dmitry about how the community had zero input > into the rate at which syzbot was allowed to look for xfs bugs. Nobody > at Google would commit to helping fix any of the XFS bugs, and Dmitry > would not give us better choices than (a) Google AI continuing to create > zerodays and leaving the community to clean up the mess, or (b) shutting > off syzbot entirely. At the time I said I would accept letting syzbot > run against xfs until it finds something, and turning it off until we > resolve the issue. That wasn't acceptable, because (I guess) nobody at > Google wants to put /any/ staff time into XFS at all. > > TLDR: XFS /does/ accept bug reports about fuzzed and broken images. > What we don't want is make-work Google AIs spraying zeroday code in > public places and the community developers having to clean up the mess. syzkaller is an open source project that implements a coverage-guided fuzzer for multiple operating system kernels; it's not "Google AI". Anyone can run syzkaller (either by itself, or as part of a syzbot instance) and find the same bugs. - Eric ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Malicious fs images was Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps 2021-01-10 18:41 ` Malicious fs images was " Pavel Machek 2021-01-11 18:51 ` Darrick J. Wong @ 2021-01-12 21:43 ` Theodore Ts'o 2021-01-12 22:28 ` Pavel Machek 1 sibling, 1 reply; 34+ messages in thread From: Theodore Ts'o @ 2021-01-12 21:43 UTC (permalink / raw) To: Pavel Machek Cc: Josh Triplett, Darrick J. Wong, Linus Torvalds, Andreas Dilger, Jan Kara, Linux Kernel Mailing List, linux-ext4 On Sun, Jan 10, 2021 at 07:41:02PM +0100, Pavel Machek wrote: > > >From our perspective (and Darrick and I discussed this on this week's > > ext4 video conference, so it represents the ext4 and xfs maintainer's > > position) is that the file system format is different. First, the > > on-disk format is not an ABI, and it is several orders more complex > > than a system call interface. Second, we make no guarantees about > > what the file system created by malicious tools will do. For example, > > XFS developers reject bug reports from file system fuzzers, because > > the v5 format has CRC checks, so randomly corrupted file systems won't > > crash the kernel. Yes, this doesn't protect against maliciously > > created file systems where the attacker makes sure the checksums are > > valid, but only crazy people who think containers are just as secure > > Well, it is not just containers. It is also USB sticks. And people who > believe secure boot is good idea and try to protect kernel against > root. And crazy people who encrypt pointers in dmesg. And... > > People want to use USB sticks from time to time. And while I > understand XFS is so complex it is unsuitable for such use, I'd still > expect bugs to be fixed there. > > I hope VFAT to be safe to mount, because that is very common on USB. > > I also hope ext2/3/4 is safe in that regard. Ext4 will fix file system fuzzing attack bugs on a best efforts basis. That is, when I have time, I've been known to stay up late to bugs reported by fuzzers. I hope ext4 is safe, but I'm not going to make any guarantees that it is Bug-Free(tm). If you want to trust it in that way, you do so at your risk. As far as VFS is concerned, I'm not aware of anyone who has been working on fuzz-proofing VFAT, and looking at the Vault 2016 for "American Fuzzy Lop"[1] while VFAT wasn't specifically tested, for the vast majority of file systems, the "time to first bug" typically ranged from seconds to minutes, with the exception of XFS and ext4 (where it was roughly 2 hours). The specific bugs which triggered in the 2016 AFL presentation have been fixed, at least for the file systems which get regular maintainer attention, but this is why I try to caution people not to count on file systems being proof against maliciously formatted images. [1] https://events.static.linuxfound.org/sites/events/files/slides/AFL%20filesystem%20fuzzing,%20Vault%202016_0.pdf > Anyway it would be nice to have documentation explaining this. If I'm > wrong about VFAT being safe, it would be good to know, and I guess > many will be surprised that XFS is using different rules. Using USB sticks is fine, so long as you trust the provenance of the drive. If you take a random USB stick that is handed to you by someone whom you don't trust implicitly, or worse, that you picked up abandoned on the sidewalk, there have been plenty of articles which describe why this is a REALLY BAD IDEA, and even if you ignore OS-level vuleranbilities, there are also firwmare and hardware based vulerabilities that would put your computer at risk. See [2] and [3] for more details; there's a reason why I've visited at least one financial institution where they put epoxy in USB ports to prevent clueless workers from potentially compromising the bank's computers. [2] https://www.redteamsecure.com/blog/usb-drop-attacks-the-danger-of-lost-and-found-thumb-drives/ [3] https://www.bleepingcomputer.com/news/security/heres-a-list-of-29-different-types-of-usb-attacks/ As far as documentation is concerned, how far should we go? Should there be a warning in the execve(2) system call man page that you shouldn't download random binaries from the network and execute them? :-) Cheers, - Ted ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Malicious fs images was Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps 2021-01-12 21:43 ` Theodore Ts'o @ 2021-01-12 22:28 ` Pavel Machek 2021-01-13 5:09 ` Theodore Ts'o 0 siblings, 1 reply; 34+ messages in thread From: Pavel Machek @ 2021-01-12 22:28 UTC (permalink / raw) To: Theodore Ts'o Cc: Josh Triplett, Darrick J. Wong, Linus Torvalds, Andreas Dilger, Jan Kara, Linux Kernel Mailing List, linux-ext4 [-- Attachment #1: Type: text/plain, Size: 3129 bytes --] Hi! > > People want to use USB sticks from time to time. And while I > > understand XFS is so complex it is unsuitable for such use, I'd still > > expect bugs to be fixed there. > > > > I hope VFAT to be safe to mount, because that is very common on USB. > > > > I also hope ext2/3/4 is safe in that regard. > > Ext4 will fix file system fuzzing attack bugs on a best efforts basis. > That is, when I have time, I've been known to stay up late to bugs > reported by fuzzers. I hope ext4 is safe, but I'm not going to make > any guarantees that it is Bug-Free(tm). If you want to trust it in > that way, you do so at your risk. Good. > > Anyway it would be nice to have documentation explaining this. If I'm > > wrong about VFAT being safe, it would be good to know, and I guess > > many will be surprised that XFS is using different rules. > > Using USB sticks is fine, so long as you trust the provenance of the > drive. If you take a random USB stick that is handed to you by Well... That makes passing data between Windows and Linux machines using USB stick "interesting", right? > someone whom you don't trust implicitly, or worse, that you picked up > abandoned on the sidewalk, there have been plenty of articles which > describe why this is a REALLY BAD IDEA, and even if you ignore > OS-level vuleranbilities, there are also firwmare and hardware based > vulerabilities that would put your computer at risk. See [2] and > [3] I know, but bear with me. > As far as documentation is concerned, how far should we go? Should > there be a warning in the execve(2) system call man page that you > shouldn't download random binaries from the network and execute them? :-) No need to pull straw men for me. This thread suggested that kernel is _not_ supposed to be robust against corrupt filesystems (because fsck is not integrated in kernel). Which was news to me (and I'm not the person that needs warning in execve documentation). I'd certainly like to hear that VFAT and EXT4 _is_ supposed to be robust in that way. And if we have filesystems where corrupt image is known to allow arbitrary code execution, we need to a) document that. b) disable them when secure boot is enabled. Because with secure boot, we are supposed to be secure against attacks from root, and root can prepare malicious filesystems. ("The problem, simply put, is this: the objective of secure boot is to prevent the system from running any unsigned code in a privileged mode. So, if one boots a Linux system that, in turn, gives access to the machine to untrusted code, the entire purpose has been defeated. The consequences could hurt both locally (bad code could take control of the machine) and globally (the signing key used to boot Linux could be revoked), so it is an outcome that is worth avoiding. Doing so, however, requires placing limitations in the kernel so that not even root can circumvent the secure boot chain of trust." -- https://lwn.net/Articles/514985/ ). Best regards, Pavel -- http://www.livejournal.com/~pavelmachek [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Malicious fs images was Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps 2021-01-12 22:28 ` Pavel Machek @ 2021-01-13 5:09 ` Theodore Ts'o 0 siblings, 0 replies; 34+ messages in thread From: Theodore Ts'o @ 2021-01-13 5:09 UTC (permalink / raw) To: Pavel Machek Cc: Josh Triplett, Darrick J. Wong, Linus Torvalds, Andreas Dilger, Jan Kara, Linux Kernel Mailing List, linux-ext4 On Tue, Jan 12, 2021 at 11:28:40PM +0100, Pavel Machek wrote: > > This thread suggested that kernel is _not_ supposed to be robust > against corrupt filesystems (because fsck is not integrated in > kernel). Which was news to me (and I'm not the person that needs > warning in execve documentation). Define "supposed to be". In the ideal world, the kernel should be robust against corrupt file systems. In the ideal world, hard drives would never die, and memory bits would never get flipped due to cosmic rays, and so Intel would be correct that consumers don't need ECC memory. In the ideal world, drivers would never make mistakes, and so seat belts would be completely unnecessasry. Unfortunately, we live in the real world. And so there is risk associated with using various technologies, and that risk is not a binary 0% vs 100%. In my mind, assuming that file systems are robust against maliciously created images is much like driving around without a seat belt. There are plenty of people who drive without seat belts for years, and they haven't been killed yet. But it's not something *I* would do. But hey, I also spent extra money to buy a personal desktop computer with ECC memory, and I don't go bicycling without a helment, either. You're free to decide what *you* want to do. But I wouldn't take a random file system image from the Net and mount it without running fsck on the darned thing first. As far as secure boot is concerned, do you know how many zero days are out there with Windows machines? I'm pretty sure there are vast numbers. There are even more systems where the system adminsitrators haven't bothered to install latest updates, as well. > And if we have filesystems where corrupt image is known to allow > arbitrary code execution, we need to Define *known*. If you look at the syzbot dashboard, there are hundreds of reproducers where root is able to crash a Linux server. Some number of them will almost certainly be exploitable. The problem is it takes a huge amount of time to analyze them, and Syzbot's file system fuzzers are positively developer-hostile to root cause. So usually I find and fix ext4 file system fuzzing reports via reports from other fuzzers, and then eventually syzbot realizes that the reproducer no longer works, and it gets closed out. I'd certainly be willing to bet a beer or two that there are vulnerabilities in VFAT, but do I *know* that to be the case? No, because I don't have the time to take a syzbot report relating to a file system and root cause it. The time to extract out the image, and then figure out how to get a simple reproducer (as opposed to the mess that a syzbot reproducer that might be randomly toggling a network bridge interface on one thread while messing with a file system image on another) is easily 10-20 times the effort to actually *fix* the bug once we're able to come up with a trivial reproducer with a file system image which is a separate file so we can analyze it using file system debugging tools. I'm also *quite* confident that the NSA, KGB, and other state intelligence agencies have dozens of zero days for Windows, Linux, etc. We just don't know in what subsystem they are in, so we can't just "disable them when secure boot is enabled". - Ted ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps 2020-10-07 20:14 ` Josh Triplett 2020-10-08 2:10 ` Theodore Y. Ts'o @ 2020-10-08 2:57 ` Andreas Dilger 2020-10-08 19:12 ` Josh Triplett 1 sibling, 1 reply; 34+ messages in thread From: Andreas Dilger @ 2020-10-08 2:57 UTC (permalink / raw) To: Josh Triplett Cc: Theodore Y. Ts'o, Darrick J. Wong, Jan Kara, Linux Kernel Mailing List, Ext4 Developers List [-- Attachment #1: Type: text/plain, Size: 1578 bytes --] On Oct 7, 2020, at 2:14 PM, Josh Triplett <josh@joshtriplett.org> wrote: > If those aren't the right way to express that, I could potentially > adapt. I had a similar such conversation on linux-ext4 already (about > inline data with 128-bit inodes), which led to me choosing to abandon > 128-byte inodes rather than try to get ext4 to support what I wanted > with them, because I didn't want to be disruptive to ext4 for a niche > use case. In the particular case that motivated this thread, what I was > doing already worked in previous kernels, and it seemed reasonable to > ask for it to continue to work in new kernels, while preserving the > newly added checks in the new kernels. This was discussed in the "Inline data with 128-byte inodes?" thread back in May. While Jan was not necessarily in favour of this, I was actually OK with improving the ext4 code to handle this case better, since it would (at minimum) clean up ext4 to make a clear separation of how it is detecting data in the i_block[] array and the system.data xattr, and I don't think it added any complexity to the code. I even posted a WIP patch to that effect, but didn't get a response back: https://marc.info/?l=linux-ext4&m=158863275019187 I *do* think that inline_data is an under-appreciated feature that I would be happy to see some improvements with. I don't think that small files are a niche use case, and if we can clean up the inline_data code to work with 128-byte inodes I'm not against that, even though I'm not going to use that combination of features myself. Cheers, Andreas [-- Attachment #2: Message signed with OpenPGP --] [-- Type: application/pgp-signature, Size: 873 bytes --] ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps 2020-10-08 2:57 ` Andreas Dilger @ 2020-10-08 19:12 ` Josh Triplett 2020-10-08 19:25 ` Andreas Dilger 0 siblings, 1 reply; 34+ messages in thread From: Josh Triplett @ 2020-10-08 19:12 UTC (permalink / raw) To: Andreas Dilger Cc: Theodore Y. Ts'o, Darrick J. Wong, Jan Kara, Linux Kernel Mailing List, Ext4 Developers List On Wed, Oct 07, 2020 at 08:57:12PM -0600, Andreas Dilger wrote: > On Oct 7, 2020, at 2:14 PM, Josh Triplett <josh@joshtriplett.org> wrote: > > If those aren't the right way to express that, I could potentially > > adapt. I had a similar such conversation on linux-ext4 already (about > > inline data with 128-bit inodes), which led to me choosing to abandon > > 128-byte inodes rather than try to get ext4 to support what I wanted > > with them, because I didn't want to be disruptive to ext4 for a niche > > use case. In the particular case that motivated this thread, what I was > > doing already worked in previous kernels, and it seemed reasonable to > > ask for it to continue to work in new kernels, while preserving the > > newly added checks in the new kernels. > > This was discussed in the "Inline data with 128-byte inodes?" thread > back in May. While Jan was not necessarily in favour of this, I was > actually OK with improving the ext4 code to handle this case better, > since it would (at minimum) clean up ext4 to make a clear separation > of how it is detecting data in the i_block[] array and the system.data > xattr, and I don't think it added any complexity to the code. > > I even posted a WIP patch to that effect, but didn't get a response back: > https://marc.info/?l=linux-ext4&m=158863275019187 My apologies, I thought I responded to that. It looks promising to me, though I wouldn't have the bandwidth to take it to completion anytime soon. > I *do* think that inline_data is an under-appreciated feature that I > would be happy to see some improvements with. I don't think that small > files are a niche use case, and if we can clean up the inline_data code > to work with 128-byte inodes I'm not against that, even though I'm not > going to use that combination of features myself. I'd love to see that happen. At the time, it seemed like too large of a change to block on, which is why I ended up deciding to switch to 256-byte inodes. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps 2020-10-08 19:12 ` Josh Triplett @ 2020-10-08 19:25 ` Andreas Dilger 2020-10-08 22:28 ` Josh Triplett 0 siblings, 1 reply; 34+ messages in thread From: Andreas Dilger @ 2020-10-08 19:25 UTC (permalink / raw) To: Josh Triplett Cc: Theodore Y. Ts'o, Darrick J. Wong, Jan Kara, Linux Kernel Mailing List, Ext4 Developers List [-- Attachment #1: Type: text/plain, Size: 2550 bytes --] On Oct 8, 2020, at 1:12 PM, Josh Triplett <josh@joshtriplett.org> wrote: > > On Wed, Oct 07, 2020 at 08:57:12PM -0600, Andreas Dilger wrote: >> On Oct 7, 2020, at 2:14 PM, Josh Triplett <josh@joshtriplett.org> wrote: >>> If those aren't the right way to express that, I could potentially >>> adapt. I had a similar such conversation on linux-ext4 already (about >>> inline data with 128-bit inodes), which led to me choosing to abandon >>> 128-byte inodes rather than try to get ext4 to support what I wanted >>> with them, because I didn't want to be disruptive to ext4 for a niche >>> use case. In the particular case that motivated this thread, what I was >>> doing already worked in previous kernels, and it seemed reasonable to >>> ask for it to continue to work in new kernels, while preserving the >>> newly added checks in the new kernels. >> >> This was discussed in the "Inline data with 128-byte inodes?" thread >> back in May. While Jan was not necessarily in favour of this, I was >> actually OK with improving the ext4 code to handle this case better, >> since it would (at minimum) clean up ext4 to make a clear separation >> of how it is detecting data in the i_block[] array and the system.data >> xattr, and I don't think it added any complexity to the code. >> >> I even posted a WIP patch to that effect, but didn't get a response back: >> https://marc.info/?l=linux-ext4&m=158863275019187 > > My apologies, I thought I responded to that. It looks promising to me, > though I wouldn't have the bandwidth to take it to completion anytime > soon. NP, I don't have bandwidth to work on it right now either. >> I *do* think that inline_data is an under-appreciated feature that I >> would be happy to see some improvements with. I don't think that small >> files are a niche use case, and if we can clean up the inline_data code >> to work with 128-byte inodes I'm not against that, even though I'm not >> going to use that combination of features myself. > > I'd love to see that happen. At the time, it seemed like too large of a > change to block on, which is why I ended up deciding to switch to > 256-byte inodes. Does that mean you are using inline_data with 256-byte inodes? That would also be good to know, since there haven't been any well-known users of this feature so far (AFAIK). Since you are using this in a read-only manner, you won't hit the one know issue when an inline_data inode is extended to use an external block that may temporarily leave the inode in an inconsistent state. Cheers, Andreas [-- Attachment #2: Message signed with OpenPGP --] [-- Type: application/pgp-signature, Size: 873 bytes --] ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps 2020-10-08 19:25 ` Andreas Dilger @ 2020-10-08 22:28 ` Josh Triplett 0 siblings, 0 replies; 34+ messages in thread From: Josh Triplett @ 2020-10-08 22:28 UTC (permalink / raw) To: Andreas Dilger Cc: Theodore Y. Ts'o, Darrick J. Wong, Jan Kara, Linux Kernel Mailing List, Ext4 Developers List On Thu, Oct 08, 2020 at 01:25:57PM -0600, Andreas Dilger wrote: > On Oct 8, 2020, at 1:12 PM, Josh Triplett <josh@joshtriplett.org> wrote: > > On Wed, Oct 07, 2020 at 08:57:12PM -0600, Andreas Dilger wrote: > >> I *do* think that inline_data is an under-appreciated feature that I > >> would be happy to see some improvements with. I don't think that small > >> files are a niche use case, and if we can clean up the inline_data code > >> to work with 128-byte inodes I'm not against that, even though I'm not > >> going to use that combination of features myself. > > > > I'd love to see that happen. At the time, it seemed like too large of a > > change to block on, which is why I ended up deciding to switch to > > 256-byte inodes. > > Does that mean you are using inline_data with 256-byte inodes? I am, yes, and it mostly works great. I've hit zero issues with it in the filesystems I'm generating. > That would also be good to know, since there haven't been any > well-known users of this feature so far (AFAIK). Since you are using > this in a read-only manner, you won't hit the one know issue when an > inline_data inode is extended to use an external block that may > temporarily leave the inode in an inconsistent state. I've run into a few other issues with it in other tools, as well. mke2fs with inline_data generates invalid files given xattrs: https://lore.kernel.org/linux-ext4/20200926102512.GA11386@localhost/T/#u And extlinux doesn't like inline_data at all: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=971002 I'll report any other issues I run into using inline_data. I agree that it's deeply underappreciated. - Josh ^ permalink raw reply [flat|nested] 34+ messages in thread
end of thread, other threads:[~2021-01-13 5:10 UTC | newest] Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <CAHk-=wj-H5BYCU_kKiOK=B9sN3BtRzL4pnne2AJPyf54nQ+d=w@mail.gmail.com> 2020-10-05 8:14 ` ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps Josh Triplett 2020-10-05 9:46 ` Jan Kara 2020-10-05 10:16 ` Josh Triplett 2020-10-05 16:19 ` Jan Kara 2020-10-05 16:20 ` Jan Kara 2020-10-05 17:36 ` Darrick J. Wong 2020-10-06 0:04 ` Theodore Y. Ts'o 2020-10-06 0:32 ` Josh Triplett 2020-10-06 2:51 ` Darrick J. Wong 2020-10-06 3:18 ` Theodore Y. Ts'o 2020-10-06 5:03 ` Josh Triplett 2020-10-06 6:03 ` Josh Triplett 2020-10-06 13:35 ` Theodore Y. Ts'o 2020-10-07 8:03 ` Josh Triplett 2020-10-07 14:32 ` Theodore Y. Ts'o 2020-10-07 20:14 ` Josh Triplett 2020-10-08 2:10 ` Theodore Y. Ts'o 2020-10-08 17:54 ` Darrick J. Wong 2020-10-08 22:38 ` Josh Triplett 2020-10-09 2:54 ` Darrick J. Wong 2020-10-09 19:08 ` Josh Triplett 2020-10-08 22:22 ` Josh Triplett 2020-10-09 14:37 ` Theodore Y. Ts'o 2020-10-09 20:30 ` Josh Triplett 2021-01-10 18:41 ` Malicious fs images was " Pavel Machek 2021-01-11 18:51 ` Darrick J. Wong 2021-01-11 19:39 ` Eric Biggers 2021-01-12 21:43 ` Theodore Ts'o 2021-01-12 22:28 ` Pavel Machek 2021-01-13 5:09 ` Theodore Ts'o 2020-10-08 2:57 ` Andreas Dilger 2020-10-08 19:12 ` Josh Triplett 2020-10-08 19:25 ` Andreas Dilger 2020-10-08 22:28 ` Josh Triplett
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).