From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 46133C47095 for ; Thu, 8 Oct 2020 02:10:37 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id EBAA820B1F for ; Thu, 8 Oct 2020 02:10:36 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727801AbgJHCKe (ORCPT ); Wed, 7 Oct 2020 22:10:34 -0400 Received: from outgoing-auth-1.mit.edu ([18.9.28.11]:60792 "EHLO outgoing.mit.edu" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726520AbgJHCKd (ORCPT ); Wed, 7 Oct 2020 22:10:33 -0400 Received: from callcc.thunk.org (pool-72-74-133-215.bstnma.fios.verizon.net [72.74.133.215]) (authenticated bits=0) (User authenticated as tytso@ATHENA.MIT.EDU) by outgoing.mit.edu (8.14.7/8.12.4) with ESMTP id 0982AIXE014585 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 7 Oct 2020 22:10:20 -0400 Received: by callcc.thunk.org (Postfix, from userid 15806) id D3D1B420107; Wed, 7 Oct 2020 22:10:17 -0400 (EDT) Date: Wed, 7 Oct 2020 22:10:17 -0400 From: "Theodore Y. Ts'o" To: Josh Triplett Cc: "Darrick J. Wong" , Linus Torvalds , Andreas Dilger , Jan Kara , Linux Kernel Mailing List , linux-ext4@vger.kernel.org Subject: Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps Message-ID: <20201008021017.GD235506@mit.edu> References: <20201005081454.GA493107@localhost> <20201005173639.GA2311765@magnolia> <20201006003216.GB6553@localhost> <20201006025110.GJ49559@magnolia> <20201006031834.GA5797@mit.edu> <20201006050306.GA8098@localhost> <20201006133533.GC5797@mit.edu> <20201007080304.GB1112@localhost> <20201007143211.GA235506@mit.edu> <20201007201424.GB15049@localhost> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20201007201424.GB15049@localhost> Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org On Wed, Oct 07, 2020 at 01:14:24PM -0700, Josh Triplett wrote: > > That sounds like a conversation that would have been a lot more > interesting and enjoyable if it hadn't started with "can we shoot it in > the head", and continued with the notion that anything other than > e2fsprogs making something to be mounted by mount(2) and handled by > fs/ext4 is being "inflicted", and if the goal didn't still seem to be > "how do we make it go away so that only e2fsprogs and the kernel ever > touch ext4". I started this thread because I'd written some userspace > code, a new version of the kernel made that userspace code stop working, > so I wanted to report that the moment I'd discovered that, along with a > potential way to address it with as little disrupton to ext4 as > possible. What is really getting my dander up is your attempt to claim that the on-disk file system format is like the userspace/kernel interface, where if we break any file system that file system that was "previously accepted by an older kernel", this is a bug that must be reverted or otherwise fixed to allow file systems that had previously worked, to continue to work. And this is true even if the file system is ***invalid***. And the problem with this is that there have been any number of commits where file systems which were previously invalid, but which could be caused to trigger a syzbot whine, which was fixed by tightening up the validity tests in the kernel. In some cases, I had to also had to fix up e2fsck to detect the invalid file system which was generated by the file system fuzzer. Yes, it's unfortunate that we didn't have these checks earlier, but a file system has a huge amount of state. The principle you've articulated would make it impossible for me to fix these bugs, unless I can prove that the failure to check a particular invalid file system corruption could lead to a security vulnerability. (Would it be OK for me to make the kernel more strict and reject an invalid file system if it triggers a WARN_ON, so I get the syzbot complaint, but it doesn't actually cause a security issue?) So this conversation would have been a lot more pleasant for *me* if you hadn't tried to elevate your request to a general principle, where if someone is deliberately generating an invalid file system, I'm not allowed to make the kernel more strict to detect said invalidity and to reject the invalid / corrupted / fuzzed file system. And note that sometimes the security problem happens when there are multiple file system corruptions that are chained together. So enabling block validity *can* sometimes prevent the fuzzed file system from proceeding further. Granted, this is less likely in the case of a read-only file system, but it really worries me when there are proprietary programs (maybe your library isn't proprietary, but I note you haven't send me a link to your git repo, but instead have offered sending sample file systems) which insist on generating their own file systems, which might or might not be valid, and then expecting them to receive first class support as part of an iron-bound contract where I'm not even allowed to add stronger sanity checks which might reject said invalid file system in the future. > The short version is that I needed a library to rapidly turn > dynamically-obtained data into a set of disk blocks to be served > on-the-fly as a software-defined disk, and then mounted on the other > side of that interface by the Linux kernel. Turns out that's *many > orders of magnitude* faster than any kind of network filesystem like > NFS. It's slightly similar to a vvfat for ext4. The less blocks it can > generate and account for and cache, the faster it can run, and > microseconds matter. So are you actually trying to dedup data blocks, or are you just trying to avoid needing to track the block allocation bitmaps? And are you just writing a single file, or multiple files? Do you know what the maximum size of the file or files will be? Do you need a complex directory structure, or just a single root directory? Can the file system be sparse? So for example, you can do something like this, which puts all of the metadata at beginning of the file system, and then you could write to contiguous data blocks. Add the following in mke2fs.conf: [fs_types] hugefile = { features = extent,huge_file,bigalloc,flex_bg,uninit_bg,dir_nlink,extra_isize,^resize_inode,sparse_super2 cluster_size = 32768 hash_alg = half_md4 reserved_ratio = 0.0 num_backup_sb = 0 packed_meta_blocks = 1 make_hugefiles = 1 inode_ratio = 4194304 hugefiles_dir = /storage hugefiles_name = huge-file hugefiles_digits = 0 hugefiles_size = 0 hugefiles_align = 256M hugefiles_align_disk = true num_hugefiles = 1 zero_hugefiles = false inode_size = 128 } hugefiles = { features = extent,huge_file,flex_bg,uninit_bg,dir_nlink,extra_isize,^resize_inode,sparse_super2 hash_alg = half_md4 reserved_ratio = 0.0 num_backup_sb = 0 packed_meta_blocks = 1 make_hugefiles = 1 inode_ratio = 4194304 hugefiles_dir = /storage hugefiles_name = chunk- hugefiles_digits = 5 hugefiles_size = 4G hugefiles_align = 256M hugefiles_align_disk = true zero_hugefiles = false flex_bg_size = 262144 inode_size = 128 } ... and then run "mke2fs -T hugefile /tmp/image 1T" or "mke2fs -T hugefiles /tmp/image 1T", and see what you get. In the case of hugefile, you'll see a single file which covers the entire storage device. Because we are using bigalloc with a large cluster size, this minimizes the number of bitmap blocks. With hugefiles, it will create a set of 4G files to fill the size of the disk, again, aligned to 256 MiB zones at the beginning of the disk. In both cases, the file or files are aligned to 256 MiB relative to beginning of the disk, which can be handy if you are creating the file system, on, say, a 14T SMR disk. And this is a niche use case if there ever was one! :-) So if you had come to the ext4 list with a set of requirements, it could have been that we could have come up with something which uses the existing file system features, or come up with something which would have been more specific --- and more importantly, we'd know what the semantics were of various on-disk file system formats that people are depending upon. > If at some point I'm looking to make ext4 support more than it already > does (e.g. a way to omit bitmaps entirely, or a way to express > contiguous files with smaller extent maps, or other enhancements for > read-only filesystems), See above for a way to significantly reduce the number of bitmaps. Adding a way to omit bitmaps entirely would require an INCOMPAT flag, so it might not be worth it. The way to express contiguous files with smaller extent files would be to extend the kernel to allow file systems with block_size > page_size read-only. This would allow you to create a file system with a block size of 64k, which will reduce the size of the extent maps by a factor of 16, and it wouldn't be all that hard to teach ext4 to support these file systems. (The reason why it would be hard for us to support file systems with block sizes > page size is dealing with page cache when writing files while allocating blocks, especially when doing random writes into a sparse file. Read-only would be much easier to support.) So please, talk to us, and *tell* us what it is you're trying to do before you try to do it. Don't rely on some implementation detail where we're not being sufficiently strict in checking for an invalid file system, especially without telling us in advance and then trying to hold us to the lack of checking forever because it's "breaking things that used to work". Cheers, - Ted