From: "Björn Töpel" <bjorn@kernel.org> To: Christian Brauner <brauner@kernel.org> Cc: Andreas Dilger <adilger@dilger.ca>, Al Viro <viro@zeniv.linux.org.uk>, Nam Cao <namcao@linutronix.de>, linux-fsdevel <linux-fsdevel@vger.kernel.org>, Jan Kara <jack@suse.cz>, Linux Kernel Mailing List <linux-kernel@vger.kernel.org>, linux-riscv@lists.infradead.org, Theodore Ts'o <tytso@mit.edu>, Ext4 Developers List <linux-ext4@vger.kernel.org>, Conor Dooley <conor@kernel.org>, "Matthew Wilcox (Oracle)" <willy@infradead.org>, Anders Roxell <anders.roxell@linaro.org> Subject: Re: riscv32 EXT4 splat, 6.8 regression? Date: Mon, 15 Apr 2024 18:04:50 +0200 [thread overview] Message-ID: <87le5e393x.fsf@all.your.base.are.belong.to.us> (raw) In-Reply-To: <20240415-festland-unattraktiv-2b5953a6dbc9@brauner> Christian Brauner <brauner@kernel.org> writes: > On Sun, Apr 14, 2024 at 04:08:11PM +0200, Björn Töpel wrote: >> Andreas Dilger <adilger@dilger.ca> writes: >> >> > On Apr 13, 2024, at 8:15 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: >> >> >> >> On Sat, Apr 13, 2024 at 07:46:03PM -0600, Andreas Dilger wrote: >> >> >> >>> As to whether the 0xfffff000 address itself is valid for riscv32 is >> >>> outside my realm, but given that RAM is cheap it doesn't seem unlikely >> >>> to have 4GB+ of RAM and want to use it all. The riscv32 might consider >> >>> reserving this page address from allocation to avoid similar issues in >> >>> other parts of the code, as is done with the NULL/0 page address. >> >> >> >> Not a chance. *Any* page mapped there is a serious bug on any 32bit >> >> box. Recall what ERR_PTR() is... >> >> >> >> On any architecture the virtual addresses in range (unsigned long)-512.. >> >> (unsigned long)-1 must never resolve to valid kernel objects. >> >> In other words, any kind of wraparound here is asking for an oops on >> >> attempts to access the elements of buffer - kernel dereference of >> >> (char *)0xfffff000 on a 32bit box is already a bug. >> >> >> >> It might be getting an invalid pointer, but arithmetical overflows >> >> are irrelevant. >> > >> > The original bug report stated that search_buf = 0xfffff000 on entry, >> > and I'd quoted that at the start of my email: >> > >> > On Apr 12, 2024, at 8:57 AM, Björn Töpel <bjorn@kernel.org> wrote: >> >> What I see in ext4_search_dir() is that search_buf is 0xfffff000, and at >> >> some point the address wraps to zero, and boom. I doubt that 0xfffff000 >> >> is a sane address. >> > >> > Now that you mention ERR_PTR() it definitely makes sense that this last >> > page HAS to be excluded. >> > >> > So some other bug is passing the bad pointer to this code before this >> > error, or the arch is not correctly excluding this page from allocation. >> >> Yeah, something is off for sure. >> >> (FWIW, I manage to hit this for Linus' master as well.) >> >> I added a print (close to trace_mm_filemap_add_to_page_cache()), and for >> this BT: >> >> [<c01e8b34>] __filemap_add_folio+0x322/0x508 >> [<c01e8d6e>] filemap_add_folio+0x54/0xce >> [<c01ea076>] __filemap_get_folio+0x156/0x2aa >> [<c02df346>] __getblk_slow+0xcc/0x302 >> [<c02df5f2>] bdev_getblk+0x76/0x7a >> [<c03519da>] ext4_getblk+0xbc/0x2c4 >> [<c0351cc2>] ext4_bread_batch+0x56/0x186 >> [<c036bcaa>] __ext4_find_entry+0x156/0x578 >> [<c036c152>] ext4_lookup+0x86/0x1f4 >> [<c02a3252>] __lookup_slow+0x8e/0x142 >> [<c02a6d70>] walk_component+0x104/0x174 >> [<c02a793c>] path_lookupat+0x78/0x182 >> [<c02a8c7c>] filename_lookup+0x96/0x158 >> [<c02a8d76>] kern_path+0x38/0x56 >> [<c0c1cb7a>] init_mount+0x5c/0xac >> [<c0c2ba4c>] devtmpfs_mount+0x44/0x7a >> [<c0c01cce>] prepare_namespace+0x226/0x27c >> [<c0c011c6>] kernel_init_freeable+0x286/0x2a8 >> [<c0b97ab8>] kernel_init+0x2a/0x156 >> [<c0ba22ca>] ret_from_fork+0xe/0x20 >> >> I get a folio where folio_address(folio) == 0xfffff000 (which is >> broken). >> >> Need to go into the weeds here... > > I don't see anything obvious that could explain this right away. Did you > manage to reproduce this on any other architecture and/or filesystem? > > Fwiw, iirc there were a bunch of fs/buffer.c changes that came in > through the mm/ layer between v6.7 and v6.8 that might also be > interesting. But really I'm poking in the dark currently. Thanks for getting back! Spent some more time one it today. It seems that the buddy allocator *can* return a page with a VA that can wrap (0xfffff000 -- pointed out by Nam and myself). Further, it seems like riscv32 indeed inserts a page like that to the buddy allocator, when the memblock is free'd: | [<c024961c>] __free_one_page+0x2a4/0x3ea | [<c024a448>] __free_pages_ok+0x158/0x3cc | [<c024b1a4>] __free_pages_core+0xe8/0x12c | [<c0c1435a>] memblock_free_pages+0x1a/0x22 | [<c0c17676>] memblock_free_all+0x1ee/0x278 | [<c0c050b0>] mem_init+0x10/0xa4 | [<c0c1447c>] mm_core_init+0x11a/0x2da | [<c0c00bb6>] start_kernel+0x3c4/0x6de Here, a page with VA 0xfffff000 is a added to the freelist. We were just lucky (unlucky?) that page was used for the page cache. A nasty patch like: --8<-- diff --git a/mm/mm_init.c b/mm/mm_init.c index 549e76af8f82..a6a6abbe71b0 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -2566,6 +2566,9 @@ void __init set_dma_reserve(unsigned long new_dma_reserve) void __init memblock_free_pages(struct page *page, unsigned long pfn, unsigned int order) { + if ((long)page_address(page) == 0xfffff000L) { + return; // leak it + } if (IS_ENABLED(CONFIG_DEFERRED_STRUCT_PAGE_INIT)) { int nid = early_pfn_to_nid(pfn); --8<-- ...and it's gone. I need to think more about what a proper fix is. Regardless; Christian, Al, and Ted can all relax. ;-) Björn
WARNING: multiple messages have this Message-ID (diff)
From: "Björn Töpel" <bjorn@kernel.org> To: Christian Brauner <brauner@kernel.org> Cc: Andreas Dilger <adilger@dilger.ca>, Al Viro <viro@zeniv.linux.org.uk>, Nam Cao <namcao@linutronix.de>, linux-fsdevel <linux-fsdevel@vger.kernel.org>, Jan Kara <jack@suse.cz>, Linux Kernel Mailing List <linux-kernel@vger.kernel.org>, linux-riscv@lists.infradead.org, Theodore Ts'o <tytso@mit.edu>, Ext4 Developers List <linux-ext4@vger.kernel.org>, Conor Dooley <conor@kernel.org>, "Matthew Wilcox (Oracle)" <willy@infradead.org>, Anders Roxell <anders.roxell@linaro.org> Subject: Re: riscv32 EXT4 splat, 6.8 regression? Date: Mon, 15 Apr 2024 18:04:50 +0200 [thread overview] Message-ID: <87le5e393x.fsf@all.your.base.are.belong.to.us> (raw) In-Reply-To: <20240415-festland-unattraktiv-2b5953a6dbc9@brauner> Christian Brauner <brauner@kernel.org> writes: > On Sun, Apr 14, 2024 at 04:08:11PM +0200, Björn Töpel wrote: >> Andreas Dilger <adilger@dilger.ca> writes: >> >> > On Apr 13, 2024, at 8:15 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: >> >> >> >> On Sat, Apr 13, 2024 at 07:46:03PM -0600, Andreas Dilger wrote: >> >> >> >>> As to whether the 0xfffff000 address itself is valid for riscv32 is >> >>> outside my realm, but given that RAM is cheap it doesn't seem unlikely >> >>> to have 4GB+ of RAM and want to use it all. The riscv32 might consider >> >>> reserving this page address from allocation to avoid similar issues in >> >>> other parts of the code, as is done with the NULL/0 page address. >> >> >> >> Not a chance. *Any* page mapped there is a serious bug on any 32bit >> >> box. Recall what ERR_PTR() is... >> >> >> >> On any architecture the virtual addresses in range (unsigned long)-512.. >> >> (unsigned long)-1 must never resolve to valid kernel objects. >> >> In other words, any kind of wraparound here is asking for an oops on >> >> attempts to access the elements of buffer - kernel dereference of >> >> (char *)0xfffff000 on a 32bit box is already a bug. >> >> >> >> It might be getting an invalid pointer, but arithmetical overflows >> >> are irrelevant. >> > >> > The original bug report stated that search_buf = 0xfffff000 on entry, >> > and I'd quoted that at the start of my email: >> > >> > On Apr 12, 2024, at 8:57 AM, Björn Töpel <bjorn@kernel.org> wrote: >> >> What I see in ext4_search_dir() is that search_buf is 0xfffff000, and at >> >> some point the address wraps to zero, and boom. I doubt that 0xfffff000 >> >> is a sane address. >> > >> > Now that you mention ERR_PTR() it definitely makes sense that this last >> > page HAS to be excluded. >> > >> > So some other bug is passing the bad pointer to this code before this >> > error, or the arch is not correctly excluding this page from allocation. >> >> Yeah, something is off for sure. >> >> (FWIW, I manage to hit this for Linus' master as well.) >> >> I added a print (close to trace_mm_filemap_add_to_page_cache()), and for >> this BT: >> >> [<c01e8b34>] __filemap_add_folio+0x322/0x508 >> [<c01e8d6e>] filemap_add_folio+0x54/0xce >> [<c01ea076>] __filemap_get_folio+0x156/0x2aa >> [<c02df346>] __getblk_slow+0xcc/0x302 >> [<c02df5f2>] bdev_getblk+0x76/0x7a >> [<c03519da>] ext4_getblk+0xbc/0x2c4 >> [<c0351cc2>] ext4_bread_batch+0x56/0x186 >> [<c036bcaa>] __ext4_find_entry+0x156/0x578 >> [<c036c152>] ext4_lookup+0x86/0x1f4 >> [<c02a3252>] __lookup_slow+0x8e/0x142 >> [<c02a6d70>] walk_component+0x104/0x174 >> [<c02a793c>] path_lookupat+0x78/0x182 >> [<c02a8c7c>] filename_lookup+0x96/0x158 >> [<c02a8d76>] kern_path+0x38/0x56 >> [<c0c1cb7a>] init_mount+0x5c/0xac >> [<c0c2ba4c>] devtmpfs_mount+0x44/0x7a >> [<c0c01cce>] prepare_namespace+0x226/0x27c >> [<c0c011c6>] kernel_init_freeable+0x286/0x2a8 >> [<c0b97ab8>] kernel_init+0x2a/0x156 >> [<c0ba22ca>] ret_from_fork+0xe/0x20 >> >> I get a folio where folio_address(folio) == 0xfffff000 (which is >> broken). >> >> Need to go into the weeds here... > > I don't see anything obvious that could explain this right away. Did you > manage to reproduce this on any other architecture and/or filesystem? > > Fwiw, iirc there were a bunch of fs/buffer.c changes that came in > through the mm/ layer between v6.7 and v6.8 that might also be > interesting. But really I'm poking in the dark currently. Thanks for getting back! Spent some more time one it today. It seems that the buddy allocator *can* return a page with a VA that can wrap (0xfffff000 -- pointed out by Nam and myself). Further, it seems like riscv32 indeed inserts a page like that to the buddy allocator, when the memblock is free'd: | [<c024961c>] __free_one_page+0x2a4/0x3ea | [<c024a448>] __free_pages_ok+0x158/0x3cc | [<c024b1a4>] __free_pages_core+0xe8/0x12c | [<c0c1435a>] memblock_free_pages+0x1a/0x22 | [<c0c17676>] memblock_free_all+0x1ee/0x278 | [<c0c050b0>] mem_init+0x10/0xa4 | [<c0c1447c>] mm_core_init+0x11a/0x2da | [<c0c00bb6>] start_kernel+0x3c4/0x6de Here, a page with VA 0xfffff000 is a added to the freelist. We were just lucky (unlucky?) that page was used for the page cache. A nasty patch like: --8<-- diff --git a/mm/mm_init.c b/mm/mm_init.c index 549e76af8f82..a6a6abbe71b0 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -2566,6 +2566,9 @@ void __init set_dma_reserve(unsigned long new_dma_reserve) void __init memblock_free_pages(struct page *page, unsigned long pfn, unsigned int order) { + if ((long)page_address(page) == 0xfffff000L) { + return; // leak it + } if (IS_ENABLED(CONFIG_DEFERRED_STRUCT_PAGE_INIT)) { int nid = early_pfn_to_nid(pfn); --8<-- ...and it's gone. I need to think more about what a proper fix is. Regardless; Christian, Al, and Ted can all relax. ;-) Björn _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv
next prev parent reply other threads:[~2024-04-15 16:04 UTC|newest] Thread overview: 74+ messages / expand[flat|nested] mbox.gz Atom feed top 2024-04-12 14:57 riscv32 EXT4 splat, 6.8 regression? Björn Töpel 2024-04-12 14:57 ` Björn Töpel 2024-04-12 15:43 ` Theodore Ts'o 2024-04-12 15:43 ` Theodore Ts'o 2024-04-12 16:59 ` Björn Töpel 2024-04-12 16:59 ` Björn Töpel 2024-04-13 4:35 ` Theodore Ts'o 2024-04-13 4:35 ` Theodore Ts'o 2024-04-13 10:01 ` Conor Dooley 2024-04-13 10:01 ` Conor Dooley 2024-04-13 14:43 ` Nam Cao 2024-04-13 14:43 ` Nam Cao 2024-04-14 0:24 ` Theodore Ts'o 2024-04-14 0:24 ` Theodore Ts'o 2024-04-14 1:46 ` Andreas Dilger 2024-04-14 1:46 ` Andreas Dilger 2024-04-14 2:04 ` Theodore Ts'o 2024-04-14 2:04 ` Theodore Ts'o 2024-04-14 2:18 ` Al Viro 2024-04-14 2:18 ` Al Viro 2024-04-14 2:15 ` Al Viro 2024-04-14 2:15 ` Al Viro 2024-04-14 4:16 ` Andreas Dilger 2024-04-14 4:16 ` Andreas Dilger 2024-04-14 14:08 ` Björn Töpel 2024-04-14 14:08 ` Björn Töpel 2024-04-15 13:04 ` Christian Brauner 2024-04-15 13:04 ` Christian Brauner 2024-04-15 16:04 ` Björn Töpel [this message] 2024-04-15 16:04 ` Björn Töpel 2024-04-16 6:44 ` Nam Cao 2024-04-16 6:44 ` Nam Cao 2024-04-16 8:25 ` Christian Brauner 2024-04-16 8:25 ` Christian Brauner 2024-04-16 11:02 ` Björn Töpel 2024-04-16 11:02 ` Björn Töpel 2024-04-16 14:24 ` Mike Rapoport 2024-04-16 14:24 ` Mike Rapoport 2024-04-16 15:17 ` Nam Cao 2024-04-16 15:17 ` Nam Cao 2024-04-16 15:30 ` Nam Cao 2024-04-16 15:30 ` Nam Cao 2024-04-16 15:56 ` Björn Töpel 2024-04-16 15:56 ` Björn Töpel 2024-04-16 16:19 ` Nam Cao 2024-04-16 16:19 ` Nam Cao 2024-04-16 16:31 ` Mike Rapoport 2024-04-16 16:31 ` Mike Rapoport 2024-04-16 17:00 ` Matthew Wilcox 2024-04-16 17:00 ` Matthew Wilcox 2024-04-16 18:34 ` Mike Rapoport 2024-04-16 18:34 ` Mike Rapoport 2024-04-16 22:36 ` Nam Cao 2024-04-16 22:36 ` Nam Cao 2024-04-17 15:31 ` Theodore Ts'o 2024-04-17 15:31 ` Theodore Ts'o 2024-04-17 18:06 ` Nam Cao 2024-04-17 18:06 ` Nam Cao 2024-04-17 19:34 ` Mike Rapoport 2024-04-17 19:34 ` Mike Rapoport 2024-04-17 22:09 ` Andreas Dilger 2024-04-17 22:09 ` Andreas Dilger 2024-04-18 9:17 ` Nam Cao 2024-04-18 9:17 ` Nam Cao 2024-04-16 18:05 ` Björn Töpel 2024-04-16 18:05 ` Björn Töpel 2024-04-16 18:09 ` Nam Cao 2024-04-16 18:09 ` Nam Cao 2024-04-16 16:19 ` Mike Rapoport 2024-04-16 16:19 ` Mike Rapoport 2024-04-16 16:31 ` Matthew Wilcox 2024-04-16 16:31 ` Matthew Wilcox 2024-04-16 18:18 ` Mike Rapoport 2024-04-16 18:18 ` Mike Rapoport
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=87le5e393x.fsf@all.your.base.are.belong.to.us \ --to=bjorn@kernel.org \ --cc=adilger@dilger.ca \ --cc=anders.roxell@linaro.org \ --cc=brauner@kernel.org \ --cc=conor@kernel.org \ --cc=jack@suse.cz \ --cc=linux-ext4@vger.kernel.org \ --cc=linux-fsdevel@vger.kernel.org \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-riscv@lists.infradead.org \ --cc=namcao@linutronix.de \ --cc=tytso@mit.edu \ --cc=viro@zeniv.linux.org.uk \ --cc=willy@infradead.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.