All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Björn Töpel" <bjorn@kernel.org>
To: Christian Brauner <brauner@kernel.org>
Cc: Andreas Dilger <adilger@dilger.ca>,
	Al Viro <viro@zeniv.linux.org.uk>, Nam Cao <namcao@linutronix.de>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Jan Kara <jack@suse.cz>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	linux-riscv@lists.infradead.org, Theodore Ts'o <tytso@mit.edu>,
	Ext4 Developers List <linux-ext4@vger.kernel.org>,
	Conor Dooley <conor@kernel.org>,
	"Matthew Wilcox (Oracle)" <willy@infradead.org>,
	Anders Roxell <anders.roxell@linaro.org>
Subject: Re: riscv32 EXT4 splat, 6.8 regression?
Date: Mon, 15 Apr 2024 18:04:50 +0200	[thread overview]
Message-ID: <87le5e393x.fsf@all.your.base.are.belong.to.us> (raw)
In-Reply-To: <20240415-festland-unattraktiv-2b5953a6dbc9@brauner>

Christian Brauner <brauner@kernel.org> writes:

> On Sun, Apr 14, 2024 at 04:08:11PM +0200, Björn Töpel wrote:
>> Andreas Dilger <adilger@dilger.ca> writes:
>> 
>> > On Apr 13, 2024, at 8:15 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>> >> 
>> >> On Sat, Apr 13, 2024 at 07:46:03PM -0600, Andreas Dilger wrote:
>> >> 
>> >>> As to whether the 0xfffff000 address itself is valid for riscv32 is
>> >>> outside my realm, but given that RAM is cheap it doesn't seem unlikely
>> >>> to have 4GB+ of RAM and want to use it all.  The riscv32 might consider
>> >>> reserving this page address from allocation to avoid similar issues in
>> >>> other parts of the code, as is done with the NULL/0 page address.
>> >> 
>> >> Not a chance.  *Any* page mapped there is a serious bug on any 32bit
>> >> box.  Recall what ERR_PTR() is...
>> >> 
>> >> On any architecture the virtual addresses in range (unsigned long)-512..
>> >> (unsigned long)-1 must never resolve to valid kernel objects.
>> >> In other words, any kind of wraparound here is asking for an oops on
>> >> attempts to access the elements of buffer - kernel dereference of
>> >> (char *)0xfffff000 on a 32bit box is already a bug.
>> >> 
>> >> It might be getting an invalid pointer, but arithmetical overflows
>> >> are irrelevant.
>> >
>> > The original bug report stated that search_buf = 0xfffff000 on entry,
>> > and I'd quoted that at the start of my email:
>> >
>> > On Apr 12, 2024, at 8:57 AM, Björn Töpel <bjorn@kernel.org> wrote:
>> >> What I see in ext4_search_dir() is that search_buf is 0xfffff000, and at
>> >> some point the address wraps to zero, and boom. I doubt that 0xfffff000
>> >> is a sane address.
>> >
>> > Now that you mention ERR_PTR() it definitely makes sense that this last
>> > page HAS to be excluded.
>> >
>> > So some other bug is passing the bad pointer to this code before this
>> > error, or the arch is not correctly excluding this page from allocation.
>> 
>> Yeah, something is off for sure.
>> 
>> (FWIW, I manage to hit this for Linus' master as well.)
>> 
>> I added a print (close to trace_mm_filemap_add_to_page_cache()), and for
>> this BT:
>> 
>>   [<c01e8b34>] __filemap_add_folio+0x322/0x508
>>   [<c01e8d6e>] filemap_add_folio+0x54/0xce
>>   [<c01ea076>] __filemap_get_folio+0x156/0x2aa
>>   [<c02df346>] __getblk_slow+0xcc/0x302
>>   [<c02df5f2>] bdev_getblk+0x76/0x7a
>>   [<c03519da>] ext4_getblk+0xbc/0x2c4
>>   [<c0351cc2>] ext4_bread_batch+0x56/0x186
>>   [<c036bcaa>] __ext4_find_entry+0x156/0x578
>>   [<c036c152>] ext4_lookup+0x86/0x1f4
>>   [<c02a3252>] __lookup_slow+0x8e/0x142
>>   [<c02a6d70>] walk_component+0x104/0x174
>>   [<c02a793c>] path_lookupat+0x78/0x182
>>   [<c02a8c7c>] filename_lookup+0x96/0x158
>>   [<c02a8d76>] kern_path+0x38/0x56
>>   [<c0c1cb7a>] init_mount+0x5c/0xac
>>   [<c0c2ba4c>] devtmpfs_mount+0x44/0x7a
>>   [<c0c01cce>] prepare_namespace+0x226/0x27c
>>   [<c0c011c6>] kernel_init_freeable+0x286/0x2a8
>>   [<c0b97ab8>] kernel_init+0x2a/0x156
>>   [<c0ba22ca>] ret_from_fork+0xe/0x20
>> 
>> I get a folio where folio_address(folio) == 0xfffff000 (which is
>> broken).
>> 
>> Need to go into the weeds here...
>
> I don't see anything obvious that could explain this right away. Did you
> manage to reproduce this on any other architecture and/or filesystem?
>
> Fwiw, iirc there were a bunch of fs/buffer.c changes that came in
> through the mm/ layer between v6.7 and v6.8 that might also be
> interesting. But really I'm poking in the dark currently.

Thanks for getting back! Spent some more time one it today.

It seems that the buddy allocator *can* return a page with a VA that can
wrap (0xfffff000 -- pointed out by Nam and myself).

Further, it seems like riscv32 indeed inserts a page like that to the
buddy allocator, when the memblock is free'd:

  | [<c024961c>] __free_one_page+0x2a4/0x3ea
  | [<c024a448>] __free_pages_ok+0x158/0x3cc
  | [<c024b1a4>] __free_pages_core+0xe8/0x12c
  | [<c0c1435a>] memblock_free_pages+0x1a/0x22
  | [<c0c17676>] memblock_free_all+0x1ee/0x278
  | [<c0c050b0>] mem_init+0x10/0xa4
  | [<c0c1447c>] mm_core_init+0x11a/0x2da
  | [<c0c00bb6>] start_kernel+0x3c4/0x6de

Here, a page with VA 0xfffff000 is a added to the freelist. We were just
lucky (unlucky?) that page was used for the page cache.

A nasty patch like:
--8<--
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 549e76af8f82..a6a6abbe71b0 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -2566,6 +2566,9 @@ void __init set_dma_reserve(unsigned long new_dma_reserve)
 void __init memblock_free_pages(struct page *page, unsigned long pfn,
 							unsigned int order)
 {
+	if ((long)page_address(page) == 0xfffff000L) {
+		return; // leak it
+	}
 
 	if (IS_ENABLED(CONFIG_DEFERRED_STRUCT_PAGE_INIT)) {
 		int nid = early_pfn_to_nid(pfn);
--8<--

...and it's gone.

I need to think more about what a proper fix is. Regardless; Christian,
Al, and Ted can all relax. ;-)


Björn

WARNING: multiple messages have this Message-ID (diff)
From: "Björn Töpel" <bjorn@kernel.org>
To: Christian Brauner <brauner@kernel.org>
Cc: Andreas Dilger <adilger@dilger.ca>,
	Al Viro <viro@zeniv.linux.org.uk>, Nam Cao <namcao@linutronix.de>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Jan Kara <jack@suse.cz>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	linux-riscv@lists.infradead.org, Theodore Ts'o <tytso@mit.edu>,
	Ext4 Developers List <linux-ext4@vger.kernel.org>,
	Conor Dooley <conor@kernel.org>,
	"Matthew Wilcox (Oracle)" <willy@infradead.org>,
	Anders Roxell <anders.roxell@linaro.org>
Subject: Re: riscv32 EXT4 splat, 6.8 regression?
Date: Mon, 15 Apr 2024 18:04:50 +0200	[thread overview]
Message-ID: <87le5e393x.fsf@all.your.base.are.belong.to.us> (raw)
In-Reply-To: <20240415-festland-unattraktiv-2b5953a6dbc9@brauner>

Christian Brauner <brauner@kernel.org> writes:

> On Sun, Apr 14, 2024 at 04:08:11PM +0200, Björn Töpel wrote:
>> Andreas Dilger <adilger@dilger.ca> writes:
>> 
>> > On Apr 13, 2024, at 8:15 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>> >> 
>> >> On Sat, Apr 13, 2024 at 07:46:03PM -0600, Andreas Dilger wrote:
>> >> 
>> >>> As to whether the 0xfffff000 address itself is valid for riscv32 is
>> >>> outside my realm, but given that RAM is cheap it doesn't seem unlikely
>> >>> to have 4GB+ of RAM and want to use it all.  The riscv32 might consider
>> >>> reserving this page address from allocation to avoid similar issues in
>> >>> other parts of the code, as is done with the NULL/0 page address.
>> >> 
>> >> Not a chance.  *Any* page mapped there is a serious bug on any 32bit
>> >> box.  Recall what ERR_PTR() is...
>> >> 
>> >> On any architecture the virtual addresses in range (unsigned long)-512..
>> >> (unsigned long)-1 must never resolve to valid kernel objects.
>> >> In other words, any kind of wraparound here is asking for an oops on
>> >> attempts to access the elements of buffer - kernel dereference of
>> >> (char *)0xfffff000 on a 32bit box is already a bug.
>> >> 
>> >> It might be getting an invalid pointer, but arithmetical overflows
>> >> are irrelevant.
>> >
>> > The original bug report stated that search_buf = 0xfffff000 on entry,
>> > and I'd quoted that at the start of my email:
>> >
>> > On Apr 12, 2024, at 8:57 AM, Björn Töpel <bjorn@kernel.org> wrote:
>> >> What I see in ext4_search_dir() is that search_buf is 0xfffff000, and at
>> >> some point the address wraps to zero, and boom. I doubt that 0xfffff000
>> >> is a sane address.
>> >
>> > Now that you mention ERR_PTR() it definitely makes sense that this last
>> > page HAS to be excluded.
>> >
>> > So some other bug is passing the bad pointer to this code before this
>> > error, or the arch is not correctly excluding this page from allocation.
>> 
>> Yeah, something is off for sure.
>> 
>> (FWIW, I manage to hit this for Linus' master as well.)
>> 
>> I added a print (close to trace_mm_filemap_add_to_page_cache()), and for
>> this BT:
>> 
>>   [<c01e8b34>] __filemap_add_folio+0x322/0x508
>>   [<c01e8d6e>] filemap_add_folio+0x54/0xce
>>   [<c01ea076>] __filemap_get_folio+0x156/0x2aa
>>   [<c02df346>] __getblk_slow+0xcc/0x302
>>   [<c02df5f2>] bdev_getblk+0x76/0x7a
>>   [<c03519da>] ext4_getblk+0xbc/0x2c4
>>   [<c0351cc2>] ext4_bread_batch+0x56/0x186
>>   [<c036bcaa>] __ext4_find_entry+0x156/0x578
>>   [<c036c152>] ext4_lookup+0x86/0x1f4
>>   [<c02a3252>] __lookup_slow+0x8e/0x142
>>   [<c02a6d70>] walk_component+0x104/0x174
>>   [<c02a793c>] path_lookupat+0x78/0x182
>>   [<c02a8c7c>] filename_lookup+0x96/0x158
>>   [<c02a8d76>] kern_path+0x38/0x56
>>   [<c0c1cb7a>] init_mount+0x5c/0xac
>>   [<c0c2ba4c>] devtmpfs_mount+0x44/0x7a
>>   [<c0c01cce>] prepare_namespace+0x226/0x27c
>>   [<c0c011c6>] kernel_init_freeable+0x286/0x2a8
>>   [<c0b97ab8>] kernel_init+0x2a/0x156
>>   [<c0ba22ca>] ret_from_fork+0xe/0x20
>> 
>> I get a folio where folio_address(folio) == 0xfffff000 (which is
>> broken).
>> 
>> Need to go into the weeds here...
>
> I don't see anything obvious that could explain this right away. Did you
> manage to reproduce this on any other architecture and/or filesystem?
>
> Fwiw, iirc there were a bunch of fs/buffer.c changes that came in
> through the mm/ layer between v6.7 and v6.8 that might also be
> interesting. But really I'm poking in the dark currently.

Thanks for getting back! Spent some more time one it today.

It seems that the buddy allocator *can* return a page with a VA that can
wrap (0xfffff000 -- pointed out by Nam and myself).

Further, it seems like riscv32 indeed inserts a page like that to the
buddy allocator, when the memblock is free'd:

  | [<c024961c>] __free_one_page+0x2a4/0x3ea
  | [<c024a448>] __free_pages_ok+0x158/0x3cc
  | [<c024b1a4>] __free_pages_core+0xe8/0x12c
  | [<c0c1435a>] memblock_free_pages+0x1a/0x22
  | [<c0c17676>] memblock_free_all+0x1ee/0x278
  | [<c0c050b0>] mem_init+0x10/0xa4
  | [<c0c1447c>] mm_core_init+0x11a/0x2da
  | [<c0c00bb6>] start_kernel+0x3c4/0x6de

Here, a page with VA 0xfffff000 is a added to the freelist. We were just
lucky (unlucky?) that page was used for the page cache.

A nasty patch like:
--8<--
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 549e76af8f82..a6a6abbe71b0 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -2566,6 +2566,9 @@ void __init set_dma_reserve(unsigned long new_dma_reserve)
 void __init memblock_free_pages(struct page *page, unsigned long pfn,
 							unsigned int order)
 {
+	if ((long)page_address(page) == 0xfffff000L) {
+		return; // leak it
+	}
 
 	if (IS_ENABLED(CONFIG_DEFERRED_STRUCT_PAGE_INIT)) {
 		int nid = early_pfn_to_nid(pfn);
--8<--

...and it's gone.

I need to think more about what a proper fix is. Regardless; Christian,
Al, and Ted can all relax. ;-)


Björn

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

  reply	other threads:[~2024-04-15 16:04 UTC|newest]

Thread overview: 74+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-12 14:57 riscv32 EXT4 splat, 6.8 regression? Björn Töpel
2024-04-12 14:57 ` Björn Töpel
2024-04-12 15:43 ` Theodore Ts'o
2024-04-12 15:43   ` Theodore Ts'o
2024-04-12 16:59   ` Björn Töpel
2024-04-12 16:59     ` Björn Töpel
2024-04-13  4:35     ` Theodore Ts'o
2024-04-13  4:35       ` Theodore Ts'o
2024-04-13 10:01       ` Conor Dooley
2024-04-13 10:01         ` Conor Dooley
2024-04-13 14:43 ` Nam Cao
2024-04-13 14:43   ` Nam Cao
2024-04-14  0:24   ` Theodore Ts'o
2024-04-14  0:24     ` Theodore Ts'o
2024-04-14  1:46   ` Andreas Dilger
2024-04-14  1:46     ` Andreas Dilger
2024-04-14  2:04     ` Theodore Ts'o
2024-04-14  2:04       ` Theodore Ts'o
2024-04-14  2:18       ` Al Viro
2024-04-14  2:18         ` Al Viro
2024-04-14  2:15     ` Al Viro
2024-04-14  2:15       ` Al Viro
2024-04-14  4:16       ` Andreas Dilger
2024-04-14  4:16         ` Andreas Dilger
2024-04-14 14:08         ` Björn Töpel
2024-04-14 14:08           ` Björn Töpel
2024-04-15 13:04           ` Christian Brauner
2024-04-15 13:04             ` Christian Brauner
2024-04-15 16:04             ` Björn Töpel [this message]
2024-04-15 16:04               ` Björn Töpel
2024-04-16  6:44               ` Nam Cao
2024-04-16  6:44                 ` Nam Cao
2024-04-16  8:25                 ` Christian Brauner
2024-04-16  8:25                   ` Christian Brauner
2024-04-16 11:02                   ` Björn Töpel
2024-04-16 11:02                     ` Björn Töpel
2024-04-16 14:24                     ` Mike Rapoport
2024-04-16 14:24                       ` Mike Rapoport
2024-04-16 15:17                       ` Nam Cao
2024-04-16 15:17                         ` Nam Cao
2024-04-16 15:30                         ` Nam Cao
2024-04-16 15:30                           ` Nam Cao
2024-04-16 15:56                           ` Björn Töpel
2024-04-16 15:56                             ` Björn Töpel
2024-04-16 16:19                             ` Nam Cao
2024-04-16 16:19                               ` Nam Cao
2024-04-16 16:31                               ` Mike Rapoport
2024-04-16 16:31                                 ` Mike Rapoport
2024-04-16 17:00                                 ` Matthew Wilcox
2024-04-16 17:00                                   ` Matthew Wilcox
2024-04-16 18:34                                   ` Mike Rapoport
2024-04-16 18:34                                     ` Mike Rapoport
2024-04-16 22:36                                     ` Nam Cao
2024-04-16 22:36                                       ` Nam Cao
2024-04-17 15:31                                       ` Theodore Ts'o
2024-04-17 15:31                                         ` Theodore Ts'o
2024-04-17 18:06                                         ` Nam Cao
2024-04-17 18:06                                           ` Nam Cao
2024-04-17 19:34                                       ` Mike Rapoport
2024-04-17 19:34                                         ` Mike Rapoport
2024-04-17 22:09                                       ` Andreas Dilger
2024-04-17 22:09                                         ` Andreas Dilger
2024-04-18  9:17                                         ` Nam Cao
2024-04-18  9:17                                           ` Nam Cao
2024-04-16 18:05                               ` Björn Töpel
2024-04-16 18:05                                 ` Björn Töpel
2024-04-16 18:09                                 ` Nam Cao
2024-04-16 18:09                                   ` Nam Cao
2024-04-16 16:19                         ` Mike Rapoport
2024-04-16 16:19                           ` Mike Rapoport
2024-04-16 16:31                           ` Matthew Wilcox
2024-04-16 16:31                             ` Matthew Wilcox
2024-04-16 18:18                             ` Mike Rapoport
2024-04-16 18:18                               ` Mike Rapoport

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87le5e393x.fsf@all.your.base.are.belong.to.us \
    --to=bjorn@kernel.org \
    --cc=adilger@dilger.ca \
    --cc=anders.roxell@linaro.org \
    --cc=brauner@kernel.org \
    --cc=conor@kernel.org \
    --cc=jack@suse.cz \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-riscv@lists.infradead.org \
    --cc=namcao@linutronix.de \
    --cc=tytso@mit.edu \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.