linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Vlastimil Babka <vbabka@suse.cz>
To: Oliver Sang <oliver.sang@intel.com>, Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: oe-lkp@lists.linux.dev, lkp@intel.com,
	Mike Rapoport <rppt@linux.ibm.com>,
	Christoph Lameter <cl@linux.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Matthew Wilcox <willy@infradead.org>
Subject: Re: [linus:master] [mm, slub] 0af8489b02: kernel_BUG_at_include/linux/mm.h
Date: Fri, 6 Jan 2023 11:13:15 +0100	[thread overview]
Message-ID: <3f7fa3b3-9623-5c4c-94b1-a41dea6eaaf2@suse.cz> (raw)
In-Reply-To: <Y7Yr3kEkDEd51xns@xsang-OptiPlex-9020>

On 1/5/23 02:46, Oliver Sang wrote:
> hi, Hyeonggon, hi, Vlastimil,
> 
> On Wed, Jan 04, 2023 at 06:04:20PM +0900, Hyeonggon Yoo wrote:
>> On Tue, Jan 03, 2023 at 09:46:33PM +0800, Oliver Sang wrote:
>> > On Tue, Jan 03, 2023 at 11:42:11AM +0100, Vlastimil Babka wrote:
>> > > So the events leading up to this could be something like:
>> > > 
>> > > - 0x2daee is order-1 slab folio of the inode cache, sitting on the partial list
>> > > - despite being on partial list, it's freed ???
>> > > - somebody else allocates order-2 page 0x2daec and uses it for whatever,
>> > > then frees it
>> > > - 0x2daec is reallocated as order-1 slab from names_cache, then freed
>> > > - we try to allocate from the slab page 0x2daee and trip on the PageTail
>> > > 
>> > > Except, the freeing of order-2 page would have reset the PageTail and
>> > > compound_head in 0x2daec, so this is even more complicated or involves some
>> > > extra race?
>> > 
>> > FYI, we ran tests more up to 500 times, then saw different issues but rate is
>> > actually low
>> > 
>> > 56d5a2b9ba85a390 0af8489b0216fa1dd83e264bef8
>> > ---------------- ---------------------------
>> >        fail:runs  %reproduction    fail:runs
>> >            |             |             |
>> >            :500         12%          61:500   dmesg.invalid_opcode:#[##]
>> >            :500          3%          14:500   dmesg.kernel_BUG_at_include/linux/mm.h
>> >            :500          3%          17:500   dmesg.kernel_BUG_at_include/linux/page-flags.h
>> >            :500          5%          26:500   dmesg.kernel_BUG_at_lib/list_debug.c
>> >            :500          0%           2:500   dmesg.kernel_BUG_at_mm/page_alloc.c
>> >            :500          0%           2:500   dmesg.kernel_BUG_at_mm/usercopy.c
>> > 
> 
> hi Vlastimil,
> 
> as you mentioned
>> Hm even if rate is low, the different kinds of reports could be useful to
>> see, if all of that is caused by the commit.
> 
> we tried to run tests even more times, but with the config which enable
>     CONFIG_DEBUG_PAGEALLOC
>     CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT
> (config is attached as
>     config-6.1.0-rc2-00014-g0af8489b0216+CONFIG_DEBUG_PAGEALLOC+CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT
> the only diff with previous config is
> @@ -5601,7 +5601,8 @@ CONFIG_HAVE_KCSAN_COMPILER=y
>  # Memory Debugging
>  #
>  CONFIG_PAGE_EXTENSION=y
> -# CONFIG_DEBUG_PAGEALLOC is not set
> +CONFIG_DEBUG_PAGEALLOC=y
> +CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT=y
>  CONFIG_PAGE_OWNER=y
>  # CONFIG_PAGE_POISONING is not set
>  CONFIG_DEBUG_PAGE_REF=y
> )
> 
> what we found now is some issues are also reproduced on parent now (still by
> rcutorture tests here), though seems lower rate on parent.
> 
> =========================================================================================
> compiler/kconfig/rootfs/runtime/tbox_group/test/testcase/torture_type:
>   gcc-11/i386-randconfig-a012-20221226+CONFIG_DEBUG_PAGEALLOC+CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT/debian-11.1-i386-20220923.cgz/300s/vm-snb/default/rcutorture/tasks-tracing
> 
> 56d5a2b9ba85a390 0af8489b0216fa1dd83e264bef8
> ---------------- ---------------------------
>        fail:runs  %reproduction    fail:runs
>            |             |             |
>           8:985         19%         199:990   dmesg.invalid_opcode:#[##]
>            :985          5%          51:990   dmesg.kernel_BUG_at_include/linux/mm.h
>           3:985          4%          41:990   dmesg.kernel_BUG_at_include/linux/page-flags.h
>           4:985         10%         102:990   dmesg.kernel_BUG_at_lib/list_debug.c
>            :985          0%           2:990   dmesg.kernel_BUG_at_mm/page_alloc.c
>           1:985          0%           3:990   dmesg.kernel_BUG_at_mm/usercopy.c
> 
> however, we noticed dmesg.kernel_BUG_at_include/linux/mm.h still have
> relatively high rate on this commit but keeps clean on parent.

Well that's interesting. As long as any bugs happen in the parent, it could
mean the commit we suspect is just changing the circumstances and creating
conditions that increase the bug happening - e.g. because it causes slab
pages to be always immediately freed when the last object is freed.

So I would be curiou about how some of the reports from the parent look like
in detail. And if the rate at the parent (has it increased thanks to the
DEBUG_PAGEALLOC?) is sufficient to bisect to the truly first bad commit. Thanks!


  parent reply	other threads:[~2023-01-06 10:13 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-12-31 15:26 [linus:master] [mm, slub] 0af8489b02: kernel_BUG_at_include/linux/mm.h kernel test robot
2023-01-01  5:30 ` Hyeonggon Yoo
2023-01-01  6:50 ` Hyeonggon Yoo
2023-01-01  7:37   ` supervisor write access in kernel mode in __pv_queued_spin_unlock_slowpath Hyeonggon Yoo
2023-01-01 11:08     ` Maxim Levitsky
2023-01-02 11:17       ` Hyeonggon Yoo
2023-01-03 10:42 ` [linus:master] [mm, slub] 0af8489b02: kernel_BUG_at_include/linux/mm.h Vlastimil Babka
2023-01-03 13:46   ` Oliver Sang
2023-01-03 14:36     ` Vlastimil Babka
2023-01-04  9:04     ` Hyeonggon Yoo
2023-01-05  1:46       ` Oliver Sang
2023-01-05 13:59         ` Hyeonggon Yoo
2023-01-05 14:47         ` Hyeonggon Yoo
2023-01-09 14:16           ` Oliver Sang
2023-01-06 10:13         ` Vlastimil Babka [this message]
2023-01-09 14:01           ` Oliver Sang
2023-01-09 14:04             ` Oliver Sang
2023-01-10 13:53             ` Oliver Sang
2023-01-10 14:09               ` Vlastimil Babka
2023-01-11  2:26                 ` Feng Tang
2023-01-11 10:52                   ` Vlastimil Babka
2023-01-12  7:47                 ` Oliver Sang
2023-01-12  7:56                   ` Vlastimil Babka
2023-01-17  7:19                     ` Oliver Sang
2023-01-12  8:49                   ` Vlastimil Babka
2023-01-03 15:31   ` A better dump_page() Matthew Wilcox
2023-01-03 23:07     ` David Rientjes
2023-01-03 23:29       ` Matthew Wilcox
2023-01-05 15:19         ` Vlastimil Babka
2023-01-05 15:35           ` Matthew Wilcox
2023-01-06 17:28 ` [linus:master] [mm, slub] 0af8489b02: kernel_BUG_at_include/linux/mm.h Hyeonggon Yoo
2023-01-11  9:44 ` BUG: unable to handle page fault for address: f6ffe000 Hyeonggon Yoo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3f7fa3b3-9623-5c4c-94b1-a41dea6eaaf2@suse.cz \
    --to=vbabka@suse.cz \
    --cc=42.hyeyoo@gmail.com \
    --cc=cl@linux.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lkp@intel.com \
    --cc=oe-lkp@lists.linux.dev \
    --cc=oliver.sang@intel.com \
    --cc=rppt@linux.ibm.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).