Re: [rfc/patch] mm/slub: restore/expand unfreeze_partials() local exclusion scope

From: Vlastimil Babka <vbabka@suse.cz>
To: Mike Galbraith <efault@gmx.de>,
	Thomas Gleixner <tglx@linutronix.de>,
	LKML <linux-kernel@vger.kernel.org>
Cc: linux-rt-users@vger.kernel.org,
	Mel Gorman <mgorman@techsingularity.net>,
	Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Subject: Re: [rfc/patch] mm/slub: restore/expand unfreeze_partials() local exclusion scope
Date: Tue, 20 Jul 2021 10:56:48 +0200	[thread overview]
Message-ID: <f9935c4c-078c-4b52-5297-64ee22272664@suse.cz> (raw)
In-Reply-To: <7431ceb9761c566cf2d1f6f263247acd8d38c4b5.camel@gmx.de>

On 7/15/21 6:34 PM, Mike Galbraith wrote:
> Greetings crickets,
> 
> Methinks he problem is the hole these patches opened only for RT.
> 
> static void put_cpu_partial(struct kmem_cache *s, struct page *page,
> int drain)
> {
> #ifdef CONFIG_SLUB_CPU_PARTIAL
> 	struct page *oldpage;
> 	int pages;
> 	int pobjects;
> 
> 	slub_get_cpu_ptr(s->cpu_slab);
>         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> That is an assertion that the stuff under it is preempt safe for RT and
> ONLY RT.  My box says the RT portion of that assertion is horse pookey.
> What it does is let kmem_cache_free()/kfree() blast straight through
> ___slab_alloc() critical sections, swapping out ->partial underneath
> it.  Sprinkling percpu fingerprint power about, I saw lots of
> ___slab_alloc() preempts put_cpu_partial().. and vice versa.  I don't
> think it's a coincidence that ___slab_alloc() and __unfreeze_partials()
> both explode trying to access page->freelist.  You've seen the former,
> here's one of the later.
> 
> crash> bt -sx
> PID: 18761  TASK: ffff88812fff0000  CPU: 0   COMMAND: "hackbench"
>  #0 [ffff88818f8ff980] machine_kexec+0x14f at ffffffff81051c8f
>  #1 [ffff88818f8ff9c8] __crash_kexec+0xd2 at ffffffff8111ef72
>  #2 [ffff88818f8ffa88] crash_kexec+0x30 at ffffffff8111fd10
>  #3 [ffff88818f8ffa98] oops_end+0xd3 at ffffffff810267e3
>  #4 [ffff88818f8ffab8] exc_general_protection+0x195 at ffffffff8179fdb5
>  #5 [ffff88818f8ffb50] asm_exc_general_protection+0x1e at ffffffff81800a0e
>     [exception RIP: __unfreeze_partials+156]

Hm going back to this report, if that's a direct result of __slab_alloc
preempting (interrupting) put_cpu_partial() then I think that's
something that could happen even on !RT as all we do in the
put_cpu_partial() there is disable preemption, and the allocation could
come in irq handler.
It would mean the whole idea of "mm, slub: detach percpu partial list in
unfreeze_partials() using this_cpu_cmpxchg()" isn't safe. I'll have to
ponder it.
But we didn't see crashes on !RT yet. So could it be that it was still
put_cpu_partial() preempting __slab_alloc() messing the partial list,
but for some reason the put_cpu_partial() side crashed this time?

>     RIP: ffffffff81248bac  RSP: ffff88818f8ffc00  RFLAGS: 00010202
>     RAX: 0000000000200002  RBX: 0000000000200002  RCX: 000000017fffffff
>     RDX: 00000001ffffffff  RSI: 0000000000000202  RDI: ffff888100040b80
>     RBP: ffff88818f8ffca0   R8: ffff88818f9cba30   R9: 0000000000000001
>     R10: ffff88818f8ffcc0  R11: 0000000000000000  R12: ffff888100043700
>     R13: ffff888100040b80  R14: 00000001ffffffff  R15: ffffea00046c0808
>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>  #6 [ffff88818f8ffcb8] put_cpu_partial+0x8e at ffffffff81248ece
>  #7 [ffff88818f8ffcd8] consume_skb+0x2b at ffffffff815eddeb
>  #8 [ffff88818f8ffce8] unix_stream_read_generic+0x788 at ffffffff8170b238
>  #9 [ffff88818f8ffdc0] unix_stream_recvmsg+0x43 at ffffffff8170b433
> #10 [ffff88818f8ffdf8] sock_read_iter+0x104 at ffffffff815dd554
> #11 [ffff88818f8ffe68] new_sync_read+0x16a at ffffffff812674fa
> #12 [ffff88818f8ffee8] vfs_read+0x1ae at ffffffff81269c8e
> #13 [ffff88818f8fff00] ksys_read+0x40 at ffffffff8126a070
> #14 [ffff88818f8fff38] do_syscall_64+0x37 at ffffffff8179f5e7
> #15 [ffff88818f8fff50] entry_SYSCALL_64_after_hwframe+0x44 at ffffffff8180007c
>     RIP: 00007fbda4438f2e  RSP: 00007fff3bf9d798  RFLAGS: 00000246
>     RAX: ffffffffffffffda  RBX: 00007fff3bf9e7a0  RCX: 00007fbda4438f2e
>     RDX: 0000000000001000  RSI: 00007fff3bf9d7a0  RDI: 0000000000000007
>     RBP: 00007fff3bf9e800   R8: 00007fff3bf9e6e0   R9: 00007fbda4641010
>     R10: 00007fbda4428028  R11: 0000000000000246  R12: 0000000000001000
> crash> dis __unfreeze_partials+156
> 0xffffffff81248bac <__unfreeze_partials+156>:   lock cmpxchg16b 0x20(%r15)
> crash> gdb list *__unfreeze_partials+156
> 0xffffffff81248bac is in __unfreeze_partials (mm/slub.c:475).
> 470             if (!disable_irqs)
> 471                     lockdep_assert_irqs_disabled();
> 472     #if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
> 473         defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
> 474             if (s->flags & __CMPXCHG_DOUBLE) {
> 475                     if (cmpxchg_double(&page->freelist, &page->counters,
> 476                                        freelist_old, counters_old,
> 477                                        freelist_new, counters_new))
> 478                             return true;
> 479             } else
> crash> kmem ffffea00046c0808
>       PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
> ffffea00104b3000 412cc0000                0        c  2 8000000000003000 reserved,private
> crash
> 
> Regarding $subject, Lockdep thinks my hole plugging skills suck rocks
> (shrug, I've given it plentiful cause) but that's ok, I sometimes think
> the same of its bug reporting skills :)
> 
> [    2.459456] ============================================
> [    2.459456] WARNING: possible recursive locking detected
> [    2.459457] 5.14.0.g79e92006-tip-rt #83 Tainted: G            E
> [    2.459458] --------------------------------------------
> [    2.459458] rcuc/7/56 is trying to acquire lock:
> [    2.459459] ffff88840edf01c0 (&(&c->lock)->lock){+.+.}-{0:0}, at: kfree+0x280/0x670
> [    2.459466]
>                but task is already holding lock:
> [    2.459466] ffff88840edf4d60 (&(&c->lock)->lock){+.+.}-{0:0}, at: __slab_free+0x4d6/0x600
> [    2.459469]
> 
> live kernel snooping....
> 
> crash> ps | grep UN
> crash> bt -sx 56
> PID: 56     TASK: ffff888100c19a40  CPU: 7   COMMAND: "rcuc/7"
>  #0 [ffff888100c63e40] __schedule+0x2eb at ffffffff818969fb
>  #1 [ffff888100c63ec8] schedule+0x3b at ffffffff8189723b
>  #2 [ffff888100c63ee0] smpboot_thread_fn+0x18c at ffffffff810a90dc
>  #3 [ffff888100c63f18] kthread+0x1af at ffffffff810a27bf
>  #4 [ffff888100c63f50] ret_from_fork+0x1f at ffffffff81001cbf
> crash> task_struct ffff888100c19a40 | grep __state
>   __state = 1,
> crash> gdb list *__slab_free+0x4d6
> 0xffffffff812923c6 is in __slab_free (mm/slub.c:2568).
> 2563                                    /*
> 2564                                     * partial array is full. Move the existing
> 2565                                     * set to the per node partial list.
> 2566                                     */
> 2567                                    local_lock(&s->cpu_slab->lock);
> 2568                                    unfreeze_partials(s);
> 2569                                    local_unlock(&s->cpu_slab->lock);
> 2570                                    oldpage = NULL;
> 2571                                    pobjects = 0;
> 2572                                    pages = 0;
> crash> gdb list *kfree+0x280
> 0xffffffff81292770 is in kfree (mm/slub.c:3442).
> 3437                     * __slab_free() as that wouldn't use the cpu freelist at all.
> 3438                     */
> 3439                    void **freelist;
> 3440
> 3441                    local_lock(&s->cpu_slab->lock);
> 3442                    c = this_cpu_ptr(s->cpu_slab);
> 3443                    if (unlikely(page != c->page)) {
> 3444                            local_unlock(&s->cpu_slab->lock);
> 3445                            goto redo;
> 3446                    }
> crash>
> 
> Check, those are what's in the stacktrace below... but the allegedly
> deadlocked for real task is very much alive, as is the rest of box.
> 
>                other info that might help us debug this:
> [    2.459470]  Possible unsafe locking scenario:
> 
> [    2.459470]        CPU0
> [    2.459470]        ----
> [    2.459471]   lock(&(&c->lock)->lock);
> [    2.459471]   lock(&(&c->lock)->lock);
> [    2.459472]
>                 *** DEADLOCK ***
> 
> [    2.459472]  May be due to missing lock nesting notation
> 
> [    2.459472] 6 locks held by rcuc/7/56:
> [    2.459473]  #0: ffff88840edd9820 ((softirq_ctrl.lock).lock){+.+.}-{2:2}, at: __local_bh_disable_ip+0xc3/0x190
> [    2.459479]  #1: ffffffff82382b80 (rcu_read_lock){....}-{1:2}, at: rt_spin_lock+0x5/0xd0
> [    2.459484]  #2: ffffffff82382b80 (rcu_read_lock){....}-{1:2}, at: __local_bh_disable_ip+0xa0/0x190
> [    2.459487]  #3: ffffffff82382ac0 (rcu_callback){....}-{0:0}, at: rcu_do_batch+0x195/0x520
> [    2.459490]  #4: ffff88840edf4d60 (&(&c->lock)->lock){+.+.}-{0:0}, at: __slab_free+0x4d6/0x600
> [    2.459493]  #5: ffffffff82382b80 (rcu_read_lock){....}-{1:2}, at: rt_spin_lock+0x5/0xd0
> [    2.459496]
>                stack backtrace:
> [    2.459497] CPU: 7 PID: 56 Comm: rcuc/7 Tainted: G            E     5.14.0.g79e92006-tip-rt #83
> [    2.459498] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 09/23/2013
> [    2.459499] Call Trace:
> [    2.459501]  dump_stack_lvl+0x44/0x57
> [    2.459504]  __lock_acquire+0xb7f/0x1ab0
> [    2.459508]  lock_acquire+0x2a6/0x340
> [    2.459510]  ? kfree+0x280/0x670
> [    2.459513]  ? __free_slab+0xa4/0x1b0
> [    2.459515]  rt_spin_lock+0x2a/0xd0
> [    2.459516]  ? kfree+0x280/0x670                   somewhere in the multiverse lies a dead rcuc/7
> [    2.459518]  kfree+0x280/0x670                 <== local_lock(&s->cpu_slab->lock); #2
> [    2.459521]  __free_slab+0xa4/0x1b0            ==> kfree(page_objcgs(page))
> [    2.459523]  __unfreeze_partials+0x1d8/0x330   ==> discard_slab(s, page);
> [    2.459526]  ? _raw_spin_unlock_irqrestore+0x30/0x80
> [    2.459530]  ? __slab_free+0x4de/0x600
> [    2.459532]  __slab_free+0x4de/0x600           <== local_lock(&s->cpu_slab->lock); #1
> [    2.459534]  ? find_held_lock+0x2d/0x90
> [    2.459536]  ? kmem_cache_free+0x276/0x630
> [    2.459539]  ? rcu_do_batch+0x1c3/0x520
> [    2.459540]  ? kmem_cache_free+0x364/0x630
> [    2.459542]  kmem_cache_free+0x364/0x630
> [    2.459544]  ? rcu_do_batch+0x195/0x520
> [    2.459546]  rcu_do_batch+0x1c3/0x520
> [    2.459547]  ? rcu_do_batch+0x195/0x520
> [    2.459550]  ? rcu_cpu_kthread+0x7e/0x4b0
> [    2.459552]  ? rcu_cpu_kthread+0x57/0x4b0
> [    2.459553]  rcu_core+0x2c3/0x590
> [    2.459555]  ? rcu_cpu_kthread+0x78/0x4b0
> [    2.459557]  ? rcu_cpu_kthread+0x7e/0x4b0
> [    2.459558]  ? rcu_cpu_kthread+0x57/0x4b0
> [    2.459560]  rcu_cpu_kthread+0xc2/0x4b0
> [    2.459562]  ? smpboot_thread_fn+0x23/0x300
> [    2.459565]  smpboot_thread_fn+0x249/0x300
> [    2.459567]  ? smpboot_register_percpu_thread+0xe0/0xe0
> [    2.459569]  kthread+0x1af/0x1d0
> [    2.459571]  ? set_kthread_struct+0x40/0x40
> [    2.459574]  ret_from_fork+0x1f/0x30
>