All of lore.kernel.org
 help / color / mirror / Atom feed
* Soft lockup in __slab_free (SLUB)
@ 2016-09-26  7:46 Nikolay Borisov
  2016-09-28  5:31 ` Joonsoo Kim
  0 siblings, 1 reply; 17+ messages in thread
From: Nikolay Borisov @ 2016-09-26  7:46 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Linux MM, brouer, js1304

Hello, 

On 4.4.14 stable kernel I observed the following soft-lockup, however I
also checked that the code is the same in 4.8-rc so the problem is 
present there as well: 

[434575.862377] NMI watchdog: BUG: soft lockup - CPU#13 stuck for 23s! [swapper/13:0]
[434575.866352] CPU: 13 PID: 0 Comm: swapper/13 Tainted: P           O    4.4.14-clouder5 #2
[434575.866643] Hardware name: Supermicro X9DRD-iF/LF/X9DRD-iF, BIOS 3.0b 12/05/2013
[434575.866932] task: ffff8803714aadc0 ti: ffff8803714c4000 task.ti: ffff8803714c4000
[434575.867221] RIP: 0010:[<ffffffff81613f4c>]  [<ffffffff81613f4c>] _raw_spin_unlock_irqrestore+0x1c/0x30
[434575.867566] RSP: 0018:ffff880373ce3dc0  EFLAGS: 00000203
[434575.867736] RAX: ffff88066e0c9a40 RBX: 0000000000000203 RCX: 0000000000000000
[434575.868023] RDX: 0000000000000008 RSI: 0000000000000203 RDI: ffff88066e0c9a40
[434575.868311] RBP: ffff880373ce3dc8 R08: ffff8803e5c1d118 R09: ffff8803e5c1d538
[434575.868609] R10: 0000000000000000 R11: ffffea000f970600 R12: ffff88066e0c9a40
[434575.868895] R13: ffffea000f970600 R14: 000000000046cf3b R15: ffff88036f8e3200
[434575.869183] FS:  0000000000000000(0000) GS:ffff880373ce0000(0000) knlGS:0000000000000000
[434575.869472] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[434575.869643] CR2: ffffffffff600400 CR3: 0000000367201000 CR4: 00000000001406e0
[434575.869931] Stack:
[434575.870095]  ffff88066e0c9a40 ffff880373ce3e78 ffffffff8117ea8a ffff880373ce3e08
[434575.870567]  000000000046bd03 0000000100170017 ffff8803e5c1d118 ffff8803e5c1d118
[434575.871037]  00ff000100000000 0000000000000203 0000000000000000 ffffffff8123d9ac
[434575.874253] Call Trace:
[434575.874418]  <IRQ> 
[434575.874473]  [<ffffffff8117ea8a>] __slab_free+0xca/0x290
[434575.874806]  [<ffffffff8123d9ac>] ? ext4_i_callback+0x1c/0x20
[434575.874978]  [<ffffffff8117ee3a>] kmem_cache_free+0x1ea/0x200
[434575.875149]  [<ffffffff8123d9ac>] ext4_i_callback+0x1c/0x20
[434575.875325]  [<ffffffff810ad09b>] rcu_process_callbacks+0x21b/0x620
[434575.875506]  [<ffffffff81057337>] __do_softirq+0x147/0x310
[434575.875680]  [<ffffffff8105764f>] irq_exit+0x5f/0x70
[434575.875851]  [<ffffffff81616a82>] smp_apic_timer_interrupt+0x42/0x50
[434575.876025]  [<ffffffff816151e9>] apic_timer_interrupt+0x89/0x90
[434575.876197]  <EOI> 
[434575.876250]  [<ffffffff81510601>] ? cpuidle_enter_state+0x141/0x2c0
[434575.876583]  [<ffffffff815105f6>] ? cpuidle_enter_state+0x136/0x2c0
[434575.876755]  [<ffffffff815107b7>] cpuidle_enter+0x17/0x20
[434575.876929]  [<ffffffff810949fc>] cpu_startup_entry+0x2fc/0x360
[434575.877105]  [<ffffffff810330e3>] start_secondary+0xf3/0x100

The ip in __slab_free points to this piece of code (in mm/slub.c): 

if (unlikely(n)) {
	spin_unlock_irqrestore(&n->list_lock, flags);
        n = NULL;
}

I think it's a pure chance that the spin_unlock_restore is being shown in this trace, 
do you think that a cond_resched is needed in this unlikely if clause? Apparently there 
are cases where this loop can take a considerable amount of time.

How about this patch: 

diff --git a/mm/slub.c b/mm/slub.c
index 65d5f92d51d2..daa20f38770a 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2654,6 +2654,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
                if (unlikely(n)) {
                        spin_unlock_irqrestore(&n->list_lock, flags);
                        n = NULL;
+                       cond_resched();
                }
                prior = page->freelist;
                counters = page->counters;

Regards,
Nikolay 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: Soft lockup in __slab_free (SLUB)
  2016-09-26  7:46 Soft lockup in __slab_free (SLUB) Nikolay Borisov
@ 2016-09-28  5:31 ` Joonsoo Kim
  2016-09-28  7:15   ` Nikolay Borisov
  2016-10-04 14:58   ` Nikolay Borisov
  0 siblings, 2 replies; 17+ messages in thread
From: Joonsoo Kim @ 2016-09-28  5:31 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Christoph Lameter, Linux MM, brouer, Paul E. McKenney

Hello,

Ccing Paul, because it looks like RCU problem.

On Mon, Sep 26, 2016 at 10:46:56AM +0300, Nikolay Borisov wrote:
> Hello, 
> 
> On 4.4.14 stable kernel I observed the following soft-lockup, however I
> also checked that the code is the same in 4.8-rc so the problem is 
> present there as well: 
> 
> [434575.862377] NMI watchdog: BUG: soft lockup - CPU#13 stuck for 23s! [swapper/13:0]
> [434575.866352] CPU: 13 PID: 0 Comm: swapper/13 Tainted: P           O    4.4.14-clouder5 #2
> [434575.866643] Hardware name: Supermicro X9DRD-iF/LF/X9DRD-iF, BIOS 3.0b 12/05/2013
> [434575.866932] task: ffff8803714aadc0 ti: ffff8803714c4000 task.ti: ffff8803714c4000
> [434575.867221] RIP: 0010:[<ffffffff81613f4c>]  [<ffffffff81613f4c>] _raw_spin_unlock_irqrestore+0x1c/0x30
> [434575.867566] RSP: 0018:ffff880373ce3dc0  EFLAGS: 00000203
> [434575.867736] RAX: ffff88066e0c9a40 RBX: 0000000000000203 RCX: 0000000000000000
> [434575.868023] RDX: 0000000000000008 RSI: 0000000000000203 RDI: ffff88066e0c9a40
> [434575.868311] RBP: ffff880373ce3dc8 R08: ffff8803e5c1d118 R09: ffff8803e5c1d538
> [434575.868609] R10: 0000000000000000 R11: ffffea000f970600 R12: ffff88066e0c9a40
> [434575.868895] R13: ffffea000f970600 R14: 000000000046cf3b R15: ffff88036f8e3200
> [434575.869183] FS:  0000000000000000(0000) GS:ffff880373ce0000(0000) knlGS:0000000000000000
> [434575.869472] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [434575.869643] CR2: ffffffffff600400 CR3: 0000000367201000 CR4: 00000000001406e0
> [434575.869931] Stack:
> [434575.870095]  ffff88066e0c9a40 ffff880373ce3e78 ffffffff8117ea8a ffff880373ce3e08
> [434575.870567]  000000000046bd03 0000000100170017 ffff8803e5c1d118 ffff8803e5c1d118
> [434575.871037]  00ff000100000000 0000000000000203 0000000000000000 ffffffff8123d9ac
> [434575.874253] Call Trace:
> [434575.874418]  <IRQ> 
> [434575.874473]  [<ffffffff8117ea8a>] __slab_free+0xca/0x290
> [434575.874806]  [<ffffffff8123d9ac>] ? ext4_i_callback+0x1c/0x20
> [434575.874978]  [<ffffffff8117ee3a>] kmem_cache_free+0x1ea/0x200
> [434575.875149]  [<ffffffff8123d9ac>] ext4_i_callback+0x1c/0x20
> [434575.875325]  [<ffffffff810ad09b>] rcu_process_callbacks+0x21b/0x620
> [434575.875506]  [<ffffffff81057337>] __do_softirq+0x147/0x310
> [434575.875680]  [<ffffffff8105764f>] irq_exit+0x5f/0x70
> [434575.875851]  [<ffffffff81616a82>] smp_apic_timer_interrupt+0x42/0x50
> [434575.876025]  [<ffffffff816151e9>] apic_timer_interrupt+0x89/0x90
> [434575.876197]  <EOI> 
> [434575.876250]  [<ffffffff81510601>] ? cpuidle_enter_state+0x141/0x2c0
> [434575.876583]  [<ffffffff815105f6>] ? cpuidle_enter_state+0x136/0x2c0
> [434575.876755]  [<ffffffff815107b7>] cpuidle_enter+0x17/0x20
> [434575.876929]  [<ffffffff810949fc>] cpu_startup_entry+0x2fc/0x360
> [434575.877105]  [<ffffffff810330e3>] start_secondary+0xf3/0x100
> 
> The ip in __slab_free points to this piece of code (in mm/slub.c): 
> 
> if (unlikely(n)) {
> 	spin_unlock_irqrestore(&n->list_lock, flags);
>         n = NULL;
> }
> 
> I think it's a pure chance that the spin_unlock_restore is being shown in this trace, 
> do you think that a cond_resched is needed in this unlikely if clause? Apparently there 
> are cases where this loop can take a considerable amount of time.

I think that __slab_free() doesn't take too long time even if there is
lock contention. And, cond_resched() is valid on softirq context?

I think that problem would be caused by too many rcu callback is
executed without scheduling. Paul?

Thanks.

> 
> How about this patch: 
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 65d5f92d51d2..daa20f38770a 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2654,6 +2654,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
>                 if (unlikely(n)) {
>                         spin_unlock_irqrestore(&n->list_lock, flags);
>                         n = NULL;
> +                       cond_resched();
>                 }
>                 prior = page->freelist;
>                 counters = page->counters;
> 
> Regards,
> Nikolay 
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Soft lockup in __slab_free (SLUB)
  2016-09-28  5:31 ` Joonsoo Kim
@ 2016-09-28  7:15   ` Nikolay Borisov
  2016-09-28 11:12     ` Paul E. McKenney
  2016-09-29  1:40     ` Joonsoo Kim
  2016-10-04 14:58   ` Nikolay Borisov
  1 sibling, 2 replies; 17+ messages in thread
From: Nikolay Borisov @ 2016-09-28  7:15 UTC (permalink / raw)
  To: Joonsoo Kim; +Cc: Christoph Lameter, Linux MM, brouer, Paul E. McKenney



On 09/28/2016 08:31 AM, Joonsoo Kim wrote:
> Hello,
> 
> Ccing Paul, because it looks like RCU problem.
> 
> On Mon, Sep 26, 2016 at 10:46:56AM +0300, Nikolay Borisov wrote:
>> Hello, 
>>
>> On 4.4.14 stable kernel I observed the following soft-lockup, however I
>> also checked that the code is the same in 4.8-rc so the problem is 
>> present there as well: 
>>
>> [434575.862377] NMI watchdog: BUG: soft lockup - CPU#13 stuck for 23s! [swapper/13:0]
>> [434575.866352] CPU: 13 PID: 0 Comm: swapper/13 Tainted: P           O    4.4.14-clouder5 #2
>> [434575.866643] Hardware name: Supermicro X9DRD-iF/LF/X9DRD-iF, BIOS 3.0b 12/05/2013
>> [434575.866932] task: ffff8803714aadc0 ti: ffff8803714c4000 task.ti: ffff8803714c4000
>> [434575.867221] RIP: 0010:[<ffffffff81613f4c>]  [<ffffffff81613f4c>] _raw_spin_unlock_irqrestore+0x1c/0x30
>> [434575.867566] RSP: 0018:ffff880373ce3dc0  EFLAGS: 00000203
>> [434575.867736] RAX: ffff88066e0c9a40 RBX: 0000000000000203 RCX: 0000000000000000
>> [434575.868023] RDX: 0000000000000008 RSI: 0000000000000203 RDI: ffff88066e0c9a40
>> [434575.868311] RBP: ffff880373ce3dc8 R08: ffff8803e5c1d118 R09: ffff8803e5c1d538
>> [434575.868609] R10: 0000000000000000 R11: ffffea000f970600 R12: ffff88066e0c9a40
>> [434575.868895] R13: ffffea000f970600 R14: 000000000046cf3b R15: ffff88036f8e3200
>> [434575.869183] FS:  0000000000000000(0000) GS:ffff880373ce0000(0000) knlGS:0000000000000000
>> [434575.869472] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [434575.869643] CR2: ffffffffff600400 CR3: 0000000367201000 CR4: 00000000001406e0
>> [434575.869931] Stack:
>> [434575.870095]  ffff88066e0c9a40 ffff880373ce3e78 ffffffff8117ea8a ffff880373ce3e08
>> [434575.870567]  000000000046bd03 0000000100170017 ffff8803e5c1d118 ffff8803e5c1d118
>> [434575.871037]  00ff000100000000 0000000000000203 0000000000000000 ffffffff8123d9ac
>> [434575.874253] Call Trace:
>> [434575.874418]  <IRQ> 
>> [434575.874473]  [<ffffffff8117ea8a>] __slab_free+0xca/0x290
>> [434575.874806]  [<ffffffff8123d9ac>] ? ext4_i_callback+0x1c/0x20
>> [434575.874978]  [<ffffffff8117ee3a>] kmem_cache_free+0x1ea/0x200
>> [434575.875149]  [<ffffffff8123d9ac>] ext4_i_callback+0x1c/0x20
>> [434575.875325]  [<ffffffff810ad09b>] rcu_process_callbacks+0x21b/0x620
>> [434575.875506]  [<ffffffff81057337>] __do_softirq+0x147/0x310
>> [434575.875680]  [<ffffffff8105764f>] irq_exit+0x5f/0x70
>> [434575.875851]  [<ffffffff81616a82>] smp_apic_timer_interrupt+0x42/0x50
>> [434575.876025]  [<ffffffff816151e9>] apic_timer_interrupt+0x89/0x90
>> [434575.876197]  <EOI> 
>> [434575.876250]  [<ffffffff81510601>] ? cpuidle_enter_state+0x141/0x2c0
>> [434575.876583]  [<ffffffff815105f6>] ? cpuidle_enter_state+0x136/0x2c0
>> [434575.876755]  [<ffffffff815107b7>] cpuidle_enter+0x17/0x20
>> [434575.876929]  [<ffffffff810949fc>] cpu_startup_entry+0x2fc/0x360
>> [434575.877105]  [<ffffffff810330e3>] start_secondary+0xf3/0x100
>>
>> The ip in __slab_free points to this piece of code (in mm/slub.c): 
>>
>> if (unlikely(n)) {
>> 	spin_unlock_irqrestore(&n->list_lock, flags);
>>         n = NULL;
>> }
>>
>> I think it's a pure chance that the spin_unlock_restore is being shown in this trace, 
>> do you think that a cond_resched is needed in this unlikely if clause? Apparently there 
>> are cases where this loop can take a considerable amount of time.
> 
> I think that __slab_free() doesn't take too long time even if there is
> lock contention. And, cond_resched() is valid on softirq context?

Now that I think of it - it's not valid since it might sleep and softirq
is atomic context. So my suggestion is actually invalid, too bad.

> 
> I think that problem would be caused by too many rcu callback is
> executed without scheduling. Paul?

I don't think it's an RCU problem per-se since ext4_i_callback is being
called from RCU due to the way inodes are being freed.

On a slightly different note - on a different physical server, running
zfsonlinux I experienced a very similar issue but without being in an
RCU context, here what the stacktrace looked there:

 Call Trace:
  [<ffffffff8117ea8a>] __slab_free+0xca/0x290
  [<ffffffff8117ee3a>] ? kmem_cache_free+0x1ea/0x200
  [<ffffffff8117ee3a>] kmem_cache_free+0x1ea/0x200
  [<ffffffffa0245c97>] spl_kmem_cache_free+0x147/0x1d0 [spl]
  [<ffffffffa02cbecc>] dnode_destroy+0x1dc/0x230 [zfs]
  [<ffffffffa02cbf64>] dnode_buf_pageout+0x44/0xc0 [zfs]
  [<ffffffffa0248301>] taskq_thread+0x291/0x4e0 [spl]
  [<ffffffff8107ccd0>] ? wake_up_q+0x70/0x70
  [<ffffffffa0248070>] ? taskq_thread_spawn+0x50/0x50 [spl]
  [<ffffffff810715bf>] kthread+0xef/0x110
  [<ffffffff810714d0>] ? kthread_park+0x60/0x60
  [<ffffffff8161483f>] ret_from_fork+0x3f/0x70
  [<ffffffff810714d0>] ? kthread_park+0x60/0x60

It's hard to believe this is a coincidence. I've inspected the callpaths
in the taskq_thread/dnode_buf_pageout/dnode_destroy and there doesn't
seem to be a loop apart from the one in __slab_free.

[SNIP]

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Soft lockup in __slab_free (SLUB)
  2016-09-28  7:15   ` Nikolay Borisov
@ 2016-09-28 11:12     ` Paul E. McKenney
  2016-09-29  1:40     ` Joonsoo Kim
  1 sibling, 0 replies; 17+ messages in thread
From: Paul E. McKenney @ 2016-09-28 11:12 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Joonsoo Kim, Christoph Lameter, Linux MM, brouer

On Wed, Sep 28, 2016 at 10:15:01AM +0300, Nikolay Borisov wrote:
> 
> 
> On 09/28/2016 08:31 AM, Joonsoo Kim wrote:
> > Hello,
> > 
> > Ccing Paul, because it looks like RCU problem.
> > 
> > On Mon, Sep 26, 2016 at 10:46:56AM +0300, Nikolay Borisov wrote:
> >> Hello, 
> >>
> >> On 4.4.14 stable kernel I observed the following soft-lockup, however I
> >> also checked that the code is the same in 4.8-rc so the problem is 
> >> present there as well: 
> >>
> >> [434575.862377] NMI watchdog: BUG: soft lockup - CPU#13 stuck for 23s! [swapper/13:0]
> >> [434575.866352] CPU: 13 PID: 0 Comm: swapper/13 Tainted: P           O    4.4.14-clouder5 #2
> >> [434575.866643] Hardware name: Supermicro X9DRD-iF/LF/X9DRD-iF, BIOS 3.0b 12/05/2013
> >> [434575.866932] task: ffff8803714aadc0 ti: ffff8803714c4000 task.ti: ffff8803714c4000
> >> [434575.867221] RIP: 0010:[<ffffffff81613f4c>]  [<ffffffff81613f4c>] _raw_spin_unlock_irqrestore+0x1c/0x30
> >> [434575.867566] RSP: 0018:ffff880373ce3dc0  EFLAGS: 00000203
> >> [434575.867736] RAX: ffff88066e0c9a40 RBX: 0000000000000203 RCX: 0000000000000000
> >> [434575.868023] RDX: 0000000000000008 RSI: 0000000000000203 RDI: ffff88066e0c9a40
> >> [434575.868311] RBP: ffff880373ce3dc8 R08: ffff8803e5c1d118 R09: ffff8803e5c1d538
> >> [434575.868609] R10: 0000000000000000 R11: ffffea000f970600 R12: ffff88066e0c9a40
> >> [434575.868895] R13: ffffea000f970600 R14: 000000000046cf3b R15: ffff88036f8e3200
> >> [434575.869183] FS:  0000000000000000(0000) GS:ffff880373ce0000(0000) knlGS:0000000000000000
> >> [434575.869472] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> [434575.869643] CR2: ffffffffff600400 CR3: 0000000367201000 CR4: 00000000001406e0
> >> [434575.869931] Stack:
> >> [434575.870095]  ffff88066e0c9a40 ffff880373ce3e78 ffffffff8117ea8a ffff880373ce3e08
> >> [434575.870567]  000000000046bd03 0000000100170017 ffff8803e5c1d118 ffff8803e5c1d118
> >> [434575.871037]  00ff000100000000 0000000000000203 0000000000000000 ffffffff8123d9ac
> >> [434575.874253] Call Trace:
> >> [434575.874418]  <IRQ> 
> >> [434575.874473]  [<ffffffff8117ea8a>] __slab_free+0xca/0x290
> >> [434575.874806]  [<ffffffff8123d9ac>] ? ext4_i_callback+0x1c/0x20
> >> [434575.874978]  [<ffffffff8117ee3a>] kmem_cache_free+0x1ea/0x200
> >> [434575.875149]  [<ffffffff8123d9ac>] ext4_i_callback+0x1c/0x20
> >> [434575.875325]  [<ffffffff810ad09b>] rcu_process_callbacks+0x21b/0x620
> >> [434575.875506]  [<ffffffff81057337>] __do_softirq+0x147/0x310
> >> [434575.875680]  [<ffffffff8105764f>] irq_exit+0x5f/0x70
> >> [434575.875851]  [<ffffffff81616a82>] smp_apic_timer_interrupt+0x42/0x50
> >> [434575.876025]  [<ffffffff816151e9>] apic_timer_interrupt+0x89/0x90
> >> [434575.876197]  <EOI> 
> >> [434575.876250]  [<ffffffff81510601>] ? cpuidle_enter_state+0x141/0x2c0
> >> [434575.876583]  [<ffffffff815105f6>] ? cpuidle_enter_state+0x136/0x2c0
> >> [434575.876755]  [<ffffffff815107b7>] cpuidle_enter+0x17/0x20
> >> [434575.876929]  [<ffffffff810949fc>] cpu_startup_entry+0x2fc/0x360
> >> [434575.877105]  [<ffffffff810330e3>] start_secondary+0xf3/0x100
> >>
> >> The ip in __slab_free points to this piece of code (in mm/slub.c): 
> >>
> >> if (unlikely(n)) {
> >> 	spin_unlock_irqrestore(&n->list_lock, flags);
> >>         n = NULL;
> >> }
> >>
> >> I think it's a pure chance that the spin_unlock_restore is being shown in this trace, 
> >> do you think that a cond_resched is needed in this unlikely if clause? Apparently there 
> >> are cases where this loop can take a considerable amount of time.
> > 
> > I think that __slab_free() doesn't take too long time even if there is
> > lock contention. And, cond_resched() is valid on softirq context?
> 
> Now that I think of it - it's not valid since it might sleep and softirq
> is atomic context. So my suggestion is actually invalid, too bad.
> 
> > 
> > I think that problem would be caused by too many rcu callback is
> > executed without scheduling. Paul?
> 
> I don't think it's an RCU problem per-se since ext4_i_callback is being
> called from RCU due to the way inodes are being freed.
> 
> On a slightly different note - on a different physical server, running
> zfsonlinux I experienced a very similar issue but without being in an
> RCU context, here what the stacktrace looked there:
> 
>  Call Trace:
>   [<ffffffff8117ea8a>] __slab_free+0xca/0x290
>   [<ffffffff8117ee3a>] ? kmem_cache_free+0x1ea/0x200
>   [<ffffffff8117ee3a>] kmem_cache_free+0x1ea/0x200
>   [<ffffffffa0245c97>] spl_kmem_cache_free+0x147/0x1d0 [spl]
>   [<ffffffffa02cbecc>] dnode_destroy+0x1dc/0x230 [zfs]
>   [<ffffffffa02cbf64>] dnode_buf_pageout+0x44/0xc0 [zfs]
>   [<ffffffffa0248301>] taskq_thread+0x291/0x4e0 [spl]
>   [<ffffffff8107ccd0>] ? wake_up_q+0x70/0x70
>   [<ffffffffa0248070>] ? taskq_thread_spawn+0x50/0x50 [spl]
>   [<ffffffff810715bf>] kthread+0xef/0x110
>   [<ffffffff810714d0>] ? kthread_park+0x60/0x60
>   [<ffffffff8161483f>] ret_from_fork+0x3f/0x70
>   [<ffffffff810714d0>] ? kthread_park+0x60/0x60
> 
> It's hard to believe this is a coincidence. I've inspected the callpaths
> in the taskq_thread/dnode_buf_pageout/dnode_destroy and there doesn't
> seem to be a loop apart from the one in __slab_free.

The RCU stall-warning message will repeat after a minute or two if the
CPU remains stuck, so one trick would be to let the system run for awhile
after the first message and compare the two stack traces.

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Soft lockup in __slab_free (SLUB)
  2016-09-28  7:15   ` Nikolay Borisov
  2016-09-28 11:12     ` Paul E. McKenney
@ 2016-09-29  1:40     ` Joonsoo Kim
  2016-09-29  2:11       ` Paul E. McKenney
  1 sibling, 1 reply; 17+ messages in thread
From: Joonsoo Kim @ 2016-09-29  1:40 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Christoph Lameter, Linux MM, brouer, Paul E. McKenney

On Wed, Sep 28, 2016 at 10:15:01AM +0300, Nikolay Borisov wrote:
> 
> 
> On 09/28/2016 08:31 AM, Joonsoo Kim wrote:
> > Hello,
> > 
> > Ccing Paul, because it looks like RCU problem.
> > 
> > On Mon, Sep 26, 2016 at 10:46:56AM +0300, Nikolay Borisov wrote:
> >> Hello, 
> >>
> >> On 4.4.14 stable kernel I observed the following soft-lockup, however I
> >> also checked that the code is the same in 4.8-rc so the problem is 
> >> present there as well: 
> >>
> >> [434575.862377] NMI watchdog: BUG: soft lockup - CPU#13 stuck for 23s! [swapper/13:0]
> >> [434575.866352] CPU: 13 PID: 0 Comm: swapper/13 Tainted: P           O    4.4.14-clouder5 #2
> >> [434575.866643] Hardware name: Supermicro X9DRD-iF/LF/X9DRD-iF, BIOS 3.0b 12/05/2013
> >> [434575.866932] task: ffff8803714aadc0 ti: ffff8803714c4000 task.ti: ffff8803714c4000
> >> [434575.867221] RIP: 0010:[<ffffffff81613f4c>]  [<ffffffff81613f4c>] _raw_spin_unlock_irqrestore+0x1c/0x30
> >> [434575.867566] RSP: 0018:ffff880373ce3dc0  EFLAGS: 00000203
> >> [434575.867736] RAX: ffff88066e0c9a40 RBX: 0000000000000203 RCX: 0000000000000000
> >> [434575.868023] RDX: 0000000000000008 RSI: 0000000000000203 RDI: ffff88066e0c9a40
> >> [434575.868311] RBP: ffff880373ce3dc8 R08: ffff8803e5c1d118 R09: ffff8803e5c1d538
> >> [434575.868609] R10: 0000000000000000 R11: ffffea000f970600 R12: ffff88066e0c9a40
> >> [434575.868895] R13: ffffea000f970600 R14: 000000000046cf3b R15: ffff88036f8e3200
> >> [434575.869183] FS:  0000000000000000(0000) GS:ffff880373ce0000(0000) knlGS:0000000000000000
> >> [434575.869472] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> [434575.869643] CR2: ffffffffff600400 CR3: 0000000367201000 CR4: 00000000001406e0
> >> [434575.869931] Stack:
> >> [434575.870095]  ffff88066e0c9a40 ffff880373ce3e78 ffffffff8117ea8a ffff880373ce3e08
> >> [434575.870567]  000000000046bd03 0000000100170017 ffff8803e5c1d118 ffff8803e5c1d118
> >> [434575.871037]  00ff000100000000 0000000000000203 0000000000000000 ffffffff8123d9ac
> >> [434575.874253] Call Trace:
> >> [434575.874418]  <IRQ> 
> >> [434575.874473]  [<ffffffff8117ea8a>] __slab_free+0xca/0x290
> >> [434575.874806]  [<ffffffff8123d9ac>] ? ext4_i_callback+0x1c/0x20
> >> [434575.874978]  [<ffffffff8117ee3a>] kmem_cache_free+0x1ea/0x200
> >> [434575.875149]  [<ffffffff8123d9ac>] ext4_i_callback+0x1c/0x20
> >> [434575.875325]  [<ffffffff810ad09b>] rcu_process_callbacks+0x21b/0x620
> >> [434575.875506]  [<ffffffff81057337>] __do_softirq+0x147/0x310
> >> [434575.875680]  [<ffffffff8105764f>] irq_exit+0x5f/0x70
> >> [434575.875851]  [<ffffffff81616a82>] smp_apic_timer_interrupt+0x42/0x50
> >> [434575.876025]  [<ffffffff816151e9>] apic_timer_interrupt+0x89/0x90
> >> [434575.876197]  <EOI> 
> >> [434575.876250]  [<ffffffff81510601>] ? cpuidle_enter_state+0x141/0x2c0
> >> [434575.876583]  [<ffffffff815105f6>] ? cpuidle_enter_state+0x136/0x2c0
> >> [434575.876755]  [<ffffffff815107b7>] cpuidle_enter+0x17/0x20
> >> [434575.876929]  [<ffffffff810949fc>] cpu_startup_entry+0x2fc/0x360
> >> [434575.877105]  [<ffffffff810330e3>] start_secondary+0xf3/0x100
> >>
> >> The ip in __slab_free points to this piece of code (in mm/slub.c): 
> >>
> >> if (unlikely(n)) {
> >> 	spin_unlock_irqrestore(&n->list_lock, flags);
> >>         n = NULL;
> >> }
> >>
> >> I think it's a pure chance that the spin_unlock_restore is being shown in this trace, 
> >> do you think that a cond_resched is needed in this unlikely if clause? Apparently there 
> >> are cases where this loop can take a considerable amount of time.
> > 
> > I think that __slab_free() doesn't take too long time even if there is
> > lock contention. And, cond_resched() is valid on softirq context?
> 
> Now that I think of it - it's not valid since it might sleep and softirq
> is atomic context. So my suggestion is actually invalid, too bad.
> 
> > 
> > I think that problem would be caused by too many rcu callback is
> > executed without scheduling. Paul?
> 
> I don't think it's an RCU problem per-se since ext4_i_callback is being
> called from RCU due to the way inodes are being freed.

That doesn't mean that RCU has no problem. IIUC, the fact is that RCU
has no scheduling point in rcu_process_callbacks() and it would be
problematic. It just depends on workload.

> 
> On a slightly different note - on a different physical server, running
> zfsonlinux I experienced a very similar issue but without being in an
> RCU context, here what the stacktrace looked there:
> 
>  Call Trace:
>   [<ffffffff8117ea8a>] __slab_free+0xca/0x290
>   [<ffffffff8117ee3a>] ? kmem_cache_free+0x1ea/0x200
>   [<ffffffff8117ee3a>] kmem_cache_free+0x1ea/0x200
>   [<ffffffffa0245c97>] spl_kmem_cache_free+0x147/0x1d0 [spl]
>   [<ffffffffa02cbecc>] dnode_destroy+0x1dc/0x230 [zfs]
>   [<ffffffffa02cbf64>] dnode_buf_pageout+0x44/0xc0 [zfs]
>   [<ffffffffa0248301>] taskq_thread+0x291/0x4e0 [spl]
>   [<ffffffff8107ccd0>] ? wake_up_q+0x70/0x70
>   [<ffffffffa0248070>] ? taskq_thread_spawn+0x50/0x50 [spl]
>   [<ffffffff810715bf>] kthread+0xef/0x110
>   [<ffffffff810714d0>] ? kthread_park+0x60/0x60
>   [<ffffffff8161483f>] ret_from_fork+0x3f/0x70
>   [<ffffffff810714d0>] ? kthread_park+0x60/0x60
> 
> It's hard to believe this is a coincidence. I've inspected the callpaths
> in the taskq_thread/dnode_buf_pageout/dnode_destroy and there doesn't
> seem to be a loop apart from the one in __slab_free.

I checked zfsonlinux and found that there are loops.
First, taskq_thread() doesn't call schedule() if there is pending
work item. Second, dnode_buf_pageout() is just simple loop.

__slab_free() has a loop for cmpxchg_double but retry is normally
limited by the number of objects in a slab. After all objects are
freed, it cannot be failed. So that's not too much even in worst case.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Soft lockup in __slab_free (SLUB)
  2016-09-29  1:40     ` Joonsoo Kim
@ 2016-09-29  2:11       ` Paul E. McKenney
  2016-09-29  2:30         ` Nicholas Piggin
  2016-09-29  2:55         ` Joonsoo Kim
  0 siblings, 2 replies; 17+ messages in thread
From: Paul E. McKenney @ 2016-09-29  2:11 UTC (permalink / raw)
  To: Joonsoo Kim; +Cc: Nikolay Borisov, Christoph Lameter, Linux MM, brouer

On Thu, Sep 29, 2016 at 10:40:24AM +0900, Joonsoo Kim wrote:
> On Wed, Sep 28, 2016 at 10:15:01AM +0300, Nikolay Borisov wrote:
> > 
> > 
> > On 09/28/2016 08:31 AM, Joonsoo Kim wrote:
> > > Hello,
> > > 
> > > Ccing Paul, because it looks like RCU problem.
> > > 
> > > On Mon, Sep 26, 2016 at 10:46:56AM +0300, Nikolay Borisov wrote:
> > >> Hello, 
> > >>
> > >> On 4.4.14 stable kernel I observed the following soft-lockup, however I
> > >> also checked that the code is the same in 4.8-rc so the problem is 
> > >> present there as well: 
> > >>
> > >> [434575.862377] NMI watchdog: BUG: soft lockup - CPU#13 stuck for 23s! [swapper/13:0]
> > >> [434575.866352] CPU: 13 PID: 0 Comm: swapper/13 Tainted: P           O    4.4.14-clouder5 #2
> > >> [434575.866643] Hardware name: Supermicro X9DRD-iF/LF/X9DRD-iF, BIOS 3.0b 12/05/2013
> > >> [434575.866932] task: ffff8803714aadc0 ti: ffff8803714c4000 task.ti: ffff8803714c4000
> > >> [434575.867221] RIP: 0010:[<ffffffff81613f4c>]  [<ffffffff81613f4c>] _raw_spin_unlock_irqrestore+0x1c/0x30
> > >> [434575.867566] RSP: 0018:ffff880373ce3dc0  EFLAGS: 00000203
> > >> [434575.867736] RAX: ffff88066e0c9a40 RBX: 0000000000000203 RCX: 0000000000000000
> > >> [434575.868023] RDX: 0000000000000008 RSI: 0000000000000203 RDI: ffff88066e0c9a40
> > >> [434575.868311] RBP: ffff880373ce3dc8 R08: ffff8803e5c1d118 R09: ffff8803e5c1d538
> > >> [434575.868609] R10: 0000000000000000 R11: ffffea000f970600 R12: ffff88066e0c9a40
> > >> [434575.868895] R13: ffffea000f970600 R14: 000000000046cf3b R15: ffff88036f8e3200
> > >> [434575.869183] FS:  0000000000000000(0000) GS:ffff880373ce0000(0000) knlGS:0000000000000000
> > >> [434575.869472] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > >> [434575.869643] CR2: ffffffffff600400 CR3: 0000000367201000 CR4: 00000000001406e0
> > >> [434575.869931] Stack:
> > >> [434575.870095]  ffff88066e0c9a40 ffff880373ce3e78 ffffffff8117ea8a ffff880373ce3e08
> > >> [434575.870567]  000000000046bd03 0000000100170017 ffff8803e5c1d118 ffff8803e5c1d118
> > >> [434575.871037]  00ff000100000000 0000000000000203 0000000000000000 ffffffff8123d9ac
> > >> [434575.874253] Call Trace:
> > >> [434575.874418]  <IRQ> 
> > >> [434575.874473]  [<ffffffff8117ea8a>] __slab_free+0xca/0x290
> > >> [434575.874806]  [<ffffffff8123d9ac>] ? ext4_i_callback+0x1c/0x20
> > >> [434575.874978]  [<ffffffff8117ee3a>] kmem_cache_free+0x1ea/0x200
> > >> [434575.875149]  [<ffffffff8123d9ac>] ext4_i_callback+0x1c/0x20
> > >> [434575.875325]  [<ffffffff810ad09b>] rcu_process_callbacks+0x21b/0x620
> > >> [434575.875506]  [<ffffffff81057337>] __do_softirq+0x147/0x310
> > >> [434575.875680]  [<ffffffff8105764f>] irq_exit+0x5f/0x70
> > >> [434575.875851]  [<ffffffff81616a82>] smp_apic_timer_interrupt+0x42/0x50
> > >> [434575.876025]  [<ffffffff816151e9>] apic_timer_interrupt+0x89/0x90
> > >> [434575.876197]  <EOI> 
> > >> [434575.876250]  [<ffffffff81510601>] ? cpuidle_enter_state+0x141/0x2c0
> > >> [434575.876583]  [<ffffffff815105f6>] ? cpuidle_enter_state+0x136/0x2c0
> > >> [434575.876755]  [<ffffffff815107b7>] cpuidle_enter+0x17/0x20
> > >> [434575.876929]  [<ffffffff810949fc>] cpu_startup_entry+0x2fc/0x360
> > >> [434575.877105]  [<ffffffff810330e3>] start_secondary+0xf3/0x100
> > >>
> > >> The ip in __slab_free points to this piece of code (in mm/slub.c): 
> > >>
> > >> if (unlikely(n)) {
> > >> 	spin_unlock_irqrestore(&n->list_lock, flags);
> > >>         n = NULL;
> > >> }
> > >>
> > >> I think it's a pure chance that the spin_unlock_restore is being shown in this trace, 
> > >> do you think that a cond_resched is needed in this unlikely if clause? Apparently there 
> > >> are cases where this loop can take a considerable amount of time.
> > > 
> > > I think that __slab_free() doesn't take too long time even if there is
> > > lock contention. And, cond_resched() is valid on softirq context?
> > 
> > Now that I think of it - it's not valid since it might sleep and softirq
> > is atomic context. So my suggestion is actually invalid, too bad.
> > 
> > > 
> > > I think that problem would be caused by too many rcu callback is
> > > executed without scheduling. Paul?
> > 
> > I don't think it's an RCU problem per-se since ext4_i_callback is being
> > called from RCU due to the way inodes are being freed.
> 
> That doesn't mean that RCU has no problem. IIUC, the fact is that RCU
> has no scheduling point in rcu_process_callbacks() and it would be
> problematic. It just depends on workload.

You mean rcu_do_batch()?  It does limit the callbacks invoked per call
to rcu_do_batch() under normal conditions, see the "++count >= bl" check.

Now, if you dump a huge number of callbacks down call_rcu()'s throat,
it will stop being Mr. Nice Guy and will start executing the callbacks
as fast as it can for potentially quite some time.  But a huge number
will be in the millions.  Per CPU.  In which case I just might have a
few questions about exactly what you are trying to do.

Nevertheless, it is entirely possible that RCU's callback-invocation
throttling strategy needs improvement.  So why not turn on ftrace and
have the machine tell you exactly where it is spending all of its time?
That would allow us to work out what the right fix should be.

> > On a slightly different note - on a different physical server, running
> > zfsonlinux I experienced a very similar issue but without being in an
> > RCU context, here what the stacktrace looked there:
> > 
> >  Call Trace:
> >   [<ffffffff8117ea8a>] __slab_free+0xca/0x290
> >   [<ffffffff8117ee3a>] ? kmem_cache_free+0x1ea/0x200
> >   [<ffffffff8117ee3a>] kmem_cache_free+0x1ea/0x200
> >   [<ffffffffa0245c97>] spl_kmem_cache_free+0x147/0x1d0 [spl]
> >   [<ffffffffa02cbecc>] dnode_destroy+0x1dc/0x230 [zfs]
> >   [<ffffffffa02cbf64>] dnode_buf_pageout+0x44/0xc0 [zfs]
> >   [<ffffffffa0248301>] taskq_thread+0x291/0x4e0 [spl]
> >   [<ffffffff8107ccd0>] ? wake_up_q+0x70/0x70
> >   [<ffffffffa0248070>] ? taskq_thread_spawn+0x50/0x50 [spl]
> >   [<ffffffff810715bf>] kthread+0xef/0x110
> >   [<ffffffff810714d0>] ? kthread_park+0x60/0x60
> >   [<ffffffff8161483f>] ret_from_fork+0x3f/0x70
> >   [<ffffffff810714d0>] ? kthread_park+0x60/0x60
> > 
> > It's hard to believe this is a coincidence. I've inspected the callpaths
> > in the taskq_thread/dnode_buf_pageout/dnode_destroy and there doesn't
> > seem to be a loop apart from the one in __slab_free.
> 
> I checked zfsonlinux and found that there are loops.
> First, taskq_thread() doesn't call schedule() if there is pending
> work item. Second, dnode_buf_pageout() is just simple loop.

Then shouldn't these be fixed?

> __slab_free() has a loop for cmpxchg_double but retry is normally
> limited by the number of objects in a slab. After all objects are
> freed, it cannot be failed. So that's not too much even in worst case.

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Soft lockup in __slab_free (SLUB)
  2016-09-29  2:11       ` Paul E. McKenney
@ 2016-09-29  2:30         ` Nicholas Piggin
  2016-09-29  2:47           ` Paul E. McKenney
  2016-09-29  2:55         ` Joonsoo Kim
  1 sibling, 1 reply; 17+ messages in thread
From: Nicholas Piggin @ 2016-09-29  2:30 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Joonsoo Kim, Nikolay Borisov, Christoph Lameter, Linux MM, brouer

On Wed, 28 Sep 2016 19:11:00 -0700
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> On Thu, Sep 29, 2016 at 10:40:24AM +0900, Joonsoo Kim wrote:
> > On Wed, Sep 28, 2016 at 10:15:01AM +0300, Nikolay Borisov wrote:  
> > > 
> > > I don't think it's an RCU problem per-se since ext4_i_callback is being
> > > called from RCU due to the way inodes are being freed.  
> > 
> > That doesn't mean that RCU has no problem. IIUC, the fact is that RCU
> > has no scheduling point in rcu_process_callbacks() and it would be
> > problematic. It just depends on workload.  
> 
> You mean rcu_do_batch()?  It does limit the callbacks invoked per call
> to rcu_do_batch() under normal conditions, see the "++count >= bl" check.
> 
> Now, if you dump a huge number of callbacks down call_rcu()'s throat,
> it will stop being Mr. Nice Guy and will start executing the callbacks
> as fast as it can for potentially quite some time.  But a huge number
> will be in the millions.  Per CPU.  In which case I just might have a
> few questions about exactly what you are trying to do.
> 
> Nevertheless, it is entirely possible that RCU's callback-invocation
> throttling strategy needs improvement.

Would it be useful to have a call_rcu variant that may sleep. Callers would
use it preferentially if they can. Implementation might be exactly the same
for now, but it would give you more flexibility with throttling strategies
in future.

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Soft lockup in __slab_free (SLUB)
  2016-09-29  2:30         ` Nicholas Piggin
@ 2016-09-29  2:47           ` Paul E. McKenney
  2016-09-29  3:13             ` Nicholas Piggin
  0 siblings, 1 reply; 17+ messages in thread
From: Paul E. McKenney @ 2016-09-29  2:47 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Joonsoo Kim, Nikolay Borisov, Christoph Lameter, Linux MM, brouer

On Thu, Sep 29, 2016 at 12:30:07PM +1000, Nicholas Piggin wrote:
> On Wed, 28 Sep 2016 19:11:00 -0700
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> > On Thu, Sep 29, 2016 at 10:40:24AM +0900, Joonsoo Kim wrote:
> > > On Wed, Sep 28, 2016 at 10:15:01AM +0300, Nikolay Borisov wrote:  
> > > > 
> > > > I don't think it's an RCU problem per-se since ext4_i_callback is being
> > > > called from RCU due to the way inodes are being freed.  
> > > 
> > > That doesn't mean that RCU has no problem. IIUC, the fact is that RCU
> > > has no scheduling point in rcu_process_callbacks() and it would be
> > > problematic. It just depends on workload.  
> > 
> > You mean rcu_do_batch()?  It does limit the callbacks invoked per call
> > to rcu_do_batch() under normal conditions, see the "++count >= bl" check.
> > 
> > Now, if you dump a huge number of callbacks down call_rcu()'s throat,
> > it will stop being Mr. Nice Guy and will start executing the callbacks
> > as fast as it can for potentially quite some time.  But a huge number
> > will be in the millions.  Per CPU.  In which case I just might have a
> > few questions about exactly what you are trying to do.
> > 
> > Nevertheless, it is entirely possible that RCU's callback-invocation
> > throttling strategy needs improvement.
> 
> Would it be useful to have a call_rcu variant that may sleep. Callers would
> use it preferentially if they can. Implementation might be exactly the same
> for now, but it would give you more flexibility with throttling strategies
> in future.

You can specify callback-offloading at build and boot time, which will have
each CPU's callbacks being processed by a kernel thread:

CONFIG_RCU_NOCB_CPU
CONFIG_RCU_NOCB_CPU_{NONE,ZERO,ALL}
rcu_nocbs=

However, this still executes the individual callbacks with bh disabled.
If you want the actual callbacks themselves to be able to sleep, make
the callback hand off to a workqueue, wake up a kthread, or some such.

But yes, if enough people were just having the RCU callback immediately
invoke a workqueue, that could easily be special cased, just as
kfree_rcu() is now.

Or am I missing your point?

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Soft lockup in __slab_free (SLUB)
  2016-09-29  2:11       ` Paul E. McKenney
  2016-09-29  2:30         ` Nicholas Piggin
@ 2016-09-29  2:55         ` Joonsoo Kim
  2016-09-29  7:11           ` Nikolay Borisov
  1 sibling, 1 reply; 17+ messages in thread
From: Joonsoo Kim @ 2016-09-29  2:55 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Nikolay Borisov, Christoph Lameter, Linux MM, brouer

On Wed, Sep 28, 2016 at 07:11:00PM -0700, Paul E. McKenney wrote:
> On Thu, Sep 29, 2016 at 10:40:24AM +0900, Joonsoo Kim wrote:
> > On Wed, Sep 28, 2016 at 10:15:01AM +0300, Nikolay Borisov wrote:
> > > 
> > > 
> > > On 09/28/2016 08:31 AM, Joonsoo Kim wrote:
> > > > Hello,
> > > > 
> > > > Ccing Paul, because it looks like RCU problem.
> > > > 
> > > > On Mon, Sep 26, 2016 at 10:46:56AM +0300, Nikolay Borisov wrote:
> > > >> Hello, 
> > > >>
> > > >> On 4.4.14 stable kernel I observed the following soft-lockup, however I
> > > >> also checked that the code is the same in 4.8-rc so the problem is 
> > > >> present there as well: 
> > > >>
> > > >> [434575.862377] NMI watchdog: BUG: soft lockup - CPU#13 stuck for 23s! [swapper/13:0]
> > > >> [434575.866352] CPU: 13 PID: 0 Comm: swapper/13 Tainted: P           O    4.4.14-clouder5 #2
> > > >> [434575.866643] Hardware name: Supermicro X9DRD-iF/LF/X9DRD-iF, BIOS 3.0b 12/05/2013
> > > >> [434575.866932] task: ffff8803714aadc0 ti: ffff8803714c4000 task.ti: ffff8803714c4000
> > > >> [434575.867221] RIP: 0010:[<ffffffff81613f4c>]  [<ffffffff81613f4c>] _raw_spin_unlock_irqrestore+0x1c/0x30
> > > >> [434575.867566] RSP: 0018:ffff880373ce3dc0  EFLAGS: 00000203
> > > >> [434575.867736] RAX: ffff88066e0c9a40 RBX: 0000000000000203 RCX: 0000000000000000
> > > >> [434575.868023] RDX: 0000000000000008 RSI: 0000000000000203 RDI: ffff88066e0c9a40
> > > >> [434575.868311] RBP: ffff880373ce3dc8 R08: ffff8803e5c1d118 R09: ffff8803e5c1d538
> > > >> [434575.868609] R10: 0000000000000000 R11: ffffea000f970600 R12: ffff88066e0c9a40
> > > >> [434575.868895] R13: ffffea000f970600 R14: 000000000046cf3b R15: ffff88036f8e3200
> > > >> [434575.869183] FS:  0000000000000000(0000) GS:ffff880373ce0000(0000) knlGS:0000000000000000
> > > >> [434575.869472] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > >> [434575.869643] CR2: ffffffffff600400 CR3: 0000000367201000 CR4: 00000000001406e0
> > > >> [434575.869931] Stack:
> > > >> [434575.870095]  ffff88066e0c9a40 ffff880373ce3e78 ffffffff8117ea8a ffff880373ce3e08
> > > >> [434575.870567]  000000000046bd03 0000000100170017 ffff8803e5c1d118 ffff8803e5c1d118
> > > >> [434575.871037]  00ff000100000000 0000000000000203 0000000000000000 ffffffff8123d9ac
> > > >> [434575.874253] Call Trace:
> > > >> [434575.874418]  <IRQ> 
> > > >> [434575.874473]  [<ffffffff8117ea8a>] __slab_free+0xca/0x290
> > > >> [434575.874806]  [<ffffffff8123d9ac>] ? ext4_i_callback+0x1c/0x20
> > > >> [434575.874978]  [<ffffffff8117ee3a>] kmem_cache_free+0x1ea/0x200
> > > >> [434575.875149]  [<ffffffff8123d9ac>] ext4_i_callback+0x1c/0x20
> > > >> [434575.875325]  [<ffffffff810ad09b>] rcu_process_callbacks+0x21b/0x620
> > > >> [434575.875506]  [<ffffffff81057337>] __do_softirq+0x147/0x310
> > > >> [434575.875680]  [<ffffffff8105764f>] irq_exit+0x5f/0x70
> > > >> [434575.875851]  [<ffffffff81616a82>] smp_apic_timer_interrupt+0x42/0x50
> > > >> [434575.876025]  [<ffffffff816151e9>] apic_timer_interrupt+0x89/0x90
> > > >> [434575.876197]  <EOI> 
> > > >> [434575.876250]  [<ffffffff81510601>] ? cpuidle_enter_state+0x141/0x2c0
> > > >> [434575.876583]  [<ffffffff815105f6>] ? cpuidle_enter_state+0x136/0x2c0
> > > >> [434575.876755]  [<ffffffff815107b7>] cpuidle_enter+0x17/0x20
> > > >> [434575.876929]  [<ffffffff810949fc>] cpu_startup_entry+0x2fc/0x360
> > > >> [434575.877105]  [<ffffffff810330e3>] start_secondary+0xf3/0x100
> > > >>
> > > >> The ip in __slab_free points to this piece of code (in mm/slub.c): 
> > > >>
> > > >> if (unlikely(n)) {
> > > >> 	spin_unlock_irqrestore(&n->list_lock, flags);
> > > >>         n = NULL;
> > > >> }
> > > >>
> > > >> I think it's a pure chance that the spin_unlock_restore is being shown in this trace, 
> > > >> do you think that a cond_resched is needed in this unlikely if clause? Apparently there 
> > > >> are cases where this loop can take a considerable amount of time.
> > > > 
> > > > I think that __slab_free() doesn't take too long time even if there is
> > > > lock contention. And, cond_resched() is valid on softirq context?
> > > 
> > > Now that I think of it - it's not valid since it might sleep and softirq
> > > is atomic context. So my suggestion is actually invalid, too bad.
> > > 
> > > > 
> > > > I think that problem would be caused by too many rcu callback is
> > > > executed without scheduling. Paul?
> > > 
> > > I don't think it's an RCU problem per-se since ext4_i_callback is being
> > > called from RCU due to the way inodes are being freed.
> > 
> > That doesn't mean that RCU has no problem. IIUC, the fact is that RCU
> > has no scheduling point in rcu_process_callbacks() and it would be
> > problematic. It just depends on workload.
> 
> You mean rcu_do_batch()?  It does limit the callbacks invoked per call
> to rcu_do_batch() under normal conditions, see the "++count >= bl" check.

Sorry, in fact, I didn't check anything deeply. What I checked is that
rcu_process_callbacks() has for_each_rcu_flavor() which is a loop.
But, now I find that it is a small loop just for rcu type. Sorry about
my careless mention.

Now, I checked that rcu_do_batch() has a finish point as you mentioned.

Hmm... I still think that there is no problem on __slab_free().
There maybe someone to corrupt the memory and it would cause looping in
slab code.

Nikolay, could you share your .config?

> 
> Now, if you dump a huge number of callbacks down call_rcu()'s throat,
> it will stop being Mr. Nice Guy and will start executing the callbacks
> as fast as it can for potentially quite some time.  But a huge number
> will be in the millions.  Per CPU.  In which case I just might have a
> few questions about exactly what you are trying to do.
> 
> Nevertheless, it is entirely possible that RCU's callback-invocation
> throttling strategy needs improvement.  So why not turn on ftrace and
> have the machine tell you exactly where it is spending all of its time?
> That would allow us to work out what the right fix should be.

Enabling ftrace looks a good idea!

> > > On a slightly different note - on a different physical server, running
> > > zfsonlinux I experienced a very similar issue but without being in an
> > > RCU context, here what the stacktrace looked there:
> > > 
> > >  Call Trace:
> > >   [<ffffffff8117ea8a>] __slab_free+0xca/0x290
> > >   [<ffffffff8117ee3a>] ? kmem_cache_free+0x1ea/0x200
> > >   [<ffffffff8117ee3a>] kmem_cache_free+0x1ea/0x200
> > >   [<ffffffffa0245c97>] spl_kmem_cache_free+0x147/0x1d0 [spl]
> > >   [<ffffffffa02cbecc>] dnode_destroy+0x1dc/0x230 [zfs]
> > >   [<ffffffffa02cbf64>] dnode_buf_pageout+0x44/0xc0 [zfs]
> > >   [<ffffffffa0248301>] taskq_thread+0x291/0x4e0 [spl]
> > >   [<ffffffff8107ccd0>] ? wake_up_q+0x70/0x70
> > >   [<ffffffffa0248070>] ? taskq_thread_spawn+0x50/0x50 [spl]
> > >   [<ffffffff810715bf>] kthread+0xef/0x110
> > >   [<ffffffff810714d0>] ? kthread_park+0x60/0x60
> > >   [<ffffffff8161483f>] ret_from_fork+0x3f/0x70
> > >   [<ffffffff810714d0>] ? kthread_park+0x60/0x60
> > > 
> > > It's hard to believe this is a coincidence. I've inspected the callpaths
> > > in the taskq_thread/dnode_buf_pageout/dnode_destroy and there doesn't
> > > seem to be a loop apart from the one in __slab_free.
> > 
> > I checked zfsonlinux and found that there are loops.
> > First, taskq_thread() doesn't call schedule() if there is pending
> > work item. Second, dnode_buf_pageout() is just simple loop.
> 
> Then shouldn't these be fixed?

These are out of tree modules and I didn't look at it deeply. My
analysis would be incorrect.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Soft lockup in __slab_free (SLUB)
  2016-09-29  2:47           ` Paul E. McKenney
@ 2016-09-29  3:13             ` Nicholas Piggin
  2016-09-29 10:30               ` Paul E. McKenney
  0 siblings, 1 reply; 17+ messages in thread
From: Nicholas Piggin @ 2016-09-29  3:13 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Joonsoo Kim, Nikolay Borisov, Christoph Lameter, Linux MM, brouer

On Wed, 28 Sep 2016 19:47:05 -0700
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:

> On Thu, Sep 29, 2016 at 12:30:07PM +1000, Nicholas Piggin wrote:
> > On Wed, 28 Sep 2016 19:11:00 -0700
> > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:  
> > > On Thu, Sep 29, 2016 at 10:40:24AM +0900, Joonsoo Kim wrote:  
> > > > On Wed, Sep 28, 2016 at 10:15:01AM +0300, Nikolay Borisov wrote:    
> > > > > 
> > > > > I don't think it's an RCU problem per-se since ext4_i_callback is being
> > > > > called from RCU due to the way inodes are being freed.    
> > > > 
> > > > That doesn't mean that RCU has no problem. IIUC, the fact is that RCU
> > > > has no scheduling point in rcu_process_callbacks() and it would be
> > > > problematic. It just depends on workload.    
> > > 
> > > You mean rcu_do_batch()?  It does limit the callbacks invoked per call
> > > to rcu_do_batch() under normal conditions, see the "++count >= bl" check.
> > > 
> > > Now, if you dump a huge number of callbacks down call_rcu()'s throat,
> > > it will stop being Mr. Nice Guy and will start executing the callbacks
> > > as fast as it can for potentially quite some time.  But a huge number
> > > will be in the millions.  Per CPU.  In which case I just might have a
> > > few questions about exactly what you are trying to do.
> > > 
> > > Nevertheless, it is entirely possible that RCU's callback-invocation
> > > throttling strategy needs improvement.  
> > 
> > Would it be useful to have a call_rcu variant that may sleep. Callers would
> > use it preferentially if they can. Implementation might be exactly the same
> > for now, but it would give you more flexibility with throttling strategies
> > in future.  
> 
> You can specify callback-offloading at build and boot time, which will have
> each CPU's callbacks being processed by a kernel thread:
> 
> CONFIG_RCU_NOCB_CPU
> CONFIG_RCU_NOCB_CPU_{NONE,ZERO,ALL}
> rcu_nocbs=
> 
> However, this still executes the individual callbacks with bh disabled.
> If you want the actual callbacks themselves to be able to sleep, make
> the callback hand off to a workqueue, wake up a kthread, or some such.
> 
> But yes, if enough people were just having the RCU callback immediately
> invoke a workqueue, that could easily be special cased, just as
> kfree_rcu() is now.
> 
> Or am I missing your point?

I just meant where the call_rcu() caller can sleep. RCU could block
there to throttle production if necessary.

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Soft lockup in __slab_free (SLUB)
  2016-09-29  2:55         ` Joonsoo Kim
@ 2016-09-29  7:11           ` Nikolay Borisov
  2016-09-29 10:27             ` Paul E. McKenney
  0 siblings, 1 reply; 17+ messages in thread
From: Nikolay Borisov @ 2016-09-29  7:11 UTC (permalink / raw)
  To: Joonsoo Kim, Paul E. McKenney; +Cc: Christoph Lameter, Linux MM, brouer



On 09/29/2016 05:55 AM, Joonsoo Kim wrote:
> On Wed, Sep 28, 2016 at 07:11:00PM -0700, Paul E. McKenney wrote:
>> On Thu, Sep 29, 2016 at 10:40:24AM +0900, Joonsoo Kim wrote:
>>> On Wed, Sep 28, 2016 at 10:15:01AM +0300, Nikolay Borisov wrote:
>>>>
>>>>
>>>> On 09/28/2016 08:31 AM, Joonsoo Kim wrote:
>>>>> Hello,
>>>>>
>>>>> Ccing Paul, because it looks like RCU problem.
>>>>>
>>>>> On Mon, Sep 26, 2016 at 10:46:56AM +0300, Nikolay Borisov wrote:
>>>>>> Hello, 
>>>>>>
>>>>>> On 4.4.14 stable kernel I observed the following soft-lockup, however I
>>>>>> also checked that the code is the same in 4.8-rc so the problem is 
>>>>>> present there as well: 
>>>>>>
>>>>>> [434575.862377] NMI watchdog: BUG: soft lockup - CPU#13 stuck for 23s! [swapper/13:0]
>>>>>> [434575.866352] CPU: 13 PID: 0 Comm: swapper/13 Tainted: P           O    4.4.14-clouder5 #2
>>>>>> [434575.866643] Hardware name: Supermicro X9DRD-iF/LF/X9DRD-iF, BIOS 3.0b 12/05/2013
>>>>>> [434575.866932] task: ffff8803714aadc0 ti: ffff8803714c4000 task.ti: ffff8803714c4000
>>>>>> [434575.867221] RIP: 0010:[<ffffffff81613f4c>]  [<ffffffff81613f4c>] _raw_spin_unlock_irqrestore+0x1c/0x30
>>>>>> [434575.867566] RSP: 0018:ffff880373ce3dc0  EFLAGS: 00000203
>>>>>> [434575.867736] RAX: ffff88066e0c9a40 RBX: 0000000000000203 RCX: 0000000000000000
>>>>>> [434575.868023] RDX: 0000000000000008 RSI: 0000000000000203 RDI: ffff88066e0c9a40
>>>>>> [434575.868311] RBP: ffff880373ce3dc8 R08: ffff8803e5c1d118 R09: ffff8803e5c1d538
>>>>>> [434575.868609] R10: 0000000000000000 R11: ffffea000f970600 R12: ffff88066e0c9a40
>>>>>> [434575.868895] R13: ffffea000f970600 R14: 000000000046cf3b R15: ffff88036f8e3200
>>>>>> [434575.869183] FS:  0000000000000000(0000) GS:ffff880373ce0000(0000) knlGS:0000000000000000
>>>>>> [434575.869472] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>> [434575.869643] CR2: ffffffffff600400 CR3: 0000000367201000 CR4: 00000000001406e0
>>>>>> [434575.869931] Stack:
>>>>>> [434575.870095]  ffff88066e0c9a40 ffff880373ce3e78 ffffffff8117ea8a ffff880373ce3e08
>>>>>> [434575.870567]  000000000046bd03 0000000100170017 ffff8803e5c1d118 ffff8803e5c1d118
>>>>>> [434575.871037]  00ff000100000000 0000000000000203 0000000000000000 ffffffff8123d9ac
>>>>>> [434575.874253] Call Trace:
>>>>>> [434575.874418]  <IRQ> 
>>>>>> [434575.874473]  [<ffffffff8117ea8a>] __slab_free+0xca/0x290
>>>>>> [434575.874806]  [<ffffffff8123d9ac>] ? ext4_i_callback+0x1c/0x20
>>>>>> [434575.874978]  [<ffffffff8117ee3a>] kmem_cache_free+0x1ea/0x200
>>>>>> [434575.875149]  [<ffffffff8123d9ac>] ext4_i_callback+0x1c/0x20
>>>>>> [434575.875325]  [<ffffffff810ad09b>] rcu_process_callbacks+0x21b/0x620
>>>>>> [434575.875506]  [<ffffffff81057337>] __do_softirq+0x147/0x310
>>>>>> [434575.875680]  [<ffffffff8105764f>] irq_exit+0x5f/0x70
>>>>>> [434575.875851]  [<ffffffff81616a82>] smp_apic_timer_interrupt+0x42/0x50
>>>>>> [434575.876025]  [<ffffffff816151e9>] apic_timer_interrupt+0x89/0x90
>>>>>> [434575.876197]  <EOI> 
>>>>>> [434575.876250]  [<ffffffff81510601>] ? cpuidle_enter_state+0x141/0x2c0
>>>>>> [434575.876583]  [<ffffffff815105f6>] ? cpuidle_enter_state+0x136/0x2c0
>>>>>> [434575.876755]  [<ffffffff815107b7>] cpuidle_enter+0x17/0x20
>>>>>> [434575.876929]  [<ffffffff810949fc>] cpu_startup_entry+0x2fc/0x360
>>>>>> [434575.877105]  [<ffffffff810330e3>] start_secondary+0xf3/0x100
>>>>>>
>>>>>> The ip in __slab_free points to this piece of code (in mm/slub.c): 
>>>>>>
>>>>>> if (unlikely(n)) {
>>>>>> 	spin_unlock_irqrestore(&n->list_lock, flags);
>>>>>>         n = NULL;
>>>>>> }
>>>>>>
>>>>>> I think it's a pure chance that the spin_unlock_restore is being shown in this trace, 
>>>>>> do you think that a cond_resched is needed in this unlikely if clause? Apparently there 
>>>>>> are cases where this loop can take a considerable amount of time.
>>>>>
>>>>> I think that __slab_free() doesn't take too long time even if there is
>>>>> lock contention. And, cond_resched() is valid on softirq context?
>>>>
>>>> Now that I think of it - it's not valid since it might sleep and softirq
>>>> is atomic context. So my suggestion is actually invalid, too bad.
>>>>
>>>>>
>>>>> I think that problem would be caused by too many rcu callback is
>>>>> executed without scheduling. Paul?
>>>>
>>>> I don't think it's an RCU problem per-se since ext4_i_callback is being
>>>> called from RCU due to the way inodes are being freed.
>>>
>>> That doesn't mean that RCU has no problem. IIUC, the fact is that RCU
>>> has no scheduling point in rcu_process_callbacks() and it would be
>>> problematic. It just depends on workload.
>>
>> You mean rcu_do_batch()?  It does limit the callbacks invoked per call
>> to rcu_do_batch() under normal conditions, see the "++count >= bl" check.
> 
> Sorry, in fact, I didn't check anything deeply. What I checked is that
> rcu_process_callbacks() has for_each_rcu_flavor() which is a loop.
> But, now I find that it is a small loop just for rcu type. Sorry about
> my careless mention.
> 
> Now, I checked that rcu_do_batch() has a finish point as you mentioned.
> 
> Hmm... I still think that there is no problem on __slab_free().
> There maybe someone to corrupt the memory and it would cause looping in
> slab code.
> 
> Nikolay, could you share your .config?

Here it is: http://sprunge.us/INHJ

> 
>>
>> Now, if you dump a huge number of callbacks down call_rcu()'s throat,
>> it will stop being Mr. Nice Guy and will start executing the callbacks
>> as fast as it can for potentially quite some time.  But a huge number
>> will be in the millions.  Per CPU.  In which case I just might have a
>> few questions about exactly what you are trying to do.
>>
>> Nevertheless, it is entirely possible that RCU's callback-invocation
>> throttling strategy needs improvement.  So why not turn on ftrace and
>> have the machine tell you exactly where it is spending all of its time?
>> That would allow us to work out what the right fix should be.
> 
> Enabling ftrace looks a good idea!

What in particular should I be looking for in ftrace? tracing the stacks
on the stuck cpu?

> 
>>>> On a slightly different note - on a different physical server, running
>>>> zfsonlinux I experienced a very similar issue but without being in an
>>>> RCU context, here what the stacktrace looked there:
>>>>
>>>>  Call Trace:
>>>>   [<ffffffff8117ea8a>] __slab_free+0xca/0x290
>>>>   [<ffffffff8117ee3a>] ? kmem_cache_free+0x1ea/0x200
>>>>   [<ffffffff8117ee3a>] kmem_cache_free+0x1ea/0x200
>>>>   [<ffffffffa0245c97>] spl_kmem_cache_free+0x147/0x1d0 [spl]
>>>>   [<ffffffffa02cbecc>] dnode_destroy+0x1dc/0x230 [zfs]
>>>>   [<ffffffffa02cbf64>] dnode_buf_pageout+0x44/0xc0 [zfs]
>>>>   [<ffffffffa0248301>] taskq_thread+0x291/0x4e0 [spl]
>>>>   [<ffffffff8107ccd0>] ? wake_up_q+0x70/0x70
>>>>   [<ffffffffa0248070>] ? taskq_thread_spawn+0x50/0x50 [spl]
>>>>   [<ffffffff810715bf>] kthread+0xef/0x110
>>>>   [<ffffffff810714d0>] ? kthread_park+0x60/0x60
>>>>   [<ffffffff8161483f>] ret_from_fork+0x3f/0x70
>>>>   [<ffffffff810714d0>] ? kthread_park+0x60/0x60
>>>>
>>>> It's hard to believe this is a coincidence. I've inspected the callpaths
>>>> in the taskq_thread/dnode_buf_pageout/dnode_destroy and there doesn't
>>>> seem to be a loop apart from the one in __slab_free.
>>>
>>> I checked zfsonlinux and found that there are loops.
>>> First, taskq_thread() doesn't call schedule() if there is pending
>>> work item. Second, dnode_buf_pageout() is just simple loop.
>>
>> Then shouldn't these be fixed?
> 
> These are out of tree modules and I didn't look at it deeply. My
> analysis would be incorrect.

So I just gave ZFS as an example, I'm well aware it's an out of tree
module and wouldn't pay much attention to the code there questions. I
just used it to illustrate that the ext4 issue is not an isolated case.
Also the simple 'for' loop in dnode_buf_pageout actually would run for
at most 32 iteration and at the time the softlockup happened it was on
the 6th iteration. More info here (if you really care):
https://github.com/zfsonlinux/zfs/issues/5177

I just find it strange that 2 different file system on 2 different
servers would hang in the exact same spot in __slab_free.

> 
> Thanks.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Soft lockup in __slab_free (SLUB)
  2016-09-29  7:11           ` Nikolay Borisov
@ 2016-09-29 10:27             ` Paul E. McKenney
  2016-09-29 10:50               ` Nikolay Borisov
  0 siblings, 1 reply; 17+ messages in thread
From: Paul E. McKenney @ 2016-09-29 10:27 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Joonsoo Kim, Christoph Lameter, Linux MM, brouer

On Thu, Sep 29, 2016 at 10:11:09AM +0300, Nikolay Borisov wrote:
> 
> 
> On 09/29/2016 05:55 AM, Joonsoo Kim wrote:
> > On Wed, Sep 28, 2016 at 07:11:00PM -0700, Paul E. McKenney wrote:
> >> On Thu, Sep 29, 2016 at 10:40:24AM +0900, Joonsoo Kim wrote:
> >>> On Wed, Sep 28, 2016 at 10:15:01AM +0300, Nikolay Borisov wrote:
> >>>>
> >>>>
> >>>> On 09/28/2016 08:31 AM, Joonsoo Kim wrote:
> >>>>> Hello,
> >>>>>
> >>>>> Ccing Paul, because it looks like RCU problem.
> >>>>>
> >>>>> On Mon, Sep 26, 2016 at 10:46:56AM +0300, Nikolay Borisov wrote:
> >>>>>> Hello, 
> >>>>>>
> >>>>>> On 4.4.14 stable kernel I observed the following soft-lockup, however I
> >>>>>> also checked that the code is the same in 4.8-rc so the problem is 
> >>>>>> present there as well: 
> >>>>>>
> >>>>>> [434575.862377] NMI watchdog: BUG: soft lockup - CPU#13 stuck for 23s! [swapper/13:0]
> >>>>>> [434575.866352] CPU: 13 PID: 0 Comm: swapper/13 Tainted: P           O    4.4.14-clouder5 #2
> >>>>>> [434575.866643] Hardware name: Supermicro X9DRD-iF/LF/X9DRD-iF, BIOS 3.0b 12/05/2013
> >>>>>> [434575.866932] task: ffff8803714aadc0 ti: ffff8803714c4000 task.ti: ffff8803714c4000
> >>>>>> [434575.867221] RIP: 0010:[<ffffffff81613f4c>]  [<ffffffff81613f4c>] _raw_spin_unlock_irqrestore+0x1c/0x30
> >>>>>> [434575.867566] RSP: 0018:ffff880373ce3dc0  EFLAGS: 00000203
> >>>>>> [434575.867736] RAX: ffff88066e0c9a40 RBX: 0000000000000203 RCX: 0000000000000000
> >>>>>> [434575.868023] RDX: 0000000000000008 RSI: 0000000000000203 RDI: ffff88066e0c9a40
> >>>>>> [434575.868311] RBP: ffff880373ce3dc8 R08: ffff8803e5c1d118 R09: ffff8803e5c1d538
> >>>>>> [434575.868609] R10: 0000000000000000 R11: ffffea000f970600 R12: ffff88066e0c9a40
> >>>>>> [434575.868895] R13: ffffea000f970600 R14: 000000000046cf3b R15: ffff88036f8e3200
> >>>>>> [434575.869183] FS:  0000000000000000(0000) GS:ffff880373ce0000(0000) knlGS:0000000000000000
> >>>>>> [434575.869472] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>>>>> [434575.869643] CR2: ffffffffff600400 CR3: 0000000367201000 CR4: 00000000001406e0
> >>>>>> [434575.869931] Stack:
> >>>>>> [434575.870095]  ffff88066e0c9a40 ffff880373ce3e78 ffffffff8117ea8a ffff880373ce3e08
> >>>>>> [434575.870567]  000000000046bd03 0000000100170017 ffff8803e5c1d118 ffff8803e5c1d118
> >>>>>> [434575.871037]  00ff000100000000 0000000000000203 0000000000000000 ffffffff8123d9ac
> >>>>>> [434575.874253] Call Trace:
> >>>>>> [434575.874418]  <IRQ> 
> >>>>>> [434575.874473]  [<ffffffff8117ea8a>] __slab_free+0xca/0x290
> >>>>>> [434575.874806]  [<ffffffff8123d9ac>] ? ext4_i_callback+0x1c/0x20
> >>>>>> [434575.874978]  [<ffffffff8117ee3a>] kmem_cache_free+0x1ea/0x200
> >>>>>> [434575.875149]  [<ffffffff8123d9ac>] ext4_i_callback+0x1c/0x20
> >>>>>> [434575.875325]  [<ffffffff810ad09b>] rcu_process_callbacks+0x21b/0x620
> >>>>>> [434575.875506]  [<ffffffff81057337>] __do_softirq+0x147/0x310
> >>>>>> [434575.875680]  [<ffffffff8105764f>] irq_exit+0x5f/0x70
> >>>>>> [434575.875851]  [<ffffffff81616a82>] smp_apic_timer_interrupt+0x42/0x50
> >>>>>> [434575.876025]  [<ffffffff816151e9>] apic_timer_interrupt+0x89/0x90
> >>>>>> [434575.876197]  <EOI> 
> >>>>>> [434575.876250]  [<ffffffff81510601>] ? cpuidle_enter_state+0x141/0x2c0
> >>>>>> [434575.876583]  [<ffffffff815105f6>] ? cpuidle_enter_state+0x136/0x2c0
> >>>>>> [434575.876755]  [<ffffffff815107b7>] cpuidle_enter+0x17/0x20
> >>>>>> [434575.876929]  [<ffffffff810949fc>] cpu_startup_entry+0x2fc/0x360
> >>>>>> [434575.877105]  [<ffffffff810330e3>] start_secondary+0xf3/0x100
> >>>>>>
> >>>>>> The ip in __slab_free points to this piece of code (in mm/slub.c): 
> >>>>>>
> >>>>>> if (unlikely(n)) {
> >>>>>> 	spin_unlock_irqrestore(&n->list_lock, flags);
> >>>>>>         n = NULL;
> >>>>>> }
> >>>>>>
> >>>>>> I think it's a pure chance that the spin_unlock_restore is being shown in this trace, 
> >>>>>> do you think that a cond_resched is needed in this unlikely if clause? Apparently there 
> >>>>>> are cases where this loop can take a considerable amount of time.
> >>>>>
> >>>>> I think that __slab_free() doesn't take too long time even if there is
> >>>>> lock contention. And, cond_resched() is valid on softirq context?
> >>>>
> >>>> Now that I think of it - it's not valid since it might sleep and softirq
> >>>> is atomic context. So my suggestion is actually invalid, too bad.
> >>>>
> >>>>>
> >>>>> I think that problem would be caused by too many rcu callback is
> >>>>> executed without scheduling. Paul?
> >>>>
> >>>> I don't think it's an RCU problem per-se since ext4_i_callback is being
> >>>> called from RCU due to the way inodes are being freed.
> >>>
> >>> That doesn't mean that RCU has no problem. IIUC, the fact is that RCU
> >>> has no scheduling point in rcu_process_callbacks() and it would be
> >>> problematic. It just depends on workload.
> >>
> >> You mean rcu_do_batch()?  It does limit the callbacks invoked per call
> >> to rcu_do_batch() under normal conditions, see the "++count >= bl" check.
> > 
> > Sorry, in fact, I didn't check anything deeply. What I checked is that
> > rcu_process_callbacks() has for_each_rcu_flavor() which is a loop.
> > But, now I find that it is a small loop just for rcu type. Sorry about
> > my careless mention.
> > 
> > Now, I checked that rcu_do_batch() has a finish point as you mentioned.
> > 
> > Hmm... I still think that there is no problem on __slab_free().
> > There maybe someone to corrupt the memory and it would cause looping in
> > slab code.
> > 
> > Nikolay, could you share your .config?
> 
> Here it is: http://sprunge.us/INHJ

OK, you do have CONFIG_PREEMPT_NONE=y, which means that cond_resched(),
cond_resched_rcu_qs(), and friends are especially important.

> >> Now, if you dump a huge number of callbacks down call_rcu()'s throat,
> >> it will stop being Mr. Nice Guy and will start executing the callbacks
> >> as fast as it can for potentially quite some time.  But a huge number
> >> will be in the millions.  Per CPU.  In which case I just might have a
> >> few questions about exactly what you are trying to do.
> >>
> >> Nevertheless, it is entirely possible that RCU's callback-invocation
> >> throttling strategy needs improvement.  So why not turn on ftrace and
> >> have the machine tell you exactly where it is spending all of its time?
> >> That would allow us to work out what the right fix should be.
> > 
> > Enabling ftrace looks a good idea!
> 
> What in particular should I be looking for in ftrace? tracing the stacks
> on the stuck cpu?

To start with, how about the sequence of functions that the stuck
CPU is executing?

> >>>> On a slightly different note - on a different physical server, running
> >>>> zfsonlinux I experienced a very similar issue but without being in an
> >>>> RCU context, here what the stacktrace looked there:
> >>>>
> >>>>  Call Trace:
> >>>>   [<ffffffff8117ea8a>] __slab_free+0xca/0x290
> >>>>   [<ffffffff8117ee3a>] ? kmem_cache_free+0x1ea/0x200
> >>>>   [<ffffffff8117ee3a>] kmem_cache_free+0x1ea/0x200
> >>>>   [<ffffffffa0245c97>] spl_kmem_cache_free+0x147/0x1d0 [spl]
> >>>>   [<ffffffffa02cbecc>] dnode_destroy+0x1dc/0x230 [zfs]
> >>>>   [<ffffffffa02cbf64>] dnode_buf_pageout+0x44/0xc0 [zfs]
> >>>>   [<ffffffffa0248301>] taskq_thread+0x291/0x4e0 [spl]
> >>>>   [<ffffffff8107ccd0>] ? wake_up_q+0x70/0x70
> >>>>   [<ffffffffa0248070>] ? taskq_thread_spawn+0x50/0x50 [spl]
> >>>>   [<ffffffff810715bf>] kthread+0xef/0x110
> >>>>   [<ffffffff810714d0>] ? kthread_park+0x60/0x60
> >>>>   [<ffffffff8161483f>] ret_from_fork+0x3f/0x70
> >>>>   [<ffffffff810714d0>] ? kthread_park+0x60/0x60
> >>>>
> >>>> It's hard to believe this is a coincidence. I've inspected the callpaths
> >>>> in the taskq_thread/dnode_buf_pageout/dnode_destroy and there doesn't
> >>>> seem to be a loop apart from the one in __slab_free.
> >>>
> >>> I checked zfsonlinux and found that there are loops.
> >>> First, taskq_thread() doesn't call schedule() if there is pending
> >>> work item. Second, dnode_buf_pageout() is just simple loop.
> >>
> >> Then shouldn't these be fixed?
> > 
> > These are out of tree modules and I didn't look at it deeply. My
> > analysis would be incorrect.
> 
> So I just gave ZFS as an example, I'm well aware it's an out of tree
> module and wouldn't pay much attention to the code there questions. I
> just used it to illustrate that the ext4 issue is not an isolated case.
> Also the simple 'for' loop in dnode_buf_pageout actually would run for
> at most 32 iteration and at the time the softlockup happened it was on
> the 6th iteration. More info here (if you really care):
> https://github.com/zfsonlinux/zfs/issues/5177
> 
> I just find it strange that 2 different file system on 2 different
> servers would hang in the exact same spot in __slab_free.

It is interesting.  It would be even more interesting to learn why.  ;-)

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Soft lockup in __slab_free (SLUB)
  2016-09-29  3:13             ` Nicholas Piggin
@ 2016-09-29 10:30               ` Paul E. McKenney
  0 siblings, 0 replies; 17+ messages in thread
From: Paul E. McKenney @ 2016-09-29 10:30 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Joonsoo Kim, Nikolay Borisov, Christoph Lameter, Linux MM, brouer

On Thu, Sep 29, 2016 at 01:13:29PM +1000, Nicholas Piggin wrote:
> On Wed, 28 Sep 2016 19:47:05 -0700
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> 
> > On Thu, Sep 29, 2016 at 12:30:07PM +1000, Nicholas Piggin wrote:
> > > On Wed, 28 Sep 2016 19:11:00 -0700
> > > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:  
> > > > On Thu, Sep 29, 2016 at 10:40:24AM +0900, Joonsoo Kim wrote:  
> > > > > On Wed, Sep 28, 2016 at 10:15:01AM +0300, Nikolay Borisov wrote:    
> > > > > > 
> > > > > > I don't think it's an RCU problem per-se since ext4_i_callback is being
> > > > > > called from RCU due to the way inodes are being freed.    
> > > > > 
> > > > > That doesn't mean that RCU has no problem. IIUC, the fact is that RCU
> > > > > has no scheduling point in rcu_process_callbacks() and it would be
> > > > > problematic. It just depends on workload.    
> > > > 
> > > > You mean rcu_do_batch()?  It does limit the callbacks invoked per call
> > > > to rcu_do_batch() under normal conditions, see the "++count >= bl" check.
> > > > 
> > > > Now, if you dump a huge number of callbacks down call_rcu()'s throat,
> > > > it will stop being Mr. Nice Guy and will start executing the callbacks
> > > > as fast as it can for potentially quite some time.  But a huge number
> > > > will be in the millions.  Per CPU.  In which case I just might have a
> > > > few questions about exactly what you are trying to do.
> > > > 
> > > > Nevertheless, it is entirely possible that RCU's callback-invocation
> > > > throttling strategy needs improvement.  
> > > 
> > > Would it be useful to have a call_rcu variant that may sleep. Callers would
> > > use it preferentially if they can. Implementation might be exactly the same
> > > for now, but it would give you more flexibility with throttling strategies
> > > in future.  
> > 
> > You can specify callback-offloading at build and boot time, which will have
> > each CPU's callbacks being processed by a kernel thread:
> > 
> > CONFIG_RCU_NOCB_CPU
> > CONFIG_RCU_NOCB_CPU_{NONE,ZERO,ALL}
> > rcu_nocbs=
> > 
> > However, this still executes the individual callbacks with bh disabled.
> > If you want the actual callbacks themselves to be able to sleep, make
> > the callback hand off to a workqueue, wake up a kthread, or some such.
> > 
> > But yes, if enough people were just having the RCU callback immediately
> > invoke a workqueue, that could easily be special cased, just as
> > kfree_rcu() is now.
> > 
> > Or am I missing your point?
> 
> I just meant where the call_rcu() caller can sleep. RCU could block
> there to throttle production if necessary.

Good point!  Yes, if the problem is RCU getting flooded with callbacks,
this would be one possible resolution.  I am hoping that ftrace will
better identify the actual problem.  Naive of me, I know!  ;-)

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Soft lockup in __slab_free (SLUB)
  2016-09-29 10:27             ` Paul E. McKenney
@ 2016-09-29 10:50               ` Nikolay Borisov
  2016-09-29 11:10                 ` Paul E. McKenney
  0 siblings, 1 reply; 17+ messages in thread
From: Nikolay Borisov @ 2016-09-29 10:50 UTC (permalink / raw)
  To: paulmck; +Cc: Joonsoo Kim, Christoph Lameter, Linux MM, brouer



On 09/29/2016 01:27 PM, Paul E. McKenney wrote:
> On Thu, Sep 29, 2016 at 10:11:09AM +0300, Nikolay Borisov wrote:
[SNIP]

>> What in particular should I be looking for in ftrace? tracing the stacks
>> on the stuck cpu?
> 
> To start with, how about the sequence of functions that the stuck
> CPU is executing?

Unfortunately I do not know how to reproduce the issue, but it is being
reproduced byt our production load - which is creating backups in this
case. They are created by rsyncing files to a loop-back attached files
wihch are then unmounted and unmapped.From this crash it is evident that
the hang occurs while a volume is being unmounted.

But the callstack is in my hang report, no? I have the crashdump with me
so if you are interested in anything in particular I can go look for it.
I believe an inode eviction was requested, since destroy_inode, which
utilizes ext4_i_callback is called in the eviction + some errors paths.
And this eviction is executed on this particular CPU. What in particular
are you looking for?

Unfortunately it's impossible for me to run:

trace-cmd record -p function_graph -F <command that causes the issue>

[SNIP]

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Soft lockup in __slab_free (SLUB)
  2016-09-29 10:50               ` Nikolay Borisov
@ 2016-09-29 11:10                 ` Paul E. McKenney
  0 siblings, 0 replies; 17+ messages in thread
From: Paul E. McKenney @ 2016-09-29 11:10 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Joonsoo Kim, Christoph Lameter, Linux MM, brouer

On Thu, Sep 29, 2016 at 01:50:34PM +0300, Nikolay Borisov wrote:
> 
> 
> On 09/29/2016 01:27 PM, Paul E. McKenney wrote:
> > On Thu, Sep 29, 2016 at 10:11:09AM +0300, Nikolay Borisov wrote:
> [SNIP]
> 
> >> What in particular should I be looking for in ftrace? tracing the stacks
> >> on the stuck cpu?
> > 
> > To start with, how about the sequence of functions that the stuck
> > CPU is executing?
> 
> Unfortunately I do not know how to reproduce the issue, but it is being
> reproduced byt our production load - which is creating backups in this
> case. They are created by rsyncing files to a loop-back attached files
> wihch are then unmounted and unmapped.From this crash it is evident that
> the hang occurs while a volume is being unmounted.
> 
> But the callstack is in my hang report, no? I have the crashdump with me
> so if you are interested in anything in particular I can go look for it.
> I believe an inode eviction was requested, since destroy_inode, which
> utilizes ext4_i_callback is called in the eviction + some errors paths.
> And this eviction is executed on this particular CPU. What in particular
> are you looking for?
> 
> Unfortunately it's impossible for me to run:
> 
> trace-cmd record -p function_graph -F <command that causes the issue>

Given that the hang appears to involve a few common functions, is it
possible to turn ftrace on for those functions (and related ones) at boot
time?  Or using sysfs just after boot time?

Given that you cannot reproduce at will, your earlier suggestion of enabling
tracing of stacks might also make a lot of sense.

							Thanx, Paul

> [SNIP]

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Soft lockup in __slab_free (SLUB)
  2016-09-28  5:31 ` Joonsoo Kim
  2016-09-28  7:15   ` Nikolay Borisov
@ 2016-10-04 14:58   ` Nikolay Borisov
  2016-10-04 15:36     ` Paul E. McKenney
  1 sibling, 1 reply; 17+ messages in thread
From: Nikolay Borisov @ 2016-10-04 14:58 UTC (permalink / raw)
  To: Joonsoo Kim; +Cc: Linux MM, Paul E. McKenney



On 09/28/2016 08:31 AM, Joonsoo Kim wrote:
> Hello,
> 
> Ccing Paul, because it looks like RCU problem.
> 
> On Mon, Sep 26, 2016 at 10:46:56AM +0300, Nikolay Borisov wrote:
>> Hello, 
>>
>> On 4.4.14 stable kernel I observed the following soft-lockup, however I
>> also checked that the code is the same in 4.8-rc so the problem is 
>> present there as well: 
>>
>> [434575.862377] NMI watchdog: BUG: soft lockup - CPU#13 stuck for 23s! [swapper/13:0]
>> [434575.866352] CPU: 13 PID: 0 Comm: swapper/13 Tainted: P           O    4.4.14-clouder5 #2
>> [434575.866643] Hardware name: Supermicro X9DRD-iF/LF/X9DRD-iF, BIOS 3.0b 12/05/2013
>> [434575.866932] task: ffff8803714aadc0 ti: ffff8803714c4000 task.ti: ffff8803714c4000
>> [434575.867221] RIP: 0010:[<ffffffff81613f4c>]  [<ffffffff81613f4c>] _raw_spin_unlock_irqrestore+0x1c/0x30
>> [434575.867566] RSP: 0018:ffff880373ce3dc0  EFLAGS: 00000203
>> [434575.867736] RAX: ffff88066e0c9a40 RBX: 0000000000000203 RCX: 0000000000000000
>> [434575.868023] RDX: 0000000000000008 RSI: 0000000000000203 RDI: ffff88066e0c9a40
>> [434575.868311] RBP: ffff880373ce3dc8 R08: ffff8803e5c1d118 R09: ffff8803e5c1d538
>> [434575.868609] R10: 0000000000000000 R11: ffffea000f970600 R12: ffff88066e0c9a40
>> [434575.868895] R13: ffffea000f970600 R14: 000000000046cf3b R15: ffff88036f8e3200
>> [434575.869183] FS:  0000000000000000(0000) GS:ffff880373ce0000(0000) knlGS:0000000000000000
>> [434575.869472] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [434575.869643] CR2: ffffffffff600400 CR3: 0000000367201000 CR4: 00000000001406e0
>> [434575.869931] Stack:
>> [434575.870095]  ffff88066e0c9a40 ffff880373ce3e78 ffffffff8117ea8a ffff880373ce3e08
>> [434575.870567]  000000000046bd03 0000000100170017 ffff8803e5c1d118 ffff8803e5c1d118
>> [434575.871037]  00ff000100000000 0000000000000203 0000000000000000 ffffffff8123d9ac
>> [434575.874253] Call Trace:
>> [434575.874418]  <IRQ> 
>> [434575.874473]  [<ffffffff8117ea8a>] __slab_free+0xca/0x290
>> [434575.874806]  [<ffffffff8123d9ac>] ? ext4_i_callback+0x1c/0x20
>> [434575.874978]  [<ffffffff8117ee3a>] kmem_cache_free+0x1ea/0x200
>> [434575.875149]  [<ffffffff8123d9ac>] ext4_i_callback+0x1c/0x20
>> [434575.875325]  [<ffffffff810ad09b>] rcu_process_callbacks+0x21b/0x620
>> [434575.875506]  [<ffffffff81057337>] __do_softirq+0x147/0x310
>> [434575.875680]  [<ffffffff8105764f>] irq_exit+0x5f/0x70
>> [434575.875851]  [<ffffffff81616a82>] smp_apic_timer_interrupt+0x42/0x50
>> [434575.876025]  [<ffffffff816151e9>] apic_timer_interrupt+0x89/0x90
>> [434575.876197]  <EOI> 
>> [434575.876250]  [<ffffffff81510601>] ? cpuidle_enter_state+0x141/0x2c0
>> [434575.876583]  [<ffffffff815105f6>] ? cpuidle_enter_state+0x136/0x2c0
>> [434575.876755]  [<ffffffff815107b7>] cpuidle_enter+0x17/0x20
>> [434575.876929]  [<ffffffff810949fc>] cpu_startup_entry+0x2fc/0x360
>> [434575.877105]  [<ffffffff810330e3>] start_secondary+0xf3/0x100
>>
>> The ip in __slab_free points to this piece of code (in mm/slub.c): 
>>
>> if (unlikely(n)) {
>> 	spin_unlock_irqrestore(&n->list_lock, flags);
>>         n = NULL;
>> }
>>
>> I think it's a pure chance that the spin_unlock_restore is being shown in this trace, 
>> do you think that a cond_resched is needed in this unlikely if clause? Apparently there 
>> are cases where this loop can take a considerable amount of time.
> 
> I think that __slab_free() doesn't take too long time even if there is
> lock contention. And, cond_resched() is valid on softirq context?
> 
> I think that problem would be caused by too many rcu callback is
> executed without scheduling. Paul?
> 
> Thanks.

So this problem manifested itself again, with the exact same callstack,
this actually leads me to believe that your hypotheses about rcu being
the main culprit might actually be correct. I will have to play with
ftrace to see how to acquire useful information which might point me at
the culprit. Do you have any ideas on the top of your head?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Soft lockup in __slab_free (SLUB)
  2016-10-04 14:58   ` Nikolay Borisov
@ 2016-10-04 15:36     ` Paul E. McKenney
  0 siblings, 0 replies; 17+ messages in thread
From: Paul E. McKenney @ 2016-10-04 15:36 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Joonsoo Kim, Linux MM

On Tue, Oct 04, 2016 at 05:58:20PM +0300, Nikolay Borisov wrote:
> 
> 
> On 09/28/2016 08:31 AM, Joonsoo Kim wrote:
> > Hello,
> > 
> > Ccing Paul, because it looks like RCU problem.
> > 
> > On Mon, Sep 26, 2016 at 10:46:56AM +0300, Nikolay Borisov wrote:
> >> Hello, 
> >>
> >> On 4.4.14 stable kernel I observed the following soft-lockup, however I
> >> also checked that the code is the same in 4.8-rc so the problem is 
> >> present there as well: 
> >>
> >> [434575.862377] NMI watchdog: BUG: soft lockup - CPU#13 stuck for 23s! [swapper/13:0]
> >> [434575.866352] CPU: 13 PID: 0 Comm: swapper/13 Tainted: P           O    4.4.14-clouder5 #2
> >> [434575.866643] Hardware name: Supermicro X9DRD-iF/LF/X9DRD-iF, BIOS 3.0b 12/05/2013
> >> [434575.866932] task: ffff8803714aadc0 ti: ffff8803714c4000 task.ti: ffff8803714c4000
> >> [434575.867221] RIP: 0010:[<ffffffff81613f4c>]  [<ffffffff81613f4c>] _raw_spin_unlock_irqrestore+0x1c/0x30
> >> [434575.867566] RSP: 0018:ffff880373ce3dc0  EFLAGS: 00000203
> >> [434575.867736] RAX: ffff88066e0c9a40 RBX: 0000000000000203 RCX: 0000000000000000
> >> [434575.868023] RDX: 0000000000000008 RSI: 0000000000000203 RDI: ffff88066e0c9a40
> >> [434575.868311] RBP: ffff880373ce3dc8 R08: ffff8803e5c1d118 R09: ffff8803e5c1d538
> >> [434575.868609] R10: 0000000000000000 R11: ffffea000f970600 R12: ffff88066e0c9a40
> >> [434575.868895] R13: ffffea000f970600 R14: 000000000046cf3b R15: ffff88036f8e3200
> >> [434575.869183] FS:  0000000000000000(0000) GS:ffff880373ce0000(0000) knlGS:0000000000000000
> >> [434575.869472] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> [434575.869643] CR2: ffffffffff600400 CR3: 0000000367201000 CR4: 00000000001406e0
> >> [434575.869931] Stack:
> >> [434575.870095]  ffff88066e0c9a40 ffff880373ce3e78 ffffffff8117ea8a ffff880373ce3e08
> >> [434575.870567]  000000000046bd03 0000000100170017 ffff8803e5c1d118 ffff8803e5c1d118
> >> [434575.871037]  00ff000100000000 0000000000000203 0000000000000000 ffffffff8123d9ac
> >> [434575.874253] Call Trace:
> >> [434575.874418]  <IRQ> 
> >> [434575.874473]  [<ffffffff8117ea8a>] __slab_free+0xca/0x290
> >> [434575.874806]  [<ffffffff8123d9ac>] ? ext4_i_callback+0x1c/0x20
> >> [434575.874978]  [<ffffffff8117ee3a>] kmem_cache_free+0x1ea/0x200
> >> [434575.875149]  [<ffffffff8123d9ac>] ext4_i_callback+0x1c/0x20
> >> [434575.875325]  [<ffffffff810ad09b>] rcu_process_callbacks+0x21b/0x620
> >> [434575.875506]  [<ffffffff81057337>] __do_softirq+0x147/0x310
> >> [434575.875680]  [<ffffffff8105764f>] irq_exit+0x5f/0x70
> >> [434575.875851]  [<ffffffff81616a82>] smp_apic_timer_interrupt+0x42/0x50
> >> [434575.876025]  [<ffffffff816151e9>] apic_timer_interrupt+0x89/0x90
> >> [434575.876197]  <EOI> 
> >> [434575.876250]  [<ffffffff81510601>] ? cpuidle_enter_state+0x141/0x2c0
> >> [434575.876583]  [<ffffffff815105f6>] ? cpuidle_enter_state+0x136/0x2c0
> >> [434575.876755]  [<ffffffff815107b7>] cpuidle_enter+0x17/0x20
> >> [434575.876929]  [<ffffffff810949fc>] cpu_startup_entry+0x2fc/0x360
> >> [434575.877105]  [<ffffffff810330e3>] start_secondary+0xf3/0x100
> >>
> >> The ip in __slab_free points to this piece of code (in mm/slub.c): 
> >>
> >> if (unlikely(n)) {
> >> 	spin_unlock_irqrestore(&n->list_lock, flags);
> >>         n = NULL;
> >> }
> >>
> >> I think it's a pure chance that the spin_unlock_restore is being shown in this trace, 
> >> do you think that a cond_resched is needed in this unlikely if clause? Apparently there 
> >> are cases where this loop can take a considerable amount of time.
> > 
> > I think that __slab_free() doesn't take too long time even if there is
> > lock contention. And, cond_resched() is valid on softirq context?
> > 
> > I think that problem would be caused by too many rcu callback is
> > executed without scheduling. Paul?
> > 
> > Thanks.
> 
> So this problem manifested itself again, with the exact same callstack,
> this actually leads me to believe that your hypotheses about rcu being
> the main culprit might actually be correct. I will have to play with
> ftrace to see how to acquire useful information which might point me at
> the culprit. Do you have any ideas on the top of your head?

I suggest enabling the rcu_utilization, rcu_batch_start,
rcu_invoke_callback, rcu_invoke_kfree_callback, and rcu_batch_end
event traces.  The last four will require that you build with
CONFIG_RCU_TRACE=y.  If you are indeed seeing too many RCU callbacks
being executed, you will large numbers of rcu_invoke_callback and/or
rcu_invoke_kfree_callback trace events.  If you are stuck in a
particular callback, you will instead see one of these two events
(probably rcu_invoke_callback) with a large gap after it.

Probably additional events will be needed, but that should be a good
start.

Note that RCU will in some cases automatically dump the trace buffer
for you.  If you don't want it to do that, make rcu_ftrace_dump() be
a no-op.

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2016-10-04 15:36 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-26  7:46 Soft lockup in __slab_free (SLUB) Nikolay Borisov
2016-09-28  5:31 ` Joonsoo Kim
2016-09-28  7:15   ` Nikolay Borisov
2016-09-28 11:12     ` Paul E. McKenney
2016-09-29  1:40     ` Joonsoo Kim
2016-09-29  2:11       ` Paul E. McKenney
2016-09-29  2:30         ` Nicholas Piggin
2016-09-29  2:47           ` Paul E. McKenney
2016-09-29  3:13             ` Nicholas Piggin
2016-09-29 10:30               ` Paul E. McKenney
2016-09-29  2:55         ` Joonsoo Kim
2016-09-29  7:11           ` Nikolay Borisov
2016-09-29 10:27             ` Paul E. McKenney
2016-09-29 10:50               ` Nikolay Borisov
2016-09-29 11:10                 ` Paul E. McKenney
2016-10-04 14:58   ` Nikolay Borisov
2016-10-04 15:36     ` Paul E. McKenney

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.