linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* syzbot rcu/debugobjects warning
@ 2018-03-23  2:51 Joel Fernandes
  2018-03-23  9:11 ` Thomas Gleixner
  0 siblings, 1 reply; 6+ messages in thread
From: Joel Fernandes @ 2018-03-23  2:51 UTC (permalink / raw)
  To: Paul McKenney, Thomas Gleixner; +Cc: LKML, Todd Poynor

Hi Paul, Thomas,

I received a crash report from syzbot on the android 4.9 kernel and I
am looking into it, it seems the debugobjects subsystem is warning
that a certain RCU structure is not allocated on the stack, but is
annotated to be.

------------[ cut here ]------------
WARNING: CPU: 1 PID: 0 at lib/debugobjects.c:300
debug_object_is_on_stack lib/debugobjects.c:300 [inline]
WARNING: CPU: 1 PID: 0 at lib/debugobjects.c:300
__debug_object_init+0x526/0xc40 lib/debugobjects.c:326
[...]
 <IRQ> [  150.631700]  [<ffffffff81d96069>] dump_stack+0xc1/0x128
lib/dump_stack.c:51
 [<ffffffff8142fbd1>] panic+0x1bc/0x3a8 kernel/panic.c:179
 [<ffffffff81131894>] __warn+0x1c4/0x1e0 kernel/panic.c:542
 [<ffffffff81131afc>] warn_slowpath_null+0x2c/0x40 kernel/panic.c:585
 [<ffffffff81dfefb6>] debug_object_is_on_stack lib/debugobjects.c:300 [inline]
 [<ffffffff81dfefb6>] __debug_object_init+0x526/0xc40 lib/debugobjects.c:326
 [<ffffffff81dff709>] debug_object_init_on_stack+0x19/0x20
lib/debugobjects.c:378
 [<ffffffff81287a93>] init_rcu_head_on_stack kernel/rcu/update.c:403 [inline]
 [<ffffffff81287a93>] __wait_rcu_gp+0x93/0x1b0 kernel/rcu/update.c:358
 [<ffffffff81290251>] synchronize_rcu.part.65+0x101/0x110
kernel/rcu/tree_plugin.h:678
 [<ffffffff81290287>] synchronize_rcu+0x27/0x90 kernel/rcu/tree_plugin.h:679
 [<ffffffff83588b35>] __l2tp_session_unhash+0x3d5/0x550
net/l2tp/l2tp_core.c:1792

The full report is here:
https://syzkaller.appspot.com/bug?extid=e6a19b585ab2dba3eee8

It seems as per the code that the structure is on the stack so its
weird why debugobjects thinks its not.
The object in question is allocated on the stack by the __wait_rcu_gp
macro when its called from synchronize_rcu:

#define _wait_rcu_gp(checktiny, ...) \
do {                                                                    \
        call_rcu_func_t __crcu_array[] = { __VA_ARGS__ };               \
        struct rcu_synchronize __rs_array[ARRAY_SIZE(__crcu_array)];    \
        __wait_rcu_gp(checktiny, ARRAY_SIZE(__crcu_array),              \
                        __crcu_array, __rs_array);                      \
} while (0)


Any debug ideas or thoughts about it?

thanks for any help,

- Joel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: syzbot rcu/debugobjects warning
  2018-03-23  2:51 syzbot rcu/debugobjects warning Joel Fernandes
@ 2018-03-23  9:11 ` Thomas Gleixner
  2018-03-23 16:24   ` Joel Fernandes
  0 siblings, 1 reply; 6+ messages in thread
From: Thomas Gleixner @ 2018-03-23  9:11 UTC (permalink / raw)
  To: Joel Fernandes; +Cc: Paul McKenney, LKML, Todd Poynor

On Thu, 22 Mar 2018, Joel Fernandes wrote:

> Hi Paul, Thomas,
> 
> I received a crash report from syzbot on the android 4.9 kernel and I
> am looking into it, it seems the debugobjects subsystem is warning
> that a certain RCU structure is not allocated on the stack, but is
> annotated to be.
> 
> ------------[ cut here ]------------
> WARNING: CPU: 1 PID: 0 at lib/debugobjects.c:300
> debug_object_is_on_stack lib/debugobjects.c:300 [inline]
> WARNING: CPU: 1 PID: 0 at lib/debugobjects.c:300
> __debug_object_init+0x526/0xc40 lib/debugobjects.c:326
> [...]
>  <IRQ> [  150.631700]  [<ffffffff81d96069>] dump_stack+0xc1/0x128
> lib/dump_stack.c:51
>  [<ffffffff8142fbd1>] panic+0x1bc/0x3a8 kernel/panic.c:179
>  [<ffffffff81131894>] __warn+0x1c4/0x1e0 kernel/panic.c:542
>  [<ffffffff81131afc>] warn_slowpath_null+0x2c/0x40 kernel/panic.c:585
>  [<ffffffff81dfefb6>] debug_object_is_on_stack lib/debugobjects.c:300 [inline]
>  [<ffffffff81dfefb6>] __debug_object_init+0x526/0xc40 lib/debugobjects.c:326
>  [<ffffffff81dff709>] debug_object_init_on_stack+0x19/0x20
> lib/debugobjects.c:378
>  [<ffffffff81287a93>] init_rcu_head_on_stack kernel/rcu/update.c:403 [inline]
>  [<ffffffff81287a93>] __wait_rcu_gp+0x93/0x1b0 kernel/rcu/update.c:358
>  [<ffffffff81290251>] synchronize_rcu.part.65+0x101/0x110
> kernel/rcu/tree_plugin.h:678
>  [<ffffffff81290287>] synchronize_rcu+0x27/0x90 kernel/rcu/tree_plugin.h:679
>  [<ffffffff83588b35>] __l2tp_session_unhash+0x3d5/0x550
> net/l2tp/l2tp_core.c:1792
> 
> The full report is here:
> https://syzkaller.appspot.com/bug?extid=e6a19b585ab2dba3eee8

This is beyond useless. That brings me to a google 'Sign in' page. Please
use accessible storage. That information is hardly secrit.

> It seems as per the code that the structure is on the stack so its
> weird why debugobjects thinks its not.
> The object in question is allocated on the stack by the __wait_rcu_gp
> macro when its called from synchronize_rcu:
> 
> #define _wait_rcu_gp(checktiny, ...) \
> do {                                                                    \
>         call_rcu_func_t __crcu_array[] = { __VA_ARGS__ };               \
>         struct rcu_synchronize __rs_array[ARRAY_SIZE(__crcu_array)];    \
>         __wait_rcu_gp(checktiny, ARRAY_SIZE(__crcu_array),              \
>                         __crcu_array, __rs_array);                      \
> } while (0)
> 
> 
> Any debug ideas or thoughts about it?

I assume it emitted:
     pr_warn("object is not on stack, but annotated\n");

before dumping the WARN_ON(). Right? If so, then you might have run into a
stack corruption. But hard to tell. Please add something like this:

     pr_warn("object %p is not on stack %p, but annotated\n", obj,
     	     task_stack_page(current));
Thanks,

	tglx

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: syzbot rcu/debugobjects warning
  2018-03-23  9:11 ` Thomas Gleixner
@ 2018-03-23 16:24   ` Joel Fernandes
  2018-03-23 20:41     ` Thomas Gleixner
  0 siblings, 1 reply; 6+ messages in thread
From: Joel Fernandes @ 2018-03-23 16:24 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Paul McKenney, LKML, Todd Poynor

On Fri, Mar 23, 2018 at 2:11 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Thu, 22 Mar 2018, Joel Fernandes wrote:
>
>> Hi Paul, Thomas,
>>
>> I received a crash report from syzbot on the android 4.9 kernel and I
>> am looking into it, it seems the debugobjects subsystem is warning
>> that a certain RCU structure is not allocated on the stack, but is
>> annotated to be.
>>
>> ------------[ cut here ]------------
>> WARNING: CPU: 1 PID: 0 at lib/debugobjects.c:300
>> debug_object_is_on_stack lib/debugobjects.c:300 [inline]
>> WARNING: CPU: 1 PID: 0 at lib/debugobjects.c:300
>> __debug_object_init+0x526/0xc40 lib/debugobjects.c:326
>> [...]
>>  <IRQ> [  150.631700]  [<ffffffff81d96069>] dump_stack+0xc1/0x128
>> lib/dump_stack.c:51
>>  [<ffffffff8142fbd1>] panic+0x1bc/0x3a8 kernel/panic.c:179
>>  [<ffffffff81131894>] __warn+0x1c4/0x1e0 kernel/panic.c:542
>>  [<ffffffff81131afc>] warn_slowpath_null+0x2c/0x40 kernel/panic.c:585
>>  [<ffffffff81dfefb6>] debug_object_is_on_stack lib/debugobjects.c:300 [inline]
>>  [<ffffffff81dfefb6>] __debug_object_init+0x526/0xc40 lib/debugobjects.c:326
>>  [<ffffffff81dff709>] debug_object_init_on_stack+0x19/0x20
>> lib/debugobjects.c:378
>>  [<ffffffff81287a93>] init_rcu_head_on_stack kernel/rcu/update.c:403 [inline]
>>  [<ffffffff81287a93>] __wait_rcu_gp+0x93/0x1b0 kernel/rcu/update.c:358
>>  [<ffffffff81290251>] synchronize_rcu.part.65+0x101/0x110
>> kernel/rcu/tree_plugin.h:678
>>  [<ffffffff81290287>] synchronize_rcu+0x27/0x90 kernel/rcu/tree_plugin.h:679
>>  [<ffffffff83588b35>] __l2tp_session_unhash+0x3d5/0x550
>> net/l2tp/l2tp_core.c:1792
>>
>> The full report is here:
>> https://syzkaller.appspot.com/bug?extid=e6a19b585ab2dba3eee8
>
> This is beyond useless. That brings me to a google 'Sign in' page. Please
> use accessible storage. That information is hardly secrit.

Sorry. Here is the raw crash log: https://pastebin.com/raw/puvh0cXE
(The kernel logs are toward the end with the above).

>
>> It seems as per the code that the structure is on the stack so its
>> weird why debugobjects thinks its not.
>> The object in question is allocated on the stack by the __wait_rcu_gp
>> macro when its called from synchronize_rcu:
>>
>> #define _wait_rcu_gp(checktiny, ...) \
>> do {                                                                    \
>>         call_rcu_func_t __crcu_array[] = { __VA_ARGS__ };               \
>>         struct rcu_synchronize __rs_array[ARRAY_SIZE(__crcu_array)];    \
>>         __wait_rcu_gp(checktiny, ARRAY_SIZE(__crcu_array),              \
>>                         __crcu_array, __rs_array);                      \
>> } while (0)
>>
>>
>> Any debug ideas or thoughts about it?
>
> I assume it emitted:
>      pr_warn("object is not on stack, but annotated\n");
>

Yes that is the warning.

> before dumping the WARN_ON(). Right? If so, then you might have run into a
> stack corruption. But hard to tell. Please add something like this:
>
>      pr_warn("object %p is not on stack %p, but annotated\n", obj,
>              task_stack_page(current));

Sounds good, I will make this change into android-4.9 kernel for debug
purpose and hopefully the bot hits it soon (that's likely since its
been reported many times). Will let you know once we hit it.

Thanks a lot for the ideas,

- Joel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: syzbot rcu/debugobjects warning
  2018-03-23 16:24   ` Joel Fernandes
@ 2018-03-23 20:41     ` Thomas Gleixner
  2018-03-25  6:29       ` Joel Fernandes
  0 siblings, 1 reply; 6+ messages in thread
From: Thomas Gleixner @ 2018-03-23 20:41 UTC (permalink / raw)
  To: Joel Fernandes; +Cc: Paul McKenney, LKML, Todd Poynor, netdev

On Fri, 23 Mar 2018, Joel Fernandes wrote:
> On Fri, Mar 23, 2018 at 2:11 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > On Thu, 22 Mar 2018, Joel Fernandes wrote:
> Sorry. Here is the raw crash log: https://pastebin.com/raw/puvh0cXE
> (The kernel logs are toward the end with the above).

And that is interesting:

[  150.629667]  <IRQ> [  150.631700]  [<ffffffff81d96069>] dump_stack+0xc1/0x128
[  150.637051]  [<ffffffff81dfefb6>] ? __debug_object_init+0x526/0xc40
[  150.643431]  [<ffffffff8142fbd1>] panic+0x1bc/0x3a8
[  150.648416]  [<ffffffff8142fa15>] ? percpu_up_read_preempt_enable.constprop.53+0xd7/0xd7
[  150.656611]  [<ffffffff81430835>] ? load_image_and_restore+0xf9/0xf9
[  150.663070]  [<ffffffff81269efd>] ? vprintk_default+0x1d/0x30
[  150.668925]  [<ffffffff81131879>] ? __warn+0x1a9/0x1e0
[  150.674170]  [<ffffffff81dfefb6>] ? __debug_object_init+0x526/0xc40
[  150.680543]  [<ffffffff81131894>] __warn+0x1c4/0x1e0
[  150.685614]  [<ffffffff81131afc>] warn_slowpath_null+0x2c/0x40
[  150.691972]  [<ffffffff81dfefb6>] __debug_object_init+0x526/0xc40
[  150.698174]  [<ffffffff81dfea90>] ? debug_object_fixup+0x30/0x30
[  150.704283]  [<ffffffff81dff709>] debug_object_init_on_stack+0x19/0x20
[  150.710917]  [<ffffffff81287a93>] __wait_rcu_gp+0x93/0x1b0
[  150.716508]  [<ffffffff81290251>] synchronize_rcu.part.65+0x101/0x110
[  150.723054]  [<ffffffff81290150>] ? rcu_pm_notify+0xc0/0xc0
[  150.728735]  [<ffffffff81292bc0>] ? __call_rcu.constprop.72+0x910/0x910
[  150.735459]  [<ffffffff81235221>] ? __lock_is_held+0xa1/0xf0
[  150.741223]  [<ffffffff81290287>] synchronize_rcu+0x27/0x90

So this calls synchronize_rcu from a rcu callback. That's a nono. This is
on the back of an interrupt in softirq context and __wait_rcu_gp() can
sleep, which is obviously a bad idea in softirq context....

Cc'ed netdev ....

And that also explains the debug object splat because this is not running
on the task stack. It's running on the softirq stack ....

[  150.746908]  [<ffffffff83588b35>] __l2tp_session_unhash+0x3d5/0x550
[  150.753281]  [<ffffffff8358891f>] ? __l2tp_session_unhash+0x1bf/0x550
[  150.759828]  [<ffffffff8114596a>] ? __local_bh_enable_ip+0x6a/0xd0
[  150.766123]  [<ffffffff8358ddb0>] ? l2tp_udp_encap_recv+0xd90/0xd90
[  150.772497]  [<ffffffff83588e97>] l2tp_tunnel_closeall+0x1e7/0x3a0
[  150.778782]  [<ffffffff835897be>] l2tp_tunnel_destruct+0x30e/0x5a0
[  150.785067]  [<ffffffff8358965a>] ? l2tp_tunnel_destruct+0x1aa/0x5a0
[  150.791537]  [<ffffffff835894b0>] ? l2tp_tunnel_del_work+0x460/0x460
[  150.797997]  [<ffffffff82ee8053>] __sk_destruct+0x53/0x570
[  150.803588]  [<ffffffff81293918>] rcu_process_callbacks+0x898/0x1300
[  150.810048]  [<ffffffff812939f7>] ? rcu_process_callbacks+0x977/0x1300
[  150.816684]  [<ffffffff82ee8000>] ? __sk_dst_check+0x240/0x240
[  150.822625]  [<ffffffff838be5d6>] __do_softirq+0x206/0x951
[  150.828223]  [<ffffffff81147315>] irq_exit+0x165/0x190
[  150.833557]  [<ffffffff838bd1eb>] smp_apic_timer_interrupt+0x7b/0xa0
[  150.840018]  [<ffffffff838b9470>] apic_timer_interrupt+0xa0/0xb0
[  150.846132]  <EOI> [  150.848166]  [<ffffffff838b6756>] ? native_safe_halt+0x6/0x10
[  150.854036]  [<ffffffff8123bf2d>] ? trace_hardirqs_on+0xd/0x10
[  150.859973]  [<ffffffff838b5d85>] default_idle+0x55/0x360
[  150.865478]  [<ffffffff8106be0a>] arch_cpu_idle+0xa/0x10
[  150.870896]  [<ffffffff838b6b96>] default_idle_call+0x36/0x60
[  150.876751]  [<ffffffff81226cb0>] cpu_startup_entry+0x2b0/0x380
[  150.882787]  [<ffffffff81226a00>] ? cpu_in_idle+0x20/0x20
[  150.888291]  [<ffffffff812d2343>] ? clockevents_register_device+0x123/0x200
[  150.895358]  [<ffffffff810b0693>] start_secondary+0x303/0x3e0
[  150.901209]  [<ffffffff810b0390>] ? set_cpu_sibling_map+0x11f0/0x11f0

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: syzbot rcu/debugobjects warning
  2018-03-23 20:41     ` Thomas Gleixner
@ 2018-03-25  6:29       ` Joel Fernandes
  2018-03-26 17:38         ` Guillaume Nault
  0 siblings, 1 reply; 6+ messages in thread
From: Joel Fernandes @ 2018-03-25  6:29 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Paul McKenney, LKML, Todd Poynor,
	open list:BPF (Safe dynamic programs and tools),
	Ben Hutchings, Greg Kroah-Hartman, Guillaume Nault

On Fri, Mar 23, 2018 at 1:41 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Fri, 23 Mar 2018, Joel Fernandes wrote:
>> On Fri, Mar 23, 2018 at 2:11 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> > On Thu, 22 Mar 2018, Joel Fernandes wrote:
>> Sorry. Here is the raw crash log: https://pastebin.com/raw/puvh0cXE
>> (The kernel logs are toward the end with the above).
>
> And that is interesting:
>
> [  150.629667]  <IRQ> [  150.631700]  [<ffffffff81d96069>] dump_stack+0xc1/0x128
> [  150.637051]  [<ffffffff81dfefb6>] ? __debug_object_init+0x526/0xc40
> [  150.643431]  [<ffffffff8142fbd1>] panic+0x1bc/0x3a8
> [  150.648416]  [<ffffffff8142fa15>] ? percpu_up_read_preempt_enable.constprop.53+0xd7/0xd7
> [  150.656611]  [<ffffffff81430835>] ? load_image_and_restore+0xf9/0xf9
> [  150.663070]  [<ffffffff81269efd>] ? vprintk_default+0x1d/0x30
> [  150.668925]  [<ffffffff81131879>] ? __warn+0x1a9/0x1e0
> [  150.674170]  [<ffffffff81dfefb6>] ? __debug_object_init+0x526/0xc40
> [  150.680543]  [<ffffffff81131894>] __warn+0x1c4/0x1e0
> [  150.685614]  [<ffffffff81131afc>] warn_slowpath_null+0x2c/0x40
> [  150.691972]  [<ffffffff81dfefb6>] __debug_object_init+0x526/0xc40
> [  150.698174]  [<ffffffff81dfea90>] ? debug_object_fixup+0x30/0x30
> [  150.704283]  [<ffffffff81dff709>] debug_object_init_on_stack+0x19/0x20
> [  150.710917]  [<ffffffff81287a93>] __wait_rcu_gp+0x93/0x1b0
> [  150.716508]  [<ffffffff81290251>] synchronize_rcu.part.65+0x101/0x110
> [  150.723054]  [<ffffffff81290150>] ? rcu_pm_notify+0xc0/0xc0
> [  150.728735]  [<ffffffff81292bc0>] ? __call_rcu.constprop.72+0x910/0x910
> [  150.735459]  [<ffffffff81235221>] ? __lock_is_held+0xa1/0xf0
> [  150.741223]  [<ffffffff81290287>] synchronize_rcu+0x27/0x90
>
> So this calls synchronize_rcu from a rcu callback. That's a nono. This is
> on the back of an interrupt in softirq context and __wait_rcu_gp() can
> sleep, which is obviously a bad idea in softirq context....
>
> Cc'ed netdev ....
>
> And that also explains the debug object splat because this is not running
> on the task stack. It's running on the softirq stack ....
>
> [  150.746908]  [<ffffffff83588b35>] __l2tp_session_unhash+0x3d5/0x550
> [  150.753281]  [<ffffffff8358891f>] ? __l2tp_session_unhash+0x1bf/0x550
> [  150.759828]  [<ffffffff8114596a>] ? __local_bh_enable_ip+0x6a/0xd0
> [  150.766123]  [<ffffffff8358ddb0>] ? l2tp_udp_encap_recv+0xd90/0xd90
> [  150.772497]  [<ffffffff83588e97>] l2tp_tunnel_closeall+0x1e7/0x3a0
> [  150.778782]  [<ffffffff835897be>] l2tp_tunnel_destruct+0x30e/0x5a0
> [  150.785067]  [<ffffffff8358965a>] ? l2tp_tunnel_destruct+0x1aa/0x5a0
> [  150.791537]  [<ffffffff835894b0>] ? l2tp_tunnel_del_work+0x460/0x460
> [  150.797997]  [<ffffffff82ee8053>] __sk_destruct+0x53/0x570
> [  150.803588]  [<ffffffff81293918>] rcu_process_callbacks+0x898/0x1300
> [  150.810048]  [<ffffffff812939f7>] ? rcu_process_callbacks+0x977/0x1300
> [  150.816684]  [<ffffffff82ee8000>] ? __sk_dst_check+0x240/0x240
> [  150.822625]  [<ffffffff838be5d6>] __do_softirq+0x206/0x951
> [  150.828223]  [<ffffffff81147315>] irq_exit+0x165/0x190
> [  150.833557]  [<ffffffff838bd1eb>] smp_apic_timer_interrupt+0x7b/0xa0
> [  150.840018]  [<ffffffff838b9470>] apic_timer_interrupt+0xa0/0xb0
> [  150.846132]  <EOI> [  150.848166]  [<ffffffff838b6756>] ? native_safe_halt+0x6/0x10
> [  150.854036]  [<ffffffff8123bf2d>] ? trace_hardirqs_on+0xd/0x10
> [  150.859973]  [<ffffffff838b5d85>] default_idle+0x55/0x360
> [  150.865478]  [<ffffffff8106be0a>] arch_cpu_idle+0xa/0x10
> [  150.870896]  [<ffffffff838b6b96>] default_idle_call+0x36/0x60
> [  150.876751]  [<ffffffff81226cb0>] cpu_startup_entry+0x2b0/0x380
> [  150.882787]  [<ffffffff81226a00>] ? cpu_in_idle+0x20/0x20
> [  150.888291]  [<ffffffff812d2343>] ? clockevents_register_device+0x123/0x200
> [  150.895358]  [<ffffffff810b0693>] start_secondary+0x303/0x3e0
> [  150.901209]  [<ffffffff810b0390>] ? set_cpu_sibling_map+0x11f0/0x11f0

Thomas, thanks a lot. It appears this issue will not happen on
mainline since from commit  765924e362d1  (subject "l2tp: don't close
sessions in l2tp_tunnel_destruct()"), l2tp_tunnel_closeall is no
longer called from l2tp_tunnel_destruct. From that commit message it
seems one of the motivations is to solve scheduling from atomic issue.

However for this change to be applied to android-4.9 and/or 4.9
stable, it depends on several other l2p patches and they aren't
straight forward cherry-picks from mainline (and I don't have much
background with this driver).

v3.16.56 stable seems to be further along with l2tp than v4.9.89, in
that it atleast has more of the upstream patches adapted for it, that
the above patch depends on. Since this also related to stable, I am
CC'ing Greg kh and Ben.

Here are some of the commits in 3.16 stable that I couldn't find
applied to v4.9 stable. The above fix quotes the below patches as
dependencies so they would need to be stable backported. Also CC'ing
Guillaume since he authored the above mentioned fix.

0c15ddabbcf l2tp: don't register sessions in l2tp_session_create()
a3c5d5b70f4e l2tp: fix race condition in l2tp_tunnel_delete
5b216e8dcda2 l2tp: prevent creation of sessions on terminated tunnels
76ff5e22f1e0 l2tp: hold tunnel while looking up sessions in l2tp_netlink
ceb8f6b23a38 l2tp: define parameters of l2tp_session_get*() as "const"
0295d020b63f l2tp: initialise session's refcount before making it reachable
29a77518927e l2tp: take reference on sessions being dumped
b301c9b7782f l2tp: take a reference on sessions used in genetlink handlers

By the way I think the reason why scheduling while atomic checks
didn't show up is because the debugobjects warning caused a panic
first, before that could happen.

- Joel

PS: There's also 12d656af4e3d2 ("l2tp: Avoid schedule while atomic in
exit_net") which was fixing a call to synchronize_rcu in the same
path/function, but the caller originated from l2tp_exit_net. But this
patch is already in the stable trees. I am just mentioning it here for
completeness.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: syzbot rcu/debugobjects warning
  2018-03-25  6:29       ` Joel Fernandes
@ 2018-03-26 17:38         ` Guillaume Nault
  0 siblings, 0 replies; 6+ messages in thread
From: Guillaume Nault @ 2018-03-26 17:38 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Thomas Gleixner, Paul McKenney, LKML, Todd Poynor,
	open list:BPF (Safe dynamic programs and tools),
	Ben Hutchings, Greg Kroah-Hartman

On Sat, Mar 24, 2018 at 11:29:42PM -0700, Joel Fernandes wrote:
> On Fri, Mar 23, 2018 at 1:41 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > On Fri, 23 Mar 2018, Joel Fernandes wrote:
> >> On Fri, Mar 23, 2018 at 2:11 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> >> > On Thu, 22 Mar 2018, Joel Fernandes wrote:
> >> Sorry. Here is the raw crash log: https://pastebin.com/raw/puvh0cXE
> >> (The kernel logs are toward the end with the above).
> >
> > And that is interesting:
> >
> > [  150.629667]  <IRQ> [  150.631700]  [<ffffffff81d96069>] dump_stack+0xc1/0x128
> > [  150.637051]  [<ffffffff81dfefb6>] ? __debug_object_init+0x526/0xc40
> > [  150.643431]  [<ffffffff8142fbd1>] panic+0x1bc/0x3a8
> > [  150.648416]  [<ffffffff8142fa15>] ? percpu_up_read_preempt_enable.constprop.53+0xd7/0xd7
> > [  150.656611]  [<ffffffff81430835>] ? load_image_and_restore+0xf9/0xf9
> > [  150.663070]  [<ffffffff81269efd>] ? vprintk_default+0x1d/0x30
> > [  150.668925]  [<ffffffff81131879>] ? __warn+0x1a9/0x1e0
> > [  150.674170]  [<ffffffff81dfefb6>] ? __debug_object_init+0x526/0xc40
> > [  150.680543]  [<ffffffff81131894>] __warn+0x1c4/0x1e0
> > [  150.685614]  [<ffffffff81131afc>] warn_slowpath_null+0x2c/0x40
> > [  150.691972]  [<ffffffff81dfefb6>] __debug_object_init+0x526/0xc40
> > [  150.698174]  [<ffffffff81dfea90>] ? debug_object_fixup+0x30/0x30
> > [  150.704283]  [<ffffffff81dff709>] debug_object_init_on_stack+0x19/0x20
> > [  150.710917]  [<ffffffff81287a93>] __wait_rcu_gp+0x93/0x1b0
> > [  150.716508]  [<ffffffff81290251>] synchronize_rcu.part.65+0x101/0x110
> > [  150.723054]  [<ffffffff81290150>] ? rcu_pm_notify+0xc0/0xc0
> > [  150.728735]  [<ffffffff81292bc0>] ? __call_rcu.constprop.72+0x910/0x910
> > [  150.735459]  [<ffffffff81235221>] ? __lock_is_held+0xa1/0xf0
> > [  150.741223]  [<ffffffff81290287>] synchronize_rcu+0x27/0x90
> >
> > So this calls synchronize_rcu from a rcu callback. That's a nono. This is
> > on the back of an interrupt in softirq context and __wait_rcu_gp() can
> > sleep, which is obviously a bad idea in softirq context....
> >
> > Cc'ed netdev ....
> >
> > And that also explains the debug object splat because this is not running
> > on the task stack. It's running on the softirq stack ....
> >
> > [  150.746908]  [<ffffffff83588b35>] __l2tp_session_unhash+0x3d5/0x550
> > [  150.753281]  [<ffffffff8358891f>] ? __l2tp_session_unhash+0x1bf/0x550
> > [  150.759828]  [<ffffffff8114596a>] ? __local_bh_enable_ip+0x6a/0xd0
> > [  150.766123]  [<ffffffff8358ddb0>] ? l2tp_udp_encap_recv+0xd90/0xd90
> > [  150.772497]  [<ffffffff83588e97>] l2tp_tunnel_closeall+0x1e7/0x3a0
> > [  150.778782]  [<ffffffff835897be>] l2tp_tunnel_destruct+0x30e/0x5a0
> > [  150.785067]  [<ffffffff8358965a>] ? l2tp_tunnel_destruct+0x1aa/0x5a0
> > [  150.791537]  [<ffffffff835894b0>] ? l2tp_tunnel_del_work+0x460/0x460
> > [  150.797997]  [<ffffffff82ee8053>] __sk_destruct+0x53/0x570
> > [  150.803588]  [<ffffffff81293918>] rcu_process_callbacks+0x898/0x1300
> > [  150.810048]  [<ffffffff812939f7>] ? rcu_process_callbacks+0x977/0x1300
> > [  150.816684]  [<ffffffff82ee8000>] ? __sk_dst_check+0x240/0x240
> > [  150.822625]  [<ffffffff838be5d6>] __do_softirq+0x206/0x951
> > [  150.828223]  [<ffffffff81147315>] irq_exit+0x165/0x190
> > [  150.833557]  [<ffffffff838bd1eb>] smp_apic_timer_interrupt+0x7b/0xa0
> > [  150.840018]  [<ffffffff838b9470>] apic_timer_interrupt+0xa0/0xb0
> > [  150.846132]  <EOI> [  150.848166]  [<ffffffff838b6756>] ? native_safe_halt+0x6/0x10
> > [  150.854036]  [<ffffffff8123bf2d>] ? trace_hardirqs_on+0xd/0x10
> > [  150.859973]  [<ffffffff838b5d85>] default_idle+0x55/0x360
> > [  150.865478]  [<ffffffff8106be0a>] arch_cpu_idle+0xa/0x10
> > [  150.870896]  [<ffffffff838b6b96>] default_idle_call+0x36/0x60
> > [  150.876751]  [<ffffffff81226cb0>] cpu_startup_entry+0x2b0/0x380
> > [  150.882787]  [<ffffffff81226a00>] ? cpu_in_idle+0x20/0x20
> > [  150.888291]  [<ffffffff812d2343>] ? clockevents_register_device+0x123/0x200
> > [  150.895358]  [<ffffffff810b0693>] start_secondary+0x303/0x3e0
> > [  150.901209]  [<ffffffff810b0390>] ? set_cpu_sibling_map+0x11f0/0x11f0
> 
> Thomas, thanks a lot. It appears this issue will not happen on
> mainline since from commit  765924e362d1  (subject "l2tp: don't close
> sessions in l2tp_tunnel_destruct()"), l2tp_tunnel_closeall is no
> longer called from l2tp_tunnel_destruct. From that commit message it
> seems one of the motivations is to solve scheduling from atomic issue.
> 
I agree that this patch should fix the above splat.

> However for this change to be applied to android-4.9 and/or 4.9
> stable, it depends on several other l2p patches and they aren't
> straight forward cherry-picks from mainline (and I don't have much
> background with this driver).
> 
> v3.16.56 stable seems to be further along with l2tp than v4.9.89, in
> that it atleast has more of the upstream patches adapted for it, that
> the above patch depends on. Since this also related to stable, I am
> CC'ing Greg kh and Ben.
> 
I generally review l2tp patches proposed for -stable trees (although
not in time). If a patch has been ported to 3.16.y and is missing in
another tree, then it should be safe to port it there too.

> Here are some of the commits in 3.16 stable that I couldn't find
> applied to v4.9 stable. The above fix quotes the below patches as
> dependencies so they would need to be stable backported. Also CC'ing
> Guillaume since he authored the above mentioned fix.
> 
> 0c15ddabbcf l2tp: don't register sessions in l2tp_session_create()
> a3c5d5b70f4e l2tp: fix race condition in l2tp_tunnel_delete
> 5b216e8dcda2 l2tp: prevent creation of sessions on terminated tunnels
> 76ff5e22f1e0 l2tp: hold tunnel while looking up sessions in l2tp_netlink
> ceb8f6b23a38 l2tp: define parameters of l2tp_session_get*() as "const"
> 0295d020b63f l2tp: initialise session's refcount before making it reachable
> 29a77518927e l2tp: take reference on sessions being dumped
> b301c9b7782f l2tp: take a reference on sessions used in genetlink handlers
> 
Yes, I think it'd make sense to port these patches.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-03-26 17:39 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-23  2:51 syzbot rcu/debugobjects warning Joel Fernandes
2018-03-23  9:11 ` Thomas Gleixner
2018-03-23 16:24   ` Joel Fernandes
2018-03-23 20:41     ` Thomas Gleixner
2018-03-25  6:29       ` Joel Fernandes
2018-03-26 17:38         ` Guillaume Nault

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).