linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* WARNING: ODEBUG bug in tcindex_destroy_work (3)
@ 2020-02-25  5:43 syzbot
  2020-03-16 23:47 ` syzbot
  2020-03-21  4:49 ` syzbot
  0 siblings, 2 replies; 16+ messages in thread
From: syzbot @ 2020-02-25  5:43 UTC (permalink / raw)
  To: davem, jhs, jiri, kuba, linux-kernel, netdev, syzkaller-bugs,
	xiyou.wangcong

Hello,

syzbot found the following crash on:

HEAD commit:    d2eee258 Merge tag 'for-5.6-rc2-tag' of git://git.kernel.o..
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=17fd8931e00000
kernel config:  https://syzkaller.appspot.com/x/.config?x=3e57a6b450fb9883
dashboard link: https://syzkaller.appspot.com/bug?extid=46f513c3033d592409d2
compiler:       clang version 10.0.0 (https://github.com/llvm/llvm-project/ c2443155a0fb245c8f17f2c1c72b6ea391e86e81)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+46f513c3033d592409d2@syzkaller.appspotmail.com

------------[ cut here ]------------
ODEBUG: free active (active state 0) object type: work_struct hint: tcindex_destroy_rexts_work+0x0/0xd0 net/sched/cls_tcindex.c:57
WARNING: CPU: 1 PID: 21 at lib/debugobjects.c:488 debug_print_object lib/debugobjects.c:485 [inline]
WARNING: CPU: 1 PID: 21 at lib/debugobjects.c:488 __debug_check_no_obj_freed lib/debugobjects.c:967 [inline]
WARNING: CPU: 1 PID: 21 at lib/debugobjects.c:488 debug_check_no_obj_freed+0x468/0x620 lib/debugobjects.c:998
Kernel panic - not syncing: panic_on_warn set ...
CPU: 1 PID: 21 Comm: kworker/u4:1 Not tainted 5.6.0-rc2-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: tc_filter_workqueue tcindex_destroy_work
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1fb/0x318 lib/dump_stack.c:118
 panic+0x264/0x7a9 kernel/panic.c:221
 __warn+0x209/0x210 kernel/panic.c:582
 report_bug+0x1b6/0x2f0 lib/bug.c:195
 fixup_bug arch/x86/kernel/traps.c:174 [inline]
 do_error_trap+0xcf/0x1c0 arch/x86/kernel/traps.c:267
 do_invalid_op+0x36/0x40 arch/x86/kernel/traps.c:286
 invalid_op+0x23/0x30 arch/x86/entry/entry_64.S:1027
RIP: 0010:debug_print_object lib/debugobjects.c:485 [inline]
RIP: 0010:__debug_check_no_obj_freed lib/debugobjects.c:967 [inline]
RIP: 0010:debug_check_no_obj_freed+0x468/0x620 lib/debugobjects.c:998
Code: 08 48 89 df e8 f9 27 0f fe 4c 8b 03 48 c7 c7 75 eb f0 88 48 c7 c6 8f dc ee 88 4c 89 e2 44 89 f9 4d 89 e9 31 c0 e8 28 c4 a3 fd <0f> 0b 48 ba 00 00 00 00 00 fc ff df 4c 8b 6d b0 ff 05 b6 df c4 05
RSP: 0018:ffffc90000dd7be0 EFLAGS: 00010046
RAX: a8ee08c59d207d00 RBX: ffffffff892d11c0 RCX: ffff8880a9bf6580
RDX: 0000000000000000 RSI: 0000000080000000 RDI: 0000000000000000
RBP: ffffc90000dd7c78 R08: ffffffff81600324 R09: ffffed1015d64592
R10: ffffed1015d64592 R11: 0000000000000000 R12: ffffffff88f4c8da
R13: ffffffff869b5000 R14: ffff88802cc7c000 R15: 0000000000000000
 kfree+0xff/0x220 mm/slab.c:3756
 tcindex_destroy_work+0x3e/0x70 net/sched/cls_tcindex.c:231
 process_one_work+0x7f5/0x10f0 kernel/workqueue.c:2264
 worker_thread+0xbbc/0x1630 kernel/workqueue.c:2410
 kthread+0x332/0x350 kernel/kthread.c:255
 ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
Kernel Offset: disabled
Rebooting in 86400 seconds..


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: WARNING: ODEBUG bug in tcindex_destroy_work (3)
  2020-02-25  5:43 WARNING: ODEBUG bug in tcindex_destroy_work (3) syzbot
@ 2020-03-16 23:47 ` syzbot
  2020-03-21 10:19   ` Thomas Gleixner
  2020-03-21  4:49 ` syzbot
  1 sibling, 1 reply; 16+ messages in thread
From: syzbot @ 2020-03-16 23:47 UTC (permalink / raw)
  To: davem, jhs, jiri, kuba, linux-kernel, netdev, syzkaller-bugs,
	xiyou.wangcong

syzbot has found a reproducer for the following crash on:

HEAD commit:    74522e7b net: sched: set the hw_stats_type in pedit loop
git tree:       net-next
console output: https://syzkaller.appspot.com/x/log.txt?x=14c85173e00000
kernel config:  https://syzkaller.appspot.com/x/.config?x=b5acf5ac38a50651
dashboard link: https://syzkaller.appspot.com/bug?extid=46f513c3033d592409d2
compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=17bfff65e00000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+46f513c3033d592409d2@syzkaller.appspotmail.com

------------[ cut here ]------------
ODEBUG: free active (active state 0) object type: work_struct hint: tcindex_destroy_rexts_work+0x0/0x20 net/sched/cls_tcindex.c:143
WARNING: CPU: 1 PID: 7 at lib/debugobjects.c:485 debug_print_object+0x160/0x250 lib/debugobjects.c:485
Kernel panic - not syncing: panic_on_warn set ...
CPU: 1 PID: 7 Comm: kworker/u4:0 Not tainted 5.6.0-rc5-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: tc_filter_workqueue tcindex_destroy_work
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x188/0x20d lib/dump_stack.c:118
 panic+0x2e3/0x75c kernel/panic.c:221
 __warn.cold+0x2f/0x35 kernel/panic.c:582
 report_bug+0x27b/0x2f0 lib/bug.c:195
 fixup_bug arch/x86/kernel/traps.c:174 [inline]
 fixup_bug arch/x86/kernel/traps.c:169 [inline]
 do_error_trap+0x12b/0x220 arch/x86/kernel/traps.c:267
 do_invalid_op+0x32/0x40 arch/x86/kernel/traps.c:286
 invalid_op+0x23/0x30 arch/x86/entry/entry_64.S:1027
RIP: 0010:debug_print_object+0x160/0x250 lib/debugobjects.c:485
Code: dd c0 fa 51 88 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 bf 00 00 00 48 8b 14 dd c0 fa 51 88 48 c7 c7 20 f0 51 88 e8 a8 bd b1 fd <0f> 0b 83 05 8b cf d3 06 01 48 83 c4 20 5b 5d 41 5c 41 5d c3 48 89
RSP: 0018:ffffc90000cdfc40 EFLAGS: 00010082
RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff815bfe61 RDI: fffff5200019bf7a
RBP: 0000000000000001 R08: ffff8880a95de1c0 R09: ffffed1015ce45c9
R10: ffffed1015ce45c8 R11: ffff8880ae722e43 R12: ffffffff8977aba0
R13: ffffffff814a9360 R14: ffff88807e278f98 R15: ffff88809835c6c8
 __debug_check_no_obj_freed lib/debugobjects.c:967 [inline]
 debug_check_no_obj_freed+0x2e1/0x445 lib/debugobjects.c:998
 kfree+0xf6/0x2b0 mm/slab.c:3756
 tcindex_destroy_work+0x2e/0x70 net/sched/cls_tcindex.c:231
 process_one_work+0x94b/0x1690 kernel/workqueue.c:2266
 worker_thread+0x96/0xe20 kernel/workqueue.c:2412
 kthread+0x357/0x430 kernel/kthread.c:255
 ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
Kernel Offset: disabled
Rebooting in 86400 seconds..


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: WARNING: ODEBUG bug in tcindex_destroy_work (3)
  2020-02-25  5:43 WARNING: ODEBUG bug in tcindex_destroy_work (3) syzbot
  2020-03-16 23:47 ` syzbot
@ 2020-03-21  4:49 ` syzbot
  2020-03-21  5:42   ` Dominik Brodowski
  1 sibling, 1 reply; 16+ messages in thread
From: syzbot @ 2020-03-21  4:49 UTC (permalink / raw)
  To: adam.zerella, davem, gregkh, hdanton, jhs, jiri, kuba,
	linux-kernel, linux, netdev, syzkaller-bugs, tglx,
	xiyou.wangcong

syzbot has bisected this bug to:

commit 836e9494f4485127a5b505ae57e4387bea8b53c4
Author: Adam Zerella <adam.zerella@gmail.com>
Date:   Sun Aug 25 05:35:10 2019 +0000

    pcmcia/i82092: Refactored dprintk macro for dev_dbg().

bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=175cffe3e00000
start commit:   74522e7b net: sched: set the hw_stats_type in pedit loop
git tree:       net-next
final crash:    https://syzkaller.appspot.com/x/report.txt?x=14dcffe3e00000
console output: https://syzkaller.appspot.com/x/log.txt?x=10dcffe3e00000
kernel config:  https://syzkaller.appspot.com/x/.config?x=b5acf5ac38a50651
dashboard link: https://syzkaller.appspot.com/bug?extid=46f513c3033d592409d2
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=17bfff65e00000

Reported-by: syzbot+46f513c3033d592409d2@syzkaller.appspotmail.com
Fixes: 836e9494f448 ("pcmcia/i82092: Refactored dprintk macro for dev_dbg().")

For information about bisection process see: https://goo.gl/tpsmEJ#bisection

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: WARNING: ODEBUG bug in tcindex_destroy_work (3)
  2020-03-21  4:49 ` syzbot
@ 2020-03-21  5:42   ` Dominik Brodowski
  0 siblings, 0 replies; 16+ messages in thread
From: Dominik Brodowski @ 2020-03-21  5:42 UTC (permalink / raw)
  To: syzbot
  Cc: adam.zerella, davem, gregkh, hdanton, jhs, jiri, kuba,
	linux-kernel, netdev, syzkaller-bugs, tglx, xiyou.wangcong

On Fri, Mar 20, 2020 at 09:49:03PM -0700, syzbot wrote:
> syzbot has bisected this bug to:
> 
> commit 836e9494f4485127a5b505ae57e4387bea8b53c4
> Author: Adam Zerella <adam.zerella@gmail.com>
> Date:   Sun Aug 25 05:35:10 2019 +0000
> 
>     pcmcia/i82092: Refactored dprintk macro for dev_dbg().
> 
> bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=175cffe3e00000
> start commit:   74522e7b net: sched: set the hw_stats_type in pedit loop
> git tree:       net-next
> final crash:    https://syzkaller.appspot.com/x/report.txt?x=14dcffe3e00000
> console output: https://syzkaller.appspot.com/x/log.txt?x=10dcffe3e00000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=b5acf5ac38a50651
> dashboard link: https://syzkaller.appspot.com/bug?extid=46f513c3033d592409d2
> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=17bfff65e00000
> 
> Reported-by: syzbot+46f513c3033d592409d2@syzkaller.appspotmail.com
> Fixes: 836e9494f448 ("pcmcia/i82092: Refactored dprintk macro for dev_dbg().")
> 
> For information about bisection process see: https://goo.gl/tpsmEJ#bisection

That bisect evidently can't be right.

Thanks,
	Dominik

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: WARNING: ODEBUG bug in tcindex_destroy_work (3)
  2020-03-16 23:47 ` syzbot
@ 2020-03-21 10:19   ` Thomas Gleixner
  2020-03-23 17:48     ` Cong Wang
  0 siblings, 1 reply; 16+ messages in thread
From: Thomas Gleixner @ 2020-03-21 10:19 UTC (permalink / raw)
  To: syzbot, davem, jhs, jiri, kuba, linux-kernel, netdev,
	syzkaller-bugs, xiyou.wangcong

syzbot <syzbot+46f513c3033d592409d2@syzkaller.appspotmail.com> writes:
> syzbot has found a reproducer for the following crash on:
>
> HEAD commit:    74522e7b net: sched: set the hw_stats_type in pedit loop
> git tree:       net-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=14c85173e00000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=b5acf5ac38a50651
> dashboard link: https://syzkaller.appspot.com/bug?extid=46f513c3033d592409d2
> compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=17bfff65e00000
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+46f513c3033d592409d2@syzkaller.appspotmail.com
>
> ------------[ cut here ]------------
> ODEBUG: free active (active state 0) object type: work_struct hint: tcindex_destroy_rexts_work+0x0/0x20 net/sched/cls_tcindex.c:143
...
>  __debug_check_no_obj_freed lib/debugobjects.c:967 [inline]
>  debug_check_no_obj_freed+0x2e1/0x445 lib/debugobjects.c:998
>  kfree+0xf6/0x2b0 mm/slab.c:3756
>  tcindex_destroy_work+0x2e/0x70 net/sched/cls_tcindex.c:231

So this is:

	kfree(p->perfect);

Looking at the place which queues that work:

tcindex_destroy()

   if (p->perfect) {
        if (tcf_exts_get_net(&r->exts))
            tcf_queue_work(&r-rwork, tcindex_destroy_rexts_work);
        else
            __tcindex_destroy_rexts(r)
   }

   .....
   
   tcf_queue_work(&p->rwork, tcindex_destroy_work);
   
So obviously if tcindex_destroy_work() runs before
tcindex_destroy_rexts_work() then the above happens.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: WARNING: ODEBUG bug in tcindex_destroy_work (3)
  2020-03-21 10:19   ` Thomas Gleixner
@ 2020-03-23 17:48     ` Cong Wang
  2020-03-23 21:14       ` Thomas Gleixner
  0 siblings, 1 reply; 16+ messages in thread
From: Cong Wang @ 2020-03-23 17:48 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: syzbot, David Miller, Jamal Hadi Salim, Jiri Pirko,
	Jakub Kicinski, LKML, Linux Kernel Network Developers,
	syzkaller-bugs

On Sat, Mar 21, 2020 at 3:19 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> syzbot <syzbot+46f513c3033d592409d2@syzkaller.appspotmail.com> writes:
> > syzbot has found a reproducer for the following crash on:
> >
> > HEAD commit:    74522e7b net: sched: set the hw_stats_type in pedit loop
> > git tree:       net-next
> > console output: https://syzkaller.appspot.com/x/log.txt?x=14c85173e00000
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=b5acf5ac38a50651
> > dashboard link: https://syzkaller.appspot.com/bug?extid=46f513c3033d592409d2
> > compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
> > syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=17bfff65e00000
> >
> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > Reported-by: syzbot+46f513c3033d592409d2@syzkaller.appspotmail.com
> >
> > ------------[ cut here ]------------
> > ODEBUG: free active (active state 0) object type: work_struct hint: tcindex_destroy_rexts_work+0x0/0x20 net/sched/cls_tcindex.c:143
> ...
> >  __debug_check_no_obj_freed lib/debugobjects.c:967 [inline]
> >  debug_check_no_obj_freed+0x2e1/0x445 lib/debugobjects.c:998
> >  kfree+0xf6/0x2b0 mm/slab.c:3756
> >  tcindex_destroy_work+0x2e/0x70 net/sched/cls_tcindex.c:231
>
> So this is:
>
>         kfree(p->perfect);
>
> Looking at the place which queues that work:
>
> tcindex_destroy()
>
>    if (p->perfect) {
>         if (tcf_exts_get_net(&r->exts))
>             tcf_queue_work(&r-rwork, tcindex_destroy_rexts_work);
>         else
>             __tcindex_destroy_rexts(r)
>    }
>
>    .....
>
>    tcf_queue_work(&p->rwork, tcindex_destroy_work);
>
> So obviously if tcindex_destroy_work() runs before
> tcindex_destroy_rexts_work() then the above happens.

We use an ordered workqueue for tc filters, so these two
works are executed in the same order as they are queued.

Thanks.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: WARNING: ODEBUG bug in tcindex_destroy_work (3)
  2020-03-23 17:48     ` Cong Wang
@ 2020-03-23 21:14       ` Thomas Gleixner
  2020-03-23 23:14         ` Cong Wang
  0 siblings, 1 reply; 16+ messages in thread
From: Thomas Gleixner @ 2020-03-23 21:14 UTC (permalink / raw)
  To: Cong Wang
  Cc: syzbot, David Miller, Jamal Hadi Salim, Jiri Pirko,
	Jakub Kicinski, LKML, Linux Kernel Network Developers,
	syzkaller-bugs

Cong Wang <xiyou.wangcong@gmail.com> writes:
> On Sat, Mar 21, 2020 at 3:19 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>> > ------------[ cut here ]------------
>> > ODEBUG: free active (active state 0) object type: work_struct hint: tcindex_destroy_rexts_work+0x0/0x20 net/sched/cls_tcindex.c:143
>> ...
>> >  __debug_check_no_obj_freed lib/debugobjects.c:967 [inline]
>> >  debug_check_no_obj_freed+0x2e1/0x445 lib/debugobjects.c:998
>> >  kfree+0xf6/0x2b0 mm/slab.c:3756
>> >  tcindex_destroy_work+0x2e/0x70 net/sched/cls_tcindex.c:231
>>
>> So this is:
>>
>>         kfree(p->perfect);
>>
>> Looking at the place which queues that work:
>>
>> tcindex_destroy()
>>
>>    if (p->perfect) {
>>         if (tcf_exts_get_net(&r->exts))
>>             tcf_queue_work(&r-rwork, tcindex_destroy_rexts_work);
>>         else
>>             __tcindex_destroy_rexts(r)
>>    }
>>
>>    .....
>>
>>    tcf_queue_work(&p->rwork, tcindex_destroy_work);
>>
>> So obviously if tcindex_destroy_work() runs before
>> tcindex_destroy_rexts_work() then the above happens.
>
> We use an ordered workqueue for tc filters, so these two
> works are executed in the same order as they are queued.

The workqueue is ordered, but look how the work is queued on the work
queue:

tcf_queue_work()
  queue_rcu_work()
    call_rcu(&rwork->rcu, rcu_work_rcufn);

So after the grace period elapses rcu_work_rcufn() queues it in the
actual work queue.

Now tcindex_destroy() is invoked via tcf_proto_destroy() which can be
invoked from preemtible context. Now assume the following:

CPU0
  tcf_queue_work()
    tcf_queue_work(&r->rwork, tcindex_destroy_rexts_work);

-> Migration

CPU1
   tcf_queue_work(&p->rwork, tcindex_destroy_work);

So your RCU callbacks can be placed on different CPUs which obviously
has no ordering guarantee at all. See also:

  https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/Answers/RCU/RCUCBordering.html

Disabling preemption would "fix" it today, but that documentation
explicitely says that it is an implementation detail, but not
guaranteed by design.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: WARNING: ODEBUG bug in tcindex_destroy_work (3)
  2020-03-23 21:14       ` Thomas Gleixner
@ 2020-03-23 23:14         ` Cong Wang
  2020-03-24  1:01           ` Thomas Gleixner
  0 siblings, 1 reply; 16+ messages in thread
From: Cong Wang @ 2020-03-23 23:14 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: syzbot, David Miller, Jamal Hadi Salim, Jiri Pirko,
	Jakub Kicinski, LKML, Linux Kernel Network Developers,
	syzkaller-bugs

On Mon, Mar 23, 2020 at 2:14 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Cong Wang <xiyou.wangcong@gmail.com> writes:
> > On Sat, Mar 21, 2020 at 3:19 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> >> > ------------[ cut here ]------------
> >> > ODEBUG: free active (active state 0) object type: work_struct hint: tcindex_destroy_rexts_work+0x0/0x20 net/sched/cls_tcindex.c:143
> >> ...
> >> >  __debug_check_no_obj_freed lib/debugobjects.c:967 [inline]
> >> >  debug_check_no_obj_freed+0x2e1/0x445 lib/debugobjects.c:998
> >> >  kfree+0xf6/0x2b0 mm/slab.c:3756
> >> >  tcindex_destroy_work+0x2e/0x70 net/sched/cls_tcindex.c:231
> >>
> >> So this is:
> >>
> >>         kfree(p->perfect);
> >>
> >> Looking at the place which queues that work:
> >>
> >> tcindex_destroy()
> >>
> >>    if (p->perfect) {
> >>         if (tcf_exts_get_net(&r->exts))
> >>             tcf_queue_work(&r-rwork, tcindex_destroy_rexts_work);
> >>         else
> >>             __tcindex_destroy_rexts(r)
> >>    }
> >>
> >>    .....
> >>
> >>    tcf_queue_work(&p->rwork, tcindex_destroy_work);
> >>
> >> So obviously if tcindex_destroy_work() runs before
> >> tcindex_destroy_rexts_work() then the above happens.
> >
> > We use an ordered workqueue for tc filters, so these two
> > works are executed in the same order as they are queued.
>
> The workqueue is ordered, but look how the work is queued on the work
> queue:
>
> tcf_queue_work()
>   queue_rcu_work()
>     call_rcu(&rwork->rcu, rcu_work_rcufn);
>
> So after the grace period elapses rcu_work_rcufn() queues it in the
> actual work queue.
>
> Now tcindex_destroy() is invoked via tcf_proto_destroy() which can be
> invoked from preemtible context. Now assume the following:
>
> CPU0
>   tcf_queue_work()
>     tcf_queue_work(&r->rwork, tcindex_destroy_rexts_work);
>
> -> Migration
>
> CPU1
>    tcf_queue_work(&p->rwork, tcindex_destroy_work);
>
> So your RCU callbacks can be placed on different CPUs which obviously
> has no ordering guarantee at all. See also:

Good catch!

I thought about this when I added this ordered workqueue, but it
seems I misinterpret max_active, so despite we have max_active==1,
more than 1 work could still be queued on different CPU's here.

I don't know how to fix this properly, I think essentially RCU work
should be guaranteed the same ordering with regular work. But this
seems impossible unless RCU offers some API to achieve that.

Thanks.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: WARNING: ODEBUG bug in tcindex_destroy_work (3)
  2020-03-23 23:14         ` Cong Wang
@ 2020-03-24  1:01           ` Thomas Gleixner
  2020-03-24  2:05             ` Paul E. McKenney
  2020-03-25 18:36             ` Cong Wang
  0 siblings, 2 replies; 16+ messages in thread
From: Thomas Gleixner @ 2020-03-24  1:01 UTC (permalink / raw)
  To: Cong Wang
  Cc: syzbot, David Miller, Jamal Hadi Salim, Jiri Pirko,
	Jakub Kicinski, LKML, Linux Kernel Network Developers,
	syzkaller-bugs, Paul E . McKenney

Cong Wang <xiyou.wangcong@gmail.com> writes:
> On Mon, Mar 23, 2020 at 2:14 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>> > We use an ordered workqueue for tc filters, so these two
>> > works are executed in the same order as they are queued.
>>
>> The workqueue is ordered, but look how the work is queued on the work
>> queue:
>>
>> tcf_queue_work()
>>   queue_rcu_work()
>>     call_rcu(&rwork->rcu, rcu_work_rcufn);
>>
>> So after the grace period elapses rcu_work_rcufn() queues it in the
>> actual work queue.
>>
>> Now tcindex_destroy() is invoked via tcf_proto_destroy() which can be
>> invoked from preemtible context. Now assume the following:
>>
>> CPU0
>>   tcf_queue_work()
>>     tcf_queue_work(&r->rwork, tcindex_destroy_rexts_work);
>>
>> -> Migration
>>
>> CPU1
>>    tcf_queue_work(&p->rwork, tcindex_destroy_work);
>>
>> So your RCU callbacks can be placed on different CPUs which obviously
>> has no ordering guarantee at all. See also:
>
> Good catch!
>
> I thought about this when I added this ordered workqueue, but it
> seems I misinterpret max_active, so despite we have max_active==1,
> more than 1 work could still be queued on different CPU's here.

The workqueue is not the problem. it works perfectly fine. The way how
the work gets queued is the issue.

> I don't know how to fix this properly, I think essentially RCU work
> should be guaranteed the same ordering with regular work. But this
> seems impossible unless RCU offers some API to achieve that.

I don't think that's possible w/o putting constraints on the flexibility
of RCU (Paul of course might disagree).

I assume that the filters which hang of tcindex_data::perfect and
tcindex_data:p must be freed before tcindex_data, right?

Refcounting of tcindex_data should do the trick. I.e. any element which
you add to a tcindex_data instance takes a refcount and when that is
destroyed then the rcu/work callback drops a reference which once it
reaches 0 triggers tcindex_data to be freed.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: WARNING: ODEBUG bug in tcindex_destroy_work (3)
  2020-03-24  1:01           ` Thomas Gleixner
@ 2020-03-24  2:05             ` Paul E. McKenney
  2020-03-25 18:53               ` Cong Wang
  2020-03-25 18:36             ` Cong Wang
  1 sibling, 1 reply; 16+ messages in thread
From: Paul E. McKenney @ 2020-03-24  2:05 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Cong Wang, syzbot, David Miller, Jamal Hadi Salim, Jiri Pirko,
	Jakub Kicinski, LKML, Linux Kernel Network Developers,
	syzkaller-bugs

On Tue, Mar 24, 2020 at 02:01:13AM +0100, Thomas Gleixner wrote:
> Cong Wang <xiyou.wangcong@gmail.com> writes:
> > On Mon, Mar 23, 2020 at 2:14 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >> > We use an ordered workqueue for tc filters, so these two
> >> > works are executed in the same order as they are queued.
> >>
> >> The workqueue is ordered, but look how the work is queued on the work
> >> queue:
> >>
> >> tcf_queue_work()
> >>   queue_rcu_work()
> >>     call_rcu(&rwork->rcu, rcu_work_rcufn);
> >>
> >> So after the grace period elapses rcu_work_rcufn() queues it in the
> >> actual work queue.
> >>
> >> Now tcindex_destroy() is invoked via tcf_proto_destroy() which can be
> >> invoked from preemtible context. Now assume the following:
> >>
> >> CPU0
> >>   tcf_queue_work()
> >>     tcf_queue_work(&r->rwork, tcindex_destroy_rexts_work);
> >>
> >> -> Migration
> >>
> >> CPU1
> >>    tcf_queue_work(&p->rwork, tcindex_destroy_work);
> >>
> >> So your RCU callbacks can be placed on different CPUs which obviously
> >> has no ordering guarantee at all. See also:
> >
> > Good catch!
> >
> > I thought about this when I added this ordered workqueue, but it
> > seems I misinterpret max_active, so despite we have max_active==1,
> > more than 1 work could still be queued on different CPU's here.
> 
> The workqueue is not the problem. it works perfectly fine. The way how
> the work gets queued is the issue.
> 
> > I don't know how to fix this properly, I think essentially RCU work
> > should be guaranteed the same ordering with regular work. But this
> > seems impossible unless RCU offers some API to achieve that.
> 
> I don't think that's possible w/o putting constraints on the flexibility
> of RCU (Paul of course might disagree).

It is possible, but it does not come for free.

From an RCU/workqueues perspective, if I understand the scenario, you
can do the following:

	tcf_queue_work(&r->rwork, tcindex_destroy_rexts_work);

	rcu_barrier(); // Wait for the RCU callback.
	flush_work(...); // Wait for the workqueue handler.
			 // But maybe for quite a few of them...

	// All the earlier handlers have completed.
	tcf_queue_work(&p->rwork, tcindex_destroy_work);

This of course introduces overhead and latency.  Maybe that is not a
problem at teardown time, or maybe the final tcf_queue_work() can itself
be dumped into a workqueue in order to get it off of the critical path.

However, depending on your constraints ...

> I assume that the filters which hang of tcindex_data::perfect and
> tcindex_data:p must be freed before tcindex_data, right?
> 
> Refcounting of tcindex_data should do the trick. I.e. any element which
> you add to a tcindex_data instance takes a refcount and when that is
> destroyed then the rcu/work callback drops a reference which once it
> reaches 0 triggers tcindex_data to be freed.

... reference counts might work much better for you.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: WARNING: ODEBUG bug in tcindex_destroy_work (3)
  2020-03-24  1:01           ` Thomas Gleixner
  2020-03-24  2:05             ` Paul E. McKenney
@ 2020-03-25 18:36             ` Cong Wang
  2020-03-25 18:58               ` Paul E. McKenney
  1 sibling, 1 reply; 16+ messages in thread
From: Cong Wang @ 2020-03-25 18:36 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: syzbot, David Miller, Jamal Hadi Salim, Jiri Pirko,
	Jakub Kicinski, LKML, Linux Kernel Network Developers,
	syzkaller-bugs, Paul E . McKenney

On Mon, Mar 23, 2020 at 6:01 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Cong Wang <xiyou.wangcong@gmail.com> writes:
> > On Mon, Mar 23, 2020 at 2:14 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >> > We use an ordered workqueue for tc filters, so these two
> >> > works are executed in the same order as they are queued.
> >>
> >> The workqueue is ordered, but look how the work is queued on the work
> >> queue:
> >>
> >> tcf_queue_work()
> >>   queue_rcu_work()
> >>     call_rcu(&rwork->rcu, rcu_work_rcufn);
> >>
> >> So after the grace period elapses rcu_work_rcufn() queues it in the
> >> actual work queue.
> >>
> >> Now tcindex_destroy() is invoked via tcf_proto_destroy() which can be
> >> invoked from preemtible context. Now assume the following:
> >>
> >> CPU0
> >>   tcf_queue_work()
> >>     tcf_queue_work(&r->rwork, tcindex_destroy_rexts_work);
> >>
> >> -> Migration
> >>
> >> CPU1
> >>    tcf_queue_work(&p->rwork, tcindex_destroy_work);
> >>
> >> So your RCU callbacks can be placed on different CPUs which obviously
> >> has no ordering guarantee at all. See also:
> >
> > Good catch!
> >
> > I thought about this when I added this ordered workqueue, but it
> > seems I misinterpret max_active, so despite we have max_active==1,
> > more than 1 work could still be queued on different CPU's here.
>
> The workqueue is not the problem. it works perfectly fine. The way how
> the work gets queued is the issue.

Well, a RCU work is also a work, so the ordered workqueue should
apply to RCU works too, from users' perspective. Users should not
need to learn queue_rcu_work() is actually a call_rcu() which does
not guarantee the ordering for an ordered workqueue.


> > I don't know how to fix this properly, I think essentially RCU work
> > should be guaranteed the same ordering with regular work. But this
> > seems impossible unless RCU offers some API to achieve that.
>
> I don't think that's possible w/o putting constraints on the flexibility
> of RCU (Paul of course might disagree).
>
> I assume that the filters which hang of tcindex_data::perfect and
> tcindex_data:p must be freed before tcindex_data, right?
>
> Refcounting of tcindex_data should do the trick. I.e. any element which
> you add to a tcindex_data instance takes a refcount and when that is
> destroyed then the rcu/work callback drops a reference which once it
> reaches 0 triggers tcindex_data to be freed.

Yeah, but the problem is more than just tcindex filter, we have many
places make the same assumption of ordering.

Thanks!

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: WARNING: ODEBUG bug in tcindex_destroy_work (3)
  2020-03-24  2:05             ` Paul E. McKenney
@ 2020-03-25 18:53               ` Cong Wang
  2020-03-25 19:07                 ` Paul E. McKenney
  0 siblings, 1 reply; 16+ messages in thread
From: Cong Wang @ 2020-03-25 18:53 UTC (permalink / raw)
  To: Paul E . McKenney
  Cc: Thomas Gleixner, syzbot, David Miller, Jamal Hadi Salim,
	Jiri Pirko, Jakub Kicinski, LKML,
	Linux Kernel Network Developers, syzkaller-bugs

On Mon, Mar 23, 2020 at 7:05 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Tue, Mar 24, 2020 at 02:01:13AM +0100, Thomas Gleixner wrote:
> > Cong Wang <xiyou.wangcong@gmail.com> writes:
> > > On Mon, Mar 23, 2020 at 2:14 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> > >> > We use an ordered workqueue for tc filters, so these two
> > >> > works are executed in the same order as they are queued.
> > >>
> > >> The workqueue is ordered, but look how the work is queued on the work
> > >> queue:
> > >>
> > >> tcf_queue_work()
> > >>   queue_rcu_work()
> > >>     call_rcu(&rwork->rcu, rcu_work_rcufn);
> > >>
> > >> So after the grace period elapses rcu_work_rcufn() queues it in the
> > >> actual work queue.
> > >>
> > >> Now tcindex_destroy() is invoked via tcf_proto_destroy() which can be
> > >> invoked from preemtible context. Now assume the following:
> > >>
> > >> CPU0
> > >>   tcf_queue_work()
> > >>     tcf_queue_work(&r->rwork, tcindex_destroy_rexts_work);
> > >>
> > >> -> Migration
> > >>
> > >> CPU1
> > >>    tcf_queue_work(&p->rwork, tcindex_destroy_work);
> > >>
> > >> So your RCU callbacks can be placed on different CPUs which obviously
> > >> has no ordering guarantee at all. See also:
> > >
> > > Good catch!
> > >
> > > I thought about this when I added this ordered workqueue, but it
> > > seems I misinterpret max_active, so despite we have max_active==1,
> > > more than 1 work could still be queued on different CPU's here.
> >
> > The workqueue is not the problem. it works perfectly fine. The way how
> > the work gets queued is the issue.
> >
> > > I don't know how to fix this properly, I think essentially RCU work
> > > should be guaranteed the same ordering with regular work. But this
> > > seems impossible unless RCU offers some API to achieve that.
> >
> > I don't think that's possible w/o putting constraints on the flexibility
> > of RCU (Paul of course might disagree).
>
> It is possible, but it does not come for free.
>
> From an RCU/workqueues perspective, if I understand the scenario, you
> can do the following:
>
>         tcf_queue_work(&r->rwork, tcindex_destroy_rexts_work);
>
>         rcu_barrier(); // Wait for the RCU callback.
>         flush_work(...); // Wait for the workqueue handler.
>                          // But maybe for quite a few of them...
>
>         // All the earlier handlers have completed.
>         tcf_queue_work(&p->rwork, tcindex_destroy_work);
>
> This of course introduces overhead and latency.  Maybe that is not a
> problem at teardown time, or maybe the final tcf_queue_work() can itself
> be dumped into a workqueue in order to get it off of the critical path.

I personally agree, but nowadays NIC vendors care about tc filter
slow path performance as well. :-/


>
> However, depending on your constraints ...
>
> > I assume that the filters which hang of tcindex_data::perfect and
> > tcindex_data:p must be freed before tcindex_data, right?
> >
> > Refcounting of tcindex_data should do the trick. I.e. any element which
> > you add to a tcindex_data instance takes a refcount and when that is
> > destroyed then the rcu/work callback drops a reference which once it
> > reaches 0 triggers tcindex_data to be freed.
>
> ... reference counts might work much better for you.
>

I need to think about how much work is needed for refcnting, given
other filters have the same assumption.

Thanks.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: WARNING: ODEBUG bug in tcindex_destroy_work (3)
  2020-03-25 18:36             ` Cong Wang
@ 2020-03-25 18:58               ` Paul E. McKenney
  2020-03-28 19:53                 ` Cong Wang
  0 siblings, 1 reply; 16+ messages in thread
From: Paul E. McKenney @ 2020-03-25 18:58 UTC (permalink / raw)
  To: Cong Wang
  Cc: Thomas Gleixner, syzbot, David Miller, Jamal Hadi Salim,
	Jiri Pirko, Jakub Kicinski, LKML,
	Linux Kernel Network Developers, syzkaller-bugs

On Wed, Mar 25, 2020 at 11:36:16AM -0700, Cong Wang wrote:
> On Mon, Mar 23, 2020 at 6:01 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > Cong Wang <xiyou.wangcong@gmail.com> writes:
> > > On Mon, Mar 23, 2020 at 2:14 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> > >> > We use an ordered workqueue for tc filters, so these two
> > >> > works are executed in the same order as they are queued.
> > >>
> > >> The workqueue is ordered, but look how the work is queued on the work
> > >> queue:
> > >>
> > >> tcf_queue_work()
> > >>   queue_rcu_work()
> > >>     call_rcu(&rwork->rcu, rcu_work_rcufn);
> > >>
> > >> So after the grace period elapses rcu_work_rcufn() queues it in the
> > >> actual work queue.
> > >>
> > >> Now tcindex_destroy() is invoked via tcf_proto_destroy() which can be
> > >> invoked from preemtible context. Now assume the following:
> > >>
> > >> CPU0
> > >>   tcf_queue_work()
> > >>     tcf_queue_work(&r->rwork, tcindex_destroy_rexts_work);
> > >>
> > >> -> Migration
> > >>
> > >> CPU1
> > >>    tcf_queue_work(&p->rwork, tcindex_destroy_work);
> > >>
> > >> So your RCU callbacks can be placed on different CPUs which obviously
> > >> has no ordering guarantee at all. See also:
> > >
> > > Good catch!
> > >
> > > I thought about this when I added this ordered workqueue, but it
> > > seems I misinterpret max_active, so despite we have max_active==1,
> > > more than 1 work could still be queued on different CPU's here.
> >
> > The workqueue is not the problem. it works perfectly fine. The way how
> > the work gets queued is the issue.
> 
> Well, a RCU work is also a work, so the ordered workqueue should
> apply to RCU works too, from users' perspective. Users should not
> need to learn queue_rcu_work() is actually a call_rcu() which does
> not guarantee the ordering for an ordered workqueue.

And the workqueues might well guarantee the ordering in cases where the
pair of RCU callbacks are invoked in a known order.  But that workqueues
ordering guarantee does not extend upstream to RCU, nor do I know of a
reasonable way to make this happen within the confines of RCU.

If you have ideas, please do not keep them a secret, but please also
understand that call_rcu() must meet some pretty severe performance and
scalability constraints.

I suppose that queue_rcu_work() could track outstanding call_rcu()
invocations, and (one way or another) defer the second queue_rcu_work()
if a first one is still pending from the current task, but that might not
make the common-case user of queue_rcu_work() all that happy.  But perhaps
there is a way to restrict these semantics to ordered workqueues.  In that
case, one could imagine the second and subsequent too-quick call to
queue_rcu_work() using the rcu_head structure's ->next field to queue these
too-quick callbacks, and then having rcu_work_rcufn() check for queued
too-quick callbacks, queuing the first one.

But I must defer to Tejun on this one.

And one additional caution...  This would meter out ordered
queue_rcu_work() requests at a rate of no faster than one per RCU
grace period.  The queue might build up, resulting in long delays.
Are you sure that your use case can live with this?

> > > I don't know how to fix this properly, I think essentially RCU work
> > > should be guaranteed the same ordering with regular work. But this
> > > seems impossible unless RCU offers some API to achieve that.
> >
> > I don't think that's possible w/o putting constraints on the flexibility
> > of RCU (Paul of course might disagree).
> >
> > I assume that the filters which hang of tcindex_data::perfect and
> > tcindex_data:p must be freed before tcindex_data, right?
> >
> > Refcounting of tcindex_data should do the trick. I.e. any element which
> > you add to a tcindex_data instance takes a refcount and when that is
> > destroyed then the rcu/work callback drops a reference which once it
> > reaches 0 triggers tcindex_data to be freed.
> 
> Yeah, but the problem is more than just tcindex filter, we have many
> places make the same assumption of ordering.

But don't you also have a situation where there might be a large group
of queue_rcu_work() invocations whose order doesn't matter, followed by a
single queue_rcu_work() invocation that must be ordered after the earlier
group?  If so, ordering -all- of these invocations might be overkill.

Or did I misread your code?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: WARNING: ODEBUG bug in tcindex_destroy_work (3)
  2020-03-25 18:53               ` Cong Wang
@ 2020-03-25 19:07                 ` Paul E. McKenney
  0 siblings, 0 replies; 16+ messages in thread
From: Paul E. McKenney @ 2020-03-25 19:07 UTC (permalink / raw)
  To: Cong Wang
  Cc: Thomas Gleixner, syzbot, David Miller, Jamal Hadi Salim,
	Jiri Pirko, Jakub Kicinski, LKML,
	Linux Kernel Network Developers, syzkaller-bugs

On Wed, Mar 25, 2020 at 11:53:51AM -0700, Cong Wang wrote:
> On Mon, Mar 23, 2020 at 7:05 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Tue, Mar 24, 2020 at 02:01:13AM +0100, Thomas Gleixner wrote:
> > > Cong Wang <xiyou.wangcong@gmail.com> writes:
> > > > On Mon, Mar 23, 2020 at 2:14 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> > > >> > We use an ordered workqueue for tc filters, so these two
> > > >> > works are executed in the same order as they are queued.
> > > >>
> > > >> The workqueue is ordered, but look how the work is queued on the work
> > > >> queue:
> > > >>
> > > >> tcf_queue_work()
> > > >>   queue_rcu_work()
> > > >>     call_rcu(&rwork->rcu, rcu_work_rcufn);
> > > >>
> > > >> So after the grace period elapses rcu_work_rcufn() queues it in the
> > > >> actual work queue.
> > > >>
> > > >> Now tcindex_destroy() is invoked via tcf_proto_destroy() which can be
> > > >> invoked from preemtible context. Now assume the following:
> > > >>
> > > >> CPU0
> > > >>   tcf_queue_work()
> > > >>     tcf_queue_work(&r->rwork, tcindex_destroy_rexts_work);
> > > >>
> > > >> -> Migration
> > > >>
> > > >> CPU1
> > > >>    tcf_queue_work(&p->rwork, tcindex_destroy_work);
> > > >>
> > > >> So your RCU callbacks can be placed on different CPUs which obviously
> > > >> has no ordering guarantee at all. See also:
> > > >
> > > > Good catch!
> > > >
> > > > I thought about this when I added this ordered workqueue, but it
> > > > seems I misinterpret max_active, so despite we have max_active==1,
> > > > more than 1 work could still be queued on different CPU's here.
> > >
> > > The workqueue is not the problem. it works perfectly fine. The way how
> > > the work gets queued is the issue.
> > >
> > > > I don't know how to fix this properly, I think essentially RCU work
> > > > should be guaranteed the same ordering with regular work. But this
> > > > seems impossible unless RCU offers some API to achieve that.
> > >
> > > I don't think that's possible w/o putting constraints on the flexibility
> > > of RCU (Paul of course might disagree).
> >
> > It is possible, but it does not come for free.
> >
> > From an RCU/workqueues perspective, if I understand the scenario, you
> > can do the following:
> >
> >         tcf_queue_work(&r->rwork, tcindex_destroy_rexts_work);
> >
> >         rcu_barrier(); // Wait for the RCU callback.
> >         flush_work(...); // Wait for the workqueue handler.
> >                          // But maybe for quite a few of them...
> >
> >         // All the earlier handlers have completed.
> >         tcf_queue_work(&p->rwork, tcindex_destroy_work);
> >
> > This of course introduces overhead and latency.  Maybe that is not a
> > problem at teardown time, or maybe the final tcf_queue_work() can itself
> > be dumped into a workqueue in order to get it off of the critical path.
> 
> I personally agree, but nowadays NIC vendors care about tc filter
> slow path performance as well. :-/

I bet that they do!  ;-)

But have you actually tried the rcu_barrier()/flush_work() approach?
It is reasonably simple to implement, even if you do have to use multiple
flush_work() calls, and who knows?  Maybe it is fast enough.

So why not try it?

Hmmm...  Another approach would be to associate a counter with this group
of requests, incrementing this count in tcf_queue_work() and capturing
the prior value of the count, then in tcindex_destroy_work() waiting
for the count to reach the captured value.  This would avoid needless
waiting in tcindex_destroy_rexts_work().  Such needless waiting would
be hard to avoid within the confines of either RCU or workqueues, and
such needless waiting might well slow down this slowpath, which again
might make the NIC vendors unhappy.

> > However, depending on your constraints ...
> >
> > > I assume that the filters which hang of tcindex_data::perfect and
> > > tcindex_data:p must be freed before tcindex_data, right?
> > >
> > > Refcounting of tcindex_data should do the trick. I.e. any element which
> > > you add to a tcindex_data instance takes a refcount and when that is
> > > destroyed then the rcu/work callback drops a reference which once it
> > > reaches 0 triggers tcindex_data to be freed.
> >
> > ... reference counts might work much better for you.
> 
> I need to think about how much work is needed for refcnting, given
> other filters have the same assumption.

Perhaps you can create some common code to handle this situation, which
can then be shared among those various filters.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: WARNING: ODEBUG bug in tcindex_destroy_work (3)
  2020-03-25 18:58               ` Paul E. McKenney
@ 2020-03-28 19:53                 ` Cong Wang
  2020-03-30 13:37                   ` Paul E. McKenney
  0 siblings, 1 reply; 16+ messages in thread
From: Cong Wang @ 2020-03-28 19:53 UTC (permalink / raw)
  To: Paul E . McKenney
  Cc: Thomas Gleixner, syzbot, David Miller, Jamal Hadi Salim,
	Jiri Pirko, Jakub Kicinski, LKML,
	Linux Kernel Network Developers, syzkaller-bugs

On Wed, Mar 25, 2020 at 11:58 AM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Wed, Mar 25, 2020 at 11:36:16AM -0700, Cong Wang wrote:
> > On Mon, Mar 23, 2020 at 6:01 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> > >
> > > Cong Wang <xiyou.wangcong@gmail.com> writes:
> > > > On Mon, Mar 23, 2020 at 2:14 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> > > >> > We use an ordered workqueue for tc filters, so these two
> > > >> > works are executed in the same order as they are queued.
> > > >>
> > > >> The workqueue is ordered, but look how the work is queued on the work
> > > >> queue:
> > > >>
> > > >> tcf_queue_work()
> > > >>   queue_rcu_work()
> > > >>     call_rcu(&rwork->rcu, rcu_work_rcufn);
> > > >>
> > > >> So after the grace period elapses rcu_work_rcufn() queues it in the
> > > >> actual work queue.
> > > >>
> > > >> Now tcindex_destroy() is invoked via tcf_proto_destroy() which can be
> > > >> invoked from preemtible context. Now assume the following:
> > > >>
> > > >> CPU0
> > > >>   tcf_queue_work()
> > > >>     tcf_queue_work(&r->rwork, tcindex_destroy_rexts_work);
> > > >>
> > > >> -> Migration
> > > >>
> > > >> CPU1
> > > >>    tcf_queue_work(&p->rwork, tcindex_destroy_work);
> > > >>
> > > >> So your RCU callbacks can be placed on different CPUs which obviously
> > > >> has no ordering guarantee at all. See also:
> > > >
> > > > Good catch!
> > > >
> > > > I thought about this when I added this ordered workqueue, but it
> > > > seems I misinterpret max_active, so despite we have max_active==1,
> > > > more than 1 work could still be queued on different CPU's here.
> > >
> > > The workqueue is not the problem. it works perfectly fine. The way how
> > > the work gets queued is the issue.
> >
> > Well, a RCU work is also a work, so the ordered workqueue should
> > apply to RCU works too, from users' perspective. Users should not
> > need to learn queue_rcu_work() is actually a call_rcu() which does
> > not guarantee the ordering for an ordered workqueue.
>
> And the workqueues might well guarantee the ordering in cases where the
> pair of RCU callbacks are invoked in a known order.  But that workqueues
> ordering guarantee does not extend upstream to RCU, nor do I know of a
> reasonable way to make this happen within the confines of RCU.
>
> If you have ideas, please do not keep them a secret, but please also
> understand that call_rcu() must meet some pretty severe performance and
> scalability constraints.
>
> I suppose that queue_rcu_work() could track outstanding call_rcu()
> invocations, and (one way or another) defer the second queue_rcu_work()
> if a first one is still pending from the current task, but that might not
> make the common-case user of queue_rcu_work() all that happy.  But perhaps
> there is a way to restrict these semantics to ordered workqueues.  In that
> case, one could imagine the second and subsequent too-quick call to
> queue_rcu_work() using the rcu_head structure's ->next field to queue these
> too-quick callbacks, and then having rcu_work_rcufn() check for queued
> too-quick callbacks, queuing the first one.
>
> But I must defer to Tejun on this one.
>
> And one additional caution...  This would meter out ordered
> queue_rcu_work() requests at a rate of no faster than one per RCU
> grace period.  The queue might build up, resulting in long delays.
> Are you sure that your use case can live with this?

I don't know, I guess we might be able to add a call_rcu() takes a cpu
as a parameter so that all of these call_rcu() callbacks will be queued
on a same CPU thusly guarantees the ordering. But of course we
need to figure out which cpu to use. :)

Just my two cents.


>
> > > > I don't know how to fix this properly, I think essentially RCU work
> > > > should be guaranteed the same ordering with regular work. But this
> > > > seems impossible unless RCU offers some API to achieve that.
> > >
> > > I don't think that's possible w/o putting constraints on the flexibility
> > > of RCU (Paul of course might disagree).
> > >
> > > I assume that the filters which hang of tcindex_data::perfect and
> > > tcindex_data:p must be freed before tcindex_data, right?
> > >
> > > Refcounting of tcindex_data should do the trick. I.e. any element which
> > > you add to a tcindex_data instance takes a refcount and when that is
> > > destroyed then the rcu/work callback drops a reference which once it
> > > reaches 0 triggers tcindex_data to be freed.
> >
> > Yeah, but the problem is more than just tcindex filter, we have many
> > places make the same assumption of ordering.
>
> But don't you also have a situation where there might be a large group
> of queue_rcu_work() invocations whose order doesn't matter, followed by a
> single queue_rcu_work() invocation that must be ordered after the earlier
> group?  If so, ordering -all- of these invocations might be overkill.
>
> Or did I misread your code?

You are right. Previously I thought all non-trivial tc filters would need
to address this ordering bug, but it turns out probably only tcindex
needs it, because most of them actually use linked lists. As long as
we remove the entry from the list before tcf_queue_work(), it is fine
to free the list head before each entry in the list.

I just sent out a minimal fix using the refcnt.

Thanks!

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: WARNING: ODEBUG bug in tcindex_destroy_work (3)
  2020-03-28 19:53                 ` Cong Wang
@ 2020-03-30 13:37                   ` Paul E. McKenney
  0 siblings, 0 replies; 16+ messages in thread
From: Paul E. McKenney @ 2020-03-30 13:37 UTC (permalink / raw)
  To: Cong Wang
  Cc: Thomas Gleixner, syzbot, David Miller, Jamal Hadi Salim,
	Jiri Pirko, Jakub Kicinski, LKML,
	Linux Kernel Network Developers, syzkaller-bugs

On Sat, Mar 28, 2020 at 12:53:43PM -0700, Cong Wang wrote:
> On Wed, Mar 25, 2020 at 11:58 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Wed, Mar 25, 2020 at 11:36:16AM -0700, Cong Wang wrote:
> > > On Mon, Mar 23, 2020 at 6:01 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> > > >
> > > > Cong Wang <xiyou.wangcong@gmail.com> writes:
> > > > > On Mon, Mar 23, 2020 at 2:14 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> > > > >> > We use an ordered workqueue for tc filters, so these two
> > > > >> > works are executed in the same order as they are queued.
> > > > >>
> > > > >> The workqueue is ordered, but look how the work is queued on the work
> > > > >> queue:
> > > > >>
> > > > >> tcf_queue_work()
> > > > >>   queue_rcu_work()
> > > > >>     call_rcu(&rwork->rcu, rcu_work_rcufn);
> > > > >>
> > > > >> So after the grace period elapses rcu_work_rcufn() queues it in the
> > > > >> actual work queue.
> > > > >>
> > > > >> Now tcindex_destroy() is invoked via tcf_proto_destroy() which can be
> > > > >> invoked from preemtible context. Now assume the following:
> > > > >>
> > > > >> CPU0
> > > > >>   tcf_queue_work()
> > > > >>     tcf_queue_work(&r->rwork, tcindex_destroy_rexts_work);
> > > > >>
> > > > >> -> Migration
> > > > >>
> > > > >> CPU1
> > > > >>    tcf_queue_work(&p->rwork, tcindex_destroy_work);
> > > > >>
> > > > >> So your RCU callbacks can be placed on different CPUs which obviously
> > > > >> has no ordering guarantee at all. See also:
> > > > >
> > > > > Good catch!
> > > > >
> > > > > I thought about this when I added this ordered workqueue, but it
> > > > > seems I misinterpret max_active, so despite we have max_active==1,
> > > > > more than 1 work could still be queued on different CPU's here.
> > > >
> > > > The workqueue is not the problem. it works perfectly fine. The way how
> > > > the work gets queued is the issue.
> > >
> > > Well, a RCU work is also a work, so the ordered workqueue should
> > > apply to RCU works too, from users' perspective. Users should not
> > > need to learn queue_rcu_work() is actually a call_rcu() which does
> > > not guarantee the ordering for an ordered workqueue.
> >
> > And the workqueues might well guarantee the ordering in cases where the
> > pair of RCU callbacks are invoked in a known order.  But that workqueues
> > ordering guarantee does not extend upstream to RCU, nor do I know of a
> > reasonable way to make this happen within the confines of RCU.
> >
> > If you have ideas, please do not keep them a secret, but please also
> > understand that call_rcu() must meet some pretty severe performance and
> > scalability constraints.
> >
> > I suppose that queue_rcu_work() could track outstanding call_rcu()
> > invocations, and (one way or another) defer the second queue_rcu_work()
> > if a first one is still pending from the current task, but that might not
> > make the common-case user of queue_rcu_work() all that happy.  But perhaps
> > there is a way to restrict these semantics to ordered workqueues.  In that
> > case, one could imagine the second and subsequent too-quick call to
> > queue_rcu_work() using the rcu_head structure's ->next field to queue these
> > too-quick callbacks, and then having rcu_work_rcufn() check for queued
> > too-quick callbacks, queuing the first one.
> >
> > But I must defer to Tejun on this one.
> >
> > And one additional caution...  This would meter out ordered
> > queue_rcu_work() requests at a rate of no faster than one per RCU
> > grace period.  The queue might build up, resulting in long delays.
> > Are you sure that your use case can live with this?
> 
> I don't know, I guess we might be able to add a call_rcu() takes a cpu
> as a parameter so that all of these call_rcu() callbacks will be queued
> on a same CPU thusly guarantees the ordering. But of course we
> need to figure out which cpu to use. :)
> 
> Just my two cents.

CPUs can go offline.  Plus if current trends continue, I will eventually
be forced to concurrently execute callbacks originating from a single CPU.

Another approach would be to have an additional step of workqueue
in the ordered case, so that a workqueue handler does rcu_barrier(),
invokes call_rcu(), which finally spawns the eventual workqueue handler.
But I do not believe that this will work well in your case because
you only need the last workqueue handler to be ordered against all the
previous callbacks.  I suppose that a separate queue_rcu_work_ordered()
API (or similar) could be added that did this, but I suspect that Tejun
might want to see a few more use cases before adding something like this.
Especially if the rcu_barrier() is introducing too much delay for your
use case.

> > > > > I don't know how to fix this properly, I think essentially RCU work
> > > > > should be guaranteed the same ordering with regular work. But this
> > > > > seems impossible unless RCU offers some API to achieve that.
> > > >
> > > > I don't think that's possible w/o putting constraints on the flexibility
> > > > of RCU (Paul of course might disagree).
> > > >
> > > > I assume that the filters which hang of tcindex_data::perfect and
> > > > tcindex_data:p must be freed before tcindex_data, right?
> > > >
> > > > Refcounting of tcindex_data should do the trick. I.e. any element which
> > > > you add to a tcindex_data instance takes a refcount and when that is
> > > > destroyed then the rcu/work callback drops a reference which once it
> > > > reaches 0 triggers tcindex_data to be freed.
> > >
> > > Yeah, but the problem is more than just tcindex filter, we have many
> > > places make the same assumption of ordering.
> >
> > But don't you also have a situation where there might be a large group
> > of queue_rcu_work() invocations whose order doesn't matter, followed by a
> > single queue_rcu_work() invocation that must be ordered after the earlier
> > group?  If so, ordering -all- of these invocations might be overkill.
> >
> > Or did I misread your code?
> 
> You are right. Previously I thought all non-trivial tc filters would need
> to address this ordering bug, but it turns out probably only tcindex
> needs it, because most of them actually use linked lists. As long as
> we remove the entry from the list before tcf_queue_work(), it is fine
> to free the list head before each entry in the list.
> 
> I just sent out a minimal fix using the refcnt.

Sounds good.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2020-03-30 13:37 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-25  5:43 WARNING: ODEBUG bug in tcindex_destroy_work (3) syzbot
2020-03-16 23:47 ` syzbot
2020-03-21 10:19   ` Thomas Gleixner
2020-03-23 17:48     ` Cong Wang
2020-03-23 21:14       ` Thomas Gleixner
2020-03-23 23:14         ` Cong Wang
2020-03-24  1:01           ` Thomas Gleixner
2020-03-24  2:05             ` Paul E. McKenney
2020-03-25 18:53               ` Cong Wang
2020-03-25 19:07                 ` Paul E. McKenney
2020-03-25 18:36             ` Cong Wang
2020-03-25 18:58               ` Paul E. McKenney
2020-03-28 19:53                 ` Cong Wang
2020-03-30 13:37                   ` Paul E. McKenney
2020-03-21  4:49 ` syzbot
2020-03-21  5:42   ` Dominik Brodowski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).