RCU Archive on lore.kernel.org
 help / color / Atom feed
* Re: KCSAN: data-race in find_next_bit / rcu_report_exp_cpu_mult
       [not found] <000000000000604e8905944f211f@google.com>
@ 2019-10-07 10:04 ` Marco Elver
  2019-10-07 13:22   ` Paul E. McKenney
                     ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Marco Elver @ 2019-10-07 10:04 UTC (permalink / raw)
  To: syzbot, paulmck, josh, rostedt, mathieu.desnoyers, jiangshanlai,
	Joel Fernandes, rcu
  Cc: a, b.a.t.m.a.n, davem, LKML, mareklindner, netdev, sw, syzkaller-bugs

+RCU maintainers
This might be a data-race in RCU itself.

On Mon, 7 Oct 2019 at 12:01, syzbot
<syzbot+134336b86f728d6e55a0@syzkaller.appspotmail.com> wrote:
>
> Hello,
>
> syzbot found the following crash on:
>
> HEAD commit:    b4bd9343 x86, kcsan: Enable KCSAN for x86
> git tree:       https://github.com/google/ktsan.git kcsan
> console output: https://syzkaller.appspot.com/x/log.txt?x=11edb20d600000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=c0906aa620713d80
> dashboard link: https://syzkaller.appspot.com/bug?extid=134336b86f728d6e55a0
> compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
>
> Unfortunately, I don't have any reproducer for this crash yet.
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+134336b86f728d6e55a0@syzkaller.appspotmail.com
>
> ==================================================================
> BUG: KCSAN: data-race in find_next_bit / rcu_report_exp_cpu_mult
>
> write to 0xffffffff85a7f140 of 8 bytes by task 7 on cpu 0:
>   rcu_report_exp_cpu_mult+0x4f/0xa0 kernel/rcu/tree_exp.h:244
>   rcu_report_exp_rdp+0x6c/0x90 kernel/rcu/tree_exp.h:254
>   rcu_preempt_deferred_qs_irqrestore+0x3bb/0x580 kernel/rcu/tree_plugin.h:475
>   rcu_read_unlock_special+0xec/0x370 kernel/rcu/tree_plugin.h:659
>   __rcu_read_unlock+0xcf/0xe0 kernel/rcu/tree_plugin.h:394
>   rcu_read_unlock include/linux/rcupdate.h:645 [inline]
>   batadv_nc_purge_orig_hash net/batman-adv/network-coding.c:411 [inline]
>   batadv_nc_worker+0x13a/0x390 net/batman-adv/network-coding.c:718
>   process_one_work+0x3d4/0x890 kernel/workqueue.c:2269
>   worker_thread+0xa0/0x800 kernel/workqueue.c:2415
>   kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
>   ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352
>
> read to 0xffffffff85a7f140 of 8 bytes by task 7251 on cpu 1:
>   _find_next_bit lib/find_bit.c:39 [inline]
>   find_next_bit+0x57/0xe0 lib/find_bit.c:70
>   sync_rcu_exp_select_node_cpus+0x28e/0x510 kernel/rcu/tree_exp.h:375
>   sync_rcu_exp_select_cpus+0x30c/0x590 kernel/rcu/tree_exp.h:439
>   rcu_exp_sel_wait_wake kernel/rcu/tree_exp.h:575 [inline]
>   wait_rcu_exp_gp+0x25/0x40 kernel/rcu/tree_exp.h:589
>   process_one_work+0x3d4/0x890 kernel/workqueue.c:2269
>   worker_thread+0xa0/0x800 kernel/workqueue.c:2415
>   kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
>   ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352
>
> Reported by Kernel Concurrency Sanitizer on:
> CPU: 1 PID: 7251 Comm: kworker/1:4 Not tainted 5.3.0+ #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Workqueue: rcu_gp wait_rcu_exp_gp
> ==================================================================
>
>
> ---
> This bug is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkaller@googlegroups.com.
>
> syzbot will keep track of this bug report. See:
> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: KCSAN: data-race in find_next_bit / rcu_report_exp_cpu_mult
  2019-10-07 10:04 ` KCSAN: data-race in find_next_bit / rcu_report_exp_cpu_mult Marco Elver
@ 2019-10-07 13:22   ` Paul E. McKenney
  2019-10-07 13:34   ` Steven Rostedt
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 10+ messages in thread
From: Paul E. McKenney @ 2019-10-07 13:22 UTC (permalink / raw)
  To: Marco Elver
  Cc: syzbot, josh, rostedt, mathieu.desnoyers, jiangshanlai,
	Joel Fernandes, rcu, a, b.a.t.m.a.n, davem, LKML, mareklindner,
	netdev, sw, syzkaller-bugs

On Mon, Oct 07, 2019 at 12:04:16PM +0200, Marco Elver wrote:
> +RCU maintainers
> This might be a data-race in RCU itself.

Quite possibly.  I will take a look, but there will be delays due to this
week being bootcamp and all.

							Thanx, Paul

> On Mon, 7 Oct 2019 at 12:01, syzbot
> <syzbot+134336b86f728d6e55a0@syzkaller.appspotmail.com> wrote:
> >
> > Hello,
> >
> > syzbot found the following crash on:
> >
> > HEAD commit:    b4bd9343 x86, kcsan: Enable KCSAN for x86
> > git tree:       https://github.com/google/ktsan.git kcsan
> > console output: https://syzkaller.appspot.com/x/log.txt?x=11edb20d600000
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=c0906aa620713d80
> > dashboard link: https://syzkaller.appspot.com/bug?extid=134336b86f728d6e55a0
> > compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
> >
> > Unfortunately, I don't have any reproducer for this crash yet.
> >
> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > Reported-by: syzbot+134336b86f728d6e55a0@syzkaller.appspotmail.com
> >
> > ==================================================================
> > BUG: KCSAN: data-race in find_next_bit / rcu_report_exp_cpu_mult
> >
> > write to 0xffffffff85a7f140 of 8 bytes by task 7 on cpu 0:
> >   rcu_report_exp_cpu_mult+0x4f/0xa0 kernel/rcu/tree_exp.h:244
> >   rcu_report_exp_rdp+0x6c/0x90 kernel/rcu/tree_exp.h:254
> >   rcu_preempt_deferred_qs_irqrestore+0x3bb/0x580 kernel/rcu/tree_plugin.h:475
> >   rcu_read_unlock_special+0xec/0x370 kernel/rcu/tree_plugin.h:659
> >   __rcu_read_unlock+0xcf/0xe0 kernel/rcu/tree_plugin.h:394
> >   rcu_read_unlock include/linux/rcupdate.h:645 [inline]
> >   batadv_nc_purge_orig_hash net/batman-adv/network-coding.c:411 [inline]
> >   batadv_nc_worker+0x13a/0x390 net/batman-adv/network-coding.c:718
> >   process_one_work+0x3d4/0x890 kernel/workqueue.c:2269
> >   worker_thread+0xa0/0x800 kernel/workqueue.c:2415
> >   kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
> >   ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352
> >
> > read to 0xffffffff85a7f140 of 8 bytes by task 7251 on cpu 1:
> >   _find_next_bit lib/find_bit.c:39 [inline]
> >   find_next_bit+0x57/0xe0 lib/find_bit.c:70
> >   sync_rcu_exp_select_node_cpus+0x28e/0x510 kernel/rcu/tree_exp.h:375
> >   sync_rcu_exp_select_cpus+0x30c/0x590 kernel/rcu/tree_exp.h:439
> >   rcu_exp_sel_wait_wake kernel/rcu/tree_exp.h:575 [inline]
> >   wait_rcu_exp_gp+0x25/0x40 kernel/rcu/tree_exp.h:589
> >   process_one_work+0x3d4/0x890 kernel/workqueue.c:2269
> >   worker_thread+0xa0/0x800 kernel/workqueue.c:2415
> >   kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
> >   ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352
> >
> > Reported by Kernel Concurrency Sanitizer on:
> > CPU: 1 PID: 7251 Comm: kworker/1:4 Not tainted 5.3.0+ #0
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> > Google 01/01/2011
> > Workqueue: rcu_gp wait_rcu_exp_gp
> > ==================================================================
> >
> >
> > ---
> > This bug is generated by a bot. It may contain errors.
> > See https://goo.gl/tpsmEJ for more information about syzbot.
> > syzbot engineers can be reached at syzkaller@googlegroups.com.
> >
> > syzbot will keep track of this bug report. See:
> > https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: KCSAN: data-race in find_next_bit / rcu_report_exp_cpu_mult
  2019-10-07 10:04 ` KCSAN: data-race in find_next_bit / rcu_report_exp_cpu_mult Marco Elver
  2019-10-07 13:22   ` Paul E. McKenney
@ 2019-10-07 13:34   ` Steven Rostedt
  2019-10-07 13:43   ` Boqun Feng
  2019-10-08  1:47   ` Paul E. McKenney
  3 siblings, 0 replies; 10+ messages in thread
From: Steven Rostedt @ 2019-10-07 13:34 UTC (permalink / raw)
  To: Marco Elver
  Cc: syzbot, paulmck, josh, mathieu.desnoyers, jiangshanlai,
	Joel Fernandes, rcu, a, b.a.t.m.a.n, davem, LKML, mareklindner,
	netdev, sw, syzkaller-bugs

On Mon, 7 Oct 2019 12:04:16 +0200
Marco Elver <elver@google.com> wrote:

> +RCU maintainers
> This might be a data-race in RCU itself.
> 
> >
> > write to 0xffffffff85a7f140 of 8 bytes by task 7 on cpu 0:
> >   rcu_report_exp_cpu_mult+0x4f/0xa0 kernel/rcu/tree_exp.h:244

Here we have:

	raw_spin_lock_irqsave_rcu_node(rnp, flags);
	if (!(rnp->expmask & mask)) {
		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
		return;
	}
	rnp->expmask &= ~mask;
	__rcu_report_exp_rnp(rnp, wake, flags); /* Releases rnp->lock. */

> >
> > read to 0xffffffff85a7f140 of 8 bytes by task 7251 on cpu 1:
> >   _find_next_bit lib/find_bit.c:39 [inline]
> >   find_next_bit+0x57/0xe0 lib/find_bit.c:70
> >   sync_rcu_exp_select_node_cpus+0x28e/0x510 kernel/rcu/tree_exp.h:375

and here we have:


	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);

	/* IPI the remaining CPUs for expedited quiescent state. */
	for_each_leaf_node_cpu_mask(rnp, cpu, rnp->expmask) {


The write to rnp->expmask is done under the rnp->lock, but on the read
side, that lock is released before the for loop. Should we have
something like:

	unsigned long expmask;
	[...]

	expmask = rnp->expmask;
	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);

	/* IPI the remaining CPUs for expedited quiescent state. */
	for_each_leaf_node_cpu_mask(rnp, cpu, expmask) {

?

-- Steve

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: KCSAN: data-race in find_next_bit / rcu_report_exp_cpu_mult
  2019-10-07 10:04 ` KCSAN: data-race in find_next_bit / rcu_report_exp_cpu_mult Marco Elver
  2019-10-07 13:22   ` Paul E. McKenney
  2019-10-07 13:34   ` Steven Rostedt
@ 2019-10-07 13:43   ` Boqun Feng
  2019-10-08  0:11     ` Joel Fernandes
  2019-10-08  1:47   ` Paul E. McKenney
  3 siblings, 1 reply; 10+ messages in thread
From: Boqun Feng @ 2019-10-07 13:43 UTC (permalink / raw)
  To: Marco Elver
  Cc: syzbot, paulmck, josh, rostedt, mathieu.desnoyers, jiangshanlai,
	Joel Fernandes, rcu, a, b.a.t.m.a.n, davem, LKML, mareklindner,
	netdev, sw, syzkaller-bugs

Hi Marco,

On Mon, Oct 07, 2019 at 12:04:16PM +0200, Marco Elver wrote:
> +RCU maintainers
> This might be a data-race in RCU itself.
> 
> On Mon, 7 Oct 2019 at 12:01, syzbot
> <syzbot+134336b86f728d6e55a0@syzkaller.appspotmail.com> wrote:
> >
> > Hello,
> >
> > syzbot found the following crash on:
> >
> > HEAD commit:    b4bd9343 x86, kcsan: Enable KCSAN for x86
> > git tree:       https://github.com/google/ktsan.git kcsan
> > console output: https://syzkaller.appspot.com/x/log.txt?x=11edb20d600000
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=c0906aa620713d80
> > dashboard link: https://syzkaller.appspot.com/bug?extid=134336b86f728d6e55a0
> > compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
> >
> > Unfortunately, I don't have any reproducer for this crash yet.
> >
> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > Reported-by: syzbot+134336b86f728d6e55a0@syzkaller.appspotmail.com
> >
> > ==================================================================
> > BUG: KCSAN: data-race in find_next_bit / rcu_report_exp_cpu_mult
> >
> > write to 0xffffffff85a7f140 of 8 bytes by task 7 on cpu 0:
> >   rcu_report_exp_cpu_mult+0x4f/0xa0 kernel/rcu/tree_exp.h:244
> >   rcu_report_exp_rdp+0x6c/0x90 kernel/rcu/tree_exp.h:254
> >   rcu_preempt_deferred_qs_irqrestore+0x3bb/0x580 kernel/rcu/tree_plugin.h:475
> >   rcu_read_unlock_special+0xec/0x370 kernel/rcu/tree_plugin.h:659
> >   __rcu_read_unlock+0xcf/0xe0 kernel/rcu/tree_plugin.h:394
> >   rcu_read_unlock include/linux/rcupdate.h:645 [inline]
> >   batadv_nc_purge_orig_hash net/batman-adv/network-coding.c:411 [inline]
> >   batadv_nc_worker+0x13a/0x390 net/batman-adv/network-coding.c:718
> >   process_one_work+0x3d4/0x890 kernel/workqueue.c:2269
> >   worker_thread+0xa0/0x800 kernel/workqueue.c:2415
> >   kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
> >   ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352
> >
> > read to 0xffffffff85a7f140 of 8 bytes by task 7251 on cpu 1:
> >   _find_next_bit lib/find_bit.c:39 [inline]
> >   find_next_bit+0x57/0xe0 lib/find_bit.c:70
> >   sync_rcu_exp_select_node_cpus+0x28e/0x510 kernel/rcu/tree_exp.h:375

This is the second for_each_leaf_node_cpu_mask() loop in
sync_rcu_exp_select_node_cpus(), the first loop is for collecting which
CPU blocks current grace period (IOW, which CPU need to be sent an IPI
to), and the second loop does the real work of sending IPI. The first
loop is protected by proper lock (rcu node lock), so there is no race
there. But the second one can't hold rcu node lock, because the IPI
handler (rcu_exp_handler) needs to acquire the same lock, so rcu node
lock has to be dropped before the second loop to avoid deadlock.

Now for the racy find_next_bit() on rnp->expmask:

1) if an extra bit appears: it's OK since there is checking on whether
the bit exists in mask_ofl_ipi (the result of the first loop).

2) if a bit is missing: it will be problematic, because the second loop
will skip the CPU, and the rest of the code will treat the CPU as
offline but hasn't reported a quesient state, and the
rcu_report_exp_cpu_mult() will report the qs for it, even though the CPU
may currenlty run inside a RCU read-side critical section.

Note both "appears" and "missing" means some intermediate state of a
plain unset for expmask contributed by compiler magic.

Please see below for a compile-test-only patch:

> >   sync_rcu_exp_select_cpus+0x30c/0x590 kernel/rcu/tree_exp.h:439
> >   rcu_exp_sel_wait_wake kernel/rcu/tree_exp.h:575 [inline]
> >   wait_rcu_exp_gp+0x25/0x40 kernel/rcu/tree_exp.h:589
> >   process_one_work+0x3d4/0x890 kernel/workqueue.c:2269
> >   worker_thread+0xa0/0x800 kernel/workqueue.c:2415
> >   kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
> >   ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352
> >
> > Reported by Kernel Concurrency Sanitizer on:
> > CPU: 1 PID: 7251 Comm: kworker/1:4 Not tainted 5.3.0+ #0
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> > Google 01/01/2011
> > Workqueue: rcu_gp wait_rcu_exp_gp
> > ==================================================================
> >
> >

Regards,
Boqun

------------------->8
Subject: [PATCH] rcu: exp: Avoid race on lockless rcu_node::expmask loop

KCSAN reported an issue:

| BUG: KCSAN: data-race in find_next_bit / rcu_report_exp_cpu_mult
|
| write to 0xffffffff85a7f140 of 8 bytes by task 7 on cpu 0:
|   rcu_report_exp_cpu_mult+0x4f/0xa0 kernel/rcu/tree_exp.h:244
|   rcu_report_exp_rdp+0x6c/0x90 kernel/rcu/tree_exp.h:254
|   rcu_preempt_deferred_qs_irqrestore+0x3bb/0x580 kernel/rcu/tree_plugin.h:475
|   rcu_read_unlock_special+0xec/0x370 kernel/rcu/tree_plugin.h:659
|   __rcu_read_unlock+0xcf/0xe0 kernel/rcu/tree_plugin.h:394
|   rcu_read_unlock include/linux/rcupdate.h:645 [inline]
|   batadv_nc_purge_orig_hash net/batman-adv/network-coding.c:411 [inline]
|   batadv_nc_worker+0x13a/0x390 net/batman-adv/network-coding.c:718
|   process_one_work+0x3d4/0x890 kernel/workqueue.c:2269
|   worker_thread+0xa0/0x800 kernel/workqueue.c:2415
|   kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
|   ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352
|
| read to 0xffffffff85a7f140 of 8 bytes by task 7251 on cpu 1:
|   _find_next_bit lib/find_bit.c:39 [inline]
|   find_next_bit+0x57/0xe0 lib/find_bit.c:70
|   sync_rcu_exp_select_node_cpus+0x28e/0x510 kernel/rcu/tree_exp.h:375
|   sync_rcu_exp_select_cpus+0x30c/0x590 kernel/rcu/tree_exp.h:439
|   rcu_exp_sel_wait_wake kernel/rcu/tree_exp.h:575 [inline]
|   wait_rcu_exp_gp+0x25/0x40 kernel/rcu/tree_exp.h:589
|   process_one_work+0x3d4/0x890 kernel/workqueue.c:2269
|   worker_thread+0xa0/0x800 kernel/workqueue.c:2415
|   kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
|   ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352

The root cause of this is the second for_each_leaf_node_cpu_mask() loop
in sync_rcu_exp_select_node_cpus() accesses the rcu_node::expmask
without holding rcu_node's lock. This is by design, because the second
loop may issue IPIs to other CPUs, and the IPI handler (rcu_exp_handler)
may acquire the same rcu_node's lock. So the rcu_node's lock has to be
dropped before the second loop.

The problem will occur when the normal unsetting of rcu_node::expmask
results into some intermediate state (because it's a plain access),
where an extra bit gets zeroed. The second loop will skip the
corrensponding CPU, but treat it as offline and in quesient state. This
will cause trouble because that CPU may be in a RCU read-side critical
section.

To fix this, take a snapshot of mask_ofl_ipi, and make the second loop
iterate on the snapshot's bits, as a result, the find_next_bit() of the
second loop doesn't access any variables that may get changed in
parallel, so the race is avoided.

Reported-by: syzbot+134336b86f728d6e55a0@syzkaller.appspotmail.com
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
---
 kernel/rcu/tree_exp.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index af7e7b9c86af..7f3e19d0275e 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -335,6 +335,7 @@ static void sync_rcu_exp_select_node_cpus(struct work_struct *wp)
 	unsigned long flags;
 	unsigned long mask_ofl_test;
 	unsigned long mask_ofl_ipi;
+	unsigned long mask_ofl_ipi_snap;
 	int ret;
 	struct rcu_exp_work *rewp =
 		container_of(wp, struct rcu_exp_work, rew_work);
@@ -371,13 +372,12 @@ static void sync_rcu_exp_select_node_cpus(struct work_struct *wp)
 		rnp->exp_tasks = rnp->blkd_tasks.next;
 	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
 
+	mask_ofl_ipi_snap = mask_ofl_ipi;
 	/* IPI the remaining CPUs for expedited quiescent state. */
-	for_each_leaf_node_cpu_mask(rnp, cpu, rnp->expmask) {
+	for_each_leaf_node_cpu_mask(rnp, cpu, mask_ofl_ipi_snap) {
 		unsigned long mask = leaf_node_cpu_bit(rnp, cpu);
 		struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
 
-		if (!(mask_ofl_ipi & mask))
-			continue;
 retry_ipi:
 		if (rcu_dynticks_in_eqs_since(rdp, rdp->exp_dynticks_snap)) {
 			mask_ofl_test |= mask;
-- 
2.23.0


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: KCSAN: data-race in find_next_bit / rcu_report_exp_cpu_mult
  2019-10-07 13:43   ` Boqun Feng
@ 2019-10-08  0:11     ` Joel Fernandes
  2019-10-08  2:12       ` Paul E. McKenney
  0 siblings, 1 reply; 10+ messages in thread
From: Joel Fernandes @ 2019-10-08  0:11 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Marco Elver, syzbot, paulmck, josh, rostedt, mathieu.desnoyers,
	jiangshanlai, rcu, a, b.a.t.m.a.n, davem, LKML, mareklindner,
	netdev, sw, syzkaller-bugs

On Mon, Oct 07, 2019 at 09:43:04PM +0800, Boqun Feng wrote:
> Hi Marco,

Hi Boqun, Steve and Paul, fun times!

Marco, good catch ;-)

Some comments below:

> On Mon, Oct 07, 2019 at 12:04:16PM +0200, Marco Elver wrote:
> > +RCU maintainers
> > This might be a data-race in RCU itself.
> > 
> > On Mon, 7 Oct 2019 at 12:01, syzbot
> > <syzbot+134336b86f728d6e55a0@syzkaller.appspotmail.com> wrote:
> > >
> > > Hello,
> > >
> > > syzbot found the following crash on:
> > >
> > > HEAD commit:    b4bd9343 x86, kcsan: Enable KCSAN for x86
> > > git tree:       https://github.com/google/ktsan.git kcsan
> > > console output: https://syzkaller.appspot.com/x/log.txt?x=11edb20d600000
> > > kernel config:  https://syzkaller.appspot.com/x/.config?x=c0906aa620713d80
> > > dashboard link: https://syzkaller.appspot.com/bug?extid=134336b86f728d6e55a0
> > > compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
> > >
> > > Unfortunately, I don't have any reproducer for this crash yet.
> > >
> > > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > > Reported-by: syzbot+134336b86f728d6e55a0@syzkaller.appspotmail.com
> > >
> > > ==================================================================
> > > BUG: KCSAN: data-race in find_next_bit / rcu_report_exp_cpu_mult
> > >
> > > write to 0xffffffff85a7f140 of 8 bytes by task 7 on cpu 0:
> > >   rcu_report_exp_cpu_mult+0x4f/0xa0 kernel/rcu/tree_exp.h:244
> > >   rcu_report_exp_rdp+0x6c/0x90 kernel/rcu/tree_exp.h:254
> > >   rcu_preempt_deferred_qs_irqrestore+0x3bb/0x580 kernel/rcu/tree_plugin.h:475
> > >   rcu_read_unlock_special+0xec/0x370 kernel/rcu/tree_plugin.h:659
> > >   __rcu_read_unlock+0xcf/0xe0 kernel/rcu/tree_plugin.h:394
> > >   rcu_read_unlock include/linux/rcupdate.h:645 [inline]
> > >   batadv_nc_purge_orig_hash net/batman-adv/network-coding.c:411 [inline]
> > >   batadv_nc_worker+0x13a/0x390 net/batman-adv/network-coding.c:718
> > >   process_one_work+0x3d4/0x890 kernel/workqueue.c:2269
> > >   worker_thread+0xa0/0x800 kernel/workqueue.c:2415
> > >   kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
> > >   ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352
> > >
> > > read to 0xffffffff85a7f140 of 8 bytes by task 7251 on cpu 1:
> > >   _find_next_bit lib/find_bit.c:39 [inline]
> > >   find_next_bit+0x57/0xe0 lib/find_bit.c:70
> > >   sync_rcu_exp_select_node_cpus+0x28e/0x510 kernel/rcu/tree_exp.h:375
> 
> This is the second for_each_leaf_node_cpu_mask() loop in
> sync_rcu_exp_select_node_cpus(), the first loop is for collecting which
> CPU blocks current grace period (IOW, which CPU need to be sent an IPI
> to), and the second loop does the real work of sending IPI. The first
> loop is protected by proper lock (rcu node lock), so there is no race
> there. But the second one can't hold rcu node lock, because the IPI
> handler (rcu_exp_handler) needs to acquire the same lock, so rcu node
> lock has to be dropped before the second loop to avoid deadlock.
> 
> Now for the racy find_next_bit() on rnp->expmask:
> 
> 1) if an extra bit appears: it's OK since there is checking on whether
> the bit exists in mask_ofl_ipi (the result of the first loop).
> 
> 2) if a bit is missing: it will be problematic, because the second loop
> will skip the CPU, and the rest of the code will treat the CPU as
> offline but hasn't reported a quesient state, and the
> rcu_report_exp_cpu_mult() will report the qs for it, even though the CPU
> may currenlty run inside a RCU read-side critical section.
> 
> Note both "appears" and "missing" means some intermediate state of a
> plain unset for expmask contributed by compiler magic.
> 
> Please see below for a compile-test-only patch:
> 
> > >   sync_rcu_exp_select_cpus+0x30c/0x590 kernel/rcu/tree_exp.h:439
> > >   rcu_exp_sel_wait_wake kernel/rcu/tree_exp.h:575 [inline]
> > >   wait_rcu_exp_gp+0x25/0x40 kernel/rcu/tree_exp.h:589
> > >   process_one_work+0x3d4/0x890 kernel/workqueue.c:2269
> > >   worker_thread+0xa0/0x800 kernel/workqueue.c:2415
> > >   kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
> > >   ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352
> > >
> > > Reported by Kernel Concurrency Sanitizer on:
> > > CPU: 1 PID: 7251 Comm: kworker/1:4 Not tainted 5.3.0+ #0
> > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> > > Google 01/01/2011
> > > Workqueue: rcu_gp wait_rcu_exp_gp
> > > ==================================================================
> > >
> > >
> 
> Regards,
> Boqun
> 
> ------------------->8
> Subject: [PATCH] rcu: exp: Avoid race on lockless rcu_node::expmask loop
> 
> KCSAN reported an issue:
> 
> | BUG: KCSAN: data-race in find_next_bit / rcu_report_exp_cpu_mult
> |
> | write to 0xffffffff85a7f140 of 8 bytes by task 7 on cpu 0:
> |   rcu_report_exp_cpu_mult+0x4f/0xa0 kernel/rcu/tree_exp.h:244
> |   rcu_report_exp_rdp+0x6c/0x90 kernel/rcu/tree_exp.h:254
> |   rcu_preempt_deferred_qs_irqrestore+0x3bb/0x580 kernel/rcu/tree_plugin.h:475
> |   rcu_read_unlock_special+0xec/0x370 kernel/rcu/tree_plugin.h:659
> |   __rcu_read_unlock+0xcf/0xe0 kernel/rcu/tree_plugin.h:394
> |   rcu_read_unlock include/linux/rcupdate.h:645 [inline]
> |   batadv_nc_purge_orig_hash net/batman-adv/network-coding.c:411 [inline]
> |   batadv_nc_worker+0x13a/0x390 net/batman-adv/network-coding.c:718
> |   process_one_work+0x3d4/0x890 kernel/workqueue.c:2269
> |   worker_thread+0xa0/0x800 kernel/workqueue.c:2415
> |   kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
> |   ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352
> |
> | read to 0xffffffff85a7f140 of 8 bytes by task 7251 on cpu 1:
> |   _find_next_bit lib/find_bit.c:39 [inline]
> |   find_next_bit+0x57/0xe0 lib/find_bit.c:70
> |   sync_rcu_exp_select_node_cpus+0x28e/0x510 kernel/rcu/tree_exp.h:375
> |   sync_rcu_exp_select_cpus+0x30c/0x590 kernel/rcu/tree_exp.h:439
> |   rcu_exp_sel_wait_wake kernel/rcu/tree_exp.h:575 [inline]
> |   wait_rcu_exp_gp+0x25/0x40 kernel/rcu/tree_exp.h:589
> |   process_one_work+0x3d4/0x890 kernel/workqueue.c:2269
> |   worker_thread+0xa0/0x800 kernel/workqueue.c:2415
> |   kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
> |   ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352
> 
> The root cause of this is the second for_each_leaf_node_cpu_mask() loop
> in sync_rcu_exp_select_node_cpus() accesses the rcu_node::expmask
> without holding rcu_node's lock. This is by design, because the second
> loop may issue IPIs to other CPUs, and the IPI handler (rcu_exp_handler)
> may acquire the same rcu_node's lock. So the rcu_node's lock has to be
> dropped before the second loop.
> 
> The problem will occur when the normal unsetting of rcu_node::expmask
> results into some intermediate state (because it's a plain access),
> where an extra bit gets zeroed. The second loop will skip the
> corrensponding CPU, but treat it as offline and in quesient state. This
> will cause trouble because that CPU may be in a RCU read-side critical
> section.
> 
> To fix this, take a snapshot of mask_ofl_ipi, and make the second loop
> iterate on the snapshot's bits, as a result, the find_next_bit() of the
> second loop doesn't access any variables that may get changed in
> parallel, so the race is avoided.
> 
> Reported-by: syzbot+134336b86f728d6e55a0@syzkaller.appspotmail.com
> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
> ---
>  kernel/rcu/tree_exp.h | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> index af7e7b9c86af..7f3e19d0275e 100644
> --- a/kernel/rcu/tree_exp.h
> +++ b/kernel/rcu/tree_exp.h
> @@ -335,6 +335,7 @@ static void sync_rcu_exp_select_node_cpus(struct work_struct *wp)
>  	unsigned long flags;
>  	unsigned long mask_ofl_test;
>  	unsigned long mask_ofl_ipi;
> +	unsigned long mask_ofl_ipi_snap;
>  	int ret;
>  	struct rcu_exp_work *rewp =
>  		container_of(wp, struct rcu_exp_work, rew_work);
> @@ -371,13 +372,12 @@ static void sync_rcu_exp_select_node_cpus(struct work_struct *wp)
>  		rnp->exp_tasks = rnp->blkd_tasks.next;
>  	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
>  
> +	mask_ofl_ipi_snap = mask_ofl_ipi;
>  	/* IPI the remaining CPUs for expedited quiescent state. */
> -	for_each_leaf_node_cpu_mask(rnp, cpu, rnp->expmask) {
> +	for_each_leaf_node_cpu_mask(rnp, cpu, mask_ofl_ipi_snap) {

This looks good to me. Just a nit, I prefer if the comment to IPI the
remaining CPUs is before the assignment to mask_ofl_ipi_snap since the
new assignment is done for consumption by the for_each..(..) loop itself.

Steve's patch looks good as well and I was thinking along the same lines but
Boqun's patch is slightly better because he doesn't need to snapshot exp_mask
inside the locked section.

Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>

thanks,

 - Joel

>  		unsigned long mask = leaf_node_cpu_bit(rnp, cpu);
>  		struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
>  
> -		if (!(mask_ofl_ipi & mask))
> -			continue;
>  retry_ipi:
>  		if (rcu_dynticks_in_eqs_since(rdp, rdp->exp_dynticks_snap)) {
>  			mask_ofl_test |= mask;
> -- 
> 2.23.0
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: KCSAN: data-race in find_next_bit / rcu_report_exp_cpu_mult
  2019-10-07 10:04 ` KCSAN: data-race in find_next_bit / rcu_report_exp_cpu_mult Marco Elver
                     ` (2 preceding siblings ...)
  2019-10-07 13:43   ` Boqun Feng
@ 2019-10-08  1:47   ` Paul E. McKenney
  3 siblings, 0 replies; 10+ messages in thread
From: Paul E. McKenney @ 2019-10-08  1:47 UTC (permalink / raw)
  To: Marco Elver
  Cc: syzbot, josh, rostedt, mathieu.desnoyers, jiangshanlai,
	Joel Fernandes, rcu, a, b.a.t.m.a.n, davem, LKML, mareklindner,
	netdev, sw, syzkaller-bugs

On Mon, Oct 07, 2019 at 12:04:16PM +0200, Marco Elver wrote:
> +RCU maintainers
> This might be a data-race in RCU itself.
> 
> On Mon, 7 Oct 2019 at 12:01, syzbot
> <syzbot+134336b86f728d6e55a0@syzkaller.appspotmail.com> wrote:
> >
> > Hello,
> >
> > syzbot found the following crash on:
> >
> > HEAD commit:    b4bd9343 x86, kcsan: Enable KCSAN for x86
> > git tree:       https://github.com/google/ktsan.git kcsan
> > console output: https://syzkaller.appspot.com/x/log.txt?x=11edb20d600000
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=c0906aa620713d80
> > dashboard link: https://syzkaller.appspot.com/bug?extid=134336b86f728d6e55a0
> > compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
> >
> > Unfortunately, I don't have any reproducer for this crash yet.
> >
> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > Reported-by: syzbot+134336b86f728d6e55a0@syzkaller.appspotmail.com
> >
> > ==================================================================
> > BUG: KCSAN: data-race in find_next_bit / rcu_report_exp_cpu_mult
> >
> > write to 0xffffffff85a7f140 of 8 bytes by task 7 on cpu 0:
> >   rcu_report_exp_cpu_mult+0x4f/0xa0 kernel/rcu/tree_exp.h:244
> >   rcu_report_exp_rdp+0x6c/0x90 kernel/rcu/tree_exp.h:254
> >   rcu_preempt_deferred_qs_irqrestore+0x3bb/0x580 kernel/rcu/tree_plugin.h:475
> >   rcu_read_unlock_special+0xec/0x370 kernel/rcu/tree_plugin.h:659
> >   __rcu_read_unlock+0xcf/0xe0 kernel/rcu/tree_plugin.h:394
> >   rcu_read_unlock include/linux/rcupdate.h:645 [inline]
> >   batadv_nc_purge_orig_hash net/batman-adv/network-coding.c:411 [inline]
> >   batadv_nc_worker+0x13a/0x390 net/batman-adv/network-coding.c:718
> >   process_one_work+0x3d4/0x890 kernel/workqueue.c:2269
> >   worker_thread+0xa0/0x800 kernel/workqueue.c:2415
> >   kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
> >   ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352
> >
> > read to 0xffffffff85a7f140 of 8 bytes by task 7251 on cpu 1:
> >   _find_next_bit lib/find_bit.c:39 [inline]
> >   find_next_bit+0x57/0xe0 lib/find_bit.c:70
> >   sync_rcu_exp_select_node_cpus+0x28e/0x510 kernel/rcu/tree_exp.h:375
> >   sync_rcu_exp_select_cpus+0x30c/0x590 kernel/rcu/tree_exp.h:439
> >   rcu_exp_sel_wait_wake kernel/rcu/tree_exp.h:575 [inline]
> >   wait_rcu_exp_gp+0x25/0x40 kernel/rcu/tree_exp.h:589
> >   process_one_work+0x3d4/0x890 kernel/workqueue.c:2269
> >   worker_thread+0xa0/0x800 kernel/workqueue.c:2415
> >   kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
> >   ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352
> >
> > Reported by Kernel Concurrency Sanitizer on:
> > CPU: 1 PID: 7251 Comm: kworker/1:4 Not tainted 5.3.0+ #0
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> > Google 01/01/2011
> > Workqueue: rcu_gp wait_rcu_exp_gp
> > ==================================================================
> >
> >
> > ---
> > This bug is generated by a bot. It may contain errors.
> > See https://goo.gl/tpsmEJ for more information about syzbot.
> > syzbot engineers can be reached at syzkaller@googlegroups.com.
> >
> > syzbot will keep track of this bug report. See:
> > https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

And yes, ->expmask is accessed locklessly in a few places without
READ_ONCE().  So does the following (untested) patch help?

							Thanx, Paul

------------------------------------------------------------------------

diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index d632cd0..bcd2a79 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -134,7 +134,7 @@ static void __maybe_unused sync_exp_reset_tree(void)
 	rcu_for_each_node_breadth_first(rnp) {
 		raw_spin_lock_irqsave_rcu_node(rnp, flags);
 		WARN_ON_ONCE(rnp->expmask);
-		rnp->expmask = rnp->expmaskinit;
+		WRITE_ONCE(rnp->expmask, rnp->expmaskinit);
 		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
 	}
 }
@@ -211,7 +211,7 @@ static void __rcu_report_exp_rnp(struct rcu_node *rnp,
 		rnp = rnp->parent;
 		raw_spin_lock_rcu_node(rnp); /* irqs already disabled */
 		WARN_ON_ONCE(!(rnp->expmask & mask));
-		rnp->expmask &= ~mask;
+		WRITE_ONCE(rnp->expmask, rnp->expmask & ~mask);
 	}
 }
 
@@ -241,7 +241,7 @@ static void rcu_report_exp_cpu_mult(struct rcu_node *rnp,
 		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
 		return;
 	}
-	rnp->expmask &= ~mask;
+	WRITE_ONCE(rnp->expmask, rnp->expmask & ~mask);
 	__rcu_report_exp_rnp(rnp, wake, flags); /* Releases rnp->lock. */
 }
 
@@ -332,6 +332,7 @@ static bool exp_funnel_lock(unsigned long s)
 static void sync_rcu_exp_select_node_cpus(struct work_struct *wp)
 {
 	int cpu;
+	unsigned long expmask_snap;
 	unsigned long flags;
 	unsigned long mask_ofl_test;
 	unsigned long mask_ofl_ipi;
@@ -369,10 +370,11 @@ static void sync_rcu_exp_select_node_cpus(struct work_struct *wp)
 	 */
 	if (rcu_preempt_has_tasks(rnp))
 		rnp->exp_tasks = rnp->blkd_tasks.next;
+	expmask_snap = rnp->expmask;
 	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
 
 	/* IPI the remaining CPUs for expedited quiescent state. */
-	for_each_leaf_node_cpu_mask(rnp, cpu, rnp->expmask) {
+	for_each_leaf_node_cpu_mask(rnp, cpu, expmask_snap) {
 		unsigned long mask = leaf_node_cpu_bit(rnp, cpu);
 		struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
 
@@ -491,7 +493,7 @@ static void synchronize_sched_expedited_wait(void)
 				struct rcu_data *rdp;
 
 				mask = leaf_node_cpu_bit(rnp, cpu);
-				if (!(rnp->expmask & mask))
+				if (!(READ_ONCE(rnp->expmask) & mask))
 					continue;
 				ndetected++;
 				rdp = per_cpu_ptr(&rcu_data, cpu);
@@ -503,7 +505,8 @@ static void synchronize_sched_expedited_wait(void)
 		}
 		pr_cont(" } %lu jiffies s: %lu root: %#lx/%c\n",
 			jiffies - jiffies_start, rcu_state.expedited_sequence,
-			rnp_root->expmask, ".T"[!!rnp_root->exp_tasks]);
+			READ_ONCE(rnp_root->expmask),
+			".T"[!!rnp_root->exp_tasks]);
 		if (ndetected) {
 			pr_err("blocking rcu_node structures:");
 			rcu_for_each_node_breadth_first(rnp) {
@@ -513,7 +516,7 @@ static void synchronize_sched_expedited_wait(void)
 					continue;
 				pr_cont(" l=%u:%d-%d:%#lx/%c",
 					rnp->level, rnp->grplo, rnp->grphi,
-					rnp->expmask,
+					READ_ONCE(rnp->expmask),
 					".T"[!!rnp->exp_tasks]);
 			}
 			pr_cont("\n");
@@ -521,7 +524,7 @@ static void synchronize_sched_expedited_wait(void)
 		rcu_for_each_leaf_node(rnp) {
 			for_each_leaf_node_possible_cpu(rnp, cpu) {
 				mask = leaf_node_cpu_bit(rnp, cpu);
-				if (!(rnp->expmask & mask))
+				if (!(READ_ONCE(rnp->expmask) & mask))
 					continue;
 				dump_cpu_task(cpu);
 			}

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: KCSAN: data-race in find_next_bit / rcu_report_exp_cpu_mult
  2019-10-08  0:11     ` Joel Fernandes
@ 2019-10-08  2:12       ` Paul E. McKenney
  2019-10-08  2:50         ` Boqun Feng
  0 siblings, 1 reply; 10+ messages in thread
From: Paul E. McKenney @ 2019-10-08  2:12 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Boqun Feng, Marco Elver, syzbot, josh, rostedt,
	mathieu.desnoyers, jiangshanlai, rcu, a, b.a.t.m.a.n, davem,
	LKML, mareklindner, netdev, sw, syzkaller-bugs

On Mon, Oct 07, 2019 at 08:11:31PM -0400, Joel Fernandes wrote:
> On Mon, Oct 07, 2019 at 09:43:04PM +0800, Boqun Feng wrote:
> > Hi Marco,
> 
> Hi Boqun, Steve and Paul, fun times!
> 
> Marco, good catch ;-)

Indeed!  ;-)

> Some comments below:
> 
> > On Mon, Oct 07, 2019 at 12:04:16PM +0200, Marco Elver wrote:
> > > +RCU maintainers
> > > This might be a data-race in RCU itself.
> > > 
> > > On Mon, 7 Oct 2019 at 12:01, syzbot
> > > <syzbot+134336b86f728d6e55a0@syzkaller.appspotmail.com> wrote:
> > > >
> > > > Hello,
> > > >
> > > > syzbot found the following crash on:
> > > >
> > > > HEAD commit:    b4bd9343 x86, kcsan: Enable KCSAN for x86
> > > > git tree:       https://github.com/google/ktsan.git kcsan
> > > > console output: https://syzkaller.appspot.com/x/log.txt?x=11edb20d600000
> > > > kernel config:  https://syzkaller.appspot.com/x/.config?x=c0906aa620713d80
> > > > dashboard link: https://syzkaller.appspot.com/bug?extid=134336b86f728d6e55a0
> > > > compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
> > > >
> > > > Unfortunately, I don't have any reproducer for this crash yet.
> > > >
> > > > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > > > Reported-by: syzbot+134336b86f728d6e55a0@syzkaller.appspotmail.com
> > > >
> > > > ==================================================================
> > > > BUG: KCSAN: data-race in find_next_bit / rcu_report_exp_cpu_mult
> > > >
> > > > write to 0xffffffff85a7f140 of 8 bytes by task 7 on cpu 0:
> > > >   rcu_report_exp_cpu_mult+0x4f/0xa0 kernel/rcu/tree_exp.h:244
> > > >   rcu_report_exp_rdp+0x6c/0x90 kernel/rcu/tree_exp.h:254
> > > >   rcu_preempt_deferred_qs_irqrestore+0x3bb/0x580 kernel/rcu/tree_plugin.h:475
> > > >   rcu_read_unlock_special+0xec/0x370 kernel/rcu/tree_plugin.h:659
> > > >   __rcu_read_unlock+0xcf/0xe0 kernel/rcu/tree_plugin.h:394
> > > >   rcu_read_unlock include/linux/rcupdate.h:645 [inline]
> > > >   batadv_nc_purge_orig_hash net/batman-adv/network-coding.c:411 [inline]
> > > >   batadv_nc_worker+0x13a/0x390 net/batman-adv/network-coding.c:718
> > > >   process_one_work+0x3d4/0x890 kernel/workqueue.c:2269
> > > >   worker_thread+0xa0/0x800 kernel/workqueue.c:2415
> > > >   kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
> > > >   ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352
> > > >
> > > > read to 0xffffffff85a7f140 of 8 bytes by task 7251 on cpu 1:
> > > >   _find_next_bit lib/find_bit.c:39 [inline]
> > > >   find_next_bit+0x57/0xe0 lib/find_bit.c:70
> > > >   sync_rcu_exp_select_node_cpus+0x28e/0x510 kernel/rcu/tree_exp.h:375
> > 
> > This is the second for_each_leaf_node_cpu_mask() loop in
> > sync_rcu_exp_select_node_cpus(), the first loop is for collecting which
> > CPU blocks current grace period (IOW, which CPU need to be sent an IPI
> > to), and the second loop does the real work of sending IPI. The first
> > loop is protected by proper lock (rcu node lock), so there is no race
> > there. But the second one can't hold rcu node lock, because the IPI
> > handler (rcu_exp_handler) needs to acquire the same lock, so rcu node
> > lock has to be dropped before the second loop to avoid deadlock.
> > 
> > Now for the racy find_next_bit() on rnp->expmask:
> > 
> > 1) if an extra bit appears: it's OK since there is checking on whether
> > the bit exists in mask_ofl_ipi (the result of the first loop).
> > 
> > 2) if a bit is missing: it will be problematic, because the second loop
> > will skip the CPU, and the rest of the code will treat the CPU as
> > offline but hasn't reported a quesient state, and the
> > rcu_report_exp_cpu_mult() will report the qs for it, even though the CPU
> > may currenlty run inside a RCU read-side critical section.
> > 
> > Note both "appears" and "missing" means some intermediate state of a
> > plain unset for expmask contributed by compiler magic.
> > 
> > Please see below for a compile-test-only patch:
> > 
> > > >   sync_rcu_exp_select_cpus+0x30c/0x590 kernel/rcu/tree_exp.h:439
> > > >   rcu_exp_sel_wait_wake kernel/rcu/tree_exp.h:575 [inline]
> > > >   wait_rcu_exp_gp+0x25/0x40 kernel/rcu/tree_exp.h:589
> > > >   process_one_work+0x3d4/0x890 kernel/workqueue.c:2269
> > > >   worker_thread+0xa0/0x800 kernel/workqueue.c:2415
> > > >   kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
> > > >   ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352
> > > >
> > > > Reported by Kernel Concurrency Sanitizer on:
> > > > CPU: 1 PID: 7251 Comm: kworker/1:4 Not tainted 5.3.0+ #0
> > > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> > > > Google 01/01/2011
> > > > Workqueue: rcu_gp wait_rcu_exp_gp
> > > > ==================================================================
> > > >
> > > >
> > 
> > Regards,
> > Boqun
> > 
> > ------------------->8
> > Subject: [PATCH] rcu: exp: Avoid race on lockless rcu_node::expmask loop
> > 
> > KCSAN reported an issue:
> > 
> > | BUG: KCSAN: data-race in find_next_bit / rcu_report_exp_cpu_mult
> > |
> > | write to 0xffffffff85a7f140 of 8 bytes by task 7 on cpu 0:
> > |   rcu_report_exp_cpu_mult+0x4f/0xa0 kernel/rcu/tree_exp.h:244
> > |   rcu_report_exp_rdp+0x6c/0x90 kernel/rcu/tree_exp.h:254
> > |   rcu_preempt_deferred_qs_irqrestore+0x3bb/0x580 kernel/rcu/tree_plugin.h:475
> > |   rcu_read_unlock_special+0xec/0x370 kernel/rcu/tree_plugin.h:659
> > |   __rcu_read_unlock+0xcf/0xe0 kernel/rcu/tree_plugin.h:394
> > |   rcu_read_unlock include/linux/rcupdate.h:645 [inline]
> > |   batadv_nc_purge_orig_hash net/batman-adv/network-coding.c:411 [inline]
> > |   batadv_nc_worker+0x13a/0x390 net/batman-adv/network-coding.c:718
> > |   process_one_work+0x3d4/0x890 kernel/workqueue.c:2269
> > |   worker_thread+0xa0/0x800 kernel/workqueue.c:2415
> > |   kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
> > |   ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352
> > |
> > | read to 0xffffffff85a7f140 of 8 bytes by task 7251 on cpu 1:
> > |   _find_next_bit lib/find_bit.c:39 [inline]
> > |   find_next_bit+0x57/0xe0 lib/find_bit.c:70
> > |   sync_rcu_exp_select_node_cpus+0x28e/0x510 kernel/rcu/tree_exp.h:375
> > |   sync_rcu_exp_select_cpus+0x30c/0x590 kernel/rcu/tree_exp.h:439
> > |   rcu_exp_sel_wait_wake kernel/rcu/tree_exp.h:575 [inline]
> > |   wait_rcu_exp_gp+0x25/0x40 kernel/rcu/tree_exp.h:589
> > |   process_one_work+0x3d4/0x890 kernel/workqueue.c:2269
> > |   worker_thread+0xa0/0x800 kernel/workqueue.c:2415
> > |   kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
> > |   ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352
> > 
> > The root cause of this is the second for_each_leaf_node_cpu_mask() loop
> > in sync_rcu_exp_select_node_cpus() accesses the rcu_node::expmask
> > without holding rcu_node's lock. This is by design, because the second
> > loop may issue IPIs to other CPUs, and the IPI handler (rcu_exp_handler)
> > may acquire the same rcu_node's lock. So the rcu_node's lock has to be
> > dropped before the second loop.
> > 
> > The problem will occur when the normal unsetting of rcu_node::expmask
> > results into some intermediate state (because it's a plain access),
> > where an extra bit gets zeroed. The second loop will skip the
> > corrensponding CPU, but treat it as offline and in quesient state. This
> > will cause trouble because that CPU may be in a RCU read-side critical
> > section.
> > 
> > To fix this, take a snapshot of mask_ofl_ipi, and make the second loop
> > iterate on the snapshot's bits, as a result, the find_next_bit() of the
> > second loop doesn't access any variables that may get changed in
> > parallel, so the race is avoided.
> > 
> > Reported-by: syzbot+134336b86f728d6e55a0@syzkaller.appspotmail.com
> > Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
> > ---
> >  kernel/rcu/tree_exp.h | 6 +++---
> >  1 file changed, 3 insertions(+), 3 deletions(-)
> > 
> > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> > index af7e7b9c86af..7f3e19d0275e 100644
> > --- a/kernel/rcu/tree_exp.h
> > +++ b/kernel/rcu/tree_exp.h
> > @@ -335,6 +335,7 @@ static void sync_rcu_exp_select_node_cpus(struct work_struct *wp)
> >  	unsigned long flags;
> >  	unsigned long mask_ofl_test;
> >  	unsigned long mask_ofl_ipi;
> > +	unsigned long mask_ofl_ipi_snap;
> >  	int ret;
> >  	struct rcu_exp_work *rewp =
> >  		container_of(wp, struct rcu_exp_work, rew_work);
> > @@ -371,13 +372,12 @@ static void sync_rcu_exp_select_node_cpus(struct work_struct *wp)
> >  		rnp->exp_tasks = rnp->blkd_tasks.next;
> >  	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> >  
> > +	mask_ofl_ipi_snap = mask_ofl_ipi;
> >  	/* IPI the remaining CPUs for expedited quiescent state. */
> > -	for_each_leaf_node_cpu_mask(rnp, cpu, rnp->expmask) {
> > +	for_each_leaf_node_cpu_mask(rnp, cpu, mask_ofl_ipi_snap) {

Why can't we just use mask_ofl_ipi?  The bits removed are only those
bits just now looked at, right?  Also, the test of mask_ofl_ipi can be
dropped, since that branch will never be taken, correct?

> This looks good to me. Just a nit, I prefer if the comment to IPI the
> remaining CPUs is before the assignment to mask_ofl_ipi_snap since the
> new assignment is done for consumption by the for_each..(..) loop itself.
> 
> Steve's patch looks good as well and I was thinking along the same lines but
> Boqun's patch is slightly better because he doesn't need to snapshot exp_mask
> inside the locked section.

There are also similar lockless accesses to ->expmask in the stall-warning
code.

> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>

But thank all three of you for looking this over!  My original patch
was overly ornate.  ;-)

							Thanx, Paul

> thanks,
> 
>  - Joel
> 
> >  		unsigned long mask = leaf_node_cpu_bit(rnp, cpu);
> >  		struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
> >  
> > -		if (!(mask_ofl_ipi & mask))
> > -			continue;
> >  retry_ipi:
> >  		if (rcu_dynticks_in_eqs_since(rdp, rdp->exp_dynticks_snap)) {
> >  			mask_ofl_test |= mask;
> > -- 
> > 2.23.0
> > 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: KCSAN: data-race in find_next_bit / rcu_report_exp_cpu_mult
  2019-10-08  2:12       ` Paul E. McKenney
@ 2019-10-08  2:50         ` Boqun Feng
  2019-10-08  3:33           ` Paul E. McKenney
  0 siblings, 1 reply; 10+ messages in thread
From: Boqun Feng @ 2019-10-08  2:50 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Joel Fernandes, Marco Elver, syzbot, josh, rostedt,
	mathieu.desnoyers, jiangshanlai, rcu, a, b.a.t.m.a.n, davem,
	LKML, mareklindner, netdev, sw, syzkaller-bugs

On Mon, Oct 07, 2019 at 07:12:33PM -0700, Paul E. McKenney wrote:
> On Mon, Oct 07, 2019 at 08:11:31PM -0400, Joel Fernandes wrote:
> > On Mon, Oct 07, 2019 at 09:43:04PM +0800, Boqun Feng wrote:
> > > Hi Marco,
> > 
> > Hi Boqun, Steve and Paul, fun times!
> > 
> > Marco, good catch ;-)
> 
> Indeed!  ;-)
> 
[...]
> > > +	mask_ofl_ipi_snap = mask_ofl_ipi;
> > >  	/* IPI the remaining CPUs for expedited quiescent state. */
> > > -	for_each_leaf_node_cpu_mask(rnp, cpu, rnp->expmask) {
> > > +	for_each_leaf_node_cpu_mask(rnp, cpu, mask_ofl_ipi_snap) {
> 
> Why can't we just use mask_ofl_ipi?  The bits removed are only those
> bits just now looked at, right?  Also, the test of mask_ofl_ipi can be
> dropped, since that branch will never be taken, correct?
> 

You're correct. But I think we can further simplify the code a little
bit so that we won't need to modify the mask_ofl_ipi:

In the second loop:

1) if the target CPU is online and response the IPI we do nothing.

2) if the target CPU is offline but it doesn't block current GP, we do
nothing.

3) otherwise, the target CPU is offline and blocks current GP, we add
the corresponding bit in mask_ofl_test.

Thoughts?

Please see the end of email for a patch.

> > This looks good to me. Just a nit, I prefer if the comment to IPI the
> > remaining CPUs is before the assignment to mask_ofl_ipi_snap since the
> > new assignment is done for consumption by the for_each..(..) loop itself.
> > 
> > Steve's patch looks good as well and I was thinking along the same lines but
> > Boqun's patch is slightly better because he doesn't need to snapshot exp_mask
> > inside the locked section.
> 
> There are also similar lockless accesses to ->expmask in the stall-warning
> code.
> 
> > Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> 
> But thank all three of you for looking this over!  My original patch
> was overly ornate.  ;-)
> 
> 							Thanx, Paul
> 

-------------------->8
Subject: [PATCH v2] rcu: exp: Avoid race on lockless rcu_node::expmask loop

KCSAN reported an issue:

| BUG: KCSAN: data-race in find_next_bit / rcu_report_exp_cpu_mult
|
| write to 0xffffffff85a7f140 of 8 bytes by task 7 on cpu 0:
|   rcu_report_exp_cpu_mult+0x4f/0xa0 kernel/rcu/tree_exp.h:244
|   rcu_report_exp_rdp+0x6c/0x90 kernel/rcu/tree_exp.h:254
|   rcu_preempt_deferred_qs_irqrestore+0x3bb/0x580 kernel/rcu/tree_plugin.h:475
|   rcu_read_unlock_special+0xec/0x370 kernel/rcu/tree_plugin.h:659
|   __rcu_read_unlock+0xcf/0xe0 kernel/rcu/tree_plugin.h:394
|   rcu_read_unlock include/linux/rcupdate.h:645 [inline]
|   batadv_nc_purge_orig_hash net/batman-adv/network-coding.c:411 [inline]
|   batadv_nc_worker+0x13a/0x390 net/batman-adv/network-coding.c:718
|   process_one_work+0x3d4/0x890 kernel/workqueue.c:2269
|   worker_thread+0xa0/0x800 kernel/workqueue.c:2415
|   kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
|   ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352
|
| read to 0xffffffff85a7f140 of 8 bytes by task 7251 on cpu 1:
|   _find_next_bit lib/find_bit.c:39 [inline]
|   find_next_bit+0x57/0xe0 lib/find_bit.c:70
|   sync_rcu_exp_select_node_cpus+0x28e/0x510 kernel/rcu/tree_exp.h:375
|   sync_rcu_exp_select_cpus+0x30c/0x590 kernel/rcu/tree_exp.h:439
|   rcu_exp_sel_wait_wake kernel/rcu/tree_exp.h:575 [inline]
|   wait_rcu_exp_gp+0x25/0x40 kernel/rcu/tree_exp.h:589
|   process_one_work+0x3d4/0x890 kernel/workqueue.c:2269
|   worker_thread+0xa0/0x800 kernel/workqueue.c:2415
|   kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
|   ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352

The root cause of this is the second for_each_leaf_node_cpu_mask() loop
in sync_rcu_exp_select_node_cpus() accesses the rcu_node::expmask
without holding rcu_node's lock. This is by design, because the second
loop may issue IPIs to other CPUs, and the IPI handler (rcu_exp_handler)
may acquire the same rcu_node's lock. So the rcu_node's lock has to be
dropped before the second loop.

The problem will occur when the normal unsetting of rcu_node::expmask
results into some intermediate state (because it's a plain access),
where an extra bit gets zeroed. The second loop will skip the
corrensponding CPU, but treat it as offline and in quesient state. This
will cause trouble because that CPU may be in a RCU read-side critical
section.

To fix this, make the second loop iterate on mask_ofl_ipi, as a result,
the find_next_bit() of the second loop doesn't access any variables that
may get changed in parallel, so the race is avoided. While we are at it,
remove the unset of mask_ofl_ipi to improve the readiblity, because we
can always use mask_ofl_test to record which CPU's QS should be
reported.

Reported-by: syzbot+134336b86f728d6e55a0@syzkaller.appspotmail.com
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
---
 kernel/rcu/tree_exp.h | 17 +++++++----------
 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index af7e7b9c86af..fb51752ac9a6 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -372,12 +372,10 @@ static void sync_rcu_exp_select_node_cpus(struct work_struct *wp)
 	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
 
 	/* IPI the remaining CPUs for expedited quiescent state. */
-	for_each_leaf_node_cpu_mask(rnp, cpu, rnp->expmask) {
+	for_each_leaf_node_cpu_mask(rnp, cpu, mask_ofl_ipi) {
 		unsigned long mask = leaf_node_cpu_bit(rnp, cpu);
 		struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
 
-		if (!(mask_ofl_ipi & mask))
-			continue;
 retry_ipi:
 		if (rcu_dynticks_in_eqs_since(rdp, rdp->exp_dynticks_snap)) {
 			mask_ofl_test |= mask;
@@ -389,10 +387,10 @@ static void sync_rcu_exp_select_node_cpus(struct work_struct *wp)
 		}
 		ret = smp_call_function_single(cpu, rcu_exp_handler, NULL, 0);
 		put_cpu();
-		if (!ret) {
-			mask_ofl_ipi &= ~mask;
+		/* the CPU responses the IPI, and it will report QS itself */
+		if (!ret)
 			continue;
-		}
+
 		/* Failed, raced with CPU hotplug operation. */
 		raw_spin_lock_irqsave_rcu_node(rnp, flags);
 		if ((rnp->qsmaskinitnext & mask) &&
@@ -403,13 +401,12 @@ static void sync_rcu_exp_select_node_cpus(struct work_struct *wp)
 			schedule_timeout_uninterruptible(1);
 			goto retry_ipi;
 		}
-		/* CPU really is offline, so we can ignore it. */
-		if (!(rnp->expmask & mask))
-			mask_ofl_ipi &= ~mask;
+		/* CPU really is offline, and we need its QS. */
+		if (rnp->expmask & mask)
+			mask_ofl_test |= mask;
 		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
 	}
 	/* Report quiescent states for those that went offline. */
-	mask_ofl_test |= mask_ofl_ipi;
 	if (mask_ofl_test)
 		rcu_report_exp_cpu_mult(rnp, mask_ofl_test, false);
 }
-- 
2.23.0


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: KCSAN: data-race in find_next_bit / rcu_report_exp_cpu_mult
  2019-10-08  2:50         ` Boqun Feng
@ 2019-10-08  3:33           ` Paul E. McKenney
  2019-10-08  3:58             ` Boqun Feng
  0 siblings, 1 reply; 10+ messages in thread
From: Paul E. McKenney @ 2019-10-08  3:33 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Joel Fernandes, Marco Elver, syzbot, josh, rostedt,
	mathieu.desnoyers, jiangshanlai, rcu, a, b.a.t.m.a.n, davem,
	LKML, mareklindner, netdev, sw, syzkaller-bugs

On Tue, Oct 08, 2019 at 10:50:56AM +0800, Boqun Feng wrote:
> On Mon, Oct 07, 2019 at 07:12:33PM -0700, Paul E. McKenney wrote:
> > On Mon, Oct 07, 2019 at 08:11:31PM -0400, Joel Fernandes wrote:
> > > On Mon, Oct 07, 2019 at 09:43:04PM +0800, Boqun Feng wrote:
> > > > Hi Marco,
> > > 
> > > Hi Boqun, Steve and Paul, fun times!
> > > 
> > > Marco, good catch ;-)
> > 
> > Indeed!  ;-)
> > 
> [...]
> > > > +	mask_ofl_ipi_snap = mask_ofl_ipi;
> > > >  	/* IPI the remaining CPUs for expedited quiescent state. */
> > > > -	for_each_leaf_node_cpu_mask(rnp, cpu, rnp->expmask) {
> > > > +	for_each_leaf_node_cpu_mask(rnp, cpu, mask_ofl_ipi_snap) {
> > 
> > Why can't we just use mask_ofl_ipi?  The bits removed are only those
> > bits just now looked at, right?  Also, the test of mask_ofl_ipi can be
> > dropped, since that branch will never be taken, correct?
> > 
> 
> You're correct. But I think we can further simplify the code a little
> bit so that we won't need to modify the mask_ofl_ipi:
> 
> In the second loop:
> 
> 1) if the target CPU is online and response the IPI we do nothing.
> 
> 2) if the target CPU is offline but it doesn't block current GP, we do
> nothing.
> 
> 3) otherwise, the target CPU is offline and blocks current GP, we add
> the corresponding bit in mask_ofl_test.
> 
> Thoughts?
> 
> Please see the end of email for a patch.
> 
> > > This looks good to me. Just a nit, I prefer if the comment to IPI the
> > > remaining CPUs is before the assignment to mask_ofl_ipi_snap since the
> > > new assignment is done for consumption by the for_each..(..) loop itself.
> > > 
> > > Steve's patch looks good as well and I was thinking along the same lines but
> > > Boqun's patch is slightly better because he doesn't need to snapshot exp_mask
> > > inside the locked section.
> > 
> > There are also similar lockless accesses to ->expmask in the stall-warning
> > code.
> > 
> > > Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > 
> > But thank all three of you for looking this over!  My original patch
> > was overly ornate.  ;-)
> > 
> > 							Thanx, Paul
> > 
> 
> -------------------->8
> Subject: [PATCH v2] rcu: exp: Avoid race on lockless rcu_node::expmask loop
> 
> KCSAN reported an issue:
> 
> | BUG: KCSAN: data-race in find_next_bit / rcu_report_exp_cpu_mult
> |
> | write to 0xffffffff85a7f140 of 8 bytes by task 7 on cpu 0:
> |   rcu_report_exp_cpu_mult+0x4f/0xa0 kernel/rcu/tree_exp.h:244
> |   rcu_report_exp_rdp+0x6c/0x90 kernel/rcu/tree_exp.h:254
> |   rcu_preempt_deferred_qs_irqrestore+0x3bb/0x580 kernel/rcu/tree_plugin.h:475
> |   rcu_read_unlock_special+0xec/0x370 kernel/rcu/tree_plugin.h:659
> |   __rcu_read_unlock+0xcf/0xe0 kernel/rcu/tree_plugin.h:394
> |   rcu_read_unlock include/linux/rcupdate.h:645 [inline]
> |   batadv_nc_purge_orig_hash net/batman-adv/network-coding.c:411 [inline]
> |   batadv_nc_worker+0x13a/0x390 net/batman-adv/network-coding.c:718
> |   process_one_work+0x3d4/0x890 kernel/workqueue.c:2269
> |   worker_thread+0xa0/0x800 kernel/workqueue.c:2415
> |   kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
> |   ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352
> |
> | read to 0xffffffff85a7f140 of 8 bytes by task 7251 on cpu 1:
> |   _find_next_bit lib/find_bit.c:39 [inline]
> |   find_next_bit+0x57/0xe0 lib/find_bit.c:70
> |   sync_rcu_exp_select_node_cpus+0x28e/0x510 kernel/rcu/tree_exp.h:375
> |   sync_rcu_exp_select_cpus+0x30c/0x590 kernel/rcu/tree_exp.h:439
> |   rcu_exp_sel_wait_wake kernel/rcu/tree_exp.h:575 [inline]
> |   wait_rcu_exp_gp+0x25/0x40 kernel/rcu/tree_exp.h:589
> |   process_one_work+0x3d4/0x890 kernel/workqueue.c:2269
> |   worker_thread+0xa0/0x800 kernel/workqueue.c:2415
> |   kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
> |   ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352
> 
> The root cause of this is the second for_each_leaf_node_cpu_mask() loop
> in sync_rcu_exp_select_node_cpus() accesses the rcu_node::expmask
> without holding rcu_node's lock. This is by design, because the second
> loop may issue IPIs to other CPUs, and the IPI handler (rcu_exp_handler)
> may acquire the same rcu_node's lock. So the rcu_node's lock has to be
> dropped before the second loop.
> 
> The problem will occur when the normal unsetting of rcu_node::expmask
> results into some intermediate state (because it's a plain access),
> where an extra bit gets zeroed. The second loop will skip the
> corrensponding CPU, but treat it as offline and in quesient state. This
> will cause trouble because that CPU may be in a RCU read-side critical
> section.
> 
> To fix this, make the second loop iterate on mask_ofl_ipi, as a result,
> the find_next_bit() of the second loop doesn't access any variables that
> may get changed in parallel, so the race is avoided. While we are at it,
> remove the unset of mask_ofl_ipi to improve the readiblity, because we
> can always use mask_ofl_test to record which CPU's QS should be
> reported.

Good point on the second loop!

> Reported-by: syzbot+134336b86f728d6e55a0@syzkaller.appspotmail.com
> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
> ---
>  kernel/rcu/tree_exp.h | 17 +++++++----------
>  1 file changed, 7 insertions(+), 10 deletions(-)
> 
> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> index af7e7b9c86af..fb51752ac9a6 100644
> --- a/kernel/rcu/tree_exp.h
> +++ b/kernel/rcu/tree_exp.h
> @@ -372,12 +372,10 @@ static void sync_rcu_exp_select_node_cpus(struct work_struct *wp)
>  	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
>  
>  	/* IPI the remaining CPUs for expedited quiescent state. */
> -	for_each_leaf_node_cpu_mask(rnp, cpu, rnp->expmask) {
> +	for_each_leaf_node_cpu_mask(rnp, cpu, mask_ofl_ipi) {
>  		unsigned long mask = leaf_node_cpu_bit(rnp, cpu);
>  		struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
>  
> -		if (!(mask_ofl_ipi & mask))
> -			continue;
>  retry_ipi:
>  		if (rcu_dynticks_in_eqs_since(rdp, rdp->exp_dynticks_snap)) {
>  			mask_ofl_test |= mask;

This part I have already on -rcu branch "dev".

> @@ -389,10 +387,10 @@ static void sync_rcu_exp_select_node_cpus(struct work_struct *wp)
>  		}
>  		ret = smp_call_function_single(cpu, rcu_exp_handler, NULL, 0);
>  		put_cpu();
> -		if (!ret) {
> -			mask_ofl_ipi &= ~mask;
> +		/* the CPU responses the IPI, and it will report QS itself */
> +		if (!ret)
>  			continue;
> -		}
> +
>  		/* Failed, raced with CPU hotplug operation. */
>  		raw_spin_lock_irqsave_rcu_node(rnp, flags);
>  		if ((rnp->qsmaskinitnext & mask) &&
> @@ -403,13 +401,12 @@ static void sync_rcu_exp_select_node_cpus(struct work_struct *wp)
>  			schedule_timeout_uninterruptible(1);
>  			goto retry_ipi;
>  		}
> -		/* CPU really is offline, so we can ignore it. */
> -		if (!(rnp->expmask & mask))
> -			mask_ofl_ipi &= ~mask;
> +		/* CPU really is offline, and we need its QS. */
> +		if (rnp->expmask & mask)
> +			mask_ofl_test |= mask;
>  		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
>  	}
>  	/* Report quiescent states for those that went offline. */
> -	mask_ofl_test |= mask_ofl_ipi;
>  	if (mask_ofl_test)
>  		rcu_report_exp_cpu_mult(rnp, mask_ofl_test, false);
>  }

Would you be willing to port this optimization on top of current -rcu
branch "dev" with an suitably modified commit message?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: KCSAN: data-race in find_next_bit / rcu_report_exp_cpu_mult
  2019-10-08  3:33           ` Paul E. McKenney
@ 2019-10-08  3:58             ` Boqun Feng
  0 siblings, 0 replies; 10+ messages in thread
From: Boqun Feng @ 2019-10-08  3:58 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Joel Fernandes, Marco Elver, syzbot, josh, rostedt,
	mathieu.desnoyers, jiangshanlai, rcu, a, b.a.t.m.a.n, davem,
	LKML, mareklindner, netdev, sw, syzkaller-bugs

[-- Attachment #1: Type: text/plain, Size: 2482 bytes --]

On Mon, Oct 07, 2019 at 08:33:53PM -0700, Paul E. McKenney wrote:
[...]
> > ---
> >  kernel/rcu/tree_exp.h | 17 +++++++----------
> >  1 file changed, 7 insertions(+), 10 deletions(-)
> > 
> > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> > index af7e7b9c86af..fb51752ac9a6 100644
> > --- a/kernel/rcu/tree_exp.h
> > +++ b/kernel/rcu/tree_exp.h
> > @@ -372,12 +372,10 @@ static void sync_rcu_exp_select_node_cpus(struct work_struct *wp)
> >  	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> >  
> >  	/* IPI the remaining CPUs for expedited quiescent state. */
> > -	for_each_leaf_node_cpu_mask(rnp, cpu, rnp->expmask) {
> > +	for_each_leaf_node_cpu_mask(rnp, cpu, mask_ofl_ipi) {
> >  		unsigned long mask = leaf_node_cpu_bit(rnp, cpu);
> >  		struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
> >  
> > -		if (!(mask_ofl_ipi & mask))
> > -			continue;
> >  retry_ipi:
> >  		if (rcu_dynticks_in_eqs_since(rdp, rdp->exp_dynticks_snap)) {
> >  			mask_ofl_test |= mask;
> 
> This part I have already on -rcu branch "dev".
> 
> > @@ -389,10 +387,10 @@ static void sync_rcu_exp_select_node_cpus(struct work_struct *wp)
> >  		}
> >  		ret = smp_call_function_single(cpu, rcu_exp_handler, NULL, 0);
> >  		put_cpu();
> > -		if (!ret) {
> > -			mask_ofl_ipi &= ~mask;
> > +		/* the CPU responses the IPI, and it will report QS itself */
> > +		if (!ret)
> >  			continue;
> > -		}
> > +
> >  		/* Failed, raced with CPU hotplug operation. */
> >  		raw_spin_lock_irqsave_rcu_node(rnp, flags);
> >  		if ((rnp->qsmaskinitnext & mask) &&
> > @@ -403,13 +401,12 @@ static void sync_rcu_exp_select_node_cpus(struct work_struct *wp)
> >  			schedule_timeout_uninterruptible(1);
> >  			goto retry_ipi;
> >  		}
> > -		/* CPU really is offline, so we can ignore it. */
> > -		if (!(rnp->expmask & mask))
> > -			mask_ofl_ipi &= ~mask;
> > +		/* CPU really is offline, and we need its QS. */
> > +		if (rnp->expmask & mask)
> > +			mask_ofl_test |= mask;
> >  		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> >  	}
> >  	/* Report quiescent states for those that went offline. */
> > -	mask_ofl_test |= mask_ofl_ipi;
> >  	if (mask_ofl_test)
> >  		rcu_report_exp_cpu_mult(rnp, mask_ofl_test, false);
> >  }
> 
> Would you be willing to port this optimization on top of current -rcu
> branch "dev" with an suitably modified commit message?
> 

Sure, will do ;-)

Regards,
Boqun

> 							Thanx, Paul

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, back to index

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <000000000000604e8905944f211f@google.com>
2019-10-07 10:04 ` KCSAN: data-race in find_next_bit / rcu_report_exp_cpu_mult Marco Elver
2019-10-07 13:22   ` Paul E. McKenney
2019-10-07 13:34   ` Steven Rostedt
2019-10-07 13:43   ` Boqun Feng
2019-10-08  0:11     ` Joel Fernandes
2019-10-08  2:12       ` Paul E. McKenney
2019-10-08  2:50         ` Boqun Feng
2019-10-08  3:33           ` Paul E. McKenney
2019-10-08  3:58             ` Boqun Feng
2019-10-08  1:47   ` Paul E. McKenney

RCU Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/rcu/0 rcu/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 rcu rcu/ https://lore.kernel.org/rcu \
		rcu@vger.kernel.org
	public-inbox-index rcu

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.rcu


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git