Hello, syzbot found the following crash on: HEAD commit: a78622932c27 bpf: sockmap, fix double-free git tree: bpf-next console output: https://syzkaller.appspot.com/x/log.txt?x=1305ba77800000 kernel config: https://syzkaller.appspot.com/x/.config?x=b632d8e2c2ab2c1 dashboard link: https://syzkaller.appspot.com/bug?extid=774fddf07b7ab29a1e55 compiler: gcc (GCC) 8.0.1 20180413 (experimental) Unfortunately, I don't have any reproducer for this crash yet. IMPORTANT: if you fix the bug, please add the following tag to the commit: Reported-by: syzbot+774fddf07b7ab29a1e55@syzkaller.appspotmail.com WARNING: CPU: 0 PID: 11734 at kernel/tracepoint.c:210 tracepoint_add_func kernel/tracepoint.c:210 [inline] WARNING: CPU: 0 PID: 11734 at kernel/tracepoint.c:210 tracepoint_probe_register_prio+0x3b4/0xa50 kernel/tracepoint.c:282 Kernel panic - not syncing: panic_on_warn set ... CPU: 0 PID: 11734 Comm: syz-executor1 Not tainted 4.17.0-rc4+ #13 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1b9/0x294 lib/dump_stack.c:113 panic+0x22f/0x4de kernel/panic.c:184 __warn.cold.8+0x163/0x1b3 kernel/panic.c:536 report_bug+0x252/0x2d0 lib/bug.c:186 fixup_bug arch/x86/kernel/traps.c:178 [inline] do_error_trap+0x1de/0x490 arch/x86/kernel/traps.c:296 do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315 invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:992 RIP: 0010:tracepoint_add_func kernel/tracepoint.c:210 [inline] RIP: 0010:tracepoint_probe_register_prio+0x3b4/0xa50 kernel/tracepoint.c:282 RSP: 0018:ffff8801c7977438 EFLAGS: 00010216 RAX: 0000000000040000 RBX: ffff8801c7977518 RCX: ffffc900080f9000 RDX: 00000000000012a1 RSI: ffffffff817a9d34 RDI: ffff8801c29ddab0 RBP: ffff8801c7977540 R08: ffff880197448700 R09: fffffbfff11b803c R10: ffff8801c7977428 R11: ffffffff88dc01e7 R12: 00000000ffffffef R13: 00000000ffffffff R14: ffffffff81516c90 R15: 0000000000000001 tracepoint_probe_register+0x2a/0x40 kernel/tracepoint.c:303 trace_event_reg+0x19a/0x350 kernel/trace/trace_events.c:305 perf_trace_event_reg kernel/trace/trace_event_perf.c:123 [inline] perf_trace_event_init+0x4fe/0x990 kernel/trace/trace_event_perf.c:198 perf_trace_init+0x186/0x250 kernel/trace/trace_event_perf.c:222 perf_tp_event_init+0xa6/0x120 kernel/events/core.c:8337 perf_try_init_event+0x137/0x2f0 kernel/events/core.c:9734 perf_init_event kernel/events/core.c:9772 [inline] perf_event_alloc.part.91+0x1248/0x3090 kernel/events/core.c:10038 perf_event_alloc kernel/events/core.c:10394 [inline] __do_sys_perf_event_open+0xa8a/0x2fa0 kernel/events/core.c:10495 __se_sys_perf_event_open kernel/events/core.c:10384 [inline] __x64_sys_perf_event_open+0xbe/0x150 kernel/events/core.c:10384 do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x455a09 RSP: 002b:00007f136b4f5c68 EFLAGS: 00000246 ORIG_RAX: 000000000000012a RAX: ffffffffffffffda RBX: 00007f136b4f66d4 RCX: 0000000000455a09 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000002001d000 RBP: 000000000072bea0 R08: 0000000000000000 R09: 0000000000000000 R10: ffffffffffffffff R11: 0000000000000246 R12: 0000000000000014 R13: 000000000000050c R14: 00000000006fb9c0 R15: 0000000000000003 Dumping ftrace buffer: (ftrace buffer empty) Kernel Offset: disabled Rebooting in 86400 seconds.. --- This bug is generated by a bot. It may contain errors. See https://goo.gl/tpsmEJ for more information about syzbot. syzbot engineers can be reached at syzkaller@googlegroups.com. syzbot will keep track of this bug report. See: https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with syzbot.
syzbot has found a reproducer for the following crash on: HEAD commit: ecb9f80d net/mvpp2: Replace tasklet with softirq hrtimer git tree: net-next console output: https://syzkaller.appspot.com/x/log.txt?x=115730ac600000 kernel config: https://syzkaller.appspot.com/x/.config?x=d4cf1ffb87d590d7 dashboard link: https://syzkaller.appspot.com/bug?extid=774fddf07b7ab29a1e55 compiler: gcc (GCC) 9.0.0 20181231 (experimental) syz repro: https://syzkaller.appspot.com/x/repro.syz?x=11b02a22600000 IMPORTANT: if you fix the bug, please add the following tag to the commit: Reported-by: syzbot+774fddf07b7ab29a1e55@syzkaller.appspotmail.com WARNING: CPU: 0 PID: 8824 at kernel/tracepoint.c:243 tracepoint_add_func kernel/tracepoint.c:243 [inline] WARNING: CPU: 0 PID: 8824 at kernel/tracepoint.c:243 tracepoint_probe_register_prio+0x216/0x790 kernel/tracepoint.c:315 Kernel panic - not syncing: panic_on_warn set ... CPU: 0 PID: 8824 Comm: syz-executor.4 Not tainted 5.3.0-rc3+ #133 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x172/0x1f0 lib/dump_stack.c:113 panic+0x2dc/0x755 kernel/panic.c:219 __warn.cold+0x20/0x4c kernel/panic.c:576 report_bug+0x263/0x2b0 lib/bug.c:186 fixup_bug arch/x86/kernel/traps.c:179 [inline] fixup_bug arch/x86/kernel/traps.c:174 [inline] do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:272 do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:291 invalid_op+0x23/0x30 arch/x86/entry/entry_64.S:1028 RIP: 0010:tracepoint_add_func kernel/tracepoint.c:243 [inline] RIP: 0010:tracepoint_probe_register_prio+0x216/0x790 kernel/tracepoint.c:315 Code: 48 89 f8 48 c1 e8 03 80 3c 08 00 0f 85 bf 04 00 00 48 8b 45 b8 49 3b 45 08 0f 85 21 ff ff ff 41 bd ef ff ff ff e8 aa 8c fe ff <0f> 0b e8 a3 8c fe ff 48 c7 c7 20 44 de 88 e8 d7 1d ca 05 44 89 e8 RSP: 0018:ffff88807b5a7498 EFLAGS: 00010293 RAX: ffff8880a87a85c0 RBX: ffffffff89a1eb00 RCX: dffffc0000000000 RDX: 0000000000000000 RSI: ffffffff8173fcd6 RDI: ffff88808afc4698 RBP: ffff88807b5a74f0 R08: ffff8880a87a85c0 R09: fffffbfff11bc885 R10: ffff88807b5a7488 R11: ffffffff88de4427 R12: ffff88808afc4690 R13: 00000000ffffffef R14: 00000000ffffffff R15: ffffffff8177f710 tracepoint_probe_register+0x2b/0x40 kernel/tracepoint.c:335 register_trace_sched_wakeup include/trace/events/sched.h:96 [inline] tracing_sched_register kernel/trace/trace_sched_switch.c:54 [inline] tracing_start_sched_switch+0xa8/0x190 kernel/trace/trace_sched_switch.c:106 tracing_start_cmdline_record+0x13/0x20 kernel/trace/trace_sched_switch.c:131 trace_printk_init_buffers kernel/trace/trace.c:3050 [inline] trace_printk_init_buffers.cold+0xdf/0xe9 kernel/trace/trace.c:3013 bpf_get_trace_printk_proto+0xe/0x20 kernel/trace/bpf_trace.c:338 cgroup_base_func_proto kernel/bpf/cgroup.c:799 [inline] cgroup_base_func_proto.isra.0+0x10e/0x120 kernel/bpf/cgroup.c:776 cg_sockopt_func_proto+0x53/0x70 kernel/bpf/cgroup.c:1411 check_helper_call+0x141/0x32f0 kernel/bpf/verifier.c:3950 do_check+0x6213/0x89f0 kernel/bpf/verifier.c:7707 bpf_check+0x6f99/0x9948 kernel/bpf/verifier.c:9294 bpf_prog_load+0xe68/0x1670 kernel/bpf/syscall.c:1698 __do_sys_bpf+0xc43/0x3460 kernel/bpf/syscall.c:2849 __se_sys_bpf kernel/bpf/syscall.c:2808 [inline] __x64_sys_bpf+0x73/0xb0 kernel/bpf/syscall.c:2808 do_syscall_64+0xfd/0x6a0 arch/x86/entry/common.c:296 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x459829 Code: fd b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 cb b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007fc4bf1dec78 EFLAGS: 00000246 ORIG_RAX: 0000000000000141 RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000459829 RDX: 0000000000000070 RSI: 0000000020000180 RDI: 0000000000000005 RBP: 000000000075bf20 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007fc4bf1df6d4 R13: 00000000004bfc7c R14: 00000000004d1938 R15: 00000000ffffffff Kernel Offset: disabled Rebooting in 86400 seconds..
syzbot has bisected this bug to: commit ecb9f80db23a7ab09b46b298b404e41dd7aff6e6 Author: Thomas Gleixner <tglx@linutronix.de> Date: Tue Aug 13 08:00:25 2019 +0000 net/mvpp2: Replace tasklet with softirq hrtimer bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=13ffb9ee600000 start commit: ecb9f80d net/mvpp2: Replace tasklet with softirq hrtimer git tree: net-next final crash: https://syzkaller.appspot.com/x/report.txt?x=100079ee600000 console output: https://syzkaller.appspot.com/x/log.txt?x=17ffb9ee600000 kernel config: https://syzkaller.appspot.com/x/.config?x=d4cf1ffb87d590d7 dashboard link: https://syzkaller.appspot.com/bug?extid=774fddf07b7ab29a1e55 syz repro: https://syzkaller.appspot.com/x/repro.syz?x=11b02a22600000 Reported-by: syzbot+774fddf07b7ab29a1e55@syzkaller.appspotmail.com Fixes: ecb9f80db23a ("net/mvpp2: Replace tasklet with softirq hrtimer") For information about bisection process see: https://goo.gl/tpsmEJ#bisection
On 2019-08-16 05:32:00 [-0700], syzbot wrote:
> syzbot has bisected this bug to:
>
> commit ecb9f80db23a7ab09b46b298b404e41dd7aff6e6
> Author: Thomas Gleixner <tglx@linutronix.de>
> Date: Tue Aug 13 08:00:25 2019 +0000
>
> net/mvpp2: Replace tasklet with softirq hrtimer
>
> bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=13ffb9ee600000
that bisect is wrong. That warning triggered once and this commit was
the top most one in net-next at the time…
Sebastian
Reading the sched_cmdline_ref and sched_tgid_ref initial state within tracing_start_sched_switch without holding the sched_register_mutex is racy against concurrent updates, which can lead to tracepoint probes being registered more than once (and thus trigger warnings within tracepoint.c). Also, write and read to/from those variables should be done with WRITE_ONCE() and READ_ONCE(), given that those are read within tracing probes without holding the sched_register_mutex. [ Compile-tested only. I suspect it might fix the following syzbot report: syzbot+774fddf07b7ab29a1e55@syzkaller.appspotmail.com ] Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> CC: Joel Fernandes (Google) <joel@joelfernandes.org> CC: Peter Zijlstra <peterz@infradead.org> CC: Steven Rostedt (VMware) <rostedt@goodmis.org> CC: Thomas Gleixner <tglx@linutronix.de> CC: Paul E. McKenney <paulmck@linux.ibm.com> --- kernel/trace/trace_sched_switch.c | 32 ++++++++++++++++++++++---------- 1 file changed, 22 insertions(+), 10 deletions(-) diff --git a/kernel/trace/trace_sched_switch.c b/kernel/trace/trace_sched_switch.c index e288168661e1..902e8bf59aeb 100644 --- a/kernel/trace/trace_sched_switch.c +++ b/kernel/trace/trace_sched_switch.c @@ -26,8 +26,8 @@ probe_sched_switch(void *ignore, bool preempt, { int flags; - flags = (RECORD_TGID * !!sched_tgid_ref) + - (RECORD_CMDLINE * !!sched_cmdline_ref); + flags = (RECORD_TGID * !!READ_ONCE(sched_tgid_ref)) + + (RECORD_CMDLINE * !!READ_ONCE(sched_cmdline_ref)); if (!flags) return; @@ -39,8 +39,8 @@ probe_sched_wakeup(void *ignore, struct task_struct *wakee) { int flags; - flags = (RECORD_TGID * !!sched_tgid_ref) + - (RECORD_CMDLINE * !!sched_cmdline_ref); + flags = (RECORD_TGID * !!READ_ONCE(sched_tgid_ref)) + + (RECORD_CMDLINE * !!READ_ONCE(sched_cmdline_ref)); if (!flags) return; @@ -89,21 +89,28 @@ static void tracing_sched_unregister(void) static void tracing_start_sched_switch(int ops) { - bool sched_register = (!sched_cmdline_ref && !sched_tgid_ref); + bool sched_register; + mutex_lock(&sched_register_mutex); + sched_register = (!sched_cmdline_ref && !sched_tgid_ref); switch (ops) { case RECORD_CMDLINE: - sched_cmdline_ref++; + WRITE_ONCE(sched_cmdline_ref, sched_cmdline_ref + 1); break; case RECORD_TGID: - sched_tgid_ref++; + WRITE_ONCE(sched_tgid_ref, sched_tgid_ref + 1); break; + + default: + WARN_ONCE(1, "Unsupported tracing op: %d", ops); + goto end; } - if (sched_register && (sched_cmdline_ref || sched_tgid_ref)) + if (sched_register) tracing_sched_register(); +end: mutex_unlock(&sched_register_mutex); } @@ -113,16 +120,21 @@ static void tracing_stop_sched_switch(int ops) switch (ops) { case RECORD_CMDLINE: - sched_cmdline_ref--; + WRITE_ONCE(sched_cmdline_ref, sched_cmdline_ref - 1); break; case RECORD_TGID: - sched_tgid_ref--; + WRITE_ONCE(sched_tgid_ref, sched_tgid_ref - 1); break; + + default: + WARN_ONCE(1, "Unsupported tracing op: %d", ops); + goto end; } if (!sched_cmdline_ref && !sched_tgid_ref) tracing_sched_unregister(); +end: mutex_unlock(&sched_register_mutex); } -- 2.11.0
On Fri, 16 Aug 2019 10:26:43 -0400 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > Reading the sched_cmdline_ref and sched_tgid_ref initial state within > tracing_start_sched_switch without holding the sched_register_mutex is > racy against concurrent updates, which can lead to tracepoint probes > being registered more than once (and thus trigger warnings within > tracepoint.c). > > Also, write and read to/from those variables should be done with > WRITE_ONCE() and READ_ONCE(), given that those are read within tracing > probes without holding the sched_register_mutex. > I understand the READ_ONCE() but is the WRITE_ONCE() truly necessary? It's done while holding the mutex. It's not that critical of a path, and makes the code look ugly. -- Steve > [ Compile-tested only. I suspect it might fix the following syzbot > report: > > syzbot+774fddf07b7ab29a1e55@syzkaller.appspotmail.com ] > > Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > CC: Joel Fernandes (Google) <joel@joelfernandes.org> > CC: Peter Zijlstra <peterz@infradead.org> > CC: Steven Rostedt (VMware) <rostedt@goodmis.org> > CC: Thomas Gleixner <tglx@linutronix.de> > CC: Paul E. McKenney <paulmck@linux.ibm.com> > --- > kernel/trace/trace_sched_switch.c | 32 ++++++++++++++++++++++---------- > 1 file changed, 22 insertions(+), 10 deletions(-) > > diff --git a/kernel/trace/trace_sched_switch.c b/kernel/trace/trace_sched_switch.c > index e288168661e1..902e8bf59aeb 100644 > --- a/kernel/trace/trace_sched_switch.c > +++ b/kernel/trace/trace_sched_switch.c > @@ -26,8 +26,8 @@ probe_sched_switch(void *ignore, bool preempt, > { > int flags; > > - flags = (RECORD_TGID * !!sched_tgid_ref) + > - (RECORD_CMDLINE * !!sched_cmdline_ref); > + flags = (RECORD_TGID * !!READ_ONCE(sched_tgid_ref)) + > + (RECORD_CMDLINE * !!READ_ONCE(sched_cmdline_ref)); > > if (!flags) > return; > @@ -39,8 +39,8 @@ probe_sched_wakeup(void *ignore, struct task_struct *wakee) > { > int flags; > > - flags = (RECORD_TGID * !!sched_tgid_ref) + > - (RECORD_CMDLINE * !!sched_cmdline_ref); > + flags = (RECORD_TGID * !!READ_ONCE(sched_tgid_ref)) + > + (RECORD_CMDLINE * !!READ_ONCE(sched_cmdline_ref)); > > if (!flags) > return; > @@ -89,21 +89,28 @@ static void tracing_sched_unregister(void) > > static void tracing_start_sched_switch(int ops) > { > - bool sched_register = (!sched_cmdline_ref && !sched_tgid_ref); > + bool sched_register; > + > mutex_lock(&sched_register_mutex); > + sched_register = (!sched_cmdline_ref && !sched_tgid_ref); > > switch (ops) { > case RECORD_CMDLINE: > - sched_cmdline_ref++; > + WRITE_ONCE(sched_cmdline_ref, sched_cmdline_ref + 1); > break; > > case RECORD_TGID: > - sched_tgid_ref++; > + WRITE_ONCE(sched_tgid_ref, sched_tgid_ref + 1); > break; > + > + default: > + WARN_ONCE(1, "Unsupported tracing op: %d", ops); > + goto end; > } > > - if (sched_register && (sched_cmdline_ref || sched_tgid_ref)) > + if (sched_register) > tracing_sched_register(); > +end: > mutex_unlock(&sched_register_mutex); > } > > @@ -113,16 +120,21 @@ static void tracing_stop_sched_switch(int ops) > > switch (ops) { > case RECORD_CMDLINE: > - sched_cmdline_ref--; > + WRITE_ONCE(sched_cmdline_ref, sched_cmdline_ref - 1); > break; > > case RECORD_TGID: > - sched_tgid_ref--; > + WRITE_ONCE(sched_tgid_ref, sched_tgid_ref - 1); > break; > + > + default: > + WARN_ONCE(1, "Unsupported tracing op: %d", ops); > + goto end; > } > > if (!sched_cmdline_ref && !sched_tgid_ref) > tracing_sched_unregister(); > +end: > mutex_unlock(&sched_register_mutex); > } >
On 16/08/2019 17:25, Steven Rostedt wrote:
>> Also, write and read to/from those variables should be done with
>> WRITE_ONCE() and READ_ONCE(), given that those are read within tracing
>> probes without holding the sched_register_mutex.
>>
>
> I understand the READ_ONCE() but is the WRITE_ONCE() truly necessary?
> It's done while holding the mutex. It's not that critical of a path,
> and makes the code look ugly.
>
I seem to recall something like locking primitives don't protect you from
store tearing / invented stores, so if you can have concurrent readers
using READ_ONCE(), there should be a WRITE_ONCE() on the writer side, even
if it's done in a critical section.
On Fri, 16 Aug 2019 17:48:59 +0100
Valentin Schneider <valentin.schneider@arm.com> wrote:
> On 16/08/2019 17:25, Steven Rostedt wrote:
> >> Also, write and read to/from those variables should be done with
> >> WRITE_ONCE() and READ_ONCE(), given that those are read within tracing
> >> probes without holding the sched_register_mutex.
> >>
> >
> > I understand the READ_ONCE() but is the WRITE_ONCE() truly necessary?
> > It's done while holding the mutex. It's not that critical of a path,
> > and makes the code look ugly.
> >
>
> I seem to recall something like locking primitives don't protect you from
> store tearing / invented stores, so if you can have concurrent readers
> using READ_ONCE(), there should be a WRITE_ONCE() on the writer side, even
> if it's done in a critical section.
But for this, it really doesn't matter. The READ_ONCE() is for going
from 0->1 or 1->0 any other change is the same as 1.
When we enable trace events, we start recording the tasks comms such
that we can possibly map them to the pids. When we disable trace
events, we stop recording the comms so that we don't overwrite the
cache when not needed. Note, if more than the max cache of tasks are
recorded during a session, we are likely to miss comms anyway.
Thinking about this more, the READ_ONCE() and WRTIE_ONCE() are not even
needed, because this is just a best effort anyway.
The only real fix was to move the check into the mutex protect area,
because that can cause a real bug if there was a race.
{
- bool sched_register = (!sched_cmdline_ref && !sched_tgid_ref);
+ bool sched_register;
+
mutex_lock(&sched_register_mutex);
+ sched_register = (!sched_cmdline_ref && !sched_tgid_ref);
Thus, I'd like to see a v2 of this patch without the READ_ONCE() or
WRITE_ONCE() added.
-- Steve
----- On Aug 16, 2019, at 12:25 PM, rostedt rostedt@goodmis.org wrote: > On Fri, 16 Aug 2019 10:26:43 -0400 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > [...] >> >> Also, write and read to/from those variables should be done with >> WRITE_ONCE() and READ_ONCE(), given that those are read within tracing >> probes without holding the sched_register_mutex. >> > > I understand the READ_ONCE() but is the WRITE_ONCE() truly necessary? > It's done while holding the mutex. It's not that critical of a path, > and makes the code look ugly. The update is done while holding the mutex, but the read-side does not hold that mutex, so it can observe the intermediate state caused by store-tearing or invented stores which can be generated by the compiler on the update-side. Please refer to the following LWN article: https://lwn.net/Articles/793253/ Sections: - "Store tearing" - "Invented stores" Arguably, based on that article, store tearing is only observed in the wild for constants (which is not the case here), and invented stores seem to require specific code patterns. But I wonder why we would ever want to pair a fragile non-volatile store with a READ_ONCE() ? Considering the pain associated to reproduce and hunt down this kind of issue in the wild, I would be tempted to enforce that any READ_ONCE() operating on a variable would either need to be paired with WRITE_ONCE() or with atomic operations, so those can eventually be validated by static code checkers and code sanitizers. If coding style is your only concern here, we may want to consider introducing new macros in compiler.h: WRITE_ONCE_INC(v) /* v++ */ WRITE_ONCE_DEC(v) /* v-- */ WRITE_ONCE_ADD(v, count) /* v += count */ WRITE_ONCE_SUB(v, count) /* v -= count */ Thanks, Mathieu > > -- Steve > > > >> [ Compile-tested only. I suspect it might fix the following syzbot >> report: >> >> syzbot+774fddf07b7ab29a1e55@syzkaller.appspotmail.com ] >> >> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> >> CC: Joel Fernandes (Google) <joel@joelfernandes.org> >> CC: Peter Zijlstra <peterz@infradead.org> >> CC: Steven Rostedt (VMware) <rostedt@goodmis.org> >> CC: Thomas Gleixner <tglx@linutronix.de> >> CC: Paul E. McKenney <paulmck@linux.ibm.com> >> --- >> kernel/trace/trace_sched_switch.c | 32 ++++++++++++++++++++++---------- >> 1 file changed, 22 insertions(+), 10 deletions(-) >> >> diff --git a/kernel/trace/trace_sched_switch.c >> b/kernel/trace/trace_sched_switch.c >> index e288168661e1..902e8bf59aeb 100644 >> --- a/kernel/trace/trace_sched_switch.c >> +++ b/kernel/trace/trace_sched_switch.c >> @@ -26,8 +26,8 @@ probe_sched_switch(void *ignore, bool preempt, >> { >> int flags; >> >> - flags = (RECORD_TGID * !!sched_tgid_ref) + >> - (RECORD_CMDLINE * !!sched_cmdline_ref); >> + flags = (RECORD_TGID * !!READ_ONCE(sched_tgid_ref)) + >> + (RECORD_CMDLINE * !!READ_ONCE(sched_cmdline_ref)); >> >> if (!flags) >> return; >> @@ -39,8 +39,8 @@ probe_sched_wakeup(void *ignore, struct task_struct *wakee) >> { >> int flags; >> >> - flags = (RECORD_TGID * !!sched_tgid_ref) + >> - (RECORD_CMDLINE * !!sched_cmdline_ref); >> + flags = (RECORD_TGID * !!READ_ONCE(sched_tgid_ref)) + >> + (RECORD_CMDLINE * !!READ_ONCE(sched_cmdline_ref)); >> >> if (!flags) >> return; >> @@ -89,21 +89,28 @@ static void tracing_sched_unregister(void) >> >> static void tracing_start_sched_switch(int ops) >> { >> - bool sched_register = (!sched_cmdline_ref && !sched_tgid_ref); >> + bool sched_register; >> + >> mutex_lock(&sched_register_mutex); >> + sched_register = (!sched_cmdline_ref && !sched_tgid_ref); >> >> switch (ops) { >> case RECORD_CMDLINE: >> - sched_cmdline_ref++; >> + WRITE_ONCE(sched_cmdline_ref, sched_cmdline_ref + 1); >> break; >> >> case RECORD_TGID: >> - sched_tgid_ref++; >> + WRITE_ONCE(sched_tgid_ref, sched_tgid_ref + 1); >> break; >> + >> + default: >> + WARN_ONCE(1, "Unsupported tracing op: %d", ops); >> + goto end; >> } >> >> - if (sched_register && (sched_cmdline_ref || sched_tgid_ref)) >> + if (sched_register) >> tracing_sched_register(); >> +end: >> mutex_unlock(&sched_register_mutex); >> } >> >> @@ -113,16 +120,21 @@ static void tracing_stop_sched_switch(int ops) >> >> switch (ops) { >> case RECORD_CMDLINE: >> - sched_cmdline_ref--; >> + WRITE_ONCE(sched_cmdline_ref, sched_cmdline_ref - 1); >> break; >> >> case RECORD_TGID: >> - sched_tgid_ref--; >> + WRITE_ONCE(sched_tgid_ref, sched_tgid_ref - 1); >> break; >> + >> + default: >> + WARN_ONCE(1, "Unsupported tracing op: %d", ops); >> + goto end; >> } >> >> if (!sched_cmdline_ref && !sched_tgid_ref) >> tracing_sched_unregister(); >> +end: >> mutex_unlock(&sched_register_mutex); >> } -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
----- On Aug 16, 2019, at 1:04 PM, rostedt rostedt@goodmis.org wrote: > On Fri, 16 Aug 2019 17:48:59 +0100 > Valentin Schneider <valentin.schneider@arm.com> wrote: > >> On 16/08/2019 17:25, Steven Rostedt wrote: >> >> Also, write and read to/from those variables should be done with >> >> WRITE_ONCE() and READ_ONCE(), given that those are read within tracing >> >> probes without holding the sched_register_mutex. >> >> >> > >> > I understand the READ_ONCE() but is the WRITE_ONCE() truly necessary? >> > It's done while holding the mutex. It's not that critical of a path, >> > and makes the code look ugly. >> > >> >> I seem to recall something like locking primitives don't protect you from >> store tearing / invented stores, so if you can have concurrent readers >> using READ_ONCE(), there should be a WRITE_ONCE() on the writer side, even >> if it's done in a critical section. > > But for this, it really doesn't matter. The READ_ONCE() is for going > from 0->1 or 1->0 any other change is the same as 1. Let's consider this "invented store" scenario (even though as I noted in my earlier email, I suspect this particular instance of the code does not appear to fit the requirements to generate this in the wild with current compilers): intial state: sched_tgid_ref = 10; Thread A Thread B registration tracepoint probe sched_tgid_ref++ - compiler decides to invent a store: sched_tgid_ref = 0 READ_ONCE(sched_tgid_ref) -> observes 0, but should really see either 10 or 11. - compiler stores 11 into sched_tgid_ref This kind of scenario could cause spurious missing data in the middle of a trace caused by another user trying to increment the reference count, which is really unexpected. A similar scenario can happen for "store tearing" if the compiler decides to break the store into many stores close to the 16-bit overflow value when updating a 32-bit reference count. Spurious 1 -> 0 -> 1 transitions could be observed by readers. > When we enable trace events, we start recording the tasks comms such > that we can possibly map them to the pids. When we disable trace > events, we stop recording the comms so that we don't overwrite the > cache when not needed. Note, if more than the max cache of tasks are > recorded during a session, we are likely to miss comms anyway. > > Thinking about this more, the READ_ONCE() and WRTIE_ONCE() are not even > needed, because this is just a best effort anyway. If you choose not to use READ_ONCE(), then the "load tearing" issue can cause similar spurious 1 -> 0 -> 1 transitions near 16-bit counter overflow as described above. The "Invented load" also becomes an issue, because the compiler could use the loaded value for a branch, and re-load that value between two branches which are expected to use the same value, effectively generating a corrupted state. I think we need a statement about whether READ_ONCE/WRITE_ONCE should be used in this kind of situation, or if we are fine dealing with the awkward compiler side-effects when they will occur. Thanks, Mathieu > > The only real fix was to move the check into the mutex protect area, > because that can cause a real bug if there was a race. > > { > - bool sched_register = (!sched_cmdline_ref && !sched_tgid_ref); > + bool sched_register; > + > mutex_lock(&sched_register_mutex); > + sched_register = (!sched_cmdline_ref && !sched_tgid_ref); > > Thus, I'd like to see a v2 of this patch without the READ_ONCE() or > WRITE_ONCE() added. > > -- Steve -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
On Fri, 16 Aug 2019 13:19:20 -0400 (EDT) Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > ----- On Aug 16, 2019, at 12:25 PM, rostedt rostedt@goodmis.org wrote: > > > On Fri, 16 Aug 2019 10:26:43 -0400 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > > [...] > >> > >> Also, write and read to/from those variables should be done with > >> WRITE_ONCE() and READ_ONCE(), given that those are read within tracing > >> probes without holding the sched_register_mutex. > >> > > > > I understand the READ_ONCE() but is the WRITE_ONCE() truly necessary? > > It's done while holding the mutex. It's not that critical of a path, > > and makes the code look ugly. > > The update is done while holding the mutex, but the read-side does not > hold that mutex, so it can observe the intermediate state caused by > store-tearing or invented stores which can be generated by the compiler > on the update-side. > > Please refer to the following LWN article: > > https://lwn.net/Articles/793253/ > > Sections: > - "Store tearing" > - "Invented stores" > > Arguably, based on that article, store tearing is only observed in the > wild for constants (which is not the case here), and invented stores > seem to require specific code patterns. But I wonder why we would ever want to > pair a fragile non-volatile store with a READ_ONCE() ? Considering the pain > associated to reproduce and hunt down this kind of issue in the wild, I would > be tempted to enforce that any READ_ONCE() operating on a variable would either > need to be paired with WRITE_ONCE() or with atomic operations, so those can > eventually be validated by static code checkers and code sanitizers. My issue is that this is just a case to decide if we should cache a comm or not. It's a helper, nothing more. There's no guarantee that something will get cached. -- Steve > > If coding style is your only concern here, we may want to consider > introducing new macros in compiler.h: > > WRITE_ONCE_INC(v) /* v++ */ > WRITE_ONCE_DEC(v) /* v-- */ > WRITE_ONCE_ADD(v, count) /* v += count */ > WRITE_ONCE_SUB(v, count) /* v -= count */ >
On Fri, 16 Aug 2019 13:41:59 -0400 (EDT) Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > ----- On Aug 16, 2019, at 1:04 PM, rostedt rostedt@goodmis.org wrote: > > > On Fri, 16 Aug 2019 17:48:59 +0100 > > Valentin Schneider <valentin.schneider@arm.com> wrote: > > > >> On 16/08/2019 17:25, Steven Rostedt wrote: > >> >> Also, write and read to/from those variables should be done with > >> >> WRITE_ONCE() and READ_ONCE(), given that those are read within tracing > >> >> probes without holding the sched_register_mutex. > >> >> > >> > > >> > I understand the READ_ONCE() but is the WRITE_ONCE() truly necessary? > >> > It's done while holding the mutex. It's not that critical of a path, > >> > and makes the code look ugly. > >> > > >> > >> I seem to recall something like locking primitives don't protect you from > >> store tearing / invented stores, so if you can have concurrent readers > >> using READ_ONCE(), there should be a WRITE_ONCE() on the writer side, even > >> if it's done in a critical section. > > > > But for this, it really doesn't matter. The READ_ONCE() is for going > > from 0->1 or 1->0 any other change is the same as 1. > > Let's consider this "invented store" scenario (even though as I noted in my > earlier email, I suspect this particular instance of the code does not appear > to fit the requirements to generate this in the wild with current compilers): > > intial state: > > sched_tgid_ref = 10; > > Thread A Thread B > > registration tracepoint probe > sched_tgid_ref++ > - compiler decides to invent a > store: sched_tgid_ref = 0 This looks to me that this would cause more issues in other parts of the code if a compiler ever decided to do this. But I still don't care for this case. It's a cache, nothing more. No guarantee that anything will be recorded. > READ_ONCE(sched_tgid_ref) -> > observes 0, but should really see either 10 or 11. > - compiler stores 11 into > sched_tgid_ref > > This kind of scenario could cause spurious missing data in the middle > of a trace caused by another user trying to increment the reference > count, which is really unexpected. > > A similar scenario can happen for "store tearing" if the compiler > decides to break the store into many stores close to the 16-bit > overflow value when updating a 32-bit reference count. Spurious 1 -> > 0 -> 1 transitions could be observed by readers. > > > When we enable trace events, we start recording the tasks comms such > > that we can possibly map them to the pids. When we disable trace > > events, we stop recording the comms so that we don't overwrite the > > cache when not needed. Note, if more than the max cache of tasks are > > recorded during a session, we are likely to miss comms anyway. > > > > Thinking about this more, the READ_ONCE() and WRTIE_ONCE() are not > > even needed, because this is just a best effort anyway. > > If you choose not to use READ_ONCE(), then the "load tearing" issue > can cause similar spurious 1 -> 0 -> 1 transitions near 16-bit counter > overflow as described above. The "Invented load" also becomes an > issue, because the compiler could use the loaded value for a branch, > and re-load that value between two branches which are expected to use > the same value, effectively generating a corrupted state. > > I think we need a statement about whether READ_ONCE/WRITE_ONCE should > be used in this kind of situation, or if we are fine dealing with the > awkward compiler side-effects when they will occur. > Let me put it this way. My biggest issue with this, is that the READ/WRITE_ONCE() has *nothing* to do with the bug you are trying to fix. That bug is that we did the decision of starting the probes outside the mutex. That is fixed my moving the decision into the mutex. The READ/WRITE_ONCE() is just added noise. -- Steve
On Fri, 16 Aug 2019, Mathieu Desnoyers wrote:
> If you choose not to use READ_ONCE(), then the "load tearing" issue can
> cause similar spurious 1 -> 0 -> 1 transitions near 16-bit counter
> overflow as described above. The "Invented load" also becomes an issue,
> because the compiler could use the loaded value for a branch, and re-load
> that value between two branches which are expected to use the same value,
> effectively generating a corrupted state.
>
> I think we need a statement about whether READ_ONCE/WRITE_ONCE should
> be used in this kind of situation, or if we are fine dealing with the
> awkward compiler side-effects when they will occur.
The only real downside (apart from readability) of READ_ONCE and
WRITE_ONCE is that they prevent the compiler from optimizing accesses
to the location being read or written. But if you're just doing a
single access in each place, not multiple accesses, then there's
nothing to optimize anyway. So there's no real reason not to use
READ_ONCE or WRITE_ONCE.
Alan Stern
On Fri, Aug 16, 2019 at 3:19 PM Alan Stern <stern@rowland.harvard.edu> wrote:
>
> On Fri, 16 Aug 2019, Mathieu Desnoyers wrote:
>
> > If you choose not to use READ_ONCE(), then the "load tearing" issue can
> > cause similar spurious 1 -> 0 -> 1 transitions near 16-bit counter
> > overflow as described above. The "Invented load" also becomes an issue,
> > because the compiler could use the loaded value for a branch, and re-load
> > that value between two branches which are expected to use the same value,
> > effectively generating a corrupted state.
> >
> > I think we need a statement about whether READ_ONCE/WRITE_ONCE should
> > be used in this kind of situation, or if we are fine dealing with the
> > awkward compiler side-effects when they will occur.
>
> The only real downside (apart from readability) of READ_ONCE and
> WRITE_ONCE is that they prevent the compiler from optimizing accesses
> to the location being read or written. But if you're just doing a
> single access in each place, not multiple accesses, then there's
> nothing to optimize anyway. So there's no real reason not to use
> READ_ONCE or WRITE_ONCE.
I am also more on the side of using *_ONCE. To me, by principal, I
would be willing to convert any concurrent plain access using _ONCE,
just so we don't have to worry about it now or in the future and also
documents the access.
Perhaps the commit message can be reworded to mention that the _ONCE
is an additional clean up for safe access.
thanks,
- Joel
On Fri, 16 Aug 2019, Joel Fernandes wrote:
> On Fri, Aug 16, 2019 at 3:19 PM Alan Stern <stern@rowland.harvard.edu> wrote:
> > On Fri, 16 Aug 2019, Mathieu Desnoyers wrote:
> >
> > > If you choose not to use READ_ONCE(), then the "load tearing" issue can
> > > cause similar spurious 1 -> 0 -> 1 transitions near 16-bit counter
> > > overflow as described above. The "Invented load" also becomes an issue,
> > > because the compiler could use the loaded value for a branch, and re-load
> > > that value between two branches which are expected to use the same value,
> > > effectively generating a corrupted state.
> > >
> > > I think we need a statement about whether READ_ONCE/WRITE_ONCE should
> > > be used in this kind of situation, or if we are fine dealing with the
> > > awkward compiler side-effects when they will occur.
> >
> > The only real downside (apart from readability) of READ_ONCE and
> > WRITE_ONCE is that they prevent the compiler from optimizing accesses
> > to the location being read or written. But if you're just doing a
> > single access in each place, not multiple accesses, then there's
> > nothing to optimize anyway. So there's no real reason not to use
> > READ_ONCE or WRITE_ONCE.
>
> I am also more on the side of using *_ONCE. To me, by principal, I
> would be willing to convert any concurrent plain access using _ONCE,
> just so we don't have to worry about it now or in the future and also
> documents the access.
By that argumentation we need to plaster half of the kernel with _ONCE()
and I'm so not looking forward to the insane amount of script kiddies
patches to do that.
Can we finally put a foot down and tell compiler and standard committee
people to stop this insanity?
Thanks,
tglx
On Fri, 16 Aug 2019 16:44:10 -0400
Joel Fernandes <joel@joelfernandes.org> wrote:
> I am also more on the side of using *_ONCE. To me, by principal, I
> would be willing to convert any concurrent plain access using _ONCE,
> just so we don't have to worry about it now or in the future and also
> documents the access.
>
> Perhaps the commit message can be reworded to mention that the _ONCE
> is an additional clean up for safe access.
The most I'll take is two separate patches. One is going to be marked
for stable as it fixes a real bug. The other is more for cosmetic or
theoretical issues, that I will state clearly "NOT FOR STABLE", such
that the autosel doesn't take them.
-- Steve
On Fri, Aug 16, 2019 at 10:49:04PM +0200, Thomas Gleixner wrote: > On Fri, 16 Aug 2019, Joel Fernandes wrote: > > On Fri, Aug 16, 2019 at 3:19 PM Alan Stern <stern@rowland.harvard.edu> wrote: > > > On Fri, 16 Aug 2019, Mathieu Desnoyers wrote: > > > > > > > If you choose not to use READ_ONCE(), then the "load tearing" issue can > > > > cause similar spurious 1 -> 0 -> 1 transitions near 16-bit counter > > > > overflow as described above. The "Invented load" also becomes an issue, > > > > because the compiler could use the loaded value for a branch, and re-load > > > > that value between two branches which are expected to use the same value, > > > > effectively generating a corrupted state. > > > > > > > > I think we need a statement about whether READ_ONCE/WRITE_ONCE should > > > > be used in this kind of situation, or if we are fine dealing with the > > > > awkward compiler side-effects when they will occur. > > > > > > The only real downside (apart from readability) of READ_ONCE and > > > WRITE_ONCE is that they prevent the compiler from optimizing accesses > > > to the location being read or written. But if you're just doing a > > > single access in each place, not multiple accesses, then there's > > > nothing to optimize anyway. So there's no real reason not to use > > > READ_ONCE or WRITE_ONCE. > > > > I am also more on the side of using *_ONCE. To me, by principal, I > > would be willing to convert any concurrent plain access using _ONCE, > > just so we don't have to worry about it now or in the future and also > > documents the access. > > By that argumentation we need to plaster half of the kernel with _ONCE() > and I'm so not looking forward to the insane amount of script kiddies > patches to do that. Really? That is quite scary that you are saying half of the kernel has issues with concurrent access or compiler optimizations. It scares me that a concurrent access can tear down a store/load and existing code can just fail, if that is the case. > Can we finally put a foot down and tell compiler and standard committee > people to stop this insanity? Sure, or could the compilers provide flags which prevent such optimization similar to -O* flags? thanks, - Joel
On Fri, Aug 16, 2019 at 04:49:12PM -0400, Steven Rostedt wrote:
> On Fri, 16 Aug 2019 16:44:10 -0400
> Joel Fernandes <joel@joelfernandes.org> wrote:
>
>
> > I am also more on the side of using *_ONCE. To me, by principal, I
> > would be willing to convert any concurrent plain access using _ONCE,
> > just so we don't have to worry about it now or in the future and also
> > documents the access.
> >
> > Perhaps the commit message can be reworded to mention that the _ONCE
> > is an additional clean up for safe access.
>
> The most I'll take is two separate patches. One is going to be marked
> for stable as it fixes a real bug. The other is more for cosmetic or
> theoretical issues, that I will state clearly "NOT FOR STABLE", such
> that the autosel doesn't take them.
Makes sense to me!
thanks,
- Joel
On Fri, Aug 16, 2019 at 1:49 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Can we finally put a foot down and tell compiler and standard committee
> people to stop this insanity?
It's already effectively done.
Yes, values can be read from memory multiple times if they need
reloading. So "READ_ONCE()" when the value can change is a damn good
idea.
But it should only be used if the value *can* change. Inside a locked
region it is actively pointless and misleading.
Similarly, WRITE_ONCE() should only be used if you have a _reason_ for
using it (notably if you're not holding a lock).
If people use READ_ONCE/WRITE_ONCE when there are locks that prevent
the values from changing, they are only making the code illegible.
Don't do it.
But in the *absence* of locking, READ_ONCE/WRITE_ONCE is usually a
good thing. The READ_ONCE actually tends to matter, because even if
the value is used only once at a source level, the compiler *could*
decide to do something else.
The WRITE_ONCE() may or may not matter (afaik, thanks to concurrency,
modern C standard does not allow optimistic writes anyway, and we
wouldn't really accept such a compiler option if it did).
But if the write is done without locking, it's good practice just to
show you are aware of the whole "done without locking" part.
Linus
On 16/08/2019 21:57, Joel Fernandes wrote: >> Can we finally put a foot down and tell compiler and standard committee >> people to stop this insanity? > > Sure, or could the compilers provide flags which prevent such optimization > similar to -O* flags? > How would you differentiate optimizations you want from those you don't with just a flag? There's a reason we use volatile casts instead of declaring everything volatile: we actually *want* those optimizations. It just so happens that we don't want them *in some places*, and we have tools to tag them as such. The alternative is having a compiler that can magically correlate e.g. locked writes with lock-free reads and properly handle them, but I don't think there's a foolproof way of doing that. > thanks, > > - Joel >
On Fri, Aug 16, 2019 at 3:27 PM Valentin Schneider
<valentin.schneider@arm.com> wrote:
>
> How would you differentiate optimizations you want from those you don't with
> just a flag? There's a reason we use volatile casts instead of declaring
> everything volatile: we actually *want* those optimizations. It just so
> happens that we don't want them *in some places*, and we have tools to tag
> them as such.
We actually disable lots of "valid" (read: the standard allows them,
but they are completely wrong for the kernel) optimizations because
they are wrong.
The whole type-based alias thing is just wrong. The C standards body
was incompetent to allow that garbage. So we disable it.
If you can *prove* that no aliasing exists, go ahead and re-order
accesses. But no guesses based on random types.
Similarly, if some compiler decides that it's ok to make speculative
writes (knowing it will over-write it with the right data later) to
data that is possibly visible to other threads, then such an
"optimization" needs to just be disabled. It might help some
benchmark, and if you read the standard just the right way it might be
allowed - but that doesn't make it valid.
We already had situations like that, where compiler people thought it
would be ok (for example) to turns a narrow write into a wider
read-modify-write because it had already done the wider read for other
reasons.
Again, the original C standard "allows" that in theory, because the
original C standard doesn't take threading into account. In fact, the
alpha architecture made actively bad design decisions based on that
(incorrect) assumption.
It turns out that in that case, even non-kernel people rebelled, and
it's apparently thankfully not allowed in newer versions of the
standard, exactly because threading has become a thing. You can't
magically write back unrelated variables just because they might be
next-door neighbors and share a word.
So no, we do *not* in general just say that we want any random
optimizations. A compiler that turns a single write into something
else is almost certainly something that shouldn't be allowed near the
kernel.
We add READ_ONCE and WRITE_ONCE annotations when they make sense. Not
because of some theoretical "compiler is free to do garbage"
arguments. If such garbage happens, we need to fix the compiler, the
same way we already do with
-fno-strict-aliasing
-fno-delete-null-pointer-checks
-fno-strict-overflow
because all those "optimizations" are just fundamentally unsafe and wrong.
I really wish the compiler would never take advantage of "I can prove
this is undefined behavior" kind of things when it comes to the kernel
(or any other projects I am involved with, for that matter). If you
can prove that, then you shouldn't decide to generate random code
without a big warning. But that's what those optimizations that we
disable effectively all do.
I'd love to have a flag that says "all undefined behavior is treated
as implementation-defined". There's a somewhat subtle - but very
important - difference there.
And that's what some hypothetical speculative write optimizations do
too. I do not believe they are valid for the kernel. If the code says
if (a)
global_var = 1
else
global_var = 0
then the compiler had better not turn that into
global_var = 0
if (a)
global_var = 1
even if there isn't a volatile there. But yes, we've had compiler
writers that say "if you read the specs, that's ok".
No, it's not ok. Because reality trumps any weasel-spec-reading.
And happily, I don't think we've ever really seen a compiler that we
use that actually does the above kind of speculative write thing (but
doing it for your own local variables that can't be seen by other
threads of execution - go wild).
So in general, we very much expect the compiler to do sane code
generation, and not (for example) do store tearing on normal
word-sized things or add writes that weren't there originally etc.
And yes, reads are different from writes. Reads don't have the same
kind of "other threads of execution can see them" effects, so a
compiler turning a single read into multiple reads is much more
realistic and not the same kind of "we need to expect a certain kind
of sanity from the compiler" issue.
Linus
----- On Aug 16, 2019, at 4:49 PM, rostedt rostedt@goodmis.org wrote: > On Fri, 16 Aug 2019 16:44:10 -0400 > Joel Fernandes <joel@joelfernandes.org> wrote: > > >> I am also more on the side of using *_ONCE. To me, by principal, I >> would be willing to convert any concurrent plain access using _ONCE, >> just so we don't have to worry about it now or in the future and also >> documents the access. >> >> Perhaps the commit message can be reworded to mention that the _ONCE >> is an additional clean up for safe access. > > The most I'll take is two separate patches. One is going to be marked > for stable as it fixes a real bug. The other is more for cosmetic or > theoretical issues, that I will state clearly "NOT FOR STABLE", such > that the autosel doesn't take them. Splitting this into two separate patches makes perfect sense. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
----- On Aug 16, 2019, at 5:04 PM, Linus Torvalds torvalds@linux-foundation.org wrote: > On Fri, Aug 16, 2019 at 1:49 PM Thomas Gleixner <tglx@linutronix.de> wrote: >> >> Can we finally put a foot down and tell compiler and standard committee >> people to stop this insanity? > > It's already effectively done. > > Yes, values can be read from memory multiple times if they need > reloading. So "READ_ONCE()" when the value can change is a damn good > idea. > > But it should only be used if the value *can* change. Inside a locked > region it is actively pointless and misleading. > > Similarly, WRITE_ONCE() should only be used if you have a _reason_ for > using it (notably if you're not holding a lock). > > If people use READ_ONCE/WRITE_ONCE when there are locks that prevent > the values from changing, they are only making the code illegible. > Don't do it. I agree with your argument in the case where both read-side and write-side are protected by the same lock: READ/WRITE_ONCE are useless then. However, in the scenario we have here, only the write-side is protected by the lock against concurrent updates, but reads don't take any lock. If WRITE_ONCE has any use at all (protecting against store tearing and invented stores), it should be used even with a lock held in this scenario, because the lock does not prevent READ_ONCE() from observing transient states caused by lack of WRITE_ONCE() for the update. So why does WRITE_ONCE exist in the first place ? Is it for documentation purposes only or are there actual situations where omitting it can cause bugs with real-life compilers ? In terms of code change, should we favor only introducing WRITE_ONCE in new code, or should existing code matching those conditions be moved to WRITE_ONCE as bug fixes ? Thanks, Mathieu > > But in the *absence* of locking, READ_ONCE/WRITE_ONCE is usually a > good thing. The READ_ONCE actually tends to matter, because even if > the value is used only once at a source level, the compiler *could* > decide to do something else. > > The WRITE_ONCE() may or may not matter (afaik, thanks to concurrency, > modern C standard does not allow optimistic writes anyway, and we > wouldn't really accept such a compiler option if it did). > > But if the write is done without locking, it's good practice just to > show you are aware of the whole "done without locking" part. > > Linus -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
----- On Aug 16, 2019, at 6:57 PM, Linus Torvalds torvalds@linux-foundation.org wrote: > So in general, we very much expect the compiler to do sane code > generation, and not (for example) do store tearing on normal > word-sized things or add writes that weren't there originally etc. My understanding of https://lwn.net/Articles/793253/ section "Store tearing" which points at https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56028 seems to contradict your expectation at least when writing constants to a 64-bit word without a volatile access. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
On Fri, 16 Aug 2019 21:36:49 -0400 (EDT) Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > ----- On Aug 16, 2019, at 5:04 PM, Linus Torvalds torvalds@linux-foundation.org wrote: > > > On Fri, Aug 16, 2019 at 1:49 PM Thomas Gleixner <tglx@linutronix.de> wrote: > >> > >> Can we finally put a foot down and tell compiler and standard committee > >> people to stop this insanity? > > > > It's already effectively done. > > > > Yes, values can be read from memory multiple times if they need > > reloading. So "READ_ONCE()" when the value can change is a damn good > > idea. > > > > But it should only be used if the value *can* change. Inside a locked > > region it is actively pointless and misleading. > > > > Similarly, WRITE_ONCE() should only be used if you have a _reason_ for > > using it (notably if you're not holding a lock). > > > > If people use READ_ONCE/WRITE_ONCE when there are locks that prevent > > the values from changing, they are only making the code illegible. > > Don't do it. > > I agree with your argument in the case where both read-side and write-side > are protected by the same lock: READ/WRITE_ONCE are useless then. However, > in the scenario we have here, only the write-side is protected by the lock > against concurrent updates, but reads don't take any lock. And because reads are not protected by any lock or memory barrier, using READ_ONCE() is pointless. The CPU could be doing a lot of hidden manipulation of that variable too. Again, this is just to enable caching of the comm. Nothing more. It's a "best effort" algorithm. We get it, we then can map a pid to a name. If not, we just have a pid and we map "<...>". Next you'll be asking for the memory barriers to guarantee a real hit. And honestly, this information is not worth any overhead. And most often we enable this before we enable the tracepoint we want this information from, which requires modification of the text area and will do a bunch of syncs across CPUs. That alone will most likely keep any race from happening. The only real bug is the check to know if we should add the probe or not was done outside the lock, and when we hit the race, we could add a probe twice, causing the kernel to spit out a warning. You fixed that, and that's all that needs to be done. I'm now even more against adding the READ_ONCE() or WRITE_ONCE(). -- Steve > > If WRITE_ONCE has any use at all (protecting against store tearing and > invented stores), it should be used even with a lock held in this > scenario, because the lock does not prevent READ_ONCE() from observing > transient states caused by lack of WRITE_ONCE() for the update. > > So why does WRITE_ONCE exist in the first place ? Is it for documentation > purposes only or are there actual situations where omitting it can cause > bugs with real-life compilers ? > > In terms of code change, should we favor only introducing WRITE_ONCE > in new code, or should existing code matching those conditions be > moved to WRITE_ONCE as bug fixes ? > >
On Fri, Aug 16, 2019 at 03:57:43PM -0700, Linus Torvalds wrote: [ . . . ] > We add READ_ONCE and WRITE_ONCE annotations when they make sense. Not > because of some theoretical "compiler is free to do garbage" > arguments. If such garbage happens, we need to fix the compiler, the > same way we already do with > > -fno-strict-aliasing Yeah, the compete-with-FORTRAN stuff. :-/ There is some work going on in the C committee on this, where the theorists would like to restrict strict-alias based optimizations to enable better analysis tooling. And no, although the theorists are pushing in the direction we would like them to, as far as I can see they are not pushing as far as we would like. But it might be that -fno-strict-aliasing needs some upgrades as well. I expect to learn more at the next meeting in a few months. http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2364.pdf http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2363.pdf http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2362.pdf > -fno-delete-null-pointer-checks > -fno-strict-overflow > > because all those "optimizations" are just fundamentally unsafe and wrong. I was hoping that -fno-strict-overflow might go into the C++ (not C) standard, and even thought that it had at one point, but what went into the standard was that signed integers are twos complement, not that overflowing them is well defined. We are pushing to hopefully end but at least to restrict the current pointer lifetime-end zap behavior in both C and C++, which is similar to the pointer provenance/alias analysis that -fno-strict-aliasing at least partially suppresses. The zapping invalidates loading, storing, comparing, or doing arithmetic on a pointer to an object whose lifetime has ended. (The WG14 PDF linked to below gives a non-exhaustive list of problems that this causes.) The Linux kernel used to avoid this by refusing to tell the compiler about kmalloc() and friends, but that changed a few years ago. This zapping rules out a non-trivial class of concurrent algorithms, but for once RCU is unaffected. Some committee members complained that zapping has been part of the standard since about 1990, but Maged Michael found a writeup of one of the algorithms dating back to 1973. ;-) http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2369.pdf http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1726r0.pdf > I really wish the compiler would never take advantage of "I can prove > this is undefined behavior" kind of things when it comes to the kernel > (or any other projects I am involved with, for that matter). If you > can prove that, then you shouldn't decide to generate random code > without a big warning. But that's what those optimizations that we > disable effectively all do. > > I'd love to have a flag that says "all undefined behavior is treated > as implementation-defined". There's a somewhat subtle - but very > important - difference there. It would also be nice to downgrade some of the undefined behavior in the standard itself. Compiler writers usually hate this because it would require them to document what their compiler does in each case. So they would prefer "unspecified" if the could not have "undefined". > And that's what some hypothetical speculative write optimizations do > too. I do not believe they are valid for the kernel. If the code says > > if (a) > global_var = 1 > else > global_var = 0 > > then the compiler had better not turn that into > > global_var = 0 > if (a) > global_var = 1 > > even if there isn't a volatile there. But yes, we've had compiler > writers that say "if you read the specs, that's ok". > > No, it's not ok. Because reality trumps any weasel-spec-reading. > > And happily, I don't think we've ever really seen a compiler that we > use that actually does the above kind of speculative write thing (but > doing it for your own local variables that can't be seen by other > threads of execution - go wild). Me, I would still want to use WRITE_ONCE() in this case. > So in general, we very much expect the compiler to do sane code > generation, and not (for example) do store tearing on normal > word-sized things or add writes that weren't there originally etc. > > And yes, reads are different from writes. Reads don't have the same > kind of "other threads of execution can see them" effects, so a > compiler turning a single read into multiple reads is much more > realistic and not the same kind of "we need to expect a certain kind > of sanity from the compiler" issue. Though each of those compiler-generated multiple reads might return a different value, right? Thanx, Paul
On Fri, Aug 16, 2019, 18:36 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > If WRITE_ONCE has any use at all (protecting against store tearing and > invented stores), it should be used even with a lock held in this > scenario, because the lock does not prevent READ_ONCE() from observing > transient states caused by lack of WRITE_ONCE() for the update. The thing is, we really haven't requred WRITE_ONCE() to protect against store tearing. We have lots of traditional code that does stuff along the lines of .. set of data structure .. smp_wmb(); *ptr = newdata; and we simply *depend* on the compiler doing the "*ptr" as a single write. We've simply always done that. Even on UP we've had the "interrupts will see old value or new value but not some random half-way value", going all the way back to the original kernel sources. The data tearing issue is almost a non-issue. We're not going to add WRITE_ONCE() to these kinds of places for no good reason. > So why does WRITE_ONCE exist in the first place ? Is it for documentation > purposes only or are there actual situations where omitting it can cause > bugs with real-life compilers ? WRITE_ONCE should be seen mainly as (a) documentation and (b) for new code. Although I suspect often you'd be better off using smb_load_acquire() and smp_store_release() when you have code sequences where you have unlocked READ_ONCE/WRITE_ONCE and memory ordering. WRITE_ONCE() doesn't even protect against data tearing. If you do a "WRITE_ONCE()" on a type larger than 8 bytes, it will fall back to __builtin_memcpy(). So honestly, WRITE_ONCE() is often more documentation than protecting against something, but overdoing it doesn't help document anything, it just obscures the point. Yeah, yeah, it will use a "volatile" access for the proper normal types, but even then that won't protect against data tearing on 64-bit writes on a 32-bit machine, for example. It doesn't even give ordering guarantees for the sub-words. So you still end up with the almost the same basic rule: a natural byte/word write will be atomic. But really, it's going to be so even without the WRITE_ONCE(), so... It does ostensibly protect against the compiler re-ordering the write against other writes (note: *compiler*, not CPU), and it does make sure the write only happens once. But it's really hard to see a valid compiler optimization that wouldn't do that anyway. Because of the compiler ordering of WRITE_ONCE against other WRITE_ONCE cases, it could be used for irq-safe ordering on the local cpu, for example. If that matters. Although I suspect any such case is practically going to use per-cpu variables anyway. End result: it's *mostly* about documentation. Just do a grep for WRITE_ONCE() vs READ_ONCE(). You'll find a lot more users of the latter. And quite frankly, I looked at some of the WRITE_ONCE() users and a lot of them were kind of pointless. Somebody tell me just exactly how they expect the WRITE_ONCE() cases in net/tls/tls_sw.c to matter, for example (and that was literally a random case I happened to look at). It's not clear what they do, since they certainly don't imply any memory ordering. They do imply a certain amount of compiler re-ordering due to the volatile, but that's pretty limited too. Only wrt _other_ things with side effects, not the normal code around them anyway. In contrast, READ_ONCE() has very practical examples of mattering, because unlike writes, compilers really can reasonably split reads. For example, if you're testing multiple bits in the same word, and you want to make sure they are coherent, you should very much do val = READ_ONCE(mem); .. test different bits in val .. because without the READ_ONCE(), the compiler could end up just doing the read multiple times. Similarly, even if you only use a value once, this is actually something a compiler can do: if (a) { lock(); B() unlock(); } else B(); and a compiler might end up generating that as if (a) lock(); B(); if (a) unlock(); instead. So doing "if (READ_ONCE(a))" actually makes a difference - it guarantees that 'a' is only read once, even if that value was _literally_ only used on a source level, the compiler really could have decided "let's read it twice". See how duplicating a read is fundamentally different from duplicating a write? Duplicating or splitting a read is not visible to other threads. Notice how nothiing like the above is reasonable for a write. That said, the most common case really is none of the above half-way subtle cases. It's literally the whole "write pointer once". Making existing code that does that use WRITE_ONCE() is completely pointless. It's not going to really change or fix anything at all. Linus
On Fri, Aug 16, 2019 at 9:52 PM Paul E. McKenney <paulmck@linux.ibm.com> wrote: > > > I'd love to have a flag that says "all undefined behavior is treated > > as implementation-defined". There's a somewhat subtle - but very > > important - difference there. > > It would also be nice to downgrade some of the undefined behavior in > the standard itself. Compiler writers usually hate this because it > would require them to document what their compiler does in each case. > So they would prefer "unspecified" if the could not have "undefined". That certainly would sound very good to me. It's not the "documented behavior" that is important to me (since it is still going to be potentially different across different platforms), it's the "at least it's *some* consistent behavior for that platform". That's just _so_ much better than "undefined behavior" which basically says the compiler writer can do absolutely anything, whether it makes sense or not, and whether it might open a small bug up to absolutely catastrophic end results. > > if (a) > > global_var = 1 > > else > > global_var = 0 > > Me, I would still want to use WRITE_ONCE() in this case. I actually suspect that if we had this kind of code, and a real reason why readers would see it locklessly, then yes, WRITE_ONCE() makes sense. But most of the cases where people add WRITE_ONCE() aren't even the above kind of half-questionable cases. They are just literally WRITE_ONCE(flag, value); and since there is no real memory ordering implied by this, what's the actual value of that statement? What problem does it really solve wrt just writing it as flag = value; which is generally a lot easier to read. If the flag has semantic behavior wrt other things, and the write can race with a read (whether it's the read or the write that is unlocked is irrelevant), is still doesn't tend to make a lot of real difference. In the end, the most common reason for a WRITE_ONCE() is mostly just "to visually pair up with the non-synchronized read that uses READ_ONCE()" Put another way: a WRITE_ONCE() without a paired READ_ONCE() is almost certainly pointless. But the reverse is not really true. All a READ_ONCE() says is "I want either the old or the new value", and it can get that _without_ being paired with a WRITE_ONCE(). See? They just aren't equally important. > > And yes, reads are different from writes. Reads don't have the same > > kind of "other threads of execution can see them" effects, so a > > compiler turning a single read into multiple reads is much more > > realistic and not the same kind of "we need to expect a certain kind > > of sanity from the compiler" issue. > > Though each of those compiler-generated multiple reads might return a > different value, right? Right. See the examples I have in the email to Mathieu: unsigned int bits = some_global_value; ...test different bits in in 'bits' ... can easily cause multiple reads (particularly on a CPU that has a "test bits in memory" instruction and a lack of registers. So then doing it as unsigned int bits = READ_ONCE(some_global_value); .. test different bits in 'bits'... makes a real and obvious semantic difference. In ways that changing a one-time ptr->flag = true; to WRITE_ONCE(ptr->flag, true); does _not_ make any obvious semantic difference what-so-ever. Caching reads is also something that makes sense and is common, in ways that caching writes does not. So doing while (in_progress_global) /* twiddle your thumbs */; obviously trivially translates to an infinite loop with a single conditional in front of it, in ways that while (READ_ONCE(in_progress_global)) /* twiddle */; does not. So there are often _obvious_ reasons to use READ_ONCE(). I really do not find the same to be true of WRITE_ONCE(). There are valid uses, but they are definitely much less common, and much less obvious. Linus
On Sat, Aug 17, 2019 at 1:28 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> unsigned int bits = some_global_value;
> ...test different bits in in 'bits' ...
>
> can easily cause multiple reads (particularly on a CPU that has a
> "test bits in memory" instruction and a lack of registers.
>
> So then doing it as
>
> unsigned int bits = READ_ONCE(some_global_value);
> .. test different bits in 'bits'...
Side note: this is likely the best example of actual WRITE_ONCE() use
too: if you have that global value with multiple bits that actually
have some interdependencies, then doing
some_global_value = some_complex_expression();
might be reasonably compiled to do several rmw instructions to update
'some_global_value'
So then
WRITE_ONCE(some_global_value, some_complex_expression());
really can be a good thing - it clearly just writes things once, and
it also documents the whole "write one or the other" value, not some
mid-way one, when you then look at the READ_ONCE() thing.
But I'm seeing a lot of WRITE_ONCE(x, constantvalue) kind of things
and don't seem to find a lot of reason to think that they are any
inherently better than "x = constantvalue".
(In contrast, using "smp_store_release(flag, true)" has inherent
value, because it actually implies a memory barrier wrt previous
writes, in ways that WRITE_ONCE() or a direct assignment does not.)
Ok, enough blathering. I think I've made my point.
Linus
----- On Aug 16, 2019, at 3:15 PM, rostedt rostedt@goodmis.org wrote: > On Fri, 16 Aug 2019 13:19:20 -0400 (EDT) > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > >> ----- On Aug 16, 2019, at 12:25 PM, rostedt rostedt@goodmis.org wrote: >> >> > On Fri, 16 Aug 2019 10:26:43 -0400 Mathieu Desnoyers >> > <mathieu.desnoyers@efficios.com> wrote: >> > >> [...] >> >> >> >> Also, write and read to/from those variables should be done with >> >> WRITE_ONCE() and READ_ONCE(), given that those are read within tracing >> >> probes without holding the sched_register_mutex. >> >> >> > >> > I understand the READ_ONCE() but is the WRITE_ONCE() truly necessary? >> > It's done while holding the mutex. It's not that critical of a path, >> > and makes the code look ugly. >> >> The update is done while holding the mutex, but the read-side does not >> hold that mutex, so it can observe the intermediate state caused by >> store-tearing or invented stores which can be generated by the compiler >> on the update-side. >> >> Please refer to the following LWN article: >> >> https://lwn.net/Articles/793253/ >> >> Sections: >> - "Store tearing" >> - "Invented stores" >> >> Arguably, based on that article, store tearing is only observed in the >> wild for constants (which is not the case here), and invented stores >> seem to require specific code patterns. But I wonder why we would ever want to >> pair a fragile non-volatile store with a READ_ONCE() ? Considering the pain >> associated to reproduce and hunt down this kind of issue in the wild, I would >> be tempted to enforce that any READ_ONCE() operating on a variable would either >> need to be paired with WRITE_ONCE() or with atomic operations, so those can >> eventually be validated by static code checkers and code sanitizers. > > My issue is that this is just a case to decide if we should cache a > comm or not. It's a helper, nothing more. There's no guarantee that > something will get cached. I get your point wrt WRITE_ONCE(): since it's a cache it should not have user-visible effects if a temporary incorrect value is observed. Well in reality, it's not a cache: if the lookup fails, it returns "<...>" instead, so cache lookup failure ends up not providing any useful data in the trace. Let's assume this is a known and documented tracer limitation. However, wrt READ_ONCE(), things are different. The variable read ends up being used to control various branches in the code, and the compiler could decide to re-fetch the variable (with a different state), and therefore cause _some_ of the branches to be inconsistent. See tracing_record_taskinfo_sched_switch() and tracing_record_taskinfo() @flags parameter. AFAIU the current code should not generate any out-of-bound writes in case of re-fetch, but no comment in there documents how fragile this is. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
----- On Aug 16, 2019, at 10:13 PM, rostedt rostedt@goodmis.org wrote: > On Fri, 16 Aug 2019 21:36:49 -0400 (EDT) > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > >> ----- On Aug 16, 2019, at 5:04 PM, Linus Torvalds torvalds@linux-foundation.org >> wrote: >> >> > On Fri, Aug 16, 2019 at 1:49 PM Thomas Gleixner <tglx@linutronix.de> wrote: >> >> >> >> Can we finally put a foot down and tell compiler and standard committee >> >> people to stop this insanity? >> > >> > It's already effectively done. >> > >> > Yes, values can be read from memory multiple times if they need >> > reloading. So "READ_ONCE()" when the value can change is a damn good >> > idea. >> > >> > But it should only be used if the value *can* change. Inside a locked >> > region it is actively pointless and misleading. >> > >> > Similarly, WRITE_ONCE() should only be used if you have a _reason_ for >> > using it (notably if you're not holding a lock). >> > >> > If people use READ_ONCE/WRITE_ONCE when there are locks that prevent >> > the values from changing, they are only making the code illegible. >> > Don't do it. >> >> I agree with your argument in the case where both read-side and write-side >> are protected by the same lock: READ/WRITE_ONCE are useless then. However, >> in the scenario we have here, only the write-side is protected by the lock >> against concurrent updates, but reads don't take any lock. > > And because reads are not protected by any lock or memory barrier, > using READ_ONCE() is pointless. The CPU could be doing a lot of hidden > manipulation of that variable too. Please enlighten me by providing some details on what the CPU could do to this word-aligned, word-sized variable in the absence of lock and barrier that is relevant to this specific use-case ? I suspect most of the barriers you refer to here are taken care of by the tracepoint code which uses RCU to synchronize probe registration wrt probe callback execution. > > Again, this is just to enable caching of the comm. Nothing more. It's a > "best effort" algorithm. We get it, we then can map a pid to a name. If > not, we just have a pid and we map "<...>". > > Next you'll be asking for the memory barriers to guarantee a real hit. > And honestly, this information is not worth any overhead. No, that's not my intent to add overhead to guarantee trace data availability near trace beginning and end. However, considering that READ_ONCE() and WRITE_ONCE() can provide additional data availability guarantees in the middle of traces at no runtime cost, it seems like a good trade off. It's easier for an analysis to disregard missing information at the beginning and end of trace without generating false-positive reports than when it happens spuriously in the middle of traces. > > And most often we enable this before we enable the tracepoint we want > this information from, which requires modification of the text area and > will do a bunch of syncs across CPUs. That alone will most likely keep > any race from happening. Indeed the tracepoint use of RCU to synchronize registration vs probes should take care of those barriers. > > The only real bug is the check to know if we should add the probe or > not was done outside the lock, and when we hit the race, we could add a > probe twice, causing the kernel to spit out a warning. You fixed that, > and that's all that needs to be done. I just sent that fix in a separate patch. > I'm now even more against adding the READ_ONCE() or WRITE_ONCE(). I'm not convinced by your arguments. Thanks, Mathieu > > > -- Steve > > > >> >> If WRITE_ONCE has any use at all (protecting against store tearing and >> invented stores), it should be used even with a lock held in this >> scenario, because the lock does not prevent READ_ONCE() from observing >> transient states caused by lack of WRITE_ONCE() for the update. >> >> So why does WRITE_ONCE exist in the first place ? Is it for documentation >> purposes only or are there actual situations where omitting it can cause >> bugs with real-life compilers ? >> >> In terms of code change, should we favor only introducing WRITE_ONCE >> in new code, or should existing code matching those conditions be >> moved to WRITE_ONCE as bug fixes ? >> -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
----- On Aug 17, 2019, at 4:44 AM, Linus Torvalds torvalds@linux-foundation.org wrote: > > But I'm seeing a lot of WRITE_ONCE(x, constantvalue) kind of things > and don't seem to find a lot of reason to think that they are any > inherently better than "x = constantvalue". If the only states that "x" can take is 1 or 0, then indeed there seems to be no point in using a WRITE_ONCE() when paired with a READ_ONCE() other than for documentation purposes. However, if the state of "x" can be any pointer value, or a reference count value, then not using "WRITE_ONCE()" to store a constant leaves the compiler free to perform that store in more than one memory access. Based on [1], section "Store tearing", there are situations where this happens on x86 in the wild today when storing 64-bit constants: the compiler is then free to decide to use two 32-bit immediate store instructions. Thanks, Mathieu [1] https://lwn.net/Articles/793253/ -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
On Sat, 17 Aug 2019 10:40:31 -0400 (EDT)
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> > I'm now even more against adding the READ_ONCE() or WRITE_ONCE().
>
> I'm not convinced by your arguments.
Prove to me that there's an issue here beyond theoretical analysis,
then I'll consider that patch.
Show me a compiler used to compile the kernel that zeros out the
increment. Show me were the race actually occurs.
I think the READ/WRITE_ONCE() is more confusing than helpful. And
unneeded churn to the code. And really not needed for something that's
not critical to execution.
-- Steve
On Sat, 17 Aug 2019 10:27:39 -0400 (EDT) Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > I get your point wrt WRITE_ONCE(): since it's a cache it should not have > user-visible effects if a temporary incorrect value is observed. Well in > reality, it's not a cache: if the lookup fails, it returns "<...>" instead, > so cache lookup failure ends up not providing any useful data in the trace. > Let's assume this is a known and documented tracer limitation. Note, this is done at every sched switch, for both next and prev tasks. And the update is only done at the enabling of a tracepoint (very rare occurrence) If it missed it scheduling in, it has a really good chance of getting it while scheduling out. And 99.999% of my tracing that I do, the tasks scheduling in when enabling a tracepoint is not what I even care about, as I enable tracing then start what I want to trace. > > However, wrt READ_ONCE(), things are different. The variable read ends up > being used to control various branches in the code, and the compiler could > decide to re-fetch the variable (with a different state), and therefore > cause _some_ of the branches to be inconsistent. See > tracing_record_taskinfo_sched_switch() and tracing_record_taskinfo() @flags > parameter. I'm more OK with using a READ_ONCE() on the flags so it is consistent. But the WRITE_ONCE() is going a bit overboard. > > AFAIU the current code should not generate any out-of-bound writes in case of > re-fetch, but no comment in there documents how fragile this is. Which part of the code are you talking about here? -- Steve
----- On Aug 17, 2019, at 11:42 AM, rostedt rostedt@goodmis.org wrote: > On Sat, 17 Aug 2019 10:27:39 -0400 (EDT) > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > >> I get your point wrt WRITE_ONCE(): since it's a cache it should not have >> user-visible effects if a temporary incorrect value is observed. Well in >> reality, it's not a cache: if the lookup fails, it returns "<...>" instead, >> so cache lookup failure ends up not providing any useful data in the trace. >> Let's assume this is a known and documented tracer limitation. > > Note, this is done at every sched switch, for both next and prev tasks. > And the update is only done at the enabling of a tracepoint (very rare > occurrence) If it missed it scheduling in, it has a really good chance > of getting it while scheduling out. > > And 99.999% of my tracing that I do, the tasks scheduling in when > enabling a tracepoint is not what I even care about, as I enable > tracing then start what I want to trace. Since it's refcount based, my concern is about the side-effect of incrementing or decrementing that reference count without WRITE_ONCE which would lead to a transient corrupted value observed by _another_ active tracing user. For you use-case, it would lead to a missing comm when you are actively tracing what you want to trace, caused by another user of that refcount incrementing or decrementing it. I agree with you that missing tracing data at the beginning or end of a trace is not important. >> >> However, wrt READ_ONCE(), things are different. The variable read ends up >> being used to control various branches in the code, and the compiler could >> decide to re-fetch the variable (with a different state), and therefore >> cause _some_ of the branches to be inconsistent. See >> tracing_record_taskinfo_sched_switch() and tracing_record_taskinfo() @flags >> parameter. > > I'm more OK with using a READ_ONCE() on the flags so it is consistent. > But the WRITE_ONCE() is going a bit overboard. Hence my request for additional guidance on the usefulness of WRITE_ONCE(), whether it's mainly there for documentation purposes, or if we should consider that it takes care of real-life problems introduced by compiler optimizations in the wild. The LWN article seems to imply that it's not just a theoretical issue, but I'll have to let the article authors justify their conclusions, because I have limited time to investigate this myself. > >> >> AFAIU the current code should not generate any out-of-bound writes in case of >> re-fetch, but no comment in there documents how fragile this is. > > Which part of the code are you talking about here? kernel/trace/trace.c:tracing_record_taskinfo_sched_switch() kernel/trace/trace.c:tracing_record_taskinfo() where @flags is used to control a few branches. I don't think any of those would end up causing corruption if the flags is re-fetched between two branches, but it seems rather fragile. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
----- On Aug 17, 2019, at 11:26 AM, rostedt rostedt@goodmis.org wrote: > On Sat, 17 Aug 2019 10:40:31 -0400 (EDT) > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > >> > I'm now even more against adding the READ_ONCE() or WRITE_ONCE(). >> >> I'm not convinced by your arguments. > > Prove to me that there's an issue here beyond theoretical analysis, > then I'll consider that patch. > > Show me a compiler used to compile the kernel that zeros out the > increment. Show me were the race actually occurs. > > I think the READ/WRITE_ONCE() is more confusing than helpful. And > unneeded churn to the code. And really not needed for something that's > not critical to execution. I'll have to let the authors of the LWN article speak up on this, because I have limited time to replicate this investigation myself. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
On Sat, 17 Aug 2019 11:55:17 -0400 (EDT)
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> ----- On Aug 17, 2019, at 11:26 AM, rostedt rostedt@goodmis.org wrote:
>
> > On Sat, 17 Aug 2019 10:40:31 -0400 (EDT)
> > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> >
> >> > I'm now even more against adding the READ_ONCE() or WRITE_ONCE().
> >>
> >> I'm not convinced by your arguments.
> >
> > Prove to me that there's an issue here beyond theoretical analysis,
> > then I'll consider that patch.
> >
> > Show me a compiler used to compile the kernel that zeros out the
> > increment. Show me were the race actually occurs.
> >
> > I think the READ/WRITE_ONCE() is more confusing than helpful. And
> > unneeded churn to the code. And really not needed for something that's
> > not critical to execution.
>
> I'll have to let the authors of the LWN article speak up on this, because
> I have limited time to replicate this investigation myself.
I'll let Paul McKenney convince me then, if he has any spare cycles ;-)
The one instance in that article is from a 2013 bug, which talks about
storing a 64 bit value on a 32 bit machine. But the ref count is an int
(32 bit), and I highly doubt any compiler will split it into 16 bit
stores for a simple increment. And I don't believe Linux even supports
any architecture that requires 16 bit stores anymore.
-- Steve
On Sat, 17 Aug 2019 11:53:41 -0400 (EDT)
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> kernel/trace/trace.c:tracing_record_taskinfo_sched_switch()
> kernel/trace/trace.c:tracing_record_taskinfo()
>
> where @flags is used to control a few branches. I don't think any of those
> would end up causing corruption if the flags is re-fetched between two
> branches, but it seems rather fragile.
There shouldn't be any corruption, as each flag test is not dependent
on any of the other tests. But I do agree, a READ_ONCE() would be
appropriate for just making a consistent state among the tests, even
though they are independent.
-- Steve
Apologies to Steve for continuing this thread when all he wanted was moving an operation inside a mutex... On 17/08/2019 16:02, Mathieu Desnoyers wrote: [...] > However, if the state of "x" can be any pointer value, or a reference > count value, then not using "WRITE_ONCE()" to store a constant leaves > the compiler free to perform that store in more than one memory access. > Based on [1], section "Store tearing", there are situations where this > happens on x86 in the wild today when storing 64-bit constants: the > compiler is then free to decide to use two 32-bit immediate store > instructions. > That's also how I understand things, and it's also one of the points raised in the compiler barrier section of memory-barriers.txt Taking this store tearing, or the invented stores - e.g. the branch optimization pointed out by Linus: > if (a) > global_var = 1 > else > global_var = 0 > > then the compiler had better not turn that into > > global_var = 0 > if (a) > global_var = 1 AFAICT nothing prevents this from happening inside a critical section (where the locking primitives provide the right barriers, but that's it). That's all fine when data is never accessed locklessly, but in the case of locked writes vs lockless reads, couldn't there be "leaks" of these transient states? In those cases we would want WRITE_ONCE() for the writes. So going back to: > But the reverse is not really true. All a READ_ONCE() says is "I want > either the old or the new value", and it can get that _without_ being > paired with a WRITE_ONCE(). AFAIU it's not always the case, since a lone READ_ONCE() could get transient values. I'll be honest, it's not 100% clear to me when those optimizations can actually be done (maybe the branch thingy but the others are dubious), and it's even less clear when compilers *actually* do it - only that they have been reported to do it (so it's not made up). > Thanks, > > Mathieu > > [1] https://lwn.net/Articles/793253/ >
On Sat, Aug 17, 2019 at 12:40:40PM -0400, Steven Rostedt wrote: > On Sat, 17 Aug 2019 11:55:17 -0400 (EDT) > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > > ----- On Aug 17, 2019, at 11:26 AM, rostedt rostedt@goodmis.org wrote: > > > > > On Sat, 17 Aug 2019 10:40:31 -0400 (EDT) > > > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > > > > >> > I'm now even more against adding the READ_ONCE() or WRITE_ONCE(). > > >> > > >> I'm not convinced by your arguments. > > > > > > Prove to me that there's an issue here beyond theoretical analysis, > > > then I'll consider that patch. > > > > > > Show me a compiler used to compile the kernel that zeros out the > > > increment. Show me were the race actually occurs. > > > > > > I think the READ/WRITE_ONCE() is more confusing than helpful. And > > > unneeded churn to the code. And really not needed for something that's > > > not critical to execution. > > > > I'll have to let the authors of the LWN article speak up on this, because > > I have limited time to replicate this investigation myself. > > I'll let Paul McKenney convince me then, if he has any spare cycles ;-) You guys do manage to time these things sometimes. ;-) > The one instance in that article is from a 2013 bug, which talks about > storing a 64 bit value on a 32 bit machine. But the ref count is an int > (32 bit), and I highly doubt any compiler will split it into 16 bit > stores for a simple increment. And I don't believe Linux even supports > any architecture that requires 16 bit stores anymore. For a machine-sized and aligned increment, it is indeed hard to imagine, even for me. I would be more worried about stores of constants with lots of zero bits between non-zero bits on systems with small-sized store-immediate instructions. Thanx, Paul
On Sat, Aug 17, 2019 at 01:28:48AM -0700, Linus Torvalds wrote: [ . . . ] > Put another way: a WRITE_ONCE() without a paired READ_ONCE() is almost > certainly pointless. "Your honor, I have no further questions at this time, but I reserve the right to recall this witness." Outside of things like MMIO (where one could argue that the corresponding READ_ONCE() is in the device firmware), the use cases I can imagine for WRITE_ONCE() with no READ_ONCE() are quite strange. For example, doing the WRITE_ONCE()s while read-holding a given lock and doing plain reads while write-holding that same lock. While at the same time being worried about store tearing and similar. Perhaps I am suffering a failure of imagination, but I am not seeing a reasonable use for such things at the moment. > But the reverse is not really true. All a READ_ONCE() says is "I want > either the old or the new value", and it can get that _without_ being > paired with a WRITE_ONCE(). > > See? They just aren't equally important. > > > > And yes, reads are different from writes. Reads don't have the same > > > kind of "other threads of execution can see them" effects, so a > > > compiler turning a single read into multiple reads is much more > > > realistic and not the same kind of "we need to expect a certain kind > > > of sanity from the compiler" issue. > > > > Though each of those compiler-generated multiple reads might return a > > different value, right? > > Right. See the examples I have in the email to Mathieu: > > unsigned int bits = some_global_value; > ...test different bits in in 'bits' ... > > can easily cause multiple reads (particularly on a CPU that has a > "test bits in memory" instruction and a lack of registers. > > So then doing it as > > unsigned int bits = READ_ONCE(some_global_value); > .. test different bits in 'bits'... > > makes a real and obvious semantic difference. In ways that changing a one-time > > ptr->flag = true; > > to > > WRITE_ONCE(ptr->flag, true); > > does _not_ make any obvious semantic difference what-so-ever. Agreed, especially given that only one bit of ->flag is most likely ever changing. > Caching reads is also something that makes sense and is common, in > ways that caching writes does not. So doing > > while (in_progress_global) /* twiddle your thumbs */; > > obviously trivially translates to an infinite loop with a single > conditional in front of it, in ways that > > while (READ_ONCE(in_progress_global)) /* twiddle */; > > does not. > > So there are often _obvious_ reasons to use READ_ONCE(). > > I really do not find the same to be true of WRITE_ONCE(). There are > valid uses, but they are definitely much less common, and much less > obvious. Agreed, and I expect READ_ONCE() to continue to be used more heavily than is WRITE_ONCE(), even including the documentation-only WRITE_ONCE() usage. Thanx, Paul
On Sat, Aug 17, 2019 at 09:03:30PM +0100, Valentin Schneider wrote: > Apologies to Steve for continuing this thread when all he wanted was moving > an operation inside a mutex... > > On 17/08/2019 16:02, Mathieu Desnoyers wrote: > [...] > > However, if the state of "x" can be any pointer value, or a reference > > count value, then not using "WRITE_ONCE()" to store a constant leaves > > the compiler free to perform that store in more than one memory access. > > Based on [1], section "Store tearing", there are situations where this > > happens on x86 in the wild today when storing 64-bit constants: the > > compiler is then free to decide to use two 32-bit immediate store > > instructions. > > > > That's also how I understand things, and it's also one of the points raised > in the compiler barrier section of memory-barriers.txt > > Taking this store tearing, or the invented stores - e.g. the branch > optimization pointed out by Linus: > > > if (a) > > global_var = 1 > > else > > global_var = 0 > > > > then the compiler had better not turn that into > > > > global_var = 0 > > if (a) > > global_var = 1 > > AFAICT nothing prevents this from happening inside a critical section (where > the locking primitives provide the right barriers, but that's it). That's > all fine when data is never accessed locklessly, but in the case of locked > writes vs lockless reads, couldn't there be "leaks" of these transient > states? In those cases we would want WRITE_ONCE() for the writes. > > So going back to: > > > But the reverse is not really true. All a READ_ONCE() says is "I want > > either the old or the new value", and it can get that _without_ being > > paired with a WRITE_ONCE(). > > AFAIU it's not always the case, since a lone READ_ONCE() could get transient > values. Linus noted that he believes that compilers for architectures supporting Linux can be trusted to avoid store-to-load transformations, invented stores, and unnecessary store tearing. Should these appear, Linus would report a bug against the compiler and expect it to be fixed. > I'll be honest, it's not 100% clear to me when those optimizations can > actually be done (maybe the branch thingy but the others are dubious), and > it's even less clear when compilers *actually* do it - only that they have > been reported to do it (so it's not made up). There is significant unclarity inherent in the situation. The standard says one thing, different compilers do other things, and developers often expect yet a third thing. And sometimes things change over time, for example, the ca. 2011 dictim against compilers inventing data races. Hey, they didn't teach me this aspect of software development in school, either. ;-) Thanx, Paul
[-- Attachment #1: Type: text/plain, Size: 840 bytes --] Hi! > The most I'll take is two separate patches. One is going to be marked > for stable as it fixes a real bug. The other is more for cosmetic or > theoretical issues, that I will state clearly "NOT FOR STABLE", such > that the autosel doesn't take them. Do we have standartized way to mark "this is not for stable"? Because I often think "I'd really hate to see this in stable"... On a related note, it would be nice to have standartized way to marking corresponding upstream commit. (Currently three methods are in use). Upstream: XXXX in the signoff area would be nice, clearly marking who touched the patch before mainline and who after. Best regards, Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 181 bytes --]
On 18/08/2019 00:00, Paul E. McKenney wrote: [...] > Linus noted that he believes that compilers for architectures supporting > Linux can be trusted to avoid store-to-load transformations, invented > stores, and unnecessary store tearing. Should these appear, Linus would > report a bug against the compiler and expect it to be fixed. > >> I'll be honest, it's not 100% clear to me when those optimizations can >> actually be done (maybe the branch thingy but the others are dubious), and >> it's even less clear when compilers *actually* do it - only that they have >> been reported to do it (so it's not made up). > > There is significant unclarity inherent in the situation. The standard > says one thing, different compilers do other things, and developers > often expect yet a third thing. And sometimes things change over time, > for example, the ca. 2011 dictim against compilers inventing data races. > > Hey, they didn't teach me this aspect of software development in school, > either. ;-) > Gotta keeps things "interesting" somehow, eh... Thanks for the clarifications. > Thanx, Paul >
On Sat, Aug 17, 2019 at 01:08:02AM -0700, Linus Torvalds wrote:
> The data tearing issue is almost a non-issue. We're not going to add
> WRITE_ONCE() to these kinds of places for no good reason.
Paulmck actually has an example of that somewhere; ISTR that particular
case actually got fixed by GCC, but I'd really _love_ for some compiler
people (both GCC and LLVM) to state that their respective compilers will
not do load/store tearing for machine word sized load/stores.
Without this written guarantee (which supposedly was in older GCC
manuals but has since gone missing), I'm loathe to rely on it.
Yes, it is very rare, but it is a massive royal pain to debug if/when it
does do happen.
On Fri, Aug 16, 2019 at 09:52:17PM -0700, Paul E. McKenney wrote:
> On Fri, Aug 16, 2019 at 03:57:43PM -0700, Linus Torvalds wrote:
>
> [ . . . ]
>
> > We add READ_ONCE and WRITE_ONCE annotations when they make sense. Not
> > because of some theoretical "compiler is free to do garbage"
> > arguments. If such garbage happens, we need to fix the compiler, the
> > same way we already do with
> >
> > -fno-strict-aliasing
>
> Yeah, the compete-with-FORTRAN stuff. :-/
>
> There is some work going on in the C committee on this, where the
> theorists would like to restrict strict-alias based optimizations to
> enable better analysis tooling. And no, although the theorists are
> pushing in the direction we would like them to, as far as I can see
> they are not pushing as far as we would like. But it might be that
> -fno-strict-aliasing needs some upgrades as well. I expect to learn
> more at the next meeting in a few months.
>
> http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2364.pdf
> http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2363.pdf
> http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2362.pdf
We really should get the compiler folks to give us a
-fno-pointer-provenance. Waiting on the standards committee to get their
act together seems unlikely, esp. given that some people actually seem
to _want_ this nonsense :/
On Tue, Aug 20, 2019 at 03:56:12PM +0200, Peter Zijlstra wrote: > On Sat, Aug 17, 2019 at 01:08:02AM -0700, Linus Torvalds wrote: > > > The data tearing issue is almost a non-issue. We're not going to add > > WRITE_ONCE() to these kinds of places for no good reason. > > Paulmck actually has an example of that somewhere; ISTR that particular > case actually got fixed by GCC, but I'd really _love_ for some compiler > people (both GCC and LLVM) to state that their respective compilers will > not do load/store tearing for machine word sized load/stores. I do very much recall such an example, but I am now unable to either find it or reproduce it. :-/ If I cannot turn it up in a few days, I will ask the LWN editors to make appropriate changes to the "Who is afraid" article. > Without this written guarantee (which supposedly was in older GCC > manuals but has since gone missing), I'm loathe to rely on it. > > Yes, it is very rare, but it is a massive royal pain to debug if/when it > does do happen. But from what I can see, Linus is OK with use of WRITE_ONCE() for data races on any variable for which there is at least one READ_ONCE(). So we can still use WRITE_ONCE() as we would like in our own code. Yes, you or I might be hit by someone else's omission of WRITE_ONCE(), it is better than the proverbial kick in the teeth. Of course, if anyone knows of a compiler/architecture combination that really does tear stores of 32-bit constants, please do not keep it a secret! After all, it would be good to get that addressed easily starting now rather than after a difficult and painful series of debugging sessions. Thanx, Paul
On Tue, Aug 20, 2019 at 04:01:16PM +0200, Peter Zijlstra wrote:
> On Fri, Aug 16, 2019 at 09:52:17PM -0700, Paul E. McKenney wrote:
> > On Fri, Aug 16, 2019 at 03:57:43PM -0700, Linus Torvalds wrote:
> >
> > [ . . . ]
> >
> > > We add READ_ONCE and WRITE_ONCE annotations when they make sense. Not
> > > because of some theoretical "compiler is free to do garbage"
> > > arguments. If such garbage happens, we need to fix the compiler, the
> > > same way we already do with
> > >
> > > -fno-strict-aliasing
> >
> > Yeah, the compete-with-FORTRAN stuff. :-/
> >
> > There is some work going on in the C committee on this, where the
> > theorists would like to restrict strict-alias based optimizations to
> > enable better analysis tooling. And no, although the theorists are
> > pushing in the direction we would like them to, as far as I can see
> > they are not pushing as far as we would like. But it might be that
> > -fno-strict-aliasing needs some upgrades as well. I expect to learn
> > more at the next meeting in a few months.
> >
> > http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2364.pdf
> > http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2363.pdf
> > http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2362.pdf
>
> We really should get the compiler folks to give us a
> -fno-pointer-provenance. Waiting on the standards committee to get their
> act together seems unlikely, esp. given that some people actually seem
> to _want_ this nonsense :/
The reason that they want it is to enable some significant optimizations
in numerical code on the one hand and in heavily templated C++ code on
the other. Neither of which has much bearing on kernel code.
Interested in coming to the next C standards committee meeting in October
to help me push for this? ;-)
Thanx, Paul
On Tue, Aug 20, 2019 at 01:31:35PM -0700, Paul E. McKenney wrote: > On Tue, Aug 20, 2019 at 04:01:16PM +0200, Peter Zijlstra wrote: > > We really should get the compiler folks to give us a > > -fno-pointer-provenance. Waiting on the standards committee to get their > > act together seems unlikely, esp. given that some people actually seem > > to _want_ this nonsense :/ > > The reason that they want it is to enable some significant optimizations > in numerical code on the one hand and in heavily templated C++ code on > the other. Neither of which has much bearing on kernel code. > > Interested in coming to the next C standards committee meeting in October > to help me push for this? ;-) How about we try and get some compiler folks together at plumbers and bribe them with beer? Once we have our compiler knob, we happy :-)
On Tue, Aug 20, 2019 at 10:39:39PM +0200, Peter Zijlstra wrote: > On Tue, Aug 20, 2019 at 01:31:35PM -0700, Paul E. McKenney wrote: > > On Tue, Aug 20, 2019 at 04:01:16PM +0200, Peter Zijlstra wrote: > > > > We really should get the compiler folks to give us a > > > -fno-pointer-provenance. Waiting on the standards committee to get their > > > act together seems unlikely, esp. given that some people actually seem > > > to _want_ this nonsense :/ > > > > The reason that they want it is to enable some significant optimizations > > in numerical code on the one hand and in heavily templated C++ code on > > the other. Neither of which has much bearing on kernel code. > > > > Interested in coming to the next C standards committee meeting in October > > to help me push for this? ;-) > > How about we try and get some compiler folks together at plumbers and > bribe them with beer? Once we have our compiler knob, we happy :-) C'mon, Peter! Where is your sense of self-destruction??? ;-) But yes, if nothing else there is a Toolchains MC [1]. Which happens to have a topic entitled "Potential impact/benefit/detriment of recently developed GCC optimizations on the kernel", now that you mention it. Looking forward to seeing you in Lisbon! Thanx, Paul [1] https://linuxplumbersconf.org/event/4/sessions/45/#20190909
On Tue, Aug 20, 2019 at 01:29:32PM -0700, Paul E. McKenney wrote:
> On Tue, Aug 20, 2019 at 03:56:12PM +0200, Peter Zijlstra wrote:
> > On Sat, Aug 17, 2019 at 01:08:02AM -0700, Linus Torvalds wrote:
> >
> > > The data tearing issue is almost a non-issue. We're not going to add
> > > WRITE_ONCE() to these kinds of places for no good reason.
> >
> > Paulmck actually has an example of that somewhere; ISTR that particular
> > case actually got fixed by GCC, but I'd really _love_ for some compiler
> > people (both GCC and LLVM) to state that their respective compilers will
> > not do load/store tearing for machine word sized load/stores.
>
> I do very much recall such an example, but I am now unable to either
> find it or reproduce it. :-/
>
> If I cannot turn it up in a few days, I will ask the LWN editors to
> make appropriate changes to the "Who is afraid" article.
>
> > Without this written guarantee (which supposedly was in older GCC
> > manuals but has since gone missing), I'm loathe to rely on it.
> >
> > Yes, it is very rare, but it is a massive royal pain to debug if/when it
> > does do happen.
>
> But from what I can see, Linus is OK with use of WRITE_ONCE() for data
> races on any variable for which there is at least one READ_ONCE().
> So we can still use WRITE_ONCE() as we would like in our own code.
> Yes, you or I might be hit by someone else's omission of WRITE_ONCE(),
> it is better than the proverbial kick in the teeth.
>
> Of course, if anyone knows of a compiler/architecture combination that
> really does tear stores of 32-bit constants, please do not keep it
> a secret! After all, it would be good to get that addressed easily
> starting now rather than after a difficult and painful series of
> debugging sessions.
It's not quite what you asked for, but if you look at the following
silly code:
typedef unsigned long long u64;
struct data {
u64 arr[1023];
u64 flag;
};
void foo(struct data *x)
{
int i;
for (i = 0; i < 1023; ++i)
x->arr[i] = 0;
x->flag = 0;
}
void bar(u64 *x)
{
*x = 0xabcdef10abcdef10;
}
Then arm64 clang (-O2) generates the following for foo:
foo: // @foo
stp x29, x30, [sp, #-16]! // 16-byte Folded Spill
orr w2, wzr, #0x2000
mov w1, wzr
mov x29, sp
bl memset
ldp x29, x30, [sp], #16 // 16-byte Folded Reload
ret
and so the store to 'flag' has become part of the memset, which could
easily be bytewise in terms of atomicity (and this isn't unlikely given
we have a DC ZVA instruction which only guaratees bytewise atomicity).
GCC (also -O2) generates the following for bar:
bar:
mov w1, 61200
movk w1, 0xabcd, lsl 16
stp w1, w1, [x0]
ret
and so it is using a store-pair instruction to reduce the complexity in
the immediate generation. Thus, the 64-bit store will only have 32-bit
atomicity. In fact, this is scary because if I change bar to:
void bar(u64 *x)
{
*(volatile u64 *)x = 0xabcdef10abcdef10;
}
then I get:
bar:
mov w1, 61200
movk w1, 0xabcd, lsl 16
str w1, [x0]
str w1, [x0, 4]
ret
so I'm not sure that WRITE_ONCE would even help :/
It's worth noting that:
void baz(atomic_long *x)
{
atomic_store_explicit(x, 0xabcdef10abcdef10, memory_order_relaxed)
}
does the right thing:
baz:
mov x1, 61200
movk x1, 0xabcd, lsl 16
movk x1, 0xef10, lsl 32
movk x1, 0xabcd, lsl 48
str x1, [x0]
ret
Whilst these examples may be contrived, I do thing they illustrate that
we can't simply say "stores to aligned, word-sized pointers are atomic".
Will
On Wed, Aug 21, 2019 at 11:32:01AM +0100, Will Deacon wrote: > On Tue, Aug 20, 2019 at 01:29:32PM -0700, Paul E. McKenney wrote: > > On Tue, Aug 20, 2019 at 03:56:12PM +0200, Peter Zijlstra wrote: > > > On Sat, Aug 17, 2019 at 01:08:02AM -0700, Linus Torvalds wrote: > > > > > > > The data tearing issue is almost a non-issue. We're not going to add > > > > WRITE_ONCE() to these kinds of places for no good reason. > > > > > > Paulmck actually has an example of that somewhere; ISTR that particular > > > case actually got fixed by GCC, but I'd really _love_ for some compiler > > > people (both GCC and LLVM) to state that their respective compilers will > > > not do load/store tearing for machine word sized load/stores. > > > > I do very much recall such an example, but I am now unable to either > > find it or reproduce it. :-/ > > > > If I cannot turn it up in a few days, I will ask the LWN editors to > > make appropriate changes to the "Who is afraid" article. > > > > > Without this written guarantee (which supposedly was in older GCC > > > manuals but has since gone missing), I'm loathe to rely on it. > > > > > > Yes, it is very rare, but it is a massive royal pain to debug if/when it > > > does do happen. > > > > But from what I can see, Linus is OK with use of WRITE_ONCE() for data > > races on any variable for which there is at least one READ_ONCE(). > > So we can still use WRITE_ONCE() as we would like in our own code. > > Yes, you or I might be hit by someone else's omission of WRITE_ONCE(), > > it is better than the proverbial kick in the teeth. > > > > Of course, if anyone knows of a compiler/architecture combination that > > really does tear stores of 32-bit constants, please do not keep it > > a secret! After all, it would be good to get that addressed easily > > starting now rather than after a difficult and painful series of > > debugging sessions. > > It's not quite what you asked for, but if you look at the following > silly code: > > typedef unsigned long long u64; > > struct data { > u64 arr[1023]; > u64 flag; > }; > > void foo(struct data *x) > { > int i; > > for (i = 0; i < 1023; ++i) > x->arr[i] = 0; > > x->flag = 0; > } > > void bar(u64 *x) > { > *x = 0xabcdef10abcdef10; > } > > Then arm64 clang (-O2) generates the following for foo: > > foo: // @foo > stp x29, x30, [sp, #-16]! // 16-byte Folded Spill > orr w2, wzr, #0x2000 > mov w1, wzr > mov x29, sp > bl memset > ldp x29, x30, [sp], #16 // 16-byte Folded Reload > ret > > and so the store to 'flag' has become part of the memset, which could > easily be bytewise in terms of atomicity (and this isn't unlikely given > we have a DC ZVA instruction which only guaratees bytewise atomicity). > > GCC (also -O2) generates the following for bar: > > bar: > mov w1, 61200 > movk w1, 0xabcd, lsl 16 > stp w1, w1, [x0] > ret > > and so it is using a store-pair instruction to reduce the complexity in > the immediate generation. Thus, the 64-bit store will only have 32-bit > atomicity. In fact, this is scary because if I change bar to: > > void bar(u64 *x) > { > *(volatile u64 *)x = 0xabcdef10abcdef10; > } > > then I get: > > bar: > mov w1, 61200 > movk w1, 0xabcd, lsl 16 > str w1, [x0] > str w1, [x0, 4] > ret > > so I'm not sure that WRITE_ONCE would even help :/ Well, I can have the LWN article cite your email, then. So thank you very much! Is generation of this code for a 64-bit volatile store considered a bug? Or does ARMv8 exclude the possibility of 64-bit MMIO registers? And I would guess that Thomas and Linus would ask a similar bugginess question for normal stores. ;-) > It's worth noting that: > > void baz(atomic_long *x) > { > atomic_store_explicit(x, 0xabcdef10abcdef10, memory_order_relaxed) > } > > does the right thing: > > baz: > mov x1, 61200 > movk x1, 0xabcd, lsl 16 > movk x1, 0xef10, lsl 32 > movk x1, 0xabcd, lsl 48 > str x1, [x0] > ret OK, the C11 and C++11 guys should be happy with this. > Whilst these examples may be contrived, I do thing they illustrate that > we can't simply say "stores to aligned, word-sized pointers are atomic". And thank you again! Thanx, Paul
On Wed, Aug 21, 2019 at 06:23:10AM -0700, Paul E. McKenney wrote: > On Wed, Aug 21, 2019 at 11:32:01AM +0100, Will Deacon wrote: > > void bar(u64 *x) > > { > > *(volatile u64 *)x = 0xabcdef10abcdef10; > > } > > > > then I get: > > > > bar: > > mov w1, 61200 > > movk w1, 0xabcd, lsl 16 > > str w1, [x0] > > str w1, [x0, 4] > > ret > > > > so I'm not sure that WRITE_ONCE would even help :/ > > Well, I can have the LWN article cite your email, then. So thank you > very much! > > Is generation of this code for a 64-bit volatile store considered a bug? I consider it a bug for the volatile case, and the one compiler person I've spoken to also seems to reckon it's a bug, so hopefully it will be fixed. I'm led to believe it's an optimisation in the AArch64 backend of GCC. > Or does ARMv8 exclude the possibility of 64-bit MMIO registers? And I > would guess that Thomas and Linus would ask a similar bugginess question > for normal stores. ;-) We use inline asm for MMIO, fwiw. Will
On Wed, Aug 21, 2019 at 02:32:48PM +0100, Will Deacon wrote: > On Wed, Aug 21, 2019 at 06:23:10AM -0700, Paul E. McKenney wrote: > > On Wed, Aug 21, 2019 at 11:32:01AM +0100, Will Deacon wrote: > > > void bar(u64 *x) > > > { > > > *(volatile u64 *)x = 0xabcdef10abcdef10; > > > } > > > > > > then I get: > > > > > > bar: > > > mov w1, 61200 > > > movk w1, 0xabcd, lsl 16 > > > str w1, [x0] > > > str w1, [x0, 4] > > > ret > > > > > > so I'm not sure that WRITE_ONCE would even help :/ > > > > Well, I can have the LWN article cite your email, then. So thank you > > very much! > > > > Is generation of this code for a 64-bit volatile store considered a bug? > > I consider it a bug for the volatile case, and the one compiler person I've > spoken to also seems to reckon it's a bug, so hopefully it will be fixed. > I'm led to believe it's an optimisation in the AArch64 backend of GCC. Here is hoping for the fix! > > Or does ARMv8 exclude the possibility of 64-bit MMIO registers? And I > > would guess that Thomas and Linus would ask a similar bugginess question > > for normal stores. ;-) > > We use inline asm for MMIO, fwiw. I should have remembered that, shouldn't I have? ;-) Is that also common practice across other embedded kernels these days? Thanx, Paul
On Wed, Aug 21, 2019 at 06:23:10AM -0700, Paul E. McKenney wrote: > On Wed, Aug 21, 2019 at 11:32:01AM +0100, Will Deacon wrote: > > and so it is using a store-pair instruction to reduce the complexity in > > the immediate generation. Thus, the 64-bit store will only have 32-bit > > atomicity. In fact, this is scary because if I change bar to: > > > > void bar(u64 *x) > > { > > *(volatile u64 *)x = 0xabcdef10abcdef10; > > } > > > > then I get: > > > > bar: > > mov w1, 61200 > > movk w1, 0xabcd, lsl 16 > > str w1, [x0] > > str w1, [x0, 4] > > ret > > > > so I'm not sure that WRITE_ONCE would even help :/ > > Well, I can have the LWN article cite your email, then. So thank you > very much! > > Is generation of this code for a 64-bit volatile store considered a bug? > Or does ARMv8 exclude the possibility of 64-bit MMIO registers? And I > would guess that Thomas and Linus would ask a similar bugginess question > for normal stores. ;-) I'm calling this a compiler bug; the way I understand volatile this is very much against the intentended use case. That is, this is buggy even on UP vs signals or MMIO.
----- On Aug 21, 2019, at 8:33 AM, Peter Zijlstra peterz@infradead.org wrote: > On Wed, Aug 21, 2019 at 06:23:10AM -0700, Paul E. McKenney wrote: >> On Wed, Aug 21, 2019 at 11:32:01AM +0100, Will Deacon wrote: > >> > and so it is using a store-pair instruction to reduce the complexity in >> > the immediate generation. Thus, the 64-bit store will only have 32-bit >> > atomicity. In fact, this is scary because if I change bar to: >> > >> > void bar(u64 *x) >> > { >> > *(volatile u64 *)x = 0xabcdef10abcdef10; >> > } >> > >> > then I get: >> > >> > bar: >> > mov w1, 61200 >> > movk w1, 0xabcd, lsl 16 >> > str w1, [x0] >> > str w1, [x0, 4] >> > ret >> > >> > so I'm not sure that WRITE_ONCE would even help :/ >> >> Well, I can have the LWN article cite your email, then. So thank you >> very much! >> >> Is generation of this code for a 64-bit volatile store considered a bug? >> Or does ARMv8 exclude the possibility of 64-bit MMIO registers? And I >> would guess that Thomas and Linus would ask a similar bugginess question >> for normal stores. ;-) > > I'm calling this a compiler bug; the way I understand volatile this is > very much against the intentended use case. That is, this is buggy even > on UP vs signals or MMIO. And here is a simpler reproducer on my gcc-8.3.0 (aarch64) compiled with O2: volatile unsigned long a; void fct(void) { a = 0x1234567812345678ULL; } void fct(void) { a = 0x1234567812345678ULL; 0: 90000000 adrp x0, 8 <fct+0x8> 4: 528acf01 mov w1, #0x5678 // #22136 8: 72a24681 movk w1, #0x1234, lsl #16 c: f9400000 ldr x0, [x0] 10: b9000001 str w1, [x0] 14: b9000401 str w1, [x0, #4] } 18: d65f03c0 ret And the non-volatile case uses stp (is it a single store to memory ?): unsigned long a; void fct(void) { a = 0x1234567812345678ULL; } void fct(void) { a = 0x1234567812345678ULL; 0: 90000000 adrp x0, 8 <fct+0x8> 4: 528acf01 mov w1, #0x5678 // #22136 8: 72a24681 movk w1, #0x1234, lsl #16 c: f9400000 ldr x0, [x0] 10: 29000401 stp w1, w1, [x0] } 14: d65f03c0 ret It would probably be a good idea to audit other architectures, since this is done by the compiler backend. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
On Wed, Aug 21, 2019 at 11:48:43AM -0400, Mathieu Desnoyers wrote:
> ----- On Aug 21, 2019, at 8:33 AM, Peter Zijlstra peterz@infradead.org wrote:
>
> > On Wed, Aug 21, 2019 at 06:23:10AM -0700, Paul E. McKenney wrote:
> >> On Wed, Aug 21, 2019 at 11:32:01AM +0100, Will Deacon wrote:
> >
> >> > and so it is using a store-pair instruction to reduce the complexity in
> >> > the immediate generation. Thus, the 64-bit store will only have 32-bit
> >> > atomicity. In fact, this is scary because if I change bar to:
> >> >
> >> > void bar(u64 *x)
> >> > {
> >> > *(volatile u64 *)x = 0xabcdef10abcdef10;
> >> > }
> >> >
> >> > then I get:
> >> >
> >> > bar:
> >> > mov w1, 61200
> >> > movk w1, 0xabcd, lsl 16
> >> > str w1, [x0]
> >> > str w1, [x0, 4]
> >> > ret
> >> >
> >> > so I'm not sure that WRITE_ONCE would even help :/
> >>
> >> Well, I can have the LWN article cite your email, then. So thank you
> >> very much!
> >>
> >> Is generation of this code for a 64-bit volatile store considered a bug?
> >> Or does ARMv8 exclude the possibility of 64-bit MMIO registers? And I
> >> would guess that Thomas and Linus would ask a similar bugginess question
> >> for normal stores. ;-)
> >
> > I'm calling this a compiler bug; the way I understand volatile this is
> > very much against the intentended use case. That is, this is buggy even
> > on UP vs signals or MMIO.
>
> And here is a simpler reproducer on my gcc-8.3.0 (aarch64) compiled with O2:
>
> volatile unsigned long a;
>
> void fct(void)
> {
> a = 0x1234567812345678ULL;
> }
>
> void fct(void)
> {
> a = 0x1234567812345678ULL;
> 0: 90000000 adrp x0, 8 <fct+0x8>
> 4: 528acf01 mov w1, #0x5678 // #22136
> 8: 72a24681 movk w1, #0x1234, lsl #16
> c: f9400000 ldr x0, [x0]
> 10: b9000001 str w1, [x0]
> 14: b9000401 str w1, [x0, #4]
> }
> 18: d65f03c0 ret
>
> And the non-volatile case uses stp (is it a single store to memory ?):
>
> unsigned long a;
>
> void fct(void)
> {
> a = 0x1234567812345678ULL;
> }
>
> void fct(void)
> {
> a = 0x1234567812345678ULL;
> 0: 90000000 adrp x0, 8 <fct+0x8>
> 4: 528acf01 mov w1, #0x5678 // #22136
> 8: 72a24681 movk w1, #0x1234, lsl #16
> c: f9400000 ldr x0, [x0]
> 10: 29000401 stp w1, w1, [x0]
> }
> 14: d65f03c0 ret
>
> It would probably be a good idea to audit other architectures, since this
> is done by the compiler backend.
That does seem like a very good idea!
Thanx, Paul
On Wed, Aug 21, 2019 at 06:56:10AM -0700, Paul E. McKenney wrote:
> On Wed, Aug 21, 2019 at 02:32:48PM +0100, Will Deacon wrote:
> > On Wed, Aug 21, 2019 at 06:23:10AM -0700, Paul E. McKenney wrote:
> > > On Wed, Aug 21, 2019 at 11:32:01AM +0100, Will Deacon wrote:
> > > > void bar(u64 *x)
> > > > {
> > > > *(volatile u64 *)x = 0xabcdef10abcdef10;
> > > > }
> > > >
> > > > then I get:
> > > >
> > > > bar:
> > > > mov w1, 61200
> > > > movk w1, 0xabcd, lsl 16
> > > > str w1, [x0]
> > > > str w1, [x0, 4]
> > > > ret
> > > >
> > > > so I'm not sure that WRITE_ONCE would even help :/
> > >
> > > Well, I can have the LWN article cite your email, then. So thank you
> > > very much!
> > >
> > > Is generation of this code for a 64-bit volatile store considered a bug?
> >
> > I consider it a bug for the volatile case, and the one compiler person I've
> > spoken to also seems to reckon it's a bug, so hopefully it will be fixed.
> > I'm led to believe it's an optimisation in the AArch64 backend of GCC.
>
> Here is hoping for the fix!
>
> > > Or does ARMv8 exclude the possibility of 64-bit MMIO registers? And I
> > > would guess that Thomas and Linus would ask a similar bugginess question
> > > for normal stores. ;-)
> >
> > We use inline asm for MMIO, fwiw.
>
> I should have remembered that, shouldn't I have? ;-)
>
> Is that also common practice across other embedded kernels these days?
I think so. Sometimes you care about things like the addressing mode being
used, so it's easier to roll it by hand.
Will
On Wed, Aug 21, 2019 at 11:48:43AM -0400, Mathieu Desnoyers wrote:
> ----- On Aug 21, 2019, at 8:33 AM, Peter Zijlstra peterz@infradead.org wrote:
>
> > On Wed, Aug 21, 2019 at 06:23:10AM -0700, Paul E. McKenney wrote:
> >> On Wed, Aug 21, 2019 at 11:32:01AM +0100, Will Deacon wrote:
> >
> >> > and so it is using a store-pair instruction to reduce the complexity in
> >> > the immediate generation. Thus, the 64-bit store will only have 32-bit
> >> > atomicity. In fact, this is scary because if I change bar to:
> >> >
> >> > void bar(u64 *x)
> >> > {
> >> > *(volatile u64 *)x = 0xabcdef10abcdef10;
> >> > }
> >> >
> >> > then I get:
> >> >
> >> > bar:
> >> > mov w1, 61200
> >> > movk w1, 0xabcd, lsl 16
> >> > str w1, [x0]
> >> > str w1, [x0, 4]
> >> > ret
> >> >
> >> > so I'm not sure that WRITE_ONCE would even help :/
> >>
> >> Well, I can have the LWN article cite your email, then. So thank you
> >> very much!
> >>
> >> Is generation of this code for a 64-bit volatile store considered a bug?
> >> Or does ARMv8 exclude the possibility of 64-bit MMIO registers? And I
> >> would guess that Thomas and Linus would ask a similar bugginess question
> >> for normal stores. ;-)
> >
> > I'm calling this a compiler bug; the way I understand volatile this is
> > very much against the intentended use case. That is, this is buggy even
> > on UP vs signals or MMIO.
>
> And here is a simpler reproducer on my gcc-8.3.0 (aarch64) compiled with O2:
>
> volatile unsigned long a;
>
> void fct(void)
> {
> a = 0x1234567812345678ULL;
> }
>
> void fct(void)
> {
> a = 0x1234567812345678ULL;
> 0: 90000000 adrp x0, 8 <fct+0x8>
> 4: 528acf01 mov w1, #0x5678 // #22136
> 8: 72a24681 movk w1, #0x1234, lsl #16
> c: f9400000 ldr x0, [x0]
> 10: b9000001 str w1, [x0]
> 14: b9000401 str w1, [x0, #4]
> }
> 18: d65f03c0 ret
Fwiw, and, interestingly, on clang v7.0.1-8 (aarch64), I get a proper 64-bit
str with the above example (even when not using volatile):
0000000000000000 <nonvol>:
0: d28acf08 mov x8, #0x5678 // #22136
4: f2a24688 movk x8, #0x1234, lsl #16
8: f2cacf08 movk x8, #0x5678, lsl #32
c: f2e24688 movk x8, #0x1234, lsl #48
10: 90000009 adrp x9, 8 <nonvol+0x8>
14: 91000129 add x9, x9, #0x0
18: f9000128 str x8, [x9]
1c: d65f03c0 ret
test1.o: file format elf64-littleaarch64
And even with -O2 it is a single store:
Disassembly of section .text:
0000000000000000 <nonvol>:
0: d28acf09 mov x9, #0x5678 // #22136
4: f2a24689 movk x9, #0x1234, lsl #16
8: f2cacf09 movk x9, #0x5678, lsl #32
c: 90000008 adrp x8, 8 <nonvol+0x8>
10: f2e24689 movk x9, #0x1234, lsl #48
14: f9000109 str x9, [x8]
18: d65f03c0 ret
thanks,
- Joel
[...]
Peter Zijlstra <peterz@infradead.org> wrote: > > Paulmck actually has an example of that somewhere; ISTR that particular > case actually got fixed by GCC, but I'd really _love_ for some compiler > people (both GCC and LLVM) to state that their respective compilers will > not do load/store tearing for machine word sized load/stores. > > Without this written guarantee (which supposedly was in older GCC > manuals but has since gone missing), I'm loathe to rely on it. IIRC in that case gcc actually broke atomic writes even with a volatile keyword. So even WRITE_ONCE wouldn't have saved us. Cheers, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt