Re: [PATCH] locking/osq_lock: fix a data race in osq_wait_next

From: Marco Elver <elver@google.com>
To: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>, Qian Cai <cai@lca.pw>,
	Will Deacon <will@kernel.org>, Ingo Molnar <mingo@redhat.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	kasan-dev <kasan-dev@googlegroups.com>
Subject: Re: [PATCH] locking/osq_lock: fix a data race in osq_wait_next
Date: Wed, 29 Jan 2020 16:29:43 +0100	[thread overview]
Message-ID: <CANpmjNN8J1oWtLPHTgCwbbtTuU_Js-8HD=cozW5cYkm8h-GTBg@mail.gmail.com> (raw)
In-Reply-To: <20200129002253.GT2935@paulmck-ThinkPad-P72>

On Wed, 29 Jan 2020 at 01:22, Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Tue, Jan 28, 2020 at 05:56:55PM +0100, Peter Zijlstra wrote:
> > On Tue, Jan 28, 2020 at 12:46:26PM +0100, Marco Elver wrote:
> >
> > > > Marco, any thought on improving KCSAN for this to reduce the false
> > > > positives?
> > >
> > > Define 'false positive'.
> >
> > I'll use it where the code as written is correct while the tool
> > complains about it.
>
> I could be wrong, but I would guess that Marco is looking for something
> a little less subjective and a little more specific.  ;-)
>
> > > From what I can tell, all 'false positives' that have come up are data
> > > races where the consequences on the behaviour of the code is
> > > inconsequential. In other words, all of them would require
> > > understanding of the intended logic of the code, and understanding if
> > > the worst possible outcome of a data race changes the behaviour of the
> > > code in such a way that we may end up with an erroneously behaving
> > > system.
> > >
> > > As I have said before, KCSAN (or any data race detector) by definition
> > > only works at the language level. Any semantic analysis, beyond simple
> > > rules (such as ignore same-value stores) and annotations, is simply
> > > impossible since the tool can't know about the logic that the
> > > programmer intended.
> > >
> > > That being said, if there are simple rules (like ignore same-value
> > > stores) or other minimal annotations that can help reduce such 'false
> > > positives', more than happy to add them.
> >
> > OK, so KCSAN knows about same-value-stores? If so, that ->cpu =
> > smp_processor_id() case really doesn't need annotation, right?
>
> If smp_processor_id() returns the value already stored in ->cpu,
> I believe that the default KCSAN setup refrains from complaining.

Yes it won't report this with KCSAN_REPORT_VALUE_CHANGE_ONLY.  (There
was one case I missed, and just sent a patch to fix.)

> Which reminds me, I need to disable this in my RCU runs.  If I create a
> bug that causes me to unknowingly access something that is supposed to
> be CPU-private from the wrong CPU, I want to know about it.
>
> > > What to do about osq_lock here? If people agree that no further
> > > annotations are wanted, and the reasoning above concludes there are no
> > > bugs, we can blacklist the file. That would, however, miss new data
> > > races in future.
> >
> > I'm still hoping to convince you that the other case is one of those
> > 'simple-rules' too :-)
>
> On this I must defer to Marco.

On Tue, 28 Jan 2020 at 17:52, Peter Zijlstra <peterz@infradead.org> wrote:
> I'm claiming that in the first case, the only thing that's ever done
> with a racy load is comparing against 0, there is no possible bad
> outcome ever. While obviously if you let the load escape, or do anything
> other than compare against 0, there is.

It might sound like a simple rule, but implementing this is anything
but simple: This would require changing the compiler, which we said
we'd like to avoid as it introduces new problems. This particular rule
relies on semantic analysis that is beyond what the TSAN
instrumentation currently supports. Right now we support GCC and
Clang; changing the compiler probably means we'd end up with only one
(probably Clang), and many more years before the change has propagated
to the majority of used compiler versions. It'd be good if we can do
this purely as a change in the kernel's codebase.

Keeping the bigger picture in mind, how frequent is this case, and
what are we really trying to accomplish? Is it only to avoid a
READ_ONCE? Why is the READ_ONCE bad here? If there is a racing access,
why not be explicit about it?

I can see that maybe this particular file probably doesn't need more
hints that there is concurrency here (given its osq_lock), but as a
general rule for the remainder of the kernel it appears to add more
inconsistencies.

Thanks,
-- Marco