Linux-Fsdevel Archive on lore.kernel.org
 help / color / Atom feed
* Re: KCSAN: data-race in __alloc_file / __alloc_file
       [not found] <CAHk-=wjB61GNmqpX0BLA5tpL4tsjWV7akaTc2Roth7uGgax+mw@mail.gmail.com>
@ 2019-11-10 16:09 ` Alan Stern
  2019-11-10 19:10   ` Marco Elver
  2019-11-10 19:12   ` Linus Torvalds
  0 siblings, 2 replies; 67+ messages in thread
From: Alan Stern @ 2019-11-10 16:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Marco Elver, Eric Dumazet, Eric Dumazet, syzbot, linux-fsdevel,
	Linux Kernel Mailing List, syzkaller-bugs, Al Viro, Andrea Parri,
	Paul E. McKenney, LKMM Maintainers -- Akira Yokosawa

On Sat, 9 Nov 2019, Linus Torvalds wrote:

> On Sat, Nov 9, 2019, 15:08 Alan Stern <stern@rowland.harvard.edu> wrote:
> 
> > On Fri, 8 Nov 2019, Linus Torvalds wrote:
> > >
> > > Two writes to normal memory are *not* idempotent if they write
> > > different values. The ordering very much matters, and it's racy and a
> > > tool should complain.
> >
> > What you have written strongly indicates that either you think the word
> > "idempotent" means something different from its actual meaning or else
> > you are misusing the word in a very confusing way.
> >
> 
> "Idempotence is the property of certain operations in mathematics and
> computer science whereby they can be applied multiple times without
> changing the result beyond the initial application. "
> 
> This is (for example) commonly used when talking about nfs operations,
> where you can re-send the same nfs operation, and it's ok (even if it has
> side effects) because the server remembers that it already did the
> operation. If it's already been done, nothing changes.
> 
> It may not match your definition in some other area, but this is very much
> the accepted meaning of the word in computer science and operating systems.

Agreed.  My point was that you were using the word in a way which did
not match this definition.

Never mind that.  You did not respond to the question at the end of my 
previous email: Should the LKMM be changed so that two writes are not 
considered to race with each other if they store the same value?

That change would take care of the original issue of this email thread,
wouldn't it?  And it would render WRITE_IDEMPOTENT unnecessary.

Making that change would amount to formalizing your requirement that 
the compiler should not invent stores to shared variables.  In C11 such 
invented stores are allowed.  Given something like this:

	<A complex computation which does not involve x but does
	 require a register spill>
	x = 1234;

C11 allows the compiler to store an intermediate value in x rather than
allocating a slot on the stack for the register spill.  After all, x is
going to be overwritten anyway, and if any other thread accessed x
during the complex computation then it would race with the final store
and so the behavior would be undefined in any case.

If you want to specifically forbid the compiler from doing this, it
makes sense to change the memory model accordingly.

For those used to thinking in terms of litmus tests, consider this one:

C equivalent-writes

{}

P0(int *x)
{
	*x = 1;
}

P1(int *x)
{
	*x = 1;
}

exists (~x=1)

Should the LKMM say that this litmus test contains a race?

This suggests that we might also want to relax the notion of a write
racing with a read, although in that case I'm not at all sure what the
appropriate change to the memory model would be.  Something along the
lines of: If a write W races with a read R, but W stores the same value
that R would have read if W were not present, then it's not really a
race.  But of course this is far too vague to be useful.

Alan Stern


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-10 16:09 ` KCSAN: data-race in __alloc_file / __alloc_file Alan Stern
@ 2019-11-10 19:10   ` Marco Elver
  2019-11-11 15:51     ` Alan Stern
  2019-11-10 19:12   ` Linus Torvalds
  1 sibling, 1 reply; 67+ messages in thread
From: Marco Elver @ 2019-11-10 19:10 UTC (permalink / raw)
  To: Alan Stern
  Cc: Linus Torvalds, Eric Dumazet, Eric Dumazet, syzbot,
	linux-fsdevel, Linux Kernel Mailing List, syzkaller-bugs,
	Al Viro, Andrea Parri, Paul E. McKenney,
	LKMM Maintainers -- Akira Yokosawa

On Sun, 10 Nov 2019 at 17:09, Alan Stern <stern@rowland.harvard.edu> wrote:
>
> On Sat, 9 Nov 2019, Linus Torvalds wrote:
>
> > On Sat, Nov 9, 2019, 15:08 Alan Stern <stern@rowland.harvard.edu> wrote:
> >
> > > On Fri, 8 Nov 2019, Linus Torvalds wrote:
> > > >
> > > > Two writes to normal memory are *not* idempotent if they write
> > > > different values. The ordering very much matters, and it's racy and a
> > > > tool should complain.
> > >
> > > What you have written strongly indicates that either you think the word
> > > "idempotent" means something different from its actual meaning or else
> > > you are misusing the word in a very confusing way.
> > >
> >
> > "Idempotence is the property of certain operations in mathematics and
> > computer science whereby they can be applied multiple times without
> > changing the result beyond the initial application. "
> >
> > This is (for example) commonly used when talking about nfs operations,
> > where you can re-send the same nfs operation, and it's ok (even if it has
> > side effects) because the server remembers that it already did the
> > operation. If it's already been done, nothing changes.
> >
> > It may not match your definition in some other area, but this is very much
> > the accepted meaning of the word in computer science and operating systems.
>
> Agreed.  My point was that you were using the word in a way which did
> not match this definition.
>
> Never mind that.  You did not respond to the question at the end of my
> previous email: Should the LKMM be changed so that two writes are not
> considered to race with each other if they store the same value?
>
> That change would take care of the original issue of this email thread,
> wouldn't it?  And it would render WRITE_IDEMPOTENT unnecessary.
>
> Making that change would amount to formalizing your requirement that
> the compiler should not invent stores to shared variables.  In C11 such
> invented stores are allowed.  Given something like this:
>
>         <A complex computation which does not involve x but does
>          require a register spill>
>         x = 1234;
>
> C11 allows the compiler to store an intermediate value in x rather than
> allocating a slot on the stack for the register spill.  After all, x is
> going to be overwritten anyway, and if any other thread accessed x
> during the complex computation then it would race with the final store
> and so the behavior would be undefined in any case.
>
> If you want to specifically forbid the compiler from doing this, it
> makes sense to change the memory model accordingly.
>
> For those used to thinking in terms of litmus tests, consider this one:
>
> C equivalent-writes
>
> {}
>
> P0(int *x)
> {
>         *x = 1;
> }
>
> P1(int *x)
> {
>         *x = 1;
> }
>
> exists (~x=1)
>
> Should the LKMM say that this litmus test contains a race?
>
> This suggests that we might also want to relax the notion of a write
> racing with a read, although in that case I'm not at all sure what the
> appropriate change to the memory model would be.  Something along the
> lines of: If a write W races with a read R, but W stores the same value
> that R would have read if W were not present, then it's not really a
> race.  But of course this is far too vague to be useful.

What if you introduce to the above litmus test:

P2(int *x) { *x = 2; }

How can a developer, using the LKMM as a reference, hope to prove
their code is free from data races without having to enumerate all
possible values a variable could contain (in addition to all possible
interleavings)?

I view introducing data value dependencies, for the sake of allowing
more programs, to a language memory model as a slippery slope, and am
not aware of any precedent where this worked out. The additional
complexity in the memory model would put a burden on developers and
the compiler that is unlikely to be a real benefit (as you pointed
out, the compiler may even need to disable some transformations). From
a practical point of view, if the LKMM departs further and further
from C11's memory model, how do we ensure all compilers do the right
thing?

My vote would go to explicit annotation, not only because it reduces
hidden complexity, but also because it makes the code more
understandable, for developers and tooling. As an additional point, I
find the original suggestion to add WRITE_ONCE to be the least bad (or
some other better named WRITE_). Consider somebody changing the code,
changing the semantics and the values written to "non_rcu". With a
WRITE_ONCE, the developer would be clear about the fact that the write
can happen concurrently, and ensure new code is written with the
assumption that concurrent writes can happen.

Thanks,
-- Marco

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-10 16:09 ` KCSAN: data-race in __alloc_file / __alloc_file Alan Stern
  2019-11-10 19:10   ` Marco Elver
@ 2019-11-10 19:12   ` Linus Torvalds
  2019-11-10 19:20     ` Linus Torvalds
  2019-11-12 19:14     ` Alan Stern
  1 sibling, 2 replies; 67+ messages in thread
From: Linus Torvalds @ 2019-11-10 19:12 UTC (permalink / raw)
  To: Alan Stern
  Cc: Marco Elver, Eric Dumazet, Eric Dumazet, syzbot, linux-fsdevel,
	Linux Kernel Mailing List, syzkaller-bugs, Al Viro, Andrea Parri,
	Paul E. McKenney, LKMM Maintainers -- Akira Yokosawa

On Sun, Nov 10, 2019 at 8:09 AM Alan Stern <stern@rowland.harvard.edu> wrote:
>
> Agreed.  My point was that you were using the word in a way which did
> not match this definition.

Whatever. I claim that my use was *exactly* that "certain writes are idempotent"

> Never mind that.  You did not respond to the question at the end of my
> previous email: Should the LKMM be changed so that two writes are not
> considered to race with each other if they store the same value?

No.

The whole point is that only *certain* writes are idempotent - the
ones where we stickily set a flag or clear a flag. So the field has
exactly two possible values: the initial state, and the "something did
a write to it" state.

This is why I suggested that WRITE_IDEMPOTENT() - which is us telling
the system that "I'm now doing that sticky write of a flag, and
ordering with other threads (or within this thread) on this field
doesn't matter".

One side effect of that "ordering doesn't matter" is that we could -
if it were to be shown to be worthwhile - turn it into a "did somebody
else already do this, then I won't bother".

But that's not necessarily true in _general_. We might write the same
value back without it being a true idempotent write. Some other write
_could_ race with it and be a data race.

For example, two threads doing

   variable++;

could race, and end up writing the same value _because_ of the race.
That would obviously be a data race, and neither of the two writes are
in any way idempotent.

Similarly, a "I added new data to a linked list, you should wake up
and handle it" write would always write the same value in that
particular location, but another location would obviously clear the
flag, so now that write that sets the "new data available" flag is
_not_ idempotent, and you could _not_ replace it with a "did somebody
else already set this flag" sequence. It might look on a local scope
like a "always write the same value", and yes, it might race with
others that also write the same value, but there are also threads that
write a different value, so now it's not ok to say "did it already
have that value, in which case I can skip the write".

See why I think idempotent writes are something somewhat special -
they aren't just about writing the same value. They are about only
_ever_ writing the same value (with the caveat obviously being "over
the lifetime of that data structure, and with the initial value being
different", of course).

> That change would take care of the original issue of this email thread,
> wouldn't it?  And it would render WRITE_IDEMPOTENT unnecessary.

So I do think LKMM should say "writes of the same value must obviously
result in the same value in memory afterwards", if it doesn't already.
That's a somewhat trivial case, it's just a special case of the
single-value atomicity issue. I thought the LKMM had that already: if
you have writes of 'x' and 'y' to a variable from two CPU's, all CPU's
are supposed to see _either_ 'x' or 'y', they can't ever see a mix of
the two.

And yes, we've depended on that single-value atomicity historically.

The 'x' and 'y' have the same value is just a special case of that
general issue - if two threads write the same value, no CPU can ever
see anything but that value (or the original one). So in that sense,
fundamentally the same value write cannot race with itself.

But that LKMM rule is separate from a rule about a statistical tool like KCSAN.

Should KCSAN then ignore writes of the same value?

Maybe.

Because while that "variable++" data race with the same value is real,
the likelihood of hitting it is small, so a statistical tool like
KCSAN might as well ignore it - the tool would show the data race when
the race _doesn't_ happen, which would be the normal case anyway, and
would be the reason why the race hadn't been noticed by a normal human
being.

So practically speaking, we might say "concurrent writes of the same
value aren't data races" for KCSAN, even though they _could_ be data
races.

And this is where WRITE_IDEMPOTENT would make a possible difference.
In particular, if we make the optimization to do the "read and only
write if changed", two CPU's doing this concurrently would do

   READ 0
   WRITE 1

(for a "flag goes from 0->1" transition) and from a tool perspective,
it would be very hard to know whether this is a race (two threads
doing "variable++") or not (two threads setting a sticky flag).

So WRITE_IDEMPOTENT would then disambiguate that choice. See what I'm saying?

At the same time, I suspect that it's just simpler to say "if all the
writes we see to this field have the same value, then we will assume
it has idempotent behavior".

Even then the "all writes" would have to know the difference between
initial values and subsequent updates, which apparently isn't obvious
in KCSAN, but I don't know how hacky that kind of logic would be.

> Making that change would amount to formalizing your requirement that
> the compiler should not invent stores to shared variables.  In C11 such
> invented stores are allowed.

I don't care one whit about C11. Made-up stores to shared data are not
acceptable. Ever. We will turn that off with a compiler switch if the
compiler thinks it can do them, the same way we turn off other
incorrect optimizations like the type-based aliasing or the insane
"signed integer arithmetic can have undefined behavior" stupidity that
the standards people allowed.

I thought that has always been clear. I have not exactly been
ambiguous about my dislike of silly pointless "the standard allows me
to do stupid things".

                 Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-10 19:12   ` Linus Torvalds
@ 2019-11-10 19:20     ` Linus Torvalds
  2019-11-10 20:44       ` Paul E. McKenney
  2019-11-12 19:14     ` Alan Stern
  1 sibling, 1 reply; 67+ messages in thread
From: Linus Torvalds @ 2019-11-10 19:20 UTC (permalink / raw)
  To: Alan Stern
  Cc: Marco Elver, Eric Dumazet, Eric Dumazet, syzbot, linux-fsdevel,
	Linux Kernel Mailing List, syzkaller-bugs, Al Viro, Andrea Parri,
	Paul E. McKenney, LKMM Maintainers -- Akira Yokosawa

On Sun, Nov 10, 2019 at 11:12 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> And this is where WRITE_IDEMPOTENT would make a possible difference.
> In particular, if we make the optimization to do the "read and only
> write if changed"

It might be useful for checking too. IOW, something like KCSAN could
actually check that if a field has an idempotent write to it, all
writes always have the same value.

Again, there's the issue with lifetime.

Part of that is "initialization is different". Those writes would not
be marked idempotent, of course, and they'd write another value.

There's also the issue of lifetime at the _end_ of the use, of course.
There _are_ interesting data races at the end of the lifetime, both
reads and writes.

In particular, if it's a sticky flag, in order for there to not be any
races, all the writes have to happen with a refcount held, and the
final read has to happen after the final refcount is dropped (and the
refcounts have to have atomicity and ordering, of course). I'm not
sure how easy something like that is model in KSAN. Maybe it already
does things like that for all the other refcount stuff we do.

But the lifetime can be problematic for other reasons too - in this
particular case we have a union for that sticky flag (which is used
under the refcount), and then when the final refcount is released we
read that value (thus no data race) but because of the union we will
now start using that field with *different* data. It becomes that RCU
list head instead.

That kind of "it used to be a sticky flag, but now the lifetime of the
flag is over, and it's something entirely different" might be a
nightmare for something like KCSAN. It sounds complicated to check
for, but I have no idea what KCSAN really considers complicated or
not.

                  Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-10 19:20     ` Linus Torvalds
@ 2019-11-10 20:44       ` Paul E. McKenney
  2019-11-10 21:10         ` Linus Torvalds
  2019-11-11 14:17         ` Marco Elver
  0 siblings, 2 replies; 67+ messages in thread
From: Paul E. McKenney @ 2019-11-10 20:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Stern, Marco Elver, Eric Dumazet, Eric Dumazet, syzbot,
	linux-fsdevel, Linux Kernel Mailing List, syzkaller-bugs,
	Al Viro, Andrea Parri, LKMM Maintainers -- Akira Yokosawa

On Sun, Nov 10, 2019 at 11:20:53AM -0800, Linus Torvalds wrote:
> On Sun, Nov 10, 2019 at 11:12 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > And this is where WRITE_IDEMPOTENT would make a possible difference.
> > In particular, if we make the optimization to do the "read and only
> > write if changed"
> 
> It might be useful for checking too. IOW, something like KCSAN could
> actually check that if a field has an idempotent write to it, all
> writes always have the same value.
> 
> Again, there's the issue with lifetime.
> 
> Part of that is "initialization is different". Those writes would not
> be marked idempotent, of course, and they'd write another value.
> 
> There's also the issue of lifetime at the _end_ of the use, of course.
> There _are_ interesting data races at the end of the lifetime, both
> reads and writes.
> 
> In particular, if it's a sticky flag, in order for there to not be any
> races, all the writes have to happen with a refcount held, and the
> final read has to happen after the final refcount is dropped (and the
> refcounts have to have atomicity and ordering, of course). I'm not
> sure how easy something like that is model in KSAN. Maybe it already
> does things like that for all the other refcount stuff we do.
> 
> But the lifetime can be problematic for other reasons too - in this
> particular case we have a union for that sticky flag (which is used
> under the refcount), and then when the final refcount is released we
> read that value (thus no data race) but because of the union we will
> now start using that field with *different* data. It becomes that RCU
> list head instead.
> 
> That kind of "it used to be a sticky flag, but now the lifetime of the
> flag is over, and it's something entirely different" might be a
> nightmare for something like KCSAN. It sounds complicated to check
> for, but I have no idea what KCSAN really considers complicated or
> not.

But will "one size fits all" be practical and useful?

For my code, I would be happy to accept a significant "false positive"
rate to get even a probabilistic warning of other-task accesses to some
of RCU's fields.  Even if the accesses were perfect from a functional
viewpoint, they could be problematic from a performance and scalability
viewpoint.  And for something like RCU, real bugs, even those that are
very improbable, need to be fixed.

But other code (and thus other developers and maintainers) are going to
have different needs.  For all I know, some might have good reasons to
exclude their code from KCSAN analysis entirely.

Would it make sense for KCSAN to have per-file/subsystem/whatever flags
specifying the depth of the analysis?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-10 20:44       ` Paul E. McKenney
@ 2019-11-10 21:10         ` Linus Torvalds
  2019-11-10 21:31           ` Paul E. McKenney
  2019-11-11 14:17         ` Marco Elver
  1 sibling, 1 reply; 67+ messages in thread
From: Linus Torvalds @ 2019-11-10 21:10 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Alan Stern, Marco Elver, Eric Dumazet, Eric Dumazet, syzbot,
	linux-fsdevel, Linux Kernel Mailing List, syzkaller-bugs,
	Al Viro, Andrea Parri, LKMM Maintainers -- Akira Yokosawa

On Sun, Nov 10, 2019 at 12:44 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> But will "one size fits all" be practical and useful?

Oh, I do agree that if KCSAN has some mode where it says "I'll ignore
repeated writes with the same value" (or whatever), it could/should
likely be behind some flag.

I don't think it should be a subsystem flag, though. More of a "I'm
willing to actually analyze and ignore false positives" flag. Because
I don't think it's so much about the code, as it is about the person
who looks at the results.

For example, we're already getting push-back from people on some of
the KCSAN-inspired patches. If we have people sending patches to add
READ_ONCE/WRITE_ONCE to random places to shut up KCSAN reports, I
don't think that's good.

But if we have people who _work_ on memory ordering issues etc, and
want to see a strict mode, knowing there are false positives and able
to handle them, that's a completely different thing..

No?

              Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-10 21:10         ` Linus Torvalds
@ 2019-11-10 21:31           ` Paul E. McKenney
  0 siblings, 0 replies; 67+ messages in thread
From: Paul E. McKenney @ 2019-11-10 21:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Stern, Marco Elver, Eric Dumazet, Eric Dumazet, syzbot,
	linux-fsdevel, Linux Kernel Mailing List, syzkaller-bugs,
	Al Viro, Andrea Parri, LKMM Maintainers -- Akira Yokosawa

On Sun, Nov 10, 2019 at 01:10:39PM -0800, Linus Torvalds wrote:
> On Sun, Nov 10, 2019 at 12:44 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > But will "one size fits all" be practical and useful?
> 
> Oh, I do agree that if KCSAN has some mode where it says "I'll ignore
> repeated writes with the same value" (or whatever), it could/should
> likely be behind some flag.
> 
> I don't think it should be a subsystem flag, though. More of a "I'm
> willing to actually analyze and ignore false positives" flag. Because
> I don't think it's so much about the code, as it is about the person
> who looks at the results.
> 
> For example, we're already getting push-back from people on some of
> the KCSAN-inspired patches. If we have people sending patches to add
> READ_ONCE/WRITE_ONCE to random places to shut up KCSAN reports, I
> don't think that's good.
> 
> But if we have people who _work_ on memory ordering issues etc, and
> want to see a strict mode, knowing there are false positives and able
> to handle them, that's a completely different thing..
> 
> No?

Understood on the pushback!  And I especially agree that it is bad to
automatically add *_ONCE() just to shut up KCSAN.  For one thing, doing
that inconveniences people later on who might want to take a closer look.

As long as I can get the full-up reports for RCU.  And as long as the
others who want the full-up reports can also get them.  ;-)

And agreed, if the results are adjusted based on who is processing them,
that should be good.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-10 20:44       ` Paul E. McKenney
  2019-11-10 21:10         ` Linus Torvalds
@ 2019-11-11 14:17         ` Marco Elver
  2019-11-11 14:31           ` Paul E. McKenney
  1 sibling, 1 reply; 67+ messages in thread
From: Marco Elver @ 2019-11-11 14:17 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Alan Stern, Eric Dumazet, Eric Dumazet, syzbot,
	linux-fsdevel, Linux Kernel Mailing List, syzkaller-bugs,
	Al Viro, Andrea Parri, LKMM Maintainers -- Akira Yokosawa

On Sun, 10 Nov 2019 at 21:44, Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Sun, Nov 10, 2019 at 11:20:53AM -0800, Linus Torvalds wrote:
> > On Sun, Nov 10, 2019 at 11:12 AM Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > > And this is where WRITE_IDEMPOTENT would make a possible difference.
> > > In particular, if we make the optimization to do the "read and only
> > > write if changed"
> >
> > It might be useful for checking too. IOW, something like KCSAN could
> > actually check that if a field has an idempotent write to it, all
> > writes always have the same value.
> >
> > Again, there's the issue with lifetime.
> >
> > Part of that is "initialization is different". Those writes would not
> > be marked idempotent, of course, and they'd write another value.
> >
> > There's also the issue of lifetime at the _end_ of the use, of course.
> > There _are_ interesting data races at the end of the lifetime, both
> > reads and writes.
> >
> > In particular, if it's a sticky flag, in order for there to not be any
> > races, all the writes have to happen with a refcount held, and the
> > final read has to happen after the final refcount is dropped (and the
> > refcounts have to have atomicity and ordering, of course). I'm not
> > sure how easy something like that is model in KSAN. Maybe it already
> > does things like that for all the other refcount stuff we do.
> >
> > But the lifetime can be problematic for other reasons too - in this
> > particular case we have a union for that sticky flag (which is used
> > under the refcount), and then when the final refcount is released we
> > read that value (thus no data race) but because of the union we will
> > now start using that field with *different* data. It becomes that RCU
> > list head instead.
> >
> > That kind of "it used to be a sticky flag, but now the lifetime of the
> > flag is over, and it's something entirely different" might be a
> > nightmare for something like KCSAN. It sounds complicated to check
> > for, but I have no idea what KCSAN really considers complicated or
> > not.
>
> But will "one size fits all" be practical and useful?
>
> For my code, I would be happy to accept a significant "false positive"
> rate to get even a probabilistic warning of other-task accesses to some
> of RCU's fields.  Even if the accesses were perfect from a functional
> viewpoint, they could be problematic from a performance and scalability
> viewpoint.  And for something like RCU, real bugs, even those that are
> very improbable, need to be fixed.
>
> But other code (and thus other developers and maintainers) are going to
> have different needs.  For all I know, some might have good reasons to
> exclude their code from KCSAN analysis entirely.
>
> Would it make sense for KCSAN to have per-file/subsystem/whatever flags
> specifying the depth of the analysis?

Just to answer this: we already have this, and disable certain files
already. So it's an option if required. Just need maintainers to add
KCSAN_SANITIZE := n, or KCSAN_SANITIZE_file.o := n to Makefiles, and
KCSAN will simply ignore those.

FWIW we now also have a config option to "ignore repeated writes with
the same value". It may be a little overaggressive/imprecise in
filtering data races, but anything else like the super precise
analysis involving tracking lifetimes and values (and whatever else
the rules would require) is simply too complex. So, the current
solution will avoid reporting cases like the original report here
(__alloc_file), but at the cost of maybe being a little imprecise.
It's probably a reasonable trade-off, given that we have too many data
races to deal with on syzbot anyway.

Thanks,
-- Marco

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-11 14:17         ` Marco Elver
@ 2019-11-11 14:31           ` Paul E. McKenney
  2019-11-11 15:10             ` Marco Elver
  0 siblings, 1 reply; 67+ messages in thread
From: Paul E. McKenney @ 2019-11-11 14:31 UTC (permalink / raw)
  To: Marco Elver
  Cc: Linus Torvalds, Alan Stern, Eric Dumazet, Eric Dumazet, syzbot,
	linux-fsdevel, Linux Kernel Mailing List, syzkaller-bugs,
	Al Viro, Andrea Parri, LKMM Maintainers -- Akira Yokosawa

On Mon, Nov 11, 2019 at 03:17:51PM +0100, Marco Elver wrote:
> On Sun, 10 Nov 2019 at 21:44, Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Sun, Nov 10, 2019 at 11:20:53AM -0800, Linus Torvalds wrote:
> > > On Sun, Nov 10, 2019 at 11:12 AM Linus Torvalds
> > > <torvalds@linux-foundation.org> wrote:
> > > >
> > > > And this is where WRITE_IDEMPOTENT would make a possible difference.
> > > > In particular, if we make the optimization to do the "read and only
> > > > write if changed"
> > >
> > > It might be useful for checking too. IOW, something like KCSAN could
> > > actually check that if a field has an idempotent write to it, all
> > > writes always have the same value.
> > >
> > > Again, there's the issue with lifetime.
> > >
> > > Part of that is "initialization is different". Those writes would not
> > > be marked idempotent, of course, and they'd write another value.
> > >
> > > There's also the issue of lifetime at the _end_ of the use, of course.
> > > There _are_ interesting data races at the end of the lifetime, both
> > > reads and writes.
> > >
> > > In particular, if it's a sticky flag, in order for there to not be any
> > > races, all the writes have to happen with a refcount held, and the
> > > final read has to happen after the final refcount is dropped (and the
> > > refcounts have to have atomicity and ordering, of course). I'm not
> > > sure how easy something like that is model in KSAN. Maybe it already
> > > does things like that for all the other refcount stuff we do.
> > >
> > > But the lifetime can be problematic for other reasons too - in this
> > > particular case we have a union for that sticky flag (which is used
> > > under the refcount), and then when the final refcount is released we
> > > read that value (thus no data race) but because of the union we will
> > > now start using that field with *different* data. It becomes that RCU
> > > list head instead.
> > >
> > > That kind of "it used to be a sticky flag, but now the lifetime of the
> > > flag is over, and it's something entirely different" might be a
> > > nightmare for something like KCSAN. It sounds complicated to check
> > > for, but I have no idea what KCSAN really considers complicated or
> > > not.
> >
> > But will "one size fits all" be practical and useful?
> >
> > For my code, I would be happy to accept a significant "false positive"
> > rate to get even a probabilistic warning of other-task accesses to some
> > of RCU's fields.  Even if the accesses were perfect from a functional
> > viewpoint, they could be problematic from a performance and scalability
> > viewpoint.  And for something like RCU, real bugs, even those that are
> > very improbable, need to be fixed.
> >
> > But other code (and thus other developers and maintainers) are going to
> > have different needs.  For all I know, some might have good reasons to
> > exclude their code from KCSAN analysis entirely.
> >
> > Would it make sense for KCSAN to have per-file/subsystem/whatever flags
> > specifying the depth of the analysis?
> 
> Just to answer this: we already have this, and disable certain files
> already. So it's an option if required. Just need maintainers to add
> KCSAN_SANITIZE := n, or KCSAN_SANITIZE_file.o := n to Makefiles, and
> KCSAN will simply ignore those.
> 
> FWIW we now also have a config option to "ignore repeated writes with
> the same value". It may be a little overaggressive/imprecise in
> filtering data races, but anything else like the super precise
> analysis involving tracking lifetimes and values (and whatever else
> the rules would require) is simply too complex. So, the current
> solution will avoid reporting cases like the original report here
> (__alloc_file), but at the cost of maybe being a little imprecise.
> It's probably a reasonable trade-off, given that we have too many data
> races to deal with on syzbot anyway.

Nice!

Is this added repeated-writes analysis something that can be disabled?
I would prefer that the analysis of RCU complain in this case as a
probabilistic cache-locality warning.  If it can be disabled, please
let me know if there is anything that I need to do to make this happen.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-11 14:31           ` Paul E. McKenney
@ 2019-11-11 15:10             ` Marco Elver
  2019-11-13  0:25               ` Paul E. McKenney
  0 siblings, 1 reply; 67+ messages in thread
From: Marco Elver @ 2019-11-11 15:10 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Alan Stern, Eric Dumazet, Eric Dumazet, syzbot,
	linux-fsdevel, Linux Kernel Mailing List, syzkaller-bugs,
	Al Viro, Andrea Parri, LKMM Maintainers -- Akira Yokosawa

On Mon, 11 Nov 2019 at 15:31, Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Mon, Nov 11, 2019 at 03:17:51PM +0100, Marco Elver wrote:
> > On Sun, 10 Nov 2019 at 21:44, Paul E. McKenney <paulmck@kernel.org> wrote:
> > >
> > > On Sun, Nov 10, 2019 at 11:20:53AM -0800, Linus Torvalds wrote:
> > > > On Sun, Nov 10, 2019 at 11:12 AM Linus Torvalds
> > > > <torvalds@linux-foundation.org> wrote:
> > > > >
> > > > > And this is where WRITE_IDEMPOTENT would make a possible difference.
> > > > > In particular, if we make the optimization to do the "read and only
> > > > > write if changed"
> > > >
> > > > It might be useful for checking too. IOW, something like KCSAN could
> > > > actually check that if a field has an idempotent write to it, all
> > > > writes always have the same value.
> > > >
> > > > Again, there's the issue with lifetime.
> > > >
> > > > Part of that is "initialization is different". Those writes would not
> > > > be marked idempotent, of course, and they'd write another value.
> > > >
> > > > There's also the issue of lifetime at the _end_ of the use, of course.
> > > > There _are_ interesting data races at the end of the lifetime, both
> > > > reads and writes.
> > > >
> > > > In particular, if it's a sticky flag, in order for there to not be any
> > > > races, all the writes have to happen with a refcount held, and the
> > > > final read has to happen after the final refcount is dropped (and the
> > > > refcounts have to have atomicity and ordering, of course). I'm not
> > > > sure how easy something like that is model in KSAN. Maybe it already
> > > > does things like that for all the other refcount stuff we do.
> > > >
> > > > But the lifetime can be problematic for other reasons too - in this
> > > > particular case we have a union for that sticky flag (which is used
> > > > under the refcount), and then when the final refcount is released we
> > > > read that value (thus no data race) but because of the union we will
> > > > now start using that field with *different* data. It becomes that RCU
> > > > list head instead.
> > > >
> > > > That kind of "it used to be a sticky flag, but now the lifetime of the
> > > > flag is over, and it's something entirely different" might be a
> > > > nightmare for something like KCSAN. It sounds complicated to check
> > > > for, but I have no idea what KCSAN really considers complicated or
> > > > not.
> > >
> > > But will "one size fits all" be practical and useful?
> > >
> > > For my code, I would be happy to accept a significant "false positive"
> > > rate to get even a probabilistic warning of other-task accesses to some
> > > of RCU's fields.  Even if the accesses were perfect from a functional
> > > viewpoint, they could be problematic from a performance and scalability
> > > viewpoint.  And for something like RCU, real bugs, even those that are
> > > very improbable, need to be fixed.
> > >
> > > But other code (and thus other developers and maintainers) are going to
> > > have different needs.  For all I know, some might have good reasons to
> > > exclude their code from KCSAN analysis entirely.
> > >
> > > Would it make sense for KCSAN to have per-file/subsystem/whatever flags
> > > specifying the depth of the analysis?
> >
> > Just to answer this: we already have this, and disable certain files
> > already. So it's an option if required. Just need maintainers to add
> > KCSAN_SANITIZE := n, or KCSAN_SANITIZE_file.o := n to Makefiles, and
> > KCSAN will simply ignore those.
> >
> > FWIW we now also have a config option to "ignore repeated writes with
> > the same value". It may be a little overaggressive/imprecise in
> > filtering data races, but anything else like the super precise
> > analysis involving tracking lifetimes and values (and whatever else
> > the rules would require) is simply too complex. So, the current
> > solution will avoid reporting cases like the original report here
> > (__alloc_file), but at the cost of maybe being a little imprecise.
> > It's probably a reasonable trade-off, given that we have too many data
> > races to deal with on syzbot anyway.
>
> Nice!
>
> Is this added repeated-writes analysis something that can be disabled?
> I would prefer that the analysis of RCU complain in this case as a
> probabilistic cache-locality warning.  If it can be disabled, please
> let me know if there is anything that I need to do to make this happen.

It's hidden behind a Kconfig config option, and actually disabled by
default. We can't enable/disable this on a per-file basis.

Right now, we'll just enable it on the public syzbot instance, which
will use the most conservative config.  Of course you can still run
your own fuzzer/stress test of choice with KCSAN and the option
disabled. Is that enough?

Otherwise I could also just say if the symbolized top stack frame
contains "rcu_", don't ignore -- which would be a little hacky and
imprecise though. What do you prefer?

Thanks,
-- Marco

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-10 19:10   ` Marco Elver
@ 2019-11-11 15:51     ` Alan Stern
  2019-11-11 16:51       ` Linus Torvalds
  0 siblings, 1 reply; 67+ messages in thread
From: Alan Stern @ 2019-11-11 15:51 UTC (permalink / raw)
  To: Marco Elver
  Cc: Linus Torvalds, Eric Dumazet, Eric Dumazet, syzbot,
	linux-fsdevel, Linux Kernel Mailing List, syzkaller-bugs,
	Al Viro, Andrea Parri, Paul E. McKenney,
	LKMM Maintainers -- Akira Yokosawa

On Sun, 10 Nov 2019, Marco Elver wrote:

> On Sun, 10 Nov 2019 at 17:09, Alan Stern <stern@rowland.harvard.edu> wrote:
...

> > For those used to thinking in terms of litmus tests, consider this one:
> >
> > C equivalent-writes
> >
> > {}
> >
> > P0(int *x)
> > {
> >         *x = 1;
> > }
> >
> > P1(int *x)
> > {
> >         *x = 1;
> > }
> >
> > exists (~x=1)
> >
> > Should the LKMM say that this litmus test contains a race?
> >
> > This suggests that we might also want to relax the notion of a write
> > racing with a read, although in that case I'm not at all sure what the
> > appropriate change to the memory model would be.  Something along the
> > lines of: If a write W races with a read R, but W stores the same value
> > that R would have read if W were not present, then it's not really a
> > race.  But of course this is far too vague to be useful.
> 
> What if you introduce to the above litmus test:
> 
> P2(int *x) { *x = 2; }

Then clearly the test _would_ contain a data race.

> How can a developer, using the LKMM as a reference, hope to prove
> their code is free from data races without having to enumerate all
> possible values a variable could contain (in addition to all possible
> interleavings)?

Well, for one thing the new rule doesn't say anything about all
possible values a variable could contain; it only talks about the
values of a pair of concurrent writes.  Of course, this would still
require you to be aware of (if not to fully enumerate) all possible
values a write could store, so perhaps it's not much of an improvement.

On the other hand, one way to prove your code is data-race-free under
the revised LKMM would be to show that it is data-race-free under the
original (i.e., current) LKMM.  Then you wouldn't have to enumerate any
lists of possible values.

> I view introducing data value dependencies, for the sake of allowing
> more programs, to a language memory model as a slippery slope, and am
> not aware of any precedent where this worked out. The additional
> complexity in the memory model would put a burden on developers and
> the compiler that is unlikely to be a real benefit (as you pointed
> out, the compiler may even need to disable some transformations).

This may be so.  But if the resulting model is a better match to the
way kernel developers think about their code, wouldn't it be
appropriate?

>  From
> a practical point of view, if the LKMM departs further and further
> from C11's memory model, how do we ensure all compilers do the right
> thing?

Linus has already committed to doing this.  To quote the latest example
(from a message he posted shortly after yours):

	I don't care one whit about C11. Made-up stores to shared data
	are not acceptable. Ever. We will turn that off with a compiler
	switch if the compiler thinks it can do them, the same way we
	turn off other incorrect optimizations like the type-based
	aliasing or the insane "signed integer arithmetic can have
	undefined behavior" stupidity that the standards people
	allowed.

Given that the kernel _requires_ compilers to behave this way, 
shouldn't the LKMM reflect this requirement?

> My vote would go to explicit annotation, not only because it reduces
> hidden complexity, but also because it makes the code more
> understandable, for developers and tooling. As an additional point, I
> find the original suggestion to add WRITE_ONCE to be the least bad (or
> some other better named WRITE_). Consider somebody changing the code,
> changing the semantics and the values written to "non_rcu". With a
> WRITE_ONCE, the developer would be clear about the fact that the write
> can happen concurrently, and ensure new code is written with the
> assumption that concurrent writes can happen.

I dislike the explicit annotation approach, because it shifts the
burden of proving correctness from the automatic verifier to the
programmer.  Let's take the litmus test above as example.  I could
annotate it to read:

P0(int *x) { WRITE_IDEMPOTENT(*x, 1); }
P1(int *x) { WRITE_IDEMPOTENT(*x, 1); }

and then KCSAN would take my word for it that the two writes don't race
with each other (or if they do race, it doesn't matter).  But now if
the code was changed by adding:

P2(int *x) { WRITE_IDEMPOTENT(*x, 2); }

then KCSAN would still believe there was no race, meaning it would be
up to me to audit all possible writes to x to make sure they store the
same value.  That is not how automated tooling should work.

Alan Stern


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-11 15:51     ` Alan Stern
@ 2019-11-11 16:51       ` Linus Torvalds
  2019-11-11 17:52         ` Eric Dumazet
  0 siblings, 1 reply; 67+ messages in thread
From: Linus Torvalds @ 2019-11-11 16:51 UTC (permalink / raw)
  To: Alan Stern
  Cc: Marco Elver, Eric Dumazet, Eric Dumazet, syzbot, linux-fsdevel,
	Linux Kernel Mailing List, syzkaller-bugs, Al Viro, Andrea Parri,
	Paul E. McKenney, LKMM Maintainers -- Akira Yokosawa

On Mon, Nov 11, 2019 at 7:51 AM Alan Stern <stern@rowland.harvard.edu> wrote:
>
> I dislike the explicit annotation approach, because it shifts the
> burden of proving correctness from the automatic verifier to the
> programmer.

Yes.

However, sometimes explicit annotations are very useful as
documentation and as showing of intent even if they might not change
behavior or code generation.

But they generally should never _replace_ checking - in fact, the
annotations themselves should hopefully be checked for correctness
too.

So a good annotation would implicitly document intent, but it should
also be something that we can check being true, so that we also have
the check that reality actually _matches_ the intent too. Because
misleading and wrong documentation is worse than no documentation at
all.

Side note: an example of a dangerous annotation is the one that Eric
pointed out, where a 64-bit read in percpu_counter_read_positive()
could be changed to READ_ONCE(), and we would compile it cleanly, but
on 32-bit it wouldn't actually be atomic.

We at one time tried to actually verify that READ/WRITE_ONCE() was
done only on types that could actually be accessed atomically (always
ignoring alpha because the pain is not worth it), but it showed too
many problems.

So now we silently accept things that aren't actually atomic. We do
access them "once" in the sense that we don't allow the compiler to
reload it, but it's not "once" in the LKMM sense of one single value.

That's ok for some cases. But it's actually a horrid horrid thing from
a documentation standpoint, and I hate it, and it's dangerous.

                Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-11 16:51       ` Linus Torvalds
@ 2019-11-11 17:52         ` Eric Dumazet
  2019-11-11 18:04           ` Linus Torvalds
  0 siblings, 1 reply; 67+ messages in thread
From: Eric Dumazet @ 2019-11-11 17:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Stern, Marco Elver, Eric Dumazet, syzbot, linux-fsdevel,
	Linux Kernel Mailing List, syzkaller-bugs, Al Viro, Andrea Parri,
	Paul E. McKenney, LKMM Maintainers -- Akira Yokosawa

On Mon, Nov 11, 2019 at 8:51 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mon, Nov 11, 2019 at 7:51 AM Alan Stern <stern@rowland.harvard.edu> wrote:
> >
> > I dislike the explicit annotation approach, because it shifts the
> > burden of proving correctness from the automatic verifier to the
> > programmer.
>
> Yes.
>
> However, sometimes explicit annotations are very useful as
> documentation and as showing of intent even if they might not change
> behavior or code generation.
>
> But they generally should never _replace_ checking - in fact, the
> annotations themselves should hopefully be checked for correctness
> too.
>
> So a good annotation would implicitly document intent, but it should
> also be something that we can check being true, so that we also have
> the check that reality actually _matches_ the intent too. Because
> misleading and wrong documentation is worse than no documentation at
> all.
>
> Side note: an example of a dangerous annotation is the one that Eric
> pointed out, where a 64-bit read in percpu_counter_read_positive()
> could be changed to READ_ONCE(), and we would compile it cleanly, but
> on 32-bit it wouldn't actually be atomic.
>
> We at one time tried to actually verify that READ/WRITE_ONCE() was
> done only on types that could actually be accessed atomically (always
> ignoring alpha because the pain is not worth it), but it showed too
> many problems.
>
> So now we silently accept things that aren't actually atomic. We do
> access them "once" in the sense that we don't allow the compiler to
> reload it, but it's not "once" in the LKMM sense of one single value.
>
> That's ok for some cases. But it's actually a horrid horrid thing from
> a documentation standpoint, and I hate it, and it's dangerous.
>
>                 Linus

I was hoping to cleanup the 'easy cases'  before looking at more serious issues.

But it looks like even the ' easy cases'  are not that easy.

Now I wonder what to do with the ~400 KCSAN reports sitting in
pre-moderation queue.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-11 17:52         ` Eric Dumazet
@ 2019-11-11 18:04           ` Linus Torvalds
  2019-11-11 18:31             ` Eric Dumazet
  0 siblings, 1 reply; 67+ messages in thread
From: Linus Torvalds @ 2019-11-11 18:04 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alan Stern, Marco Elver, Eric Dumazet, syzbot, linux-fsdevel,
	Linux Kernel Mailing List, syzkaller-bugs, Al Viro, Andrea Parri,
	Paul E. McKenney, LKMM Maintainers -- Akira Yokosawa

On Mon, Nov 11, 2019 at 9:52 AM Eric Dumazet <edumazet@google.com> wrote:
>
> Now I wonder what to do with the ~400 KCSAN reports sitting in
> pre-moderation queue.

So regular KASAN reports are fairly easy to deal with: they report
actual bugs. They may be hard to hit, but generally there's no
question about something like a use-after-free or whatever.

The problem with KCSAN is that it's not clear how many of the reports
have been actual real honest-to-goodness bugs that could cause
problems, and how many of them are "this isn't actually a bug, but an
annotation will shut up KCSAN".

My gut feeling would be that it would be best to ignore the ones that
are "an annotation will shut up KCSAN", and look at the ones that are
real bugs.

Is there a pattern to those real bugs? Is there perhaps a way to make
KCSAN notice _that_ pattern in particular, and suppress the ones that
are "we can shut these up with annotations that don't really change
the code"?

I think it would be much better for the kernel - and much better for
KCSAN - if the problem reports KCSAN reports are real problems that
can actually be triggered as problems, and that it behaves much more
like KASAN in that respect.

Yes, yes, then once the *real* problems have been handled, maybe we
can expand the search to be "stylistic issues" and "in theory, this
could cause problems with a compiler that did X" issues.

But I think the "just annotate" thing makes people more likely to
dismiss KCSAN issues, and I don't think it's healthy.

                Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-11 18:04           ` Linus Torvalds
@ 2019-11-11 18:31             ` Eric Dumazet
  2019-11-11 18:44               ` Eric Dumazet
  2019-11-11 18:50               ` Linus Torvalds
  0 siblings, 2 replies; 67+ messages in thread
From: Eric Dumazet @ 2019-11-11 18:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Stern, Marco Elver, Eric Dumazet, syzbot, linux-fsdevel,
	Linux Kernel Mailing List, syzkaller-bugs, Al Viro, Andrea Parri,
	Paul E. McKenney, LKMM Maintainers -- Akira Yokosawa

On Mon, Nov 11, 2019 at 10:05 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mon, Nov 11, 2019 at 9:52 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > Now I wonder what to do with the ~400 KCSAN reports sitting in
> > pre-moderation queue.
>
> So regular KASAN reports are fairly easy to deal with: they report
> actual bugs. They may be hard to hit, but generally there's no
> question about something like a use-after-free or whatever.
>
> The problem with KCSAN is that it's not clear how many of the reports
> have been actual real honest-to-goodness bugs that could cause
> problems, and how many of them are "this isn't actually a bug, but an
> annotation will shut up KCSAN".
>
> My gut feeling would be that it would be best to ignore the ones that
> are "an annotation will shut up KCSAN", and look at the ones that are
> real bugs.
>
> Is there a pattern to those real bugs? Is there perhaps a way to make
> KCSAN notice _that_ pattern in particular, and suppress the ones that
> are "we can shut these up with annotations that don't really change
> the code"?
>
> I think it would be much better for the kernel - and much better for
> KCSAN - if the problem reports KCSAN reports are real problems that
> can actually be triggered as problems, and that it behaves much more
> like KASAN in that respect.
>
> Yes, yes, then once the *real* problems have been handled, maybe we
> can expand the search to be "stylistic issues" and "in theory, this
> could cause problems with a compiler that did X" issues.
>
> But I think the "just annotate" thing makes people more likely to
> dismiss KCSAN issues, and I don't think it's healthy.
>

Problem is that KASAN/KCSAN stops as soon as one issue is hit,
regardless of it being a false positive or not.

(Same happens with LOCKDEP seeing only one issue, then disabling itself)

If we do not annotate the false positive, the real issues might be
hidden for years.

There is no pattern really, only a lot of noise (small ' bugs'  that
have no real impact)

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-11 18:31             ` Eric Dumazet
@ 2019-11-11 18:44               ` Eric Dumazet
  2019-11-11 19:00                 ` Linus Torvalds
  2019-11-11 18:50               ` Linus Torvalds
  1 sibling, 1 reply; 67+ messages in thread
From: Eric Dumazet @ 2019-11-11 18:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Stern, Marco Elver, Eric Dumazet, syzbot, linux-fsdevel,
	Linux Kernel Mailing List, syzkaller-bugs, Al Viro, Andrea Parri,
	Paul E. McKenney, LKMM Maintainers -- Akira Yokosawa

On Mon, Nov 11, 2019 at 10:31 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Mon, Nov 11, 2019 at 10:05 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Mon, Nov 11, 2019 at 9:52 AM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > Now I wonder what to do with the ~400 KCSAN reports sitting in
> > > pre-moderation queue.
> >
> > So regular KASAN reports are fairly easy to deal with: they report
> > actual bugs. They may be hard to hit, but generally there's no
> > question about something like a use-after-free or whatever.
> >
> > The problem with KCSAN is that it's not clear how many of the reports
> > have been actual real honest-to-goodness bugs that could cause
> > problems, and how many of them are "this isn't actually a bug, but an
> > annotation will shut up KCSAN".
> >
> > My gut feeling would be that it would be best to ignore the ones that
> > are "an annotation will shut up KCSAN", and look at the ones that are
> > real bugs.
> >
> > Is there a pattern to those real bugs? Is there perhaps a way to make
> > KCSAN notice _that_ pattern in particular, and suppress the ones that
> > are "we can shut these up with annotations that don't really change
> > the code"?
> >
> > I think it would be much better for the kernel - and much better for
> > KCSAN - if the problem reports KCSAN reports are real problems that
> > can actually be triggered as problems, and that it behaves much more
> > like KASAN in that respect.
> >
> > Yes, yes, then once the *real* problems have been handled, maybe we
> > can expand the search to be "stylistic issues" and "in theory, this
> > could cause problems with a compiler that did X" issues.
> >
> > But I think the "just annotate" thing makes people more likely to
> > dismiss KCSAN issues, and I don't think it's healthy.
> >
>
> Problem is that KASAN/KCSAN stops as soon as one issue is hit,
> regardless of it being a false positive or not.
>
> (Same happens with LOCKDEP seeing only one issue, then disabling itself)
>
> If we do not annotate the false positive, the real issues might be
> hidden for years.
>
> There is no pattern really, only a lot of noise (small ' bugs'  that
> have no real impact)

An interesting case is the race in ksys_write()

if (ppos) {
     pos = *ppos; // data-race
     ppos = &pos;
}
ret = vfs_write(f.file, buf, count, ppos);
if (ret >= 0 && ppos)
    f.file->f_pos = pos;  // data-race

BUG: KCSAN: data-race in ksys_write / ksys_write

write to 0xffff8880a9c29568 of 8 bytes by task 27477 on cpu 1:
 ksys_write+0x101/0x1b0 fs/read_write.c:613
 __do_sys_write fs/read_write.c:623 [inline]
 __se_sys_write fs/read_write.c:620 [inline]
 __x64_sys_write+0x4c/0x60 fs/read_write.c:620
 do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-11 18:31             ` Eric Dumazet
  2019-11-11 18:44               ` Eric Dumazet
@ 2019-11-11 18:50               ` Linus Torvalds
  2019-11-11 18:59                 ` Marco Elver
  2019-11-11 18:59                 ` Eric Dumazet
  1 sibling, 2 replies; 67+ messages in thread
From: Linus Torvalds @ 2019-11-11 18:50 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alan Stern, Marco Elver, Eric Dumazet, syzbot, linux-fsdevel,
	Linux Kernel Mailing List, syzkaller-bugs, Al Viro, Andrea Parri,
	Paul E. McKenney, LKMM Maintainers -- Akira Yokosawa

On Mon, Nov 11, 2019 at 10:31 AM Eric Dumazet <edumazet@google.com> wrote:
>
> Problem is that KASAN/KCSAN stops as soon as one issue is hit,
> regardless of it being a false positive or not.

So mayb e that - together with the known huge base of false positives
- just means that KCSAN needs some more work before it can be used as
a basis for sending out patches.

Maybe the reporting needs to create a hash of the location, and report
once per location? Or something like that.

Maybe KCSAN needs a way to filter out known false positives on a KCSAN
side, without having to change the source for a tool that gives too
much noise?

> If we do not annotate the false positive, the real issues might be
> hidden for years.

I don't think "change the kernel source for a tool that isn't good
enough" is the solution.

> There is no pattern really, only a lot of noise (small ' bugs'  that
> have no real impact)

Yeah, if it hasn't shown any real bugs so far, that just strengthens
the "it needs much fewer false positives to be useful".

KASAN and lockdep can afford to stop after the first problem, because
the problems they report - and the additional annotations you might
want to add - are quality problems and annotations.

                Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-11 18:50               ` Linus Torvalds
@ 2019-11-11 18:59                 ` Marco Elver
  2019-11-11 18:59                 ` Eric Dumazet
  1 sibling, 0 replies; 67+ messages in thread
From: Marco Elver @ 2019-11-11 18:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, Alan Stern, Eric Dumazet, syzbot, linux-fsdevel,
	Linux Kernel Mailing List, syzkaller-bugs, Al Viro, Andrea Parri,
	Paul E. McKenney, LKMM Maintainers -- Akira Yokosawa

On Mon, 11 Nov 2019 at 19:50, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mon, Nov 11, 2019 at 10:31 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > Problem is that KASAN/KCSAN stops as soon as one issue is hit,
> > regardless of it being a false positive or not.
>
> So mayb e that - together with the known huge base of false positives
> - just means that KCSAN needs some more work before it can be used as
> a basis for sending out patches.
>
> Maybe the reporting needs to create a hash of the location, and report
> once per location? Or something like that.
>
> Maybe KCSAN needs a way to filter out known false positives on a KCSAN
> side, without having to change the source for a tool that gives too
> much noise?
>
> > If we do not annotate the false positive, the real issues might be
> > hidden for years.
>
> I don't think "change the kernel source for a tool that isn't good
> enough" is the solution.
>
> > There is no pattern really, only a lot of noise (small ' bugs'  that
> > have no real impact)
>
> Yeah, if it hasn't shown any real bugs so far, that just strengthens
> the "it needs much fewer false positives to be useful".
>
> KASAN and lockdep can afford to stop after the first problem, because
> the problems they report - and the additional annotations you might
> want to add - are quality problems and annotations.

There is a fundamental limitation here. As I understand, in an ideal
world we'd only find true logic bugs, *race conditions*, first, and
*data races* later. Some data races are also race conditions, which a
tool like KCSAN can report. However, not all race conditions are data
races and vice-versa.

The intent is to report data races according to the LKMM. KCSAN
currently does that. On syzbot, we do not even report all data races
because we use a very conservative config, to filter things like data
races between plain and ONCE/atomic accesses, etc. A data race
detector can only work at the memory model/language level.

Deeper analysis, to find only race conditions, requires conveying the
intended logic and patterns to a tool. This requires 1) the developer
either writing a spec or model of their code, and then 2) the tool
verifying that the implementation matches.  People have done this for
small bits of code using model checkers (and other formal methods),
but this just doesn't scale!

KCSAN finds real problems today. Maybe not all of them are race
conditions, but they are data races. We already have several options
to filter data races, and will keep adding more. However, a tool that
magically proves that there are no concurrency related logic bugs is
an entirely different beast.

Thanks,
-- Marco

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-11 18:50               ` Linus Torvalds
  2019-11-11 18:59                 ` Marco Elver
@ 2019-11-11 18:59                 ` Eric Dumazet
  1 sibling, 0 replies; 67+ messages in thread
From: Eric Dumazet @ 2019-11-11 18:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Stern, Marco Elver, Eric Dumazet, syzbot, linux-fsdevel,
	Linux Kernel Mailing List, syzkaller-bugs, Al Viro, Andrea Parri,
	Paul E. McKenney, LKMM Maintainers -- Akira Yokosawa

On Mon, Nov 11, 2019 at 10:50 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:

> Yeah, if it hasn't shown any real bugs so far, that just strengthens
> the "it needs much fewer false positives to be useful".
>

It definitely has shown real bugs, some of them being still addressed
(not yet fixed)

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-11 18:44               ` Eric Dumazet
@ 2019-11-11 19:00                 ` Linus Torvalds
  2019-11-11 19:13                   ` Eric Dumazet
  2019-11-11 23:51                   ` Linus Torvalds
  0 siblings, 2 replies; 67+ messages in thread
From: Linus Torvalds @ 2019-11-11 19:00 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alan Stern, Marco Elver, Eric Dumazet, syzbot, linux-fsdevel,
	Linux Kernel Mailing List, syzkaller-bugs, Al Viro, Andrea Parri,
	Paul E. McKenney, LKMM Maintainers -- Akira Yokosawa

On Mon, Nov 11, 2019 at 10:44 AM Eric Dumazet <edumazet@google.com> wrote:
>
> An interesting case is the race in ksys_write()

Not really.

> if (ppos) {
>      pos = *ppos; // data-race

That code uses "fdget_pos().

Which does mutual exclusion _if_ the file is something we care about
pos for, and if it has more than one process using it.

Basically the rule there is that we don't care about the data race in
certain circumstances. We don't care about non-regular files, for
example, because those are what POSIX gives guarantees for.

(We have since moved towards FMODE_STREAM handling instead of the
older FMODE_ATOMIC_POS which does this better, and it's possible we
should get rid of the FMODE_ATOMIC_POS behavior in favor of
FMODE_STREAM entirely)

Again, that's pretty hard to tell something like KCSAN.

Of course, it's then questionable whether our rules for not caring are
necessarily the _right_ rules for not caring. For example, if you have
threads, the "more than one process opening it" doesn't trigger. It's
literally just atomicity across processes that we guarantee. That's
certainly a bit questionable. But that's a higher-level decision.

               Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-11 19:00                 ` Linus Torvalds
@ 2019-11-11 19:13                   ` Eric Dumazet
  2019-11-11 20:43                     ` Linus Torvalds
  2019-11-11 23:51                   ` Linus Torvalds
  1 sibling, 1 reply; 67+ messages in thread
From: Eric Dumazet @ 2019-11-11 19:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Stern, Marco Elver, Eric Dumazet, syzbot, linux-fsdevel,
	Linux Kernel Mailing List, syzkaller-bugs, Al Viro, Andrea Parri,
	Paul E. McKenney, LKMM Maintainers -- Akira Yokosawa

On Mon, Nov 11, 2019 at 11:01 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mon, Nov 11, 2019 at 10:44 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > An interesting case is the race in ksys_write()
>
> Not really.
>
> > if (ppos) {
> >      pos = *ppos; // data-race
>
> That code uses "fdget_pos().
>
> Which does mutual exclusion _if_ the file is something we care about
> pos for, and if it has more than one process using it.
>
> Basically the rule there is that we don't care about the data race in
> certain circumstances. We don't care about non-regular files, for
> example, because those are what POSIX gives guarantees for.
>
> (We have since moved towards FMODE_STREAM handling instead of the
> older FMODE_ATOMIC_POS which does this better, and it's possible we
> should get rid of the FMODE_ATOMIC_POS behavior in favor of
> FMODE_STREAM entirely)
>
> Again, that's pretty hard to tell something like KCSAN.

Well, this is hard to explain to humans... Probably less than 10 on
this planet could tell that.

What about this other one, it looks like multiple threads can
manipulate tsk->min_flt++; at the same time  in faultin_page()

Should we not care, or should we mirror min_flt with a second
atomic_long_t, or simply convert min_flt to atomic_long_t ?

BUG: KCSAN: data-race in __get_user_pages / __get_user_pages

read to 0xffff8880b0b8f650 of 8 bytes by task 11553 on cpu 1:
 faultin_page mm/gup.c:653 [inline]
 __get_user_pages+0x78f/0x1160 mm/gup.c:845
 __get_user_pages_locked mm/gup.c:1023 [inline]
 get_user_pages_remote+0x206/0x3e0 mm/gup.c:1163
 process_vm_rw_single_vec mm/process_vm_access.c:109 [inline]
 process_vm_rw_core.isra.0+0x3a4/0x8c0 mm/process_vm_access.c:216
 process_vm_rw+0x1c4/0x1e0 mm/process_vm_access.c:284
 __do_sys_process_vm_writev mm/process_vm_access.c:306 [inline]
 __se_sys_process_vm_writev mm/process_vm_access.c:301 [inline]
 __x64_sys_process_vm_writev+0x8b/0xb0 mm/process_vm_access.c:301
 do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

write to 0xffff8880b0b8f650 of 8 bytes by task 11531 on cpu 0:
 faultin_page mm/gup.c:653 [inline]
 __get_user_pages+0x7b1/0x1160 mm/gup.c:845
 __get_user_pages_locked mm/gup.c:1023 [inline]
 get_user_pages_remote+0x206/0x3e0 mm/gup.c:1163
 process_vm_rw_single_vec mm/process_vm_access.c:109 [inline]
 process_vm_rw_core.isra.0+0x3a4/0x8c0 mm/process_vm_access.c:216
 process_vm_rw+0x1c4/0x1e0 mm/process_vm_access.c:284
 __do_sys_process_vm_writev mm/process_vm_access.c:306 [inline]
 __se_sys_process_vm_writev mm/process_vm_access.c:301 [inline]
 __x64_sys_process_vm_writev+0x8b/0xb0 mm/process_vm_access.c:301
 do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 11531 Comm: syz-executor.4 Not tainted 5.4.0-rc6+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine,
BIOS Google 01/01/2011

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-11 19:13                   ` Eric Dumazet
@ 2019-11-11 20:43                     ` Linus Torvalds
  2019-11-11 20:46                       ` Linus Torvalds
  0 siblings, 1 reply; 67+ messages in thread
From: Linus Torvalds @ 2019-11-11 20:43 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alan Stern, Marco Elver, Eric Dumazet, syzbot, linux-fsdevel,
	Linux Kernel Mailing List, syzkaller-bugs, Al Viro, Andrea Parri,
	Paul E. McKenney, LKMM Maintainers -- Akira Yokosawa

On Mon, Nov 11, 2019 at 11:13 AM Eric Dumazet <edumazet@google.com> wrote:
>
> What about this other one, it looks like multiple threads can
> manipulate tsk->min_flt++; at the same time  in faultin_page()

Yeah, maybe we could have some model for marking "this is statistics,
doesn't need to be exact".

> Should we not care, or should we mirror min_flt with a second
> atomic_long_t, or simply convert min_flt to atomic_long_t ?

Definitely not make it atomic. Those are expensive, and there's no point.

                    Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-11 20:43                     ` Linus Torvalds
@ 2019-11-11 20:46                       ` Linus Torvalds
  2019-11-11 21:53                         ` Eric Dumazet
  0 siblings, 1 reply; 67+ messages in thread
From: Linus Torvalds @ 2019-11-11 20:46 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alan Stern, Marco Elver, Eric Dumazet, syzbot, linux-fsdevel,
	Linux Kernel Mailing List, syzkaller-bugs, Al Viro, Andrea Parri,
	Paul E. McKenney, LKMM Maintainers -- Akira Yokosawa

On Mon, Nov 11, 2019 at 12:43 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Yeah, maybe we could have some model for marking "this is statistics,
> doesn't need to be exact".

Side note: that marking MUST NOT be "READ_ONCE + WRITE_ONCE", because
that makes gcc create horrible code, and only makes the race worse.

At least with a regular add, it might stay as a single r-m-w
instruction on architectures that have that, and makes the quality of
the statistics slightly better (no preemption etc).

So that's an excellent example of where changing code to use
WRITE_ONCE actually makes the code objectively worse in practice -
even if it might be the same in theory.

                Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-11 20:46                       ` Linus Torvalds
@ 2019-11-11 21:53                         ` Eric Dumazet
  0 siblings, 0 replies; 67+ messages in thread
From: Eric Dumazet @ 2019-11-11 21:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Stern, Marco Elver, Eric Dumazet, syzbot, linux-fsdevel,
	Linux Kernel Mailing List, syzkaller-bugs, Al Viro, Andrea Parri,
	Paul E. McKenney, LKMM Maintainers -- Akira Yokosawa

On Mon, Nov 11, 2019 at 12:47 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mon, Nov 11, 2019 at 12:43 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Yeah, maybe we could have some model for marking "this is statistics,
> > doesn't need to be exact".
>
> Side note: that marking MUST NOT be "READ_ONCE + WRITE_ONCE", because
> that makes gcc create horrible code, and only makes the race worse.
>
> At least with a regular add, it might stay as a single r-m-w
> instruction on architectures that have that, and makes the quality of
> the statistics slightly better (no preemption etc).
>
> So that's an excellent example of where changing code to use
> WRITE_ONCE actually makes the code objectively worse in practice -
> even if it might be the same in theory.

Yes, I believe that was the rationale of the ADD_ONCE() thing I
mentioned earlier.

I do not believe we have a solution right now ?

We have similar non atomic increments in some virtual network drivers
doing "dev->stats.tx_errors++;"  in their error path.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-11 19:00                 ` Linus Torvalds
  2019-11-11 19:13                   ` Eric Dumazet
@ 2019-11-11 23:51                   ` Linus Torvalds
  2019-11-12 16:50                     ` Kirill Smelkov
  1 sibling, 1 reply; 67+ messages in thread
From: Linus Torvalds @ 2019-11-11 23:51 UTC (permalink / raw)
  To: Eric Dumazet, Al Viro, Kirill Smelkov
  Cc: Alan Stern, Marco Elver, Eric Dumazet, syzbot, linux-fsdevel,
	Linux Kernel Mailing List, syzkaller-bugs, Andrea Parri,
	Paul E. McKenney, LKMM Maintainers -- Akira Yokosawa

[-- Attachment #1: Type: text/plain, Size: 1276 bytes --]

On Mon, Nov 11, 2019 at 11:00 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> > if (ppos) {
> >      pos = *ppos; // data-race
>
> That code uses "fdget_pos().
>
> Which does mutual exclusion _if_ the file is something we care about
> pos for, and if it has more than one process using it.

That said, the more I look at that code, the less I like it.

I have this feeling we really should get rid of FMODE_ATOMIC_POS
entirely, now that we have the much nicer FMODE_STREAM to indicate
that 'pos' really doesn't matter.

Also, the test for "file_count(file) > 1" really is wrong, in that it
means that we protect against other processes, but not other threads.

So maybe we really should do the attached thing. Adding Al and Kirill
to the cc for comments. Kirill did some fairly in-depth review of the
whole locking on f_pos, it might be good to get his comments.

Al? Note the change from

-               if (file_count(file) > 1) {
+               if ((v & FDPUT_FPUT) || file_count(file) > 1) {

in __fdget_pos(). It basically says that the threaded case also does
the pos locking.

NOTE! This is entirely untested. It might be totally broken. It passes
my "LooksSuperficiallyFine(tm)" test, but that's all I'm going to say
about the patch.

            Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 1928 bytes --]

 fs/file.c          | 4 ++--
 fs/open.c          | 6 +-----
 include/linux/fs.h | 2 --
 3 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 3da91a112bab..708e5c2b7d65 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -795,8 +795,8 @@ unsigned long __fdget_pos(unsigned int fd)
 	unsigned long v = __fdget(fd);
 	struct file *file = (struct file *)(v & ~3);
 
-	if (file && (file->f_mode & FMODE_ATOMIC_POS)) {
-		if (file_count(file) > 1) {
+	if (file && !(file->f_mode & FMODE_STREAM)) {
+		if ((v & FDPUT_FPUT) || file_count(file) > 1) {
 			v |= FDPUT_POS_UNLOCK;
 			mutex_lock(&file->f_pos_lock);
 		}
diff --git a/fs/open.c b/fs/open.c
index b62f5c0923a8..5c68282ea79e 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -771,10 +771,6 @@ static int do_dentry_open(struct file *f,
 		f->f_mode |= FMODE_WRITER;
 	}
 
-	/* POSIX.1-2008/SUSv4 Section XSI 2.9.7 */
-	if (S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode))
-		f->f_mode |= FMODE_ATOMIC_POS;
-
 	f->f_op = fops_get(inode->i_fop);
 	if (WARN_ON(!f->f_op)) {
 		error = -ENODEV;
@@ -1256,7 +1252,7 @@ EXPORT_SYMBOL(nonseekable_open);
  */
 int stream_open(struct inode *inode, struct file *filp)
 {
-	filp->f_mode &= ~(FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE | FMODE_ATOMIC_POS);
+	filp->f_mode &= ~(FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE);
 	filp->f_mode |= FMODE_STREAM;
 	return 0;
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e0d909d35763..a7c3f6dd5701 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -148,8 +148,6 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 /* File is opened with O_PATH; almost nothing can be done with it */
 #define FMODE_PATH		((__force fmode_t)0x4000)
 
-/* File needs atomic accesses to f_pos */
-#define FMODE_ATOMIC_POS	((__force fmode_t)0x8000)
 /* Write access to underlying fs */
 #define FMODE_WRITER		((__force fmode_t)0x10000)
 /* Has read method(s) */

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-11 23:51                   ` Linus Torvalds
@ 2019-11-12 16:50                     ` Kirill Smelkov
  2019-11-12 17:23                       ` Linus Torvalds
  0 siblings, 1 reply; 67+ messages in thread
From: Kirill Smelkov @ 2019-11-12 16:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, Al Viro, Alan Stern, Marco Elver, Eric Dumazet,
	syzbot, linux-fsdevel, Linux Kernel Mailing List, syzkaller-bugs,
	Andrea Parri, Paul E. McKenney,
	LKMM Maintainers -- Akira Yokosawa

On Mon, Nov 11, 2019 at 03:51:03PM -0800, Linus Torvalds wrote:
> On Mon, Nov 11, 2019 at 11:00 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > > if (ppos) {
> > >      pos = *ppos; // data-race
> >
> > That code uses "fdget_pos().
> >
> > Which does mutual exclusion _if_ the file is something we care about
> > pos for, and if it has more than one process using it.
> 
> That said, the more I look at that code, the less I like it.
> 
> I have this feeling we really should get rid of FMODE_ATOMIC_POS
> entirely, now that we have the much nicer FMODE_STREAM to indicate
> that 'pos' really doesn't matter.
> 
> Also, the test for "file_count(file) > 1" really is wrong, in that it
> means that we protect against other processes, but not other threads.
> 
> So maybe we really should do the attached thing. Adding Al and Kirill
> to the cc for comments. Kirill did some fairly in-depth review of the
> whole locking on f_pos, it might be good to get his comments.
> 
> Al? Note the change from
> 
> -               if (file_count(file) > 1) {
> +               if ((v & FDPUT_FPUT) || file_count(file) > 1) {
> 
> in __fdget_pos(). It basically says that the threaded case also does
> the pos locking.
> 
> NOTE! This is entirely untested. It might be totally broken. It passes
> my "LooksSuperficiallyFine(tm)" test, but that's all I'm going to say
> about the patch.

Linus, thanks for asking. Here is my feedback after quickly rechecking
this topic:

Your patch does two things:

1) switch file->f_pos locking enable from N(processes-using-the-file) > 1
   to N("threads"-using-the-file) > 1.

2) switch `if FMODE_ATOMIC_POS` to `if !FMODE_STREAM`.

About "1": I think it should be a good thing to do with one of the
reasons being security: not to allow a processes to pass-by range
based access restrictions. Let me quote your email on this topic from
April: (https://lore.kernel.org/linux-fsdevel/CAHk-=wg6Pn4M+7awA5WrHv2vVc5iLRL29ueG-NSwOCAyVT-OYQ@mail.gmail.com)

    """
    I do think that moving to a model where we wither have a
    (properly locked) file position or no pos pointer at all is the right
    model (ie I'd really like to get rid of the mixed case), but there
    might be some practical problem that makes it impractical.
    
    Because the *real* problem with the mixed case is not "insane people
    who do bad things might get odd results". No, the real problem with
    the mixed case is that it could be a security issue (ie: one process
    intentionally changes the file position just as another process is
    going a 'read' and then avoids some limit test because the limit test
    was done using the old 'pos' value but the actual IO was done using
    the new one).
    """

The same logic applies if it is not 2 processes, but 2 threads:
thread T2 adjusts file position racily to thread T1 while T1 is doing
read syscall with the end result that T1 read could access file range
that it should not be allowed to access.

So with that reasoning I think we should do "1" anyway now.

By the way on "1" topic I suspect there is a race of how
`N(file-users) > 1` check is done: file_count(file) is
atomic_long_read(&file->f_count), but let's think on how that atomic
read is positioned wrt another process creation: I did not studied in
detail, so I might be wrong here, but offhand it looks like there is no
synchronization. This way when another processes is being created,
things could align so that the check could be performed while
file->f_count=1 with second process being created right after that. The
end result is that the first process is doing sysread _without_ locking
f_pos_lock, while the second process could do syslseek/whatever in
parallel - it will be locking f_pos_lock, but since P1 did not took that
lock, P2 would be taking f_pos_lock without blocking. Yes, __fget_light
(called by fdget_pos) says that "You must not clone the current task in
between the calls to fget_light and fput_light", but if original process
is multithreaded, a second thread might call clone while the first
thread is in sysread. The task of the second thread is not task of T1,
but since, as I suspect, both T1 and T2 share file table, the invariant
of fdget_light will become broken with end result that the same
"race-to-bypass-range-based-access" scenario is possible.

Once again, maybe I'm wrong here and miss some detail and there is some
synchronization in between file_count() call and second process
creation, but for me it is not clear from just looking at __fdget_pos().

The situation a bit reminds me another race I fixed recently:
https://github.com/python/cpython/commit/c5abd63e94fc. There the code
was considered to be race-free for a long time, and original author
actually argued with someone defending it to be race-free. That code is
indeed race-free if all objects are created beforehand, but once objects
become allocated/deallocated dynamically (processes in our case) the
race starts to be there.

So talking about the kernel I would also review the possibility of
file_count wrt clone race once again.

----

About "2": I generally agree with the direction, but I think the kernel
is not yet ready for this switch. Let me quote myself:
(https://lore.kernel.org/linux-fsdevel/20190413184404.GA13490@deco.navytux.spb.ru)

    """
    And I appreciate if people could help at least somehow with "getting rid
    of mixed case entirely" (i.e. always lock f_pos_lock on !FMODE_STREAM),
    because this transition starts to diverge from my particular use-case
    too far. To me it makes sense to do that transition as follows:
    
    - convert nonseekable_open -> stream_open via stream_open.cocci;
    - audit other nonseekable_open calls and convert left users that truly
      don't depend on position to stream_open;
    - extend stream_open.cocci to analyze alloc_file_pseudo as well (this
      will cover pipes and sockets), or maybe convert pipes and sockets to
      FMODE_STREAM manually;
    - extend stream_open.cocci to analyze file_operations that use no_llseek
      or noop_llseek, but do not use nonseekable_open or alloc_file_pseudo.
      This might find files that have stream semantic but are opened
      differently;
    - extend stream_open.cocci to analyze file_operations whose .read/.write
      do not use ppos at all (independently of how file was opened);
    - ...
    - after that remove FMODE_ATOMIC_POS and always take f_pos_lock if
      !FMODE_STREAM;
    - gather bug reports for deadlocked read/write and convert missed cases
      to FMODE_STREAM, probably extending stream_open.cocci along the road
      to catch similar cases.
    
    i.e. always take f_pos_lock unless a file is explicitly marked as being
    stream, and try to find and cover all files that are streams.
    """

Basically there are at least several non-regular files - like e.g.
pipes, fifos, sockets, etc, that are currently not marked with
FMODE_STREAM but on which f_pos_lock is currently _not_ locked as your
original commit 9c225f2655e3 (vfs: atomic f_pos accesses as per POSIX)
added FMODE_ATOMIC_POS only to regular files, not those. So if we just
start locking on `!(f->f_mode & FMODE_STREAM)` without changing other
parts of the kernel, those pipes, fifos, sockets, etc will start to take
f_pos_lock on read/write, which e.g. for pipes would be "only"
performance regression, but for sockets - due to them being duplex
channels - could lead to read/write deadlocks similar to the deadlocks
described in 10dce8af3422 (fs: stream_open - opener for stream-like
files so that read and write can run simultaneously without deadlock).

The plan should be thus that we should go through the kernel and first
mark all those non-regular files that have stream semantic with
FMODE_STREAM. And only then we should be able to safely switch
`if FMODE_ATOMIC_POS` to `if !FMODE_STREAM` in IO syscalls.

I apologize for being silent on this stream_open conversion topic.
I'm currently busy fixing another deadlock related to Python GIL in my
wendelin.core filesystem client, where one client thread, in cooperation
with filesystem server, is responsible for providing isolated filesystem
views corresponding to different database states, while the pagecache is
practically shared for all views. You can check e.g. here, in case you are
curious, what this is about:
https://pypi.org/project/pygolang/#id24
https://pypi.org/project/pygolang/#cython-nogil-api

Hope it helps a bit,
Kirill


>  fs/file.c          | 4 ++--
>  fs/open.c          | 6 +-----
>  include/linux/fs.h | 2 --
>  3 files changed, 3 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/file.c b/fs/file.c
> index 3da91a112bab..708e5c2b7d65 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -795,8 +795,8 @@ unsigned long __fdget_pos(unsigned int fd)
>  	unsigned long v = __fdget(fd);
>  	struct file *file = (struct file *)(v & ~3);
>  
> -	if (file && (file->f_mode & FMODE_ATOMIC_POS)) {
> -		if (file_count(file) > 1) {
> +	if (file && !(file->f_mode & FMODE_STREAM)) {
> +		if ((v & FDPUT_FPUT) || file_count(file) > 1) {
>  			v |= FDPUT_POS_UNLOCK;
>  			mutex_lock(&file->f_pos_lock);
>  		}
> diff --git a/fs/open.c b/fs/open.c
> index b62f5c0923a8..5c68282ea79e 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -771,10 +771,6 @@ static int do_dentry_open(struct file *f,
>  		f->f_mode |= FMODE_WRITER;
>  	}
>  
> -	/* POSIX.1-2008/SUSv4 Section XSI 2.9.7 */
> -	if (S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode))
> -		f->f_mode |= FMODE_ATOMIC_POS;
> -
>  	f->f_op = fops_get(inode->i_fop);
>  	if (WARN_ON(!f->f_op)) {
>  		error = -ENODEV;
> @@ -1256,7 +1252,7 @@ EXPORT_SYMBOL(nonseekable_open);
>   */
>  int stream_open(struct inode *inode, struct file *filp)
>  {
> -	filp->f_mode &= ~(FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE | FMODE_ATOMIC_POS);
> +	filp->f_mode &= ~(FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE);
>  	filp->f_mode |= FMODE_STREAM;
>  	return 0;
>  }
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index e0d909d35763..a7c3f6dd5701 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -148,8 +148,6 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
>  /* File is opened with O_PATH; almost nothing can be done with it */
>  #define FMODE_PATH		((__force fmode_t)0x4000)
>  
> -/* File needs atomic accesses to f_pos */
> -#define FMODE_ATOMIC_POS	((__force fmode_t)0x8000)
>  /* Write access to underlying fs */
>  #define FMODE_WRITER		((__force fmode_t)0x10000)
>  /* Has read method(s) */

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-12 16:50                     ` Kirill Smelkov
@ 2019-11-12 17:23                       ` Linus Torvalds
  2019-11-12 17:36                         ` Linus Torvalds
  0 siblings, 1 reply; 67+ messages in thread
From: Linus Torvalds @ 2019-11-12 17:23 UTC (permalink / raw)
  To: Kirill Smelkov
  Cc: Eric Dumazet, Al Viro, Alan Stern, Marco Elver, Eric Dumazet,
	syzbot, linux-fsdevel, Linux Kernel Mailing List, syzkaller-bugs,
	Andrea Parri, Paul E. McKenney,
	LKMM Maintainers -- Akira Yokosawa

On Tue, Nov 12, 2019 at 8:50 AM Kirill Smelkov <kirr@nexedi.com> wrote:
>
> The same logic applies if it is not 2 processes, but 2 threads:
> thread T2 adjusts file position racily to thread T1 while T1 is doing
> read syscall with the end result that T1 read could access file range
> that it should not be allowed to access.

Well, I think we actually always copy the file position before we pass
it down. So everybody always _uses_ their own private pointer, and the
race is only in the "read original value" vs "write new value back".

You had a patch that passed the address of file->f_pos down in your
original series iirc, but I NAK'ed that one. Exactly because it made
me nervous.

> By the way on "1" topic I suspect there is a race of how
> `N(file-users) > 1` check is done: file_count(file) is
> atomic_long_read(&file->f_count), but let's think on how that atomic
> read is positioned wrt another process creation: I did not studied in
> detail, so I might be wrong here, but offhand it looks like there is no
> synchronization.

Well, that's one reason to add the test for threads - it also gets rid
of that race. Because without threads, there's nothing else that could
access - or fork - a "N(file-users) == 1" file but us.

> So talking about the kernel I would also review the possibility of
> file_count wrt clone race once again.

See above. That goes away with the test for FDPUT_FPUT.

> About "2": I generally agree with the direction, but I think the kernel
> is not yet ready for this switch. Let me quote myself:

Hmm. I thought we already then applied all the patches that marked
things that didn't use f_pos as FMODE_STREAM. Including pipes and
sockets etc.

But if we didn't - and no, I didn't double-check now either - then
obviously that part of the patch can't be applied now.

             Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-12 17:23                       ` Linus Torvalds
@ 2019-11-12 17:36                         ` Linus Torvalds
  2019-11-17 18:56                           ` Kirill Smelkov
  0 siblings, 1 reply; 67+ messages in thread
From: Linus Torvalds @ 2019-11-12 17:36 UTC (permalink / raw)
  To: Kirill Smelkov
  Cc: Eric Dumazet, Al Viro, Alan Stern, Marco Elver, Eric Dumazet,
	syzbot, linux-fsdevel, Linux Kernel Mailing List, syzkaller-bugs,
	Andrea Parri, Paul E. McKenney,
	LKMM Maintainers -- Akira Yokosawa

On Tue, Nov 12, 2019 at 9:23 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Hmm. I thought we already then applied all the patches that marked
> things that didn't use f_pos as FMODE_STREAM. Including pipes and
> sockets etc.
>
> But if we didn't - and no, I didn't double-check now either - then
> obviously that part of the patch can't be applied now.

Ok, looking at it now.

Yeah, commit c5bf68fe0c86 ("*: convert stream-like files from
nonseekable_open -> stream_open") did the scripted thing, but it only
did it for nonseekable_open, not for the more complicated cases.

So yup, you're right - we'd need to at least do the pipe/socket case too.

What happens if the actual conversion part (nonseekable_open ->
stream_open) is removed from the cocci script, and it's used to only
find "read/write doesn't use f_pos" cases?

Or maybe trigger on '.llseek = no_llseek'?

                 Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-10 19:12   ` Linus Torvalds
  2019-11-10 19:20     ` Linus Torvalds
@ 2019-11-12 19:14     ` Alan Stern
  2019-11-12 19:47       ` Linus Torvalds
  1 sibling, 1 reply; 67+ messages in thread
From: Alan Stern @ 2019-11-12 19:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Marco Elver, Eric Dumazet, Eric Dumazet, syzbot, linux-fsdevel,
	Linux Kernel Mailing List, syzkaller-bugs, Al Viro, Andrea Parri,
	Paul E. McKenney, LKMM Maintainers -- Akira Yokosawa

On Sun, 10 Nov 2019, Linus Torvalds wrote:

> So I do think LKMM should say "writes of the same value must obviously
> result in the same value in memory afterwards", if it doesn't already.
> That's a somewhat trivial case, it's just a special case of the
> single-value atomicity issue. I thought the LKMM had that already: if
> you have writes of 'x' and 'y' to a variable from two CPU's, all CPU's
> are supposed to see _either_ 'x' or 'y', they can't ever see a mix of
> the two.

Not exactly -- the LKMM says if you have concurrent plain writes of 'x'
and 'y' to a variable from two CPUs then the writes race, and in the
presence of a data race the LKMM doesn't guarantee anything.  This may
not be the best approach, but it was all we could realistically come up
with.

Of course, if you have concurrent non-plain writes, including things
like WRITE_ONCE(), then indeed the LKMM does say that all CPUs using
non-plain reads will see either 'x' or 'y', nothing else.

> And yes, we've depended on that single-value atomicity historically.
> 
> The 'x' and 'y' have the same value is just a special case of that
> general issue - if two threads write the same value, no CPU can ever
> see anything but that value (or the original one). So in that sense,
> fundamentally the same value write cannot race with itself.

I'm starting to think that this issue is only one aspect of a whole 
larger discussion...

> But that LKMM rule is separate from a rule about a statistical tool like KCSAN.
> 
> Should KCSAN then ignore writes of the same value?
> 
> Maybe.
> 
> Because while that "variable++" data race with the same value is real,
> the likelihood of hitting it is small, so a statistical tool like
> KCSAN might as well ignore it - the tool would show the data race when
> the race _doesn't_ happen, which would be the normal case anyway, and
> would be the reason why the race hadn't been noticed by a normal human
> being.

That's not how KCSAN works, if I understand it correctly.  It never 
shows races that don't happen -- the only way it knows a race is 
present is by statistically waiting to see that one has occurred.

> So practically speaking, we might say "concurrent writes of the same
> value aren't data races" for KCSAN, even though they _could_ be data
> races.
> 
> And this is where WRITE_IDEMPOTENT would make a possible difference.
> In particular, if we make the optimization to do the "read and only
> write if changed", two CPU's doing this concurrently would do
> 
>    READ 0
>    WRITE 1
> 
> (for a "flag goes from 0->1" transition) and from a tool perspective,
> it would be very hard to know whether this is a race (two threads
> doing "variable++") or not (two threads setting a sticky flag).
> 
> So WRITE_IDEMPOTENT would then disambiguate that choice. See what I'm saying?

Let me broaden the discussion.  Concurrent writes of the same value are 
only one type of example; the kernel has plenty of other low-level 
races.

One could be the thing you brought up earlier: Suppose the compiler
decides to use the "write only if changed" transformation, so that the
code generated for the sticky write:

	x = 1;

ends up being what you would expect to see for:

	if (x != 1)
		x = 1;

Then two CPUs executing this code concurrently could still be flagged
as racing by KCSAN -- if not because of the writes then because the
read on one CPU would be perceived as racing with the write on the
other!

So changing the LKMM to say that concurrent writes of the same value 
don't race will only fix a part of the overall problem.

What we really need is a way to tell KCSAN that yes, we know certain 
accesses can race with each other at the hardware level, but at the 
source level we don't care (and we don't want KCSAN to complain about 
those accesses).  A good example is an approximate counter.  If we 
don't care whether the total sum is exactly correct then we don't mind 
if a few counts get lost because two CPUs executing

	x++;

happened to race with each other.

What we _do_ care about in these situations is:

	These accesses should be atomic (a 64-bit increment on a
	32-bit architecture really could run into trouble if there 
	was a race);

	The accesses should not write to memory outside the variable
	they affect (an architecture incapable of doing 16-bit writes
	should not do a 32-bit load/mask/store);

	The code should behave gracefully in the presence of hardware
	races (no C11 undefined behavior!);

	The object code generated by the compiler shouldn't stink.

And probably a few other things that escape me at the moment.  

READ_ONCE and WRITE_ONCE provide all of those guarantees except the
last one.  Normal reads and writes aren't suitable because of the
third requirement (and maybe the first).

But what about C11 relaxed atomic reads and writes?  They are what the
compiler writers _expect_ people to use in situations like this.  Do
you happen to know whether gcc is any good with them?  It might make
sense to define WRITE_RELAXED and READ_RELAXED analogously to
WRITE_ONCE and READ_ONCE but in terms of relaxed atomic accesses
instead of volatile accesses.

Alan Stern


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-12 19:14     ` Alan Stern
@ 2019-11-12 19:47       ` Linus Torvalds
  2019-11-12 20:29         ` Alan Stern
  0 siblings, 1 reply; 67+ messages in thread
From: Linus Torvalds @ 2019-11-12 19:47 UTC (permalink / raw)
  To: Alan Stern
  Cc: Marco Elver, Eric Dumazet, Eric Dumazet, syzbot, linux-fsdevel,
	Linux Kernel Mailing List, syzkaller-bugs, Al Viro, Andrea Parri,
	Paul E. McKenney, LKMM Maintainers -- Akira Yokosawa

On Tue, Nov 12, 2019 at 11:14 AM Alan Stern <stern@rowland.harvard.edu> wrote:
>
> One could be the thing you brought up earlier: Suppose the compiler
> decides to use the "write only if changed" transformation, so that the
> code generated for the sticky write:
>
>         x = 1;
>
> ends up being what you would expect to see for:
>
>         if (x != 1)
>                 x = 1;

That is exactly the kind of  crap that would make me go "use the flag
to disable that invalid optimization, or don't use the compiler".

We already do -param=allow-store-data-races=0

The C standards body sadly has a very bad track record on this kind of
thing, where they have allowed absolutely insane extensions of "that's
undefined" in the name of making C a much worse language (they say "to
compete with Fortran", but it's the same thing).

I have talked to some people who have tried to change that course, but
they are fed up with the standards body too, and it's fighting
windmills.

Which is why I don't even  bother. The C standard language-lawyering
is simply not interesting to me. Yes, there are too many people who do
it, and I don't care.

For the kernel, we basically do not accept "that's undefined behavior,
I might generate odd code".

If the compiler can statitcally give an error for it, then that's one
thing, and we'd be ok with that. But the kind of mindset where people
think it's ok to have the compiler read the standard cross-eyed and
change the obvious meaning of the code "because it's undefined
behavior" is to me a sign of a cn incompetent compiler writer, and I
am not at all interested in playing that game.

Seriously.

I wish somebody on the C standard had the back-bone to say "undefined
behavior is not acceptable", and just say that the proper
optimizations are ones where you transform the code the obvious
straightforward way, and then you only do optimizations that are based
on that code and you can prove do not change semantics.

You can't add reads that weren't there.

But you can look at code that did a read, and then wrote back what you
can prove is the same value, and say "that write is redundant, just
looking at the code".

See the difference?

One approach makes up shit. The other approach looks at the code AS
WRITTEN and can prove "that's stupid, I can do it better, and I can
show why it makes no difference".

So you can change "i++; i++;" to "i +=2", even if "i" is not a private
variable. Did that remove a write? Yes it did. But it really falls
under the "I just improved on the code".

But you can *not* do the insane things that type-based aliasing do
(lack the "prove it's the same" part).

Because when we notice that in the kernel, we turn it off. It's why we have

 -fno-strict-overflow
 -fno-merge-all-constants
 -fno-strict-aliasing
 -fno-delete-null-pointer-checks
 --param=allow-store-data-races=0

and probably others. Because the standard is simply wrong when you
care about reliability.

> But what about C11 relaxed atomic reads and writes?

Again, I'm not in the least interested in the C11 standard
language-lawyering, because it has shown itself to not be useful.

Stop bringing up the "what if" cases. They aren't interesting. If a
compiler turns a single write into some kind of conditional write, or
if the compiler creates dummy writes, the compiler is garbage. No
amount of "but but but C11" is at all relevant.

What a compiler can do is:

 - generate multiple (and speculative) reads

 - combine writes to the same location (non-speciulatively)

 - take advantage of actual reads in the source code to do
transformations that are obvious (ie "oh, you read value X, you tested
by Y was set, now you write it back again, but clearly the value
didn't change so I can avoid the write").

so yes, a compiler can remove a _redundant_ write, and if the SOURCE
CODE has the read in it and the compiler decides "Oh, I already know
it has that value" then that's one thing.

But no, the compiler can not add data races that weren't there in the
source code and say "but C11". We're not compiling to the standard.
We're compiling to the real world.

So if the compiler just adds its own reads, I don't want to play with
that compiler. It may be appropriate in situations where we don't have
threads, we don't have security issues, and we don't have various
system and kernel concerns, but it's not appropriate for a kernel.

It really is that simple.

This is in no way different from other language lawyering, ie the
whole "signed arithmetic overflows are undefined, so i can do
optimization X" or "I can silently remove the NULL pointer check
because you accessed it before and that invoced undefined behavior, so
now I can do anthing".

Those optimizations may be valid in other projects. They are not valid
for the kernel.

Stop bringing them up. They are irrelevant. We will keep adding the
options to tell the compiler "no, we're not your toy benchmark, we do
real work, and that optimization is dangerous".

              Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-12 19:47       ` Linus Torvalds
@ 2019-11-12 20:29         ` Alan Stern
  2019-11-12 20:58           ` Linus Torvalds
  0 siblings, 1 reply; 67+ messages in thread
From: Alan Stern @ 2019-11-12 20:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Marco Elver, Eric Dumazet, Eric Dumazet, syzbot, linux-fsdevel,
	Linux Kernel Mailing List, syzkaller-bugs, Al Viro, Andrea Parri,
	Paul E. McKenney, LKMM Maintainers -- Akira Yokosawa

On Tue, 12 Nov 2019, Linus Torvalds wrote:

> On Tue, Nov 12, 2019 at 11:14 AM Alan Stern <stern@rowland.harvard.edu> wrote:
> >
> > One could be the thing you brought up earlier: Suppose the compiler
> > decides to use the "write only if changed" transformation, so that the
> > code generated for the sticky write:
> >
> >         x = 1;
> >
> > ends up being what you would expect to see for:
> >
> >         if (x != 1)
> >                 x = 1;
> 
> That is exactly the kind of  crap that would make me go "use the flag
> to disable that invalid optimization, or don't use the compiler".
> 
> We already do -param=allow-store-data-races=0
> 
> The C standards body sadly has a very bad track record on this kind of
> thing, where they have allowed absolutely insane extensions of "that's
> undefined" in the name of making C a much worse language (they say "to
> compete with Fortran", but it's the same thing).
> 
> I have talked to some people who have tried to change that course, but
> they are fed up with the standards body too, and it's fighting
> windmills.
> 
> Which is why I don't even  bother. The C standard language-lawyering
> is simply not interesting to me. Yes, there are too many people who do
> it, and I don't care.
> 
> For the kernel, we basically do not accept "that's undefined behavior,
> I might generate odd code".
> 
> If the compiler can statitcally give an error for it, then that's one
> thing, and we'd be ok with that. But the kind of mindset where people
> think it's ok to have the compiler read the standard cross-eyed and
> change the obvious meaning of the code "because it's undefined
> behavior" is to me a sign of a cn incompetent compiler writer, and I
> am not at all interested in playing that game.
> 
> Seriously.
> 
> I wish somebody on the C standard had the back-bone to say "undefined
> behavior is not acceptable", and just say that the proper
> optimizations are ones where you transform the code the obvious
> straightforward way, and then you only do optimizations that are based
> on that code and you can prove do not change semantics.
> 
> You can't add reads that weren't there.
> 
> But you can look at code that did a read, and then wrote back what you
> can prove is the same value, and say "that write is redundant, just
> looking at the code".
> 
> See the difference?
> 
> One approach makes up shit. The other approach looks at the code AS
> WRITTEN and can prove "that's stupid, I can do it better, and I can
> show why it makes no difference".
> 
> So you can change "i++; i++;" to "i +=2", even if "i" is not a private
> variable. Did that remove a write? Yes it did. But it really falls
> under the "I just improved on the code".
> 
> But you can *not* do the insane things that type-based aliasing do
> (lack the "prove it's the same" part).
> 
> Because when we notice that in the kernel, we turn it off. It's why we have
> 
>  -fno-strict-overflow
>  -fno-merge-all-constants
>  -fno-strict-aliasing
>  -fno-delete-null-pointer-checks
>  --param=allow-store-data-races=0
> 
> and probably others. Because the standard is simply wrong when you
> care about reliability.
> 
> > But what about C11 relaxed atomic reads and writes?
> 
> Again, I'm not in the least interested in the C11 standard
> language-lawyering, because it has shown itself to not be useful.
> 
> Stop bringing up the "what if" cases. They aren't interesting. If a
> compiler turns a single write into some kind of conditional write, or
> if the compiler creates dummy writes, the compiler is garbage. No
> amount of "but but but C11" is at all relevant.
> 
> What a compiler can do is:
> 
>  - generate multiple (and speculative) reads
> 
>  - combine writes to the same location (non-speciulatively)
> 
>  - take advantage of actual reads in the source code to do
> transformations that are obvious (ie "oh, you read value X, you tested
> by Y was set, now you write it back again, but clearly the value
> didn't change so I can avoid the write").
> 
> so yes, a compiler can remove a _redundant_ write, and if the SOURCE
> CODE has the read in it and the compiler decides "Oh, I already know
> it has that value" then that's one thing.
> 
> But no, the compiler can not add data races that weren't there in the
> source code and say "but C11". We're not compiling to the standard.
> We're compiling to the real world.
> 
> So if the compiler just adds its own reads, I don't want to play with
> that compiler. It may be appropriate in situations where we don't have
> threads, we don't have security issues, and we don't have various
> system and kernel concerns, but it's not appropriate for a kernel.
> 
> It really is that simple.
> 
> This is in no way different from other language lawyering, ie the
> whole "signed arithmetic overflows are undefined, so i can do
> optimization X" or "I can silently remove the NULL pointer check
> because you accessed it before and that invoced undefined behavior, so
> now I can do anthing".
> 
> Those optimizations may be valid in other projects. They are not valid
> for the kernel.
> 
> Stop bringing them up. They are irrelevant. We will keep adding the
> options to tell the compiler "no, we're not your toy benchmark, we do
> real work, and that optimization is dangerous".

Linus, calm down and read what I actually wrote.  That optimization was 
a straw man.

I'm trying to solve a real problem: How to tell KCSAN and the compiler
that we don't care about certain access patterns which result in
hardware-level races, and how to guarantee that the object code will
still work correctly when those races occur.  Not telling the compiler 
anything is a head-in-the-sand approach that will be dangerous in the 
long run.

We could annotate all those accesses with READ_ONCE/WRITE_ONCE.  You 
don't like this approach, mainly because gcc produces lousy object code 
for volatile accesses.

My question was whether gcc does a better job with C11 relaxed atomic
accesses.  If it does we could define READ_RELAXED/WRITE_RELAXED
analogously to READ_ONCE/WRITE_ONCE, and do the annotations that way.  
The resulting object code certainly ought to be robust against races,
but I don't know what the quality would be like.

On the other hand, if the compiler generates lousy code even for C11 
relaxed atomic accesses, you've got a good case to go complain to the 
GCC maintainers about.  They can't say they don't want to support such 
things, because it's in the spec.

Alan Stern


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-12 20:29         ` Alan Stern
@ 2019-11-12 20:58           ` Linus Torvalds
  2019-11-12 21:13             ` Linus Torvalds
  2019-11-12 21:48             ` Alan Stern
  0 siblings, 2 replies; 67+ messages in thread
From: Linus Torvalds @ 2019-11-12 20:58 UTC (permalink / raw)
  To: Alan Stern
  Cc: Marco Elver, Eric Dumazet, Eric Dumazet, syzbot, linux-fsdevel,
	Linux Kernel Mailing List, syzkaller-bugs, Al Viro, Andrea Parri,
	Paul E. McKenney, LKMM Maintainers -- Akira Yokosawa

On Tue, Nov 12, 2019 at 12:29 PM Alan Stern <stern@rowland.harvard.edu> wrote:
>
> I'm trying to solve a real problem: How to tell KCSAN and the compiler
> that we don't care about certain access patterns which result in
> hardware-level races, and how to guarantee that the object code will
> still work correctly when those races occur.  Not telling the compiler
> anything is a head-in-the-sand approach that will be dangerous in the
> long run.

I don't actually know how KCSAN ends up reading the annotations, but
since it's apparently not using the 'volatile' as a marker.

[ Goes off and fetches the thing ]

Ugh, that's just nasty.

Honestly, my preferred model would have been to just add a comment,
and have the reporting tool know to then just ignore it. So something
like

+               // Benign data-race on min_flt
                tsk->min_flt++;
                perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);

for the case that Eric mentioned - the tool would trigger on
"data-race", and the rest of the comment could/should be for humans.
Without making the code uglier, but giving the potential for a nice
leghibl.e explanation instead of a completely illegible "let's
randomly use WRITE_ONCE() here" or something like that.

Could the KCSAN code be taught to do something like that by simply not
instrumenting it? Or, as mentioned, just have the reporting logic
maybe have a list of those comments (easily generated with some
variation of "git grep -in data-race" or something) and logic to just
ignore any report that comes from a line below that kind of comment?

Because I do not see a pretty way to annotate random things like this
that actually makes the code more legible. The READ_ONCE/WRITE_ONCE
annotations have not imho improved the code quality.

                 Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-12 20:58           ` Linus Torvalds
@ 2019-11-12 21:13             ` Linus Torvalds
  2019-11-12 22:05               ` Marco Elver
  2019-11-12 21:48             ` Alan Stern
  1 sibling, 1 reply; 67+ messages in thread
From: Linus Torvalds @ 2019-11-12 21:13 UTC (permalink / raw)
  To: Alan Stern
  Cc: Marco Elver, Eric Dumazet, Eric Dumazet, syzbot, linux-fsdevel,
	Linux Kernel Mailing List, syzkaller-bugs, Al Viro, Andrea Parri,
	Paul E. McKenney, LKMM Maintainers -- Akira Yokosawa

On Tue, Nov 12, 2019 at 12:58 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Honestly, my preferred model would have been to just add a comment,
> and have the reporting tool know to then just ignore it. So something
> like
>
> +               // Benign data-race on min_flt
>                 tsk->min_flt++;
>                 perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);
>
> for the case that Eric mentioned - the tool would trigger on
> "data-race", and the rest of the comment could/should be for humans.
> Without making the code uglier, but giving the potential for a nice
> leghibl.e explanation instead of a completely illegible "let's
> randomly use WRITE_ONCE() here" or something like that.

Hmm. Looking at the practicality of this, it actually doesn't look
*too* horrible.

I note that at least clang already has a "--blacklist" ability. I
didn't find a list of complete syntax for that, and it looks like it
might be just "whole functions" or "whole source files", but maybe the
clang people would be willing to add "file and line ranges" to the
blacklists?

Then you could generate the blacklist with that trivial grep before
you start the build, and -fsanitize=thread would automatically simply
not look at those lines.

For a simple first case, maybe the rule could be that the comment has
to be on the line. A bit less legible for humans, but it could be

-               tsk->min_flt++;
+               // Benign race min_flt - statistics only
+               tsk->min_flt++; // data-race

instead.

Wouldn't that be a much better annotation than having to add code?

               Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-12 20:58           ` Linus Torvalds
  2019-11-12 21:13             ` Linus Torvalds
@ 2019-11-12 21:48             ` Alan Stern
  2019-11-12 22:07               ` Eric Dumazet
  1 sibling, 1 reply; 67+ messages in thread
From: Alan Stern @ 2019-11-12 21:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Marco Elver, Eric Dumazet, Eric Dumazet, syzbot, linux-fsdevel,
	Linux Kernel Mailing List, syzkaller-bugs, Al Viro, Andrea Parri,
	Paul E. McKenney, LKMM Maintainers -- Akira Yokosawa

On Tue, 12 Nov 2019, Linus Torvalds wrote:

> Honestly, my preferred model would have been to just add a comment,
> and have the reporting tool know to then just ignore it. So something
> like
> 
> +               // Benign data-race on min_flt
>                 tsk->min_flt++;
>                 perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);
> 
> for the case that Eric mentioned - the tool would trigger on
> "data-race", and the rest of the comment could/should be for humans.
> Without making the code uglier, but giving the potential for a nice
> leghibl.e explanation instead of a completely illegible "let's
> randomly use WRITE_ONCE() here" or something like that.

Just to be perfectly clear, then:

Your feeling is that we don't need to tell the compiler anything at all 
about these races, because if a compiler generates code that is 
non-robust against such things then you don't want to use it for the 
kernel.

And as a corollary, the only changes you want to make to the source
code are things that tell KCSAN not to worry about these races when
they occur.

Right?

> +		// Benign data-race on min_flt
> 		tsk->min_flt++;
> 		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);

I suggest grouping the accesses into classes somehow, and telling KCSAN
that races between accesses in the same class are okay but racing
accesses in different classes should trigger a warning.  That would
give the tool a better chance of finding genuine races.

Alan Stern


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-12 21:13             ` Linus Torvalds
@ 2019-11-12 22:05               ` Marco Elver
  0 siblings, 0 replies; 67+ messages in thread
From: Marco Elver @ 2019-11-12 22:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Stern, Eric Dumazet, Eric Dumazet, syzbot, linux-fsdevel,
	Linux Kernel Mailing List, syzkaller-bugs, Al Viro, Andrea Parri,
	Paul E. McKenney, LKMM Maintainers -- Akira Yokosawa

On Tue, 12 Nov 2019 at 22:13, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Tue, Nov 12, 2019 at 12:58 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Honestly, my preferred model would have been to just add a comment,
> > and have the reporting tool know to then just ignore it. So something
> > like
> >
> > +               // Benign data-race on min_flt
> >                 tsk->min_flt++;
> >                 perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);
> >
> > for the case that Eric mentioned - the tool would trigger on
> > "data-race", and the rest of the comment could/should be for humans.
> > Without making the code uglier, but giving the potential for a nice
> > leghibl.e explanation instead of a completely illegible "let's
> > randomly use WRITE_ONCE() here" or something like that.
>
> Hmm. Looking at the practicality of this, it actually doesn't look
> *too* horrible.
>
> I note that at least clang already has a "--blacklist" ability. I
> didn't find a list of complete syntax for that, and it looks like it
> might be just "whole functions" or "whole source files", but maybe the
> clang people would be willing to add "file and line ranges" to the
> blacklists?
>
> Then you could generate the blacklist with that trivial grep before
> you start the build, and -fsanitize=thread would automatically simply
> not look at those lines.
>
> For a simple first case, maybe the rule could be that the comment has
> to be on the line. A bit less legible for humans, but it could be
>
> -               tsk->min_flt++;
> +               // Benign race min_flt - statistics only
> +               tsk->min_flt++; // data-race
>
> instead.
>
> Wouldn't that be a much better annotation than having to add code?

Thanks for the suggestion.

Right now I can't say what the most reliable way to do this for KCSAN
is. Doing this through the compiler doesn't seem possible today, but
is something to look into. An alternative is to preprocess the code
based on comments somehow.

How many variations of such comments could exist?

If it's only one or two, as a counter suggestion, would a macro not be
more reliable? A macro would provide a uniform way to document intent,
but could otherwise be a no-op. The tool would have no problems
understanding the macro. For example "APPROX(tsk->min_flt++)" or
something else that documents that the computation can be approximate
e.g. in the presence of races.

Thanks,
-- Marco

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-12 21:48             ` Alan Stern
@ 2019-11-12 22:07               ` Eric Dumazet
  2019-11-12 22:44                 ` Alexei Starovoitov
  0 siblings, 1 reply; 67+ messages in thread
From: Eric Dumazet @ 2019-11-12 22:07 UTC (permalink / raw)
  To: Alan Stern
  Cc: Linus Torvalds, Marco Elver, Eric Dumazet, syzbot, linux-fsdevel,
	Linux Kernel Mailing List, syzkaller-bugs, Al Viro, Andrea Parri,
	Paul E. McKenney, LKMM Maintainers -- Akira Yokosawa

On Tue, Nov 12, 2019 at 1:48 PM Alan Stern <stern@rowland.harvard.edu> wrote:
>
> On Tue, 12 Nov 2019, Linus Torvalds wrote:
>
> > Honestly, my preferred model would have been to just add a comment,
> > and have the reporting tool know to then just ignore it. So something
> > like
> >
> > +               // Benign data-race on min_flt
> >                 tsk->min_flt++;
> >                 perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);
> >
> > for the case that Eric mentioned - the tool would trigger on
> > "data-race", and the rest of the comment could/should be for humans.
> > Without making the code uglier, but giving the potential for a nice
> > leghibl.e explanation instead of a completely illegible "let's
> > randomly use WRITE_ONCE() here" or something like that.
>
> Just to be perfectly clear, then:
>
> Your feeling is that we don't need to tell the compiler anything at all
> about these races, because if a compiler generates code that is
> non-robust against such things then you don't want to use it for the
> kernel.
>

I would prefer some kind of explicit marking, instead of a comment.

Even if we prefer having a sane compiler, having these clearly
annotated can help
code readability quite a lot.

/*
 * To use when we are ok with minor races... bla bla bla
 */
static void inline add_relaxed(int *p, int x)
{
    x += __atomic_load_n(p, __ATOMIC_RELAXED);
    __atomic_store_n(p, x, __ATOMIC_RELAXED);
}

The actual implementation might depend on the compiler, and revert to something
without any constraint for old compilers  : *p += x;

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-12 22:07               ` Eric Dumazet
@ 2019-11-12 22:44                 ` Alexei Starovoitov
  2019-11-12 23:17                   ` Eric Dumazet
  0 siblings, 1 reply; 67+ messages in thread
From: Alexei Starovoitov @ 2019-11-12 22:44 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alan Stern, Linus Torvalds, Marco Elver, Eric Dumazet, syzbot,
	linux-fsdevel, Linux Kernel Mailing List, syzkaller-bugs,
	Al Viro, Andrea Parri, Paul E. McKenney,
	LKMM Maintainers -- Akira Yokosawa

On Tue, Nov 12, 2019 at 02:07:03PM -0800, Eric Dumazet wrote:
> 
> I would prefer some kind of explicit marking, instead of a comment.
> 
> Even if we prefer having a sane compiler, having these clearly
> annotated can help
> code readability quite a lot.

Annotating every line where tsk->min_flt is used with a comment
or explicit macro seems like a lot of churn.
How about adding an attribute to a field ?
Or an attribute to a type?

clang attributes can be easily exteneded. We add bpf specific attributes
that are known to clang only when 'clang -target bpf' is used.
There could be x86 or generic attributes.
Then one can do:
typedef unsigned long __attribute__((ignore_data_race)) racy_u64;
struct task_struct { 
   racy_u64 min_flt;
};

Hopefully less churn and clear signal to clang.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-12 22:44                 ` Alexei Starovoitov
@ 2019-11-12 23:17                   ` Eric Dumazet
  2019-11-12 23:40                     ` Linus Torvalds
  0 siblings, 1 reply; 67+ messages in thread
From: Eric Dumazet @ 2019-11-12 23:17 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alan Stern, Linus Torvalds, Marco Elver, Eric Dumazet, syzbot,
	linux-fsdevel, Linux Kernel Mailing List, syzkaller-bugs,
	Al Viro, Andrea Parri, Paul E. McKenney,
	LKMM Maintainers -- Akira Yokosawa

On Tue, Nov 12, 2019 at 2:44 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Nov 12, 2019 at 02:07:03PM -0800, Eric Dumazet wrote:
> >
> > I would prefer some kind of explicit marking, instead of a comment.
> >
> > Even if we prefer having a sane compiler, having these clearly
> > annotated can help
> > code readability quite a lot.
>
> Annotating every line where tsk->min_flt is used with a comment
> or explicit macro seems like a lot of churn.
> How about adding an attribute to a field ?
> Or an attribute to a type?
>
> clang attributes can be easily exteneded. We add bpf specific attributes
> that are known to clang only when 'clang -target bpf' is used.
> There could be x86 or generic attributes.
> Then one can do:
> typedef unsigned long __attribute__((ignore_data_race)) racy_u64;
> struct task_struct {
>    racy_u64 min_flt;
> };
>
> Hopefully less churn and clear signal to clang.

Hmm we have the ' volatile'  attribute on jiffies, and it causes
confusion already :p

arch/x86/kernel/apic/apic.c:904:        jif_start = READ_ONCE(jiffies);
arch/x86/kernel/apic/apic.c:927:
unsigned long jif_now = READ_ONCE(jiffies);
kernel/sched/wait_bit.c:218:    unsigned long now = READ_ONCE(jiffies);
kernel/sched/wait_bit.c:232:    unsigned long now = READ_ONCE(jiffies);
kernel/time/timer.c:891:        jnow = READ_ONCE(jiffies);
kernel/time/timer.c:1681:       unsigned long now = READ_ONCE(jiffies);
net/rxrpc/conn_client.c:1111:           now = READ_ONCE(jiffies);

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-12 23:17                   ` Eric Dumazet
@ 2019-11-12 23:40                     ` Linus Torvalds
  2019-11-13 15:00                       ` Marco Elver
  0 siblings, 1 reply; 67+ messages in thread
From: Linus Torvalds @ 2019-11-12 23:40 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexei Starovoitov, Alan Stern, Marco Elver, Eric Dumazet,
	syzbot, linux-fsdevel, Linux Kernel Mailing List, syzkaller-bugs,
	Al Viro, Andrea Parri, Paul E. McKenney,
	LKMM Maintainers -- Akira Yokosawa

On Tue, Nov 12, 2019 at 3:18 PM Eric Dumazet <edumazet@google.com> wrote:
>
> Hmm we have the ' volatile'  attribute on jiffies, and it causes
> confusion already :p

The jiffies case is partly historical, but partly also because it's
one of the very very few data structures where 99% of all uses are
unlocked and for convenience reasons we really don't want to force
those legacy cases to do anything special about it.

"jiffies" really is very special for those kinds of legacy reasons.
Look at the kinds of games we play with it on 32-bit architectures:
the "real" storage is "jiffies_64", but nobody actually wants to use
that, and because of load tearing issues it's actually hard to use
too. So what everybody _uses_ is just the low 32 bits, and 'jiffies'
isn't a real variable, it's done with linker tricks in our
vmlinux.lds.S files. So, for example, on sparc32, you find this:

    jiffies = jiffies_64 + 4;

in the vmlinux.lds.S file, because it's big-endian, and the lower 32
bits are at offset 4 from the real 64-bit variable.

Note that to actually read the "true" full 64-bit value, you have to
then call a function that does the proper sequence counter stuff etc.
But nobody really wants it, since what everybody actually _uses_ is
the "time_after(x,jiffies+10)" kind of thing, which is only done in
the wrapping "unsigned long" time base. So the odd "linker tricks with
the atomic low bits marked as a volatile data structure" is actually
exactly what we want, but people should realize that this is not
normal.

So 'jiffies' is really really special.

And absolutely nothing else should use 'volatile' on data structures
and be that special.  In the case of jiffies, the rule ends up being
that nobody should ever write to it (you write to the real
jiffies_64), and the real jiffies_64 thing gets the real locking, and
'jiffies' is a read-only volatile thing.

So "READ_ONCE()" is indeed unnecessary with jiffies, but it won't
hurt. It's not really "confusion" - there's nothing _wrong_ with using
READ_ONCE() on volatile data, but we just normally don't do volatile
data in the kernel (because our normal model is that data is never
really volatile in general - there may be locked and unlocked accesses
to it, so it's stable or volatile depending on context, and thus the
'volatile' goes on the _code_, not on the data structure)

But while jiffies READ_ONCE() accesses isn't _wrong_, it is also not
really paired with any WRITE_ONCE(), because the real update is to
technically not even to the same full data structure.

So don't look to jiffies for how to do things. It's an odd one-off.

That said, for "this might be used racily", if there are annotations
for clang to just make it shut up about one particular field in a
structure, than I think that would be ok. As long as it doesn't then
imply special code generation (outside of the checking, of course).

                 Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-11 15:10             ` Marco Elver
@ 2019-11-13  0:25               ` Paul E. McKenney
  0 siblings, 0 replies; 67+ messages in thread
From: Paul E. McKenney @ 2019-11-13  0:25 UTC (permalink / raw)
  To: Marco Elver
  Cc: Linus Torvalds, Alan Stern, Eric Dumazet, Eric Dumazet, syzbot,
	linux-fsdevel, Linux Kernel Mailing List, syzkaller-bugs,
	Al Viro, Andrea Parri, LKMM Maintainers -- Akira Yokosawa

On Mon, Nov 11, 2019 at 04:10:20PM +0100, Marco Elver wrote:
> On Mon, 11 Nov 2019 at 15:31, Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Mon, Nov 11, 2019 at 03:17:51PM +0100, Marco Elver wrote:
> > > On Sun, 10 Nov 2019 at 21:44, Paul E. McKenney <paulmck@kernel.org> wrote:
> > > >
> > > > On Sun, Nov 10, 2019 at 11:20:53AM -0800, Linus Torvalds wrote:
> > > > > On Sun, Nov 10, 2019 at 11:12 AM Linus Torvalds
> > > > > <torvalds@linux-foundation.org> wrote:
> > > > > >
> > > > > > And this is where WRITE_IDEMPOTENT would make a possible difference.
> > > > > > In particular, if we make the optimization to do the "read and only
> > > > > > write if changed"
> > > > >
> > > > > It might be useful for checking too. IOW, something like KCSAN could
> > > > > actually check that if a field has an idempotent write to it, all
> > > > > writes always have the same value.
> > > > >
> > > > > Again, there's the issue with lifetime.
> > > > >
> > > > > Part of that is "initialization is different". Those writes would not
> > > > > be marked idempotent, of course, and they'd write another value.
> > > > >
> > > > > There's also the issue of lifetime at the _end_ of the use, of course.
> > > > > There _are_ interesting data races at the end of the lifetime, both
> > > > > reads and writes.
> > > > >
> > > > > In particular, if it's a sticky flag, in order for there to not be any
> > > > > races, all the writes have to happen with a refcount held, and the
> > > > > final read has to happen after the final refcount is dropped (and the
> > > > > refcounts have to have atomicity and ordering, of course). I'm not
> > > > > sure how easy something like that is model in KSAN. Maybe it already
> > > > > does things like that for all the other refcount stuff we do.
> > > > >
> > > > > But the lifetime can be problematic for other reasons too - in this
> > > > > particular case we have a union for that sticky flag (which is used
> > > > > under the refcount), and then when the final refcount is released we
> > > > > read that value (thus no data race) but because of the union we will
> > > > > now start using that field with *different* data. It becomes that RCU
> > > > > list head instead.
> > > > >
> > > > > That kind of "it used to be a sticky flag, but now the lifetime of the
> > > > > flag is over, and it's something entirely different" might be a
> > > > > nightmare for something like KCSAN. It sounds complicated to check
> > > > > for, but I have no idea what KCSAN really considers complicated or
> > > > > not.
> > > >
> > > > But will "one size fits all" be practical and useful?
> > > >
> > > > For my code, I would be happy to accept a significant "false positive"
> > > > rate to get even a probabilistic warning of other-task accesses to some
> > > > of RCU's fields.  Even if the accesses were perfect from a functional
> > > > viewpoint, they could be problematic from a performance and scalability
> > > > viewpoint.  And for something like RCU, real bugs, even those that are
> > > > very improbable, need to be fixed.
> > > >
> > > > But other code (and thus other developers and maintainers) are going to
> > > > have different needs.  For all I know, some might have good reasons to
> > > > exclude their code from KCSAN analysis entirely.
> > > >
> > > > Would it make sense for KCSAN to have per-file/subsystem/whatever flags
> > > > specifying the depth of the analysis?
> > >
> > > Just to answer this: we already have this, and disable certain files
> > > already. So it's an option if required. Just need maintainers to add
> > > KCSAN_SANITIZE := n, or KCSAN_SANITIZE_file.o := n to Makefiles, and
> > > KCSAN will simply ignore those.
> > >
> > > FWIW we now also have a config option to "ignore repeated writes with
> > > the same value". It may be a little overaggressive/imprecise in
> > > filtering data races, but anything else like the super precise
> > > analysis involving tracking lifetimes and values (and whatever else
> > > the rules would require) is simply too complex. So, the current
> > > solution will avoid reporting cases like the original report here
> > > (__alloc_file), but at the cost of maybe being a little imprecise.
> > > It's probably a reasonable trade-off, given that we have too many data
> > > races to deal with on syzbot anyway.
> >
> > Nice!
> >
> > Is this added repeated-writes analysis something that can be disabled?
> > I would prefer that the analysis of RCU complain in this case as a
> > probabilistic cache-locality warning.  If it can be disabled, please
> > let me know if there is anything that I need to do to make this happen.
> 
> It's hidden behind a Kconfig config option, and actually disabled by
> default. We can't enable/disable this on a per-file basis.
> 
> Right now, we'll just enable it on the public syzbot instance, which
> will use the most conservative config.  Of course you can still run
> your own fuzzer/stress test of choice with KCSAN and the option
> disabled. Is that enough?
> 
> Otherwise I could also just say if the symbolized top stack frame
> contains "rcu_", don't ignore -- which would be a little hacky and
> imprecise though. What do you prefer?

Tough question.  ;-)

We could try the "rcu_" trick, and if it doesn't cause problems for
others, why not?

						Thanx, Paul

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-12 23:40                     ` Linus Torvalds
@ 2019-11-13 15:00                       ` Marco Elver
  2019-11-13 16:57                         ` Linus Torvalds
  0 siblings, 1 reply; 67+ messages in thread
From: Marco Elver @ 2019-11-13 15:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, Alexei Starovoitov, Alan Stern, Eric Dumazet,
	syzbot, linux-fsdevel, Linux Kernel Mailing List, syzkaller-bugs,
	Al Viro, Andrea Parri, Paul E. McKenney,
	LKMM Maintainers -- Akira Yokosawa

On Wed, 13 Nov 2019 at 00:41, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Tue, Nov 12, 2019 at 3:18 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > Hmm we have the ' volatile'  attribute on jiffies, and it causes
> > confusion already :p
>
> The jiffies case is partly historical, but partly also because it's
> one of the very very few data structures where 99% of all uses are
> unlocked and for convenience reasons we really don't want to force
> those legacy cases to do anything special about it.
>
> "jiffies" really is very special for those kinds of legacy reasons.
> Look at the kinds of games we play with it on 32-bit architectures:
> the "real" storage is "jiffies_64", but nobody actually wants to use
> that, and because of load tearing issues it's actually hard to use
> too. So what everybody _uses_ is just the low 32 bits, and 'jiffies'
> isn't a real variable, it's done with linker tricks in our
> vmlinux.lds.S files. So, for example, on sparc32, you find this:
>
>     jiffies = jiffies_64 + 4;
>
> in the vmlinux.lds.S file, because it's big-endian, and the lower 32
> bits are at offset 4 from the real 64-bit variable.
>
> Note that to actually read the "true" full 64-bit value, you have to
> then call a function that does the proper sequence counter stuff etc.
> But nobody really wants it, since what everybody actually _uses_ is
> the "time_after(x,jiffies+10)" kind of thing, which is only done in
> the wrapping "unsigned long" time base. So the odd "linker tricks with
> the atomic low bits marked as a volatile data structure" is actually
> exactly what we want, but people should realize that this is not
> normal.
>
> So 'jiffies' is really really special.
>
> And absolutely nothing else should use 'volatile' on data structures
> and be that special.  In the case of jiffies, the rule ends up being
> that nobody should ever write to it (you write to the real
> jiffies_64), and the real jiffies_64 thing gets the real locking, and
> 'jiffies' is a read-only volatile thing.
>
> So "READ_ONCE()" is indeed unnecessary with jiffies, but it won't
> hurt. It's not really "confusion" - there's nothing _wrong_ with using
> READ_ONCE() on volatile data, but we just normally don't do volatile
> data in the kernel (because our normal model is that data is never
> really volatile in general - there may be locked and unlocked accesses
> to it, so it's stable or volatile depending on context, and thus the
> 'volatile' goes on the _code_, not on the data structure)
>
> But while jiffies READ_ONCE() accesses isn't _wrong_, it is also not
> really paired with any WRITE_ONCE(), because the real update is to
> technically not even to the same full data structure.
>
> So don't look to jiffies for how to do things. It's an odd one-off.
>
> That said, for "this might be used racily", if there are annotations
> for clang to just make it shut up about one particular field in a
> structure, than I think that would be ok. As long as it doesn't then
> imply special code generation (outside of the checking, of course).

There are annotations that work for globals only. This limitation is
because once you take a pointer of a variable, and just pass around
the address, the attribute will be lost. So sometimes you might do the
right thing, but other times you might not.

Just to summarize the options we've had so far:
1. Add a comment, and let the tool parse it somehow.
2. Add attribute to variables.
3. Add some new macro to use with expressions, which doesn't do
anything if the tool is disabled. E.g. "racy__(counter++)",
"lossy__(counter++);" or any suitable name.

1 and 3 are almost equivalent, except that 3 introduces "structured"
comments that are much easier to understand for tooling. Both 1 and 2
need compiler support.

Right now both gcc and clang support KCSAN, and doing 1 or 2 would
probably imply dropping one of them. Both 1 and 2 are also brittle: 1
because comments are not forced to be structured, and 2 because
attributes on variables do not propagate through pointers.

From our perspective, macro-based annotations would be most reliable.

Thanks,
-- Marco

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-13 15:00                       ` Marco Elver
@ 2019-11-13 16:57                         ` Linus Torvalds
  2019-11-13 21:33                           ` Marco Elver
  0 siblings, 1 reply; 67+ messages in thread
From: Linus Torvalds @ 2019-11-13 16:57 UTC (permalink / raw)
  To: Marco Elver
  Cc: Eric Dumazet, Alexei Starovoitov, Alan Stern, Eric Dumazet,
	syzbot, linux-fsdevel, Linux Kernel Mailing List, syzkaller-bugs,
	Al Viro, Andrea Parri, Paul E. McKenney,
	LKMM Maintainers -- Akira Yokosawa

On Wed, Nov 13, 2019 at 7:00 AM Marco Elver <elver@google.com> wrote:
>
> Just to summarize the options we've had so far:
> 1. Add a comment, and let the tool parse it somehow.
> 2. Add attribute to variables.
> 3. Add some new macro to use with expressions, which doesn't do
> anything if the tool is disabled. E.g. "racy__(counter++)",
> "lossy__(counter++);" or any suitable name.

I guess I could live with "data_race(x)" or something simple like
that, assuming we really can just surround a whole expression with it,
and we don't have to make a hundred different versions for the
different cases ("racy plain assignment" vs "racy statistics update"
vs "racy u64 addition" etc etc).

I just want the source code to be very legible, which is one of the
problems with the ugly READ_ONCE() conversions.

Part of that "legible source code" implies no crazy double
underscores. But a plain "data_race(x)" might not look too bad, and
would be easy to grep for, and doesn't seem to exist in the current
kernel as anything else.

One question is if it would be a statement expression or an actual
expression. I think the expression would read much better, IOW you
could do

    val = data_race(p->field);

instead of having to write it as

    data_race(val = p->field);

to really point out the race. But at the same time, maybe you need to
surround several statements, ie

    // intentionally racy xchg because we don't care and it generates
better code
    data_race(a = p->field; p->field = b);

which all would work fine with a non-instrumented macro something like this:

    #define data_race(x) ({ x; })

which would hopefully give the proper syntax rules.

But that might be very very inconvenient for KCSAN, depending on how
you annotate the thing.

So I _suspect_ that what you actually want is to do it as a statement,
not as an expression. What's the actual underlying syntax for "ignore
this code for thread safety checking"?

                    Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-13 16:57                         ` Linus Torvalds
@ 2019-11-13 21:33                           ` Marco Elver
  2019-11-13 21:50                             ` Alan Stern
  0 siblings, 1 reply; 67+ messages in thread
From: Marco Elver @ 2019-11-13 21:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, Alexei Starovoitov, Alan Stern, Eric Dumazet,
	syzbot, linux-fsdevel, Linux Kernel Mailing List, syzkaller-bugs,
	Al Viro, Andrea Parri, Paul E. McKenney,
	LKMM Maintainers -- Akira Yokosawa

On Wed, 13 Nov 2019, Linus Torvalds wrote:

> On Wed, Nov 13, 2019 at 7:00 AM Marco Elver <elver@google.com> wrote:
> >
> > Just to summarize the options we've had so far:
> > 1. Add a comment, and let the tool parse it somehow.
> > 2. Add attribute to variables.
> > 3. Add some new macro to use with expressions, which doesn't do
> > anything if the tool is disabled. E.g. "racy__(counter++)",
> > "lossy__(counter++);" or any suitable name.
> 
> I guess I could live with "data_race(x)" or something simple like
> that, assuming we really can just surround a whole expression with it,
> and we don't have to make a hundred different versions for the
> different cases ("racy plain assignment" vs "racy statistics update"
> vs "racy u64 addition" etc etc).
> 
> I just want the source code to be very legible, which is one of the
> problems with the ugly READ_ONCE() conversions.
> 
> Part of that "legible source code" implies no crazy double
> underscores. But a plain "data_race(x)" might not look too bad, and
> would be easy to grep for, and doesn't seem to exist in the current
> kernel as anything else.
> 
> One question is if it would be a statement expression or an actual
> expression. I think the expression would read much better, IOW you
> could do
> 
>     val = data_race(p->field);
> 
> instead of having to write it as
> 
>     data_race(val = p->field);
> 
> to really point out the race. But at the same time, maybe you need to
> surround several statements, ie
> 
>     // intentionally racy xchg because we don't care and it generates
> better code
>     data_race(a = p->field; p->field = b);
> 
> which all would work fine with a non-instrumented macro something like this:
> 
>     #define data_race(x) ({ x; })
> 
> which would hopefully give the proper syntax rules.
> 
> But that might be very very inconvenient for KCSAN, depending on how
> you annotate the thing.
> 
> So I _suspect_ that what you actually want is to do it as a statement,
> not as an expression. What's the actual underlying syntax for "ignore
> this code for thread safety checking"?

An expression works fine. The below patch would work with KCSAN, and all
your above examples work.

Re name: would it make sense to more directly convey the intent?  I.e.
"this expression can race, and it's fine that the result is approximate
if it does"?

My vote would go to something like 'smp_lossy' or 'lossy_race' -- but
don't have a strong preference, and would also be fine with 'data_race'.
Whatever is most legible.  Comments?

Thanks,
-- Marco


diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index 0b6506b9dd11..4d0597f89168 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -308,6 +308,29 @@ unsigned long read_word_at_a_time(const void *addr)
 	__u.__val;					\
 })
 
+#include <linux/kcsan.h>
+
+/*
+ * smp_lossy: macro that provides a way to document the intent that an
+ * expression may be lossy or approximate on an SMP system due to data races.
+ *
+ * This macro *does not* affect normal code generation, but is a hint to tooling
+ * that it is intentional that accesses in the expression may conflict with
+ * concurrent accesses.
+ */
+#ifdef __SANITIZE_THREAD__
+#define smp_lossy(expr)                                                        \
+	({                                                                     \
+		typeof(({ expr; })) __val;                                     \
+		kcsan_nestable_atomic_begin();                                 \
+		__val = ({ expr; });                                           \
+		kcsan_nestable_atomic_end();                                   \
+		__val;                                                         \
+	})
+#else
+#define smp_lossy(expr) ({ expr; })
+#endif
+
 #endif /* __KERNEL__ */
 
 /*

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-13 21:33                           ` Marco Elver
@ 2019-11-13 21:50                             ` Alan Stern
  2019-11-13 22:48                               ` Marco Elver
  0 siblings, 1 reply; 67+ messages in thread
From: Alan Stern @ 2019-11-13 21:50 UTC (permalink / raw)
  To: Marco Elver
  Cc: Linus Torvalds, Eric Dumazet, Alexei Starovoitov, Eric Dumazet,
	syzbot, linux-fsdevel, Linux Kernel Mailing List, syzkaller-bugs,
	Al Viro, Andrea Parri, Paul E. McKenney,
	LKMM Maintainers -- Akira Yokosawa

On Wed, 13 Nov 2019, Marco Elver wrote:

> An expression works fine. The below patch would work with KCSAN, and all
> your above examples work.
> 
> Re name: would it make sense to more directly convey the intent?  I.e.
> "this expression can race, and it's fine that the result is approximate
> if it does"?
> 
> My vote would go to something like 'smp_lossy' or 'lossy_race' -- but
> don't have a strong preference, and would also be fine with 'data_race'.
> Whatever is most legible.  Comments?

Lossiness isn't really relevant.  Things like sticky writes work 
perfectly well with data races; they don't lose anything.

My preference would be for "data_race" or something very similar
("racy"? "race_ok"?).  That's the whole point -- we know the
operation can be part of a data race and we don't care.

Alan Stern


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-13 21:50                             ` Alan Stern
@ 2019-11-13 22:48                               ` Marco Elver
  0 siblings, 0 replies; 67+ messages in thread
From: Marco Elver @ 2019-11-13 22:48 UTC (permalink / raw)
  To: Alan Stern
  Cc: Linus Torvalds, Eric Dumazet, Alexei Starovoitov, Eric Dumazet,
	syzbot, linux-fsdevel, Linux Kernel Mailing List, syzkaller-bugs,
	Al Viro, Andrea Parri, Paul E. McKenney,
	LKMM Maintainers -- Akira Yokosawa

On Wed, 13 Nov 2019, Alan Stern wrote:

> On Wed, 13 Nov 2019, Marco Elver wrote:
> 
> > An expression works fine. The below patch would work with KCSAN, and all
> > your above examples work.
> > 
> > Re name: would it make sense to more directly convey the intent?  I.e.
> > "this expression can race, and it's fine that the result is approximate
> > if it does"?
> > 
> > My vote would go to something like 'smp_lossy' or 'lossy_race' -- but
> > don't have a strong preference, and would also be fine with 'data_race'.
> > Whatever is most legible.  Comments?
> 
> Lossiness isn't really relevant.  Things like sticky writes work 
> perfectly well with data races; they don't lose anything.
> 
> My preference would be for "data_race" or something very similar
> ("racy"? "race_ok"?).  That's the whole point -- we know the
> operation can be part of a data race and we don't care.

Makes sense. Let's stick with 'data_race' then.

As Linus pointed out this won't yet work for void types and
if-statements. How frequent would that use be? Should it even be
encouraged?

I'll add this as a patch for the next version of the KCSAN patch series.

Thanks,
-- Marco

diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index 0b6506b9dd11..a97f323b61e3 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -308,6 +308,26 @@ unsigned long read_word_at_a_time(const void *addr)
 	__u.__val;					\
 })
 
+#include <linux/kcsan.h>
+
+/*
+ * data_race: macro to document that accesses in an expression may conflict with
+ * other concurrent accesses resulting in data races, but the resulting
+ * behaviour is deemed safe regardless.
+ *
+ * This macro *does not* affect normal code generation, but is a hint to tooling
+ * that data races here are intentional.
+ */
+#define data_race(expr)                                                      \
+	({                                                                     \
+		typeof(({ expr; })) __val;                                     \
+		kcsan_nestable_atomic_begin();                                 \
+		__val = ({ expr; });                                           \
+		kcsan_nestable_atomic_end();                                   \
+		__val;                                                         \
+	})
+#else
+
 #endif /* __KERNEL__ */
 
 /*

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-12 17:36                         ` Linus Torvalds
@ 2019-11-17 18:56                           ` Kirill Smelkov
  2019-11-17 19:20                             ` Linus Torvalds
  0 siblings, 1 reply; 67+ messages in thread
From: Kirill Smelkov @ 2019-11-17 18:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, Al Viro, Alan Stern, Marco Elver, Eric Dumazet,
	syzbot, linux-fsdevel, Linux Kernel Mailing List, syzkaller-bugs,
	Andrea Parri, Paul E. McKenney,
	LKMM Maintainers -- Akira Yokosawa

On Tue, Nov 12, 2019 at 09:36:41AM -0800, Linus Torvalds wrote:
> On Tue, Nov 12, 2019 at 9:23 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Hmm. I thought we already then applied all the patches that marked
> > things that didn't use f_pos as FMODE_STREAM. Including pipes and
> > sockets etc.
> >
> > But if we didn't - and no, I didn't double-check now either - then
> > obviously that part of the patch can't be applied now.
> 
> Ok, looking at it now.
> 
> Yeah, commit c5bf68fe0c86 ("*: convert stream-like files from
> nonseekable_open -> stream_open") did the scripted thing, but it only
> did it for nonseekable_open, not for the more complicated cases.
> 
> So yup, you're right - we'd need to at least do the pipe/socket case too.
> 
> What happens if the actual conversion part (nonseekable_open ->
> stream_open) is removed from the cocci script, and it's used to only
> find "read/write doesn't use f_pos" cases?
> 
> Or maybe trigger on '.llseek = no_llseek'?

( just a quick update that I'm still pending on this. I've tried to
  quickly check the above this evening but offhand it does not give good
  results until stream_open.cocci is extended to understand
  read_iter/writer_iter and properly worked some more on it.
  Or maybe I'm just too sleepy...

  I'd like to take a time break for now.
  I will try to return to this topic after finishing my main work first.
  I apologize for the inconvenience. )

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-17 18:56                           ` Kirill Smelkov
@ 2019-11-17 19:20                             ` Linus Torvalds
  0 siblings, 0 replies; 67+ messages in thread
From: Linus Torvalds @ 2019-11-17 19:20 UTC (permalink / raw)
  To: Kirill Smelkov
  Cc: Eric Dumazet, Al Viro, Alan Stern, Marco Elver, Eric Dumazet,
	syzbot, linux-fsdevel, Linux Kernel Mailing List, syzkaller-bugs,
	Andrea Parri, Paul E. McKenney,
	LKMM Maintainers -- Akira Yokosawa

[-- Attachment #1: Type: text/plain, Size: 429 bytes --]

On Sun, Nov 17, 2019 at 10:56 AM Kirill Smelkov <kirr@nexedi.com> wrote:
>
>   I'd like to take a time break for now.
>   I will try to return to this topic after finishing my main work first.
>   I apologize for the inconvenience. )

Sure, no problem, appreciate that you're looking at it.

I *think* the pipe and socket case should be fixed by something like
this, but it is entirely and utterly untested.

              Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 1349 bytes --]

 fs/pipe.c    | 6 ++++--
 net/socket.c | 1 +
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index 8a2ab2f974bd..de6dee559d41 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -783,6 +783,7 @@ int create_pipe_files(struct file **res, int flags)
 	}
 
 	f->private_data = inode->i_pipe;
+	stream_open(inode, f);
 
 	res[0] = alloc_file_clone(f, O_RDONLY | (flags & O_NONBLOCK),
 				  &pipefifo_fops);
@@ -791,6 +792,7 @@ int create_pipe_files(struct file **res, int flags)
 		fput(f);
 		return PTR_ERR(res[0]);
 	}
+	stream_open(inode, f);
 	res[0]->private_data = inode->i_pipe;
 	res[1] = f;
 	return 0;
@@ -931,9 +933,9 @@ static int fifo_open(struct inode *inode, struct file *filp)
 	__pipe_lock(pipe);
 
 	/* We can only do regular read/write on fifos */
-	filp->f_mode &= (FMODE_READ | FMODE_WRITE);
+	stream_open(inode, filp);
 
-	switch (filp->f_mode) {
+	switch (filp->f_mode & (FMODE_READ | FMODE_WRITE)) {
 	case FMODE_READ:
 	/*
 	 *  O_RDONLY
diff --git a/net/socket.c b/net/socket.c
index 6a9ab7a8b1d2..3c6d60eadf7a 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -404,6 +404,7 @@ struct file *sock_alloc_file(struct socket *sock, int flags, const char *dname)
 
 	sock->file = file;
 	file->private_data = sock;
+	stream_open(SOCK_INODE(sock), file);
 	return file;
 }
 EXPORT_SYMBOL(sock_alloc_file);

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-08 22:06                       ` Linus Torvalds
@ 2019-11-09 23:08                         ` Alan Stern
  0 siblings, 0 replies; 67+ messages in thread
From: Alan Stern @ 2019-11-09 23:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Marco Elver, Eric Dumazet, Eric Dumazet, syzbot, linux-fsdevel,
	Linux Kernel Mailing List, syzkaller-bugs, Al Viro, Andrea Parri,
	Paul E. McKenney, LKMM Maintainers -- Akira Yokosawa

On Fri, 8 Nov 2019, Linus Torvalds wrote:

> On Fri, Nov 8, 2019 at 1:57 PM Alan Stern <stern@rowland.harvard.edu> wrote:
> >
> > Can we please agree to call these writes something other than
> > "idempotent"?  After all, any write to normal memory is idempotent in
> > the sense that doing it twice has the same effect as doing it once
> > (ignoring possible races, of course).
> 
> No!
> 
> You're completely missing the point.
> 
> Two writes to normal memory are *not* idempotent if they write
> different values. The ordering very much matters, and it's racy and a
> tool should complain.

What you have written strongly indicates that either you think the word
"idempotent" means something different from its actual meaning or else
you are misusing the word in a very confusing way.

Your text seems to say that two operations are idempotent if they have 
the same effect.  Compare that to the definition from Wikipedia (not 
the best in the world, perhaps, but adequate for this discussion):

	Idempotence is the property of certain operations in
	mathematics and computer science whereby they can be applied
	multiple times without changing the result beyond the initial
	application.

For example, writing 5 to x is idempotent because performing that write
multiple times will have the same result as performing it only once.
Likewise for any write to normal memory.

Therefore talking about idempotent writes as somehow being distinct 
from other writes does not make sense.  Nor does it make sense to say 
"Two writes to normal memory are *not* idempotent if they write
different values", because idempotence is a property of individual 
operations, not of pairs of operations.

You should use a different word, because what you mean is different 
from what "idempotent" actually means.

> But the point of WRITE_IDEMPOTENT() is that it *always* writes the
> same value, so it doesn't matter if two different writers race on it.
> 
> So it really is about being idempotent.

Okay, I agree that I did completely miss the point of what you
originally meant.  But what you meant wasn't "being idempotent".

> > A better name would be "write-if-different" or "write-if-changed"
> 
> No.
> 
> Again,  you're totally missing the point.
> 
> It's not about "write-if-different".
> 
> It's about idempotent writes.

Assuming you really are talking about writes (presumably in different
threads) that store the same value to the same location: Yes, two such
writes do not in practice race with each other -- even though they may
satisfy the formal definition of a data race.

On the other hand, they may each race with other accesses.

> But if you do know that all the possible racing writes are idempotent,
> then AS A POSSIBLE CACHE OPTIMIZATION, you can then say "only do this
> write if somebody else didn't already do it".

In fact, you can _always_ perform that possible optimization.  Whether
it will be worthwhile is a separate matter...

> But that's a side effect of being idempotent, not the basic rule! And
> it's not necessarily obviously an optimization at all, because the
> cacheline may already be dirty, and checking the old value and having
> a conditional on it may be much more expensive than just writing the
> new value./

Agreed.

> The point is that certain writes DO NOT CARE ABOUT ORDERING, because
> they may be setting a sticky flag (or stickily clearing a flag, as in
> this case).  The ordering doesn't matter, because the operation is
> idempotent.

Ah, here you use the word correctly.

> That's what "idempotent" means. You can do it once, or a hundred
> times, and the end result is the same (but is different from not doing
> it at all).

Precisely the point I make above.

> And no, not all writes are idempotent. That's just crazy talk. Writes
> have values.

By "writes have values", do you mean that writes store values?  Of
course they do.  But it is clear from what you wrote just above that
all writes _are_ idempotent, because doing a write once or a hundred
times will yield the same end result.

On a more productive note, do you think we should change the LKMM so 
that it won't consider two writes to race with each other if they store 
the same value?

Alan Stern


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-08 21:57                     ` Alan Stern
@ 2019-11-08 22:06                       ` Linus Torvalds
  2019-11-09 23:08                         ` Alan Stern
  0 siblings, 1 reply; 67+ messages in thread
From: Linus Torvalds @ 2019-11-08 22:06 UTC (permalink / raw)
  To: Alan Stern
  Cc: Marco Elver, Eric Dumazet, Eric Dumazet, syzbot, linux-fsdevel,
	Linux Kernel Mailing List, syzkaller-bugs, Al Viro, Andrea Parri,
	Paul E. McKenney, LKMM Maintainers -- Akira Yokosawa

On Fri, Nov 8, 2019 at 1:57 PM Alan Stern <stern@rowland.harvard.edu> wrote:
>
> Can we please agree to call these writes something other than
> "idempotent"?  After all, any write to normal memory is idempotent in
> the sense that doing it twice has the same effect as doing it once
> (ignoring possible races, of course).

No!

You're completely missing the point.

Two writes to normal memory are *not* idempotent if they write
different values. The ordering very much matters, and it's racy and a
tool should complain.

But the point of WRITE_IDEMPOTENT() is that it *always* writes the
same value, so it doesn't matter if two different writers race on it.

So it really is about being idempotent.

> A better name would be "write-if-different" or "write-if-changed"

No.

Again,  you're totally missing the point.

It's not about "write-if-different".

It's about idempotent writes.

But if you do know that all the possible racing writes are idempotent,
then AS A POSSIBLE CACHE OPTIMIZATION, you can then say "only do this
write if somebody else didn't already do it".

But that's a side effect of being idempotent, not the basic rule! And
it's not necessarily obviously an optimization at all, because the
cacheline may already be dirty, and checking the old value and having
a conditional on it may be much more expensive than just writing the
new value./

The point is that certain writes DO NOT CARE ABOUT ORDERING, because
they may be setting a sticky flag (or stickily clearing a flag, as in
this case).  The ordering doesn't matter, because the operation is
idempotent.

That's what "idempotent" means. You can do it once, or a hundred
times, and the end result is the same (but is different from not doing
it at all).

And no, not all writes are idempotent. That's just crazy talk. Writes
have values.

               Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-08 20:26                   ` Linus Torvalds
@ 2019-11-08 21:57                     ` Alan Stern
  2019-11-08 22:06                       ` Linus Torvalds
  0 siblings, 1 reply; 67+ messages in thread
From: Alan Stern @ 2019-11-08 21:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Marco Elver, Eric Dumazet, Eric Dumazet, syzbot, linux-fsdevel,
	Linux Kernel Mailing List, syzkaller-bugs, Al Viro, Andrea Parri,
	Paul E. McKenney, LKMM Maintainers -- Akira Yokosawa

On Fri, 8 Nov 2019, Linus Torvalds wrote:

> On Fri, Nov 8, 2019 at 11:48 AM Marco Elver <elver@google.com> wrote:
> >
> > It's not explicitly aware of initialization or release. We rely on
> > compiler instrumentation for all memory accesses; KCSAN then sets up
> > "watchpoints" for sampled memory accesses, delaying execution, and
> > checking if a concurrent access is observed.
> 
> Ok.
> 
> > This same approach could be used to ignore "idempotent writes" where
> > we would otherwise report a data race; i.e. if there was a concurrent
> > write, but the data value did not change, do not report the race. I'm
> > happy to add this feature if this should always be ignored.
> 
> Hmm. I don't think it's valid in general, but it might be useful
> enough in practice, at least as an option to lower the false
> positives.

Minor point...

Can we please agree to call these writes something other than 
"idempotent"?  After all, any write to normal memory is idempotent in 
the sense that doing it twice has the same effect as doing it once 
(ignoring possible races, of course).

A better name would be "write-if-different" or "write-if-changed" (and
I bet people can come up with something even better if they try).  
This at least gets across the main idea, and using

	WRITE_IF_CHANGED(x, y);

to mean

	if (READ_ONCE(x) != y) WRITE_ONCE(x, y);

is a lot clearer than using WRITE_IDEMPOTENT(x, y).

Alan Stern


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-08 20:53               ` Eric Dumazet
@ 2019-11-08 21:36                 ` Linus Torvalds
  0 siblings, 0 replies; 67+ messages in thread
From: Linus Torvalds @ 2019-11-08 21:36 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, syzbot, Marco Elver, linux-fsdevel,
	Linux Kernel Mailing List, syzkaller-bugs, Al Viro

On Fri, Nov 8, 2019 at 12:53 PM Eric Dumazet <edumazet@google.com> wrote:
>
> per cpu SNMP counters mostly, with no IRQ safety requirements.
>
> Note that this could be implemented using local{64}_add() on arches like x86_64,
> while others might have to fallback to WRITE_ONCE(variable, variable + add)

raw_cpu_add()?

We already use those for vm_counters where we intentionally accept races.

              Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-08 20:30             ` Linus Torvalds
@ 2019-11-08 20:53               ` Eric Dumazet
  2019-11-08 21:36                 ` Linus Torvalds
  0 siblings, 1 reply; 67+ messages in thread
From: Eric Dumazet @ 2019-11-08 20:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, syzbot, Marco Elver, linux-fsdevel,
	Linux Kernel Mailing List, syzkaller-bugs, Al Viro

On Fri, Nov 8, 2019 at 12:30 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Fri, Nov 8, 2019 at 9:56 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > BTW, I would love an efficient ADD_ONCE(variable, value)
> >
> > Using WRITE_ONCE(variable, variable + value) is not good, since it can
> > not use the optimized instructions operating directly on memory.
>
> So I'm having a hard time seeing how this could possibly ever be valid.
>
> Is this a "writer is locked, readers are unlocked" case or something?

per cpu SNMP counters mostly, with no IRQ safety requirements.

Note that this could be implemented using local{64}_add() on arches like x86_64,
while others might have to fallback to WRITE_ONCE(variable, variable + add)

>
> Because we don't really have any sane way to do that any more
> efficiently, unless we'd have to add new architecture-specific
> functions for it (like we do have fo the percpu ops).
>
> Anyway, if you have a really hot case you care about, maybe you could
> convince the gcc people to just add it as a peephole optimization?
> Right now, gcc ends up doing some strange things with volatiles, and
> basically disables a lot of stuff over them. But with a test-case,
> maybe you can convince somebody that certain optimizations are still
> fine. A "read+add+write" really does the exact same accesses as an
> add-to-memory instruction, but gcc has some logic to disable that
> instruction fusion.
>
>           Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-08 17:55           ` Eric Dumazet
  2019-11-08 18:02             ` Eric Dumazet
@ 2019-11-08 20:30             ` Linus Torvalds
  2019-11-08 20:53               ` Eric Dumazet
  1 sibling, 1 reply; 67+ messages in thread
From: Linus Torvalds @ 2019-11-08 20:30 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, syzbot, Marco Elver, linux-fsdevel,
	Linux Kernel Mailing List, syzkaller-bugs, Al Viro

On Fri, Nov 8, 2019 at 9:56 AM Eric Dumazet <edumazet@google.com> wrote:
>
> BTW, I would love an efficient ADD_ONCE(variable, value)
>
> Using WRITE_ONCE(variable, variable + value) is not good, since it can
> not use the optimized instructions operating directly on memory.

So I'm having a hard time seeing how this could possibly ever be valid.

Is this a "writer is locked, readers are unlocked" case or something?

Because we don't really have any sane way to do that any more
efficiently, unless we'd have to add new architecture-specific
functions for it (like we do have fo the percpu ops).

Anyway, if you have a really hot case you care about, maybe you could
convince the gcc people to just add it as a peephole optimization?
Right now, gcc ends up doing some strange things with volatiles, and
basically disables a lot of stuff over them. But with a test-case,
maybe you can convince somebody that certain optimizations are still
fine. A "read+add+write" really does the exact same accesses as an
add-to-memory instruction, but gcc has some logic to disable that
instruction fusion.

          Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-08 19:48                 ` Marco Elver
@ 2019-11-08 20:26                   ` Linus Torvalds
  2019-11-08 21:57                     ` Alan Stern
  0 siblings, 1 reply; 67+ messages in thread
From: Linus Torvalds @ 2019-11-08 20:26 UTC (permalink / raw)
  To: Marco Elver
  Cc: Eric Dumazet, Eric Dumazet, syzbot, linux-fsdevel,
	Linux Kernel Mailing List, syzkaller-bugs, Al Viro, Alan Stern,
	Andrea Parri, Paul E. McKenney,
	LKMM Maintainers -- Akira Yokosawa

On Fri, Nov 8, 2019 at 11:48 AM Marco Elver <elver@google.com> wrote:
>
> It's not explicitly aware of initialization or release. We rely on
> compiler instrumentation for all memory accesses; KCSAN then sets up
> "watchpoints" for sampled memory accesses, delaying execution, and
> checking if a concurrent access is observed.

Ok.

> This same approach could be used to ignore "idempotent writes" where
> we would otherwise report a data race; i.e. if there was a concurrent
> write, but the data value did not change, do not report the race. I'm
> happy to add this feature if this should always be ignored.

Hmm. I don't think it's valid in general, but it might be useful
enough in practice, at least as an option to lower the false
positives.

               Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-08 18:40               ` Linus Torvalds
@ 2019-11-08 19:48                 ` Marco Elver
  2019-11-08 20:26                   ` Linus Torvalds
  0 siblings, 1 reply; 67+ messages in thread
From: Marco Elver @ 2019-11-08 19:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, Eric Dumazet, syzbot, linux-fsdevel,
	Linux Kernel Mailing List, syzkaller-bugs, Al Viro, Alan Stern,
	Andrea Parri, Paul E. McKenney,
	LKMM Maintainers -- Akira Yokosawa

On Fri, 8 Nov 2019 at 19:40, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Fri, Nov 8, 2019 at 10:16 AM Marco Elver <elver@google.com> wrote:
> >
> > KCSAN does not use volatile to distinguish accesses. Right now
> > READ_ONCE, WRITE_ONCE, atomic bitops, atomic_t (+ some arch specific
> > primitives) are treated as marked atomic operations.
>
> Ok, so we'd have to do this in terms of ATOMIC_WRITE().
>
> One alternative might be KCSAN enhancement, where you notice the
> following pattern:
>
>  - a field is initialized before the data structure gets exposed (I
> presume KCSAN already must understand about this issue -
> initializations are different and not atomic)
>
>  - while the field is live, there are operations that write the same
> (let's call it "idempotent") value to the field under certain
> circumstances
>
>  - at release time, after all the reference counts are gone, the field
> is read for whether that situation happened. I'm assuming KCSAN
> already understands about this case too?

It's not explicitly aware of initialization or release. We rely on
compiler instrumentation for all memory accesses; KCSAN then sets up
"watchpoints" for sampled memory accesses, delaying execution, and
checking if a concurrent access is observed.

We already have an option (currently disabled on syzbot) where KCSAN
infers a data race not because another instrumented accesses happened
concurrently, but because the data value changed during a watchpoint's
lifetime (e.g. due to uninstrumented write, device write etc.).

This same approach could be used to ignore "idempotent writes" where
we would otherwise report a data race; i.e. if there was a concurrent
write, but the data value did not change, do not report the race. I'm
happy to add this feature if this should always be ignored.

> So it would only be the "idempotent writes" thing that would be
> something KCSAN would have to realize do not involve a race - because
> it simply doesn't matter if two writes of the same value race against
> each other.
>
> But I guess we could also just do
>
>    #define WRITE_IDEMPOTENT(x,y) WRITE_ONCE(x,y)
>
> and use that in the kernel to annotate these things. And if we have
> that kind of annotation, we could then possibly change it to
>
>   #define WRITE_IDEMPOTENT(x,y) \
>        if READ_ONCE(x)!=y WRITE_ONCE(x,y)
>
> if we have numbers that that actually helps (that macro written to be
> intentionally invalid C - it obviously needs statement protection and
> protection against evaluating the arguments multiple times etc).
>
>                 Linus

Thanks,
-- Marco

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-08 18:15             ` Marco Elver
@ 2019-11-08 18:40               ` Linus Torvalds
  2019-11-08 19:48                 ` Marco Elver
  0 siblings, 1 reply; 67+ messages in thread
From: Linus Torvalds @ 2019-11-08 18:40 UTC (permalink / raw)
  To: Marco Elver
  Cc: Eric Dumazet, Eric Dumazet, syzbot, linux-fsdevel,
	Linux Kernel Mailing List, syzkaller-bugs, Al Viro, Alan Stern,
	Andrea Parri, Paul E. McKenney,
	LKMM Maintainers -- Akira Yokosawa

On Fri, Nov 8, 2019 at 10:16 AM Marco Elver <elver@google.com> wrote:
>
> KCSAN does not use volatile to distinguish accesses. Right now
> READ_ONCE, WRITE_ONCE, atomic bitops, atomic_t (+ some arch specific
> primitives) are treated as marked atomic operations.

Ok, so we'd have to do this in terms of ATOMIC_WRITE().

One alternative might be KCSAN enhancement, where you notice the
following pattern:

 - a field is initialized before the data structure gets exposed (I
presume KCSAN already must understand about this issue -
initializations are different and not atomic)

 - while the field is live, there are operations that write the same
(let's call it "idempotent") value to the field under certain
circumstances

 - at release time, after all the reference counts are gone, the field
is read for whether that situation happened. I'm assuming KCSAN
already understands about this case too?

So it would only be the "idempotent writes" thing that would be
something KCSAN would have to realize do not involve a race - because
it simply doesn't matter if two writes of the same value race against
each other.

But I guess we could also just do

   #define WRITE_IDEMPOTENT(x,y) WRITE_ONCE(x,y)

and use that in the kernel to annotate these things. And if we have
that kind of annotation, we could then possibly change it to

  #define WRITE_IDEMPOTENT(x,y) \
       if READ_ONCE(x)!=y WRITE_ONCE(x,y)

if we have numbers that that actually helps (that macro written to be
intentionally invalid C - it obviously needs statement protection and
protection against evaluating the arguments multiple times etc).

                Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-08 18:05           ` Linus Torvalds
@ 2019-11-08 18:15             ` Marco Elver
  2019-11-08 18:40               ` Linus Torvalds
  0 siblings, 1 reply; 67+ messages in thread
From: Marco Elver @ 2019-11-08 18:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, Eric Dumazet, syzbot, linux-fsdevel,
	Linux Kernel Mailing List, syzkaller-bugs, Al Viro, Alan Stern,
	Andrea Parri, Paul E. McKenney,
	LKMM Maintainers -- Akira Yokosawa

On Fri, 8 Nov 2019 at 19:05, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Fri, Nov 8, 2019 at 9:53 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > I personally like WRITE_ONCE() since it adds zero overhead on generated code,
> > and is the facto accessor we used for many years (before KCSAN was conceived)
>
> So I generally prefer WRITE_ONCE() over adding "volatile" to random
> data structure members.
>
> Because volatile *does* have potentially absolutely horrendous
> overhead on generated code. It just happens to be ok for the simple
> case of writing once to a variable.
>
> In fact, you bring that up yourself in your next email when you ask
> for "ADD_ONCE()". Exactly because gcc generates absolutely horrendous
> garbage for volatiles, for no actual good reason. Gcc *could* generate
> a single add-to-memory instruction. But no, that's not at all what gcc
> does.
>
> So for the kernel, we've generally had the rule to avoid 'volatile'
> data structures as much as humanly possible, because it actually does
> something much worse than it could do, and the source code _looks_
> simple when the volatile is hidden in the data structures.
>
> Which is why we have READ_ONCE/WRITE_ONCE - it puts the volatile in
> the code, and makes it clear not only what is going on, but also the
> impact it has on code generation.
>
> But at the same time, I don't love WRITE_ONCE() when it's not actually
> about writing once. It might be better to have another way to show
> "this variable is a flag that we set to a single value". Even if maybe
> the implementation is then the same (ie we use a 'volatile' assignment
> to make KCSAN happy).

(+some LKMM folks, in case I missed something on what the LKMM defines
as data race.)

KCSAN does not use volatile to distinguish accesses. Right now
READ_ONCE, WRITE_ONCE, atomic bitops, atomic_t (+ some arch specific
primitives) are treated as marked atomic operations.

The goal is to cover all primitives that the LKMM declares as
marked/atomic. A data race is then detected for concurrent conflicting
accesses where at least one is plain unmarked. In the end the LKMM
should decide what KCSAN determines as a data race. As far as I can
tell, none of the reported data races so far are false positives in
that sense.

Many thanks,
-- Marco

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-08 18:02             ` Eric Dumazet
@ 2019-11-08 18:12               ` Linus Torvalds
  0 siblings, 0 replies; 67+ messages in thread
From: Linus Torvalds @ 2019-11-08 18:12 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, syzbot, Marco Elver, linux-fsdevel,
	Linux Kernel Mailing List, syzkaller-bugs, Al Viro

On Fri, Nov 8, 2019 at 10:02 AM Eric Dumazet <edumazet@google.com> wrote:
>
> Another interesting KCSAN report :
>
> static inline s64 percpu_counter_read_positive(struct percpu_counter *fbc)
> {
>         s64 ret = fbc->count;   // data-race ....

Yeah, I think that's fundamentally broken on 32-bit. It might need a
sequence lock instead of that raw_spinlock_t to be sanely done
properly on 32-bit.

Or we just admit that 32-bit doesn't really matter any more in the
long run, and that this is not a problem in practice because the
32-bit overflow basically never happens on small machines anyway.

               Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-08 17:53         ` Eric Dumazet
  2019-11-08 17:55           ` Eric Dumazet
@ 2019-11-08 18:05           ` Linus Torvalds
  2019-11-08 18:15             ` Marco Elver
  1 sibling, 1 reply; 67+ messages in thread
From: Linus Torvalds @ 2019-11-08 18:05 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, syzbot, Marco Elver, linux-fsdevel,
	Linux Kernel Mailing List, syzkaller-bugs, Al Viro

On Fri, Nov 8, 2019 at 9:53 AM Eric Dumazet <edumazet@google.com> wrote:
>
> I personally like WRITE_ONCE() since it adds zero overhead on generated code,
> and is the facto accessor we used for many years (before KCSAN was conceived)

So I generally prefer WRITE_ONCE() over adding "volatile" to random
data structure members.

Because volatile *does* have potentially absolutely horrendous
overhead on generated code. It just happens to be ok for the simple
case of writing once to a variable.

In fact, you bring that up yourself in your next email when you ask
for "ADD_ONCE()". Exactly because gcc generates absolutely horrendous
garbage for volatiles, for no actual good reason. Gcc *could* generate
a single add-to-memory instruction. But no, that's not at all what gcc
does.

So for the kernel, we've generally had the rule to avoid 'volatile'
data structures as much as humanly possible, because it actually does
something much worse than it could do, and the source code _looks_
simple when the volatile is hidden in the data structures.

Which is why we have READ_ONCE/WRITE_ONCE - it puts the volatile in
the code, and makes it clear not only what is going on, but also the
impact it has on code generation.

But at the same time, I don't love WRITE_ONCE() when it's not actually
about writing once. It might be better to have another way to show
"this variable is a flag that we set to a single value". Even if maybe
the implementation is then the same (ie we use a 'volatile' assignment
to make KCSAN happy).

> Hmm, which questionable optimization are you referring to?

The "avoid dirty cacheline" one by adding a read and a conditional.
Yes, it can be an optimization. And it might not be.

                Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-08 17:55           ` Eric Dumazet
@ 2019-11-08 18:02             ` Eric Dumazet
  2019-11-08 18:12               ` Linus Torvalds
  2019-11-08 20:30             ` Linus Torvalds
  1 sibling, 1 reply; 67+ messages in thread
From: Eric Dumazet @ 2019-11-08 18:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, syzbot, Marco Elver, linux-fsdevel,
	Linux Kernel Mailing List, syzkaller-bugs, Al Viro

On Fri, Nov 8, 2019 at 9:55 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Fri, Nov 8, 2019 at 9:53 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Fri, Nov 8, 2019 at 9:39 AM Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> >
> > > I'd hope that there is some way to mark the cases we know about where
> > > we just have a flag. I'm not sure what KCSAN uses right now - is it
> > > just the "volatile" that makes KCSAN ignore it, or are there other
> > > ways to do it?
> >
> > I dunno, Marco will comment on this.
> >
> > I personally like WRITE_ONCE() since it adds zero overhead on generated code,
> > and is the facto accessor we used for many years (before KCSAN was conceived)
> >
>
> BTW, I would love an efficient ADD_ONCE(variable, value)
>
> Using WRITE_ONCE(variable, variable + value) is not good, since it can
> not use the
> optimized instructions operating directly on memory.

Another interesting KCSAN report :

static inline s64 percpu_counter_read_positive(struct percpu_counter *fbc)
{
        s64 ret = fbc->count;   // data-race ....

        barrier();              /* Prevent reloads of fbc->count */
        if (ret >= 0)
                return ret;
        return 0;
}

How was this code supposed to work at all on 32bit arches ???

Using READ_ONCE(fbc->count) alone will not help.


BUG: KCSAN: data-race in ip6_dst_gc / ip6_dst_gc

read to 0xffff88811dd6298c of 4 bytes by task 10060 on cpu 1:
 dst_entries_get_fast include/net/dst_ops.h:47 [inline]
 ip6_dst_gc+0xf6/0x220 net/ipv6/route.c:3167
 dst_alloc+0x104/0x149 net/core/dst.c:85
 ip6_dst_alloc+0x3d/0x80 net/ipv6/route.c:353
 ip6_rt_cache_alloc+0x8d/0x340 net/ipv6/route.c:1338
 ip6_pol_route+0x4ec/0x5c0 net/ipv6/route.c:2217
 ip6_pol_route_output+0x48/0x60 net/ipv6/route.c:2452
 fib6_rule_lookup+0x95/0x470 net/ipv6/fib6_rules.c:113
 ip6_route_output_flags_noref+0x16b/0x230 net/ipv6/route.c:2484
 ip6_route_output_flags+0x50/0x1a0 net/ipv6/route.c:2497
 ip6_dst_lookup_tail+0x25d/0xc30 net/ipv6/ip6_output.c:1052
 ip6_dst_lookup_flow+0x68/0x120 net/ipv6/ip6_output.c:1153
 rawv6_sendmsg+0x82c/0x21e0 net/ipv6/raw.c:928
 inet_sendmsg+0x6d/0x90 net/ipv4/af_inet.c:807
 sock_sendmsg_nosec net/socket.c:637 [inline]
 sock_sendmsg+0x9f/0xc0 net/socket.c:657
 kernel_sendmsg+0x4d/0x70 net/socket.c:677
 sock_no_sendpage+0xda/0x110 net/core/sock.c:2742
 kernel_sendpage+0x7b/0xc0 net/socket.c:3682
 sock_sendpage+0x6c/0x90 net/socket.c:935
 pipe_to_sendpage+0x102/0x180 fs/splice.c:449
 splice_from_pipe_feed fs/splice.c:500 [inline]
 __splice_from_pipe+0x248/0x480 fs/splice.c:624
 splice_from_pipe+0xbb/0x100 fs/splice.c:659
 generic_splice_sendpage+0x45/0x60 fs/splice.c:829
 do_splice_from fs/splice.c:848 [inline]
 direct_splice_actor+0xa0/0xc0 fs/splice.c:1020
 splice_direct_to_actor+0x215/0x510 fs/splice.c:975
 do_splice_direct+0x161/0x1e0 fs/splice.c:1063
 do_sendfile+0x384/0x7f0 fs/read_write.c:1464
 __do_sys_sendfile64 fs/read_write.c:1525 [inline]
 __se_sys_sendfile64 fs/read_write.c:1511 [inline]
 __x64_sys_sendfile64+0x12a/0x140 fs/read_write.c:1511
 do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-08 17:53         ` Eric Dumazet
@ 2019-11-08 17:55           ` Eric Dumazet
  2019-11-08 18:02             ` Eric Dumazet
  2019-11-08 20:30             ` Linus Torvalds
  2019-11-08 18:05           ` Linus Torvalds
  1 sibling, 2 replies; 67+ messages in thread
From: Eric Dumazet @ 2019-11-08 17:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, syzbot, Marco Elver, linux-fsdevel,
	Linux Kernel Mailing List, syzkaller-bugs, Al Viro

On Fri, Nov 8, 2019 at 9:53 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Fri, Nov 8, 2019 at 9:39 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>
> > I'd hope that there is some way to mark the cases we know about where
> > we just have a flag. I'm not sure what KCSAN uses right now - is it
> > just the "volatile" that makes KCSAN ignore it, or are there other
> > ways to do it?
>
> I dunno, Marco will comment on this.
>
> I personally like WRITE_ONCE() since it adds zero overhead on generated code,
> and is the facto accessor we used for many years (before KCSAN was conceived)
>

BTW, I would love an efficient ADD_ONCE(variable, value)

Using WRITE_ONCE(variable, variable + value) is not good, since it can
not use the
optimized instructions operating directly on memory.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-08 17:38       ` Linus Torvalds
@ 2019-11-08 17:53         ` Eric Dumazet
  2019-11-08 17:55           ` Eric Dumazet
  2019-11-08 18:05           ` Linus Torvalds
  0 siblings, 2 replies; 67+ messages in thread
From: Eric Dumazet @ 2019-11-08 17:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, syzbot, Marco Elver, linux-fsdevel,
	Linux Kernel Mailing List, syzkaller-bugs, Al Viro

On Fri, Nov 8, 2019 at 9:39 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:

> I'd hope that there is some way to mark the cases we know about where
> we just have a flag. I'm not sure what KCSAN uses right now - is it
> just the "volatile" that makes KCSAN ignore it, or are there other
> ways to do it?

I dunno, Marco will comment on this.

I personally like WRITE_ONCE() since it adds zero overhead on generated code,
and is the facto accessor we used for many years (before KCSAN was conceived)

>
> "volatile" has huge problems with code generation for gcc. It would
> probably be fine for "not_rcu" in this case, but I'd like to avoid it
> in general otherwise, which is why I wonder if there are other
> options.
>
> But worst comes to worst, I'd be ok with a WRITE_ONCE() and a comment
> about why (and the reason being KCSAN, not the questionable
> optimization).

Ok for a single WRITE_ONCE() with a comment.

Hmm, which questionable optimization are you referring to?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-08 17:22     ` Eric Dumazet
@ 2019-11-08 17:38       ` Linus Torvalds
  2019-11-08 17:53         ` Eric Dumazet
  0 siblings, 1 reply; 67+ messages in thread
From: Linus Torvalds @ 2019-11-08 17:38 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, syzbot, Marco Elver, linux-fsdevel,
	Linux Kernel Mailing List, syzkaller-bugs, Al Viro

On Fri, Nov 8, 2019 at 9:22 AM Eric Dumazet <edumazet@google.com> wrote:
>
> Ok, so what do you suggest next ?
>
> Declare KCSAN useless because too many false positives ?

I'd hope that there is some way to mark the cases we know about where
we just have a flag. I'm not sure what KCSAN uses right now - is it
just the "volatile" that makes KCSAN ignore it, or are there other
ways to do it?

"volatile" has huge problems with code generation for gcc. It would
probably be fine for "not_rcu" in this case, but I'd like to avoid it
in general otherwise, which is why I wonder if there are other
options.

But worst comes to worst, I'd be ok with a WRITE_ONCE() and a comment
about why (and the reason being KCSAN, not the questionable
optimization).

           Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-08 17:01   ` Linus Torvalds
@ 2019-11-08 17:22     ` Eric Dumazet
  2019-11-08 17:38       ` Linus Torvalds
  0 siblings, 1 reply; 67+ messages in thread
From: Eric Dumazet @ 2019-11-08 17:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, syzbot, Marco Elver, linux-fsdevel,
	Linux Kernel Mailing List, syzkaller-bugs, Al Viro

On Fri, Nov 8, 2019 at 9:01 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Fri, Nov 8, 2019 at 5:28 AM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> > Linus, what do you think of the following fix ?
>
> I think it's incredibly ugly.
>
> I realize that avoiding the cacheline dirtying might be worth it, but
> I'd like to see some indication that it actually matters and helps
> from a performance angle. We've already dirtied memory fairly close,
> even if it might not share a cacheline (that structure is randomized,
> we've touched - or will touch - 'cred->usage') too.
>
> Honestly, I don't think get_cred() is even in a hotpath. Most cred use
> just use the current cred that doesn't need the 'get'. So the
> optimization looks somewhat questionable - for all we know it just
> makes things worse.
>
> I also don't like using a "WRITE_ONCE()" without a reason for it. In
> this case, the only "reason" is that KCSAN special-cases that thing.
> I'd much rather have some other way to mark it.
>
> So it just looks hacky to me.
>
> I like that people are looking at KCSAN, but I get a very strong
> feeling that right now the workarounds for KCSAN false-positives are
> incredibly ugly, and not always appropriate.
>
> There is absolutely zero need for a WRITE_ONCE() in this case. The
> code would work fine if the compiler did the zero write fifty times,
> and re-ordered it wildly. We have a flag that starts out set, and we
> clear it.  There's really no "write-once" about it.
>

Ok, so what do you suggest next ?

Declare KCSAN useless because too many false positives ?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-08 13:28 ` Eric Dumazet
@ 2019-11-08 17:01   ` Linus Torvalds
  2019-11-08 17:22     ` Eric Dumazet
  0 siblings, 1 reply; 67+ messages in thread
From: Linus Torvalds @ 2019-11-08 17:01 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: syzbot, elver, linux-fsdevel, Linux Kernel Mailing List,
	syzkaller-bugs, Al Viro, Eric Dumazet

On Fri, Nov 8, 2019 at 5:28 AM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> Linus, what do you think of the following fix ?

I think it's incredibly ugly.

I realize that avoiding the cacheline dirtying might be worth it, but
I'd like to see some indication that it actually matters and helps
from a performance angle. We've already dirtied memory fairly close,
even if it might not share a cacheline (that structure is randomized,
we've touched - or will touch - 'cred->usage') too.

Honestly, I don't think get_cred() is even in a hotpath. Most cred use
just use the current cred that doesn't need the 'get'. So the
optimization looks somewhat questionable - for all we know it just
makes things worse.

I also don't like using a "WRITE_ONCE()" without a reason for it. In
this case, the only "reason" is that KCSAN special-cases that thing.
I'd much rather have some other way to mark it.

So it just looks hacky to me.

I like that people are looking at KCSAN, but I get a very strong
feeling that right now the workarounds for KCSAN false-positives are
incredibly ugly, and not always appropriate.

There is absolutely zero need for a WRITE_ONCE() in this case. The
code would work fine if the compiler did the zero write fifty times,
and re-ordered it wildly. We have a flag that starts out set, and we
clear it.  There's really no "write-once" about it.

               Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: KCSAN: data-race in __alloc_file / __alloc_file
  2019-11-08 13:16 syzbot
@ 2019-11-08 13:28 ` Eric Dumazet
  2019-11-08 17:01   ` Linus Torvalds
  0 siblings, 1 reply; 67+ messages in thread
From: Eric Dumazet @ 2019-11-08 13:28 UTC (permalink / raw)
  To: syzbot, elver, linux-fsdevel, linux-kernel, syzkaller-bugs, viro,
	Linus Torvalds, Eric Dumazet



On 11/8/19 5:16 AM, syzbot wrote:
> Hello,
> 
> syzbot found the following crash on:
> 
> HEAD commit:    05f22368 x86, kcsan: Enable KCSAN for x86
> git tree:       https://github.com/google/ktsan.git kcsan
> console output: https://syzkaller.appspot.com/x/log.txt?x=10d7fd88e00000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=87d111955f40591f
> dashboard link: https://syzkaller.appspot.com/bug?extid=3ef049d50587836c0606
> compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
> 
> Unfortunately, I don't have any reproducer for this crash yet.
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+3ef049d50587836c0606@syzkaller.appspotmail.com
> 
> ==================================================================
> BUG: KCSAN: data-race in __alloc_file / __alloc_file
> 
> write to 0xffff8880bb157398 of 4 bytes by task 10993 on cpu 0:
>  get_cred include/linux/cred.h:253 [inline]
>  __alloc_file+0x74/0x210 fs/file_table.c:105
>  alloc_empty_file+0x8f/0x180 fs/file_table.c:151
>  alloc_file+0x4e/0x2b0 fs/file_table.c:193
>  alloc_file_pseudo+0x11c/0x1b0 fs/file_table.c:232
>  anon_inode_getfile fs/anon_inodes.c:91 [inline]
>  anon_inode_getfile+0x103/0x1d0 fs/anon_inodes.c:74
>  __do_sys_perf_event_open+0xd32/0x1ac0 kernel/events/core.c:11100
>  __se_sys_perf_event_open kernel/events/core.c:10867 [inline]
>  __x64_sys_perf_event_open+0x70/0x90 kernel/events/core.c:10867
>  do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> write to 0xffff8880bb157398 of 4 bytes by task 11004 on cpu 1:
>  get_cred include/linux/cred.h:253 [inline]
>  __alloc_file+0x74/0x210 fs/file_table.c:105
>  alloc_empty_file+0x8f/0x180 fs/file_table.c:151
>  path_openat+0x74/0x36e0 fs/namei.c:3514
>  do_filp_open+0x11e/0x1b0 fs/namei.c:3555
>  do_sys_open+0x3b3/0x4f0 fs/open.c:1097
>  __do_sys_open fs/open.c:1115 [inline]
>  __se_sys_open fs/open.c:1110 [inline]
>  __x64_sys_open+0x55/0x70 fs/open.c:1110
>  do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> Reported by Kernel Concurrency Sanitizer on:
> CPU: 1 PID: 11004 Comm: syz-executor.5 Not tainted 5.4.0-rc3+ #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> ==================================================================
> 
> 
> ---
> This bug is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkaller@googlegroups.com.
> 
> syzbot will keep track of this bug report. See:
> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

Linus, what do you think of the following fix ?

I also took the opportunity avoiding dirtying a cache line if this was possible.

diff --git a/include/linux/cred.h b/include/linux/cred.h
index 18639c069263fbe79dfd5a36163c656dca5da220..01b5b7d4e054ddca0df676dc1ceb068e5d71a3f8 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -250,7 +250,14 @@ static inline const struct cred *get_cred(const struct cred *cred)
        if (!cred)
                return cred;
        validate_creds(cred);
-       nonconst_cred->non_rcu = 0;
+
+       /*
+        * Avoid dirtying one cache line. The WRITE_ONCE() also pairs
+        * with itself, since we run without protection of a lock.
+        */
+       if (READ_ONCE(nonconst_cred->non_rcu))
+               WRITE_ONCE(nonconst_cred->non_rcu, 0);
+
        return get_new_cred(nonconst_cred);
 }
 
@@ -262,7 +269,14 @@ static inline const struct cred *get_cred_rcu(const struct cred *cred)
        if (!atomic_inc_not_zero(&nonconst_cred->usage))
                return NULL;
        validate_creds(cred);
-       nonconst_cred->non_rcu = 0;
+
+       /*
+        * Avoid dirtying one cache line. The WRITE_ONCE() also pairs
+        * with itself, since we run without protection of a lock.
+        */
+       if (READ_ONCE(nonconst_cred->non_rcu))
+               WRITE_ONCE(nonconst_cred->non_rcu, 0);
+
        return cred;
 }
 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* KCSAN: data-race in __alloc_file / __alloc_file
@ 2019-11-08 13:16 syzbot
  2019-11-08 13:28 ` Eric Dumazet
  0 siblings, 1 reply; 67+ messages in thread
From: syzbot @ 2019-11-08 13:16 UTC (permalink / raw)
  To: elver, linux-fsdevel, linux-kernel, syzkaller-bugs, viro

Hello,

syzbot found the following crash on:

HEAD commit:    05f22368 x86, kcsan: Enable KCSAN for x86
git tree:       https://github.com/google/ktsan.git kcsan
console output: https://syzkaller.appspot.com/x/log.txt?x=10d7fd88e00000
kernel config:  https://syzkaller.appspot.com/x/.config?x=87d111955f40591f
dashboard link: https://syzkaller.appspot.com/bug?extid=3ef049d50587836c0606
compiler:       gcc (GCC) 9.0.0 20181231 (experimental)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+3ef049d50587836c0606@syzkaller.appspotmail.com

==================================================================
BUG: KCSAN: data-race in __alloc_file / __alloc_file

write to 0xffff8880bb157398 of 4 bytes by task 10993 on cpu 0:
  get_cred include/linux/cred.h:253 [inline]
  __alloc_file+0x74/0x210 fs/file_table.c:105
  alloc_empty_file+0x8f/0x180 fs/file_table.c:151
  alloc_file+0x4e/0x2b0 fs/file_table.c:193
  alloc_file_pseudo+0x11c/0x1b0 fs/file_table.c:232
  anon_inode_getfile fs/anon_inodes.c:91 [inline]
  anon_inode_getfile+0x103/0x1d0 fs/anon_inodes.c:74
  __do_sys_perf_event_open+0xd32/0x1ac0 kernel/events/core.c:11100
  __se_sys_perf_event_open kernel/events/core.c:10867 [inline]
  __x64_sys_perf_event_open+0x70/0x90 kernel/events/core.c:10867
  do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

write to 0xffff8880bb157398 of 4 bytes by task 11004 on cpu 1:
  get_cred include/linux/cred.h:253 [inline]
  __alloc_file+0x74/0x210 fs/file_table.c:105
  alloc_empty_file+0x8f/0x180 fs/file_table.c:151
  path_openat+0x74/0x36e0 fs/namei.c:3514
  do_filp_open+0x11e/0x1b0 fs/namei.c:3555
  do_sys_open+0x3b3/0x4f0 fs/open.c:1097
  __do_sys_open fs/open.c:1115 [inline]
  __se_sys_open fs/open.c:1110 [inline]
  __x64_sys_open+0x55/0x70 fs/open.c:1110
  do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 11004 Comm: syz-executor.5 Not tainted 5.4.0-rc3+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
==================================================================


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

^ permalink raw reply	[flat|nested] 67+ messages in thread

end of thread, back to index

Thread overview: 67+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAHk-=wjB61GNmqpX0BLA5tpL4tsjWV7akaTc2Roth7uGgax+mw@mail.gmail.com>
2019-11-10 16:09 ` KCSAN: data-race in __alloc_file / __alloc_file Alan Stern
2019-11-10 19:10   ` Marco Elver
2019-11-11 15:51     ` Alan Stern
2019-11-11 16:51       ` Linus Torvalds
2019-11-11 17:52         ` Eric Dumazet
2019-11-11 18:04           ` Linus Torvalds
2019-11-11 18:31             ` Eric Dumazet
2019-11-11 18:44               ` Eric Dumazet
2019-11-11 19:00                 ` Linus Torvalds
2019-11-11 19:13                   ` Eric Dumazet
2019-11-11 20:43                     ` Linus Torvalds
2019-11-11 20:46                       ` Linus Torvalds
2019-11-11 21:53                         ` Eric Dumazet
2019-11-11 23:51                   ` Linus Torvalds
2019-11-12 16:50                     ` Kirill Smelkov
2019-11-12 17:23                       ` Linus Torvalds
2019-11-12 17:36                         ` Linus Torvalds
2019-11-17 18:56                           ` Kirill Smelkov
2019-11-17 19:20                             ` Linus Torvalds
2019-11-11 18:50               ` Linus Torvalds
2019-11-11 18:59                 ` Marco Elver
2019-11-11 18:59                 ` Eric Dumazet
2019-11-10 19:12   ` Linus Torvalds
2019-11-10 19:20     ` Linus Torvalds
2019-11-10 20:44       ` Paul E. McKenney
2019-11-10 21:10         ` Linus Torvalds
2019-11-10 21:31           ` Paul E. McKenney
2019-11-11 14:17         ` Marco Elver
2019-11-11 14:31           ` Paul E. McKenney
2019-11-11 15:10             ` Marco Elver
2019-11-13  0:25               ` Paul E. McKenney
2019-11-12 19:14     ` Alan Stern
2019-11-12 19:47       ` Linus Torvalds
2019-11-12 20:29         ` Alan Stern
2019-11-12 20:58           ` Linus Torvalds
2019-11-12 21:13             ` Linus Torvalds
2019-11-12 22:05               ` Marco Elver
2019-11-12 21:48             ` Alan Stern
2019-11-12 22:07               ` Eric Dumazet
2019-11-12 22:44                 ` Alexei Starovoitov
2019-11-12 23:17                   ` Eric Dumazet
2019-11-12 23:40                     ` Linus Torvalds
2019-11-13 15:00                       ` Marco Elver
2019-11-13 16:57                         ` Linus Torvalds
2019-11-13 21:33                           ` Marco Elver
2019-11-13 21:50                             ` Alan Stern
2019-11-13 22:48                               ` Marco Elver
2019-11-08 13:16 syzbot
2019-11-08 13:28 ` Eric Dumazet
2019-11-08 17:01   ` Linus Torvalds
2019-11-08 17:22     ` Eric Dumazet
2019-11-08 17:38       ` Linus Torvalds
2019-11-08 17:53         ` Eric Dumazet
2019-11-08 17:55           ` Eric Dumazet
2019-11-08 18:02             ` Eric Dumazet
2019-11-08 18:12               ` Linus Torvalds
2019-11-08 20:30             ` Linus Torvalds
2019-11-08 20:53               ` Eric Dumazet
2019-11-08 21:36                 ` Linus Torvalds
2019-11-08 18:05           ` Linus Torvalds
2019-11-08 18:15             ` Marco Elver
2019-11-08 18:40               ` Linus Torvalds
2019-11-08 19:48                 ` Marco Elver
2019-11-08 20:26                   ` Linus Torvalds
2019-11-08 21:57                     ` Alan Stern
2019-11-08 22:06                       ` Linus Torvalds
2019-11-09 23:08                         ` Alan Stern

Linux-Fsdevel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-fsdevel/0 linux-fsdevel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-fsdevel linux-fsdevel/ https://lore.kernel.org/linux-fsdevel \
		linux-fsdevel@vger.kernel.org
	public-inbox-index linux-fsdevel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-fsdevel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git