[v2] netlink: Replace rhash_portid with bound
diff mbox series

Message ID 20150922033856.GA7851@gondor.apana.org.au
State New, archived
Headers show
Series
  • [v2] netlink: Replace rhash_portid with bound
Related show

Commit Message

Herbert Xu Sept. 22, 2015, 3:38 a.m. UTC
On Mon, Sep 21, 2015 at 02:20:22PM -0400, Tejun Heo wrote:
>
> store_release and load_acquire are different from the usual memory
> barriers and can't be paired this way.  You have to pair store_release
> and load_acquire.  Besides, it isn't a particularly good idea to

OK I've decided to drop the acquire/release helpers as they don't
help us at all and simply pessimises the code by using full memory
barriers (on some architectures) where only a write or read barrier
is needed.

> depend on memory barriers embedded in other data structures like the
> above.  Here, especially, rhashtable_insert() would have write barrier
> *before* the entry is hashed not necessarily *after*, which means that
> in the above case, a socket which appears to have set bound to a
> reader might not visible when the reader tries to look up the socket
> on the hashtable.

But you are right we do need an explicit write barrier here to
ensure that the hashing is visible.

> There's no reason to be overly smart here.  This isn't a crazy hot
> path, write barriers tend to be very cheap, store_release more so.
> Please just do smp_store_release() and note what it's paired with.

It's not about being overly smart.  It's about actually understanding
what's going on with the code.  I've seen too many instances of
people simply sprinkling synchronisation primitives around without
any knowledge of what is happening underneath, which is just a recipe
for creating hard-to-debug races.

> > @@ -1539,7 +1546,7 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr,
> >  		}
> >  	}
> >  
> > -	if (!nlk->portid) {
> > +	if (!nlk->bound) {
> 
> I don't think you can skip load_acquire here just because this is the
> second deref of the variable.  That doesn't change anything.  Race
> condition could still happen between the first and second tests and
> skipping the second would lead to the same kind of bug.

The reason this one is OK is because we do not use nlk->portid or
try to get nlk from the hash table before we return to user-space.

However, there is a real bug here that none of these acquire/release
helpers discovered.  The two bound tests here used to be a single
one.  Now that they are separate it is entirely possible for another
thread to come in the middle and bind the socket.  So we need to
repeat the portid check in order to maintain consistency.

> > @@ -1587,7 +1594,7 @@ static int netlink_connect(struct socket *sock, struct sockaddr *addr,
> >  	    !netlink_allowed(sock, NL_CFG_F_NONROOT_SEND))
> >  		return -EPERM;
> >  
> > -	if (!nlk->portid)
> > +	if (!nlk->bound)
> 
> Don't we need load_acquire here too?  Is this path holding a lock
> which makes that unnecessary?

Ditto.

---8<---
The commit 1f770c0a09da855a2b51af6d19de97fb955eca85 ("netlink:
Fix autobind race condition that leads to zero port ID") created
some new races that can occur due to inconcsistencies between the
two port IDs.

Tejun is right that a barrier is unavoidable.  Therefore I am
reverting to the original patch that used a boolean to indicate
that a user netlink socket has been bound.

Barriers have been added where necessary to ensure that a valid
portid and the hashed socket is visible.

I have also changed netlink_insert to only return EBUSY if the
socket is bound to a portid different to the requested one.  This
combined with only reading nlk->bound once in netlink_bind fixes
a race where two threads that bind the socket at the same time
with different port IDs may both succeed.

Fixes: 1f770c0a09da ("netlink: Fix autobind race condition that leads to zero port ID")
Reported-by: Tejun Heo <tj@kernel.org>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Comments

Tejun Heo Sept. 22, 2015, 4:10 p.m. UTC | #1
Hello, Herbert.

On Tue, Sep 22, 2015 at 11:38:56AM +0800, Herbert Xu wrote:
> On Mon, Sep 21, 2015 at 02:20:22PM -0400, Tejun Heo wrote:
> > store_release and load_acquire are different from the usual memory
> > barriers and can't be paired this way.  You have to pair store_release
> > and load_acquire.  Besides, it isn't a particularly good idea to
> 
> OK I've decided to drop the acquire/release helpers as they don't
> help us at all and simply pessimises the code by using full memory
> barriers (on some architectures) where only a write or read barrier
> is needed.

That's a pentium pro era errata.  Virtually no working machine is
affected by that anymore and nobody builds kernel with that option.
In most cases, store_release and load_acquire are cheaper as they're
more specific.  On x86, store_release and load_acquire boil down to
compiler reordering barriers.  You're running in the opposite
direction.

> > There's no reason to be overly smart here.  This isn't a crazy hot
> > path, write barriers tend to be very cheap, store_release more so.
> > Please just do smp_store_release() and note what it's paired with.
> 
> It's not about being overly smart.  It's about actually understanding
> what's going on with the code.  I've seen too many instances of
> people simply sprinkling synchronisation primitives around without
> any knowledge of what is happening underneath, which is just a recipe
> for creating hard-to-debug races.

I mean, read this thread.  It's one subtle breakage after another, one
confusion after another.  The problematic usages of memory barriers
are usually of two types.

1. Misunderstand what barriers do and fail to, most frequently, pair
   the barriers correctly.  This leads to things like lone smp_wmb()s
   which don't do anything but provide false sense of security.

2. The usage itself is correct but not properly documented and it's
   not clear what's being synchronized.  Because there's nothing
   inherently pairing the matching barrier pairs, w/o proper
   documentation, it can be very challenging to track down what is
   being synchronized making it difficult to tell this case from 1.
   Note that this is another reason smp_store_release() and
   smp_load_acquire() are just better.  Their specificity not only
   makes them lighter but also makes it a lot easier to track down
   what's going on.

So, when using barriers, we want to first pick the most specific
barrier pairs that suit the use case and clearly identify the matching
pairs and describe any subtlties if there's any.  Here, it's a
straight-forward writer to reader interlocking.

Once we meet the above criteria, we do want to be as simple as the use
case allows it to be.  e.g. if there are multiples readers,
encapsulate them and document them centrally.  If there are different
classes of readers and it's worthwhile to apply different
synchronization schemes to them, use separate helpers or clearly mark
and document exceptions with rationale.

> > I don't think you can skip load_acquire here just because this is the
> > second deref of the variable.  That doesn't change anything.  Race
> > condition could still happen between the first and second tests and
> > skipping the second would lead to the same kind of bug.
> 
> The reason this one is OK is because we do not use nlk->portid or
> try to get nlk from the hash table before we return to user-space.

What happens if somebody later adds code below that which happens to
use portid?  You're creating a booby trap and the patch isn't even
properly documenting what's going on.

> ---8<---
> The commit 1f770c0a09da855a2b51af6d19de97fb955eca85 ("netlink:
> Fix autobind race condition that leads to zero port ID") created
> some new races that can occur due to inconcsistencies between the
> two port IDs.
> 
> Tejun is right that a barrier is unavoidable.  Therefore I am
> reverting to the original patch that used a boolean to indicate
> that a user netlink socket has been bound.
> 
> Barriers have been added where necessary to ensure that a valid
> portid and the hashed socket is visible.
> 
> I have also changed netlink_insert to only return EBUSY if the
> socket is bound to a portid different to the requested one.  This
> combined with only reading nlk->bound once in netlink_bind fixes
> a race where two threads that bind the socket at the same time
> with different port IDs may both succeed.
> 
> Fixes: 1f770c0a09da ("netlink: Fix autobind race condition that leads to zero port ID")
> Reported-by: Tejun Heo <tj@kernel.org>
> Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

This is a pretty misguided use of barriers.

Nacked-by: Tejun Heo <tj@kernel.org>

Thanks.
Linus Torvalds Sept. 22, 2015, 6:42 p.m. UTC | #2
On Tue, Sep 22, 2015 at 9:10 AM, Tejun Heo <tj@kernel.org> wrote:
>
> That's a pentium pro era errata.  Virtually no working machine is
> affected by that anymore and nobody builds kernel with that option.
> In most cases, store_release and load_acquire are cheaper as they're
> more specific.  On x86, store_release and load_acquire boil down to
> compiler reordering barriers.  You're running in the opposite
> direction.

Well, to be fair, there are lots of machines where acquire/release is
actually quite expensive.

In general, the cheapest barrier there is (apart from the "no barrier
at all" or just "compiler barrier") is "smp_wmb()".  If an
architecture gets that one wrong, the architects were f*cking morons.
It should be a fundamentally cheap operation, since writes are
buffered and it should simply be a buffer barrier.

The acquire/release things are generally fairly cheap on modern
architectures. Not free (except on x86), but fairly low-cost. HOWEVER,
they are not at all free on some older architectures, including 32-bit
ARM.

smp_rmb() should generally be about the same cost as an acquire. It
can go either way.

So *if* the algorithm is amenable to smp_wmb()/smp_rmb() kind of
barriers, that's actually quite possibly better than acquire/release.

smp_mb() is expensive pretty much everywhere.

Looking forward, I suspect long-term acquire/release is what hardware
is going to be "reasonably good at", but as things are right now, you
can't necessarily rely on them being fast.

                   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Tejun Heo Sept. 22, 2015, 6:53 p.m. UTC | #3
Hello, Linus.

On Tue, Sep 22, 2015 at 11:42:33AM -0700, Linus Torvalds wrote:
...
> smp_rmb() should generally be about the same cost as an acquire. It
> can go either way.
> 
> So *if* the algorithm is amenable to smp_wmb()/smp_rmb() kind of
> barriers, that's actually quite possibly better than acquire/release.

I see.  The write path here is cold so the competition is between rmb
and acquire.  Unless some significant archs completely screwed it up,
acquire still seems like the better option.  It's essentially free on
x86 after all.

Thanks.
Linus Torvalds Sept. 22, 2015, 7:28 p.m. UTC | #4
On Tue, Sep 22, 2015 at 11:53 AM, Tejun Heo <tj@kernel.org> wrote:
>
> I see.  The write path here is cold so the competition is between rmb
> and acquire.  Unless some significant archs completely screwed it up,
> acquire still seems like the better option.  It's essentially free on
> x86 after all.

Both acquire and smp_rmb() are equally free on x86.

It appears that we don't do the X86_PPRO_FENCE bug handling for
acquire, but I guess we should.

Or possibly we should start deprecating the insane X86_PPRO_FENCE
entirely. It's very costly, for dubious advantages.

                  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Tejun Heo Sept. 22, 2015, 7:50 p.m. UTC | #5
Hello,

On Tue, Sep 22, 2015 at 12:28:45PM -0700, Linus Torvalds wrote:
> Both acquire and smp_rmb() are equally free on x86.

Was this always like this?  Ah, okay, from 2007.  Was thinking it's
still doing an actual rmb.  Talk about being confused.

> It appears that we don't do the X86_PPRO_FENCE bug handling for
> acquire, but I guess we should.

Hmmm?  We're doing that.  PPRO_FENCE makes both acquire and release
use smp_mb().

Thanks.
Linus Torvalds Sept. 22, 2015, 8:03 p.m. UTC | #6
On Tue, Sep 22, 2015 at 12:50 PM, Tejun Heo <tj@kernel.org> wrote:
>
> Hmmm?  We're doing that.  PPRO_FENCE makes both acquire and release
> use smp_mb().

Oh, right you are. I just grepped for rmb, and missed it because as
you say, it's a full mb.

The PPRO_FENCE_BUG thing is rather questionable. I'm not sure it's
even documented, but I can't find the official PPro errata now. I
think I discussed it with Andy Glew back when Intel was starting to
document the memory ordering rules.

I'd be willing to get rid of it. It was apparently just an early
stepping or two.

Anybody with better google-fu than me who can find the official Intel
PPro errata?

                 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Bjørn Mork Sept. 22, 2015, 8:36 p.m. UTC | #7
Linus Torvalds <torvalds@linux-foundation.org> writes:

> The PPRO_FENCE_BUG thing is rather questionable. I'm not sure it's
> even documented, but I can't find the official PPro errata now. I
> think I discussed it with Andy Glew back when Intel was starting to
> document the memory ordering rules.
>
> I'd be willing to get rid of it. It was apparently just an early
> stepping or two.
>
> Anybody with better google-fu than me who can find the official Intel
> PPro errata?

http://download.intel.com/design/archives/processors/pro/docs/24268935.pdf

Says "NoFix" for erratas 66 and 92.


Bjørn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Linus Torvalds Sept. 22, 2015, 9:04 p.m. UTC | #8
On Tue, Sep 22, 2015 at 1:36 PM, Bjørn Mork <bjorn@mork.no> wrote:
>
> http://download.intel.com/design/archives/processors/pro/docs/24268935.pdf
>
> Says "NoFix" for erratas 66 and 92.

Yeah, 66 and 92 do look like they could cause the apparent ordering of
accesses to be violated. That said, both of them <i>seem</i> to be
"processor had exclusive access to line A, and gave it away but ended
up still reading now-stale data".

And that's not what we use "smp_wmb()" or "smp_rmb()" to protect
against. If we did a write and then wanted to do an ordered read, we'd
use smp_mb(), which always does that barrier.

So I don't know whether either of those really merit our PPRO
workaround. Cache coherency is hard.

There's also errata 41, which looks like it would be a bad situation.

                 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Herbert Xu Sept. 23, 2015, 6:13 a.m. UTC | #9
On Tue, Sep 22, 2015 at 12:10:56PM -0400, Tejun Heo wrote:
>
> That's a pentium pro era errata.  Virtually no working machine is
> affected by that anymore and nobody builds kernel with that option.
> In most cases, store_release and load_acquire are cheaper as they're
> more specific.  On x86, store_release and load_acquire boil down to
> compiler reordering barriers.  You're running in the opposite
> direction.

It doesn't matter on x86 but the semantics of smp_load_acquire
and smp_store_release explicitly includes a full barrier which
is something that we don't need.

> I mean, read this thread.  It's one subtle breakage after another, one
> confusion after another.  The problematic usages of memory barriers
> are usually of two types.

smp_load_acquire/store_release isn't some kind of magic dust
that you can just spread around to fix races.  Whoever is writing
the code had better understood what the code is doing or they will
end up creating more bugs.

Having said that, I very much appreciate the work you have done
in reviewing my patches and pointing out holes in them.  Please
continue to do so and I will look at any real issues that you
discover.

> 1. Misunderstand what barriers do and fail to, most frequently, pair
>    the barriers correctly.  This leads to things like lone smp_wmb()s
>    which don't do anything but provide false sense of security.

smp_store_release can give you exactly the same kind of false
sense of security.

> 2. The usage itself is correct but not properly documented and it's
>    not clear what's being synchronized.  Because there's nothing
>    inherently pairing the matching barrier pairs, w/o proper
>    documentation, it can be very challenging to track down what is
>    being synchronized making it difficult to tell this case from 1.
>    Note that this is another reason smp_store_release() and
>    smp_load_acquire() are just better.  Their specificity not only
>    makes them lighter but also makes it a lot easier to track down
>    what's going on.

Well if there is anything unclear in my patch please point them out
and I will rewrite and/or add comments where necessary.

> > The reason this one is OK is because we do not use nlk->portid or
> > try to get nlk from the hash table before we return to user-space.
> 
> What happens if somebody later adds code below that which happens to
> use portid?  You're creating a booby trap and the patch isn't even
> properly documenting what's going on.

The same thing can still happen even if you use smp_load_acquire.
Someone may add a read prior to it or add a second smp_load_acquire
and then screw up the logic as we have seen in netlink_bind.

As I said whoever is touching these lockless code paths need to
understand what is going on.  There is no easy way out.

Cheers,
Tejun Heo Sept. 23, 2015, 3:54 p.m. UTC | #10
Hello, Herbert.

On Wed, Sep 23, 2015 at 02:13:42PM +0800, Herbert Xu wrote:
> It doesn't matter on x86 but the semantics of smp_load_acquire
> and smp_store_release explicitly includes a full barrier which
> is something that we don't need.

Yeah, I was confused there.  I was thinking smp_rmb() was still doing
rmb() on x86.  smp_rmb/wmb() is the better pick here.

> > I mean, read this thread.  It's one subtle breakage after another, one
> > confusion after another.  The problematic usages of memory barriers
> > are usually of two types.
> 
> smp_load_acquire/store_release isn't some kind of magic dust
> that you can just spread around to fix races.  Whoever is writing
> the code had better understood what the code is doing or they will
> end up creating more bugs.

Hmm... lemme try again.  When using barriers or RCU, it's desirable to
establish certain invariants because it usually is extremely easy to
miss corner cases.  It is helpful to have an abstraction, even if just
conceptual, where people can go "this thing is barrier / RCU protected
to guarantee XYZ".  Going more towards RCU example, this is why we
annotate variables as RCU protected to detect incorrect usages.  There
sure are exceptions but most are of the sort "this is write path and
protected by something else which is annotated differently".  Doing
things this way makes it a lot easier to get right.

Now, going back to barrier example.  If we take a similar approach,
there can be two cases.

1. Read of the boolean is barrier protected.  It's known that the
   condition that the boolean indicates is visible as true once the
   test succeeds.

2. Directly test the boolean inside write locking.  As the whole thing
   is protected by the same lock, we know that the indicated condition
   always matches what's visible.

In a lot of cases, separating out 2 doesn't matter all that much and
we sometimes skip it.  If we do that, either the protection should be
fairly obvious or, if there are multiple of those and the code path
isn't quite trivial, a different helper with extra lockdep annotation
can be used.

No matter what we do, please realize that we keep the invariant that
once the test evaulates to true the visible state matches what's
expected from such result.

What you're proposing introduces a third case in the above scenario
where how the test is performed becomes dependent on not only the
context of the test (which often can be annotated) but also the
following code does with the result of the test.  There are
exceptional cases where this makes sense but it's inherently hairy and
we only do things like that only if it's absoultely necessary and with
a lot of caution.  That isn't the case here.

You said that barriers aren't magic bullets.  Exactly, they aren't
anything special and whatever we do with them should be a reasonable
trade-off.  We've had incorrect and/or incomprehensible barrier usages
but that's not because people weren't going to 11 with scrutinizing
each deref and mixing barriered and non-barriered accesses.  If
anything, doing things like that without good enough rationale and
documentation (both are lacking here) is likely to lead to code base
which is difficult to comprehend and easy to get wrong.

> Having said that, I very much appreciate the work you have done
> in reviewing my patches and pointing out holes in them.  Please
> continue to do so and I will look at any real issues that you
> discover.

It isn't an accident that so many attempts in this thread were
errorneous.  You are taking the wrong approach and it's gonna cause
the same kind of failures in the future and there is no actual benefit
we're getting out of it.

> The same thing can still happen even if you use smp_load_acquire.
> Someone may add a read prior to it or add a second smp_load_acquire
> and then screw up the logic as we have seen in netlink_bind.

So, you're worrying about the following?

	portid = nlk->portid;
	if (nlk_bound(nlk)) {
		// use portid
	}

How is that even comparable?

> As I said whoever is touching these lockless code paths need to
> understand what is going on.  There is no easy way out.

There is a pretty wide gap between "no easy way out" and "lemme make
this as painful as possible so that people including myself get it
wrong most of the time for no good reason".

Thanks.
Herbert Xu Sept. 24, 2015, 2:30 a.m. UTC | #11
On Wed, Sep 23, 2015 at 11:54:40AM -0400, Tejun Heo wrote:
> 
> Hmm... lemme try again.  When using barriers or RCU, it's desirable to
> establish certain invariants because it usually is extremely easy to
> miss corner cases.  It is helpful to have an abstraction, even if just
> conceptual, where people can go "this thing is barrier / RCU protected
> to guarantee XYZ".  Going more towards RCU example, this is why we
> annotate variables as RCU protected to detect incorrect usages.  There
> sure are exceptions but most are of the sort "this is write path and
> protected by something else which is annotated differently".  Doing
> things this way makes it a lot easier to get right.

Well if someone provided helpers which

1) uses smp_wmb and smp_rmb instead of full barriers;
2) provides raw variants for the cases that barriers aren't needed

then I'm more than happy to use them.

Having reviewed the situation again I'm even more convincend
now that smp_load_acquire/smp_store_release aren't the appropriate
primitives for us.  They are meant for situations that are similar
to spin lock/unlock where you need to prevent all reads/writes from
floating above or below the load/store, respectively.

For our situation we only need write or read ordering, so they are
literally the wrong tool for the job and will only cause confusion
in future when someone tries to do a major rewrite of the code and
they will be scratching their head as to why we needed locking-like
semantics here.

Cheers,
Tejun Heo Sept. 24, 2015, 2:46 a.m. UTC | #12
Hello, Herbert.

On Thu, Sep 24, 2015 at 10:30:11AM +0800, Herbert Xu wrote:
> Well if someone provided helpers which
> 
> 1) uses smp_wmb and smp_rmb instead of full barriers;

This part is fine.

> 2) provides raw variants for the cases that barriers aren't needed

Hmm... It looks like I'm failing at communicating.  Lemme try again.
There are two situations where we do this.

1. When there are different locking contexts.  In this case, the write
   path is.  It's already protected by the spinlock so the barrier
   isn't necessary.

2. When the path is hot enough for the cost of smp_rmb() to matter and
   the specifics of individual deref allows for micro optimization and
   justifies the added overhead in terms of increased fragility,
   complexity and need for documentation.

In both cases, we want to make reasonable trade-offs like any other
choices we make.  We don't go off and run to one extreme or the other
just because barriers are involved.  One good measure to use is
whether the extra documentation necessary is justifiable.  In this
case, on each unprotected derefs, we want to explain why the
unprotected deref is okay and justified.

> then I'm more than happy to use them.
> 
> Having reviewed the situation again I'm even more convincend
> now that smp_load_acquire/smp_store_release aren't the appropriate
> primitives for us.  They are meant for situations that are similar
> to spin lock/unlock where you need to prevent all reads/writes from
> floating above or below the load/store, respectively.
>
> For our situation we only need write or read ordering, so they are
> literally the wrong tool for the job and will only cause confusion
> in future when someone tries to do a major rewrite of the code and
> they will be scratching their head as to why we needed locking-like
> semantics here.

store_release/load_acquire vs. wmb/rmb is a separate issue.  I no
longer have objections against using wmb/rmb pairs here although I do
wanna note that eventually I think release/acquire are likely to be
more prevalent but that's a separate discussion.

Thanks.
Herbert Xu Sept. 24, 2015, 2:54 a.m. UTC | #13
On Wed, Sep 23, 2015 at 10:46:08PM -0400, Tejun Heo wrote:
> 
> Hmm... It looks like I'm failing at communicating.  Lemme try again.
> There are two situations where we do this.
> 
> 1. When there are different locking contexts.  In this case, the write
>    path is.  It's already protected by the spinlock so the barrier
>    isn't necessary.
> 
> 2. When the path is hot enough for the cost of smp_rmb() to matter and
>    the specifics of individual deref allows for micro optimization and
>    justifies the added overhead in terms of increased fragility,
>    complexity and need for documentation.
> 
> In both cases, we want to make reasonable trade-offs like any other
> choices we make.  We don't go off and run to one extreme or the other
> just because barriers are involved.  One good measure to use is
> whether the extra documentation necessary is justifiable.  In this
> case, on each unprotected derefs, we want to explain why the
> unprotected deref is okay and justified.

What I am concerned about is the next guy who comes along and
does a rewrite like the one that introduced the netlink_bind
bug.  That person needs to fully understand what each primitive
is protecting against.

Using primitives where they're not needed can lead to misunderstandings
which may end up causing bugs.

Honestly I don't care whether you have a barrier there or not as
I only use x86.  But you very much should add a comment at least
saying that the barrier isn't needed for the cases where I left it
out so that future developers don't get confused.

Cheers,
Tejun Heo Sept. 24, 2015, 3:06 a.m. UTC | #14
Hello,

On Thu, Sep 24, 2015 at 10:54:36AM +0800, Herbert Xu wrote:
> What I am concerned about is the next guy who comes along and
> does a rewrite like the one that introduced the netlink_bind
> bug.  That person needs to fully understand what each primitive
> is protecting against.
> 
> Using primitives where they're not needed can lead to misunderstandings
> which may end up causing bugs.

I think this is where we're not agreeing.  My point is that better
understanding and lower likelihood of bug doesn't equate specializing
each usage site.  That's a lot more likely to lead to unnecessary
cognition overhead and naturally errors.  There's no reason to require
such error-prone and specific understanding of each usage site when we
can have agreed-upon abstractions which yield invariants which are a
lot easier for people to wrap their heads around.

This isn't an isolated one-off barrier hack.  This is a
well-established pattern and sure there are cases we wanna deconstruct
that and make exceptions but that needs to be justifiable.  The
overhead gotta buy us something.  Here it just doesn't.

Thanks.
Herbert Xu Sept. 24, 2015, 3:21 a.m. UTC | #15
On Wed, Sep 23, 2015 at 11:06:09PM -0400, Tejun Heo wrote:
> 
> I think this is where we're not agreeing.  My point is that better
> understanding and lower likelihood of bug doesn't equate specializing
> each usage site.  That's a lot more likely to lead to unnecessary
> cognition overhead and naturally errors.  There's no reason to require
> such error-prone and specific understanding of each usage site when we
> can have agreed-upon abstractions which yield invariants which are a
> lot easier for people to wrap their heads around.

Well we'll have to agree to disagree on that one.  I have seen too
many instances over the years where people post patches that use
primitives such as RCU and think that they must be safe because
it compiles with no warnings (and probably even runs).

Cheers,
Tejun Heo Sept. 24, 2015, 3:29 a.m. UTC | #16
Hello,

On Thu, Sep 24, 2015 at 11:21:00AM +0800, Herbert Xu wrote:
> Well we'll have to agree to disagree on that one.  I have seen too
> many instances over the years where people post patches that use
> primitives such as RCU and think that they must be safe because
> it compiles with no warnings (and probably even runs).

So, while that also has been a common failure mode that we've been
seeing with barrier usages, what you're suggesting isn't the right
balance either.  It's error-prone in a different way as amply
exemplified in this very thread.  It ended up making what should have
been a straight-forward writer-reader interlocking into a maze in
which one can easily be lost.  I think you should be able to see that
after this thread.

Both misusages can be solved by understanding and sticking to
established patterns and making exceptions only when explicitly
justifiable and with ample explanation.

Thanks.
Herbert Xu Sept. 24, 2015, 3:31 a.m. UTC | #17
On Wed, Sep 23, 2015 at 11:29:28PM -0400, Tejun Heo wrote:
> 
> So, while that also has been a common failure mode that we've been
> seeing with barrier usages, what you're suggesting isn't the right
> balance either.  It's error-prone in a different way as amply
> exemplified in this very thread.  It ended up making what should have
> been a straight-forward writer-reader interlocking into a maze in
> which one can easily be lost.  I think you should be able to see that
> after this thread.

No this isn't what happened.  My error was trying to see if there
is a way to do it without barriers.  In the end there wasn't.  This
has nothing to do with using primitives.

Cheers,
Tejun Heo Sept. 24, 2015, 3:41 a.m. UTC | #18
On Thu, Sep 24, 2015 at 11:31:17AM +0800, Herbert Xu wrote:
> No this isn't what happened.  My error was trying to see if there
> is a way to do it without barriers.  In the end there wasn't.  This
> has nothing to do with using primitives.

Hmmm... yeah, you can say that, but it still was a failure to
recognize and apply the common pattern and what you're suggesting is
deviating for no good reason.  It demands a lot of cognitive overhead
for something which should be routine and makes the code a lot more
fragile as a result.  Things like this make barrier usages difficult
to understand and verify because it takes away a lot of ready-made
cognitive tools.  So, let's please stick to the known pattern.

Thanks.
Herbert Xu Sept. 24, 2015, 3:42 a.m. UTC | #19
On Wed, Sep 23, 2015 at 11:41:16PM -0400, Tejun Heo wrote:
> On Thu, Sep 24, 2015 at 11:31:17AM +0800, Herbert Xu wrote:
> > No this isn't what happened.  My error was trying to see if there
> > is a way to do it without barriers.  In the end there wasn't.  This
> > has nothing to do with using primitives.
> 
> Hmmm... yeah, you can say that, but it still was a failure to
> recognize and apply the common pattern and what you're suggesting is
> deviating for no good reason.  It demands a lot of cognitive overhead
> for something which should be routine and makes the code a lot more
> fragile as a result.  Things like this make barrier usages difficult
> to understand and verify because it takes away a lot of ready-made
> cognitive tools.  So, let's please stick to the known pattern.

Well I disagree so let's leave it at that.
Tejun Heo Sept. 24, 2015, 3:43 a.m. UTC | #20
On Thu, Sep 24, 2015 at 11:42:14AM +0800, Herbert Xu wrote:
> Well I disagree so let's leave it at that.

Leaving things disagreed is fine but there's still a patch to commit
here, so I get that you're still dead against just applying the
pattern?
Herbert Xu Sept. 24, 2015, 3:44 a.m. UTC | #21
On Wed, Sep 23, 2015 at 11:43:21PM -0400, Tejun Heo wrote:
> On Thu, Sep 24, 2015 at 11:42:14AM +0800, Herbert Xu wrote:
> > Well I disagree so let's leave it at that.
> 
> Leaving things disagreed is fine but there's still a patch to commit
> here, so I get that you're still dead against just applying the
> pattern?

Honestly I don't care anymore.  Feel free to do whatever you
want.
David Miller Sept. 24, 2015, 7:11 p.m. UTC | #22
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Tue, 22 Sep 2015 11:38:56 +0800

> The commit 1f770c0a09da855a2b51af6d19de97fb955eca85 ("netlink:
> Fix autobind race condition that leads to zero port ID") created
> some new races that can occur due to inconcsistencies between the
> two port IDs.
> 
> Tejun is right that a barrier is unavoidable.  Therefore I am
> reverting to the original patch that used a boolean to indicate
> that a user netlink socket has been bound.
> 
> Barriers have been added where necessary to ensure that a valid
> portid and the hashed socket is visible.
> 
> I have also changed netlink_insert to only return EBUSY if the
> socket is bound to a portid different to the requested one.  This
> combined with only reading nlk->bound once in netlink_bind fixes
> a race where two threads that bind the socket at the same time
> with different port IDs may both succeed.
> 
> Fixes: 1f770c0a09da ("netlink: Fix autobind race condition that leads to zero port ID")
> Reported-by: Tejun Heo <tj@kernel.org>
> Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

I've decided to apply this and queue it up for -stable.

Thanks everyone.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Tejun Heo Sept. 24, 2015, 8:05 p.m. UTC | #23
Hello, David.

On Thu, Sep 24, 2015 at 12:11:42PM -0700, David Miller wrote:
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Tue, 22 Sep 2015 11:38:56 +0800
> 
> > The commit 1f770c0a09da855a2b51af6d19de97fb955eca85 ("netlink:
> > Fix autobind race condition that leads to zero port ID") created
> > some new races that can occur due to inconcsistencies between the
> > two port IDs.
...
> I've decided to apply this and queue it up for -stable.

This is mostly correct; however, if there are or can be in-kernel
users which create the client side of netlink socket, it isn't.  Let's
say such in-kernel user does kernel_connect() and then query the
assigned port number by kernel_getsockname().  That can just return
zero.  Maybe such scenario is not possible for some combination of
reasons but why leak this level of synchronization detail to the users
in the first place?  This should be terminated from the site where
such synchronization scheme is implemented.  This expands the scope of
correctness verification to all possible users of these functions.

Thanks.

Patch
diff mbox series

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 303efb7..2c15fae 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1015,7 +1015,7 @@  static inline int netlink_compare(struct rhashtable_compare_arg *arg,
 	const struct netlink_compare_arg *x = arg->key;
 	const struct netlink_sock *nlk = ptr;
 
-	return nlk->rhash_portid != x->portid ||
+	return nlk->portid != x->portid ||
 	       !net_eq(sock_net(&nlk->sk), read_pnet(&x->pnet));
 }
 
@@ -1041,7 +1041,7 @@  static int __netlink_insert(struct netlink_table *table, struct sock *sk)
 {
 	struct netlink_compare_arg arg;
 
-	netlink_compare_arg_init(&arg, sock_net(sk), nlk_sk(sk)->rhash_portid);
+	netlink_compare_arg_init(&arg, sock_net(sk), nlk_sk(sk)->portid);
 	return rhashtable_lookup_insert_key(&table->hash, &arg,
 					    &nlk_sk(sk)->node,
 					    netlink_rhashtable_params);
@@ -1094,8 +1094,8 @@  static int netlink_insert(struct sock *sk, u32 portid)
 
 	lock_sock(sk);
 
-	err = -EBUSY;
-	if (nlk_sk(sk)->portid)
+	err = nlk_sk(sk)->portid == portid ? 0 : -EBUSY;
+	if (nlk_sk(sk)->bound)
 		goto err;
 
 	err = -ENOMEM;
@@ -1103,7 +1103,7 @@  static int netlink_insert(struct sock *sk, u32 portid)
 	    unlikely(atomic_read(&table->hash.nelems) >= UINT_MAX))
 		goto err;
 
-	nlk_sk(sk)->rhash_portid = portid;
+	nlk_sk(sk)->portid = portid;
 	sock_hold(sk);
 
 	err = __netlink_insert(table, sk);
@@ -1119,7 +1119,9 @@  static int netlink_insert(struct sock *sk, u32 portid)
 		goto err;
 	}
 
-	nlk_sk(sk)->portid = portid;
+	/* We need to ensure that the socket is hashed and visible. */
+	smp_wmb();
+	nlk_sk(sk)->bound = portid;
 
 err:
 	release_sock(sk);
@@ -1505,6 +1507,7 @@  static int netlink_bind(struct socket *sock, struct sockaddr *addr,
 	struct sockaddr_nl *nladdr = (struct sockaddr_nl *)addr;
 	int err;
 	long unsigned int groups = nladdr->nl_groups;
+	bool bound;
 
 	if (addr_len < sizeof(struct sockaddr_nl))
 		return -EINVAL;
@@ -1521,9 +1524,14 @@  static int netlink_bind(struct socket *sock, struct sockaddr *addr,
 			return err;
 	}
 
-	if (nlk->portid)
+	bound = nlk->bound;
+	if (bound) {
+		/* Ensure nlk->portid is up-to-date. */
+		smp_rmb();
+
 		if (nladdr->nl_pid != nlk->portid)
 			return -EINVAL;
+	}
 
 	if (nlk->netlink_bind && groups) {
 		int group;
@@ -1539,7 +1547,10 @@  static int netlink_bind(struct socket *sock, struct sockaddr *addr,
 		}
 	}
 
-	if (!nlk->portid) {
+	/* No need for barriers here as we return to user-space without
+	 * using any of the bound attributes.
+	 */
+	if (!bound) {
 		err = nladdr->nl_pid ?
 			netlink_insert(sk, nladdr->nl_pid) :
 			netlink_autobind(sock);
@@ -1587,7 +1598,10 @@  static int netlink_connect(struct socket *sock, struct sockaddr *addr,
 	    !netlink_allowed(sock, NL_CFG_F_NONROOT_SEND))
 		return -EPERM;
 
-	if (!nlk->portid)
+	/* No need for barriers here as we return to user-space without
+	 * using any of the bound attributes.
+	 */
+	if (!nlk->bound)
 		err = netlink_autobind(sock);
 
 	if (err == 0) {
@@ -2428,10 +2442,13 @@  static int netlink_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
 		dst_group = nlk->dst_group;
 	}
 
-	if (!nlk->portid) {
+	if (!nlk->bound) {
 		err = netlink_autobind(sock);
 		if (err)
 			goto out;
+	} else {
+		/* Ensure nlk is hashed and visible. */
+		smp_rmb();
 	}
 
 	/* It's a really convoluted way for userland to ask for mmaped
@@ -3257,7 +3274,7 @@  static inline u32 netlink_hash(const void *data, u32 len, u32 seed)
 	const struct netlink_sock *nlk = data;
 	struct netlink_compare_arg arg;
 
-	netlink_compare_arg_init(&arg, sock_net(&nlk->sk), nlk->rhash_portid);
+	netlink_compare_arg_init(&arg, sock_net(&nlk->sk), nlk->portid);
 	return jhash2((u32 *)&arg, netlink_compare_arg_len / sizeof(u32), seed);
 }
 
diff --git a/net/netlink/af_netlink.h b/net/netlink/af_netlink.h
index c96dfa3..e6aae40 100644
--- a/net/netlink/af_netlink.h
+++ b/net/netlink/af_netlink.h
@@ -25,7 +25,6 @@  struct netlink_ring {
 struct netlink_sock {
 	/* struct sock has to be the first member of netlink_sock */
 	struct sock		sk;
-	u32			rhash_portid;
 	u32			portid;
 	u32			dst_portid;
 	u32			dst_group;
@@ -36,6 +35,7 @@  struct netlink_sock {
 	unsigned long		state;
 	size_t			max_recvmsg_len;
 	wait_queue_head_t	wait;
+	bool			bound;
 	bool			cb_running;
 	struct netlink_callback	cb;
 	struct mutex		*cb_mutex;