futex atomic vs ordering constraints

* futex atomic vs ordering constraints
@ 2015-08-26 18:16 Peter Zijlstra
  2015-08-29  1:33 ` Davidlohr Bueso
  2015-09-01 16:31 ` Will Deacon
  0 siblings, 2 replies; 15+ messages in thread
From: Peter Zijlstra @ 2015-08-26 18:16 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Linus Torvalds, Oleg Nesterov, Paul McKenney, Ingo Molnar,
	mtk.manpages, dvhart, dave, Vineet.Gupta1, ralf, ddaney,
	Will Deacon, linux-kernel

Hi all,

I tried to keep this email short, but failed miserably at this. For
the TL;DR skip to the tail.

So the question of ordering constraints of futex atomic operations has
come up recently:

  http://marc.info/?l=linux-kernel&m=143894765931868

This email will attempt to describe the two primitives and start a
discussion on the constraints.

 * futex_atomic_op_inuser()

There is but a single callsite of this function: futex_wake_op().

It being part of a wake primitive seems to suggest a (RCsc) RELEASE is
the strongest required (the RCsc part because I don't think we want to
expose RCpc to userspace if we don't have to).

The immediate scenario where this is important is:

        CPU0            CPU1            CPU2

        futex_lock(); -> uncontended user acquire
        A = 1;
                        futex_lock(); -> kernel, set pending, sleep
        B = 1;
        futex_unlock();
          if pending
            <kernel>
              futex_wake_op
                spin_lock(bh->lock)
                RELEASE
                futex_atomic_op_inuser(); -> futex unlocked
                                        futex_lock() -> uncontended user steal
                                        load A;

In other words, the moment we perform the WAKE_OP userspace can observe
the 'lock' as unlocked and do a lock (steal) acquire of the 'lock'.

If userspace succeeds with this acquire, we need full serialization of
the locked (RCsc) variables (eg A and B in the above).

Of course, if anything else prior to futex_atomic_op_inuser() implies an
(RCsc) RELEASE or stronger the primitive can do without providing
anything itself.

This turns out to be the case, a successful get_futex_key() implies a
full memory barrier; recent: 1d0dcb3ad9d3 ("futex: Implement lockless
wakeups").

And since get_futex_key() is fundamental to doing _anything_ with a
futex, I think its semi-sane to rely on this.

So we have two valid options:

 - RCsc RELEASE
 - no ordering at all

Current implementation:

alpha:	 MB ll/sc		RELEASE
arm64:	 ll/sc-release MB	FULL
arm:	 MB ll/sc		RELEASE
mips:	 ll/sc MB		ACQUIRE
powerpc: lwsync ll/sc sync	FULL

 * futex_atomic_cmpxchg_inatomic()

This is called from:

	lock_pi_update_atomic
	wake_futex_pi
	fixup_pi_state_owner
	futex_unlock_pi
	handle_futex_death

But I think we can form a position from just two of them:

  futex_unlock_pi() and lock_pi_update_atomic()

these end up being ACQUIRE and RELEASE, and a combination of these two
would give us a requirement for full serialization.

And unlike the previous we cannot talk this one away. Even though every
futex op needs a get_futex_key() which implies a full memory barrier,
and every get_futex_key() needs a put_futex_key(), the latter does _NOT_
imply a full barrier.

So while we could relax the RELEASE semantics we cannot relax the
ACQUIRE semantics.

Then there is handle_futex_death(), which is difficult, I _think_ it
wants to be a RELEASE, but state is corrupted anyhow and I can well
imagine not wanting to play any games here and go fully serialized like
we're used to with cmpxchg.

Now the robust stuff doesn't use {get,put}_futex_key() stuff, so no
implied barriers here.

Which leaves us all with a great big mess.

Current implementation:

alpha:	 MB ll/sc		RELEASE
arm64:	 ll/sc-release MB	FULL
arm:	 MB ll/sc MB		FULL
mips:	 ll/sc MB		ACQUIRE
powerpc: lwsync ll/sc sync	FULL

There are a few options:

 1) punt, mandate they're both fully ordered and stop thinking about it

 2) make them both fully relaxed, rely on implied barriers and employ
    smp_mb__{before,after}_atomic in key places

Given the current state of things and that I don't really think there is
a compelling performance argument to be made for 2, I would suggest we
go with 1.

^ permalink raw reply	[flat|nested] 15+ messages in thread