On Thu, Aug 29, 2013 at 12:25 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Hmm. I can see it, but it turns out that for normal pathname walking,
> one of the main stumbling blocks is the RCU case of complete_walk(),
> which cannot be done with the lockless lockref model.
> [... snip ...]

Ok, here's a patch for people to try out if they want to. It's tested,
and works for me, but it is - like the two preparatory patches I
already applied - not really based on Waiman's patches except from a
conceptual standpoint.

For architecture people (ie Ben, if you want to try this on ppc64),
the thing that it needs from an architecture:

 - the raw_spinlock_t and the "unsigned long" needs to fit in a u64.

   This is normally true anyway, but if your architecture code has a
big spinlock, you can't use it.

 - the architecture needs to support "cmpxchg()" on that "u64" type
(so x86-32 does *not* use this, although we could teach it to do so by
teaching x86-32 to use "cmpxchg8b" for the 8-byte case)

   A 64-bit architecture needs to support cmpxchg() on an u64 anyway,
so this is always true on 64-bit. It _can_ be true on 32-bit
architectures too.

 - the architecture needs to implement a simple
"arch_spin_value_unlocked()" macro, which takes a raw_spinlock_t value
and says whether it is unlocked or not.

   This is a new macro/function. It's generally a one-liner. For the
x86 ticket locks, for example, the test is simply "lock.tickets.head
== lock.tickets.tail".

 - the architecture code needs to let the generic code know that it
supports all this by doing a "select ARCH_USE_CMPXCHG_LOCKREF"

   Add it to your Kconfig file in the appropriate place. You do *not*
need to worry about LOCKDEP etc, you only need to worry about your own
architecture details above.

That's it. If it does that, the lockref code will use the cmpxchg
optimization when appropriate (ie spinlock debugging is not turned on
etc etc).

For Waiman: your patch had that adaptive waiting thing, and "wait for
unlocked" code, and I threw all that away. I didn't like it, and the
only reason for it existing was that the spinlock could be taken in a
hot path, which I felt was against the whole point of this "lockref"
thing.

So I fixed the VFS layer instead. With dput() and friends using
lockrefs, the only thing remaining in the hot RCU dentry lookup path
was the nasty __d_rcu_to_refcount() thing in complete_walk(). I
rewrote that to locklessly increment the refcount when it was nonzero,
and get the lock if it was zero, and that all seems to work fine.

And once the only case that is relevant for the fast-path is "d_lock
is unlocked", all your games with waiting for the spinlock to be
released are unnecessary. Making everything much simpler. If the
spinlock isn't unlocked, we always kick out to the fallback case (with
real locking).

NOTE! My test-case was very small and simple, so it may not trigger
other cases that might trigger d_lock in a hotpath. Anything that
kicks us out of rcu mode (like a symlink, for example) will trigger
"unlazy_walk()", and I didn't do the same thing there. So there's
still details like that to sort out, but I very much think this whole
"only an unlocked spinlock is a fastpath" is the right approach.

My simple testing shows that this has about the same best-case
performance, and the 15% _raw_spin_lock load I was able to trigger is
totally gone. That doesn't make things *faster* for me (because the
cost of the cmpxchg is pretty much comparable to the cost of the
spinlocks), but the big difference is the contended behavior where we
don't actually have to wait for the spinlock, we can just locklessly
increment the counter.

I can't trigger the CPU-eating contention case on my single-socket
system, which is why I'm sending out this patch for testing (despite
it not having that unlazy_walk() thing etc).

Also note that this is one single patch, not split up. Again, that's
because what I'm really hoping to get is just "does this fix the
contention-case on the 80-core monster machine that I don't have
access to?"

Side note: the whole cmpxchg() loop is written to basically generate
the perfect cmpxchg sequence on x86. The assembly results actually
look pretty good. I can't take advantage of the eflag setting of the
instruction, because gcc inline asms don't support that (even with
"asm goto" - I'd need to have output values for that, and "asm goto"
does now allow that). So there's one extra "cmpq" instruction, and gcc
makes a mess of "add 1 to structure member in high bytes of a 64-bit
stricture", but it's actually fairly readable and short assembly code,
which was *not* true of the original patches.

Waiman? Mind looking at this and testing?

                          Linus