On Fri, Aug 25, 2017 at 7:54 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Simplify, simplify, simplify.

I've now tried three different approaches (I stopped sending broken
patches after deciding the first one was unfixable), and they all
suck.

I was hoping for something lockless to avoid the whole issue with
latency over the long list, but anything that has the wait queue entry
allocated on the stack needs to synchronize the wakeup due to the
stack usage, so once you have lists that are thousands of entries,
either you hold the lock for long times (normal wait queues) or take
and release them constantly (the swait approach), or you batch things
up (Tim's wait-queue patches).

Of those three approaches, the batching does seem to be the best of
the lot. Allocating non-stack wait entries while waiting for pages
seems like a bad idea. We're probably waiting for IO in normal
circumstances, possibly because we're low on memory.

So I *am* ok with the batching (your patch #1), but I'm *not* ok with
the nasty horrible book-keeping to avoid new entries when you batch
and release the lock and that allows new entries on the list (your
patch #2).

That said, I have now stared *way* too much at the page wait code
after having unsuccessfully tried to replace that wait-queue with
other "clever" approaches (all of which ended up being crap).

And I'm starting to see a possible solution, or at least improvement.

Let's just assume I take the batching patch. It's not conceptually
ugly, it's fairly simple, and there are lots of independent arguments
for it (ie latency, but also possible performance from better
parallelism).

That patch doesn't make any data structures bigger or more
complicated, it's just that single "is this a bookmark entry" thing.
So that patch just looks fundamentally fine to me, and the only real
argument I ever had against it was that I would really really _really_
have wanted to root-cause the behavior.

So that leaves my fundamental dislike of your other patch.

And it turns out that all my looking at the page wait code wasn't
entirely unproductive. Yes, I went through three crap approaches
before I gave up on rewriting it, but in the meantime I did get way
too intimate with looking at that pile of crud.

And I think I found the reason why you needed that patch #2 in the first place.

Here's what is going on:

 - we're going to assume that the problem is all with a single page,
not due to hash collisions (as per your earlier reports)

 - we also know that the only bit that matters is the PG_locked bit

 - that means that the only way to get that "multiple concurrent
waker" situation that your patch #2 tries to handle better is because
you have multiple people unlocking the page - since that is what
causes the wakeups

 - that in turn means that you obviously had multiple threads *locking* it too.

 - and that means that among those new entries there are lockers
coming in in the middle and adding an exclusive entry.

 - the exclusive entry already stops the wakeup list walking

 - but we add non-exclusive entries TO THE BEGINNING of the page waiters list

And I really think that last thing is fundamentally wrong.

It's wrong for several reasons:

 - because it's unfair: threads that want to lock get put behind
threads that just want to see the unlocked state.

 - because it's stupid: our non-locking waiters will end up waiting
again if the page got locked, so waking up a locker *and* a lot of
non-locking waiters will just cause them to go back to sleep anyway

 - because it causes us to walk longer lists: we stop walking when we
wake up the exclusive waiter, but because we always put the
non-exclusive waiters in there first, we always walk the long list of
non-exclusive waiters even if we could just stop walking because we
woke up an exclusive one.

Now the reason we do this seems to be entirely random: for a *normal*
wait queue, you really want to always wake up all the non-exclusive
waiters, because exclusive waiters are only exclusive wrt each other.

But for a page lock, an exclusive waiter really is getting the lock,
and the other waiters are going to wait for the lock to clear anyway,
so the page wait thing is actually almost exactly the reverse
situation. We *could* put exclusive waiters at the beginning of the
list instead, and actively try to avoid walking the list at all if we
have pending lockers.

I'm not doing that, because I think the fair thing to do is to just do
things in the order they came in. Plus the code is actually simpler if
we just always add to the tail.

Now, the other thing to look at is how the wakeup function works. It
checks the aliasing information (page and bit number), but then it
*also* does:

        if (test_bit(key->bit_nr, &key->page->flags))
                return 0;

basically saying "continue walking if somebody else already got the bit".

That's *INSANE*. It's exactly the wrong thing to do. It's basically
saying "even if we had an exclusive waiter, let's not wake it up, but
do let us continue to walk the list even though the bit we're waiting
for to clear is set already".

What would be sane is to say "stop walking the list now".. So just do
that - by making "negative return from wake function" mean "stop
walking".

So how about just this fairly trivial patch?

In fact, I think this may help even *without* Tim's patch #1. So I
think this would be lovely to test on that problem load on its own,
and seeing if it makes the wait queues behave better.

It might not cut down on the total length of the wait-queue, but it
should hopefully cause it to walk it much less. We now hit the
exclusive waiters earlier and stop waking things up when we have a new
locker thread pending. And when the page ends up being locked again,
we stop walking the list entirely, so no unnecessarily traversal.

This patch is small and at least minimally works (I'm running it right now).

                               Linus