On Fri, Aug 25, 2017 at 7:54 PM, Linus Torvalds wrote: > > Simplify, simplify, simplify. I've now tried three different approaches (I stopped sending broken patches after deciding the first one was unfixable), and they all suck. I was hoping for something lockless to avoid the whole issue with latency over the long list, but anything that has the wait queue entry allocated on the stack needs to synchronize the wakeup due to the stack usage, so once you have lists that are thousands of entries, either you hold the lock for long times (normal wait queues) or take and release them constantly (the swait approach), or you batch things up (Tim's wait-queue patches). Of those three approaches, the batching does seem to be the best of the lot. Allocating non-stack wait entries while waiting for pages seems like a bad idea. We're probably waiting for IO in normal circumstances, possibly because we're low on memory. So I *am* ok with the batching (your patch #1), but I'm *not* ok with the nasty horrible book-keeping to avoid new entries when you batch and release the lock and that allows new entries on the list (your patch #2). That said, I have now stared *way* too much at the page wait code after having unsuccessfully tried to replace that wait-queue with other "clever" approaches (all of which ended up being crap). And I'm starting to see a possible solution, or at least improvement. Let's just assume I take the batching patch. It's not conceptually ugly, it's fairly simple, and there are lots of independent arguments for it (ie latency, but also possible performance from better parallelism). That patch doesn't make any data structures bigger or more complicated, it's just that single "is this a bookmark entry" thing. So that patch just looks fundamentally fine to me, and the only real argument I ever had against it was that I would really really _really_ have wanted to root-cause the behavior. So that leaves my fundamental dislike of your other patch. And it turns out that all my looking at the page wait code wasn't entirely unproductive. Yes, I went through three crap approaches before I gave up on rewriting it, but in the meantime I did get way too intimate with looking at that pile of crud. And I think I found the reason why you needed that patch #2 in the first place. Here's what is going on: - we're going to assume that the problem is all with a single page, not due to hash collisions (as per your earlier reports) - we also know that the only bit that matters is the PG_locked bit - that means that the only way to get that "multiple concurrent waker" situation that your patch #2 tries to handle better is because you have multiple people unlocking the page - since that is what causes the wakeups - that in turn means that you obviously had multiple threads *locking* it too. - and that means that among those new entries there are lockers coming in in the middle and adding an exclusive entry. - the exclusive entry already stops the wakeup list walking - but we add non-exclusive entries TO THE BEGINNING of the page waiters list And I really think that last thing is fundamentally wrong. It's wrong for several reasons: - because it's unfair: threads that want to lock get put behind threads that just want to see the unlocked state. - because it's stupid: our non-locking waiters will end up waiting again if the page got locked, so waking up a locker *and* a lot of non-locking waiters will just cause them to go back to sleep anyway - because it causes us to walk longer lists: we stop walking when we wake up the exclusive waiter, but because we always put the non-exclusive waiters in there first, we always walk the long list of non-exclusive waiters even if we could just stop walking because we woke up an exclusive one. Now the reason we do this seems to be entirely random: for a *normal* wait queue, you really want to always wake up all the non-exclusive waiters, because exclusive waiters are only exclusive wrt each other. But for a page lock, an exclusive waiter really is getting the lock, and the other waiters are going to wait for the lock to clear anyway, so the page wait thing is actually almost exactly the reverse situation. We *could* put exclusive waiters at the beginning of the list instead, and actively try to avoid walking the list at all if we have pending lockers. I'm not doing that, because I think the fair thing to do is to just do things in the order they came in. Plus the code is actually simpler if we just always add to the tail. Now, the other thing to look at is how the wakeup function works. It checks the aliasing information (page and bit number), but then it *also* does: if (test_bit(key->bit_nr, &key->page->flags)) return 0; basically saying "continue walking if somebody else already got the bit". That's *INSANE*. It's exactly the wrong thing to do. It's basically saying "even if we had an exclusive waiter, let's not wake it up, but do let us continue to walk the list even though the bit we're waiting for to clear is set already". What would be sane is to say "stop walking the list now".. So just do that - by making "negative return from wake function" mean "stop walking". So how about just this fairly trivial patch? In fact, I think this may help even *without* Tim's patch #1. So I think this would be lovely to test on that problem load on its own, and seeing if it makes the wait queues behave better. It might not cut down on the total length of the wait-queue, but it should hopefully cause it to walk it much less. We now hit the exclusive waiters earlier and stop waking things up when we have a new locker thread pending. And when the page ends up being locked again, we stop walking the list entirely, so no unnecessarily traversal. This patch is small and at least minimally works (I'm running it right now). Linus