On Wed, Mar 17, 2021 at 3:23 PM Nicholas Piggin <npiggin@gmail.com> wrote:
>
> Might take some time to get a system and run tests. We actually had
> difficulty recreating it before this patch too, so it's kind of
> hard to say _that_ was the exact case that previously ran badly and
> is now fixed. We thought just the statistical nature of collisions
> and page / lock contention made things occasionally line up and
> tank.

Yeah, it probably depends a lot on the exact page allocation patterns
and just bad luck.  Plus the actual spinlocks will end up with false
sharing anyway, so you might need to have both contention on multiple
pages on the same hash queue _and_ perhaps some action on other pages
that just hash close-by so that they make the cache coherency protocol
have to work harder..

And then 99% of the time it all just works out fine, because we only
need to look at the hashed spinlocks when we have any contention on
the page lock in the first place, and that generally should be
something we should avoid for other reasons.

But it is perhaps also a sign that while 256 entries is "ridiculously
small", it might be the case that it's not necessarily much of a real
problem in the end, and I get the feeling that growing it by several
orders of magnitude is overkill.

In fact, the problems we have had with the page wait queue have often
been because we did too much page locking in the first place.

I actually have a couple of patches in my "still thinking about it
tree" (that have been there for six months by now, so maybe not
thinking too _actively_ about it) which remove some overly stupid page
locking.

For example, look at "do_shared_fault()". Currently __do_fault()
always locks the result page. Then if you have a page_mkwrite()
function, we immediately unlock it again. And then we lock it again in
do_page_mkwrite().

Does that particular case matter? I really have no idea. But we
basically lock it twice for what looks like absolutely zero reason
other than calling conventions.

Bah. I'll just attach my three "not very actively thinking about it"
patches here in case anybody else wants to not actively think about
them.

I've actually been running these on my own machine since October,
since the kernel I actually boot on my main desktop always includes my
"thinking about it" patches.

The two first patches should fix the double lock thing I mentioned.

The third patch should be a no-op right now, but the thinking is
outlined in the commit message: why do we even lock pages in
filemap_fault? I'm not actually convinced we need to. I think we could
return an unputodate page unlocked, and instead do the checking at pte
install time (I had some discussions with Kirill about instead doing
it at pte install time, and depending entirely on memory ordering
constraints wrt page truncation that *does* take the page lock, and
then does various other checks - including the page mapcount and
taking the ptl lock - under that lock).

Anyway, I don't mind the "make the hash table larger" regardless of
this, but I do want it to be documented a bit more.

               Linus