Virtual Scanning Considered Harmful

* Virtual Scanning Considered Harmful
@ 2023-10-24  3:34 Matthew Wilcox
  0 siblings, 0 replies; only message in thread
From: Matthew Wilcox @ 2023-10-24  3:34 UTC (permalink / raw)
  To: linux-mm

In spite of https://meyerweb.com/eric/comment/chech.html
I must pay homage to https://dl.acm.org/doi/10.1145/362929.362947

Modern CPUs are bad at walking linked lists.  When you can issue 6
instructions per clock tick and it takes 100ns for an L3 cache miss, the
ability for your CPU to predict your next memory reference and prefetch
it as far ahead of time as possible is critical.

I had originally planned to write a missive pleading for using a
better data structure than a linked list for storing pages in an LRU.
In my investigation, that turned out to be the wrong idea by a factor
of about 3-4x.  I now believe that _any_ virtual scanning is a mistake
and we should be scanning memory physically.  And I have the code to
back up that belief.

First, the code that simulates walking an LRU list vs an LRU array vs
walking the physical array directly:

https://www.infradead.org/~willy/linux/scan.c

If you're going to run it for yourself, I recommend running it with -v to
check how long it takes to run the physical scan; if it's too small, you
may get some unstable results.  I like to specify -c 10 (simulates 40GB of
memory) to get more stable numbers.  On my core i7-1165G7 @ 2.8GHz, I get

$ ./scan -c 10
Physical scan is 37.962013 faster than ListLRU
Physical scan is 3.847013 faster than ArrayLRU

That bounces around a bit, but generally, it's 37-38x faster to do
a physical scan than a ListLRU scan.

I asked a few people to try it on their machines.

<ziy> 17.2x with -c 50 on E5-2650 V4 @ 2.20GHz
<ziy> 31.6x with -c 50 on Threadripper 3960X
<ziy> 17.7x on cortex-A72. basically raspberrypi4.
<ljs> on a coffee lake i9-9900k I see ~20 - 25 with a fair bit of variance with no -c specified, with -c 50 around ~28 with less variance

The relative speed of memory & CPU, along with the effectiveness /
aggressiveness of the CPU prefetchers will affect the exact numbers here,
but you can see modern machines are all showing an advantage.

I had intended to investigate using a radix tree or B-tree to store
the LRU list, but a linear array of pointers to struct page should beat
any data structure that is essentially a segmented array of pointers to
struct page.  And that is still about 3.8x slower than a physical scan,
so I see no need to carry out the experiment of storing pages in an
XArray, Maple Tree, or any other data structure.

Admittedly, this is a very crude benchmark.  In no way does it model the
various tests done on the pages/folios that are scanned.  It doesn't
allocate large folios; they're all single pages.  There are probably
half a dozen other things wrong with it.  Nevertheless, a 4x speedup on
LRU scan would be amazing.  A 40x speedup would be stunning.

I don't have time to take on this project.  I need to work on folios
and other projects that I have started.  I haven't done enough research
to understand the various options we have for switching to a physical
scan (and the downsides of those approaches!).  I have no well-founded
opinions on how we might go about this.  Hopefully the potential win
here is enough to motivate someone to want to work on this.

^ permalink raw reply	[flat|nested] only message in thread