[MODERATED] Re: [PATCH v4 3/8] [PATCH v4 3/8] Linux Patch #3

From: Dave Hansen <dave.hansen@linux.intel.com>
To: speck@linutronix.de
Subject: [MODERATED] Re: [PATCH v4 3/8] [PATCH v4 3/8] Linux Patch #3
Date: Mon, 25 Jun 2018 10:26:10 -0700	[thread overview]
Message-ID: <b0c08867-32a9-4a36-a1ab-c1b5f990a0b7@linux.intel.com> (raw)
In-Reply-To: <cb3b10a8-1e8a-e8e4-07cb-fa767e22675b@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 1808 bytes --]

On 06/25/2018 09:46 AM, speck for Paolo Bonzini wrote:
>> 32k is theoretically enough, but _only_ if none of the lines being
>> touched were in the cache previously.  That's why it was a 64k buffer in
>> some examples.
> 
> But pre-Skylake has 16k cache only, doesn't it?  Does it need to read in
> 4 times the cache size?

I thought it's been 32k for a while.  But, either way, I guess we should
be doing the *enumerated* L1D size rather than a fixed 32k.

Here's a Haswell system, btw:

dave@o2:~$ cat /sys/devices/system/cpu/cpu0/cache/index0/size
32K
dave@o2:~$ cat /proc/cpuinfo  | grep model
model name	: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz

Or a Westmere Xeon:

dave@bigbox:~$ cat /sys/devices/system/cpu/cpu0/cache/index0/size
32K

>> You also need guard pages at either end to ensure the prefetchers don't
>> run into the next page.
> 
> Hmm, it would be a pity to require order 5 even.  Earlier in the thread
> someone said that 52 KiB were enough, if that's confirmed we could keep
> order 4 and have guard pages.

52k was the theoretical floor of the smallest possible size that would
guarantee 32k got _evicted_.  But, the recommendation from the hardware
folks was to do more than 52k so there was some buffer in case the
analysis was imprecise.

BTW, this buffer does *not* need to be per-thread necessarily.  Tony
Luck pointed out that we could just have a buffer for each hyperthread.
Core-0/Thread-0 could share its buffer with Core-1/Thread-0, for
instance.  Having them be NUMA-node-local would be nice too, but not
required for correctness.

At the point that we've got two per NUMA node, I'm not sure we really
care much whether it's 128k or 64k consumed.  I'd much rather do what
the hardware folks are comfortable with than save 64k.