From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail.linutronix.de (146.0.238.70:993) by crypto-ml.lab.linutronix.de with IMAP4-SSL for ; 25 Jun 2018 17:26:29 -0000 Received: from mga01.intel.com ([192.55.52.88]) by Galois.linutronix.de with esmtps (TLS1.2:DHE_RSA_AES_256_CBC_SHA256:256) (Exim 4.80) (envelope-from ) id 1fXVFz-0004Ii-0d for speck@linutronix.de; Mon, 25 Jun 2018 19:26:28 +0200 Subject: [MODERATED] Re: [PATCH v4 3/8] [PATCH v4 3/8] Linux Patch #3 References: <20180623135445.641656585@localhost.localdomain> From: Dave Hansen Message-ID: Date: Mon, 25 Jun 2018 10:26:10 -0700 MIME-Version: 1.0 In-Reply-To: Content-Type: multipart/mixed; boundary="RSvlOUnWFDqZibIkmb4ruW6UHP3XyBNAy"; protected-headers="v1" To: speck@linutronix.de List-ID: This is an OpenPGP/MIME encrypted message (RFC 4880 and 3156) --RSvlOUnWFDqZibIkmb4ruW6UHP3XyBNAy Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: quoted-printable On 06/25/2018 09:46 AM, speck for Paolo Bonzini wrote: >> 32k is theoretically enough, but _only_ if none of the lines being >> touched were in the cache previously. That's why it was a 64k buffer = in >> some examples. >=20 > But pre-Skylake has 16k cache only, doesn't it? Does it need to read i= n > 4 times the cache size? I thought it's been 32k for a while. But, either way, I guess we should be doing the *enumerated* L1D size rather than a fixed 32k. Here's a Haswell system, btw: dave@o2:~$ cat /sys/devices/system/cpu/cpu0/cache/index0/size 32K dave@o2:~$ cat /proc/cpuinfo | grep model model name : Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz Or a Westmere Xeon: dave@bigbox:~$ cat /sys/devices/system/cpu/cpu0/cache/index0/size 32K >> You also need guard pages at either end to ensure the prefetchers don'= t >> run into the next page. >=20 > Hmm, it would be a pity to require order 5 even. Earlier in the thread= > someone said that 52 KiB were enough, if that's confirmed we could keep= > order 4 and have guard pages. 52k was the theoretical floor of the smallest possible size that would guarantee 32k got _evicted_. But, the recommendation from the hardware folks was to do more than 52k so there was some buffer in case the analysis was imprecise. BTW, this buffer does *not* need to be per-thread necessarily. Tony Luck pointed out that we could just have a buffer for each hyperthread. Core-0/Thread-0 could share its buffer with Core-1/Thread-0, for instance. Having them be NUMA-node-local would be nice too, but not required for correctness. At the point that we've got two per NUMA node, I'm not sure we really care much whether it's 128k or 64k consumed. I'd much rather do what the hardware folks are comfortable with than save 64k. --RSvlOUnWFDqZibIkmb4ruW6UHP3XyBNAy--