Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

From: David Rientjes <rientjes@google.com>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	mgorman@techsingularity.net, Vlastimil Babka <vbabka@suse.cz>,
	Michal Hocko <mhocko@kernel.org>,
	ying.huang@intel.com, s.priebe@profihost.ag,
	Linux List Kernel Mailing <linux-kernel@vger.kernel.org>,
	alex.williamson@redhat.com, lkp@01.org, kirill@shutemov.name,
	Andrew Morton <akpm@linux-foundation.org>,
	zi.yan@cs.rutgers.edu
Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
Date: Tue, 11 Dec 2018 16:37:22 -0800 (PST)	[thread overview]
Message-ID: <alpine.DEB.2.21.1812111609060.255489@chino.kir.corp.google.com> (raw)
In-Reply-To: <20181210044916.GC24097@redhat.com>

On Sun, 9 Dec 2018, Andrea Arcangeli wrote:

> You didn't release the proprietary software that depends on
> __GFP_THISNODE behavior and that you're afraid is getting a
> regression.
> 
> Could you at least release with an open source license the benchmark
> software that you must have used to do the above measurement to
> understand why it gives such a weird result on remote THP?
> 

Hi Andrea,

As I said in response to Linus, I'm in the process of writing a more 
complete benchmarking test across all of our platforms for access and 
allocation latency for x86 (both Intel and AMD), POWER8/9, and arm64, and 
doing so on a kernel with minimum overhead (for the allocation latency, I 
want to remove things like mem cgroup overhead from the result).

> On skylake and on the threadripper I can't confirm that there isn't a
> significant benefit from cross socket hugepage over cross socket small
> page.
> 
> Skylake Xeon(R) Gold 5115:
> 
> # numactl --hardware
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 20 21 22 23 24 25 26 27 28 29
> node 0 size: 15602 MB
> node 0 free: 14077 MB
> node 1 cpus: 10 11 12 13 14 15 16 17 18 19 30 31 32 33 34 35 36 37 38 39
> node 1 size: 16099 MB
> node 1 free: 15949 MB
> node distances:
> node   0   1
>   0:  10  21
>   1:  21  10
> # numactl -m 0 -C 0 ./numa-thp-bench
> random writes MADV_HUGEPAGE 10109753 usec
> random writes MADV_NOHUGEPAGE 13682041 usec
> random writes MADV_NOHUGEPAGE 13704208 usec
> random writes MADV_HUGEPAGE 10120405 usec
> # numactl -m 0 -C 10 ./numa-thp-bench
> random writes MADV_HUGEPAGE 15393923 usec
> random writes MADV_NOHUGEPAGE 19644793 usec
> random writes MADV_NOHUGEPAGE 19671287 usec
> random writes MADV_HUGEPAGE 15495281 usec
> # grep Xeon /proc/cpuinfo |head -1
> model name      : Intel(R) Xeon(R) Gold 5115 CPU @ 2.40GHz
> 
> local 4k -> local 2m: +35%
> local 4k -> remote 2m: -11% 
> remote 4k -> remote 2m: +26%
> 
> threadripper 1950x:
> 
> # numactl --hardware
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
> node 0 size: 15982 MB
> node 0 free: 14422 MB
> node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
> node 1 size: 16124 MB
> node 1 free: 5357 MB
> node distances:
> node   0   1
>   0:  10  16
>   1:  16  10
> # numactl -m 0 -C 0 /tmp/numa-thp-bench
> random writes MADV_HUGEPAGE 12902667 usec
> random writes MADV_NOHUGEPAGE 17543070 usec
> random writes MADV_NOHUGEPAGE 17568858 usec
> random writes MADV_HUGEPAGE 12896588 usec
> # numactl -m 0 -C 8 /tmp/numa-thp-bench
> random writes MADV_HUGEPAGE 19663515 usec
> random writes MADV_NOHUGEPAGE 27819864 usec
> random writes MADV_NOHUGEPAGE 27844066 usec
> random writes MADV_HUGEPAGE 19662706 usec
> # grep Threadripper /proc/cpuinfo |head -1
> model name      : AMD Ryzen Threadripper 1950X 16-Core Processor
> 
> local 4k -> local 2m: +35%
> local 4k -> remote 2m: -10% 
> remote 4k -> remote 2m: +41%
> 
> Or if you prefer reversed in terms of compute time (negative
> percentage is better in this case):
> 
> local 4k -> local 2m: -26%
> local 4k -> remote 2m: +12%
> remote 4k -> remote 2m: -29%
> 
> It's true that local 4k is generally a win vs remote THP when the
> workload is memory bound also for the threadripper, the threadripper
> seems even more favorable to remote THP than skylake Xeon is.
> 

My results are organized slightly different since it considers local 
hugepages as the baseline and is what we optimize for: on Broadwell, I've 
obtained more accurate results that show local small pages at +3.8%, 
remote hugepages at +12.8% and remote small pages at +18.8%.  I think we 
both agree that the locality preference for workloads that fit within a 
single node is local hugepage -> local small page -> remote hugepage -> 
remote small page, and that has been unchanged in any of benchmarking 
results for either of us.

> The above is the host bare metal result. Now let's try guest mode on
> the threadripper. The last two lines seems more reliable (the first
> two lines also needs to fault in the guest RAM because the guest
> was fresh booted).
> 
> guest backed by local 2M pages:
> 
> random writes MADV_HUGEPAGE 16025855 usec
> random writes MADV_NOHUGEPAGE 21903002 usec
> random writes MADV_NOHUGEPAGE 19762767 usec
> random writes MADV_HUGEPAGE 15189231 usec
> 
> guest backed by remote 2M pages:
> 
> random writes MADV_HUGEPAGE 25434251 usec
> random writes MADV_NOHUGEPAGE 32404119 usec
> random writes MADV_NOHUGEPAGE 31455592 usec
> random writes MADV_HUGEPAGE 22248304 usec
> 
> guest backed by local 4k pages:
> 
> random writes MADV_HUGEPAGE 28945251 usec
> random writes MADV_NOHUGEPAGE 32217690 usec
> random writes MADV_NOHUGEPAGE 30664731 usec
> random writes MADV_HUGEPAGE 22981082 usec
> 
> guest backed by remote 4k pages:
> 
> random writes MADV_HUGEPAGE 43772939 usec
> random writes MADV_NOHUGEPAGE 52745664 usec
> random writes MADV_NOHUGEPAGE 51632065 usec
> random writes MADV_HUGEPAGE 40263194 usec
> 
> I haven't yet tried the guest mode on the skylake nor
> haswell/broadwell. I can do that too but I don't expect a significant
> difference.
> 
> On a threadripper guest, the remote 2m is practically identical to
> local 4k. So shutting down compaction to try to generate local 4k
> memory looks a sure loss.
> 

I'm assuming your results above are with a defrag setting of "madvise" or 
"defer+madvise".

> Even if we ignore the guest mode results completely, if we don't make
> assumption on the workload to be able to fit in the node, if I use
> MADV_HUGEPAGE I think I'd prefer the risk of a -10% slowdown if the
> THP page ends up in a remote node, than not getting the +41% THP
> speedup on remote memory if the pagetable ends up being remote or the
> 4k page itself ends up being remote over time.
> 

I'm agreeing with you that the preference for remote hugepages over local 
small pages depends on the configuration and the workload that you are 
running and there are clear advantages and disadvantages to both.  This is 
different than what the long-standing NUMA preferences have been for thp 
allocations.

I think we can optimize for *both* usecases without causing an unnecessary 
regression for other and doing so is not extremely complex.

Since it depends on the workload, specifically workloads that fit within a 
single node, I think the reasonable approach would be to have a sane 
default regardless of the use of MADV_HUGEPAGE or thp defrag settings and 
then optimzie for the minority of cases where the workload does not fit in 
a single node.  I'm assuming there is no debate about these larger 
workloads being in the minority, although we have single machines where 
this encompasses the totality of their workloads.

Regarding the role of direct reclaim in the allocator, I think we need 
work on the feedback from compaction to determine whether it's worthwhile.  
That's difficult because of the point I continue to bring up: 
isolate_freepages() is not necessarily always able to access this freed 
memory.  But for cases where we get COMPACT_SKIPPED because the order-0 
watermarks are failing, reclaim *is* likely to have an impact in the 
success of compaction, otherwise we fail and defer because it wasn't able 
to make a hugepage available.

 [ If we run compaction regardless of the order-0 watermark check and find
   a pageblock where we can likely free a hugepage because it is 
   fragmented movable pages, this is a pretty good indication that reclaim
   is worthwhile iff the reclaimed memory is beyond the migration scanner. ]

Let me try to list out what I think is a reasonable design for the various 
configs assuming we are able to address the reclaim concern above.  Note 
that this is for the majority of users where workloads do not span 
multiple nodes:

 - defrag=always: always compact, obviously

 - defrag=madvise/defer+madvise:

   - MADV_HUGEPAGE: always compact locally, fallback to small pages 
     locally (small pages become eligible for khugepaged to collapse
     locally later, no chance of additional access latency)

   - neither MADV_HUGEPAGE nor MADV_NOHUGEPAGE: kick kcompactd locally, 
     fallback to small pages locally

 - defrag=defer: kick kcompactd locally, fallback to small pages locally

 - defrag=never: fallback to small pages locally

And that, AFAICT, has been the implementation for almost four years.

For workloads that *can* span multiple nodes, this doesn't make much 
sense, as you point out and have reported in your bug.  Considering the 
reclaim problem separately where we thrash a node unnecessarily, if we 
consider only hugepages and NUMA locality:

 - defrag=always: always compact for all allowed zones, zonelist ordered
   according to NUMA locality

 - defrag=madvise/defer+madvise:

   - MADV_HUGEPAGE: always compact for all allowed zones, try to allocate
     hugepages in zonelist order, only fallback to small pages when
     compaction fails

   - neither MADV_HUGEPAGE nor MADV_NOHUGEPAGE: kick kcompactd for all
     allowed zones, fallback to small pages locally

 - defrag=defer: kick kcompactd for all allowed zones, fallback to small 
   pages locally

 - defrag=never: fallback to small pages locally

For this policy to be possible, we must clear __GFP_THISNODE.  How to 
determine when to do this?  I think we have three options: heuristics (rss 
vs zone managed pages), per-process prctl(), or global thp setting for 
machine-wide behavior.

I've been suggesting a per-process prctl() that can be set and carried 
across fork so that there are no changes needed to any workload and can 
simply special-case the thp allocation policy to use __GFP_THISNODE, which 
is the default for bare metal, and to not use it when we've said the 
workload will span multiple nodes.  Depending on the size of the workload, 
it may choose to use this setting on certain systems and not others.