Re: [RFC PATCH v0 0/3] sched/numa: Process Adaptive autoNUMA

From: Bharata B Rao <bharata@amd.com>
To: Mel Gorman <mgorman@suse.de>
Cc: linux-kernel@vger.kernel.org, mingo@redhat.com,
	peterz@infradead.org, juri.lelli@redhat.com,
	vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
	rostedt@goodmis.org, bsegall@google.com, bristot@redhat.com,
	dishaa.talreja@amd.com, "Huang2, Wei" <Wei.Huang2@amd.com>
Subject: Re: [RFC PATCH v0 0/3] sched/numa: Process Adaptive autoNUMA
Date: Tue, 1 Feb 2022 18:37:23 +0530	[thread overview]
Message-ID: <ff74bacd-9092-4ebb-a5bb-98e49cf314a9@amd.com> (raw)
In-Reply-To: <20220131121707.GW3301@suse.de>

On 1/31/2022 5:47 PM, Mel Gorman wrote:
> On Fri, Jan 28, 2022 at 10:58:48AM +0530, Bharata B Rao wrote:
>> Hi,
>>
>> This patchset implements an adaptive algorithm for calculating the autonuma
>> scan period.
> 
> autonuma refers to the khugepaged-like approach to NUMA balancing that
> was later superceded by NUMA Balancing (NUMAB) and is generally reflected
> by the naming e.g. git grep -i autonuma and note how few references there
> are to autonuma versus numab or "NUMA balancing". I know MMTests still
> refers to AutoNUMA but mostly because at the time it was written,
> autoNUMA was what was being evaluated and I never updated the naming.

Thanks. Noted and will use appropriate terminologies next time onward.

> 
>> In the existing mechanism of scan period calculation,
>>
>> - scan period is derived from the per-thread stats.
>> - static threshold (NUMA_PERIOD_THRESHOLD) is used for changing the
>>   scan rate.
>>
>> In this new approach (Process Adaptive autoNUMA or PAN), we gather NUMA
>> fault stats at per-process level which allows for capturing the application
>> behaviour better. In addition, the algorithm learns and adjusts the scan
>> rate based on remote fault rate. By not sticking to a static threshold, the
>> algorithm can respond better to different workload behaviours.
>>
> 
> NUMA Balancing is concerned with threads (task) and an address space (mm)
> so basing the naming on Address Space rather than process may be more
> appropriate although I admit the acronym is not as snappy.

Sure, will think about more appropriate naming.

> 
>> Since the threads of a processes are already considered as a group,
>> we add a bunch of metrics to the task's mm to track the various
>> types of faults and derive the scan rate from them.
>>
> 
> Enumerate the types of faults and note how the per-thread and
> per-address-space metrics are related.

Sure will list the type of faults and describe the.

Per-address-space metrics are essentially aggregate of the existing per-thread
metrics. Unlike the existing task_numa_group mechanism, the threads are
implicitly/already considered part of the address space group (p->mm).

> 
>> The new per-process fault stats contribute only to the per-process
>> scan period calculation, while the existing per-thread stats continue
>> to contribute towards the numa_group stats which eventually
>> determine the thresholds for migrating memory and threads
>> across nodes.
>>
>> This patchset has been tested with a bunch of benchmarks on the
>> following system:
>>
> 
> Please include the comparisons of both the headline metrics and notes on
> the change in scan rates in the changelog of the patch. Not all people
> are access to Google drive and it is not guaranteed to remain forever.
> Similarly, the leader is not guaranteed to appear in the git history

Sure, noted.

> 
>> ------------------------------------------------------
>> % gain of PAN vs default (Avg of 3 runs)
>> ------------------------------------------------------
>> NAS-BT		-0.17
>> NAS-CG		+9.39
>> NAS-MG		+8.19
>> NAS-FT		+2.23
>> Hashjoin	+0.58
>> Graph500	+14.93
>> Pagerank	+0.37
> 
> 
> 
>> ------------------------------------------------------
>> 		Default		PAN		%diff
>> ------------------------------------------------------
>> 		NUMA hint faults(Total of 3 runs)
>> ------------------------------------------------------
>> NAS-BT		758282358	539850429	+29
>> NAS-CG		2179458823	1180301361	+46
>> NAS-MG		517641172	346066391	+33
>> NAS-FT		297044964	230033861	+23
>> Hashjoin	201684863	268436275	-33
>> Graph500	261808733	154338827	+41
>> Pagerank	217917818	211260310	+03
>> ------------------------------------------------------
>> 		Migrations(Total of 3 runs)
>> ------------------------------------------------------
>> NAS-BT		106888517	86482076	+19
>> NAS-CG		81191368	12859924	+84
>> NAS-MG		83927451	39651254	+53
>> NAS-FT		61807715	38934618	+37
>> Hashjoin	45406983	59828843	-32
>> Graph500	22798837	21560714	+05
>> Pagerank	59072135	44968673	+24
>> ------------------------------------------------------
>>
>> And here are some tests from a few microbenchmarks of mmtests suite.
>> (The results are trimmed a bit here, the complete results can
>> be viewed in the above mentioned link)
>>
>> Hackbench
>> ---------
>> hackbench-process-pipes
>>                            hackbench              hackbench
>>                              default                    pan
>> Min       256     23.5510 (   0.00%)     23.1900 (   1.53%)
>> Amean     256     24.4604 (   0.00%)     24.0353 *   1.74%*
>> Stddev    256      0.4420 (   0.00%)      0.7611 ( -72.18%)
>> CoeffVar  256      1.8072 (   0.00%)      3.1666 ( -75.22%)
>> Max       256     25.4930 (   0.00%)     30.5450 ( -19.82%)
>> BAmean-50 256     24.1074 (   0.00%)     23.6616 (   1.85%)
>> BAmean-95 256     24.4111 (   0.00%)     23.9308 (   1.97%)
>> BAmean-99 256     24.4499 (   0.00%)     23.9696 (   1.96%)
>>
>>                    hackbench   hackbench
>>                      default         pan
>> Duration User       25810.02    25158.93
>> Duration System    276322.70   271729.32
>> Duration Elapsed     2707.75     2671.33
>>
> 
>>                                       hackbench      hackbench
>>                                         default            pan
>> Ops NUMA alloc hit                1082415453.00  1088025994.00
>> Ops NUMA alloc miss                        0.00           0.00
>> Ops NUMA interleave hit                    0.00           0.00
>> Ops NUMA alloc local              1082415441.00  1088025974.00
>> Ops NUMA base-page range updates       33475.00      228900.00
>> Ops NUMA PTE updates                   33475.00      228900.00
>> Ops NUMA PMD updates                       0.00           0.00
>> Ops NUMA hint faults                   15758.00      222100.00
>> Ops NUMA hint local faults %           15371.00      214570.00
>> Ops NUMA hint local percent               97.54          96.61
>> Ops NUMA pages migrated                  235.00        4029.00
>> Ops AutoNUMA cost                         79.03        1112.18
>>
> 
> Hackbench processes are generally short-lived enough that NUMA balancing
> has a marginal impact. Interesting though that updates and hints were
> increased by a lot relatively speaking.

Yes, this increased AutoNUMA cost seen mostly with these micro benchmarks
are not seen typically with the other benchmarks that we have listed at
the beginning which we believe contributes to the gain that those
benchmarks see.

The algorithm tries aggressively to learn the application behaviour
at the beginning and short-lived tasks will see more scanning than
default.

Having said that, we need to investigate and check why some of these
micro benchmarks incur higher autonuma cost.

> 
>> Netperf-RR
>> ----------
>> netperf-udp-rr
>>                            netperf                netperf
>>                         rr-default                 rr-pan
>> Min       1   104915.69 (   0.00%)   104505.71 (  -0.39%)
>> Hmean     1   105865.46 (   0.00%)   105899.22 *   0.03%*
>> Stddev    1      528.45 (   0.00%)      881.92 ( -66.89%)
>> CoeffVar  1        0.50 (   0.00%)        0.83 ( -66.83%)
>> Max       1   106410.28 (   0.00%)   107196.52 (   0.74%)
>> BHmean-50 1   106232.53 (   0.00%)   106568.26 (   0.32%)
>> BHmean-95 1   105972.05 (   0.00%)   106056.35 (   0.08%)
>> BHmean-99 1   105972.05 (   0.00%)   106056.35 (   0.08%)
>>
>>                      netperf     netperf
>>                   rr-default      rr-pan
>> Duration User          11.20       10.74
>> Duration System       202.40      201.32
>> Duration Elapsed      303.09      303.08
>>
>>                                         netperf        netperf
>>                                      rr-default         rr-pan
>> Ops NUMA alloc hit                    183999.00      183853.00
>> Ops NUMA alloc miss                        0.00           0.00
>> Ops NUMA interleave hit                    0.00           0.00
>> Ops NUMA alloc local                  183999.00      183853.00
>> Ops NUMA base-page range updates           0.00       24370.00
>> Ops NUMA PTE updates                       0.00       24370.00
>> Ops NUMA PMD updates                       0.00           0.00
>> Ops NUMA hint faults                     539.00       24470.00
>> Ops NUMA hint local faults %             539.00       24447.00
>> Ops NUMA hint local percent              100.00          99.91
>> Ops NUMA pages migrated                    0.00          23.00
>> Ops AutoNUMA cost                          2.69         122.52
>>
> 
> Netperf these days usually runs on the same node so NUMA balancing
> triggers very rarely.

But we still see increase in the hint faults, need to investigate this.

>>                 autonumabenchautonumabench
>>                      default         pan
>> Duration User       94363.43    94436.71
>> Duration System     81671.72    81408.53
>> Duration Elapsed     1676.81     1647.99
>>
>>                                   autonumabench  autonumabench
>>                                         default            pan
>> Ops NUMA alloc hit                 539544115.00   539522029.00
>> Ops NUMA alloc miss                        0.00           0.00
>> Ops NUMA interleave hit                    0.00           0.00
>> Ops NUMA alloc local               279025768.00   281735736.00
>> Ops NUMA base-page range updates    69695169.00    84767502.00
>> Ops NUMA PTE updates                69695169.00    84767502.00
>> Ops NUMA PMD updates                       0.00           0.00
>> Ops NUMA hint faults                69691818.00    87895044.00
>> Ops NUMA hint local faults %        56565519.00    65819747.00
>> Ops NUMA hint local percent               81.17          74.88
>> Ops NUMA pages migrated              5950362.00     8310169.00
>> Ops AutoNUMA cost                     349060.01      440226.49
>>
> 
> More hinting faults and migrations. Not clear which sub-test exactly but
> most likely NUMA02.

I will have to run them separately and check.

Regards,
Bharata.