Re: [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS

From: K Prateek Nayak <kprateek.nayak@amd.com>
To: Abel Wu <wuyun.abel@bytedance.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@kernel.org>, Mel Gorman <mgorman@suse.de>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Valentin Schneider <valentin.schneider@arm.com>
Cc: Josh Don <joshdon@google.com>, Chen Yu <yu.c.chen@intel.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	Aubrey Li <aubrey.li@intel.com>,
	Qais Yousef <qais.yousef@arm.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Rik van Riel <riel@surriel.com>,
	Yicong Yang <yangyicong@huawei.com>,
	Barry Song <21cnbao@gmail.com>,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS
Date: Tue, 15 Nov 2022 16:58:37 +0530	[thread overview]
Message-ID: <906747ff-148c-f058-dc94-7a9225125f52@amd.com> (raw)
In-Reply-To: <2a049755-57cb-4943-0850-cbbf2537c97e@bytedance.com>

Hello Abel,

Thank you for taking a look at the report.

On 11/15/2022 2:01 PM, Abel Wu wrote:
> Hi Prateek, thanks very much for your detailed testing!
> 
> On 11/14/22 1:45 PM, K Prateek Nayak wrote:
>> Hello Abel,
>>
>> Sorry for the delay. I've tested the patch on a dual socket Zen3 system
>> (2 x 64C/128T)
>>
>> tl;dr
>>
>> o I do not notice any regressions with the standard benchmarks.
>> o schbench sees a nice improvement to the tail latency when the number
>>    of worker are equal to the number of cores in the system in NPS1 and
>>    NPS2 mode. (Marked with "^")
>> o Few data points show improvements in tbench in NPS1 and NPS2 mode.
>>    (Marked with "^")
>>
>> I'm still in the process of running larger workloads. If there is any
>> specific workload you would like me to run on the test system, please
>> do let me know. Below is the detailed report:
> 
> Not particularly in my mind, and I think testing larger workloads is
> great. Thanks!
>
>>
>> Following are the results from running standard benchmarks on a
>> dual socket Zen3 (2 x 64C/128T) machine configured in different
>> NPS modes.
>>
>> NPS Modes are used to logically divide single socket into
>> multiple NUMA region.
>> Following is the NUMA configuration for each NPS mode on the system:
>>
>> NPS1: Each socket is a NUMA node.
>>      Total 2 NUMA nodes in the dual socket machine.
>>
>>      Node 0: 0-63,   128-191
>>      Node 1: 64-127, 192-255
>>
>> NPS2: Each socket is further logically divided into 2 NUMA regions.
>>      Total 4 NUMA nodes exist over 2 socket.
>>          Node 0: 0-31,   128-159
>>      Node 1: 32-63,  160-191
>>      Node 2: 64-95,  192-223
>>      Node 3: 96-127, 223-255
>>
>> NPS4: Each socket is logically divided into 4 NUMA regions.
>>      Total 8 NUMA nodes exist over 2 socket.
>>          Node 0: 0-15,    128-143
>>      Node 1: 16-31,   144-159
>>      Node 2: 32-47,   160-175
>>      Node 3: 48-63,   176-191
>>      Node 4: 64-79,   192-207
>>      Node 5: 80-95,   208-223
>>      Node 6: 96-111,  223-231
>>      Node 7: 112-127, 232-255
>>
>> Benchmark Results:
>>
>> Kernel versions:
>> - tip:          5.19.0 tip sched/core
>> - sis_core:     5.19.0 tip sched/core + this series
>>
>> When we started testing, the tip was at:
>> commit fdf756f71271 ("sched: Fix more TASK_state comparisons")
>>
>> ~~~~~~~~~~~~~
>> ~ hackbench ~
>> ~~~~~~~~~~~~~
>>
>> o NPS1
>>
>> Test:            tip            sis_core
>>   1-groups:       4.06 (0.00 pct)       4.26 (-4.92 pct)    *
>>   1-groups:       4.14 (0.00 pct)       4.09 (1.20 pct)    [Verification Run]
>>   2-groups:       4.76 (0.00 pct)       4.71 (1.05 pct)
>>   4-groups:       5.22 (0.00 pct)       5.11 (2.10 pct)
>>   8-groups:       5.35 (0.00 pct)       5.31 (0.74 pct)
>> 16-groups:       7.21 (0.00 pct)       6.80 (5.68 pct)
>>
>> o NPS2
>>
>> Test:            tip            sis_core
>>   1-groups:       4.09 (0.00 pct)       4.08 (0.24 pct)
>>   2-groups:       4.70 (0.00 pct)       4.69 (0.21 pct)
>>   4-groups:       5.05 (0.00 pct)       4.92 (2.57 pct)
>>   8-groups:       5.35 (0.00 pct)       5.26 (1.68 pct)
>> 16-groups:       6.37 (0.00 pct)       6.34 (0.47 pct)
>>
>> o NPS4
>>
>> Test:            tip            sis_core
>>   1-groups:       4.07 (0.00 pct)       3.99 (1.96 pct)
>>   2-groups:       4.65 (0.00 pct)       4.59 (1.29 pct)
>>   4-groups:       5.13 (0.00 pct)       5.00 (2.53 pct)
>>   8-groups:       5.47 (0.00 pct)       5.43 (0.73 pct)
>> 16-groups:       6.82 (0.00 pct)       6.56 (3.81 pct)
> 
> Although each cpu will get 2.5 tasks when 16-groups, which can
> be considered overloaded, I tested in AMD EPYC 7Y83 machine and
> the total cpu usage was ~82% (with some older kernel version),
> so there is still lots of idle time.
> 
> I guess cutting off at 16-groups is because it's enough loaded
> compared to the real workloads, so testing more groups might just
> be a waste of time?

The machine has 16 LLCs so I capped the results at 16-groups.
Previously I had seen some run-to-run variance with larger group counts
so I limited the reports to 16-groups. I'll run hackbench with more
number of groups (32, 64, 128, 256) and get back to you with the
results along with results for a couple of long running workloads. 

> 
> Thanks & Best Regards,
>     Abel
> 
> [..snip..]
>

--
Thanks and Regards,
Prateek