Re: [PATCH v9 0/9] Add latency priority for CFS class

From: K Prateek Nayak <kprateek.nayak@amd.com>
To: Vincent Guittot <vincent.guittot@linaro.org>
Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com,
	dietmar.eggemann@arm.com, rostedt@goodmis.org,
	bsegall@google.com, mgorman@suse.de, bristot@redhat.com,
	vschneid@redhat.com, linux-kernel@vger.kernel.org,
	parth@linux.ibm.com, qyousef@layalina.io, chris.hyser@oracle.com,
	patrick.bellasi@matbug.net, David.Laight@aculab.com,
	pjt@google.com, pavel@ucw.cz, tj@kernel.org, qperret@google.com,
	tim.c.chen@linux.intel.com, joshdon@google.com, timj@gnu.org,
	yu.c.chen@intel.com, youssefesmat@chromium.org,
	joel@joelfernandes.org
Subject: Re: [PATCH v9 0/9] Add latency priority for CFS class
Date: Wed, 7 Dec 2022 21:56:04 +0530	[thread overview]
Message-ID: <e3fdc51b-19aa-c85d-f51e-16ff9cf64e2a@amd.com> (raw)
In-Reply-To: <CAKfTPtDgVT8mGhDbh9Z40769Ju1DMFpL+zu+rEqnYyJRYetmfg@mail.gmail.com>

Hello Vincent,

Thank you for taking a look at the report.

On 11/28/2022 10:49 PM, Vincent Guittot wrote:
> Hi Prateek,
> 
> On Mon, 28 Nov 2022 at 12:52, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>>
>> Hello Vincent,
>>
>> Following are the test results on dual socket Zen3 machine (2 x 64C/128T)
>>
>> tl;dr
>>
>> o All benchmarks with DEFAULT_LATENCY_NICE value are comparable to tip.
>>   There is, however, a noticeable dip for unixbench-spawn test case.
>>
>> o With the 2 rbtree approach, I do not see much difference in the
>>   hackbench results with varying latency nice value. Tests on v5 did
>>   yield noticeable improvements for hackbench.
>>   (https://lore.kernel.org/lkml/cd48ebbb-9724-985f-28e3-e558dea07827@amd.com/)
> 
> The 2 rbtree approach is the one that was already used in v5. I just
> rerun hackbench tests with latest tip and v6.2-rc7 and I can see large
> performance improvement for pipe tests on my system (8 cores system).
> Could you try witha larger number of group ? like 64, 128 and 256
> groups

Ah! My bad. I've rerun hackbench with larger number of groups and I see a
clear win for pipes with latency nice 19. Hackbench with sockets too see a
small win.

o pipes

$ perf bench sched messaging -p -l 50000 -g <groups>

latency_nice:           0                       19                      -20
32-groups:         9.43 (0.00 pct)         6.42 (31.91 pct)        9.75 (-3.39 pct)
64-groups:        21.55 (0.00 pct)        12.97 (39.81 pct)       21.48 (0.32 pct)
128-groups:       41.15 (0.00 pct)        24.18 (41.23 pct)       46.69 (-13.46 pct)
256-groups:       78.87 (0.00 pct)        43.65 (44.65 pct)       78.84 (0.03 pct)
512-groups:      125.48 (0.00 pct)        78.91 (37.11 pct)      136.21 (-8.55 pct)
1024-groups:     292.81 (0.00 pct)       151.36 (48.30 pct)      323.57 (-10.50 pct)

o sockets

$ perf bench sched messaging  -l 100000 -g <groups>

latency_nice:           0                       19                      -20
32-groups:        27.23 (0.00 pct)        27.00 (0.84 pct)        26.92 (1.13 pct)
64-groups:        45.71 (0.00 pct)        44.58 (2.47 pct)        45.86 (-0.32 pct)
128-groups:       79.55 (0.00 pct)        78.22 (1.67 pct)        80.01 (-0.57 pct)
256-groups:      161.41 (0.00 pct)       164.04 (-1.62 pct)      169.57 (-5.05 pct)
512-groups:      326.41 (0.00 pct)       310.00 (5.02 pct)       342.17 (-4.82 pct)
1024-groups:     634.36 (0.00 pct)       633.59 (0.12 pct)       640.05 (-0.89 pct)

Note: All tests were done in NPS1 mode.

> 
>>
>> [..snip..]
>>
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> ~ Unixbench - DEFAULT_LATENCY_NICE ~
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>
>> o NPS1
>>
>> Test                    Metric    Parallelism                   tip                   latency_nice
>> unixbench-dhry2reg      Hmean     unixbench-dhry2reg-1      48929419.48 (   0.00%)    49137039.06 (   0.42%)
>> unixbench-dhry2reg      Hmean     unixbench-dhry2reg-512  6275526953.25 (   0.00%)  6265580479.15 (  -0.16%)
>> unixbench-syscall       Amean     unixbench-syscall-1        2994319.73 (   0.00%)     3008596.83 *  -0.48%*
>> unixbench-syscall       Amean     unixbench-syscall-512      7349715.87 (   0.00%)     7420994.50 *  -0.97%*
>> unixbench-pipe          Hmean     unixbench-pipe-1           2830206.03 (   0.00%)     2854405.99 *   0.86%*
>> unixbench-pipe          Hmean     unixbench-pipe-512       326207828.01 (   0.00%)   328997804.52 *   0.86%*
>> unixbench-spawn         Hmean     unixbench-spawn-1             6394.21 (   0.00%)        6367.75 (  -0.41%)
>> unixbench-spawn         Hmean     unixbench-spawn-512          72700.64 (   0.00%)       71454.19 *  -1.71%*
>> unixbench-execl         Hmean     unixbench-execl-1             4723.61 (   0.00%)        4750.59 (   0.57%)
>> unixbench-execl         Hmean     unixbench-execl-512          11212.05 (   0.00%)       11262.13 (   0.45%)
>>
>> o NPS2
>>
>> Test                    Metric    Parallelism                   tip                   latency_nice
>> unixbench-dhry2reg      Hmean     unixbench-dhry2reg-1      49271512.85 (   0.00%)    49245260.43 (  -0.05%)
>> unixbench-dhry2reg      Hmean     unixbench-dhry2reg-512  6267992483.03 (   0.00%)  6264951100.67 (  -0.05%)
>> unixbench-syscall       Amean     unixbench-syscall-1        2995885.93 (   0.00%)     3005975.10 *  -0.34%*
>> unixbench-syscall       Amean     unixbench-syscall-512      7388865.77 (   0.00%)     7276275.63 *   1.52%*
>> unixbench-pipe          Hmean     unixbench-pipe-1           2828971.95 (   0.00%)     2856578.72 *   0.98%*
>> unixbench-pipe          Hmean     unixbench-pipe-512       326225385.37 (   0.00%)   328941270.81 *   0.83%*
>> unixbench-spawn         Hmean     unixbench-spawn-1             6958.71 (   0.00%)        6954.21 (  -0.06%)
>> unixbench-spawn         Hmean     unixbench-spawn-512          85443.56 (   0.00%)       70536.42 * -17.45%* (0.67% vs 0.93% - CoEff var)
> 
> I don't expect any perf improvement or regression when the latency
> nice is not changed

This regression can be ignored. Although the results from back to
back runs are very stable, I see the results vary when I rebuild
the unixbench binaries on my test setup.

			  tip	      latency_nice
unixbench-spawn-512	73489.0		78260.4		(kexec)
unixbench-spawn-512 	73332.7		77821.2		(reboot)
unixbench-spawn-512	86207.4		82281.2		(rebuilt + reboot)

I'll go back and look more into the spawn test because there is
something else at play there but other Unixbench results seem to
be stable looking at the rerun.

> 
>> unixbench-execl         Hmean     unixbench-execl-1             4767.99 (   0.00%)        4752.63 *  -0.32%*
>> unixbench-execl         Hmean     unixbench-execl-512          11250.72 (   0.00%)       11320.97 (   0.62%)
>>
>> o NPS4
>>
>> Test                    Metric    Parallelism                   tip                   latency_nice
>> unixbench-dhry2reg      Hmean     unixbench-dhry2reg-1      49041932.68 (   0.00%)    49156671.05 (   0.23%)
>> unixbench-dhry2reg      Hmean     unixbench-dhry2reg-512  6286981589.85 (   0.00%)  6285248711.40 (  -0.03%)
>> unixbench-syscall       Amean     unixbench-syscall-1        2992405.60 (   0.00%)     3008933.03 *  -0.55%*
>> unixbench-syscall       Amean     unixbench-syscall-512      7971789.70 (   0.00%)     7814622.23 *   1.97%*
>> unixbench-pipe          Hmean     unixbench-pipe-1           2822892.54 (   0.00%)     2852615.11 *   1.05%*
>> unixbench-pipe          Hmean     unixbench-pipe-512       326408309.83 (   0.00%)   329617202.56 *   0.98%*
>> unixbench-spawn         Hmean     unixbench-spawn-1             7685.31 (   0.00%)        7243.54 (  -5.75%)
>> unixbench-spawn         Hmean     unixbench-spawn-512          72245.56 (   0.00%)       77000.81 *   6.58%*
>> unixbench-execl         Hmean     unixbench-execl-1             4761.42 (   0.00%)        4733.12 *  -0.59%*
>> unixbench-execl         Hmean     unixbench-execl-512          11533.53 (   0.00%)       11660.17 (   1.10%)
>>
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> ~ Hackbench - Various Latency Nice Values ~
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>
>> o 100000 loops
>>
>> - pipe (process)
>>
>> Test:                   LN: 0                   LN: 19                  LN: -20
>>  1-groups:         3.91 (0.00 pct)         3.91 (0.00 pct)         3.81 (2.55 pct)
>>  2-groups:         4.48 (0.00 pct)         4.52 (-0.89 pct)        4.53 (-1.11 pct)
>>  4-groups:         4.83 (0.00 pct)         4.83 (0.00 pct)         4.87 (-0.82 pct)
>>  8-groups:         5.09 (0.00 pct)         5.00 (1.76 pct)         5.07 (0.39 pct)
>> 16-groups:         6.92 (0.00 pct)         6.79 (1.87 pct)         6.96 (-0.57 pct)
>>
>> - pipe (thread)
>>
>>  1-groups:         4.13 (0.00 pct)         4.08 (1.21 pct)         4.11 (0.48 pct)
>>  2-groups:         4.78 (0.00 pct)         4.90 (-2.51 pct)        4.79 (-0.20 pct)
>>  4-groups:         5.12 (0.00 pct)         5.08 (0.78 pct)         5.16 (-0.78 pct)
>>  8-groups:         5.31 (0.00 pct)         5.28 (0.56 pct)         5.33 (-0.37 pct)
>> 16-groups:         7.34 (0.00 pct)         7.27 (0.95 pct)         7.33 (0.13 pct)
>>
>> - socket (process)
>>
>> Test:                   LN: 0                   LN: 19                  LN: -20
>>  1-groups:         6.61 (0.00 pct)         6.38 (3.47 pct)         6.54 (1.05 pct)
>>  2-groups:         6.59 (0.00 pct)         6.67 (-1.21 pct)        6.11 (7.28 pct)
>>  4-groups:         6.77 (0.00 pct)         6.78 (-0.14 pct)        6.79 (-0.29 pct)
>>  8-groups:         8.29 (0.00 pct)         8.39 (-1.20 pct)        8.36 (-0.84 pct)
>> 16-groups:        12.21 (0.00 pct)        12.03 (1.47 pct)        12.35 (-1.14 pct)
>>
>> - socket (thread)
>>
>> Test:                   LN: 0                   LN: 19                  LN: -20
>>  1-groups:         6.50 (0.00 pct)         5.99 (7.84 pct)         6.02 (7.38 pct)      ^
>>  2-groups:         6.07 (0.00 pct)         6.20 (-2.14 pct)        6.23 (-2.63 pct)
>>  4-groups:         6.61 (0.00 pct)         6.64 (-0.45 pct)        6.63 (-0.30 pct)
>>  8-groups:         8.87 (0.00 pct)         8.67 (2.25 pct)         8.78 (1.01 pct)
>> 16-groups:        12.63 (0.00 pct)        12.54 (0.71 pct)        12.59 (0.31 pct)
>>
>>> [..snip..]
>>>
>>
>> Apart from couple of anomalies, latency nice reduces wait time, especially
>> when the system is heavily loaded. If there is any data, or any specific
>> workload you would like me to run on the test system, please do let me know.
>> Meanwhile, I'll try to get some numbers for larger workloads like SpecJBB
>> that did see improvements with latency nice on v5.

Following are results for SpecJBB in NPS1 mode:

+----------------------------------------------+
|                |   Latency Nice    |         |
|     Metric     |-------------------|   tip   |
|                |    0    |    19   |         |
|----------------|-------------------|---------|
|    Max jOPS    | 100.00% | 102.19% | 101.02% |
| Criritcal jOPS | 100.00% | 122.41% | 100.41% |
+----------------------------------------------+

SpecJBB throughput for Max-jOPS is similar across the board
but Critical-jOPS throughput sees a good uplift again with
latency nice 19.

> 
> [..snip..]
>

If there is any specific workload you would like me to test,
please do let me know. I'll try to test more workloads I come
across with different latency nice values and update you
with the results on this thread.

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
--
Thanks and Regards,
Prateek