Hello Yicong,

On 6/16/2022 1:25 PM, Yicong Yang wrote:
> Hi Prateek,
> 
> Seems my previous reply is in the wrong format so the server rejected it..just repost..
> 
> Thanks a lot for the test!

Thank you for looking into it and suggesting further steps
for debugging the issue.

>> [..snip..]
>>
>> We are still trying to root cause the underlying issue that
>> brought about such drastic regression in tbench performance. 
>>
>>>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>>>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
>>>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
>>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>>> index 05b6c2ad90b9..0595827d481d 100644
>>> --- a/kernel/sched/topology.c
>>> +++ b/kernel/sched/topology.c
>>> @@ -664,6 +664,8 @@ static void destroy_sched_domains(struct sched_domain *sd)
>>>  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>>>  DEFINE_PER_CPU(int, sd_llc_size);
>>>  DEFINE_PER_CPU(int, sd_llc_id);
>>> +DEFINE_PER_CPU(int, sd_share_id);
>>> +DEFINE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
>>>  DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>>>  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>>>  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
>>>
>>>  [..snip..]
>>>
>>
>> We would like some time to investigate this issue and root cause
>> the reason for this regression.
> 
> One significant difference I can see for now is that Kunpeng 920 is a non-SMT machine while Zen3
> is a SMT machine. So in the select_idle_sibling() path we won't use sd_llc_shared. Since sd_share_id
> and sd_cluster are inserted between sd_llc_id and sd_llc_shared which are both used in the path on
> your test, I guess the change of the layout may affect something like the cache behaviour.
> 
> Can you also test with SMT disabled?

Following are results on 2 x 64C Zen3 machine with SMT disabled in BIOS:

Clients:      tip                tip + Patch 1 only        tip + both patches        tip + both patches (Move declaration to top of block)
    1    456.35 (0.00 pct)       455.26 (-0.23 pct)        449.19 (-1.56 pct)        444.95 (-2.49 pct)
    2    887.44 (0.00 pct)       881.23 (-0.69 pct)        854.28 (-3.73 pct)        878.32 (-1.02 pct)
    4    1693.37 (0.00 pct)      1616.95 (-4.51 pct)       1635.48 (-3.41 pct)       1631.13 (-3.67 pct)
    8    3212.89 (0.00 pct)      3033.79 (-5.57 pct)   *   3135.47 (-2.40 pct)       3234.27 (0.66 pct)
   16    6234.92 (0.00 pct)      5201.92 (-16.56 pct)  *   5530.89 (-11.29 pct)  *   5912.84 (-5.16 pct)    *
   32    12237.45 (0.00 pct)     8156.19 (-33.35 pct)  *   10959.70 (-10.44 pct) *   11810.84 (-3.48 pct)
   64    24020.17 (0.00 pct)     13164.27 (-45.19 pct) *   14578.70 (-39.30 pct) *   21266.73 (-11.46 pct)  *
  128    34633.95 (0.00 pct)     32629.42 (-5.78 pct)  *   31811.74 (-8.14 pct)  *   28707.32 (-17.11 pct)  *
  256    34310.04 (0.00 pct)     37271.34 (8.63 pct)       34086.25 (-0.65 pct)      29466.42 (-14.11 pct)  *
  512    35216.68 (0.00 pct)     36763.77 (4.39 pct)       35632.86 (1.18 pct)       31823.27 (-9.63 pct)   *

As you can see the regression still exists. even with only first patch
of the series applied on top of the tip. Moving the declarations around
to top helps some cases but we are having troubles at the saturation
point with the move.

Following is the diff for "Move declaration to top of block":

--
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e9f3dc6dcbf4..97a3895416ab 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1750,12 +1750,12 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
 	return sd;
 }
 
+DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
+DECLARE_PER_CPU(int, sd_share_id);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
 DECLARE_PER_CPU(int, sd_llc_size);
 DECLARE_PER_CPU(int, sd_llc_id);
-DECLARE_PER_CPU(int, sd_share_id);
 DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
-DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
--

The System-map of each kernel is as follows:

- On "tip"

0000000000020518 D sd_asym_cpucapacity
0000000000020520 D sd_asym_packing
0000000000020528 D sd_numa
0000000000020530 D sd_llc_shared
0000000000020538 D sd_llc_id
000000000002053c D sd_llc_size
-------------------------------------------- 64B Cacheline Boundary
0000000000020540 D sd_llc

- On "tip + Patch 1 only" and "tip + both patches"

0000000000020518 D sd_asym_cpucapacity
0000000000020520 D sd_asym_packing
0000000000020528 D sd_numa
0000000000020530 D sd_cluster     <-----
0000000000020538 D sd_llc_shared
-------------------------------------------- 64B Cacheline Boundary
0000000000020540 D sd_share_id    <-----
0000000000020544 D sd_llc_id
0000000000020548 D sd_llc_size
0000000000020550 D sd_llc


- On "tip + both patches (Move declaration to top)"

0000000000020518 D sd_asym_cpucapacity
0000000000020520 D sd_asym_packing
0000000000020528 D sd_numa
0000000000020530 D sd_llc_shared
0000000000020538 D sd_llc_id
000000000002053c D sd_llc_size
-------------------------------------------- 64B Cacheline Boundary
0000000000020540 D sd_llc
0000000000020548 D sd_share_id    <-----
0000000000020550 D sd_cluster     <-----

> Or change the layout a bit to see if there's any difference,
> like:
> 
>  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>  DEFINE_PER_CPU(int, sd_llc_size);
>  DEFINE_PER_CPU(int, sd_llc_id);
>  DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> +DEFINE_PER_CPU(int, sd_share_id);
> +DEFINE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
>  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> 
> I need to further look into it and have some tests on a SMT machine. Would you mind to share
> the kernel config as well? I'd like to compare the config as well.

I've attached the kernel config used to build the test kernel
to this mail.

> 
> Thanks,
> Yicong

We are trying to debug the issue using perf and find an optimal
arrangement of the per cpu declarations to get the relevant data
used in the wakeup path on the same 64B cache line.

We'll keep you posted of out finding. Let me know if you need
anything else in the meantime.
--
Thanks and Regards,
Prateek