All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/5] sched/fair: Improve scan efficiency of SIS
@ 2022-09-09  5:52 Abel Wu
  2022-09-09  5:53 ` [PATCH v5 1/5] sched/fair: Ignore SIS_UTIL when has idle core Abel Wu
                   ` (4 more replies)
  0 siblings, 5 replies; 18+ messages in thread
From: Abel Wu @ 2022-09-09  5:52 UTC (permalink / raw)
  To: Peter Zijlstra, Mel Gorman, Vincent Guittot
  Cc: Josh Don, Chen Yu, Tim Chen, K Prateek Nayak, Gautham R . Shenoy,
	linux-kernel, Abel Wu

The wakeup fastpath for fair tasks, select_idle_sibling() aka SIS,
plays an important role in maximizing the usage of cpu resources and
can greatly affect overall performance of the system.

The SIS tries to find an idle cpu inside the targeting LLC to place
the woken-up task. The cache hot cpus are preferred, but if none of
them is idle, just fallback to the other cpus of that LLC by firing
a domain scan.  The domain scan works well under light pressure by
simply traversing the cpus of the LLC due to lots of idle cpus can
be available.

But things change. The LLCs are getting bigger in modern and future
machines, and the cloud service providers are consistently trying to
reduce TCO by pushing more load squeezing all kinds of resources.
So the fashion of simply traversing doesn't fit well to the future
requirement.

There are already features like SIS_{UTIL,PROP} exist to cope with
the scalability issue by limiting the scan depth, and it would be
better if we can improve the way of how it scans as well. And this
is exactly what the SIS_FILTER is born for.

So this patchset will focus on improving the efficiency of the SIS
domain scan:

 - Patch 1~2, give a chance to scan for idle cores if has_idle_core
   is true even when the LLC is overloaded (nr_idle_scan == 0). This
   helps exploiting the cpu resources by more actively kicking idle
   cores working.

 - Patch 3, add check on whether there are any tasks pending right
   at the beginning of __update_idle_core(), so the has_idle_cores
   hint won't be mistakenly updated if the cpu isn't idle any more.

 - Patch 4, skip SIS domain scan if the target LLC is fully busy.
   So the cpu resources can be used more wisely rather than wasting
   time trying to find an idle cpu that probably not exist.

 - Patch 5, introduce the SIS_FILTER feature to optimize the way of
   scanning LLC, helping SIS_{UTIL,PROP} to work more efficiently.

Benchmark
=========

Tests are done in a dual socket (2 x 24C/48T) machine modeled Intel
Xeon(R) Platinum 8260, with SNC configuration:

	SNC on:  4 NUMA nodes each of which has 12C/24T
	SNC off: 2 NUMA nodes each of which has 24C/48T

All of the benchmarks are done inside a normal cpu cgroup in a clean
environment with cpu turbo disabled.

Based on tip sched/core 0fba527e959d (v5.19.0).

Results
=======

hackbench-process-pipes
                         vanilla		patched
(SNC on)
Amean     1        0.4480 (   0.00%)      0.4500 (  -0.45%)
Amean     4        0.6137 (   0.00%)      0.5990 (   2.39%)
Amean     7        0.7530 (   0.00%)      0.7377 (   2.04%)
Amean     12       1.1230 (   0.00%)      1.0490 *   6.59%*
Amean     21       2.0567 (   0.00%)      1.8680 *   9.17%*
Amean     30       3.0847 (   0.00%)      2.7710 *  10.17%*
Amean     48       5.9043 (   0.00%)      4.5393 *  23.12%*
Amean     79       9.3477 (   0.00%)      7.6610 *  18.04%*
Amean     110     11.0647 (   0.00%)     10.5560 (   4.60%)
Amean     141     13.3297 (   0.00%)     12.6137 *   5.37%*
Amean     172     15.2210 (   0.00%)     14.6547 (   3.72%)
Amean     203     17.8510 (   0.00%)     16.9000 *   5.33%*
Amean     234     19.9263 (   0.00%)     18.7687 *   5.81%*
Amean     265     21.9117 (   0.00%)     21.4060 (   2.31%)
Amean     296     23.7683 (   0.00%)     23.0990 *   2.82%*
(SNC off)
Amean     1        0.2963 (   0.00%)      0.2933 (   1.01%)
Amean     4        0.6093 (   0.00%)      0.5883 (   3.45%)
Amean     7        0.7837 (   0.00%)      0.7570 (   3.40%)
Amean     12       1.2703 (   0.00%)      1.0780 (  15.14%)
Amean     21       2.6260 (   0.00%)      1.8903 *  28.01%*
Amean     30       4.3483 (   0.00%)      2.7903 *  35.83%*
Amean     48       7.9753 (   0.00%)      4.8920 *  38.66%*
Amean     79       9.6540 (   0.00%)      8.0127 *  17.00%*
Amean     110     11.2597 (   0.00%)     10.1557 *   9.80%*
Amean     141     13.8077 (   0.00%)     12.7387 *   7.74%*
Amean     172     16.3513 (   0.00%)     14.5860 *  10.80%*
Amean     203     19.0880 (   0.00%)     17.1950 *   9.92%*
Amean     234     21.7660 (   0.00%)     19.6763 *   9.60%*
Amean     265     23.0447 (   0.00%)     22.5557 (   2.12%)
Amean     296     25.4660 (   0.00%)     24.4273 (   4.08%)

hackbench-process-sockets
                         vanilla		patched
(SNC on)
Amean     1        0.6503 (   0.00%)      0.6430 (   1.13%)
Amean     4        1.6320 (   0.00%)      1.6247 (   0.45%)
Amean     7        2.4810 (   0.00%)      2.4927 (  -0.47%)
Amean     12       4.0943 (   0.00%)      4.0743 (   0.49%)
Amean     21       6.8833 (   0.00%)      6.9220 (  -0.56%)
Amean     30       9.7560 (   0.00%)      9.7107 *   0.46%*
Amean     48      15.5020 (   0.00%)     15.5013 (   0.00%)
Amean     79      25.7690 (   0.00%)     25.7860 (  -0.07%)
Amean     110     35.7700 (   0.00%)     35.9203 (  -0.42%)
Amean     141     45.8710 (   0.00%)     46.0040 (  -0.29%)
Amean     172     55.7460 (   0.00%)     56.1457 *  -0.72%*
Amean     203     65.9393 (   0.00%)     65.8903 (   0.07%)
Amean     234     75.7340 (   0.00%)     76.0653 *  -0.44%*
Amean     265     85.8940 (   0.00%)     85.9670 (  -0.08%)
Amean     296     96.0660 (   0.00%)     96.3823 (  -0.33%)
(SNC off)
Amean     1        0.4050 (   0.00%)      0.4267 (  -5.35%)
Amean     4        1.4517 (   0.00%)      1.4260 *   1.77%*
Amean     7        2.4617 (   0.00%)      2.4583 (   0.14%)
Amean     12       4.0730 (   0.00%)      4.0900 (  -0.42%)
Amean     21       6.9217 (   0.00%)      6.9297 (  -0.12%)
Amean     30       9.7527 (   0.00%)      9.7483 (   0.04%)
Amean     48      15.5490 (   0.00%)     15.5040 (   0.29%)
Amean     79      26.1670 (   0.00%)     26.2290 (  -0.24%)
Amean     110     36.3910 (   0.00%)     36.4073 (  -0.04%)
Amean     141     46.6660 (   0.00%)     46.1683 *   1.07%*
Amean     172     56.7627 (   0.00%)     55.9527 *   1.43%*
Amean     203     66.8097 (   0.00%)     66.4117 *   0.60%*
Amean     234     77.3577 (   0.00%)     76.3683 (   1.28%)
Amean     265     87.5923 (   0.00%)     85.9367 *   1.89%*
Amean     296     97.2430 (   0.00%)     96.4750 *   0.79%*

hackbench-thread-pipes
                         vanilla		patched
(SNC on)
Amean     1        0.4470 (   0.00%)      0.4477 (  -0.15%)
Amean     4        0.6563 (   0.00%)      0.6803 (  -3.66%)
Amean     7        0.8303 (   0.00%)      0.7963 (   4.09%)
Amean     12       1.3983 (   0.00%)      1.2387 *  11.42%*
Amean     21       2.8490 (   0.00%)      2.3220 *  18.50%*
Amean     30       4.3807 (   0.00%)      3.2800 *  25.13%*
Amean     48       7.0663 (   0.00%)      5.4790 *  22.46%*
Amean     79      10.0643 (   0.00%)      8.4573 *  15.97%*
Amean     110     12.2953 (   0.00%)     11.2157 *   8.78%*
Amean     141     15.1210 (   0.00%)     14.2923 *   5.48%*
Amean     172     16.8823 (   0.00%)     16.3013 *   3.44%*
Amean     203     19.3927 (   0.00%)     18.9503 (   2.28%)
Amean     234     22.1817 (   0.00%)     21.1763 *   4.53%*
Amean     265     24.4327 (   0.00%)     24.1200 (   1.28%)
Amean     296     26.9480 (   0.00%)     26.3020 *   2.40%*
(SNC off)
Amean     1        0.2663 (   0.00%)      0.2583 (   3.00%)
Amean     4        0.6627 (   0.00%)      0.6147 *   7.24%*
Amean     7        0.8183 (   0.00%)      0.7757 (   5.21%)
Amean     12       1.3550 (   0.00%)      1.2430 (   8.27%)
Amean     21       3.4853 (   0.00%)      2.2950 *  34.15%*
Amean     30       7.1470 (   0.00%)      3.6847 *  48.44%*
Amean     48       9.6940 (   0.00%)      6.9593 *  28.21%*
Amean     79      10.3100 (   0.00%)      8.7727 *  14.91%*
Amean     110     11.9537 (   0.00%)     10.4757 *  12.36%*
Amean     141     14.2673 (   0.00%)     12.8350 *  10.04%*
Amean     172     16.6850 (   0.00%)     15.1883 *   8.97%*
Amean     203     19.5060 (   0.00%)     18.1390 *   7.01%*
Amean     234     21.6900 (   0.00%)     20.2783 *   6.51%*
Amean     265     24.9073 (   0.00%)     23.3440 *   6.28%*
Amean     296     26.9260 (   0.00%)     26.3007 (   2.32%)

hackbench-thread-sockets
                         vanilla		patched
(SNC on)
Amean     1        0.6700 (   0.00%)      0.6983 (  -4.23%)
Amean     4        1.6450 (   0.00%)      1.6530 (  -0.49%)
Amean     7        2.5587 (   0.00%)      2.5420 *   0.65%*
Amean     12       4.1793 (   0.00%)      4.1630 (   0.39%)
Amean     21       7.1000 (   0.00%)      7.1087 (  -0.12%)
Amean     30       9.9783 (   0.00%)      9.9740 (   0.04%)
Amean     48      15.8737 (   0.00%)     15.8643 (   0.06%)
Amean     79      26.3667 (   0.00%)     26.4097 (  -0.16%)
Amean     110     36.5910 (   0.00%)     36.7513 (  -0.44%)
Amean     141     46.8240 (   0.00%)     47.0613 *  -0.51%*
Amean     172     56.9787 (   0.00%)     57.3263 *  -0.61%*
Amean     203     67.4160 (   0.00%)     67.6133 (  -0.29%)
Amean     234     77.9683 (   0.00%)     78.1877 (  -0.28%)
Amean     265     88.1377 (   0.00%)     88.2207 (  -0.09%)
Amean     296     98.1933 (   0.00%)     98.4537 (  -0.27%)
(SNC off)
Amean     1        0.4280 (   0.00%)      0.4973 ( -16.20%)
Amean     4        1.4897 (   0.00%)      1.4593 *   2.04%*
Amean     7        2.5220 (   0.00%)      2.4803 *   1.65%*
Amean     12       4.1887 (   0.00%)      4.1693 (   0.46%)
Amean     21       7.1040 (   0.00%)      7.1267 (  -0.32%)
Amean     30      10.0300 (   0.00%)     10.0183 (   0.12%)
Amean     48      15.9953 (   0.00%)     15.9597 (   0.22%)
Amean     79      26.8963 (   0.00%)     26.6973 *   0.74%*
Amean     110     37.1493 (   0.00%)     37.0533 (   0.26%)
Amean     141     47.3680 (   0.00%)     47.1407 (   0.48%)
Amean     172     57.8660 (   0.00%)     57.5513 (   0.54%)
Amean     203     68.0573 (   0.00%)     68.3440 (  -0.42%)
Amean     234     78.5393 (   0.00%)     78.2117 (   0.42%)
Amean     265     89.1383 (   0.00%)     88.6450 (   0.55%)
Amean     296     99.8007 (   0.00%)     99.1537 *   0.65%*

tbench4 Throughput
                         vanilla		patched
(SNC on)
Hmean     1        300.70 (   0.00%)      302.52 *   0.61%*
Hmean     2        597.53 (   0.00%)      604.76 *   1.21%*
Hmean     4       1188.34 (   0.00%)     1204.79 *   1.38%*
Hmean     8       2336.22 (   0.00%)     2375.87 *   1.70%*
Hmean     16      4459.17 (   0.00%)     4681.25 *   4.98%*
Hmean     32      7606.69 (   0.00%)     7607.93 (   0.02%)
Hmean     64      9009.48 (   0.00%)     8956.00 *  -0.59%*
Hmean     128    19456.88 (   0.00%)    19258.30 *  -1.02%*
Hmean     256    19771.10 (   0.00%)    20783.82 *   5.12%*
Hmean     384    20118.74 (   0.00%)    20407.40 *   1.43%*
(SNC off)
Hmean     1        284.44 (   0.00%)      286.27 *   0.64%*
Hmean     2        564.10 (   0.00%)      574.82 *   1.90%*
Hmean     4       1120.93 (   0.00%)     1137.27 *   1.46%*
Hmean     8       2248.94 (   0.00%)     2261.98 *   0.58%*
Hmean     16      4360.10 (   0.00%)     4430.95 *   1.63%*
Hmean     32      7300.52 (   0.00%)     7341.70 *   0.56%*
Hmean     64      8912.37 (   0.00%)     8954.61 *   0.47%*
Hmean     128    19874.16 (   0.00%)    20198.82 *   1.63%*
Hmean     256    19759.42 (   0.00%)    19785.57 *   0.13%*
Hmean     384    19502.40 (   0.00%)    19956.96 *   2.33%*

netperf-udp
			     vanilla		   patched
(SNC on)
Hmean     send-64         209.06 (   0.00%)      210.32 (   0.60%)
Hmean     send-128        416.70 (   0.00%)      415.34 (  -0.33%)
Hmean     send-256        819.65 (   0.00%)      808.52 *  -1.36%*
Hmean     send-1024      3163.12 (   0.00%)     3132.35 *  -0.97%*
Hmean     send-2048      5958.21 (   0.00%)     5926.40 (  -0.53%)
Hmean     send-3312      9168.81 (   0.00%)     9194.53 (   0.28%)
Hmean     send-4096     11039.27 (   0.00%)    11159.21 *   1.09%*
Hmean     send-8192     17804.42 (   0.00%)    17840.95 (   0.21%)
Hmean     send-16384    28529.57 (   0.00%)    28389.97 (  -0.49%)
Hmean     recv-64         209.06 (   0.00%)      210.32 (   0.60%)
Hmean     recv-128        416.70 (   0.00%)      415.34 (  -0.33%)
Hmean     recv-256        819.65 (   0.00%)      808.52 *  -1.36%*
Hmean     recv-1024      3163.12 (   0.00%)     3132.35 *  -0.97%*
Hmean     recv-2048      5958.21 (   0.00%)     5926.40 (  -0.53%)
Hmean     recv-3312      9168.81 (   0.00%)     9194.53 (   0.28%)
Hmean     recv-4096     11039.27 (   0.00%)    11159.10 *   1.09%*
Hmean     recv-8192     17804.32 (   0.00%)    17840.92 (   0.21%)
Hmean     recv-16384    28529.38 (   0.00%)    28389.96 (  -0.49%)
(SNC off)
Hmean     send-64         211.39 (   0.00%)      210.60 (  -0.37%)
Hmean     send-128        415.25 (   0.00%)      420.91 *   1.36%*
Hmean     send-256        814.75 (   0.00%)      822.22 *   0.92%*
Hmean     send-1024      3171.61 (   0.00%)     3135.16 (  -1.15%)
Hmean     send-2048      6015.92 (   0.00%)     5943.85 *  -1.20%*
Hmean     send-3312      9210.17 (   0.00%)     9159.59 (  -0.55%)
Hmean     send-4096     11084.55 (   0.00%)    11098.02 (   0.12%)
Hmean     send-8192     17769.83 (   0.00%)    17804.44 (   0.19%)
Hmean     send-16384    27718.62 (   0.00%)    27828.89 (   0.40%)
Hmean     recv-64         211.39 (   0.00%)      210.60 (  -0.37%)
Hmean     recv-128        415.25 (   0.00%)      420.90 *   1.36%*
Hmean     recv-256        814.75 (   0.00%)      822.22 *   0.92%*
Hmean     recv-1024      3171.61 (   0.00%)     3135.16 (  -1.15%)
Hmean     recv-2048      6015.92 (   0.00%)     5943.85 *  -1.20%*
Hmean     recv-3312      9210.17 (   0.00%)     9159.59 (  -0.55%)
Hmean     recv-4096     11084.55 (   0.00%)    11098.02 (   0.12%)
Hmean     recv-8192     17769.76 (   0.00%)    17804.41 (   0.20%)
Hmean     recv-16384    27718.62 (   0.00%)    27828.79 (   0.40%)

netperf-tcp
                         vanilla		patched
(SNC on)
Hmean     64        1192.41 (   0.00%)     1219.91 *   2.31%*
Hmean     128       2354.50 (   0.00%)     2360.65 (   0.26%)
Hmean     256       4371.10 (   0.00%)     4393.92 (   0.52%)
Hmean     1024     13813.84 (   0.00%)    13712.10 (  -0.74%)
Hmean     2048     21518.91 (   0.00%)    21950.82 *   2.01%*
Hmean     3312     25585.77 (   0.00%)    26087.72 *   1.96%*
Hmean     4096     27402.77 (   0.00%)    27927.67 *   1.92%*
Hmean     8192     31766.67 (   0.00%)    31914.49 (   0.47%)
Hmean     16384    36227.30 (   0.00%)    36630.26 (   1.11%)
(SNC off)
Hmean     64        1182.09 (   0.00%)     1211.82 *   2.52%*
Hmean     128       2316.35 (   0.00%)     2356.54 *   1.74%*
Hmean     256       4231.05 (   0.00%)     4282.52 (   1.22%)
Hmean     1024     13461.44 (   0.00%)    13571.02 (   0.81%)
Hmean     2048     21016.51 (   0.00%)    21356.97 *   1.62%*
Hmean     3312     24834.03 (   0.00%)    25140.70 *   1.23%*
Hmean     4096     26700.53 (   0.00%)    26734.55 (   0.13%)
Hmean     8192     31094.10 (   0.00%)    31201.16 (   0.34%)
Hmean     16384    34953.23 (   0.00%)    34949.00 (  -0.01%)

Conclusion
==========

Substantial improvement can be seen in hackbench pipe tests, while
others are generally neutral. It is because if the workload has
already had a relatively high SIS success rate, e.g. tbench4~=50%
and netperf~=100% compared to hackbench~=3.5%, the cost of trying
to improve its efficiency will negate lots of benefit it brings.

But the real world is more complicated. Different workloads can be
co-located on the same machine to share resources, and their profiles
can vary quite much, so the SIS success rate is not predictable.

Please see more detailed analysis in individual patches.

v4 -> v5:
 - Add limited scan for idle cores when overloaded, suggested by Mel
 - Split out several patches since they are irrelevant to this scope
 - Add quick check on ttwu_pending before core update
 - Wrap the filter into SIS_FILTER feature, suggested by Chen Yu
 - Move the main filter logic to the idle path, because the newidle
   balance can bail out early if rq->avg_idle is small enough and
   lose chances to update the filter.

v3 -> v4:
 - Update filter in load_balance rather than in the tick
 - Now the filter contains unoccupied cpus rather than overloaded ones
 - Added mechanisms to deal with the false positive cases

v2 -> v3:
 - Removed sched-idle balance feature and focus on SIS
 - Take non-CFS tasks into consideration
 - Several fixes/improvement suggested by Josh Don

v1 -> v2:
 - Several optimizations on sched-idle balancing
 - Ignore asym topos in can_migrate_task
 - Add more benchmarks including SIS efficiency
 - Re-organize patch as suggested by Mel Gorman

v4: https://lore.kernel.org/lkml/20220619120451.95251-1-wuyun.abel@bytedance.com/
v3: https://lore.kernel.org/lkml/20220505122331.42696-1-wuyun.abel@bytedance.com/
v2: https://lore.kernel.org/lkml/20220409135104.3733193-1-wuyun.abel@bytedance.com/
v1: https://lore.kernel.org/lkml/20220217154403.6497-1-wuyun.abel@bytedance.com/

Abel Wu (5):
  sched/fair: Ignore SIS_UTIL when has idle core
  sched/fair: Limited scan for idle cores when overloaded
  sched/fair: Skip core update if task pending
  sched/fair: Skip SIS domain scan if fully busy
  sched/fair: Introduce SIS_FILTER

 include/linux/sched/topology.h |  53 +++++++++-
 kernel/sched/core.c            |   1 +
 kernel/sched/fair.c            | 171 +++++++++++++++++++++++++++++----
 kernel/sched/features.h        |   1 +
 kernel/sched/sched.h           |   3 +
 kernel/sched/topology.c        |   9 +-
 6 files changed, 218 insertions(+), 20 deletions(-)

-- 
2.37.3


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v5 1/5] sched/fair: Ignore SIS_UTIL when has idle core
  2022-09-09  5:52 [PATCH v5 0/5] sched/fair: Improve scan efficiency of SIS Abel Wu
@ 2022-09-09  5:53 ` Abel Wu
  2022-09-14 21:58   ` Tim Chen
  2022-09-09  5:53 ` [PATCH v5 2/5] sched/fair: Limited scan for idle cores when overloaded Abel Wu
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 18+ messages in thread
From: Abel Wu @ 2022-09-09  5:53 UTC (permalink / raw)
  To: Peter Zijlstra, Mel Gorman, Vincent Guittot
  Cc: Josh Don, Chen Yu, Tim Chen, K Prateek Nayak, Gautham R . Shenoy,
	linux-kernel, Abel Wu

When SIS_UTIL is enabled, SIS domain scan will be skipped if the
LLC is overloaded even the has_idle_core hint is true. Since idle
load balancing is triggered at tick boundary, the idle cores can
stay cold for the whole tick period wasting time meanwhile some
of other cpus might be overloaded.

Give it a chance to scan for idle cores if the hint implies a
worthy effort.

Benchmark
=========

Tests are done in a dual socket (2 x 24C/48T) machine modeled Intel
Xeon(R) Platinum 8260, with SNC configuration:

	SNC on:  4 NUMA nodes each of which has 12C/24T
	SNC off: 2 NUMA nodes each of which has 24C/48T

All of the benchmarks are done inside a normal cpu cgroup in a clean
environment with cpu turbo disabled.

Based on tip sched/core 0fba527e959d (v5.19.0).

Results
=======

hackbench-process-pipes
                         vanilla		patched
(SNC on)
Amean     1        0.4480 (   0.00%)      0.4470 (   0.22%)
Amean     4        0.6137 (   0.00%)      0.5947 (   3.10%)
Amean     7        0.7530 (   0.00%)      0.7450 (   1.06%)
Amean     12       1.1230 (   0.00%)      1.1053 (   1.57%)
Amean     21       2.0567 (   0.00%)      1.9420 (   5.58%)
Amean     30       3.0847 (   0.00%)      2.9267 *   5.12%*
Amean     48       5.9043 (   0.00%)      4.7027 *  20.35%*
Amean     79       9.3477 (   0.00%)      7.7097 *  17.52%*
Amean     110     11.0647 (   0.00%)     10.0680 *   9.01%*
Amean     141     13.3297 (   0.00%)     12.5450 *   5.89%*
Amean     172     15.2210 (   0.00%)     15.0297 (   1.26%)
Amean     203     17.8510 (   0.00%)     16.8827 *   5.42%*
Amean     234     19.9263 (   0.00%)     19.1183 (   4.05%)
Amean     265     21.9117 (   0.00%)     20.9893 *   4.21%*
Amean     296     23.7683 (   0.00%)     23.3920 (   1.58%)
(SNC off)
Amean     1        0.2963 (   0.00%)      0.2717 (   8.32%)
Amean     4        0.6093 (   0.00%)      0.6257 (  -2.68%)
Amean     7        0.7837 (   0.00%)      0.7740 (   1.23%)
Amean     12       1.2703 (   0.00%)      1.2410 (   2.31%)
Amean     21       2.6260 (   0.00%)      2.6410 (  -0.57%)
Amean     30       4.3483 (   0.00%)      3.7620 (  13.48%)
Amean     48       7.9753 (   0.00%)      6.7757 (  15.04%)
Amean     79       9.6540 (   0.00%)      8.8827 *   7.99%*
Amean     110     11.2597 (   0.00%)     11.0583 (   1.79%)
Amean     141     13.8077 (   0.00%)     13.3387 (   3.40%)
Amean     172     16.3513 (   0.00%)     15.9583 *   2.40%*
Amean     203     19.0880 (   0.00%)     17.8757 *   6.35%*
Amean     234     21.7660 (   0.00%)     20.0543 *   7.86%*
Amean     265     23.0447 (   0.00%)     22.6643 *   1.65%*
Amean     296     25.4660 (   0.00%)     25.6677 (  -0.79%)

The more overloaded the system is, the more benefit can be seen due
to exploiting the cpu resources by more actively kicking idle cores
working, e.g. 21~48 groups. But once more workload are applied (79+
groups), the free cpu capacity that can be exploited becoming less,
thus improvement comes down to ~5%.

On the other hand when the load is relatively low (<12 groups), not
much benefit can be seen because in such case it's not hard to find
an idle cpu, so the benefit is picking up an idle core rather than
an idle cpu, but the cost of full scans will indeed negate lots of
benefit it brings.

The downside of full scan is that the cost gets bigger in larger
LLCs, but the test result seems still positive. One possible reason
is due to the low SIS success rate (~3.5%), so the cost doesn't
negate the benefit.

tbench4 Throughput
                         vanilla		patched
(SNC on)
Hmean     1        284.44 (   0.00%)      287.90 *   1.22%*
Hmean     2        564.10 (   0.00%)      575.52 *   2.02%*
Hmean     4       1120.93 (   0.00%)     1137.94 *   1.52%*
Hmean     8       2248.94 (   0.00%)     2250.42 *   0.07%*
Hmean     16      4360.10 (   0.00%)     4363.41 (   0.08%)
Hmean     32      7300.52 (   0.00%)     7338.06 *   0.51%*
Hmean     64      8912.37 (   0.00%)     8914.66 (   0.03%)
Hmean     128    19874.16 (   0.00%)    19978.59 *   0.53%*
Hmean     256    19759.42 (   0.00%)    20057.49 *   1.51%*
Hmean     384    19502.40 (   0.00%)    19846.74 *   1.77%*
(SNC off)
Hmean     1        300.70 (   0.00%)      309.43 *   2.90%*
Hmean     2        597.53 (   0.00%)      613.92 *   2.74%*
Hmean     4       1188.34 (   0.00%)     1227.84 *   3.32%*
Hmean     8       2336.22 (   0.00%)     2379.04 *   1.83%*
Hmean     16      4459.17 (   0.00%)     4634.66 *   3.94%*
Hmean     32      7606.69 (   0.00%)     7592.12 *  -0.19%*
Hmean     64      9009.48 (   0.00%)     9241.11 *   2.57%*
Hmean     128    19456.88 (   0.00%)    17870.37 *  -8.15%*
Hmean     256    19771.10 (   0.00%)    19370.92 *  -2.02%*
Hmean     384    20118.74 (   0.00%)    19413.92 *  -3.50%*

netperf-udp
                         vanilla		patched
(SNC on)
Hmean     send-64         209.06 (   0.00%)      211.69 *   1.26%*
Hmean     send-128        416.70 (   0.00%)      417.00 (   0.07%)
Hmean     send-256        819.65 (   0.00%)      827.61 *   0.97%*
Hmean     send-1024      3163.12 (   0.00%)     3191.16 *   0.89%*
Hmean     send-2048      5958.21 (   0.00%)     6045.20 *   1.46%*
Hmean     send-3312      9168.81 (   0.00%)     9282.21 *   1.24%*
Hmean     send-4096     11039.27 (   0.00%)    11130.55 (   0.83%)
Hmean     send-8192     17804.42 (   0.00%)    17816.25 (   0.07%)
Hmean     send-16384    28529.57 (   0.00%)    28812.09 (   0.99%)
Hmean     recv-64         209.06 (   0.00%)      211.69 *   1.26%*
Hmean     recv-128        416.70 (   0.00%)      417.00 (   0.07%)
Hmean     recv-256        819.65 (   0.00%)      827.61 *   0.97%*
Hmean     recv-1024      3163.12 (   0.00%)     3191.16 *   0.89%*
Hmean     recv-2048      5958.21 (   0.00%)     6045.18 *   1.46%*
Hmean     recv-3312      9168.81 (   0.00%)     9282.21 *   1.24%*
Hmean     recv-4096     11039.27 (   0.00%)    11130.55 (   0.83%)
Hmean     recv-8192     17804.32 (   0.00%)    17816.23 (   0.07%)
Hmean     recv-16384    28529.38 (   0.00%)    28812.04 (   0.99%)
(SNC off)
Hmean     send-64         211.39 (   0.00%)      213.24 (   0.87%)
Hmean     send-128        415.25 (   0.00%)      426.45 *   2.70%*
Hmean     send-256        814.75 (   0.00%)      835.33 *   2.53%*
Hmean     send-1024      3171.61 (   0.00%)     3173.84 (   0.07%)
Hmean     send-2048      6015.92 (   0.00%)     6046.41 (   0.51%)
Hmean     send-3312      9210.17 (   0.00%)     9309.65 (   1.08%)
Hmean     send-4096     11084.55 (   0.00%)    11250.86 *   1.50%*
Hmean     send-8192     17769.83 (   0.00%)    18101.50 *   1.87%*
Hmean     send-16384    27718.62 (   0.00%)    28152.58 *   1.57%*
Hmean     recv-64         211.39 (   0.00%)      213.24 (   0.87%)
Hmean     recv-128        415.25 (   0.00%)      426.45 *   2.70%*
Hmean     recv-256        814.75 (   0.00%)      835.32 *   2.53%*
Hmean     recv-1024      3171.61 (   0.00%)     3173.84 (   0.07%)
Hmean     recv-2048      6015.92 (   0.00%)     6046.41 (   0.51%)
Hmean     recv-3312      9210.17 (   0.00%)     9309.65 (   1.08%)
Hmean     recv-4096     11084.55 (   0.00%)    11250.86 *   1.50%*
Hmean     recv-8192     17769.76 (   0.00%)    18101.32 *   1.87%*
Hmean     recv-16384    27718.62 (   0.00%)    28152.46 *   1.57%*

netperf-tcp
                         vanilla		patched
(SNC on)
Hmean     64        1192.41 (   0.00%)     1253.72 *   5.14%*
Hmean     128       2354.50 (   0.00%)     2375.97 (   0.91%)
Hmean     256       4371.10 (   0.00%)     4412.90 (   0.96%)
Hmean     1024     13813.84 (   0.00%)    13987.31 (   1.26%)
Hmean     2048     21518.91 (   0.00%)    21677.74 (   0.74%)
Hmean     3312     25585.77 (   0.00%)    25943.95 *   1.40%*
Hmean     4096     27402.77 (   0.00%)    27700.88 *   1.09%*
Hmean     8192     31766.67 (   0.00%)    32187.68 *   1.33%*
Hmean     16384    36227.30 (   0.00%)    36542.97 (   0.87%)
(SNC off)
Hmean     64        1182.09 (   0.00%)     1219.15 *   3.14%*
Hmean     128       2316.35 (   0.00%)     2361.89 *   1.97%*
Hmean     256       4231.05 (   0.00%)     4314.53 *   1.97%*
Hmean     1024     13461.44 (   0.00%)    13543.85 (   0.61%)
Hmean     2048     21016.51 (   0.00%)    21270.62 *   1.21%*
Hmean     3312     24834.03 (   0.00%)    24960.05 (   0.51%)
Hmean     4096     26700.53 (   0.00%)    26959.99 (   0.97%)
Hmean     8192     31094.10 (   0.00%)    30989.89 (  -0.34%)
Hmean     16384    34953.23 (   0.00%)    35310.35 (   1.02%)

The netperf and tbench4 both have high SIS success rate, that is
~100% and ~50% respectively. So the effort paid for full scan for
idle cores is not very beneficial compared to its cost. This is
actually the case similar to the aforementioned <12 groups case
in hackbench.

Conclusion
==========

Taking a full scan for idle cores is generally good for making
better use of the cpu resources, yet there is still room for
improvement under certain circumstances.

Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
Tested-by: Chen Yu <yu.c.chen@intel.com>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index efceb670e755..5af9bf246274 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6437,7 +6437,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 		time = cpu_clock(this);
 	}
 
-	if (sched_feat(SIS_UTIL)) {
+	if (sched_feat(SIS_UTIL) && !has_idle_core) {
 		sd_share = rcu_dereference(per_cpu(sd_llc_shared, target));
 		if (sd_share) {
 			/* because !--nr is the condition to stop scan */
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v5 2/5] sched/fair: Limited scan for idle cores when overloaded
  2022-09-09  5:52 [PATCH v5 0/5] sched/fair: Improve scan efficiency of SIS Abel Wu
  2022-09-09  5:53 ` [PATCH v5 1/5] sched/fair: Ignore SIS_UTIL when has idle core Abel Wu
@ 2022-09-09  5:53 ` Abel Wu
  2022-09-09  9:29   ` Chen Yu
  2022-09-14 22:25   ` Tim Chen
  2022-09-09  5:53 ` [PATCH v5 3/5] sched/fair: Skip core update if task pending Abel Wu
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 18+ messages in thread
From: Abel Wu @ 2022-09-09  5:53 UTC (permalink / raw)
  To: Peter Zijlstra, Mel Gorman, Vincent Guittot
  Cc: Josh Don, Chen Yu, Tim Chen, K Prateek Nayak, Gautham R . Shenoy,
	linux-kernel, Abel Wu, Mel Gorman

The has_idle_cores hint could be misleading due to some kind of
rapid idling workloads, especially when LLC is overloaded. If this
is the case, then there will be some full scan cost incurred that
often fails to find a core.

So limit the scan depth for idle cores in such case to make a
speculative inspection at a reasonable cost.

Benchmark
=========

Tests are done in a dual socket (2 x 24C/48T) machine modeled Intel
Xeon(R) Platinum 8260, with SNC configuration:

	SNC on:  4 NUMA nodes each of which has 12C/24T
	SNC off: 2 NUMA nodes each of which has 24C/48T

All of the benchmarks are done inside a normal cpu cgroup in a clean
environment with cpu turbo disabled.

Based on tip sched/core 0fba527e959d (v5.19.0) plus previous patches
of this series.

Results
=======

hackbench-process-pipes
                         unpatched		  patched
(SNC on)
Amean     1        0.4470 (   0.00%)      0.4557 (  -1.94%)
Amean     4        0.5947 (   0.00%)      0.6033 (  -1.46%)
Amean     7        0.7450 (   0.00%)      0.7627 (  -2.37%)
Amean     12       1.1053 (   0.00%)      1.0653 (   3.62%)
Amean     21       1.9420 (   0.00%)      2.0283 *  -4.45%*
Amean     30       2.9267 (   0.00%)      2.9670 (  -1.38%)
Amean     48       4.7027 (   0.00%)      4.6863 (   0.35%)
Amean     79       7.7097 (   0.00%)      7.9443 *  -3.04%*
Amean     110     10.0680 (   0.00%)     10.2393 (  -1.70%)
Amean     141     12.5450 (   0.00%)     12.6343 (  -0.71%)
Amean     172     15.0297 (   0.00%)     14.9957 (   0.23%)
Amean     203     16.8827 (   0.00%)     16.9133 (  -0.18%)
Amean     234     19.1183 (   0.00%)     19.2433 (  -0.65%)
Amean     265     20.9893 (   0.00%)     21.6917 (  -3.35%)
Amean     296     23.3920 (   0.00%)     23.8743 (  -2.06%)
(SNC off)
Amean     1        0.2717 (   0.00%)      0.3143 ( -15.71%)
Amean     4        0.6257 (   0.00%)      0.6070 (   2.98%)
Amean     7        0.7740 (   0.00%)      0.7960 (  -2.84%)
Amean     12       1.2410 (   0.00%)      1.1947 (   3.73%)
Amean     21       2.6410 (   0.00%)      2.4837 (   5.96%)
Amean     30       3.7620 (   0.00%)      3.4577 (   8.09%)
Amean     48       6.7757 (   0.00%)      5.5227 *  18.49%*
Amean     79       8.8827 (   0.00%)      9.2933 (  -4.62%)
Amean     110     11.0583 (   0.00%)     11.0443 (   0.13%)
Amean     141     13.3387 (   0.00%)     13.1360 (   1.52%)
Amean     172     15.9583 (   0.00%)     15.7770 (   1.14%)
Amean     203     17.8757 (   0.00%)     17.9557 (  -0.45%)
Amean     234     20.0543 (   0.00%)     20.4373 *  -1.91%*
Amean     265     22.6643 (   0.00%)     23.6053 *  -4.15%*
Amean     296     25.6677 (   0.00%)     25.6803 (  -0.05%)

Run to run variations is large in the 1 group test, so can be safely
ignored.

With limited scan for idle cores when the LLC is overloaded, a slight
regression can be seen on the smaller LLC machine. It is because the
cost of full scan on these LLCs is much smaller than the machines with
bigger LLCs. So when comes to the SNC off case, the limited scan can
provide obvious benefit especially when the frequency of such scan is
relatively high, e.g. <48 groups.

It's not a universal win, but considering the LLCs are getting bigger
nowadays, we should be careful on the scan depth and limited scan on
certain circumstance is indeed necessary.

tbench4 Throughput
                         unpatched		  patched
(SNC on)
Hmean     1        309.43 (   0.00%)      301.54 *  -2.55%*
Hmean     2        613.92 (   0.00%)      607.81 *  -0.99%*
Hmean     4       1227.84 (   0.00%)     1210.64 *  -1.40%*
Hmean     8       2379.04 (   0.00%)     2381.73 *   0.11%*
Hmean     16      4634.66 (   0.00%)     4601.21 *  -0.72%*
Hmean     32      7592.12 (   0.00%)     7626.84 *   0.46%*
Hmean     64      9241.11 (   0.00%)     9251.51 *   0.11%*
Hmean     128    17870.37 (   0.00%)    20620.98 *  15.39%*
Hmean     256    19370.92 (   0.00%)    20406.51 *   5.35%*
Hmean     384    19413.92 (   0.00%)    20312.97 *   4.63%*
(SNC off)
Hmean     1        287.90 (   0.00%)      292.37 *   1.55%*
Hmean     2        575.52 (   0.00%)      583.29 *   1.35%*
Hmean     4       1137.94 (   0.00%)     1155.83 *   1.57%*
Hmean     8       2250.42 (   0.00%)     2297.63 *   2.10%*
Hmean     16      4363.41 (   0.00%)     4562.44 *   4.56%*
Hmean     32      7338.06 (   0.00%)     7425.69 *   1.19%*
Hmean     64      8914.66 (   0.00%)     9021.77 *   1.20%*
Hmean     128    19978.59 (   0.00%)    20257.76 *   1.40%*
Hmean     256    20057.49 (   0.00%)    20043.54 *  -0.07%*
Hmean     384    19846.74 (   0.00%)    19528.03 *  -1.61%*

Conclusion
==========

Limited scan for idle cores when LLC is overloaded is almost neutral
compared to full scan given smaller LLCs, but is an obvious win at
the bigger ones which are future-oriented.

Suggested-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
---
 kernel/sched/fair.c | 26 +++++++++++++++++++++-----
 1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5af9bf246274..7abe188a1533 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6437,26 +6437,42 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 		time = cpu_clock(this);
 	}
 
-	if (sched_feat(SIS_UTIL) && !has_idle_core) {
+	if (sched_feat(SIS_UTIL)) {
 		sd_share = rcu_dereference(per_cpu(sd_llc_shared, target));
 		if (sd_share) {
 			/* because !--nr is the condition to stop scan */
 			nr = READ_ONCE(sd_share->nr_idle_scan) + 1;
-			/* overloaded LLC is unlikely to have idle cpu/core */
-			if (nr == 1)
+
+			/*
+			 * Overloaded LLC is unlikely to have idle cpus.
+			 * But if has_idle_core hint is true, a limited
+			 * speculative scan might help without incurring
+			 * much overhead.
+			 */
+			if (has_idle_core)
+				nr = nr > 1 ? INT_MAX : 3;
+			else if (nr == 1)
 				return -1;
 		}
 	}
 
 	for_each_cpu_wrap(cpu, cpus, target + 1) {
+		/*
+		 * This might get the has_idle_cores hint cleared for a
+		 * partial scan for idle cores but the hint is probably
+		 * wrong anyway. What more important is that not clearing
+		 * the hint may result in excessive partial scan for idle
+		 * cores introducing innegligible overhead.
+		 */
+		if (!--nr)
+			break;
+
 		if (has_idle_core) {
 			i = select_idle_core(p, cpu, cpus, &idle_cpu);
 			if ((unsigned int)i < nr_cpumask_bits)
 				return i;
 
 		} else {
-			if (!--nr)
-				return -1;
 			idle_cpu = __select_idle_cpu(cpu, p);
 			if ((unsigned int)idle_cpu < nr_cpumask_bits)
 				break;
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v5 3/5] sched/fair: Skip core update if task pending
  2022-09-09  5:52 [PATCH v5 0/5] sched/fair: Improve scan efficiency of SIS Abel Wu
  2022-09-09  5:53 ` [PATCH v5 1/5] sched/fair: Ignore SIS_UTIL when has idle core Abel Wu
  2022-09-09  5:53 ` [PATCH v5 2/5] sched/fair: Limited scan for idle cores when overloaded Abel Wu
@ 2022-09-09  5:53 ` Abel Wu
  2022-09-09 10:09   ` Chen Yu
  2022-09-14 22:37   ` Tim Chen
  2022-09-09  5:53 ` [PATCH v5 4/5] sched/fair: Skip SIS domain scan if fully busy Abel Wu
  2022-09-09  5:53 ` [PATCH v5 5/5] sched/fair: Introduce SIS_FILTER Abel Wu
  4 siblings, 2 replies; 18+ messages in thread
From: Abel Wu @ 2022-09-09  5:53 UTC (permalink / raw)
  To: Peter Zijlstra, Mel Gorman, Vincent Guittot
  Cc: Josh Don, Chen Yu, Tim Chen, K Prateek Nayak, Gautham R . Shenoy,
	linux-kernel, Abel Wu

The function __update_idle_core() considers this cpu is idle so
only checks its siblings to decide whether the resident core is
idle or not and update has_idle_cores hint if necessary. But the
problem is that this cpu might not be idle at that moment any
more, resulting in the hint being misleading.

It's not proper to make this check everywhere in the idle path,
but checking just before core updating can make the has_idle_core
hint more reliable with negligible cost.

Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
---
 kernel/sched/fair.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7abe188a1533..fad289530e07 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6294,6 +6294,9 @@ void __update_idle_core(struct rq *rq)
 	int core = cpu_of(rq);
 	int cpu;
 
+	if (rq->ttwu_pending)
+		return;
+
 	rcu_read_lock();
 	if (test_idle_cores(core, true))
 		goto unlock;
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v5 4/5] sched/fair: Skip SIS domain scan if fully busy
  2022-09-09  5:52 [PATCH v5 0/5] sched/fair: Improve scan efficiency of SIS Abel Wu
                   ` (2 preceding siblings ...)
  2022-09-09  5:53 ` [PATCH v5 3/5] sched/fair: Skip core update if task pending Abel Wu
@ 2022-09-09  5:53 ` Abel Wu
  2022-09-14  6:21   ` Yicong Yang
  2022-09-15  0:22   ` Tim Chen
  2022-09-09  5:53 ` [PATCH v5 5/5] sched/fair: Introduce SIS_FILTER Abel Wu
  4 siblings, 2 replies; 18+ messages in thread
From: Abel Wu @ 2022-09-09  5:53 UTC (permalink / raw)
  To: Peter Zijlstra, Mel Gorman, Vincent Guittot
  Cc: Josh Don, Chen Yu, Tim Chen, K Prateek Nayak, Gautham R . Shenoy,
	linux-kernel, Abel Wu

If a full domain scan failed, then no unoccupied cpus available
and the LLC is fully busy.  In this case we'd better use cpus
more wisely, rather than wasting it trying to find an idle cpu
that probably not exist. The fully busy status will be cleared
when any cpu of that LLC goes idle and everything goes back to
normal again.

Make the has_idle_cores boolean hint more rich by turning it
into a state machine.

Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
---
 include/linux/sched/topology.h | 35 +++++++++++++++++-
 kernel/sched/fair.c            | 67 ++++++++++++++++++++++++++++------
 2 files changed, 89 insertions(+), 13 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 816df6cc444e..cc6089765b64 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -77,10 +77,43 @@ extern int sched_domain_level_max;
 
 struct sched_group;
 
+/*
+ * States of the sched-domain
+ *
+ * - sd_has_icores
+ *	This state is only used in LLC domains to indicate worthy
+ *	of a full scan in SIS due to idle cores available.
+ *
+ * - sd_has_icpus
+ *	This state indicates that unoccupied (sched-idle/idle) cpus
+ *	might exist in this domain. For the LLC domains it is the
+ *	default state since these cpus are the main targets of SIS
+ *	search, and is also used as a fallback state of the other
+ *	states.
+ *
+ * - sd_is_busy
+ *	This state indicates there are no unoccupied cpus in this
+ *	domain. So for LLC domains, it gives the hint on whether
+ *	we should put efforts on the SIS search or not.
+ *
+ * For LLC domains, sd_has_icores is set when the last non-idle cpu of
+ * a core becomes idle. After a full SIS scan and if no idle cores found,
+ * sd_has_icores must be cleared and the state will be set to sd_has_icpus
+ * or sd_is_busy depending on whether there is any idle cpu. And during
+ * load balancing on each SMT domain inside the LLC, the state will be
+ * re-evaluated and switch from sd_is_busy to sd_has_icpus if idle cpus
+ * exist.
+ */
+enum sd_state {
+	sd_has_icores,
+	sd_has_icpus,
+	sd_is_busy
+};
+
 struct sched_domain_shared {
 	atomic_t	ref;
 	atomic_t	nr_busy_cpus;
-	int		has_idle_cores;
+	enum sd_state	state;
 	int		nr_idle_scan;
 };
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fad289530e07..25df73c7e73c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6262,26 +6262,47 @@ static inline int __select_idle_cpu(int cpu, struct task_struct *p)
 DEFINE_STATIC_KEY_FALSE(sched_smt_present);
 EXPORT_SYMBOL_GPL(sched_smt_present);
 
-static inline void set_idle_cores(int cpu, int val)
+static inline void set_llc_state(int cpu, enum sd_state state)
 {
 	struct sched_domain_shared *sds;
 
 	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
 	if (sds)
-		WRITE_ONCE(sds->has_idle_cores, val);
+		WRITE_ONCE(sds->state, state);
 }
 
-static inline bool test_idle_cores(int cpu, bool def)
+static inline enum sd_state get_llc_state(int cpu, enum sd_state def)
 {
 	struct sched_domain_shared *sds;
 
 	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
 	if (sds)
-		return READ_ONCE(sds->has_idle_cores);
+		return READ_ONCE(sds->state);
 
 	return def;
 }
 
+static inline void clear_idle_cpus(int cpu)
+{
+	set_llc_state(cpu, sd_is_busy);
+}
+
+static inline bool test_idle_cpus(int cpu)
+{
+	return get_llc_state(cpu, sd_has_icpus) != sd_is_busy;
+}
+
+static inline void set_idle_cores(int cpu, int core_idle)
+{
+	set_llc_state(cpu, core_idle ? sd_has_icores : sd_has_icpus);
+}
+
+static inline bool test_idle_cores(int cpu, bool def)
+{
+	return sd_has_icores ==
+	       get_llc_state(cpu, def ? sd_has_icores : sd_has_icpus);
+}
+
 /*
  * Scans the local SMT mask to see if the entire core is idle, and records this
  * information in sd_llc_shared->has_idle_cores.
@@ -6291,25 +6312,29 @@ static inline bool test_idle_cores(int cpu, bool def)
  */
 void __update_idle_core(struct rq *rq)
 {
-	int core = cpu_of(rq);
-	int cpu;
+	enum sd_state old, new = sd_has_icores;
+	int core = cpu_of(rq), cpu;
 
 	if (rq->ttwu_pending)
 		return;
 
 	rcu_read_lock();
-	if (test_idle_cores(core, true))
+	old = get_llc_state(core, sd_has_icores);
+	if (old == sd_has_icores)
 		goto unlock;
 
 	for_each_cpu(cpu, cpu_smt_mask(core)) {
 		if (cpu == core)
 			continue;
 
-		if (!available_idle_cpu(cpu))
-			goto unlock;
+		if (!available_idle_cpu(cpu)) {
+			new = sd_has_icpus;
+			break;
+		}
 	}
 
-	set_idle_cores(core, 1);
+	if (old != new)
+		set_llc_state(core, new);
 unlock:
 	rcu_read_unlock();
 }
@@ -6370,6 +6395,15 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t
 
 #else /* CONFIG_SCHED_SMT */
 
+static inline void clear_idle_cpus(int cpu)
+{
+}
+
+static inline bool test_idle_cpus(int cpu)
+{
+	return true;
+}
+
 static inline void set_idle_cores(int cpu, int val)
 {
 }
@@ -6406,6 +6440,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 	struct sched_domain *this_sd;
 	u64 time = 0;
 
+	if (!test_idle_cpus(target))
+		return -1;
+
 	this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
 	if (!this_sd)
 		return -1;
@@ -6482,8 +6519,14 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 		}
 	}
 
-	if (has_idle_core)
-		set_idle_cores(target, false);
+	/*
+	 * If no idle cpu can be found, set LLC state to busy to prevent
+	 * us from SIS domain scan to save a few cycles.
+	 */
+	if (idle_cpu == -1)
+		clear_idle_cpus(target);
+	else if (has_idle_core)
+		set_idle_cores(target, 0);
 
 	if (sched_feat(SIS_PROP) && !has_idle_core) {
 		time = cpu_clock(this) - time;
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v5 5/5] sched/fair: Introduce SIS_FILTER
  2022-09-09  5:52 [PATCH v5 0/5] sched/fair: Improve scan efficiency of SIS Abel Wu
                   ` (3 preceding siblings ...)
  2022-09-09  5:53 ` [PATCH v5 4/5] sched/fair: Skip SIS domain scan if fully busy Abel Wu
@ 2022-09-09  5:53 ` Abel Wu
  4 siblings, 0 replies; 18+ messages in thread
From: Abel Wu @ 2022-09-09  5:53 UTC (permalink / raw)
  To: Peter Zijlstra, Mel Gorman, Vincent Guittot
  Cc: Josh Don, Chen Yu, Tim Chen, K Prateek Nayak, Gautham R . Shenoy,
	linux-kernel, Abel Wu

The wakeup fastpath for fair tasks, select_idle_sibling() aka SIS,
plays an important role in maximizing the usage of cpu resources and
can greatly affect overall performance of the system.

The SIS tries to find an idle cpu inside the targeting LLC to place
the woken-up task. The cache hot cpus are preferred, but if none of
them is idle, just fallback to the other cpus of that LLC by firing
a domain scan.  The domain scan works well under light pressure by
simply traversing the cpus of the LLC due to lots of idle cpus can
be available.

But things change. The LLCs are getting bigger in modern and future
machines, and the cloud service providers are consistently trying to
reduce TCO by pushing more load squeezing all kinds of resources.
So the fashion of simply traversing doesn't fit well to the future
requirement.

There are already features like SIS_{UTIL,PROP} exist to cope with
the scalability issue by limiting the scan depth, and it would be
better if we can improve the way of how it scans as well. And this
is exactly what the SIS_FILTER is born for.

The SIS filter is supposed to contain the idle and sched-idle cpus
of the targeting LLC and try to improve efficiency of SIS domain scan
by filtering out busy cpus, making the limited scan depth being used
more wisely.

The idle cpus are propagated to the filter lazily if the resident
cores are already present in the filter. This will ease the pain of
LLC cache traffic, and can also bring benefit by spreading out load
to different cores.

There is also a sequence number to indicate the generation of the
filter. The generation iterates every time when the filter resets.
Once a cpu is propagated/set to the filter, the filter generation
is cached locally to help identify whether it is present or not in
the filter, rather than peeking at the LLC shared filter.

Benchmark
=========

Tests are done in a dual socket (2 x 24C/48T) machine modeled Intel
Xeon(R) Platinum 8260, with SNC configuration:

	SNC on:  4 NUMA nodes each of which has 12C/24T
	SNC off: 2 NUMA nodes each of which has 24C/48T

All of the benchmarks are done inside a normal cpu cgroup in a clean
environment with cpu turbo disabled.

Baseline is tip sched/core 0fba527e959d (v5.19.0) with the first two
patches of this series, and the "patched" contains the whole series.

Results
=======

hackbench-process-pipes
                         baseline		  patched
(SNC on)
Amean     1        0.4557 (   0.00%)      0.4500 (   1.24%)
Amean     4        0.6033 (   0.00%)      0.5990 (   0.72%)
Amean     7        0.7627 (   0.00%)      0.7377 (   3.28%)
Amean     12       1.0653 (   0.00%)      1.0490 (   1.53%)
Amean     21       2.0283 (   0.00%)      1.8680 *   7.90%*
Amean     30       2.9670 (   0.00%)      2.7710 *   6.61%*
Amean     48       4.6863 (   0.00%)      4.5393 (   3.14%)
Amean     79       7.9443 (   0.00%)      7.6610 *   3.57%*
Amean     110     10.2393 (   0.00%)     10.5560 (  -3.09%)
Amean     141     12.6343 (   0.00%)     12.6137 (   0.16%)
Amean     172     14.9957 (   0.00%)     14.6547 *   2.27%*
Amean     203     16.9133 (   0.00%)     16.9000 (   0.08%)
Amean     234     19.2433 (   0.00%)     18.7687 (   2.47%)
Amean     265     21.6917 (   0.00%)     21.4060 (   1.32%)
Amean     296     23.8743 (   0.00%)     23.0990 *   3.25%*
(SNC off)
Amean     1        0.3143 (   0.00%)      0.2933 (   6.68%)
Amean     4        0.6070 (   0.00%)      0.5883 (   3.08%)
Amean     7        0.7960 (   0.00%)      0.7570 (   4.90%)
Amean     12       1.1947 (   0.00%)      1.0780 *   9.77%*
Amean     21       2.4837 (   0.00%)      1.8903 *  23.89%*
Amean     30       3.4577 (   0.00%)      2.7903 *  19.30%*
Amean     48       5.5227 (   0.00%)      4.8920 (  11.42%)
Amean     79       9.2933 (   0.00%)      8.0127 *  13.78%*
Amean     110     11.0443 (   0.00%)     10.1557 *   8.05%*
Amean     141     13.1360 (   0.00%)     12.7387 (   3.02%)
Amean     172     15.7770 (   0.00%)     14.5860 *   7.55%*
Amean     203     17.9557 (   0.00%)     17.1950 *   4.24%*
Amean     234     20.4373 (   0.00%)     19.6763 *   3.72%*
Amean     265     23.6053 (   0.00%)     22.5557 (   4.45%)
Amean     296     25.6803 (   0.00%)     24.4273 *   4.88%*

Generally the result showed better improvement on larger LLCs. And with
load increases but not saturate the cpus (<30 groups), the benefit is
obvious, and even under extreme pressure the filter also helps squeeze
out some power (remember that the baseline has already included first
two patches).

tbench4 Throughput
                         baseline		  patched
(SNC on)
Hmean     1        301.54 (   0.00%)      302.52 *   0.32%*
Hmean     2        607.81 (   0.00%)      604.76 *  -0.50%*
Hmean     4       1210.64 (   0.00%)     1204.79 *  -0.48%*
Hmean     8       2381.73 (   0.00%)     2375.87 *  -0.25%*
Hmean     16      4601.21 (   0.00%)     4681.25 *   1.74%*
Hmean     32      7626.84 (   0.00%)     7607.93 *  -0.25%*
Hmean     64      9251.51 (   0.00%)     8956.00 *  -3.19%*
Hmean     128    20620.98 (   0.00%)    19258.30 *  -6.61%*
Hmean     256    20406.51 (   0.00%)    20783.82 *   1.85%*
Hmean     384    20312.97 (   0.00%)    20407.40 *   0.46%*
(SNC off)
Hmean     1        292.37 (   0.00%)      286.27 *  -2.09%*
Hmean     2        583.29 (   0.00%)      574.82 *  -1.45%*
Hmean     4       1155.83 (   0.00%)     1137.27 *  -1.61%*
Hmean     8       2297.63 (   0.00%)     2261.98 *  -1.55%*
Hmean     16      4562.44 (   0.00%)     4430.95 *  -2.88%*
Hmean     32      7425.69 (   0.00%)     7341.70 *  -1.13%*
Hmean     64      9021.77 (   0.00%)     8954.61 *  -0.74%*
Hmean     128    20257.76 (   0.00%)    20198.82 *  -0.29%*
Hmean     256    20043.54 (   0.00%)    19785.57 *  -1.29%*
Hmean     384    19528.03 (   0.00%)    19956.96 *   2.20%*

The slight regression indicates that if the workload has already had
a relatively high SIS success rate, e.g. tbench4 ~= 50%, the benefit
the filter brought will be reduced while its cost is still there. And
the benefit might not balance the cost if SIS success rate goes high
enough.

But the real world is more complicated. Different workloads can be
located on the same machine to share resources, and their profiles can
vary quite much, so the SIS success rate is not predictable.

Conclusion
==========

The SIS filter is much more efficient than linear scan under certain
circumstances, and even if it goes unlucky, the filter can be disabled
at any time.

Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
---
 include/linux/sched/topology.h | 18 ++++++++
 kernel/sched/core.c            |  1 +
 kernel/sched/fair.c            | 83 ++++++++++++++++++++++++++++++++--
 kernel/sched/features.h        |  1 +
 kernel/sched/sched.h           |  3 ++
 kernel/sched/topology.c        |  9 +++-
 6 files changed, 109 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index cc6089765b64..f8e6154b5c37 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -114,7 +114,20 @@ struct sched_domain_shared {
 	atomic_t	ref;
 	atomic_t	nr_busy_cpus;
 	enum sd_state	state;
+	u64		seq;
 	int		nr_idle_scan;
+
+	/*
+	 * The SIS filter
+	 *
+	 * Record idle and sched-idle cpus to improve efficiency of
+	 * the SIS domain scan.
+	 *
+	 * NOTE: this field is variable length. (Allocated dynamically
+	 * by attaching extra space to the end of the structure,
+	 * depending on how many CPUs the kernel has booted up with)
+	 */
+	unsigned long	icpus[];
 };
 
 struct sched_domain {
@@ -200,6 +213,11 @@ static inline struct cpumask *sched_domain_span(struct sched_domain *sd)
 	return to_cpumask(sd->span);
 }
 
+static inline struct cpumask *sched_domain_icpus(struct sched_domain *sd)
+{
+	return to_cpumask(sd->shared->icpus);
+}
+
 extern void partition_sched_domains_locked(int ndoms_new,
 					   cpumask_var_t doms_new[],
 					   struct sched_domain_attr *dattr_new);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7d289d87acf7..a0cbf6c0d540 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9719,6 +9719,7 @@ void __init sched_init(void)
 		rq->wake_stamp = jiffies;
 		rq->wake_avg_idle = rq->avg_idle;
 		rq->max_idle_balance_cost = sysctl_sched_migration_cost;
+		rq->last_idle_seq = 0;
 
 		INIT_LIST_HEAD(&rq->cfs_tasks);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 25df73c7e73c..354e6e646a7b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6262,6 +6262,57 @@ static inline int __select_idle_cpu(int cpu, struct task_struct *p)
 DEFINE_STATIC_KEY_FALSE(sched_smt_present);
 EXPORT_SYMBOL_GPL(sched_smt_present);
 
+static inline u64 filter_seq(int cpu)
+{
+	struct sched_domain_shared *sds;
+
+	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+	if (sds)
+		return READ_ONCE(sds->seq);
+
+	return 0;
+}
+
+static inline void filter_set_cpu(int cpu, int nr)
+{
+	struct sched_domain_shared *sds;
+
+	if (!sched_feat(SIS_FILTER))
+		return;
+
+	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+	if (sds) {
+		cpu_rq(nr)->last_idle_seq = filter_seq(cpu);
+		set_bit(nr, sds->icpus);
+	}
+}
+
+static inline bool filter_test_cpu(int cpu, int nr)
+{
+	if (!sched_feat(SIS_FILTER))
+		return true;
+
+	return cpu_rq(nr)->last_idle_seq >= filter_seq(cpu);
+}
+
+static inline void filter_reset(int cpu)
+{
+	struct sched_domain_shared *sds;
+
+	if (!sched_feat(SIS_FILTER))
+		return;
+
+	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+	if (sds) {
+		bitmap_zero(sds->icpus, nr_cpumask_bits);
+		/*
+		 * The seq field is racy but at least we can
+		 * use WRITE_ONCE() to prevent store tearing.
+		 */
+		WRITE_ONCE(sds->seq, filter_seq(cpu) + 1);
+	}
+}
+
 static inline void set_llc_state(int cpu, enum sd_state state)
 {
 	struct sched_domain_shared *sds;
@@ -6285,6 +6336,8 @@ static inline enum sd_state get_llc_state(int cpu, enum sd_state def)
 static inline void clear_idle_cpus(int cpu)
 {
 	set_llc_state(cpu, sd_is_busy);
+	if (sched_smt_active())
+		filter_reset(cpu);
 }
 
 static inline bool test_idle_cpus(int cpu)
@@ -6314,13 +6367,15 @@ void __update_idle_core(struct rq *rq)
 {
 	enum sd_state old, new = sd_has_icores;
 	int core = cpu_of(rq), cpu;
+	int exist;
 
 	if (rq->ttwu_pending)
 		return;
 
 	rcu_read_lock();
 	old = get_llc_state(core, sd_has_icores);
-	if (old == sd_has_icores)
+	exist = filter_test_cpu(core, core);
+	if (old == sd_has_icores && exist)
 		goto unlock;
 
 	for_each_cpu(cpu, cpu_smt_mask(core)) {
@@ -6329,11 +6384,26 @@ void __update_idle_core(struct rq *rq)
 
 		if (!available_idle_cpu(cpu)) {
 			new = sd_has_icpus;
-			break;
+
+			/*
+			 * If any cpu of this core has already
+			 * been set in the filter, then this
+			 * core is present and won't be missed
+			 * during SIS domain scan.
+			 */
+			if (exist)
+				break;
+			if (!sched_idle_cpu(cpu))
+				continue;
 		}
+
+		if (!exist)
+			exist = filter_test_cpu(core, cpu);
 	}
 
-	if (old != new)
+	if (!exist)
+		filter_set_cpu(core, core);
+	if (old != sd_has_icores && old != new)
 		set_llc_state(core, new);
 unlock:
 	rcu_read_unlock();
@@ -6447,8 +6517,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 	if (!this_sd)
 		return -1;
 
-	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
-
 	if (sched_feat(SIS_PROP) && !has_idle_core) {
 		u64 avg_cost, avg_idle, span_avg;
 		unsigned long now = jiffies;
@@ -6496,6 +6564,11 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 		}
 	}
 
+	if (sched_smt_active() && sched_feat(SIS_FILTER))
+		cpumask_and(cpus, sched_domain_icpus(sd), p->cpus_ptr);
+	else
+		cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+
 	for_each_cpu_wrap(cpu, cpus, target + 1) {
 		/*
 		 * This might get the has_idle_cores hint cleared for a
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index ee7f23c76bd3..1bebdb87c2f4 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -62,6 +62,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
  */
 SCHED_FEAT(SIS_PROP, false)
 SCHED_FEAT(SIS_UTIL, true)
+SCHED_FEAT(SIS_FILTER, true)
 
 /*
  * Issue a WARN when we do multiple update_rq_clock() calls
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b75ac74986fb..1fe1b152bc20 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1071,6 +1071,9 @@ struct rq {
 	/* This is used to determine avg_idle's max value */
 	u64			max_idle_balance_cost;
 
+	/* Cached filter generation when setting this cpu */
+	u64			last_idle_seq;
+
 #ifdef CONFIG_HOTPLUG_CPU
 	struct rcuwait		hotplug_wait;
 #endif
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8739c2a5a54e..01dccaca0fa8 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1641,6 +1641,13 @@ sd_init(struct sched_domain_topology_level *tl,
 		sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
 		atomic_inc(&sd->shared->ref);
 		atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
+		cpumask_copy(sched_domain_icpus(sd), sd_span);
+
+		/*
+		 * The cached per-cpu seq starts from 0, so initialize
+		 * filter seq to 1 to discard all cached cpu state.
+		 */
+		WRITE_ONCE(sd->shared->seq, 1);
 	}
 
 	sd->private = sdd;
@@ -2106,7 +2113,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
 
 			*per_cpu_ptr(sdd->sd, j) = sd;
 
-			sds = kzalloc_node(sizeof(struct sched_domain_shared),
+			sds = kzalloc_node(sizeof(struct sched_domain_shared) + cpumask_size(),
 					GFP_KERNEL, cpu_to_node(j));
 			if (!sds)
 				return -ENOMEM;
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 2/5] sched/fair: Limited scan for idle cores when overloaded
  2022-09-09  5:53 ` [PATCH v5 2/5] sched/fair: Limited scan for idle cores when overloaded Abel Wu
@ 2022-09-09  9:29   ` Chen Yu
  2022-09-09 10:11     ` Abel Wu
  2022-09-14 22:25   ` Tim Chen
  1 sibling, 1 reply; 18+ messages in thread
From: Chen Yu @ 2022-09-09  9:29 UTC (permalink / raw)
  To: Abel Wu
  Cc: Peter Zijlstra, Mel Gorman, Vincent Guittot, Josh Don, Tim Chen,
	K Prateek Nayak, Gautham R . Shenoy, linux-kernel, Mel Gorman

On 2022-09-09 at 13:53:01 +0800, Abel Wu wrote:
> The has_idle_cores hint could be misleading due to some kind of
> rapid idling workloads, especially when LLC is overloaded. If this
> is the case, then there will be some full scan cost incurred that
> often fails to find a core.
> 
> So limit the scan depth for idle cores in such case to make a
> speculative inspection at a reasonable cost.
> 
> Benchmark
> =========
> 
> Tests are done in a dual socket (2 x 24C/48T) machine modeled Intel
> Xeon(R) Platinum 8260, with SNC configuration:
> 
> 	SNC on:  4 NUMA nodes each of which has 12C/24T
> 	SNC off: 2 NUMA nodes each of which has 24C/48T
> 
> All of the benchmarks are done inside a normal cpu cgroup in a clean
> environment with cpu turbo disabled.
> 
> Based on tip sched/core 0fba527e959d (v5.19.0) plus previous patches
> of this series.
> 
> Results
> =======
> 
> hackbench-process-pipes
>                          unpatched		  patched
> (SNC on)
> Amean     1        0.4470 (   0.00%)      0.4557 (  -1.94%)
> Amean     4        0.5947 (   0.00%)      0.6033 (  -1.46%)
> Amean     7        0.7450 (   0.00%)      0.7627 (  -2.37%)
> Amean     12       1.1053 (   0.00%)      1.0653 (   3.62%)
> Amean     21       1.9420 (   0.00%)      2.0283 *  -4.45%*
> Amean     30       2.9267 (   0.00%)      2.9670 (  -1.38%)
> Amean     48       4.7027 (   0.00%)      4.6863 (   0.35%)
> Amean     79       7.7097 (   0.00%)      7.9443 *  -3.04%*
> Amean     110     10.0680 (   0.00%)     10.2393 (  -1.70%)
> Amean     141     12.5450 (   0.00%)     12.6343 (  -0.71%)
> Amean     172     15.0297 (   0.00%)     14.9957 (   0.23%)
> Amean     203     16.8827 (   0.00%)     16.9133 (  -0.18%)
> Amean     234     19.1183 (   0.00%)     19.2433 (  -0.65%)
> Amean     265     20.9893 (   0.00%)     21.6917 (  -3.35%)
> Amean     296     23.3920 (   0.00%)     23.8743 (  -2.06%)
> (SNC off)
> Amean     1        0.2717 (   0.00%)      0.3143 ( -15.71%)
> Amean     4        0.6257 (   0.00%)      0.6070 (   2.98%)
> Amean     7        0.7740 (   0.00%)      0.7960 (  -2.84%)
> Amean     12       1.2410 (   0.00%)      1.1947 (   3.73%)
> Amean     21       2.6410 (   0.00%)      2.4837 (   5.96%)
> Amean     30       3.7620 (   0.00%)      3.4577 (   8.09%)
> Amean     48       6.7757 (   0.00%)      5.5227 *  18.49%*
> Amean     79       8.8827 (   0.00%)      9.2933 (  -4.62%)
> Amean     110     11.0583 (   0.00%)     11.0443 (   0.13%)
> Amean     141     13.3387 (   0.00%)     13.1360 (   1.52%)
> Amean     172     15.9583 (   0.00%)     15.7770 (   1.14%)
> Amean     203     17.8757 (   0.00%)     17.9557 (  -0.45%)
> Amean     234     20.0543 (   0.00%)     20.4373 *  -1.91%*
> Amean     265     22.6643 (   0.00%)     23.6053 *  -4.15%*
> Amean     296     25.6677 (   0.00%)     25.6803 (  -0.05%)
> 
> Run to run variations is large in the 1 group test, so can be safely
> ignored.
> 
> With limited scan for idle cores when the LLC is overloaded, a slight
> regression can be seen on the smaller LLC machine. It is because the
> cost of full scan on these LLCs is much smaller than the machines with
> bigger LLCs. So when comes to the SNC off case, the limited scan can
> provide obvious benefit especially when the frequency of such scan is
> relatively high, e.g. <48 groups.
> 
> It's not a universal win, but considering the LLCs are getting bigger
> nowadays, we should be careful on the scan depth and limited scan on
> certain circumstance is indeed necessary.
> 
> tbench4 Throughput
>                          unpatched		  patched
> (SNC on)
> Hmean     1        309.43 (   0.00%)      301.54 *  -2.55%*
> Hmean     2        613.92 (   0.00%)      607.81 *  -0.99%*
> Hmean     4       1227.84 (   0.00%)     1210.64 *  -1.40%*
> Hmean     8       2379.04 (   0.00%)     2381.73 *   0.11%*
> Hmean     16      4634.66 (   0.00%)     4601.21 *  -0.72%*
> Hmean     32      7592.12 (   0.00%)     7626.84 *   0.46%*
> Hmean     64      9241.11 (   0.00%)     9251.51 *   0.11%*
> Hmean     128    17870.37 (   0.00%)    20620.98 *  15.39%*
> Hmean     256    19370.92 (   0.00%)    20406.51 *   5.35%*
> Hmean     384    19413.92 (   0.00%)    20312.97 *   4.63%*
> (SNC off)
> Hmean     1        287.90 (   0.00%)      292.37 *   1.55%*
> Hmean     2        575.52 (   0.00%)      583.29 *   1.35%*
> Hmean     4       1137.94 (   0.00%)     1155.83 *   1.57%*
> Hmean     8       2250.42 (   0.00%)     2297.63 *   2.10%*
> Hmean     16      4363.41 (   0.00%)     4562.44 *   4.56%*
> Hmean     32      7338.06 (   0.00%)     7425.69 *   1.19%*
> Hmean     64      8914.66 (   0.00%)     9021.77 *   1.20%*
> Hmean     128    19978.59 (   0.00%)    20257.76 *   1.40%*
> Hmean     256    20057.49 (   0.00%)    20043.54 *  -0.07%*
> Hmean     384    19846.74 (   0.00%)    19528.03 *  -1.61%*
> 
> Conclusion
> ==========
> 
> Limited scan for idle cores when LLC is overloaded is almost neutral
> compared to full scan given smaller LLCs, but is an obvious win at
> the bigger ones which are future-oriented.
> 
> Suggested-by: Mel Gorman <mgorman@techsingularity.net>
> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
> ---
>  kernel/sched/fair.c | 26 +++++++++++++++++++++-----
>  1 file changed, 21 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 5af9bf246274..7abe188a1533 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6437,26 +6437,42 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>  		time = cpu_clock(this);
>  	}
>  
> -	if (sched_feat(SIS_UTIL) && !has_idle_core) {
> +	if (sched_feat(SIS_UTIL)) {
[1/5] patch added !has_idle_core, but this patch removes the check.
I'm trying to figure out the reason. Is it to better illustrating the
benchmark difference?

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 3/5] sched/fair: Skip core update if task pending
  2022-09-09  5:53 ` [PATCH v5 3/5] sched/fair: Skip core update if task pending Abel Wu
@ 2022-09-09 10:09   ` Chen Yu
  2022-09-09 10:13     ` Abel Wu
  2022-09-14 22:37   ` Tim Chen
  1 sibling, 1 reply; 18+ messages in thread
From: Chen Yu @ 2022-09-09 10:09 UTC (permalink / raw)
  To: Abel Wu
  Cc: Peter Zijlstra, Mel Gorman, Vincent Guittot, Josh Don, Tim Chen,
	K Prateek Nayak, Gautham R . Shenoy, linux-kernel

On 2022-09-09 at 13:53:02 +0800, Abel Wu wrote:
> The function __update_idle_core() considers this cpu is idle so
> only checks its siblings to decide whether the resident core is
> idle or not and update has_idle_cores hint if necessary. But the
> problem is that this cpu might not be idle at that moment any
> more, resulting in the hint being misleading.
> 
> It's not proper to make this check everywhere in the idle path,
> but checking just before core updating can make the has_idle_core
> hint more reliable with negligible cost.
> 
> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
> ---
>  kernel/sched/fair.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7abe188a1533..fad289530e07 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6294,6 +6294,9 @@ void __update_idle_core(struct rq *rq)
>  	int core = cpu_of(rq);
>  	int cpu;
>  
> +	if (rq->ttwu_pending)
> +		return;
> +
Is it to deal with the race condition? I'm thinking of the
following scenario: task p1 on rq1 is about to switch to idle.
However when p1 reaches __update_idle_core(), someone on other
CPU tries to wake up p2, and leverages rq1 to queue p2
thus set the ttwu_pending flag on rq1. It is likely that
rq1 becomes idle but soon finds that TF_NEED_RESCHED is set, thus
quits the idle loop. As a result rq will not be idle and we will
get false positive here.

thanks,
Chenyu
>  	rcu_read_lock();
>  	if (test_idle_cores(core, true))
>  		goto unlock;
> -- 
> 2.37.3
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 2/5] sched/fair: Limited scan for idle cores when overloaded
  2022-09-09  9:29   ` Chen Yu
@ 2022-09-09 10:11     ` Abel Wu
  0 siblings, 0 replies; 18+ messages in thread
From: Abel Wu @ 2022-09-09 10:11 UTC (permalink / raw)
  To: Chen Yu
  Cc: Peter Zijlstra, Mel Gorman, Vincent Guittot, Josh Don, Tim Chen,
	K Prateek Nayak, Gautham R . Shenoy, linux-kernel, Mel Gorman

On 9/9/22 5:29 PM, Chen Yu wrote:
> On 2022-09-09 at 13:53:01 +0800, Abel Wu wrote:
>> The has_idle_cores hint could be misleading due to some kind of
>> rapid idling workloads, especially when LLC is overloaded. If this
>> is the case, then there will be some full scan cost incurred that
>> often fails to find a core.
>>
>> So limit the scan depth for idle cores in such case to make a
>> speculative inspection at a reasonable cost.
>>
>> Benchmark
>> =========
>>
>> Tests are done in a dual socket (2 x 24C/48T) machine modeled Intel
>> Xeon(R) Platinum 8260, with SNC configuration:
>>
>> 	SNC on:  4 NUMA nodes each of which has 12C/24T
>> 	SNC off: 2 NUMA nodes each of which has 24C/48T
>>
>> All of the benchmarks are done inside a normal cpu cgroup in a clean
>> environment with cpu turbo disabled.
>>
>> Based on tip sched/core 0fba527e959d (v5.19.0) plus previous patches
>> of this series.
>>
>> Results
>> =======
>>
>> hackbench-process-pipes
>>                           unpatched		  patched
>> (SNC on)
>> Amean     1        0.4470 (   0.00%)      0.4557 (  -1.94%)
>> Amean     4        0.5947 (   0.00%)      0.6033 (  -1.46%)
>> Amean     7        0.7450 (   0.00%)      0.7627 (  -2.37%)
>> Amean     12       1.1053 (   0.00%)      1.0653 (   3.62%)
>> Amean     21       1.9420 (   0.00%)      2.0283 *  -4.45%*
>> Amean     30       2.9267 (   0.00%)      2.9670 (  -1.38%)
>> Amean     48       4.7027 (   0.00%)      4.6863 (   0.35%)
>> Amean     79       7.7097 (   0.00%)      7.9443 *  -3.04%*
>> Amean     110     10.0680 (   0.00%)     10.2393 (  -1.70%)
>> Amean     141     12.5450 (   0.00%)     12.6343 (  -0.71%)
>> Amean     172     15.0297 (   0.00%)     14.9957 (   0.23%)
>> Amean     203     16.8827 (   0.00%)     16.9133 (  -0.18%)
>> Amean     234     19.1183 (   0.00%)     19.2433 (  -0.65%)
>> Amean     265     20.9893 (   0.00%)     21.6917 (  -3.35%)
>> Amean     296     23.3920 (   0.00%)     23.8743 (  -2.06%)
>> (SNC off)
>> Amean     1        0.2717 (   0.00%)      0.3143 ( -15.71%)
>> Amean     4        0.6257 (   0.00%)      0.6070 (   2.98%)
>> Amean     7        0.7740 (   0.00%)      0.7960 (  -2.84%)
>> Amean     12       1.2410 (   0.00%)      1.1947 (   3.73%)
>> Amean     21       2.6410 (   0.00%)      2.4837 (   5.96%)
>> Amean     30       3.7620 (   0.00%)      3.4577 (   8.09%)
>> Amean     48       6.7757 (   0.00%)      5.5227 *  18.49%*
>> Amean     79       8.8827 (   0.00%)      9.2933 (  -4.62%)
>> Amean     110     11.0583 (   0.00%)     11.0443 (   0.13%)
>> Amean     141     13.3387 (   0.00%)     13.1360 (   1.52%)
>> Amean     172     15.9583 (   0.00%)     15.7770 (   1.14%)
>> Amean     203     17.8757 (   0.00%)     17.9557 (  -0.45%)
>> Amean     234     20.0543 (   0.00%)     20.4373 *  -1.91%*
>> Amean     265     22.6643 (   0.00%)     23.6053 *  -4.15%*
>> Amean     296     25.6677 (   0.00%)     25.6803 (  -0.05%)
>>
>> Run to run variations is large in the 1 group test, so can be safely
>> ignored.
>>
>> With limited scan for idle cores when the LLC is overloaded, a slight
>> regression can be seen on the smaller LLC machine. It is because the
>> cost of full scan on these LLCs is much smaller than the machines with
>> bigger LLCs. So when comes to the SNC off case, the limited scan can
>> provide obvious benefit especially when the frequency of such scan is
>> relatively high, e.g. <48 groups.
>>
>> It's not a universal win, but considering the LLCs are getting bigger
>> nowadays, we should be careful on the scan depth and limited scan on
>> certain circumstance is indeed necessary.
>>
>> tbench4 Throughput
>>                           unpatched		  patched
>> (SNC on)
>> Hmean     1        309.43 (   0.00%)      301.54 *  -2.55%*
>> Hmean     2        613.92 (   0.00%)      607.81 *  -0.99%*
>> Hmean     4       1227.84 (   0.00%)     1210.64 *  -1.40%*
>> Hmean     8       2379.04 (   0.00%)     2381.73 *   0.11%*
>> Hmean     16      4634.66 (   0.00%)     4601.21 *  -0.72%*
>> Hmean     32      7592.12 (   0.00%)     7626.84 *   0.46%*
>> Hmean     64      9241.11 (   0.00%)     9251.51 *   0.11%*
>> Hmean     128    17870.37 (   0.00%)    20620.98 *  15.39%*
>> Hmean     256    19370.92 (   0.00%)    20406.51 *   5.35%*
>> Hmean     384    19413.92 (   0.00%)    20312.97 *   4.63%*
>> (SNC off)
>> Hmean     1        287.90 (   0.00%)      292.37 *   1.55%*
>> Hmean     2        575.52 (   0.00%)      583.29 *   1.35%*
>> Hmean     4       1137.94 (   0.00%)     1155.83 *   1.57%*
>> Hmean     8       2250.42 (   0.00%)     2297.63 *   2.10%*
>> Hmean     16      4363.41 (   0.00%)     4562.44 *   4.56%*
>> Hmean     32      7338.06 (   0.00%)     7425.69 *   1.19%*
>> Hmean     64      8914.66 (   0.00%)     9021.77 *   1.20%*
>> Hmean     128    19978.59 (   0.00%)    20257.76 *   1.40%*
>> Hmean     256    20057.49 (   0.00%)    20043.54 *  -0.07%*
>> Hmean     384    19846.74 (   0.00%)    19528.03 *  -1.61%*
>>
>> Conclusion
>> ==========
>>
>> Limited scan for idle cores when LLC is overloaded is almost neutral
>> compared to full scan given smaller LLCs, but is an obvious win at
>> the bigger ones which are future-oriented.
>>
>> Suggested-by: Mel Gorman <mgorman@techsingularity.net>
>> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
>> ---
>>   kernel/sched/fair.c | 26 +++++++++++++++++++++-----
>>   1 file changed, 21 insertions(+), 5 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 5af9bf246274..7abe188a1533 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6437,26 +6437,42 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>   		time = cpu_clock(this);
>>   	}
>>   
>> -	if (sched_feat(SIS_UTIL) && !has_idle_core) {
>> +	if (sched_feat(SIS_UTIL)) {
> [1/5] patch added !has_idle_core, but this patch removes the check.
> I'm trying to figure out the reason. Is it to better illustrating the
> benchmark difference?

This patch moves the check inside the SIS_UTIL block to limit the scan
depth when (!nr_idle_scan && has_idle_core), but if !nr_idle_scan which
means the LLC is not overloaded, a full scan will be triggered like
before. So in general, this patch turns nr_idle_scan to a boolean value
if has_idle_core.

Best Regards,
Abel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 3/5] sched/fair: Skip core update if task pending
  2022-09-09 10:09   ` Chen Yu
@ 2022-09-09 10:13     ` Abel Wu
  0 siblings, 0 replies; 18+ messages in thread
From: Abel Wu @ 2022-09-09 10:13 UTC (permalink / raw)
  To: Chen Yu
  Cc: Peter Zijlstra, Mel Gorman, Vincent Guittot, Josh Don, Tim Chen,
	K Prateek Nayak, Gautham R . Shenoy, linux-kernel

On 9/9/22 6:09 PM, Chen Yu wrote:
> On 2022-09-09 at 13:53:02 +0800, Abel Wu wrote:
>> The function __update_idle_core() considers this cpu is idle so
>> only checks its siblings to decide whether the resident core is
>> idle or not and update has_idle_cores hint if necessary. But the
>> problem is that this cpu might not be idle at that moment any
>> more, resulting in the hint being misleading.
>>
>> It's not proper to make this check everywhere in the idle path,
>> but checking just before core updating can make the has_idle_core
>> hint more reliable with negligible cost.
>>
>> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
>> ---
>>   kernel/sched/fair.c | 3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 7abe188a1533..fad289530e07 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6294,6 +6294,9 @@ void __update_idle_core(struct rq *rq)
>>   	int core = cpu_of(rq);
>>   	int cpu;
>>   
>> +	if (rq->ttwu_pending)
>> +		return;
>> +
> Is it to deal with the race condition? I'm thinking of the
> following scenario: task p1 on rq1 is about to switch to idle.
> However when p1 reaches __update_idle_core(), someone on other
> CPU tries to wake up p2, and leverages rq1 to queue p2
> thus set the ttwu_pending flag on rq1. It is likely that
> rq1 becomes idle but soon finds that TF_NEED_RESCHED is set, thus
> quits the idle loop. As a result rq will not be idle and we will
> get false positive here.

Yes, exactly as you said.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 4/5] sched/fair: Skip SIS domain scan if fully busy
  2022-09-09  5:53 ` [PATCH v5 4/5] sched/fair: Skip SIS domain scan if fully busy Abel Wu
@ 2022-09-14  6:21   ` Yicong Yang
  2022-09-14  7:43     ` Abel Wu
  2022-09-15  0:22   ` Tim Chen
  1 sibling, 1 reply; 18+ messages in thread
From: Yicong Yang @ 2022-09-14  6:21 UTC (permalink / raw)
  To: Abel Wu, Peter Zijlstra, Mel Gorman, Vincent Guittot
  Cc: yangyicong, Josh Don, Chen Yu, Tim Chen, K Prateek Nayak,
	Gautham R . Shenoy, linux-kernel

On 2022/9/9 13:53, Abel Wu wrote:
> If a full domain scan failed, then no unoccupied cpus available
> and the LLC is fully busy.  In this case we'd better use cpus
> more wisely, rather than wasting it trying to find an idle cpu
> that probably not exist. The fully busy status will be cleared
> when any cpu of that LLC goes idle and everything goes back to
> normal again.
> 
> Make the has_idle_cores boolean hint more rich by turning it
> into a state machine.
> 
> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
> ---
>  include/linux/sched/topology.h | 35 +++++++++++++++++-
>  kernel/sched/fair.c            | 67 ++++++++++++++++++++++++++++------
>  2 files changed, 89 insertions(+), 13 deletions(-)
> 
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 816df6cc444e..cc6089765b64 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -77,10 +77,43 @@ extern int sched_domain_level_max;
>  
>  struct sched_group;
>  
> +/*
> + * States of the sched-domain
> + *
> + * - sd_has_icores
> + *	This state is only used in LLC domains to indicate worthy
> + *	of a full scan in SIS due to idle cores available.
> + *
> + * - sd_has_icpus
> + *	This state indicates that unoccupied (sched-idle/idle) cpus
> + *	might exist in this domain. For the LLC domains it is the
> + *	default state since these cpus are the main targets of SIS
> + *	search, and is also used as a fallback state of the other
> + *	states.
> + *
> + * - sd_is_busy
> + *	This state indicates there are no unoccupied cpus in this
> + *	domain. So for LLC domains, it gives the hint on whether
> + *	we should put efforts on the SIS search or not.
> + *
> + * For LLC domains, sd_has_icores is set when the last non-idle cpu of
> + * a core becomes idle. After a full SIS scan and if no idle cores found,
> + * sd_has_icores must be cleared and the state will be set to sd_has_icpus
> + * or sd_is_busy depending on whether there is any idle cpu. And during
> + * load balancing on each SMT domain inside the LLC, the state will be
> + * re-evaluated and switch from sd_is_busy to sd_has_icpus if idle cpus
> + * exist.
> + */
> +enum sd_state {
> +	sd_has_icores,
> +	sd_has_icpus,
> +	sd_is_busy
> +};
> +
>  struct sched_domain_shared {
>  	atomic_t	ref;
>  	atomic_t	nr_busy_cpus;
> -	int		has_idle_cores;
> +	enum sd_state	state;
>  	int		nr_idle_scan;
>  };
>  
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index fad289530e07..25df73c7e73c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6262,26 +6262,47 @@ static inline int __select_idle_cpu(int cpu, struct task_struct *p)
>  DEFINE_STATIC_KEY_FALSE(sched_smt_present);
>  EXPORT_SYMBOL_GPL(sched_smt_present);
>  
> -static inline void set_idle_cores(int cpu, int val)
> +static inline void set_llc_state(int cpu, enum sd_state state)
>  {
>  	struct sched_domain_shared *sds;
>  
>  	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
>  	if (sds)
> -		WRITE_ONCE(sds->has_idle_cores, val);
> +		WRITE_ONCE(sds->state, state);
>  }
>  
> -static inline bool test_idle_cores(int cpu, bool def)
> +static inline enum sd_state get_llc_state(int cpu, enum sd_state def)
>  {
>  	struct sched_domain_shared *sds;
>  
>  	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
>  	if (sds)
> -		return READ_ONCE(sds->has_idle_cores);
> +		return READ_ONCE(sds->state);
>  
>  	return def;
>  }
>  
> +static inline void clear_idle_cpus(int cpu)
> +{
> +	set_llc_state(cpu, sd_is_busy);
> +}
> +
> +static inline bool test_idle_cpus(int cpu)
> +{
> +	return get_llc_state(cpu, sd_has_icpus) != sd_is_busy;
> +}
> +
> +static inline void set_idle_cores(int cpu, int core_idle)
> +{
> +	set_llc_state(cpu, core_idle ? sd_has_icores : sd_has_icpus);
> +}
> +
> +static inline bool test_idle_cores(int cpu, bool def)
> +{
> +	return sd_has_icores ==
> +	       get_llc_state(cpu, def ? sd_has_icores : sd_has_icpus);
> +}
> +
>  /*
>   * Scans the local SMT mask to see if the entire core is idle, and records this
>   * information in sd_llc_shared->has_idle_cores.
> @@ -6291,25 +6312,29 @@ static inline bool test_idle_cores(int cpu, bool def)
>   */
>  void __update_idle_core(struct rq *rq)
>  {
> -	int core = cpu_of(rq);
> -	int cpu;
> +	enum sd_state old, new = sd_has_icores;
> +	int core = cpu_of(rq), cpu;
>  
>  	if (rq->ttwu_pending)
>  		return;
>  
>  	rcu_read_lock();
> -	if (test_idle_cores(core, true))
> +	old = get_llc_state(core, sd_has_icores);
> +	if (old == sd_has_icores)
>  		goto unlock;
>  
>  	for_each_cpu(cpu, cpu_smt_mask(core)) {
>  		if (cpu == core)
>  			continue;
>  
> -		if (!available_idle_cpu(cpu))
> -			goto unlock;
> +		if (!available_idle_cpu(cpu)) {
> +			new = sd_has_icpus;
> +			break;
> +		}
>  	}
>  
> -	set_idle_cores(core, 1);
> +	if (old != new)
> +		set_llc_state(core, new);
>  unlock:
>  	rcu_read_unlock();
>  }

We'll reach this function only when SMT is active (sched_smt_present == True)...

> @@ -6370,6 +6395,15 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t
>  
>  #else /* CONFIG_SCHED_SMT */
>  
> +static inline void clear_idle_cpus(int cpu)
> +{
> +}
> +
> +static inline bool test_idle_cpus(int cpu)
> +{
> +	return true;
> +}
> +
>  static inline void set_idle_cores(int cpu, int val)
>  {
>  }
> @@ -6406,6 +6440,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>  	struct sched_domain *this_sd;
>  	u64 time = 0;
>  
> +	if (!test_idle_cpus(target))
> +		return -1;
> +

...and on a non-SMT machine, we'll always fail here after the first time we clear_idle_cpus() below.
since we have no place to make sds->state to sd_has_icpus again. You may need to make sds->state
update in set_next_task_idle() also when smt is inactive.

>  	this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>  	if (!this_sd)
>  		return -1;
> @@ -6482,8 +6519,14 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>  		}
>  	}
>  
> -	if (has_idle_core)
> -		set_idle_cores(target, false);
> +	/*
> +	 * If no idle cpu can be found, set LLC state to busy to prevent
> +	 * us from SIS domain scan to save a few cycles.
> +	 */
> +	if (idle_cpu == -1)
> +		clear_idle_cpus(target);
> +	else if (has_idle_core)
> +		set_idle_cores(target, 0);
>  
>  	if (sched_feat(SIS_PROP) && !has_idle_core) {
>  		time = cpu_clock(this) - time;
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 4/5] sched/fair: Skip SIS domain scan if fully busy
  2022-09-14  6:21   ` Yicong Yang
@ 2022-09-14  7:43     ` Abel Wu
  0 siblings, 0 replies; 18+ messages in thread
From: Abel Wu @ 2022-09-14  7:43 UTC (permalink / raw)
  To: Yicong Yang, Peter Zijlstra, Mel Gorman, Vincent Guittot
  Cc: yangyicong, Josh Don, Chen Yu, Tim Chen, K Prateek Nayak,
	Gautham R . Shenoy, linux-kernel

On 9/14/22 2:21 PM, Yicong Yang wrote:
> On 2022/9/9 13:53, Abel Wu wrote:
>> If a full domain scan failed, then no unoccupied cpus available
>> and the LLC is fully busy.  In this case we'd better use cpus
>> more wisely, rather than wasting it trying to find an idle cpu
>> that probably not exist. The fully busy status will be cleared
>> when any cpu of that LLC goes idle and everything goes back to
>> normal again.
>>
>> Make the has_idle_cores boolean hint more rich by turning it
>> into a state machine.
>>
>> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
>> ---
>>   include/linux/sched/topology.h | 35 +++++++++++++++++-
>>   kernel/sched/fair.c            | 67 ++++++++++++++++++++++++++++------
>>   2 files changed, 89 insertions(+), 13 deletions(-)
>>
>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>> index 816df6cc444e..cc6089765b64 100644
>> --- a/include/linux/sched/topology.h
>> +++ b/include/linux/sched/topology.h
>> @@ -77,10 +77,43 @@ extern int sched_domain_level_max;
>>   
>>   struct sched_group;
>>   
>> +/*
>> + * States of the sched-domain
>> + *
>> + * - sd_has_icores
>> + *	This state is only used in LLC domains to indicate worthy
>> + *	of a full scan in SIS due to idle cores available.
>> + *
>> + * - sd_has_icpus
>> + *	This state indicates that unoccupied (sched-idle/idle) cpus
>> + *	might exist in this domain. For the LLC domains it is the
>> + *	default state since these cpus are the main targets of SIS
>> + *	search, and is also used as a fallback state of the other
>> + *	states.
>> + *
>> + * - sd_is_busy
>> + *	This state indicates there are no unoccupied cpus in this
>> + *	domain. So for LLC domains, it gives the hint on whether
>> + *	we should put efforts on the SIS search or not.
>> + *
>> + * For LLC domains, sd_has_icores is set when the last non-idle cpu of
>> + * a core becomes idle. After a full SIS scan and if no idle cores found,
>> + * sd_has_icores must be cleared and the state will be set to sd_has_icpus
>> + * or sd_is_busy depending on whether there is any idle cpu. And during
>> + * load balancing on each SMT domain inside the LLC, the state will be
>> + * re-evaluated and switch from sd_is_busy to sd_has_icpus if idle cpus
>> + * exist.
>> + */
>> +enum sd_state {
>> +	sd_has_icores,
>> +	sd_has_icpus,
>> +	sd_is_busy
>> +};
>> +
>>   struct sched_domain_shared {
>>   	atomic_t	ref;
>>   	atomic_t	nr_busy_cpus;
>> -	int		has_idle_cores;
>> +	enum sd_state	state;
>>   	int		nr_idle_scan;
>>   };
>>   
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index fad289530e07..25df73c7e73c 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6262,26 +6262,47 @@ static inline int __select_idle_cpu(int cpu, struct task_struct *p)
>>   DEFINE_STATIC_KEY_FALSE(sched_smt_present);
>>   EXPORT_SYMBOL_GPL(sched_smt_present);
>>   
>> -static inline void set_idle_cores(int cpu, int val)
>> +static inline void set_llc_state(int cpu, enum sd_state state)
>>   {
>>   	struct sched_domain_shared *sds;
>>   
>>   	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
>>   	if (sds)
>> -		WRITE_ONCE(sds->has_idle_cores, val);
>> +		WRITE_ONCE(sds->state, state);
>>   }
>>   
>> -static inline bool test_idle_cores(int cpu, bool def)
>> +static inline enum sd_state get_llc_state(int cpu, enum sd_state def)
>>   {
>>   	struct sched_domain_shared *sds;
>>   
>>   	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
>>   	if (sds)
>> -		return READ_ONCE(sds->has_idle_cores);
>> +		return READ_ONCE(sds->state);
>>   
>>   	return def;
>>   }
>>   
>> +static inline void clear_idle_cpus(int cpu)
>> +{
>> +	set_llc_state(cpu, sd_is_busy);
>> +}
>> +
>> +static inline bool test_idle_cpus(int cpu)
>> +{
>> +	return get_llc_state(cpu, sd_has_icpus) != sd_is_busy;
>> +}
>> +
>> +static inline void set_idle_cores(int cpu, int core_idle)
>> +{
>> +	set_llc_state(cpu, core_idle ? sd_has_icores : sd_has_icpus);
>> +}
>> +
>> +static inline bool test_idle_cores(int cpu, bool def)
>> +{
>> +	return sd_has_icores ==
>> +	       get_llc_state(cpu, def ? sd_has_icores : sd_has_icpus);
>> +}
>> +
>>   /*
>>    * Scans the local SMT mask to see if the entire core is idle, and records this
>>    * information in sd_llc_shared->has_idle_cores.
>> @@ -6291,25 +6312,29 @@ static inline bool test_idle_cores(int cpu, bool def)
>>    */
>>   void __update_idle_core(struct rq *rq)
>>   {
>> -	int core = cpu_of(rq);
>> -	int cpu;
>> +	enum sd_state old, new = sd_has_icores;
>> +	int core = cpu_of(rq), cpu;
>>   
>>   	if (rq->ttwu_pending)
>>   		return;
>>   
>>   	rcu_read_lock();
>> -	if (test_idle_cores(core, true))
>> +	old = get_llc_state(core, sd_has_icores);
>> +	if (old == sd_has_icores)
>>   		goto unlock;
>>   
>>   	for_each_cpu(cpu, cpu_smt_mask(core)) {
>>   		if (cpu == core)
>>   			continue;
>>   
>> -		if (!available_idle_cpu(cpu))
>> -			goto unlock;
>> +		if (!available_idle_cpu(cpu)) {
>> +			new = sd_has_icpus;
>> +			break;
>> +		}
>>   	}
>>   
>> -	set_idle_cores(core, 1);
>> +	if (old != new)
>> +		set_llc_state(core, new);
>>   unlock:
>>   	rcu_read_unlock();
>>   }
> 
> We'll reach this function only when SMT is active (sched_smt_present == True)...
> 
>> @@ -6370,6 +6395,15 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t
>>   
>>   #else /* CONFIG_SCHED_SMT */
>>   
>> +static inline void clear_idle_cpus(int cpu)
>> +{
>> +}
>> +
>> +static inline bool test_idle_cpus(int cpu)
>> +{
>> +	return true;
>> +}
>> +
>>   static inline void set_idle_cores(int cpu, int val)
>>   {
>>   }
>> @@ -6406,6 +6440,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>   	struct sched_domain *this_sd;
>>   	u64 time = 0;
>>   
>> +	if (!test_idle_cpus(target))
>> +		return -1;
>> +
> 
> ...and on a non-SMT machine, we'll always fail here after the first time we clear_idle_cpus() below.
> since we have no place to make sds->state to sd_has_icpus again. You may need to make sds->state
> update in set_next_task_idle() also when smt is inactive.

Nice catch! The last 2 patches are now under reworking.

With filter enabled, this can lead to undesired result even in SMT case.
It's OK to mark the LLC busy if a full scan fails to find any idle cpus,
but with filter applied, the semantic of 'full scan' changed that not
all cpus of the LLC will be scanned, but just the cpus presented in the
filter are scanned.

As filter updated at core granule (a newly idle cpu could skip being set
to the filter if its SMT sibling has already been set), there might be
idle cpus that don't show up in the filter. So even a full scan on the
filter fails, it doesn't mean the LLC is fully busy, and idle cpus may
indeed exist. Since the busy state is cleared when cpus go idle, this
can hurt performance especially when LLC is under light pressure.

Thanks & BR,
Abel

> 
>>   	this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>>   	if (!this_sd)
>>   		return -1;
>> @@ -6482,8 +6519,14 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>   		}
>>   	}
>>   
>> -	if (has_idle_core)
>> -		set_idle_cores(target, false);
>> +	/*
>> +	 * If no idle cpu can be found, set LLC state to busy to prevent
>> +	 * us from SIS domain scan to save a few cycles.
>> +	 */
>> +	if (idle_cpu == -1)
>> +		clear_idle_cpus(target);
>> +	else if (has_idle_core)
>> +		set_idle_cores(target, 0);
>>   
>>   	if (sched_feat(SIS_PROP) && !has_idle_core) {
>>   		time = cpu_clock(this) - time;
>>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 1/5] sched/fair: Ignore SIS_UTIL when has idle core
  2022-09-09  5:53 ` [PATCH v5 1/5] sched/fair: Ignore SIS_UTIL when has idle core Abel Wu
@ 2022-09-14 21:58   ` Tim Chen
  0 siblings, 0 replies; 18+ messages in thread
From: Tim Chen @ 2022-09-14 21:58 UTC (permalink / raw)
  To: Abel Wu, Peter Zijlstra, Mel Gorman, Vincent Guittot
  Cc: Josh Don, Chen Yu, K Prateek Nayak, Gautham R . Shenoy, linux-kernel

On Fri, 2022-09-09 at 13:53 +0800, Abel Wu wrote:
> When SIS_UTIL is enabled, SIS domain scan will be skipped if the
> LLC is overloaded even the has_idle_core hint is true. Since idle
> load balancing is triggered at tick boundary, the idle cores can
> stay cold for the whole tick period wasting time meanwhile some
> of other cpus might be overloaded.
> 
> Give it a chance to scan for idle cores if the hint implies a
> worthy effort.

Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 2/5] sched/fair: Limited scan for idle cores when overloaded
  2022-09-09  5:53 ` [PATCH v5 2/5] sched/fair: Limited scan for idle cores when overloaded Abel Wu
  2022-09-09  9:29   ` Chen Yu
@ 2022-09-14 22:25   ` Tim Chen
  2022-09-15  3:08     ` Abel Wu
  1 sibling, 1 reply; 18+ messages in thread
From: Tim Chen @ 2022-09-14 22:25 UTC (permalink / raw)
  To: Abel Wu, Peter Zijlstra, Mel Gorman, Vincent Guittot
  Cc: Josh Don, Chen Yu, K Prateek Nayak, Gautham R . Shenoy,
	linux-kernel, Mel Gorman

On Fri, 2022-09-09 at 13:53 +0800, Abel Wu wrote:
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 5af9bf246274..7abe188a1533 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6437,26 +6437,42 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>  		time = cpu_clock(this);
>  	}
>  
> -	if (sched_feat(SIS_UTIL) && !has_idle_core) {
> +	if (sched_feat(SIS_UTIL)) {
>  		sd_share = rcu_dereference(per_cpu(sd_llc_shared, target));
>  		if (sd_share) {
>  			/* because !--nr is the condition to stop scan */
>  			nr = READ_ONCE(sd_share->nr_idle_scan) + 1;
> -			/* overloaded LLC is unlikely to have idle cpu/core */
> -			if (nr == 1)
> +
> +			/*
> +			 * Overloaded LLC is unlikely to have idle cpus.
> +			 * But if has_idle_core hint is true, a limited
> +			 * speculative scan might help without incurring
> +			 * much overhead.
> +			 */
> +			if (has_idle_core)
> +				nr = nr > 1 ? INT_MAX : 3;

The choice of nr is a very abrupt function of utilization when has_idle_core==true,
it is either feast or famine.  Why is such choice better than a smoother
reduction of nr vs utilization?  I agree that we want to scan more aggressively than
!has_idle_core, but it is not obvious why the above work better, versus something
like nr = nr*2+1.

Tim

> +			else if (nr == 1)
>  				return -1;
>  		}
>  	}
>  
>  	for_each_cpu_wrap(cpu, cpus, target + 1) {
> +		/*
> +		 * This might get the has_idle_cores hint cleared for a
> +		 * partial scan for idle cores but the hint is probably
> +		 * wrong anyway. What more important is that not clearing
> +		 * the hint may result in excessive partial scan for idle
> +		 * cores introducing innegligible overhead.
> +		 */
> +		if (!--nr)
> +			break;
> +
>  		if (has_idle_core) {
>  			i = select_idle_core(p, cpu, cpus, &idle_cpu);
>  			if ((unsigned int)i < nr_cpumask_bits)
>  				return i;
>  
>  		} else {
> -			if (!--nr)
> -				return -1;
>  			idle_cpu = __select_idle_cpu(cpu, p);
>  			if ((unsigned int)idle_cpu < nr_cpumask_bits)
>  				break;


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 3/5] sched/fair: Skip core update if task pending
  2022-09-09  5:53 ` [PATCH v5 3/5] sched/fair: Skip core update if task pending Abel Wu
  2022-09-09 10:09   ` Chen Yu
@ 2022-09-14 22:37   ` Tim Chen
  1 sibling, 0 replies; 18+ messages in thread
From: Tim Chen @ 2022-09-14 22:37 UTC (permalink / raw)
  To: Abel Wu, Peter Zijlstra, Mel Gorman, Vincent Guittot
  Cc: Josh Don, Chen Yu, K Prateek Nayak, Gautham R . Shenoy, linux-kernel

On Fri, 2022-09-09 at 13:53 +0800, Abel Wu wrote:
> The function __update_idle_core() considers this cpu is idle so
> only checks its siblings to decide whether the resident core is
> idle or not and update has_idle_cores hint if necessary. But the
> problem is that this cpu might not be idle at that moment any
> more, resulting in the hint being misleading.
> 
> It's not proper to make this check everywhere in the idle path,
> but checking just before core updating can make the has_idle_core
> hint more reliable with negligible cost.
> 

Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>

> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
> ---
>  kernel/sched/fair.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7abe188a1533..fad289530e07 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6294,6 +6294,9 @@ void __update_idle_core(struct rq *rq)
>  	int core = cpu_of(rq);
>  	int cpu;
>  
> +	if (rq->ttwu_pending)
> +		return;
> +
>  	rcu_read_lock();
>  	if (test_idle_cores(core, true))
>  		goto unlock;


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 4/5] sched/fair: Skip SIS domain scan if fully busy
  2022-09-09  5:53 ` [PATCH v5 4/5] sched/fair: Skip SIS domain scan if fully busy Abel Wu
  2022-09-14  6:21   ` Yicong Yang
@ 2022-09-15  0:22   ` Tim Chen
  2022-09-15  3:11     ` Abel Wu
  1 sibling, 1 reply; 18+ messages in thread
From: Tim Chen @ 2022-09-15  0:22 UTC (permalink / raw)
  To: Abel Wu, Peter Zijlstra, Mel Gorman, Vincent Guittot
  Cc: Josh Don, Chen Yu, K Prateek Nayak, Gautham R . Shenoy, linux-kernel

On Fri, 2022-09-09 at 13:53 +0800, Abel Wu wrote:
> If a full domain scan failed, then no unoccupied cpus available
> and the LLC is fully busy.  In this case we'd better use cpus
> more wisely, rather than wasting it trying to find an idle cpu
> that probably not exist. The fully busy status will be cleared
> when any cpu of that LLC goes idle and everything goes back to
> normal again.
> 
> Make the has_idle_cores boolean hint more rich by turning it
> into a state machine.
> 
> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
> ---
>  include/linux/sched/topology.h | 35 +++++++++++++++++-
>  kernel/sched/fair.c            | 67 ++++++++++++++++++++++++++++------
>  2 files changed, 89 insertions(+), 13 deletions(-)
> 
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 816df6cc444e..cc6089765b64 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -77,10 +77,43 @@ extern int sched_domain_level_max;
>  
>  struct sched_group;
>  
> +/*
> + * States of the sched-domain
> + *
> + * - sd_has_icores
> + *	This state is only used in LLC domains to indicate worthy
> + *	of a full scan in SIS due to idle cores available.
> + *
> + * - sd_has_icpus
> + *	This state indicates that unoccupied (sched-idle/idle) cpus
> + *	might exist in this domain. For the LLC domains it is the
> + *	default state since these cpus are the main targets of SIS
> + *	search, and is also used as a fallback state of the other
> + *	states.
> + *
> + * - sd_is_busy
> + *	This state indicates there are no unoccupied cpus in this

Suggest reword to

.. indicates that all cpus are occupied in this ...

> + *	domain. So for LLC domains, it gives the hint on whether
> + *	we should put efforts on the SIS search or not.
> + *
> 

Tim


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 2/5] sched/fair: Limited scan for idle cores when overloaded
  2022-09-14 22:25   ` Tim Chen
@ 2022-09-15  3:08     ` Abel Wu
  0 siblings, 0 replies; 18+ messages in thread
From: Abel Wu @ 2022-09-15  3:08 UTC (permalink / raw)
  To: Tim Chen, Peter Zijlstra, Mel Gorman, Vincent Guittot
  Cc: Josh Don, Chen Yu, K Prateek Nayak, Gautham R . Shenoy,
	linux-kernel, Mel Gorman

Hi Tim, thanks for your reviewing!

On 9/15/22 6:25 AM, Tim Chen wrote:
> On Fri, 2022-09-09 at 13:53 +0800, Abel Wu wrote:
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 5af9bf246274..7abe188a1533 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6437,26 +6437,42 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>   		time = cpu_clock(this);
>>   	}
>>   
>> -	if (sched_feat(SIS_UTIL) && !has_idle_core) {
>> +	if (sched_feat(SIS_UTIL)) {
>>   		sd_share = rcu_dereference(per_cpu(sd_llc_shared, target));
>>   		if (sd_share) {
>>   			/* because !--nr is the condition to stop scan */
>>   			nr = READ_ONCE(sd_share->nr_idle_scan) + 1;
>> -			/* overloaded LLC is unlikely to have idle cpu/core */
>> -			if (nr == 1)
>> +
>> +			/*
>> +			 * Overloaded LLC is unlikely to have idle cpus.
>> +			 * But if has_idle_core hint is true, a limited
>> +			 * speculative scan might help without incurring
>> +			 * much overhead.
>> +			 */
>> +			if (has_idle_core)
>> +				nr = nr > 1 ? INT_MAX : 3;
> 
> The choice of nr is a very abrupt function of utilization when has_idle_core==true,
> it is either feast or famine.  Why is such choice better than a smoother
> reduction of nr vs utilization?  I agree that we want to scan more aggressively than
> !has_idle_core, but it is not obvious why the above work better, versus something
> like nr = nr*2+1.
This has been discussed with Mel, and he suggested do simple things
first before scaling the depth.

https://lore.kernel.org/all/20220906095717.maao4qtel4fhbmfq@techsingularity.net/

Thanks and BR,
Abel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 4/5] sched/fair: Skip SIS domain scan if fully busy
  2022-09-15  0:22   ` Tim Chen
@ 2022-09-15  3:11     ` Abel Wu
  0 siblings, 0 replies; 18+ messages in thread
From: Abel Wu @ 2022-09-15  3:11 UTC (permalink / raw)
  To: Tim Chen, Peter Zijlstra, Mel Gorman, Vincent Guittot
  Cc: Josh Don, Chen Yu, K Prateek Nayak, Gautham R . Shenoy, linux-kernel

On 9/15/22 8:22 AM, Tim Chen wrote:
> On Fri, 2022-09-09 at 13:53 +0800, Abel Wu wrote:
>> If a full domain scan failed, then no unoccupied cpus available
>> and the LLC is fully busy.  In this case we'd better use cpus
>> more wisely, rather than wasting it trying to find an idle cpu
>> that probably not exist. The fully busy status will be cleared
>> when any cpu of that LLC goes idle and everything goes back to
>> normal again.
>>
>> Make the has_idle_cores boolean hint more rich by turning it
>> into a state machine.
>>
>> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
>> ---
>>   include/linux/sched/topology.h | 35 +++++++++++++++++-
>>   kernel/sched/fair.c            | 67 ++++++++++++++++++++++++++++------
>>   2 files changed, 89 insertions(+), 13 deletions(-)
>>
>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>> index 816df6cc444e..cc6089765b64 100644
>> --- a/include/linux/sched/topology.h
>> +++ b/include/linux/sched/topology.h
>> @@ -77,10 +77,43 @@ extern int sched_domain_level_max;
>>   
>>   struct sched_group;
>>   
>> +/*
>> + * States of the sched-domain
>> + *
>> + * - sd_has_icores
>> + *	This state is only used in LLC domains to indicate worthy
>> + *	of a full scan in SIS due to idle cores available.
>> + *
>> + * - sd_has_icpus
>> + *	This state indicates that unoccupied (sched-idle/idle) cpus
>> + *	might exist in this domain. For the LLC domains it is the
>> + *	default state since these cpus are the main targets of SIS
>> + *	search, and is also used as a fallback state of the other
>> + *	states.
>> + *
>> + * - sd_is_busy
>> + *	This state indicates there are no unoccupied cpus in this
> 
> Suggest reword to
> 
> .. indicates that all cpus are occupied in this ...

OK, I will make a update in next version!

Thanks,
Abel

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2022-09-15  3:11 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-09  5:52 [PATCH v5 0/5] sched/fair: Improve scan efficiency of SIS Abel Wu
2022-09-09  5:53 ` [PATCH v5 1/5] sched/fair: Ignore SIS_UTIL when has idle core Abel Wu
2022-09-14 21:58   ` Tim Chen
2022-09-09  5:53 ` [PATCH v5 2/5] sched/fair: Limited scan for idle cores when overloaded Abel Wu
2022-09-09  9:29   ` Chen Yu
2022-09-09 10:11     ` Abel Wu
2022-09-14 22:25   ` Tim Chen
2022-09-15  3:08     ` Abel Wu
2022-09-09  5:53 ` [PATCH v5 3/5] sched/fair: Skip core update if task pending Abel Wu
2022-09-09 10:09   ` Chen Yu
2022-09-09 10:13     ` Abel Wu
2022-09-14 22:37   ` Tim Chen
2022-09-09  5:53 ` [PATCH v5 4/5] sched/fair: Skip SIS domain scan if fully busy Abel Wu
2022-09-14  6:21   ` Yicong Yang
2022-09-14  7:43     ` Abel Wu
2022-09-15  0:22   ` Tim Chen
2022-09-15  3:11     ` Abel Wu
2022-09-09  5:53 ` [PATCH v5 5/5] sched/fair: Introduce SIS_FILTER Abel Wu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.