Re: [RFC PATCH] sched/fair: Introduce SIS_PAIR to wakeup task on local idle core first

From: Chen Yu <yu.c.chen@intel.com>
To: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Ingo Molnar <mingo@redhat.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Mel Gorman <mgorman@techsingularity.net>,
	Tim Chen <tim.c.chen@intel.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	"Steven Rostedt" <rostedt@goodmis.org>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Abel Wu" <wuyun.abel@bytedance.com>,
	Yicong Yang <yangyicong@hisilicon.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	Len Brown <len.brown@intel.com>, Chen Yu <yu.chen.surf@gmail.com>,
	Arjan Van De Ven <arjan.van.de.ven@intel.com>,
	Aaron Lu <aaron.lu@intel.com>, Barry Song <baohua@kernel.org>,
	<linux-kernel@vger.kernel.org>
Subject: Re: [RFC PATCH] sched/fair: Introduce SIS_PAIR to wakeup task on local idle core first
Date: Thu, 18 May 2023 00:57:15 +0800	[thread overview]
Message-ID: <ZGUHa+Si4dJbdsZN@chenyu5-mobl1> (raw)
In-Reply-To: <795a6d9475ecb444d219a9d36dc93b48a69e960e.camel@gmx.de>

On 2023-05-16 at 13:51:26 +0200, Mike Galbraith wrote:
> On Tue, 2023-05-16 at 16:41 +0800, Chen Yu wrote:
> > >
> > IMO this issue could be generic, the cost of C2C is O(sqrt (n)), in theory on
> > a system with a large number of LLC and with SMT enabled, the issue is easy to
> > be detected.
> >
> > As an example, I did not choose a super big system,
> > but a desktop i9-10980XE, launches 2 pairs of ping-ping tasks.
> >
> > Each pair of tasks are bound to 1 dedicated core:
> > ./context_switch1_processes -s 30 -t 2
> > average:956883
> >
> > No CPU affinity for the tasks:
> > ./context_switch1_processes -s 30 -t 2 -n
> > average:849209
> >
> > We can see that, waking up the wakee on local core brings benefits on this platform.
> 
> Sure, tiny ping-pong balls fly really fast in a shared L2 siblings. The
> players being truly synchronous, there's not a whole lot of room for
> resource competition, and they do about as close to nothing but
> schedule as you can get.
> 
> That's not a workload, much less one worthy of special consideration.
> It is a useful tool though, exposed a big socket issue, good job tool.
> Changing task placement strategy in response to that issue makes zero
> sense to me.  There are many ways to schedule and wake other tasks at
> high frequency, all use the same paths.  Reduce the pain that big box
> sees when playing high speed ping-pong, and real workloads will profit
> in boxen large and small.  More in large than small, but nobody loses,
> everybody wins.
>
I'm thinking of two directions based on current patch:

1. Check the task duration, if it is a high speed ping-pong pair, let the
   wakee search for an idle SMT sibling on current core.

   This strategy give the best overall performance improvement, but
   the short task duration tweak based on online CPU number would be
   an obstacle.
 
Or

2. Honors the idle core.
   That is to say, if there is an idle core in the system, choose that
   idle core first. Otherwise, fall back to searching for an idle smt
   sibling rather than choosing a idle CPU in a random half-busy core.

   This strategy could partially mitigate the C2C overhead, and not
   breaking the idle-core-first strategy. So I had a try on it, with
   above change, I did see some improvement when the system is around
   half busy(afterall, the idle_has_core has to be false):


Core(TM) i9-10980XE Cascade Lake, 18C/36T
netperf
=======
case                    load            baseline(std%)  compare%( std%)
TCP_RR                  9-threads        1.00 (  0.50)   +0.48 (  0.39)
TCP_RR                  18-threads       1.00 (  0.39)  +10.95 (  0.31)
TCP_RR                  27-threads       1.00 (  0.31)   +8.82 (  0.17)
TCP_RR                  36-threads       1.00 (  0.86)   +0.02 (  0.28)
TCP_RR                  45-threads       1.00 (  4.17)   +0.22 (  4.02)
TCP_RR                  54-threads       1.00 (  1.28)   -0.32 (  1.66)
TCP_RR                  63-threads       1.00 (  3.21)   +0.28 (  2.76)
TCP_RR                  72-threads       1.00 (  0.36)   -0.38 (  0.61)
UDP_RR                  9-threads        1.00 (  0.47)   -0.19 (  0.61)
UDP_RR                  18-threads       1.00 (  0.12)   +8.30 (  0.12)
UDP_RR                  27-threads       1.00 (  3.70)   +9.62 (  0.21)
UDP_RR                  36-threads       1.00 (  0.37)   +0.02 (  0.19)
UDP_RR                  45-threads       1.00 ( 13.52)   -0.76 ( 17.59)
UDP_RR                  54-threads       1.00 ( 11.44)   +0.41 (  1.43)
UDP_RR                  63-threads       1.00 ( 17.26)   +0.42 (  3.00)
UDP_RR                  72-threads       1.00 (  0.36)   -0.29 (  7.90)

TCP_MAERTS              9-threads        1.00 (  0.11)   +0.04 (  0.06)
TCP_MAERTS              18-threads       1.00 (  0.26)   +1.60 (  0.07)
TCP_MAERTS              27-threads       1.00 (  0.03)   -0.02 (  0.01)
TCP_MAERTS              36-threads       1.00 (  0.01)   -0.01 (  0.01)
TCP_MAERTS              45-threads       1.00 (  0.01)   +0.00 (  0.01)
TCP_MAERTS              54-threads       1.00 (  0.00)   +0.00 (  0.01)
TCP_MAERTS              63-threads       1.00 (  0.01)   +0.00 (  0.01)
TCP_MAERTS              72-threads       1.00 (  0.01)   +0.00 (  0.00)
TCP_SENDFILE            9-threads        1.00 (  0.10)   +0.34 (  0.25)
TCP_SENDFILE            18-threads       1.00 (  0.01)   -0.02 (  0.03)
TCP_SENDFILE            27-threads       1.00 (  0.02)   -0.01 (  0.01)
TCP_SENDFILE            36-threads       1.00 (  0.00)   -0.00 (  0.00)
TCP_SENDFILE            45-threads       1.00 (  0.00)   +0.00 (  0.00)
TCP_SENDFILE            54-threads       1.00 (  0.00)   +0.00 (  0.00)
TCP_SENDFILE            63-threads       1.00 (  0.00)   +0.00 (  0.00)
TCP_SENDFILE            72-threads       1.00 (  0.00)   -0.00 (  0.00)
TCP_STREAM              9-threads        1.00 (  0.01)   -0.12 (  0.01)
TCP_STREAM              18-threads       1.00 (  0.37)   +1.46 (  0.10)
TCP_STREAM              27-threads       1.00 (  0.01)   -0.01 (  0.02)
TCP_STREAM              36-threads       1.00 (  0.01)   +0.00 (  0.01)
TCP_STREAM              45-threads       1.00 (  0.01)   +0.00 (  0.01)
TCP_STREAM              54-threads       1.00 (  0.01)   +0.00 (  0.01)
TCP_STREAM              63-threads       1.00 (  0.01)   +0.00 (  0.01)
TCP_STREAM              72-threads       1.00 (  0.01)   -0.00 (  0.01)


UDP_STREAM              9-threads        1.00 ( 99.99)   +1.40 ( 99.99)
UDP_STREAM              18-threads       1.00 (101.21)   +7.33 (100.51)
UDP_STREAM              27-threads       1.00 ( 99.98)   +0.03 ( 99.98)
UDP_STREAM              36-threads       1.00 ( 99.97)   +0.16 ( 99.97)
UDP_STREAM              45-threads       1.00 (100.06)   -0.00 (100.06)
UDP_STREAM              54-threads       1.00 (100.02)   +0.10 (100.02)
UDP_STREAM              63-threads       1.00 (100.25)   +0.05 (100.25)
UDP_STREAM              72-threads       1.00 ( 99.89)   -0.03 ( 99.87)

The UDP_STREAM seems to be quite unstable, ignore it for now...

Xeon(R) Platinum 8480+ Sapphire Rapids, 56C/112T, 1 socket online
netperf
=======
case                    load            baseline(std%)  compare%( std%)
TCP_RR                  28-threads       1.00 (  2.19)   -0.32 (  2.26)
TCP_RR                  56-threads       1.00 (  4.09)   -0.22 (  3.85)
TCP_RR                  84-threads       1.00 (  0.23)  +10.11 (  0.27)
TCP_RR                  112-threads      1.00 (  0.16)   +5.12 (  0.19)
TCP_RR                  140-threads      1.00 (  9.58)   -0.08 ( 10.26)
TCP_RR                  168-threads      1.00 ( 10.96)   -0.61 ( 11.98)
TCP_RR                  196-threads      1.00 ( 14.20)   -1.31 ( 14.34)
TCP_RR                  224-threads      1.00 (  9.70)   -0.84 (  9.80)
UDP_RR                  28-threads       1.00 (  0.99)   -0.29 (  1.06)
UDP_RR                  56-threads       1.00 (  6.68)   +0.96 (  7.26)
UDP_RR                  84-threads       1.00 ( 15.05)  +10.37 ( 24.45)
UDP_RR                  112-threads      1.00 (  8.55)   +3.98 ( 12.17)
UDP_RR                  140-threads      1.00 ( 14.55)   +0.25 ( 12.53)
UDP_RR                  168-threads      1.00 ( 12.87)   -0.81 ( 18.60)
UDP_RR                  196-threads      1.00 ( 16.61)   +0.10 ( 17.12)
UDP_RR                  224-threads      1.00 ( 15.81)   +0.00 ( 14.11)

Similarly there are improvements when system is above half-busy.

[Limitations]
Per the test, this patch only brings benefits when there is an idle sibling
in the SMT domain and no idle core available. This is a trade-off to not
break the latency of low end platform but mitigate the C2C on high end ones.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7d2613ab392c..572d663065e3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7126,6 +7126,23 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	    asym_fits_cpu(task_util, util_min, util_max, target))
 		return target;
 
+	/*
+	 * If the waker and the wakee are good friends to each other,
+	 * putting them within the same SMT domain could reduce C2C
+	 * overhead. But this only applies when there is no idle core
+	 * available. SMT idle sibling should be prefered to wakee's
+	 * previous CPU, because the latter could still have the risk of C2C
+	 * overhead.
+	 *
+	 */
+	if (sched_feat(SIS_PAIR) && sched_smt_active() && !has_idle_core &&
+	    current->last_wakee == p && p->last_wakee == current) {
+		i = select_idle_smt(p, smp_processor_id());
+
+		if ((unsigned int)i < nr_cpumask_bits)
+			return i;
+	}
+
 	/*
 	 * If the previous CPU is cache affine and idle, don't be stupid:
 	 */
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index ee7f23c76bd3..86b5c4f16199 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -62,6 +62,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
  */
 SCHED_FEAT(SIS_PROP, false)
 SCHED_FEAT(SIS_UTIL, true)
+SCHED_FEAT(SIS_PAIR, true)
 
 /*
  * Issue a WARN when we do multiple update_rq_clock() calls
-- 
2.25.1