Re: [PATCH v8 2/2] sched/fair: Introduce SIS_CURRENT to wake up short task on current CPU

From: Peter Zijlstra <peterz@infradead.org>
To: Mike Galbraith <efault@gmx.de>
Cc: Chen Yu <yu.c.chen@intel.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Ingo Molnar <mingo@redhat.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Mel Gorman <mgorman@techsingularity.net>,
	Tim Chen <tim.c.chen@intel.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	Abel Wu <wuyun.abel@bytedance.com>,
	Yicong Yang <yangyicong@hisilicon.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	Honglei Wang <wanghonglei@didichuxing.com>,
	Len Brown <len.brown@intel.com>, Chen Yu <yu.chen.surf@gmail.com>,
	Tianchen Ding <dtcccc@linux.alibaba.com>,
	Joel Fernandes <joel@joelfernandes.org>,
	Josh Don <joshdon@google.com>,
	kernel test robot <yujie.liu@intel.com>,
	Arjan Van De Ven <arjan.van.de.ven@intel.com>,
	Aaron Lu <aaron.lu@intel.com>,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH v8 2/2] sched/fair: Introduce SIS_CURRENT to wake up short task on current CPU
Date: Mon, 1 May 2023 10:25:36 +0200	[thread overview]
Message-ID: <20230501082536.GA1597476@hirez.programming.kicks-ass.net> (raw)
In-Reply-To: <66406be50c8e040870217f5c9131b901d4dd2013.camel@gmx.de>

On Sat, Apr 29, 2023 at 09:34:06PM +0200, Mike Galbraith wrote:
> On Sat, 2023-04-29 at 07:16 +0800, Chen Yu wrote:
> > [Problem Statement]
> > For a workload that is doing frequent context switches, the throughput
> > scales well until the number of instances reaches a peak point. After
> > that peak point, the throughput drops significantly if the number of
> > instances continue to increase.
> >
> > The will-it-scale context_switch1 test case exposes the issue. The
> > test platform has 2 x 56C/112T and 224 CPUs in total. will-it-scale
> > launches 1, 8, 16 ... instances respectively. Each instance is composed
> > of 2 tasks, and each pair of tasks would do ping-pong scheduling via
> > pipe_read() and pipe_write(). No task is bound to any CPU. It is found
> > that, once the number of instances is higher than 56, the throughput
> > drops accordingly:
> >
> >           ^
> > throughput|
> >           |                 X
> >           |               X   X X
> >           |             X         X X
> >           |           X               X
> >           |         X                   X
> >           |       X
> >           |     X
> >           |   X
> >           | X
> >           |
> >           +-----------------.------------------->
> >                             56
> >                                  number of instances
> 
> Should these buddy pairs not start interfering with one another at 112
> instances instead of 56? NR_CPUS/2 buddy pair instances is the point at
> which trying to turn waker/wakee overlap into throughput should tend
> toward being a loser due to man-in-the-middle wakeup delay pain more
> than offsetting overlap recovery gain, rendering sync wakeup thereafter
> an ever more likely win.
> 
> Anyway..
> 
> What I see in my box, and I bet a virtual nickle it's a player in your
> box as well, is WA_WEIGHT making a mess of things by stacking tasks,
> sometimes very badly.  Below, I start NR_CPUS tbench buddy pairs in
> crusty ole i4790 desktop box with WA_WEIGHT turned off, then turn it on
> remotely as to not have noisy GUI muck up my demo.
> 
> ...
>    8   3155749  3606.79 MB/sec  warmup  38 sec  latency 3.852 ms
>    8   3238485  3608.75 MB/sec  warmup  39 sec  latency 3.839 ms
>    8   3321578  3608.59 MB/sec  warmup  40 sec  latency 3.882 ms
>    8   3404746  3608.09 MB/sec  warmup  41 sec  latency 2.273 ms
>    8   3487885  3607.58 MB/sec  warmup  42 sec  latency 3.869 ms
>    8   3571034  3607.12 MB/sec  warmup  43 sec  latency 3.855 ms
>    8   3654067  3607.48 MB/sec  warmup  44 sec  latency 3.857 ms
>    8   3736973  3608.83 MB/sec  warmup  45 sec  latency 4.008 ms
>    8   3820160  3608.33 MB/sec  warmup  46 sec  latency 3.849 ms
>    8   3902963  3607.60 MB/sec  warmup  47 sec  latency 14.241 ms
>    8   3986117  3607.17 MB/sec  warmup  48 sec  latency 20.290 ms
>    8   4069256  3606.70 MB/sec  warmup  49 sec  latency 28.284 ms
>    8   4151986  3608.35 MB/sec  warmup  50 sec  latency 17.216 ms
>    8   4235070  3608.06 MB/sec  warmup  51 sec  latency 23.221 ms
>    8   4318221  3607.81 MB/sec  warmup  52 sec  latency 28.285 ms
>    8   4401456  3607.29 MB/sec  warmup  53 sec  latency 20.835 ms
>    8   4484606  3607.06 MB/sec  warmup  54 sec  latency 28.943 ms
>    8   4567609  3607.32 MB/sec  warmup  55 sec  latency 28.254 ms
> 
> Where I turned it on is hard to miss.
> 
> Short duration thread pool workers can be stacked all the way to the
> ceiling by WA_WEIGHT during burst wakeups, with wake_wide() not being
> able to intervene due to lack of cross coupling between waker/wakees
> leading to heuristic failure.  A (now long) while ago I caught that
> happening with firefox event threads, it launched 32 of 'em in my 8 rq
> box (hmm), and them being essentially the scheduler equivalent of
> neutrinos (nearly massless), we stuffed 'em all into one rq.. and got
> away with it because those particular threads don't seem to do much of
> anything.  However, were they to go active, the latency hit that we set
> up could have stung mightily. That scenario being highly generic leads
> me to suspect that somewhere out there in the big wide world, folks are
> eating that burst serialization.

I'm thinking WA_BIAS makes this worse...