linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Mel Gorman <mgorman@techsingularity.net>
To: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: peterz@infradead.org, bristot@redhat.com, bsegall@google.com,
	dietmar.eggemann@arm.com, joshdon@google.com,
	juri.lelli@redhat.com, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-s390@vger.kernel.org,
	linux@rasmusvillemoes.dk, mgorman@suse.de, mingo@kernel.org,
	rostedt@goodmis.org, valentin.schneider@arm.com,
	vincent.guittot@linaro.org
Subject: Re: [PATCH 1/1] sched/fair: improve yield_to vs fairness
Date: Fri, 23 Jul 2021 10:35:23 +0100	[thread overview]
Message-ID: <20210723093523.GX3809@techsingularity.net> (raw)
In-Reply-To: <20210707123402.13999-2-borntraeger@de.ibm.com>

On Wed, Jul 07, 2021 at 02:34:02PM +0200, Christian Borntraeger wrote:
> After some debugging in situations where a smaller sched_latency_ns and
> smaller sched_migration_cost settings helped for KVM host, I was able to
> come up with a reduced testcase.
> This testcase has 2 vcpus working on a shared memory location and
> waiting for mem % 2 == cpu number to then do an add on the shared
> memory.
> To start simple I pinned all vcpus to one host CPU. Without the
> yield_to in KVM the testcase was horribly slow. This is expected as each
> vcpu will spin a whole time slice. With the yield_to from KVM things are
> much better, but I was still seeing yields being ignored.
> In the end pick_next_entity decided to keep the current process running
> due to fairness reasons.  On this path we really know that there is no
> point in continuing current. So let us make things a bit unfairer to
> current.
> This makes the reduced testcase noticeable faster. It improved a more
> realistic test case (many guests on some host CPUs with overcomitment)
> even more.
> In the end this is similar to the old compat_sched_yield approach with
> an important difference:
> Instead of doing it for all yields we now only do it for yield_to
> a place where we really know that current it waiting for the target.
> 
> What are alternative implementations for this patch
> - do the same as the old compat_sched_yield:
>   current->vruntime = rightmost->vruntime+1
> - provide a new tunable sched_ns_yield_penalty: how much vruntime to add
>   (could be per architecture)
> - also fiddle with the vruntime of the target
>   e.g. subtract from the target what we add to the source
> 
> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>

I think this one accidentally fell off everyones radar including mine.
At the time this patch was mailed I remembered thinking that playing games with
vruntime might have other consequences. For example, what I believe is
the most relevant problem for KVM is that a task spinning to acquire a
lock may be waiting on a vcpu holding the lock that has been
descheduled. Without vcpu pinning, it's possible that the holder is on
the same runqueue as the lock acquirer so the acquirer is wasting CPU.

In such a case, changing the acquirers vcpu may mean that it unfairly
loses CPU time simply because it's a lock acquirer. Vincent, what do you
think? Christian, would you mind testing this as an alternative with your
demonstration test case and more importantly the "realistic test case"?

--8<--
sched: Do not select highest priority task to run if it should be skipped

pick_next_entity will consider the "next buddy" over the highest priority
task if it's not unfair to do so (as determined by wakekup_preempt_entity).
The potential problem is that an in-kernel user of yield_to() such as
KVM may explicitly want to yield the current task because it is trying
to acquire a spinlock from a task that is currently descheduled and
potentially running on the same runqueue. However, if it's more fair from
the scheduler perspective to continue running the current task, it'll continue
to spin uselessly waiting on a descheduled task to run.

This patch will select the targeted task to run even if it's unfair if the
highest priority task is explicitly marked as "skip".

This was evaluated using a debugging patch to expose yield_to as a system
call. A demonstration program creates N number of threads and arranges
them in a ring that are updating a shared value in memory. Each thread
spins until the value matches the thread ID. It then updates the value
and wakes the next thread in the ring. It measures how many times it spins
before it gets its turn. Without the patch, the number of spins is highly
variable and unstable but with the patch it's more consistent.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 44c452072a1b..ddc0212d520f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4522,7 +4522,8 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 			se = second;
 	}
 
-	if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) {
+	if (cfs_rq->next &&
+	    (cfs_rq->skip == left || wakeup_preempt_entity(cfs_rq->next, left) < 1)) {
 		/*
 		 * Someone really wants this to run. If it's not unfair, run it.
 		 */

-- 
Mel Gorman
SUSE Labs

  parent reply	other threads:[~2021-07-23  9:35 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-12 10:14 [PATCH v2 0/9] sched: Clean up SCHED_DEBUG Peter Zijlstra
2021-04-12 10:14 ` [PATCH v2 1/9] sched/numa: Allow runtime enabling/disabling of NUMA balance without SCHED_DEBUG Peter Zijlstra
2021-04-12 10:14 ` [PATCH v2 2/9] sched: Remove sched_schedstats sysctl out from under SCHED_DEBUG Peter Zijlstra
2021-04-16 15:53   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2021-04-12 10:14 ` [PATCH v2 3/9] sched: Dont make LATENCYTOP select SCHED_DEBUG Peter Zijlstra
2021-04-16 15:53   ` [tip: sched/core] sched: Don't " tip-bot2 for Peter Zijlstra
2021-04-12 10:14 ` [PATCH v2 4/9] sched: Move SCHED_DEBUG sysctl to debugfs Peter Zijlstra
2021-04-15 16:29   ` [PATCH] sched/debug: Rename the sched_debug parameter to sched_debug_verbose Peter Zijlstra
2021-04-19 19:26     ` Josh Don
2021-04-16 15:53   ` [tip: sched/core] sched: Move SCHED_DEBUG sysctl to debugfs tip-bot2 for Peter Zijlstra
2021-04-27 14:59   ` Christian Borntraeger
2021-04-27 15:09     ` Steven Rostedt
2021-04-27 15:17       ` Christian Borntraeger
2021-04-28  8:47       ` Peter Zijlstra
2021-04-28  8:46     ` Peter Zijlstra
2021-04-28  8:54       ` Christian Borntraeger
2021-04-28  8:58         ` Christian Borntraeger
2021-04-28  9:25         ` Peter Zijlstra
2021-04-28  9:31           ` Christian Borntraeger
2021-04-28  9:42       ` Christian Borntraeger
2021-04-28 12:38         ` Peter Zijlstra
2021-04-28 14:49           ` Christian Borntraeger
2021-07-07 12:34           ` [PATCH 0/1] Improve yield (was: sched: Move SCHED_DEBUG sysctl to debugfs) Christian Borntraeger
2021-07-07 12:34             ` [PATCH 1/1] sched/fair: improve yield_to vs fairness Christian Borntraeger
2021-07-07 18:07               ` kernel test robot
2021-07-23  9:35               ` Mel Gorman [this message]
2021-07-23 12:36                 ` Christian Borntraeger
2021-07-23 16:21                   ` Mel Gorman
2021-07-26 18:41                     ` Christian Borntraeger
2021-07-26 19:32                       ` Mel Gorman
2021-07-27  6:59                         ` Christian Borntraeger
2021-07-27 18:57                       ` Benjamin Segall
2021-07-28 16:23                         ` Christian Borntraeger
2021-08-10  8:49                           ` Vincent Guittot
2021-07-27 13:29                     ` Peter Zijlstra
2021-07-27 13:33                 ` Peter Zijlstra
2021-07-27 14:31                   ` Mel Gorman
2021-04-12 10:14 ` [PATCH v2 5/9] sched,preempt: Move preempt_dynamic to debug.c Peter Zijlstra
2021-04-16 15:53   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2021-04-12 10:14 ` [PATCH v2 6/9] debugfs: Implement debugfs_create_str() Peter Zijlstra
2021-04-16 15:53   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2021-04-12 10:14 ` [PATCH v2 7/9] sched,debug: Convert sysctl sched_domains to debugfs Peter Zijlstra
2021-04-13 14:55   ` Valentin Schneider
2021-04-15  9:06     ` Peter Zijlstra
2021-04-15 12:16       ` Dietmar Eggemann
2021-04-15 12:34       ` Valentin Schneider
2021-04-15 13:02         ` Peter Zijlstra
2021-04-12 10:14 ` [PATCH v2 8/9] sched: Move /proc/sched_debug " Peter Zijlstra
2021-04-16 15:53   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2021-04-12 10:14 ` [PATCH v2 9/9] sched,fair: Alternative sched_slice() Peter Zijlstra
2021-04-12 10:26   ` Peter Zijlstra
2021-04-16 15:53   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210723093523.GX3809@techsingularity.net \
    --to=mgorman@techsingularity.net \
    --cc=borntraeger@de.ibm.com \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=joshdon@google.com \
    --cc=juri.lelli@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=linux@rasmusvillemoes.dk \
    --cc=mgorman@suse.de \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=valentin.schneider@arm.com \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).