Re: [RFC PATCH 1/1] sched/eevdf: Skip eligibility check for current entity during wakeup preemption

From: Youssef Esmat <youssefesmat@chromium.org>
To: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	 Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	 Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	 Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	 Daniel Bristot de Oliveira <bristot@redhat.com>,
	Valentin Schneider <vschneid@redhat.com>,
	linux-kernel@vger.kernel.org,
	 Tobias Huschle <huschle@linux.ibm.com>,
	Luis Machado <luis.machado@arm.com>,
	 Chen Yu <yu.c.chen@intel.com>,
	Abel Wu <wuyun.abel@bytedance.com>,
	 Tianchen Ding <dtcccc@linux.alibaba.com>,
	Xuewen Yan <xuewen.yan94@gmail.com>,
	 "Gautham R. Shenoy" <gautham.shenoy@amd.com>
Subject: Re: [RFC PATCH 1/1] sched/eevdf: Skip eligibility check for current entity during wakeup preemption
Date: Mon, 25 Mar 2024 10:13:17 -0500	[thread overview]
Message-ID: <CA+q576OCK24VSp+s4OLD2ogO48i95y39_JO=zV=TwHSEg3_b1w@mail.gmail.com> (raw)
In-Reply-To: <20240325060226.1540-2-kprateek.nayak@amd.com>

On Mon, Mar 25, 2024 at 1:03 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> With the curr entity's eligibility check, a wakeup preemption is very
> likely when an entity with positive lag joins the runqueue pushing the
> avg_vruntime of the runqueue backwards, making the vruntime of the
> current entity ineligible. This leads to aggressive wakeup preemption
> which was previously guarded by wakeup_granularity_ns in legacy CFS.
> Below figure depicts one such aggressive preemption scenario with EEVDF
> in DeathStarBench [1]:
>
>                                 deadline for Nginx
>                    |
>         +-------+  |                    |
>     /-- | Nginx | -|------------------> |
>     |   +-------+  |                    |
>     |              |
>     |   -----------|-------------------------------> vruntime timeline
>     |              \--> rq->avg_vruntime
>     |
>     |   wakes service on the same runqueue since system is busy
>     |
>     |   +---------+|
>     \-->| Service || (service has +ve lag pushes avg_vruntime backwards)
>         +---------+|
>           |        |
>    wakeup |     +--|-----+                       |
>  preempts \---->| N|ginx | --------------------> | {deadline for Nginx}
>                 +--|-----+                       |
>            (Nginx ineligible)
>         -----------|-------------------------------> vruntime timeline
>                    \--> rq->avg_vruntime
>
> When NGINX server is involuntarily switched out, it cannot accept any
> incoming request, leading to longer turn around time for the clients and
> thus loss in DeathStarBench throughput.
>
>     ==================================================================
>     Test          : DeathStarBench
>     Units         : Normalized latency
>     Interpretation: Lower is better
>     Statistic     : Mean
>     ==================================================================
>     tip         1.00
>     eevdf       1.14 (+14.61%)
>
> For current running task, skip eligibility check in pick_eevdf() if it
> has not exhausted the slice promised to it during selection despite the
> situation having changed since. The behavior is guarded by
> RUN_TO_PARITY_WAKEUP sched_feat to simplify testing. With
> RUN_TO_PARITY_WAKEUP enabled, performance loss seen with DeathStarBench
> since the merge of EEVDF disappears. Following are the results from
> testing on a Dual Socket 3rd Generation EPYC server (2 x 64C/128T):
>
>     ==================================================================
>     Test          : DeathStarBench
>     Units         : Normalized throughput
>     Interpretation: Higher is better
>     Statistic     : Mean
>     ==================================================================
>     Pinning      scaling     tip            run-to-parity-wakeup(pct imp)
>      1CCD           1       1.00            1.16 (%diff: 16%)
>      2CCD           2       1.00            1.03 (%diff: 3%)
>      4CCD           4       1.00            1.12 (%diff: 12%)
>      8CCD           8       1.00            1.05 (%diff: 6%)
>
> With spec_rstack_overflow=off, the DeathStarBench performance with the
> proposed solution is same as the performance on v6.5 release before
> EEVDF was merged.

Thanks for sharing this Prateek.
We actually noticed we could also gain performance by disabling
eligibility checks (but disable it on all paths).
The following are a few threads we had on the topic:

Discussion around eligibility:
https://lore.kernel.org/lkml/CA+q576MS0-MV1Oy-eecvmYpvNT3tqxD8syzrpxQ-Zk310hvRbw@mail.gmail.com/
Some of our results:
https://lore.kernel.org/lkml/CA+q576Mov1jpdfZhPBoy_hiVh3xSWuJjXdP3nS4zfpqfOXtq7Q@mail.gmail.com/
Sched feature to disable eligibility:
https://lore.kernel.org/lkml/20231013030213.2472697-1-youssefesmat@chromium.org/

>
> This may lead to newly waking task waiting longer for its turn on the
> CPU, however, testing on the same system did not reveal any consistent
> regressions with the standard benchmarks.
>
> Link: https://github.com/delimitrou/DeathStarBench/ [1]
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
>  kernel/sched/fair.c     | 24 ++++++++++++++++++++----
>  kernel/sched/features.h |  1 +
>  2 files changed, 21 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6a16129f9a5c..a9b145a4eab0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -875,7 +875,7 @@ struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
>   *
>   * Which allows tree pruning through eligibility.
>   */
> -static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
> +static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq, bool wakeup_preempt)
>  {
>         struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
>         struct sched_entity *se = __pick_first_entity(cfs_rq);
> @@ -889,7 +889,23 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
>         if (cfs_rq->nr_running == 1)
>                 return curr && curr->on_rq ? curr : se;
>
> -       if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
> +       if (curr && !curr->on_rq)
> +               curr = NULL;
> +
> +       /*
> +        * When an entity with positive lag wakes up, it pushes the
> +        * avg_vruntime of the runqueue backwards. This may causes the
> +        * current entity to be ineligible soon into its run leading to
> +        * wakeup preemption.
> +        *
> +        * To prevent such aggressive preemption of the current running
> +        * entity during task wakeups, skip the eligibility check if the
> +        * slice promised to the entity since its selection has not yet
> +        * elapsed.
> +        */
> +       if (curr &&
> +           !(sched_feat(RUN_TO_PARITY_WAKEUP) && wakeup_preempt && curr->vlag == curr->deadline) &&
> +           !entity_eligible(cfs_rq, curr))
>                 curr = NULL;
>
>         /*
> @@ -5460,7 +5476,7 @@ pick_next_entity(struct cfs_rq *cfs_rq)
>             cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next))
>                 return cfs_rq->next;
>
> -       return pick_eevdf(cfs_rq);
> +       return pick_eevdf(cfs_rq, false);
>  }
>
>  static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq);
> @@ -8340,7 +8356,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>         /*
>          * XXX pick_eevdf(cfs_rq) != se ?
>          */
> -       if (pick_eevdf(cfs_rq) == pse)
> +       if (pick_eevdf(cfs_rq, true) == pse)
>                 goto preempt;
>
>         return;
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index 143f55df890b..027bab5b4031 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -7,6 +7,7 @@
>  SCHED_FEAT(PLACE_LAG, true)
>  SCHED_FEAT(PLACE_DEADLINE_INITIAL, true)
>  SCHED_FEAT(RUN_TO_PARITY, true)
> +SCHED_FEAT(RUN_TO_PARITY_WAKEUP, true)
>
>  /*
>   * Prefer to schedule the task we woke last (assuming it failed
> --
> 2.34.1
>