All of lore.kernel.org
 help / color / mirror / Atom feed
From: Zhang Qiao <zhangqiao22@huawei.com>
To: Vincent Guittot <vincent.guittot@linaro.org>
Cc: <linux-kernel@vger.kernel.org>, <mingo@redhat.com>,
	<juri.lelli@redhat.com>, <dietmar.eggemann@arm.com>,
	<rostedt@goodmis.org>, <bsegall@google.com>, <mgorman@suse.de>,
	<bristot@redhat.com>, <vschneid@redhat.com>, <rkagan@amazon.de>,
	Peter Zijlstra <peterz@infradead.org>
Subject: Re: [PATCH v2] sched/fair: sanitize vruntime of entity being migrated
Date: Wed, 15 Mar 2023 17:16:26 +0800	[thread overview]
Message-ID: <5a08aa29-f197-1379-c441-8c1db5fa7ab7@huawei.com> (raw)
In-Reply-To: <ZBCTg/dGAIRsldHU@vingu-book>



在 2023/3/14 23:32, Vincent Guittot 写道:
> Le mardi 14 mars 2023 à 14:39:49 (+0100), Vincent Guittot a écrit :
>> On Tue, 14 Mar 2023 at 14:38, Zhang Qiao <zhangqiao22@huawei.com> wrote:
>>>
>>>
>>>
>>> 在 2023/3/14 21:26, Vincent Guittot 写道:
>>>> On Tue, 14 Mar 2023 at 12:03, Zhang Qiao <zhangqiao22@huawei.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> 在 2023/3/13 22:23, Vincent Guittot 写道:
>>>>>> On Sat, 11 Mar 2023 at 10:57, Zhang Qiao <zhangqiao22@huawei.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 在 2023/3/10 22:29, Vincent Guittot 写道:
>>>>>>>> Le jeudi 09 mars 2023 à 16:14:38 (+0100), Vincent Guittot a écrit :
>>>>>>>>> On Thu, 9 Mar 2023 at 15:37, Peter Zijlstra <peterz@infradead.org> wrote:
>>>>>>>>>>
>>>>>>>>>> On Thu, Mar 09, 2023 at 03:28:25PM +0100, Peter Zijlstra wrote:
>>>>>>>>>>> On Thu, Mar 09, 2023 at 02:34:05PM +0100, Vincent Guittot wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Then, even if we don't clear exec_start before migrating and keep
>>>>>>>>>>>> current value to be used in place_entity on the new cpu, we can't
>>>>>>>>>>>> compare the rq_clock_task(rq_of(cfs_rq)) of 2 different rqs AFAICT
>>>>>>>>>>>
>>>>>>>>>>> Blergh -- indeed, irq and steal time can skew them between CPUs :/
>>>>>>>>>>> I suppose we can fudge that... wait_start (which is basically what we're
>>>>>>>>>>> making it do) also does that IIRC.
>>>>>>>>>>>
>>>>>>>>>>> I really dislike having this placement muck spreadout like proposed.
>>>>>>>>>>
>>>>>>>>>> Also, I think we might be over-engineering this, we don't care about
>>>>>>>>>> accuracy at all, all we really care about is 'long-time'.
>>>>>>>>>
>>>>>>>>> you mean taking the patch 1/2 that you mentioned here to add a
>>>>>>>>> migrated field:
>>>>>>>>> https://lore.kernel.org/all/68832dfbb60fda030540b5f4e39c5801942689b1.1648228023.git.tim.c.chen@linux.intel.com/T/#ma5637eb8010f3f4a4abff778af8db705429d003b
>>>>>>>>>
>>>>>>>>> And assume that the divergence between the rq_clock_task() can be ignored ?
>>>>>>>>>
>>>>>>>>> That could probably work but we need to replace the (60LL *
>>>>>>>>> NSEC_PER_SEC) by ((1ULL << 63) / NICE_0_LOAD) because 60sec divergence
>>>>>>>>> would not be unrealistic.
>>>>>>>>> and a comment to explain why it's acceptable
>>>>>>>>
>>>>>>>> Zhang,
>>>>>>>>
>>>>>>>> Could you try the patch below ?
>>>>>>>> This is a rebase/merge/update of:
>>>>>>>> -patch 1/2 above and
>>>>>>>> -https://lore.kernel.org/lkml/20230209193107.1432770-1-rkagan@amazon.de/
>>>>>>>
>>>>>>>
>>>>>>> I applyed and tested this patch, and it make hackbench slower.
>>>>>>> According to my previous test results. The good result is 82.1(s).
>>>>>>> But the result of this patch is 108.725(s).
>>>>>>
>>>>>> By "the result of this patch is 108.725(s)",  you mean the result of
>>>>>> https://lore.kernel.org/lkml/20230209193107.1432770-1-rkagan@amazon.de/
>>>>>> alone, don't you ?
>>>>>
>>>>> No, with your patch, the test results is 108.725(s),
>>>>
>>>> Ok
>>>>
>>>>>
>>>>> git diff:
>>>>>
>>>>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>>>>> index 63d242164b1a..93a3909ae4c4 100644
>>>>> --- a/include/linux/sched.h
>>>>> +++ b/include/linux/sched.h
>>>>> @@ -550,6 +550,7 @@ struct sched_entity {
>>>>>         struct rb_node                  run_node;
>>>>>         struct list_head                group_node;
>>>>>         unsigned int                    on_rq;
>>>>> +       unsigned int                    migrated;
>>>>>
>>>>>         u64                             exec_start;
>>>>>         u64                             sum_exec_runtime;
>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>> index ff4dbbae3b10..e60defc39f6e 100644
>>>>> --- a/kernel/sched/fair.c
>>>>> +++ b/kernel/sched/fair.c
>>>>> @@ -1057,6 +1057,7 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
>>>>>         /*
>>>>>          * We are starting a new run period:
>>>>>          */
>>>>> +       se->migrated = 0;
>>>>>         se->exec_start = rq_clock_task(rq_of(cfs_rq));
>>>>>  }
>>>>>
>>>>> @@ -4690,9 +4691,9 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
>>>>>          * inversed due to s64 overflow.
>>>>>          */
>>>>>         sleep_time = rq_clock_task(rq_of(cfs_rq)) - se->exec_start;
>>>>> -       if ((s64)sleep_time > 60LL * NSEC_PER_SEC)
>>>>> +       if ((s64)sleep_time > (1ULL << 63) / scale_load_down(NICE_0_LOAD) / 2) {
>>>>>                 se->vruntime = vruntime;
>>>>> -       else
>>>>> +       } else
>>>>>                 se->vruntime = max_vruntime(se->vruntime, vruntime);
>>>>>  }
>>>>>
>>>>> @@ -7658,8 +7659,7 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
>>>>>         se->avg.last_update_time = 0;
>>>>>
>>>>>         /* We have migrated, no longer consider this task hot */
>>>>> -       se->exec_start = 0;
>>>>> -
>>>>> +       se->migrated = 1;
>>>>>         update_scan_period(p, new_cpu);
>>>>>  }
>>>>>
>>>>> @@ -8343,6 +8343,8 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
>>>>>
>>>>>         if (sysctl_sched_migration_cost == 0)
>>>>>                 return 0;
>>>>> +       if (p->se.migrated)
>>>>> +               return 0;
>>>>>
>>>>>         delta = rq_clock_task(env->src_rq) - p->se.exec_start;
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> version1: v6.2
>>>>>>>> version2: v6.2 + commit 829c1651e9c4
>>>>>>>> version3: v6.2 + commit 829c1651e9c4 + this patch
>>>>>>>>
>>>>>>>> -------------------------------------------------
>>>>>>>>       version1        version2        version3
>>>>>>>> test1 81.0            118.1           82.1
>>>>>>>> test2 82.1            116.9           80.3
>>>>>>>> test3 83.2            103.9           83.3
>>>>>>>> avg(s)        82.1            113.0           81.9
>>>>>>
>>>>>> Ok, it looks like we are back to normal figures
>>>>
>>>> What do those results refer to then ?
>>>
>>> Quote from this email (https://lore.kernel.org/lkml/1cd19d3f-18c4-92f9-257a-378cc18cfbc7@huawei.com/).
>>
>> ok.
>>
>> Then, there is something wrong in my patch. Let me look at it more deeply
> 
> Coudl you try the patc below. It fixes the problem on my system
> 

Yes, it fixes the problem.

------

 Performance counter stats for 'hackbench -g 44 -f 20 --process --pipe -l 60000 -s 100' (10 runs):

             79.53 +- 1.21 seconds time elapsed  ( +-  1.52% )



> ---
>  kernel/sched/fair.c | 84 +++++++++++++++++++++++++++++++--------------
>  1 file changed, 59 insertions(+), 25 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0f499e9a74b5..f8722e47bb0b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4648,23 +4648,36 @@ static void check_spread(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  #endif
>  }
> 
> +static inline bool entity_is_long_sleeper(struct sched_entity *se)
> +{
> +       struct cfs_rq *cfs_rq;
> +       u64 sleep_time;
> +
> +       if (se->exec_start == 0)
> +               return false;
> +
> +       cfs_rq = cfs_rq_of(se);
> +
> +       sleep_time = rq_clock_task(rq_of(cfs_rq));
> +
> +       /* Happen while migrating because of clock task divergence */
> +       if (sleep_time <= se->exec_start)
> +	       return false;
> +
> +       sleep_time -= se->exec_start;
> +       if (sleep_time > ((1ULL << 63) / scale_load_down(NICE_0_LOAD)))
> +               return true;
> +
> +       return false;
> +}
> +
>  static void
> -place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
> +place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  {
>  	u64 vruntime = cfs_rq->min_vruntime;
> -	u64 sleep_time;
> -
> -	/*
> -	 * The 'current' period is already promised to the current tasks,
> -	 * however the extra weight of the new task will slow them down a
> -	 * little, place the new task so that it fits in the slot that
> -	 * stays open at the end.
> -	 */
> -	if (initial && sched_feat(START_DEBIT))
> -		vruntime += sched_vslice(cfs_rq, se);
> 
>  	/* sleeps up to a single latency don't count. */
> -	if (!initial) {
> +	if (flags & ENQUEUE_WAKEUP) {
>  		unsigned long thresh;
> 
>  		if (se_is_idle(se))
> @@ -4680,20 +4693,43 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
>  			thresh >>= 1;
> 
>  		vruntime -= thresh;
> +	} else if sched_feat(START_DEBIT) {

There's a parenthesis missing here.


> +		/*
> +		 * The 'current' period is already promised to the current tasks,
> +		 * however the extra weight of the new task will slow them down a
> +		 * little, place the new task so that it fits in the slot that
> +		 * stays open at the end.
> +		 */
> +		vruntime += sched_vslice(cfs_rq, se);
>  	}
> 
>  	/*
>  	 * Pull vruntime of the entity being placed to the base level of
> -	 * cfs_rq, to prevent boosting it if placed backwards.  If the entity
> -	 * slept for a long time, don't even try to compare its vruntime with
> -	 * the base as it may be too far off and the comparison may get
> -	 * inversed due to s64 overflow.
> -	 */
> -	sleep_time = rq_clock_task(rq_of(cfs_rq)) - se->exec_start;
> -	if ((s64)sleep_time > 60LL * NSEC_PER_SEC)
> +	 * cfs_rq, to prevent boosting it if placed backwards.
> +	 * However, min_vruntime can advance much faster than real time, with
> +	 * the exterme being when an entity with the minimal weight always runs
> +	 * on the cfs_rq. If the new entity slept for long, its vruntime
> +	 * difference from min_vruntime may overflow s64 and their comparison
> +	 * may get inversed, so ignore the entity's original vruntime in that
> +	 * case.
> +	 * The maximal vruntime speedup is given by the ratio of normal to
> +	 * minimal weight: scale_load_down(NICE_0_LOAD) / MIN_SHARES.
> +	 * When placing a migrated waking entity, its exec_start has been set
> +	 * from a different rq. In order to take into account a possible
> +	 * divergence between new and prev rq's clocks task because of irq and
> +	 * stolen time, we take an additional margin.
> +	 * So, cutting off on the sleep time of
> +	 *     2^63 / scale_load_down(NICE_0_LOAD) ~ 104 days
> +	 * should be safe.
> +
> +	 */
> +	if (entity_is_long_sleeper(se))
>  		se->vruntime = vruntime;
>  	else
>  		se->vruntime = max_vruntime(se->vruntime, vruntime);
> +
> +	if (flags & ENQUEUE_MIGRATED)
> +		se->exec_start = 0;
>  }
> 
>  static void check_enqueue_throttle(struct cfs_rq *cfs_rq);
> @@ -4769,7 +4805,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  	account_entity_enqueue(cfs_rq, se);
> 
>  	if (flags & ENQUEUE_WAKEUP)
> -		place_entity(cfs_rq, se, 0);
> +		place_entity(cfs_rq, se, flags);
> 
>  	check_schedstat_required();
>  	update_stats_enqueue_fair(cfs_rq, se, flags);
> @@ -7665,9 +7701,6 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
>  	/* Tell new CPU we are migrated */
>  	se->avg.last_update_time = 0;
> 
> -	/* We have migrated, no longer consider this task hot */
> -	se->exec_start = 0;
> -
>  	update_scan_period(p, new_cpu);
>  }
> 
> @@ -11993,7 +12026,7 @@ static void task_fork_fair(struct task_struct *p)
>  		update_curr(cfs_rq);
>  		se->vruntime = curr->vruntime;
>  	}
> -	place_entity(cfs_rq, se, 1);
> +	place_entity(cfs_rq, se, 0);
> 
>  	if (sysctl_sched_child_runs_first && curr && entity_before(curr, se)) {
>  		/*
> @@ -12137,8 +12170,9 @@ static void detach_task_cfs_rq(struct task_struct *p)
>  		/*
>  		 * Fix up our vruntime so that the current sleep doesn't
>  		 * cause 'unlimited' sleep bonus.
> +		 * This is the same as placing a waking task.
>  		 */
> -		place_entity(cfs_rq, se, 0);
> +		place_entity(cfs_rq, se, ENQUEUE_WAKEUP);
>  		se->vruntime -= cfs_rq->min_vruntime;
>  	}
> 
> --
> 2.34.1
> 
> 
>>
>>>
>>>>
> .
> 

  reply	other threads:[~2023-03-15  9:16 UTC|newest]

Thread overview: 59+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-06 13:24 [PATCH v2] sched/fair: sanitize vruntime of entity being migrated Zhang Qiao
2023-03-06 13:53 ` Vincent Guittot
2023-03-07 10:26   ` Vincent Guittot
2023-03-07 11:05     ` Zhang Qiao
2023-03-07 13:41     ` Zhang Qiao
2023-03-08  8:01       ` Vincent Guittot
2023-03-08 12:55         ` Vincent Guittot
2023-03-09  8:37           ` Zhang Qiao
2023-03-09  9:09             ` Dietmar Eggemann
2023-03-09  9:30               ` Zhang Qiao
2023-03-09 10:48             ` Vincent Guittot
2023-03-09 14:23               ` Zhang Qiao
2023-03-07  2:16 ` kernel test robot
2023-03-07 12:45 ` Dietmar Eggemann
2023-03-07 14:06   ` Zhang Qiao
2023-03-09  9:43   ` Zhang Qiao
2023-03-08 14:33 ` Chen Yu
2023-03-09 13:05 ` Peter Zijlstra
2023-03-09 13:34   ` Vincent Guittot
2023-03-09 14:28     ` Peter Zijlstra
2023-03-09 14:36       ` Peter Zijlstra
2023-03-09 15:14         ` Vincent Guittot
2023-03-10 14:29           ` Vincent Guittot
2023-03-11  9:57             ` Zhang Qiao
2023-03-13 14:23               ` Vincent Guittot
2023-03-14 11:03                 ` Zhang Qiao
2023-03-14 13:26                   ` Vincent Guittot
2023-03-14 13:38                     ` Zhang Qiao
2023-03-14 13:39                       ` Vincent Guittot
2023-03-14 15:32                         ` Vincent Guittot
2023-03-15  9:16                           ` Zhang Qiao [this message]
2023-03-15 15:30                             ` Vincent Guittot
2023-03-13  9:06             ` Dietmar Eggemann
2023-03-13 18:17               ` Dietmar Eggemann
2023-03-14  7:41                 ` Vincent Guittot
2023-03-14 12:07                   ` Peter Zijlstra
2023-03-14 13:24                     ` Vincent Guittot
2023-03-14 17:16                       ` Peter Zijlstra
2023-03-15  7:18                         ` Vincent Guittot
2023-03-15  8:42                           ` Vincent Guittot
2023-03-15 10:15                             ` Dietmar Eggemann
2023-03-15 10:21                               ` Vincent Guittot
2023-03-15 13:35                                 ` Dietmar Eggemann
2023-03-15 15:32                                   ` Vincent Guittot
2023-03-14 13:29                     ` Dietmar Eggemann
2023-03-14 13:37                       ` Dietmar Eggemann
2023-03-17 16:08 Vincent Guittot
2023-03-18  7:45 ` Zhang Qiao
2023-03-20 12:29   ` Dietmar Eggemann
2023-03-20 13:26     ` Vincent Guittot
2023-03-21 10:02 ` Peter Zijlstra
2023-03-21 10:29   ` Dietmar Eggemann
2023-03-21 10:49     ` Peter Zijlstra
2023-03-21 11:12       ` Vincent Guittot
2023-03-21 11:13       ` Dietmar Eggemann
2023-03-21 12:26         ` Peter Zijlstra
2023-03-21 12:28 ` Peter Zijlstra
2023-03-21 12:38   ` Vincent Guittot
2023-03-24  4:05 ` Chen Yu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5a08aa29-f197-1379-c441-8c1db5fa7ab7@huawei.com \
    --to=zhangqiao22@huawei.com \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rkagan@amazon.de \
    --cc=rostedt@goodmis.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.