From: Zhang Qiao <zhangqiao22@huawei.com>
To: Vincent Guittot <vincent.guittot@linaro.org>
Cc: <linux-kernel@vger.kernel.org>, <mingo@redhat.com>,
<juri.lelli@redhat.com>, <dietmar.eggemann@arm.com>,
<rostedt@goodmis.org>, <bsegall@google.com>, <mgorman@suse.de>,
<bristot@redhat.com>, <vschneid@redhat.com>, <rkagan@amazon.de>,
Peter Zijlstra <peterz@infradead.org>
Subject: Re: [PATCH v2] sched/fair: sanitize vruntime of entity being migrated
Date: Tue, 14 Mar 2023 19:03:02 +0800 [thread overview]
Message-ID: <55754a59-a01f-206a-43f6-d07ea37300dd@huawei.com> (raw)
In-Reply-To: <CAKfTPtBaBdxfc9uoViNT8gsWU-GdgnHrDdWPpAduadTFmu1ZGg@mail.gmail.com>
在 2023/3/13 22:23, Vincent Guittot 写道:
> On Sat, 11 Mar 2023 at 10:57, Zhang Qiao <zhangqiao22@huawei.com> wrote:
>>
>>
>>
>> 在 2023/3/10 22:29, Vincent Guittot 写道:
>>> Le jeudi 09 mars 2023 à 16:14:38 (+0100), Vincent Guittot a écrit :
>>>> On Thu, 9 Mar 2023 at 15:37, Peter Zijlstra <peterz@infradead.org> wrote:
>>>>>
>>>>> On Thu, Mar 09, 2023 at 03:28:25PM +0100, Peter Zijlstra wrote:
>>>>>> On Thu, Mar 09, 2023 at 02:34:05PM +0100, Vincent Guittot wrote:
>>>>>>
>>>>>>> Then, even if we don't clear exec_start before migrating and keep
>>>>>>> current value to be used in place_entity on the new cpu, we can't
>>>>>>> compare the rq_clock_task(rq_of(cfs_rq)) of 2 different rqs AFAICT
>>>>>>
>>>>>> Blergh -- indeed, irq and steal time can skew them between CPUs :/
>>>>>> I suppose we can fudge that... wait_start (which is basically what we're
>>>>>> making it do) also does that IIRC.
>>>>>>
>>>>>> I really dislike having this placement muck spreadout like proposed.
>>>>>
>>>>> Also, I think we might be over-engineering this, we don't care about
>>>>> accuracy at all, all we really care about is 'long-time'.
>>>>
>>>> you mean taking the patch 1/2 that you mentioned here to add a
>>>> migrated field:
>>>> https://lore.kernel.org/all/68832dfbb60fda030540b5f4e39c5801942689b1.1648228023.git.tim.c.chen@linux.intel.com/T/#ma5637eb8010f3f4a4abff778af8db705429d003b
>>>>
>>>> And assume that the divergence between the rq_clock_task() can be ignored ?
>>>>
>>>> That could probably work but we need to replace the (60LL *
>>>> NSEC_PER_SEC) by ((1ULL << 63) / NICE_0_LOAD) because 60sec divergence
>>>> would not be unrealistic.
>>>> and a comment to explain why it's acceptable
>>>
>>> Zhang,
>>>
>>> Could you try the patch below ?
>>> This is a rebase/merge/update of:
>>> -patch 1/2 above and
>>> -https://lore.kernel.org/lkml/20230209193107.1432770-1-rkagan@amazon.de/
>>
>>
>> I applyed and tested this patch, and it make hackbench slower.
>> According to my previous test results. The good result is 82.1(s).
>> But the result of this patch is 108.725(s).
>
> By "the result of this patch is 108.725(s)", you mean the result of
> https://lore.kernel.org/lkml/20230209193107.1432770-1-rkagan@amazon.de/
> alone, don't you ?
No, with your patch, the test results is 108.725(s),
git diff:
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 63d242164b1a..93a3909ae4c4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -550,6 +550,7 @@ struct sched_entity {
struct rb_node run_node;
struct list_head group_node;
unsigned int on_rq;
+ unsigned int migrated;
u64 exec_start;
u64 sum_exec_runtime;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ff4dbbae3b10..e60defc39f6e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1057,6 +1057,7 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
/*
* We are starting a new run period:
*/
+ se->migrated = 0;
se->exec_start = rq_clock_task(rq_of(cfs_rq));
}
@@ -4690,9 +4691,9 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
* inversed due to s64 overflow.
*/
sleep_time = rq_clock_task(rq_of(cfs_rq)) - se->exec_start;
- if ((s64)sleep_time > 60LL * NSEC_PER_SEC)
+ if ((s64)sleep_time > (1ULL << 63) / scale_load_down(NICE_0_LOAD) / 2) {
se->vruntime = vruntime;
- else
+ } else
se->vruntime = max_vruntime(se->vruntime, vruntime);
}
@@ -7658,8 +7659,7 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
se->avg.last_update_time = 0;
/* We have migrated, no longer consider this task hot */
- se->exec_start = 0;
-
+ se->migrated = 1;
update_scan_period(p, new_cpu);
}
@@ -8343,6 +8343,8 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
if (sysctl_sched_migration_cost == 0)
return 0;
+ if (p->se.migrated)
+ return 0;
delta = rq_clock_task(env->src_rq) - p->se.exec_start;
>
>>
>>
>>> version1: v6.2
>>> version2: v6.2 + commit 829c1651e9c4
>>> version3: v6.2 + commit 829c1651e9c4 + this patch
>>>
>>> -------------------------------------------------
>>> version1 version2 version3
>>> test1 81.0 118.1 82.1
>>> test2 82.1 116.9 80.3
>>> test3 83.2 103.9 83.3
>>> avg(s) 82.1 113.0 81.9
>
> Ok, it looks like we are back to normal figures
>
>>>
>>> -------------------------------------------------
>>>
>>> The proposal accepts a divergence of up to 52 days between the 2 rqs.
>>>
>>> If this work, we will prepare a proper patch
>>>
>>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>>> index 63d242164b1a..cb8af0a137f7 100644
>>> --- a/include/linux/sched.h
>>> +++ b/include/linux/sched.h
>>> @@ -550,6 +550,7 @@ struct sched_entity {
>>> struct rb_node run_node;
>>> struct list_head group_node;
>>> unsigned int on_rq;
>>> + unsigned int migrated;
>>>
>>> u64 exec_start;
>>> u64 sum_exec_runtime;
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 7a1b1f855b96..36acd9598b40 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -1057,6 +1057,7 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
>>> /*
>>> * We are starting a new run period:
>>> */
>>> + se->migrated = 0;
>>> se->exec_start = rq_clock_task(rq_of(cfs_rq));
>>> }
>>>
>>> @@ -4684,13 +4685,23 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
>>>
>>> /*
>>> * Pull vruntime of the entity being placed to the base level of
>>> - * cfs_rq, to prevent boosting it if placed backwards. If the entity
>>> - * slept for a long time, don't even try to compare its vruntime with
>>> - * the base as it may be too far off and the comparison may get
>>> - * inversed due to s64 overflow.
>>> + * cfs_rq, to prevent boosting it if placed backwards.
>>> + * However, min_vruntime can advance much faster than real time, with
>>> + * the exterme being when an entity with the minimal weight always runs
>>> + * on the cfs_rq. If the new entity slept for long, its vruntime
>>> + * difference from min_vruntime may overflow s64 and their comparison
>>> + * may get inversed, so ignore the entity's original vruntime in that
>>> + * case.
>>> + * The maximal vruntime speedup is given by the ratio of normal to
>>> + * minimal weight: NICE_0_LOAD / MIN_SHARES, so cutting off on the
>>
>> why not is `scale_load_down(NICE_0_LOAD) / MIN_SHARES` here ?
>
> yes, you are right.
>
>>
>>
>>> + * sleep time of 2^63 / NICE_0_LOAD should be safe.
>>> + * When placing a migrated waking entity, its exec_start has been set
>>> + * from a different rq. In order to take into account a possible
>>> + * divergence between new and prev rq's clocks task because of irq and
>>
>> This divergence might be larger, it cause `sleep_time` maybe negative.
>
> AFAICT, we are safe with ((1ULL << 63) / scale_load_down(NICE_0_LOAD)
> / 2) as long as the divergence between the 2 rqs clocks task is lower
> than 2^52nsec. Do you expect a divergence higher than 2^52 nsec
> (around 52 days)?
>
> We can probably keep using (1ULL << 63) / scale_load_down(NICE_0_LOAD)
> which is already half the max value if needed.
>
> the fact that sleep_time can be negative is not a problem as
> s64)sleep_time > will take care of this.
In my opinion, when comparing signed with unsigned, the compiler converts the signed value to unsigned.
So, if sleep_time < 0, "(s64)sleep_time > (1ULL << 63) / NICE_0_LOAD / 2" will be true.
>
>>
>>> + * stolen time, we take an additional margin.
>>> */
>>> sleep_time = rq_clock_task(rq_of(cfs_rq)) - se->exec_start;
>>> - if ((s64)sleep_time > 60LL * NSEC_PER_SEC)
>>> + if ((s64)sleep_time > (1ULL << 63) / NICE_0_LOAD / 2)> se->vruntime = vruntime;
>>> else
>>> se->vruntime = max_vruntime(se->vruntime, vruntime);
>>> @@ -7658,7 +7669,7 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
>>> se->avg.last_update_time = 0;
>>>
>>> /* We have migrated, no longer consider this task hot */
>>> - se->exec_start = 0;
>>> + se->migrated = 1;
>>>
>>> update_scan_period(p, new_cpu);
>>> }
>>> @@ -8344,6 +8355,9 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
>>> if (sysctl_sched_migration_cost == 0)
>>> return 0;
>>>
>>> + if (p->se.migrated)
>>> + return 0;
>>> +
>>> delta = rq_clock_task(env->src_rq) - p->se.exec_start;
>>>
>>> return delta < (s64)sysctl_sched_migration_cost;
>>>
>>>
>>>
>>>>
>>>>
>>>>>
>>>>>
>>> .
>>>
> .
>
next prev parent reply other threads:[~2023-03-14 11:04 UTC|newest]
Thread overview: 59+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-03-06 13:24 [PATCH v2] sched/fair: sanitize vruntime of entity being migrated Zhang Qiao
2023-03-06 13:53 ` Vincent Guittot
2023-03-07 10:26 ` Vincent Guittot
2023-03-07 11:05 ` Zhang Qiao
2023-03-07 13:41 ` Zhang Qiao
2023-03-08 8:01 ` Vincent Guittot
2023-03-08 12:55 ` Vincent Guittot
2023-03-09 8:37 ` Zhang Qiao
2023-03-09 9:09 ` Dietmar Eggemann
2023-03-09 9:30 ` Zhang Qiao
2023-03-09 10:48 ` Vincent Guittot
2023-03-09 14:23 ` Zhang Qiao
2023-03-07 2:16 ` kernel test robot
2023-03-07 12:45 ` Dietmar Eggemann
2023-03-07 14:06 ` Zhang Qiao
2023-03-09 9:43 ` Zhang Qiao
2023-03-08 14:33 ` Chen Yu
2023-03-09 13:05 ` Peter Zijlstra
2023-03-09 13:34 ` Vincent Guittot
2023-03-09 14:28 ` Peter Zijlstra
2023-03-09 14:36 ` Peter Zijlstra
2023-03-09 15:14 ` Vincent Guittot
2023-03-10 14:29 ` Vincent Guittot
2023-03-11 9:57 ` Zhang Qiao
2023-03-13 14:23 ` Vincent Guittot
2023-03-14 11:03 ` Zhang Qiao [this message]
2023-03-14 13:26 ` Vincent Guittot
2023-03-14 13:38 ` Zhang Qiao
2023-03-14 13:39 ` Vincent Guittot
2023-03-14 15:32 ` Vincent Guittot
2023-03-15 9:16 ` Zhang Qiao
2023-03-15 15:30 ` Vincent Guittot
2023-03-13 9:06 ` Dietmar Eggemann
2023-03-13 18:17 ` Dietmar Eggemann
2023-03-14 7:41 ` Vincent Guittot
2023-03-14 12:07 ` Peter Zijlstra
2023-03-14 13:24 ` Vincent Guittot
2023-03-14 17:16 ` Peter Zijlstra
2023-03-15 7:18 ` Vincent Guittot
2023-03-15 8:42 ` Vincent Guittot
2023-03-15 10:15 ` Dietmar Eggemann
2023-03-15 10:21 ` Vincent Guittot
2023-03-15 13:35 ` Dietmar Eggemann
2023-03-15 15:32 ` Vincent Guittot
2023-03-14 13:29 ` Dietmar Eggemann
2023-03-14 13:37 ` Dietmar Eggemann
2023-03-17 16:08 Vincent Guittot
2023-03-18 7:45 ` Zhang Qiao
2023-03-20 12:29 ` Dietmar Eggemann
2023-03-20 13:26 ` Vincent Guittot
2023-03-21 10:02 ` Peter Zijlstra
2023-03-21 10:29 ` Dietmar Eggemann
2023-03-21 10:49 ` Peter Zijlstra
2023-03-21 11:12 ` Vincent Guittot
2023-03-21 11:13 ` Dietmar Eggemann
2023-03-21 12:26 ` Peter Zijlstra
2023-03-21 12:28 ` Peter Zijlstra
2023-03-21 12:38 ` Vincent Guittot
2023-03-24 4:05 ` Chen Yu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=55754a59-a01f-206a-43f6-d07ea37300dd@huawei.com \
--to=zhangqiao22@huawei.com \
--cc=bristot@redhat.com \
--cc=bsegall@google.com \
--cc=dietmar.eggemann@arm.com \
--cc=juri.lelli@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=rkagan@amazon.de \
--cc=rostedt@goodmis.org \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).