linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] sched/debug: Show intergroup and hierarchy sum wait time of a task group
@ 2019-01-23  9:46 ufo19890607
  2019-01-25  3:11 ` 王贇
  2019-02-06 17:19 ` Peter Zijlstra
  0 siblings, 2 replies; 10+ messages in thread
From: ufo19890607 @ 2019-01-23  9:46 UTC (permalink / raw)
  To: mingo, peterz, yun.wang, yuzhoujian; +Cc: linux-kernel

From: yuzhoujian <yuzhoujian@didichuxing.com>

We can monitor the sum wait time of a task group since 'commit 3d6c50c27bd6
("sched/debug: Show the sum wait time of a task group")'. However this
wait_sum just represents the confilct between different task groups, since
it is simply sum the wait time of task_group's cfs_rq. And we still cannot
evaluate the conflict between all the tasks within hierarchy of this group,
so the hierarchy wait time is still needed.

Thus we introduce hierarchy wait_sum which summarizes the total wait sum of
all the tasks in the hierarchy of a group.

The 'cpu.stat' is modified to show the statistic, like:

	nr_periods 0
	nr_throttled 0
	throttled_time 0
	intergroup wait_sum 2842251984
	hierarchy wait_sum 6389509389332798

From now on we can monitor both the wait_sum of intergroup and hierarchy,
which will inevitably help a system administrator know how intense the CPU
competition is within a task group and between different task groups. We
can calculate the wait rate of a task group based on hierarchy wait_sum and
cpuacct.usage.

For example:
	X% = (current_wait_sum - last_wait_sum) / ((current_usage -
last_usage) + (current_wait_sum - last_wait_sum))

That means the task group paid X percentage of time on runqueue waiting
for the CPU.

Signed-off-by: yuzhoujian <yuzhoujian@didichuxing.com>
---
 kernel/sched/core.c  | 11 +++++++----
 kernel/sched/fair.c  | 17 +++++++++++++++++
 kernel/sched/sched.h |  3 +++
 3 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ee77636..172e6fb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6760,13 +6760,16 @@ static int cpu_cfs_stat_show(struct seq_file *sf, void *v)
 	seq_printf(sf, "throttled_time %llu\n", cfs_b->throttled_time);
 
 	if (schedstat_enabled() && tg != &root_task_group) {
-		u64 ws = 0;
+		u64 inter_ws = 0, hierarchy_ws = 0;
 		int i;
 
-		for_each_possible_cpu(i)
-			ws += schedstat_val(tg->se[i]->statistics.wait_sum);
+		for_each_possible_cpu(i) {
+			inter_ws += schedstat_val(tg->se[i]->statistics.wait_sum);
+			hierarchy_ws += tg->cfs_rq[i]->hierarchy_wait_sum;
+		}
 
-		seq_printf(sf, "wait_sum %llu\n", ws);
+		seq_printf(sf, "intergroup wait_sum %llu\n", inter_ws);
+		seq_printf(sf, "hierarchy wait_sum %llu\n", hierarchy_ws);
 	}
 
 	return 0;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e2ff4b6..35e89ca 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -858,6 +858,19 @@ static void update_curr_fair(struct rq *rq)
 }
 
 static inline void
+update_hierarchy_wait_sum(struct sched_entity *se,
+			u64 delta_wait)
+{
+	for_each_sched_entity(se) {
+		struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+		if (cfs_rq->tg != &root_task_group)
+			__schedstat_add(cfs_rq->hierarchy_wait_sum,
+					delta_wait);
+	}
+}
+
+static inline void
 update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	struct task_struct *p;
@@ -880,6 +893,7 @@ static void update_curr_fair(struct rq *rq)
 			return;
 		}
 		trace_sched_stat_wait(p, delta);
+		update_hierarchy_wait_sum(se, delta);
 	}
 
 	__schedstat_set(se->statistics.wait_max,
@@ -10273,6 +10287,9 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
 #ifndef CONFIG_64BIT
 	cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
 #endif
+#ifdef CONFIG_SCHEDSTATS
+	cfs_rq->hierarchy_wait_sum = 0;
+#endif
 #ifdef CONFIG_SMP
 	raw_spin_lock_init(&cfs_rq->removed.lock);
 #endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d27c1a5..c01ab99 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -496,6 +496,9 @@ struct cfs_rq {
 #ifndef CONFIG_64BIT
 	u64			min_vruntime_copy;
 #endif
+#ifdef CONFIG_SCHEDSTATS
+	u64			hierarchy_wait_sum;
+#endif
 
 	struct rb_root_cached	tasks_timeline;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] sched/debug: Show intergroup and hierarchy sum wait time of a task group
  2019-01-23  9:46 [PATCH] sched/debug: Show intergroup and hierarchy sum wait time of a task group ufo19890607
@ 2019-01-25  3:11 ` 王贇
       [not found]   ` <CAHCio2iYoKOSgxJWFX6KcRcn7GK=aEJP7Sz0vU6SwXK2C56mew@mail.gmail.com>
  2019-02-06 17:19 ` Peter Zijlstra
  1 sibling, 1 reply; 10+ messages in thread
From: 王贇 @ 2019-01-25  3:11 UTC (permalink / raw)
  To: ufo19890607, mingo, peterz, yuzhoujian; +Cc: linux-kernel



On 2019/1/23 下午5:46, ufo19890607@gmail.com wrote:
> From: yuzhoujian <yuzhoujian@didichuxing.com>
> 
> We can monitor the sum wait time of a task group since 'commit 3d6c50c27bd6
> ("sched/debug: Show the sum wait time of a task group")'. However this
> wait_sum just represents the confilct between different task groups, since
> it is simply sum the wait time of task_group's cfs_rq. And we still cannot
> evaluate the conflict between all the tasks within hierarchy of this group,
> so the hierarchy wait time is still needed.

Could you please give us a scene that we do need this hierarchy wait_sum,
despite the extra overhead?

Regards,
Michael Wang

> 
> Thus we introduce hierarchy wait_sum which summarizes the total wait sum of
> all the tasks in the hierarchy of a group.
> 
> The 'cpu.stat' is modified to show the statistic, like:
> 
> 	nr_periods 0
> 	nr_throttled 0
> 	throttled_time 0
> 	intergroup wait_sum 2842251984
> 	hierarchy wait_sum 6389509389332798
> 
>  From now on we can monitor both the wait_sum of intergroup and hierarchy,
> which will inevitably help a system administrator know how intense the CPU
> competition is within a task group and between different task groups. We
> can calculate the wait rate of a task group based on hierarchy wait_sum and
> cpuacct.usage.
> 
> For example:
> 	X% = (current_wait_sum - last_wait_sum) / ((current_usage -
> last_usage) + (current_wait_sum - last_wait_sum))
> 
> That means the task group paid X percentage of time on runqueue waiting
> for the CPU.
> 
> Signed-off-by: yuzhoujian <yuzhoujian@didichuxing.com>
> ---
>   kernel/sched/core.c  | 11 +++++++----
>   kernel/sched/fair.c  | 17 +++++++++++++++++
>   kernel/sched/sched.h |  3 +++
>   3 files changed, 27 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index ee77636..172e6fb 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6760,13 +6760,16 @@ static int cpu_cfs_stat_show(struct seq_file *sf, void *v)
>   	seq_printf(sf, "throttled_time %llu\n", cfs_b->throttled_time);
>   
>   	if (schedstat_enabled() && tg != &root_task_group) {
> -		u64 ws = 0;
> +		u64 inter_ws = 0, hierarchy_ws = 0;
>   		int i;
>   
> -		for_each_possible_cpu(i)
> -			ws += schedstat_val(tg->se[i]->statistics.wait_sum);
> +		for_each_possible_cpu(i) {
> +			inter_ws += schedstat_val(tg->se[i]->statistics.wait_sum);
> +			hierarchy_ws += tg->cfs_rq[i]->hierarchy_wait_sum;
> +		}
>   
> -		seq_printf(sf, "wait_sum %llu\n", ws);
> +		seq_printf(sf, "intergroup wait_sum %llu\n", inter_ws);
> +		seq_printf(sf, "hierarchy wait_sum %llu\n", hierarchy_ws);
>   	}
>   
>   	return 0;
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e2ff4b6..35e89ca 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -858,6 +858,19 @@ static void update_curr_fair(struct rq *rq)
>   }
>   
>   static inline void
> +update_hierarchy_wait_sum(struct sched_entity *se,
> +			u64 delta_wait)
> +{
> +	for_each_sched_entity(se) {
> +		struct cfs_rq *cfs_rq = cfs_rq_of(se);
> +
> +		if (cfs_rq->tg != &root_task_group)
> +			__schedstat_add(cfs_rq->hierarchy_wait_sum,
> +					delta_wait);
> +	}
> +}
> +
> +static inline void
>   update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
>   {
>   	struct task_struct *p;
> @@ -880,6 +893,7 @@ static void update_curr_fair(struct rq *rq)
>   			return;
>   		}
>   		trace_sched_stat_wait(p, delta);
> +		update_hierarchy_wait_sum(se, delta);
>   	}
>   
>   	__schedstat_set(se->statistics.wait_max,
> @@ -10273,6 +10287,9 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
>   #ifndef CONFIG_64BIT
>   	cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
>   #endif
> +#ifdef CONFIG_SCHEDSTATS
> +	cfs_rq->hierarchy_wait_sum = 0;
> +#endif
>   #ifdef CONFIG_SMP
>   	raw_spin_lock_init(&cfs_rq->removed.lock);
>   #endif
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index d27c1a5..c01ab99 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -496,6 +496,9 @@ struct cfs_rq {
>   #ifndef CONFIG_64BIT
>   	u64			min_vruntime_copy;
>   #endif
> +#ifdef CONFIG_SCHEDSTATS
> +	u64			hierarchy_wait_sum;
> +#endif
>   
>   	struct rb_root_cached	tasks_timeline;
>   
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] sched/debug: Show intergroup and hierarchy sum wait time of a task group
       [not found]   ` <CAHCio2iYoKOSgxJWFX6KcRcn7GK=aEJP7Sz0vU6SwXK2C56mew@mail.gmail.com>
@ 2019-01-28  2:53     ` 王贇
  2019-01-28  7:21       ` 禹舟键
  0 siblings, 1 reply; 10+ messages in thread
From: 王贇 @ 2019-01-28  2:53 UTC (permalink / raw)
  To: 禹舟键; +Cc: mingo, Peter Zijlstra, Wind Yu, linux-kernel



On 2019/1/25 下午3:55, 禹舟键 wrote:
> Hi, Michael
> Actually, I just created a container which had just 1 CPU, and I ran two tasks(while 1 loop) in this container. Then I read cpu.stat at 1 second interval ,but the wait_sum difference is almost 0.

Task competition inside a cgroup won't be considered as cgroup's
competition, please try create another cgroup with dead loop on
each CPU.

  I check your patch and find that you just summary the task group's se wait_sum rather than tasks' se wait sum. So, we can see that the increment of 'wait_sum' will always be 0 as long as there are running tasks in this task group.

Running tasks doesn't means no competition, only if that cgroup occupied
the CPU exclusively at that moment.

  Unfortunately, this is not what we want. Our expectation is to evaluate the CPU contention in a task group just as I said. We need to ensure that containers with high priority have sufficient CPU resources.

No offense but I'm afraid you misunderstand the problem we try to solve
by wait_sum, if your purpose is to have a way to tell whether there are
sufficient CPU inside a container, please try lxcfs + top, if there are
almost no idle and load is high, then the CPU resource is not sufficient.

> 
> I implement 'hierarchy wait_sum' referred to cpuacct.usage. The hierarchy wait_sum will summarize all the tasks' se wait_sum  within the hierarchy of this task group. We
> can calculate the wait rate of a task group based on hierarchy wait_sum and cpuacct.usage
Frankly speaking this sounds like a supplement rather than a missing piece,
although we don't rely on lxcfs and modify the kernel ourselves to support
container environment, I still don't think such kind of solutions should be
in kernel.

Regards,
Michael Wang

> 
> The extra overhead for calculating the hierarchy wait_sum is traversing the cfs_rq's se from the target task's se to the root_task_group children's se.
> 
> Regartds
> Yuzhoujian
> 
> 
> 王贇 <yun.wang@linux.alibaba.com <mailto:yun.wang@linux.alibaba.com>> 于2019年1月25日周五 上午11:12写道:
> 
> 
> 
>     On 2019/1/23 下午5:46, ufo19890607@gmail.com <mailto:ufo19890607@gmail.com> wrote:
>      > From: yuzhoujian <yuzhoujian@didichuxing.com <mailto:yuzhoujian@didichuxing.com>>
>      >
>      > We can monitor the sum wait time of a task group since 'commit 3d6c50c27bd6
>      > ("sched/debug: Show the sum wait time of a task group")'. However this
>      > wait_sum just represents the confilct between different task groups, since
>      > it is simply sum the wait time of task_group's cfs_rq. And we still cannot
>      > evaluate the conflict between all the tasks within hierarchy of this group,
>      > so the hierarchy wait time is still needed.
> 
>     Could you please give us a scene that we do need this hierarchy wait_sum,
>     despite the extra overhead?
> 
>     Regards,
>     Michael Wang
> 
>      >
>      > Thus we introduce hierarchy wait_sum which summarizes the total wait sum of
>      > all the tasks in the hierarchy of a group.
>      >
>      > The 'cpu.stat' is modified to show the statistic, like:
>      >
>      >       nr_periods 0
>      >       nr_throttled 0
>      >       throttled_time 0
>      >       intergroup wait_sum 2842251984
>      >       hierarchy wait_sum 6389509389332798
>      >
>      >  From now on we can monitor both the wait_sum of intergroup and hierarchy,
>      > which will inevitably help a system administrator know how intense the CPU
>      > competition is within a task group and between different task groups. We
>      > can calculate the wait rate of a task group based on hierarchy wait_sum and
>      > cpuacct.usage.
>      >
>      > For example:
>      >       X% = (current_wait_sum - last_wait_sum) / ((current_usage -
>      > last_usage) + (current_wait_sum - last_wait_sum))
>      >
>      > That means the task group paid X percentage of time on runqueue waiting
>      > for the CPU.
>      >
>      > Signed-off-by: yuzhoujian <yuzhoujian@didichuxing.com <mailto:yuzhoujian@didichuxing.com>>
>      > ---
>      >   kernel/sched/core.c  | 11 +++++++----
>      >   kernel/sched/fair.c  | 17 +++++++++++++++++
>      >   kernel/sched/sched.h |  3 +++
>      >   3 files changed, 27 insertions(+), 4 deletions(-)
>      >
>      > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>      > index ee77636..172e6fb 100644
>      > --- a/kernel/sched/core.c
>      > +++ b/kernel/sched/core.c
>      > @@ -6760,13 +6760,16 @@ static int cpu_cfs_stat_show(struct seq_file *sf, void *v)
>      >       seq_printf(sf, "throttled_time %llu\n", cfs_b->throttled_time);
>      >
>      >       if (schedstat_enabled() && tg != &root_task_group) {
>      > -             u64 ws = 0;
>      > +             u64 inter_ws = 0, hierarchy_ws = 0;
>      >               int i;
>      >
>      > -             for_each_possible_cpu(i)
>      > -                     ws += schedstat_val(tg->se[i]->statistics.wait_sum);
>      > +             for_each_possible_cpu(i) {
>      > +                     inter_ws += schedstat_val(tg->se[i]->statistics.wait_sum);
>      > +                     hierarchy_ws += tg->cfs_rq[i]->hierarchy_wait_sum;
>      > +             }
>      >
>      > -             seq_printf(sf, "wait_sum %llu\n", ws);
>      > +             seq_printf(sf, "intergroup wait_sum %llu\n", inter_ws);
>      > +             seq_printf(sf, "hierarchy wait_sum %llu\n", hierarchy_ws);
>      >       }
>      >
>      >       return 0;
>      > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>      > index e2ff4b6..35e89ca 100644
>      > --- a/kernel/sched/fair.c
>      > +++ b/kernel/sched/fair.c
>      > @@ -858,6 +858,19 @@ static void update_curr_fair(struct rq *rq)
>      >   }
>      >
>      >   static inline void
>      > +update_hierarchy_wait_sum(struct sched_entity *se,
>      > +                     u64 delta_wait)
>      > +{
>      > +     for_each_sched_entity(se) {
>      > +             struct cfs_rq *cfs_rq = cfs_rq_of(se);
>      > +
>      > +             if (cfs_rq->tg != &root_task_group)
>      > +                     __schedstat_add(cfs_rq->hierarchy_wait_sum,
>      > +                                     delta_wait);
>      > +     }
>      > +}
>      > +
>      > +static inline void
>      >   update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
>      >   {
>      >       struct task_struct *p;
>      > @@ -880,6 +893,7 @@ static void update_curr_fair(struct rq *rq)
>      >                       return;
>      >               }
>      >               trace_sched_stat_wait(p, delta);
>      > +             update_hierarchy_wait_sum(se, delta);
>      >       }
>      >
>      >       __schedstat_set(se->statistics.wait_max,
>      > @@ -10273,6 +10287,9 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
>      >   #ifndef CONFIG_64BIT
>      >       cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
>      >   #endif
>      > +#ifdef CONFIG_SCHEDSTATS
>      > +     cfs_rq->hierarchy_wait_sum = 0;
>      > +#endif
>      >   #ifdef CONFIG_SMP
>      >       raw_spin_lock_init(&cfs_rq->removed.lock);
>      >   #endif
>      > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>      > index d27c1a5..c01ab99 100644
>      > --- a/kernel/sched/sched.h
>      > +++ b/kernel/sched/sched.h
>      > @@ -496,6 +496,9 @@ struct cfs_rq {
>      >   #ifndef CONFIG_64BIT
>      >       u64                     min_vruntime_copy;
>      >   #endif
>      > +#ifdef CONFIG_SCHEDSTATS
>      > +     u64                     hierarchy_wait_sum;
>      > +#endif
>      >
>      >       struct rb_root_cached   tasks_timeline;
>      >
>      >
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] sched/debug: Show intergroup and hierarchy sum wait time of a task group
  2019-01-28  2:53     ` 王贇
@ 2019-01-28  7:21       ` 禹舟键
  2019-01-30  1:53         ` 王贇
  0 siblings, 1 reply; 10+ messages in thread
From: 禹舟键 @ 2019-01-28  7:21 UTC (permalink / raw)
  To: 王贇; +Cc: mingo, Peter Zijlstra, Wind Yu, linux-kernel

Hi Michael
> Task competition inside a cgroup won't be considered as cgroup's
> competition, please try create another cgroup with dead loop on
> each CPU

Yes, you are right, but I don't think we just need to account for
cgroup's competition,
because this factor does not reflect cgroup internal conditions. We
still need a proper
method to evaluate CPU competition inside a cgroup.


> Running tasks doesn't means no competition, only if that cgroup occupied
> the CPU exclusively at that moment.

I care much about CPU competiton inside a cgroup. I can only read
'/proc/$pid/schedstat'
thousands of times to get every task's wait_sum time without cgroup
hierarchy wait_sum,
and it definitely tasks a real long time(40ms for 8000 tasks in a container).


> No offense but I'm afraid you misunderstand the problem we try to solve
> by wait_sum, if your purpose is to have a way to tell whether there are
> sufficient CPU inside a container, please try lxcfs + top, if there are
> almost no idle and load is high, then the CPU resource is not sufficient.

emmmm... Maybe I didn't make it clear.  We need to dynamically adjust the
number of CPUs for a container based on the running state of tasks inside
the container. If we find tasks' wait_sum are really high, we will add more
CPU cores to this container, or else we will decline some CPU to this container.
In a word, we want to ensure 'co-scheduling' for high priority containers.


>Frankly speaking this sounds like a supplement rather than a missing piece,
>although we don't rely on lxcfs and modify the kernel ourselves to support
>container environment, I still don't think such kind of solutions should be
>in kernel.

I don't care if this value is considered as a supplement or a missing piece. I
only care about how can I assess the running state inside a container. I think
lxcfs is really a good solution to improve the visibility of container
resources,
but it is not good enough at the moment.

/proc/cpuinfo
/proc/diskstats
/proc/meminfo
/proc/stat
/proc/swaps
/proc/uptime

we can read this procfs file inside a container,but this file still
cannot reflect
real-time information. Please think about the following scenario: a
'rabbit' process
will generate 2000 tasks in every 30ms, and these children tasks just run 1~5ms
and then exit. How can we detect this thrashing workload without
hierarchy wait_sum?

Thanks,
Yuzhoujian

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] sched/debug: Show intergroup and hierarchy sum wait time of a task group
  2019-01-28  7:21       ` 禹舟键
@ 2019-01-30  1:53         ` 王贇
  0 siblings, 0 replies; 10+ messages in thread
From: 王贇 @ 2019-01-30  1:53 UTC (permalink / raw)
  To: 禹舟键; +Cc: mingo, Peter Zijlstra, Wind Yu, linux-kernel



On 2019/1/28 下午3:21, 禹舟键 wrote:
[snip]
>> No offense but I'm afraid you misunderstand the problem we try to solve
>> by wait_sum, if your purpose is to have a way to tell whether there are
>> sufficient CPU inside a container, please try lxcfs + top, if there are
>> almost no idle and load is high, then the CPU resource is not sufficient.
> 
> emmmm... Maybe I didn't make it clear.  We need to dynamically adjust the
> number of CPUs for a container based on the running state of tasks inside
> the container. If we find tasks' wait_sum are really high, we will add more
> CPU cores to this container, or else we will decline some CPU to this container.
> In a word, we want to ensure 'co-scheduling' for high priority containers.
> 

I understand that you want to use task wait time which is a raw metric, but IMHO
when task wait more, idle will be less and load will be high, they are more
general metric to tell whether there are CPU starving rather then task wait
time on rq, and we rely on these too.

The only issue we got previously is that we don't know what caused the less idle
and high load, could be wrong resource assignment or cgroup competition, and now
with wait_sum we can firstly make sure competition is low, then if idle is still
low and load is still high inside container, time to assign more CPU.

> 
>> Frankly speaking this sounds like a supplement rather than a missing piece,
>> although we don't rely on lxcfs and modify the kernel ourselves to support
>> container environment, I still don't think such kind of solutions should be
>> in kernel.
> 
> I don't care if this value is considered as a supplement or a missing piece. I
> only care about how can I assess the running state inside a container. I think
> lxcfs is really a good solution to improve the visibility of container
> resources,
> but it is not good enough at the moment.
> 
> /proc/cpuinfo
> /proc/diskstats
> /proc/meminfo
> /proc/stat
> /proc/swaps
> /proc/uptime
> 
> we can read this procfs file inside a container,but this file still
> cannot reflect
> real-time information. Please think about the following scenario: a
> 'rabbit' process
> will generate 2000 tasks in every 30ms, and these children tasks just run 1~5ms
> and then exit. How can we detect this thrashing workload without
> hierarchy wait_sum?

As mentioned, we implement the isolation on ourselves, so we will see the isolated
idle and load information inside container, rather then the host data, we don't
rely on lxcfs but we know it's something doing similar work, so what you got by
reading /proc/stat, does it tell the isolated idle data? And you need a isolated
/proc/loadavg too.

Anyway, IMHO this is a special requirement only for container environment, not a
general solution for kernel problem, so I would suggest either help improve
the lxcfs to make it useful for your production, or do the modification in your own
kernel.

Regards,
Michael Wang

> 
> Thanks,
> Yuzhoujian
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] sched/debug: Show intergroup and hierarchy sum wait time of a task group
  2019-01-23  9:46 [PATCH] sched/debug: Show intergroup and hierarchy sum wait time of a task group ufo19890607
  2019-01-25  3:11 ` 王贇
@ 2019-02-06 17:19 ` Peter Zijlstra
  2019-02-11  2:43   ` 禹舟键
  2019-02-12  3:14   ` 禹舟键
  1 sibling, 2 replies; 10+ messages in thread
From: Peter Zijlstra @ 2019-02-06 17:19 UTC (permalink / raw)
  To: ufo19890607; +Cc: mingo, yun.wang, yuzhoujian, linux-kernel

On Wed, Jan 23, 2019 at 05:46:56PM +0800, ufo19890607@gmail.com wrote:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e2ff4b6..35e89ca 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -858,6 +858,19 @@ static void update_curr_fair(struct rq *rq)
>  }
>  
>  static inline void
> +update_hierarchy_wait_sum(struct sched_entity *se,
> +			u64 delta_wait)
> +{
> +	for_each_sched_entity(se) {
> +		struct cfs_rq *cfs_rq = cfs_rq_of(se);
> +
> +		if (cfs_rq->tg != &root_task_group)
> +			__schedstat_add(cfs_rq->hierarchy_wait_sum,
> +					delta_wait);
> +	}
> +}
> +
> +static inline void
>  update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
>  	struct task_struct *p;
> @@ -880,6 +893,7 @@ static void update_curr_fair(struct rq *rq)
>  			return;
>  		}
>  		trace_sched_stat_wait(p, delta);
> +		update_hierarchy_wait_sum(se, delta);
>  	}
>  
>  	__schedstat_set(se->statistics.wait_max,

The problem I have with this is that it will make schedstats even more
expensive :/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] sched/debug: Show intergroup and hierarchy sum wait time of a task group
  2019-02-06 17:19 ` Peter Zijlstra
@ 2019-02-11  2:43   ` 禹舟键
  2019-02-12  3:14   ` 禹舟键
  1 sibling, 0 replies; 10+ messages in thread
From: 禹舟键 @ 2019-02-11  2:43 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, 王贇, Wind Yu, linux-kernel

Hi Peter
> The problem I have with this is that it will make schedstats even more
expensive :/

I think the overhead for accounting hierarchy wait time is just the
same as cpuacct.usage. If the performance overhead is low enough(<
1%), is it acceptable?

Thanks
Yuzhoujian

Peter Zijlstra <peterz@infradead.org> 于2019年2月7日周四 上午1:19写道:
>
> On Wed, Jan 23, 2019 at 05:46:56PM +0800, ufo19890607@gmail.com wrote:
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index e2ff4b6..35e89ca 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -858,6 +858,19 @@ static void update_curr_fair(struct rq *rq)
> >  }
> >
> >  static inline void
> > +update_hierarchy_wait_sum(struct sched_entity *se,
> > +                     u64 delta_wait)
> > +{
> > +     for_each_sched_entity(se) {
> > +             struct cfs_rq *cfs_rq = cfs_rq_of(se);
> > +
> > +             if (cfs_rq->tg != &root_task_group)
> > +                     __schedstat_add(cfs_rq->hierarchy_wait_sum,
> > +                                     delta_wait);
> > +     }
> > +}
> > +
> > +static inline void
> >  update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >  {
> >       struct task_struct *p;
> > @@ -880,6 +893,7 @@ static void update_curr_fair(struct rq *rq)
> >                       return;
> >               }
> >               trace_sched_stat_wait(p, delta);
> > +             update_hierarchy_wait_sum(se, delta);
> >       }
> >
> >       __schedstat_set(se->statistics.wait_max,
>
> The problem I have with this is that it will make schedstats even more
> expensive :/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] sched/debug: Show intergroup and hierarchy sum wait time of a task group
  2019-02-06 17:19 ` Peter Zijlstra
  2019-02-11  2:43   ` 禹舟键
@ 2019-02-12  3:14   ` 禹舟键
  2019-02-19  2:15     ` 禹舟键
  1 sibling, 1 reply; 10+ messages in thread
From: 禹舟键 @ 2019-02-12  3:14 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, 王贇, Wind Yu, linux-kernel

Hi, Peter
I think hierarchy wait time for task groups is worth accounting
despite with a little extra overhead. Because we can evaluate task
groups' condition with a more direct metric. We cannot get the real
situation just with some general metrics, like idle or loadavg, since
their value is decreased with the elapse of time. So, I think general
metrics cannot satisify the request for task groups.

Thanks
Yuzhoujian


Peter Zijlstra <peterz@infradead.org> 于2019年2月7日周四 上午1:19写道:
>
> On Wed, Jan 23, 2019 at 05:46:56PM +0800, ufo19890607@gmail.com wrote:
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index e2ff4b6..35e89ca 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -858,6 +858,19 @@ static void update_curr_fair(struct rq *rq)
> >  }
> >
> >  static inline void
> > +update_hierarchy_wait_sum(struct sched_entity *se,
> > +                     u64 delta_wait)
> > +{
> > +     for_each_sched_entity(se) {
> > +             struct cfs_rq *cfs_rq = cfs_rq_of(se);
> > +
> > +             if (cfs_rq->tg != &root_task_group)
> > +                     __schedstat_add(cfs_rq->hierarchy_wait_sum,
> > +                                     delta_wait);
> > +     }
> > +}
> > +
> > +static inline void
> >  update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >  {
> >       struct task_struct *p;
> > @@ -880,6 +893,7 @@ static void update_curr_fair(struct rq *rq)
> >                       return;
> >               }
> >               trace_sched_stat_wait(p, delta);
> > +             update_hierarchy_wait_sum(se, delta);
> >       }
> >
> >       __schedstat_set(se->statistics.wait_max,
>
> The problem I have with this is that it will make schedstats even more
> expensive :/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] sched/debug: Show intergroup and hierarchy sum wait time of a task group
  2019-02-12  3:14   ` 禹舟键
@ 2019-02-19  2:15     ` 禹舟键
  2019-02-25  3:29       ` 禹舟键
  0 siblings, 1 reply; 10+ messages in thread
From: 禹舟键 @ 2019-02-19  2:15 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, 王贇, Wind Yu, linux-kernel

PING

禹舟键 <ufo19890607@gmail.com> 于2019年2月12日周二 上午11:14写道:
>
> Hi, Peter
> I think hierarchy wait time for task groups is worth accounting
> despite with a little extra overhead. Because we can evaluate task
> groups' condition with a more direct metric. We cannot get the real
> situation just with some general metrics, like idle or loadavg, since
> their value is decreased with the elapse of time. So, I think general
> metrics cannot satisify the request for task groups.
>
> Thanks
> Yuzhoujian
>
>
> Peter Zijlstra <peterz@infradead.org> 于2019年2月7日周四 上午1:19写道:
> >
> > On Wed, Jan 23, 2019 at 05:46:56PM +0800, ufo19890607@gmail.com wrote:
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index e2ff4b6..35e89ca 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -858,6 +858,19 @@ static void update_curr_fair(struct rq *rq)
> > >  }
> > >
> > >  static inline void
> > > +update_hierarchy_wait_sum(struct sched_entity *se,
> > > +                     u64 delta_wait)
> > > +{
> > > +     for_each_sched_entity(se) {
> > > +             struct cfs_rq *cfs_rq = cfs_rq_of(se);
> > > +
> > > +             if (cfs_rq->tg != &root_task_group)
> > > +                     __schedstat_add(cfs_rq->hierarchy_wait_sum,
> > > +                                     delta_wait);
> > > +     }
> > > +}
> > > +
> > > +static inline void
> > >  update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
> > >  {
> > >       struct task_struct *p;
> > > @@ -880,6 +893,7 @@ static void update_curr_fair(struct rq *rq)
> > >                       return;
> > >               }
> > >               trace_sched_stat_wait(p, delta);
> > > +             update_hierarchy_wait_sum(se, delta);
> > >       }
> > >
> > >       __schedstat_set(se->statistics.wait_max,
> >
> > The problem I have with this is that it will make schedstats even more
> > expensive :/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] sched/debug: Show intergroup and hierarchy sum wait time of a task group
  2019-02-19  2:15     ` 禹舟键
@ 2019-02-25  3:29       ` 禹舟键
  0 siblings, 0 replies; 10+ messages in thread
From: 禹舟键 @ 2019-02-25  3:29 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, 王贇, Wind Yu, linux-kernel

PING


禹舟键 <ufo19890607@gmail.com> 于2019年2月19日周二 上午10:15写道:
>
> PING
>
> 禹舟键 <ufo19890607@gmail.com> 于2019年2月12日周二 上午11:14写道:
> >
> > Hi, Peter
> > I think hierarchy wait time for task groups is worth accounting
> > despite with a little extra overhead. Because we can evaluate task
> > groups' condition with a more direct metric. We cannot get the real
> > situation just with some general metrics, like idle or loadavg, since
> > their value is decreased with the elapse of time. So, I think general
> > metrics cannot satisify the request for task groups.
> >
> > Thanks
> > Yuzhoujian
> >
> >
> > Peter Zijlstra <peterz@infradead.org> 于2019年2月7日周四 上午1:19写道:
> > >
> > > On Wed, Jan 23, 2019 at 05:46:56PM +0800, ufo19890607@gmail.com wrote:
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index e2ff4b6..35e89ca 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -858,6 +858,19 @@ static void update_curr_fair(struct rq *rq)
> > > >  }
> > > >
> > > >  static inline void
> > > > +update_hierarchy_wait_sum(struct sched_entity *se,
> > > > +                     u64 delta_wait)
> > > > +{
> > > > +     for_each_sched_entity(se) {
> > > > +             struct cfs_rq *cfs_rq = cfs_rq_of(se);
> > > > +
> > > > +             if (cfs_rq->tg != &root_task_group)
> > > > +                     __schedstat_add(cfs_rq->hierarchy_wait_sum,
> > > > +                                     delta_wait);
> > > > +     }
> > > > +}
> > > > +
> > > > +static inline void
> > > >  update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
> > > >  {
> > > >       struct task_struct *p;
> > > > @@ -880,6 +893,7 @@ static void update_curr_fair(struct rq *rq)
> > > >                       return;
> > > >               }
> > > >               trace_sched_stat_wait(p, delta);
> > > > +             update_hierarchy_wait_sum(se, delta);
> > > >       }
> > > >
> > > >       __schedstat_set(se->statistics.wait_max,
> > >
> > > The problem I have with this is that it will make schedstats even more
> > > expensive :/

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-02-25  3:30 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-23  9:46 [PATCH] sched/debug: Show intergroup and hierarchy sum wait time of a task group ufo19890607
2019-01-25  3:11 ` 王贇
     [not found]   ` <CAHCio2iYoKOSgxJWFX6KcRcn7GK=aEJP7Sz0vU6SwXK2C56mew@mail.gmail.com>
2019-01-28  2:53     ` 王贇
2019-01-28  7:21       ` 禹舟键
2019-01-30  1:53         ` 王贇
2019-02-06 17:19 ` Peter Zijlstra
2019-02-11  2:43   ` 禹舟键
2019-02-12  3:14   ` 禹舟键
2019-02-19  2:15     ` 禹舟键
2019-02-25  3:29       ` 禹舟键

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).