[PATCHSET for-4.14] cgroup, sched: cgroup2 interface for CPU controller

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCHSET for-4.14] cgroup, sched: cgroup2 interface for CPU controller
@ 2017-07-20 18:48 ` Tejun Heo
  0 siblings, 0 replies; 15+ messages in thread
From: Tejun Heo @ 2017-07-20 18:48 UTC (permalink / raw)
  To: lizefan, hannes, peterz, mingo, longman
  Cc: cgroups, linux-kernel, kernel-team, pjt, luto, efault, torvalds, guro

Hello,

These are two patches to implement the cgroup2 CPU controller
interface.  It's separated from the [L] previous version of combined
cgroup2 threaded mode + CPU controller patchset.

* "cpu.weight.nice" added which allows reading and setting weights
  using nice values so that it's easier to configure when threaded
  cgroups are competing against threads.

* Doc and other misc updates.

 0001-sched-Misc-preps-for-cgroup-unified-hierarchy-interf.patch
 0002-sched-Implement-interface-for-cgroup-unified-hierarc.patch

The patchset is on top of "cgroup: implement cgroup2 thread mode, v4"
patchset [T] and also available in the following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-cgroup2-cpu-on-v4

 Documentation/cgroup-v2.txt |   36 ++------
 kernel/sched/core.c         |  186 +++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/cpuacct.c      |   53 +++++++++---
 kernel/sched/cpuacct.h      |    5 +
 4 files changed, 241 insertions(+), 39 deletions(-)

--
tejun

[L] http://lkml.kernel.org/r/20170610140351.10703-1-tj@kernel.org
[T] http://lkml.kernel.org/r/20170719194439.2915602-1-tj@kernel.org

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCHSET for-4.14] cgroup, sched: cgroup2 interface for CPU controller
@ 2017-07-20 18:48 ` Tejun Heo
  0 siblings, 0 replies; 15+ messages in thread
From: Tejun Heo @ 2017-07-20 18:48 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	longman-H+wXaHxf7aLQT0dZR+AlfA
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	pjt-hpIqsD4AKlfQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	efault-Mmb7MZpHnFY, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	guro-b10kYP2dOMg

Hello,

These are two patches to implement the cgroup2 CPU controller
interface.  It's separated from the [L] previous version of combined
cgroup2 threaded mode + CPU controller patchset.

* "cpu.weight.nice" added which allows reading and setting weights
  using nice values so that it's easier to configure when threaded
  cgroups are competing against threads.

* Doc and other misc updates.

 0001-sched-Misc-preps-for-cgroup-unified-hierarchy-interf.patch
 0002-sched-Implement-interface-for-cgroup-unified-hierarc.patch

The patchset is on top of "cgroup: implement cgroup2 thread mode, v4"
patchset [T] and also available in the following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-cgroup2-cpu-on-v4

 Documentation/cgroup-v2.txt |   36 ++------
 kernel/sched/core.c         |  186 +++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/cpuacct.c      |   53 +++++++++---
 kernel/sched/cpuacct.h      |    5 +
 4 files changed, 241 insertions(+), 39 deletions(-)

--
tejun

[L] http://lkml.kernel.org/r/20170610140351.10703-1-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org
[T] http://lkml.kernel.org/r/20170719194439.2915602-1-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/2] sched: Misc preps for cgroup unified hierarchy interface
  2017-07-20 18:48 ` Tejun Heo
  (?)
@ 2017-07-20 18:48 ` Tejun Heo
  -1 siblings, 0 replies; 15+ messages in thread
From: Tejun Heo @ 2017-07-20 18:48 UTC (permalink / raw)
  To: lizefan, hannes, peterz, mingo, longman
  Cc: cgroups, linux-kernel, kernel-team, pjt, luto, efault, torvalds,
	guro, Tejun Heo

Make the following changes in preparation for the cpu controller
interface implementation for cgroup2.  This patch doesn't cause any
functional differences.

* s/cpu_stats_show()/cpu_cfs_stats_show()/

* s/cpu_files/cpu_legacy_files/

* Separate out cpuacct_stats_read() from cpuacct_stats_show().  While
  at it, make the @val array u64 for consistency.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---
 kernel/sched/core.c    |  8 ++++----
 kernel/sched/cpuacct.c | 29 ++++++++++++++++++-----------
 2 files changed, 22 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 17c667b..71a060f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6578,7 +6578,7 @@ static int __cfs_schedulable(struct task_group *tg, u64 period, u64 quota)
 	return ret;
 }
 
-static int cpu_stats_show(struct seq_file *sf, void *v)
+static int cpu_cfs_stats_show(struct seq_file *sf, void *v)
 {
 	struct task_group *tg = css_tg(seq_css(sf));
 	struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
@@ -6618,7 +6618,7 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css,
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
-static struct cftype cpu_files[] = {
+static struct cftype cpu_legacy_files[] = {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	{
 		.name = "shares",
@@ -6639,7 +6639,7 @@ static struct cftype cpu_files[] = {
 	},
 	{
 		.name = "stat",
-		.seq_show = cpu_stats_show,
+		.seq_show = cpu_cfs_stats_show,
 	},
 #endif
 #ifdef CONFIG_RT_GROUP_SCHED
@@ -6665,7 +6665,7 @@ struct cgroup_subsys cpu_cgrp_subsys = {
 	.fork		= cpu_cgroup_fork,
 	.can_attach	= cpu_cgroup_can_attach,
 	.attach		= cpu_cgroup_attach,
-	.legacy_cftypes	= cpu_files,
+	.legacy_cftypes	= cpu_legacy_files,
 	.early_init	= true,
 };
 
diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index f95ab29..6151c23 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -276,26 +276,33 @@ static int cpuacct_all_seq_show(struct seq_file *m, void *V)
 	return 0;
 }
 
-static int cpuacct_stats_show(struct seq_file *sf, void *v)
+static void cpuacct_stats_read(struct cpuacct *ca,
+			       u64 (*val)[CPUACCT_STAT_NSTATS])
 {
-	struct cpuacct *ca = css_ca(seq_css(sf));
-	s64 val[CPUACCT_STAT_NSTATS];
 	int cpu;
-	int stat;
 
-	memset(val, 0, sizeof(val));
+	memset(val, 0, sizeof(*val));
+
 	for_each_possible_cpu(cpu) {
 		u64 *cpustat = per_cpu_ptr(ca->cpustat, cpu)->cpustat;
 
-		val[CPUACCT_STAT_USER]   += cpustat[CPUTIME_USER];
-		val[CPUACCT_STAT_USER]   += cpustat[CPUTIME_NICE];
-		val[CPUACCT_STAT_SYSTEM] += cpustat[CPUTIME_SYSTEM];
-		val[CPUACCT_STAT_SYSTEM] += cpustat[CPUTIME_IRQ];
-		val[CPUACCT_STAT_SYSTEM] += cpustat[CPUTIME_SOFTIRQ];
+		(*val)[CPUACCT_STAT_USER]   += cpustat[CPUTIME_USER];
+		(*val)[CPUACCT_STAT_USER]   += cpustat[CPUTIME_NICE];
+		(*val)[CPUACCT_STAT_SYSTEM] += cpustat[CPUTIME_SYSTEM];
+		(*val)[CPUACCT_STAT_SYSTEM] += cpustat[CPUTIME_IRQ];
+		(*val)[CPUACCT_STAT_SYSTEM] += cpustat[CPUTIME_SOFTIRQ];
 	}
+}
+
+static int cpuacct_stats_show(struct seq_file *sf, void *v)
+{
+	u64 val[CPUACCT_STAT_NSTATS];
+	int stat;
+
+	cpuacct_stats_read(css_ca(seq_css(sf)), &val);
 
 	for (stat = 0; stat < CPUACCT_STAT_NSTATS; stat++) {
-		seq_printf(sf, "%s %lld\n",
+		seq_printf(sf, "%s %llu\n",
 			   cpuacct_stat_desc[stat],
 			   (long long)nsec_to_clock_t(val[stat]));
 	}
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy
  2017-07-20 18:48 ` Tejun Heo
  (?)
  (?)
@ 2017-07-20 18:48 ` Tejun Heo
  2017-07-29  9:17     ` Peter Zijlstra
  -1 siblings, 1 reply; 15+ messages in thread
From: Tejun Heo @ 2017-07-20 18:48 UTC (permalink / raw)
  To: lizefan, hannes, peterz, mingo, longman
  Cc: cgroups, linux-kernel, kernel-team, pjt, luto, efault, torvalds,
	guro, Tejun Heo

There are a couple interface issues which can be addressed in cgroup2
interface.

* Stats from cpuacct being reported separately from the cpu stats.

* Use of different time units.  Writable control knobs use
  microseconds, some stat fields use nanoseconds while other cpuacct
  stat fields use centiseconds.

* Control knobs which can't be used in the root cgroup still show up
  in the root.

* Control knob names and semantics aren't consistent with other
  controllers.

This patchset implements cpu controller's interface on cgroup2 which
adheres to the controller file conventions described in
Documentation/cgroups/cgroup-v2.txt.  Overall, the following changes
are made.

* cpuacct is implictly enabled and disabled by cpu and its information
  is reported through "cpu.stat" which now uses microseconds for all
  time durations.  All time duration fields now have "_usec" appended
  to them for clarity.

  Note that cpuacct.usage_percpu is currently not included in
  "cpu.stat".  If this information is actually called for, it will be
  added later.

* "cpu.shares" is replaced with "cpu.weight" and operates on the
  standard scale defined by CGROUP_WEIGHT_MIN/DFL/MAX (1, 100, 10000).
  The weight is scaled to scheduler weight so that 100 maps to 1024
  and the ratio relationship is preserved - if weight is W and its
  scaled value is S, W / 100 == S / 1024.  While the mapped range is a
  bit smaller than the orignal scheduler weight range, the dead zones
  on both sides are relatively small and covers wider range than the
  nice value mappings.  This file doesn't make sense in the root
  cgroup and isn't create on root.

* "cpu.weight.nice" is added. When read, it reads back the nice value
  which is closest to the current "cpu.weight".  When written, it sets
  "cpu.weight" to the weight value which matches the nice value.  This
  makes it easy to configure cgroups when they're competing against
  threads in threaded subtrees.

* "cpu.cfs_quota_us" and "cpu.cfs_period_us" are replaced by "cpu.max"
  which contains both quota and period.

* "cpu.rt_runtime_us" and "cpu.rt_period_us" are replaced by
  "cpu.rt.max" which contains both runtime and period.

v3: - Added "cpu.weight.nice" to allow using nice values when
      configuring the weight.  The feature is requested by PeterZ.
    - Merge the patch to enable threaded support on cpu and cpuacct.
    - Dropped the bits about getting rid of cpuacct from patch
      description as there is a pretty strong case for making cpuacct
      an implicit controller so that basic cpu usage stats are always
      available.
    - Documentation updated accordingly.  "cpu.rt.max" section is
      dropped for now.

v2: - cpu_stats_show() was incorrectly using CONFIG_FAIR_GROUP_SCHED
      for CFS bandwidth stats and also using raw division for u64.
      Use CONFIG_CFS_BANDWITH and do_div() instead.  "cpu.rt.max" is
      not included yet.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---
 Documentation/cgroup-v2.txt |  36 +++------
 kernel/sched/core.c         | 178 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/cpuacct.c      |  24 ++++++
 kernel/sched/cpuacct.h      |   5 ++
 4 files changed, 219 insertions(+), 24 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index cb9ea28..43f9811 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -860,10 +860,6 @@ Controllers
 CPU
 ---
 
-.. note::
-
-	The interface for the cpu controller hasn't been merged yet
-
 The "cpu" controllers regulates distribution of CPU cycles.  This
 controller implements weight and absolute bandwidth limit models for
 normal scheduling policy and absolute bandwidth allocation model for
@@ -893,6 +889,18 @@ All time durations are in microseconds.
 
 	The weight in the range [1, 10000].
 
+  cpu.weight.nice
+	A read-write single value file which exists on non-root
+	cgroups.  The default is "0".
+
+	The nice value is in the range [-20, 19].
+
+	This interface file is an alternative interface for
+	"cpu.weight" and allows reading and setting weight using the
+	same values used by nice(2).  Because the range is smaller and
+	granularity is coarser for the nice values, the read value is
+	the closest approximation of the current weight.
+
   cpu.max
 	A read-write two value file which exists on non-root cgroups.
 	The default is "max 100000".
@@ -905,26 +913,6 @@ All time durations are in microseconds.
 	$PERIOD duration.  "max" for $MAX indicates no limit.  If only
 	one number is written, $MAX is updated.
 
-  cpu.rt.max
-	.. note::
-
-	   The semantics of this file is still under discussion and the
-	   interface hasn't been merged yet
-
-	A read-write two value file which exists on all cgroups.
-	The default is "0 100000".
-
-	The maximum realtime runtime allocation.  Over-committing
-	configurations are disallowed and process migrations are
-	rejected if not enough bandwidth is available.  It's in the
-	following format::
-
-	  $MAX $PERIOD
-
-	which indicates that the group may consume upto $MAX in each
-	$PERIOD duration.  If only one number is written, $MAX is
-	updated.
-
 
 Memory
 ------
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 71a060f..9766670 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6657,6 +6657,175 @@ static struct cftype cpu_legacy_files[] = {
 	{ }	/* Terminate */
 };
 
+static int cpu_stats_show(struct seq_file *sf, void *v)
+{
+	cpuacct_cpu_stats_show(sf);
+
+#ifdef CONFIG_CFS_BANDWIDTH
+	{
+		struct task_group *tg = css_tg(seq_css(sf));
+		struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
+		u64 throttled_usec;
+
+		throttled_usec = cfs_b->throttled_time;
+		do_div(throttled_usec, NSEC_PER_USEC);
+
+		seq_printf(sf, "nr_periods %d\n"
+			   "nr_throttled %d\n"
+			   "throttled_usec %llu\n",
+			   cfs_b->nr_periods, cfs_b->nr_throttled,
+			   throttled_usec);
+	}
+#endif
+	return 0;
+}
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+static u64 cpu_weight_read_u64(struct cgroup_subsys_state *css,
+			       struct cftype *cft)
+{
+	struct task_group *tg = css_tg(css);
+	u64 weight = scale_load_down(tg->shares);
+
+	return DIV_ROUND_CLOSEST_ULL(weight * CGROUP_WEIGHT_DFL, 1024);
+}
+
+static int cpu_weight_write_u64(struct cgroup_subsys_state *css,
+				struct cftype *cft, u64 weight)
+{
+	/*
+	 * cgroup weight knobs should use the common MIN, DFL and MAX
+	 * values which are 1, 100 and 10000 respectively.  While it loses
+	 * a bit of range on both ends, it maps pretty well onto the shares
+	 * value used by scheduler and the round-trip conversions preserve
+	 * the original value over the entire range.
+	 */
+	if (weight < CGROUP_WEIGHT_MIN || weight > CGROUP_WEIGHT_MAX)
+		return -ERANGE;
+
+	weight = DIV_ROUND_CLOSEST_ULL(weight * 1024, CGROUP_WEIGHT_DFL);
+
+	return sched_group_set_shares(css_tg(css), scale_load(weight));
+}
+
+static s64 cpu_weight_nice_read_s64(struct cgroup_subsys_state *css,
+				    struct cftype *cft)
+{
+	unsigned long weight = scale_load_down(css_tg(css)->shares);
+	int last_delta = INT_MAX;
+	int prio, delta;
+
+	/* find the closest nice value to the current weight */
+	for (prio = 0; prio < ARRAY_SIZE(sched_prio_to_weight); prio++) {
+		delta = abs(sched_prio_to_weight[prio] - weight);
+		if (delta >= last_delta)
+			break;
+		last_delta = delta;
+	}
+
+	return PRIO_TO_NICE(prio - 1 + MAX_RT_PRIO);
+}
+
+static int cpu_weight_nice_write_s64(struct cgroup_subsys_state *css,
+				     struct cftype *cft, s64 nice)
+{
+	unsigned long weight;
+
+	if (nice < MIN_NICE || nice > MAX_NICE)
+		return -ERANGE;
+
+	weight = sched_prio_to_weight[NICE_TO_PRIO(nice) - MAX_RT_PRIO];
+	return sched_group_set_shares(css_tg(css), scale_load(weight));
+}
+#endif
+
+static void __maybe_unused cpu_period_quota_print(struct seq_file *sf,
+						  long period, long quota)
+{
+	if (quota < 0)
+		seq_puts(sf, "max");
+	else
+		seq_printf(sf, "%ld", quota);
+
+	seq_printf(sf, " %ld\n", period);
+}
+
+/* caller should put the current value in *@periodp before calling */
+static int __maybe_unused cpu_period_quota_parse(char *buf,
+						 u64 *periodp, u64 *quotap)
+{
+	char tok[21];	/* U64_MAX */
+
+	if (!sscanf(buf, "%s %llu", tok, periodp))
+		return -EINVAL;
+
+	*periodp *= NSEC_PER_USEC;
+
+	if (sscanf(tok, "%llu", quotap))
+		*quotap *= NSEC_PER_USEC;
+	else if (!strcmp(tok, "max"))
+		*quotap = RUNTIME_INF;
+	else
+		return -EINVAL;
+
+	return 0;
+}
+
+#ifdef CONFIG_CFS_BANDWIDTH
+static int cpu_max_show(struct seq_file *sf, void *v)
+{
+	struct task_group *tg = css_tg(seq_css(sf));
+
+	cpu_period_quota_print(sf, tg_get_cfs_period(tg), tg_get_cfs_quota(tg));
+	return 0;
+}
+
+static ssize_t cpu_max_write(struct kernfs_open_file *of,
+			     char *buf, size_t nbytes, loff_t off)
+{
+	struct task_group *tg = css_tg(of_css(of));
+	u64 period = tg_get_cfs_period(tg);
+	u64 quota;
+	int ret;
+
+	ret = cpu_period_quota_parse(buf, &period, &quota);
+	if (!ret)
+		ret = tg_set_cfs_bandwidth(tg, period, quota);
+	return ret ?: nbytes;
+}
+#endif
+
+static struct cftype cpu_files[] = {
+	{
+		.name = "stat",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = cpu_stats_show,
+	},
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	{
+		.name = "weight",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = cpu_weight_read_u64,
+		.write_u64 = cpu_weight_write_u64,
+	},
+	{
+		.name = "weight.nice",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_s64 = cpu_weight_nice_read_s64,
+		.write_s64 = cpu_weight_nice_write_s64,
+	},
+#endif
+#ifdef CONFIG_CFS_BANDWIDTH
+	{
+		.name = "max",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = cpu_max_show,
+		.write = cpu_max_write,
+	},
+#endif
+	{ }	/* terminate */
+};
+
 struct cgroup_subsys cpu_cgrp_subsys = {
 	.css_alloc	= cpu_cgroup_css_alloc,
 	.css_online	= cpu_cgroup_css_online,
@@ -6666,7 +6835,16 @@ struct cgroup_subsys cpu_cgrp_subsys = {
 	.can_attach	= cpu_cgroup_can_attach,
 	.attach		= cpu_cgroup_attach,
 	.legacy_cftypes	= cpu_legacy_files,
+	.dfl_cftypes	= cpu_files,
 	.early_init	= true,
+	.threaded	= true,
+#ifdef CONFIG_CGROUP_CPUACCT
+	/*
+	 * cpuacct is enabled together with cpu on the unified hierarchy
+	 * and its stats are reported through "cpu.stat".
+	 */
+	.depends_on	= 1 << cpuacct_cgrp_id,
+#endif
 };
 
 #endif	/* CONFIG_CGROUP_SCHED */
diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index 6151c23..ee4976a 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -347,6 +347,29 @@ static struct cftype files[] = {
 	{ }	/* terminate */
 };
 
+/* used to print cpuacct stats in cpu.stat on the unified hierarchy */
+void cpuacct_cpu_stats_show(struct seq_file *sf)
+{
+	struct cgroup_subsys_state *css;
+	u64 usage, val[CPUACCT_STAT_NSTATS];
+
+	css = cgroup_get_e_css(seq_css(sf)->cgroup, &cpuacct_cgrp_subsys);
+
+	usage = cpuusage_read(css, seq_cft(sf));
+	cpuacct_stats_read(css_ca(css), &val);
+
+	do_div(usage, NSEC_PER_USEC);
+	do_div(val[CPUACCT_STAT_USER], NSEC_PER_USEC);
+	do_div(val[CPUACCT_STAT_SYSTEM], NSEC_PER_USEC);
+
+	seq_printf(sf, "usage_usec %llu\n"
+		   "user_usec %llu\n"
+		   "system_usec %llu\n",
+		   usage, val[CPUACCT_STAT_USER], val[CPUACCT_STAT_SYSTEM]);
+
+	css_put(css);
+}
+
 /*
  * charge this task's execution time to its accounting group.
  *
@@ -389,4 +412,5 @@ struct cgroup_subsys cpuacct_cgrp_subsys = {
 	.css_free	= cpuacct_css_free,
 	.legacy_cftypes	= files,
 	.early_init	= true,
+	.threaded	= true,
 };
diff --git a/kernel/sched/cpuacct.h b/kernel/sched/cpuacct.h
index ba72807..ddf7af4 100644
--- a/kernel/sched/cpuacct.h
+++ b/kernel/sched/cpuacct.h
@@ -2,6 +2,7 @@
 
 extern void cpuacct_charge(struct task_struct *tsk, u64 cputime);
 extern void cpuacct_account_field(struct task_struct *tsk, int index, u64 val);
+extern void cpuacct_cpu_stats_show(struct seq_file *sf);
 
 #else
 
@@ -14,4 +15,8 @@ cpuacct_account_field(struct task_struct *tsk, int index, u64 val)
 {
 }
 
+static inline void cpuacct_cpu_stats_show(struct seq_file *sf)
+{
+}
+
 #endif
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCHSET for-4.14] cgroup, sched: cgroup2 interface for CPU controller
@ 2017-07-27 19:51   ` Tejun Heo
  0 siblings, 0 replies; 15+ messages in thread
From: Tejun Heo @ 2017-07-27 19:51 UTC (permalink / raw)
  To: lizefan, hannes, peterz, mingo, longman
  Cc: cgroups, linux-kernel, kernel-team, pjt, luto, efault, torvalds, guro

Hello,

On Thu, Jul 20, 2017 at 02:48:06PM -0400, Tejun Heo wrote:
> These are two patches to implement the cgroup2 CPU controller
> interface.  It's separated from the [L] previous version of combined
> cgroup2 threaded mode + CPU controller patchset.
> 
> * "cpu.weight.nice" added which allows reading and setting weights
>   using nice values so that it's easier to configure when threaded
>   cgroups are competing against threads.
> 
> * Doc and other misc updates.
> 
>  0001-sched-Misc-preps-for-cgroup-unified-hierarchy-interf.patch
>  0002-sched-Implement-interface-for-cgroup-unified-hierarc.patch

Ping.  Any comments on the proposed interface?

Thanks!

-- 
tejun

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCHSET for-4.14] cgroup, sched: cgroup2 interface for CPU controller
@ 2017-07-27 19:51   ` Tejun Heo
  0 siblings, 0 replies; 15+ messages in thread
From: Tejun Heo @ 2017-07-27 19:51 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	longman-H+wXaHxf7aLQT0dZR+AlfA
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	pjt-hpIqsD4AKlfQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	efault-Mmb7MZpHnFY, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	guro-b10kYP2dOMg

Hello,

On Thu, Jul 20, 2017 at 02:48:06PM -0400, Tejun Heo wrote:
> These are two patches to implement the cgroup2 CPU controller
> interface.  It's separated from the [L] previous version of combined
> cgroup2 threaded mode + CPU controller patchset.
> 
> * "cpu.weight.nice" added which allows reading and setting weights
>   using nice values so that it's easier to configure when threaded
>   cgroups are competing against threads.
> 
> * Doc and other misc updates.
> 
>  0001-sched-Misc-preps-for-cgroup-unified-hierarchy-interf.patch
>  0002-sched-Implement-interface-for-cgroup-unified-hierarc.patch

Ping.  Any comments on the proposed interface?

Thanks!

-- 
tejun

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy
@ 2017-07-29  9:17     ` Peter Zijlstra
  0 siblings, 0 replies; 15+ messages in thread
From: Peter Zijlstra @ 2017-07-29  9:17 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, hannes, mingo, longman, cgroups, linux-kernel,
	kernel-team, pjt, luto, efault, torvalds, guro

On Thu, Jul 20, 2017 at 02:48:08PM -0400, Tejun Heo wrote:
> There are a couple interface issues which can be addressed in cgroup2
> interface.
> 
> * Stats from cpuacct being reported separately from the cpu stats.
> 
> * Use of different time units.  Writable control knobs use
>   microseconds, some stat fields use nanoseconds while other cpuacct
>   stat fields use centiseconds.
> 
> * Control knobs which can't be used in the root cgroup still show up
>   in the root.
> 
> * Control knob names and semantics aren't consistent with other
>   controllers.
> 
> This patchset implements cpu controller's interface on cgroup2 which
> adheres to the controller file conventions described in
> Documentation/cgroups/cgroup-v2.txt.  Overall, the following changes
> are made.
> 
> * cpuacct is implictly enabled and disabled by cpu and its information
>   is reported through "cpu.stat" which now uses microseconds for all
>   time durations.  All time duration fields now have "_usec" appended
>   to them for clarity.
> 
>   Note that cpuacct.usage_percpu is currently not included in
>   "cpu.stat".  If this information is actually called for, it will be
>   added later.
> 
> * "cpu.shares" is replaced with "cpu.weight" and operates on the
>   standard scale defined by CGROUP_WEIGHT_MIN/DFL/MAX (1, 100, 10000).
>   The weight is scaled to scheduler weight so that 100 maps to 1024
>   and the ratio relationship is preserved - if weight is W and its
>   scaled value is S, W / 100 == S / 1024.  While the mapped range is a
>   bit smaller than the orignal scheduler weight range, the dead zones
>   on both sides are relatively small and covers wider range than the
>   nice value mappings.  This file doesn't make sense in the root
>   cgroup and isn't create on root.

s/create/&d/

Thanks!

> * "cpu.weight.nice" is added. When read, it reads back the nice value
>   which is closest to the current "cpu.weight".  When written, it sets
>   "cpu.weight" to the weight value which matches the nice value.  This
>   makes it easy to configure cgroups when they're competing against
>   threads in threaded subtrees.
> 
> * "cpu.cfs_quota_us" and "cpu.cfs_period_us" are replaced by "cpu.max"
>   which contains both quota and period.
> 
> * "cpu.rt_runtime_us" and "cpu.rt_period_us" are replaced by
>   "cpu.rt.max" which contains both runtime and period.

So we've been looking at overhauling the whole RT stuff. But sadly we've
not been able to find something that works with all the legacy
constraints (like RT tasks having arbitrary affinities).

Lets just hope we can preserve this interface :/
> 
> v3: - Added "cpu.weight.nice" to allow using nice values when
>       configuring the weight.  The feature is requested by PeterZ.
>     - Merge the patch to enable threaded support on cpu and cpuacct.

>     - Dropped the bits about getting rid of cpuacct from patch
>       description as there is a pretty strong case for making cpuacct
>       an implicit controller so that basic cpu usage stats are always
>       available.

What about the whole double accounting thing? Because currently cpuacct
and cpu do a fair bit of duplication. It would be very good to get rid
of that.

>     - Documentation updated accordingly.  "cpu.rt.max" section is
>       dropped for now.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy
@ 2017-07-29  9:17     ` Peter Zijlstra
  0 siblings, 0 replies; 15+ messages in thread
From: Peter Zijlstra @ 2017-07-29  9:17 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, longman-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	pjt-hpIqsD4AKlfQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	efault-Mmb7MZpHnFY, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	guro-b10kYP2dOMg

On Thu, Jul 20, 2017 at 02:48:08PM -0400, Tejun Heo wrote:
> There are a couple interface issues which can be addressed in cgroup2
> interface.
> 
> * Stats from cpuacct being reported separately from the cpu stats.
> 
> * Use of different time units.  Writable control knobs use
>   microseconds, some stat fields use nanoseconds while other cpuacct
>   stat fields use centiseconds.
> 
> * Control knobs which can't be used in the root cgroup still show up
>   in the root.
> 
> * Control knob names and semantics aren't consistent with other
>   controllers.
> 
> This patchset implements cpu controller's interface on cgroup2 which
> adheres to the controller file conventions described in
> Documentation/cgroups/cgroup-v2.txt.  Overall, the following changes
> are made.
> 
> * cpuacct is implictly enabled and disabled by cpu and its information
>   is reported through "cpu.stat" which now uses microseconds for all
>   time durations.  All time duration fields now have "_usec" appended
>   to them for clarity.
> 
>   Note that cpuacct.usage_percpu is currently not included in
>   "cpu.stat".  If this information is actually called for, it will be
>   added later.
> 
> * "cpu.shares" is replaced with "cpu.weight" and operates on the
>   standard scale defined by CGROUP_WEIGHT_MIN/DFL/MAX (1, 100, 10000).
>   The weight is scaled to scheduler weight so that 100 maps to 1024
>   and the ratio relationship is preserved - if weight is W and its
>   scaled value is S, W / 100 == S / 1024.  While the mapped range is a
>   bit smaller than the orignal scheduler weight range, the dead zones
>   on both sides are relatively small and covers wider range than the
>   nice value mappings.  This file doesn't make sense in the root
>   cgroup and isn't create on root.

s/create/&d/

Thanks!

> * "cpu.weight.nice" is added. When read, it reads back the nice value
>   which is closest to the current "cpu.weight".  When written, it sets
>   "cpu.weight" to the weight value which matches the nice value.  This
>   makes it easy to configure cgroups when they're competing against
>   threads in threaded subtrees.
> 
> * "cpu.cfs_quota_us" and "cpu.cfs_period_us" are replaced by "cpu.max"
>   which contains both quota and period.
> 
> * "cpu.rt_runtime_us" and "cpu.rt_period_us" are replaced by
>   "cpu.rt.max" which contains both runtime and period.

So we've been looking at overhauling the whole RT stuff. But sadly we've
not been able to find something that works with all the legacy
constraints (like RT tasks having arbitrary affinities).

Lets just hope we can preserve this interface :/
> 
> v3: - Added "cpu.weight.nice" to allow using nice values when
>       configuring the weight.  The feature is requested by PeterZ.
>     - Merge the patch to enable threaded support on cpu and cpuacct.

>     - Dropped the bits about getting rid of cpuacct from patch
>       description as there is a pretty strong case for making cpuacct
>       an implicit controller so that basic cpu usage stats are always
>       available.

What about the whole double accounting thing? Because currently cpuacct
and cpu do a fair bit of duplication. It would be very good to get rid
of that.

>     - Documentation updated accordingly.  "cpu.rt.max" section is
>       dropped for now.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy
  2017-07-29  9:17     ` Peter Zijlstra
  (?)
@ 2017-08-01 20:17     ` Tejun Heo
  2017-08-01 21:40         ` Peter Zijlstra
  -1 siblings, 1 reply; 15+ messages in thread
From: Tejun Heo @ 2017-08-01 20:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: lizefan, hannes, mingo, longman, cgroups, linux-kernel,
	kernel-team, pjt, luto, efault, torvalds, guro

Hello, Peter.

On Sat, Jul 29, 2017 at 11:17:07AM +0200, Peter Zijlstra wrote:
> > * "cpu.shares" is replaced with "cpu.weight" and operates on the
> >   standard scale defined by CGROUP_WEIGHT_MIN/DFL/MAX (1, 100, 10000).
> >   The weight is scaled to scheduler weight so that 100 maps to 1024
> >   and the ratio relationship is preserved - if weight is W and its
> >   scaled value is S, W / 100 == S / 1024.  While the mapped range is a
> >   bit smaller than the orignal scheduler weight range, the dead zones
> >   on both sides are relatively small and covers wider range than the
> >   nice value mappings.  This file doesn't make sense in the root
> >   cgroup and isn't create on root.
> 
> s/create/&d/

Updated, thanks.

> > * "cpu.rt_runtime_us" and "cpu.rt_period_us" are replaced by
> >   "cpu.rt.max" which contains both runtime and period.
> 
> So we've been looking at overhauling the whole RT stuff. But sadly we've
> not been able to find something that works with all the legacy
> constraints (like RT tasks having arbitrary affinities).
> 
> Lets just hope we can preserve this interface :/

Ah, should have dropped this from the description.  Yeah, we can wait
till the RT side settles down and go for a better matching interface
as necessary.

> > v3: - Added "cpu.weight.nice" to allow using nice values when
> >       configuring the weight.  The feature is requested by PeterZ.
> >     - Merge the patch to enable threaded support on cpu and cpuacct.
> 
> >     - Dropped the bits about getting rid of cpuacct from patch
> >       description as there is a pretty strong case for making cpuacct
> >       an implicit controller so that basic cpu usage stats are always
> >       available.
> 
> What about the whole double accounting thing? Because currently cpuacct
> and cpu do a fair bit of duplication. It would be very good to get rid
> of that.

I'm not that sure at this point.  Here are my current thoughts on
cpuacct.

* It is useful to have basic cpu statistics on cgroup without having
  to enable the cpu controller, especially because enabling cpu
  controller always changes how cpu cycles are distributed and
  currently comes at some performance overhead.

* On cgroup2, there is only one hierarchy.  It'd be great to have
  basic resource accounting enabled by default on all cgroups.  Note
  that we couldn't do that on v1 because there could be any number of
  hierarchies and the cost would increase with the number of
  hierarchies.

* It is bothersome that we're walking up the tree each time for
  cpuacct although being percpu && just walking up the tree makes it
  relatively cheap.  Anyways, I'm thinking about shifting the
  aggregation to the reader side so that the hot path always only
  updates local counters in a way which can scale even when there are
  a lot of (idle) cgroups.  Will follow up on this later.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy
@ 2017-08-01 21:40         ` Peter Zijlstra
  0 siblings, 0 replies; 15+ messages in thread
From: Peter Zijlstra @ 2017-08-01 21:40 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, hannes, mingo, longman, cgroups, linux-kernel,
	kernel-team, pjt, luto, efault, torvalds, guro

On Tue, Aug 01, 2017 at 01:17:45PM -0700, Tejun Heo wrote:

> > What about the whole double accounting thing? Because currently cpuacct
> > and cpu do a fair bit of duplication. It would be very good to get rid
> > of that.
> 
> I'm not that sure at this point.  Here are my current thoughts on
> cpuacct.
> 
> * It is useful to have basic cpu statistics on cgroup without having
>   to enable the cpu controller, especially because enabling cpu
>   controller always changes how cpu cycles are distributed and
>   currently comes at some performance overhead.
> 
> * On cgroup2, there is only one hierarchy.  It'd be great to have
>   basic resource accounting enabled by default on all cgroups.  Note
>   that we couldn't do that on v1 because there could be any number of
>   hierarchies and the cost would increase with the number of
>   hierarchies.

Yes, the whole single hierarchy thing makes doing away with the double
accounting possible.

> * It is bothersome that we're walking up the tree each time for
>   cpuacct although being percpu && just walking up the tree makes it
>   relatively cheap.

So even if its only CPU local accounting, you still have all the pointer
chasing and misses, not to mention that a faster O(depth) is still
O(depth).

>		Anyways, I'm thinking about shifting the
>   aggregation to the reader side so that the hot path always only
>   updates local counters in a way which can scale even when there are
>   a lot of (idle) cgroups.  Will follow up on this later.

Not entirely sure I follow, we currently only update the current cgroup
and its immediate parents, no? Or are you looking to only account into
the current cgroup and propagate into the parents on reading?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy
@ 2017-08-01 21:40         ` Peter Zijlstra
  0 siblings, 0 replies; 15+ messages in thread
From: Peter Zijlstra @ 2017-08-01 21:40 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, longman-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	pjt-hpIqsD4AKlfQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	efault-Mmb7MZpHnFY, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	guro-b10kYP2dOMg

On Tue, Aug 01, 2017 at 01:17:45PM -0700, Tejun Heo wrote:

> > What about the whole double accounting thing? Because currently cpuacct
> > and cpu do a fair bit of duplication. It would be very good to get rid
> > of that.
> 
> I'm not that sure at this point.  Here are my current thoughts on
> cpuacct.
> 
> * It is useful to have basic cpu statistics on cgroup without having
>   to enable the cpu controller, especially because enabling cpu
>   controller always changes how cpu cycles are distributed and
>   currently comes at some performance overhead.
> 
> * On cgroup2, there is only one hierarchy.  It'd be great to have
>   basic resource accounting enabled by default on all cgroups.  Note
>   that we couldn't do that on v1 because there could be any number of
>   hierarchies and the cost would increase with the number of
>   hierarchies.

Yes, the whole single hierarchy thing makes doing away with the double
accounting possible.

> * It is bothersome that we're walking up the tree each time for
>   cpuacct although being percpu && just walking up the tree makes it
>   relatively cheap.

So even if its only CPU local accounting, you still have all the pointer
chasing and misses, not to mention that a faster O(depth) is still
O(depth).

>		Anyways, I'm thinking about shifting the
>   aggregation to the reader side so that the hot path always only
>   updates local counters in a way which can scale even when there are
>   a lot of (idle) cgroups.  Will follow up on this later.

Not entirely sure I follow, we currently only update the current cgroup
and its immediate parents, no? Or are you looking to only account into
the current cgroup and propagate into the parents on reading?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy
@ 2017-08-02 15:41           ` Tejun Heo
  0 siblings, 0 replies; 15+ messages in thread
From: Tejun Heo @ 2017-08-02 15:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: lizefan, hannes, mingo, longman, cgroups, linux-kernel,
	kernel-team, pjt, luto, efault, torvalds, guro

Hello, Peter.

On Tue, Aug 01, 2017 at 11:40:38PM +0200, Peter Zijlstra wrote:
> > * On cgroup2, there is only one hierarchy.  It'd be great to have
> >   basic resource accounting enabled by default on all cgroups.  Note
> >   that we couldn't do that on v1 because there could be any number of
> >   hierarchies and the cost would increase with the number of
> >   hierarchies.
> 
> Yes, the whole single hierarchy thing makes doing away with the double
> accounting possible.

Yeah, we can either do that or make it cheaper so that we can have
basic stats by default.

> > * It is bothersome that we're walking up the tree each time for
> >   cpuacct although being percpu && just walking up the tree makes it
> >   relatively cheap.
> 
> So even if its only CPU local accounting, you still have all the pointer
> chasing and misses, not to mention that a faster O(depth) is still
> O(depth).
> 
> >		Anyways, I'm thinking about shifting the
> >   aggregation to the reader side so that the hot path always only
> >   updates local counters in a way which can scale even when there are
> >   a lot of (idle) cgroups.  Will follow up on this later.
> 
> Not entirely sure I follow, we currently only update the current cgroup
> and its immediate parents, no? Or are you looking to only account into
> the current cgroup and propagate into the parents on reading?

Yeah, shifting the cost to the readers and being smart with
propagation so that reading isn't O(nr_descendants) but
O(nr_descendants_which_have_run_since_last_read).  That way, we can
show the basic stats without taxing the hot paths with reasonable
scalability.

I have a couple questions about cpuacct tho.

* The stat file is sampling based and the usage files are calculated
  from actual scheduling events.  Is this because the latter is more
  accurate?

* Why do we have user/sys breakdown in usage numbers?  It tries to
  distinguish user or sys by looking at task_pt_regs().  I can't see
  how this would work (e.g. interrupt handlers never schedule) and w/o
  kernel preemption, the sys part is always zero.  What is this number
  supposed to mean?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy
@ 2017-08-02 15:41           ` Tejun Heo
  0 siblings, 0 replies; 15+ messages in thread
From: Tejun Heo @ 2017-08-02 15:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, longman-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	pjt-hpIqsD4AKlfQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	efault-Mmb7MZpHnFY, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	guro-b10kYP2dOMg

Hello, Peter.

On Tue, Aug 01, 2017 at 11:40:38PM +0200, Peter Zijlstra wrote:
> > * On cgroup2, there is only one hierarchy.  It'd be great to have
> >   basic resource accounting enabled by default on all cgroups.  Note
> >   that we couldn't do that on v1 because there could be any number of
> >   hierarchies and the cost would increase with the number of
> >   hierarchies.
> 
> Yes, the whole single hierarchy thing makes doing away with the double
> accounting possible.

Yeah, we can either do that or make it cheaper so that we can have
basic stats by default.

> > * It is bothersome that we're walking up the tree each time for
> >   cpuacct although being percpu && just walking up the tree makes it
> >   relatively cheap.
> 
> So even if its only CPU local accounting, you still have all the pointer
> chasing and misses, not to mention that a faster O(depth) is still
> O(depth).
> 
> >		Anyways, I'm thinking about shifting the
> >   aggregation to the reader side so that the hot path always only
> >   updates local counters in a way which can scale even when there are
> >   a lot of (idle) cgroups.  Will follow up on this later.
> 
> Not entirely sure I follow, we currently only update the current cgroup
> and its immediate parents, no? Or are you looking to only account into
> the current cgroup and propagate into the parents on reading?

Yeah, shifting the cost to the readers and being smart with
propagation so that reading isn't O(nr_descendants) but
O(nr_descendants_which_have_run_since_last_read).  That way, we can
show the basic stats without taxing the hot paths with reasonable
scalability.

I have a couple questions about cpuacct tho.

* The stat file is sampling based and the usage files are calculated
  from actual scheduling events.  Is this because the latter is more
  accurate?

* Why do we have user/sys breakdown in usage numbers?  It tries to
  distinguish user or sys by looking at task_pt_regs().  I can't see
  how this would work (e.g. interrupt handlers never schedule) and w/o
  kernel preemption, the sys part is always zero.  What is this number
  supposed to mean?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy
  2017-08-02 15:41           ` Tejun Heo
  (?)
@ 2017-08-02 16:05           ` Peter Zijlstra
  2017-08-02 18:16             ` Tejun Heo
  -1 siblings, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2017-08-02 16:05 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, hannes, mingo, longman, cgroups, linux-kernel,
	kernel-team, pjt, luto, efault, torvalds, guro

On Wed, Aug 02, 2017 at 08:41:35AM -0700, Tejun Heo wrote:
> > Not entirely sure I follow, we currently only update the current cgroup
> > and its immediate parents, no? Or are you looking to only account into
> > the current cgroup and propagate into the parents on reading?
> 
> Yeah, shifting the cost to the readers and being smart with
> propagation so that reading isn't O(nr_descendants) but
> O(nr_descendants_which_have_run_since_last_read).  That way, we can
> show the basic stats without taxing the hot paths with reasonable
> scalability.

Right, that would be good.

> I have a couple questions about cpuacct tho.
> 
> * The stat file is sampling based and the usage files are calculated
>   from actual scheduling events.  Is this because the latter is more
>   accurate?

So I actually don't know the history of this stuff too well. But I would
think so. This all looks rather dodgy.

> * Why do we have user/sys breakdown in usage numbers?  It tries to
>   distinguish user or sys by looking at task_pt_regs().  I can't see
>   how this would work (e.g. interrupt handlers never schedule) and w/o
>   kernel preemption, the sys part is always zero.  What is this number
>   supposed to mean?

For normal scheduler stuff we account the total runtime in ns and use
the user/kernel tick samples to divide it into user/kernel time parts.
See cputime_adjust().

But looking at the cpuacct I have no clue, that looks wonky at best.

Ideally we'd reuse the normal cputime code and do the same thing
per-cgroup, but clearly that isn't happening now.

I never really looked further than that cpuacct_charge() doing _another_
cgroup iteration, even though we already account that delta to each
cgroup (modulo scheduling class crud).

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy
  2017-08-02 16:05           ` Peter Zijlstra
@ 2017-08-02 18:16             ` Tejun Heo
  0 siblings, 0 replies; 15+ messages in thread
From: Tejun Heo @ 2017-08-02 18:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: lizefan, hannes, mingo, longman, cgroups, linux-kernel,
	kernel-team, pjt, luto, efault, torvalds, guro

Hello, Peter.

On Wed, Aug 02, 2017 at 06:05:11PM +0200, Peter Zijlstra wrote:
> > * The stat file is sampling based and the usage files are calculated
> >   from actual scheduling events.  Is this because the latter is more
> >   accurate?
> 
> So I actually don't know the history of this stuff too well. But I would
> think so. This all looks rather dodgy.

I see.

> > * Why do we have user/sys breakdown in usage numbers?  It tries to
> >   distinguish user or sys by looking at task_pt_regs().  I can't see
> >   how this would work (e.g. interrupt handlers never schedule) and w/o
> >   kernel preemption, the sys part is always zero.  What is this number
> >   supposed to mean?
> 
> For normal scheduler stuff we account the total runtime in ns and use
> the user/kernel tick samples to divide it into user/kernel time parts.
> See cputime_adjust().
> 
> But looking at the cpuacct I have no clue, that looks wonky at best.
> 
> Ideally we'd reuse the normal cputime code and do the same thing
> per-cgroup, but clearly that isn't happening now.
> 
> I never really looked further than that cpuacct_charge() doing _another_
> cgroup iteration, even though we already account that delta to each
> cgroup (modulo scheduling class crud).

Yeah, it's kinda silly.  I'll see if I can just kill cpuacct for
cgroup2.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2017-08-02 18:16 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-20 18:48 [PATCHSET for-4.14] cgroup, sched: cgroup2 interface for CPU controller Tejun Heo
2017-07-20 18:48 ` Tejun Heo
2017-07-20 18:48 ` [PATCH 1/2] sched: Misc preps for cgroup unified hierarchy interface Tejun Heo
2017-07-20 18:48 ` [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy Tejun Heo
2017-07-29  9:17   ` Peter Zijlstra
2017-07-29  9:17     ` Peter Zijlstra
2017-08-01 20:17     ` Tejun Heo
2017-08-01 21:40       ` Peter Zijlstra
2017-08-01 21:40         ` Peter Zijlstra
2017-08-02 15:41         ` Tejun Heo
2017-08-02 15:41           ` Tejun Heo
2017-08-02 16:05           ` Peter Zijlstra
2017-08-02 18:16             ` Tejun Heo
2017-07-27 19:51 ` [PATCHSET for-4.14] cgroup, sched: cgroup2 interface for CPU controller Tejun Heo
2017-07-27 19:51   ` Tejun Heo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.