[PATCHSET for-4.14] cgroup, sched: cgroup2 interface for CPU controller

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCHSET for-4.14] cgroup, sched: cgroup2 interface for CPU controller
@ 2017-07-20 18:48 ` Tejun Heo
  0 siblings, 0 replies; 20+ messages in thread
From: Tejun Heo @ 2017-07-20 18:48 UTC (permalink / raw)
  To: lizefan, hannes, peterz, mingo, longman
  Cc: cgroups, linux-kernel, kernel-team, pjt, luto, efault, torvalds, guro

Hello,

These are two patches to implement the cgroup2 CPU controller
interface.  It's separated from the [L] previous version of combined
cgroup2 threaded mode + CPU controller patchset.

* "cpu.weight.nice" added which allows reading and setting weights
  using nice values so that it's easier to configure when threaded
  cgroups are competing against threads.

* Doc and other misc updates.

 0001-sched-Misc-preps-for-cgroup-unified-hierarchy-interf.patch
 0002-sched-Implement-interface-for-cgroup-unified-hierarc.patch

The patchset is on top of "cgroup: implement cgroup2 thread mode, v4"
patchset [T] and also available in the following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-cgroup2-cpu-on-v4

 Documentation/cgroup-v2.txt |   36 ++------
 kernel/sched/core.c         |  186 +++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/cpuacct.c      |   53 +++++++++---
 kernel/sched/cpuacct.h      |    5 +
 4 files changed, 241 insertions(+), 39 deletions(-)

--
tejun

[L] http://lkml.kernel.org/r/20170610140351.10703-1-tj@kernel.org
[T] http://lkml.kernel.org/r/20170719194439.2915602-1-tj@kernel.org

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCHSET for-4.14] cgroup, sched: cgroup2 interface for CPU controller
@ 2017-07-20 18:48 ` Tejun Heo
  0 siblings, 0 replies; 20+ messages in thread
From: Tejun Heo @ 2017-07-20 18:48 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	longman-H+wXaHxf7aLQT0dZR+AlfA
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	pjt-hpIqsD4AKlfQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	efault-Mmb7MZpHnFY, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	guro-b10kYP2dOMg

Hello,

These are two patches to implement the cgroup2 CPU controller
interface.  It's separated from the [L] previous version of combined
cgroup2 threaded mode + CPU controller patchset.

* "cpu.weight.nice" added which allows reading and setting weights
  using nice values so that it's easier to configure when threaded
  cgroups are competing against threads.

* Doc and other misc updates.

 0001-sched-Misc-preps-for-cgroup-unified-hierarchy-interf.patch
 0002-sched-Implement-interface-for-cgroup-unified-hierarc.patch

The patchset is on top of "cgroup: implement cgroup2 thread mode, v4"
patchset [T] and also available in the following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-cgroup2-cpu-on-v4

 Documentation/cgroup-v2.txt |   36 ++------
 kernel/sched/core.c         |  186 +++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/cpuacct.c      |   53 +++++++++---
 kernel/sched/cpuacct.h      |    5 +
 4 files changed, 241 insertions(+), 39 deletions(-)

--
tejun

[L] http://lkml.kernel.org/r/20170610140351.10703-1-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org
[T] http://lkml.kernel.org/r/20170719194439.2915602-1-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/2] sched: Misc preps for cgroup unified hierarchy interface
  2017-07-20 18:48 ` Tejun Heo
  (?)
@ 2017-07-20 18:48 ` Tejun Heo
  -1 siblings, 0 replies; 20+ messages in thread
From: Tejun Heo @ 2017-07-20 18:48 UTC (permalink / raw)
  To: lizefan, hannes, peterz, mingo, longman
  Cc: cgroups, linux-kernel, kernel-team, pjt, luto, efault, torvalds,
	guro, Tejun Heo

Make the following changes in preparation for the cpu controller
interface implementation for cgroup2.  This patch doesn't cause any
functional differences.

* s/cpu_stats_show()/cpu_cfs_stats_show()/

* s/cpu_files/cpu_legacy_files/

* Separate out cpuacct_stats_read() from cpuacct_stats_show().  While
  at it, make the @val array u64 for consistency.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---
 kernel/sched/core.c    |  8 ++++----
 kernel/sched/cpuacct.c | 29 ++++++++++++++++++-----------
 2 files changed, 22 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 17c667b..71a060f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6578,7 +6578,7 @@ static int __cfs_schedulable(struct task_group *tg, u64 period, u64 quota)
 	return ret;
 }
 
-static int cpu_stats_show(struct seq_file *sf, void *v)
+static int cpu_cfs_stats_show(struct seq_file *sf, void *v)
 {
 	struct task_group *tg = css_tg(seq_css(sf));
 	struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
@@ -6618,7 +6618,7 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css,
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
-static struct cftype cpu_files[] = {
+static struct cftype cpu_legacy_files[] = {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	{
 		.name = "shares",
@@ -6639,7 +6639,7 @@ static struct cftype cpu_files[] = {
 	},
 	{
 		.name = "stat",
-		.seq_show = cpu_stats_show,
+		.seq_show = cpu_cfs_stats_show,
 	},
 #endif
 #ifdef CONFIG_RT_GROUP_SCHED
@@ -6665,7 +6665,7 @@ struct cgroup_subsys cpu_cgrp_subsys = {
 	.fork		= cpu_cgroup_fork,
 	.can_attach	= cpu_cgroup_can_attach,
 	.attach		= cpu_cgroup_attach,
-	.legacy_cftypes	= cpu_files,
+	.legacy_cftypes	= cpu_legacy_files,
 	.early_init	= true,
 };
 
diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index f95ab29..6151c23 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -276,26 +276,33 @@ static int cpuacct_all_seq_show(struct seq_file *m, void *V)
 	return 0;
 }
 
-static int cpuacct_stats_show(struct seq_file *sf, void *v)
+static void cpuacct_stats_read(struct cpuacct *ca,
+			       u64 (*val)[CPUACCT_STAT_NSTATS])
 {
-	struct cpuacct *ca = css_ca(seq_css(sf));
-	s64 val[CPUACCT_STAT_NSTATS];
 	int cpu;
-	int stat;
 
-	memset(val, 0, sizeof(val));
+	memset(val, 0, sizeof(*val));
+
 	for_each_possible_cpu(cpu) {
 		u64 *cpustat = per_cpu_ptr(ca->cpustat, cpu)->cpustat;
 
-		val[CPUACCT_STAT_USER]   += cpustat[CPUTIME_USER];
-		val[CPUACCT_STAT_USER]   += cpustat[CPUTIME_NICE];
-		val[CPUACCT_STAT_SYSTEM] += cpustat[CPUTIME_SYSTEM];
-		val[CPUACCT_STAT_SYSTEM] += cpustat[CPUTIME_IRQ];
-		val[CPUACCT_STAT_SYSTEM] += cpustat[CPUTIME_SOFTIRQ];
+		(*val)[CPUACCT_STAT_USER]   += cpustat[CPUTIME_USER];
+		(*val)[CPUACCT_STAT_USER]   += cpustat[CPUTIME_NICE];
+		(*val)[CPUACCT_STAT_SYSTEM] += cpustat[CPUTIME_SYSTEM];
+		(*val)[CPUACCT_STAT_SYSTEM] += cpustat[CPUTIME_IRQ];
+		(*val)[CPUACCT_STAT_SYSTEM] += cpustat[CPUTIME_SOFTIRQ];
 	}
+}
+
+static int cpuacct_stats_show(struct seq_file *sf, void *v)
+{
+	u64 val[CPUACCT_STAT_NSTATS];
+	int stat;
+
+	cpuacct_stats_read(css_ca(seq_css(sf)), &val);
 
 	for (stat = 0; stat < CPUACCT_STAT_NSTATS; stat++) {
-		seq_printf(sf, "%s %lld\n",
+		seq_printf(sf, "%s %llu\n",
 			   cpuacct_stat_desc[stat],
 			   (long long)nsec_to_clock_t(val[stat]));
 	}
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy
  2017-07-20 18:48 ` Tejun Heo
  (?)
  (?)
@ 2017-07-20 18:48 ` Tejun Heo
  2017-07-29  9:17     ` Peter Zijlstra
  -1 siblings, 1 reply; 20+ messages in thread
From: Tejun Heo @ 2017-07-20 18:48 UTC (permalink / raw)
  To: lizefan, hannes, peterz, mingo, longman
  Cc: cgroups, linux-kernel, kernel-team, pjt, luto, efault, torvalds,
	guro, Tejun Heo

There are a couple interface issues which can be addressed in cgroup2
interface.

* Stats from cpuacct being reported separately from the cpu stats.

* Use of different time units.  Writable control knobs use
  microseconds, some stat fields use nanoseconds while other cpuacct
  stat fields use centiseconds.

* Control knobs which can't be used in the root cgroup still show up
  in the root.

* Control knob names and semantics aren't consistent with other
  controllers.

This patchset implements cpu controller's interface on cgroup2 which
adheres to the controller file conventions described in
Documentation/cgroups/cgroup-v2.txt.  Overall, the following changes
are made.

* cpuacct is implictly enabled and disabled by cpu and its information
  is reported through "cpu.stat" which now uses microseconds for all
  time durations.  All time duration fields now have "_usec" appended
  to them for clarity.

  Note that cpuacct.usage_percpu is currently not included in
  "cpu.stat".  If this information is actually called for, it will be
  added later.

* "cpu.shares" is replaced with "cpu.weight" and operates on the
  standard scale defined by CGROUP_WEIGHT_MIN/DFL/MAX (1, 100, 10000).
  The weight is scaled to scheduler weight so that 100 maps to 1024
  and the ratio relationship is preserved - if weight is W and its
  scaled value is S, W / 100 == S / 1024.  While the mapped range is a
  bit smaller than the orignal scheduler weight range, the dead zones
  on both sides are relatively small and covers wider range than the
  nice value mappings.  This file doesn't make sense in the root
  cgroup and isn't create on root.

* "cpu.weight.nice" is added. When read, it reads back the nice value
  which is closest to the current "cpu.weight".  When written, it sets
  "cpu.weight" to the weight value which matches the nice value.  This
  makes it easy to configure cgroups when they're competing against
  threads in threaded subtrees.

* "cpu.cfs_quota_us" and "cpu.cfs_period_us" are replaced by "cpu.max"
  which contains both quota and period.

* "cpu.rt_runtime_us" and "cpu.rt_period_us" are replaced by
  "cpu.rt.max" which contains both runtime and period.

v3: - Added "cpu.weight.nice" to allow using nice values when
      configuring the weight.  The feature is requested by PeterZ.
    - Merge the patch to enable threaded support on cpu and cpuacct.
    - Dropped the bits about getting rid of cpuacct from patch
      description as there is a pretty strong case for making cpuacct
      an implicit controller so that basic cpu usage stats are always
      available.
    - Documentation updated accordingly.  "cpu.rt.max" section is
      dropped for now.

v2: - cpu_stats_show() was incorrectly using CONFIG_FAIR_GROUP_SCHED
      for CFS bandwidth stats and also using raw division for u64.
      Use CONFIG_CFS_BANDWITH and do_div() instead.  "cpu.rt.max" is
      not included yet.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---
 Documentation/cgroup-v2.txt |  36 +++------
 kernel/sched/core.c         | 178 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/cpuacct.c      |  24 ++++++
 kernel/sched/cpuacct.h      |   5 ++
 4 files changed, 219 insertions(+), 24 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index cb9ea28..43f9811 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -860,10 +860,6 @@ Controllers
 CPU
 ---
 
-.. note::
-
-	The interface for the cpu controller hasn't been merged yet
-
 The "cpu" controllers regulates distribution of CPU cycles.  This
 controller implements weight and absolute bandwidth limit models for
 normal scheduling policy and absolute bandwidth allocation model for
@@ -893,6 +889,18 @@ All time durations are in microseconds.
 
 	The weight in the range [1, 10000].
 
+  cpu.weight.nice
+	A read-write single value file which exists on non-root
+	cgroups.  The default is "0".
+
+	The nice value is in the range [-20, 19].
+
+	This interface file is an alternative interface for
+	"cpu.weight" and allows reading and setting weight using the
+	same values used by nice(2).  Because the range is smaller and
+	granularity is coarser for the nice values, the read value is
+	the closest approximation of the current weight.
+
   cpu.max
 	A read-write two value file which exists on non-root cgroups.
 	The default is "max 100000".
@@ -905,26 +913,6 @@ All time durations are in microseconds.
 	$PERIOD duration.  "max" for $MAX indicates no limit.  If only
 	one number is written, $MAX is updated.
 
-  cpu.rt.max
-	.. note::
-
-	   The semantics of this file is still under discussion and the
-	   interface hasn't been merged yet
-
-	A read-write two value file which exists on all cgroups.
-	The default is "0 100000".
-
-	The maximum realtime runtime allocation.  Over-committing
-	configurations are disallowed and process migrations are
-	rejected if not enough bandwidth is available.  It's in the
-	following format::
-
-	  $MAX $PERIOD
-
-	which indicates that the group may consume upto $MAX in each
-	$PERIOD duration.  If only one number is written, $MAX is
-	updated.
-
 
 Memory
 ------
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 71a060f..9766670 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6657,6 +6657,175 @@ static struct cftype cpu_legacy_files[] = {
 	{ }	/* Terminate */
 };
 
+static int cpu_stats_show(struct seq_file *sf, void *v)
+{
+	cpuacct_cpu_stats_show(sf);
+
+#ifdef CONFIG_CFS_BANDWIDTH
+	{
+		struct task_group *tg = css_tg(seq_css(sf));
+		struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
+		u64 throttled_usec;
+
+		throttled_usec = cfs_b->throttled_time;
+		do_div(throttled_usec, NSEC_PER_USEC);
+
+		seq_printf(sf, "nr_periods %d\n"
+			   "nr_throttled %d\n"
+			   "throttled_usec %llu\n",
+			   cfs_b->nr_periods, cfs_b->nr_throttled,
+			   throttled_usec);
+	}
+#endif
+	return 0;
+}
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+static u64 cpu_weight_read_u64(struct cgroup_subsys_state *css,
+			       struct cftype *cft)
+{
+	struct task_group *tg = css_tg(css);
+	u64 weight = scale_load_down(tg->shares);
+
+	return DIV_ROUND_CLOSEST_ULL(weight * CGROUP_WEIGHT_DFL, 1024);
+}
+
+static int cpu_weight_write_u64(struct cgroup_subsys_state *css,
+				struct cftype *cft, u64 weight)
+{
+	/*
+	 * cgroup weight knobs should use the common MIN, DFL and MAX
+	 * values which are 1, 100 and 10000 respectively.  While it loses
+	 * a bit of range on both ends, it maps pretty well onto the shares
+	 * value used by scheduler and the round-trip conversions preserve
+	 * the original value over the entire range.
+	 */
+	if (weight < CGROUP_WEIGHT_MIN || weight > CGROUP_WEIGHT_MAX)
+		return -ERANGE;
+
+	weight = DIV_ROUND_CLOSEST_ULL(weight * 1024, CGROUP_WEIGHT_DFL);
+
+	return sched_group_set_shares(css_tg(css), scale_load(weight));
+}
+
+static s64 cpu_weight_nice_read_s64(struct cgroup_subsys_state *css,
+				    struct cftype *cft)
+{
+	unsigned long weight = scale_load_down(css_tg(css)->shares);
+	int last_delta = INT_MAX;
+	int prio, delta;
+
+	/* find the closest nice value to the current weight */
+	for (prio = 0; prio < ARRAY_SIZE(sched_prio_to_weight); prio++) {
+		delta = abs(sched_prio_to_weight[prio] - weight);
+		if (delta >= last_delta)
+			break;
+		last_delta = delta;
+	}
+
+	return PRIO_TO_NICE(prio - 1 + MAX_RT_PRIO);
+}
+
+static int cpu_weight_nice_write_s64(struct cgroup_subsys_state *css,
+				     struct cftype *cft, s64 nice)
+{
+	unsigned long weight;
+
+	if (nice < MIN_NICE || nice > MAX_NICE)
+		return -ERANGE;
+
+	weight = sched_prio_to_weight[NICE_TO_PRIO(nice) - MAX_RT_PRIO];
+	return sched_group_set_shares(css_tg(css), scale_load(weight));
+}
+#endif
+
+static void __maybe_unused cpu_period_quota_print(struct seq_file *sf,
+						  long period, long quota)
+{
+	if (quota < 0)
+		seq_puts(sf, "max");
+	else
+		seq_printf(sf, "%ld", quota);
+
+	seq_printf(sf, " %ld\n", period);
+}
+
+/* caller should put the current value in *@periodp before calling */
+static int __maybe_unused cpu_period_quota_parse(char *buf,
+						 u64 *periodp, u64 *quotap)
+{
+	char tok[21];	/* U64_MAX */
+
+	if (!sscanf(buf, "%s %llu", tok, periodp))
+		return -EINVAL;
+
+	*periodp *= NSEC_PER_USEC;
+
+	if (sscanf(tok, "%llu", quotap))
+		*quotap *= NSEC_PER_USEC;
+	else if (!strcmp(tok, "max"))
+		*quotap = RUNTIME_INF;
+	else
+		return -EINVAL;
+
+	return 0;
+}
+
+#ifdef CONFIG_CFS_BANDWIDTH
+static int cpu_max_show(struct seq_file *sf, void *v)
+{
+	struct task_group *tg = css_tg(seq_css(sf));
+
+	cpu_period_quota_print(sf, tg_get_cfs_period(tg), tg_get_cfs_quota(tg));
+	return 0;
+}
+
+static ssize_t cpu_max_write(struct kernfs_open_file *of,
+			     char *buf, size_t nbytes, loff_t off)
+{
+	struct task_group *tg = css_tg(of_css(of));
+	u64 period = tg_get_cfs_period(tg);
+	u64 quota;
+	int ret;
+
+	ret = cpu_period_quota_parse(buf, &period, &quota);
+	if (!ret)
+		ret = tg_set_cfs_bandwidth(tg, period, quota);
+	return ret ?: nbytes;
+}
+#endif
+
+static struct cftype cpu_files[] = {
+	{
+		.name = "stat",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = cpu_stats_show,
+	},
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	{
+		.name = "weight",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = cpu_weight_read_u64,
+		.write_u64 = cpu_weight_write_u64,
+	},
+	{
+		.name = "weight.nice",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_s64 = cpu_weight_nice_read_s64,
+		.write_s64 = cpu_weight_nice_write_s64,
+	},
+#endif
+#ifdef CONFIG_CFS_BANDWIDTH
+	{
+		.name = "max",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = cpu_max_show,
+		.write = cpu_max_write,
+	},
+#endif
+	{ }	/* terminate */
+};
+
 struct cgroup_subsys cpu_cgrp_subsys = {
 	.css_alloc	= cpu_cgroup_css_alloc,
 	.css_online	= cpu_cgroup_css_online,
@@ -6666,7 +6835,16 @@ struct cgroup_subsys cpu_cgrp_subsys = {
 	.can_attach	= cpu_cgroup_can_attach,
 	.attach		= cpu_cgroup_attach,
 	.legacy_cftypes	= cpu_legacy_files,
+	.dfl_cftypes	= cpu_files,
 	.early_init	= true,
+	.threaded	= true,
+#ifdef CONFIG_CGROUP_CPUACCT
+	/*
+	 * cpuacct is enabled together with cpu on the unified hierarchy
+	 * and its stats are reported through "cpu.stat".
+	 */
+	.depends_on	= 1 << cpuacct_cgrp_id,
+#endif
 };
 
 #endif	/* CONFIG_CGROUP_SCHED */
diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index 6151c23..ee4976a 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -347,6 +347,29 @@ static struct cftype files[] = {
 	{ }	/* terminate */
 };
 
+/* used to print cpuacct stats in cpu.stat on the unified hierarchy */
+void cpuacct_cpu_stats_show(struct seq_file *sf)
+{
+	struct cgroup_subsys_state *css;
+	u64 usage, val[CPUACCT_STAT_NSTATS];
+
+	css = cgroup_get_e_css(seq_css(sf)->cgroup, &cpuacct_cgrp_subsys);
+
+	usage = cpuusage_read(css, seq_cft(sf));
+	cpuacct_stats_read(css_ca(css), &val);
+
+	do_div(usage, NSEC_PER_USEC);
+	do_div(val[CPUACCT_STAT_USER], NSEC_PER_USEC);
+	do_div(val[CPUACCT_STAT_SYSTEM], NSEC_PER_USEC);
+
+	seq_printf(sf, "usage_usec %llu\n"
+		   "user_usec %llu\n"
+		   "system_usec %llu\n",
+		   usage, val[CPUACCT_STAT_USER], val[CPUACCT_STAT_SYSTEM]);
+
+	css_put(css);
+}
+
 /*
  * charge this task's execution time to its accounting group.
  *
@@ -389,4 +412,5 @@ struct cgroup_subsys cpuacct_cgrp_subsys = {
 	.css_free	= cpuacct_css_free,
 	.legacy_cftypes	= files,
 	.early_init	= true,
+	.threaded	= true,
 };
diff --git a/kernel/sched/cpuacct.h b/kernel/sched/cpuacct.h
index ba72807..ddf7af4 100644
--- a/kernel/sched/cpuacct.h
+++ b/kernel/sched/cpuacct.h
@@ -2,6 +2,7 @@
 
 extern void cpuacct_charge(struct task_struct *tsk, u64 cputime);
 extern void cpuacct_account_field(struct task_struct *tsk, int index, u64 val);
+extern void cpuacct_cpu_stats_show(struct seq_file *sf);
 
 #else
 
@@ -14,4 +15,8 @@ cpuacct_account_field(struct task_struct *tsk, int index, u64 val)
 {
 }
 
+static inline void cpuacct_cpu_stats_show(struct seq_file *sf)
+{
+}
+
 #endif
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy
@ 2017-07-29  9:17     ` Peter Zijlstra
  0 siblings, 0 replies; 20+ messages in thread
From: Peter Zijlstra @ 2017-07-29  9:17 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, hannes, mingo, longman, cgroups, linux-kernel,
	kernel-team, pjt, luto, efault, torvalds, guro

On Thu, Jul 20, 2017 at 02:48:08PM -0400, Tejun Heo wrote:
> There are a couple interface issues which can be addressed in cgroup2
> interface.
> 
> * Stats from cpuacct being reported separately from the cpu stats.
> 
> * Use of different time units.  Writable control knobs use
>   microseconds, some stat fields use nanoseconds while other cpuacct
>   stat fields use centiseconds.
> 
> * Control knobs which can't be used in the root cgroup still show up
>   in the root.
> 
> * Control knob names and semantics aren't consistent with other
>   controllers.
> 
> This patchset implements cpu controller's interface on cgroup2 which
> adheres to the controller file conventions described in
> Documentation/cgroups/cgroup-v2.txt.  Overall, the following changes
> are made.
> 
> * cpuacct is implictly enabled and disabled by cpu and its information
>   is reported through "cpu.stat" which now uses microseconds for all
>   time durations.  All time duration fields now have "_usec" appended
>   to them for clarity.
> 
>   Note that cpuacct.usage_percpu is currently not included in
>   "cpu.stat".  If this information is actually called for, it will be
>   added later.
> 
> * "cpu.shares" is replaced with "cpu.weight" and operates on the
>   standard scale defined by CGROUP_WEIGHT_MIN/DFL/MAX (1, 100, 10000).
>   The weight is scaled to scheduler weight so that 100 maps to 1024
>   and the ratio relationship is preserved - if weight is W and its
>   scaled value is S, W / 100 == S / 1024.  While the mapped range is a
>   bit smaller than the orignal scheduler weight range, the dead zones
>   on both sides are relatively small and covers wider range than the
>   nice value mappings.  This file doesn't make sense in the root
>   cgroup and isn't create on root.

s/create/&d/

Thanks!

> * "cpu.weight.nice" is added. When read, it reads back the nice value
>   which is closest to the current "cpu.weight".  When written, it sets
>   "cpu.weight" to the weight value which matches the nice value.  This
>   makes it easy to configure cgroups when they're competing against
>   threads in threaded subtrees.
> 
> * "cpu.cfs_quota_us" and "cpu.cfs_period_us" are replaced by "cpu.max"
>   which contains both quota and period.
> 
> * "cpu.rt_runtime_us" and "cpu.rt_period_us" are replaced by
>   "cpu.rt.max" which contains both runtime and period.

So we've been looking at overhauling the whole RT stuff. But sadly we've
not been able to find something that works with all the legacy
constraints (like RT tasks having arbitrary affinities).

Lets just hope we can preserve this interface :/
> 
> v3: - Added "cpu.weight.nice" to allow using nice values when
>       configuring the weight.  The feature is requested by PeterZ.
>     - Merge the patch to enable threaded support on cpu and cpuacct.

>     - Dropped the bits about getting rid of cpuacct from patch
>       description as there is a pretty strong case for making cpuacct
>       an implicit controller so that basic cpu usage stats are always
>       available.

What about the whole double accounting thing? Because currently cpuacct
and cpu do a fair bit of duplication. It would be very good to get rid
of that.

>     - Documentation updated accordingly.  "cpu.rt.max" section is
>       dropped for now.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy
@ 2017-07-29  9:17     ` Peter Zijlstra
  0 siblings, 0 replies; 20+ messages in thread
From: Peter Zijlstra @ 2017-07-29  9:17 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, longman-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	pjt-hpIqsD4AKlfQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	efault-Mmb7MZpHnFY, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	guro-b10kYP2dOMg

On Thu, Jul 20, 2017 at 02:48:08PM -0400, Tejun Heo wrote:
> There are a couple interface issues which can be addressed in cgroup2
> interface.
> 
> * Stats from cpuacct being reported separately from the cpu stats.
> 
> * Use of different time units.  Writable control knobs use
>   microseconds, some stat fields use nanoseconds while other cpuacct
>   stat fields use centiseconds.
> 
> * Control knobs which can't be used in the root cgroup still show up
>   in the root.
> 
> * Control knob names and semantics aren't consistent with other
>   controllers.
> 
> This patchset implements cpu controller's interface on cgroup2 which
> adheres to the controller file conventions described in
> Documentation/cgroups/cgroup-v2.txt.  Overall, the following changes
> are made.
> 
> * cpuacct is implictly enabled and disabled by cpu and its information
>   is reported through "cpu.stat" which now uses microseconds for all
>   time durations.  All time duration fields now have "_usec" appended
>   to them for clarity.
> 
>   Note that cpuacct.usage_percpu is currently not included in
>   "cpu.stat".  If this information is actually called for, it will be
>   added later.
> 
> * "cpu.shares" is replaced with "cpu.weight" and operates on the
>   standard scale defined by CGROUP_WEIGHT_MIN/DFL/MAX (1, 100, 10000).
>   The weight is scaled to scheduler weight so that 100 maps to 1024
>   and the ratio relationship is preserved - if weight is W and its
>   scaled value is S, W / 100 == S / 1024.  While the mapped range is a
>   bit smaller than the orignal scheduler weight range, the dead zones
>   on both sides are relatively small and covers wider range than the
>   nice value mappings.  This file doesn't make sense in the root
>   cgroup and isn't create on root.

s/create/&d/

Thanks!

> * "cpu.weight.nice" is added. When read, it reads back the nice value
>   which is closest to the current "cpu.weight".  When written, it sets
>   "cpu.weight" to the weight value which matches the nice value.  This
>   makes it easy to configure cgroups when they're competing against
>   threads in threaded subtrees.
> 
> * "cpu.cfs_quota_us" and "cpu.cfs_period_us" are replaced by "cpu.max"
>   which contains both quota and period.
> 
> * "cpu.rt_runtime_us" and "cpu.rt_period_us" are replaced by
>   "cpu.rt.max" which contains both runtime and period.

So we've been looking at overhauling the whole RT stuff. But sadly we've
not been able to find something that works with all the legacy
constraints (like RT tasks having arbitrary affinities).

Lets just hope we can preserve this interface :/
> 
> v3: - Added "cpu.weight.nice" to allow using nice values when
>       configuring the weight.  The feature is requested by PeterZ.
>     - Merge the patch to enable threaded support on cpu and cpuacct.

>     - Dropped the bits about getting rid of cpuacct from patch
>       description as there is a pretty strong case for making cpuacct
>       an implicit controller so that basic cpu usage stats are always
>       available.

What about the whole double accounting thing? Because currently cpuacct
and cpu do a fair bit of duplication. It would be very good to get rid
of that.

>     - Documentation updated accordingly.  "cpu.rt.max" section is
>       dropped for now.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy
  2017-07-29  9:17     ` Peter Zijlstra
  (?)
@ 2017-08-01 20:17     ` Tejun Heo
  2017-08-01 21:40         ` Peter Zijlstra
  -1 siblings, 1 reply; 20+ messages in thread
From: Tejun Heo @ 2017-08-01 20:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: lizefan, hannes, mingo, longman, cgroups, linux-kernel,
	kernel-team, pjt, luto, efault, torvalds, guro

Hello, Peter.

On Sat, Jul 29, 2017 at 11:17:07AM +0200, Peter Zijlstra wrote:
> > * "cpu.shares" is replaced with "cpu.weight" and operates on the
> >   standard scale defined by CGROUP_WEIGHT_MIN/DFL/MAX (1, 100, 10000).
> >   The weight is scaled to scheduler weight so that 100 maps to 1024
> >   and the ratio relationship is preserved - if weight is W and its
> >   scaled value is S, W / 100 == S / 1024.  While the mapped range is a
> >   bit smaller than the orignal scheduler weight range, the dead zones
> >   on both sides are relatively small and covers wider range than the
> >   nice value mappings.  This file doesn't make sense in the root
> >   cgroup and isn't create on root.
> 
> s/create/&d/

Updated, thanks.

> > * "cpu.rt_runtime_us" and "cpu.rt_period_us" are replaced by
> >   "cpu.rt.max" which contains both runtime and period.
> 
> So we've been looking at overhauling the whole RT stuff. But sadly we've
> not been able to find something that works with all the legacy
> constraints (like RT tasks having arbitrary affinities).
> 
> Lets just hope we can preserve this interface :/

Ah, should have dropped this from the description.  Yeah, we can wait
till the RT side settles down and go for a better matching interface
as necessary.

> > v3: - Added "cpu.weight.nice" to allow using nice values when
> >       configuring the weight.  The feature is requested by PeterZ.
> >     - Merge the patch to enable threaded support on cpu and cpuacct.
> 
> >     - Dropped the bits about getting rid of cpuacct from patch
> >       description as there is a pretty strong case for making cpuacct
> >       an implicit controller so that basic cpu usage stats are always
> >       available.
> 
> What about the whole double accounting thing? Because currently cpuacct
> and cpu do a fair bit of duplication. It would be very good to get rid
> of that.

I'm not that sure at this point.  Here are my current thoughts on
cpuacct.

* It is useful to have basic cpu statistics on cgroup without having
  to enable the cpu controller, especially because enabling cpu
  controller always changes how cpu cycles are distributed and
  currently comes at some performance overhead.

* On cgroup2, there is only one hierarchy.  It'd be great to have
  basic resource accounting enabled by default on all cgroups.  Note
  that we couldn't do that on v1 because there could be any number of
  hierarchies and the cost would increase with the number of
  hierarchies.

* It is bothersome that we're walking up the tree each time for
  cpuacct although being percpu && just walking up the tree makes it
  relatively cheap.  Anyways, I'm thinking about shifting the
  aggregation to the reader side so that the hot path always only
  updates local counters in a way which can scale even when there are
  a lot of (idle) cgroups.  Will follow up on this later.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy
@ 2017-08-01 21:40         ` Peter Zijlstra
  0 siblings, 0 replies; 20+ messages in thread
From: Peter Zijlstra @ 2017-08-01 21:40 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, hannes, mingo, longman, cgroups, linux-kernel,
	kernel-team, pjt, luto, efault, torvalds, guro

On Tue, Aug 01, 2017 at 01:17:45PM -0700, Tejun Heo wrote:

> > What about the whole double accounting thing? Because currently cpuacct
> > and cpu do a fair bit of duplication. It would be very good to get rid
> > of that.
> 
> I'm not that sure at this point.  Here are my current thoughts on
> cpuacct.
> 
> * It is useful to have basic cpu statistics on cgroup without having
>   to enable the cpu controller, especially because enabling cpu
>   controller always changes how cpu cycles are distributed and
>   currently comes at some performance overhead.
> 
> * On cgroup2, there is only one hierarchy.  It'd be great to have
>   basic resource accounting enabled by default on all cgroups.  Note
>   that we couldn't do that on v1 because there could be any number of
>   hierarchies and the cost would increase with the number of
>   hierarchies.

Yes, the whole single hierarchy thing makes doing away with the double
accounting possible.

> * It is bothersome that we're walking up the tree each time for
>   cpuacct although being percpu && just walking up the tree makes it
>   relatively cheap.

So even if its only CPU local accounting, you still have all the pointer
chasing and misses, not to mention that a faster O(depth) is still
O(depth).

>		Anyways, I'm thinking about shifting the
>   aggregation to the reader side so that the hot path always only
>   updates local counters in a way which can scale even when there are
>   a lot of (idle) cgroups.  Will follow up on this later.

Not entirely sure I follow, we currently only update the current cgroup
and its immediate parents, no? Or are you looking to only account into
the current cgroup and propagate into the parents on reading?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy
@ 2017-08-01 21:40         ` Peter Zijlstra
  0 siblings, 0 replies; 20+ messages in thread
From: Peter Zijlstra @ 2017-08-01 21:40 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, longman-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	pjt-hpIqsD4AKlfQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	efault-Mmb7MZpHnFY, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	guro-b10kYP2dOMg

On Tue, Aug 01, 2017 at 01:17:45PM -0700, Tejun Heo wrote:

> > What about the whole double accounting thing? Because currently cpuacct
> > and cpu do a fair bit of duplication. It would be very good to get rid
> > of that.
> 
> I'm not that sure at this point.  Here are my current thoughts on
> cpuacct.
> 
> * It is useful to have basic cpu statistics on cgroup without having
>   to enable the cpu controller, especially because enabling cpu
>   controller always changes how cpu cycles are distributed and
>   currently comes at some performance overhead.
> 
> * On cgroup2, there is only one hierarchy.  It'd be great to have
>   basic resource accounting enabled by default on all cgroups.  Note
>   that we couldn't do that on v1 because there could be any number of
>   hierarchies and the cost would increase with the number of
>   hierarchies.

Yes, the whole single hierarchy thing makes doing away with the double
accounting possible.

> * It is bothersome that we're walking up the tree each time for
>   cpuacct although being percpu && just walking up the tree makes it
>   relatively cheap.

So even if its only CPU local accounting, you still have all the pointer
chasing and misses, not to mention that a faster O(depth) is still
O(depth).

>		Anyways, I'm thinking about shifting the
>   aggregation to the reader side so that the hot path always only
>   updates local counters in a way which can scale even when there are
>   a lot of (idle) cgroups.  Will follow up on this later.

Not entirely sure I follow, we currently only update the current cgroup
and its immediate parents, no? Or are you looking to only account into
the current cgroup and propagate into the parents on reading?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy
@ 2017-08-02 15:41           ` Tejun Heo
  0 siblings, 0 replies; 20+ messages in thread
From: Tejun Heo @ 2017-08-02 15:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: lizefan, hannes, mingo, longman, cgroups, linux-kernel,
	kernel-team, pjt, luto, efault, torvalds, guro

Hello, Peter.

On Tue, Aug 01, 2017 at 11:40:38PM +0200, Peter Zijlstra wrote:
> > * On cgroup2, there is only one hierarchy.  It'd be great to have
> >   basic resource accounting enabled by default on all cgroups.  Note
> >   that we couldn't do that on v1 because there could be any number of
> >   hierarchies and the cost would increase with the number of
> >   hierarchies.
> 
> Yes, the whole single hierarchy thing makes doing away with the double
> accounting possible.

Yeah, we can either do that or make it cheaper so that we can have
basic stats by default.

> > * It is bothersome that we're walking up the tree each time for
> >   cpuacct although being percpu && just walking up the tree makes it
> >   relatively cheap.
> 
> So even if its only CPU local accounting, you still have all the pointer
> chasing and misses, not to mention that a faster O(depth) is still
> O(depth).
> 
> >		Anyways, I'm thinking about shifting the
> >   aggregation to the reader side so that the hot path always only
> >   updates local counters in a way which can scale even when there are
> >   a lot of (idle) cgroups.  Will follow up on this later.
> 
> Not entirely sure I follow, we currently only update the current cgroup
> and its immediate parents, no? Or are you looking to only account into
> the current cgroup and propagate into the parents on reading?

Yeah, shifting the cost to the readers and being smart with
propagation so that reading isn't O(nr_descendants) but
O(nr_descendants_which_have_run_since_last_read).  That way, we can
show the basic stats without taxing the hot paths with reasonable
scalability.

I have a couple questions about cpuacct tho.

* The stat file is sampling based and the usage files are calculated
  from actual scheduling events.  Is this because the latter is more
  accurate?

* Why do we have user/sys breakdown in usage numbers?  It tries to
  distinguish user or sys by looking at task_pt_regs().  I can't see
  how this would work (e.g. interrupt handlers never schedule) and w/o
  kernel preemption, the sys part is always zero.  What is this number
  supposed to mean?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy
@ 2017-08-02 15:41           ` Tejun Heo
  0 siblings, 0 replies; 20+ messages in thread
From: Tejun Heo @ 2017-08-02 15:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, longman-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	pjt-hpIqsD4AKlfQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	efault-Mmb7MZpHnFY, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	guro-b10kYP2dOMg

Hello, Peter.

On Tue, Aug 01, 2017 at 11:40:38PM +0200, Peter Zijlstra wrote:
> > * On cgroup2, there is only one hierarchy.  It'd be great to have
> >   basic resource accounting enabled by default on all cgroups.  Note
> >   that we couldn't do that on v1 because there could be any number of
> >   hierarchies and the cost would increase with the number of
> >   hierarchies.
> 
> Yes, the whole single hierarchy thing makes doing away with the double
> accounting possible.

Yeah, we can either do that or make it cheaper so that we can have
basic stats by default.

> > * It is bothersome that we're walking up the tree each time for
> >   cpuacct although being percpu && just walking up the tree makes it
> >   relatively cheap.
> 
> So even if its only CPU local accounting, you still have all the pointer
> chasing and misses, not to mention that a faster O(depth) is still
> O(depth).
> 
> >		Anyways, I'm thinking about shifting the
> >   aggregation to the reader side so that the hot path always only
> >   updates local counters in a way which can scale even when there are
> >   a lot of (idle) cgroups.  Will follow up on this later.
> 
> Not entirely sure I follow, we currently only update the current cgroup
> and its immediate parents, no? Or are you looking to only account into
> the current cgroup and propagate into the parents on reading?

Yeah, shifting the cost to the readers and being smart with
propagation so that reading isn't O(nr_descendants) but
O(nr_descendants_which_have_run_since_last_read).  That way, we can
show the basic stats without taxing the hot paths with reasonable
scalability.

I have a couple questions about cpuacct tho.

* The stat file is sampling based and the usage files are calculated
  from actual scheduling events.  Is this because the latter is more
  accurate?

* Why do we have user/sys breakdown in usage numbers?  It tries to
  distinguish user or sys by looking at task_pt_regs().  I can't see
  how this would work (e.g. interrupt handlers never schedule) and w/o
  kernel preemption, the sys part is always zero.  What is this number
  supposed to mean?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy
  2017-08-02 15:41           ` Tejun Heo
  (?)
@ 2017-08-02 16:05           ` Peter Zijlstra
  2017-08-02 18:16             ` Tejun Heo
  -1 siblings, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2017-08-02 16:05 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, hannes, mingo, longman, cgroups, linux-kernel,
	kernel-team, pjt, luto, efault, torvalds, guro

On Wed, Aug 02, 2017 at 08:41:35AM -0700, Tejun Heo wrote:
> > Not entirely sure I follow, we currently only update the current cgroup
> > and its immediate parents, no? Or are you looking to only account into
> > the current cgroup and propagate into the parents on reading?
> 
> Yeah, shifting the cost to the readers and being smart with
> propagation so that reading isn't O(nr_descendants) but
> O(nr_descendants_which_have_run_since_last_read).  That way, we can
> show the basic stats without taxing the hot paths with reasonable
> scalability.

Right, that would be good.

> I have a couple questions about cpuacct tho.
> 
> * The stat file is sampling based and the usage files are calculated
>   from actual scheduling events.  Is this because the latter is more
>   accurate?

So I actually don't know the history of this stuff too well. But I would
think so. This all looks rather dodgy.

> * Why do we have user/sys breakdown in usage numbers?  It tries to
>   distinguish user or sys by looking at task_pt_regs().  I can't see
>   how this would work (e.g. interrupt handlers never schedule) and w/o
>   kernel preemption, the sys part is always zero.  What is this number
>   supposed to mean?

For normal scheduler stuff we account the total runtime in ns and use
the user/kernel tick samples to divide it into user/kernel time parts.
See cputime_adjust().

But looking at the cpuacct I have no clue, that looks wonky at best.

Ideally we'd reuse the normal cputime code and do the same thing
per-cgroup, but clearly that isn't happening now.

I never really looked further than that cpuacct_charge() doing _another_
cgroup iteration, even though we already account that delta to each
cgroup (modulo scheduling class crud).

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy
  2017-08-02 16:05           ` Peter Zijlstra
@ 2017-08-02 18:16             ` Tejun Heo
  0 siblings, 0 replies; 20+ messages in thread
From: Tejun Heo @ 2017-08-02 18:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: lizefan, hannes, mingo, longman, cgroups, linux-kernel,
	kernel-team, pjt, luto, efault, torvalds, guro

Hello, Peter.

On Wed, Aug 02, 2017 at 06:05:11PM +0200, Peter Zijlstra wrote:
> > * The stat file is sampling based and the usage files are calculated
> >   from actual scheduling events.  Is this because the latter is more
> >   accurate?
> 
> So I actually don't know the history of this stuff too well. But I would
> think so. This all looks rather dodgy.

I see.

> > * Why do we have user/sys breakdown in usage numbers?  It tries to
> >   distinguish user or sys by looking at task_pt_regs().  I can't see
> >   how this would work (e.g. interrupt handlers never schedule) and w/o
> >   kernel preemption, the sys part is always zero.  What is this number
> >   supposed to mean?
> 
> For normal scheduler stuff we account the total runtime in ns and use
> the user/kernel tick samples to divide it into user/kernel time parts.
> See cputime_adjust().
> 
> But looking at the cpuacct I have no clue, that looks wonky at best.
> 
> Ideally we'd reuse the normal cputime code and do the same thing
> per-cgroup, but clearly that isn't happening now.
> 
> I never really looked further than that cpuacct_charge() doing _another_
> cgroup iteration, even though we already account that delta to each
> cgroup (modulo scheduling class crud).

Yeah, it's kinda silly.  I'll see if I can just kill cpuacct for
cgroup2.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHSET for-4.14] cgroup, sched: cgroup2 interface for CPU controller
@ 2017-07-27 19:51   ` Tejun Heo
  0 siblings, 0 replies; 20+ messages in thread
From: Tejun Heo @ 2017-07-27 19:51 UTC (permalink / raw)
  To: lizefan, hannes, peterz, mingo, longman
  Cc: cgroups, linux-kernel, kernel-team, pjt, luto, efault, torvalds, guro

Hello,

On Thu, Jul 20, 2017 at 02:48:06PM -0400, Tejun Heo wrote:
> These are two patches to implement the cgroup2 CPU controller
> interface.  It's separated from the [L] previous version of combined
> cgroup2 threaded mode + CPU controller patchset.
> 
> * "cpu.weight.nice" added which allows reading and setting weights
>   using nice values so that it's easier to configure when threaded
>   cgroups are competing against threads.
> 
> * Doc and other misc updates.
> 
>  0001-sched-Misc-preps-for-cgroup-unified-hierarchy-interf.patch
>  0002-sched-Implement-interface-for-cgroup-unified-hierarc.patch

Ping.  Any comments on the proposed interface?

Thanks!

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHSET for-4.14] cgroup, sched: cgroup2 interface for CPU controller
@ 2017-07-27 19:51   ` Tejun Heo
  0 siblings, 0 replies; 20+ messages in thread
From: Tejun Heo @ 2017-07-27 19:51 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	longman-H+wXaHxf7aLQT0dZR+AlfA
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	pjt-hpIqsD4AKlfQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	efault-Mmb7MZpHnFY, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	guro-b10kYP2dOMg

Hello,

On Thu, Jul 20, 2017 at 02:48:06PM -0400, Tejun Heo wrote:
> These are two patches to implement the cgroup2 CPU controller
> interface.  It's separated from the [L] previous version of combined
> cgroup2 threaded mode + CPU controller patchset.
> 
> * "cpu.weight.nice" added which allows reading and setting weights
>   using nice values so that it's easier to configure when threaded
>   cgroups are competing against threads.
> 
> * Doc and other misc updates.
> 
>  0001-sched-Misc-preps-for-cgroup-unified-hierarchy-interf.patch
>  0002-sched-Implement-interface-for-cgroup-unified-hierarc.patch

Ping.  Any comments on the proposed interface?

Thanks!

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCHSET REPOST for-4.15] cgroup, sched: cgroup2 interface for CPU controller (on basic acct)
@ 2017-09-25 16:00 Tejun Heo
  2017-09-25 16:00 ` [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy Tejun Heo
  0 siblings, 1 reply; 20+ messages in thread
From: Tejun Heo @ 2017-09-25 16:00 UTC (permalink / raw)
  To: lizefan, hannes, peterz, mingo
  Cc: longman, cgroups, linux-kernel, kernel-team, pjt, luto, efault,
	torvalds, guro

Hello,

(Rebased on cgroup/for-4.15, no real changes.  Once acked, I think
 it'd be easiest to route these through the cgroup branch;
 alternatively, we can pull cgroup/for-4.15 to a sched branch and
 apply these there.)

These are two patches to implement the cgroup2 CPU controller
interface.  Changes from the last revision[L].

* Updated on top of the cgroup2 basic resource accounting patchset[B]
  and now uses cgroup2's CPU accounting instead of cpuacct as stat
  source.

* cpuacct is no longer necessary and not converted.

This patchset contains the following two patches.

 0001-sched-Misc-preps-for-cgroup-unified-hierarchy-interf.patch
 0002-sched-Implement-interface-for-cgroup-unified-hierarc.patch

It's on top of the cgroup2 basic resource accounting patchset, which
has been merged to cgroup/for-4.15, and available in the following git
branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-cgroup2-cpu-on-basic-acct

diffstat follows.  Thanks.

 Documentation/cgroup-v2.txt |   36 ++------
 kernel/sched/core.c         |  179 +++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 187 insertions(+), 28 deletions(-)

--
tejun

[L] http://lkml.kernel.org/r/20170720184808.1433868-1-tj@kernel.org
[B] http://lkml.kernel.org/r/20170811163754.3939102-1-tj@kernel.org

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy
  2017-09-25 16:00 [PATCHSET REPOST for-4.15] cgroup, sched: cgroup2 interface for CPU controller (on basic acct) Tejun Heo
@ 2017-09-25 16:00 ` Tejun Heo
  0 siblings, 0 replies; 20+ messages in thread
From: Tejun Heo @ 2017-09-25 16:00 UTC (permalink / raw)
  To: lizefan, hannes, peterz, mingo
  Cc: longman, cgroups, linux-kernel, kernel-team, pjt, luto, efault,
	torvalds, guro, Tejun Heo

There are a couple interface issues which can be addressed in cgroup2
interface.

* Stats from cpuacct being reported separately from the cpu stats.

* Use of different time units.  Writable control knobs use
  microseconds, some stat fields use nanoseconds while other cpuacct
  stat fields use centiseconds.

* Control knobs which can't be used in the root cgroup still show up
  in the root.

* Control knob names and semantics aren't consistent with other
  controllers.

This patchset implements cpu controller's interface on cgroup2 which
adheres to the controller file conventions described in
Documentation/cgroups/cgroup-v2.txt.  Overall, the following changes
are made.

* cpuacct is implictly enabled and disabled by cpu and its information
  is reported through "cpu.stat" which now uses microseconds for all
  time durations.  All time duration fields now have "_usec" appended
  to them for clarity.

  Note that cpuacct.usage_percpu is currently not included in
  "cpu.stat".  If this information is actually called for, it will be
  added later.

* "cpu.shares" is replaced with "cpu.weight" and operates on the
  standard scale defined by CGROUP_WEIGHT_MIN/DFL/MAX (1, 100, 10000).
  The weight is scaled to scheduler weight so that 100 maps to 1024
  and the ratio relationship is preserved - if weight is W and its
  scaled value is S, W / 100 == S / 1024.  While the mapped range is a
  bit smaller than the orignal scheduler weight range, the dead zones
  on both sides are relatively small and covers wider range than the
  nice value mappings.  This file doesn't make sense in the root
  cgroup and isn't created on root.

* "cpu.weight.nice" is added. When read, it reads back the nice value
  which is closest to the current "cpu.weight".  When written, it sets
  "cpu.weight" to the weight value which matches the nice value.  This
  makes it easy to configure cgroups when they're competing against
  threads in threaded subtrees.

* "cpu.cfs_quota_us" and "cpu.cfs_period_us" are replaced by "cpu.max"
  which contains both quota and period.

v4: - Use cgroup2 basic usage stat as the information source instead
      of cpuacct.

v3: - Added "cpu.weight.nice" to allow using nice values when
      configuring the weight.  The feature is requested by PeterZ.
    - Merge the patch to enable threaded support on cpu and cpuacct.
    - Dropped the bits about getting rid of cpuacct from patch
      description as there is a pretty strong case for making cpuacct
      an implicit controller so that basic cpu usage stats are always
      available.
    - Documentation updated accordingly.  "cpu.rt.max" section is
      dropped for now.

v2: - cpu_stats_show() was incorrectly using CONFIG_FAIR_GROUP_SCHED
      for CFS bandwidth stats and also using raw division for u64.
      Use CONFIG_CFS_BANDWITH and do_div() instead.  "cpu.rt.max" is
      not included yet.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---
 Documentation/cgroup-v2.txt |  36 ++++------
 kernel/sched/core.c         | 171 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 183 insertions(+), 24 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 3f82169..0bbdc72 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -902,10 +902,6 @@ Controllers
 CPU
 ---
 
-.. note::
-
-	The interface for the cpu controller hasn't been merged yet
-
 The "cpu" controllers regulates distribution of CPU cycles.  This
 controller implements weight and absolute bandwidth limit models for
 normal scheduling policy and absolute bandwidth allocation model for
@@ -935,6 +931,18 @@ All time durations are in microseconds.
 
 	The weight in the range [1, 10000].
 
+  cpu.weight.nice
+	A read-write single value file which exists on non-root
+	cgroups.  The default is "0".
+
+	The nice value is in the range [-20, 19].
+
+	This interface file is an alternative interface for
+	"cpu.weight" and allows reading and setting weight using the
+	same values used by nice(2).  Because the range is smaller and
+	granularity is coarser for the nice values, the read value is
+	the closest approximation of the current weight.
+
   cpu.max
 	A read-write two value file which exists on non-root cgroups.
 	The default is "max 100000".
@@ -947,26 +955,6 @@ All time durations are in microseconds.
 	$PERIOD duration.  "max" for $MAX indicates no limit.  If only
 	one number is written, $MAX is updated.
 
-  cpu.rt.max
-	.. note::
-
-	   The semantics of this file is still under discussion and the
-	   interface hasn't been merged yet
-
-	A read-write two value file which exists on all cgroups.
-	The default is "0 100000".
-
-	The maximum realtime runtime allocation.  Over-committing
-	configurations are disallowed and process migrations are
-	rejected if not enough bandwidth is available.  It's in the
-	following format::
-
-	  $MAX $PERIOD
-
-	which indicates that the group may consume upto $MAX in each
-	$PERIOD duration.  If only one number is written, $MAX is
-	updated.
-
 
 Memory
 ------
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6815fa4..ad25516 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6678,6 +6678,175 @@ static struct cftype cpu_legacy_files[] = {
 	{ }	/* Terminate */
 };
 
+static int cpu_stat_show(struct seq_file *sf, void *v)
+{
+	cgroup_stat_show_cputime(sf, "");
+
+#ifdef CONFIG_CFS_BANDWIDTH
+	{
+		struct task_group *tg = css_tg(seq_css(sf));
+		struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
+		u64 throttled_usec;
+
+		throttled_usec = cfs_b->throttled_time;
+		do_div(throttled_usec, NSEC_PER_USEC);
+
+		seq_printf(sf, "nr_periods %d\n"
+			   "nr_throttled %d\n"
+			   "throttled_usec %llu\n",
+			   cfs_b->nr_periods, cfs_b->nr_throttled,
+			   throttled_usec);
+	}
+#endif
+	return 0;
+}
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+static u64 cpu_weight_read_u64(struct cgroup_subsys_state *css,
+			       struct cftype *cft)
+{
+	struct task_group *tg = css_tg(css);
+	u64 weight = scale_load_down(tg->shares);
+
+	return DIV_ROUND_CLOSEST_ULL(weight * CGROUP_WEIGHT_DFL, 1024);
+}
+
+static int cpu_weight_write_u64(struct cgroup_subsys_state *css,
+				struct cftype *cft, u64 weight)
+{
+	/*
+	 * cgroup weight knobs should use the common MIN, DFL and MAX
+	 * values which are 1, 100 and 10000 respectively.  While it loses
+	 * a bit of range on both ends, it maps pretty well onto the shares
+	 * value used by scheduler and the round-trip conversions preserve
+	 * the original value over the entire range.
+	 */
+	if (weight < CGROUP_WEIGHT_MIN || weight > CGROUP_WEIGHT_MAX)
+		return -ERANGE;
+
+	weight = DIV_ROUND_CLOSEST_ULL(weight * 1024, CGROUP_WEIGHT_DFL);
+
+	return sched_group_set_shares(css_tg(css), scale_load(weight));
+}
+
+static s64 cpu_weight_nice_read_s64(struct cgroup_subsys_state *css,
+				    struct cftype *cft)
+{
+	unsigned long weight = scale_load_down(css_tg(css)->shares);
+	int last_delta = INT_MAX;
+	int prio, delta;
+
+	/* find the closest nice value to the current weight */
+	for (prio = 0; prio < ARRAY_SIZE(sched_prio_to_weight); prio++) {
+		delta = abs(sched_prio_to_weight[prio] - weight);
+		if (delta >= last_delta)
+			break;
+		last_delta = delta;
+	}
+
+	return PRIO_TO_NICE(prio - 1 + MAX_RT_PRIO);
+}
+
+static int cpu_weight_nice_write_s64(struct cgroup_subsys_state *css,
+				     struct cftype *cft, s64 nice)
+{
+	unsigned long weight;
+
+	if (nice < MIN_NICE || nice > MAX_NICE)
+		return -ERANGE;
+
+	weight = sched_prio_to_weight[NICE_TO_PRIO(nice) - MAX_RT_PRIO];
+	return sched_group_set_shares(css_tg(css), scale_load(weight));
+}
+#endif
+
+static void __maybe_unused cpu_period_quota_print(struct seq_file *sf,
+						  long period, long quota)
+{
+	if (quota < 0)
+		seq_puts(sf, "max");
+	else
+		seq_printf(sf, "%ld", quota);
+
+	seq_printf(sf, " %ld\n", period);
+}
+
+/* caller should put the current value in *@periodp before calling */
+static int __maybe_unused cpu_period_quota_parse(char *buf,
+						 u64 *periodp, u64 *quotap)
+{
+	char tok[21];	/* U64_MAX */
+
+	if (!sscanf(buf, "%s %llu", tok, periodp))
+		return -EINVAL;
+
+	*periodp *= NSEC_PER_USEC;
+
+	if (sscanf(tok, "%llu", quotap))
+		*quotap *= NSEC_PER_USEC;
+	else if (!strcmp(tok, "max"))
+		*quotap = RUNTIME_INF;
+	else
+		return -EINVAL;
+
+	return 0;
+}
+
+#ifdef CONFIG_CFS_BANDWIDTH
+static int cpu_max_show(struct seq_file *sf, void *v)
+{
+	struct task_group *tg = css_tg(seq_css(sf));
+
+	cpu_period_quota_print(sf, tg_get_cfs_period(tg), tg_get_cfs_quota(tg));
+	return 0;
+}
+
+static ssize_t cpu_max_write(struct kernfs_open_file *of,
+			     char *buf, size_t nbytes, loff_t off)
+{
+	struct task_group *tg = css_tg(of_css(of));
+	u64 period = tg_get_cfs_period(tg);
+	u64 quota;
+	int ret;
+
+	ret = cpu_period_quota_parse(buf, &period, &quota);
+	if (!ret)
+		ret = tg_set_cfs_bandwidth(tg, period, quota);
+	return ret ?: nbytes;
+}
+#endif
+
+static struct cftype cpu_files[] = {
+	{
+		.name = "stat",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = cpu_stat_show,
+	},
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	{
+		.name = "weight",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = cpu_weight_read_u64,
+		.write_u64 = cpu_weight_write_u64,
+	},
+	{
+		.name = "weight.nice",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_s64 = cpu_weight_nice_read_s64,
+		.write_s64 = cpu_weight_nice_write_s64,
+	},
+#endif
+#ifdef CONFIG_CFS_BANDWIDTH
+	{
+		.name = "max",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = cpu_max_show,
+		.write = cpu_max_write,
+	},
+#endif
+	{ }	/* terminate */
+};
+
 struct cgroup_subsys cpu_cgrp_subsys = {
 	.css_alloc	= cpu_cgroup_css_alloc,
 	.css_online	= cpu_cgroup_css_online,
@@ -6687,7 +6856,9 @@ struct cgroup_subsys cpu_cgrp_subsys = {
 	.can_attach	= cpu_cgroup_can_attach,
 	.attach		= cpu_cgroup_attach,
 	.legacy_cftypes	= cpu_legacy_files,
+	.dfl_cftypes	= cpu_files,
 	.early_init	= true,
+	.threaded	= true,
 };
 
 #endif	/* CONFIG_CGROUP_SCHED */
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCHSET for-4.14] cgroup, sched: cgroup2 interface for CPU controller (on basic acct)
@ 2017-08-11 16:47 Tejun Heo
  2017-08-11 16:47 ` [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy Tejun Heo
  0 siblings, 1 reply; 20+ messages in thread
From: Tejun Heo @ 2017-08-11 16:47 UTC (permalink / raw)
  To: lizefan, hannes, peterz, mingo, longman
  Cc: cgroups, linux-kernel, kernel-team, pjt, luto, efault, torvalds, guro

Hello,

These are two patches to implement the cgroup2 CPU controller
interface.  Changes from the last revision[L].

* Updated on top of the cgroup2 basic resource accounting patchset[B]
  and now uses cgroup2's CPU accounting instead of cpuacct as stat
  source.

* cpuacct is no longer necessary and not converted.

This patchset contains the following two patches.

 0001-sched-Misc-preps-for-cgroup-unified-hierarchy-interf.patch
 0002-sched-Implement-interface-for-cgroup-unified-hierarc.patch

It's on top of the cgroup2 basic resource accounting patchset and
available in the following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-cgroup2-cpu-on-basic-acct

diffstat follows.  Thanks.

 Documentation/cgroup-v2.txt |   36 ++------
 kernel/sched/core.c         |  179 +++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 187 insertions(+), 28 deletions(-)

--
tejun

[L] http://lkml.kernel.org/r/20170720184808.1433868-1-tj@kernel.org
[B] http://lkml.kernel.org/r/20170811163754.3939102-1-tj@kernel.org

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy
  2017-08-11 16:47 [PATCHSET for-4.14] cgroup, sched: cgroup2 interface for CPU controller (on basic acct) Tejun Heo
@ 2017-08-11 16:47 ` Tejun Heo
  0 siblings, 0 replies; 20+ messages in thread
From: Tejun Heo @ 2017-08-11 16:47 UTC (permalink / raw)
  To: lizefan, hannes, peterz, mingo, longman
  Cc: cgroups, linux-kernel, kernel-team, pjt, luto, efault, torvalds,
	guro, Tejun Heo

There are a couple interface issues which can be addressed in cgroup2
interface.

* Stats from cpuacct being reported separately from the cpu stats.

* Use of different time units.  Writable control knobs use
  microseconds, some stat fields use nanoseconds while other cpuacct
  stat fields use centiseconds.

* Control knobs which can't be used in the root cgroup still show up
  in the root.

* Control knob names and semantics aren't consistent with other
  controllers.

This patchset implements cpu controller's interface on cgroup2 which
adheres to the controller file conventions described in
Documentation/cgroups/cgroup-v2.txt.  Overall, the following changes
are made.

* cpuacct is implictly enabled and disabled by cpu and its information
  is reported through "cpu.stat" which now uses microseconds for all
  time durations.  All time duration fields now have "_usec" appended
  to them for clarity.

  Note that cpuacct.usage_percpu is currently not included in
  "cpu.stat".  If this information is actually called for, it will be
  added later.

* "cpu.shares" is replaced with "cpu.weight" and operates on the
  standard scale defined by CGROUP_WEIGHT_MIN/DFL/MAX (1, 100, 10000).
  The weight is scaled to scheduler weight so that 100 maps to 1024
  and the ratio relationship is preserved - if weight is W and its
  scaled value is S, W / 100 == S / 1024.  While the mapped range is a
  bit smaller than the orignal scheduler weight range, the dead zones
  on both sides are relatively small and covers wider range than the
  nice value mappings.  This file doesn't make sense in the root
  cgroup and isn't created on root.

* "cpu.weight.nice" is added. When read, it reads back the nice value
  which is closest to the current "cpu.weight".  When written, it sets
  "cpu.weight" to the weight value which matches the nice value.  This
  makes it easy to configure cgroups when they're competing against
  threads in threaded subtrees.

* "cpu.cfs_quota_us" and "cpu.cfs_period_us" are replaced by "cpu.max"
  which contains both quota and period.

v4: - Use cgroup2 basic usage stat as the information source instead
      of cpuacct.

v3: - Added "cpu.weight.nice" to allow using nice values when
      configuring the weight.  The feature is requested by PeterZ.
    - Merge the patch to enable threaded support on cpu and cpuacct.
    - Dropped the bits about getting rid of cpuacct from patch
      description as there is a pretty strong case for making cpuacct
      an implicit controller so that basic cpu usage stats are always
      available.
    - Documentation updated accordingly.  "cpu.rt.max" section is
      dropped for now.

v2: - cpu_stats_show() was incorrectly using CONFIG_FAIR_GROUP_SCHED
      for CFS bandwidth stats and also using raw division for u64.
      Use CONFIG_CFS_BANDWITH and do_div() instead.  "cpu.rt.max" is
      not included yet.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---
 Documentation/cgroup-v2.txt |  36 ++++------
 kernel/sched/core.c         | 171 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 183 insertions(+), 24 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 3f82169..0bbdc72 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -902,10 +902,6 @@ Controllers
 CPU
 ---
 
-.. note::
-
-	The interface for the cpu controller hasn't been merged yet
-
 The "cpu" controllers regulates distribution of CPU cycles.  This
 controller implements weight and absolute bandwidth limit models for
 normal scheduling policy and absolute bandwidth allocation model for
@@ -935,6 +931,18 @@ All time durations are in microseconds.
 
 	The weight in the range [1, 10000].
 
+  cpu.weight.nice
+	A read-write single value file which exists on non-root
+	cgroups.  The default is "0".
+
+	The nice value is in the range [-20, 19].
+
+	This interface file is an alternative interface for
+	"cpu.weight" and allows reading and setting weight using the
+	same values used by nice(2).  Because the range is smaller and
+	granularity is coarser for the nice values, the read value is
+	the closest approximation of the current weight.
+
   cpu.max
 	A read-write two value file which exists on non-root cgroups.
 	The default is "max 100000".
@@ -947,26 +955,6 @@ All time durations are in microseconds.
 	$PERIOD duration.  "max" for $MAX indicates no limit.  If only
 	one number is written, $MAX is updated.
 
-  cpu.rt.max
-	.. note::
-
-	   The semantics of this file is still under discussion and the
-	   interface hasn't been merged yet
-
-	A read-write two value file which exists on all cgroups.
-	The default is "0 100000".
-
-	The maximum realtime runtime allocation.  Over-committing
-	configurations are disallowed and process migrations are
-	rejected if not enough bandwidth is available.  It's in the
-	following format::
-
-	  $MAX $PERIOD
-
-	which indicates that the group may consume upto $MAX in each
-	$PERIOD duration.  If only one number is written, $MAX is
-	updated.
-
 
 Memory
 ------
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9220391..e37cc97 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6657,6 +6657,175 @@ static struct cftype cpu_legacy_files[] = {
 	{ }	/* Terminate */
 };
 
+static int cpu_stat_show(struct seq_file *sf, void *v)
+{
+	cgroup_stat_show_cputime(sf, "");
+
+#ifdef CONFIG_CFS_BANDWIDTH
+	{
+		struct task_group *tg = css_tg(seq_css(sf));
+		struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
+		u64 throttled_usec;
+
+		throttled_usec = cfs_b->throttled_time;
+		do_div(throttled_usec, NSEC_PER_USEC);
+
+		seq_printf(sf, "nr_periods %d\n"
+			   "nr_throttled %d\n"
+			   "throttled_usec %llu\n",
+			   cfs_b->nr_periods, cfs_b->nr_throttled,
+			   throttled_usec);
+	}
+#endif
+	return 0;
+}
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+static u64 cpu_weight_read_u64(struct cgroup_subsys_state *css,
+			       struct cftype *cft)
+{
+	struct task_group *tg = css_tg(css);
+	u64 weight = scale_load_down(tg->shares);
+
+	return DIV_ROUND_CLOSEST_ULL(weight * CGROUP_WEIGHT_DFL, 1024);
+}
+
+static int cpu_weight_write_u64(struct cgroup_subsys_state *css,
+				struct cftype *cft, u64 weight)
+{
+	/*
+	 * cgroup weight knobs should use the common MIN, DFL and MAX
+	 * values which are 1, 100 and 10000 respectively.  While it loses
+	 * a bit of range on both ends, it maps pretty well onto the shares
+	 * value used by scheduler and the round-trip conversions preserve
+	 * the original value over the entire range.
+	 */
+	if (weight < CGROUP_WEIGHT_MIN || weight > CGROUP_WEIGHT_MAX)
+		return -ERANGE;
+
+	weight = DIV_ROUND_CLOSEST_ULL(weight * 1024, CGROUP_WEIGHT_DFL);
+
+	return sched_group_set_shares(css_tg(css), scale_load(weight));
+}
+
+static s64 cpu_weight_nice_read_s64(struct cgroup_subsys_state *css,
+				    struct cftype *cft)
+{
+	unsigned long weight = scale_load_down(css_tg(css)->shares);
+	int last_delta = INT_MAX;
+	int prio, delta;
+
+	/* find the closest nice value to the current weight */
+	for (prio = 0; prio < ARRAY_SIZE(sched_prio_to_weight); prio++) {
+		delta = abs(sched_prio_to_weight[prio] - weight);
+		if (delta >= last_delta)
+			break;
+		last_delta = delta;
+	}
+
+	return PRIO_TO_NICE(prio - 1 + MAX_RT_PRIO);
+}
+
+static int cpu_weight_nice_write_s64(struct cgroup_subsys_state *css,
+				     struct cftype *cft, s64 nice)
+{
+	unsigned long weight;
+
+	if (nice < MIN_NICE || nice > MAX_NICE)
+		return -ERANGE;
+
+	weight = sched_prio_to_weight[NICE_TO_PRIO(nice) - MAX_RT_PRIO];
+	return sched_group_set_shares(css_tg(css), scale_load(weight));
+}
+#endif
+
+static void __maybe_unused cpu_period_quota_print(struct seq_file *sf,
+						  long period, long quota)
+{
+	if (quota < 0)
+		seq_puts(sf, "max");
+	else
+		seq_printf(sf, "%ld", quota);
+
+	seq_printf(sf, " %ld\n", period);
+}
+
+/* caller should put the current value in *@periodp before calling */
+static int __maybe_unused cpu_period_quota_parse(char *buf,
+						 u64 *periodp, u64 *quotap)
+{
+	char tok[21];	/* U64_MAX */
+
+	if (!sscanf(buf, "%s %llu", tok, periodp))
+		return -EINVAL;
+
+	*periodp *= NSEC_PER_USEC;
+
+	if (sscanf(tok, "%llu", quotap))
+		*quotap *= NSEC_PER_USEC;
+	else if (!strcmp(tok, "max"))
+		*quotap = RUNTIME_INF;
+	else
+		return -EINVAL;
+
+	return 0;
+}
+
+#ifdef CONFIG_CFS_BANDWIDTH
+static int cpu_max_show(struct seq_file *sf, void *v)
+{
+	struct task_group *tg = css_tg(seq_css(sf));
+
+	cpu_period_quota_print(sf, tg_get_cfs_period(tg), tg_get_cfs_quota(tg));
+	return 0;
+}
+
+static ssize_t cpu_max_write(struct kernfs_open_file *of,
+			     char *buf, size_t nbytes, loff_t off)
+{
+	struct task_group *tg = css_tg(of_css(of));
+	u64 period = tg_get_cfs_period(tg);
+	u64 quota;
+	int ret;
+
+	ret = cpu_period_quota_parse(buf, &period, &quota);
+	if (!ret)
+		ret = tg_set_cfs_bandwidth(tg, period, quota);
+	return ret ?: nbytes;
+}
+#endif
+
+static struct cftype cpu_files[] = {
+	{
+		.name = "stat",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = cpu_stat_show,
+	},
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	{
+		.name = "weight",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = cpu_weight_read_u64,
+		.write_u64 = cpu_weight_write_u64,
+	},
+	{
+		.name = "weight.nice",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_s64 = cpu_weight_nice_read_s64,
+		.write_s64 = cpu_weight_nice_write_s64,
+	},
+#endif
+#ifdef CONFIG_CFS_BANDWIDTH
+	{
+		.name = "max",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = cpu_max_show,
+		.write = cpu_max_write,
+	},
+#endif
+	{ }	/* terminate */
+};
+
 struct cgroup_subsys cpu_cgrp_subsys = {
 	.css_alloc	= cpu_cgroup_css_alloc,
 	.css_online	= cpu_cgroup_css_online,
@@ -6666,7 +6835,9 @@ struct cgroup_subsys cpu_cgrp_subsys = {
 	.can_attach	= cpu_cgroup_can_attach,
 	.attach		= cpu_cgroup_attach,
 	.legacy_cftypes	= cpu_legacy_files,
+	.dfl_cftypes	= cpu_files,
 	.early_init	= true,
+	.threaded	= true,
 };
 
 #endif	/* CONFIG_CGROUP_SCHED */
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Documentation] State of CPU controller in cgroup v2
@ 2016-08-05 17:07 Tejun Heo
  2016-08-05 17:09   ` Tejun Heo
  0 siblings, 1 reply; 20+ messages in thread
From: Tejun Heo @ 2016-08-05 17:07 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Li Zefan, Johannes Weiner,
	Peter Zijlstra, Paul Turner, Mike Galbraith, Ingo Molnar
  Cc: linux-kernel, cgroups, linux-api, kernel-team

Hello,

There have been several discussions around CPU controller support.
Unfortunately, no consensus was reached and cgroup v2 is sorely
lacking CPU controller support.  This document includes summary of the
situation and arguments along with an interim solution for parties who
want to use the out-of-tree patches for CPU controller cgroup v2
support.  I'll post the two patches as replies for reference.

Thanks.

CPU Controller on Control Group v2

August, 2016		Tejun Heo <tj@kernel.org>

While most controllers have support for cgroup v2 now, the CPU
controller support is not upstream yet due to objections from the
scheduler maintainers on the basic designs of cgroup v2.  This
document explains the current situation as well as an interim
solution, and details the disagreements and arguments.  The latest
version of this document can be found at the following URL.

 https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git/tree/Documentation/cgroup-v2-cpu.txt?h=cgroup-v2-cpu

CONTENTS

1. Current Situation and Interim Solution
2. Disagreements and Arguments
  2-1. Contentious Restrictions
    2-1-1. Process Granularity
    2-1-2. No Internal Process Constraint
  2-2. Impact on CPU Controller
    2-2-1. Impact of Process Granularity
    2-2-2. Impact of No Internal Process Constraint
  2-3. Arguments for cgroup v2
3. Way Forward
4. References

1. Current Situation and Interim Solution

All objections from the scheduler maintainers apply to cgroup v2 core
design, and there are no known objections to the specifics of the CPU
controller cgroup v2 interface.  The only blocked part is changes to
expose the CPU controller interface on cgroup v2, which comprises the
following two patches:

 [1] sched: Misc preps for cgroup unified hierarchy interface
 [2] sched: Implement interface for cgroup unified hierarchy

The necessary changes are superficial and implement the interface
files on cgroup v2.  The combined diffstat is as follows.

 kernel/sched/core.c    |  149 +++++++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/cpuacct.c |   57 ++++++++++++------
 kernel/sched/cpuacct.h |    5 +
 3 files changed, 189 insertions(+), 22 deletions(-)

The patches are easy to apply and forward-port.  The following git
branch will always carry the two patches on top of the latest release
of the upstream kernel.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git/cgroup-v2-cpu

There also are versioned branches going back to v4.4.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git/cgroup-v2-cpu-$KERNEL_VER

While it's difficult to tell whether the CPU controller support will
be merged, there are crucial resource control features in cgroup v2
that are only possible due to the design choices that are being
objected to, and every effort will be made to ease enabling the CPU
controller cgroup v2 support out-of-tree for parties which choose to.

2. Disagreements and Arguments

There have been several lengthy discussion threads [3][4] on LKML
around the structural constraints of cgroup v2.  The two that affect
the CPU controller are process granularity and no internal process
constraint.  Both arise primarily from the need for common resource
domain definition across different resources.

The common resource domain is a powerful concept in cgroup v2 that
allows controllers to make basic assumptions about the structural
organization of processes and controllers inside the cgroup hierarchy,
and thus solve problems spanning multiple types of resources.  The
prime example for this is page cache writeback: dirty page cache is
regulated through throttling buffered writers based on memory
availability, and initiating batched write outs to the disk based on
IO capacity.  Tracking and controlling writeback inside a cgroup thus
requires the direct cooperation of the memory and the IO controller.

This easily extends to other areas, such as CPU cycles consumed while
performing memory reclaim or IO encryption.

2-1. Contentious Restrictions

For controllers of different resources to work together, they must
agree on a common organization.  This uniform model across controllers
imposes two contentious restrictions on the CPU controller: process
granularity and the no-internal-process constraint.

  2-1-1. Process Granularity

  For memory, because an address space is shared between all threads
  of a process, the terminal consumer is a process, not a thread.
  Separating the threads of a single process into different memory
  control domains doesn't make semantical sense.  cgroup v2 ensures
  that all controller can agree on the same organization by requiring
  that threads of the same process belong to the same cgroup.

  There are other reasons to enforce process granularity.  One
  important one is isolating system-level management operations from
  in-process application operations.  The cgroup interface, being a
  virtual filesystem, is very unfit for multiple independent
  operations taking place at the same time as most operations have to
  be multi-step and there is no way to synchronize multiple accessors.
  See also [5] Documentation/cgroup-v2.txt, "R-2. Thread Granularity"

  2-1-2. No Internal Process Constraint

  cgroup v2 does not allow processes to belong to any cgroup which has
  child cgroups when resource controllers are enabled on it (the
  notable exception being the root cgroup itself).  This is because,
  for some resources, a resource domain (cgroup) is not directly
  comparable to the terminal consumer (process/task) of said resource,
  and so putting the two into a sibling relationship isn't meaningful.

  - Differing Control Parameters and Capabilities

    A cgroup controller has different resource control parameters and
    capabilities from a terminal consumer, be that a task or process.
    There are a couple cases where a cgroup control knob can be mapped
    to a per-task or per-process API but they are exceptions and the
    mappings aren't obvious even in those cases.

    For example, task priorities (also known as nice values) set
    through setpriority(2) are mapped to the CPU controller
    "cpu.shares" values.  However, how exactly the two ranges map and
    even the fact that they map to each other at all are not obvious.

    The situation gets further muddled when considering other resource
    types and control knobs.  IO priorities set through ioprio_set(2)
    cannot be mapped to IO controller weights and most cgroup resource
    control knobs including the bandwidth control knobs of the CPU
    controller don't have counterparts in the terminal consumers.

  - Anonymous Resource Consumption

    For CPU, every time slice consumed from inside a cgroup, which
    comprises most but not all of consumed CPU time for the cgroup,
    can be clearly attributed to a specific task or process.  Because
    these two types of entities are directly comparable as consumers
    of CPU time, it's theoretically possible to mix tasks and cgroups
    on the same tree levels and let them directly compete for the time
    quota available to their common ancestor.

    However, the same can't be said for resource types like memory or
    IO: the memory consumed by the page cache, for example, can be
    tracked on a per-cgroup level, but due to mismatches in lifetimes
    of involved objects (page cache can persist long after processes
    are gone), shared usages and the implementation overhead of
    tracking persistent state, it can no longer be attributed to
    individual processes after instantiation.  Consequently, any IO
    incurred by page cache writeback can be attributed to a cgroup,
    but not to the individual consumers inside the cgroup.

  For memory and IO, this makes a resource domain (cgroup) an object
  of a fundamentally different type than a terminal consumer
  (process).  A process can't be a first class object in the resource
  distribution graph as its total resource consumption can't be
  described without the containing resource domain.

  Disallowing processes in internal cgroups avoids competition between
  cgroups and processes which cannot be meaningfully defined for these
  resources.  All resource control takes place among cgroups and a
  terminal consumer interacts with the containing cgroup the same way
  it would with the system without cgroup.

  Root cgroup is exempt from this constraint, which is in line with
  how root cgroup is handled in general - it's excluded from cgroup
  resource accounting and control.

Enforcing process granularity and no internal process constraint
allows all controllers to be on the same footing in terms of resource
distribution hierarchy.

2-2. Impact on CPU Controller

As indicated earlier, the CPU controller's resource distribution graph
is the simplest.  Every schedulable resource consumption can be
attributed to a specific task.  In addition, for weight based control,
the per-task priority set through setpriority(2) can be translated to
and from a per-cgroup weight.  As such, the CPU controller can treat a
task and a cgroup symmetrically, allowing support for any tree layout
of cgroups and tasks.  Both process granularity and the no internal
process constraint restrict how the CPU controller can be used.

  2-2-1. Impact of Process Granularity

  Process granularity prevents tasks belonging to the same process to
  be assigned to different cgroups.  It was pointed out [6] that this
  excludes the valid use case of hierarchical CPU distribution within
  processes.

  To address this issue, the rgroup (resource group) [7][8][9]
  interface, an extension of the existing setpriority(2) API, was
  proposed, which is in line with other programmable priority
  mechanisms and eliminates the risk of in-application configuration
  and system configuration stepping on each other's toes.
  Unfortunately, the proposal quickly turned into discussions around
  cgroup v2 design decisions [4] and no consensus could be reached.

  2-2-2. Impact of No Internal Process Constraint

  The no internal process constraint disallows tasks from competing
  directly against cgroups.  Here is an excerpt from Peter Zijlstra
  pointing out the issue [10] - R, L and A are cgroups; t1, t2, t3 and
  t4 are tasks:

          R
        / | \
       t1 t2 A
           /   \
          t3   t4

    Is fundamentally different from:

               R
             /   \
           L       A
         /   \   /   \
        t1  t2  t3   t4

    Because if in the first hierarchy you add a task (t5) to R, all of
    its A will run at 1/4th of total bandwidth where before it had
    1/3rd, whereas with the second example, if you add our t5 to L, A
    doesn't get any less bandwidth.

  It is true that the trees are semantically different from each other
  and the symmetric handling of tasks and cgroups is aesthetically
  pleasing.  However, it isn't clear what the practical usefulness of
  a layout with direct competition between tasks and cgroups would be,
  considering that number and behavior of tasks are controlled by each
  application, and cgroups primarily deal with system level resource
  distribution; changes in the number of active threads would directly
  impact resource distribution.  Real world use cases of such layouts
  could not be established during the discussions.

2-3. Arguments for cgroup v2

There are strong demands for comprehensive hierarchical resource
control across all major resources, and establishing a common resource
hierarchy is an essential step.  As with most engineering decisions,
common resource hierarchy definition comes with its trade-offs.  With
cgroup v2, the trade-offs are in the form of structural constraints
which, among others, restrict the CPU controller's space of possible
configurations.

However, even with the restrictions, cgroup v2, in combination with
rgroup, covers most of identified real world use cases while enabling
new important use cases of resource control across multiple resource
types that were fundamentally broken previously.

Furthermore, for resource control, treating resource domains as
objects of a different type from terminal consumers has important
advantages - it can account for resource consumptions which are not
tied to any specific terminal consumer, be that a task or process, and
allows decoupling resource distribution controls from in-application
APIs.  Even the CPU controller may benefit from it as the kernel can
consume significant amount of CPU cycles in interrupt context or tasks
shared across multiple resource domains (e.g. softirq).

Finally, it's important to note that enabling cgroup v2 support for
the CPU controller doesn't block use cases which require the features
which are not available on cgroup v2.  Unlikely, but should anybody
actually rely on the CPU controller's symmetric handling of tasks and
cgroups, backward compatibility is and will be maintained by being
able to disconnect the controller from the cgroup v2 hierarchy and use
it standalone.  This also holds for cpuset which is often used in
highly customized configurations which might be a poor fit for common
resource domains.

The required changes are minimal, the benefits for the target use
cases are critical and obvious, and use cases which have to use v1 can
continue to do so.

3. Way Forward

cgroup v2 primarily aims to solve the problem of comprehensive
hierarchical resource control across all major computing resources,
which is one of the core problems of modern server infrastructure
engineering.  The trade-offs that cgroup v2 took are results of
pursuing that goal and gaining a better understanding of the nature of
resource control in the process.

I believe that real world usages will prove cgroup v2's model right,
considering the crucial pieces of comprehensive resource control that
cannot be implemented without common resource domains.  This is not to
say that cgroup v2 is fixed in stone and can't be updated; if there is
an approach which better serves both comprehensive resource control
and the CPU controller's flexibility, we will surely move towards
that.  It goes without saying that discussions around such approach
should consider practical aspects of resource control as a whole
rather than absolutely focusing on a particular controller.

Until such consensus can be reached, the CPU controller cgroup v2
support will be maintained out of the mainline kernel in an easily
accessible form.  If there is anything cgroup developers can do to
ease the pain, please feel free to contact us on the cgroup mailing
list at cgroups@vger.kernel.org.

4. References

[1]  http://lkml.kernel.org/r/20160105164834.GE5995@mtj.duckdns.org
     [PATCH 1/2] sched: Misc preps for cgroup unified hierarchy interface
     Tejun Heo <tj@kernel.org>

[2]  http://lkml.kernel.org/r/20160105164852.GF5995@mtj.duckdns.org
     [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy
     Tejun Heo <tj@kernel.org>

[3]  http://lkml.kernel.org/r/1438641689-14655-4-git-send-email-tj@kernel.org
     [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy
     Tejun Heo <tj@kernel.org>

[4]  http://lkml.kernel.org/r/20160407064549.GH3430@twins.programming.kicks-ass.net
     Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP
     Peter Zijlstra <peterz@infradead.org>

[5]  https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/cgroup-v2.txt
     Control Group v2
     Tejun Heo <tj@kernel.org>

[6]  http://lkml.kernel.org/r/CAPM31RJNy3jgG=DYe6GO=wyL4BPPxwUm1f2S6YXacQmo7viFZA@mail.gmail.com
     Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy
     Paul Turner <pjt@google.com>

[7]  http://lkml.kernel.org/r/20160105154503.GC5995@mtj.duckdns.org
     [RFD] cgroup: thread granularity support for cpu controller
     Tejun Heo <tj@kernel.org>

[8]  http://lkml.kernel.org/r/1457710888-31182-1-git-send-email-tj@kernel.org
     [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP
     Tejun Heo <tj@kernel.org>

[9]  http://lkml.kernel.org/r/20160311160522.GA24046@htj.duckdns.org
     Example program for PRIO_RGRP
     Tejun Heo <tj@kernel.org>

[10] http://lkml.kernel.org/r/20160407082810.GN3430@twins.programming.kicks-ass.net
     Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource
     Peter Zijlstra <peterz@infradead.org>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy
  2016-08-05 17:07 [Documentation] State of CPU controller in cgroup v2 Tejun Heo
@ 2016-08-05 17:09   ` Tejun Heo
  0 siblings, 0 replies; 20+ messages in thread
From: Tejun Heo @ 2016-08-05 17:09 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Li Zefan, Johannes Weiner,
	Peter Zijlstra, Paul Turner, Mike Galbraith, Ingo Molnar
  Cc: linux-kernel, cgroups, linux-api, kernel-team

>From ed6d93036ec930cb774da10b7c87f67905ce71f1 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Fri, 11 Mar 2016 07:31:23 -0500

While the cpu controller doesn't have any functional problems, there
are a couple interface issues which can be addressed in the v2
interface.

* cpuacct being a separate controller.  This separation is artificial
  and rather pointless as demonstrated by most use cases co-mounting
  the two controllers.  It also forces certain information to be
  accounted twice.

* Use of different time units.  Writable control knobs use
  microseconds, some stat fields use nanoseconds while other cpuacct
  stat fields use centiseconds.

* Control knobs which can't be used in the root cgroup still show up
  in the root.

* Control knob names and semantics aren't consistent with other
  controllers.

This patchset implements cpu controller's interface on the unified
hierarchy which adheres to the controller file conventions described
in Documentation/cgroups/unified-hierarchy.txt.  Overall, the
following changes are made.

* cpuacct is implictly enabled and disabled by cpu and its information
  is reported through "cpu.stat" which now uses microseconds for all
  time durations.  All time duration fields now have "_usec" appended
  to them for clarity.  While this doesn't solve the double accounting
  immediately, once majority of users switch to v2, cpu can directly
  account and report the relevant stats and cpuacct can be disabled on
  the unified hierarchy.

  Note that cpuacct.usage_percpu is currently not included in
  "cpu.stat".  If this information is actually called for, it can be
  added later.

* "cpu.shares" is replaced with "cpu.weight" and operates on the
  standard scale defined by CGROUP_WEIGHT_MIN/DFL/MAX (1, 100, 10000).
  The weight is scaled to scheduler weight so that 100 maps to 1024
  and the ratio relationship is preserved - if weight is W and its
  scaled value is S, W / 100 == S / 1024.  While the mapped range is a
  bit smaller than the orignal scheduler weight range, the dead zones
  on both sides are relatively small and covers wider range than the
  nice value mappings.  This file doesn't make sense in the root
  cgroup and isn't create on root.

* "cpu.cfs_quota_us" and "cpu.cfs_period_us" are replaced by "cpu.max"
  which contains both quota and period.

* "cpu.rt_runtime_us" and "cpu.rt_period_us" are replaced by
  "cpu.rt.max" which contains both runtime and period.

v2: cpu_stats_show() was incorrectly using CONFIG_FAIR_GROUP_SCHED for
    CFS bandwidth stats and also using raw division for u64.  Use
    CONFIG_CFS_BANDWITH and do_div() instead.

    The semantics of "cpu.rt.max" is not fully decided yet.  Dropped
    for now.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---
 kernel/sched/core.c    | 141 +++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/cpuacct.c |  24 +++++++++
 kernel/sched/cpuacct.h |   5 ++
 3 files changed, 170 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c148dfe..7bba2c5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8561,6 +8561,139 @@ static struct cftype cpu_legacy_files[] = {
 	{ }	/* terminate */
 };
 
+static int cpu_stats_show(struct seq_file *sf, void *v)
+{
+	cpuacct_cpu_stats_show(sf);
+
+#ifdef CONFIG_CFS_BANDWIDTH
+	{
+		struct task_group *tg = css_tg(seq_css(sf));
+		struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
+		u64 throttled_usec;
+
+		throttled_usec = cfs_b->throttled_time;
+		do_div(throttled_usec, NSEC_PER_USEC);
+
+		seq_printf(sf, "nr_periods %d\n"
+			   "nr_throttled %d\n"
+			   "throttled_usec %llu\n",
+			   cfs_b->nr_periods, cfs_b->nr_throttled,
+			   throttled_usec);
+	}
+#endif
+	return 0;
+}
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+static u64 cpu_weight_read_u64(struct cgroup_subsys_state *css,
+			       struct cftype *cft)
+{
+	struct task_group *tg = css_tg(css);
+	u64 weight = scale_load_down(tg->shares);
+
+	return DIV_ROUND_CLOSEST_ULL(weight * CGROUP_WEIGHT_DFL, 1024);
+}
+
+static int cpu_weight_write_u64(struct cgroup_subsys_state *css,
+				struct cftype *cftype, u64 weight)
+{
+	/*
+	 * cgroup weight knobs should use the common MIN, DFL and MAX
+	 * values which are 1, 100 and 10000 respectively.  While it loses
+	 * a bit of range on both ends, it maps pretty well onto the shares
+	 * value used by scheduler and the round-trip conversions preserve
+	 * the original value over the entire range.
+	 */
+	if (weight < CGROUP_WEIGHT_MIN || weight > CGROUP_WEIGHT_MAX)
+		return -ERANGE;
+
+	weight = DIV_ROUND_CLOSEST_ULL(weight * 1024, CGROUP_WEIGHT_DFL);
+
+	return sched_group_set_shares(css_tg(css), scale_load(weight));
+}
+#endif
+
+static void __maybe_unused cpu_period_quota_print(struct seq_file *sf,
+						  long period, long quota)
+{
+	if (quota < 0)
+		seq_puts(sf, "max");
+	else
+		seq_printf(sf, "%ld", quota);
+
+	seq_printf(sf, " %ld\n", period);
+}
+
+/* caller should put the current value in *@periodp before calling */
+static int __maybe_unused cpu_period_quota_parse(char *buf,
+						 u64 *periodp, u64 *quotap)
+{
+	char tok[21];	/* U64_MAX */
+
+	if (!sscanf(buf, "%s %llu", tok, periodp))
+		return -EINVAL;
+
+	*periodp *= NSEC_PER_USEC;
+
+	if (sscanf(tok, "%llu", quotap))
+		*quotap *= NSEC_PER_USEC;
+	else if (!strcmp(tok, "max"))
+		*quotap = RUNTIME_INF;
+	else
+		return -EINVAL;
+
+	return 0;
+}
+
+#ifdef CONFIG_CFS_BANDWIDTH
+static int cpu_max_show(struct seq_file *sf, void *v)
+{
+	struct task_group *tg = css_tg(seq_css(sf));
+
+	cpu_period_quota_print(sf, tg_get_cfs_period(tg), tg_get_cfs_quota(tg));
+	return 0;
+}
+
+static ssize_t cpu_max_write(struct kernfs_open_file *of,
+			     char *buf, size_t nbytes, loff_t off)
+{
+	struct task_group *tg = css_tg(of_css(of));
+	u64 period = tg_get_cfs_period(tg);
+	u64 quota;
+	int ret;
+
+	ret = cpu_period_quota_parse(buf, &period, &quota);
+	if (!ret)
+		ret = tg_set_cfs_bandwidth(tg, period, quota);
+	return ret ?: nbytes;
+}
+#endif
+
+static struct cftype cpu_files[] = {
+	{
+		.name = "stat",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = cpu_stats_show,
+	},
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	{
+		.name = "weight",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = cpu_weight_read_u64,
+		.write_u64 = cpu_weight_write_u64,
+	},
+#endif
+#ifdef CONFIG_CFS_BANDWIDTH
+	{
+		.name = "max",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = cpu_max_show,
+		.write = cpu_max_write,
+	},
+#endif
+	{ }	/* terminate */
+};
+
 struct cgroup_subsys cpu_cgrp_subsys = {
 	.css_alloc	= cpu_cgroup_css_alloc,
 	.css_released	= cpu_cgroup_css_released,
@@ -8569,7 +8702,15 @@ struct cgroup_subsys cpu_cgrp_subsys = {
 	.can_attach	= cpu_cgroup_can_attach,
 	.attach		= cpu_cgroup_attach,
 	.legacy_cftypes	= cpu_legacy_files,
+	.dfl_cftypes	= cpu_files,
 	.early_init	= true,
+#ifdef CONFIG_CGROUP_CPUACCT
+	/*
+	 * cpuacct is enabled together with cpu on the unified hierarchy
+	 * and its stats are reported through "cpu.stat".
+	 */
+	.depends_on	= 1 << cpuacct_cgrp_id,
+#endif
 };
 
 #endif	/* CONFIG_CGROUP_SCHED */
diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index 3eb9eda..7a02d26 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -305,6 +305,30 @@ static struct cftype files[] = {
 	{ }	/* terminate */
 };
 
+/* used to print cpuacct stats in cpu.stat on the unified hierarchy */
+void cpuacct_cpu_stats_show(struct seq_file *sf)
+{
+	struct cgroup_subsys_state *css;
+	u64 usage, user, sys;
+
+	css = cgroup_get_e_css(seq_css(sf)->cgroup, &cpuacct_cgrp_subsys);
+
+	usage = cpuusage_read(css, seq_cft(sf));
+	cpuacct_stats_read(css_ca(css), &user, &sys);
+
+	user *= TICK_NSEC;
+	sys *= TICK_NSEC;
+	do_div(usage, NSEC_PER_USEC);
+	do_div(user, NSEC_PER_USEC);
+	do_div(sys, NSEC_PER_USEC);
+
+	seq_printf(sf, "usage_usec %llu\n"
+		   "user_usec %llu\n"
+		   "system_usec %llu\n", usage, user, sys);
+
+	css_put(css);
+}
+
 /*
  * charge this task's execution time to its accounting group.
  *
diff --git a/kernel/sched/cpuacct.h b/kernel/sched/cpuacct.h
index ba72807..ddf7af4 100644
--- a/kernel/sched/cpuacct.h
+++ b/kernel/sched/cpuacct.h
@@ -2,6 +2,7 @@
 
 extern void cpuacct_charge(struct task_struct *tsk, u64 cputime);
 extern void cpuacct_account_field(struct task_struct *tsk, int index, u64 val);
+extern void cpuacct_cpu_stats_show(struct seq_file *sf);
 
 #else
 
@@ -14,4 +15,8 @@ cpuacct_account_field(struct task_struct *tsk, int index, u64 val)
 {
 }
 
+static inline void cpuacct_cpu_stats_show(struct seq_file *sf)
+{
+}
+
 #endif
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy
@ 2016-08-05 17:09   ` Tejun Heo
  0 siblings, 0 replies; 20+ messages in thread
From: Tejun Heo @ 2016-08-05 17:09 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Li Zefan, Johannes Weiner,
	Peter Zijlstra, Paul Turner, Mike Galbraith, Ingo Molnar
  Cc: linux-kernel, cgroups, linux-api, kernel-team

From ed6d93036ec930cb774da10b7c87f67905ce71f1 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Fri, 11 Mar 2016 07:31:23 -0500

While the cpu controller doesn't have any functional problems, there
are a couple interface issues which can be addressed in the v2
interface.

* cpuacct being a separate controller.  This separation is artificial
  and rather pointless as demonstrated by most use cases co-mounting
  the two controllers.  It also forces certain information to be
  accounted twice.

* Use of different time units.  Writable control knobs use
  microseconds, some stat fields use nanoseconds while other cpuacct
  stat fields use centiseconds.

* Control knobs which can't be used in the root cgroup still show up
  in the root.

* Control knob names and semantics aren't consistent with other
  controllers.

This patchset implements cpu controller's interface on the unified
hierarchy which adheres to the controller file conventions described
in Documentation/cgroups/unified-hierarchy.txt.  Overall, the
following changes are made.

* cpuacct is implictly enabled and disabled by cpu and its information
  is reported through "cpu.stat" which now uses microseconds for all
  time durations.  All time duration fields now have "_usec" appended
  to them for clarity.  While this doesn't solve the double accounting
  immediately, once majority of users switch to v2, cpu can directly
  account and report the relevant stats and cpuacct can be disabled on
  the unified hierarchy.

  Note that cpuacct.usage_percpu is currently not included in
  "cpu.stat".  If this information is actually called for, it can be
  added later.

* "cpu.shares" is replaced with "cpu.weight" and operates on the
  standard scale defined by CGROUP_WEIGHT_MIN/DFL/MAX (1, 100, 10000).
  The weight is scaled to scheduler weight so that 100 maps to 1024
  and the ratio relationship is preserved - if weight is W and its
  scaled value is S, W / 100 == S / 1024.  While the mapped range is a
  bit smaller than the orignal scheduler weight range, the dead zones
  on both sides are relatively small and covers wider range than the
  nice value mappings.  This file doesn't make sense in the root
  cgroup and isn't create on root.

* "cpu.cfs_quota_us" and "cpu.cfs_period_us" are replaced by "cpu.max"
  which contains both quota and period.

* "cpu.rt_runtime_us" and "cpu.rt_period_us" are replaced by
  "cpu.rt.max" which contains both runtime and period.

v2: cpu_stats_show() was incorrectly using CONFIG_FAIR_GROUP_SCHED for
    CFS bandwidth stats and also using raw division for u64.  Use
    CONFIG_CFS_BANDWITH and do_div() instead.

    The semantics of "cpu.rt.max" is not fully decided yet.  Dropped
    for now.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---
 kernel/sched/core.c    | 141 +++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/cpuacct.c |  24 +++++++++
 kernel/sched/cpuacct.h |   5 ++
 3 files changed, 170 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c148dfe..7bba2c5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8561,6 +8561,139 @@ static struct cftype cpu_legacy_files[] = {
 	{ }	/* terminate */
 };
 
+static int cpu_stats_show(struct seq_file *sf, void *v)
+{
+	cpuacct_cpu_stats_show(sf);
+
+#ifdef CONFIG_CFS_BANDWIDTH
+	{
+		struct task_group *tg = css_tg(seq_css(sf));
+		struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
+		u64 throttled_usec;
+
+		throttled_usec = cfs_b->throttled_time;
+		do_div(throttled_usec, NSEC_PER_USEC);
+
+		seq_printf(sf, "nr_periods %d\n"
+			   "nr_throttled %d\n"
+			   "throttled_usec %llu\n",
+			   cfs_b->nr_periods, cfs_b->nr_throttled,
+			   throttled_usec);
+	}
+#endif
+	return 0;
+}
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+static u64 cpu_weight_read_u64(struct cgroup_subsys_state *css,
+			       struct cftype *cft)
+{
+	struct task_group *tg = css_tg(css);
+	u64 weight = scale_load_down(tg->shares);
+
+	return DIV_ROUND_CLOSEST_ULL(weight * CGROUP_WEIGHT_DFL, 1024);
+}
+
+static int cpu_weight_write_u64(struct cgroup_subsys_state *css,
+				struct cftype *cftype, u64 weight)
+{
+	/*
+	 * cgroup weight knobs should use the common MIN, DFL and MAX
+	 * values which are 1, 100 and 10000 respectively.  While it loses
+	 * a bit of range on both ends, it maps pretty well onto the shares
+	 * value used by scheduler and the round-trip conversions preserve
+	 * the original value over the entire range.
+	 */
+	if (weight < CGROUP_WEIGHT_MIN || weight > CGROUP_WEIGHT_MAX)
+		return -ERANGE;
+
+	weight = DIV_ROUND_CLOSEST_ULL(weight * 1024, CGROUP_WEIGHT_DFL);
+
+	return sched_group_set_shares(css_tg(css), scale_load(weight));
+}
+#endif
+
+static void __maybe_unused cpu_period_quota_print(struct seq_file *sf,
+						  long period, long quota)
+{
+	if (quota < 0)
+		seq_puts(sf, "max");
+	else
+		seq_printf(sf, "%ld", quota);
+
+	seq_printf(sf, " %ld\n", period);
+}
+
+/* caller should put the current value in *@periodp before calling */
+static int __maybe_unused cpu_period_quota_parse(char *buf,
+						 u64 *periodp, u64 *quotap)
+{
+	char tok[21];	/* U64_MAX */
+
+	if (!sscanf(buf, "%s %llu", tok, periodp))
+		return -EINVAL;
+
+	*periodp *= NSEC_PER_USEC;
+
+	if (sscanf(tok, "%llu", quotap))
+		*quotap *= NSEC_PER_USEC;
+	else if (!strcmp(tok, "max"))
+		*quotap = RUNTIME_INF;
+	else
+		return -EINVAL;
+
+	return 0;
+}
+
+#ifdef CONFIG_CFS_BANDWIDTH
+static int cpu_max_show(struct seq_file *sf, void *v)
+{
+	struct task_group *tg = css_tg(seq_css(sf));
+
+	cpu_period_quota_print(sf, tg_get_cfs_period(tg), tg_get_cfs_quota(tg));
+	return 0;
+}
+
+static ssize_t cpu_max_write(struct kernfs_open_file *of,
+			     char *buf, size_t nbytes, loff_t off)
+{
+	struct task_group *tg = css_tg(of_css(of));
+	u64 period = tg_get_cfs_period(tg);
+	u64 quota;
+	int ret;
+
+	ret = cpu_period_quota_parse(buf, &period, &quota);
+	if (!ret)
+		ret = tg_set_cfs_bandwidth(tg, period, quota);
+	return ret ?: nbytes;
+}
+#endif
+
+static struct cftype cpu_files[] = {
+	{
+		.name = "stat",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = cpu_stats_show,
+	},
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	{
+		.name = "weight",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = cpu_weight_read_u64,
+		.write_u64 = cpu_weight_write_u64,
+	},
+#endif
+#ifdef CONFIG_CFS_BANDWIDTH
+	{
+		.name = "max",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = cpu_max_show,
+		.write = cpu_max_write,
+	},
+#endif
+	{ }	/* terminate */
+};
+
 struct cgroup_subsys cpu_cgrp_subsys = {
 	.css_alloc	= cpu_cgroup_css_alloc,
 	.css_released	= cpu_cgroup_css_released,
@@ -8569,7 +8702,15 @@ struct cgroup_subsys cpu_cgrp_subsys = {
 	.can_attach	= cpu_cgroup_can_attach,
 	.attach		= cpu_cgroup_attach,
 	.legacy_cftypes	= cpu_legacy_files,
+	.dfl_cftypes	= cpu_files,
 	.early_init	= true,
+#ifdef CONFIG_CGROUP_CPUACCT
+	/*
+	 * cpuacct is enabled together with cpu on the unified hierarchy
+	 * and its stats are reported through "cpu.stat".
+	 */
+	.depends_on	= 1 << cpuacct_cgrp_id,
+#endif
 };
 
 #endif	/* CONFIG_CGROUP_SCHED */
diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index 3eb9eda..7a02d26 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -305,6 +305,30 @@ static struct cftype files[] = {
 	{ }	/* terminate */
 };
 
+/* used to print cpuacct stats in cpu.stat on the unified hierarchy */
+void cpuacct_cpu_stats_show(struct seq_file *sf)
+{
+	struct cgroup_subsys_state *css;
+	u64 usage, user, sys;
+
+	css = cgroup_get_e_css(seq_css(sf)->cgroup, &cpuacct_cgrp_subsys);
+
+	usage = cpuusage_read(css, seq_cft(sf));
+	cpuacct_stats_read(css_ca(css), &user, &sys);
+
+	user *= TICK_NSEC;
+	sys *= TICK_NSEC;
+	do_div(usage, NSEC_PER_USEC);
+	do_div(user, NSEC_PER_USEC);
+	do_div(sys, NSEC_PER_USEC);
+
+	seq_printf(sf, "usage_usec %llu\n"
+		   "user_usec %llu\n"
+		   "system_usec %llu\n", usage, user, sys);
+
+	css_put(css);
+}
+
 /*
  * charge this task's execution time to its accounting group.
  *
diff --git a/kernel/sched/cpuacct.h b/kernel/sched/cpuacct.h
index ba72807..ddf7af4 100644
--- a/kernel/sched/cpuacct.h
+++ b/kernel/sched/cpuacct.h
@@ -2,6 +2,7 @@
 
 extern void cpuacct_charge(struct task_struct *tsk, u64 cputime);
 extern void cpuacct_account_field(struct task_struct *tsk, int index, u64 val);
+extern void cpuacct_cpu_stats_show(struct seq_file *sf);
 
 #else
 
@@ -14,4 +15,8 @@ cpuacct_account_field(struct task_struct *tsk, int index, u64 val)
 {
 }
 
+static inline void cpuacct_cpu_stats_show(struct seq_file *sf)
+{
+}
+
 #endif
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCHSET REPOST] sched, cgroup: implement cgroup v2 interface for cpu controller
@ 2016-01-05 16:47 Tejun Heo
  2016-01-05 16:48 ` [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy Tejun Heo
  0 siblings, 1 reply; 20+ messages in thread
From: Tejun Heo @ 2016-01-05 16:47 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Li Zefan, Johannes Weiner, Paul Turner
  Cc: linux-kernel, cgroups, Andrew Morton, Linus Torvalds, kernel-team

Hello,

This patchset implements cgroup v2 interface for cpu controller and a
repost of the following.

 http://lkml.kernel.org/g/1438641689-14655-1-git-send-email-tj@kernel.org

The first patch is folded and applied into v2 documentation.  The
second patch is unchanged.  The third is unchanged from the posted v2.

The patchset got blocked by discussion on in-process hierarchical
resource contorl.  The following message is to restart the discussion.

 http://lkml.kernel.org/g/20160105154503.GC5995@mtj.duckdns.org

While the conclusion hasn't been reached, none of the issues being
debated affects the system level interface and there's no technical
reason to hold off these patches especially given that cgroup v2 is
being released in the coming merge window for memory and io
controllers.

Can we please go ahead with these?

diffstat follows.

 kernel/sched/core.c    |  149 +++++++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/cpuacct.c |   57 ++++++++++++------
 kernel/sched/cpuacct.h |    5 +
 3 files changed, 189 insertions(+), 22 deletions(-)

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy
  2016-01-05 16:47 [PATCHSET REPOST] sched, cgroup: implement cgroup v2 interface for cpu controller Tejun Heo
@ 2016-01-05 16:48 ` Tejun Heo
  0 siblings, 0 replies; 20+ messages in thread
From: Tejun Heo @ 2016-01-05 16:48 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Li Zefan, Johannes Weiner, Paul Turner
  Cc: linux-kernel, cgroups, Andrew Morton, Linus Torvalds, kernel-team

While the cpu controller doesn't have any functional problems, there
are a couple interface issues which can be addressed in the v2
interface.

* cpuacct being a separate controller.  This separation is artificial
  and rather pointless as demonstrated by most use cases co-mounting
  the two controllers.  It also forces certain information to be
  accounted twice.

* Use of different time units.  Writable control knobs use
  microseconds, some stat fields use nanoseconds while other cpuacct
  stat fields use centiseconds.

* Control knobs which can't be used in the root cgroup still show up
  in the root.

* Control knob names and semantics aren't consistent with other
  controllers.

This patchset implements cpu controller's interface on the unified
hierarchy which adheres to the controller file conventions described
in Documentation/cgroups/unified-hierarchy.txt.  Overall, the
following changes are made.

* cpuacct is implictly enabled and disabled by cpu and its information
  is reported through "cpu.stat" which now uses microseconds for all
  time durations.  All time duration fields now have "_usec" appended
  to them for clarity.  While this doesn't solve the double accounting
  immediately, once majority of users switch to v2, cpu can directly
  account and report the relevant stats and cpuacct can be disabled on
  the unified hierarchy.

  Note that cpuacct.usage_percpu is currently not included in
  "cpu.stat".  If this information is actually called for, it can be
  added later.

* "cpu.shares" is replaced with "cpu.weight" and operates on the
  standard scale defined by CGROUP_WEIGHT_MIN/DFL/MAX (1, 100, 10000).
  The weight is scaled to scheduler weight so that 100 maps to 1024
  and the ratio relationship is preserved - if weight is W and its
  scaled value is S, W / 100 == S / 1024.  While the mapped range is a
  bit smaller than the orignal scheduler weight range, the dead zones
  on both sides are relatively small and covers wider range than the
  nice value mappings.  This file doesn't make sense in the root
  cgroup and isn't create on root.

* "cpu.cfs_quota_us" and "cpu.cfs_period_us" are replaced by "cpu.max"
  which contains both quota and period.

* "cpu.rt_runtime_us" and "cpu.rt_period_us" are replaced by
  "cpu.rt.max" which contains both runtime and period.

v2: cpu_stats_show() was incorrectly using CONFIG_FAIR_GROUP_SCHED for
    CFS bandwidth stats and also using raw division for u64.  Use
    CONFIG_CFS_BANDWITH and do_div() instead.

    The semantics of "cpu.rt.max" is not fully decided yet.  Dropped
    for now.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---
 kernel/sched/core.c    |  141 +++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/cpuacct.c |   24 ++++++++
 kernel/sched/cpuacct.h |    5 +
 3 files changed, 170 insertions(+)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8567,6 +8567,139 @@ static struct cftype cpu_legacy_files[]
 	{ }	/* terminate */
 };
 
+static int cpu_stats_show(struct seq_file *sf, void *v)
+{
+	cpuacct_cpu_stats_show(sf);
+
+#ifdef CONFIG_CFS_BANDWIDTH
+	{
+		struct task_group *tg = css_tg(seq_css(sf));
+		struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
+		u64 throttled_usec;
+
+		throttled_usec = cfs_b->throttled_time;
+		do_div(throttled_usec, NSEC_PER_USEC);
+
+		seq_printf(sf, "nr_periods %d\n"
+			   "nr_throttled %d\n"
+			   "throttled_usec %llu\n",
+			   cfs_b->nr_periods, cfs_b->nr_throttled,
+			   throttled_usec);
+	}
+#endif
+	return 0;
+}
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+static u64 cpu_weight_read_u64(struct cgroup_subsys_state *css,
+			       struct cftype *cft)
+{
+	struct task_group *tg = css_tg(css);
+	u64 weight = scale_load_down(tg->shares);
+
+	return DIV_ROUND_CLOSEST_ULL(weight * CGROUP_WEIGHT_DFL, 1024);
+}
+
+static int cpu_weight_write_u64(struct cgroup_subsys_state *css,
+				struct cftype *cftype, u64 weight)
+{
+	/*
+	 * cgroup weight knobs should use the common MIN, DFL and MAX
+	 * values which are 1, 100 and 10000 respectively.  While it loses
+	 * a bit of range on both ends, it maps pretty well onto the shares
+	 * value used by scheduler and the round-trip conversions preserve
+	 * the original value over the entire range.
+	 */
+	if (weight < CGROUP_WEIGHT_MIN || weight > CGROUP_WEIGHT_MAX)
+		return -ERANGE;
+
+	weight = DIV_ROUND_CLOSEST_ULL(weight * 1024, CGROUP_WEIGHT_DFL);
+
+	return sched_group_set_shares(css_tg(css), scale_load(weight));
+}
+#endif
+
+static void __maybe_unused cpu_period_quota_print(struct seq_file *sf,
+						  long period, long quota)
+{
+	if (quota < 0)
+		seq_puts(sf, "max");
+	else
+		seq_printf(sf, "%ld", quota);
+
+	seq_printf(sf, " %ld\n", period);
+}
+
+/* caller should put the current value in *@periodp before calling */
+static int __maybe_unused cpu_period_quota_parse(char *buf,
+						 u64 *periodp, u64 *quotap)
+{
+	char tok[21];	/* U64_MAX */
+
+	if (!sscanf(buf, "%s %llu", tok, periodp))
+		return -EINVAL;
+
+	*periodp *= NSEC_PER_USEC;
+
+	if (sscanf(tok, "%llu", quotap))
+		*quotap *= NSEC_PER_USEC;
+	else if (!strcmp(tok, "max"))
+		*quotap = RUNTIME_INF;
+	else
+		return -EINVAL;
+
+	return 0;
+}
+
+#ifdef CONFIG_CFS_BANDWIDTH
+static int cpu_max_show(struct seq_file *sf, void *v)
+{
+	struct task_group *tg = css_tg(seq_css(sf));
+
+	cpu_period_quota_print(sf, tg_get_cfs_period(tg), tg_get_cfs_quota(tg));
+	return 0;
+}
+
+static ssize_t cpu_max_write(struct kernfs_open_file *of,
+			     char *buf, size_t nbytes, loff_t off)
+{
+	struct task_group *tg = css_tg(of_css(of));
+	u64 period = tg_get_cfs_period(tg);
+	u64 quota;
+	int ret;
+
+	ret = cpu_period_quota_parse(buf, &period, &quota);
+	if (!ret)
+		ret = tg_set_cfs_bandwidth(tg, period, quota);
+	return ret ?: nbytes;
+}
+#endif
+
+static struct cftype cpu_files[] = {
+	{
+		.name = "stat",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = cpu_stats_show,
+	},
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	{
+		.name = "weight",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = cpu_weight_read_u64,
+		.write_u64 = cpu_weight_write_u64,
+	},
+#endif
+#ifdef CONFIG_CFS_BANDWIDTH
+	{
+		.name = "max",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = cpu_max_show,
+		.write = cpu_max_write,
+	},
+#endif
+	{ }	/* terminate */
+};
+
 struct cgroup_subsys cpu_cgrp_subsys = {
 	.css_alloc	= cpu_cgroup_css_alloc,
 	.css_free	= cpu_cgroup_css_free,
@@ -8576,7 +8709,15 @@ struct cgroup_subsys cpu_cgrp_subsys = {
 	.can_attach	= cpu_cgroup_can_attach,
 	.attach		= cpu_cgroup_attach,
 	.legacy_cftypes	= cpu_legacy_files,
+	.dfl_cftypes	= cpu_files,
 	.early_init	= 1,
+#ifdef CONFIG_CGROUP_CPUACCT
+	/*
+	 * cpuacct is enabled together with cpu on the unified hierarchy
+	 * and its stats are reported through "cpu.stat".
+	 */
+	.depends_on	= 1 << cpuacct_cgrp_id,
+#endif
 };
 
 #endif	/* CONFIG_CGROUP_SCHED */
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -224,6 +224,30 @@ static struct cftype files[] = {
 	{ }	/* terminate */
 };
 
+/* used to print cpuacct stats in cpu.stat on the unified hierarchy */
+void cpuacct_cpu_stats_show(struct seq_file *sf)
+{
+	struct cgroup_subsys_state *css;
+	u64 usage, user, sys;
+
+	css = cgroup_get_e_css(seq_css(sf)->cgroup, &cpuacct_cgrp_subsys);
+
+	usage = cpuusage_read(css, seq_cft(sf));
+	cpuacct_stats_read(css_ca(css), &user, &sys);
+
+	user *= TICK_NSEC;
+	sys *= TICK_NSEC;
+	do_div(usage, NSEC_PER_USEC);
+	do_div(user, NSEC_PER_USEC);
+	do_div(sys, NSEC_PER_USEC);
+
+	seq_printf(sf, "usage_usec %llu\n"
+		   "user_usec %llu\n"
+		   "system_usec %llu\n", usage, user, sys);
+
+	css_put(css);
+}
+
 /*
  * charge this task's execution time to its accounting group.
  *
--- a/kernel/sched/cpuacct.h
+++ b/kernel/sched/cpuacct.h
@@ -2,6 +2,7 @@
 
 extern void cpuacct_charge(struct task_struct *tsk, u64 cputime);
 extern void cpuacct_account_field(struct task_struct *p, int index, u64 val);
+extern void cpuacct_cpu_stats_show(struct seq_file *sf);
 
 #else
 
@@ -14,4 +15,8 @@ cpuacct_account_field(struct task_struct
 {
 }
 
+static inline void cpuacct_cpu_stats_show(struct seq_file *sf)
+{
+}
+
 #endif

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2017-09-25 16:00 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-20 18:48 [PATCHSET for-4.14] cgroup, sched: cgroup2 interface for CPU controller Tejun Heo
2017-07-20 18:48 ` Tejun Heo
2017-07-20 18:48 ` [PATCH 1/2] sched: Misc preps for cgroup unified hierarchy interface Tejun Heo
2017-07-20 18:48 ` [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy Tejun Heo
2017-07-29  9:17   ` Peter Zijlstra
2017-07-29  9:17     ` Peter Zijlstra
2017-08-01 20:17     ` Tejun Heo
2017-08-01 21:40       ` Peter Zijlstra
2017-08-01 21:40         ` Peter Zijlstra
2017-08-02 15:41         ` Tejun Heo
2017-08-02 15:41           ` Tejun Heo
2017-08-02 16:05           ` Peter Zijlstra
2017-08-02 18:16             ` Tejun Heo
2017-07-27 19:51 ` [PATCHSET for-4.14] cgroup, sched: cgroup2 interface for CPU controller Tejun Heo
2017-07-27 19:51   ` Tejun Heo
  -- strict thread matches above, loose matches on Subject: below --
2017-09-25 16:00 [PATCHSET REPOST for-4.15] cgroup, sched: cgroup2 interface for CPU controller (on basic acct) Tejun Heo
2017-09-25 16:00 ` [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy Tejun Heo
2017-08-11 16:47 [PATCHSET for-4.14] cgroup, sched: cgroup2 interface for CPU controller (on basic acct) Tejun Heo
2017-08-11 16:47 ` [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy Tejun Heo
2016-08-05 17:07 [Documentation] State of CPU controller in cgroup v2 Tejun Heo
2016-08-05 17:09 ` [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy Tejun Heo
2016-08-05 17:09   ` Tejun Heo
2016-01-05 16:47 [PATCHSET REPOST] sched, cgroup: implement cgroup v2 interface for cpu controller Tejun Heo
2016-01-05 16:48 ` [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy Tejun Heo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.