All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET for-4.14] cgroup, sched: cgroup2 basic resource accounting
@ 2017-08-11 16:37 ` Tejun Heo
  0 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2017-08-11 16:37 UTC (permalink / raw)
  To: lizefan, hannes, peterz, mingo, longman
  Cc: cgroups, linux-kernel, kernel-team, pjt, luto, efault, torvalds, guro

Hello,

This patchset implement cgroup2 basic resource accounting and track
CPU usages on all cgroups by default.

In cgroup1, collecting non-trivial statistics by default wasn't viable
because there can be any number of hierarchies.  While cgroup2's
single hierarchy removes that part of the problem, propagating up the
hierarchy on each accounting event is still problematic.

This patchset implements cgroup2 basic resource accounting mechanism
which keeps all the counters per-cpu and propagates lazily to allow
the accounting side to only perform single per-cpu accounting in most
cases and the reader side's complexity to be O(nr updated descendants)
instead of O(nr total descendants).

This will be used by cgroup2 CPU interface later.  The patchset
contains the following three patches.

  0001-sched-cputime-Expose-cputime_adjust.patch
  0002-cpuacct-Introduce-cgroup_account_cputime-_field.patch
  0003-cgroup-Implement-cgroup2-basic-CPU-usage-accounting.patch

The patchset is on top of cgroup/for-4.14 and available in the
following git branch.

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-cgroup2-basic-acct

diffstat follows.  Thanks.

 Documentation/cgroup-v2.txt     |    9 +
 include/linux/cgroup-defs.h     |   47 +++++
 include/linux/cgroup.h          |   60 +++++++
 include/linux/sched/cputime.h   |    3 
 kernel/cgroup/Makefile          |    2 
 kernel/cgroup/cgroup-internal.h |    8 
 kernel/cgroup/cgroup.c          |   24 ++
 kernel/cgroup/stat.c            |  333 ++++++++++++++++++++++++++++++++++++++++
 kernel/sched/cpuacct.h          |   17 --
 kernel/sched/cputime.c          |    7 
 kernel/sched/deadline.c         |    2 
 kernel/sched/fair.c             |    2 
 kernel/sched/rt.c               |    2 
 kernel/sched/sched.h            |    1 
 kernel/sched/stop_task.c        |    2 
 15 files changed, 489 insertions(+), 30 deletions(-)

--
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCHSET for-4.14] cgroup, sched: cgroup2 basic resource accounting
@ 2017-08-11 16:37 ` Tejun Heo
  0 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2017-08-11 16:37 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	longman-H+wXaHxf7aLQT0dZR+AlfA
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	pjt-hpIqsD4AKlfQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	efault-Mmb7MZpHnFY, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	guro-b10kYP2dOMg

Hello,

This patchset implement cgroup2 basic resource accounting and track
CPU usages on all cgroups by default.

In cgroup1, collecting non-trivial statistics by default wasn't viable
because there can be any number of hierarchies.  While cgroup2's
single hierarchy removes that part of the problem, propagating up the
hierarchy on each accounting event is still problematic.

This patchset implements cgroup2 basic resource accounting mechanism
which keeps all the counters per-cpu and propagates lazily to allow
the accounting side to only perform single per-cpu accounting in most
cases and the reader side's complexity to be O(nr updated descendants)
instead of O(nr total descendants).

This will be used by cgroup2 CPU interface later.  The patchset
contains the following three patches.

  0001-sched-cputime-Expose-cputime_adjust.patch
  0002-cpuacct-Introduce-cgroup_account_cputime-_field.patch
  0003-cgroup-Implement-cgroup2-basic-CPU-usage-accounting.patch

The patchset is on top of cgroup/for-4.14 and available in the
following git branch.

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-cgroup2-basic-acct

diffstat follows.  Thanks.

 Documentation/cgroup-v2.txt     |    9 +
 include/linux/cgroup-defs.h     |   47 +++++
 include/linux/cgroup.h          |   60 +++++++
 include/linux/sched/cputime.h   |    3 
 kernel/cgroup/Makefile          |    2 
 kernel/cgroup/cgroup-internal.h |    8 
 kernel/cgroup/cgroup.c          |   24 ++
 kernel/cgroup/stat.c            |  333 ++++++++++++++++++++++++++++++++++++++++
 kernel/sched/cpuacct.h          |   17 --
 kernel/sched/cputime.c          |    7 
 kernel/sched/deadline.c         |    2 
 kernel/sched/fair.c             |    2 
 kernel/sched/rt.c               |    2 
 kernel/sched/sched.h            |    1 
 kernel/sched/stop_task.c        |    2 
 15 files changed, 489 insertions(+), 30 deletions(-)

--
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 1/3] sched/cputime: Expose cputime_adjust()
  2017-08-11 16:37 ` Tejun Heo
  (?)
@ 2017-08-11 16:37 ` Tejun Heo
  -1 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2017-08-11 16:37 UTC (permalink / raw)
  To: lizefan, hannes, peterz, mingo, longman
  Cc: cgroups, linux-kernel, kernel-team, pjt, luto, efault, torvalds,
	guro, Tejun Heo

Will be used by basic cgroup resource stat reporting later.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/sched/cputime.h | 3 ++-
 kernel/sched/cputime.c        | 5 ++---
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched/cputime.h b/include/linux/sched/cputime.h
index 4c5b973..9251044 100644
--- a/include/linux/sched/cputime.h
+++ b/include/linux/sched/cputime.h
@@ -53,7 +53,8 @@ static inline void task_cputime_scaled(struct task_struct *t,
 
 extern void task_cputime_adjusted(struct task_struct *p, u64 *ut, u64 *st);
 extern void thread_group_cputime_adjusted(struct task_struct *p, u64 *ut, u64 *st);
-
+extern void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev,
+			   u64 *ut, u64 *st);
 
 /*
  * Thread group CPU time accounting.
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 6e3ea4a..3a8bfcc 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -585,9 +585,8 @@ static u64 scale_stime(u64 stime, u64 rtime, u64 total)
  *
  * Assuming that rtime_i+1 >= rtime_i.
  */
-static void cputime_adjust(struct task_cputime *curr,
-			   struct prev_cputime *prev,
-			   u64 *ut, u64 *st)
+void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev,
+		    u64 *ut, u64 *st)
 {
 	u64 rtime, stime, utime;
 	unsigned long flags;
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 2/3] cpuacct: Introduce cgroup_account_cputime[_field]()
@ 2017-08-11 16:37   ` Tejun Heo
  0 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2017-08-11 16:37 UTC (permalink / raw)
  To: lizefan, hannes, peterz, mingo, longman
  Cc: cgroups, linux-kernel, kernel-team, pjt, luto, efault, torvalds,
	guro, Tejun Heo

Introduce cgroup_account_cputime[_field]() which wrap cpuacct_charge()
and cgroup_account_field().  This doesn't introduce any functional
changes and will be used to add cgroup basic resource accounting.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
---
 include/linux/cgroup.h   | 38 ++++++++++++++++++++++++++++++++++++++
 kernel/sched/cpuacct.h   | 17 -----------------
 kernel/sched/cputime.c   |  2 +-
 kernel/sched/deadline.c  |  2 +-
 kernel/sched/fair.c      |  2 +-
 kernel/sched/rt.c        |  2 +-
 kernel/sched/sched.h     |  1 -
 kernel/sched/stop_task.c |  2 +-
 8 files changed, 43 insertions(+), 23 deletions(-)
 delete mode 100644 kernel/sched/cpuacct.h

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 085056e..f395e02 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -22,6 +22,7 @@
 #include <linux/nsproxy.h>
 #include <linux/user_namespace.h>
 #include <linux/refcount.h>
+#include <linux/kernel_stat.h>
 
 #include <linux/cgroup-defs.h>
 
@@ -675,6 +676,43 @@ static inline bool task_under_cgroup_hierarchy(struct task_struct *task,
 #endif /* !CONFIG_CGROUPS */
 
 /*
+ * Basic resource stats.
+ */
+#ifdef CONFIG_CGROUPS
+
+#ifdef CONFIG_CGROUP_CPUACCT
+void cpuacct_charge(struct task_struct *tsk, u64 cputime);
+void cpuacct_account_field(struct task_struct *tsk, int index, u64 val);
+#else
+static inline void cpuacct_charge(struct task_struct *tsk, u64 cputime) {}
+static inline void cpuacct_account_field(struct task_struct *tsk, int index,
+					 u64 val) {}
+#endif
+
+static inline void cgroup_account_cputime(struct task_struct *task,
+					  u64 delta_exec)
+{
+	cpuacct_charge(task, delta_exec);
+}
+
+static inline void cgroup_account_cputime_field(struct task_struct *task,
+						enum cpu_usage_stat index,
+						u64 delta_exec)
+{
+	cpuacct_account_field(task, index, delta_exec);
+}
+
+#else	/* CONFIG_CGROUPS */
+
+static inline void cgroup_account_cputime(struct task_struct *task,
+					  u64 delta_exec) {}
+static inline void cgroup_account_cputime_field(struct task_struct *task,
+						enum cpu_usage_stat index,
+						u64 delta_exec) {}
+
+#endif	/* CONFIG_CGROUPS */
+
+/*
  * sock->sk_cgrp_data handling.  For more info, see sock_cgroup_data
  * definition in cgroup-defs.h.
  */
diff --git a/kernel/sched/cpuacct.h b/kernel/sched/cpuacct.h
deleted file mode 100644
index ba72807..0000000
--- a/kernel/sched/cpuacct.h
+++ /dev/null
@@ -1,17 +0,0 @@
-#ifdef CONFIG_CGROUP_CPUACCT
-
-extern void cpuacct_charge(struct task_struct *tsk, u64 cputime);
-extern void cpuacct_account_field(struct task_struct *tsk, int index, u64 val);
-
-#else
-
-static inline void cpuacct_charge(struct task_struct *tsk, u64 cputime)
-{
-}
-
-static inline void
-cpuacct_account_field(struct task_struct *tsk, int index, u64 val)
-{
-}
-
-#endif
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 3a8bfcc..d13149d 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -109,7 +109,7 @@ static inline void task_group_account_field(struct task_struct *p, int index,
 	 */
 	__this_cpu_add(kernel_cpustat.cpustat[index], tmp);
 
-	cpuacct_account_field(p, index, tmp);
+	cgroup_account_cputime_field(p, index, tmp);
 }
 
 /*
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index a84299f..4d5119f 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1145,7 +1145,7 @@ static void update_curr_dl(struct rq *rq)
 	account_group_exec_runtime(curr, delta_exec);
 
 	curr->se.exec_start = rq_clock_task(rq);
-	cpuacct_charge(curr, delta_exec);
+	cgroup_account_cputime(curr, delta_exec);
 
 	sched_rt_avg_update(rq, delta_exec);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c95880e..9b2e0b6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -864,7 +864,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
 		struct task_struct *curtask = task_of(curr);
 
 		trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
-		cpuacct_charge(curtask, delta_exec);
+		cgroup_account_cputime(curtask, delta_exec);
 		account_group_exec_runtime(curtask, delta_exec);
 	}
 
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 45caf93..ed6e95d 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -979,7 +979,7 @@ static void update_curr_rt(struct rq *rq)
 	account_group_exec_runtime(curr, delta_exec);
 
 	curr->se.exec_start = rq_clock_task(rq);
-	cpuacct_charge(curr, delta_exec);
+	cgroup_account_cputime(curr, delta_exec);
 
 	sched_rt_avg_update(rq, delta_exec);
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index eeef1a3..35812d779 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -36,7 +36,6 @@
 
 #include "cpupri.h"
 #include "cpudeadline.h"
-#include "cpuacct.h"
 
 #ifdef CONFIG_SCHED_DEBUG
 # define SCHED_WARN_ON(x)	WARN_ONCE(x, #x)
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index 9f69fb6..ec0bb5a 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -71,7 +71,7 @@ static void put_prev_task_stop(struct rq *rq, struct task_struct *prev)
 	account_group_exec_runtime(curr, delta_exec);
 
 	curr->se.exec_start = rq_clock_task(rq);
-	cpuacct_charge(curr, delta_exec);
+	cgroup_account_cputime(curr, delta_exec);
 }
 
 static void task_tick_stop(struct rq *rq, struct task_struct *curr, int queued)
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 2/3] cpuacct: Introduce cgroup_account_cputime[_field]()
@ 2017-08-11 16:37   ` Tejun Heo
  0 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2017-08-11 16:37 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	longman-H+wXaHxf7aLQT0dZR+AlfA
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	pjt-hpIqsD4AKlfQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	efault-Mmb7MZpHnFY, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	guro-b10kYP2dOMg, Tejun Heo

Introduce cgroup_account_cputime[_field]() which wrap cpuacct_charge()
and cgroup_account_field().  This doesn't introduce any functional
changes and will be used to add cgroup basic resource accounting.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Ingo Molnar <mingo-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
---
 include/linux/cgroup.h   | 38 ++++++++++++++++++++++++++++++++++++++
 kernel/sched/cpuacct.h   | 17 -----------------
 kernel/sched/cputime.c   |  2 +-
 kernel/sched/deadline.c  |  2 +-
 kernel/sched/fair.c      |  2 +-
 kernel/sched/rt.c        |  2 +-
 kernel/sched/sched.h     |  1 -
 kernel/sched/stop_task.c |  2 +-
 8 files changed, 43 insertions(+), 23 deletions(-)
 delete mode 100644 kernel/sched/cpuacct.h

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 085056e..f395e02 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -22,6 +22,7 @@
 #include <linux/nsproxy.h>
 #include <linux/user_namespace.h>
 #include <linux/refcount.h>
+#include <linux/kernel_stat.h>
 
 #include <linux/cgroup-defs.h>
 
@@ -675,6 +676,43 @@ static inline bool task_under_cgroup_hierarchy(struct task_struct *task,
 #endif /* !CONFIG_CGROUPS */
 
 /*
+ * Basic resource stats.
+ */
+#ifdef CONFIG_CGROUPS
+
+#ifdef CONFIG_CGROUP_CPUACCT
+void cpuacct_charge(struct task_struct *tsk, u64 cputime);
+void cpuacct_account_field(struct task_struct *tsk, int index, u64 val);
+#else
+static inline void cpuacct_charge(struct task_struct *tsk, u64 cputime) {}
+static inline void cpuacct_account_field(struct task_struct *tsk, int index,
+					 u64 val) {}
+#endif
+
+static inline void cgroup_account_cputime(struct task_struct *task,
+					  u64 delta_exec)
+{
+	cpuacct_charge(task, delta_exec);
+}
+
+static inline void cgroup_account_cputime_field(struct task_struct *task,
+						enum cpu_usage_stat index,
+						u64 delta_exec)
+{
+	cpuacct_account_field(task, index, delta_exec);
+}
+
+#else	/* CONFIG_CGROUPS */
+
+static inline void cgroup_account_cputime(struct task_struct *task,
+					  u64 delta_exec) {}
+static inline void cgroup_account_cputime_field(struct task_struct *task,
+						enum cpu_usage_stat index,
+						u64 delta_exec) {}
+
+#endif	/* CONFIG_CGROUPS */
+
+/*
  * sock->sk_cgrp_data handling.  For more info, see sock_cgroup_data
  * definition in cgroup-defs.h.
  */
diff --git a/kernel/sched/cpuacct.h b/kernel/sched/cpuacct.h
deleted file mode 100644
index ba72807..0000000
--- a/kernel/sched/cpuacct.h
+++ /dev/null
@@ -1,17 +0,0 @@
-#ifdef CONFIG_CGROUP_CPUACCT
-
-extern void cpuacct_charge(struct task_struct *tsk, u64 cputime);
-extern void cpuacct_account_field(struct task_struct *tsk, int index, u64 val);
-
-#else
-
-static inline void cpuacct_charge(struct task_struct *tsk, u64 cputime)
-{
-}
-
-static inline void
-cpuacct_account_field(struct task_struct *tsk, int index, u64 val)
-{
-}
-
-#endif
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 3a8bfcc..d13149d 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -109,7 +109,7 @@ static inline void task_group_account_field(struct task_struct *p, int index,
 	 */
 	__this_cpu_add(kernel_cpustat.cpustat[index], tmp);
 
-	cpuacct_account_field(p, index, tmp);
+	cgroup_account_cputime_field(p, index, tmp);
 }
 
 /*
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index a84299f..4d5119f 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1145,7 +1145,7 @@ static void update_curr_dl(struct rq *rq)
 	account_group_exec_runtime(curr, delta_exec);
 
 	curr->se.exec_start = rq_clock_task(rq);
-	cpuacct_charge(curr, delta_exec);
+	cgroup_account_cputime(curr, delta_exec);
 
 	sched_rt_avg_update(rq, delta_exec);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c95880e..9b2e0b6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -864,7 +864,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
 		struct task_struct *curtask = task_of(curr);
 
 		trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
-		cpuacct_charge(curtask, delta_exec);
+		cgroup_account_cputime(curtask, delta_exec);
 		account_group_exec_runtime(curtask, delta_exec);
 	}
 
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 45caf93..ed6e95d 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -979,7 +979,7 @@ static void update_curr_rt(struct rq *rq)
 	account_group_exec_runtime(curr, delta_exec);
 
 	curr->se.exec_start = rq_clock_task(rq);
-	cpuacct_charge(curr, delta_exec);
+	cgroup_account_cputime(curr, delta_exec);
 
 	sched_rt_avg_update(rq, delta_exec);
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index eeef1a3..35812d779 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -36,7 +36,6 @@
 
 #include "cpupri.h"
 #include "cpudeadline.h"
-#include "cpuacct.h"
 
 #ifdef CONFIG_SCHED_DEBUG
 # define SCHED_WARN_ON(x)	WARN_ONCE(x, #x)
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index 9f69fb6..ec0bb5a 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -71,7 +71,7 @@ static void put_prev_task_stop(struct rq *rq, struct task_struct *prev)
 	account_group_exec_runtime(curr, delta_exec);
 
 	curr->se.exec_start = rq_clock_task(rq);
-	cpuacct_charge(curr, delta_exec);
+	cgroup_account_cputime(curr, delta_exec);
 }
 
 static void task_tick_stop(struct rq *rq, struct task_struct *curr, int queued)
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 3/3] cgroup: Implement cgroup2 basic CPU usage accounting
  2017-08-11 16:37 ` Tejun Heo
                   ` (2 preceding siblings ...)
  (?)
@ 2017-08-11 16:37 ` Tejun Heo
  2017-08-11 20:19     ` Waiman Long
                     ` (3 more replies)
  -1 siblings, 4 replies; 54+ messages in thread
From: Tejun Heo @ 2017-08-11 16:37 UTC (permalink / raw)
  To: lizefan, hannes, peterz, mingo, longman
  Cc: cgroups, linux-kernel, kernel-team, pjt, luto, efault, torvalds,
	guro, Tejun Heo

In cgroup1, while cpuacct isn't actually controlling any resources, it
is a separate controller due to combinaton of two factors -
1. enabling cpu controller has significant side effects, and 2. we
have to pick one of the hierarchies to account CPU usages on.  cpuacct
controller is effectively used to designate a hierarchy to track CPU
usages on.

cgroup2's unified hierarchy removes the second reason and we can
account basic CPU usages by default.  While we can use cpuacct for
this purpose, both its interface and implementation leave a lot to be
desired - it collects and exposes two sources of truth which don't
agree with each other and some of the exposed statistics don't make
much sense.  Also, it propagates all the way up the hierarchy on each
accounting event which is unnecessary.

This patch adds basic resource accounting mechanism to cgroup2's
unified hierarchy and accounts CPU usages using it.

* All accountings are done per-cpu and don't propagate immediately.
  It just bumps the per-cgroup per-cpu counters and links to the
  parent's updated list if not already on it.

* On a read, the per-cpu counters are collected into the global ones
  and then propagated upwards.  Only the per-cpu counters which have
  changed since the last read are propagated.

* CPU usage stats are collected and shown in "cgroup.stat" with "cpu."
  prefix.  Total usage is collected from scheduling events.  User/sys
  breakdown is sourced from tick sampling and adjusted to the usage
  using cputime_adjuts().

This keeps the accounting side hot path O(1) and per-cpu and the read
side O(nr_updated_since_last_read).

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---
 Documentation/cgroup-v2.txt     |   9 ++
 include/linux/cgroup-defs.h     |  47 ++++++
 include/linux/cgroup.h          |  22 +++
 kernel/cgroup/Makefile          |   2 +-
 kernel/cgroup/cgroup-internal.h |   8 +
 kernel/cgroup/cgroup.c          |  24 ++-
 kernel/cgroup/stat.c            | 333 ++++++++++++++++++++++++++++++++++++++++
 7 files changed, 442 insertions(+), 3 deletions(-)
 create mode 100644 kernel/cgroup/stat.c

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index dc44785..3f82169 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -886,6 +886,15 @@ All cgroup core files are prefixed with "cgroup."
 		A dying cgroup can consume system resources not exceeding
 		limits, which were active at the moment of cgroup deletion.
 
+	  cpu.usage_usec
+		CPU time consumed in the subtree.
+
+	  cpu.user_usec
+		User CPU time consumed in the subtree.
+
+	  cpu.system_usec
+		System CPU time consumed in the subtree.
+
 
 Controllers
 ===========
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 59e4ad9..17da9c8 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -16,6 +16,7 @@
 #include <linux/refcount.h>
 #include <linux/percpu-refcount.h>
 #include <linux/percpu-rwsem.h>
+#include <linux/u64_stats_sync.h>
 #include <linux/workqueue.h>
 #include <linux/bpf-cgroup.h>
 
@@ -249,6 +250,47 @@ struct css_set {
 	struct rcu_head rcu_head;
 };
 
+/*
+ * cgroup basic resource usage statistics.  Accounting is done per-cpu in
+ * cgroup_cpu_stat which is then lazily propagated up the hierarchy on
+ * reads.  The propagation is selective - only the cgroup_cpu_stats which
+ * have been updated since the last propagation are propagated.
+ */
+struct cgroup_cpu_stat {
+	/*
+	 * ->sync protects all the current counters.  These are the only
+	 * fields which get updated in the hot path.
+	 */
+	struct u64_stats_sync sync;
+	struct task_cputime cputime;
+
+	/*
+	 * Snapshots at the last reading.  These are used to calculate the
+	 * deltas to propagate to the global counters.
+	 */
+	struct task_cputime last_cputime;
+
+	/*
+	 * Child cgroups with stat updates on this cpu since the last read
+	 * are linked on the parent's ->updated_children through
+	 * ->updated_next.
+	 *
+	 * In addition to being more compact, singly-linked list pointing
+	 * to the cgroup makes it unnecessary for each per-cpu struct to
+	 * point back to the associated cgroup.
+	 *
+	 * Protected by per-cpu cgroup_cpu_stat_lock.
+	 */
+	struct cgroup *updated_children;	/* terminated by self cgroup */
+	struct cgroup *updated_next;		/* NULL iff not on the list */
+};
+
+struct cgroup_stat {
+	/* per-cpu statistics are collected into the folowing global counters */
+	struct task_cputime cputime;
+	struct prev_cputime prev_cputime;
+};
+
 struct cgroup {
 	/* self css with NULL ->ss, points back to this cgroup */
 	struct cgroup_subsys_state self;
@@ -348,6 +390,11 @@ struct cgroup {
 	 */
 	struct cgroup *dom_cgrp;
 
+	/* cgroup basic resource statistics */
+	struct cgroup_cpu_stat __percpu *cpu_stat;
+	struct cgroup_stat pending_stat;	/* pending from children */
+	struct cgroup_stat stat;
+
 	/*
 	 * list of pidlists, up to two for each namespace (one for procs, one
 	 * for tasks); created on demand.
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index f395e02..4c82212 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -689,17 +689,39 @@ static inline void cpuacct_account_field(struct task_struct *tsk, int index,
 					 u64 val) {}
 #endif
 
+void cgroup_stat_show_cputime(struct seq_file *seq, const char *prefix);
+
+void __cgroup_account_cputime(struct cgroup *cgrp, u64 delta_exec);
+void __cgroup_account_cputime_field(struct cgroup *cgrp,
+				    enum cpu_usage_stat index, u64 delta_exec);
+
 static inline void cgroup_account_cputime(struct task_struct *task,
 					  u64 delta_exec)
 {
+	struct cgroup *cgrp;
+
 	cpuacct_charge(task, delta_exec);
+
+	rcu_read_lock();
+	cgrp = task_dfl_cgroup(task);
+	if (cgroup_parent(cgrp))
+		__cgroup_account_cputime(cgrp, delta_exec);
+	rcu_read_unlock();
 }
 
 static inline void cgroup_account_cputime_field(struct task_struct *task,
 						enum cpu_usage_stat index,
 						u64 delta_exec)
 {
+	struct cgroup *cgrp;
+
 	cpuacct_account_field(task, index, delta_exec);
+
+	rcu_read_lock();
+	cgrp = task_dfl_cgroup(task);
+	if (cgroup_parent(cgrp))
+		__cgroup_account_cputime_field(cgrp, index, delta_exec);
+	rcu_read_unlock();
 }
 
 #else	/* CONFIG_CGROUPS */
diff --git a/kernel/cgroup/Makefile b/kernel/cgroup/Makefile
index ce693cc..0acee61 100644
--- a/kernel/cgroup/Makefile
+++ b/kernel/cgroup/Makefile
@@ -1,4 +1,4 @@
-obj-y := cgroup.o namespace.o cgroup-v1.o
+obj-y := cgroup.o stat.o namespace.o cgroup-v1.o
 
 obj-$(CONFIG_CGROUP_FREEZER) += freezer.o
 obj-$(CONFIG_CGROUP_PIDS) += pids.o
diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index c167a40..f17da09 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -197,6 +197,14 @@ int cgroup_show_path(struct seq_file *sf, struct kernfs_node *kf_node,
 int cgroup_task_count(const struct cgroup *cgrp);
 
 /*
+ * stat.c
+ */
+void cgroup_stat_flush(struct cgroup *cgrp);
+int cgroup_stat_init(struct cgroup *cgrp);
+void cgroup_stat_exit(struct cgroup *cgrp);
+void cgroup_stat_boot(void);
+
+/*
  * namespace.c
  */
 extern const struct proc_ns_operations cgroupns_operations;
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index c038ccf..a399a55 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -142,12 +142,14 @@ static struct static_key_true *cgroup_subsys_on_dfl_key[] = {
 };
 #undef SUBSYS
 
+static DEFINE_PER_CPU(struct cgroup_cpu_stat, cgrp_dfl_root_cpu_stat);
+
 /*
  * The default hierarchy, reserved for the subsystems that are otherwise
  * unattached - it never has more than a single cgroup, and all tasks are
  * part of that cgroup.
  */
-struct cgroup_root cgrp_dfl_root;
+struct cgroup_root cgrp_dfl_root = { .cgrp.cpu_stat = &cgrp_dfl_root_cpu_stat };
 EXPORT_SYMBOL_GPL(cgrp_dfl_root);
 
 /*
@@ -3296,6 +3298,8 @@ static int cgroup_stat_show(struct seq_file *seq, void *v)
 	seq_printf(seq, "nr_dying_descendants %d\n",
 		   cgroup->nr_dying_descendants);
 
+	cgroup_stat_show_cputime(seq, "cpu.");
+
 	return 0;
 }
 
@@ -4466,6 +4470,8 @@ static void css_free_work_fn(struct work_struct *work)
 			 */
 			cgroup_put(cgroup_parent(cgrp));
 			kernfs_put(cgrp->kn);
+			if (cgroup_on_dfl(cgrp))
+				cgroup_stat_exit(cgrp);
 			kfree(cgrp);
 		} else {
 			/*
@@ -4510,6 +4516,9 @@ static void css_release_work_fn(struct work_struct *work)
 		/* cgroup release path */
 		trace_cgroup_release(cgrp);
 
+		if (cgroup_on_dfl(cgrp))
+			cgroup_stat_flush(cgrp);
+
 		for (tcgrp = cgroup_parent(cgrp); tcgrp;
 		     tcgrp = cgroup_parent(tcgrp))
 			tcgrp->nr_dying_descendants--;
@@ -4696,6 +4705,12 @@ static struct cgroup *cgroup_create(struct cgroup *parent)
 	if (ret)
 		goto out_free_cgrp;
 
+	if (cgroup_on_dfl(parent)) {
+		ret = cgroup_stat_init(cgrp);
+		if (ret)
+			goto out_cancel_ref;
+	}
+
 	/*
 	 * Temporarily set the pointer to NULL, so idr_find() won't return
 	 * a half-baked cgroup.
@@ -4703,7 +4718,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent)
 	cgrp->id = cgroup_idr_alloc(&root->cgroup_idr, NULL, 2, 0, GFP_KERNEL);
 	if (cgrp->id < 0) {
 		ret = -ENOMEM;
-		goto out_cancel_ref;
+		goto out_stat_exit;
 	}
 
 	init_cgroup_housekeeping(cgrp);
@@ -4752,6 +4767,9 @@ static struct cgroup *cgroup_create(struct cgroup *parent)
 
 	return cgrp;
 
+out_stat_exit:
+	if (cgroup_on_dfl(parent))
+		cgroup_stat_exit(cgrp);
 out_cancel_ref:
 	percpu_ref_exit(&cgrp->self.refcnt);
 out_free_cgrp:
@@ -5146,6 +5164,8 @@ int __init cgroup_init(void)
 	BUG_ON(cgroup_init_cftypes(NULL, cgroup_base_files));
 	BUG_ON(cgroup_init_cftypes(NULL, cgroup1_base_files));
 
+	cgroup_stat_boot();
+
 	/*
 	 * The latency of the synchronize_sched() is too high for cgroups,
 	 * avoid it at the cost of forcing all readers into the slow path.
diff --git a/kernel/cgroup/stat.c b/kernel/cgroup/stat.c
new file mode 100644
index 0000000..19a10b2
--- /dev/null
+++ b/kernel/cgroup/stat.c
@@ -0,0 +1,333 @@
+#include "cgroup-internal.h"
+
+#include <linux/sched/cputime.h>
+
+static DEFINE_MUTEX(cgroup_stat_mutex);
+static DEFINE_PER_CPU(raw_spinlock_t, cgroup_cpu_stat_lock);
+
+static struct cgroup_cpu_stat *cgroup_cpu_stat(struct cgroup *cgrp, int cpu)
+{
+	return per_cpu_ptr(cgrp->cpu_stat, cpu);
+}
+
+/**
+ * cgroup_cpu_stat_updated - keep track of updated cpu_stat
+ * @cgrp: target cgroup
+ * @cpu: cpu on which cpu_stat was updated
+ *
+ * @cgrp's cpu_stat on @cpu was updated.  Put it on the parent's matching
+ * cpu_stat->updated_children list.
+ */
+static void cgroup_cpu_stat_updated(struct cgroup *cgrp, int cpu)
+{
+	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_cpu_stat_lock, cpu);
+	struct cgroup *parent;
+	unsigned long flags;
+
+	/*
+	 * Speculative already-on-list test.  This may race leading to
+	 * temporary inaccuracies, which is fine.
+	 *
+	 * Because @parent's updated_children is terminated with @parent
+	 * instead of NULL, we can tell whether @cgrp is on the list by
+	 * testing the next pointer for NULL.
+	 */
+	if (cgroup_cpu_stat(cgrp, cpu)->updated_next)
+		return;
+
+	raw_spin_lock_irqsave(cpu_lock, flags);
+
+	/* put @cgrp and all ancestors on the corresponding updated lists */
+	for (parent = cgroup_parent(cgrp); parent;
+	     cgrp = parent, parent = cgroup_parent(cgrp)) {
+		struct cgroup_cpu_stat *cstat = cgroup_cpu_stat(cgrp, cpu);
+		struct cgroup_cpu_stat *pcstat = cgroup_cpu_stat(parent, cpu);
+
+		/*
+		 * Both additions and removals are bottom-up.  If a cgroup
+		 * is already in the tree, all ancestors are.
+		 */
+		if (cstat->updated_next)
+			break;
+
+		cstat->updated_next = pcstat->updated_children;
+		pcstat->updated_children = cgrp;
+	}
+
+	raw_spin_unlock_irqrestore(cpu_lock, flags);
+}
+
+/**
+ * cgroup_cpu_stat_next_updated - iterate cpu_stat updated tree
+ * @pos: current position
+ * @root: root of the tree to traversal
+ * @cpu: target cpu
+ *
+ * Walks the udpated cpu_stat tree on @cpu from @root.  %NULL @pos starts
+ * the traversal and %NULL return indicates the end.  During traversal,
+ * each returned cgroup is unlinked from the tree.  Must be called with the
+ * matching cgroup_cpu_stat_lock held.
+ *
+ * The only ordering guarantee is that, for a parent and a child pair
+ * covered by a given traversal, if a child is visited, its parent is
+ * guaranteed to be visited afterwards.
+ */
+static struct cgroup *cgroup_cpu_stat_next_updated(struct cgroup *pos,
+						   struct cgroup *root, int cpu)
+{
+	struct cgroup_cpu_stat *cstat;
+	struct cgroup *parent;
+
+	if (pos == root)
+		return NULL;
+
+	/*
+	 * We're gonna walk down to the first leaf and visit/remove it.  We
+	 * can pick whatever unvisited node as the starting point.
+	 */
+	if (!pos)
+		pos = root;
+	else
+		pos = cgroup_parent(pos);
+
+	/* walk down to the first leaf */
+	while (true) {
+		cstat = cgroup_cpu_stat(pos, cpu);
+		if (cstat->updated_children == pos)
+			break;
+		pos = cstat->updated_children;
+	}
+
+	/*
+	 * Unlink @pos from the tree.  As the updated_children list is
+	 * singly linked, we have to walk it to find the removal point.
+	 * However, due to the way we traverse, @pos will be the first
+	 * child in most cases. The only exception is @root.
+	 */
+	parent = cgroup_parent(pos);
+	if (parent && cstat->updated_next) {
+		struct cgroup_cpu_stat *pcstat = cgroup_cpu_stat(parent, cpu);
+		struct cgroup_cpu_stat *ncstat;
+		struct cgroup **nextp;
+
+		nextp = &pcstat->updated_children;
+		while (true) {
+			ncstat = cgroup_cpu_stat(*nextp, cpu);
+			if (*nextp == pos)
+				break;
+
+			WARN_ON_ONCE(*nextp == parent);
+			nextp = &ncstat->updated_next;
+		}
+
+		*nextp = cstat->updated_next;
+		cstat->updated_next = NULL;
+	}
+
+	return pos;
+}
+
+static void cgroup_stat_accumulate(struct cgroup_stat *dst_stat,
+				   struct cgroup_stat *src_stat)
+{
+	dst_stat->cputime.utime += src_stat->cputime.utime;
+	dst_stat->cputime.stime += src_stat->cputime.stime;
+	dst_stat->cputime.sum_exec_runtime += src_stat->cputime.sum_exec_runtime;
+}
+
+static void cgroup_cpu_stat_flush_one(struct cgroup *cgrp, int cpu)
+{
+	struct cgroup *parent = cgroup_parent(cgrp);
+	struct cgroup_cpu_stat *cstat = cgroup_cpu_stat(cgrp, cpu);
+	struct task_cputime *last_cputime = &cstat->last_cputime;
+	struct task_cputime cputime;
+	struct cgroup_stat delta;
+	unsigned seq;
+
+	lockdep_assert_held(&cgroup_stat_mutex);
+
+	/* fetch the current per-cpu values */
+	do {
+		seq = __u64_stats_fetch_begin(&cstat->sync);
+		cputime = cstat->cputime;
+	} while (__u64_stats_fetch_retry(&cstat->sync, seq));
+
+	/* accumulate the deltas to propgate */
+	delta.cputime.utime = cputime.utime - last_cputime->utime;
+	delta.cputime.stime = cputime.stime - last_cputime->stime;
+	delta.cputime.sum_exec_runtime = cputime.sum_exec_runtime -
+					 last_cputime->sum_exec_runtime;
+	*last_cputime = cputime;
+
+	/* transfer the pending stat into delta */
+	cgroup_stat_accumulate(&delta, &cgrp->pending_stat);
+	memset(&cgrp->pending_stat, 0, sizeof(cgrp->pending_stat));
+
+	/* propagate delta into the global stat and the parent's pending */
+	cgroup_stat_accumulate(&cgrp->stat, &delta);
+	if (parent)
+		cgroup_stat_accumulate(&parent->pending_stat, &delta);
+}
+
+/* see cgroup_stat_flush() */
+static void cgroup_stat_flush_locked(struct cgroup *cgrp)
+{
+	int cpu;
+
+	lockdep_assert_held(&cgroup_stat_mutex);
+
+	for_each_possible_cpu(cpu) {
+		raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_cpu_stat_lock, cpu);
+		struct cgroup *pos = NULL;
+
+		raw_spin_lock_irq(cpu_lock);
+		while ((pos = cgroup_cpu_stat_next_updated(pos, cgrp, cpu)))
+			cgroup_cpu_stat_flush_one(pos, cpu);
+		raw_spin_unlock_irq(cpu_lock);
+	}
+}
+
+/**
+ * cgroup_stat_flush - flush stats in @cgrp's subtree
+ * @cgrp: target cgroup
+ *
+ * Collect all per-cpu stats in @cgrp's subtree into the global counters
+ * and propagate them upwards.  After this function returns, all cgroups in
+ * the subtree have up-to-date ->stat.
+ *
+ * This also gets all cgroups in the subtree including @cgrp off the
+ * ->updated_children lists.
+ */
+void cgroup_stat_flush(struct cgroup *cgrp)
+{
+	mutex_lock(&cgroup_stat_mutex);
+	cgroup_stat_flush_locked(cgrp);
+	mutex_unlock(&cgroup_stat_mutex);
+}
+
+static struct cgroup_cpu_stat *cgroup_cpu_stat_account_begin(struct cgroup *cgrp)
+{
+	struct cgroup_cpu_stat *cstat;
+
+	cstat = get_cpu_ptr(cgrp->cpu_stat);
+	u64_stats_update_begin(&cstat->sync);
+	return cstat;
+}
+
+static void cgroup_cpu_stat_account_end(struct cgroup *cgrp,
+					struct cgroup_cpu_stat *cstat)
+{
+	u64_stats_update_end(&cstat->sync);
+	cgroup_cpu_stat_updated(cgrp, smp_processor_id());
+	put_cpu_ptr(cstat);
+}
+
+void __cgroup_account_cputime(struct cgroup *cgrp, u64 delta_exec)
+{
+	struct cgroup_cpu_stat *cstat;
+
+	cstat = cgroup_cpu_stat_account_begin(cgrp);
+	cstat->cputime.sum_exec_runtime += delta_exec;
+	cgroup_cpu_stat_account_end(cgrp, cstat);
+}
+
+void __cgroup_account_cputime_field(struct cgroup *cgrp,
+				    enum cpu_usage_stat index, u64 delta_exec)
+{
+	struct cgroup_cpu_stat *cstat;
+
+	cstat = cgroup_cpu_stat_account_begin(cgrp);
+
+	switch (index) {
+	case CPUTIME_USER:
+	case CPUTIME_NICE:
+		cstat->cputime.utime += delta_exec;
+		break;
+	case CPUTIME_SYSTEM:
+	case CPUTIME_IRQ:
+	case CPUTIME_SOFTIRQ:
+		cstat->cputime.stime += delta_exec;
+		break;
+	default:
+		break;
+	}
+
+	cgroup_cpu_stat_account_end(cgrp, cstat);
+}
+
+void cgroup_stat_show_cputime(struct seq_file *seq, const char *prefix)
+{
+	struct cgroup *cgrp = seq_css(seq)->cgroup;
+	u64 usage, utime, stime;
+
+	if (!cgroup_parent(cgrp))
+		return;
+
+	mutex_lock(&cgroup_stat_mutex);
+
+	cgroup_stat_flush_locked(cgrp);
+
+	usage = cgrp->stat.cputime.sum_exec_runtime;
+	cputime_adjust(&cgrp->stat.cputime, &cgrp->stat.prev_cputime,
+		       &utime, &stime);
+
+	mutex_unlock(&cgroup_stat_mutex);
+
+	do_div(usage, NSEC_PER_USEC);
+	do_div(utime, NSEC_PER_USEC);
+	do_div(stime, NSEC_PER_USEC);
+
+	seq_printf(seq, "%susage_usec %llu\n"
+		   "%suser_usec %llu\n"
+		   "%ssystem_usec %llu\n",
+		   prefix, usage, prefix, utime, prefix, stime);
+}
+
+int cgroup_stat_init(struct cgroup *cgrp)
+{
+	int cpu;
+
+	/* the root cgrp has cpu_stat preallocated */
+	if (!cgrp->cpu_stat) {
+		cgrp->cpu_stat = alloc_percpu(struct cgroup_cpu_stat);
+		if (!cgrp->cpu_stat)
+			return -ENOMEM;
+	}
+
+	/* ->updated_children list is self terminated */
+	for_each_possible_cpu(cpu)
+		cgroup_cpu_stat(cgrp, cpu)->updated_children = cgrp;
+
+	prev_cputime_init(&cgrp->stat.prev_cputime);
+
+	return 0;
+}
+
+void cgroup_stat_exit(struct cgroup *cgrp)
+{
+	int cpu;
+
+	cgroup_stat_flush(cgrp);
+
+	/* sanity check */
+	for_each_possible_cpu(cpu) {
+		struct cgroup_cpu_stat *cstat = cgroup_cpu_stat(cgrp, cpu);
+
+		if (WARN_ON_ONCE(cstat->updated_children != cgrp) ||
+		    WARN_ON_ONCE(cstat->updated_next))
+			return;
+	}
+
+	free_percpu(cgrp->cpu_stat);
+	cgrp->cpu_stat = NULL;
+}
+
+void __init cgroup_stat_boot(void)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		raw_spin_lock_init(per_cpu_ptr(&cgroup_cpu_stat_lock, cpu));
+
+	WARN_ON(cgroup_stat_init(&cgrp_dfl_root.cgrp));
+}
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v2 2/3] cpuacct: Introduce cgroup_account_cputime[_field]()
@ 2017-08-11 17:28     ` Tejun Heo
  0 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2017-08-11 17:28 UTC (permalink / raw)
  To: lizefan, hannes, peterz, mingo, longman
  Cc: cgroups, linux-kernel, kernel-team, pjt, luto, efault, torvalds, guro

>From 1f88d33de13b02e2d6235c0d9451cd10c6e8210f Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Fri, 11 Aug 2017 10:24:12 -0700

Introduce cgroup_account_cputime[_field]() which wrap cpuacct_charge()
and cgroup_account_field().  This doesn't introduce any functional
changes and will be used to add cgroup basic resource accounting.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
---
Fix build failure from missing include.

Thanks.

 include/linux/cgroup.h   | 38 ++++++++++++++++++++++++++++++++++++++
 kernel/sched/cpuacct.h   | 17 -----------------
 kernel/sched/cputime.c   |  2 +-
 kernel/sched/deadline.c  |  2 +-
 kernel/sched/fair.c      |  2 +-
 kernel/sched/rt.c        |  2 +-
 kernel/sched/sched.h     |  2 +-
 kernel/sched/stop_task.c |  2 +-
 8 files changed, 44 insertions(+), 23 deletions(-)
 delete mode 100644 kernel/sched/cpuacct.h

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 085056e..f395e02 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -22,6 +22,7 @@
 #include <linux/nsproxy.h>
 #include <linux/user_namespace.h>
 #include <linux/refcount.h>
+#include <linux/kernel_stat.h>
 
 #include <linux/cgroup-defs.h>
 
@@ -675,6 +676,43 @@ static inline bool task_under_cgroup_hierarchy(struct task_struct *task,
 #endif /* !CONFIG_CGROUPS */
 
 /*
+ * Basic resource stats.
+ */
+#ifdef CONFIG_CGROUPS
+
+#ifdef CONFIG_CGROUP_CPUACCT
+void cpuacct_charge(struct task_struct *tsk, u64 cputime);
+void cpuacct_account_field(struct task_struct *tsk, int index, u64 val);
+#else
+static inline void cpuacct_charge(struct task_struct *tsk, u64 cputime) {}
+static inline void cpuacct_account_field(struct task_struct *tsk, int index,
+					 u64 val) {}
+#endif
+
+static inline void cgroup_account_cputime(struct task_struct *task,
+					  u64 delta_exec)
+{
+	cpuacct_charge(task, delta_exec);
+}
+
+static inline void cgroup_account_cputime_field(struct task_struct *task,
+						enum cpu_usage_stat index,
+						u64 delta_exec)
+{
+	cpuacct_account_field(task, index, delta_exec);
+}
+
+#else	/* CONFIG_CGROUPS */
+
+static inline void cgroup_account_cputime(struct task_struct *task,
+					  u64 delta_exec) {}
+static inline void cgroup_account_cputime_field(struct task_struct *task,
+						enum cpu_usage_stat index,
+						u64 delta_exec) {}
+
+#endif	/* CONFIG_CGROUPS */
+
+/*
  * sock->sk_cgrp_data handling.  For more info, see sock_cgroup_data
  * definition in cgroup-defs.h.
  */
diff --git a/kernel/sched/cpuacct.h b/kernel/sched/cpuacct.h
deleted file mode 100644
index ba72807..0000000
--- a/kernel/sched/cpuacct.h
+++ /dev/null
@@ -1,17 +0,0 @@
-#ifdef CONFIG_CGROUP_CPUACCT
-
-extern void cpuacct_charge(struct task_struct *tsk, u64 cputime);
-extern void cpuacct_account_field(struct task_struct *tsk, int index, u64 val);
-
-#else
-
-static inline void cpuacct_charge(struct task_struct *tsk, u64 cputime)
-{
-}
-
-static inline void
-cpuacct_account_field(struct task_struct *tsk, int index, u64 val)
-{
-}
-
-#endif
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 3a8bfcc..d13149d 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -109,7 +109,7 @@ static inline void task_group_account_field(struct task_struct *p, int index,
 	 */
 	__this_cpu_add(kernel_cpustat.cpustat[index], tmp);
 
-	cpuacct_account_field(p, index, tmp);
+	cgroup_account_cputime_field(p, index, tmp);
 }
 
 /*
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index a84299f..4d5119f 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1145,7 +1145,7 @@ static void update_curr_dl(struct rq *rq)
 	account_group_exec_runtime(curr, delta_exec);
 
 	curr->se.exec_start = rq_clock_task(rq);
-	cpuacct_charge(curr, delta_exec);
+	cgroup_account_cputime(curr, delta_exec);
 
 	sched_rt_avg_update(rq, delta_exec);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c95880e..9b2e0b6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -864,7 +864,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
 		struct task_struct *curtask = task_of(curr);
 
 		trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
-		cpuacct_charge(curtask, delta_exec);
+		cgroup_account_cputime(curtask, delta_exec);
 		account_group_exec_runtime(curtask, delta_exec);
 	}
 
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 45caf93..ed6e95d 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -979,7 +979,7 @@ static void update_curr_rt(struct rq *rq)
 	account_group_exec_runtime(curr, delta_exec);
 
 	curr->se.exec_start = rq_clock_task(rq);
-	cpuacct_charge(curr, delta_exec);
+	cgroup_account_cputime(curr, delta_exec);
 
 	sched_rt_avg_update(rq, delta_exec);
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index eeef1a3..06f606c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -29,6 +29,7 @@
 #include <linux/irq_work.h>
 #include <linux/tick.h>
 #include <linux/slab.h>
+#include <linux/cgroup.h>
 
 #ifdef CONFIG_PARAVIRT
 #include <asm/paravirt.h>
@@ -36,7 +37,6 @@
 
 #include "cpupri.h"
 #include "cpudeadline.h"
-#include "cpuacct.h"
 
 #ifdef CONFIG_SCHED_DEBUG
 # define SCHED_WARN_ON(x)	WARN_ONCE(x, #x)
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index 9f69fb6..ec0bb5a 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -71,7 +71,7 @@ static void put_prev_task_stop(struct rq *rq, struct task_struct *prev)
 	account_group_exec_runtime(curr, delta_exec);
 
 	curr->se.exec_start = rq_clock_task(rq);
-	cpuacct_charge(curr, delta_exec);
+	cgroup_account_cputime(curr, delta_exec);
 }
 
 static void task_tick_stop(struct rq *rq, struct task_struct *curr, int queued)
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v2 2/3] cpuacct: Introduce cgroup_account_cputime[_field]()
@ 2017-08-11 17:28     ` Tejun Heo
  0 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2017-08-11 17:28 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	longman-H+wXaHxf7aLQT0dZR+AlfA
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	pjt-hpIqsD4AKlfQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	efault-Mmb7MZpHnFY, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	guro-b10kYP2dOMg

From 1f88d33de13b02e2d6235c0d9451cd10c6e8210f Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Date: Fri, 11 Aug 2017 10:24:12 -0700

Introduce cgroup_account_cputime[_field]() which wrap cpuacct_charge()
and cgroup_account_field().  This doesn't introduce any functional
changes and will be used to add cgroup basic resource accounting.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Ingo Molnar <mingo-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
---
Fix build failure from missing include.

Thanks.

 include/linux/cgroup.h   | 38 ++++++++++++++++++++++++++++++++++++++
 kernel/sched/cpuacct.h   | 17 -----------------
 kernel/sched/cputime.c   |  2 +-
 kernel/sched/deadline.c  |  2 +-
 kernel/sched/fair.c      |  2 +-
 kernel/sched/rt.c        |  2 +-
 kernel/sched/sched.h     |  2 +-
 kernel/sched/stop_task.c |  2 +-
 8 files changed, 44 insertions(+), 23 deletions(-)
 delete mode 100644 kernel/sched/cpuacct.h

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 085056e..f395e02 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -22,6 +22,7 @@
 #include <linux/nsproxy.h>
 #include <linux/user_namespace.h>
 #include <linux/refcount.h>
+#include <linux/kernel_stat.h>
 
 #include <linux/cgroup-defs.h>
 
@@ -675,6 +676,43 @@ static inline bool task_under_cgroup_hierarchy(struct task_struct *task,
 #endif /* !CONFIG_CGROUPS */
 
 /*
+ * Basic resource stats.
+ */
+#ifdef CONFIG_CGROUPS
+
+#ifdef CONFIG_CGROUP_CPUACCT
+void cpuacct_charge(struct task_struct *tsk, u64 cputime);
+void cpuacct_account_field(struct task_struct *tsk, int index, u64 val);
+#else
+static inline void cpuacct_charge(struct task_struct *tsk, u64 cputime) {}
+static inline void cpuacct_account_field(struct task_struct *tsk, int index,
+					 u64 val) {}
+#endif
+
+static inline void cgroup_account_cputime(struct task_struct *task,
+					  u64 delta_exec)
+{
+	cpuacct_charge(task, delta_exec);
+}
+
+static inline void cgroup_account_cputime_field(struct task_struct *task,
+						enum cpu_usage_stat index,
+						u64 delta_exec)
+{
+	cpuacct_account_field(task, index, delta_exec);
+}
+
+#else	/* CONFIG_CGROUPS */
+
+static inline void cgroup_account_cputime(struct task_struct *task,
+					  u64 delta_exec) {}
+static inline void cgroup_account_cputime_field(struct task_struct *task,
+						enum cpu_usage_stat index,
+						u64 delta_exec) {}
+
+#endif	/* CONFIG_CGROUPS */
+
+/*
  * sock->sk_cgrp_data handling.  For more info, see sock_cgroup_data
  * definition in cgroup-defs.h.
  */
diff --git a/kernel/sched/cpuacct.h b/kernel/sched/cpuacct.h
deleted file mode 100644
index ba72807..0000000
--- a/kernel/sched/cpuacct.h
+++ /dev/null
@@ -1,17 +0,0 @@
-#ifdef CONFIG_CGROUP_CPUACCT
-
-extern void cpuacct_charge(struct task_struct *tsk, u64 cputime);
-extern void cpuacct_account_field(struct task_struct *tsk, int index, u64 val);
-
-#else
-
-static inline void cpuacct_charge(struct task_struct *tsk, u64 cputime)
-{
-}
-
-static inline void
-cpuacct_account_field(struct task_struct *tsk, int index, u64 val)
-{
-}
-
-#endif
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 3a8bfcc..d13149d 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -109,7 +109,7 @@ static inline void task_group_account_field(struct task_struct *p, int index,
 	 */
 	__this_cpu_add(kernel_cpustat.cpustat[index], tmp);
 
-	cpuacct_account_field(p, index, tmp);
+	cgroup_account_cputime_field(p, index, tmp);
 }
 
 /*
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index a84299f..4d5119f 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1145,7 +1145,7 @@ static void update_curr_dl(struct rq *rq)
 	account_group_exec_runtime(curr, delta_exec);
 
 	curr->se.exec_start = rq_clock_task(rq);
-	cpuacct_charge(curr, delta_exec);
+	cgroup_account_cputime(curr, delta_exec);
 
 	sched_rt_avg_update(rq, delta_exec);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c95880e..9b2e0b6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -864,7 +864,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
 		struct task_struct *curtask = task_of(curr);
 
 		trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
-		cpuacct_charge(curtask, delta_exec);
+		cgroup_account_cputime(curtask, delta_exec);
 		account_group_exec_runtime(curtask, delta_exec);
 	}
 
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 45caf93..ed6e95d 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -979,7 +979,7 @@ static void update_curr_rt(struct rq *rq)
 	account_group_exec_runtime(curr, delta_exec);
 
 	curr->se.exec_start = rq_clock_task(rq);
-	cpuacct_charge(curr, delta_exec);
+	cgroup_account_cputime(curr, delta_exec);
 
 	sched_rt_avg_update(rq, delta_exec);
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index eeef1a3..06f606c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -29,6 +29,7 @@
 #include <linux/irq_work.h>
 #include <linux/tick.h>
 #include <linux/slab.h>
+#include <linux/cgroup.h>
 
 #ifdef CONFIG_PARAVIRT
 #include <asm/paravirt.h>
@@ -36,7 +37,6 @@
 
 #include "cpupri.h"
 #include "cpudeadline.h"
-#include "cpuacct.h"
 
 #ifdef CONFIG_SCHED_DEBUG
 # define SCHED_WARN_ON(x)	WARN_ONCE(x, #x)
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index 9f69fb6..ec0bb5a 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -71,7 +71,7 @@ static void put_prev_task_stop(struct rq *rq, struct task_struct *prev)
 	account_group_exec_runtime(curr, delta_exec);
 
 	curr->se.exec_start = rq_clock_task(rq);
-	cpuacct_charge(curr, delta_exec);
+	cgroup_account_cputime(curr, delta_exec);
 }
 
 static void task_tick_stop(struct rq *rq, struct task_struct *curr, int queued)
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/3] cgroup: Implement cgroup2 basic CPU usage accounting
@ 2017-08-11 20:19     ` Waiman Long
  0 siblings, 0 replies; 54+ messages in thread
From: Waiman Long @ 2017-08-11 20:19 UTC (permalink / raw)
  To: Tejun Heo, lizefan, hannes, peterz, mingo
  Cc: cgroups, linux-kernel, kernel-team, pjt, luto, efault, torvalds, guro

On 08/11/2017 12:37 PM, Tejun Heo wrote:
> In cgroup1, while cpuacct isn't actually controlling any resources, it
> is a separate controller due to combinaton of two factors -
> 1. enabling cpu controller has significant side effects, and 2. we
> have to pick one of the hierarchies to account CPU usages on.  cpuacct
> controller is effectively used to designate a hierarchy to track CPU
> usages on.
>
> cgroup2's unified hierarchy removes the second reason and we can
> account basic CPU usages by default.  While we can use cpuacct for
> this purpose, both its interface and implementation leave a lot to be
> desired - it collects and exposes two sources of truth which don't
> agree with each other and some of the exposed statistics don't make
> much sense.  Also, it propagates all the way up the hierarchy on each
> accounting event which is unnecessary.
>
> This patch adds basic resource accounting mechanism to cgroup2's
> unified hierarchy and accounts CPU usages using it.
>
> * All accountings are done per-cpu and don't propagate immediately.
>   It just bumps the per-cgroup per-cpu counters and links to the
>   parent's updated list if not already on it.
>
> * On a read, the per-cpu counters are collected into the global ones
>   and then propagated upwards.  Only the per-cpu counters which have
>   changed since the last read are propagated.
>
> * CPU usage stats are collected and shown in "cgroup.stat" with "cpu."
>   prefix.  Total usage is collected from scheduling events.  User/sys
>   breakdown is sourced from tick sampling and adjusted to the usage
>   using cputime_adjuts().

Typo: cputime_adjust()

> This keeps the accounting side hot path O(1) and per-cpu and the read
> side O(nr_updated_since_last_read).
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Li Zefan <lizefan@huawei.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  Documentation/cgroup-v2.txt     |   9 ++
>  include/linux/cgroup-defs.h     |  47 ++++++
>  include/linux/cgroup.h          |  22 +++
>  kernel/cgroup/Makefile          |   2 +-
>  kernel/cgroup/cgroup-internal.h |   8 +
>  kernel/cgroup/cgroup.c          |  24 ++-
>  kernel/cgroup/stat.c            | 333 ++++++++++++++++++++++++++++++++++++++++
>  7 files changed, 442 insertions(+), 3 deletions(-)
>  create mode 100644 kernel/cgroup/stat.c
>
> diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
> index dc44785..3f82169 100644
> --- a/Documentation/cgroup-v2.txt
> +++ b/Documentation/cgroup-v2.txt
> @@ -886,6 +886,15 @@ All cgroup core files are prefixed with "cgroup."
>  		A dying cgroup can consume system resources not exceeding
>  		limits, which were active at the moment of cgroup deletion.
>  
> +	  cpu.usage_usec
> +		CPU time consumed in the subtree.
> +
> +	  cpu.user_usec
> +		User CPU time consumed in the subtree.
> +
> +	  cpu.system_usec
> +		System CPU time consumed in the subtree.
> +
>  
>  Controllers
>  ===========
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index 59e4ad9..17da9c8 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -16,6 +16,7 @@
>  #include <linux/refcount.h>
>  #include <linux/percpu-refcount.h>
>  #include <linux/percpu-rwsem.h>
> +#include <linux/u64_stats_sync.h>
>  #include <linux/workqueue.h>
>  #include <linux/bpf-cgroup.h>
>  
> @@ -249,6 +250,47 @@ struct css_set {
>  	struct rcu_head rcu_head;
>  };
>  
> +/*
> + * cgroup basic resource usage statistics.  Accounting is done per-cpu in
> + * cgroup_cpu_stat which is then lazily propagated up the hierarchy on
> + * reads.  The propagation is selective - only the cgroup_cpu_stats which
> + * have been updated since the last propagation are propagated.
> + */
> +struct cgroup_cpu_stat {
> +	/*
> +	 * ->sync protects all the current counters.  These are the only
> +	 * fields which get updated in the hot path.
> +	 */
> +	struct u64_stats_sync sync;
> +	struct task_cputime cputime;
> +
> +	/*
> +	 * Snapshots at the last reading.  These are used to calculate the
> +	 * deltas to propagate to the global counters.
> +	 */
> +	struct task_cputime last_cputime;
> +
> +	/*
> +	 * Child cgroups with stat updates on this cpu since the last read
> +	 * are linked on the parent's ->updated_children through
> +	 * ->updated_next.
> +	 *
> +	 * In addition to being more compact, singly-linked list pointing
> +	 * to the cgroup makes it unnecessary for each per-cpu struct to
> +	 * point back to the associated cgroup.
> +	 *
> +	 * Protected by per-cpu cgroup_cpu_stat_lock.
> +	 */
> +	struct cgroup *updated_children;	/* terminated by self cgroup */
> +	struct cgroup *updated_next;		/* NULL iff not on the list */
> +};
> +
> +struct cgroup_stat {
> +	/* per-cpu statistics are collected into the folowing global counters */
> +	struct task_cputime cputime;
> +	struct prev_cputime prev_cputime;
> +};
> +
>  struct cgroup {
>  	/* self css with NULL ->ss, points back to this cgroup */
>  	struct cgroup_subsys_state self;
> @@ -348,6 +390,11 @@ struct cgroup {
>  	 */
>  	struct cgroup *dom_cgrp;
>  
> +	/* cgroup basic resource statistics */
> +	struct cgroup_cpu_stat __percpu *cpu_stat;
> +	struct cgroup_stat pending_stat;	/* pending from children */
> +	struct cgroup_stat stat;
> +
>  	/*
>  	 * list of pidlists, up to two for each namespace (one for procs, one
>  	 * for tasks); created on demand.
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index f395e02..4c82212 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -689,17 +689,39 @@ static inline void cpuacct_account_field(struct task_struct *tsk, int index,
>  					 u64 val) {}
>  #endif
>  
> +void cgroup_stat_show_cputime(struct seq_file *seq, const char *prefix);
> +
> +void __cgroup_account_cputime(struct cgroup *cgrp, u64 delta_exec);
> +void __cgroup_account_cputime_field(struct cgroup *cgrp,
> +				    enum cpu_usage_stat index, u64 delta_exec);
> +
>  static inline void cgroup_account_cputime(struct task_struct *task,
>  					  u64 delta_exec)
>  {
> +	struct cgroup *cgrp;
> +
>  	cpuacct_charge(task, delta_exec);
> +
> +	rcu_read_lock();
> +	cgrp = task_dfl_cgroup(task);
> +	if (cgroup_parent(cgrp))
> +		__cgroup_account_cputime(cgrp, delta_exec);
> +	rcu_read_unlock();
>  }
>  
>  static inline void cgroup_account_cputime_field(struct task_struct *task,
>  						enum cpu_usage_stat index,
>  						u64 delta_exec)
>  {
> +	struct cgroup *cgrp;
> +
>  	cpuacct_account_field(task, index, delta_exec);
> +
> +	rcu_read_lock();
> +	cgrp = task_dfl_cgroup(task);
> +	if (cgroup_parent(cgrp))
> +		__cgroup_account_cputime_field(cgrp, index, delta_exec);
> +	rcu_read_unlock();
>  }
>  
>  #else	/* CONFIG_CGROUPS */
> diff --git a/kernel/cgroup/Makefile b/kernel/cgroup/Makefile
> index ce693cc..0acee61 100644
> --- a/kernel/cgroup/Makefile
> +++ b/kernel/cgroup/Makefile
> @@ -1,4 +1,4 @@
> -obj-y := cgroup.o namespace.o cgroup-v1.o
> +obj-y := cgroup.o stat.o namespace.o cgroup-v1.o
>  
>  obj-$(CONFIG_CGROUP_FREEZER) += freezer.o
>  obj-$(CONFIG_CGROUP_PIDS) += pids.o
> diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
> index c167a40..f17da09 100644
> --- a/kernel/cgroup/cgroup-internal.h
> +++ b/kernel/cgroup/cgroup-internal.h
> @@ -197,6 +197,14 @@ int cgroup_show_path(struct seq_file *sf, struct kernfs_node *kf_node,
>  int cgroup_task_count(const struct cgroup *cgrp);
>  
>  /*
> + * stat.c
> + */
> +void cgroup_stat_flush(struct cgroup *cgrp);
> +int cgroup_stat_init(struct cgroup *cgrp);
> +void cgroup_stat_exit(struct cgroup *cgrp);
> +void cgroup_stat_boot(void);
> +
> +/*
>   * namespace.c
>   */
>  extern const struct proc_ns_operations cgroupns_operations;
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index c038ccf..a399a55 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -142,12 +142,14 @@ static struct static_key_true *cgroup_subsys_on_dfl_key[] = {
>  };
>  #undef SUBSYS
>  
> +static DEFINE_PER_CPU(struct cgroup_cpu_stat, cgrp_dfl_root_cpu_stat);
> +
>  /*
>   * The default hierarchy, reserved for the subsystems that are otherwise
>   * unattached - it never has more than a single cgroup, and all tasks are
>   * part of that cgroup.
>   */
> -struct cgroup_root cgrp_dfl_root;
> +struct cgroup_root cgrp_dfl_root = { .cgrp.cpu_stat = &cgrp_dfl_root_cpu_stat };
>  EXPORT_SYMBOL_GPL(cgrp_dfl_root);
>  
>  /*
> @@ -3296,6 +3298,8 @@ static int cgroup_stat_show(struct seq_file *seq, void *v)
>  	seq_printf(seq, "nr_dying_descendants %d\n",
>  		   cgroup->nr_dying_descendants);
>  
> +	cgroup_stat_show_cputime(seq, "cpu.");
> +
>  	return 0;
>  }
>  
> @@ -4466,6 +4470,8 @@ static void css_free_work_fn(struct work_struct *work)
>  			 */
>  			cgroup_put(cgroup_parent(cgrp));
>  			kernfs_put(cgrp->kn);
> +			if (cgroup_on_dfl(cgrp))
> +				cgroup_stat_exit(cgrp);
>  			kfree(cgrp);
>  		} else {
>  			/*
> @@ -4510,6 +4516,9 @@ static void css_release_work_fn(struct work_struct *work)
>  		/* cgroup release path */
>  		trace_cgroup_release(cgrp);
>  
> +		if (cgroup_on_dfl(cgrp))
> +			cgroup_stat_flush(cgrp);
> +
>  		for (tcgrp = cgroup_parent(cgrp); tcgrp;
>  		     tcgrp = cgroup_parent(tcgrp))
>  			tcgrp->nr_dying_descendants--;
> @@ -4696,6 +4705,12 @@ static struct cgroup *cgroup_create(struct cgroup *parent)
>  	if (ret)
>  		goto out_free_cgrp;
>  
> +	if (cgroup_on_dfl(parent)) {
> +		ret = cgroup_stat_init(cgrp);
> +		if (ret)
> +			goto out_cancel_ref;
> +	}
> +
>  	/*
>  	 * Temporarily set the pointer to NULL, so idr_find() won't return
>  	 * a half-baked cgroup.
> @@ -4703,7 +4718,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent)
>  	cgrp->id = cgroup_idr_alloc(&root->cgroup_idr, NULL, 2, 0, GFP_KERNEL);
>  	if (cgrp->id < 0) {
>  		ret = -ENOMEM;
> -		goto out_cancel_ref;
> +		goto out_stat_exit;
>  	}
>  
>  	init_cgroup_housekeeping(cgrp);
> @@ -4752,6 +4767,9 @@ static struct cgroup *cgroup_create(struct cgroup *parent)
>  
>  	return cgrp;
>  
> +out_stat_exit:
> +	if (cgroup_on_dfl(parent))
> +		cgroup_stat_exit(cgrp);
>  out_cancel_ref:
>  	percpu_ref_exit(&cgrp->self.refcnt);
>  out_free_cgrp:
> @@ -5146,6 +5164,8 @@ int __init cgroup_init(void)
>  	BUG_ON(cgroup_init_cftypes(NULL, cgroup_base_files));
>  	BUG_ON(cgroup_init_cftypes(NULL, cgroup1_base_files));
>  
> +	cgroup_stat_boot();
> +
>  	/*
>  	 * The latency of the synchronize_sched() is too high for cgroups,
>  	 * avoid it at the cost of forcing all readers into the slow path.
> diff --git a/kernel/cgroup/stat.c b/kernel/cgroup/stat.c
> new file mode 100644
> index 0000000..19a10b2
> --- /dev/null
> +++ b/kernel/cgroup/stat.c
> @@ -0,0 +1,333 @@
> +#include "cgroup-internal.h"
> +
> +#include <linux/sched/cputime.h>
> +
> +static DEFINE_MUTEX(cgroup_stat_mutex);
> +static DEFINE_PER_CPU(raw_spinlock_t, cgroup_cpu_stat_lock);

If the hierarchy is large enough and the stat data hasn't been read
recently, it may take a while to accumulate all the stat data even for
one cpu in cgroup_stat_flush_locked(). So I think it will make more
sense to use regular spinlock_t instead of raw_spinlock_t.

> +
> +static struct cgroup_cpu_stat *cgroup_cpu_stat(struct cgroup *cgrp, int cpu)
> +{
> +	return per_cpu_ptr(cgrp->cpu_stat, cpu);
> +}
> +
> +/**
> + * cgroup_cpu_stat_updated - keep track of updated cpu_stat
> + * @cgrp: target cgroup
> + * @cpu: cpu on which cpu_stat was updated
> + *
> + * @cgrp's cpu_stat on @cpu was updated.  Put it on the parent's matching
> + * cpu_stat->updated_children list.
> + */
> +static void cgroup_cpu_stat_updated(struct cgroup *cgrp, int cpu)
> +{
> +	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_cpu_stat_lock, cpu);
> +	struct cgroup *parent;
> +	unsigned long flags;
> +
> +	/*
> +	 * Speculative already-on-list test.  This may race leading to
> +	 * temporary inaccuracies, which is fine.
> +	 *
> +	 * Because @parent's updated_children is terminated with @parent
> +	 * instead of NULL, we can tell whether @cgrp is on the list by
> +	 * testing the next pointer for NULL.
> +	 */
> +	if (cgroup_cpu_stat(cgrp, cpu)->updated_next)
> +		return;
> +
> +	raw_spin_lock_irqsave(cpu_lock, flags);
> +
> +	/* put @cgrp and all ancestors on the corresponding updated lists */
> +	for (parent = cgroup_parent(cgrp); parent;
> +	     cgrp = parent, parent = cgroup_parent(cgrp)) {
> +		struct cgroup_cpu_stat *cstat = cgroup_cpu_stat(cgrp, cpu);
> +		struct cgroup_cpu_stat *pcstat = cgroup_cpu_stat(parent, cpu);
> +
> +		/*
> +		 * Both additions and removals are bottom-up.  If a cgroup
> +		 * is already in the tree, all ancestors are.
> +		 */
> +		if (cstat->updated_next)
> +			break;
> +
> +		cstat->updated_next = pcstat->updated_children;
> +		pcstat->updated_children = cgrp;
> +	}
> +
> +	raw_spin_unlock_irqrestore(cpu_lock, flags);
> +}
> +
> +/**
> + * cgroup_cpu_stat_next_updated - iterate cpu_stat updated tree
> + * @pos: current position
> + * @root: root of the tree to traversal
> + * @cpu: target cpu
> + *
> + * Walks the udpated cpu_stat tree on @cpu from @root.  %NULL @pos starts
> + * the traversal and %NULL return indicates the end.  During traversal,
> + * each returned cgroup is unlinked from the tree.  Must be called with the
> + * matching cgroup_cpu_stat_lock held.
> + *
> + * The only ordering guarantee is that, for a parent and a child pair
> + * covered by a given traversal, if a child is visited, its parent is
> + * guaranteed to be visited afterwards.
> + */
> +static struct cgroup *cgroup_cpu_stat_next_updated(struct cgroup *pos,
> +						   struct cgroup *root, int cpu)

This function is trying to unwind one cgrp from the updated_children and
updated_next linkage. It is somewhat like the opposite of
cgroup_cpu_stat_updated(). I just feel like its name isn't intuitive
enough to convey what it is doing. Maybe use name like
cgroup_cpu_stat_unlink_one() to match cgroup_cpu_stat_flush_one().

Cheers,
Longman

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/3] cgroup: Implement cgroup2 basic CPU usage accounting
@ 2017-08-11 20:19     ` Waiman Long
  0 siblings, 0 replies; 54+ messages in thread
From: Waiman Long @ 2017-08-11 20:19 UTC (permalink / raw)
  To: Tejun Heo, lizefan-hv44wF8Li93QT0dZR+AlfA,
	hannes-druUgvl0LCNAfugRpC6u6w, peterz-wEGCiKHe2LqWVfeAwA7xHQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	pjt-hpIqsD4AKlfQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	efault-Mmb7MZpHnFY, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	guro-b10kYP2dOMg

On 08/11/2017 12:37 PM, Tejun Heo wrote:
> In cgroup1, while cpuacct isn't actually controlling any resources, it
> is a separate controller due to combinaton of two factors -
> 1. enabling cpu controller has significant side effects, and 2. we
> have to pick one of the hierarchies to account CPU usages on.  cpuacct
> controller is effectively used to designate a hierarchy to track CPU
> usages on.
>
> cgroup2's unified hierarchy removes the second reason and we can
> account basic CPU usages by default.  While we can use cpuacct for
> this purpose, both its interface and implementation leave a lot to be
> desired - it collects and exposes two sources of truth which don't
> agree with each other and some of the exposed statistics don't make
> much sense.  Also, it propagates all the way up the hierarchy on each
> accounting event which is unnecessary.
>
> This patch adds basic resource accounting mechanism to cgroup2's
> unified hierarchy and accounts CPU usages using it.
>
> * All accountings are done per-cpu and don't propagate immediately.
>   It just bumps the per-cgroup per-cpu counters and links to the
>   parent's updated list if not already on it.
>
> * On a read, the per-cpu counters are collected into the global ones
>   and then propagated upwards.  Only the per-cpu counters which have
>   changed since the last read are propagated.
>
> * CPU usage stats are collected and shown in "cgroup.stat" with "cpu."
>   prefix.  Total usage is collected from scheduling events.  User/sys
>   breakdown is sourced from tick sampling and adjusted to the usage
>   using cputime_adjuts().

Typo: cputime_adjust()

> This keeps the accounting side hot path O(1) and per-cpu and the read
> side O(nr_updated_since_last_read).
>
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: Ingo Molnar <mingo-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Cc: Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
> Cc: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
> Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> ---
>  Documentation/cgroup-v2.txt     |   9 ++
>  include/linux/cgroup-defs.h     |  47 ++++++
>  include/linux/cgroup.h          |  22 +++
>  kernel/cgroup/Makefile          |   2 +-
>  kernel/cgroup/cgroup-internal.h |   8 +
>  kernel/cgroup/cgroup.c          |  24 ++-
>  kernel/cgroup/stat.c            | 333 ++++++++++++++++++++++++++++++++++++++++
>  7 files changed, 442 insertions(+), 3 deletions(-)
>  create mode 100644 kernel/cgroup/stat.c
>
> diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
> index dc44785..3f82169 100644
> --- a/Documentation/cgroup-v2.txt
> +++ b/Documentation/cgroup-v2.txt
> @@ -886,6 +886,15 @@ All cgroup core files are prefixed with "cgroup."
>  		A dying cgroup can consume system resources not exceeding
>  		limits, which were active at the moment of cgroup deletion.
>  
> +	  cpu.usage_usec
> +		CPU time consumed in the subtree.
> +
> +	  cpu.user_usec
> +		User CPU time consumed in the subtree.
> +
> +	  cpu.system_usec
> +		System CPU time consumed in the subtree.
> +
>  
>  Controllers
>  ===========
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index 59e4ad9..17da9c8 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -16,6 +16,7 @@
>  #include <linux/refcount.h>
>  #include <linux/percpu-refcount.h>
>  #include <linux/percpu-rwsem.h>
> +#include <linux/u64_stats_sync.h>
>  #include <linux/workqueue.h>
>  #include <linux/bpf-cgroup.h>
>  
> @@ -249,6 +250,47 @@ struct css_set {
>  	struct rcu_head rcu_head;
>  };
>  
> +/*
> + * cgroup basic resource usage statistics.  Accounting is done per-cpu in
> + * cgroup_cpu_stat which is then lazily propagated up the hierarchy on
> + * reads.  The propagation is selective - only the cgroup_cpu_stats which
> + * have been updated since the last propagation are propagated.
> + */
> +struct cgroup_cpu_stat {
> +	/*
> +	 * ->sync protects all the current counters.  These are the only
> +	 * fields which get updated in the hot path.
> +	 */
> +	struct u64_stats_sync sync;
> +	struct task_cputime cputime;
> +
> +	/*
> +	 * Snapshots at the last reading.  These are used to calculate the
> +	 * deltas to propagate to the global counters.
> +	 */
> +	struct task_cputime last_cputime;
> +
> +	/*
> +	 * Child cgroups with stat updates on this cpu since the last read
> +	 * are linked on the parent's ->updated_children through
> +	 * ->updated_next.
> +	 *
> +	 * In addition to being more compact, singly-linked list pointing
> +	 * to the cgroup makes it unnecessary for each per-cpu struct to
> +	 * point back to the associated cgroup.
> +	 *
> +	 * Protected by per-cpu cgroup_cpu_stat_lock.
> +	 */
> +	struct cgroup *updated_children;	/* terminated by self cgroup */
> +	struct cgroup *updated_next;		/* NULL iff not on the list */
> +};
> +
> +struct cgroup_stat {
> +	/* per-cpu statistics are collected into the folowing global counters */
> +	struct task_cputime cputime;
> +	struct prev_cputime prev_cputime;
> +};
> +
>  struct cgroup {
>  	/* self css with NULL ->ss, points back to this cgroup */
>  	struct cgroup_subsys_state self;
> @@ -348,6 +390,11 @@ struct cgroup {
>  	 */
>  	struct cgroup *dom_cgrp;
>  
> +	/* cgroup basic resource statistics */
> +	struct cgroup_cpu_stat __percpu *cpu_stat;
> +	struct cgroup_stat pending_stat;	/* pending from children */
> +	struct cgroup_stat stat;
> +
>  	/*
>  	 * list of pidlists, up to two for each namespace (one for procs, one
>  	 * for tasks); created on demand.
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index f395e02..4c82212 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -689,17 +689,39 @@ static inline void cpuacct_account_field(struct task_struct *tsk, int index,
>  					 u64 val) {}
>  #endif
>  
> +void cgroup_stat_show_cputime(struct seq_file *seq, const char *prefix);
> +
> +void __cgroup_account_cputime(struct cgroup *cgrp, u64 delta_exec);
> +void __cgroup_account_cputime_field(struct cgroup *cgrp,
> +				    enum cpu_usage_stat index, u64 delta_exec);
> +
>  static inline void cgroup_account_cputime(struct task_struct *task,
>  					  u64 delta_exec)
>  {
> +	struct cgroup *cgrp;
> +
>  	cpuacct_charge(task, delta_exec);
> +
> +	rcu_read_lock();
> +	cgrp = task_dfl_cgroup(task);
> +	if (cgroup_parent(cgrp))
> +		__cgroup_account_cputime(cgrp, delta_exec);
> +	rcu_read_unlock();
>  }
>  
>  static inline void cgroup_account_cputime_field(struct task_struct *task,
>  						enum cpu_usage_stat index,
>  						u64 delta_exec)
>  {
> +	struct cgroup *cgrp;
> +
>  	cpuacct_account_field(task, index, delta_exec);
> +
> +	rcu_read_lock();
> +	cgrp = task_dfl_cgroup(task);
> +	if (cgroup_parent(cgrp))
> +		__cgroup_account_cputime_field(cgrp, index, delta_exec);
> +	rcu_read_unlock();
>  }
>  
>  #else	/* CONFIG_CGROUPS */
> diff --git a/kernel/cgroup/Makefile b/kernel/cgroup/Makefile
> index ce693cc..0acee61 100644
> --- a/kernel/cgroup/Makefile
> +++ b/kernel/cgroup/Makefile
> @@ -1,4 +1,4 @@
> -obj-y := cgroup.o namespace.o cgroup-v1.o
> +obj-y := cgroup.o stat.o namespace.o cgroup-v1.o
>  
>  obj-$(CONFIG_CGROUP_FREEZER) += freezer.o
>  obj-$(CONFIG_CGROUP_PIDS) += pids.o
> diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
> index c167a40..f17da09 100644
> --- a/kernel/cgroup/cgroup-internal.h
> +++ b/kernel/cgroup/cgroup-internal.h
> @@ -197,6 +197,14 @@ int cgroup_show_path(struct seq_file *sf, struct kernfs_node *kf_node,
>  int cgroup_task_count(const struct cgroup *cgrp);
>  
>  /*
> + * stat.c
> + */
> +void cgroup_stat_flush(struct cgroup *cgrp);
> +int cgroup_stat_init(struct cgroup *cgrp);
> +void cgroup_stat_exit(struct cgroup *cgrp);
> +void cgroup_stat_boot(void);
> +
> +/*
>   * namespace.c
>   */
>  extern const struct proc_ns_operations cgroupns_operations;
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index c038ccf..a399a55 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -142,12 +142,14 @@ static struct static_key_true *cgroup_subsys_on_dfl_key[] = {
>  };
>  #undef SUBSYS
>  
> +static DEFINE_PER_CPU(struct cgroup_cpu_stat, cgrp_dfl_root_cpu_stat);
> +
>  /*
>   * The default hierarchy, reserved for the subsystems that are otherwise
>   * unattached - it never has more than a single cgroup, and all tasks are
>   * part of that cgroup.
>   */
> -struct cgroup_root cgrp_dfl_root;
> +struct cgroup_root cgrp_dfl_root = { .cgrp.cpu_stat = &cgrp_dfl_root_cpu_stat };
>  EXPORT_SYMBOL_GPL(cgrp_dfl_root);
>  
>  /*
> @@ -3296,6 +3298,8 @@ static int cgroup_stat_show(struct seq_file *seq, void *v)
>  	seq_printf(seq, "nr_dying_descendants %d\n",
>  		   cgroup->nr_dying_descendants);
>  
> +	cgroup_stat_show_cputime(seq, "cpu.");
> +
>  	return 0;
>  }
>  
> @@ -4466,6 +4470,8 @@ static void css_free_work_fn(struct work_struct *work)
>  			 */
>  			cgroup_put(cgroup_parent(cgrp));
>  			kernfs_put(cgrp->kn);
> +			if (cgroup_on_dfl(cgrp))
> +				cgroup_stat_exit(cgrp);
>  			kfree(cgrp);
>  		} else {
>  			/*
> @@ -4510,6 +4516,9 @@ static void css_release_work_fn(struct work_struct *work)
>  		/* cgroup release path */
>  		trace_cgroup_release(cgrp);
>  
> +		if (cgroup_on_dfl(cgrp))
> +			cgroup_stat_flush(cgrp);
> +
>  		for (tcgrp = cgroup_parent(cgrp); tcgrp;
>  		     tcgrp = cgroup_parent(tcgrp))
>  			tcgrp->nr_dying_descendants--;
> @@ -4696,6 +4705,12 @@ static struct cgroup *cgroup_create(struct cgroup *parent)
>  	if (ret)
>  		goto out_free_cgrp;
>  
> +	if (cgroup_on_dfl(parent)) {
> +		ret = cgroup_stat_init(cgrp);
> +		if (ret)
> +			goto out_cancel_ref;
> +	}
> +
>  	/*
>  	 * Temporarily set the pointer to NULL, so idr_find() won't return
>  	 * a half-baked cgroup.
> @@ -4703,7 +4718,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent)
>  	cgrp->id = cgroup_idr_alloc(&root->cgroup_idr, NULL, 2, 0, GFP_KERNEL);
>  	if (cgrp->id < 0) {
>  		ret = -ENOMEM;
> -		goto out_cancel_ref;
> +		goto out_stat_exit;
>  	}
>  
>  	init_cgroup_housekeeping(cgrp);
> @@ -4752,6 +4767,9 @@ static struct cgroup *cgroup_create(struct cgroup *parent)
>  
>  	return cgrp;
>  
> +out_stat_exit:
> +	if (cgroup_on_dfl(parent))
> +		cgroup_stat_exit(cgrp);
>  out_cancel_ref:
>  	percpu_ref_exit(&cgrp->self.refcnt);
>  out_free_cgrp:
> @@ -5146,6 +5164,8 @@ int __init cgroup_init(void)
>  	BUG_ON(cgroup_init_cftypes(NULL, cgroup_base_files));
>  	BUG_ON(cgroup_init_cftypes(NULL, cgroup1_base_files));
>  
> +	cgroup_stat_boot();
> +
>  	/*
>  	 * The latency of the synchronize_sched() is too high for cgroups,
>  	 * avoid it at the cost of forcing all readers into the slow path.
> diff --git a/kernel/cgroup/stat.c b/kernel/cgroup/stat.c
> new file mode 100644
> index 0000000..19a10b2
> --- /dev/null
> +++ b/kernel/cgroup/stat.c
> @@ -0,0 +1,333 @@
> +#include "cgroup-internal.h"
> +
> +#include <linux/sched/cputime.h>
> +
> +static DEFINE_MUTEX(cgroup_stat_mutex);
> +static DEFINE_PER_CPU(raw_spinlock_t, cgroup_cpu_stat_lock);

If the hierarchy is large enough and the stat data hasn't been read
recently, it may take a while to accumulate all the stat data even for
one cpu in cgroup_stat_flush_locked(). So I think it will make more
sense to use regular spinlock_t instead of raw_spinlock_t.

> +
> +static struct cgroup_cpu_stat *cgroup_cpu_stat(struct cgroup *cgrp, int cpu)
> +{
> +	return per_cpu_ptr(cgrp->cpu_stat, cpu);
> +}
> +
> +/**
> + * cgroup_cpu_stat_updated - keep track of updated cpu_stat
> + * @cgrp: target cgroup
> + * @cpu: cpu on which cpu_stat was updated
> + *
> + * @cgrp's cpu_stat on @cpu was updated.  Put it on the parent's matching
> + * cpu_stat->updated_children list.
> + */
> +static void cgroup_cpu_stat_updated(struct cgroup *cgrp, int cpu)
> +{
> +	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_cpu_stat_lock, cpu);
> +	struct cgroup *parent;
> +	unsigned long flags;
> +
> +	/*
> +	 * Speculative already-on-list test.  This may race leading to
> +	 * temporary inaccuracies, which is fine.
> +	 *
> +	 * Because @parent's updated_children is terminated with @parent
> +	 * instead of NULL, we can tell whether @cgrp is on the list by
> +	 * testing the next pointer for NULL.
> +	 */
> +	if (cgroup_cpu_stat(cgrp, cpu)->updated_next)
> +		return;
> +
> +	raw_spin_lock_irqsave(cpu_lock, flags);
> +
> +	/* put @cgrp and all ancestors on the corresponding updated lists */
> +	for (parent = cgroup_parent(cgrp); parent;
> +	     cgrp = parent, parent = cgroup_parent(cgrp)) {
> +		struct cgroup_cpu_stat *cstat = cgroup_cpu_stat(cgrp, cpu);
> +		struct cgroup_cpu_stat *pcstat = cgroup_cpu_stat(parent, cpu);
> +
> +		/*
> +		 * Both additions and removals are bottom-up.  If a cgroup
> +		 * is already in the tree, all ancestors are.
> +		 */
> +		if (cstat->updated_next)
> +			break;
> +
> +		cstat->updated_next = pcstat->updated_children;
> +		pcstat->updated_children = cgrp;
> +	}
> +
> +	raw_spin_unlock_irqrestore(cpu_lock, flags);
> +}
> +
> +/**
> + * cgroup_cpu_stat_next_updated - iterate cpu_stat updated tree
> + * @pos: current position
> + * @root: root of the tree to traversal
> + * @cpu: target cpu
> + *
> + * Walks the udpated cpu_stat tree on @cpu from @root.  %NULL @pos starts
> + * the traversal and %NULL return indicates the end.  During traversal,
> + * each returned cgroup is unlinked from the tree.  Must be called with the
> + * matching cgroup_cpu_stat_lock held.
> + *
> + * The only ordering guarantee is that, for a parent and a child pair
> + * covered by a given traversal, if a child is visited, its parent is
> + * guaranteed to be visited afterwards.
> + */
> +static struct cgroup *cgroup_cpu_stat_next_updated(struct cgroup *pos,
> +						   struct cgroup *root, int cpu)

This function is trying to unwind one cgrp from the updated_children and
updated_next linkage. It is somewhat like the opposite of
cgroup_cpu_stat_updated(). I just feel like its name isn't intuitive
enough to convey what it is doing. Maybe use name like
cgroup_cpu_stat_unlink_one() to match cgroup_cpu_stat_flush_one().

Cheers,
Longman

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/3] cgroup: Implement cgroup2 basic CPU usage accounting
@ 2017-08-13 19:44       ` Tejun Heo
  0 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2017-08-13 19:44 UTC (permalink / raw)
  To: Waiman Long
  Cc: lizefan, hannes, peterz, mingo, cgroups, linux-kernel,
	kernel-team, pjt, luto, efault, torvalds, guro

Hello, Waiman.

On Fri, Aug 11, 2017 at 04:19:07PM -0400, Waiman Long wrote:
> > * CPU usage stats are collected and shown in "cgroup.stat" with "cpu."
> >   prefix.  Total usage is collected from scheduling events.  User/sys
> >   breakdown is sourced from tick sampling and adjusted to the usage
> >   using cputime_adjuts().
> 
> Typo: cputime_adjust()

Oops, will fix.

> > +static DEFINE_MUTEX(cgroup_stat_mutex);
> > +static DEFINE_PER_CPU(raw_spinlock_t, cgroup_cpu_stat_lock);
> 
> If the hierarchy is large enough and the stat data hasn't been read
> recently, it may take a while to accumulate all the stat data even for
> one cpu in cgroup_stat_flush_locked(). So I think it will make more
> sense to use regular spinlock_t instead of raw_spinlock_t.

They need to be raw spinlocks because the accounting side may grab
them while scheduling.  If the accumulating latency becomes
problematic, we can test for need_resched and spin_needbreak and break
out as necessary.  The iteration logic is built to allow that and an
earlier revision actually did that but I wasn't sure whether it's
actually necessary and removed it for simplicity.

If the latency ever becomes a problem, resurrecting that isn't
difficult at all.

> > +static struct cgroup *cgroup_cpu_stat_next_updated(struct cgroup *pos,
> > +						   struct cgroup *root, int cpu)
> 
> This function is trying to unwind one cgrp from the updated_children and
> updated_next linkage. It is somewhat like the opposite of
> cgroup_cpu_stat_updated(). I just feel like its name isn't intuitive
> enough to convey what it is doing. Maybe use name like
> cgroup_cpu_stat_unlink_one() to match cgroup_cpu_stat_flush_one().

Hmm... the name comes from it being an iterator - most interators are
named _next.  But, yeah, the name doesn't signifiy that it unlinks as
it goes along.  I'll rename it to cgroup_cpu_stat_pop_updated().

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/3] cgroup: Implement cgroup2 basic CPU usage accounting
@ 2017-08-13 19:44       ` Tejun Heo
  0 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2017-08-13 19:44 UTC (permalink / raw)
  To: Waiman Long
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	pjt-hpIqsD4AKlfQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	efault-Mmb7MZpHnFY, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	guro-b10kYP2dOMg

Hello, Waiman.

On Fri, Aug 11, 2017 at 04:19:07PM -0400, Waiman Long wrote:
> > * CPU usage stats are collected and shown in "cgroup.stat" with "cpu."
> >   prefix.  Total usage is collected from scheduling events.  User/sys
> >   breakdown is sourced from tick sampling and adjusted to the usage
> >   using cputime_adjuts().
> 
> Typo: cputime_adjust()

Oops, will fix.

> > +static DEFINE_MUTEX(cgroup_stat_mutex);
> > +static DEFINE_PER_CPU(raw_spinlock_t, cgroup_cpu_stat_lock);
> 
> If the hierarchy is large enough and the stat data hasn't been read
> recently, it may take a while to accumulate all the stat data even for
> one cpu in cgroup_stat_flush_locked(). So I think it will make more
> sense to use regular spinlock_t instead of raw_spinlock_t.

They need to be raw spinlocks because the accounting side may grab
them while scheduling.  If the accumulating latency becomes
problematic, we can test for need_resched and spin_needbreak and break
out as necessary.  The iteration logic is built to allow that and an
earlier revision actually did that but I wasn't sure whether it's
actually necessary and removed it for simplicity.

If the latency ever becomes a problem, resurrecting that isn't
difficult at all.

> > +static struct cgroup *cgroup_cpu_stat_next_updated(struct cgroup *pos,
> > +						   struct cgroup *root, int cpu)
> 
> This function is trying to unwind one cgrp from the updated_children and
> updated_next linkage. It is somewhat like the opposite of
> cgroup_cpu_stat_updated(). I just feel like its name isn't intuitive
> enough to convey what it is doing. Maybe use name like
> cgroup_cpu_stat_unlink_one() to match cgroup_cpu_stat_flush_one().

Hmm... the name comes from it being an iterator - most interators are
named _next.  But, yeah, the name doesn't signifiy that it unlinks as
it goes along.  I'll rename it to cgroup_cpu_stat_pop_updated().

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/3] cgroup: Implement cgroup2 basic CPU usage accounting
@ 2017-08-13 23:20         ` Waiman Long
  0 siblings, 0 replies; 54+ messages in thread
From: Waiman Long @ 2017-08-13 23:20 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, hannes, peterz, mingo, cgroups, linux-kernel,
	kernel-team, pjt, luto, efault, torvalds, guro

On 08/13/2017 03:44 PM, Tejun Heo wrote:
> Hello, Waiman.
>
> On Fri, Aug 11, 2017 at 04:19:07PM -0400, Waiman Long wrote:
>>> * CPU usage stats are collected and shown in "cgroup.stat" with "cpu."
>>>   prefix.  Total usage is collected from scheduling events.  User/sys
>>>   breakdown is sourced from tick sampling and adjusted to the usage
>>>   using cputime_adjuts().
>> Typo: cputime_adjust()
> Oops, will fix.
>
>>> +static DEFINE_MUTEX(cgroup_stat_mutex);
>>> +static DEFINE_PER_CPU(raw_spinlock_t, cgroup_cpu_stat_lock);
>> If the hierarchy is large enough and the stat data hasn't been read
>> recently, it may take a while to accumulate all the stat data even for
>> one cpu in cgroup_stat_flush_locked(). So I think it will make more
>> sense to use regular spinlock_t instead of raw_spinlock_t.
> They need to be raw spinlocks because the accounting side may grab
> them while scheduling.  If the accumulating latency becomes
> problematic, we can test for need_resched and spin_needbreak and break
> out as necessary.  The iteration logic is built to allow that and an
> earlier revision actually did that but I wasn't sure whether it's
> actually necessary and removed it for simplicity.
>
> If the latency ever becomes a problem, resurrecting that isn't
> difficult at all.

Right, I forgot they will be used by the scheduler.

I think it is prudent to limit the latency, but it can be another patch
when the need arises.

>>> +static struct cgroup *cgroup_cpu_stat_next_updated(struct cgroup *pos,
>>> +						   struct cgroup *root, int cpu)
>> This function is trying to unwind one cgrp from the updated_children and
>> updated_next linkage. It is somewhat like the opposite of
>> cgroup_cpu_stat_updated(). I just feel like its name isn't intuitive
>> enough to convey what it is doing. Maybe use name like
>> cgroup_cpu_stat_unlink_one() to match cgroup_cpu_stat_flush_one().
> Hmm... the name comes from it being an iterator - most interators are
> named _next.  But, yeah, the name doesn't signifiy that it unlinks as
> it goes along.  I'll rename it to cgroup_cpu_stat_pop_updated().

I am fine with that.

Thanks,
Longman

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/3] cgroup: Implement cgroup2 basic CPU usage accounting
@ 2017-08-13 23:20         ` Waiman Long
  0 siblings, 0 replies; 54+ messages in thread
From: Waiman Long @ 2017-08-13 23:20 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	pjt-hpIqsD4AKlfQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	efault-Mmb7MZpHnFY, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	guro-b10kYP2dOMg

On 08/13/2017 03:44 PM, Tejun Heo wrote:
> Hello, Waiman.
>
> On Fri, Aug 11, 2017 at 04:19:07PM -0400, Waiman Long wrote:
>>> * CPU usage stats are collected and shown in "cgroup.stat" with "cpu."
>>>   prefix.  Total usage is collected from scheduling events.  User/sys
>>>   breakdown is sourced from tick sampling and adjusted to the usage
>>>   using cputime_adjuts().
>> Typo: cputime_adjust()
> Oops, will fix.
>
>>> +static DEFINE_MUTEX(cgroup_stat_mutex);
>>> +static DEFINE_PER_CPU(raw_spinlock_t, cgroup_cpu_stat_lock);
>> If the hierarchy is large enough and the stat data hasn't been read
>> recently, it may take a while to accumulate all the stat data even for
>> one cpu in cgroup_stat_flush_locked(). So I think it will make more
>> sense to use regular spinlock_t instead of raw_spinlock_t.
> They need to be raw spinlocks because the accounting side may grab
> them while scheduling.  If the accumulating latency becomes
> problematic, we can test for need_resched and spin_needbreak and break
> out as necessary.  The iteration logic is built to allow that and an
> earlier revision actually did that but I wasn't sure whether it's
> actually necessary and removed it for simplicity.
>
> If the latency ever becomes a problem, resurrecting that isn't
> difficult at all.

Right, I forgot they will be used by the scheduler.

I think it is prudent to limit the latency, but it can be another patch
when the need arises.

>>> +static struct cgroup *cgroup_cpu_stat_next_updated(struct cgroup *pos,
>>> +						   struct cgroup *root, int cpu)
>> This function is trying to unwind one cgrp from the updated_children and
>> updated_next linkage. It is somewhat like the opposite of
>> cgroup_cpu_stat_updated(). I just feel like its name isn't intuitive
>> enough to convey what it is doing. Maybe use name like
>> cgroup_cpu_stat_unlink_one() to match cgroup_cpu_stat_flush_one().
> Hmm... the name comes from it being an iterator - most interators are
> named _next.  But, yeah, the name doesn't signifiy that it unlinks as
> it goes along.  I'll rename it to cgroup_cpu_stat_pop_updated().

I am fine with that.

Thanks,
Longman

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCHSET for-4.14] cgroup, sched: cgroup2 basic resource accounting
  2017-08-11 16:37 ` Tejun Heo
                   ` (3 preceding siblings ...)
  (?)
@ 2017-08-16 18:52 ` Tejun Heo
  2017-08-17  8:13   ` Ingo Molnar
  -1 siblings, 1 reply; 54+ messages in thread
From: Tejun Heo @ 2017-08-16 18:52 UTC (permalink / raw)
  To: lizefan, hannes, peterz, mingo, longman
  Cc: cgroups, linux-kernel, kernel-team, pjt, luto, efault, torvalds, guro

Hello,

On Fri, Aug 11, 2017 at 09:37:51AM -0700, Tejun Heo wrote:
> This patchset implement cgroup2 basic resource accounting and track
> CPU usages on all cgroups by default.

Ping.  I've renamed as Waiman suggested.  If there's no further
objection, how should we route these patches and the following cpu
cgroup2 interface patch?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCHSET for-4.14] cgroup, sched: cgroup2 basic resource accounting
  2017-08-16 18:52 ` [PATCHSET for-4.14] cgroup, sched: cgroup2 basic resource accounting Tejun Heo
@ 2017-08-17  8:13   ` Ingo Molnar
  2017-08-17 13:01     ` Tejun Heo
  0 siblings, 1 reply; 54+ messages in thread
From: Ingo Molnar @ 2017-08-17  8:13 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, hannes, peterz, mingo, longman, cgroups, linux-kernel,
	kernel-team, pjt, luto, efault, torvalds, guro


* Tejun Heo <tj@kernel.org> wrote:

> Hello,
> 
> On Fri, Aug 11, 2017 at 09:37:51AM -0700, Tejun Heo wrote:
> > This patchset implement cgroup2 basic resource accounting and track
> > CPU usages on all cgroups by default.
> 
> Ping.  I've renamed as Waiman suggested.  If there's no further
> objection, how should we route these patches and the following cpu
> cgroup2 interface patch?

Please hold off on applying it before PeterZ gets back from vacation and has had a 
chance to review it.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/3] cgroup: Implement cgroup2 basic CPU usage accounting
  2017-08-11 16:37 ` [PATCH 3/3] cgroup: Implement cgroup2 basic CPU usage accounting Tejun Heo
@ 2017-08-17 12:07     ` Roman Gushchin
  2017-08-17 12:07     ` Roman Gushchin
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 54+ messages in thread
From: Roman Gushchin @ 2017-08-17 12:07 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, hannes, peterz, mingo, longman, cgroups, linux-kernel,
	kernel-team, pjt, luto, efault, torvalds

Hi Tejun!

On Fri, Aug 11, 2017 at 09:37:54AM -0700, Tejun Heo wrote:
> In cgroup1, while cpuacct isn't actually controlling any resources, it
> is a separate controller due to combinaton of two factors -

s/combinaton/combination

> @@ -4466,6 +4470,8 @@ static void css_free_work_fn(struct work_struct *work)
>  			 */
>  			cgroup_put(cgroup_parent(cgrp));
>  			kernfs_put(cgrp->kn);
> +			if (cgroup_on_dfl(cgrp))
> +				cgroup_stat_exit(cgrp);

It looks like this "if (cgroup_on_dfl(cgrp))" works here and further similar to
"#ifdef CGROUP_V2". I wonder, if it's better to move this check into the
calling function: cgroup_stat_exit() in this case.


> +void cgroup_stat_show_cputime(struct seq_file *seq, const char *prefix)
> +{

What are any other possible prefix values except "cpu."?

> +void __init cgroup_stat_boot(void)
> +{
> +	int cpu;
> +
> +	for_each_possible_cpu(cpu)
> +		raw_spin_lock_init(per_cpu_ptr(&cgroup_cpu_stat_lock, cpu));
> +
> +	WARN_ON(cgroup_stat_init(&cgrp_dfl_root.cgrp));

I'm not sure WARN_ON() is enough here: if cgroup_stat_init() returned -ENOMEM,
the following OOPS is not avoidable, as you don't check cpu_stat pointer.
But it's very unlikely, of course.


Overall looks very good!

Thanks!

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/3] cgroup: Implement cgroup2 basic CPU usage accounting
@ 2017-08-17 12:07     ` Roman Gushchin
  0 siblings, 0 replies; 54+ messages in thread
From: Roman Gushchin @ 2017-08-17 12:07 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, hannes, peterz, mingo, longman, cgroups, linux-kernel,
	kernel-team, pjt, luto, efault, torvalds

Hi Tejun!

On Fri, Aug 11, 2017 at 09:37:54AM -0700, Tejun Heo wrote:
> In cgroup1, while cpuacct isn't actually controlling any resources, it
> is a separate controller due to combinaton of two factors -

s/combinaton/combination

> @@ -4466,6 +4470,8 @@ static void css_free_work_fn(struct work_struct *work)
>  			 */
>  			cgroup_put(cgroup_parent(cgrp));
>  			kernfs_put(cgrp->kn);
> +			if (cgroup_on_dfl(cgrp))
> +				cgroup_stat_exit(cgrp);

It looks like this "if (cgroup_on_dfl(cgrp))" works here and further similar to
"#ifdef CGROUP_V2". I wonder, if it's better to move this check into the
calling function: cgroup_stat_exit() in this case.


> +void cgroup_stat_show_cputime(struct seq_file *seq, const char *prefix)
> +{

What are any other possible prefix values except "cpu."?

> +void __init cgroup_stat_boot(void)
> +{
> +	int cpu;
> +
> +	for_each_possible_cpu(cpu)
> +		raw_spin_lock_init(per_cpu_ptr(&cgroup_cpu_stat_lock, cpu));
> +
> +	WARN_ON(cgroup_stat_init(&cgrp_dfl_root.cgrp));

I'm not sure WARN_ON() is enough here: if cgroup_stat_init() returned -ENOMEM,
the following OOPS is not avoidable, as you don't check cpu_stat pointer.
But it's very unlikely, of course.


Overall looks very good!

Thanks!

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCHSET for-4.14] cgroup, sched: cgroup2 basic resource accounting
  2017-08-17  8:13   ` Ingo Molnar
@ 2017-08-17 13:01     ` Tejun Heo
  2017-08-17 15:03         ` Ingo Molnar
  0 siblings, 1 reply; 54+ messages in thread
From: Tejun Heo @ 2017-08-17 13:01 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: lizefan, hannes, peterz, mingo, longman, cgroups, linux-kernel,
	kernel-team, pjt, luto, efault, torvalds, guro

Hello, Ingo.

On Thu, Aug 17, 2017 at 10:13:04AM +0200, Ingo Molnar wrote:
> Please hold off on applying it before PeterZ gets back from vacation and has had a 
> chance to review it.

Of course.  These aren't going anywhere without Peter's explicit acks.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/3] cgroup: Implement cgroup2 basic CPU usage accounting
@ 2017-08-17 13:13       ` Tejun Heo
  0 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2017-08-17 13:13 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: lizefan, hannes, peterz, mingo, longman, cgroups, linux-kernel,
	kernel-team, pjt, luto, efault, torvalds

Hello,

On Thu, Aug 17, 2017 at 01:07:41PM +0100, Roman Gushchin wrote:
> On Fri, Aug 11, 2017 at 09:37:54AM -0700, Tejun Heo wrote:
> > In cgroup1, while cpuacct isn't actually controlling any resources, it
> > is a separate controller due to combinaton of two factors -
> 
> s/combinaton/combination

Fixed.

> > @@ -4466,6 +4470,8 @@ static void css_free_work_fn(struct work_struct *work)
> >  			 */
> >  			cgroup_put(cgroup_parent(cgrp));
> >  			kernfs_put(cgrp->kn);
> > +			if (cgroup_on_dfl(cgrp))
> > +				cgroup_stat_exit(cgrp);
> 
> It looks like this "if (cgroup_on_dfl(cgrp))" works here and further similar to
> "#ifdef CGROUP_V2". I wonder, if it's better to move this check into the
> calling function: cgroup_stat_exit() in this case.

I have a slight preference to keeping these topology-aware tests on
the core / management part of code because that makes it obvious that
these stats aren't available for all cgroups.  Also, during cgroup
creation, because @cgrp isn't linked to its parent yet, we'd have to
pass @parent to cgroup_stat_init/exit() too.

> > +void cgroup_stat_show_cputime(struct seq_file *seq, const char *prefix)
> > +{
> 
> What are any other possible prefix values except "cpu."?

Empty string when the stats are exposed through cpu.stat.

> > +void __init cgroup_stat_boot(void)
> > +{
> > +	int cpu;
> > +
> > +	for_each_possible_cpu(cpu)
> > +		raw_spin_lock_init(per_cpu_ptr(&cgroup_cpu_stat_lock, cpu));
> > +
> > +	WARN_ON(cgroup_stat_init(&cgrp_dfl_root.cgrp));
> 
> I'm not sure WARN_ON() is enough here: if cgroup_stat_init() returned -ENOMEM,
> the following OOPS is not avoidable, as you don't check cpu_stat pointer.
> But it's very unlikely, of course.

Sure, will switch to BUG_ON().

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/3] cgroup: Implement cgroup2 basic CPU usage accounting
@ 2017-08-17 13:13       ` Tejun Heo
  0 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2017-08-17 13:13 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	longman-H+wXaHxf7aLQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	pjt-hpIqsD4AKlfQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	efault-Mmb7MZpHnFY, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Hello,

On Thu, Aug 17, 2017 at 01:07:41PM +0100, Roman Gushchin wrote:
> On Fri, Aug 11, 2017 at 09:37:54AM -0700, Tejun Heo wrote:
> > In cgroup1, while cpuacct isn't actually controlling any resources, it
> > is a separate controller due to combinaton of two factors -
> 
> s/combinaton/combination

Fixed.

> > @@ -4466,6 +4470,8 @@ static void css_free_work_fn(struct work_struct *work)
> >  			 */
> >  			cgroup_put(cgroup_parent(cgrp));
> >  			kernfs_put(cgrp->kn);
> > +			if (cgroup_on_dfl(cgrp))
> > +				cgroup_stat_exit(cgrp);
> 
> It looks like this "if (cgroup_on_dfl(cgrp))" works here and further similar to
> "#ifdef CGROUP_V2". I wonder, if it's better to move this check into the
> calling function: cgroup_stat_exit() in this case.

I have a slight preference to keeping these topology-aware tests on
the core / management part of code because that makes it obvious that
these stats aren't available for all cgroups.  Also, during cgroup
creation, because @cgrp isn't linked to its parent yet, we'd have to
pass @parent to cgroup_stat_init/exit() too.

> > +void cgroup_stat_show_cputime(struct seq_file *seq, const char *prefix)
> > +{
> 
> What are any other possible prefix values except "cpu."?

Empty string when the stats are exposed through cpu.stat.

> > +void __init cgroup_stat_boot(void)
> > +{
> > +	int cpu;
> > +
> > +	for_each_possible_cpu(cpu)
> > +		raw_spin_lock_init(per_cpu_ptr(&cgroup_cpu_stat_lock, cpu));
> > +
> > +	WARN_ON(cgroup_stat_init(&cgrp_dfl_root.cgrp));
> 
> I'm not sure WARN_ON() is enough here: if cgroup_stat_init() returned -ENOMEM,
> the following OOPS is not avoidable, as you don't check cpu_stat pointer.
> But it's very unlikely, of course.

Sure, will switch to BUG_ON().

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCHSET for-4.14] cgroup, sched: cgroup2 basic resource accounting
@ 2017-08-17 15:03         ` Ingo Molnar
  0 siblings, 0 replies; 54+ messages in thread
From: Ingo Molnar @ 2017-08-17 15:03 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, hannes, peterz, mingo, longman, cgroups, linux-kernel,
	kernel-team, pjt, luto, efault, torvalds, guro


* Tejun Heo <tj@kernel.org> wrote:

> Hello, Ingo.
> 
> On Thu, Aug 17, 2017 at 10:13:04AM +0200, Ingo Molnar wrote:
> > Please hold off on applying it before PeterZ gets back from vacation and has had a 
> > chance to review it.
> 
> Of course.  These aren't going anywhere without Peter's explicit acks.

Thank you!

	Ingo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCHSET for-4.14] cgroup, sched: cgroup2 basic resource accounting
@ 2017-08-17 15:03         ` Ingo Molnar
  0 siblings, 0 replies; 54+ messages in thread
From: Ingo Molnar @ 2017-08-17 15:03 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	longman-H+wXaHxf7aLQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	pjt-hpIqsD4AKlfQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	efault-Mmb7MZpHnFY, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	guro-b10kYP2dOMg


* Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:

> Hello, Ingo.
> 
> On Thu, Aug 17, 2017 at 10:13:04AM +0200, Ingo Molnar wrote:
> > Please hold off on applying it before PeterZ gets back from vacation and has had a 
> > chance to review it.
> 
> Of course.  These aren't going anywhere without Peter's explicit acks.

Thank you!

	Ingo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCHSET for-4.14] cgroup, sched: cgroup2 basic resource accounting
@ 2017-08-24 17:20   ` Tejun Heo
  0 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2017-08-24 17:20 UTC (permalink / raw)
  To: lizefan, hannes, peterz, mingo, longman
  Cc: cgroups, linux-kernel, kernel-team, pjt, luto, efault, torvalds, guro

A friendly ping.

Thanks!

-- 
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCHSET for-4.14] cgroup, sched: cgroup2 basic resource accounting
@ 2017-08-24 17:20   ` Tejun Heo
  0 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2017-08-24 17:20 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	longman-H+wXaHxf7aLQT0dZR+AlfA
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	pjt-hpIqsD4AKlfQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	efault-Mmb7MZpHnFY, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	guro-b10kYP2dOMg

A friendly ping.

Thanks!

-- 
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/3] cgroup: Implement cgroup2 basic CPU usage accounting
@ 2017-08-29 14:32     ` Peter Zijlstra
  0 siblings, 0 replies; 54+ messages in thread
From: Peter Zijlstra @ 2017-08-29 14:32 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, hannes, mingo, longman, cgroups, linux-kernel,
	kernel-team, pjt, luto, efault, torvalds, guro


So I mostly like. On accounting it only adds to the immediate cgroup (if
it has a parent, aka !root).

On update it does a DFS of all sub-groups and propagates the deltas up
to the requested group.

On Fri, Aug 11, 2017 at 09:37:54AM -0700, Tejun Heo wrote:
> +/**
> + * cgroup_cpu_stat_updated - keep track of updated cpu_stat
> + * @cgrp: target cgroup
> + * @cpu: cpu on which cpu_stat was updated
> + *
> + * @cgrp's cpu_stat on @cpu was updated.  Put it on the parent's matching
> + * cpu_stat->updated_children list.
> + */
> +static void cgroup_cpu_stat_updated(struct cgroup *cgrp, int cpu)
> +{
> +	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_cpu_stat_lock, cpu);
> +	struct cgroup *parent;
> +	unsigned long flags;
> +
> +	/*
> +	 * Speculative already-on-list test.  This may race leading to
> +	 * temporary inaccuracies, which is fine.
> +	 *
> +	 * Because @parent's updated_children is terminated with @parent
> +	 * instead of NULL, we can tell whether @cgrp is on the list by
> +	 * testing the next pointer for NULL.
> +	 */
> +	if (cgroup_cpu_stat(cgrp, cpu)->updated_next)
> +		return;
> +
> +	raw_spin_lock_irqsave(cpu_lock, flags);
> +
> +	/* put @cgrp and all ancestors on the corresponding updated lists */
> +	for (parent = cgroup_parent(cgrp); parent;
> +	     cgrp = parent, parent = cgroup_parent(cgrp)) {
> +		struct cgroup_cpu_stat *cstat = cgroup_cpu_stat(cgrp, cpu);
> +		struct cgroup_cpu_stat *pcstat = cgroup_cpu_stat(parent, cpu);
> +
> +		/*
> +		 * Both additions and removals are bottom-up.  If a cgroup
> +		 * is already in the tree, all ancestors are.
> +		 */
> +		if (cstat->updated_next)
> +			break;
> +
> +		cstat->updated_next = pcstat->updated_children;
> +		pcstat->updated_children = cgrp;
> +	}
> +
> +	raw_spin_unlock_irqrestore(cpu_lock, flags);
> +}

> +static struct cgroup_cpu_stat *cgroup_cpu_stat_account_begin(struct cgroup *cgrp)
> +{
> +	struct cgroup_cpu_stat *cstat;
> +
> +	cstat = get_cpu_ptr(cgrp->cpu_stat);
> +	u64_stats_update_begin(&cstat->sync);
> +	return cstat;
> +}
> +
> +static void cgroup_cpu_stat_account_end(struct cgroup *cgrp,
> +					struct cgroup_cpu_stat *cstat)
> +{
> +	u64_stats_update_end(&cstat->sync);
> +	cgroup_cpu_stat_updated(cgrp, smp_processor_id());
> +	put_cpu_ptr(cstat);
> +}
> +
> +void __cgroup_account_cputime(struct cgroup *cgrp, u64 delta_exec)
> +{
> +	struct cgroup_cpu_stat *cstat;
> +
> +	cstat = cgroup_cpu_stat_account_begin(cgrp);
> +	cstat->cputime.sum_exec_runtime += delta_exec;
> +	cgroup_cpu_stat_account_end(cgrp, cstat);
> +}

What I don't get is why you need cgroup_cpu_stat_updated(). That is, I
see you use it to keep the keep the DFS 'stack' up-to-date, but what I
don't see is why you'd need that.

Have a look at walk_tg_tree_from(), I think we can do something like
that on struct cgroup_subsys_state, it has that children list and the
parent pointer.

And yes, walk_tg_tree_from() is tricky, it always takes a fair while to
remember how it works.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/3] cgroup: Implement cgroup2 basic CPU usage accounting
@ 2017-08-29 14:32     ` Peter Zijlstra
  0 siblings, 0 replies; 54+ messages in thread
From: Peter Zijlstra @ 2017-08-29 14:32 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, longman-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	pjt-hpIqsD4AKlfQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	efault-Mmb7MZpHnFY, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	guro-b10kYP2dOMg


So I mostly like. On accounting it only adds to the immediate cgroup (if
it has a parent, aka !root).

On update it does a DFS of all sub-groups and propagates the deltas up
to the requested group.

On Fri, Aug 11, 2017 at 09:37:54AM -0700, Tejun Heo wrote:
> +/**
> + * cgroup_cpu_stat_updated - keep track of updated cpu_stat
> + * @cgrp: target cgroup
> + * @cpu: cpu on which cpu_stat was updated
> + *
> + * @cgrp's cpu_stat on @cpu was updated.  Put it on the parent's matching
> + * cpu_stat->updated_children list.
> + */
> +static void cgroup_cpu_stat_updated(struct cgroup *cgrp, int cpu)
> +{
> +	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_cpu_stat_lock, cpu);
> +	struct cgroup *parent;
> +	unsigned long flags;
> +
> +	/*
> +	 * Speculative already-on-list test.  This may race leading to
> +	 * temporary inaccuracies, which is fine.
> +	 *
> +	 * Because @parent's updated_children is terminated with @parent
> +	 * instead of NULL, we can tell whether @cgrp is on the list by
> +	 * testing the next pointer for NULL.
> +	 */
> +	if (cgroup_cpu_stat(cgrp, cpu)->updated_next)
> +		return;
> +
> +	raw_spin_lock_irqsave(cpu_lock, flags);
> +
> +	/* put @cgrp and all ancestors on the corresponding updated lists */
> +	for (parent = cgroup_parent(cgrp); parent;
> +	     cgrp = parent, parent = cgroup_parent(cgrp)) {
> +		struct cgroup_cpu_stat *cstat = cgroup_cpu_stat(cgrp, cpu);
> +		struct cgroup_cpu_stat *pcstat = cgroup_cpu_stat(parent, cpu);
> +
> +		/*
> +		 * Both additions and removals are bottom-up.  If a cgroup
> +		 * is already in the tree, all ancestors are.
> +		 */
> +		if (cstat->updated_next)
> +			break;
> +
> +		cstat->updated_next = pcstat->updated_children;
> +		pcstat->updated_children = cgrp;
> +	}
> +
> +	raw_spin_unlock_irqrestore(cpu_lock, flags);
> +}

> +static struct cgroup_cpu_stat *cgroup_cpu_stat_account_begin(struct cgroup *cgrp)
> +{
> +	struct cgroup_cpu_stat *cstat;
> +
> +	cstat = get_cpu_ptr(cgrp->cpu_stat);
> +	u64_stats_update_begin(&cstat->sync);
> +	return cstat;
> +}
> +
> +static void cgroup_cpu_stat_account_end(struct cgroup *cgrp,
> +					struct cgroup_cpu_stat *cstat)
> +{
> +	u64_stats_update_end(&cstat->sync);
> +	cgroup_cpu_stat_updated(cgrp, smp_processor_id());
> +	put_cpu_ptr(cstat);
> +}
> +
> +void __cgroup_account_cputime(struct cgroup *cgrp, u64 delta_exec)
> +{
> +	struct cgroup_cpu_stat *cstat;
> +
> +	cstat = cgroup_cpu_stat_account_begin(cgrp);
> +	cstat->cputime.sum_exec_runtime += delta_exec;
> +	cgroup_cpu_stat_account_end(cgrp, cstat);
> +}

What I don't get is why you need cgroup_cpu_stat_updated(). That is, I
see you use it to keep the keep the DFS 'stack' up-to-date, but what I
don't see is why you'd need that.

Have a look at walk_tg_tree_from(), I think we can do something like
that on struct cgroup_subsys_state, it has that children list and the
parent pointer.

And yes, walk_tg_tree_from() is tricky, it always takes a fair while to
remember how it works.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/3] cgroup: Implement cgroup2 basic CPU usage accounting
@ 2017-08-29 15:24       ` Tejun Heo
  0 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2017-08-29 15:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: lizefan, hannes, mingo, longman, cgroups, linux-kernel,
	kernel-team, pjt, luto, efault, torvalds, guro

Hello, Peter.

On Tue, Aug 29, 2017 at 04:32:52PM +0200, Peter Zijlstra wrote:
> So I mostly like. On accounting it only adds to the immediate cgroup (if
> it has a parent, aka !root).
> 
> On update it does a DFS of all sub-groups and propagates the deltas up
> to the requested group.
...
> What I don't get is why you need cgroup_cpu_stat_updated(). That is, I
> see you use it to keep the keep the DFS 'stack' up-to-date, but what I
> don't see is why you'd need that.

That is to make reading stats O(number of descendants which have been
active since last read) instad of O(number of all descendants) as
there can be a lot of not-too-active cgroups in a system.  Stat
reading can be frequent, so the combination can get really bad.  By
keeping the updated list separate, increasing read frequency decreases
the cost of each read.

Also, please note that a system may end up with a lot of cgroups
without the user intending to.  memcg drains removed cgroups lazily
and the number of draining cgroups can reach very high numbers if the
system isn't under memory pressure.  The plan is to add basic stats
for other resources too and keeping it scalable w.r.t. idle cgroups
allows using the same mechanism for all resources.

> Have a look at walk_tg_tree_from(), I think we can do something like
> that on struct cgroup_subsys_state, it has that children list and the
> parent pointer.
> 
> And yes, walk_tg_tree_from() is tricky, it always takes a fair while to
> remember how it works.

We can propagate "updated" flag up the tree (we need to, otherwise we
can't tell which subtree to descend into) and prune the iteration on
subtrees which haven't been updated; however, this can still become
very costly depending on the topology as it can't jump over the
siblings which haven't been updated.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/3] cgroup: Implement cgroup2 basic CPU usage accounting
@ 2017-08-29 15:24       ` Tejun Heo
  0 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2017-08-29 15:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, longman-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	pjt-hpIqsD4AKlfQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	efault-Mmb7MZpHnFY, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	guro-b10kYP2dOMg

Hello, Peter.

On Tue, Aug 29, 2017 at 04:32:52PM +0200, Peter Zijlstra wrote:
> So I mostly like. On accounting it only adds to the immediate cgroup (if
> it has a parent, aka !root).
> 
> On update it does a DFS of all sub-groups and propagates the deltas up
> to the requested group.
...
> What I don't get is why you need cgroup_cpu_stat_updated(). That is, I
> see you use it to keep the keep the DFS 'stack' up-to-date, but what I
> don't see is why you'd need that.

That is to make reading stats O(number of descendants which have been
active since last read) instad of O(number of all descendants) as
there can be a lot of not-too-active cgroups in a system.  Stat
reading can be frequent, so the combination can get really bad.  By
keeping the updated list separate, increasing read frequency decreases
the cost of each read.

Also, please note that a system may end up with a lot of cgroups
without the user intending to.  memcg drains removed cgroups lazily
and the number of draining cgroups can reach very high numbers if the
system isn't under memory pressure.  The plan is to add basic stats
for other resources too and keeping it scalable w.r.t. idle cgroups
allows using the same mechanism for all resources.

> Have a look at walk_tg_tree_from(), I think we can do something like
> that on struct cgroup_subsys_state, it has that children list and the
> parent pointer.
> 
> And yes, walk_tg_tree_from() is tricky, it always takes a fair while to
> remember how it works.

We can propagate "updated" flag up the tree (we need to, otherwise we
can't tell which subtree to descend into) and prune the iteration on
subtrees which haven't been updated; however, this can still become
very costly depending on the topology as it can't jump over the
siblings which haven't been updated.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/3] cgroup: Implement cgroup2 basic CPU usage accounting
@ 2017-08-29 15:32         ` Waiman Long
  0 siblings, 0 replies; 54+ messages in thread
From: Waiman Long @ 2017-08-29 15:32 UTC (permalink / raw)
  To: Tejun Heo, Peter Zijlstra
  Cc: lizefan, hannes, mingo, cgroups, linux-kernel, kernel-team, pjt,
	luto, efault, torvalds, guro

On 08/29/2017 11:24 AM, Tejun Heo wrote:
> Hello, Peter.
>
> On Tue, Aug 29, 2017 at 04:32:52PM +0200, Peter Zijlstra wrote:
>> So I mostly like. On accounting it only adds to the immediate cgroup (if
>> it has a parent, aka !root).
>>
>> On update it does a DFS of all sub-groups and propagates the deltas up
>> to the requested group.
> ...
>> What I don't get is why you need cgroup_cpu_stat_updated(). That is, I
>> see you use it to keep the keep the DFS 'stack' up-to-date, but what I
>> don't see is why you'd need that.
> That is to make reading stats O(number of descendants which have been
> active since last read) instad of O(number of all descendants) as
> there can be a lot of not-too-active cgroups in a system.  Stat
> reading can be frequent, so the combination can get really bad.  By
> keeping the updated list separate, increasing read frequency decreases
> the cost of each read.
>
> Also, please note that a system may end up with a lot of cgroups
> without the user intending to.  memcg drains removed cgroups lazily
> and the number of draining cgroups can reach very high numbers if the
> system isn't under memory pressure.  The plan is to add basic stats
> for other resources too and keeping it scalable w.r.t. idle cgroups
> allows using the same mechanism for all resources.

I think it will be good to put the rationale behind this function as a
comment in the code.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/3] cgroup: Implement cgroup2 basic CPU usage accounting
@ 2017-08-29 15:32         ` Waiman Long
  0 siblings, 0 replies; 54+ messages in thread
From: Waiman Long @ 2017-08-29 15:32 UTC (permalink / raw)
  To: Tejun Heo, Peter Zijlstra
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	pjt-hpIqsD4AKlfQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	efault-Mmb7MZpHnFY, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	guro-b10kYP2dOMg

On 08/29/2017 11:24 AM, Tejun Heo wrote:
> Hello, Peter.
>
> On Tue, Aug 29, 2017 at 04:32:52PM +0200, Peter Zijlstra wrote:
>> So I mostly like. On accounting it only adds to the immediate cgroup (if
>> it has a parent, aka !root).
>>
>> On update it does a DFS of all sub-groups and propagates the deltas up
>> to the requested group.
> ...
>> What I don't get is why you need cgroup_cpu_stat_updated(). That is, I
>> see you use it to keep the keep the DFS 'stack' up-to-date, but what I
>> don't see is why you'd need that.
> That is to make reading stats O(number of descendants which have been
> active since last read) instad of O(number of all descendants) as
> there can be a lot of not-too-active cgroups in a system.  Stat
> reading can be frequent, so the combination can get really bad.  By
> keeping the updated list separate, increasing read frequency decreases
> the cost of each read.
>
> Also, please note that a system may end up with a lot of cgroups
> without the user intending to.  memcg drains removed cgroups lazily
> and the number of draining cgroups can reach very high numbers if the
> system isn't under memory pressure.  The plan is to add basic stats
> for other resources too and keeping it scalable w.r.t. idle cgroups
> allows using the same mechanism for all resources.

I think it will be good to put the rationale behind this function as a
comment in the code.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/3] cgroup: Implement cgroup2 basic CPU usage accounting
  2017-08-29 15:32         ` Waiman Long
  (?)
@ 2017-08-29 15:42         ` Tejun Heo
  -1 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2017-08-29 15:42 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, lizefan, hannes, mingo, cgroups, linux-kernel,
	kernel-team, pjt, luto, efault, torvalds, guro

Hello, Waiman.

On Tue, Aug 29, 2017 at 11:32:35AM -0400, Waiman Long wrote:
> > Also, please note that a system may end up with a lot of cgroups
> > without the user intending to.  memcg drains removed cgroups lazily
> > and the number of draining cgroups can reach very high numbers if the
> > system isn't under memory pressure.  The plan is to add basic stats
> > for other resources too and keeping it scalable w.r.t. idle cgroups
> > allows using the same mechanism for all resources.
> 
> I think it will be good to put the rationale behind this function as a
> comment in the code.

Ah, good point.  I'll beef up the overall comment in the header file
and point to that from the function comment.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/3] cgroup: Implement cgroup2 basic CPU usage accounting
@ 2017-08-29 15:59         ` Peter Zijlstra
  0 siblings, 0 replies; 54+ messages in thread
From: Peter Zijlstra @ 2017-08-29 15:59 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, hannes, mingo, longman, cgroups, linux-kernel,
	kernel-team, pjt, luto, efault, torvalds, guro

On Tue, Aug 29, 2017 at 08:24:27AM -0700, Tejun Heo wrote:
> Hello, Peter.
> 
> On Tue, Aug 29, 2017 at 04:32:52PM +0200, Peter Zijlstra wrote:
> > So I mostly like. On accounting it only adds to the immediate cgroup (if
> > it has a parent, aka !root).
> > 
> > On update it does a DFS of all sub-groups and propagates the deltas up
> > to the requested group.
> ...
> > What I don't get is why you need cgroup_cpu_stat_updated(). That is, I
> > see you use it to keep the keep the DFS 'stack' up-to-date, but what I
> > don't see is why you'd need that.
> 
> That is to make reading stats O(number of descendants which have been
> active since last read) instad of O(number of all descendants) as
> there can be a lot of not-too-active cgroups in a system.  Stat
> reading can be frequent, so the combination can get really bad.  By
> keeping the updated list separate, increasing read frequency decreases
> the cost of each read.

Hmm, and when its never read, we only do the O(depth) thing _once_,
right?

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/3] cgroup: Implement cgroup2 basic CPU usage accounting
@ 2017-08-29 15:59         ` Peter Zijlstra
  0 siblings, 0 replies; 54+ messages in thread
From: Peter Zijlstra @ 2017-08-29 15:59 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, longman-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	pjt-hpIqsD4AKlfQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	efault-Mmb7MZpHnFY, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	guro-b10kYP2dOMg

On Tue, Aug 29, 2017 at 08:24:27AM -0700, Tejun Heo wrote:
> Hello, Peter.
> 
> On Tue, Aug 29, 2017 at 04:32:52PM +0200, Peter Zijlstra wrote:
> > So I mostly like. On accounting it only adds to the immediate cgroup (if
> > it has a parent, aka !root).
> > 
> > On update it does a DFS of all sub-groups and propagates the deltas up
> > to the requested group.
> ...
> > What I don't get is why you need cgroup_cpu_stat_updated(). That is, I
> > see you use it to keep the keep the DFS 'stack' up-to-date, but what I
> > don't see is why you'd need that.
> 
> That is to make reading stats O(number of descendants which have been
> active since last read) instad of O(number of all descendants) as
> there can be a lot of not-too-active cgroups in a system.  Stat
> reading can be frequent, so the combination can get really bad.  By
> keeping the updated list separate, increasing read frequency decreases
> the cost of each read.

Hmm, and when its never read, we only do the O(depth) thing _once_,
right?

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v2 3/3] cgroup: Implement cgroup2 basic CPU usage accounting
@ 2017-08-29 17:43     ` Tejun Heo
  0 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2017-08-29 17:43 UTC (permalink / raw)
  To: lizefan, hannes, peterz, mingo, longman
  Cc: cgroups, linux-kernel, kernel-team, pjt, luto, efault, torvalds, guro

In cgroup1, while cpuacct isn't actually controlling any resources, it
is a separate controller due to combination of two factors -
1. enabling cpu controller has significant side effects, and 2. we
have to pick one of the hierarchies to account CPU usages on.  cpuacct
controller is effectively used to designate a hierarchy to track CPU
usages on.

cgroup2's unified hierarchy removes the second reason and we can
account basic CPU usages by default.  While we can use cpuacct for
this purpose, both its interface and implementation leave a lot to be
desired - it collects and exposes two sources of truth which don't
agree with each other and some of the exposed statistics don't make
much sense.  Also, it propagates all the way up the hierarchy on each
accounting event which is unnecessary.

This patch adds basic resource accounting mechanism to cgroup2's
unified hierarchy and accounts CPU usages using it.

* All accountings are done per-cpu and don't propagate immediately.
  It just bumps the per-cgroup per-cpu counters and links to the
  parent's updated list if not already on it.

* On a read, the per-cpu counters are collected into the global ones
  and then propagated upwards.  Only the per-cpu counters which have
  changed since the last read are propagated.

* CPU usage stats are collected and shown in "cgroup.stat" with "cpu."
  prefix.  Total usage is collected from scheduling events.  User/sys
  breakdown is sourced from tick sampling and adjusted to the usage
  using cputime_adjust().

This keeps the accounting side hot path O(1) and per-cpu and the read
side O(nr_updated_since_last_read).

v2: Minor changes and documentation updates as suggested by Waiman and
    Roman.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Waiman Long <waiman.long@hpe.com>
Cc: Roman Gushchin <guro@fb.com>
---
 Documentation/cgroup-v2.txt     |    9 +
 include/linux/cgroup-defs.h     |   57 ++++++
 include/linux/cgroup.h          |   22 ++
 kernel/cgroup/Makefile          |    2 
 kernel/cgroup/cgroup-internal.h |    8 
 kernel/cgroup/cgroup.c          |   24 ++
 kernel/cgroup/stat.c            |  334 ++++++++++++++++++++++++++++++++++++++++
 7 files changed, 453 insertions(+), 3 deletions(-)

--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -886,6 +886,15 @@ All cgroup core files are prefixed with
 		A dying cgroup can consume system resources not exceeding
 		limits, which were active at the moment of cgroup deletion.
 
+	  cpu.usage_usec
+		CPU time consumed in the subtree.
+
+	  cpu.user_usec
+		User CPU time consumed in the subtree.
+
+	  cpu.system_usec
+		System CPU time consumed in the subtree.
+
 
 Controllers
 ===========
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -16,6 +16,7 @@
 #include <linux/refcount.h>
 #include <linux/percpu-refcount.h>
 #include <linux/percpu-rwsem.h>
+#include <linux/u64_stats_sync.h>
 #include <linux/workqueue.h>
 #include <linux/bpf-cgroup.h>
 
@@ -249,6 +250,57 @@ struct css_set {
 	struct rcu_head rcu_head;
 };
 
+/*
+ * cgroup basic resource usage statistics.  Accounting is done per-cpu in
+ * cgroup_cpu_stat which is then lazily propagated up the hierarchy on
+ * reads.
+ *
+ * When a stat gets updated, the cgroup_cpu_stat and its ancestors are
+ * linked into the updated tree.  On the following read, propagation only
+ * considers and consumes the updated tree.  This makes reading O(the
+ * number of descendants which have been active since last read) instead of
+ * O(the total number of descendants).
+ *
+ * This is important because there can be a lot of (draining) cgroups which
+ * aren't active and stat may be read frequently.  The combination can
+ * become very expensive.  By propagating selectively, increasing reading
+ * frequency decreases the cost of each read.
+ */
+struct cgroup_cpu_stat {
+	/*
+	 * ->sync protects all the current counters.  These are the only
+	 * fields which get updated in the hot path.
+	 */
+	struct u64_stats_sync sync;
+	struct task_cputime cputime;
+
+	/*
+	 * Snapshots at the last reading.  These are used to calculate the
+	 * deltas to propagate to the global counters.
+	 */
+	struct task_cputime last_cputime;
+
+	/*
+	 * Child cgroups with stat updates on this cpu since the last read
+	 * are linked on the parent's ->updated_children through
+	 * ->updated_next.
+	 *
+	 * In addition to being more compact, singly-linked list pointing
+	 * to the cgroup makes it unnecessary for each per-cpu struct to
+	 * point back to the associated cgroup.
+	 *
+	 * Protected by per-cpu cgroup_cpu_stat_lock.
+	 */
+	struct cgroup *updated_children;	/* terminated by self cgroup */
+	struct cgroup *updated_next;		/* NULL iff not on the list */
+};
+
+struct cgroup_stat {
+	/* per-cpu statistics are collected into the folowing global counters */
+	struct task_cputime cputime;
+	struct prev_cputime prev_cputime;
+};
+
 struct cgroup {
 	/* self css with NULL ->ss, points back to this cgroup */
 	struct cgroup_subsys_state self;
@@ -348,6 +400,11 @@ struct cgroup {
 	 */
 	struct cgroup *dom_cgrp;
 
+	/* cgroup basic resource statistics */
+	struct cgroup_cpu_stat __percpu *cpu_stat;
+	struct cgroup_stat pending_stat;	/* pending from children */
+	struct cgroup_stat stat;
+
 	/*
 	 * list of pidlists, up to two for each namespace (one for procs, one
 	 * for tasks); created on demand.
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -689,17 +689,39 @@ static inline void cpuacct_account_field
 					 u64 val) {}
 #endif
 
+void cgroup_stat_show_cputime(struct seq_file *seq, const char *prefix);
+
+void __cgroup_account_cputime(struct cgroup *cgrp, u64 delta_exec);
+void __cgroup_account_cputime_field(struct cgroup *cgrp,
+				    enum cpu_usage_stat index, u64 delta_exec);
+
 static inline void cgroup_account_cputime(struct task_struct *task,
 					  u64 delta_exec)
 {
+	struct cgroup *cgrp;
+
 	cpuacct_charge(task, delta_exec);
+
+	rcu_read_lock();
+	cgrp = task_dfl_cgroup(task);
+	if (cgroup_parent(cgrp))
+		__cgroup_account_cputime(cgrp, delta_exec);
+	rcu_read_unlock();
 }
 
 static inline void cgroup_account_cputime_field(struct task_struct *task,
 						enum cpu_usage_stat index,
 						u64 delta_exec)
 {
+	struct cgroup *cgrp;
+
 	cpuacct_account_field(task, index, delta_exec);
+
+	rcu_read_lock();
+	cgrp = task_dfl_cgroup(task);
+	if (cgroup_parent(cgrp))
+		__cgroup_account_cputime_field(cgrp, index, delta_exec);
+	rcu_read_unlock();
 }
 
 #else	/* CONFIG_CGROUPS */
--- a/kernel/cgroup/Makefile
+++ b/kernel/cgroup/Makefile
@@ -1,4 +1,4 @@
-obj-y := cgroup.o namespace.o cgroup-v1.o
+obj-y := cgroup.o stat.o namespace.o cgroup-v1.o
 
 obj-$(CONFIG_CGROUP_FREEZER) += freezer.o
 obj-$(CONFIG_CGROUP_PIDS) += pids.o
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -197,6 +197,14 @@ int cgroup_show_path(struct seq_file *sf
 int cgroup_task_count(const struct cgroup *cgrp);
 
 /*
+ * stat.c
+ */
+void cgroup_stat_flush(struct cgroup *cgrp);
+int cgroup_stat_init(struct cgroup *cgrp);
+void cgroup_stat_exit(struct cgroup *cgrp);
+void cgroup_stat_boot(void);
+
+/*
  * namespace.c
  */
 extern const struct proc_ns_operations cgroupns_operations;
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -142,12 +142,14 @@ static struct static_key_true *cgroup_su
 };
 #undef SUBSYS
 
+static DEFINE_PER_CPU(struct cgroup_cpu_stat, cgrp_dfl_root_cpu_stat);
+
 /*
  * The default hierarchy, reserved for the subsystems that are otherwise
  * unattached - it never has more than a single cgroup, and all tasks are
  * part of that cgroup.
  */
-struct cgroup_root cgrp_dfl_root;
+struct cgroup_root cgrp_dfl_root = { .cgrp.cpu_stat = &cgrp_dfl_root_cpu_stat };
 EXPORT_SYMBOL_GPL(cgrp_dfl_root);
 
 /*
@@ -3296,6 +3298,8 @@ static int cgroup_stat_show(struct seq_f
 	seq_printf(seq, "nr_dying_descendants %d\n",
 		   cgroup->nr_dying_descendants);
 
+	cgroup_stat_show_cputime(seq, "cpu.");
+
 	return 0;
 }
 
@@ -4466,6 +4470,8 @@ static void css_free_work_fn(struct work
 			 */
 			cgroup_put(cgroup_parent(cgrp));
 			kernfs_put(cgrp->kn);
+			if (cgroup_on_dfl(cgrp))
+				cgroup_stat_exit(cgrp);
 			kfree(cgrp);
 		} else {
 			/*
@@ -4510,6 +4516,9 @@ static void css_release_work_fn(struct w
 		/* cgroup release path */
 		trace_cgroup_release(cgrp);
 
+		if (cgroup_on_dfl(cgrp))
+			cgroup_stat_flush(cgrp);
+
 		for (tcgrp = cgroup_parent(cgrp); tcgrp;
 		     tcgrp = cgroup_parent(tcgrp))
 			tcgrp->nr_dying_descendants--;
@@ -4696,6 +4705,12 @@ static struct cgroup *cgroup_create(stru
 	if (ret)
 		goto out_free_cgrp;
 
+	if (cgroup_on_dfl(parent)) {
+		ret = cgroup_stat_init(cgrp);
+		if (ret)
+			goto out_cancel_ref;
+	}
+
 	/*
 	 * Temporarily set the pointer to NULL, so idr_find() won't return
 	 * a half-baked cgroup.
@@ -4703,7 +4718,7 @@ static struct cgroup *cgroup_create(stru
 	cgrp->id = cgroup_idr_alloc(&root->cgroup_idr, NULL, 2, 0, GFP_KERNEL);
 	if (cgrp->id < 0) {
 		ret = -ENOMEM;
-		goto out_cancel_ref;
+		goto out_stat_exit;
 	}
 
 	init_cgroup_housekeeping(cgrp);
@@ -4752,6 +4767,9 @@ static struct cgroup *cgroup_create(stru
 
 	return cgrp;
 
+out_stat_exit:
+	if (cgroup_on_dfl(parent))
+		cgroup_stat_exit(cgrp);
 out_cancel_ref:
 	percpu_ref_exit(&cgrp->self.refcnt);
 out_free_cgrp:
@@ -5146,6 +5164,8 @@ int __init cgroup_init(void)
 	BUG_ON(cgroup_init_cftypes(NULL, cgroup_base_files));
 	BUG_ON(cgroup_init_cftypes(NULL, cgroup1_base_files));
 
+	cgroup_stat_boot();
+
 	/*
 	 * The latency of the synchronize_sched() is too high for cgroups,
 	 * avoid it at the cost of forcing all readers into the slow path.
--- /dev/null
+++ b/kernel/cgroup/stat.c
@@ -0,0 +1,334 @@
+#include "cgroup-internal.h"
+
+#include <linux/sched/cputime.h>
+
+static DEFINE_MUTEX(cgroup_stat_mutex);
+static DEFINE_PER_CPU(raw_spinlock_t, cgroup_cpu_stat_lock);
+
+static struct cgroup_cpu_stat *cgroup_cpu_stat(struct cgroup *cgrp, int cpu)
+{
+	return per_cpu_ptr(cgrp->cpu_stat, cpu);
+}
+
+/**
+ * cgroup_cpu_stat_updated - keep track of updated cpu_stat
+ * @cgrp: target cgroup
+ * @cpu: cpu on which cpu_stat was updated
+ *
+ * @cgrp's cpu_stat on @cpu was updated.  Put it on the parent's matching
+ * cpu_stat->updated_children list.  See the comment on top of
+ * cgroup_cpu_stat definition for details.
+ */
+static void cgroup_cpu_stat_updated(struct cgroup *cgrp, int cpu)
+{
+	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_cpu_stat_lock, cpu);
+	struct cgroup *parent;
+	unsigned long flags;
+
+	/*
+	 * Speculative already-on-list test.  This may race leading to
+	 * temporary inaccuracies, which is fine.
+	 *
+	 * Because @parent's updated_children is terminated with @parent
+	 * instead of NULL, we can tell whether @cgrp is on the list by
+	 * testing the next pointer for NULL.
+	 */
+	if (cgroup_cpu_stat(cgrp, cpu)->updated_next)
+		return;
+
+	raw_spin_lock_irqsave(cpu_lock, flags);
+
+	/* put @cgrp and all ancestors on the corresponding updated lists */
+	for (parent = cgroup_parent(cgrp); parent;
+	     cgrp = parent, parent = cgroup_parent(cgrp)) {
+		struct cgroup_cpu_stat *cstat = cgroup_cpu_stat(cgrp, cpu);
+		struct cgroup_cpu_stat *pcstat = cgroup_cpu_stat(parent, cpu);
+
+		/*
+		 * Both additions and removals are bottom-up.  If a cgroup
+		 * is already in the tree, all ancestors are.
+		 */
+		if (cstat->updated_next)
+			break;
+
+		cstat->updated_next = pcstat->updated_children;
+		pcstat->updated_children = cgrp;
+	}
+
+	raw_spin_unlock_irqrestore(cpu_lock, flags);
+}
+
+/**
+ * cgroup_cpu_stat_pop_updated - iterate and dismantle cpu_stat updated tree
+ * @pos: current position
+ * @root: root of the tree to traversal
+ * @cpu: target cpu
+ *
+ * Walks the udpated cpu_stat tree on @cpu from @root.  %NULL @pos starts
+ * the traversal and %NULL return indicates the end.  During traversal,
+ * each returned cgroup is unlinked from the tree.  Must be called with the
+ * matching cgroup_cpu_stat_lock held.
+ *
+ * The only ordering guarantee is that, for a parent and a child pair
+ * covered by a given traversal, if a child is visited, its parent is
+ * guaranteed to be visited afterwards.
+ */
+static struct cgroup *cgroup_cpu_stat_pop_updated(struct cgroup *pos,
+						  struct cgroup *root, int cpu)
+{
+	struct cgroup_cpu_stat *cstat;
+	struct cgroup *parent;
+
+	if (pos == root)
+		return NULL;
+
+	/*
+	 * We're gonna walk down to the first leaf and visit/remove it.  We
+	 * can pick whatever unvisited node as the starting point.
+	 */
+	if (!pos)
+		pos = root;
+	else
+		pos = cgroup_parent(pos);
+
+	/* walk down to the first leaf */
+	while (true) {
+		cstat = cgroup_cpu_stat(pos, cpu);
+		if (cstat->updated_children == pos)
+			break;
+		pos = cstat->updated_children;
+	}
+
+	/*
+	 * Unlink @pos from the tree.  As the updated_children list is
+	 * singly linked, we have to walk it to find the removal point.
+	 * However, due to the way we traverse, @pos will be the first
+	 * child in most cases. The only exception is @root.
+	 */
+	parent = cgroup_parent(pos);
+	if (parent && cstat->updated_next) {
+		struct cgroup_cpu_stat *pcstat = cgroup_cpu_stat(parent, cpu);
+		struct cgroup_cpu_stat *ncstat;
+		struct cgroup **nextp;
+
+		nextp = &pcstat->updated_children;
+		while (true) {
+			ncstat = cgroup_cpu_stat(*nextp, cpu);
+			if (*nextp == pos)
+				break;
+
+			WARN_ON_ONCE(*nextp == parent);
+			nextp = &ncstat->updated_next;
+		}
+
+		*nextp = cstat->updated_next;
+		cstat->updated_next = NULL;
+	}
+
+	return pos;
+}
+
+static void cgroup_stat_accumulate(struct cgroup_stat *dst_stat,
+				   struct cgroup_stat *src_stat)
+{
+	dst_stat->cputime.utime += src_stat->cputime.utime;
+	dst_stat->cputime.stime += src_stat->cputime.stime;
+	dst_stat->cputime.sum_exec_runtime += src_stat->cputime.sum_exec_runtime;
+}
+
+static void cgroup_cpu_stat_flush_one(struct cgroup *cgrp, int cpu)
+{
+	struct cgroup *parent = cgroup_parent(cgrp);
+	struct cgroup_cpu_stat *cstat = cgroup_cpu_stat(cgrp, cpu);
+	struct task_cputime *last_cputime = &cstat->last_cputime;
+	struct task_cputime cputime;
+	struct cgroup_stat delta;
+	unsigned seq;
+
+	lockdep_assert_held(&cgroup_stat_mutex);
+
+	/* fetch the current per-cpu values */
+	do {
+		seq = __u64_stats_fetch_begin(&cstat->sync);
+		cputime = cstat->cputime;
+	} while (__u64_stats_fetch_retry(&cstat->sync, seq));
+
+	/* accumulate the deltas to propgate */
+	delta.cputime.utime = cputime.utime - last_cputime->utime;
+	delta.cputime.stime = cputime.stime - last_cputime->stime;
+	delta.cputime.sum_exec_runtime = cputime.sum_exec_runtime -
+					 last_cputime->sum_exec_runtime;
+	*last_cputime = cputime;
+
+	/* transfer the pending stat into delta */
+	cgroup_stat_accumulate(&delta, &cgrp->pending_stat);
+	memset(&cgrp->pending_stat, 0, sizeof(cgrp->pending_stat));
+
+	/* propagate delta into the global stat and the parent's pending */
+	cgroup_stat_accumulate(&cgrp->stat, &delta);
+	if (parent)
+		cgroup_stat_accumulate(&parent->pending_stat, &delta);
+}
+
+/* see cgroup_stat_flush() */
+static void cgroup_stat_flush_locked(struct cgroup *cgrp)
+{
+	int cpu;
+
+	lockdep_assert_held(&cgroup_stat_mutex);
+
+	for_each_possible_cpu(cpu) {
+		raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_cpu_stat_lock, cpu);
+		struct cgroup *pos = NULL;
+
+		raw_spin_lock_irq(cpu_lock);
+		while ((pos = cgroup_cpu_stat_pop_updated(pos, cgrp, cpu)))
+			cgroup_cpu_stat_flush_one(pos, cpu);
+		raw_spin_unlock_irq(cpu_lock);
+	}
+}
+
+/**
+ * cgroup_stat_flush - flush stats in @cgrp's subtree
+ * @cgrp: target cgroup
+ *
+ * Collect all per-cpu stats in @cgrp's subtree into the global counters
+ * and propagate them upwards.  After this function returns, all cgroups in
+ * the subtree have up-to-date ->stat.
+ *
+ * This also gets all cgroups in the subtree including @cgrp off the
+ * ->updated_children lists.
+ */
+void cgroup_stat_flush(struct cgroup *cgrp)
+{
+	mutex_lock(&cgroup_stat_mutex);
+	cgroup_stat_flush_locked(cgrp);
+	mutex_unlock(&cgroup_stat_mutex);
+}
+
+static struct cgroup_cpu_stat *cgroup_cpu_stat_account_begin(struct cgroup *cgrp)
+{
+	struct cgroup_cpu_stat *cstat;
+
+	cstat = get_cpu_ptr(cgrp->cpu_stat);
+	u64_stats_update_begin(&cstat->sync);
+	return cstat;
+}
+
+static void cgroup_cpu_stat_account_end(struct cgroup *cgrp,
+					struct cgroup_cpu_stat *cstat)
+{
+	u64_stats_update_end(&cstat->sync);
+	cgroup_cpu_stat_updated(cgrp, smp_processor_id());
+	put_cpu_ptr(cstat);
+}
+
+void __cgroup_account_cputime(struct cgroup *cgrp, u64 delta_exec)
+{
+	struct cgroup_cpu_stat *cstat;
+
+	cstat = cgroup_cpu_stat_account_begin(cgrp);
+	cstat->cputime.sum_exec_runtime += delta_exec;
+	cgroup_cpu_stat_account_end(cgrp, cstat);
+}
+
+void __cgroup_account_cputime_field(struct cgroup *cgrp,
+				    enum cpu_usage_stat index, u64 delta_exec)
+{
+	struct cgroup_cpu_stat *cstat;
+
+	cstat = cgroup_cpu_stat_account_begin(cgrp);
+
+	switch (index) {
+	case CPUTIME_USER:
+	case CPUTIME_NICE:
+		cstat->cputime.utime += delta_exec;
+		break;
+	case CPUTIME_SYSTEM:
+	case CPUTIME_IRQ:
+	case CPUTIME_SOFTIRQ:
+		cstat->cputime.stime += delta_exec;
+		break;
+	default:
+		break;
+	}
+
+	cgroup_cpu_stat_account_end(cgrp, cstat);
+}
+
+void cgroup_stat_show_cputime(struct seq_file *seq, const char *prefix)
+{
+	struct cgroup *cgrp = seq_css(seq)->cgroup;
+	u64 usage, utime, stime;
+
+	if (!cgroup_parent(cgrp))
+		return;
+
+	mutex_lock(&cgroup_stat_mutex);
+
+	cgroup_stat_flush_locked(cgrp);
+
+	usage = cgrp->stat.cputime.sum_exec_runtime;
+	cputime_adjust(&cgrp->stat.cputime, &cgrp->stat.prev_cputime,
+		       &utime, &stime);
+
+	mutex_unlock(&cgroup_stat_mutex);
+
+	do_div(usage, NSEC_PER_USEC);
+	do_div(utime, NSEC_PER_USEC);
+	do_div(stime, NSEC_PER_USEC);
+
+	seq_printf(seq, "%susage_usec %llu\n"
+		   "%suser_usec %llu\n"
+		   "%ssystem_usec %llu\n",
+		   prefix, usage, prefix, utime, prefix, stime);
+}
+
+int cgroup_stat_init(struct cgroup *cgrp)
+{
+	int cpu;
+
+	/* the root cgrp has cpu_stat preallocated */
+	if (!cgrp->cpu_stat) {
+		cgrp->cpu_stat = alloc_percpu(struct cgroup_cpu_stat);
+		if (!cgrp->cpu_stat)
+			return -ENOMEM;
+	}
+
+	/* ->updated_children list is self terminated */
+	for_each_possible_cpu(cpu)
+		cgroup_cpu_stat(cgrp, cpu)->updated_children = cgrp;
+
+	prev_cputime_init(&cgrp->stat.prev_cputime);
+
+	return 0;
+}
+
+void cgroup_stat_exit(struct cgroup *cgrp)
+{
+	int cpu;
+
+	cgroup_stat_flush(cgrp);
+
+	/* sanity check */
+	for_each_possible_cpu(cpu) {
+		struct cgroup_cpu_stat *cstat = cgroup_cpu_stat(cgrp, cpu);
+
+		if (WARN_ON_ONCE(cstat->updated_children != cgrp) ||
+		    WARN_ON_ONCE(cstat->updated_next))
+			return;
+	}
+
+	free_percpu(cgrp->cpu_stat);
+	cgrp->cpu_stat = NULL;
+}
+
+void __init cgroup_stat_boot(void)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		raw_spin_lock_init(per_cpu_ptr(&cgroup_cpu_stat_lock, cpu));
+
+	BUG_ON(cgroup_stat_init(&cgrp_dfl_root.cgrp));
+}

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v2 3/3] cgroup: Implement cgroup2 basic CPU usage accounting
@ 2017-08-29 17:43     ` Tejun Heo
  0 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2017-08-29 17:43 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	longman-H+wXaHxf7aLQT0dZR+AlfA
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	pjt-hpIqsD4AKlfQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	efault-Mmb7MZpHnFY, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	guro-b10kYP2dOMg

In cgroup1, while cpuacct isn't actually controlling any resources, it
is a separate controller due to combination of two factors -
1. enabling cpu controller has significant side effects, and 2. we
have to pick one of the hierarchies to account CPU usages on.  cpuacct
controller is effectively used to designate a hierarchy to track CPU
usages on.

cgroup2's unified hierarchy removes the second reason and we can
account basic CPU usages by default.  While we can use cpuacct for
this purpose, both its interface and implementation leave a lot to be
desired - it collects and exposes two sources of truth which don't
agree with each other and some of the exposed statistics don't make
much sense.  Also, it propagates all the way up the hierarchy on each
accounting event which is unnecessary.

This patch adds basic resource accounting mechanism to cgroup2's
unified hierarchy and accounts CPU usages using it.

* All accountings are done per-cpu and don't propagate immediately.
  It just bumps the per-cgroup per-cpu counters and links to the
  parent's updated list if not already on it.

* On a read, the per-cpu counters are collected into the global ones
  and then propagated upwards.  Only the per-cpu counters which have
  changed since the last read are propagated.

* CPU usage stats are collected and shown in "cgroup.stat" with "cpu."
  prefix.  Total usage is collected from scheduling events.  User/sys
  breakdown is sourced from tick sampling and adjusted to the usage
  using cputime_adjust().

This keeps the accounting side hot path O(1) and per-cpu and the read
side O(nr_updated_since_last_read).

v2: Minor changes and documentation updates as suggested by Waiman and
    Roman.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Ingo Molnar <mingo-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
Cc: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: Waiman Long <waiman.long-ZPxbGqLxI0U@public.gmane.org>
Cc: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>
---
 Documentation/cgroup-v2.txt     |    9 +
 include/linux/cgroup-defs.h     |   57 ++++++
 include/linux/cgroup.h          |   22 ++
 kernel/cgroup/Makefile          |    2 
 kernel/cgroup/cgroup-internal.h |    8 
 kernel/cgroup/cgroup.c          |   24 ++
 kernel/cgroup/stat.c            |  334 ++++++++++++++++++++++++++++++++++++++++
 7 files changed, 453 insertions(+), 3 deletions(-)

--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -886,6 +886,15 @@ All cgroup core files are prefixed with
 		A dying cgroup can consume system resources not exceeding
 		limits, which were active at the moment of cgroup deletion.
 
+	  cpu.usage_usec
+		CPU time consumed in the subtree.
+
+	  cpu.user_usec
+		User CPU time consumed in the subtree.
+
+	  cpu.system_usec
+		System CPU time consumed in the subtree.
+
 
 Controllers
 ===========
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -16,6 +16,7 @@
 #include <linux/refcount.h>
 #include <linux/percpu-refcount.h>
 #include <linux/percpu-rwsem.h>
+#include <linux/u64_stats_sync.h>
 #include <linux/workqueue.h>
 #include <linux/bpf-cgroup.h>
 
@@ -249,6 +250,57 @@ struct css_set {
 	struct rcu_head rcu_head;
 };
 
+/*
+ * cgroup basic resource usage statistics.  Accounting is done per-cpu in
+ * cgroup_cpu_stat which is then lazily propagated up the hierarchy on
+ * reads.
+ *
+ * When a stat gets updated, the cgroup_cpu_stat and its ancestors are
+ * linked into the updated tree.  On the following read, propagation only
+ * considers and consumes the updated tree.  This makes reading O(the
+ * number of descendants which have been active since last read) instead of
+ * O(the total number of descendants).
+ *
+ * This is important because there can be a lot of (draining) cgroups which
+ * aren't active and stat may be read frequently.  The combination can
+ * become very expensive.  By propagating selectively, increasing reading
+ * frequency decreases the cost of each read.
+ */
+struct cgroup_cpu_stat {
+	/*
+	 * ->sync protects all the current counters.  These are the only
+	 * fields which get updated in the hot path.
+	 */
+	struct u64_stats_sync sync;
+	struct task_cputime cputime;
+
+	/*
+	 * Snapshots at the last reading.  These are used to calculate the
+	 * deltas to propagate to the global counters.
+	 */
+	struct task_cputime last_cputime;
+
+	/*
+	 * Child cgroups with stat updates on this cpu since the last read
+	 * are linked on the parent's ->updated_children through
+	 * ->updated_next.
+	 *
+	 * In addition to being more compact, singly-linked list pointing
+	 * to the cgroup makes it unnecessary for each per-cpu struct to
+	 * point back to the associated cgroup.
+	 *
+	 * Protected by per-cpu cgroup_cpu_stat_lock.
+	 */
+	struct cgroup *updated_children;	/* terminated by self cgroup */
+	struct cgroup *updated_next;		/* NULL iff not on the list */
+};
+
+struct cgroup_stat {
+	/* per-cpu statistics are collected into the folowing global counters */
+	struct task_cputime cputime;
+	struct prev_cputime prev_cputime;
+};
+
 struct cgroup {
 	/* self css with NULL ->ss, points back to this cgroup */
 	struct cgroup_subsys_state self;
@@ -348,6 +400,11 @@ struct cgroup {
 	 */
 	struct cgroup *dom_cgrp;
 
+	/* cgroup basic resource statistics */
+	struct cgroup_cpu_stat __percpu *cpu_stat;
+	struct cgroup_stat pending_stat;	/* pending from children */
+	struct cgroup_stat stat;
+
 	/*
 	 * list of pidlists, up to two for each namespace (one for procs, one
 	 * for tasks); created on demand.
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -689,17 +689,39 @@ static inline void cpuacct_account_field
 					 u64 val) {}
 #endif
 
+void cgroup_stat_show_cputime(struct seq_file *seq, const char *prefix);
+
+void __cgroup_account_cputime(struct cgroup *cgrp, u64 delta_exec);
+void __cgroup_account_cputime_field(struct cgroup *cgrp,
+				    enum cpu_usage_stat index, u64 delta_exec);
+
 static inline void cgroup_account_cputime(struct task_struct *task,
 					  u64 delta_exec)
 {
+	struct cgroup *cgrp;
+
 	cpuacct_charge(task, delta_exec);
+
+	rcu_read_lock();
+	cgrp = task_dfl_cgroup(task);
+	if (cgroup_parent(cgrp))
+		__cgroup_account_cputime(cgrp, delta_exec);
+	rcu_read_unlock();
 }
 
 static inline void cgroup_account_cputime_field(struct task_struct *task,
 						enum cpu_usage_stat index,
 						u64 delta_exec)
 {
+	struct cgroup *cgrp;
+
 	cpuacct_account_field(task, index, delta_exec);
+
+	rcu_read_lock();
+	cgrp = task_dfl_cgroup(task);
+	if (cgroup_parent(cgrp))
+		__cgroup_account_cputime_field(cgrp, index, delta_exec);
+	rcu_read_unlock();
 }
 
 #else	/* CONFIG_CGROUPS */
--- a/kernel/cgroup/Makefile
+++ b/kernel/cgroup/Makefile
@@ -1,4 +1,4 @@
-obj-y := cgroup.o namespace.o cgroup-v1.o
+obj-y := cgroup.o stat.o namespace.o cgroup-v1.o
 
 obj-$(CONFIG_CGROUP_FREEZER) += freezer.o
 obj-$(CONFIG_CGROUP_PIDS) += pids.o
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -197,6 +197,14 @@ int cgroup_show_path(struct seq_file *sf
 int cgroup_task_count(const struct cgroup *cgrp);
 
 /*
+ * stat.c
+ */
+void cgroup_stat_flush(struct cgroup *cgrp);
+int cgroup_stat_init(struct cgroup *cgrp);
+void cgroup_stat_exit(struct cgroup *cgrp);
+void cgroup_stat_boot(void);
+
+/*
  * namespace.c
  */
 extern const struct proc_ns_operations cgroupns_operations;
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -142,12 +142,14 @@ static struct static_key_true *cgroup_su
 };
 #undef SUBSYS
 
+static DEFINE_PER_CPU(struct cgroup_cpu_stat, cgrp_dfl_root_cpu_stat);
+
 /*
  * The default hierarchy, reserved for the subsystems that are otherwise
  * unattached - it never has more than a single cgroup, and all tasks are
  * part of that cgroup.
  */
-struct cgroup_root cgrp_dfl_root;
+struct cgroup_root cgrp_dfl_root = { .cgrp.cpu_stat = &cgrp_dfl_root_cpu_stat };
 EXPORT_SYMBOL_GPL(cgrp_dfl_root);
 
 /*
@@ -3296,6 +3298,8 @@ static int cgroup_stat_show(struct seq_f
 	seq_printf(seq, "nr_dying_descendants %d\n",
 		   cgroup->nr_dying_descendants);
 
+	cgroup_stat_show_cputime(seq, "cpu.");
+
 	return 0;
 }
 
@@ -4466,6 +4470,8 @@ static void css_free_work_fn(struct work
 			 */
 			cgroup_put(cgroup_parent(cgrp));
 			kernfs_put(cgrp->kn);
+			if (cgroup_on_dfl(cgrp))
+				cgroup_stat_exit(cgrp);
 			kfree(cgrp);
 		} else {
 			/*
@@ -4510,6 +4516,9 @@ static void css_release_work_fn(struct w
 		/* cgroup release path */
 		trace_cgroup_release(cgrp);
 
+		if (cgroup_on_dfl(cgrp))
+			cgroup_stat_flush(cgrp);
+
 		for (tcgrp = cgroup_parent(cgrp); tcgrp;
 		     tcgrp = cgroup_parent(tcgrp))
 			tcgrp->nr_dying_descendants--;
@@ -4696,6 +4705,12 @@ static struct cgroup *cgroup_create(stru
 	if (ret)
 		goto out_free_cgrp;
 
+	if (cgroup_on_dfl(parent)) {
+		ret = cgroup_stat_init(cgrp);
+		if (ret)
+			goto out_cancel_ref;
+	}
+
 	/*
 	 * Temporarily set the pointer to NULL, so idr_find() won't return
 	 * a half-baked cgroup.
@@ -4703,7 +4718,7 @@ static struct cgroup *cgroup_create(stru
 	cgrp->id = cgroup_idr_alloc(&root->cgroup_idr, NULL, 2, 0, GFP_KERNEL);
 	if (cgrp->id < 0) {
 		ret = -ENOMEM;
-		goto out_cancel_ref;
+		goto out_stat_exit;
 	}
 
 	init_cgroup_housekeeping(cgrp);
@@ -4752,6 +4767,9 @@ static struct cgroup *cgroup_create(stru
 
 	return cgrp;
 
+out_stat_exit:
+	if (cgroup_on_dfl(parent))
+		cgroup_stat_exit(cgrp);
 out_cancel_ref:
 	percpu_ref_exit(&cgrp->self.refcnt);
 out_free_cgrp:
@@ -5146,6 +5164,8 @@ int __init cgroup_init(void)
 	BUG_ON(cgroup_init_cftypes(NULL, cgroup_base_files));
 	BUG_ON(cgroup_init_cftypes(NULL, cgroup1_base_files));
 
+	cgroup_stat_boot();
+
 	/*
 	 * The latency of the synchronize_sched() is too high for cgroups,
 	 * avoid it at the cost of forcing all readers into the slow path.
--- /dev/null
+++ b/kernel/cgroup/stat.c
@@ -0,0 +1,334 @@
+#include "cgroup-internal.h"
+
+#include <linux/sched/cputime.h>
+
+static DEFINE_MUTEX(cgroup_stat_mutex);
+static DEFINE_PER_CPU(raw_spinlock_t, cgroup_cpu_stat_lock);
+
+static struct cgroup_cpu_stat *cgroup_cpu_stat(struct cgroup *cgrp, int cpu)
+{
+	return per_cpu_ptr(cgrp->cpu_stat, cpu);
+}
+
+/**
+ * cgroup_cpu_stat_updated - keep track of updated cpu_stat
+ * @cgrp: target cgroup
+ * @cpu: cpu on which cpu_stat was updated
+ *
+ * @cgrp's cpu_stat on @cpu was updated.  Put it on the parent's matching
+ * cpu_stat->updated_children list.  See the comment on top of
+ * cgroup_cpu_stat definition for details.
+ */
+static void cgroup_cpu_stat_updated(struct cgroup *cgrp, int cpu)
+{
+	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_cpu_stat_lock, cpu);
+	struct cgroup *parent;
+	unsigned long flags;
+
+	/*
+	 * Speculative already-on-list test.  This may race leading to
+	 * temporary inaccuracies, which is fine.
+	 *
+	 * Because @parent's updated_children is terminated with @parent
+	 * instead of NULL, we can tell whether @cgrp is on the list by
+	 * testing the next pointer for NULL.
+	 */
+	if (cgroup_cpu_stat(cgrp, cpu)->updated_next)
+		return;
+
+	raw_spin_lock_irqsave(cpu_lock, flags);
+
+	/* put @cgrp and all ancestors on the corresponding updated lists */
+	for (parent = cgroup_parent(cgrp); parent;
+	     cgrp = parent, parent = cgroup_parent(cgrp)) {
+		struct cgroup_cpu_stat *cstat = cgroup_cpu_stat(cgrp, cpu);
+		struct cgroup_cpu_stat *pcstat = cgroup_cpu_stat(parent, cpu);
+
+		/*
+		 * Both additions and removals are bottom-up.  If a cgroup
+		 * is already in the tree, all ancestors are.
+		 */
+		if (cstat->updated_next)
+			break;
+
+		cstat->updated_next = pcstat->updated_children;
+		pcstat->updated_children = cgrp;
+	}
+
+	raw_spin_unlock_irqrestore(cpu_lock, flags);
+}
+
+/**
+ * cgroup_cpu_stat_pop_updated - iterate and dismantle cpu_stat updated tree
+ * @pos: current position
+ * @root: root of the tree to traversal
+ * @cpu: target cpu
+ *
+ * Walks the udpated cpu_stat tree on @cpu from @root.  %NULL @pos starts
+ * the traversal and %NULL return indicates the end.  During traversal,
+ * each returned cgroup is unlinked from the tree.  Must be called with the
+ * matching cgroup_cpu_stat_lock held.
+ *
+ * The only ordering guarantee is that, for a parent and a child pair
+ * covered by a given traversal, if a child is visited, its parent is
+ * guaranteed to be visited afterwards.
+ */
+static struct cgroup *cgroup_cpu_stat_pop_updated(struct cgroup *pos,
+						  struct cgroup *root, int cpu)
+{
+	struct cgroup_cpu_stat *cstat;
+	struct cgroup *parent;
+
+	if (pos == root)
+		return NULL;
+
+	/*
+	 * We're gonna walk down to the first leaf and visit/remove it.  We
+	 * can pick whatever unvisited node as the starting point.
+	 */
+	if (!pos)
+		pos = root;
+	else
+		pos = cgroup_parent(pos);
+
+	/* walk down to the first leaf */
+	while (true) {
+		cstat = cgroup_cpu_stat(pos, cpu);
+		if (cstat->updated_children == pos)
+			break;
+		pos = cstat->updated_children;
+	}
+
+	/*
+	 * Unlink @pos from the tree.  As the updated_children list is
+	 * singly linked, we have to walk it to find the removal point.
+	 * However, due to the way we traverse, @pos will be the first
+	 * child in most cases. The only exception is @root.
+	 */
+	parent = cgroup_parent(pos);
+	if (parent && cstat->updated_next) {
+		struct cgroup_cpu_stat *pcstat = cgroup_cpu_stat(parent, cpu);
+		struct cgroup_cpu_stat *ncstat;
+		struct cgroup **nextp;
+
+		nextp = &pcstat->updated_children;
+		while (true) {
+			ncstat = cgroup_cpu_stat(*nextp, cpu);
+			if (*nextp == pos)
+				break;
+
+			WARN_ON_ONCE(*nextp == parent);
+			nextp = &ncstat->updated_next;
+		}
+
+		*nextp = cstat->updated_next;
+		cstat->updated_next = NULL;
+	}
+
+	return pos;
+}
+
+static void cgroup_stat_accumulate(struct cgroup_stat *dst_stat,
+				   struct cgroup_stat *src_stat)
+{
+	dst_stat->cputime.utime += src_stat->cputime.utime;
+	dst_stat->cputime.stime += src_stat->cputime.stime;
+	dst_stat->cputime.sum_exec_runtime += src_stat->cputime.sum_exec_runtime;
+}
+
+static void cgroup_cpu_stat_flush_one(struct cgroup *cgrp, int cpu)
+{
+	struct cgroup *parent = cgroup_parent(cgrp);
+	struct cgroup_cpu_stat *cstat = cgroup_cpu_stat(cgrp, cpu);
+	struct task_cputime *last_cputime = &cstat->last_cputime;
+	struct task_cputime cputime;
+	struct cgroup_stat delta;
+	unsigned seq;
+
+	lockdep_assert_held(&cgroup_stat_mutex);
+
+	/* fetch the current per-cpu values */
+	do {
+		seq = __u64_stats_fetch_begin(&cstat->sync);
+		cputime = cstat->cputime;
+	} while (__u64_stats_fetch_retry(&cstat->sync, seq));
+
+	/* accumulate the deltas to propgate */
+	delta.cputime.utime = cputime.utime - last_cputime->utime;
+	delta.cputime.stime = cputime.stime - last_cputime->stime;
+	delta.cputime.sum_exec_runtime = cputime.sum_exec_runtime -
+					 last_cputime->sum_exec_runtime;
+	*last_cputime = cputime;
+
+	/* transfer the pending stat into delta */
+	cgroup_stat_accumulate(&delta, &cgrp->pending_stat);
+	memset(&cgrp->pending_stat, 0, sizeof(cgrp->pending_stat));
+
+	/* propagate delta into the global stat and the parent's pending */
+	cgroup_stat_accumulate(&cgrp->stat, &delta);
+	if (parent)
+		cgroup_stat_accumulate(&parent->pending_stat, &delta);
+}
+
+/* see cgroup_stat_flush() */
+static void cgroup_stat_flush_locked(struct cgroup *cgrp)
+{
+	int cpu;
+
+	lockdep_assert_held(&cgroup_stat_mutex);
+
+	for_each_possible_cpu(cpu) {
+		raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_cpu_stat_lock, cpu);
+		struct cgroup *pos = NULL;
+
+		raw_spin_lock_irq(cpu_lock);
+		while ((pos = cgroup_cpu_stat_pop_updated(pos, cgrp, cpu)))
+			cgroup_cpu_stat_flush_one(pos, cpu);
+		raw_spin_unlock_irq(cpu_lock);
+	}
+}
+
+/**
+ * cgroup_stat_flush - flush stats in @cgrp's subtree
+ * @cgrp: target cgroup
+ *
+ * Collect all per-cpu stats in @cgrp's subtree into the global counters
+ * and propagate them upwards.  After this function returns, all cgroups in
+ * the subtree have up-to-date ->stat.
+ *
+ * This also gets all cgroups in the subtree including @cgrp off the
+ * ->updated_children lists.
+ */
+void cgroup_stat_flush(struct cgroup *cgrp)
+{
+	mutex_lock(&cgroup_stat_mutex);
+	cgroup_stat_flush_locked(cgrp);
+	mutex_unlock(&cgroup_stat_mutex);
+}
+
+static struct cgroup_cpu_stat *cgroup_cpu_stat_account_begin(struct cgroup *cgrp)
+{
+	struct cgroup_cpu_stat *cstat;
+
+	cstat = get_cpu_ptr(cgrp->cpu_stat);
+	u64_stats_update_begin(&cstat->sync);
+	return cstat;
+}
+
+static void cgroup_cpu_stat_account_end(struct cgroup *cgrp,
+					struct cgroup_cpu_stat *cstat)
+{
+	u64_stats_update_end(&cstat->sync);
+	cgroup_cpu_stat_updated(cgrp, smp_processor_id());
+	put_cpu_ptr(cstat);
+}
+
+void __cgroup_account_cputime(struct cgroup *cgrp, u64 delta_exec)
+{
+	struct cgroup_cpu_stat *cstat;
+
+	cstat = cgroup_cpu_stat_account_begin(cgrp);
+	cstat->cputime.sum_exec_runtime += delta_exec;
+	cgroup_cpu_stat_account_end(cgrp, cstat);
+}
+
+void __cgroup_account_cputime_field(struct cgroup *cgrp,
+				    enum cpu_usage_stat index, u64 delta_exec)
+{
+	struct cgroup_cpu_stat *cstat;
+
+	cstat = cgroup_cpu_stat_account_begin(cgrp);
+
+	switch (index) {
+	case CPUTIME_USER:
+	case CPUTIME_NICE:
+		cstat->cputime.utime += delta_exec;
+		break;
+	case CPUTIME_SYSTEM:
+	case CPUTIME_IRQ:
+	case CPUTIME_SOFTIRQ:
+		cstat->cputime.stime += delta_exec;
+		break;
+	default:
+		break;
+	}
+
+	cgroup_cpu_stat_account_end(cgrp, cstat);
+}
+
+void cgroup_stat_show_cputime(struct seq_file *seq, const char *prefix)
+{
+	struct cgroup *cgrp = seq_css(seq)->cgroup;
+	u64 usage, utime, stime;
+
+	if (!cgroup_parent(cgrp))
+		return;
+
+	mutex_lock(&cgroup_stat_mutex);
+
+	cgroup_stat_flush_locked(cgrp);
+
+	usage = cgrp->stat.cputime.sum_exec_runtime;
+	cputime_adjust(&cgrp->stat.cputime, &cgrp->stat.prev_cputime,
+		       &utime, &stime);
+
+	mutex_unlock(&cgroup_stat_mutex);
+
+	do_div(usage, NSEC_PER_USEC);
+	do_div(utime, NSEC_PER_USEC);
+	do_div(stime, NSEC_PER_USEC);
+
+	seq_printf(seq, "%susage_usec %llu\n"
+		   "%suser_usec %llu\n"
+		   "%ssystem_usec %llu\n",
+		   prefix, usage, prefix, utime, prefix, stime);
+}
+
+int cgroup_stat_init(struct cgroup *cgrp)
+{
+	int cpu;
+
+	/* the root cgrp has cpu_stat preallocated */
+	if (!cgrp->cpu_stat) {
+		cgrp->cpu_stat = alloc_percpu(struct cgroup_cpu_stat);
+		if (!cgrp->cpu_stat)
+			return -ENOMEM;
+	}
+
+	/* ->updated_children list is self terminated */
+	for_each_possible_cpu(cpu)
+		cgroup_cpu_stat(cgrp, cpu)->updated_children = cgrp;
+
+	prev_cputime_init(&cgrp->stat.prev_cputime);
+
+	return 0;
+}
+
+void cgroup_stat_exit(struct cgroup *cgrp)
+{
+	int cpu;
+
+	cgroup_stat_flush(cgrp);
+
+	/* sanity check */
+	for_each_possible_cpu(cpu) {
+		struct cgroup_cpu_stat *cstat = cgroup_cpu_stat(cgrp, cpu);
+
+		if (WARN_ON_ONCE(cstat->updated_children != cgrp) ||
+		    WARN_ON_ONCE(cstat->updated_next))
+			return;
+	}
+
+	free_percpu(cgrp->cpu_stat);
+	cgrp->cpu_stat = NULL;
+}
+
+void __init cgroup_stat_boot(void)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		raw_spin_lock_init(per_cpu_ptr(&cgroup_cpu_stat_lock, cpu));
+
+	BUG_ON(cgroup_stat_init(&cgrp_dfl_root.cgrp));
+}

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 3/3] cgroup: Implement cgroup2 basic CPU usage accounting
@ 2017-08-29 18:06       ` Waiman Long
  0 siblings, 0 replies; 54+ messages in thread
From: Waiman Long @ 2017-08-29 18:06 UTC (permalink / raw)
  To: Tejun Heo, lizefan, hannes, peterz, mingo
  Cc: cgroups, linux-kernel, kernel-team, pjt, luto, efault, torvalds, guro

On 08/29/2017 01:43 PM, Tejun Heo wrote:
> In cgroup1, while cpuacct isn't actually controlling any resources, it
> is a separate controller due to combination of two factors -
> 1. enabling cpu controller has significant side effects, and 2. we
> have to pick one of the hierarchies to account CPU usages on.  cpuacct
> controller is effectively used to designate a hierarchy to track CPU
> usages on.
>
> cgroup2's unified hierarchy removes the second reason and we can
> account basic CPU usages by default.  While we can use cpuacct for
> this purpose, both its interface and implementation leave a lot to be
> desired - it collects and exposes two sources of truth which don't
> agree with each other and some of the exposed statistics don't make
> much sense.  Also, it propagates all the way up the hierarchy on each
> accounting event which is unnecessary.
>
> This patch adds basic resource accounting mechanism to cgroup2's
> unified hierarchy and accounts CPU usages using it.
>
> * All accountings are done per-cpu and don't propagate immediately.
>   It just bumps the per-cgroup per-cpu counters and links to the
>   parent's updated list if not already on it.
>
> * On a read, the per-cpu counters are collected into the global ones
>   and then propagated upwards.  Only the per-cpu counters which have
>   changed since the last read are propagated.
>
> * CPU usage stats are collected and shown in "cgroup.stat" with "cpu."
>   prefix.  Total usage is collected from scheduling events.  User/sys
>   breakdown is sourced from tick sampling and adjusted to the usage
>   using cputime_adjust().
>
> This keeps the accounting side hot path O(1) and per-cpu and the read
> side O(nr_updated_since_last_read).
>
> v2: Minor changes and documentation updates as suggested by Waiman and
>     Roman.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Li Zefan <lizefan@huawei.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Waiman Long <waiman.long@hpe.com>

I just noticed that you have used my obsoleted email address in the cc
line. Would you mind changing it to the one that I am currently using?

Thanks,
Longman

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 3/3] cgroup: Implement cgroup2 basic CPU usage accounting
@ 2017-08-29 18:06       ` Waiman Long
  0 siblings, 0 replies; 54+ messages in thread
From: Waiman Long @ 2017-08-29 18:06 UTC (permalink / raw)
  To: Tejun Heo, lizefan-hv44wF8Li93QT0dZR+AlfA,
	hannes-druUgvl0LCNAfugRpC6u6w, peterz-wEGCiKHe2LqWVfeAwA7xHQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	pjt-hpIqsD4AKlfQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	efault-Mmb7MZpHnFY, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	guro-b10kYP2dOMg

On 08/29/2017 01:43 PM, Tejun Heo wrote:
> In cgroup1, while cpuacct isn't actually controlling any resources, it
> is a separate controller due to combination of two factors -
> 1. enabling cpu controller has significant side effects, and 2. we
> have to pick one of the hierarchies to account CPU usages on.  cpuacct
> controller is effectively used to designate a hierarchy to track CPU
> usages on.
>
> cgroup2's unified hierarchy removes the second reason and we can
> account basic CPU usages by default.  While we can use cpuacct for
> this purpose, both its interface and implementation leave a lot to be
> desired - it collects and exposes two sources of truth which don't
> agree with each other and some of the exposed statistics don't make
> much sense.  Also, it propagates all the way up the hierarchy on each
> accounting event which is unnecessary.
>
> This patch adds basic resource accounting mechanism to cgroup2's
> unified hierarchy and accounts CPU usages using it.
>
> * All accountings are done per-cpu and don't propagate immediately.
>   It just bumps the per-cgroup per-cpu counters and links to the
>   parent's updated list if not already on it.
>
> * On a read, the per-cpu counters are collected into the global ones
>   and then propagated upwards.  Only the per-cpu counters which have
>   changed since the last read are propagated.
>
> * CPU usage stats are collected and shown in "cgroup.stat" with "cpu."
>   prefix.  Total usage is collected from scheduling events.  User/sys
>   breakdown is sourced from tick sampling and adjusted to the usage
>   using cputime_adjust().
>
> This keeps the accounting side hot path O(1) and per-cpu and the read
> side O(nr_updated_since_last_read).
>
> v2: Minor changes and documentation updates as suggested by Waiman and
>     Roman.
>
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: Ingo Molnar <mingo-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Cc: Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
> Cc: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
> Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> Cc: Waiman Long <waiman.long-ZPxbGqLxI0U@public.gmane.org>

I just noticed that you have used my obsoleted email address in the cc
line. Would you mind changing it to the one that I am currently using?

Thanks,
Longman

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 3/3] cgroup: Implement cgroup2 basic CPU usage accounting
  2017-08-29 18:06       ` Waiman Long
  (?)
@ 2017-08-29 18:10       ` Tejun Heo
  -1 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2017-08-29 18:10 UTC (permalink / raw)
  To: Waiman Long
  Cc: lizefan, hannes, peterz, mingo, cgroups, linux-kernel,
	kernel-team, pjt, luto, efault, torvalds, guro

On Tue, Aug 29, 2017 at 02:06:24PM -0400, Waiman Long wrote:
> > Cc: Waiman Long <waiman.long@hpe.com>
> 
> I just noticed that you have used my obsoleted email address in the cc
> line. Would you mind changing it to the one that I am currently using?

Ah, my apologies.  Will update.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCHSET for-4.14] cgroup, sched: cgroup2 basic resource accounting
@ 2017-09-22 18:05   ` Tejun Heo
  0 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2017-09-22 18:05 UTC (permalink / raw)
  To: lizefan, hannes, peterz, mingo, longman
  Cc: cgroups, linux-kernel, kernel-team, pjt, luto, efault, torvalds, guro

Peter, ping?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCHSET for-4.14] cgroup, sched: cgroup2 basic resource accounting
@ 2017-09-22 18:05   ` Tejun Heo
  0 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2017-09-22 18:05 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	longman-H+wXaHxf7aLQT0dZR+AlfA
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	pjt-hpIqsD4AKlfQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	efault-Mmb7MZpHnFY, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	guro-b10kYP2dOMg

Peter, ping?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCHSET for-4.14] cgroup, sched: cgroup2 basic resource accounting
@ 2017-09-23 13:37     ` Peter Zijlstra
  0 siblings, 0 replies; 54+ messages in thread
From: Peter Zijlstra @ 2017-09-23 13:37 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, hannes, mingo, longman, cgroups, linux-kernel,
	kernel-team, pjt, luto, efault, torvalds, guro

On Fri, Sep 22, 2017 at 11:05:30AM -0700, Tejun Heo wrote:
> Peter, ping?

Humm,. So I think I was good, but I was under the impression you'd send
a new version better explaining the need/design of that iteration stuff.
Could be I lost the plot somehow.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCHSET for-4.14] cgroup, sched: cgroup2 basic resource accounting
@ 2017-09-23 13:37     ` Peter Zijlstra
  0 siblings, 0 replies; 54+ messages in thread
From: Peter Zijlstra @ 2017-09-23 13:37 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, longman-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	pjt-hpIqsD4AKlfQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	efault-Mmb7MZpHnFY, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	guro-b10kYP2dOMg

On Fri, Sep 22, 2017 at 11:05:30AM -0700, Tejun Heo wrote:
> Peter, ping?

Humm,. So I think I was good, but I was under the impression you'd send
a new version better explaining the need/design of that iteration stuff.
Could be I lost the plot somehow.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCHSET for-4.14] cgroup, sched: cgroup2 basic resource accounting
@ 2017-09-23 13:44       ` Tejun Heo
  0 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2017-09-23 13:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: lizefan, hannes, mingo, longman, cgroups, linux-kernel,
	kernel-team, pjt, luto, efault, torvalds, guro

Hello, Peter.

On Sat, Sep 23, 2017 at 03:37:31PM +0200, Peter Zijlstra wrote:
> On Fri, Sep 22, 2017 at 11:05:30AM -0700, Tejun Heo wrote:
> > Peter, ping?
> 
> Humm,. So I think I was good, but I was under the impression you'd send
> a new version better explaining the need/design of that iteration stuff.
> Could be I lost the plot somehow.

The updated version was posted quite a while ago.

  http://lkml.kernel.org/r/20170829174325.GS491396@devbig577.frc2.facebook.com

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCHSET for-4.14] cgroup, sched: cgroup2 basic resource accounting
@ 2017-09-23 13:44       ` Tejun Heo
  0 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2017-09-23 13:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, longman-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	pjt-hpIqsD4AKlfQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	efault-Mmb7MZpHnFY, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	guro-b10kYP2dOMg

Hello, Peter.

On Sat, Sep 23, 2017 at 03:37:31PM +0200, Peter Zijlstra wrote:
> On Fri, Sep 22, 2017 at 11:05:30AM -0700, Tejun Heo wrote:
> > Peter, ping?
> 
> Humm,. So I think I was good, but I was under the impression you'd send
> a new version better explaining the need/design of that iteration stuff.
> Could be I lost the plot somehow.

The updated version was posted quite a while ago.

  http://lkml.kernel.org/r/20170829174325.GS491396-4dN5La/x3IkLX0oZNxdnEQ2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCHSET for-4.14] cgroup, sched: cgroup2 basic resource accounting
@ 2017-09-25  7:27         ` Peter Zijlstra
  0 siblings, 0 replies; 54+ messages in thread
From: Peter Zijlstra @ 2017-09-25  7:27 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, hannes, mingo, longman, cgroups, linux-kernel,
	kernel-team, pjt, luto, efault, torvalds, guro

On Sat, Sep 23, 2017 at 06:44:02AM -0700, Tejun Heo wrote:
> Hello, Peter.
> 
> On Sat, Sep 23, 2017 at 03:37:31PM +0200, Peter Zijlstra wrote:
> > On Fri, Sep 22, 2017 at 11:05:30AM -0700, Tejun Heo wrote:
> > > Peter, ping?
> > 
> > Humm,. So I think I was good, but I was under the impression you'd send
> > a new version better explaining the need/design of that iteration stuff.
> > Could be I lost the plot somehow.
> 
> The updated version was posted quite a while ago.
> 
>   http://lkml.kernel.org/r/20170829174325.GS491396@devbig577.frc2.facebook.com


Oh that's in the same thread, wasn't expecting that. Look ok I suppose.

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCHSET for-4.14] cgroup, sched: cgroup2 basic resource accounting
@ 2017-09-25  7:27         ` Peter Zijlstra
  0 siblings, 0 replies; 54+ messages in thread
From: Peter Zijlstra @ 2017-09-25  7:27 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, longman-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	pjt-hpIqsD4AKlfQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	efault-Mmb7MZpHnFY, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	guro-b10kYP2dOMg

On Sat, Sep 23, 2017 at 06:44:02AM -0700, Tejun Heo wrote:
> Hello, Peter.
> 
> On Sat, Sep 23, 2017 at 03:37:31PM +0200, Peter Zijlstra wrote:
> > On Fri, Sep 22, 2017 at 11:05:30AM -0700, Tejun Heo wrote:
> > > Peter, ping?
> > 
> > Humm,. So I think I was good, but I was under the impression you'd send
> > a new version better explaining the need/design of that iteration stuff.
> > Could be I lost the plot somehow.
> 
> The updated version was posted quite a while ago.
> 
>   http://lkml.kernel.org/r/20170829174325.GS491396-4dN5La/x3IkLX0oZNxdnEQ2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org


Oh that's in the same thread, wasn't expecting that. Look ok I suppose.

Acked-by: Peter Zijlstra (Intel) <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCHSET for-4.14] cgroup, sched: cgroup2 basic resource accounting
@ 2017-09-25 15:07           ` Tejun Heo
  0 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2017-09-25 15:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: lizefan, hannes, mingo, longman, cgroups, linux-kernel,
	kernel-team, pjt, luto, efault, torvalds, guro

Hello,

On Mon, Sep 25, 2017 at 09:27:18AM +0200, Peter Zijlstra wrote:
> On Sat, Sep 23, 2017 at 06:44:02AM -0700, Tejun Heo wrote:
> > Hello, Peter.
> > 
> > On Sat, Sep 23, 2017 at 03:37:31PM +0200, Peter Zijlstra wrote:
> > > On Fri, Sep 22, 2017 at 11:05:30AM -0700, Tejun Heo wrote:
> > > > Peter, ping?
> > > 
> > > Humm,. So I think I was good, but I was under the impression you'd send
> > > a new version better explaining the need/design of that iteration stuff.
> > > Could be I lost the plot somehow.
> > 
> > The updated version was posted quite a while ago.
> > 
> >   http://lkml.kernel.org/r/20170829174325.GS491396@devbig577.frc2.facebook.com
> 
> 
> Oh that's in the same thread, wasn't expecting that. Look ok I suppose.
>
> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Applying to cgroup/for-4.15.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCHSET for-4.14] cgroup, sched: cgroup2 basic resource accounting
@ 2017-09-25 15:07           ` Tejun Heo
  0 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2017-09-25 15:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, longman-H+wXaHxf7aLQT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	pjt-hpIqsD4AKlfQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	efault-Mmb7MZpHnFY, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	guro-b10kYP2dOMg

Hello,

On Mon, Sep 25, 2017 at 09:27:18AM +0200, Peter Zijlstra wrote:
> On Sat, Sep 23, 2017 at 06:44:02AM -0700, Tejun Heo wrote:
> > Hello, Peter.
> > 
> > On Sat, Sep 23, 2017 at 03:37:31PM +0200, Peter Zijlstra wrote:
> > > On Fri, Sep 22, 2017 at 11:05:30AM -0700, Tejun Heo wrote:
> > > > Peter, ping?
> > > 
> > > Humm,. So I think I was good, but I was under the impression you'd send
> > > a new version better explaining the need/design of that iteration stuff.
> > > Could be I lost the plot somehow.
> > 
> > The updated version was posted quite a while ago.
> > 
> >   http://lkml.kernel.org/r/20170829174325.GS491396-4dN5La/x3IkLX0oZNxdnEQ2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org
> 
> 
> Oh that's in the same thread, wasn't expecting that. Look ok I suppose.
>
> Acked-by: Peter Zijlstra (Intel) <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>

Applying to cgroup/for-4.15.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH cgroup/for-4.15] cgroup: statically initialize init_css_set->dfl_cgrp
@ 2017-09-25 21:10   ` Tejun Heo
  0 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2017-09-25 21:10 UTC (permalink / raw)
  To: lizefan, hannes, peterz, mingo, longman
  Cc: cgroups, linux-kernel, kernel-team, pjt, luto, efault, torvalds, guro

Hello,

A follow-up patch to fix sporadic oops during boot which is caused by
trying dereference the default cgroup from accounting code before
cgroup is initialized.

Applied to cgroup/for-4.15.

Thanks.
------ 8< ------
>From 38683148828165ea0b66ace93a9fedc2d3281e27 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Mon, 25 Sep 2017 13:50:20 -0700
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Like other csets, init_css_set's dfl_cgrp is initialized when the cset
gets linked.  init_css_set gets linked in cgroup_init().  This has
been fine till now but the recently added basic CPU usage accounting
may end up accessing dfl_cgrp of init before cgroup_init() leading to
the following oops.

  SELinux:  Initializing.
  BUG: unable to handle kernel NULL pointer dereference at 00000000000000b0
  IP: account_system_index_time+0x60/0x90
  PGD 0 P4D 0
  Oops: 0000 [#1] SMP
  Modules linked in:
  CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.14.0-rc2-00003-g041cd64 #10
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
  +1.9.3-20161025_171302-gandalf 04/01/2014
  task: ffffffff81e10480 task.stack: ffffffff81e00000
  RIP: 0010:account_system_index_time+0x60/0x90
  RSP: 0000:ffff880011e03cb8 EFLAGS: 00010002
  RAX: ffffffff81ef8800 RBX: ffffffff81e10480 RCX: 0000000000000003
  RDX: 0000000000000000 RSI: 00000000000f4240 RDI: 0000000000000000
  RBP: ffff880011e03cc0 R08: 0000000000010000 R09: 0000000000000000
  R10: 0000000000000020 R11: 0000003b9aca0000 R12: 000000000001c100
  R13: 0000000000000000 R14: ffffffff81e10480 R15: ffffffff81e03cd8
  FS:  0000000000000000(0000) GS:ffff880011e00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00000000000000b0 CR3: 0000000001e09000 CR4: 00000000000006b0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   <IRQ>
   account_system_time+0x45/0x60
   account_process_tick+0x5a/0x140
   update_process_times+0x22/0x60
   tick_periodic+0x2b/0x90
   tick_handle_periodic+0x25/0x70
   timer_interrupt+0x15/0x20
   __handle_irq_event_percpu+0x7e/0x1b0
   handle_irq_event_percpu+0x23/0x60
   handle_irq_event+0x42/0x70
   handle_level_irq+0x83/0x100
   handle_irq+0x6f/0x110
   do_IRQ+0x46/0xd0
   common_interrupt+0x9d/0x9d

Fix it by statically initializing init_css_set.dfl_cgrp so that init's
default cgroup is accessible from the get-go.

Fixes: 041cd640b2f3 ("cgroup: Implement cgroup2 basic CPU usage accounting")
Reported-by: “kbuild-all@01.org” <kbuild-all@01.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/cgroup/cgroup.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index d036625..7975b20 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -649,6 +649,14 @@ struct css_set init_css_set = {
 	.cgrp_links		= LIST_HEAD_INIT(init_css_set.cgrp_links),
 	.mg_preload_node	= LIST_HEAD_INIT(init_css_set.mg_preload_node),
 	.mg_node		= LIST_HEAD_INIT(init_css_set.mg_node),
+
+	/*
+	 * The following field is re-initialized when this cset gets linked
+	 * in cgroup_init().  However, let's initialize the field
+	 * statically too so that the default cgroup can be accessed safely
+	 * early during boot.
+	 */
+	.dfl_cgrp		= &cgrp_dfl_root.cgrp,
 };
 
 static int css_set_count	= 1;	/* 1 for init_css_set */
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH cgroup/for-4.15] cgroup: statically initialize init_css_set->dfl_cgrp
@ 2017-09-25 21:10   ` Tejun Heo
  0 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2017-09-25 21:10 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	longman-H+wXaHxf7aLQT0dZR+AlfA
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	pjt-hpIqsD4AKlfQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	efault-Mmb7MZpHnFY, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	guro-b10kYP2dOMg

Hello,

A follow-up patch to fix sporadic oops during boot which is caused by
trying dereference the default cgroup from accounting code before
cgroup is initialized.

Applied to cgroup/for-4.15.

Thanks.
------ 8< ------
From 38683148828165ea0b66ace93a9fedc2d3281e27 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Date: Mon, 25 Sep 2017 13:50:20 -0700
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Like other csets, init_css_set's dfl_cgrp is initialized when the cset
gets linked.  init_css_set gets linked in cgroup_init().  This has
been fine till now but the recently added basic CPU usage accounting
may end up accessing dfl_cgrp of init before cgroup_init() leading to
the following oops.

  SELinux:  Initializing.
  BUG: unable to handle kernel NULL pointer dereference at 00000000000000b0
  IP: account_system_index_time+0x60/0x90
  PGD 0 P4D 0
  Oops: 0000 [#1] SMP
  Modules linked in:
  CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.14.0-rc2-00003-g041cd64 #10
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
  +1.9.3-20161025_171302-gandalf 04/01/2014
  task: ffffffff81e10480 task.stack: ffffffff81e00000
  RIP: 0010:account_system_index_time+0x60/0x90
  RSP: 0000:ffff880011e03cb8 EFLAGS: 00010002
  RAX: ffffffff81ef8800 RBX: ffffffff81e10480 RCX: 0000000000000003
  RDX: 0000000000000000 RSI: 00000000000f4240 RDI: 0000000000000000
  RBP: ffff880011e03cc0 R08: 0000000000010000 R09: 0000000000000000
  R10: 0000000000000020 R11: 0000003b9aca0000 R12: 000000000001c100
  R13: 0000000000000000 R14: ffffffff81e10480 R15: ffffffff81e03cd8
  FS:  0000000000000000(0000) GS:ffff880011e00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00000000000000b0 CR3: 0000000001e09000 CR4: 00000000000006b0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   <IRQ>
   account_system_time+0x45/0x60
   account_process_tick+0x5a/0x140
   update_process_times+0x22/0x60
   tick_periodic+0x2b/0x90
   tick_handle_periodic+0x25/0x70
   timer_interrupt+0x15/0x20
   __handle_irq_event_percpu+0x7e/0x1b0
   handle_irq_event_percpu+0x23/0x60
   handle_irq_event+0x42/0x70
   handle_level_irq+0x83/0x100
   handle_irq+0x6f/0x110
   do_IRQ+0x46/0xd0
   common_interrupt+0x9d/0x9d

Fix it by statically initializing init_css_set.dfl_cgrp so that init's
default cgroup is accessible from the get-go.

Fixes: 041cd640b2f3 ("cgroup: Implement cgroup2 basic CPU usage accounting")
Reported-by: “kbuild-all-JC7UmRfGjtg@public.gmane.org” <kbuild-all-JC7UmRfGjtg@public.gmane.org>
Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 kernel/cgroup/cgroup.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index d036625..7975b20 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -649,6 +649,14 @@ struct css_set init_css_set = {
 	.cgrp_links		= LIST_HEAD_INIT(init_css_set.cgrp_links),
 	.mg_preload_node	= LIST_HEAD_INIT(init_css_set.mg_preload_node),
 	.mg_node		= LIST_HEAD_INIT(init_css_set.mg_node),
+
+	/*
+	 * The following field is re-initialized when this cset gets linked
+	 * in cgroup_init().  However, let's initialize the field
+	 * statically too so that the default cgroup can be accessed safely
+	 * early during boot.
+	 */
+	.dfl_cgrp		= &cgrp_dfl_root.cgrp,
 };
 
 static int css_set_count	= 1;	/* 1 for init_css_set */
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH cgroup/for-4.15] sched/cputime: Add dummy cputime_adjust() implementation for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
  2017-08-11 16:37 ` Tejun Heo
@ 2017-09-25 21:34   ` Tejun Heo
  -1 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2017-09-25 21:34 UTC (permalink / raw)
  To: lizefan, hannes, peterz, mingo, longman
  Cc: cgroups, linux-kernel, kernel-team, pjt, luto, efault, torvalds, guro

Hello,

Another follow-up fix.  I missed a dummy implementation leading to
bulid failure on s390.  Because the bug and fix are trivial, I
directly applied the patch to cgroup/for-4.15.  Please let me know if
it should be routed differently.

Thanks.
------ 8< ------
>From 8157a7faf94135386bf04b1cf94e6efd3fb94702 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Mon, 25 Sep 2017 14:27:54 -0700
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

cfb766da54d9 ("sched/cputime: Expose cputime_adjust()") made
cputime_adjust() public for cgroup basic cpu stat support; however,
the commit forgot to add a dummy implementaiton for
CONFIG_VIRT_CPU_ACCOUNTING_NATIVE leading to compiler errors on some
s390 configurations.

Fix it by adding the missing dummy implementation.

Reported-by: “kbuild-all@01.org” <kbuild-all@01.org>
Fixes: cfb766da54d9 ("sched/cputime: Expose cputime_adjust()")
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/cputime.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index e01b699..5498f20 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -447,6 +447,13 @@ void vtime_account_irq_enter(struct task_struct *tsk)
 EXPORT_SYMBOL_GPL(vtime_account_irq_enter);
 #endif /* __ARCH_HAS_VTIME_ACCOUNT */
 
+void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev,
+		    u64 *ut, u64 *st)
+{
+	*ut = curr->utime;
+	*st = curr->stime;
+}
+
 void task_cputime_adjusted(struct task_struct *p, u64 *ut, u64 *st)
 {
 	*ut = p->utime;
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH cgroup/for-4.15] sched/cputime: Add dummy cputime_adjust() implementation for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
@ 2017-09-25 21:34   ` Tejun Heo
  0 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2017-09-25 21:34 UTC (permalink / raw)
  To: lizefan, hannes, peterz, mingo, longman
  Cc: cgroups, linux-kernel, kernel-team, pjt, luto, efault, torvalds, guro

Hello,

Another follow-up fix.  I missed a dummy implementation leading to
bulid failure on s390.  Because the bug and fix are trivial, I
directly applied the patch to cgroup/for-4.15.  Please let me know if
it should be routed differently.

Thanks.
------ 8< ------
From 8157a7faf94135386bf04b1cf94e6efd3fb94702 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Mon, 25 Sep 2017 14:27:54 -0700
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

cfb766da54d9 ("sched/cputime: Expose cputime_adjust()") made
cputime_adjust() public for cgroup basic cpu stat support; however,
the commit forgot to add a dummy implementaiton for
CONFIG_VIRT_CPU_ACCOUNTING_NATIVE leading to compiler errors on some
s390 configurations.

Fix it by adding the missing dummy implementation.

Reported-by: “kbuild-all@01.org” <kbuild-all@01.org>
Fixes: cfb766da54d9 ("sched/cputime: Expose cputime_adjust()")
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/cputime.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index e01b699..5498f20 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -447,6 +447,13 @@ void vtime_account_irq_enter(struct task_struct *tsk)
 EXPORT_SYMBOL_GPL(vtime_account_irq_enter);
 #endif /* __ARCH_HAS_VTIME_ACCOUNT */
 
+void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev,
+		    u64 *ut, u64 *st)
+{
+	*ut = curr->utime;
+	*st = curr->stime;
+}
+
 void task_cputime_adjusted(struct task_struct *p, u64 *ut, u64 *st)
 {
 	*ut = p->utime;
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH cgroup/for-4.15] sched/cputime: Add dummy cputime_adjust() implementation for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
  2017-09-25 21:34   ` Tejun Heo
  (?)
@ 2017-09-26  9:10   ` Peter Zijlstra
  -1 siblings, 0 replies; 54+ messages in thread
From: Peter Zijlstra @ 2017-09-26  9:10 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, hannes, mingo, longman, cgroups, linux-kernel,
	kernel-team, pjt, luto, efault, torvalds, guro

On Mon, Sep 25, 2017 at 02:34:21PM -0700, Tejun Heo wrote:
> Hello,
> 
> Another follow-up fix.  I missed a dummy implementation leading to
> bulid failure on s390.  Because the bug and fix are trivial, I
> directly applied the patch to cgroup/for-4.15.  Please let me know if
> it should be routed differently.
> 
> Thanks.
> ------ 8< ------
> From 8157a7faf94135386bf04b1cf94e6efd3fb94702 Mon Sep 17 00:00:00 2001
> From: Tejun Heo <tj@kernel.org>
> Date: Mon, 25 Sep 2017 14:27:54 -0700
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
> 
> cfb766da54d9 ("sched/cputime: Expose cputime_adjust()") made
> cputime_adjust() public for cgroup basic cpu stat support; however,
> the commit forgot to add a dummy implementaiton for
> CONFIG_VIRT_CPU_ACCOUNTING_NATIVE leading to compiler errors on some
> s390 configurations.
> 
> Fix it by adding the missing dummy implementation.
> 
> Reported-by: “kbuild-all@01.org” <kbuild-all@01.org>
> Fixes: cfb766da54d9 ("sched/cputime: Expose cputime_adjust()")
> Signed-off-by: Tejun Heo <tj@kernel.org>

Seems fine,

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2017-09-26  9:10 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-11 16:37 [PATCHSET for-4.14] cgroup, sched: cgroup2 basic resource accounting Tejun Heo
2017-08-11 16:37 ` Tejun Heo
2017-08-11 16:37 ` [PATCH 1/3] sched/cputime: Expose cputime_adjust() Tejun Heo
2017-08-11 16:37 ` [PATCH 2/3] cpuacct: Introduce cgroup_account_cputime[_field]() Tejun Heo
2017-08-11 16:37   ` Tejun Heo
2017-08-11 17:28   ` [PATCH v2 " Tejun Heo
2017-08-11 17:28     ` Tejun Heo
2017-08-11 16:37 ` [PATCH 3/3] cgroup: Implement cgroup2 basic CPU usage accounting Tejun Heo
2017-08-11 20:19   ` Waiman Long
2017-08-11 20:19     ` Waiman Long
2017-08-13 19:44     ` Tejun Heo
2017-08-13 19:44       ` Tejun Heo
2017-08-13 23:20       ` Waiman Long
2017-08-13 23:20         ` Waiman Long
2017-08-17 12:07   ` Roman Gushchin
2017-08-17 12:07     ` Roman Gushchin
2017-08-17 13:13     ` Tejun Heo
2017-08-17 13:13       ` Tejun Heo
2017-08-29 14:32   ` Peter Zijlstra
2017-08-29 14:32     ` Peter Zijlstra
2017-08-29 15:24     ` Tejun Heo
2017-08-29 15:24       ` Tejun Heo
2017-08-29 15:32       ` Waiman Long
2017-08-29 15:32         ` Waiman Long
2017-08-29 15:42         ` Tejun Heo
2017-08-29 15:59       ` Peter Zijlstra
2017-08-29 15:59         ` Peter Zijlstra
2017-08-29 17:43   ` [PATCH v2 " Tejun Heo
2017-08-29 17:43     ` Tejun Heo
2017-08-29 18:06     ` Waiman Long
2017-08-29 18:06       ` Waiman Long
2017-08-29 18:10       ` Tejun Heo
2017-08-16 18:52 ` [PATCHSET for-4.14] cgroup, sched: cgroup2 basic resource accounting Tejun Heo
2017-08-17  8:13   ` Ingo Molnar
2017-08-17 13:01     ` Tejun Heo
2017-08-17 15:03       ` Ingo Molnar
2017-08-17 15:03         ` Ingo Molnar
2017-08-24 17:20 ` Tejun Heo
2017-08-24 17:20   ` Tejun Heo
2017-09-22 18:05 ` Tejun Heo
2017-09-22 18:05   ` Tejun Heo
2017-09-23 13:37   ` Peter Zijlstra
2017-09-23 13:37     ` Peter Zijlstra
2017-09-23 13:44     ` Tejun Heo
2017-09-23 13:44       ` Tejun Heo
2017-09-25  7:27       ` Peter Zijlstra
2017-09-25  7:27         ` Peter Zijlstra
2017-09-25 15:07         ` Tejun Heo
2017-09-25 15:07           ` Tejun Heo
2017-09-25 21:10 ` [PATCH cgroup/for-4.15] cgroup: statically initialize init_css_set->dfl_cgrp Tejun Heo
2017-09-25 21:10   ` Tejun Heo
2017-09-25 21:34 ` [PATCH cgroup/for-4.15] sched/cputime: Add dummy cputime_adjust() implementation for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE Tejun Heo
2017-09-25 21:34   ` Tejun Heo
2017-09-26  9:10   ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.