linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/6] freezer for cgroup v2
@ 2018-11-12 23:04 Roman Gushchin
  2018-11-12 23:04 ` [PATCH] cgroup: document cgroup v2 freezer interface Roman Gushchin
                   ` (6 more replies)
  0 siblings, 7 replies; 23+ messages in thread
From: Roman Gushchin @ 2018-11-12 23:04 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Oleg Nesterov, cgroups, linux-kernel, kernel-team, Roman Gushchin

This patchset implements freezer for cgroup v2.

It provides similar functionality as v1 freezer, but the interface
conforms to the cgroup v2 interface design principles, and it
provides a better user experience: tasks can be killed, ptrace works,
there is no separate controller, which has to be enabled, etc.

Patches (1) and (2) are some preparational work, patch (3) contains
the implementation, patch (4) is a small cgroup kselftest fix,
patch (5) covers freezer adds 6 new kselftests to cover the freezer
functionality. Patch (6) adds corresponding docs.

v2->v1:
  - fixed locking aroung calling cgroup_freezer_leave()
  - added docs

Roman Gushchin 
  cgroup: rename freezer.c into legacy_freezer.c
  cgroup: implement __cgroup_task_count() helper
  cgroup: cgroup v2 freezer
  kselftests: cgroup: don't fail on cg_kill_all() error in cg_destroy()
  kselftests: cgroup: add freezer controller self-tests
  cgroup: document cgroup v2 freezer interface

 Documentation/admin-guide/cgroup-v2.rst       |  26 +
 include/linux/cgroup-defs.h                   |  33 +
 include/linux/cgroup.h                        |  42 ++
 include/linux/sched.h                         |   5 +-
 include/linux/sched/jobctl.h                  |   5 +-
 kernel/cgroup/Makefile                        |   4 +-
 kernel/cgroup/cgroup-internal.h               |   1 +
 kernel/cgroup/cgroup-v1.c                     |  16 -
 kernel/cgroup/cgroup.c                        | 132 +++-
 kernel/cgroup/freezer.c                       | 590 +++++----------
 kernel/cgroup/legacy_freezer.c                | 481 ++++++++++++
 kernel/ptrace.c                               |   6 +
 kernel/signal.c                               |  54 +-
 tools/testing/selftests/cgroup/.gitignore     |   1 +
 tools/testing/selftests/cgroup/Makefile       |   2 +
 tools/testing/selftests/cgroup/cgroup_util.c  |  85 ++-
 tools/testing/selftests/cgroup/cgroup_util.h  |   7 +
 tools/testing/selftests/cgroup/test_freezer.c | 685 ++++++++++++++++++
 18 files changed, 1724 insertions(+), 451 deletions(-)
 create mode 100644 kernel/cgroup/legacy_freezer.c
 create mode 100644 tools/testing/selftests/cgroup/test_freezer.c

-- 
2.17.2


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH] cgroup: document cgroup v2 freezer interface
  2018-11-12 23:04 [PATCH v2 0/6] freezer for cgroup v2 Roman Gushchin
@ 2018-11-12 23:04 ` Roman Gushchin
  2018-11-12 23:04 ` [PATCH v2 1/6] cgroup: rename freezer.c into legacy_freezer.c Roman Gushchin
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 23+ messages in thread
From: Roman Gushchin @ 2018-11-12 23:04 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Oleg Nesterov, cgroups, linux-kernel, kernel-team, Roman Gushchin

Describe cgroup v2 freezer interface in the cgroup v2 admin guide.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
---
 Documentation/admin-guide/cgroup-v2.rst | 26 +++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 184193bcb262..a065c0bed88c 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -862,6 +862,8 @@ All cgroup core files are prefixed with "cgroup."
 	  populated
 		1 if the cgroup or its descendants contains any live
 		processes; otherwise, 0.
+	  frozen
+		1 if the cgroup is frozen; otherwise, 0.
 
   cgroup.max.descendants
 	A read-write single value files.  The default is "max".
@@ -895,6 +897,30 @@ All cgroup core files are prefixed with "cgroup."
 		A dying cgroup can consume system resources not exceeding
 		limits, which were active at the moment of cgroup deletion.
 
+  cgroup.freeze
+	A read-write single value file which exists on non-root cgroups.
+	Allowed values are "0" and "1". The default is "0".
+
+	Writing "1" to the file causes freezing of the cgroup and all
+	descendant cgroups. This means that all belonging processes will
+	be stopped and will not run until the cgroup will be explicitly
+	unfrozen. Freezing of the cgroup may take some time; when the process
+	is complete, the "frozen" value in the cgroup.events control file
+	will be updated and the corresponding notification will be issued.
+
+	Cgroup can be frozen either by its own settings, either by settings
+	of any ancestor cgroups. If any of ancestor cgroups is frozen, the
+	cgroup will remain frozen.
+
+	Processes in the frozen cgroup can be killed by a fatal signal.
+	They also can enter and leave a frozen cgroup: either by an explicit
+	move by a user, either if freezing of the cgroup races with fork().
+	If a cgroup is moved to a frozen cgroup, it stops. If a process is
+	moving out of a frozen cgroup, it becomes running.
+
+	Frozen status of a cgroup doesn't affect any cgroup tree operations:
+	it's possible to delete a frozen (and empty) cgroup, as well as
+	create new sub-cgroups.
 
 Controllers
 ===========
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 1/6] cgroup: rename freezer.c into legacy_freezer.c
  2018-11-12 23:04 [PATCH v2 0/6] freezer for cgroup v2 Roman Gushchin
  2018-11-12 23:04 ` [PATCH] cgroup: document cgroup v2 freezer interface Roman Gushchin
@ 2018-11-12 23:04 ` Roman Gushchin
  2018-11-12 23:04 ` [PATCH v2 2/6] cgroup: implement __cgroup_task_count() helper Roman Gushchin
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 23+ messages in thread
From: Roman Gushchin @ 2018-11-12 23:04 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Oleg Nesterov, cgroups, linux-kernel, kernel-team, Roman Gushchin

Freezer.c will contain an implementation of cgroup v2 freezer,
so let's rename the v1 freezer to avoid naming conflicts.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
---
 kernel/cgroup/Makefile                        | 2 +-
 kernel/cgroup/{freezer.c => legacy_freezer.c} | 0
 2 files changed, 1 insertion(+), 1 deletion(-)
 rename kernel/cgroup/{freezer.c => legacy_freezer.c} (100%)

diff --git a/kernel/cgroup/Makefile b/kernel/cgroup/Makefile
index bfcdae896122..8d5689ca94b9 100644
--- a/kernel/cgroup/Makefile
+++ b/kernel/cgroup/Makefile
@@ -1,7 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-y := cgroup.o rstat.o namespace.o cgroup-v1.o
 
-obj-$(CONFIG_CGROUP_FREEZER) += freezer.o
+obj-$(CONFIG_CGROUP_FREEZER) += legacy_freezer.o
 obj-$(CONFIG_CGROUP_PIDS) += pids.o
 obj-$(CONFIG_CGROUP_RDMA) += rdma.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
diff --git a/kernel/cgroup/freezer.c b/kernel/cgroup/legacy_freezer.c
similarity index 100%
rename from kernel/cgroup/freezer.c
rename to kernel/cgroup/legacy_freezer.c
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 2/6] cgroup: implement __cgroup_task_count() helper
  2018-11-12 23:04 [PATCH v2 0/6] freezer for cgroup v2 Roman Gushchin
  2018-11-12 23:04 ` [PATCH] cgroup: document cgroup v2 freezer interface Roman Gushchin
  2018-11-12 23:04 ` [PATCH v2 1/6] cgroup: rename freezer.c into legacy_freezer.c Roman Gushchin
@ 2018-11-12 23:04 ` Roman Gushchin
  2018-11-12 23:04 ` [PATCH v2 3/6] cgroup: cgroup v2 freezer Roman Gushchin
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 23+ messages in thread
From: Roman Gushchin @ 2018-11-12 23:04 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Oleg Nesterov, cgroups, linux-kernel, kernel-team, Roman Gushchin

The helper is identical to the existing cgroup_task_count()
except it doesn't take the css_set_lock by itself, assuming
that the caller does.

Also, move cgroup_task_count() implementation into
kernel/cgroup/cgroup.c, as there is nothing specific to cgroup v1.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
---
 kernel/cgroup/cgroup-internal.h |  1 +
 kernel/cgroup/cgroup-v1.c       | 16 ----------------
 kernel/cgroup/cgroup.c          | 33 +++++++++++++++++++++++++++++++++
 3 files changed, 34 insertions(+), 16 deletions(-)

diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index 75568fcf2180..fe01a9fa4a8d 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -224,6 +224,7 @@ int cgroup_rmdir(struct kernfs_node *kn);
 int cgroup_show_path(struct seq_file *sf, struct kernfs_node *kf_node,
 		     struct kernfs_root *kf_root);
 
+int __cgroup_task_count(const struct cgroup *cgrp);
 int cgroup_task_count(const struct cgroup *cgrp);
 
 /*
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index 51063e7a93c2..6134fef07d57 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -336,22 +336,6 @@ static struct cgroup_pidlist *cgroup_pidlist_find_create(struct cgroup *cgrp,
 	return l;
 }
 
-/**
- * cgroup_task_count - count the number of tasks in a cgroup.
- * @cgrp: the cgroup in question
- */
-int cgroup_task_count(const struct cgroup *cgrp)
-{
-	int count = 0;
-	struct cgrp_cset_link *link;
-
-	spin_lock_irq(&css_set_lock);
-	list_for_each_entry(link, &cgrp->cset_links, cset_link)
-		count += link->cset->nr_tasks;
-	spin_unlock_irq(&css_set_lock);
-	return count;
-}
-
 /*
  * Load a cgroup's pidarray with either procs' tgids or tasks' pids
  */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 4a3dae2a8283..ef3442555b32 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -561,6 +561,39 @@ static void cgroup_get_live(struct cgroup *cgrp)
 	css_get(&cgrp->self);
 }
 
+/**
+ * __cgroup_task_count - count the number of tasks in a cgroup. The caller
+ * is responsible for taking the css_set_lock.
+ * @cgrp: the cgroup in question
+ */
+int __cgroup_task_count(const struct cgroup *cgrp)
+{
+	int count = 0;
+	struct cgrp_cset_link *link;
+
+	lockdep_assert_held(&css_set_lock);
+
+	list_for_each_entry(link, &cgrp->cset_links, cset_link)
+		count += link->cset->nr_tasks;
+
+	return count;
+}
+
+/**
+ * cgroup_task_count - count the number of tasks in a cgroup.
+ * @cgrp: the cgroup in question
+ */
+int cgroup_task_count(const struct cgroup *cgrp)
+{
+	int count;
+
+	spin_lock_irq(&css_set_lock);
+	count = __cgroup_task_count(cgrp);
+	spin_unlock_irq(&css_set_lock);
+
+	return count;
+}
+
 struct cgroup_subsys_state *of_css(struct kernfs_open_file *of)
 {
 	struct cgroup *cgrp = of->kn->parent->priv;
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 3/6] cgroup: cgroup v2 freezer
  2018-11-12 23:04 [PATCH v2 0/6] freezer for cgroup v2 Roman Gushchin
                   ` (2 preceding siblings ...)
  2018-11-12 23:04 ` [PATCH v2 2/6] cgroup: implement __cgroup_task_count() helper Roman Gushchin
@ 2018-11-12 23:04 ` Roman Gushchin
  2018-11-13  2:08   ` Tejun Heo
                     ` (2 more replies)
  2018-11-12 23:04 ` [PATCH v2 4/6] kselftests: cgroup: don't fail on cg_kill_all() error in cg_destroy() Roman Gushchin
                   ` (2 subsequent siblings)
  6 siblings, 3 replies; 23+ messages in thread
From: Roman Gushchin @ 2018-11-12 23:04 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Oleg Nesterov, cgroups, linux-kernel, kernel-team, Roman Gushchin

Cgroup v1 implements the freezer controller, which provides an ability
to stop the workload in a cgroup and temporarily free up some
resources (cpu, io, network bandwidth and, potentially, memory)
for some other tasks. Cgroup v2 lacks this functionality.

This patch implements freezer for cgroup v2. However the functionality
is similar, the interface is different to cgroup v1: it follows
cgroup v2 interface principles.

Key differences are:
1) There is no separate controller: the functionality is always
available and is represented by cgroup.freeze and cgroup.events
cgroup control files.
2) The desired state is defined by the cgroup.freeze control file.
Any hierarchical configuration is allowed.
3) The interface is asynchronous. The actual state is available
using cgroup.events control file ("frozen" field). There are no
dedicated transitional states.
4) It's allowed to make any changes with the cgroup hierarchy
(create new cgroups, remove old cgroups, move tasks between cgroups)
no matter if some cgroups are frozen.
5) Tasks in a frozen cgroup can be killed.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
---
 include/linux/cgroup-defs.h  |  33 +++++
 include/linux/cgroup.h       |  42 ++++++
 include/linux/sched.h        |   5 +-
 include/linux/sched/jobctl.h |   5 +-
 kernel/cgroup/Makefile       |   2 +-
 kernel/cgroup/cgroup.c       |  99 +++++++++++++-
 kernel/cgroup/freezer.c      | 247 +++++++++++++++++++++++++++++++++++
 kernel/ptrace.c              |   6 +
 kernel/signal.c              |  54 ++++++--
 9 files changed, 475 insertions(+), 18 deletions(-)
 create mode 100644 kernel/cgroup/freezer.c

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 22254c1fe1c5..600165d0f5a2 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -63,6 +63,12 @@ enum {
 	 * specified at mount time and thus is implemented here.
 	 */
 	CGRP_CPUSET_CLONE_CHILDREN,
+
+	/* Control group has to be frozen. */
+	CGRP_FREEZE,
+
+	/* Cgroup is frozen. */
+	CGRP_FROZEN,
 };
 
 /* cgroup_root->flags */
@@ -314,6 +320,27 @@ struct cgroup_rstat_cpu {
 	struct cgroup *updated_next;		/* NULL iff not on the list */
 };
 
+struct cgroup_freezer_state {
+	/* Should the cgroup actually be frozen?
+	 * (protected by cgroup_mutex)
+	 */
+	int e_freeze;
+
+	/* Fields below are protected by css_set_lock */
+
+	/* Number of frozen descendant cgroups */
+	int nr_frozen_descendants;
+
+	/* Number of tasks to freeze */
+	int nr_tasks_to_freeze;
+
+	/* Number of frozen tasks */
+	int nr_frozen_tasks;
+
+	/* Used for delayed notifications */
+	struct work_struct notify_work;
+};
+
 struct cgroup {
 	/* self css with NULL ->ss, points back to this cgroup */
 	struct cgroup_subsys_state self;
@@ -442,6 +469,12 @@ struct cgroup {
 	/* If there is block congestion on this cgroup. */
 	atomic_t congestion_count;
 
+	/* Should the cgroup and its descendants be frozen. */
+	bool freeze;
+
+	/* Used to store internal freezer state */
+	struct cgroup_freezer_state freezer;
+
 	/* ids of the ancestors at each level including self */
 	int ancestor_ids[];
 };
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 32c553556bbd..8f7e82b05bf8 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -871,4 +871,46 @@ static inline void put_cgroup_ns(struct cgroup_namespace *ns)
 		free_cgroup_ns(ns);
 }
 
+#ifdef CONFIG_CGROUPS
+
+static inline bool cgroup_frozen(struct task_struct *task)
+{
+	bool ret;
+
+	rcu_read_lock();
+	ret = test_bit(CGRP_FREEZE, &task_dfl_cgroup(task)->flags);
+	rcu_read_unlock();
+
+	return ret;
+}
+
+static inline bool cgroup_task_in_freezer(struct task_struct *task)
+{
+	return task->frozen;
+}
+
+void cgroup_freezer_enter(void);
+void cgroup_freezer_leave(void);
+void cgroup_freeze(struct cgroup *cgrp, bool freeze);
+void cgroup_notify_frozen_fn(struct work_struct *work);
+void cgroup_notify_frozen(struct cgroup *cgrp, bool frozen);
+void cgroup_queue_notify_frozen(struct cgroup *cgrp);
+void cgroup_freezer_migrate_task(struct task_struct *task, struct cgroup *src,
+				 struct cgroup *dst);
+
+#else /* !CONFIG_CGROUPS */
+
+static inline bool cgroup_task_in_freezer(struct task_struct *task)
+{
+	return false;
+}
+static inline void cgroup_freezer_enter(void) { }
+static inline void cgroup_freezer_leave(void) { }
+static inline bool cgroup_frozen(struct task_struct *task)
+{
+	return false;
+}
+
+#endif /* !CONFIG_CGROUPS */
+
 #endif /* _LINUX_CGROUP_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 977cb57d7bc9..8ef5d3174e50 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -83,7 +83,8 @@ struct task_group;
 #define TASK_WAKING			0x0200
 #define TASK_NOLOAD			0x0400
 #define TASK_NEW			0x0800
-#define TASK_STATE_MAX			0x1000
+#define TASK_FROZEN			0x1000
+#define TASK_STATE_MAX			0x2000
 
 /* Convenience macros for the sake of set_current_state: */
 #define TASK_KILLABLE			(TASK_WAKEKILL | TASK_UNINTERRUPTIBLE)
@@ -733,6 +734,8 @@ struct task_struct {
 #ifdef CONFIG_CGROUPS
 	/* disallow userland-initiated cgroup migration */
 	unsigned			no_cgroup_migration:1;
+	/* task is in the cgroup freezer loop */
+	unsigned			frozen:1;
 #endif
 #ifdef CONFIG_BLK_CGROUP
 	/* to be used once the psi infrastructure lands upstream. */
diff --git a/include/linux/sched/jobctl.h b/include/linux/sched/jobctl.h
index 98228bd48aee..6c49455dcfe6 100644
--- a/include/linux/sched/jobctl.h
+++ b/include/linux/sched/jobctl.h
@@ -18,6 +18,7 @@ struct task_struct;
 #define JOBCTL_TRAP_NOTIFY_BIT	20	/* trap for NOTIFY */
 #define JOBCTL_TRAPPING_BIT	21	/* switching to TRACED */
 #define JOBCTL_LISTENING_BIT	22	/* ptracer is listening for events */
+#define JOBCTL_TRAP_FREEZE_BIT	23	/* trap for cgroup freezer */
 
 #define JOBCTL_STOP_DEQUEUED	(1UL << JOBCTL_STOP_DEQUEUED_BIT)
 #define JOBCTL_STOP_PENDING	(1UL << JOBCTL_STOP_PENDING_BIT)
@@ -26,8 +27,10 @@ struct task_struct;
 #define JOBCTL_TRAP_NOTIFY	(1UL << JOBCTL_TRAP_NOTIFY_BIT)
 #define JOBCTL_TRAPPING		(1UL << JOBCTL_TRAPPING_BIT)
 #define JOBCTL_LISTENING	(1UL << JOBCTL_LISTENING_BIT)
+#define JOBCTL_TRAP_FREEZE	(1UL << JOBCTL_TRAP_FREEZE_BIT)
 
-#define JOBCTL_TRAP_MASK	(JOBCTL_TRAP_STOP | JOBCTL_TRAP_NOTIFY)
+#define JOBCTL_TRAP_MASK	(JOBCTL_TRAP_STOP | JOBCTL_TRAP_NOTIFY | \
+				 JOBCTL_TRAP_FREEZE)
 #define JOBCTL_PENDING_MASK	(JOBCTL_STOP_PENDING | JOBCTL_TRAP_MASK)
 
 extern bool task_set_jobctl_pending(struct task_struct *task, unsigned long mask);
diff --git a/kernel/cgroup/Makefile b/kernel/cgroup/Makefile
index 8d5689ca94b9..5d7a76bfbbb7 100644
--- a/kernel/cgroup/Makefile
+++ b/kernel/cgroup/Makefile
@@ -1,5 +1,5 @@
 # SPDX-License-Identifier: GPL-2.0
-obj-y := cgroup.o rstat.o namespace.o cgroup-v1.o
+obj-y := cgroup.o rstat.o namespace.o cgroup-v1.o freezer.o
 
 obj-$(CONFIG_CGROUP_FREEZER) += legacy_freezer.o
 obj-$(CONFIG_CGROUP_PIDS) += pids.o
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index ef3442555b32..4cffaae075af 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2358,6 +2358,10 @@ static int cgroup_migrate_execute(struct cgroup_mgctx *mgctx)
 			css_set_move_task(task, from_cset, to_cset, true);
 			put_css_set_locked(from_cset);
 			from_cset->nr_tasks--;
+
+			cgroup_freezer_migrate_task(task,
+						    from_cset->dfl_cgrp,
+						    to_cset->dfl_cgrp);
 		}
 	}
 	spin_unlock_irq(&css_set_lock);
@@ -3401,8 +3405,11 @@ static ssize_t cgroup_max_depth_write(struct kernfs_open_file *of,
 
 static int cgroup_events_show(struct seq_file *seq, void *v)
 {
-	seq_printf(seq, "populated %d\n",
-		   cgroup_is_populated(seq_css(seq)->cgroup));
+	struct cgroup *cgrp = seq_css(seq)->cgroup;
+
+	seq_printf(seq, "populated %d\n", cgroup_is_populated(cgrp));
+	seq_printf(seq, "frozen %d\n", test_bit(CGRP_FROZEN, &cgrp->flags));
+
 	return 0;
 }
 
@@ -3449,6 +3456,40 @@ static int cpu_stat_show(struct seq_file *seq, void *v)
 	return ret;
 }
 
+static int cgroup_freeze_show(struct seq_file *seq, void *v)
+{
+	struct cgroup *cgrp = seq_css(seq)->cgroup;
+
+	seq_printf(seq, "%d\n", cgrp->freeze);
+
+	return 0;
+}
+
+static ssize_t cgroup_freeze_write(struct kernfs_open_file *of,
+				   char *buf, size_t nbytes, loff_t off)
+{
+	struct cgroup *cgrp;
+	ssize_t ret;
+	int freeze;
+
+	ret = kstrtoint(strstrip(buf), 0, &freeze);
+	if (ret)
+		return ret;
+
+	if (freeze < 0 || freeze > 1)
+		return -ERANGE;
+
+	cgrp = cgroup_kn_lock_live(of->kn, false);
+	if (!cgrp)
+		return -ENOENT;
+
+	cgroup_freeze(cgrp, freeze);
+
+	cgroup_kn_unlock(of->kn);
+
+	return nbytes;
+}
+
 static int cgroup_file_open(struct kernfs_open_file *of)
 {
 	struct cftype *cft = of->kn->priv;
@@ -4574,6 +4615,12 @@ static struct cftype cgroup_base_files[] = {
 		.name = "cgroup.stat",
 		.seq_show = cgroup_stat_show,
 	},
+	{
+		.name = "cgroup.freeze",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = cgroup_freeze_show,
+		.write = cgroup_freeze_write,
+	},
 	{
 		.name = "cpu.stat",
 		.flags = CFTYPE_NOT_ON_ROOT,
@@ -4899,11 +4946,20 @@ static struct cgroup *cgroup_create(struct cgroup *parent)
 	if (ret)
 		goto out_idr_free;
 
+	INIT_WORK(&cgrp->freezer.notify_work, cgroup_notify_frozen_fn);
+	cgrp->freezer.e_freeze = parent->freezer.e_freeze;
+	if (cgrp->freezer.e_freeze)
+		set_bit(CGRP_FROZEN, &cgrp->flags);
+
 	for (tcgrp = cgrp; tcgrp; tcgrp = cgroup_parent(tcgrp)) {
 		cgrp->ancestor_ids[tcgrp->level] = tcgrp->id;
 
-		if (tcgrp != cgrp)
+		if (tcgrp != cgrp) {
 			tcgrp->nr_descendants++;
+
+			if (cgrp->freezer.e_freeze)
+				tcgrp->freezer.nr_frozen_descendants++;
+		}
 	}
 
 	if (notify_on_release(parent))
@@ -5190,6 +5246,8 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
 	for (tcgrp = cgroup_parent(cgrp); tcgrp; tcgrp = cgroup_parent(tcgrp)) {
 		tcgrp->nr_descendants--;
 		tcgrp->nr_dying_descendants++;
+		if (cgrp->freezer.e_freeze)
+			tcgrp->freezer.nr_frozen_descendants--;
 	}
 
 	cgroup1_check_for_release(parent);
@@ -5642,6 +5700,23 @@ void cgroup_post_fork(struct task_struct *child)
 			cset->nr_tasks++;
 			css_set_move_task(child, NULL, cset, false);
 		}
+
+		if (unlikely(cgroup_frozen(child) &&
+			     (child->flags & ~PF_KTHREAD))) {
+			struct cgroup *cgrp;
+			unsigned long flags;
+
+			if (lock_task_sighand(child, &flags)) {
+				cgrp = cset->dfl_cgrp;
+				cgrp->freezer.nr_tasks_to_freeze++;
+				WARN_ON_ONCE(cgrp->freezer.nr_tasks_to_freeze <
+					     cgrp->freezer.nr_frozen_tasks);
+				child->jobctl |= JOBCTL_TRAP_FREEZE;
+				signal_wake_up(child, false);
+				unlock_task_sighand(child, &flags);
+			}
+		}
+
 		spin_unlock_irq(&css_set_lock);
 	}
 
@@ -5690,6 +5765,24 @@ void cgroup_exit(struct task_struct *tsk)
 		spin_lock_irq(&css_set_lock);
 		css_set_move_task(tsk, cset, NULL, false);
 		cset->nr_tasks--;
+
+		if (unlikely(cgroup_frozen(tsk) &&
+			     (tsk->flags & ~PF_KTHREAD))) {
+			struct cgroup *frozen_cgrp = cset->dfl_cgrp;
+
+			frozen_cgrp->freezer.nr_tasks_to_freeze--;
+
+			WARN_ON(tsk->frozen);
+			WARN_ON_ONCE(frozen_cgrp->freezer.nr_tasks_to_freeze <
+				     0);
+			WARN_ON_ONCE(frozen_cgrp->freezer.nr_tasks_to_freeze <
+				     frozen_cgrp->freezer.nr_frozen_tasks);
+
+			if (frozen_cgrp->freezer.nr_frozen_tasks ==
+			    frozen_cgrp->freezer.nr_tasks_to_freeze)
+				cgroup_queue_notify_frozen(frozen_cgrp);
+		}
+
 		spin_unlock_irq(&css_set_lock);
 	} else {
 		get_css_set(cset);
diff --git a/kernel/cgroup/freezer.c b/kernel/cgroup/freezer.c
new file mode 100644
index 000000000000..b81e215e2cce
--- /dev/null
+++ b/kernel/cgroup/freezer.c
@@ -0,0 +1,247 @@
+//SPDX-License-Identifier: GPL-2.0
+#include <linux/cgroup.h>
+#include <linux/sched.h>
+#include <linux/sched/task.h>
+#include <linux/sched/signal.h>
+
+#include "cgroup-internal.h"
+
+void cgroup_notify_frozen(struct cgroup *cgrp, bool frozen)
+{
+	int delta = 1;
+
+	lockdep_assert_held(&cgroup_mutex);
+
+	/*
+	 * Did we race with fork() or exit()? Np, everything is still frozen.
+	 */
+	if (frozen == test_bit(CGRP_FROZEN, &cgrp->flags))
+		return;
+
+	if (frozen)
+		set_bit(CGRP_FROZEN, &cgrp->flags);
+	else
+		clear_bit(CGRP_FROZEN, &cgrp->flags);
+	cgroup_file_notify(&cgrp->events_file);
+
+	while ((cgrp = cgroup_parent(cgrp))) {
+		if (frozen) {
+			cgrp->freezer.nr_frozen_descendants += delta;
+			if (!test_bit(CGRP_FROZEN, &cgrp->flags) &&
+			    test_bit(CGRP_FREEZE, &cgrp->flags) &&
+			    cgrp->freezer.nr_frozen_descendants ==
+			    cgrp->nr_descendants) {
+				set_bit(CGRP_FROZEN, &cgrp->flags);
+				cgroup_file_notify(&cgrp->events_file);
+				delta++;
+			}
+		} else {
+			cgrp->freezer.nr_frozen_descendants -= delta;
+			if (test_bit(CGRP_FROZEN, &cgrp->flags)) {
+				clear_bit(CGRP_FROZEN, &cgrp->flags);
+				cgroup_file_notify(&cgrp->events_file);
+				delta++;
+			}
+		}
+	}
+}
+
+void cgroup_notify_frozen_fn(struct work_struct *work)
+{
+	struct cgroup *cgrp = container_of(work, struct cgroup,
+					   freezer.notify_work);
+
+	mutex_lock(&cgroup_mutex);
+	cgroup_notify_frozen(cgrp, true);
+	mutex_unlock(&cgroup_mutex);
+
+	css_put(&cgrp->self);
+}
+
+void cgroup_queue_notify_frozen(struct cgroup *cgrp)
+{
+	if (work_pending(&cgrp->freezer.notify_work))
+		return;
+
+	css_get(&cgrp->self);
+	schedule_work(&cgrp->freezer.notify_work);
+}
+
+void cgroup_freezer_enter(void)
+{
+	long state = current->state;
+	struct cgroup *cgrp;
+
+	if (!current->frozen) {
+		spin_lock_irq(&css_set_lock);
+		current->frozen = true;
+		cgrp = task_dfl_cgroup(current);
+		cgrp->freezer.nr_frozen_tasks++;
+
+		WARN_ON_ONCE(cgrp->freezer.nr_tasks_to_freeze <
+			     cgrp->freezer.nr_frozen_tasks);
+
+		if (cgrp->freezer.nr_tasks_to_freeze ==
+		    cgrp->freezer.nr_frozen_tasks)
+			cgroup_queue_notify_frozen(cgrp);
+		spin_unlock_irq(&css_set_lock);
+	}
+
+	/* refrigerator */
+	set_current_state(TASK_WAKEKILL | TASK_INTERRUPTIBLE | TASK_FROZEN);
+	clear_thread_flag(TIF_SIGPENDING);
+	schedule();
+	recalc_sigpending();
+
+	set_current_state(state);
+}
+
+void cgroup_freezer_leave(void)
+{
+	struct cgroup *cgrp;
+
+	spin_lock_irq(&css_set_lock);
+	cgrp = task_dfl_cgroup(current);
+	cgrp->freezer.nr_frozen_tasks--;
+	WARN_ON_ONCE(cgrp->freezer.nr_frozen_tasks < 0);
+	current->frozen = false;
+	spin_unlock_irq(&css_set_lock);
+}
+
+static void cgroup_do_freeze(struct cgroup *cgrp, bool freeze)
+{
+	struct css_task_iter it;
+	struct task_struct *task;
+	unsigned long flags;
+
+	lockdep_assert_held(&cgroup_mutex);
+
+	spin_lock_irq(&css_set_lock);
+	if (freeze) {
+		cgrp->freezer.nr_tasks_to_freeze = __cgroup_task_count(cgrp);
+		set_bit(CGRP_FREEZE, &cgrp->flags);
+	} else {
+		clear_bit(CGRP_FREEZE, &cgrp->flags);
+		cgroup_notify_frozen(cgrp, false);
+	}
+	spin_unlock_irq(&css_set_lock);
+
+	css_task_iter_start(&cgrp->self, 0, &it);
+	while ((task = css_task_iter_next(&it))) {
+		if (task->flags & PF_KTHREAD)
+			continue;
+
+		if (!lock_task_sighand(task, &flags))
+			continue;
+
+		if (freeze) {
+			task->jobctl |= JOBCTL_TRAP_FREEZE;
+			signal_wake_up(task, false);
+		} else {
+			task->jobctl &= ~JOBCTL_TRAP_FREEZE;
+			wake_up_process(task);
+		}
+
+		unlock_task_sighand(task, &flags);
+	}
+	css_task_iter_end(&it);
+
+	if (freeze && cgrp->nr_descendants ==
+	    cgrp->freezer.nr_frozen_descendants) {
+		spin_lock_irq(&css_set_lock);
+		WARN_ON_ONCE(cgrp->freezer.nr_tasks_to_freeze <
+			     cgrp->freezer.nr_frozen_tasks);
+		if (cgrp->freezer.nr_tasks_to_freeze ==
+		    cgrp->freezer.nr_frozen_tasks)
+			cgroup_notify_frozen(cgrp, true);
+		spin_unlock_irq(&css_set_lock);
+	}
+}
+
+void cgroup_freezer_migrate_task(struct task_struct *task,
+				 struct cgroup *src, struct cgroup *dst)
+{
+	unsigned long flags;
+
+	lockdep_assert_held(&css_set_lock);
+
+	if (task->flags & PF_KTHREAD)
+		return;
+
+	if (test_bit(CGRP_FREEZE, &src->flags) || task->frozen)
+		src->freezer.nr_tasks_to_freeze--;
+	if (test_bit(CGRP_FREEZE, &dst->flags) || task->frozen)
+		dst->freezer.nr_tasks_to_freeze++;
+
+	if (task->frozen) {
+		src->freezer.nr_frozen_tasks--;
+		dst->freezer.nr_frozen_tasks++;
+	}
+
+	if (test_bit(CGRP_FREEZE, &src->flags) ==
+	    test_bit(CGRP_FREEZE, &dst->flags))
+		return;
+
+	if (lock_task_sighand(task, &flags)) {
+		if (test_bit(CGRP_FREEZE, &dst->flags))
+			task->jobctl |= JOBCTL_TRAP_FREEZE;
+		else
+			task->jobctl &= ~JOBCTL_TRAP_FREEZE;
+		signal_wake_up(task, false);
+		unlock_task_sighand(task, &flags);
+	}
+}
+
+void cgroup_freeze(struct cgroup *cgrp, bool freeze)
+{
+	struct cgroup_subsys_state *css;
+	struct cgroup *dsct;
+	bool applied = false;
+
+	lockdep_assert_held(&cgroup_mutex);
+
+	/*
+	 * Nothing changed? Just exit.
+	 */
+	if (cgrp->freeze == freeze)
+		return;
+
+	cgrp->freeze = freeze;
+
+	/*
+	 * Propagate changes downwards the cgroup tree.
+	 */
+	css_for_each_descendant_pre(css, &cgrp->self) {
+		dsct = css->cgroup;
+
+		if (cgroup_is_dead(dsct))
+			continue;
+
+		if (freeze) {
+			dsct->freezer.e_freeze++;
+			/*
+			 * Already frozen because of ancestor's settings?
+			 */
+			if (dsct->freezer.e_freeze > 1)
+				continue;
+		} else {
+			dsct->freezer.e_freeze--;
+			/*
+			 * Still frozen because of ancestor's settings?
+			 */
+			if (dsct->freezer.e_freeze > 0)
+				continue;
+
+			WARN_ON_ONCE(dsct->freezer.e_freeze < 0);
+		}
+
+		/*
+		 * Do change actual state: freeze or unfreeze.
+		 */
+		cgroup_do_freeze(dsct, freeze);
+		applied = true;
+	}
+
+	if (!applied)
+		cgroup_file_notify(&cgrp->events_file);
+}
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 21fec73d45d4..5e484e2480e5 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -400,6 +400,12 @@ static int ptrace_attach(struct task_struct *task, long request,
 
 	spin_lock(&task->sighand->siglock);
 
+	/*
+	 * Kick the process to get it out of the refrigerator.
+	 */
+	if (cgroup_frozen(task))
+		wake_up_process(task);
+
 	/*
 	 * If the task is already STOPPED, set JOBCTL_TRAP_STOP and
 	 * TRAPPING, and kick it so that it transits to TRACED.  TRAPPING
diff --git a/kernel/signal.c b/kernel/signal.c
index 5843c541fda9..6d7f0654f60d 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -326,6 +326,11 @@ void task_clear_jobctl_pending(struct task_struct *task, unsigned long mask)
 	if (mask & JOBCTL_STOP_PENDING)
 		mask |= JOBCTL_STOP_CONSUME | JOBCTL_STOP_DEQUEUED;
 
+	/*
+	 * JOBCTL_TRAP_FREEZE is set and cleared from cgroup side,
+	 * don't touch it here.
+	 */
+	mask &= ~JOBCTL_TRAP_FREEZE;
 	task->jobctl &= ~mask;
 
 	if (!(task->jobctl & JOBCTL_PENDING_MASK))
@@ -2252,7 +2257,7 @@ static bool do_signal_stop(int signr)
 }
 
 /**
- * do_jobctl_trap - take care of ptrace jobctl traps
+ * do_jobctl_trap - take care of ptrace and cgroup freezer jobctl traps
  *
  * When PT_SEIZED, it's used for both group stop and explicit
  * SEIZE/INTERRUPT traps.  Both generate PTRACE_EVENT_STOP trap with
@@ -2268,20 +2273,35 @@ static bool do_signal_stop(int signr)
  */
 static void do_jobctl_trap(void)
 {
+	struct sighand_struct *sighand = current->sighand;
 	struct signal_struct *signal = current->signal;
 	int signr = current->jobctl & JOBCTL_STOP_SIGMASK;
 
-	if (current->ptrace & PT_SEIZED) {
-		if (!signal->group_stop_count &&
-		    !(signal->flags & SIGNAL_STOP_STOPPED))
-			signr = SIGTRAP;
-		WARN_ON_ONCE(!signr);
-		ptrace_do_notify(signr, signr | (PTRACE_EVENT_STOP << 8),
-				 CLD_STOPPED);
-	} else {
-		WARN_ON_ONCE(!signr);
-		ptrace_stop(signr, CLD_STOPPED, 0, NULL);
-		current->exit_code = 0;
+	if (current->jobctl & (JOBCTL_TRAP_STOP | JOBCTL_TRAP_NOTIFY)) {
+		if (current->ptrace & PT_SEIZED) {
+			if (!signal->group_stop_count &&
+			    !(signal->flags & SIGNAL_STOP_STOPPED))
+				signr = SIGTRAP;
+			WARN_ON_ONCE(!signr);
+			ptrace_do_notify(signr,
+					 signr | (PTRACE_EVENT_STOP << 8),
+					 CLD_STOPPED);
+		} else {
+			WARN_ON_ONCE(!signr);
+			ptrace_stop(signr, CLD_STOPPED, 0, NULL);
+			current->exit_code = 0;
+		}
+	} else if (current->jobctl & JOBCTL_TRAP_FREEZE) {
+		/*
+		 * Enter the freezer, unless the task is about to exit.
+		 */
+		if (fatal_signal_pending(current)) {
+			current->jobctl &= ~JOBCTL_TRAP_FREEZE;
+		} else {
+			spin_unlock_irq(&sighand->siglock);
+			cgroup_freezer_enter();
+			spin_lock_irq(&sighand->siglock);
+		}
 	}
 }
 
@@ -2403,6 +2423,16 @@ bool get_signal(struct ksignal *ksig)
 			goto relock;
 		}
 
+		/*
+		 * If the task is leaving the freezer look, let's notify
+		 * the cgroup and reset the frozen mark.
+		 */
+		if (unlikely(cgroup_task_in_freezer(current))) {
+			spin_unlock_irq(&sighand->siglock);
+			cgroup_freezer_leave();
+			spin_lock_irq(&sighand->siglock);
+		}
+
 		signr = dequeue_signal(current, &current->blocked, &ksig->info);
 
 		if (!signr)
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 4/6] kselftests: cgroup: don't fail on cg_kill_all() error in cg_destroy()
  2018-11-12 23:04 [PATCH v2 0/6] freezer for cgroup v2 Roman Gushchin
                   ` (3 preceding siblings ...)
  2018-11-12 23:04 ` [PATCH v2 3/6] cgroup: cgroup v2 freezer Roman Gushchin
@ 2018-11-12 23:04 ` Roman Gushchin
  2018-11-12 23:04 ` [PATCH v2 5/6] kselftests: cgroup: add freezer controller self-tests Roman Gushchin
  2018-11-12 23:04 ` [PATCH v2 6/6] cgroup: document cgroup v2 freezer interface Roman Gushchin
  6 siblings, 0 replies; 23+ messages in thread
From: Roman Gushchin @ 2018-11-12 23:04 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Oleg Nesterov, cgroups, linux-kernel, kernel-team,
	Roman Gushchin, Shuah Khan

If the cgroup destruction races with an exit() of a belonging
process(es), cg_kill_all() may fail. It's not a good reason to make
cg_destroy() fail and leave the cgroup in place, potentially causing
next test runs to fail.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
---
 tools/testing/selftests/cgroup/cgroup_util.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/tools/testing/selftests/cgroup/cgroup_util.c b/tools/testing/selftests/cgroup/cgroup_util.c
index 14c9fe284806..eba06f94433b 100644
--- a/tools/testing/selftests/cgroup/cgroup_util.c
+++ b/tools/testing/selftests/cgroup/cgroup_util.c
@@ -227,9 +227,7 @@ int cg_destroy(const char *cgroup)
 retry:
 	ret = rmdir(cgroup);
 	if (ret && errno == EBUSY) {
-		ret = cg_killall(cgroup);
-		if (ret)
-			return ret;
+		cg_killall(cgroup);
 		usleep(100);
 		goto retry;
 	}
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 5/6] kselftests: cgroup: add freezer controller self-tests
  2018-11-12 23:04 [PATCH v2 0/6] freezer for cgroup v2 Roman Gushchin
                   ` (4 preceding siblings ...)
  2018-11-12 23:04 ` [PATCH v2 4/6] kselftests: cgroup: don't fail on cg_kill_all() error in cg_destroy() Roman Gushchin
@ 2018-11-12 23:04 ` Roman Gushchin
  2018-11-12 23:04 ` [PATCH v2 6/6] cgroup: document cgroup v2 freezer interface Roman Gushchin
  6 siblings, 0 replies; 23+ messages in thread
From: Roman Gushchin @ 2018-11-12 23:04 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Oleg Nesterov, cgroups, linux-kernel, kernel-team,
	Roman Gushchin, Shuah Khan

This patch implements six tests for the freezer controller for
cgroup v2:
1) a simple test, which aims to freeze and unfreeze a cgroup with 100
processes
2) a more complicated tree test, which creates a hierarchy of cgroups,
puts some processes in some cgroups, and tries to freeze and unfreeze
different parts of the subtree
3) a forkbomb test: the test aims to freeze a forkbomb running in a
cgroup, kill all tasks in the cgroup and remove the cgroup without
the unfreezing.
4) rmdir test: the test creates two nested cgroups, freezes the parent
one, checks that the child can be successfully removed, and a new
child can be created
5) migration tests: the test checks migration of a task between
frozen cgroups: from a frozen to a running, from a running to a
frozen, and from a frozen to a frozen.
6) ptrace test: the test checks that it's possible to attach to
a process in a frozen cgroup, get some information and detach, and
the cgroup will remain frozen.

Expected output:

  $ ./test_freezer
  ok 1 test_cgfreezer_simple
  ok 2 test_cgfreezer_tree
  ok 3 test_cgfreezer_forkbomb
  ok 4 test_cgrreezer_rmdir
  ok 5 test_cgfreezer_migrate
  ok 6 test_cgfreezer_ptrace

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
---
 tools/testing/selftests/cgroup/.gitignore     |   1 +
 tools/testing/selftests/cgroup/Makefile       |   2 +
 tools/testing/selftests/cgroup/cgroup_util.c  |  81 ++-
 tools/testing/selftests/cgroup/cgroup_util.h  |   7 +
 tools/testing/selftests/cgroup/test_freezer.c | 685 ++++++++++++++++++
 5 files changed, 775 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/cgroup/test_freezer.c

diff --git a/tools/testing/selftests/cgroup/.gitignore b/tools/testing/selftests/cgroup/.gitignore
index adacda50a4b2..7f9835624793 100644
--- a/tools/testing/selftests/cgroup/.gitignore
+++ b/tools/testing/selftests/cgroup/.gitignore
@@ -1,2 +1,3 @@
 test_memcontrol
 test_core
+test_freezer
diff --git a/tools/testing/selftests/cgroup/Makefile b/tools/testing/selftests/cgroup/Makefile
index 23fbaa4a9630..8d369b6a2069 100644
--- a/tools/testing/selftests/cgroup/Makefile
+++ b/tools/testing/selftests/cgroup/Makefile
@@ -5,8 +5,10 @@ all:
 
 TEST_GEN_PROGS = test_memcontrol
 TEST_GEN_PROGS += test_core
+TEST_GEN_PROGS += test_freezer
 
 include ../lib.mk
 
 $(OUTPUT)/test_memcontrol: cgroup_util.c
 $(OUTPUT)/test_core: cgroup_util.c
+$(OUTPUT)/test_freezer: cgroup_util.c
diff --git a/tools/testing/selftests/cgroup/cgroup_util.c b/tools/testing/selftests/cgroup/cgroup_util.c
index eba06f94433b..e9cdad673901 100644
--- a/tools/testing/selftests/cgroup/cgroup_util.c
+++ b/tools/testing/selftests/cgroup/cgroup_util.c
@@ -74,6 +74,16 @@ char *cg_name_indexed(const char *root, const char *name, int index)
 	return ret;
 }
 
+char *cg_control(const char *cgroup, const char *control)
+{
+	size_t len = strlen(cgroup) + strlen(control) + 2;
+	char *ret = malloc(len);
+
+	snprintf(ret, len, "%s/%s", cgroup, control);
+
+	return ret;
+}
+
 int cg_read(const char *cgroup, const char *control, char *buf, size_t len)
 {
 	char path[PATH_MAX];
@@ -196,7 +206,59 @@ int cg_create(const char *cgroup)
 	return mkdir(cgroup, 0644);
 }
 
-static int cg_killall(const char *cgroup)
+int cg_for_all_procs(const char *cgroup, int (*fn)(int pid, void *arg),
+		     void *arg)
+{
+	char buf[PAGE_SIZE];
+	char *ptr = buf;
+	int ret;
+
+	if (cg_read(cgroup, "cgroup.procs", buf, sizeof(buf)))
+		return -1;
+
+	while (ptr < buf + sizeof(buf)) {
+		int pid = strtol(ptr, &ptr, 10);
+
+		if (pid == 0)
+			break;
+		if (*ptr)
+			ptr++;
+		else
+			break;
+		ret = fn(pid, arg);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+int cg_wait_for_proc_count(const char *cgroup, int count)
+{
+	char buf[10 * PAGE_SIZE] = {0};
+	int attempts;
+	char *ptr;
+
+	for (attempts = 10; attempts >= 0; attempts--) {
+		int nr = 0;
+
+		if (cg_read(cgroup, "cgroup.procs", buf, sizeof(buf)))
+			break;
+
+		for (ptr = buf; *ptr; ptr++)
+			if (*ptr == '\n')
+				nr++;
+
+		if (nr >= count)
+			return 0;
+
+		usleep(100000);
+	}
+
+	return -1;
+}
+
+int cg_killall(const char *cgroup)
 {
 	char buf[PAGE_SIZE];
 	char *ptr = buf;
@@ -238,6 +300,14 @@ int cg_destroy(const char *cgroup)
 	return ret;
 }
 
+int cg_enter(const char *cgroup, int pid)
+{
+	char pidbuf[64];
+
+	snprintf(pidbuf, sizeof(pidbuf), "%d", pid);
+	return cg_write(cgroup, "cgroup.procs", pidbuf);
+}
+
 int cg_enter_current(const char *cgroup)
 {
 	char pidbuf[64];
@@ -367,3 +437,12 @@ int set_oom_adj_score(int pid, int score)
 	close(fd);
 	return 0;
 }
+
+char proc_read_text(int pid, const char *item, char *buf, size_t size)
+{
+	char path[PATH_MAX];
+
+	snprintf(path, sizeof(path), "/proc/%d/%s", pid, item);
+
+	return read_text(path, buf, size);
+}
diff --git a/tools/testing/selftests/cgroup/cgroup_util.h b/tools/testing/selftests/cgroup/cgroup_util.h
index 9ac8b7958f83..8ee63c00a668 100644
--- a/tools/testing/selftests/cgroup/cgroup_util.h
+++ b/tools/testing/selftests/cgroup/cgroup_util.h
@@ -18,6 +18,7 @@ static inline int values_close(long a, long b, int err)
 extern int cg_find_unified_root(char *root, size_t len);
 extern char *cg_name(const char *root, const char *name);
 extern char *cg_name_indexed(const char *root, const char *name, int index);
+extern char *cg_control(const char *cgroup, const char *control);
 extern int cg_create(const char *cgroup);
 extern int cg_destroy(const char *cgroup);
 extern int cg_read(const char *cgroup, const char *control,
@@ -32,6 +33,7 @@ extern int cg_write(const char *cgroup, const char *control, char *buf);
 extern int cg_run(const char *cgroup,
 		  int (*fn)(const char *cgroup, void *arg),
 		  void *arg);
+extern int cg_enter(const char *cgroup, int pid);
 extern int cg_enter_current(const char *cgroup);
 extern int cg_run_nowait(const char *cgroup,
 			 int (*fn)(const char *cgroup, void *arg),
@@ -41,3 +43,8 @@ extern int alloc_pagecache(int fd, size_t size);
 extern int alloc_anon(const char *cgroup, void *arg);
 extern int is_swap_enabled(void);
 extern int set_oom_adj_score(int pid, int score);
+extern int cg_for_all_procs(const char *cgroup, int (*fn)(int pid, void *arg),
+			    void *arg);
+extern int cg_wait_for_proc_count(const char *cgroup, int count);
+extern int cg_killall(const char *cgroup);
+extern char proc_read_text(int pid, const char *item, char *buf, size_t size);
diff --git a/tools/testing/selftests/cgroup/test_freezer.c b/tools/testing/selftests/cgroup/test_freezer.c
new file mode 100644
index 000000000000..dc9830986f08
--- /dev/null
+++ b/tools/testing/selftests/cgroup/test_freezer.c
@@ -0,0 +1,685 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <stdbool.h>
+#include <linux/limits.h>
+#include <sys/ptrace.h>
+#include <sys/types.h>
+#include <sys/mman.h>
+#include <unistd.h>
+#include <stdio.h>
+#include <errno.h>
+#include <poll.h>
+#include <stdlib.h>
+#include <sys/inotify.h>
+#include <string.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+
+#include "../kselftest.h"
+#include "cgroup_util.h"
+
+#define DEBUG
+#ifdef DEBUG
+#define debug(args...) fprintf(stderr, args)
+#else
+#define debug(args...)
+#endif
+
+/*
+ * Freeze the given cgroup and wait for the inotify signal.
+ * If there is no signal in 10 seconds, treat this as an error.
+ */
+static int cg_freeze_wait(const char *cgroup, bool freeze)
+{
+	int fd, wd;
+	struct pollfd fds;
+	int ret = -1;
+
+	fd = inotify_init1(IN_NONBLOCK);
+	if (fd == -1)
+		return fd;
+
+	wd = inotify_add_watch(fd, cg_control(cgroup, "cgroup.events"),
+			       IN_MODIFY);
+	if (wd == -1) {
+		close(fd);
+		return wd;
+	}
+	fds.fd = fd;
+	fds.events = POLLIN;
+
+	ret = cg_write(cgroup, "cgroup.freeze", freeze ? "1" : "0");
+	if (ret) {
+		close(fd);
+		return ret;
+	}
+
+	while (true) {
+		wd = poll(&fds, 1, 10000);
+
+		if (wd == -1 && errno == EINTR)
+			continue;
+
+		if (wd == 1 && fds.revents & POLLIN)
+			ret = 0;
+
+		break;
+	}
+
+	close(fd);
+
+	return ret;
+}
+
+/*
+ * Check if the process is frozen and parked in a proper place.
+ */
+static int proc_check_frozen(int pid, void *arg)
+{
+	char buf[PAGE_SIZE];
+	int len;
+
+	len = proc_read_text(pid, "stat", buf, sizeof(buf));
+	if (len == -1) {
+		debug("Can't get %d stat\n", pid);
+		return -1;
+	}
+
+	if (strstr(buf, "(test_freezer) S ") == NULL) {
+		debug("Process %d in the unexpected state: %s\n", pid, buf);
+		return -1;
+	}
+
+	len = proc_read_text(pid, "stack", buf, sizeof(buf));
+	if (len == -1) {
+		debug("Can't get stack of the process %d\n", pid);
+		return -1;
+	}
+
+	if (strstr(buf, "[<0>] cgroup_freezer") != buf) {
+		debug("Process %d has unexpected stacktrace: %s\n", pid, buf);
+		return -1;
+	}
+
+	return 0;
+}
+
+/*
+ * Check if the cgroup is frozen and all belonging processes are
+ * parked in a proper place.
+ */
+static int cg_check_frozen(const char *cgroup, bool frozen)
+{
+	if (frozen) {
+		/*
+		 * Check the cgroup.events::frozen value.
+		 */
+		if (cg_read_strstr(cgroup, "cgroup.events", "frozen 1") != 0) {
+			debug("Cgroup %s isn't unexpectedly frozen\n", cgroup);
+			return -1;
+		}
+
+		/*
+		 * Check that all processes are parked in the proper place.
+		 */
+		if (cg_for_all_procs(cgroup, proc_check_frozen, NULL)) {
+			debug("Some processes of cgroup %s are not frozen\n",
+			      cgroup);
+			return -1;
+		}
+	} else {
+		/*
+		 * Check the cgroup.events::frozen value.
+		 */
+		if (cg_read_strstr(cgroup, "cgroup.events", "frozen 0") != 0) {
+			debug("Cgroup %s is unexpectedly frozen\n", cgroup);
+			return -1;
+		}
+	}
+
+	return 0;
+}
+
+/*
+ * A simple process running in a sleep loop until being
+ * re-parented.
+ */
+static int child_fn(const char *cgroup, void *arg)
+{
+	int ppid = getppid();
+
+	while (getppid() == ppid)
+		usleep(1000);
+
+	return getppid() == ppid;
+}
+
+/*
+ * A simple test for the cgroup freezer: populated the cgroup with 100
+ * running processes and freeze it. Then unfreeze it. Then it kills all
+ * processes and destroys the cgroup.
+ */
+static int test_cgfreezer_simple(const char *root)
+{
+	int ret = KSFT_FAIL;
+	char *cgroup = NULL;
+	int i;
+
+	cgroup = cg_name(root, "cg_test");
+	if (!cgroup)
+		goto cleanup;
+
+	if (cg_create(cgroup))
+		goto cleanup;
+
+	for (i = 0; i < 100; i++)
+		cg_run_nowait(cgroup, child_fn, NULL);
+
+	if (cg_wait_for_proc_count(cgroup, 100))
+		goto cleanup;
+
+	if (cg_check_frozen(cgroup, false))
+		goto cleanup;
+
+	if (cg_freeze_wait(cgroup, true))
+		goto cleanup;
+
+	if (cg_check_frozen(cgroup, true))
+		goto cleanup;
+
+	if (cg_freeze_wait(cgroup, false))
+		goto cleanup;
+
+	if (cg_check_frozen(cgroup, false))
+		goto cleanup;
+
+	ret = KSFT_PASS;
+
+cleanup:
+	if (cgroup)
+		cg_destroy(cgroup);
+	free(cgroup);
+	return ret;
+}
+
+/*
+ * The test creates the following hierarchy:
+ *       A
+ *    / / \ \
+ *   B  E  I K
+ *  /\  |
+ * C  D F
+ *      |
+ *      G
+ *      |
+ *      H
+ *
+ * with a process in C, H and 3 processes in K.
+ * Then it tries to freeze and unfreeze the whole tree.
+ */
+static int test_cgfreezer_tree(const char *root)
+{
+	char *cgroup[10] = {0};
+	int ret = KSFT_FAIL;
+	int i;
+
+	cgroup[0] = cg_name(root, "cg_test_A");
+	if (!cgroup[0])
+		goto cleanup;
+
+	cgroup[1] = cg_name(cgroup[0], "cg_test_B");
+	if (!cgroup[1])
+		goto cleanup;
+
+	cgroup[2] = cg_name(cgroup[1], "cg_test_C");
+	if (!cgroup[2])
+		goto cleanup;
+
+	cgroup[3] = cg_name(cgroup[1], "cg_test_D");
+	if (!cgroup[3])
+		goto cleanup;
+
+	cgroup[4] = cg_name(cgroup[0], "cg_test_E");
+	if (!cgroup[4])
+		goto cleanup;
+
+	cgroup[5] = cg_name(cgroup[4], "cg_test_F");
+	if (!cgroup[5])
+		goto cleanup;
+
+	cgroup[6] = cg_name(cgroup[5], "cg_test_G");
+	if (!cgroup[6])
+		goto cleanup;
+
+	cgroup[7] = cg_name(cgroup[6], "cg_test_H");
+	if (!cgroup[7])
+		goto cleanup;
+
+	cgroup[8] = cg_name(cgroup[0], "cg_test_I");
+	if (!cgroup[8])
+		goto cleanup;
+
+	cgroup[9] = cg_name(cgroup[0], "cg_test_K");
+	if (!cgroup[9])
+		goto cleanup;
+
+	for (i = 0; i < 10; i++)
+		if (cg_create(cgroup[i]))
+			goto cleanup;
+
+	cg_run_nowait(cgroup[2], child_fn, NULL);
+	cg_run_nowait(cgroup[7], child_fn, NULL);
+	cg_run_nowait(cgroup[9], child_fn, NULL);
+	cg_run_nowait(cgroup[9], child_fn, NULL);
+	cg_run_nowait(cgroup[9], child_fn, NULL);
+
+	/*
+	 * Wait until all child processes will enter
+	 * corresponding cgroups.
+	 */
+
+	if (cg_wait_for_proc_count(cgroup[2], 1) ||
+	    cg_wait_for_proc_count(cgroup[7], 1) ||
+	    cg_wait_for_proc_count(cgroup[9], 3))
+		goto cleanup;
+
+	/*
+	 * Freeze B.
+	 */
+	if (cg_freeze_wait(cgroup[1], true))
+		goto cleanup;
+
+	if (cg_check_frozen(cgroup[1], true))
+		goto cleanup;
+
+	/*
+	 * Freeze F.
+	 */
+	if (cg_freeze_wait(cgroup[5], true))
+		goto cleanup;
+
+	if (cg_check_frozen(cgroup[5], true))
+		goto cleanup;
+
+	/*
+	 * Freeze G.
+	 */
+	if (cg_freeze_wait(cgroup[6], true))
+		goto cleanup;
+
+	if (cg_check_frozen(cgroup[6], true))
+		goto cleanup;
+
+	/*
+	 * Check that A and E are not frozen.
+	 */
+	if (cg_check_frozen(cgroup[0], false))
+		goto cleanup;
+
+	if (cg_check_frozen(cgroup[4], false))
+		goto cleanup;
+
+	/*
+	 * Freeze A. Check that A, B and E are frozen.
+	 */
+	if (cg_freeze_wait(cgroup[0], true))
+		goto cleanup;
+
+	if (cg_check_frozen(cgroup[0], true))
+		goto cleanup;
+
+	if (cg_check_frozen(cgroup[1], true))
+		goto cleanup;
+
+	if (cg_check_frozen(cgroup[4], true))
+		goto cleanup;
+
+	/*
+	 * Unfreeze B, F and G
+	 */
+	if (cg_freeze_wait(cgroup[1], false))
+		goto cleanup;
+
+	if (cg_freeze_wait(cgroup[5], false))
+		goto cleanup;
+
+	if (cg_freeze_wait(cgroup[6], false))
+		goto cleanup;
+
+	/*
+	 * Check that C and H are still frozen.
+	 */
+	if (cg_check_frozen(cgroup[2], true))
+		goto cleanup;
+
+	if (cg_check_frozen(cgroup[7], true))
+		goto cleanup;
+
+	/*
+	 * Unfreezing A failed. Check that A, C and K are not frozen.
+	 */
+	if (cg_freeze_wait(cgroup[0], false))
+		goto cleanup;
+
+	if (cg_check_frozen(cgroup[0], false))
+		goto cleanup;
+
+	if (cg_check_frozen(cgroup[2], false))
+		goto cleanup;
+
+	if (cg_check_frozen(cgroup[9], false))
+		goto cleanup;
+
+	ret = KSFT_PASS;
+
+cleanup:
+	for (i = 9; i >= 0 && cgroup[i]; i--) {
+		cg_destroy(cgroup[i]);
+		free(cgroup[i]);
+	}
+
+	return ret;
+}
+
+/*
+ * A fork bomb emulator.
+ */
+static int forkbomb_fn(const char *cgroup, void *arg)
+{
+	int ppid;
+
+	fork();
+	fork();
+
+	ppid = getppid();
+
+	while (getppid() == ppid)
+		usleep(1000);
+
+	return getppid() == ppid;
+}
+
+/*
+ * The test runs a fork bomb in a cgroup and tries to freeze it.
+ * Then it kills all processes and checks that cgroup isn't populated
+ * anymore.
+ */
+static int test_cgfreezer_forkbomb(const char *root)
+{
+	int ret = KSFT_FAIL;
+	char *cgroup = NULL;
+
+	cgroup = cg_name(root, "cg_forkbomb_test");
+	if (!cgroup)
+		goto cleanup;
+
+	if (cg_create(cgroup))
+		goto cleanup;
+
+	cg_run_nowait(cgroup, forkbomb_fn, NULL);
+
+	usleep(100000);
+
+	if (cg_freeze_wait(cgroup, true))
+		goto cleanup;
+
+	if (cg_check_frozen(cgroup, true))
+		goto cleanup;
+
+	if (cg_killall(cgroup))
+		goto cleanup;
+
+	if (cg_wait_for_proc_count(cgroup, 0))
+		goto cleanup;
+
+	ret = KSFT_PASS;
+
+cleanup:
+	if (cgroup)
+		cg_destroy(cgroup);
+	free(cgroup);
+	return ret;
+}
+
+/*
+ * The test creates two nested cgroups, freezes the parent
+ * and removes the child. Then it checks that the parent cgroup
+ * remains frozen and it's possible to create a new child
+ * without unfreezing. The new child is frozen too.
+ */
+static int test_cgfreezer_rmdir(const char *root)
+{
+	int ret = KSFT_FAIL;
+	char *parent, *child = NULL;
+
+	parent = cg_name(root, "cg_test_A");
+	if (!parent)
+		goto cleanup;
+
+	child = cg_name(parent, "cg_test_B");
+	if (!child)
+		goto cleanup;
+
+	if (cg_create(parent))
+		goto cleanup;
+
+	if (cg_create(child))
+		goto cleanup;
+
+	if (cg_freeze_wait(parent, true))
+		goto cleanup;
+
+	if (cg_check_frozen(parent, true))
+		goto cleanup;
+
+	if (cg_destroy(child))
+		goto cleanup;
+
+	if (cg_check_frozen(parent, true))
+		goto cleanup;
+
+	if (cg_create(child))
+		goto cleanup;
+
+	if (cg_check_frozen(child, true))
+		goto cleanup;
+
+	ret = KSFT_PASS;
+
+cleanup:
+	if (child)
+		cg_destroy(child);
+	free(child);
+	if (parent)
+		cg_destroy(parent);
+	free(parent);
+	return ret;
+}
+
+/*
+ * The test creates two cgroups: A and B. The it runs a process in A,
+ * and performs several migrations:
+ * 1) A (running) -> B (frozen)
+ * 2) B (frozen) -> A (running)
+ * 3) A (frozen) -> B (frozen)
+ *
+ * One each step it checks that the actual state of cgroups matches
+ * the expected state.
+ */
+static int test_cgfreezer_migrate(const char *root)
+{
+	int ret = KSFT_FAIL;
+	char *cgroup[2] = {0};
+	int pid;
+
+	cgroup[0] = cg_name(root, "cg_test_A");
+	if (!cgroup[0])
+		goto cleanup;
+
+	cgroup[1] = cg_name(root, "cg_test_B");
+	if (!cgroup[1])
+		goto cleanup;
+
+	if (cg_create(cgroup[0]))
+		goto cleanup;
+
+	if (cg_create(cgroup[1]))
+		goto cleanup;
+
+	pid = cg_run_nowait(cgroup[0], child_fn, NULL);
+	if (pid < 0)
+		goto cleanup;
+
+	if (cg_wait_for_proc_count(cgroup[0], 1))
+		goto cleanup;
+
+	/*
+	 * Migrate from A (running) to B (frozen)
+	 */
+	if (cg_freeze_wait(cgroup[1], true))
+		goto cleanup;
+
+	if (cg_check_frozen(cgroup[1], true))
+		goto cleanup;
+
+	if (cg_enter(cgroup[1], pid))
+		goto cleanup;
+
+	if (cg_check_frozen(cgroup[0], false))
+		goto cleanup;
+
+	if (cg_check_frozen(cgroup[1], true))
+		goto cleanup;
+
+	/*
+	 * Migrate from B (frozen) to A (running)
+	 */
+	if (cg_enter(cgroup[0], pid))
+		goto cleanup;
+
+	if (cg_check_frozen(cgroup[0], false))
+		goto cleanup;
+
+	if (cg_check_frozen(cgroup[1], true))
+		goto cleanup;
+
+	/*
+	 * Migrate from A (frozen) to B (frozen)
+	 */
+	if (cg_freeze_wait(cgroup[0], true))
+		goto cleanup;
+
+	if (cg_check_frozen(cgroup[0], true))
+		goto cleanup;
+
+	if (cg_enter(cgroup[1], pid))
+		goto cleanup;
+
+	if (cg_check_frozen(cgroup[0], true))
+		goto cleanup;
+
+	if (cg_check_frozen(cgroup[1], true))
+		goto cleanup;
+
+	ret = KSFT_PASS;
+
+cleanup:
+	if (cgroup[0])
+		cg_destroy(cgroup[0]);
+	free(cgroup[0]);
+	if (cgroup[1])
+		cg_destroy(cgroup[1]);
+	free(cgroup[1]);
+	return ret;
+}
+
+/*
+ * The test checks that ptrace works with a tracing process in a frozen cgroup.
+ */
+static int test_cgfreezer_ptrace(const char *root)
+{
+	int ret = KSFT_FAIL;
+	char *cgroup = NULL;
+	siginfo_t siginfo;
+	int pid;
+
+	cgroup = cg_name(root, "cg_test");
+	if (!cgroup)
+		goto cleanup;
+
+	if (cg_create(cgroup))
+		goto cleanup;
+
+	pid = cg_run_nowait(cgroup, child_fn, NULL);
+	if (pid < 0)
+		goto cleanup;
+
+	if (cg_wait_for_proc_count(cgroup, 1))
+		goto cleanup;
+
+	if (cg_freeze_wait(cgroup, true))
+		goto cleanup;
+
+	if (cg_check_frozen(cgroup, true))
+		goto cleanup;
+
+	if (ptrace(PTRACE_SEIZE, pid, NULL, NULL))
+		goto cleanup;
+
+	if (ptrace(PTRACE_INTERRUPT, pid, NULL, NULL))
+		goto cleanup;
+
+	waitpid(pid, NULL, 0);
+
+	if (ptrace(PTRACE_GETSIGINFO, pid, NULL, &siginfo))
+		goto cleanup;
+
+	if (ptrace(PTRACE_DETACH, pid, NULL, NULL))
+		goto cleanup;
+
+	ret = KSFT_PASS;
+
+cleanup:
+	if (cgroup)
+		cg_destroy(cgroup);
+	free(cgroup);
+	return ret;
+}
+
+#define T(x) { x, #x }
+struct cgfreezer_test {
+	int (*fn)(const char *root);
+	const char *name;
+} tests[] = {
+	T(test_cgfreezer_simple),
+	T(test_cgfreezer_tree),
+	T(test_cgfreezer_forkbomb),
+	T(test_cgfreezer_rmdir),
+	T(test_cgfreezer_migrate),
+	T(test_cgfreezer_ptrace),
+};
+#undef T
+
+int main(int argc, char *argv[])
+{
+	char root[PATH_MAX];
+	int i, ret = EXIT_SUCCESS;
+
+	if (cg_find_unified_root(root, sizeof(root)))
+		ksft_exit_skip("cgroup v2 isn't mounted\n");
+	for (i = 0; i < ARRAY_SIZE(tests); i++) {
+		switch (tests[i].fn(root)) {
+		case KSFT_PASS:
+			ksft_test_result_pass("%s\n", tests[i].name);
+			break;
+		case KSFT_SKIP:
+			ksft_test_result_skip("%s\n", tests[i].name);
+			break;
+		default:
+			ret = EXIT_FAILURE;
+			ksft_test_result_fail("%s\n", tests[i].name);
+			break;
+		}
+	}
+
+	return ret;
+}
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 6/6] cgroup: document cgroup v2 freezer interface
  2018-11-12 23:04 [PATCH v2 0/6] freezer for cgroup v2 Roman Gushchin
                   ` (5 preceding siblings ...)
  2018-11-12 23:04 ` [PATCH v2 5/6] kselftests: cgroup: add freezer controller self-tests Roman Gushchin
@ 2018-11-12 23:04 ` Roman Gushchin
  6 siblings, 0 replies; 23+ messages in thread
From: Roman Gushchin @ 2018-11-12 23:04 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Oleg Nesterov, cgroups, linux-kernel, kernel-team, Roman Gushchin

Describe cgroup v2 freezer interface in the cgroup v2 admin guide.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
---
 Documentation/admin-guide/cgroup-v2.rst | 26 +++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 184193bcb262..a065c0bed88c 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -862,6 +862,8 @@ All cgroup core files are prefixed with "cgroup."
 	  populated
 		1 if the cgroup or its descendants contains any live
 		processes; otherwise, 0.
+	  frozen
+		1 if the cgroup is frozen; otherwise, 0.
 
   cgroup.max.descendants
 	A read-write single value files.  The default is "max".
@@ -895,6 +897,30 @@ All cgroup core files are prefixed with "cgroup."
 		A dying cgroup can consume system resources not exceeding
 		limits, which were active at the moment of cgroup deletion.
 
+  cgroup.freeze
+	A read-write single value file which exists on non-root cgroups.
+	Allowed values are "0" and "1". The default is "0".
+
+	Writing "1" to the file causes freezing of the cgroup and all
+	descendant cgroups. This means that all belonging processes will
+	be stopped and will not run until the cgroup will be explicitly
+	unfrozen. Freezing of the cgroup may take some time; when the process
+	is complete, the "frozen" value in the cgroup.events control file
+	will be updated and the corresponding notification will be issued.
+
+	Cgroup can be frozen either by its own settings, either by settings
+	of any ancestor cgroups. If any of ancestor cgroups is frozen, the
+	cgroup will remain frozen.
+
+	Processes in the frozen cgroup can be killed by a fatal signal.
+	They also can enter and leave a frozen cgroup: either by an explicit
+	move by a user, either if freezing of the cgroup races with fork().
+	If a cgroup is moved to a frozen cgroup, it stops. If a process is
+	moving out of a frozen cgroup, it becomes running.
+
+	Frozen status of a cgroup doesn't affect any cgroup tree operations:
+	it's possible to delete a frozen (and empty) cgroup, as well as
+	create new sub-cgroups.
 
 Controllers
 ===========
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 3/6] cgroup: cgroup v2 freezer
  2018-11-12 23:04 ` [PATCH v2 3/6] cgroup: cgroup v2 freezer Roman Gushchin
@ 2018-11-13  2:08   ` Tejun Heo
  2018-11-13 18:47     ` Roman Gushchin
  2018-11-13 15:37   ` Oleg Nesterov
  2018-11-13 15:48   ` Oleg Nesterov
  2 siblings, 1 reply; 23+ messages in thread
From: Tejun Heo @ 2018-11-13  2:08 UTC (permalink / raw)
  To: Roman Gushchin, Oleg Nesterov
  Cc: cgroups, linux-kernel, kernel-team, Roman Gushchin,
	Linus Torvalds, Andrew Morton, Ingo Molnar

(cc'ing Linus, Andrew, Ingo for the addition of a new task state)

Hello,

On Mon, Nov 12, 2018 at 03:04:19PM -0800, Roman Gushchin wrote:
> Cgroup v1 implements the freezer controller, which provides an ability
> to stop the workload in a cgroup and temporarily free up some
> resources (cpu, io, network bandwidth and, potentially, memory)
> for some other tasks. Cgroup v2 lacks this functionality.
> 
> This patch implements freezer for cgroup v2. However the functionality
> is similar, the interface is different to cgroup v1: it follows
> cgroup v2 interface principles.
> 
> Key differences are:
> 1) There is no separate controller: the functionality is always
> available and is represented by cgroup.freeze and cgroup.events
> cgroup control files.
> 2) The desired state is defined by the cgroup.freeze control file.
> Any hierarchical configuration is allowed.
> 3) The interface is asynchronous. The actual state is available
> using cgroup.events control file ("frozen" field). There are no
> dedicated transitional states.
> 4) It's allowed to make any changes with the cgroup hierarchy
> (create new cgroups, remove old cgroups, move tasks between cgroups)
> no matter if some cgroups are frozen.
> 5) Tasks in a frozen cgroup can be killed.

I think it'd be worthwhile to explain that tasks are now frozen in a
state which is essentially equivalent to the job control stop state
but which can only be controlled by the freezer and its interactions
with ptrace.

Oleg, I'd really appreciate if you can review the signal delivery /
ptrace parts of the patch.

> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index 22254c1fe1c5..600165d0f5a2 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -63,6 +63,12 @@ enum {
>  	 * specified at mount time and thus is implemented here.
>  	 */
>  	CGRP_CPUSET_CLONE_CHILDREN,
> +
> +	/* Control group has to be frozen. */
> +	CGRP_FREEZE,
> +
> +	/* Cgroup is frozen. */
> +	CGRP_FROZEN,
>  };
>  
>  /* cgroup_root->flags */
> @@ -314,6 +320,27 @@ struct cgroup_rstat_cpu {
>  	struct cgroup *updated_next;		/* NULL iff not on the list */
>  };
>  
> +struct cgroup_freezer_state {
> +	/* Should the cgroup actually be frozen?
> +	 * (protected by cgroup_mutex)
> +	 */
> +	int e_freeze;
> +
> +	/* Fields below are protected by css_set_lock */
> +
> +	/* Number of frozen descendant cgroups */
> +	int nr_frozen_descendants;
> +
> +	/* Number of tasks to freeze */
> +	int nr_tasks_to_freeze;
> +
> +	/* Number of frozen tasks */
> +	int nr_frozen_tasks;
> +
> +	/* Used for delayed notifications */
> +	struct work_struct notify_work;
> +};
> +
>  struct cgroup {
>  	/* self css with NULL ->ss, points back to this cgroup */
>  	struct cgroup_subsys_state self;
> @@ -442,6 +469,12 @@ struct cgroup {
>  	/* If there is block congestion on this cgroup. */
>  	atomic_t congestion_count;
>  
> +	/* Should the cgroup and its descendants be frozen. */
> +	bool freeze;

Why not have this in freezer too?

> +	/* Used to store internal freezer state */
> +	struct cgroup_freezer_state freezer;
> +
>  	/* ids of the ancestors at each level including self */
>  	int ancestor_ids[];
>  };
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 32c553556bbd..8f7e82b05bf8 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -871,4 +871,46 @@ static inline void put_cgroup_ns(struct cgroup_namespace *ns)
>  		free_cgroup_ns(ns);
>  }
>  
> +#ifdef CONFIG_CGROUPS
> +
> +static inline bool cgroup_frozen(struct task_struct *task)
> +{
> +	bool ret;
> +
> +	rcu_read_lock();
> +	ret = test_bit(CGRP_FREEZE, &task_dfl_cgroup(task)->flags);
> +	rcu_read_unlock();
> +
> +	return ret;
> +}
> +
> +static inline bool cgroup_task_in_freezer(struct task_struct *task)
> +{
> +	return task->frozen;
> +}

I think the above are pretty confusing.  CGRP_FREEZE is frozen and
task->frozen is cgroup_task_in_freezer()?  Can't they be e.g.
cgroup_task_freeze() (or freezing, which probably is clearer) and
cgroup_task_frozen()?

> +void cgroup_freezer_enter(void);
> +void cgroup_freezer_leave(void);

So, if we use freeze, freezing, frozen instead, the aboves can be
cgroup_frozen_enter() and cgroup_frozen_leave() (or begin/end).

> +void cgroup_freeze(struct cgroup *cgrp, bool freeze);
> +void cgroup_notify_frozen_fn(struct work_struct *work);

This doesn't need to be exported outside of cgroup proper, right?

> +void cgroup_notify_frozen(struct cgroup *cgrp, bool frozen);
> +void cgroup_queue_notify_frozen(struct cgroup *cgrp);
> +void cgroup_freezer_migrate_task(struct task_struct *task, struct cgroup *src,
> +				 struct cgroup *dst);

Ditto.  Do these need to be exposed outside freezer or cgroup proper?

> +
> +#else /* !CONFIG_CGROUPS */
> +
> +static inline bool cgroup_task_in_freezer(struct task_struct *task)
> +{
> +	return false;
> +}
> +static inline void cgroup_freezer_enter(void) { }
> +static inline void cgroup_freezer_leave(void) { }
> +static inline bool cgroup_frozen(struct task_struct *task)
> +{
> +	return false;
> +}
> +
> +#endif /* !CONFIG_CGROUPS */
> +
>  #endif /* _LINUX_CGROUP_H */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 977cb57d7bc9..8ef5d3174e50 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -83,7 +83,8 @@ struct task_group;
>  #define TASK_WAKING			0x0200
>  #define TASK_NOLOAD			0x0400
>  #define TASK_NEW			0x0800
> -#define TASK_STATE_MAX			0x1000
> +#define TASK_FROZEN			0x1000
> +#define TASK_STATE_MAX			0x2000

We should also cc linux-api@vger.kernel.org as this is visible from
userland, I think.

>  /* Convenience macros for the sake of set_current_state: */
>  #define TASK_KILLABLE			(TASK_WAKEKILL | TASK_UNINTERRUPTIBLE)
> @@ -733,6 +734,8 @@ struct task_struct {
>  #ifdef CONFIG_CGROUPS
>  	/* disallow userland-initiated cgroup migration */
>  	unsigned			no_cgroup_migration:1;
> +	/* task is in the cgroup freezer loop */

The above comment isn't strictly true, right?

> +	unsigned			frozen:1;
>  #endif
>  #ifdef CONFIG_BLK_CGROUP
>  	/* to be used once the psi infrastructure lands upstream. */
> diff --git a/include/linux/sched/jobctl.h b/include/linux/sched/jobctl.h
> index 98228bd48aee..6c49455dcfe6 100644
> --- a/include/linux/sched/jobctl.h
> +++ b/include/linux/sched/jobctl.h
> @@ -18,6 +18,7 @@ struct task_struct;
>  #define JOBCTL_TRAP_NOTIFY_BIT	20	/* trap for NOTIFY */
>  #define JOBCTL_TRAPPING_BIT	21	/* switching to TRACED */
>  #define JOBCTL_LISTENING_BIT	22	/* ptracer is listening for events */
> +#define JOBCTL_TRAP_FREEZE_BIT	23	/* trap for cgroup freezer */
>  
>  #define JOBCTL_STOP_DEQUEUED	(1UL << JOBCTL_STOP_DEQUEUED_BIT)
>  #define JOBCTL_STOP_PENDING	(1UL << JOBCTL_STOP_PENDING_BIT)
> @@ -26,8 +27,10 @@ struct task_struct;
>  #define JOBCTL_TRAP_NOTIFY	(1UL << JOBCTL_TRAP_NOTIFY_BIT)
>  #define JOBCTL_TRAPPING		(1UL << JOBCTL_TRAPPING_BIT)
>  #define JOBCTL_LISTENING	(1UL << JOBCTL_LISTENING_BIT)
> +#define JOBCTL_TRAP_FREEZE	(1UL << JOBCTL_TRAP_FREEZE_BIT)
>  
> -#define JOBCTL_TRAP_MASK	(JOBCTL_TRAP_STOP | JOBCTL_TRAP_NOTIFY)
> +#define JOBCTL_TRAP_MASK	(JOBCTL_TRAP_STOP | JOBCTL_TRAP_NOTIFY | \
> +				 JOBCTL_TRAP_FREEZE)
>  #define JOBCTL_PENDING_MASK	(JOBCTL_STOP_PENDING | JOBCTL_TRAP_MASK)
>  
>  extern bool task_set_jobctl_pending(struct task_struct *task, unsigned long mask);
> diff --git a/kernel/cgroup/Makefile b/kernel/cgroup/Makefile
> index 8d5689ca94b9..5d7a76bfbbb7 100644
> --- a/kernel/cgroup/Makefile
> +++ b/kernel/cgroup/Makefile
> @@ -1,5 +1,5 @@
>  # SPDX-License-Identifier: GPL-2.0
> -obj-y := cgroup.o rstat.o namespace.o cgroup-v1.o
> +obj-y := cgroup.o rstat.o namespace.o cgroup-v1.o freezer.o
>  
>  obj-$(CONFIG_CGROUP_FREEZER) += legacy_freezer.o
>  obj-$(CONFIG_CGROUP_PIDS) += pids.o
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index ef3442555b32..4cffaae075af 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -2358,6 +2358,10 @@ static int cgroup_migrate_execute(struct cgroup_mgctx *mgctx)
>  			css_set_move_task(task, from_cset, to_cset, true);
>  			put_css_set_locked(from_cset);
>  			from_cset->nr_tasks--;
> +
> +			cgroup_freezer_migrate_task(task,
> +						    from_cset->dfl_cgrp,
> +						    to_cset->dfl_cgrp);
>  		}
>  	}
>  	spin_unlock_irq(&css_set_lock);
> @@ -3401,8 +3405,11 @@ static ssize_t cgroup_max_depth_write(struct kernfs_open_file *of,
>  
>  static int cgroup_events_show(struct seq_file *seq, void *v)
>  {
> -	seq_printf(seq, "populated %d\n",
> -		   cgroup_is_populated(seq_css(seq)->cgroup));
> +	struct cgroup *cgrp = seq_css(seq)->cgroup;
> +
> +	seq_printf(seq, "populated %d\n", cgroup_is_populated(cgrp));
> +	seq_printf(seq, "frozen %d\n", test_bit(CGRP_FROZEN, &cgrp->flags));
> +
>  	return 0;
>  }
>  
> @@ -3449,6 +3456,40 @@ static int cpu_stat_show(struct seq_file *seq, void *v)
>  	return ret;
>  }
>  
> +static int cgroup_freeze_show(struct seq_file *seq, void *v)
> +{
> +	struct cgroup *cgrp = seq_css(seq)->cgroup;
> +
> +	seq_printf(seq, "%d\n", cgrp->freeze);
> +
> +	return 0;
> +}
> +
> +static ssize_t cgroup_freeze_write(struct kernfs_open_file *of,
> +				   char *buf, size_t nbytes, loff_t off)
> +{
> +	struct cgroup *cgrp;
> +	ssize_t ret;
> +	int freeze;
> +
> +	ret = kstrtoint(strstrip(buf), 0, &freeze);
> +	if (ret)
> +		return ret;
> +
> +	if (freeze < 0 || freeze > 1)
> +		return -ERANGE;
> +
> +	cgrp = cgroup_kn_lock_live(of->kn, false);
> +	if (!cgrp)
> +		return -ENOENT;
> +
> +	cgroup_freeze(cgrp, freeze);
> +
> +	cgroup_kn_unlock(of->kn);
> +
> +	return nbytes;
> +}
> +
>  static int cgroup_file_open(struct kernfs_open_file *of)
>  {
>  	struct cftype *cft = of->kn->priv;
> @@ -4574,6 +4615,12 @@ static struct cftype cgroup_base_files[] = {
>  		.name = "cgroup.stat",
>  		.seq_show = cgroup_stat_show,
>  	},
> +	{
> +		.name = "cgroup.freeze",
> +		.flags = CFTYPE_NOT_ON_ROOT,
> +		.seq_show = cgroup_freeze_show,
> +		.write = cgroup_freeze_write,
> +	},
>  	{
>  		.name = "cpu.stat",
>  		.flags = CFTYPE_NOT_ON_ROOT,
> @@ -4899,11 +4946,20 @@ static struct cgroup *cgroup_create(struct cgroup *parent)
>  	if (ret)
>  		goto out_idr_free;
>  
> +	INIT_WORK(&cgrp->freezer.notify_work, cgroup_notify_frozen_fn);
> +	cgrp->freezer.e_freeze = parent->freezer.e_freeze;
> +	if (cgrp->freezer.e_freeze)
> +		set_bit(CGRP_FROZEN, &cgrp->flags);
> +
>  	for (tcgrp = cgrp; tcgrp; tcgrp = cgroup_parent(tcgrp)) {
>  		cgrp->ancestor_ids[tcgrp->level] = tcgrp->id;
>  
> -		if (tcgrp != cgrp)
> +		if (tcgrp != cgrp) {
>  			tcgrp->nr_descendants++;
> +
> +			if (cgrp->freezer.e_freeze)
> +				tcgrp->freezer.nr_frozen_descendants++;
> +		}
>  	}
>  
>  	if (notify_on_release(parent))
> @@ -5190,6 +5246,8 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
>  	for (tcgrp = cgroup_parent(cgrp); tcgrp; tcgrp = cgroup_parent(tcgrp)) {
>  		tcgrp->nr_descendants--;
>  		tcgrp->nr_dying_descendants++;
> +		if (cgrp->freezer.e_freeze)
> +			tcgrp->freezer.nr_frozen_descendants--;
>  	}
>  
>  	cgroup1_check_for_release(parent);
> @@ -5642,6 +5700,23 @@ void cgroup_post_fork(struct task_struct *child)
>  			cset->nr_tasks++;
>  			css_set_move_task(child, NULL, cset, false);
>  		}
> +
> +		if (unlikely(cgroup_frozen(child) &&
> +			     (child->flags & ~PF_KTHREAD))) {

I don't think we need explicit PF_KTHREAD test here.  We don't allow
kthreads in non-root cgroups anyway and if we wanna change that there
are a bunch of other things which need updating anyway.

> +			struct cgroup *cgrp;
> +			unsigned long flags;
> +
> +			if (lock_task_sighand(child, &flags)) {
> +				cgrp = cset->dfl_cgrp;
> +				cgrp->freezer.nr_tasks_to_freeze++;
> +				WARN_ON_ONCE(cgrp->freezer.nr_tasks_to_freeze <
> +					     cgrp->freezer.nr_frozen_tasks);
> +				child->jobctl |= JOBCTL_TRAP_FREEZE;
> +				signal_wake_up(child, false);
> +				unlock_task_sighand(child, &flags);
> +			}
> +		}
> +
>  		spin_unlock_irq(&css_set_lock);
>  	}
>  
> @@ -5690,6 +5765,24 @@ void cgroup_exit(struct task_struct *tsk)
>  		spin_lock_irq(&css_set_lock);
>  		css_set_move_task(tsk, cset, NULL, false);
>  		cset->nr_tasks--;
> +
> +		if (unlikely(cgroup_frozen(tsk) &&
> +			     (tsk->flags & ~PF_KTHREAD))) {

Ditto.

> +			struct cgroup *frozen_cgrp = cset->dfl_cgrp;
> +
> +			frozen_cgrp->freezer.nr_tasks_to_freeze--;
> +
> +			WARN_ON(tsk->frozen);

Why not WARN_ON_ONCE here too?

> +			WARN_ON_ONCE(frozen_cgrp->freezer.nr_tasks_to_freeze <
> +				     0);
> +			WARN_ON_ONCE(frozen_cgrp->freezer.nr_tasks_to_freeze <
> +				     frozen_cgrp->freezer.nr_frozen_tasks);
> +
> +			if (frozen_cgrp->freezer.nr_frozen_tasks ==
> +			    frozen_cgrp->freezer.nr_tasks_to_freeze)
> +				cgroup_queue_notify_frozen(frozen_cgrp);
> +		}
> +
>  		spin_unlock_irq(&css_set_lock);
>  	} else {
>  		get_css_set(cset);
> diff --git a/kernel/cgroup/freezer.c b/kernel/cgroup/freezer.c
> new file mode 100644
> index 000000000000..b81e215e2cce
> --- /dev/null
> +++ b/kernel/cgroup/freezer.c
> @@ -0,0 +1,247 @@
> +//SPDX-License-Identifier: GPL-2.0
> +#include <linux/cgroup.h>
> +#include <linux/sched.h>
> +#include <linux/sched/task.h>
> +#include <linux/sched/signal.h>
> +
> +#include "cgroup-internal.h"
> +
> +void cgroup_notify_frozen(struct cgroup *cgrp, bool frozen)
> +{
> +	int delta = 1;
> +
> +	lockdep_assert_held(&cgroup_mutex);
> +
> +	/*
> +	 * Did we race with fork() or exit()? Np, everything is still frozen.
> +	 */
> +	if (frozen == test_bit(CGRP_FROZEN, &cgrp->flags))
> +		return;
> +
> +	if (frozen)
> +		set_bit(CGRP_FROZEN, &cgrp->flags);
> +	else
> +		clear_bit(CGRP_FROZEN, &cgrp->flags);

I'm not sure this is wrong but it feels a bit weird to tie the actual
state transition to notification.  Wouldn't it be more
straight-forward if CGRP_FROZEN bit is purely determined by whether
the tasks are frozen or not and the notification just checks that
against the last notified state and generate a notification if they're
different?

> +	cgroup_file_notify(&cgrp->events_file);
> +
> +	while ((cgrp = cgroup_parent(cgrp))) {
> +		if (frozen) {
> +			cgrp->freezer.nr_frozen_descendants += delta;
> +			if (!test_bit(CGRP_FROZEN, &cgrp->flags) &&
> +			    test_bit(CGRP_FREEZE, &cgrp->flags) &&
> +			    cgrp->freezer.nr_frozen_descendants ==
> +			    cgrp->nr_descendants) {
> +				set_bit(CGRP_FROZEN, &cgrp->flags);
> +				cgroup_file_notify(&cgrp->events_file);
> +				delta++;
> +			}
> +		} else {
> +			cgrp->freezer.nr_frozen_descendants -= delta;
> +			if (test_bit(CGRP_FROZEN, &cgrp->flags)) {
> +				clear_bit(CGRP_FROZEN, &cgrp->flags);
> +				cgroup_file_notify(&cgrp->events_file);
> +				delta++;
> +			}
> +		}
> +	}
> +}

So that all these state transitions are synchronous with the actual
freezing events and we can just queue per-cgroup work items all the
way to the top if the new state is different from the last one
cgroup-by-cgroup?

> +void cgroup_notify_frozen_fn(struct work_struct *work)
> +{
> +	struct cgroup *cgrp = container_of(work, struct cgroup,
> +					   freezer.notify_work);
> +
> +	mutex_lock(&cgroup_mutex);
> +	cgroup_notify_frozen(cgrp, true);
> +	mutex_unlock(&cgroup_mutex);
> +
> +	css_put(&cgrp->self);
> +}
> +
> +void cgroup_queue_notify_frozen(struct cgroup *cgrp)
> +{
> +	if (work_pending(&cgrp->freezer.notify_work))
> +		return;
> +
> +	css_get(&cgrp->self);
> +	schedule_work(&cgrp->freezer.notify_work);
> +}
> +
> +void cgroup_freezer_enter(void)
> +{
> +	long state = current->state;
> +	struct cgroup *cgrp;
> +
> +	if (!current->frozen) {

I think this needs a lot more comments explaining what's going on.
It's really subtle that this is where the frozen state is becoming
sticky until the task breaks out of the signal delivery path.

> +		spin_lock_irq(&css_set_lock);
> +		current->frozen = true;
> +		cgrp = task_dfl_cgroup(current);
> +		cgrp->freezer.nr_frozen_tasks++;
> +
> +		WARN_ON_ONCE(cgrp->freezer.nr_tasks_to_freeze <
> +			     cgrp->freezer.nr_frozen_tasks);
> +
> +		if (cgrp->freezer.nr_tasks_to_freeze ==
> +		    cgrp->freezer.nr_frozen_tasks)
> +			cgroup_queue_notify_frozen(cgrp);
> +		spin_unlock_irq(&css_set_lock);
> +	}
> +
> +	/* refrigerator */
> +	set_current_state(TASK_WAKEKILL | TASK_INTERRUPTIBLE | TASK_FROZEN);
> +	clear_thread_flag(TIF_SIGPENDING);

The toggling of TIF_SIGPENDING needs explanation too.

> +	schedule();
> +	recalc_sigpending();
> +
> +	set_current_state(state);
> +}
> +
> +void cgroup_freezer_leave(void)
> +{
> +	struct cgroup *cgrp;
> +
> +	spin_lock_irq(&css_set_lock);
> +	cgrp = task_dfl_cgroup(current);
> +	cgrp->freezer.nr_frozen_tasks--;
> +	WARN_ON_ONCE(cgrp->freezer.nr_frozen_tasks < 0);
> +	current->frozen = false;
> +	spin_unlock_irq(&css_set_lock);
> +}
> +
> +static void cgroup_do_freeze(struct cgroup *cgrp, bool freeze)
> +{
> +	struct css_task_iter it;
> +	struct task_struct *task;
> +	unsigned long flags;
> +
> +	lockdep_assert_held(&cgroup_mutex);
> +
> +	spin_lock_irq(&css_set_lock);
> +	if (freeze) {
> +		cgrp->freezer.nr_tasks_to_freeze = __cgroup_task_count(cgrp);
> +		set_bit(CGRP_FREEZE, &cgrp->flags);
> +	} else {
> +		clear_bit(CGRP_FREEZE, &cgrp->flags);
> +		cgroup_notify_frozen(cgrp, false);
> +	}
> +	spin_unlock_irq(&css_set_lock);
> +
> +	css_task_iter_start(&cgrp->self, 0, &it);
> +	while ((task = css_task_iter_next(&it))) {
> +		if (task->flags & PF_KTHREAD)
> +			continue;

WARN_ON_ONCE?

> +		if (!lock_task_sighand(task, &flags))
> +			continue;

So, this gotta be a dying task.  A comment wouldn't hurt.

> +		if (freeze) {
> +			task->jobctl |= JOBCTL_TRAP_FREEZE;
> +			signal_wake_up(task, false);
> +		} else {
> +			task->jobctl &= ~JOBCTL_TRAP_FREEZE;
> +			wake_up_process(task);

Again, a comment explaining why one's signal_wake_up() and the other's
wake_up_process() would be great.

> +		}
> +
> +		unlock_task_sighand(task, &flags);
> +	}
> +	css_task_iter_end(&it);
> +
> +	if (freeze && cgrp->nr_descendants ==
> +	    cgrp->freezer.nr_frozen_descendants) {
> +		spin_lock_irq(&css_set_lock);
> +		WARN_ON_ONCE(cgrp->freezer.nr_tasks_to_freeze <
> +			     cgrp->freezer.nr_frozen_tasks);
> +		if (cgrp->freezer.nr_tasks_to_freeze ==
> +		    cgrp->freezer.nr_frozen_tasks)
> +			cgroup_notify_frozen(cgrp, true);
> +		spin_unlock_irq(&css_set_lock);

I think notification handling would be easier if we separate out state
transitions and notifications.

> +	}
> +}
> +
> +void cgroup_freezer_migrate_task(struct task_struct *task,
> +				 struct cgroup *src, struct cgroup *dst)
> +{
> +	unsigned long flags;
> +
> +	lockdep_assert_held(&css_set_lock);
> +
> +	if (task->flags & PF_KTHREAD)
> +		return;
> +
> +	if (test_bit(CGRP_FREEZE, &src->flags) || task->frozen)
> +		src->freezer.nr_tasks_to_freeze--;
> +	if (test_bit(CGRP_FREEZE, &dst->flags) || task->frozen)
> +		dst->freezer.nr_tasks_to_freeze++;
> +
> +	if (task->frozen) {
> +		src->freezer.nr_frozen_tasks--;
> +		dst->freezer.nr_frozen_tasks++;
> +	}
> +
> +	if (test_bit(CGRP_FREEZE, &src->flags) ==
> +	    test_bit(CGRP_FREEZE, &dst->flags))
> +		return;
> +
> +	if (lock_task_sighand(task, &flags)) {
> +		if (test_bit(CGRP_FREEZE, &dst->flags))
> +			task->jobctl |= JOBCTL_TRAP_FREEZE;
> +		else
> +			task->jobctl &= ~JOBCTL_TRAP_FREEZE;

How are these flags synchronized?

> +		signal_wake_up(task, false);

Here, we're using signal_wake_up() for both transitions unlike the
earlier spot.

> +		unlock_task_sighand(task, &flags);
> +	}
> +}
> +
> +void cgroup_freeze(struct cgroup *cgrp, bool freeze)
> +{
> +	struct cgroup_subsys_state *css;
> +	struct cgroup *dsct;
> +	bool applied = false;
> +
> +	lockdep_assert_held(&cgroup_mutex);
> +
> +	/*
> +	 * Nothing changed? Just exit.
> +	 */
> +	if (cgrp->freeze == freeze)
> +		return;
> +
> +	cgrp->freeze = freeze;
> +
> +	/*
> +	 * Propagate changes downwards the cgroup tree.
> +	 */
> +	css_for_each_descendant_pre(css, &cgrp->self) {
> +		dsct = css->cgroup;
> +
> +		if (cgroup_is_dead(dsct))
> +			continue;
> +
> +		if (freeze) {
> +			dsct->freezer.e_freeze++;
> +			/*
> +			 * Already frozen because of ancestor's settings?
> +			 */
> +			if (dsct->freezer.e_freeze > 1)
> +				continue;
> +		} else {
> +			dsct->freezer.e_freeze--;
> +			/*
> +			 * Still frozen because of ancestor's settings?
> +			 */
> +			if (dsct->freezer.e_freeze > 0)
> +				continue;
> +
> +			WARN_ON_ONCE(dsct->freezer.e_freeze < 0);
> +		}
> +
> +		/*
> +		 * Do change actual state: freeze or unfreeze.
> +		 */
> +		cgroup_do_freeze(dsct, freeze);
> +		applied = true;
> +	}
> +
> +	if (!applied)
> +		cgroup_file_notify(&cgrp->events_file);
> +}
> diff --git a/kernel/ptrace.c b/kernel/ptrace.c
> index 21fec73d45d4..5e484e2480e5 100644
> --- a/kernel/ptrace.c
> +++ b/kernel/ptrace.c
> @@ -400,6 +400,12 @@ static int ptrace_attach(struct task_struct *task, long request,
>  
>  	spin_lock(&task->sighand->siglock);
>  
> +	/*
> +	 * Kick the process to get it out of the refrigerator.
> +	 */
> +	if (cgroup_frozen(task))
> +		wake_up_process(task);
> +
>  	/*
>  	 * If the task is already STOPPED, set JOBCTL_TRAP_STOP and
>  	 * TRAPPING, and kick it so that it transits to TRACED.  TRAPPING
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 5843c541fda9..6d7f0654f60d 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -326,6 +326,11 @@ void task_clear_jobctl_pending(struct task_struct *task, unsigned long mask)
>  	if (mask & JOBCTL_STOP_PENDING)
>  		mask |= JOBCTL_STOP_CONSUME | JOBCTL_STOP_DEQUEUED;
>  
> +	/*
> +	 * JOBCTL_TRAP_FREEZE is set and cleared from cgroup side,
> +	 * don't touch it here.
> +	 */
> +	mask &= ~JOBCTL_TRAP_FREEZE;

It's a bit weird to toggle the bit from here.  Can't we do this from
the callers?

>  	task->jobctl &= ~mask;
>  
>  	if (!(task->jobctl & JOBCTL_PENDING_MASK))
> @@ -2252,7 +2257,7 @@ static bool do_signal_stop(int signr)
>  }
>  
>  /**
> - * do_jobctl_trap - take care of ptrace jobctl traps
> + * do_jobctl_trap - take care of ptrace and cgroup freezer jobctl traps
>   *
>   * When PT_SEIZED, it's used for both group stop and explicit
>   * SEIZE/INTERRUPT traps.  Both generate PTRACE_EVENT_STOP trap with
> @@ -2268,20 +2273,35 @@ static bool do_signal_stop(int signr)
>   */
>  static void do_jobctl_trap(void)
>  {
> +	struct sighand_struct *sighand = current->sighand;
>  	struct signal_struct *signal = current->signal;
>  	int signr = current->jobctl & JOBCTL_STOP_SIGMASK;
>  
> -	if (current->ptrace & PT_SEIZED) {
> -		if (!signal->group_stop_count &&
> -		    !(signal->flags & SIGNAL_STOP_STOPPED))
> -			signr = SIGTRAP;
> -		WARN_ON_ONCE(!signr);
> -		ptrace_do_notify(signr, signr | (PTRACE_EVENT_STOP << 8),
> -				 CLD_STOPPED);
> -	} else {
> -		WARN_ON_ONCE(!signr);
> -		ptrace_stop(signr, CLD_STOPPED, 0, NULL);
> -		current->exit_code = 0;
> +	if (current->jobctl & (JOBCTL_TRAP_STOP | JOBCTL_TRAP_NOTIFY)) {
> +		if (current->ptrace & PT_SEIZED) {
> +			if (!signal->group_stop_count &&
> +			    !(signal->flags & SIGNAL_STOP_STOPPED))
> +				signr = SIGTRAP;
> +			WARN_ON_ONCE(!signr);
> +			ptrace_do_notify(signr,
> +					 signr | (PTRACE_EVENT_STOP << 8),
> +					 CLD_STOPPED);
> +		} else {
> +			WARN_ON_ONCE(!signr);
> +			ptrace_stop(signr, CLD_STOPPED, 0, NULL);
> +			current->exit_code = 0;
> +		}
> +	} else if (current->jobctl & JOBCTL_TRAP_FREEZE) {
> +		/*
> +		 * Enter the freezer, unless the task is about to exit.
> +		 */
> +		if (fatal_signal_pending(current)) {
> +			current->jobctl &= ~JOBCTL_TRAP_FREEZE;
> +		} else {
> +			spin_unlock_irq(&sighand->siglock);
> +			cgroup_freezer_enter();
> +			spin_lock_irq(&sighand->siglock);
> +		}

We'll need a healthy amount of explanation in terms of how freezer
interacts with jobctl stop and ptrace.

>  	}
>  }
>  
> @@ -2403,6 +2423,16 @@ bool get_signal(struct ksignal *ksig)
>  			goto relock;
>  		}
>  
> +		/*
> +		 * If the task is leaving the freezer look, let's notify
> +		 * the cgroup and reset the frozen mark.
> +		 */
> +		if (unlikely(cgroup_task_in_freezer(current))) {
> +			spin_unlock_irq(&sighand->siglock);
> +			cgroup_freezer_leave();
> +			spin_lock_irq(&sighand->siglock);
> +		}
> +
>  		signr = dequeue_signal(current, &current->blocked, &ksig->info);
>  
>  		if (!signr)

Thanks!

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 3/6] cgroup: cgroup v2 freezer
  2018-11-12 23:04 ` [PATCH v2 3/6] cgroup: cgroup v2 freezer Roman Gushchin
  2018-11-13  2:08   ` Tejun Heo
@ 2018-11-13 15:37   ` Oleg Nesterov
  2018-11-13 15:43     ` Tejun Heo
  2018-11-13 15:48   ` Oleg Nesterov
  2 siblings, 1 reply; 23+ messages in thread
From: Oleg Nesterov @ 2018-11-13 15:37 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Tejun Heo, cgroups, linux-kernel, kernel-team, Roman Gushchin

On 11/12, Roman Gushchin wrote:
>
> This patch implements freezer for cgroup v2. However the functionality
> is similar, the interface is different to cgroup v1: it follows
> cgroup v2 interface principles.

Oh, it seems that I actually need to apply this patch to (try to) understand
the details ;) Will try tomorrow.

> --- a/include/linux/sched/jobctl.h
> +++ b/include/linux/sched/jobctl.h
> @@ -18,6 +18,7 @@ struct task_struct;
>  #define JOBCTL_TRAP_NOTIFY_BIT	20	/* trap for NOTIFY */
>  #define JOBCTL_TRAPPING_BIT	21	/* switching to TRACED */
>  #define JOBCTL_LISTENING_BIT	22	/* ptracer is listening for events */
> +#define JOBCTL_TRAP_FREEZE_BIT	23	/* trap for cgroup freezer */
>
>  #define JOBCTL_STOP_DEQUEUED	(1UL << JOBCTL_STOP_DEQUEUED_BIT)
>  #define JOBCTL_STOP_PENDING	(1UL << JOBCTL_STOP_PENDING_BIT)
> @@ -26,8 +27,10 @@ struct task_struct;
>  #define JOBCTL_TRAP_NOTIFY	(1UL << JOBCTL_TRAP_NOTIFY_BIT)
>  #define JOBCTL_TRAPPING		(1UL << JOBCTL_TRAPPING_BIT)
>  #define JOBCTL_LISTENING	(1UL << JOBCTL_LISTENING_BIT)
> +#define JOBCTL_TRAP_FREEZE	(1UL << JOBCTL_TRAP_FREEZE_BIT)
>
> -#define JOBCTL_TRAP_MASK	(JOBCTL_TRAP_STOP | JOBCTL_TRAP_NOTIFY)
> +#define JOBCTL_TRAP_MASK	(JOBCTL_TRAP_STOP | JOBCTL_TRAP_NOTIFY | \
> +				 JOBCTL_TRAP_FREEZE)

Again, I didn't actually read the patch yet, but my gut feeling tells me
we shouldn't change JOBCTL_TRAP_MASK... and the fact you had to change
task_clear_jobctl_pending() to filter out JOBCTL_TRAP_FREEZE bit may be
proves this.

This

	if (current->jobctl & (JOBCTL_TRAP_STOP | JOBCTL_TRAP_NOTIFY))
	...
	else if (current->jobctl & JOBCTL_TRAP_FREEZE)

code in do_jobctl_trap() doesn't look nice too.

OK, please forget for now, but perhaps it would be more clean to add
JOBCTL_TRAP_FREEZE to the JOBCTL_PENDING_MASK check in recalc_sigpending()
and change get_signal to check JOBCTL_TRAP_MASK | JOBCTL_TRAP_FREEZE; and
I am not even sure cgroup_freezer_enter() should live in do_jobctl_trap().

> @@ -5642,6 +5700,23 @@ void cgroup_post_fork(struct task_struct *child)
>  			cset->nr_tasks++;
>  			css_set_move_task(child, NULL, cset, false);
>  		}
> +
> +		if (unlikely(cgroup_frozen(child) &&
> +			     (child->flags & ~PF_KTHREAD))) {
> +			struct cgroup *cgrp;
> +			unsigned long flags;
> +
> +			if (lock_task_sighand(child, &flags)) {

You can just do spin_lock_irq(siglock). The new child can't go away
until wake_up_new_task(), otherwise any usage of "child" including
lock_task_sighand() was not safe.

> +				cgrp = cset->dfl_cgrp;
> +				cgrp->freezer.nr_tasks_to_freeze++;
> +				WARN_ON_ONCE(cgrp->freezer.nr_tasks_to_freeze <
> +					     cgrp->freezer.nr_frozen_tasks);
> +				child->jobctl |= JOBCTL_TRAP_FREEZE;
> +				signal_wake_up(child, false);

signal_wake_up() is pointless.

wake_up_process() has no effect, set_tsk_thread_flag(TIF_SIGPENDING) is
not needed because schedule_tail() does calculate_sigpending() which should
notice JOBCTL_TRAP_FREEZE.

> +	} else if (current->jobctl & JOBCTL_TRAP_FREEZE) {
> +		/*
> +		 * Enter the freezer, unless the task is about to exit.
> +		 */
> +		if (fatal_signal_pending(current)) {
> +			current->jobctl &= ~JOBCTL_TRAP_FREEZE;

And again, please note that we need this because task_clear_jobctl_pending()
drops JOBCTL_TRAP_FREEZE. It shouldn't, I think...

Oleg.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 3/6] cgroup: cgroup v2 freezer
  2018-11-13 15:37   ` Oleg Nesterov
@ 2018-11-13 15:43     ` Tejun Heo
  2018-11-13 16:00       ` Oleg Nesterov
  0 siblings, 1 reply; 23+ messages in thread
From: Tejun Heo @ 2018-11-13 15:43 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Roman Gushchin, cgroups, linux-kernel, kernel-team, Roman Gushchin

Hello, Oleg.

On Tue, Nov 13, 2018 at 04:37:01PM +0100, Oleg Nesterov wrote:
> On 11/12, Roman Gushchin wrote:
> >
> > This patch implements freezer for cgroup v2. However the functionality
> > is similar, the interface is different to cgroup v1: it follows
> > cgroup v2 interface principles.
> 
> Oh, it seems that I actually need to apply this patch to (try to) understand
> the details ;) Will try tomorrow.

Yeah, it's a bit of a head spin like everything in signal delivery /
ptrace paths.

...
> OK, please forget for now, but perhaps it would be more clean to add
> JOBCTL_TRAP_FREEZE to the JOBCTL_PENDING_MASK check in recalc_sigpending()
> and change get_signal to check JOBCTL_TRAP_MASK | JOBCTL_TRAP_FREEZE; and
> I am not even sure cgroup_freezer_enter() should live in do_jobctl_trap().

I'm sure you're aware of the context but just to refresh - one thing
which was really broken about cgroup1 freezer was that it piggybacked
on hibernation freezer and put frozen tasks in a state which is
undefined when seen from userspace - they're just stuck in D sleep
somewhere in the kernel.  That's fine when the whole system is not
gonna be running, but not when only a subportion is being frozen.

So, the primary goal of cgroup2 freezer is putting the tasks in an
equivalent state as jobctl stop.  It's a jobctl stop but controlled by
cgroup frozen state, meaning that they can be killed, PTRACE_SEIZE'd
and INTERRUPT'ed (PTRACE_ATTACH doesn't work as signal delivery should
be blocked but that's fine) and so on.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 3/6] cgroup: cgroup v2 freezer
  2018-11-12 23:04 ` [PATCH v2 3/6] cgroup: cgroup v2 freezer Roman Gushchin
  2018-11-13  2:08   ` Tejun Heo
  2018-11-13 15:37   ` Oleg Nesterov
@ 2018-11-13 15:48   ` Oleg Nesterov
  2018-11-13 21:59     ` Roman Gushchin
  2 siblings, 1 reply; 23+ messages in thread
From: Oleg Nesterov @ 2018-11-13 15:48 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Tejun Heo, cgroups, linux-kernel, kernel-team, Roman Gushchin

On 11/12, Roman Gushchin wrote:
>
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -83,7 +83,8 @@ struct task_group;
>  #define TASK_WAKING			0x0200
>  #define TASK_NOLOAD			0x0400
>  #define TASK_NEW			0x0800
> -#define TASK_STATE_MAX			0x1000
> +#define TASK_FROZEN			0x1000
> +#define TASK_STATE_MAX			0x2000

Just noticed the new task state... Why? Can't we avoid it?

...

> +void cgroup_freezer_enter(void)
> +{
> +	long state = current->state;

Why? it must be TASK_RUNNING?

If not set_current_state() at the end is simply wrong... Yes, __refrigerator()
does this, but at least it has a reason although it is wrong too.

> +	struct cgroup *cgrp;
> +
> +	if (!current->frozen) {
> +		spin_lock_irq(&css_set_lock);
> +		current->frozen = true;
> +		cgrp = task_dfl_cgroup(current);
> +		cgrp->freezer.nr_frozen_tasks++;
> +
> +		WARN_ON_ONCE(cgrp->freezer.nr_tasks_to_freeze <
> +			     cgrp->freezer.nr_frozen_tasks);
> +
> +		if (cgrp->freezer.nr_tasks_to_freeze ==
> +		    cgrp->freezer.nr_frozen_tasks)
> +			cgroup_queue_notify_frozen(cgrp);
> +		spin_unlock_irq(&css_set_lock);
> +	}
> +
> +	/* refrigerator */
> +	set_current_state(TASK_WAKEKILL | TASK_INTERRUPTIBLE | TASK_FROZEN);

Why not __set_current_state() ?

If ->state include TASK_INTERRUPTIBLE, why do we need TASK_WAKEKILL?

And again, why TASK_FROZEN?

> +	clear_thread_flag(TIF_SIGPENDING);
> +	schedule();
> +	recalc_sigpending();

I simply can't understand these 3 lines above but I bet this is not correct ;)

if nothing else recalc_sigpending() without ->siglock is wrong, it can race
with signal_wakeup/etc.

> +	set_current_state(state);

see above...

Oleg.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 3/6] cgroup: cgroup v2 freezer
  2018-11-13 15:43     ` Tejun Heo
@ 2018-11-13 16:00       ` Oleg Nesterov
  0 siblings, 0 replies; 23+ messages in thread
From: Oleg Nesterov @ 2018-11-13 16:00 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Roman Gushchin, cgroups, linux-kernel, kernel-team, Roman Gushchin

Hi Tejun,

On 11/13, Tejun Heo wrote:
>
> > OK, please forget for now, but perhaps it would be more clean to add
> > JOBCTL_TRAP_FREEZE to the JOBCTL_PENDING_MASK check in recalc_sigpending()
> > and change get_signal to check JOBCTL_TRAP_MASK | JOBCTL_TRAP_FREEZE; and
> > I am not even sure cgroup_freezer_enter() should live in do_jobctl_trap().
>
> I'm sure you're aware of the context but just to refresh - one thing
> which was really broken about cgroup1 freezer was that it piggybacked
> on hibernation freezer and put frozen tasks in a state which is
> undefined when seen from userspace - they're just stuck in D sleep
> somewhere in the kernel.  That's fine when the whole system is not
> gonna be running, but not when only a subportion is being frozen.

Thanks, I see.

> So, the primary goal of cgroup2 freezer is putting the tasks in an
> equivalent state as jobctl stop.  It's a jobctl stop but controlled by
> cgroup frozen state, meaning that they can be killed, PTRACE_SEIZE'd
> and INTERRUPT'ed (PTRACE_ATTACH doesn't work as signal delivery should
> be blocked but that's fine) and so on.

And I agree, JOBCTL_TRAP_FREEZE looks fine.

Just somehow I _feel_ that we can improve this logic a bit, but let me
repeat that of course I can be easily wrong and I didn't even read the
patch yet.

Oleg.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 3/6] cgroup: cgroup v2 freezer
  2018-11-13  2:08   ` Tejun Heo
@ 2018-11-13 18:47     ` Roman Gushchin
  2018-11-13 19:15       ` Tejun Heo
  0 siblings, 1 reply; 23+ messages in thread
From: Roman Gushchin @ 2018-11-13 18:47 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Roman Gushchin, Oleg Nesterov, cgroups, linux-kernel,
	Kernel Team, Linus Torvalds, Andrew Morton, Ingo Molnar

On Mon, Nov 12, 2018 at 06:08:17PM -0800, Tejun Heo wrote:
> (cc'ing Linus, Andrew, Ingo for the addition of a new task state)
> 
> Hello,
> 
> On Mon, Nov 12, 2018 at 03:04:19PM -0800, Roman Gushchin wrote:
> > Cgroup v1 implements the freezer controller, which provides an ability
> > to stop the workload in a cgroup and temporarily free up some
> > resources (cpu, io, network bandwidth and, potentially, memory)
> > for some other tasks. Cgroup v2 lacks this functionality.
> > 
> > This patch implements freezer for cgroup v2. However the functionality
> > is similar, the interface is different to cgroup v1: it follows
> > cgroup v2 interface principles.
> > 
> > Key differences are:
> > 1) There is no separate controller: the functionality is always
> > available and is represented by cgroup.freeze and cgroup.events
> > cgroup control files.
> > 2) The desired state is defined by the cgroup.freeze control file.
> > Any hierarchical configuration is allowed.
> > 3) The interface is asynchronous. The actual state is available
> > using cgroup.events control file ("frozen" field). There are no
> > dedicated transitional states.
> > 4) It's allowed to make any changes with the cgroup hierarchy
> > (create new cgroups, remove old cgroups, move tasks between cgroups)
> > no matter if some cgroups are frozen.
> > 5) Tasks in a frozen cgroup can be killed.
> 
> I think it'd be worthwhile to explain that tasks are now frozen in a
> state which is essentially equivalent to the job control stop state
> but which can only be controlled by the freezer and its interactions
> with ptrace.

Agreed.

> 
> Oleg, I'd really appreciate if you can review the signal delivery /
> ptrace parts of the patch.
> 
> > diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> > index 22254c1fe1c5..600165d0f5a2 100644
> > --- a/include/linux/cgroup-defs.h
> > +++ b/include/linux/cgroup-defs.h
> > @@ -63,6 +63,12 @@ enum {
> >  	 * specified at mount time and thus is implemented here.
> >  	 */
> >  	CGRP_CPUSET_CLONE_CHILDREN,
> > +
> > +	/* Control group has to be frozen. */
> > +	CGRP_FREEZE,
> > +
> > +	/* Cgroup is frozen. */
> > +	CGRP_FROZEN,
> >  };
> >  
> >  /* cgroup_root->flags */
> > @@ -314,6 +320,27 @@ struct cgroup_rstat_cpu {
> >  	struct cgroup *updated_next;		/* NULL iff not on the list */
> >  };
> >  
> > +struct cgroup_freezer_state {
> > +	/* Should the cgroup actually be frozen?
> > +	 * (protected by cgroup_mutex)
> > +	 */
> > +	int e_freeze;
> > +
> > +	/* Fields below are protected by css_set_lock */
> > +
> > +	/* Number of frozen descendant cgroups */
> > +	int nr_frozen_descendants;
> > +
> > +	/* Number of tasks to freeze */
> > +	int nr_tasks_to_freeze;
> > +
> > +	/* Number of frozen tasks */
> > +	int nr_frozen_tasks;
> > +
> > +	/* Used for delayed notifications */
> > +	struct work_struct notify_work;
> > +};
> > +
> >  struct cgroup {
> >  	/* self css with NULL ->ss, points back to this cgroup */
> >  	struct cgroup_subsys_state self;
> > @@ -442,6 +469,12 @@ struct cgroup {
> >  	/* If there is block congestion on this cgroup. */
> >  	atomic_t congestion_count;
> >  
> > +	/* Should the cgroup and its descendants be frozen. */
> > +	bool freeze;
> 
> Why not have this in freezer too?

I thought that this variable is just the state of the cgroup.freeze knob,
where the freezer field contains the internal state of the freezer, and
can in theory be allocated dynamically.

Not a strong preference, I can move it there too, if you prefer to.

> 
> > +	/* Used to store internal freezer state */
> > +	struct cgroup_freezer_state freezer;
> > +
> >  	/* ids of the ancestors at each level including self */
> >  	int ancestor_ids[];
> >  };
> > diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> > index 32c553556bbd..8f7e82b05bf8 100644
> > --- a/include/linux/cgroup.h
> > +++ b/include/linux/cgroup.h
> > @@ -871,4 +871,46 @@ static inline void put_cgroup_ns(struct cgroup_namespace *ns)
> >  		free_cgroup_ns(ns);
> >  }
> >  
> > +#ifdef CONFIG_CGROUPS
> > +
> > +static inline bool cgroup_frozen(struct task_struct *task)
> > +{
> > +	bool ret;
> > +
> > +	rcu_read_lock();
> > +	ret = test_bit(CGRP_FREEZE, &task_dfl_cgroup(task)->flags);
> > +	rcu_read_unlock();
> > +
> > +	return ret;
> > +}
> > +
> > +static inline bool cgroup_task_in_freezer(struct task_struct *task)
> > +{
> > +	return task->frozen;
> > +}
> 
> I think the above are pretty confusing.  CGRP_FREEZE is frozen and
> task->frozen is cgroup_task_in_freezer()?  Can't they be e.g.
> cgroup_task_freeze() (or freezing, which probably is clearer) and
> cgroup_task_frozen()?

Yeah, looks good to me, will rename.

> 
> > +void cgroup_freezer_enter(void);
> > +void cgroup_freezer_leave(void);
> 
> So, if we use freeze, freezing, frozen instead, the aboves can be
> cgroup_frozen_enter() and cgroup_frozen_leave() (or begin/end).

Idk, maybe cgroup_enter_frozen()/cgroup_leave_frozen() ?

> 
> > +void cgroup_freeze(struct cgroup *cgrp, bool freeze);
> > +void cgroup_notify_frozen_fn(struct work_struct *work);
> 
> This doesn't need to be exported outside of cgroup proper, right?
> 
> > +void cgroup_notify_frozen(struct cgroup *cgrp, bool frozen);
> > +void cgroup_queue_notify_frozen(struct cgroup *cgrp);
> > +void cgroup_freezer_migrate_task(struct task_struct *task, struct cgroup *src,
> > +				 struct cgroup *dst);
> 
> Ditto.  Do these need to be exposed outside freezer or cgroup proper?

Yeah, makes sense, will remove.

> 
> > +
> > +#else /* !CONFIG_CGROUPS */
> > +
> > +static inline bool cgroup_task_in_freezer(struct task_struct *task)
> > +{
> > +	return false;
> > +}
> > +static inline void cgroup_freezer_enter(void) { }
> > +static inline void cgroup_freezer_leave(void) { }
> > +static inline bool cgroup_frozen(struct task_struct *task)
> > +{
> > +	return false;
> > +}
> > +
> > +#endif /* !CONFIG_CGROUPS */
> > +
> >  #endif /* _LINUX_CGROUP_H */
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 977cb57d7bc9..8ef5d3174e50 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -83,7 +83,8 @@ struct task_group;
> >  #define TASK_WAKING			0x0200
> >  #define TASK_NOLOAD			0x0400
> >  #define TASK_NEW			0x0800
> > -#define TASK_STATE_MAX			0x1000
> > +#define TASK_FROZEN			0x1000
> > +#define TASK_STATE_MAX			0x2000
> 
> We should also cc linux-api@vger.kernel.org as this is visible from
> userland, I think.

Ok.

> 
> >  /* Convenience macros for the sake of set_current_state: */
> >  #define TASK_KILLABLE			(TASK_WAKEKILL | TASK_UNINTERRUPTIBLE)
> > @@ -733,6 +734,8 @@ struct task_struct {
> >  #ifdef CONFIG_CGROUPS
> >  	/* disallow userland-initiated cgroup migration */
> >  	unsigned			no_cgroup_migration:1;
> > +	/* task is in the cgroup freezer loop */
> 
> The above comment isn't strictly true, right?

Why so?

It actually means that the task is looping somewhere in the signal delivery loop
after entering cgroup_freezer_enter() and before cgroup_freezer_leave().

Maybe simple "task is frozen by the cgroup freezer"?

> 
> > +	unsigned			frozen:1;
> >  #endif
> >  #ifdef CONFIG_BLK_CGROUP
> >  	/* to be used once the psi infrastructure lands upstream. */
> > diff --git a/include/linux/sched/jobctl.h b/include/linux/sched/jobctl.h
> > index 98228bd48aee..6c49455dcfe6 100644
> > --- a/include/linux/sched/jobctl.h
> > +++ b/include/linux/sched/jobctl.h
> > @@ -18,6 +18,7 @@ struct task_struct;
> >  #define JOBCTL_TRAP_NOTIFY_BIT	20	/* trap for NOTIFY */
> >  #define JOBCTL_TRAPPING_BIT	21	/* switching to TRACED */
> >  #define JOBCTL_LISTENING_BIT	22	/* ptracer is listening for events */
> > +#define JOBCTL_TRAP_FREEZE_BIT	23	/* trap for cgroup freezer */
> >  
> >  #define JOBCTL_STOP_DEQUEUED	(1UL << JOBCTL_STOP_DEQUEUED_BIT)
> >  #define JOBCTL_STOP_PENDING	(1UL << JOBCTL_STOP_PENDING_BIT)
> > @@ -26,8 +27,10 @@ struct task_struct;
> >  #define JOBCTL_TRAP_NOTIFY	(1UL << JOBCTL_TRAP_NOTIFY_BIT)
> >  #define JOBCTL_TRAPPING		(1UL << JOBCTL_TRAPPING_BIT)
> >  #define JOBCTL_LISTENING	(1UL << JOBCTL_LISTENING_BIT)
> > +#define JOBCTL_TRAP_FREEZE	(1UL << JOBCTL_TRAP_FREEZE_BIT)
> >  
> > -#define JOBCTL_TRAP_MASK	(JOBCTL_TRAP_STOP | JOBCTL_TRAP_NOTIFY)
> > +#define JOBCTL_TRAP_MASK	(JOBCTL_TRAP_STOP | JOBCTL_TRAP_NOTIFY | \
> > +				 JOBCTL_TRAP_FREEZE)
> >  #define JOBCTL_PENDING_MASK	(JOBCTL_STOP_PENDING | JOBCTL_TRAP_MASK)
> >  
> >  extern bool task_set_jobctl_pending(struct task_struct *task, unsigned long mask);
> > diff --git a/kernel/cgroup/Makefile b/kernel/cgroup/Makefile
> > index 8d5689ca94b9..5d7a76bfbbb7 100644
> > --- a/kernel/cgroup/Makefile
> > +++ b/kernel/cgroup/Makefile
> > @@ -1,5 +1,5 @@
> >  # SPDX-License-Identifier: GPL-2.0
> > -obj-y := cgroup.o rstat.o namespace.o cgroup-v1.o
> > +obj-y := cgroup.o rstat.o namespace.o cgroup-v1.o freezer.o
> >  
> >  obj-$(CONFIG_CGROUP_FREEZER) += legacy_freezer.o
> >  obj-$(CONFIG_CGROUP_PIDS) += pids.o
> > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> > index ef3442555b32..4cffaae075af 100644
> > --- a/kernel/cgroup/cgroup.c
> > +++ b/kernel/cgroup/cgroup.c
> > @@ -2358,6 +2358,10 @@ static int cgroup_migrate_execute(struct cgroup_mgctx *mgctx)
> >  			css_set_move_task(task, from_cset, to_cset, true);
> >  			put_css_set_locked(from_cset);
> >  			from_cset->nr_tasks--;
> > +
> > +			cgroup_freezer_migrate_task(task,
> > +						    from_cset->dfl_cgrp,
> > +						    to_cset->dfl_cgrp);
> >  		}
> >  	}
> >  	spin_unlock_irq(&css_set_lock);
> > @@ -3401,8 +3405,11 @@ static ssize_t cgroup_max_depth_write(struct kernfs_open_file *of,
> >  
> >  static int cgroup_events_show(struct seq_file *seq, void *v)
> >  {
> > -	seq_printf(seq, "populated %d\n",
> > -		   cgroup_is_populated(seq_css(seq)->cgroup));
> > +	struct cgroup *cgrp = seq_css(seq)->cgroup;
> > +
> > +	seq_printf(seq, "populated %d\n", cgroup_is_populated(cgrp));
> > +	seq_printf(seq, "frozen %d\n", test_bit(CGRP_FROZEN, &cgrp->flags));
> > +
> >  	return 0;
> >  }
> >  
> > @@ -3449,6 +3456,40 @@ static int cpu_stat_show(struct seq_file *seq, void *v)
> >  	return ret;
> >  }
> >  
> > +static int cgroup_freeze_show(struct seq_file *seq, void *v)
> > +{
> > +	struct cgroup *cgrp = seq_css(seq)->cgroup;
> > +
> > +	seq_printf(seq, "%d\n", cgrp->freeze);
> > +
> > +	return 0;
> > +}
> > +
> > +static ssize_t cgroup_freeze_write(struct kernfs_open_file *of,
> > +				   char *buf, size_t nbytes, loff_t off)
> > +{
> > +	struct cgroup *cgrp;
> > +	ssize_t ret;
> > +	int freeze;
> > +
> > +	ret = kstrtoint(strstrip(buf), 0, &freeze);
> > +	if (ret)
> > +		return ret;
> > +
> > +	if (freeze < 0 || freeze > 1)
> > +		return -ERANGE;
> > +
> > +	cgrp = cgroup_kn_lock_live(of->kn, false);
> > +	if (!cgrp)
> > +		return -ENOENT;
> > +
> > +	cgroup_freeze(cgrp, freeze);
> > +
> > +	cgroup_kn_unlock(of->kn);
> > +
> > +	return nbytes;
> > +}
> > +
> >  static int cgroup_file_open(struct kernfs_open_file *of)
> >  {
> >  	struct cftype *cft = of->kn->priv;
> > @@ -4574,6 +4615,12 @@ static struct cftype cgroup_base_files[] = {
> >  		.name = "cgroup.stat",
> >  		.seq_show = cgroup_stat_show,
> >  	},
> > +	{
> > +		.name = "cgroup.freeze",
> > +		.flags = CFTYPE_NOT_ON_ROOT,
> > +		.seq_show = cgroup_freeze_show,
> > +		.write = cgroup_freeze_write,
> > +	},
> >  	{
> >  		.name = "cpu.stat",
> >  		.flags = CFTYPE_NOT_ON_ROOT,
> > @@ -4899,11 +4946,20 @@ static struct cgroup *cgroup_create(struct cgroup *parent)
> >  	if (ret)
> >  		goto out_idr_free;
> >  
> > +	INIT_WORK(&cgrp->freezer.notify_work, cgroup_notify_frozen_fn);
> > +	cgrp->freezer.e_freeze = parent->freezer.e_freeze;
> > +	if (cgrp->freezer.e_freeze)
> > +		set_bit(CGRP_FROZEN, &cgrp->flags);
> > +
> >  	for (tcgrp = cgrp; tcgrp; tcgrp = cgroup_parent(tcgrp)) {
> >  		cgrp->ancestor_ids[tcgrp->level] = tcgrp->id;
> >  
> > -		if (tcgrp != cgrp)
> > +		if (tcgrp != cgrp) {
> >  			tcgrp->nr_descendants++;
> > +
> > +			if (cgrp->freezer.e_freeze)
> > +				tcgrp->freezer.nr_frozen_descendants++;
> > +		}
> >  	}
> >  
> >  	if (notify_on_release(parent))
> > @@ -5190,6 +5246,8 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
> >  	for (tcgrp = cgroup_parent(cgrp); tcgrp; tcgrp = cgroup_parent(tcgrp)) {
> >  		tcgrp->nr_descendants--;
> >  		tcgrp->nr_dying_descendants++;
> > +		if (cgrp->freezer.e_freeze)
> > +			tcgrp->freezer.nr_frozen_descendants--;
> >  	}
> >  
> >  	cgroup1_check_for_release(parent);
> > @@ -5642,6 +5700,23 @@ void cgroup_post_fork(struct task_struct *child)
> >  			cset->nr_tasks++;
> >  			css_set_move_task(child, NULL, cset, false);
> >  		}
> > +
> > +		if (unlikely(cgroup_frozen(child) &&
> > +			     (child->flags & ~PF_KTHREAD))) {
> 
> I don't think we need explicit PF_KTHREAD test here.  We don't allow
> kthreads in non-root cgroups anyway and if we wanna change that there
> are a bunch of other things which need updating anyway.

Don't we? I think we do. I've proposed a patch to fix this some time ago
(https://lkml.org/lkml/2017/10/12/556), but was NAKed by Peter.

> 
> > +			struct cgroup *cgrp;
> > +			unsigned long flags;
> > +
> > +			if (lock_task_sighand(child, &flags)) {
> > +				cgrp = cset->dfl_cgrp;
> > +				cgrp->freezer.nr_tasks_to_freeze++;
> > +				WARN_ON_ONCE(cgrp->freezer.nr_tasks_to_freeze <
> > +					     cgrp->freezer.nr_frozen_tasks);
> > +				child->jobctl |= JOBCTL_TRAP_FREEZE;
> > +				signal_wake_up(child, false);
> > +				unlock_task_sighand(child, &flags);
> > +			}
> > +		}
> > +
> >  		spin_unlock_irq(&css_set_lock);
> >  	}
> >  
> > @@ -5690,6 +5765,24 @@ void cgroup_exit(struct task_struct *tsk)
> >  		spin_lock_irq(&css_set_lock);
> >  		css_set_move_task(tsk, cset, NULL, false);
> >  		cset->nr_tasks--;
> > +
> > +		if (unlikely(cgroup_frozen(tsk) &&
> > +			     (tsk->flags & ~PF_KTHREAD))) {
> 
> Ditto.
> 
> > +			struct cgroup *frozen_cgrp = cset->dfl_cgrp;
> > +
> > +			frozen_cgrp->freezer.nr_tasks_to_freeze--;
> > +
> > +			WARN_ON(tsk->frozen);
> 
> Why not WARN_ON_ONCE here too?

My bad, will fix.

> 
> > +			WARN_ON_ONCE(frozen_cgrp->freezer.nr_tasks_to_freeze <
> > +				     0);
> > +			WARN_ON_ONCE(frozen_cgrp->freezer.nr_tasks_to_freeze <
> > +				     frozen_cgrp->freezer.nr_frozen_tasks);
> > +
> > +			if (frozen_cgrp->freezer.nr_frozen_tasks ==
> > +			    frozen_cgrp->freezer.nr_tasks_to_freeze)
> > +				cgroup_queue_notify_frozen(frozen_cgrp);
> > +		}
> > +
> >  		spin_unlock_irq(&css_set_lock);
> >  	} else {
> >  		get_css_set(cset);
> > diff --git a/kernel/cgroup/freezer.c b/kernel/cgroup/freezer.c
> > new file mode 100644
> > index 000000000000..b81e215e2cce
> > --- /dev/null
> > +++ b/kernel/cgroup/freezer.c
> > @@ -0,0 +1,247 @@
> > +//SPDX-License-Identifier: GPL-2.0
> > +#include <linux/cgroup.h>
> > +#include <linux/sched.h>
> > +#include <linux/sched/task.h>
> > +#include <linux/sched/signal.h>
> > +
> > +#include "cgroup-internal.h"
> > +
> > +void cgroup_notify_frozen(struct cgroup *cgrp, bool frozen)
> > +{
> > +	int delta = 1;
> > +
> > +	lockdep_assert_held(&cgroup_mutex);
> > +
> > +	/*
> > +	 * Did we race with fork() or exit()? Np, everything is still frozen.
> > +	 */
> > +	if (frozen == test_bit(CGRP_FROZEN, &cgrp->flags))
> > +		return;
> > +
> > +	if (frozen)
> > +		set_bit(CGRP_FROZEN, &cgrp->flags);
> > +	else
> > +		clear_bit(CGRP_FROZEN, &cgrp->flags);
> 
> I'm not sure this is wrong but it feels a bit weird to tie the actual
> state transition to notification.  Wouldn't it be more
> straight-forward if CGRP_FROZEN bit is purely determined by whether
> the tasks are frozen or not and the notification just checks that
> against the last notified state and generate a notification if they're
> different?

So, maybe cgroup_notify_frozen() is not the best name, maybe
cgroup_propagate_frozen() better reflects what's happening here.
We need to recalc the state of ancestor cgroups, and we have to do it
with cgroup_mutex held, this is why we do it from the delayed work
context (on hot paths).

The first pat of the function can be probably separated and called
immediately. Is this what you're suggesting?

> 
> > +	cgroup_file_notify(&cgrp->events_file);
> > +
> > +	while ((cgrp = cgroup_parent(cgrp))) {
> > +		if (frozen) {
> > +			cgrp->freezer.nr_frozen_descendants += delta;
> > +			if (!test_bit(CGRP_FROZEN, &cgrp->flags) &&
> > +			    test_bit(CGRP_FREEZE, &cgrp->flags) &&
> > +			    cgrp->freezer.nr_frozen_descendants ==
> > +			    cgrp->nr_descendants) {
> > +				set_bit(CGRP_FROZEN, &cgrp->flags);
> > +				cgroup_file_notify(&cgrp->events_file);
> > +				delta++;
> > +			}
> > +		} else {
> > +			cgrp->freezer.nr_frozen_descendants -= delta;
> > +			if (test_bit(CGRP_FROZEN, &cgrp->flags)) {
> > +				clear_bit(CGRP_FROZEN, &cgrp->flags);
> > +				cgroup_file_notify(&cgrp->events_file);
> > +				delta++;
> > +			}
> > +		}
> > +	}
> > +}
> 
> So that all these state transitions are synchronous with the actual
> freezing events and we can just queue per-cgroup work items all the
> way to the top if the new state is different from the last one
> cgroup-by-cgroup?

Hm, Idk. Why it's better?

> 
> > +void cgroup_notify_frozen_fn(struct work_struct *work)
> > +{
> > +	struct cgroup *cgrp = container_of(work, struct cgroup,
> > +					   freezer.notify_work);
> > +
> > +	mutex_lock(&cgroup_mutex);
> > +	cgroup_notify_frozen(cgrp, true);
> > +	mutex_unlock(&cgroup_mutex);
> > +
> > +	css_put(&cgrp->self);
> > +}
> > +
> > +void cgroup_queue_notify_frozen(struct cgroup *cgrp)
> > +{
> > +	if (work_pending(&cgrp->freezer.notify_work))
> > +		return;
> > +
> > +	css_get(&cgrp->self);
> > +	schedule_work(&cgrp->freezer.notify_work);
> > +}
> > +
> > +void cgroup_freezer_enter(void)
> > +{
> > +	long state = current->state;
> > +	struct cgroup *cgrp;
> > +
> > +	if (!current->frozen) {
> 
> I think this needs a lot more comments explaining what's going on.
> It's really subtle that this is where the frozen state is becoming
> sticky until the task breaks out of the signal delivery path.

Agreed.

> 
> > +		spin_lock_irq(&css_set_lock);
> > +		current->frozen = true;
> > +		cgrp = task_dfl_cgroup(current);
> > +		cgrp->freezer.nr_frozen_tasks++;
> > +
> > +		WARN_ON_ONCE(cgrp->freezer.nr_tasks_to_freeze <
> > +			     cgrp->freezer.nr_frozen_tasks);
> > +
> > +		if (cgrp->freezer.nr_tasks_to_freeze ==
> > +		    cgrp->freezer.nr_frozen_tasks)
> > +			cgroup_queue_notify_frozen(cgrp);
> > +		spin_unlock_irq(&css_set_lock);
> > +	}
> > +
> > +	/* refrigerator */
> > +	set_current_state(TASK_WAKEKILL | TASK_INTERRUPTIBLE | TASK_FROZEN);
> > +	clear_thread_flag(TIF_SIGPENDING);
> 
> The toggling of TIF_SIGPENDING needs explanation too.
> 
> > +	schedule();
> > +	recalc_sigpending();
> > +
> > +	set_current_state(state);
> > +}
> > +
> > +void cgroup_freezer_leave(void)
> > +{
> > +	struct cgroup *cgrp;
> > +
> > +	spin_lock_irq(&css_set_lock);
> > +	cgrp = task_dfl_cgroup(current);
> > +	cgrp->freezer.nr_frozen_tasks--;
> > +	WARN_ON_ONCE(cgrp->freezer.nr_frozen_tasks < 0);
> > +	current->frozen = false;
> > +	spin_unlock_irq(&css_set_lock);
> > +}
> > +
> > +static void cgroup_do_freeze(struct cgroup *cgrp, bool freeze)
> > +{
> > +	struct css_task_iter it;
> > +	struct task_struct *task;
> > +	unsigned long flags;
> > +
> > +	lockdep_assert_held(&cgroup_mutex);
> > +
> > +	spin_lock_irq(&css_set_lock);
> > +	if (freeze) {
> > +		cgrp->freezer.nr_tasks_to_freeze = __cgroup_task_count(cgrp);
> > +		set_bit(CGRP_FREEZE, &cgrp->flags);
> > +	} else {
> > +		clear_bit(CGRP_FREEZE, &cgrp->flags);
> > +		cgroup_notify_frozen(cgrp, false);
> > +	}
> > +	spin_unlock_irq(&css_set_lock);
> > +
> > +	css_task_iter_start(&cgrp->self, 0, &it);
> > +	while ((task = css_task_iter_next(&it))) {
> > +		if (task->flags & PF_KTHREAD)
> > +			continue;
> 
> WARN_ON_ONCE?

Unfortunately, no. See comments above.

> 
> > +		if (!lock_task_sighand(task, &flags))
> > +			continue;
> 
> So, this gotta be a dying task.  A comment wouldn't hurt.

Agree.

> 
> > +		if (freeze) {
> > +			task->jobctl |= JOBCTL_TRAP_FREEZE;
> > +			signal_wake_up(task, false);
> > +		} else {
> > +			task->jobctl &= ~JOBCTL_TRAP_FREEZE;
> > +			wake_up_process(task);
> 
> Again, a comment explaining why one's signal_wake_up() and the other's
> wake_up_process() would be great.
> 
> > +		}
> > +
> > +		unlock_task_sighand(task, &flags);
> > +	}
> > +	css_task_iter_end(&it);
> > +
> > +	if (freeze && cgrp->nr_descendants ==
> > +	    cgrp->freezer.nr_frozen_descendants) {
> > +		spin_lock_irq(&css_set_lock);
> > +		WARN_ON_ONCE(cgrp->freezer.nr_tasks_to_freeze <
> > +			     cgrp->freezer.nr_frozen_tasks);
> > +		if (cgrp->freezer.nr_tasks_to_freeze ==
> > +		    cgrp->freezer.nr_frozen_tasks)
> > +			cgroup_notify_frozen(cgrp, true);
> > +		spin_unlock_irq(&css_set_lock);
> 
> I think notification handling would be easier if we separate out state
> transitions and notifications.

Hm, really? I mean the notification part is really simple: every time
we do the state transition (set or clear CGRP_FROZEN flag), we call
cgroup_file_notify(). That's it.

I can wrap it to a helper function though, might be looking better.
E.g. cgroup_set/clear_frozen(), which will toggle the bit and call
cgroup_file_notify() if necessary.

> 
> > +	}
> > +}
> > +
> > +void cgroup_freezer_migrate_task(struct task_struct *task,
> > +				 struct cgroup *src, struct cgroup *dst)
> > +{
> > +	unsigned long flags;
> > +
> > +	lockdep_assert_held(&css_set_lock);
> > +
> > +	if (task->flags & PF_KTHREAD)
> > +		return;
> > +
> > +	if (test_bit(CGRP_FREEZE, &src->flags) || task->frozen)
> > +		src->freezer.nr_tasks_to_freeze--;
> > +	if (test_bit(CGRP_FREEZE, &dst->flags) || task->frozen)
> > +		dst->freezer.nr_tasks_to_freeze++;
> > +
> > +	if (task->frozen) {
> > +		src->freezer.nr_frozen_tasks--;
> > +		dst->freezer.nr_frozen_tasks++;
> > +	}
> > +
> > +	if (test_bit(CGRP_FREEZE, &src->flags) ==
> > +	    test_bit(CGRP_FREEZE, &dst->flags))
> > +		return;
> > +
> > +	if (lock_task_sighand(task, &flags)) {
> > +		if (test_bit(CGRP_FREEZE, &dst->flags))
> > +			task->jobctl |= JOBCTL_TRAP_FREEZE;
> > +		else
> > +			task->jobctl &= ~JOBCTL_TRAP_FREEZE;
> 
> How are these flags synchronized?

Using the css_set_lock.

> 
> > +		signal_wake_up(task, false);
> 
> Here, we're using signal_wake_up() for both transitions unlike the
> earlier spot.

Yeah, let me check it out.

> 
> > +		unlock_task_sighand(task, &flags);
> > +	}
> > +}
> > +
> > +void cgroup_freeze(struct cgroup *cgrp, bool freeze)
> > +{
> > +	struct cgroup_subsys_state *css;
> > +	struct cgroup *dsct;
> > +	bool applied = false;
> > +
> > +	lockdep_assert_held(&cgroup_mutex);
> > +
> > +	/*
> > +	 * Nothing changed? Just exit.
> > +	 */
> > +	if (cgrp->freeze == freeze)
> > +		return;
> > +
> > +	cgrp->freeze = freeze;
> > +
> > +	/*
> > +	 * Propagate changes downwards the cgroup tree.
> > +	 */
> > +	css_for_each_descendant_pre(css, &cgrp->self) {
> > +		dsct = css->cgroup;
> > +
> > +		if (cgroup_is_dead(dsct))
> > +			continue;
> > +
> > +		if (freeze) {
> > +			dsct->freezer.e_freeze++;
> > +			/*
> > +			 * Already frozen because of ancestor's settings?
> > +			 */
> > +			if (dsct->freezer.e_freeze > 1)
> > +				continue;
> > +		} else {
> > +			dsct->freezer.e_freeze--;
> > +			/*
> > +			 * Still frozen because of ancestor's settings?
> > +			 */
> > +			if (dsct->freezer.e_freeze > 0)
> > +				continue;
> > +
> > +			WARN_ON_ONCE(dsct->freezer.e_freeze < 0);
> > +		}
> > +
> > +		/*
> > +		 * Do change actual state: freeze or unfreeze.
> > +		 */
> > +		cgroup_do_freeze(dsct, freeze);
> > +		applied = true;
> > +	}
> > +
> > +	if (!applied)
> > +		cgroup_file_notify(&cgrp->events_file);
> > +}
> > diff --git a/kernel/ptrace.c b/kernel/ptrace.c
> > index 21fec73d45d4..5e484e2480e5 100644
> > --- a/kernel/ptrace.c
> > +++ b/kernel/ptrace.c
> > @@ -400,6 +400,12 @@ static int ptrace_attach(struct task_struct *task, long request,
> >  
> >  	spin_lock(&task->sighand->siglock);
> >  
> > +	/*
> > +	 * Kick the process to get it out of the refrigerator.
> > +	 */
> > +	if (cgroup_frozen(task))
> > +		wake_up_process(task);
> > +
> >  	/*
> >  	 * If the task is already STOPPED, set JOBCTL_TRAP_STOP and
> >  	 * TRAPPING, and kick it so that it transits to TRACED.  TRAPPING
> > diff --git a/kernel/signal.c b/kernel/signal.c
> > index 5843c541fda9..6d7f0654f60d 100644
> > --- a/kernel/signal.c
> > +++ b/kernel/signal.c
> > @@ -326,6 +326,11 @@ void task_clear_jobctl_pending(struct task_struct *task, unsigned long mask)
> >  	if (mask & JOBCTL_STOP_PENDING)
> >  		mask |= JOBCTL_STOP_CONSUME | JOBCTL_STOP_DEQUEUED;
> >  
> > +	/*
> > +	 * JOBCTL_TRAP_FREEZE is set and cleared from cgroup side,
> > +	 * don't touch it here.
> > +	 */
> > +	mask &= ~JOBCTL_TRAP_FREEZE;
> 
> It's a bit weird to toggle the bit from here.  Can't we do this from
> the callers?

Probably, not. But we might exclude JOBCTL_TRAP_FREEZE from the
JOBCTL_PENDING_MASK, and it will be enough.

> 
> >  	task->jobctl &= ~mask;
> >  
> >  	if (!(task->jobctl & JOBCTL_PENDING_MASK))
> > @@ -2252,7 +2257,7 @@ static bool do_signal_stop(int signr)
> >  }
> >  
> >  /**
> > - * do_jobctl_trap - take care of ptrace jobctl traps
> > + * do_jobctl_trap - take care of ptrace and cgroup freezer jobctl traps
> >   *
> >   * When PT_SEIZED, it's used for both group stop and explicit
> >   * SEIZE/INTERRUPT traps.  Both generate PTRACE_EVENT_STOP trap with
> > @@ -2268,20 +2273,35 @@ static bool do_signal_stop(int signr)
> >   */
> >  static void do_jobctl_trap(void)
> >  {
> > +	struct sighand_struct *sighand = current->sighand;
> >  	struct signal_struct *signal = current->signal;
> >  	int signr = current->jobctl & JOBCTL_STOP_SIGMASK;
> >  
> > -	if (current->ptrace & PT_SEIZED) {
> > -		if (!signal->group_stop_count &&
> > -		    !(signal->flags & SIGNAL_STOP_STOPPED))
> > -			signr = SIGTRAP;
> > -		WARN_ON_ONCE(!signr);
> > -		ptrace_do_notify(signr, signr | (PTRACE_EVENT_STOP << 8),
> > -				 CLD_STOPPED);
> > -	} else {
> > -		WARN_ON_ONCE(!signr);
> > -		ptrace_stop(signr, CLD_STOPPED, 0, NULL);
> > -		current->exit_code = 0;
> > +	if (current->jobctl & (JOBCTL_TRAP_STOP | JOBCTL_TRAP_NOTIFY)) {
> > +		if (current->ptrace & PT_SEIZED) {
> > +			if (!signal->group_stop_count &&
> > +			    !(signal->flags & SIGNAL_STOP_STOPPED))
> > +				signr = SIGTRAP;
> > +			WARN_ON_ONCE(!signr);
> > +			ptrace_do_notify(signr,
> > +					 signr | (PTRACE_EVENT_STOP << 8),
> > +					 CLD_STOPPED);
> > +		} else {
> > +			WARN_ON_ONCE(!signr);
> > +			ptrace_stop(signr, CLD_STOPPED, 0, NULL);
> > +			current->exit_code = 0;
> > +		}
> > +	} else if (current->jobctl & JOBCTL_TRAP_FREEZE) {
> > +		/*
> > +		 * Enter the freezer, unless the task is about to exit.
> > +		 */
> > +		if (fatal_signal_pending(current)) {
> > +			current->jobctl &= ~JOBCTL_TRAP_FREEZE;
> > +		} else {
> > +			spin_unlock_irq(&sighand->siglock);
> > +			cgroup_freezer_enter();
> > +			spin_lock_irq(&sighand->siglock);
> > +		}
> 
> We'll need a healthy amount of explanation in terms of how freezer
> interacts with jobctl stop and ptrace.

Sure, will add lot more comments, once we'll agree on the actual implementation
in the thread with Oleg.

Thank you for the code review!

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 3/6] cgroup: cgroup v2 freezer
  2018-11-13 18:47     ` Roman Gushchin
@ 2018-11-13 19:15       ` Tejun Heo
  2018-11-13 20:55         ` Roman Gushchin
  0 siblings, 1 reply; 23+ messages in thread
From: Tejun Heo @ 2018-11-13 19:15 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Roman Gushchin, Oleg Nesterov, cgroups, linux-kernel,
	Kernel Team, Linus Torvalds, Andrew Morton, Ingo Molnar

Hello, Roman.

On Tue, Nov 13, 2018 at 06:47:55PM +0000, Roman Gushchin wrote:
> > > +	/* Should the cgroup and its descendants be frozen. */
> > > +	bool freeze;
> > 
> > Why not have this in freezer too?
> 
> I thought that this variable is just the state of the cgroup.freeze knob,
> where the freezer field contains the internal state of the freezer, and
> can in theory be allocated dynamically.
> 
> Not a strong preference, I can move it there too, if you prefer to.

Yeah, let's just put it together.

> > > +void cgroup_freezer_enter(void);
> > > +void cgroup_freezer_leave(void);
> > 
> > So, if we use freeze, freezing, frozen instead, the aboves can be
> > cgroup_frozen_enter() and cgroup_frozen_leave() (or begin/end).
> 
> Idk, maybe cgroup_enter_frozen()/cgroup_leave_frozen() ?

Sure.

> > > +	/* task is in the cgroup freezer loop */
> > 
> > The above comment isn't strictly true, right?
> 
> Why so?
> 
> It actually means that the task is looping somewhere in the signal delivery loop
> after entering cgroup_freezer_enter() and before cgroup_freezer_leave().
> 
> Maybe simple "task is frozen by the cgroup freezer"?

Yeah, sounds good.

> > > @@ -5642,6 +5700,23 @@ void cgroup_post_fork(struct task_struct *child)
> > >  			cset->nr_tasks++;
> > >  			css_set_move_task(child, NULL, cset, false);
> > >  		}
> > > +
> > > +		if (unlikely(cgroup_frozen(child) &&
> > > +			     (child->flags & ~PF_KTHREAD))) {
> > 
> > I don't think we need explicit PF_KTHREAD test here.  We don't allow
> > kthreads in non-root cgroups anyway and if we wanna change that there
> > are a bunch of other things which need updating anyway.
> 
> Don't we? I think we do. I've proposed a patch to fix this some time ago
> (https://lkml.org/lkml/2017/10/12/556), but was NAKed by Peter.

Ah, right, I thought that went in.  Oh well, let's keep the test then.

> > > +	/*
> > > +	 * Did we race with fork() or exit()? Np, everything is still frozen.
> > > +	 */
> > > +	if (frozen == test_bit(CGRP_FROZEN, &cgrp->flags))
> > > +		return;
> > > +
> > > +	if (frozen)
> > > +		set_bit(CGRP_FROZEN, &cgrp->flags);
> > > +	else
> > > +		clear_bit(CGRP_FROZEN, &cgrp->flags);
> > 
> > I'm not sure this is wrong but it feels a bit weird to tie the actual
> > state transition to notification.  Wouldn't it be more
> > straight-forward if CGRP_FROZEN bit is purely determined by whether
> > the tasks are frozen or not and the notification just checks that
> > against the last notified state and generate a notification if they're
> > different?
> 
> So, maybe cgroup_notify_frozen() is not the best name, maybe
> cgroup_propagate_frozen() better reflects what's happening here.
> We need to recalc the state of ancestor cgroups, and we have to do it
> with cgroup_mutex held, this is why we do it from the delayed work
> context (on hot paths).

Can't we protect that state with css_set_lock too?  That's what task
states are protected by and the cgroup state is a mere aggregation of
task states.

> The first pat of the function can be probably separated and called
> immediately. Is this what you're suggesting?

Pretty much.  I think separating out state transitions and
notifications would make it more straightforward.

> > So that all these state transitions are synchronous with the actual
> > freezing events and we can just queue per-cgroup work items all the
> > way to the top if the new state is different from the last one
> > cgroup-by-cgroup?
> 
> Hm, Idk. Why it's better?

So, the pieces are - 1. task states, 2. cgroup states and
3. notifications.  The current code ties together #2 and #3 together
which is weird because #2 is a mere aggregation of #1.  Also, that
way, notifications become a lot more robust because whether to
generate a notification or not can be solely determined from #2
flipping.  ie. sth like the following

	change_task_frozen_state()
	{
		update counters
		if (cgroup state needs to change) {
			change cgroup state;
			queue notification work;
			repeat for the parent;
		}
	}

where notification work always notifies should work and trivially
satisfies the requirement (there should be at least one notification
since the last state transition) without any further work.  Wouldn't
this be easier and more robust?  The current code depends on
annotating each possible transition event, which is kinda fragile.

> > > +	if (lock_task_sighand(task, &flags)) {
> > > +		if (test_bit(CGRP_FREEZE, &dst->flags))
> > > +			task->jobctl |= JOBCTL_TRAP_FREEZE;
> > > +		else
> > > +			task->jobctl &= ~JOBCTL_TRAP_FREEZE;
> > 
> > How are these flags synchronized?
> 
> Using the css_set_lock.

But other JOBCTL_TRAP bits aren't synchronized by css_set_lock, right?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 3/6] cgroup: cgroup v2 freezer
  2018-11-13 19:15       ` Tejun Heo
@ 2018-11-13 20:55         ` Roman Gushchin
  2018-11-13 20:58           ` Tejun Heo
  0 siblings, 1 reply; 23+ messages in thread
From: Roman Gushchin @ 2018-11-13 20:55 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Roman Gushchin, Oleg Nesterov, cgroups, linux-kernel,
	Kernel Team, Linus Torvalds, Andrew Morton, Ingo Molnar

On Tue, Nov 13, 2018 at 11:15:41AM -0800, Tejun Heo wrote:
> Hello, Roman.
> 
> On Tue, Nov 13, 2018 at 06:47:55PM +0000, Roman Gushchin wrote:
> > > > +	/* Should the cgroup and its descendants be frozen. */
> > > > +	bool freeze;
> > > 
> > > Why not have this in freezer too?
> > 
> > I thought that this variable is just the state of the cgroup.freeze knob,
> > where the freezer field contains the internal state of the freezer, and
> > can in theory be allocated dynamically.
> > 
> > Not a strong preference, I can move it there too, if you prefer to.
> 
> Yeah, let's just put it together.
> 
> > > > +void cgroup_freezer_enter(void);
> > > > +void cgroup_freezer_leave(void);
> > > 
> > > So, if we use freeze, freezing, frozen instead, the aboves can be
> > > cgroup_frozen_enter() and cgroup_frozen_leave() (or begin/end).
> > 
> > Idk, maybe cgroup_enter_frozen()/cgroup_leave_frozen() ?
> 
> Sure.
> 
> > > > +	/* task is in the cgroup freezer loop */
> > > 
> > > The above comment isn't strictly true, right?
> > 
> > Why so?
> > 
> > It actually means that the task is looping somewhere in the signal delivery loop
> > after entering cgroup_freezer_enter() and before cgroup_freezer_leave().
> > 
> > Maybe simple "task is frozen by the cgroup freezer"?
> 
> Yeah, sounds good.
> 
> > > > @@ -5642,6 +5700,23 @@ void cgroup_post_fork(struct task_struct *child)
> > > >  			cset->nr_tasks++;
> > > >  			css_set_move_task(child, NULL, cset, false);
> > > >  		}
> > > > +
> > > > +		if (unlikely(cgroup_frozen(child) &&
> > > > +			     (child->flags & ~PF_KTHREAD))) {
> > > 
> > > I don't think we need explicit PF_KTHREAD test here.  We don't allow
> > > kthreads in non-root cgroups anyway and if we wanna change that there
> > > are a bunch of other things which need updating anyway.
> > 
> > Don't we? I think we do. I've proposed a patch to fix this some time ago
> > (https://lkml.org/lkml/2017/10/12/556), but was NAKed by Peter.
> 
> Ah, right, I thought that went in.  Oh well, let's keep the test then.
> 
> > > > +	/*
> > > > +	 * Did we race with fork() or exit()? Np, everything is still frozen.
> > > > +	 */
> > > > +	if (frozen == test_bit(CGRP_FROZEN, &cgrp->flags))
> > > > +		return;
> > > > +
> > > > +	if (frozen)
> > > > +		set_bit(CGRP_FROZEN, &cgrp->flags);
> > > > +	else
> > > > +		clear_bit(CGRP_FROZEN, &cgrp->flags);
> > > 
> > > I'm not sure this is wrong but it feels a bit weird to tie the actual
> > > state transition to notification.  Wouldn't it be more
> > > straight-forward if CGRP_FROZEN bit is purely determined by whether
> > > the tasks are frozen or not and the notification just checks that
> > > against the last notified state and generate a notification if they're
> > > different?
> > 
> > So, maybe cgroup_notify_frozen() is not the best name, maybe
> > cgroup_propagate_frozen() better reflects what's happening here.
> > We need to recalc the state of ancestor cgroups, and we have to do it
> > with cgroup_mutex held, this is why we do it from the delayed work
> > context (on hot paths).
> 
> Can't we protect that state with css_set_lock too?  That's what task
> states are protected by and the cgroup state is a mere aggregation of
> task states.

See below.

> 
> > The first pat of the function can be probably separated and called
> > immediately. Is this what you're suggesting?
> 
> Pretty much.  I think separating out state transitions and
> notifications would make it more straightforward.
> 
> > > So that all these state transitions are synchronous with the actual
> > > freezing events and we can just queue per-cgroup work items all the
> > > way to the top if the new state is different from the last one
> > > cgroup-by-cgroup?
> > 
> > Hm, Idk. Why it's better?
> 
> So, the pieces are - 1. task states, 2. cgroup states and
> 3. notifications.  The current code ties together #2 and #3 together
> which is weird because #2 is a mere aggregation of #1.  Also, that
> way, notifications become a lot more robust because whether to
> generate a notification or not can be solely determined from #2
> flipping.  ie. sth like the following
> 
> 	change_task_frozen_state()
> 	{
> 		update counters
> 		if (cgroup state needs to change) {
> 			change cgroup state;
> 			queue notification work;
> 			repeat for the parent;
> 		}
> 	}
> 
> where notification work always notifies should work and trivially
> satisfies the requirement (there should be at least one notification
> since the last state transition) without any further work.  Wouldn't
> this be easier and more robust?  The current code depends on
> annotating each possible transition event, which is kinda fragile.

Ok, it looks like I can additionally synchronize descendant counters
using css_set_lock, and then eliminate the whole thing with delayed
transitions/notifications.

Will do in v3.

> 
> > > > +	if (lock_task_sighand(task, &flags)) {
> > > > +		if (test_bit(CGRP_FREEZE, &dst->flags))
> > > > +			task->jobctl |= JOBCTL_TRAP_FREEZE;
> > > > +		else
> > > > +			task->jobctl &= ~JOBCTL_TRAP_FREEZE;
> > > 
> > > How are these flags synchronized?
> > 
> > Using the css_set_lock.
> 
> But other JOBCTL_TRAP bits aren't synchronized by css_set_lock, right?


But if we don't touch this bit anywhere else, should be fine, right?

Thanks!

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 3/6] cgroup: cgroup v2 freezer
  2018-11-13 20:55         ` Roman Gushchin
@ 2018-11-13 20:58           ` Tejun Heo
  0 siblings, 0 replies; 23+ messages in thread
From: Tejun Heo @ 2018-11-13 20:58 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Roman Gushchin, Oleg Nesterov, cgroups, linux-kernel,
	Kernel Team, Linus Torvalds, Andrew Morton, Ingo Molnar

On Tue, Nov 13, 2018 at 08:55:11PM +0000, Roman Gushchin wrote:
> > > > > +	if (lock_task_sighand(task, &flags)) {
> > > > > +		if (test_bit(CGRP_FREEZE, &dst->flags))
> > > > > +			task->jobctl |= JOBCTL_TRAP_FREEZE;
> > > > > +		else
> > > > > +			task->jobctl &= ~JOBCTL_TRAP_FREEZE;
> > > > 
> > > > How are these flags synchronized?
> > > 
> > > Using the css_set_lock.
> > 
> > But other JOBCTL_TRAP bits aren't synchronized by css_set_lock, right?
> 
> But if we don't touch this bit anywhere else, should be fine, right?

But other JOBCTL_TRAP_ bits aren't synchronized with css_set_lock, so
they can be updated (read-modify-write) concurrently and clobber each
other, no?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 3/6] cgroup: cgroup v2 freezer
  2018-11-13 15:48   ` Oleg Nesterov
@ 2018-11-13 21:59     ` Roman Gushchin
  2018-11-14 16:56       ` Oleg Nesterov
  0 siblings, 1 reply; 23+ messages in thread
From: Roman Gushchin @ 2018-11-13 21:59 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Roman Gushchin, Tejun Heo, cgroups, linux-kernel, Kernel Team

Hi Oleg!

On Tue, Nov 13, 2018 at 04:48:25PM +0100, Oleg Nesterov wrote:
> On 11/12, Roman Gushchin wrote:
> >
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -83,7 +83,8 @@ struct task_group;
> >  #define TASK_WAKING			0x0200
> >  #define TASK_NOLOAD			0x0400
> >  #define TASK_NEW			0x0800
> > -#define TASK_STATE_MAX			0x1000
> > +#define TASK_FROZEN			0x1000
> > +#define TASK_STATE_MAX			0x2000
> 
> Just noticed the new task state... Why? Can't we avoid it?

We can, but it's nice to show to userspace that tasks are frozen,
rather than just stuck somewhere in the kernel...

> 
> ...
> 
> > +void cgroup_freezer_enter(void)
> > +{
> > +	long state = current->state;
> 
> Why? it must be TASK_RUNNING?
> 
> If not set_current_state() at the end is simply wrong... Yes, __refrigerator()
> does this, but at least it has a reason although it is wrong too.
> 
> > +	struct cgroup *cgrp;
> > +
> > +	if (!current->frozen) {
> > +		spin_lock_irq(&css_set_lock);
> > +		current->frozen = true;
> > +		cgrp = task_dfl_cgroup(current);
> > +		cgrp->freezer.nr_frozen_tasks++;
> > +
> > +		WARN_ON_ONCE(cgrp->freezer.nr_tasks_to_freeze <
> > +			     cgrp->freezer.nr_frozen_tasks);
> > +
> > +		if (cgrp->freezer.nr_tasks_to_freeze ==
> > +		    cgrp->freezer.nr_frozen_tasks)
> > +			cgroup_queue_notify_frozen(cgrp);
> > +		spin_unlock_irq(&css_set_lock);
> > +	}
> > +
> > +	/* refrigerator */
> > +	set_current_state(TASK_WAKEKILL | TASK_INTERRUPTIBLE | TASK_FROZEN);
> 
> Why not __set_current_state() ?

Hm, it's not a hot path at all, so set_current_state() is good enough.
Not a strong preference, of course.

> 
> If ->state include TASK_INTERRUPTIBLE, why do we need TASK_WAKEKILL?
> 
> And again, why TASK_FROZEN?

So, should it be just TASK_INTERRUPTIBLE | TASK_FROZEN ?

> 
> > +	clear_thread_flag(TIF_SIGPENDING);
> > +	schedule();
> > +	recalc_sigpending();
> 
> I simply can't understand these 3 lines above but I bet this is not correct ;)

So, yeah, the problem is that if there is TIF_SIGPENDING bit set, schedule()
will return immediately, so we're getting pretty much a busy loop here.
This is a nasty workaround.

I believe we can clear and not call recalc_sigpending() at all. Does this seem
to be correct?

> 
> if nothing else recalc_sigpending() without ->siglock is wrong, it can race
> with signal_wakeup/etc.
> 
> > +	set_current_state(state);
> 
> see above...

Thank you for the review!
And looking forward for more comments from you!

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 3/6] cgroup: cgroup v2 freezer
  2018-11-13 21:59     ` Roman Gushchin
@ 2018-11-14 16:56       ` Oleg Nesterov
  2018-11-14 17:06         ` Roman Gushchin
  2018-11-28 17:36         ` Roman Gushchin
  0 siblings, 2 replies; 23+ messages in thread
From: Oleg Nesterov @ 2018-11-14 16:56 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Roman Gushchin, Tejun Heo, cgroups, linux-kernel, Kernel Team

Hi Roman,

On 11/13, Roman Gushchin wrote:
>
> > > +#define TASK_FROZEN			0x1000
> > > +#define TASK_STATE_MAX			0x2000
> >
> > Just noticed the new task state... Why? Can't we avoid it?
>
> We can, but it's nice to show to userspace that tasks are frozen,
> rather than just stuck somewhere in the kernel...

But then you need to change get_task_state() too. Which iiuc could
probably check ->frozen along with ->state.

I do not think the new task state is a good idea, at least I would like
to ask you to make a separate patch which we can discuss separately.


> > > +	set_current_state(TASK_WAKEKILL | TASK_INTERRUPTIBLE | TASK_FROZEN);
> >
> > Why not __set_current_state() ?
>
> Hm, it's not a hot path at all, so set_current_state() is good enough.
> Not a strong preference, of course.

It is not about performance, to me set_current_state() looks as if we need
a memory barrier for some obscure/undocumented reason and this doesn't help
to understand the code.

> > If ->state include TASK_INTERRUPTIBLE, why do we need TASK_WAKEKILL?
> >
> > And again, why TASK_FROZEN?
>
> So, should it be just TASK_INTERRUPTIBLE | TASK_FROZEN ?

Again, TASK_FROZEN is pointless at least until you change fs/proc or until
you have wake_up_state(TASK_FROZEN). May be cgroup_do_freeze() and/or
ptrace_attach() could use it, but see above, I'd suggest to make another
patch.

Looks like you need TASK_KILLABLE, see below.

> > > +	clear_thread_flag(TIF_SIGPENDING);
> > > +	schedule();
> > > +	recalc_sigpending();
> >
> > I simply can't understand these 3 lines above but I bet this is not correct ;)
>
> So, yeah, the problem is that if there is TIF_SIGPENDING bit set, schedule()
> will return immediately, so we're getting pretty much a busy loop here.

I suspected this answer ;)

> This is a nasty workaround.

No, this is very wrong. Just suppose the caller is killed right before
clear_thread_flag(TIF_SIGPENDING).

> I believe we can clear and not call recalc_sigpending() at all. Does this seem
> to be correct?

I think you need to simply remove both clear_thread_flag() and recalc_sigpending().
If schedule() is called in TASK_KILLABLE state it will return only if
fatal_signal_pending() is true, and this is what we want, right?

OK, it seems you are going to make the new version anyway, so I can wait for it
and not read this series ;)

Oleg.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 3/6] cgroup: cgroup v2 freezer
  2018-11-14 16:56       ` Oleg Nesterov
@ 2018-11-14 17:06         ` Roman Gushchin
  2018-11-14 17:36           ` Oleg Nesterov
  2018-11-28 17:36         ` Roman Gushchin
  1 sibling, 1 reply; 23+ messages in thread
From: Roman Gushchin @ 2018-11-14 17:06 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Roman Gushchin, Tejun Heo, cgroups, linux-kernel, Kernel Team

On Wed, Nov 14, 2018 at 05:56:32PM +0100, Oleg Nesterov wrote:
> Hi Roman,
> 
> On 11/13, Roman Gushchin wrote:
> >
> > > > +#define TASK_FROZEN			0x1000
> > > > +#define TASK_STATE_MAX			0x2000
> > >
> > > Just noticed the new task state... Why? Can't we avoid it?
> >
> > We can, but it's nice to show to userspace that tasks are frozen,
> > rather than just stuck somewhere in the kernel...
> 
> But then you need to change get_task_state() too. Which iiuc could
> probably check ->frozen along with ->state.
> 
> I do not think the new task state is a good idea, at least I would like
> to ask you to make a separate patch which we can discuss separately.

Yeah, I have separated it in v3.

> 
> 
> > > > +	set_current_state(TASK_WAKEKILL | TASK_INTERRUPTIBLE | TASK_FROZEN);
> > >
> > > Why not __set_current_state() ?
> >
> > Hm, it's not a hot path at all, so set_current_state() is good enough.
> > Not a strong preference, of course.
> 
> It is not about performance, to me set_current_state() looks as if we need
> a memory barrier for some obscure/undocumented reason and this doesn't help
> to understand the code.
> 
> > > If ->state include TASK_INTERRUPTIBLE, why do we need TASK_WAKEKILL?
> > >
> > > And again, why TASK_FROZEN?
> >
> > So, should it be just TASK_INTERRUPTIBLE | TASK_FROZEN ?
> 
> Again, TASK_FROZEN is pointless at least until you change fs/proc or until
> you have wake_up_state(TASK_FROZEN). May be cgroup_do_freeze() and/or
> ptrace_attach() could use it, but see above, I'd suggest to make another
> patch.
> 
> Looks like you need TASK_KILLABLE, see below.
> 
> > > > +	clear_thread_flag(TIF_SIGPENDING);
> > > > +	schedule();
> > > > +	recalc_sigpending();
> > >
> > > I simply can't understand these 3 lines above but I bet this is not correct ;)
> >
> > So, yeah, the problem is that if there is TIF_SIGPENDING bit set, schedule()
> > will return immediately, so we're getting pretty much a busy loop here.
> 
> I suspected this answer ;)
> 
> > This is a nasty workaround.
> 
> No, this is very wrong. Just suppose the caller is killed right before
> clear_thread_flag(TIF_SIGPENDING).

So, I had TASK_KILLABLE before, but had some issues with ptrace/gdb.
I'll revisit this option.

> 
> > I believe we can clear and not call recalc_sigpending() at all. Does this seem
> > to be correct?
> 
> I think you need to simply remove both clear_thread_flag() and recalc_sigpending().
> If schedule() is called in TASK_KILLABLE state it will return only if
> fatal_signal_pending() is true, and this is what we want, right?
> 
> OK, it seems you are going to make the new version anyway, so I can wait for it
> and not read this series ;)

Sure! I'm about to post it.

Thanks!

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 3/6] cgroup: cgroup v2 freezer
  2018-11-14 17:06         ` Roman Gushchin
@ 2018-11-14 17:36           ` Oleg Nesterov
  2018-11-14 17:39             ` Roman Gushchin
  0 siblings, 1 reply; 23+ messages in thread
From: Oleg Nesterov @ 2018-11-14 17:36 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Roman Gushchin, Tejun Heo, cgroups, linux-kernel, Kernel Team

On 11/14, Roman Gushchin wrote:
>
> > No, this is very wrong. Just suppose the caller is killed right before
> > clear_thread_flag(TIF_SIGPENDING).
>
> So, I had TASK_KILLABLE before, but had some issues with ptrace/gdb.
> I'll revisit this option.

Yes, ptrace_signal_wake_up() won't wake a TASK_KILLABLE tracee up...

BTW. Could you explain in the changelog how should a frozen task interact
with ptrace? 0/6 says "ptrace works" but that is all. It is not clear to me
what do we actually want wrt ptrace...

Oleg.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 3/6] cgroup: cgroup v2 freezer
  2018-11-14 17:36           ` Oleg Nesterov
@ 2018-11-14 17:39             ` Roman Gushchin
  0 siblings, 0 replies; 23+ messages in thread
From: Roman Gushchin @ 2018-11-14 17:39 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Roman Gushchin, Tejun Heo, cgroups, linux-kernel, Kernel Team

On Wed, Nov 14, 2018 at 06:36:11PM +0100, Oleg Nesterov wrote:
> On 11/14, Roman Gushchin wrote:
> >
> > > No, this is very wrong. Just suppose the caller is killed right before
> > > clear_thread_flag(TIF_SIGPENDING).
> >
> > So, I had TASK_KILLABLE before, but had some issues with ptrace/gdb.
> > I'll revisit this option.
> 
> Yes, ptrace_signal_wake_up() won't wake a TASK_KILLABLE tracee up...
> 
> BTW. Could you explain in the changelog how should a frozen task interact
> with ptrace? 0/6 says "ptrace works" but that is all. It is not clear to me
> what do we actually want wrt ptrace...

Sure, will add.
Basically, I want some minimal functionality to work properly.
E.g. attach to a frozen task, get some info, detach.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 3/6] cgroup: cgroup v2 freezer
  2018-11-14 16:56       ` Oleg Nesterov
  2018-11-14 17:06         ` Roman Gushchin
@ 2018-11-28 17:36         ` Roman Gushchin
  1 sibling, 0 replies; 23+ messages in thread
From: Roman Gushchin @ 2018-11-28 17:36 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Roman Gushchin, Tejun Heo, cgroups, linux-kernel, Kernel Team

On Wed, Nov 14, 2018 at 05:56:32PM +0100, Oleg Nesterov wrote:
> Hi Roman,
> 
> OK, it seems you are going to make the new version anyway, so I can wait for it
> and not read this series ;)

Hi Oleg!

Can you, please, take a look at v3, which I posted recently?
https://lkml.org/lkml/2018/11/17/193

Thank you!

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2018-11-28 17:37 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-12 23:04 [PATCH v2 0/6] freezer for cgroup v2 Roman Gushchin
2018-11-12 23:04 ` [PATCH] cgroup: document cgroup v2 freezer interface Roman Gushchin
2018-11-12 23:04 ` [PATCH v2 1/6] cgroup: rename freezer.c into legacy_freezer.c Roman Gushchin
2018-11-12 23:04 ` [PATCH v2 2/6] cgroup: implement __cgroup_task_count() helper Roman Gushchin
2018-11-12 23:04 ` [PATCH v2 3/6] cgroup: cgroup v2 freezer Roman Gushchin
2018-11-13  2:08   ` Tejun Heo
2018-11-13 18:47     ` Roman Gushchin
2018-11-13 19:15       ` Tejun Heo
2018-11-13 20:55         ` Roman Gushchin
2018-11-13 20:58           ` Tejun Heo
2018-11-13 15:37   ` Oleg Nesterov
2018-11-13 15:43     ` Tejun Heo
2018-11-13 16:00       ` Oleg Nesterov
2018-11-13 15:48   ` Oleg Nesterov
2018-11-13 21:59     ` Roman Gushchin
2018-11-14 16:56       ` Oleg Nesterov
2018-11-14 17:06         ` Roman Gushchin
2018-11-14 17:36           ` Oleg Nesterov
2018-11-14 17:39             ` Roman Gushchin
2018-11-28 17:36         ` Roman Gushchin
2018-11-12 23:04 ` [PATCH v2 4/6] kselftests: cgroup: don't fail on cg_kill_all() error in cg_destroy() Roman Gushchin
2018-11-12 23:04 ` [PATCH v2 5/6] kselftests: cgroup: add freezer controller self-tests Roman Gushchin
2018-11-12 23:04 ` [PATCH v2 6/6] cgroup: document cgroup v2 freezer interface Roman Gushchin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).