[PATCH 1/3 cgroup/for-5.20] cgroup: Elide write-locking threadgroup_rwsem when updating csses on an empty subtree

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/3 cgroup/for-5.20] cgroup: Elide write-locking threadgroup_rwsem when updating csses on an empty subtree
@ 2022-07-15  4:38 Tejun Heo
  2022-07-15  4:38   ` Tejun Heo
  2022-07-26 14:31   ` Michal Koutný
  0 siblings, 2 replies; 32+ messages in thread
From: Tejun Heo @ 2022-07-15  4:38 UTC (permalink / raw)
  To: Christian Brauner, Michal Koutný
  Cc: Peter Zijlstra, John Stultz, Dmitry Shmidt, Oleg Nesterov,
	linux-kernel, cgroups

cgroup_update_dfl_csses() write-lock the threadgroup_rwsem as updating the
csses can trigger process migrations. However, if the subtree doesn't
contain any tasks, there aren't gonna be any cgroup migrations. This
condition can be trivially detected by testing whether
mgctx.preloaded_src_csets is empty. Elide write-locking threadgroup_rwsem if
the subtree is empty.

After this optimization, the usage pattern of creating a cgroup, enabling
the necessary controllers, and then seeding it with CLONE_INTO_CGROUP and
then removing the cgroup after it becomes empty doesn't need to write-lock
threadgroup_rwsem at all.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
---
 kernel/cgroup/cgroup.c |   16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2933,12 +2933,11 @@ static int cgroup_update_dfl_csses(struc
 	struct cgroup_subsys_state *d_css;
 	struct cgroup *dsct;
 	struct css_set *src_cset;
+	bool has_tasks;
 	int ret;
 
 	lockdep_assert_held(&cgroup_mutex);
 
-	percpu_down_write(&cgroup_threadgroup_rwsem);
-
 	/* look up all csses currently attached to @cgrp's subtree */
 	spin_lock_irq(&css_set_lock);
 	cgroup_for_each_live_descendant_pre(dsct, d_css, cgrp) {
@@ -2949,6 +2948,16 @@ static int cgroup_update_dfl_csses(struc
 	}
 	spin_unlock_irq(&css_set_lock);
 
+	/*
+	 * We need to write-lock threadgroup_rwsem while migrating tasks.
+	 * However, if there are no source csets for @cgrp, changing its
+	 * controllers isn't gonna produce any task migrations and the
+	 * write-locking can be skipped safely.
+	 */
+	has_tasks = !list_empty(&mgctx.preloaded_src_csets);
+	if (has_tasks)
+		percpu_down_write(&cgroup_threadgroup_rwsem);
+
 	/* NULL dst indicates self on default hierarchy */
 	ret = cgroup_migrate_prepare_dst(&mgctx);
 	if (ret)
@@ -2967,7 +2976,8 @@ static int cgroup_update_dfl_csses(struc
 	ret = cgroup_migrate_execute(&mgctx);
 out_finish:
 	cgroup_migrate_finish(&mgctx);
-	percpu_up_write(&cgroup_threadgroup_rwsem);
+	if (has_tasks)
+		percpu_up_write(&cgroup_threadgroup_rwsem);
 	return ret;
 }
 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 2/3 cgroup/for-5.20] cgroup: Add "no" prefixed mount options
@ 2022-07-15  4:38   ` Tejun Heo
  0 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2022-07-15  4:38 UTC (permalink / raw)
  Cc: Peter Zijlstra, John Stultz, Dmitry Shmidt, Oleg Nesterov,
	linux-kernel, cgroups

We allow modifying these mount options via remount. Let's add "no" prefixed
variants so that they can be turned off too.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
---
 Documentation/admin-guide/cgroup-v2.rst |    6 +++---
 kernel/cgroup/cgroup.c                  |   20 +++++++++++++++-----
 2 files changed, 18 insertions(+), 8 deletions(-)

--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -177,14 +177,14 @@ disabling controllers in v1 and make the
 
 cgroup v2 currently supports the following mount options.
 
-  nsdelegate
+  [no]nsdelegate
 	Consider cgroup namespaces as delegation boundaries.  This
 	option is system wide and can only be set on mount or modified
 	through remount from the init namespace.  The mount option is
 	ignored on non-init namespace mounts.  Please refer to the
 	Delegation section for details.
 
-  memory_localevents
+  memory_[no]localevents
         Only populate memory.events with data for the current cgroup,
         and not any subtrees. This is legacy behaviour, the default
         behaviour without this option is to include subtree counts.
@@ -192,7 +192,7 @@ cgroup v2 currently supports the followi
         modified through remount from the init namespace. The mount
         option is ignored on non-init namespace mounts.
 
-  memory_recursiveprot
+  memory_[no]recursiveprot
         Recursively apply memory.min and memory.low protection to
         entire subtrees, without requiring explicit downward
         propagation into leaf cgroups.  This allows protecting entire
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -279,8 +279,6 @@ bool cgroup_ssid_enabled(int ssid)
  *
  * - When mounting an existing superblock, mount options should match.
  *
- * - Remount is disallowed.
- *
  * - rename(2) is disallowed.
  *
  * - "tasks" is removed.  Everything should be at process granularity.  Use
@@ -1859,16 +1857,19 @@ int cgroup_show_path(struct seq_file *sf
 }
 
 enum cgroup2_param {
-	Opt_nsdelegate,
-	Opt_memory_localevents,
-	Opt_memory_recursiveprot,
+	Opt_nsdelegate, Opt_nonsdelegate,
+	Opt_memory_localevents, Opt_memory_nolocalevents,
+	Opt_memory_recursiveprot, Opt_memory_norecursiveprot,
 	nr__cgroup2_params
 };
 
 static const struct fs_parameter_spec cgroup2_fs_parameters[] = {
 	fsparam_flag("nsdelegate",		Opt_nsdelegate),
+	fsparam_flag("nonsdelegate",		Opt_nonsdelegate),
 	fsparam_flag("memory_localevents",	Opt_memory_localevents),
+	fsparam_flag("memory_nolocalevents",	Opt_memory_nolocalevents),
 	fsparam_flag("memory_recursiveprot",	Opt_memory_recursiveprot),
+	fsparam_flag("memory_norecursiveprot",	Opt_memory_norecursiveprot),
 	{}
 };
 
@@ -1886,12 +1887,21 @@ static int cgroup2_parse_param(struct fs
 	case Opt_nsdelegate:
 		ctx->flags |= CGRP_ROOT_NS_DELEGATE;
 		return 0;
+	case Opt_nonsdelegate:
+		ctx->flags &= ~CGRP_ROOT_NS_DELEGATE;
+		return 0;
 	case Opt_memory_localevents:
 		ctx->flags |= CGRP_ROOT_MEMORY_LOCAL_EVENTS;
 		return 0;
+	case Opt_memory_nolocalevents:
+		ctx->flags &= ~CGRP_ROOT_MEMORY_LOCAL_EVENTS;
+		return 0;
 	case Opt_memory_recursiveprot:
 		ctx->flags |= CGRP_ROOT_MEMORY_RECURSIVE_PROT;
 		return 0;
+	case Opt_memory_norecursiveprot:
+		ctx->flags &= ~CGRP_ROOT_MEMORY_RECURSIVE_PROT;
+		return 0;
 	}
 	return -EINVAL;
 }

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 2/3 cgroup/for-5.20] cgroup: Add "no" prefixed mount options
@ 2022-07-15  4:38   ` Tejun Heo
  0 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2022-07-15  4:38 UTC (permalink / raw)
  Cc: Peter Zijlstra, John Stultz, Dmitry Shmidt, Oleg Nesterov,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

We allow modifying these mount options via remount. Let's add "no" prefixed
variants so that they can be turned off too.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Christian Brauner <brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Michal Koutný <mkoutny-IBi9RG/b67k@public.gmane.org>
---
 Documentation/admin-guide/cgroup-v2.rst |    6 +++---
 kernel/cgroup/cgroup.c                  |   20 +++++++++++++++-----
 2 files changed, 18 insertions(+), 8 deletions(-)

--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -177,14 +177,14 @@ disabling controllers in v1 and make the
 
 cgroup v2 currently supports the following mount options.
 
-  nsdelegate
+  [no]nsdelegate
 	Consider cgroup namespaces as delegation boundaries.  This
 	option is system wide and can only be set on mount or modified
 	through remount from the init namespace.  The mount option is
 	ignored on non-init namespace mounts.  Please refer to the
 	Delegation section for details.
 
-  memory_localevents
+  memory_[no]localevents
         Only populate memory.events with data for the current cgroup,
         and not any subtrees. This is legacy behaviour, the default
         behaviour without this option is to include subtree counts.
@@ -192,7 +192,7 @@ cgroup v2 currently supports the followi
         modified through remount from the init namespace. The mount
         option is ignored on non-init namespace mounts.
 
-  memory_recursiveprot
+  memory_[no]recursiveprot
         Recursively apply memory.min and memory.low protection to
         entire subtrees, without requiring explicit downward
         propagation into leaf cgroups.  This allows protecting entire
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -279,8 +279,6 @@ bool cgroup_ssid_enabled(int ssid)
  *
  * - When mounting an existing superblock, mount options should match.
  *
- * - Remount is disallowed.
- *
  * - rename(2) is disallowed.
  *
  * - "tasks" is removed.  Everything should be at process granularity.  Use
@@ -1859,16 +1857,19 @@ int cgroup_show_path(struct seq_file *sf
 }
 
 enum cgroup2_param {
-	Opt_nsdelegate,
-	Opt_memory_localevents,
-	Opt_memory_recursiveprot,
+	Opt_nsdelegate, Opt_nonsdelegate,
+	Opt_memory_localevents, Opt_memory_nolocalevents,
+	Opt_memory_recursiveprot, Opt_memory_norecursiveprot,
 	nr__cgroup2_params
 };
 
 static const struct fs_parameter_spec cgroup2_fs_parameters[] = {
 	fsparam_flag("nsdelegate",		Opt_nsdelegate),
+	fsparam_flag("nonsdelegate",		Opt_nonsdelegate),
 	fsparam_flag("memory_localevents",	Opt_memory_localevents),
+	fsparam_flag("memory_nolocalevents",	Opt_memory_nolocalevents),
 	fsparam_flag("memory_recursiveprot",	Opt_memory_recursiveprot),
+	fsparam_flag("memory_norecursiveprot",	Opt_memory_norecursiveprot),
 	{}
 };
 
@@ -1886,12 +1887,21 @@ static int cgroup2_parse_param(struct fs
 	case Opt_nsdelegate:
 		ctx->flags |= CGRP_ROOT_NS_DELEGATE;
 		return 0;
+	case Opt_nonsdelegate:
+		ctx->flags &= ~CGRP_ROOT_NS_DELEGATE;
+		return 0;
 	case Opt_memory_localevents:
 		ctx->flags |= CGRP_ROOT_MEMORY_LOCAL_EVENTS;
 		return 0;
+	case Opt_memory_nolocalevents:
+		ctx->flags &= ~CGRP_ROOT_MEMORY_LOCAL_EVENTS;
+		return 0;
 	case Opt_memory_recursiveprot:
 		ctx->flags |= CGRP_ROOT_MEMORY_RECURSIVE_PROT;
 		return 0;
+	case Opt_memory_norecursiveprot:
+		ctx->flags &= ~CGRP_ROOT_MEMORY_RECURSIVE_PROT;
+		return 0;
 	}
 	return -EINVAL;
 }

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 3/3 cgroup/for-5.20] cgroup: Make !percpu threadgroup_rwsem operations optional
@ 2022-07-15  4:39     ` Tejun Heo
  0 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2022-07-15  4:39 UTC (permalink / raw)
  To: Christian Brauner, Michal Koutný,
	Peter Zijlstra, John Stultz, Dmitry Shmidt, Oleg Nesterov,
	linux-kernel, cgroups

3942a9bd7b58 ("locking, rcu, cgroup: Avoid synchronize_sched() in
__cgroup_procs_write()") disabled percpu operations on threadgroup_rwsem
because the impiled synchronize_rcu() on write locking was pushing up the
latencies too much for android which constantly moves processes between
cgroups.

This makes the hotter paths - fork and exit - slower as they're always
forced into the slow path. There is no reason to force this on everyone
especially given that more common static usage pattern can now completely
avoid write-locking the rwsem. Write-locking is elided when turning on and
off controllers on empty sub-trees and CLONE_INTO_CGROUP enables seeding a
cgroup without grabbing the rwsem.

Restore the default percpu operations and introduce the mount option
"favordynmods" and config option CGROUP_FAVOR_DYNMODS for users who need
lower latencies for the dynamic operations.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Dmitry Shmidt <dimitrysh@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
---
 Documentation/admin-guide/cgroup-v2.rst |    8 +++++
 kernel/cgroup/cgroup-v1.c               |   17 +++++++++++-
 kernel/cgroup/cgroup.c                  |   43 ++++++++++++++++++++++++++------
 3 files changed, 60 insertions(+), 8 deletions(-)

--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -184,6 +184,14 @@ cgroup v2 currently supports the followi
 	ignored on non-init namespace mounts.  Please refer to the
 	Delegation section for details.
 
+  [no]favordynmods
+        Reduce the latencies of dynamic cgroup modifications such as
+        task migrations and controller on/offs at the cost of making
+        hot path operations such as forks and exits more expensive.
+        The static usage pattern of creating a cgroup, enabling
+        controllers, and then seeding it with CLONE_INTO_CGROUP is
+        not affected by this option.
+
   memory_[no]localevents
         Only populate memory.events with data for the current cgroup,
         and not any subtrees. This is legacy behaviour, the default
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -875,6 +875,8 @@ static int cgroup1_show_options(struct s
 		seq_puts(seq, ",xattr");
 	if (root->flags & CGRP_ROOT_CPUSET_V2_MODE)
 		seq_puts(seq, ",cpuset_v2_mode");
+	if (root->flags & CGRP_ROOT_FAVOR_DYNMODS)
+		seq_puts(seq, ",favordynmods");
 
 	spin_lock(&release_agent_path_lock);
 	if (strlen(root->release_agent_path))
@@ -898,6 +900,8 @@ enum cgroup1_param {
 	Opt_noprefix,
 	Opt_release_agent,
 	Opt_xattr,
+	Opt_favordynmods,
+	Opt_nofavordynmods,
 };
 
 const struct fs_parameter_spec cgroup1_fs_parameters[] = {
@@ -909,6 +913,8 @@ const struct fs_parameter_spec cgroup1_f
 	fsparam_flag  ("noprefix",	Opt_noprefix),
 	fsparam_string("release_agent",	Opt_release_agent),
 	fsparam_flag  ("xattr",		Opt_xattr),
+	fsparam_flag  ("favordynmods",	Opt_favordynmods),
+	fsparam_flag  ("nofavordynmods", Opt_nofavordynmods),
 	{}
 };
 
@@ -960,6 +966,12 @@ int cgroup1_parse_param(struct fs_contex
 	case Opt_xattr:
 		ctx->flags |= CGRP_ROOT_XATTR;
 		break;
+	case Opt_favordynmods:
+		ctx->flags |= CGRP_ROOT_FAVOR_DYNMODS;
+		break;
+	case Opt_nofavordynmods:
+		ctx->flags &= ~CGRP_ROOT_FAVOR_DYNMODS;
+		break;
 	case Opt_release_agent:
 		/* Specifying two release agents is forbidden */
 		if (ctx->release_agent)
@@ -1211,8 +1223,11 @@ static int cgroup1_root_to_use(struct fs
 	init_cgroup_root(ctx);
 
 	ret = cgroup_setup_root(root, ctx->subsys_mask);
-	if (ret)
+	if (!ret)
+		cgroup_favor_dynmods(root, ctx->flags & CGRP_ROOT_FAVOR_DYNMODS);
+	else
 		cgroup_free_root(root);
+
 	return ret;
 }
 
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1305,6 +1305,20 @@ struct cgroup_root *cgroup_root_from_kf(
 	return root_cgrp->root;
 }
 
+void cgroup_favor_dynmods(struct cgroup_root *root, bool favor)
+{
+	bool favoring = root->flags & CGRP_ROOT_FAVOR_DYNMODS;
+
+	/* see the comment above CGRP_ROOT_FAVOR_DYNMODS definition */
+	if (favor && !favoring) {
+		rcu_sync_enter(&cgroup_threadgroup_rwsem.rss);
+		root->flags |= CGRP_ROOT_FAVOR_DYNMODS;
+	} else if (!favor && favoring) {
+		rcu_sync_exit(&cgroup_threadgroup_rwsem.rss);
+		root->flags &= ~CGRP_ROOT_FAVOR_DYNMODS;
+	}
+}
+
 static int cgroup_init_root_id(struct cgroup_root *root)
 {
 	int id;
@@ -1365,6 +1379,7 @@ static void cgroup_destroy_root(struct c
 		cgroup_root_count--;
 	}
 
+	cgroup_favor_dynmods(root, false);
 	cgroup_exit_root_id(root);
 
 	mutex_unlock(&cgroup_mutex);
@@ -1858,6 +1873,7 @@ int cgroup_show_path(struct seq_file *sf
 
 enum cgroup2_param {
 	Opt_nsdelegate, Opt_nonsdelegate,
+	Opt_favordynmods, Opt_nofavordynmods,
 	Opt_memory_localevents, Opt_memory_nolocalevents,
 	Opt_memory_recursiveprot, Opt_memory_norecursiveprot,
 	nr__cgroup2_params
@@ -1866,6 +1882,8 @@ enum cgroup2_param {
 static const struct fs_parameter_spec cgroup2_fs_parameters[] = {
 	fsparam_flag("nsdelegate",		Opt_nsdelegate),
 	fsparam_flag("nonsdelegate",		Opt_nonsdelegate),
+	fsparam_flag("favordynmods",		Opt_favordynmods),
+	fsparam_flag("nofavordynmods",		Opt_nofavordynmods),
 	fsparam_flag("memory_localevents",	Opt_memory_localevents),
 	fsparam_flag("memory_nolocalevents",	Opt_memory_nolocalevents),
 	fsparam_flag("memory_recursiveprot",	Opt_memory_recursiveprot),
@@ -1890,6 +1908,12 @@ static int cgroup2_parse_param(struct fs
 	case Opt_nonsdelegate:
 		ctx->flags &= ~CGRP_ROOT_NS_DELEGATE;
 		return 0;
+	case Opt_favordynmods:
+		ctx->flags |= CGRP_ROOT_FAVOR_DYNMODS;
+		return 0;
+	case Opt_nofavordynmods:
+		ctx->flags &= ~CGRP_ROOT_FAVOR_DYNMODS;
+		return 0;
 	case Opt_memory_localevents:
 		ctx->flags |= CGRP_ROOT_MEMORY_LOCAL_EVENTS;
 		return 0;
@@ -1914,6 +1938,9 @@ static void apply_cgroup_root_flags(unsi
 		else
 			cgrp_dfl_root.flags &= ~CGRP_ROOT_NS_DELEGATE;
 
+		cgroup_favor_dynmods(&cgrp_dfl_root,
+				     root_flags & CGRP_ROOT_FAVOR_DYNMODS);
+
 		if (root_flags & CGRP_ROOT_MEMORY_LOCAL_EVENTS)
 			cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_LOCAL_EVENTS;
 		else
@@ -1930,6 +1957,8 @@ static int cgroup_show_options(struct se
 {
 	if (cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE)
 		seq_puts(seq, ",nsdelegate");
+	if (cgrp_dfl_root.flags & CGRP_ROOT_FAVOR_DYNMODS)
+		seq_puts(seq, ",favordynmods");
 	if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_LOCAL_EVENTS)
 		seq_puts(seq, ",memory_localevents");
 	if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT)
@@ -1980,7 +2009,8 @@ void init_cgroup_root(struct cgroup_fs_c
 	cgrp->root = root;
 	init_cgroup_housekeeping(cgrp);
 
-	root->flags = ctx->flags;
+	/* DYNMODS must be modified through cgroup_favor_dynmods() */
+	root->flags = ctx->flags & ~CGRP_ROOT_FAVOR_DYNMODS;
 	if (ctx->release_agent)
 		strscpy(root->release_agent_path, ctx->release_agent, PATH_MAX);
 	if (ctx->name)
@@ -2202,6 +2232,10 @@ static int cgroup_init_fs_context(struct
 	put_user_ns(fc->user_ns);
 	fc->user_ns = get_user_ns(ctx->ns->user_ns);
 	fc->global = true;
+
+#ifdef CONFIG_CGROUP_FAVOR_DYNMODS
+	ctx->flags |= CGRP_ROOT_FAVOR_DYNMODS;
+#endif
 	return 0;
 }
 
@@ -5854,12 +5888,6 @@ int __init cgroup_init(void)
 
 	cgroup_rstat_boot();
 
-	/*
-	 * The latency of the synchronize_rcu() is too high for cgroups,
-	 * avoid it at the cost of forcing all readers into the slow path.
-	 */
-	rcu_sync_enter_start(&cgroup_threadgroup_rwsem.rss);
-
 	get_user_ns(init_cgroup_ns.user_ns);
 
 	mutex_lock(&cgroup_mutex);
@@ -6771,6 +6799,7 @@ static ssize_t features_show(struct kobj
 {
 	return snprintf(buf, PAGE_SIZE,
 			"nsdelegate\n"
+			"favordynmods\n"
 			"memory_localevents\n"
 			"memory_recursiveprot\n");
 }

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 3/3 cgroup/for-5.20] cgroup: Make !percpu threadgroup_rwsem operations optional
@ 2022-07-15  4:39     ` Tejun Heo
  0 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2022-07-15  4:39 UTC (permalink / raw)
  To: Christian Brauner, Michal Koutný,
	Peter Zijlstra, John Stultz, Dmitry Shmidt, Oleg Nesterov,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

3942a9bd7b58 ("locking, rcu, cgroup: Avoid synchronize_sched() in
__cgroup_procs_write()") disabled percpu operations on threadgroup_rwsem
because the impiled synchronize_rcu() on write locking was pushing up the
latencies too much for android which constantly moves processes between
cgroups.

This makes the hotter paths - fork and exit - slower as they're always
forced into the slow path. There is no reason to force this on everyone
especially given that more common static usage pattern can now completely
avoid write-locking the rwsem. Write-locking is elided when turning on and
off controllers on empty sub-trees and CLONE_INTO_CGROUP enables seeding a
cgroup without grabbing the rwsem.

Restore the default percpu operations and introduce the mount option
"favordynmods" and config option CGROUP_FAVOR_DYNMODS for users who need
lower latencies for the dynamic operations.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Christian Brauner <brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Michal Koutný <mkoutny-IBi9RG/b67k@public.gmane.org>
Cc: Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
Cc: John Stultz <john.stultz-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>
Cc: Dmitry Shmidt <dimitrysh-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 Documentation/admin-guide/cgroup-v2.rst |    8 +++++
 kernel/cgroup/cgroup-v1.c               |   17 +++++++++++-
 kernel/cgroup/cgroup.c                  |   43 ++++++++++++++++++++++++++------
 3 files changed, 60 insertions(+), 8 deletions(-)

--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -184,6 +184,14 @@ cgroup v2 currently supports the followi
 	ignored on non-init namespace mounts.  Please refer to the
 	Delegation section for details.
 
+  [no]favordynmods
+        Reduce the latencies of dynamic cgroup modifications such as
+        task migrations and controller on/offs at the cost of making
+        hot path operations such as forks and exits more expensive.
+        The static usage pattern of creating a cgroup, enabling
+        controllers, and then seeding it with CLONE_INTO_CGROUP is
+        not affected by this option.
+
   memory_[no]localevents
         Only populate memory.events with data for the current cgroup,
         and not any subtrees. This is legacy behaviour, the default
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -875,6 +875,8 @@ static int cgroup1_show_options(struct s
 		seq_puts(seq, ",xattr");
 	if (root->flags & CGRP_ROOT_CPUSET_V2_MODE)
 		seq_puts(seq, ",cpuset_v2_mode");
+	if (root->flags & CGRP_ROOT_FAVOR_DYNMODS)
+		seq_puts(seq, ",favordynmods");
 
 	spin_lock(&release_agent_path_lock);
 	if (strlen(root->release_agent_path))
@@ -898,6 +900,8 @@ enum cgroup1_param {
 	Opt_noprefix,
 	Opt_release_agent,
 	Opt_xattr,
+	Opt_favordynmods,
+	Opt_nofavordynmods,
 };
 
 const struct fs_parameter_spec cgroup1_fs_parameters[] = {
@@ -909,6 +913,8 @@ const struct fs_parameter_spec cgroup1_f
 	fsparam_flag  ("noprefix",	Opt_noprefix),
 	fsparam_string("release_agent",	Opt_release_agent),
 	fsparam_flag  ("xattr",		Opt_xattr),
+	fsparam_flag  ("favordynmods",	Opt_favordynmods),
+	fsparam_flag  ("nofavordynmods", Opt_nofavordynmods),
 	{}
 };
 
@@ -960,6 +966,12 @@ int cgroup1_parse_param(struct fs_contex
 	case Opt_xattr:
 		ctx->flags |= CGRP_ROOT_XATTR;
 		break;
+	case Opt_favordynmods:
+		ctx->flags |= CGRP_ROOT_FAVOR_DYNMODS;
+		break;
+	case Opt_nofavordynmods:
+		ctx->flags &= ~CGRP_ROOT_FAVOR_DYNMODS;
+		break;
 	case Opt_release_agent:
 		/* Specifying two release agents is forbidden */
 		if (ctx->release_agent)
@@ -1211,8 +1223,11 @@ static int cgroup1_root_to_use(struct fs
 	init_cgroup_root(ctx);
 
 	ret = cgroup_setup_root(root, ctx->subsys_mask);
-	if (ret)
+	if (!ret)
+		cgroup_favor_dynmods(root, ctx->flags & CGRP_ROOT_FAVOR_DYNMODS);
+	else
 		cgroup_free_root(root);
+
 	return ret;
 }
 
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1305,6 +1305,20 @@ struct cgroup_root *cgroup_root_from_kf(
 	return root_cgrp->root;
 }
 
+void cgroup_favor_dynmods(struct cgroup_root *root, bool favor)
+{
+	bool favoring = root->flags & CGRP_ROOT_FAVOR_DYNMODS;
+
+	/* see the comment above CGRP_ROOT_FAVOR_DYNMODS definition */
+	if (favor && !favoring) {
+		rcu_sync_enter(&cgroup_threadgroup_rwsem.rss);
+		root->flags |= CGRP_ROOT_FAVOR_DYNMODS;
+	} else if (!favor && favoring) {
+		rcu_sync_exit(&cgroup_threadgroup_rwsem.rss);
+		root->flags &= ~CGRP_ROOT_FAVOR_DYNMODS;
+	}
+}
+
 static int cgroup_init_root_id(struct cgroup_root *root)
 {
 	int id;
@@ -1365,6 +1379,7 @@ static void cgroup_destroy_root(struct c
 		cgroup_root_count--;
 	}
 
+	cgroup_favor_dynmods(root, false);
 	cgroup_exit_root_id(root);
 
 	mutex_unlock(&cgroup_mutex);
@@ -1858,6 +1873,7 @@ int cgroup_show_path(struct seq_file *sf
 
 enum cgroup2_param {
 	Opt_nsdelegate, Opt_nonsdelegate,
+	Opt_favordynmods, Opt_nofavordynmods,
 	Opt_memory_localevents, Opt_memory_nolocalevents,
 	Opt_memory_recursiveprot, Opt_memory_norecursiveprot,
 	nr__cgroup2_params
@@ -1866,6 +1882,8 @@ enum cgroup2_param {
 static const struct fs_parameter_spec cgroup2_fs_parameters[] = {
 	fsparam_flag("nsdelegate",		Opt_nsdelegate),
 	fsparam_flag("nonsdelegate",		Opt_nonsdelegate),
+	fsparam_flag("favordynmods",		Opt_favordynmods),
+	fsparam_flag("nofavordynmods",		Opt_nofavordynmods),
 	fsparam_flag("memory_localevents",	Opt_memory_localevents),
 	fsparam_flag("memory_nolocalevents",	Opt_memory_nolocalevents),
 	fsparam_flag("memory_recursiveprot",	Opt_memory_recursiveprot),
@@ -1890,6 +1908,12 @@ static int cgroup2_parse_param(struct fs
 	case Opt_nonsdelegate:
 		ctx->flags &= ~CGRP_ROOT_NS_DELEGATE;
 		return 0;
+	case Opt_favordynmods:
+		ctx->flags |= CGRP_ROOT_FAVOR_DYNMODS;
+		return 0;
+	case Opt_nofavordynmods:
+		ctx->flags &= ~CGRP_ROOT_FAVOR_DYNMODS;
+		return 0;
 	case Opt_memory_localevents:
 		ctx->flags |= CGRP_ROOT_MEMORY_LOCAL_EVENTS;
 		return 0;
@@ -1914,6 +1938,9 @@ static void apply_cgroup_root_flags(unsi
 		else
 			cgrp_dfl_root.flags &= ~CGRP_ROOT_NS_DELEGATE;
 
+		cgroup_favor_dynmods(&cgrp_dfl_root,
+				     root_flags & CGRP_ROOT_FAVOR_DYNMODS);
+
 		if (root_flags & CGRP_ROOT_MEMORY_LOCAL_EVENTS)
 			cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_LOCAL_EVENTS;
 		else
@@ -1930,6 +1957,8 @@ static int cgroup_show_options(struct se
 {
 	if (cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE)
 		seq_puts(seq, ",nsdelegate");
+	if (cgrp_dfl_root.flags & CGRP_ROOT_FAVOR_DYNMODS)
+		seq_puts(seq, ",favordynmods");
 	if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_LOCAL_EVENTS)
 		seq_puts(seq, ",memory_localevents");
 	if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT)
@@ -1980,7 +2009,8 @@ void init_cgroup_root(struct cgroup_fs_c
 	cgrp->root = root;
 	init_cgroup_housekeeping(cgrp);
 
-	root->flags = ctx->flags;
+	/* DYNMODS must be modified through cgroup_favor_dynmods() */
+	root->flags = ctx->flags & ~CGRP_ROOT_FAVOR_DYNMODS;
 	if (ctx->release_agent)
 		strscpy(root->release_agent_path, ctx->release_agent, PATH_MAX);
 	if (ctx->name)
@@ -2202,6 +2232,10 @@ static int cgroup_init_fs_context(struct
 	put_user_ns(fc->user_ns);
 	fc->user_ns = get_user_ns(ctx->ns->user_ns);
 	fc->global = true;
+
+#ifdef CONFIG_CGROUP_FAVOR_DYNMODS
+	ctx->flags |= CGRP_ROOT_FAVOR_DYNMODS;
+#endif
 	return 0;
 }
 
@@ -5854,12 +5888,6 @@ int __init cgroup_init(void)
 
 	cgroup_rstat_boot();
 
-	/*
-	 * The latency of the synchronize_rcu() is too high for cgroups,
-	 * avoid it at the cost of forcing all readers into the slow path.
-	 */
-	rcu_sync_enter_start(&cgroup_threadgroup_rwsem.rss);
-
 	get_user_ns(init_cgroup_ns.user_ns);
 
 	mutex_lock(&cgroup_mutex);
@@ -6771,6 +6799,7 @@ static ssize_t features_show(struct kobj
 {
 	return snprintf(buf, PAGE_SIZE,
 			"nsdelegate\n"
+			"favordynmods\n"
 			"memory_localevents\n"
 			"memory_recursiveprot\n");
 }

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/3 cgroup/for-5.20] cgroup: Make !percpu threadgroup_rwsem operations optional
  2022-07-15  4:39     ` Tejun Heo
@ 2022-07-23  5:12       ` Tejun Heo
  -1 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2022-07-23  5:12 UTC (permalink / raw)
  To: Christian Brauner, Michal Koutný,
	Peter Zijlstra, John Stultz, Dmitry Shmidt, Oleg Nesterov,
	linux-kernel, cgroups

Applying 1-3 to cgroup/for-5.20.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/3 cgroup/for-5.20] cgroup: Make !percpu threadgroup_rwsem operations optional
@ 2022-07-23  5:12       ` Tejun Heo
  0 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2022-07-23  5:12 UTC (permalink / raw)
  To: Christian Brauner, Michal Koutný,
	Peter Zijlstra, John Stultz, Dmitry Shmidt, Oleg Nesterov,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Applying 1-3 to cgroup/for-5.20.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH RESEND 3/3 cgroup/for-5.20] cgroup: Make !percpu threadgroup_rwsem operations optional
  2022-07-15  4:39     ` Tejun Heo
@ 2022-07-23 14:28       ` Tejun Heo
  -1 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2022-07-23 14:28 UTC (permalink / raw)
  To: Christian Brauner, Michal Koutný,
	Peter Zijlstra, John Stultz, Dmitry Shmidt, Oleg Nesterov,
	linux-kernel, cgroups

3942a9bd7b58 ("locking, rcu, cgroup: Avoid synchronize_sched() in
__cgroup_procs_write()") disabled percpu operations on threadgroup_rwsem
because the impiled synchronize_rcu() on write locking was pushing up the
latencies too much for android which constantly moves processes between
cgroups.

This makes the hotter paths - fork and exit - slower as they're always
forced into the slow path. There is no reason to force this on everyone
especially given that more common static usage pattern can now completely
avoid write-locking the rwsem. Write-locking is elided when turning on and
off controllers on empty sub-trees and CLONE_INTO_CGROUP enables seeding a
cgroup without grabbing the rwsem.

Restore the default percpu operations and introduce the mount option
"favordynmods" and config option CGROUP_FAVOR_DYNMODS for users who need
lower latencies for the dynamic operations.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Dmitry Shmidt <dimitrysh@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
---
I messed up patch generation and the patch was missing diffs for a few
files.

Thanks.

 Documentation/admin-guide/cgroup-v2.rst |    8 +++++
 include/linux/cgroup-defs.h             |   19 +++++++++++---
 init/Kconfig                            |   10 +++++++
 kernel/cgroup/cgroup-internal.h         |    1 
 kernel/cgroup/cgroup-v1.c               |   17 +++++++++++-
 kernel/cgroup/cgroup.c                  |   43 ++++++++++++++++++++++++++------
 6 files changed, 87 insertions(+), 11 deletions(-)

--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -184,6 +184,14 @@ cgroup v2 currently supports the followi
 	ignored on non-init namespace mounts.  Please refer to the
 	Delegation section for details.
 
+  [no]favordynmods
+        Reduce the latencies of dynamic cgroup modifications such as
+        task migrations and controller on/offs at the cost of making
+        hot path operations such as forks and exits more expensive.
+        The static usage pattern of creating a cgroup, enabling
+        controllers, and then seeding it with CLONE_INTO_CGROUP is
+        not affected by this option.
+
   memory_[no]localevents
         Only populate memory.events with data for the current cgroup,
         and not any subtrees. This is legacy behaviour, the default
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -89,19 +89,32 @@ enum {
 	CGRP_ROOT_NS_DELEGATE	= (1 << 3),
 
 	/*
+	 * Reduce latencies on dynamic cgroup modifications such as task
+	 * migrations and controller on/offs by disabling percpu operation on
+	 * cgroup_threadgroup_rwsem. This makes hot path operations such as
+	 * forks and exits into the slow path and more expensive.
+	 *
+	 * The static usage pattern of creating a cgroup, enabling controllers,
+	 * and then seeding it with CLONE_INTO_CGROUP doesn't require write
+	 * locking cgroup_threadgroup_rwsem and thus doesn't benefit from
+	 * favordynmod.
+	 */
+	CGRP_ROOT_FAVOR_DYNMODS = (1 << 4),
+
+	/*
 	 * Enable cpuset controller in v1 cgroup to use v2 behavior.
 	 */
-	CGRP_ROOT_CPUSET_V2_MODE = (1 << 4),
+	CGRP_ROOT_CPUSET_V2_MODE = (1 << 16),
 
 	/*
 	 * Enable legacy local memory.events.
 	 */
-	CGRP_ROOT_MEMORY_LOCAL_EVENTS = (1 << 5),
+	CGRP_ROOT_MEMORY_LOCAL_EVENTS = (1 << 17),
 
 	/*
 	 * Enable recursive subtree protection
 	 */
-	CGRP_ROOT_MEMORY_RECURSIVE_PROT = (1 << 6),
+	CGRP_ROOT_MEMORY_RECURSIVE_PROT = (1 << 18),
 };
 
 /* cftype->flags */
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -936,6 +936,16 @@ if CGROUPS
 config PAGE_COUNTER
 	bool
 
+config CGROUP_FAVOR_DYNMODS
+        bool "Favor dynamic modification latency reduction by default"
+        help
+          This option enables the "favordynmods" mount option by default
+          which reduces the latencies of dynamic cgroup modifications such
+          as task migrations and controller on/offs at the cost of making
+          hot path operations such as forks and exits more expensive.
+
+          Say N if unsure.
+
 config MEMCG
 	bool "Memory controller"
 	select PAGE_COUNTER
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -233,6 +233,7 @@ void cgroup_kn_unlock(struct kernfs_node
 int cgroup_path_ns_locked(struct cgroup *cgrp, char *buf, size_t buflen,
 			  struct cgroup_namespace *ns);
 
+void cgroup_favor_dynmods(struct cgroup_root *root, bool favor);
 void cgroup_free_root(struct cgroup_root *root);
 void init_cgroup_root(struct cgroup_fs_context *ctx);
 int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask);
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -875,6 +875,8 @@ static int cgroup1_show_options(struct s
 		seq_puts(seq, ",xattr");
 	if (root->flags & CGRP_ROOT_CPUSET_V2_MODE)
 		seq_puts(seq, ",cpuset_v2_mode");
+	if (root->flags & CGRP_ROOT_FAVOR_DYNMODS)
+		seq_puts(seq, ",favordynmods");
 
 	spin_lock(&release_agent_path_lock);
 	if (strlen(root->release_agent_path))
@@ -898,6 +900,8 @@ enum cgroup1_param {
 	Opt_noprefix,
 	Opt_release_agent,
 	Opt_xattr,
+	Opt_favordynmods,
+	Opt_nofavordynmods,
 };
 
 const struct fs_parameter_spec cgroup1_fs_parameters[] = {
@@ -909,6 +913,8 @@ const struct fs_parameter_spec cgroup1_f
 	fsparam_flag  ("noprefix",	Opt_noprefix),
 	fsparam_string("release_agent",	Opt_release_agent),
 	fsparam_flag  ("xattr",		Opt_xattr),
+	fsparam_flag  ("favordynmods",	Opt_favordynmods),
+	fsparam_flag  ("nofavordynmods", Opt_nofavordynmods),
 	{}
 };
 
@@ -960,6 +966,12 @@ int cgroup1_parse_param(struct fs_contex
 	case Opt_xattr:
 		ctx->flags |= CGRP_ROOT_XATTR;
 		break;
+	case Opt_favordynmods:
+		ctx->flags |= CGRP_ROOT_FAVOR_DYNMODS;
+		break;
+	case Opt_nofavordynmods:
+		ctx->flags &= ~CGRP_ROOT_FAVOR_DYNMODS;
+		break;
 	case Opt_release_agent:
 		/* Specifying two release agents is forbidden */
 		if (ctx->release_agent)
@@ -1211,8 +1223,11 @@ static int cgroup1_root_to_use(struct fs
 	init_cgroup_root(ctx);
 
 	ret = cgroup_setup_root(root, ctx->subsys_mask);
-	if (ret)
+	if (!ret)
+		cgroup_favor_dynmods(root, ctx->flags & CGRP_ROOT_FAVOR_DYNMODS);
+	else
 		cgroup_free_root(root);
+
 	return ret;
 }
 
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1305,6 +1305,20 @@ struct cgroup_root *cgroup_root_from_kf(
 	return root_cgrp->root;
 }
 
+void cgroup_favor_dynmods(struct cgroup_root *root, bool favor)
+{
+	bool favoring = root->flags & CGRP_ROOT_FAVOR_DYNMODS;
+
+	/* see the comment above CGRP_ROOT_FAVOR_DYNMODS definition */
+	if (favor && !favoring) {
+		rcu_sync_enter(&cgroup_threadgroup_rwsem.rss);
+		root->flags |= CGRP_ROOT_FAVOR_DYNMODS;
+	} else if (!favor && favoring) {
+		rcu_sync_exit(&cgroup_threadgroup_rwsem.rss);
+		root->flags &= ~CGRP_ROOT_FAVOR_DYNMODS;
+	}
+}
+
 static int cgroup_init_root_id(struct cgroup_root *root)
 {
 	int id;
@@ -1365,6 +1379,7 @@ static void cgroup_destroy_root(struct c
 		cgroup_root_count--;
 	}
 
+	cgroup_favor_dynmods(root, false);
 	cgroup_exit_root_id(root);
 
 	mutex_unlock(&cgroup_mutex);
@@ -1858,6 +1873,7 @@ int cgroup_show_path(struct seq_file *sf
 
 enum cgroup2_param {
 	Opt_nsdelegate, Opt_nonsdelegate,
+	Opt_favordynmods, Opt_nofavordynmods,
 	Opt_memory_localevents, Opt_memory_nolocalevents,
 	Opt_memory_recursiveprot, Opt_memory_norecursiveprot,
 	nr__cgroup2_params
@@ -1866,6 +1882,8 @@ enum cgroup2_param {
 static const struct fs_parameter_spec cgroup2_fs_parameters[] = {
 	fsparam_flag("nsdelegate",		Opt_nsdelegate),
 	fsparam_flag("nonsdelegate",		Opt_nonsdelegate),
+	fsparam_flag("favordynmods",		Opt_favordynmods),
+	fsparam_flag("nofavordynmods",		Opt_nofavordynmods),
 	fsparam_flag("memory_localevents",	Opt_memory_localevents),
 	fsparam_flag("memory_nolocalevents",	Opt_memory_nolocalevents),
 	fsparam_flag("memory_recursiveprot",	Opt_memory_recursiveprot),
@@ -1890,6 +1908,12 @@ static int cgroup2_parse_param(struct fs
 	case Opt_nonsdelegate:
 		ctx->flags &= ~CGRP_ROOT_NS_DELEGATE;
 		return 0;
+	case Opt_favordynmods:
+		ctx->flags |= CGRP_ROOT_FAVOR_DYNMODS;
+		return 0;
+	case Opt_nofavordynmods:
+		ctx->flags &= ~CGRP_ROOT_FAVOR_DYNMODS;
+		return 0;
 	case Opt_memory_localevents:
 		ctx->flags |= CGRP_ROOT_MEMORY_LOCAL_EVENTS;
 		return 0;
@@ -1914,6 +1938,9 @@ static void apply_cgroup_root_flags(unsi
 		else
 			cgrp_dfl_root.flags &= ~CGRP_ROOT_NS_DELEGATE;
 
+		cgroup_favor_dynmods(&cgrp_dfl_root,
+				     root_flags & CGRP_ROOT_FAVOR_DYNMODS);
+
 		if (root_flags & CGRP_ROOT_MEMORY_LOCAL_EVENTS)
 			cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_LOCAL_EVENTS;
 		else
@@ -1930,6 +1957,8 @@ static int cgroup_show_options(struct se
 {
 	if (cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE)
 		seq_puts(seq, ",nsdelegate");
+	if (cgrp_dfl_root.flags & CGRP_ROOT_FAVOR_DYNMODS)
+		seq_puts(seq, ",favordynmods");
 	if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_LOCAL_EVENTS)
 		seq_puts(seq, ",memory_localevents");
 	if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT)
@@ -1980,7 +2009,8 @@ void init_cgroup_root(struct cgroup_fs_c
 	cgrp->root = root;
 	init_cgroup_housekeeping(cgrp);
 
-	root->flags = ctx->flags;
+	/* DYNMODS must be modified through cgroup_favor_dynmods() */
+	root->flags = ctx->flags & ~CGRP_ROOT_FAVOR_DYNMODS;
 	if (ctx->release_agent)
 		strscpy(root->release_agent_path, ctx->release_agent, PATH_MAX);
 	if (ctx->name)
@@ -2202,6 +2232,10 @@ static int cgroup_init_fs_context(struct
 	put_user_ns(fc->user_ns);
 	fc->user_ns = get_user_ns(ctx->ns->user_ns);
 	fc->global = true;
+
+#ifdef CONFIG_CGROUP_FAVOR_DYNMODS
+	ctx->flags |= CGRP_ROOT_FAVOR_DYNMODS;
+#endif
 	return 0;
 }
 
@@ -5854,12 +5888,6 @@ int __init cgroup_init(void)
 
 	cgroup_rstat_boot();
 
-	/*
-	 * The latency of the synchronize_rcu() is too high for cgroups,
-	 * avoid it at the cost of forcing all readers into the slow path.
-	 */
-	rcu_sync_enter_start(&cgroup_threadgroup_rwsem.rss);
-
 	get_user_ns(init_cgroup_ns.user_ns);
 
 	mutex_lock(&cgroup_mutex);
@@ -6771,6 +6799,7 @@ static ssize_t features_show(struct kobj
 {
 	return snprintf(buf, PAGE_SIZE,
 			"nsdelegate\n"
+			"favordynmods\n"
 			"memory_localevents\n"
 			"memory_recursiveprot\n");
 }

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH RESEND 3/3 cgroup/for-5.20] cgroup: Make !percpu threadgroup_rwsem operations optional
@ 2022-07-23 14:28       ` Tejun Heo
  0 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2022-07-23 14:28 UTC (permalink / raw)
  To: Christian Brauner, Michal Koutný,
	Peter Zijlstra, John Stultz, Dmitry Shmidt, Oleg Nesterov,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

3942a9bd7b58 ("locking, rcu, cgroup: Avoid synchronize_sched() in
__cgroup_procs_write()") disabled percpu operations on threadgroup_rwsem
because the impiled synchronize_rcu() on write locking was pushing up the
latencies too much for android which constantly moves processes between
cgroups.

This makes the hotter paths - fork and exit - slower as they're always
forced into the slow path. There is no reason to force this on everyone
especially given that more common static usage pattern can now completely
avoid write-locking the rwsem. Write-locking is elided when turning on and
off controllers on empty sub-trees and CLONE_INTO_CGROUP enables seeding a
cgroup without grabbing the rwsem.

Restore the default percpu operations and introduce the mount option
"favordynmods" and config option CGROUP_FAVOR_DYNMODS for users who need
lower latencies for the dynamic operations.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Christian Brauner <brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Michal Koutný <mkoutny-IBi9RG/b67k@public.gmane.org>
Cc: Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
Cc: John Stultz <john.stultz-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>
Cc: Dmitry Shmidt <dimitrysh-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
I messed up patch generation and the patch was missing diffs for a few
files.

Thanks.

 Documentation/admin-guide/cgroup-v2.rst |    8 +++++
 include/linux/cgroup-defs.h             |   19 +++++++++++---
 init/Kconfig                            |   10 +++++++
 kernel/cgroup/cgroup-internal.h         |    1 
 kernel/cgroup/cgroup-v1.c               |   17 +++++++++++-
 kernel/cgroup/cgroup.c                  |   43 ++++++++++++++++++++++++++------
 6 files changed, 87 insertions(+), 11 deletions(-)

--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -184,6 +184,14 @@ cgroup v2 currently supports the followi
 	ignored on non-init namespace mounts.  Please refer to the
 	Delegation section for details.
 
+  [no]favordynmods
+        Reduce the latencies of dynamic cgroup modifications such as
+        task migrations and controller on/offs at the cost of making
+        hot path operations such as forks and exits more expensive.
+        The static usage pattern of creating a cgroup, enabling
+        controllers, and then seeding it with CLONE_INTO_CGROUP is
+        not affected by this option.
+
   memory_[no]localevents
         Only populate memory.events with data for the current cgroup,
         and not any subtrees. This is legacy behaviour, the default
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -89,19 +89,32 @@ enum {
 	CGRP_ROOT_NS_DELEGATE	= (1 << 3),
 
 	/*
+	 * Reduce latencies on dynamic cgroup modifications such as task
+	 * migrations and controller on/offs by disabling percpu operation on
+	 * cgroup_threadgroup_rwsem. This makes hot path operations such as
+	 * forks and exits into the slow path and more expensive.
+	 *
+	 * The static usage pattern of creating a cgroup, enabling controllers,
+	 * and then seeding it with CLONE_INTO_CGROUP doesn't require write
+	 * locking cgroup_threadgroup_rwsem and thus doesn't benefit from
+	 * favordynmod.
+	 */
+	CGRP_ROOT_FAVOR_DYNMODS = (1 << 4),
+
+	/*
 	 * Enable cpuset controller in v1 cgroup to use v2 behavior.
 	 */
-	CGRP_ROOT_CPUSET_V2_MODE = (1 << 4),
+	CGRP_ROOT_CPUSET_V2_MODE = (1 << 16),
 
 	/*
 	 * Enable legacy local memory.events.
 	 */
-	CGRP_ROOT_MEMORY_LOCAL_EVENTS = (1 << 5),
+	CGRP_ROOT_MEMORY_LOCAL_EVENTS = (1 << 17),
 
 	/*
 	 * Enable recursive subtree protection
 	 */
-	CGRP_ROOT_MEMORY_RECURSIVE_PROT = (1 << 6),
+	CGRP_ROOT_MEMORY_RECURSIVE_PROT = (1 << 18),
 };
 
 /* cftype->flags */
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -936,6 +936,16 @@ if CGROUPS
 config PAGE_COUNTER
 	bool
 
+config CGROUP_FAVOR_DYNMODS
+        bool "Favor dynamic modification latency reduction by default"
+        help
+          This option enables the "favordynmods" mount option by default
+          which reduces the latencies of dynamic cgroup modifications such
+          as task migrations and controller on/offs at the cost of making
+          hot path operations such as forks and exits more expensive.
+
+          Say N if unsure.
+
 config MEMCG
 	bool "Memory controller"
 	select PAGE_COUNTER
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -233,6 +233,7 @@ void cgroup_kn_unlock(struct kernfs_node
 int cgroup_path_ns_locked(struct cgroup *cgrp, char *buf, size_t buflen,
 			  struct cgroup_namespace *ns);
 
+void cgroup_favor_dynmods(struct cgroup_root *root, bool favor);
 void cgroup_free_root(struct cgroup_root *root);
 void init_cgroup_root(struct cgroup_fs_context *ctx);
 int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask);
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -875,6 +875,8 @@ static int cgroup1_show_options(struct s
 		seq_puts(seq, ",xattr");
 	if (root->flags & CGRP_ROOT_CPUSET_V2_MODE)
 		seq_puts(seq, ",cpuset_v2_mode");
+	if (root->flags & CGRP_ROOT_FAVOR_DYNMODS)
+		seq_puts(seq, ",favordynmods");
 
 	spin_lock(&release_agent_path_lock);
 	if (strlen(root->release_agent_path))
@@ -898,6 +900,8 @@ enum cgroup1_param {
 	Opt_noprefix,
 	Opt_release_agent,
 	Opt_xattr,
+	Opt_favordynmods,
+	Opt_nofavordynmods,
 };
 
 const struct fs_parameter_spec cgroup1_fs_parameters[] = {
@@ -909,6 +913,8 @@ const struct fs_parameter_spec cgroup1_f
 	fsparam_flag  ("noprefix",	Opt_noprefix),
 	fsparam_string("release_agent",	Opt_release_agent),
 	fsparam_flag  ("xattr",		Opt_xattr),
+	fsparam_flag  ("favordynmods",	Opt_favordynmods),
+	fsparam_flag  ("nofavordynmods", Opt_nofavordynmods),
 	{}
 };
 
@@ -960,6 +966,12 @@ int cgroup1_parse_param(struct fs_contex
 	case Opt_xattr:
 		ctx->flags |= CGRP_ROOT_XATTR;
 		break;
+	case Opt_favordynmods:
+		ctx->flags |= CGRP_ROOT_FAVOR_DYNMODS;
+		break;
+	case Opt_nofavordynmods:
+		ctx->flags &= ~CGRP_ROOT_FAVOR_DYNMODS;
+		break;
 	case Opt_release_agent:
 		/* Specifying two release agents is forbidden */
 		if (ctx->release_agent)
@@ -1211,8 +1223,11 @@ static int cgroup1_root_to_use(struct fs
 	init_cgroup_root(ctx);
 
 	ret = cgroup_setup_root(root, ctx->subsys_mask);
-	if (ret)
+	if (!ret)
+		cgroup_favor_dynmods(root, ctx->flags & CGRP_ROOT_FAVOR_DYNMODS);
+	else
 		cgroup_free_root(root);
+
 	return ret;
 }
 
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1305,6 +1305,20 @@ struct cgroup_root *cgroup_root_from_kf(
 	return root_cgrp->root;
 }
 
+void cgroup_favor_dynmods(struct cgroup_root *root, bool favor)
+{
+	bool favoring = root->flags & CGRP_ROOT_FAVOR_DYNMODS;
+
+	/* see the comment above CGRP_ROOT_FAVOR_DYNMODS definition */
+	if (favor && !favoring) {
+		rcu_sync_enter(&cgroup_threadgroup_rwsem.rss);
+		root->flags |= CGRP_ROOT_FAVOR_DYNMODS;
+	} else if (!favor && favoring) {
+		rcu_sync_exit(&cgroup_threadgroup_rwsem.rss);
+		root->flags &= ~CGRP_ROOT_FAVOR_DYNMODS;
+	}
+}
+
 static int cgroup_init_root_id(struct cgroup_root *root)
 {
 	int id;
@@ -1365,6 +1379,7 @@ static void cgroup_destroy_root(struct c
 		cgroup_root_count--;
 	}
 
+	cgroup_favor_dynmods(root, false);
 	cgroup_exit_root_id(root);
 
 	mutex_unlock(&cgroup_mutex);
@@ -1858,6 +1873,7 @@ int cgroup_show_path(struct seq_file *sf
 
 enum cgroup2_param {
 	Opt_nsdelegate, Opt_nonsdelegate,
+	Opt_favordynmods, Opt_nofavordynmods,
 	Opt_memory_localevents, Opt_memory_nolocalevents,
 	Opt_memory_recursiveprot, Opt_memory_norecursiveprot,
 	nr__cgroup2_params
@@ -1866,6 +1882,8 @@ enum cgroup2_param {
 static const struct fs_parameter_spec cgroup2_fs_parameters[] = {
 	fsparam_flag("nsdelegate",		Opt_nsdelegate),
 	fsparam_flag("nonsdelegate",		Opt_nonsdelegate),
+	fsparam_flag("favordynmods",		Opt_favordynmods),
+	fsparam_flag("nofavordynmods",		Opt_nofavordynmods),
 	fsparam_flag("memory_localevents",	Opt_memory_localevents),
 	fsparam_flag("memory_nolocalevents",	Opt_memory_nolocalevents),
 	fsparam_flag("memory_recursiveprot",	Opt_memory_recursiveprot),
@@ -1890,6 +1908,12 @@ static int cgroup2_parse_param(struct fs
 	case Opt_nonsdelegate:
 		ctx->flags &= ~CGRP_ROOT_NS_DELEGATE;
 		return 0;
+	case Opt_favordynmods:
+		ctx->flags |= CGRP_ROOT_FAVOR_DYNMODS;
+		return 0;
+	case Opt_nofavordynmods:
+		ctx->flags &= ~CGRP_ROOT_FAVOR_DYNMODS;
+		return 0;
 	case Opt_memory_localevents:
 		ctx->flags |= CGRP_ROOT_MEMORY_LOCAL_EVENTS;
 		return 0;
@@ -1914,6 +1938,9 @@ static void apply_cgroup_root_flags(unsi
 		else
 			cgrp_dfl_root.flags &= ~CGRP_ROOT_NS_DELEGATE;
 
+		cgroup_favor_dynmods(&cgrp_dfl_root,
+				     root_flags & CGRP_ROOT_FAVOR_DYNMODS);
+
 		if (root_flags & CGRP_ROOT_MEMORY_LOCAL_EVENTS)
 			cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_LOCAL_EVENTS;
 		else
@@ -1930,6 +1957,8 @@ static int cgroup_show_options(struct se
 {
 	if (cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE)
 		seq_puts(seq, ",nsdelegate");
+	if (cgrp_dfl_root.flags & CGRP_ROOT_FAVOR_DYNMODS)
+		seq_puts(seq, ",favordynmods");
 	if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_LOCAL_EVENTS)
 		seq_puts(seq, ",memory_localevents");
 	if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT)
@@ -1980,7 +2009,8 @@ void init_cgroup_root(struct cgroup_fs_c
 	cgrp->root = root;
 	init_cgroup_housekeeping(cgrp);
 
-	root->flags = ctx->flags;
+	/* DYNMODS must be modified through cgroup_favor_dynmods() */
+	root->flags = ctx->flags & ~CGRP_ROOT_FAVOR_DYNMODS;
 	if (ctx->release_agent)
 		strscpy(root->release_agent_path, ctx->release_agent, PATH_MAX);
 	if (ctx->name)
@@ -2202,6 +2232,10 @@ static int cgroup_init_fs_context(struct
 	put_user_ns(fc->user_ns);
 	fc->user_ns = get_user_ns(ctx->ns->user_ns);
 	fc->global = true;
+
+#ifdef CONFIG_CGROUP_FAVOR_DYNMODS
+	ctx->flags |= CGRP_ROOT_FAVOR_DYNMODS;
+#endif
 	return 0;
 }
 
@@ -5854,12 +5888,6 @@ int __init cgroup_init(void)
 
 	cgroup_rstat_boot();
 
-	/*
-	 * The latency of the synchronize_rcu() is too high for cgroups,
-	 * avoid it at the cost of forcing all readers into the slow path.
-	 */
-	rcu_sync_enter_start(&cgroup_threadgroup_rwsem.rss);
-
 	get_user_ns(init_cgroup_ns.user_ns);
 
 	mutex_lock(&cgroup_mutex);
@@ -6771,6 +6799,7 @@ static ssize_t features_show(struct kobj
 {
 	return snprintf(buf, PAGE_SIZE,
 			"nsdelegate\n"
+			"favordynmods\n"
 			"memory_localevents\n"
 			"memory_recursiveprot\n");
 }

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RESEND 3/3 cgroup/for-5.20] cgroup: Make !percpu threadgroup_rwsem operations optional
@ 2022-07-25 12:12         ` Oleg Nesterov
  0 siblings, 0 replies; 32+ messages in thread
From: Oleg Nesterov @ 2022-07-25 12:12 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christian Brauner, Michal Koutný,
	Peter Zijlstra, John Stultz, Dmitry Shmidt, linux-kernel,
	cgroups

On 07/23, Tejun Heo wrote:
>
> +void cgroup_favor_dynmods(struct cgroup_root *root, bool favor)
> +{
> +	bool favoring = root->flags & CGRP_ROOT_FAVOR_DYNMODS;
> +
> +	/* see the comment above CGRP_ROOT_FAVOR_DYNMODS definition */
> +	if (favor && !favoring) {
> +		rcu_sync_enter(&cgroup_threadgroup_rwsem.rss);
> +		root->flags |= CGRP_ROOT_FAVOR_DYNMODS;
> +	} else if (!favor && favoring) {
> +		rcu_sync_exit(&cgroup_threadgroup_rwsem.rss);
> +		root->flags &= ~CGRP_ROOT_FAVOR_DYNMODS;
> +	}
> +}

I see no problems in this patch. But just for record, we do not need
synchronize_rcu() in the "favor && !favoring" case, so we cab probably
do something like

	--- a/kernel/rcu/sync.c
	+++ b/kernel/rcu/sync.c
	@@ -118,7 +118,7 @@ static void rcu_sync_func(struct rcu_head *rhp)
	  * optimize away the grace-period wait via a state machine implemented
	  * by rcu_sync_enter(), rcu_sync_exit(), and rcu_sync_func().
	  */
	-void rcu_sync_enter(struct rcu_sync *rsp)
	+void __rcu_sync_enter(struct rcu_sync *rsp, bool wait)
	 {
		int gp_state;
	 
	@@ -146,13 +146,20 @@ void rcu_sync_enter(struct rcu_sync *rsp)
			 * See the comment above, this simply does the "synchronous"
			 * call_rcu(rcu_sync_func) which does GP_ENTER -> GP_PASSED.
			 */
	-		synchronize_rcu();
	-		rcu_sync_func(&rsp->cb_head);
	-		/* Not really needed, wait_event() would see GP_PASSED. */
	-		return;
	+		if (wait) {
	+			synchronize_rcu();
	+			rcu_sync_func(&rsp->cb_head);
	+		} else {
	+			rcu_sync_call(rsp);
	+		}
	+	} else if (wait) {
	+		wait_event(rsp->gp_wait, READ_ONCE(rsp->gp_state) >= GP_PASSED);
		}
	+}
	 
	-	wait_event(rsp->gp_wait, READ_ONCE(rsp->gp_state) >= GP_PASSED);
	+void rcu_sync_enter(struct rcu_sync *rsp)
	+{
	+	__rcu_sync_enter(rsp, true);
	 }
	 
	 /**

later.

__rcu_sync_enter(rsp, false) works just like rcu_sync_enter_start() but it can
be safely called at any moment.

And can't resist, off-topic question... Say, cgroup_attach_task_all() does

	mutex_lock(&cgroup_mutex);
	percpu_down_write(&cgroup_threadgroup_rwsem);

and this means that synchronize_rcu() can be called with cgroup_mutex held.
Perhaps it makes sense to change this code to do

	rcu_sync_enter(&cgroup_threadgroup_rwsem.rss);
	mutex_lock(&cgroup_mutex);
	percpu_down_write(&cgroup_threadgroup_rwsem);
	...
	percpu_up_write(&cgroup_threadgroup_rwsem);
	mutex_unlock(&cgroup_mutex);
	rcu_sync_exit(&cgroup_threadgroup_rwsem.rss);

? Just curious.

> -	/*
> -	 * The latency of the synchronize_rcu() is too high for cgroups,
> -	 * avoid it at the cost of forcing all readers into the slow path.
> -	 */
> -	rcu_sync_enter_start(&cgroup_threadgroup_rwsem.rss);

Note that it doesn't have other users, probably you can kill it.

Oleg.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RESEND 3/3 cgroup/for-5.20] cgroup: Make !percpu threadgroup_rwsem operations optional
@ 2022-07-25 12:12         ` Oleg Nesterov
  0 siblings, 0 replies; 32+ messages in thread
From: Oleg Nesterov @ 2022-07-25 12:12 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christian Brauner, Michal Koutný,
	Peter Zijlstra, John Stultz, Dmitry Shmidt,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On 07/23, Tejun Heo wrote:
>
> +void cgroup_favor_dynmods(struct cgroup_root *root, bool favor)
> +{
> +	bool favoring = root->flags & CGRP_ROOT_FAVOR_DYNMODS;
> +
> +	/* see the comment above CGRP_ROOT_FAVOR_DYNMODS definition */
> +	if (favor && !favoring) {
> +		rcu_sync_enter(&cgroup_threadgroup_rwsem.rss);
> +		root->flags |= CGRP_ROOT_FAVOR_DYNMODS;
> +	} else if (!favor && favoring) {
> +		rcu_sync_exit(&cgroup_threadgroup_rwsem.rss);
> +		root->flags &= ~CGRP_ROOT_FAVOR_DYNMODS;
> +	}
> +}

I see no problems in this patch. But just for record, we do not need
synchronize_rcu() in the "favor && !favoring" case, so we cab probably
do something like

	--- a/kernel/rcu/sync.c
	+++ b/kernel/rcu/sync.c
	@@ -118,7 +118,7 @@ static void rcu_sync_func(struct rcu_head *rhp)
	  * optimize away the grace-period wait via a state machine implemented
	  * by rcu_sync_enter(), rcu_sync_exit(), and rcu_sync_func().
	  */
	-void rcu_sync_enter(struct rcu_sync *rsp)
	+void __rcu_sync_enter(struct rcu_sync *rsp, bool wait)
	 {
		int gp_state;
	 
	@@ -146,13 +146,20 @@ void rcu_sync_enter(struct rcu_sync *rsp)
			 * See the comment above, this simply does the "synchronous"
			 * call_rcu(rcu_sync_func) which does GP_ENTER -> GP_PASSED.
			 */
	-		synchronize_rcu();
	-		rcu_sync_func(&rsp->cb_head);
	-		/* Not really needed, wait_event() would see GP_PASSED. */
	-		return;
	+		if (wait) {
	+			synchronize_rcu();
	+			rcu_sync_func(&rsp->cb_head);
	+		} else {
	+			rcu_sync_call(rsp);
	+		}
	+	} else if (wait) {
	+		wait_event(rsp->gp_wait, READ_ONCE(rsp->gp_state) >= GP_PASSED);
		}
	+}
	 
	-	wait_event(rsp->gp_wait, READ_ONCE(rsp->gp_state) >= GP_PASSED);
	+void rcu_sync_enter(struct rcu_sync *rsp)
	+{
	+	__rcu_sync_enter(rsp, true);
	 }
	 
	 /**

later.

__rcu_sync_enter(rsp, false) works just like rcu_sync_enter_start() but it can
be safely called at any moment.

And can't resist, off-topic question... Say, cgroup_attach_task_all() does

	mutex_lock(&cgroup_mutex);
	percpu_down_write(&cgroup_threadgroup_rwsem);

and this means that synchronize_rcu() can be called with cgroup_mutex held.
Perhaps it makes sense to change this code to do

	rcu_sync_enter(&cgroup_threadgroup_rwsem.rss);
	mutex_lock(&cgroup_mutex);
	percpu_down_write(&cgroup_threadgroup_rwsem);
	...
	percpu_up_write(&cgroup_threadgroup_rwsem);
	mutex_unlock(&cgroup_mutex);
	rcu_sync_exit(&cgroup_threadgroup_rwsem.rss);

? Just curious.

> -	/*
> -	 * The latency of the synchronize_rcu() is too high for cgroups,
> -	 * avoid it at the cost of forcing all readers into the slow path.
> -	 */
> -	rcu_sync_enter_start(&cgroup_threadgroup_rwsem.rss);

Note that it doesn't have other users, probably you can kill it.

Oleg.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RESEND 3/3 cgroup/for-5.20] cgroup: Make !percpu threadgroup_rwsem operations optional
@ 2022-07-25 14:16         ` Christian Brauner
  0 siblings, 0 replies; 32+ messages in thread
From: Christian Brauner @ 2022-07-25 14:16 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Koutný,
	Peter Zijlstra, John Stultz, Dmitry Shmidt, Oleg Nesterov,
	linux-kernel, cgroups

On Sat, Jul 23, 2022 at 04:28:28AM -1000, Tejun Heo wrote:
> 3942a9bd7b58 ("locking, rcu, cgroup: Avoid synchronize_sched() in
> __cgroup_procs_write()") disabled percpu operations on threadgroup_rwsem
> because the impiled synchronize_rcu() on write locking was pushing up the
> latencies too much for android which constantly moves processes between
> cgroups.
> 
> This makes the hotter paths - fork and exit - slower as they're always
> forced into the slow path. There is no reason to force this on everyone
> especially given that more common static usage pattern can now completely
> avoid write-locking the rwsem. Write-locking is elided when turning on and
> off controllers on empty sub-trees and CLONE_INTO_CGROUP enables seeding a
> cgroup without grabbing the rwsem.
> 
> Restore the default percpu operations and introduce the mount option
> "favordynmods" and config option CGROUP_FAVOR_DYNMODS for users who need
> lower latencies for the dynamic operations.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: Michal Koutný <mkoutny@suse.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: John Stultz <john.stultz@linaro.org>
> Cc: Dmitry Shmidt <dimitrysh@google.com>
> Cc: Oleg Nesterov <oleg@redhat.com>
> ---

Seems sane,
Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RESEND 3/3 cgroup/for-5.20] cgroup: Make !percpu threadgroup_rwsem operations optional
@ 2022-07-25 14:16         ` Christian Brauner
  0 siblings, 0 replies; 32+ messages in thread
From: Christian Brauner @ 2022-07-25 14:16 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Koutný,
	Peter Zijlstra, John Stultz, Dmitry Shmidt, Oleg Nesterov,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Sat, Jul 23, 2022 at 04:28:28AM -1000, Tejun Heo wrote:
> 3942a9bd7b58 ("locking, rcu, cgroup: Avoid synchronize_sched() in
> __cgroup_procs_write()") disabled percpu operations on threadgroup_rwsem
> because the impiled synchronize_rcu() on write locking was pushing up the
> latencies too much for android which constantly moves processes between
> cgroups.
> 
> This makes the hotter paths - fork and exit - slower as they're always
> forced into the slow path. There is no reason to force this on everyone
> especially given that more common static usage pattern can now completely
> avoid write-locking the rwsem. Write-locking is elided when turning on and
> off controllers on empty sub-trees and CLONE_INTO_CGROUP enables seeding a
> cgroup without grabbing the rwsem.
> 
> Restore the default percpu operations and introduce the mount option
> "favordynmods" and config option CGROUP_FAVOR_DYNMODS for users who need
> lower latencies for the dynamic operations.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: Christian Brauner <brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: Michal Koutn√Ω <mkoutny-IBi9RG/b67k@public.gmane.org>
> Cc: Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
> Cc: John Stultz <john.stultz-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>
> Cc: Dmitry Shmidt <dimitrysh-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Cc: Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> ---

Seems sane,
Acked-by: Christian Brauner (Microsoft) <brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/3 cgroup/for-5.20] cgroup: Elide write-locking threadgroup_rwsem when updating csses on an empty subtree
@ 2022-07-26 14:31   ` Michal Koutný
  0 siblings, 0 replies; 32+ messages in thread
From: Michal Koutný @ 2022-07-26 14:31 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christian Brauner, Peter Zijlstra, John Stultz, Dmitry Shmidt,
	Oleg Nesterov, linux-kernel, cgroups

[-- Attachment #1: Type: text/plain, Size: 634 bytes --]

Hello.

On Thu, Jul 14, 2022 at 06:38:15PM -1000, Tejun Heo <tj@kernel.org> wrote:
> However, if the subtree doesn't contain any tasks, there aren't gonna
> be any cgroup migrations.

Nice catch.

> This condition can be trivially detected by testing whether
> mgctx.preloaded_src_csets is empty. Elide write-locking
> threadgroup_rwsem if the subtree is empty.

This check is perhaps even more robust than, e.g. cgroup_is_populated()
due to possible zombie cases.

>  kernel/cgroup/cgroup.c |   16 +++++++++++++---
>  1 file changed, 13 insertions(+), 3 deletions(-)

Reviewed-by: Michal Koutný <mkoutny@suse.com>

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/3 cgroup/for-5.20] cgroup: Elide write-locking threadgroup_rwsem when updating csses on an empty subtree
@ 2022-07-26 14:31   ` Michal Koutný
  0 siblings, 0 replies; 32+ messages in thread
From: Michal Koutný @ 2022-07-26 14:31 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christian Brauner, Peter Zijlstra, John Stultz, Dmitry Shmidt,
	Oleg Nesterov, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 683 bytes --]

Hello.

On Thu, Jul 14, 2022 at 06:38:15PM -1000, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> However, if the subtree doesn't contain any tasks, there aren't gonna
> be any cgroup migrations.

Nice catch.

> This condition can be trivially detected by testing whether
> mgctx.preloaded_src_csets is empty. Elide write-locking
> threadgroup_rwsem if the subtree is empty.

This check is perhaps even more robust than, e.g. cgroup_is_populated()
due to possible zombie cases.

>  kernel/cgroup/cgroup.c |   16 +++++++++++++---
>  1 file changed, 13 insertions(+), 3 deletions(-)

Reviewed-by: Michal Koutný <mkoutny-IBi9RG/b67k@public.gmane.org>

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/3 cgroup/for-5.20] cgroup: Add "no" prefixed mount options
@ 2022-07-26 14:32     ` Michal Koutný
  0 siblings, 0 replies; 32+ messages in thread
From: Michal Koutný @ 2022-07-26 14:32 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Peter Zijlstra, John Stultz, Dmitry Shmidt, Oleg Nesterov,
	linux-kernel, cgroups

[-- Attachment #1: Type: text/plain, Size: 1407 bytes --]

On Thu, Jul 14, 2022 at 06:38:43PM -1000, Tejun Heo <tj@kernel.org> wrote:
> We allow modifying these mount options via remount. Let's add "no" prefixed
> variants so that they can be turned off too.

They can be turned off:

> // on v5.19-rc?
> :~ # grep cg /proc/mounts
> cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate 0 0
> :~ # mount -t cgroup2 cgroup2 /sys/fs/cgroup/ -oremount
> :~ # grep cg /proc/mounts
> cgroup2 /sys/fs/cgroup cgroup2 rw,relatime 0 0

The mount(2) says about remounting:
> The  mountflags  and  data  arguments should match the values used in
> the original mount() call, except for those parameters that are being
> deliberately changed.

Or is this a provision for the fsconfig(2) API?

> +	fsparam_flag("memory_nolocalevents",	Opt_memory_nolocalevents),
> +	fsparam_flag("memory_norecursiveprot",	Opt_memory_norecursiveprot),

These are not 'no' prefixes of the option :-)

I.e. it seem more consistent to prefix whole boolean option name (in
accordance with other FS options but I know limited subset of them).
In the end, this should be handled generically for boolean options in
the VFS and not via custom options.

Also, this allows both
	'nsdelegate,nonsdelegate'
and
	'nonsdelegate,nsdelegate'
(nsdelegate is just an example) where the 'no' always overrides being a
hidden implementation detail.

I find this patch a bit weird.


Thanks,
Michal

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/3 cgroup/for-5.20] cgroup: Add "no" prefixed mount options
@ 2022-07-26 14:32     ` Michal Koutný
  0 siblings, 0 replies; 32+ messages in thread
From: Michal Koutný @ 2022-07-26 14:32 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Peter Zijlstra, John Stultz, Dmitry Shmidt, Oleg Nesterov,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 1436 bytes --]

On Thu, Jul 14, 2022 at 06:38:43PM -1000, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> We allow modifying these mount options via remount. Let's add "no" prefixed
> variants so that they can be turned off too.

They can be turned off:

> // on v5.19-rc?
> :~ # grep cg /proc/mounts
> cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate 0 0
> :~ # mount -t cgroup2 cgroup2 /sys/fs/cgroup/ -oremount
> :~ # grep cg /proc/mounts
> cgroup2 /sys/fs/cgroup cgroup2 rw,relatime 0 0

The mount(2) says about remounting:
> The  mountflags  and  data  arguments should match the values used in
> the original mount() call, except for those parameters that are being
> deliberately changed.

Or is this a provision for the fsconfig(2) API?

> +	fsparam_flag("memory_nolocalevents",	Opt_memory_nolocalevents),
> +	fsparam_flag("memory_norecursiveprot",	Opt_memory_norecursiveprot),

These are not 'no' prefixes of the option :-)

I.e. it seem more consistent to prefix whole boolean option name (in
accordance with other FS options but I know limited subset of them).
In the end, this should be handled generically for boolean options in
the VFS and not via custom options.

Also, this allows both
	'nsdelegate,nonsdelegate'
and
	'nonsdelegate,nsdelegate'
(nsdelegate is just an example) where the 'no' always overrides being a
hidden implementation detail.

I find this patch a bit weird.


Thanks,
Michal

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RESEND 3/3 cgroup/for-5.20] cgroup: Make !percpu threadgroup_rwsem operations optional
  2022-07-23 14:28       ` Tejun Heo
                         ` (2 preceding siblings ...)
  (?)
@ 2022-07-26 14:32       ` Michal Koutný
  2022-07-26 17:33           ` Tejun Heo
  -1 siblings, 1 reply; 32+ messages in thread
From: Michal Koutný @ 2022-07-26 14:32 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christian Brauner, Peter Zijlstra, John Stultz, Dmitry Shmidt,
	Oleg Nesterov, linux-kernel, cgroups

[-- Attachment #1: Type: text/plain, Size: 1347 bytes --]

On Sat, Jul 23, 2022 at 04:28:28AM -1000, Tejun Heo <tj@kernel.org> wrote:
> This makes the hotter paths - fork and exit - slower as they're always
> forced into the slow path. There is no reason to force this on everyone
> especially given that more common static usage pattern can now completely
> avoid write-locking the rwsem. Write-locking is elided when turning on and
> off controllers on empty sub-trees and CLONE_INTO_CGROUP enables seeding a
> cgroup without grabbing the rwsem.

Just a practical note that CLONE_INTO_CGROUP may not be so widespread
yet [1][2].
But generally, the change makes sense to me.


> +	CGRP_ROOT_FAVOR_DYNMODS = (1 << 4),
> +
> +	/*
>  	 * Enable cpuset controller in v1 cgroup to use v2 behavior.
>  	 */
> -	CGRP_ROOT_CPUSET_V2_MODE = (1 << 4),
> +	CGRP_ROOT_CPUSET_V2_MODE = (1 << 16),
>  
>  	/*
>  	 * Enable legacy local memory.events.
>  	 */
> -	CGRP_ROOT_MEMORY_LOCAL_EVENTS = (1 << 5),
> +	CGRP_ROOT_MEMORY_LOCAL_EVENTS = (1 << 17),
>  
>  	/*
>  	 * Enable recursive subtree protection
>  	 */
> -	CGRP_ROOT_MEMORY_RECURSIVE_PROT = (1 << 6),
> +	CGRP_ROOT_MEMORY_RECURSIVE_PROT = (1 << 18),

Why this new gap in flag bits?

[1] https://github.com/systemd/systemd/pull/16706
[2] https://github.com/search?q=org%3Aopencontainers+CLONE_INTO_CGROUP&type=all (empty)

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RESEND 3/3 cgroup/for-5.20] cgroup: Make !percpu threadgroup_rwsem operations optional
@ 2022-07-26 17:33           ` Tejun Heo
  0 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2022-07-26 17:33 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Christian Brauner, Peter Zijlstra, John Stultz, Dmitry Shmidt,
	Oleg Nesterov, linux-kernel, cgroups

Hello,

On Tue, Jul 26, 2022 at 04:32:57PM +0200, Michal Koutný wrote:
> On Sat, Jul 23, 2022 at 04:28:28AM -1000, Tejun Heo <tj@kernel.org> wrote:
> > This makes the hotter paths - fork and exit - slower as they're always
> > forced into the slow path. There is no reason to force this on everyone
> > especially given that more common static usage pattern can now completely
> > avoid write-locking the rwsem. Write-locking is elided when turning on and
> > off controllers on empty sub-trees and CLONE_INTO_CGROUP enables seeding a
> > cgroup without grabbing the rwsem.
> 
> Just a practical note that CLONE_INTO_CGROUP may not be so widespread
> yet [1][2].
> But generally, the change makes sense to me.

Yeah, I was disappoinetd that it wasn't being used by systemd already. It'd
be great if the glibc situation can be rectified soon because this is a much
better interface.

> > +	CGRP_ROOT_FAVOR_DYNMODS = (1 << 4),
> > +
> > +	/*
> >  	 * Enable cpuset controller in v1 cgroup to use v2 behavior.
> >  	 */
> > -	CGRP_ROOT_CPUSET_V2_MODE = (1 << 4),
> > +	CGRP_ROOT_CPUSET_V2_MODE = (1 << 16),
> >  
> >  	/*
> >  	 * Enable legacy local memory.events.
> >  	 */
> > -	CGRP_ROOT_MEMORY_LOCAL_EVENTS = (1 << 5),
> > +	CGRP_ROOT_MEMORY_LOCAL_EVENTS = (1 << 17),
> >  
> >  	/*
> >  	 * Enable recursive subtree protection
> >  	 */
> > -	CGRP_ROOT_MEMORY_RECURSIVE_PROT = (1 << 6),
> > +	CGRP_ROOT_MEMORY_RECURSIVE_PROT = (1 << 18),
> 
> Why this new gap in flag bits?

To distinguish core and per-controller flags.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RESEND 3/3 cgroup/for-5.20] cgroup: Make !percpu threadgroup_rwsem operations optional
@ 2022-07-26 17:33           ` Tejun Heo
  0 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2022-07-26 17:33 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Christian Brauner, Peter Zijlstra, John Stultz, Dmitry Shmidt,
	Oleg Nesterov, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Hello,

On Tue, Jul 26, 2022 at 04:32:57PM +0200, Michal Koutný wrote:
> On Sat, Jul 23, 2022 at 04:28:28AM -1000, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> > This makes the hotter paths - fork and exit - slower as they're always
> > forced into the slow path. There is no reason to force this on everyone
> > especially given that more common static usage pattern can now completely
> > avoid write-locking the rwsem. Write-locking is elided when turning on and
> > off controllers on empty sub-trees and CLONE_INTO_CGROUP enables seeding a
> > cgroup without grabbing the rwsem.
> 
> Just a practical note that CLONE_INTO_CGROUP may not be so widespread
> yet [1][2].
> But generally, the change makes sense to me.

Yeah, I was disappoinetd that it wasn't being used by systemd already. It'd
be great if the glibc situation can be rectified soon because this is a much
better interface.

> > +	CGRP_ROOT_FAVOR_DYNMODS = (1 << 4),
> > +
> > +	/*
> >  	 * Enable cpuset controller in v1 cgroup to use v2 behavior.
> >  	 */
> > -	CGRP_ROOT_CPUSET_V2_MODE = (1 << 4),
> > +	CGRP_ROOT_CPUSET_V2_MODE = (1 << 16),
> >  
> >  	/*
> >  	 * Enable legacy local memory.events.
> >  	 */
> > -	CGRP_ROOT_MEMORY_LOCAL_EVENTS = (1 << 5),
> > +	CGRP_ROOT_MEMORY_LOCAL_EVENTS = (1 << 17),
> >  
> >  	/*
> >  	 * Enable recursive subtree protection
> >  	 */
> > -	CGRP_ROOT_MEMORY_RECURSIVE_PROT = (1 << 6),
> > +	CGRP_ROOT_MEMORY_RECURSIVE_PROT = (1 << 18),
> 
> Why this new gap in flag bits?

To distinguish core and per-controller flags.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/3 cgroup/for-5.20] cgroup: Add "no" prefixed mount options
@ 2022-07-26 20:01       ` Tejun Heo
  0 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2022-07-26 20:01 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Peter Zijlstra, John Stultz, Dmitry Shmidt, Oleg Nesterov,
	linux-kernel, cgroups

Hello,

On Tue, Jul 26, 2022 at 04:32:46PM +0200, Michal Koutný wrote:
> On Thu, Jul 14, 2022 at 06:38:43PM -1000, Tejun Heo <tj@kernel.org> wrote:
> > We allow modifying these mount options via remount. Let's add "no" prefixed
> > variants so that they can be turned off too.
> 
> They can be turned off:
> 
> > // on v5.19-rc?
> > :~ # grep cg /proc/mounts
> > cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate 0 0
> > :~ # mount -t cgroup2 cgroup2 /sys/fs/cgroup/ -oremount
> > :~ # grep cg /proc/mounts
> > cgroup2 /sys/fs/cgroup cgroup2 rw,relatime 0 0
> 
> The mount(2) says about remounting:
> > The  mountflags  and  data  arguments should match the values used in
> > the original mount() call, except for those parameters that are being
> > deliberately changed.
>
> Or is this a provision for the fsconfig(2) API?

It's just me not knowing how these things work. I just looked at other real
filesystems and copied.

> > +	fsparam_flag("memory_nolocalevents",	Opt_memory_nolocalevents),
> > +	fsparam_flag("memory_norecursiveprot",	Opt_memory_norecursiveprot),
> 
> These are not 'no' prefixes of the option :-)

Oh, I tried that first but nomemory_recursiveprot looked really weird. The
thing is that the underbar is added to separate the subsystem from the
actual option and we're now prepending no to the subsystem part of the name.
I'm not super attached to the current names tho.

> I.e. it seem more consistent to prefix whole boolean option name (in
> accordance with other FS options but I know limited subset of them).
> In the end, this should be handled generically for boolean options in
> the VFS and not via custom options.
> 
> Also, this allows both
> 	'nsdelegate,nonsdelegate'
> and
> 	'nonsdelegate,nsdelegate'
> (nsdelegate is just an example) where the 'no' always overrides being a
> hidden implementation detail.
> 
> I find this patch a bit weird.

It is a bit weird. Lemme play a bit with turning off the options and I'll
remove the no options if they can be turned off without explicitly
specifying them.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/3 cgroup/for-5.20] cgroup: Add "no" prefixed mount options
@ 2022-07-26 20:01       ` Tejun Heo
  0 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2022-07-26 20:01 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Peter Zijlstra, John Stultz, Dmitry Shmidt, Oleg Nesterov,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Hello,

On Tue, Jul 26, 2022 at 04:32:46PM +0200, Michal Koutný wrote:
> On Thu, Jul 14, 2022 at 06:38:43PM -1000, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> > We allow modifying these mount options via remount. Let's add "no" prefixed
> > variants so that they can be turned off too.
> 
> They can be turned off:
> 
> > // on v5.19-rc?
> > :~ # grep cg /proc/mounts
> > cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate 0 0
> > :~ # mount -t cgroup2 cgroup2 /sys/fs/cgroup/ -oremount
> > :~ # grep cg /proc/mounts
> > cgroup2 /sys/fs/cgroup cgroup2 rw,relatime 0 0
> 
> The mount(2) says about remounting:
> > The  mountflags  and  data  arguments should match the values used in
> > the original mount() call, except for those parameters that are being
> > deliberately changed.
>
> Or is this a provision for the fsconfig(2) API?

It's just me not knowing how these things work. I just looked at other real
filesystems and copied.

> > +	fsparam_flag("memory_nolocalevents",	Opt_memory_nolocalevents),
> > +	fsparam_flag("memory_norecursiveprot",	Opt_memory_norecursiveprot),
> 
> These are not 'no' prefixes of the option :-)

Oh, I tried that first but nomemory_recursiveprot looked really weird. The
thing is that the underbar is added to separate the subsystem from the
actual option and we're now prepending no to the subsystem part of the name.
I'm not super attached to the current names tho.

> I.e. it seem more consistent to prefix whole boolean option name (in
> accordance with other FS options but I know limited subset of them).
> In the end, this should be handled generically for boolean options in
> the VFS and not via custom options.
> 
> Also, this allows both
> 	'nsdelegate,nonsdelegate'
> and
> 	'nonsdelegate,nsdelegate'
> (nsdelegate is just an example) where the 'no' always overrides being a
> hidden implementation detail.
> 
> I find this patch a bit weird.

It is a bit weird. Lemme play a bit with turning off the options and I'll
remove the no options if they can be turned off without explicitly
specifying them.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RESEND 3/3 cgroup/for-5.20] cgroup: Make !percpu threadgroup_rwsem operations optional
@ 2022-07-26 23:14           ` Tejun Heo
  0 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2022-07-26 23:14 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Christian Brauner, Michal Koutný,
	Peter Zijlstra, John Stultz, Dmitry Shmidt, linux-kernel,
	cgroups

Hello, Oleg.

On Mon, Jul 25, 2022 at 02:12:09PM +0200, Oleg Nesterov wrote:
> I see no problems in this patch. But just for record, we do not need
> synchronize_rcu() in the "favor && !favoring" case, so we cab probably
> do something like
> 
> 	@@ -146,13 +146,20 @@ void rcu_sync_enter(struct rcu_sync *rsp)
> 			 * See the comment above, this simply does the "synchronous"
> 			 * call_rcu(rcu_sync_func) which does GP_ENTER -> GP_PASSED.
> 			 */
> 	+		if (wait) {
> 	+			synchronize_rcu();
> 	+			rcu_sync_func(&rsp->cb_head);
> 	+		} else {
> 	+			rcu_sync_call(rsp);
> 	+		}
> 	+	} else if (wait) {
> 	+		wait_event(rsp->gp_wait, READ_ONCE(rsp->gp_state) >= GP_PASSED);
...
> later.
> 
> __rcu_sync_enter(rsp, false) works just like rcu_sync_enter_start() but it can
> be safely called at any moment.

Yeah, I originally used rcu_sync_enter_start() but quickly found out that it
can't be reverted reliably. Given how cold the option switching path is, I
think it's fine to pay an extra synchronize_rcu() there rather than adding
more complexity to rcu_sync_enter() unless this will be useful somewhere
else too.

> And can't resist, off-topic question... Say, cgroup_attach_task_all() does
> 
> 	mutex_lock(&cgroup_mutex);
> 	percpu_down_write(&cgroup_threadgroup_rwsem);
> 
> and this means that synchronize_rcu() can be called with cgroup_mutex held.
> Perhaps it makes sense to change this code to do
> 
> 	rcu_sync_enter(&cgroup_threadgroup_rwsem.rss);
> 	mutex_lock(&cgroup_mutex);
> 	percpu_down_write(&cgroup_threadgroup_rwsem);
> 	...
> 	percpu_up_write(&cgroup_threadgroup_rwsem);
> 	mutex_unlock(&cgroup_mutex);
> 	rcu_sync_exit(&cgroup_threadgroup_rwsem.rss);
> 
> ? Just curious.

I'm not quite following. Are you saying that if we switching the rwsem into
slow mode before grabbing the locks, we can avoid inducing latencies on
other users? Hmm... assuming that I'm understanding you correctly, one
problem with that approach is that everyone would be doing synchronize_rcu()
whether they want to change favoring state. In vast majority of cases, users
won't care about this flag but most users will end up mounting cgroup and do
the rcu_sync_enter(), so we'd end up adding a grace period wait in most boot
scenarios. It's not a lot in itself but seems less desriable than making the
users who want to change the mode pay at the time of changing.

> > -	/*
> > -	 * The latency of the synchronize_rcu() is too high for cgroups,
> > -	 * avoid it at the cost of forcing all readers into the slow path.
> > -	 */
> > -	rcu_sync_enter_start(&cgroup_threadgroup_rwsem.rss);
> 
> Note that it doesn't have other users, probably you can kill it.

Ah, nice, will do that.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RESEND 3/3 cgroup/for-5.20] cgroup: Make !percpu threadgroup_rwsem operations optional
@ 2022-07-26 23:14           ` Tejun Heo
  0 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2022-07-26 23:14 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Christian Brauner, Michal Koutný,
	Peter Zijlstra, John Stultz, Dmitry Shmidt,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Hello, Oleg.

On Mon, Jul 25, 2022 at 02:12:09PM +0200, Oleg Nesterov wrote:
> I see no problems in this patch. But just for record, we do not need
> synchronize_rcu() in the "favor && !favoring" case, so we cab probably
> do something like
> 
> 	@@ -146,13 +146,20 @@ void rcu_sync_enter(struct rcu_sync *rsp)
> 			 * See the comment above, this simply does the "synchronous"
> 			 * call_rcu(rcu_sync_func) which does GP_ENTER -> GP_PASSED.
> 			 */
> 	+		if (wait) {
> 	+			synchronize_rcu();
> 	+			rcu_sync_func(&rsp->cb_head);
> 	+		} else {
> 	+			rcu_sync_call(rsp);
> 	+		}
> 	+	} else if (wait) {
> 	+		wait_event(rsp->gp_wait, READ_ONCE(rsp->gp_state) >= GP_PASSED);
...
> later.
> 
> __rcu_sync_enter(rsp, false) works just like rcu_sync_enter_start() but it can
> be safely called at any moment.

Yeah, I originally used rcu_sync_enter_start() but quickly found out that it
can't be reverted reliably. Given how cold the option switching path is, I
think it's fine to pay an extra synchronize_rcu() there rather than adding
more complexity to rcu_sync_enter() unless this will be useful somewhere
else too.

> And can't resist, off-topic question... Say, cgroup_attach_task_all() does
> 
> 	mutex_lock(&cgroup_mutex);
> 	percpu_down_write(&cgroup_threadgroup_rwsem);
> 
> and this means that synchronize_rcu() can be called with cgroup_mutex held.
> Perhaps it makes sense to change this code to do
> 
> 	rcu_sync_enter(&cgroup_threadgroup_rwsem.rss);
> 	mutex_lock(&cgroup_mutex);
> 	percpu_down_write(&cgroup_threadgroup_rwsem);
> 	...
> 	percpu_up_write(&cgroup_threadgroup_rwsem);
> 	mutex_unlock(&cgroup_mutex);
> 	rcu_sync_exit(&cgroup_threadgroup_rwsem.rss);
> 
> ? Just curious.

I'm not quite following. Are you saying that if we switching the rwsem into
slow mode before grabbing the locks, we can avoid inducing latencies on
other users? Hmm... assuming that I'm understanding you correctly, one
problem with that approach is that everyone would be doing synchronize_rcu()
whether they want to change favoring state. In vast majority of cases, users
won't care about this flag but most users will end up mounting cgroup and do
the rcu_sync_enter(), so we'd end up adding a grace period wait in most boot
scenarios. It's not a lot in itself but seems less desriable than making the
users who want to change the mode pay at the time of changing.

> > -	/*
> > -	 * The latency of the synchronize_rcu() is too high for cgroups,
> > -	 * avoid it at the cost of forcing all readers into the slow path.
> > -	 */
> > -	rcu_sync_enter_start(&cgroup_threadgroup_rwsem.rss);
> 
> Note that it doesn't have other users, probably you can kill it.

Ah, nice, will do that.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/3 cgroup/for-5.20] cgroup: Add "no" prefixed mount options
  2022-07-26 20:01       ` Tejun Heo
  (?)
@ 2022-07-26 23:30       ` Tejun Heo
  2022-07-26 23:48           ` Tejun Heo
  -1 siblings, 1 reply; 32+ messages in thread
From: Tejun Heo @ 2022-07-26 23:30 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Peter Zijlstra, John Stultz, Dmitry Shmidt, Oleg Nesterov,
	linux-kernel, cgroups

Hello,

On Tue, Jul 26, 2022 at 10:01:04AM -1000, Tejun Heo wrote:
> On Tue, Jul 26, 2022 at 04:32:46PM +0200, Michal Koutný wrote:
> > On Thu, Jul 14, 2022 at 06:38:43PM -1000, Tejun Heo <tj@kernel.org> wrote:
> > > We allow modifying these mount options via remount. Let's add "no" prefixed
> > > variants so that they can be turned off too.
> > 
> > They can be turned off:
> > 
> > > // on v5.19-rc?
> > > :~ # grep cg /proc/mounts
> > > cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate 0 0
> > > :~ # mount -t cgroup2 cgroup2 /sys/fs/cgroup/ -oremount
> > > :~ # grep cg /proc/mounts
> > > cgroup2 /sys/fs/cgroup cgroup2 rw,relatime 0 0

  root@test ~# grep cgroup /proc/mounts
  cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot 0 0
  root@test ~# mount -o remount,exec /sys/fs/cgroup
  root@test ~# grep cgroup /proc/mounts
  cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,relatime,nsdelegate,memory_recursiveprot 0 0
  root@test ~# mount -o remount /sys/fs/cgroup
  root@test ~# grep cgroup /proc/mounts
  cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,relatime,nsdelegate,memory_recursiveprot 0 0
  root@test ~# mount -o remount,nsdelegate,memory_recursiveprot cgroup2 /sys/fs/cgroup
  root@test ~# grep cgroup /proc/mounts
  cgroup2 /sys/fs/cgroup cgroup2 rw,relatime,nsdelegate,memory_recursiveprot 0 0
  root@test ~# mount -o remount,memory_recursiveprot cgroup2 /sys/fs/cgroup
  root@test ~# grep cgroup /proc/mounts
  cgroup2 /sys/fs/cgroup cgroup2 rw,relatime,memory_recursiveprot 0 0
  root@test ~# mount -o remount cgroup2 /sys/fs/cgroup
  root@test ~# grep cgroup /proc/mounts
  cgroup2 /sys/fs/cgroup cgroup2 rw,relatime 0 0

Man, I had no idea that `mount -o remount,$OPTS $MOUNT_POINT` and `mount -o
remount,$OPTS $SRC $MOUNT_POINT` behave completely differently in how they
handle existing options. I wonder why other filesystems are implementing
explicit no prefixed options.

Anyways, will soon post a patch to remove the no prefixed options.

Thanks for pointing it out.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH cgroup/for-5.20] cgroup: remove "no" prefixed mount options options
@ 2022-07-26 23:48           ` Tejun Heo
  0 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2022-07-26 23:48 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Peter Zijlstra, John Stultz, Dmitry Shmidt, Oleg Nesterov,
	linux-kernel, cgroups

30312730bd02 ("cgroup: Add "no" prefixed mount options") added "no" prefixed
mount options to allow turning them off and 6a010a49b63a ("cgroup: Make
!percpu threadgroup_rwsem operations optional") added one more "no" prefixed
mount option. However, Michal pointed out that the "no" prefixed options
aren't necessary in allowing mount options to be turned off:

  # grep group /proc/mounts
  cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,relatime,nsdelegate,memory_recursiveprot 0 0
  # mount -o remount,nsdelegate,memory_recursiveprot none /sys/fs/cgroup
  # grep cgroup /proc/mounts
  cgroup2 /sys/fs/cgroup cgroup2 rw,relatime,nsdelegate,memory_recursiveprot 0 0

Note that this is different from the remount behavior when the mount(1) is
invoked without the device argument - "none":

 # grep cgroup /proc/mounts
 cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot 0 0
 # mount -o remount,nsdelegate,memory_recursiveprot /sys/fs/cgroup
 # grep cgroup /proc/mounts
 cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot 0 0

While a bit confusing, given that there is a way to turn off the options,
there's no reason to have the explicit "no" prefixed options. Let's remove
them.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
---
 Documentation/admin-guide/cgroup-v2.rst |    8 ++++----
 kernel/cgroup/cgroup.c                  |   24 ++++--------------------
 2 files changed, 8 insertions(+), 24 deletions(-)

--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -177,14 +177,14 @@ disabling controllers in v1 and make the
 
 cgroup v2 currently supports the following mount options.
 
-  [no]nsdelegate
+  nsdelegate
 	Consider cgroup namespaces as delegation boundaries.  This
 	option is system wide and can only be set on mount or modified
 	through remount from the init namespace.  The mount option is
 	ignored on non-init namespace mounts.  Please refer to the
 	Delegation section for details.
 
-  [no]favordynmods
+  favordynmods
         Reduce the latencies of dynamic cgroup modifications such as
         task migrations and controller on/offs at the cost of making
         hot path operations such as forks and exits more expensive.
@@ -192,7 +192,7 @@ cgroup v2 currently supports the followi
         controllers, and then seeding it with CLONE_INTO_CGROUP is
         not affected by this option.
 
-  memory_[no]localevents
+  memory_localevents
         Only populate memory.events with data for the current cgroup,
         and not any subtrees. This is legacy behaviour, the default
         behaviour without this option is to include subtree counts.
@@ -200,7 +200,7 @@ cgroup v2 currently supports the followi
         modified through remount from the init namespace. The mount
         option is ignored on non-init namespace mounts.
 
-  memory_[no]recursiveprot
+  memory_recursiveprot
         Recursively apply memory.min and memory.low protection to
         entire subtrees, without requiring explicit downward
         propagation into leaf cgroups.  This allows protecting entire
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1872,22 +1872,18 @@ int cgroup_show_path(struct seq_file *sf
 }
 
 enum cgroup2_param {
-	Opt_nsdelegate, Opt_nonsdelegate,
-	Opt_favordynmods, Opt_nofavordynmods,
-	Opt_memory_localevents, Opt_memory_nolocalevents,
-	Opt_memory_recursiveprot, Opt_memory_norecursiveprot,
+	Opt_nsdelegate,
+	Opt_favordynmods,
+	Opt_memory_localevents,
+	Opt_memory_recursiveprot,
 	nr__cgroup2_params
 };
 
 static const struct fs_parameter_spec cgroup2_fs_parameters[] = {
 	fsparam_flag("nsdelegate",		Opt_nsdelegate),
-	fsparam_flag("nonsdelegate",		Opt_nonsdelegate),
 	fsparam_flag("favordynmods",		Opt_favordynmods),
-	fsparam_flag("nofavordynmods",		Opt_nofavordynmods),
 	fsparam_flag("memory_localevents",	Opt_memory_localevents),
-	fsparam_flag("memory_nolocalevents",	Opt_memory_nolocalevents),
 	fsparam_flag("memory_recursiveprot",	Opt_memory_recursiveprot),
-	fsparam_flag("memory_norecursiveprot",	Opt_memory_norecursiveprot),
 	{}
 };
 
@@ -1905,27 +1901,15 @@ static int cgroup2_parse_param(struct fs
 	case Opt_nsdelegate:
 		ctx->flags |= CGRP_ROOT_NS_DELEGATE;
 		return 0;
-	case Opt_nonsdelegate:
-		ctx->flags &= ~CGRP_ROOT_NS_DELEGATE;
-		return 0;
 	case Opt_favordynmods:
 		ctx->flags |= CGRP_ROOT_FAVOR_DYNMODS;
 		return 0;
-	case Opt_nofavordynmods:
-		ctx->flags &= ~CGRP_ROOT_FAVOR_DYNMODS;
-		return 0;
 	case Opt_memory_localevents:
 		ctx->flags |= CGRP_ROOT_MEMORY_LOCAL_EVENTS;
 		return 0;
-	case Opt_memory_nolocalevents:
-		ctx->flags &= ~CGRP_ROOT_MEMORY_LOCAL_EVENTS;
-		return 0;
 	case Opt_memory_recursiveprot:
 		ctx->flags |= CGRP_ROOT_MEMORY_RECURSIVE_PROT;
 		return 0;
-	case Opt_memory_norecursiveprot:
-		ctx->flags &= ~CGRP_ROOT_MEMORY_RECURSIVE_PROT;
-		return 0;
 	}
 	return -EINVAL;
 }

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH cgroup/for-5.20] cgroup: remove "no" prefixed mount options options
@ 2022-07-26 23:48           ` Tejun Heo
  0 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2022-07-26 23:48 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Peter Zijlstra, John Stultz, Dmitry Shmidt, Oleg Nesterov,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

30312730bd02 ("cgroup: Add "no" prefixed mount options") added "no" prefixed
mount options to allow turning them off and 6a010a49b63a ("cgroup: Make
!percpu threadgroup_rwsem operations optional") added one more "no" prefixed
mount option. However, Michal pointed out that the "no" prefixed options
aren't necessary in allowing mount options to be turned off:

  # grep group /proc/mounts
  cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,relatime,nsdelegate,memory_recursiveprot 0 0
  # mount -o remount,nsdelegate,memory_recursiveprot none /sys/fs/cgroup
  # grep cgroup /proc/mounts
  cgroup2 /sys/fs/cgroup cgroup2 rw,relatime,nsdelegate,memory_recursiveprot 0 0

Note that this is different from the remount behavior when the mount(1) is
invoked without the device argument - "none":

 # grep cgroup /proc/mounts
 cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot 0 0
 # mount -o remount,nsdelegate,memory_recursiveprot /sys/fs/cgroup
 # grep cgroup /proc/mounts
 cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot 0 0

While a bit confusing, given that there is a way to turn off the options,
there's no reason to have the explicit "no" prefixed options. Let's remove
them.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Michal Koutný <mkoutny-IBi9RG/b67k@public.gmane.org>
---
 Documentation/admin-guide/cgroup-v2.rst |    8 ++++----
 kernel/cgroup/cgroup.c                  |   24 ++++--------------------
 2 files changed, 8 insertions(+), 24 deletions(-)

--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -177,14 +177,14 @@ disabling controllers in v1 and make the
 
 cgroup v2 currently supports the following mount options.
 
-  [no]nsdelegate
+  nsdelegate
 	Consider cgroup namespaces as delegation boundaries.  This
 	option is system wide and can only be set on mount or modified
 	through remount from the init namespace.  The mount option is
 	ignored on non-init namespace mounts.  Please refer to the
 	Delegation section for details.
 
-  [no]favordynmods
+  favordynmods
         Reduce the latencies of dynamic cgroup modifications such as
         task migrations and controller on/offs at the cost of making
         hot path operations such as forks and exits more expensive.
@@ -192,7 +192,7 @@ cgroup v2 currently supports the followi
         controllers, and then seeding it with CLONE_INTO_CGROUP is
         not affected by this option.
 
-  memory_[no]localevents
+  memory_localevents
         Only populate memory.events with data for the current cgroup,
         and not any subtrees. This is legacy behaviour, the default
         behaviour without this option is to include subtree counts.
@@ -200,7 +200,7 @@ cgroup v2 currently supports the followi
         modified through remount from the init namespace. The mount
         option is ignored on non-init namespace mounts.
 
-  memory_[no]recursiveprot
+  memory_recursiveprot
         Recursively apply memory.min and memory.low protection to
         entire subtrees, without requiring explicit downward
         propagation into leaf cgroups.  This allows protecting entire
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1872,22 +1872,18 @@ int cgroup_show_path(struct seq_file *sf
 }
 
 enum cgroup2_param {
-	Opt_nsdelegate, Opt_nonsdelegate,
-	Opt_favordynmods, Opt_nofavordynmods,
-	Opt_memory_localevents, Opt_memory_nolocalevents,
-	Opt_memory_recursiveprot, Opt_memory_norecursiveprot,
+	Opt_nsdelegate,
+	Opt_favordynmods,
+	Opt_memory_localevents,
+	Opt_memory_recursiveprot,
 	nr__cgroup2_params
 };
 
 static const struct fs_parameter_spec cgroup2_fs_parameters[] = {
 	fsparam_flag("nsdelegate",		Opt_nsdelegate),
-	fsparam_flag("nonsdelegate",		Opt_nonsdelegate),
 	fsparam_flag("favordynmods",		Opt_favordynmods),
-	fsparam_flag("nofavordynmods",		Opt_nofavordynmods),
 	fsparam_flag("memory_localevents",	Opt_memory_localevents),
-	fsparam_flag("memory_nolocalevents",	Opt_memory_nolocalevents),
 	fsparam_flag("memory_recursiveprot",	Opt_memory_recursiveprot),
-	fsparam_flag("memory_norecursiveprot",	Opt_memory_norecursiveprot),
 	{}
 };
 
@@ -1905,27 +1901,15 @@ static int cgroup2_parse_param(struct fs
 	case Opt_nsdelegate:
 		ctx->flags |= CGRP_ROOT_NS_DELEGATE;
 		return 0;
-	case Opt_nonsdelegate:
-		ctx->flags &= ~CGRP_ROOT_NS_DELEGATE;
-		return 0;
 	case Opt_favordynmods:
 		ctx->flags |= CGRP_ROOT_FAVOR_DYNMODS;
 		return 0;
-	case Opt_nofavordynmods:
-		ctx->flags &= ~CGRP_ROOT_FAVOR_DYNMODS;
-		return 0;
 	case Opt_memory_localevents:
 		ctx->flags |= CGRP_ROOT_MEMORY_LOCAL_EVENTS;
 		return 0;
-	case Opt_memory_nolocalevents:
-		ctx->flags &= ~CGRP_ROOT_MEMORY_LOCAL_EVENTS;
-		return 0;
 	case Opt_memory_recursiveprot:
 		ctx->flags |= CGRP_ROOT_MEMORY_RECURSIVE_PROT;
 		return 0;
-	case Opt_memory_norecursiveprot:
-		ctx->flags &= ~CGRP_ROOT_MEMORY_RECURSIVE_PROT;
-		return 0;
 	}
 	return -EINVAL;
 }

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH cgroup/for-5.20] cgroup: remove "no" prefixed mount options options
@ 2022-07-27  9:27             ` Michal Koutný
  0 siblings, 0 replies; 32+ messages in thread
From: Michal Koutný @ 2022-07-27  9:27 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Peter Zijlstra, John Stultz, Dmitry Shmidt, Oleg Nesterov,
	linux-kernel, cgroups

On Tue, Jul 26, 2022 at 01:48:17PM -1000, Tejun Heo <tj@kernel.org> wrote:
Thanks.

> While a bit confusing, given that there is a way to turn off the options,
> there's no reason to have the explicit "no" prefixed options. Let's remove
> them.

This is sensible...

>  Documentation/admin-guide/cgroup-v2.rst |    8 ++++----
>  kernel/cgroup/cgroup.c                  |   24 ++++--------------------
>  2 files changed, 8 insertions(+), 24 deletions(-)

...and cleaner.

Michal

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH cgroup/for-5.20] cgroup: remove "no" prefixed mount options options
@ 2022-07-27  9:27             ` Michal Koutný
  0 siblings, 0 replies; 32+ messages in thread
From: Michal Koutný @ 2022-07-27  9:27 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Peter Zijlstra, John Stultz, Dmitry Shmidt, Oleg Nesterov,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Tue, Jul 26, 2022 at 01:48:17PM -1000, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
Thanks.

> While a bit confusing, given that there is a way to turn off the options,
> there's no reason to have the explicit "no" prefixed options. Let's remove
> them.

This is sensible...

>  Documentation/admin-guide/cgroup-v2.rst |    8 ++++----
>  kernel/cgroup/cgroup.c                  |   24 ++++--------------------
>  2 files changed, 8 insertions(+), 24 deletions(-)

...and cleaner.

Michal

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RESEND 3/3 cgroup/for-5.20] cgroup: Make !percpu threadgroup_rwsem operations optional
  2022-07-26 23:14           ` Tejun Heo
  (?)
@ 2022-07-27 17:39           ` Oleg Nesterov
  -1 siblings, 0 replies; 32+ messages in thread
From: Oleg Nesterov @ 2022-07-27 17:39 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christian Brauner, Michal Koutný,
	Peter Zijlstra, John Stultz, Dmitry Shmidt, linux-kernel,
	cgroups

Hi Tejun,

On 07/26, Tejun Heo wrote:
>
> > __rcu_sync_enter(rsp, false) works just like rcu_sync_enter_start() but it can
> > be safely called at any moment.
>
> Yeah, I originally used rcu_sync_enter_start() but quickly found out that it
> can't be reverted reliably. Given how cold the option switching path is, I
> think it's fine to pay an extra synchronize_rcu() there rather than adding
> more complexity to rcu_sync_enter() unless this will be useful somewhere
> else too.

Yes, agreed. As I said, this is just for record, so that I can find this (simple)
patch on lkml if we have another user of __rcu_sync_enter(rsp, bool wait).

> > And can't resist, off-topic question... Say, cgroup_attach_task_all() does
> >
> > 	mutex_lock(&cgroup_mutex);
> > 	percpu_down_write(&cgroup_threadgroup_rwsem);
> >
> > and this means that synchronize_rcu() can be called with cgroup_mutex held.
> > Perhaps it makes sense to change this code to do
> >
> > 	rcu_sync_enter(&cgroup_threadgroup_rwsem.rss);
> > 	mutex_lock(&cgroup_mutex);
> > 	percpu_down_write(&cgroup_threadgroup_rwsem);
> > 	...
> > 	percpu_up_write(&cgroup_threadgroup_rwsem);
> > 	mutex_unlock(&cgroup_mutex);
> > 	rcu_sync_exit(&cgroup_threadgroup_rwsem.rss);
> >
> > ? Just curious.
>
> I'm not quite following.

Me too ;)

> Are you saying that if we switching the rwsem into
> slow mode before grabbing the locks, we can avoid inducing latencies on
> other users?

Well yes, in that another mutex_lock(&cgroup_mutex) won't sleep until
synchronize_rcu() (called under cgroup_mutex) completes.

> Hmm... assuming that I'm understanding you correctly, one
> problem with that approach is that everyone would be doing synchronize_rcu()
> whether they want to change favoring state.

Hmm... I didn't mean the changing if favoring state... And in any case,
this won't cause any additional synchronize_rcu().

Nevermind, please forget, this probably makes no sense.

Oleg.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH cgroup/for-5.20] cgroup: remove "no" prefixed mount options options
@ 2022-07-27 17:55               ` Tejun Heo
  0 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2022-07-27 17:55 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Peter Zijlstra, John Stultz, Dmitry Shmidt, Oleg Nesterov,
	linux-kernel, cgroups

On Wed, Jul 27, 2022 at 11:27:15AM +0200, Michal Koutný wrote:
> On Tue, Jul 26, 2022 at 01:48:17PM -1000, Tejun Heo <tj@kernel.org> wrote:
> Thanks.
> 
> > While a bit confusing, given that there is a way to turn off the options,
> > there's no reason to have the explicit "no" prefixed options. Let's remove
> > them.
> 
> This is sensible...
> 
> >  Documentation/admin-guide/cgroup-v2.rst |    8 ++++----
> >  kernel/cgroup/cgroup.c                  |   24 ++++--------------------
> >  2 files changed, 8 insertions(+), 24 deletions(-)
> 
> ...and cleaner.

Alright, applied to cgroup/for-5.20.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH cgroup/for-5.20] cgroup: remove "no" prefixed mount options options
@ 2022-07-27 17:55               ` Tejun Heo
  0 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2022-07-27 17:55 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Peter Zijlstra, John Stultz, Dmitry Shmidt, Oleg Nesterov,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Wed, Jul 27, 2022 at 11:27:15AM +0200, Michal Koutný wrote:
> On Tue, Jul 26, 2022 at 01:48:17PM -1000, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> Thanks.
> 
> > While a bit confusing, given that there is a way to turn off the options,
> > there's no reason to have the explicit "no" prefixed options. Let's remove
> > them.
> 
> This is sensible...
> 
> >  Documentation/admin-guide/cgroup-v2.rst |    8 ++++----
> >  kernel/cgroup/cgroup.c                  |   24 ++++--------------------
> >  2 files changed, 8 insertions(+), 24 deletions(-)
> 
> ...and cleaner.

Alright, applied to cgroup/for-5.20.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2022-07-27 18:56 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-15  4:38 [PATCH 1/3 cgroup/for-5.20] cgroup: Elide write-locking threadgroup_rwsem when updating csses on an empty subtree Tejun Heo
2022-07-15  4:38 ` [PATCH 2/3 cgroup/for-5.20] cgroup: Add "no" prefixed mount options Tejun Heo
2022-07-15  4:38   ` Tejun Heo
2022-07-15  4:39   ` [PATCH 3/3 cgroup/for-5.20] cgroup: Make !percpu threadgroup_rwsem operations optional Tejun Heo
2022-07-15  4:39     ` Tejun Heo
2022-07-23  5:12     ` Tejun Heo
2022-07-23  5:12       ` Tejun Heo
2022-07-23 14:28     ` [PATCH RESEND " Tejun Heo
2022-07-23 14:28       ` Tejun Heo
2022-07-25 12:12       ` Oleg Nesterov
2022-07-25 12:12         ` Oleg Nesterov
2022-07-26 23:14         ` Tejun Heo
2022-07-26 23:14           ` Tejun Heo
2022-07-27 17:39           ` Oleg Nesterov
2022-07-25 14:16       ` Christian Brauner
2022-07-25 14:16         ` Christian Brauner
2022-07-26 14:32       ` Michal Koutný
2022-07-26 17:33         ` Tejun Heo
2022-07-26 17:33           ` Tejun Heo
2022-07-26 14:32   ` [PATCH 2/3 cgroup/for-5.20] cgroup: Add "no" prefixed mount options Michal Koutný
2022-07-26 14:32     ` Michal Koutný
2022-07-26 20:01     ` Tejun Heo
2022-07-26 20:01       ` Tejun Heo
2022-07-26 23:30       ` Tejun Heo
2022-07-26 23:48         ` [PATCH cgroup/for-5.20] cgroup: remove "no" prefixed mount options options Tejun Heo
2022-07-26 23:48           ` Tejun Heo
2022-07-27  9:27           ` Michal Koutný
2022-07-27  9:27             ` Michal Koutný
2022-07-27 17:55             ` Tejun Heo
2022-07-27 17:55               ` Tejun Heo
2022-07-26 14:31 ` [PATCH 1/3 cgroup/for-5.20] cgroup: Elide write-locking threadgroup_rwsem when updating csses on an empty subtree Michal Koutný
2022-07-26 14:31   ` Michal Koutný

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.