[RFC PATCH 0/3] Rework locking when rendering mountinfo cgroup paths

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/3] Rework locking when rendering mountinfo cgroup paths
@ 2023-05-02 13:38 ` Michal Koutný
  0 siblings, 0 replies; 22+ messages in thread
From: Michal Koutný @ 2023-05-02 13:38 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, cgroups
  Cc: Alexander Viro, Christian Brauner, Tejun Heo, Zefan Li,
	Johannes Weiner, Dave Chinner, Rik van Riel, Jiri Wiesner

Idea for these modification came up when css_set_lock seemed unneeded in
cgroup_show_path.
It's a delicate change, so the deciding factor was when cgroup_show_path popped
up also in some profiles of frequent mountinfo readers.
The idea is to trade the exclusive css_set_lock for the shared
namespace_sem when rendering cgroup paths. Details are described more in
individual commits.

Michal Koutný (3):
  cgroup: Drop unused function for cgroup_path
  cgroup: Rely on namespace_sem in current_cgns_cgroup_from_root
    explicitly
  cgroup: Do not take css_set_lock in cgroup_show_path

 fs/namespace.c         |  5 +++-
 include/linux/mount.h  |  4 +++
 kernel/cgroup/cgroup.c | 58 ++++++++----------------------------------
 3 files changed, 18 insertions(+), 49 deletions(-)

-- 
2.40.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC PATCH 0/3] Rework locking when rendering mountinfo cgroup paths
@ 2023-05-02 13:38 ` Michal Koutný
  0 siblings, 0 replies; 22+ messages in thread
From: Michal Koutný @ 2023-05-02 13:38 UTC (permalink / raw)
  To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA
  Cc: Alexander Viro, Christian Brauner, Tejun Heo, Zefan Li,
	Johannes Weiner, Dave Chinner, Rik van Riel, Jiri Wiesner

Idea for these modification came up when css_set_lock seemed unneeded in
cgroup_show_path.
It's a delicate change, so the deciding factor was when cgroup_show_path popped
up also in some profiles of frequent mountinfo readers.
The idea is to trade the exclusive css_set_lock for the shared
namespace_sem when rendering cgroup paths. Details are described more in
individual commits.

Michal Koutn√Ω (3):
  cgroup: Drop unused function for cgroup_path
  cgroup: Rely on namespace_sem in current_cgns_cgroup_from_root
    explicitly
  cgroup: Do not take css_set_lock in cgroup_show_path

 fs/namespace.c         |  5 +++-
 include/linux/mount.h  |  4 +++
 kernel/cgroup/cgroup.c | 58 ++++++++----------------------------------
 3 files changed, 18 insertions(+), 49 deletions(-)

-- 
2.40.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC PATCH 1/3] cgroup: Drop unused function for cgroup_path
@ 2023-05-02 13:38   ` Michal Koutný
  0 siblings, 0 replies; 22+ messages in thread
From: Michal Koutný @ 2023-05-02 13:38 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, cgroups
  Cc: Alexander Viro, Christian Brauner, Tejun Heo, Zefan Li,
	Johannes Weiner, Dave Chinner, Rik van Riel, Jiri Wiesner

There is no current user and there are alternative methods to obtain
task's cgroup path.

Signed-off-by: Michal Koutný <mkoutny@suse.com>
---
 kernel/cgroup/cgroup.c | 39 ---------------------------------------
 1 file changed, 39 deletions(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 625d7483951c..55e5f0110e3b 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2378,45 +2378,6 @@ int cgroup_path_ns(struct cgroup *cgrp, char *buf, size_t buflen,
 }
 EXPORT_SYMBOL_GPL(cgroup_path_ns);
 
-/**
- * task_cgroup_path - cgroup path of a task in the first cgroup hierarchy
- * @task: target task
- * @buf: the buffer to write the path into
- * @buflen: the length of the buffer
- *
- * Determine @task's cgroup on the first (the one with the lowest non-zero
- * hierarchy_id) cgroup hierarchy and copy its path into @buf.  This
- * function grabs cgroup_mutex and shouldn't be used inside locks used by
- * cgroup controller callbacks.
- *
- * Return value is the same as kernfs_path().
- */
-int task_cgroup_path(struct task_struct *task, char *buf, size_t buflen)
-{
-	struct cgroup_root *root;
-	struct cgroup *cgrp;
-	int hierarchy_id = 1;
-	int ret;
-
-	cgroup_lock();
-	spin_lock_irq(&css_set_lock);
-
-	root = idr_get_next(&cgroup_hierarchy_idr, &hierarchy_id);
-
-	if (root) {
-		cgrp = task_cgroup_from_root(task, root);
-		ret = cgroup_path_ns_locked(cgrp, buf, buflen, &init_cgroup_ns);
-	} else {
-		/* if no hierarchy exists, everyone is in "/" */
-		ret = strscpy(buf, "/", buflen);
-	}
-
-	spin_unlock_irq(&css_set_lock);
-	cgroup_unlock();
-	return ret;
-}
-EXPORT_SYMBOL_GPL(task_cgroup_path);
-
 /**
  * cgroup_attach_lock - Lock for ->attach()
  * @lock_threadgroup: whether to down_write cgroup_threadgroup_rwsem
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 1/3] cgroup: Drop unused function for cgroup_path
@ 2023-05-02 13:38   ` Michal Koutný
  0 siblings, 0 replies; 22+ messages in thread
From: Michal Koutný @ 2023-05-02 13:38 UTC (permalink / raw)
  To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA
  Cc: Alexander Viro, Christian Brauner, Tejun Heo, Zefan Li,
	Johannes Weiner, Dave Chinner, Rik van Riel, Jiri Wiesner

There is no current user and there are alternative methods to obtain
task's cgroup path.

Signed-off-by: Michal Koutn√Ω <mkoutny-IBi9RG/b67k@public.gmane.org>
---
 kernel/cgroup/cgroup.c | 39 ---------------------------------------
 1 file changed, 39 deletions(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 625d7483951c..55e5f0110e3b 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2378,45 +2378,6 @@ int cgroup_path_ns(struct cgroup *cgrp, char *buf, size_t buflen,
 }
 EXPORT_SYMBOL_GPL(cgroup_path_ns);
 
-/**
- * task_cgroup_path - cgroup path of a task in the first cgroup hierarchy
- * @task: target task
- * @buf: the buffer to write the path into
- * @buflen: the length of the buffer
- *
- * Determine @task's cgroup on the first (the one with the lowest non-zero
- * hierarchy_id) cgroup hierarchy and copy its path into @buf.  This
- * function grabs cgroup_mutex and shouldn't be used inside locks used by
- * cgroup controller callbacks.
- *
- * Return value is the same as kernfs_path().
- */
-int task_cgroup_path(struct task_struct *task, char *buf, size_t buflen)
-{
-	struct cgroup_root *root;
-	struct cgroup *cgrp;
-	int hierarchy_id = 1;
-	int ret;
-
-	cgroup_lock();
-	spin_lock_irq(&css_set_lock);
-
-	root = idr_get_next(&cgroup_hierarchy_idr, &hierarchy_id);
-
-	if (root) {
-		cgrp = task_cgroup_from_root(task, root);
-		ret = cgroup_path_ns_locked(cgrp, buf, buflen, &init_cgroup_ns);
-	} else {
-		/* if no hierarchy exists, everyone is in "/" */
-		ret = strscpy(buf, "/", buflen);
-	}
-
-	spin_unlock_irq(&css_set_lock);
-	cgroup_unlock();
-	return ret;
-}
-EXPORT_SYMBOL_GPL(task_cgroup_path);
-
 /**
  * cgroup_attach_lock - Lock for ->attach()
  * @lock_threadgroup: whether to down_write cgroup_threadgroup_rwsem
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 2/3] cgroup: Rely on namespace_sem in current_cgns_cgroup_from_root explicitly
  2023-05-02 13:38 ` Michal Koutný
@ 2023-05-02 13:38   ` Michal Koutný
  -1 siblings, 0 replies; 22+ messages in thread
From: Michal Koutný @ 2023-05-02 13:38 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, cgroups
  Cc: Alexander Viro, Christian Brauner, Tejun Heo, Zefan Li,
	Johannes Weiner, Dave Chinner, Rik van Riel, Jiri Wiesner

The function current_cgns_cgroup_from_root() expects a stable
cgroup_root, which is currently ensured with RCU read side paired with
cgroup_destroy_root() called after RCU period.

The particular current_cgns_cgroup_from_root() is called from VFS code
and cgroup_root stability can be also ensured by namespace_sem. Mark it
explicitly as a preparation for further rework.

Signed-off-by: Michal Koutný <mkoutny@suse.com>
---
 fs/namespace.c         | 5 ++++-
 include/linux/mount.h  | 4 ++++
 kernel/cgroup/cgroup.c | 7 +++----
 3 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 54847db5b819..0d2333832064 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -71,7 +71,10 @@ static DEFINE_IDA(mnt_group_ida);
 static struct hlist_head *mount_hashtable __read_mostly;
 static struct hlist_head *mountpoint_hashtable __read_mostly;
 static struct kmem_cache *mnt_cache __read_mostly;
-static DECLARE_RWSEM(namespace_sem);
+DECLARE_RWSEM(namespace_sem);
+#ifdef CONFIG_LOCKDEP
+EXPORT_SYMBOL_GPL(namespace_sem);
+#endif
 static HLIST_HEAD(unmounted);	/* protected by namespace_sem */
 static LIST_HEAD(ex_mountpoints); /* protected by namespace_sem */
 
diff --git a/include/linux/mount.h b/include/linux/mount.h
index 1ea326c368f7..6277435f6748 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -80,6 +80,10 @@ static inline struct mnt_idmap *mnt_idmap(const struct vfsmount *mnt)
 	return smp_load_acquire(&mnt->mnt_idmap);
 }
 
+#ifdef CONFIG_LOCKDEP
+extern struct rw_semaphore namespace_sem;
+#endif
+
 extern int mnt_want_write(struct vfsmount *mnt);
 extern int mnt_want_write_file(struct file *file);
 extern void mnt_drop_write(struct vfsmount *mnt);
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 55e5f0110e3b..32d693a797b9 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1440,13 +1440,12 @@ current_cgns_cgroup_from_root(struct cgroup_root *root)
 
 	lockdep_assert_held(&css_set_lock);
 
-	rcu_read_lock();
+	/* namespace_sem ensures `root` stability on unmount */
+	lockdep_assert(lockdep_is_held_type(&namespace_sem, -1));
 
 	cset = current->nsproxy->cgroup_ns->root_cset;
 	res = __cset_cgroup_from_root(cset, root);
 
-	rcu_read_unlock();
-
 	return res;
 }
 
@@ -1454,7 +1453,7 @@ current_cgns_cgroup_from_root(struct cgroup_root *root)
  * Look up cgroup associated with current task's cgroup namespace on the default
  * hierarchy.
  *
- * Unlike current_cgns_cgroup_from_root(), this doesn't need locks:
+ * Relaxed locking requirements:
  * - Internal rcu_read_lock is unnecessary because we don't dereference any rcu
  *   pointers.
  * - css_set_lock is not needed because we just read cset->dfl_cgrp.
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 2/3] cgroup: Rely on namespace_sem in current_cgns_cgroup_from_root explicitly
@ 2023-05-02 13:38   ` Michal Koutný
  0 siblings, 0 replies; 22+ messages in thread
From: Michal Koutný @ 2023-05-02 13:38 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, cgroups
  Cc: Alexander Viro, Christian Brauner, Tejun Heo, Zefan Li,
	Johannes Weiner, Dave Chinner, Rik van Riel, Jiri Wiesner

The function current_cgns_cgroup_from_root() expects a stable
cgroup_root, which is currently ensured with RCU read side paired with
cgroup_destroy_root() called after RCU period.

The particular current_cgns_cgroup_from_root() is called from VFS code
and cgroup_root stability can be also ensured by namespace_sem. Mark it
explicitly as a preparation for further rework.

Signed-off-by: Michal Koutn√Ω <mkoutny@suse.com>
---
 fs/namespace.c         | 5 ++++-
 include/linux/mount.h  | 4 ++++
 kernel/cgroup/cgroup.c | 7 +++----
 3 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 54847db5b819..0d2333832064 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -71,7 +71,10 @@ static DEFINE_IDA(mnt_group_ida);
 static struct hlist_head *mount_hashtable __read_mostly;
 static struct hlist_head *mountpoint_hashtable __read_mostly;
 static struct kmem_cache *mnt_cache __read_mostly;
-static DECLARE_RWSEM(namespace_sem);
+DECLARE_RWSEM(namespace_sem);
+#ifdef CONFIG_LOCKDEP
+EXPORT_SYMBOL_GPL(namespace_sem);
+#endif
 static HLIST_HEAD(unmounted);	/* protected by namespace_sem */
 static LIST_HEAD(ex_mountpoints); /* protected by namespace_sem */
 
diff --git a/include/linux/mount.h b/include/linux/mount.h
index 1ea326c368f7..6277435f6748 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -80,6 +80,10 @@ static inline struct mnt_idmap *mnt_idmap(const struct vfsmount *mnt)
 	return smp_load_acquire(&mnt->mnt_idmap);
 }
 
+#ifdef CONFIG_LOCKDEP
+extern struct rw_semaphore namespace_sem;
+#endif
+
 extern int mnt_want_write(struct vfsmount *mnt);
 extern int mnt_want_write_file(struct file *file);
 extern void mnt_drop_write(struct vfsmount *mnt);
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 55e5f0110e3b..32d693a797b9 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1440,13 +1440,12 @@ current_cgns_cgroup_from_root(struct cgroup_root *root)
 
 	lockdep_assert_held(&css_set_lock);
 
-	rcu_read_lock();
+	/* namespace_sem ensures `root` stability on unmount */
+	lockdep_assert(lockdep_is_held_type(&namespace_sem, -1));
 
 	cset = current->nsproxy->cgroup_ns->root_cset;
 	res = __cset_cgroup_from_root(cset, root);
 
-	rcu_read_unlock();
-
 	return res;
 }
 
@@ -1454,7 +1453,7 @@ current_cgns_cgroup_from_root(struct cgroup_root *root)
  * Look up cgroup associated with current task's cgroup namespace on the default
  * hierarchy.
  *
- * Unlike current_cgns_cgroup_from_root(), this doesn't need locks:
+ * Relaxed locking requirements:
  * - Internal rcu_read_lock is unnecessary because we don't dereference any rcu
  *   pointers.
  * - css_set_lock is not needed because we just read cset->dfl_cgrp.
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 3/3] cgroup: Do not take css_set_lock in cgroup_show_path
  2023-05-02 13:38 ` Michal Koutný
@ 2023-05-02 13:38   ` Michal Koutný
  -1 siblings, 0 replies; 22+ messages in thread
From: Michal Koutný @ 2023-05-02 13:38 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, cgroups
  Cc: Alexander Viro, Christian Brauner, Tejun Heo, Zefan Li,
	Johannes Weiner, Dave Chinner, Rik van Riel, Jiri Wiesner

/proc/$pid/mountinfo may accumulate lots of entries (causing frequent
re-reads of whole file) or lots cgroupfs entries alone.
The cgroupfs entries rendered with cgroup_show_path() may increase/be
subject of css_set_lock contention causing further slowdown -- not only
mountinfo rendering but any other css_set_lock user.

We leverage the fact that mountinfo reading happens with namespace_sem
taken and hierarchy roots thus cannot be freed concurrently.

There are three relevant nodes for each cgroupfs entry:

        R ... cgroup hierarchy root
        M ... mount root
        C ... reader's cgroup NS root

mountinfo is supposed to show path from C to M.

Current's css_set (and linked root cgroups) are stable under
namespace_sem, therefore current_cgns_cgroup_from_root() doesn't need
css_set_lock.

When the path is assembled in kernfs_path_from_node(), we know that:
- C kernfs_node is (transitively) pinned via current->nsproxy,
- M kernfs_node is pinned thanks to namespace_sem,
- path C to M is pinned via child->parent references (this holds also
  when C and M are in distinct subtrees).

Streamline mountinfo rendering a bit by relieving css_set_lock and add
careful notes about that.

Signed-off-by: Michal Koutný <mkoutny@suse.com>
---
 kernel/cgroup/cgroup.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 32d693a797b9..e2ec6f0028be 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1407,12 +1407,18 @@ static inline struct cgroup *__cset_cgroup_from_root(struct css_set *cset,
 	struct cgroup *res_cgroup = NULL;
 
 	if (cset == &init_css_set) {
+		/* callers must ensure root stability */
 		res_cgroup = &root->cgrp;
 	} else if (root == &cgrp_dfl_root) {
 		res_cgroup = cset->dfl_cgrp;
 	} else {
 		struct cgrp_cset_link *link;
-		lockdep_assert_held(&css_set_lock);
+		/* cset's cgroups are pinned unless they are root cgroups that
+		 * were unmounted.  We look at links to !cgrp_dfl_root
+		 * cgroup_root, either lock ensures the list is not mutated
+		 */
+		lockdep_assert(lockdep_is_held(&css_set_lock) ||
+			       lockdep_is_held_type(&namespace_sem, -1));
 
 		list_for_each_entry(link, &cset->cgrp_links, cgrp_link) {
 			struct cgroup *c = link->cgrp;
@@ -1438,8 +1444,6 @@ current_cgns_cgroup_from_root(struct cgroup_root *root)
 	struct cgroup *res = NULL;
 	struct css_set *cset;
 
-	lockdep_assert_held(&css_set_lock);
-
 	/* namespace_sem ensures `root` stability on unmount */
 	lockdep_assert(lockdep_is_held_type(&namespace_sem, -1));
 
@@ -1905,10 +1909,8 @@ int cgroup_show_path(struct seq_file *sf, struct kernfs_node *kf_node,
 	if (!buf)
 		return -ENOMEM;
 
-	spin_lock_irq(&css_set_lock);
 	ns_cgroup = current_cgns_cgroup_from_root(kf_cgroot);
 	len = kernfs_path_from_node(kf_node, ns_cgroup->kn, buf, PATH_MAX);
-	spin_unlock_irq(&css_set_lock);
 
 	if (len >= PATH_MAX)
 		len = -ERANGE;
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 3/3] cgroup: Do not take css_set_lock in cgroup_show_path
@ 2023-05-02 13:38   ` Michal Koutný
  0 siblings, 0 replies; 22+ messages in thread
From: Michal Koutný @ 2023-05-02 13:38 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, cgroups
  Cc: Alexander Viro, Christian Brauner, Tejun Heo, Zefan Li,
	Johannes Weiner, Dave Chinner, Rik van Riel, Jiri Wiesner

/proc/$pid/mountinfo may accumulate lots of entries (causing frequent
re-reads of whole file) or lots cgroupfs entries alone.
The cgroupfs entries rendered with cgroup_show_path() may increase/be
subject of css_set_lock contention causing further slowdown -- not only
mountinfo rendering but any other css_set_lock user.

We leverage the fact that mountinfo reading happens with namespace_sem
taken and hierarchy roots thus cannot be freed concurrently.

There are three relevant nodes for each cgroupfs entry:

        R ... cgroup hierarchy root
        M ... mount root
        C ... reader's cgroup NS root

mountinfo is supposed to show path from C to M.

Current's css_set (and linked root cgroups) are stable under
namespace_sem, therefore current_cgns_cgroup_from_root() doesn't need
css_set_lock.

When the path is assembled in kernfs_path_from_node(), we know that:
- C kernfs_node is (transitively) pinned via current->nsproxy,
- M kernfs_node is pinned thanks to namespace_sem,
- path C to M is pinned via child->parent references (this holds also
  when C and M are in distinct subtrees).

Streamline mountinfo rendering a bit by relieving css_set_lock and add
careful notes about that.

Signed-off-by: Michal Koutn√Ω <mkoutny@suse.com>
---
 kernel/cgroup/cgroup.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 32d693a797b9..e2ec6f0028be 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1407,12 +1407,18 @@ static inline struct cgroup *__cset_cgroup_from_root(struct css_set *cset,
 	struct cgroup *res_cgroup = NULL;
 
 	if (cset == &init_css_set) {
+		/* callers must ensure root stability */
 		res_cgroup = &root->cgrp;
 	} else if (root == &cgrp_dfl_root) {
 		res_cgroup = cset->dfl_cgrp;
 	} else {
 		struct cgrp_cset_link *link;
-		lockdep_assert_held(&css_set_lock);
+		/* cset's cgroups are pinned unless they are root cgroups that
+		 * were unmounted.  We look at links to !cgrp_dfl_root
+		 * cgroup_root, either lock ensures the list is not mutated
+		 */
+		lockdep_assert(lockdep_is_held(&css_set_lock) ||
+			       lockdep_is_held_type(&namespace_sem, -1));
 
 		list_for_each_entry(link, &cset->cgrp_links, cgrp_link) {
 			struct cgroup *c = link->cgrp;
@@ -1438,8 +1444,6 @@ current_cgns_cgroup_from_root(struct cgroup_root *root)
 	struct cgroup *res = NULL;
 	struct css_set *cset;
 
-	lockdep_assert_held(&css_set_lock);
-
 	/* namespace_sem ensures `root` stability on unmount */
 	lockdep_assert(lockdep_is_held_type(&namespace_sem, -1));
 
@@ -1905,10 +1909,8 @@ int cgroup_show_path(struct seq_file *sf, struct kernfs_node *kf_node,
 	if (!buf)
 		return -ENOMEM;
 
-	spin_lock_irq(&css_set_lock);
 	ns_cgroup = current_cgns_cgroup_from_root(kf_cgroot);
 	len = kernfs_path_from_node(kf_node, ns_cgroup->kn, buf, PATH_MAX);
-	spin_unlock_irq(&css_set_lock);
 
 	if (len >= PATH_MAX)
 		len = -ERANGE;
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/3] cgroup: Rely on namespace_sem in current_cgns_cgroup_from_root explicitly
  2023-05-02 13:38   ` Michal Koutný
  (?)
@ 2023-05-02 19:50   ` Waiman Long
  -1 siblings, 0 replies; 22+ messages in thread
From: Waiman Long @ 2023-05-02 19:50 UTC (permalink / raw)
  To: Michal Koutný, linux-fsdevel, linux-kernel, cgroups
  Cc: Alexander Viro, Christian Brauner, Tejun Heo, Zefan Li,
	Johannes Weiner, Dave Chinner, Rik van Riel, Jiri Wiesner

On 5/2/23 09:38, Michal Koutný wrote:
> The function current_cgns_cgroup_from_root() expects a stable
> cgroup_root, which is currently ensured with RCU read side paired with
> cgroup_destroy_root() called after RCU period.
>
> The particular current_cgns_cgroup_from_root() is called from VFS code
> and cgroup_root stability can be also ensured by namespace_sem. Mark it
> explicitly as a preparation for further rework.
>
> Signed-off-by: Michal Koutný <mkoutny@suse.com>
> ---
>   fs/namespace.c         | 5 ++++-
>   include/linux/mount.h  | 4 ++++
>   kernel/cgroup/cgroup.c | 7 +++----
>   3 files changed, 11 insertions(+), 5 deletions(-)
>
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 54847db5b819..0d2333832064 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -71,7 +71,10 @@ static DEFINE_IDA(mnt_group_ida);
>   static struct hlist_head *mount_hashtable __read_mostly;
>   static struct hlist_head *mountpoint_hashtable __read_mostly;
>   static struct kmem_cache *mnt_cache __read_mostly;
> -static DECLARE_RWSEM(namespace_sem);
> +DECLARE_RWSEM(namespace_sem);
> +#ifdef CONFIG_LOCKDEP
> +EXPORT_SYMBOL_GPL(namespace_sem);

I don't think fs/namespace.o and kernel/cgroup/cgroup.o can't be built 
into a kernel module. I doubt we need to export it.


> +#endif
>   static HLIST_HEAD(unmounted);	/* protected by namespace_sem */
>   static LIST_HEAD(ex_mountpoints); /* protected by namespace_sem */
>   
> diff --git a/include/linux/mount.h b/include/linux/mount.h
> index 1ea326c368f7..6277435f6748 100644
> --- a/include/linux/mount.h
> +++ b/include/linux/mount.h
> @@ -80,6 +80,10 @@ static inline struct mnt_idmap *mnt_idmap(const struct vfsmount *mnt)
>   	return smp_load_acquire(&mnt->mnt_idmap);
>   }
>   
> +#ifdef CONFIG_LOCKDEP
> +extern struct rw_semaphore namespace_sem;
> +#endif
> +
>   extern int mnt_want_write(struct vfsmount *mnt);
>   extern int mnt_want_write_file(struct file *file);
>   extern void mnt_drop_write(struct vfsmount *mnt);
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 55e5f0110e3b..32d693a797b9 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -1440,13 +1440,12 @@ current_cgns_cgroup_from_root(struct cgroup_root *root)
>   
>   	lockdep_assert_held(&css_set_lock);
>   
> -	rcu_read_lock();
> +	/* namespace_sem ensures `root` stability on unmount */
> +	lockdep_assert(lockdep_is_held_type(&namespace_sem, -1));
It will be easier if you just use lockdep_is_held() without the 2nd argment.
>   
>   	cset = current->nsproxy->cgroup_ns->root_cset;
>   	res = __cset_cgroup_from_root(cset, root);
>   
> -	rcu_read_unlock();
> -
>   	return res;
>   }
>   
> @@ -1454,7 +1453,7 @@ current_cgns_cgroup_from_root(struct cgroup_root *root)
>    * Look up cgroup associated with current task's cgroup namespace on the default
>    * hierarchy.
>    *
> - * Unlike current_cgns_cgroup_from_root(), this doesn't need locks:
> + * Relaxed locking requirements:
>    * - Internal rcu_read_lock is unnecessary because we don't dereference any rcu
>    *   pointers.
>    * - css_set_lock is not needed because we just read cset->dfl_cgrp.
Cheers,
Longman


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 3/3] cgroup: Do not take css_set_lock in cgroup_show_path
@ 2023-05-02 19:56     ` Waiman Long
  0 siblings, 0 replies; 22+ messages in thread
From: Waiman Long @ 2023-05-02 19:56 UTC (permalink / raw)
  To: Michal Koutný, linux-fsdevel, linux-kernel, cgroups
  Cc: Alexander Viro, Christian Brauner, Tejun Heo, Zefan Li,
	Johannes Weiner, Dave Chinner, Rik van Riel, Jiri Wiesner

On 5/2/23 09:38, Michal Koutný wrote:
> /proc/$pid/mountinfo may accumulate lots of entries (causing frequent
> re-reads of whole file) or lots cgroupfs entries alone.
> The cgroupfs entries rendered with cgroup_show_path() may increase/be
> subject of css_set_lock contention causing further slowdown -- not only
> mountinfo rendering but any other css_set_lock user.
>
> We leverage the fact that mountinfo reading happens with namespace_sem
> taken and hierarchy roots thus cannot be freed concurrently.
>
> There are three relevant nodes for each cgroupfs entry:
>
>          R ... cgroup hierarchy root
>          M ... mount root
>          C ... reader's cgroup NS root
>
> mountinfo is supposed to show path from C to M.
>
> Current's css_set (and linked root cgroups) are stable under
> namespace_sem, therefore current_cgns_cgroup_from_root() doesn't need
> css_set_lock.
>
> When the path is assembled in kernfs_path_from_node(), we know that:
> - C kernfs_node is (transitively) pinned via current->nsproxy,
> - M kernfs_node is pinned thanks to namespace_sem,
> - path C to M is pinned via child->parent references (this holds also
>    when C and M are in distinct subtrees).
>
> Streamline mountinfo rendering a bit by relieving css_set_lock and add
> careful notes about that.
>
> Signed-off-by: Michal Koutný <mkoutny@suse.com>
> ---
>   kernel/cgroup/cgroup.c | 12 +++++++-----
>   1 file changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 32d693a797b9..e2ec6f0028be 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -1407,12 +1407,18 @@ static inline struct cgroup *__cset_cgroup_from_root(struct css_set *cset,
>   	struct cgroup *res_cgroup = NULL;
>   
>   	if (cset == &init_css_set) {
> +		/* callers must ensure root stability */
>   		res_cgroup = &root->cgrp;
>   	} else if (root == &cgrp_dfl_root) {
>   		res_cgroup = cset->dfl_cgrp;
>   	} else {
>   		struct cgrp_cset_link *link;
> -		lockdep_assert_held(&css_set_lock);
> +		/* cset's cgroups are pinned unless they are root cgroups that
> +		 * were unmounted.  We look at links to !cgrp_dfl_root
> +		 * cgroup_root, either lock ensures the list is not mutated
> +		 */
> +		lockdep_assert(lockdep_is_held(&css_set_lock) ||
> +			       lockdep_is_held_type(&namespace_sem, -1));
Again lockdep_is_held(&namespace_sem) is good enough.

>   
>   		list_for_each_entry(link, &cset->cgrp_links, cgrp_link) {
>   			struct cgroup *c = link->cgrp;
> @@ -1438,8 +1444,6 @@ current_cgns_cgroup_from_root(struct cgroup_root *root)
>   	struct cgroup *res = NULL;
>   	struct css_set *cset;
>   
> -	lockdep_assert_held(&css_set_lock);
> -
>   	/* namespace_sem ensures `root` stability on unmount */
>   	lockdep_assert(lockdep_is_held_type(&namespace_sem, -1));
>   
> @@ -1905,10 +1909,8 @@ int cgroup_show_path(struct seq_file *sf, struct kernfs_node *kf_node,
>   	if (!buf)
>   		return -ENOMEM;
>   
> -	spin_lock_irq(&css_set_lock);
>   	ns_cgroup = current_cgns_cgroup_from_root(kf_cgroot);
>   	len = kernfs_path_from_node(kf_node, ns_cgroup->kn, buf, PATH_MAX);
> -	spin_unlock_irq(&css_set_lock);
>   
>   	if (len >= PATH_MAX)
>   		len = -ERANGE;
Cheers,
Longman


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 3/3] cgroup: Do not take css_set_lock in cgroup_show_path
@ 2023-05-02 19:56     ` Waiman Long
  0 siblings, 0 replies; 22+ messages in thread
From: Waiman Long @ 2023-05-02 19:56 UTC (permalink / raw)
  To: Michal Koutný,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA
  Cc: Alexander Viro, Christian Brauner, Tejun Heo, Zefan Li,
	Johannes Weiner, Dave Chinner, Rik van Riel, Jiri Wiesner

On 5/2/23 09:38, Michal Koutný wrote:
> /proc/$pid/mountinfo may accumulate lots of entries (causing frequent
> re-reads of whole file) or lots cgroupfs entries alone.
> The cgroupfs entries rendered with cgroup_show_path() may increase/be
> subject of css_set_lock contention causing further slowdown -- not only
> mountinfo rendering but any other css_set_lock user.
>
> We leverage the fact that mountinfo reading happens with namespace_sem
> taken and hierarchy roots thus cannot be freed concurrently.
>
> There are three relevant nodes for each cgroupfs entry:
>
>          R ... cgroup hierarchy root
>          M ... mount root
>          C ... reader's cgroup NS root
>
> mountinfo is supposed to show path from C to M.
>
> Current's css_set (and linked root cgroups) are stable under
> namespace_sem, therefore current_cgns_cgroup_from_root() doesn't need
> css_set_lock.
>
> When the path is assembled in kernfs_path_from_node(), we know that:
> - C kernfs_node is (transitively) pinned via current->nsproxy,
> - M kernfs_node is pinned thanks to namespace_sem,
> - path C to M is pinned via child->parent references (this holds also
>    when C and M are in distinct subtrees).
>
> Streamline mountinfo rendering a bit by relieving css_set_lock and add
> careful notes about that.
>
> Signed-off-by: Michal Koutný <mkoutny-IBi9RG/b67k@public.gmane.org>
> ---
>   kernel/cgroup/cgroup.c | 12 +++++++-----
>   1 file changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 32d693a797b9..e2ec6f0028be 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -1407,12 +1407,18 @@ static inline struct cgroup *__cset_cgroup_from_root(struct css_set *cset,
>   	struct cgroup *res_cgroup = NULL;
>   
>   	if (cset == &init_css_set) {
> +		/* callers must ensure root stability */
>   		res_cgroup = &root->cgrp;
>   	} else if (root == &cgrp_dfl_root) {
>   		res_cgroup = cset->dfl_cgrp;
>   	} else {
>   		struct cgrp_cset_link *link;
> -		lockdep_assert_held(&css_set_lock);
> +		/* cset's cgroups are pinned unless they are root cgroups that
> +		 * were unmounted.  We look at links to !cgrp_dfl_root
> +		 * cgroup_root, either lock ensures the list is not mutated
> +		 */
> +		lockdep_assert(lockdep_is_held(&css_set_lock) ||
> +			       lockdep_is_held_type(&namespace_sem, -1));
Again lockdep_is_held(&namespace_sem) is good enough.

>   
>   		list_for_each_entry(link, &cset->cgrp_links, cgrp_link) {
>   			struct cgroup *c = link->cgrp;
> @@ -1438,8 +1444,6 @@ current_cgns_cgroup_from_root(struct cgroup_root *root)
>   	struct cgroup *res = NULL;
>   	struct css_set *cset;
>   
> -	lockdep_assert_held(&css_set_lock);
> -
>   	/* namespace_sem ensures `root` stability on unmount */
>   	lockdep_assert(lockdep_is_held_type(&namespace_sem, -1));
>   
> @@ -1905,10 +1909,8 @@ int cgroup_show_path(struct seq_file *sf, struct kernfs_node *kf_node,
>   	if (!buf)
>   		return -ENOMEM;
>   
> -	spin_lock_irq(&css_set_lock);
>   	ns_cgroup = current_cgns_cgroup_from_root(kf_cgroot);
>   	len = kernfs_path_from_node(kf_node, ns_cgroup->kn, buf, PATH_MAX);
> -	spin_unlock_irq(&css_set_lock);
>   
>   	if (len >= PATH_MAX)
>   		len = -ERANGE;
Cheers,
Longman


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/3] cgroup: Drop unused function for cgroup_path
  2023-05-02 13:38   ` Michal Koutný
  (?)
@ 2023-05-02 19:58   ` Waiman Long
  -1 siblings, 0 replies; 22+ messages in thread
From: Waiman Long @ 2023-05-02 19:58 UTC (permalink / raw)
  To: Michal Koutný, linux-fsdevel, linux-kernel, cgroups
  Cc: Alexander Viro, Christian Brauner, Tejun Heo, Zefan Li,
	Johannes Weiner, Dave Chinner, Rik van Riel, Jiri Wiesner

On 5/2/23 09:38, Michal Koutný wrote:
> There is no current user and there are alternative methods to obtain
> task's cgroup path.
>
> Signed-off-by: Michal Koutný <mkoutny@suse.com>
> ---
>   kernel/cgroup/cgroup.c | 39 ---------------------------------------
>   1 file changed, 39 deletions(-)
>
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 625d7483951c..55e5f0110e3b 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -2378,45 +2378,6 @@ int cgroup_path_ns(struct cgroup *cgrp, char *buf, size_t buflen,
>   }
>   EXPORT_SYMBOL_GPL(cgroup_path_ns);
>   
> -/**
> - * task_cgroup_path - cgroup path of a task in the first cgroup hierarchy
> - * @task: target task
> - * @buf: the buffer to write the path into
> - * @buflen: the length of the buffer
> - *
> - * Determine @task's cgroup on the first (the one with the lowest non-zero
> - * hierarchy_id) cgroup hierarchy and copy its path into @buf.  This
> - * function grabs cgroup_mutex and shouldn't be used inside locks used by
> - * cgroup controller callbacks.
> - *
> - * Return value is the same as kernfs_path().
> - */
> -int task_cgroup_path(struct task_struct *task, char *buf, size_t buflen)
> -{
> -	struct cgroup_root *root;
> -	struct cgroup *cgrp;
> -	int hierarchy_id = 1;
> -	int ret;
> -
> -	cgroup_lock();
> -	spin_lock_irq(&css_set_lock);
> -
> -	root = idr_get_next(&cgroup_hierarchy_idr, &hierarchy_id);
> -
> -	if (root) {
> -		cgrp = task_cgroup_from_root(task, root);
> -		ret = cgroup_path_ns_locked(cgrp, buf, buflen, &init_cgroup_ns);
> -	} else {
> -		/* if no hierarchy exists, everyone is in "/" */
> -		ret = strscpy(buf, "/", buflen);
> -	}
> -
> -	spin_unlock_irq(&css_set_lock);
> -	cgroup_unlock();
> -	return ret;
> -}
> -EXPORT_SYMBOL_GPL(task_cgroup_path);
> -
>   /**
>    * cgroup_attach_lock - Lock for ->attach()
>    * @lock_threadgroup: whether to down_write cgroup_threadgroup_rwsem

I went to a few of earlier Linux version down to v3.11. 
task_cgroup_path() doesn't seems to have any users in my few attempts. 
Anyway,

Reviewed-by: Waiman Long <longman@redhat.com>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 3/3] cgroup: Do not take css_set_lock in cgroup_show_path
  2023-05-02 13:38   ` Michal Koutný
  (?)
  (?)
@ 2023-05-05 15:45   ` Tejun Heo
  2023-05-05 17:32     ` Michal Koutný
  -1 siblings, 1 reply; 22+ messages in thread
From: Tejun Heo @ 2023-05-05 15:45 UTC (permalink / raw)
  To: Michal Koutný
  Cc: linux-fsdevel, linux-kernel, cgroups, Alexander Viro,
	Christian Brauner, Zefan Li, Johannes Weiner, Dave Chinner,
	Rik van Riel, Jiri Wiesner

Hello,

On Tue, May 02, 2023 at 03:38:47PM +0200, Michal Koutný wrote:
> /proc/$pid/mountinfo may accumulate lots of entries (causing frequent
> re-reads of whole file) or lots cgroupfs entries alone.
> The cgroupfs entries rendered with cgroup_show_path() may increase/be
> subject of css_set_lock contention causing further slowdown -- not only
> mountinfo rendering but any other css_set_lock user.
> 
> We leverage the fact that mountinfo reading happens with namespace_sem
> taken and hierarchy roots thus cannot be freed concurrently.
> 
> There are three relevant nodes for each cgroupfs entry:
> 
>         R ... cgroup hierarchy root
>         M ... mount root
>         C ... reader's cgroup NS root
> 
> mountinfo is supposed to show path from C to M.

At least for cgroup2, the path from C to M isn't gonna change once NS is
established, right? Nothing can be moved or renamed while the NS root is
there. If so, can't we just cache the formatted path and return the same
thing without any locking? The proposed changes seem a bit too brittle to
me.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 3/3] cgroup: Do not take css_set_lock in cgroup_show_path
  2023-05-05 15:45   ` Tejun Heo
@ 2023-05-05 17:32     ` Michal Koutný
  2023-05-05 18:17       ` Tejun Heo
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Koutný @ 2023-05-05 17:32 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-fsdevel, linux-kernel, cgroups, Alexander Viro,
	Christian Brauner, Zefan Li, Johannes Weiner, Dave Chinner,
	Rik van Riel, Jiri Wiesner

[-- Attachment #1: Type: text/plain, Size: 1271 bytes --]

On Fri, May 05, 2023 at 05:45:58AM -1000, Tejun Heo <tj@kernel.org> wrote:
> > There are three relevant nodes for each cgroupfs entry:
> > 
> >         R ... cgroup hierarchy root
> >         M ... mount root
> >         C ... reader's cgroup NS root
> > 
> > mountinfo is supposed to show path from C to M.
> 
> At least for cgroup2, the path from C to M isn't gonna change once NS is
> established, right?

Right. Although, the argument about M (when C above M or when C and M in
different subtrees) implicitly relies on the namespace_sem.

> Nothing can be moved or renamed while the NS root is there. If so,
> can't we just cache the formatted path and return the same thing
> without any locking?

Here I find the caching complexity not worth it for v2 only (also C is
in the eye of the beholder) -- because then css_set_lock can be dropped
from cgroup_show_path with simpler reasoning.

(Sadly, the bigger benefit would be on hybrid setups multiplied by the
number of v1 hierarchies listed.)

(OTOH, the caching argument and weights for it may be different for
/proc/$pid/cgroup.)

> The proposed changes seem a bit too brittle to me.

OK, I'll look into separate cgroup_show_path and cgroup1_show_path
approach.

Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 3/3] cgroup: Do not take css_set_lock in cgroup_show_path
  2023-05-05 17:32     ` Michal Koutný
@ 2023-05-05 18:17       ` Tejun Heo
  2023-05-09 10:34         ` Michal Koutný
  0 siblings, 1 reply; 22+ messages in thread
From: Tejun Heo @ 2023-05-05 18:17 UTC (permalink / raw)
  To: Michal Koutný
  Cc: linux-fsdevel, linux-kernel, cgroups, Alexander Viro,
	Christian Brauner, Zefan Li, Johannes Weiner, Dave Chinner,
	Rik van Riel, Jiri Wiesner

Hello,

On Fri, May 05, 2023 at 07:32:40PM +0200, Michal Koutný wrote:
> On Fri, May 05, 2023 at 05:45:58AM -1000, Tejun Heo <tj@kernel.org> wrote:
> > > There are three relevant nodes for each cgroupfs entry:
> > > 
> > >         R ... cgroup hierarchy root
> > >         M ... mount root
> > >         C ... reader's cgroup NS root
> > > 
> > > mountinfo is supposed to show path from C to M.
> > 
> > At least for cgroup2, the path from C to M isn't gonna change once NS is
> > established, right?
> 
> Right. Although, the argument about M (when C above M or when C and M in
> different subtrees) implicitly relies on the namespace_sem.

I don't follow. Can you please elaborate a bit more?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 3/3] cgroup: Do not take css_set_lock in cgroup_show_path
  2023-05-05 18:17       ` Tejun Heo
@ 2023-05-09 10:34         ` Michal Koutný
  2023-05-22 20:55           ` Tejun Heo
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Koutný @ 2023-05-09 10:34 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-fsdevel, linux-kernel, cgroups, Alexander Viro,
	Christian Brauner, Zefan Li, Johannes Weiner, Dave Chinner,
	Rik van Riel, Jiri Wiesner

[-- Attachment #1: Type: text/plain, Size: 1580 bytes --]

On Fri, May 05, 2023 at 08:17:10AM -1000, Tejun Heo <tj@kernel.org> wrote:
> On Fri, May 05, 2023 at 07:32:40PM +0200, Michal Koutný wrote:
> > On Fri, May 05, 2023 at 05:45:58AM -1000, Tejun Heo <tj@kernel.org> wrote:
> > > > There are three relevant nodes for each cgroupfs entry:
> > > > 
> > > >         R ... cgroup hierarchy root
> > > >         M ... mount root
> > > >         C ... reader's cgroup NS root
> > > > 
> > > > mountinfo is supposed to show path from C to M.
> > > 
> > > At least for cgroup2, the path from C to M isn't gonna change once NS is
> > > established, right?
> > 
> > Right. Although, the argument about M (when C above M or when C and M in
> > different subtrees) implicitly relies on the namespace_sem.
> 
> I don't follow. Can you please elaborate a bit more?

I wanted to say that even with restriction to cgroup2, the css_set_lock
removal would also rely on namespace_sem.

For a given mountinfo entry the path C--M won't change (no renames).
The question is whether cgroup M will stay around (with the relaxed
locking):

  - C >= M (C is below M) 
    -> C (transitively) pins M

  - C < M (C is above M) or C and M are in two disjoint subtrees (path
    goes through a common ancestor)
    -> M could be released without relation to C (even on cgroup2, with
       the css_set_lock removed) but such a destructive operation on M
       is excluded as long as namespace_sem is held during entry
       rendering.

Does that clarify the trade-off of removing css_set_lock at this spot?

Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 3/3] cgroup: Do not take css_set_lock in cgroup_show_path
  2023-05-09 10:34         ` Michal Koutný
@ 2023-05-22 20:55           ` Tejun Heo
  0 siblings, 0 replies; 22+ messages in thread
From: Tejun Heo @ 2023-05-22 20:55 UTC (permalink / raw)
  To: Michal Koutný
  Cc: linux-fsdevel, linux-kernel, cgroups, Alexander Viro,
	Christian Brauner, Zefan Li, Johannes Weiner, Dave Chinner,
	Rik van Riel, Jiri Wiesner

Hello, Michal.

Sorry about the delay.

On Tue, May 09, 2023 at 12:34:53PM +0200, Michal Koutný wrote:
> On Fri, May 05, 2023 at 08:17:10AM -1000, Tejun Heo <tj@kernel.org> wrote:
> > On Fri, May 05, 2023 at 07:32:40PM +0200, Michal Koutný wrote:
> > > On Fri, May 05, 2023 at 05:45:58AM -1000, Tejun Heo <tj@kernel.org> wrote:
> > > > > There are three relevant nodes for each cgroupfs entry:
> > > > > 
> > > > >         R ... cgroup hierarchy root
> > > > >         M ... mount root
> > > > >         C ... reader's cgroup NS root
> > > > > 
> > > > > mountinfo is supposed to show path from C to M.
> > > > 
> > > > At least for cgroup2, the path from C to M isn't gonna change once NS is
> > > > established, right?
> > > 
> > > Right. Although, the argument about M (when C above M or when C and M in
> > > different subtrees) implicitly relies on the namespace_sem.
> > 
> > I don't follow. Can you please elaborate a bit more?
> 
> I wanted to say that even with restriction to cgroup2, the css_set_lock
> removal would also rely on namespace_sem.
> 
> For a given mountinfo entry the path C--M won't change (no renames).
> The question is whether cgroup M will stay around (with the relaxed
> locking):
> 
>   - C >= M (C is below M) 
>     -> C (transitively) pins M

Yeah, this was what I was thinking.

>   - C < M (C is above M) or C and M are in two disjoint subtrees (path
>     goes through a common ancestor)
>     -> M could be released without relation to C (even on cgroup2, with
>        the css_set_lock removed) but such a destructive operation on M
>        is excluded as long as namespace_sem is held during entry
>        rendering.
> 
> Does that clarify the trade-off of removing css_set_lock at this spot?

Right, you can have cgroup outside NS root still mounted and that mount root
can be viewed from multiple cgroup NS's, so the the path isn't fixed either.

Having enough lockdep annotations should do but if reasonable the preference
is being a bit more self-contained.

Thanks for the explanation.

-- 
tejun

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/3] cgroup: Rely on namespace_sem in current_cgns_cgroup_from_root explicitly
@ 2023-05-23 10:42     ` Christian Brauner
  0 siblings, 0 replies; 22+ messages in thread
From: Christian Brauner @ 2023-05-23 10:42 UTC (permalink / raw)
  To: Michal Koutný
  Cc: linux-fsdevel, linux-kernel, cgroups, Alexander Viro, Tejun Heo,
	Zefan Li, Johannes Weiner, Dave Chinner, Rik van Riel,
	Jiri Wiesner

On Tue, May 02, 2023 at 03:38:46PM +0200, Michal Koutný wrote:
> The function current_cgns_cgroup_from_root() expects a stable
> cgroup_root, which is currently ensured with RCU read side paired with
> cgroup_destroy_root() called after RCU period.
> 
> The particular current_cgns_cgroup_from_root() is called from VFS code
> and cgroup_root stability can be also ensured by namespace_sem. Mark it
> explicitly as a preparation for further rework.
> 
> Signed-off-by: Michal Koutný <mkoutny@suse.com>
> ---
>  fs/namespace.c         | 5 ++++-
>  include/linux/mount.h  | 4 ++++
>  kernel/cgroup/cgroup.c | 7 +++----
>  3 files changed, 11 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 54847db5b819..0d2333832064 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -71,7 +71,10 @@ static DEFINE_IDA(mnt_group_ida);
>  static struct hlist_head *mount_hashtable __read_mostly;
>  static struct hlist_head *mountpoint_hashtable __read_mostly;
>  static struct kmem_cache *mnt_cache __read_mostly;
> -static DECLARE_RWSEM(namespace_sem);
> +DECLARE_RWSEM(namespace_sem);
> +#ifdef CONFIG_LOCKDEP
> +EXPORT_SYMBOL_GPL(namespace_sem);
> +#endif
>  static HLIST_HEAD(unmounted);	/* protected by namespace_sem */
>  static LIST_HEAD(ex_mountpoints); /* protected by namespace_sem */
>  
> diff --git a/include/linux/mount.h b/include/linux/mount.h
> index 1ea326c368f7..6277435f6748 100644
> --- a/include/linux/mount.h
> +++ b/include/linux/mount.h
> @@ -80,6 +80,10 @@ static inline struct mnt_idmap *mnt_idmap(const struct vfsmount *mnt)
>  	return smp_load_acquire(&mnt->mnt_idmap);
>  }
>  
> +#ifdef CONFIG_LOCKDEP
> +extern struct rw_semaphore namespace_sem;
> +#endif

Nope, we're not putting namespace_sem in a header. The code it protects
is massively sensitive and it interacts with mount_lock and other locks.
This stays private to fs/namespace.c as far as I'm concerned.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/3] cgroup: Rely on namespace_sem in current_cgns_cgroup_from_root explicitly
@ 2023-05-23 10:42     ` Christian Brauner
  0 siblings, 0 replies; 22+ messages in thread
From: Christian Brauner @ 2023-05-23 10:42 UTC (permalink / raw)
  To: Michal Koutný
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Alexander Viro, Tejun Heo,
	Zefan Li, Johannes Weiner, Dave Chinner, Rik van Riel,
	Jiri Wiesner

On Tue, May 02, 2023 at 03:38:46PM +0200, Michal Koutný wrote:
> The function current_cgns_cgroup_from_root() expects a stable
> cgroup_root, which is currently ensured with RCU read side paired with
> cgroup_destroy_root() called after RCU period.
> 
> The particular current_cgns_cgroup_from_root() is called from VFS code
> and cgroup_root stability can be also ensured by namespace_sem. Mark it
> explicitly as a preparation for further rework.
> 
> Signed-off-by: Michal Koutný <mkoutny-IBi9RG/b67k@public.gmane.org>
> ---
>  fs/namespace.c         | 5 ++++-
>  include/linux/mount.h  | 4 ++++
>  kernel/cgroup/cgroup.c | 7 +++----
>  3 files changed, 11 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 54847db5b819..0d2333832064 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -71,7 +71,10 @@ static DEFINE_IDA(mnt_group_ida);
>  static struct hlist_head *mount_hashtable __read_mostly;
>  static struct hlist_head *mountpoint_hashtable __read_mostly;
>  static struct kmem_cache *mnt_cache __read_mostly;
> -static DECLARE_RWSEM(namespace_sem);
> +DECLARE_RWSEM(namespace_sem);
> +#ifdef CONFIG_LOCKDEP
> +EXPORT_SYMBOL_GPL(namespace_sem);
> +#endif
>  static HLIST_HEAD(unmounted);	/* protected by namespace_sem */
>  static LIST_HEAD(ex_mountpoints); /* protected by namespace_sem */
>  
> diff --git a/include/linux/mount.h b/include/linux/mount.h
> index 1ea326c368f7..6277435f6748 100644
> --- a/include/linux/mount.h
> +++ b/include/linux/mount.h
> @@ -80,6 +80,10 @@ static inline struct mnt_idmap *mnt_idmap(const struct vfsmount *mnt)
>  	return smp_load_acquire(&mnt->mnt_idmap);
>  }
>  
> +#ifdef CONFIG_LOCKDEP
> +extern struct rw_semaphore namespace_sem;
> +#endif

Nope, we're not putting namespace_sem in a header. The code it protects
is massively sensitive and it interacts with mount_lock and other locks.
This stays private to fs/namespace.c as far as I'm concerned.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/3] Rework locking when rendering mountinfo cgroup paths
  2023-05-02 13:38 ` Michal Koutný
@ 2023-05-23 12:09   ` Christian Brauner
  -1 siblings, 0 replies; 22+ messages in thread
From: Christian Brauner @ 2023-05-23 12:09 UTC (permalink / raw)
  To: Michal Koutný
  Cc: linux-fsdevel, linux-kernel, cgroups, Alexander Viro, Tejun Heo,
	Zefan Li, Johannes Weiner, Dave Chinner, Rik van Riel,
	Jiri Wiesner

On Tue, May 02, 2023 at 03:38:44PM +0200, Michal Koutný wrote:
> Idea for these modification came up when css_set_lock seemed unneeded in
> cgroup_show_path.
> It's a delicate change, so the deciding factor was when cgroup_show_path popped
> up also in some profiles of frequent mountinfo readers.
> The idea is to trade the exclusive css_set_lock for the shared
> namespace_sem when rendering cgroup paths. Details are described more in

I have no issue with the cgroup specific part of relying on
namespace_sem but kernel/cgroup/ has no business of being aware of
namespace semaphore in any way. Leave a comment to clarify what you're
doing but we're not going to sprinkle namespace_sem references - even if
only for the sake of lockdep - into other subsystems.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/3] Rework locking when rendering mountinfo cgroup paths
@ 2023-05-23 12:09   ` Christian Brauner
  0 siblings, 0 replies; 22+ messages in thread
From: Christian Brauner @ 2023-05-23 12:09 UTC (permalink / raw)
  To: Michal Koutný
  Cc: linux-fsdevel, linux-kernel, cgroups, Alexander Viro, Tejun Heo,
	Zefan Li, Johannes Weiner, Dave Chinner, Rik van Riel,
	Jiri Wiesner

On Tue, May 02, 2023 at 03:38:44PM +0200, Michal KoutnÃ½ wrote:
> Idea for these modification came up when css_set_lock seemed unneeded in
> cgroup_show_path.
> It's a delicate change, so the deciding factor was when cgroup_show_path popped
> up also in some profiles of frequent mountinfo readers.
> The idea is to trade the exclusive css_set_lock for the shared
> namespace_sem when rendering cgroup paths. Details are described more in

I have no issue with the cgroup specific part of relying on
namespace_sem but kernel/cgroup/ has no business of being aware of
namespace semaphore in any way. Leave a comment to clarify what you're
doing but we're not going to sprinkle namespace_sem references - even if
only for the sake of lockdep - into other subsystems.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/3] cgroup: Rely on namespace_sem in current_cgns_cgroup_from_root explicitly
  2023-05-23 10:42     ` Christian Brauner
  (?)
@ 2023-05-23 19:12     ` Tejun Heo
  -1 siblings, 0 replies; 22+ messages in thread
From: Tejun Heo @ 2023-05-23 19:12 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Michal Koutný,
	linux-fsdevel, linux-kernel, cgroups, Alexander Viro, Zefan Li,
	Johannes Weiner, Dave Chinner, Rik van Riel, Jiri Wiesner

Hello,

On Tue, May 23, 2023 at 12:42:46PM +0200, Christian Brauner wrote:
...
> Nope, we're not putting namespace_sem in a header. The code it protects
> is massively sensitive and it interacts with mount_lock and other locks.
> This stays private to fs/namespace.c as far as I'm concerned.

Michal, would it make sense to add a separate locking in cgroup.c? It'll add
a bit more overhead but not massively so and we should be able to get
similar gain without entangling with namespace locking.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2023-05-23 19:12 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-02 13:38 [RFC PATCH 0/3] Rework locking when rendering mountinfo cgroup paths Michal Koutný
2023-05-02 13:38 ` Michal Koutný
2023-05-02 13:38 ` [RFC PATCH 1/3] cgroup: Drop unused function for cgroup_path Michal Koutný
2023-05-02 13:38   ` Michal Koutný
2023-05-02 19:58   ` Waiman Long
2023-05-02 13:38 ` [RFC PATCH 2/3] cgroup: Rely on namespace_sem in current_cgns_cgroup_from_root explicitly Michal Koutný
2023-05-02 13:38   ` Michal Koutný
2023-05-02 19:50   ` Waiman Long
2023-05-23 10:42   ` Christian Brauner
2023-05-23 10:42     ` Christian Brauner
2023-05-23 19:12     ` Tejun Heo
2023-05-02 13:38 ` [RFC PATCH 3/3] cgroup: Do not take css_set_lock in cgroup_show_path Michal Koutný
2023-05-02 13:38   ` Michal Koutný
2023-05-02 19:56   ` Waiman Long
2023-05-02 19:56     ` Waiman Long
2023-05-05 15:45   ` Tejun Heo
2023-05-05 17:32     ` Michal Koutný
2023-05-05 18:17       ` Tejun Heo
2023-05-09 10:34         ` Michal Koutný
2023-05-22 20:55           ` Tejun Heo
2023-05-23 12:09 ` [RFC PATCH 0/3] Rework locking when rendering mountinfo cgroup paths Christian Brauner
2023-05-23 12:09   ` Christian Brauner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.