All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] shm: shm_rmid_forced feature fixes
@ 2021-10-27 22:43 Alexander Mikhalitsyn
  2021-10-27 22:43 ` [PATCH 1/2] ipc: WARN if trying to remove ipc object which is absent Alexander Mikhalitsyn
  2021-10-27 22:43 ` [PATCH 2/2] shm: extend forced shm destroy to support objects from several IPC nses Alexander Mikhalitsyn
  0 siblings, 2 replies; 12+ messages in thread
From: Alexander Mikhalitsyn @ 2021-10-27 22:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Mikhalitsyn, Eric W. Biederman, Andrew Morton,
	Davidlohr Bueso, Greg KH, Andrei Vagin, Pavel Tikhomirov,
	Vasily Averin, Manfred Spraul, Alexander Mikhalitsyn, stable

A long story behind all of that...

Some time ago I met kernel crash after CRIU restore procedure,
fortunately, it was CRIU restore, so, I had dump files and could
do restore many times and crash reproduced easily. After some
investigation I've constructed the minimal reproducer. It was
found that it's use-after-free and it happens only if
sysctl kernel.shm_rmid_forced = 1.

The key of the problem is that the exit_shm() function
not handles shp's object destroy when task->sysvshm.shm_clist
contains items from different IPC namespaces. In most cases
this list will contain only items from one IPC namespace.

Why this list may contain object from different namespaces?
Function exit_shm() designed to clean up this list always when
process leaves IPC namespace. But we made a mistake a long time ago
and not add exit_shm() call into setns() syscall procedures.
1st second idea was just to add this call to setns() syscall but
it's obviously changes semantics of setns() syscall and that's
userspace-visible change. So, I gave up this idea.

First real attempt to address the issue was just to omit forced destroy
if we meet shp object not from current task IPC namespace [1]. But
that was not the best idea because task->sysvshm.shm_clist was
protected by rwsem which belongs to current task IPC namespace.
It means that list corruption may occur.

Second approach is just extend exit_shm() to properly handle
shp's from different IPC namespaces [2]. This is really
non-trivial thing, I've put a lot of effort into that but
not believed that it's possible to make it fully safe, clean
and clear.

Thanks to the efforts of Manfred Spraul working and elegant
solution was designed. Thanks a lot, Manfred!

Eric also suggested the way to address the issue in
("[RFC][PATCH] shm: In shm_exit destroy all created and never attached segments")
Eric's idea was to maintain a list of shm_clists one per IPC namespace,
use lock-less lists. But there is some extra memory consumption-related concerns.

Alternative solution which was suggested by me was implemented in
("shm: reset shm_clist on setns but omit forced shm destroy")
Idea is pretty simple, we add exit_shm() syscall to setns() but DO NOT
destroy shm segments even if sysctl kernel.shm_rmid_forced = 1, we just
clean up the task->sysvshm.shm_clist list. This chages semantics of
setns() syscall a little bit but in comparision to "naive" solution
when we just add exit_shm() without any special exclusions this looks
like a safer option.

[1] https://lkml.org/lkml/2021/7/6/1108
[2] https://lkml.org/lkml/2021/7/14/736

Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Cc: Vasily Averin <vvs@virtuozzo.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Alexander Mikhalitsyn <alexander@mihalicyn.com>
Cc: stable@vger.kernel.org
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>

Alexander Mikhalitsyn (2):
  ipc: WARN if trying to remove ipc object which is absent
  shm: extend forced shm destroy to support objects from several IPC
    nses

 include/linux/ipc_namespace.h |  15 +++
 include/linux/sched/task.h    |   2 +-
 include/linux/shm.h           |   2 +-
 ipc/shm.c                     | 170 +++++++++++++++++++++++++---------
 ipc/util.c                    |   6 +-
 5 files changed, 145 insertions(+), 50 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/2] ipc: WARN if trying to remove ipc object which is absent
  2021-10-27 22:43 [PATCH 0/2] shm: shm_rmid_forced feature fixes Alexander Mikhalitsyn
@ 2021-10-27 22:43 ` Alexander Mikhalitsyn
  2021-10-27 22:43 ` [PATCH 2/2] shm: extend forced shm destroy to support objects from several IPC nses Alexander Mikhalitsyn
  1 sibling, 0 replies; 12+ messages in thread
From: Alexander Mikhalitsyn @ 2021-10-27 22:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Mikhalitsyn, Eric W. Biederman, Andrew Morton,
	Davidlohr Bueso, Greg KH, Andrei Vagin, Pavel Tikhomirov,
	Vasily Averin, Manfred Spraul, Alexander Mikhalitsyn, stable

Lets produce a warning if we trying to remove non-existing
IPC object from IPC namespace kht/idr structures.

This allows to catch possible bugs when ipc_rmid() function was
called with inconsistent struct ipc_ids*, struct kern_ipc_perm*
arguments.

Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Cc: Vasily Averin <vvs@virtuozzo.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Alexander Mikhalitsyn <alexander@mihalicyn.com>
Cc: stable@vger.kernel.org
Co-developed-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
---
 ipc/util.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/ipc/util.c b/ipc/util.c
index 0027e47626b7..b28003c653d1 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -447,8 +447,8 @@ static int ipcget_public(struct ipc_namespace *ns, struct ipc_ids *ids,
 static void ipc_kht_remove(struct ipc_ids *ids, struct kern_ipc_perm *ipcp)
 {
 	if (ipcp->key != IPC_PRIVATE)
-		rhashtable_remove_fast(&ids->key_ht, &ipcp->khtnode,
-				       ipc_kht_params);
+		WARN_ON_ONCE(rhashtable_remove_fast(&ids->key_ht, &ipcp->khtnode,
+				       ipc_kht_params));
 }
 
 /**
@@ -498,7 +498,7 @@ void ipc_rmid(struct ipc_ids *ids, struct kern_ipc_perm *ipcp)
 {
 	int idx = ipcid_to_idx(ipcp->id);
 
-	idr_remove(&ids->ipcs_idr, idx);
+	WARN_ON_ONCE(idr_remove(&ids->ipcs_idr, idx) != ipcp);
 	ipc_kht_remove(ids, ipcp);
 	ids->in_use--;
 	ipcp->deleted = true;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/2] shm: extend forced shm destroy to support objects from several IPC nses
  2021-10-27 22:43 [PATCH 0/2] shm: shm_rmid_forced feature fixes Alexander Mikhalitsyn
  2021-10-27 22:43 ` [PATCH 1/2] ipc: WARN if trying to remove ipc object which is absent Alexander Mikhalitsyn
@ 2021-10-27 22:43 ` Alexander Mikhalitsyn
  2021-10-30  4:26   ` Eric W. Biederman
  2021-11-05 17:46   ` Eric W. Biederman
  1 sibling, 2 replies; 12+ messages in thread
From: Alexander Mikhalitsyn @ 2021-10-27 22:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Mikhalitsyn, Eric W. Biederman, Andrew Morton,
	Davidlohr Bueso, Greg KH, Andrei Vagin, Pavel Tikhomirov,
	Vasily Averin, Manfred Spraul, Alexander Mikhalitsyn, stable

Currently, exit_shm function not designed to work properly when
task->sysvshm.shm_clist holds shm objects from different IPC namespaces.

This is a real pain when sysctl kernel.shm_rmid_forced = 1, because
it leads to use-after-free (reproducer exists).

That particular patch is attempt to fix the problem by extending exit_shm
mechanism to handle shm's destroy from several IPC ns'es.

To achieve that we do several things:
1. add namespace (non-refcounted) pointer to the struct shmid_kernel
2. during new shm object creation (newseg()/shmget syscall) we initialize
this pointer by current task IPC ns
3. exit_shm() fully reworked such that it traverses over all
shp's in task->sysvshm.shm_clist and gets IPC namespace not
from current task as it was before but from shp's object itself, then
call shm_destroy(shp, ns).

Note. We need to be really careful here, because as it was said before
(1), our pointer to IPC ns non-refcnt'ed. To be on the safe side we using
special helper get_ipc_ns_not_zero() which allows to get IPC ns refcounter
only if IPC ns not in the "state of destruction".

Q/A

Q: Why we can access shp->ns memory using non-refcounted pointer?
A: Because shp object lifetime is always shorther
than IPC namespace lifetime, so, if we get shp object from the
task->sysvshm.shm_clist while holding task_lock(task) nobody can
steal our namespace.

Q: Does this patch change semantics of unshare/setns/clone syscalls?
A: Not. It's just fixes non-covered case when process may leave
IPC namespace without getting task->sysvshm.shm_clist list cleaned up.

Fixes: ab602f79915 ("shm: make exit_shm work proportional to task activity")

Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Cc: Vasily Averin <vvs@virtuozzo.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Alexander Mikhalitsyn <alexander@mihalicyn.com>
Cc: stable@vger.kernel.org
Co-developed-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
---
 include/linux/ipc_namespace.h |  15 +++
 include/linux/sched/task.h    |   2 +-
 include/linux/shm.h           |   2 +-
 ipc/shm.c                     | 170 +++++++++++++++++++++++++---------
 4 files changed, 142 insertions(+), 47 deletions(-)

diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
index 05e22770af51..b75395ec8d52 100644
--- a/include/linux/ipc_namespace.h
+++ b/include/linux/ipc_namespace.h
@@ -131,6 +131,16 @@ static inline struct ipc_namespace *get_ipc_ns(struct ipc_namespace *ns)
 	return ns;
 }
 
+static inline struct ipc_namespace *get_ipc_ns_not_zero(struct ipc_namespace *ns)
+{
+	if (ns) {
+		if (refcount_inc_not_zero(&ns->ns.count))
+			return ns;
+	}
+
+	return NULL;
+}
+
 extern void put_ipc_ns(struct ipc_namespace *ns);
 #else
 static inline struct ipc_namespace *copy_ipcs(unsigned long flags,
@@ -147,6 +157,11 @@ static inline struct ipc_namespace *get_ipc_ns(struct ipc_namespace *ns)
 	return ns;
 }
 
+static inline struct ipc_namespace *get_ipc_ns_not_zero(struct ipc_namespace *ns)
+{
+	return ns;
+}
+
 static inline void put_ipc_ns(struct ipc_namespace *ns)
 {
 }
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index ef02be869cf2..bfdf84dab4be 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -157,7 +157,7 @@ static inline struct vm_struct *task_stack_vm_area(const struct task_struct *t)
  * Protects ->fs, ->files, ->mm, ->group_info, ->comm, keyring
  * subscriptions and synchronises with wait4().  Also used in procfs.  Also
  * pins the final release of task.io_context.  Also protects ->cpuset and
- * ->cgroup.subsys[]. And ->vfork_done.
+ * ->cgroup.subsys[]. And ->vfork_done. And ->sysvshm.shm_clist.
  *
  * Nests both inside and outside of read_lock(&tasklist_lock).
  * It must not be nested with write_lock_irq(&tasklist_lock),
diff --git a/include/linux/shm.h b/include/linux/shm.h
index d8e69aed3d32..709f6d0451c0 100644
--- a/include/linux/shm.h
+++ b/include/linux/shm.h
@@ -11,7 +11,7 @@ struct file;
 
 #ifdef CONFIG_SYSVIPC
 struct sysv_shm {
-	struct list_head shm_clist;
+	struct list_head	shm_clist;
 };
 
 long do_shmat(int shmid, char __user *shmaddr, int shmflg, unsigned long *addr,
diff --git a/ipc/shm.c b/ipc/shm.c
index 748933e376ca..29667e17b12a 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -62,9 +62,18 @@ struct shmid_kernel /* private to the kernel */
 	struct pid		*shm_lprid;
 	struct ucounts		*mlock_ucounts;
 
-	/* The task created the shm object.  NULL if the task is dead. */
+	/*
+	 * The task created the shm object, for looking up
+	 * task->sysvshm.shm_clist_lock
+	 */
 	struct task_struct	*shm_creator;
-	struct list_head	shm_clist;	/* list by creator */
+
+	/*
+	 * list by creator. shm_clist_lock required for read/write
+	 * if list_empty(), then the creator is dead already
+	 */
+	struct list_head	shm_clist;
+	struct ipc_namespace	*ns;
 } __randomize_layout;
 
 /* shm_mode upper byte flags */
@@ -115,6 +124,7 @@ static void do_shm_rmid(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
 	struct shmid_kernel *shp;
 
 	shp = container_of(ipcp, struct shmid_kernel, shm_perm);
+	WARN_ON(ns != shp->ns);
 
 	if (shp->shm_nattch) {
 		shp->shm_perm.mode |= SHM_DEST;
@@ -225,10 +235,36 @@ static void shm_rcu_free(struct rcu_head *head)
 	kfree(shp);
 }
 
-static inline void shm_rmid(struct ipc_namespace *ns, struct shmid_kernel *s)
+/*
+ * It has to be called with shp locked.
+ * It must be called before ipc_rmid()
+ */
+static inline void shm_clist_rm(struct shmid_kernel *shp)
 {
-	list_del(&s->shm_clist);
-	ipc_rmid(&shm_ids(ns), &s->shm_perm);
+	struct task_struct *creator;
+
+	/*
+	 * A concurrent exit_shm may do a list_del_init() as well.
+	 * Just do nothing if exit_shm already did the work
+	 */
+	if (list_empty(&shp->shm_clist))
+		return;
+
+	/*
+	 * shp->shm_creator is guaranteed to be valid *only*
+	 * if shp->shm_clist is not empty.
+	 */
+	creator = shp->shm_creator;
+
+	task_lock(creator);
+	list_del_init(&shp->shm_clist);
+	task_unlock(creator);
+}
+
+static inline void shm_rmid(struct shmid_kernel *s)
+{
+	shm_clist_rm(s);
+	ipc_rmid(&shm_ids(s->ns), &s->shm_perm);
 }
 
 
@@ -283,7 +319,7 @@ static void shm_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
 	shm_file = shp->shm_file;
 	shp->shm_file = NULL;
 	ns->shm_tot -= (shp->shm_segsz + PAGE_SIZE - 1) >> PAGE_SHIFT;
-	shm_rmid(ns, shp);
+	shm_rmid(shp);
 	shm_unlock(shp);
 	if (!is_file_hugepages(shm_file))
 		shmem_lock(shm_file, 0, shp->mlock_ucounts);
@@ -306,10 +342,10 @@ static void shm_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
  *
  * 2) sysctl kernel.shm_rmid_forced is set to 1.
  */
-static bool shm_may_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
+static bool shm_may_destroy(struct shmid_kernel *shp)
 {
 	return (shp->shm_nattch == 0) &&
-	       (ns->shm_rmid_forced ||
+	       (shp->ns->shm_rmid_forced ||
 		(shp->shm_perm.mode & SHM_DEST));
 }
 
@@ -340,7 +376,7 @@ static void shm_close(struct vm_area_struct *vma)
 	ipc_update_pid(&shp->shm_lprid, task_tgid(current));
 	shp->shm_dtim = ktime_get_real_seconds();
 	shp->shm_nattch--;
-	if (shm_may_destroy(ns, shp))
+	if (shm_may_destroy(shp))
 		shm_destroy(ns, shp);
 	else
 		shm_unlock(shp);
@@ -361,10 +397,10 @@ static int shm_try_destroy_orphaned(int id, void *p, void *data)
 	 *
 	 * As shp->* are changed under rwsem, it's safe to skip shp locking.
 	 */
-	if (shp->shm_creator != NULL)
+	if (!list_empty(&shp->shm_clist))
 		return 0;
 
-	if (shm_may_destroy(ns, shp)) {
+	if (shm_may_destroy(shp)) {
 		shm_lock_by_ptr(shp);
 		shm_destroy(ns, shp);
 	}
@@ -382,48 +418,87 @@ void shm_destroy_orphaned(struct ipc_namespace *ns)
 /* Locking assumes this will only be called with task == current */
 void exit_shm(struct task_struct *task)
 {
-	struct ipc_namespace *ns = task->nsproxy->ipc_ns;
-	struct shmid_kernel *shp, *n;
+	for (;;) {
+		struct shmid_kernel *shp;
+		struct ipc_namespace *ns;
 
-	if (list_empty(&task->sysvshm.shm_clist))
-		return;
+		task_lock(task);
+
+		if (list_empty(&task->sysvshm.shm_clist)) {
+			task_unlock(task);
+			break;
+		}
+
+		shp = list_first_entry(&task->sysvshm.shm_clist, struct shmid_kernel,
+				shm_clist);
+
+		/* 1) unlink */
+		list_del_init(&shp->shm_clist);
 
-	/*
-	 * If kernel.shm_rmid_forced is not set then only keep track of
-	 * which shmids are orphaned, so that a later set of the sysctl
-	 * can clean them up.
-	 */
-	if (!ns->shm_rmid_forced) {
-		down_read(&shm_ids(ns).rwsem);
-		list_for_each_entry(shp, &task->sysvshm.shm_clist, shm_clist)
-			shp->shm_creator = NULL;
 		/*
-		 * Only under read lock but we are only called on current
-		 * so no entry on the list will be shared.
+		 * 2) Get pointer to the ipc namespace. It is worth to say
+		 * that this pointer is guaranteed to be valid because
+		 * shp lifetime is always shorter than namespace lifetime
+		 * in which shp lives.
+		 * We taken task_lock it means that shp won't be freed.
 		 */
-		list_del(&task->sysvshm.shm_clist);
-		up_read(&shm_ids(ns).rwsem);
-		return;
-	}
+		ns = shp->ns;
 
-	/*
-	 * Destroy all already created segments, that were not yet mapped,
-	 * and mark any mapped as orphan to cover the sysctl toggling.
-	 * Destroy is skipped if shm_may_destroy() returns false.
-	 */
-	down_write(&shm_ids(ns).rwsem);
-	list_for_each_entry_safe(shp, n, &task->sysvshm.shm_clist, shm_clist) {
-		shp->shm_creator = NULL;
+		/*
+		 * 3) If kernel.shm_rmid_forced is not set then only keep track of
+		 * which shmids are orphaned, so that a later set of the sysctl
+		 * can clean them up.
+		 */
+		if (!ns->shm_rmid_forced) {
+			task_unlock(task);
+			continue;
+		}
 
-		if (shm_may_destroy(ns, shp)) {
+		/*
+		 * 4) get a reference to the namespace.
+		 *    The refcount could be already 0. If it is 0, then
+		 *    the shm objects will be free by free_ipc_work().
+		 */
+		ns = get_ipc_ns_not_zero(ns);
+		if (ns) {
+			/*
+			 * 5) get a reference to the shp itself.
+			 *   This cannot fail: shm_clist_rm() is called before
+			 *   ipc_rmid(), thus the refcount cannot be 0.
+			 */
+			WARN_ON(!ipc_rcu_getref(&shp->shm_perm));
+		}
+
+		task_unlock(task);
+
+		if (ns) {
+			down_write(&shm_ids(ns).rwsem);
 			shm_lock_by_ptr(shp);
-			shm_destroy(ns, shp);
+			/*
+			 * rcu_read_lock was implicitly taken in
+			 * shm_lock_by_ptr, it's safe to call
+			 * ipc_rcu_putref here
+			 */
+			ipc_rcu_putref(&shp->shm_perm, shm_rcu_free);
+
+			if (ipc_valid_object(&shp->shm_perm)) {
+				if (shm_may_destroy(shp))
+					shm_destroy(ns, shp);
+				else
+					shm_unlock(shp);
+			} else {
+				/*
+				 * Someone else deleted the shp from namespace
+				 * idr/kht while we have waited.
+				 * Just unlock and continue.
+				 */
+				shm_unlock(shp);
+			}
+
+			up_write(&shm_ids(ns).rwsem);
+			put_ipc_ns(ns); /* paired with get_ipc_ns_not_zero */
 		}
 	}
-
-	/* Remove the list head from any segments still attached. */
-	list_del(&task->sysvshm.shm_clist);
-	up_write(&shm_ids(ns).rwsem);
 }
 
 static vm_fault_t shm_fault(struct vm_fault *vmf)
@@ -680,7 +755,11 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
 	if (error < 0)
 		goto no_id;
 
+	shp->ns = ns;
+
+	task_lock(current);
 	list_add(&shp->shm_clist, &current->sysvshm.shm_clist);
+	task_unlock(current);
 
 	/*
 	 * shmid gets reported as "inode#" in /proc/pid/maps.
@@ -1573,7 +1652,8 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg,
 	down_write(&shm_ids(ns).rwsem);
 	shp = shm_lock(ns, shmid);
 	shp->shm_nattch--;
-	if (shm_may_destroy(ns, shp))
+
+	if (shm_may_destroy(shp))
 		shm_destroy(ns, shp);
 	else
 		shm_unlock(shp);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] shm: extend forced shm destroy to support objects from several IPC nses
  2021-10-27 22:43 ` [PATCH 2/2] shm: extend forced shm destroy to support objects from several IPC nses Alexander Mikhalitsyn
@ 2021-10-30  4:26   ` Eric W. Biederman
  2021-10-30 13:11     ` Manfred Spraul
  2021-11-05 17:46   ` Eric W. Biederman
  1 sibling, 1 reply; 12+ messages in thread
From: Eric W. Biederman @ 2021-10-30  4:26 UTC (permalink / raw)
  To: Alexander Mikhalitsyn
  Cc: linux-kernel, Andrew Morton, Davidlohr Bueso, Greg KH,
	Andrei Vagin, Pavel Tikhomirov, Vasily Averin, Manfred Spraul,
	Alexander Mikhalitsyn, stable

Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com> writes:

> Currently, exit_shm function not designed to work properly when
> task->sysvshm.shm_clist holds shm objects from different IPC namespaces.
>
> This is a real pain when sysctl kernel.shm_rmid_forced = 1, because
> it leads to use-after-free (reproducer exists).
>
> That particular patch is attempt to fix the problem by extending exit_shm
> mechanism to handle shm's destroy from several IPC ns'es.
>
> To achieve that we do several things:
> 1. add namespace (non-refcounted) pointer to the struct shmid_kernel
> 2. during new shm object creation (newseg()/shmget syscall) we initialize
> this pointer by current task IPC ns
> 3. exit_shm() fully reworked such that it traverses over all
> shp's in task->sysvshm.shm_clist and gets IPC namespace not
> from current task as it was before but from shp's object itself, then
> call shm_destroy(shp, ns).
>
> Note. We need to be really careful here, because as it was said before
> (1), our pointer to IPC ns non-refcnt'ed. To be on the safe side we using
> special helper get_ipc_ns_not_zero() which allows to get IPC ns refcounter
> only if IPC ns not in the "state of destruction".




> Q/A
>
> Q: Why we can access shp->ns memory using non-refcounted pointer?
> A: Because shp object lifetime is always shorther
> than IPC namespace lifetime, so, if we get shp object from the
> task->sysvshm.shm_clist while holding task_lock(task) nobody can
> steal our namespace.

Not true.  A struct shmid_kernel can outlive the namespace in which it
was created.  I you look at do_shm_rmid which is called when the
namespace is destroyed for every shmid_kernel in the namespace that if
the struct shmid_kernel still has users only ipc_set_key_private is
called.  The struct shmid_kernel continues to exist.

> Q: Does this patch change semantics of unshare/setns/clone syscalls?
> A: Not. It's just fixes non-covered case when process may leave
> IPC namespace without getting task->sysvshm.shm_clist list cleaned up.


Just reading through exit_shm the code is not currently safe.

At a minimum do_shm_rmid needs to set the shp->ns to NULL.  Otherwise
the struct shmid_kernel can contain a namespace pointer after
the namespace exits.  Which results in a different use after free.


Beyond that there is dropping the task lock.  The code holds a reference
to the namespace which means that the code does not need to worry about
free_ipcs.  References from mappings are still possible.

Which means that the code could see:
exit_shm()
   task_lock()
   shp = ...;
   task_unlock()
                                     shm_close()
                                         down_write(&shm_ids(ns).rwsem);
                                         ...
                                         shm_destroy(shp);
                                         up_write(&shm_ids(ns).rwsem);
   down_write(&shm_ids(ns)->rwsem);
   shm_lock_by_ptr(shp);	/* use after free */


I am trying to imagine how to close that race with the current code
structure.  Maybe something could be done by looking at shm_nattach
count and making it safe to look at that count under the task_lock.

But even then because shmid_kernel is still in the hash table it could
be mapped and unmapped in the window when task_lock was dropped.
Alternatively shmctl(id, IPC_RMID) can be called in when task_lock is
dropped.  Much less code is involved than mapping and unmapping so it is
much more likely to win the race.

I don't see how that race can be closed.

Am I missing something?

Eric


> Fixes: ab602f79915 ("shm: make exit_shm work proportional to task activity")
>
> Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Davidlohr Bueso <dave@stgolabs.net>
> Cc: Greg KH <gregkh@linuxfoundation.org>
> Cc: Andrei Vagin <avagin@gmail.com>
> Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
> Cc: Vasily Averin <vvs@virtuozzo.com>
> Cc: Manfred Spraul <manfred@colorfullife.com>
> Cc: Alexander Mikhalitsyn <alexander@mihalicyn.com>
> Cc: stable@vger.kernel.org
> Co-developed-by: Manfred Spraul <manfred@colorfullife.com>
> Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
> Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
> ---
>  include/linux/ipc_namespace.h |  15 +++
>  include/linux/sched/task.h    |   2 +-
>  include/linux/shm.h           |   2 +-
>  ipc/shm.c                     | 170 +++++++++++++++++++++++++---------
>  4 files changed, 142 insertions(+), 47 deletions(-)
>
> diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
> index 05e22770af51..b75395ec8d52 100644
> --- a/include/linux/ipc_namespace.h
> +++ b/include/linux/ipc_namespace.h
> @@ -131,6 +131,16 @@ static inline struct ipc_namespace *get_ipc_ns(struct ipc_namespace *ns)
>  	return ns;
>  }
>  
> +static inline struct ipc_namespace *get_ipc_ns_not_zero(struct ipc_namespace *ns)
> +{
> +	if (ns) {
> +		if (refcount_inc_not_zero(&ns->ns.count))
> +			return ns;
> +	}
> +
> +	return NULL;
> +}
> +
>  extern void put_ipc_ns(struct ipc_namespace *ns);
>  #else
>  static inline struct ipc_namespace *copy_ipcs(unsigned long flags,
> @@ -147,6 +157,11 @@ static inline struct ipc_namespace *get_ipc_ns(struct ipc_namespace *ns)
>  	return ns;
>  }
>  
> +static inline struct ipc_namespace *get_ipc_ns_not_zero(struct ipc_namespace *ns)
> +{
> +	return ns;
> +}
> +
>  static inline void put_ipc_ns(struct ipc_namespace *ns)
>  {
>  }
> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> index ef02be869cf2..bfdf84dab4be 100644
> --- a/include/linux/sched/task.h
> +++ b/include/linux/sched/task.h
> @@ -157,7 +157,7 @@ static inline struct vm_struct *task_stack_vm_area(const struct task_struct *t)
>   * Protects ->fs, ->files, ->mm, ->group_info, ->comm, keyring
>   * subscriptions and synchronises with wait4().  Also used in procfs.  Also
>   * pins the final release of task.io_context.  Also protects ->cpuset and
> - * ->cgroup.subsys[]. And ->vfork_done.
> + * ->cgroup.subsys[]. And ->vfork_done. And ->sysvshm.shm_clist.
>   *
>   * Nests both inside and outside of read_lock(&tasklist_lock).
>   * It must not be nested with write_lock_irq(&tasklist_lock),
> diff --git a/include/linux/shm.h b/include/linux/shm.h
> index d8e69aed3d32..709f6d0451c0 100644
> --- a/include/linux/shm.h
> +++ b/include/linux/shm.h
> @@ -11,7 +11,7 @@ struct file;
>  
>  #ifdef CONFIG_SYSVIPC
>  struct sysv_shm {
> -	struct list_head shm_clist;
> +	struct list_head	shm_clist;
>  };
>  
>  long do_shmat(int shmid, char __user *shmaddr, int shmflg, unsigned long *addr,
> diff --git a/ipc/shm.c b/ipc/shm.c
> index 748933e376ca..29667e17b12a 100644
> --- a/ipc/shm.c
> +++ b/ipc/shm.c
> @@ -62,9 +62,18 @@ struct shmid_kernel /* private to the kernel */
>  	struct pid		*shm_lprid;
>  	struct ucounts		*mlock_ucounts;
>  
> -	/* The task created the shm object.  NULL if the task is dead. */
> +	/*
> +	 * The task created the shm object, for looking up
> +	 * task->sysvshm.shm_clist_lock
> +	 */
>  	struct task_struct	*shm_creator;
> -	struct list_head	shm_clist;	/* list by creator */
> +
> +	/*
> +	 * list by creator. shm_clist_lock required for read/write
> +	 * if list_empty(), then the creator is dead already
> +	 */
> +	struct list_head	shm_clist;
> +	struct ipc_namespace	*ns;
>  } __randomize_layout;
>  
>  /* shm_mode upper byte flags */
> @@ -115,6 +124,7 @@ static void do_shm_rmid(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
>  	struct shmid_kernel *shp;
>  
>  	shp = container_of(ipcp, struct shmid_kernel, shm_perm);
> +	WARN_ON(ns != shp->ns);
>  
>  	if (shp->shm_nattch) {
>  		shp->shm_perm.mode |= SHM_DEST;
> @@ -225,10 +235,36 @@ static void shm_rcu_free(struct rcu_head *head)
>  	kfree(shp);
>  }
>  
> -static inline void shm_rmid(struct ipc_namespace *ns, struct shmid_kernel *s)
> +/*
> + * It has to be called with shp locked.
> + * It must be called before ipc_rmid()
> + */
> +static inline void shm_clist_rm(struct shmid_kernel *shp)
>  {
> -	list_del(&s->shm_clist);
> -	ipc_rmid(&shm_ids(ns), &s->shm_perm);
> +	struct task_struct *creator;
> +
> +	/*
> +	 * A concurrent exit_shm may do a list_del_init() as well.
> +	 * Just do nothing if exit_shm already did the work
> +	 */
> +	if (list_empty(&shp->shm_clist))
> +		return;
> +
> +	/*
> +	 * shp->shm_creator is guaranteed to be valid *only*
> +	 * if shp->shm_clist is not empty.
> +	 */
> +	creator = shp->shm_creator;
> +
> +	task_lock(creator);
> +	list_del_init(&shp->shm_clist);
> +	task_unlock(creator);

Lock ordering
   rwsem
       ipc_lock
          task_lock
       
> +}
> +
> +static inline void shm_rmid(struct shmid_kernel *s)
> +{
> +	shm_clist_rm(s);
> +	ipc_rmid(&shm_ids(s->ns), &s->shm_perm);
>  }
>  
>  
> @@ -283,7 +319,7 @@ static void shm_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
>  	shm_file = shp->shm_file;
>  	shp->shm_file = NULL;
>  	ns->shm_tot -= (shp->shm_segsz + PAGE_SIZE - 1) >> PAGE_SHIFT;
> -	shm_rmid(ns, shp);
> +	shm_rmid(shp);
>  	shm_unlock(shp);
>  	if (!is_file_hugepages(shm_file))
>  		shmem_lock(shm_file, 0, shp->mlock_ucounts);
> @@ -306,10 +342,10 @@ static void shm_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
>   *
>   * 2) sysctl kernel.shm_rmid_forced is set to 1.
>   */
> -static bool shm_may_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
> +static bool shm_may_destroy(struct shmid_kernel *shp)
>  {
>  	return (shp->shm_nattch == 0) &&
> -	       (ns->shm_rmid_forced ||
> +	       (shp->ns->shm_rmid_forced ||
>  		(shp->shm_perm.mode & SHM_DEST));
>  }
>  
> @@ -340,7 +376,7 @@ static void shm_close(struct vm_area_struct *vma)
>  	ipc_update_pid(&shp->shm_lprid, task_tgid(current));
>  	shp->shm_dtim = ktime_get_real_seconds();
>  	shp->shm_nattch--;
> -	if (shm_may_destroy(ns, shp))
> +	if (shm_may_destroy(shp))
>  		shm_destroy(ns, shp);
>  	else
>  		shm_unlock(shp);
> @@ -361,10 +397,10 @@ static int shm_try_destroy_orphaned(int id, void *p, void *data)
>  	 *
>  	 * As shp->* are changed under rwsem, it's safe to skip shp locking.
>  	 */
> -	if (shp->shm_creator != NULL)
> +	if (!list_empty(&shp->shm_clist))
>  		return 0;
>  
> -	if (shm_may_destroy(ns, shp)) {
> +	if (shm_may_destroy(shp)) {
>  		shm_lock_by_ptr(shp);
>  		shm_destroy(ns, shp);
>  	}
> @@ -382,48 +418,87 @@ void shm_destroy_orphaned(struct ipc_namespace *ns)
>  /* Locking assumes this will only be called with task == current */
>  void exit_shm(struct task_struct *task)
>  {
> -	struct ipc_namespace *ns = task->nsproxy->ipc_ns;
> -	struct shmid_kernel *shp, *n;
> +	for (;;) {
> +		struct shmid_kernel *shp;
> +		struct ipc_namespace *ns;
>  
> -	if (list_empty(&task->sysvshm.shm_clist))
> -		return;
> +		task_lock(task);
> +
> +		if (list_empty(&task->sysvshm.shm_clist)) {
> +			task_unlock(task);
> +			break;
> +		}
> +
> +		shp = list_first_entry(&task->sysvshm.shm_clist, struct shmid_kernel,
> +				shm_clist);
> +
> +		/* 1) unlink */
> +		list_del_init(&shp->shm_clist);
>  
> -	/*
> -	 * If kernel.shm_rmid_forced is not set then only keep track of
> -	 * which shmids are orphaned, so that a later set of the sysctl
> -	 * can clean them up.
> -	 */
> -	if (!ns->shm_rmid_forced) {
> -		down_read(&shm_ids(ns).rwsem);
> -		list_for_each_entry(shp, &task->sysvshm.shm_clist, shm_clist)
> -			shp->shm_creator = NULL;
>  		/*
> -		 * Only under read lock but we are only called on current
> -		 * so no entry on the list will be shared.
> +		 * 2) Get pointer to the ipc namespace. It is worth to say
> +		 * that this pointer is guaranteed to be valid because
> +		 * shp lifetime is always shorter than namespace lifetime
> +		 * in which shp lives.
> +		 * We taken task_lock it means that shp won't be freed.
>  		 */
> -		list_del(&task->sysvshm.shm_clist);
> -		up_read(&shm_ids(ns).rwsem);
> -		return;
> -	}
> +		ns = shp->ns;
>  
> -	/*
> -	 * Destroy all already created segments, that were not yet mapped,
> -	 * and mark any mapped as orphan to cover the sysctl toggling.
> -	 * Destroy is skipped if shm_may_destroy() returns false.
> -	 */
> -	down_write(&shm_ids(ns).rwsem);
> -	list_for_each_entry_safe(shp, n, &task->sysvshm.shm_clist, shm_clist) {
> -		shp->shm_creator = NULL;
> +		/*
> +		 * 3) If kernel.shm_rmid_forced is not set then only keep track of
> +		 * which shmids are orphaned, so that a later set of the sysctl
> +		 * can clean them up.
> +		 */
> +		if (!ns->shm_rmid_forced) {
> +			task_unlock(task);
> +			continue;
> +		}
>  
> -		if (shm_may_destroy(ns, shp)) {
> +		/*
> +		 * 4) get a reference to the namespace.
> +		 *    The refcount could be already 0. If it is 0, then
> +		 *    the shm objects will be free by free_ipc_work().
> +		 */
> +		ns = get_ipc_ns_not_zero(ns);
> +		if (ns) {
> +			/*
> +			 * 5) get a reference to the shp itself.
> +			 *   This cannot fail: shm_clist_rm() is called before
> +			 *   ipc_rmid(), thus the refcount cannot be 0.
> +			 */
> +			WARN_ON(!ipc_rcu_getref(&shp->shm_perm));
> +		}
> +
> +		task_unlock(task);

<<<<<<<<< BOOM >>>>>>>

I don't see anything that prevents another task from
calling shm_destroy(ns, shp) here and freeing it before
this task can take the rwsem for writing.

> +
> +		if (ns) {
> +			down_write(&shm_ids(ns).rwsem);
>  			shm_lock_by_ptr(shp);
> -			shm_destroy(ns, shp);
> +			/*
> +			 * rcu_read_lock was implicitly taken in
> +			 * shm_lock_by_ptr, it's safe to call
> +			 * ipc_rcu_putref here
> +			 */
> +			ipc_rcu_putref(&shp->shm_perm, shm_rcu_free);
> +
> +			if (ipc_valid_object(&shp->shm_perm)) {
> +				if (shm_may_destroy(shp))
> +					shm_destroy(ns, shp);
> +				else
> +					shm_unlock(shp);
> +			} else {
> +				/*
> +				 * Someone else deleted the shp from namespace
> +				 * idr/kht while we have waited.
> +				 * Just unlock and continue.
> +				 */
> +				shm_unlock(shp);
> +			}
> +
> +			up_write(&shm_ids(ns).rwsem);
> +			put_ipc_ns(ns); /* paired with get_ipc_ns_not_zero */
>  		}
>  	}
> -
> -	/* Remove the list head from any segments still attached. */
> -	list_del(&task->sysvshm.shm_clist);
> -	up_write(&shm_ids(ns).rwsem);
>  }
>  
>  static vm_fault_t shm_fault(struct vm_fault *vmf)
> @@ -680,7 +755,11 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
>  	if (error < 0)
>  		goto no_id;
>  
> +	shp->ns = ns;
> +
> +	task_lock(current);
>  	list_add(&shp->shm_clist, &current->sysvshm.shm_clist);
> +	task_unlock(current);
>  
>  	/*
>  	 * shmid gets reported as "inode#" in /proc/pid/maps.
> @@ -1573,7 +1652,8 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg,
>  	down_write(&shm_ids(ns).rwsem);
>  	shp = shm_lock(ns, shmid);
>  	shp->shm_nattch--;
> -	if (shm_may_destroy(ns, shp))
> +
> +	if (shm_may_destroy(shp))
>  		shm_destroy(ns, shp);
>  	else
>  		shm_unlock(shp);

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] shm: extend forced shm destroy to support objects from several IPC nses
  2021-10-30  4:26   ` Eric W. Biederman
@ 2021-10-30 13:11     ` Manfred Spraul
  0 siblings, 0 replies; 12+ messages in thread
From: Manfred Spraul @ 2021-10-30 13:11 UTC (permalink / raw)
  To: Eric W. Biederman, Alexander Mikhalitsyn
  Cc: linux-kernel, Andrew Morton, Davidlohr Bueso, Greg KH,
	Andrei Vagin, Pavel Tikhomirov, Vasily Averin,
	Alexander Mikhalitsyn, stable

[-- Attachment #1: Type: text/plain, Size: 17784 bytes --]

On 10/30/21 06:26, Eric W. Biederman wrote:
> Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com> writes:
>
>> Currently, exit_shm function not designed to work properly when
>> task->sysvshm.shm_clist holds shm objects from different IPC namespaces.
>>
>> This is a real pain when sysctl kernel.shm_rmid_forced = 1, because
>> it leads to use-after-free (reproducer exists).
>>
>> That particular patch is attempt to fix the problem by extending exit_shm
>> mechanism to handle shm's destroy from several IPC ns'es.
>>
>> To achieve that we do several things:
>> 1. add namespace (non-refcounted) pointer to the struct shmid_kernel
>> 2. during new shm object creation (newseg()/shmget syscall) we initialize
>> this pointer by current task IPC ns
>> 3. exit_shm() fully reworked such that it traverses over all
>> shp's in task->sysvshm.shm_clist and gets IPC namespace not
>> from current task as it was before but from shp's object itself, then
>> call shm_destroy(shp, ns).
>>
>> Note. We need to be really careful here, because as it was said before
>> (1), our pointer to IPC ns non-refcnt'ed. To be on the safe side we using
>> special helper get_ipc_ns_not_zero() which allows to get IPC ns refcounter
>> only if IPC ns not in the "state of destruction".
>
>
>
>> Q/A
>>
>> Q: Why we can access shp->ns memory using non-refcounted pointer?
>> A: Because shp object lifetime is always shorther
>> than IPC namespace lifetime, so, if we get shp object from the
>> task->sysvshm.shm_clist while holding task_lock(task) nobody can
>> steal our namespace.
> Not true.  A struct shmid_kernel can outlive the namespace in which it
> was created.  I you look at do_shm_rmid which is called when the
> namespace is destroyed for every shmid_kernel in the namespace that if
> the struct shmid_kernel still has users only ipc_set_key_private is
> called.  The struct shmid_kernel continues to exist.

No, shm_nattach is always 0 when a namespace is destroyed.

Thus it is impossible that shmid_kernel continues to exist.

Let's check all shm_nattach modifications:

1) do_shmat:

     shp->shm_nattach++;

     sfd->ns = get_ipc_ns(ns);

     shp->shm_nattach--;

pairs with

    shm_release()

         put_ipc_ns()

2) shm_open()

only shp->shm_nattach++

shm_open unconditionally accesses shm_file_data, i.e. sfd must be valid, 
there must be a reference to the namespace

pairs with shm_close()

only shp->shm_nattach--;

shm_close unconditionally accesses shm_file_data, i.e. sfd must be 
valid, there must be a reference to the namespace

As shm_open()/close "nests" inside do_shmat: there is always a get_ipc_ns().

Or, much simpler: Check shm_open() and shm_close():

These two functions address a shm segment by namespace and  ID, not by a 
shm pointer. Thus _if_ it is possible that shm_nattach is > 0 at 
namespace destruction, then there would be far more issues.


Or: Attached is a log file, a test application, and a patch that adds 
pr_info statements.

The namespace is destroyed immediately when no segments are mapped, the 
destruction is delayed until exit() if there are mapped segments.


>> Q: Does this patch change semantics of unshare/setns/clone syscalls?
>> A: Not. It's just fixes non-covered case when process may leave
>> IPC namespace without getting task->sysvshm.shm_clist list cleaned up.
>
> Just reading through exit_shm the code is not currently safe.
>
> At a minimum do_shm_rmid needs to set the shp->ns to NULL.  Otherwise
> the struct shmid_kernel can contain a namespace pointer after
> the namespace exits.  Which results in a different use after free.
No [unless there are additional bugs]
>
> Beyond that there is dropping the task lock.  The code holds a reference
> to the namespace which means that the code does not need to worry about
> free_ipcs.  References from mappings are still possible.
>
> Which means that the code could see:
> exit_shm()
>     task_lock()
>     shp = ...;

>     task_unlock()
>                                       shm_close()
>                                           down_write(&shm_ids(ns).rwsem);
>                                           ...
>                                           shm_destroy(shp);
>                                           up_write(&shm_ids(ns).rwsem);
>     down_write(&shm_ids(ns)->rwsem);
>     shm_lock_by_ptr(shp);	/* use after free */
>
>
> I am trying to imagine how to close that race with the current code
> structure.  Maybe something could be done by looking at shm_nattach
> count and making it safe to look at that count under the task_lock.

There is no race. Before dropping task_lock, a reference to both the 
namespace and the shp pointer is obtained.

Thus neither one can disappear.

> But even then because shmid_kernel is still in the hash table it could
> be mapped and unmapped in the window when task_lock was dropped.

We have ipc_valid_object(), i.e. perm->deleted. If set, then the pointer 
and the spinlock are valid, even though the rest is already destroyed.

ipc_rmid() just sets deleted, the (rcu delayed) kfree is done via 
ipc_rcu_putref().
> Alternatively shmctl(id, IPC_RMID) can be called in when task_lock is
> dropped.  Much less code is involved than mapping and unmapping so it is
> much more likely to win the race.
>
> I don't see how that race can be closed.
>
> Am I missing something?
>
> Eric
>
>
>> Fixes: ab602f79915 ("shm: make exit_shm work proportional to task activity")
>>
>> Cc: "Eric W. Biederman" <ebiederm@xmission.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Davidlohr Bueso <dave@stgolabs.net>
>> Cc: Greg KH <gregkh@linuxfoundation.org>
>> Cc: Andrei Vagin <avagin@gmail.com>
>> Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
>> Cc: Vasily Averin <vvs@virtuozzo.com>
>> Cc: Manfred Spraul <manfred@colorfullife.com>
>> Cc: Alexander Mikhalitsyn <alexander@mihalicyn.com>
>> Cc: stable@vger.kernel.org
>> Co-developed-by: Manfred Spraul <manfred@colorfullife.com>
>> Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
>> Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>

Should/can I mark that I have tested the code?

I would drop one change and one comment is incorrect, otherwise no 
findings. See the attached 0002 patch

Tested-by: Manfred Spraul <manfred@colorfullife.com>

>> ---
>>   include/linux/ipc_namespace.h |  15 +++
>>   include/linux/sched/task.h    |   2 +-
>>   include/linux/shm.h           |   2 +-
>>   ipc/shm.c                     | 170 +++++++++++++++++++++++++---------
>>   4 files changed, 142 insertions(+), 47 deletions(-)
>>
>> diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
>> index 05e22770af51..b75395ec8d52 100644
>> --- a/include/linux/ipc_namespace.h
>> +++ b/include/linux/ipc_namespace.h
>> @@ -131,6 +131,16 @@ static inline struct ipc_namespace *get_ipc_ns(struct ipc_namespace *ns)
>>   	return ns;
>>   }
>>   
>> +static inline struct ipc_namespace *get_ipc_ns_not_zero(struct ipc_namespace *ns)
>> +{
>> +	if (ns) {
>> +		if (refcount_inc_not_zero(&ns->ns.count))
>> +			return ns;
>> +	}
>> +
>> +	return NULL;
>> +}
>> +
>>   extern void put_ipc_ns(struct ipc_namespace *ns);
>>   #else
>>   static inline struct ipc_namespace *copy_ipcs(unsigned long flags,
>> @@ -147,6 +157,11 @@ static inline struct ipc_namespace *get_ipc_ns(struct ipc_namespace *ns)
>>   	return ns;
>>   }
>>   
>> +static inline struct ipc_namespace *get_ipc_ns_not_zero(struct ipc_namespace *ns)
>> +{
>> +	return ns;
>> +}
>> +
>>   static inline void put_ipc_ns(struct ipc_namespace *ns)
>>   {
>>   }
>> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
>> index ef02be869cf2..bfdf84dab4be 100644
>> --- a/include/linux/sched/task.h
>> +++ b/include/linux/sched/task.h
>> @@ -157,7 +157,7 @@ static inline struct vm_struct *task_stack_vm_area(const struct task_struct *t)
>>    * Protects ->fs, ->files, ->mm, ->group_info, ->comm, keyring
>>    * subscriptions and synchronises with wait4().  Also used in procfs.  Also
>>    * pins the final release of task.io_context.  Also protects ->cpuset and
>> - * ->cgroup.subsys[]. And ->vfork_done.
>> + * ->cgroup.subsys[]. And ->vfork_done. And ->sysvshm.shm_clist.
>>    *
>>    * Nests both inside and outside of read_lock(&tasklist_lock).
>>    * It must not be nested with write_lock_irq(&tasklist_lock),
>> diff --git a/include/linux/shm.h b/include/linux/shm.h
>> index d8e69aed3d32..709f6d0451c0 100644
>> --- a/include/linux/shm.h
>> +++ b/include/linux/shm.h
>> @@ -11,7 +11,7 @@ struct file;
>>   
>>   #ifdef CONFIG_SYSVIPC
>>   struct sysv_shm {
>> -	struct list_head shm_clist;
>> +	struct list_head	shm_clist;
>>   };
>>   
This is a whitespace only change. We can drop it.
>>   long do_shmat(int shmid, char __user *shmaddr, int shmflg, unsigned long *addr,
>> diff --git a/ipc/shm.c b/ipc/shm.c
>> index 748933e376ca..29667e17b12a 100644
>> --- a/ipc/shm.c
>> +++ b/ipc/shm.c
>> @@ -62,9 +62,18 @@ struct shmid_kernel /* private to the kernel */
>>   	struct pid		*shm_lprid;
>>   	struct ucounts		*mlock_ucounts;
>>   
>> -	/* The task created the shm object.  NULL if the task is dead. */
>> +	/*
>> +	 * The task created the shm object, for looking up
>> +	 * task->sysvshm.shm_clist_lock
>> +	 */
>>   	struct task_struct	*shm_creator;
>> -	struct list_head	shm_clist;	/* list by creator */
>> +
>> +	/*
>> +	 * list by creator. shm_clist_lock required for read/write
>> +	 * if list_empty(), then the creator is dead already
>> +	 */
shm_clist_lock was replaced by task_lock(->shm_creator).
>> +	struct list_head	shm_clist;
>> +	struct ipc_namespace	*ns;
>>   } __randomize_layout;
>>   
>>   /* shm_mode upper byte flags */
>> @@ -115,6 +124,7 @@ static void do_shm_rmid(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
>>   	struct shmid_kernel *shp;
>>   
>>   	shp = container_of(ipcp, struct shmid_kernel, shm_perm);
>> +	WARN_ON(ns != shp->ns);
>>   
>>   	if (shp->shm_nattch) {
>>   		shp->shm_perm.mode |= SHM_DEST;
>> @@ -225,10 +235,36 @@ static void shm_rcu_free(struct rcu_head *head)
>>   	kfree(shp);
>>   }
>>   
>> -static inline void shm_rmid(struct ipc_namespace *ns, struct shmid_kernel *s)
>> +/*
>> + * It has to be called with shp locked.
>> + * It must be called before ipc_rmid()
>> + */
>> +static inline void shm_clist_rm(struct shmid_kernel *shp)
>>   {
>> -	list_del(&s->shm_clist);
>> -	ipc_rmid(&shm_ids(ns), &s->shm_perm);
>> +	struct task_struct *creator;
>> +
>> +	/*
>> +	 * A concurrent exit_shm may do a list_del_init() as well.
>> +	 * Just do nothing if exit_shm already did the work
>> +	 */
>> +	if (list_empty(&shp->shm_clist))
>> +		return;
>> +
>> +	/*
>> +	 * shp->shm_creator is guaranteed to be valid *only*
>> +	 * if shp->shm_clist is not empty.
>> +	 */
>> +	creator = shp->shm_creator;
>> +
>> +	task_lock(creator);
>> +	list_del_init(&shp->shm_clist);
>> +	task_unlock(creator);
> Lock ordering
>     rwsem
>         ipc_lock
>            task_lock
>         
correct.
>> +}
>> +
>> +static inline void shm_rmid(struct shmid_kernel *s)
>> +{
>> +	shm_clist_rm(s);
>> +	ipc_rmid(&shm_ids(s->ns), &s->shm_perm);
>>   }
>>   
>>   
>> @@ -283,7 +319,7 @@ static void shm_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
>>   	shm_file = shp->shm_file;
>>   	shp->shm_file = NULL;
>>   	ns->shm_tot -= (shp->shm_segsz + PAGE_SIZE - 1) >> PAGE_SHIFT;
>> -	shm_rmid(ns, shp);
>> +	shm_rmid(shp);
>>   	shm_unlock(shp);
>>   	if (!is_file_hugepages(shm_file))
>>   		shmem_lock(shm_file, 0, shp->mlock_ucounts);
>> @@ -306,10 +342,10 @@ static void shm_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
>>    *
>>    * 2) sysctl kernel.shm_rmid_forced is set to 1.
>>    */
>> -static bool shm_may_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
>> +static bool shm_may_destroy(struct shmid_kernel *shp)
>>   {
>>   	return (shp->shm_nattch == 0) &&
>> -	       (ns->shm_rmid_forced ||
>> +	       (shp->ns->shm_rmid_forced ||
>>   		(shp->shm_perm.mode & SHM_DEST));
>>   }
>>   
>> @@ -340,7 +376,7 @@ static void shm_close(struct vm_area_struct *vma)
>>   	ipc_update_pid(&shp->shm_lprid, task_tgid(current));
>>   	shp->shm_dtim = ktime_get_real_seconds();
>>   	shp->shm_nattch--;
>> -	if (shm_may_destroy(ns, shp))
>> +	if (shm_may_destroy(shp))
>>   		shm_destroy(ns, shp);
>>   	else
>>   		shm_unlock(shp);
>> @@ -361,10 +397,10 @@ static int shm_try_destroy_orphaned(int id, void *p, void *data)
>>   	 *
>>   	 * As shp->* are changed under rwsem, it's safe to skip shp locking.
>>   	 */
>> -	if (shp->shm_creator != NULL)
>> +	if (!list_empty(&shp->shm_clist))
>>   		return 0;
>>   
>> -	if (shm_may_destroy(ns, shp)) {
>> +	if (shm_may_destroy(shp)) {
>>   		shm_lock_by_ptr(shp);
>>   		shm_destroy(ns, shp);
>>   	}
>> @@ -382,48 +418,87 @@ void shm_destroy_orphaned(struct ipc_namespace *ns)
>>   /* Locking assumes this will only be called with task == current */
>>   void exit_shm(struct task_struct *task)
>>   {
>> -	struct ipc_namespace *ns = task->nsproxy->ipc_ns;
>> -	struct shmid_kernel *shp, *n;
>> +	for (;;) {
>> +		struct shmid_kernel *shp;
>> +		struct ipc_namespace *ns;
>>   
>> -	if (list_empty(&task->sysvshm.shm_clist))
>> -		return;
>> +		task_lock(task);
>> +
>> +		if (list_empty(&task->sysvshm.shm_clist)) {
>> +			task_unlock(task);
>> +			break;
>> +		}
>> +
>> +		shp = list_first_entry(&task->sysvshm.shm_clist, struct shmid_kernel,
>> +				shm_clist);
>> +
>> +		/* 1) unlink */
>> +		list_del_init(&shp->shm_clist);
>>   
>> -	/*
>> -	 * If kernel.shm_rmid_forced is not set then only keep track of
>> -	 * which shmids are orphaned, so that a later set of the sysctl
>> -	 * can clean them up.
>> -	 */
>> -	if (!ns->shm_rmid_forced) {
>> -		down_read(&shm_ids(ns).rwsem);
>> -		list_for_each_entry(shp, &task->sysvshm.shm_clist, shm_clist)
>> -			shp->shm_creator = NULL;
>>   		/*
>> -		 * Only under read lock but we are only called on current
>> -		 * so no entry on the list will be shared.
>> +		 * 2) Get pointer to the ipc namespace. It is worth to say
>> +		 * that this pointer is guaranteed to be valid because
>> +		 * shp lifetime is always shorter than namespace lifetime
>> +		 * in which shp lives.
>> +		 * We taken task_lock it means that shp won't be freed.
>>   		 */
>> -		list_del(&task->sysvshm.shm_clist);
>> -		up_read(&shm_ids(ns).rwsem);
>> -		return;
>> -	}
>> +		ns = shp->ns;
>>   
>> -	/*
>> -	 * Destroy all already created segments, that were not yet mapped,
>> -	 * and mark any mapped as orphan to cover the sysctl toggling.
>> -	 * Destroy is skipped if shm_may_destroy() returns false.
>> -	 */
>> -	down_write(&shm_ids(ns).rwsem);
>> -	list_for_each_entry_safe(shp, n, &task->sysvshm.shm_clist, shm_clist) {
>> -		shp->shm_creator = NULL;
>> +		/*
>> +		 * 3) If kernel.shm_rmid_forced is not set then only keep track of
>> +		 * which shmids are orphaned, so that a later set of the sysctl
>> +		 * can clean them up.
>> +		 */
>> +		if (!ns->shm_rmid_forced) {
>> +			task_unlock(task);
>> +			continue;
>> +		}
>>   
>> -		if (shm_may_destroy(ns, shp)) {
>> +		/*
>> +		 * 4) get a reference to the namespace.
>> +		 *    The refcount could be already 0. If it is 0, then
>> +		 *    the shm objects will be free by free_ipc_work().
>> +		 */
>> +		ns = get_ipc_ns_not_zero(ns);
>> +		if (ns) {
>> +			/*
>> +			 * 5) get a reference to the shp itself.
>> +			 *   This cannot fail: shm_clist_rm() is called before
>> +			 *   ipc_rmid(), thus the refcount cannot be 0.
>> +			 */
>> +			WARN_ON(!ipc_rcu_getref(&shp->shm_perm));
>> +		}
>> +
>> +		task_unlock(task);
> <<<<<<<<< BOOM >>>>>>>
>
> I don't see anything that prevents another task from
> calling shm_destroy(ns, shp) here and freeing it before
> this task can take the rwsem for writing.

shm_destroy() can be called. But due to the ipc_rcu_getref(), the 
structure will remain valid.


>> +
>> +		if (ns) {
>> +			down_write(&shm_ids(ns).rwsem);
>>   			shm_lock_by_ptr(shp);
>> -			shm_destroy(ns, shp);
>> +			/*
>> +			 * rcu_read_lock was implicitly taken in
>> +			 * shm_lock_by_ptr, it's safe to call
>> +			 * ipc_rcu_putref here
>> +			 */
>> +			ipc_rcu_putref(&shp->shm_perm, shm_rcu_free);
>> +
>> +			if (ipc_valid_object(&shp->shm_perm)) {

And this will return false if there was a shm_destroy().


>> +				if (shm_may_destroy(shp))
>> +					shm_destroy(ns, shp);
>> +				else
>> +					shm_unlock(shp);
>> +			} else {
>> +				/*
>> +				 * Someone else deleted the shp from namespace
>> +				 * idr/kht while we have waited.
>> +				 * Just unlock and continue.
>> +				 */

-> just do a NOP if shm_destroy() was alread performed.

Actually, the same design is used by find_alloc_undo() in ipc/sem.c.

>> +				shm_unlock(shp);
>> +			}
>> +
>> +			up_write(&shm_ids(ns).rwsem);
>> +			put_ipc_ns(ns); /* paired with get_ipc_ns_not_zero */
>>   		}
>>   	}
>> -
>> -	/* Remove the list head from any segments still attached. */
>> -	list_del(&task->sysvshm.shm_clist);
>> -	up_write(&shm_ids(ns).rwsem);
>>   }
>>   
>>   static vm_fault_t shm_fault(struct vm_fault *vmf)
>> @@ -680,7 +755,11 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
>>   	if (error < 0)
>>   		goto no_id;
>>   
>> +	shp->ns = ns;
>> +
>> +	task_lock(current);
>>   	list_add(&shp->shm_clist, &current->sysvshm.shm_clist);
>> +	task_unlock(current);
>>   
>>   	/*
>>   	 * shmid gets reported as "inode#" in /proc/pid/maps.
>> @@ -1573,7 +1652,8 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg,
>>   	down_write(&shm_ids(ns).rwsem);
>>   	shp = shm_lock(ns, shmid);
>>   	shp->shm_nattch--;
>> -	if (shm_may_destroy(ns, shp))
>> +
>> +	if (shm_may_destroy(shp))
>>   		shm_destroy(ns, shp);
>>   	else
>>   		shm_unlock(shp);


[-- Attachment #2: 0002-shm-extend-forced-shm-destroy-to-support-objects-fro.patch --]
[-- Type: text/x-patch, Size: 11893 bytes --]

From c9b0b5037865aa7714b0e7c96082e0296d8a42b9 Mon Sep 17 00:00:00 2001
From: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Date: Thu, 28 Oct 2021 01:43:48 +0300
Subject: [PATCH 2/3] shm: extend forced shm destroy to support objects from
 several IPC nses

Currently, exit_shm function not designed to work properly when
task->sysvshm.shm_clist holds shm objects from different IPC namespaces.

This is a real pain when sysctl kernel.shm_rmid_forced = 1, because
it leads to use-after-free (reproducer exists).

That particular patch is attempt to fix the problem by extending exit_shm
mechanism to handle shm's destroy from several IPC ns'es.

To achieve that we do several things:
1. add namespace (non-refcounted) pointer to the struct shmid_kernel
2. during new shm object creation (newseg()/shmget syscall) we initialize
this pointer by current task IPC ns
3. exit_shm() fully reworked such that it traverses over all
shp's in task->sysvshm.shm_clist and gets IPC namespace not
from current task as it was before but from shp's object itself, then
call shm_destroy(shp, ns).

Note. We need to be really careful here, because as it was said before
(1), our pointer to IPC ns non-refcnt'ed. To be on the safe side we using
special helper get_ipc_ns_not_zero() which allows to get IPC ns refcounter
only if IPC ns not in the "state of destruction".

Q/A

Q: Why we can access shp->ns memory using non-refcounted pointer?
A: Because shp object lifetime is always shorther
than IPC namespace lifetime, so, if we get shp object from the
task->sysvshm.shm_clist while holding task_lock(task) nobody can
steal our namespace.

Q: Does this patch change semantics of unshare/setns/clone syscalls?
A: Not. It's just fixes non-covered case when process may leave
IPC namespace without getting task->sysvshm.shm_clist list cleaned up.

Fixes: ab602f79915 ("shm: make exit_shm work proportional to task activity")

Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Cc: Vasily Averin <vvs@virtuozzo.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Alexander Mikhalitsyn <alexander@mihalicyn.com>
Cc: stable@vger.kernel.org
Co-developed-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
---
 include/linux/ipc_namespace.h |  15 +++
 include/linux/sched/task.h    |   2 +-
 include/linux/shm.h           |   2 +-
 ipc/shm.c                     | 170 +++++++++++++++++++++++++---------
 4 files changed, 142 insertions(+), 47 deletions(-)

diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
index 05e22770af51..b75395ec8d52 100644
--- a/include/linux/ipc_namespace.h
+++ b/include/linux/ipc_namespace.h
@@ -131,6 +131,16 @@ static inline struct ipc_namespace *get_ipc_ns(struct ipc_namespace *ns)
 	return ns;
 }
 
+static inline struct ipc_namespace *get_ipc_ns_not_zero(struct ipc_namespace *ns)
+{
+	if (ns) {
+		if (refcount_inc_not_zero(&ns->ns.count))
+			return ns;
+	}
+
+	return NULL;
+}
+
 extern void put_ipc_ns(struct ipc_namespace *ns);
 #else
 static inline struct ipc_namespace *copy_ipcs(unsigned long flags,
@@ -147,6 +157,11 @@ static inline struct ipc_namespace *get_ipc_ns(struct ipc_namespace *ns)
 	return ns;
 }
 
+static inline struct ipc_namespace *get_ipc_ns_not_zero(struct ipc_namespace *ns)
+{
+	return ns;
+}
+
 static inline void put_ipc_ns(struct ipc_namespace *ns)
 {
 }
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index ef02be869cf2..bfdf84dab4be 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -157,7 +157,7 @@ static inline struct vm_struct *task_stack_vm_area(const struct task_struct *t)
  * Protects ->fs, ->files, ->mm, ->group_info, ->comm, keyring
  * subscriptions and synchronises with wait4().  Also used in procfs.  Also
  * pins the final release of task.io_context.  Also protects ->cpuset and
- * ->cgroup.subsys[]. And ->vfork_done.
+ * ->cgroup.subsys[]. And ->vfork_done. And ->sysvshm.shm_clist.
  *
  * Nests both inside and outside of read_lock(&tasklist_lock).
  * It must not be nested with write_lock_irq(&tasklist_lock),
diff --git a/ipc/shm.c b/ipc/shm.c
index ab749be6d8b7..ebb25a8ecc58 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -62,9 +62,18 @@ struct shmid_kernel /* private to the kernel */
 	struct pid		*shm_lprid;
 	struct ucounts		*mlock_ucounts;
 
-	/* The task created the shm object.  NULL if the task is dead. */
+	/*
+	 * The task created the shm object, for looking up
+	 * task->sysvshm.shm_clist_lock
+	 */
 	struct task_struct	*shm_creator;
-	struct list_head	shm_clist;	/* list by creator */
+
+	/*
+	 * List by creator. task_lock(->shm_creator) required for read/write.
+	 * If list_empty(), then the creator is dead already.
+	 */
+	struct list_head	shm_clist;
+	struct ipc_namespace	*ns;
 } __randomize_layout;
 
 /* shm_mode upper byte flags */
@@ -115,6 +124,7 @@ static void do_shm_rmid(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
 	struct shmid_kernel *shp;
 
 	shp = container_of(ipcp, struct shmid_kernel, shm_perm);
+	WARN_ON(ns != shp->ns);
 
 	if (shp->shm_nattch) {
 		shp->shm_perm.mode |= SHM_DEST;
@@ -225,10 +235,36 @@ static void shm_rcu_free(struct rcu_head *head)
 	kfree(shp);
 }
 
-static inline void shm_rmid(struct ipc_namespace *ns, struct shmid_kernel *s)
+/*
+ * It has to be called with shp locked.
+ * It must be called before ipc_rmid()
+ */
+static inline void shm_clist_rm(struct shmid_kernel *shp)
 {
-	list_del(&s->shm_clist);
-	ipc_rmid(&shm_ids(ns), &s->shm_perm);
+	struct task_struct *creator;
+
+	/*
+	 * A concurrent exit_shm may do a list_del_init() as well.
+	 * Just do nothing if exit_shm already did the work
+	 */
+	if (list_empty(&shp->shm_clist))
+		return;
+
+	/*
+	 * shp->shm_creator is guaranteed to be valid *only*
+	 * if shp->shm_clist is not empty.
+	 */
+	creator = shp->shm_creator;
+
+	task_lock(creator);
+	list_del_init(&shp->shm_clist);
+	task_unlock(creator);
+}
+
+static inline void shm_rmid(struct shmid_kernel *s)
+{
+	shm_clist_rm(s);
+	ipc_rmid(&shm_ids(s->ns), &s->shm_perm);
 }
 
 
@@ -283,7 +319,7 @@ static void shm_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
 	shm_file = shp->shm_file;
 	shp->shm_file = NULL;
 	ns->shm_tot -= (shp->shm_segsz + PAGE_SIZE - 1) >> PAGE_SHIFT;
-	shm_rmid(ns, shp);
+	shm_rmid(shp);
 	shm_unlock(shp);
 	if (!is_file_hugepages(shm_file))
 		shmem_lock(shm_file, 0, shp->mlock_ucounts);
@@ -306,10 +342,10 @@ static void shm_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
  *
  * 2) sysctl kernel.shm_rmid_forced is set to 1.
  */
-static bool shm_may_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
+static bool shm_may_destroy(struct shmid_kernel *shp)
 {
 	return (shp->shm_nattch == 0) &&
-	       (ns->shm_rmid_forced ||
+	       (shp->ns->shm_rmid_forced ||
 		(shp->shm_perm.mode & SHM_DEST));
 }
 
@@ -340,7 +376,7 @@ static void shm_close(struct vm_area_struct *vma)
 	ipc_update_pid(&shp->shm_lprid, task_tgid(current));
 	shp->shm_dtim = ktime_get_real_seconds();
 	shp->shm_nattch--;
-	if (shm_may_destroy(ns, shp))
+	if (shm_may_destroy(shp))
 		shm_destroy(ns, shp);
 	else
 		shm_unlock(shp);
@@ -361,10 +397,10 @@ static int shm_try_destroy_orphaned(int id, void *p, void *data)
 	 *
 	 * As shp->* are changed under rwsem, it's safe to skip shp locking.
 	 */
-	if (shp->shm_creator != NULL)
+	if (!list_empty(&shp->shm_clist))
 		return 0;
 
-	if (shm_may_destroy(ns, shp)) {
+	if (shm_may_destroy(shp)) {
 		shm_lock_by_ptr(shp);
 		shm_destroy(ns, shp);
 	}
@@ -382,48 +418,87 @@ void shm_destroy_orphaned(struct ipc_namespace *ns)
 /* Locking assumes this will only be called with task == current */
 void exit_shm(struct task_struct *task)
 {
-	struct ipc_namespace *ns = task->nsproxy->ipc_ns;
-	struct shmid_kernel *shp, *n;
+	for (;;) {
+		struct shmid_kernel *shp;
+		struct ipc_namespace *ns;
 
-	if (list_empty(&task->sysvshm.shm_clist))
-		return;
+		task_lock(task);
+
+		if (list_empty(&task->sysvshm.shm_clist)) {
+			task_unlock(task);
+			break;
+		}
+
+		shp = list_first_entry(&task->sysvshm.shm_clist, struct shmid_kernel,
+				shm_clist);
+
+		/* 1) unlink */
+		list_del_init(&shp->shm_clist);
 
-	/*
-	 * If kernel.shm_rmid_forced is not set then only keep track of
-	 * which shmids are orphaned, so that a later set of the sysctl
-	 * can clean them up.
-	 */
-	if (!ns->shm_rmid_forced) {
-		down_read(&shm_ids(ns).rwsem);
-		list_for_each_entry(shp, &task->sysvshm.shm_clist, shm_clist)
-			shp->shm_creator = NULL;
 		/*
-		 * Only under read lock but we are only called on current
-		 * so no entry on the list will be shared.
+		 * 2) Get pointer to the ipc namespace. It is worth to say
+		 * that this pointer is guaranteed to be valid because
+		 * shp lifetime is always shorter than namespace lifetime
+		 * in which shp lives.
+		 * We taken task_lock it means that shp won't be freed.
 		 */
-		list_del(&task->sysvshm.shm_clist);
-		up_read(&shm_ids(ns).rwsem);
-		return;
-	}
+		ns = shp->ns;
 
-	/*
-	 * Destroy all already created segments, that were not yet mapped,
-	 * and mark any mapped as orphan to cover the sysctl toggling.
-	 * Destroy is skipped if shm_may_destroy() returns false.
-	 */
-	down_write(&shm_ids(ns).rwsem);
-	list_for_each_entry_safe(shp, n, &task->sysvshm.shm_clist, shm_clist) {
-		shp->shm_creator = NULL;
+		/*
+		 * 3) If kernel.shm_rmid_forced is not set then only keep track of
+		 * which shmids are orphaned, so that a later set of the sysctl
+		 * can clean them up.
+		 */
+		if (!ns->shm_rmid_forced) {
+			task_unlock(task);
+			continue;
+		}
 
-		if (shm_may_destroy(ns, shp)) {
+		/*
+		 * 4) get a reference to the namespace.
+		 *    The refcount could be already 0. If it is 0, then
+		 *    the shm objects will be free by free_ipc_work().
+		 */
+		ns = get_ipc_ns_not_zero(ns);
+		if (ns) {
+			/*
+			 * 5) get a reference to the shp itself.
+			 *   This cannot fail: shm_clist_rm() is called before
+			 *   ipc_rmid(), thus the refcount cannot be 0.
+			 */
+			WARN_ON(!ipc_rcu_getref(&shp->shm_perm));
+		}
+
+		task_unlock(task);
+
+		if (ns) {
+			down_write(&shm_ids(ns).rwsem);
 			shm_lock_by_ptr(shp);
-			shm_destroy(ns, shp);
+			/*
+			 * rcu_read_lock was implicitly taken in
+			 * shm_lock_by_ptr, it's safe to call
+			 * ipc_rcu_putref here
+			 */
+			ipc_rcu_putref(&shp->shm_perm, shm_rcu_free);
+
+			if (ipc_valid_object(&shp->shm_perm)) {
+				if (shm_may_destroy(shp))
+					shm_destroy(ns, shp);
+				else
+					shm_unlock(shp);
+			} else {
+				/*
+				 * Someone else deleted the shp from namespace
+				 * idr/kht while we have waited.
+				 * Just unlock and continue.
+				 */
+				shm_unlock(shp);
+			}
+
+			up_write(&shm_ids(ns).rwsem);
+			put_ipc_ns(ns); /* paired with get_ipc_ns_not_zero */
 		}
 	}
-
-	/* Remove the list head from any segments still attached. */
-	list_del(&task->sysvshm.shm_clist);
-	up_write(&shm_ids(ns).rwsem);
 }
 
 static vm_fault_t shm_fault(struct vm_fault *vmf)
@@ -680,7 +755,11 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
 	if (error < 0)
 		goto no_id;
 
+	shp->ns = ns;
+
+	task_lock(current);
 	list_add(&shp->shm_clist, &current->sysvshm.shm_clist);
+	task_unlock(current);
 
 	/*
 	 * shmid gets reported as "inode#" in /proc/pid/maps.
@@ -1573,7 +1652,8 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg,
 	down_write(&shm_ids(ns).rwsem);
 	shp = shm_lock(ns, shmid);
 	shp->shm_nattch--;
-	if (shm_may_destroy(ns, shp))
+
+	if (shm_may_destroy(shp))
 		shm_destroy(ns, shp);
 	else
 		shm_unlock(shp);
-- 
2.31.1


[-- Attachment #3: 0003-DEBUG-CODE-instrummented-ipc-shm.c.patch --]
[-- Type: text/x-patch, Size: 3848 bytes --]

From ed67173357031d9a501e41b6be05cfc438f44adc Mon Sep 17 00:00:00 2001
From: Manfred Spraul <manfred@colorfullife.com>
Date: Sat, 30 Oct 2021 14:27:25 +0200
Subject: [PATCH 3/3] [DEBUG CODE] instrummented ipc/shm.c

Target: show that namespaces cannot outlive a shm segment.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
---
 ipc/shm.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/ipc/shm.c b/ipc/shm.c
index ebb25a8ecc58..6222d5b8acf6 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -126,6 +126,7 @@ static void do_shm_rmid(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
 	shp = container_of(ipcp, struct shmid_kernel, shm_perm);
 	WARN_ON(ns != shp->ns);
 
+pr_info("do_shm_rmid(): shp %px: shp->shm_nattch %ld.\n", shp, shp->shm_nattch);
 	if (shp->shm_nattch) {
 		shp->shm_perm.mode |= SHM_DEST;
 		/* Do not find it any more */
@@ -138,9 +139,11 @@ static void do_shm_rmid(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
 #ifdef CONFIG_IPC_NS
 void shm_exit_ns(struct ipc_namespace *ns)
 {
+pr_info("namespace %px: in exit_ns.\n", ns);
 	free_ipcs(ns, &shm_ids(ns), do_shm_rmid);
 	idr_destroy(&ns->ids[IPC_SHM_IDS].ipcs_idr);
 	rhashtable_destroy(&ns->ids[IPC_SHM_IDS].key_ht);
+pr_info("namespace %px: end of exit_ns.\n", ns);
 }
 #endif
 
@@ -287,6 +290,7 @@ static int __shm_open(struct vm_area_struct *vma)
 
 	shp->shm_atim = ktime_get_real_seconds();
 	ipc_update_pid(&shp->shm_lprid, task_tgid(current));
+pr_info("__shm_open(): before ++: shp %px, sfd->file %px: shp->shm_nattch %ld.\n", shp, sfd->file, shp->shm_nattch);
 	shp->shm_nattch++;
 	shm_unlock(shp);
 	return 0;
@@ -344,6 +348,7 @@ static void shm_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
  */
 static bool shm_may_destroy(struct shmid_kernel *shp)
 {
+pr_info("shm_may_destroy(): shp %px: shp->shm_nattch %ld.\n", shp, shp->shm_nattch);
 	return (shp->shm_nattch == 0) &&
 	       (shp->ns->shm_rmid_forced ||
 		(shp->shm_perm.mode & SHM_DEST));
@@ -375,6 +380,7 @@ static void shm_close(struct vm_area_struct *vma)
 
 	ipc_update_pid(&shp->shm_lprid, task_tgid(current));
 	shp->shm_dtim = ktime_get_real_seconds();
+pr_info("shm_close(): before --: shp %px: shp->shm_nattch %ld.\n", shp, shp->shm_nattch);
 	shp->shm_nattch--;
 	if (shm_may_destroy(shp))
 		shm_destroy(ns, shp);
@@ -590,6 +596,7 @@ static int shm_release(struct inode *ino, struct file *file)
 {
 	struct shm_file_data *sfd = shm_file_data(file);
 
+pr_info("shm_release: file %px, put_ipc_ns().\n", sfd->file);
 	put_ipc_ns(sfd->ns);
 	fput(sfd->file);
 	shm_file_data(file) = NULL;
@@ -748,6 +755,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
 	shp->shm_segsz = size;
 	shp->shm_nattch = 0;
 	shp->shm_file = file;
+pr_info("newseg(): shp %px: shp->shm_nattch %ld ->shmfile %px.\n", shp, shp->shm_nattch, shp->shm_file);
 	shp->shm_creator = current;
 
 	/* ipc_addid() locks shp upon success. */
@@ -1588,6 +1596,7 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg,
 	 * detect shm ID reuse we need to compare the file pointers.
 	 */
 	base = get_file(shp->shm_file);
+pr_info("do_shmat(): shp %px: shp->shm_nattch %ld.\n", shp, shp->shm_nattch);
 	shp->shm_nattch++;
 	size = i_size_read(file_inode(base));
 	ipc_unlock_object(&shp->shm_perm);
@@ -1612,6 +1621,7 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg,
 	}
 
 	sfd->id = shp->shm_perm.id;
+pr_info("do_shmat(): shp %px: get_ipc_ns().\n", shp);
 	sfd->ns = get_ipc_ns(ns);
 	sfd->file = base;
 	sfd->vm_ops = NULL;
@@ -1651,6 +1661,7 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg,
 out_nattch:
 	down_write(&shm_ids(ns).rwsem);
 	shp = shm_lock(ns, shmid);
+pr_info("do_shmat() before --: shp %px: shp->shm_nattch %ld.\n", shp, shp->shm_nattch);
 	shp->shm_nattch--;
 
 	if (shm_may_destroy(shp))
-- 
2.31.1


[-- Attachment #4: shmns4.c --]
[-- Type: text/x-csrc, Size: 3215 bytes --]

#include <stdlib.h>
#include <stdbool.h>
#include <stdio.h>
#include <fcntl.h>

#define _GNU_SOURCE
#define __USE_GNU
#include <sched.h>
#include <errno.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <sys/ipc.h>
#include <sys/shm.h>
#include <pthread.h>
#include <unistd.h>

static void do_unshare(int num, int flags)
{
	int res;

	printf(" %d) unshare(0x%x).\n", num, flags);
	res = unshare(flags);
	if (res != 0) {
		printf("  %d: unshare(0x%x) failed, errno %d.\n", num, flags, errno);
		exit(3);
	}
}

static void set_rmid_forced(char *value)
{
	int fd;
	int i;

	fd=open("/proc/sys/kernel/shm_rmid_forced", O_RDWR);
	if (fd == -1) {
		printf("open shm_rmid_forced failed, errno %d.\n", errno);
		exit (1);
	}
	i = write(fd, value, 2);
	if (i != 2) {
		printf("unexpected result when writing %s to shm_rmid_forced: %d, errno %d.\n", value, i, errno);
		exit (2);
	}
	close(fd);
}

static void *do_shmget(bool map_it)
{
	int seg;
	void *ptr;

	if ((seg = shmget (IPC_PRIVATE, 1, IPC_CREAT| 0600)) == -1) {
		perror("shmget");
		exit(3);
	}
	if (map_it) {
		if ((ptr = shmat (seg, 0, 0)) == (void*)-1) {
			perror ("shmat");
			exit(4);
		}
	} else {
		ptr = NULL;
	}
	return ptr;
}

int main (int argc, char **argv)
{
	pid_t child;

	(void)argv;
	(void)argc;

	printf("shmns4:\n");
	printf("  One process creates and maps shm segments in multiple namespaces.\n");
	printf("  The namespaces are replaced before unmapping the segments.\n");

	do_unshare(1, CLONE_NEWIPC);
	set_rmid_forced("1\n");

	child = fork();
	if (child == -1) {
		perror ("fork");
		exit(5);
	}
	if (child == 0) {
		printf("create a namespace, create 2 shm segments, do not map them.\n");
		do_unshare(2, CLONE_NEWIPC);
		set_rmid_forced("1\n");
		do_shmget(false);
		do_shmget(false);

		do_unshare(2, CLONE_NEWIPC);
		set_rmid_forced("0\n");
		sleep(5); /* namespace destruction is done in a worker, thus wait a bit */

		printf("create a namespace, create 2 shm segments, do not map them, no auto-rm.\n");
		do_shmget(false);
		do_shmget(false);

		do_unshare(2, CLONE_NEWIPC);
		set_rmid_forced("1\n");
		sleep(5); /* namespace destruction is done in a worker, thus wait a bit */

		printf("create a namespace, create 2 shm segments, map them.\n");
		do_shmget(true);
		do_shmget(true);

		do_unshare(2, CLONE_NEWIPC);
		set_rmid_forced("0\n");
		sleep(5); /* namespace destruction is done in a worker, thus wait a bit */

		printf("Once more: Create a namespace, create 2 shm segments, map them, no auto-rm.\n");
		do_shmget(true);
		do_shmget(true);


		printf("Orphan namespace (switch back to parent namespace).\n");
		{
			char path[255];
			int fd;

			sprintf(path, "/proc/%d/ns/ipc", getppid());
			fd = open(path, O_RDONLY);
			if (fd == -1) {
				perror("open ipc ns");
				exit(6);
			}
			if (setns(fd, 0) == -1) {
				perror("setns to parent");
				exit(7);			
			}
		}
		sleep(5); /* namespace destruction is done in a worker, thus wait a bit */

		printf("Before exit of child: 4 mappings exist in 2 namespaces.\n");

		exit(0);
		
	} else {
		int status;
		int ret;

		sleep(1);
		ret = waitpid(child, &status, 0);
		sleep(10);
		printf("parent:waitpid returned %d, status %d.\n", ret, status);
	}
	return 0;
}

[-- Attachment #5: log-ns4.txt --]
[-- Type: text/plain, Size: 5028 bytes --]

#./shmns4
shmns4:
  One process creates and maps shm segments in multiple namespaces.
  The namespaces are replaced before unmapping the segments.
 1) unshare(0x8000000).
create a namespace, create 2 shm segments, do not map them.
 2) unshare(0x8000000).
[   71.444890] newseg(): shp ffff888003a84f00: shp->shm_nattch 0 ->shmfile ffff88800428f500.
[   71.448696] newseg(): shp ffff888003a84e00: shp->shm_nattch 0 ->shmfile ffff88800428f900.
 2) unshare(0x8000000).
[   71.453352] shm_may_destroy(): shp ffff888003a84e00: shp->shm_nattch 0.
[   71.455822] shm_may_destroy(): shp ffff888003a84f00: shp->shm_nattch 0.
[   71.460332] namespace ffff888003679400: in exit_ns.
[   71.461783] namespace ffff888003679400: end of exit_ns.
create a namespace, create 2 shm segments, do not map them, no auto-rm.
[   76.481527] newseg(): shp ffff888003a84f00: shp->shm_nattch 0 ->shmfile ffff88800428f800.
[   76.486162] newseg(): shp ffff888003a84e00: shp->shm_nattch 0 ->shmfile ffff88800428f900.
 2) unshare(0x8000000).
[   76.496480] namespace ffff888003679800: in exit_ns.
[   76.499758] do_shm_rmid(): shp ffff888003a84f00: shp->shm_nattch 0.
[   76.515934] do_shm_rmid(): shp ffff888003a84e00: shp->shm_nattch 0.
[   76.537126] namespace ffff888003679800: end of exit_ns.
create a namespace, create 2 shm segments, map them.
[   81.517464] newseg(): shp ffff888003a84e00: shp->shm_nattch 0 ->shmfile ffff88800428f800.
[   81.526964] do_shmat(): shp ffff888003a84e00: shp->shm_nattch 0.
[   81.531575] do_shmat(): shp ffff888003a84e00: get_ipc_ns().
[   81.542459] __shm_open(): before ++: shp ffff888003a84e00, sfd->file ffff88800428f800: shp->shm_nattch 1.
[   81.549390] do_shmat() before --: shp ffff888003a84e00: shp->shm_nattch 2.
[   81.554699] shm_may_destroy(): shp ffff888003a84e00: shp->shm_nattch 1.
[   81.560649] newseg(): shp ffff888003a84f00: shp->shm_nattch 0 ->shmfile ffff88800428f500.
[   81.564649] do_shmat(): shp ffff888003a84f00: shp->shm_nattch 0.
[   81.568681] do_shmat(): shp ffff888003a84f00: get_ipc_ns().
[   81.573865] __shm_open(): before ++: shp ffff888003a84f00, sfd->file ffff88800428f500: shp->shm_nattch 1.
[   81.576866] do_shmat() before --: shp ffff888003a84f00: shp->shm_nattch 2.
[   81.580494] shm_may_destroy(): shp ffff888003a84f00: shp->shm_nattch 1.
 2) unshare(0x8000000).
[   81.589648] shm_may_destroy(): shp ffff888003a84f00: shp->shm_nattch 1.
[   81.592431] shm_may_destroy(): shp ffff888003a84e00: shp->shm_nattch 1.
Once more: Create a namespace, create 2 shm segments, map them, no auto-rm.
[   86.609807] newseg(): shp ffff888003a84000: shp->shm_nattch 0 ->shmfile ffff888004023a00.
[   86.613978] do_shmat(): shp ffff888003a84000: shp->shm_nattch 0.
[   86.616616] do_shmat(): shp ffff888003a84000: get_ipc_ns().
[   86.621714] __shm_open(): before ++: shp ffff888003a84000, sfd->file ffff888004023a00: shp->shm_nattch 1.
[   86.625975] do_shmat() before --: shp ffff888003a84000: shp->shm_nattch 2.
[   86.629578] shm_may_destroy(): shp ffff888003a84000: shp->shm_nattch 1.
[   86.633766] newseg(): shp ffff888003a84100: shp->shm_nattch 0 ->shmfile ffff888004023e00.
[   86.639642] do_shmat(): shp ffff888003a84100: shp->shm_nattch 0.
[   86.643634] do_shmat(): shp ffff888003a84100: get_ipc_ns().
[   86.646951] __shm_open(): before ++: shp ffff888003a84100, sfd->file ffff888004023e00: shp->shm_nattch 1.
[   86.651648] do_shmat() before --: shp ffff888003a84100: shp->shm_nattch 2.
[   86.660527] shm_may_destroy(): shp ffff888003a84100: shp->shm_nattch 1.
Orphan namespace (switch back to parent namespace).
Before exit of child: 4 mappings exist in 2 namespaces.
[   91.750385] shm_close(): before --: shp ffff888003a84100: shp->shm_nattch 1.
[   91.755503] shm_may_destroy(): shp ffff888003a84100: shp->shm_nattch 0.
[   91.758710] shm_close(): before --: shp ffff888003a84000: shp->shm_nattch 1.
[   91.761828] shm_may_destroy(): shp ffff888003a84000: shp->shm_nattch 0.
[   91.764879] shm_close(): before --: shp ffff888003a84f00: shp->shm_nattch 1.
[   91.768248] shm_may_destroy(): shp ffff888003a84f00: shp->shm_nattch 0.
[   91.772642] shm_close(): before --: shp ffff888003a84e00: shp->shm_nattch 1.
[   91.776417] shm_may_destroy(): shp ffff888003a84e00: shp->shm_nattch 0.
[   91.790861] shm_release: file ffff88800428f800, put_ipc_ns().
[   91.796858] shm_release: file ffff88800428f500, put_ipc_ns().
[   91.805866] namespace ffff888003679400: in exit_ns.
[   91.808458] namespace ffff888003679400: end of exit_ns.
[   91.816613] shm_release: file ffff888004023a00, put_ipc_ns().
[   91.821392] shm_release: file ffff888004023e00, put_ipc_ns().
[   91.825715] namespace ffff888003679800: in exit_ns.
[   91.828811] do_shm_rmid(): shp ffff888003a84000: shp->shm_nattch 0.
[   91.832453] do_shm_rmid(): shp ffff888003a84100: shp->shm_nattch 0.
[   91.843841] namespace ffff888003679800: end of exit_ns.
parent:waitpid returned 285, status 0.
[  101.882842] namespace ffff888003679000: in exit_ns.
[  101.885707] namespace ffff888003679000: end of exit_ns.
# 

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] shm: extend forced shm destroy to support objects from several IPC nses
  2021-10-27 22:43 ` [PATCH 2/2] shm: extend forced shm destroy to support objects from several IPC nses Alexander Mikhalitsyn
  2021-10-30  4:26   ` Eric W. Biederman
@ 2021-11-05 17:46   ` Eric W. Biederman
  2021-11-05 19:03     ` Manfred Spraul
  1 sibling, 1 reply; 12+ messages in thread
From: Eric W. Biederman @ 2021-11-05 17:46 UTC (permalink / raw)
  To: Alexander Mikhalitsyn
  Cc: linux-kernel, Andrew Morton, Davidlohr Bueso, Greg KH,
	Andrei Vagin, Pavel Tikhomirov, Vasily Averin, Manfred Spraul,
	Alexander Mikhalitsyn, stable

Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com> writes:

> Currently, exit_shm function not designed to work properly when
> task->sysvshm.shm_clist holds shm objects from different IPC namespaces.
>
> This is a real pain when sysctl kernel.shm_rmid_forced = 1, because
> it leads to use-after-free (reproducer exists).
>
> That particular patch is attempt to fix the problem by extending exit_shm
> mechanism to handle shm's destroy from several IPC ns'es.
>
> To achieve that we do several things:
> 1. add namespace (non-refcounted) pointer to the struct shmid_kernel
> 2. during new shm object creation (newseg()/shmget syscall) we initialize
> this pointer by current task IPC ns
> 3. exit_shm() fully reworked such that it traverses over all
> shp's in task->sysvshm.shm_clist and gets IPC namespace not
> from current task as it was before but from shp's object itself, then
> call shm_destroy(shp, ns).
>
> Note. We need to be really careful here, because as it was said before
> (1), our pointer to IPC ns non-refcnt'ed. To be on the safe side we using
> special helper get_ipc_ns_not_zero() which allows to get IPC ns refcounter
> only if IPC ns not in the "state of destruction".
>
> Q/A
>
> Q: Why we can access shp->ns memory using non-refcounted pointer?
> A: Because shp object lifetime is always shorther
> than IPC namespace lifetime, so, if we get shp object from the
> task->sysvshm.shm_clist while holding task_lock(task) nobody can
> steal our namespace.
>
> Q: Does this patch change semantics of unshare/setns/clone syscalls?
> A: Not. It's just fixes non-covered case when process may leave
> IPC namespace without getting task->sysvshm.shm_clist list cleaned up.
>
> Fixes: ab602f79915 ("shm: make exit_shm work proportional to task
> activity")

After reading Manfred's explanation I see what I was missing.

The ipc namespace exists as long as shm_nattach != 0.  I am annoyed
that shm_exit_ns calls do_shm_rmid which implies otherwise.

I had totally missed that ipc_rcu_getref and ipc_rcu_putref existed.
Which is what makes taking a reference and then dropping and retaking
locking possible.

From 10,000 feet:
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>

This approach does directly address the reported issue without
touching anything else so I think this is a good approach to solve
the reported crash.


Comments on the actual code are below.  Mostly it is little
nits.  But at least one substantive issue as well.

> Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Davidlohr Bueso <dave@stgolabs.net>
> Cc: Greg KH <gregkh@linuxfoundation.org>
> Cc: Andrei Vagin <avagin@gmail.com>
> Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
> Cc: Vasily Averin <vvs@virtuozzo.com>
> Cc: Manfred Spraul <manfred@colorfullife.com>
> Cc: Alexander Mikhalitsyn <alexander@mihalicyn.com>
> Cc: stable@vger.kernel.org
> Co-developed-by: Manfred Spraul <manfred@colorfullife.com>
> Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
> Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
> ---
>  include/linux/ipc_namespace.h |  15 +++
>  include/linux/sched/task.h    |   2 +-
>  include/linux/shm.h           |   2 +-
>  ipc/shm.c                     | 170 +++++++++++++++++++++++++---------
>  4 files changed, 142 insertions(+), 47 deletions(-)
>
> diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
> index 05e22770af51..b75395ec8d52 100644
> --- a/include/linux/ipc_namespace.h
> +++ b/include/linux/ipc_namespace.h
> @@ -131,6 +131,16 @@ static inline struct ipc_namespace *get_ipc_ns(struct ipc_namespace *ns)
>  	return ns;
>  }
>  
> +static inline struct ipc_namespace *get_ipc_ns_not_zero(struct ipc_namespace *ns)
> +{
> +	if (ns) {
> +		if (refcount_inc_not_zero(&ns->ns.count))
> +			return ns;
> +	}
> +
> +	return NULL;
> +}
> +
>  extern void put_ipc_ns(struct ipc_namespace *ns);
>  #else
>  static inline struct ipc_namespace *copy_ipcs(unsigned long flags,
> @@ -147,6 +157,11 @@ static inline struct ipc_namespace *get_ipc_ns(struct ipc_namespace *ns)
>  	return ns;
>  }
>  
> +static inline struct ipc_namespace *get_ipc_ns_not_zero(struct ipc_namespace *ns)
> +{
> +	return ns;
> +}
> +
>  static inline void put_ipc_ns(struct ipc_namespace *ns)
>  {
>  }
> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> index ef02be869cf2..bfdf84dab4be 100644
> --- a/include/linux/sched/task.h
> +++ b/include/linux/sched/task.h
> @@ -157,7 +157,7 @@ static inline struct vm_struct *task_stack_vm_area(const struct task_struct *t)
>   * Protects ->fs, ->files, ->mm, ->group_info, ->comm, keyring
>   * subscriptions and synchronises with wait4().  Also used in procfs.  Also
>   * pins the final release of task.io_context.  Also protects ->cpuset and
> - * ->cgroup.subsys[]. And ->vfork_done.
> + * ->cgroup.subsys[]. And ->vfork_done. And ->sysvshm.shm_clist.
>   *
>   * Nests both inside and outside of read_lock(&tasklist_lock).
>   * It must not be nested with write_lock_irq(&tasklist_lock),
> diff --git a/include/linux/shm.h b/include/linux/shm.h
> index d8e69aed3d32..709f6d0451c0 100644
> --- a/include/linux/shm.h
> +++ b/include/linux/shm.h
> @@ -11,7 +11,7 @@ struct file;
>  
>  #ifdef CONFIG_SYSVIPC
>  struct sysv_shm {
> -	struct list_head shm_clist;
> +	struct list_head	shm_clist;
>  };

This change is unnecessary.

>  
>  long do_shmat(int shmid, char __user *shmaddr, int shmflg, unsigned long *addr,
> diff --git a/ipc/shm.c b/ipc/shm.c
> index 748933e376ca..29667e17b12a 100644
> --- a/ipc/shm.c
> +++ b/ipc/shm.c
> @@ -62,9 +62,18 @@ struct shmid_kernel /* private to the kernel */
>  	struct pid		*shm_lprid;
>  	struct ucounts		*mlock_ucounts;
>  
> -	/* The task created the shm object.  NULL if the task is dead. */
> +	/*
> +	 * The task created the shm object, for looking up
> +	 * task->sysvshm.shm_clist_lock
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           task_lock
> +	 */
>  	struct task_struct	*shm_creator;
> -	struct list_head	shm_clist;	/* list by creator */
> +
> +	/*
> +	 * list by creator. shm_clist_lock required for read/write
                            ^^^^^^^^^^^^^^
                            task_lock
> +	 * if list_empty(), then the creator is dead already
> +	 */
> +	struct list_head	shm_clist;
> +	struct ipc_namespace	*ns;
>  } __randomize_layout;
>  
>  /* shm_mode upper byte flags */
> @@ -115,6 +124,7 @@ static void do_shm_rmid(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
>  	struct shmid_kernel *shp;
>  
>  	shp = container_of(ipcp, struct shmid_kernel, shm_perm);
> +	WARN_ON(ns != shp->ns);

>  
>  	if (shp->shm_nattch) {
>  		shp->shm_perm.mode |= SHM_DEST;
> @@ -225,10 +235,36 @@ static void shm_rcu_free(struct rcu_head *head)
>  	kfree(shp);
>  }
>  
> -static inline void shm_rmid(struct ipc_namespace *ns, struct shmid_kernel *s)
> +/*
> + * It has to be called with shp locked.
> + * It must be called before ipc_rmid()
> + */
> +static inline void shm_clist_rm(struct shmid_kernel *shp)
>  {
> -	list_del(&s->shm_clist);
> -	ipc_rmid(&shm_ids(ns), &s->shm_perm);
> +	struct task_struct *creator;
> +
> +	/*
> +	 * A concurrent exit_shm may do a list_del_init() as well.
> +	 * Just do nothing if exit_shm already did the work
> +	 */
> +	if (list_empty(&shp->shm_clist))
> +		return;

This looks like a problem.  With no lock is held the list_empty here is
fundamentally an optimization.  So the rest of the function should run
properly if this list_empty is removed.

It does not look to me like the rest of the function will run properly
if list_empty is removed.

The code needs an rcu_lock or something like that to ensure that
shm_creator does not go away between the time it is read and when the
lock is taken.

> +
> +	/*
> +	 * shp->shm_creator is guaranteed to be valid *only*
> +	 * if shp->shm_clist is not empty.
> +	 */
> +	creator = shp->shm_creator;
> +
> +	task_lock(creator);
> +	list_del_init(&shp->shm_clist);
> +	task_unlock(creator);
> +}
> +
> +static inline void shm_rmid(struct shmid_kernel *s)
> +{
> +	shm_clist_rm(s);
> +	ipc_rmid(&shm_ids(s->ns), &s->shm_perm);
>  }
>  
>  
> @@ -283,7 +319,7 @@ static void shm_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
>  	shm_file = shp->shm_file;
>  	shp->shm_file = NULL;
>  	ns->shm_tot -= (shp->shm_segsz + PAGE_SIZE - 1) >> PAGE_SHIFT;
> -	shm_rmid(ns, shp);
> +	shm_rmid(shp);
>  	shm_unlock(shp);
>  	if (!is_file_hugepages(shm_file))
>  		shmem_lock(shm_file, 0, shp->mlock_ucounts);
> @@ -306,10 +342,10 @@ static void shm_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
>   *
>   * 2) sysctl kernel.shm_rmid_forced is set to 1.
>   */
> -static bool shm_may_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
> +static bool shm_may_destroy(struct shmid_kernel *shp)
>  {
>  	return (shp->shm_nattch == 0) &&
> -	       (ns->shm_rmid_forced ||
> +	       (shp->ns->shm_rmid_forced ||
>  		(shp->shm_perm.mode & SHM_DEST));
>  }
>  
> @@ -340,7 +376,7 @@ static void shm_close(struct vm_area_struct *vma)
>  	ipc_update_pid(&shp->shm_lprid, task_tgid(current));
>  	shp->shm_dtim = ktime_get_real_seconds();
>  	shp->shm_nattch--;
> -	if (shm_may_destroy(ns, shp))
> +	if (shm_may_destroy(shp))
>  		shm_destroy(ns, shp);
>  	else
>  		shm_unlock(shp);
> @@ -361,10 +397,10 @@ static int shm_try_destroy_orphaned(int id, void *p, void *data)
>  	 *
>  	 * As shp->* are changed under rwsem, it's safe to skip shp locking.
>  	 */

We should add a comment why testing list_empty here is safe/reliable.

Now that the list deletion is only protected by task_lock it feels like
this introduces a race.

I don't think the race is meaningful as either the list is non-empty
or it is empty.  Plus none of the following tests are racy.  So there
is no danger of an attached segment being destroyed.

> -	if (shp->shm_creator != NULL)
> +	if (!list_empty(&shp->shm_clist))
>  		return 0;
>  
> -	if (shm_may_destroy(ns, shp)) {
> +	if (shm_may_destroy(shp)) {
>  		shm_lock_by_ptr(shp);
>  		shm_destroy(ns, shp);
>  	}
> @@ -382,48 +418,87 @@ void shm_destroy_orphaned(struct ipc_namespace *ns)
>  /* Locking assumes this will only be called with task == current */
>  void exit_shm(struct task_struct *task)
>  {
> -	struct ipc_namespace *ns = task->nsproxy->ipc_ns;
> -	struct shmid_kernel *shp, *n;
> +	for (;;) {
> +		struct shmid_kernel *shp;
> +		struct ipc_namespace *ns;
>  
> -	if (list_empty(&task->sysvshm.shm_clist))
> -		return;
> +		task_lock(task);
> +
> +		if (list_empty(&task->sysvshm.shm_clist)) {
> +			task_unlock(task);
> +			break;
> +		}
> +
> +		shp = list_first_entry(&task->sysvshm.shm_clist, struct shmid_kernel,
> +				shm_clist);
> +
> +		/* 1) unlink */
> +		list_del_init(&shp->shm_clist);
                ^^^^^^^
                The code should also clear shm_creator here as well.
                So that a stale reference becomes a NULL pointer
                dereference instead of use-after-free.  Something like:

		/*
                 * The old shm_creator value will remain valid for
                 * at least an rcu grace period after this, see
		 * put_task_struct_rcu_user.
                 */
                 
		rcu_assign_pointer(shp->shm_creator, NULL);

This allows shm_clist_rm to look like:
static inline void shm_clist_rm(struct shmid_kernel *shp)
{
	struct task_struct *creator;

	rcu_read_lock();
        creator = rcu_dereference(shp->shm_clist);
        if (creator) {
        	task_lock(creator);
                list_del_init(&shp->shm_clist);
                task_unlock(creator);
        }
        rcu_read_unlock();
}
	
>  
> -	/*
> -	 * If kernel.shm_rmid_forced is not set then only keep track of
> -	 * which shmids are orphaned, so that a later set of the sysctl
> -	 * can clean them up.
> -	 */
> -	if (!ns->shm_rmid_forced) {
> -		down_read(&shm_ids(ns).rwsem);
> -		list_for_each_entry(shp, &task->sysvshm.shm_clist, shm_clist)
> -			shp->shm_creator = NULL;
>  		/*
> -		 * Only under read lock but we are only called on current
> -		 * so no entry on the list will be shared.
> +		 * 2) Get pointer to the ipc namespace. It is worth to say
> +		 * that this pointer is guaranteed to be valid because
> +		 * shp lifetime is always shorter than namespace lifetime
> +		 * in which shp lives.
> +		 * We taken task_lock it means that shp won't be freed.
>  		 */
> -		list_del(&task->sysvshm.shm_clist);
> -		up_read(&shm_ids(ns).rwsem);
> -		return;
> -	}
> +		ns = shp->ns;
>  
> -	/*
> -	 * Destroy all already created segments, that were not yet mapped,
> -	 * and mark any mapped as orphan to cover the sysctl toggling.
> -	 * Destroy is skipped if shm_may_destroy() returns false.
> -	 */
> -	down_write(&shm_ids(ns).rwsem);
> -	list_for_each_entry_safe(shp, n, &task->sysvshm.shm_clist, shm_clist) {
> -		shp->shm_creator = NULL;
> +		/*
> +		 * 3) If kernel.shm_rmid_forced is not set then only keep track of
> +		 * which shmids are orphaned, so that a later set of the sysctl
> +		 * can clean them up.
> +		 */
> +		if (!ns->shm_rmid_forced) {
> +			task_unlock(task);
> +			continue;
> +		}
>  
> -		if (shm_may_destroy(ns, shp)) {
> +		/*
> +		 * 4) get a reference to the namespace.
> +		 *    The refcount could be already 0. If it is 0, then
> +		 *    the shm objects will be free by free_ipc_work().
> +		 */
> +		ns = get_ipc_ns_not_zero(ns);
> +		if (ns) {
                ^^^^^^^^^

                This test is probably easier to follow if it was simply:
                if (!ns) {
                	task_unlock(task);
                        continue;
                }

                Then the basic logic can all stay at the same
                indentation level, and ns does not need to be
                tested a second time.

> +			/*
> +			 * 5) get a reference to the shp itself.
> +			 *   This cannot fail: shm_clist_rm() is called before
> +			 *   ipc_rmid(), thus the refcount cannot be 0.
> +			 */
> +			WARN_ON(!ipc_rcu_getref(&shp->shm_perm));
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                        This calls for an ipc_getref that simply calls
                        refcount_inc.  Then the refcount code can
                        perform all of the sanity checks for you,
                        and the WARN_ON becomes unnecessary.

                        Plus the code then documents the fact you know
                        the refcount must be non-zero here.
> +		}
> +
> +		task_unlock(task);
> +
> +		if (ns) {
> +			down_write(&shm_ids(ns).rwsem);
>  			shm_lock_by_ptr(shp);
> -			shm_destroy(ns, shp);
> +			/*
> +			 * rcu_read_lock was implicitly taken in
> +			 * shm_lock_by_ptr, it's safe to call
> +			 * ipc_rcu_putref here
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                          This comment should say something like:

                          rcu_read_lock was taken in shm_lock_by_ptr.
                          With rcu protecting our accesses of shp
                          holding a reference to shp is unnecessary.
                          
> +			 */
> +			ipc_rcu_putref(&shp->shm_perm, shm_rcu_free);
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                        It probably makes most sense just to move
                        this decrement of the extra reference down to
                        just before put_ipc_ns.  Removing the need
                        for the comment and understanding the subtleties
                        there, and keeping all of the taking and putting
                        in a consistent order.
                        

> +
> +			if (ipc_valid_object(&shp->shm_perm)) {
> +				if (shm_may_destroy(shp))
> +					shm_destroy(ns, shp);
> +				else
> +					shm_unlock(shp);
> +			} else {
> +				/*
> +				 * Someone else deleted the shp from namespace
> +				 * idr/kht while we have waited.
> +				 * Just unlock and continue.
> +				 */
> +				shm_unlock(shp);
> +			}
> +
> +			up_write(&shm_ids(ns).rwsem);
> +			put_ipc_ns(ns); /* paired with get_ipc_ns_not_zero */
>  		}
>  	}
> -
> -	/* Remove the list head from any segments still attached. */
> -	list_del(&task->sysvshm.shm_clist);
> -	up_write(&shm_ids(ns).rwsem);
>  }
>  
>  static vm_fault_t shm_fault(struct vm_fault *vmf)
> @@ -680,7 +755,11 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
>  	if (error < 0)
>  		goto no_id;
>  
> +	shp->ns = ns;
> +
> +	task_lock(current);
>  	list_add(&shp->shm_clist, &current->sysvshm.shm_clist);
> +	task_unlock(current);
>  
>  	/*
>  	 * shmid gets reported as "inode#" in /proc/pid/maps.
> @@ -1573,7 +1652,8 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg,
>  	down_write(&shm_ids(ns).rwsem);
>  	shp = shm_lock(ns, shmid);
>  	shp->shm_nattch--;
> -	if (shm_may_destroy(ns, shp))
> +
> +	if (shm_may_destroy(shp))
>  		shm_destroy(ns, shp);
>  	else
>  		shm_unlock(shp);

Eric

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] shm: extend forced shm destroy to support objects from several IPC nses
  2021-11-05 17:46   ` Eric W. Biederman
@ 2021-11-05 19:03     ` Manfred Spraul
  2021-11-05 21:34       ` [RFC] shm: extend forced shm destroy to support objects from several IPC nses (simplified) Eric W. Biederman
  0 siblings, 1 reply; 12+ messages in thread
From: Manfred Spraul @ 2021-11-05 19:03 UTC (permalink / raw)
  To: Eric W. Biederman, Alexander Mikhalitsyn
  Cc: linux-kernel, Andrew Morton, Davidlohr Bueso, Greg KH,
	Andrei Vagin, Pavel Tikhomirov, Vasily Averin,
	Alexander Mikhalitsyn, stable

Hi Eric,

On 11/5/21 18:46, Eric W. Biederman wrote:
>
>> -static inline void shm_rmid(struct ipc_namespace *ns, struct shmid_kernel *s)
>> +/*
>> + * It has to be called with shp locked.
>> + * It must be called before ipc_rmid()
>> + */
>> +static inline void shm_clist_rm(struct shmid_kernel *shp)
>>   {
>> -	list_del(&s->shm_clist);
>> -	ipc_rmid(&shm_ids(ns), &s->shm_perm);
>> +	struct task_struct *creator;
>> +
>> +	/*
>> +	 * A concurrent exit_shm may do a list_del_init() as well.
>> +	 * Just do nothing if exit_shm already did the work
>> +	 */
>> +	if (list_empty(&shp->shm_clist))
>> +		return;
> This looks like a problem.  With no lock is held the list_empty here is
> fundamentally an optimization.  So the rest of the function should run
> properly if this list_empty is removed.
>
> It does not look to me like the rest of the function will run properly
> if list_empty is removed.
>
> The code needs an rcu_lock or something like that to ensure that
> shm_creator does not go away between the time it is read and when the
> lock is taken.

>> +
>> +	/*
>> +	 * shp->shm_creator is guaranteed to be valid *only*
>> +	 * if shp->shm_clist is not empty.
>> +	 */
>> +	creator = shp->shm_creator;
>> +
>> +	task_lock(creator);
>> +	list_del_init(&shp->shm_clist);
>> +	task_unlock(creator);
>> +}
>> +

You are right!
I had checked the function several times, but I have overlooked the 
simple case. exit_shm() contains:

> task_lock()
> list_del_init()
> task_unlock()
>
> down_write(&shm_ids(ns).rwsem);
> shm_lock_by_ptr(shp);
>
<<< since the shm_clist_rm() is called when holding the shp lock, 
exit_shm() cannot proceed. Thus if !list_empty()) is guarantees that 
->creator will not disappear.

But: for !shm_rmid_forced, there is no lock of shp :-(


>> +static inline void shm_rmid(struct shmid_kernel *s)
>> +{
>> +	shm_clist_rm(s);
>> +	ipc_rmid(&shm_ids(s->ns), &s->shm_perm);
>>   }
>>   
>>   
>> @@ -283,7 +319,7 @@ static void shm_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
>>   	shm_file = shp->shm_file;
>>   	shp->shm_file = NULL;
>>   	ns->shm_tot -= (shp->shm_segsz + PAGE_SIZE - 1) >> PAGE_SHIFT;
>> -	shm_rmid(ns, shp);
>> +	shm_rmid(shp);
>>   	shm_unlock(shp);
>>   	if (!is_file_hugepages(shm_file))
>>   		shmem_lock(shm_file, 0, shp->mlock_ucounts);
>> @@ -306,10 +342,10 @@ static void shm_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
>>    *
>>    * 2) sysctl kernel.shm_rmid_forced is set to 1.
>>    */
>> -static bool shm_may_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
>> +static bool shm_may_destroy(struct shmid_kernel *shp)
>>   {
>>   	return (shp->shm_nattch == 0) &&
>> -	       (ns->shm_rmid_forced ||
>> +	       (shp->ns->shm_rmid_forced ||
>>   		(shp->shm_perm.mode & SHM_DEST));
>>   }
>>   
>> @@ -340,7 +376,7 @@ static void shm_close(struct vm_area_struct *vma)
>>   	ipc_update_pid(&shp->shm_lprid, task_tgid(current));
>>   	shp->shm_dtim = ktime_get_real_seconds();
>>   	shp->shm_nattch--;
>> -	if (shm_may_destroy(ns, shp))
>> +	if (shm_may_destroy(shp))
>>   		shm_destroy(ns, shp);
>>   	else
>>   		shm_unlock(shp);
>> @@ -361,10 +397,10 @@ static int shm_try_destroy_orphaned(int id, void *p, void *data)
>>   	 *
>>   	 * As shp->* are changed under rwsem, it's safe to skip shp locking.
>>   	 */
> We should add a comment why testing list_empty here is safe/reliable.
>
> Now that the list deletion is only protected by task_lock it feels like
> this introduces a race.
>
> I don't think the race is meaningful as either the list is non-empty
> or it is empty.  Plus none of the following tests are racy.  So there
> is no danger of an attached segment being destroyed.
It shp can be destroyed, in the sense that ->deleted is set. But this is 
handled.
>> -	if (shp->shm_creator != NULL)
>> +	if (!list_empty(&shp->shm_clist))
>>   		return 0;
>>   
>> -	if (shm_may_destroy(ns, shp)) {
>> +	if (shm_may_destroy(shp)) {
>>   		shm_lock_by_ptr(shp);
>>   		shm_destroy(ns, shp);
>>   	}
>> @@ -382,48 +418,87 @@ void shm_destroy_orphaned(struct ipc_namespace *ns)
>>   /* Locking assumes this will only be called with task == current */
>>   void exit_shm(struct task_struct *task)
>>   {
>> -	struct ipc_namespace *ns = task->nsproxy->ipc_ns;
>> -	struct shmid_kernel *shp, *n;
>> +	for (;;) {
>> +		struct shmid_kernel *shp;
>> +		struct ipc_namespace *ns;
>>   
>> -	if (list_empty(&task->sysvshm.shm_clist))
>> -		return;
>> +		task_lock(task);
>> +
>> +		if (list_empty(&task->sysvshm.shm_clist)) {
>> +			task_unlock(task);
>> +			break;
>> +		}
>> +
>> +		shp = list_first_entry(&task->sysvshm.shm_clist, struct shmid_kernel,
>> +				shm_clist);
>> +
>> +		/* 1) unlink */
>> +		list_del_init(&shp->shm_clist);
>                  ^^^^^^^
>                  The code should also clear shm_creator here as well.
>                  So that a stale reference becomes a NULL pointer
>                  dereference instead of use-after-free.  Something like:
list_del_init() already contains a write_once, and that pairs with a 
READ_ONCE() in list_empty.

Using both shp->shm_creator ==NULL and list_empty() as protection 
doesn't help, it can only introduce new races.



> 		/*
>                   * The old shm_creator value will remain valid for
>                   * at least an rcu grace period after this, see
> 		 * put_task_struct_rcu_user.
>                   */
>                   
> 		rcu_assign_pointer(shp->shm_creator, NULL);
>
> This allows shm_clist_rm to look like:
> static inline void shm_clist_rm(struct shmid_kernel *shp)
> {
> 	struct task_struct *creator;
>
> 	rcu_read_lock();
>          creator = rcu_dereference(shp->shm_clist);

We must protect against a parallel: 
exit_sem();<...>;kmem_cache_free(,creator), correct?

No other races are relevant, as shp->shm_creator is written once and 
then never updated.

Thus, my current understanding: We need the rcu_read_lock().

And rcu_read_lock() is sufficient, as release_task ends with 
put_task_struct_rcu_user().

>          if (creator) {
>          	task_lock(creator);
>                  list_del_init(&shp->shm_clist);
>                  task_unlock(creator);
>          }
>          rcu_read_unlock();
> }
> 	
>>   
>> -	/*
>> -	 * If kernel.shm_rmid_forced is not set then only keep track of
>> -	 * which shmids are orphaned, so that a later set of the sysctl
>> -	 * can clean them up.
>> -	 */
>> -	if (!ns->shm_rmid_forced) {
>> -		down_read(&shm_ids(ns).rwsem);
>> -		list_for_each_entry(shp, &task->sysvshm.shm_clist, shm_clist)
>> -			shp->shm_creator = NULL;
>>   		/*
>> -		 * Only under read lock but we are only called on current
>> -		 * so no entry on the list will be shared.
>> +		 * 2) Get pointer to the ipc namespace. It is worth to say
>> +		 * that this pointer is guaranteed to be valid because
>> +		 * shp lifetime is always shorter than namespace lifetime
>> +		 * in which shp lives.
>> +		 * We taken task_lock it means that shp won't be freed.
>>   		 */
>> -		list_del(&task->sysvshm.shm_clist);
>> -		up_read(&shm_ids(ns).rwsem);
>> -		return;
>> -	}
>> +		ns = shp->ns;
>>   
>> -	/*
>> -	 * Destroy all already created segments, that were not yet mapped,
>> -	 * and mark any mapped as orphan to cover the sysctl toggling.
>> -	 * Destroy is skipped if shm_may_destroy() returns false.
>> -	 */
>> -	down_write(&shm_ids(ns).rwsem);
>> -	list_for_each_entry_safe(shp, n, &task->sysvshm.shm_clist, shm_clist) {
>> -		shp->shm_creator = NULL;
>> +		/*
>> +		 * 3) If kernel.shm_rmid_forced is not set then only keep track of
>> +		 * which shmids are orphaned, so that a later set of the sysctl
>> +		 * can clean them up.
>> +		 */
>> +		if (!ns->shm_rmid_forced) {
>> +			task_unlock(task);
>> +			continue;
>> +		}
>>   
>> -		if (shm_may_destroy(ns, shp)) {
>> +		/*
>> +		 * 4) get a reference to the namespace.
>> +		 *    The refcount could be already 0. If it is 0, then
>> +		 *    the shm objects will be free by free_ipc_work().
>> +		 */
>> +		ns = get_ipc_ns_not_zero(ns);
>> +		if (ns) {
>                  ^^^^^^^^^
>
>                  This test is probably easier to follow if it was simply:
>                  if (!ns) {
>                  	task_unlock(task);
>                          continue;
>                  }
>
>                  Then the basic logic can all stay at the same
>                  indentation level, and ns does not need to be
>                  tested a second time.
>
>> +			/*
>> +			 * 5) get a reference to the shp itself.
>> +			 *   This cannot fail: shm_clist_rm() is called before
>> +			 *   ipc_rmid(), thus the refcount cannot be 0.
>> +			 */
>> +			WARN_ON(!ipc_rcu_getref(&shp->shm_perm));
>                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>                          This calls for an ipc_getref that simply calls
>                          refcount_inc.  Then the refcount code can
>                          perform all of the sanity checks for you,
>                          and the WARN_ON becomes unnecessary.
>
>                          Plus the code then documents the fact you know
>                          the refcount must be non-zero here.
>> +		}
>> +
>> +		task_unlock(task);
>> +
>> +		if (ns) {
>> +			down_write(&shm_ids(ns).rwsem);
>>   			shm_lock_by_ptr(shp);
>> -			shm_destroy(ns, shp);
>> +			/*
>> +			 * rcu_read_lock was implicitly taken in
>> +			 * shm_lock_by_ptr, it's safe to call
>> +			 * ipc_rcu_putref here
>                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>                            This comment should say something like:
>
>                            rcu_read_lock was taken in shm_lock_by_ptr.
>                            With rcu protecting our accesses of shp
>                            holding a reference to shp is unnecessary.
>                            
>> +			 */
>> +			ipc_rcu_putref(&shp->shm_perm, shm_rcu_free);
>                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>                          It probably makes most sense just to move
>                          this decrement of the extra reference down to
>                          just before put_ipc_ns.  Removing the need
>                          for the comment and understanding the subtleties
>                          there, and keeping all of the taking and putting
>                          in a consistent order.
>                          
>
>> +
>> +			if (ipc_valid_object(&shp->shm_perm)) {
>> +				if (shm_may_destroy(shp))
>> +					shm_destroy(ns, shp);
>> +				else
>> +					shm_unlock(shp);
>> +			} else {
>> +				/*
>> +				 * Someone else deleted the shp from namespace
>> +				 * idr/kht while we have waited.
>> +				 * Just unlock and continue.
>> +				 */
>> +				shm_unlock(shp);
>> +			}
>> +
>> +			up_write(&shm_ids(ns).rwsem);
>> +			put_ipc_ns(ns); /* paired with get_ipc_ns_not_zero */
>>   		}
>>   	}
>> -
>> -	/* Remove the list head from any segments still attached. */
>> -	list_del(&task->sysvshm.shm_clist);
>> -	up_write(&shm_ids(ns).rwsem);
>>   }
>>   
>>   static vm_fault_t shm_fault(struct vm_fault *vmf)
>> @@ -680,7 +755,11 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
>>   	if (error < 0)
>>   		goto no_id;
>>   
>> +	shp->ns = ns;
>> +
>> +	task_lock(current);
>>   	list_add(&shp->shm_clist, &current->sysvshm.shm_clist);
>> +	task_unlock(current);
>>   
>>   	/*
>>   	 * shmid gets reported as "inode#" in /proc/pid/maps.
>> @@ -1573,7 +1652,8 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg,
>>   	down_write(&shm_ids(ns).rwsem);
>>   	shp = shm_lock(ns, shmid);
>>   	shp->shm_nattch--;
>> -	if (shm_may_destroy(ns, shp))
>> +
>> +	if (shm_may_destroy(shp))
>>   		shm_destroy(ns, shp);
>>   	else
>>   		shm_unlock(shp);
> Eric



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC] shm: extend forced shm destroy to support objects from several IPC nses (simplified)
  2021-11-05 19:03     ` Manfred Spraul
@ 2021-11-05 21:34       ` Eric W. Biederman
  2021-11-06  7:50         ` Manfred Spraul
  2021-11-06 14:42         ` Manfred Spraul
  0 siblings, 2 replies; 12+ messages in thread
From: Eric W. Biederman @ 2021-11-05 21:34 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Alexander Mikhalitsyn, linux-kernel, Andrew Morton,
	Davidlohr Bueso, Greg KH, Andrei Vagin, Pavel Tikhomirov,
	Vasily Averin, Alexander Mikhalitsyn, stable


I have to dash so this is short.

This is what I am thinking this change should look like.

I am not certain this is truly reviewable as a single change, so I will
break it into a couple of smaller ones next time I get the chance.

Eric

 include/linux/ipc_namespace.h |  12 ++++
 include/linux/sched/task.h    |   2 +-
 ipc/shm.c                     | 135 +++++++++++++++++++++++++-----------------
 ipc/util.c                    |   5 ++
 ipc/util.h                    |   1 +
 kernel/fork.c                 |   1 -
 6 files changed, 100 insertions(+), 56 deletions(-)

diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
index 05e22770af51..c220767a0cc1 100644
--- a/include/linux/ipc_namespace.h
+++ b/include/linux/ipc_namespace.h
@@ -131,6 +131,13 @@ static inline struct ipc_namespace *get_ipc_ns(struct ipc_namespace *ns)
 	return ns;
 }
 
+static inline struct ipc_namespace *get_ipc_ns_not_zero(struct ipc_namespace *ns)
+{
+	if (ns && refcount_inc_not_zero(&ns->ns.count))
+		return ns;
+	return NULL;
+}
+
 extern void put_ipc_ns(struct ipc_namespace *ns);
 #else
 static inline struct ipc_namespace *copy_ipcs(unsigned long flags,
@@ -147,6 +154,11 @@ static inline struct ipc_namespace *get_ipc_ns(struct ipc_namespace *ns)
 	return ns;
 }
 
+static inline struct ipc_namespace *get_ipc_ns_not_zero(struct ipc_namespace *ns)
+{
+	return ns;
+}
+
 static inline void put_ipc_ns(struct ipc_namespace *ns)
 {
 }
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index ef02be869cf2..1d9533d66f7e 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -157,7 +157,7 @@ static inline struct vm_struct *task_stack_vm_area(const struct task_struct *t)
  * Protects ->fs, ->files, ->mm, ->group_info, ->comm, keyring
  * subscriptions and synchronises with wait4().  Also used in procfs.  Also
  * pins the final release of task.io_context.  Also protects ->cpuset and
- * ->cgroup.subsys[]. And ->vfork_done.
+ * ->cgroup.subsys[]. And ->vfork_done. And ->shmvshm.shm_clist.
  *
  * Nests both inside and outside of read_lock(&tasklist_lock).
  * It must not be nested with write_lock_irq(&tasklist_lock),
diff --git a/ipc/shm.c b/ipc/shm.c
index ab749be6d8b7..80e3595d3a69 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -63,8 +63,9 @@ struct shmid_kernel /* private to the kernel */
 	struct ucounts		*mlock_ucounts;
 
 	/* The task created the shm object.  NULL if the task is dead. */
-	struct task_struct	*shm_creator;
+	struct task_struct __rcu *shm_creator;
 	struct list_head	shm_clist;	/* list by creator */
+	struct ipc_namespace	*shm_ns;	/* valid when shm_nattch != 0 */
 } __randomize_layout;
 
 /* shm_mode upper byte flags */
@@ -106,29 +107,17 @@ void shm_init_ns(struct ipc_namespace *ns)
 	ipc_init_ids(&shm_ids(ns));
 }
 
-/*
- * Called with shm_ids.rwsem (writer) and the shp structure locked.
- * Only shm_ids.rwsem remains locked on exit.
- */
-static void do_shm_rmid(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
+static void do_shm_destroy(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
 {
-	struct shmid_kernel *shp;
-
-	shp = container_of(ipcp, struct shmid_kernel, shm_perm);
-
-	if (shp->shm_nattch) {
-		shp->shm_perm.mode |= SHM_DEST;
-		/* Do not find it any more */
-		ipc_set_key_private(&shm_ids(ns), &shp->shm_perm);
-		shm_unlock(shp);
-	} else
-		shm_destroy(ns, shp);
+	struct shmid_kernel *shp =
+		container_of(ipcp, struct shmid_kernel, shm_perm);
+	shm_destroy(ns, shp);
 }
 
 #ifdef CONFIG_IPC_NS
 void shm_exit_ns(struct ipc_namespace *ns)
 {
-	free_ipcs(ns, &shm_ids(ns), do_shm_rmid);
+	free_ipcs(ns, &shm_ids(ns), do_shm_destroy);
 	idr_destroy(&ns->ids[IPC_SHM_IDS].ipcs_idr);
 	rhashtable_destroy(&ns->ids[IPC_SHM_IDS].key_ht);
 }
@@ -225,9 +214,22 @@ static void shm_rcu_free(struct rcu_head *head)
 	kfree(shp);
 }
 
+static inline void shm_clist_del(struct shmid_kernel *shp)
+{
+	struct task_struct *creator;
+
+	rcu_read_lock();
+	creator = rcu_dereference(shp->shm_creator);
+	if (creator) {
+		task_lock(creator);
+		list_del(&shp->shm_clist);
+		task_unlock(creator);
+	}
+	rcu_read_unlock();
+}
+
 static inline void shm_rmid(struct ipc_namespace *ns, struct shmid_kernel *s)
 {
-	list_del(&s->shm_clist);
 	ipc_rmid(&shm_ids(ns), &s->shm_perm);
 }
 
@@ -283,7 +285,9 @@ static void shm_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
 	shm_file = shp->shm_file;
 	shp->shm_file = NULL;
 	ns->shm_tot -= (shp->shm_segsz + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	shm_clist_del(shp);
 	shm_rmid(ns, shp);
+	shp->shm_ns = NULL;
 	shm_unlock(shp);
 	if (!is_file_hugepages(shm_file))
 		shmem_lock(shm_file, 0, shp->mlock_ucounts);
@@ -361,7 +365,7 @@ static int shm_try_destroy_orphaned(int id, void *p, void *data)
 	 *
 	 * As shp->* are changed under rwsem, it's safe to skip shp locking.
 	 */
-	if (shp->shm_creator != NULL)
+	if (rcu_access_pointer(shp->shm_creator) != NULL)
 		return 0;
 
 	if (shm_may_destroy(ns, shp)) {
@@ -382,48 +386,62 @@ void shm_destroy_orphaned(struct ipc_namespace *ns)
 /* Locking assumes this will only be called with task == current */
 void exit_shm(struct task_struct *task)
 {
-	struct ipc_namespace *ns = task->nsproxy->ipc_ns;
-	struct shmid_kernel *shp, *n;
-
-	if (list_empty(&task->sysvshm.shm_clist))
-		return;
-
-	/*
-	 * If kernel.shm_rmid_forced is not set then only keep track of
-	 * which shmids are orphaned, so that a later set of the sysctl
-	 * can clean them up.
-	 */
-	if (!ns->shm_rmid_forced) {
-		down_read(&shm_ids(ns).rwsem);
-		list_for_each_entry(shp, &task->sysvshm.shm_clist, shm_clist)
-			shp->shm_creator = NULL;
-		/*
-		 * Only under read lock but we are only called on current
-		 * so no entry on the list will be shared.
-		 */
-		list_del(&task->sysvshm.shm_clist);
-		up_read(&shm_ids(ns).rwsem);
-		return;
-	}
+	struct list_head *head = &task->sysvshm.shm_clist;
 
 	/*
 	 * Destroy all already created segments, that were not yet mapped,
 	 * and mark any mapped as orphan to cover the sysctl toggling.
 	 * Destroy is skipped if shm_may_destroy() returns false.
 	 */
-	down_write(&shm_ids(ns).rwsem);
-	list_for_each_entry_safe(shp, n, &task->sysvshm.shm_clist, shm_clist) {
-		shp->shm_creator = NULL;
+	for (;;) {
+		struct ipc_namespace *ns;
+		struct shmid_kernel *shp;
 
-		if (shm_may_destroy(ns, shp)) {
+		task_lock(task);
+		if (list_empty(head)) {
+			task_unlock(task);
+			break;
+		}
+
+		shp = list_first_entry(head, struct shmid_kernel, shm_clist);
+
+		list_del(&shp->shm_clist);
+		rcu_assign_pointer(shp->shm_creator, NULL);
+
+		/*
+		 * Guarantee that ns lives after task_list is dropped.
+		 *
+		 * This shm segment may not be attached and it's ipc
+		 * namespace may be exiting.  If so ignore the shm
+		 * segment as it will be destroyed by shm_exit_ns.
+		 */
+		ns = get_ipc_ns_not_zero(shp->shm_ns);
+		if (!ns) {
+			task_unlock(task);
+			continue;
+		}
+
+		/* Guarantee shp lives after task_lock is dropped */
+		ipc_getref(&shp->shm_perm);
+
+		/* Drop task_lock so that shm_destroy may take it */
+		task_unlock(task);
+
+		/* Can the shm segment be destroyed? */
+		down_write(&shm_ids(ns).rwsem);
+		shm_lock_by_ptr(shp);
+		if (ipc_valid_object(&shp->shm_perm) &&
+		    shm_may_destroy(ns, shp)) {
 			shm_lock_by_ptr(shp);
 			shm_destroy(ns, shp);
+		} else {
+			shm_unlock(shp);
 		}
-	}
 
-	/* Remove the list head from any segments still attached. */
-	list_del(&task->sysvshm.shm_clist);
-	up_write(&shm_ids(ns).rwsem);
+		ipc_rcu_putref(&shp->shm_perm, shm_rcu_free);
+		up_write(&shm_ids(ns).rwsem);
+		put_ipc_ns(ns);
+	}
 }
 
 static vm_fault_t shm_fault(struct vm_fault *vmf)
@@ -673,14 +691,17 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
 	shp->shm_segsz = size;
 	shp->shm_nattch = 0;
 	shp->shm_file = file;
-	shp->shm_creator = current;
+	RCU_INIT_POINTER(shp->shm_creator, current);
+	shp->shm_ns = ns;
 
 	/* ipc_addid() locks shp upon success. */
 	error = ipc_addid(&shm_ids(ns), &shp->shm_perm, ns->shm_ctlmni);
 	if (error < 0)
 		goto no_id;
 
+	task_lock(current);
 	list_add(&shp->shm_clist, &current->sysvshm.shm_clist);
+	task_unlock(current);
 
 	/*
 	 * shmid gets reported as "inode#" in /proc/pid/maps.
@@ -913,8 +934,14 @@ static int shmctl_down(struct ipc_namespace *ns, int shmid, int cmd,
 	switch (cmd) {
 	case IPC_RMID:
 		ipc_lock_object(&shp->shm_perm);
-		/* do_shm_rmid unlocks the ipc object and rcu */
-		do_shm_rmid(ns, ipcp);
+		if (shp->shm_nattch) {
+			shp->shm_perm.mode |= SHM_DEST;
+			/* Do not find it any more */
+			ipc_set_key_private(&shm_ids(ns), &shp->shm_perm);
+			shm_unlock(shp);
+		} else
+			shm_destroy(ns, shp);
+		/* shm_unlock unlocked the ipc object and rcu */
 		goto out_up;
 	case IPC_SET:
 		ipc_lock_object(&shp->shm_perm);
diff --git a/ipc/util.c b/ipc/util.c
index fa2d86ef3fb8..58228f342397 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -525,6 +525,11 @@ void ipc_set_key_private(struct ipc_ids *ids, struct kern_ipc_perm *ipcp)
 	ipcp->key = IPC_PRIVATE;
 }
 
+void ipc_getref(struct kern_ipc_perm *ptr)
+{
+	return refcount_inc(&ptr->refcount);
+}
+
 bool ipc_rcu_getref(struct kern_ipc_perm *ptr)
 {
 	return refcount_inc_not_zero(&ptr->refcount);
diff --git a/ipc/util.h b/ipc/util.h
index 2dd7ce0416d8..e13b46ff675f 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -170,6 +170,7 @@ static inline int ipc_get_maxidx(struct ipc_ids *ids)
  * refcount is initialized by ipc_addid(), before that point call_rcu()
  * must be used.
  */
+void ipc_getref(struct kern_ipc_perm *ptr);
 bool ipc_rcu_getref(struct kern_ipc_perm *ptr);
 void ipc_rcu_putref(struct kern_ipc_perm *ptr,
 			void (*func)(struct rcu_head *head));
diff --git a/kernel/fork.c b/kernel/fork.c
index 38681ad44c76..3e881f78bcf2 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -3095,7 +3095,6 @@ int ksys_unshare(unsigned long unshare_flags)
 		if (unshare_flags & CLONE_NEWIPC) {
 			/* Orphan segments in old ns (see sem above). */
 			exit_shm(current);
-			shm_init_task(current);
 		}
 
 		if (new_nsproxy)


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC] shm: extend forced shm destroy to support objects from several IPC nses (simplified)
  2021-11-05 21:34       ` [RFC] shm: extend forced shm destroy to support objects from several IPC nses (simplified) Eric W. Biederman
@ 2021-11-06  7:50         ` Manfred Spraul
  2021-11-06 14:42         ` Manfred Spraul
  1 sibling, 0 replies; 12+ messages in thread
From: Manfred Spraul @ 2021-11-06  7:50 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Alexander Mikhalitsyn, linux-kernel, Andrew Morton,
	Davidlohr Bueso, Greg KH, Andrei Vagin, Pavel Tikhomirov,
	Vasily Averin, Alexander Mikhalitsyn, stable

Hi Eric,

On 11/5/21 22:34, Eric W. Biederman wrote:
> I have to dash so this is short.

As last time, I'll review the change and check for new/good ideas.

As first question: Is the change tested?

[...]

>   
>   	/* The task created the shm object.  NULL if the task is dead. */
> -	struct task_struct	*shm_creator;
> +	struct task_struct __rcu *shm_creator;
>   	struct list_head	shm_clist;	/* list by creator */
> +	struct ipc_namespace	*shm_ns;	/* valid when shm_nattch != 0 */
>   } __randomize_layout;
>   
There is no reason to modify shm_creator:

We need _one_ indicator that the creator has died, not two.

We have both list_empty() and shm_creator. Thus we should/must define 
what is the relevant indicator, and every function must use the same one.

exit_sem() must walk shm_clist. list_empty() must return the correct answer.

Thus I think it is simpler that list_empty() is the indicator.

In addition, as you have correctly noticed: If we make shm_creator==NULL 
the indicator, then we must use at __rcu or at least READ_ONCE() accessors.

But: This would only solve a self created problem. Just leave 
shm_creator unmodified - and the need for READ_ONCE() goes away.

>   /* shm_mode upper byte flags */
> @@ -106,29 +107,17 @@ void shm_init_ns(struct ipc_namespace *ns)
>   	ipc_init_ids(&shm_ids(ns));
>   }
>   
> -/*
> - * Called with shm_ids.rwsem (writer) and the shp structure locked.
> - * Only shm_ids.rwsem remains locked on exit.
> - */
> -static void do_shm_rmid(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
> +static void do_shm_destroy(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
>   {
> -	struct shmid_kernel *shp;
> -
> -	shp = container_of(ipcp, struct shmid_kernel, shm_perm);
> -
> -	if (shp->shm_nattch) {
> -		shp->shm_perm.mode |= SHM_DEST;
> -		/* Do not find it any more */
> -		ipc_set_key_private(&shm_ids(ns), &shp->shm_perm);
> -		shm_unlock(shp);
> -	} else
> -		shm_destroy(ns, shp);
> +	struct shmid_kernel *shp =
> +		container_of(ipcp, struct shmid_kernel, shm_perm);
> +	shm_destroy(ns, shp);
>   }
>   
>   #ifdef CONFIG_IPC_NS
>   void shm_exit_ns(struct ipc_namespace *ns)
>   {
> -	free_ipcs(ns, &shm_ids(ns), do_shm_rmid);
> +	free_ipcs(ns, &shm_ids(ns), do_shm_destroy);
>   	idr_destroy(&ns->ids[IPC_SHM_IDS].ipcs_idr);
>   	rhashtable_destroy(&ns->ids[IPC_SHM_IDS].key_ht);
>   }
> @@ -225,9 +214,22 @@ static void shm_rcu_free(struct rcu_head *head)
>   	kfree(shp);
>   }
>   
> +static inline void shm_clist_del(struct shmid_kernel *shp)
> +{
> +	struct task_struct *creator;
> +
> +	rcu_read_lock();
> +	creator = rcu_dereference(shp->shm_creator);
> +	if (creator) {
> +		task_lock(creator);
> +		list_del(&shp->shm_clist);

Does this work? You are using list_del() instead of list_del_init().

I fear that this might break exit_sem()

> +		task_unlock(creator);
> +	}
> +	rcu_read_unlock();
> +}
> +
>   static inline void shm_rmid(struct ipc_namespace *ns, struct shmid_kernel *s)
>   {
> -	list_del(&s->shm_clist);
>   	ipc_rmid(&shm_ids(ns), &s->shm_perm);
>   }
>   
> @@ -283,7 +285,9 @@ static void shm_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
>   	shm_file = shp->shm_file;
>   	shp->shm_file = NULL;
>   	ns->shm_tot -= (shp->shm_segsz + PAGE_SIZE - 1) >> PAGE_SHIFT;
> +	shm_clist_del(shp);
>   	shm_rmid(ns, shp);
> +	shp->shm_ns = NULL;
>   	shm_unlock(shp);
>   	if (!is_file_hugepages(shm_file))
>   		shmem_lock(shm_file, 0, shp->mlock_ucounts);
> @@ -361,7 +365,7 @@ static int shm_try_destroy_orphaned(int id, void *p, void *data)
>   	 *
>   	 * As shp->* are changed under rwsem, it's safe to skip shp locking.
>   	 */
> -	if (shp->shm_creator != NULL)
> +	if (rcu_access_pointer(shp->shm_creator) != NULL)
>   		return 0;
>   
>   	if (shm_may_destroy(ns, shp)) {
> @@ -382,48 +386,62 @@ void shm_destroy_orphaned(struct ipc_namespace *ns)
>   /* Locking assumes this will only be called with task == current */
>   void exit_shm(struct task_struct *task)
>   {
> -	struct ipc_namespace *ns = task->nsproxy->ipc_ns;
> -	struct shmid_kernel *shp, *n;
> -
> -	if (list_empty(&task->sysvshm.shm_clist))
> -		return;
> -
> -	/*
> -	 * If kernel.shm_rmid_forced is not set then only keep track of
> -	 * which shmids are orphaned, so that a later set of the sysctl
> -	 * can clean them up.
> -	 */
> -	if (!ns->shm_rmid_forced) {
> -		down_read(&shm_ids(ns).rwsem);
> -		list_for_each_entry(shp, &task->sysvshm.shm_clist, shm_clist)
> -			shp->shm_creator = NULL;
> -		/*
> -		 * Only under read lock but we are only called on current
> -		 * so no entry on the list will be shared.
> -		 */
> -		list_del(&task->sysvshm.shm_clist);
> -		up_read(&shm_ids(ns).rwsem);
> -		return;
> -	}
> +	struct list_head *head = &task->sysvshm.shm_clist;
>   
>   	/*
>   	 * Destroy all already created segments, that were not yet mapped,
>   	 * and mark any mapped as orphan to cover the sysctl toggling.
>   	 * Destroy is skipped if shm_may_destroy() returns false.
>   	 */
> -	down_write(&shm_ids(ns).rwsem);
> -	list_for_each_entry_safe(shp, n, &task->sysvshm.shm_clist, shm_clist) {
> -		shp->shm_creator = NULL;
> +	for (;;) {
> +		struct ipc_namespace *ns;
> +		struct shmid_kernel *shp;
>   
> -		if (shm_may_destroy(ns, shp)) {
> +		task_lock(task);
> +		if (list_empty(head)) {
> +			task_unlock(task);
> +			break;
> +		}
> +
> +		shp = list_first_entry(head, struct shmid_kernel, shm_clist);
> +
> +		list_del(&shp->shm_clist);
> +		rcu_assign_pointer(shp->shm_creator, NULL);
> +
> +		/*
> +		 * Guarantee that ns lives after task_list is dropped.
> +		 *
> +		 * This shm segment may not be attached and it's ipc
> +		 * namespace may be exiting.  If so ignore the shm
> +		 * segment as it will be destroyed by shm_exit_ns.
> +		 */
> +		ns = get_ipc_ns_not_zero(shp->shm_ns);
> +		if (!ns) {
> +			task_unlock(task);
> +			continue;
> +		}
> +
> +		/* Guarantee shp lives after task_lock is dropped */
> +		ipc_getref(&shp->shm_perm);
> +
> +		/* Drop task_lock so that shm_destroy may take it */
> +		task_unlock(task);
> +
> +		/* Can the shm segment be destroyed? */
> +		down_write(&shm_ids(ns).rwsem);
> +		shm_lock_by_ptr(shp);
> +		if (ipc_valid_object(&shp->shm_perm) &&
> +		    shm_may_destroy(ns, shp)) {
>   			shm_lock_by_ptr(shp);
>   			shm_destroy(ns, shp);
> +		} else {
> +			shm_unlock(shp);
>   		}
> -	}
>   
> -	/* Remove the list head from any segments still attached. */
> -	list_del(&task->sysvshm.shm_clist);
> -	up_write(&shm_ids(ns).rwsem);
> +		ipc_rcu_putref(&shp->shm_perm, shm_rcu_free);
> +		up_write(&shm_ids(ns).rwsem);
> +		put_ipc_ns(ns);
> +	}
>   }
>   
>   static vm_fault_t shm_fault(struct vm_fault *vmf)
> @@ -673,14 +691,17 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
>   	shp->shm_segsz = size;
>   	shp->shm_nattch = 0;
>   	shp->shm_file = file;
> -	shp->shm_creator = current;
> +	RCU_INIT_POINTER(shp->shm_creator, current);
> +	shp->shm_ns = ns;
>   
>   	/* ipc_addid() locks shp upon success. */
>   	error = ipc_addid(&shm_ids(ns), &shp->shm_perm, ns->shm_ctlmni);
>   	if (error < 0)
>   		goto no_id;
>   
> +	task_lock(current);
>   	list_add(&shp->shm_clist, &current->sysvshm.shm_clist);
> +	task_unlock(current);
>   
>   	/*
>   	 * shmid gets reported as "inode#" in /proc/pid/maps.
> @@ -913,8 +934,14 @@ static int shmctl_down(struct ipc_namespace *ns, int shmid, int cmd,
>   	switch (cmd) {
>   	case IPC_RMID:
>   		ipc_lock_object(&shp->shm_perm);
> -		/* do_shm_rmid unlocks the ipc object and rcu */
> -		do_shm_rmid(ns, ipcp);
> +		if (shp->shm_nattch) {
> +			shp->shm_perm.mode |= SHM_DEST;
> +			/* Do not find it any more */
> +			ipc_set_key_private(&shm_ids(ns), &shp->shm_perm);
> +			shm_unlock(shp);
> +		} else
> +			shm_destroy(ns, shp);
> +		/* shm_unlock unlocked the ipc object and rcu */
>   		goto out_up;
>   	case IPC_SET:
>   		ipc_lock_object(&shp->shm_perm);
> diff --git a/ipc/util.c b/ipc/util.c
> index fa2d86ef3fb8..58228f342397 100644
> --- a/ipc/util.c
> +++ b/ipc/util.c
> @@ -525,6 +525,11 @@ void ipc_set_key_private(struct ipc_ids *ids, struct kern_ipc_perm *ipcp)
>   	ipcp->key = IPC_PRIVATE;
>   }
>   
> +void ipc_getref(struct kern_ipc_perm *ptr)
> +{
> +	return refcount_inc(&ptr->refcount);
> +}
> +
>   bool ipc_rcu_getref(struct kern_ipc_perm *ptr)
>   {
>   	return refcount_inc_not_zero(&ptr->refcount);
> diff --git a/ipc/util.h b/ipc/util.h
> index 2dd7ce0416d8..e13b46ff675f 100644
> --- a/ipc/util.h
> +++ b/ipc/util.h
> @@ -170,6 +170,7 @@ static inline int ipc_get_maxidx(struct ipc_ids *ids)
>    * refcount is initialized by ipc_addid(), before that point call_rcu()
>    * must be used.
>    */
> +void ipc_getref(struct kern_ipc_perm *ptr);
>   bool ipc_rcu_getref(struct kern_ipc_perm *ptr);
>   void ipc_rcu_putref(struct kern_ipc_perm *ptr,
>   			void (*func)(struct rcu_head *head));
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 38681ad44c76..3e881f78bcf2 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -3095,7 +3095,6 @@ int ksys_unshare(unsigned long unshare_flags)
>   		if (unshare_flags & CLONE_NEWIPC) {
>   			/* Orphan segments in old ns (see sem above). */
>   			exit_shm(current);
> -			shm_init_task(current);
>   		}
>   
>   		if (new_nsproxy)



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] shm: extend forced shm destroy to support objects from several IPC nses (simplified)
  2021-11-05 21:34       ` [RFC] shm: extend forced shm destroy to support objects from several IPC nses (simplified) Eric W. Biederman
  2021-11-06  7:50         ` Manfred Spraul
@ 2021-11-06 14:42         ` Manfred Spraul
  2021-11-07 19:51           ` Eric W. Biederman
  1 sibling, 1 reply; 12+ messages in thread
From: Manfred Spraul @ 2021-11-06 14:42 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Alexander Mikhalitsyn, linux-kernel, Andrew Morton,
	Davidlohr Bueso, Greg KH, Andrei Vagin, Pavel Tikhomirov,
	Vasily Averin, Alexander Mikhalitsyn, stable

[-- Attachment #1: Type: text/plain, Size: 1748 bytes --]

Hello together,

On 11/5/21 22:34, Eric W. Biederman wrote:
> I have to dash so this is short.
>
> This is what I am thinking this change should look like.
>
> I am not certain this is truly reviewable as a single change, so I will
> break it into a couple of smaller ones next time I get the chance.

I think we should concentrate to check the commit from Alexander.

What I did is to write two additional stress test apps - and now I'm 
able to trigger the use-after-free bug.

It is much simpler, the exclusion of exit_shm() and IPC_RMID didn't work 
- regardless if your approach or the approach from Alexander/myself is used.

>   
> +static inline void shm_clist_del(struct shmid_kernel *shp)
> +{
> +	struct task_struct *creator;
> +
> +	rcu_read_lock();
> +	creator = rcu_dereference(shp->shm_creator);
> +	if (creator) {
> +		task_lock(creator);
> +		list_del(&shp->shm_clist);
> +		task_unlock(creator);
> +	}
> +	rcu_read_unlock();
> +}
> +

shm_clist_del() only synchronizes against exit_shm() when shm_creator is 
not NULL.


> +		list_del(&shp->shm_clist);
> +		rcu_assign_pointer(shp->shm_creator, NULL);
> +

We set shm_creator to NULL -> no more synchronization.

Now IPC_RMID can run in parallel - regardless if we test for 
list_empty() or shm_creator.

> +
> +		/* Guarantee shp lives after task_lock is dropped */
> +		ipc_getref(&shp->shm_perm);
> +

task_lock() doesn't help: As soon as shm_creator is set to NULL, 
IPC_RMID won't acquire task_lock() anymore.

Thus shp can disappear before we arrive at this ipc_getref.

[Yes, I think I have introduced this bug. ]

Corrected version attached.


I'll reboot and retest the patch, then I would send it to akpm as 
replacement for current patch in mmotm.

--

     Manfred

[-- Attachment #2: 0001-shm-extend-forced-shm-destroy-to-support-objects-fro.patch --]
[-- Type: text/x-patch, Size: 12760 bytes --]

From df024d627beb3f23f2c5b795e1ae841a2dd6ff49 Mon Sep 17 00:00:00 2001
From: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Date: Thu, 28 Oct 2021 01:43:48 +0300
Subject: [PATCH] shm: extend forced shm destroy to support objects from
 several IPC nses

Currently, exit_shm function not designed to work properly when
task->sysvshm.shm_clist holds shm objects from different IPC namespaces.

This is a real pain when sysctl kernel.shm_rmid_forced = 1, because
it leads to use-after-free (reproducer exists).

That particular patch is attempt to fix the problem by extending exit_shm
mechanism to handle shm's destroy from several IPC ns'es.

To achieve that we do several things:
1. add namespace (non-refcounted) pointer to the struct shmid_kernel
2. during new shm object creation (newseg()/shmget syscall) we initialize
this pointer by current task IPC ns
3. exit_shm() fully reworked such that it traverses over all
shp's in task->sysvshm.shm_clist and gets IPC namespace not
from current task as it was before but from shp's object itself, then
call shm_destroy(shp, ns).

Note. We need to be really careful here, because as it was said before
(1), our pointer to IPC ns non-refcnt'ed. To be on the safe side we using
special helper get_ipc_ns_not_zero() which allows to get IPC ns refcounter
only if IPC ns not in the "state of destruction".

Q/A

Q: Why we can access shp->ns memory using non-refcounted pointer?
A: Because shp object lifetime is always shorther
than IPC namespace lifetime, so, if we get shp object from the
task->sysvshm.shm_clist while holding task_lock(task) nobody can
steal our namespace.

Q: Does this patch change semantics of unshare/setns/clone syscalls?
A: Not. It's just fixes non-covered case when process may leave
IPC namespace without getting task->sysvshm.shm_clist list cleaned up.

Fixes: ab602f79915 ("shm: make exit_shm work proportional to task activity")

Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Cc: Vasily Averin <vvs@virtuozzo.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Alexander Mikhalitsyn <alexander@mihalicyn.com>
Cc: stable@vger.kernel.org
Co-developed-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
- comment corrections, shm_clist_lock was replaced by task_lock()
- if/else exchanged as recommended by Eric
- bugfix for shp refcounting. Actually, this probably the real
  use-after-free bug, the current code contains a use-after-free even
  without using namespaces.
- add rcu_read_lock() into shm_clist_rm(), to protect against
  a use-after-free for shm_creator->alloc_lock.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
---
 include/linux/ipc_namespace.h |  15 +++
 include/linux/sched/task.h    |   2 +-
 ipc/shm.c                     | 186 +++++++++++++++++++++++++---------
 3 files changed, 156 insertions(+), 47 deletions(-)

diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
index 05e22770af51..b75395ec8d52 100644
--- a/include/linux/ipc_namespace.h
+++ b/include/linux/ipc_namespace.h
@@ -131,6 +131,16 @@ static inline struct ipc_namespace *get_ipc_ns(struct ipc_namespace *ns)
 	return ns;
 }
 
+static inline struct ipc_namespace *get_ipc_ns_not_zero(struct ipc_namespace *ns)
+{
+	if (ns) {
+		if (refcount_inc_not_zero(&ns->ns.count))
+			return ns;
+	}
+
+	return NULL;
+}
+
 extern void put_ipc_ns(struct ipc_namespace *ns);
 #else
 static inline struct ipc_namespace *copy_ipcs(unsigned long flags,
@@ -147,6 +157,11 @@ static inline struct ipc_namespace *get_ipc_ns(struct ipc_namespace *ns)
 	return ns;
 }
 
+static inline struct ipc_namespace *get_ipc_ns_not_zero(struct ipc_namespace *ns)
+{
+	return ns;
+}
+
 static inline void put_ipc_ns(struct ipc_namespace *ns)
 {
 }
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index ba88a6987400..058d7f371e25 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -158,7 +158,7 @@ static inline struct vm_struct *task_stack_vm_area(const struct task_struct *t)
  * Protects ->fs, ->files, ->mm, ->group_info, ->comm, keyring
  * subscriptions and synchronises with wait4().  Also used in procfs.  Also
  * pins the final release of task.io_context.  Also protects ->cpuset and
- * ->cgroup.subsys[]. And ->vfork_done.
+ * ->cgroup.subsys[]. And ->vfork_done. And ->sysvshm.shm_clist.
  *
  * Nests both inside and outside of read_lock(&tasklist_lock).
  * It must not be nested with write_lock_irq(&tasklist_lock),
diff --git a/ipc/shm.c b/ipc/shm.c
index ab749be6d8b7..939fa2de4de7 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -62,9 +62,18 @@ struct shmid_kernel /* private to the kernel */
 	struct pid		*shm_lprid;
 	struct ucounts		*mlock_ucounts;
 
-	/* The task created the shm object.  NULL if the task is dead. */
+	/*
+	 * The task created the shm object, for
+	 * task_lock(shp->shm_creator)
+	 */
 	struct task_struct	*shm_creator;
-	struct list_head	shm_clist;	/* list by creator */
+
+	/*
+	 * List by creator. task_lock(->shm_creator) required for read/write.
+	 * If list_empty(), then the creator is dead already.
+	 */
+	struct list_head	shm_clist;
+	struct ipc_namespace	*ns;
 } __randomize_layout;
 
 /* shm_mode upper byte flags */
@@ -115,6 +124,7 @@ static void do_shm_rmid(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
 	struct shmid_kernel *shp;
 
 	shp = container_of(ipcp, struct shmid_kernel, shm_perm);
+	WARN_ON(ns != shp->ns);
 
 	if (shp->shm_nattch) {
 		shp->shm_perm.mode |= SHM_DEST;
@@ -225,10 +235,43 @@ static void shm_rcu_free(struct rcu_head *head)
 	kfree(shp);
 }
 
-static inline void shm_rmid(struct ipc_namespace *ns, struct shmid_kernel *s)
+/*
+ * It has to be called with shp locked.
+ * It must be called before ipc_rmid()
+ */
+static inline void shm_clist_rm(struct shmid_kernel *shp)
+{
+	struct task_struct *creator;
+
+	/* ensure that shm_creator does not disappear */
+	rcu_read_lock();
+
+	/*
+	 * A concurrent exit_shm may do a list_del_init() as well.
+	 * Just do nothing if exit_shm already did the work
+	 */
+	if (!list_empty(&shp->shm_clist)) {
+		/*
+		 * shp->shm_creator is guaranteed to be valid *only*
+		 * if shp->shm_clist is not empty.
+		 */
+		creator = shp->shm_creator;
+
+		task_lock(creator);
+		/*
+		 * list_del_init() is a nop if the entry was already removed
+		 * from the list.
+		 */
+		list_del_init(&shp->shm_clist);
+		task_unlock(creator);
+	}
+	rcu_read_unlock();
+}
+
+static inline void shm_rmid(struct shmid_kernel *s)
 {
-	list_del(&s->shm_clist);
-	ipc_rmid(&shm_ids(ns), &s->shm_perm);
+	shm_clist_rm(s);
+	ipc_rmid(&shm_ids(s->ns), &s->shm_perm);
 }
 
 
@@ -283,7 +326,7 @@ static void shm_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
 	shm_file = shp->shm_file;
 	shp->shm_file = NULL;
 	ns->shm_tot -= (shp->shm_segsz + PAGE_SIZE - 1) >> PAGE_SHIFT;
-	shm_rmid(ns, shp);
+	shm_rmid(shp);
 	shm_unlock(shp);
 	if (!is_file_hugepages(shm_file))
 		shmem_lock(shm_file, 0, shp->mlock_ucounts);
@@ -306,10 +349,10 @@ static void shm_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
  *
  * 2) sysctl kernel.shm_rmid_forced is set to 1.
  */
-static bool shm_may_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
+static bool shm_may_destroy(struct shmid_kernel *shp)
 {
 	return (shp->shm_nattch == 0) &&
-	       (ns->shm_rmid_forced ||
+	       (shp->ns->shm_rmid_forced ||
 		(shp->shm_perm.mode & SHM_DEST));
 }
 
@@ -340,7 +383,7 @@ static void shm_close(struct vm_area_struct *vma)
 	ipc_update_pid(&shp->shm_lprid, task_tgid(current));
 	shp->shm_dtim = ktime_get_real_seconds();
 	shp->shm_nattch--;
-	if (shm_may_destroy(ns, shp))
+	if (shm_may_destroy(shp))
 		shm_destroy(ns, shp);
 	else
 		shm_unlock(shp);
@@ -361,10 +404,10 @@ static int shm_try_destroy_orphaned(int id, void *p, void *data)
 	 *
 	 * As shp->* are changed under rwsem, it's safe to skip shp locking.
 	 */
-	if (shp->shm_creator != NULL)
+	if (!list_empty(&shp->shm_clist))
 		return 0;
 
-	if (shm_may_destroy(ns, shp)) {
+	if (shm_may_destroy(shp)) {
 		shm_lock_by_ptr(shp);
 		shm_destroy(ns, shp);
 	}
@@ -382,48 +425,94 @@ void shm_destroy_orphaned(struct ipc_namespace *ns)
 /* Locking assumes this will only be called with task == current */
 void exit_shm(struct task_struct *task)
 {
-	struct ipc_namespace *ns = task->nsproxy->ipc_ns;
-	struct shmid_kernel *shp, *n;
+	for (;;) {
+		struct shmid_kernel *shp;
+		struct ipc_namespace *ns;
 
-	if (list_empty(&task->sysvshm.shm_clist))
-		return;
+		task_lock(task);
+
+		if (list_empty(&task->sysvshm.shm_clist)) {
+			task_unlock(task);
+			break;
+		}
+
+		shp = list_first_entry(&task->sysvshm.shm_clist, struct shmid_kernel,
+				shm_clist);
 
-	/*
-	 * If kernel.shm_rmid_forced is not set then only keep track of
-	 * which shmids are orphaned, so that a later set of the sysctl
-	 * can clean them up.
-	 */
-	if (!ns->shm_rmid_forced) {
-		down_read(&shm_ids(ns).rwsem);
-		list_for_each_entry(shp, &task->sysvshm.shm_clist, shm_clist)
-			shp->shm_creator = NULL;
 		/*
-		 * Only under read lock but we are only called on current
-		 * so no entry on the list will be shared.
+		 * 1) get a reference to shp.
+		 *    This must be done first: Right now, task_lock() prevents
+		 *    any concurrent IPC_RMID calls. After the list_del_init(),
+		 *    IPC_RMID will not acquire task_lock(->shm_creator)
+		 *    anymore.
 		 */
-		list_del(&task->sysvshm.shm_clist);
-		up_read(&shm_ids(ns).rwsem);
-		return;
-	}
+		WARN_ON(!ipc_rcu_getref(&shp->shm_perm));
 
-	/*
-	 * Destroy all already created segments, that were not yet mapped,
-	 * and mark any mapped as orphan to cover the sysctl toggling.
-	 * Destroy is skipped if shm_may_destroy() returns false.
-	 */
-	down_write(&shm_ids(ns).rwsem);
-	list_for_each_entry_safe(shp, n, &task->sysvshm.shm_clist, shm_clist) {
-		shp->shm_creator = NULL;
+		/* 2) unlink */
+		list_del_init(&shp->shm_clist);
+
+		/*
+		 * 3) Get pointer to the ipc namespace. It is worth to say
+		 * that this pointer is guaranteed to be valid because
+		 * shp lifetime is always shorter than namespace lifetime
+		 * in which shp lives.
+		 * We taken task_lock it means that shp won't be freed.
+		 */
+		ns = shp->ns;
 
-		if (shm_may_destroy(ns, shp)) {
-			shm_lock_by_ptr(shp);
-			shm_destroy(ns, shp);
+		/*
+		 * 4) If kernel.shm_rmid_forced is not set then only keep track of
+		 * which shmids are orphaned, so that a later set of the sysctl
+		 * can clean them up.
+		 */
+		if (!ns->shm_rmid_forced) {
+			ipc_rcu_putref(&shp->shm_perm, shm_rcu_free);
+			task_unlock(task);
+			continue;
 		}
-	}
 
-	/* Remove the list head from any segments still attached. */
-	list_del(&task->sysvshm.shm_clist);
-	up_write(&shm_ids(ns).rwsem);
+		/*
+		 * 5) get a reference to the namespace.
+		 *    The refcount could be already 0. If it is 0, then
+		 *    the shm objects will be free by free_ipc_work().
+		 */
+		ns = get_ipc_ns_not_zero(ns);
+		if (!ns) {
+			ipc_rcu_putref(&shp->shm_perm, shm_rcu_free);
+			task_unlock(task);
+			continue;
+		}
+		task_unlock(task);
+
+		/*
+		 * 6) we have all references
+		 *    Thus lock & if needed destroy shp.
+		 */
+		down_write(&shm_ids(ns).rwsem);
+		shm_lock_by_ptr(shp);
+		/*
+		 * rcu_read_lock was implicitly taken in shm_lock_by_ptr, it's
+		 * safe to call ipc_rcu_putref here
+		 */
+		ipc_rcu_putref(&shp->shm_perm, shm_rcu_free);
+
+		if (ipc_valid_object(&shp->shm_perm)) {
+			if (shm_may_destroy(shp))
+				shm_destroy(ns, shp);
+			else
+				shm_unlock(shp);
+		} else {
+			/*
+			 * Someone else deleted the shp from namespace
+			 * idr/kht while we have waited.
+			 * Just unlock and continue.
+			 */
+			shm_unlock(shp);
+		}
+
+		up_write(&shm_ids(ns).rwsem);
+		put_ipc_ns(ns); /* paired with get_ipc_ns_not_zero */
+	}
 }
 
 static vm_fault_t shm_fault(struct vm_fault *vmf)
@@ -680,7 +769,11 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
 	if (error < 0)
 		goto no_id;
 
+	shp->ns = ns;
+
+	task_lock(current);
 	list_add(&shp->shm_clist, &current->sysvshm.shm_clist);
+	task_unlock(current);
 
 	/*
 	 * shmid gets reported as "inode#" in /proc/pid/maps.
@@ -1573,7 +1666,8 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg,
 	down_write(&shm_ids(ns).rwsem);
 	shp = shm_lock(ns, shmid);
 	shp->shm_nattch--;
-	if (shm_may_destroy(ns, shp))
+
+	if (shm_may_destroy(shp))
 		shm_destroy(ns, shp);
 	else
 		shm_unlock(shp);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC] shm: extend forced shm destroy to support objects from several IPC nses (simplified)
  2021-11-06 14:42         ` Manfred Spraul
@ 2021-11-07 19:51           ` Eric W. Biederman
  2021-11-08 18:34             ` Manfred Spraul
  0 siblings, 1 reply; 12+ messages in thread
From: Eric W. Biederman @ 2021-11-07 19:51 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Alexander Mikhalitsyn, linux-kernel, Andrew Morton,
	Davidlohr Bueso, Greg KH, Andrei Vagin, Pavel Tikhomirov,
	Vasily Averin, Alexander Mikhalitsyn, stable

Manfred Spraul <manfred@colorfullife.com> writes:

> Hello together,
>
> On 11/5/21 22:34, Eric W. Biederman wrote:
>>   +static inline void shm_clist_del(struct shmid_kernel *shp)
>> +{
>> +	struct task_struct *creator;
>> +
>> +	rcu_read_lock();
>> +	creator = rcu_dereference(shp->shm_creator);
>> +	if (creator) {
>> +		task_lock(creator);
>> +		list_del(&shp->shm_clist);
>> +		task_unlock(creator);
>> +	}
>> +	rcu_read_unlock();
>> +}
>> +
>
> shm_clist_del() only synchronizes against exit_shm() when shm_creator
> is not NULL.
>
>
>> +		list_del(&shp->shm_clist);
>> +		rcu_assign_pointer(shp->shm_creator, NULL);
>> +
>
> We set shm_creator to NULL -> no more synchronization.
>
> Now IPC_RMID can run in parallel - regardless if we test for
> list_empty() or shm_creator.
>
>> +
>> +		/* Guarantee shp lives after task_lock is dropped */
>> +		ipc_getref(&shp->shm_perm);
>> +
>
> task_lock() doesn't help: As soon as shm_creator is set to NULL,
> IPC_RMID won't acquire task_lock() anymore.
>
> Thus shp can disappear before we arrive at this ipc_getref.
>
> [Yes, I think I have introduced this bug. ]
>
> Corrected version attached.
>
>
> I'll reboot and retest the patch, then I would send it to akpm as
> replacement for current patch in mmotm.
>
> --
>
>     Manfred
>

> @@ -382,48 +425,94 @@ void shm_destroy_orphaned(struct ipc_namespace *ns)
>  /* Locking assumes this will only be called with task == current */
>  void exit_shm(struct task_struct *task)
>  {
> -	struct ipc_namespace *ns = task->nsproxy->ipc_ns;
> -	struct shmid_kernel *shp, *n;
> +	for (;;) {
> +		struct shmid_kernel *shp;
> +		struct ipc_namespace *ns;
>  
> -	if (list_empty(&task->sysvshm.shm_clist))
> -		return;
> +		task_lock(task);
> +
> +		if (list_empty(&task->sysvshm.shm_clist)) {
> +			task_unlock(task);
> +			break;
> +		}
> +
> +		shp = list_first_entry(&task->sysvshm.shm_clist, struct shmid_kernel,
> +				shm_clist);
>  
> -	/*
> -	 * If kernel.shm_rmid_forced is not set then only keep track of
> -	 * which shmids are orphaned, so that a later set of the sysctl
> -	 * can clean them up.
> -	 */
> -	if (!ns->shm_rmid_forced) {
> -		down_read(&shm_ids(ns).rwsem);
> -		list_for_each_entry(shp, &task->sysvshm.shm_clist, shm_clist)
> -			shp->shm_creator = NULL;
>  		/*
> -		 * Only under read lock but we are only called on current
> -		 * so no entry on the list will be shared.
> +		 * 1) get a reference to shp.
> +		 *    This must be done first: Right now, task_lock() prevents
> +		 *    any concurrent IPC_RMID calls. After the list_del_init(),
> +		 *    IPC_RMID will not acquire task_lock(->shm_creator)
> +		 *    anymore.
>  		 */
> -		list_del(&task->sysvshm.shm_clist);
> -		up_read(&shm_ids(ns).rwsem);
> -		return;
> -	}
> +		WARN_ON(!ipc_rcu_getref(&shp->shm_perm));
>  
> -	/*
> -	 * Destroy all already created segments, that were not yet mapped,
> -	 * and mark any mapped as orphan to cover the sysctl toggling.
> -	 * Destroy is skipped if shm_may_destroy() returns false.
> -	 */
> -	down_write(&shm_ids(ns).rwsem);
> -	list_for_each_entry_safe(shp, n, &task->sysvshm.shm_clist, shm_clist) {
> -		shp->shm_creator = NULL;
> +		/* 2) unlink */
> +		list_del_init(&shp->shm_clist);
> +
> +		/*
> +		 * 3) Get pointer to the ipc namespace. It is worth to say
> +		 * that this pointer is guaranteed to be valid because
> +		 * shp lifetime is always shorter than namespace lifetime
> +		 * in which shp lives.
> +		 * We taken task_lock it means that shp won't be freed.
> +		 */
> +		ns = shp->ns;
>  
> -		if (shm_may_destroy(ns, shp)) {
> -			shm_lock_by_ptr(shp);
> -			shm_destroy(ns, shp);
> +		/*
> +		 * 4) If kernel.shm_rmid_forced is not set then only keep track of
> +		 * which shmids are orphaned, so that a later set of the sysctl
> +		 * can clean them up.
> +		 */
> +		if (!ns->shm_rmid_forced) {
> +			ipc_rcu_putref(&shp->shm_perm, shm_rcu_free);
> +			task_unlock(task);
> +			continue;
>  		}
> -	}
>  
> -	/* Remove the list head from any segments still attached. */
> -	list_del(&task->sysvshm.shm_clist);
> -	up_write(&shm_ids(ns).rwsem);
> +		/*
> +		 * 5) get a reference to the namespace.
> +		 *    The refcount could be already 0. If it is 0, then
> +		 *    the shm objects will be free by free_ipc_work().
> +		 */
> +		ns = get_ipc_ns_not_zero(ns);
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Isn't this increment also too late?  Doesn't this need to move up
by ipc_rcu_getref while shp is still on the list?

Assuming the code is running in parallel with shm_exit_ns after removal
from shm_clist shm_destroy can run to completion and shm_exit_ns can
run to completion and the ipc namespace can be freed.

Eric

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] shm: extend forced shm destroy to support objects from several IPC nses (simplified)
  2021-11-07 19:51           ` Eric W. Biederman
@ 2021-11-08 18:34             ` Manfred Spraul
  0 siblings, 0 replies; 12+ messages in thread
From: Manfred Spraul @ 2021-11-08 18:34 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Alexander Mikhalitsyn, linux-kernel, Andrew Morton,
	Davidlohr Bueso, Greg KH, Andrei Vagin, Pavel Tikhomirov,
	Vasily Averin, Alexander Mikhalitsyn, stable

[-- Attachment #1: Type: text/plain, Size: 1228 bytes --]

Hi Eric,

On 11/7/21 20:51, Eric W. Biederman wrote:
> Manfred Spraul <manfred@colorfullife.com> writes:
>
>>
>>> +
>>> +		/* Guarantee shp lives after task_lock is dropped */
>>> +		ipc_getref(&shp->shm_perm);
>>> +
>> task_lock() doesn't help: As soon as shm_creator is set to NULL,
>> IPC_RMID won't acquire task_lock() anymore.
>>
>> Thus shp can disappear before we arrive at this ipc_getref.
>>
>> [Yes, I think I have introduced this bug. ]
>>
>> Corrected version attached.
>>
>>
[...]
>> +		/* 2) unlink */
>> +		list_del_init(&shp->shm_clist);
>> +
[...]
>> +		/*
>> +		 * 5) get a reference to the namespace.
>> +		 *    The refcount could be already 0. If it is 0, then
>> +		 *    the shm objects will be free by free_ipc_work().
>> +		 */
>> +		ns = get_ipc_ns_not_zero(ns);
>                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> Isn't this increment also too late?  Doesn't this need to move up
> by ipc_rcu_getref while shp is still on the list?

Yes, thanks.

Updated patch attached.

> Assuming the code is running in parallel with shm_exit_ns after removal
> from shm_clist shm_destroy can run to completion and shm_exit_ns can
> run to completion and the ipc namespace can be freed.
>
> Eric

--

     Manfred

[-- Attachment #2: 0001-shm-extend-forced-shm-destroy-to-support-objects-fro.patch --]
[-- Type: text/x-patch, Size: 12950 bytes --]

From 58c606c60dc1dc1bcc8ae426fbe13ab971b1e979 Mon Sep 17 00:00:00 2001
From: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Date: Thu, 28 Oct 2021 01:43:48 +0300
Subject: [PATCH] shm: extend forced shm destroy to support objects from
 several IPC nses

Currently, exit_shm function not designed to work properly when
task->sysvshm.shm_clist holds shm objects from different IPC namespaces.

This is a real pain when sysctl kernel.shm_rmid_forced = 1, because
it leads to use-after-free (reproducer exists).

That particular patch is attempt to fix the problem by extending exit_shm
mechanism to handle shm's destroy from several IPC ns'es.

To achieve that we do several things:
1. add namespace (non-refcounted) pointer to the struct shmid_kernel
2. during new shm object creation (newseg()/shmget syscall) we initialize
this pointer by current task IPC ns
3. exit_shm() fully reworked such that it traverses over all
shp's in task->sysvshm.shm_clist and gets IPC namespace not
from current task as it was before but from shp's object itself, then
call shm_destroy(shp, ns).

Note. We need to be really careful here, because as it was said before
(1), our pointer to IPC ns non-refcnt'ed. To be on the safe side we using
special helper get_ipc_ns_not_zero() which allows to get IPC ns refcounter
only if IPC ns not in the "state of destruction".

Q/A

Q: Why we can access shp->ns memory using non-refcounted pointer?
A: Because shp object lifetime is always shorther
than IPC namespace lifetime, so, if we get shp object from the
task->sysvshm.shm_clist while holding task_lock(task) nobody can
steal our namespace.

Q: Does this patch change semantics of unshare/setns/clone syscalls?
A: Not. It's just fixes non-covered case when process may leave
IPC namespace without getting task->sysvshm.shm_clist list cleaned up.

Fixes: ab602f79915 ("shm: make exit_shm work proportional to task activity")

Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Cc: Vasily Averin <vvs@virtuozzo.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Alexander Mikhalitsyn <alexander@mihalicyn.com>
Cc: stable@vger.kernel.org
Co-developed-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
- comment corrections, shm_clist_lock was replaced by task_lock()
- if/else exchanged as recommended by Eric
- bugfix for shp refcounting. Actually, this probably the real
  use-after-free bug, the current code contains a use-after-free even
  without using namespaces.
- add rcu_read_lock() into shm_clist_rm(), to protect against
  a use-after-free for shm_creator->alloc_lock.
- bugfix for review finding from Eric: ns could disappear as well,
  the list_del_init() must be done last.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
---
 include/linux/ipc_namespace.h |  15 +++
 include/linux/sched/task.h    |   2 +-
 ipc/shm.c                     | 189 +++++++++++++++++++++++++---------
 3 files changed, 159 insertions(+), 47 deletions(-)

diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
index 05e22770af51..b75395ec8d52 100644
--- a/include/linux/ipc_namespace.h
+++ b/include/linux/ipc_namespace.h
@@ -131,6 +131,16 @@ static inline struct ipc_namespace *get_ipc_ns(struct ipc_namespace *ns)
 	return ns;
 }
 
+static inline struct ipc_namespace *get_ipc_ns_not_zero(struct ipc_namespace *ns)
+{
+	if (ns) {
+		if (refcount_inc_not_zero(&ns->ns.count))
+			return ns;
+	}
+
+	return NULL;
+}
+
 extern void put_ipc_ns(struct ipc_namespace *ns);
 #else
 static inline struct ipc_namespace *copy_ipcs(unsigned long flags,
@@ -147,6 +157,11 @@ static inline struct ipc_namespace *get_ipc_ns(struct ipc_namespace *ns)
 	return ns;
 }
 
+static inline struct ipc_namespace *get_ipc_ns_not_zero(struct ipc_namespace *ns)
+{
+	return ns;
+}
+
 static inline void put_ipc_ns(struct ipc_namespace *ns)
 {
 }
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index ba88a6987400..058d7f371e25 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -158,7 +158,7 @@ static inline struct vm_struct *task_stack_vm_area(const struct task_struct *t)
  * Protects ->fs, ->files, ->mm, ->group_info, ->comm, keyring
  * subscriptions and synchronises with wait4().  Also used in procfs.  Also
  * pins the final release of task.io_context.  Also protects ->cpuset and
- * ->cgroup.subsys[]. And ->vfork_done.
+ * ->cgroup.subsys[]. And ->vfork_done. And ->sysvshm.shm_clist.
  *
  * Nests both inside and outside of read_lock(&tasklist_lock).
  * It must not be nested with write_lock_irq(&tasklist_lock),
diff --git a/ipc/shm.c b/ipc/shm.c
index ab749be6d8b7..048eb183b24b 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -62,9 +62,18 @@ struct shmid_kernel /* private to the kernel */
 	struct pid		*shm_lprid;
 	struct ucounts		*mlock_ucounts;
 
-	/* The task created the shm object.  NULL if the task is dead. */
+	/*
+	 * The task created the shm object, for
+	 * task_lock(shp->shm_creator)
+	 */
 	struct task_struct	*shm_creator;
-	struct list_head	shm_clist;	/* list by creator */
+
+	/*
+	 * List by creator. task_lock(->shm_creator) required for read/write.
+	 * If list_empty(), then the creator is dead already.
+	 */
+	struct list_head	shm_clist;
+	struct ipc_namespace	*ns;
 } __randomize_layout;
 
 /* shm_mode upper byte flags */
@@ -115,6 +124,7 @@ static void do_shm_rmid(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
 	struct shmid_kernel *shp;
 
 	shp = container_of(ipcp, struct shmid_kernel, shm_perm);
+	WARN_ON(ns != shp->ns);
 
 	if (shp->shm_nattch) {
 		shp->shm_perm.mode |= SHM_DEST;
@@ -225,10 +235,43 @@ static void shm_rcu_free(struct rcu_head *head)
 	kfree(shp);
 }
 
-static inline void shm_rmid(struct ipc_namespace *ns, struct shmid_kernel *s)
+/*
+ * It has to be called with shp locked.
+ * It must be called before ipc_rmid()
+ */
+static inline void shm_clist_rm(struct shmid_kernel *shp)
 {
-	list_del(&s->shm_clist);
-	ipc_rmid(&shm_ids(ns), &s->shm_perm);
+	struct task_struct *creator;
+
+	/* ensure that shm_creator does not disappear */
+	rcu_read_lock();
+
+	/*
+	 * A concurrent exit_shm may do a list_del_init() as well.
+	 * Just do nothing if exit_shm already did the work
+	 */
+	if (!list_empty(&shp->shm_clist)) {
+		/*
+		 * shp->shm_creator is guaranteed to be valid *only*
+		 * if shp->shm_clist is not empty.
+		 */
+		creator = shp->shm_creator;
+
+		task_lock(creator);
+		/*
+		 * list_del_init() is a nop if the entry was already removed
+		 * from the list.
+		 */
+		list_del_init(&shp->shm_clist);
+		task_unlock(creator);
+	}
+	rcu_read_unlock();
+}
+
+static inline void shm_rmid(struct shmid_kernel *s)
+{
+	shm_clist_rm(s);
+	ipc_rmid(&shm_ids(s->ns), &s->shm_perm);
 }
 
 
@@ -283,7 +326,7 @@ static void shm_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
 	shm_file = shp->shm_file;
 	shp->shm_file = NULL;
 	ns->shm_tot -= (shp->shm_segsz + PAGE_SIZE - 1) >> PAGE_SHIFT;
-	shm_rmid(ns, shp);
+	shm_rmid(shp);
 	shm_unlock(shp);
 	if (!is_file_hugepages(shm_file))
 		shmem_lock(shm_file, 0, shp->mlock_ucounts);
@@ -306,10 +349,10 @@ static void shm_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
  *
  * 2) sysctl kernel.shm_rmid_forced is set to 1.
  */
-static bool shm_may_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
+static bool shm_may_destroy(struct shmid_kernel *shp)
 {
 	return (shp->shm_nattch == 0) &&
-	       (ns->shm_rmid_forced ||
+	       (shp->ns->shm_rmid_forced ||
 		(shp->shm_perm.mode & SHM_DEST));
 }
 
@@ -340,7 +383,7 @@ static void shm_close(struct vm_area_struct *vma)
 	ipc_update_pid(&shp->shm_lprid, task_tgid(current));
 	shp->shm_dtim = ktime_get_real_seconds();
 	shp->shm_nattch--;
-	if (shm_may_destroy(ns, shp))
+	if (shm_may_destroy(shp))
 		shm_destroy(ns, shp);
 	else
 		shm_unlock(shp);
@@ -361,10 +404,10 @@ static int shm_try_destroy_orphaned(int id, void *p, void *data)
 	 *
 	 * As shp->* are changed under rwsem, it's safe to skip shp locking.
 	 */
-	if (shp->shm_creator != NULL)
+	if (!list_empty(&shp->shm_clist))
 		return 0;
 
-	if (shm_may_destroy(ns, shp)) {
+	if (shm_may_destroy(shp)) {
 		shm_lock_by_ptr(shp);
 		shm_destroy(ns, shp);
 	}
@@ -382,48 +425,97 @@ void shm_destroy_orphaned(struct ipc_namespace *ns)
 /* Locking assumes this will only be called with task == current */
 void exit_shm(struct task_struct *task)
 {
-	struct ipc_namespace *ns = task->nsproxy->ipc_ns;
-	struct shmid_kernel *shp, *n;
+	for (;;) {
+		struct shmid_kernel *shp;
+		struct ipc_namespace *ns;
 
-	if (list_empty(&task->sysvshm.shm_clist))
-		return;
+		task_lock(task);
+
+		if (list_empty(&task->sysvshm.shm_clist)) {
+			task_unlock(task);
+			break;
+		}
+
+		shp = list_first_entry(&task->sysvshm.shm_clist, struct shmid_kernel,
+				shm_clist);
 
-	/*
-	 * If kernel.shm_rmid_forced is not set then only keep track of
-	 * which shmids are orphaned, so that a later set of the sysctl
-	 * can clean them up.
-	 */
-	if (!ns->shm_rmid_forced) {
-		down_read(&shm_ids(ns).rwsem);
-		list_for_each_entry(shp, &task->sysvshm.shm_clist, shm_clist)
-			shp->shm_creator = NULL;
 		/*
-		 * Only under read lock but we are only called on current
-		 * so no entry on the list will be shared.
+		 * 1) Get pointer to the ipc namespace. It is worth to say
+		 * that this pointer is guaranteed to be valid because
+		 * shp lifetime is always shorter than namespace lifetime
+		 * in which shp lives.
+		 * We taken task_lock it means that shp won't be freed.
 		 */
-		list_del(&task->sysvshm.shm_clist);
-		up_read(&shm_ids(ns).rwsem);
-		return;
-	}
+		ns = shp->ns;
 
-	/*
-	 * Destroy all already created segments, that were not yet mapped,
-	 * and mark any mapped as orphan to cover the sysctl toggling.
-	 * Destroy is skipped if shm_may_destroy() returns false.
-	 */
-	down_write(&shm_ids(ns).rwsem);
-	list_for_each_entry_safe(shp, n, &task->sysvshm.shm_clist, shm_clist) {
-		shp->shm_creator = NULL;
+		/*
+		 * 2) If kernel.shm_rmid_forced is not set then only keep track of
+		 * which shmids are orphaned, so that a later set of the sysctl
+		 * can clean them up.
+		 */
+		if (!ns->shm_rmid_forced)
+			goto unlink_continue;
 
-		if (shm_may_destroy(ns, shp)) {
-			shm_lock_by_ptr(shp);
-			shm_destroy(ns, shp);
+		/*
+		 * 3) get a reference to the namespace.
+		 *    The refcount could be already 0. If it is 0, then
+		 *    the shm objects will be free by free_ipc_work().
+		 */
+		ns = get_ipc_ns_not_zero(ns);
+		if (!ns) {
+unlink_continue:
+			list_del_init(&shp->shm_clist);
+			task_unlock(task);
+			continue;
 		}
-	}
 
-	/* Remove the list head from any segments still attached. */
-	list_del(&task->sysvshm.shm_clist);
-	up_write(&shm_ids(ns).rwsem);
+		/*
+		 * 4) get a reference to shp.
+		 *   This cannot fail: shm_clist_rm() is called before
+		 *   ipc_rmid(), thus the refcount cannot be 0.
+		 */
+		WARN_ON(!ipc_rcu_getref(&shp->shm_perm));
+
+		/*
+		 * 5) unlink the shm segment from the list of segments
+		 *    created by current.
+		 *    This must be done last. After unlinking,
+		 *    only the refcounts obtained above prevent IPC_RMID
+		 *    from destroying the segment or the namespace.
+		 */
+		list_del_init(&shp->shm_clist);
+
+		task_unlock(task);
+
+		/*
+		 * 6) we have all references
+		 *    Thus lock & if needed destroy shp.
+		 */
+		down_write(&shm_ids(ns).rwsem);
+		shm_lock_by_ptr(shp);
+		/*
+		 * rcu_read_lock was implicitly taken in shm_lock_by_ptr, it's
+		 * safe to call ipc_rcu_putref here
+		 */
+		ipc_rcu_putref(&shp->shm_perm, shm_rcu_free);
+
+		if (ipc_valid_object(&shp->shm_perm)) {
+			if (shm_may_destroy(shp))
+				shm_destroy(ns, shp);
+			else
+				shm_unlock(shp);
+		} else {
+			/*
+			 * Someone else deleted the shp from namespace
+			 * idr/kht while we have waited.
+			 * Just unlock and continue.
+			 */
+			shm_unlock(shp);
+		}
+
+		up_write(&shm_ids(ns).rwsem);
+		put_ipc_ns(ns); /* paired with get_ipc_ns_not_zero */
+	}
 }
 
 static vm_fault_t shm_fault(struct vm_fault *vmf)
@@ -680,7 +772,11 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
 	if (error < 0)
 		goto no_id;
 
+	shp->ns = ns;
+
+	task_lock(current);
 	list_add(&shp->shm_clist, &current->sysvshm.shm_clist);
+	task_unlock(current);
 
 	/*
 	 * shmid gets reported as "inode#" in /proc/pid/maps.
@@ -1573,7 +1669,8 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg,
 	down_write(&shm_ids(ns).rwsem);
 	shp = shm_lock(ns, shmid);
 	shp->shm_nattch--;
-	if (shm_may_destroy(ns, shp))
+
+	if (shm_may_destroy(shp))
 		shm_destroy(ns, shp);
 	else
 		shm_unlock(shp);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-11-08 18:34 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-27 22:43 [PATCH 0/2] shm: shm_rmid_forced feature fixes Alexander Mikhalitsyn
2021-10-27 22:43 ` [PATCH 1/2] ipc: WARN if trying to remove ipc object which is absent Alexander Mikhalitsyn
2021-10-27 22:43 ` [PATCH 2/2] shm: extend forced shm destroy to support objects from several IPC nses Alexander Mikhalitsyn
2021-10-30  4:26   ` Eric W. Biederman
2021-10-30 13:11     ` Manfred Spraul
2021-11-05 17:46   ` Eric W. Biederman
2021-11-05 19:03     ` Manfred Spraul
2021-11-05 21:34       ` [RFC] shm: extend forced shm destroy to support objects from several IPC nses (simplified) Eric W. Biederman
2021-11-06  7:50         ` Manfred Spraul
2021-11-06 14:42         ` Manfred Spraul
2021-11-07 19:51           ` Eric W. Biederman
2021-11-08 18:34             ` Manfred Spraul

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.