On 10/30/21 06:26, Eric W. Biederman wrote: > Alexander Mikhalitsyn writes: > >> Currently, exit_shm function not designed to work properly when >> task->sysvshm.shm_clist holds shm objects from different IPC namespaces. >> >> This is a real pain when sysctl kernel.shm_rmid_forced = 1, because >> it leads to use-after-free (reproducer exists). >> >> That particular patch is attempt to fix the problem by extending exit_shm >> mechanism to handle shm's destroy from several IPC ns'es. >> >> To achieve that we do several things: >> 1. add namespace (non-refcounted) pointer to the struct shmid_kernel >> 2. during new shm object creation (newseg()/shmget syscall) we initialize >> this pointer by current task IPC ns >> 3. exit_shm() fully reworked such that it traverses over all >> shp's in task->sysvshm.shm_clist and gets IPC namespace not >> from current task as it was before but from shp's object itself, then >> call shm_destroy(shp, ns). >> >> Note. We need to be really careful here, because as it was said before >> (1), our pointer to IPC ns non-refcnt'ed. To be on the safe side we using >> special helper get_ipc_ns_not_zero() which allows to get IPC ns refcounter >> only if IPC ns not in the "state of destruction". > > > >> Q/A >> >> Q: Why we can access shp->ns memory using non-refcounted pointer? >> A: Because shp object lifetime is always shorther >> than IPC namespace lifetime, so, if we get shp object from the >> task->sysvshm.shm_clist while holding task_lock(task) nobody can >> steal our namespace. > Not true. A struct shmid_kernel can outlive the namespace in which it > was created. I you look at do_shm_rmid which is called when the > namespace is destroyed for every shmid_kernel in the namespace that if > the struct shmid_kernel still has users only ipc_set_key_private is > called. The struct shmid_kernel continues to exist. No, shm_nattach is always 0 when a namespace is destroyed. Thus it is impossible that shmid_kernel continues to exist. Let's check all shm_nattach modifications: 1) do_shmat:     shp->shm_nattach++;     sfd->ns = get_ipc_ns(ns);     shp->shm_nattach--; pairs with    shm_release()         put_ipc_ns() 2) shm_open() only shp->shm_nattach++ shm_open unconditionally accesses shm_file_data, i.e. sfd must be valid, there must be a reference to the namespace pairs with shm_close() only shp->shm_nattach--; shm_close unconditionally accesses shm_file_data, i.e. sfd must be valid, there must be a reference to the namespace As shm_open()/close "nests" inside do_shmat: there is always a get_ipc_ns(). Or, much simpler: Check shm_open() and shm_close(): These two functions address a shm segment by namespace and  ID, not by a shm pointer. Thus _if_ it is possible that shm_nattach is > 0 at namespace destruction, then there would be far more issues. Or: Attached is a log file, a test application, and a patch that adds pr_info statements. The namespace is destroyed immediately when no segments are mapped, the destruction is delayed until exit() if there are mapped segments. >> Q: Does this patch change semantics of unshare/setns/clone syscalls? >> A: Not. It's just fixes non-covered case when process may leave >> IPC namespace without getting task->sysvshm.shm_clist list cleaned up. > > Just reading through exit_shm the code is not currently safe. > > At a minimum do_shm_rmid needs to set the shp->ns to NULL. Otherwise > the struct shmid_kernel can contain a namespace pointer after > the namespace exits. Which results in a different use after free. No [unless there are additional bugs] > > Beyond that there is dropping the task lock. The code holds a reference > to the namespace which means that the code does not need to worry about > free_ipcs. References from mappings are still possible. > > Which means that the code could see: > exit_shm() > task_lock() > shp = ...; > task_unlock() > shm_close() > down_write(&shm_ids(ns).rwsem); > ... > shm_destroy(shp); > up_write(&shm_ids(ns).rwsem); > down_write(&shm_ids(ns)->rwsem); > shm_lock_by_ptr(shp); /* use after free */ > > > I am trying to imagine how to close that race with the current code > structure. Maybe something could be done by looking at shm_nattach > count and making it safe to look at that count under the task_lock. There is no race. Before dropping task_lock, a reference to both the namespace and the shp pointer is obtained. Thus neither one can disappear. > But even then because shmid_kernel is still in the hash table it could > be mapped and unmapped in the window when task_lock was dropped. We have ipc_valid_object(), i.e. perm->deleted. If set, then the pointer and the spinlock are valid, even though the rest is already destroyed. ipc_rmid() just sets deleted, the (rcu delayed) kfree is done via ipc_rcu_putref(). > Alternatively shmctl(id, IPC_RMID) can be called in when task_lock is > dropped. Much less code is involved than mapping and unmapping so it is > much more likely to win the race. > > I don't see how that race can be closed. > > Am I missing something? > > Eric > > >> Fixes: ab602f79915 ("shm: make exit_shm work proportional to task activity") >> >> Cc: "Eric W. Biederman" >> Cc: Andrew Morton >> Cc: Davidlohr Bueso >> Cc: Greg KH >> Cc: Andrei Vagin >> Cc: Pavel Tikhomirov >> Cc: Vasily Averin >> Cc: Manfred Spraul >> Cc: Alexander Mikhalitsyn >> Cc: stable@vger.kernel.org >> Co-developed-by: Manfred Spraul >> Signed-off-by: Manfred Spraul >> Signed-off-by: Alexander Mikhalitsyn Should/can I mark that I have tested the code? I would drop one change and one comment is incorrect, otherwise no findings. See the attached 0002 patch Tested-by: Manfred Spraul >> --- >> include/linux/ipc_namespace.h | 15 +++ >> include/linux/sched/task.h | 2 +- >> include/linux/shm.h | 2 +- >> ipc/shm.c | 170 +++++++++++++++++++++++++--------- >> 4 files changed, 142 insertions(+), 47 deletions(-) >> >> diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h >> index 05e22770af51..b75395ec8d52 100644 >> --- a/include/linux/ipc_namespace.h >> +++ b/include/linux/ipc_namespace.h >> @@ -131,6 +131,16 @@ static inline struct ipc_namespace *get_ipc_ns(struct ipc_namespace *ns) >> return ns; >> } >> >> +static inline struct ipc_namespace *get_ipc_ns_not_zero(struct ipc_namespace *ns) >> +{ >> + if (ns) { >> + if (refcount_inc_not_zero(&ns->ns.count)) >> + return ns; >> + } >> + >> + return NULL; >> +} >> + >> extern void put_ipc_ns(struct ipc_namespace *ns); >> #else >> static inline struct ipc_namespace *copy_ipcs(unsigned long flags, >> @@ -147,6 +157,11 @@ static inline struct ipc_namespace *get_ipc_ns(struct ipc_namespace *ns) >> return ns; >> } >> >> +static inline struct ipc_namespace *get_ipc_ns_not_zero(struct ipc_namespace *ns) >> +{ >> + return ns; >> +} >> + >> static inline void put_ipc_ns(struct ipc_namespace *ns) >> { >> } >> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h >> index ef02be869cf2..bfdf84dab4be 100644 >> --- a/include/linux/sched/task.h >> +++ b/include/linux/sched/task.h >> @@ -157,7 +157,7 @@ static inline struct vm_struct *task_stack_vm_area(const struct task_struct *t) >> * Protects ->fs, ->files, ->mm, ->group_info, ->comm, keyring >> * subscriptions and synchronises with wait4(). Also used in procfs. Also >> * pins the final release of task.io_context. Also protects ->cpuset and >> - * ->cgroup.subsys[]. And ->vfork_done. >> + * ->cgroup.subsys[]. And ->vfork_done. And ->sysvshm.shm_clist. >> * >> * Nests both inside and outside of read_lock(&tasklist_lock). >> * It must not be nested with write_lock_irq(&tasklist_lock), >> diff --git a/include/linux/shm.h b/include/linux/shm.h >> index d8e69aed3d32..709f6d0451c0 100644 >> --- a/include/linux/shm.h >> +++ b/include/linux/shm.h >> @@ -11,7 +11,7 @@ struct file; >> >> #ifdef CONFIG_SYSVIPC >> struct sysv_shm { >> - struct list_head shm_clist; >> + struct list_head shm_clist; >> }; >> This is a whitespace only change. We can drop it. >> long do_shmat(int shmid, char __user *shmaddr, int shmflg, unsigned long *addr, >> diff --git a/ipc/shm.c b/ipc/shm.c >> index 748933e376ca..29667e17b12a 100644 >> --- a/ipc/shm.c >> +++ b/ipc/shm.c >> @@ -62,9 +62,18 @@ struct shmid_kernel /* private to the kernel */ >> struct pid *shm_lprid; >> struct ucounts *mlock_ucounts; >> >> - /* The task created the shm object. NULL if the task is dead. */ >> + /* >> + * The task created the shm object, for looking up >> + * task->sysvshm.shm_clist_lock >> + */ >> struct task_struct *shm_creator; >> - struct list_head shm_clist; /* list by creator */ >> + >> + /* >> + * list by creator. shm_clist_lock required for read/write >> + * if list_empty(), then the creator is dead already >> + */ shm_clist_lock was replaced by task_lock(->shm_creator). >> + struct list_head shm_clist; >> + struct ipc_namespace *ns; >> } __randomize_layout; >> >> /* shm_mode upper byte flags */ >> @@ -115,6 +124,7 @@ static void do_shm_rmid(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp) >> struct shmid_kernel *shp; >> >> shp = container_of(ipcp, struct shmid_kernel, shm_perm); >> + WARN_ON(ns != shp->ns); >> >> if (shp->shm_nattch) { >> shp->shm_perm.mode |= SHM_DEST; >> @@ -225,10 +235,36 @@ static void shm_rcu_free(struct rcu_head *head) >> kfree(shp); >> } >> >> -static inline void shm_rmid(struct ipc_namespace *ns, struct shmid_kernel *s) >> +/* >> + * It has to be called with shp locked. >> + * It must be called before ipc_rmid() >> + */ >> +static inline void shm_clist_rm(struct shmid_kernel *shp) >> { >> - list_del(&s->shm_clist); >> - ipc_rmid(&shm_ids(ns), &s->shm_perm); >> + struct task_struct *creator; >> + >> + /* >> + * A concurrent exit_shm may do a list_del_init() as well. >> + * Just do nothing if exit_shm already did the work >> + */ >> + if (list_empty(&shp->shm_clist)) >> + return; >> + >> + /* >> + * shp->shm_creator is guaranteed to be valid *only* >> + * if shp->shm_clist is not empty. >> + */ >> + creator = shp->shm_creator; >> + >> + task_lock(creator); >> + list_del_init(&shp->shm_clist); >> + task_unlock(creator); > Lock ordering > rwsem > ipc_lock > task_lock > correct. >> +} >> + >> +static inline void shm_rmid(struct shmid_kernel *s) >> +{ >> + shm_clist_rm(s); >> + ipc_rmid(&shm_ids(s->ns), &s->shm_perm); >> } >> >> >> @@ -283,7 +319,7 @@ static void shm_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp) >> shm_file = shp->shm_file; >> shp->shm_file = NULL; >> ns->shm_tot -= (shp->shm_segsz + PAGE_SIZE - 1) >> PAGE_SHIFT; >> - shm_rmid(ns, shp); >> + shm_rmid(shp); >> shm_unlock(shp); >> if (!is_file_hugepages(shm_file)) >> shmem_lock(shm_file, 0, shp->mlock_ucounts); >> @@ -306,10 +342,10 @@ static void shm_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp) >> * >> * 2) sysctl kernel.shm_rmid_forced is set to 1. >> */ >> -static bool shm_may_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp) >> +static bool shm_may_destroy(struct shmid_kernel *shp) >> { >> return (shp->shm_nattch == 0) && >> - (ns->shm_rmid_forced || >> + (shp->ns->shm_rmid_forced || >> (shp->shm_perm.mode & SHM_DEST)); >> } >> >> @@ -340,7 +376,7 @@ static void shm_close(struct vm_area_struct *vma) >> ipc_update_pid(&shp->shm_lprid, task_tgid(current)); >> shp->shm_dtim = ktime_get_real_seconds(); >> shp->shm_nattch--; >> - if (shm_may_destroy(ns, shp)) >> + if (shm_may_destroy(shp)) >> shm_destroy(ns, shp); >> else >> shm_unlock(shp); >> @@ -361,10 +397,10 @@ static int shm_try_destroy_orphaned(int id, void *p, void *data) >> * >> * As shp->* are changed under rwsem, it's safe to skip shp locking. >> */ >> - if (shp->shm_creator != NULL) >> + if (!list_empty(&shp->shm_clist)) >> return 0; >> >> - if (shm_may_destroy(ns, shp)) { >> + if (shm_may_destroy(shp)) { >> shm_lock_by_ptr(shp); >> shm_destroy(ns, shp); >> } >> @@ -382,48 +418,87 @@ void shm_destroy_orphaned(struct ipc_namespace *ns) >> /* Locking assumes this will only be called with task == current */ >> void exit_shm(struct task_struct *task) >> { >> - struct ipc_namespace *ns = task->nsproxy->ipc_ns; >> - struct shmid_kernel *shp, *n; >> + for (;;) { >> + struct shmid_kernel *shp; >> + struct ipc_namespace *ns; >> >> - if (list_empty(&task->sysvshm.shm_clist)) >> - return; >> + task_lock(task); >> + >> + if (list_empty(&task->sysvshm.shm_clist)) { >> + task_unlock(task); >> + break; >> + } >> + >> + shp = list_first_entry(&task->sysvshm.shm_clist, struct shmid_kernel, >> + shm_clist); >> + >> + /* 1) unlink */ >> + list_del_init(&shp->shm_clist); >> >> - /* >> - * If kernel.shm_rmid_forced is not set then only keep track of >> - * which shmids are orphaned, so that a later set of the sysctl >> - * can clean them up. >> - */ >> - if (!ns->shm_rmid_forced) { >> - down_read(&shm_ids(ns).rwsem); >> - list_for_each_entry(shp, &task->sysvshm.shm_clist, shm_clist) >> - shp->shm_creator = NULL; >> /* >> - * Only under read lock but we are only called on current >> - * so no entry on the list will be shared. >> + * 2) Get pointer to the ipc namespace. It is worth to say >> + * that this pointer is guaranteed to be valid because >> + * shp lifetime is always shorter than namespace lifetime >> + * in which shp lives. >> + * We taken task_lock it means that shp won't be freed. >> */ >> - list_del(&task->sysvshm.shm_clist); >> - up_read(&shm_ids(ns).rwsem); >> - return; >> - } >> + ns = shp->ns; >> >> - /* >> - * Destroy all already created segments, that were not yet mapped, >> - * and mark any mapped as orphan to cover the sysctl toggling. >> - * Destroy is skipped if shm_may_destroy() returns false. >> - */ >> - down_write(&shm_ids(ns).rwsem); >> - list_for_each_entry_safe(shp, n, &task->sysvshm.shm_clist, shm_clist) { >> - shp->shm_creator = NULL; >> + /* >> + * 3) If kernel.shm_rmid_forced is not set then only keep track of >> + * which shmids are orphaned, so that a later set of the sysctl >> + * can clean them up. >> + */ >> + if (!ns->shm_rmid_forced) { >> + task_unlock(task); >> + continue; >> + } >> >> - if (shm_may_destroy(ns, shp)) { >> + /* >> + * 4) get a reference to the namespace. >> + * The refcount could be already 0. If it is 0, then >> + * the shm objects will be free by free_ipc_work(). >> + */ >> + ns = get_ipc_ns_not_zero(ns); >> + if (ns) { >> + /* >> + * 5) get a reference to the shp itself. >> + * This cannot fail: shm_clist_rm() is called before >> + * ipc_rmid(), thus the refcount cannot be 0. >> + */ >> + WARN_ON(!ipc_rcu_getref(&shp->shm_perm)); >> + } >> + >> + task_unlock(task); > <<<<<<<<< BOOM >>>>>>> > > I don't see anything that prevents another task from > calling shm_destroy(ns, shp) here and freeing it before > this task can take the rwsem for writing. shm_destroy() can be called. But due to the ipc_rcu_getref(), the structure will remain valid. >> + >> + if (ns) { >> + down_write(&shm_ids(ns).rwsem); >> shm_lock_by_ptr(shp); >> - shm_destroy(ns, shp); >> + /* >> + * rcu_read_lock was implicitly taken in >> + * shm_lock_by_ptr, it's safe to call >> + * ipc_rcu_putref here >> + */ >> + ipc_rcu_putref(&shp->shm_perm, shm_rcu_free); >> + >> + if (ipc_valid_object(&shp->shm_perm)) { And this will return false if there was a shm_destroy(). >> + if (shm_may_destroy(shp)) >> + shm_destroy(ns, shp); >> + else >> + shm_unlock(shp); >> + } else { >> + /* >> + * Someone else deleted the shp from namespace >> + * idr/kht while we have waited. >> + * Just unlock and continue. >> + */ -> just do a NOP if shm_destroy() was alread performed. Actually, the same design is used by find_alloc_undo() in ipc/sem.c. >> + shm_unlock(shp); >> + } >> + >> + up_write(&shm_ids(ns).rwsem); >> + put_ipc_ns(ns); /* paired with get_ipc_ns_not_zero */ >> } >> } >> - >> - /* Remove the list head from any segments still attached. */ >> - list_del(&task->sysvshm.shm_clist); >> - up_write(&shm_ids(ns).rwsem); >> } >> >> static vm_fault_t shm_fault(struct vm_fault *vmf) >> @@ -680,7 +755,11 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params) >> if (error < 0) >> goto no_id; >> >> + shp->ns = ns; >> + >> + task_lock(current); >> list_add(&shp->shm_clist, ¤t->sysvshm.shm_clist); >> + task_unlock(current); >> >> /* >> * shmid gets reported as "inode#" in /proc/pid/maps. >> @@ -1573,7 +1652,8 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg, >> down_write(&shm_ids(ns).rwsem); >> shp = shm_lock(ns, shmid); >> shp->shm_nattch--; >> - if (shm_may_destroy(ns, shp)) >> + >> + if (shm_may_destroy(shp)) >> shm_destroy(ns, shp); >> else >> shm_unlock(shp);