All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] shm: omit forced shm destroy if task IPC namespace was changed
@ 2021-07-06 13:22 Alexander Mikhalitsyn
  2021-07-06 13:22 ` [PATCH 1/2] shm: skip shm_destroy " Alexander Mikhalitsyn
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Alexander Mikhalitsyn @ 2021-07-06 13:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Mikhalitsyn, Andrew Morton, Milton Miller, Jack Miller,
	Pavel Tikhomirov, Alexander Mikhalitsyn

Hello,

Task IPC namespace shm's has shm_rmid_forced feature which is per IPC namespace
and controlled by kernel.shm_rmid_forced sysctl. When feature is turned on,
then during task exit (and unshare(CLONE_NEWIPC)) all sysvshm's will be destroyed
by exit_shm(struct task_struct *task) function. But there is a problem if task
was changed IPC namespace since shmget() call. In such situation exit_shm() function
will try to call
shm_destroy(<new_ipc_namespace_ptr>, <sysvshmem_from_old_ipc_namespace>)
which leads to the situation when sysvshm object still attached to old
IPC namespace but freed; later during old IPC namespace cleanup we will try to
free such sysvshm object for the second time and will get the problem :)

First patch solves this problem by postponing shm_destroy to the moment when
IPC namespace cleanup will be called.
Second patch is useful to prevent (or easy catch) such bugs in the future by
adding corresponding WARNings.

Regards,
Alex

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Milton Miller <miltonm@bga.com>
Cc: Jack Miller <millerjo@us.ibm.com>
Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Cc: Alexander Mikhalitsyn <alexander@mihalicyn.com>

Alexander Mikhalitsyn (2):
  shm: skip shm_destroy if task IPC namespace was changed
  ipc: WARN if trying to remove ipc object which is absent

 ipc/shm.c  | 10 +++++++++-
 ipc/util.c |  6 +++---
 2 files changed, 12 insertions(+), 4 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/2] shm: skip shm_destroy if task IPC namespace was changed
  2021-07-06 13:22 [PATCH 0/2] shm: omit forced shm destroy if task IPC namespace was changed Alexander Mikhalitsyn
@ 2021-07-06 13:22 ` Alexander Mikhalitsyn
  2021-07-06 13:22 ` [PATCH 2/2] ipc: WARN if trying to remove ipc object which is absent Alexander Mikhalitsyn
  2021-07-10  1:12 ` [PATCH 0/2] shm: omit forced shm destroy if task IPC namespace was changed Andrew Morton
  2 siblings, 0 replies; 17+ messages in thread
From: Alexander Mikhalitsyn @ 2021-07-06 13:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Mikhalitsyn, Andrew Morton, Milton Miller, Jack Miller,
	Pavel Tikhomirov, Alexander Mikhalitsyn, stable

Task may change IPC namespace by doing setns() but sysvshm
objects remains at the origin IPC namespace (=IPC namespace
where task was when shmget() was called). Let's skip forced
shm destroy in such case because we can't determine IPC
namespace by shm only. These problematic sysvshm's will
be destroyed on ipc namespace cleanup.

Fixes: ab602f79915 ("shm: make exit_shm work proportional to task activity")
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Milton Miller <miltonm@bga.com>
Cc: Jack Miller <millerjo@us.ibm.com>
Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Cc: Alexander Mikhalitsyn <alexander@mihalicyn.com>
Cc: stable@vger.kernel.org
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
---
 ipc/shm.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/ipc/shm.c b/ipc/shm.c
index 748933e376ca..70a41171b8bb 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -173,6 +173,14 @@ static inline struct shmid_kernel *shm_obtain_object_check(struct ipc_namespace
 	return container_of(ipcp, struct shmid_kernel, shm_perm);
 }
 
+static inline bool is_shm_in_ns(struct ipc_namespace *ns, struct shmid_kernel *shp)
+{
+	int idx = ipcid_to_idx(shp->shm_perm.id);
+	struct shmid_kernel *tshp = shm_obtain_object(ns, idx);
+
+	return !IS_ERR(tshp) && tshp == shp;
+}
+
 /*
  * shm_lock_(check_) routines are called in the paths where the rwsem
  * is not necessarily held.
@@ -415,7 +423,7 @@ void exit_shm(struct task_struct *task)
 	list_for_each_entry_safe(shp, n, &task->sysvshm.shm_clist, shm_clist) {
 		shp->shm_creator = NULL;
 
-		if (shm_may_destroy(ns, shp)) {
+		if (is_shm_in_ns(ns, shp) && shm_may_destroy(ns, shp)) {
 			shm_lock_by_ptr(shp);
 			shm_destroy(ns, shp);
 		}
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/2] ipc: WARN if trying to remove ipc object which is absent
  2021-07-06 13:22 [PATCH 0/2] shm: omit forced shm destroy if task IPC namespace was changed Alexander Mikhalitsyn
  2021-07-06 13:22 ` [PATCH 1/2] shm: skip shm_destroy " Alexander Mikhalitsyn
@ 2021-07-06 13:22 ` Alexander Mikhalitsyn
  2021-07-10  1:12 ` [PATCH 0/2] shm: omit forced shm destroy if task IPC namespace was changed Andrew Morton
  2 siblings, 0 replies; 17+ messages in thread
From: Alexander Mikhalitsyn @ 2021-07-06 13:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Mikhalitsyn, Andrew Morton, Milton Miller, Jack Miller,
	Pavel Tikhomirov, Alexander Mikhalitsyn

Lets produce a warning if we trying to remove non-existing
IPC object from IPC namespace kht/idr structures.

This allows to catch possible bugs when ipc_rmid() function was
called with inconsistent struct ipc_ids*, struct kern_ipc_perm*
arguments.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Milton Miller <miltonm@bga.com>
Cc: Jack Miller <millerjo@us.ibm.com>
Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Cc: Alexander Mikhalitsyn <alexander@mihalicyn.com>
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
---
 ipc/util.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/ipc/util.c b/ipc/util.c
index 0027e47626b7..45bb8ce6c42c 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -447,8 +447,8 @@ static int ipcget_public(struct ipc_namespace *ns, struct ipc_ids *ids,
 static void ipc_kht_remove(struct ipc_ids *ids, struct kern_ipc_perm *ipcp)
 {
 	if (ipcp->key != IPC_PRIVATE)
-		rhashtable_remove_fast(&ids->key_ht, &ipcp->khtnode,
-				       ipc_kht_params);
+		WARN_ON(rhashtable_remove_fast(&ids->key_ht, &ipcp->khtnode,
+				       ipc_kht_params));
 }
 
 /**
@@ -498,7 +498,7 @@ void ipc_rmid(struct ipc_ids *ids, struct kern_ipc_perm *ipcp)
 {
 	int idx = ipcid_to_idx(ipcp->id);
 
-	idr_remove(&ids->ipcs_idr, idx);
+	WARN_ON(idr_remove(&ids->ipcs_idr, idx) != ipcp);
 	ipc_kht_remove(ids, ipcp);
 	ids->in_use--;
 	ipcp->deleted = true;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/2] shm: omit forced shm destroy if task IPC namespace was changed
  2021-07-06 13:22 [PATCH 0/2] shm: omit forced shm destroy if task IPC namespace was changed Alexander Mikhalitsyn
  2021-07-06 13:22 ` [PATCH 1/2] shm: skip shm_destroy " Alexander Mikhalitsyn
  2021-07-06 13:22 ` [PATCH 2/2] ipc: WARN if trying to remove ipc object which is absent Alexander Mikhalitsyn
@ 2021-07-10  1:12 ` Andrew Morton
  2021-07-10 10:55   ` Alexander Mihalicyn
  2 siblings, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2021-07-10  1:12 UTC (permalink / raw)
  To: Alexander Mikhalitsyn
  Cc: linux-kernel, Milton Miller, Jack Miller, Pavel Tikhomirov,
	Alexander Mikhalitsyn, Manfred Spraul, Davidlohr Bueso,
	Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Eric W. Biederman

On Tue,  6 Jul 2021 16:22:57 +0300 Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com> wrote:

> Hello,
> 
> Task IPC namespace shm's has shm_rmid_forced feature which is per IPC namespace
> and controlled by kernel.shm_rmid_forced sysctl. When feature is turned on,
> then during task exit (and unshare(CLONE_NEWIPC)) all sysvshm's will be destroyed
> by exit_shm(struct task_struct *task) function. But there is a problem if task
> was changed IPC namespace since shmget() call. In such situation exit_shm() function
> will try to call
> shm_destroy(<new_ipc_namespace_ptr>, <sysvshmem_from_old_ipc_namespace>)
> which leads to the situation when sysvshm object still attached to old
> IPC namespace but freed; later during old IPC namespace cleanup we will try to
> free such sysvshm object for the second time and will get the problem :)
> 
> First patch solves this problem by postponing shm_destroy to the moment when
> IPC namespace cleanup will be called.
> Second patch is useful to prevent (or easy catch) such bugs in the future by
> adding corresponding WARNings.
> 

(cc's added)

I assume that a

Fixes: b34a6b1da371ed8af ("ipc: introduce shm_rmid_forced sysctl") is
appropriate here?

A double-free is serious.  Should this fix be backported into earlier
kernels?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/2] shm: omit forced shm destroy if task IPC namespace was changed
  2021-07-10  1:12 ` [PATCH 0/2] shm: omit forced shm destroy if task IPC namespace was changed Andrew Morton
@ 2021-07-10 10:55   ` Alexander Mihalicyn
       [not found]     ` <CALgW_8VUk0us_umLncUv2DUMkOi3qixmT+YkHV4Dhpt_nNMZHw@mail.gmail.com>
  0 siblings, 1 reply; 17+ messages in thread
From: Alexander Mihalicyn @ 2021-07-10 10:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Milton Miller, Jack Miller, Pavel Tikhomirov,
	Manfred Spraul, Davidlohr Bueso, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, Eric W. Biederman, Andrei Vagin,
	Christian Brauner

On Sat, Jul 10, 2021 at 4:12 AM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Tue,  6 Jul 2021 16:22:57 +0300 Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com> wrote:
>
> > Hello,
> >
> > Task IPC namespace shm's has shm_rmid_forced feature which is per IPC namespace
> > and controlled by kernel.shm_rmid_forced sysctl. When feature is turned on,
> > then during task exit (and unshare(CLONE_NEWIPC)) all sysvshm's will be destroyed
> > by exit_shm(struct task_struct *task) function. But there is a problem if task
> > was changed IPC namespace since shmget() call. In such situation exit_shm() function
> > will try to call
> > shm_destroy(<new_ipc_namespace_ptr>, <sysvshmem_from_old_ipc_namespace>)
> > which leads to the situation when sysvshm object still attached to old
> > IPC namespace but freed; later during old IPC namespace cleanup we will try to
> > free such sysvshm object for the second time and will get the problem :)
> >
> > First patch solves this problem by postponing shm_destroy to the moment when
> > IPC namespace cleanup will be called.
> > Second patch is useful to prevent (or easy catch) such bugs in the future by
> > adding corresponding WARNings.
> >

Andrew,

thanks for your attention to the patches!

>
> (cc's added)
>
> I assume that a
>
> Fixes: b34a6b1da371ed8af ("ipc: introduce shm_rmid_forced sysctl") is
> appropriate here?

Really not, this patch looks fully safe because it always takes
objects to free from
concrete IPC namespace idr with appropriate locking. For example
/* Destroy all already created segments, but not mapped yet */
down_write(&shm_ids(ns).rw_mutex);
idr_for_each(&shm_ids(ns).ipcs_idr, &shm_try_destroy_current, ns);
up_write(&shm_ids(ns).rw_mutex);
here we take ns from current->nsproxy, lock idr, and destroy only shms
from this particular IPC ns.

But after b34a6b1da ("shm: make exit_shm work proportional to task
activity") we introduced
new field (struct sysv_shm sysvshm) on task_struct. exit_shm()
function was changed so:

list_for_each_entry_safe(shp, n, &task->sysvshm.shm_clist, shm_clist)
    shm_mark_orphan(shp, ns);

instead of previous idr_for_each(&shm_ids(ns).ipcs_idr,
&shm_try_destroy_current, ns);

Now, using setns() syscall, we can construct situation when on
task->sysvshm.shm_clist list
we have shm items from several (!) IPC namespaces. Before this patch
we always destroying
shmems only from current->nsproxy->ipc_ns, but now we can do something
like this:

shmget(0xAAAA, 4096, IPC_CREAT|0700); <-- add item to task->sysvshm.shm_clist
setns(fd, CLONE_NEWIPC);
shmget(0xAAAA, 4096, IPC_CREAT|0700); <-- add item to task->sysvshm.shm_clist
*now we have two items in task->sysvshm.shm_clist but from different
IPC namespaces*

(I've reproducer program and can send it privately)

That's a problem. If we take a look on
int ksys_unshare(unsigned long unshare_flags)

we can see following part:

if (unshare_flags & CLONE_NEWIPC) {
   /* Orphan segments in old ns (see sem above). */
   exit_shm(current);
   shm_init_task(current);
}

here we cleaning up this list BEFORE changing IPC namespace. So, if we
do something like:
shmget(0xAAAA, 4096, IPC_CREAT|0700); <-- add item to task->sysvshm.shm_clist
unshare(CLONE_NEWIPC); <-- task->sysvshm.shm_clist is cleaned and
reinitialized here
shmget(0xAAAA, 4096, IPC_CREAT|0700); <-- add item to task->sysvshm.shm_clist

and all fine!

So, semantics of setns() and unshare() is different here. We can fix
this problem by adding
analogical calls to exit_shm(), shm_init_task() into

static void commit_nsset(struct nsset *nsset)
...
#ifdef CONFIG_IPC_NS
if (flags & CLONE_NEWIPC) {
     exit_sem(me);
+   shm_init_task(current);
+  exit_shm(current);
}
#endif

with this change semantics of unshare() and setns() will be equal in
terms of the shm_rmid_forced
feature. But this may break some applications which using setns() and
IPC resources not destroying
after that syscall. (CRIU using setns() extensively and we have to
change our IPC ns C/R implementation
a little bit if we take this way of fixing the problem).

I've proposed a change which keeps the old behaviour of setns() but
fixes double free.

>
> A double-free is serious.  Should this fix be backported into earlier
> kernels?

Yes, the IPC namespace code was not changed seriously, so it means
that we can easily apply this patch to older kernels.
(I've CCed stable lists in the patch where the fix was)

CCed: Andrei Vagin and Christian Brauner

Regards,
Alex

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/2] shm: omit forced shm destroy if task IPC namespace was changed
       [not found]     ` <CALgW_8VUk0us_umLncUv2DUMkOi3qixmT+YkHV4Dhpt_nNMZHw@mail.gmail.com>
@ 2021-07-11 10:33       ` Alexander Mihalicyn
  2021-07-11 11:46         ` Manfred Spraul
  0 siblings, 1 reply; 17+ messages in thread
From: Alexander Mihalicyn @ 2021-07-11 10:33 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Andrew Morton, linux-kernel, Milton Miller, Jack Miller,
	Pavel Tikhomirov, Davidlohr Bueso, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, Eric W. Biederman, Andrei Vagin,
	Christian Brauner

Hi, Manfred,

On Sun, Jul 11, 2021 at 12:13 PM Manfred Spraul
<manfred@colorfullife.com> wrote:
>
> Hi,
>
>
> Am Samstag, 10. Juli 2021 schrieb Alexander Mihalicyn <alexander@mihalicyn.com>:
>>
>>
>> Now, using setns() syscall, we can construct situation when on
>> task->sysvshm.shm_clist list
>> we have shm items from several (!) IPC namespaces.
>>
>>
> Does this imply that locking ist affected as well? According to the initial patch, accesses to shm_clist are protected by "the" IPC shm namespace rwsem. This can't work if the list contains objects from several namespaces.

Of course, you are right. I've to rework this part -> I can add check into
static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
function and before adding new shm into task list check that list is empty OR
an item which is present on the list from the same namespace as
current->nsproxy->ipc_ns.

>
> From ipc/shm.c:
>
>  397                down_read(&shm_ids(ns).rwsem);
>  398                list_for_each_entry(shp, &task->sysvshm.shm_clist, shm_clist)
>  399                        shp->shm_creator = NULL;
>  400                /*
>  401                 * Only under read lock but we are only called on current
>  402                 * so no entry on the list will be shared.
>  403                 */
>  404                list_del(&task->sysvshm.shm_clist);
>  405                up_read(&shm_ids(ns).rwsem);
>
>
> Task A and B in same namespace
>
> - A: create shm object
>
> - A: setns()
>
> - in parallel B) does shmctl(IPC_RMID), A) does exit()

Yep.

>
>
>>
>>
>> So, semantics of setns() and unshare() is different here. We can fix
>> this problem by adding
>> analogical calls to exit_shm(), shm_init_task() into
>>
>> static void commit_nsset(struct nsset *nsset)
>> ...
>> #ifdef CONFIG_IPC_NS
>> if (flags & CLONE_NEWIPC) {
>>      exit_sem(me);
>> +   shm_init_task(current);
>> +  exit_shm(current);
>> }
>> #endif
>>
>> with this change semantics of unshare() and setns() will be equal in
>> terms of the shm_rmid_forced
>> feature.
>
> Additional advantage: exit_sem() and exit_shm() would appear as pairs, both in unshare () and setns().
>
>> But this may break some applications which using setns() and
>> IPC resources not destroying
>> after that syscall. (CRIU using setns() extensively and we have to
>> change our IPC ns C/R implementation
>> a little bit if we take this way of fixing the problem).
>>
>> I've proposed a change which keeps the old behaviour of setns() but
>> fixes double free.
>>
> Assuming that locking works, I would consider this as a namespace design question: Do we want to support that a task contains shm objects from several ipc namespaces?

This depends on what we mean by "task contains shm objects from
several ipc namespaces". There are two meanings:

1. Task has attached shm object from different ipc namespaces

We already support that by design. When we doing a change of namespace
using unshare(CLONE_NEWIPC) even with
sysctl shm_rmid_forced=1 we not detach all ipc's from task! Let see on
shm_exit() functio which is validly called
when we doing unshare():

if (shm_may_destroy(ns, shp)) { <--- (shp->shm_nattch == 0) &&
(ns->shm_rmid_forced || (shp->shm_perm.mode & SHM_DEST))
    shm_lock_by_ptr(shp);
    shm_destroy(ns, shp);
}

here all depends on shp->shm_nattch which will be non-zero if used
doing something like this:

int id = shmget(0xAAAA, 4096, IPC_CREAT|0700);
void *addr = shmat(id, NULL, 0); // <-- map shm to the task address space
unshare(CLONE_NEWIPC); // <--- task->sysvshm.shm_clist is cleared! But
shm 0xAAAA remains attached
id = shmget(0xBBBB, 4096, IPC_CREAT|0700); // <-- add item to the
task->sysvshm.shm_clist now it contains object only from new IPC
namespace
addr = shmat(id, NULL, 0);

So, this task->sysvshm.shm_clist list used only for shm_rmid_forced
feature. It doesn't affect any mm-related things like /proc/<pid>/maps
or something similar.

2. Task task->sysvshm.shm_clist list has items from different IPC namespaces.

I'm not sure, do we need that or not. But I'm ready to prepare a patch
for any of the options which we choose:
a) just add exit_shm(current)+shm_init_task(current);
b) prepare PATCHv2 with appropriate check in the newseg() to prevent
adding new items from different namespace to the list
c) rework algorithm so we can safely have items from different
namespaces in task->sysvshm.shm_clist

and, of course, prepare a test case with this particular bug
reproducer to prevent future degradations and increase coverage.
(I'm not publishing the reproducer program directly on the list at the
moment because it may be not fully safe.
But I think any of us already knows how to trigger the problem.)

>
> Does it work everywhere (/proc/{pid}, ...)?
> --
>    Manfred

Thanks,
Alex

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/2] shm: omit forced shm destroy if task IPC namespace was changed
  2021-07-11 10:33       ` Alexander Mihalicyn
@ 2021-07-11 11:46         ` Manfred Spraul
  2021-07-12  9:54           ` Alexander Mihalicyn
  0 siblings, 1 reply; 17+ messages in thread
From: Manfred Spraul @ 2021-07-11 11:46 UTC (permalink / raw)
  To: Alexander Mihalicyn
  Cc: Andrew Morton, linux-kernel, Milton Miller, Jack Miller,
	Pavel Tikhomirov, Davidlohr Bueso, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, Eric W. Biederman, Andrei Vagin,
	Christian Brauner

Hi Alex,


Am Sonntag, 11. Juli 2021 schrieb Alexander Mihalicyn <alexander@mihalicyn.com>:
>
> Hi, Manfred,
>
> On Sun, Jul 11, 2021 at 12:13 PM Manfred Spraul
> <manfred@colorfullife.com> wrote:
> >
> > Hi,
> >
> >
> > Am Samstag, 10. Juli 2021 schrieb Alexander Mihalicyn <alexander@mihalicyn.com>:
> >>
> >>
> >> Now, using setns() syscall, we can construct situation when on
> >> task->sysvshm.shm_clist list
> >> we have shm items from several (!) IPC namespaces.
> >>
> >>
> > Does this imply that locking ist affected as well? According to the initial patch, accesses to shm_clist are protected by "the" IPC shm namespace rwsem. This can't work if the list contains objects from several namespaces.
>
> Of course, you are right. I've to rework this part -> I can add check into
> static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
> function and before adding new shm into task list check that list is empty OR
> an item which is present on the list from the same namespace as
> current->nsproxy->ipc_ns.
>
Ok. (Sorry, I have only smartphone internet, thus I could not check
the patch fully)

> >> I've proposed a change which keeps the old behaviour of setns() but
> >> fixes double free.
> >>
> > Assuming that locking works, I would consider this as a namespace design question: Do we want to support that a task contains shm objects from several ipc namespaces?
>
> This depends on what we mean by "task contains shm objects from
> several ipc namespaces". There are two meanings:
>
> 1. Task has attached shm object from different ipc namespaces
>
> We already support that by design. When we doing a change of namespace
> using unshare(CLONE_NEWIPC) even with
> sysctl shm_rmid_forced=1 we not detach all ipc's from task!

OK. Thus shm and sem have different behavior anyways.

>
> 2. Task task->sysvshm.shm_clist list has items from different IPC namespaces.
>
> I'm not sure, do we need that or not. But I'm ready to prepare a patch
> for any of the options which we choose:
> a) just add exit_shm(current)+shm_init_task(current);
> b) prepare PATCHv2 with appropriate check in the newseg() to prevent
> adding new items from different namespace to the list
> c) rework algorithm so we can safely have items from different
> namespaces in task->sysvshm.shm_clist
>
Before you write something, let's wait what the others say. I don't
qualify AS shm expert

a) is user space visible, without any good excuse
c) is probably highest amount of Changes
b) Impact for me not clear. Would it mean that shm_rmid_forced would
stop to work after setns()

Correct?

--
   Manfred

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/2] shm: omit forced shm destroy if task IPC namespace was changed
  2021-07-11 11:46         ` Manfred Spraul
@ 2021-07-12  9:54           ` Alexander Mihalicyn
  2021-07-12 19:18             ` Eric W. Biederman
  0 siblings, 1 reply; 17+ messages in thread
From: Alexander Mihalicyn @ 2021-07-12  9:54 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Andrew Morton, linux-kernel, Milton Miller, Jack Miller,
	Pavel Tikhomirov, Davidlohr Bueso, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, Eric W. Biederman, Andrei Vagin,
	Christian Brauner

Hello Manfred,

On Sun, Jul 11, 2021 at 2:47 PM Manfred Spraul <manfred@colorfullife.com> wrote:
>
> Hi Alex,
>
>
> Am Sonntag, 11. Juli 2021 schrieb Alexander Mihalicyn <alexander@mihalicyn.com>:
> >
> > Hi, Manfred,
> >
> > On Sun, Jul 11, 2021 at 12:13 PM Manfred Spraul
> > <manfred@colorfullife.com> wrote:
> > >
> > > Hi,
> > >
> > >
> > > Am Samstag, 10. Juli 2021 schrieb Alexander Mihalicyn <alexander@mihalicyn.com>:
> > >>
> > >>
> > >> Now, using setns() syscall, we can construct situation when on
> > >> task->sysvshm.shm_clist list
> > >> we have shm items from several (!) IPC namespaces.
> > >>
> > >>
> > > Does this imply that locking ist affected as well? According to the initial patch, accesses to shm_clist are protected by "the" IPC shm namespace rwsem. This can't work if the list contains objects from several namespaces.
> >
> > Of course, you are right. I've to rework this part -> I can add check into
> > static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
> > function and before adding new shm into task list check that list is empty OR
> > an item which is present on the list from the same namespace as
> > current->nsproxy->ipc_ns.
> >
> Ok. (Sorry, I have only smartphone internet, thus I could not check
> the patch fully)
>
> > >> I've proposed a change which keeps the old behaviour of setns() but
> > >> fixes double free.
> > >>
> > > Assuming that locking works, I would consider this as a namespace design question: Do we want to support that a task contains shm objects from several ipc namespaces?
> >
> > This depends on what we mean by "task contains shm objects from
> > several ipc namespaces". There are two meanings:
> >
> > 1. Task has attached shm object from different ipc namespaces
> >
> > We already support that by design. When we doing a change of namespace
> > using unshare(CLONE_NEWIPC) even with
> > sysctl shm_rmid_forced=1 we not detach all ipc's from task!
>
> OK. Thus shm and sem have different behavior anyways.
>
> >
> > 2. Task task->sysvshm.shm_clist list has items from different IPC namespaces.
> >
> > I'm not sure, do we need that or not. But I'm ready to prepare a patch
> > for any of the options which we choose:
> > a) just add exit_shm(current)+shm_init_task(current);
> > b) prepare PATCHv2 with appropriate check in the newseg() to prevent
> > adding new items from different namespace to the list
> > c) rework algorithm so we can safely have items from different
> > namespaces in task->sysvshm.shm_clist
> >
> Before you write something, let's wait what the others say. I don't
> qualify AS shm expert
>
> a) is user space visible, without any good excuse

yes, but maybe we decide that this is not so critical?
We need more people here :)

> c) is probably highest amount of Changes

yep. but ok, I will prepare patches fast.

> b) Impact for me not clear. Would it mean that shm_rmid_forced would
> stop to work after setns()

Yes, in this case this "forced" destroy will stop working on newly created shm's
in the new IPC namespace. But for old segments all will continue work as it was.
Not an elegant solution maybe...

>
> Correct?

Absolutely ;)

>
> --
>    Manfred

Regards,
Alex

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/2] shm: omit forced shm destroy if task IPC namespace was changed
  2021-07-12  9:54           ` Alexander Mihalicyn
@ 2021-07-12 19:18             ` Eric W. Biederman
  2021-07-12 19:27               ` Alexander Mihalicyn
  2021-09-23 16:36               ` Manfred Spraul
  0 siblings, 2 replies; 17+ messages in thread
From: Eric W. Biederman @ 2021-07-12 19:18 UTC (permalink / raw)
  To: Alexander Mihalicyn
  Cc: Manfred Spraul, Andrew Morton, linux-kernel, Milton Miller,
	Jack Miller, Pavel Tikhomirov, Davidlohr Bueso, Johannes Weiner,
	Michal Hocko, Vladimir Davydov, Andrei Vagin, Christian Brauner

Alexander Mihalicyn <alexander@mihalicyn.com> writes:

> Hello Manfred,
>
> On Sun, Jul 11, 2021 at 2:47 PM Manfred Spraul <manfred@colorfullife.com> wrote:
>>
>> Hi Alex,
>>
>>
>> Am Sonntag, 11. Juli 2021 schrieb Alexander Mihalicyn <alexander@mihalicyn.com>:
>> >
>> > Hi, Manfred,
>> >
>> > On Sun, Jul 11, 2021 at 12:13 PM Manfred Spraul
>> > <manfred@colorfullife.com> wrote:
>> > >
>> > > Hi,
>> > >
>> > >
>> > > Am Samstag, 10. Juli 2021 schrieb Alexander Mihalicyn <alexander@mihalicyn.com>:
>> > >>
>> > >>
>> > >> Now, using setns() syscall, we can construct situation when on
>> > >> task->sysvshm.shm_clist list
>> > >> we have shm items from several (!) IPC namespaces.
>> > >>
>> > >>
>> > > Does this imply that locking ist affected as well? According to the initial patch, accesses to shm_clist are protected by "the" IPC shm namespace rwsem. This can't work if the list contains objects from several namespaces.
>> >
>> > Of course, you are right. I've to rework this part -> I can add check into
>> > static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
>> > function and before adding new shm into task list check that list is empty OR
>> > an item which is present on the list from the same namespace as
>> > current->nsproxy->ipc_ns.
>> >
>> Ok. (Sorry, I have only smartphone internet, thus I could not check
>> the patch fully)
>>
>> > >> I've proposed a change which keeps the old behaviour of setns() but
>> > >> fixes double free.
>> > >>
>> > > Assuming that locking works, I would consider this as a namespace design question: Do we want to support that a task contains shm objects from several ipc namespaces?
>> >
>> > This depends on what we mean by "task contains shm objects from
>> > several ipc namespaces". There are two meanings:
>> >
>> > 1. Task has attached shm object from different ipc namespaces
>> >
>> > We already support that by design. When we doing a change of namespace
>> > using unshare(CLONE_NEWIPC) even with
>> > sysctl shm_rmid_forced=1 we not detach all ipc's from task!
>>
>> OK. Thus shm and sem have different behavior anyways.
>>
>> >
>> > 2. Task task->sysvshm.shm_clist list has items from different IPC namespaces.
>> >
>> > I'm not sure, do we need that or not. But I'm ready to prepare a patch
>> > for any of the options which we choose:
>> > a) just add exit_shm(current)+shm_init_task(current);
>> > b) prepare PATCHv2 with appropriate check in the newseg() to prevent
>> > adding new items from different namespace to the list
>> > c) rework algorithm so we can safely have items from different
>> > namespaces in task->sysvshm.shm_clist
>> >
>> Before you write something, let's wait what the others say. I don't
>> qualify AS shm expert
>>
>> a) is user space visible, without any good excuse
>
> yes, but maybe we decide that this is not so critical?
> We need more people here :)

It is barely visible.  You have to do something very silly
to see this happening.  It is probably ok, but the work to
verify that nothing cares so that we can safely backport
the change is probably much more work than just updating
the list to handle shmid's for multiple namespaces.


>> c) is probably highest amount of Changes
>
> yep. but ok, I will prepare patches fast.

Given that this is a bug I think c) is the safest option.

A couple of suggestions.
1) We can replace the test "shm_creator != NULL" with
   "list_empty(&shp->shm_clist)" and remove shm_creator.

   Along with replacing "shm_creator = NULL" with
   "list_del_init(&shp->shm_clist)".

2) We can update shmat to do "list_del_init(&shp->shm_clist)"
   upon shmat.  The last unmap will still shm_destroy the
   shm segment as ns->shm_rmid_forced is set.

   For a multi-threaded process I think this will nicely clean up
   the clist, and make it clear that the clist only cares about
   those segments that have been created but never attached.

3) Put a non-reference counted struct ipc_namespace in struct
   shmid_kernel, and use it to remove the namespace parameter
   from shm_destroy.

I think that is enough to fix this bug with no changes in semantics,
no additional memory consumed, and an implementation that is easier
to read and perhaps a little faster.

Eric

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/2] shm: omit forced shm destroy if task IPC namespace was changed
  2021-07-12 19:18             ` Eric W. Biederman
@ 2021-07-12 19:27               ` Alexander Mihalicyn
  2021-07-14 17:30                 ` [RFC PATCH] shm: extend forced shm destroy to support objects from several IPC nses Alexander Mikhalitsyn
  2021-07-14 17:42                 ` [PATCH 0/2] shm: omit forced shm destroy if task IPC namespace was changed Alexander Mihalicyn
  2021-09-23 16:36               ` Manfred Spraul
  1 sibling, 2 replies; 17+ messages in thread
From: Alexander Mihalicyn @ 2021-07-12 19:27 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Manfred Spraul, Andrew Morton, linux-kernel, Pavel Tikhomirov,
	Davidlohr Bueso, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Andrei Vagin, Christian Brauner

Hi Eric,

On Mon, Jul 12, 2021 at 10:18 PM Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> Alexander Mihalicyn <alexander@mihalicyn.com> writes:
>
> > Hello Manfred,
> >
> > On Sun, Jul 11, 2021 at 2:47 PM Manfred Spraul <manfred@colorfullife.com> wrote:
> >>
> >> Hi Alex,
> >>
> >>
> >> Am Sonntag, 11. Juli 2021 schrieb Alexander Mihalicyn <alexander@mihalicyn.com>:
> >> >
> >> > Hi, Manfred,
> >> >
> >> > On Sun, Jul 11, 2021 at 12:13 PM Manfred Spraul
> >> > <manfred@colorfullife.com> wrote:
> >> > >
> >> > > Hi,
> >> > >
> >> > >
> >> > > Am Samstag, 10. Juli 2021 schrieb Alexander Mihalicyn <alexander@mihalicyn.com>:
> >> > >>
> >> > >>
> >> > >> Now, using setns() syscall, we can construct situation when on
> >> > >> task->sysvshm.shm_clist list
> >> > >> we have shm items from several (!) IPC namespaces.
> >> > >>
> >> > >>
> >> > > Does this imply that locking ist affected as well? According to the initial patch, accesses to shm_clist are protected by "the" IPC shm namespace rwsem. This can't work if the list contains objects from several namespaces.
> >> >
> >> > Of course, you are right. I've to rework this part -> I can add check into
> >> > static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
> >> > function and before adding new shm into task list check that list is empty OR
> >> > an item which is present on the list from the same namespace as
> >> > current->nsproxy->ipc_ns.
> >> >
> >> Ok. (Sorry, I have only smartphone internet, thus I could not check
> >> the patch fully)
> >>
> >> > >> I've proposed a change which keeps the old behaviour of setns() but
> >> > >> fixes double free.
> >> > >>
> >> > > Assuming that locking works, I would consider this as a namespace design question: Do we want to support that a task contains shm objects from several ipc namespaces?
> >> >
> >> > This depends on what we mean by "task contains shm objects from
> >> > several ipc namespaces". There are two meanings:
> >> >
> >> > 1. Task has attached shm object from different ipc namespaces
> >> >
> >> > We already support that by design. When we doing a change of namespace
> >> > using unshare(CLONE_NEWIPC) even with
> >> > sysctl shm_rmid_forced=1 we not detach all ipc's from task!
> >>
> >> OK. Thus shm and sem have different behavior anyways.
> >>
> >> >
> >> > 2. Task task->sysvshm.shm_clist list has items from different IPC namespaces.
> >> >
> >> > I'm not sure, do we need that or not. But I'm ready to prepare a patch
> >> > for any of the options which we choose:
> >> > a) just add exit_shm(current)+shm_init_task(current);
> >> > b) prepare PATCHv2 with appropriate check in the newseg() to prevent
> >> > adding new items from different namespace to the list
> >> > c) rework algorithm so we can safely have items from different
> >> > namespaces in task->sysvshm.shm_clist
> >> >
> >> Before you write something, let's wait what the others say. I don't
> >> qualify AS shm expert
> >>
> >> a) is user space visible, without any good excuse
> >
> > yes, but maybe we decide that this is not so critical?
> > We need more people here :)
>
> It is barely visible.  You have to do something very silly
> to see this happening.  It is probably ok, but the work to
> verify that nothing cares so that we can safely backport
> the change is probably much more work than just updating
> the list to handle shmid's for multiple namespaces.
>
>
> >> c) is probably highest amount of Changes
> >
> > yep. but ok, I will prepare patches fast.
>
> Given that this is a bug I think c) is the safest option.
>
> A couple of suggestions.
> 1) We can replace the test "shm_creator != NULL" with
>    "list_empty(&shp->shm_clist)" and remove shm_creator.
>
>    Along with replacing "shm_creator = NULL" with
>    "list_del_init(&shp->shm_clist)".
>
> 2) We can update shmat to do "list_del_init(&shp->shm_clist)"
>    upon shmat.  The last unmap will still shm_destroy the
>    shm segment as ns->shm_rmid_forced is set.
>
>    For a multi-threaded process I think this will nicely clean up
>    the clist, and make it clear that the clist only cares about
>    those segments that have been created but never attached.
>
> 3) Put a non-reference counted struct ipc_namespace in struct
>    shmid_kernel, and use it to remove the namespace parameter
>    from shm_destroy.

Thanks for your detailed suggestions! ;)
I will prepare a patch tomorrow for kernel + test what's happening with
CRIU and will prepare a fix for it.

>
> I think that is enough to fix this bug with no changes in semantics,
> no additional memory consumed, and an implementation that is easier
> to read and perhaps a little faster.
>
> Eric

Regards,
Alex

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH] shm: extend forced shm destroy to support objects from several IPC nses
  2021-07-12 19:27               ` Alexander Mihalicyn
@ 2021-07-14 17:30                 ` Alexander Mikhalitsyn
  2021-07-21  6:32                   ` Manfred Spraul
  2021-07-22 18:46                   ` Manfred Spraul
  2021-07-14 17:42                 ` [PATCH 0/2] shm: omit forced shm destroy if task IPC namespace was changed Alexander Mihalicyn
  1 sibling, 2 replies; 17+ messages in thread
From: Alexander Mikhalitsyn @ 2021-07-14 17:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Mikhalitsyn, Eric W . Biederman, Manfred Spraul,
	Andrew Morton, Davidlohr Bueso, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, Andrei Vagin, Christian Brauner,
	Pavel Tikhomirov, Alexander Mikhalitsyn

This is total rework of fix.
Thanks to Eric Biederman for suggestions (but may be I've misunderstood some of them :))

I've tested it with reproducer of the original problem. But of course it needs
detailed testing. I hope that I get some general comments about design and implementation.

ToDo: remove unneeded "ns" argument from shm_destroy, shm_rmid and other functions.

Fixes: ab602f79915 ("shm: make exit_shm work proportional to task activity")
Cc: Eric W. Biederman <ebiederm@xmission.com> 
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Christian Brauner <christian@brauner.io>
Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Cc: Alexander Mikhalitsyn <alexander@mihalicyn.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
---
 include/linux/shm.h |   5 +-
 ipc/shm.c           | 128 +++++++++++++++++++++++++++++++-------------
 2 files changed, 95 insertions(+), 38 deletions(-)

diff --git a/include/linux/shm.h b/include/linux/shm.h
index d8e69aed3d32..8053ed1433df 100644
--- a/include/linux/shm.h
+++ b/include/linux/shm.h
@@ -11,14 +11,15 @@ struct file;
 
 #ifdef CONFIG_SYSVIPC
 struct sysv_shm {
-	struct list_head shm_clist;
+	spinlock_t		shm_clist_lock;
+	struct list_head	shm_clist;
 };
 
 long do_shmat(int shmid, char __user *shmaddr, int shmflg, unsigned long *addr,
 	      unsigned long shmlba);
 bool is_file_shm_hugepages(struct file *file);
 void exit_shm(struct task_struct *task);
-#define shm_init_task(task) INIT_LIST_HEAD(&(task)->sysvshm.shm_clist)
+void shm_init_task(struct task_struct *task);
 #else
 struct sysv_shm {
 	/* empty */
diff --git a/ipc/shm.c b/ipc/shm.c
index 748933e376ca..a746886a0e00 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -65,6 +65,7 @@ struct shmid_kernel /* private to the kernel */
 	/* The task created the shm object.  NULL if the task is dead. */
 	struct task_struct	*shm_creator;
 	struct list_head	shm_clist;	/* list by creator */
+	struct ipc_namespace	*ns;
 } __randomize_layout;
 
 /* shm_mode upper byte flags */
@@ -115,6 +116,7 @@ static void do_shm_rmid(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
 	struct shmid_kernel *shp;
 
 	shp = container_of(ipcp, struct shmid_kernel, shm_perm);
+	BUG_ON(shp->ns && ns != shp->ns);
 
 	if (shp->shm_nattch) {
 		shp->shm_perm.mode |= SHM_DEST;
@@ -225,9 +227,32 @@ static void shm_rcu_free(struct rcu_head *head)
 	kfree(shp);
 }
 
+static inline void task_shm_clist_lock(struct task_struct *task)
+{
+	spin_lock(&task->sysvshm.shm_clist_lock);
+}
+
+static inline void task_shm_clist_unlock(struct task_struct *task)
+{
+	spin_unlock(&task->sysvshm.shm_clist_lock);
+}
+
+void shm_clist_rm(struct shmid_kernel *shp)
+{
+	if (!shp->shm_creator)
+		return;
+
+	task_shm_clist_lock(shp->shm_creator);
+	list_del_init(&shp->shm_clist);
+	task_shm_clist_unlock(shp->shm_creator);
+	shp->shm_creator = NULL;
+}
+
 static inline void shm_rmid(struct ipc_namespace *ns, struct shmid_kernel *s)
 {
-	list_del(&s->shm_clist);
+	WARN_ON(s->ns && ns != s->ns);
+	//list_del_init(&s->shm_clist);
+	shm_clist_rm(s);
 	ipc_rmid(&shm_ids(ns), &s->shm_perm);
 }
 
@@ -306,10 +331,10 @@ static void shm_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
  *
  * 2) sysctl kernel.shm_rmid_forced is set to 1.
  */
-static bool shm_may_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
+static bool shm_may_destroy(struct shmid_kernel *shp)
 {
 	return (shp->shm_nattch == 0) &&
-	       (ns->shm_rmid_forced ||
+	       (shp->ns->shm_rmid_forced ||
 		(shp->shm_perm.mode & SHM_DEST));
 }
 
@@ -340,7 +365,7 @@ static void shm_close(struct vm_area_struct *vma)
 	ipc_update_pid(&shp->shm_lprid, task_tgid(current));
 	shp->shm_dtim = ktime_get_real_seconds();
 	shp->shm_nattch--;
-	if (shm_may_destroy(ns, shp))
+	if (shm_may_destroy(shp))
 		shm_destroy(ns, shp);
 	else
 		shm_unlock(shp);
@@ -361,10 +386,10 @@ static int shm_try_destroy_orphaned(int id, void *p, void *data)
 	 *
 	 * As shp->* are changed under rwsem, it's safe to skip shp locking.
 	 */
-	if (shp->shm_creator != NULL)
+	if (!list_empty(&shp->shm_clist))
 		return 0;
 
-	if (shm_may_destroy(ns, shp)) {
+	if (shm_may_destroy(shp)) {
 		shm_lock_by_ptr(shp);
 		shm_destroy(ns, shp);
 	}
@@ -379,51 +404,77 @@ void shm_destroy_orphaned(struct ipc_namespace *ns)
 	up_write(&shm_ids(ns).rwsem);
 }
 
+void shm_init_task(struct task_struct *task)
+{
+	INIT_LIST_HEAD(&(task)->sysvshm.shm_clist);
+	spin_lock_init(&task->sysvshm.shm_clist_lock);
+}
+
 /* Locking assumes this will only be called with task == current */
 void exit_shm(struct task_struct *task)
 {
-	struct ipc_namespace *ns = task->nsproxy->ipc_ns;
+	LIST_HEAD(tmp);
 	struct shmid_kernel *shp, *n;
 
 	if (list_empty(&task->sysvshm.shm_clist))
 		return;
 
-	/*
-	 * If kernel.shm_rmid_forced is not set then only keep track of
-	 * which shmids are orphaned, so that a later set of the sysctl
-	 * can clean them up.
-	 */
-	if (!ns->shm_rmid_forced) {
-		down_read(&shm_ids(ns).rwsem);
-		list_for_each_entry(shp, &task->sysvshm.shm_clist, shm_clist)
-			shp->shm_creator = NULL;
-		/*
-		 * Only under read lock but we are only called on current
-		 * so no entry on the list will be shared.
-		 */
-		list_del(&task->sysvshm.shm_clist);
-		up_read(&shm_ids(ns).rwsem);
-		return;
-	}
+	rcu_read_lock(); /* for refcount_inc_not_zero */
+	task_shm_clist_lock(task);
 
-	/*
-	 * Destroy all already created segments, that were not yet mapped,
-	 * and mark any mapped as orphan to cover the sysctl toggling.
-	 * Destroy is skipped if shm_may_destroy() returns false.
-	 */
-	down_write(&shm_ids(ns).rwsem);
 	list_for_each_entry_safe(shp, n, &task->sysvshm.shm_clist, shm_clist) {
+		struct ipc_namespace *ns = shp->ns;
+
+		/*
+		 * Remove shm from task list and nullify shm_creator which
+		 * marks object as orphaned.
+		 *
+		 * If kernel.shm_rmid_forced is not set then only keep track of
+		 * which shmids are orphaned, so that a later set of the sysctl
+		 * can clean them up.
+		 */
+		list_del_init(&shp->shm_clist);
 		shp->shm_creator = NULL;
 
-		if (shm_may_destroy(ns, shp)) {
-			shm_lock_by_ptr(shp);
-			shm_destroy(ns, shp);
+		printk("exit_shm() %px refcnt=%u, id=%d,key=%x\n", shp,
+			refcount_read(&shp->shm_perm.refcount),
+			shp->shm_perm.id, shp->shm_perm.key
+		);
+
+		/*
+		 * Will destroy all already created segments, that were not yet mapped,
+		 * and mark any mapped as orphan to cover the sysctl toggling.
+		 * Destroy is skipped if shm_may_destroy() returns false.
+		 */
+		if (shp->ns->shm_rmid_forced && shm_may_destroy(shp)) {
+			/*
+			 * We may race with shm_exit_ns() if refcounter
+			 * already zero. Let's skip shm_destroy() of such
+			 * shm object as it will be destroyed during shm_exit_ns()
+			 */
+			if (!refcount_inc_not_zero(&ns->ns.count)) /* get_ipc_ns */
+				continue;
+
+			list_add(&shp->shm_clist, &tmp);
 		}
 	}
 
-	/* Remove the list head from any segments still attached. */
 	list_del(&task->sysvshm.shm_clist);
-	up_write(&shm_ids(ns).rwsem);
+	task_shm_clist_unlock(task);
+	rcu_read_unlock();
+
+	list_for_each_entry_safe(shp, n, &tmp, shm_clist) {
+		struct ipc_namespace *ns = shp->ns;
+
+		list_del_init(&shp->shm_clist);
+
+		down_write(&shm_ids(ns).rwsem);
+		shm_lock_by_ptr(shp);
+		/* will do put_ipc_ns(shp->ns) */
+		shm_destroy(ns, shp);
+		up_write(&shm_ids(ns).rwsem);
+		put_ipc_ns(ns); /* see refcount_inc_not_zero */
+	}
 }
 
 static vm_fault_t shm_fault(struct vm_fault *vmf)
@@ -681,6 +732,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
 		goto no_id;
 
 	list_add(&shp->shm_clist, &current->sysvshm.shm_clist);
+	shp->ns = ns;
 
 	/*
 	 * shmid gets reported as "inode#" in /proc/pid/maps.
@@ -1573,7 +1625,11 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg,
 	down_write(&shm_ids(ns).rwsem);
 	shp = shm_lock(ns, shmid);
 	shp->shm_nattch--;
-	if (shm_may_destroy(ns, shp))
+#if 0
+	if (shp->shm_nattch)
+		list_del_init(&shp->shm_clist);
+#endif
+	if (shm_may_destroy(shp))
 		shm_destroy(ns, shp);
 	else
 		shm_unlock(shp);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/2] shm: omit forced shm destroy if task IPC namespace was changed
  2021-07-12 19:27               ` Alexander Mihalicyn
  2021-07-14 17:30                 ` [RFC PATCH] shm: extend forced shm destroy to support objects from several IPC nses Alexander Mikhalitsyn
@ 2021-07-14 17:42                 ` Alexander Mihalicyn
  1 sibling, 0 replies; 17+ messages in thread
From: Alexander Mihalicyn @ 2021-07-14 17:42 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Manfred Spraul, Andrew Morton, linux-kernel, Pavel Tikhomirov,
	Davidlohr Bueso, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Andrei Vagin, Christian Brauner, alexander.mikhalitsyn

Hi Eric,

I've sent the RFC patch "shm: extend forced shm destroy to support
objects from", please take a look once time permits.

A few words about design:
1. In the exit_shm() function I'm using a temporary list to collect
shm's before destroying because I can't take semaphore (which protects
ns idr structure) under spinlock (which protects list on task)
2. I've to take IPC ns refcounter in exit_shm() to prevent possible
race with shm_exit_ns().
3. I haven't added list_del_init(&shp->shm_clist) into shmat() syscall
because I can't understand why this is safe. But maybe I've to think
more about that.
4. I need to remove the extra "ns" argument (todo).

Thanks,
Alex

On Mon, Jul 12, 2021 at 10:27 PM Alexander Mihalicyn
<alexander@mihalicyn.com> wrote:
>
> Hi Eric,
>
> On Mon, Jul 12, 2021 at 10:18 PM Eric W. Biederman
> <ebiederm@xmission.com> wrote:
> >
> > Alexander Mihalicyn <alexander@mihalicyn.com> writes:
> >
> > > Hello Manfred,
> > >
> > > On Sun, Jul 11, 2021 at 2:47 PM Manfred Spraul <manfred@colorfullife.com> wrote:
> > >>
> > >> Hi Alex,
> > >>
> > >>
> > >> Am Sonntag, 11. Juli 2021 schrieb Alexander Mihalicyn <alexander@mihalicyn.com>:
> > >> >
> > >> > Hi, Manfred,
> > >> >
> > >> > On Sun, Jul 11, 2021 at 12:13 PM Manfred Spraul
> > >> > <manfred@colorfullife.com> wrote:
> > >> > >
> > >> > > Hi,
> > >> > >
> > >> > >
> > >> > > Am Samstag, 10. Juli 2021 schrieb Alexander Mihalicyn <alexander@mihalicyn.com>:
> > >> > >>
> > >> > >>
> > >> > >> Now, using setns() syscall, we can construct situation when on
> > >> > >> task->sysvshm.shm_clist list
> > >> > >> we have shm items from several (!) IPC namespaces.
> > >> > >>
> > >> > >>
> > >> > > Does this imply that locking ist affected as well? According to the initial patch, accesses to shm_clist are protected by "the" IPC shm namespace rwsem. This can't work if the list contains objects from several namespaces.
> > >> >
> > >> > Of course, you are right. I've to rework this part -> I can add check into
> > >> > static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
> > >> > function and before adding new shm into task list check that list is empty OR
> > >> > an item which is present on the list from the same namespace as
> > >> > current->nsproxy->ipc_ns.
> > >> >
> > >> Ok. (Sorry, I have only smartphone internet, thus I could not check
> > >> the patch fully)
> > >>
> > >> > >> I've proposed a change which keeps the old behaviour of setns() but
> > >> > >> fixes double free.
> > >> > >>
> > >> > > Assuming that locking works, I would consider this as a namespace design question: Do we want to support that a task contains shm objects from several ipc namespaces?
> > >> >
> > >> > This depends on what we mean by "task contains shm objects from
> > >> > several ipc namespaces". There are two meanings:
> > >> >
> > >> > 1. Task has attached shm object from different ipc namespaces
> > >> >
> > >> > We already support that by design. When we doing a change of namespace
> > >> > using unshare(CLONE_NEWIPC) even with
> > >> > sysctl shm_rmid_forced=1 we not detach all ipc's from task!
> > >>
> > >> OK. Thus shm and sem have different behavior anyways.
> > >>
> > >> >
> > >> > 2. Task task->sysvshm.shm_clist list has items from different IPC namespaces.
> > >> >
> > >> > I'm not sure, do we need that or not. But I'm ready to prepare a patch
> > >> > for any of the options which we choose:
> > >> > a) just add exit_shm(current)+shm_init_task(current);
> > >> > b) prepare PATCHv2 with appropriate check in the newseg() to prevent
> > >> > adding new items from different namespace to the list
> > >> > c) rework algorithm so we can safely have items from different
> > >> > namespaces in task->sysvshm.shm_clist
> > >> >
> > >> Before you write something, let's wait what the others say. I don't
> > >> qualify AS shm expert
> > >>
> > >> a) is user space visible, without any good excuse
> > >
> > > yes, but maybe we decide that this is not so critical?
> > > We need more people here :)
> >
> > It is barely visible.  You have to do something very silly
> > to see this happening.  It is probably ok, but the work to
> > verify that nothing cares so that we can safely backport
> > the change is probably much more work than just updating
> > the list to handle shmid's for multiple namespaces.
> >
> >
> > >> c) is probably highest amount of Changes
> > >
> > > yep. but ok, I will prepare patches fast.
> >
> > Given that this is a bug I think c) is the safest option.
> >
> > A couple of suggestions.
> > 1) We can replace the test "shm_creator != NULL" with
> >    "list_empty(&shp->shm_clist)" and remove shm_creator.
> >
> >    Along with replacing "shm_creator = NULL" with
> >    "list_del_init(&shp->shm_clist)".
> >
> > 2) We can update shmat to do "list_del_init(&shp->shm_clist)"
> >    upon shmat.  The last unmap will still shm_destroy the
> >    shm segment as ns->shm_rmid_forced is set.
> >
> >    For a multi-threaded process I think this will nicely clean up
> >    the clist, and make it clear that the clist only cares about
> >    those segments that have been created but never attached.
> >
> > 3) Put a non-reference counted struct ipc_namespace in struct
> >    shmid_kernel, and use it to remove the namespace parameter
> >    from shm_destroy.
>
> Thanks for your detailed suggestions! ;)
> I will prepare a patch tomorrow for kernel + test what's happening with
> CRIU and will prepare a fix for it.
>
> >
> > I think that is enough to fix this bug with no changes in semantics,
> > no additional memory consumed, and an implementation that is easier
> > to read and perhaps a little faster.
> >
> > Eric
>
> Regards,
> Alex

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] shm: extend forced shm destroy to support objects from several IPC nses
  2021-07-14 17:30                 ` [RFC PATCH] shm: extend forced shm destroy to support objects from several IPC nses Alexander Mikhalitsyn
@ 2021-07-21  6:32                   ` Manfred Spraul
  2021-07-22 18:46                   ` Manfred Spraul
  1 sibling, 0 replies; 17+ messages in thread
From: Manfred Spraul @ 2021-07-21  6:32 UTC (permalink / raw)
  To: Alexander Mikhalitsyn, linux-kernel
  Cc: Eric W . Biederman, Andrew Morton, Davidlohr Bueso,
	Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrei Vagin,
	Christian Brauner, Pavel Tikhomirov, Alexander Mikhalitsyn,
	1vier1

Hi Alexander,

On 7/14/21 7:30 PM, Alexander Mikhalitsyn wrote:
> This is total rework of fix.
> Thanks to Eric Biederman for suggestions (but may be I've misunderstood some of them :))
>
> I've tested it with reproducer of the original problem. But of course it needs
> detailed testing. I hope that I get some general comments about design and implementation.
>
> ToDo: remove unneeded "ns" argument from shm_destroy, shm_rmid and other functions.
>
> Fixes: ab602f79915 ("shm: make exit_shm work proportional to task activity")
> Cc: Eric W. Biederman <ebiederm@xmission.com>
> Cc: Manfred Spraul <manfred@colorfullife.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Davidlohr Bueso <dave@stgolabs.net>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Andrei Vagin <avagin@gmail.com>
> Cc: Christian Brauner <christian@brauner.io>
> Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
> Cc: Alexander Mikhalitsyn <alexander@mihalicyn.com>
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
> ---
>   include/linux/shm.h |   5 +-
>   ipc/shm.c           | 128 +++++++++++++++++++++++++++++++-------------
>   2 files changed, 95 insertions(+), 38 deletions(-)
>
> diff --git a/include/linux/shm.h b/include/linux/shm.h
> index d8e69aed3d32..8053ed1433df 100644
> --- a/include/linux/shm.h
> +++ b/include/linux/shm.h
> @@ -11,14 +11,15 @@ struct file;
>   
>   #ifdef CONFIG_SYSVIPC
>   struct sysv_shm {
> -	struct list_head shm_clist;
> +	spinlock_t		shm_clist_lock;
> +	struct list_head	shm_clist;
>   };
>   

Can we use task_lock() instead of adding a spinlock to struct task_struct?

And: please document the lock nesting.

If I see it right:

- ns namespace rwsem

- shm_perm.lock

- the new shm_clist_lock


>   long do_shmat(int shmid, char __user *shmaddr, int shmflg, unsigned long *addr,
>   	      unsigned long shmlba);
>   bool is_file_shm_hugepages(struct file *file);
>   void exit_shm(struct task_struct *task);
> -#define shm_init_task(task) INIT_LIST_HEAD(&(task)->sysvshm.shm_clist)
> +void shm_init_task(struct task_struct *task);
>   #else
>   struct sysv_shm {
>   	/* empty */
> diff --git a/ipc/shm.c b/ipc/shm.c
> index 748933e376ca..a746886a0e00 100644
> --- a/ipc/shm.c
> +++ b/ipc/shm.c
> @@ -65,6 +65,7 @@ struct shmid_kernel /* private to the kernel */
>   	/* The task created the shm object.  NULL if the task is dead. */
>   	struct task_struct	*shm_creator;
>   	struct list_head	shm_clist;	/* list by creator */
> +	struct ipc_namespace	*ns;
>   } __randomize_layout;
>   
>   /* shm_mode upper byte flags */
> @@ -115,6 +116,7 @@ static void do_shm_rmid(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
>   	struct shmid_kernel *shp;
>   
>   	shp = container_of(ipcp, struct shmid_kernel, shm_perm);
> +	BUG_ON(shp->ns && ns != shp->ns);
>   
>   	if (shp->shm_nattch) {
>   		shp->shm_perm.mode |= SHM_DEST;
> @@ -225,9 +227,32 @@ static void shm_rcu_free(struct rcu_head *head)
>   	kfree(shp);
>   }
>   
> +static inline void task_shm_clist_lock(struct task_struct *task)
> +{
> +	spin_lock(&task->sysvshm.shm_clist_lock);
> +}
> +
> +static inline void task_shm_clist_unlock(struct task_struct *task)
> +{
> +	spin_unlock(&task->sysvshm.shm_clist_lock);
> +}
> +
> +void shm_clist_rm(struct shmid_kernel *shp)
> +{
> +	if (!shp->shm_creator)
> +		return;
> +
> +	task_shm_clist_lock(shp->shm_creator);
> +	list_del_init(&shp->shm_clist);
> +	task_shm_clist_unlock(shp->shm_creator);
> +	shp->shm_creator = NULL;
> +}
> +
>   static inline void shm_rmid(struct ipc_namespace *ns, struct shmid_kernel *s)
>   {
> -	list_del(&s->shm_clist);
> +	WARN_ON(s->ns && ns != s->ns);
> +	//list_del_init(&s->shm_clist);
> +	shm_clist_rm(s);
>   	ipc_rmid(&shm_ids(ns), &s->shm_perm);
>   }
>   
> @@ -306,10 +331,10 @@ static void shm_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
>    *
>    * 2) sysctl kernel.shm_rmid_forced is set to 1.
>    */
> -static bool shm_may_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
> +static bool shm_may_destroy(struct shmid_kernel *shp)
>   {
>   	return (shp->shm_nattch == 0) &&
> -	       (ns->shm_rmid_forced ||
> +	       (shp->ns->shm_rmid_forced ||
>   		(shp->shm_perm.mode & SHM_DEST));
>   }
>   
> @@ -340,7 +365,7 @@ static void shm_close(struct vm_area_struct *vma)
>   	ipc_update_pid(&shp->shm_lprid, task_tgid(current));
>   	shp->shm_dtim = ktime_get_real_seconds();
>   	shp->shm_nattch--;
> -	if (shm_may_destroy(ns, shp))
> +	if (shm_may_destroy(shp))
>   		shm_destroy(ns, shp);
>   	else
>   		shm_unlock(shp);
> @@ -361,10 +386,10 @@ static int shm_try_destroy_orphaned(int id, void *p, void *data)
>   	 *
>   	 * As shp->* are changed under rwsem, it's safe to skip shp locking.
>   	 */
> -	if (shp->shm_creator != NULL)
> +	if (!list_empty(&shp->shm_clist))
>   		return 0;
>   
> -	if (shm_may_destroy(ns, shp)) {
> +	if (shm_may_destroy(shp)) {
>   		shm_lock_by_ptr(shp);
>   		shm_destroy(ns, shp);
>   	}
> @@ -379,51 +404,77 @@ void shm_destroy_orphaned(struct ipc_namespace *ns)
>   	up_write(&shm_ids(ns).rwsem);
>   }
>   
> +void shm_init_task(struct task_struct *task)
> +{
> +	INIT_LIST_HEAD(&(task)->sysvshm.shm_clist);
> +	spin_lock_init(&task->sysvshm.shm_clist_lock);
> +}
> +
>   /* Locking assumes this will only be called with task == current */
>   void exit_shm(struct task_struct *task)
>   {
> -	struct ipc_namespace *ns = task->nsproxy->ipc_ns;
> +	LIST_HEAD(tmp);
>   	struct shmid_kernel *shp, *n;
>   
>   	if (list_empty(&task->sysvshm.shm_clist))
>   		return;
>   
> -	/*
> -	 * If kernel.shm_rmid_forced is not set then only keep track of
> -	 * which shmids are orphaned, so that a later set of the sysctl
> -	 * can clean them up.
> -	 */
> -	if (!ns->shm_rmid_forced) {
> -		down_read(&shm_ids(ns).rwsem);
> -		list_for_each_entry(shp, &task->sysvshm.shm_clist, shm_clist)
> -			shp->shm_creator = NULL;
> -		/*
> -		 * Only under read lock but we are only called on current
> -		 * so no entry on the list will be shared.
> -		 */
> -		list_del(&task->sysvshm.shm_clist);
> -		up_read(&shm_ids(ns).rwsem);
> -		return;
> -	}
> +	rcu_read_lock(); /* for refcount_inc_not_zero */
> +	task_shm_clist_lock(task);
>   
> -	/*
> -	 * Destroy all already created segments, that were not yet mapped,
> -	 * and mark any mapped as orphan to cover the sysctl toggling.
> -	 * Destroy is skipped if shm_may_destroy() returns false.
> -	 */
> -	down_write(&shm_ids(ns).rwsem);
>   	list_for_each_entry_safe(shp, n, &task->sysvshm.shm_clist, shm_clist) {
> +		struct ipc_namespace *ns = shp->ns;
> +
> +		/*
> +		 * Remove shm from task list and nullify shm_creator which
> +		 * marks object as orphaned.
> +		 *
> +		 * If kernel.shm_rmid_forced is not set then only keep track of
> +		 * which shmids are orphaned, so that a later set of the sysctl
> +		 * can clean them up.
> +		 */
> +		list_del_init(&shp->shm_clist);
>   		shp->shm_creator = NULL;
>   
> -		if (shm_may_destroy(ns, shp)) {
> -			shm_lock_by_ptr(shp);
> -			shm_destroy(ns, shp);
> +		printk("exit_shm() %px refcnt=%u, id=%d,key=%x\n", shp,
> +			refcount_read(&shp->shm_perm.refcount),
> +			shp->shm_perm.id, shp->shm_perm.key
> +		);
> +
> +		/*
> +		 * Will destroy all already created segments, that were not yet mapped,
> +		 * and mark any mapped as orphan to cover the sysctl toggling.
> +		 * Destroy is skipped if shm_may_destroy() returns false.
> +		 */

I see that you didn't write this comment, but is "not yet" correct?

i.e. if a segment is created, mapped, unmapped, then would it be a 
candidate for getting destroyed?


> +		if (shp->ns->shm_rmid_forced && shm_may_destroy(shp)) {
> +			/*
> +			 * We may race with shm_exit_ns() if refcounter
> +			 * already zero. Let's skip shm_destroy() of such
> +			 * shm object as it will be destroyed during shm_exit_ns()
> +			 */
> +			if (!refcount_inc_not_zero(&ns->ns.count)) /* get_ipc_ns */
> +				continue;
> +

get_ipc_ns() means "copy/pasted from get_ipc_ns()", correct?

This asks for trouble, I would  prefer if a get_ipc_ns_not_zero() is 
added into ipc_namespace.h.


> +			list_add(&shp->shm_clist, &tmp);
>   		}
>   	}
>   
> -	/* Remove the list head from any segments still attached. */
>   	list_del(&task->sysvshm.shm_clist);
> -	up_write(&shm_ids(ns).rwsem);
> +	task_shm_clist_unlock(task);
> +	rcu_read_unlock();
> +
> +	list_for_each_entry_safe(shp, n, &tmp, shm_clist) {
> +		struct ipc_namespace *ns = shp->ns;
> +
> +		list_del_init(&shp->shm_clist);
> +
> +		down_write(&shm_ids(ns).rwsem);
> +		shm_lock_by_ptr(shp);
> +		/* will do put_ipc_ns(shp->ns) */
> +		shm_destroy(ns, shp);

What if someone attaches between shm_may_destroy() and this shm_destroy()?

The other callers seem to do down_write(rwsem); if (shm_may_destroy()) 
shm_destroy();


> +		up_write(&shm_ids(ns).rwsem);
> +		put_ipc_ns(ns); /* see refcount_inc_not_zero */
> +	}
>   }
>   
>   static vm_fault_t shm_fault(struct vm_fault *vmf)
> @@ -681,6 +732,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
>   		goto no_id;
>   
>   	list_add(&shp->shm_clist, &current->sysvshm.shm_clist);
> +	shp->ns = ns;
>   
>   	/*
>   	 * shmid gets reported as "inode#" in /proc/pid/maps.
> @@ -1573,7 +1625,11 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg,
>   	down_write(&shm_ids(ns).rwsem);
>   	shp = shm_lock(ns, shmid);
>   	shp->shm_nattch--;
> -	if (shm_may_destroy(ns, shp))
> +#if 0
> +	if (shp->shm_nattch)
> +		list_del_init(&shp->shm_clist);
> +#endif
What is the purpose of this change?


And I think after ipc_addid(), task_shm_clist_lock() is missing.

>        /* ipc_addid() locks shp upon success. */
>         error = ipc_addid(&shm_ids(ns), &shp->shm_perm, ns->shm_ctlmni);
>         if (error < 0)
>                 goto no_id;
>
+ task_shm_clist_lock(task);

>         list_add(&shp->shm_clist, &current->sysvshm.shm_clist);
+ task_shm_clist_unlock(task);

>         shp->ns = ns;

I think locking is needed here, otherwise a parallel IPC_RMID on an 
unrelated segment could corrupt shm_clist


--

     Manfred


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] shm: extend forced shm destroy to support objects from several IPC nses
  2021-07-14 17:30                 ` [RFC PATCH] shm: extend forced shm destroy to support objects from several IPC nses Alexander Mikhalitsyn
  2021-07-21  6:32                   ` Manfred Spraul
@ 2021-07-22 18:46                   ` Manfred Spraul
  2021-07-30 14:52                     ` Alexander Mihalicyn
  1 sibling, 1 reply; 17+ messages in thread
From: Manfred Spraul @ 2021-07-22 18:46 UTC (permalink / raw)
  To: Alexander Mikhalitsyn, linux-kernel
  Cc: Eric W . Biederman, Andrew Morton, Davidlohr Bueso,
	Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrei Vagin,
	Christian Brauner, Pavel Tikhomirov, Alexander Mikhalitsyn

[-- Attachment #1: Type: text/plain, Size: 7806 bytes --]

Hi Alexander,

A few more remarks.

On 7/14/21 7:30 PM, Alexander Mikhalitsyn wrote:
> This is total rework of fix.
> Thanks to Eric Biederman for suggestions (but may be I've misunderstood some of them :))
>
> I've tested it with reproducer of the original problem. But of course it needs
> detailed testing. I hope that I get some general comments about design and implementation.
>
> ToDo: remove unneeded "ns" argument from shm_destroy, shm_rmid and other functions.

What ensures the that shp->ns is not destroyed prematurely?

I did some tests, and it seems that shmat() acquires a namespace 
refcount, and shm_release() puts it again, and the shm_release is late 
enough to ensure that the ns cannot go out of scope.

But I haven't checked all combinations (with/without shm_rmid_forced, 
delete via exit(), shmctl(), shmdt(), mmap()).

And: This should be documented somewhere.


> --- a/ipc/shm.c
> +++ b/ipc/shm.c
> @@ -65,6 +65,7 @@ struct shmid_kernel /* private to the kernel */
>   	/* The task created the shm object.  NULL if the task is dead. */
>   	struct task_struct	*shm_creator;
>   	struct list_head	shm_clist;	/* list by creator */

I think the comments are wrong/outdated.

Some parts of the new code checks with list_empty(shm_clist), not by 
looking at shm_creator.

> +	struct ipc_namespace	*ns;
>   } __randomize_layout;
>   
>   /* shm_mode upper byte flags */
> @@ -115,6 +116,7 @@ static void do_shm_rmid(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
>   	struct shmid_kernel *shp;
>   
>   	shp = container_of(ipcp, struct shmid_kernel, shm_perm);
> +	BUG_ON(shp->ns && ns != shp->ns);

Is shp->ns == NULL allowed/possible? From what I see, it is impossible.

I think we should not have  NULL check in a few codepaths, but not in 
other codepaths. Either everywhere, or nowhere.


>   
>   	if (shp->shm_nattch) {
>   		shp->shm_perm.mode |= SHM_DEST;
> @@ -225,9 +227,32 @@ static void shm_rcu_free(struct rcu_head *head)
>   	kfree(shp);
>   }
>   
> +static inline void task_shm_clist_lock(struct task_struct *task)
> +{
> +	spin_lock(&task->sysvshm.shm_clist_lock);
> +}
> +
> +static inline void task_shm_clist_unlock(struct task_struct *task)
> +{
> +	spin_unlock(&task->sysvshm.shm_clist_lock);
> +}
> +
> +void shm_clist_rm(struct shmid_kernel *shp)
> +{
> +	if (!shp->shm_creator)
> +		return;
> +
> +	task_shm_clist_lock(shp->shm_creator);
> +	list_del_init(&shp->shm_clist);
> +	task_shm_clist_unlock(shp->shm_creator);
> +	shp->shm_creator = NULL;
> +}
> +
>   static inline void shm_rmid(struct ipc_namespace *ns, struct shmid_kernel *s)
>   {
> -	list_del(&s->shm_clist);
> +	WARN_ON(s->ns && ns != s->ns);
> +	//list_del_init(&s->shm_clist);
> +	shm_clist_rm(s);
>   	ipc_rmid(&shm_ids(ns), &s->shm_perm);
>   }
>   
> @@ -306,10 +331,10 @@ static void shm_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
>    *
>    * 2) sysctl kernel.shm_rmid_forced is set to 1.
>    */
> -static bool shm_may_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
> +static bool shm_may_destroy(struct shmid_kernel *shp)
>   {
>   	return (shp->shm_nattch == 0) &&
> -	       (ns->shm_rmid_forced ||
> +	       (shp->ns->shm_rmid_forced ||
>   		(shp->shm_perm.mode & SHM_DEST));
>   }
>   
As written before: what ensures that shp->ns->shm_rmid_forced was not 
released already?
> @@ -340,7 +365,7 @@ static void shm_close(struct vm_area_struct *vma)
>   	ipc_update_pid(&shp->shm_lprid, task_tgid(current));
>   	shp->shm_dtim = ktime_get_real_seconds();
>   	shp->shm_nattch--;
> -	if (shm_may_destroy(ns, shp))
> +	if (shm_may_destroy(shp))
>   		shm_destroy(ns, shp);
>   	else
>   		shm_unlock(shp);
> @@ -361,10 +386,10 @@ static int shm_try_destroy_orphaned(int id, void *p, void *data)
>   	 *
>   	 * As shp->* are changed under rwsem, it's safe to skip shp locking.
>   	 */
> -	if (shp->shm_creator != NULL)
> +	if (!list_empty(&shp->shm_clist))
>   		return 0;
>   
This collides with the comment above: here, list_empty() is used.
> -	if (shm_may_destroy(ns, shp)) {
> +	if (shm_may_destroy(shp)) {
>   		shm_lock_by_ptr(shp);
>   		shm_destroy(ns, shp);
>   	}
> @@ -379,51 +404,77 @@ void shm_destroy_orphaned(struct ipc_namespace *ns)
>   	up_write(&shm_ids(ns).rwsem);
>   }
>   
> +void shm_init_task(struct task_struct *task)
> +{
> +	INIT_LIST_HEAD(&(task)->sysvshm.shm_clist);
> +	spin_lock_init(&task->sysvshm.shm_clist_lock);
> +}
> +
>   /* Locking assumes this will only be called with task == current */
>   void exit_shm(struct task_struct *task)
>   {
> -	struct ipc_namespace *ns = task->nsproxy->ipc_ns;
> +	LIST_HEAD(tmp);
>   	struct shmid_kernel *shp, *n;
>   
>   	if (list_empty(&task->sysvshm.shm_clist))
>   		return;
>   
> -	/*
> -	 * If kernel.shm_rmid_forced is not set then only keep track of
> -	 * which shmids are orphaned, so that a later set of the sysctl
> -	 * can clean them up.
> -	 */
> -	if (!ns->shm_rmid_forced) {
> -		down_read(&shm_ids(ns).rwsem);
> -		list_for_each_entry(shp, &task->sysvshm.shm_clist, shm_clist)
> -			shp->shm_creator = NULL;
> -		/*
> -		 * Only under read lock but we are only called on current
> -		 * so no entry on the list will be shared.
> -		 */
> -		list_del(&task->sysvshm.shm_clist);
> -		up_read(&shm_ids(ns).rwsem);
> -		return;
> -	}
> +	rcu_read_lock(); /* for refcount_inc_not_zero */
> +	task_shm_clist_lock(task);
>   
> -	/*
> -	 * Destroy all already created segments, that were not yet mapped,
> -	 * and mark any mapped as orphan to cover the sysctl toggling.
> -	 * Destroy is skipped if shm_may_destroy() returns false.
> -	 */
> -	down_write(&shm_ids(ns).rwsem);
>   	list_for_each_entry_safe(shp, n, &task->sysvshm.shm_clist, shm_clist) {
> +		struct ipc_namespace *ns = shp->ns;
> +
> +		/*
> +		 * Remove shm from task list and nullify shm_creator which
> +		 * marks object as orphaned.
> +		 *
> +		 * If kernel.shm_rmid_forced is not set then only keep track of
> +		 * which shmids are orphaned, so that a later set of the sysctl
> +		 * can clean them up.
> +		 */
> +		list_del_init(&shp->shm_clist);
>   		shp->shm_creator = NULL;
>   
> -		if (shm_may_destroy(ns, shp)) {
> -			shm_lock_by_ptr(shp);
> -			shm_destroy(ns, shp);
> +		printk("exit_shm() %px refcnt=%u, id=%d,key=%x\n", shp,
> +			refcount_read(&shp->shm_perm.refcount),
> +			shp->shm_perm.id, shp->shm_perm.key
> +		);
> +
> +		/*
> +		 * Will destroy all already created segments, that were not yet mapped,
> +		 * and mark any mapped as orphan to cover the sysctl toggling.
> +		 * Destroy is skipped if shm_may_destroy() returns false.
> +		 */
> +		if (shp->ns->shm_rmid_forced && shm_may_destroy(shp)) {
> +			/*
> +			 * We may race with shm_exit_ns() if refcounter
> +			 * already zero. Let's skip shm_destroy() of such
> +			 * shm object as it will be destroyed during shm_exit_ns()
> +			 */
> +			if (!refcount_inc_not_zero(&ns->ns.count)) /* get_ipc_ns */
> +				continue;
> +
> +			list_add(&shp->shm_clist, &tmp);
>   		}
>   	}
>   
> -	/* Remove the list head from any segments still attached. */
>   	list_del(&task->sysvshm.shm_clist);
> -	up_write(&shm_ids(ns).rwsem);
> +	task_shm_clist_unlock(task);
> +	rcu_read_unlock();
> +
> +	list_for_each_entry_safe(shp, n, &tmp, shm_clist) {
> +		struct ipc_namespace *ns = shp->ns;
> +
> +		list_del_init(&shp->shm_clist);
> +
> +		down_write(&shm_ids(ns).rwsem);
> +		shm_lock_by_ptr(shp);
> +		/* will do put_ipc_ns(shp->ns) */
> +		shm_destroy(ns, shp);
> +		up_write(&shm_ids(ns).rwsem);
> +		put_ipc_ns(ns); /* see refcount_inc_not_zero */
> +	}
>   }
>   

I do not see the advantage of first collecting everything in a local 
list, and then destroying the elements.

Attached is my current test case. Feel free to merge whatever you 
consider as useful into your change.


--

     Manfred


[-- Attachment #2: 0003-ipc-shm-locking-questions.patch --]
[-- Type: text/x-patch, Size: 10628 bytes --]

From 7ac17dcf2615f40cef789b203d6f6876038627b6 Mon Sep 17 00:00:00 2001
From: Manfred Spraul <manfred@colorfullife.com>
Date: Thu, 22 Jul 2021 20:19:47 +0200
Subject: [PATCH 3/3] ipc/shm: locking / questions

for discussion, not for merging:
- make locking a bit more explicit
- there is no advantage of a temporary list in exit_shm(),
  thus delete segments one by one.
- lots of debug pr_info.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
---
 include/linux/ipc_namespace.h |  16 ++++
 ipc/namespace.c               |   4 +
 ipc/shm.c                     | 160 +++++++++++++++++++++-------------
 3 files changed, 121 insertions(+), 59 deletions(-)

diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
index 05e22770af51..f3a09604d0d5 100644
--- a/include/linux/ipc_namespace.h
+++ b/include/linux/ipc_namespace.h
@@ -126,11 +126,22 @@ extern struct ipc_namespace *copy_ipcs(unsigned long flags,
 
 static inline struct ipc_namespace *get_ipc_ns(struct ipc_namespace *ns)
 {
+pr_info("get_ipc_ns: inc for ns %px.\n", ns);
 	if (ns)
 		refcount_inc(&ns->ns.count);
 	return ns;
 }
 
+static inline struct ipc_namespace *get_ipc_ns_not_zero(struct ipc_namespace *ns)
+{
+pr_info("get_ipc_ns_not_zero: inc for ns %px.\n", ns);
+	if (ns) {
+		if (refcount_inc_not_zero(&ns->ns.count))
+			return ns;
+	}
+	return NULL;
+}
+
 extern void put_ipc_ns(struct ipc_namespace *ns);
 #else
 static inline struct ipc_namespace *copy_ipcs(unsigned long flags,
@@ -147,6 +158,11 @@ static inline struct ipc_namespace *get_ipc_ns(struct ipc_namespace *ns)
 	return ns;
 }
 
+static inline struct ipc_namespace *get_ipc_ns_not_zero(struct ipc_namespace *ns)
+{
+	return ns;
+}
+
 static inline void put_ipc_ns(struct ipc_namespace *ns)
 {
 }
diff --git a/ipc/namespace.c b/ipc/namespace.c
index 7bd0766ddc3b..4160ea18dcd2 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -117,6 +117,7 @@ void free_ipcs(struct ipc_namespace *ns, struct ipc_ids *ids,
 
 static void free_ipc_ns(struct ipc_namespace *ns)
 {
+pr_info("free_ipc_ns: worker task for namespace %px.\n", ns);
 	/* mq_put_mnt() waits for a grace period as kern_unmount()
 	 * uses synchronize_rcu().
 	 */
@@ -129,6 +130,7 @@ static void free_ipc_ns(struct ipc_namespace *ns)
 	put_user_ns(ns->user_ns);
 	ns_free_inum(&ns->ns);
 	kfree(ns);
+pr_info("free_ipc_ns: worker task for namespace %px done.\n", ns);
 }
 
 static LLIST_HEAD(free_ipc_list);
@@ -164,9 +166,11 @@ static DECLARE_WORK(free_ipc_work, free_ipc);
  */
 void put_ipc_ns(struct ipc_namespace *ns)
 {
+pr_info("put_ipc_ns: got called for %px.\n", ns);
 	if (refcount_dec_and_lock(&ns->ns.count, &mq_lock)) {
 		mq_clear_sbinfo(ns);
 		spin_unlock(&mq_lock);
+pr_info("put_ipc_ns: destroying namespace %px.\n", ns);
 
 		if (llist_add(&ns->mnt_llist, &free_ipc_list))
 			schedule_work(&free_ipc_work);
diff --git a/ipc/shm.c b/ipc/shm.c
index a746886a0e00..f55118d0a425 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -62,9 +62,14 @@ struct shmid_kernel /* private to the kernel */
 	struct pid		*shm_lprid;
 	struct ucounts		*mlock_ucounts;
 
-	/* The task created the shm object.  NULL if the task is dead. */
+	/* The task created the shm object, for looking up
+	 * task->sysvshm.shm_clist_lock */
 	struct task_struct	*shm_creator;
-	struct list_head	shm_clist;	/* list by creator */
+
+	/* list by creator. shm_clist_lock required for read/write
+	 * if list_empty(), then the creator is dead already
+	 */
+	struct list_head	shm_clist;
 	struct ipc_namespace	*ns;
 } __randomize_layout;
 
@@ -130,6 +135,7 @@ static void do_shm_rmid(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
 #ifdef CONFIG_IPC_NS
 void shm_exit_ns(struct ipc_namespace *ns)
 {
+pr_info("shm_exit_ns(), %px.\n", ns);
 	free_ipcs(ns, &shm_ids(ns), do_shm_rmid);
 	idr_destroy(&ns->ids[IPC_SHM_IDS].ipcs_idr);
 	rhashtable_destroy(&ns->ids[IPC_SHM_IDS].key_ht);
@@ -237,15 +243,30 @@ static inline void task_shm_clist_unlock(struct task_struct *task)
 	spin_unlock(&task->sysvshm.shm_clist_lock);
 }
 
-void shm_clist_rm(struct shmid_kernel *shp)
+/* It has to be called with shp locked.
+ * It must be called before ipc_rmid() */
+static inline void shm_clist_rm(struct shmid_kernel *shp)
 {
-	if (!shp->shm_creator)
+	struct task_struct *creator;
+
+	creator = READ_ONCE(shp->shm_creator);
+	if (!creator) {
+pr_info("shm_clist_rm: creator already NULL for %px.\n", shp);
 		return;
+	}
+
+	task_shm_clist_lock(creator);
+
+pr_info("shm_clist_rm: creator %px locked for shp %px.\n", creator, shp);
 
-	task_shm_clist_lock(shp->shm_creator);
-	list_del_init(&shp->shm_clist);
-	task_shm_clist_unlock(shp->shm_creator);
-	shp->shm_creator = NULL;
+	/* A concurrent exit_shm may do a list_del_init() as well.
+	 * Just do nothing if exit_shm already did the work
+	 */
+	if (!list_empty(&shp->shm_clist)) {
+		list_del_init(&shp->shm_clist);
+		WRITE_ONCE(shp->shm_creator, NULL);
+	}
+	task_shm_clist_unlock(creator);
 }
 
 static inline void shm_rmid(struct ipc_namespace *ns, struct shmid_kernel *s)
@@ -305,6 +326,8 @@ static void shm_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
 {
 	struct file *shm_file;
 
+pr_info("shm_destroy: current %px, namespace %px, shp %px.\n", current, ns, shp);
+
 	shm_file = shp->shm_file;
 	shp->shm_file = NULL;
 	ns->shm_tot -= (shp->shm_segsz + PAGE_SIZE - 1) >> PAGE_SHIFT;
@@ -333,6 +356,11 @@ static void shm_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
  */
 static bool shm_may_destroy(struct shmid_kernel *shp)
 {
+pr_info("shm_may_destroy for current %px, ns %px, shp %px.\n", current, shp->ns, shp);
+
+/* TODO:
+ * understand refcounting for shp->ns: What guarantees that ns cannot go out of scope?
+ */
 	return (shp->shm_nattch == 0) &&
 	       (shp->ns->shm_rmid_forced ||
 		(shp->shm_perm.mode & SHM_DEST));
@@ -365,6 +393,7 @@ static void shm_close(struct vm_area_struct *vma)
 	ipc_update_pid(&shp->shm_lprid, task_tgid(current));
 	shp->shm_dtim = ktime_get_real_seconds();
 	shp->shm_nattch--;
+pr_info("shm_close for current %px, ns %px, shp %px.\n", current, shp->ns, shp);
 	if (shm_may_destroy(shp))
 		shm_destroy(ns, shp);
 	else
@@ -413,67 +442,66 @@ void shm_init_task(struct task_struct *task)
 /* Locking assumes this will only be called with task == current */
 void exit_shm(struct task_struct *task)
 {
-	LIST_HEAD(tmp);
-	struct shmid_kernel *shp, *n;
 
-	if (list_empty(&task->sysvshm.shm_clist))
-		return;
+pr_info("exit_shm: called for task %px, current %px\n", task, current);
 
-	rcu_read_lock(); /* for refcount_inc_not_zero */
-	task_shm_clist_lock(task);
+	for (;;) {
+		struct shmid_kernel *shp;
+		struct ipc_namespace *ns;
 
-	list_for_each_entry_safe(shp, n, &task->sysvshm.shm_clist, shm_clist) {
-		struct ipc_namespace *ns = shp->ns;
 
-		/*
-		 * Remove shm from task list and nullify shm_creator which
-		 * marks object as orphaned.
-		 *
-		 * If kernel.shm_rmid_forced is not set then only keep track of
-		 * which shmids are orphaned, so that a later set of the sysctl
-		 * can clean them up.
-		 */
-		list_del_init(&shp->shm_clist);
-		shp->shm_creator = NULL;
+		task_shm_clist_lock(task);
+		if (list_empty(&task->sysvshm.shm_clist)) {
+			task_shm_clist_unlock(task);
+			break;
+		}
 
-		printk("exit_shm() %px refcnt=%u, id=%d,key=%x\n", shp,
-			refcount_read(&shp->shm_perm.refcount),
-			shp->shm_perm.id, shp->shm_perm.key
-		);
+		shp = list_first_entry(&task->sysvshm.shm_clist, struct shmid_kernel,
+				shm_clist);
+
+		/* 1) unlink */
+		list_del_init(&shp->shm_clist);
+		WRITE_ONCE(shp->shm_creator, NULL);
 
 		/*
-		 * Will destroy all already created segments, that were not yet mapped,
-		 * and mark any mapped as orphan to cover the sysctl toggling.
-		 * Destroy is skipped if shm_may_destroy() returns false.
+		 * 2) get a reference to the namespace.
+		 *    The refcount could be already 0. If it is 0, then
+		 *    the shm objects will be free by free_ipc_work().
 		 */
-		if (shp->ns->shm_rmid_forced && shm_may_destroy(shp)) {
+		ns = shp->ns;
+pr_info("exit_shm: called for task %px, current %px, processing shp %px, ns %px\n", task, current, shp, ns);
+
+		ns = get_ipc_ns_not_zero(ns);
+		if (ns) {
 			/*
-			 * We may race with shm_exit_ns() if refcounter
-			 * already zero. Let's skip shm_destroy() of such
-			 * shm object as it will be destroyed during shm_exit_ns()
+			 * 3) get a reference to the shp itself.
+			 *   This cannot fail: shm_clist_rm() is called before
+			 *   ipc_rmid(), thus the refcount cannot be 0.
 			 */
-			if (!refcount_inc_not_zero(&ns->ns.count)) /* get_ipc_ns */
-				continue;
-
-			list_add(&shp->shm_clist, &tmp);
+			ipc_rcu_getref(&shp->shm_perm);
+		}
+		task_shm_clist_unlock(task);
+
+		if (ns) {
+			down_write(&shm_ids(ns).rwsem);
+			shm_lock_by_ptr(shp);
+			ipc_rcu_putref(&shp->shm_perm, shm_rcu_free);
+
+			if (ipc_valid_object(&shp->shm_perm)) {
+				if (shm_may_destroy(shp))
+					shm_destroy(ns, shp);
+				else
+					shm_unlock(shp);
+			} else {
+				/*
+				 * Someone else deleted the shp while we have waited.
+				 * Just unlock and continue.
+				 */
+				shm_unlock(shp);
+			}
+			up_write(&shm_ids(ns).rwsem);
+			put_ipc_ns(ns);
 		}
-	}
-
-	list_del(&task->sysvshm.shm_clist);
-	task_shm_clist_unlock(task);
-	rcu_read_unlock();
-
-	list_for_each_entry_safe(shp, n, &tmp, shm_clist) {
-		struct ipc_namespace *ns = shp->ns;
-
-		list_del_init(&shp->shm_clist);
-
-		down_write(&shm_ids(ns).rwsem);
-		shm_lock_by_ptr(shp);
-		/* will do put_ipc_ns(shp->ns) */
-		shm_destroy(ns, shp);
-		up_write(&shm_ids(ns).rwsem);
-		put_ipc_ns(ns); /* see refcount_inc_not_zero */
 	}
 }
 
@@ -566,6 +594,8 @@ static int shm_release(struct inode *ino, struct file *file)
 {
 	struct shm_file_data *sfd = shm_file_data(file);
 
+pr_info("shm_release for current %px.\n", current);
+
 	put_ipc_ns(sfd->ns);
 	fput(sfd->file);
 	shm_file_data(file) = NULL;
@@ -731,9 +761,21 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
 	if (error < 0)
 		goto no_id;
 
-	list_add(&shp->shm_clist, &current->sysvshm.shm_clist);
 	shp->ns = ns;
 
+	task_shm_clist_lock(current);
+	list_add(&shp->shm_clist, &current->sysvshm.shm_clist);
+
+pr_info("newseg: current %px, namespace %px, shp %px.\n", current, ns, shp);
+	{
+		struct shmid_kernel *shp;
+
+		list_for_each_entry(shp, &current->sysvshm.shm_clist, shm_clist) {
+pr_info("newseg: current %px, list entry %px.\n", current, shp);
+		}
+	}
+	task_shm_clist_unlock(current);
+ 
 	/*
 	 * shmid gets reported as "inode#" in /proc/pid/maps.
 	 * proc-ps tools use this. Changing this will break them.
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] shm: extend forced shm destroy to support objects from several IPC nses
  2021-07-22 18:46                   ` Manfred Spraul
@ 2021-07-30 14:52                     ` Alexander Mihalicyn
  0 siblings, 0 replies; 17+ messages in thread
From: Alexander Mihalicyn @ 2021-07-30 14:52 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: linux-kernel, Eric W . Biederman, Andrew Morton, Davidlohr Bueso,
	Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrei Vagin,
	Christian Brauner

Hi, Manfred,

I'm sorry for the long delay with the answer.
Bug hunting season is open, so I tried to catch another one last week :)

I will return to work on that problem next week.

Thank you very much for your review, suggestions and fixes. I will
take your fixes
and, if you allow, add your Co-developed-by tag ;)

I've left some comments below.

On Thu, Jul 22, 2021 at 9:46 PM Manfred Spraul <manfred@colorfullife.com> wrote:
>
> Hi Alexander,
>
> A few more remarks.
>
> On 7/14/21 7:30 PM, Alexander Mikhalitsyn wrote:
> > This is total rework of fix.
> > Thanks to Eric Biederman for suggestions (but may be I've misunderstood some of them :))
> >
> > I've tested it with reproducer of the original problem. But of course it needs
> > detailed testing. I hope that I get some general comments about design and implementation.
> >
> > ToDo: remove unneeded "ns" argument from shm_destroy, shm_rmid and other functions.
>
> What ensures the that shp->ns is not destroyed prematurely?
>
> I did some tests, and it seems that shmat() acquires a namespace
> refcount, and shm_release() puts it again, and the shm_release is late
> enough to ensure that the ns cannot go out of scope.
>
> But I haven't checked all combinations (with/without shm_rmid_forced,
> delete via exit(), shmctl(), shmdt(), mmap()).
>
> And: This should be documented somewhere.
>
>
> > --- a/ipc/shm.c
> > +++ b/ipc/shm.c
> > @@ -65,6 +65,7 @@ struct shmid_kernel /* private to the kernel */
> >       /* The task created the shm object.  NULL if the task is dead. */
> >       struct task_struct      *shm_creator;
> >       struct list_head        shm_clist;      /* list by creator */
>
> I think the comments are wrong/outdated.
>
> Some parts of the new code checks with list_empty(shm_clist), not by
> looking at shm_creator.
>
> > +     struct ipc_namespace    *ns;
> >   } __randomize_layout;
> >
> >   /* shm_mode upper byte flags */
> > @@ -115,6 +116,7 @@ static void do_shm_rmid(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
> >       struct shmid_kernel *shp;
> >
> >       shp = container_of(ipcp, struct shmid_kernel, shm_perm);
> > +     BUG_ON(shp->ns && ns != shp->ns);
>
> Is shp->ns == NULL allowed/possible? From what I see, it is impossible.

Yep, it's leftover from debugging.

>
> I think we should not have  NULL check in a few codepaths, but not in
> other codepaths. Either everywhere, or nowhere.
>
>
> >
> >       if (shp->shm_nattch) {
> >               shp->shm_perm.mode |= SHM_DEST;
> > @@ -225,9 +227,32 @@ static void shm_rcu_free(struct rcu_head *head)
> >       kfree(shp);
> >   }
> >
> > +static inline void task_shm_clist_lock(struct task_struct *task)
> > +{
> > +     spin_lock(&task->sysvshm.shm_clist_lock);
> > +}
> > +
> > +static inline void task_shm_clist_unlock(struct task_struct *task)
> > +{
> > +     spin_unlock(&task->sysvshm.shm_clist_lock);
> > +}
> > +
> > +void shm_clist_rm(struct shmid_kernel *shp)
> > +{
> > +     if (!shp->shm_creator)
> > +             return;
> > +
> > +     task_shm_clist_lock(shp->shm_creator);
> > +     list_del_init(&shp->shm_clist);
> > +     task_shm_clist_unlock(shp->shm_creator);
> > +     shp->shm_creator = NULL;
> > +}
> > +
> >   static inline void shm_rmid(struct ipc_namespace *ns, struct shmid_kernel *s)
> >   {
> > -     list_del(&s->shm_clist);
> > +     WARN_ON(s->ns && ns != s->ns);
> > +     //list_del_init(&s->shm_clist);
> > +     shm_clist_rm(s);
> >       ipc_rmid(&shm_ids(ns), &s->shm_perm);
> >   }
> >
> > @@ -306,10 +331,10 @@ static void shm_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
> >    *
> >    * 2) sysctl kernel.shm_rmid_forced is set to 1.
> >    */
> > -static bool shm_may_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
> > +static bool shm_may_destroy(struct shmid_kernel *shp)
> >   {
> >       return (shp->shm_nattch == 0) &&
> > -            (ns->shm_rmid_forced ||
> > +            (shp->ns->shm_rmid_forced ||
> >               (shp->shm_perm.mode & SHM_DEST));
> >   }
> >
> As written before: what ensures that shp->ns->shm_rmid_forced was not
> released already?

As far as I understand, if we have locked struct shmid_kernel it means
that someone (task)
holds IPC namespace. But I will check and describe this.

> > @@ -340,7 +365,7 @@ static void shm_close(struct vm_area_struct *vma)
> >       ipc_update_pid(&shp->shm_lprid, task_tgid(current));
> >       shp->shm_dtim = ktime_get_real_seconds();
> >       shp->shm_nattch--;
> > -     if (shm_may_destroy(ns, shp))
> > +     if (shm_may_destroy(shp))
> >               shm_destroy(ns, shp);
> >       else
> >               shm_unlock(shp);
> > @@ -361,10 +386,10 @@ static int shm_try_destroy_orphaned(int id, void *p, void *data)
> >        *
> >        * As shp->* are changed under rwsem, it's safe to skip shp locking.
> >        */
> > -     if (shp->shm_creator != NULL)
> > +     if (!list_empty(&shp->shm_clist))
> >               return 0;
> >
> This collides with the comment above: here, list_empty() is used.
> > -     if (shm_may_destroy(ns, shp)) {
> > +     if (shm_may_destroy(shp)) {
> >               shm_lock_by_ptr(shp);
> >               shm_destroy(ns, shp);
> >       }
> > @@ -379,51 +404,77 @@ void shm_destroy_orphaned(struct ipc_namespace *ns)
> >       up_write(&shm_ids(ns).rwsem);
> >   }
> >
> > +void shm_init_task(struct task_struct *task)
> > +{
> > +     INIT_LIST_HEAD(&(task)->sysvshm.shm_clist);
> > +     spin_lock_init(&task->sysvshm.shm_clist_lock);
> > +}
> > +
> >   /* Locking assumes this will only be called with task == current */
> >   void exit_shm(struct task_struct *task)
> >   {
> > -     struct ipc_namespace *ns = task->nsproxy->ipc_ns;
> > +     LIST_HEAD(tmp);
> >       struct shmid_kernel *shp, *n;
> >
> >       if (list_empty(&task->sysvshm.shm_clist))
> >               return;
> >
> > -     /*
> > -      * If kernel.shm_rmid_forced is not set then only keep track of
> > -      * which shmids are orphaned, so that a later set of the sysctl
> > -      * can clean them up.
> > -      */
> > -     if (!ns->shm_rmid_forced) {
> > -             down_read(&shm_ids(ns).rwsem);
> > -             list_for_each_entry(shp, &task->sysvshm.shm_clist, shm_clist)
> > -                     shp->shm_creator = NULL;
> > -             /*
> > -              * Only under read lock but we are only called on current
> > -              * so no entry on the list will be shared.
> > -              */
> > -             list_del(&task->sysvshm.shm_clist);
> > -             up_read(&shm_ids(ns).rwsem);
> > -             return;
> > -     }
> > +     rcu_read_lock(); /* for refcount_inc_not_zero */
> > +     task_shm_clist_lock(task);
> >
> > -     /*
> > -      * Destroy all already created segments, that were not yet mapped,
> > -      * and mark any mapped as orphan to cover the sysctl toggling.
> > -      * Destroy is skipped if shm_may_destroy() returns false.
> > -      */
> > -     down_write(&shm_ids(ns).rwsem);
> >       list_for_each_entry_safe(shp, n, &task->sysvshm.shm_clist, shm_clist) {
> > +             struct ipc_namespace *ns = shp->ns;
> > +
> > +             /*
> > +              * Remove shm from task list and nullify shm_creator which
> > +              * marks object as orphaned.
> > +              *
> > +              * If kernel.shm_rmid_forced is not set then only keep track of
> > +              * which shmids are orphaned, so that a later set of the sysctl
> > +              * can clean them up.
> > +              */
> > +             list_del_init(&shp->shm_clist);
> >               shp->shm_creator = NULL;
> >
> > -             if (shm_may_destroy(ns, shp)) {
> > -                     shm_lock_by_ptr(shp);
> > -                     shm_destroy(ns, shp);
> > +             printk("exit_shm() %px refcnt=%u, id=%d,key=%x\n", shp,
> > +                     refcount_read(&shp->shm_perm.refcount),
> > +                     shp->shm_perm.id, shp->shm_perm.key
> > +             );
> > +
> > +             /*
> > +              * Will destroy all already created segments, that were not yet mapped,
> > +              * and mark any mapped as orphan to cover the sysctl toggling.
> > +              * Destroy is skipped if shm_may_destroy() returns false.
> > +              */
> > +             if (shp->ns->shm_rmid_forced && shm_may_destroy(shp)) {
> > +                     /*
> > +                      * We may race with shm_exit_ns() if refcounter
> > +                      * already zero. Let's skip shm_destroy() of such
> > +                      * shm object as it will be destroyed during shm_exit_ns()
> > +                      */
> > +                     if (!refcount_inc_not_zero(&ns->ns.count)) /* get_ipc_ns */
> > +                             continue;
> > +
> > +                     list_add(&shp->shm_clist, &tmp);
> >               }
> >       }
> >
> > -     /* Remove the list head from any segments still attached. */
> >       list_del(&task->sysvshm.shm_clist);
> > -     up_write(&shm_ids(ns).rwsem);
> > +     task_shm_clist_unlock(task);
> > +     rcu_read_unlock();
> > +
> > +     list_for_each_entry_safe(shp, n, &tmp, shm_clist) {
> > +             struct ipc_namespace *ns = shp->ns;
> > +
> > +             list_del_init(&shp->shm_clist);
> > +
> > +             down_write(&shm_ids(ns).rwsem);
> > +             shm_lock_by_ptr(shp);
> > +             /* will do put_ipc_ns(shp->ns) */
> > +             shm_destroy(ns, shp);
> > +             up_write(&shm_ids(ns).rwsem);
> > +             put_ipc_ns(ns); /* see refcount_inc_not_zero */
> > +     }
> >   }
> >
>
> I do not see the advantage of first collecting everything in a local
> list, and then destroying the elements.

Ah, I've got your idea. You lock the list inside a loop over the list
and ensure that it's not empty
and it allows you to split spin_lock and rwsem taking. I've tried to
do the same (split locks)
but using a temporary list head. :) Thanks!

>
> Attached is my current test case. Feel free to merge whatever you
> consider as useful into your change.
>
>
> --
>
>      Manfred
>

Regards,
Alex

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/2] shm: omit forced shm destroy if task IPC namespace was changed
  2021-07-12 19:18             ` Eric W. Biederman
  2021-07-12 19:27               ` Alexander Mihalicyn
@ 2021-09-23 16:36               ` Manfred Spraul
  2021-09-24 16:45                 ` Eric W. Biederman
  1 sibling, 1 reply; 17+ messages in thread
From: Manfred Spraul @ 2021-09-23 16:36 UTC (permalink / raw)
  To: Eric W. Biederman, Alexander Mihalicyn
  Cc: Andrew Morton, linux-kernel, Milton Miller, Jack Miller,
	Pavel Tikhomirov, Davidlohr Bueso, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, Andrei Vagin, Christian Brauner

Hi Eric,

I'd like to restart the discussion, the issue should be fixed.

On 7/12/21 9:18 PM, Eric W. Biederman wrote:
>
> Given that this is a bug I think c) is the safest option.
>
> A couple of suggestions.
> 1) We can replace the test "shm_creator != NULL" with
>     "list_empty(&shp->shm_clist)"

Yes, good idea. list_del() already contains WRITE_ONCE() and 
list_empty() contains READ_ONCE(), i.e. we get clear memory ordering rules.


>   and remove shm_creator.

I do not see how we can remove shm_creator:

- thread A: creates new segment, doesn't map it.

- thread B: shmctl(,IPC_RMID,).

thread B must now locate the lock that protects ->shm_clist. And if the 
lock is per-thread, then I do not see how we can avoid having a pointer 
in shp to the lock.


>     Along with replacing "shm_creator = NULL" with
>     "list_del_init(&shp->shm_clist)".
Correct, list_del_init() is better than shm_create = NULL.
> 2) We can update shmat to do "list_del_init(&shp->shm_clist)"
>     upon shmat.

That would be a (tiny) user space visible change:

echo 0 > /proc/sys/kernel/shm_rmid_forced
shmget()
shmat()
shmdt()
echo 1 > /proc/sys/kernel/shm_rmid_forced
exit()

Right now: segment is destroyed

After your proposal: Segment is not destroyed.

I don't think that we should mix that with the bugfix.


>    The last unmap will still shm_destroy the
>     shm segment as ns->shm_rmid_forced is set.
But what if shm_rmid_forced is modified?
>     For a multi-threaded process I think this will nicely clean up
>     the clist, and make it clear that the clist only cares about
>     those segments that have been created but never attached.

> 3) Put a non-reference counted struct ipc_namespace in struct
>     shmid_kernel, and use it to remove the namespace parameter
>     from shm_destroy.
>
> I think that is enough to fix this bug with no changes in semantics,
> no additional memory consumed, and an implementation that is easier
> to read and perhaps a little faster.

I do not see how this solves the list corruption:

A thread creates 2 shm segments, then switches the namespace and creates 
another 2 segment.

- corruption 1: in each of the namespaces, one thread calls 
shmctl(,IPC_RMID,) -> both will operate in parallel on ->shm_clist.

- corruption 2: exit_shm() in parallel to one thread

- corruption 3: one shmctl(,IPC_RMID,) in parallel to a shmget().

i.e.: we can have list_add() and multiple list_del() in parallel.

I don't see how this should work without a lock.

With regards to memory usage: I would propose to use task_lock(), that 
lock exists already.


--

     Manfred


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/2] shm: omit forced shm destroy if task IPC namespace was changed
  2021-09-23 16:36               ` Manfred Spraul
@ 2021-09-24 16:45                 ` Eric W. Biederman
  0 siblings, 0 replies; 17+ messages in thread
From: Eric W. Biederman @ 2021-09-24 16:45 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Alexander Mihalicyn, Andrew Morton, linux-kernel, Milton Miller,
	Jack Miller, Pavel Tikhomirov, Davidlohr Bueso, Johannes Weiner,
	Michal Hocko, Vladimir Davydov, Andrei Vagin, Christian Brauner

Manfred Spraul <manfred@colorfullife.com> writes:

> Hi Eric,
>
> I'd like to restart the discussion, the issue should be fixed.

Agreed.

I am going to come right out and say having read through everything
my suggests were confused and wrong.

Somehow I thought the cleanups I was suggesting world result in
shm_clist only being modified from the task that owns the list.
Which would result in no need to use a per-task list.

Having looked through my suggestions again I was completely wrong.

The only useful bit I have to contribute from that original suggestion
is let's please have smallish patches that change one thing at a time.

That code is sufficiently interesting that it is way too easy to get
lost in big patches.


I am not going to discuss my broken suggestions right now because every
time I look into them I go into a rabbit hole and I don't get anything
productive done on fixing these issues, just something close and
frustrating.  Apologies if I derailed your patch.

Eric

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2021-09-24 16:45 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-06 13:22 [PATCH 0/2] shm: omit forced shm destroy if task IPC namespace was changed Alexander Mikhalitsyn
2021-07-06 13:22 ` [PATCH 1/2] shm: skip shm_destroy " Alexander Mikhalitsyn
2021-07-06 13:22 ` [PATCH 2/2] ipc: WARN if trying to remove ipc object which is absent Alexander Mikhalitsyn
2021-07-10  1:12 ` [PATCH 0/2] shm: omit forced shm destroy if task IPC namespace was changed Andrew Morton
2021-07-10 10:55   ` Alexander Mihalicyn
     [not found]     ` <CALgW_8VUk0us_umLncUv2DUMkOi3qixmT+YkHV4Dhpt_nNMZHw@mail.gmail.com>
2021-07-11 10:33       ` Alexander Mihalicyn
2021-07-11 11:46         ` Manfred Spraul
2021-07-12  9:54           ` Alexander Mihalicyn
2021-07-12 19:18             ` Eric W. Biederman
2021-07-12 19:27               ` Alexander Mihalicyn
2021-07-14 17:30                 ` [RFC PATCH] shm: extend forced shm destroy to support objects from several IPC nses Alexander Mikhalitsyn
2021-07-21  6:32                   ` Manfred Spraul
2021-07-22 18:46                   ` Manfred Spraul
2021-07-30 14:52                     ` Alexander Mihalicyn
2021-07-14 17:42                 ` [PATCH 0/2] shm: omit forced shm destroy if task IPC namespace was changed Alexander Mihalicyn
2021-09-23 16:36               ` Manfred Spraul
2021-09-24 16:45                 ` Eric W. Biederman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.