From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1752265Ab3LQDOa (ORCPT <rfc822;w@1wt.eu>);
	Mon, 16 Dec 2013 22:14:30 -0500
Received: from szxga01-in.huawei.com ([119.145.14.64]:21968 "EHLO
	szxga01-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751088Ab3LQDO2 (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Mon, 16 Dec 2013 22:14:28 -0500
Message-ID: <52AFC163.5010507@huawei.com>
Date: Tue, 17 Dec 2013 11:13:39 +0800
From: Li Zefan <lizefan@huawei.com>
User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:17.0) Gecko/20130801 Thunderbird/17.0.8
MIME-Version: 1.0
To: Hugh Dickins <hughd@google.com>
CC: Tejun Heo <tj@kernel.org>, Michal Hocko <mhocko@suse.cz>,
        Johannes Weiner <hannes@cmpxchg.org>,
        Andrew Morton <akpm@linux-foundation.org>,
        "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>,
        <linux-mm@kvack.org>, <cgroups@vger.kernel.org>,
        <linux-kernel@vger.kernel.org>
Subject: Re: 3.13-rc breaks MEMCG_SWAP
References: <alpine.LNX.2.00.1312160025200.2785@eggly.anvils> <52AEC989.4080509@huawei.com> <20131216095345.GB23582@dhcp22.suse.cz> <20131216104042.GC23582@dhcp22.suse.cz> <20131216163530.GH32509@htj.dyndns.org> <20131216171937.GG26797@dhcp22.suse.cz> <20131216172143.GJ32509@htj.dyndns.org> <alpine.LNX.2.00.1312161718001.2037@eggly.anvils>
In-Reply-To: <alpine.LNX.2.00.1312161718001.2037@eggly.anvils>
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 7bit
X-Originating-IP: [10.135.68.215]
X-CFilter-Loop: Reflected
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On 2013/12/17 9:41, Hugh Dickins wrote:
> On Mon, 16 Dec 2013, Tejun Heo wrote:
>> On Mon, Dec 16, 2013 at 06:19:37PM +0100, Michal Hocko wrote:
>>> I have to think about it some more (the brain is not working anymore
>>> today). But what we really need is that nobody gets the same id while
>>> the css is alive.

That's what I meant to do in my last reply.

But I'm confused by

"How would this work? .. the swap will be there
after the last reference to css as well."

>>> So css_from_id returning NULL doesn't seem to be
>>> enough.
>>
>> Oh, I meant whether it's necessary to keep css_from_id() working
>> (ie. doing successful lookups) between offline and release, because
>> that's where lifetimes are coupled.  IOW, if it's enough for cgroup to
>> not recycle the ID until all css's are released && fail css_from_id()
>> lookup after the css is offlined, I can make a five liner quick fix.
> 
> Don't take my word on it, I'm too fuzzy on this: but although it would
> be good to refrain from recycling the ID until all css's are released,
> I believe that it would not be good enough to fail css_from_id() once
> the css is offlined - mem_cgroup_uncharge_swap() needs to uncharge the
> hierarchy of the dead memcg (for example, when tmpfs file is removed).
> 
> Uncharging the dead memcg itself is presumably irrelevant, but it does
> need to locate the right parent to uncharge, and NULL css_from_id()
> would make that impossible.  It would be easy if we said those charges
> migrate to root rather than to parent, but that's inconsistent with
> what we have happily converged upon doing elsewhere (in the preferred
> use_hierarchy case), and it would be a change in behaviour.
> 
> I'm not nearly as enthusiastic for my patch as Michal is: I really
> would prefer a five-liner from you or from Zefan. 

I've come up with a fix. Though it's more than five-line, it mostly moves
a few lines from one place to another. I've tested it with your script.

============================

From: Li Zefan <lizefan@huawei.com>
Date: Tue, 17 Dec 2013 10:45:09 +0800
Subject: [PATCH] cgroup: don't recycle cgroup id until all csses' have been destroyed

Hugh reported this bug:

> CONFIG_MEMCG_SWAP is broken in 3.13-rc.  Try something like this:
>
> mkdir -p /tmp/tmpfs /tmp/memcg
> mount -t tmpfs -o size=1G tmpfs /tmp/tmpfs
> mount -t cgroup -o memory memcg /tmp/memcg
> mkdir /tmp/memcg/old
> echo 512M >/tmp/memcg/old/memory.limit_in_bytes
> echo $$ >/tmp/memcg/old/tasks
> cp /dev/zero /tmp/tmpfs/zero 2>/dev/null
> echo $$ >/tmp/memcg/tasks
> rmdir /tmp/memcg/old
> sleep 1	# let rmdir work complete
> mkdir /tmp/memcg/new
> umount /tmp/tmpfs
> dmesg | grep WARNING
> rmdir /tmp/memcg/new
> umount /tmp/memcg
>
> Shows lots of WARNING: CPU: 1 PID: 1006 at kernel/res_counter.c:91
>                            res_counter_uncharge_locked+0x1f/0x2f()
>
> Breakage comes from 34c00c319ce7 ("memcg: convert to use cgroup id").
>
> The lifetime of a cgroup id is different from the lifetime of the
> css id it replaced: memsw's css_get()s do nothing to hold on to the
> old cgroup id, it soon gets recycled to a new cgroup, which then
> mysteriously inherits the old's swap, without any charge for it.

Instead of removing cgroup id right after all the csses have been
offlined, we should do that after csses have been destroyed.

To make sure an invalid css pointer won't be returned after the css
is destroyed, make sure css_from_id() returns NULL in this case.

Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
---
 kernel/cgroup.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index c36d906..769b5bb 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -868,6 +868,15 @@ static void cgroup_diput(struct dentry *dentry, struct inode *inode)
 		struct cgroup *cgrp = dentry->d_fsdata;
 
 		BUG_ON(!(cgroup_is_dead(cgrp)));
+
+		/*
+		 * We should remove the cgroup object from idr before its
+		 * grace period starts, so we won't be looking up a cgroup
+		 * while the cgroup is being freed.
+		 */
+		idr_remove(&cgrp->root->cgroup_idr, cgrp->id);
+		cgrp->id = -1;
+
 		call_rcu(&cgrp->rcu_head, cgroup_free_rcu);
 	} else {
 		struct cfent *cfe = __d_cfe(dentry);
@@ -4104,6 +4113,7 @@ static void css_release(struct percpu_ref *ref)
 	struct cgroup_subsys_state *css =
 		container_of(ref, struct cgroup_subsys_state, refcnt);
 
+	rcu_assign_pointer(css->cgroup->subsys[css->ss->subsys_id], NULL);
 	call_rcu(&css->rcu_head, css_free_rcu_fn);
 }
 
@@ -4545,14 +4555,6 @@ static void cgroup_destroy_css_killed(struct cgroup *cgrp)
 	/* delete this cgroup from parent->children */
 	list_del_rcu(&cgrp->sibling);
 
-	/*
-	 * We should remove the cgroup object from idr before its grace
-	 * period starts, so we won't be looking up a cgroup while the
-	 * cgroup is being freed.
-	 */
-	idr_remove(&cgrp->root->cgroup_idr, cgrp->id);
-	cgrp->id = -1;
-
 	dput(d);
 
 	set_bit(CGRP_RELEASABLE, &parent->flags);
-- 
1.8.0.2


From mboxrd@z Thu Jan  1 00:00:00 1970
From: Li Zefan <lizefan@huawei.com>
Subject: Re: 3.13-rc breaks MEMCG_SWAP
Date: Tue, 17 Dec 2013 11:13:39 +0800
Message-ID: <52AFC163.5010507@huawei.com>
References: <alpine.LNX.2.00.1312160025200.2785@eggly.anvils> <52AEC989.4080509@huawei.com> <20131216095345.GB23582@dhcp22.suse.cz> <20131216104042.GC23582@dhcp22.suse.cz> <20131216163530.GH32509@htj.dyndns.org> <20131216171937.GG26797@dhcp22.suse.cz> <20131216172143.GJ32509@htj.dyndns.org> <alpine.LNX.2.00.1312161718001.2037@eggly.anvils>
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit
Return-path: <owner-linux-mm@kvack.org>
In-Reply-To: <alpine.LNX.2.00.1312161718001.2037@eggly.anvils>
Sender: owner-linux-mm@kvack.org
List-ID: <cgroups.vger.kernel.org>
Content-Type: text/plain; charset="us-ascii"
To: Hugh Dickins <hughd@google.com>
Cc: Tejun Heo <tj@kernel.org>, Michal Hocko <mhocko@suse.cz>, Johannes Weiner <hannes@cmpxchg.org>, Andrew Morton <akpm@linux-foundation.org>, KAMEZAWA
 Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org

On 2013/12/17 9:41, Hugh Dickins wrote:
> On Mon, 16 Dec 2013, Tejun Heo wrote:
>> On Mon, Dec 16, 2013 at 06:19:37PM +0100, Michal Hocko wrote:
>>> I have to think about it some more (the brain is not working anymore
>>> today). But what we really need is that nobody gets the same id while
>>> the css is alive.

That's what I meant to do in my last reply.

But I'm confused by

"How would this work? .. the swap will be there
after the last reference to css as well."

>>> So css_from_id returning NULL doesn't seem to be
>>> enough.
>>
>> Oh, I meant whether it's necessary to keep css_from_id() working
>> (ie. doing successful lookups) between offline and release, because
>> that's where lifetimes are coupled.  IOW, if it's enough for cgroup to
>> not recycle the ID until all css's are released && fail css_from_id()
>> lookup after the css is offlined, I can make a five liner quick fix.
> 
> Don't take my word on it, I'm too fuzzy on this: but although it would
> be good to refrain from recycling the ID until all css's are released,
> I believe that it would not be good enough to fail css_from_id() once
> the css is offlined - mem_cgroup_uncharge_swap() needs to uncharge the
> hierarchy of the dead memcg (for example, when tmpfs file is removed).
> 
> Uncharging the dead memcg itself is presumably irrelevant, but it does
> need to locate the right parent to uncharge, and NULL css_from_id()
> would make that impossible.  It would be easy if we said those charges
> migrate to root rather than to parent, but that's inconsistent with
> what we have happily converged upon doing elsewhere (in the preferred
> use_hierarchy case), and it would be a change in behaviour.
> 
> I'm not nearly as enthusiastic for my patch as Michal is: I really
> would prefer a five-liner from you or from Zefan. 

I've come up with a fix. Though it's more than five-line, it mostly moves
a few lines from one place to another. I've tested it with your script.

============================

From: Li Zefan <lizefan@huawei.com>
Date: Tue, 17 Dec 2013 10:45:09 +0800
Subject: [PATCH] cgroup: don't recycle cgroup id until all csses' have been destroyed

Hugh reported this bug:

> CONFIG_MEMCG_SWAP is broken in 3.13-rc.  Try something like this:
>
> mkdir -p /tmp/tmpfs /tmp/memcg
> mount -t tmpfs -o size=1G tmpfs /tmp/tmpfs
> mount -t cgroup -o memory memcg /tmp/memcg
> mkdir /tmp/memcg/old
> echo 512M >/tmp/memcg/old/memory.limit_in_bytes
> echo $$ >/tmp/memcg/old/tasks
> cp /dev/zero /tmp/tmpfs/zero 2>/dev/null
> echo $$ >/tmp/memcg/tasks
> rmdir /tmp/memcg/old
> sleep 1	# let rmdir work complete
> mkdir /tmp/memcg/new
> umount /tmp/tmpfs
> dmesg | grep WARNING
> rmdir /tmp/memcg/new
> umount /tmp/memcg
>
> Shows lots of WARNING: CPU: 1 PID: 1006 at kernel/res_counter.c:91
>                            res_counter_uncharge_locked+0x1f/0x2f()
>
> Breakage comes from 34c00c319ce7 ("memcg: convert to use cgroup id").
>
> The lifetime of a cgroup id is different from the lifetime of the
> css id it replaced: memsw's css_get()s do nothing to hold on to the
> old cgroup id, it soon gets recycled to a new cgroup, which then
> mysteriously inherits the old's swap, without any charge for it.

Instead of removing cgroup id right after all the csses have been
offlined, we should do that after csses have been destroyed.

To make sure an invalid css pointer won't be returned after the css
is destroyed, make sure css_from_id() returns NULL in this case.

Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
---
 kernel/cgroup.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index c36d906..769b5bb 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -868,6 +868,15 @@ static void cgroup_diput(struct dentry *dentry, struct inode *inode)
 		struct cgroup *cgrp = dentry->d_fsdata;
 
 		BUG_ON(!(cgroup_is_dead(cgrp)));
+
+		/*
+		 * We should remove the cgroup object from idr before its
+		 * grace period starts, so we won't be looking up a cgroup
+		 * while the cgroup is being freed.
+		 */
+		idr_remove(&cgrp->root->cgroup_idr, cgrp->id);
+		cgrp->id = -1;
+
 		call_rcu(&cgrp->rcu_head, cgroup_free_rcu);
 	} else {
 		struct cfent *cfe = __d_cfe(dentry);
@@ -4104,6 +4113,7 @@ static void css_release(struct percpu_ref *ref)
 	struct cgroup_subsys_state *css =
 		container_of(ref, struct cgroup_subsys_state, refcnt);
 
+	rcu_assign_pointer(css->cgroup->subsys[css->ss->subsys_id], NULL);
 	call_rcu(&css->rcu_head, css_free_rcu_fn);
 }
 
@@ -4545,14 +4555,6 @@ static void cgroup_destroy_css_killed(struct cgroup *cgrp)
 	/* delete this cgroup from parent->children */
 	list_del_rcu(&cgrp->sibling);
 
-	/*
-	 * We should remove the cgroup object from idr before its grace
-	 * period starts, so we won't be looking up a cgroup while the
-	 * cgroup is being freed.
-	 */
-	idr_remove(&cgrp->root->cgroup_idr, cgrp->id);
-	cgrp->id = -1;
-
 	dput(d);
 
 	set_bit(CGRP_RELEASABLE, &parent->flags);
-- 
1.8.0.2


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>