linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Glauber Costa <glommer@parallels.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>, <linux-kernel@vger.kernel.org>,
	<cgroups@vger.kernel.org>, <kamezawa.hiroyu@jp.fujitsu.com>,
	<devel@openvz.org>, Tejun Heo <tj@kernel.org>,
	<linux-mm@kvack.org>, "Suleiman Souhlal" <suleiman@google.com>,
	Frederic Weisbecker <fweisbec@gmail.com>,
	"Mel Gorman" <mgorman@suse.de>,
	David Rientjes <rientjes@google.com>
Subject: Re: [PATCH v3 12/13] execute the whole memcg freeing in rcu callback
Date: Mon, 8 Oct 2012 13:45:35 +0400	[thread overview]
Message-ID: <5072A0BF.2060306@parallels.com> (raw)
In-Reply-To: <20121005153100.GB2625@cmpxchg.org>

On 10/05/2012 07:31 PM, Johannes Weiner wrote:
> On Thu, Oct 04, 2012 at 02:53:13PM +0400, Glauber Costa wrote:
>> On 10/01/2012 05:27 PM, Michal Hocko wrote:
>>> On Tue 18-09-12 18:04:09, Glauber Costa wrote:
>>>> A lot of the initialization we do in mem_cgroup_create() is done with softirqs
>>>> enabled. This include grabbing a css id, which holds &ss->id_lock->rlock, and
>>>> the per-zone trees, which holds rtpz->lock->rlock. All of those signal to the
>>>> lockdep mechanism that those locks can be used in SOFTIRQ-ON-W context. This
>>>> means that the freeing of memcg structure must happen in a compatible context,
>>>> otherwise we'll get a deadlock.
>>>
>>> Maybe I am missing something obvious but why cannot we simply disble
>>> (soft)irqs in mem_cgroup_create rather than make the free path much more
>>> complicated. It really feels strange to defer everything (e.g. soft
>>> reclaim tree cleanup which should be a no-op at the time because there
>>> shouldn't be any user pages in the group).
>>>
>>
>> Ok.
>>
>> I was just able to come back to this today - I was mostly working on the
>> slab feedback over the past few days. I will answer yours and Tejun's
>> concerns at once:
>>
>> Here is the situation: the backtrace I get is this one:
>>
>> [  124.956725] =================================
>> [  124.957217] [ INFO: inconsistent lock state ]
>> [  124.957217] 3.5.0+ #99 Not tainted
>> [  124.957217] ---------------------------------
>> [  124.957217] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
>> [  124.957217] ksoftirqd/0/3 [HC0[0]:SC1[1]:HE1:SE0] takes:
>> [  124.957217]  (&(&ss->id_lock)->rlock){+.?...}, at:
>> [<ffffffff810aa7b2>] spin_lock+0x9/0xb
>> [  124.957217] {SOFTIRQ-ON-W} state was registered at:
>> [  124.957217]   [<ffffffff810996ed>] __lock_acquire+0x31f/0xd68
>> [  124.957217]   [<ffffffff8109a660>] lock_acquire+0x108/0x15c
>> [  124.957217]   [<ffffffff81534ec4>] _raw_spin_lock+0x40/0x4f
>> [  124.957217]   [<ffffffff810aa7b2>] spin_lock+0x9/0xb
>> [  124.957217]   [<ffffffff810ad00e>] get_new_cssid+0x69/0xf3
>> [  124.957217]   [<ffffffff810ad0da>] cgroup_init_idr+0x42/0x60
>> [  124.957217]   [<ffffffff81b20e04>] cgroup_init+0x50/0x100
>> [  124.957217]   [<ffffffff81b05b9b>] start_kernel+0x3b9/0x3ee
>> [  124.957217]   [<ffffffff81b052d6>] x86_64_start_reservations+0xb1/0xb5
>> [  124.957217]   [<ffffffff81b053d8>] x86_64_start_kernel+0xfe/0x10b
>>
>>
>> So what we learn from it, is: we are acquiring a specific lock (the css
>> id one) from softirq context. It was previously taken in a
>> softirq-enabled context, that seems to be coming directly from
>> get_new_cssid.
>>
>> Tejun correctly pointed out that we should never acquire that lock from
>> a softirq context, in which he is right.
>>
>> But the situation changes slightly with kmem. Now, the following excerpt
>> of a backtrace is possible:
>>
>> [   48.602775]  [<ffffffff81103095>] free_accounted_pages+0x47/0x4c
>> [   48.602775]  [<ffffffff81047f90>] free_task+0x31/0x5c
>> [   48.602775]  [<ffffffff8104807d>] __put_task_struct+0xc2/0xdb
>> [   48.602775]  [<ffffffff8104dfc7>] put_task_struct+0x1e/0x22
>> [   48.602775]  [<ffffffff8104e144>] delayed_put_task_struct+0x7a/0x98
>> [   48.602775]  [<ffffffff810cf0e5>] __rcu_process_callbacks+0x269/0x3df
>> [   48.602775]  [<ffffffff810cf28c>] rcu_process_callbacks+0x31/0x5b
>> [   48.602775]  [<ffffffff8105266d>] __do_softirq+0x122/0x277
>>
>> So as you can see, free_accounted_pages (that will trigger a memcg_put()
>> -> mem_cgroup_free()) can now be called from softirq context, which is,
>> an rcu callback (and I just realized I wrote the exact opposite in the
>> subj line: man, I really suck at that!!)
>> As a matter of fact, we could not move to our rcu callback as well:
>>
>> we need to move it to a worker thread with the rest.
>>
>> We already have a worker thread: he reason we have it is not
>> static_branches: The reason is vfree(), that will BUG_ON(in_interrupt())
>> and could not be called from rcu callback as well. We moved static
>> branches in there as well for a similar problem, but haven't introduced it.
>>
>> Could we move just part of it to the worker thread? Absolutely yes.
>> Moving just free_css_id() is enough to make it work. But since it is not
>> the first context related problem we had, I thought: "to hell with that,
>> let's move everything and be safe".
>>
>> I am fine moving free_css_id() only if you would prefer.
>>
>> Can we disable softirqs when we initialize css_id? Maybe. My machine
>> seems to boot fine and survive the simple workload that would trigger
>> that bug if I use irqsave spinlocks instead of normal spinlocks. But
>> this has to be done from cgroup core: We have no control over css
>> creation in memcg.
>>
>> How would you guys like me to handle this ?
> 
> Without the vfree callback, I would have preferred just making the
> id_lock softirq safe.  But since we have to defer (parts of) freeing
> anyway, I like your approach of just deferring the rest as well
> better.
> 
> But please add comments why the stuff in there is actually deferred.
> Just simple notes like:
> 
> "this can be called from atomic contexts, <examples>",
> 
> "vfree must run from process context" and "css_id locking is not soft
> irq safe",
> 
> "to hell with that, let's just do everything from the workqueue and be
> safe and simple".
> 
> (And this may be personal preference, but why have free_work call
> __mem_cgroup_free()?  Does anyone else need to call that code?  There
> are too many layers already, why not just keep it all in free_work()
> and have one less stack frame on your mind? :))
> 
It is used when create fails.


  reply	other threads:[~2012-10-08  9:45 UTC|newest]

Thread overview: 127+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-09-18 14:03 [PATCH v3 00/13] kmem controller for memcg Glauber Costa
2012-09-18 14:03 ` [PATCH v3 01/13] memcg: Make it possible to use the stock for more than one page Glauber Costa
2012-10-01 18:48   ` Johannes Weiner
2012-09-18 14:03 ` [PATCH v3 02/13] memcg: Reclaim when more than one page needed Glauber Costa
2012-10-01 19:00   ` Johannes Weiner
2012-09-18 14:04 ` [PATCH v3 03/13] memcg: change defines to an enum Glauber Costa
2012-10-01 19:06   ` Johannes Weiner
2012-10-02  9:10     ` Glauber Costa
2012-09-18 14:04 ` [PATCH v3 04/13] kmem accounting basic infrastructure Glauber Costa
2012-09-21 16:34   ` Tejun Heo
2012-09-24  8:09     ` Glauber Costa
2012-09-26 14:03   ` Michal Hocko
2012-09-26 14:33     ` Glauber Costa
2012-09-26 16:01       ` Michal Hocko
2012-09-26 17:34         ` Glauber Costa
2012-09-26 16:36     ` Tejun Heo
2012-09-26 17:36       ` Glauber Costa
2012-09-26 17:44         ` Tejun Heo
2012-09-26 17:53           ` Glauber Costa
2012-09-26 18:01             ` Tejun Heo
2012-09-26 18:56               ` Glauber Costa
2012-09-26 19:34                 ` Tejun Heo
2012-09-26 19:46                   ` Glauber Costa
2012-09-26 19:56                     ` Tejun Heo
2012-09-26 20:02                       ` Glauber Costa
2012-09-26 20:16                         ` Tejun Heo
2012-09-26 21:24                           ` Glauber Costa
2012-09-26 22:10                             ` Tejun Heo
2012-09-26 22:29                               ` Glauber Costa
2012-09-26 22:42                                 ` Tejun Heo
2012-09-26 22:54                                   ` Glauber Costa
2012-09-26 23:08                                     ` Tejun Heo
2012-09-26 23:20                                       ` Glauber Costa
2012-09-26 23:33                                         ` Tejun Heo
2012-09-27 12:15                                           ` Michal Hocko
2012-09-27 12:20                                             ` Glauber Costa
2012-09-27 12:40                                               ` Michal Hocko
2012-09-27 12:40                                                 ` Glauber Costa
2012-09-27 12:54                                                   ` Michal Hocko
2012-09-27 14:28                                       ` Mel Gorman
2012-09-27 14:49                                         ` Tejun Heo
2012-09-27 14:57                                           ` Glauber Costa
2012-09-27 17:46                                             ` Tejun Heo
2012-09-27 17:56                                               ` Michal Hocko
2012-09-27 18:45                                               ` Glauber Costa
2012-09-30  7:57                                                 ` Tejun Heo
2012-09-30  8:02                                                   ` Tejun Heo
2012-09-30  8:56                                                     ` James Bottomley
2012-09-30 10:37                                                       ` Tejun Heo
2012-09-30 11:25                                                         ` James Bottomley
2012-10-01  0:57                                                           ` Tejun Heo
2012-10-01  8:43                                                             ` Glauber Costa
2012-10-01  8:46                                                         ` Glauber Costa
2012-10-03 22:59                                                           ` Tejun Heo
2012-10-01  8:36                                                   ` Glauber Costa
2012-09-27 12:08                             ` Michal Hocko
2012-09-27 12:11                               ` Glauber Costa
2012-09-27 14:33                               ` Tejun Heo
2012-09-27 14:43                                 ` Mel Gorman
2012-09-27 14:58                                   ` Tejun Heo
2012-09-27 18:30                                     ` Glauber Costa
2012-09-30  8:23                                       ` Tejun Heo
2012-10-01  8:45                                         ` Glauber Costa
2012-10-03 22:54                                           ` Tejun Heo
2012-10-04 11:55                                             ` Glauber Costa
2012-10-06  2:19                                               ` Tejun Heo
2012-09-27 15:09                                 ` Michal Hocko
2012-09-30  8:47                                   ` Tejun Heo
2012-10-01  9:27                                     ` Michal Hocko
2012-10-03 22:43                                       ` Tejun Heo
2012-10-05 13:47                                         ` Michal Hocko
2012-09-26 22:11                         ` Johannes Weiner
2012-09-26 22:45                           ` Glauber Costa
2012-09-18 14:04 ` [PATCH v3 05/13] Add a __GFP_KMEMCG flag Glauber Costa
2012-09-18 14:15   ` Rik van Riel
2012-09-18 15:06   ` Christoph Lameter
2012-09-19  7:39     ` Glauber Costa
2012-09-19 14:07       ` Christoph Lameter
2012-09-27 13:34   ` Mel Gorman
2012-09-27 13:41     ` Glauber Costa
2012-10-01 19:09   ` Johannes Weiner
2012-09-18 14:04 ` [PATCH v3 06/13] memcg: kmem controller infrastructure Glauber Costa
2012-09-20 16:05   ` JoonSoo Kim
2012-09-21  8:41     ` Glauber Costa
2012-09-21  9:14       ` JoonSoo Kim
2012-09-26 15:51   ` Michal Hocko
2012-09-27 11:31     ` Glauber Costa
2012-09-27 13:44       ` Michal Hocko
2012-09-28 11:34         ` Glauber Costa
2012-09-30  8:25           ` Tejun Heo
2012-10-01  8:28             ` Glauber Costa
2012-10-03 22:11               ` Tejun Heo
2012-10-01  9:44             ` Michal Hocko
2012-10-01  9:48           ` Michal Hocko
2012-10-01 10:09             ` Glauber Costa
2012-10-01 11:51               ` Michal Hocko
2012-10-01 11:51                 ` Glauber Costa
2012-10-01 11:58                   ` Michal Hocko
2012-10-01 12:04                     ` Glauber Costa
2012-09-18 14:04 ` [PATCH v3 07/13] mm: Allocate kernel pages to the right memcg Glauber Costa
2012-09-27 13:50   ` Mel Gorman
2012-09-28  9:43     ` Glauber Costa
2012-09-28 13:28       ` Mel Gorman
2012-09-27 13:52   ` Michal Hocko
2012-09-18 14:04 ` [PATCH v3 08/13] res_counter: return amount of charges after res_counter_uncharge Glauber Costa
2012-10-01 10:00   ` Michal Hocko
2012-10-01 10:01     ` Glauber Costa
2012-09-18 14:04 ` [PATCH v3 09/13] memcg: kmem accounting lifecycle management Glauber Costa
2012-10-01 12:15   ` Michal Hocko
2012-10-01 12:29     ` Glauber Costa
2012-10-01 12:36       ` Michal Hocko
2012-10-01 12:43         ` Glauber Costa
2012-09-18 14:04 ` [PATCH v3 10/13] memcg: use static branches when code not in use Glauber Costa
2012-10-01 12:25   ` Michal Hocko
2012-10-01 12:27     ` Glauber Costa
2012-09-18 14:04 ` [PATCH v3 11/13] memcg: allow a memcg with kmem charges to be destructed Glauber Costa
2012-10-01 12:30   ` Michal Hocko
2012-09-18 14:04 ` [PATCH v3 12/13] execute the whole memcg freeing in rcu callback Glauber Costa
2012-09-21 17:23   ` Tejun Heo
2012-09-24  8:48     ` Glauber Costa
2012-10-01 13:27   ` Michal Hocko
2012-10-04 10:53     ` Glauber Costa
2012-10-04 14:20       ` Glauber Costa
2012-10-05 15:31       ` Johannes Weiner
2012-10-08  9:45         ` Glauber Costa [this message]
2012-09-18 14:04 ` [PATCH v3 13/13] protect architectures where THREAD_SIZE >= PAGE_SIZE against fork bombs Glauber Costa
2012-10-01 13:17   ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5072A0BF.2060306@parallels.com \
    --to=glommer@parallels.com \
    --cc=cgroups@vger.kernel.org \
    --cc=devel@openvz.org \
    --cc=fweisbec@gmail.com \
    --cc=hannes@cmpxchg.org \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@suse.cz \
    --cc=rientjes@google.com \
    --cc=suleiman@google.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).