From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=j5h/=5D=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-2.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no
	autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 8F169C10DCE
	for <linux-mm@archiver.kernel.org>; Wed, 18 Mar 2020 10:18:17 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 4D6432076D
	for <linux-mm@archiver.kernel.org>; Wed, 18 Mar 2020 10:18:17 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 4D6432076D
Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=suse.cz
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id D31096B0078; Wed, 18 Mar 2020 06:18:16 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id CBBD86B007B; Wed, 18 Mar 2020 06:18:16 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id B83736B007D; Wed, 18 Mar 2020 06:18:16 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0200.hostedemail.com [216.40.44.200])
	by kanga.kvack.org (Postfix) with ESMTP id 9C6B86B0078
	for <linux-mm@kvack.org>; Wed, 18 Mar 2020 06:18:16 -0400 (EDT)
Received: from smtpin12.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay02.hostedemail.com (Postfix) with ESMTP id 547B733CD
	for <linux-mm@kvack.org>; Wed, 18 Mar 2020 10:18:16 +0000 (UTC)
X-FDA: 76608082992.12.wind70_653a5e2648949
X-HE-Tag: wind70_653a5e2648949
X-Filterd-Recvd-Size: 7206
Received: from mx2.suse.de (mx2.suse.de [195.135.220.15])
	by imf33.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Wed, 18 Mar 2020 10:18:15 +0000 (UTC)
X-Virus-Scanned: by amavisd-new at test-mx.suse.de
Received: from relay2.suse.de (unknown [195.135.220.254])
	by mx2.suse.de (Postfix) with ESMTP id 43548AB98;
	Wed, 18 Mar 2020 10:18:13 +0000 (UTC)
Subject: Re: Slub: Increased mem consumption on cpu,mem-less node powerpc
 guest
To: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: bharata@linux.ibm.com, linux-mm@kvack.org,
 Christoph Lameter <cl@linux.com>, Pekka Enberg <penberg@kernel.org>,
 David Rientjes <rientjes@google.com>, Joonsoo Kim <iamjoonsoo.kim@lge.com>,
 Andrew Morton <akpm@linux-foundation.org>, linuxppc-dev@ozlabs.org,
 aneesh.kumar@linux.ibm.com, Sachin Sant <sachinp@linux.vnet.ibm.com>,
 Michal Hocko <mhocko@kernel.org>
References: <20200317092624.GB22538@in.ibm.com>
 <20200317115339.GA26049@in.ibm.com>
 <4088ae3c-4dfa-62ae-f56a-b46773788fc7@suse.cz>
 <20200317162536.GB27520@linux.vnet.ibm.com>
 <080b2d00-76ef-2187-ec78-c9d181ef1701@suse.cz>
 <20200318032044.GC4879@linux.vnet.ibm.com>
From: Vlastimil Babka <vbabka@suse.cz>
Message-ID: <088b5996-faae-8a56-ef9c-5b567125ae54@suse.cz>
Date: Wed, 18 Mar 2020 11:18:11 +0100
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101
 Thunderbird/68.5.0
MIME-Version: 1.0
In-Reply-To: <20200318032044.GC4879@linux.vnet.ibm.com>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 7bit
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On 3/18/20 4:20 AM, Srikar Dronamraju wrote:
> * Vlastimil Babka <vbabka@suse.cz> [2020-03-17 17:45:15]:
>> 
>> Yes, that Kirill's patch was about the memcg shrinker map allocation. But the
>> patch hunk that Bharata posted as a "hack" that fixes the problem, it follows
>> that there has to be something else that calls kmalloc_node(node) where node is
>> one that doesn't have present pages.
>> 
>> He mentions alloc_fair_sched_group() which has:
>> 
>>         for_each_possible_cpu(i) {
>>                 cfs_rq = kzalloc_node(sizeof(struct cfs_rq),
>>                                       GFP_KERNEL, cpu_to_node(i));
>> ...
>>                 se = kzalloc_node(sizeof(struct sched_entity),
>>                                   GFP_KERNEL, cpu_to_node(i));
>> 
> 
> 
> Sachin's experiment.
> Upstream-next/ memcg /
> possible nodes were 0-31
> online nodes were 0-1
> kmalloc_node called for_each_node / for_each_possible_node.
> This would crash while allocating slab from !N_ONLINE nodes.

So you're saying the crash was actually for allocation on e.g. node 2, not node 0?
But I believe it was on node 0, because init_kmem_cache_nodes() will only
allocate kmem_cache_node on nodes with N_NORMAL_MEMORY (which doesn't include
0), and slab_mem_going_online_callback() was probably not called for node 0 (it
was not dynamically onlined).
Also if node 0 was fine, node_to_mem_node(2-31) (not initialized explicitly)
would have returned 0 and thus not crash as well.

> Bharata's experiment.
> Upstream
> possible nodes were 0-1
> online nodes were 0-1
> kmalloc_node called for_each_online_node/ for_each_possible_cpu
> i.e kmalloc is called for N_ONLINE nodes.
> So wouldn't crash
> 
> Even if his possible nodes were 0-256. I don't think we have kmalloc_node
> being called in !N_ONLINE nodes. Hence its not crashing.
> If we see the above code that you quote, kzalloc_node is using cpu_to_node
> which in Bharata's case will always return 1.

Are you sure that for_each_possible_cpu(), cpu_to_node() will be 1? Are all of
them properly initialized or is there a similar issue as with
node_to_mem_node(), that some were not initialized and thus cpu_to_node() will
return 0?

Because AFAICS, if kzalloc_node() was always called 1, then
node_present_pages(1) is true, and the "hack" that Bharata reports to work in
his original mail would make no functional difference.

> 
>> I assume one of these structs is 1k and other 512 bytes (rounded) and that for
>> some possible cpu's cpu_to_node(i) will be 0, which has no present pages. And as
>> Bharata pasted, node_to_mem_node(0) = 0
>> So this looks like the same scenario, but it doesn't crash? Is the node 0
>> actually online here, and/or does it have N_NORMAL_MEMORY state?
> 
> I still dont have any clue on the leak though.

Let's assume that kzalloc_node() was called with 0 for some of the possible
CPU's. I still wonder why it won't crash, but let's assume kmem_cache_node does
exist for node 0 here.
So the execution AFAICS goes like this:

slab_alloc_node(0)
  c = raw_cpu_ptr(s->cpu_slab);
  object = c->freelist;
  page = c->page;
  if (unlikely(!object || !node_match(page, node))) {
  // whatever we have in the per-cpu cache must be from node 1
  // because node 0 has no memory, so there's no node_match and thus
   __slab_alloc(node == 0)
    ___slab_alloc(node == 0)
      page = c->page;
     redo:
      if (unlikely(!node_match(page, node))) { // still no match
        int searchnode = node;

        if (node != NUMA_NO_NODE && !node_present_pages(node))
	                   //  true && true for node 0
          searchnode = node_to_mem_node(node);
          // searchnode is 0, not 1

          if (unlikely(!node_match(page, searchnode))) {
          // page still from node 1, searchnode is 0, no match
	
            stat(s, ALLOC_NODE_MISMATCH);
            deactivate_slab(s, page, c->freelist, c);
            // we removed the slab from cpu's cache
            goto new_slab;
          }

     new_slab:
      if (slub_percpu_partial(c)) {
        page = c->page = slub_percpu_partial(c);
        slub_set_percpu_partial(c, page);
        stat(s, CPU_PARTIAL_ALLOC);
        goto redo;
        // huh, so with CONFIG_SLUB_CPU_PARTIAL
        // this can become an infinite loop actually?
      }
// Bharata's slub stats don't include cpu_partial_alloc so I assume
// CONFIG_SLUB_CPU_PARTIAL is not enabled and we don't loop
      freelist = new_slab_objects(s, gfpflags, node, &c);
        freelist = new_slab_objects(s, gfpflags, node, &c);

         if (node == NUMA_NO_NODE) // false, it's 0
         else if (!node_present_pages(node)) // true for 0
            searchnode = node_to_mem_node(node); // still 0

         object = get_partial_node(s, get_node(s, searchnode),...);
         // object is NULL as node 0 has nothing
         // but we have node == 0 so we return the NULL
         if (object || node != NUMA_NO_NODE)
                return object;
         // and we don't fallback to get_any_partial which would
         // have found e.g. the slab we deactivated earlier
         return get_any_partial(s, flags, c);

       page = new_slab(s, flags, node);
       // we attempt to allocate new slab on node 0, but it will come
       // from node 1

So that explains the leak I think. We keep throwing away slabs from node 1 only
to allocate new ones on node 1. Effectively each cfs_rq object and each
sched_entity object will get a new (high-order?) page
for a possible cpu where cpu_to_node() is 0.