From: Vlastimil Babka <vbabka@suse.cz> To: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: bharata@linux.ibm.com, linux-mm@kvack.org, Christoph Lameter <cl@linux.com>, Pekka Enberg <penberg@kernel.org>, David Rientjes <rientjes@google.com>, Joonsoo Kim <iamjoonsoo.kim@lge.com>, Andrew Morton <akpm@linux-foundation.org>, linuxppc-dev@ozlabs.org, aneesh.kumar@linux.ibm.com, Sachin Sant <sachinp@linux.vnet.ibm.com>, Michal Hocko <mhocko@kernel.org> Subject: Re: Slub: Increased mem consumption on cpu,mem-less node powerpc guest Date: Wed, 18 Mar 2020 11:18:11 +0100 Message-ID: <088b5996-faae-8a56-ef9c-5b567125ae54@suse.cz> (raw) In-Reply-To: <20200318032044.GC4879@linux.vnet.ibm.com> On 3/18/20 4:20 AM, Srikar Dronamraju wrote: > * Vlastimil Babka <vbabka@suse.cz> [2020-03-17 17:45:15]: >> >> Yes, that Kirill's patch was about the memcg shrinker map allocation. But the >> patch hunk that Bharata posted as a "hack" that fixes the problem, it follows >> that there has to be something else that calls kmalloc_node(node) where node is >> one that doesn't have present pages. >> >> He mentions alloc_fair_sched_group() which has: >> >> for_each_possible_cpu(i) { >> cfs_rq = kzalloc_node(sizeof(struct cfs_rq), >> GFP_KERNEL, cpu_to_node(i)); >> ... >> se = kzalloc_node(sizeof(struct sched_entity), >> GFP_KERNEL, cpu_to_node(i)); >> > > > Sachin's experiment. > Upstream-next/ memcg / > possible nodes were 0-31 > online nodes were 0-1 > kmalloc_node called for_each_node / for_each_possible_node. > This would crash while allocating slab from !N_ONLINE nodes. So you're saying the crash was actually for allocation on e.g. node 2, not node 0? But I believe it was on node 0, because init_kmem_cache_nodes() will only allocate kmem_cache_node on nodes with N_NORMAL_MEMORY (which doesn't include 0), and slab_mem_going_online_callback() was probably not called for node 0 (it was not dynamically onlined). Also if node 0 was fine, node_to_mem_node(2-31) (not initialized explicitly) would have returned 0 and thus not crash as well. > Bharata's experiment. > Upstream > possible nodes were 0-1 > online nodes were 0-1 > kmalloc_node called for_each_online_node/ for_each_possible_cpu > i.e kmalloc is called for N_ONLINE nodes. > So wouldn't crash > > Even if his possible nodes were 0-256. I don't think we have kmalloc_node > being called in !N_ONLINE nodes. Hence its not crashing. > If we see the above code that you quote, kzalloc_node is using cpu_to_node > which in Bharata's case will always return 1. Are you sure that for_each_possible_cpu(), cpu_to_node() will be 1? Are all of them properly initialized or is there a similar issue as with node_to_mem_node(), that some were not initialized and thus cpu_to_node() will return 0? Because AFAICS, if kzalloc_node() was always called 1, then node_present_pages(1) is true, and the "hack" that Bharata reports to work in his original mail would make no functional difference. > >> I assume one of these structs is 1k and other 512 bytes (rounded) and that for >> some possible cpu's cpu_to_node(i) will be 0, which has no present pages. And as >> Bharata pasted, node_to_mem_node(0) = 0 >> So this looks like the same scenario, but it doesn't crash? Is the node 0 >> actually online here, and/or does it have N_NORMAL_MEMORY state? > > I still dont have any clue on the leak though. Let's assume that kzalloc_node() was called with 0 for some of the possible CPU's. I still wonder why it won't crash, but let's assume kmem_cache_node does exist for node 0 here. So the execution AFAICS goes like this: slab_alloc_node(0) c = raw_cpu_ptr(s->cpu_slab); object = c->freelist; page = c->page; if (unlikely(!object || !node_match(page, node))) { // whatever we have in the per-cpu cache must be from node 1 // because node 0 has no memory, so there's no node_match and thus __slab_alloc(node == 0) ___slab_alloc(node == 0) page = c->page; redo: if (unlikely(!node_match(page, node))) { // still no match int searchnode = node; if (node != NUMA_NO_NODE && !node_present_pages(node)) // true && true for node 0 searchnode = node_to_mem_node(node); // searchnode is 0, not 1 if (unlikely(!node_match(page, searchnode))) { // page still from node 1, searchnode is 0, no match stat(s, ALLOC_NODE_MISMATCH); deactivate_slab(s, page, c->freelist, c); // we removed the slab from cpu's cache goto new_slab; } new_slab: if (slub_percpu_partial(c)) { page = c->page = slub_percpu_partial(c); slub_set_percpu_partial(c, page); stat(s, CPU_PARTIAL_ALLOC); goto redo; // huh, so with CONFIG_SLUB_CPU_PARTIAL // this can become an infinite loop actually? } // Bharata's slub stats don't include cpu_partial_alloc so I assume // CONFIG_SLUB_CPU_PARTIAL is not enabled and we don't loop freelist = new_slab_objects(s, gfpflags, node, &c); freelist = new_slab_objects(s, gfpflags, node, &c); if (node == NUMA_NO_NODE) // false, it's 0 else if (!node_present_pages(node)) // true for 0 searchnode = node_to_mem_node(node); // still 0 object = get_partial_node(s, get_node(s, searchnode),...); // object is NULL as node 0 has nothing // but we have node == 0 so we return the NULL if (object || node != NUMA_NO_NODE) return object; // and we don't fallback to get_any_partial which would // have found e.g. the slab we deactivated earlier return get_any_partial(s, flags, c); page = new_slab(s, flags, node); // we attempt to allocate new slab on node 0, but it will come // from node 1 So that explains the leak I think. We keep throwing away slabs from node 1 only to allocate new ones on node 1. Effectively each cfs_rq object and each sched_entity object will get a new (high-order?) page for a possible cpu where cpu_to_node() is 0.
prev parent reply index Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top 2020-03-17 9:26 Bharata B Rao 2020-03-17 11:53 ` Bharata B Rao 2020-03-17 15:56 ` Vlastimil Babka 2020-03-17 16:25 ` Srikar Dronamraju 2020-03-17 16:45 ` Vlastimil Babka 2020-03-18 3:20 ` Srikar Dronamraju 2020-03-18 4:46 ` Bharata B Rao 2020-03-18 10:18 ` Vlastimil Babka [this message]
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=088b5996-faae-8a56-ef9c-5b567125ae54@suse.cz \ --to=vbabka@suse.cz \ --cc=akpm@linux-foundation.org \ --cc=aneesh.kumar@linux.ibm.com \ --cc=bharata@linux.ibm.com \ --cc=cl@linux.com \ --cc=iamjoonsoo.kim@lge.com \ --cc=linux-mm@kvack.org \ --cc=linuxppc-dev@ozlabs.org \ --cc=mhocko@kernel.org \ --cc=penberg@kernel.org \ --cc=rientjes@google.com \ --cc=sachinp@linux.vnet.ibm.com \ --cc=srikar@linux.vnet.ibm.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
Linux-mm Archive on lore.kernel.org Archives are clonable: git clone --mirror https://lore.kernel.org/linux-mm/0 linux-mm/git/0.git # If you have public-inbox 1.1+ installed, you may # initialize and index your mirror using the following commands: public-inbox-init -V2 linux-mm linux-mm/ https://lore.kernel.org/linux-mm \ linux-mm@kvack.org public-inbox-index linux-mm Example config snippet for mirrors Newsgroup available over NNTP: nntp://nntp.lore.kernel.org/org.kvack.linux-mm AGPL code for this site: git clone https://public-inbox.org/public-inbox.git