LKML Archive on lore.kernel.org
 help / color / Atom feed
From: Vlastimil Babka <vbabka@suse.cz>
To: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Sachin Sant <sachinp@linux.vnet.ibm.com>,
	Michal Hocko <mhocko@kernel.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	LKML <linux-kernel@vger.kernel.org>,
	linux-mm@kvack.org, Mel Gorman <mgorman@suse.de>,
	"Kirill A. Shutemov" <kirill@shutemov.name>,
	Andrew Morton <akpm@linux-foundation.org>,
	linuxppc-dev@lists.ozlabs.org, Christopher Lameter <cl@linux.com>,
	Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	Kirill Tkhai <ktkhai@virtuozzo.com>
Subject: Re: [PATCH 1/3] powerpc/numa: Set numa_node for all possible cpus
Date: Thu, 12 Mar 2020 17:41:58 +0100
Message-ID: <e115048c-be38-c298-b8d1-d4b513e7d2fb@suse.cz> (raw)
In-Reply-To: <20200312161310.GC3277@linux.vnet.ibm.com>

On 3/12/20 5:13 PM, Srikar Dronamraju wrote:
> * Vlastimil Babka <vbabka@suse.cz> [2020-03-12 14:51:38]:
> 
>> > * Vlastimil Babka <vbabka@suse.cz> [2020-03-12 10:30:50]:
>> > 
>> >> On 3/12/20 9:23 AM, Sachin Sant wrote:
>> >> >> On 12-Mar-2020, at 10:57 AM, Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:
>> >> >> * Michal Hocko <mhocko@kernel.org> [2020-03-11 12:57:35]:
>> >> >>> On Wed 11-03-20 16:32:35, Srikar Dronamraju wrote:
>> >> >>>> To ensure a cpuless, memoryless dummy node is not online, powerpc need
>> >> >>>> to make sure all possible but not present cpu_to_node are set to a
>> >> >>>> proper node.
>> >> >>> 
>> >> >>> Just curious, is this somehow related to
>> >> >>> http://lkml.kernel.org/r/20200227182650.GG3771@dhcp22.suse.cz?
>> >> >>> 
>> >> >> 
>> >> >> The issue I am trying to fix is a known issue in Powerpc since many years.
>> >> >> So this surely not a problem after a75056fc1e7c (mm/memcontrol.c: allocate
>> >> >> shrinker_map on appropriate NUMA node"). 
>> >> >> 
>> > 
>> > While I am not an expert in the slub area, I looked at the patch
>> > a75056fc1e7c and had some thoughts on why this could be causing this issue.
>> > 
>> > On the system where the crash happens, the possible number of nodes is much
>> > greater than the number of onlined nodes. The pdgat or the NODE_DATA is only
>> > available for onlined nodes.
>> > 
>> > With a75056fc1e7c memcg_alloc_shrinker_maps, we end up calling kzalloc_node
>> > for all possible nodes and in ___slab_alloc we end up looking at the
>> > node_present_pages which is NODE_DATA(nid)->node_present_pages.
>> > i.e for a node whose pdgat struct is not allocated, we are trying to
>> > dereference.
>> 
>> From what we saw, the pgdat does exist, the problem is that slab's per-node data
>> doesn't exist for a node that doesn't have present pages, as it would be a waste
>> of memory.
> 
> Just to be clear
> Before my 3 patches to fix dummy node:
> srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/possible
> 0-31
> srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/online
> 0-1

OK

>> 
>> Uh actually you are probably right, the NODE_DATA doesn't exist anymore? In
>> Sachin's first report [1] we have
>> 
>> [    0.000000] numa:   NODE_DATA [mem 0x8bfedc900-0x8bfee3fff]
>> [    0.000000] numa:     NODE_DATA(0) on node 1
>> [    0.000000] numa:   NODE_DATA [mem 0x8bfed5200-0x8bfedc8ff]
>> 
> 
> So even if pgdat would exist for nodes 0 and 1, there is no pgdat for the
> rest 30 nodes.

I see. Perhaps node_present_pages(node) is not safe in SLUB then and it should
check online first, as you suggested.

>> But in this thread, with your patches Sachin reports:
> 
> and with my patches
> srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/possible
> 0-31
> srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/online
> 1
> 
>> 
>> [    0.000000] numa:   NODE_DATA [mem 0x8bfedc900-0x8bfee3fff]
>> 
> 
> so we only see one pgdat.
> 
>> So I assume it's just node 1. In that case, node_present_pages is really dangerous.
>> 
>> [1]
>> https://lore.kernel.org/linux-next/3381CD91-AB3D-4773-BA04-E7A072A63968@linux.vnet.ibm.com/
>> 
>> > Also for a memoryless/cpuless node or possible but not present nodes,
>> > node_to_mem_node(node) will still end up as node (atleast on powerpc).
>> 
>> I think that's the place where this would be best to fix.
>> 
> 
> Maybe. I thought about it but the current set_numa_mem semantics are apt
> for memoryless cpu node and not for possible nodes.  We could have upto 256
> possible nodes and only 2 nodes (1,2) with cpu and 1 node (1) with memory.
> node_to_mem_node seems to return what is set in set_numa_mem().
> set_numa_mem() seems to say set my numa_mem node for the current memoryless
> node to the param passed.
> 
> But how do we set numa_mem for all the other 253 possible nodes, which
> probably will have 0 as default?
> 
> Should we introduce another API such that we could update for all possible
> nodes?

If we want to rely on node_to_mem_node() to give us something safe for each
possible node, then probably it would have to be like that, yeah.

>> > I tried with this hunk below and it works.
>> > 
>> > But I am not sure if we need to check at other places were
>> > node_present_pages is being called.
>> 
>> I think this seems to defeat the purpose of node_to_mem_node()? Shouldn't it
>> return only nodes that are online with present memory?
>> CCing Joonsoo who seems to have introduced this in ad2c8144418c ("topology: add
>> support for node_to_mem_node() to determine the fallback node")
>> 
> 
> Agree 
> 
>> I think we do need well defined and documented rules around node_to_mem_node(),
>> cpu_to_node(), existence of NODE_DATA, various node_states bitmaps etc so
>> everyone handles it the same, safe way.

So let's try to brainstorm how this would look like? What I mean are some rules
like below, even if some details in my current understanding are most likely
incorrect:

with nid present in:
N_POSSIBLE - pgdat might not exist, node_to_mem_node() must return some online
node with memory so that we don't require everyone to search for it in slightly
different ways
N_ONLINE - pgdat must exist, there doesn't have to be present memory,
node_to_mem_node() still has to return something else (?)
N_NORMAL_MEMORY - there is present memory, node_to_mem_node() returns itself
N_HIGH_MEMORY - node has present high memory

> 
> Other option would be to tweak Kirill Tkhai's patch such that we call
> kvmalloc_node()/kzalloc_node() if node is online and call kvmalloc/kvzalloc
> if the node is offline.

I really would like a solution that hides these ugly details from callers so
they don't have to workaround the APIs we provide. kvmalloc_node() really
shouldn't crash, and it should fallback automatically if we don't give it
__GFP_THISNODE

However, taking a step back, memcg_alloc_shrinker_maps() is probably rather
wasteful on systems with 256 possible nodes and only few present, by allocating
effectively dead structures for each memcg.

SLUB tries to be smart, so it allocates the per-node per-cache structures only
when the node goes online in slab_mem_going_online_callback(). This is why
there's a crash when such non-existing structures are accessed for a node that's
not online, and why they shouldn't be accessed.

Perhaps memcg should do the same on-demand allocation, if possible.

>> > diff --git a/mm/slub.c b/mm/slub.c
>> > index 626cbcbd977f..bddb93bed55e 100644
>> > --- a/mm/slub.c
>> > +++ b/mm/slub.c
>> > @@ -2571,9 +2571,13 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>> >  	if (unlikely(!node_match(page, node))) {
>> >  		int searchnode = node;
>> >  
>> > -		if (node != NUMA_NO_NODE && !node_present_pages(node))
>> > -			searchnode = node_to_mem_node(node);
>> > -
>> > +		if (node != NUMA_NO_NODE) {
>> > +			if (!node_online(node) || !node_present_pages(node)) {
>> > +				searchnode = node_to_mem_node(node);
>> > +				if (!node_online(searchnode))
>> > +					searchnode = first_online_node;
>> > +			}
>> > +		}
>> >  		if (unlikely(!node_match(page, searchnode))) {
>> >  			stat(s, ALLOC_NODE_MISMATCH);
>> >  			deactivate_slab(s, page, c->freelist, c);
> 


  reply index

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-11 11:02 [PATCH 0/3] Offline memoryless cpuless node 0 Srikar Dronamraju
2020-03-11 11:02 ` [PATCH 1/3] powerpc/numa: Set numa_node for all possible cpus Srikar Dronamraju
2020-03-11 11:57   ` Michal Hocko
2020-03-12  5:27     ` Srikar Dronamraju
2020-03-12  8:23       ` Sachin Sant
2020-03-12  9:30         ` Vlastimil Babka
2020-03-12 13:14           ` Srikar Dronamraju
2020-03-12 13:51             ` Vlastimil Babka
2020-03-12 16:13               ` Srikar Dronamraju
2020-03-12 16:41                 ` Vlastimil Babka [this message]
2020-03-13  9:47                   ` Joonsoo Kim
2020-03-13 11:04                     ` Srikar Dronamraju
2020-03-13 11:38                       ` Vlastimil Babka
2020-03-16  8:15                         ` Joonsoo Kim
2020-03-13 11:22                   ` Srikar Dronamraju
2020-03-16  9:06                   ` Michal Hocko
2020-03-17 13:44                     ` Vlastimil Babka
2020-03-17 14:01                       ` Michal Hocko
2020-03-11 11:02 ` [PATCH 2/3] powerpc/numa: Prefer node id queried from vphn Srikar Dronamraju
2020-03-11 11:02 ` [PATCH 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline Srikar Dronamraju
2020-03-15 14:20   ` Christopher Lameter
2020-03-16  8:54     ` Michal Hocko
2020-03-18  7:50       ` Srikar Dronamraju
2020-03-18 18:57       ` Christopher Lameter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e115048c-be38-c298-b8d1-d4b513e7d2fb@suse.cz \
    --to=vbabka@suse.cz \
    --cc=akpm@linux-foundation.org \
    --cc=cl@linux.com \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=kirill@shutemov.name \
    --cc=ktkhai@virtuozzo.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@kernel.org \
    --cc=sachinp@linux.vnet.ibm.com \
    --cc=srikar@linux.vnet.ibm.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git
	git clone --mirror https://lore.kernel.org/lkml/8 lkml/git/8.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git