linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Michael Ellerman <mpe@ellerman.id.au>
To: Srikar Dronamraju <srikar@linux.vnet.ibm.com>,
	Oscar Salvador <osalvador@suse.de>
Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>,
	linux-mm@kvack.org, Michal Hocko <mhocko@kernel.org>,
	Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
Subject: Re: [PATCH] powerpc/numa: Handle partially initialized numa nodes
Date: Sun, 10 Apr 2022 21:28:38 +1000	[thread overview]
Message-ID: <874k316uu1.fsf@mpe.ellerman.id.au> (raw)
In-Reply-To: <20220408122537.GD568950@linux.vnet.ibm.com>

Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> * Oscar Salvador <osalvador@suse.de> [2022-04-06 18:19:00]:
>
>> On Wed, Mar 30, 2022 at 07:21:23PM +0530, Srikar Dronamraju wrote:
>> >  arch/powerpc/mm/numa.c | 2 +-
>> >  1 file changed, 1 insertion(+), 1 deletion(-)
>> > 
>> > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
>> > index b9b7fefbb64b..13022d734951 100644
>> > --- a/arch/powerpc/mm/numa.c
>> > +++ b/arch/powerpc/mm/numa.c
>> > @@ -1436,7 +1436,7 @@ int find_and_online_cpu_nid(int cpu)
>> >  	if (new_nid < 0 || !node_possible(new_nid))
>> >  		new_nid = first_online_node;
>> >  
>> > -	if (NODE_DATA(new_nid) == NULL) {
>> > +	if (!node_online(new_nid)) {
>> >  #ifdef CONFIG_MEMORY_HOTPLUG
>> >  		/*
>> >  		 * Need to ensure that NODE_DATA is initialized for a node from
>> 
>> Because of this fix, I wanted to check whether we might have any more fallouts due
>> to ("mm: handle uninitialized numa nodes gracefully"), and it made me look closer
>> as to why powerpc is the only platform that special cases try_online_node(),
>> while all others rely on cpu_up()->try_online_node() to do the right thing.
>> 
>> So, I had a look.
>> Let us rewind a bit:
>> 
>> The commit that sets find_and_online_cpu_nid() in dlpar_online_cpu was
>> e67e02a544e9 ("powerpc/pseries: Fix cpu hotplug crash with memoryless nodes").
>> 
>> In there, it says that we have the following picture:
>> 
>> partition_sched_domains
>>  arch_update_cpu_topology
>>   numa_update_cpu_topology
>>    find_and_online_cpu_nid
>> 
>> and by the time find_and_online_cpu_nid() gets called to online the node, it might be
>> too late as we might have referenced some NODE_DATA() fields.
>> Note that this happens at a much later stage in cpuhp.
>> 
>> Also note that at a much earlier stage, we do already have a try_online_node() in cpu_up(),
>> which should allocate-and-online the node and prevent accessing garbage.
>> But the problem is that, on powerpc, all possible cpus have the same node set at boot stage
>> (see arch/powerpc/mm/numa.c:mem_topology_setup),
>> so cpu_to_node() returns the same thing until it the mapping gets update (which happens in
>> start_secondary()->set_numa_node()), and so, the try_online_node() from cpu_up() does not
>> apply on the right node, because it still holds the not-up-to-date mapping node <-> cpu.
>> 
>> (e.g: in my test case, when onlining a CPU belongin to node1, cpu_up()->try_online_node()
>>  tries to online node0, or whatever old mapping numa<->cpu is there).
>> 
>> So, we have something like:
>> 
>> dlpar_online_cpu
>>  device_online
>>   dev->bus->online
>>    cpu_subsys_online
>>     cpu_device_up
>>      cpu_up
>>       try_online_node (old mapping nid <-> cpu )
>>       ...
>>       ...
>>       cphp_callbacks
>>        sched_cpu_activate
>>         cpuset_update_active_cpus
>>          schedule_work(&cpuset_hotplug_work)
>>           cpuset_hotplug_work
>>            partition_sched_domains
>>             arch_update_cpu_topology
>>              numa_update_cpu_topology
>>               find_and_online_cpu_nid (online new_nid)
>> 
>> 
>> So, yeah, the real onlining in numa_update_cpu_topology()->find_and_online_cpu_nid()
>> happens too late, that is why adding find_and_online_cpu_nid() back in dlpar_online_cpu()
>> fixed the issue, but we should not need this special casing at all.
>> 
>> We do already know the numa<->cpu associativity in
>> dlpar_online_cpu()->find_and_online_cpu_nid(), so we should just be able to
>> update the numa<->cpu mapping, and let the try_online_node() do the right thing.
>> 
>> With this in mind, I came up with the following patch, where I carried out a battery
>> of tests for CPU hotplug stuff and it worked as expected.
>> But I am not familiar with all powerpc pitfalls, e.g: dedicated vs shared cpus etc, so
>> I would like to hear from more familiar people.
>> 
>> The patch is:
>> 
>> From: Oscar Salvador <osalvador@suse.de>
>> Date: Wed, 6 Apr 2022 14:39:15 +0200
>> Subject: [PATCH] powerpc/numa: Associate numa node to its cpu earlier
>> 
>> powerpc is the only platform that do not rely on
>> cpu_up()->try_online_node() to bring up a numa node,
>> and special cases it, instead, deep in its own machinery:
>> 
>> dlpar_online_cpu
>>  find_and_online_cpu_nid
>>   try_online_node
>> 
>> This should not be needed, but the thing is that the try_online_node()
>> from cpu_up() will not apply on the right node, because cpu_to_node()
>> will return the old mapping numa<->cpu that gets set on boot stage
>> for all possible cpus.
>> 
>> That can be seen easily if we try to print out the numa node passed
>> to try_online_node() in cpu_up().
>> 
>> The thing is that the numa<->cpu mapping does not get updated till a much
>> later stage in start_secondary:
>> 
>> start_secondary:
>>  set_numa_node(numa_cpu_lookup_table[cpu])
>> 
>> But we do not really care, as we already now the
>> CPU <-> NUMA associativity back in find_and_online_cpu_nid(),
>> so let us make use of that and set the proper numa<->cpu mapping,
>> so cpu_to_node() in cpu_up() returns the right node and
>> try_online_node() can do its work.
>> 
>> Signed-off-by: Oscar Salvador <osalvador@suse.de>
>
> Looks good to me.
>
> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>

Yeah agreed, thanks for getting to the root of the problem.

Can you resend as a standalone patch. Because you sent it as a reply it
won't be recognised by patchwork[1] which means it risks getting lost.

cheers

1: http://patchwork.ozlabs.org/project/linuxppc-dev/list/


  reply	other threads:[~2022-04-10 11:28 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-30 13:51 [PATCH] powerpc/numa: Handle partially initialized numa nodes Srikar Dronamraju
2022-03-30 13:59 ` Michal Hocko
2022-03-30 14:04 ` Oscar Salvador
2022-04-05  4:41 ` Mike Rapoport
2022-04-06 16:19 ` Oscar Salvador
2022-04-08 12:25   ` Srikar Dronamraju
2022-04-10 11:28     ` Michael Ellerman [this message]
2022-04-11  8:00       ` Oscar Salvador
2022-04-11  6:29     ` Geetika Moolchandani1
2022-04-10 12:27 ` Michael Ellerman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=874k316uu1.fsf@mpe.ellerman.id.au \
    --to=mpe@ellerman.id.au \
    --cc=Geetika.Moolchandani1@ibm.com \
    --cc=linux-mm@kvack.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mhocko@kernel.org \
    --cc=osalvador@suse.de \
    --cc=srikar@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).