RE: [PATCH] cpu-topology: warn if NUMA configurations conflicts with lower layer

From: "Zengtao (B)" <prime.zeng@hisilicon.com>
To: Sudeep Holla <sudeep.holla@arm.com>
Cc: Linuxarm <linuxarm@huawei.com>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	"Rafael J. Wysocki" <rafael@kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"Morten Rasmussen" <morten.rasmussen@arm.com>
Subject: RE: [PATCH] cpu-topology: warn if NUMA configurations conflicts with lower layer
Date: Thu, 2 Jan 2020 12:47:01 +0000	[thread overview]
Message-ID: <678F3D1BB717D949B966B68EAEB446ED340AEB67@dggemm526-mbx.china.huawei.com> (raw)
In-Reply-To: <20200102112955.GC4864@bogus>

> -----Original Message-----
> From: Sudeep Holla [mailto:sudeep.holla@arm.com]
> Sent: Thursday, January 02, 2020 7:30 PM
> To: Zengtao (B)
> Cc: Linuxarm; Greg Kroah-Hartman; Rafael J. Wysocki;
> linux-kernel@vger.kernel.org; Morten Rasmussen
> Subject: Re: [PATCH] cpu-topology: warn if NUMA configurations conflicts
> with lower layer
> 
> On Thu, Jan 02, 2020 at 03:05:40AM +0000, Zengtao (B) wrote:
> > Hi Sudeep:
> >
> > Thanks for your reply.
> >
> > > -----Original Message-----
> > > From: Sudeep Holla [mailto:sudeep.holla@arm.com]
> > > Sent: Wednesday, January 01, 2020 12:41 AM
> > > To: Zengtao (B)
> > > Cc: Linuxarm; Greg Kroah-Hartman; Rafael J. Wysocki;
> > > linux-kernel@vger.kernel.org; Sudeep Holla; Morten Rasmussen
> > > Subject: Re: [PATCH] cpu-topology: warn if NUMA configurations
> conflicts
> > > with lower layer
> > >
> > > On Mon, Dec 23, 2019 at 04:16:19PM +0800, z00214469 wrote:
> > > > As we know, from sched domain's perspective, the DIE layer should
> be
> > > > larger than or at least equal to the MC layer, and in some cases, MC
> > > > is defined by the arch specified hardware, MPIDR for example, but
> > > NUMA
> > > > can be defined by users,
> > >
> > > Who are the users you are referring above ?
> > For example, when I use QEMU to start a guest linux, I can define the
> > NUMA topology of the guest linux whatever i want.
> 
> OK and how is the information passed to the kernel ? DT or ACPI ?
> We need to fix the miss match if any during the initial parse of those
> information.
> 

Both, For the current QEMU, we don't have the correct cpu topology
passed to linux. Luckily drjones planed to deal with the issue.
https://patchwork.ozlabs.org/cover/939301/

> > > > with the following system configrations:
> > >
> > > Do you mean ACPI tables or DT or some firmware tables ?
> > >
> > > > *************************************
> > > > NUMA:      	 0-2,  3-7
> > >
> > > Is the above simply wrong with respect to hardware and it actually
> match
> > > core_siblings ?
> > >
> > Actually, we can't simply say this is wrong, i just want to show an
> example.
> > And this example also can be:
> > NUMA:  0-23,  24-47
> > core_siblings:   0-15,  16-31, 32-47
> >
> 
> Are you sure of the above ? Possible values w.r.t hardware config:
> core_siblings:   0-15,  16-23, 24-31, 32-47
> 
> But what you have specified above is still wrong core_siblings IMO.
> 
It depends on the hardware, on my platform, 16 cores per cluster.

> 
> [...]
> 
> > > > diff --git a/drivers/base/arch_topology.c
> b/drivers/base/arch_topology.c
> > > > index 1eb81f11..5fe44b3 100644
> > > > --- a/drivers/base/arch_topology.c
> > > > +++ b/drivers/base/arch_topology.c
> > > > @@ -439,10 +439,18 @@ const struct cpumask
> > > *cpu_coregroup_mask(int cpu)
> > > >  	if (cpumask_subset(&cpu_topology[cpu].core_sibling,
> core_mask)) {
> > > >  		/* not numa in package, lets use the package siblings */
> > > >  		core_mask = &cpu_topology[cpu].core_sibling;
> > > > -	}
> > > > +	} else
> > > > +		pr_warn_once("Warning: suspicous broken topology:
> cpu:[%d]'s
> > > core_sibling:[%*pbl] not a subset of numa node:[%*pbl]\n",
> > > > +			cpu,
> cpumask_pr_args(&cpu_topology[cpu].core_sibling),
> > > > +			cpumask_pr_args(core_mask));
> > > > +
> > >
> > > Won't this print warning on all systems that don't have numa within a
> > > package ? What are you trying to achieve here ?
> >
> > Since in my case, when this corner case happens, the linux kernel just fall
> into
> > dead loop with no prompt, here this is a helping message will help a lot.
> >
> 
> As I said, wrong configurations need to be detected when generating
> DT/ACPI if possible. The above will print warning on systems with NUMA
> within package.
> 
> NUMA:  0-7, 8-15
> core_siblings:   0-15
> 
> The above is the example where the die has 16 CPUs and 2 NUMA nodes
> within a package, your change throws error to the above config which is
> wrong.
>
From your example, the core 7 and core 8 has got different LLC but the same Low
Level cache?
From schedule view of point, lower level sched domain should be a subset of higher
Level sched domain.

> > >
> > > >  	if (cpu_topology[cpu].llc_id != -1) {
> > > >  		if (cpumask_subset(&cpu_topology[cpu].llc_sibling,
> core_mask))
> > > >  			core_mask = &cpu_topology[cpu].llc_sibling;
> > > > +		else
> > > > +			pr_warn_once("Warning: suspicous broken topology:
> > > cpu:[%d]'s llc_sibling:[%*pbl] not a subset of numa node:[%*pbl]\n",
> > > > +				cpu,
> > > cpumask_pr_args(&cpu_topology[cpu].llc_sibling),
> > > > +				cpumask_pr_args(core_mask));
> > > >  	}
> > > >
> > >
> > > This will trigger warning on all systems that lack cacheinfo topology.
> > > I don't understand the intent of this patch at all. Can you explain
> > > all the steps you follow and the issue you face ?
> >
> > Can you show me an example, what I really want to warn is the case that
> > NUMA topology conflicts with lower level.
> >
> 
> I was wrong here, I mis-read this section. I still fail to understand
> why the above change is needed. I understood the QEMU example, but you
> haven't specified how cacheinfo looks like there.
> 

My idea:
(1) If the dtb/acpi is not legal, the linux kernel should check or prompt to help debug.
(2)From scheduler view of point, low level sched domain should be a subset of high level
Sched domain.

I am very sure about (2), any example to show me (2) is wrong?

Thanks. 
Zengtao