Re: [PATCH] mm: hugetlb: fix hugetlb_cma_reserve() if CONFIG_NUMA isn't set

From: Michal Hocko <mhocko@kernel.org>
To: Roman Gushchin <guro@fb.com>
Cc: "Andrew Morton" <akpm@linux-foundation.org>,
	linux-mm@kvack.org, kernel-team@fb.com,
	linux-kernel@vger.kernel.org, "Rik van Riel" <riel@surriel.com>,
	"Andreas Schaufler" <andreas.schaufler@gmx.de>,
	"Mike Kravetz" <mike.kravetz@oracle.com>,
	"Guido Günther" <agx@sigxcpu.org>,
	"Naresh Kamboju" <naresh.kamboju@linaro.org>
Subject: Re: [PATCH] mm: hugetlb: fix hugetlb_cma_reserve() if CONFIG_NUMA isn't set
Date: Thu, 19 Mar 2020 17:16:44 +0100	[thread overview]
Message-ID: <20200319161644.GH20800@dhcp22.suse.cz> (raw)
In-Reply-To: <20200318175529.GA6263@carbon.dhcp.thefacebook.com>

On Wed 18-03-20 10:55:29, Roman Gushchin wrote:
> On Wed, Mar 18, 2020 at 05:16:25PM +0100, Michal Hocko wrote:
> > On Wed 18-03-20 08:34:24, Roman Gushchin wrote:
> > > If CONFIG_NUMA isn't set, there is no need to ensure that
> > > the hugetlb cma area belongs to a specific numa node.
> > > 
> > > min/max_low_pfn can be used for limiting the maximum size
> > > of the hugetlb_cma area.
> > > 
> > > Also for_each_mem_pfn_range() is defined only if
> > > CONFIG_HAVE_MEMBLOCK_NODE_MAP is set, and on arm (unlike most
> > > other architectures) it depends on CONFIG_NUMA. This makes the
> > > build fail if CONFIG_NUMA isn't set.
> > 
> > CONFIG_HAVE_MEMBLOCK_NODE_MAP has popped out as a problem several times
> > already. Is there any real reason we cannot make it unconditional?
> > Essentially make the functionality always enabled and drop the config?
> 
> It depends on CONFIG_NUMA only on arm, and I really don't know
> if there is a good justification for it. It not, that will be a much
> simpler fix.

I have checked the history and the dependency is there since NUMA was
introduced in arm64. So it would be great to double check with arch
maintainers.

> > The code below is ugly as hell. Just look at it. You have
> > for_each_node_state without any ifdefery but the having ifdef
> > CONFIG_NUMA. That just doesn't make any sense.
> 
> I don't think it makes no sense:
> it tries to reserve a cma area on each node (need for_each_node_state()),
> and it uses the for_each_mem_pfn_range() to get a min and max pfn
> for each node. With !CONFIG_NUMA the first part is reduced to one
> iteration and the second part is not required at all.

Sure the resulting code logic makes sense. I meant that it doesn't make
much sense wrt readability. There is a loop over all existing numa nodes
to have ifdef for NUMA inside the loop. See?

> I agree that for_each_mem_pfn_range() here looks quite ugly, but I don't know
> of a better way to get min/max pfns for a node so early in the boot process.
> If somebody has any ideas here, I'll appreciate a lot.

The loop is ok. Maybe we have other memblock API that would be better
but I am not really aware of it from top of my head. I would stick with
it. It just sucks that this API depends on HAVE_MEMBLOCK_NODE_MAP and
that it is not generally available. This is what I am complaining about.
Just look what kind of dirty hack it made you to create ;)

> I know Rik plans some further improvements here, so the goal for now
> is to fix the build. If you think that enabling CONFIG_HAVE_MEMBLOCK_NODE_MAP
> unconditionally is a way to go, I'm fine with it too.

This is not the first time HAVE_MEMBLOCK_NODE_MAP has been problematic.
I might be missing something but I really do not get why do we really
need it these days. As for !NUMA, I suspect we can make it generate the
right thing when !NUMA.
-- 
Michal Hocko
SUSE Labs