From: Anshuman Khandual <anshuman.khandual@arm.com>
To: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
Cc: linux-acpi@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
Barry Song <song.bao.hua@hisilicon.com>,
Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>,
Hanjun Guo <guohanjun@huawei.com>,
linuxarm@huawei.com, Sudeep Holla <sudeep.holla@arm.com>,
Tejun Heo <tj@kernel.org>,
Dan Williams <dan.j.williams@intel.com>
Subject: Re: [PATCH] arm64: numa: rightsize the distance array
Date: Thu, 20 Aug 2020 08:39:14 +0530 [thread overview]
Message-ID: <a3df03b6-0266-2d12-df3e-1630ccd12421@arm.com> (raw)
In-Reply-To: <20200819104106.00002e01@Huawei.com>
On 08/19/2020 03:11 PM, Jonathan Cameron wrote:
> On Wed, 19 Aug 2020 10:31:18 +0530
> Anshuman Khandual <anshuman.khandual@arm.com> wrote:
>
>> Hello Jonathan,
>
> Hi Anshuman,
>
>>
>> On 07/08/2020 05:08 PM, Jonathan Cameron wrote:
>>> Unfortunately we are currently calling numa_alloc_distance well before we call
>>> setup_node_to_cpu_mask_map means that nr_node_ids is set to MAX_NUMNODES.
>>> This wastes a bit of memory and is confusing to the reader.
>>
>> With defconfig where CONFIG_NODES_SHIFT = 2 i.e MAX_NUMNODES = 4, the total
>> numa_distance size is 16 bytes (individual entries here are just u8).
>
> That's another issue to tidy up. The current defconfig isn't big enough to
> support existing platforms (we are shipping 8 node systems),
> let alone next generation. Bit of an open question on how far to push it up
> though. Perhaps 64? Given CXL we are probably going to soon see systems with
> a lot more nodes. I'll send a patch for that and we can argue the exact
> value then.
I would expect atleast it matches up with the max nodes possible on currently
available systems.
>
>> Even
>> with MAX_NUMNODES = 256, numa_distance is going to be just 64K. Hence there
>> might not be much space to be saved that would need optimizing this path.
>> Please correct me if I have missed something.
>
> Agreed. The saving is small. This was mostly about the code being misleading.
> It gives the impression of doing dynamic sizing but doesn't actually do so as it
> is using nr_node_ids before that value has been updated. I would prefer
> to make the code 'obviously' correct. The x86 path is one approach, simply
> hard coding the size as a constant is another.
>
> The secondary advantage would be to make it easy to unify this code across
> architectures. Kind of inevitable riscv will be here soon and it would be nice
> not to have a third slight variation on the code. I'd like to explore such
> possible unification but it wasn't the main aim of this patch.
There is already a RISC-V proposal in this regard which is trying to create
a generic NUMA framework selectable with GENERIC_ARCH_NUMA. If the ultimate
goal is unify the NUMA init path, we should try and see if that can also
accommodate X86.
https://patchwork.kernel.org/project/linux-arm-kernel/list/?series=332897
>
>>
>>>
>>> Note we could just decide to hardcode it as MAX_NUMNODES but if so we should
>>> do so explicitly.
>>
>> nr_node_ids = MAX_NUMNODES which is set from mm/page_alloc.c, yes asserting
>> with an WARN_ON() that it is indeed MAX_NUMNODES would make sense.
>
> OK, a WARN_ON would at least make it apparent this is a constant, not a dynamic
> value as a reader of the code might assume (I did).
>
> Could we just go further and use MAX_NUMNODES directly and ignore nr_node_ids
> for this purpose?
IMHO this actually sounds better and cleaner. But the first preference should
be to unify all these as GENERIC_ARCH_NUMA.
>
>>
>>>
>>> Looking at what x86 does, they do a walk of nodes_parsed and locally
>>> establish the maximum node count seen. We can't actually do that where we
>>> were previously calling it in numa_init because nodes_parsed isn't set up
>>> either yet. So let us take a leaf entirely out of x86's book and make
>>> the true assumption that nodes_parsed will definitely be set up before
>>> we try to put a real value in this array. Hence just do it on demand.
>>
>> So it is replacing one assumption i.e nr_node_ids = MAX_NUMNODES with another
>> i.e nodes_parsed has been initialized, while trying to populate an entry.
>
> Yes. If we make it clearly a fixed value, then I'm happy with that solution
> to this particular problem.
>
> If not, I would argue that the use nodes_parsed makes it very much obvious that
> we are assuming it was initialized. (Though on that argument we would have
> assumed nr_node_ids had it's final value in the existing code and it doesn't,
> I come back to my main argument being to make the code 'obviously' correct).
>
>>
>>>
>>> In order to avoid trying and failing to allocate the array multiple times
>>> we do the same thing as x86 and set numa_distance = 1. This requires a
>>> few small modifications elsewhere.
>>
>> Where ? Dont see numa_distance being set as 1.
>
> See below, I have highlighted the line.
>
>>
>>>
>>> Worth noting, that with one exception (which it appears can be removed [1])
>>> the x86 and arm numa distance code is now identical. Worth factoring it
>>> out to some common location?
>>>
>>> [1] https://lkml.kernel.org/r/20170406124459.dwn5zhpr2xqg3lqm@node.shutemov.name
>>>
>>> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>>> ---
>>> arch/arm64/mm/numa.c | 35 ++++++++++++++++++-----------------
>>> 1 file changed, 18 insertions(+), 17 deletions(-)
>>>
>>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>>> index aafcee3e3f7e..a2f549ef0a36 100644
>>> --- a/arch/arm64/mm/numa.c
>>> +++ b/arch/arm64/mm/numa.c
>>> @@ -255,13 +255,11 @@ void __init numa_free_distance(void)
>>> {
>>> size_t size;
>>>
>>> - if (!numa_distance)
>>> - return;
>>> -
>>> size = numa_distance_cnt * numa_distance_cnt *
>>> sizeof(numa_distance[0]);
>>> -
>>> - memblock_free(__pa(numa_distance), size);
>>> + /* numa_distance could be 1LU marking allocation failure, test cnt */
>>> + if (numa_distance_cnt)
>>> + memblock_free(__pa(numa_distance), size);
>>> numa_distance_cnt = 0;
>>> numa_distance = NULL;
>>> }
>>> @@ -271,20 +269,29 @@ void __init numa_free_distance(void)
>>> */
>>> static int __init numa_alloc_distance(void)
>>> {
>>> + nodemask_t nodes_parsed;
>>> size_t size;
>>> + int i, j, cnt = 0;
>>> u64 phys;
>>> - int i, j;
>>>
>>> - size = nr_node_ids * nr_node_ids * sizeof(numa_distance[0]);
>>> + /* size the new table and allocate it */
>>> + nodes_parsed = numa_nodes_parsed;
>>> + for_each_node_mask(i, nodes_parsed)
>>> + cnt = i;
>>
>> There is no nodemask related helper to fetch the highest bit set ?
>
> Not that I can find.
>
> There is nodes_weight(), but I think we can currently end up with holes,
> and it definitely isn't obvious from the local code that we can't.
Right, the node mask can have holes, so nodes_weight() would not imply
maximum possible node number. Possibly a new helper nodes_max() should
do but not sure if there are any existing instances which could use
such a helper.
>
>
>>
>>> + cnt++;
>>> + size = cnt * cnt * sizeof(numa_distance[0]);
>>> phys = memblock_find_in_range(0, PFN_PHYS(max_pfn),
>>> size, PAGE_SIZE);
>>> - if (WARN_ON(!phys))
>>> + if (!phys) {
>>> + pr_warn("Warning: can't allocate distance table!\n");
>>> + /* don't retry until explicitly reset */
>>> + numa_distance = (void *)1LU;
>
> Here is where we set numa_distance to 1 in the error path.
Got it.
>
>>> return -ENOMEM;
>>> -
>>> + }
>>> memblock_reserve(phys, size);
>>>
>>> numa_distance = __va(phys);
>>> - numa_distance_cnt = nr_node_ids;
>>> + numa_distance_cnt = cnt;
>>>
>>> /* fill with the default distances */
>>> for (i = 0; i < numa_distance_cnt; i++)
>>> @@ -311,10 +318,8 @@ static int __init numa_alloc_distance(void)
>>> */
>>> void __init numa_set_distance(int from, int to, int distance)
>>> {
>>> - if (!numa_distance) {
>>> - pr_warn_once("Warning: distance table not allocated yet\n");
>>> + if (!numa_distance && numa_alloc_distance() < 0)
>>> return;
>>> - }
>>>
>>> if (from >= numa_distance_cnt || to >= numa_distance_cnt ||
>>> from < 0 || to < 0) {
>>> @@ -384,10 +389,6 @@ static int __init numa_init(int (*init_func)(void))
>>> nodes_clear(node_possible_map);
>>> nodes_clear(node_online_map);
>>>
>>> - ret = numa_alloc_distance();
>>> - if (ret < 0)
>>> - return ret;
>>> -
>>> ret = init_func();
>>> if (ret < 0)
>>> goto out_free_distance;
>>>
>>
>> What is the primary objective here ? Reduce memory for numa_distance[]
>> or unifying arm64's numa_init() with that of x86's ?
>
> As mentioned above. The objective when I originally looked at this was
> to make the code do what it appeared to do. The approach chosen was about
> the added benefit of unifying this part with x86 + riscv etc.
> Always good to not reinvent the wheel.
Agreed.
next prev parent reply other threads:[~2020-08-20 3:09 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-07-08 11:38 [PATCH] arm64: numa: rightsize the distance array Jonathan Cameron
2020-07-24 11:01 ` Jonathan Cameron
2020-08-18 17:13 ` Jonathan Cameron
2020-07-24 11:45 ` Song Bao Hua (Barry Song)
2020-08-19 5:01 ` Anshuman Khandual
2020-08-19 9:41 ` Jonathan Cameron
2020-08-20 3:09 ` Anshuman Khandual [this message]
2020-08-20 8:45 ` Jonathan Cameron
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=a3df03b6-0266-2d12-df3e-1630ccd12421@arm.com \
--to=anshuman.khandual@arm.com \
--cc=Jonathan.Cameron@Huawei.com \
--cc=dan.j.williams@intel.com \
--cc=guohanjun@huawei.com \
--cc=linux-acpi@vger.kernel.org \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linuxarm@huawei.com \
--cc=lorenzo.pieralisi@arm.com \
--cc=song.bao.hua@hisilicon.com \
--cc=sudeep.holla@arm.com \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).