Re: [PATCH v5 2/6] xen/x86: move generically usable NUMA code from x86 to common

From: Jan Beulich <jbeulich@suse.com>
To: Wei Chen <Wei.Chen@arm.com>
Cc: nd@arm.com, "Andrew Cooper" <andrew.cooper3@citrix.com>,
	"Roger Pau Monné" <roger.pau@citrix.com>, "Wei Liu" <wl@xen.org>,
	"George Dunlap" <george.dunlap@citrix.com>,
	"Julien Grall" <julien@xen.org>,
	"Stefano Stabellini" <sstabellini@kernel.org>,
	xen-devel@lists.xenproject.org
Subject: Re: [PATCH v5 2/6] xen/x86: move generically usable NUMA code from x86 to common
Date: Thu, 29 Sep 2022 14:14:20 +0200	[thread overview]
Message-ID: <ce60f432-fed5-0fbb-c544-36b767c22130@suse.com> (raw)
In-Reply-To: <72691b9b-761e-a89b-97df-afd5cf0ddebb@arm.com>

On 29.09.2022 09:43, Wei Chen wrote:
> On 2022/9/27 16:19, Jan Beulich wrote:
>> On 20.09.2022 11:12, Wei Chen wrote:
>>> +        nodes_used++;
>>> +        if ( epdx > memtop )
>>> +            memtop = epdx;
>>> +    }
>>> +
>>> +    if ( nodes_used <= 1 )
>>> +        i = BITS_PER_LONG - 1;
>>
>> Is this actually going to be correct for all architectures? Aiui
>> Arm64 has only up to 48 physical address bits, but what about an
>> architecture allowing the use of all 64 bits? I think at the very
>> least we want BUILD_BUG_ON(PADDR_BITS >= BITS_PER_LONG) here.
>>
> 
> Ok I will add above BUILD_BUG_ON. And I also have question why can't
> we use PADDR_BITS here directly?

Well, if you used PADDR_BITS, then you would use it without subtracting
1, and you'd be in trouble again when PADDR_BITS == BITS_PER_LONG. What
may be possible to do instead of BUILD_BUG_ON() is

    if ( nodes_used <= 1 )
        i = min(PADDR_BITS, BITS_PER_LONG - 1);

>>> +    else
>>> +        i = find_first_bit(&bitfield, sizeof(unsigned long) * 8);
>>> +
>>> +    memnodemapsize = (memtop >> i) + 1;
>>
>> Again perhaps the subject of a separate patch: Isn't there an off-by-1
>> mistake here? memtop is the maximum of all epdx-es, which are
>> calculated to be the first PDX following the region. Hence I'd expect
>>
>>      memnodemapsize = ((memtop - 1) >> i) + 1;
>>
>> here. I guess I'll make patches for both issues, which you may then
>> need to re-base over.
>>
> 
> Thanks, I will wait your patches.

Already sent out yesterday.

>>> +static void cf_check dump_numa(unsigned char key)
>>> +{
>>> +    s_time_t now = NOW();
>>> +    unsigned int i, j, n;
>>> +    struct domain *d;
>>> +    const struct page_info *page;
>>> +    unsigned int page_num_node[MAX_NUMNODES];
>>> +    const struct vnuma_info *vnuma;
>>> +
>>> +    printk("'%c' pressed -> dumping numa info (now = %"PRI_stime")\n", key,
>>> +           now);
>>> +
>>> +    for_each_online_node ( i )
>>> +    {
>>> +        paddr_t pa = pfn_to_paddr(node_start_pfn(i) + 1);
>>> +
>>> +        printk("NODE%u start->%lu size->%lu free->%lu\n",
>>> +               i, node_start_pfn(i), node_spanned_pages(i),
>>> +               avail_node_heap_pages(i));
>>> +        /* Sanity check phys_to_nid() */
>>> +        if ( phys_to_nid(pa) != i )
>>> +            printk("phys_to_nid(%"PRIpaddr") -> %d should be %u\n",
>>> +                   pa, phys_to_nid(pa), i);
>>> +    }
>>> +
>>> +    j = cpumask_first(&cpu_online_map);
>>> +    n = 0;
>>> +    for_each_online_cpu ( i )
>>> +    {
>>> +        if ( i != j + n || cpu_to_node[j] != cpu_to_node[i] )
>>> +        {
>>> +            if ( n > 1 )
>>> +                printk("CPU%u...%u -> NODE%d\n", j, j + n - 1, cpu_to_node[j]);
>>> +            else
>>> +                printk("CPU%u -> NODE%d\n", j, cpu_to_node[j]);
>>> +            j = i;
>>> +            n = 1;
>>> +        }
>>> +        else
>>> +            ++n;
>>> +    }
>>> +    if ( n > 1 )
>>> +        printk("CPU%u...%u -> NODE%d\n", j, j + n - 1, cpu_to_node[j]);
>>> +    else
>>> +        printk("CPU%u -> NODE%d\n", j, cpu_to_node[j]);
>>> +
>>> +    rcu_read_lock(&domlist_read_lock);
>>> +
>>> +    printk("Memory location of each domain:\n");
>>> +    for_each_domain ( d )
>>> +    {
>>> +        process_pending_softirqs();
>>> +
>>> +        printk("Domain %u (total: %u):\n", d->domain_id, domain_tot_pages(d));
>>> +
>>> +        for_each_online_node ( i )
>>> +            page_num_node[i] = 0;
>>
>> I'd be inclined to suggest to use memset() here, but I won't insist
>> on you doing this "on the fly". Along with this would likely go the
>> request to limit the scope of page_num_node[] (and then perhaps also
>> vnuma and page).
>>
> 
> memset for page_num_node makes sense, I will do it before 
> for_each_domain ( d ).

That won't be right - array elements need clearing on every iteration.
Plus ...

> About limit the scope, did you mean, we should move:
> 
> "const struct page_info *page;
> unsigned int page_num_node[MAX_NUMNODES];
> const struct vnuma_info *vnuma;"
> 
> to the block of for_each_domain ( d )?

... this limiting of scope (yes to your question) would also conflict
with the movement you suggest. It is actually (among other things)
such a mistaken movement which the more narrow scope is intended to
prevent.

Jan