Re: [PATCH v5 2/6] xen/x86: move generically usable NUMA code from x86 to common

From: Wei Chen <Wei.Chen@arm.com>
To: Jan Beulich <jbeulich@suse.com>
Cc: nd@arm.com, "Andrew Cooper" <andrew.cooper3@citrix.com>,
	"Roger Pau Monné" <roger.pau@citrix.com>, "Wei Liu" <wl@xen.org>,
	"George Dunlap" <george.dunlap@citrix.com>,
	"Julien Grall" <julien@xen.org>,
	"Stefano Stabellini" <sstabellini@kernel.org>,
	xen-devel@lists.xenproject.org
Subject: Re: [PATCH v5 2/6] xen/x86: move generically usable NUMA code from x86 to common
Date: Thu, 29 Sep 2022 15:43:30 +0800	[thread overview]
Message-ID: <72691b9b-761e-a89b-97df-afd5cf0ddebb@arm.com> (raw)
In-Reply-To: <359e87d9-aec7-2198-86ab-1d0f76bf567d@suse.com>

Hi Jan,

On 2022/9/27 16:19, Jan Beulich wrote:
> On 20.09.2022 11:12, Wei Chen wrote:
>> There are some codes in x86/numa.c can be shared by common
>> architectures to implememnt NUMA support. Just like some
>> variables and functions to check and store NUMA memory map.
>> And some variables and functions to do NUMA initialization.
>>
>> In this patch, we move them to common/numa.c and xen/numa.h
>> and use the CONFIG_NUMA to gate them for non-NUMA supported
>> architectures. As the target header file is Xen-style, so
>> we trim some spaces and replace tabs for the codes that has
>> been moved to xen/numa.h at the same time.
>>
>> As acpi_scan_nodes has been used in a common function, it
>> doesn't make sense to use acpi_xxx in common code, so we
>> rename it to numa_scan_nodes in this patch too. After that
> 
> numa_process_nodes() now?

Oh, yes, I will update it.

> 
>> if we still use CONFIG_ACPI_NUMA in to gate numa_scan_nodes
>> in numa_initmem_init, that doesn't make sense. As CONFIG_NUMA
>> will be selected by CONFIG_ACPI_NUMA for x86. So, we replace
>> CONFIG_ACPI_NUMA by CONFIG_NUMA to gate numa_scan_nodes.
>>
>> As arch_numa_disabled has been implememnted for ACPI NUMA,
>> we can rename srat_disabled to numa_disabled and move it
>> to common code as well.
> 
> Please update the description: arch_numa_disabled() appears in patch 5
> only. Of course if you follow the comments to patch 2, the wording here
> would be correct again.
> 

I will update the description.

>> +static unsigned int __init extract_lsb_from_nodes(const struct node *nodes,
>> +                                                  nodeid_t numnodes)
>> +{
>> +    unsigned int i, nodes_used = 0;
>> +    unsigned long spdx, epdx;
>> +    unsigned long bitfield = 0, memtop = 0;
>> +
>> +    for ( i = 0; i < numnodes; i++ )
>> +    {
>> +        spdx = paddr_to_pdx(nodes[i].start);
>> +        epdx = paddr_to_pdx(nodes[i].end - 1) + 1;
>> +
>> +        if ( spdx >= epdx )
>> +            continue;
>> +
>> +        bitfield |= spdx;
> 
> Perhaps to be taken care of in a separate patch: We accumulate only
> the bits from the start addresses here, contrary to what the comment
> ahead of the function says (and I think it is the comment which is
> putting things correctly).

If one node has non-zero memory, the bit of end will >= the bit of 
start. As we use this function to calculate LSB, I don't think only
accumulate bits of start addresses will be a problem. Instead I think
we should modify this function comment to say why we only need to 
accumulate bits of start addresses.

> 
>> +        nodes_used++;
>> +        if ( epdx > memtop )
>> +            memtop = epdx;
>> +    }
>> +
>> +    if ( nodes_used <= 1 )
>> +        i = BITS_PER_LONG - 1;
> 
> Is this actually going to be correct for all architectures? Aiui
> Arm64 has only up to 48 physical address bits, but what about an
> architecture allowing the use of all 64 bits? I think at the very
> least we want BUILD_BUG_ON(PADDR_BITS >= BITS_PER_LONG) here.
> 

Ok I will add above BUILD_BUG_ON. And I also have question why can't
we use PADDR_BITS here directly?

>> +    else
>> +        i = find_first_bit(&bitfield, sizeof(unsigned long) * 8);
>> +
>> +    memnodemapsize = (memtop >> i) + 1;
> 
> Again perhaps the subject of a separate patch: Isn't there an off-by-1
> mistake here? memtop is the maximum of all epdx-es, which are
> calculated to be the first PDX following the region. Hence I'd expect
> 
>      memnodemapsize = ((memtop - 1) >> i) + 1;
> 
> here. I guess I'll make patches for both issues, which you may then
> need to re-base over.
> 

Thanks, I will wait your patches.

>> +static int __init cf_check numa_setup(const char *opt)
>> +{
>> +    if ( !strncmp(opt, "off", 3) )
>> +        numa_off = true;
>> +    else if ( !strncmp(opt, "on", 2) )
>> +        numa_off = false;
>> +#ifdef CONFIG_NUMA_EMU
>> +    else if ( !strncmp(opt, "fake=", 5) )
>> +    {
>> +        numa_off = false;
>> +        numa_fake = simple_strtoul(opt + 5, NULL, 0);
>> +        if ( numa_fake >= MAX_NUMNODES )
>> +            numa_fake = MAX_NUMNODES;
>> +    }
>> +#endif
>> +    else
>> +        return arch_numa_setup(opt);
>> +
>> +    return 0;
>> +}
>> +custom_param("numa", numa_setup);
> 
> Note that with this moved here at some point during your work (when
> allowing NUMA=y for Arm) you'll need to update the command line doc.
> 

I have prepared a patch for this doc in part#3 Arm part code.

>> +static void cf_check dump_numa(unsigned char key)
>> +{
>> +    s_time_t now = NOW();
>> +    unsigned int i, j, n;
>> +    struct domain *d;
>> +    const struct page_info *page;
>> +    unsigned int page_num_node[MAX_NUMNODES];
>> +    const struct vnuma_info *vnuma;
>> +
>> +    printk("'%c' pressed -> dumping numa info (now = %"PRI_stime")\n", key,
>> +           now);
>> +
>> +    for_each_online_node ( i )
>> +    {
>> +        paddr_t pa = pfn_to_paddr(node_start_pfn(i) + 1);
>> +
>> +        printk("NODE%u start->%lu size->%lu free->%lu\n",
>> +               i, node_start_pfn(i), node_spanned_pages(i),
>> +               avail_node_heap_pages(i));
>> +        /* Sanity check phys_to_nid() */
>> +        if ( phys_to_nid(pa) != i )
>> +            printk("phys_to_nid(%"PRIpaddr") -> %d should be %u\n",
>> +                   pa, phys_to_nid(pa), i);
>> +    }
>> +
>> +    j = cpumask_first(&cpu_online_map);
>> +    n = 0;
>> +    for_each_online_cpu ( i )
>> +    {
>> +        if ( i != j + n || cpu_to_node[j] != cpu_to_node[i] )
>> +        {
>> +            if ( n > 1 )
>> +                printk("CPU%u...%u -> NODE%d\n", j, j + n - 1, cpu_to_node[j]);
>> +            else
>> +                printk("CPU%u -> NODE%d\n", j, cpu_to_node[j]);
>> +            j = i;
>> +            n = 1;
>> +        }
>> +        else
>> +            ++n;
>> +    }
>> +    if ( n > 1 )
>> +        printk("CPU%u...%u -> NODE%d\n", j, j + n - 1, cpu_to_node[j]);
>> +    else
>> +        printk("CPU%u -> NODE%d\n", j, cpu_to_node[j]);
>> +
>> +    rcu_read_lock(&domlist_read_lock);
>> +
>> +    printk("Memory location of each domain:\n");
>> +    for_each_domain ( d )
>> +    {
>> +        process_pending_softirqs();
>> +
>> +        printk("Domain %u (total: %u):\n", d->domain_id, domain_tot_pages(d));
>> +
>> +        for_each_online_node ( i )
>> +            page_num_node[i] = 0;
> 
> I'd be inclined to suggest to use memset() here, but I won't insist
> on you doing this "on the fly". Along with this would likely go the
> request to limit the scope of page_num_node[] (and then perhaps also
> vnuma and page).
> 

memset for page_num_node makes sense, I will do it before 
for_each_domain ( d ).
About limit the scope, did you mean, we should move:

"const struct page_info *page;
unsigned int page_num_node[MAX_NUMNODES];
const struct vnuma_info *vnuma;"

to the block of for_each_domain ( d )?

>> +        spin_lock(&d->page_alloc_lock);
>> +        page_list_for_each ( page, &d->page_list )
>> +        {
>> +            i = phys_to_nid(page_to_maddr(page));
>> +            page_num_node[i]++;
>> +        }
>> +        spin_unlock(&d->page_alloc_lock);
>> +
>> +        for_each_online_node ( i )
>> +            printk("    Node %u: %u\n", i, page_num_node[i]);
>> +
>> +        if ( !read_trylock(&d->vnuma_rwlock) )
>> +            continue;
>> +
>> +        if ( !d->vnuma )
>> +        {
>> +            read_unlock(&d->vnuma_rwlock);
>> +            continue;
>> +        }
>> +
>> +        vnuma = d->vnuma;
>> +        printk("     %u vnodes, %u vcpus, guest physical layout:\n",
>> +               vnuma->nr_vnodes, d->max_vcpus);
>> +        for ( i = 0; i < vnuma->nr_vnodes; i++ )
>> +        {
>> +            unsigned int start_cpu = ~0U;
>> +
>> +            if ( vnuma->vnode_to_pnode[i] == NUMA_NO_NODE )
>> +                printk("       %3u: pnode ???,", i);
>> +            else
>> +                printk("       %3u: pnode %3u,", i, vnuma->vnode_to_pnode[i]);
>> +
>> +            printk(" vcpus ");
>> +
>> +            for ( j = 0; j < d->max_vcpus; j++ )
>> +            {
>> +                if ( !(j & 0x3f) )
>> +                    process_pending_softirqs();
>> +
>> +                if ( vnuma->vcpu_to_vnode[j] == i )
>> +                {
>> +                    if ( start_cpu == ~0U )
>> +                    {
>> +                        printk("%d", j);
> 
> j being "unsigned int", would you mind switching to %u here and below?
> 

Ok, I will do it and below.

>> --- a/xen/include/xen/numa.h
>> +++ b/xen/include/xen/numa.h
>> @@ -18,4 +18,71 @@
>>     (((d)->vcpu != NULL && (d)->vcpu[0] != NULL) \
>>      ? vcpu_to_node((d)->vcpu[0]) : NUMA_NO_NODE)
>>   
>> +/* The following content can be used when NUMA feature is enabled */
>> +#ifdef CONFIG_NUMA
>> +
>> +extern nodeid_t      cpu_to_node[NR_CPUS];
>> +extern cpumask_t     node_to_cpumask[];
>> +
>> +#define cpu_to_node(cpu)        cpu_to_node[cpu]
>> +#define parent_node(node)       (node)
>> +#define node_to_first_cpu(node) __ffs(node_to_cpumask[node])
> 
> I can't spot a use of this - perhaps better drop than generalize (if
> done right here then along with mentioning this in the description)?
> 

Yes, there is no code using this macro anymore, I will delete it and 
mention it in the commit log.

Cheers,
Wei Chen

> Jan