All of lore.kernel.org
 help / color / mirror / Atom feed
From: Wei Chen <Wei.Chen@arm.com>
To: Jan Beulich <jbeulich@suse.com>
Cc: nd@arm.com, "Andrew Cooper" <andrew.cooper3@citrix.com>,
	"Roger Pau Monné" <roger.pau@citrix.com>, "Wei Liu" <wl@xen.org>,
	"George Dunlap" <george.dunlap@citrix.com>,
	"Julien Grall" <julien@xen.org>,
	"Stefano Stabellini" <sstabellini@kernel.org>,
	xen-devel@lists.xenproject.org
Subject: Re: [PATCH v5 2/6] xen/x86: move generically usable NUMA code from x86 to common
Date: Thu, 29 Sep 2022 15:43:30 +0800	[thread overview]
Message-ID: <72691b9b-761e-a89b-97df-afd5cf0ddebb@arm.com> (raw)
In-Reply-To: <359e87d9-aec7-2198-86ab-1d0f76bf567d@suse.com>

Hi Jan,

On 2022/9/27 16:19, Jan Beulich wrote:
> On 20.09.2022 11:12, Wei Chen wrote:
>> There are some codes in x86/numa.c can be shared by common
>> architectures to implememnt NUMA support. Just like some
>> variables and functions to check and store NUMA memory map.
>> And some variables and functions to do NUMA initialization.
>>
>> In this patch, we move them to common/numa.c and xen/numa.h
>> and use the CONFIG_NUMA to gate them for non-NUMA supported
>> architectures. As the target header file is Xen-style, so
>> we trim some spaces and replace tabs for the codes that has
>> been moved to xen/numa.h at the same time.
>>
>> As acpi_scan_nodes has been used in a common function, it
>> doesn't make sense to use acpi_xxx in common code, so we
>> rename it to numa_scan_nodes in this patch too. After that
> 
> numa_process_nodes() now?

Oh, yes, I will update it.

> 
>> if we still use CONFIG_ACPI_NUMA in to gate numa_scan_nodes
>> in numa_initmem_init, that doesn't make sense. As CONFIG_NUMA
>> will be selected by CONFIG_ACPI_NUMA for x86. So, we replace
>> CONFIG_ACPI_NUMA by CONFIG_NUMA to gate numa_scan_nodes.
>>
>> As arch_numa_disabled has been implememnted for ACPI NUMA,
>> we can rename srat_disabled to numa_disabled and move it
>> to common code as well.
> 
> Please update the description: arch_numa_disabled() appears in patch 5
> only. Of course if you follow the comments to patch 2, the wording here
> would be correct again.
> 

I will update the description.

>> +static unsigned int __init extract_lsb_from_nodes(const struct node *nodes,
>> +                                                  nodeid_t numnodes)
>> +{
>> +    unsigned int i, nodes_used = 0;
>> +    unsigned long spdx, epdx;
>> +    unsigned long bitfield = 0, memtop = 0;
>> +
>> +    for ( i = 0; i < numnodes; i++ )
>> +    {
>> +        spdx = paddr_to_pdx(nodes[i].start);
>> +        epdx = paddr_to_pdx(nodes[i].end - 1) + 1;
>> +
>> +        if ( spdx >= epdx )
>> +            continue;
>> +
>> +        bitfield |= spdx;
> 
> Perhaps to be taken care of in a separate patch: We accumulate only
> the bits from the start addresses here, contrary to what the comment
> ahead of the function says (and I think it is the comment which is
> putting things correctly).

If one node has non-zero memory, the bit of end will >= the bit of 
start. As we use this function to calculate LSB, I don't think only
accumulate bits of start addresses will be a problem. Instead I think
we should modify this function comment to say why we only need to 
accumulate bits of start addresses.

> 
>> +        nodes_used++;
>> +        if ( epdx > memtop )
>> +            memtop = epdx;
>> +    }
>> +
>> +    if ( nodes_used <= 1 )
>> +        i = BITS_PER_LONG - 1;
> 
> Is this actually going to be correct for all architectures? Aiui
> Arm64 has only up to 48 physical address bits, but what about an
> architecture allowing the use of all 64 bits? I think at the very
> least we want BUILD_BUG_ON(PADDR_BITS >= BITS_PER_LONG) here.
> 

Ok I will add above BUILD_BUG_ON. And I also have question why can't
we use PADDR_BITS here directly?

>> +    else
>> +        i = find_first_bit(&bitfield, sizeof(unsigned long) * 8);
>> +
>> +    memnodemapsize = (memtop >> i) + 1;
> 
> Again perhaps the subject of a separate patch: Isn't there an off-by-1
> mistake here? memtop is the maximum of all epdx-es, which are
> calculated to be the first PDX following the region. Hence I'd expect
> 
>      memnodemapsize = ((memtop - 1) >> i) + 1;
> 
> here. I guess I'll make patches for both issues, which you may then
> need to re-base over.
> 

Thanks, I will wait your patches.

>> +static int __init cf_check numa_setup(const char *opt)
>> +{
>> +    if ( !strncmp(opt, "off", 3) )
>> +        numa_off = true;
>> +    else if ( !strncmp(opt, "on", 2) )
>> +        numa_off = false;
>> +#ifdef CONFIG_NUMA_EMU
>> +    else if ( !strncmp(opt, "fake=", 5) )
>> +    {
>> +        numa_off = false;
>> +        numa_fake = simple_strtoul(opt + 5, NULL, 0);
>> +        if ( numa_fake >= MAX_NUMNODES )
>> +            numa_fake = MAX_NUMNODES;
>> +    }
>> +#endif
>> +    else
>> +        return arch_numa_setup(opt);
>> +
>> +    return 0;
>> +}
>> +custom_param("numa", numa_setup);
> 
> Note that with this moved here at some point during your work (when
> allowing NUMA=y for Arm) you'll need to update the command line doc.
> 

I have prepared a patch for this doc in part#3 Arm part code.

>> +static void cf_check dump_numa(unsigned char key)
>> +{
>> +    s_time_t now = NOW();
>> +    unsigned int i, j, n;
>> +    struct domain *d;
>> +    const struct page_info *page;
>> +    unsigned int page_num_node[MAX_NUMNODES];
>> +    const struct vnuma_info *vnuma;
>> +
>> +    printk("'%c' pressed -> dumping numa info (now = %"PRI_stime")\n", key,
>> +           now);
>> +
>> +    for_each_online_node ( i )
>> +    {
>> +        paddr_t pa = pfn_to_paddr(node_start_pfn(i) + 1);
>> +
>> +        printk("NODE%u start->%lu size->%lu free->%lu\n",
>> +               i, node_start_pfn(i), node_spanned_pages(i),
>> +               avail_node_heap_pages(i));
>> +        /* Sanity check phys_to_nid() */
>> +        if ( phys_to_nid(pa) != i )
>> +            printk("phys_to_nid(%"PRIpaddr") -> %d should be %u\n",
>> +                   pa, phys_to_nid(pa), i);
>> +    }
>> +
>> +    j = cpumask_first(&cpu_online_map);
>> +    n = 0;
>> +    for_each_online_cpu ( i )
>> +    {
>> +        if ( i != j + n || cpu_to_node[j] != cpu_to_node[i] )
>> +        {
>> +            if ( n > 1 )
>> +                printk("CPU%u...%u -> NODE%d\n", j, j + n - 1, cpu_to_node[j]);
>> +            else
>> +                printk("CPU%u -> NODE%d\n", j, cpu_to_node[j]);
>> +            j = i;
>> +            n = 1;
>> +        }
>> +        else
>> +            ++n;
>> +    }
>> +    if ( n > 1 )
>> +        printk("CPU%u...%u -> NODE%d\n", j, j + n - 1, cpu_to_node[j]);
>> +    else
>> +        printk("CPU%u -> NODE%d\n", j, cpu_to_node[j]);
>> +
>> +    rcu_read_lock(&domlist_read_lock);
>> +
>> +    printk("Memory location of each domain:\n");
>> +    for_each_domain ( d )
>> +    {
>> +        process_pending_softirqs();
>> +
>> +        printk("Domain %u (total: %u):\n", d->domain_id, domain_tot_pages(d));
>> +
>> +        for_each_online_node ( i )
>> +            page_num_node[i] = 0;
> 
> I'd be inclined to suggest to use memset() here, but I won't insist
> on you doing this "on the fly". Along with this would likely go the
> request to limit the scope of page_num_node[] (and then perhaps also
> vnuma and page).
> 

memset for page_num_node makes sense, I will do it before 
for_each_domain ( d ).
About limit the scope, did you mean, we should move:

"const struct page_info *page;
unsigned int page_num_node[MAX_NUMNODES];
const struct vnuma_info *vnuma;"

to the block of for_each_domain ( d )?


>> +        spin_lock(&d->page_alloc_lock);
>> +        page_list_for_each ( page, &d->page_list )
>> +        {
>> +            i = phys_to_nid(page_to_maddr(page));
>> +            page_num_node[i]++;
>> +        }
>> +        spin_unlock(&d->page_alloc_lock);
>> +
>> +        for_each_online_node ( i )
>> +            printk("    Node %u: %u\n", i, page_num_node[i]);
>> +
>> +        if ( !read_trylock(&d->vnuma_rwlock) )
>> +            continue;
>> +
>> +        if ( !d->vnuma )
>> +        {
>> +            read_unlock(&d->vnuma_rwlock);
>> +            continue;
>> +        }
>> +
>> +        vnuma = d->vnuma;
>> +        printk("     %u vnodes, %u vcpus, guest physical layout:\n",
>> +               vnuma->nr_vnodes, d->max_vcpus);
>> +        for ( i = 0; i < vnuma->nr_vnodes; i++ )
>> +        {
>> +            unsigned int start_cpu = ~0U;
>> +
>> +            if ( vnuma->vnode_to_pnode[i] == NUMA_NO_NODE )
>> +                printk("       %3u: pnode ???,", i);
>> +            else
>> +                printk("       %3u: pnode %3u,", i, vnuma->vnode_to_pnode[i]);
>> +
>> +            printk(" vcpus ");
>> +
>> +            for ( j = 0; j < d->max_vcpus; j++ )
>> +            {
>> +                if ( !(j & 0x3f) )
>> +                    process_pending_softirqs();
>> +
>> +                if ( vnuma->vcpu_to_vnode[j] == i )
>> +                {
>> +                    if ( start_cpu == ~0U )
>> +                    {
>> +                        printk("%d", j);
> 
> j being "unsigned int", would you mind switching to %u here and below?
> 

Ok, I will do it and below.

>> --- a/xen/include/xen/numa.h
>> +++ b/xen/include/xen/numa.h
>> @@ -18,4 +18,71 @@
>>     (((d)->vcpu != NULL && (d)->vcpu[0] != NULL) \
>>      ? vcpu_to_node((d)->vcpu[0]) : NUMA_NO_NODE)
>>   
>> +/* The following content can be used when NUMA feature is enabled */
>> +#ifdef CONFIG_NUMA
>> +
>> +extern nodeid_t      cpu_to_node[NR_CPUS];
>> +extern cpumask_t     node_to_cpumask[];
>> +
>> +#define cpu_to_node(cpu)        cpu_to_node[cpu]
>> +#define parent_node(node)       (node)
>> +#define node_to_first_cpu(node) __ffs(node_to_cpumask[node])
> 
> I can't spot a use of this - perhaps better drop than generalize (if
> done right here then along with mentioning this in the description)?
> 

Yes, there is no code using this macro anymore, I will delete it and 
mention it in the commit log.

Cheers,
Wei Chen


> Jan


  parent reply	other threads:[~2022-09-29  7:44 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-20  9:12 [PATCH v5 0/6] Device tree based NUMA support for Arm - Part#2 Wei Chen
2022-09-20  9:12 ` [PATCH v5 1/6] xen/x86: Provide helpers for common code to access acpi_numa Wei Chen
2022-09-27  7:37   ` Jan Beulich
2022-09-29  6:29     ` Wei Chen
2022-09-20  9:12 ` [PATCH v5 2/6] xen/x86: move generically usable NUMA code from x86 to common Wei Chen
2022-09-27  8:19   ` Jan Beulich
2022-09-27  9:39     ` Jan Beulich
2022-09-29  7:58       ` Wei Chen
2022-09-29  7:43     ` Wei Chen [this message]
2022-09-29 12:14       ` Jan Beulich
2022-09-30  1:45         ` Wei Chen
2022-09-20  9:12 ` [PATCH v5 3/6] xen/x86: Use ASSERT instead of VIRTUAL_BUG_ON for phys_to_nid Wei Chen
2022-09-20  9:12 ` [PATCH v5 4/6] xen/x86: use arch_get_ram_range to get information from E820 map Wei Chen
2022-09-20  9:12 ` [PATCH v5 5/6] xen/x86: move NUMA scan nodes codes from x86 to common Wei Chen
2022-09-27 15:48   ` Jan Beulich
2022-09-29  8:21     ` Wei Chen
2022-09-29 12:21       ` Jan Beulich
2022-09-30  1:40         ` Wei Chen
2022-09-30  6:03           ` Jan Beulich
2022-10-09  7:25             ` Wei Chen
2022-10-10  7:03               ` Wei Chen
2022-10-10  8:25                 ` Jan Beulich
2022-09-20  9:12 ` [PATCH v5 6/6] xen: introduce a Kconfig option to configure NUMA nodes number Wei Chen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=72691b9b-761e-a89b-97df-afd5cf0ddebb@arm.com \
    --to=wei.chen@arm.com \
    --cc=andrew.cooper3@citrix.com \
    --cc=george.dunlap@citrix.com \
    --cc=jbeulich@suse.com \
    --cc=julien@xen.org \
    --cc=nd@arm.com \
    --cc=roger.pau@citrix.com \
    --cc=sstabellini@kernel.org \
    --cc=wl@xen.org \
    --cc=xen-devel@lists.xenproject.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.