All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] NUMA: phys_to_nid() related adjustments
@ 2022-12-13 11:35 Jan Beulich
  2022-12-13 11:36 ` [PATCH 1/2] x86/mm: avoid phys_to_nid() calls for invalid addresses Jan Beulich
  2022-12-13 11:38 ` [PATCH 2/2] NUMA: replace phys_to_nid() Jan Beulich
  0 siblings, 2 replies; 15+ messages in thread
From: Jan Beulich @ 2022-12-13 11:35 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu, Roger Pau Monné

First of all we need to deal with fallout from converting the
function's previously dead checks to ASSERT(). An approach with a
weakness is proposed on patch 1; see the RFC remark there. Plus
phys_to_nid() has been somewhat inefficient with respect to all of
its present callers.

1: x86/mm: avoid phys_to_nid() calls for invalid addresses
2: NUMA: replace phys_to_nid()

Jan


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/2] x86/mm: avoid phys_to_nid() calls for invalid addresses
  2022-12-13 11:35 [PATCH 0/2] NUMA: phys_to_nid() related adjustments Jan Beulich
@ 2022-12-13 11:36 ` Jan Beulich
  2022-12-14  3:28   ` Wei Chen
  2022-12-16 19:24   ` Andrew Cooper
  2022-12-13 11:38 ` [PATCH 2/2] NUMA: replace phys_to_nid() Jan Beulich
  1 sibling, 2 replies; 15+ messages in thread
From: Jan Beulich @ 2022-12-13 11:36 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, George Dunlap, Wei Liu, Roger Pau Monné

With phys_to_nid() now actively checking that a valid node ID is on
record, the two uses in paging_init() can actually trigger at least the
2nd of the assertions there. They're used to calculate allocation flags,
but the calculated flags wouldn't be used when dealing with an invalid
(unpopulated) address range. Defer the calculations such that they can
be done with a validated MFN in hands. This also does away with the
artificial calculations of an address to pass to phys_to_nid().

Note that while the variable is provably written before use, at least
some compiler versions can't actually verify that. Hence the variable
also needs to gain a (dead) initializer.

Fixes: e9c72d524fbd ("xen/x86: Use ASSERT instead of VIRTUAL_BUG_ON for phys_to_nid")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
RFC: With small enough a NUMA hash shift it would still be possible to
     hit an SRAT hole, despite mfn_valid() passing. Hence, like was the
     original plan, it may still be necessary to relax the checking in
     phys_to_nid() (or its designated replacements). At which point the
     value of this change here would shrink to merely reducing the
     chance of unintentionally doing NUMA_NO_NODE allocations.

--- a/xen/arch/x86/x86_64/mm.c
+++ b/xen/arch/x86/x86_64/mm.c
@@ -498,7 +498,7 @@ error:
 void __init paging_init(void)
 {
     unsigned long i, mpt_size, va;
-    unsigned int n, memflags;
+    unsigned int n, memflags = 0;
     l3_pgentry_t *l3_ro_mpt;
     l2_pgentry_t *pl2e = NULL, *l2_ro_mpt = NULL;
     struct page_info *l1_pg;
@@ -547,8 +547,6 @@ void __init paging_init(void)
     {
         BUILD_BUG_ON(RO_MPT_VIRT_START & ((1UL << L3_PAGETABLE_SHIFT) - 1));
         va = RO_MPT_VIRT_START + (i << L2_PAGETABLE_SHIFT);
-        memflags = MEMF_node(phys_to_nid(i <<
-            (L2_PAGETABLE_SHIFT - 3 + PAGE_SHIFT)));
 
         if ( cpu_has_page1gb &&
              !((unsigned long)pl2e & ~PAGE_MASK) &&
@@ -559,10 +557,15 @@ void __init paging_init(void)
             for ( holes = k = 0; k < 1 << PAGETABLE_ORDER; ++k)
             {
                 for ( n = 0; n < CNT; ++n)
-                    if ( mfn_valid(_mfn(MFN(i + k) + n * PDX_GROUP_COUNT)) )
+                {
+                    mfn = _mfn(MFN(i + k) + n * PDX_GROUP_COUNT);
+                    if ( mfn_valid(mfn) )
                         break;
+                }
                 if ( n == CNT )
                     ++holes;
+                else if ( k == holes )
+                    memflags = MEMF_node(phys_to_nid(mfn_to_maddr(mfn)));
             }
             if ( k == holes )
             {
@@ -593,8 +596,14 @@ void __init paging_init(void)
         }
 
         for ( n = 0; n < CNT; ++n)
-            if ( mfn_valid(_mfn(MFN(i) + n * PDX_GROUP_COUNT)) )
+        {
+            mfn = _mfn(MFN(i) + n * PDX_GROUP_COUNT);
+            if ( mfn_valid(mfn) )
+            {
+                memflags = MEMF_node(phys_to_nid(mfn_to_maddr(mfn)));
                 break;
+            }
+        }
         if ( n == CNT )
             l1_pg = NULL;
         else if ( (l1_pg = alloc_domheap_pages(NULL, PAGETABLE_ORDER,
@@ -663,15 +672,19 @@ void __init paging_init(void)
                  sizeof(*compat_machine_to_phys_mapping));
     for ( i = 0; i < (mpt_size >> L2_PAGETABLE_SHIFT); i++, pl2e++ )
     {
-        memflags = MEMF_node(phys_to_nid(i <<
-            (L2_PAGETABLE_SHIFT - 2 + PAGE_SHIFT)));
         for ( n = 0; n < CNT; ++n)
-            if ( mfn_valid(_mfn(MFN(i) + n * PDX_GROUP_COUNT)) )
+        {
+            mfn = _mfn(MFN(i) + n * PDX_GROUP_COUNT);
+            if ( mfn_valid(mfn) )
+            {
+                memflags = MEMF_node(phys_to_nid(mfn_to_maddr(mfn)));
                 break;
+            }
+        }
         if ( n == CNT )
             continue;
         if ( (l1_pg = alloc_domheap_pages(NULL, PAGETABLE_ORDER,
-                                               memflags)) == NULL )
+                                          memflags)) == NULL )
             goto nomem;
         map_pages_to_xen(
             RDWR_COMPAT_MPT_VIRT_START + (i << L2_PAGETABLE_SHIFT),



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 2/2] NUMA: replace phys_to_nid()
  2022-12-13 11:35 [PATCH 0/2] NUMA: phys_to_nid() related adjustments Jan Beulich
  2022-12-13 11:36 ` [PATCH 1/2] x86/mm: avoid phys_to_nid() calls for invalid addresses Jan Beulich
@ 2022-12-13 11:38 ` Jan Beulich
  2022-12-13 12:06   ` Julien Grall
  2022-12-16 11:49   ` Andrew Cooper
  1 sibling, 2 replies; 15+ messages in thread
From: Jan Beulich @ 2022-12-13 11:38 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Wei Liu, Roger Pau Monné,
	Bertrand Marquis, Volodymyr Babchuk

All callers convert frame numbers (perhaps in turn derived from struct
page_info pointers) to an address, just for the function to convert it
back to a frame number (as the first step of paddr_to_pdx()). Replace
the function by mfn_to_nid() plus a page_to_nid() wrapper macro. Replace
call sites by the respectively most suitable one.

While there also introduce a !NUMA stub, eliminating the need for Arm
(and potentially other ports) to carry one individually.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
At the top of free_heap_pages() mfn_to_nid() could also be used, since
the MFN is calculated immediately ahead. The choice of using
page_to_nid() (for now at least) was with the earlier patch's RFC in
mind, addressing of which may require to make mfn_to_nid() do weaker
checking than page_to_nid().

--- a/xen/arch/arm/include/asm/numa.h
+++ b/xen/arch/arm/include/asm/numa.h
@@ -11,11 +11,6 @@ typedef u8 nodeid_t;
 #define cpu_to_node(cpu) 0
 #define node_to_cpumask(node)   (cpu_online_map)
 
-static inline __attribute__((pure)) nodeid_t phys_to_nid(paddr_t addr)
-{
-    return 0;
-}
-
 /*
  * TODO: make first_valid_mfn static when NUMA is supported on Arm, this
  * is required because the dummy helpers are using it.
--- a/xen/arch/x86/mm/p2m-pod.c
+++ b/xen/arch/x86/mm/p2m-pod.c
@@ -492,7 +492,7 @@ p2m_pod_offline_or_broken_replace(struct
 {
     struct domain *d;
     struct p2m_domain *p2m;
-    nodeid_t node = phys_to_nid(page_to_maddr(p));
+    nodeid_t node = page_to_nid(p);
 
     if ( !(d = page_get_owner(p)) || !(p2m = p2m_get_hostp2m(d)) )
         return;
--- a/xen/arch/x86/x86_64/mm.c
+++ b/xen/arch/x86/x86_64/mm.c
@@ -565,7 +565,7 @@ void __init paging_init(void)
                 if ( n == CNT )
                     ++holes;
                 else if ( k == holes )
-                    memflags = MEMF_node(phys_to_nid(mfn_to_maddr(mfn)));
+                    memflags = MEMF_node(mfn_to_nid(mfn));
             }
             if ( k == holes )
             {
@@ -600,7 +600,7 @@ void __init paging_init(void)
             mfn = _mfn(MFN(i) + n * PDX_GROUP_COUNT);
             if ( mfn_valid(mfn) )
             {
-                memflags = MEMF_node(phys_to_nid(mfn_to_maddr(mfn)));
+                memflags = MEMF_node(mfn_to_nid(mfn));
                 break;
             }
         }
@@ -677,7 +677,7 @@ void __init paging_init(void)
             mfn = _mfn(MFN(i) + n * PDX_GROUP_COUNT);
             if ( mfn_valid(mfn) )
             {
-                memflags = MEMF_node(phys_to_nid(mfn_to_maddr(mfn)));
+                memflags = MEMF_node(mfn_to_nid(mfn));
                 break;
             }
         }
--- a/xen/common/numa.c
+++ b/xen/common/numa.c
@@ -671,15 +671,15 @@ static void cf_check dump_numa(unsigned
 
     for_each_online_node ( i )
     {
-        paddr_t pa = pfn_to_paddr(node_start_pfn(i) + 1);
+        mfn_t mfn = _mfn(node_start_pfn(i) + 1);
 
         printk("NODE%u start->%lu size->%lu free->%lu\n",
                i, node_start_pfn(i), node_spanned_pages(i),
                avail_node_heap_pages(i));
-        /* Sanity check phys_to_nid() */
-        if ( phys_to_nid(pa) != i )
-            printk("phys_to_nid(%"PRIpaddr") -> %d should be %u\n",
-                   pa, phys_to_nid(pa), i);
+        /* Sanity check mfn_to_nid() */
+        if ( node_spanned_pages(i) && mfn_to_nid(mfn) != i )
+            printk("mfn_to_nid(%"PRI_mfn") -> %d should be %u\n",
+                   mfn_x(mfn), mfn_to_nid(mfn), i);
     }
 
     j = cpumask_first(&cpu_online_map);
@@ -721,7 +721,7 @@ static void cf_check dump_numa(unsigned
         spin_lock(&d->page_alloc_lock);
         page_list_for_each ( page, &d->page_list )
         {
-            i = phys_to_nid(page_to_maddr(page));
+            i = page_to_nid(page);
             page_num_node[i]++;
         }
         spin_unlock(&d->page_alloc_lock);
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -971,7 +971,7 @@ static struct page_info *alloc_heap_page
         return NULL;
     }
 
-    node = phys_to_nid(page_to_maddr(pg));
+    node = page_to_nid(pg);
     zone = page_to_zone(pg);
     buddy_order = PFN_ORDER(pg);
 
@@ -1078,7 +1078,7 @@ static struct page_info *alloc_heap_page
 /* Remove any offlined page in the buddy pointed to by head. */
 static int reserve_offlined_page(struct page_info *head)
 {
-    unsigned int node = phys_to_nid(page_to_maddr(head));
+    unsigned int node = page_to_nid(head);
     int zone = page_to_zone(head), i, head_order = PFN_ORDER(head), count = 0;
     struct page_info *cur_head;
     unsigned int cur_order, first_dirty;
@@ -1443,7 +1443,7 @@ static void free_heap_pages(
 {
     unsigned long mask;
     mfn_t mfn = page_to_mfn(pg);
-    unsigned int i, node = phys_to_nid(mfn_to_maddr(mfn));
+    unsigned int i, node = page_to_nid(pg);
     unsigned int zone = page_to_zone(pg);
     bool pg_offlined = false;
 
@@ -1487,7 +1487,7 @@ static void free_heap_pages(
                  !page_state_is(predecessor, free) ||
                  (predecessor->count_info & PGC_static) ||
                  (PFN_ORDER(predecessor) != order) ||
-                 (phys_to_nid(page_to_maddr(predecessor)) != node) )
+                 (page_to_nid(predecessor) != node) )
                 break;
 
             check_and_stop_scrub(predecessor);
@@ -1511,7 +1511,7 @@ static void free_heap_pages(
                  !page_state_is(successor, free) ||
                  (successor->count_info & PGC_static) ||
                  (PFN_ORDER(successor) != order) ||
-                 (phys_to_nid(page_to_maddr(successor)) != node) )
+                 (page_to_nid(successor) != node) )
                 break;
 
             check_and_stop_scrub(successor);
@@ -1574,7 +1574,7 @@ static unsigned long mark_page_offline(s
 static int reserve_heap_page(struct page_info *pg)
 {
     struct page_info *head = NULL;
-    unsigned int i, node = phys_to_nid(page_to_maddr(pg));
+    unsigned int i, node = page_to_nid(pg);
     unsigned int zone = page_to_zone(pg);
 
     for ( i = 0; i <= MAX_ORDER; i++ )
@@ -1794,7 +1794,7 @@ static void _init_heap_pages(const struc
                              bool need_scrub)
 {
     unsigned long s, e;
-    unsigned int nid = phys_to_nid(page_to_maddr(pg));
+    unsigned int nid = page_to_nid(pg);
 
     s = mfn_x(page_to_mfn(pg));
     e = mfn_x(mfn_add(page_to_mfn(pg + nr_pages - 1), 1));
@@ -1869,7 +1869,7 @@ static void init_heap_pages(
 #ifdef CONFIG_SEPARATE_XENHEAP
         unsigned int zone = page_to_zone(pg);
 #endif
-        unsigned int nid = phys_to_nid(page_to_maddr(pg));
+        unsigned int nid = page_to_nid(pg);
         unsigned long left = nr_pages - i;
         unsigned long contig_pages;
 
@@ -1893,7 +1893,7 @@ static void init_heap_pages(
                 break;
 #endif
 
-            if ( nid != (phys_to_nid(page_to_maddr(pg + contig_pages))) )
+            if ( nid != (page_to_nid(pg + contig_pages)) )
                 break;
         }
 
@@ -1934,7 +1934,7 @@ void __init end_boot_allocator(void)
     {
         struct bootmem_region *r = &bootmem_region_list[i];
         if ( (r->s < r->e) &&
-             (phys_to_nid(pfn_to_paddr(r->s)) == cpu_to_node(0)) )
+             (mfn_to_nid(_mfn(r->s)) == cpu_to_node(0)) )
         {
             init_heap_pages(mfn_to_page(_mfn(r->s)), r->e - r->s);
             r->e = r->s;
--- a/xen/include/xen/numa.h
+++ b/xen/include/xen/numa.h
@@ -1,6 +1,7 @@
 #ifndef _XEN_NUMA_H
 #define _XEN_NUMA_H
 
+#include <xen/mm-frame.h>
 #include <asm/numa.h>
 
 #define NUMA_NO_NODE     0xFF
@@ -68,12 +69,15 @@ struct node_data {
 
 extern struct node_data node_data[];
 
-static inline nodeid_t __attribute_pure__ phys_to_nid(paddr_t addr)
+static inline nodeid_t __attribute_pure__ mfn_to_nid(mfn_t mfn)
 {
     nodeid_t nid;
-    ASSERT((paddr_to_pdx(addr) >> memnode_shift) < memnodemapsize);
-    nid = memnodemap[paddr_to_pdx(addr) >> memnode_shift];
+    unsigned long pdx = mfn_to_pdx(mfn);
+
+    ASSERT((pdx >> memnode_shift) < memnodemapsize);
+    nid = memnodemap[pdx >> memnode_shift];
     ASSERT(nid < MAX_NUMNODES && node_data[nid].node_spanned_pages);
+
     return nid;
 }
 
@@ -102,6 +106,15 @@ extern bool numa_update_node_memblks(nod
                                      paddr_t start, paddr_t size, bool hotplug);
 extern void numa_set_processor_nodes_parsed(nodeid_t node);
 
+#else
+
+static inline nodeid_t __attribute_pure__ mfn_to_nid(mfn_t mfn)
+{
+    return 0;
+}
+
 #endif
 
+#define page_to_nid(pg) mfn_to_nid(page_to_mfn(pg))
+
 #endif /* _XEN_NUMA_H */



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] NUMA: replace phys_to_nid()
  2022-12-13 11:38 ` [PATCH 2/2] NUMA: replace phys_to_nid() Jan Beulich
@ 2022-12-13 12:06   ` Julien Grall
  2022-12-13 12:46     ` Jan Beulich
  2022-12-16 11:49   ` Andrew Cooper
  1 sibling, 1 reply; 15+ messages in thread
From: Julien Grall @ 2022-12-13 12:06 UTC (permalink / raw)
  To: Jan Beulich, xen-devel
  Cc: Andrew Cooper, George Dunlap, Stefano Stabellini, Wei Liu,
	Roger Pau Monné,
	Bertrand Marquis, Volodymyr Babchuk

Hi Jan,

On 13/12/2022 11:38, Jan Beulich wrote:
> All callers convert frame numbers (perhaps in turn derived from struct
> page_info pointers) to an address, just for the function to convert it
> back to a frame number (as the first step of paddr_to_pdx()). Replace
> the function by mfn_to_nid() plus a page_to_nid() wrapper macro. Replace
> call sites by the respectively most suitable one.
> 
> While there also introduce a !NUMA stub, eliminating the need for Arm
> (and potentially other ports) to carry one individually.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> At the top of free_heap_pages() mfn_to_nid() could also be used, since
> the MFN is calculated immediately ahead. The choice of using
> page_to_nid() (for now at least) was with the earlier patch's RFC in
> mind, addressing of which may require to make mfn_to_nid() do weaker
> checking than page_to_nid().

I haven't looked in details at the previous patch. However, I don't like 
the idea of making mfn_to_nid() do weaker checking because this could 
easily confuse the reader/developper.

If you want to use weaker check, then it would be better if a separate 
helper is provided with a name reflecting its purpose.

> --- a/xen/common/numa.c
> +++ b/xen/common/numa.c
> @@ -671,15 +671,15 @@ static void cf_check dump_numa(unsigned
>   
>       for_each_online_node ( i )
>       {
> -        paddr_t pa = pfn_to_paddr(node_start_pfn(i) + 1);
> +        mfn_t mfn = _mfn(node_start_pfn(i) + 1);
>   
>           printk("NODE%u start->%lu size->%lu free->%lu\n",
>                  i, node_start_pfn(i), node_spanned_pages(i),
>                  avail_node_heap_pages(i));
> -        /* Sanity check phys_to_nid() */
> -        if ( phys_to_nid(pa) != i )
> -            printk("phys_to_nid(%"PRIpaddr") -> %d should be %u\n",
> -                   pa, phys_to_nid(pa), i);
> +        /* Sanity check mfn_to_nid() */
> +        if ( node_spanned_pages(i) && mfn_to_nid(mfn) != i )


 From the commit message, I would have expected that we would only 
replace phys_to_nid() with either mfn_to_nid() or page_to_nid(). 
However, here you added node_spanned_pages(). Can you explain why?

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] NUMA: replace phys_to_nid()
  2022-12-13 12:06   ` Julien Grall
@ 2022-12-13 12:46     ` Jan Beulich
  2022-12-13 13:48       ` Julien Grall
  0 siblings, 1 reply; 15+ messages in thread
From: Jan Beulich @ 2022-12-13 12:46 UTC (permalink / raw)
  To: Julien Grall
  Cc: Andrew Cooper, George Dunlap, Stefano Stabellini, Wei Liu,
	xen-devel, Roger Pau Monné,
	Bertrand Marquis, Volodymyr Babchuk

On 13.12.2022 13:06, Julien Grall wrote:
> On 13/12/2022 11:38, Jan Beulich wrote:
>> All callers convert frame numbers (perhaps in turn derived from struct
>> page_info pointers) to an address, just for the function to convert it
>> back to a frame number (as the first step of paddr_to_pdx()). Replace
>> the function by mfn_to_nid() plus a page_to_nid() wrapper macro. Replace
>> call sites by the respectively most suitable one.
>>
>> While there also introduce a !NUMA stub, eliminating the need for Arm
>> (and potentially other ports) to carry one individually.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> At the top of free_heap_pages() mfn_to_nid() could also be used, since
>> the MFN is calculated immediately ahead. The choice of using
>> page_to_nid() (for now at least) was with the earlier patch's RFC in
>> mind, addressing of which may require to make mfn_to_nid() do weaker
>> checking than page_to_nid().
> 
> I haven't looked in details at the previous patch. However, I don't like 
> the idea of making mfn_to_nid() do weaker checking because this could 
> easily confuse the reader/developper.
> 
> If you want to use weaker check, then it would be better if a separate 
> helper is provided with a name reflecting its purpose.

Well, the purpose then still is the very same conversion, so the name
is quite appropriate. I don't view mfn_to_nid_bug_dont_look_very_closely()
(exaggerating) as very sensible a name.

>> --- a/xen/common/numa.c
>> +++ b/xen/common/numa.c
>> @@ -671,15 +671,15 @@ static void cf_check dump_numa(unsigned
>>   
>>       for_each_online_node ( i )
>>       {
>> -        paddr_t pa = pfn_to_paddr(node_start_pfn(i) + 1);
>> +        mfn_t mfn = _mfn(node_start_pfn(i) + 1);
>>   
>>           printk("NODE%u start->%lu size->%lu free->%lu\n",
>>                  i, node_start_pfn(i), node_spanned_pages(i),
>>                  avail_node_heap_pages(i));
>> -        /* Sanity check phys_to_nid() */
>> -        if ( phys_to_nid(pa) != i )
>> -            printk("phys_to_nid(%"PRIpaddr") -> %d should be %u\n",
>> -                   pa, phys_to_nid(pa), i);
>> +        /* Sanity check mfn_to_nid() */
>> +        if ( node_spanned_pages(i) && mfn_to_nid(mfn) != i )
> 
> 
>  From the commit message, I would have expected that we would only 
> replace phys_to_nid() with either mfn_to_nid() or page_to_nid(). 
> However, here you added node_spanned_pages(). Can you explain why?

Oh, indeed, I meant to say a word on this but then forgot. This
simply is because the adding of 1 to the start PFN (which by
itself is imo a little funny) makes it so that the printk()
inside the conditional would be certain to be called for an
empty (e.g. CPU-only) node.

Jan


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] NUMA: replace phys_to_nid()
  2022-12-13 12:46     ` Jan Beulich
@ 2022-12-13 13:48       ` Julien Grall
  2022-12-13 14:08         ` Jan Beulich
  0 siblings, 1 reply; 15+ messages in thread
From: Julien Grall @ 2022-12-13 13:48 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, George Dunlap, Stefano Stabellini, Wei Liu,
	xen-devel, Roger Pau Monné,
	Bertrand Marquis, Volodymyr Babchuk

Hi Jan,

On 13/12/2022 12:46, Jan Beulich wrote:
> On 13.12.2022 13:06, Julien Grall wrote:
>> On 13/12/2022 11:38, Jan Beulich wrote:
>>> All callers convert frame numbers (perhaps in turn derived from struct
>>> page_info pointers) to an address, just for the function to convert it
>>> back to a frame number (as the first step of paddr_to_pdx()). Replace
>>> the function by mfn_to_nid() plus a page_to_nid() wrapper macro. Replace
>>> call sites by the respectively most suitable one.
>>>
>>> While there also introduce a !NUMA stub, eliminating the need for Arm
>>> (and potentially other ports) to carry one individually.
>>>
>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>> ---
>>> At the top of free_heap_pages() mfn_to_nid() could also be used, since
>>> the MFN is calculated immediately ahead. The choice of using
>>> page_to_nid() (for now at least) was with the earlier patch's RFC in
>>> mind, addressing of which may require to make mfn_to_nid() do weaker
>>> checking than page_to_nid().
>>
>> I haven't looked in details at the previous patch. However, I don't like
>> the idea of making mfn_to_nid() do weaker checking because this could
>> easily confuse the reader/developper.
>>
>> If you want to use weaker check, then it would be better if a separate
>> helper is provided with a name reflecting its purpose.
> 
> Well, the purpose then still is the very same conversion, so the name
> is quite appropriate. I don't view mfn_to_nid_bug_dont_look_very_closely()
> (exaggerating) as very sensible a name.

I understand they are both doing the same conversion. But the checks 
will be different. With your proposal, we are now going to say if the 
caller is "buggy" then use mfn_to_nid() if not then you can use any.

I think this is wrong to hide the "bug" just because the name is longer. 
In fact, it means that any non-buggy caller will still have relaxed 
check. The risk if we are going to introduce more "buggy" caller in the 
future.

So from my perspective there are only two acceptable solutions:
   1. Provide a different helper that will be used for just "buggy" 
caller. This will make super clear that the helper should only be used 
in very limited circumstances.
   2. Fix the "buggy" callers.

 From your previous e-mails, it wasn't clear whether 2) is possible. So 
that's leave us only with 1).

>>> --- a/xen/common/numa.c
>>> +++ b/xen/common/numa.c
>>> @@ -671,15 +671,15 @@ static void cf_check dump_numa(unsigned
>>>    
>>>        for_each_online_node ( i )
>>>        {
>>> -        paddr_t pa = pfn_to_paddr(node_start_pfn(i) + 1);
>>> +        mfn_t mfn = _mfn(node_start_pfn(i) + 1);
>>>    
>>>            printk("NODE%u start->%lu size->%lu free->%lu\n",
>>>                   i, node_start_pfn(i), node_spanned_pages(i),
>>>                   avail_node_heap_pages(i));
>>> -        /* Sanity check phys_to_nid() */
>>> -        if ( phys_to_nid(pa) != i )
>>> -            printk("phys_to_nid(%"PRIpaddr") -> %d should be %u\n",
>>> -                   pa, phys_to_nid(pa), i);
>>> +        /* Sanity check mfn_to_nid() */
>>> +        if ( node_spanned_pages(i) && mfn_to_nid(mfn) != i )
>>
>>
>>   From the commit message, I would have expected that we would only
>> replace phys_to_nid() with either mfn_to_nid() or page_to_nid().
>> However, here you added node_spanned_pages(). Can you explain why?
> 
> Oh, indeed, I meant to say a word on this but then forgot. This
> simply is because the adding of 1 to the start PFN (which by
> itself is imo a little funny) makes it so that the printk()
> inside the conditional would be certain to be called for an
> empty (e.g. CPU-only) node.

Ok. I think this wants to be a separate patch as this sounds like bug 
and we should avoid mixing code conversion with bug fix.

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] NUMA: replace phys_to_nid()
  2022-12-13 13:48       ` Julien Grall
@ 2022-12-13 14:08         ` Jan Beulich
  2022-12-13 21:33           ` Julien Grall
  0 siblings, 1 reply; 15+ messages in thread
From: Jan Beulich @ 2022-12-13 14:08 UTC (permalink / raw)
  To: Julien Grall
  Cc: Andrew Cooper, George Dunlap, Stefano Stabellini, Wei Liu,
	xen-devel, Roger Pau Monné,
	Bertrand Marquis, Volodymyr Babchuk

On 13.12.2022 14:48, Julien Grall wrote:
> On 13/12/2022 12:46, Jan Beulich wrote:
>> On 13.12.2022 13:06, Julien Grall wrote:
>>> On 13/12/2022 11:38, Jan Beulich wrote:
>>>> All callers convert frame numbers (perhaps in turn derived from struct
>>>> page_info pointers) to an address, just for the function to convert it
>>>> back to a frame number (as the first step of paddr_to_pdx()). Replace
>>>> the function by mfn_to_nid() plus a page_to_nid() wrapper macro. Replace
>>>> call sites by the respectively most suitable one.
>>>>
>>>> While there also introduce a !NUMA stub, eliminating the need for Arm
>>>> (and potentially other ports) to carry one individually.
>>>>
>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>>> ---
>>>> At the top of free_heap_pages() mfn_to_nid() could also be used, since
>>>> the MFN is calculated immediately ahead. The choice of using
>>>> page_to_nid() (for now at least) was with the earlier patch's RFC in
>>>> mind, addressing of which may require to make mfn_to_nid() do weaker
>>>> checking than page_to_nid().
>>>
>>> I haven't looked in details at the previous patch. However, I don't like
>>> the idea of making mfn_to_nid() do weaker checking because this could
>>> easily confuse the reader/developper.
>>>
>>> If you want to use weaker check, then it would be better if a separate
>>> helper is provided with a name reflecting its purpose.
>>
>> Well, the purpose then still is the very same conversion, so the name
>> is quite appropriate. I don't view mfn_to_nid_bug_dont_look_very_closely()
>> (exaggerating) as very sensible a name.
> 
> I understand they are both doing the same conversion. But the checks 
> will be different. With your proposal, we are now going to say if the 
> caller is "buggy" then use mfn_to_nid() if not then you can use any.
> 
> I think this is wrong to hide the "bug" just because the name is longer. 
> In fact, it means that any non-buggy caller will still have relaxed 
> check. The risk if we are going to introduce more "buggy" caller in the 
> future.

While I, too, have taken your perspective as one possible one, I've
also been considering a slightly different perspective: page_to_nid()
implies the caller to have a struct page_info *, which in turn implies
you pass in something identifying valid memory (which hence should have
a valid node ID associated with it). mfn_to_nid(), otoh, has nothing
to pre-qualify (see patch 1's RFC remark as to mfn_valid() not being
sufficient). Hence less rigid checking there can make sense (and you'll
notice that mfn_to_nid() was also used quite sparingly in the course of
the conversion.)

> So from my perspective there are only two acceptable solutions:
>    1. Provide a different helper that will be used for just "buggy" 
> caller. This will make super clear that the helper should only be used 
> in very limited circumstances.
>    2. Fix the "buggy" callers.
> 
>  From your previous e-mails, it wasn't clear whether 2) is possible. So 
> that's leave us only with 1).

The buggy callers are the ones touched by patch 1; see (again) the RFC
remark there for limitations of that approach.

>>>> --- a/xen/common/numa.c
>>>> +++ b/xen/common/numa.c
>>>> @@ -671,15 +671,15 @@ static void cf_check dump_numa(unsigned
>>>>    
>>>>        for_each_online_node ( i )
>>>>        {
>>>> -        paddr_t pa = pfn_to_paddr(node_start_pfn(i) + 1);
>>>> +        mfn_t mfn = _mfn(node_start_pfn(i) + 1);
>>>>    
>>>>            printk("NODE%u start->%lu size->%lu free->%lu\n",
>>>>                   i, node_start_pfn(i), node_spanned_pages(i),
>>>>                   avail_node_heap_pages(i));
>>>> -        /* Sanity check phys_to_nid() */
>>>> -        if ( phys_to_nid(pa) != i )
>>>> -            printk("phys_to_nid(%"PRIpaddr") -> %d should be %u\n",
>>>> -                   pa, phys_to_nid(pa), i);
>>>> +        /* Sanity check mfn_to_nid() */
>>>> +        if ( node_spanned_pages(i) && mfn_to_nid(mfn) != i )
>>>
>>>
>>>   From the commit message, I would have expected that we would only
>>> replace phys_to_nid() with either mfn_to_nid() or page_to_nid().
>>> However, here you added node_spanned_pages(). Can you explain why?
>>
>> Oh, indeed, I meant to say a word on this but then forgot. This
>> simply is because the adding of 1 to the start PFN (which by
>> itself is imo a little funny) makes it so that the printk()
>> inside the conditional would be certain to be called for an
>> empty (e.g. CPU-only) node.
> 
> Ok. I think this wants to be a separate patch as this sounds like bug 
> and we should avoid mixing code conversion with bug fix.

Yet then this is only in a debug key handler. (Else I would have made
it a separate patch, yes.)

Jan


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] NUMA: replace phys_to_nid()
  2022-12-13 14:08         ` Jan Beulich
@ 2022-12-13 21:33           ` Julien Grall
  0 siblings, 0 replies; 15+ messages in thread
From: Julien Grall @ 2022-12-13 21:33 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, George Dunlap, Stefano Stabellini, Wei Liu,
	xen-devel, Roger Pau Monné,
	Bertrand Marquis, Volodymyr Babchuk

Hi Jan,

On 13/12/2022 14:08, Jan Beulich wrote:
> On 13.12.2022 14:48, Julien Grall wrote:
>> On 13/12/2022 12:46, Jan Beulich wrote:
>>> On 13.12.2022 13:06, Julien Grall wrote:
>>>> On 13/12/2022 11:38, Jan Beulich wrote:
>>>>> All callers convert frame numbers (perhaps in turn derived from struct
>>>>> page_info pointers) to an address, just for the function to convert it
>>>>> back to a frame number (as the first step of paddr_to_pdx()). Replace
>>>>> the function by mfn_to_nid() plus a page_to_nid() wrapper macro. Replace
>>>>> call sites by the respectively most suitable one.
>>>>>
>>>>> While there also introduce a !NUMA stub, eliminating the need for Arm
>>>>> (and potentially other ports) to carry one individually.
>>>>>
>>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>>>> ---
>>>>> At the top of free_heap_pages() mfn_to_nid() could also be used, since
>>>>> the MFN is calculated immediately ahead. The choice of using
>>>>> page_to_nid() (for now at least) was with the earlier patch's RFC in
>>>>> mind, addressing of which may require to make mfn_to_nid() do weaker
>>>>> checking than page_to_nid().
>>>>
>>>> I haven't looked in details at the previous patch. However, I don't like
>>>> the idea of making mfn_to_nid() do weaker checking because this could
>>>> easily confuse the reader/developper.
>>>>
>>>> If you want to use weaker check, then it would be better if a separate
>>>> helper is provided with a name reflecting its purpose.
>>>
>>> Well, the purpose then still is the very same conversion, so the name
>>> is quite appropriate. I don't view mfn_to_nid_bug_dont_look_very_closely()
>>> (exaggerating) as very sensible a name.
>>
>> I understand they are both doing the same conversion. But the checks
>> will be different. With your proposal, we are now going to say if the
>> caller is "buggy" then use mfn_to_nid() if not then you can use any.
>>
>> I think this is wrong to hide the "bug" just because the name is longer.
>> In fact, it means that any non-buggy caller will still have relaxed
>> check. The risk if we are going to introduce more "buggy" caller in the
>> future.
> 
> While I, too, have taken your perspective as one possible one, I've
> also been considering a slightly different perspective: page_to_nid()
> implies the caller to have a struct page_info *, which in turn implies
> you pass in something identifying valid memory (which hence should have
> a valid node ID associated with it). mfn_to_nid(), otoh, has nothing
> to pre-qualify (see patch 1's RFC remark as to mfn_valid() not being
> sufficient). Hence less rigid checking there can make sense (and you'll
> notice that mfn_to_nid() was also used quite sparingly in the course of
> the conversion.)
> 
>> So from my perspective there are only two acceptable solutions:
>>     1. Provide a different helper that will be used for just "buggy"
>> caller. This will make super clear that the helper should only be used
>> in very limited circumstances.
>>     2. Fix the "buggy" callers.
>>
>>   From your previous e-mails, it wasn't clear whether 2) is possible. So
>> that's leave us only with 1).
> 
> The buggy callers are the ones touched by patch 1; see (again) the RFC
> remark there for limitations of that approach.

Even with what you wrote above, I still think that relaxing the check 
for everyone is wrong. Anyway, this patch is not changing the helper. So 
I will wait and see a formal proposal.

> 
>>>>> --- a/xen/common/numa.c
>>>>> +++ b/xen/common/numa.c
>>>>> @@ -671,15 +671,15 @@ static void cf_check dump_numa(unsigned
>>>>>     
>>>>>         for_each_online_node ( i )
>>>>>         {
>>>>> -        paddr_t pa = pfn_to_paddr(node_start_pfn(i) + 1);
>>>>> +        mfn_t mfn = _mfn(node_start_pfn(i) + 1);
>>>>>     
>>>>>             printk("NODE%u start->%lu size->%lu free->%lu\n",
>>>>>                    i, node_start_pfn(i), node_spanned_pages(i),
>>>>>                    avail_node_heap_pages(i));
>>>>> -        /* Sanity check phys_to_nid() */
>>>>> -        if ( phys_to_nid(pa) != i )
>>>>> -            printk("phys_to_nid(%"PRIpaddr") -> %d should be %u\n",
>>>>> -                   pa, phys_to_nid(pa), i);
>>>>> +        /* Sanity check mfn_to_nid() */
>>>>> +        if ( node_spanned_pages(i) && mfn_to_nid(mfn) != i )
>>>>
>>>>
>>>>    From the commit message, I would have expected that we would only
>>>> replace phys_to_nid() with either mfn_to_nid() or page_to_nid().
>>>> However, here you added node_spanned_pages(). Can you explain why?
>>>
>>> Oh, indeed, I meant to say a word on this but then forgot. This
>>> simply is because the adding of 1 to the start PFN (which by
>>> itself is imo a little funny) makes it so that the printk()
>>> inside the conditional would be certain to be called for an
>>> empty (e.g. CPU-only) node.
>>
>> Ok. I think this wants to be a separate patch as this sounds like bug
>> and we should avoid mixing code conversion with bug fix.
> 
> Yet then this is only in a debug key handler. (Else I would have made
> it a separate patch, yes.)

IMO, the fact it is a debug key handler doesn't matter. While I am 
generally OK if we do minor swapin patch modifying the behavior. I think 
the other way around is quite confusing. And therefore, I would rather 
prefer the split unless another maintainer thinks otherwise.

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] x86/mm: avoid phys_to_nid() calls for invalid addresses
  2022-12-13 11:36 ` [PATCH 1/2] x86/mm: avoid phys_to_nid() calls for invalid addresses Jan Beulich
@ 2022-12-14  3:28   ` Wei Chen
  2022-12-14  7:44     ` Jan Beulich
  2022-12-16 19:24   ` Andrew Cooper
  1 sibling, 1 reply; 15+ messages in thread
From: Wei Chen @ 2022-12-14  3:28 UTC (permalink / raw)
  To: Jan Beulich, xen-devel
  Cc: Andrew Cooper, George Dunlap, Wei Liu, Roger Pau Monné

Hi Jan,

On 2022/12/13 19:36, Jan Beulich wrote:
> With phys_to_nid() now actively checking that a valid node ID is on
> record, the two uses in paging_init() can actually trigger at least the
> 2nd of the assertions there. They're used to calculate allocation flags,
> but the calculated flags wouldn't be used when dealing with an invalid
> (unpopulated) address range. Defer the calculations such that they can
> be done with a validated MFN in hands. This also does away with the
> artificial calculations of an address to pass to phys_to_nid().
> 
> Note that while the variable is provably written before use, at least
> some compiler versions can't actually verify that. Hence the variable
> also needs to gain a (dead) initializer.
> 
> Fixes: e9c72d524fbd ("xen/x86: Use ASSERT instead of VIRTUAL_BUG_ON for phys_to_nid")
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> RFC: With small enough a NUMA hash shift it would still be possible to
>       hit an SRAT hole, despite mfn_valid() passing. Hence, like was the
>       original plan, it may still be necessary to relax the checking in
>       phys_to_nid() (or its designated replacements). At which point the
>       value of this change here would shrink to merely reducing the
>       chance of unintentionally doing NUMA_NO_NODE allocations.
> 

I think it's better to place the last sentence or the whole RFC to the
commit log. Without the RFC content, after a while, when I check this 
commit again, I will be confused about what problem this commit solved. 
Because just looking at the changes, as your said in RFC, it doesn't 
completely solve the problem.

Cheers,
Wei Chen

> --- a/xen/arch/x86/x86_64/mm.c
> +++ b/xen/arch/x86/x86_64/mm.c
> @@ -498,7 +498,7 @@ error:
>   void __init paging_init(void)
>   {
>       unsigned long i, mpt_size, va;
> -    unsigned int n, memflags;
> +    unsigned int n, memflags = 0;
>       l3_pgentry_t *l3_ro_mpt;
>       l2_pgentry_t *pl2e = NULL, *l2_ro_mpt = NULL;
>       struct page_info *l1_pg;
> @@ -547,8 +547,6 @@ void __init paging_init(void)
>       {
>           BUILD_BUG_ON(RO_MPT_VIRT_START & ((1UL << L3_PAGETABLE_SHIFT) - 1));
>           va = RO_MPT_VIRT_START + (i << L2_PAGETABLE_SHIFT);
> -        memflags = MEMF_node(phys_to_nid(i <<
> -            (L2_PAGETABLE_SHIFT - 3 + PAGE_SHIFT)));
>   
>           if ( cpu_has_page1gb &&
>                !((unsigned long)pl2e & ~PAGE_MASK) &&
> @@ -559,10 +557,15 @@ void __init paging_init(void)
>               for ( holes = k = 0; k < 1 << PAGETABLE_ORDER; ++k)
>               {
>                   for ( n = 0; n < CNT; ++n)
> -                    if ( mfn_valid(_mfn(MFN(i + k) + n * PDX_GROUP_COUNT)) )
> +                {
> +                    mfn = _mfn(MFN(i + k) + n * PDX_GROUP_COUNT);
> +                    if ( mfn_valid(mfn) )
>                           break;
> +                }
>                   if ( n == CNT )
>                       ++holes;
> +                else if ( k == holes )
> +                    memflags = MEMF_node(phys_to_nid(mfn_to_maddr(mfn)));
>               }
>               if ( k == holes )
>               {
> @@ -593,8 +596,14 @@ void __init paging_init(void)
>           }
>   
>           for ( n = 0; n < CNT; ++n)
> -            if ( mfn_valid(_mfn(MFN(i) + n * PDX_GROUP_COUNT)) )
> +        {
> +            mfn = _mfn(MFN(i) + n * PDX_GROUP_COUNT);
> +            if ( mfn_valid(mfn) )
> +            {
> +                memflags = MEMF_node(phys_to_nid(mfn_to_maddr(mfn)));
>                   break;
> +            }
> +        }
>           if ( n == CNT )
>               l1_pg = NULL;
>           else if ( (l1_pg = alloc_domheap_pages(NULL, PAGETABLE_ORDER,
> @@ -663,15 +672,19 @@ void __init paging_init(void)
>                    sizeof(*compat_machine_to_phys_mapping));
>       for ( i = 0; i < (mpt_size >> L2_PAGETABLE_SHIFT); i++, pl2e++ )
>       {
> -        memflags = MEMF_node(phys_to_nid(i <<
> -            (L2_PAGETABLE_SHIFT - 2 + PAGE_SHIFT)));
>           for ( n = 0; n < CNT; ++n)
> -            if ( mfn_valid(_mfn(MFN(i) + n * PDX_GROUP_COUNT)) )
> +        {
> +            mfn = _mfn(MFN(i) + n * PDX_GROUP_COUNT);
> +            if ( mfn_valid(mfn) )
> +            {
> +                memflags = MEMF_node(phys_to_nid(mfn_to_maddr(mfn)));
>                   break;
> +            }
> +        }
>           if ( n == CNT )
>               continue;
>           if ( (l1_pg = alloc_domheap_pages(NULL, PAGETABLE_ORDER,
> -                                               memflags)) == NULL )
> +                                          memflags)) == NULL )
>               goto nomem;
>           map_pages_to_xen(
>               RDWR_COMPAT_MPT_VIRT_START + (i << L2_PAGETABLE_SHIFT),
> 
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] x86/mm: avoid phys_to_nid() calls for invalid addresses
  2022-12-14  3:28   ` Wei Chen
@ 2022-12-14  7:44     ` Jan Beulich
  0 siblings, 0 replies; 15+ messages in thread
From: Jan Beulich @ 2022-12-14  7:44 UTC (permalink / raw)
  To: Wei Chen
  Cc: Andrew Cooper, George Dunlap, Wei Liu, Roger Pau Monné, xen-devel

On 14.12.2022 04:28, Wei Chen wrote:
> Hi Jan,
> 
> On 2022/12/13 19:36, Jan Beulich wrote:
>> With phys_to_nid() now actively checking that a valid node ID is on
>> record, the two uses in paging_init() can actually trigger at least the
>> 2nd of the assertions there. They're used to calculate allocation flags,
>> but the calculated flags wouldn't be used when dealing with an invalid
>> (unpopulated) address range. Defer the calculations such that they can
>> be done with a validated MFN in hands. This also does away with the
>> artificial calculations of an address to pass to phys_to_nid().
>>
>> Note that while the variable is provably written before use, at least
>> some compiler versions can't actually verify that. Hence the variable
>> also needs to gain a (dead) initializer.
>>
>> Fixes: e9c72d524fbd ("xen/x86: Use ASSERT instead of VIRTUAL_BUG_ON for phys_to_nid")
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> RFC: With small enough a NUMA hash shift it would still be possible to
>>       hit an SRAT hole, despite mfn_valid() passing. Hence, like was the
>>       original plan, it may still be necessary to relax the checking in
>>       phys_to_nid() (or its designated replacements). At which point the
>>       value of this change here would shrink to merely reducing the
>>       chance of unintentionally doing NUMA_NO_NODE allocations.
>>
> 
> I think it's better to place the last sentence or the whole RFC to the
> commit log. Without the RFC content, after a while, when I check this 
> commit again, I will be confused about what problem this commit solved. 
> Because just looking at the changes, as your said in RFC, it doesn't 
> completely solve the problem.

Moving some/all of this to the commit message is one of the ways to
resolve this RFC, yes. But the other one, flipping the order of the
two patches and making mfn_to_nid() check less than page_to_nid() is
one where the commit message here would need re-writing anyway. IOW
the primary question here is what route to go.

Jan


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] NUMA: replace phys_to_nid()
  2022-12-13 11:38 ` [PATCH 2/2] NUMA: replace phys_to_nid() Jan Beulich
  2022-12-13 12:06   ` Julien Grall
@ 2022-12-16 11:49   ` Andrew Cooper
  2022-12-16 11:59     ` Jan Beulich
  1 sibling, 1 reply; 15+ messages in thread
From: Andrew Cooper @ 2022-12-16 11:49 UTC (permalink / raw)
  To: Jan Beulich, xen-devel
  Cc: George Dunlap, Julien Grall, Stefano Stabellini, Wei Liu,
	Roger Pau Monne, Bertrand Marquis, Volodymyr Babchuk

On 13/12/2022 11:38 am, Jan Beulich wrote:
> All callers convert frame numbers (perhaps in turn derived from struct
> page_info pointers) to an address, just for the function to convert it
> back to a frame number (as the first step of paddr_to_pdx()). Replace
> the function by mfn_to_nid() plus a page_to_nid() wrapper macro. Replace
> call sites by the respectively most suitable one.
>
> While there also introduce a !NUMA stub, eliminating the need for Arm
> (and potentially other ports) to carry one individually.

Thanks.  This will help RISC-V too.

> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>, albeit with one
deletion.

> --- a/xen/include/xen/numa.h
> +++ b/xen/include/xen/numa.h
> @@ -1,6 +1,7 @@
>  #ifndef _XEN_NUMA_H
>  #define _XEN_NUMA_H
>  
> +#include <xen/mm-frame.h>
>  #include <asm/numa.h>
>  
>  #define NUMA_NO_NODE     0xFF
> @@ -68,12 +69,15 @@ struct node_data {
>  
>  extern struct node_data node_data[];
>  
> -static inline nodeid_t __attribute_pure__ phys_to_nid(paddr_t addr)
> +static inline nodeid_t __attribute_pure__ mfn_to_nid(mfn_t mfn)
>  {
>      nodeid_t nid;
> -    ASSERT((paddr_to_pdx(addr) >> memnode_shift) < memnodemapsize);
> -    nid = memnodemap[paddr_to_pdx(addr) >> memnode_shift];
> +    unsigned long pdx = mfn_to_pdx(mfn);
> +
> +    ASSERT((pdx >> memnode_shift) < memnodemapsize);
> +    nid = memnodemap[pdx >> memnode_shift];
>      ASSERT(nid < MAX_NUMNODES && node_data[nid].node_spanned_pages);
> +
>      return nid;
>  }
>  
> @@ -102,6 +106,15 @@ extern bool numa_update_node_memblks(nod
>                                       paddr_t start, paddr_t size, bool hotplug);
>  extern void numa_set_processor_nodes_parsed(nodeid_t node);
>  
> +#else
> +
> +static inline nodeid_t __attribute_pure__ mfn_to_nid(mfn_t mfn)
> +{
> +    return 0;
> +}

pure is useless on a stub like this, whereas its false on the non-stub
form (uses several non-const variables) in a way that the compiler can
prove (because it's static inline), and will discard.

As you're modifying both lines anyway, just drop the attribute.

~Andrew

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] NUMA: replace phys_to_nid()
  2022-12-16 11:49   ` Andrew Cooper
@ 2022-12-16 11:59     ` Jan Beulich
  2022-12-16 14:27       ` Andrew Cooper
  0 siblings, 1 reply; 15+ messages in thread
From: Jan Beulich @ 2022-12-16 11:59 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: George Dunlap, Julien Grall, Stefano Stabellini, Wei Liu,
	Roger Pau Monne, Bertrand Marquis, Volodymyr Babchuk, xen-devel

On 16.12.2022 12:49, Andrew Cooper wrote:
> On 13/12/2022 11:38 am, Jan Beulich wrote:
>> All callers convert frame numbers (perhaps in turn derived from struct
>> page_info pointers) to an address, just for the function to convert it
>> back to a frame number (as the first step of paddr_to_pdx()). Replace
>> the function by mfn_to_nid() plus a page_to_nid() wrapper macro. Replace
>> call sites by the respectively most suitable one.
>>
>> While there also introduce a !NUMA stub, eliminating the need for Arm
>> (and potentially other ports) to carry one individually.
> 
> Thanks.  This will help RISC-V too.
> 
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>,

Thanks. You realize though that the patch may change depending on the
verdict on patch 1 (and, if that one's to change, the two likely
flipped with the actual fix moving here in the form of more relaxed
assertions, one way or another)?

> albeit with one deletion.
> 
>> --- a/xen/include/xen/numa.h
>> +++ b/xen/include/xen/numa.h
>> @@ -1,6 +1,7 @@
>>  #ifndef _XEN_NUMA_H
>>  #define _XEN_NUMA_H
>>  
>> +#include <xen/mm-frame.h>
>>  #include <asm/numa.h>
>>  
>>  #define NUMA_NO_NODE     0xFF
>> @@ -68,12 +69,15 @@ struct node_data {
>>  
>>  extern struct node_data node_data[];
>>  
>> -static inline nodeid_t __attribute_pure__ phys_to_nid(paddr_t addr)
>> +static inline nodeid_t __attribute_pure__ mfn_to_nid(mfn_t mfn)
>>  {
>>      nodeid_t nid;
>> -    ASSERT((paddr_to_pdx(addr) >> memnode_shift) < memnodemapsize);
>> -    nid = memnodemap[paddr_to_pdx(addr) >> memnode_shift];
>> +    unsigned long pdx = mfn_to_pdx(mfn);
>> +
>> +    ASSERT((pdx >> memnode_shift) < memnodemapsize);
>> +    nid = memnodemap[pdx >> memnode_shift];
>>      ASSERT(nid < MAX_NUMNODES && node_data[nid].node_spanned_pages);
>> +
>>      return nid;
>>  }
>>  
>> @@ -102,6 +106,15 @@ extern bool numa_update_node_memblks(nod
>>                                       paddr_t start, paddr_t size, bool hotplug);
>>  extern void numa_set_processor_nodes_parsed(nodeid_t node);
>>  
>> +#else
>> +
>> +static inline nodeid_t __attribute_pure__ mfn_to_nid(mfn_t mfn)
>> +{
>> +    return 0;
>> +}
> 
> pure is useless on a stub like this, whereas its false on the non-stub
> form (uses several non-const variables) in a way that the compiler can
> prove (because it's static inline), and will discard.
> 
> As you're modifying both lines anyway, just drop the attribute.

Hmm, yes, I agree for the stub, so I've dropped it there. "Several non-
const variables", however, is only partly true. These are __ro_after_init
and not written anymore once set. Are you sure the compiler will ignore
a "pure" attribute if it finds it (formally) violated? That would be
somewhat odd, as it means differing behavior depending on whether the
same piece of code is in an inline or out-of-line function.

Jan


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] NUMA: replace phys_to_nid()
  2022-12-16 11:59     ` Jan Beulich
@ 2022-12-16 14:27       ` Andrew Cooper
  0 siblings, 0 replies; 15+ messages in thread
From: Andrew Cooper @ 2022-12-16 14:27 UTC (permalink / raw)
  To: Jan Beulich
  Cc: George Dunlap, Julien Grall, Stefano Stabellini, Wei Liu,
	Roger Pau Monne, Bertrand Marquis, Volodymyr Babchuk, xen-devel

On 16/12/2022 11:59 am, Jan Beulich wrote:
> On 16.12.2022 12:49, Andrew Cooper wrote:
>> On 13/12/2022 11:38 am, Jan Beulich wrote:
>>> All callers convert frame numbers (perhaps in turn derived from struct
>>> page_info pointers) to an address, just for the function to convert it
>>> back to a frame number (as the first step of paddr_to_pdx()). Replace
>>> the function by mfn_to_nid() plus a page_to_nid() wrapper macro. Replace
>>> call sites by the respectively most suitable one.
>>>
>>> While there also introduce a !NUMA stub, eliminating the need for Arm
>>> (and potentially other ports) to carry one individually.
>> Thanks.  This will help RISC-V too.
>>
>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>,
> Thanks. You realize though that the patch may change depending on the
> verdict on patch 1 (and, if that one's to change, the two likely
> flipped with the actual fix moving here in the form of more relaxed
> assertions, one way or another)?

Yeah, the tweak sounded entirely reasonable.

>
>> albeit with one deletion.
>>
>>> --- a/xen/include/xen/numa.h
>>> +++ b/xen/include/xen/numa.h
>>> @@ -1,6 +1,7 @@
>>>  #ifndef _XEN_NUMA_H
>>>  #define _XEN_NUMA_H
>>>  
>>> +#include <xen/mm-frame.h>
>>>  #include <asm/numa.h>
>>>  
>>>  #define NUMA_NO_NODE     0xFF
>>> @@ -68,12 +69,15 @@ struct node_data {
>>>  
>>>  extern struct node_data node_data[];
>>>  
>>> -static inline nodeid_t __attribute_pure__ phys_to_nid(paddr_t addr)
>>> +static inline nodeid_t __attribute_pure__ mfn_to_nid(mfn_t mfn)
>>>  {
>>>      nodeid_t nid;
>>> -    ASSERT((paddr_to_pdx(addr) >> memnode_shift) < memnodemapsize);
>>> -    nid = memnodemap[paddr_to_pdx(addr) >> memnode_shift];
>>> +    unsigned long pdx = mfn_to_pdx(mfn);
>>> +
>>> +    ASSERT((pdx >> memnode_shift) < memnodemapsize);
>>> +    nid = memnodemap[pdx >> memnode_shift];
>>>      ASSERT(nid < MAX_NUMNODES && node_data[nid].node_spanned_pages);
>>> +
>>>      return nid;
>>>  }
>>>  
>>> @@ -102,6 +106,15 @@ extern bool numa_update_node_memblks(nod
>>>                                       paddr_t start, paddr_t size, bool hotplug);
>>>  extern void numa_set_processor_nodes_parsed(nodeid_t node);
>>>  
>>> +#else
>>> +
>>> +static inline nodeid_t __attribute_pure__ mfn_to_nid(mfn_t mfn)
>>> +{
>>> +    return 0;
>>> +}
>> pure is useless on a stub like this, whereas its false on the non-stub
>> form (uses several non-const variables) in a way that the compiler can
>> prove (because it's static inline), and will discard.
>>
>> As you're modifying both lines anyway, just drop the attribute.
> Hmm, yes, I agree for the stub, so I've dropped it there. "Several non-
> const variables", however, is only partly true. These are __ro_after_init
> and not written anymore once set.

They're still read-write as far as C is concerned, and some of these
uses are before modifications finish.

>  Are you sure the compiler will ignore
> a "pure" attribute if it finds it (formally) violated?

Yes, very sure.  It got discussed at length on one of the speculation lists.

When the compiler can prove that the programmer doesn't know the rules
concerning pure/const, the attributes will be discarded.

To abuse the rules, you really do need the operation hidden in a place
that GCC can't see, so either a separate translation unit, or in inline
assembly.

~Andrew

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] x86/mm: avoid phys_to_nid() calls for invalid addresses
  2022-12-13 11:36 ` [PATCH 1/2] x86/mm: avoid phys_to_nid() calls for invalid addresses Jan Beulich
  2022-12-14  3:28   ` Wei Chen
@ 2022-12-16 19:24   ` Andrew Cooper
  2022-12-19  7:14     ` Jan Beulich
  1 sibling, 1 reply; 15+ messages in thread
From: Andrew Cooper @ 2022-12-16 19:24 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monne

On 13/12/2022 11:36 am, Jan Beulich wrote:
> With phys_to_nid() now actively checking that a valid node ID is on
> record, the two uses in paging_init() can actually trigger at least the
> 2nd of the assertions there. They're used to calculate allocation flags,
> but the calculated flags wouldn't be used when dealing with an invalid
> (unpopulated) address range. Defer the calculations such that they can
> be done with a validated MFN in hands. This also does away with the
> artificial calculations of an address to pass to phys_to_nid().
>
> Note that while the variable is provably written before use, at least
> some compiler versions can't actually verify that. Hence the variable
> also needs to gain a (dead) initializer.

I'm not surprised in the slightest that GCC can't prove that it is
always initialised.  I suspect a lot of humans would struggle too.

> Fixes: e9c72d524fbd ("xen/x86: Use ASSERT instead of VIRTUAL_BUG_ON for phys_to_nid")
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

This does appear to fix things.  (Testing hasn't finished yet, but all
systems have installed, and they didn't get that far previously).

> ---
> RFC: With small enough a NUMA hash shift it would still be possible to
>      hit an SRAT hole, despite mfn_valid() passing. Hence, like was the
>      original plan, it may still be necessary to relax the checking in
>      phys_to_nid() (or its designated replacements). At which point the
>      value of this change here would shrink to merely reducing the
>      chance of unintentionally doing NUMA_NO_NODE allocations.

Why does the NUMA shift matter?  Can't this occur for badly constructed
SRAT tables too?


Nevertheless, this is a clear improvement over what's currently in tree,
so I'm going to commit it to try and unblock OSSTest.  The tree has been
blocked for too long.  Further adjustments can come in due course.

~Andrew

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] x86/mm: avoid phys_to_nid() calls for invalid addresses
  2022-12-16 19:24   ` Andrew Cooper
@ 2022-12-19  7:14     ` Jan Beulich
  0 siblings, 0 replies; 15+ messages in thread
From: Jan Beulich @ 2022-12-19  7:14 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, Wei Liu, Roger Pau Monne, xen-devel

On 16.12.2022 20:24, Andrew Cooper wrote:
> On 13/12/2022 11:36 am, Jan Beulich wrote:
>> RFC: With small enough a NUMA hash shift it would still be possible to
>>      hit an SRAT hole, despite mfn_valid() passing. Hence, like was the
>>      original plan, it may still be necessary to relax the checking in
>>      phys_to_nid() (or its designated replacements). At which point the
>>      value of this change here would shrink to merely reducing the
>>      chance of unintentionally doing NUMA_NO_NODE allocations.
> 
> Why does the NUMA shift matter?  Can't this occur for badly constructed
> SRAT tables too?

Well, the NUMA hash shift is computed from the SRAT table entries, so
often "badly constructed" => "too small shift".

> Nevertheless, this is a clear improvement over what's currently in tree,
> so I'm going to commit it to try and unblock OSSTest.  The tree has been
> blocked for too long.  Further adjustments can come in due course.

Thanks. And I see it has unblocked the tree.

Jan


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2022-12-19  7:14 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-13 11:35 [PATCH 0/2] NUMA: phys_to_nid() related adjustments Jan Beulich
2022-12-13 11:36 ` [PATCH 1/2] x86/mm: avoid phys_to_nid() calls for invalid addresses Jan Beulich
2022-12-14  3:28   ` Wei Chen
2022-12-14  7:44     ` Jan Beulich
2022-12-16 19:24   ` Andrew Cooper
2022-12-19  7:14     ` Jan Beulich
2022-12-13 11:38 ` [PATCH 2/2] NUMA: replace phys_to_nid() Jan Beulich
2022-12-13 12:06   ` Julien Grall
2022-12-13 12:46     ` Jan Beulich
2022-12-13 13:48       ` Julien Grall
2022-12-13 14:08         ` Jan Beulich
2022-12-13 21:33           ` Julien Grall
2022-12-16 11:49   ` Andrew Cooper
2022-12-16 11:59     ` Jan Beulich
2022-12-16 14:27       ` Andrew Cooper

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.