[PATCH v3] Xen: Spread boot time page scrubbing across all available CPUs

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3] Xen: Spread boot time page scrubbing across all available CPUs
@ 2014-04-11 18:08 Konrad Rzeszutek Wilk
  2014-04-11 18:08 ` [PATCH v3] Xen: Spread boot time page scrubbing across all available CPU's Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 13+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-04-11 18:08 UTC (permalink / raw)
  To: dario.faggioli, tim, keir, andrew.cooper3, xen-devel, JBeulich,
	malcolm.crossley

Hey,

Please see v3 of this patchset. It should have all review comments addressed.
I did change the algorithm for the NUMA-node-but-no-CPUs code. Instead of
just using one CPU (the bootstrap) it spreads the work across all of the
CPUs. The first pass - retains the same algorithm.

 docs/misc/xen-command-line.markdown |   10 ++
 xen/common/page_alloc.c             |  177 +++++++++++++++++++++++++++++++----
 
Malcolm Crossley (1):
      Xen: Spread boot time page scrubbing across all available CPU's

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v3] Xen: Spread boot time page scrubbing across all available CPU's
  2014-04-11 18:08 [PATCH v3] Xen: Spread boot time page scrubbing across all available CPUs Konrad Rzeszutek Wilk
@ 2014-04-11 18:08 ` Konrad Rzeszutek Wilk
  2014-04-11 18:38   ` Andrew Cooper
                     ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-04-11 18:08 UTC (permalink / raw)
  To: dario.faggioli, tim, keir, andrew.cooper3, xen-devel, JBeulich,
	malcolm.crossley

From: Malcolm Crossley <malcolm.crossley@citrix.com>

The page scrubbing is done in 128MB chunks in lockstep across all the CPU's.
This allows for the boot CPU to hold the heap_lock whilst each chunk is being
scrubbed and then release the heap_lock when all CPU's are finished scrubing
their individual chunk. This allows for the heap_lock to not be held
continously and for pending softirqs are to be serviced periodically across
all CPU's.

The page scrub memory chunks are allocated to the CPU's in a NUMA aware
fashion to reduce socket interconnect overhead and improve performance.
Specifically in the first phase we scrub at the same time on all the
NUMA nodes that have CPUs - we also weed out the SMT threads so that
we only use cores (that gives a 50% boost). The second phase is to use
all of the CPUs for the NUMA nodes that have no CPUs.

This patch reduces the boot page scrub time on a 128GB 64 core AMD Opteron
6386 machine from 49 seconds to 3 seconds.
On a IvyBridge-EX 8 socket box with 1.5TB it cuts it down from 15 minutes
to 117 seconds.

v2
 - Reduced default chunk size to 128MB
 - Added code to scrub NUMA nodes with no active CPU linked to them
 - Be robust to boot CPU not being linked to a NUMA node

v3:
 - Don't use SMT threads
 - Take care of remainder if the number of CPUs (or memory) is odd
 - Restructure the worker thread
 - s/u64/unsigned long/

Signed-off-by: Malcolm Crossley <malcolm.crossley@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 docs/misc/xen-command-line.markdown |   10 ++
 xen/common/page_alloc.c             |  177 +++++++++++++++++++++++++++++++----
 2 files changed, 167 insertions(+), 20 deletions(-)

diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
index 87de2dc..a7da227 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -197,6 +197,16 @@ Scrub free RAM during boot.  This is a safety feature to prevent
 accidentally leaking sensitive VM data into other VMs if Xen crashes
 and reboots.
 
+### bootscrub_chunk_
+> `= <size>`
+
+> Default: `128MiB`
+
+Maximum RAM block size chunks to be scrubbed whilst holding the page heap lock
+and not running softirqs. Reduce this if softirqs are not being run frequently
+enough. Setting this to a high value may cause cause boot failure, particularly
+if the NMI watchdog is also enabled.
+
 ### cachesize
 > `= <size>`
 
diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index 601319c..3ad6e1d 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -65,6 +65,12 @@ static bool_t opt_bootscrub __initdata = 1;
 boolean_param("bootscrub", opt_bootscrub);
 
 /*
+ * bootscrub_blocksize -> Size (bytes) of mem block to scrub with heaplock held
+ */
+static unsigned int __initdata opt_bootscrub_chunk = 128 * 1024 * 1024;
+size_param("bootscrub_chunk", opt_bootscrub_chunk);
+
+/*
  * Bit width of the DMA heap -- used to override NUMA-node-first.
  * allocation strategy, which can otherwise exhaust low memory.
  */
@@ -90,6 +96,16 @@ static struct bootmem_region {
 } *__initdata bootmem_region_list;
 static unsigned int __initdata nr_bootmem_regions;
 
+struct scrub_region {
+    unsigned long offset;
+    unsigned long start;
+    unsigned long per_cpu_sz;
+    unsigned long rem;
+    cpumask_t cpu;
+};
+static struct scrub_region __initdata region[MAX_NUMNODES];
+static unsigned long __initdata chunk_size;
+
 static void __init boot_bug(int line)
 {
     panic("Boot BUG at %s:%d", __FILE__, line);
@@ -1256,45 +1272,166 @@ void __init end_boot_allocator(void)
     printk("\n");
 }
 
+void __init smp_scrub_heap_pages(void *data)
+{
+    unsigned long mfn, start, end;
+    struct page_info *pg;
+    struct scrub_region *r;
+    unsigned int temp_cpu, node, cpu_idx = 0;
+    unsigned int cpu = smp_processor_id();
+
+    if ( data )
+        r = data;
+    else {
+        node = cpu_to_node(cpu);
+        if ( node == NUMA_NO_NODE )
+            return;
+        r = &region[node];
+    }
+    ASSERT(r != NULL);
+
+    /* Determine the current CPU's index into CPU's linked to this node*/
+    for_each_cpu ( temp_cpu, &r->cpu )
+    {
+        if ( cpu == temp_cpu )
+            break;
+        cpu_idx++;
+    }
+
+    /* Calculate the starting mfn for this CPU's memory block */
+    start = r->start + (r->per_cpu_sz * cpu_idx) + r->offset;
+
+    /* Calculate the end mfn into this CPU's memory block for this iteration */
+    if ( r->offset + chunk_size > r->per_cpu_sz ) {
+        end = r->start + (r->per_cpu_sz * cpu_idx) + r->per_cpu_sz;
+        if ( r->rem && ((cpumask_weight(&r->cpu) - 1 == cpu_idx )) )
+            end += r->rem;
+    }
+    else
+        end = start + chunk_size;
+
+    for ( mfn = start; mfn < end; mfn++ )
+    {
+        pg = mfn_to_page(mfn);
+
+        /* Check the mfn is valid and page is free. */
+        if ( !mfn_valid(mfn) || !page_state_is(pg, free) )
+            continue;
+
+        scrub_one_page(pg);
+    }
+    wmb();
+}
+
 /*
- * Scrub all unallocated pages in all heap zones. This function is more
- * convoluted than appears necessary because we do not want to continuously
- * hold the lock while scrubbing very large memory areas.
+ * Scrub all unallocated pages in all heap zones. This function uses all
+ * online cpu's to scrub the memory in parallel.
  */
 void __init scrub_heap_pages(void)
 {
-    unsigned long mfn;
-    struct page_info *pg;
+    cpumask_t node_cpus, temp_cpus, all_worker_cpus = {{ 0 }};
+    unsigned int i, j, cpu, sibling;
+    unsigned long offset, max_per_cpu_sz = 0;
+    unsigned long start, end;
+    unsigned long rem = 0;
 
     if ( !opt_bootscrub )
         return;
 
-    printk("Scrubbing Free RAM: ");
+    /* Scrub block size */
+    chunk_size = opt_bootscrub_chunk >> PAGE_SHIFT;
+    if ( chunk_size == 0 )
+        chunk_size = 1;
 
-    for ( mfn = first_valid_mfn; mfn < max_page; mfn++ )
+    /* Round #0 - figure out amounts and which CPUs to use */
+    for_each_online_node ( i )
     {
+        /* Calculate Node memory start and end address */
+        start = max(node_start_pfn(i), first_valid_mfn);
+        end = min(node_start_pfn(i) + node_spanned_pages(i), max_page);
+        /* CPUs that are online and on this node (if none, that it is OK */
+        cpumask_and(&node_cpus, &node_to_cpumask(i), &cpu_online_map);
+        cpumask_copy(&temp_cpus, &node_cpus);
+        /* Rip out threads. */
+        for_each_cpu ( j, &temp_cpus )
+        {
+            cpu = 0;
+            for_each_cpu(sibling, per_cpu(cpu_sibling_mask, j)) {
+                if (cpu++ == 0) /* Skip 1st CPU - the core */
+                    continue;
+                cpumask_clear_cpu(sibling, &node_cpus);
+            }
+        }
+        cpumask_or(&all_worker_cpus, &all_worker_cpus, &node_cpus);
+        if ( cpumask_empty(&node_cpus) ) { /* No CPUs on this node. */
+            rem = 0;
+            region[i].per_cpu_sz = (end - start);
+        } else {
+            rem = (end - start) % cpumask_weight(&node_cpus);
+            region[i].per_cpu_sz = (end - start) / cpumask_weight(&node_cpus);
+            if ( region[i].per_cpu_sz > max_per_cpu_sz )
+                max_per_cpu_sz = region[i].per_cpu_sz;
+        }
+        region[i].start = start;
+        region[i].rem = rem;
+        cpumask_copy(&region[i].cpu, &node_cpus);
+
+    }
+    cpu = smp_processor_id();
+    /* Round default chunk size down if required */
+    if ( max_per_cpu_sz && chunk_size > max_per_cpu_sz )
+        chunk_size = max_per_cpu_sz;
+
+    printk("Scrubbing Free RAM on %u nodes using %u CPUs: ", num_online_nodes(),
+           cpumask_weight(&all_worker_cpus));
+
+    /* Round: #1 - do NUMA nodes with CPUs */
+    for ( offset = 0; offset < max_per_cpu_sz; offset += chunk_size )
+    {
+        for_each_online_node ( i )
+            region[i].offset = offset;
+
         process_pending_softirqs();
 
-        pg = mfn_to_page(mfn);
+        spin_lock(&heap_lock);
+        on_selected_cpus(&all_worker_cpus, smp_scrub_heap_pages, NULL, 1);
+        spin_unlock(&heap_lock);
 
-        /* Quick lock-free check. */
-        if ( !mfn_valid(mfn) || !page_state_is(pg, free) )
+        printk(".");
+    }
+
+    /* Round #2: NUMA nodes with no CPUs get scrubbed with all CPUs. */
+    for_each_online_node ( i )
+    {
+        node_cpus = node_to_cpumask(i);
+
+        if ( !cpumask_empty(&node_cpus) )
             continue;
 
-        /* Every 100MB, print a progress dot. */
-        if ( (mfn % ((100*1024*1024)/PAGE_SIZE)) == 0 )
-            printk(".");
+        /* We already have the node information from round #0 */
+        end = region[i].start + region[i].per_cpu_sz;
+        rem = region[i].per_cpu_sz % cpumask_weight(&all_worker_cpus);
 
-        spin_lock(&heap_lock);
+        region[i].rem = rem;
+        region[i].per_cpu_sz /= cpumask_weight(&all_worker_cpus);
+        max_per_cpu_sz = region[i].per_cpu_sz;
+        if ( max_per_cpu_sz && chunk_size > max_per_cpu_sz )
+            chunk_size = max_per_cpu_sz;
+        cpumask_copy(&region[i].cpu, &all_worker_cpus);
 
-        /* Re-check page status with lock held. */
-        if ( page_state_is(pg, free) )
-            scrub_one_page(pg);
+        for ( offset = 0; offset < max_per_cpu_sz; offset += chunk_size )
+        {
+            region[i].offset = offset;
 
-        spin_unlock(&heap_lock);
-    }
+            process_pending_softirqs();
+
+            spin_lock(&heap_lock);
+            on_selected_cpus(&all_worker_cpus, smp_scrub_heap_pages, &region[i], 1);
+            spin_unlock(&heap_lock);
 
-    printk("done.\n");
+            printk(".");
+        }
+    }
 
     /* Now that the heap is initialized, run checks and set bounds
      * for the low mem virq algorithm. */
-- 
1.7.7.6

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v3] Xen: Spread boot time page scrubbing across all available CPU's
  2014-04-11 18:08 ` [PATCH v3] Xen: Spread boot time page scrubbing across all available CPU's Konrad Rzeszutek Wilk
@ 2014-04-11 18:38   ` Andrew Cooper
  2014-04-11 19:19   ` Julien Grall
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 13+ messages in thread
From: Andrew Cooper @ 2014-04-11 18:38 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: keir, dario.faggioli, tim, xen-devel, JBeulich, malcolm.crossley

On 11/04/14 19:08, Konrad Rzeszutek Wilk wrote:
> From: Malcolm Crossley <malcolm.crossley@citrix.com>
>
> The page scrubbing is done in 128MB chunks in lockstep across all the CPU's.

No apostrophe in plural CPUs (also further through)

> This allows for the boot CPU to hold the heap_lock whilst each chunk is being
> scrubbed and then release the heap_lock when all CPU's are finished scrubing
> their individual chunk. This allows for the heap_lock to not be held
> continously and for pending softirqs are to be serviced periodically across
> all CPU's.
>
> The page scrub memory chunks are allocated to the CPU's in a NUMA aware
> fashion to reduce socket interconnect overhead and improve performance.
> Specifically in the first phase we scrub at the same time on all the
> NUMA nodes that have CPUs - we also weed out the SMT threads so that
> we only use cores (that gives a 50% boost). The second phase is to use
> all of the CPUs for the NUMA nodes that have no CPUs.
>
> This patch reduces the boot page scrub time on a 128GB 64 core AMD Opteron
> 6386 machine from 49 seconds to 3 seconds.
> On a IvyBridge-EX 8 socket box with 1.5TB it cuts it down from 15 minutes
> to 117 seconds.

The numbers from our 1TB box are also along a similar line, but I don't
have them to hand.

>
> v2
>  - Reduced default chunk size to 128MB
>  - Added code to scrub NUMA nodes with no active CPU linked to them
>  - Be robust to boot CPU not being linked to a NUMA node
>
> v3:
>  - Don't use SMT threads
>  - Take care of remainder if the number of CPUs (or memory) is odd
>  - Restructure the worker thread
>  - s/u64/unsigned long/
>
> Signed-off-by: Malcolm Crossley <malcolm.crossley@citrix.com>
> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> ---
>  docs/misc/xen-command-line.markdown |   10 ++
>  xen/common/page_alloc.c             |  177 +++++++++++++++++++++++++++++++----
>  2 files changed, 167 insertions(+), 20 deletions(-)
>
> diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
> index 87de2dc..a7da227 100644
> --- a/docs/misc/xen-command-line.markdown
> +++ b/docs/misc/xen-command-line.markdown
> @@ -197,6 +197,16 @@ Scrub free RAM during boot.  This is a safety feature to prevent
>  accidentally leaking sensitive VM data into other VMs if Xen crashes
>  and reboots.
>  
> +### bootscrub_chunk_

Erronious trailing underscore, and the first one needs escaping for the
markdown to end up properly formatted.

> +> `= <size>`
> +
> +> Default: `128MiB`

`128MB`

128MiB will be rejected by the size parsing code

> +
> +Maximum RAM block size chunks to be scrubbed whilst holding the page heap lock
> +and not running softirqs. Reduce this if softirqs are not being run frequently
> +enough. Setting this to a high value may cause cause boot failure, particularly
> +if the NMI watchdog is also enabled.
> +
>  ### cachesize
>  > `= <size>`
>  
> diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
> index 601319c..3ad6e1d 100644
> --- a/xen/common/page_alloc.c
> +++ b/xen/common/page_alloc.c
> @@ -65,6 +65,12 @@ static bool_t opt_bootscrub __initdata = 1;
>  boolean_param("bootscrub", opt_bootscrub);
>  
>  /*
> + * bootscrub_blocksize -> Size (bytes) of mem block to scrub with heaplock held

Stale comment?

> + */
> +static unsigned int __initdata opt_bootscrub_chunk = 128 * 1024 * 1024;
> +size_param("bootscrub_chunk", opt_bootscrub_chunk);
> +
> +/*
>   * Bit width of the DMA heap -- used to override NUMA-node-first.
>   * allocation strategy, which can otherwise exhaust low memory.
>   */
> @@ -90,6 +96,16 @@ static struct bootmem_region {
>  } *__initdata bootmem_region_list;
>  static unsigned int __initdata nr_bootmem_regions;
>  
> +struct scrub_region {
> +    unsigned long offset;
> +    unsigned long start;
> +    unsigned long per_cpu_sz;
> +    unsigned long rem;
> +    cpumask_t cpu;

cpus surely?

> +};
> +static struct scrub_region __initdata region[MAX_NUMNODES];
> +static unsigned long __initdata chunk_size;
> +
>  static void __init boot_bug(int line)
>  {
>      panic("Boot BUG at %s:%d", __FILE__, line);
> @@ -1256,45 +1272,166 @@ void __init end_boot_allocator(void)
>      printk("\n");
>  }
>  
> +void __init smp_scrub_heap_pages(void *data)
> +{
> +    unsigned long mfn, start, end;
> +    struct page_info *pg;
> +    struct scrub_region *r;
> +    unsigned int temp_cpu, node, cpu_idx = 0;
> +    unsigned int cpu = smp_processor_id();
> +
> +    if ( data )
> +        r = data;
> +    else {
> +        node = cpu_to_node(cpu);
> +        if ( node == NUMA_NO_NODE )
> +            return;
> +        r = &region[node];
> +    }
> +    ASSERT(r != NULL);

Under what conditions would NULL be passed?  Can't the caller do
something more sane?

> +
> +    /* Determine the current CPU's index into CPU's linked to this node*/

Space at the end of the comment

> +    for_each_cpu ( temp_cpu, &r->cpu )
> +    {
> +        if ( cpu == temp_cpu )
> +            break;
> +        cpu_idx++;
> +    }

Is this really the best way to do this?  perhaps, but it absolutely
should be calculated once on the first chunk scrubbed rather than for
each chunk.  Or better yet, probably calculated by the BSP when it is
splitting up RAM.

> +
> +    /* Calculate the starting mfn for this CPU's memory block */
> +    start = r->start + (r->per_cpu_sz * cpu_idx) + r->offset;
> +
> +    /* Calculate the end mfn into this CPU's memory block for this iteration */
> +    if ( r->offset + chunk_size > r->per_cpu_sz ) {
> +        end = r->start + (r->per_cpu_sz * cpu_idx) + r->per_cpu_sz;
> +        if ( r->rem && ((cpumask_weight(&r->cpu) - 1 == cpu_idx )) )
> +            end += r->rem;
> +    }
> +    else
> +        end = start + chunk_size;
> +
> +    for ( mfn = start; mfn < end; mfn++ )
> +    {
> +        pg = mfn_to_page(mfn);
> +
> +        /* Check the mfn is valid and page is free. */
> +        if ( !mfn_valid(mfn) || !page_state_is(pg, free) )
> +            continue;
> +
> +        scrub_one_page(pg);
> +    }
> +    wmb();

Why this barrier?

> +}
> +
>  /*
> - * Scrub all unallocated pages in all heap zones. This function is more
> - * convoluted than appears necessary because we do not want to continuously
> - * hold the lock while scrubbing very large memory areas.
> + * Scrub all unallocated pages in all heap zones. This function uses all
> + * online cpu's to scrub the memory in parallel.
>   */
>  void __init scrub_heap_pages(void)
>  {
> -    unsigned long mfn;
> -    struct page_info *pg;
> +    cpumask_t node_cpus, temp_cpus, all_worker_cpus = {{ 0 }};
> +    unsigned int i, j, cpu, sibling;
> +    unsigned long offset, max_per_cpu_sz = 0;
> +    unsigned long start, end;
> +    unsigned long rem = 0;
>  
>      if ( !opt_bootscrub )
>          return;
>  
> -    printk("Scrubbing Free RAM: ");
> +    /* Scrub block size */
> +    chunk_size = opt_bootscrub_chunk >> PAGE_SHIFT;
> +    if ( chunk_size == 0 )
> +        chunk_size = 1;
>  
> -    for ( mfn = first_valid_mfn; mfn < max_page; mfn++ )
> +    /* Round #0 - figure out amounts and which CPUs to use */
> +    for_each_online_node ( i )
>      {
> +        /* Calculate Node memory start and end address */
> +        start = max(node_start_pfn(i), first_valid_mfn);
> +        end = min(node_start_pfn(i) + node_spanned_pages(i), max_page);
> +        /* CPUs that are online and on this node (if none, that it is OK */
> +        cpumask_and(&node_cpus, &node_to_cpumask(i), &cpu_online_map);
> +        cpumask_copy(&temp_cpus, &node_cpus);
> +        /* Rip out threads. */
> +        for_each_cpu ( j, &temp_cpus )
> +        {
> +            cpu = 0;
> +            for_each_cpu(sibling, per_cpu(cpu_sibling_mask, j)) {
> +                if (cpu++ == 0) /* Skip 1st CPU - the core */
> +                    continue;
> +                cpumask_clear_cpu(sibling, &node_cpus);
> +            }
> +        }
> +        cpumask_or(&all_worker_cpus, &all_worker_cpus, &node_cpus);
> +        if ( cpumask_empty(&node_cpus) ) { /* No CPUs on this node. */
> +            rem = 0;
> +            region[i].per_cpu_sz = (end - start);
> +        } else {
> +            rem = (end - start) % cpumask_weight(&node_cpus);
> +            region[i].per_cpu_sz = (end - start) / cpumask_weight(&node_cpus);
> +            if ( region[i].per_cpu_sz > max_per_cpu_sz )
> +                max_per_cpu_sz = region[i].per_cpu_sz;
> +        }
> +        region[i].start = start;
> +        region[i].rem = rem;
> +        cpumask_copy(&region[i].cpu, &node_cpus);
> +
> +    }
> +    cpu = smp_processor_id();

This is surely always 0?, and I don't see it being used any further down....

~Andrew

> +    /* Round default chunk size down if required */
> +    if ( max_per_cpu_sz && chunk_size > max_per_cpu_sz )
> +        chunk_size = max_per_cpu_sz;
> +
> +    printk("Scrubbing Free RAM on %u nodes using %u CPUs: ", num_online_nodes(),
> +           cpumask_weight(&all_worker_cpus));
> +
> +    /* Round: #1 - do NUMA nodes with CPUs */
> +    for ( offset = 0; offset < max_per_cpu_sz; offset += chunk_size )
> +    {
> +        for_each_online_node ( i )
> +            region[i].offset = offset;
> +
>          process_pending_softirqs();
>  
> -        pg = mfn_to_page(mfn);
> +        spin_lock(&heap_lock);
> +        on_selected_cpus(&all_worker_cpus, smp_scrub_heap_pages, NULL, 1);
> +        spin_unlock(&heap_lock);
>  
> -        /* Quick lock-free check. */
> -        if ( !mfn_valid(mfn) || !page_state_is(pg, free) )
> +        printk(".");
> +    }
> +
> +    /* Round #2: NUMA nodes with no CPUs get scrubbed with all CPUs. */
> +    for_each_online_node ( i )
> +    {
> +        node_cpus = node_to_cpumask(i);
> +
> +        if ( !cpumask_empty(&node_cpus) )
>              continue;
>  
> -        /* Every 100MB, print a progress dot. */
> -        if ( (mfn % ((100*1024*1024)/PAGE_SIZE)) == 0 )
> -            printk(".");
> +        /* We already have the node information from round #0 */
> +        end = region[i].start + region[i].per_cpu_sz;
> +        rem = region[i].per_cpu_sz % cpumask_weight(&all_worker_cpus);
>  
> -        spin_lock(&heap_lock);
> +        region[i].rem = rem;
> +        region[i].per_cpu_sz /= cpumask_weight(&all_worker_cpus);
> +        max_per_cpu_sz = region[i].per_cpu_sz;
> +        if ( max_per_cpu_sz && chunk_size > max_per_cpu_sz )
> +            chunk_size = max_per_cpu_sz;
> +        cpumask_copy(&region[i].cpu, &all_worker_cpus);
>  
> -        /* Re-check page status with lock held. */
> -        if ( page_state_is(pg, free) )
> -            scrub_one_page(pg);
> +        for ( offset = 0; offset < max_per_cpu_sz; offset += chunk_size )
> +        {
> +            region[i].offset = offset;
>  
> -        spin_unlock(&heap_lock);
> -    }
> +            process_pending_softirqs();
> +
> +            spin_lock(&heap_lock);
> +            on_selected_cpus(&all_worker_cpus, smp_scrub_heap_pages, &region[i], 1);
> +            spin_unlock(&heap_lock);
>  
> -    printk("done.\n");
> +            printk(".");
> +        }
> +    }
>  
>      /* Now that the heap is initialized, run checks and set bounds
>       * for the low mem virq algorithm. */

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3] Xen: Spread boot time page scrubbing across all available CPU's
  2014-04-11 18:08 ` [PATCH v3] Xen: Spread boot time page scrubbing across all available CPU's Konrad Rzeszutek Wilk
  2014-04-11 18:38   ` Andrew Cooper
@ 2014-04-11 19:19   ` Julien Grall
  2014-04-12 13:19     ` Konrad Rzeszutek Wilk
  2014-04-13 22:38   ` Tim Deegan
  2014-04-14  9:39   ` Jan Beulich
  3 siblings, 1 reply; 13+ messages in thread
From: Julien Grall @ 2014-04-11 19:19 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: keir, andrew.cooper3, dario.faggioli, tim, xen-devel, JBeulich,
	malcolm.crossley

Hi Konrad,

This patch will break ARM build (see below).

On 04/11/2014 07:08 PM, Konrad Rzeszutek Wilk wrote:
> +void __init smp_scrub_heap_pages(void *data)
> +{
> +    unsigned long mfn, start, end;
> +    struct page_info *pg;
> +    struct scrub_region *r;
> +    unsigned int temp_cpu, node, cpu_idx = 0;
> +    unsigned int cpu = smp_processor_id();
> +
> +    if ( data )
> +        r = data;
> +    else {

else
{

> +        node = cpu_to_node(cpu);
> +        if ( node == NUMA_NO_NODE )
> +            return;
> +        r = &region[node];
> +    }
> +    ASSERT(r != NULL);
> +
> +    /* Determine the current CPU's index into CPU's linked to this node*/
> +    for_each_cpu ( temp_cpu, &r->cpu )
> +    {
> +        if ( cpu == temp_cpu )
> +            break;
> +        cpu_idx++;
> +    }
> +
> +    /* Calculate the starting mfn for this CPU's memory block */
> +    start = r->start + (r->per_cpu_sz * cpu_idx) + r->offset;
> +
> +    /* Calculate the end mfn into this CPU's memory block for this iteration */
> +    if ( r->offset + chunk_size > r->per_cpu_sz ) {

if ( ... )
{

[..]

>  void __init scrub_heap_pages(void)
>  {
> -    unsigned long mfn;
> -    struct page_info *pg;
> +    cpumask_t node_cpus, temp_cpus, all_worker_cpus = {{ 0 }};
> +    unsigned int i, j, cpu, sibling;
> +    unsigned long offset, max_per_cpu_sz = 0;
> +    unsigned long start, end;
> +    unsigned long rem = 0;
>  
>      if ( !opt_bootscrub )
>          return;
>  
> -    printk("Scrubbing Free RAM: ");
> +    /* Scrub block size */
> +    chunk_size = opt_bootscrub_chunk >> PAGE_SHIFT;
> +    if ( chunk_size == 0 )
> +        chunk_size = 1;
>  
> -    for ( mfn = first_valid_mfn; mfn < max_page; mfn++ )
> +    /* Round #0 - figure out amounts and which CPUs to use */
> +    for_each_online_node ( i )
>      {
> +        /* Calculate Node memory start and end address */
> +        start = max(node_start_pfn(i), first_valid_mfn);

node_start_pfn doesn't exists on ARM.
As we don't yet support NUMA, you will have to add a dummy #define in
asm-arm/numa.h which will return the correct value.

> +        end = min(node_start_pfn(i) + node_spanned_pages(i), max_page);
> +        /* CPUs that are online and on this node (if none, that it is OK */
> +        cpumask_and(&node_cpus, &node_to_cpumask(i), &cpu_online_map);
> +        cpumask_copy(&temp_cpus, &node_cpus);
> +        /* Rip out threads. */
> +        for_each_cpu ( j, &temp_cpus )
> +        {
> +            cpu = 0;
> +            for_each_cpu(sibling, per_cpu(cpu_sibling_mask, j)) {

for_each_cpu( ... )
{

> +                if (cpu++ == 0) /* Skip 1st CPU - the core */

if ( ... )

> +                    continue;
> +                cpumask_clear_cpu(sibling, &node_cpus);
> +            }
> +        }
> +        cpumask_or(&all_worker_cpus, &all_worker_cpus, &node_cpus);
> +        if ( cpumask_empty(&node_cpus) ) { /* No CPUs on this node. */

if ( ... )
{

Regards,

-- 
Julien Grall

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3] Xen: Spread boot time page scrubbing across all available CPU's
  2014-04-11 19:19   ` Julien Grall
@ 2014-04-12 13:19     ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 13+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-04-12 13:19 UTC (permalink / raw)
  To: Julien Grall
  Cc: keir, andrew.cooper3, dario.faggioli, tim, xen-devel,
	Konrad Rzeszutek Wilk, JBeulich, malcolm.crossley

On Fri, Apr 11, 2014 at 08:19:27PM +0100, Julien Grall wrote:
> Hi Konrad,
> 
> This patch will break ARM build (see below).
> 
> On 04/11/2014 07:08 PM, Konrad Rzeszutek Wilk wrote:
> > +void __init smp_scrub_heap_pages(void *data)
> > +{
> > +    unsigned long mfn, start, end;
> > +    struct page_info *pg;
> > +    struct scrub_region *r;
> > +    unsigned int temp_cpu, node, cpu_idx = 0;
> > +    unsigned int cpu = smp_processor_id();
> > +
> > +    if ( data )
> > +        r = data;
> > +    else {
> 
> else
> {

Right.
> 
> > +        node = cpu_to_node(cpu);
> > +        if ( node == NUMA_NO_NODE )
> > +            return;
> > +        r = &region[node];
> > +    }
> > +    ASSERT(r != NULL);
> > +
> > +    /* Determine the current CPU's index into CPU's linked to this node*/
> > +    for_each_cpu ( temp_cpu, &r->cpu )
> > +    {
> > +        if ( cpu == temp_cpu )
> > +            break;
> > +        cpu_idx++;
> > +    }
> > +
> > +    /* Calculate the starting mfn for this CPU's memory block */
> > +    start = r->start + (r->per_cpu_sz * cpu_idx) + r->offset;
> > +
> > +    /* Calculate the end mfn into this CPU's memory block for this iteration */
> > +    if ( r->offset + chunk_size > r->per_cpu_sz ) {
> 
> if ( ... )
> {
> 
> [..]
<sigh>
> 
> >  void __init scrub_heap_pages(void)
> >  {
> > -    unsigned long mfn;
> > -    struct page_info *pg;
> > +    cpumask_t node_cpus, temp_cpus, all_worker_cpus = {{ 0 }};
> > +    unsigned int i, j, cpu, sibling;
> > +    unsigned long offset, max_per_cpu_sz = 0;
> > +    unsigned long start, end;
> > +    unsigned long rem = 0;
> >  
> >      if ( !opt_bootscrub )
> >          return;
> >  
> > -    printk("Scrubbing Free RAM: ");
> > +    /* Scrub block size */
> > +    chunk_size = opt_bootscrub_chunk >> PAGE_SHIFT;
> > +    if ( chunk_size == 0 )
> > +        chunk_size = 1;
> >  
> > -    for ( mfn = first_valid_mfn; mfn < max_page; mfn++ )
> > +    /* Round #0 - figure out amounts and which CPUs to use */
> > +    for_each_online_node ( i )
> >      {
> > +        /* Calculate Node memory start and end address */
> > +        start = max(node_start_pfn(i), first_valid_mfn);
> 
> node_start_pfn doesn't exists on ARM.
> As we don't yet support NUMA, you will have to add a dummy #define in
> asm-arm/numa.h which will return the correct value.

OK. Will add that.
> 
> > +        end = min(node_start_pfn(i) + node_spanned_pages(i), max_page);
> > +        /* CPUs that are online and on this node (if none, that it is OK */
> > +        cpumask_and(&node_cpus, &node_to_cpumask(i), &cpu_online_map);
> > +        cpumask_copy(&temp_cpus, &node_cpus);
> > +        /* Rip out threads. */
> > +        for_each_cpu ( j, &temp_cpus )
> > +        {
> > +            cpu = 0;
> > +            for_each_cpu(sibling, per_cpu(cpu_sibling_mask, j)) {
> 
> for_each_cpu( ... )
> {
> 
> > +                if (cpu++ == 0) /* Skip 1st CPU - the core */
> 
> if ( ... )
> 
> > +                    continue;
> > +                cpumask_clear_cpu(sibling, &node_cpus);
> > +            }
> > +        }
> > +        cpumask_or(&all_worker_cpus, &all_worker_cpus, &node_cpus);
> > +        if ( cpumask_empty(&node_cpus) ) { /* No CPUs on this node. */
> 
> if ( ... )
> {
> 
> Regards,

Thanks for your review!
> 
> -- 
> Julien Grall
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3] Xen: Spread boot time page scrubbing across all available CPU's
  2014-04-11 18:08 ` [PATCH v3] Xen: Spread boot time page scrubbing across all available CPU's Konrad Rzeszutek Wilk
  2014-04-11 18:38   ` Andrew Cooper
  2014-04-11 19:19   ` Julien Grall
@ 2014-04-13 22:38   ` Tim Deegan
  2014-04-14  9:39   ` Jan Beulich
  3 siblings, 0 replies; 13+ messages in thread
From: Tim Deegan @ 2014-04-13 22:38 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: keir, andrew.cooper3, dario.faggioli, xen-devel, JBeulich,
	malcolm.crossley

Hi,

Thanks for picking this up!

One or two comments:

At 14:08 -0400 on 11 Apr (1397221681), Konrad Rzeszutek Wilk wrote:
> +    /* Calculate the end mfn into this CPU's memory block for this iteration */
> +    if ( r->offset + chunk_size > r->per_cpu_sz ) {
> +        end = r->start + (r->per_cpu_sz * cpu_idx) + r->per_cpu_sz;
> +        if ( r->rem && ((cpumask_weight(&r->cpu) - 1 == cpu_idx )) )
> +            end += r->rem;

I think the outer gate here needs to be a '>=' to handle the case where 
per_cpu_sz % chunk_size == 0 but node_spaned_pages() % per_cpu_sz != 0.

> +    }
> +    else
> +        end = start + chunk_size;
> +

[...]

> +    /* Round default chunk size down if required */
> +    if ( max_per_cpu_sz && chunk_size > max_per_cpu_sz )
> +        chunk_size = max_per_cpu_sz;

Is this necessary?  The worker function should DTRT with a too-large
chunk size.  (Likewise in round #2)

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3] Xen: Spread boot time page scrubbing across all available CPU's
  2014-04-11 18:08 ` [PATCH v3] Xen: Spread boot time page scrubbing across all available CPU's Konrad Rzeszutek Wilk
                     ` (2 preceding siblings ...)
  2014-04-13 22:38   ` Tim Deegan
@ 2014-04-14  9:39   ` Jan Beulich
  2014-04-14 15:04     ` Konrad Rzeszutek Wilk
  3 siblings, 1 reply; 13+ messages in thread
From: Jan Beulich @ 2014-04-14  9:39 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: keir, andrew.cooper3, dario.faggioli, tim, xen-devel, malcolm.crossley

>>> On 11.04.14 at 20:08, <konrad@kernel.org> wrote:

(I'll try to not duplicate comments others already made.)

> --- a/xen/common/page_alloc.c
> +++ b/xen/common/page_alloc.c
> @@ -65,6 +65,12 @@ static bool_t opt_bootscrub __initdata = 1;
>  boolean_param("bootscrub", opt_bootscrub);
>  
>  /*
> + * bootscrub_blocksize -> Size (bytes) of mem block to scrub with heaplock held
> + */
> +static unsigned int __initdata opt_bootscrub_chunk = 128 * 1024 * 1024;

MB(128). Also perhaps better "unsigned long" to allow the value to
be 4Gb or above? Or alternatively make this a page count instead
of a byte one.

> @@ -1256,45 +1272,166 @@ void __init end_boot_allocator(void)
>      printk("\n");
>  }
>  
> +void __init smp_scrub_heap_pages(void *data)

static

> +{
> +    unsigned long mfn, start, end;
> +    struct page_info *pg;
> +    struct scrub_region *r;
> +    unsigned int temp_cpu, node, cpu_idx = 0;
> +    unsigned int cpu = smp_processor_id();
> +
> +    if ( data )
> +        r = data;
> +    else {
> +        node = cpu_to_node(cpu);
> +        if ( node == NUMA_NO_NODE )
> +            return;
> +        r = &region[node];
> +    }
> +    ASSERT(r != NULL);
> +
> +    /* Determine the current CPU's index into CPU's linked to this node*/

There's not just a blank missing here, but also a stop (and similarly on
various other comments).

>  void __init scrub_heap_pages(void)
>  {
> -    unsigned long mfn;
> -    struct page_info *pg;
> +    cpumask_t node_cpus, temp_cpus, all_worker_cpus = {{ 0 }};

No open coding like this please. Either (preferred) use cpumask_clear()
or limit the initializer to {} (thus not depending on other than the fact
that this is a compound type).

> +    unsigned int i, j, cpu, sibling;
> +    unsigned long offset, max_per_cpu_sz = 0;
> +    unsigned long start, end;
> +    unsigned long rem = 0;
>  
>      if ( !opt_bootscrub )
>          return;
>  
> -    printk("Scrubbing Free RAM: ");
> +    /* Scrub block size */
> +    chunk_size = opt_bootscrub_chunk >> PAGE_SHIFT;
> +    if ( chunk_size == 0 )
> +        chunk_size = 1;

This seems awfully low - I'd suggest reverting to the default when the
passed value is zero.

> -    for ( mfn = first_valid_mfn; mfn < max_page; mfn++ )
> +    /* Round #0 - figure out amounts and which CPUs to use */
> +    for_each_online_node ( i )
>      {
> +        /* Calculate Node memory start and end address */
> +        start = max(node_start_pfn(i), first_valid_mfn);
> +        end = min(node_start_pfn(i) + node_spanned_pages(i), max_page);
> +        /* CPUs that are online and on this node (if none, that it is OK */

There's a closing parenthesis missing here.

> +        cpumask_and(&node_cpus, &node_to_cpumask(i), &cpu_online_map);
> +        cpumask_copy(&temp_cpus, &node_cpus);
> +        /* Rip out threads. */
> +        for_each_cpu ( j, &temp_cpus )
> +        {
> +            cpu = 0;
> +            for_each_cpu(sibling, per_cpu(cpu_sibling_mask, j)) {

We commonly treat these for_each_ constructs as extended
keywords, i.e. requiring blanks on both sides of the parentheses. I'd
be fine with either style, as long as you us it consistently. But across
the last dozen of lines you have three different variants.

> +                if (cpu++ == 0) /* Skip 1st CPU - the core */

Coding style.

> +                    continue;
> +                cpumask_clear_cpu(sibling, &node_cpus);

I think this can be done without the inner loop:

cpumask_clear(temp_cpus);
for_each_cpu(j, node_cpus) {
    if(cpumask_intersects(temp_cpus, per_cpu(cpu_sibling_mask, j)))
        continue;
    cpu = cpumask_any(per_cpu(cpu_sibling_mask, j));
    cpumask_set_cpu(cpu, temp_cpus);
}

(Of course you may prefer cpumask_first() over cpumask_any().)

> +            }
> +        }
> +        cpumask_or(&all_worker_cpus, &all_worker_cpus, &node_cpus);
> +        if ( cpumask_empty(&node_cpus) ) { /* No CPUs on this node. */
> +            rem = 0;
> +            region[i].per_cpu_sz = (end - start);
> +        } else {
> +            rem = (end - start) % cpumask_weight(&node_cpus);
> +            region[i].per_cpu_sz = (end - start) / cpumask_weight(&node_cpus);
> +            if ( region[i].per_cpu_sz > max_per_cpu_sz )
> +                max_per_cpu_sz = region[i].per_cpu_sz;
> +        }
> +        region[i].start = start;
> +        region[i].rem = rem;
> +        cpumask_copy(&region[i].cpu, &node_cpus);
> +
> +    }
> +    cpu = smp_processor_id();
> +    /* Round default chunk size down if required */
> +    if ( max_per_cpu_sz && chunk_size > max_per_cpu_sz )
> +        chunk_size = max_per_cpu_sz;
> +
> +    printk("Scrubbing Free RAM on %u nodes using %u CPUs: ", num_online_nodes(),
> +           cpumask_weight(&all_worker_cpus));
> +
> +    /* Round: #1 - do NUMA nodes with CPUs */
> +    for ( offset = 0; offset < max_per_cpu_sz; offset += chunk_size )
> +    {
> +        for_each_online_node ( i )
> +            region[i].offset = offset;
> +
>          process_pending_softirqs();
>  
> -        pg = mfn_to_page(mfn);
> +        spin_lock(&heap_lock);
> +        on_selected_cpus(&all_worker_cpus, smp_scrub_heap_pages, NULL, 1);
> +        spin_unlock(&heap_lock);
>  
> -        /* Quick lock-free check. */
> -        if ( !mfn_valid(mfn) || !page_state_is(pg, free) )
> +        printk(".");
> +    }
> +
> +    /* Round #2: NUMA nodes with no CPUs get scrubbed with all CPUs. */
> +    for_each_online_node ( i )
> +    {
> +        node_cpus = node_to_cpumask(i);
> +
> +        if ( !cpumask_empty(&node_cpus) )
>              continue;
>  
> -        /* Every 100MB, print a progress dot. */
> -        if ( (mfn % ((100*1024*1024)/PAGE_SIZE)) == 0 )
> -            printk(".");
> +        /* We already have the node information from round #0 */
> +        end = region[i].start + region[i].per_cpu_sz;

The value calculated here doesn't seem to be used anywhere.

> +        rem = region[i].per_cpu_sz % cpumask_weight(&all_worker_cpus);
>  
> -        spin_lock(&heap_lock);
> +        region[i].rem = rem;
> +        region[i].per_cpu_sz /= cpumask_weight(&all_worker_cpus);
> +        max_per_cpu_sz = region[i].per_cpu_sz;
> +        if ( max_per_cpu_sz && chunk_size > max_per_cpu_sz )
> +            chunk_size = max_per_cpu_sz;

So you may end up limiting chunk size further on each iteration.
That's surely not very efficient.

> +        cpumask_copy(&region[i].cpu, &all_worker_cpus);
>  
> -        /* Re-check page status with lock held. */
> -        if ( page_state_is(pg, free) )
> -            scrub_one_page(pg);
> +        for ( offset = 0; offset < max_per_cpu_sz; offset += chunk_size )
> +        {
> +            region[i].offset = offset;
>  
> -        spin_unlock(&heap_lock);
> -    }
> +            process_pending_softirqs();
> +
> +            spin_lock(&heap_lock);
> +            on_selected_cpus(&all_worker_cpus, smp_scrub_heap_pages, &region[i], 1);

So in round 2 you're telling all CPUs to scrub one node's memory. That
ought to be particularly inefficient when you have relatively many
CPUs and relatively little memory on the node. In the hope that nodes
without CPUs aren't that common (or are relatively symmetric in terms
of their count relative to the count of nodes with CPUs), wouldn't it be
better (and yielding simpler code) to have each CPU scrub one such
node in its entirety (ultimately, if this turns out to be a more common
case, node distance could even be taken into consideration when
picking which CPU does what).

Jan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3] Xen: Spread boot time page scrubbing across all available CPU's
  2014-04-14  9:39   ` Jan Beulich
@ 2014-04-14 15:04     ` Konrad Rzeszutek Wilk
  2014-04-14 15:18       ` NUMA nodes with no cpus Andrew Cooper
  2014-04-14 15:38       ` [PATCH v3] Xen: Spread boot time page scrubbing across all available CPU's Jan Beulich
  0 siblings, 2 replies; 13+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-04-14 15:04 UTC (permalink / raw)
  To: Jan Beulich
  Cc: keir, andrew.cooper3, dario.faggioli, tim, xen-devel,
	Konrad Rzeszutek Wilk, malcolm.crossley

. snip.. [agreed to the comments]
> > -        spin_lock(&heap_lock);
> > +        region[i].rem = rem;
> > +        region[i].per_cpu_sz /= cpumask_weight(&all_worker_cpus);
> > +        max_per_cpu_sz = region[i].per_cpu_sz;
> > +        if ( max_per_cpu_sz && chunk_size > max_per_cpu_sz )
> > +            chunk_size = max_per_cpu_sz;
> 
> So you may end up limiting chunk size further on each iteration.
> That's surely not very efficient.

Tim also raised this. I think a better option would be to
stash the 'chunk_size' then in the 'region' and have each
NODE worker select the most optimal value.

That will require making the loop that kicks off all the CPUs to not
increment 'chunk' based on the 'chunk size'

 chunk = 0; chunk < max_per_cpu_sz; chunk += chunk_size
                                             ^^^^^^^^^^

but on a different value. Perhaps just make it a max iteration value:

 j = 0; j < max_per_cpu_size / opt_bootscrub_chunk; j++

and each NOED worker will be given as 'offset' the 'j' value and can
figure out whether:

   r->start + r->offset * r->per_cpu_sz >= r->end;
		return
   else
                do the work

(and we would add 'end' in the new region).
> 
> > +        cpumask_copy(&region[i].cpu, &all_worker_cpus);
> >  
> > -        /* Re-check page status with lock held. */
> > -        if ( page_state_is(pg, free) )
> > -            scrub_one_page(pg);
> > +        for ( offset = 0; offset < max_per_cpu_sz; offset += chunk_size )
> > +        {
> > +            region[i].offset = offset;
> >  
> > -        spin_unlock(&heap_lock);
> > -    }
> > +            process_pending_softirqs();
> > +
> > +            spin_lock(&heap_lock);
> > +            on_selected_cpus(&all_worker_cpus, smp_scrub_heap_pages, &region[i], 1);
> 
> So in round 2 you're telling all CPUs to scrub one node's memory. That
> ought to be particularly inefficient when you have relatively many
> CPUs and relatively little memory on the node. In the hope that nodes

Aha! I knew I missed one case.

> without CPUs aren't that common (or are relatively symmetric in terms
> of their count relative to the count of nodes with CPUs), wouldn't it be
> better (and yielding simpler code) to have each CPU scrub one such
> node in its entirety (ultimately, if this turns out to be a more common
> case, node distance could even be taken into consideration when
> picking which CPU does what).

That would work too. The non-CPU-full-NUMA case is an oddity that I have only
reproduced if I use 'numa=fake=8' on a single socket machine. In that case
the 1st node has all of the CPUs and the rest are without. And it made it faster
to just use all of the available CPU cores on the rest of the nodes.

Perhaps a check:

	(end - start)  < opt_bootscrub_chunk * 2

If the amount of memory per node is less than bootscrub chunk size (assume it is in
bytes) times two per then use just one CPU. 

But if it is larger, then split the amount of _all_worker_cpus_ by the
total count of nodes - and have them work in lock-step? Something like this:

 node = 0;
 non-cpu-nodes = 0;
 for_each_online_node(node)
 {
	if ( cpumask_empty(&node_cpus) )
		non-cpu-nodes ++;
 }
 cnt = cpumask_weight(&all_worker_cpus) / non-cpu-nodes;
 for (node = 0, i = 0; i < cpumask_weight(&all_worker_cpus); i++)
 {
	cpumask_clear(&r->cpus[node]);
	for (j = i; j < cnt + i; j++)
		cpumask_set(&r->cpus[node], j);
		/* Take care to deal with SMT threads. */

 }

then we would have just a subset of CPUs hammering on the non-CPU-nodes.


P.S.
I don't know of any real-life scenarios where there are NUMA nodes without
CPUs.

> 
> Jan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: NUMA nodes with no cpus
  2014-04-14 15:04     ` Konrad Rzeszutek Wilk
@ 2014-04-14 15:18       ` Andrew Cooper
  2014-04-14 15:38       ` [PATCH v3] Xen: Spread boot time page scrubbing across all available CPU's Jan Beulich
  1 sibling, 0 replies; 13+ messages in thread
From: Andrew Cooper @ 2014-04-14 15:18 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: keir, dario.faggioli, tim, xen-devel, Konrad Rzeszutek Wilk,
	Jan Beulich, malcolm.crossley

On 14/04/14 16:04, Konrad Rzeszutek Wilk wrote:
>
> P.S.
> I don't know of any real-life scenarios where there are NUMA nodes without
> CPUs.

Very easy to end up with a scenario like this by disabling cpus in the
bios or Xen command line.  I would hope it is never seen in a production
scenario.

~Andrew

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3] Xen: Spread boot time page scrubbing across all available CPU's
  2014-04-14 15:04     ` Konrad Rzeszutek Wilk
  2014-04-14 15:18       ` NUMA nodes with no cpus Andrew Cooper
@ 2014-04-14 15:38       ` Jan Beulich
  2014-04-14 15:54         ` Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 13+ messages in thread
From: Jan Beulich @ 2014-04-14 15:38 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: keir, andrew.cooper3, dario.faggioli, tim, xen-devel,
	Konrad Rzeszutek Wilk, malcolm.crossley

>>> On 14.04.14 at 17:04, <konrad.wilk@oracle.com> wrote:
> I don't know of any real-life scenarios where there are NUMA nodes without
> CPUs.

I've seen mention of such, but without encountering any I think it
makes sense to handle the case with the smallest possible amount of
(and most understandable) code. No tricks - those can be introduced
by whoever first gets annoyed by the process being too slow on such
a system.

Jan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3] Xen: Spread boot time page scrubbing across all available CPU's
  2014-04-14 15:38       ` [PATCH v3] Xen: Spread boot time page scrubbing across all available CPU's Jan Beulich
@ 2014-04-14 15:54         ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 13+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-04-14 15:54 UTC (permalink / raw)
  To: Jan Beulich
  Cc: keir, andrew.cooper3, dario.faggioli, tim, xen-devel,
	Konrad Rzeszutek Wilk, malcolm.crossley

On Mon, Apr 14, 2014 at 04:38:54PM +0100, Jan Beulich wrote:
> >>> On 14.04.14 at 17:04, <konrad.wilk@oracle.com> wrote:
> > I don't know of any real-life scenarios where there are NUMA nodes without
> > CPUs.
> 
> I've seen mention of such, but without encountering any I think it
> makes sense to handle the case with the smallest possible amount of
> (and most understandable) code. No tricks - those can be introduced
> by whoever first gets annoyed by the process being too slow on such
> a system.

Heheh. OK, will follow the design you laid out.

> 
> Jan
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3] Xen: Spread boot time page scrubbing across all available CPU's
  2014-04-12 10:41 Konrad Rzeszutek Wilk
@ 2014-04-13 22:45 ` Tim Deegan
  0 siblings, 0 replies; 13+ messages in thread
From: Tim Deegan @ 2014-04-13 22:45 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: keir, Andrew Cooper, dario.faggioli, xen-devel,
	Konrad Rzeszutek Wilk, JBeulich, malcolm.crossley

At 06:41 -0400 on 12 Apr (1397281284), Konrad Rzeszutek Wilk wrote:
> On Apr 11, 2014 2:38 PM, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> > On 11/04/14 19:08, Konrad Rzeszutek Wilk wrote: 
> > > +    if ( data ) 
> > > +        r = data; 
> > > +    else { 
> > > +        node = cpu_to_node(cpu); 
> > > +        if ( node == NUMA_NO_NODE ) 
> > > +            return; 
> > > +        r = &region[node]; 
> > > +    } 
> > > +    ASSERT(r != NULL); 
> >
> > Under what conditions would NULL be passed?  Can't the caller do 
> > something more sane? 

In round 1, the caller (via on_selected_cpus()) passes NULL to tell
the workers to scrub their local NUMA nodes.  In round 2, it passes
explicit node details to all workers.  Seems OK to me.

> > > +    wmb(); 
> >
> > Why this barrier?
> 
> Tim asked for it in his review - see http://lists.xen.org/archives/html/xen-devel/2013-10/msg00131.html
> " + scrub_one_page(pg); > + } There should be a wmb() here, to make sure the main scrub dispatcher can't exit while the last worker is still issuing writes."
> 
> But with the @wait = 1 this fix is not needed anymore. Will remove it.

Yep, on_selected_cpus will DTRT here.

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3] Xen: Spread boot time page scrubbing across all available CPU's
@ 2014-04-12 10:41 Konrad Rzeszutek Wilk
  2014-04-13 22:45 ` Tim Deegan
  0 siblings, 1 reply; 13+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-04-12 10:41 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: keir, dario.faggioli, tim, xen-devel, Konrad Rzeszutek Wilk,
	JBeulich, malcolm.crossley


On Apr 11, 2014 2:38 PM, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>
> On 11/04/14 19:08, Konrad Rzeszutek Wilk wrote: 
> > From: Malcolm Crossley <malcolm.crossley@citrix.com> 
> > 
> > The page scrubbing is done in 128MB chunks in lockstep across all the CPU's. 
>
> No apostrophe in plural CPUs (also further through) 
>
> > This allows for the boot CPU to hold the heap_lock whilst each chunk is being 
> > scrubbed and then release the heap_lock when all CPU's are finished scrubing 
> > their individual chunk. This allows for the heap_lock to not be held 
> > continously and for pending softirqs are to be serviced periodically across 
> > all CPU's. 
> > 
> > The page scrub memory chunks are allocated to the CPU's in a NUMA aware 
> > fashion to reduce socket interconnect overhead and improve performance. 
> > Specifically in the first phase we scrub at the same time on all the 
> > NUMA nodes that have CPUs - we also weed out the SMT threads so that 
> > we only use cores (that gives a 50% boost). The second phase is to use 
> > all of the CPUs for the NUMA nodes that have no CPUs. 
> > 
> > This patch reduces the boot page scrub time on a 128GB 64 core AMD Opteron 
> > 6386 machine from 49 seconds to 3 seconds. 
> > On a IvyBridge-EX 8 socket box with 1.5TB it cuts it down from 15 minutes 
> > to 117 seconds. 
>
> The numbers from our 1TB box are also along a similar line, but I don't 
> have them to hand. 
>
> > 
> > v2 
> >  - Reduced default chunk size to 128MB 
> >  - Added code to scrub NUMA nodes with no active CPU linked to them 
> >  - Be robust to boot CPU not being linked to a NUMA node 
> > 
> > v3: 
> >  - Don't use SMT threads 
> >  - Take care of remainder if the number of CPUs (or memory) is odd 
> >  - Restructure the worker thread 
> >  - s/u64/unsigned long/ 
> > 
> > Signed-off-by: Malcolm Crossley <malcolm.crossley@citrix.com> 
> > Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> 
> > --- 
> >  docs/misc/xen-command-line.markdown |   10 ++ 
> >  xen/common/page_alloc.c             |  177 +++++++++++++++++++++++++++++++---- 
> >  2 files changed, 167 insertions(+), 20 deletions(-) 
> > 
> > diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown 
> > index 87de2dc..a7da227 100644 
> > --- a/docs/misc/xen-command-line.markdown 
> > +++ b/docs/misc/xen-command-line.markdown 
> > @@ -197,6 +197,16 @@ Scrub free RAM during boot.  This is a safety feature to prevent 
> >  accidentally leaking sensitive VM data into other VMs if Xen crashes 
> >  and reboots. 
> >  
> > +### bootscrub_chunk_ 
>
> Erronious trailing underscore, and the first one needs escaping for the 
> markdown to end up properly formatted. 
>
> > +> `= <size>` 
> > + 
> > +> Default: `128MiB` 
>
> `128MB` 
>
> 128MiB will be rejected by the size parsing code

Yes. Will correct.
>
> > + 
> > +Maximum RAM block size chunks to be scrubbed whilst holding the page heap lock 
> > +and not running softirqs. Reduce this if softirqs are not being run frequently 
> > +enough. Setting this to a high value may cause cause boot failure, particularly 
> > +if the NMI watchdog is also enabled. 
> > + 
> >  ### cachesize 
> >  > `= <size>` 
> >  
> > diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c 
> > index 601319c..3ad6e1d 100644 
> > --- a/xen/common/page_alloc.c 
> > +++ b/xen/common/page_alloc.c 
> > @@ -65,6 +65,12 @@ static bool_t opt_bootscrub __initdata = 1; 
> >  boolean_param("bootscrub", opt_bootscrub); 
> >  
> >  /* 
> > + * bootscrub_blocksize -> Size (bytes) of mem block to scrub with heaplock held 
>
> Stale comment?

Aye
>
> > + */ 
> > +static unsigned int __initdata opt_bootscrub_chunk = 128 * 1024 * 1024; 
> > +size_param("bootscrub_chunk", opt_bootscrub_chunk); 
> > + 
> > +/* 
> >   * Bit width of the DMA heap -- used to override NUMA-node-first. 
> >   * allocation strategy, which can otherwise exhaust low memory. 
> >   */ 
> > @@ -90,6 +96,16 @@ static struct bootmem_region { 
> >  } *__initdata bootmem_region_list; 
> >  static unsigned int __initdata nr_bootmem_regions; 
> >  
> > +struct scrub_region { 
> > +    unsigned long offset; 
> > +    unsigned long start; 
> > +    unsigned long per_cpu_sz; 
> > +    unsigned long rem; 
> > +    cpumask_t cpu; 
>
> cpus surely? 
>
> > +}; 
> > +static struct scrub_region __initdata region[MAX_NUMNODES]; 
> > +static unsigned long __initdata chunk_size; 
> > + 
> >  static void __init boot_bug(int line) 
> >  { 
> >      panic("Boot BUG at %s:%d", __FILE__, line); 
> > @@ -1256,45 +1272,166 @@ void __init end_boot_allocator(void) 
> >      printk("\n"); 
> >  } 
> >  
> > +void __init smp_scrub_heap_pages(void *data) 
> > +{ 
> > +    unsigned long mfn, start, end; 
> > +    struct page_info *pg; 
> > +    struct scrub_region *r; 
> > +    unsigned int temp_cpu, node, cpu_idx = 0; 
> > +    unsigned int cpu = smp_processor_id(); 
> > + 
> > +    if ( data ) 
> > +        r = data; 
> > +    else { 
> > +        node = cpu_to_node(cpu); 
> > +        if ( node == NUMA_NO_NODE ) 
> > +            return; 
> > +        r = &region[node]; 
> > +    } 
> > +    ASSERT(r != NULL); 
>
> Under what conditions would NULL be passed?  Can't the caller do 
> something more sane? 
>

Left over code from v2.
> > + 
> > +    /* Determine the current CPU's index into CPU's linked to this node*/ 
>
> Space at the end of the comment 
>
> > +    for_each_cpu ( temp_cpu, &r->cpu ) 
> > +    { 
> > +        if ( cpu == temp_cpu ) 
> > +            break; 
> > +        cpu_idx++; 
> > +    } 
>
> Is this really the best way to do this?  perhaps, but it absolutely 
> should be calculated once on the first chunk scrubbed rather than for 
> each chunk.  Or better yet, probably calculated by the BSP when it is 
> splitting up RAM.

Where would you store this? The region is per-node. We would need to allocate an CPUs array to store the cpu-idx values. I am not sure if it is worth that considering this code only runs at boot up.
>
> > + 
> > +    /* Calculate the starting mfn for this CPU's memory block */ 
> > +    start = r->start + (r->per_cpu_sz * cpu_idx) + r->offset; 
> > + 
> > +    /* Calculate the end mfn into this CPU's memory block for this iteration */ 
> > +    if ( r->offset + chunk_size > r->per_cpu_sz ) { 
> > +        end = r->start + (r->per_cpu_sz * cpu_idx) + r->per_cpu_sz; 
> > +        if ( r->rem && ((cpumask_weight(&r->cpu) - 1 == cpu_idx )) ) 
> > +            end += r->rem; 
> > +    } 
> > +    else 
> > +        end = start + chunk_size; 
> > + 
> > +    for ( mfn = start; mfn < end; mfn++ ) 
> > +    { 
> > +        pg = mfn_to_page(mfn); 
> > + 
> > +        /* Check the mfn is valid and page is free. */ 
> > +        if ( !mfn_valid(mfn) || !page_state_is(pg, free) ) 
> > +            continue; 
> > + 
> > +        scrub_one_page(pg); 
> > +    } 
> > +    wmb(); 
>
> Why this barrier?

Tim asked for it in his review - see http://lists.xen.org/archives/html/xen-devel/2013-10/msg00131.html
" + scrub_one_page(pg); > + } There should be a wmb() here, to make sure the main scrub dispatcher can't exit while the last worker is still issuing writes."

But with the @wait = 1 this fix is not needed anymore. Will remove it.

>
> > +} 
> > + 
> >  /* 
> > - * Scrub all unallocated pages in all heap zones. This function is more 
> > - * convoluted than appears necessary because we do not want to continuously 
> > - * hold the lock while scrubbing very large memory areas. 
> > + * Scrub all unallocated pages in all heap zones. This function uses all 
> > + * online cpu's to scrub the memory in parallel. 
> >   */ 
> >  void __init scrub_heap_pages(void) 
> >  { 
> > -    unsigned long mfn; 
> > -    struct page_info *pg; 
> > +    cpumask_t node_cpus, temp_cpus, all_worker_cpus = {{ 0 }}; 
> > +    unsigned int i, j, cpu, sibling; 
> > +    unsigned long offset, max_per_cpu_sz = 0; 
> > +    unsigned long start, end; 
> > +    unsigned long rem = 0; 
> >  
> >      if ( !opt_bootscrub ) 
> >          return; 
> >  
> > -    printk("Scrubbing Free RAM: "); 
> > +    /* Scrub block size */ 
> > +    chunk_size = opt_bootscrub_chunk >> PAGE_SHIFT; 
> > +    if ( chunk_size == 0 ) 
> > +        chunk_size = 1; 
> >  
> > -    for ( mfn = first_valid_mfn; mfn < max_page; mfn++ ) 
> > +    /* Round #0 - figure out amounts and which CPUs to use */ 
> > +    for_each_online_node ( i ) 
> >      { 
> > +        /* Calculate Node memory start and end address */ 
> > +        start = max(node_start_pfn(i), first_valid_mfn); 
> > +        end = min(node_start_pfn(i) + node_spanned_pages(i), max_page); 
> > +        /* CPUs that are online and on this node (if none, that it is OK */ 
> > +        cpumask_and(&node_cpus, &node_to_cpumask(i), &cpu_online_map); 
> > +        cpumask_copy(&temp_cpus, &node_cpus); 
> > +        /* Rip out threads. */ 
> > +        for_each_cpu ( j, &temp_cpus ) 
> > +        { 
> > +            cpu = 0; 
> > +            for_each_cpu(sibling, per_cpu(cpu_sibling_mask, j)) { 
> > +                if (cpu++ == 0) /* Skip 1st CPU - the core */ 
> > +                    continue; 
> > +                cpumask_clear_cpu(sibling, &node_cpus); 
> > +            } 
> > +        } 
> > +        cpumask_or(&all_worker_cpus, &all_worker_cpus, &node_cpus); 
> > +        if ( cpumask_empty(&node_cpus) ) { /* No CPUs on this node. */ 
> > +            rem = 0; 
> > +            region[i].per_cpu_sz = (end - start); 
> > +        } else { 
> > +            rem = (end - start) % cpumask_weight(&node_cpus); 
> > +            region[i].per_cpu_sz = (end - start) / cpumask_weight(&node_cpus); 
> > +            if ( region[i].per_cpu_sz > max_per_cpu_sz ) 
> > +                max_per_cpu_sz = region[i].per_cpu_sz; 
> > +        } 
> > +        region[i].start = start; 
> > +        region[i].rem = rem; 
> > +        cpumask_copy(&region[i].cpu, &node_cpus); 
> > + 
> > +    } 
> > +    cpu = smp_processor_id(); 
>
> This is surely always 0?, and I don't see it being used any further down.... 
>

Leftover from v2.

Thanks for looking!
> ~Andrew 
>
> > +    /* Round default chunk size down if required */ 
> > +    if ( max_per_cpu_sz && chunk_size > max_per_cpu_sz ) 
> > +        chunk_size = max_per_cpu_sz; 
> > + 
> > +    printk("Scrubbing Free RAM on %u nodes using %u CPUs: ", num_online_nodes(), 
> > +           cpumask_weight(&all_worker_cpus)); 
> > + 
> > +    /* Round: #1 - do NUMA nodes with CPUs */ 
> > +    for ( offset = 0; offset < max_per_cpu_sz; offset += chunk_size ) 
> > +    { 
> > +        for_each_online_node ( i ) 
> > +            region[i].offset = offset; 
> > + 
> >          process_pending_softirqs(); 
> >  
> > -        pg = mfn_to_page(mfn); 
> > +        spin_lock(&heap_lock); 
> > +        on_selected_cpus(&all_worker_cpus, smp_scrub_heap_pages, NULL, 1); 
> > +        spin_unlock(&heap_lock); 
> >  
> > -        /* Quick lock-free check. */ 
> > -        if ( !mfn_valid(mfn) || !page_state_is(pg, free) ) 
> > +        printk("."); 
> > +    } 
> > + 
> > +    /* Round #2: NUMA nodes with no CPUs get scrubbed with all CPUs. */ 
> > +    for_each_online_node ( i ) 
> > +    { 
> > +        node_cpus = node_to_cpumask(i); 
> > + 
> > +        if ( !cpumask_empty(&node_cpus) ) 
> >              continue; 
> >  
> > -        /* Every 100MB, print a progress dot. */ 
> > -        if ( (mfn % ((100*1024*1024)/PAGE_SIZE)) == 0 ) 
> > -            printk("."); 
> > +        /* We already have the node information from round #0 */ 
> > +        end = region[i].start + region[i].per_cpu_sz; 
> > +        rem = region[i].per_cpu_sz % cpumask_weight(&all_worker_cpus); 
> >  
> > -        spin_lock(&heap_lock); 
> > +        region[i].rem = rem; 
> > +        region[i].per_cpu_sz /= cpumask_weight(&all_worker_cpus); 
> > +        max_per_cpu_sz = region[i].per_cpu_sz; 
> > +        if ( max_per_cpu_sz && chunk_size > max_per_cpu_sz ) 
> > +            chunk_size = max_per_cpu_sz; 
> > +        cpumask_copy(&region[i].cpu, &all_worker_cpus); 
> >  
> > -        /* Re-check page status with lock held. */ 
> > -        if ( page_state_is(pg, free) ) 
> > -            scrub_one_page(pg); 
> > +        for ( offset = 0; offset < max_per_cpu_sz; offset += chunk_size ) 
> > +        { 
> > +            region[i].offset = offset; 
> >  
> > -        spin_unlock(&heap_lock); 
> > -    } 
> > +            process_pending_softirqs(); 
> > + 
> > +            spin_lock(&heap_lock); 
> > +            on_selected_cpus(&all_worker_cpus, smp_scrub_heap_pages, &region[i], 1); 
> > +            spin_unlock(&heap_lock); 
> >  
> > -    printk("done.\n"); 
> > +            printk("."); 
> > +        } 
> > +    } 
> >  
> >      /* Now that the heap is initialized, run checks and set bounds 
> >       * for the low mem virq algorithm. */ 
>
>
> _______________________________________________ 
> Xen-devel mailing list 
> Xen-devel@lists.xen.org 
> http://lists.xen.org/xen-devel 
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2014-04-14 15:54 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-04-11 18:08 [PATCH v3] Xen: Spread boot time page scrubbing across all available CPUs Konrad Rzeszutek Wilk
2014-04-11 18:08 ` [PATCH v3] Xen: Spread boot time page scrubbing across all available CPU's Konrad Rzeszutek Wilk
2014-04-11 18:38   ` Andrew Cooper
2014-04-11 19:19   ` Julien Grall
2014-04-12 13:19     ` Konrad Rzeszutek Wilk
2014-04-13 22:38   ` Tim Deegan
2014-04-14  9:39   ` Jan Beulich
2014-04-14 15:04     ` Konrad Rzeszutek Wilk
2014-04-14 15:18       ` NUMA nodes with no cpus Andrew Cooper
2014-04-14 15:38       ` [PATCH v3] Xen: Spread boot time page scrubbing across all available CPU's Jan Beulich
2014-04-14 15:54         ` Konrad Rzeszutek Wilk
2014-04-12 10:41 Konrad Rzeszutek Wilk
2014-04-13 22:45 ` Tim Deegan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.