[PATCH v5] Spread boot time scrubbing across available CPUs.

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v5] Spread boot time scrubbing across available CPUs.
@ 2014-06-04 13:29 Konrad Rzeszutek Wilk
  2014-06-04 13:29 ` [PATCH] Xen: Spread boot time page scrubbing across all available CPU's (v5) Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 7+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-06-04 13:29 UTC (permalink / raw)
  To: xen-devel, JBeulich, andrew.cooper3, tim, dario.faggioli, julien.grall

Please see v5 of this patchset. It should have all review comments addressed.

I did change the algorithm for the NUMA-node-but-no-CPUs code. It now
picks the closest NUMA node CPUs to do the scrubbing. If that node
does not have any CPUs it will continue on until it finds something - or
it falls back on the first node. And if the first node has no CPUs either
- it will just pick the BSP and call it a day. That hopefully should take
care of it running on broken hardware.

I've also cross compiled it on ARM but hadn't yet run the emulator to 
make sure it works right. I figured I would do that once the x86 folks
are comfortable with the patch.

Thank you everybody for reviewing the patch over and over.

 docs/misc/xen-command-line.markdown |  10 ++
 xen/common/page_alloc.c             | 208 ++++++++++++++++++++++++++++++++----
 xen/include/asm-arm/numa.h          |   1 +
 3 files changed, 201 insertions(+), 18 deletions(-)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH] Xen: Spread boot time page scrubbing across all available CPU's (v5)
  2014-06-04 13:29 [PATCH v5] Spread boot time scrubbing across available CPUs Konrad Rzeszutek Wilk
@ 2014-06-04 13:29 ` Konrad Rzeszutek Wilk
  2014-06-04 13:35   ` Andrew Cooper
                     ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-06-04 13:29 UTC (permalink / raw)
  To: xen-devel, JBeulich, andrew.cooper3, tim, dario.faggioli, julien.grall
  Cc: Malcolm Crossley, Konrad Rzeszutek Wilk

From: Malcolm Crossley <malcolm.crossley@citrix.com>

The page scrubbing is done in 128MB chunks in lockstep across all the
non-SMT CPU's. This allows for the boot CPU to hold the heap_lock whilst each
chunk is being scrubbed and then release the heap_lock when the CPU's are
finished scrubing their individual chunk. This allows for the heap_lock to
not be held continously and for pending softirqs are to be serviced
periodically across the CPU's.

The page scrub memory chunks are allocated to the CPU's in a NUMA aware
fashion to reduce socket interconnect overhead and improve performance.
Specifically in the first phase we scrub at the same time on all the
NUMA nodes that have CPUs - we also weed out the SMT threads so that
we only use cores (that gives a 50% boost). The second phase is for NUMA
nodes that have no CPUs - for that we use the closest NUMA node's CPUs
(non-SMT again) to do the job.

This patch reduces the boot page scrub time on a 128GB 64 core AMD Opteron
6386 machine from 49 seconds to 3 seconds.
On a IvyBridge-EX 8 socket box with 1.5TB it cuts it down from 15 minutes
to 63 seconds.

v2
 - Reduced default chunk size to 128MB
 - Added code to scrub NUMA nodes with no active CPU linked to them
 - Be robust to boot CPU not being linked to a NUMA node

v3:
 - Don't use SMT threads
 - Take care of remainder if the number of CPUs (or memory) is odd
 - Restructure the worker thread
 - s/u64/unsigned long/

v4:
 - Don't use all CPUs on non-CPU NUMA nodes, just closest one
 - Syntax, and docs updates
 - Compile on ARM
 - Fix bug when NUMA node has 0 pages

v5:
 - Properly figure out best NUMA node.
 - Fix comments to be proper style.
Signed-off-by: Malcolm Crossley <malcolm.crossley@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 docs/misc/xen-command-line.markdown |  10 ++
 xen/common/page_alloc.c             | 208 ++++++++++++++++++++++++++++++++----
 xen/include/asm-arm/numa.h          |   1 +
 3 files changed, 201 insertions(+), 18 deletions(-)

diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
index 514c7b2..39a67be 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -198,6 +198,16 @@ Scrub free RAM during boot.  This is a safety feature to prevent
 accidentally leaking sensitive VM data into other VMs if Xen crashes
 and reboots.
 
+### bootscrub_chunk_
+> `= <size>`
+
+> Default: `128MB`
+
+Maximum RAM block size chunks to be scrubbed whilst holding the page heap lock
+and not running softirqs. Reduce this if softirqs are not being run frequently
+enough. Setting this to a high value may cause cause boot failure, particularly
+if the NMI watchdog is also enabled.
+
 ### cachesize
 > `= <size>`
 
diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index 601319c..d7390b6 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -65,6 +65,13 @@ static bool_t opt_bootscrub __initdata = 1;
 boolean_param("bootscrub", opt_bootscrub);
 
 /*
+ * bootscrub_chunk -> Amount of bytes to scrub lockstep on non-SMT CPUs
+ * on all NUMA nodes.
+ */
+static unsigned long __initdata opt_bootscrub_chunk = MB(128);
+size_param("bootscrub_chunk", opt_bootscrub_chunk);
+
+/*
  * Bit width of the DMA heap -- used to override NUMA-node-first.
  * allocation strategy, which can otherwise exhaust low memory.
  */
@@ -90,6 +97,16 @@ static struct bootmem_region {
 } *__initdata bootmem_region_list;
 static unsigned int __initdata nr_bootmem_regions;
 
+struct scrub_region {
+    unsigned long offset;
+    unsigned long start;
+    unsigned long per_cpu_sz;
+    unsigned long rem;
+    cpumask_t cpus;
+};
+static struct scrub_region __initdata region[MAX_NUMNODES];
+static unsigned long __initdata chunk_size;
+
 static void __init boot_bug(int line)
 {
     panic("Boot BUG at %s:%d", __FILE__, line);
@@ -1256,42 +1273,197 @@ void __init end_boot_allocator(void)
     printk("\n");
 }
 
+static void __init smp_scrub_heap_pages(void *data)
+{
+    unsigned long mfn, start, end;
+    struct page_info *pg;
+    struct scrub_region *r;
+    unsigned int temp_cpu, node, cpu_idx = 0;
+    unsigned int cpu = smp_processor_id();
+
+    if ( data )
+        r = data;
+    else
+    {
+        node = cpu_to_node(cpu);
+        if ( node == NUMA_NO_NODE )
+            return;
+        r = &region[node];
+    }
+
+    /* Determine the current CPU's index into CPU's linked to this node. */
+    for_each_cpu ( temp_cpu, &r->cpus )
+    {
+        if ( cpu == temp_cpu )
+            break;
+        cpu_idx++;
+    }
+
+    /* Calculate the starting mfn for this CPU's memory block. */
+    start = r->start + (r->per_cpu_sz * cpu_idx) + r->offset;
+
+    /* Calculate the end mfn into this CPU's memory block for this iteration. */
+    if ( r->offset + chunk_size >= r->per_cpu_sz )
+    {
+        end = r->start + (r->per_cpu_sz * cpu_idx) + r->per_cpu_sz;
+
+        if ( r->rem && ((cpumask_weight(&r->cpus) - 1 == cpu_idx )) )
+            end += r->rem;
+    }
+    else
+        end = start + chunk_size;
+
+    for ( mfn = start; mfn < end; mfn++ )
+    {
+        pg = mfn_to_page(mfn);
+
+        /* Check the mfn is valid and page is free. */
+        if ( !mfn_valid(mfn) || !page_state_is(pg, free) )
+            continue;
+
+        scrub_one_page(pg);
+    }
+}
+
+static int __init find_non_smt(unsigned int node, cpumask_t *dest)
+{
+    cpumask_t node_cpus;
+    unsigned int i, cpu;
+
+    cpumask_and(&node_cpus, &node_to_cpumask(node), &cpu_online_map);
+    cpumask_clear(dest);
+    for_each_cpu ( i, &node_cpus )
+    {
+        if ( cpumask_intersects(dest, per_cpu(cpu_sibling_mask, i)) )
+            continue;
+        cpu = cpumask_first(per_cpu(cpu_sibling_mask, i));
+        cpumask_set_cpu(cpu, dest);
+    }
+    return cpumask_weight(dest);
+}
+
 /*
- * Scrub all unallocated pages in all heap zones. This function is more
- * convoluted than appears necessary because we do not want to continuously
- * hold the lock while scrubbing very large memory areas.
+ * Scrub all unallocated pages in all heap zones. This function uses all
+ * online cpu's to scrub the memory in parallel.
  */
 void __init scrub_heap_pages(void)
 {
-    unsigned long mfn;
-    struct page_info *pg;
+    cpumask_t node_cpus, all_worker_cpus;
+    unsigned int i, j;
+    unsigned long offset, max_per_cpu_sz = 0;
+    unsigned long start, end;
+    unsigned long rem = 0;
+    int last_distance, best_node;
 
     if ( !opt_bootscrub )
         return;
 
-    printk("Scrubbing Free RAM: ");
+    cpumask_clear(&all_worker_cpus);
+    /* Scrub block size. */
+    chunk_size = opt_bootscrub_chunk >> PAGE_SHIFT;
+    if ( chunk_size == 0 )
+        chunk_size = MB(128);
+
+    /* Round #0 - figure out amounts and which CPUs to use. */
+    for_each_online_node ( i )
+    {
+        if ( !node_spanned_pages(i) )
+            continue;
+        /* Calculate Node memory start and end address. */
+        start = max(node_start_pfn(i), first_valid_mfn);
+        end = min(node_start_pfn(i) + node_spanned_pages(i), max_page);
+        /* CPUs that are online and on this node (if none, that it is OK). */
+        find_non_smt(i, &node_cpus);
+        cpumask_or(&all_worker_cpus, &all_worker_cpus, &node_cpus);
+        if ( cpumask_empty(&node_cpus) )
+        {
+            /* No CPUs on this node. Round #2 will take of it. */
+            rem = 0;
+            region[i].per_cpu_sz = (end - start);
+        } else {
+            rem = (end - start) % cpumask_weight(&node_cpus);
+            region[i].per_cpu_sz = (end - start) / cpumask_weight(&node_cpus);
+            if ( region[i].per_cpu_sz > max_per_cpu_sz )
+                max_per_cpu_sz = region[i].per_cpu_sz;
+        }
+        region[i].start = start;
+        region[i].rem = rem;
+        cpumask_copy(&region[i].cpus, &node_cpus);
+
+    }
+
+    printk("Scrubbing Free RAM on %u nodes using %u CPUs\n", num_online_nodes(),
+           cpumask_weight(&all_worker_cpus));
 
-    for ( mfn = first_valid_mfn; mfn < max_page; mfn++ )
+    /* Round: #1 - do NUMA nodes with CPUs. */
+    for ( offset = 0; offset < max_per_cpu_sz; offset += chunk_size )
     {
+        for_each_online_node ( i )
+            region[i].offset = offset;
+
         process_pending_softirqs();
 
-        pg = mfn_to_page(mfn);
+        spin_lock(&heap_lock);
+        on_selected_cpus(&all_worker_cpus, smp_scrub_heap_pages, NULL, 1);
+        spin_unlock(&heap_lock);
 
-        /* Quick lock-free check. */
-        if ( !mfn_valid(mfn) || !page_state_is(pg, free) )
+        printk(".");
+    }
+
+    /*
+     * Round #2: NUMA nodes with no CPUs get scrubbed with CPUs on the node
+     * closest to us and with CPUs.
+     */
+    for_each_online_node ( i )
+    {
+        node_cpus = node_to_cpumask(i);
+
+        if ( !cpumask_empty(&node_cpus) )
             continue;
 
-        /* Every 100MB, print a progress dot. */
-        if ( (mfn % ((100*1024*1024)/PAGE_SIZE)) == 0 )
-            printk(".");
+        last_distance = INT_MAX;
+        best_node = first_node(node_online_map);
+        /* Figure out which NODE CPUs are close. */
+        for_each_online_node ( j )
+        {
+            int distance;
 
-        spin_lock(&heap_lock);
+            if ( i == j )
+                continue;
+            distance = __node_distance(i, j);
+            if ( distance < last_distance )
+            {
+                if ( cpumask_empty(&node_to_cpumask(j)) )
+                    continue;
+                last_distance = distance;
+                best_node = j;
+            }
+        }
+        /*
+         * Use CPUs from best node, and if there are no CPUs on the
+         * first node (the default) use the BSP.
+         */
+        if ( find_non_smt(best_node, &node_cpus) == 0 )
+            cpumask_set_cpu(smp_processor_id(), &node_cpus);
 
-        /* Re-check page status with lock held. */
-        if ( page_state_is(pg, free) )
-            scrub_one_page(pg);
+        /* We already have the node information from round #0. */
+        region[i].rem = region[i].per_cpu_sz % cpumask_weight(&node_cpus);
+        region[i].per_cpu_sz /= cpumask_weight(&node_cpus);
+        max_per_cpu_sz = region[i].per_cpu_sz;
+        cpumask_copy(&region[i].cpus, &node_cpus);
 
-        spin_unlock(&heap_lock);
+        for ( offset = 0; offset < max_per_cpu_sz; offset += chunk_size )
+        {
+            region[i].offset = offset;
+
+            process_pending_softirqs();
+
+            spin_lock(&heap_lock);
+            on_selected_cpus(&node_cpus, smp_scrub_heap_pages, &region[i], 1);
+            spin_unlock(&heap_lock);
+
+            printk(".");
+        }
     }
 
     printk("done.\n");
diff --git a/xen/include/asm-arm/numa.h b/xen/include/asm-arm/numa.h
index cb8f2ba..2c019d7 100644
--- a/xen/include/asm-arm/numa.h
+++ b/xen/include/asm-arm/numa.h
@@ -12,6 +12,7 @@ static inline __attribute__((pure)) int phys_to_nid(paddr_t addr)
 
 /* XXX: implement NUMA support */
 #define node_spanned_pages(nid) (total_pages)
+#define node_start_pfn(nid) (frametable_base_mfn)
 #define __node_distance(a, b) (20)
 
 #endif /* __ARCH_ARM_NUMA_H */
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] Xen: Spread boot time page scrubbing across all available CPU's (v5)
  2014-06-04 13:29 ` [PATCH] Xen: Spread boot time page scrubbing across all available CPU's (v5) Konrad Rzeszutek Wilk
@ 2014-06-04 13:35   ` Andrew Cooper
  2014-06-04 13:52     ` Konrad Rzeszutek Wilk
  2014-06-05 10:09   ` Tim Deegan
  2014-06-05 11:22   ` Jan Beulich
  2 siblings, 1 reply; 7+ messages in thread
From: Andrew Cooper @ 2014-06-04 13:35 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: dario.faggioli, julien.grall, tim, JBeulich, xen-devel, Malcolm Crossley

On 04/06/14 14:29, Konrad Rzeszutek Wilk wrote:
> diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
> index 514c7b2..39a67be 100644
> --- a/docs/misc/xen-command-line.markdown
> +++ b/docs/misc/xen-command-line.markdown
> @@ -198,6 +198,16 @@ Scrub free RAM during boot.  This is a safety feature to prevent
>  accidentally leaking sensitive VM data into other VMs if Xen crashes
>  and reboots.
>  
> +### bootscrub_chunk_
> +> `= <size>`
> +
> +> Default: `128MB`

The 'B' here is still erroneous.

>      printk("done.\n");
> diff --git a/xen/include/asm-arm/numa.h b/xen/include/asm-arm/numa.h
> index cb8f2ba..2c019d7 100644
> --- a/xen/include/asm-arm/numa.h
> +++ b/xen/include/asm-arm/numa.h
> @@ -12,6 +12,7 @@ static inline __attribute__((pure)) int phys_to_nid(paddr_t addr)
>  
>  /* XXX: implement NUMA support */
>  #define node_spanned_pages(nid) (total_pages)
> +#define node_start_pfn(nid) (frametable_base_mfn)

What is this supposed to achieve?

~Andrew

>  #define __node_distance(a, b) (20)
>  
>  #endif /* __ARCH_ARM_NUMA_H */

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] Xen: Spread boot time page scrubbing across all available CPU's (v5)
  2014-06-04 13:35   ` Andrew Cooper
@ 2014-06-04 13:52     ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 7+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-06-04 13:52 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: dario.faggioli, julien.grall, tim, JBeulich, xen-devel, Malcolm Crossley

On Wed, Jun 04, 2014 at 02:35:49PM +0100, Andrew Cooper wrote:
> On 04/06/14 14:29, Konrad Rzeszutek Wilk wrote:
> > diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
> > index 514c7b2..39a67be 100644
> > --- a/docs/misc/xen-command-line.markdown
> > +++ b/docs/misc/xen-command-line.markdown
> > @@ -198,6 +198,16 @@ Scrub free RAM during boot.  This is a safety feature to prevent
> >  accidentally leaking sensitive VM data into other VMs if Xen crashes
> >  and reboots.
> >  
> > +### bootscrub_chunk_
> > +> `= <size>`
> > +
> > +> Default: `128MB`
> 
> The 'B' here is still erroneous.
> 
> >      printk("done.\n");
> > diff --git a/xen/include/asm-arm/numa.h b/xen/include/asm-arm/numa.h
> > index cb8f2ba..2c019d7 100644
> > --- a/xen/include/asm-arm/numa.h
> > +++ b/xen/include/asm-arm/numa.h
> > @@ -12,6 +12,7 @@ static inline __attribute__((pure)) int phys_to_nid(paddr_t addr)
> >  
> >  /* XXX: implement NUMA support */
> >  #define node_spanned_pages(nid) (total_pages)
> > +#define node_start_pfn(nid) (frametable_base_mfn)
> 
> What is this supposed to achieve?

Make it scrub memory on ARM (which define a flat 1 NODE NUMA topology).
Perhaps I am not understanding your question?

> 
> ~Andrew
> 
> >  #define __node_distance(a, b) (20)
> >  
> >  #endif /* __ARCH_ARM_NUMA_H */
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] Xen: Spread boot time page scrubbing across all available CPU's (v5)
  2014-06-04 13:29 ` [PATCH] Xen: Spread boot time page scrubbing across all available CPU's (v5) Konrad Rzeszutek Wilk
  2014-06-04 13:35   ` Andrew Cooper
@ 2014-06-05 10:09   ` Tim Deegan
  2014-06-05 11:22   ` Jan Beulich
  2 siblings, 0 replies; 7+ messages in thread
From: Tim Deegan @ 2014-06-05 10:09 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: andrew.cooper3, dario.faggioli, julien.grall, JBeulich,
	xen-devel, Malcolm Crossley

Hi,

At 09:29 -0400 on 04 Jun (1401870576), Konrad Rzeszutek Wilk wrote:
> +    /* Scrub block size. */
> +    chunk_size = opt_bootscrub_chunk >> PAGE_SHIFT;
> +    if ( chunk_size == 0 )
> +        chunk_size = MB(128);

ITYM (MB(128) >> PAGE_SHIFT) here!

Otherwise, Reviewed-by: Tim Deegan <tim@xen.org>

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] Xen: Spread boot time page scrubbing across all available CPU's (v5)
  2014-06-04 13:29 ` [PATCH] Xen: Spread boot time page scrubbing across all available CPU's (v5) Konrad Rzeszutek Wilk
  2014-06-04 13:35   ` Andrew Cooper
  2014-06-05 10:09   ` Tim Deegan
@ 2014-06-05 11:22   ` Jan Beulich
  2014-06-05 17:49     ` Konrad Rzeszutek Wilk
  2 siblings, 1 reply; 7+ messages in thread
From: Jan Beulich @ 2014-06-05 11:22 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: andrew.cooper3, dario.faggioli, julien.grall, tim, xen-devel,
	Malcolm Crossley

>>> On 04.06.14 at 15:29, <konrad.wilk@oracle.com> wrote:
> --- a/docs/misc/xen-command-line.markdown
> +++ b/docs/misc/xen-command-line.markdown
> @@ -198,6 +198,16 @@ Scrub free RAM during boot.  This is a safety feature to prevent
>  accidentally leaking sensitive VM data into other VMs if Xen crashes
>  and reboots.
>  
> +### bootscrub_chunk_

Looking at other examples in that file, underscores appear to need
backslash escaping here. And I don't think the trailing one should
be there.

> +> `= <size>`
> +
> +> Default: `128MB`
> +
> +Maximum RAM block size chunks to be scrubbed whilst holding the page heap lock
> +and not running softirqs. Reduce this if softirqs are not being run frequently
> +enough. Setting this to a high value may cause cause boot failure, particularly

Double "cause".

> +static void __init smp_scrub_heap_pages(void *data)
> +{
> +    unsigned long mfn, start, end;
> +    struct page_info *pg;
> +    struct scrub_region *r;
> +    unsigned int temp_cpu, node, cpu_idx = 0;
> +    unsigned int cpu = smp_processor_id();
> +
> +    if ( data )
> +        r = data;
> +    else
> +    {
> +        node = cpu_to_node(cpu);
> +        if ( node == NUMA_NO_NODE )
> +            return;
> +        r = &region[node];
> +    }
> +
> +    /* Determine the current CPU's index into CPU's linked to this node. */
> +    for_each_cpu ( temp_cpu, &r->cpus )
> +    {
> +        if ( cpu == temp_cpu )
> +            break;
> +        cpu_idx++;
> +    }
> +
> +    /* Calculate the starting mfn for this CPU's memory block. */
> +    start = r->start + (r->per_cpu_sz * cpu_idx) + r->offset;
> +
> +    /* Calculate the end mfn into this CPU's memory block for this 
> iteration. */
> +    if ( r->offset + chunk_size >= r->per_cpu_sz )
> +    {
> +        end = r->start + (r->per_cpu_sz * cpu_idx) + r->per_cpu_sz;
> +
> +        if ( r->rem && ((cpumask_weight(&r->cpus) - 1 == cpu_idx )) )
> +            end += r->rem;
> +    }
> +    else
> +        end = start + chunk_size;
> +
> +    for ( mfn = start; mfn < end; mfn++ )
> +    {
> +        pg = mfn_to_page(mfn);
> +
> +        /* Check the mfn is valid and page is free. */
> +        if ( !mfn_valid(mfn) || !page_state_is(pg, free) )
> +            continue;
> +
> +        scrub_one_page(pg);
> +    }
> +}
> +
> +static int __init find_non_smt(unsigned int node, cpumask_t *dest)
> +{
> +    cpumask_t node_cpus;
> +    unsigned int i, cpu;
> +
> +    cpumask_and(&node_cpus, &node_to_cpumask(node), &cpu_online_map);
> +    cpumask_clear(dest);
> +    for_each_cpu ( i, &node_cpus )
> +    {
> +        if ( cpumask_intersects(dest, per_cpu(cpu_sibling_mask, i)) )
> +            continue;
> +        cpu = cpumask_first(per_cpu(cpu_sibling_mask, i));
> +        cpumask_set_cpu(cpu, dest);
> +    }
> +    return cpumask_weight(dest);
> +}
> +
>  /*
> - * Scrub all unallocated pages in all heap zones. This function is more
> - * convoluted than appears necessary because we do not want to continuously
> - * hold the lock while scrubbing very large memory areas.
> + * Scrub all unallocated pages in all heap zones. This function uses all
> + * online cpu's to scrub the memory in parallel.
>   */
>  void __init scrub_heap_pages(void)
>  {
> -    unsigned long mfn;
> -    struct page_info *pg;
> +    cpumask_t node_cpus, all_worker_cpus;
> +    unsigned int i, j;
> +    unsigned long offset, max_per_cpu_sz = 0;
> +    unsigned long start, end;
> +    unsigned long rem = 0;
> +    int last_distance, best_node;
>  
>      if ( !opt_bootscrub )
>          return;
>  
> -    printk("Scrubbing Free RAM: ");
> +    cpumask_clear(&all_worker_cpus);
> +    /* Scrub block size. */
> +    chunk_size = opt_bootscrub_chunk >> PAGE_SHIFT;
> +    if ( chunk_size == 0 )
> +        chunk_size = MB(128);
> +
> +    /* Round #0 - figure out amounts and which CPUs to use. */
> +    for_each_online_node ( i )
> +    {
> +        if ( !node_spanned_pages(i) )
> +            continue;
> +        /* Calculate Node memory start and end address. */
> +        start = max(node_start_pfn(i), first_valid_mfn);

This implies you're aware that possibly node_start_pfn(i) <
first_valid_mfn.

> +        end = min(node_start_pfn(i) + node_spanned_pages(i), max_page);

Which in turn means that this may yield end < start. Is all the rest
of the code prepared to deal with this? At least the divide and
modulo operations on the difference further down doesn't look like
so.

> +        /* CPUs that are online and on this node (if none, that it is OK). */
> +        find_non_smt(i, &node_cpus);

Here you could latch the weight, avoiding ... 

> +        cpumask_or(&all_worker_cpus, &all_worker_cpus, &node_cpus);
> +        if ( cpumask_empty(&node_cpus) )

... the extra operation on the mask here and the explicit use of
cpumask_weight() on node_cpus in the else path.

> +        {
> +            /* No CPUs on this node. Round #2 will take of it. */
> +            rem = 0;
> +            region[i].per_cpu_sz = (end - start);
> +        } else {

Coding style - this takes 3 lines.

> +            rem = (end - start) % cpumask_weight(&node_cpus);
> +            region[i].per_cpu_sz = (end - start) / cpumask_weight(&node_cpus);
> +            if ( region[i].per_cpu_sz > max_per_cpu_sz )
> +                max_per_cpu_sz = region[i].per_cpu_sz;
> +        }
> +        region[i].start = start;
> +        region[i].rem = rem;
> +        cpumask_copy(&region[i].cpus, &node_cpus);
> +
> +    }
> +
> +    printk("Scrubbing Free RAM on %u nodes using %u CPUs\n", num_online_nodes(),
> +           cpumask_weight(&all_worker_cpus));
>  
> -    for ( mfn = first_valid_mfn; mfn < max_page; mfn++ )
> +    /* Round: #1 - do NUMA nodes with CPUs. */
> +    for ( offset = 0; offset < max_per_cpu_sz; offset += chunk_size )
>      {
> +        for_each_online_node ( i )
> +            region[i].offset = offset;
> +
>          process_pending_softirqs();
>  
> -        pg = mfn_to_page(mfn);
> +        spin_lock(&heap_lock);
> +        on_selected_cpus(&all_worker_cpus, smp_scrub_heap_pages, NULL, 1);
> +        spin_unlock(&heap_lock);
>  
> -        /* Quick lock-free check. */
> -        if ( !mfn_valid(mfn) || !page_state_is(pg, free) )
> +        printk(".");
> +    }
> +
> +    /*
> +     * Round #2: NUMA nodes with no CPUs get scrubbed with CPUs on the node
> +     * closest to us and with CPUs.
> +     */
> +    for_each_online_node ( i )
> +    {
> +        node_cpus = node_to_cpumask(i);
> +
> +        if ( !cpumask_empty(&node_cpus) )
>              continue;
>  
> -        /* Every 100MB, print a progress dot. */
> -        if ( (mfn % ((100*1024*1024)/PAGE_SIZE)) == 0 )
> -            printk(".");
> +        last_distance = INT_MAX;
> +        best_node = first_node(node_online_map);
> +        /* Figure out which NODE CPUs are close. */
> +        for_each_online_node ( j )
> +        {
> +            int distance;
>  
> -        spin_lock(&heap_lock);
> +            if ( i == j )
> +                continue;

This could be replaced with cpumask_empty(&node_to_cpumask(j)),
allowing ...

> +            distance = __node_distance(i, j);
> +            if ( distance < last_distance )
> +            {
> +                if ( cpumask_empty(&node_to_cpumask(j)) )
> +                    continue;

this check to be dropped.

Jan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] Xen: Spread boot time page scrubbing across all available CPU's (v5)
  2014-06-05 11:22   ` Jan Beulich
@ 2014-06-05 17:49     ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 7+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-06-05 17:49 UTC (permalink / raw)
  To: Jan Beulich
  Cc: andrew.cooper3, dario.faggioli, julien.grall, tim, xen-devel,
	Malcolm Crossley

On Thu, Jun 05, 2014 at 12:22:41PM +0100, Jan Beulich wrote:
> >>> On 04.06.14 at 15:29, <konrad.wilk@oracle.com> wrote:
> > --- a/docs/misc/xen-command-line.markdown
> > +++ b/docs/misc/xen-command-line.markdown
> > @@ -198,6 +198,16 @@ Scrub free RAM during boot.  This is a safety feature to prevent
> >  accidentally leaking sensitive VM data into other VMs if Xen crashes
> >  and reboots.
> >  
> > +### bootscrub_chunk_
> 
> Looking at other examples in that file, underscores appear to need
> backslash escaping here. And I don't think the trailing one should
> be there.
> 
> > +> `= <size>`
> > +
> > +> Default: `128MB`
> > +
> > +Maximum RAM block size chunks to be scrubbed whilst holding the page heap lock
> > +and not running softirqs. Reduce this if softirqs are not being run frequently
> > +enough. Setting this to a high value may cause cause boot failure, particularly
> 
> Double "cause".
> 
> > +static void __init smp_scrub_heap_pages(void *data)
> > +{
> > +    unsigned long mfn, start, end;
> > +    struct page_info *pg;
> > +    struct scrub_region *r;
> > +    unsigned int temp_cpu, node, cpu_idx = 0;
> > +    unsigned int cpu = smp_processor_id();
> > +
> > +    if ( data )
> > +        r = data;
> > +    else
> > +    {
> > +        node = cpu_to_node(cpu);
> > +        if ( node == NUMA_NO_NODE )
> > +            return;
> > +        r = &region[node];
> > +    }
> > +
> > +    /* Determine the current CPU's index into CPU's linked to this node. */
> > +    for_each_cpu ( temp_cpu, &r->cpus )
> > +    {
> > +        if ( cpu == temp_cpu )
> > +            break;
> > +        cpu_idx++;
> > +    }
> > +
> > +    /* Calculate the starting mfn for this CPU's memory block. */
> > +    start = r->start + (r->per_cpu_sz * cpu_idx) + r->offset;
> > +
> > +    /* Calculate the end mfn into this CPU's memory block for this 
> > iteration. */
> > +    if ( r->offset + chunk_size >= r->per_cpu_sz )
> > +    {
> > +        end = r->start + (r->per_cpu_sz * cpu_idx) + r->per_cpu_sz;
> > +
> > +        if ( r->rem && ((cpumask_weight(&r->cpus) - 1 == cpu_idx )) )
> > +            end += r->rem;
> > +    }
> > +    else
> > +        end = start + chunk_size;
> > +
> > +    for ( mfn = start; mfn < end; mfn++ )
> > +    {
> > +        pg = mfn_to_page(mfn);
> > +
> > +        /* Check the mfn is valid and page is free. */
> > +        if ( !mfn_valid(mfn) || !page_state_is(pg, free) )
> > +            continue;
> > +
> > +        scrub_one_page(pg);
> > +    }
> > +}
> > +
> > +static int __init find_non_smt(unsigned int node, cpumask_t *dest)
> > +{
> > +    cpumask_t node_cpus;
> > +    unsigned int i, cpu;
> > +
> > +    cpumask_and(&node_cpus, &node_to_cpumask(node), &cpu_online_map);
> > +    cpumask_clear(dest);
> > +    for_each_cpu ( i, &node_cpus )
> > +    {
> > +        if ( cpumask_intersects(dest, per_cpu(cpu_sibling_mask, i)) )
> > +            continue;
> > +        cpu = cpumask_first(per_cpu(cpu_sibling_mask, i));
> > +        cpumask_set_cpu(cpu, dest);
> > +    }
> > +    return cpumask_weight(dest);
> > +}
> > +
> >  /*
> > - * Scrub all unallocated pages in all heap zones. This function is more
> > - * convoluted than appears necessary because we do not want to continuously
> > - * hold the lock while scrubbing very large memory areas.
> > + * Scrub all unallocated pages in all heap zones. This function uses all
> > + * online cpu's to scrub the memory in parallel.
> >   */
> >  void __init scrub_heap_pages(void)
> >  {
> > -    unsigned long mfn;
> > -    struct page_info *pg;
> > +    cpumask_t node_cpus, all_worker_cpus;
> > +    unsigned int i, j;
> > +    unsigned long offset, max_per_cpu_sz = 0;
> > +    unsigned long start, end;
> > +    unsigned long rem = 0;
> > +    int last_distance, best_node;
> >  
> >      if ( !opt_bootscrub )
> >          return;
> >  
> > -    printk("Scrubbing Free RAM: ");
> > +    cpumask_clear(&all_worker_cpus);
> > +    /* Scrub block size. */
> > +    chunk_size = opt_bootscrub_chunk >> PAGE_SHIFT;
> > +    if ( chunk_size == 0 )
> > +        chunk_size = MB(128);
> > +
> > +    /* Round #0 - figure out amounts and which CPUs to use. */
> > +    for_each_online_node ( i )
> > +    {
> > +        if ( !node_spanned_pages(i) )
> > +            continue;
> > +        /* Calculate Node memory start and end address. */
> > +        start = max(node_start_pfn(i), first_valid_mfn);
> 
> This implies you're aware that possibly node_start_pfn(i) <
> first_valid_mfn.
> 
> > +        end = min(node_start_pfn(i) + node_spanned_pages(i), max_page);
> 
> Which in turn means that this may yield end < start. Is all the rest
> of the code prepared to deal with this? At least the divide and
> modulo operations on the difference further down doesn't look like
> so.

It will loop forever. I think adding

	end = max(end, start);

Would fix it.
> 
> > +        /* CPUs that are online and on this node (if none, that it is OK). */
> > +        find_non_smt(i, &node_cpus);
> 
> Here you could latch the weight, avoiding ... 
> 
> > +        cpumask_or(&all_worker_cpus, &all_worker_cpus, &node_cpus);
> > +        if ( cpumask_empty(&node_cpus) )
> 
> ... the extra operation on the mask here and the explicit use of
> cpumask_weight() on node_cpus in the else path.

<nods>
> 
> > +        {
> > +            /* No CPUs on this node. Round #2 will take of it. */
> > +            rem = 0;
> > +            region[i].per_cpu_sz = (end - start);
> > +        } else {
> 
> Coding style - this takes 3 lines.
> 
> > +            rem = (end - start) % cpumask_weight(&node_cpus);
> > +            region[i].per_cpu_sz = (end - start) / cpumask_weight(&node_cpus);
> > +            if ( region[i].per_cpu_sz > max_per_cpu_sz )
> > +                max_per_cpu_sz = region[i].per_cpu_sz;
> > +        }
> > +        region[i].start = start;
> > +        region[i].rem = rem;
> > +        cpumask_copy(&region[i].cpus, &node_cpus);
> > +
> > +    }
> > +
> > +    printk("Scrubbing Free RAM on %u nodes using %u CPUs\n", num_online_nodes(),
> > +           cpumask_weight(&all_worker_cpus));
> >  
> > -    for ( mfn = first_valid_mfn; mfn < max_page; mfn++ )
> > +    /* Round: #1 - do NUMA nodes with CPUs. */
> > +    for ( offset = 0; offset < max_per_cpu_sz; offset += chunk_size )
> >      {
> > +        for_each_online_node ( i )
> > +            region[i].offset = offset;
> > +
> >          process_pending_softirqs();
> >  
> > -        pg = mfn_to_page(mfn);
> > +        spin_lock(&heap_lock);
> > +        on_selected_cpus(&all_worker_cpus, smp_scrub_heap_pages, NULL, 1);
> > +        spin_unlock(&heap_lock);
> >  
> > -        /* Quick lock-free check. */
> > -        if ( !mfn_valid(mfn) || !page_state_is(pg, free) )
> > +        printk(".");
> > +    }
> > +
> > +    /*
> > +     * Round #2: NUMA nodes with no CPUs get scrubbed with CPUs on the node
> > +     * closest to us and with CPUs.
> > +     */
> > +    for_each_online_node ( i )
> > +    {
> > +        node_cpus = node_to_cpumask(i);
> > +
> > +        if ( !cpumask_empty(&node_cpus) )
> >              continue;
> >  
> > -        /* Every 100MB, print a progress dot. */
> > -        if ( (mfn % ((100*1024*1024)/PAGE_SIZE)) == 0 )
> > -            printk(".");
> > +        last_distance = INT_MAX;
> > +        best_node = first_node(node_online_map);
> > +        /* Figure out which NODE CPUs are close. */
> > +        for_each_online_node ( j )
> > +        {
> > +            int distance;
> >  
> > -        spin_lock(&heap_lock);
> > +            if ( i == j )
> > +                continue;
> 
> This could be replaced with cpumask_empty(&node_to_cpumask(j)),
> allowing ...
> 
> > +            distance = __node_distance(i, j);
> > +            if ( distance < last_distance )
> > +            {
> > +                if ( cpumask_empty(&node_to_cpumask(j)) )
> > +                    continue;
> 
> this check to be dropped.

Clever! Will do.
> 
> Jan

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2014-06-05 17:49 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-04 13:29 [PATCH v5] Spread boot time scrubbing across available CPUs Konrad Rzeszutek Wilk
2014-06-04 13:29 ` [PATCH] Xen: Spread boot time page scrubbing across all available CPU's (v5) Konrad Rzeszutek Wilk
2014-06-04 13:35   ` Andrew Cooper
2014-06-04 13:52     ` Konrad Rzeszutek Wilk
2014-06-05 10:09   ` Tim Deegan
2014-06-05 11:22   ` Jan Beulich
2014-06-05 17:49     ` Konrad Rzeszutek Wilk

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.