All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/9] Memory scrubbing from idle loop
@ 2017-04-03 16:50 Boris Ostrovsky
  2017-04-03 16:50 ` [PATCH v2 1/9] mm: Separate free page chunk merging into its own routine Boris Ostrovsky
                   ` (9 more replies)
  0 siblings, 10 replies; 31+ messages in thread
From: Boris Ostrovsky @ 2017-04-03 16:50 UTC (permalink / raw)
  To: xen-devel
  Cc: sstabellini, wei.liu2, George.Dunlap, andrew.cooper3,
	ian.jackson, tim, jbeulich, Boris Ostrovsky

When a domain is destroyed the hypervisor must scrub domain's pages before
giving them to another guest in order to prevent leaking the deceased
guest's data. Currently this is done during guest's destruction, possibly
causing very lengthy cleanup process.

This series adds support for scrubbing released pages from idle loop,
making guest destruction significantly faster. For example, destroying a
1TB guest can now be completed in 40+ seconds as opposed to about 9 minutes
using existing scrubbing algorithm.

The downside of this series is that we sometimes fail to allocate high-order
sets of pages since dirty pages may not yet be merged into higher-order sets
while they are waiting to be scrubbed.

Briefly, the new algorithm places dirty pages at the end of heap's page list
for each node/zone/order to avoid having to scan full list while searching
for dirty pages. One processor form each node checks whether the node has any
dirty pages and, if such pages are found, scrubs them. Scrubbing itself
happens without holding heap lock so other users may access heap in the
meantime. If while idle loop is scrubbing a particular chunk of pages this
chunk is requested by the heap allocator, scrubbing is immediately stopped.

On the allocation side, alloc_heap_pages() first tries to satisfy allocation
request using only clean pages. If this is not possible, the search is
repeated and dirty pages are scrubbed by the allocator.

This series is somewhat based on earlier work by Bob Liu.

V1:
* Only set PGC_need_scrub bit for the buddy head, thus making it unnecessary
  to scan whole buddy
* Fix spin_lock_cb()
* Scrub CPU-less nodes
* ARM support. Note that I have not been able to test this, only built the
  binary
* Added scrub test patch (last one). Not sure whether it should be considered
  for committing but I have been running with it.

V2:
* merge_chunks() returns new buddy head
* scrub_free_pages() returns softirq pending status in addition to (factored out)
  status of unscrubbed memory
* spin_lock uses inlined spin_lock_cb()
* scrub debugging code checks whole page, not just the first word.

Deferred:
* Per-node heap locks. In addition to (presumably) improving performance in
  general, once they are available we can parallelize scrubbing further by
  allowing more than one core per node to do idle loop scrubbing.
* AVX-based scrubbing
* Use idle loop scrubbing during boot.


Boris Ostrovsky (9):
  mm: Separate free page chunk merging into its own routine
  mm: Place unscrubbed pages at the end of pagelist
  mm: Scrub pages in alloc_heap_pages() if needed
  mm: Scrub memory from idle loop
  mm: Do not discard already-scrubbed pages softirqs are pending
  spinlock: Introduce spin_lock_cb()
  mm: Keep pages available for allocation while scrubbing
  mm: Print number of unscrubbed pages in 'H' debug handler
  mm: Make sure pages are scrubbed

 xen/Kconfig.debug          |    7 +
 xen/arch/arm/domain.c      |   13 +-
 xen/arch/x86/domain.c      |    3 +-
 xen/common/page_alloc.c    |  475 +++++++++++++++++++++++++++++++++++++-------
 xen/common/spinlock.c      |    9 +-
 xen/include/asm-arm/mm.h   |    8 +
 xen/include/asm-x86/mm.h   |    8 +
 xen/include/xen/mm.h       |    1 +
 xen/include/xen/spinlock.h |    3 +
 9 files changed, 451 insertions(+), 76 deletions(-)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 1/9] mm: Separate free page chunk merging into its own routine
  2017-04-03 16:50 [PATCH v2 0/9] Memory scrubbing from idle loop Boris Ostrovsky
@ 2017-04-03 16:50 ` Boris Ostrovsky
  2017-04-04 11:16   ` Jan Beulich
  2017-04-03 16:50 ` [PATCH v2 2/9] mm: Place unscrubbed pages at the end of pagelist Boris Ostrovsky
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 31+ messages in thread
From: Boris Ostrovsky @ 2017-04-03 16:50 UTC (permalink / raw)
  To: xen-devel
  Cc: sstabellini, wei.liu2, George.Dunlap, andrew.cooper3,
	ian.jackson, tim, jbeulich, Boris Ostrovsky

This is needed for subsequent changes to memory scrubbing.

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
---

Changes in v2:
* Return boolean (as oppposed to integer) values in can_merge()
* Return new buddy head from merge_chunks()


 xen/common/page_alloc.c |   88 +++++++++++++++++++++++++++++-----------------
 1 files changed, 55 insertions(+), 33 deletions(-)

diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index 68dba19..15fd7f4 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -924,11 +924,64 @@ static int reserve_offlined_page(struct page_info *head)
     return count;
 }
 
+static bool_t can_merge(struct page_info *buddy, unsigned int node,
+                        unsigned int order)
+{
+    if ( !mfn_valid(_mfn(page_to_mfn(buddy))) ||
+         !page_state_is(buddy, free) ||
+         (PFN_ORDER(buddy) != order) ||
+         (phys_to_nid(page_to_maddr(buddy)) != node) )
+        return false;
+
+    return true;
+}
+
+/* Returns new buddy head. */
+static struct page_info *merge_chunks(struct page_info *pg, unsigned int node,
+                                      unsigned int zone, unsigned int order)
+{
+    ASSERT(spin_is_locked(&heap_lock));
+
+    /* Merge chunks as far as possible. */
+    while ( order < MAX_ORDER )
+    {
+        unsigned long mask = 1UL << order;
+        struct page_info *buddy;
+
+        if ( (page_to_mfn(pg) & mask) )
+        {
+            /* Merge with predecessor block? */
+            buddy = pg - mask;
+            if ( !can_merge(buddy, node, order) )
+                break;
+
+            pg = buddy;
+            page_list_del(pg, &heap(node, zone, order));
+        }
+        else
+        {
+            /* Merge with successor block? */
+            buddy = pg + mask;
+            if ( !can_merge(buddy, node, order) )
+                break;
+
+            page_list_del(buddy, &heap(node, zone, order));
+        }
+
+        order++;
+    }
+
+    PFN_ORDER(pg) = order;
+    page_list_add_tail(pg, &heap(node, zone, order));
+
+    return pg;
+}
+
 /* Free 2^@order set of pages. */
 static void free_heap_pages(
     struct page_info *pg, unsigned int order)
 {
-    unsigned long mask, mfn = page_to_mfn(pg);
+    unsigned long mfn = page_to_mfn(pg);
     unsigned int i, node = phys_to_nid(page_to_maddr(pg)), tainted = 0;
     unsigned int zone = page_to_zone(pg);
 
@@ -975,38 +1028,7 @@ static void free_heap_pages(
         midsize_alloc_zone_pages = max(
             midsize_alloc_zone_pages, total_avail_pages / MIDSIZE_ALLOC_FRAC);
 
-    /* Merge chunks as far as possible. */
-    while ( order < MAX_ORDER )
-    {
-        mask = 1UL << order;
-
-        if ( (page_to_mfn(pg) & mask) )
-        {
-            /* Merge with predecessor block? */
-            if ( !mfn_valid(_mfn(page_to_mfn(pg-mask))) ||
-                 !page_state_is(pg-mask, free) ||
-                 (PFN_ORDER(pg-mask) != order) ||
-                 (phys_to_nid(page_to_maddr(pg-mask)) != node) )
-                break;
-            pg -= mask;
-            page_list_del(pg, &heap(node, zone, order));
-        }
-        else
-        {
-            /* Merge with successor block? */
-            if ( !mfn_valid(_mfn(page_to_mfn(pg+mask))) ||
-                 !page_state_is(pg+mask, free) ||
-                 (PFN_ORDER(pg+mask) != order) ||
-                 (phys_to_nid(page_to_maddr(pg+mask)) != node) )
-                break;
-            page_list_del(pg + mask, &heap(node, zone, order));
-        }
-
-        order++;
-    }
-
-    PFN_ORDER(pg) = order;
-    page_list_add_tail(pg, &heap(node, zone, order));
+    pg = merge_chunks(pg, node, zone, order);
 
     if ( tainted )
         reserve_offlined_page(pg);
-- 
1.7.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 2/9] mm: Place unscrubbed pages at the end of pagelist
  2017-04-03 16:50 [PATCH v2 0/9] Memory scrubbing from idle loop Boris Ostrovsky
  2017-04-03 16:50 ` [PATCH v2 1/9] mm: Separate free page chunk merging into its own routine Boris Ostrovsky
@ 2017-04-03 16:50 ` Boris Ostrovsky
  2017-04-04 14:46   ` Jan Beulich
  2017-04-03 16:50 ` [PATCH v2 3/9] mm: Scrub pages in alloc_heap_pages() if needed Boris Ostrovsky
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 31+ messages in thread
From: Boris Ostrovsky @ 2017-04-03 16:50 UTC (permalink / raw)
  To: xen-devel
  Cc: sstabellini, wei.liu2, George.Dunlap, andrew.cooper3,
	ian.jackson, tim, jbeulich, Boris Ostrovsky

. so that it's easy to find pages that need to be scrubbed (those pages are
now marked with _PGC_need_scrub bit).

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
---
Changes in v2:
* Added page_list_add_scrub()
* Mark pages as needing a scrub irrespective on tanted in free_heap_pages()

 xen/common/page_alloc.c  |  106 +++++++++++++++++++++++++++++++++++++---------
 xen/include/asm-arm/mm.h |    4 ++
 xen/include/asm-x86/mm.h |    4 ++
 3 files changed, 94 insertions(+), 20 deletions(-)

diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index 15fd7f4..56486a8 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -383,6 +383,8 @@ typedef struct page_list_head heap_by_zone_and_order_t[NR_ZONES][MAX_ORDER+1];
 static heap_by_zone_and_order_t *_heap[MAX_NUMNODES];
 #define heap(node, zone, order) ((*_heap[node])[zone][order])
 
+static unsigned long node_need_scrub[MAX_NUMNODES];
+
 static unsigned long *avail[MAX_NUMNODES];
 static long total_avail_pages;
 
@@ -683,6 +685,20 @@ static void check_low_mem_virq(void)
     }
 }
 
+/* Pages that need scrub are added to tail, otherwise to head. */
+static void page_list_add_scrub(struct page_info *pg, unsigned int node,
+                                unsigned int zone, unsigned int order,
+                                bool_t need_scrub)
+{
+    if ( need_scrub )
+    {
+        pg->count_info |= PGC_need_scrub;
+        page_list_add_tail(pg, &heap(node, zone, order));
+    }
+    else
+        page_list_add(pg, &heap(node, zone, order));
+}
+
 /* Allocate 2^@order contiguous pages. */
 static struct page_info *alloc_heap_pages(
     unsigned int zone_lo, unsigned int zone_hi,
@@ -807,7 +823,7 @@ static struct page_info *alloc_heap_pages(
     while ( j != order )
     {
         PFN_ORDER(pg) = --j;
-        page_list_add_tail(pg, &heap(node, zone, j));
+        page_list_add(pg, &heap(node, zone, j));
         pg += 1 << j;
     }
 
@@ -827,6 +843,8 @@ static struct page_info *alloc_heap_pages(
         BUG_ON(pg[i].count_info != PGC_state_free);
         pg[i].count_info = PGC_state_inuse;
 
+        BUG_ON(test_bit(_PGC_need_scrub, &pg[i].count_info));
+
         if ( !(memflags & MEMF_no_tlbflush) )
             accumulate_tlbflush(&need_tlbflush, &pg[i],
                                 &tlbflush_timestamp);
@@ -856,6 +874,7 @@ static int reserve_offlined_page(struct page_info *head)
     int zone = page_to_zone(head), i, head_order = PFN_ORDER(head), count = 0;
     struct page_info *cur_head;
     int cur_order;
+    bool_t need_scrub = !!test_bit(_PGC_need_scrub, &head->count_info);
 
     ASSERT(spin_is_locked(&heap_lock));
 
@@ -897,8 +916,8 @@ static int reserve_offlined_page(struct page_info *head)
             {
             merge:
                 /* We don't consider merging outside the head_order. */
-                page_list_add_tail(cur_head, &heap(node, zone, cur_order));
                 PFN_ORDER(cur_head) = cur_order;
+                page_list_add_scrub(cur_head, node, zone, cur_order, need_scrub);
                 cur_head += (1 << cur_order);
                 break;
             }
@@ -925,7 +944,7 @@ static int reserve_offlined_page(struct page_info *head)
 }
 
 static bool_t can_merge(struct page_info *buddy, unsigned int node,
-                        unsigned int order)
+                        unsigned int order, bool_t need_scrub)
 {
     if ( !mfn_valid(_mfn(page_to_mfn(buddy))) ||
          !page_state_is(buddy, free) ||
@@ -933,6 +952,10 @@ static bool_t can_merge(struct page_info *buddy, unsigned int node,
          (phys_to_nid(page_to_maddr(buddy)) != node) )
         return false;
 
+    if ( need_scrub !=
+         !!test_bit(_PGC_need_scrub, &buddy->count_info) )
+        return false;
+
     return true;
 }
 
@@ -940,6 +963,8 @@ static bool_t can_merge(struct page_info *buddy, unsigned int node,
 static struct page_info *merge_chunks(struct page_info *pg, unsigned int node,
                                       unsigned int zone, unsigned int order)
 {
+    bool_t need_scrub = !!test_bit(_PGC_need_scrub, &pg->count_info);
+
     ASSERT(spin_is_locked(&heap_lock));
 
     /* Merge chunks as far as possible. */
@@ -952,9 +977,10 @@ static struct page_info *merge_chunks(struct page_info *pg, unsigned int node,
         {
             /* Merge with predecessor block? */
             buddy = pg - mask;
-            if ( !can_merge(buddy, node, order) )
+            if ( !can_merge(buddy, node, order, need_scrub) )
                 break;
 
+            pg->count_info &= ~PGC_need_scrub;
             pg = buddy;
             page_list_del(pg, &heap(node, zone, order));
         }
@@ -962,9 +988,10 @@ static struct page_info *merge_chunks(struct page_info *pg, unsigned int node,
         {
             /* Merge with successor block? */
             buddy = pg + mask;
-            if ( !can_merge(buddy, node, order) )
+            if ( !can_merge(buddy, node, order, need_scrub) )
                 break;
 
+            buddy->count_info &= ~PGC_need_scrub;
             page_list_del(buddy, &heap(node, zone, order));
         }
 
@@ -972,14 +999,51 @@ static struct page_info *merge_chunks(struct page_info *pg, unsigned int node,
     }
 
     PFN_ORDER(pg) = order;
-    page_list_add_tail(pg, &heap(node, zone, order));
+    page_list_add_scrub(pg, node, zone, order, need_scrub);
 
     return pg;
 }
 
+static void scrub_free_pages(unsigned int node)
+{
+    struct page_info *pg;
+    unsigned int i, zone;
+    int order;
+
+    ASSERT(spin_is_locked(&heap_lock));
+
+    if ( !node_need_scrub[node] )
+        return;
+
+    for ( zone = 0; zone < NR_ZONES; zone++ )
+    {
+        for ( order = MAX_ORDER; order >= 0; order-- )
+        {
+            while ( !page_list_empty(&heap(node, zone, order)) )
+            {
+                /* Unscrubbed pages are always at the end of the list. */
+                pg = page_list_last(&heap(node, zone, order));
+                if ( !test_bit(_PGC_need_scrub, &pg->count_info) )
+                    break;
+
+                for ( i = 0; i < (1UL << order); i++)
+                    scrub_one_page(&pg[i]);
+
+                pg->count_info &= ~PGC_need_scrub;
+
+                page_list_del(pg, &heap(node, zone, order));
+                (void)merge_chunks(pg, node, zone, order);
+
+                node_need_scrub[node] -= (1UL << order);
+            }
+        }
+    }
+ }
+
+
 /* Free 2^@order set of pages. */
 static void free_heap_pages(
-    struct page_info *pg, unsigned int order)
+    struct page_info *pg, unsigned int order, bool_t need_scrub)
 {
     unsigned long mfn = page_to_mfn(pg);
     unsigned int i, node = phys_to_nid(page_to_maddr(pg)), tainted = 0;
@@ -1028,11 +1092,20 @@ static void free_heap_pages(
         midsize_alloc_zone_pages = max(
             midsize_alloc_zone_pages, total_avail_pages / MIDSIZE_ALLOC_FRAC);
 
+    if ( need_scrub )
+    {
+        pg->count_info |= PGC_need_scrub;
+        node_need_scrub[node] += (1UL << order);
+    }
+
     pg = merge_chunks(pg, node, zone, order);
 
     if ( tainted )
         reserve_offlined_page(pg);
 
+    if ( need_scrub )
+        scrub_free_pages(node);
+
     spin_unlock(&heap_lock);
 }
 
@@ -1253,7 +1326,7 @@ unsigned int online_page(unsigned long mfn, uint32_t *status)
     spin_unlock(&heap_lock);
 
     if ( (y & PGC_state) == PGC_state_offlined )
-        free_heap_pages(pg, 0);
+        free_heap_pages(pg, 0, 0);
 
     return ret;
 }
@@ -1322,7 +1395,7 @@ static void init_heap_pages(
             nr_pages -= n;
         }
 
-        free_heap_pages(pg+i, 0);
+        free_heap_pages(pg + i, 0, 0);
     }
 }
 
@@ -1649,7 +1722,7 @@ void free_xenheap_pages(void *v, unsigned int order)
 
     memguard_guard_range(v, 1 << (order + PAGE_SHIFT));
 
-    free_heap_pages(virt_to_page(v), order);
+    free_heap_pages(virt_to_page(v), order, 0);
 }
 
 #else
@@ -1703,12 +1776,9 @@ void free_xenheap_pages(void *v, unsigned int order)
     pg = virt_to_page(v);
 
     for ( i = 0; i < (1u << order); i++ )
-    {
-        scrub_one_page(&pg[i]);
         pg[i].count_info &= ~PGC_xen_heap;
-    }
 
-    free_heap_pages(pg, order);
+    free_heap_pages(pg, order, 1);
 }
 
 #endif
@@ -1817,7 +1887,7 @@ struct page_info *alloc_domheap_pages(
     if ( d && !(memflags & MEMF_no_owner) &&
          assign_pages(d, pg, order, memflags) )
     {
-        free_heap_pages(pg, order);
+        free_heap_pages(pg, order, 0);
         return NULL;
     }
     
@@ -1885,11 +1955,7 @@ void free_domheap_pages(struct page_info *pg, unsigned int order)
             scrub = 1;
         }
 
-        if ( unlikely(scrub) )
-            for ( i = 0; i < (1 << order); i++ )
-                scrub_one_page(&pg[i]);
-
-        free_heap_pages(pg, order);
+        free_heap_pages(pg, order, scrub);
     }
 
     if ( drop_dom_ref )
diff --git a/xen/include/asm-arm/mm.h b/xen/include/asm-arm/mm.h
index 4892155..149940b 100644
--- a/xen/include/asm-arm/mm.h
+++ b/xen/include/asm-arm/mm.h
@@ -107,6 +107,10 @@ struct page_info
 #define PGC_count_width   PG_shift(9)
 #define PGC_count_mask    ((1UL<<PGC_count_width)-1)
 
+/* Page needs to be scrubbed */
+#define _PGC_need_scrub   PG_shift(10)
+#define PGC_need_scrub    PG_mask(1, 10)
+
 extern unsigned long xenheap_mfn_start, xenheap_mfn_end;
 extern vaddr_t xenheap_virt_end;
 #ifdef CONFIG_ARM_64
diff --git a/xen/include/asm-x86/mm.h b/xen/include/asm-x86/mm.h
index e22603c..f3d4443 100644
--- a/xen/include/asm-x86/mm.h
+++ b/xen/include/asm-x86/mm.h
@@ -233,6 +233,10 @@ struct page_info
 #define PGC_count_width   PG_shift(9)
 #define PGC_count_mask    ((1UL<<PGC_count_width)-1)
 
+/* Page needs to be scrubbed */
+#define _PGC_need_scrub   PG_shift(10)
+#define PGC_need_scrub    PG_mask(1, 10)
+
 struct spage_info
 {
        unsigned long type_info;
-- 
1.7.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 3/9] mm: Scrub pages in alloc_heap_pages() if needed
  2017-04-03 16:50 [PATCH v2 0/9] Memory scrubbing from idle loop Boris Ostrovsky
  2017-04-03 16:50 ` [PATCH v2 1/9] mm: Separate free page chunk merging into its own routine Boris Ostrovsky
  2017-04-03 16:50 ` [PATCH v2 2/9] mm: Place unscrubbed pages at the end of pagelist Boris Ostrovsky
@ 2017-04-03 16:50 ` Boris Ostrovsky
  2017-04-03 16:50 ` [PATCH v2 4/9] mm: Scrub memory from idle loop Boris Ostrovsky
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 31+ messages in thread
From: Boris Ostrovsky @ 2017-04-03 16:50 UTC (permalink / raw)
  To: xen-devel
  Cc: sstabellini, wei.liu2, George.Dunlap, andrew.cooper3,
	ian.jackson, tim, jbeulich, Boris Ostrovsky

When allocating pages in alloc_heap_pages() first look for clean pages. If none
is found then retry, take pages marked as unscrubbed and scrub them.

Note that we shouldn't find unscrubbed pages in alloc_heap_pages() yet. However,
this will become possible when we stop scrubbing from free_heap_pages() and
instead do it from idle loop.

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
---
 xen/common/page_alloc.c |   87 ++++++++++++++++++++++++++++++++++-------------
 1 files changed, 63 insertions(+), 24 deletions(-)

diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index 56486a8..a560d3e 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -705,34 +705,17 @@ static struct page_info *alloc_heap_pages(
     unsigned int order, unsigned int memflags,
     struct domain *d)
 {
-    unsigned int i, j, zone = 0, nodemask_retry = 0;
-    nodeid_t first_node, node = MEMF_get_node(memflags), req_node = node;
+    unsigned int i, j, zone, nodemask_retry;
+    nodeid_t first_node, node, req_node;
     unsigned long request = 1UL << order;
     struct page_info *pg;
-    nodemask_t nodemask = (d != NULL ) ? d->node_affinity : node_online_map;
-    bool_t need_tlbflush = 0;
+    nodemask_t nodemask;
+    bool_t need_scrub, need_tlbflush = 0, use_unscrubbed = 0;
     uint32_t tlbflush_timestamp = 0;
 
     /* Make sure there are enough bits in memflags for nodeID. */
     BUILD_BUG_ON((_MEMF_bits - _MEMF_node) < (8 * sizeof(nodeid_t)));
 
-    if ( node == NUMA_NO_NODE )
-    {
-        if ( d != NULL )
-        {
-            node = next_node(d->last_alloc_node, nodemask);
-            if ( node >= MAX_NUMNODES )
-                node = first_node(nodemask);
-        }
-        if ( node >= MAX_NUMNODES )
-            node = cpu_to_node(smp_processor_id());
-    }
-    first_node = node;
-
-    ASSERT(node < MAX_NUMNODES);
-    ASSERT(zone_lo <= zone_hi);
-    ASSERT(zone_hi < NR_ZONES);
-
     if ( unlikely(order > MAX_ORDER) )
         return NULL;
 
@@ -746,7 +729,10 @@ static struct page_info *alloc_heap_pages(
           total_avail_pages + tmem_freeable_pages()) &&
           ((memflags & MEMF_no_refcount) ||
            !d || d->outstanding_pages < request) )
-        goto not_found;
+    {
+        spin_unlock(&heap_lock);
+        return NULL;
+    }
 
     /*
      * TMEM: When available memory is scarce due to tmem absorbing it, allow
@@ -759,6 +745,28 @@ static struct page_info *alloc_heap_pages(
          tmem_freeable_pages() )
         goto try_tmem;
 
+ again:
+
+    nodemask_retry = 0;
+    nodemask = (d != NULL ) ? d->node_affinity : node_online_map;
+    node = req_node = MEMF_get_node(memflags);
+    if ( node == NUMA_NO_NODE )
+    {
+        if ( d != NULL )
+        {
+            node = next_node(d->last_alloc_node, nodemask);
+            if ( node >= MAX_NUMNODES )
+                node = first_node(nodemask);
+        }
+        if ( node >= MAX_NUMNODES )
+            node = cpu_to_node(smp_processor_id());
+    }
+    first_node = node;
+
+    ASSERT(node < MAX_NUMNODES);
+    ASSERT(zone_lo <= zone_hi);
+    ASSERT(zone_hi < NR_ZONES);
+
     /*
      * Start with requested node, but exhaust all node memory in requested 
      * zone before failing, only calc new node value if we fail to find memory 
@@ -774,8 +782,16 @@ static struct page_info *alloc_heap_pages(
 
             /* Find smallest order which can satisfy the request. */
             for ( j = order; j <= MAX_ORDER; j++ )
+            {
                 if ( (pg = page_list_remove_head(&heap(node, zone, j))) )
-                    goto found;
+                {
+                    if ( (order == 0) || use_unscrubbed ||
+                         !test_bit(_PGC_need_scrub, &pg->count_info) )
+                        goto found;
+
+                    page_list_add_tail(pg, &heap(node, zone, j));
+                }
+            }
         } while ( zone-- > zone_lo ); /* careful: unsigned zone may wrap */
 
         if ( (memflags & MEMF_exact_node) && req_node != NUMA_NO_NODE )
@@ -814,18 +830,32 @@ static struct page_info *alloc_heap_pages(
     }
 
  not_found:
+    /*
+     * If we couldn't find clean page let's search again and this time
+     * take unscrubbed pages if available.
+     */
+    if ( !use_unscrubbed )
+    {
+        use_unscrubbed = 1;
+        goto again;
+    }
+
     /* No suitable memory blocks. Fail the request. */
     spin_unlock(&heap_lock);
     return NULL;
 
  found: 
+    need_scrub = !!test_bit(_PGC_need_scrub, &pg->count_info);
+
     /* We may have to halve the chunk a number of times. */
     while ( j != order )
     {
         PFN_ORDER(pg) = --j;
-        page_list_add(pg, &heap(node, zone, j));
+        page_list_add_scrub(pg, node, zone, j, need_scrub);
         pg += 1 << j;
     }
+    if ( need_scrub )
+        pg->count_info |= PGC_need_scrub;
 
     ASSERT(avail[node][zone] >= request);
     avail[node][zone] -= request;
@@ -837,6 +867,15 @@ static struct page_info *alloc_heap_pages(
     if ( d != NULL )
         d->last_alloc_node = node;
 
+    if ( need_scrub )
+    {
+        for ( i = 0; i < (1 << order); i++ )
+            scrub_one_page(&pg[i]);
+        pg->count_info &= ~PGC_need_scrub;
+        node_need_scrub[node] -= (1 << order);
+    }
+
+
     for ( i = 0; i < (1 << order); i++ )
     {
         /* Reference count must continuously be zero for free pages. */
-- 
1.7.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 4/9] mm: Scrub memory from idle loop
  2017-04-03 16:50 [PATCH v2 0/9] Memory scrubbing from idle loop Boris Ostrovsky
                   ` (2 preceding siblings ...)
  2017-04-03 16:50 ` [PATCH v2 3/9] mm: Scrub pages in alloc_heap_pages() if needed Boris Ostrovsky
@ 2017-04-03 16:50 ` Boris Ostrovsky
  2017-04-12 16:11   ` Jan Beulich
  2017-04-03 16:50 ` [PATCH v2 5/9] mm: Do not discard already-scrubbed pages softirqs are pending Boris Ostrovsky
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 31+ messages in thread
From: Boris Ostrovsky @ 2017-04-03 16:50 UTC (permalink / raw)
  To: xen-devel
  Cc: sstabellini, wei.liu2, George.Dunlap, andrew.cooper3,
	ian.jackson, tim, jbeulich, Boris Ostrovsky

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
---
Changes in v2:
* Added node_to_scrub()
* Include softirq_pending() in scrub_free_pages()'s return value


 xen/arch/arm/domain.c   |   13 +++++++----
 xen/arch/x86/domain.c   |    3 +-
 xen/common/page_alloc.c |   52 ++++++++++++++++++++++++++++++++++++++--------
 xen/include/xen/mm.h    |    1 +
 4 files changed, 54 insertions(+), 15 deletions(-)

diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c
index bb327da..fdf06e1 100644
--- a/xen/arch/arm/domain.c
+++ b/xen/arch/arm/domain.c
@@ -45,13 +45,16 @@ void idle_loop(void)
         if ( cpu_is_offline(smp_processor_id()) )
             stop_cpu();
 
-        local_irq_disable();
-        if ( cpu_is_haltable(smp_processor_id()) )
+        if ( !scrub_free_pages() )
         {
-            dsb(sy);
-            wfi();
+            local_irq_disable();
+            if ( cpu_is_haltable(smp_processor_id()) )
+            {
+                dsb(sy);
+                wfi();
+            }
+            local_irq_enable();
         }
-        local_irq_enable();
 
         do_tasklet();
         do_softirq();
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 90e2b1f..a5f62b5 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -118,7 +118,8 @@ static void idle_loop(void)
     {
         if ( cpu_is_offline(smp_processor_id()) )
             play_dead();
-        (*pm_idle)();
+        if ( !scrub_free_pages() )
+            (*pm_idle)();
         do_tasklet();
         do_softirq();
         /*
diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index a560d3e..5cc489a 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -1043,16 +1043,44 @@ static struct page_info *merge_chunks(struct page_info *pg, unsigned int node,
     return pg;
 }
 
-static void scrub_free_pages(unsigned int node)
+static nodemask_t node_scrubbing;
+static unsigned int node_to_scrub(bool_t get_node)
+{
+    nodeid_t node = cpu_to_node(smp_processor_id()), local_node;
+
+    if ( node == NUMA_NO_NODE )
+        node = 0;
+    local_node = node;
+
+    /*
+     * Check local node fist and then then see if there are any memory-only
+     * nodes that may need scrubbing
+     */
+    while ( 1 )
+    {
+        if ( node_need_scrub[node] &&
+             (!node_test_and_set(node, node_scrubbing) || !get_node) )
+                return node;
+        do {
+            node = cycle_node(node, node_online_map);
+            if ( node == local_node )
+                return NUMA_NO_NODE;
+        } while ( !cpumask_empty(&node_to_cpumask(node)) );
+    }
+}
+
+bool_t scrub_free_pages()
 {
     struct page_info *pg;
     unsigned int i, zone;
-    int order;
+    int order, cpu = smp_processor_id();
+    nodeid_t node;
 
-    ASSERT(spin_is_locked(&heap_lock));
+    node = node_to_scrub(true);
+    if ( node == NUMA_NO_NODE )
+        return false;
 
-    if ( !node_need_scrub[node] )
-        return;
+    spin_lock(&heap_lock);
 
     for ( zone = 0; zone < NR_ZONES; zone++ )
     {
@@ -1066,7 +1094,11 @@ static void scrub_free_pages(unsigned int node)
                     break;
 
                 for ( i = 0; i < (1UL << order); i++)
+                {
                     scrub_one_page(&pg[i]);
+                    if ( softirq_pending(cpu) )
+                        goto out;
+                }
 
                 pg->count_info &= ~PGC_need_scrub;
 
@@ -1077,7 +1109,12 @@ static void scrub_free_pages(unsigned int node)
             }
         }
     }
- }
+
+ out:
+    spin_unlock(&heap_lock);
+    node_clear(node, node_scrubbing);
+    return softirq_pending(cpu) || (node_to_scrub(false) != NUMA_NO_NODE);
+}
 
 
 /* Free 2^@order set of pages. */
@@ -1142,9 +1179,6 @@ static void free_heap_pages(
     if ( tainted )
         reserve_offlined_page(pg);
 
-    if ( need_scrub )
-        scrub_free_pages(node);
-
     spin_unlock(&heap_lock);
 }
 
diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h
index 88de3c1..3d93fcc 100644
--- a/xen/include/xen/mm.h
+++ b/xen/include/xen/mm.h
@@ -138,6 +138,7 @@ void init_xenheap_pages(paddr_t ps, paddr_t pe);
 void xenheap_max_mfn(unsigned long mfn);
 void *alloc_xenheap_pages(unsigned int order, unsigned int memflags);
 void free_xenheap_pages(void *v, unsigned int order);
+bool_t scrub_free_pages(void);
 #define alloc_xenheap_page() (alloc_xenheap_pages(0,0))
 #define free_xenheap_page(v) (free_xenheap_pages(v,0))
 /* Map machine page range in Xen virtual address space. */
-- 
1.7.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 5/9] mm: Do not discard already-scrubbed pages softirqs are pending
  2017-04-03 16:50 [PATCH v2 0/9] Memory scrubbing from idle loop Boris Ostrovsky
                   ` (3 preceding siblings ...)
  2017-04-03 16:50 ` [PATCH v2 4/9] mm: Scrub memory from idle loop Boris Ostrovsky
@ 2017-04-03 16:50 ` Boris Ostrovsky
  2017-04-13 15:41   ` Jan Beulich
  2017-04-03 16:50 ` [PATCH v2 6/9] spinlock: Introduce spin_lock_cb() Boris Ostrovsky
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 31+ messages in thread
From: Boris Ostrovsky @ 2017-04-03 16:50 UTC (permalink / raw)
  To: xen-devel
  Cc: sstabellini, wei.liu2, George.Dunlap, andrew.cooper3,
	ian.jackson, tim, jbeulich, Boris Ostrovsky

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
---
 xen/common/page_alloc.c |   74 ++++++++++++++++++++++++++++++++++++++++-------
 1 files changed, 63 insertions(+), 11 deletions(-)

diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index 5cc489a..1c23991 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -1000,7 +1000,8 @@ static bool_t can_merge(struct page_info *buddy, unsigned int node,
 
 /* Returns new buddy head. */
 static struct page_info *merge_chunks(struct page_info *pg, unsigned int node,
-                                      unsigned int zone, unsigned int order)
+                                      unsigned int zone, unsigned int order,
+                                      bool_t is_frag)
 {
     bool_t need_scrub = !!test_bit(_PGC_need_scrub, &pg->count_info);
 
@@ -1025,9 +1026,12 @@ static struct page_info *merge_chunks(struct page_info *pg, unsigned int node,
         }
         else
         {
-            /* Merge with successor block? */
+            /*
+             * Merge with successor block?
+             * Only un-fragmented buddy can be merged forward.
+             */
             buddy = pg + mask;
-            if ( !can_merge(buddy, node, order, need_scrub) )
+            if ( is_frag || !can_merge(buddy, node, order, need_scrub) )
                 break;
 
             buddy->count_info &= ~PGC_need_scrub;
@@ -1069,10 +1073,13 @@ static unsigned int node_to_scrub(bool_t get_node)
     }
 }
 
+#define SCRUB_CHUNK_ORDER  8
 bool_t scrub_free_pages()
 {
     struct page_info *pg;
     unsigned int i, zone;
+    unsigned int num_scrubbed, scrub_order, start, end;
+    bool_t preempt, is_frag;
     int order, cpu = smp_processor_id();
     nodeid_t node;
 
@@ -1093,19 +1100,64 @@ bool_t scrub_free_pages()
                 if ( !test_bit(_PGC_need_scrub, &pg->count_info) )
                     break;
 
-                for ( i = 0; i < (1UL << order); i++)
+                page_list_del(pg, &heap(node, zone, order));
+
+                scrub_order = MIN(order, SCRUB_CHUNK_ORDER);
+                num_scrubbed = 0;
+                is_frag = false;
+                while ( num_scrubbed < (1 << order) )
                 {
-                    scrub_one_page(&pg[i]);
+                    for ( i = 0; i < (1 << scrub_order); i++ )
+                        scrub_one_page(&pg[num_scrubbed + i]);
+
+                    num_scrubbed += (1 << scrub_order);
                     if ( softirq_pending(cpu) )
-                        goto out;
+                    {
+                        preempt = 1;
+                        is_frag = (num_scrubbed < (1 << order));
+                        break;
+                    }
                 }
 
-                pg->count_info &= ~PGC_need_scrub;
+                start = 0;
+                end = num_scrubbed;
 
-                page_list_del(pg, &heap(node, zone, order));
-                (void)merge_chunks(pg, node, zone, order);
+                /* Merge clean pages */
+                pg->count_info &= ~PGC_need_scrub;
+                while ( start < end )
+                {
+                    /*
+                     * Largest power-of-two chunk starting @start,
+                     * not greater than @end
+                     */
+                    unsigned chunk_order = flsl(end - start) - 1;
+                    struct page_info *ppg = &pg[start];
+
+                    node_need_scrub[node] -= (1 << chunk_order);
+
+                    PFN_ORDER(ppg) = chunk_order;
+                    (void)merge_chunks(ppg, node, zone, chunk_order, is_frag);
+                    start += (1 << chunk_order);
+                }
 
-                node_need_scrub[node] -= (1UL << order);
+                /* Merge unscrubbed pages */
+                while ( end < (1 << order) )
+                {
+                    /*
+                     * Largest power-of-two chunk starting @end, not crossing
+                     * next power-of-two boundary
+                     */
+                    unsigned chunk_order = ffsl(end) - 1;
+                    struct page_info *ppg = &pg[end];
+
+                    PFN_ORDER(ppg) = chunk_order;
+                    ppg->count_info |= PGC_need_scrub;
+                    (void)merge_chunks(ppg, node, zone, chunk_order, true);
+                    end += (1 << chunk_order);
+                 }
+
+                if ( preempt )
+                    goto out;
             }
         }
     }
@@ -1174,7 +1226,7 @@ static void free_heap_pages(
         node_need_scrub[node] += (1UL << order);
     }
 
-    pg = merge_chunks(pg, node, zone, order);
+    pg = merge_chunks(pg, node, zone, order, false);
 
     if ( tainted )
         reserve_offlined_page(pg);
-- 
1.7.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 6/9] spinlock: Introduce spin_lock_cb()
  2017-04-03 16:50 [PATCH v2 0/9] Memory scrubbing from idle loop Boris Ostrovsky
                   ` (4 preceding siblings ...)
  2017-04-03 16:50 ` [PATCH v2 5/9] mm: Do not discard already-scrubbed pages softirqs are pending Boris Ostrovsky
@ 2017-04-03 16:50 ` Boris Ostrovsky
  2017-04-13 15:46   ` Jan Beulich
  2017-04-03 16:50 ` [PATCH v2 7/9] mm: Keep pages available for allocation while scrubbing Boris Ostrovsky
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 31+ messages in thread
From: Boris Ostrovsky @ 2017-04-03 16:50 UTC (permalink / raw)
  To: xen-devel
  Cc: sstabellini, wei.liu2, George.Dunlap, andrew.cooper3,
	ian.jackson, tim, jbeulich, Boris Ostrovsky

While waiting for a lock we may want to periodically run some
code. We could use spin_trylock() but since it doesn't take lock
ticket it may take a long time until the lock is taken.

Add spin_lock_cb() that allows us to execute a callback while waiting.
Also add spin_lock_kick() that will wake up the waiters.

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
---
Changes in v2:
* Use inlined _spin_lock_cb() for _spin_lock()

 xen/common/spinlock.c      |    9 ++++++++-
 xen/include/xen/spinlock.h |    3 +++
 2 files changed, 11 insertions(+), 1 deletions(-)

diff --git a/xen/common/spinlock.c b/xen/common/spinlock.c
index 2a06406..3c1caae 100644
--- a/xen/common/spinlock.c
+++ b/xen/common/spinlock.c
@@ -129,7 +129,7 @@ static always_inline u16 observe_head(spinlock_tickets_t *t)
     return read_atomic(&t->head);
 }
 
-void _spin_lock(spinlock_t *lock)
+void inline _spin_lock_cb(spinlock_t *lock, void (*cb)(void *), void *data)
 {
     spinlock_tickets_t tickets = SPINLOCK_TICKET_INC;
     LOCK_PROFILE_VAR;
@@ -140,6 +140,8 @@ void _spin_lock(spinlock_t *lock)
     while ( tickets.tail != observe_head(&lock->tickets) )
     {
         LOCK_PROFILE_BLOCK;
+        if ( unlikely(cb) )
+            cb(data);
         arch_lock_relax();
     }
     LOCK_PROFILE_GOT;
@@ -147,6 +149,11 @@ void _spin_lock(spinlock_t *lock)
     arch_lock_acquire_barrier();
 }
 
+void _spin_lock(spinlock_t *lock)
+{
+     _spin_lock_cb(lock, NULL, NULL);
+}
+
 void _spin_lock_irq(spinlock_t *lock)
 {
     ASSERT(local_irq_is_enabled());
diff --git a/xen/include/xen/spinlock.h b/xen/include/xen/spinlock.h
index c1883bd..1cd91b7 100644
--- a/xen/include/xen/spinlock.h
+++ b/xen/include/xen/spinlock.h
@@ -153,6 +153,7 @@ typedef struct spinlock {
 #define spin_lock_init(l) (*(l) = (spinlock_t)SPIN_LOCK_UNLOCKED)
 
 void _spin_lock(spinlock_t *lock);
+void _spin_lock_cb(spinlock_t *lock, void (*cond)(void *), void *data);
 void _spin_lock_irq(spinlock_t *lock);
 unsigned long _spin_lock_irqsave(spinlock_t *lock);
 
@@ -169,6 +170,8 @@ void _spin_lock_recursive(spinlock_t *lock);
 void _spin_unlock_recursive(spinlock_t *lock);
 
 #define spin_lock(l)                  _spin_lock(l)
+#define spin_lock_cb(l, c, d)         _spin_lock_cb(l, c, d)
+#define spin_lock_kick(l)             arch_lock_signal()
 #define spin_lock_irq(l)              _spin_lock_irq(l)
 #define spin_lock_irqsave(l, f)                                 \
     ({                                                          \
-- 
1.7.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 7/9] mm: Keep pages available for allocation while scrubbing
  2017-04-03 16:50 [PATCH v2 0/9] Memory scrubbing from idle loop Boris Ostrovsky
                   ` (5 preceding siblings ...)
  2017-04-03 16:50 ` [PATCH v2 6/9] spinlock: Introduce spin_lock_cb() Boris Ostrovsky
@ 2017-04-03 16:50 ` Boris Ostrovsky
  2017-04-13 15:59   ` Jan Beulich
  2017-04-03 16:50 ` [PATCH v2 8/9] mm: Print number of unscrubbed pages in 'H' debug handler Boris Ostrovsky
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 31+ messages in thread
From: Boris Ostrovsky @ 2017-04-03 16:50 UTC (permalink / raw)
  To: xen-devel
  Cc: sstabellini, wei.liu2, George.Dunlap, andrew.cooper3,
	ian.jackson, tim, jbeulich, Boris Ostrovsky

Instead of scrubbing pages while holding heap lock we can mark
buddy's head as being scrubbed and drop the lock temporarily.
If someone (most likely alloc_heap_pages()) tries to access
this chunk it will signal the scrubber to abort scrub by setting
head's PAGE_SCRUB_ABORT bit. The scrubber checks this bit after
processing each page and stops its work as soon as it sees it.

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
---
 xen/common/page_alloc.c  |   76 ++++++++++++++++++++++++++++++++++++++++++++--
 xen/include/asm-arm/mm.h |    4 ++
 xen/include/asm-x86/mm.h |    4 ++
 3 files changed, 81 insertions(+), 3 deletions(-)

diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index 1c23991..666b79a 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -699,6 +699,18 @@ static void page_list_add_scrub(struct page_info *pg, unsigned int node,
         page_list_add(pg, &heap(node, zone, order));
 }
 
+static void check_and_stop_scrub(struct page_info *head)
+{
+    if ( head->u.free.scrub_state & PAGE_SCRUBBING )
+    {
+        head->u.free.scrub_state |= PAGE_SCRUB_ABORT;
+        smp_mb();
+        spin_lock_kick();
+        while ( ACCESS_ONCE(head->u.free.scrub_state) & PAGE_SCRUB_ABORT )
+            cpu_relax();
+    }
+}
+
 /* Allocate 2^@order contiguous pages. */
 static struct page_info *alloc_heap_pages(
     unsigned int zone_lo, unsigned int zone_hi,
@@ -785,10 +797,15 @@ static struct page_info *alloc_heap_pages(
             {
                 if ( (pg = page_list_remove_head(&heap(node, zone, j))) )
                 {
-                    if ( (order == 0) || use_unscrubbed ||
-                         !test_bit(_PGC_need_scrub, &pg->count_info) )
+                    if ( !test_bit(_PGC_need_scrub, &pg[0].count_info) )
                         goto found;
 
+                    if ( (order == 0) || use_unscrubbed )
+                    {
+                        check_and_stop_scrub(pg);
+                        goto found;
+                    }
+
                     page_list_add_tail(pg, &heap(node, zone, j));
                 }
             }
@@ -919,6 +936,8 @@ static int reserve_offlined_page(struct page_info *head)
 
     cur_head = head;
 
+    check_and_stop_scrub(head);
+
     page_list_del(head, &heap(node, zone, head_order));
 
     while ( cur_head < (head + (1 << head_order)) )
@@ -995,6 +1014,9 @@ static bool_t can_merge(struct page_info *buddy, unsigned int node,
          !!test_bit(_PGC_need_scrub, &buddy->count_info) )
         return false;
 
+    if ( buddy->u.free.scrub_state & PAGE_SCRUBBING )
+        return false;
+
     return true;
 }
 
@@ -1074,12 +1096,34 @@ static unsigned int node_to_scrub(bool_t get_node)
 }
 
 #define SCRUB_CHUNK_ORDER  8
+
+struct scrub_wait_state {
+    struct page_info *pg;
+    bool_t drop;
+};
+
+static void scrub_continue(void *data)
+{
+    struct scrub_wait_state *st = (struct scrub_wait_state *)data;
+
+    if ( st->drop )
+        return;
+
+    if ( st->pg->u.free.scrub_state & PAGE_SCRUB_ABORT )
+    {
+        /* There is a waiter for this chunk. Release it. */
+        st->drop = true;
+        st->pg->u.free.scrub_state = 0;
+    }
+}
+
 bool_t scrub_free_pages()
 {
     struct page_info *pg;
     unsigned int i, zone;
     unsigned int num_scrubbed, scrub_order, start, end;
     bool_t preempt, is_frag;
+    struct scrub_wait_state st;
     int order, cpu = smp_processor_id();
     nodeid_t node;
 
@@ -1100,7 +1144,10 @@ bool_t scrub_free_pages()
                 if ( !test_bit(_PGC_need_scrub, &pg->count_info) )
                     break;
 
-                page_list_del(pg, &heap(node, zone, order));
+                ASSERT(!pg->u.free.scrub_state);
+                pg->u.free.scrub_state = PAGE_SCRUBBING;
+
+                spin_unlock(&heap_lock);
 
                 scrub_order = MIN(order, SCRUB_CHUNK_ORDER);
                 num_scrubbed = 0;
@@ -1108,7 +1155,15 @@ bool_t scrub_free_pages()
                 while ( num_scrubbed < (1 << order) )
                 {
                     for ( i = 0; i < (1 << scrub_order); i++ )
+                    {
                         scrub_one_page(&pg[num_scrubbed + i]);
+                        if ( ACCESS_ONCE(pg->u.free.scrub_state) & PAGE_SCRUB_ABORT )
+                        {
+                            /* Someone wants this chunk. Drop everything. */
+                            pg->u.free.scrub_state = 0;
+                            goto out_nolock;
+                        }
+                    }
 
                     num_scrubbed += (1 << scrub_order);
                     if ( softirq_pending(cpu) )
@@ -1119,6 +1174,15 @@ bool_t scrub_free_pages()
                     }
                 }
 
+                st.pg = pg;
+                st.drop = false;
+                spin_lock_cb(&heap_lock, scrub_continue, &st);
+
+                if ( st.drop )
+                    goto out;
+
+                page_list_del(pg, &heap(node, zone, order));
+
                 start = 0;
                 end = num_scrubbed;
 
@@ -1156,6 +1220,8 @@ bool_t scrub_free_pages()
                     end += (1 << chunk_order);
                  }
 
+                pg->u.free.scrub_state = 0;
+
                 if ( preempt )
                     goto out;
             }
@@ -1164,6 +1230,8 @@ bool_t scrub_free_pages()
 
  out:
     spin_unlock(&heap_lock);
+
+ out_nolock:
     node_clear(node, node_scrubbing);
     return softirq_pending(cpu) || (node_to_scrub(false) != NUMA_NO_NODE);
 }
@@ -1203,6 +1271,8 @@ static void free_heap_pages(
         if ( page_state_is(&pg[i], offlined) )
             tainted = 1;
 
+        pg[i].u.free.scrub_state=0;
+
         /* If a page has no owner it will need no safety TLB flush. */
         pg[i].u.free.need_tlbflush = (page_get_owner(&pg[i]) != NULL);
         if ( pg[i].u.free.need_tlbflush )
diff --git a/xen/include/asm-arm/mm.h b/xen/include/asm-arm/mm.h
index 149940b..b2d9dd3 100644
--- a/xen/include/asm-arm/mm.h
+++ b/xen/include/asm-arm/mm.h
@@ -35,6 +35,10 @@ struct page_info
         } inuse;
         /* Page is on a free list: ((count_info & PGC_count_mask) == 0). */
         struct {
+#define PAGE_SCRUBBING      (1<<1)
+#define PAGE_SCRUB_ABORT    (1<<2)
+            unsigned char scrub_state;
+
             /* Do TLBs need flushing for safety before next page use? */
             bool_t need_tlbflush;
         } free;
diff --git a/xen/include/asm-x86/mm.h b/xen/include/asm-x86/mm.h
index f3d4443..31e53e9 100644
--- a/xen/include/asm-x86/mm.h
+++ b/xen/include/asm-x86/mm.h
@@ -87,6 +87,10 @@ struct page_info
 
         /* Page is on a free list: ((count_info & PGC_count_mask) == 0). */
         struct {
+#define PAGE_SCRUBBING      (1<<1)
+#define PAGE_SCRUB_ABORT    (1<<2)
+            unsigned char scrub_state;
+
             /* Do TLBs need flushing for safety before next page use? */
             bool_t need_tlbflush;
         } free;
-- 
1.7.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 8/9] mm: Print number of unscrubbed pages in 'H' debug handler
  2017-04-03 16:50 [PATCH v2 0/9] Memory scrubbing from idle loop Boris Ostrovsky
                   ` (6 preceding siblings ...)
  2017-04-03 16:50 ` [PATCH v2 7/9] mm: Keep pages available for allocation while scrubbing Boris Ostrovsky
@ 2017-04-03 16:50 ` Boris Ostrovsky
  2017-04-03 16:50 ` [PATCH v2 9/9] mm: Make sure pages are scrubbed Boris Ostrovsky
  2017-04-04 15:21 ` [PATCH v2 0/9] Memory scrubbing from idle loop George Dunlap
  9 siblings, 0 replies; 31+ messages in thread
From: Boris Ostrovsky @ 2017-04-03 16:50 UTC (permalink / raw)
  To: xen-devel
  Cc: sstabellini, wei.liu2, George.Dunlap, andrew.cooper3,
	ian.jackson, tim, jbeulich, Boris Ostrovsky

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
---
 xen/common/page_alloc.c |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index 666b79a..8273102 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -2249,6 +2249,13 @@ static void dump_heap(unsigned char key)
             printk("heap[node=%d][zone=%d] -> %lu pages\n",
                    i, j, avail[i][j]);
     }
+
+    for ( i = 0; i < MAX_NUMNODES; i++ )
+    {
+        if ( !node_need_scrub[i] )
+            continue;
+        printk("Node %d has %lu unscrubbed pages\n", i, node_need_scrub[i]);
+    }
 }
 
 static __init int register_heap_trigger(void)
-- 
1.7.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 9/9] mm: Make sure pages are scrubbed
  2017-04-03 16:50 [PATCH v2 0/9] Memory scrubbing from idle loop Boris Ostrovsky
                   ` (7 preceding siblings ...)
  2017-04-03 16:50 ` [PATCH v2 8/9] mm: Print number of unscrubbed pages in 'H' debug handler Boris Ostrovsky
@ 2017-04-03 16:50 ` Boris Ostrovsky
  2017-04-04 15:21 ` [PATCH v2 0/9] Memory scrubbing from idle loop George Dunlap
  9 siblings, 0 replies; 31+ messages in thread
From: Boris Ostrovsky @ 2017-04-03 16:50 UTC (permalink / raw)
  To: xen-devel
  Cc: sstabellini, wei.liu2, George.Dunlap, andrew.cooper3,
	ian.jackson, tim, jbeulich, Boris Ostrovsky

Add a debug Kconfig option that will make page allocator verify
that pages that were supposed to be scrubbed are, in fact, clean.

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
---
Changes in v2:
* Check full page for having been scrubbed

Note that poison_one_page() may be unnecesary since chances that a dirty page
starts with 0xc2c2c2c2c2c2c2c2 are fairly low.

 xen/Kconfig.debug       |    7 ++++++
 xen/common/page_alloc.c |   49 ++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 55 insertions(+), 1 deletions(-)

diff --git a/xen/Kconfig.debug b/xen/Kconfig.debug
index 689f297..f3bf9a9 100644
--- a/xen/Kconfig.debug
+++ b/xen/Kconfig.debug
@@ -114,6 +114,13 @@ config DEVICE_TREE_DEBUG
 	  logged in the Xen ring buffer.
 	  If unsure, say N here.
 
+config SCRUB_DEBUG
+    bool "Page scrubbing test"
+    default DEBUG
+    ---help---
+      Verify that pages that need to be scrubbed before being allocated to
+      a guest are indeed scrubbed.
+
 endif # DEBUG || EXPERT
 
 endmenu
diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index 8273102..b82aa51 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -699,6 +699,31 @@ static void page_list_add_scrub(struct page_info *pg, unsigned int node,
         page_list_add(pg, &heap(node, zone, order));
 }
 
+#define SCRUB_BYTE_PATTERN 0xc2c2c2c2c2c2c2c2
+#ifdef CONFIG_SCRUB_DEBUG
+static void poison_one_page(struct page_info *pg)
+{
+    mfn_t mfn = _mfn(page_to_mfn(pg));
+    uint64_t *ptr;
+
+    ptr = map_domain_page(mfn);
+    *ptr = ~SCRUB_BYTE_PATTERN;
+    unmap_domain_page(ptr);
+}
+
+static void check_one_page(struct page_info *pg)
+{
+    mfn_t mfn = _mfn(page_to_mfn(pg));
+    uint64_t *ptr;
+    unsigned i;
+
+    ptr = map_domain_page(mfn);
+    for ( i = 0; i < PAGE_SIZE / sizeof (*ptr); i++ )
+        ASSERT(ptr[i] == SCRUB_BYTE_PATTERN);
+    unmap_domain_page(ptr);
+}
+#endif /* CONFIG_SCRUB_DEBUG */
+
 static void check_and_stop_scrub(struct page_info *head)
 {
     if ( head->u.free.scrub_state & PAGE_SCRUBBING )
@@ -913,6 +938,11 @@ static struct page_info *alloc_heap_pages(
          * guest can control its own visibility of/through the cache.
          */
         flush_page_to_ram(page_to_mfn(&pg[i]));
+
+#ifdef CONFIG_SCRUB_DEBUG
+        if ( d && !is_idle_domain(d) )
+            check_one_page(&pg[i]);
+#endif
     }
 
     spin_unlock(&heap_lock);
@@ -1294,6 +1324,11 @@ static void free_heap_pages(
     {
         pg->count_info |= PGC_need_scrub;
         node_need_scrub[node] += (1UL << order);
+
+#ifdef CONFIG_SCRUB_DEBUG
+        for ( i = 0; i < (1 << order); i++ )
+            poison_one_page(&pg[i]);
+#endif
     }
 
     pg = merge_chunks(pg, node, zone, order, false);
@@ -1590,6 +1625,14 @@ static void init_heap_pages(
             nr_pages -= n;
         }
 
+#ifdef CONFIG_SCRUB_DEBUG
+        /*
+         * These pages get into heap and are allocated to dom0 before
+         * boot scrub happens.
+         * Not scrubbing them here will cause failure in check_one_page().
+         */
+        scrub_one_page(pg + i);
+#endif
         free_heap_pages(pg + i, 0, 0);
     }
 }
@@ -2123,6 +2166,9 @@ void free_domheap_pages(struct page_info *pg, unsigned int order)
             {
                 BUG_ON((pg[i].u.inuse.type_info & PGT_count_mask) != 0);
                 arch_free_heap_page(d, &pg[i]);
+#ifdef CONFIG_SCRUB_DEBUG
+                scrub_one_page(&pg[i]);
+#endif
             }
 
             drop_dom_ref = !domain_adjust_tot_pages(d, -(1 << order));
@@ -2226,7 +2272,8 @@ void scrub_one_page(struct page_info *pg)
 
 #ifndef NDEBUG
     /* Avoid callers relying on allocations returning zeroed pages. */
-    unmap_domain_page(memset(__map_domain_page(pg), 0xc2, PAGE_SIZE));
+    unmap_domain_page(memset(__map_domain_page(pg),
+                             SCRUB_BYTE_PATTERN & 0xff, PAGE_SIZE));
 #else
     /* For a production build, clear_page() is the fastest way to scrub. */
     clear_domain_page(_mfn(page_to_mfn(pg)));
-- 
1.7.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/9] mm: Separate free page chunk merging into its own routine
  2017-04-03 16:50 ` [PATCH v2 1/9] mm: Separate free page chunk merging into its own routine Boris Ostrovsky
@ 2017-04-04 11:16   ` Jan Beulich
  2017-04-04 13:48     ` Boris Ostrovsky
  0 siblings, 1 reply; 31+ messages in thread
From: Jan Beulich @ 2017-04-04 11:16 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: tim, sstabellini, wei.liu2, George.Dunlap, andrew.cooper3,
	ian.jackson, xen-devel

>>> On 03.04.17 at 18:50, <boris.ostrovsky@oracle.com> wrote:
> --- a/xen/common/page_alloc.c
> +++ b/xen/common/page_alloc.c
> @@ -924,11 +924,64 @@ static int reserve_offlined_page(struct page_info *head)
>      return count;
>  }
>  
> +static bool_t can_merge(struct page_info *buddy, unsigned int node,

Plain bool please, and const for the pointer.

> +{
> +    if ( !mfn_valid(_mfn(page_to_mfn(buddy))) ||
> +         !page_state_is(buddy, free) ||

As a helper of freeing, this is fine, but then the name of both
functions is too generic: Fundamentally certain other page types
might be mergeable too.

> +         (PFN_ORDER(buddy) != order) ||
> +         (phys_to_nid(page_to_maddr(buddy)) != node) )
> +        return false;
> +
> +    return true;

Is there any reason not to make this a single return statement? I
don't think it would be any worse to read.

> +static struct page_info *merge_chunks(struct page_info *pg, unsigned int node,
> +                                      unsigned int zone, unsigned int order)
> +{
> +    ASSERT(spin_is_locked(&heap_lock));
> +
> +    /* Merge chunks as far as possible. */
> +    while ( order < MAX_ORDER )
> +    {
> +        unsigned long mask = 1UL << order;
> +        struct page_info *buddy;
> +
> +        if ( (page_to_mfn(pg) & mask) )
> +        {
> +            /* Merge with predecessor block? */
> +            buddy = pg - mask;
> +            if ( !can_merge(buddy, node, order) )
> +                break;
> +
> +            pg = buddy;
> +            page_list_del(pg, &heap(node, zone, order));
> +        }
> +        else
> +        {
> +            /* Merge with successor block? */
> +            buddy = pg + mask;
> +            if ( !can_merge(buddy, node, order) )
> +                break;
> +
> +            page_list_del(buddy, &heap(node, zone, order));

This and its companion in the if() branch are now the same (taking
into account the assignment ahead of the former) - please move
them past the if/else.

> +        }
> +
> +        order++;
> +    }
> +
> +    PFN_ORDER(pg) = order;
> +    page_list_add_tail(pg, &heap(node, zone, order));

I don't think this strictly is part of the merge anymore, so judging
by the name of the function this last line would rather belong into
the caller.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/9] mm: Separate free page chunk merging into its own routine
  2017-04-04 11:16   ` Jan Beulich
@ 2017-04-04 13:48     ` Boris Ostrovsky
  2017-04-04 14:01       ` Jan Beulich
  0 siblings, 1 reply; 31+ messages in thread
From: Boris Ostrovsky @ 2017-04-04 13:48 UTC (permalink / raw)
  To: Jan Beulich
  Cc: tim, sstabellini, wei.liu2, George.Dunlap, andrew.cooper3,
	ian.jackson, xen-devel

On 04/04/2017 07:16 AM, Jan Beulich wrote:
>>>> On 03.04.17 at 18:50, <boris.ostrovsky@oracle.com> wrote:
>> --- a/xen/common/page_alloc.c
>> +++ b/xen/common/page_alloc.c
>> @@ -924,11 +924,64 @@ static int reserve_offlined_page(struct page_info *head)
>>      return count;
>>  }
>>  
>> +static bool_t can_merge(struct page_info *buddy, unsigned int node,
> Plain bool please, and const for the pointer.

Wei asked for that too but it wasn't clear to me what the hypervisor
preference is. We currently have much higher use of bool_t than bool
(1015 vs 321, according to cscope). page_alloc.c uses only the former.

>
>> +{
>> +    if ( !mfn_valid(_mfn(page_to_mfn(buddy))) ||
>> +         !page_state_is(buddy, free) ||
> As a helper of freeing, this is fine, but then the name of both
> functions is too generic: Fundamentally certain other page types
> might be mergeable too.

merge_free_chunks()/can_merge_free() ?

>
>> +         (PFN_ORDER(buddy) != order) ||
>> +         (phys_to_nid(page_to_maddr(buddy)) != node) )
>> +        return false;
>> +
>> +    return true;
> Is there any reason not to make this a single return statement? I
> don't think it would be any worse to read.

See patches 2 and 7, where more checks are added. While those two might
be merged (they both deal with scrubbing) , combining all 3 of them into
a single statement would make it a bit difficult to read IMO.

>
>> +static struct page_info *merge_chunks(struct page_info *pg, unsigned int node,
>> +                                      unsigned int zone, unsigned int order)
>> +{
>> +    ASSERT(spin_is_locked(&heap_lock));
>> +
>> +    /* Merge chunks as far as possible. */
>> +    while ( order < MAX_ORDER )
>> +    {
>> +        unsigned long mask = 1UL << order;
>> +        struct page_info *buddy;
>> +
>> +        if ( (page_to_mfn(pg) & mask) )
>> +        {
>> +            /* Merge with predecessor block? */
>> +            buddy = pg - mask;
>> +            if ( !can_merge(buddy, node, order) )
>> +                break;
>> +
>> +            pg = buddy;
>> +            page_list_del(pg, &heap(node, zone, order));
>> +        }
>> +        else
>> +        {
>> +            /* Merge with successor block? */
>> +            buddy = pg + mask;
>> +            if ( !can_merge(buddy, node, order) )
>> +                break;
>> +
>> +            page_list_del(buddy, &heap(node, zone, order));
> This and its companion in the if() branch are now the same (taking
> into account the assignment ahead of the former) - please move
> them past the if/else.
>
>> +        }
>> +
>> +        order++;
>> +    }
>> +
>> +    PFN_ORDER(pg) = order;
>> +    page_list_add_tail(pg, &heap(node, zone, order));
> I don't think this strictly is part of the merge anymore, so judging
> by the name of the function this last line would rather belong into
> the caller.

All users of merge_chunks() want to place the new chunk back into the
heap and I don't think I see why anyone would want to just get the new
buddy without putting it there.

-boris

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/9] mm: Separate free page chunk merging into its own routine
  2017-04-04 13:48     ` Boris Ostrovsky
@ 2017-04-04 14:01       ` Jan Beulich
  2017-04-04 14:23         ` Boris Ostrovsky
  0 siblings, 1 reply; 31+ messages in thread
From: Jan Beulich @ 2017-04-04 14:01 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: tim, sstabellini, wei.liu2, George.Dunlap, andrew.cooper3,
	ian.jackson, xen-devel

>>> On 04.04.17 at 15:48, <boris.ostrovsky@oracle.com> wrote:
> On 04/04/2017 07:16 AM, Jan Beulich wrote:
>>>>> On 03.04.17 at 18:50, <boris.ostrovsky@oracle.com> wrote:
>>> --- a/xen/common/page_alloc.c
>>> +++ b/xen/common/page_alloc.c
>>> @@ -924,11 +924,64 @@ static int reserve_offlined_page(struct page_info *head)
>>>      return count;
>>>  }
>>>  
>>> +static bool_t can_merge(struct page_info *buddy, unsigned int node,
>> Plain bool please, and const for the pointer.
> 
> Wei asked for that too but it wasn't clear to me what the hypervisor
> preference is. We currently have much higher use of bool_t than bool
> (1015 vs 321, according to cscope). page_alloc.c uses only the former.

That's because in this release cycle we've only started the transition.

>>> +{
>>> +    if ( !mfn_valid(_mfn(page_to_mfn(buddy))) ||
>>> +         !page_state_is(buddy, free) ||
>> As a helper of freeing, this is fine, but then the name of both
>> functions is too generic: Fundamentally certain other page types
>> might be mergeable too.
> 
> merge_free_chunks()/can_merge_free() ?

Fine with me.

>>> +         (PFN_ORDER(buddy) != order) ||
>>> +         (phys_to_nid(page_to_maddr(buddy)) != node) )
>>> +        return false;
>>> +
>>> +    return true;
>> Is there any reason not to make this a single return statement? I
>> don't think it would be any worse to read.
> 
> See patches 2 and 7, where more checks are added. While those two might
> be merged (they both deal with scrubbing) , combining all 3 of them into
> a single statement would make it a bit difficult to read IMO.

Okay, likely makes sense (didn't get that far yet).

>>> +static struct page_info *merge_chunks(struct page_info *pg, unsigned int node,
>>> +                                      unsigned int zone, unsigned int order)
>>> +{
>>> +    ASSERT(spin_is_locked(&heap_lock));
>>> +
>>> +    /* Merge chunks as far as possible. */
>>> +    while ( order < MAX_ORDER )
>>> +    {
>>> +        unsigned long mask = 1UL << order;
>>> +        struct page_info *buddy;
>>> +
>>> +        if ( (page_to_mfn(pg) & mask) )
>>> +        {
>>> +            /* Merge with predecessor block? */
>>> +            buddy = pg - mask;
>>> +            if ( !can_merge(buddy, node, order) )
>>> +                break;
>>> +
>>> +            pg = buddy;
>>> +            page_list_del(pg, &heap(node, zone, order));
>>> +        }
>>> +        else
>>> +        {
>>> +            /* Merge with successor block? */
>>> +            buddy = pg + mask;
>>> +            if ( !can_merge(buddy, node, order) )
>>> +                break;
>>> +
>>> +            page_list_del(buddy, &heap(node, zone, order));
>> This and its companion in the if() branch are now the same (taking
>> into account the assignment ahead of the former) - please move
>> them past the if/else.
>>
>>> +        }
>>> +
>>> +        order++;
>>> +    }
>>> +
>>> +    PFN_ORDER(pg) = order;
>>> +    page_list_add_tail(pg, &heap(node, zone, order));
>> I don't think this strictly is part of the merge anymore, so judging
>> by the name of the function this last line would rather belong into
>> the caller.
> 
> All users of merge_chunks() want to place the new chunk back into the
> heap and I don't think I see why anyone would want to just get the new
> buddy without putting it there.

I understand that, yet the name of the function says otherwise.
At the very least I'd expect a comment right ahead of the function
to state this. Or maybe, as a slight variation of the new name you
suggest above, free_merged_chunks() (merge_chunk_sand_free()
or some such would look ugly to me)?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/9] mm: Separate free page chunk merging into its own routine
  2017-04-04 14:01       ` Jan Beulich
@ 2017-04-04 14:23         ` Boris Ostrovsky
  0 siblings, 0 replies; 31+ messages in thread
From: Boris Ostrovsky @ 2017-04-04 14:23 UTC (permalink / raw)
  To: Jan Beulich
  Cc: tim, sstabellini, wei.liu2, George.Dunlap, andrew.cooper3,
	ian.jackson, xen-devel


>>>> +        }
>>>> +
>>>> +        order++;
>>>> +    }
>>>> +
>>>> +    PFN_ORDER(pg) = order;
>>>> +    page_list_add_tail(pg, &heap(node, zone, order));
>>> I don't think this strictly is part of the merge anymore, so judging
>>> by the name of the function this last line would rather belong into
>>> the caller.
>> All users of merge_chunks() want to place the new chunk back into the
>> heap and I don't think I see why anyone would want to just get the new
>> buddy without putting it there.
> I understand that, yet the name of the function says otherwise.
> At the very least I'd expect a comment right ahead of the function
> to state this. Or maybe, as a slight variation of the new name you
> suggest above, free_merged_chunks() (merge_chunk_sand_free()
> or some such would look ugly to me)?

merge_and_free_chunks(). And, in fact, merge_and_free_buddy(), and
probably s/chunk/buddy/ in scrub_free_pages(), introduced in a later patch.

-boris

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 2/9] mm: Place unscrubbed pages at the end of pagelist
  2017-04-03 16:50 ` [PATCH v2 2/9] mm: Place unscrubbed pages at the end of pagelist Boris Ostrovsky
@ 2017-04-04 14:46   ` Jan Beulich
  2017-04-04 15:14     ` Boris Ostrovsky
  0 siblings, 1 reply; 31+ messages in thread
From: Jan Beulich @ 2017-04-04 14:46 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: tim, sstabellini, wei.liu2, George.Dunlap, andrew.cooper3,
	ian.jackson, xen-devel

>>> On 03.04.17 at 18:50, <boris.ostrovsky@oracle.com> wrote:
> @@ -856,6 +874,7 @@ static int reserve_offlined_page(struct page_info *head)
>      int zone = page_to_zone(head), i, head_order = PFN_ORDER(head), count = 0;
>      struct page_info *cur_head;
>      int cur_order;
> +    bool_t need_scrub = !!test_bit(_PGC_need_scrub, &head->count_info);

With our use of plain bool now there's no need for !! here anymore.

> @@ -897,8 +916,8 @@ static int reserve_offlined_page(struct page_info *head)
>              {
>              merge:
>                  /* We don't consider merging outside the head_order. */
> -                page_list_add_tail(cur_head, &heap(node, zone, cur_order));
>                  PFN_ORDER(cur_head) = cur_order;
> +                page_list_add_scrub(cur_head, node, zone, cur_order, need_scrub);

With this re-arrangement, what's the point of also passing a
separate order argument to the function?

> @@ -933,6 +952,10 @@ static bool_t can_merge(struct page_info *buddy, unsigned int node,
>           (phys_to_nid(page_to_maddr(buddy)) != node) )
>          return false;
>  
> +    if ( need_scrub !=
> +         !!test_bit(_PGC_need_scrub, &buddy->count_info) )
> +        return false;

I don't think leaving the tree in a state where larger order chunks
don't become available for allocation right away is going to be
acceptable. Hence with this issue being dealt with only in patch 7
as it seems, you should state clearly and visibly that (at least)
patches 2...7 should only be committed together.

> @@ -952,9 +977,10 @@ static struct page_info *merge_chunks(struct page_info *pg, unsigned int node,
>          {
>              /* Merge with predecessor block? */
>              buddy = pg - mask;
> -            if ( !can_merge(buddy, node, order) )
> +            if ( !can_merge(buddy, node, order, need_scrub) )
>                  break;
>  
> +            pg->count_info &= ~PGC_need_scrub;
>              pg = buddy;
>              page_list_del(pg, &heap(node, zone, order));
>          }
> @@ -962,9 +988,10 @@ static struct page_info *merge_chunks(struct page_info *pg, unsigned int node,
>          {
>              /* Merge with successor block? */
>              buddy = pg + mask;
> -            if ( !can_merge(buddy, node, order) )
> +            if ( !can_merge(buddy, node, order, need_scrub) )
>                  break;
>  
> +            buddy->count_info &= ~PGC_need_scrub;
>              page_list_del(buddy, &heap(node, zone, order));
>          }

For both of these, how come you can / want to clear the need-scrub
flag? Wouldn't it be better for each individual page to retain it, so
when encountering a higher-order one you know which pages need
scrubbing and which don't? Couldn't that also be used to avoid
suppressing their merging here right away?

> +static void scrub_free_pages(unsigned int node)
> +{
> +    struct page_info *pg;
> +    unsigned int i, zone;
> +    int order;

There are no negative orders.

> +    ASSERT(spin_is_locked(&heap_lock));
> +
> +    if ( !node_need_scrub[node] )
> +        return;
> +
> +    for ( zone = 0; zone < NR_ZONES; zone++ )
> +    {
> +        for ( order = MAX_ORDER; order >= 0; order-- )
> +        {
> +            while ( !page_list_empty(&heap(node, zone, order)) )
> +            {
> +                /* Unscrubbed pages are always at the end of the list. */
> +                pg = page_list_last(&heap(node, zone, order));
> +                if ( !test_bit(_PGC_need_scrub, &pg->count_info) )
> +                    break;
> +
> +                for ( i = 0; i < (1UL << order); i++)

Types of loop variable and upper bound do not match.

> +                    scrub_one_page(&pg[i]);
> +
> +                pg->count_info &= ~PGC_need_scrub;
> +
> +                page_list_del(pg, &heap(node, zone, order));
> +                (void)merge_chunks(pg, node, zone, order);

Pointless cast.

> +                node_need_scrub[node] -= (1UL << order);

Perhaps worth returning right away if the new value is zero?

> +            }
> +        }
> +    }
> + }
> +
> +

Stray double blank lines

> @@ -1253,7 +1326,7 @@ unsigned int online_page(unsigned long mfn, uint32_t *status)
>      spin_unlock(&heap_lock);
>  
>      if ( (y & PGC_state) == PGC_state_offlined )
> -        free_heap_pages(pg, 0);
> +        free_heap_pages(pg, 0, 0);

false (also elsewhere, and similarly when passing true)

> --- a/xen/include/asm-x86/mm.h
> +++ b/xen/include/asm-x86/mm.h
> @@ -233,6 +233,10 @@ struct page_info
>  #define PGC_count_width   PG_shift(9)
>  #define PGC_count_mask    ((1UL<<PGC_count_width)-1)
>  
> +/* Page needs to be scrubbed */
> +#define _PGC_need_scrub   PG_shift(10)
> +#define PGC_need_scrub    PG_mask(1, 10)

So why not a new PGC_state_dirty instead of this independent
flag? Pages other than PGC_state_free should never make it
to the scrubber, so the flag is meaningless for all other
PGC_state_*.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 2/9] mm: Place unscrubbed pages at the end of pagelist
  2017-04-04 14:46   ` Jan Beulich
@ 2017-04-04 15:14     ` Boris Ostrovsky
  2017-04-04 15:29       ` Jan Beulich
  0 siblings, 1 reply; 31+ messages in thread
From: Boris Ostrovsky @ 2017-04-04 15:14 UTC (permalink / raw)
  To: Jan Beulich
  Cc: tim, sstabellini, wei.liu2, George.Dunlap, andrew.cooper3,
	ian.jackson, xen-devel

On 04/04/2017 10:46 AM, Jan Beulich wrote:
>> @@ -897,8 +916,8 @@ static int reserve_offlined_page(struct page_info *head)
>>              {
>>              merge:
>>                  /* We don't consider merging outside the head_order. */
>> -                page_list_add_tail(cur_head, &heap(node, zone, cur_order));
>>                  PFN_ORDER(cur_head) = cur_order;
>> +                page_list_add_scrub(cur_head, node, zone, cur_order, need_scrub);
> With this re-arrangement, what's the point of also passing a
> separate order argument to the function?

No reason indeed.

>
>> @@ -933,6 +952,10 @@ static bool_t can_merge(struct page_info *buddy, unsigned int node,
>>           (phys_to_nid(page_to_maddr(buddy)) != node) )
>>          return false;
>>  
>> +    if ( need_scrub !=
>> +         !!test_bit(_PGC_need_scrub, &buddy->count_info) )
>> +        return false;
> I don't think leaving the tree in a state where larger order chunks
> don't become available for allocation right away is going to be
> acceptable. Hence with this issue being dealt with only in patch 7
> as it seems, you should state clearly and visibly that (at least)
> patches 2...7 should only be committed together.

The dirty pages are available for allocation as result of this patch but
they might not be merged with higher orders (which is what this check is
for)

Patch 7 is not named well, it should be called "Keep heap available for
allocation during idle-loop scrubbing".

>
>> @@ -952,9 +977,10 @@ static struct page_info *merge_chunks(struct page_info *pg, unsigned int node,
>>          {
>>              /* Merge with predecessor block? */
>>              buddy = pg - mask;
>> -            if ( !can_merge(buddy, node, order) )
>> +            if ( !can_merge(buddy, node, order, need_scrub) )
>>                  break;
>>  
>> +            pg->count_info &= ~PGC_need_scrub;
>>              pg = buddy;
>>              page_list_del(pg, &heap(node, zone, order));
>>          }
>> @@ -962,9 +988,10 @@ static struct page_info *merge_chunks(struct page_info *pg, unsigned int node,
>>          {
>>              /* Merge with successor block? */
>>              buddy = pg + mask;
>> -            if ( !can_merge(buddy, node, order) )
>> +            if ( !can_merge(buddy, node, order, need_scrub) )
>>                  break;
>>  
>> +            buddy->count_info &= ~PGC_need_scrub;
>>              page_list_del(buddy, &heap(node, zone, order));
>>          }
> For both of these, how come you can / want to clear the need-scrub
> flag? Wouldn't it be better for each individual page to retain it, so
> when encountering a higher-order one you know which pages need
> scrubbing and which don't? Couldn't that also be used to avoid
> suppressing their merging here right away?

I am trying to avoid having to keep dirty bit for each page since a
buddy is either fully clean or fully dirty. That way we shouldn't need
to walk the list and clear the bit. (I, in fact, suspect that there may
be other state bits/fields that we might be able to keep at a buddy only)

Later, in patch 5, we can safely break a buddy that is being cleaned
into a clean and dirty subsets if preemption is needed.

>
>> +static void scrub_free_pages(unsigned int node)
>> +{
>> +    struct page_info *pg;
>> +    unsigned int i, zone;
>> +    int order;
> There are no negative orders.

It actually becomes negative in the loop below and this is loop exit
condition.

>
>> +    ASSERT(spin_is_locked(&heap_lock));
>> +
>> +    if ( !node_need_scrub[node] )
>> +        return;
>> +
>> +    for ( zone = 0; zone < NR_ZONES; zone++ )
>> +    {
>> +        for ( order = MAX_ORDER; order >= 0; order-- )
>> +        {
>> +            while ( !page_list_empty(&heap(node, zone, order)) )
>> +            {
>> +                /* Unscrubbed pages are always at the end of the list. */
>> +                pg = page_list_last(&heap(node, zone, order));
>> +                if ( !test_bit(_PGC_need_scrub, &pg->count_info) )
>> +                    break;
>> +
>> +                for ( i = 0; i < (1UL << order); i++)
> Types of loop variable and upper bound do not match.
>
>> +                    scrub_one_page(&pg[i]);
>> +
>> +                pg->count_info &= ~PGC_need_scrub;
>> +
>> +                page_list_del(pg, &heap(node, zone, order));
>> +                (void)merge_chunks(pg, node, zone, order);
> Pointless cast.

Didn't coverity complain about those types of things? That was the
reason I have the cast here. If not I'll drop it.

>
>> +                node_need_scrub[node] -= (1UL << order);
> Perhaps worth returning right away if the new value is zero?


Yes.


>> --- a/xen/include/asm-x86/mm.h
>> +++ b/xen/include/asm-x86/mm.h
>> @@ -233,6 +233,10 @@ struct page_info
>>  #define PGC_count_width   PG_shift(9)
>>  #define PGC_count_mask    ((1UL<<PGC_count_width)-1)
>>  
>> +/* Page needs to be scrubbed */
>> +#define _PGC_need_scrub   PG_shift(10)
>> +#define PGC_need_scrub    PG_mask(1, 10)
> So why not a new PGC_state_dirty instead of this independent
> flag? Pages other than PGC_state_free should never make it
> to the scrubber, so the flag is meaningless for all other
> PGC_state_*.

Wouldn't doing this require possibly making two checks ---
page_state_is(pg, free) || page_state_is(pg, dirty)?

-boris


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 0/9] Memory scrubbing from idle loop
  2017-04-03 16:50 [PATCH v2 0/9] Memory scrubbing from idle loop Boris Ostrovsky
                   ` (8 preceding siblings ...)
  2017-04-03 16:50 ` [PATCH v2 9/9] mm: Make sure pages are scrubbed Boris Ostrovsky
@ 2017-04-04 15:21 ` George Dunlap
  9 siblings, 0 replies; 31+ messages in thread
From: George Dunlap @ 2017-04-04 15:21 UTC (permalink / raw)
  To: Boris Ostrovsky, xen-devel
  Cc: sstabellini, wei.liu2, George.Dunlap, andrew.cooper3,
	ian.jackson, tim, jbeulich

On 03/04/17 17:50, Boris Ostrovsky wrote:
> When a domain is destroyed the hypervisor must scrub domain's pages before
> giving them to another guest in order to prevent leaking the deceased
> guest's data. Currently this is done during guest's destruction, possibly
> causing very lengthy cleanup process.
> 
> This series adds support for scrubbing released pages from idle loop,
> making guest destruction significantly faster. For example, destroying a
> 1TB guest can now be completed in 40+ seconds as opposed to about 9 minutes
> using existing scrubbing algorithm.
> 
> The downside of this series is that we sometimes fail to allocate high-order
> sets of pages since dirty pages may not yet be merged into higher-order sets
> while they are waiting to be scrubbed.
> 
> Briefly, the new algorithm places dirty pages at the end of heap's page list
> for each node/zone/order to avoid having to scan full list while searching
> for dirty pages. One processor form each node checks whether the node has any
> dirty pages and, if such pages are found, scrubs them. Scrubbing itself
> happens without holding heap lock so other users may access heap in the
> meantime. If while idle loop is scrubbing a particular chunk of pages this
> chunk is requested by the heap allocator, scrubbing is immediately stopped.
> 
> On the allocation side, alloc_heap_pages() first tries to satisfy allocation
> request using only clean pages. If this is not possible, the search is
> repeated and dirty pages are scrubbed by the allocator.
> 
> This series is somewhat based on earlier work by Bob Liu.
> 
> V1:
> * Only set PGC_need_scrub bit for the buddy head, thus making it unnecessary
>   to scan whole buddy
> * Fix spin_lock_cb()
> * Scrub CPU-less nodes
> * ARM support. Note that I have not been able to test this, only built the
>   binary
> * Added scrub test patch (last one). Not sure whether it should be considered
>   for committing but I have been running with it.
> 
> V2:
> * merge_chunks() returns new buddy head
> * scrub_free_pages() returns softirq pending status in addition to (factored out)
>   status of unscrubbed memory
> * spin_lock uses inlined spin_lock_cb()
> * scrub debugging code checks whole page, not just the first word.
> 
> Deferred:
> * Per-node heap locks. In addition to (presumably) improving performance in
>   general, once they are available we can parallelize scrubbing further by
>   allowing more than one core per node to do idle loop scrubbing.
> * AVX-based scrubbing
> * Use idle loop scrubbing during boot.
> 
> 
> Boris Ostrovsky (9):
>   mm: Separate free page chunk merging into its own routine
>   mm: Place unscrubbed pages at the end of pagelist
>   mm: Scrub pages in alloc_heap_pages() if needed
>   mm: Scrub memory from idle loop
>   mm: Do not discard already-scrubbed pages softirqs are pending
>   spinlock: Introduce spin_lock_cb()
>   mm: Keep pages available for allocation while scrubbing
>   mm: Print number of unscrubbed pages in 'H' debug handler
>   mm: Make sure pages are scrubbed
> 
>  xen/Kconfig.debug          |    7 +
>  xen/arch/arm/domain.c      |   13 +-
>  xen/arch/x86/domain.c      |    3 +-
>  xen/common/page_alloc.c    |  475 +++++++++++++++++++++++++++++++++++++-------
>  xen/common/spinlock.c      |    9 +-
>  xen/include/asm-arm/mm.h   |    8 +
>  xen/include/asm-x86/mm.h   |    8 +
>  xen/include/xen/mm.h       |    1 +
>  xen/include/xen/spinlock.h |    3 +

FAOD, even though this series has 'mm' in the title of almost every
patch, none of it comes under my explicit maintainership (which is
arch/x86/mm), so I don't think it needs any Acks specifically from me to
get in.

On the whole the series looks like it's going in the right direction, so
I won't give a detailed review unless someone specifically asks for it.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 2/9] mm: Place unscrubbed pages at the end of pagelist
  2017-04-04 15:14     ` Boris Ostrovsky
@ 2017-04-04 15:29       ` Jan Beulich
  2017-04-04 15:39         ` Boris Ostrovsky
  0 siblings, 1 reply; 31+ messages in thread
From: Jan Beulich @ 2017-04-04 15:29 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: tim, sstabellini, wei.liu2, George.Dunlap, andrew.cooper3,
	ian.jackson, xen-devel

>>> On 04.04.17 at 17:14, <boris.ostrovsky@oracle.com> wrote:
> On 04/04/2017 10:46 AM, Jan Beulich wrote:
>>> @@ -933,6 +952,10 @@ static bool_t can_merge(struct page_info *buddy, unsigned int node,
>>>           (phys_to_nid(page_to_maddr(buddy)) != node) )
>>>          return false;
>>>  
>>> +    if ( need_scrub !=
>>> +         !!test_bit(_PGC_need_scrub, &buddy->count_info) )
>>> +        return false;
>> I don't think leaving the tree in a state where larger order chunks
>> don't become available for allocation right away is going to be
>> acceptable. Hence with this issue being dealt with only in patch 7
>> as it seems, you should state clearly and visibly that (at least)
>> patches 2...7 should only be committed together.
> 
> The dirty pages are available for allocation as result of this patch but
> they might not be merged with higher orders (which is what this check is
> for)

The individual chunks are available for allocation, but not the
combined one (for a suitably high order request). Or am I
missing something?

>>> @@ -952,9 +977,10 @@ static struct page_info *merge_chunks(struct page_info *pg, unsigned int node,
>>>          {
>>>              /* Merge with predecessor block? */
>>>              buddy = pg - mask;
>>> -            if ( !can_merge(buddy, node, order) )
>>> +            if ( !can_merge(buddy, node, order, need_scrub) )
>>>                  break;
>>>  
>>> +            pg->count_info &= ~PGC_need_scrub;
>>>              pg = buddy;
>>>              page_list_del(pg, &heap(node, zone, order));
>>>          }
>>> @@ -962,9 +988,10 @@ static struct page_info *merge_chunks(struct page_info *pg, unsigned int node,
>>>          {
>>>              /* Merge with successor block? */
>>>              buddy = pg + mask;
>>> -            if ( !can_merge(buddy, node, order) )
>>> +            if ( !can_merge(buddy, node, order, need_scrub) )
>>>                  break;
>>>  
>>> +            buddy->count_info &= ~PGC_need_scrub;
>>>              page_list_del(buddy, &heap(node, zone, order));
>>>          }
>> For both of these, how come you can / want to clear the need-scrub
>> flag? Wouldn't it be better for each individual page to retain it, so
>> when encountering a higher-order one you know which pages need
>> scrubbing and which don't? Couldn't that also be used to avoid
>> suppressing their merging here right away?
> 
> I am trying to avoid having to keep dirty bit for each page since a
> buddy is either fully clean or fully dirty. That way we shouldn't need
> to walk the list and clear the bit. (I, in fact, suspect that there may
> be other state bits/fields that we might be able to keep at a buddy only)

But as said - at the expense of not being able to merge early. I
consider this a serious limitation.

>>> +static void scrub_free_pages(unsigned int node)
>>> +{
>>> +    struct page_info *pg;
>>> +    unsigned int i, zone;
>>> +    int order;
>> There are no negative orders.
> 
> It actually becomes negative in the loop below and this is loop exit
> condition.

Only because of the way you've coded the loop. It becoming
negative can be easily avoided.

>>> +    ASSERT(spin_is_locked(&heap_lock));
>>> +
>>> +    if ( !node_need_scrub[node] )
>>> +        return;
>>> +
>>> +    for ( zone = 0; zone < NR_ZONES; zone++ )
>>> +    {
>>> +        for ( order = MAX_ORDER; order >= 0; order-- )
>>> +        {
>>> +            while ( !page_list_empty(&heap(node, zone, order)) )
>>> +            {
>>> +                /* Unscrubbed pages are always at the end of the list. */
>>> +                pg = page_list_last(&heap(node, zone, order));
>>> +                if ( !test_bit(_PGC_need_scrub, &pg->count_info) )
>>> +                    break;
>>> +
>>> +                for ( i = 0; i < (1UL << order); i++)
>> Types of loop variable and upper bound do not match.
>>
>>> +                    scrub_one_page(&pg[i]);
>>> +
>>> +                pg->count_info &= ~PGC_need_scrub;
>>> +
>>> +                page_list_del(pg, &heap(node, zone, order));
>>> +                (void)merge_chunks(pg, node, zone, order);
>> Pointless cast.
> 
> Didn't coverity complain about those types of things? That was the
> reason I have the cast here. If not I'll drop it.

I don't know why Coverity would complain about an unused
return value. We've got plenty of such cases throughout the
code base. If this was a macro, the story might be different.

>>> --- a/xen/include/asm-x86/mm.h
>>> +++ b/xen/include/asm-x86/mm.h
>>> @@ -233,6 +233,10 @@ struct page_info
>>>  #define PGC_count_width   PG_shift(9)
>>>  #define PGC_count_mask    ((1UL<<PGC_count_width)-1)
>>>  
>>> +/* Page needs to be scrubbed */
>>> +#define _PGC_need_scrub   PG_shift(10)
>>> +#define PGC_need_scrub    PG_mask(1, 10)
>> So why not a new PGC_state_dirty instead of this independent
>> flag? Pages other than PGC_state_free should never make it
>> to the scrubber, so the flag is meaningless for all other
>> PGC_state_*.
> 
> Wouldn't doing this require possibly making two checks ---
> page_state_is(pg, free) || page_state_is(pg, dirty)?

Well, your goal would normally be to first look for pages not
needing scrubbing anyway, so quite likely you'd do two
passes anyway. But of course much depends on whether to
merge early or late.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 2/9] mm: Place unscrubbed pages at the end of pagelist
  2017-04-04 15:29       ` Jan Beulich
@ 2017-04-04 15:39         ` Boris Ostrovsky
  2017-04-04 15:50           ` Jan Beulich
  0 siblings, 1 reply; 31+ messages in thread
From: Boris Ostrovsky @ 2017-04-04 15:39 UTC (permalink / raw)
  To: Jan Beulich
  Cc: tim, sstabellini, wei.liu2, George.Dunlap, andrew.cooper3,
	ian.jackson, xen-devel

On 04/04/2017 11:29 AM, Jan Beulich wrote:
>>>> On 04.04.17 at 17:14, <boris.ostrovsky@oracle.com> wrote:
>> On 04/04/2017 10:46 AM, Jan Beulich wrote:
>>>> @@ -933,6 +952,10 @@ static bool_t can_merge(struct page_info *buddy, unsigned int node,
>>>>           (phys_to_nid(page_to_maddr(buddy)) != node) )
>>>>          return false;
>>>>  
>>>> +    if ( need_scrub !=
>>>> +         !!test_bit(_PGC_need_scrub, &buddy->count_info) )
>>>> +        return false;
>>> I don't think leaving the tree in a state where larger order chunks
>>> don't become available for allocation right away is going to be
>>> acceptable. Hence with this issue being dealt with only in patch 7
>>> as it seems, you should state clearly and visibly that (at least)
>>> patches 2...7 should only be committed together.
>> The dirty pages are available for allocation as result of this patch but
>> they might not be merged with higher orders (which is what this check is
>> for)
> The individual chunks are available for allocation, but not the
> combined one (for a suitably high order request). Or am I
> missing something?


Correct, but this is not changed by any later patch (including patch 7).
We only merge with a buddy with the same level of cleanliness (so to
speak ;-))


>
>>>> @@ -952,9 +977,10 @@ static struct page_info *merge_chunks(struct page_info *pg, unsigned int node,
>>>>          {
>>>>              /* Merge with predecessor block? */
>>>>              buddy = pg - mask;
>>>> -            if ( !can_merge(buddy, node, order) )
>>>> +            if ( !can_merge(buddy, node, order, need_scrub) )
>>>>                  break;
>>>>  
>>>> +            pg->count_info &= ~PGC_need_scrub;
>>>>              pg = buddy;
>>>>              page_list_del(pg, &heap(node, zone, order));
>>>>          }
>>>> @@ -962,9 +988,10 @@ static struct page_info *merge_chunks(struct page_info *pg, unsigned int node,
>>>>          {
>>>>              /* Merge with successor block? */
>>>>              buddy = pg + mask;
>>>> -            if ( !can_merge(buddy, node, order) )
>>>> +            if ( !can_merge(buddy, node, order, need_scrub) )
>>>>                  break;
>>>>  
>>>> +            buddy->count_info &= ~PGC_need_scrub;
>>>>              page_list_del(buddy, &heap(node, zone, order));
>>>>          }
>>> For both of these, how come you can / want to clear the need-scrub
>>> flag? Wouldn't it be better for each individual page to retain it, so
>>> when encountering a higher-order one you know which pages need
>>> scrubbing and which don't? Couldn't that also be used to avoid
>>> suppressing their merging here right away?
>> I am trying to avoid having to keep dirty bit for each page since a
>> buddy is either fully clean or fully dirty. That way we shouldn't need
>> to walk the list and clear the bit. (I, in fact, suspect that there may
>> be other state bits/fields that we might be able to keep at a buddy only)
> But as said - at the expense of not being able to merge early. I
> consider this a serious limitation.

What do you mean by "early"? At freeing time?

But then we will always have to scan the buddy during allocation to see
if any pages are dirty.


>
>>>> --- a/xen/include/asm-x86/mm.h
>>>> +++ b/xen/include/asm-x86/mm.h
>>>> @@ -233,6 +233,10 @@ struct page_info
>>>>  #define PGC_count_width   PG_shift(9)
>>>>  #define PGC_count_mask    ((1UL<<PGC_count_width)-1)
>>>>  
>>>> +/* Page needs to be scrubbed */
>>>> +#define _PGC_need_scrub   PG_shift(10)
>>>> +#define PGC_need_scrub    PG_mask(1, 10)
>>> So why not a new PGC_state_dirty instead of this independent
>>> flag? Pages other than PGC_state_free should never make it
>>> to the scrubber, so the flag is meaningless for all other
>>> PGC_state_*.
>> Wouldn't doing this require possibly making two checks ---
>> page_state_is(pg, free) || page_state_is(pg, dirty)?
> Well, your goal would normally be to first look for pages not
> needing scrubbing anyway, so quite likely you'd do two
> passes anyway. But of course much depends on whether to
> merge early or late.

Again, I need to understand what you consider "early" and "late".

-boris


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 2/9] mm: Place unscrubbed pages at the end of pagelist
  2017-04-04 15:39         ` Boris Ostrovsky
@ 2017-04-04 15:50           ` Jan Beulich
  2017-04-04 16:22             ` Boris Ostrovsky
  0 siblings, 1 reply; 31+ messages in thread
From: Jan Beulich @ 2017-04-04 15:50 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: tim, sstabellini, wei.liu2, George.Dunlap, andrew.cooper3,
	ian.jackson, xen-devel

>>> On 04.04.17 at 17:39, <boris.ostrovsky@oracle.com> wrote:
> On 04/04/2017 11:29 AM, Jan Beulich wrote:
>>>>> On 04.04.17 at 17:14, <boris.ostrovsky@oracle.com> wrote:
>>> On 04/04/2017 10:46 AM, Jan Beulich wrote:
>>>>> @@ -933,6 +952,10 @@ static bool_t can_merge(struct page_info *buddy, unsigned int node,
>>>>>           (phys_to_nid(page_to_maddr(buddy)) != node) )
>>>>>          return false;
>>>>>  
>>>>> +    if ( need_scrub !=
>>>>> +         !!test_bit(_PGC_need_scrub, &buddy->count_info) )
>>>>> +        return false;
>>>> I don't think leaving the tree in a state where larger order chunks
>>>> don't become available for allocation right away is going to be
>>>> acceptable. Hence with this issue being dealt with only in patch 7
>>>> as it seems, you should state clearly and visibly that (at least)
>>>> patches 2...7 should only be committed together.
>>> The dirty pages are available for allocation as result of this patch but
>>> they might not be merged with higher orders (which is what this check is
>>> for)
>> The individual chunks are available for allocation, but not the
>> combined one (for a suitably high order request). Or am I
>> missing something?
> 
> 
> Correct, but this is not changed by any later patch (including patch 7).
> We only merge with a buddy with the same level of cleanliness (so to
> speak ;-))

Hmm, that aspect was one of the main things I had objected to
back when one of your colleagues had a first take at this.

>>>>> @@ -952,9 +977,10 @@ static struct page_info *merge_chunks(struct page_info *pg, unsigned int node,
>>>>>          {
>>>>>              /* Merge with predecessor block? */
>>>>>              buddy = pg - mask;
>>>>> -            if ( !can_merge(buddy, node, order) )
>>>>> +            if ( !can_merge(buddy, node, order, need_scrub) )
>>>>>                  break;
>>>>>  
>>>>> +            pg->count_info &= ~PGC_need_scrub;
>>>>>              pg = buddy;
>>>>>              page_list_del(pg, &heap(node, zone, order));
>>>>>          }
>>>>> @@ -962,9 +988,10 @@ static struct page_info *merge_chunks(struct page_info *pg, unsigned int node,
>>>>>          {
>>>>>              /* Merge with successor block? */
>>>>>              buddy = pg + mask;
>>>>> -            if ( !can_merge(buddy, node, order) )
>>>>> +            if ( !can_merge(buddy, node, order, need_scrub) )
>>>>>                  break;
>>>>>  
>>>>> +            buddy->count_info &= ~PGC_need_scrub;
>>>>>              page_list_del(buddy, &heap(node, zone, order));
>>>>>          }
>>>> For both of these, how come you can / want to clear the need-scrub
>>>> flag? Wouldn't it be better for each individual page to retain it, so
>>>> when encountering a higher-order one you know which pages need
>>>> scrubbing and which don't? Couldn't that also be used to avoid
>>>> suppressing their merging here right away?
>>> I am trying to avoid having to keep dirty bit for each page since a
>>> buddy is either fully clean or fully dirty. That way we shouldn't need
>>> to walk the list and clear the bit. (I, in fact, suspect that there may
>>> be other state bits/fields that we might be able to keep at a buddy only)
>> But as said - at the expense of not being able to merge early. I
>> consider this a serious limitation.
> 
> What do you mean by "early"? At freeing time?

Right.

> But then we will always have to scan the buddy during allocation to see
> if any pages are dirty.

There could be a summary flag to avoid this for entirely clean
buddies. Plus perhaps some auxiliary indication where the first
unclean part is, to speed up the scanning.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 2/9] mm: Place unscrubbed pages at the end of pagelist
  2017-04-04 15:50           ` Jan Beulich
@ 2017-04-04 16:22             ` Boris Ostrovsky
  0 siblings, 0 replies; 31+ messages in thread
From: Boris Ostrovsky @ 2017-04-04 16:22 UTC (permalink / raw)
  To: Jan Beulich
  Cc: tim, sstabellini, wei.liu2, George.Dunlap, andrew.cooper3,
	ian.jackson, xen-devel

On 04/04/2017 11:50 AM, Jan Beulich wrote:
>>>> On 04.04.17 at 17:39, <boris.ostrovsky@oracle.com> wrote:
>> On 04/04/2017 11:29 AM, Jan Beulich wrote:
>>>>>> On 04.04.17 at 17:14, <boris.ostrovsky@oracle.com> wrote:
>>>> On 04/04/2017 10:46 AM, Jan Beulich wrote:
>>>>>> @@ -933,6 +952,10 @@ static bool_t can_merge(struct page_info *buddy, unsigned int node,
>>>>>>           (phys_to_nid(page_to_maddr(buddy)) != node) )
>>>>>>          return false;
>>>>>>  
>>>>>> +    if ( need_scrub !=
>>>>>> +         !!test_bit(_PGC_need_scrub, &buddy->count_info) )
>>>>>> +        return false;
>>>>> I don't think leaving the tree in a state where larger order chunks
>>>>> don't become available for allocation right away is going to be
>>>>> acceptable. Hence with this issue being dealt with only in patch 7
>>>>> as it seems, you should state clearly and visibly that (at least)
>>>>> patches 2...7 should only be committed together.
>>>> The dirty pages are available for allocation as result of this patch but
>>>> they might not be merged with higher orders (which is what this check is
>>>> for)
>>> The individual chunks are available for allocation, but not the
>>> combined one (for a suitably high order request). Or am I
>>> missing something?
>>
>> Correct, but this is not changed by any later patch (including patch 7).
>> We only merge with a buddy with the same level of cleanliness (so to
>> speak ;-))
> Hmm, that aspect was one of the main things I had objected to
> back when one of your colleagues had a first take at this.

I thought your objections were over having a period of time when a chunk
is removed from heap for scrubbing and this is not available at all.


>
>> But then we will always have to scan the buddy during allocation to see
>> if any pages are dirty.
> There could be a summary flag to avoid this for entirely clean
> buddies. Plus perhaps some auxiliary indication where the first
> unclean part is, to speed up the scanning.


This should be doable. Let me work on this.

-boris


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 4/9] mm: Scrub memory from idle loop
  2017-04-03 16:50 ` [PATCH v2 4/9] mm: Scrub memory from idle loop Boris Ostrovsky
@ 2017-04-12 16:11   ` Jan Beulich
  0 siblings, 0 replies; 31+ messages in thread
From: Jan Beulich @ 2017-04-12 16:11 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: tim, sstabellini, wei.liu2, George.Dunlap, andrew.cooper3,
	ian.jackson, xen-devel

>>> On 03.04.17 at 18:50, <boris.ostrovsky@oracle.com> wrote:
> --- a/xen/common/page_alloc.c
> +++ b/xen/common/page_alloc.c
> @@ -1043,16 +1043,44 @@ static struct page_info *merge_chunks(struct page_info *pg, unsigned int node,
>      return pg;
>  }
>  
> -static void scrub_free_pages(unsigned int node)
> +static nodemask_t node_scrubbing;
> +static unsigned int node_to_scrub(bool_t get_node)

Blank line between these two please.

> +{
> +    nodeid_t node = cpu_to_node(smp_processor_id()), local_node;
> +
> +    if ( node == NUMA_NO_NODE )
> +        node = 0;
> +    local_node = node;
> +
> +    /*
> +     * Check local node fist and then then see if there are any memory-only
> +     * nodes that may need scrubbing
> +     */
> +    while ( 1 )
> +    {
> +        if ( node_need_scrub[node] &&
> +             (!node_test_and_set(node, node_scrubbing) || !get_node) )
> +                return node;
> +        do {
> +            node = cycle_node(node, node_online_map);
> +            if ( node == local_node )
> +                return NUMA_NO_NODE;
> +        } while ( !cpumask_empty(&node_to_cpumask(node)) );

I think this needs logic to prefer closer nodes over more distant
ones, or else the latency of coming back out of the idle loop may
grow needlessly high.

> +    }
> +}
> +
> +bool_t scrub_free_pages()

Plain bool please and missing void.

>  {
>      struct page_info *pg;
>      unsigned int i, zone;
> -    int order;
> +    int order, cpu = smp_processor_id();

Both unsigned int.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 5/9] mm: Do not discard already-scrubbed pages softirqs are pending
  2017-04-03 16:50 ` [PATCH v2 5/9] mm: Do not discard already-scrubbed pages softirqs are pending Boris Ostrovsky
@ 2017-04-13 15:41   ` Jan Beulich
  2017-04-13 16:46     ` Boris Ostrovsky
  0 siblings, 1 reply; 31+ messages in thread
From: Jan Beulich @ 2017-04-13 15:41 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: tim, sstabellini, wei.liu2, George.Dunlap, andrew.cooper3,
	ian.jackson, xen-devel

>>> On 03.04.17 at 18:50, <boris.ostrovsky@oracle.com> wrote:
> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

To be honest, without a proper description and with an apparently
not really well formed title I don't want to start guessing what the
patch intends, and whether the changes done match that intention.
I guess this should really have been an RFC ...

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 6/9] spinlock: Introduce spin_lock_cb()
  2017-04-03 16:50 ` [PATCH v2 6/9] spinlock: Introduce spin_lock_cb() Boris Ostrovsky
@ 2017-04-13 15:46   ` Jan Beulich
  2017-04-13 16:55     ` Boris Ostrovsky
  0 siblings, 1 reply; 31+ messages in thread
From: Jan Beulich @ 2017-04-13 15:46 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: tim, sstabellini, wei.liu2, George.Dunlap, andrew.cooper3,
	ian.jackson, xen-devel

>>> On 03.04.17 at 18:50, <boris.ostrovsky@oracle.com> wrote:
> While waiting for a lock we may want to periodically run some
> code. We could use spin_trylock() but since it doesn't take lock
> ticket it may take a long time until the lock is taken.
> 
> Add spin_lock_cb() that allows us to execute a callback while waiting.

You don't add any user(s) of this new interface and you also don't
outline under what conditions you think using this might be a good
idea. On that basis I don't think this makes much sense. I am
particularly worried of undue latencies use of this function may
incur.

> Also add spin_lock_kick() that will wake up the waiters.

Same here - I can't even seem to guess what use you intend this to
have.

Nevertheless the code itself looks okay.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 7/9] mm: Keep pages available for allocation while scrubbing
  2017-04-03 16:50 ` [PATCH v2 7/9] mm: Keep pages available for allocation while scrubbing Boris Ostrovsky
@ 2017-04-13 15:59   ` Jan Beulich
  0 siblings, 0 replies; 31+ messages in thread
From: Jan Beulich @ 2017-04-13 15:59 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: tim, sstabellini, wei.liu2, George.Dunlap, andrew.cooper3,
	ian.jackson, xen-devel

>>> On 03.04.17 at 18:50, <boris.ostrovsky@oracle.com> wrote:
> Instead of scrubbing pages while holding heap lock we can mark
> buddy's head as being scrubbed and drop the lock temporarily.
> If someone (most likely alloc_heap_pages()) tries to access
> this chunk it will signal the scrubber to abort scrub by setting
> head's PAGE_SCRUB_ABORT bit. The scrubber checks this bit after
> processing each page and stops its work as soon as it sees it.

So if the scrubber managed to handle all but one page of, say, a
1Gb buddy, you'd re-do all of it synchronously? One more argument
to track dirty/scrubbed state per page instead of per buddy, I think.

> --- a/xen/common/page_alloc.c
> +++ b/xen/common/page_alloc.c
> @@ -699,6 +699,18 @@ static void page_list_add_scrub(struct page_info *pg, 
> unsigned int node,
>          page_list_add(pg, &heap(node, zone, order));
>  }
>  
> +static void check_and_stop_scrub(struct page_info *head)
> +{
> +    if ( head->u.free.scrub_state & PAGE_SCRUBBING )
> +    {
> +        head->u.free.scrub_state |= PAGE_SCRUB_ABORT;
> +        smp_mb();
> +        spin_lock_kick();

I think the barrier would better be in the kicking construct than
explicit here.

> @@ -785,10 +797,15 @@ static struct page_info *alloc_heap_pages(
>              {
>                  if ( (pg = page_list_remove_head(&heap(node, zone, j))) )
>                  {
> -                    if ( (order == 0) || use_unscrubbed ||
> -                         !test_bit(_PGC_need_scrub, &pg->count_info) )
> +                    if ( !test_bit(_PGC_need_scrub, &pg[0].count_info) )

Any reason to change from -> to [0]. here?

> @@ -1074,12 +1096,34 @@ static unsigned int node_to_scrub(bool_t get_node)
>  }
>  
>  #define SCRUB_CHUNK_ORDER  8
> +
> +struct scrub_wait_state {
> +    struct page_info *pg;
> +    bool_t drop;
> +};
> +
> +static void scrub_continue(void *data)
> +{
> +    struct scrub_wait_state *st = (struct scrub_wait_state *)data;

Pointless cast.

> @@ -1203,6 +1271,8 @@ static void free_heap_pages(
>          if ( page_state_is(&pg[i], offlined) )
>              tainted = 1;
>  
> +        pg[i].u.free.scrub_state=0;

Style.

> --- a/xen/include/asm-arm/mm.h
> +++ b/xen/include/asm-arm/mm.h
> @@ -35,6 +35,10 @@ struct page_info
>          } inuse;
>          /* Page is on a free list: ((count_info & PGC_count_mask) == 0). */
>          struct {
> +#define PAGE_SCRUBBING      (1<<1)
> +#define PAGE_SCRUB_ABORT    (1<<2)

Any reason not to start from bit 0? I'm also not sure boolean flags
are the ideal solution here: You really only have three states afaict
(none, scrubbing, abort).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 5/9] mm: Do not discard already-scrubbed pages softirqs are pending
  2017-04-13 15:41   ` Jan Beulich
@ 2017-04-13 16:46     ` Boris Ostrovsky
  0 siblings, 0 replies; 31+ messages in thread
From: Boris Ostrovsky @ 2017-04-13 16:46 UTC (permalink / raw)
  To: Jan Beulich
  Cc: tim, sstabellini, wei.liu2, George.Dunlap, andrew.cooper3,
	ian.jackson, xen-devel

On 04/13/2017 11:41 AM, Jan Beulich wrote:
>>>> On 03.04.17 at 18:50, <boris.ostrovsky@oracle.com> wrote:
>> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> To be honest, without a proper description and with an apparently
> not really well formed title I don't want to start guessing what the
> patch intends, and whether the changes done match that intention.
> I guess this should really have been an RFC ...

There was an RFC for this
(https://lists.xenproject.org/archives/html/xen-devel/2017-02/msg03237.html)
but your point is well taken.

Commit message should have been something like:

    To avoid delaying softirq processing while scrubbing check
    for softirqs every 256 pages. If softirq is pending, stop scrubbing
    and merge the partially-scrubbed buddy by breaking the clean
    portion into smaller power-of-2 chunks. Then repeat the same
    process for the dirty part.

I was about to post the new version of this series, with dirty bit per
page. I'll make updates based on your comments today (and yesterday) and
if you review this patch in this version I'll include your comment into
that version as well.


-boris

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 6/9] spinlock: Introduce spin_lock_cb()
  2017-04-13 15:46   ` Jan Beulich
@ 2017-04-13 16:55     ` Boris Ostrovsky
  2017-04-18  6:49       ` Jan Beulich
  0 siblings, 1 reply; 31+ messages in thread
From: Boris Ostrovsky @ 2017-04-13 16:55 UTC (permalink / raw)
  To: Jan Beulich
  Cc: tim, sstabellini, wei.liu2, George.Dunlap, andrew.cooper3,
	ian.jackson, xen-devel

On 04/13/2017 11:46 AM, Jan Beulich wrote:
>>>> On 03.04.17 at 18:50, <boris.ostrovsky@oracle.com> wrote:
>> While waiting for a lock we may want to periodically run some
>> code. We could use spin_trylock() but since it doesn't take lock
>> ticket it may take a long time until the lock is taken.
>>
>> Add spin_lock_cb() that allows us to execute a callback while waiting.
> You don't add any user(s) of this new interface and you also don't
> outline under what conditions you think using this might be a good
> idea. On that basis I don't think this makes much sense. I am
> particularly worried of undue latencies use of this function may
> incur.

There is (currently) only one user of this interface and it is
introduced in the next patch.

If you don't think explanation above is sufficient I can add

    "This code may, for example, allow the caller to release resources
      held by it that are no longer needed in the critical section protected
      by the lock."

after the first sentence.

As for latency, the fast path is not affected, it's only if the lock is
already taken do we make the callback.

-boris

>
>> Also add spin_lock_kick() that will wake up the waiters.
> Same here - I can't even seem to guess what use you intend this to
> have.
>
> Nevertheless the code itself looks okay.
>
> Jan
>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 6/9] spinlock: Introduce spin_lock_cb()
  2017-04-13 16:55     ` Boris Ostrovsky
@ 2017-04-18  6:49       ` Jan Beulich
  2017-04-18 12:32         ` Boris Ostrovsky
  0 siblings, 1 reply; 31+ messages in thread
From: Jan Beulich @ 2017-04-18  6:49 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: tim, sstabellini, wei.liu2, George.Dunlap, andrew.cooper3,
	ian.jackson, xen-devel

>>> On 13.04.17 at 18:55, <boris.ostrovsky@oracle.com> wrote:
> On 04/13/2017 11:46 AM, Jan Beulich wrote:
>>>>> On 03.04.17 at 18:50, <boris.ostrovsky@oracle.com> wrote:
>>> While waiting for a lock we may want to periodically run some
>>> code. We could use spin_trylock() but since it doesn't take lock
>>> ticket it may take a long time until the lock is taken.
>>>
>>> Add spin_lock_cb() that allows us to execute a callback while waiting.
>> You don't add any user(s) of this new interface and you also don't
>> outline under what conditions you think using this might be a good
>> idea. On that basis I don't think this makes much sense. I am
>> particularly worried of undue latencies use of this function may
>> incur.
> 
> There is (currently) only one user of this interface and it is
> introduced in the next patch.
> 
> If you don't think explanation above is sufficient I can add
> 
>     "This code may, for example, allow the caller to release resources
>       held by it that are no longer needed in the critical section protected
>       by the lock."
> 
> after the first sentence.
> 
> As for latency, the fast path is not affected, it's only if the lock is
> already taken do we make the callback.

That's a rather relevant aspect, which I think needs calling out
explicitly. As you may have noticed, my initial understanding of
the basic idea here was that the callback would be invoked while
spinning (i.e. to use to spinning time to do something useful),
not while holding the lock.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 6/9] spinlock: Introduce spin_lock_cb()
  2017-04-18  6:49       ` Jan Beulich
@ 2017-04-18 12:32         ` Boris Ostrovsky
  2017-04-18 12:43           ` Jan Beulich
  0 siblings, 1 reply; 31+ messages in thread
From: Boris Ostrovsky @ 2017-04-18 12:32 UTC (permalink / raw)
  To: Jan Beulich
  Cc: tim, sstabellini, wei.liu2, George.Dunlap, andrew.cooper3,
	ian.jackson, xen-devel

On 04/18/2017 02:49 AM, Jan Beulich wrote:
>>>> On 13.04.17 at 18:55, <boris.ostrovsky@oracle.com> wrote:
>> On 04/13/2017 11:46 AM, Jan Beulich wrote:
>>>>>> On 03.04.17 at 18:50, <boris.ostrovsky@oracle.com> wrote:
>>>> While waiting for a lock we may want to periodically run some
>>>> code. We could use spin_trylock() but since it doesn't take lock
>>>> ticket it may take a long time until the lock is taken.
>>>>
>>>> Add spin_lock_cb() that allows us to execute a callback while waiting.
>>> You don't add any user(s) of this new interface and you also don't
>>> outline under what conditions you think using this might be a good
>>> idea. On that basis I don't think this makes much sense. I am
>>> particularly worried of undue latencies use of this function may
>>> incur.
>> There is (currently) only one user of this interface and it is
>> introduced in the next patch.
>>
>> If you don't think explanation above is sufficient I can add
>>
>>     "This code may, for example, allow the caller to release resources
>>       held by it that are no longer needed in the critical section protected
>>       by the lock."
>>
>> after the first sentence.
>>
>> As for latency, the fast path is not affected, it's only if the lock is
>> already taken do we make the callback.
> That's a rather relevant aspect, which I think needs calling out
> explicitly. As you may have noticed, my initial understanding of
> the basic idea here was that the callback would be invoked while
> spinning (i.e. to use to spinning time to do something useful),
> not while holding the lock.

The callback *is* invoked when we are spinning waiting for the lock. I
probably should have said "only if the lock is already taken by someone
else". However, on the fast path, where noone is holding the lock and
the caller can grab it right away, the callback is not invoked.

-boris




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 6/9] spinlock: Introduce spin_lock_cb()
  2017-04-18 12:32         ` Boris Ostrovsky
@ 2017-04-18 12:43           ` Jan Beulich
  2017-04-18 13:14             ` Boris Ostrovsky
  0 siblings, 1 reply; 31+ messages in thread
From: Jan Beulich @ 2017-04-18 12:43 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: tim, sstabellini, wei.liu2, George.Dunlap, andrew.cooper3,
	ian.jackson, xen-devel

>>> On 18.04.17 at 14:32, <boris.ostrovsky@oracle.com> wrote:
> On 04/18/2017 02:49 AM, Jan Beulich wrote:
>>>>> On 13.04.17 at 18:55, <boris.ostrovsky@oracle.com> wrote:
>>> On 04/13/2017 11:46 AM, Jan Beulich wrote:
>>>>>>> On 03.04.17 at 18:50, <boris.ostrovsky@oracle.com> wrote:
>>>>> While waiting for a lock we may want to periodically run some
>>>>> code. We could use spin_trylock() but since it doesn't take lock
>>>>> ticket it may take a long time until the lock is taken.
>>>>>
>>>>> Add spin_lock_cb() that allows us to execute a callback while waiting.
>>>> You don't add any user(s) of this new interface and you also don't
>>>> outline under what conditions you think using this might be a good
>>>> idea. On that basis I don't think this makes much sense. I am
>>>> particularly worried of undue latencies use of this function may
>>>> incur.
>>> There is (currently) only one user of this interface and it is
>>> introduced in the next patch.
>>>
>>> If you don't think explanation above is sufficient I can add
>>>
>>>     "This code may, for example, allow the caller to release resources
>>>       held by it that are no longer needed in the critical section protected
>>>       by the lock."
>>>
>>> after the first sentence.
>>>
>>> As for latency, the fast path is not affected, it's only if the lock is
>>> already taken do we make the callback.
>> That's a rather relevant aspect, which I think needs calling out
>> explicitly. As you may have noticed, my initial understanding of
>> the basic idea here was that the callback would be invoked while
>> spinning (i.e. to use to spinning time to do something useful),
>> not while holding the lock.
> 
> The callback *is* invoked when we are spinning waiting for the lock. I
> probably should have said "only if the lock is already taken by someone
> else". However, on the fast path, where noone is holding the lock and
> the caller can grab it right away, the callback is not invoked.

Oh, so back to what I was originally understanding, and back to my
latency concerns. Yes, releasing a resource ought to not incur much
latency, but as you know once we have a certain mechanism, other
less clear use cases may appear. Therefore while I'm not outright
objecting to the idea, I'm not really convinced of it either (the more
that the try-lock approach still exists as a possible alternative). At
least a very clear warning needs to be placed next to the function
declaration and/or definition.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 6/9] spinlock: Introduce spin_lock_cb()
  2017-04-18 12:43           ` Jan Beulich
@ 2017-04-18 13:14             ` Boris Ostrovsky
  0 siblings, 0 replies; 31+ messages in thread
From: Boris Ostrovsky @ 2017-04-18 13:14 UTC (permalink / raw)
  To: Jan Beulich
  Cc: tim, sstabellini, wei.liu2, George.Dunlap, andrew.cooper3,
	ian.jackson, xen-devel

On 04/18/2017 08:43 AM, Jan Beulich wrote:
>>>> On 18.04.17 at 14:32, <boris.ostrovsky@oracle.com> wrote:
>> On 04/18/2017 02:49 AM, Jan Beulich wrote:
>>>>>> On 13.04.17 at 18:55, <boris.ostrovsky@oracle.com> wrote:
>>>> On 04/13/2017 11:46 AM, Jan Beulich wrote:
>>>>>>>> On 03.04.17 at 18:50, <boris.ostrovsky@oracle.com> wrote:
>>>>>> While waiting for a lock we may want to periodically run some
>>>>>> code. We could use spin_trylock() but since it doesn't take lock
>>>>>> ticket it may take a long time until the lock is taken.
>>>>>>
>>>>>> Add spin_lock_cb() that allows us to execute a callback while waiting.
>>>>> You don't add any user(s) of this new interface and you also don't
>>>>> outline under what conditions you think using this might be a good
>>>>> idea. On that basis I don't think this makes much sense. I am
>>>>> particularly worried of undue latencies use of this function may
>>>>> incur.
>>>> There is (currently) only one user of this interface and it is
>>>> introduced in the next patch.
>>>>
>>>> If you don't think explanation above is sufficient I can add
>>>>
>>>>     "This code may, for example, allow the caller to release resources
>>>>       held by it that are no longer needed in the critical section protected
>>>>       by the lock."
>>>>
>>>> after the first sentence.
>>>>
>>>> As for latency, the fast path is not affected, it's only if the lock is
>>>> already taken do we make the callback.
>>> That's a rather relevant aspect, which I think needs calling out
>>> explicitly. As you may have noticed, my initial understanding of
>>> the basic idea here was that the callback would be invoked while
>>> spinning (i.e. to use to spinning time to do something useful),
>>> not while holding the lock.
>> The callback *is* invoked when we are spinning waiting for the lock. I
>> probably should have said "only if the lock is already taken by someone
>> else". However, on the fast path, where noone is holding the lock and
>> the caller can grab it right away, the callback is not invoked.
> Oh, so back to what I was originally understanding, and back to my
> latency concerns. Yes, releasing a resource ought to not incur much
> latency, but as you know once we have a certain mechanism, other
> less clear use cases may appear. Therefore while I'm not outright
> objecting to the idea, I'm not really convinced of it either (the more
> that the try-lock approach still exists as a possible alternative). At
> least a very clear warning needs to be placed next to the function
> declaration and/or definition.


I'd rather add a warning in the header file since trylock approach is
essentially unbounded, especially on large systems (where this series
becomes useful).

Alternatively, I could drop the lock and restart in alloc_heap_pages()
when buddy that is being scrubbed is selected and have the scrubber
break the it into clean and dirty part. But that will cause us lose the
high-order chunk.

-boris


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2017-04-18 13:14 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-03 16:50 [PATCH v2 0/9] Memory scrubbing from idle loop Boris Ostrovsky
2017-04-03 16:50 ` [PATCH v2 1/9] mm: Separate free page chunk merging into its own routine Boris Ostrovsky
2017-04-04 11:16   ` Jan Beulich
2017-04-04 13:48     ` Boris Ostrovsky
2017-04-04 14:01       ` Jan Beulich
2017-04-04 14:23         ` Boris Ostrovsky
2017-04-03 16:50 ` [PATCH v2 2/9] mm: Place unscrubbed pages at the end of pagelist Boris Ostrovsky
2017-04-04 14:46   ` Jan Beulich
2017-04-04 15:14     ` Boris Ostrovsky
2017-04-04 15:29       ` Jan Beulich
2017-04-04 15:39         ` Boris Ostrovsky
2017-04-04 15:50           ` Jan Beulich
2017-04-04 16:22             ` Boris Ostrovsky
2017-04-03 16:50 ` [PATCH v2 3/9] mm: Scrub pages in alloc_heap_pages() if needed Boris Ostrovsky
2017-04-03 16:50 ` [PATCH v2 4/9] mm: Scrub memory from idle loop Boris Ostrovsky
2017-04-12 16:11   ` Jan Beulich
2017-04-03 16:50 ` [PATCH v2 5/9] mm: Do not discard already-scrubbed pages softirqs are pending Boris Ostrovsky
2017-04-13 15:41   ` Jan Beulich
2017-04-13 16:46     ` Boris Ostrovsky
2017-04-03 16:50 ` [PATCH v2 6/9] spinlock: Introduce spin_lock_cb() Boris Ostrovsky
2017-04-13 15:46   ` Jan Beulich
2017-04-13 16:55     ` Boris Ostrovsky
2017-04-18  6:49       ` Jan Beulich
2017-04-18 12:32         ` Boris Ostrovsky
2017-04-18 12:43           ` Jan Beulich
2017-04-18 13:14             ` Boris Ostrovsky
2017-04-03 16:50 ` [PATCH v2 7/9] mm: Keep pages available for allocation while scrubbing Boris Ostrovsky
2017-04-13 15:59   ` Jan Beulich
2017-04-03 16:50 ` [PATCH v2 8/9] mm: Print number of unscrubbed pages in 'H' debug handler Boris Ostrovsky
2017-04-03 16:50 ` [PATCH v2 9/9] mm: Make sure pages are scrubbed Boris Ostrovsky
2017-04-04 15:21 ` [PATCH v2 0/9] Memory scrubbing from idle loop George Dunlap

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.