xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* [Xen-devel] [RFC PATCH 0/3] Live update boot memory management
@ 2020-01-08 17:24 David Woodhouse
  2020-01-08 17:24 ` [Xen-devel] [RFC PATCH 1/3] x86/setup: Don't skip 2MiB underneath relocated Xen image David Woodhouse
  2020-01-13 11:54 ` [Xen-devel] [RFC PATCH 0/3] Live update boot memory management David Woodhouse
  0 siblings, 2 replies; 16+ messages in thread
From: David Woodhouse @ 2020-01-08 17:24 UTC (permalink / raw)
  To: Xen-devel
  Cc: Stefano Stabellini, Julien Grall, Wei Liu, Konrad Rzeszutek Wilk,
	George Dunlap, Andrew Cooper, paul, Ian Jackson, Jan Beulich,
	Roger Pau Monné


[-- Attachment #1.1: Type: text/plain, Size: 3072 bytes --]

When doing a live update, Xen needs to be very careful not to scribble
on pages which contain guest memory or state information for the
domains which are being preserved.

The information about which pages are in use is contained in the live
update state passed from the previous Xen — which is mostly just a
guest-transparent live migration data stream, except that it points to
the page tables in place in memory while traditional live migration
obviously copies the pages separately.

Our initial implementation actually prepended a list of 'in-use' ranges
to the live update state, and made the boot allocator treat them the
same as 'bad pages'. That worked well enough for initial development
but wouldn't scale to a live production system, mainly because the boot
allocator has a limit of 512 memory ranges that it can keep track of,
and a real system would end up more fragmented than that.

My other concern with that approach is that it required two passes over
the domain-owned pages. We have to do a later pass *anyway*, as we set
up ownership in the frametable for each page — and that has to happen
after we've managed to allocate a 'struct domain' for each page_info to
point to. If we want to keep the pause time due to a live update down
to a bare minimum, doing two passes over the full set of domain pages
isn't my favourite strategy.

So we've settled on a simpler approach \x02— reserve a contiguous region
of physical memory which *won't* be used for domain pages. Let the boot
allocator see *only* that region of memory, and plug the rest of the
memory in later only after doing a full pass of the live update state.

This means that we have to ensure the reserved region is large enough,
but ultimately we had that problem either way — even if we were
processing the actual free ranges, if the page_info grew and we didn't
have enough contiguous space for the new frametable we were hosed
anyway.

So the straw man patch ends up being really simple, as a seed for
bikeshedding. Just take a 'liveupdate=' region on the command line,
which kexec(8) can find from the running Xen. The initial Xen needs to
ensure that it *won't* allocate any pages from that range which will
subsequently need to be preserved across live update, which isn't done
yet. We just need to make sure that any page which might be given to
share_xen_page_with_guest() is allocated appropriately.

The part which actually hands over the live update state isn't included
yet, so this really does just *defer* the addition of the memory until
a little bit later in __start_xen(). Actually taking ranges out of it
will come later.


David Woodhouse (3):
      x86/setup: Don't skip 2MiB underneath relocated Xen image
      x86/boot: Reserve live update boot memory
      Add KEXEC_RANGE_MA_LIVEUPDATE

 xen/arch/x86/machine_kexec.c |  15 ++++--
 xen/arch/x86/setup.c         | 122 +++++++++++++++++++++++++++++++++++++++----
 xen/include/public/kexec.h   |   1 +
 3 files changed, 124 insertions(+), 14 deletions(-)

[-- Attachment #1.2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

[-- Attachment #2: Type: text/plain, Size: 157 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Xen-devel] [RFC PATCH 1/3] x86/setup: Don't skip 2MiB underneath relocated Xen image
  2020-01-08 17:24 [Xen-devel] [RFC PATCH 0/3] Live update boot memory management David Woodhouse
@ 2020-01-08 17:24 ` David Woodhouse
  2020-01-08 17:24   ` [Xen-devel] [RFC PATCH 2/3] x86/boot: Reserve live update boot memory David Woodhouse
                     ` (2 more replies)
  2020-01-13 11:54 ` [Xen-devel] [RFC PATCH 0/3] Live update boot memory management David Woodhouse
  1 sibling, 3 replies; 16+ messages in thread
From: David Woodhouse @ 2020-01-08 17:24 UTC (permalink / raw)
  To: Xen-devel
  Cc: Stefano Stabellini, Julien Grall, Wei Liu, Konrad Rzeszutek Wilk,
	George Dunlap, Andrew Cooper, paul, Ian Jackson, Jan Beulich,
	Roger Pau Monné

From: David Woodhouse <dwmw@amazon.co.uk>

Set 'e' correctly to reflect the location that Xen is actually relocated
to from its default 2MiB location. Not 2MiB below that.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 xen/arch/x86/setup.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 501f3f5e4b..47e065e5fe 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -1077,9 +1077,9 @@ void __init noreturn __start_xen(unsigned long mbi_p)
             unsigned long pte_update_limit;
 
             /* Select relocation address. */
-            e = end - reloc_size;
-            xen_phys_start = e;
-            bootsym(trampoline_xen_phys_start) = e;
+            xen_phys_start = end - reloc_size;
+            e = xen_phys_start + XEN_IMG_OFFSET;
+            bootsym(trampoline_xen_phys_start) = xen_phys_start;
 
             /*
              * No PTEs pointing above this address are candidates for relocation.
@@ -1096,7 +1096,7 @@ void __init noreturn __start_xen(unsigned long mbi_p)
              * data until after we have switched to the relocated pagetables!
              */
             barrier();
-            move_memory(e + XEN_IMG_OFFSET, XEN_IMG_OFFSET, _end - _start, 1);
+            move_memory(e, XEN_IMG_OFFSET, _end - _start, 1);
 
             /* Walk initial pagetables, relocating page directory entries. */
             pl4e = __va(__pa(idle_pg_table));
-- 
2.21.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [Xen-devel] [RFC PATCH 2/3] x86/boot: Reserve live update boot memory
  2020-01-08 17:24 ` [Xen-devel] [RFC PATCH 1/3] x86/setup: Don't skip 2MiB underneath relocated Xen image David Woodhouse
@ 2020-01-08 17:24   ` David Woodhouse
  2020-01-20 16:58     ` Jan Beulich
  2020-01-08 17:25   ` [Xen-devel] [RFC PATCH 3/3] Add KEXEC_RANGE_MA_LIVEUPDATE David Woodhouse
  2020-01-10 11:15   ` [Xen-devel] [RFC PATCH 1/3] x86/setup: Don't skip 2MiB underneath relocated Xen image Durrant, Paul
  2 siblings, 1 reply; 16+ messages in thread
From: David Woodhouse @ 2020-01-08 17:24 UTC (permalink / raw)
  To: Xen-devel
  Cc: Stefano Stabellini, Julien Grall, Wei Liu, Konrad Rzeszutek Wilk,
	George Dunlap, Andrew Cooper, paul, Ian Jackson, Jan Beulich,
	Roger Pau Monné

From: David Woodhouse <dwmw@amazon.co.uk>

For live update to work, it will need a region of memory that can be
given to the boot allocator while it parses the state information from
the previous Xen and works out which of the other pages of memory it
can consume.

Reserve that like the crashdump region, and accept it on the command
line. Use only that region for early boot, and register the remaining
RAM (all of it for now, until the real live update happens) later.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 xen/arch/x86/setup.c | 114 ++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 107 insertions(+), 7 deletions(-)

diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 47e065e5fe..650d70c1fc 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -678,6 +678,41 @@ static unsigned int __init copy_bios_e820(struct e820entry *map, unsigned int li
     return n;
 }
 
+static unsigned long lu_bootmem_start, lu_bootmem_size, lu_data;
+
+static int __init parse_liveupdate(const char *str)
+{
+    const char *cur;
+    lu_bootmem_size = parse_size_and_unit(cur = str, &str);
+    if (!lu_bootmem_size || cur == str)
+        return -EINVAL;
+
+    if (!*str) {
+        printk("Live update size 0x%lx\n", lu_bootmem_size);
+        return 0;
+    }
+    if (*str != '@')
+        return -EINVAL;
+    lu_bootmem_start = parse_size_and_unit(cur = str + 1, &str);
+    if (!lu_bootmem_start || cur == str)
+        return -EINVAL;
+
+    printk("Live update area 0x%lx-0x%lx (0x%lx)\n", lu_bootmem_start,
+           lu_bootmem_start + lu_bootmem_size, lu_bootmem_size);
+
+    if (!*str)
+        return 0;
+    if (*str != ':')
+        return -EINVAL;
+    lu_data = simple_strtoull(cur = str + 1, &str, 0);
+    if (!lu_data || cur == str)
+        return -EINVAL;
+
+    printk("Live update data at 0x%lx\n", lu_data);
+    return 0;
+}
+custom_param("liveupdate", parse_liveupdate);
+
 void __init noreturn __start_xen(unsigned long mbi_p)
 {
     char *memmap_type = NULL;
@@ -687,7 +722,7 @@ void __init noreturn __start_xen(unsigned long mbi_p)
     module_t *mod;
     unsigned long nr_pages, raw_max_page, modules_headroom, module_map[1];
     int i, j, e820_warn = 0, bytes = 0;
-    bool acpi_boot_table_init_done = false, relocated = false;
+    bool acpi_boot_table_init_done = false, relocated = false, lu_reserved = false;
     int ret;
     struct ns16550_defaults ns16550 = {
         .data_bits = 8,
@@ -980,6 +1015,22 @@ void __init noreturn __start_xen(unsigned long mbi_p)
     set_kexec_crash_area_size((u64)nr_pages << PAGE_SHIFT);
     kexec_reserve_area(&boot_e820);
 
+    if ( lu_bootmem_start )
+    {
+        /* XX: Check it's in usable memory first */
+        reserve_e820_ram(&boot_e820, lu_bootmem_start, lu_bootmem_start + lu_bootmem_size);
+
+        /* Since it will already be out of the e820 map by the time the first
+         * loop over physical memory, map it manually already. */
+        set_pdx_range(lu_bootmem_start >> PAGE_SHIFT,
+                      (lu_bootmem_start + lu_bootmem_size) >> PAGE_SHIFT);
+        map_pages_to_xen((unsigned long)__va(lu_bootmem_start),
+                         maddr_to_mfn(lu_bootmem_start),
+                         PFN_DOWN(lu_bootmem_size), PAGE_HYPERVISOR);
+
+        lu_reserved = true;
+    }
+
     initial_images = mod;
     nr_initial_images = mbi->mods_count;
 
@@ -1204,6 +1255,16 @@ void __init noreturn __start_xen(unsigned long mbi_p)
             printk("New Xen image base address: %#lx\n", xen_phys_start);
         }
 
+        /* Is the region suitable for the live update bootmem region? */
+        if ( lu_bootmem_size && ! lu_bootmem_start && e < limit )
+        {
+            end = consider_modules(s, e, lu_bootmem_size, mod, mbi->mods_count + relocated, -1);
+            if ( end )
+            {
+                e = lu_bootmem_start = end - lu_bootmem_size;
+            }
+        }
+
         /* Is the region suitable for relocating the multiboot modules? */
         for ( j = mbi->mods_count - 1; j >= 0; j-- )
         {
@@ -1267,6 +1328,15 @@ void __init noreturn __start_xen(unsigned long mbi_p)
     if ( !xen_phys_start )
         panic("Not enough memory to relocate Xen\n");
 
+    if ( lu_bootmem_start )
+    {
+        if ( !lu_reserved )
+            reserve_e820_ram(&boot_e820, lu_bootmem_start, lu_bootmem_start + lu_bootmem_size);
+        printk("LU bootmem: 0x%lx - 0x%lx\n", lu_bootmem_start, lu_bootmem_start + lu_bootmem_size);
+        init_boot_pages(lu_bootmem_start, lu_bootmem_start + lu_bootmem_size);
+        lu_reserved = true;
+    }
+
     /* This needs to remain in sync with xen_in_range(). */
     reserve_e820_ram(&boot_e820, __pa(_stext), __pa(__2M_rwdata_end));
 
@@ -1278,8 +1348,8 @@ void __init noreturn __start_xen(unsigned long mbi_p)
         xenheap_max_mfn(PFN_DOWN(highmem_start - 1));
 
     /*
-     * Walk every RAM region and map it in its entirety (on x86/64, at least)
-     * and notify it to the boot allocator.
+     * Walk every RAM region and map it in its entirety and (unless in
+     * live update mode) notify it to the boot allocator.
      */
     for ( i = 0; i < boot_e820.nr_map; i++ )
     {
@@ -1329,6 +1399,7 @@ void __init noreturn __start_xen(unsigned long mbi_p)
                 printk(XENLOG_WARNING "Ignoring inaccessible memory range"
                                       " %013"PRIx64"-%013"PRIx64"\n",
                        s, e);
+                reserve_e820_ram(&boot_e820, s, e);
                 continue;
             }
             map_e = e;
@@ -1336,6 +1407,7 @@ void __init noreturn __start_xen(unsigned long mbi_p)
             printk(XENLOG_WARNING "Ignoring inaccessible memory range"
                                   " %013"PRIx64"-%013"PRIx64"\n",
                    e, map_e);
+            reserve_e820_ram(&boot_e820, e, map_e);
         }
 
         set_pdx_range(s >> PAGE_SHIFT, e >> PAGE_SHIFT);
@@ -1346,7 +1418,9 @@ void __init noreturn __start_xen(unsigned long mbi_p)
                       ARRAY_SIZE(l2_identmap) << L2_PAGETABLE_SHIFT);
 
         /* Pass mapped memory to allocator /before/ creating new mappings. */
-        init_boot_pages(s, min(map_s, e));
+        if ( !lu_reserved)
+            init_boot_pages(s, min(map_s, e));
+
         s = map_s;
         if ( s < map_e )
         {
@@ -1354,7 +1428,8 @@ void __init noreturn __start_xen(unsigned long mbi_p)
 
             map_s = (s + mask) & ~mask;
             map_e &= ~mask;
-            init_boot_pages(map_s, map_e);
+            if ( !lu_reserved)
+                init_boot_pages(map_s, map_e);
         }
 
         if ( map_s > map_e )
@@ -1370,7 +1445,8 @@ void __init noreturn __start_xen(unsigned long mbi_p)
             {
                 map_pages_to_xen((unsigned long)__va(map_e), maddr_to_mfn(map_e),
                                  PFN_DOWN(end - map_e), PAGE_HYPERVISOR);
-                init_boot_pages(map_e, end);
+                if ( !lu_reserved)
+                    init_boot_pages(map_e, end);
                 map_e = end;
             }
         }
@@ -1385,7 +1461,8 @@ void __init noreturn __start_xen(unsigned long mbi_p)
         {
             map_pages_to_xen((unsigned long)__va(s), maddr_to_mfn(s),
                              PFN_DOWN(map_s - s), PAGE_HYPERVISOR);
-            init_boot_pages(s, map_s);
+            if ( !lu_reserved)
+                init_boot_pages(s, map_s);
         }
     }
 
@@ -1483,6 +1560,29 @@ void __init noreturn __start_xen(unsigned long mbi_p)
 
     numa_initmem_init(0, raw_max_page);
 
+    if ( lu_bootmem_start )
+    {
+        unsigned long limit = virt_to_mfn(HYPERVISOR_VIRT_END - 1);
+        uint64_t mask = PAGE_SIZE - 1;
+
+        for ( i = 0; i < boot_e820.nr_map; i++ )
+        {
+            uint64_t s, e;
+
+            if ( boot_e820.map[i].type != E820_RAM )
+                continue;
+            s = (boot_e820.map[i].addr + mask) & ~mask;
+            e = (boot_e820.map[i].addr + boot_e820.map[i].size) & ~mask;
+            s = max_t(uint64_t, s, 1<<20);
+            if ( PFN_DOWN(s) > limit )
+                continue;
+            if ( PFN_DOWN(e) > limit )
+                e = pfn_to_paddr(limit);
+
+            init_boot_pages(s, e);
+        }
+    }
+
     if ( max_page - 1 > virt_to_mfn(HYPERVISOR_VIRT_END - 1) )
     {
         unsigned long limit = virt_to_mfn(HYPERVISOR_VIRT_END - 1);
-- 
2.21.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [Xen-devel] [RFC PATCH 3/3] Add KEXEC_RANGE_MA_LIVEUPDATE
  2020-01-08 17:24 ` [Xen-devel] [RFC PATCH 1/3] x86/setup: Don't skip 2MiB underneath relocated Xen image David Woodhouse
  2020-01-08 17:24   ` [Xen-devel] [RFC PATCH 2/3] x86/boot: Reserve live update boot memory David Woodhouse
@ 2020-01-08 17:25   ` David Woodhouse
  2020-01-10 11:15   ` [Xen-devel] [RFC PATCH 1/3] x86/setup: Don't skip 2MiB underneath relocated Xen image Durrant, Paul
  2 siblings, 0 replies; 16+ messages in thread
From: David Woodhouse @ 2020-01-08 17:25 UTC (permalink / raw)
  To: Xen-devel
  Cc: Stefano Stabellini, Julien Grall, Wei Liu, Konrad Rzeszutek Wilk,
	George Dunlap, Andrew Cooper, paul, Ian Jackson, Jan Beulich,
	Roger Pau Monné

From: David Woodhouse <dwmw@amazon.co.uk>

This allows kexec userspace to tell the next Xen where the range is,
on its command line.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 xen/arch/x86/machine_kexec.c | 15 ++++++++++++---
 xen/arch/x86/setup.c         |  2 +-
 xen/include/public/kexec.h   |  1 +
 3 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/xen/arch/x86/machine_kexec.c b/xen/arch/x86/machine_kexec.c
index b70d5a6a86..f0c4617234 100644
--- a/xen/arch/x86/machine_kexec.c
+++ b/xen/arch/x86/machine_kexec.c
@@ -184,11 +184,20 @@ void machine_kexec(struct kexec_image *image)
                 image->head, image->entry_maddr, reloc_flags);
 }
 
+extern unsigned long lu_bootmem_start, lu_bootmem_size;
+
 int machine_kexec_get(xen_kexec_range_t *range)
 {
-	if (range->range != KEXEC_RANGE_MA_XEN)
-		return -EINVAL;
-	return machine_kexec_get_xen(range);
+    switch (range->range) {
+    case KEXEC_RANGE_MA_XEN:
+        return machine_kexec_get_xen(range);
+    case KEXEC_RANGE_MA_LIVEUPDATE:
+        range->start = lu_bootmem_start;
+        range->size = lu_bootmem_size;
+        return 0;
+    default:
+        return -EINVAL;
+    }
 }
 
 void arch_crash_save_vmcoreinfo(void)
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 650d70c1fc..11c1ba8e91 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -678,7 +678,7 @@ static unsigned int __init copy_bios_e820(struct e820entry *map, unsigned int li
     return n;
 }
 
-static unsigned long lu_bootmem_start, lu_bootmem_size, lu_data;
+unsigned long lu_bootmem_start, lu_bootmem_size, lu_data;
 
 static int __init parse_liveupdate(const char *str)
 {
diff --git a/xen/include/public/kexec.h b/xen/include/public/kexec.h
index 3f2a118381..298381af8d 100644
--- a/xen/include/public/kexec.h
+++ b/xen/include/public/kexec.h
@@ -150,6 +150,7 @@ typedef struct xen_kexec_load_v1 {
 #define KEXEC_RANGE_MA_EFI_MEMMAP 5 /* machine address and size of
                                      * of the EFI Memory Map */
 #define KEXEC_RANGE_MA_VMCOREINFO 6 /* machine address and size of vmcoreinfo */
+#define KEXEC_RANGE_MA_LIVEUPDATE 7 /* Boot mem for live update */
 
 /*
  * Find the address and size of certain memory areas
-- 
2.21.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [Xen-devel] [RFC PATCH 1/3] x86/setup: Don't skip 2MiB underneath relocated Xen image
  2020-01-08 17:24 ` [Xen-devel] [RFC PATCH 1/3] x86/setup: Don't skip 2MiB underneath relocated Xen image David Woodhouse
  2020-01-08 17:24   ` [Xen-devel] [RFC PATCH 2/3] x86/boot: Reserve live update boot memory David Woodhouse
  2020-01-08 17:25   ` [Xen-devel] [RFC PATCH 3/3] Add KEXEC_RANGE_MA_LIVEUPDATE David Woodhouse
@ 2020-01-10 11:15   ` Durrant, Paul
  2020-01-10 12:15     ` David Woodhouse
  2 siblings, 1 reply; 16+ messages in thread
From: Durrant, Paul @ 2020-01-10 11:15 UTC (permalink / raw)
  To: David Woodhouse, Xen-devel
  Cc: Stefano Stabellini, Julien Grall, Wei Liu, Konrad Rzeszutek Wilk,
	George Dunlap, Andrew Cooper, paul, Ian Jackson, Jan Beulich,
	Roger Pau Monné

> -----Original Message-----
> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of
> David Woodhouse
> Sent: 08 January 2020 17:25
> To: Xen-devel <xen-devel@lists.xenproject.org>
> Cc: Stefano Stabellini <sstabellini@kernel.org>; Julien Grall
> <julien@xen.org>; Wei Liu <wl@xen.org>; Konrad Rzeszutek Wilk
> <konrad.wilk@oracle.com>; George Dunlap <George.Dunlap@eu.citrix.com>;
> Andrew Cooper <andrew.cooper3@citrix.com>; paul@xen.org; Ian Jackson
> <ian.jackson@eu.citrix.com>; Jan Beulich <jbeulich@suse.com>; Roger Pau
> Monné <roger.pau@citrix.com>
> Subject: [Xen-devel] [RFC PATCH 1/3] x86/setup: Don't skip 2MiB underneath
> relocated Xen image
> 
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> Set 'e' correctly to reflect the location that Xen is actually relocated
> to from its default 2MiB location. Not 2MiB below that.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>  xen/arch/x86/setup.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
> index 501f3f5e4b..47e065e5fe 100644
> --- a/xen/arch/x86/setup.c
> +++ b/xen/arch/x86/setup.c
> @@ -1077,9 +1077,9 @@ void __init noreturn __start_xen(unsigned long
> mbi_p)
>              unsigned long pte_update_limit;
> 
>              /* Select relocation address. */
> -            e = end - reloc_size;
> -            xen_phys_start = e;
> -            bootsym(trampoline_xen_phys_start) = e;
> +            xen_phys_start = end - reloc_size;
> +            e = xen_phys_start + XEN_IMG_OFFSET;
> +            bootsym(trampoline_xen_phys_start) = xen_phys_start;
> 
>              /*
>               * No PTEs pointing above this address are candidates for
> relocation.

Do you not also need to adjust the setting of pte_update_limit that's just out of context below here?

  Paul
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Xen-devel] [RFC PATCH 1/3] x86/setup: Don't skip 2MiB underneath relocated Xen image
  2020-01-10 11:15   ` [Xen-devel] [RFC PATCH 1/3] x86/setup: Don't skip 2MiB underneath relocated Xen image Durrant, Paul
@ 2020-01-10 12:15     ` David Woodhouse
  0 siblings, 0 replies; 16+ messages in thread
From: David Woodhouse @ 2020-01-10 12:15 UTC (permalink / raw)
  To: Durrant, Paul, Xen-devel
  Cc: Stefano Stabellini, Julien Grall, Wei Liu, Konrad Rzeszutek Wilk,
	George Dunlap, Andrew Cooper, paul, Ian Jackson, Jan Beulich,
	Roger Pau Monné


[-- Attachment #1.1: Type: text/plain, Size: 2163 bytes --]

On Fri, 2020-01-10 at 11:15 +0000, Durrant, Paul wrote:
> > -----Original Message-----
> > From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf
> > Of
> > David Woodhouse
> > Sent: 08 January 2020 17:25
> > To: Xen-devel <xen-devel@lists.xenproject.org>
> > Cc: Stefano Stabellini <sstabellini@kernel.org>; Julien Grall
> > <julien@xen.org>; Wei Liu <wl@xen.org>; Konrad Rzeszutek Wilk
> > <konrad.wilk@oracle.com>; George Dunlap <
> > George.Dunlap@eu.citrix.com>;
> > Andrew Cooper <andrew.cooper3@citrix.com>; paul@xen.org; Ian
> > Jackson
> > <ian.jackson@eu.citrix.com>; Jan Beulich <jbeulich@suse.com>; Roger
> > Pau
> > Monné <roger.pau@citrix.com>
> > Subject: [Xen-devel] [RFC PATCH 1/3] x86/setup: Don't skip 2MiB
> > underneath
> > relocated Xen image
> > 
> > From: David Woodhouse <dwmw@amazon.co.uk>
> > 
> > Set 'e' correctly to reflect the location that Xen is actually
> > relocated
> > to from its default 2MiB location. Not 2MiB below that.
> > 
> > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> > ---
> >  xen/arch/x86/setup.c | 8 ++++----
> >  1 file changed, 4 insertions(+), 4 deletions(-)
> > 
> > diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
> > index 501f3f5e4b..47e065e5fe 100644
> > --- a/xen/arch/x86/setup.c
> > +++ b/xen/arch/x86/setup.c
> > @@ -1077,9 +1077,9 @@ void __init noreturn __start_xen(unsigned
> > long
> > mbi_p)
> >              unsigned long pte_update_limit;
> > 
> >              /* Select relocation address. */
> > -            e = end - reloc_size;
> > -            xen_phys_start = e;
> > -            bootsym(trampoline_xen_phys_start) = e;
> > +            xen_phys_start = end - reloc_size;
> > +            e = xen_phys_start + XEN_IMG_OFFSET;
> > +            bootsym(trampoline_xen_phys_start) = xen_phys_start;
> > 
> >              /*
> >               * No PTEs pointing above this address are candidates
> > for
> > relocation.
> 
> Do you not also need to adjust the setting of pte_update_limit that's
> just out of context below here?

Yes. I missed that when forward-porting to the master branch. Thanks.


[-- Attachment #1.2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

[-- Attachment #2: Type: text/plain, Size: 157 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Xen-devel] [RFC PATCH 0/3] Live update boot memory management
  2020-01-08 17:24 [Xen-devel] [RFC PATCH 0/3] Live update boot memory management David Woodhouse
  2020-01-08 17:24 ` [Xen-devel] [RFC PATCH 1/3] x86/setup: Don't skip 2MiB underneath relocated Xen image David Woodhouse
@ 2020-01-13 11:54 ` David Woodhouse
  2020-01-14 14:15   ` Julien Grall
  1 sibling, 1 reply; 16+ messages in thread
From: David Woodhouse @ 2020-01-13 11:54 UTC (permalink / raw)
  To: Xen-devel, Xia, Hongyan
  Cc: Stefano Stabellini, Julien Grall, Wei Liu, Konrad Rzeszutek Wilk,
	George Dunlap, Andrew Cooper, paul, Ian Jackson, Jan Beulich,
	Roger Pau Monné


[-- Attachment #1.1: Type: text/plain, Size: 3875 bytes --]

On Wed, 2020-01-08 at 17:24 +0000, David Woodhouse wrote:
> When doing a live update, Xen needs to be very careful not to scribble
> on pages which contain guest memory or state information for the
> domains which are being preserved.
> 
> The information about which pages are in use is contained in the live
> update state passed from the previous Xen — which is mostly just a
> guest-transparent live migration data stream, except that it points to
> the page tables in place in memory while traditional live migration
> obviously copies the pages separately.
> 
> Our initial implementation actually prepended a list of 'in-use' ranges
> to the live update state, and made the boot allocator treat them the
> same as 'bad pages'. That worked well enough for initial development
> but wouldn't scale to a live production system, mainly because the boot
> allocator has a limit of 512 memory ranges that it can keep track of,
> and a real system would end up more fragmented than that.
> 
> My other concern with that approach is that it required two passes over
> the domain-owned pages. We have to do a later pass *anyway*, as we set
> up ownership in the frametable for each page — and that has to happen
> after we've managed to allocate a 'struct domain' for each page_info to
> point to. If we want to keep the pause time due to a live update down
> to a bare minimum, doing two passes over the full set of domain pages
> isn't my favourite strategy.
> 
> So we've settled on a simpler approach \x02— reserve a contiguous region
> of physical memory which *won't* be used for domain pages. Let the boot
> allocator see *only* that region of memory, and plug the rest of the
> memory in later only after doing a full pass of the live update state.
> 
> This means that we have to ensure the reserved region is large enough,
> but ultimately we had that problem either way — even if we were
> processing the actual free ranges, if the page_info grew and we didn't
> have enough contiguous space for the new frametable we were hosed
> anyway.
> 
> So the straw man patch ends up being really simple, as a seed for
> bikeshedding. Just take a 'liveupdate=' region on the command line,
> which kexec(8) can find from the running Xen. The initial Xen needs to
> ensure that it *won't* allocate any pages from that range which will
> subsequently need to be preserved across live update, which isn't done
> yet. We just need to make sure that any page which might be given to
> share_xen_page_with_guest() is allocated appropriately.
> 
> The part which actually hands over the live update state isn't included
> yet, so this really does just *defer* the addition of the memory until
> a little bit later in __start_xen(). Actually taking ranges out of it
> will come later.

What isn't addressed in this series is actually *honouring* the promise
not to put pages into the reserved LU bootmem region that need to be
preserved over live update. As things stand, we just add them to the
heap anyway in end_boot_allocator().

It isn't even sufficient to use these pages for xenheap allocations and
not domheap, since there are cases where we allocate from the xenheap
and then share pages to a domain.

Hongyan's patches to kill the directmap have already started addressing
a bunch of the places that do that, so what I'm inclined to do in the
short term is just *not* use the remaining space in the reserved LU
bootmem region. Use it for boot time allocations (including the
frametable) only, and *not* insert the rest of those pages into the
heap allocator in end_boot_allocator() for now. If sized appropriately,
there shouldn't be much wastage anyway. We can refine it and ensure
that we can use those pages but *not* for domain allocations, once the
dust has settled on the directmap removal.



[-- Attachment #1.2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

[-- Attachment #2: Type: text/plain, Size: 157 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Xen-devel] [RFC PATCH 0/3] Live update boot memory management
  2020-01-13 11:54 ` [Xen-devel] [RFC PATCH 0/3] Live update boot memory management David Woodhouse
@ 2020-01-14 14:15   ` Julien Grall
  2020-01-14 14:48     ` David Woodhouse
  0 siblings, 1 reply; 16+ messages in thread
From: Julien Grall @ 2020-01-14 14:15 UTC (permalink / raw)
  To: David Woodhouse, Xen-devel, Xia, Hongyan
  Cc: Stefano Stabellini, Wei Liu, paul, George Dunlap, Andrew Cooper,
	Konrad Rzeszutek Wilk, Ian Jackson, Jan Beulich,
	Roger Pau Monné

Hi David,

On 13/01/2020 11:54, David Woodhouse wrote:
> On Wed, 2020-01-08 at 17:24 +0000, David Woodhouse wrote:
>> When doing a live update, Xen needs to be very careful not to scribble
>> on pages which contain guest memory or state information for the
>> domains which are being preserved.
>>
>> The information about which pages are in use is contained in the live
>> update state passed from the previous Xen — which is mostly just a
>> guest-transparent live migration data stream, except that it points to
>> the page tables in place in memory while traditional live migration
>> obviously copies the pages separately.
>>
>> Our initial implementation actually prepended a list of 'in-use' ranges
>> to the live update state, and made the boot allocator treat them the
>> same as 'bad pages'. That worked well enough for initial development
>> but wouldn't scale to a live production system, mainly because the boot
>> allocator has a limit of 512 memory ranges that it can keep track of,
>> and a real system would end up more fragmented than that.
>>
>> My other concern with that approach is that it required two passes over
>> the domain-owned pages. We have to do a later pass *anyway*, as we set
>> up ownership in the frametable for each page — and that has to happen
>> after we've managed to allocate a 'struct domain' for each page_info to
>> point to. If we want to keep the pause time due to a live update down
>> to a bare minimum, doing two passes over the full set of domain pages
>> isn't my favourite strategy.

We actually need one more pass for PV domain (at least). The pass is 
used to allocate the page type (e.g L4, L1,...). This can't be done 
before because we need the pages to belongs to the guest before going 
through its page-tables.

>>
>> So we've settled on a simpler approach \x02— reserve a contiguous region
>> of physical memory which *won't* be used for domain pages. Let the boot
>> allocator see *only* that region of memory, and plug the rest of the
>> memory in later only after doing a full pass of the live update state.

It is a bit unclear what the region will be used for. If you plan to put 
the state of the VMs in it, then you can't possibly use it for boot 
allocation (e.g frametable) otherwise this may be overwritten when doing 
the live update.
The problem would arise in the first Xen but also in the second Xen if 
you plan to live update another time.

Did I miss anything?

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Xen-devel] [RFC PATCH 0/3] Live update boot memory management
  2020-01-14 14:15   ` Julien Grall
@ 2020-01-14 14:48     ` David Woodhouse
  2020-01-14 15:00       ` Julien Grall
  0 siblings, 1 reply; 16+ messages in thread
From: David Woodhouse @ 2020-01-14 14:48 UTC (permalink / raw)
  To: Julien Grall, Xen-devel, Xia, Hongyan
  Cc: Stefano Stabellini, Wei Liu, paul, George Dunlap, Andrew Cooper,
	Konrad Rzeszutek Wilk, Ian Jackson, Jan Beulich,
	Roger Pau Monné


[-- Attachment #1.1: Type: text/plain, Size: 4022 bytes --]

On Tue, 2020-01-14 at 14:15 +0000, Julien Grall wrote:
> Hi David,
> 
> On 13/01/2020 11:54, David Woodhouse wrote:
> > On Wed, 2020-01-08 at 17:24 +0000, David Woodhouse wrote:
> > > When doing a live update, Xen needs to be very careful not to scribble
> > > on pages which contain guest memory or state information for the
> > > domains which are being preserved.
> > > 
> > > The information about which pages are in use is contained in the live
> > > update state passed from the previous Xen — which is mostly just a
> > > guest-transparent live migration data stream, except that it points to
> > > the page tables in place in memory while traditional live migration
> > > obviously copies the pages separately.
> > > 
> > > Our initial implementation actually prepended a list of 'in-use' ranges
> > > to the live update state, and made the boot allocator treat them the
> > > same as 'bad pages'. That worked well enough for initial development
> > > but wouldn't scale to a live production system, mainly because the boot
> > > allocator has a limit of 512 memory ranges that it can keep track of,
> > > and a real system would end up more fragmented than that.
> > > 
> > > My other concern with that approach is that it required two passes over
> > > the domain-owned pages. We have to do a later pass *anyway*, as we set
> > > up ownership in the frametable for each page — and that has to happen
> > > after we've managed to allocate a 'struct domain' for each page_info to
> > > point to. If we want to keep the pause time due to a live update down
> > > to a bare minimum, doing two passes over the full set of domain pages
> > > isn't my favourite strategy.
> 
> We actually need one more pass for PV domain (at least). The pass is 
> used to allocate the page type (e.g L4, L1,...). This can't be done 
> before because we need the pages to belongs to the guest before going 
> through its page-tables.

All the more reason why I don't want to do an *additional* pass just
for the allocator.

> > > 
> > > So we've settled on a simpler approach \x02— reserve a contiguous region
> > > of physical memory which *won't* be used for domain pages. Let the boot
> > > allocator see *only* that region of memory, and plug the rest of the
> > > memory in later only after doing a full pass of the live update state.
> 
> It is a bit unclear what the region will be used for. If you plan to put 
> the state of the VMs in it, then you can't possibly use it for boot 
> allocation (e.g frametable) otherwise this may be overwritten when doing 
> the live update.

Right. This is only for boot time allocations by Xen#2, before it's
processed the LU data and knows which parts of the rest of memory it
can use. It allocates its frame table from there, as well as anything
else it needs to allocate before/while processing the LU data.

As an implementation detail, I anticipate that we'll be using the boot
allocator for that early part from the reserved region, and that the
switch to using the full available memory (less those pages already in-
use) will *coincide* with switching to the real heap allocator.

The reserved region *isn't* for the LU data itself. That can be
allocated from arbitrary pages *outside* the reserved area, in Xen#1.
Xen#2 can vmap those pages, and needs to avoid stomping on them just
like it needs to avoid stomping on actual domain-owned pages.

The plan is that Xen#1 allocates arbitrary pages to store the actual LU
data. Then another page (or higher order allocation if we need >2MiB of
actual LU data) containing the MFNs of all those data pages. Then we
need to somehow pass the address of that MFN-list to Xen#2.

My current plan is to put *that* in the first 64 bits of the reserved
LU bootmem region, and load it from there early in the Xen#2 boot
process. I'm looking at adding an IND_WRITE64 primitive to the kimage
processing, to allow it to be trivially appended for kexec_reloc() to
obey.


[-- Attachment #1.2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

[-- Attachment #2: Type: text/plain, Size: 157 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Xen-devel] [RFC PATCH 0/3] Live update boot memory management
  2020-01-14 14:48     ` David Woodhouse
@ 2020-01-14 15:00       ` Julien Grall
  2020-01-14 15:20         ` David Woodhouse
  0 siblings, 1 reply; 16+ messages in thread
From: Julien Grall @ 2020-01-14 15:00 UTC (permalink / raw)
  To: David Woodhouse, Xen-devel, Xia, Hongyan
  Cc: Stefano Stabellini, Wei Liu, paul, George Dunlap, Andrew Cooper,
	Konrad Rzeszutek Wilk, Ian Jackson, Jan Beulich,
	Roger Pau Monné



On 14/01/2020 14:48, David Woodhouse wrote:
> On Tue, 2020-01-14 at 14:15 +0000, Julien Grall wrote:
>> Hi David,
>>
>> On 13/01/2020 11:54, David Woodhouse wrote:
>>> On Wed, 2020-01-08 at 17:24 +0000, David Woodhouse wrote:
>>>> So we've settled on a simpler approach \x02— reserve a contiguous region
>>>> of physical memory which *won't* be used for domain pages. Let the boot
>>>> allocator see *only* that region of memory, and plug the rest of the
>>>> memory in later only after doing a full pass of the live update state.
>>
>> It is a bit unclear what the region will be used for. If you plan to put
>> the state of the VMs in it, then you can't possibly use it for boot
>> allocation (e.g frametable) otherwise this may be overwritten when doing
>> the live update.
> 
> Right. This is only for boot time allocations by Xen#2, before it's
> processed the LU data and knows which parts of the rest of memory it
> can use. It allocates its frame table from there, as well as anything
> else it needs to allocate before/while processing the LU data.

It would be worth documenting what is the expectation of the buffer. 
Maybe in xen-command-line along with the rest of the new option you 
introduced? Or in a separate document.

> As an implementation detail, I anticipate that we'll be using the boot
> allocator for that early part from the reserved region, and that the
> switch to using the full available memory (less those pages already in-
> use) will *coincide* with switching to the real heap allocator.
> 
> The reserved region *isn't* for the LU data itself. That can be
> allocated from arbitrary pages *outside* the reserved area, in Xen#1.
> Xen#2 can vmap those pages, and needs to avoid stomping on them just
> like it needs to avoid stomping on actual domain-owned pages.
> 
> The plan is that Xen#1 allocates arbitrary pages to store the actual LU
> data. Then another page (or higher order allocation if we need >2MiB of
> actual LU data) containing the MFNs of all those data pages. Then we
> need to somehow pass the address of that MFN-list to Xen#2.
> 
> My current plan is to put *that* in the first 64 bits of the reserved
> LU bootmem region, and load it from there early in the Xen#2 boot
> process. I'm looking at adding an IND_WRITE64 primitive to the kimage
> processing, to allow it to be trivially appended for kexec_reloc() to
> obey.

Wouldn't it be better to reserve the first 4K page of the LU bootmem region?

Otherwise, you may end up into the same trouble as described above (to a 
lesser extent) if the 64-bit value overwrite anything useful for the 
current Xen. But I guess, you could delay the writing just before you 
jump to xen#2.

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Xen-devel] [RFC PATCH 0/3] Live update boot memory management
  2020-01-14 15:00       ` Julien Grall
@ 2020-01-14 15:20         ` David Woodhouse
  2020-01-14 16:29           ` Julien Grall
  0 siblings, 1 reply; 16+ messages in thread
From: David Woodhouse @ 2020-01-14 15:20 UTC (permalink / raw)
  To: Julien Grall, Xen-devel, Xia, Hongyan
  Cc: Stefano Stabellini, Wei Liu, paul, George Dunlap, Andrew Cooper,
	Konrad Rzeszutek Wilk, Ian Jackson, Jan Beulich,
	Roger Pau Monné


[-- Attachment #1.1: Type: text/plain, Size: 4036 bytes --]

On Tue, 2020-01-14 at 15:00 +0000, Julien Grall wrote:
> 
> On 14/01/2020 14:48, David Woodhouse wrote:
> > On Tue, 2020-01-14 at 14:15 +0000, Julien Grall wrote:
> > > Hi David,
> > > 
> > > On 13/01/2020 11:54, David Woodhouse wrote:
> > > > On Wed, 2020-01-08 at 17:24 +0000, David Woodhouse wrote:
> > > > > So we've settled on a simpler approach \x02— reserve a contiguous region
> > > > > of physical memory which *won't* be used for domain pages. Let the boot
> > > > > allocator see *only* that region of memory, and plug the rest of the
> > > > > memory in later only after doing a full pass of the live update state.
> > > 
> > > It is a bit unclear what the region will be used for. If you plan to put
> > > the state of the VMs in it, then you can't possibly use it for boot
> > > allocation (e.g frametable) otherwise this may be overwritten when doing
> > > the live update.
> > 
> > Right. This is only for boot time allocations by Xen#2, before it's
> > processed the LU data and knows which parts of the rest of memory it
> > can use. It allocates its frame table from there, as well as anything
> > else it needs to allocate before/while processing the LU data.
> 
> It would be worth documenting what is the expectation of the buffer. 
> Maybe in xen-command-line along with the rest of the new option you 
> introduced? Or in a separate document.

Kind of need to implement that part too, and then we can document what
it finally looks like :)

> > As an implementation detail, I anticipate that we'll be using the boot
> > allocator for that early part from the reserved region, and that the
> > switch to using the full available memory (less those pages already in-
> > use) will *coincide* with switching to the real heap allocator.
> > 
> > The reserved region *isn't* for the LU data itself. That can be
> > allocated from arbitrary pages *outside* the reserved area, in Xen#1.
> > Xen#2 can vmap those pages, and needs to avoid stomping on them just
> > like it needs to avoid stomping on actual domain-owned pages.
> > 
> > The plan is that Xen#1 allocates arbitrary pages to store the actual LU
> > data. Then another page (or higher order allocation if we need >2MiB of
> > actual LU data) containing the MFNs of all those data pages. Then we
> > need to somehow pass the address of that MFN-list to Xen#2.
> > 
> > My current plan is to put *that* in the first 64 bits of the reserved
> > LU bootmem region, and load it from there early in the Xen#2 boot
> > process. I'm looking at adding an IND_WRITE64 primitive to the kimage
> > processing, to allow it to be trivially appended for kexec_reloc() to
> > obey.
> 
> Wouldn't it be better to reserve the first 4K page of the LU bootmem region?
> 
> Otherwise, you may end up into the same trouble as described above (to a 
> lesser extent) if the 64-bit value overwrite anything useful for the 
> current Xen. But I guess, you could delay the writing just before you 
> jump to xen#2.

That's the point in appending an IND_WRITE64 operation to the kimage
stream. The actual write is done in the last gasp of kexec_reloc()
after Xen#1 is quiescent, on the way into purgatory.

So when Xen#1 has created the LU data stream, (for which the pointer to
the root of that data structure is page-aligned) it just calls
  kimage_add_entry(image, IND_WRITE64 | lu_data_address);

--- a/xen/arch/x86/x86_64/kexec_reloc.S
+++ b/xen/arch/x86/x86_64/kexec_reloc.S
@@ -131,11 +131,18 @@ is_source:
         jmp     next_entry
 is_zero:
         testb   $IND_ZERO, %cl
-        jz      next_entry
+        jz      is_write64
         movl    $(PAGE_SIZE / 8), %ecx  /* Zero the destination page. */
         xorl    %eax, %eax
         rep stosq
         jmp     next_entry
+is_write64:
+        testb   $IND_WRITE64, %cl
+        jz      next_entry
+        andq    $PAGE_MASK, %rcx
+        movl    %rcx, %rax
+        stosq
+        jmp     next_entry
 done:
         popq    %rbx

[-- Attachment #1.2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

[-- Attachment #2: Type: text/plain, Size: 157 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Xen-devel] [RFC PATCH 0/3] Live update boot memory management
  2020-01-14 15:20         ` David Woodhouse
@ 2020-01-14 16:29           ` Julien Grall
  2020-01-15  7:40             ` David Woodhouse
  0 siblings, 1 reply; 16+ messages in thread
From: Julien Grall @ 2020-01-14 16:29 UTC (permalink / raw)
  To: David Woodhouse, Xen-devel, Xia, Hongyan
  Cc: Stefano Stabellini, Wei Liu, paul, George Dunlap, Andrew Cooper,
	Konrad Rzeszutek Wilk, Ian Jackson, Jan Beulich,
	Roger Pau Monné

Hi David,

On 14/01/2020 15:20, David Woodhouse wrote:
> On Tue, 2020-01-14 at 15:00 +0000, Julien Grall wrote:
>>
>> On 14/01/2020 14:48, David Woodhouse wrote:
>>> On Tue, 2020-01-14 at 14:15 +0000, Julien Grall wrote:
>>>> Hi David,
>>>>
>>>> On 13/01/2020 11:54, David Woodhouse wrote:
>>>>> On Wed, 2020-01-08 at 17:24 +0000, David Woodhouse wrote:
>>>>>> So we've settled on a simpler approach \x02— reserve a contiguous region
>>>>>> of physical memory which *won't* be used for domain pages. Let the boot
>>>>>> allocator see *only* that region of memory, and plug the rest of the
>>>>>> memory in later only after doing a full pass of the live update state.
>>>>
>>>> It is a bit unclear what the region will be used for. If you plan to put
>>>> the state of the VMs in it, then you can't possibly use it for boot
>>>> allocation (e.g frametable) otherwise this may be overwritten when doing
>>>> the live update.
>>>
>>> Right. This is only for boot time allocations by Xen#2, before it's
>>> processed the LU data and knows which parts of the rest of memory it
>>> can use. It allocates its frame table from there, as well as anything
>>> else it needs to allocate before/while processing the LU data.
>>
>> It would be worth documenting what is the expectation of the buffer.
>> Maybe in xen-command-line along with the rest of the new option you
>> introduced? Or in a separate document.
> 
> Kind of need to implement that part too, and then we can document what
> it finally looks like :)
> 
>>> As an implementation detail, I anticipate that we'll be using the boot
>>> allocator for that early part from the reserved region, and that the
>>> switch to using the full available memory (less those pages already in-
>>> use) will *coincide* with switching to the real heap allocator.
>>>
>>> The reserved region *isn't* for the LU data itself. That can be
>>> allocated from arbitrary pages *outside* the reserved area, in Xen#1.
>>> Xen#2 can vmap those pages, and needs to avoid stomping on them just
>>> like it needs to avoid stomping on actual domain-owned pages.
>>>
>>> The plan is that Xen#1 allocates arbitrary pages to store the actual LU
>>> data. Then another page (or higher order allocation if we need >2MiB of
>>> actual LU data) containing the MFNs of all those data pages. Then we
>>> need to somehow pass the address of that MFN-list to Xen#2.
>>>
>>> My current plan is to put *that* in the first 64 bits of the reserved
>>> LU bootmem region, and load it from there early in the Xen#2 boot
>>> process. I'm looking at adding an IND_WRITE64 primitive to the kimage
>>> processing, to allow it to be trivially appended for kexec_reloc() to
>>> obey.
>>
>> Wouldn't it be better to reserve the first 4K page of the LU bootmem region?
>>
>> Otherwise, you may end up into the same trouble as described above (to a
>> lesser extent) if the 64-bit value overwrite anything useful for the
>> current Xen. But I guess, you could delay the writing just before you
>> jump to xen#2.
> 
> That's the point in appending an IND_WRITE64 operation to the kimage
> stream. The actual write is done in the last gasp of kexec_reloc()
> after Xen#1 is quiescent, on the way into purgatory.

I was not sure what you meant by IND_WRITE64. Maybe I should have asked 
it first :). Thank you for the explanation!

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Xen-devel] [RFC PATCH 0/3] Live update boot memory management
  2020-01-14 16:29           ` Julien Grall
@ 2020-01-15  7:40             ` David Woodhouse
  2020-01-15 10:26               ` Julien Grall
  0 siblings, 1 reply; 16+ messages in thread
From: David Woodhouse @ 2020-01-15  7:40 UTC (permalink / raw)
  To: Julien Grall, Xen-devel, Xia, Hongyan
  Cc: Stefano Stabellini, Wei Liu, paul, George Dunlap, Andrew Cooper,
	Konrad Rzeszutek Wilk, Ian Jackson, Jan Beulich,
	Roger Pau Monné


[-- Attachment #1.1: Type: text/plain, Size: 519 bytes --]

On Tue, 2020-01-14 at 16:29 +0000, Julien Grall wrote:
> > That's the point in appending an IND_WRITE64 operation to the kimage
> > stream. The actual write is done in the last gasp of kexec_reloc()
> > after Xen#1 is quiescent, on the way into purgatory.
> 
> I was not sure what you meant by IND_WRITE64. Maybe I should have asked 
> it first :). Thank you for the explanation!

Don't you often find an email is made easier to understand by the
addition of a few lines of unified diff of assembler code...?

[-- Attachment #1.2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

[-- Attachment #2: Type: text/plain, Size: 157 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Xen-devel] [RFC PATCH 0/3] Live update boot memory management
  2020-01-15  7:40             ` David Woodhouse
@ 2020-01-15 10:26               ` Julien Grall
  0 siblings, 0 replies; 16+ messages in thread
From: Julien Grall @ 2020-01-15 10:26 UTC (permalink / raw)
  To: David Woodhouse, Xen-devel, Xia, Hongyan
  Cc: Stefano Stabellini, Wei Liu, paul, George Dunlap, Andrew Cooper,
	Konrad Rzeszutek Wilk, Ian Jackson, Jan Beulich,
	Roger Pau Monné



On 15/01/2020 07:40, David Woodhouse wrote:
> On Tue, 2020-01-14 at 16:29 +0000, Julien Grall wrote:
>>> That's the point in appending an IND_WRITE64 operation to the kimage
>>> stream. The actual write is done in the last gasp of kexec_reloc()
>>> after Xen#1 is quiescent, on the way into purgatory.
>>
>> I was not sure what you meant by IND_WRITE64. Maybe I should have asked
>> it first :). Thank you for the explanation!
> 
> Don't you often find an email is made easier to understand by the
> addition of a few lines of unified diff of assembler code...?

It definitely helps. I tend to prefer diff over a long paragraph trying 
to explain the same :).

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Xen-devel] [RFC PATCH 2/3] x86/boot: Reserve live update boot memory
  2020-01-08 17:24   ` [Xen-devel] [RFC PATCH 2/3] x86/boot: Reserve live update boot memory David Woodhouse
@ 2020-01-20 16:58     ` Jan Beulich
  2020-01-20 17:24       ` David Woodhouse
  0 siblings, 1 reply; 16+ messages in thread
From: Jan Beulich @ 2020-01-20 16:58 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Stefano Stabellini, Julien Grall, Wei Liu, Konrad Rzeszutek Wilk,
	George Dunlap, Andrew Cooper, paul, Ian Jackson, Xen-devel,
	Roger Pau Monné

On 08.01.2020 18:24, David Woodhouse wrote:
> @@ -980,6 +1015,22 @@ void __init noreturn __start_xen(unsigned long mbi_p)
>      set_kexec_crash_area_size((u64)nr_pages << PAGE_SHIFT);
>      kexec_reserve_area(&boot_e820);
>  
> +    if ( lu_bootmem_start )
> +    {
> +        /* XX: Check it's in usable memory first */
> +        reserve_e820_ram(&boot_e820, lu_bootmem_start, lu_bootmem_start + lu_bootmem_size);
> +
> +        /* Since it will already be out of the e820 map by the time the first
> +         * loop over physical memory, map it manually already. */
> +        set_pdx_range(lu_bootmem_start >> PAGE_SHIFT,
> +                      (lu_bootmem_start + lu_bootmem_size) >> PAGE_SHIFT);
> +        map_pages_to_xen((unsigned long)__va(lu_bootmem_start),
> +                         maddr_to_mfn(lu_bootmem_start),
> +                         PFN_DOWN(lu_bootmem_size), PAGE_HYPERVISOR);

Doesn't this require the range to be a multiple of 2Mb and below
4Gb? I don't see this enforced anywhere.

> @@ -1278,8 +1348,8 @@ void __init noreturn __start_xen(unsigned long mbi_p)
>          xenheap_max_mfn(PFN_DOWN(highmem_start - 1));
>  
>      /*
> -     * Walk every RAM region and map it in its entirety (on x86/64, at least)
> -     * and notify it to the boot allocator.
> +     * Walk every RAM region and map it in its entirety and (unless in
> +     * live update mode) notify it to the boot allocator.
>       */
>      for ( i = 0; i < boot_e820.nr_map; i++ )
>      {
> @@ -1329,6 +1399,7 @@ void __init noreturn __start_xen(unsigned long mbi_p)
>                  printk(XENLOG_WARNING "Ignoring inaccessible memory range"
>                                        " %013"PRIx64"-%013"PRIx64"\n",
>                         s, e);
> +                reserve_e820_ram(&boot_e820, s, e);
>                  continue;
>              }
>              map_e = e;
> @@ -1336,6 +1407,7 @@ void __init noreturn __start_xen(unsigned long mbi_p)
>              printk(XENLOG_WARNING "Ignoring inaccessible memory range"
>                                    " %013"PRIx64"-%013"PRIx64"\n",
>                     e, map_e);
> +            reserve_e820_ram(&boot_e820, e, map_e);
>          }
>  
>          set_pdx_range(s >> PAGE_SHIFT, e >> PAGE_SHIFT);

What are these two hunks needed for? The comment you change further up
relates to ...

> @@ -1346,7 +1418,9 @@ void __init noreturn __start_xen(unsigned long mbi_p)
>                        ARRAY_SIZE(l2_identmap) << L2_PAGETABLE_SHIFT);
>  
>          /* Pass mapped memory to allocator /before/ creating new mappings. */
> -        init_boot_pages(s, min(map_s, e));
> +        if ( !lu_reserved)
> +            init_boot_pages(s, min(map_s, e));

... this afaict.

Apart from this, also applicable to patch 3 - where I have no other
comments - there's quite a bit of style cleanup to b done here. And
of course the new command line option wants documenting. I can't
e.g. guess yet what lu_data is about, and hence why this is
apparently an address without an accompanying size.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Xen-devel] [RFC PATCH 2/3] x86/boot: Reserve live update boot memory
  2020-01-20 16:58     ` Jan Beulich
@ 2020-01-20 17:24       ` David Woodhouse
  0 siblings, 0 replies; 16+ messages in thread
From: David Woodhouse @ 2020-01-20 17:24 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Stefano Stabellini, Julien Grall, Wei Liu, Konrad Rzeszutek Wilk,
	George Dunlap, Andrew Cooper, paul, Ian Jackson, Xen-devel,
	Roger Pau Monné


[-- Attachment #1.1: Type: text/plain, Size: 6001 bytes --]

On Mon, 2020-01-20 at 17:58 +0100, Jan Beulich wrote:
> On 08.01.2020 18:24, David Woodhouse wrote:
> > @@ -980,6 +1015,22 @@ void __init noreturn __start_xen(unsigned long mbi_p)
> >      set_kexec_crash_area_size((u64)nr_pages << PAGE_SHIFT);
> >      kexec_reserve_area(&boot_e820);
> >  
> > +    if ( lu_bootmem_start )
> > +    {
> > +        /* XX: Check it's in usable memory first */
> > +        reserve_e820_ram(&boot_e820, lu_bootmem_start, lu_bootmem_start + lu_bootmem_size);
> > +
> > +        /* Since it will already be out of the e820 map by the time the first
> > +         * loop over physical memory, map it manually already. */
> > +        set_pdx_range(lu_bootmem_start >> PAGE_SHIFT,
> > +                      (lu_bootmem_start + lu_bootmem_size) >> PAGE_SHIFT);
> > +        map_pages_to_xen((unsigned long)__va(lu_bootmem_start),
> > +                         maddr_to_mfn(lu_bootmem_start),
> > +                         PFN_DOWN(lu_bootmem_size), PAGE_HYPERVISOR);
> 
> Doesn't this require the range to be a multiple of 2Mb and below
> 4Gb? I don't see this enforced anywhere.

Aha, so *that's* why the mapping succeeded without having to allocate
any memory for PTEs. That did confuse me for a while, before I figured
my time was better spent in the short term by focusing on things I
don't understand that *weren't* working, rather than things I didn't
understand that *were*. :)

Yes, if this is the solution we end up with (and I do think it's still
the best option I've seen), then the requirement should be clearly
documented and enforced.

Andy and Hongyan are busy messing with all the 1:1 mappings, both at
boot time and at run time, so the actual restrictions may change.

> > @@ -1278,8 +1348,8 @@ void __init noreturn __start_xen(unsigned long mbi_p)
> >          xenheap_max_mfn(PFN_DOWN(highmem_start - 1));
> >  
> >      /*
> > -     * Walk every RAM region and map it in its entirety (on x86/64, at least)
> > -     * and notify it to the boot allocator.
> > +     * Walk every RAM region and map it in its entirety and (unless in
> > +     * live update mode) notify it to the boot allocator.
> >       */
> >      for ( i = 0; i < boot_e820.nr_map; i++ )
> >      {
> > @@ -1329,6 +1399,7 @@ void __init noreturn __start_xen(unsigned long mbi_p)
> >                  printk(XENLOG_WARNING "Ignoring inaccessible memory range"
> >                                        " %013"PRIx64"-%013"PRIx64"\n",
> >                         s, e);
> > +                reserve_e820_ram(&boot_e820, s, e);
> >                  continue;
> >              }
> >              map_e = e;
> > @@ -1336,6 +1407,7 @@ void __init noreturn __start_xen(unsigned long mbi_p)
> >              printk(XENLOG_WARNING "Ignoring inaccessible memory range"
> >                                    " %013"PRIx64"-%013"PRIx64"\n",
> >                     e, map_e);
> > +            reserve_e820_ram(&boot_e820, e, map_e);
> >          }
> >  
> >          set_pdx_range(s >> PAGE_SHIFT, e >> PAGE_SHIFT);
> 
> What are these two hunks needed for? The comment you change further up
> relates to ...

When we use only the LU-reserved region for bootmem, we defer the
registration of the other regions found in E820 to a later pass, after
we've consumed the live update state (and know which pages not to
touch).

So instead of just ignoring those inaccessible regions in the first
loop as we did before, we need to *mark* them reserved in our E820 data
so that they don't get registered by that second pass.

> > @@ -1346,7 +1418,9 @@ void __init noreturn __start_xen(unsigned long mbi_p)
> >                        ARRAY_SIZE(l2_identmap) << L2_PAGETABLE_SHIFT);
> >  
> >          /* Pass mapped memory to allocator /before/ creating new mappings. */
> > -        init_boot_pages(s, min(map_s, e));
> > +        if ( !lu_reserved)
> > +            init_boot_pages(s, min(map_s, e));
> 
> ... this afaict.

Kind of, but more to the point applicable to where we later *do*
register those pages, around line 1600.

> Apart from this, also applicable to patch 3 - where I have no other
> comments - there's quite a bit of style cleanup to b done here. And
> of course the new command line option wants documenting. I can't
> e.g. guess yet what lu_data is about, and hence why this is
> apparently an address without an accompanying size.

Right. The lu_data is intended to be the 'breadcrumb' which leads to
the actual live update state, which is scatter/gather across any pages
*outside* the reserved bootmem region.

Although it's hard to put it on the command line, since that has to be
handled by *userspace*, while the live update state is created *during*
the kexec hypercall by Xen itself. We've settled for now on putting
that breadcrumb into the start of the reserved bootmem region itself,
removing the need for a separate lu_data argument.

The series continues at
https://xenbits.xen.org/gitweb/?p=people/dwmw2/xen.git;a=shortlog;h=refs/heads/lu-master
and has reached the point where I can write "Hello World" to a live
update data stream and then frown grumpily at the next Xen telling me

(XEN) 1 pages of live update data at 23e24d000
(XEN) First live update data page at MFN 23ea34:
(XEN)  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

... because the first data page got zeroed during the transition.

I'll fix that, implement the code which actually excludes busy pages
from being registered in the heap (and fix up the fact that bad pages
above HYPERVISOR_VIRT_END are also not being dropped as they should,
while I'm at it), and post a second set for comment.

I'm mostly after feedback on the direction (of which the comment about
how the first mapping succeeds was massively useful; thanks!) than the
finer details of the implementation at this point. It's just that code
is sometimes a better explanation of what I mean, than prose.


[-- Attachment #1.2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

[-- Attachment #2: Type: text/plain, Size: 157 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2020-01-20 17:24 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-08 17:24 [Xen-devel] [RFC PATCH 0/3] Live update boot memory management David Woodhouse
2020-01-08 17:24 ` [Xen-devel] [RFC PATCH 1/3] x86/setup: Don't skip 2MiB underneath relocated Xen image David Woodhouse
2020-01-08 17:24   ` [Xen-devel] [RFC PATCH 2/3] x86/boot: Reserve live update boot memory David Woodhouse
2020-01-20 16:58     ` Jan Beulich
2020-01-20 17:24       ` David Woodhouse
2020-01-08 17:25   ` [Xen-devel] [RFC PATCH 3/3] Add KEXEC_RANGE_MA_LIVEUPDATE David Woodhouse
2020-01-10 11:15   ` [Xen-devel] [RFC PATCH 1/3] x86/setup: Don't skip 2MiB underneath relocated Xen image Durrant, Paul
2020-01-10 12:15     ` David Woodhouse
2020-01-13 11:54 ` [Xen-devel] [RFC PATCH 0/3] Live update boot memory management David Woodhouse
2020-01-14 14:15   ` Julien Grall
2020-01-14 14:48     ` David Woodhouse
2020-01-14 15:00       ` Julien Grall
2020-01-14 15:20         ` David Woodhouse
2020-01-14 16:29           ` Julien Grall
2020-01-15  7:40             ` David Woodhouse
2020-01-15 10:26               ` Julien Grall

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).