linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] xen: remove memory limits from pv-domains
@ 2014-09-04 12:38 Juergen Gross
  2014-09-04 12:38 ` [PATCH 1/3] xen: sync some headers with xen tree Juergen Gross
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Juergen Gross @ 2014-09-04 12:38 UTC (permalink / raw)
  To: linux-kernel, xen-devel, konrad.wilk, boris.ostrovsky,
	david.vrabel, jbeulich
  Cc: Juergen Gross

When a Xen pv-domain is booted the initial memory map contains multiple
objects in the top 2 GB including the initrd and the p2m list. This
limits the supported maximum size of the initrd and the maximum
initial memory size is limited to about 500 GB.

Xen however supports loading the initrd without mapping it and the
initial p2m list can be mapped by Xen to an arbitrary selected virtual
address. The following patches activate those options and thus remove
the limitations.

Juergen Gross (3):
  xen: sync some headers with xen tree
  xen: eliminate scalability issues from initrd handling
  xen: eliminate scalability issues from initial mapping setup

 arch/x86/xen/enlighten.c        |  15 ++-
 arch/x86/xen/mmu.c              | 116 +++++++++++++++--
 arch/x86/xen/setup.c            |  65 +++++-----
 arch/x86/xen/xen-head.S         |   5 +
 include/xen/interface/elfnote.h | 102 ++++++++++++++-
 include/xen/interface/xen.h     | 272 ++++++++++++++++++++++++++++++++++++----
 6 files changed, 512 insertions(+), 63 deletions(-)

-- 
1.8.4.5


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/3] xen: sync some headers with xen tree
  2014-09-04 12:38 [PATCH 0/3] xen: remove memory limits from pv-domains Juergen Gross
@ 2014-09-04 12:38 ` Juergen Gross
  2014-09-04 12:52   ` Jan Beulich
  2014-09-04 12:38 ` [PATCH 2/3] xen: eliminate scalability issues from initrd handling Juergen Gross
  2014-09-04 12:38 ` [PATCH 3/3] xen: eliminate scalability issues from initial mapping setup Juergen Gross
  2 siblings, 1 reply; 19+ messages in thread
From: Juergen Gross @ 2014-09-04 12:38 UTC (permalink / raw)
  To: linux-kernel, xen-devel, konrad.wilk, boris.ostrovsky,
	david.vrabel, jbeulich
  Cc: Juergen Gross

To be able to use an initially unmapped initrd with xen the following
header files must be synced to a newer version from the xen tree:

include/xen/interface/elfnote.h
include/xen/interface/xen.h

As the KEXEC and DUMPCORE related ELFNOTES are not relevant for the
kernel they are omitted from elfnote.h.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 include/xen/interface/elfnote.h | 102 ++++++++++++++-
 include/xen/interface/xen.h     | 272 ++++++++++++++++++++++++++++++++++++----
 2 files changed, 348 insertions(+), 26 deletions(-)

diff --git a/include/xen/interface/elfnote.h b/include/xen/interface/elfnote.h
index 6f4eae3..5501e7a 100644
--- a/include/xen/interface/elfnote.h
+++ b/include/xen/interface/elfnote.h
@@ -3,6 +3,24 @@
  *
  * Definitions used for the Xen ELF notes.
  *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to
+ * deal in the Software without restriction, including without limitation the
+ * rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+ * sell copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+ * DEALINGS IN THE SOFTWARE.
+ *
  * Copyright (c) 2006, Ian Campbell, XenSource Ltd.
  */
 
@@ -18,12 +36,13 @@
  *
  * LEGACY indicated the fields in the legacy __xen_guest string which
  * this a note type replaces.
+ *
+ * String values (for non-legacy) are NULL terminated ASCII, also known
+ * as ASCIZ type.
  */
 
 /*
  * NAME=VALUE pair (string).
- *
- * LEGACY: FEATURES and PAE
  */
 #define XEN_ELFNOTE_INFO           0
 
@@ -137,10 +156,30 @@
 
 /*
  * Whether or not the guest supports cooperative suspend cancellation.
+ * This is a numeric value.
+ *
+ * Default is 0
  */
 #define XEN_ELFNOTE_SUSPEND_CANCEL 14
 
 /*
+ * The (non-default) location the initial phys-to-machine map should be
+ * placed at by the hypervisor (Dom0) or the tools (DomU).
+ * The kernel must be prepared for this mapping to be established using
+ * large pages, despite such otherwise not being available to guests.
+ * The kernel must also be able to handle the page table pages used for
+ * this mapping not being accessible through the initial mapping.
+ * (Only x86-64 supports this at present.)
+ */
+#define XEN_ELFNOTE_INIT_P2M      15
+
+/*
+ * Whether or not the guest can deal with being passed an initrd not
+ * mapped through its initial page tables.
+ */
+#define XEN_ELFNOTE_MOD_START_PFN 16
+
+/*
  * The features supported by this kernel (numeric).
  *
  * Other than XEN_ELFNOTE_FEATURES on pre-4.2 Xen, this note allows a
@@ -153,6 +192,65 @@
  */
 #define XEN_ELFNOTE_SUPPORTED_FEATURES 17
 
+/*
+ * The number of the highest elfnote defined.
+ */
+#define XEN_ELFNOTE_MAX XEN_ELFNOTE_SUPPORTED_FEATURES
+
+/*
+ * System information exported through crash notes.
+ *
+ * The kexec / kdump code will create one XEN_ELFNOTE_CRASH_INFO
+ * note in case of a system crash. This note will contain various
+ * information about the system, see xen/include/xen/elfcore.h.
+ */
+#define XEN_ELFNOTE_CRASH_INFO 0x1000001
+
+/*
+ * System registers exported through crash notes.
+ *
+ * The kexec / kdump code will create one XEN_ELFNOTE_CRASH_REGS
+ * note per cpu in case of a system crash. This note is architecture
+ * specific and will contain registers not saved in the "CORE" note.
+ * See xen/include/xen/elfcore.h for more information.
+ */
+#define XEN_ELFNOTE_CRASH_REGS 0x1000002
+
+
+/*
+ * xen dump-core none note.
+ * xm dump-core code will create one XEN_ELFNOTE_DUMPCORE_NONE
+ * in its dump file to indicate that the file is xen dump-core
+ * file. This note doesn't have any other information.
+ * See tools/libxc/xc_core.h for more information.
+ */
+#define XEN_ELFNOTE_DUMPCORE_NONE               0x2000000
+
+/*
+ * xen dump-core header note.
+ * xm dump-core code will create one XEN_ELFNOTE_DUMPCORE_HEADER
+ * in its dump file.
+ * See tools/libxc/xc_core.h for more information.
+ */
+#define XEN_ELFNOTE_DUMPCORE_HEADER             0x2000001
+
+/*
+ * xen dump-core xen version note.
+ * xm dump-core code will create one XEN_ELFNOTE_DUMPCORE_XEN_VERSION
+ * in its dump file. It contains the xen version obtained via the
+ * XENVER hypercall.
+ * See tools/libxc/xc_core.h for more information.
+ */
+#define XEN_ELFNOTE_DUMPCORE_XEN_VERSION        0x2000002
+
+/*
+ * xen dump-core format version note.
+ * xm dump-core code will create one XEN_ELFNOTE_DUMPCORE_FORMAT_VERSION
+ * in its dump file. It contains a format version identifier.
+ * See tools/libxc/xc_core.h for more information.
+ */
+#define XEN_ELFNOTE_DUMPCORE_FORMAT_VERSION     0x2000003
+
 #endif /* __XEN_PUBLIC_ELFNOTE_H__ */
 
 /*
diff --git a/include/xen/interface/xen.h b/include/xen/interface/xen.h
index de08213..f68719f 100644
--- a/include/xen/interface/xen.h
+++ b/include/xen/interface/xen.h
@@ -3,6 +3,24 @@
  *
  * Guest OS interface to Xen.
  *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to
+ * deal in the Software without restriction, including without limitation the
+ * rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+ * sell copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+ * DEALINGS IN THE SOFTWARE.
+ *
  * Copyright (c) 2004, K A Fraser
  */
 
@@ -73,13 +91,23 @@
  * VIRTUAL INTERRUPTS
  *
  * Virtual interrupts that a guest OS may receive from Xen.
+ * In the side comments, 'V.' denotes a per-VCPU VIRQ while 'G.' denotes a
+ * global VIRQ. The former can be bound once per VCPU and cannot be re-bound.
+ * The latter can be allocated only once per guest: they must initially be
+ * allocated to VCPU0 but can subsequently be re-bound.
  */
-#define VIRQ_TIMER      0  /* Timebase update, and/or requested timeout.  */
-#define VIRQ_DEBUG      1  /* Request guest to dump debug info.           */
-#define VIRQ_CONSOLE    2  /* (DOM0) Bytes received on emergency console. */
-#define VIRQ_DOM_EXC    3  /* (DOM0) Exceptional event for some domain.   */
-#define VIRQ_DEBUGGER   6  /* (DOM0) A domain has paused for debugging.   */
-#define VIRQ_PCPU_STATE 9  /* (DOM0) PCPU state changed                   */
+#define VIRQ_TIMER      0  /* V. Timebase update, and/or requested timeout.  */
+#define VIRQ_DEBUG      1  /* V. Request guest to dump debug info.           */
+#define VIRQ_CONSOLE    2  /* G. (DOM0) Bytes received on emergency console. */
+#define VIRQ_DOM_EXC    3  /* G. (DOM0) Exceptional event for some domain.   */
+#define VIRQ_TBUF       4  /* G. (DOM0) Trace buffer has records available.  */
+#define VIRQ_DEBUGGER   6  /* G. (DOM0) A domain has paused for debugging.   */
+#define VIRQ_XENOPROF   7  /* V. XenOprofile interrupt: new sample available */
+#define VIRQ_CON_RING   8  /* G. (DOM0) Bytes received on console            */
+#define VIRQ_PCPU_STATE 9  /* G. (DOM0) PCPU state changed                   */
+#define VIRQ_MEM_EVENT  10 /* G. (DOM0) A memory event has occured           */
+#define VIRQ_XC_RESERVED 11 /* G. Reserved for XenClient                     */
+#define VIRQ_ENOMEM     12 /* G. (DOM0) Low on heap memory       */
 
 /* Architecture-specific VIRQ definitions. */
 #define VIRQ_ARCH_0    16
@@ -92,24 +120,68 @@
 #define VIRQ_ARCH_7    23
 
 #define NR_VIRQS       24
+
 /*
- * MMU-UPDATE REQUESTS
- *
- * HYPERVISOR_mmu_update() accepts a list of (ptr, val) pairs.
- * A foreigndom (FD) can be specified (or DOMID_SELF for none).
- * Where the FD has some effect, it is described below.
- * ptr[1:0] specifies the appropriate MMU_* command.
+ * enum neg_errnoval HYPERVISOR_mmu_update(const struct mmu_update reqs[],
+ *                                         unsigned count, unsigned *done_out,
+ *                                         unsigned foreigndom)
+ * @reqs is an array of mmu_update_t structures ((ptr, val) pairs).
+ * @count is the length of the above array.
+ * @pdone is an output parameter indicating number of completed operations
+ * @foreigndom[15:0]: FD, the expected owner of data pages referenced in this
+ *                    hypercall invocation. Can be DOMID_SELF.
+ * @foreigndom[31:16]: PFD, the expected owner of pagetable pages referenced
+ *                     in this hypercall invocation. The value of this field
+ *                     (x) encodes the PFD as follows:
+ *                     x == 0 => PFD == DOMID_SELF
+ *                     x != 0 => PFD == x - 1
  *
+ * Sub-commands: ptr[1:0] specifies the appropriate MMU_* command.
+ * -------------
  * ptr[1:0] == MMU_NORMAL_PT_UPDATE:
- * Updates an entry in a page table. If updating an L1 table, and the new
- * table entry is valid/present, the mapped frame must belong to the FD, if
- * an FD has been specified. If attempting to map an I/O page then the
- * caller assumes the privilege of the FD.
+ * Updates an entry in a page table belonging to PFD. If updating an L1 table,
+ * and the new table entry is valid/present, the mapped frame must belong to
+ * FD. If attempting to map an I/O page then the caller assumes the privilege
+ * of the FD.
  * FD == DOMID_IO: Permit /only/ I/O mappings, at the priv level of the caller.
  * FD == DOMID_XEN: Map restricted areas of Xen's heap space.
  * ptr[:2]  -- Machine address of the page-table entry to modify.
  * val      -- Value to write.
  *
+ * There also certain implicit requirements when using this hypercall. The
+ * pages that make up a pagetable must be mapped read-only in the guest.
+ * This prevents uncontrolled guest updates to the pagetable. Xen strictly
+ * enforces this, and will disallow any pagetable update which will end up
+ * mapping pagetable page RW, and will disallow using any writable page as a
+ * pagetable. In practice it means that when constructing a page table for a
+ * process, thread, etc, we MUST be very dilligient in following these rules:
+ *  1). Start with top-level page (PGD or in Xen language: L4). Fill out
+ *      the entries.
+ *  2). Keep on going, filling out the upper (PUD or L3), and middle (PMD
+ *      or L2).
+ *  3). Start filling out the PTE table (L1) with the PTE entries. Once
+ *      done, make sure to set each of those entries to RO (so writeable bit
+ *      is unset). Once that has been completed, set the PMD (L2) for this
+ *      PTE table as RO.
+ *  4). When completed with all of the PMD (L2) entries, and all of them have
+ *      been set to RO, make sure to set RO the PUD (L3). Do the same
+ *      operation on PGD (L4) pagetable entries that have a PUD (L3) entry.
+ *  5). Now before you can use those pages (so setting the cr3), you MUST also
+ *      pin them so that the hypervisor can verify the entries. This is done
+ *      via the HYPERVISOR_mmuext_op(MMUEXT_PIN_L4_TABLE, guest physical frame
+ *      number of the PGD (L4)). And this point the HYPERVISOR_mmuext_op(
+ *      MMUEXT_NEW_BASEPTR, guest physical frame number of the PGD (L4)) can be
+ *      issued.
+ * For 32-bit guests, the L4 is not used (as there is less pagetables), so
+ * instead use L3.
+ * At this point the pagetables can be modified using the MMU_NORMAL_PT_UPDATE
+ * hypercall. Also if so desired the OS can also try to write to the PTE
+ * and be trapped by the hypervisor (as the PTE entry is RO).
+ *
+ * To deallocate the pages, the operations are the reverse of the steps
+ * mentioned above. The argument is MMUEXT_UNPIN_TABLE for all levels and the
+ * pagetable MUST not be in use (meaning that the cr3 is not set to it).
+ *
  * ptr[1:0] == MMU_MACHPHYS_UPDATE:
  * Updates an entry in the machine->pseudo-physical mapping table.
  * ptr[:2]  -- Machine address within the frame whose mapping to modify.
@@ -119,6 +191,72 @@
  * ptr[1:0] == MMU_PT_UPDATE_PRESERVE_AD:
  * As MMU_NORMAL_PT_UPDATE above, but A/D bits currently in the PTE are ORed
  * with those in @val.
+ *
+ * @val is usually the machine frame number along with some attributes.
+ * The attributes by default follow the architecture defined bits. Meaning that
+ * if this is a X86_64 machine and four page table layout is used, the layout
+ * of val is:
+ *  - 63 if set means No execute (NX)
+ *  - 46-13 the machine frame number
+ *  - 12 available for guest
+ *  - 11 available for guest
+ *  - 10 available for guest
+ *  - 9 available for guest
+ *  - 8 global
+ *  - 7 PAT (PSE is disabled, must use hypercall to make 4MB or 2MB pages)
+ *  - 6 dirty
+ *  - 5 accessed
+ *  - 4 page cached disabled
+ *  - 3 page write through
+ *  - 2 userspace accessible
+ *  - 1 writeable
+ *  - 0 present
+ *
+ *  The one bits that does not fit with the default layout is the PAGE_PSE
+ *  also called PAGE_PAT). The MMUEXT_[UN]MARK_SUPER arguments to the
+ *  HYPERVISOR_mmuext_op serve as mechanism to set a pagetable to be 4MB
+ *  (or 2MB) instead of using the PAGE_PSE bit.
+ *
+ *  The reason that the PAGE_PSE (bit 7) is not being utilized is due to Xen
+ *  using it as the Page Attribute Table (PAT) bit - for details on it please
+ *  refer to Intel SDM 10.12. The PAT allows to set the caching attributes of
+ *  pages instead of using MTRRs.
+ *
+ *  The PAT MSR is as follows (it is a 64-bit value, each entry is 8 bits):
+ *                    PAT4                 PAT0
+ *  +-----+-----+----+----+----+-----+----+----+
+ *  | UC  | UC- | WC | WB | UC | UC- | WC | WB |  <= Linux
+ *  +-----+-----+----+----+----+-----+----+----+
+ *  | UC  | UC- | WT | WB | UC | UC- | WT | WB |  <= BIOS (default when machine boots)
+ *  +-----+-----+----+----+----+-----+----+----+
+ *  | rsv | rsv | WP | WC | UC | UC- | WT | WB |  <= Xen
+ *  +-----+-----+----+----+----+-----+----+----+
+ *
+ *  The lookup of this index table translates to looking up
+ *  Bit 7, Bit 4, and Bit 3 of val entry:
+ *
+ *  PAT/PSE (bit 7) ... PCD (bit 4) .. PWT (bit 3).
+ *
+ *  If all bits are off, then we are using PAT0. If bit 3 turned on,
+ *  then we are using PAT1, if bit 3 and bit 4, then PAT2..
+ *
+ *  As you can see, the Linux PAT1 translates to PAT4 under Xen. Which means
+ *  that if a guest that follows Linux's PAT setup and would like to set Write
+ *  Combined on pages it MUST use PAT4 entry. Meaning that Bit 7 (PAGE_PAT) is
+ *  set. For example, under Linux it only uses PAT0, PAT1, and PAT2 for the
+ *  caching as:
+ *
+ *   WB = none (so PAT0)
+ *   WC = PWT (bit 3 on)
+ *   UC = PWT | PCD (bit 3 and 4 are on).
+ *
+ * To make it work with Xen, it needs to translate the WC bit as so:
+ *
+ *  PWT (so bit 3 on) --> PAT (so bit 7 is on) and clear bit 3
+ *
+ * And to translate back it would:
+ *
+ * PAT (bit 7 on) --> PWT (bit 3 on) and clear bit 7.
  */
 #define MMU_NORMAL_PT_UPDATE      0 /* checked '*ptr = val'. ptr is MA.       */
 #define MMU_MACHPHYS_UPDATE       1 /* ptr = MA of frame to modify entry for  */
@@ -127,7 +265,12 @@
 /*
  * MMU EXTENDED OPERATIONS
  *
- * HYPERVISOR_mmuext_op() accepts a list of mmuext_op structures.
+ * enum neg_errnoval HYPERVISOR_mmuext_op(mmuext_op_t uops[],
+ *                                        unsigned int count,
+ *                                        unsigned int *pdone,
+ *                                        unsigned int foreigndom)
+ */
+/* HYPERVISOR_mmuext_op() accepts a list of mmuext_op structures.
  * A foreigndom (FD) can be specified (or DOMID_SELF for none).
  * Where the FD has some effect, it is described below.
  *
@@ -164,9 +307,23 @@
  * cmd: MMUEXT_FLUSH_CACHE
  * No additional arguments. Writes back and flushes cache contents.
  *
+ * cmd: MMUEXT_FLUSH_CACHE_GLOBAL
+ * No additional arguments. Writes back and flushes cache contents
+ * on all CPUs in the system.
+ *
  * cmd: MMUEXT_SET_LDT
  * linear_addr: Linear address of LDT base (NB. must be page-aligned).
  * nr_ents: Number of entries in LDT.
+ *
+ * cmd: MMUEXT_CLEAR_PAGE
+ * mfn: Machine frame number to be cleared.
+ *
+ * cmd: MMUEXT_COPY_PAGE
+ * mfn: Machine frame number of the destination page.
+ * src_mfn: Machine frame number of the source page.
+ *
+ * cmd: MMUEXT_[UN]MARK_SUPER
+ * mfn: Machine frame number of head of superpage to be [un]marked.
  */
 #define MMUEXT_PIN_L1_TABLE      0
 #define MMUEXT_PIN_L2_TABLE      1
@@ -183,12 +340,18 @@
 #define MMUEXT_FLUSH_CACHE      12
 #define MMUEXT_SET_LDT          13
 #define MMUEXT_NEW_USER_BASEPTR 15
+#define MMUEXT_CLEAR_PAGE       16
+#define MMUEXT_COPY_PAGE        17
+#define MMUEXT_FLUSH_CACHE_GLOBAL 18
+#define MMUEXT_MARK_SUPER       19
+#define MMUEXT_UNMARK_SUPER     20
 
 #ifndef __ASSEMBLY__
 struct mmuext_op {
 	unsigned int cmd;
 	union {
-		/* [UN]PIN_TABLE, NEW_BASEPTR, NEW_USER_BASEPTR */
+		/* [UN]PIN_TABLE, NEW_BASEPTR, NEW_USER_BASEPTR
+		 * CLEAR_PAGE, COPY_PAGE, [UN]MARK_SUPER */
 		xen_pfn_t mfn;
 		/* INVLPG_LOCAL, INVLPG_ALL, SET_LDT */
 		unsigned long linear_addr;
@@ -198,6 +361,8 @@ struct mmuext_op {
 		unsigned int nr_ents;
 		/* TLB_FLUSH_MULTI, INVLPG_MULTI */
 		void *vcpumask;
+		/* COPY_PAGE */
+		xen_pfn_t src_mfn;
 	} arg2;
 };
 DEFINE_GUEST_HANDLE_STRUCT(mmuext_op);
@@ -225,10 +390,23 @@ DEFINE_GUEST_HANDLE_STRUCT(mmuext_op);
  */
 #define VMASST_CMD_enable                0
 #define VMASST_CMD_disable               1
+
+/* x86/32 guests: simulate full 4GB segment limits. */
 #define VMASST_TYPE_4gb_segments         0
+
+/* x86/32 guests: trap (vector 15) whenever above vmassist is used. */
 #define VMASST_TYPE_4gb_segments_notify  1
+
+/*
+ * x86 guests: support writes to bottom-level PTEs.
+ * NB1. Page-directory entries cannot be written.
+ * NB2. Guest must continue to remove all writable mappings of PTEs.
+ */
 #define VMASST_TYPE_writable_pagetables  2
+
+/* x86/PAE guests: support PDPTs above 4GB. */
 #define VMASST_TYPE_pae_extended_cr3     3
+
 #define MAX_VMASST_TYPE 3
 
 #ifndef __ASSEMBLY__
@@ -260,6 +438,15 @@ typedef uint16_t domid_t;
  */
 #define DOMID_XEN  (0x7FF2U)
 
+/* DOMID_COW is used as the owner of sharable pages */
+#define DOMID_COW  (0x7FF3U)
+
+/* DOMID_INVALID is used to identify pages with unknown owner. */
+#define DOMID_INVALID (0x7FF4U)
+
+/* Idle domain. */
+#define DOMID_IDLE (0x7FFFU)
+
 /*
  * Send an array of these to HYPERVISOR_mmu_update().
  * NB. The fields are natural pointer/address size for this architecture.
@@ -272,7 +459,9 @@ DEFINE_GUEST_HANDLE_STRUCT(mmu_update);
 
 /*
  * Send an array of these to HYPERVISOR_multicall().
- * NB. The fields are natural register size for this architecture.
+ * NB. The fields are logically the natural register size for this
+ * architecture. In cases where xen_ulong_t is larger than this then
+ * any unused bits in the upper portion must be zero.
  */
 struct multicall_entry {
     xen_ulong_t op;
@@ -442,8 +631,48 @@ struct start_info {
 	unsigned long mod_start;    /* VIRTUAL address of pre-loaded module.  */
 	unsigned long mod_len;      /* Size (bytes) of pre-loaded module.     */
 	int8_t cmd_line[MAX_GUEST_CMDLINE];
+	/* The pfn range here covers both page table and p->m table frames.   */
+	unsigned long first_p2m_pfn;/* 1st pfn forming initial P->M table.    */
+	unsigned long nr_p2m_frames;/* # of pfns forming initial P->M table.  */
 };
 
+/* These flags are passed in the 'flags' field of start_info_t. */
+#define SIF_PRIVILEGED    (1<<0)  /* Is the domain privileged? */
+#define SIF_INITDOMAIN    (1<<1)  /* Is this the initial control domain? */
+#define SIF_MULTIBOOT_MOD (1<<2)  /* Is mod_start a multiboot module? */
+#define SIF_MOD_START_PFN (1<<3)  /* Is mod_start a PFN? */
+#define SIF_PM_MASK       (0xFF<<8) /* reserve 1 byte for xen-pm options */
+
+/*
+ * A multiboot module is a package containing modules very similar to a
+ * multiboot module array. The only differences are:
+ * - the array of module descriptors is by convention simply at the beginning
+ *   of the multiboot module,
+ * - addresses in the module descriptors are based on the beginning of the
+ *   multiboot module,
+ * - the number of modules is determined by a termination descriptor that has
+ *   mod_start == 0.
+ *
+ * This permits to both build it statically and reference it in a configuration
+ * file, and let the PV guest easily rebase the addresses to virtual addresses
+ * and at the same time count the number of modules.
+ */
+struct xen_multiboot_mod_list {
+	/* Address of first byte of the module */
+	uint32_t mod_start;
+	/* Address of last byte of the module (inclusive) */
+	uint32_t mod_end;
+	/* Address of zero-terminated command line */
+	uint32_t cmdline;
+	/* Unused, must be zero */
+	uint32_t pad;
+};
+/*
+ * The console structure in start_info.console.dom0
+ *
+ * This structure includes a variety of information required to
+ * have a working VGA/VESA console.
+ */
 struct dom0_vga_console_info {
 	uint8_t video_type;
 #define XEN_VGATYPE_TEXT_MODE_3 0x03
@@ -484,11 +713,6 @@ struct dom0_vga_console_info {
 	} u;
 };
 
-/* These flags are passed in the 'flags' field of start_info_t. */
-#define SIF_PRIVILEGED    (1<<0)  /* Is the domain privileged? */
-#define SIF_INITDOMAIN    (1<<1)  /* Is this the initial control domain? */
-#define SIF_PM_MASK       (0xFF<<8) /* reserve 1 byte for xen-pm options */
-
 typedef uint64_t cpumap_t;
 
 typedef uint8_t xen_domain_handle_t[16];
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/3] xen: eliminate scalability issues from initrd handling
  2014-09-04 12:38 [PATCH 0/3] xen: remove memory limits from pv-domains Juergen Gross
  2014-09-04 12:38 ` [PATCH 1/3] xen: sync some headers with xen tree Juergen Gross
@ 2014-09-04 12:38 ` Juergen Gross
  2014-09-04 12:52   ` David Vrabel
  2014-09-04 12:38 ` [PATCH 3/3] xen: eliminate scalability issues from initial mapping setup Juergen Gross
  2 siblings, 1 reply; 19+ messages in thread
From: Juergen Gross @ 2014-09-04 12:38 UTC (permalink / raw)
  To: linux-kernel, xen-devel, konrad.wilk, boris.ostrovsky,
	david.vrabel, jbeulich
  Cc: Juergen Gross

Size restrictions native kernels wouldn't have resulted from the initrd
getting mapped into the initial mapping. The kernel doesn't really need
the initrd to be mapped, so use infrastructure available in Xen to avoid
the mapping and hence the restriction.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 arch/x86/xen/enlighten.c | 15 +++++++++++++--
 arch/x86/xen/xen-head.S  |  3 +++
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index c0cb11f..c8e4e6a 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -1519,6 +1519,7 @@ static void __init xen_pvh_early_guest_init(void)
 asmlinkage __visible void __init xen_start_kernel(void)
 {
 	struct physdev_set_iopl set_iopl;
+	unsigned long initrd_start = 0;
 	int rc;
 
 	if (!xen_start_info)
@@ -1667,10 +1668,20 @@ asmlinkage __visible void __init xen_start_kernel(void)
 	new_cpu_data.x86_capability[0] = cpuid_edx(1);
 #endif
 
+	if (xen_start_info->mod_start)
+		initrd_start = __pa(xen_start_info->mod_start);
+#ifdef CONFIG_BLK_DEV_INITRD
+#ifdef CONFIG_X86_32
+	BUG_ON(xen_start_info->flags & SIF_MOD_START_PFN);
+#else
+	if (xen_start_info->flags & SIF_MOD_START_PFN)
+		initrd_start = PFN_PHYS(xen_start_info->mod_start);
+#endif
+#endif
+
 	/* Poke various useful things into boot_params */
 	boot_params.hdr.type_of_loader = (9 << 4) | 0;
-	boot_params.hdr.ramdisk_image = xen_start_info->mod_start
-		? __pa(xen_start_info->mod_start) : 0;
+	boot_params.hdr.ramdisk_image = initrd_start;
 	boot_params.hdr.ramdisk_size = xen_start_info->mod_len;
 	boot_params.hdr.cmd_line_ptr = __pa(xen_start_info->cmd_line);
 
diff --git a/arch/x86/xen/xen-head.S b/arch/x86/xen/xen-head.S
index 485b695..a458fd7 100644
--- a/arch/x86/xen/xen-head.S
+++ b/arch/x86/xen/xen-head.S
@@ -124,6 +124,9 @@ NEXT_HYPERCALL(arch_6)
 	ELFNOTE(Xen, XEN_ELFNOTE_L1_MFN_VALID,
 		.quad _PAGE_PRESENT; .quad _PAGE_PRESENT)
 	ELFNOTE(Xen, XEN_ELFNOTE_SUSPEND_CANCEL, .long 1)
+#ifdef CONFIG_X86_64
+	ELFNOTE(Xen, XEN_ELFNOTE_MOD_START_PFN,  .long 1)
+#endif
 	ELFNOTE(Xen, XEN_ELFNOTE_HV_START_LOW,   _ASM_PTR __HYPERVISOR_VIRT_START)
 	ELFNOTE(Xen, XEN_ELFNOTE_PADDR_OFFSET,   _ASM_PTR 0)
 
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/3] xen: eliminate scalability issues from initial mapping setup
  2014-09-04 12:38 [PATCH 0/3] xen: remove memory limits from pv-domains Juergen Gross
  2014-09-04 12:38 ` [PATCH 1/3] xen: sync some headers with xen tree Juergen Gross
  2014-09-04 12:38 ` [PATCH 2/3] xen: eliminate scalability issues from initrd handling Juergen Gross
@ 2014-09-04 12:38 ` Juergen Gross
  2014-09-04 12:59   ` David Vrabel
  2 siblings, 1 reply; 19+ messages in thread
From: Juergen Gross @ 2014-09-04 12:38 UTC (permalink / raw)
  To: linux-kernel, xen-devel, konrad.wilk, boris.ostrovsky,
	david.vrabel, jbeulich
  Cc: Juergen Gross

Direct Xen to place the initial P->M table outside of the initial
mapping, as otherwise the 1G (implementation) / 2G (theoretical)
restriction on the size of the initial mapping limits the amount
of memory a domain can be handed initially.

As the initial P->M table is copied rather early during boot to
domain private memory and it's initial virtual mapping is dropped,
the easiest way to avoid virtual address conflicts with other
addresses in the kernel is to use a user address area for the
virtual address of the initial P->M table. This allows us to just
throw away the page tables of the initial mapping after the copy
without having to care about address invalidation.

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
 arch/x86/xen/mmu.c      | 116 +++++++++++++++++++++++++++++++++++++++++++++---
 arch/x86/xen/setup.c    |  65 +++++++++++++++------------
 arch/x86/xen/xen-head.S |   2 +
 3 files changed, 148 insertions(+), 35 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index e8a1201..555d01f 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1198,6 +1198,76 @@ static void __init xen_cleanhighmap(unsigned long vaddr,
 	 * instead of somewhere later and be confusing. */
 	xen_mc_flush();
 }
+
+/*
+ * Make a page range writeable and free it.
+ */
+static void __init xen_free_ro_pages(unsigned long paddr, unsigned long size)
+{
+	void *vaddr = __va(paddr);
+	void *vaddr_end = vaddr + size;
+
+	for (; vaddr < vaddr_end; vaddr += PAGE_SIZE)
+		make_lowmem_page_readwrite(vaddr);
+
+	memblock_free(paddr, size);
+}
+
+/*
+ * Since it is well isolated we can (and since it is perhaps large we should)
+ * also free the page tables mapping the initial P->M table.
+ */
+static void __init xen_cleanmfnmap(unsigned long vaddr)
+{
+	unsigned long va = vaddr & PMD_MASK;
+	unsigned long pa;
+	pgd_t *pgd = pgd_offset_k(va);
+	pud_t *pud_page = pud_offset(pgd, 0);
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+	unsigned int i;
+
+	set_pgd(pgd, __pgd(0));
+	do {
+		pud = pud_page + pud_index(va);
+		if (pud_none(*pud)) {
+			va += PUD_SIZE;
+		} else if (pud_large(*pud)) {
+			pa = pud_val(*pud) & PHYSICAL_PAGE_MASK;
+			xen_free_ro_pages(pa, PUD_SIZE);
+			va += PUD_SIZE;
+		} else {
+			pmd = pmd_offset(pud, va);
+			if (pmd_large(*pmd)) {
+				pa = pmd_val(*pmd) & PHYSICAL_PAGE_MASK;
+				xen_free_ro_pages(pa, PMD_SIZE);
+			} else if (!pmd_none(*pmd)) {
+				pte = pte_offset_kernel(pmd, va);
+				for (i = 0; i < PTRS_PER_PTE; ++i) {
+					if (pte_none(pte[i]))
+						break;
+					pa = pte_pfn(pte[i]) << PAGE_SHIFT;
+					xen_free_ro_pages(pa, PAGE_SIZE);
+				}
+				pa = __pa(pte) & PHYSICAL_PAGE_MASK;
+				ClearPagePinned(virt_to_page(__va(pa)));
+				xen_free_ro_pages(pa, PAGE_SIZE);
+			}
+			va += PMD_SIZE;
+			if (pmd_index(va))
+				continue;
+			pa = __pa(pmd) & PHYSICAL_PAGE_MASK;
+			ClearPagePinned(virt_to_page(__va(pa)));
+			xen_free_ro_pages(pa, PAGE_SIZE);
+		}
+
+	} while (pud_index(va) || pmd_index(va));
+	pa = __pa(pud_page) & PHYSICAL_PAGE_MASK;
+	ClearPagePinned(virt_to_page(__va(pa)));
+	xen_free_ro_pages(pa, PAGE_SIZE);
+}
+
 static void __init xen_pagetable_p2m_copy(void)
 {
 	unsigned long size;
@@ -1217,18 +1287,23 @@ static void __init xen_pagetable_p2m_copy(void)
 	/* using __ka address and sticking INVALID_P2M_ENTRY! */
 	memset((void *)xen_start_info->mfn_list, 0xff, size);
 
-	/* We should be in __ka space. */
-	BUG_ON(xen_start_info->mfn_list < __START_KERNEL_map);
 	addr = xen_start_info->mfn_list;
-	/* We roundup to the PMD, which means that if anybody at this stage is
+	/* We could be in __ka space.
+	 * We roundup to the PMD, which means that if anybody at this stage is
 	 * using the __ka address of xen_start_info or xen_start_info->shared_info
 	 * they are in going to crash. Fortunatly we have already revectored
 	 * in xen_setup_kernel_pagetable and in xen_setup_shared_info. */
 	size = roundup(size, PMD_SIZE);
-	xen_cleanhighmap(addr, addr + size);
 
-	size = PAGE_ALIGN(xen_start_info->nr_pages * sizeof(unsigned long));
-	memblock_free(__pa(xen_start_info->mfn_list), size);
+	if (addr >= __START_KERNEL_map) {
+		xen_cleanhighmap(addr, addr + size);
+		size = PAGE_ALIGN(xen_start_info->nr_pages *
+				  sizeof(unsigned long));
+		memblock_free(__pa(addr), size);
+	} else {
+		xen_cleanmfnmap(addr);
+	}
+
 	/* And revector! Bye bye old array */
 	xen_start_info->mfn_list = new_mfn_list;
 
@@ -1529,6 +1604,22 @@ static pte_t __init mask_rw_pte(pte_t *ptep, pte_t pte)
 #else /* CONFIG_X86_64 */
 static pte_t __init mask_rw_pte(pte_t *ptep, pte_t pte)
 {
+	unsigned long pfn;
+
+	if (xen_feature(XENFEAT_writable_page_tables) ||
+	    xen_feature(XENFEAT_auto_translated_physmap) ||
+	    xen_start_info->mfn_list >= __START_KERNEL_map)
+		return pte;
+
+	/*
+	 * Pages belonging to the initial p2m list mapped outside the default
+	 * address range must be mapped read-only.
+	 */
+	pfn = pte_pfn(pte);
+	if (pfn >= xen_start_info->first_p2m_pfn &&
+	    pfn < xen_start_info->first_p2m_pfn + xen_start_info->nr_p2m_frames)
+		pte = __pte_ma(pte_val_ma(pte) & ~_PAGE_RW);
+
 	return pte;
 }
 #endif /* CONFIG_X86_64 */
@@ -1885,7 +1976,10 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 	 * mappings. Considering that on Xen after the kernel mappings we
 	 * have the mappings of some pages that don't exist in pfn space, we
 	 * set max_pfn_mapped to the last real pfn mapped. */
-	max_pfn_mapped = PFN_DOWN(__pa(xen_start_info->mfn_list));
+	if (xen_start_info->mfn_list < __START_KERNEL_map)
+		max_pfn_mapped = xen_start_info->first_p2m_pfn;
+	else
+		max_pfn_mapped = PFN_DOWN(__pa(xen_start_info->mfn_list));
 
 	pt_base = PFN_DOWN(__pa(xen_start_info->pt_base));
 	pt_end = pt_base + xen_start_info->nr_pt_frames;
@@ -1928,6 +2022,12 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 	copy_page(level2_fixmap_pgt, l2);
 	/* Note that we don't do anything with level1_fixmap_pgt which
 	 * we don't need. */
+
+	/* Copy the initial P->M table mappings if necessary. */
+	i = pgd_index(xen_start_info->mfn_list);
+	if (i && i < pgd_index(__START_KERNEL_map))
+		init_level4_pgt[i] = ((pgd_t *)xen_start_info->pt_base)[i];
+
 	if (!xen_feature(XENFEAT_auto_translated_physmap)) {
 		/* Make pagetable pieces RO */
 		set_page_prot(init_level4_pgt, PAGE_KERNEL_RO);
@@ -1967,6 +2067,8 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 
 	/* Our (by three pages) smaller Xen pagetable that we are using */
 	memblock_reserve(PFN_PHYS(pt_base), (pt_end - pt_base) * PAGE_SIZE);
+	/* protect xen_start_info */
+	memblock_reserve(__pa(xen_start_info), PAGE_SIZE);
 	/* Revector the xen_start_info */
 	xen_start_info = (struct start_info *)__va(__pa(xen_start_info));
 }
diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index 2e555163..6412367 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -333,6 +333,41 @@ void xen_ignore_unusable(struct e820entry *list, size_t map_size)
 	}
 }
 
+/*
+ * Reserve Xen mfn_list.
+ * See comment above "struct start_info" in <xen/interface/xen.h>
+ * We tried to make the the memblock_reserve more selective so
+ * that it would be clear what region is reserved. Sadly we ran
+ * in the problem wherein on a 64-bit hypervisor with a 32-bit
+ * initial domain, the pt_base has the cr3 value which is not
+ * neccessarily where the pagetable starts! As Jan put it: "
+ * Actually, the adjustment turns out to be correct: The page
+ * tables for a 32-on-64 dom0 get allocated in the order "first L1",
+ * "first L2", "first L3", so the offset to the page table base is
+ * indeed 2. When reading xen/include/public/xen.h's comment
+ * very strictly, this is not a violation (since there nothing is said
+ * that the first thing in the page table space is pointed to by
+ * pt_base; I admit that this seems to be implied though, namely
+ * do I think that it is implied that the page table space is the
+ * range [pt_base, pt_base + nt_pt_frames), whereas that
+ * range here indeed is [pt_base - 2, pt_base - 2 + nt_pt_frames),
+ * which - without a priori knowledge - the kernel would have
+ * difficulty to figure out)." - so lets just fall back to the
+ * easy way and reserve the whole region.
+ */
+static void __init xen_reserve_xen_mfnlist(void)
+{
+	if (xen_start_info->mfn_list >= __START_KERNEL_map) {
+		memblock_reserve(__pa(xen_start_info->mfn_list),
+				 xen_start_info->pt_base -
+				 xen_start_info->mfn_list);
+		return;
+	}
+
+	memblock_reserve(PFN_PHYS(xen_start_info->first_p2m_pfn),
+			 PFN_PHYS(xen_start_info->nr_p2m_frames));
+}
+
 /**
  * machine_specific_memory_setup - Hook for machine specific memory setup.
  **/
@@ -467,32 +502,7 @@ char * __init xen_memory_setup(void)
 	e820_add_region(ISA_START_ADDRESS, ISA_END_ADDRESS - ISA_START_ADDRESS,
 			E820_RESERVED);
 
-	/*
-	 * Reserve Xen bits:
-	 *  - mfn_list
-	 *  - xen_start_info
-	 * See comment above "struct start_info" in <xen/interface/xen.h>
-	 * We tried to make the the memblock_reserve more selective so
-	 * that it would be clear what region is reserved. Sadly we ran
-	 * in the problem wherein on a 64-bit hypervisor with a 32-bit
-	 * initial domain, the pt_base has the cr3 value which is not
-	 * neccessarily where the pagetable starts! As Jan put it: "
-	 * Actually, the adjustment turns out to be correct: The page
-	 * tables for a 32-on-64 dom0 get allocated in the order "first L1",
-	 * "first L2", "first L3", so the offset to the page table base is
-	 * indeed 2. When reading xen/include/public/xen.h's comment
-	 * very strictly, this is not a violation (since there nothing is said
-	 * that the first thing in the page table space is pointed to by
-	 * pt_base; I admit that this seems to be implied though, namely
-	 * do I think that it is implied that the page table space is the
-	 * range [pt_base, pt_base + nt_pt_frames), whereas that
-	 * range here indeed is [pt_base - 2, pt_base - 2 + nt_pt_frames),
-	 * which - without a priori knowledge - the kernel would have
-	 * difficulty to figure out)." - so lets just fall back to the
-	 * easy way and reserve the whole region.
-	 */
-	memblock_reserve(__pa(xen_start_info->mfn_list),
-			 xen_start_info->pt_base - xen_start_info->mfn_list);
+	xen_reserve_xen_mfnlist();
 
 	sanitize_e820_map(e820.map, ARRAY_SIZE(e820.map), &e820.nr_map);
 
@@ -522,8 +532,7 @@ char * __init xen_auto_xlated_memory_setup(void)
 	for (i = 0; i < memmap.nr_entries; i++)
 		e820_add_region(map[i].addr, map[i].size, map[i].type);
 
-	memblock_reserve(__pa(xen_start_info->mfn_list),
-			 xen_start_info->pt_base - xen_start_info->mfn_list);
+	xen_reserve_xen_mfnlist();
 
 	return "Xen";
 }
diff --git a/arch/x86/xen/xen-head.S b/arch/x86/xen/xen-head.S
index a458fd7..2998033 100644
--- a/arch/x86/xen/xen-head.S
+++ b/arch/x86/xen/xen-head.S
@@ -112,6 +112,8 @@ NEXT_HYPERCALL(arch_6)
 	ELFNOTE(Xen, XEN_ELFNOTE_VIRT_BASE,      _ASM_PTR __PAGE_OFFSET)
 #else
 	ELFNOTE(Xen, XEN_ELFNOTE_VIRT_BASE,      _ASM_PTR __START_KERNEL_map)
+	/* Map the p2m table to a 512GB-aligned user address. */
+	ELFNOTE(Xen, XEN_ELFNOTE_INIT_P2M,       .quad PGDIR_SIZE)
 #endif
 	ELFNOTE(Xen, XEN_ELFNOTE_ENTRY,          _ASM_PTR startup_xen)
 	ELFNOTE(Xen, XEN_ELFNOTE_HYPERCALL_PAGE, _ASM_PTR hypercall_page)
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] xen: eliminate scalability issues from initrd handling
  2014-09-04 12:38 ` [PATCH 2/3] xen: eliminate scalability issues from initrd handling Juergen Gross
@ 2014-09-04 12:52   ` David Vrabel
  2014-09-04 14:29     ` Jan Beulich
  0 siblings, 1 reply; 19+ messages in thread
From: David Vrabel @ 2014-09-04 12:52 UTC (permalink / raw)
  To: Juergen Gross, linux-kernel, xen-devel, konrad.wilk,
	boris.ostrovsky, jbeulich

On 04/09/14 13:38, Juergen Gross wrote:
> Size restrictions native kernels wouldn't have resulted from the initrd
> getting mapped into the initial mapping. The kernel doesn't really need
> the initrd to be mapped, so use infrastructure available in Xen to avoid
> the mapping and hence the restriction.
[...]
> --- a/arch/x86/xen/enlighten.c
> +++ b/arch/x86/xen/enlighten.c
[...]
> @@ -1667,10 +1668,20 @@ asmlinkage __visible void __init xen_start_kernel(void)
>  	new_cpu_data.x86_capability[0] = cpuid_edx(1);
>  #endif
>  
> +	if (xen_start_info->mod_start)
> +		initrd_start = __pa(xen_start_info->mod_start);
> +#ifdef CONFIG_BLK_DEV_INITRD
> +#ifdef CONFIG_X86_32
> +	BUG_ON(xen_start_info->flags & SIF_MOD_START_PFN);
> +#else
> +	if (xen_start_info->flags & SIF_MOD_START_PFN)
> +		initrd_start = PFN_PHYS(xen_start_info->mod_start);
> +#endif
> +#endif

Remove these unnecessary #ifdefs and the BUG_ON().  We can trust Xen to
not set SIF_MOD_START_PFN if we haven't asked for it.

> --- a/arch/x86/xen/xen-head.S
> +++ b/arch/x86/xen/xen-head.S
> @@ -124,6 +124,9 @@ NEXT_HYPERCALL(arch_6)
>  	ELFNOTE(Xen, XEN_ELFNOTE_L1_MFN_VALID,
>  		.quad _PAGE_PRESENT; .quad _PAGE_PRESENT)
>  	ELFNOTE(Xen, XEN_ELFNOTE_SUSPEND_CANCEL, .long 1)
> +#ifdef CONFIG_X86_64
> +	ELFNOTE(Xen, XEN_ELFNOTE_MOD_START_PFN,  .long 1)
> +#endif

Why X86_64 only?  If there's a good reason the commit message needs to
explain why.

David

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] xen: sync some headers with xen tree
  2014-09-04 12:38 ` [PATCH 1/3] xen: sync some headers with xen tree Juergen Gross
@ 2014-09-04 12:52   ` Jan Beulich
  2014-09-05  8:06     ` Juergen Gross
  0 siblings, 1 reply; 19+ messages in thread
From: Jan Beulich @ 2014-09-04 12:52 UTC (permalink / raw)
  To: Juergen Gross
  Cc: david.vrabel, xen-devel, boris.ostrovsky, konrad.wilk, linux-kernel

>>> On 04.09.14 at 14:38, <"jgross@suse.com".non-mime.internet> wrote:
> As the KEXEC and DUMPCORE related ELFNOTES are not relevant for the
> kernel they are omitted from elfnote.h.

But the defines are still in the patch:

> @@ -153,6 +192,65 @@
>   */
>  #define XEN_ELFNOTE_SUPPORTED_FEATURES 17
>  
> +/*
> + * The number of the highest elfnote defined.
> + */
> +#define XEN_ELFNOTE_MAX XEN_ELFNOTE_SUPPORTED_FEATURES
> +
> +/*
> + * System information exported through crash notes.
> + *
> + * The kexec / kdump code will create one XEN_ELFNOTE_CRASH_INFO
> + * note in case of a system crash. This note will contain various
> + * information about the system, see xen/include/xen/elfcore.h.
> + */
> +#define XEN_ELFNOTE_CRASH_INFO 0x1000001
> +
> +/*
> + * System registers exported through crash notes.
> + *
> + * The kexec / kdump code will create one XEN_ELFNOTE_CRASH_REGS
> + * note per cpu in case of a system crash. This note is architecture
> + * specific and will contain registers not saved in the "CORE" note.
> + * See xen/include/xen/elfcore.h for more information.
> + */
> +#define XEN_ELFNOTE_CRASH_REGS 0x1000002
> +
> +
> +/*
> + * xen dump-core none note.
> + * xm dump-core code will create one XEN_ELFNOTE_DUMPCORE_NONE
> + * in its dump file to indicate that the file is xen dump-core
> + * file. This note doesn't have any other information.
> + * See tools/libxc/xc_core.h for more information.
> + */
> +#define XEN_ELFNOTE_DUMPCORE_NONE               0x2000000
> +
> +/*
> + * xen dump-core header note.
> + * xm dump-core code will create one XEN_ELFNOTE_DUMPCORE_HEADER
> + * in its dump file.
> + * See tools/libxc/xc_core.h for more information.
> + */
> +#define XEN_ELFNOTE_DUMPCORE_HEADER             0x2000001
> +
> +/*
> + * xen dump-core xen version note.
> + * xm dump-core code will create one XEN_ELFNOTE_DUMPCORE_XEN_VERSION
> + * in its dump file. It contains the xen version obtained via the
> + * XENVER hypercall.
> + * See tools/libxc/xc_core.h for more information.
> + */
> +#define XEN_ELFNOTE_DUMPCORE_XEN_VERSION        0x2000002
> +
> +/*
> + * xen dump-core format version note.
> + * xm dump-core code will create one XEN_ELFNOTE_DUMPCORE_FORMAT_VERSION
> + * in its dump file. It contains a format version identifier.
> + * See tools/libxc/xc_core.h for more information.
> + */
> +#define XEN_ELFNOTE_DUMPCORE_FORMAT_VERSION     0x2000003
> +
>  #endif /* __XEN_PUBLIC_ELFNOTE_H__ */
>  
>  /*

Jan


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] xen: eliminate scalability issues from initial mapping setup
  2014-09-04 12:38 ` [PATCH 3/3] xen: eliminate scalability issues from initial mapping setup Juergen Gross
@ 2014-09-04 12:59   ` David Vrabel
  2014-09-04 13:02     ` [Xen-devel] " Andrew Cooper
  2014-09-05  8:03     ` Juergen Gross
  0 siblings, 2 replies; 19+ messages in thread
From: David Vrabel @ 2014-09-04 12:59 UTC (permalink / raw)
  To: Juergen Gross, linux-kernel, xen-devel, konrad.wilk,
	boris.ostrovsky, jbeulich

On 04/09/14 13:38, Juergen Gross wrote:
> Direct Xen to place the initial P->M table outside of the initial
> mapping, as otherwise the 1G (implementation) / 2G (theoretical)
> restriction on the size of the initial mapping limits the amount
> of memory a domain can be handed initially.

The three level p2m limits memory to 512 GiB on x86-64 but this patch
doesn't seem to address this limit and thus seems a bit useless to me.

David

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Xen-devel] [PATCH 3/3] xen: eliminate scalability issues from initial mapping setup
  2014-09-04 12:59   ` David Vrabel
@ 2014-09-04 13:02     ` Andrew Cooper
  2014-09-04 14:31       ` Jan Beulich
  2014-09-05  8:03     ` Juergen Gross
  1 sibling, 1 reply; 19+ messages in thread
From: Andrew Cooper @ 2014-09-04 13:02 UTC (permalink / raw)
  To: David Vrabel, Juergen Gross, linux-kernel, xen-devel,
	konrad.wilk, boris.ostrovsky, jbeulich

On 04/09/14 13:59, David Vrabel wrote:
> On 04/09/14 13:38, Juergen Gross wrote:
>> Direct Xen to place the initial P->M table outside of the initial
>> mapping, as otherwise the 1G (implementation) / 2G (theoretical)
>> restriction on the size of the initial mapping limits the amount
>> of memory a domain can be handed initially.
> The three level p2m limits memory to 512 GiB on x86-64 but this patch
> doesn't seem to address this limit and thus seems a bit useless to me.
>
> David

Any increase of the p2m beyond 3 levels will need to come with
substantial libxc changes first.  3 level p2ms are hard coded throughout
all the PV build and migrate code.

~Andrew

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] xen: eliminate scalability issues from initrd handling
  2014-09-04 12:52   ` David Vrabel
@ 2014-09-04 14:29     ` Jan Beulich
  2014-09-04 14:53       ` [Xen-devel] " David Vrabel
  0 siblings, 1 reply; 19+ messages in thread
From: Jan Beulich @ 2014-09-04 14:29 UTC (permalink / raw)
  To: David Vrabel
  Cc: xen-devel, boris.ostrovsky, konrad.wilk, Juergen Gross, linux-kernel

>>> On 04.09.14 at 14:52, <david.vrabel@citrix.com> wrote:
> On 04/09/14 13:38, Juergen Gross wrote:
>> --- a/arch/x86/xen/xen-head.S
>> +++ b/arch/x86/xen/xen-head.S
>> @@ -124,6 +124,9 @@ NEXT_HYPERCALL(arch_6)
>>  	ELFNOTE(Xen, XEN_ELFNOTE_L1_MFN_VALID,
>>  		.quad _PAGE_PRESENT; .quad _PAGE_PRESENT)
>>  	ELFNOTE(Xen, XEN_ELFNOTE_SUSPEND_CANCEL, .long 1)
>> +#ifdef CONFIG_X86_64
>> +	ELFNOTE(Xen, XEN_ELFNOTE_MOD_START_PFN,  .long 1)
>> +#endif
> 
> Why X86_64 only?  If there's a good reason the commit message needs to
> explain why.

Does native 32-bit support huge initrd?

Jan


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Xen-devel] [PATCH 3/3] xen: eliminate scalability issues from initial mapping setup
  2014-09-04 13:02     ` [Xen-devel] " Andrew Cooper
@ 2014-09-04 14:31       ` Jan Beulich
  2014-09-04 14:43         ` Andrew Cooper
  2014-09-04 15:13         ` David Vrabel
  0 siblings, 2 replies; 19+ messages in thread
From: Jan Beulich @ 2014-09-04 14:31 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: David Vrabel, xen-devel, boris.ostrovsky, konrad.wilk,
	Juergen Gross, linux-kernel

>>> On 04.09.14 at 15:02, <andrew.cooper3@citrix.com> wrote:
> On 04/09/14 13:59, David Vrabel wrote:
>> On 04/09/14 13:38, Juergen Gross wrote:
>>> Direct Xen to place the initial P->M table outside of the initial
>>> mapping, as otherwise the 1G (implementation) / 2G (theoretical)
>>> restriction on the size of the initial mapping limits the amount
>>> of memory a domain can be handed initially.
>> The three level p2m limits memory to 512 GiB on x86-64 but this patch
>> doesn't seem to address this limit and thus seems a bit useless to me.
> 
> Any increase of the p2m beyond 3 levels will need to come with
> substantial libxc changes first.  3 level p2ms are hard coded throughout
> all the PV build and migrate code.

No, there no such dependency - the kernel could use 4 levels at
any time (sacrificing being able to get migrated), making sure it
only exposes the 3 levels hanging off the fourth level (or not
exposing this information at all) to external entities making this
wrong assumption.

Jan


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Xen-devel] [PATCH 3/3] xen: eliminate scalability issues from initial mapping setup
  2014-09-04 14:31       ` Jan Beulich
@ 2014-09-04 14:43         ` Andrew Cooper
  2014-09-05  7:55           ` Juergen Gross
  2014-09-04 15:13         ` David Vrabel
  1 sibling, 1 reply; 19+ messages in thread
From: Andrew Cooper @ 2014-09-04 14:43 UTC (permalink / raw)
  To: Jan Beulich
  Cc: David Vrabel, xen-devel, boris.ostrovsky, konrad.wilk,
	Juergen Gross, linux-kernel

On 04/09/14 15:31, Jan Beulich wrote:
>>>> On 04.09.14 at 15:02, <andrew.cooper3@citrix.com> wrote:
>> On 04/09/14 13:59, David Vrabel wrote:
>>> On 04/09/14 13:38, Juergen Gross wrote:
>>>> Direct Xen to place the initial P->M table outside of the initial
>>>> mapping, as otherwise the 1G (implementation) / 2G (theoretical)
>>>> restriction on the size of the initial mapping limits the amount
>>>> of memory a domain can be handed initially.
>>> The three level p2m limits memory to 512 GiB on x86-64 but this patch
>>> doesn't seem to address this limit and thus seems a bit useless to me.
>> Any increase of the p2m beyond 3 levels will need to come with
>> substantial libxc changes first.  3 level p2ms are hard coded throughout
>> all the PV build and migrate code.
> No, there no such dependency - the kernel could use 4 levels at
> any time (sacrificing being able to get migrated), making sure it
> only exposes the 3 levels hanging off the fourth level (or not
> exposing this information at all) to external entities making this
> wrong assumption.
>
> Jan
>

That would require that the PV kernel must start with a 3 level p2m and
fudge things afterwards.

At a minimum, I would expect a patch to libxc to detect a 4 level PV
guest and fail with a meaningful error, rather than an obscure "m2p
doesn't match p2m for mfn/pfn X".

~Andrew

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Xen-devel] [PATCH 2/3] xen: eliminate scalability issues from initrd handling
  2014-09-04 14:29     ` Jan Beulich
@ 2014-09-04 14:53       ` David Vrabel
  2014-09-05  8:04         ` Juergen Gross
  0 siblings, 1 reply; 19+ messages in thread
From: David Vrabel @ 2014-09-04 14:53 UTC (permalink / raw)
  To: Jan Beulich, David Vrabel
  Cc: Juergen Gross, boris.ostrovsky, xen-devel, linux-kernel

On 04/09/14 15:29, Jan Beulich wrote:
>>>> On 04.09.14 at 14:52, <david.vrabel@citrix.com> wrote:
>> On 04/09/14 13:38, Juergen Gross wrote:
>>> --- a/arch/x86/xen/xen-head.S
>>> +++ b/arch/x86/xen/xen-head.S
>>> @@ -124,6 +124,9 @@ NEXT_HYPERCALL(arch_6)
>>>  	ELFNOTE(Xen, XEN_ELFNOTE_L1_MFN_VALID,
>>>  		.quad _PAGE_PRESENT; .quad _PAGE_PRESENT)
>>>  	ELFNOTE(Xen, XEN_ELFNOTE_SUSPEND_CANCEL, .long 1)
>>> +#ifdef CONFIG_X86_64
>>> +	ELFNOTE(Xen, XEN_ELFNOTE_MOD_START_PFN,  .long 1)
>>> +#endif
>>
>> Why X86_64 only?  If there's a good reason the commit message needs to
>> explain why.
> 
> Does native 32-bit support huge initrd?

Does that matter? If the MOD_START_PFN options works with a 32-bit guest
then it should use it, regardless of whether it is essential or not.
Because this reduces the #ifdef'ery.

David

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Xen-devel] [PATCH 3/3] xen: eliminate scalability issues from initial mapping setup
  2014-09-04 14:31       ` Jan Beulich
  2014-09-04 14:43         ` Andrew Cooper
@ 2014-09-04 15:13         ` David Vrabel
  1 sibling, 0 replies; 19+ messages in thread
From: David Vrabel @ 2014-09-04 15:13 UTC (permalink / raw)
  To: Jan Beulich, Andrew Cooper
  Cc: xen-devel, boris.ostrovsky, konrad.wilk, Juergen Gross, linux-kernel

On 04/09/14 15:31, Jan Beulich wrote:
>>>> On 04.09.14 at 15:02, <andrew.cooper3@citrix.com> wrote:
>> On 04/09/14 13:59, David Vrabel wrote:
>>> On 04/09/14 13:38, Juergen Gross wrote:
>>>> Direct Xen to place the initial P->M table outside of the initial
>>>> mapping, as otherwise the 1G (implementation) / 2G (theoretical)
>>>> restriction on the size of the initial mapping limits the amount
>>>> of memory a domain can be handed initially.
>>> The three level p2m limits memory to 512 GiB on x86-64 but this patch
>>> doesn't seem to address this limit and thus seems a bit useless to me.
>>
>> Any increase of the p2m beyond 3 levels will need to come with
>> substantial libxc changes first.  3 level p2ms are hard coded throughout
>> all the PV build and migrate code.
> 
> No, there no such dependency - the kernel could use 4 levels at
> any time (sacrificing being able to get migrated), making sure it
> only exposes the 3 levels hanging off the fourth level (or not
> exposing this information at all) to external entities making this
> wrong assumption.

I don't think we want a kernel that may or may not be saved or migrated
based on how much memory it has.

Nor do we want a kernel that has even more differences between dom0 and
domU.

David

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Xen-devel] [PATCH 3/3] xen: eliminate scalability issues from initial mapping setup
  2014-09-04 14:43         ` Andrew Cooper
@ 2014-09-05  7:55           ` Juergen Gross
  2014-09-05  9:05             ` Andrew Cooper
  0 siblings, 1 reply; 19+ messages in thread
From: Juergen Gross @ 2014-09-05  7:55 UTC (permalink / raw)
  To: Andrew Cooper, Jan Beulich
  Cc: David Vrabel, xen-devel, boris.ostrovsky, konrad.wilk, linux-kernel

On 09/04/2014 04:43 PM, Andrew Cooper wrote:
> On 04/09/14 15:31, Jan Beulich wrote:
>>>>> On 04.09.14 at 15:02, <andrew.cooper3@citrix.com> wrote:
>>> On 04/09/14 13:59, David Vrabel wrote:
>>>> On 04/09/14 13:38, Juergen Gross wrote:
>>>>> Direct Xen to place the initial P->M table outside of the initial
>>>>> mapping, as otherwise the 1G (implementation) / 2G (theoretical)
>>>>> restriction on the size of the initial mapping limits the amount
>>>>> of memory a domain can be handed initially.
>>>> The three level p2m limits memory to 512 GiB on x86-64 but this patch
>>>> doesn't seem to address this limit and thus seems a bit useless to me.
>>> Any increase of the p2m beyond 3 levels will need to come with
>>> substantial libxc changes first.  3 level p2ms are hard coded throughout
>>> all the PV build and migrate code.
>> No, there no such dependency - the kernel could use 4 levels at
>> any time (sacrificing being able to get migrated), making sure it
>> only exposes the 3 levels hanging off the fourth level (or not
>> exposing this information at all) to external entities making this
>> wrong assumption.
>>
>> Jan
>>
>
> That would require that the PV kernel must start with a 3 level p2m and
> fudge things afterwards.

I always thought the 3 level p2m is constructed by the kernel, not by
the tools.

It starts with the linear p2m list anchored at xen_start_info->mfn_list,
constructs the p2m tree and writes the p2m_top_mfn mfn to
HYPERVISOR_shared_info->arch.pfn_to_mfn_frame_list_list

See comment in the kernel source arch/x86/xen/p2m.c

So booting with a larger p2m list can be handled completely by the
kernel itself.

>
> At a minimum, I would expect a patch to libxc to detect a 4 level PV
> guest and fail with a meaningful error, rather than an obscure "m2p
> doesn't match p2m for mfn/pfn X".

I'd rather fix it in a clean way.

I think the best way to do it would be an indicator in the p2m array
anchor, e.g. setting 1<<61 in pfn_to_mfn_frame_list_list. This will
result in an early error with old tools:
"Couldn't map p2m_frame_list_list"


Juergen

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] xen: eliminate scalability issues from initial mapping setup
  2014-09-04 12:59   ` David Vrabel
  2014-09-04 13:02     ` [Xen-devel] " Andrew Cooper
@ 2014-09-05  8:03     ` Juergen Gross
  1 sibling, 0 replies; 19+ messages in thread
From: Juergen Gross @ 2014-09-05  8:03 UTC (permalink / raw)
  To: David Vrabel, linux-kernel, xen-devel, konrad.wilk,
	boris.ostrovsky, jbeulich

On 09/04/2014 02:59 PM, David Vrabel wrote:
> On 04/09/14 13:38, Juergen Gross wrote:
>> Direct Xen to place the initial P->M table outside of the initial
>> mapping, as otherwise the 1G (implementation) / 2G (theoretical)
>> restriction on the size of the initial mapping limits the amount
>> of memory a domain can be handed initially.
>
> The three level p2m limits memory to 512 GiB on x86-64 but this patch
> doesn't seem to address this limit and thus seems a bit useless to me.

Yeah, there seem to be some bits missing...

I'll add another patch to support a 4 level p2m scheme in the kernel.
For the Xen tools I'll do it, too.


Juergen

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Xen-devel] [PATCH 2/3] xen: eliminate scalability issues from initrd handling
  2014-09-04 14:53       ` [Xen-devel] " David Vrabel
@ 2014-09-05  8:04         ` Juergen Gross
  0 siblings, 0 replies; 19+ messages in thread
From: Juergen Gross @ 2014-09-05  8:04 UTC (permalink / raw)
  To: David Vrabel, Jan Beulich; +Cc: boris.ostrovsky, xen-devel, linux-kernel

On 09/04/2014 04:53 PM, David Vrabel wrote:
> On 04/09/14 15:29, Jan Beulich wrote:
>>>>> On 04.09.14 at 14:52, <david.vrabel@citrix.com> wrote:
>>> On 04/09/14 13:38, Juergen Gross wrote:
>>>> --- a/arch/x86/xen/xen-head.S
>>>> +++ b/arch/x86/xen/xen-head.S
>>>> @@ -124,6 +124,9 @@ NEXT_HYPERCALL(arch_6)
>>>>   	ELFNOTE(Xen, XEN_ELFNOTE_L1_MFN_VALID,
>>>>   		.quad _PAGE_PRESENT; .quad _PAGE_PRESENT)
>>>>   	ELFNOTE(Xen, XEN_ELFNOTE_SUSPEND_CANCEL, .long 1)
>>>> +#ifdef CONFIG_X86_64
>>>> +	ELFNOTE(Xen, XEN_ELFNOTE_MOD_START_PFN,  .long 1)
>>>> +#endif
>>>
>>> Why X86_64 only?  If there's a good reason the commit message needs to
>>> explain why.
>>
>> Does native 32-bit support huge initrd?
>
> Does that matter? If the MOD_START_PFN options works with a 32-bit guest
> then it should use it, regardless of whether it is essential or not.
> Because this reduces the #ifdef'ery.

Okay, I'll verify it's working on 32-bit, too.


Juergen


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] xen: sync some headers with xen tree
  2014-09-04 12:52   ` Jan Beulich
@ 2014-09-05  8:06     ` Juergen Gross
  0 siblings, 0 replies; 19+ messages in thread
From: Juergen Gross @ 2014-09-05  8:06 UTC (permalink / raw)
  To: Jan Beulich
  Cc: david.vrabel, xen-devel, boris.ostrovsky, konrad.wilk, linux-kernel

On 09/04/2014 02:52 PM, Jan Beulich wrote:
>>>> On 04.09.14 at 14:38, <"jgross@suse.com".non-mime.internet> wrote:
>> As the KEXEC and DUMPCORE related ELFNOTES are not relevant for the
>> kernel they are omitted from elfnote.h.
>
> But the defines are still in the patch:

Oops, old header version. I'll correct it.


Juergen


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Xen-devel] [PATCH 3/3] xen: eliminate scalability issues from initial mapping setup
  2014-09-05  7:55           ` Juergen Gross
@ 2014-09-05  9:05             ` Andrew Cooper
  2014-09-05  9:44               ` Juergen Gross
  0 siblings, 1 reply; 19+ messages in thread
From: Andrew Cooper @ 2014-09-05  9:05 UTC (permalink / raw)
  To: Juergen Gross, Jan Beulich
  Cc: David Vrabel, xen-devel, boris.ostrovsky, konrad.wilk, linux-kernel

On 05/09/14 08:55, Juergen Gross wrote:
> On 09/04/2014 04:43 PM, Andrew Cooper wrote:
>> On 04/09/14 15:31, Jan Beulich wrote:
>>>>>> On 04.09.14 at 15:02, <andrew.cooper3@citrix.com> wrote:
>>>> On 04/09/14 13:59, David Vrabel wrote:
>>>>> On 04/09/14 13:38, Juergen Gross wrote:
>>>>>> Direct Xen to place the initial P->M table outside of the initial
>>>>>> mapping, as otherwise the 1G (implementation) / 2G (theoretical)
>>>>>> restriction on the size of the initial mapping limits the amount
>>>>>> of memory a domain can be handed initially.
>>>>> The three level p2m limits memory to 512 GiB on x86-64 but this patch
>>>>> doesn't seem to address this limit and thus seems a bit useless to
>>>>> me.
>>>> Any increase of the p2m beyond 3 levels will need to come with
>>>> substantial libxc changes first.  3 level p2ms are hard coded
>>>> throughout
>>>> all the PV build and migrate code.
>>> No, there no such dependency - the kernel could use 4 levels at
>>> any time (sacrificing being able to get migrated), making sure it
>>> only exposes the 3 levels hanging off the fourth level (or not
>>> exposing this information at all) to external entities making this
>>> wrong assumption.
>>>
>>> Jan
>>>
>>
>> That would require that the PV kernel must start with a 3 level p2m and
>> fudge things afterwards.
>
> I always thought the 3 level p2m is constructed by the kernel, not by
> the tools.
>
> It starts with the linear p2m list anchored at xen_start_info->mfn_list,
> constructs the p2m tree and writes the p2m_top_mfn mfn to
> HYPERVISOR_shared_info->arch.pfn_to_mfn_frame_list_list
>
> See comment in the kernel source arch/x86/xen/p2m.c
>
> So booting with a larger p2m list can be handled completely by the
> kernel itself.

Ah yes - I remember now.  All the toolstack does is create the linear
p2m.  In which case building such a domain will be fine.

>
>>
>> At a minimum, I would expect a patch to libxc to detect a 4 level PV
>> guest and fail with a meaningful error, rather than an obscure "m2p
>> doesn't match p2m for mfn/pfn X".
>
> I'd rather fix it in a clean way.
>
> I think the best way to do it would be an indicator in the p2m array
> anchor, e.g. setting 1<<61 in pfn_to_mfn_frame_list_list. This will
> result in an early error with old tools:
> "Couldn't map p2m_frame_list_list"

No it wont.  The is_mapped() macro in the toolstack is quite broken.  It
stems from a lack of Design/API/ABI concerning things like the p2m.  In
particular, INVALID_MFN is not an ABI constant, nor is any notion of
mapped vs unmapped.

Its current implementation is a relic of 32bit days, and only checks bit
31.  It also means that it is impossible to migrate a PV VM with pfns
above the 43bit limit; a restriction which is lifted by my migration v2
series.  A lot of the other migration constructs are in a similar state,
which is why they are being deleted by the v2 series.

The clean way to fix this is to leave pfn_to_mfn_frame_list_list as
INVALID_MFN. Introduce two new fields beside it named p2m_levels and
p2m_root, which then caters for levels greater than 4 in a compatible
manner.

~Andrew

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Xen-devel] [PATCH 3/3] xen: eliminate scalability issues from initial mapping setup
  2014-09-05  9:05             ` Andrew Cooper
@ 2014-09-05  9:44               ` Juergen Gross
  0 siblings, 0 replies; 19+ messages in thread
From: Juergen Gross @ 2014-09-05  9:44 UTC (permalink / raw)
  To: Andrew Cooper, Jan Beulich
  Cc: David Vrabel, xen-devel, boris.ostrovsky, konrad.wilk, linux-kernel

On 09/05/2014 11:05 AM, Andrew Cooper wrote:
> On 05/09/14 08:55, Juergen Gross wrote:
>> On 09/04/2014 04:43 PM, Andrew Cooper wrote:
>>> On 04/09/14 15:31, Jan Beulich wrote:
>>>>>>> On 04.09.14 at 15:02, <andrew.cooper3@citrix.com> wrote:
>>>>> On 04/09/14 13:59, David Vrabel wrote:
>>>>>> On 04/09/14 13:38, Juergen Gross wrote:
>>>>>>> Direct Xen to place the initial P->M table outside of the initial
>>>>>>> mapping, as otherwise the 1G (implementation) / 2G (theoretical)
>>>>>>> restriction on the size of the initial mapping limits the amount
>>>>>>> of memory a domain can be handed initially.
>>>>>> The three level p2m limits memory to 512 GiB on x86-64 but this patch
>>>>>> doesn't seem to address this limit and thus seems a bit useless to
>>>>>> me.
>>>>> Any increase of the p2m beyond 3 levels will need to come with
>>>>> substantial libxc changes first.  3 level p2ms are hard coded
>>>>> throughout
>>>>> all the PV build and migrate code.
>>>> No, there no such dependency - the kernel could use 4 levels at
>>>> any time (sacrificing being able to get migrated), making sure it
>>>> only exposes the 3 levels hanging off the fourth level (or not
>>>> exposing this information at all) to external entities making this
>>>> wrong assumption.
>>>>
>>>> Jan
>>>>
>>>
>>> That would require that the PV kernel must start with a 3 level p2m and
>>> fudge things afterwards.
>>
>> I always thought the 3 level p2m is constructed by the kernel, not by
>> the tools.
>>
>> It starts with the linear p2m list anchored at xen_start_info->mfn_list,
>> constructs the p2m tree and writes the p2m_top_mfn mfn to
>> HYPERVISOR_shared_info->arch.pfn_to_mfn_frame_list_list
>>
>> See comment in the kernel source arch/x86/xen/p2m.c
>>
>> So booting with a larger p2m list can be handled completely by the
>> kernel itself.
>
> Ah yes - I remember now.  All the toolstack does is create the linear
> p2m.  In which case building such a domain will be fine.
>
>>
>>>
>>> At a minimum, I would expect a patch to libxc to detect a 4 level PV
>>> guest and fail with a meaningful error, rather than an obscure "m2p
>>> doesn't match p2m for mfn/pfn X".
>>
>> I'd rather fix it in a clean way.
>>
>> I think the best way to do it would be an indicator in the p2m array
>> anchor, e.g. setting 1<<61 in pfn_to_mfn_frame_list_list. This will
>> result in an early error with old tools:
>> "Couldn't map p2m_frame_list_list"
>
> No it wont.  The is_mapped() macro in the toolstack is quite broken.  It
> stems from a lack of Design/API/ABI concerning things like the p2m.  In
> particular, INVALID_MFN is not an ABI constant, nor is any notion of
> mapped vs unmapped.

That's not relevant here. map_frame_list_list() in xc_domain_save.c
reads pfn_to_mfn_frame_list_list and tries to map that mfn directly.
This will fail and result in above error message.

> Its current implementation is a relic of 32bit days, and only checks bit
> 31.  It also means that it is impossible to migrate a PV VM with pfns
> above the 43bit limit; a restriction which is lifted by my migration v2
> series.  A lot of the other migration constructs are in a similar state,
> which is why they are being deleted by the v2 series.
>
> The clean way to fix this is to leave pfn_to_mfn_frame_list_list as
> INVALID_MFN. Introduce two new fields beside it named p2m_levels and
> p2m_root, which then caters for levels greater than 4 in a compatible
> manner.

I don't mind doing it this way.


Juergen


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2014-09-05  9:44 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-04 12:38 [PATCH 0/3] xen: remove memory limits from pv-domains Juergen Gross
2014-09-04 12:38 ` [PATCH 1/3] xen: sync some headers with xen tree Juergen Gross
2014-09-04 12:52   ` Jan Beulich
2014-09-05  8:06     ` Juergen Gross
2014-09-04 12:38 ` [PATCH 2/3] xen: eliminate scalability issues from initrd handling Juergen Gross
2014-09-04 12:52   ` David Vrabel
2014-09-04 14:29     ` Jan Beulich
2014-09-04 14:53       ` [Xen-devel] " David Vrabel
2014-09-05  8:04         ` Juergen Gross
2014-09-04 12:38 ` [PATCH 3/3] xen: eliminate scalability issues from initial mapping setup Juergen Gross
2014-09-04 12:59   ` David Vrabel
2014-09-04 13:02     ` [Xen-devel] " Andrew Cooper
2014-09-04 14:31       ` Jan Beulich
2014-09-04 14:43         ` Andrew Cooper
2014-09-05  7:55           ` Juergen Gross
2014-09-05  9:05             ` Andrew Cooper
2014-09-05  9:44               ` Juergen Gross
2014-09-04 15:13         ` David Vrabel
2014-09-05  8:03     ` Juergen Gross

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).