All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] xen: core dom0 support
@ 2009-02-28  1:59 Jeremy Fitzhardinge
  2009-02-28  1:59 ` [PATCH] xen dom0: Make hvc_xen console work for dom0 Jeremy Fitzhardinge
                   ` (20 more replies)
  0 siblings, 21 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-02-28  1:59 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List, Xen-devel


Hi,

This series implements the core parts of Xen dom0 support; that is, just
enough to get the kernel started when booted by Xen as a dom0 kernel.

The Xen dom0 kernel runs as a normal paravirtualized Xen kernel, but
it also has the additional responsibilty for managing all the machine's
hardware, as Xen itself has almost no internal driver support (it barely
even knows about PCI).

This series includes:
 - setting up a Xen hvc console
 - initializing Xenbus
 - enabling IO permissions for the kernel
 - MTRR setup hooks
 - Use _PAGE_IOMAP to allow direct hardware mappings
 - add a paravirt-ops for page_is_ram, to allow Xen to exclude granted pages
 - enable the use of a vga console

Not included in this series is the hooks into apic setup; that's next.

This may be pulled from:

The following changes since commit cc2f3b455c8efa01c66b8e66df8aad1da9310901:
  Ingo Molnar (1):
        Merge branch 'sched/urgent'

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen.git push/xen/dom0/core

Ian Campbell (4):
      xen: disable PAT
      xen/dom0: Use host E820 map
      xen: implement XENMEM_machphys_mapping
      xen: clear reserved bits in l3 entries given in the initial pagetables

Jeremy Fitzhardinge (6):
      xen dom0: Make hvc_xen console work for dom0.
      xen-dom0: only selectively disable cpu features
      xen/dom0: use _PAGE_IOMAP in ioremap to do machine mappings
      paravirt/xen: add pvop for page_is_ram
      xen/dom0: add XEN_DOM0 config option
      xen: allow enable use of VGA console on dom0

Juan Quintela (2):
      xen dom0: Initialize xenbus for dom0.
      xen dom0: Set up basic IO permissions for dom0.

Mark McLoughlin (5):
      xen mtrr: Use specific cpu_has_foo macros instead of generic cpu_has()
      xen mtrr: Kill some unneccessary includes
      xen mtrr: Use generic_validate_add_page()
      xen mtrr: Implement xen_get_free_region()
      xen mtrr: Add xen_{get,set}_mtrr() implementations

Stephen Tweedie (2):
      xen dom0: Add support for the platform_ops hypercall
      xen mtrr: Add mtrr_ops support for Xen mtrr

 arch/x86/include/asm/page.h             |    9 +-
 arch/x86/include/asm/paravirt.h         |    7 +
 arch/x86/include/asm/pat.h              |    5 +
 arch/x86/include/asm/xen/hypercall.h    |    8 +
 arch/x86/include/asm/xen/interface.h    |    6 +-
 arch/x86/include/asm/xen/interface_32.h |    5 +
 arch/x86/include/asm/xen/interface_64.h |   13 +--
 arch/x86/include/asm/xen/page.h         |   15 +--
 arch/x86/kernel/cpu/mtrr/Makefile       |    1 +
 arch/x86/kernel/cpu/mtrr/amd.c          |    1 +
 arch/x86/kernel/cpu/mtrr/centaur.c      |    1 +
 arch/x86/kernel/cpu/mtrr/cyrix.c        |    1 +
 arch/x86/kernel/cpu/mtrr/generic.c      |    1 +
 arch/x86/kernel/cpu/mtrr/main.c         |   11 +-
 arch/x86/kernel/cpu/mtrr/mtrr.h         |    7 +
 arch/x86/kernel/cpu/mtrr/xen.c          |  120 ++++++++++++++++
 arch/x86/kernel/paravirt.c              |    1 +
 arch/x86/mm/ioremap.c                   |    2 +-
 arch/x86/mm/pat.c                       |    5 -
 arch/x86/xen/Kconfig                    |   26 ++++
 arch/x86/xen/Makefile                   |    3 +-
 arch/x86/xen/enlighten.c                |   58 ++++++--
 arch/x86/xen/mmu.c                      |  135 ++++++++++++++++++-
 arch/x86/xen/setup.c                    |   51 ++++++-
 arch/x86/xen/vga.c                      |   65 +++++++++
 arch/x86/xen/xen-ops.h                  |   12 ++
 drivers/char/hvc_xen.c                  |  101 +++++++++-----
 drivers/xen/events.c                    |    2 +-
 drivers/xen/xenbus/xenbus_probe.c       |   30 ++++-
 include/xen/events.h                    |    2 +
 include/xen/interface/memory.h          |   42 ++++++
 include/xen/interface/platform.h        |  232 +++++++++++++++++++++++++++++++
 include/xen/interface/xen.h             |   41 ++++++
 33 files changed, 931 insertions(+), 88 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/mtrr/xen.c
 create mode 100644 arch/x86/xen/vga.c
 create mode 100644 include/xen/interface/platform.h

Thanks,
	J

^ permalink raw reply	[flat|nested] 121+ messages in thread

* [PATCH] xen dom0: Make hvc_xen console work for dom0.
  2009-02-28  1:59 [PATCH] xen: core dom0 support Jeremy Fitzhardinge
@ 2009-02-28  1:59 ` Jeremy Fitzhardinge
  2009-02-28  1:59   ` Jeremy Fitzhardinge
                   ` (19 subsequent siblings)
  20 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-02-28  1:59 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List, Xen-devel,
	Jeremy Fitzhardinge, Jeremy Fitzhardinge, Juan Quintela

Use the console hypercalls for dom0 console.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>
---
 drivers/char/hvc_xen.c |  101 ++++++++++++++++++++++++++++++++----------------
 drivers/xen/events.c   |    2 +-
 include/xen/events.h   |    2 +
 3 files changed, 70 insertions(+), 35 deletions(-)

diff --git a/drivers/char/hvc_xen.c b/drivers/char/hvc_xen.c
index eba999f..81d9186 100644
--- a/drivers/char/hvc_xen.c
+++ b/drivers/char/hvc_xen.c
@@ -55,7 +55,7 @@ static inline void notify_daemon(void)
 	notify_remote_via_evtchn(xen_start_info->console.domU.evtchn);
 }
 
-static int write_console(uint32_t vtermno, const char *data, int len)
+static int domU_write_console(uint32_t vtermno, const char *data, int len)
 {
 	struct xencons_interface *intf = xencons_interface();
 	XENCONS_RING_IDX cons, prod;
@@ -76,7 +76,7 @@ static int write_console(uint32_t vtermno, const char *data, int len)
 	return sent;
 }
 
-static int read_console(uint32_t vtermno, char *buf, int len)
+static int domU_read_console(uint32_t vtermno, char *buf, int len)
 {
 	struct xencons_interface *intf = xencons_interface();
 	XENCONS_RING_IDX cons, prod;
@@ -97,28 +97,63 @@ static int read_console(uint32_t vtermno, char *buf, int len)
 	return recv;
 }
 
-static struct hv_ops hvc_ops = {
-	.get_chars = read_console,
-	.put_chars = write_console,
+static struct hv_ops domU_hvc_ops = {
+	.get_chars = domU_read_console,
+	.put_chars = domU_write_console,
 	.notifier_add = notifier_add_irq,
 	.notifier_del = notifier_del_irq,
 	.notifier_hangup = notifier_hangup_irq,
 };
 
-static int __init xen_init(void)
+static int dom0_read_console(uint32_t vtermno, char *buf, int len)
+{
+	return HYPERVISOR_console_io(CONSOLEIO_read, len, buf);
+}
+
+/*
+ * Either for a dom0 to write to the system console, or a domU with a
+ * debug version of Xen
+ */
+static int dom0_write_console(uint32_t vtermno, const char *str, int len)
+{
+	int rc = HYPERVISOR_console_io(CONSOLEIO_write, len, (char *)str);
+	if (rc < 0)
+		return 0;
+
+	return len;
+}
+
+static struct hv_ops dom0_hvc_ops = {
+	.get_chars = dom0_read_console,
+	.put_chars = dom0_write_console,
+	.notifier_add = notifier_add_irq,
+	.notifier_del = notifier_del_irq,
+	.notifier_hangup = notifier_hangup_irq,
+};
+
+static int __init xen_hvc_init(void)
 {
 	struct hvc_struct *hp;
+	struct hv_ops *ops;
 
-	if (!xen_pv_domain() ||
-	    xen_initial_domain() ||
-	    !xen_start_info->console.domU.evtchn)
-		return -ENODEV;
+	if (!xen_pv_domain())
+ 		return -ENODEV;
+
+	if (xen_initial_domain()) {
+		ops = &dom0_hvc_ops;
+		xencons_irq = bind_virq_to_irq(VIRQ_CONSOLE, 0);
+	} else {
+		if (!xen_start_info->console.domU.evtchn)
+			return -ENODEV;
+
+		ops = &domU_hvc_ops;
+		xencons_irq = bind_evtchn_to_irq(xen_start_info->console.domU.evtchn);
+	}
 
-	xencons_irq = bind_evtchn_to_irq(xen_start_info->console.domU.evtchn);
 	if (xencons_irq < 0)
 		xencons_irq = 0; /* NO_IRQ */
 
-	hp = hvc_alloc(HVC_COOKIE, xencons_irq, &hvc_ops, 256);
+	hp = hvc_alloc(HVC_COOKIE, xencons_irq, ops, 256);
 	if (IS_ERR(hp))
 		return PTR_ERR(hp);
 
@@ -135,7 +170,7 @@ void xen_console_resume(void)
 		rebind_evtchn_irq(xen_start_info->console.domU.evtchn, xencons_irq);
 }
 
-static void __exit xen_fini(void)
+static void __exit xen_hvc_fini(void)
 {
 	if (hvc)
 		hvc_remove(hvc);
@@ -143,29 +178,24 @@ static void __exit xen_fini(void)
 
 static int xen_cons_init(void)
 {
+	struct hv_ops *ops;
+
 	if (!xen_pv_domain())
 		return 0;
 
-	hvc_instantiate(HVC_COOKIE, 0, &hvc_ops);
+	ops = &domU_hvc_ops;
+	if (xen_initial_domain())
+		ops = &dom0_hvc_ops;
+
+	hvc_instantiate(HVC_COOKIE, 0, ops);
+
 	return 0;
 }
 
-module_init(xen_init);
-module_exit(xen_fini);
+module_init(xen_hvc_init);
+module_exit(xen_hvc_fini);
 console_initcall(xen_cons_init);
 
-static void raw_console_write(const char *str, int len)
-{
-	while(len > 0) {
-		int rc = HYPERVISOR_console_io(CONSOLEIO_write, len, (char *)str);
-		if (rc <= 0)
-			break;
-
-		str += rc;
-		len -= rc;
-	}
-}
-
 #ifdef CONFIG_EARLY_PRINTK
 static void xenboot_write_console(struct console *console, const char *string,
 				  unsigned len)
@@ -173,19 +203,22 @@ static void xenboot_write_console(struct console *console, const char *string,
 	unsigned int linelen, off = 0;
 	const char *pos;
 
-	raw_console_write(string, len);
+	dom0_write_console(0, string, len);
+
+	if (xen_initial_domain())
+		return;
 
-	write_console(0, "(early) ", 8);
+	domU_write_console(0, "(early) ", 8);
 	while (off < len && NULL != (pos = strchr(string+off, '\n'))) {
 		linelen = pos-string+off;
 		if (off + linelen > len)
 			break;
-		write_console(0, string+off, linelen);
-		write_console(0, "\r\n", 2);
+		domU_write_console(0, string+off, linelen);
+		domU_write_console(0, "\r\n", 2);
 		off += linelen + 1;
 	}
 	if (off < len)
-		write_console(0, string+off, len-off);
+		domU_write_console(0, string+off, len-off);
 }
 
 struct console xenboot_console = {
@@ -197,7 +230,7 @@ struct console xenboot_console = {
 
 void xen_raw_console_write(const char *str)
 {
-	raw_console_write(str, strlen(str));
+	dom0_write_console(0, str, strlen(str));
 }
 
 void xen_raw_printk(const char *fmt, ...)
diff --git a/drivers/xen/events.c b/drivers/xen/events.c
index 30963af..f46e880 100644
--- a/drivers/xen/events.c
+++ b/drivers/xen/events.c
@@ -404,7 +404,7 @@ static int bind_ipi_to_irq(unsigned int ipi, unsigned int cpu)
 }
 
 
-static int bind_virq_to_irq(unsigned int virq, unsigned int cpu)
+int bind_virq_to_irq(unsigned int virq, unsigned int cpu)
 {
 	struct evtchn_bind_virq bind_virq;
 	int evtchn, irq;
diff --git a/include/xen/events.h b/include/xen/events.h
index 0d5f1ad..0397ba1 100644
--- a/include/xen/events.h
+++ b/include/xen/events.h
@@ -12,6 +12,8 @@ int bind_evtchn_to_irqhandler(unsigned int evtchn,
 			      irq_handler_t handler,
 			      unsigned long irqflags, const char *devname,
 			      void *dev_id);
+int bind_virq_to_irq(unsigned int virq, unsigned int cpu);
+
 int bind_virq_to_irqhandler(unsigned int virq, unsigned int cpu,
 			    irq_handler_t handler,
 			    unsigned long irqflags, const char *devname,
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH] xen dom0: Initialize xenbus for dom0.
  2009-02-28  1:59 [PATCH] xen: core dom0 support Jeremy Fitzhardinge
@ 2009-02-28  1:59   ` Jeremy Fitzhardinge
  2009-02-28  1:59   ` Jeremy Fitzhardinge
                     ` (19 subsequent siblings)
  20 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-02-28  1:59 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List, Xen-devel,
	Juan Quintela, Jeremy Fitzhardinge

From: Juan Quintela <quintela@redhat.com>

Do initial xenbus/xenstore setup in dom0.  In dom0 we need to actually
allocate the xenstore resources, rather than being given them from
outside.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>
---
 drivers/xen/xenbus/xenbus_probe.c |   30 +++++++++++++++++++++++++++++-
 1 files changed, 29 insertions(+), 1 deletions(-)

diff --git a/drivers/xen/xenbus/xenbus_probe.c b/drivers/xen/xenbus/xenbus_probe.c
index 773d1cf..38aaec3 100644
--- a/drivers/xen/xenbus/xenbus_probe.c
+++ b/drivers/xen/xenbus/xenbus_probe.c
@@ -821,6 +821,7 @@ void xenbus_probe(struct work_struct *unused)
 static int __init xenbus_probe_init(void)
 {
 	int err = 0;
+	unsigned long page = 0;
 
 	DPRINTK("");
 
@@ -841,7 +842,31 @@ static int __init xenbus_probe_init(void)
 	 * Domain0 doesn't have a store_evtchn or store_mfn yet.
 	 */
 	if (xen_initial_domain()) {
-		/* dom0 not yet supported */
+		struct evtchn_alloc_unbound alloc_unbound;
+
+		/* Allocate Xenstore page */
+		page = get_zeroed_page(GFP_KERNEL);
+		if (!page)
+			return -ENOMEM;
+
+		xen_store_mfn = xen_start_info->store_mfn =
+			pfn_to_mfn(virt_to_phys((void *)page) >>
+				   PAGE_SHIFT);
+
+		/* Next allocate a local port which xenstored can bind to */
+		alloc_unbound.dom        = DOMID_SELF;
+		alloc_unbound.remote_dom = 0;
+
+		err = HYPERVISOR_event_channel_op(EVTCHNOP_alloc_unbound,
+						  &alloc_unbound);
+		if (err == -ENOSYS)
+			goto out_unreg_front;
+
+		BUG_ON(err);
+		xen_store_evtchn = xen_start_info->store_evtchn =
+			alloc_unbound.port;
+
+		xen_store_interface = mfn_to_virt(xen_store_mfn);
 	} else {
 		xenstored_ready = 1;
 		xen_store_evtchn = xen_start_info->store_evtchn;
@@ -877,6 +902,9 @@ static int __init xenbus_probe_init(void)
 	bus_unregister(&xenbus_frontend.bus);
 
   out_error:
+	if (page != 0)
+		free_page(page);
+
 	return err;
 }
 
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH] xen dom0: Initialize xenbus for dom0.
@ 2009-02-28  1:59   ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-02-28  1:59 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Xen-devel, the arch/x86 maintainers, Linux Kernel Mailing List,
	Jeremy Fitzhardinge, Juan Quintela

From: Juan Quintela <quintela@redhat.com>

Do initial xenbus/xenstore setup in dom0.  In dom0 we need to actually
allocate the xenstore resources, rather than being given them from
outside.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>
---
 drivers/xen/xenbus/xenbus_probe.c |   30 +++++++++++++++++++++++++++++-
 1 files changed, 29 insertions(+), 1 deletions(-)

diff --git a/drivers/xen/xenbus/xenbus_probe.c b/drivers/xen/xenbus/xenbus_probe.c
index 773d1cf..38aaec3 100644
--- a/drivers/xen/xenbus/xenbus_probe.c
+++ b/drivers/xen/xenbus/xenbus_probe.c
@@ -821,6 +821,7 @@ void xenbus_probe(struct work_struct *unused)
 static int __init xenbus_probe_init(void)
 {
 	int err = 0;
+	unsigned long page = 0;
 
 	DPRINTK("");
 
@@ -841,7 +842,31 @@ static int __init xenbus_probe_init(void)
 	 * Domain0 doesn't have a store_evtchn or store_mfn yet.
 	 */
 	if (xen_initial_domain()) {
-		/* dom0 not yet supported */
+		struct evtchn_alloc_unbound alloc_unbound;
+
+		/* Allocate Xenstore page */
+		page = get_zeroed_page(GFP_KERNEL);
+		if (!page)
+			return -ENOMEM;
+
+		xen_store_mfn = xen_start_info->store_mfn =
+			pfn_to_mfn(virt_to_phys((void *)page) >>
+				   PAGE_SHIFT);
+
+		/* Next allocate a local port which xenstored can bind to */
+		alloc_unbound.dom        = DOMID_SELF;
+		alloc_unbound.remote_dom = 0;
+
+		err = HYPERVISOR_event_channel_op(EVTCHNOP_alloc_unbound,
+						  &alloc_unbound);
+		if (err == -ENOSYS)
+			goto out_unreg_front;
+
+		BUG_ON(err);
+		xen_store_evtchn = xen_start_info->store_evtchn =
+			alloc_unbound.port;
+
+		xen_store_interface = mfn_to_virt(xen_store_mfn);
 	} else {
 		xenstored_ready = 1;
 		xen_store_evtchn = xen_start_info->store_evtchn;
@@ -877,6 +902,9 @@ static int __init xenbus_probe_init(void)
 	bus_unregister(&xenbus_frontend.bus);
 
   out_error:
+	if (page != 0)
+		free_page(page);
+
 	return err;
 }
 
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH] xen dom0: Set up basic IO permissions for dom0.
  2009-02-28  1:59 [PATCH] xen: core dom0 support Jeremy Fitzhardinge
@ 2009-02-28  1:59   ` Jeremy Fitzhardinge
  2009-02-28  1:59   ` Jeremy Fitzhardinge
                     ` (19 subsequent siblings)
  20 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-02-28  1:59 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List, Xen-devel,
	Juan Quintela, Jeremy Fitzhardinge

From: Juan Quintela <quintela@redhat.com>

Add the direct mapping area for ISA bus access, and enable IO space
access for the guest when running as dom0.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>
---
 arch/x86/xen/enlighten.c |    8 ++++++++
 arch/x86/xen/mmu.c       |   24 ++++++++++++++++++++++++
 arch/x86/xen/setup.c     |    6 +++++-
 arch/x86/xen/xen-ops.h   |    1 +
 4 files changed, 38 insertions(+), 1 deletions(-)

diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index 95ff6a0..d5fc434 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -941,6 +941,7 @@ asmlinkage void __init xen_start_kernel(void)
 
 	xen_raw_console_write("mapping kernel into physical memory\n");
 	pgd = xen_setup_kernel_pagetable(pgd, xen_start_info->nr_pages);
+	xen_ident_map_ISA();
 
 	init_mm.pgd = pgd;
 
@@ -950,6 +951,13 @@ asmlinkage void __init xen_start_kernel(void)
 	if (xen_feature(XENFEAT_supervisor_mode_kernel))
 		pv_info.kernel_rpl = 0;
 
+	if (xen_initial_domain()) {
+		struct physdev_set_iopl set_iopl;
+		set_iopl.iopl = 1;
+		if (HYPERVISOR_physdev_op(PHYSDEVOP_set_iopl, &set_iopl) == -1)
+			BUG();
+	}
+
 	/* set the limit of our address space */
 	xen_reserve_top();
 
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index d2e8ed1..36125ea 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1572,6 +1572,7 @@ static void *m2v(phys_addr_t maddr)
 	return __ka(m2p(maddr));
 }
 
+/* Set the page permissions on an identity-mapped pages */
 static void set_page_prot(void *addr, pgprot_t prot)
 {
 	unsigned long pfn = __pa(addr) >> PAGE_SHIFT;
@@ -1789,6 +1790,29 @@ static void xen_set_fixmap(unsigned idx, unsigned long phys, pgprot_t prot)
 #endif
 }
 
+__init void xen_ident_map_ISA(void)
+{
+	unsigned long pa;
+
+	/*
+	 * If we're dom0, then linear map the ISA machine addresses into
+	 * the kernel's address space.
+	 */
+	if (!xen_initial_domain())
+		return;
+
+	xen_raw_printk("Xen: setup ISA identity maps\n");
+
+	for (pa = ISA_START_ADDRESS; pa < ISA_END_ADDRESS; pa += PAGE_SIZE) {
+		pte_t pte = mfn_pte(PFN_DOWN(pa), PAGE_KERNEL_IO);
+
+		if (HYPERVISOR_update_va_mapping(PAGE_OFFSET + pa, pte, 0))
+			BUG();
+	}
+
+	xen_flush_tlb();
+}
+
 __init void xen_post_allocator_init(void)
 {
 	pv_mmu_ops.set_pte = xen_set_pte;
diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index 15c6c68..3e4cf46 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -51,6 +51,9 @@ char * __init xen_memory_setup(void)
 	 * Even though this is normal, usable memory under Xen, reserve
 	 * ISA memory anyway because too many things think they can poke
 	 * about in there.
+	 *
+	 * In a dom0 kernel, this region is identity mapped with the
+	 * hardware ISA area, so it really is out of bounds.
 	 */
 	e820_add_region(ISA_START_ADDRESS, ISA_END_ADDRESS - ISA_START_ADDRESS,
 			E820_RESERVED);
@@ -188,7 +191,8 @@ void __init xen_arch_setup(void)
 
 	pm_idle = xen_idle;
 
-	paravirt_disable_iospace();
+	if (!xen_initial_domain())
+		paravirt_disable_iospace();
 
 	fiddle_vdso();
 }
diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index 2f5ef26..33f7538 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -29,6 +29,7 @@ void xen_setup_machphys_mapping(void);
 pgd_t *xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn);
 void xen_ident_map_ISA(void);
 void xen_reserve_top(void);
+void xen_ident_map_ISA(void);
 
 void xen_leave_lazy(void);
 void xen_post_allocator_init(void);
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH] xen dom0: Set up basic IO permissions for dom0.
@ 2009-02-28  1:59   ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-02-28  1:59 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Xen-devel, the arch/x86 maintainers, Linux Kernel Mailing List,
	Jeremy Fitzhardinge, Juan Quintela

From: Juan Quintela <quintela@redhat.com>

Add the direct mapping area for ISA bus access, and enable IO space
access for the guest when running as dom0.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>
---
 arch/x86/xen/enlighten.c |    8 ++++++++
 arch/x86/xen/mmu.c       |   24 ++++++++++++++++++++++++
 arch/x86/xen/setup.c     |    6 +++++-
 arch/x86/xen/xen-ops.h   |    1 +
 4 files changed, 38 insertions(+), 1 deletions(-)

diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index 95ff6a0..d5fc434 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -941,6 +941,7 @@ asmlinkage void __init xen_start_kernel(void)
 
 	xen_raw_console_write("mapping kernel into physical memory\n");
 	pgd = xen_setup_kernel_pagetable(pgd, xen_start_info->nr_pages);
+	xen_ident_map_ISA();
 
 	init_mm.pgd = pgd;
 
@@ -950,6 +951,13 @@ asmlinkage void __init xen_start_kernel(void)
 	if (xen_feature(XENFEAT_supervisor_mode_kernel))
 		pv_info.kernel_rpl = 0;
 
+	if (xen_initial_domain()) {
+		struct physdev_set_iopl set_iopl;
+		set_iopl.iopl = 1;
+		if (HYPERVISOR_physdev_op(PHYSDEVOP_set_iopl, &set_iopl) == -1)
+			BUG();
+	}
+
 	/* set the limit of our address space */
 	xen_reserve_top();
 
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index d2e8ed1..36125ea 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1572,6 +1572,7 @@ static void *m2v(phys_addr_t maddr)
 	return __ka(m2p(maddr));
 }
 
+/* Set the page permissions on an identity-mapped pages */
 static void set_page_prot(void *addr, pgprot_t prot)
 {
 	unsigned long pfn = __pa(addr) >> PAGE_SHIFT;
@@ -1789,6 +1790,29 @@ static void xen_set_fixmap(unsigned idx, unsigned long phys, pgprot_t prot)
 #endif
 }
 
+__init void xen_ident_map_ISA(void)
+{
+	unsigned long pa;
+
+	/*
+	 * If we're dom0, then linear map the ISA machine addresses into
+	 * the kernel's address space.
+	 */
+	if (!xen_initial_domain())
+		return;
+
+	xen_raw_printk("Xen: setup ISA identity maps\n");
+
+	for (pa = ISA_START_ADDRESS; pa < ISA_END_ADDRESS; pa += PAGE_SIZE) {
+		pte_t pte = mfn_pte(PFN_DOWN(pa), PAGE_KERNEL_IO);
+
+		if (HYPERVISOR_update_va_mapping(PAGE_OFFSET + pa, pte, 0))
+			BUG();
+	}
+
+	xen_flush_tlb();
+}
+
 __init void xen_post_allocator_init(void)
 {
 	pv_mmu_ops.set_pte = xen_set_pte;
diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index 15c6c68..3e4cf46 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -51,6 +51,9 @@ char * __init xen_memory_setup(void)
 	 * Even though this is normal, usable memory under Xen, reserve
 	 * ISA memory anyway because too many things think they can poke
 	 * about in there.
+	 *
+	 * In a dom0 kernel, this region is identity mapped with the
+	 * hardware ISA area, so it really is out of bounds.
 	 */
 	e820_add_region(ISA_START_ADDRESS, ISA_END_ADDRESS - ISA_START_ADDRESS,
 			E820_RESERVED);
@@ -188,7 +191,8 @@ void __init xen_arch_setup(void)
 
 	pm_idle = xen_idle;
 
-	paravirt_disable_iospace();
+	if (!xen_initial_domain())
+		paravirt_disable_iospace();
 
 	fiddle_vdso();
 }
diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index 2f5ef26..33f7538 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -29,6 +29,7 @@ void xen_setup_machphys_mapping(void);
 pgd_t *xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn);
 void xen_ident_map_ISA(void);
 void xen_reserve_top(void);
+void xen_ident_map_ISA(void);
 
 void xen_leave_lazy(void);
 void xen_post_allocator_init(void);
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH] xen-dom0: only selectively disable cpu features
  2009-02-28  1:59 [PATCH] xen: core dom0 support Jeremy Fitzhardinge
                   ` (2 preceding siblings ...)
  2009-02-28  1:59   ` Jeremy Fitzhardinge
@ 2009-02-28  1:59 ` Jeremy Fitzhardinge
  2009-02-28  1:59 ` [PATCH] xen dom0: Add support for the platform_ops hypercall Jeremy Fitzhardinge
                   ` (16 subsequent siblings)
  20 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-02-28  1:59 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List, Xen-devel,
	Jeremy Fitzhardinge, Jeremy Fitzhardinge

Dom0 kernels actually want most of the CPU features to be enabled.
Some, like MCA/MCE, are still handled by Xen itself.

We leave APIC enabled even though we don't really have a functional
local apic so that the ACPI code will parse the corresponding tables
properly.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
---
 arch/x86/xen/enlighten.c |   21 +++++++++++++--------
 1 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index d5fc434..468aa23 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -171,18 +171,23 @@ static void __init xen_banner(void)
 static void xen_cpuid(unsigned int *ax, unsigned int *bx,
 		      unsigned int *cx, unsigned int *dx)
 {
-	unsigned maskedx = ~0;
+	unsigned maskedx = 0;
 
 	/*
 	 * Mask out inconvenient features, to try and disable as many
 	 * unsupported kernel subsystems as possible.
 	 */
-	if (*ax == 1)
-		maskedx = ~((1 << X86_FEATURE_APIC) |  /* disable APIC */
-			    (1 << X86_FEATURE_ACPI) |  /* disable ACPI */
-			    (1 << X86_FEATURE_MCE)  |  /* disable MCE */
-			    (1 << X86_FEATURE_MCA)  |  /* disable MCA */
-			    (1 << X86_FEATURE_ACC));   /* thermal monitoring */
+	if (*ax == 1) {
+		maskedx =
+			(1 << X86_FEATURE_MCE)  |  /* disable MCE */
+			(1 << X86_FEATURE_MCA)  |  /* disable MCA */
+			(1 << X86_FEATURE_ACC);   /* thermal monitoring */
+
+		if (!xen_initial_domain())
+			maskedx |=
+				(1 << X86_FEATURE_APIC) |  /* disable local APIC */
+				(1 << X86_FEATURE_ACPI);  /* disable ACPI */
+	}
 
 	asm(XEN_EMULATE_PREFIX "cpuid"
 		: "=a" (*ax),
@@ -190,7 +195,7 @@ static void xen_cpuid(unsigned int *ax, unsigned int *bx,
 		  "=c" (*cx),
 		  "=d" (*dx)
 		: "0" (*ax), "2" (*cx));
-	*dx &= maskedx;
+	*dx &= ~maskedx;
 }
 
 static void xen_set_debugreg(int reg, unsigned long val)
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH] xen dom0: Add support for the platform_ops hypercall
  2009-02-28  1:59 [PATCH] xen: core dom0 support Jeremy Fitzhardinge
                   ` (3 preceding siblings ...)
  2009-02-28  1:59 ` [PATCH] xen-dom0: only selectively disable cpu features Jeremy Fitzhardinge
@ 2009-02-28  1:59 ` Jeremy Fitzhardinge
  2009-02-28  1:59   ` Jeremy Fitzhardinge
                   ` (15 subsequent siblings)
  20 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-02-28  1:59 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List, Xen-devel,
	Stephen Tweedie, Jeremy Fitzhardinge

From: Stephen Tweedie <sct@redhat.com>

Minimal changes to get platform ops (renamed dom0_ops on pv_ops) working
on pv_ops builds.  Pulls in upstream linux-2.6.18-xen.hg's platform.h

Signed-off-by: Stephen Tweedie <sct@redhat.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
---
 arch/x86/include/asm/xen/hypercall.h |    8 ++
 include/xen/interface/platform.h     |  232 ++++++++++++++++++++++++++++++++++
 include/xen/interface/xen.h          |    2 +
 3 files changed, 242 insertions(+), 0 deletions(-)
 create mode 100644 include/xen/interface/platform.h

diff --git a/arch/x86/include/asm/xen/hypercall.h b/arch/x86/include/asm/xen/hypercall.h
index 5e79ca6..2200a72 100644
--- a/arch/x86/include/asm/xen/hypercall.h
+++ b/arch/x86/include/asm/xen/hypercall.h
@@ -45,6 +45,7 @@
 #include <xen/interface/xen.h>
 #include <xen/interface/sched.h>
 #include <xen/interface/physdev.h>
+#include <xen/interface/platform.h>
 
 /*
  * The hypercall asms have to meet several constraints:
@@ -282,6 +283,13 @@ HYPERVISOR_set_timer_op(u64 timeout)
 }
 
 static inline int
+HYPERVISOR_dom0_op(struct xen_platform_op *platform_op)
+{
+	platform_op->interface_version = XENPF_INTERFACE_VERSION;
+	return _hypercall1(int, dom0_op, platform_op);
+}
+
+static inline int
 HYPERVISOR_set_debugreg(int reg, unsigned long value)
 {
 	return _hypercall2(int, set_debugreg, reg, value);
diff --git a/include/xen/interface/platform.h b/include/xen/interface/platform.h
new file mode 100644
index 0000000..da548f3
--- /dev/null
+++ b/include/xen/interface/platform.h
@@ -0,0 +1,232 @@
+/******************************************************************************
+ * platform.h
+ *
+ * Hardware platform operations. Intended for use by domain-0 kernel.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to
+ * deal in the Software without restriction, including without limitation the
+ * rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+ * sell copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+ * DEALINGS IN THE SOFTWARE.
+ *
+ * Copyright (c) 2002-2006, K Fraser
+ */
+
+#ifndef __XEN_PUBLIC_PLATFORM_H__
+#define __XEN_PUBLIC_PLATFORM_H__
+
+#include "xen.h"
+
+#define XENPF_INTERFACE_VERSION 0x03000001
+
+/*
+ * Set clock such that it would read <secs,nsecs> after 00:00:00 UTC,
+ * 1 January, 1970 if the current system time was <system_time>.
+ */
+#define XENPF_settime             17
+struct xenpf_settime {
+    /* IN variables. */
+    uint32_t secs;
+    uint32_t nsecs;
+    uint64_t system_time;
+};
+typedef struct xenpf_settime xenpf_settime_t;
+DEFINE_GUEST_HANDLE_STRUCT(xenpf_settime_t);
+
+/*
+ * Request memory range (@mfn, @mfn+@nr_mfns-1) to have type @type.
+ * On x86, @type is an architecture-defined MTRR memory type.
+ * On success, returns the MTRR that was used (@reg) and a handle that can
+ * be passed to XENPF_DEL_MEMTYPE to accurately tear down the new setting.
+ * (x86-specific).
+ */
+#define XENPF_add_memtype         31
+struct xenpf_add_memtype {
+    /* IN variables. */
+    unsigned long mfn;
+    uint64_t nr_mfns;
+    uint32_t type;
+    /* OUT variables. */
+    uint32_t handle;
+    uint32_t reg;
+};
+typedef struct xenpf_add_memtype xenpf_add_memtype_t;
+DEFINE_GUEST_HANDLE_STRUCT(xenpf_add_memtype_t);
+
+/*
+ * Tear down an existing memory-range type. If @handle is remembered then it
+ * should be passed in to accurately tear down the correct setting (in case
+ * of overlapping memory regions with differing types). If it is not known
+ * then @handle should be set to zero. In all cases @reg must be set.
+ * (x86-specific).
+ */
+#define XENPF_del_memtype         32
+struct xenpf_del_memtype {
+    /* IN variables. */
+    uint32_t handle;
+    uint32_t reg;
+};
+typedef struct xenpf_del_memtype xenpf_del_memtype_t;
+DEFINE_GUEST_HANDLE_STRUCT(xenpf_del_memtype_t);
+
+/* Read current type of an MTRR (x86-specific). */
+#define XENPF_read_memtype        33
+struct xenpf_read_memtype {
+    /* IN variables. */
+    uint32_t reg;
+    /* OUT variables. */
+    unsigned long mfn;
+    uint64_t nr_mfns;
+    uint32_t type;
+};
+typedef struct xenpf_read_memtype xenpf_read_memtype_t;
+DEFINE_GUEST_HANDLE_STRUCT(xenpf_read_memtype_t);
+
+#define XENPF_microcode_update    35
+struct xenpf_microcode_update {
+    /* IN variables. */
+    GUEST_HANDLE(void) data;          /* Pointer to microcode data */
+    uint32_t length;                  /* Length of microcode data. */
+};
+typedef struct xenpf_microcode_update xenpf_microcode_update_t;
+DEFINE_GUEST_HANDLE_STRUCT(xenpf_microcode_update_t);
+
+#define XENPF_platform_quirk      39
+#define QUIRK_NOIRQBALANCING      1 /* Do not restrict IO-APIC RTE targets */
+#define QUIRK_IOAPIC_BAD_REGSEL   2 /* IO-APIC REGSEL forgets its value    */
+#define QUIRK_IOAPIC_GOOD_REGSEL  3 /* IO-APIC REGSEL behaves properly     */
+struct xenpf_platform_quirk {
+    /* IN variables. */
+    uint32_t quirk_id;
+};
+typedef struct xenpf_platform_quirk xenpf_platform_quirk_t;
+DEFINE_GUEST_HANDLE_STRUCT(xenpf_platform_quirk_t);
+
+#define XENPF_firmware_info       50
+#define XEN_FW_DISK_INFO          1 /* from int 13 AH=08/41/48 */
+#define XEN_FW_DISK_MBR_SIGNATURE 2 /* from MBR offset 0x1b8 */
+#define XEN_FW_VBEDDC_INFO        3 /* from int 10 AX=4f15 */
+struct xenpf_firmware_info {
+    /* IN variables. */
+    uint32_t type;
+    uint32_t index;
+    /* OUT variables. */
+    union {
+        struct {
+            /* Int13, Fn48: Check Extensions Present. */
+            uint8_t device;                   /* %dl: bios device number */
+            uint8_t version;                  /* %ah: major version      */
+            uint16_t interface_support;       /* %cx: support bitmap     */
+            /* Int13, Fn08: Legacy Get Device Parameters. */
+            uint16_t legacy_max_cylinder;     /* %cl[7:6]:%ch: max cyl # */
+            uint8_t legacy_max_head;          /* %dh: max head #         */
+            uint8_t legacy_sectors_per_track; /* %cl[5:0]: max sector #  */
+            /* Int13, Fn41: Get Device Parameters (as filled into %ds:%esi). */
+            /* NB. First uint16_t of buffer must be set to buffer size.      */
+            GUEST_HANDLE(void) edd_params;
+        } disk_info; /* XEN_FW_DISK_INFO */
+        struct {
+            uint8_t device;                   /* bios device number  */
+            uint32_t mbr_signature;           /* offset 0x1b8 in mbr */
+        } disk_mbr_signature; /* XEN_FW_DISK_MBR_SIGNATURE */
+        struct {
+            /* Int10, AX=4F15: Get EDID info. */
+            uint8_t capabilities;
+            uint8_t edid_transfer_time;
+            /* must refer to 128-byte buffer */
+            GUEST_HANDLE(uchar) edid;
+        } vbeddc_info; /* XEN_FW_VBEDDC_INFO */
+    } u;
+};
+typedef struct xenpf_firmware_info xenpf_firmware_info_t;
+DEFINE_GUEST_HANDLE_STRUCT(xenpf_firmware_info_t);
+
+#define XENPF_enter_acpi_sleep    51
+struct xenpf_enter_acpi_sleep {
+    /* IN variables */
+    uint16_t pm1a_cnt_val;      /* PM1a control value. */
+    uint16_t pm1b_cnt_val;      /* PM1b control value. */
+    uint32_t sleep_state;       /* Which state to enter (Sn). */
+    uint32_t flags;             /* Must be zero. */
+};
+typedef struct xenpf_enter_acpi_sleep xenpf_enter_acpi_sleep_t;
+DEFINE_GUEST_HANDLE_STRUCT(xenpf_enter_acpi_sleep_t);
+
+#define XENPF_change_freq         52
+struct xenpf_change_freq {
+    /* IN variables */
+    uint32_t flags; /* Must be zero. */
+    uint32_t cpu;   /* Physical cpu. */
+    uint64_t freq;  /* New frequency (Hz). */
+};
+typedef struct xenpf_change_freq xenpf_change_freq_t;
+DEFINE_GUEST_HANDLE_STRUCT(xenpf_change_freq_t);
+
+/*
+ * Get idle times (nanoseconds since boot) for physical CPUs specified in the
+ * @cpumap_bitmap with range [0..@cpumap_nr_cpus-1]. The @idletime array is
+ * indexed by CPU number; only entries with the corresponding @cpumap_bitmap
+ * bit set are written to. On return, @cpumap_bitmap is modified so that any
+ * non-existent CPUs are cleared. Such CPUs have their @idletime array entry
+ * cleared.
+ */
+#define XENPF_getidletime         53
+struct xenpf_getidletime {
+    /* IN/OUT variables */
+    /* IN: CPUs to interrogate; OUT: subset of IN which are present */
+    GUEST_HANDLE(uchar) cpumap_bitmap;
+    /* IN variables */
+    /* Size of cpumap bitmap. */
+    uint32_t cpumap_nr_cpus;
+    /* Must be indexable for every cpu in cpumap_bitmap. */
+    GUEST_HANDLE(uint64_t) idletime;
+    /* OUT variables */
+    /* System time when the idletime snapshots were taken. */
+    uint64_t now;
+};
+typedef struct xenpf_getidletime xenpf_getidletime_t;
+DEFINE_GUEST_HANDLE_STRUCT(xenpf_getidletime_t);
+
+struct xen_platform_op {
+    uint32_t cmd;
+    uint32_t interface_version; /* XENPF_INTERFACE_VERSION */
+    union {
+        struct xenpf_settime           settime;
+        struct xenpf_add_memtype       add_memtype;
+        struct xenpf_del_memtype       del_memtype;
+        struct xenpf_read_memtype      read_memtype;
+        struct xenpf_microcode_update  microcode;
+        struct xenpf_platform_quirk    platform_quirk;
+        struct xenpf_firmware_info     firmware_info;
+        struct xenpf_enter_acpi_sleep  enter_acpi_sleep;
+        struct xenpf_change_freq       change_freq;
+        struct xenpf_getidletime       getidletime;
+        uint8_t                        pad[128];
+    } u;
+};
+typedef struct xen_platform_op xen_platform_op_t;
+DEFINE_GUEST_HANDLE_STRUCT(xen_platform_op_t);
+
+#endif /* __XEN_PUBLIC_PLATFORM_H__ */
+
+/*
+ * Local variables:
+ * mode: C
+ * c-set-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/include/xen/interface/xen.h b/include/xen/interface/xen.h
index 2befa3e..18b5599 100644
--- a/include/xen/interface/xen.h
+++ b/include/xen/interface/xen.h
@@ -461,6 +461,8 @@ typedef uint8_t xen_domain_handle_t[16];
 #define __mk_unsigned_long(x) x ## UL
 #define mk_unsigned_long(x) __mk_unsigned_long(x)
 
+DEFINE_GUEST_HANDLE(uint64_t);
+
 #else /* __ASSEMBLY__ */
 
 /* In assembly code we cannot use C numeric constant suffixes. */
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH] xen mtrr: Add mtrr_ops support for Xen mtrr
  2009-02-28  1:59 [PATCH] xen: core dom0 support Jeremy Fitzhardinge
@ 2009-02-28  1:59   ` Jeremy Fitzhardinge
  2009-02-28  1:59   ` Jeremy Fitzhardinge
                     ` (19 subsequent siblings)
  20 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-02-28  1:59 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List, Xen-devel,
	Stephen Tweedie, Jeremy Fitzhardinge

From: Stephen Tweedie <sct@redhat.com>

Add a Xen mtrr type, and reorganise mtrr initialisation slightly to allow the
mtrr driver to set up num_var_ranges (Xen needs to do this by querying the
hypervisor itself.)

Only the boot path is handled for now: we set up a xen-specific mtrr_if
and set up the mtrr tables based on hypervisor information, but we don't
yet handle mtrr entry add/delete.

Signed-off-by: Stephen Tweedie <sct@redhat.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
---
 arch/x86/kernel/cpu/mtrr/Makefile  |    1 +
 arch/x86/kernel/cpu/mtrr/amd.c     |    1 +
 arch/x86/kernel/cpu/mtrr/centaur.c |    1 +
 arch/x86/kernel/cpu/mtrr/cyrix.c   |    1 +
 arch/x86/kernel/cpu/mtrr/generic.c |    1 +
 arch/x86/kernel/cpu/mtrr/main.c    |   11 +++++--
 arch/x86/kernel/cpu/mtrr/mtrr.h    |    5 +++
 arch/x86/kernel/cpu/mtrr/xen.c     |   59 ++++++++++++++++++++++++++++++++++++
 8 files changed, 77 insertions(+), 3 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/mtrr/xen.c

diff --git a/arch/x86/kernel/cpu/mtrr/Makefile b/arch/x86/kernel/cpu/mtrr/Makefile
index 191fc05..a836f72 100644
--- a/arch/x86/kernel/cpu/mtrr/Makefile
+++ b/arch/x86/kernel/cpu/mtrr/Makefile
@@ -1,3 +1,4 @@
 obj-y		:= main.o if.o generic.o state.o
 obj-$(CONFIG_X86_32) += amd.o cyrix.o centaur.o
+obj-$(CONFIG_XEN_DOM0) += xen.o
 
diff --git a/arch/x86/kernel/cpu/mtrr/amd.c b/arch/x86/kernel/cpu/mtrr/amd.c
index ee2331b..7bf23de 100644
--- a/arch/x86/kernel/cpu/mtrr/amd.c
+++ b/arch/x86/kernel/cpu/mtrr/amd.c
@@ -108,6 +108,7 @@ static struct mtrr_ops amd_mtrr_ops = {
 	.get_free_region   = generic_get_free_region,
 	.validate_add_page = amd_validate_add_page,
 	.have_wrcomb       = positive_have_wrcomb,
+	.num_var_ranges	   = common_num_var_ranges,
 };
 
 int __init amd_init_mtrr(void)
diff --git a/arch/x86/kernel/cpu/mtrr/centaur.c b/arch/x86/kernel/cpu/mtrr/centaur.c
index cb9aa3a..7e3f74f 100644
--- a/arch/x86/kernel/cpu/mtrr/centaur.c
+++ b/arch/x86/kernel/cpu/mtrr/centaur.c
@@ -213,6 +213,7 @@ static struct mtrr_ops centaur_mtrr_ops = {
 	.get_free_region   = centaur_get_free_region,
 	.validate_add_page = centaur_validate_add_page,
 	.have_wrcomb       = positive_have_wrcomb,
+	.num_var_ranges	   = common_num_var_ranges,
 };
 
 int __init centaur_init_mtrr(void)
diff --git a/arch/x86/kernel/cpu/mtrr/cyrix.c b/arch/x86/kernel/cpu/mtrr/cyrix.c
index ff14c32..c7bb5e3 100644
--- a/arch/x86/kernel/cpu/mtrr/cyrix.c
+++ b/arch/x86/kernel/cpu/mtrr/cyrix.c
@@ -263,6 +263,7 @@ static struct mtrr_ops cyrix_mtrr_ops = {
 	.get_free_region   = cyrix_get_free_region,
 	.validate_add_page = generic_validate_add_page,
 	.have_wrcomb       = positive_have_wrcomb,
+	.num_var_ranges	   = common_num_var_ranges,
 };
 
 int __init cyrix_init_mtrr(void)
diff --git a/arch/x86/kernel/cpu/mtrr/generic.c b/arch/x86/kernel/cpu/mtrr/generic.c
index 0c0a455..b06febd 100644
--- a/arch/x86/kernel/cpu/mtrr/generic.c
+++ b/arch/x86/kernel/cpu/mtrr/generic.c
@@ -663,4 +663,5 @@ struct mtrr_ops generic_mtrr_ops = {
 	.set               = generic_set_mtrr,
 	.validate_add_page = generic_validate_add_page,
 	.have_wrcomb       = generic_have_wrcomb,
+	.num_var_ranges	   = common_num_var_ranges,
 };
diff --git a/arch/x86/kernel/cpu/mtrr/main.c b/arch/x86/kernel/cpu/mtrr/main.c
index 236a401..a66a54d 100644
--- a/arch/x86/kernel/cpu/mtrr/main.c
+++ b/arch/x86/kernel/cpu/mtrr/main.c
@@ -99,7 +99,7 @@ static int have_wrcomb(void)
 }
 
 /*  This function returns the number of variable MTRRs  */
-static void __init set_num_var_ranges(void)
+int __init common_num_var_ranges(void)
 {
 	unsigned long config = 0, dummy;
 
@@ -109,7 +109,7 @@ static void __init set_num_var_ranges(void)
 		config = 2;
 	else if (is_cpu(CYRIX) || is_cpu(CENTAUR))
 		config = 8;
-	num_var_ranges = config & 0xff;
+	return config & 0xff;
 }
 
 static void __init init_table(void)
@@ -1673,12 +1673,17 @@ int __init mtrr_trim_uncached_memory(unsigned long end_pfn)
 void __init mtrr_bp_init(void)
 {
 	u32 phys_addr;
+
 	init_ifs();
 
 	phys_addr = 32;
 
 	if (cpu_has_mtrr) {
 		mtrr_if = &generic_mtrr_ops;
+#ifdef CONFIG_XEN_DOM0
+		xen_init_mtrr();
+#endif
+
 		size_or_mask = 0xff000000;	/* 36 bits */
 		size_and_mask = 0x00f00000;
 		phys_addr = 36;
@@ -1736,7 +1741,7 @@ void __init mtrr_bp_init(void)
 	}
 
 	if (mtrr_if) {
-		set_num_var_ranges();
+		num_var_ranges = mtrr_if->num_var_ranges();
 		init_table();
 		if (use_intel()) {
 			get_mtrr_state();
diff --git a/arch/x86/kernel/cpu/mtrr/mtrr.h b/arch/x86/kernel/cpu/mtrr/mtrr.h
index ffd6040..eb23ca2 100644
--- a/arch/x86/kernel/cpu/mtrr/mtrr.h
+++ b/arch/x86/kernel/cpu/mtrr/mtrr.h
@@ -41,6 +41,8 @@ struct mtrr_ops {
 	int	(*validate_add_page)(unsigned long base, unsigned long size,
 				     unsigned int type);
 	int	(*have_wrcomb)(void);
+
+	int	(*num_var_ranges)(void);
 };
 
 extern int generic_get_free_region(unsigned long base, unsigned long size,
@@ -52,6 +54,8 @@ extern struct mtrr_ops generic_mtrr_ops;
 
 extern int positive_have_wrcomb(void);
 
+extern int __init common_num_var_ranges(void);
+
 /* library functions for processor-specific routines */
 struct set_mtrr_context {
 	unsigned long flags;
@@ -88,3 +92,4 @@ void mtrr_wrmsr(unsigned, unsigned, unsigned);
 int amd_init_mtrr(void);
 int cyrix_init_mtrr(void);
 int centaur_init_mtrr(void);
+void xen_init_mtrr(void);
diff --git a/arch/x86/kernel/cpu/mtrr/xen.c b/arch/x86/kernel/cpu/mtrr/xen.c
new file mode 100644
index 0000000..db3ef39
--- /dev/null
+++ b/arch/x86/kernel/cpu/mtrr/xen.c
@@ -0,0 +1,59 @@
+#include <linux/init.h>
+#include <linux/proc_fs.h>
+#include <linux/ctype.h>
+#include <linux/module.h>
+#include <linux/seq_file.h>
+#include <asm/uaccess.h>
+#include <linux/mutex.h>
+
+#include <asm/mtrr.h>
+#include "mtrr.h"
+
+#include <xen/interface/platform.h>
+#include <asm/xen/hypervisor.h>
+#include <asm/xen/hypercall.h>
+
+static int __init xen_num_var_ranges(void);
+
+/* DOM0 TODO: Need to fill in the remaining mtrr methods to have full
+ * working userland mtrr support. */
+static struct mtrr_ops xen_mtrr_ops = {
+	.vendor            = X86_VENDOR_UNKNOWN,
+//	.set               = xen_set_mtrr,
+//	.get               = xen_get_mtrr,
+	.get_free_region   = generic_get_free_region,
+//	.validate_add_page = xen_validate_add_page,
+	.have_wrcomb       = positive_have_wrcomb,
+	.use_intel_if	   = 0,
+	.num_var_ranges	   = xen_num_var_ranges,
+};
+
+static int __init xen_num_var_ranges(void)
+{
+	int ranges;
+	struct xen_platform_op op;
+
+	for (ranges = 0; ; ranges++) {
+		op.cmd = XENPF_read_memtype;
+		op.u.read_memtype.reg = ranges;
+		if (HYPERVISOR_dom0_op(&op) != 0)
+			break;
+	}
+	return ranges;
+}
+
+void __init xen_init_mtrr(void)
+{
+	struct cpuinfo_x86 *c = &boot_cpu_data;
+
+	if (!xen_initial_domain())
+		return;
+
+	if ((!cpu_has(c, X86_FEATURE_MTRR)) &&
+	    (!cpu_has(c, X86_FEATURE_K6_MTRR)) &&
+	    (!cpu_has(c, X86_FEATURE_CYRIX_ARR)) &&
+	    (!cpu_has(c, X86_FEATURE_CENTAUR_MCR)))
+		return;
+
+	mtrr_if = &xen_mtrr_ops;
+}
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH] xen mtrr: Add mtrr_ops support for Xen mtrr
@ 2009-02-28  1:59   ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-02-28  1:59 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Xen-devel, Stephen Tweedie, the arch/x86 maintainers,
	Linux Kernel Mailing List, Jeremy Fitzhardinge

From: Stephen Tweedie <sct@redhat.com>

Add a Xen mtrr type, and reorganise mtrr initialisation slightly to allow the
mtrr driver to set up num_var_ranges (Xen needs to do this by querying the
hypervisor itself.)

Only the boot path is handled for now: we set up a xen-specific mtrr_if
and set up the mtrr tables based on hypervisor information, but we don't
yet handle mtrr entry add/delete.

Signed-off-by: Stephen Tweedie <sct@redhat.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
---
 arch/x86/kernel/cpu/mtrr/Makefile  |    1 +
 arch/x86/kernel/cpu/mtrr/amd.c     |    1 +
 arch/x86/kernel/cpu/mtrr/centaur.c |    1 +
 arch/x86/kernel/cpu/mtrr/cyrix.c   |    1 +
 arch/x86/kernel/cpu/mtrr/generic.c |    1 +
 arch/x86/kernel/cpu/mtrr/main.c    |   11 +++++--
 arch/x86/kernel/cpu/mtrr/mtrr.h    |    5 +++
 arch/x86/kernel/cpu/mtrr/xen.c     |   59 ++++++++++++++++++++++++++++++++++++
 8 files changed, 77 insertions(+), 3 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/mtrr/xen.c

diff --git a/arch/x86/kernel/cpu/mtrr/Makefile b/arch/x86/kernel/cpu/mtrr/Makefile
index 191fc05..a836f72 100644
--- a/arch/x86/kernel/cpu/mtrr/Makefile
+++ b/arch/x86/kernel/cpu/mtrr/Makefile
@@ -1,3 +1,4 @@
 obj-y		:= main.o if.o generic.o state.o
 obj-$(CONFIG_X86_32) += amd.o cyrix.o centaur.o
+obj-$(CONFIG_XEN_DOM0) += xen.o
 
diff --git a/arch/x86/kernel/cpu/mtrr/amd.c b/arch/x86/kernel/cpu/mtrr/amd.c
index ee2331b..7bf23de 100644
--- a/arch/x86/kernel/cpu/mtrr/amd.c
+++ b/arch/x86/kernel/cpu/mtrr/amd.c
@@ -108,6 +108,7 @@ static struct mtrr_ops amd_mtrr_ops = {
 	.get_free_region   = generic_get_free_region,
 	.validate_add_page = amd_validate_add_page,
 	.have_wrcomb       = positive_have_wrcomb,
+	.num_var_ranges	   = common_num_var_ranges,
 };
 
 int __init amd_init_mtrr(void)
diff --git a/arch/x86/kernel/cpu/mtrr/centaur.c b/arch/x86/kernel/cpu/mtrr/centaur.c
index cb9aa3a..7e3f74f 100644
--- a/arch/x86/kernel/cpu/mtrr/centaur.c
+++ b/arch/x86/kernel/cpu/mtrr/centaur.c
@@ -213,6 +213,7 @@ static struct mtrr_ops centaur_mtrr_ops = {
 	.get_free_region   = centaur_get_free_region,
 	.validate_add_page = centaur_validate_add_page,
 	.have_wrcomb       = positive_have_wrcomb,
+	.num_var_ranges	   = common_num_var_ranges,
 };
 
 int __init centaur_init_mtrr(void)
diff --git a/arch/x86/kernel/cpu/mtrr/cyrix.c b/arch/x86/kernel/cpu/mtrr/cyrix.c
index ff14c32..c7bb5e3 100644
--- a/arch/x86/kernel/cpu/mtrr/cyrix.c
+++ b/arch/x86/kernel/cpu/mtrr/cyrix.c
@@ -263,6 +263,7 @@ static struct mtrr_ops cyrix_mtrr_ops = {
 	.get_free_region   = cyrix_get_free_region,
 	.validate_add_page = generic_validate_add_page,
 	.have_wrcomb       = positive_have_wrcomb,
+	.num_var_ranges	   = common_num_var_ranges,
 };
 
 int __init cyrix_init_mtrr(void)
diff --git a/arch/x86/kernel/cpu/mtrr/generic.c b/arch/x86/kernel/cpu/mtrr/generic.c
index 0c0a455..b06febd 100644
--- a/arch/x86/kernel/cpu/mtrr/generic.c
+++ b/arch/x86/kernel/cpu/mtrr/generic.c
@@ -663,4 +663,5 @@ struct mtrr_ops generic_mtrr_ops = {
 	.set               = generic_set_mtrr,
 	.validate_add_page = generic_validate_add_page,
 	.have_wrcomb       = generic_have_wrcomb,
+	.num_var_ranges	   = common_num_var_ranges,
 };
diff --git a/arch/x86/kernel/cpu/mtrr/main.c b/arch/x86/kernel/cpu/mtrr/main.c
index 236a401..a66a54d 100644
--- a/arch/x86/kernel/cpu/mtrr/main.c
+++ b/arch/x86/kernel/cpu/mtrr/main.c
@@ -99,7 +99,7 @@ static int have_wrcomb(void)
 }
 
 /*  This function returns the number of variable MTRRs  */
-static void __init set_num_var_ranges(void)
+int __init common_num_var_ranges(void)
 {
 	unsigned long config = 0, dummy;
 
@@ -109,7 +109,7 @@ static void __init set_num_var_ranges(void)
 		config = 2;
 	else if (is_cpu(CYRIX) || is_cpu(CENTAUR))
 		config = 8;
-	num_var_ranges = config & 0xff;
+	return config & 0xff;
 }
 
 static void __init init_table(void)
@@ -1673,12 +1673,17 @@ int __init mtrr_trim_uncached_memory(unsigned long end_pfn)
 void __init mtrr_bp_init(void)
 {
 	u32 phys_addr;
+
 	init_ifs();
 
 	phys_addr = 32;
 
 	if (cpu_has_mtrr) {
 		mtrr_if = &generic_mtrr_ops;
+#ifdef CONFIG_XEN_DOM0
+		xen_init_mtrr();
+#endif
+
 		size_or_mask = 0xff000000;	/* 36 bits */
 		size_and_mask = 0x00f00000;
 		phys_addr = 36;
@@ -1736,7 +1741,7 @@ void __init mtrr_bp_init(void)
 	}
 
 	if (mtrr_if) {
-		set_num_var_ranges();
+		num_var_ranges = mtrr_if->num_var_ranges();
 		init_table();
 		if (use_intel()) {
 			get_mtrr_state();
diff --git a/arch/x86/kernel/cpu/mtrr/mtrr.h b/arch/x86/kernel/cpu/mtrr/mtrr.h
index ffd6040..eb23ca2 100644
--- a/arch/x86/kernel/cpu/mtrr/mtrr.h
+++ b/arch/x86/kernel/cpu/mtrr/mtrr.h
@@ -41,6 +41,8 @@ struct mtrr_ops {
 	int	(*validate_add_page)(unsigned long base, unsigned long size,
 				     unsigned int type);
 	int	(*have_wrcomb)(void);
+
+	int	(*num_var_ranges)(void);
 };
 
 extern int generic_get_free_region(unsigned long base, unsigned long size,
@@ -52,6 +54,8 @@ extern struct mtrr_ops generic_mtrr_ops;
 
 extern int positive_have_wrcomb(void);
 
+extern int __init common_num_var_ranges(void);
+
 /* library functions for processor-specific routines */
 struct set_mtrr_context {
 	unsigned long flags;
@@ -88,3 +92,4 @@ void mtrr_wrmsr(unsigned, unsigned, unsigned);
 int amd_init_mtrr(void);
 int cyrix_init_mtrr(void);
 int centaur_init_mtrr(void);
+void xen_init_mtrr(void);
diff --git a/arch/x86/kernel/cpu/mtrr/xen.c b/arch/x86/kernel/cpu/mtrr/xen.c
new file mode 100644
index 0000000..db3ef39
--- /dev/null
+++ b/arch/x86/kernel/cpu/mtrr/xen.c
@@ -0,0 +1,59 @@
+#include <linux/init.h>
+#include <linux/proc_fs.h>
+#include <linux/ctype.h>
+#include <linux/module.h>
+#include <linux/seq_file.h>
+#include <asm/uaccess.h>
+#include <linux/mutex.h>
+
+#include <asm/mtrr.h>
+#include "mtrr.h"
+
+#include <xen/interface/platform.h>
+#include <asm/xen/hypervisor.h>
+#include <asm/xen/hypercall.h>
+
+static int __init xen_num_var_ranges(void);
+
+/* DOM0 TODO: Need to fill in the remaining mtrr methods to have full
+ * working userland mtrr support. */
+static struct mtrr_ops xen_mtrr_ops = {
+	.vendor            = X86_VENDOR_UNKNOWN,
+//	.set               = xen_set_mtrr,
+//	.get               = xen_get_mtrr,
+	.get_free_region   = generic_get_free_region,
+//	.validate_add_page = xen_validate_add_page,
+	.have_wrcomb       = positive_have_wrcomb,
+	.use_intel_if	   = 0,
+	.num_var_ranges	   = xen_num_var_ranges,
+};
+
+static int __init xen_num_var_ranges(void)
+{
+	int ranges;
+	struct xen_platform_op op;
+
+	for (ranges = 0; ; ranges++) {
+		op.cmd = XENPF_read_memtype;
+		op.u.read_memtype.reg = ranges;
+		if (HYPERVISOR_dom0_op(&op) != 0)
+			break;
+	}
+	return ranges;
+}
+
+void __init xen_init_mtrr(void)
+{
+	struct cpuinfo_x86 *c = &boot_cpu_data;
+
+	if (!xen_initial_domain())
+		return;
+
+	if ((!cpu_has(c, X86_FEATURE_MTRR)) &&
+	    (!cpu_has(c, X86_FEATURE_K6_MTRR)) &&
+	    (!cpu_has(c, X86_FEATURE_CYRIX_ARR)) &&
+	    (!cpu_has(c, X86_FEATURE_CENTAUR_MCR)))
+		return;
+
+	mtrr_if = &xen_mtrr_ops;
+}
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH] xen: disable PAT
  2009-02-28  1:59 [PATCH] xen: core dom0 support Jeremy Fitzhardinge
@ 2009-02-28  1:59   ` Jeremy Fitzhardinge
  2009-02-28  1:59   ` Jeremy Fitzhardinge
                     ` (19 subsequent siblings)
  20 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-02-28  1:59 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List, Xen-devel,
	Ian Campbell

From: Ian Campbell <ian.campbell@citrix.com>

Xen imposes a particular PAT layout on all paravirtual guests which
does not match the layout Linux would like to use.

Force PAT to be disabled until this is resolved.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
---
 arch/x86/include/asm/pat.h |    5 +++++
 arch/x86/mm/pat.c          |    5 -----
 arch/x86/xen/enlighten.c   |    3 +++
 3 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/pat.h b/arch/x86/include/asm/pat.h
index 9709fdf..d8be231 100644
--- a/arch/x86/include/asm/pat.h
+++ b/arch/x86/include/asm/pat.h
@@ -5,8 +5,13 @@
 
 #ifdef CONFIG_X86_PAT
 extern int pat_enabled;
+extern void pat_disable(const char *reason);
 #else
 static const int pat_enabled;
+static inline void pat_disable(const char *reason)
+{
+	(void)reason;
+}
 #endif
 
 extern void pat_init(void);
diff --git a/arch/x86/mm/pat.c b/arch/x86/mm/pat.c
index 05f9aef..37df685 100644
--- a/arch/x86/mm/pat.c
+++ b/arch/x86/mm/pat.c
@@ -42,11 +42,6 @@ static int __init nopat(char *str)
 	return 0;
 }
 early_param("nopat", nopat);
-#else
-static inline void pat_disable(const char *reason)
-{
-	(void)reason;
-}
 #endif
 
 
diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index 468aa23..1b89d1c 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -48,6 +48,7 @@
 #include <asm/pgtable.h>
 #include <asm/tlbflush.h>
 #include <asm/reboot.h>
+#include <asm/pat.h>
 
 #include "xen-ops.h"
 #include "mmu.h"
@@ -986,6 +987,8 @@ asmlinkage void __init xen_start_kernel(void)
 		add_preferred_console("hvc", 0, NULL);
 	}
 
+	pat_disable("PAT disabled on Xen");
+
 	xen_raw_console_write("about to get started...\n");
 
 	/* Start the world */
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH] xen: disable PAT
@ 2009-02-28  1:59   ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-02-28  1:59 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Xen-devel, the arch/x86 maintainers, Linux Kernel Mailing List,
	Ian Campbell

From: Ian Campbell <ian.campbell@citrix.com>

Xen imposes a particular PAT layout on all paravirtual guests which
does not match the layout Linux would like to use.

Force PAT to be disabled until this is resolved.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
---
 arch/x86/include/asm/pat.h |    5 +++++
 arch/x86/mm/pat.c          |    5 -----
 arch/x86/xen/enlighten.c   |    3 +++
 3 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/pat.h b/arch/x86/include/asm/pat.h
index 9709fdf..d8be231 100644
--- a/arch/x86/include/asm/pat.h
+++ b/arch/x86/include/asm/pat.h
@@ -5,8 +5,13 @@
 
 #ifdef CONFIG_X86_PAT
 extern int pat_enabled;
+extern void pat_disable(const char *reason);
 #else
 static const int pat_enabled;
+static inline void pat_disable(const char *reason)
+{
+	(void)reason;
+}
 #endif
 
 extern void pat_init(void);
diff --git a/arch/x86/mm/pat.c b/arch/x86/mm/pat.c
index 05f9aef..37df685 100644
--- a/arch/x86/mm/pat.c
+++ b/arch/x86/mm/pat.c
@@ -42,11 +42,6 @@ static int __init nopat(char *str)
 	return 0;
 }
 early_param("nopat", nopat);
-#else
-static inline void pat_disable(const char *reason)
-{
-	(void)reason;
-}
 #endif
 
 
diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index 468aa23..1b89d1c 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -48,6 +48,7 @@
 #include <asm/pgtable.h>
 #include <asm/tlbflush.h>
 #include <asm/reboot.h>
+#include <asm/pat.h>
 
 #include "xen-ops.h"
 #include "mmu.h"
@@ -986,6 +987,8 @@ asmlinkage void __init xen_start_kernel(void)
 		add_preferred_console("hvc", 0, NULL);
 	}
 
+	pat_disable("PAT disabled on Xen");
+
 	xen_raw_console_write("about to get started...\n");
 
 	/* Start the world */
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH] xen/dom0: use _PAGE_IOMAP in ioremap to do machine mappings
  2009-02-28  1:59 [PATCH] xen: core dom0 support Jeremy Fitzhardinge
                   ` (6 preceding siblings ...)
  2009-02-28  1:59   ` Jeremy Fitzhardinge
@ 2009-02-28  1:59 ` Jeremy Fitzhardinge
  2009-02-28  1:59 ` [PATCH] paravirt/xen: add pvop for page_is_ram Jeremy Fitzhardinge
                   ` (12 subsequent siblings)
  20 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-02-28  1:59 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List, Xen-devel,
	Jeremy Fitzhardinge, Jeremy Fitzhardinge

In a Xen domain, ioremap operates on machine addresses, not
pseudo-physical addresses.  We use _PAGE_IOMAP to determine whether a
mapping is intended for machine addresses.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
---
 arch/x86/include/asm/xen/page.h |    8 +---
 arch/x86/xen/enlighten.c        |    4 ++-
 arch/x86/xen/mmu.c              |   70 +++++++++++++++++++++++++++++++++++++-
 3 files changed, 73 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/xen/page.h b/arch/x86/include/asm/xen/page.h
index 4bd990e..20c3872 100644
--- a/arch/x86/include/asm/xen/page.h
+++ b/arch/x86/include/asm/xen/page.h
@@ -112,13 +112,9 @@ static inline xpaddr_t machine_to_phys(xmaddr_t machine)
  */
 static inline unsigned long mfn_to_local_pfn(unsigned long mfn)
 {
-	extern unsigned long max_mapnr;
 	unsigned long pfn = mfn_to_pfn(mfn);
-	if ((pfn < max_mapnr)
-	    && !xen_feature(XENFEAT_auto_translated_physmap)
-	    && (get_phys_to_machine(pfn) != mfn))
-		return max_mapnr; /* force !pfn_valid() */
-	/* XXX fixme; not true with sparsemem */
+	if (get_phys_to_machine(pfn) != mfn)
+		return -1; /* force !pfn_valid() */
 	return pfn;
 }
 
diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index 1b89d1c..c12a3c8 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -938,7 +938,9 @@ asmlinkage void __init xen_start_kernel(void)
 
 	/* Prevent unwanted bits from being set in PTEs. */
 	__supported_pte_mask &= ~_PAGE_GLOBAL;
-	if (!xen_initial_domain())
+	if (xen_initial_domain())
+		__supported_pte_mask |= _PAGE_IOMAP;
+	else
 		__supported_pte_mask &= ~(_PAGE_PWT | _PAGE_PCD);
 
 	/* Don't do the full vcpu_info placement stuff until we have a
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 36125ea..6aa6d55 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -336,6 +336,28 @@ static bool xen_page_pinned(void *ptr)
 	return PagePinned(page);
 }
 
+static bool xen_iomap_pte(pte_t pte)
+{
+	return xen_initial_domain() && (pte_flags(pte) & _PAGE_IOMAP);
+}
+
+static void xen_set_iomap_pte(pte_t *ptep, pte_t pteval)
+{
+	struct multicall_space mcs;
+	struct mmu_update *u;
+
+	mcs = xen_mc_entry(sizeof(*u));
+	u = mcs.args;
+
+	/* ptep might be kmapped when using 32-bit HIGHPTE */
+	u->ptr = arbitrary_virt_to_machine(ptep).maddr;
+	u->val = pte_val_ma(pteval);
+
+	MULTI_mmu_update(mcs.mc, mcs.args, 1, NULL, DOMID_IO);
+
+	xen_mc_issue(PARAVIRT_LAZY_MMU);
+}
+
 static void xen_extend_mmu_update(const struct mmu_update *update)
 {
 	struct multicall_space mcs;
@@ -416,6 +438,11 @@ void xen_set_pte_at(struct mm_struct *mm, unsigned long addr,
 	if (mm == &init_mm)
 		preempt_disable();
 
+	if (xen_iomap_pte(pteval)) {
+		xen_set_iomap_pte(ptep, pteval);
+		goto out;
+	}
+
 	ADD_STATS(set_pte_at, 1);
 //	ADD_STATS(set_pte_at_pinned, xen_page_pinned(ptep));
 	ADD_STATS(set_pte_at_current, mm == current->mm);
@@ -488,8 +515,25 @@ static pteval_t pte_pfn_to_mfn(pteval_t val)
 	return val;
 }
 
+static pteval_t iomap_pte(pteval_t val)
+{
+	if (val & _PAGE_PRESENT) {
+		unsigned long pfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
+		pteval_t flags = val & PTE_FLAGS_MASK;
+
+		/* We assume the pte frame number is a MFN, so
+		   just use it as-is. */
+		val = ((pteval_t)pfn << PAGE_SHIFT) | flags;
+	}
+
+	return val;
+}
+
 pteval_t xen_pte_val(pte_t pte)
 {
+	if (xen_initial_domain() && (pte.pte & _PAGE_IOMAP))
+		return pte.pte;
+
 	return pte_mfn_to_pfn(pte.pte);
 }
 PV_CALLEE_SAVE_REGS_THUNK(xen_pte_val);
@@ -502,7 +546,11 @@ PV_CALLEE_SAVE_REGS_THUNK(xen_pgd_val);
 
 pte_t xen_make_pte(pteval_t pte)
 {
-	pte = pte_pfn_to_mfn(pte);
+	if (unlikely(xen_initial_domain() && (pte & _PAGE_IOMAP)))
+		pte = iomap_pte(pte);
+	else
+		pte = pte_pfn_to_mfn(pte);
+
 	return native_make_pte(pte);
 }
 PV_CALLEE_SAVE_REGS_THUNK(xen_make_pte);
@@ -558,6 +606,11 @@ void xen_set_pud(pud_t *ptr, pud_t val)
 
 void xen_set_pte(pte_t *ptep, pte_t pte)
 {
+	if (xen_iomap_pte(pte)) {
+		xen_set_iomap_pte(ptep, pte);
+		return;
+	}
+
 	ADD_STATS(pte_update, 1);
 //	ADD_STATS(pte_update_pinned, xen_page_pinned(ptep));
 	ADD_STATS(pte_update_batched, paravirt_get_lazy_mode() == PARAVIRT_LAZY_MMU);
@@ -574,6 +627,11 @@ void xen_set_pte(pte_t *ptep, pte_t pte)
 #ifdef CONFIG_X86_PAE
 void xen_set_pte_atomic(pte_t *ptep, pte_t pte)
 {
+	if (xen_iomap_pte(pte)) {
+		xen_set_iomap_pte(ptep, pte);
+		return;
+	}
+
 	set_64bit((u64 *)ptep, native_pte_val(pte));
 }
 
@@ -1770,12 +1828,20 @@ static void xen_set_fixmap(unsigned idx, unsigned long phys, pgprot_t prot)
 #ifdef CONFIG_X86_LOCAL_APIC
 	case FIX_APIC_BASE:	/* maps dummy local APIC */
 #endif
+		/* All local page mappings */
 		pte = pfn_pte(phys, prot);
 		break;
 
-	default:
+	case FIX_PARAVIRT_BOOTMAP:
+		/* This is an MFN, but it isn't an IO mapping from the
+		   IO domain */
 		pte = mfn_pte(phys, prot);
 		break;
+
+	default:
+		/* By default, set_fixmap is used for hardware mappings */
+		pte = mfn_pte(phys, __pgprot(pgprot_val(prot) | _PAGE_IOMAP));
+		break;
 	}
 
 	__native_set_fixmap(idx, pte);
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH] paravirt/xen: add pvop for page_is_ram
  2009-02-28  1:59 [PATCH] xen: core dom0 support Jeremy Fitzhardinge
                   ` (7 preceding siblings ...)
  2009-02-28  1:59 ` [PATCH] xen/dom0: use _PAGE_IOMAP in ioremap to do machine mappings Jeremy Fitzhardinge
@ 2009-02-28  1:59 ` Jeremy Fitzhardinge
  2009-03-10  1:07   ` H. Peter Anvin
  2009-02-28  1:59 ` [PATCH] xen/dom0: Use host E820 map Jeremy Fitzhardinge
                   ` (11 subsequent siblings)
  20 siblings, 1 reply; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-02-28  1:59 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List, Xen-devel,
	Jeremy Fitzhardinge

From: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>

A guest domain may have external pages mapped into its address space,
in order to share memory with other domains.  These shared pages are
more akin to io mappings than real RAM, and should not pass the
page_is_ram test.  Add a paravirt op for this so that a hypervisor
backend can validate whether a page should be considered ram or not.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>

Conflicts:

	arch/x86/include/asm/page.h
---
 arch/x86/include/asm/page.h     |    9 ++++++++-
 arch/x86/include/asm/paravirt.h |    7 +++++++
 arch/x86/kernel/paravirt.c      |    1 +
 arch/x86/mm/ioremap.c           |    2 +-
 arch/x86/xen/mmu.c              |   11 +++++++++++
 5 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index 05f2da7..719b9aa 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -56,7 +56,14 @@
 typedef struct { pgdval_t pgd; } pgd_t;
 typedef struct { pgprotval_t pgprot; } pgprot_t;
 
-extern int page_is_ram(unsigned long pagenr);
+extern int native_page_is_ram(unsigned long pagenr);
+#ifndef CONFIG_PARAVIRT
+static inline int page_is_ram(unsigned long pagenr)
+{
+	return native_page_is_ram(pagenr);
+}
+#endif
+
 extern int devmem_is_allowed(unsigned long pagenr);
 extern void map_devmem(unsigned long pfn, unsigned long size,
 		       pgprot_t vma_prot);
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index b788dfd..d07eea5 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -350,6 +350,8 @@ struct pv_mmu_ops {
 	   an mfn.  We can tell which is which from the index. */
 	void (*set_fixmap)(unsigned /* enum fixed_addresses */ idx,
 			   unsigned long phys, pgprot_t flags);
+
+	int (*page_is_ram)(unsigned long pfn);
 };
 
 struct raw_spinlock;
@@ -1452,6 +1454,11 @@ static inline void __set_fixmap(unsigned /* enum fixed_addresses */ idx,
 	pv_mmu_ops.set_fixmap(idx, phys, flags);
 }
 
+static inline int page_is_ram(unsigned long pfn)
+{
+	return PVOP_CALL1(int, pv_mmu_ops.page_is_ram, pfn);
+}
+
 void _paravirt_nop(void);
 u32 _paravirt_ident_32(u32);
 u64 _paravirt_ident_64(u64);
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 6dc4dca..62e00cc 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -504,6 +504,7 @@ struct pv_mmu_ops pv_mmu_ops = {
 	},
 
 	.set_fixmap = native_set_fixmap,
+	.page_is_ram = native_page_is_ram,
 };
 
 EXPORT_SYMBOL_GPL(pv_time_ops);
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 433f7bd..28ac8d0 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -97,7 +97,7 @@ EXPORT_SYMBOL(__virt_addr_valid);
 
 #endif
 
-int page_is_ram(unsigned long pagenr)
+int native_page_is_ram(unsigned long pagenr)
 {
 	resource_size_t addr, end;
 	int i;
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 6aa6d55..f0d8190 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1856,6 +1856,16 @@ static void xen_set_fixmap(unsigned idx, unsigned long phys, pgprot_t prot)
 #endif
 }
 
+static int xen_page_is_ram(unsigned long pfn)
+{
+	/* Granted pages are not RAM.  They will not have a proper
+	   identity pfn<->mfn translation. */
+	if (mfn_to_local_pfn(pfn_to_mfn(pfn)) != pfn)
+		return 0;
+
+	return native_page_is_ram(pfn);
+}
+
 __init void xen_ident_map_ISA(void)
 {
 	unsigned long pa;
@@ -1984,6 +1994,7 @@ const struct pv_mmu_ops xen_mmu_ops __initdata = {
 	},
 
 	.set_fixmap = xen_set_fixmap,
+	.page_is_ram = xen_page_is_ram,
 };
 
 
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH] xen/dom0: Use host E820 map
  2009-02-28  1:59 [PATCH] xen: core dom0 support Jeremy Fitzhardinge
                   ` (8 preceding siblings ...)
  2009-02-28  1:59 ` [PATCH] paravirt/xen: add pvop for page_is_ram Jeremy Fitzhardinge
@ 2009-02-28  1:59 ` Jeremy Fitzhardinge
  2009-02-28  1:59 ` [PATCH] xen: implement XENMEM_machphys_mapping Jeremy Fitzhardinge
                   ` (10 subsequent siblings)
  20 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-02-28  1:59 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List, Xen-devel,
	Ian Campbell, Jeremy Fitzhardinge

From: Ian Campbell <ian.campbell@citrix.com>

Unlike the non-paravirt Xen port we do not have distinct psuedo-physical
and I/O memory resource-spaces and therefore resources in the two
can clash. Fix this by registering a memory map which matches the
underlying I/O map. Currently this wastes the memory in the reserved
regions. Eventually we should remap this memory to the end of the
address space.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
---
 arch/x86/xen/setup.c           |   45 +++++++++++++++++++++++++++++++++++++--
 include/xen/interface/memory.h |   29 +++++++++++++++++++++++++
 2 files changed, 71 insertions(+), 3 deletions(-)

diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index 3e4cf46..175396c 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -19,6 +19,7 @@
 
 #include <xen/page.h>
 #include <xen/interface/callback.h>
+#include <xen/interface/memory.h>
 #include <xen/interface/physdev.h>
 #include <xen/features.h>
 
@@ -36,16 +37,54 @@ extern void xen_syscall32_target(void);
 /**
  * machine_specific_memory_setup - Hook for machine specific memory setup.
  **/
-
 char * __init xen_memory_setup(void)
 {
 	unsigned long max_pfn = xen_start_info->nr_pages;
+	unsigned long long mem_end;
+	int rc;
+	struct xen_memory_map memmap;
+	/*
+	 * This is rather large for a stack variable but this early in
+	 * the boot process we know we have plenty slack space.
+	 */
+	struct e820entry map[E820MAX];
+	int op = xen_initial_domain() ?
+		XENMEM_machine_memory_map :
+		XENMEM_memory_map;
+	int i;
 
 	max_pfn = min(MAX_DOMAIN_PAGES, max_pfn);
+	mem_end = PFN_PHYS((u64)max_pfn);
+
+	memmap.nr_entries = E820MAX;
+	set_xen_guest_handle(memmap.buffer, map);
+
+	rc = HYPERVISOR_memory_op(op, &memmap);
+	if (rc == -ENOSYS) {
+		memmap.nr_entries = 1;
+		map[0].addr = 0ULL;
+		map[0].size = mem_end;
+		/* 8MB slack (to balance backend allocations). */
+		map[0].size += 8ULL << 20;
+		map[0].type = E820_RAM;
+		rc = 0;
+	}
+	BUG_ON(rc);
 
 	e820.nr_map = 0;
-
-	e820_add_region(0, PFN_PHYS((u64)max_pfn), E820_RAM);
+	for (i = 0; i < memmap.nr_entries; i++) {
+		unsigned long long end = map[i].addr + map[i].size;
+		if (map[i].type == E820_RAM) {
+			if (map[i].addr > mem_end)
+				continue;
+			if (end > mem_end) {
+				/* Truncate region to max_mem. */
+				map[i].size -= end - mem_end;
+			}
+		}
+		if (map[i].size > 0)
+			e820_add_region(map[i].addr, map[i].size, map[i].type);
+	}
 
 	/*
 	 * Even though this is normal, usable memory under Xen, reserve
diff --git a/include/xen/interface/memory.h b/include/xen/interface/memory.h
index af36ead..e6c6bcb 100644
--- a/include/xen/interface/memory.h
+++ b/include/xen/interface/memory.h
@@ -142,4 +142,33 @@ struct xen_translate_gpfn_list {
 };
 DEFINE_GUEST_HANDLE_STRUCT(xen_translate_gpfn_list);
 
+/*
+ * Returns the pseudo-physical memory map as it was when the domain
+ * was started (specified by XENMEM_set_memory_map).
+ * arg == addr of struct xen_memory_map.
+ */
+#define XENMEM_memory_map           9
+struct xen_memory_map {
+    /*
+     * On call the number of entries which can be stored in buffer. On
+     * return the number of entries which have been stored in
+     * buffer.
+     */
+    unsigned int nr_entries;
+
+    /*
+     * Entries in the buffer are in the same format as returned by the
+     * BIOS INT 0x15 EAX=0xE820 call.
+     */
+    GUEST_HANDLE(void) buffer;
+};
+DEFINE_GUEST_HANDLE_STRUCT(xen_memory_map);
+
+/*
+ * Returns the real physical memory map. Passes the same structure as
+ * XENMEM_memory_map.
+ * arg == addr of struct xen_memory_map.
+ */
+#define XENMEM_machine_memory_map   10
+
 #endif /* __XEN_PUBLIC_MEMORY_H__ */
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH] xen: implement XENMEM_machphys_mapping
  2009-02-28  1:59 [PATCH] xen: core dom0 support Jeremy Fitzhardinge
                   ` (9 preceding siblings ...)
  2009-02-28  1:59 ` [PATCH] xen/dom0: Use host E820 map Jeremy Fitzhardinge
@ 2009-02-28  1:59 ` Jeremy Fitzhardinge
  2009-02-28  1:59 ` [PATCH] xen: clear reserved bits in l3 entries given in the initial pagetables Jeremy Fitzhardinge
                   ` (9 subsequent siblings)
  20 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-02-28  1:59 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List, Xen-devel,
	Ian Campbell, Jeremy Fitzhardinge

From: Ian Campbell <ian.campbell@citrix.com>

This hypercall allows Xen to specify a non-default location for the
machine to physical mapping. This capability is used when running a 32
bit domain 0 on a 64 bit hypervisor to shrink the hypervisor hole to
exactly the size required.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
---
 arch/x86/include/asm/xen/interface.h    |    6 +++---
 arch/x86/include/asm/xen/interface_32.h |    5 +++++
 arch/x86/include/asm/xen/interface_64.h |   13 +------------
 arch/x86/include/asm/xen/page.h         |    7 ++++---
 arch/x86/xen/enlighten.c                |    7 +++++++
 arch/x86/xen/mmu.c                      |   15 +++++++++++++++
 include/xen/interface/memory.h          |   13 +++++++++++++
 7 files changed, 48 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/xen/interface.h b/arch/x86/include/asm/xen/interface.h
index e8506c1..1c10c88 100644
--- a/arch/x86/include/asm/xen/interface.h
+++ b/arch/x86/include/asm/xen/interface.h
@@ -61,9 +61,9 @@ DEFINE_GUEST_HANDLE(void);
 #define HYPERVISOR_VIRT_START mk_unsigned_long(__HYPERVISOR_VIRT_START)
 #endif
 
-#ifndef machine_to_phys_mapping
-#define machine_to_phys_mapping ((unsigned long *)HYPERVISOR_VIRT_START)
-#endif
+#define MACH2PHYS_VIRT_START  mk_unsigned_long(__MACH2PHYS_VIRT_START)
+#define MACH2PHYS_VIRT_END    mk_unsigned_long(__MACH2PHYS_VIRT_END)
+#define MACH2PHYS_NR_ENTRIES  ((MACH2PHYS_VIRT_END-MACH2PHYS_VIRT_START)>>__MACH2PHYS_SHIFT)
 
 /* Maximum number of virtual CPUs in multi-processor guests. */
 #define MAX_VIRT_CPUS 32
diff --git a/arch/x86/include/asm/xen/interface_32.h b/arch/x86/include/asm/xen/interface_32.h
index 42a7e00..8413688 100644
--- a/arch/x86/include/asm/xen/interface_32.h
+++ b/arch/x86/include/asm/xen/interface_32.h
@@ -32,6 +32,11 @@
 /* And the trap vector is... */
 #define TRAP_INSTR "int $0x82"
 
+#define __MACH2PHYS_VIRT_START 0xF5800000
+#define __MACH2PHYS_VIRT_END   0xF6800000
+
+#define __MACH2PHYS_SHIFT      2
+
 /*
  * Virtual addresses beyond this are not modifiable by guest OSes. The
  * machine->physical mapping table starts at this address, read-only.
diff --git a/arch/x86/include/asm/xen/interface_64.h b/arch/x86/include/asm/xen/interface_64.h
index 100d266..839a481 100644
--- a/arch/x86/include/asm/xen/interface_64.h
+++ b/arch/x86/include/asm/xen/interface_64.h
@@ -39,18 +39,7 @@
 #define __HYPERVISOR_VIRT_END   0xFFFF880000000000
 #define __MACH2PHYS_VIRT_START  0xFFFF800000000000
 #define __MACH2PHYS_VIRT_END    0xFFFF804000000000
-
-#ifndef HYPERVISOR_VIRT_START
-#define HYPERVISOR_VIRT_START mk_unsigned_long(__HYPERVISOR_VIRT_START)
-#define HYPERVISOR_VIRT_END   mk_unsigned_long(__HYPERVISOR_VIRT_END)
-#endif
-
-#define MACH2PHYS_VIRT_START  mk_unsigned_long(__MACH2PHYS_VIRT_START)
-#define MACH2PHYS_VIRT_END    mk_unsigned_long(__MACH2PHYS_VIRT_END)
-#define MACH2PHYS_NR_ENTRIES  ((MACH2PHYS_VIRT_END-MACH2PHYS_VIRT_START)>>3)
-#ifndef machine_to_phys_mapping
-#define machine_to_phys_mapping ((unsigned long *)HYPERVISOR_VIRT_START)
-#endif
+#define __MACH2PHYS_SHIFT       3
 
 /*
  * int HYPERVISOR_set_segment_base(unsigned int which, unsigned long base)
diff --git a/arch/x86/include/asm/xen/page.h b/arch/x86/include/asm/xen/page.h
index 20c3872..95a3122 100644
--- a/arch/x86/include/asm/xen/page.h
+++ b/arch/x86/include/asm/xen/page.h
@@ -5,6 +5,7 @@
 #include <linux/types.h>
 #include <linux/spinlock.h>
 #include <linux/pfn.h>
+#include <linux/mm.h>
 
 #include <asm/uaccess.h>
 #include <asm/page.h>
@@ -35,6 +36,8 @@ typedef struct xpaddr {
 #define MAX_DOMAIN_PAGES						\
     ((unsigned long)((u64)CONFIG_XEN_MAX_DOMAIN_MEMORY * 1024 * 1024 * 1024 / PAGE_SIZE))
 
+extern unsigned long *machine_to_phys_mapping;
+extern unsigned int   machine_to_phys_order;
 
 extern unsigned long get_phys_to_machine(unsigned long pfn);
 extern void set_phys_to_machine(unsigned long pfn, unsigned long mfn);
@@ -62,10 +65,8 @@ static inline unsigned long mfn_to_pfn(unsigned long mfn)
 	if (xen_feature(XENFEAT_auto_translated_physmap))
 		return mfn;
 
-#if 0
 	if (unlikely((mfn >> machine_to_phys_order) != 0))
-		return max_mapnr;
-#endif
+		return ~0;
 
 	pfn = 0;
 	/*
diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index c12a3c8..62d229a 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -62,6 +62,11 @@ DEFINE_PER_CPU(struct vcpu_info, xen_vcpu_info);
 enum xen_domain_type xen_domain_type = XEN_NATIVE;
 EXPORT_SYMBOL_GPL(xen_domain_type);
 
+unsigned long *machine_to_phys_mapping = (void *)MACH2PHYS_VIRT_START;
+EXPORT_SYMBOL(machine_to_phys_mapping);
+unsigned int   machine_to_phys_order;
+EXPORT_SYMBOL(machine_to_phys_order);
+
 struct start_info *xen_start_info;
 EXPORT_SYMBOL_GPL(xen_start_info);
 
@@ -890,6 +895,8 @@ asmlinkage void __init xen_start_kernel(void)
 
 	xen_setup_features();
 
+	xen_setup_machphys_mapping();
+
 	/* Install Xen paravirt ops */
 	pv_info = xen_info;
 	pv_init_ops = xen_init_ops;
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index f0d8190..367a7d2 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -57,6 +57,7 @@
 #include <xen/page.h>
 #include <xen/interface/xen.h>
 #include <xen/interface/version.h>
+#include <xen/interface/memory.h>
 #include <xen/hvc-console.h>
 
 #include "multicalls.h"
@@ -1866,6 +1867,20 @@ static int xen_page_is_ram(unsigned long pfn)
 	return native_page_is_ram(pfn);
 }
 
+__init void xen_setup_machphys_mapping(void)
+{
+	struct xen_machphys_mapping mapping;
+	unsigned long machine_to_phys_nr_ents;
+
+	if (HYPERVISOR_memory_op(XENMEM_machphys_mapping, &mapping) == 0) {
+		machine_to_phys_mapping = (unsigned long *)mapping.v_start;
+		machine_to_phys_nr_ents = mapping.max_mfn + 1;
+	} else {
+		machine_to_phys_nr_ents = MACH2PHYS_NR_ENTRIES;
+	}
+	machine_to_phys_order = fls(machine_to_phys_nr_ents - 1);
+}
+
 __init void xen_ident_map_ISA(void)
 {
 	unsigned long pa;
diff --git a/include/xen/interface/memory.h b/include/xen/interface/memory.h
index e6c6bcb..f548f7c 100644
--- a/include/xen/interface/memory.h
+++ b/include/xen/interface/memory.h
@@ -97,6 +97,19 @@ struct xen_machphys_mfn_list {
 DEFINE_GUEST_HANDLE_STRUCT(xen_machphys_mfn_list);
 
 /*
+ * Returns the location in virtual address space of the machine_to_phys
+ * mapping table. Architectures which do not have a m2p table, or which do not
+ * map it by default into guest address space, do not implement this command.
+ * arg == addr of xen_machphys_mapping_t.
+ */
+#define XENMEM_machphys_mapping     12
+struct xen_machphys_mapping {
+    unsigned long v_start, v_end; /* Start and end virtual addresses.   */
+    unsigned long max_mfn;        /* Maximum MFN that can be looked up. */
+};
+DEFINE_GUEST_HANDLE_STRUCT(xen_machphys_mapping_t);
+
+/*
  * Sets the GPFN at which a particular page appears in the specified guest's
  * pseudophysical address space.
  * arg == addr of xen_add_to_physmap_t.
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH] xen: clear reserved bits in l3 entries given in the initial pagetables
  2009-02-28  1:59 [PATCH] xen: core dom0 support Jeremy Fitzhardinge
                   ` (10 preceding siblings ...)
  2009-02-28  1:59 ` [PATCH] xen: implement XENMEM_machphys_mapping Jeremy Fitzhardinge
@ 2009-02-28  1:59 ` Jeremy Fitzhardinge
  2009-02-28  1:59   ` Jeremy Fitzhardinge
                   ` (8 subsequent siblings)
  20 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-02-28  1:59 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List, Xen-devel,
	Ian Campbell, Jeremy Fitzhardinge

From: Ian Campbell <ian.campbell@citrix.com>

In native PAE, the only flag that may be legitimately set in an L3
entry is Present.  When Xen grafts the top-level PAE L3 pagetable
entries into the L4 pagetable, it must also set the other permissions
flags so that the mapped pages are actually accessible.

However, due to a bug in the hypervisor, it validates update to the L3
entries as formal PAE entries, so it will refuse to validate these
entries with the extra bits requires for 4-level pagetables.

This patch simply masks the entries back to the bare PAE level,
leaving Xen to add whatever bits it feels are necessary.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
---
 arch/x86/xen/mmu.c |   15 +++++++++++++++
 1 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 367a7d2..5f034a1 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1778,6 +1778,7 @@ __init pgd_t *xen_setup_kernel_pagetable(pgd_t *pgd,
 					 unsigned long max_pfn)
 {
 	pmd_t *kernel_pmd;
+	int i;
 
 	init_pg_tables_start = __pa(pgd);
 	init_pg_tables_end = __pa(pgd) + xen_start_info->nr_pt_frames*PAGE_SIZE;
@@ -1789,6 +1790,20 @@ __init pgd_t *xen_setup_kernel_pagetable(pgd_t *pgd,
 	xen_map_identity_early(level2_kernel_pgt, max_pfn);
 
 	memcpy(swapper_pg_dir, pgd, sizeof(pgd_t) * PTRS_PER_PGD);
+
+	/*
+	 * When running a 32 bit domain 0 on a 64 bit hypervisor a
+	 * pinned L3 (such as the initial pgd here) contains bits
+	 * which are reserved in the PAE layout but not in the 64 bit
+	 * layout. Unfortunately some versions of the hypervisor
+	 * (incorrectly) validate compat mode guests against the PAE
+	 * layout and hence will not allow such a pagetable to be
+	 * pinned by the guest. Therefore we mask off only the PFN and
+	 * Present bits of the supplied L3.
+	 */
+	for (i = 0; i < PTRS_PER_PGD; i++)
+		swapper_pg_dir[i].pgd &= (PTE_PFN_MASK | _PAGE_PRESENT);
+
 	set_pgd(&swapper_pg_dir[KERNEL_PGD_BOUNDARY],
 			__pgd(__pa(level2_kernel_pgt) | _PAGE_PRESENT));
 
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH] xen/dom0: add XEN_DOM0 config option
  2009-02-28  1:59 [PATCH] xen: core dom0 support Jeremy Fitzhardinge
@ 2009-02-28  1:59   ` Jeremy Fitzhardinge
  2009-02-28  1:59   ` Jeremy Fitzhardinge
                     ` (19 subsequent siblings)
  20 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-02-28  1:59 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List, Xen-devel,
	Jeremy Fitzhardinge, Jeremy Fitzhardinge

Allow dom0 to be configured.  Requires more patches to do something useful.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
---
 arch/x86/xen/Kconfig     |   26 ++++++++++++++++++++++++++
 arch/x86/xen/enlighten.c |    5 +++--
 2 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig
index 87b9ab1..e5c141a 100644
--- a/arch/x86/xen/Kconfig
+++ b/arch/x86/xen/Kconfig
@@ -36,3 +36,29 @@ config XEN_DEBUG_FS
 	help
 	  Enable statistics output and various tuning options in debugfs.
 	  Enabling this option may incur a significant performance overhead.
+
+config XEN_DOM0
+	bool "Enable Xen privileged domain support"
+	depends on XEN && X86_IO_APIC && ACPI
+	help
+	  The Xen hypervisor requires a privileged domain ("dom0") to
+	  actually manage the machine, provide devices drivers, etc.
+	  This option enables dom0 support.  A dom0 kernel can also
+	  run as an unprivileged domU kernel, or a kernel running
+	  native on bare hardware.
+
+# Dummy symbol since people have come to rely on the PRIVILEGED_GUEST
+# name in tools.
+config XEN_PRIVILEGED_GUEST
+	def_bool XEN_DOM0
+
+config XEN_PCI_PASSTHROUGH
+       bool #"Enable support for Xen PCI passthrough devices"
+       depends on XEN && PCI
+       help
+         Enable support for passing PCI devices through to
+	 unprivileged domains. (COMPLETELY UNTESTED)
+
+config XEN_DOM0_PCI
+       def_bool y
+       depends on XEN_DOM0 && PCI
diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index 62d229a..676aaf8 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -169,9 +169,10 @@ static void __init xen_banner(void)
 
 	printk(KERN_INFO "Booting paravirtualized kernel on %s\n",
 	       pv_info.name);
-	printk(KERN_INFO "Xen version: %d.%d%s%s\n",
+	printk(KERN_INFO "Xen version: %d.%d%s%s%s\n",
 	       version >> 16, version & 0xffff, extra.extraversion,
-	       xen_feature(XENFEAT_mmu_pt_update_preserve_ad) ? " (preserve-AD)" : "");
+	       xen_feature(XENFEAT_mmu_pt_update_preserve_ad) ? " (preserve-AD)" : "",
+	       xen_initial_domain() ? " (dom0)" : "");
 }
 
 static void xen_cpuid(unsigned int *ax, unsigned int *bx,
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH] xen/dom0: add XEN_DOM0 config option
@ 2009-02-28  1:59   ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-02-28  1:59 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Xen-devel, Jeremy Fitzhardinge, the arch/x86 maintainers,
	Linux Kernel Mailing List, Jeremy Fitzhardinge

Allow dom0 to be configured.  Requires more patches to do something useful.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
---
 arch/x86/xen/Kconfig     |   26 ++++++++++++++++++++++++++
 arch/x86/xen/enlighten.c |    5 +++--
 2 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig
index 87b9ab1..e5c141a 100644
--- a/arch/x86/xen/Kconfig
+++ b/arch/x86/xen/Kconfig
@@ -36,3 +36,29 @@ config XEN_DEBUG_FS
 	help
 	  Enable statistics output and various tuning options in debugfs.
 	  Enabling this option may incur a significant performance overhead.
+
+config XEN_DOM0
+	bool "Enable Xen privileged domain support"
+	depends on XEN && X86_IO_APIC && ACPI
+	help
+	  The Xen hypervisor requires a privileged domain ("dom0") to
+	  actually manage the machine, provide devices drivers, etc.
+	  This option enables dom0 support.  A dom0 kernel can also
+	  run as an unprivileged domU kernel, or a kernel running
+	  native on bare hardware.
+
+# Dummy symbol since people have come to rely on the PRIVILEGED_GUEST
+# name in tools.
+config XEN_PRIVILEGED_GUEST
+	def_bool XEN_DOM0
+
+config XEN_PCI_PASSTHROUGH
+       bool #"Enable support for Xen PCI passthrough devices"
+       depends on XEN && PCI
+       help
+         Enable support for passing PCI devices through to
+	 unprivileged domains. (COMPLETELY UNTESTED)
+
+config XEN_DOM0_PCI
+       def_bool y
+       depends on XEN_DOM0 && PCI
diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index 62d229a..676aaf8 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -169,9 +169,10 @@ static void __init xen_banner(void)
 
 	printk(KERN_INFO "Booting paravirtualized kernel on %s\n",
 	       pv_info.name);
-	printk(KERN_INFO "Xen version: %d.%d%s%s\n",
+	printk(KERN_INFO "Xen version: %d.%d%s%s%s\n",
 	       version >> 16, version & 0xffff, extra.extraversion,
-	       xen_feature(XENFEAT_mmu_pt_update_preserve_ad) ? " (preserve-AD)" : "");
+	       xen_feature(XENFEAT_mmu_pt_update_preserve_ad) ? " (preserve-AD)" : "",
+	       xen_initial_domain() ? " (dom0)" : "");
 }
 
 static void xen_cpuid(unsigned int *ax, unsigned int *bx,
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH] xen: allow enable use of VGA console on dom0
  2009-02-28  1:59 [PATCH] xen: core dom0 support Jeremy Fitzhardinge
                   ` (12 preceding siblings ...)
  2009-02-28  1:59   ` Jeremy Fitzhardinge
@ 2009-02-28  1:59 ` Jeremy Fitzhardinge
  2009-02-28  1:59 ` [PATCH] xen mtrr: Use specific cpu_has_foo macros instead of generic cpu_has() Jeremy Fitzhardinge
                   ` (6 subsequent siblings)
  20 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-02-28  1:59 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List, Xen-devel,
	Jeremy Fitzhardinge, Jeremy Fitzhardinge

If we're booting a privileged domain, then set up the vga console for use.
Xen provides us with all the information about the current vga state, without
need to use the BIOS.  We use that information to populate screen_info, which
allows the rest of the kernel to carry on as normal.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
---
 arch/x86/xen/Makefile       |    3 +-
 arch/x86/xen/enlighten.c    |   10 ++++++
 arch/x86/xen/vga.c          |   65 +++++++++++++++++++++++++++++++++++++++++++
 arch/x86/xen/xen-ops.h      |   11 +++++++
 include/xen/interface/xen.h |   39 +++++++++++++++++++++++++
 5 files changed, 127 insertions(+), 1 deletions(-)
 create mode 100644 arch/x86/xen/vga.c

diff --git a/arch/x86/xen/Makefile b/arch/x86/xen/Makefile
index 3b767d0..c4cda96 100644
--- a/arch/x86/xen/Makefile
+++ b/arch/x86/xen/Makefile
@@ -10,4 +10,5 @@ obj-y		:= enlighten.o setup.o multicalls.o mmu.o irq.o \
 			grant-table.o suspend.o
 
 obj-$(CONFIG_SMP)		+= smp.o spinlock.o
-obj-$(CONFIG_XEN_DEBUG_FS)	+= debugfs.o
\ No newline at end of file
+obj-$(CONFIG_XEN_DEBUG_FS)	+= debugfs.o
+obj-$(CONFIG_XEN_DOM0)		+= vga.o
diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index 676aaf8..639eeb1 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -995,6 +995,16 @@ asmlinkage void __init xen_start_kernel(void)
 		add_preferred_console("xenboot", 0, NULL);
 		add_preferred_console("tty", 0, NULL);
 		add_preferred_console("hvc", 0, NULL);
+
+		boot_params.screen_info.orig_video_isVGA = 0;
+	} else {
+		const struct dom0_vga_console_info *info =
+			(void *)((char *)xen_start_info +
+			         xen_start_info->console.dom0.info_off);
+
+		xen_init_vga(info, xen_start_info->console.dom0.info_size);
+		xen_start_info->console.domU.mfn = 0;
+		xen_start_info->console.domU.evtchn = 0;
 	}
 
 	pat_disable("PAT disabled on Xen");
diff --git a/arch/x86/xen/vga.c b/arch/x86/xen/vga.c
new file mode 100644
index 0000000..f4a038a
--- /dev/null
+++ b/arch/x86/xen/vga.c
@@ -0,0 +1,65 @@
+#include <linux/screen_info.h>
+#include <linux/init.h>
+
+#include <asm/bootparam.h>
+#include <asm/setup.h>
+
+#include "xen-ops.h"
+
+void __init xen_init_vga(const struct dom0_vga_console_info *info, size_t size)
+{
+	struct screen_info *screen_info = &boot_params.screen_info;
+
+	/* This is drawn from a dump from vgacon:startup in
+	 * standard Linux. */
+	screen_info->orig_video_mode = 3;
+	screen_info->orig_video_isVGA = 1;
+	screen_info->orig_video_lines = 25;
+	screen_info->orig_video_cols = 80;
+	screen_info->orig_video_ega_bx = 3;
+	screen_info->orig_video_points = 16;
+	screen_info->orig_y = screen_info->orig_video_lines - 1;
+
+	switch (info->video_type) {
+	case XEN_VGATYPE_TEXT_MODE_3:
+		if (size < offsetof(struct dom0_vga_console_info, u.text_mode_3)
+		           + sizeof(info->u.text_mode_3))
+			break;
+		screen_info->orig_video_lines = info->u.text_mode_3.rows;
+		screen_info->orig_video_cols = info->u.text_mode_3.columns;
+		screen_info->orig_x = info->u.text_mode_3.cursor_x;
+		screen_info->orig_y = info->u.text_mode_3.cursor_y;
+		screen_info->orig_video_points =
+			info->u.text_mode_3.font_height;
+		break;
+
+	case XEN_VGATYPE_VESA_LFB:
+		if (size < offsetof(struct dom0_vga_console_info,
+		                    u.vesa_lfb.gbl_caps))
+			break;
+		screen_info->orig_video_isVGA = VIDEO_TYPE_VLFB;
+		screen_info->lfb_width = info->u.vesa_lfb.width;
+		screen_info->lfb_height = info->u.vesa_lfb.height;
+		screen_info->lfb_depth = info->u.vesa_lfb.bits_per_pixel;
+		screen_info->lfb_base = info->u.vesa_lfb.lfb_base;
+		screen_info->lfb_size = info->u.vesa_lfb.lfb_size;
+		screen_info->lfb_linelength = info->u.vesa_lfb.bytes_per_line;
+		screen_info->red_size = info->u.vesa_lfb.red_size;
+		screen_info->red_pos = info->u.vesa_lfb.red_pos;
+		screen_info->green_size = info->u.vesa_lfb.green_size;
+		screen_info->green_pos = info->u.vesa_lfb.green_pos;
+		screen_info->blue_size = info->u.vesa_lfb.blue_size;
+		screen_info->blue_pos = info->u.vesa_lfb.blue_pos;
+		screen_info->rsvd_size = info->u.vesa_lfb.rsvd_size;
+		screen_info->rsvd_pos = info->u.vesa_lfb.rsvd_pos;
+		if (size >= offsetof(struct dom0_vga_console_info,
+		                     u.vesa_lfb.gbl_caps)
+		            + sizeof(info->u.vesa_lfb.gbl_caps))
+			screen_info->capabilities = info->u.vesa_lfb.gbl_caps;
+		if (size >= offsetof(struct dom0_vga_console_info,
+		                     u.vesa_lfb.mode_attrs)
+		            + sizeof(info->u.vesa_lfb.mode_attrs))
+			screen_info->vesa_attributes = info->u.vesa_lfb.mode_attrs;
+		break;
+	}
+}
diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index 33f7538..414236b 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -4,6 +4,8 @@
 #include <linux/init.h>
 #include <linux/clocksource.h>
 #include <linux/irqreturn.h>
+
+#include <xen/interface/xen.h>
 #include <xen/xen-ops.h>
 
 /* These are code, but not functions.  Defined in entry.S */
@@ -75,6 +77,15 @@ static inline void xen_smp_init(void) {}
 #endif
 
 
+#ifdef CONFIG_XEN_DOM0
+void xen_init_vga(const struct dom0_vga_console_info *, size_t size);
+#else
+static inline void xen_init_vga(const struct dom0_vga_console_info *info,
+				size_t size)
+{
+}
+#endif
+
 /* Declare an asm function, along with symbols needed to make it
    inlineable */
 #define DECL_ASM(ret, name, ...)		\
diff --git a/include/xen/interface/xen.h b/include/xen/interface/xen.h
index 18b5599..6c0af21 100644
--- a/include/xen/interface/xen.h
+++ b/include/xen/interface/xen.h
@@ -449,6 +449,45 @@ struct start_info {
 	int8_t cmd_line[MAX_GUEST_CMDLINE];
 };
 
+struct dom0_vga_console_info {
+    uint8_t video_type; /* DOM0_VGA_CONSOLE_??? */
+#define XEN_VGATYPE_TEXT_MODE_3 0x03
+#define XEN_VGATYPE_VESA_LFB    0x23
+
+    union {
+        struct {
+            /* Font height, in pixels. */
+            uint16_t font_height;
+            /* Cursor location (column, row). */
+            uint16_t cursor_x, cursor_y;
+            /* Number of rows and columns (dimensions in characters). */
+            uint16_t rows, columns;
+        } text_mode_3;
+
+        struct {
+            /* Width and height, in pixels. */
+            uint16_t width, height;
+            /* Bytes per scan line. */
+            uint16_t bytes_per_line;
+            /* Bits per pixel. */
+            uint16_t bits_per_pixel;
+            /* LFB physical address, and size (in units of 64kB). */
+            uint32_t lfb_base;
+            uint32_t lfb_size;
+            /* RGB mask offsets and sizes, as defined by VBE 1.2+ */
+            uint8_t  red_pos, red_size;
+            uint8_t  green_pos, green_size;
+            uint8_t  blue_pos, blue_size;
+            uint8_t  rsvd_pos, rsvd_size;
+
+            /* VESA capabilities (offset 0xa, VESA command 0x4f00). */
+            uint32_t gbl_caps;
+            /* Mode attributes (offset 0x0, VESA command 0x4f01). */
+            uint16_t mode_attrs;
+        } vesa_lfb;
+    } u;
+};
+
 /* These flags are passed in the 'flags' field of start_info_t. */
 #define SIF_PRIVILEGED    (1<<0)  /* Is the domain privileged? */
 #define SIF_INITDOMAIN    (1<<1)  /* Is this the initial control domain? */
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH] xen mtrr: Use specific cpu_has_foo macros instead of generic cpu_has()
  2009-02-28  1:59 [PATCH] xen: core dom0 support Jeremy Fitzhardinge
                   ` (13 preceding siblings ...)
  2009-02-28  1:59 ` [PATCH] xen: allow enable use of VGA console on dom0 Jeremy Fitzhardinge
@ 2009-02-28  1:59 ` Jeremy Fitzhardinge
  2009-02-28  1:59   ` Jeremy Fitzhardinge
                   ` (5 subsequent siblings)
  20 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-02-28  1:59 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List, Xen-devel,
	Mark McLoughlin

From: Mark McLoughlin <markmc@redhat.com>

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
---
 arch/x86/kernel/cpu/mtrr/xen.c |   10 ++++------
 1 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/mtrr/xen.c b/arch/x86/kernel/cpu/mtrr/xen.c
index db3ef39..e03532c 100644
--- a/arch/x86/kernel/cpu/mtrr/xen.c
+++ b/arch/x86/kernel/cpu/mtrr/xen.c
@@ -44,15 +44,13 @@ static int __init xen_num_var_ranges(void)
 
 void __init xen_init_mtrr(void)
 {
-	struct cpuinfo_x86 *c = &boot_cpu_data;
-
 	if (!xen_initial_domain())
 		return;
 
-	if ((!cpu_has(c, X86_FEATURE_MTRR)) &&
-	    (!cpu_has(c, X86_FEATURE_K6_MTRR)) &&
-	    (!cpu_has(c, X86_FEATURE_CYRIX_ARR)) &&
-	    (!cpu_has(c, X86_FEATURE_CENTAUR_MCR)))
+	if (!cpu_has_mtrr &&
+	    !cpu_has_k6_mtrr &&
+	    !cpu_has_cyrix_arr &&
+	    !cpu_has_centaur_mcr)
 		return;
 
 	mtrr_if = &xen_mtrr_ops;
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH] xen mtrr: Kill some unneccessary includes
  2009-02-28  1:59 [PATCH] xen: core dom0 support Jeremy Fitzhardinge
@ 2009-02-28  1:59   ` Jeremy Fitzhardinge
  2009-02-28  1:59   ` Jeremy Fitzhardinge
                     ` (19 subsequent siblings)
  20 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-02-28  1:59 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List, Xen-devel,
	Mark McLoughlin

From: Mark McLoughlin <markmc@redhat.com>

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
---
 arch/x86/kernel/cpu/mtrr/mtrr.h |    2 ++
 arch/x86/kernel/cpu/mtrr/xen.c  |    8 +-------
 2 files changed, 3 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/cpu/mtrr/mtrr.h b/arch/x86/kernel/cpu/mtrr/mtrr.h
index eb23ca2..6142d6e 100644
--- a/arch/x86/kernel/cpu/mtrr/mtrr.h
+++ b/arch/x86/kernel/cpu/mtrr/mtrr.h
@@ -5,6 +5,8 @@
 #include <linux/types.h>
 #include <linux/stddef.h>
 
+#include <asm/mtrr.h>
+
 #define MTRRcap_MSR     0x0fe
 #define MTRRdefType_MSR 0x2ff
 
diff --git a/arch/x86/kernel/cpu/mtrr/xen.c b/arch/x86/kernel/cpu/mtrr/xen.c
index e03532c..7a25f88 100644
--- a/arch/x86/kernel/cpu/mtrr/xen.c
+++ b/arch/x86/kernel/cpu/mtrr/xen.c
@@ -1,12 +1,6 @@
 #include <linux/init.h>
-#include <linux/proc_fs.h>
-#include <linux/ctype.h>
-#include <linux/module.h>
-#include <linux/seq_file.h>
-#include <asm/uaccess.h>
-#include <linux/mutex.h>
-
-#include <asm/mtrr.h>
+#include <linux/mm.h>
+
 #include "mtrr.h"
 
 #include <xen/interface/platform.h>
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH] xen mtrr: Kill some unneccessary includes
@ 2009-02-28  1:59   ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-02-28  1:59 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Xen-devel, Mark McLoughlin, the arch/x86 maintainers,
	Linux Kernel Mailing List

From: Mark McLoughlin <markmc@redhat.com>

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
---
 arch/x86/kernel/cpu/mtrr/mtrr.h |    2 ++
 arch/x86/kernel/cpu/mtrr/xen.c  |    8 +-------
 2 files changed, 3 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/cpu/mtrr/mtrr.h b/arch/x86/kernel/cpu/mtrr/mtrr.h
index eb23ca2..6142d6e 100644
--- a/arch/x86/kernel/cpu/mtrr/mtrr.h
+++ b/arch/x86/kernel/cpu/mtrr/mtrr.h
@@ -5,6 +5,8 @@
 #include <linux/types.h>
 #include <linux/stddef.h>
 
+#include <asm/mtrr.h>
+
 #define MTRRcap_MSR     0x0fe
 #define MTRRdefType_MSR 0x2ff
 
diff --git a/arch/x86/kernel/cpu/mtrr/xen.c b/arch/x86/kernel/cpu/mtrr/xen.c
index e03532c..7a25f88 100644
--- a/arch/x86/kernel/cpu/mtrr/xen.c
+++ b/arch/x86/kernel/cpu/mtrr/xen.c
@@ -1,12 +1,6 @@
 #include <linux/init.h>
-#include <linux/proc_fs.h>
-#include <linux/ctype.h>
-#include <linux/module.h>
-#include <linux/seq_file.h>
-#include <asm/uaccess.h>
-#include <linux/mutex.h>
-
-#include <asm/mtrr.h>
+#include <linux/mm.h>
+
 #include "mtrr.h"
 
 #include <xen/interface/platform.h>
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH] xen mtrr: Use generic_validate_add_page()
  2009-02-28  1:59 [PATCH] xen: core dom0 support Jeremy Fitzhardinge
                   ` (15 preceding siblings ...)
  2009-02-28  1:59   ` Jeremy Fitzhardinge
@ 2009-02-28  1:59 ` Jeremy Fitzhardinge
  2009-02-28  1:59 ` [PATCH] xen mtrr: Implement xen_get_free_region() Jeremy Fitzhardinge
                   ` (3 subsequent siblings)
  20 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-02-28  1:59 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List, Xen-devel,
	Mark McLoughlin

From: Mark McLoughlin <markmc@redhat.com>

The hypervisor already performs the same validation, but
better to do it early before getting to the range combining
code.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
---
 arch/x86/kernel/cpu/mtrr/xen.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/cpu/mtrr/xen.c b/arch/x86/kernel/cpu/mtrr/xen.c
index 7a25f88..f226044 100644
--- a/arch/x86/kernel/cpu/mtrr/xen.c
+++ b/arch/x86/kernel/cpu/mtrr/xen.c
@@ -16,7 +16,7 @@ static struct mtrr_ops xen_mtrr_ops = {
 //	.set               = xen_set_mtrr,
 //	.get               = xen_get_mtrr,
 	.get_free_region   = generic_get_free_region,
-//	.validate_add_page = xen_validate_add_page,
+	.validate_add_page = generic_validate_add_page,
 	.have_wrcomb       = positive_have_wrcomb,
 	.use_intel_if	   = 0,
 	.num_var_ranges	   = xen_num_var_ranges,
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH] xen mtrr: Implement xen_get_free_region()
  2009-02-28  1:59 [PATCH] xen: core dom0 support Jeremy Fitzhardinge
                   ` (16 preceding siblings ...)
  2009-02-28  1:59 ` [PATCH] xen mtrr: Use generic_validate_add_page() Jeremy Fitzhardinge
@ 2009-02-28  1:59 ` Jeremy Fitzhardinge
  2009-02-28  1:59 ` [PATCH] xen mtrr: Add xen_{get,set}_mtrr() implementations Jeremy Fitzhardinge
                   ` (2 subsequent siblings)
  20 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-02-28  1:59 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List, Xen-devel,
	Mark McLoughlin

From: Mark McLoughlin <markmc@redhat.com>

When an already set MTRR is being changed, we need to
first unset, since Xen also maintains a usage count.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
---
 arch/x86/kernel/cpu/mtrr/xen.c |   27 ++++++++++++++++++++++++++-
 1 files changed, 26 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/cpu/mtrr/xen.c b/arch/x86/kernel/cpu/mtrr/xen.c
index f226044..d715843 100644
--- a/arch/x86/kernel/cpu/mtrr/xen.c
+++ b/arch/x86/kernel/cpu/mtrr/xen.c
@@ -9,13 +9,38 @@
 
 static int __init xen_num_var_ranges(void);
 
+static int xen_get_free_region(unsigned long base, unsigned long size, int replace_reg)
+{
+	struct xen_platform_op op;
+	int error;
+
+	if (replace_reg < 0)
+		return generic_get_free_region(base, size, -1);
+
+	/* If we're replacing the contents of a register,
+	 * we need to first unset it since Xen also keeps
+	 * a usage count.
+	 */
+	op.cmd = XENPF_del_memtype;
+	op.u.del_memtype.handle = 0;
+	op.u.del_memtype.reg    = replace_reg;
+
+	error = HYPERVISOR_dom0_op(&op);
+	if (error) {
+		BUG_ON(error > 0);
+		return error;
+	}
+
+	return replace_reg;
+}
+
 /* DOM0 TODO: Need to fill in the remaining mtrr methods to have full
  * working userland mtrr support. */
 static struct mtrr_ops xen_mtrr_ops = {
 	.vendor            = X86_VENDOR_UNKNOWN,
 //	.set               = xen_set_mtrr,
 //	.get               = xen_get_mtrr,
-	.get_free_region   = generic_get_free_region,
+	.get_free_region   = xen_get_free_region,
 	.validate_add_page = generic_validate_add_page,
 	.have_wrcomb       = positive_have_wrcomb,
 	.use_intel_if	   = 0,
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* [PATCH] xen mtrr: Add xen_{get,set}_mtrr() implementations
  2009-02-28  1:59 [PATCH] xen: core dom0 support Jeremy Fitzhardinge
                   ` (17 preceding siblings ...)
  2009-02-28  1:59 ` [PATCH] xen mtrr: Implement xen_get_free_region() Jeremy Fitzhardinge
@ 2009-02-28  1:59 ` Jeremy Fitzhardinge
  2009-02-28  5:28 ` [PATCH] xen: core dom0 support Andrew Morton
  2009-02-28  6:17 ` Boris Derzhavets
  20 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-02-28  1:59 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List, Xen-devel,
	Mark McLoughlin

From: Mark McLoughlin <markmc@redhat.com>

Straightforward apart from the hack to turn mtrr_ops->set()
into a no-op on all but one CPU.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
---
 arch/x86/kernel/cpu/mtrr/xen.c |   52 ++++++++++++++++++++++++++++++++++++---
 1 files changed, 48 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/mtrr/xen.c b/arch/x86/kernel/cpu/mtrr/xen.c
index d715843..50a45db 100644
--- a/arch/x86/kernel/cpu/mtrr/xen.c
+++ b/arch/x86/kernel/cpu/mtrr/xen.c
@@ -9,6 +9,52 @@
 
 static int __init xen_num_var_ranges(void);
 
+static void xen_set_mtrr(unsigned int reg, unsigned long base,
+			 unsigned long size, mtrr_type type)
+{
+	struct xen_platform_op op;
+	int error;
+
+	/* mtrr_ops->set() is called once per CPU,
+	 * but Xen's ops apply to all CPUs.
+	 */
+	if (smp_processor_id())
+		return;
+
+	if (size == 0) {
+		op.cmd = XENPF_del_memtype;
+		op.u.del_memtype.handle = 0;
+		op.u.del_memtype.reg    = reg;
+	} else {
+		op.cmd = XENPF_add_memtype;
+		op.u.add_memtype.mfn     = base;
+		op.u.add_memtype.nr_mfns = size;
+		op.u.add_memtype.type    = type;
+	}
+
+	error = HYPERVISOR_dom0_op(&op);
+	BUG_ON(error != 0);
+}
+
+static void xen_get_mtrr(unsigned int reg, unsigned long *base,
+			 unsigned long *size, mtrr_type *type)
+{
+	struct xen_platform_op op;
+
+	op.cmd = XENPF_read_memtype;
+	op.u.read_memtype.reg = reg;
+	if (HYPERVISOR_dom0_op(&op) != 0) {
+		*base = 0;
+		*size = 0;
+		*type = 0;
+		return;
+	}
+
+	*size = op.u.read_memtype.nr_mfns;
+	*base = op.u.read_memtype.mfn;
+	*type = op.u.read_memtype.type;
+}
+
 static int xen_get_free_region(unsigned long base, unsigned long size, int replace_reg)
 {
 	struct xen_platform_op op;
@@ -34,12 +80,10 @@ static int xen_get_free_region(unsigned long base, unsigned long size, int repla
 	return replace_reg;
 }
 
-/* DOM0 TODO: Need to fill in the remaining mtrr methods to have full
- * working userland mtrr support. */
 static struct mtrr_ops xen_mtrr_ops = {
 	.vendor            = X86_VENDOR_UNKNOWN,
-//	.set               = xen_set_mtrr,
-//	.get               = xen_get_mtrr,
+	.set               = xen_set_mtrr,
+	.get               = xen_get_mtrr,
 	.get_free_region   = xen_get_free_region,
 	.validate_add_page = generic_validate_add_page,
 	.have_wrcomb       = positive_have_wrcomb,
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
  2009-02-28  1:59 [PATCH] xen: core dom0 support Jeremy Fitzhardinge
                   ` (18 preceding siblings ...)
  2009-02-28  1:59 ` [PATCH] xen mtrr: Add xen_{get,set}_mtrr() implementations Jeremy Fitzhardinge
@ 2009-02-28  5:28 ` Andrew Morton
  2009-02-28  6:52     ` Jeremy Fitzhardinge
                     ` (2 more replies)
  2009-02-28  6:17 ` Boris Derzhavets
  20 siblings, 3 replies; 121+ messages in thread
From: Andrew Morton @ 2009-02-28  5:28 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: H. Peter Anvin, the arch/x86 maintainers,
	Linux Kernel Mailing List, Xen-devel

On Fri, 27 Feb 2009 17:59:06 -0800 Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> This series implements the core parts of Xen dom0 support; that is, just
> enough to get the kernel started when booted by Xen as a dom0 kernel.

And what other patches can we expect to see to complete the xen dom0
support?


and..

I hate to be the one to say it, but we should sit down and work out
whether it is justifiable to merge any of this into Linux.  I think
it's still the case that the Xen technology is the "old" way and that
the world is moving off in the "new" direction, KVM?

In three years time, will we regret having merged this?

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
  2009-02-28  1:59 [PATCH] xen: core dom0 support Jeremy Fitzhardinge
                   ` (19 preceding siblings ...)
  2009-02-28  5:28 ` [PATCH] xen: core dom0 support Andrew Morton
@ 2009-02-28  6:17 ` Boris Derzhavets
  2009-02-28  6:23     ` Jeremy Fitzhardinge
  20 siblings, 1 reply; 121+ messages in thread
From: Boris Derzhavets @ 2009-02-28  6:17 UTC (permalink / raw)
  To: H. Peter Anvin, Jeremy Fitzhardinge
  Cc: Xen-devel, the arch/x86 maintainers, Linux Kernel Mailing List


[-- Attachment #1.1: Type: text/plain, Size: 5252 bytes --]

 
Does it mean that Wiki Page should  look now as ?
----------------------------------------------------------------------------------------------------------------------
git   clone git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen.git   linux-2.6-xen
cd   linux-2.6-xen
git   checkout   origin/push/xen/dom0/core   -b    push/xen/dom0/core
----------------------------------------------------------------------------------------------------------------------

--- On Fri, 2/27/09, Jeremy Fitzhardinge <jeremy@goop.org> wrote:

From: Jeremy Fitzhardinge <jeremy@goop.org>
Subject: [Xen-devel] [PATCH] xen: core dom0 support
To: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Xen-devel" <xen-devel@lists.xensource.com>, "the arch/x86 maintainers" <x86@kernel.org>, "Linux Kernel Mailing List" <linux-kernel@vger.kernel.org>
Date: Friday, February 27, 2009, 8:59 PM

Hi,

This series implements the core parts of Xen dom0 support; that is, just
enough to get the kernel started when booted by Xen as a dom0 kernel.

The Xen dom0 kernel runs as a normal paravirtualized Xen kernel, but
it also has the additional responsibilty for managing all the machine's
hardware, as Xen itself has almost no internal driver support (it barely
even knows about PCI).

This series includes:
 - setting up a Xen hvc console
 - initializing Xenbus
 - enabling IO permissions for the kernel
 - MTRR setup hooks
 - Use _PAGE_IOMAP to allow direct hardware mappings
 - add a paravirt-ops for page_is_ram, to allow Xen to exclude granted pages
 - enable the use of a vga console

Not included in this series is the hooks into apic setup; that's next.

This may be pulled from:

The following changes since commit cc2f3b455c8efa01c66b8e66df8aad1da9310901:
  Ingo Molnar (1):
        Merge branch 'sched/urgent'

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen.git
push/xen/dom0/core

Ian Campbell (4):
      xen: disable PAT
      xen/dom0: Use host E820 map
      xen: implement XENMEM_machphys_mapping
      xen: clear reserved bits in l3 entries given in the initial pagetables

Jeremy Fitzhardinge (6):
      xen dom0: Make hvc_xen console work for dom0.
      xen-dom0: only selectively disable cpu features
      xen/dom0: use _PAGE_IOMAP in ioremap to do machine mappings
      paravirt/xen: add pvop for page_is_ram
      xen/dom0: add XEN_DOM0 config option
      xen: allow enable use of VGA console on dom0

Juan Quintela (2):
      xen dom0: Initialize xenbus for dom0.
      xen dom0: Set up basic IO permissions for dom0.

Mark McLoughlin (5):
      xen mtrr: Use specific cpu_has_foo macros instead of generic cpu_has()
      xen mtrr: Kill some unneccessary includes
      xen mtrr: Use generic_validate_add_page()
      xen mtrr: Implement xen_get_free_region()
      xen mtrr: Add xen_{get,set}_mtrr() implementations

Stephen Tweedie (2):
      xen dom0: Add support for the platform_ops hypercall
      xen mtrr: Add mtrr_ops support for Xen mtrr

 arch/x86/include/asm/page.h             |    9 +-
 arch/x86/include/asm/paravirt.h         |    7 +
 arch/x86/include/asm/pat.h              |    5 +
 arch/x86/include/asm/xen/hypercall.h    |    8 +
 arch/x86/include/asm/xen/interface.h    |    6 +-
 arch/x86/include/asm/xen/interface_32.h |    5 +
 arch/x86/include/asm/xen/interface_64.h |   13 +--
 arch/x86/include/asm/xen/page.h         |   15 +--
 arch/x86/kernel/cpu/mtrr/Makefile       |    1 +
 arch/x86/kernel/cpu/mtrr/amd.c          |    1 +
 arch/x86/kernel/cpu/mtrr/centaur.c      |    1 +
 arch/x86/kernel/cpu/mtrr/cyrix.c        |    1 +
 arch/x86/kernel/cpu/mtrr/generic.c      |    1 +
 arch/x86/kernel/cpu/mtrr/main.c         |   11 +-
 arch/x86/kernel/cpu/mtrr/mtrr.h         |    7 +
 arch/x86/kernel/cpu/mtrr/xen.c          |  120 ++++++++++++++++
 arch/x86/kernel/paravirt.c              |    1 +
 arch/x86/mm/ioremap.c                   |    2 +-
 arch/x86/mm/pat.c                       |    5 -
 arch/x86/xen/Kconfig                    |   26 ++++
 arch/x86/xen/Makefile                   |    3 +-
 arch/x86/xen/enlighten.c                |   58 ++++++--
 arch/x86/xen/mmu.c                      |  135 ++++++++++++++++++-
 arch/x86/xen/setup.c                    |   51 ++++++-
 arch/x86/xen/vga.c                      |   65 +++++++++
 arch/x86/xen/xen-ops.h                  |   12 ++
 drivers/char/hvc_xen.c                  |  101 +++++++++-----
 drivers/xen/events.c                    |    2 +-
 drivers/xen/xenbus/xenbus_probe.c       |   30 ++++-
 include/xen/events.h                    |    2 +
 include/xen/interface/memory.h          |   42 ++++++
 include/xen/interface/platform.h        |  232 +++++++++++++++++++++++++++++++
 include/xen/interface/xen.h             |   41 ++++++
 33 files changed, 931 insertions(+), 88 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/mtrr/xen.c
 create mode 100644 arch/x86/xen/vga.c
 create mode 100644 include/xen/interface/platform.h

Thanks,
	J

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel



      

[-- Attachment #1.2: Type: text/html, Size: 5707 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Xen-devel] [PATCH] xen: core dom0 support
  2009-02-28  6:17 ` Boris Derzhavets
@ 2009-02-28  6:23     ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-02-28  6:23 UTC (permalink / raw)
  To: bderzhavets
  Cc: H. Peter Anvin, Xen-devel, the arch/x86 maintainers,
	Linux Kernel Mailing List

Boris Derzhavets wrote:
>  
> Does it mean that Wiki Page should  look now as ?
> ----------------------------------------------------------------------------------------------------------------------
> git   clone 
> git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen.git   
> linux-2.6-xen
> cd   linux-2.6-xen
> git   checkout   origin/push/xen/dom0/core   -b    push/xen/dom0/core
>

No.  That branch is far from complete; the push/* branches are just for 
upstreaming things to the kernel, and are not independently useful.

xen/dom0/hackery continues to be the best place to get complete (as far 
as it goes) dom0 support.

    J

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
@ 2009-02-28  6:23     ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-02-28  6:23 UTC (permalink / raw)
  To: bderzhavets
  Cc: Linux Kernel Mailing List, Xen-devel, the arch/x86 maintainers,
	H. Peter Anvin

Boris Derzhavets wrote:
>  
> Does it mean that Wiki Page should  look now as ?
> ----------------------------------------------------------------------------------------------------------------------
> git   clone 
> git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen.git   
> linux-2.6-xen
> cd   linux-2.6-xen
> git   checkout   origin/push/xen/dom0/core   -b    push/xen/dom0/core
>

No.  That branch is far from complete; the push/* branches are just for 
upstreaming things to the kernel, and are not independently useful.

xen/dom0/hackery continues to be the best place to get complete (as far 
as it goes) dom0 support.

    J

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
  2009-02-28  6:23     ` Jeremy Fitzhardinge
  (?)
@ 2009-02-28  6:28     ` Boris Derzhavets
  -1 siblings, 0 replies; 121+ messages in thread
From: Boris Derzhavets @ 2009-02-28  6:28 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: the arch/x86 maintainers, Xen-devel, Linux Kernel Mailing List,
	H. Peter Anvin


[-- Attachment #1.1: Type: text/plain, Size: 1234 bytes --]

Thank you for quick response.

--- On Sat, 2/28/09, Jeremy Fitzhardinge <jeremy@goop.org> wrote:

From: Jeremy Fitzhardinge <jeremy@goop.org>
Subject: Re: [Xen-devel] [PATCH] xen: core dom0 support
To: bderzhavets@yahoo.com
Cc: "Linux Kernel Mailing List" <linux-kernel@vger.kernel.org>, "Xen-devel" <xen-devel@lists.xensource.com>, "the arch/x86 maintainers" <x86@kernel.org>, "H. Peter Anvin" <hpa@zytor.com>
Date: Saturday, February 28, 2009, 1:23 AM

Boris Derzhavets wrote:
>  Does it mean that Wiki Page should  look now as ?
>
----------------------------------------------------------------------------------------------------------------------
> git   clone git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen.git  
linux-2.6-xen
> cd   linux-2.6-xen
> git   checkout   origin/push/xen/dom0/core   -b    push/xen/dom0/core
> 

No.  That branch is far from complete; the push/* branches are just for
upstreaming things to the kernel, and are not independently useful.

xen/dom0/hackery continues to be the best place to get complete (as far as it
goes) dom0 support.

   J

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel



      

[-- Attachment #1.2: Type: text/html, Size: 1566 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
  2009-02-28  5:28 ` [PATCH] xen: core dom0 support Andrew Morton
@ 2009-02-28  6:52     ` Jeremy Fitzhardinge
  2009-02-28  8:42     ` Ingo Molnar
  2009-03-05 13:52   ` Morten P.D. Stevens
  2 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-02-28  6:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: H. Peter Anvin, the arch/x86 maintainers,
	Linux Kernel Mailing List, Xen-devel

Andrew Morton wrote:
> On Fri, 27 Feb 2009 17:59:06 -0800 Jeremy Fitzhardinge <jeremy@goop.org> wrote:
>
>   
>> This series implements the core parts of Xen dom0 support; that is, just
>> enough to get the kernel started when booted by Xen as a dom0 kernel.
>>     
>
> And what other patches can we expect to see to complete the xen dom0
> support?
>   

There's a bit of a gradient.  There's probably another 2-3 similarly 
sized series to get everything so that you can boot dom0 out of the box 
(core, apic, swiotlb/agp/drm, backend drivers, tools).  And then a 
scattering of smaller things which may or may not be upstreamable.  The 
vast majority of it is Xen-specific code, rather than changes to core 
kernel.   I'm in no particular rush to get it all into the kernel, but I 
would like to get the core parts in for .30 so that its basically 
useful, and the delta to feature-complete isn't very large (a big reason 
is to keep the out-of-tree patch size down for distros).

> I hate to be the one to say it, but we should sit down and work out
> whether it is justifiable to merge any of this into Linux.  I think
> it's still the case that the Xen technology is the "old" way and that
> the world is moving off in the "new" direction, KVM?
>   

I don't think that's a particularly useful way to look at it.  They're 
different approaches to the problem, and have different tradeoffs.  

The more important question is: are there real users for this stuff?   
Does not merging it cause more net disadvantage than merging it?  
Despite all the noise made about kvm in kernel circles, Xen has a large 
and growing installed base.  At the moment its all running on massive 
out-of-tree patches, which doesn't make anyone happy.  It's best that it 
be in the mainline kernel.  You know, like we argue for everything else.

> In three years time, will we regret having merged this?
>   

Its a pretty minor amount of extra stuff on top of what's been added 
over the last 3 years, so I don't think it's going to tip the scales on 
its own.  I wouldn't be comfortable in trying to merge something that's 
very intrusive.

    J

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
@ 2009-02-28  6:52     ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-02-28  6:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Xen-devel, the arch/x86 maintainers, Linux Kernel Mailing List,
	H. Peter Anvin

Andrew Morton wrote:
> On Fri, 27 Feb 2009 17:59:06 -0800 Jeremy Fitzhardinge <jeremy@goop.org> wrote:
>
>   
>> This series implements the core parts of Xen dom0 support; that is, just
>> enough to get the kernel started when booted by Xen as a dom0 kernel.
>>     
>
> And what other patches can we expect to see to complete the xen dom0
> support?
>   

There's a bit of a gradient.  There's probably another 2-3 similarly 
sized series to get everything so that you can boot dom0 out of the box 
(core, apic, swiotlb/agp/drm, backend drivers, tools).  And then a 
scattering of smaller things which may or may not be upstreamable.  The 
vast majority of it is Xen-specific code, rather than changes to core 
kernel.   I'm in no particular rush to get it all into the kernel, but I 
would like to get the core parts in for .30 so that its basically 
useful, and the delta to feature-complete isn't very large (a big reason 
is to keep the out-of-tree patch size down for distros).

> I hate to be the one to say it, but we should sit down and work out
> whether it is justifiable to merge any of this into Linux.  I think
> it's still the case that the Xen technology is the "old" way and that
> the world is moving off in the "new" direction, KVM?
>   

I don't think that's a particularly useful way to look at it.  They're 
different approaches to the problem, and have different tradeoffs.  

The more important question is: are there real users for this stuff?   
Does not merging it cause more net disadvantage than merging it?  
Despite all the noise made about kvm in kernel circles, Xen has a large 
and growing installed base.  At the moment its all running on massive 
out-of-tree patches, which doesn't make anyone happy.  It's best that it 
be in the mainline kernel.  You know, like we argue for everything else.

> In three years time, will we regret having merged this?
>   

Its a pretty minor amount of extra stuff on top of what's been added 
over the last 3 years, so I don't think it's going to tip the scales on 
its own.  I wouldn't be comfortable in trying to merge something that's 
very intrusive.

    J

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
  2009-02-28  6:52     ` Jeremy Fitzhardinge
@ 2009-02-28  7:20       ` Ingo Molnar
  -1 siblings, 0 replies; 121+ messages in thread
From: Ingo Molnar @ 2009-02-28  7:20 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andrew Morton, H. Peter Anvin, the arch/x86 maintainers,
	Linux Kernel Mailing List, Xen-devel


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> [...] At the moment its all running on massive out-of-tree 
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> patches, which doesn't make anyone happy.  It's best that it 
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> be in the mainline kernel.  You know, like we argue for 
> everything else.
>
>> In three years time, will we regret having merged this?
>
> Its a pretty minor amount of extra stuff on top of what's been 
> added over the last 3 years, so I don't think it's going to 
> tip the scales on its own.  I wouldn't be comfortable in 
> trying to merge something that's very intrusive.

Hm, how can the same code that you call "massive out-of-tree 
patches which doesn't make anyone happy" in an out of tree 
context suddenly become non-intrusive "minor amount of extra 
stuff" in an upstream context?

I wish the upstream kernel was able to do such magic, but i'm 
afraid it is not.

	Ingo

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
@ 2009-02-28  7:20       ` Ingo Molnar
  0 siblings, 0 replies; 121+ messages in thread
From: Ingo Molnar @ 2009-02-28  7:20 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Xen-devel, Andrew Morton, the arch/x86 maintainers,
	Linux Kernel Mailing List, H. Peter Anvin


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> [...] At the moment its all running on massive out-of-tree 
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> patches, which doesn't make anyone happy.  It's best that it 
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> be in the mainline kernel.  You know, like we argue for 
> everything else.
>
>> In three years time, will we regret having merged this?
>
> Its a pretty minor amount of extra stuff on top of what's been 
> added over the last 3 years, so I don't think it's going to 
> tip the scales on its own.  I wouldn't be comfortable in 
> trying to merge something that's very intrusive.

Hm, how can the same code that you call "massive out-of-tree 
patches which doesn't make anyone happy" in an out of tree 
context suddenly become non-intrusive "minor amount of extra 
stuff" in an upstream context?

I wish the upstream kernel was able to do such magic, but i'm 
afraid it is not.

	Ingo

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
  2009-02-28  7:20       ` Ingo Molnar
@ 2009-02-28  8:05         ` Jeremy Fitzhardinge
  -1 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-02-28  8:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, H. Peter Anvin, the arch/x86 maintainers,
	Linux Kernel Mailing List, Xen-devel

Ingo Molnar wrote:
> Hm, how can the same code that you call "massive out-of-tree 
> patches which doesn't make anyone happy" in an out of tree 
> context suddenly become non-intrusive "minor amount of extra 
> stuff" in an upstream context?
>
> I wish the upstream kernel was able to do such magic, but i'm 
> afraid it is not.

No, but I am ;)  The current out of tree Xen patches are very intrusive 
because there hasn't been much incentive to reduce their impact.  I've 
going through it all and very carefully rewriting it 1) be cleaner, 2) 
enable/disable itself at runtime, 3) have clean interfaces and 
interactions with the rest of the kernel, and 4) address any concerns 
that others have.  In other words, make Xen a first-class kernel citizen.

Most of the intrusive stuff has already been merged (and merged for some 
time now), but without dom0 support its only half done; as it stands 
people are using mainline Linux for their domUs, but are still limited 
to patched up (old) kernels for dom0.  This is a real problem because 
all the drivers for interesting new devices are in the new kernels, so 
there's an additional burden of backporting device support into old kernels.

    J

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
@ 2009-02-28  8:05         ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-02-28  8:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Xen-devel, Andrew Morton, the arch/x86 maintainers,
	Linux Kernel Mailing List, H. Peter Anvin

Ingo Molnar wrote:
> Hm, how can the same code that you call "massive out-of-tree 
> patches which doesn't make anyone happy" in an out of tree 
> context suddenly become non-intrusive "minor amount of extra 
> stuff" in an upstream context?
>
> I wish the upstream kernel was able to do such magic, but i'm 
> afraid it is not.

No, but I am ;)  The current out of tree Xen patches are very intrusive 
because there hasn't been much incentive to reduce their impact.  I've 
going through it all and very carefully rewriting it 1) be cleaner, 2) 
enable/disable itself at runtime, 3) have clean interfaces and 
interactions with the rest of the kernel, and 4) address any concerns 
that others have.  In other words, make Xen a first-class kernel citizen.

Most of the intrusive stuff has already been merged (and merged for some 
time now), but without dom0 support its only half done; as it stands 
people are using mainline Linux for their domUs, but are still limited 
to patched up (old) kernels for dom0.  This is a real problem because 
all the drivers for interesting new devices are in the new kernels, so 
there's an additional burden of backporting device support into old kernels.

    J

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
  2009-02-28  8:05         ` Jeremy Fitzhardinge
@ 2009-02-28  8:36           ` Ingo Molnar
  -1 siblings, 0 replies; 121+ messages in thread
From: Ingo Molnar @ 2009-02-28  8:36 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andrew Morton, H. Peter Anvin, the arch/x86 maintainers,
	Linux Kernel Mailing List, Xen-devel


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Ingo Molnar wrote:
>> Hm, how can the same code that you call "massive out-of-tree patches 
>> which doesn't make anyone happy" in an out of tree context suddenly 
>> become non-intrusive "minor amount of extra stuff" in an upstream 
>> context?
>>
>> I wish the upstream kernel was able to do such magic, but i'm afraid it 
>> is not.
>
> No, but I am ;) The current out of tree Xen patches are very 
> intrusive because there hasn't been much incentive to reduce 
> their impact.  I've going through it all and very carefully 
> rewriting it 1) be cleaner, 2) enable/disable itself at 
> runtime, 3) have clean interfaces and interactions with the 
> rest of the kernel, and 4) address any concerns that others 
> have.  In other words, make Xen a first-class kernel citizen.
>
> Most of the intrusive stuff has already been merged (and 
> merged for some time now), but without dom0 support its only 
> half done; as it stands people are using mainline Linux for 
> their domUs, but are still limited to patched up (old) kernels 
> for dom0.  This is a real problem because all the drivers for 
> interesting new devices are in the new kernels, so there's an 
> additional burden of backporting device support into old 
> kernels.

This means that the "massive out-of-tree patches which doesn't 
make anyone happy" argument above is really ... a hiperbole and 
should be replaced with: "small, unintrusive out-of-tree patch"?

	Ingo

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
@ 2009-02-28  8:36           ` Ingo Molnar
  0 siblings, 0 replies; 121+ messages in thread
From: Ingo Molnar @ 2009-02-28  8:36 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Xen-devel, Andrew Morton, the arch/x86 maintainers,
	Linux Kernel Mailing List, H. Peter Anvin


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Ingo Molnar wrote:
>> Hm, how can the same code that you call "massive out-of-tree patches 
>> which doesn't make anyone happy" in an out of tree context suddenly 
>> become non-intrusive "minor amount of extra stuff" in an upstream 
>> context?
>>
>> I wish the upstream kernel was able to do such magic, but i'm afraid it 
>> is not.
>
> No, but I am ;) The current out of tree Xen patches are very 
> intrusive because there hasn't been much incentive to reduce 
> their impact.  I've going through it all and very carefully 
> rewriting it 1) be cleaner, 2) enable/disable itself at 
> runtime, 3) have clean interfaces and interactions with the 
> rest of the kernel, and 4) address any concerns that others 
> have.  In other words, make Xen a first-class kernel citizen.
>
> Most of the intrusive stuff has already been merged (and 
> merged for some time now), but without dom0 support its only 
> half done; as it stands people are using mainline Linux for 
> their domUs, but are still limited to patched up (old) kernels 
> for dom0.  This is a real problem because all the drivers for 
> interesting new devices are in the new kernels, so there's an 
> additional burden of backporting device support into old 
> kernels.

This means that the "massive out-of-tree patches which doesn't 
make anyone happy" argument above is really ... a hiperbole and 
should be replaced with: "small, unintrusive out-of-tree patch"?

	Ingo

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
  2009-02-28  5:28 ` [PATCH] xen: core dom0 support Andrew Morton
@ 2009-02-28  8:42     ` Ingo Molnar
  2009-02-28  8:42     ` Ingo Molnar
  2009-03-05 13:52   ` Morten P.D. Stevens
  2 siblings, 0 replies; 121+ messages in thread
From: Ingo Molnar @ 2009-02-28  8:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jeremy Fitzhardinge, H. Peter Anvin, the arch/x86 maintainers,
	Linux Kernel Mailing List, Xen-devel


* Andrew Morton <akpm@linux-foundation.org> wrote:

> I hate to be the one to say it, but we should sit down and 
> work out whether it is justifiable to merge any of this into 
> Linux.  I think it's still the case that the Xen technology is 
> the "old" way and that the world is moving off in the "new" 
> direction, KVM?
> 
> In three years time, will we regret having merged this?

Personally i'd like to see a sufficient reply to the mmap-perf 
paravirt regressions pointed out by Nick and reproduced by 
myself as well. (They were in the 4-5% macro-performance range 
iirc, which is huge.)

So i havent seen any real progress on reducing native kernel 
overhead with paravirt. Patches were sent but no measurements 
were done and it seemed to have all fizzled out while the dom0 
patches are being pursued.

Which is not a particularly good basis on which to add even 
_more_ paravirt stuff, is it?

	Ingo

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
@ 2009-02-28  8:42     ` Ingo Molnar
  0 siblings, 0 replies; 121+ messages in thread
From: Ingo Molnar @ 2009-02-28  8:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Xen-devel, Jeremy Fitzhardinge, the arch/x86 maintainers,
	Linux Kernel Mailing List, H. Peter Anvin


* Andrew Morton <akpm@linux-foundation.org> wrote:

> I hate to be the one to say it, but we should sit down and 
> work out whether it is justifiable to merge any of this into 
> Linux.  I think it's still the case that the Xen technology is 
> the "old" way and that the world is moving off in the "new" 
> direction, KVM?
> 
> In three years time, will we regret having merged this?

Personally i'd like to see a sufficient reply to the mmap-perf 
paravirt regressions pointed out by Nick and reproduced by 
myself as well. (They were in the 4-5% macro-performance range 
iirc, which is huge.)

So i havent seen any real progress on reducing native kernel 
overhead with paravirt. Patches were sent but no measurements 
were done and it seemed to have all fizzled out while the dom0 
patches are being pursued.

Which is not a particularly good basis on which to add even 
_more_ paravirt stuff, is it?

	Ingo

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
  2009-02-28  8:42     ` Ingo Molnar
@ 2009-02-28  9:46       ` Jeremy Fitzhardinge
  -1 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-02-28  9:46 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, H. Peter Anvin, the arch/x86 maintainers,
	Linux Kernel Mailing List, Xen-devel

Ingo Molnar wrote:
> Personally i'd like to see a sufficient reply to the mmap-perf 
> paravirt regressions pointed out by Nick and reproduced by 
> myself as well. (They were in the 4-5% macro-performance range 
> iirc, which is huge.)
>
> So i havent seen any real progress on reducing native kernel 
> overhead with paravirt. Patches were sent but no measurements 
> were done and it seemed to have all fizzled out while the dom0 
> patches are being pursued.
>   

Hm, I'm not sure what you want me to do here.  I sent out patches, they 
got merged, I posted the results of my measurements showing that the 
patches made a substantial improvement.  I'd love to see confirmation 
from others that the patches help them, but I don't think you can say 
I've been unresponsive about this.

    J

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
@ 2009-02-28  9:46       ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-02-28  9:46 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Xen-devel, Andrew Morton, the arch/x86 maintainers,
	Linux Kernel Mailing List, H. Peter Anvin

Ingo Molnar wrote:
> Personally i'd like to see a sufficient reply to the mmap-perf 
> paravirt regressions pointed out by Nick and reproduced by 
> myself as well. (They were in the 4-5% macro-performance range 
> iirc, which is huge.)
>
> So i havent seen any real progress on reducing native kernel 
> overhead with paravirt. Patches were sent but no measurements 
> were done and it seemed to have all fizzled out while the dom0 
> patches are being pursued.
>   

Hm, I'm not sure what you want me to do here.  I sent out patches, they 
got merged, I posted the results of my measurements showing that the 
patches made a substantial improvement.  I'd love to see confirmation 
from others that the patches help them, but I don't think you can say 
I've been unresponsive about this.

    J

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
  2009-02-28  8:36           ` Ingo Molnar
@ 2009-02-28  9:57             ` Jeremy Fitzhardinge
  -1 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-02-28  9:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, H. Peter Anvin, the arch/x86 maintainers,
	Linux Kernel Mailing List, Xen-devel

Ingo Molnar wrote:
> This means that the "massive out-of-tree patches which doesn't 
> make anyone happy" argument above is really ... a hiperbole and 
> should be replaced with: "small, unintrusive out-of-tree patch"?

Well at the moment we're in the "doesn't make anybody happy" state.  The 
dom0 changes I have are, I'll admit, non-trivial.  I don't think they're 
unreasonable or particularly intrusive, but they are large enough to be 
awkward to maintain out of tree.  What I'm looking to achieve now is to 
get enough into the kernel so that the remaining patches are a "small 
unintrusive out-of-tree patch" (but ultimately I'd like to get 
everything in).

But I think that's sort of beside the point.  Its not like we're talking 
about something extremely obscure here; these changes do serve a large 
existing user-base.  The (often repeated) kernel policy is "merge it".  
I'm happy to talk about the specifics of how all this stuff can be made 
to fit together - and whether the current approach is OK or if something 
else would be better, but ultimately I think this functionality does 
belong in mainline.

    J

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
@ 2009-02-28  9:57             ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-02-28  9:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Xen-devel, Andrew Morton, the arch/x86 maintainers,
	Linux Kernel Mailing List, H. Peter Anvin

Ingo Molnar wrote:
> This means that the "massive out-of-tree patches which doesn't 
> make anyone happy" argument above is really ... a hiperbole and 
> should be replaced with: "small, unintrusive out-of-tree patch"?

Well at the moment we're in the "doesn't make anybody happy" state.  The 
dom0 changes I have are, I'll admit, non-trivial.  I don't think they're 
unreasonable or particularly intrusive, but they are large enough to be 
awkward to maintain out of tree.  What I'm looking to achieve now is to 
get enough into the kernel so that the remaining patches are a "small 
unintrusive out-of-tree patch" (but ultimately I'd like to get 
everything in).

But I think that's sort of beside the point.  Its not like we're talking 
about something extremely obscure here; these changes do serve a large 
existing user-base.  The (often repeated) kernel policy is "merge it".  
I'm happy to talk about the specifics of how all this stuff can be made 
to fit together - and whether the current approach is OK or if something 
else would be better, but ultimately I think this functionality does 
belong in mainline.

    J

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
  2009-02-28  6:52     ` Jeremy Fitzhardinge
@ 2009-02-28 12:09       ` Nick Piggin
  -1 siblings, 0 replies; 121+ messages in thread
From: Nick Piggin @ 2009-02-28 12:09 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andrew Morton, H. Peter Anvin, the arch/x86 maintainers,
	Linux Kernel Mailing List, Xen-devel

On Saturday 28 February 2009 17:52:24 Jeremy Fitzhardinge wrote:
> Andrew Morton wrote:

> > I hate to be the one to say it, but we should sit down and work out
> > whether it is justifiable to merge any of this into Linux.  I think
> > it's still the case that the Xen technology is the "old" way and that
> > the world is moving off in the "new" direction, KVM?
>
> I don't think that's a particularly useful way to look at it.  They're
> different approaches to the problem, and have different tradeoffs.
>
> The more important question is: are there real users for this stuff?
> Does not merging it cause more net disadvantage than merging it?
> Despite all the noise made about kvm in kernel circles, Xen has a large
> and growing installed base.  At the moment its all running on massive
> out-of-tree patches, which doesn't make anyone happy.  It's best that it
> be in the mainline kernel.  You know, like we argue for everything else.

OTOH, there are good reasons not to duplicate functionality, and many
many times throughout the kernel history competing solutions have been
rejected even though the same arguments could be made about them.

There have also been many times duplicate functionality has been merged,
although that does often start with the intention of eliminating
duplicate implementations and ends with pain. So I think Andrew's
question is pretty important.

The user issue aside -- that is a valid point -- you don't really touch
on the technical issues. What tradeoffs, and where Xen does better
than KVM would be interesting to know, can Xen tools and users ever be
migrated to KVM or vice versa (I know very little about this myself, so
I'm just an interested observer).

Ideally of course, consensus would be made that one or the other is the
better technical solution, and we should encourage developers to improve
that one and users to use it. Although obviously a consensus can't always
be made (usually when there is no right answer -- different tradeoffs
etc).


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
@ 2009-02-28 12:09       ` Nick Piggin
  0 siblings, 0 replies; 121+ messages in thread
From: Nick Piggin @ 2009-02-28 12:09 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Xen-devel, Andrew Morton, the arch/x86 maintainers,
	Linux Kernel Mailing List, H. Peter Anvin

On Saturday 28 February 2009 17:52:24 Jeremy Fitzhardinge wrote:
> Andrew Morton wrote:

> > I hate to be the one to say it, but we should sit down and work out
> > whether it is justifiable to merge any of this into Linux.  I think
> > it's still the case that the Xen technology is the "old" way and that
> > the world is moving off in the "new" direction, KVM?
>
> I don't think that's a particularly useful way to look at it.  They're
> different approaches to the problem, and have different tradeoffs.
>
> The more important question is: are there real users for this stuff?
> Does not merging it cause more net disadvantage than merging it?
> Despite all the noise made about kvm in kernel circles, Xen has a large
> and growing installed base.  At the moment its all running on massive
> out-of-tree patches, which doesn't make anyone happy.  It's best that it
> be in the mainline kernel.  You know, like we argue for everything else.

OTOH, there are good reasons not to duplicate functionality, and many
many times throughout the kernel history competing solutions have been
rejected even though the same arguments could be made about them.

There have also been many times duplicate functionality has been merged,
although that does often start with the intention of eliminating
duplicate implementations and ends with pain. So I think Andrew's
question is pretty important.

The user issue aside -- that is a valid point -- you don't really touch
on the technical issues. What tradeoffs, and where Xen does better
than KVM would be interesting to know, can Xen tools and users ever be
migrated to KVM or vice versa (I know very little about this myself, so
I'm just an interested observer).

Ideally of course, consensus would be made that one or the other is the
better technical solution, and we should encourage developers to improve
that one and users to use it. Although obviously a consensus can't always
be made (usually when there is no right answer -- different tradeoffs
etc).

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
  2009-02-28  6:52     ` Jeremy Fitzhardinge
                       ` (2 preceding siblings ...)
  (?)
@ 2009-02-28 16:14     ` Andi Kleen
  2009-03-01 23:34         ` Jeremy Fitzhardinge
  -1 siblings, 1 reply; 121+ messages in thread
From: Andi Kleen @ 2009-02-28 16:14 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andrew Morton, Xen-devel, the arch/x86 maintainers,
	Linux Kernel Mailing List, H. Peter Anvin

Jeremy Fitzhardinge <jeremy@goop.org> writes:

> Andrew Morton wrote:
>> On Fri, 27 Feb 2009 17:59:06 -0800 Jeremy Fitzhardinge <jeremy@goop.org> wrote:
>>
>>
>>> This series implements the core parts of Xen dom0 support; that is, just
>>> enough to get the kernel started when booted by Xen as a dom0 kernel.
>>>
>>
>> And what other patches can we expect to see to complete the xen dom0
>> support?
>>
>
> There's a bit of a gradient.  There's probably another 2-3 similarly
> sized series to get everything so that you can boot dom0 out of the
> box (core, apic, swiotlb/agp/drm, backend drivers, tools).  And then a
> scattering of smaller things which may or may not be upstreamable.
> The vast majority of it is Xen-specific code, rather than changes to
> core kernel.  

I would say the more interesting question is less how much additional
code it is or even how much it changes the main kernel, but more how
different the code execution paths in interaction with Xen are
compared to what a native kernel would do. Because such differences
always would need to be considered in future changes.

For example things like: doesn't use PAT with Xen or apparently very
different routing are somewhat worrying because it means it's a
completely different operation modus with Xen that needs to be taken
care of later, adding to complexity.

Unfortunately it also looks like that Xen the HV does things
more and more different from what mainline kernel does so 
these differences will likely continue to grow over time.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Xen-devel] Re: [PATCH] xen: core dom0 support
  2009-02-28 12:09       ` Nick Piggin
@ 2009-02-28 18:11         ` Jody Belka
  -1 siblings, 0 replies; 121+ messages in thread
From: Jody Belka @ 2009-02-28 18:11 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Jeremy Fitzhardinge, Xen-devel, Andrew Morton,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	H. Peter Anvin

On Sat, Feb 28, 2009 at 11:09:07PM +1100, Nick Piggin wrote:
> On Saturday 28 February 2009 17:52:24 Jeremy Fitzhardinge wrote:
> > Andrew Morton wrote:
> 
> > > I hate to be the one to say it, but we should sit down and work out
> > > whether it is justifiable to merge any of this into Linux.  I think
> > > it's still the case that the Xen technology is the "old" way and that
> > > the world is moving off in the "new" direction, KVM?
> >
> > I don't think that's a particularly useful way to look at it.  They're
> > different approaches to the problem, and have different tradeoffs.
> >
> > The more important question is: are there real users for this stuff?
> > Does not merging it cause more net disadvantage than merging it?
> > Despite all the noise made about kvm in kernel circles, Xen has a large
> > and growing installed base.  At the moment its all running on massive
> > out-of-tree patches, which doesn't make anyone happy.  It's best that it
> > be in the mainline kernel.  You know, like we argue for everything else.
> 
> OTOH, there are good reasons not to duplicate functionality, and many
> many times throughout the kernel history competing solutions have been
> rejected even though the same arguments could be made about them.

Is it duplication though? I personally have machines with older processors
that don't have hvm support. I plan on keeping these around for a good amount
of time, and would love to be running them on mainline. So for me, unless KVM
is somehow going to support para-virtualisation, this isn't duplication.

Just my own personal viewpoint as a user of xen.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: Re: [PATCH] xen: core dom0 support
@ 2009-02-28 18:11         ` Jody Belka
  0 siblings, 0 replies; 121+ messages in thread
From: Jody Belka @ 2009-02-28 18:11 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Jeremy Fitzhardinge, Xen-devel, the arch/x86 maintainers,
	Linux Kernel Mailing List, H. Peter Anvin, Andrew Morton

On Sat, Feb 28, 2009 at 11:09:07PM +1100, Nick Piggin wrote:
> On Saturday 28 February 2009 17:52:24 Jeremy Fitzhardinge wrote:
> > Andrew Morton wrote:
> 
> > > I hate to be the one to say it, but we should sit down and work out
> > > whether it is justifiable to merge any of this into Linux.  I think
> > > it's still the case that the Xen technology is the "old" way and that
> > > the world is moving off in the "new" direction, KVM?
> >
> > I don't think that's a particularly useful way to look at it.  They're
> > different approaches to the problem, and have different tradeoffs.
> >
> > The more important question is: are there real users for this stuff?
> > Does not merging it cause more net disadvantage than merging it?
> > Despite all the noise made about kvm in kernel circles, Xen has a large
> > and growing installed base.  At the moment its all running on massive
> > out-of-tree patches, which doesn't make anyone happy.  It's best that it
> > be in the mainline kernel.  You know, like we argue for everything else.
> 
> OTOH, there are good reasons not to duplicate functionality, and many
> many times throughout the kernel history competing solutions have been
> rejected even though the same arguments could be made about them.

Is it duplication though? I personally have machines with older processors
that don't have hvm support. I plan on keeping these around for a good amount
of time, and would love to be running them on mainline. So for me, unless KVM
is somehow going to support para-virtualisation, this isn't duplication.

Just my own personal viewpoint as a user of xen.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
  2009-02-28 18:11         ` Jody Belka
  (?)
@ 2009-02-28 18:15         ` Andi Kleen
  2009-03-01 23:38             ` Jeremy Fitzhardinge
  -1 siblings, 1 reply; 121+ messages in thread
From: Andi Kleen @ 2009-02-28 18:15 UTC (permalink / raw)
  To: Jody Belka
  Cc: Nick Piggin, Jeremy Fitzhardinge, Xen-devel,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	H. Peter Anvin, Andrew Morton

Jody Belka <lists-xen@pimb.org> writes:
>
> Is it duplication though? I personally have machines with older processors
> that don't have hvm support. I plan on keeping these around for a good amount
> of time, and would love to be running them on mainline. So for me, unless KVM
> is somehow going to support para-virtualisation, this isn't duplication.

The old systems will continue to run fine with a 2.6.18 Dom0 though.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
  2009-02-28 12:09       ` Nick Piggin
@ 2009-03-01 23:27         ` Jeremy Fitzhardinge
  -1 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-03-01 23:27 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, H. Peter Anvin, the arch/x86 maintainers,
	Linux Kernel Mailing List, Xen-devel

Nick Piggin wrote:
> On Saturday 28 February 2009 17:52:24 Jeremy Fitzhardinge wrote:
>   
>> Andrew Morton wrote:
>>     
>
>   
>>> I hate to be the one to say it, but we should sit down and work out
>>> whether it is justifiable to merge any of this into Linux.  I think
>>> it's still the case that the Xen technology is the "old" way and that
>>> the world is moving off in the "new" direction, KVM?
>>>       
>> I don't think that's a particularly useful way to look at it.  They're
>> different approaches to the problem, and have different tradeoffs.
>>
>> The more important question is: are there real users for this stuff?
>> Does not merging it cause more net disadvantage than merging it?
>> Despite all the noise made about kvm in kernel circles, Xen has a large
>> and growing installed base.  At the moment its all running on massive
>> out-of-tree patches, which doesn't make anyone happy.  It's best that it
>> be in the mainline kernel.  You know, like we argue for everything else.
>>     
>
> OTOH, there are good reasons not to duplicate functionality, and many
> many times throughout the kernel history competing solutions have been
> rejected even though the same arguments could be made about them.
>
> There have also been many times duplicate functionality has been merged,
> although that does often start with the intention of eliminating
> duplicate implementations and ends with pain. So I think Andrew's
> question is pretty important.
>   

Those would be pertinent questions if I were suddenly popping up and 
saying "hey, let's add Xen support to the kernel!"  But Xen support has 
been in the kernel for well over a year now, and is widely used, enabled 
in distros, etc.  The patches I'm proposing here are not a whole new 
thing, they're part of the last 10% to fill out the kernel's support to 
make it actually useful.

> The user issue aside -- that is a valid point -- you don't really touch
> on the technical issues. What tradeoffs, and where Xen does better
> than KVM would be interesting to know, can Xen tools and users ever be
> migrated to KVM or vice versa (I know very little about this myself, so
> I'm just an interested observer).
>   

OK, fair point, its probably time for another Xen architecture refresher 
post.

There are two big architectural differences between Xen and KVM:

Firstly, Xen has a separate hypervisor who's primary role is to context 
switch between the guest domains (virtual machines).   The hypervisor is 
relatively small and single purpose.  It doesn't, for example, contain 
any device drivers or even much knowledge of things like pci buses and 
their structure.  The domains themselves are more or less peers; some 
are more privileged than others, but from Xen's perspective they are 
more or less equivalent.  The first domain, dom0, is special because its 
started by Xen itself, and has some inherent initial privileges; its 
main job is to start other domains, and it also typically provides 
virtualized/multiplexed device services to other domains via a 
frontend/backend split driver structure.

KVM, on the other hand, builds all the hypervisor stuff into the kernel 
itself, so you end up with a kernel which does all the normal kernel 
stuff, and can run virtual machines by making them look like slightly 
strange processes.

Because Xen is dedicated to just running virtual machines, its internal 
architecture can be more heavily oriented towards that task, which 
affects things from how its scheduler works, its use and multiplexing of 
physical memory.  For example, Xen manages to use new hardware 
virtualization features pretty quickly, partly because it doesn't need 
to trade-off against normal kernel functions.  The clear distinction 
between the privileged hypervisor and the rest of the domains makes the 
security people happy as well.  Also, because Xen is small and fairly 
self-contained, there's quite a few hardware vendors shipping it burned 
into the firmware so that it really is the first thing to boot (many of 
instant-on features that laptops have are based on Xen).  Both HP and 
Dell, at least, are selling servers with Xen pre-installed in the firmware.


The second big difference is the use of paravirtualization.  Xen can 
securely virtualize a machine without needing any particular hardware 
support.  Xen works well on any post-P6 or any ia64 machine, without 
needing any virtualzation hardware support.  When Xen runs a kernel in 
paravirtualized mode, it runs the kernel in an unprivileged processor 
state.  The allows the hypervisor to vet all the guest kernel's 
privileged operations, which are carried out are either via hypercalls 
or by memory shared between each guest and Xen.

By contrast, KVM relies on at least VT/SVM (and whatever the ia64 equiv 
is called) being available in the CPUs, and needs the most modern of 
hardware to get the best performance.

Once important area of paravirtualization is that Xen guests directly 
use the processor's pagetables; there is no shadow pagetable or use of 
hardware pagetable nesting.  This means that a tlb miss is just a tlb 
miss, and happens at full processor performance.  This is possible 
because 1) pagetables are always read-only to the guest, and 2) the 
guest is responsible for looking up in a table to map guest-local pfns 
into machine-wide mfns before installing them in a pte.  Xen will check 
that any new mapping or pagetable satisfies all the rules, by checking 
that the writable reference count is 0, and that the domain owns (or has 
been allowed access to) any mfn it tries to install in a pagetable.

The other interesting part of paravirtualization is the abstraction of 
interrupts into event channels.  Each domain has a bit-array of 1024 
bits which correspond to 1024 possible event channels.  An event channel 
can have one of several sources, such as a timer virtual interrupt, an 
inter-domain event, an inter-vcpu IPI, or mapped from a hardware 
interrupt.  We end up mapping the event channels back to irqs and they 
are delivered as normal interrupts as far as the rest of the kernel is 
concerned.

The net result is that a paravirtualized Xen guest runs a very close to 
full speed.  Workloads which modify live pagetables a lot take a bit of 
a performance hit (since the pte updates have to trap to the hypervisor 
for validation), but in general this is not a huge deal.  Hardware 
support for nested pagetables is only just beginning to get close to 
getting performance parity, but with different tradeoffs (pagetable 
updates are cheap, but tlb misses are much more expensive, and hits 
consume more tlb entries).

Xen can also make full use of whatever hardware virtualization features 
are available when running an "hvm" domain.  This is typically how you'd 
run Windows or other unmodified operating systems.

All of this is stuff that's necessary to support any PV Xen domain, and 
has been in the kernel for a long time now.


The additions I'm proposing now are those needed for a Xen domain to 
control the physical hardware, in order to provide virtual device 
support for other less-privileged domains.  These changes affect a few 
areas:

    * interrupts: mapping a device interrupt into an event channel for
      delivery to the domain with the device driver for that interrupt
    * mappings: allowing direct hardware mapping of device memory into a
      domain
    * dma: making sure that hardware gets programmed with machine memory
      address, nor virtual ones, and that pages are machine-contiguous
      when expected

Interrupts require a few hooks into the x86 APIC code, but the end 
result is that hardware interrupts are delivered via event channels, but 
then they're mapped back to irqs and delivered normally (they even end 
up with the same irq number as they'd usually have).

Device mappings are fairly easy to arrange.  I'm using a software pte 
bit, _PAGE_IOMAP, to indicate that a mapping is a device mapping.  This 
bit is set by things like ioremap() and remap_pfn_range, and the Xen mmu 
code just uses the pfn in the pte as-is, rather than doing the normal 
pfn->mfn translation.

DMA is handled via the normal DMA API, with some hooks to swiotlb to 
make sure that the memory underlying its pools is really DMA-ready (ie, 
is contiguous and low enough in machine memory).

The changes I'm proposing may look a bit strange from a purely x86 
perspective, but they fit in relatively well because they're not all 
that different from what other architectures require, and so the 
kernel-wide infrastructure is mostly already in place.


I hope that helps clarify what I'm trying to do here, and why Xen and 
KVM do have distinct roles to play.

    J

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
@ 2009-03-01 23:27         ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-03-01 23:27 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Xen-devel, Andrew Morton, the arch/x86 maintainers,
	Linux Kernel Mailing List, H. Peter Anvin

Nick Piggin wrote:
> On Saturday 28 February 2009 17:52:24 Jeremy Fitzhardinge wrote:
>   
>> Andrew Morton wrote:
>>     
>
>   
>>> I hate to be the one to say it, but we should sit down and work out
>>> whether it is justifiable to merge any of this into Linux.  I think
>>> it's still the case that the Xen technology is the "old" way and that
>>> the world is moving off in the "new" direction, KVM?
>>>       
>> I don't think that's a particularly useful way to look at it.  They're
>> different approaches to the problem, and have different tradeoffs.
>>
>> The more important question is: are there real users for this stuff?
>> Does not merging it cause more net disadvantage than merging it?
>> Despite all the noise made about kvm in kernel circles, Xen has a large
>> and growing installed base.  At the moment its all running on massive
>> out-of-tree patches, which doesn't make anyone happy.  It's best that it
>> be in the mainline kernel.  You know, like we argue for everything else.
>>     
>
> OTOH, there are good reasons not to duplicate functionality, and many
> many times throughout the kernel history competing solutions have been
> rejected even though the same arguments could be made about them.
>
> There have also been many times duplicate functionality has been merged,
> although that does often start with the intention of eliminating
> duplicate implementations and ends with pain. So I think Andrew's
> question is pretty important.
>   

Those would be pertinent questions if I were suddenly popping up and 
saying "hey, let's add Xen support to the kernel!"  But Xen support has 
been in the kernel for well over a year now, and is widely used, enabled 
in distros, etc.  The patches I'm proposing here are not a whole new 
thing, they're part of the last 10% to fill out the kernel's support to 
make it actually useful.

> The user issue aside -- that is a valid point -- you don't really touch
> on the technical issues. What tradeoffs, and where Xen does better
> than KVM would be interesting to know, can Xen tools and users ever be
> migrated to KVM or vice versa (I know very little about this myself, so
> I'm just an interested observer).
>   

OK, fair point, its probably time for another Xen architecture refresher 
post.

There are two big architectural differences between Xen and KVM:

Firstly, Xen has a separate hypervisor who's primary role is to context 
switch between the guest domains (virtual machines).   The hypervisor is 
relatively small and single purpose.  It doesn't, for example, contain 
any device drivers or even much knowledge of things like pci buses and 
their structure.  The domains themselves are more or less peers; some 
are more privileged than others, but from Xen's perspective they are 
more or less equivalent.  The first domain, dom0, is special because its 
started by Xen itself, and has some inherent initial privileges; its 
main job is to start other domains, and it also typically provides 
virtualized/multiplexed device services to other domains via a 
frontend/backend split driver structure.

KVM, on the other hand, builds all the hypervisor stuff into the kernel 
itself, so you end up with a kernel which does all the normal kernel 
stuff, and can run virtual machines by making them look like slightly 
strange processes.

Because Xen is dedicated to just running virtual machines, its internal 
architecture can be more heavily oriented towards that task, which 
affects things from how its scheduler works, its use and multiplexing of 
physical memory.  For example, Xen manages to use new hardware 
virtualization features pretty quickly, partly because it doesn't need 
to trade-off against normal kernel functions.  The clear distinction 
between the privileged hypervisor and the rest of the domains makes the 
security people happy as well.  Also, because Xen is small and fairly 
self-contained, there's quite a few hardware vendors shipping it burned 
into the firmware so that it really is the first thing to boot (many of 
instant-on features that laptops have are based on Xen).  Both HP and 
Dell, at least, are selling servers with Xen pre-installed in the firmware.


The second big difference is the use of paravirtualization.  Xen can 
securely virtualize a machine without needing any particular hardware 
support.  Xen works well on any post-P6 or any ia64 machine, without 
needing any virtualzation hardware support.  When Xen runs a kernel in 
paravirtualized mode, it runs the kernel in an unprivileged processor 
state.  The allows the hypervisor to vet all the guest kernel's 
privileged operations, which are carried out are either via hypercalls 
or by memory shared between each guest and Xen.

By contrast, KVM relies on at least VT/SVM (and whatever the ia64 equiv 
is called) being available in the CPUs, and needs the most modern of 
hardware to get the best performance.

Once important area of paravirtualization is that Xen guests directly 
use the processor's pagetables; there is no shadow pagetable or use of 
hardware pagetable nesting.  This means that a tlb miss is just a tlb 
miss, and happens at full processor performance.  This is possible 
because 1) pagetables are always read-only to the guest, and 2) the 
guest is responsible for looking up in a table to map guest-local pfns 
into machine-wide mfns before installing them in a pte.  Xen will check 
that any new mapping or pagetable satisfies all the rules, by checking 
that the writable reference count is 0, and that the domain owns (or has 
been allowed access to) any mfn it tries to install in a pagetable.

The other interesting part of paravirtualization is the abstraction of 
interrupts into event channels.  Each domain has a bit-array of 1024 
bits which correspond to 1024 possible event channels.  An event channel 
can have one of several sources, such as a timer virtual interrupt, an 
inter-domain event, an inter-vcpu IPI, or mapped from a hardware 
interrupt.  We end up mapping the event channels back to irqs and they 
are delivered as normal interrupts as far as the rest of the kernel is 
concerned.

The net result is that a paravirtualized Xen guest runs a very close to 
full speed.  Workloads which modify live pagetables a lot take a bit of 
a performance hit (since the pte updates have to trap to the hypervisor 
for validation), but in general this is not a huge deal.  Hardware 
support for nested pagetables is only just beginning to get close to 
getting performance parity, but with different tradeoffs (pagetable 
updates are cheap, but tlb misses are much more expensive, and hits 
consume more tlb entries).

Xen can also make full use of whatever hardware virtualization features 
are available when running an "hvm" domain.  This is typically how you'd 
run Windows or other unmodified operating systems.

All of this is stuff that's necessary to support any PV Xen domain, and 
has been in the kernel for a long time now.


The additions I'm proposing now are those needed for a Xen domain to 
control the physical hardware, in order to provide virtual device 
support for other less-privileged domains.  These changes affect a few 
areas:

    * interrupts: mapping a device interrupt into an event channel for
      delivery to the domain with the device driver for that interrupt
    * mappings: allowing direct hardware mapping of device memory into a
      domain
    * dma: making sure that hardware gets programmed with machine memory
      address, nor virtual ones, and that pages are machine-contiguous
      when expected

Interrupts require a few hooks into the x86 APIC code, but the end 
result is that hardware interrupts are delivered via event channels, but 
then they're mapped back to irqs and delivered normally (they even end 
up with the same irq number as they'd usually have).

Device mappings are fairly easy to arrange.  I'm using a software pte 
bit, _PAGE_IOMAP, to indicate that a mapping is a device mapping.  This 
bit is set by things like ioremap() and remap_pfn_range, and the Xen mmu 
code just uses the pfn in the pte as-is, rather than doing the normal 
pfn->mfn translation.

DMA is handled via the normal DMA API, with some hooks to swiotlb to 
make sure that the memory underlying its pools is really DMA-ready (ie, 
is contiguous and low enough in machine memory).

The changes I'm proposing may look a bit strange from a purely x86 
perspective, but they fit in relatively well because they're not all 
that different from what other architectures require, and so the 
kernel-wide infrastructure is mostly already in place.


I hope that helps clarify what I'm trying to do here, and why Xen and 
KVM do have distinct roles to play.

    J

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
  2009-02-28 16:14     ` Andi Kleen
@ 2009-03-01 23:34         ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-03-01 23:34 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Xen-devel, the arch/x86 maintainers,
	Linux Kernel Mailing List, H. Peter Anvin

Andi Kleen wrote:
> I would say the more interesting question is less how much additional
> code it is or even how much it changes the main kernel, but more how
> different the code execution paths in interaction with Xen are
> compared to what a native kernel would do. Because such differences
> always would need to be considered in future changes.
>   

Yes.  A big part of what I'm doing is trying to keep the Xen changes 
self-contained to try and minimize their system-wide impact.  Basically 
it comes down to that if you use (mostly existing) kernel APIs in the 
way they're intended to be used, then things just work out for both Xen 
and native cases.  The whole point of keeping the kernel modular is so 
that if people implement and use the the interfaces correctly, the 
internal details shouldn't matter very much.  Often the process of 
adding Xen support has resulted in putting clear, well defined 
interfaces into parts of the kernel where previously things were, well, 
in need of cleaning up.

> For example things like: doesn't use PAT with Xen or apparently very
> different routing are somewhat worrying because it means it's a
> completely different operation modus with Xen that needs to be taken
> care of later, adding to complexity.
>   

Unless we're planning on dropping support for processes with no or 
broken PAT support, we're always going to have to deal with the non-PAT 
case.  Xen just falls into the "processor with no PAT" case.  And 
if/when we work out how to paravirtualize PAT, it will no longer be in 
that case.

> Unfortunately it also looks like that Xen the HV does things
> more and more different from what mainline kernel does so 
> these differences will likely continue to grow over time.

I hope that won't be the case. As part of considering any change to Xen 
is considering what changes would be needed to the guest operating 
systems to make use of that feature.

    J

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
@ 2009-03-01 23:34         ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-03-01 23:34 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Xen-devel, H. Peter Anvin,
	the arch/x86 maintainers, Linux Kernel Mailing List

Andi Kleen wrote:
> I would say the more interesting question is less how much additional
> code it is or even how much it changes the main kernel, but more how
> different the code execution paths in interaction with Xen are
> compared to what a native kernel would do. Because such differences
> always would need to be considered in future changes.
>   

Yes.  A big part of what I'm doing is trying to keep the Xen changes 
self-contained to try and minimize their system-wide impact.  Basically 
it comes down to that if you use (mostly existing) kernel APIs in the 
way they're intended to be used, then things just work out for both Xen 
and native cases.  The whole point of keeping the kernel modular is so 
that if people implement and use the the interfaces correctly, the 
internal details shouldn't matter very much.  Often the process of 
adding Xen support has resulted in putting clear, well defined 
interfaces into parts of the kernel where previously things were, well, 
in need of cleaning up.

> For example things like: doesn't use PAT with Xen or apparently very
> different routing are somewhat worrying because it means it's a
> completely different operation modus with Xen that needs to be taken
> care of later, adding to complexity.
>   

Unless we're planning on dropping support for processes with no or 
broken PAT support, we're always going to have to deal with the non-PAT 
case.  Xen just falls into the "processor with no PAT" case.  And 
if/when we work out how to paravirtualize PAT, it will no longer be in 
that case.

> Unfortunately it also looks like that Xen the HV does things
> more and more different from what mainline kernel does so 
> these differences will likely continue to grow over time.

I hope that won't be the case. As part of considering any change to Xen 
is considering what changes would be needed to the guest operating 
systems to make use of that feature.

    J

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
  2009-02-28 18:15         ` Andi Kleen
@ 2009-03-01 23:38             ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-03-01 23:38 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Jody Belka, Nick Piggin, Xen-devel, the arch/x86 maintainers,
	Linux Kernel Mailing List, H. Peter Anvin, Andrew Morton

Andi Kleen wrote:
> Jody Belka <lists-xen@pimb.org> writes:
>   
>> Is it duplication though? I personally have machines with older processors
>> that don't have hvm support. I plan on keeping these around for a good amount
>> of time, and would love to be running them on mainline. So for me, unless KVM
>> is somehow going to support para-virtualisation, this isn't duplication.
>>     
>
> The old systems will continue to run fine with a 2.6.18 Dom0 though.

But that suggests the *only* reason to update kernels is to get new 
hardware support.  Or conversely, we should stop trying to be backwards 
compatible with old hardware in new kernels because there's no reason to 
keep it.

While a lot of the delta between 2.6.18 has been hardware support 
updates, there have been a lot of other useful things: a new CPU 
scheduler, tickless operation (which directly important for 
virtualization), all the cgroups stuff, new filesystems, IO schedulers, 
etc, etc.  All good things to have, even on older hardware.

    J

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
@ 2009-03-01 23:38             ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-03-01 23:38 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Nick Piggin, Xen-devel, the arch/x86 maintainers,
	Linux Kernel Mailing List, H. Peter Anvin, Andrew Morton

Andi Kleen wrote:
> Jody Belka <lists-xen@pimb.org> writes:
>   
>> Is it duplication though? I personally have machines with older processors
>> that don't have hvm support. I plan on keeping these around for a good amount
>> of time, and would love to be running them on mainline. So for me, unless KVM
>> is somehow going to support para-virtualisation, this isn't duplication.
>>     
>
> The old systems will continue to run fine with a 2.6.18 Dom0 though.

But that suggests the *only* reason to update kernels is to get new 
hardware support.  Or conversely, we should stop trying to be backwards 
compatible with old hardware in new kernels because there's no reason to 
keep it.

While a lot of the delta between 2.6.18 has been hardware support 
updates, there have been a lot of other useful things: a new CPU 
scheduler, tickless operation (which directly important for 
virtualization), all the cgroups stuff, new filesystems, IO schedulers, 
etc, etc.  All good things to have, even on older hardware.

    J

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
  2009-03-01 23:34         ` Jeremy Fitzhardinge
  (?)
@ 2009-03-01 23:52         ` H. Peter Anvin
  2009-03-02  0:08             ` Jeremy Fitzhardinge
  -1 siblings, 1 reply; 121+ messages in thread
From: H. Peter Anvin @ 2009-03-01 23:52 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andi Kleen, Andrew Morton, Xen-devel, the arch/x86 maintainers,
	Linux Kernel Mailing List

Jeremy Fitzhardinge wrote:
> 
> Unless we're planning on dropping support for processes with no or 
> broken PAT support, we're always going to have to deal with the non-PAT 
> case.  Xen just falls into the "processor with no PAT" case.  And 
> if/when we work out how to paravirtualize PAT, it will no longer be in 
> that case.
> 

In this particular case, this is actually false.  "No PAT" in the 
processor is *not* the same thing as "no cacheability controls in the 
page tables".  Every processor since the 386 has had UC, WT, and WB 
controls in the page tables; PAT only added the ability to do WC (and 
WP, which we don't use).  Since the number of processors which can do WC 
at all but don't have PAT is a small set of increasingly obsolete 
processors, we may very well choose to simply ignore the WC capabilities 
of these particular processors.

	-hpa


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
  2009-03-01 23:52         ` H. Peter Anvin
@ 2009-03-02  0:08             ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-03-02  0:08 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andi Kleen, Andrew Morton, Xen-devel, the arch/x86 maintainers,
	Linux Kernel Mailing List

H. Peter Anvin wrote:
> In this particular case, this is actually false.  "No PAT" in the 
> processor is *not* the same thing as "no cacheability controls in the 
> page tables".  Every processor since the 386 has had UC, WT, and WB 
> controls in the page tables; PAT only added the ability to do WC (and 
> WP, which we don't use).  Since the number of processors which can do 
> WC at all but don't have PAT is a small set of increasingly obsolete 
> processors, we may very well choose to simply ignore the WC 
> capabilities of these particular processors. 

I'm not quite sure what you're referring to with "this is actually 
false".  Certainly we support cachability control in ptes under Xen.  We 
just don't support full PAT because Xen uses PAT for itself.

    J

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
@ 2009-03-02  0:08             ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-03-02  0:08 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: the arch/x86 maintainers, Andrew Morton, Andi Kleen, Xen-devel,
	Linux Kernel Mailing List

H. Peter Anvin wrote:
> In this particular case, this is actually false.  "No PAT" in the 
> processor is *not* the same thing as "no cacheability controls in the 
> page tables".  Every processor since the 386 has had UC, WT, and WB 
> controls in the page tables; PAT only added the ability to do WC (and 
> WP, which we don't use).  Since the number of processors which can do 
> WC at all but don't have PAT is a small set of increasingly obsolete 
> processors, we may very well choose to simply ignore the WC 
> capabilities of these particular processors. 

I'm not quite sure what you're referring to with "this is actually 
false".  Certainly we support cachability control in ptes under Xen.  We 
just don't support full PAT because Xen uses PAT for itself.

    J

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
  2009-03-01 23:34         ` Jeremy Fitzhardinge
  (?)
  (?)
@ 2009-03-02  0:10         ` Andi Kleen
  -1 siblings, 0 replies; 121+ messages in thread
From: Andi Kleen @ 2009-03-02  0:10 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andi Kleen, Andrew Morton, Xen-devel, the arch/x86 maintainers,
	Linux Kernel Mailing List, H. Peter Anvin

> Yes.  A big part of what I'm doing is trying to keep the Xen changes 
> self-contained to try and minimize their system-wide impact.  Basically 
> it comes down to that if you use (mostly existing) kernel APIs in the 
> way they're intended to be used, then things just work out for both Xen 
> and native cases.  The whole point of keeping the kernel modular is so 
> that if people implement and use the the interfaces correctly, the 

That's a big if. It sounds good in theory, but I in practice
it will be different. Kernel interfaces tend to have hidden assumptions
too that matter and the more special case code is in there the more
additional hidden assumptions will be there too.

> internal details shouldn't matter very much.  Often the process of 
> adding Xen support has resulted in putting clear, well defined 
> interfaces into parts of the kernel where previously things were, well, 
> in need of cleaning up.

That's true, but it's still much more complex than before semantically.
> 
> >For example things like: doesn't use PAT with Xen or apparently very
> >different routing are somewhat worrying because it means it's a
> >completely different operation modus with Xen that needs to be taken
> >care of later, adding to complexity.
> >  
> 
> Unless we're planning on dropping support for processes with no or 
> broken PAT support, we're always going to have to deal with the non-PAT 
> case.

These are all really old hardware[1], no modern 3d chips etc. Xen on the 
other hand ..

[1] afaik you have to go back to PPro to get real PAT bugs.

> >Unfortunately it also looks like that Xen the HV does things
> >more and more different from what mainline kernel does so 
> >these differences will likely continue to grow over time.
> 
> I hope that won't be the case. As part of considering any change to Xen 

My impression from looking occasionally at Xen source is like this
at least. It used to be that Xen was basically Linux 2.4 with
some tweaks in many ways, but now it's often completely new code
doing things in very different ways.  Basically a real fork diverging
more and more.

That said there's probably no way around merging the Dom0 support too,
but I think it should be clearly said that it has a quite high
long term cost for Linux. Hopefully it's worth it.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
  2009-03-01 23:38             ` Jeremy Fitzhardinge
  (?)
@ 2009-03-02  0:14             ` Andi Kleen
  -1 siblings, 0 replies; 121+ messages in thread
From: Andi Kleen @ 2009-03-02  0:14 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andi Kleen, Jody Belka, Nick Piggin, Xen-devel,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	H. Peter Anvin, Andrew Morton

> But that suggests the *only* reason to update kernels is to get new 

Wait, it was about Dom0. They could still get all the features 
of the new kernels in a guest DomU.

> While a lot of the delta between 2.6.18 has been hardware support 
> updates, there have been a lot of other useful things: a new CPU 
> scheduler, tickless operation (which directly important for 

The old Dom0s always did no idle tick. And I suspect most
of the other things don't matter very much in a minimal Dom0.

That said I'm not arguing that it shouldn't be merged, but it 
seems like the "old hardware" argument is not very strong.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
  2009-03-02  0:08             ` Jeremy Fitzhardinge
  (?)
@ 2009-03-02  0:14             ` H. Peter Anvin
  2009-03-02  0:42                 ` Jeremy Fitzhardinge
  -1 siblings, 1 reply; 121+ messages in thread
From: H. Peter Anvin @ 2009-03-02  0:14 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andi Kleen, Andrew Morton, Xen-devel, the arch/x86 maintainers,
	Linux Kernel Mailing List

Jeremy Fitzhardinge wrote:
> H. Peter Anvin wrote:
>> In this particular case, this is actually false.  "No PAT" in the 
>> processor is *not* the same thing as "no cacheability controls in the 
>> page tables".  Every processor since the 386 has had UC, WT, and WB 
>> controls in the page tables; PAT only added the ability to do WC (and 
>> WP, which we don't use).  Since the number of processors which can do 
>> WC at all but don't have PAT is a small set of increasingly obsolete 
>> processors, we may very well choose to simply ignore the WC 
>> capabilities of these particular processors. 
> 
> I'm not quite sure what you're referring to with "this is actually 
> false".  Certainly we support cachability control in ptes under Xen.  We 
> just don't support full PAT because Xen uses PAT for itself.
> 

What do you define as "full PAT"?  If what you mean is that Xen lays 
claims to the PAT MSR and only allows a certain mapping that's hardly a 
problem... other than that it's not an exhaustible resource so I guess I 
really don't understand what you're trying to say here.

	-hpa

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
  2009-03-02  0:14             ` H. Peter Anvin
@ 2009-03-02  0:42                 ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-03-02  0:42 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andi Kleen, Andrew Morton, Xen-devel, the arch/x86 maintainers,
	Linux Kernel Mailing List

H. Peter Anvin wrote:
> Jeremy Fitzhardinge wrote:
>> H. Peter Anvin wrote:
>>> In this particular case, this is actually false.  "No PAT" in the 
>>> processor is *not* the same thing as "no cacheability controls in 
>>> the page tables".  Every processor since the 386 has had UC, WT, and 
>>> WB controls in the page tables; PAT only added the ability to do WC 
>>> (and WP, which we don't use).  Since the number of processors which 
>>> can do WC at all but don't have PAT is a small set of increasingly 
>>> obsolete processors, we may very well choose to simply ignore the WC 
>>> capabilities of these particular processors. 
>>
>> I'm not quite sure what you're referring to with "this is actually 
>> false".  Certainly we support cachability control in ptes under Xen.  
>> We just don't support full PAT because Xen uses PAT for itself.
>>
>
> What do you define as "full PAT"?  If what you mean is that Xen lays 
> claims to the PAT MSR and only allows a certain mapping that's hardly 
> a problem... other than that it's not an exhaustible resource so I 
> guess I really don't understand what you're trying to say here.

It does not allow guests to set their own PAT MSRs.  It can't easily be 
multiplexed either, as all CPUs must have the same settings for their 
PAT MSRs.  I guess it could be handled by allowing domains to set their 
own virtual PAT MSRs, and then rewriting the ptes to convert from the 
guest PAT settings to Xen's, but I don't know if this is possible in 
general (and it poses some problems because the pte modifications would 
be guest-visible).

    J

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
@ 2009-03-02  0:42                 ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-03-02  0:42 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: the arch/x86 maintainers, Andrew Morton, Andi Kleen, Xen-devel,
	Linux Kernel Mailing List

H. Peter Anvin wrote:
> Jeremy Fitzhardinge wrote:
>> H. Peter Anvin wrote:
>>> In this particular case, this is actually false.  "No PAT" in the 
>>> processor is *not* the same thing as "no cacheability controls in 
>>> the page tables".  Every processor since the 386 has had UC, WT, and 
>>> WB controls in the page tables; PAT only added the ability to do WC 
>>> (and WP, which we don't use).  Since the number of processors which 
>>> can do WC at all but don't have PAT is a small set of increasingly 
>>> obsolete processors, we may very well choose to simply ignore the WC 
>>> capabilities of these particular processors. 
>>
>> I'm not quite sure what you're referring to with "this is actually 
>> false".  Certainly we support cachability control in ptes under Xen.  
>> We just don't support full PAT because Xen uses PAT for itself.
>>
>
> What do you define as "full PAT"?  If what you mean is that Xen lays 
> claims to the PAT MSR and only allows a certain mapping that's hardly 
> a problem... other than that it's not an exhaustible resource so I 
> guess I really don't understand what you're trying to say here.

It does not allow guests to set their own PAT MSRs.  It can't easily be 
multiplexed either, as all CPUs must have the same settings for their 
PAT MSRs.  I guess it could be handled by allowing domains to set their 
own virtual PAT MSRs, and then rewriting the ptes to convert from the 
guest PAT settings to Xen's, but I don't know if this is possible in 
general (and it poses some problems because the pte modifications would 
be guest-visible).

    J

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
  2009-03-02  0:42                 ` Jeremy Fitzhardinge
  (?)
@ 2009-03-02  0:46                 ` H. Peter Anvin
  -1 siblings, 0 replies; 121+ messages in thread
From: H. Peter Anvin @ 2009-03-02  0:46 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andi Kleen, Andrew Morton, Xen-devel, the arch/x86 maintainers,
	Linux Kernel Mailing List

Jeremy Fitzhardinge wrote:
>>
>> What do you define as "full PAT"?  If what you mean is that Xen lays
>> claims to the PAT MSR and only allows a certain mapping that's hardly
>> a problem... other than that it's not an exhaustible resource so I
>> guess I really don't understand what you're trying to say here.
> 
> It does not allow guests to set their own PAT MSRs.  It can't easily be
> multiplexed either, as all CPUs must have the same settings for their
> PAT MSRs.  I guess it could be handled by allowing domains to set their
> own virtual PAT MSRs, and then rewriting the ptes to convert from the
> guest PAT settings to Xen's, but I don't know if this is possible in
> general (and it poses some problems because the pte modifications would
> be guest-visible).
> 

It would make a lot more sense to simply specify a particular set of
mappings.  Since the only one anyone cares about that isn't in the
default set is WC anyway, it's easy to do.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
  2009-03-01 23:27         ` Jeremy Fitzhardinge
@ 2009-03-02  6:37           ` Nick Piggin
  -1 siblings, 0 replies; 121+ messages in thread
From: Nick Piggin @ 2009-03-02  6:37 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andrew Morton, H. Peter Anvin, the arch/x86 maintainers,
	Linux Kernel Mailing List, Xen-devel

On Monday 02 March 2009 10:27:29 Jeremy Fitzhardinge wrote:
> Nick Piggin wrote:
> > On Saturday 28 February 2009 17:52:24 Jeremy Fitzhardinge wrote:
> >> Andrew Morton wrote:
> >>> I hate to be the one to say it, but we should sit down and work out
> >>> whether it is justifiable to merge any of this into Linux.  I think
> >>> it's still the case that the Xen technology is the "old" way and that
> >>> the world is moving off in the "new" direction, KVM?
> >>
> >> I don't think that's a particularly useful way to look at it.  They're
> >> different approaches to the problem, and have different tradeoffs.
> >>
> >> The more important question is: are there real users for this stuff?
> >> Does not merging it cause more net disadvantage than merging it?
> >> Despite all the noise made about kvm in kernel circles, Xen has a large
> >> and growing installed base.  At the moment its all running on massive
> >> out-of-tree patches, which doesn't make anyone happy.  It's best that it
> >> be in the mainline kernel.  You know, like we argue for everything else.
> >
> > OTOH, there are good reasons not to duplicate functionality, and many
> > many times throughout the kernel history competing solutions have been
> > rejected even though the same arguments could be made about them.
> >
> > There have also been many times duplicate functionality has been merged,
> > although that does often start with the intention of eliminating
> > duplicate implementations and ends with pain. So I think Andrew's
> > question is pretty important.
>
> Those would be pertinent questions if I were suddenly popping up and
> saying "hey, let's add Xen support to the kernel!"  But Xen support has
> been in the kernel for well over a year now, and is widely used, enabled
> in distros, etc.  The patches I'm proposing here are not a whole new
> thing, they're part of the last 10% to fill out the kernel's support to
> make it actually useful.

As a guest, I guess it has been agreed that guest support for all
different hypervisors is "a good thing". dom0 is more like a piece
of the hypervisor itself, right?


> > The user issue aside -- that is a valid point -- you don't really touch
> > on the technical issues. What tradeoffs, and where Xen does better
> > than KVM would be interesting to know, can Xen tools and users ever be
> > migrated to KVM or vice versa (I know very little about this myself, so
> > I'm just an interested observer).
>
> OK, fair point, its probably time for another Xen architecture refresher
> post.

Thanks.


> There are two big architectural differences between Xen and KVM:
>
> Firstly, Xen has a separate hypervisor who's primary role is to context
> switch between the guest domains (virtual machines).   The hypervisor is
> relatively small and single purpose.  It doesn't, for example, contain
> any device drivers or even much knowledge of things like pci buses and
> their structure.  The domains themselves are more or less peers; some
> are more privileged than others, but from Xen's perspective they are
> more or less equivalent.  The first domain, dom0, is special because its
> started by Xen itself, and has some inherent initial privileges; its
> main job is to start other domains, and it also typically provides
> virtualized/multiplexed device services to other domains via a
> frontend/backend split driver structure.
>
> KVM, on the other hand, builds all the hypervisor stuff into the kernel
> itself, so you end up with a kernel which does all the normal kernel
> stuff, and can run virtual machines by making them look like slightly
> strange processes.
>
> Because Xen is dedicated to just running virtual machines, its internal
> architecture can be more heavily oriented towards that task, which
> affects things from how its scheduler works, its use and multiplexing of
> physical memory.  For example, Xen manages to use new hardware
> virtualization features pretty quickly, partly because it doesn't need
> to trade-off against normal kernel functions.  The clear distinction
> between the privileged hypervisor and the rest of the domains makes the
> security people happy as well.  Also, because Xen is small and fairly
> self-contained, there's quite a few hardware vendors shipping it burned
> into the firmware so that it really is the first thing to boot (many of
> instant-on features that laptops have are based on Xen).  Both HP and
> Dell, at least, are selling servers with Xen pre-installed in the firmware.

That would kind of seem like Xen has a better design to me, OTOH if it
needs this dom0 for most device drivers and things, then how much
difference is it really? Is KVM really disadvantaged by being a part of
the kernel?


> The second big difference is the use of paravirtualization.  Xen can
> securely virtualize a machine without needing any particular hardware
> support.  Xen works well on any post-P6 or any ia64 machine, without
> needing any virtualzation hardware support.  When Xen runs a kernel in
> paravirtualized mode, it runs the kernel in an unprivileged processor
> state.  The allows the hypervisor to vet all the guest kernel's
> privileged operations, which are carried out are either via hypercalls
> or by memory shared between each guest and Xen.
>
> By contrast, KVM relies on at least VT/SVM (and whatever the ia64 equiv
> is called) being available in the CPUs, and needs the most modern of
> hardware to get the best performance.
>
> Once important area of paravirtualization is that Xen guests directly
> use the processor's pagetables; there is no shadow pagetable or use of
> hardware pagetable nesting.  This means that a tlb miss is just a tlb
> miss, and happens at full processor performance.  This is possible
> because 1) pagetables are always read-only to the guest, and 2) the
> guest is responsible for looking up in a table to map guest-local pfns
> into machine-wide mfns before installing them in a pte.  Xen will check
> that any new mapping or pagetable satisfies all the rules, by checking
> that the writable reference count is 0, and that the domain owns (or has
> been allowed access to) any mfn it tries to install in a pagetable.

Xen's memory virtualization is pretty neat, I'll give it that. Is it
faster than KVM on a modern CPU? Would it be possible I wonder to make
a MMU virtualization layer for CPUs without support, using Xen's page
table protection methods, and have KVM use that? Or does that amount
to putting a significant amount of Xen hypervisor into the kernel..?


> The other interesting part of paravirtualization is the abstraction of
> interrupts into event channels.  Each domain has a bit-array of 1024
> bits which correspond to 1024 possible event channels.  An event channel
> can have one of several sources, such as a timer virtual interrupt, an
> inter-domain event, an inter-vcpu IPI, or mapped from a hardware
> interrupt.  We end up mapping the event channels back to irqs and they
> are delivered as normal interrupts as far as the rest of the kernel is
> concerned.
>
> The net result is that a paravirtualized Xen guest runs a very close to
> full speed.  Workloads which modify live pagetables a lot take a bit of
> a performance hit (since the pte updates have to trap to the hypervisor
> for validation), but in general this is not a huge deal.  Hardware
> support for nested pagetables is only just beginning to get close to
> getting performance parity, but with different tradeoffs (pagetable
> updates are cheap, but tlb misses are much more expensive, and hits
> consume more tlb entries).
>
> Xen can also make full use of whatever hardware virtualization features
> are available when running an "hvm" domain.  This is typically how you'd
> run Windows or other unmodified operating systems.
>
> All of this is stuff that's necessary to support any PV Xen domain, and
> has been in the kernel for a long time now.
>
>
> The additions I'm proposing now are those needed for a Xen domain to
> control the physical hardware, in order to provide virtual device
> support for other less-privileged domains.  These changes affect a few
> areas:
>
>     * interrupts: mapping a device interrupt into an event channel for
>       delivery to the domain with the device driver for that interrupt
>     * mappings: allowing direct hardware mapping of device memory into a
>       domain
>     * dma: making sure that hardware gets programmed with machine memory
>       address, nor virtual ones, and that pages are machine-contiguous
>       when expected
>
> Interrupts require a few hooks into the x86 APIC code, but the end
> result is that hardware interrupts are delivered via event channels, but
> then they're mapped back to irqs and delivered normally (they even end
> up with the same irq number as they'd usually have).
>
> Device mappings are fairly easy to arrange.  I'm using a software pte
> bit, _PAGE_IOMAP, to indicate that a mapping is a device mapping.  This
> bit is set by things like ioremap() and remap_pfn_range, and the Xen mmu
> code just uses the pfn in the pte as-is, rather than doing the normal
> pfn->mfn translation.
>
> DMA is handled via the normal DMA API, with some hooks to swiotlb to
> make sure that the memory underlying its pools is really DMA-ready (ie,
> is contiguous and low enough in machine memory).
>
> The changes I'm proposing may look a bit strange from a purely x86
> perspective, but they fit in relatively well because they're not all
> that different from what other architectures require, and so the
> kernel-wide infrastructure is mostly already in place.
>
>
> I hope that helps clarify what I'm trying to do here, and why Xen and
> KVM do have distinct roles to play.

Thanks, it's very informative to me and hopefully helps others with
the discussion (I don't pretend to be able to judge whether your dom0
patches should be merged or not! :)). I'll continue to read with
interest.

Thanks,
Nick


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
@ 2009-03-02  6:37           ` Nick Piggin
  0 siblings, 0 replies; 121+ messages in thread
From: Nick Piggin @ 2009-03-02  6:37 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Xen-devel, Andrew Morton, the arch/x86 maintainers,
	Linux Kernel Mailing List, H. Peter Anvin

On Monday 02 March 2009 10:27:29 Jeremy Fitzhardinge wrote:
> Nick Piggin wrote:
> > On Saturday 28 February 2009 17:52:24 Jeremy Fitzhardinge wrote:
> >> Andrew Morton wrote:
> >>> I hate to be the one to say it, but we should sit down and work out
> >>> whether it is justifiable to merge any of this into Linux.  I think
> >>> it's still the case that the Xen technology is the "old" way and that
> >>> the world is moving off in the "new" direction, KVM?
> >>
> >> I don't think that's a particularly useful way to look at it.  They're
> >> different approaches to the problem, and have different tradeoffs.
> >>
> >> The more important question is: are there real users for this stuff?
> >> Does not merging it cause more net disadvantage than merging it?
> >> Despite all the noise made about kvm in kernel circles, Xen has a large
> >> and growing installed base.  At the moment its all running on massive
> >> out-of-tree patches, which doesn't make anyone happy.  It's best that it
> >> be in the mainline kernel.  You know, like we argue for everything else.
> >
> > OTOH, there are good reasons not to duplicate functionality, and many
> > many times throughout the kernel history competing solutions have been
> > rejected even though the same arguments could be made about them.
> >
> > There have also been many times duplicate functionality has been merged,
> > although that does often start with the intention of eliminating
> > duplicate implementations and ends with pain. So I think Andrew's
> > question is pretty important.
>
> Those would be pertinent questions if I were suddenly popping up and
> saying "hey, let's add Xen support to the kernel!"  But Xen support has
> been in the kernel for well over a year now, and is widely used, enabled
> in distros, etc.  The patches I'm proposing here are not a whole new
> thing, they're part of the last 10% to fill out the kernel's support to
> make it actually useful.

As a guest, I guess it has been agreed that guest support for all
different hypervisors is "a good thing". dom0 is more like a piece
of the hypervisor itself, right?


> > The user issue aside -- that is a valid point -- you don't really touch
> > on the technical issues. What tradeoffs, and where Xen does better
> > than KVM would be interesting to know, can Xen tools and users ever be
> > migrated to KVM or vice versa (I know very little about this myself, so
> > I'm just an interested observer).
>
> OK, fair point, its probably time for another Xen architecture refresher
> post.

Thanks.


> There are two big architectural differences between Xen and KVM:
>
> Firstly, Xen has a separate hypervisor who's primary role is to context
> switch between the guest domains (virtual machines).   The hypervisor is
> relatively small and single purpose.  It doesn't, for example, contain
> any device drivers or even much knowledge of things like pci buses and
> their structure.  The domains themselves are more or less peers; some
> are more privileged than others, but from Xen's perspective they are
> more or less equivalent.  The first domain, dom0, is special because its
> started by Xen itself, and has some inherent initial privileges; its
> main job is to start other domains, and it also typically provides
> virtualized/multiplexed device services to other domains via a
> frontend/backend split driver structure.
>
> KVM, on the other hand, builds all the hypervisor stuff into the kernel
> itself, so you end up with a kernel which does all the normal kernel
> stuff, and can run virtual machines by making them look like slightly
> strange processes.
>
> Because Xen is dedicated to just running virtual machines, its internal
> architecture can be more heavily oriented towards that task, which
> affects things from how its scheduler works, its use and multiplexing of
> physical memory.  For example, Xen manages to use new hardware
> virtualization features pretty quickly, partly because it doesn't need
> to trade-off against normal kernel functions.  The clear distinction
> between the privileged hypervisor and the rest of the domains makes the
> security people happy as well.  Also, because Xen is small and fairly
> self-contained, there's quite a few hardware vendors shipping it burned
> into the firmware so that it really is the first thing to boot (many of
> instant-on features that laptops have are based on Xen).  Both HP and
> Dell, at least, are selling servers with Xen pre-installed in the firmware.

That would kind of seem like Xen has a better design to me, OTOH if it
needs this dom0 for most device drivers and things, then how much
difference is it really? Is KVM really disadvantaged by being a part of
the kernel?


> The second big difference is the use of paravirtualization.  Xen can
> securely virtualize a machine without needing any particular hardware
> support.  Xen works well on any post-P6 or any ia64 machine, without
> needing any virtualzation hardware support.  When Xen runs a kernel in
> paravirtualized mode, it runs the kernel in an unprivileged processor
> state.  The allows the hypervisor to vet all the guest kernel's
> privileged operations, which are carried out are either via hypercalls
> or by memory shared between each guest and Xen.
>
> By contrast, KVM relies on at least VT/SVM (and whatever the ia64 equiv
> is called) being available in the CPUs, and needs the most modern of
> hardware to get the best performance.
>
> Once important area of paravirtualization is that Xen guests directly
> use the processor's pagetables; there is no shadow pagetable or use of
> hardware pagetable nesting.  This means that a tlb miss is just a tlb
> miss, and happens at full processor performance.  This is possible
> because 1) pagetables are always read-only to the guest, and 2) the
> guest is responsible for looking up in a table to map guest-local pfns
> into machine-wide mfns before installing them in a pte.  Xen will check
> that any new mapping or pagetable satisfies all the rules, by checking
> that the writable reference count is 0, and that the domain owns (or has
> been allowed access to) any mfn it tries to install in a pagetable.

Xen's memory virtualization is pretty neat, I'll give it that. Is it
faster than KVM on a modern CPU? Would it be possible I wonder to make
a MMU virtualization layer for CPUs without support, using Xen's page
table protection methods, and have KVM use that? Or does that amount
to putting a significant amount of Xen hypervisor into the kernel..?


> The other interesting part of paravirtualization is the abstraction of
> interrupts into event channels.  Each domain has a bit-array of 1024
> bits which correspond to 1024 possible event channels.  An event channel
> can have one of several sources, such as a timer virtual interrupt, an
> inter-domain event, an inter-vcpu IPI, or mapped from a hardware
> interrupt.  We end up mapping the event channels back to irqs and they
> are delivered as normal interrupts as far as the rest of the kernel is
> concerned.
>
> The net result is that a paravirtualized Xen guest runs a very close to
> full speed.  Workloads which modify live pagetables a lot take a bit of
> a performance hit (since the pte updates have to trap to the hypervisor
> for validation), but in general this is not a huge deal.  Hardware
> support for nested pagetables is only just beginning to get close to
> getting performance parity, but with different tradeoffs (pagetable
> updates are cheap, but tlb misses are much more expensive, and hits
> consume more tlb entries).
>
> Xen can also make full use of whatever hardware virtualization features
> are available when running an "hvm" domain.  This is typically how you'd
> run Windows or other unmodified operating systems.
>
> All of this is stuff that's necessary to support any PV Xen domain, and
> has been in the kernel for a long time now.
>
>
> The additions I'm proposing now are those needed for a Xen domain to
> control the physical hardware, in order to provide virtual device
> support for other less-privileged domains.  These changes affect a few
> areas:
>
>     * interrupts: mapping a device interrupt into an event channel for
>       delivery to the domain with the device driver for that interrupt
>     * mappings: allowing direct hardware mapping of device memory into a
>       domain
>     * dma: making sure that hardware gets programmed with machine memory
>       address, nor virtual ones, and that pages are machine-contiguous
>       when expected
>
> Interrupts require a few hooks into the x86 APIC code, but the end
> result is that hardware interrupts are delivered via event channels, but
> then they're mapped back to irqs and delivered normally (they even end
> up with the same irq number as they'd usually have).
>
> Device mappings are fairly easy to arrange.  I'm using a software pte
> bit, _PAGE_IOMAP, to indicate that a mapping is a device mapping.  This
> bit is set by things like ioremap() and remap_pfn_range, and the Xen mmu
> code just uses the pfn in the pte as-is, rather than doing the normal
> pfn->mfn translation.
>
> DMA is handled via the normal DMA API, with some hooks to swiotlb to
> make sure that the memory underlying its pools is really DMA-ready (ie,
> is contiguous and low enough in machine memory).
>
> The changes I'm proposing may look a bit strange from a purely x86
> perspective, but they fit in relatively well because they're not all
> that different from what other architectures require, and so the
> kernel-wide infrastructure is mostly already in place.
>
>
> I hope that helps clarify what I'm trying to do here, and why Xen and
> KVM do have distinct roles to play.

Thanks, it's very informative to me and hopefully helps others with
the discussion (I don't pretend to be able to judge whether your dom0
patches should be merged or not! :)). I'll continue to read with
interest.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
  2009-03-02  6:37           ` Nick Piggin
@ 2009-03-02  8:05             ` Jeremy Fitzhardinge
  -1 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-03-02  8:05 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, H. Peter Anvin, the arch/x86 maintainers,
	Linux Kernel Mailing List, Xen-devel

Nick Piggin wrote:
>> Those would be pertinent questions if I were suddenly popping up and
>> saying "hey, let's add Xen support to the kernel!"  But Xen support has
>> been in the kernel for well over a year now, and is widely used, enabled
>> in distros, etc.  The patches I'm proposing here are not a whole new
>> thing, they're part of the last 10% to fill out the kernel's support to
>> make it actually useful.
>>     
>
> As a guest, I guess it has been agreed that guest support for all
> different hypervisors is "a good thing". dom0 is more like a piece
> of the hypervisor itself, right?
>   

Hm, I wouldn't put it like that.  dom0 is no more part of the hypervisor 
than the hypervisor is part of dom0.  The hypervisor provides one set of 
services (domain isolation and multiplexing).  Domains with direct 
hardware access and drivers provide arbitration for virtualized device 
access.  They provide orthogonal sets of functionality which are both 
required to get a working system.

Also, the machinery needed to allow a kernel to operate as dom0 is more 
than that: it allows direct access to hardware in general.  An otherwise 
unprivileged domU can be given access to a specific PCI device via 
PCI-passthrough so that it can drive it directly.  This is often used 
for direct access to 3D hardware, or high-performance networking (esp 
with multi-context hardware that's designed for virtualization use).

>> Because Xen is dedicated to just running virtual machines, its internal
>> architecture can be more heavily oriented towards that task, which
>> affects things from how its scheduler works, its use and multiplexing of
>> physical memory.  For example, Xen manages to use new hardware
>> virtualization features pretty quickly, partly because it doesn't need
>> to trade-off against normal kernel functions.  The clear distinction
>> between the privileged hypervisor and the rest of the domains makes the
>> security people happy as well.  Also, because Xen is small and fairly
>> self-contained, there's quite a few hardware vendors shipping it burned
>> into the firmware so that it really is the first thing to boot (many of
>> instant-on features that laptops have are based on Xen).  Both HP and
>> Dell, at least, are selling servers with Xen pre-installed in the firmware.
>>     
>
> That would kind of seem like Xen has a better design to me, OTOH if it
> needs this dom0 for most device drivers and things, then how much
> difference is it really? Is KVM really disadvantaged by being a part of
> the kernel?
>   

Well, you can lump everything together in dom0 if you want, and that is 
a common way to run a Xen system.  But there's no reason you can't 
disaggregate drivers into their own domains, each with the 
responsibility for a particular device or set of devices (or indeed, any 
other service you want provided).  Xen can use hardware features like 
VT-d to really enforce the partitioning so that the domains can't 
program their hardware to touch anything except what they're allowed to 
touch, so nothing is trusted beyond its actual area of responsibility.  
It also means that killing off and restarting a driver domain is a 
fairly lightweight and straightforward operation because the state is 
isolated and self-contained; guests using a device have to be able to 
deal with a disconnect/reconnect anyway (for migration), so it doesn't 
affect them much.  Part of the reason there's a lot of academic interest 
in Xen is because it has the architectural flexibility to try out lots 
of different configurations.

I wouldn't say that KVM is necessarily disadvantaged by its design; its 
just a particular set of tradeoffs made up-front.  It loses Xen's 
flexibility, but the result is very familiar to Linux people.  A guest 
domain just looks like a qemu process that happens to run in a strange 
processor mode a lot of the time.  The qemu process provides virtual 
device access to its domain, and accesses the normal device drivers like 
any other usermode process would.  The domains are as isolated from each 
other as much as processes normally are, but they're all floating around 
in the same kernel; whether that provides enough isolation for whatever 
technical, billing, security, compliance/regulatory or other 
requirements you have is up to the user to judge.

>> Once important area of paravirtualization is that Xen guests directly
>> use the processor's pagetables; there is no shadow pagetable or use of
>> hardware pagetable nesting.  This means that a tlb miss is just a tlb
>> miss, and happens at full processor performance.  This is possible
>> because 1) pagetables are always read-only to the guest, and 2) the
>> guest is responsible for looking up in a table to map guest-local pfns
>> into machine-wide mfns before installing them in a pte.  Xen will check
>> that any new mapping or pagetable satisfies all the rules, by checking
>> that the writable reference count is 0, and that the domain owns (or has
>> been allowed access to) any mfn it tries to install in a pagetable.
>>     
>
> Xen's memory virtualization is pretty neat, I'll give it that. Is it
> faster than KVM on a modern CPU?

It really depends on the workload.  There's three cases to consider: 
software shadow pagetables, hardware nested pagetables, and Xen direct 
pagetables.  Even now, Xen's (highly optimised) shadow pagetable code 
generally out-performs modern nested pagetables, at least when running 
Windows (for which that code was most heavily tuned).  Shadow pagetables 
and nested pagetables will generally outperform direct pagetables when 
the workload does lots of pagetable updates compared to accesses.  (I 
don't know what the current state of kvm's shadow pagetable performance 
is, but it seems OK.)

But if you're mostly accessing the pagetable, direct pagetables still 
win.  On a tlb miss, it gets 4 memory accesses, whereas a nested 
pagetable tlb miss needs 24 memory accesses; and a nested tlb hit means 
that you have 24 tlb entries being tied up to service the hit, vs 4.  
(Though the chip vendors are fairly secretive about exactly how they 
structure their tlbs to deal with nested lookups, so I may be off 
here.)  (It also depends on whether you arrange to put the guest, host 
or both memory into large pages; doing so helps a lot.)

>  Would it be possible I wonder to make
> a MMU virtualization layer for CPUs without support, using Xen's page
> table protection methods, and have KVM use that? Or does that amount
> to putting a significant amount of Xen hypervisor into the kernel..?
>   

At one point Avi was considering doing it, but I don't think he ever 
made any real effort in that direction.  KVM is pretty wedded to having 
hardware support anyway, so there's not much point in removing it in 
this one area.

The Xen technique gets its performance from collapsing a level of 
indirection, but that has a cost in terms of flexibility; the hypervisor 
can't do as much mucking around behind the guest's back (for example, 
the guest sees real hardware memory addresses in the form of mfns, so 
Xen can't move pages around, at least not without some form of explicit 
synchronisation).

    J

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
@ 2009-03-02  8:05             ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-03-02  8:05 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Xen-devel, Andrew Morton, the arch/x86 maintainers,
	Linux Kernel Mailing List, H. Peter Anvin

Nick Piggin wrote:
>> Those would be pertinent questions if I were suddenly popping up and
>> saying "hey, let's add Xen support to the kernel!"  But Xen support has
>> been in the kernel for well over a year now, and is widely used, enabled
>> in distros, etc.  The patches I'm proposing here are not a whole new
>> thing, they're part of the last 10% to fill out the kernel's support to
>> make it actually useful.
>>     
>
> As a guest, I guess it has been agreed that guest support for all
> different hypervisors is "a good thing". dom0 is more like a piece
> of the hypervisor itself, right?
>   

Hm, I wouldn't put it like that.  dom0 is no more part of the hypervisor 
than the hypervisor is part of dom0.  The hypervisor provides one set of 
services (domain isolation and multiplexing).  Domains with direct 
hardware access and drivers provide arbitration for virtualized device 
access.  They provide orthogonal sets of functionality which are both 
required to get a working system.

Also, the machinery needed to allow a kernel to operate as dom0 is more 
than that: it allows direct access to hardware in general.  An otherwise 
unprivileged domU can be given access to a specific PCI device via 
PCI-passthrough so that it can drive it directly.  This is often used 
for direct access to 3D hardware, or high-performance networking (esp 
with multi-context hardware that's designed for virtualization use).

>> Because Xen is dedicated to just running virtual machines, its internal
>> architecture can be more heavily oriented towards that task, which
>> affects things from how its scheduler works, its use and multiplexing of
>> physical memory.  For example, Xen manages to use new hardware
>> virtualization features pretty quickly, partly because it doesn't need
>> to trade-off against normal kernel functions.  The clear distinction
>> between the privileged hypervisor and the rest of the domains makes the
>> security people happy as well.  Also, because Xen is small and fairly
>> self-contained, there's quite a few hardware vendors shipping it burned
>> into the firmware so that it really is the first thing to boot (many of
>> instant-on features that laptops have are based on Xen).  Both HP and
>> Dell, at least, are selling servers with Xen pre-installed in the firmware.
>>     
>
> That would kind of seem like Xen has a better design to me, OTOH if it
> needs this dom0 for most device drivers and things, then how much
> difference is it really? Is KVM really disadvantaged by being a part of
> the kernel?
>   

Well, you can lump everything together in dom0 if you want, and that is 
a common way to run a Xen system.  But there's no reason you can't 
disaggregate drivers into their own domains, each with the 
responsibility for a particular device or set of devices (or indeed, any 
other service you want provided).  Xen can use hardware features like 
VT-d to really enforce the partitioning so that the domains can't 
program their hardware to touch anything except what they're allowed to 
touch, so nothing is trusted beyond its actual area of responsibility.  
It also means that killing off and restarting a driver domain is a 
fairly lightweight and straightforward operation because the state is 
isolated and self-contained; guests using a device have to be able to 
deal with a disconnect/reconnect anyway (for migration), so it doesn't 
affect them much.  Part of the reason there's a lot of academic interest 
in Xen is because it has the architectural flexibility to try out lots 
of different configurations.

I wouldn't say that KVM is necessarily disadvantaged by its design; its 
just a particular set of tradeoffs made up-front.  It loses Xen's 
flexibility, but the result is very familiar to Linux people.  A guest 
domain just looks like a qemu process that happens to run in a strange 
processor mode a lot of the time.  The qemu process provides virtual 
device access to its domain, and accesses the normal device drivers like 
any other usermode process would.  The domains are as isolated from each 
other as much as processes normally are, but they're all floating around 
in the same kernel; whether that provides enough isolation for whatever 
technical, billing, security, compliance/regulatory or other 
requirements you have is up to the user to judge.

>> Once important area of paravirtualization is that Xen guests directly
>> use the processor's pagetables; there is no shadow pagetable or use of
>> hardware pagetable nesting.  This means that a tlb miss is just a tlb
>> miss, and happens at full processor performance.  This is possible
>> because 1) pagetables are always read-only to the guest, and 2) the
>> guest is responsible for looking up in a table to map guest-local pfns
>> into machine-wide mfns before installing them in a pte.  Xen will check
>> that any new mapping or pagetable satisfies all the rules, by checking
>> that the writable reference count is 0, and that the domain owns (or has
>> been allowed access to) any mfn it tries to install in a pagetable.
>>     
>
> Xen's memory virtualization is pretty neat, I'll give it that. Is it
> faster than KVM on a modern CPU?

It really depends on the workload.  There's three cases to consider: 
software shadow pagetables, hardware nested pagetables, and Xen direct 
pagetables.  Even now, Xen's (highly optimised) shadow pagetable code 
generally out-performs modern nested pagetables, at least when running 
Windows (for which that code was most heavily tuned).  Shadow pagetables 
and nested pagetables will generally outperform direct pagetables when 
the workload does lots of pagetable updates compared to accesses.  (I 
don't know what the current state of kvm's shadow pagetable performance 
is, but it seems OK.)

But if you're mostly accessing the pagetable, direct pagetables still 
win.  On a tlb miss, it gets 4 memory accesses, whereas a nested 
pagetable tlb miss needs 24 memory accesses; and a nested tlb hit means 
that you have 24 tlb entries being tied up to service the hit, vs 4.  
(Though the chip vendors are fairly secretive about exactly how they 
structure their tlbs to deal with nested lookups, so I may be off 
here.)  (It also depends on whether you arrange to put the guest, host 
or both memory into large pages; doing so helps a lot.)

>  Would it be possible I wonder to make
> a MMU virtualization layer for CPUs without support, using Xen's page
> table protection methods, and have KVM use that? Or does that amount
> to putting a significant amount of Xen hypervisor into the kernel..?
>   

At one point Avi was considering doing it, but I don't think he ever 
made any real effort in that direction.  KVM is pretty wedded to having 
hardware support anyway, so there's not much point in removing it in 
this one area.

The Xen technique gets its performance from collapsing a level of 
indirection, but that has a cost in terms of flexibility; the hypervisor 
can't do as much mucking around behind the guest's back (for example, 
the guest sees real hardware memory addresses in the form of mfns, so 
Xen can't move pages around, at least not without some form of explicit 
synchronisation).

    J

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
  2009-03-02  8:05             ` Jeremy Fitzhardinge
@ 2009-03-02  8:19               ` Nick Piggin
  -1 siblings, 0 replies; 121+ messages in thread
From: Nick Piggin @ 2009-03-02  8:19 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andrew Morton, H. Peter Anvin, the arch/x86 maintainers,
	Linux Kernel Mailing List, Xen-devel

On Monday 02 March 2009 19:05:10 Jeremy Fitzhardinge wrote:
> Nick Piggin wrote:
> > That would kind of seem like Xen has a better design to me, OTOH if it
> > needs this dom0 for most device drivers and things, then how much
> > difference is it really? Is KVM really disadvantaged by being a part of
> > the kernel?
>
> Well, you can lump everything together in dom0 if you want, and that is
> a common way to run a Xen system.  But there's no reason you can't
> disaggregate drivers into their own domains, each with the
> responsibility for a particular device or set of devices (or indeed, any
> other service you want provided).  Xen can use hardware features like
> VT-d to really enforce the partitioning so that the domains can't
> program their hardware to touch anything except what they're allowed to
> touch, so nothing is trusted beyond its actual area of responsibility.
> It also means that killing off and restarting a driver domain is a
> fairly lightweight and straightforward operation because the state is
> isolated and self-contained; guests using a device have to be able to
> deal with a disconnect/reconnect anyway (for migration), so it doesn't
> affect them much.  Part of the reason there's a lot of academic interest
> in Xen is because it has the architectural flexibility to try out lots
> of different configurations.
>
> I wouldn't say that KVM is necessarily disadvantaged by its design; its
> just a particular set of tradeoffs made up-front.  It loses Xen's
> flexibility, but the result is very familiar to Linux people.  A guest
> domain just looks like a qemu process that happens to run in a strange
> processor mode a lot of the time.  The qemu process provides virtual
> device access to its domain, and accesses the normal device drivers like
> any other usermode process would.  The domains are as isolated from each
> other as much as processes normally are, but they're all floating around
> in the same kernel; whether that provides enough isolation for whatever
> technical, billing, security, compliance/regulatory or other
> requirements you have is up to the user to judge.

Well what is the advantage of KVM? Just that it is integrated into
the kernel? Can we look at the argument the other way around and
ask why Xen can't replace KVM? (is it possible to make use of HW
memory virtualization in Xen?) The hypervisor is GPL, right?


> >  Would it be possible I wonder to make
> > a MMU virtualization layer for CPUs without support, using Xen's page
> > table protection methods, and have KVM use that? Or does that amount
> > to putting a significant amount of Xen hypervisor into the kernel..?
>
> At one point Avi was considering doing it, but I don't think he ever
> made any real effort in that direction.  KVM is pretty wedded to having
> hardware support anyway, so there's not much point in removing it in
> this one area.

Not removing it, but making it available as an alternative form of
"hardware supported" MMU virtualization. As you say if direct protected
page tables often are faster than existing HW solutoins anyway, then it
could be a win for KVM even on newer CPUs.


> The Xen technique gets its performance from collapsing a level of
> indirection, but that has a cost in terms of flexibility; the hypervisor
> can't do as much mucking around behind the guest's back (for example,
> the guest sees real hardware memory addresses in the form of mfns, so
> Xen can't move pages around, at least not without some form of explicit
> synchronisation).

Any problem can be solved by adding another level of indirection... :)


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
@ 2009-03-02  8:19               ` Nick Piggin
  0 siblings, 0 replies; 121+ messages in thread
From: Nick Piggin @ 2009-03-02  8:19 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Xen-devel, Andrew Morton, the arch/x86 maintainers,
	Linux Kernel Mailing List, H. Peter Anvin

On Monday 02 March 2009 19:05:10 Jeremy Fitzhardinge wrote:
> Nick Piggin wrote:
> > That would kind of seem like Xen has a better design to me, OTOH if it
> > needs this dom0 for most device drivers and things, then how much
> > difference is it really? Is KVM really disadvantaged by being a part of
> > the kernel?
>
> Well, you can lump everything together in dom0 if you want, and that is
> a common way to run a Xen system.  But there's no reason you can't
> disaggregate drivers into their own domains, each with the
> responsibility for a particular device or set of devices (or indeed, any
> other service you want provided).  Xen can use hardware features like
> VT-d to really enforce the partitioning so that the domains can't
> program their hardware to touch anything except what they're allowed to
> touch, so nothing is trusted beyond its actual area of responsibility.
> It also means that killing off and restarting a driver domain is a
> fairly lightweight and straightforward operation because the state is
> isolated and self-contained; guests using a device have to be able to
> deal with a disconnect/reconnect anyway (for migration), so it doesn't
> affect them much.  Part of the reason there's a lot of academic interest
> in Xen is because it has the architectural flexibility to try out lots
> of different configurations.
>
> I wouldn't say that KVM is necessarily disadvantaged by its design; its
> just a particular set of tradeoffs made up-front.  It loses Xen's
> flexibility, but the result is very familiar to Linux people.  A guest
> domain just looks like a qemu process that happens to run in a strange
> processor mode a lot of the time.  The qemu process provides virtual
> device access to its domain, and accesses the normal device drivers like
> any other usermode process would.  The domains are as isolated from each
> other as much as processes normally are, but they're all floating around
> in the same kernel; whether that provides enough isolation for whatever
> technical, billing, security, compliance/regulatory or other
> requirements you have is up to the user to judge.

Well what is the advantage of KVM? Just that it is integrated into
the kernel? Can we look at the argument the other way around and
ask why Xen can't replace KVM? (is it possible to make use of HW
memory virtualization in Xen?) The hypervisor is GPL, right?


> >  Would it be possible I wonder to make
> > a MMU virtualization layer for CPUs without support, using Xen's page
> > table protection methods, and have KVM use that? Or does that amount
> > to putting a significant amount of Xen hypervisor into the kernel..?
>
> At one point Avi was considering doing it, but I don't think he ever
> made any real effort in that direction.  KVM is pretty wedded to having
> hardware support anyway, so there's not much point in removing it in
> this one area.

Not removing it, but making it available as an alternative form of
"hardware supported" MMU virtualization. As you say if direct protected
page tables often are faster than existing HW solutoins anyway, then it
could be a win for KVM even on newer CPUs.


> The Xen technique gets its performance from collapsing a level of
> indirection, but that has a cost in terms of flexibility; the hypervisor
> can't do as much mucking around behind the guest's back (for example,
> the guest sees real hardware memory addresses in the form of mfns, so
> Xen can't move pages around, at least not without some form of explicit
> synchronisation).

Any problem can be solved by adding another level of indirection... :)

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
  2009-03-02  8:19               ` Nick Piggin
  (?)
@ 2009-03-02  9:05               ` Jeremy Fitzhardinge
  -1 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-03-02  9:05 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, H. Peter Anvin, the arch/x86 maintainers,
	Linux Kernel Mailing List, Xen-devel

Nick Piggin wrote:
>> I wouldn't say that KVM is necessarily disadvantaged by its design; its
>> just a particular set of tradeoffs made up-front.  It loses Xen's
>> flexibility, but the result is very familiar to Linux people.  A guest
>> domain just looks like a qemu process that happens to run in a strange
>> processor mode a lot of the time.  The qemu process provides virtual
>> device access to its domain, and accesses the normal device drivers like
>> any other usermode process would.  The domains are as isolated from each
>> other as much as processes normally are, but they're all floating around
>> in the same kernel; whether that provides enough isolation for whatever
>> technical, billing, security, compliance/regulatory or other
>> requirements you have is up to the user to judge.
>>     
>
> Well what is the advantage of KVM? Just that it is integrated into
> the kernel? Can we look at the argument the other way around and
> ask why Xen can't replace KVM?

Xen was around before KVM was even a twinkle, so KVM is redundant from 
that perspective; they're certainly broadly equivalent in 
functionality.  But Xen has had a fairly fraught history with respect to 
being merged into the kernel, and being merged gets your feet into a lot 
of doors.  The upshot is that using Xen has generally required some 
preparation - like installing special kernels - before you can use it, 
and so tends to get used for servers which are specifically intended to 
be virtualized.  KVM runs like an accelerated qemu, so it easy to just 
fire up an instance of windows in the middle of a normal Linux desktop 
session, with no special preparation.

But Xen is getting better at being on laptops and desktops, and doing 
all the things people expect there (power management, suspend/resume, 
etc).  And people are definitely interested in using KVM in server 
environments, so the lines are not very clear any more.

(Of course, we're completely forgetting VMI in all this, but VMware seem 
to have as well.  And we're all waiting for Rusty to make his World 
Domination move.)

>  (is it possible to make use of HW
> memory virtualization in Xen?)

Yes, Xen will use all available hardware features when running hvm 
domains (== fully virtualized == Windows).

>  The hypervisor is GPL, right?
>   

Yep.

>>>  Would it be possible I wonder to make
>>> a MMU virtualization layer for CPUs without support, using Xen's page
>>> table protection methods, and have KVM use that? Or does that amount
>>> to putting a significant amount of Xen hypervisor into the kernel..?
>>>       
>> At one point Avi was considering doing it, but I don't think he ever
>> made any real effort in that direction.  KVM is pretty wedded to having
>> hardware support anyway, so there's not much point in removing it in
>> this one area.
>>     
>
> Not removing it, but making it available as an alternative form of
> "hardware supported" MMU virtualization. As you say if direct protected
> page tables often are faster than existing HW solutoins anyway, then it
> could be a win for KVM even on newer CPUs.
>   

Well, yes.  I'm sure it will make someone a nice little project.  It 
should be fairly easy to try out - all the hooks are in place, so its 
just a matter of implementing the kvm bits.  But it probably wouldn't be 
a comfortable fit with the rest of Linux; all the memory mapped via 
direct pagetables would be solidly pinned down, completely unswappable, 
giving the VM subsystem much less flexibility about allocating 
resources.  I guess it would be no worse than a multi-hundred 
megabyte/gigabyte process mlocking itself down, but I don't know if 
anyone actually does that.

    J

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
  2009-02-28  7:20       ` Ingo Molnar
@ 2009-03-02  9:26         ` Gerd Hoffmann
  -1 siblings, 0 replies; 121+ messages in thread
From: Gerd Hoffmann @ 2009-03-02  9:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jeremy Fitzhardinge, Andrew Morton, H. Peter Anvin,
	the arch/x86 maintainers, Linux Kernel Mailing List, Xen-devel

Ingo Molnar wrote:
>>> In three years time, will we regret having merged this?
>> Its a pretty minor amount of extra stuff on top of what's been 
>> added over the last 3 years, so I don't think it's going to 
>> tip the scales on its own.  I wouldn't be comfortable in 
>> trying to merge something that's very intrusive.
> 
> Hm, how can the same code that you call "massive out-of-tree 
> patches which doesn't make anyone happy" in an out of tree 
> context suddenly become non-intrusive "minor amount of extra 
> stuff" in an upstream context?

The current, out-of-tree xen kernel stuff is based on 2.6.18.  That
predates pv_ops and is quite intrusive stuff, with alot of cut+paste
programming and dirty hacks.

Alot has happened in x86 land since 2.6.18.  Being one of the x86 arch
maintainers you should know that very well.  Most notably:

  * pv_ops.  Point of adding these is to allow virtualization-friendly
    kernels *without* being intrusive as hell.
  * x86 arch merge, followed up by tons of cleanups and code
    reorganizations.  These changes also make it easier to merge xen
    support in a non-intrusive manner.

Also the xen support code in the linux kernel itself is basically a
rewrite from scratch, it hasn't much in common with the 2.6.18 code base.

> I wish the upstream kernel was able to do such magic, but i'm 
> afraid it is not.

It's no magic, it's alot of hard work.

cheers,
  Gerd



^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
@ 2009-03-02  9:26         ` Gerd Hoffmann
  0 siblings, 0 replies; 121+ messages in thread
From: Gerd Hoffmann @ 2009-03-02  9:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jeremy Fitzhardinge, Xen-devel, the arch/x86 maintainers,
	Linux Kernel Mailing List, H. Peter Anvin, Andrew Morton

Ingo Molnar wrote:
>>> In three years time, will we regret having merged this?
>> Its a pretty minor amount of extra stuff on top of what's been 
>> added over the last 3 years, so I don't think it's going to 
>> tip the scales on its own.  I wouldn't be comfortable in 
>> trying to merge something that's very intrusive.
> 
> Hm, how can the same code that you call "massive out-of-tree 
> patches which doesn't make anyone happy" in an out of tree 
> context suddenly become non-intrusive "minor amount of extra 
> stuff" in an upstream context?

The current, out-of-tree xen kernel stuff is based on 2.6.18.  That
predates pv_ops and is quite intrusive stuff, with alot of cut+paste
programming and dirty hacks.

Alot has happened in x86 land since 2.6.18.  Being one of the x86 arch
maintainers you should know that very well.  Most notably:

  * pv_ops.  Point of adding these is to allow virtualization-friendly
    kernels *without* being intrusive as hell.
  * x86 arch merge, followed up by tons of cleanups and code
    reorganizations.  These changes also make it easier to merge xen
    support in a non-intrusive manner.

Also the xen support code in the linux kernel itself is basically a
rewrite from scratch, it hasn't much in common with the 2.6.18 code base.

> I wish the upstream kernel was able to do such magic, but i'm 
> afraid it is not.

It's no magic, it's alot of hard work.

cheers,
  Gerd

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
  2009-03-02  9:26         ` Gerd Hoffmann
@ 2009-03-02 12:04           ` Ingo Molnar
  -1 siblings, 0 replies; 121+ messages in thread
From: Ingo Molnar @ 2009-03-02 12:04 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: Jeremy Fitzhardinge, Andrew Morton, H. Peter Anvin,
	the arch/x86 maintainers, Linux Kernel Mailing List, Xen-devel


* Gerd Hoffmann <kraxel@redhat.com> wrote:

> Ingo Molnar wrote:
> >>> In three years time, will we regret having merged this?
> >> Its a pretty minor amount of extra stuff on top of what's been 
> >> added over the last 3 years, so I don't think it's going to 
> >> tip the scales on its own.  I wouldn't be comfortable in 
> >> trying to merge something that's very intrusive.
> > 
> > Hm, how can the same code that you call "massive out-of-tree 
> > patches which doesn't make anyone happy" in an out of tree 
> > context suddenly become non-intrusive "minor amount of extra 
> > stuff" in an upstream context?
> 
> The current, out-of-tree xen kernel stuff is based on 2.6.18. 
> [...]

Sure, but what i'm pointing out is the following aspect of 
communication:

>>> [...] At the moment its all running on massive out-of-tree 
>>> patches, which doesn't make anyone happy.  It's best that it 
>>> be in the mainline kernel.  You know, like we argue for 
>>> everything else.

Comparing it to a 2.6.18 base is simply misleading when it comes 
to upstreaming something. Enterprise distros will rebase, and 
their out-of-tree pile of patches will shrink.

	Ingo

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
@ 2009-03-02 12:04           ` Ingo Molnar
  0 siblings, 0 replies; 121+ messages in thread
From: Ingo Molnar @ 2009-03-02 12:04 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: Jeremy Fitzhardinge, Xen-devel, the arch/x86 maintainers,
	Linux Kernel Mailing List, H. Peter Anvin, Andrew Morton


* Gerd Hoffmann <kraxel@redhat.com> wrote:

> Ingo Molnar wrote:
> >>> In three years time, will we regret having merged this?
> >> Its a pretty minor amount of extra stuff on top of what's been 
> >> added over the last 3 years, so I don't think it's going to 
> >> tip the scales on its own.  I wouldn't be comfortable in 
> >> trying to merge something that's very intrusive.
> > 
> > Hm, how can the same code that you call "massive out-of-tree 
> > patches which doesn't make anyone happy" in an out of tree 
> > context suddenly become non-intrusive "minor amount of extra 
> > stuff" in an upstream context?
> 
> The current, out-of-tree xen kernel stuff is based on 2.6.18. 
> [...]

Sure, but what i'm pointing out is the following aspect of 
communication:

>>> [...] At the moment its all running on massive out-of-tree 
>>> patches, which doesn't make anyone happy.  It's best that it 
>>> be in the mainline kernel.  You know, like we argue for 
>>> everything else.

Comparing it to a 2.6.18 base is simply misleading when it comes 
to upstreaming something. Enterprise distros will rebase, and 
their out-of-tree pile of patches will shrink.

	Ingo

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
  2009-02-28  9:46       ` Jeremy Fitzhardinge
@ 2009-03-02 12:08         ` Ingo Molnar
  -1 siblings, 0 replies; 121+ messages in thread
From: Ingo Molnar @ 2009-03-02 12:08 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andrew Morton, H. Peter Anvin, the arch/x86 maintainers,
	Linux Kernel Mailing List, Xen-devel


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Ingo Molnar wrote:
>> Personally i'd like to see a sufficient reply to the 
>> mmap-perf paravirt regressions pointed out by Nick and 
>> reproduced by myself as well. (They were in the 4-5% 
>> macro-performance range iirc, which is huge.)
>>
>> So i havent seen any real progress on reducing native kernel 
>> overhead with paravirt. Patches were sent but no measurements 
>> were done and it seemed to have all fizzled out while the 
>> dom0 patches are being pursued.
>>   
>
> Hm, I'm not sure what you want me to do here.  I sent out 
> patches, they got merged, I posted the results of my 
> measurements showing that the patches made a substantial 
> improvement.  I'd love to see confirmation from others that 
> the patches help them, but I don't think you can say I've been 
> unresponsive about this.

Have i missed a mail of yours perhaps? I dont have any track of 
you having posted mmap-perf perfcounters results. I grepped my 
mbox and the last mail i saw from you containing the string 
"mmap-perf" is from January 20, and it only includes my numbers.

What i'd expect you to do is to proactively measure the overhead 
of CONFIG_PARAVIRT overhead of the native kernel, and analyze 
and address the results. Not just minimalistically reply to my 
performance measurements - as that does not really scale in the 
long run.

	Ingo

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
@ 2009-03-02 12:08         ` Ingo Molnar
  0 siblings, 0 replies; 121+ messages in thread
From: Ingo Molnar @ 2009-03-02 12:08 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Xen-devel, Andrew Morton, the arch/x86 maintainers,
	Linux Kernel Mailing List, H. Peter Anvin


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Ingo Molnar wrote:
>> Personally i'd like to see a sufficient reply to the 
>> mmap-perf paravirt regressions pointed out by Nick and 
>> reproduced by myself as well. (They were in the 4-5% 
>> macro-performance range iirc, which is huge.)
>>
>> So i havent seen any real progress on reducing native kernel 
>> overhead with paravirt. Patches were sent but no measurements 
>> were done and it seemed to have all fizzled out while the 
>> dom0 patches are being pursued.
>>   
>
> Hm, I'm not sure what you want me to do here.  I sent out 
> patches, they got merged, I posted the results of my 
> measurements showing that the patches made a substantial 
> improvement.  I'd love to see confirmation from others that 
> the patches help them, but I don't think you can say I've been 
> unresponsive about this.

Have i missed a mail of yours perhaps? I dont have any track of 
you having posted mmap-perf perfcounters results. I grepped my 
mbox and the last mail i saw from you containing the string 
"mmap-perf" is from January 20, and it only includes my numbers.

What i'd expect you to do is to proactively measure the overhead 
of CONFIG_PARAVIRT overhead of the native kernel, and analyze 
and address the results. Not just minimalistically reply to my 
performance measurements - as that does not really scale in the 
long run.

	Ingo

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
  2009-03-02 12:04           ` Ingo Molnar
@ 2009-03-02 12:26             ` Gerd Hoffmann
  -1 siblings, 0 replies; 121+ messages in thread
From: Gerd Hoffmann @ 2009-03-02 12:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jeremy Fitzhardinge, Andrew Morton, H. Peter Anvin,
	the arch/x86 maintainers, Linux Kernel Mailing List, Xen-devel

Ingo Molnar wrote:
> * Gerd Hoffmann <kraxel@redhat.com> wrote:
>> The current, out-of-tree xen kernel stuff is based on 2.6.18. 
>> [...]
> 
> Sure, but what i'm pointing out is the following aspect of 
> communication:
> 
>>>> [...] At the moment its all running on massive out-of-tree 
>>>> patches, which doesn't make anyone happy.  It's best that it 
>>>> be in the mainline kernel.  You know, like we argue for 
>>>> everything else.
> 
> Comparing it to a 2.6.18 base is simply misleading when it comes 
> to upstreaming something. Enterprise distros will rebase, and 
> their out-of-tree pile of patches will shrink.

I think Jeremy refers to the 2.6.18 kernel though.  And IMHO it isn't
misleading as this is the only option for a dom0 kernel.  Well, was
until very recently, now you can also run the latest pv_ops bits with
dom0 support.  That is still very young code though and I wouldn't use
that (yet) for production systems.  Works fine on my development box though.

The bits needed for pv_ops based dom0 support in arch/x86 is small
compared to what is already there for domU support.

cheers,
  Gerd

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
@ 2009-03-02 12:26             ` Gerd Hoffmann
  0 siblings, 0 replies; 121+ messages in thread
From: Gerd Hoffmann @ 2009-03-02 12:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jeremy Fitzhardinge, Xen-devel, the arch/x86 maintainers,
	Linux Kernel Mailing List, H. Peter Anvin, Andrew Morton

Ingo Molnar wrote:
> * Gerd Hoffmann <kraxel@redhat.com> wrote:
>> The current, out-of-tree xen kernel stuff is based on 2.6.18. 
>> [...]
> 
> Sure, but what i'm pointing out is the following aspect of 
> communication:
> 
>>>> [...] At the moment its all running on massive out-of-tree 
>>>> patches, which doesn't make anyone happy.  It's best that it 
>>>> be in the mainline kernel.  You know, like we argue for 
>>>> everything else.
> 
> Comparing it to a 2.6.18 base is simply misleading when it comes 
> to upstreaming something. Enterprise distros will rebase, and 
> their out-of-tree pile of patches will shrink.

I think Jeremy refers to the 2.6.18 kernel though.  And IMHO it isn't
misleading as this is the only option for a dom0 kernel.  Well, was
until very recently, now you can also run the latest pv_ops bits with
dom0 support.  That is still very young code though and I wouldn't use
that (yet) for production systems.  Works fine on my development box though.

The bits needed for pv_ops based dom0 support in arch/x86 is small
compared to what is already there for domU support.

cheers,
  Gerd

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
  2009-03-02  6:37           ` Nick Piggin
@ 2009-03-04 17:31             ` Anthony Liguori
  -1 siblings, 0 replies; 121+ messages in thread
From: Anthony Liguori @ 2009-03-04 17:31 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Jeremy Fitzhardinge, Xen-devel, Andrew Morton,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	H. Peter Anvin

Nick Piggin wrote:
> On Monday 02 March 2009 10:27:29 Jeremy Fitzhardinge wrote:

>> Once important area of paravirtualization is that Xen guests directly
>> use the processor's pagetables; there is no shadow pagetable or use of
>> hardware pagetable nesting.  This means that a tlb miss is just a tlb
>> miss, and happens at full processor performance.  This is possible
>> because 1) pagetables are always read-only to the guest, and 2) the
>> guest is responsible for looking up in a table to map guest-local pfns
>> into machine-wide mfns before installing them in a pte.  Xen will check
>> that any new mapping or pagetable satisfies all the rules, by checking
>> that the writable reference count is 0, and that the domain owns (or has
>> been allowed access to) any mfn it tries to install in a pagetable.
> 
> Xen's memory virtualization is pretty neat, I'll give it that. Is it
> faster than KVM on a modern CPU?

There is nothing architecturally that prevents KVM from making use of 
Direct Paging.  KVM doesn't use Direct Paging because we don't expect it 
will not be worth it.  Modern CPUs (Barcelona and Nehalem class) include 
hardware support for MMU virtualization (via NPT and EPT respectively).

I think that for the most part (especially with large page backed 
guests), there's wide agreement that even within the context of Xen, 
NPT/EPT often beats PV performance.  TLB miss overhead increases due to 
additional memory accesses but this is largely mitigated by large pages 
(see Ben Serebin's SOSP paper from a couple years ago).

> Would it be possible I wonder to make
> a MMU virtualization layer for CPUs without support, using Xen's page
> table protection methods, and have KVM use that? Or does that amount
> to putting a significant amount of Xen hypervisor into the kernel..?

There are various benchmarks out there (check KVM Forum and Xen Summit 
presentations) showing NPT/EPT beating Direct Paging but FWIW the direct 
paging could be implemented in KVM.

A really unfortunate aspect of direct paging is that it requires the 
guest to know the host physical addresses.  This requires the guest to 
cooperate when doing any fancy memory tricks (live migration, 
save/restore, swapping, page sharing, etc.).  This introduces guest code 
paths to ensure that things like live migration works which is extremely 
undesirable.

FWIW, I'm not advocating not taking the Xen dom0 patches.  Just pointing 
out that direct paging is orthogonal to the architectural differences 
between Xen and KVM.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
@ 2009-03-04 17:31             ` Anthony Liguori
  0 siblings, 0 replies; 121+ messages in thread
From: Anthony Liguori @ 2009-03-04 17:31 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Jeremy Fitzhardinge, Xen-devel, the arch/x86 maintainers,
	Linux Kernel Mailing List, H. Peter Anvin, Andrew Morton

Nick Piggin wrote:
> On Monday 02 March 2009 10:27:29 Jeremy Fitzhardinge wrote:

>> Once important area of paravirtualization is that Xen guests directly
>> use the processor's pagetables; there is no shadow pagetable or use of
>> hardware pagetable nesting.  This means that a tlb miss is just a tlb
>> miss, and happens at full processor performance.  This is possible
>> because 1) pagetables are always read-only to the guest, and 2) the
>> guest is responsible for looking up in a table to map guest-local pfns
>> into machine-wide mfns before installing them in a pte.  Xen will check
>> that any new mapping or pagetable satisfies all the rules, by checking
>> that the writable reference count is 0, and that the domain owns (or has
>> been allowed access to) any mfn it tries to install in a pagetable.
> 
> Xen's memory virtualization is pretty neat, I'll give it that. Is it
> faster than KVM on a modern CPU?

There is nothing architecturally that prevents KVM from making use of 
Direct Paging.  KVM doesn't use Direct Paging because we don't expect it 
will not be worth it.  Modern CPUs (Barcelona and Nehalem class) include 
hardware support for MMU virtualization (via NPT and EPT respectively).

I think that for the most part (especially with large page backed 
guests), there's wide agreement that even within the context of Xen, 
NPT/EPT often beats PV performance.  TLB miss overhead increases due to 
additional memory accesses but this is largely mitigated by large pages 
(see Ben Serebin's SOSP paper from a couple years ago).

> Would it be possible I wonder to make
> a MMU virtualization layer for CPUs without support, using Xen's page
> table protection methods, and have KVM use that? Or does that amount
> to putting a significant amount of Xen hypervisor into the kernel..?

There are various benchmarks out there (check KVM Forum and Xen Summit 
presentations) showing NPT/EPT beating Direct Paging but FWIW the direct 
paging could be implemented in KVM.

A really unfortunate aspect of direct paging is that it requires the 
guest to know the host physical addresses.  This requires the guest to 
cooperate when doing any fancy memory tricks (live migration, 
save/restore, swapping, page sharing, etc.).  This introduces guest code 
paths to ensure that things like live migration works which is extremely 
undesirable.

FWIW, I'm not advocating not taking the Xen dom0 patches.  Just pointing 
out that direct paging is orthogonal to the architectural differences 
between Xen and KVM.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
  2009-03-02  8:05             ` Jeremy Fitzhardinge
@ 2009-03-04 17:34               ` Anthony Liguori
  -1 siblings, 0 replies; 121+ messages in thread
From: Anthony Liguori @ 2009-03-04 17:34 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Nick Piggin, Xen-devel, Andrew Morton, the arch/x86 maintainers,
	Linux Kernel Mailing List, H. Peter Anvin

Jeremy Fitzhardinge wrote:
> Nick Piggin wrote:

> It really depends on the workload.  There's three cases to consider: 
> software shadow pagetables, hardware nested pagetables, and Xen direct 
> pagetables.  Even now, Xen's (highly optimised) shadow pagetable code 
> generally out-performs modern nested pagetables, at least when running 
> Windows (for which that code was most heavily tuned).

Can you point to benchmarks?  I have a hard time believing this.

How can shadow paging beat nested paging assuming the presence of large 
pages?

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
@ 2009-03-04 17:34               ` Anthony Liguori
  0 siblings, 0 replies; 121+ messages in thread
From: Anthony Liguori @ 2009-03-04 17:34 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Nick Piggin, Xen-devel, the arch/x86 maintainers,
	Linux Kernel Mailing List, H. Peter Anvin, Andrew Morton

Jeremy Fitzhardinge wrote:
> Nick Piggin wrote:

> It really depends on the workload.  There's three cases to consider: 
> software shadow pagetables, hardware nested pagetables, and Xen direct 
> pagetables.  Even now, Xen's (highly optimised) shadow pagetable code 
> generally out-performs modern nested pagetables, at least when running 
> Windows (for which that code was most heavily tuned).

Can you point to benchmarks?  I have a hard time believing this.

How can shadow paging beat nested paging assuming the presence of large 
pages?

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
  2009-03-04 17:34               ` Anthony Liguori
@ 2009-03-04 17:38                 ` Jeremy Fitzhardinge
  -1 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-03-04 17:38 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Nick Piggin, Xen-devel, Andrew Morton, the arch/x86 maintainers,
	Linux Kernel Mailing List, H. Peter Anvin

Anthony Liguori wrote:
> Jeremy Fitzhardinge wrote:
>> Nick Piggin wrote:
>
>> It really depends on the workload.  There's three cases to consider: 
>> software shadow pagetables, hardware nested pagetables, and Xen 
>> direct pagetables.  Even now, Xen's (highly optimised) shadow 
>> pagetable code generally out-performs modern nested pagetables, at 
>> least when running Windows (for which that code was most heavily tuned).
>
> Can you point to benchmarks?  I have a hard time believing this.

Erm, not that I know of off-hand.  I don't really have any interest in 
Windows performance, so I'm reduced to repeating (highly reliable) Xen 
Summit corridor chat.

> How can shadow paging beat nested paging assuming the presence of 
> large pages? 

I think large pages do turn the tables, and its close to parity with 
shadow with 4k pages on recent cpus.  But see above for reliability on 
that info.

    J


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
@ 2009-03-04 17:38                 ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-03-04 17:38 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Nick Piggin, Xen-devel, the arch/x86 maintainers,
	Linux Kernel Mailing List, H. Peter Anvin, Andrew Morton

Anthony Liguori wrote:
> Jeremy Fitzhardinge wrote:
>> Nick Piggin wrote:
>
>> It really depends on the workload.  There's three cases to consider: 
>> software shadow pagetables, hardware nested pagetables, and Xen 
>> direct pagetables.  Even now, Xen's (highly optimised) shadow 
>> pagetable code generally out-performs modern nested pagetables, at 
>> least when running Windows (for which that code was most heavily tuned).
>
> Can you point to benchmarks?  I have a hard time believing this.

Erm, not that I know of off-hand.  I don't really have any interest in 
Windows performance, so I'm reduced to repeating (highly reliable) Xen 
Summit corridor chat.

> How can shadow paging beat nested paging assuming the presence of 
> large pages? 

I think large pages do turn the tables, and its close to parity with 
shadow with 4k pages on recent cpus.  But see above for reliability on 
that info.

    J

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
  2009-03-01 23:27         ` Jeremy Fitzhardinge
  (?)
  (?)
@ 2009-03-04 19:03         ` Anthony Liguori
  2009-03-04 19:16           ` H. Peter Anvin
  -1 siblings, 1 reply; 121+ messages in thread
From: Anthony Liguori @ 2009-03-04 19:03 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Nick Piggin, Xen-devel, Andrew Morton, the arch/x86 maintainers,
	Linux Kernel Mailing List, H. Peter Anvin

Jeremy Fitzhardinge wrote:

> OK, fair point, its probably time for another Xen architecture refresher 
> post.
> 
> There are two big architectural differences between Xen and KVM:
> 
> Firstly, Xen has a separate hypervisor who's primary role is to context 
> switch between the guest domains (virtual machines).   The hypervisor is 
> relatively small and single purpose.  It doesn't, for example, contain 
> any device drivers or even much knowledge of things like pci buses and 
> their structure.  The domains themselves are more or less peers; some 
> are more privileged than others, but from Xen's perspective they are 
> more or less equivalent.  The first domain, dom0, is special because its 
> started by Xen itself, and has some inherent initial privileges; its 
> main job is to start other domains, and it also typically provides 
> virtualized/multiplexed device services to other domains via a 
> frontend/backend split driver structure.
> 
> KVM, on the other hand, builds all the hypervisor stuff into the kernel 
> itself, so you end up with a kernel which does all the normal kernel 
> stuff, and can run virtual machines by making them look like slightly 
> strange processes.
> 
> Because Xen is dedicated to just running virtual machines, its internal 
> architecture can be more heavily oriented towards that task, which 
> affects things from how its scheduler works, its use and multiplexing of 
> physical memory.  For example, Xen manages to use new hardware 
> virtualization features pretty quickly, partly because it doesn't need 
> to trade-off against normal kernel functions.  The clear distinction 
> between the privileged hypervisor and the rest of the domains makes the 
> security people happy as well.  Also, because Xen is small and fairly 
> self-contained, there's quite a few hardware vendors shipping it burned 
> into the firmware so that it really is the first thing to boot (many of 
> instant-on features that laptops have are based on Xen).  Both HP and 
> Dell, at least, are selling servers with Xen pre-installed in the firmware.

I think this is a bit misleading.  I think you can understand the true 
differences between Xen and KVM by s/hypervisor/Operating System/. 
Fundamentally, a hypervisor is just an operating system that provides a 
hardware-like interface to it's processes.

Today, the Xen operating system does not have that many features so it 
requires a special process (domain-0) to drive hardware.  It uses Linux 
for this and it happens that the Linux domain-0 has full access to all 
system resources so there is absolutely no isolation between Xen and 
domain-0.  The domain-0 guest is like a Linux userspace process with 
access to an old-style /dev/mem.

You can argue that in theory, one could build a small, decoupled 
domain-0, but you could also do this, in theory, with Linux and KVM.  It 
is not necessary to have all of your device drivers in your Linux 
kernel.  You could build an initramfs that passed all PCI devices 
through (via VT-d) to a single guest, and then provided and interface to 
allow that guest to create more guests.  This is essentially what dom0 
support is.

The real difference between KVM and Xen is that Xen is a separate 
Operating System dedicated to virtualization.  In many ways, it's a fork 
of Linux since it uses quite a lot of Linux code.

The argument for Xen as a separate OS is no different than the argument 
for a dedicated Real Time Operating System, a dedicated OS for embedded 
systems, or a dedicated OS for a very large system.

Having the distros ship Xen was a really odd thing from a Linux 
perspective.  It's as if Red Hat started shipping VXworks with a Linux 
emulation layer as Real Time Linux.

The arguments for dedicated OSes are well-known.  You can do a better 
scheduler for embedded/real-time/large systems.  You can do a better 
memory allocate for embedded/real-time/large systems.  These are the 
arguments that are made for Xen.

In theory, Xen, the hypervisor, could be merged with upstream Linux but 
there is certainly no parties interested in that currently.

My point is not to rail on Xen, but to point out that there isn't really 
a choice to be made here from a Linux perspective.  It's like saying do 
we really need FreeBSD and Linux, maybe those FreeBSD guys should just 
merge with Linux.  It's not going to happen.

KVM turns Linux into a hypervisor by adding virtualization support.  Xen 
is a separate hypervisor.

So the real discussion shouldn't be should KVM and Xen converge because 
it really doesn't make sense.  It's whether it makes sense for upstream 
Linux to support being a domain-0 guest under the Xen hypervisor.

Regards,

Anthony Liguori

> 
> The second big difference is the use of paravirtualization.  Xen can 
> securely virtualize a machine without needing any particular hardware 
> support.  Xen works well on any post-P6 or any ia64 machine, without 
> needing any virtualzation hardware support.  When Xen runs a kernel in 
> paravirtualized mode, it runs the kernel in an unprivileged processor 
> state.  The allows the hypervisor to vet all the guest kernel's 
> privileged operations, which are carried out are either via hypercalls 
> or by memory shared between each guest and Xen.
> 
> By contrast, KVM relies on at least VT/SVM (and whatever the ia64 equiv 
> is called) being available in the CPUs, and needs the most modern of 
> hardware to get the best performance.
> 
> Once important area of paravirtualization is that Xen guests directly 
> use the processor's pagetables; there is no shadow pagetable or use of 
> hardware pagetable nesting.  This means that a tlb miss is just a tlb 
> miss, and happens at full processor performance.  This is possible 
> because 1) pagetables are always read-only to the guest, and 2) the 
> guest is responsible for looking up in a table to map guest-local pfns 
> into machine-wide mfns before installing them in a pte.  Xen will check 
> that any new mapping or pagetable satisfies all the rules, by checking 
> that the writable reference count is 0, and that the domain owns (or has 
> been allowed access to) any mfn it tries to install in a pagetable.
> 
> The other interesting part of paravirtualization is the abstraction of 
> interrupts into event channels.  Each domain has a bit-array of 1024 
> bits which correspond to 1024 possible event channels.  An event channel 
> can have one of several sources, such as a timer virtual interrupt, an 
> inter-domain event, an inter-vcpu IPI, or mapped from a hardware 
> interrupt.  We end up mapping the event channels back to irqs and they 
> are delivered as normal interrupts as far as the rest of the kernel is 
> concerned.
> 
> The net result is that a paravirtualized Xen guest runs a very close to 
> full speed.  Workloads which modify live pagetables a lot take a bit of 
> a performance hit (since the pte updates have to trap to the hypervisor 
> for validation), but in general this is not a huge deal.  Hardware 
> support for nested pagetables is only just beginning to get close to 
> getting performance parity, but with different tradeoffs (pagetable 
> updates are cheap, but tlb misses are much more expensive, and hits 
> consume more tlb entries).
> 
> Xen can also make full use of whatever hardware virtualization features 
> are available when running an "hvm" domain.  This is typically how you'd 
> run Windows or other unmodified operating systems.
> 
> All of this is stuff that's necessary to support any PV Xen domain, and 
> has been in the kernel for a long time now.
> 
> 
> The additions I'm proposing now are those needed for a Xen domain to 
> control the physical hardware, in order to provide virtual device 
> support for other less-privileged domains.  These changes affect a few 
> areas:
> 
>    * interrupts: mapping a device interrupt into an event channel for
>      delivery to the domain with the device driver for that interrupt
>    * mappings: allowing direct hardware mapping of device memory into a
>      domain
>    * dma: making sure that hardware gets programmed with machine memory
>      address, nor virtual ones, and that pages are machine-contiguous
>      when expected
> 
> Interrupts require a few hooks into the x86 APIC code, but the end 
> result is that hardware interrupts are delivered via event channels, but 
> then they're mapped back to irqs and delivered normally (they even end 
> up with the same irq number as they'd usually have).
> 
> Device mappings are fairly easy to arrange.  I'm using a software pte 
> bit, _PAGE_IOMAP, to indicate that a mapping is a device mapping.  This 
> bit is set by things like ioremap() and remap_pfn_range, and the Xen mmu 
> code just uses the pfn in the pte as-is, rather than doing the normal 
> pfn->mfn translation.
> 
> DMA is handled via the normal DMA API, with some hooks to swiotlb to 
> make sure that the memory underlying its pools is really DMA-ready (ie, 
> is contiguous and low enough in machine memory).
> 
> The changes I'm proposing may look a bit strange from a purely x86 
> perspective, but they fit in relatively well because they're not all 
> that different from what other architectures require, and so the 
> kernel-wide infrastructure is mostly already in place.
> 
> 
> I hope that helps clarify what I'm trying to do here, and why Xen and 
> KVM do have distinct roles to play.
> 
>    J


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
  2009-03-04 19:03         ` Anthony Liguori
@ 2009-03-04 19:16           ` H. Peter Anvin
  2009-03-04 19:33               ` Anthony Liguori
  0 siblings, 1 reply; 121+ messages in thread
From: H. Peter Anvin @ 2009-03-04 19:16 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Jeremy Fitzhardinge, Nick Piggin, Xen-devel, Andrew Morton,
	the arch/x86 maintainers, Linux Kernel Mailing List

Anthony Liguori wrote:
> 
> I think this is a bit misleading.  I think you can understand the true 
> differences between Xen and KVM by s/hypervisor/Operating System/. 
> Fundamentally, a hypervisor is just an operating system that provides a 
> hardware-like interface to it's processes.
> 
[...]

> 
> The real difference between KVM and Xen is that Xen is a separate 
> Operating System dedicated to virtualization.  In many ways, it's a fork 
> of Linux since it uses quite a lot of Linux code.
> 
> The argument for Xen as a separate OS is no different than the argument 
> for a dedicated Real Time Operating System, a dedicated OS for embedded 
> systems, or a dedicated OS for a very large system.
> 

In particular, Xen is a microkernel-type operating system.  The dom0 
model is a classic single-server, in the style of Mach.  A lot of the 
"Xen could use a distributed dom0" arguments were also done with Mach 
("the real goal is a multi-server") but such a system never materialized 
(Hurd was supposed to be one.)  Building multiservers is *hard*, and 
building multiservers which don't suck is even harder.

	-hpa

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
  2009-03-04 19:16           ` H. Peter Anvin
@ 2009-03-04 19:33               ` Anthony Liguori
  0 siblings, 0 replies; 121+ messages in thread
From: Anthony Liguori @ 2009-03-04 19:33 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Jeremy Fitzhardinge, Nick Piggin, Xen-devel, Andrew Morton,
	the arch/x86 maintainers, Linux Kernel Mailing List

H. Peter Anvin wrote:
> In particular, Xen is a microkernel-type operating system.  The dom0 
> model is a classic single-server, in the style of Mach.  A lot of the 
> "Xen could use a distributed dom0" arguments were also done with Mach 
> ("the real goal is a multi-server") but such a system never 
> materialized (Hurd was supposed to be one.)  Building multiservers is 
> *hard*, and building multiservers which don't suck is even harder.

A lot of the core Xen concepts (domains, event channels, etc.) were 
present in the Nemesis[1] exo-kernel project.

Two other interest papers on the subject "Are virtual machine monitors 
microkernels done right?"[2] from the Xen folks and a rebuttal from the 
l4ka group[3].

[1] http://www.cl.cam.ac.uk/research/srg/netos/old-projects/nemesis/
[2] http://portal.acm.org/citation.cfm?id=1251124
[3] http://l4ka.org/publications/paper.php?docid=2189

Regards,

Anthony Liguori
>     -hpa


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
@ 2009-03-04 19:33               ` Anthony Liguori
  0 siblings, 0 replies; 121+ messages in thread
From: Anthony Liguori @ 2009-03-04 19:33 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Nick Piggin, Jeremy Fitzhardinge, Xen-devel,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Andrew Morton

H. Peter Anvin wrote:
> In particular, Xen is a microkernel-type operating system.  The dom0 
> model is a classic single-server, in the style of Mach.  A lot of the 
> "Xen could use a distributed dom0" arguments were also done with Mach 
> ("the real goal is a multi-server") but such a system never 
> materialized (Hurd was supposed to be one.)  Building multiservers is 
> *hard*, and building multiservers which don't suck is even harder.

A lot of the core Xen concepts (domains, event channels, etc.) were 
present in the Nemesis[1] exo-kernel project.

Two other interest papers on the subject "Are virtual machine monitors 
microkernels done right?"[2] from the Xen folks and a rebuttal from the 
l4ka group[3].

[1] http://www.cl.cam.ac.uk/research/srg/netos/old-projects/nemesis/
[2] http://portal.acm.org/citation.cfm?id=1251124
[3] http://l4ka.org/publications/paper.php?docid=2189

Regards,

Anthony Liguori
>     -hpa

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Xen-devel] Re: [PATCH] xen: core dom0 support
  2009-03-04 17:34               ` Anthony Liguori
@ 2009-03-05 10:59                 ` George Dunlap
  -1 siblings, 0 replies; 121+ messages in thread
From: George Dunlap @ 2009-03-05 10:59 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Jeremy Fitzhardinge, Nick Piggin, Xen-devel,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	H. Peter Anvin, Andrew Morton

On Wed, Mar 4, 2009 at 5:34 PM, Anthony Liguori <anthony@codemonkey.ws> wrote:
> Can you point to benchmarks?  I have a hard time believing this.
>
> How can shadow paging beat nested paging assuming the presence of large
> pages?

If these benchmarks would help this discussion, we can certainly run
some.  As of last Fall, even with superpage support, certain workloads
perform significantly less well with HAP (hardware-assisted paging)
than with shadow pagetables.  Examples are specjbb, which does almost
no pagetable updates, but totally thrashes the TLB.  SysMark also
performed much better with shadow pagetables than HAP.  And of course,
64-bit is worse than 32-bit.  (It's actually a bit annoying from a
default-policy perspective, since about half of our workloads perform
better with HAP (up to 30% better) and half of them perform worse (up
to 30% worse)).

Our comparison would, of course, be comparing Xen+HAP to Xen+Shadow,
which isn't necessarily comparable to KVM+HAP.

Having HAP work well would be great for us as well as KVM.  But
there's still the argument about hardware support: Xen can run
paravirtualized VMs on hardware with no HVM support, and can run fully
virtualized domains very well on hardware that has HVM support but not
HAP support.

 -George Dunlap

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: Re: [PATCH] xen: core dom0 support
@ 2009-03-05 10:59                 ` George Dunlap
  0 siblings, 0 replies; 121+ messages in thread
From: George Dunlap @ 2009-03-05 10:59 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Nick Piggin, Jeremy Fitzhardinge, Xen-devel,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	H. Peter Anvin, Andrew Morton

On Wed, Mar 4, 2009 at 5:34 PM, Anthony Liguori <anthony@codemonkey.ws> wrote:
> Can you point to benchmarks?  I have a hard time believing this.
>
> How can shadow paging beat nested paging assuming the presence of large
> pages?

If these benchmarks would help this discussion, we can certainly run
some.  As of last Fall, even with superpage support, certain workloads
perform significantly less well with HAP (hardware-assisted paging)
than with shadow pagetables.  Examples are specjbb, which does almost
no pagetable updates, but totally thrashes the TLB.  SysMark also
performed much better with shadow pagetables than HAP.  And of course,
64-bit is worse than 32-bit.  (It's actually a bit annoying from a
default-policy perspective, since about half of our workloads perform
better with HAP (up to 30% better) and half of them perform worse (up
to 30% worse)).

Our comparison would, of course, be comparing Xen+HAP to Xen+Shadow,
which isn't necessarily comparable to KVM+HAP.

Having HAP work well would be great for us as well as KVM.  But
there's still the argument about hardware support: Xen can run
paravirtualized VMs on hardware with no HVM support, and can run fully
virtualized domains very well on hardware that has HVM support but not
HAP support.

 -George Dunlap

^ permalink raw reply	[flat|nested] 121+ messages in thread

* RE: [PATCH] xen: core dom0 support
  2009-02-28  5:28 ` [PATCH] xen: core dom0 support Andrew Morton
  2009-02-28  6:52     ` Jeremy Fitzhardinge
  2009-02-28  8:42     ` Ingo Molnar
@ 2009-03-05 13:52   ` Morten P.D. Stevens
  2009-03-08 14:25     ` Manfred Knick
  2 siblings, 1 reply; 121+ messages in thread
From: Morten P.D. Stevens @ 2009-03-05 13:52 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, George.Dunlap

Hi,

i think xen is a great virtualization technology.

Many company´s work with xen and the performance is better than KVM.

Here some benchmarks with Citrix XenServer 5.0 vs KVM with linux and windows guests on an IBM x3400 server:

HDD: XEN | KVM

Write: 110 MB/s | 60 MB/s
Read: 130 MB/s  | 80 MB/s

Network Performance: (download an 4 GB iso image from apache webserver)

XEN | KVM

download Speed: 105 MB/s | 50 MB/s

xen is using the full 1000 mbit network.. great performance!

On our IBM servers Xen is still faster than KVM.

-----Original Message-----
From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of Andrew Morton
Sent: Saturday, February 28, 2009 6:28 AM
To: Jeremy Fitzhardinge
Cc: H. Peter Anvin; the arch/x86 maintainers; Linux Kernel Mailing List; Xen-devel
Subject: Re: [PATCH] xen: core dom0 support

On Fri, 27 Feb 2009 17:59:06 -0800 Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> This series implements the core parts of Xen dom0 support; that is, just
> enough to get the kernel started when booted by Xen as a dom0 kernel.

And what other patches can we expect to see to complete the xen dom0
support?


and..

I hate to be the one to say it, but we should sit down and work out
whether it is justifiable to merge any of this into Linux.  I think
it's still the case that the Xen technology is the "old" way and that
the world is moving off in the "new" direction, KVM?

In three years time, will we regret having merged this?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [Xen-devel] Re: [PATCH] xen: core dom0 support
  2009-03-05 10:59                 ` George Dunlap
@ 2009-03-05 14:37                   ` Anthony Liguori
  -1 siblings, 0 replies; 121+ messages in thread
From: Anthony Liguori @ 2009-03-05 14:37 UTC (permalink / raw)
  To: George Dunlap
  Cc: Jeremy Fitzhardinge, Nick Piggin, Xen-devel,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	H. Peter Anvin, Andrew Morton

George Dunlap wrote:
> On Wed, Mar 4, 2009 at 5:34 PM, Anthony Liguori <anthony@codemonkey.ws> wrote:
>   
>> Can you point to benchmarks?  I have a hard time believing this.
>>
>> How can shadow paging beat nested paging assuming the presence of large
>> pages?
>>     
>
> If these benchmarks would help this discussion, we can certainly run
> some.  As of last Fall, even with superpage support, certain workloads
> perform significantly less well with HAP (hardware-assisted paging)
> than with shadow pagetables.  Examples are specjbb, which does almost
> no pagetable updates, but totally thrashes the TLB.

I suspected specjbb was the benchmark.   specjbb is really an anomaly as 
it's really the only benchmark where even a naive shadow paging 
implementation performs very close to native.

specjbb also turns into a pathological case with HAP.  In my 
measurements, HAP with 4k pages was close to 70% of native for specjbb.  
Once you enable large pages though, you get pretty close to native.  
IIRC, around 95%.  I suspect that over time as the caching algorithms 
improve, this will approach 100% of native.

Then again, there are workloads like kernbench that are pathological for 
shadow paging in a much more dramatic way.  At least on shadow2, I was 
seeing around 60% of native with kernbench.  With direct paging, it goes 
to about 85% of native.  With NPT and large pages, it's almost 100% of 
native.

>   SysMark also
> performed much better with shadow pagetables than HAP.  And of course,
> 64-bit is worse than 32-bit.  (It's actually a bit annoying from a
> default-policy perspective, since about half of our workloads perform
> better with HAP (up to 30% better) and half of them perform worse (up
> to 30% worse)).
>
> Our comparison would, of course, be comparing Xen+HAP to Xen+Shadow,
> which isn't necessarily comparable to KVM+HAP.
>
> Having HAP work well would be great for us as well as KVM.  But
> there's still the argument about hardware support: Xen can run
> paravirtualized VMs on hardware with no HVM support, and can run fully
> virtualized domains very well on hardware that has HVM support but not
> HAP support.
>   

Xen is definitely not going away and as such, supporting it in Linux 
seems like a good idea to me.  I'm just refuting claims that the Xen 
architecture has intrinsic advantages wrt MMU virtualization.  It's 
simply not the case :-)

Regards,

Anthony Liguori

>  -George Dunlap
>   


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: Re: [PATCH] xen: core dom0 support
@ 2009-03-05 14:37                   ` Anthony Liguori
  0 siblings, 0 replies; 121+ messages in thread
From: Anthony Liguori @ 2009-03-05 14:37 UTC (permalink / raw)
  To: George Dunlap
  Cc: Nick Piggin, Jeremy Fitzhardinge, Xen-devel,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	H. Peter Anvin, Andrew Morton

George Dunlap wrote:
> On Wed, Mar 4, 2009 at 5:34 PM, Anthony Liguori <anthony@codemonkey.ws> wrote:
>   
>> Can you point to benchmarks?  I have a hard time believing this.
>>
>> How can shadow paging beat nested paging assuming the presence of large
>> pages?
>>     
>
> If these benchmarks would help this discussion, we can certainly run
> some.  As of last Fall, even with superpage support, certain workloads
> perform significantly less well with HAP (hardware-assisted paging)
> than with shadow pagetables.  Examples are specjbb, which does almost
> no pagetable updates, but totally thrashes the TLB.

I suspected specjbb was the benchmark.   specjbb is really an anomaly as 
it's really the only benchmark where even a naive shadow paging 
implementation performs very close to native.

specjbb also turns into a pathological case with HAP.  In my 
measurements, HAP with 4k pages was close to 70% of native for specjbb.  
Once you enable large pages though, you get pretty close to native.  
IIRC, around 95%.  I suspect that over time as the caching algorithms 
improve, this will approach 100% of native.

Then again, there are workloads like kernbench that are pathological for 
shadow paging in a much more dramatic way.  At least on shadow2, I was 
seeing around 60% of native with kernbench.  With direct paging, it goes 
to about 85% of native.  With NPT and large pages, it's almost 100% of 
native.

>   SysMark also
> performed much better with shadow pagetables than HAP.  And of course,
> 64-bit is worse than 32-bit.  (It's actually a bit annoying from a
> default-policy perspective, since about half of our workloads perform
> better with HAP (up to 30% better) and half of them perform worse (up
> to 30% worse)).
>
> Our comparison would, of course, be comparing Xen+HAP to Xen+Shadow,
> which isn't necessarily comparable to KVM+HAP.
>
> Having HAP work well would be great for us as well as KVM.  But
> there's still the argument about hardware support: Xen can run
> paravirtualized VMs on hardware with no HVM support, and can run fully
> virtualized domains very well on hardware that has HVM support but not
> HAP support.
>   

Xen is definitely not going away and as such, supporting it in Linux 
seems like a good idea to me.  I'm just refuting claims that the Xen 
architecture has intrinsic advantages wrt MMU virtualization.  It's 
simply not the case :-)

Regards,

Anthony Liguori

>  -George Dunlap
>   

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
  2009-03-02 12:08         ` Ingo Molnar
@ 2009-03-07  9:06           ` Jeremy Fitzhardinge
  -1 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-03-07  9:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, H. Peter Anvin, the arch/x86 maintainers,
	Linux Kernel Mailing List, Xen-devel

[-- Attachment #1: Type: text/plain, Size: 1800 bytes --]

Ingo Molnar wrote:
> Have i missed a mail of yours perhaps? I dont have any track of 
> you having posted mmap-perf perfcounters results. I grepped my 
> mbox and the last mail i saw from you containing the string 
> "mmap-perf" is from January 20, and it only includes my numbers.


Yes, I think you must have missed a mail. I've attached it for 
reference, along with a more complete set of measurements I made 
regarding the series of patches applied (series ending at 
1f4f931501e9270c156d05ee76b7b872de486304) to improve pvops performance.

My results showed a dramatic drop in cache references (from about 300% 
pvop vs non-pvop, down to 125% with the full set of patches applied), 
but it didn't seem to make much of an effect on the overall wallclock 
time. I'm a bit sceptical of the numbers here because, while each run's 
passes are fairly consistent, booting and remeasuring seemed to cause 
larger variations than we're looking at. It would be easy to handwave it 
away with "cache effects", but its not very satisfying.

I also didn't find the measurements very convincing because the number 
of CPU cycles and instructions executed count is effectively unchanged 
(ie, the baseline non-pvops vs original pvops apparently execute exactly 
the same number of instructions, but we know that there's a lot more 
going on), and with no change as each added patch definitely removes 
some amount of pvops overhead in terms of instructions in the 
instruction stream. Is it just measuring usermode stats? I ran it as 
root, with the command line you suggested ("./perfstat -e 
-5,-4,-3,0,1,2,3 ./mmap-perf 1"). Cache misses wandered up and down in a 
fairly non-intuitive way as well.

I'll do a rerun comparing current tip.git pvops vs non-pvops to see if I 
can get some better results.

J

[-- Attachment #2: pvops-mmap-measurements.ods --]
[-- Type: application/vnd.oasis.opendocument.spreadsheet, Size: 20038 bytes --]

[-- Attachment #3: Attached Message --]
[-- Type: message/rfc822, Size: 51779 bytes --]

[-- Attachment #3.1.1: Type: text/plain, Size: 1640 bytes --]

Ingo Molnar wrote:
> ping?
>
> This is a very serious paravirt_ops slowdown affecting the native kernel's 
> performance to the tune of 5-10% in certain workloads.
>
> It's been about 2 years ago that paravirt_ops went upstream, when you told 
> us that something like this would never happen, that paravirt_ops is 
> designed so flexibly that it will never hinder the native kernel - and if 
> it does it will be easy to fix it. Now is the time to fulfill that 
> promise.

I couldn't exactly reproduce your results, but I guess they're similar 
in shape.  Comparing 2.6.29-rc2-nopv with -pvops, I saw this ratio (pass 
1-5).  Interestingly I'm seeing identical instruction counts for pvops 
vs non-pvops, and a lower cycle count.  The cache references are way up 
and the miss rate is up a bit, which I guess is the source of the slowdown.

With the attached patch, I get a clear improvement; it replaces the 
do-nothing pte_val/make_pte functions with inlined movs to move the 
argument to return, overpatching the 6-byte indirect call (on i386 it 
would just be all nopped out).  CPU cycles and cache misses are way 
down, and the tick count is down from ~5% worse to ~2%.  But the cache 
reference rate is even higher, which really doesn't make sense to me. 
But the patch is a clear improvement, and its hard to see how it could 
make anything worse (its always going to replace an indirect call with 
simple inlined code).

(Full numbers in spreadsheet.)

I have a couple of other patches to reduce the register pressure of the 
pvops calls, but I'm trying to work out how to make sure its not all to 
complex and/or fragile.

    J

[-- Attachment #3.1.2: pvops-mmap-measurements.ods --]
[-- Type: application/vnd.oasis.opendocument.spreadsheet, Size: 30546 bytes --]

[-- Attachment #3.1.3: paravirt-ident.patch --]
[-- Type: text/plain, Size: 6903 bytes --]

Subject: x86/pvops: add a paravirt_indent functions to allow special patching

Several paravirt ops implementations simply return their arguments,
the most obvious being the make_pte/pte_val class of operations on
native.

On 32-bit, the identity function is literally a no-op, as the calling
convention uses the same registers for the first argument and return.
On 64-bit, it can be implemented with a single "mov".

This patch adds special identity functions for 32 and 64 bit argument,
and machinery to recognize them and replace them with either nops or a
mov as appropriate.

At the moment, the only users for the identity functions are the
pagetable entry conversion functions.

The result is a measureable improvement on pagetable-heavy benchmarks
(2-3%, reducing the pvops overhead from 5 to 2%).

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
---
 arch/x86/include/asm/paravirt.h     |    5 ++
 arch/x86/kernel/paravirt.c          |   75 ++++++++++++++++++++++++++++++-----
 arch/x86/kernel/paravirt_patch_32.c |   12 +++++
 arch/x86/kernel/paravirt_patch_64.c |   15 +++++++
 4 files changed, 98 insertions(+), 9 deletions(-)

===================================================================
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -390,6 +390,8 @@
 	asm("start_" #ops "_" #name ": " code "; end_" #ops "_" #name ":")
 
 unsigned paravirt_patch_nop(void);
+unsigned paravirt_patch_ident_32(void *insnbuf, unsigned len);
+unsigned paravirt_patch_ident_64(void *insnbuf, unsigned len);
 unsigned paravirt_patch_ignore(unsigned len);
 unsigned paravirt_patch_call(void *insnbuf,
 			     const void *target, u16 tgt_clobbers,
@@ -1378,6 +1380,9 @@
 }
 
 void _paravirt_nop(void);
+u32 _paravirt_ident_32(u32);
+u64 _paravirt_ident_64(u64);
+
 #define paravirt_nop	((void *)_paravirt_nop)
 
 void paravirt_use_bytelocks(void);
===================================================================
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -44,6 +44,17 @@
 {
 }
 
+/* identity function, which can be inlined */
+u32 _paravirt_ident_32(u32 x)
+{
+	return x;
+}
+
+u64 _paravirt_ident_64(u64 x)
+{
+	return x;
+}
+
 static void __init default_banner(void)
 {
 	printk(KERN_INFO "Booting paravirtualized kernel on %s\n",
@@ -138,9 +149,16 @@
 	if (opfunc == NULL)
 		/* If there's no function, patch it with a ud2a (BUG) */
 		ret = paravirt_patch_insns(insnbuf, len, ud2a, ud2a+sizeof(ud2a));
-	else if (opfunc == paravirt_nop)
+	else if (opfunc == _paravirt_nop)
 		/* If the operation is a nop, then nop the callsite */
 		ret = paravirt_patch_nop();
+
+	/* identity functions just return their single argument */
+	else if (opfunc == _paravirt_ident_32)
+		ret = paravirt_patch_ident_32(insnbuf, len);
+	else if (opfunc == _paravirt_ident_64)
+		ret = paravirt_patch_ident_64(insnbuf, len);
+
 	else if (type == PARAVIRT_PATCH(pv_cpu_ops.iret) ||
 		 type == PARAVIRT_PATCH(pv_cpu_ops.irq_enable_sysexit) ||
 		 type == PARAVIRT_PATCH(pv_cpu_ops.usergs_sysret32) ||
@@ -373,6 +391,45 @@
 #endif
 };
 
+typedef pte_t make_pte_t(pteval_t);
+typedef pmd_t make_pmd_t(pmdval_t);
+typedef pud_t make_pud_t(pudval_t);
+typedef pgd_t make_pgd_t(pgdval_t);
+
+typedef pteval_t pte_val_t(pte_t);
+typedef pmdval_t pmd_val_t(pmd_t);
+typedef pudval_t pud_val_t(pud_t);
+typedef pgdval_t pgd_val_t(pgd_t);
+
+
+#if defined(CONFIG_X86_32) && !defined(CONFIG_X86_PAE)
+/* 32-bit pagetable entries */
+#define paravirt_native_make_pte	(make_pte_t *)_paravirt_ident_32
+#define paravirt_native_pte_val		(pte_val_t *)_paravirt_ident_32
+
+#define paravirt_native_make_pmd	(make_pmd_t *)_paravirt_ident_32
+#define paravirt_native_pmd_val		(pmd_val_t *)_paravirt_ident_32
+
+#define paravirt_native_make_pud	(make_pud_t *)_paravirt_ident_32
+#define paravirt_native_pud_val		(pud_val_t *)_paravirt_ident_32
+
+#define paravirt_native_make_pgd	(make_pgd_t *)_paravirt_ident_32
+#define paravirt_native_pgd_val		(pgd_val_t *)_paravirt_ident_32
+#else
+/* 64-bit pagetable entries */
+#define paravirt_native_make_pte	(make_pte_t *)_paravirt_ident_64
+#define paravirt_native_pte_val		(pte_val_t *)_paravirt_ident_64
+
+#define paravirt_native_make_pmd	(make_pmd_t *)_paravirt_ident_64
+#define paravirt_native_pmd_val		(pmd_val_t *)_paravirt_ident_64
+
+#define paravirt_native_make_pud	(make_pud_t *)_paravirt_ident_64
+#define paravirt_native_pud_val		(pud_val_t *)_paravirt_ident_64
+
+#define paravirt_native_make_pgd	(make_pgd_t *)_paravirt_ident_64
+#define paravirt_native_pgd_val		(pgd_val_t *)_paravirt_ident_64
+#endif
+
 struct pv_mmu_ops pv_mmu_ops = {
 #ifndef CONFIG_X86_64
 	.pagetable_setup_start = native_pagetable_setup_start,
@@ -424,21 +481,21 @@
 	.pmd_clear = native_pmd_clear,
 #endif
 	.set_pud = native_set_pud,
-	.pmd_val = native_pmd_val,
-	.make_pmd = native_make_pmd,
+	.pmd_val = paravirt_native_pmd_val,
+	.make_pmd = paravirt_native_make_pmd,
 
 #if PAGETABLE_LEVELS == 4
-	.pud_val = native_pud_val,
-	.make_pud = native_make_pud,
+	.pud_val = paravirt_native_pud_val,
+	.make_pud = paravirt_native_make_pud,
 	.set_pgd = native_set_pgd,
 #endif
 #endif /* PAGETABLE_LEVELS >= 3 */
 
-	.pte_val = native_pte_val,
-	.pgd_val = native_pgd_val,
+	.pte_val = paravirt_native_pte_val,
+	.pgd_val = paravirt_native_pgd_val,
 
-	.make_pte = native_make_pte,
-	.make_pgd = native_make_pgd,
+	.make_pte = paravirt_native_make_pte,
+	.make_pgd = paravirt_native_make_pgd,
 
 	.dup_mmap = paravirt_nop,
 	.exit_mmap = paravirt_nop,
===================================================================
--- a/arch/x86/kernel/paravirt_patch_32.c
+++ b/arch/x86/kernel/paravirt_patch_32.c
@@ -12,6 +12,18 @@
 DEF_NATIVE(pv_cpu_ops, clts, "clts");
 DEF_NATIVE(pv_cpu_ops, read_tsc, "rdtsc");
 
+unsigned paravirt_patch_ident_32(void *insnbuf, unsigned len)
+{
+	/* arg in %eax, return in %eax */
+	return 0;
+}
+
+unsigned paravirt_patch_ident_64(void *insnbuf, unsigned len)
+{
+	/* arg in %edx:%eax, return in %edx:%eax */
+	return 0;
+}
+
 unsigned native_patch(u8 type, u16 clobbers, void *ibuf,
 		      unsigned long addr, unsigned len)
 {
===================================================================
--- a/arch/x86/kernel/paravirt_patch_64.c
+++ b/arch/x86/kernel/paravirt_patch_64.c
@@ -19,6 +19,21 @@
 DEF_NATIVE(pv_cpu_ops, usergs_sysret32, "swapgs; sysretl");
 DEF_NATIVE(pv_cpu_ops, swapgs, "swapgs");
 
+DEF_NATIVE(, mov32, "mov %edi, %eax");
+DEF_NATIVE(, mov64, "mov %rdi, %rax");
+
+unsigned paravirt_patch_ident_32(void *insnbuf, unsigned len)
+{
+	return paravirt_patch_insns(insnbuf, len,
+				    start__mov32, end__mov32);
+}
+
+unsigned paravirt_patch_ident_64(void *insnbuf, unsigned len)
+{
+	return paravirt_patch_insns(insnbuf, len,
+				    start__mov64, end__mov64);
+}
+
 unsigned native_patch(u8 type, u16 clobbers, void *ibuf,
 		      unsigned long addr, unsigned len)
 {

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
@ 2009-03-07  9:06           ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-03-07  9:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Xen-devel, Andrew Morton, the arch/x86 maintainers,
	Linux Kernel Mailing List, H. Peter Anvin

[-- Attachment #1: Type: text/plain, Size: 1800 bytes --]

Ingo Molnar wrote:
> Have i missed a mail of yours perhaps? I dont have any track of 
> you having posted mmap-perf perfcounters results. I grepped my 
> mbox and the last mail i saw from you containing the string 
> "mmap-perf" is from January 20, and it only includes my numbers.


Yes, I think you must have missed a mail. I've attached it for 
reference, along with a more complete set of measurements I made 
regarding the series of patches applied (series ending at 
1f4f931501e9270c156d05ee76b7b872de486304) to improve pvops performance.

My results showed a dramatic drop in cache references (from about 300% 
pvop vs non-pvop, down to 125% with the full set of patches applied), 
but it didn't seem to make much of an effect on the overall wallclock 
time. I'm a bit sceptical of the numbers here because, while each run's 
passes are fairly consistent, booting and remeasuring seemed to cause 
larger variations than we're looking at. It would be easy to handwave it 
away with "cache effects", but its not very satisfying.

I also didn't find the measurements very convincing because the number 
of CPU cycles and instructions executed count is effectively unchanged 
(ie, the baseline non-pvops vs original pvops apparently execute exactly 
the same number of instructions, but we know that there's a lot more 
going on), and with no change as each added patch definitely removes 
some amount of pvops overhead in terms of instructions in the 
instruction stream. Is it just measuring usermode stats? I ran it as 
root, with the command line you suggested ("./perfstat -e 
-5,-4,-3,0,1,2,3 ./mmap-perf 1"). Cache misses wandered up and down in a 
fairly non-intuitive way as well.

I'll do a rerun comparing current tip.git pvops vs non-pvops to see if I 
can get some better results.

J

[-- Attachment #2: pvops-mmap-measurements.ods --]
[-- Type: application/vnd.oasis.opendocument.spreadsheet, Size: 20038 bytes --]

[-- Attachment #3: Attached Message --]
[-- Type: message/rfc822, Size: 51780 bytes --]

[-- Attachment #3.1.1: Type: text/plain, Size: 1640 bytes --]

Ingo Molnar wrote:
> ping?
>
> This is a very serious paravirt_ops slowdown affecting the native kernel's 
> performance to the tune of 5-10% in certain workloads.
>
> It's been about 2 years ago that paravirt_ops went upstream, when you told 
> us that something like this would never happen, that paravirt_ops is 
> designed so flexibly that it will never hinder the native kernel - and if 
> it does it will be easy to fix it. Now is the time to fulfill that 
> promise.

I couldn't exactly reproduce your results, but I guess they're similar 
in shape.  Comparing 2.6.29-rc2-nopv with -pvops, I saw this ratio (pass 
1-5).  Interestingly I'm seeing identical instruction counts for pvops 
vs non-pvops, and a lower cycle count.  The cache references are way up 
and the miss rate is up a bit, which I guess is the source of the slowdown.

With the attached patch, I get a clear improvement; it replaces the 
do-nothing pte_val/make_pte functions with inlined movs to move the 
argument to return, overpatching the 6-byte indirect call (on i386 it 
would just be all nopped out).  CPU cycles and cache misses are way 
down, and the tick count is down from ~5% worse to ~2%.  But the cache 
reference rate is even higher, which really doesn't make sense to me. 
But the patch is a clear improvement, and its hard to see how it could 
make anything worse (its always going to replace an indirect call with 
simple inlined code).

(Full numbers in spreadsheet.)

I have a couple of other patches to reduce the register pressure of the 
pvops calls, but I'm trying to work out how to make sure its not all to 
complex and/or fragile.

    J

[-- Attachment #3.1.2: pvops-mmap-measurements.ods --]
[-- Type: application/vnd.oasis.opendocument.spreadsheet, Size: 30546 bytes --]

[-- Attachment #3.1.3: paravirt-ident.patch --]
[-- Type: text/plain, Size: 6903 bytes --]

Subject: x86/pvops: add a paravirt_indent functions to allow special patching

Several paravirt ops implementations simply return their arguments,
the most obvious being the make_pte/pte_val class of operations on
native.

On 32-bit, the identity function is literally a no-op, as the calling
convention uses the same registers for the first argument and return.
On 64-bit, it can be implemented with a single "mov".

This patch adds special identity functions for 32 and 64 bit argument,
and machinery to recognize them and replace them with either nops or a
mov as appropriate.

At the moment, the only users for the identity functions are the
pagetable entry conversion functions.

The result is a measureable improvement on pagetable-heavy benchmarks
(2-3%, reducing the pvops overhead from 5 to 2%).

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
---
 arch/x86/include/asm/paravirt.h     |    5 ++
 arch/x86/kernel/paravirt.c          |   75 ++++++++++++++++++++++++++++++-----
 arch/x86/kernel/paravirt_patch_32.c |   12 +++++
 arch/x86/kernel/paravirt_patch_64.c |   15 +++++++
 4 files changed, 98 insertions(+), 9 deletions(-)

===================================================================
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -390,6 +390,8 @@
 	asm("start_" #ops "_" #name ": " code "; end_" #ops "_" #name ":")
 
 unsigned paravirt_patch_nop(void);
+unsigned paravirt_patch_ident_32(void *insnbuf, unsigned len);
+unsigned paravirt_patch_ident_64(void *insnbuf, unsigned len);
 unsigned paravirt_patch_ignore(unsigned len);
 unsigned paravirt_patch_call(void *insnbuf,
 			     const void *target, u16 tgt_clobbers,
@@ -1378,6 +1380,9 @@
 }
 
 void _paravirt_nop(void);
+u32 _paravirt_ident_32(u32);
+u64 _paravirt_ident_64(u64);
+
 #define paravirt_nop	((void *)_paravirt_nop)
 
 void paravirt_use_bytelocks(void);
===================================================================
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -44,6 +44,17 @@
 {
 }
 
+/* identity function, which can be inlined */
+u32 _paravirt_ident_32(u32 x)
+{
+	return x;
+}
+
+u64 _paravirt_ident_64(u64 x)
+{
+	return x;
+}
+
 static void __init default_banner(void)
 {
 	printk(KERN_INFO "Booting paravirtualized kernel on %s\n",
@@ -138,9 +149,16 @@
 	if (opfunc == NULL)
 		/* If there's no function, patch it with a ud2a (BUG) */
 		ret = paravirt_patch_insns(insnbuf, len, ud2a, ud2a+sizeof(ud2a));
-	else if (opfunc == paravirt_nop)
+	else if (opfunc == _paravirt_nop)
 		/* If the operation is a nop, then nop the callsite */
 		ret = paravirt_patch_nop();
+
+	/* identity functions just return their single argument */
+	else if (opfunc == _paravirt_ident_32)
+		ret = paravirt_patch_ident_32(insnbuf, len);
+	else if (opfunc == _paravirt_ident_64)
+		ret = paravirt_patch_ident_64(insnbuf, len);
+
 	else if (type == PARAVIRT_PATCH(pv_cpu_ops.iret) ||
 		 type == PARAVIRT_PATCH(pv_cpu_ops.irq_enable_sysexit) ||
 		 type == PARAVIRT_PATCH(pv_cpu_ops.usergs_sysret32) ||
@@ -373,6 +391,45 @@
 #endif
 };
 
+typedef pte_t make_pte_t(pteval_t);
+typedef pmd_t make_pmd_t(pmdval_t);
+typedef pud_t make_pud_t(pudval_t);
+typedef pgd_t make_pgd_t(pgdval_t);
+
+typedef pteval_t pte_val_t(pte_t);
+typedef pmdval_t pmd_val_t(pmd_t);
+typedef pudval_t pud_val_t(pud_t);
+typedef pgdval_t pgd_val_t(pgd_t);
+
+
+#if defined(CONFIG_X86_32) && !defined(CONFIG_X86_PAE)
+/* 32-bit pagetable entries */
+#define paravirt_native_make_pte	(make_pte_t *)_paravirt_ident_32
+#define paravirt_native_pte_val		(pte_val_t *)_paravirt_ident_32
+
+#define paravirt_native_make_pmd	(make_pmd_t *)_paravirt_ident_32
+#define paravirt_native_pmd_val		(pmd_val_t *)_paravirt_ident_32
+
+#define paravirt_native_make_pud	(make_pud_t *)_paravirt_ident_32
+#define paravirt_native_pud_val		(pud_val_t *)_paravirt_ident_32
+
+#define paravirt_native_make_pgd	(make_pgd_t *)_paravirt_ident_32
+#define paravirt_native_pgd_val		(pgd_val_t *)_paravirt_ident_32
+#else
+/* 64-bit pagetable entries */
+#define paravirt_native_make_pte	(make_pte_t *)_paravirt_ident_64
+#define paravirt_native_pte_val		(pte_val_t *)_paravirt_ident_64
+
+#define paravirt_native_make_pmd	(make_pmd_t *)_paravirt_ident_64
+#define paravirt_native_pmd_val		(pmd_val_t *)_paravirt_ident_64
+
+#define paravirt_native_make_pud	(make_pud_t *)_paravirt_ident_64
+#define paravirt_native_pud_val		(pud_val_t *)_paravirt_ident_64
+
+#define paravirt_native_make_pgd	(make_pgd_t *)_paravirt_ident_64
+#define paravirt_native_pgd_val		(pgd_val_t *)_paravirt_ident_64
+#endif
+
 struct pv_mmu_ops pv_mmu_ops = {
 #ifndef CONFIG_X86_64
 	.pagetable_setup_start = native_pagetable_setup_start,
@@ -424,21 +481,21 @@
 	.pmd_clear = native_pmd_clear,
 #endif
 	.set_pud = native_set_pud,
-	.pmd_val = native_pmd_val,
-	.make_pmd = native_make_pmd,
+	.pmd_val = paravirt_native_pmd_val,
+	.make_pmd = paravirt_native_make_pmd,
 
 #if PAGETABLE_LEVELS == 4
-	.pud_val = native_pud_val,
-	.make_pud = native_make_pud,
+	.pud_val = paravirt_native_pud_val,
+	.make_pud = paravirt_native_make_pud,
 	.set_pgd = native_set_pgd,
 #endif
 #endif /* PAGETABLE_LEVELS >= 3 */
 
-	.pte_val = native_pte_val,
-	.pgd_val = native_pgd_val,
+	.pte_val = paravirt_native_pte_val,
+	.pgd_val = paravirt_native_pgd_val,
 
-	.make_pte = native_make_pte,
-	.make_pgd = native_make_pgd,
+	.make_pte = paravirt_native_make_pte,
+	.make_pgd = paravirt_native_make_pgd,
 
 	.dup_mmap = paravirt_nop,
 	.exit_mmap = paravirt_nop,
===================================================================
--- a/arch/x86/kernel/paravirt_patch_32.c
+++ b/arch/x86/kernel/paravirt_patch_32.c
@@ -12,6 +12,18 @@
 DEF_NATIVE(pv_cpu_ops, clts, "clts");
 DEF_NATIVE(pv_cpu_ops, read_tsc, "rdtsc");
 
+unsigned paravirt_patch_ident_32(void *insnbuf, unsigned len)
+{
+	/* arg in %eax, return in %eax */
+	return 0;
+}
+
+unsigned paravirt_patch_ident_64(void *insnbuf, unsigned len)
+{
+	/* arg in %edx:%eax, return in %edx:%eax */
+	return 0;
+}
+
 unsigned native_patch(u8 type, u16 clobbers, void *ibuf,
 		      unsigned long addr, unsigned len)
 {
===================================================================
--- a/arch/x86/kernel/paravirt_patch_64.c
+++ b/arch/x86/kernel/paravirt_patch_64.c
@@ -19,6 +19,21 @@
 DEF_NATIVE(pv_cpu_ops, usergs_sysret32, "swapgs; sysretl");
 DEF_NATIVE(pv_cpu_ops, swapgs, "swapgs");
 
+DEF_NATIVE(, mov32, "mov %edi, %eax");
+DEF_NATIVE(, mov64, "mov %rdi, %rax");
+
+unsigned paravirt_patch_ident_32(void *insnbuf, unsigned len)
+{
+	return paravirt_patch_insns(insnbuf, len,
+				    start__mov32, end__mov32);
+}
+
+unsigned paravirt_patch_ident_64(void *insnbuf, unsigned len)
+{
+	return paravirt_patch_insns(insnbuf, len,
+				    start__mov64, end__mov64);
+}
+
 unsigned native_patch(u8 type, u16 clobbers, void *ibuf,
 		      unsigned long addr, unsigned len)
 {

[-- Attachment #4: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
  2009-03-07  9:06           ` Jeremy Fitzhardinge
@ 2009-03-08 11:01             ` Ingo Molnar
  -1 siblings, 0 replies; 121+ messages in thread
From: Ingo Molnar @ 2009-03-08 11:01 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andrew Morton, H. Peter Anvin, the arch/x86 maintainers,
	Linux Kernel Mailing List, Xen-devel


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Ingo Molnar wrote:
>> Have i missed a mail of yours perhaps? I dont have any track of you 
>> having posted mmap-perf perfcounters results. I grepped my mbox and the 
>> last mail i saw from you containing the string "mmap-perf" is from 
>> January 20, and it only includes my numbers.
>
>
> Yes, I think you must have missed a mail. I've attached it for 
> reference, along with a more complete set of measurements I 
> made regarding the series of patches applied (series ending at 
> 1f4f931501e9270c156d05ee76b7b872de486304) to improve pvops 
> performance.

Yeah - indeed i missed those numbers - they were embedded in a 
spreadsheet document attached to the mail ;)

> My results showed a dramatic drop in cache references (from 
> about 300% pvop vs non-pvop, down to 125% with the full set of 
> patches applied), but it didn't seem to make much of an effect 
> on the overall wallclock time. I'm a bit sceptical of the 
> numbers here because, while each run's passes are fairly 
> consistent, booting and remeasuring seemed to cause larger 
> variations than we're looking at. It would be easy to handwave 
> it away with "cache effects", but its not very satisfying.

Well it's the L2 cache references which are being measured here, 
and the L2 cache is likely very large on your test-system. So we 
can easily run into associativity limits in the L1 cache while 
still being mostly in L2 cache otherwise.

Associativity effects do depend on the kernel image layout and 
on the precise allocations of kernel data structure allocations 
we do during bootup - and they dont really change after that.

> I also didn't find the measurements very convincing because 
> the number of CPU cycles and instructions executed count is 
> effectively unchanged (ie, the baseline non-pvops vs original 
> pvops apparently execute exactly the same number of 
> instructions, but we know that there's a lot more going on), 
> and with no change as each added patch definitely removes some 
> amount of pvops overhead in terms of instructions in the 
> instruction stream. Is it just measuring usermode stats? I ran 
> it as root, with the command line you suggested ("./perfstat 
> -e -5,-4,-3,0,1,2,3 ./mmap-perf 1"). Cache misses wandered up 
> and down in a fairly non-intuitive way as well.

It's measuring kernel stats too - and i very much saw the 
instruction count change to the tune of 10% or so.

> I'll do a rerun comparing current tip.git pvops vs non-pvops 
> to see if I can get some better results.

Thanks - i'll also try your patch on the same system i measured 
for my numbers so we'll have some comparison.

	Ingo

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
@ 2009-03-08 11:01             ` Ingo Molnar
  0 siblings, 0 replies; 121+ messages in thread
From: Ingo Molnar @ 2009-03-08 11:01 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Xen-devel, Andrew Morton, the arch/x86 maintainers,
	Linux Kernel Mailing List, H. Peter Anvin


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Ingo Molnar wrote:
>> Have i missed a mail of yours perhaps? I dont have any track of you 
>> having posted mmap-perf perfcounters results. I grepped my mbox and the 
>> last mail i saw from you containing the string "mmap-perf" is from 
>> January 20, and it only includes my numbers.
>
>
> Yes, I think you must have missed a mail. I've attached it for 
> reference, along with a more complete set of measurements I 
> made regarding the series of patches applied (series ending at 
> 1f4f931501e9270c156d05ee76b7b872de486304) to improve pvops 
> performance.

Yeah - indeed i missed those numbers - they were embedded in a 
spreadsheet document attached to the mail ;)

> My results showed a dramatic drop in cache references (from 
> about 300% pvop vs non-pvop, down to 125% with the full set of 
> patches applied), but it didn't seem to make much of an effect 
> on the overall wallclock time. I'm a bit sceptical of the 
> numbers here because, while each run's passes are fairly 
> consistent, booting and remeasuring seemed to cause larger 
> variations than we're looking at. It would be easy to handwave 
> it away with "cache effects", but its not very satisfying.

Well it's the L2 cache references which are being measured here, 
and the L2 cache is likely very large on your test-system. So we 
can easily run into associativity limits in the L1 cache while 
still being mostly in L2 cache otherwise.

Associativity effects do depend on the kernel image layout and 
on the precise allocations of kernel data structure allocations 
we do during bootup - and they dont really change after that.

> I also didn't find the measurements very convincing because 
> the number of CPU cycles and instructions executed count is 
> effectively unchanged (ie, the baseline non-pvops vs original 
> pvops apparently execute exactly the same number of 
> instructions, but we know that there's a lot more going on), 
> and with no change as each added patch definitely removes some 
> amount of pvops overhead in terms of instructions in the 
> instruction stream. Is it just measuring usermode stats? I ran 
> it as root, with the command line you suggested ("./perfstat 
> -e -5,-4,-3,0,1,2,3 ./mmap-perf 1"). Cache misses wandered up 
> and down in a fairly non-intuitive way as well.

It's measuring kernel stats too - and i very much saw the 
instruction count change to the tune of 10% or so.

> I'll do a rerun comparing current tip.git pvops vs non-pvops 
> to see if I can get some better results.

Thanks - i'll also try your patch on the same system i measured 
for my numbers so we'll have some comparison.

	Ingo

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
  2009-03-05 13:52   ` Morten P.D. Stevens
@ 2009-03-08 14:25     ` Manfred Knick
  2009-03-09 19:51       ` Morten P.D. Stevens
  2009-03-09 20:00       ` Morten P.D. Stevens
  0 siblings, 2 replies; 121+ messages in thread
From: Manfred Knick @ 2009-03-08 14:25 UTC (permalink / raw)
  To: linux-kernel

Morten P.D. Stevens <mstevens <at> win-professional.com> writes:

> Here some benchmarks with Citrix XenServer 5.0 vs KVM with linux and windows
guests on an IBM x3400 server:

Thanks!
Just out of curiosity, to complete your appreciated impression:
Do you perhaps have the corresponding figures regarding e.g. VMware ESX(i)
available?

Thanks in advance!
Manfred



^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
  2009-03-08 11:01             ` Ingo Molnar
  (?)
@ 2009-03-08 21:56             ` H. Peter Anvin
  2009-03-08 22:06                 ` Ingo Molnar
  -1 siblings, 1 reply; 121+ messages in thread
From: H. Peter Anvin @ 2009-03-08 21:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jeremy Fitzhardinge, Andrew Morton, the arch/x86 maintainers,
	Linux Kernel Mailing List, Xen-devel

Ingo Molnar wrote:
> 
> Associativity effects do depend on the kernel image layout and 
> on the precise allocations of kernel data structure allocations 
> we do during bootup - and they dont really change after that.
> 

By the way, there is a really easy way (if a bit time consuming) to get
the actual variability here -- you have to reboot between runs, even for
the same kernel.  It makes the data collection take a long time, but at
least it can be scripted.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
  2009-03-08 21:56             ` H. Peter Anvin
@ 2009-03-08 22:06                 ` Ingo Molnar
  0 siblings, 0 replies; 121+ messages in thread
From: Ingo Molnar @ 2009-03-08 22:06 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Jeremy Fitzhardinge, Andrew Morton, the arch/x86 maintainers,
	Linux Kernel Mailing List, Xen-devel


* H. Peter Anvin <hpa@zytor.com> wrote:

> Ingo Molnar wrote:
> > 
> > Associativity effects do depend on the kernel image layout 
> > and on the precise allocations of kernel data structure 
> > allocations we do during bootup - and they dont really 
> > change after that.
> > 
> 
> By the way, there is a really easy way (if a bit time 
> consuming) to get the actual variability here -- you have to 
> reboot between runs, even for the same kernel.  It makes the 
> data collection take a long time, but at least it can be 
> scripted.

Since it's the same kernel image i think the only truly reliable 
method would be to reboot between _different_ kernel images: 
same instructions but randomly re-align variables both in terms 
of absolute address and in terms of relative position to each 
other. Plus randomize bootmem allocs and never-gets-freed-really 
boot-time allocations.

Really hard to do i think ...

	Ingo

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
@ 2009-03-08 22:06                 ` Ingo Molnar
  0 siblings, 0 replies; 121+ messages in thread
From: Ingo Molnar @ 2009-03-08 22:06 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Xen-devel, Jeremy Fitzhardinge, the arch/x86 maintainers,
	Andrew Morton, Linux Kernel Mailing List


* H. Peter Anvin <hpa@zytor.com> wrote:

> Ingo Molnar wrote:
> > 
> > Associativity effects do depend on the kernel image layout 
> > and on the precise allocations of kernel data structure 
> > allocations we do during bootup - and they dont really 
> > change after that.
> > 
> 
> By the way, there is a really easy way (if a bit time 
> consuming) to get the actual variability here -- you have to 
> reboot between runs, even for the same kernel.  It makes the 
> data collection take a long time, but at least it can be 
> scripted.

Since it's the same kernel image i think the only truly reliable 
method would be to reboot between _different_ kernel images: 
same instructions but randomly re-align variables both in terms 
of absolute address and in terms of relative position to each 
other. Plus randomize bootmem allocs and never-gets-freed-really 
boot-time allocations.

Really hard to do i think ...

	Ingo

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
  2009-03-08 22:06                 ` Ingo Molnar
  (?)
@ 2009-03-08 22:08                 ` H. Peter Anvin
  2009-03-08 22:12                     ` Ingo Molnar
  -1 siblings, 1 reply; 121+ messages in thread
From: H. Peter Anvin @ 2009-03-08 22:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jeremy Fitzhardinge, Andrew Morton, the arch/x86 maintainers,
	Linux Kernel Mailing List, Xen-devel

Ingo Molnar wrote:
> 
> Since it's the same kernel image i think the only truly reliable 
> method would be to reboot between _different_ kernel images: 
> same instructions but randomly re-align variables both in terms 
> of absolute address and in terms of relative position to each 
> other. Plus randomize bootmem allocs and never-gets-freed-really 
> boot-time allocations.
> 
> Really hard to do i think ...
> 

Ouch, yeah.

On the other hand, the numbers made sense to me, so I don't see why
there is any reason to distrust them.  They show a 5% overhead with
pv_ops enabled, reduced to a 2% overhead with the changed.  That is more
or less what would match my intuition from seeing the code.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
  2009-03-08 22:08                 ` H. Peter Anvin
@ 2009-03-08 22:12                     ` Ingo Molnar
  0 siblings, 0 replies; 121+ messages in thread
From: Ingo Molnar @ 2009-03-08 22:12 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Jeremy Fitzhardinge, Andrew Morton, the arch/x86 maintainers,
	Linux Kernel Mailing List, Xen-devel


* H. Peter Anvin <hpa@zytor.com> wrote:

> Ingo Molnar wrote:
> > 
> > Since it's the same kernel image i think the only truly reliable 
> > method would be to reboot between _different_ kernel images: 
> > same instructions but randomly re-align variables both in terms 
> > of absolute address and in terms of relative position to each 
> > other. Plus randomize bootmem allocs and never-gets-freed-really 
> > boot-time allocations.
> > 
> > Really hard to do i think ...
> > 
> 
> Ouch, yeah.
> 
> On the other hand, the numbers made sense to me, so I don't 
> see why there is any reason to distrust them.  They show a 5% 
> overhead with pv_ops enabled, reduced to a 2% overhead with 
> the changed.  That is more or less what would match my 
> intuition from seeing the code.

Yeah - it was Jeremy expressed doubt in the numbers, not me.

And we need to eliminate that 2% as well - 2% is still an awful 
lot of native kernel overhead from a kernel feature that 95%+ of 
users do not make any use of.

	Ingo

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
@ 2009-03-08 22:12                     ` Ingo Molnar
  0 siblings, 0 replies; 121+ messages in thread
From: Ingo Molnar @ 2009-03-08 22:12 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Xen-devel, Jeremy Fitzhardinge, the arch/x86 maintainers,
	Andrew Morton, Linux Kernel Mailing List


* H. Peter Anvin <hpa@zytor.com> wrote:

> Ingo Molnar wrote:
> > 
> > Since it's the same kernel image i think the only truly reliable 
> > method would be to reboot between _different_ kernel images: 
> > same instructions but randomly re-align variables both in terms 
> > of absolute address and in terms of relative position to each 
> > other. Plus randomize bootmem allocs and never-gets-freed-really 
> > boot-time allocations.
> > 
> > Really hard to do i think ...
> > 
> 
> Ouch, yeah.
> 
> On the other hand, the numbers made sense to me, so I don't 
> see why there is any reason to distrust them.  They show a 5% 
> overhead with pv_ops enabled, reduced to a 2% overhead with 
> the changed.  That is more or less what would match my 
> intuition from seeing the code.

Yeah - it was Jeremy expressed doubt in the numbers, not me.

And we need to eliminate that 2% as well - 2% is still an awful 
lot of native kernel overhead from a kernel feature that 95%+ of 
users do not make any use of.

	Ingo

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
  2009-03-08 22:12                     ` Ingo Molnar
@ 2009-03-09 18:06                       ` Jeremy Fitzhardinge
  -1 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-03-09 18:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: H. Peter Anvin, Andrew Morton, the arch/x86 maintainers,
	Linux Kernel Mailing List, Xen-devel

Ingo Molnar wrote:
> * H. Peter Anvin <hpa@zytor.com> wrote:
>
>   
>> Ingo Molnar wrote:
>>     
>>> Since it's the same kernel image i think the only truly reliable 
>>> method would be to reboot between _different_ kernel images: 
>>> same instructions but randomly re-align variables both in terms 
>>> of absolute address and in terms of relative position to each 
>>> other. Plus randomize bootmem allocs and never-gets-freed-really 
>>> boot-time allocations.
>>>
>>> Really hard to do i think ...
>>>
>>>       
>> Ouch, yeah.
>>
>> On the other hand, the numbers made sense to me, so I don't 
>> see why there is any reason to distrust them.  They show a 5% 
>> overhead with pv_ops enabled, reduced to a 2% overhead with 
>> the changed.  That is more or less what would match my 
>> intuition from seeing the code.
>>     
>
> Yeah - it was Jeremy expressed doubt in the numbers, not me.
>   

Mainly because I was seeing the instruction and cycle counts completely 
unchanged from run to run, which is implausible.  They're not zero, so 
they're clearly measurements of *something*, but not cycles and 
instructions, since we know that they're changing.  So what are they 
measurements of?  And if they're not what they claim, are the other 
numbers more meaningful?

It's easy to read the numbers as confirmations of preconceived 
expectations of the outcomes, but that's - as I said - unsatisfying.

> And we need to eliminate that 2% as well - 2% is still an awful 
> lot of native kernel overhead from a kernel feature that 95%+ of 
> users do not make any use of.
>   

Well, I think there's a few points here:

   1. the test in question is a bit vague about kernel and user
      measurements.  I assume the stuff coming from perfcounters is
      kernel-only state, but the elapsed time includes the usermode
      component, and so will be affected by the usermode page placement
      and cache effects.  If I change the test to copy the test
      executable (statically linked, to avoid libraries), then that
      should at least fuzz out user page placement.
   2. Its true that the cache effects could be due to the precise layout
      of the kernel executable; but if those effects are swamping
      effects of the changes to improve pvops then its unclear what the
      point of the exercise is.  Especially since:
   3. It is a config option, so if someone is sensitive to the
      performance hit and it gives them no useful functionality to
      offset it, then it can be disabled.  Distros tend to enable it
      because they tend to value function and flexibility over raw
      performance; they tend to enable things like audit, selinux,
      modules which all have performance hits of a similar scale (of
      course, you could argue that more people get benefit from those
      features to offset their costs).  But,
   4. I think you're underestimating the number of people who get
      benefit from pvops; the Xen userbase is actually pretty large, and
      KVM will use pvops hooks when available to improve Linux-as-guest.
   5. Also, we're looking at a single benchmark with no obvious
      relevance to a real workload.  Perhaps there are workloads which
      continuously mash mmap/munmap/mremap(!), but I think they're
      fairly rare.  Such a benchmark is useful for tuning specific
      areas, but if we're going to evaluate pvops overhead, it would be
      nice to use something a bit broader to base our measurements on. 
      Also, what weighting are we going to put on 32 vs 64 bit?  Equally
      important?  One more than the other?

All that said, I would like to get the pvops overhead down to 
unmeasureable - the ideal would be to be able to justify removing the 
config option altogether and leave it always enabled.

The tradeoff, as always, is how much other complexity are we willing to 
stand to get there?  The addition of a new calling convention is already 
fairly esoteric, but so far it has got us a 60% reduction in overhead 
(in this test).  But going further is going to get more complex.

For example, the next step would be to attack set_pte (including 
set_pte_*, pte_clear, etc), to make them use the new calling convention, 
and possibly make them inlineable (ie, to get it as close as possible to 
the non-pvops case).  But that will require them to be implemented in 
asm (to guarantee that they only use the registers they're allowed to 
use), and we already have 3 variants of each for the different pagetable 
modes.  All completely doable, and not even very hard, but it will be 
just one more thing to maintain - we just need to be sure the payoff is 
worth it.

    J

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
@ 2009-03-09 18:06                       ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-03-09 18:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Xen-devel, Andrew Morton, the arch/x86 maintainers,
	Linux Kernel Mailing List, H. Peter Anvin

Ingo Molnar wrote:
> * H. Peter Anvin <hpa@zytor.com> wrote:
>
>   
>> Ingo Molnar wrote:
>>     
>>> Since it's the same kernel image i think the only truly reliable 
>>> method would be to reboot between _different_ kernel images: 
>>> same instructions but randomly re-align variables both in terms 
>>> of absolute address and in terms of relative position to each 
>>> other. Plus randomize bootmem allocs and never-gets-freed-really 
>>> boot-time allocations.
>>>
>>> Really hard to do i think ...
>>>
>>>       
>> Ouch, yeah.
>>
>> On the other hand, the numbers made sense to me, so I don't 
>> see why there is any reason to distrust them.  They show a 5% 
>> overhead with pv_ops enabled, reduced to a 2% overhead with 
>> the changed.  That is more or less what would match my 
>> intuition from seeing the code.
>>     
>
> Yeah - it was Jeremy expressed doubt in the numbers, not me.
>   

Mainly because I was seeing the instruction and cycle counts completely 
unchanged from run to run, which is implausible.  They're not zero, so 
they're clearly measurements of *something*, but not cycles and 
instructions, since we know that they're changing.  So what are they 
measurements of?  And if they're not what they claim, are the other 
numbers more meaningful?

It's easy to read the numbers as confirmations of preconceived 
expectations of the outcomes, but that's - as I said - unsatisfying.

> And we need to eliminate that 2% as well - 2% is still an awful 
> lot of native kernel overhead from a kernel feature that 95%+ of 
> users do not make any use of.
>   

Well, I think there's a few points here:

   1. the test in question is a bit vague about kernel and user
      measurements.  I assume the stuff coming from perfcounters is
      kernel-only state, but the elapsed time includes the usermode
      component, and so will be affected by the usermode page placement
      and cache effects.  If I change the test to copy the test
      executable (statically linked, to avoid libraries), then that
      should at least fuzz out user page placement.
   2. Its true that the cache effects could be due to the precise layout
      of the kernel executable; but if those effects are swamping
      effects of the changes to improve pvops then its unclear what the
      point of the exercise is.  Especially since:
   3. It is a config option, so if someone is sensitive to the
      performance hit and it gives them no useful functionality to
      offset it, then it can be disabled.  Distros tend to enable it
      because they tend to value function and flexibility over raw
      performance; they tend to enable things like audit, selinux,
      modules which all have performance hits of a similar scale (of
      course, you could argue that more people get benefit from those
      features to offset their costs).  But,
   4. I think you're underestimating the number of people who get
      benefit from pvops; the Xen userbase is actually pretty large, and
      KVM will use pvops hooks when available to improve Linux-as-guest.
   5. Also, we're looking at a single benchmark with no obvious
      relevance to a real workload.  Perhaps there are workloads which
      continuously mash mmap/munmap/mremap(!), but I think they're
      fairly rare.  Such a benchmark is useful for tuning specific
      areas, but if we're going to evaluate pvops overhead, it would be
      nice to use something a bit broader to base our measurements on. 
      Also, what weighting are we going to put on 32 vs 64 bit?  Equally
      important?  One more than the other?

All that said, I would like to get the pvops overhead down to 
unmeasureable - the ideal would be to be able to justify removing the 
config option altogether and leave it always enabled.

The tradeoff, as always, is how much other complexity are we willing to 
stand to get there?  The addition of a new calling convention is already 
fairly esoteric, but so far it has got us a 60% reduction in overhead 
(in this test).  But going further is going to get more complex.

For example, the next step would be to attack set_pte (including 
set_pte_*, pte_clear, etc), to make them use the new calling convention, 
and possibly make them inlineable (ie, to get it as close as possible to 
the non-pvops case).  But that will require them to be implemented in 
asm (to guarantee that they only use the registers they're allowed to 
use), and we already have 3 variants of each for the different pagetable 
modes.  All completely doable, and not even very hard, but it will be 
just one more thing to maintain - we just need to be sure the payoff is 
worth it.

    J

^ permalink raw reply	[flat|nested] 121+ messages in thread

* RE:  Re: [PATCH] xen: core dom0 support
  2009-03-08 14:25     ` Manfred Knick
@ 2009-03-09 19:51       ` Morten P.D. Stevens
  2009-03-09 20:00       ` Morten P.D. Stevens
  1 sibling, 0 replies; 121+ messages in thread
From: Morten P.D. Stevens @ 2009-03-09 19:51 UTC (permalink / raw)
  To: Manfred Knick; +Cc: linux-kernel

Hi,

> Do you perhaps have the corresponding figures regarding e.g. VMware
ESX(i)
> available?

Yes.

HDD: XEN | KVM | ESXi

Write: 110 MB/s | 60 MB/s | 35 MB/s
Read: 130 MB/s  | 80 MB/s | 160 MB/s

Network Performance: (download an 4 GB iso image from apache webserver)

XEN | KVM | ESXi

download Speed: 105 MB/s | 50 MB/s | 43 MB/s

Overall, xen is the most powerful virtualization platform on our IBM
servers.


Best regards,

Morten

-----Original Message-----
From: linux-kernel-owner@vger.kernel.org
[mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of Manfred Knick
Sent: Sunday, March 08, 2009 3:25 PM
To: linux-kernel@vger.kernel.org
Subject: Re: [PATCH] xen: core dom0 support

Morten P.D. Stevens <mstevens <at> win-professional.com> writes:

> Here some benchmarks with Citrix XenServer 5.0 vs KVM with linux and
windows
guests on an IBM x3400 server:

Thanks!
Just out of curiosity, to complete your appreciated impression:
Do you perhaps have the corresponding figures regarding e.g. VMware
ESX(i)
available?

Thanks in advance!
Manfred


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 121+ messages in thread

* RE:  Re: [PATCH] xen: core dom0 support
  2009-03-08 14:25     ` Manfred Knick
  2009-03-09 19:51       ` Morten P.D. Stevens
@ 2009-03-09 20:00       ` Morten P.D. Stevens
  1 sibling, 0 replies; 121+ messages in thread
From: Morten P.D. Stevens @ 2009-03-09 20:00 UTC (permalink / raw)
  To: linux-kernel

Hi,

> Do you perhaps have the corresponding figures regarding e.g. VMware 
> ESX(i) available?

Yes.

HDD: XEN | KVM | ESXi

Write: 110 MB/s | 60 MB/s | 35 MB/s
Read: 130 MB/s  | 80 MB/s | 160 MB/s

Network Performance: (download an 4 GB iso image from apache webserver)

XEN | KVM | ESXi

download Speed: 105 MB/s | 50 MB/s | 43 MB/s

Overall, xen is the most powerful virtualization platform on our IBM
servers.


Best regards,

Morten

-----Original Message-----
From: linux-kernel-owner@vger.kernel.org
[mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of Manfred Knick
Sent: Sunday, March 08, 2009 3:25 PM
To: linux-kernel@vger.kernel.org
Subject: Re: [PATCH] xen: core dom0 support

Morten P.D. Stevens <mstevens <at> win-professional.com> writes:

> Here some benchmarks with Citrix XenServer 5.0 vs KVM with linux and
windows
guests on an IBM x3400 server:

Thanks!
Just out of curiosity, to complete your appreciated impression:
Do you perhaps have the corresponding figures regarding e.g. VMware
ESX(i)
available?

Thanks in advance!
Manfred


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] paravirt/xen: add pvop for page_is_ram
  2009-02-28  1:59 ` [PATCH] paravirt/xen: add pvop for page_is_ram Jeremy Fitzhardinge
@ 2009-03-10  1:07   ` H. Peter Anvin
  2009-03-10 21:19       ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 121+ messages in thread
From: H. Peter Anvin @ 2009-03-10  1:07 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List, Xen-devel,
	Jeremy Fitzhardinge

Jeremy Fitzhardinge wrote:
> From: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
> 
> A guest domain may have external pages mapped into its address space,
> in order to share memory with other domains.  These shared pages are
> more akin to io mappings than real RAM, and should not pass the
> page_is_ram test.  Add a paravirt op for this so that a hypervisor
> backend can validate whether a page should be considered ram or not.
> 
> Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
> 

Why are these pages mapped as RAM in the memory map?  That is the right
way to handle that, not by adding yet another bloody hook...

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
  2009-03-09 18:06                       ` Jeremy Fitzhardinge
@ 2009-03-10 12:44                         ` Ingo Molnar
  -1 siblings, 0 replies; 121+ messages in thread
From: Ingo Molnar @ 2009-03-10 12:44 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: H. Peter Anvin, Andrew Morton, the arch/x86 maintainers,
	Linux Kernel Mailing List, Xen-devel


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

>> Yeah - it was Jeremy expressed doubt in the numbers, not me.
>
> Mainly because I was seeing the instruction and cycle counts 
> completely unchanged from run to run, which is implausible.  
> They're not zero, so they're clearly measurements of 
> *something*, but not cycles and instructions, since we know 
> that they're changing.  So what are they measurements of?  And 
> if they're not what they claim, are the other numbers more 
> meaningful?

cycle count not changing in a macro-workload is not plausible. 
Instruction count not changing can happen sometimes - if the 
workload is deterministic (which this one is) and we happen to 
get exactly the same number of timer irqs during the test. But 
it's more common that it varies slightly - especially on SMP 
where task balancing can be timing-dependent and hence is noise.

	Ingo

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
@ 2009-03-10 12:44                         ` Ingo Molnar
  0 siblings, 0 replies; 121+ messages in thread
From: Ingo Molnar @ 2009-03-10 12:44 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Xen-devel, Andrew Morton, the arch/x86 maintainers,
	Linux Kernel Mailing List, H. Peter Anvin


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

>> Yeah - it was Jeremy expressed doubt in the numbers, not me.
>
> Mainly because I was seeing the instruction and cycle counts 
> completely unchanged from run to run, which is implausible.  
> They're not zero, so they're clearly measurements of 
> *something*, but not cycles and instructions, since we know 
> that they're changing.  So what are they measurements of?  And 
> if they're not what they claim, are the other numbers more 
> meaningful?

cycle count not changing in a macro-workload is not plausible. 
Instruction count not changing can happen sometimes - if the 
workload is deterministic (which this one is) and we happen to 
get exactly the same number of timer irqs during the test. But 
it's more common that it varies slightly - especially on SMP 
where task balancing can be timing-dependent and hence is noise.

	Ingo

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
  2009-03-09 18:06                       ` Jeremy Fitzhardinge
@ 2009-03-10 12:49                         ` Nick Piggin
  -1 siblings, 0 replies; 121+ messages in thread
From: Nick Piggin @ 2009-03-10 12:49 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Ingo Molnar, H. Peter Anvin, Andrew Morton,
	the arch/x86 maintainers, Linux Kernel Mailing List, Xen-devel

On Tuesday 10 March 2009 05:06:40 Jeremy Fitzhardinge wrote:
> Ingo Molnar wrote:
> > * H. Peter Anvin <hpa@zytor.com> wrote:
> >> Ingo Molnar wrote:
> >>> Since it's the same kernel image i think the only truly reliable
> >>> method would be to reboot between _different_ kernel images:
> >>> same instructions but randomly re-align variables both in terms
> >>> of absolute address and in terms of relative position to each
> >>> other. Plus randomize bootmem allocs and never-gets-freed-really
> >>> boot-time allocations.
> >>>
> >>> Really hard to do i think ...
> >>
> >> Ouch, yeah.
> >>
> >> On the other hand, the numbers made sense to me, so I don't
> >> see why there is any reason to distrust them.  They show a 5%
> >> overhead with pv_ops enabled, reduced to a 2% overhead with
> >> the changed.  That is more or less what would match my
> >> intuition from seeing the code.
> >
> > Yeah - it was Jeremy expressed doubt in the numbers, not me.
>
> Mainly because I was seeing the instruction and cycle counts completely
> unchanged from run to run, which is implausible.  They're not zero, so
> they're clearly measurements of *something*, but not cycles and
> instructions, since we know that they're changing.  So what are they
> measurements of?  And if they're not what they claim, are the other
> numbers more meaningful?
>
> It's easy to read the numbers as confirmations of preconceived
> expectations of the outcomes, but that's - as I said - unsatisfying.
>
> > And we need to eliminate that 2% as well - 2% is still an awful
> > lot of native kernel overhead from a kernel feature that 95%+ of
> > users do not make any use of.
>
> Well, I think there's a few points here:
>
>    1. the test in question is a bit vague about kernel and user
>       measurements.  I assume the stuff coming from perfcounters is
>       kernel-only state, but the elapsed time includes the usermode
>       component, and so will be affected by the usermode page placement
>       and cache effects.  If I change the test to copy the test
>       executable (statically linked, to avoid libraries), then that
>       should at least fuzz out user page placement.
>    2. Its true that the cache effects could be due to the precise layout
>       of the kernel executable; but if those effects are swamping
>       effects of the changes to improve pvops then its unclear what the
>       point of the exercise is.  Especially since:
>    3. It is a config option, so if someone is sensitive to the
>       performance hit and it gives them no useful functionality to
>       offset it, then it can be disabled.  Distros tend to enable it
>       because they tend to value function and flexibility over raw
>       performance; they tend to enable things like audit, selinux,
>       modules which all have performance hits of a similar scale (of
>       course, you could argue that more people get benefit from those
>       features to offset their costs).  But,
>    4. I think you're underestimating the number of people who get
>       benefit from pvops; the Xen userbase is actually pretty large, and
>       KVM will use pvops hooks when available to improve Linux-as-guest.
>    5. Also, we're looking at a single benchmark with no obvious
>       relevance to a real workload.  Perhaps there are workloads which
>       continuously mash mmap/munmap/mremap(!), but I think they're
>       fairly rare.  Such a benchmark is useful for tuning specific
>       areas, but if we're going to evaluate pvops overhead, it would be
>       nice to use something a bit broader to base our measurements on.
>       Also, what weighting are we going to put on 32 vs 64 bit?  Equally
>       important?  One more than the other?

I saw _most_ of the extra overhead show up in page fault path. And also
don't forget that fork/exit workloads are essentially mashing mmap/munmap.

So things which mash these paths include kbuild, scripts, and some malloc
patters (like you might see in MySQL running OLTP).

Of course they tend to do more other stuff as well, so 2% in a
microbenchmark will be much smaller, but that was never in dispute. One
hardest problems is adding lots of features to critical paths that
individually "never show a statistical difference on any real workload",
but combine to slow things down. It really sucks to have people upgrade
and performance go down.

As an anecdote, I had a problem where an ISV upgraded SLES9 to SLES10
and their software's performance dropped 30% or so. And there were like
3 or 4 things that could be bisected to show a few % of that. This was
without pvops mind you, but in very similar paths (mmap/munmap/page
fault/teardown). The pvops stuff was basically just an extension of that
saga.

OK, that's probably an extreme case, but any of this stuff must always
be considered a critical fastpath IMO. We know any slowdown is going to
hurt in the long run.


> All that said, I would like to get the pvops overhead down to
> unmeasureable - the ideal would be to be able to justify removing the
> config option altogether and leave it always enabled.
>
> The tradeoff, as always, is how much other complexity are we willing to
> stand to get there?  The addition of a new calling convention is already
> fairly esoteric, but so far it has got us a 60% reduction in overhead
> (in this test).  But going further is going to get more complex.

If the complexity is not in generic code and constrained within pvops
stuff, then from my POV "as much as it takes", and you get to maintain
it ;)

Well, that's a bit unfair. From a distro POV, I'd love that to be the
case because we ship pvops. From a kernel.org point of view, you provide
a service that inevitably will have some cost but can be configured out.
But I do think that it would be in your interest too because the speed
of these paths should be important even for virtualised systems.


> For example, the next step would be to attack set_pte (including
> set_pte_*, pte_clear, etc), to make them use the new calling convention,
> and possibly make them inlineable (ie, to get it as close as possible to
> the non-pvops case).  But that will require them to be implemented in
> asm (to guarantee that they only use the registers they're allowed to
> use), and we already have 3 variants of each for the different pagetable
> modes.  All completely doable, and not even very hard, but it will be
> just one more thing to maintain - we just need to be sure the payoff is
> worth it.

Thanks for what you've done so far. I would like to see this taken as
far as possible. I think it is very worthwhile although complexity is
obviously a very real concern too.


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] xen: core dom0 support
@ 2009-03-10 12:49                         ` Nick Piggin
  0 siblings, 0 replies; 121+ messages in thread
From: Nick Piggin @ 2009-03-10 12:49 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Xen-devel, the arch/x86 maintainers, Linux Kernel Mailing List,
	H. Peter Anvin, Ingo Molnar, Andrew Morton

On Tuesday 10 March 2009 05:06:40 Jeremy Fitzhardinge wrote:
> Ingo Molnar wrote:
> > * H. Peter Anvin <hpa@zytor.com> wrote:
> >> Ingo Molnar wrote:
> >>> Since it's the same kernel image i think the only truly reliable
> >>> method would be to reboot between _different_ kernel images:
> >>> same instructions but randomly re-align variables both in terms
> >>> of absolute address and in terms of relative position to each
> >>> other. Plus randomize bootmem allocs and never-gets-freed-really
> >>> boot-time allocations.
> >>>
> >>> Really hard to do i think ...
> >>
> >> Ouch, yeah.
> >>
> >> On the other hand, the numbers made sense to me, so I don't
> >> see why there is any reason to distrust them.  They show a 5%
> >> overhead with pv_ops enabled, reduced to a 2% overhead with
> >> the changed.  That is more or less what would match my
> >> intuition from seeing the code.
> >
> > Yeah - it was Jeremy expressed doubt in the numbers, not me.
>
> Mainly because I was seeing the instruction and cycle counts completely
> unchanged from run to run, which is implausible.  They're not zero, so
> they're clearly measurements of *something*, but not cycles and
> instructions, since we know that they're changing.  So what are they
> measurements of?  And if they're not what they claim, are the other
> numbers more meaningful?
>
> It's easy to read the numbers as confirmations of preconceived
> expectations of the outcomes, but that's - as I said - unsatisfying.
>
> > And we need to eliminate that 2% as well - 2% is still an awful
> > lot of native kernel overhead from a kernel feature that 95%+ of
> > users do not make any use of.
>
> Well, I think there's a few points here:
>
>    1. the test in question is a bit vague about kernel and user
>       measurements.  I assume the stuff coming from perfcounters is
>       kernel-only state, but the elapsed time includes the usermode
>       component, and so will be affected by the usermode page placement
>       and cache effects.  If I change the test to copy the test
>       executable (statically linked, to avoid libraries), then that
>       should at least fuzz out user page placement.
>    2. Its true that the cache effects could be due to the precise layout
>       of the kernel executable; but if those effects are swamping
>       effects of the changes to improve pvops then its unclear what the
>       point of the exercise is.  Especially since:
>    3. It is a config option, so if someone is sensitive to the
>       performance hit and it gives them no useful functionality to
>       offset it, then it can be disabled.  Distros tend to enable it
>       because they tend to value function and flexibility over raw
>       performance; they tend to enable things like audit, selinux,
>       modules which all have performance hits of a similar scale (of
>       course, you could argue that more people get benefit from those
>       features to offset their costs).  But,
>    4. I think you're underestimating the number of people who get
>       benefit from pvops; the Xen userbase is actually pretty large, and
>       KVM will use pvops hooks when available to improve Linux-as-guest.
>    5. Also, we're looking at a single benchmark with no obvious
>       relevance to a real workload.  Perhaps there are workloads which
>       continuously mash mmap/munmap/mremap(!), but I think they're
>       fairly rare.  Such a benchmark is useful for tuning specific
>       areas, but if we're going to evaluate pvops overhead, it would be
>       nice to use something a bit broader to base our measurements on.
>       Also, what weighting are we going to put on 32 vs 64 bit?  Equally
>       important?  One more than the other?

I saw _most_ of the extra overhead show up in page fault path. And also
don't forget that fork/exit workloads are essentially mashing mmap/munmap.

So things which mash these paths include kbuild, scripts, and some malloc
patters (like you might see in MySQL running OLTP).

Of course they tend to do more other stuff as well, so 2% in a
microbenchmark will be much smaller, but that was never in dispute. One
hardest problems is adding lots of features to critical paths that
individually "never show a statistical difference on any real workload",
but combine to slow things down. It really sucks to have people upgrade
and performance go down.

As an anecdote, I had a problem where an ISV upgraded SLES9 to SLES10
and their software's performance dropped 30% or so. And there were like
3 or 4 things that could be bisected to show a few % of that. This was
without pvops mind you, but in very similar paths (mmap/munmap/page
fault/teardown). The pvops stuff was basically just an extension of that
saga.

OK, that's probably an extreme case, but any of this stuff must always
be considered a critical fastpath IMO. We know any slowdown is going to
hurt in the long run.


> All that said, I would like to get the pvops overhead down to
> unmeasureable - the ideal would be to be able to justify removing the
> config option altogether and leave it always enabled.
>
> The tradeoff, as always, is how much other complexity are we willing to
> stand to get there?  The addition of a new calling convention is already
> fairly esoteric, but so far it has got us a 60% reduction in overhead
> (in this test).  But going further is going to get more complex.

If the complexity is not in generic code and constrained within pvops
stuff, then from my POV "as much as it takes", and you get to maintain
it ;)

Well, that's a bit unfair. From a distro POV, I'd love that to be the
case because we ship pvops. From a kernel.org point of view, you provide
a service that inevitably will have some cost but can be configured out.
But I do think that it would be in your interest too because the speed
of these paths should be important even for virtualised systems.


> For example, the next step would be to attack set_pte (including
> set_pte_*, pte_clear, etc), to make them use the new calling convention,
> and possibly make them inlineable (ie, to get it as close as possible to
> the non-pvops case).  But that will require them to be implemented in
> asm (to guarantee that they only use the registers they're allowed to
> use), and we already have 3 variants of each for the different pagetable
> modes.  All completely doable, and not even very hard, but it will be
> just one more thing to maintain - we just need to be sure the payoff is
> worth it.

Thanks for what you've done so far. I would like to see this taken as
far as possible. I think it is very worthwhile although complexity is
obviously a very real concern too.

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] paravirt/xen: add pvop for page_is_ram
  2009-03-10  1:07   ` H. Peter Anvin
@ 2009-03-10 21:19       ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-03-10 21:19 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List, Xen-devel,
	Jeremy Fitzhardinge

H. Peter Anvin wrote:
> Jeremy Fitzhardinge wrote:
>   
>> From: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
>>
>> A guest domain may have external pages mapped into its address space,
>> in order to share memory with other domains.  These shared pages are
>> more akin to io mappings than real RAM, and should not pass the
>> page_is_ram test.  Add a paravirt op for this so that a hypervisor
>> backend can validate whether a page should be considered ram or not.
>>
>> Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
>>
>>     
>
> Why are these pages mapped as RAM in the memory map?  That is the right
> way to handle that, not by adding yet another bloody hook...
>   
Granted pages can turn up anywhere dynamically, since they're pages 
borrowed from other domains for the purposes of IO.  They're not static 
regions of non-RAM like the other cases page_is_ram() tests for,

They can't be mapped via normal pte operations (because they have 
additional state associated with them, like the grant handle), so 
/dev/mem can't just create an aliased mapping by copying the pte.

page_is_ram is used to:

   1. prevent /dev/mem from mapping non-RAM pages
   2. prevent ioremap from mapping any RAM pages
   3. testing for RAMness in PAT

3) isn't yet relevant to Xen; ioremap can't map granted pages either, so 
2) isn't terribly relevent, so the main motivation for this patch is 
1).  This allows us to reject usermode attempts to map granted pages, 
rather than oopsing (as a failed set_pte will raise a page fault).

So, more cosmetic than essential, but I don't see a better way to 
implement this functionality if its to be there at all.

    J

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] paravirt/xen: add pvop for page_is_ram
@ 2009-03-10 21:19       ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-03-10 21:19 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Xen-devel, the arch/x86 maintainers, Linux Kernel Mailing List,
	Jeremy Fitzhardinge

H. Peter Anvin wrote:
> Jeremy Fitzhardinge wrote:
>   
>> From: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
>>
>> A guest domain may have external pages mapped into its address space,
>> in order to share memory with other domains.  These shared pages are
>> more akin to io mappings than real RAM, and should not pass the
>> page_is_ram test.  Add a paravirt op for this so that a hypervisor
>> backend can validate whether a page should be considered ram or not.
>>
>> Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
>>
>>     
>
> Why are these pages mapped as RAM in the memory map?  That is the right
> way to handle that, not by adding yet another bloody hook...
>   
Granted pages can turn up anywhere dynamically, since they're pages 
borrowed from other domains for the purposes of IO.  They're not static 
regions of non-RAM like the other cases page_is_ram() tests for,

They can't be mapped via normal pte operations (because they have 
additional state associated with them, like the grant handle), so 
/dev/mem can't just create an aliased mapping by copying the pte.

page_is_ram is used to:

   1. prevent /dev/mem from mapping non-RAM pages
   2. prevent ioremap from mapping any RAM pages
   3. testing for RAMness in PAT

3) isn't yet relevant to Xen; ioremap can't map granted pages either, so 
2) isn't terribly relevent, so the main motivation for this patch is 
1).  This allows us to reject usermode attempts to map granted pages, 
rather than oopsing (as a failed set_pte will raise a page fault).

So, more cosmetic than essential, but I don't see a better way to 
implement this functionality if its to be there at all.

    J

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] paravirt/xen: add pvop for page_is_ram
  2009-03-10 21:19       ` Jeremy Fitzhardinge
  (?)
@ 2009-03-10 22:21       ` H. Peter Anvin
  2009-03-10 22:44           ` Jeremy Fitzhardinge
  -1 siblings, 1 reply; 121+ messages in thread
From: H. Peter Anvin @ 2009-03-10 22:21 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List, Xen-devel,
	Jeremy Fitzhardinge

Jeremy Fitzhardinge wrote:
>>
>> Why are these pages mapped as RAM in the memory map?  That is the right
>> way to handle that, not by adding yet another bloody hook...
>>   
> Granted pages can turn up anywhere dynamically, since they're pages
> borrowed from other domains for the purposes of IO.  They're not static
> regions of non-RAM like the other cases page_is_ram() tests for,
> 
> They can't be mapped via normal pte operations (because they have
> additional state associated with them, like the grant handle), so
> /dev/mem can't just create an aliased mapping by copying the pte.
> 
> page_is_ram is used to:
> 
>   1. prevent /dev/mem from mapping non-RAM pages
>   2. prevent ioremap from mapping any RAM pages
>   3. testing for RAMness in PAT
> 
> 3) isn't yet relevant to Xen; ioremap can't map granted pages either, so
> 2) isn't terribly relevent, so the main motivation for this patch is
> 1).  This allows us to reject usermode attempts to map granted pages,
> rather than oopsing (as a failed set_pte will raise a page fault).
> 
> So, more cosmetic than essential, but I don't see a better way to
> implement this functionality if its to be there at all.
> 

OK, that is a valid usage case and I agree about repurposing the
existing interface.  However, it is also a definition change in the
interface, so it really should be renamed first.

Would you be willing to break this patch up into one which renames the
interface and then a second which adds the pv hook?

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] paravirt/xen: add pvop for page_is_ram
  2009-03-10 22:21       ` H. Peter Anvin
@ 2009-03-10 22:44           ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-03-10 22:44 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List, Xen-devel,
	Jeremy Fitzhardinge

H. Peter Anvin wrote:
>> 3) isn't yet relevant to Xen; ioremap can't map granted pages either, so
>> 2) isn't terribly relevent, so the main motivation for this patch is
>> 1).  This allows us to reject usermode attempts to map granted pages,
>> rather than oopsing (as a failed set_pte will raise a page fault).
>>
>> So, more cosmetic than essential, but I don't see a better way to
>> implement this functionality if its to be there at all.
>>
>>     
>
> OK, that is a valid usage case and I agree about repurposing the
> existing interface.  However, it is also a definition change in the
> interface, so it really should be renamed first.
>
> Would you be willing to break this patch up into one which renames the
> interface and then a second which adds the pv hook?
>   

Well, on reflection, given that the thing we're testing for is "is page 
is allowed to be mapped by /dev/mem?", and devmem_is_allowed() already 
exists for precisely that reason, the answer is to put the hook there...

But, it seems I got the logic wrong anyway.  /dev/mem doesn't allow RAM 
pages to be mapped anyway, so granted pages masquerading as RAM will not 
be mappable via /dev/mem.  So I think we can safely drop this patch with 
no futher ado.

    J

^ permalink raw reply	[flat|nested] 121+ messages in thread

* Re: [PATCH] paravirt/xen: add pvop for page_is_ram
@ 2009-03-10 22:44           ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 121+ messages in thread
From: Jeremy Fitzhardinge @ 2009-03-10 22:44 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Xen-devel, the arch/x86 maintainers, Linux Kernel Mailing List,
	Jeremy Fitzhardinge

H. Peter Anvin wrote:
>> 3) isn't yet relevant to Xen; ioremap can't map granted pages either, so
>> 2) isn't terribly relevent, so the main motivation for this patch is
>> 1).  This allows us to reject usermode attempts to map granted pages,
>> rather than oopsing (as a failed set_pte will raise a page fault).
>>
>> So, more cosmetic than essential, but I don't see a better way to
>> implement this functionality if its to be there at all.
>>
>>     
>
> OK, that is a valid usage case and I agree about repurposing the
> existing interface.  However, it is also a definition change in the
> interface, so it really should be renamed first.
>
> Would you be willing to break this patch up into one which renames the
> interface and then a second which adds the pv hook?
>   

Well, on reflection, given that the thing we're testing for is "is page 
is allowed to be mapped by /dev/mem?", and devmem_is_allowed() already 
exists for precisely that reason, the answer is to put the hook there...

But, it seems I got the logic wrong anyway.  /dev/mem doesn't allow RAM 
pages to be mapped anyway, so granted pages masquerading as RAM will not 
be mappable via /dev/mem.  So I think we can safely drop this patch with 
no futher ado.

    J

^ permalink raw reply	[flat|nested] 121+ messages in thread

end of thread, other threads:[~2009-03-10 22:45 UTC | newest]

Thread overview: 121+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-02-28  1:59 [PATCH] xen: core dom0 support Jeremy Fitzhardinge
2009-02-28  1:59 ` [PATCH] xen dom0: Make hvc_xen console work for dom0 Jeremy Fitzhardinge
2009-02-28  1:59 ` [PATCH] xen dom0: Initialize xenbus " Jeremy Fitzhardinge
2009-02-28  1:59   ` Jeremy Fitzhardinge
2009-02-28  1:59 ` [PATCH] xen dom0: Set up basic IO permissions " Jeremy Fitzhardinge
2009-02-28  1:59   ` Jeremy Fitzhardinge
2009-02-28  1:59 ` [PATCH] xen-dom0: only selectively disable cpu features Jeremy Fitzhardinge
2009-02-28  1:59 ` [PATCH] xen dom0: Add support for the platform_ops hypercall Jeremy Fitzhardinge
2009-02-28  1:59 ` [PATCH] xen mtrr: Add mtrr_ops support for Xen mtrr Jeremy Fitzhardinge
2009-02-28  1:59   ` Jeremy Fitzhardinge
2009-02-28  1:59 ` [PATCH] xen: disable PAT Jeremy Fitzhardinge
2009-02-28  1:59   ` Jeremy Fitzhardinge
2009-02-28  1:59 ` [PATCH] xen/dom0: use _PAGE_IOMAP in ioremap to do machine mappings Jeremy Fitzhardinge
2009-02-28  1:59 ` [PATCH] paravirt/xen: add pvop for page_is_ram Jeremy Fitzhardinge
2009-03-10  1:07   ` H. Peter Anvin
2009-03-10 21:19     ` Jeremy Fitzhardinge
2009-03-10 21:19       ` Jeremy Fitzhardinge
2009-03-10 22:21       ` H. Peter Anvin
2009-03-10 22:44         ` Jeremy Fitzhardinge
2009-03-10 22:44           ` Jeremy Fitzhardinge
2009-02-28  1:59 ` [PATCH] xen/dom0: Use host E820 map Jeremy Fitzhardinge
2009-02-28  1:59 ` [PATCH] xen: implement XENMEM_machphys_mapping Jeremy Fitzhardinge
2009-02-28  1:59 ` [PATCH] xen: clear reserved bits in l3 entries given in the initial pagetables Jeremy Fitzhardinge
2009-02-28  1:59 ` [PATCH] xen/dom0: add XEN_DOM0 config option Jeremy Fitzhardinge
2009-02-28  1:59   ` Jeremy Fitzhardinge
2009-02-28  1:59 ` [PATCH] xen: allow enable use of VGA console on dom0 Jeremy Fitzhardinge
2009-02-28  1:59 ` [PATCH] xen mtrr: Use specific cpu_has_foo macros instead of generic cpu_has() Jeremy Fitzhardinge
2009-02-28  1:59 ` [PATCH] xen mtrr: Kill some unneccessary includes Jeremy Fitzhardinge
2009-02-28  1:59   ` Jeremy Fitzhardinge
2009-02-28  1:59 ` [PATCH] xen mtrr: Use generic_validate_add_page() Jeremy Fitzhardinge
2009-02-28  1:59 ` [PATCH] xen mtrr: Implement xen_get_free_region() Jeremy Fitzhardinge
2009-02-28  1:59 ` [PATCH] xen mtrr: Add xen_{get,set}_mtrr() implementations Jeremy Fitzhardinge
2009-02-28  5:28 ` [PATCH] xen: core dom0 support Andrew Morton
2009-02-28  6:52   ` Jeremy Fitzhardinge
2009-02-28  6:52     ` Jeremy Fitzhardinge
2009-02-28  7:20     ` Ingo Molnar
2009-02-28  7:20       ` Ingo Molnar
2009-02-28  8:05       ` Jeremy Fitzhardinge
2009-02-28  8:05         ` Jeremy Fitzhardinge
2009-02-28  8:36         ` Ingo Molnar
2009-02-28  8:36           ` Ingo Molnar
2009-02-28  9:57           ` Jeremy Fitzhardinge
2009-02-28  9:57             ` Jeremy Fitzhardinge
2009-03-02  9:26       ` Gerd Hoffmann
2009-03-02  9:26         ` Gerd Hoffmann
2009-03-02 12:04         ` Ingo Molnar
2009-03-02 12:04           ` Ingo Molnar
2009-03-02 12:26           ` Gerd Hoffmann
2009-03-02 12:26             ` Gerd Hoffmann
2009-02-28 12:09     ` Nick Piggin
2009-02-28 12:09       ` Nick Piggin
2009-02-28 18:11       ` [Xen-devel] " Jody Belka
2009-02-28 18:11         ` Jody Belka
2009-02-28 18:15         ` Andi Kleen
2009-03-01 23:38           ` Jeremy Fitzhardinge
2009-03-01 23:38             ` Jeremy Fitzhardinge
2009-03-02  0:14             ` Andi Kleen
2009-03-01 23:27       ` Jeremy Fitzhardinge
2009-03-01 23:27         ` Jeremy Fitzhardinge
2009-03-02  6:37         ` Nick Piggin
2009-03-02  6:37           ` Nick Piggin
2009-03-02  8:05           ` Jeremy Fitzhardinge
2009-03-02  8:05             ` Jeremy Fitzhardinge
2009-03-02  8:19             ` Nick Piggin
2009-03-02  8:19               ` Nick Piggin
2009-03-02  9:05               ` Jeremy Fitzhardinge
2009-03-04 17:34             ` Anthony Liguori
2009-03-04 17:34               ` Anthony Liguori
2009-03-04 17:38               ` Jeremy Fitzhardinge
2009-03-04 17:38                 ` Jeremy Fitzhardinge
2009-03-05 10:59               ` [Xen-devel] " George Dunlap
2009-03-05 10:59                 ` George Dunlap
2009-03-05 14:37                 ` [Xen-devel] " Anthony Liguori
2009-03-05 14:37                   ` Anthony Liguori
2009-03-04 17:31           ` Anthony Liguori
2009-03-04 17:31             ` Anthony Liguori
2009-03-04 19:03         ` Anthony Liguori
2009-03-04 19:16           ` H. Peter Anvin
2009-03-04 19:33             ` Anthony Liguori
2009-03-04 19:33               ` Anthony Liguori
2009-02-28 16:14     ` Andi Kleen
2009-03-01 23:34       ` Jeremy Fitzhardinge
2009-03-01 23:34         ` Jeremy Fitzhardinge
2009-03-01 23:52         ` H. Peter Anvin
2009-03-02  0:08           ` Jeremy Fitzhardinge
2009-03-02  0:08             ` Jeremy Fitzhardinge
2009-03-02  0:14             ` H. Peter Anvin
2009-03-02  0:42               ` Jeremy Fitzhardinge
2009-03-02  0:42                 ` Jeremy Fitzhardinge
2009-03-02  0:46                 ` H. Peter Anvin
2009-03-02  0:10         ` Andi Kleen
2009-02-28  8:42   ` Ingo Molnar
2009-02-28  8:42     ` Ingo Molnar
2009-02-28  9:46     ` Jeremy Fitzhardinge
2009-02-28  9:46       ` Jeremy Fitzhardinge
2009-03-02 12:08       ` Ingo Molnar
2009-03-02 12:08         ` Ingo Molnar
2009-03-07  9:06         ` Jeremy Fitzhardinge
2009-03-07  9:06           ` Jeremy Fitzhardinge
2009-03-08 11:01           ` Ingo Molnar
2009-03-08 11:01             ` Ingo Molnar
2009-03-08 21:56             ` H. Peter Anvin
2009-03-08 22:06               ` Ingo Molnar
2009-03-08 22:06                 ` Ingo Molnar
2009-03-08 22:08                 ` H. Peter Anvin
2009-03-08 22:12                   ` Ingo Molnar
2009-03-08 22:12                     ` Ingo Molnar
2009-03-09 18:06                     ` Jeremy Fitzhardinge
2009-03-09 18:06                       ` Jeremy Fitzhardinge
2009-03-10 12:44                       ` Ingo Molnar
2009-03-10 12:44                         ` Ingo Molnar
2009-03-10 12:49                       ` Nick Piggin
2009-03-10 12:49                         ` Nick Piggin
2009-03-05 13:52   ` Morten P.D. Stevens
2009-03-08 14:25     ` Manfred Knick
2009-03-09 19:51       ` Morten P.D. Stevens
2009-03-09 20:00       ` Morten P.D. Stevens
2009-02-28  6:17 ` Boris Derzhavets
2009-02-28  6:23   ` [Xen-devel] " Jeremy Fitzhardinge
2009-02-28  6:23     ` Jeremy Fitzhardinge
2009-02-28  6:28     ` Boris Derzhavets

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.